AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v8 00/12] Fair DRM scheduler
@ 2025-09-03 10:18 Tvrtko Ursulin
  2025-09-03 10:18 ` [RFC v8 01/12] drm/sched: Add some scheduling quality unit tests Tvrtko Ursulin
                   ` (11 more replies)
  0 siblings, 12 replies; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-03 10:18 UTC (permalink / raw)
  To: dri-devel
  Cc: amd-gfx, kernel-dev, Tvrtko Ursulin, Christian König,
	Danilo Krummrich, Leo Liu, Matthew Brost, Philipp Stanner,
	Pierre-Eric Pelloux-Prayer, Michel Dänzer

As a summary, the new scheduling algorithm is insipired by the original Linux
CFS and so far no scheduling regressions have been found relative to FIFO.
There are improvements in fairness and scheduling of interactive clients when
running in parallel with a heavy GPU load (for example Pierre-Eric has one
viewperf medical test which shows a nice improvement with amdgpu).

On the high level main advantages of the series are:

 1. Scheduling quality - schedules better than FIFO, solves priority starvation.
 2. Code simplification - no more multiple run queues and multiple algorithms.
 3. Virtual GPU time based scheduling enables relatively simple addition
    of a scheduling cgroup controller in the future.

There is a little bit more detailed write up on the motivation and results in
the form of a blog post which may be easier to read:
https://blogs.igalia.com/tursulin/fair-er-drm-gpu-scheduler/

First patches add some unit tests which allow for easy evaluation of scheduling
behaviour against different client submission patterns. From there onwards it is
hopefully a natural progression of cleanups, enablers, adding the fair policy,
and finally removing FIFO and RR and simplifying the code base due no more need
for multiple run queues.

As a headline result I have tested three simultaneous clients on the Steam Deck:

One instance of a deferredmultisampling Vulkan demo running with low priority,
one normal priority instance of the same demo, and the Unigine Heaven benchmark.

With the FIFO scheduler we can see that the low priority client is completely
starved and the GPU time distribution between the other two clients is uneven:

https://people.igalia.com/tursulin/drm-sched-fair/fifo-starvation.png

Switching to the fair scheduler, GPU time distribution is almost equal and the
low priority client does get a small share of the GPU:

https://people.igalia.com/tursulin/drm-sched-fair/fair-no-starvation.png

Moving onto the synthetic submission patterns, they are about two simultaneous
clients which broadly cover the following categories:

 * Deep queue clients
 * Hogs versus interactive
 * Priority handling

Lets look at the results:

1. Two normal priority deep queue clients.

These ones submit one second worth of 8ms jobs. As fast as they can, no
dependencies etc. There is no difference in runtime between FIFO and fair but
the latter allows both clients to progress with work more evenly:

https://people.igalia.com/tursulin/drm-sched-fair/normal-normal.png

(X axis is time, Y is submitted queue-depth, hence lowering of qd corresponds
  with work progress for both clients, tested with both schedulers separately.)

Round-robin is the same as fair here.

2. Same two clients but one is now low priority.

https://people.igalia.com/tursulin/drm-sched-fair/normal-low.png

Normal priority client is a solid line, low priority dotted. We can see how FIFO
completely starves the low priority client until the normal priority is fully
done. Only then the low priority client gets any GPU time.

In constrast, fair scheduler allows some GPU time to the low priority client.

Here round-robin flavours are the same as FIFO (same starvation issue).

3. Same clients but now high versus normal priority.

Similar behaviour as in the previous one with normal a bit less de-prioritised
relative to high, than low was against normal.

https://people.igalia.com/tursulin/drm-sched-fair/high-normal.png

And again round-robin flavours are the same as FIFO.

4. Heavy load vs interactive client.

Heavy client emits a 75% GPU load in the format of 3x 2.5ms jobs followed by a
2.5ms wait. Interactive client emits a 10% GPU load in the format of 1x 1ms job
followed by a 9ms wait.

This simulates an interactive graphical client used on top of a relatively heavy
background load but no GPU oversubscription.

Graphs show the interactive client only and from now on, instead of looking at
the client's queue depth, we look at its "fps".

https://people.igalia.com/tursulin/drm-sched-fair/4-heavy-vs-interactive.png

Here round-robin and round-robin rewritten on top of FIFO are best, while FIFO
is clearly the worst. Fair scheduler is much better than FIFO but not as good as
RR.

5. An even heavier load vs interactive client.

This one is oversubscribing the GPU by submitting 4x 50ms jobs and waiting for
only one microsecond before repeating the cycle. Interactive client is the same
10% as above.

https://people.igalia.com/tursulin/drm-sched-fair/4-very-heavy-vs-interactive.png

Here FIFO is really bad with fair more than twice as good. Round-robin flavours
are better still.

6. Low priority GPU hog versus heavy-interactive.

Low priority client: 3x 2.5ms jobs client followed by a 0.5ms wait.
Interactive client: 1x 0.5ms job followed by a 10ms wait.

https://people.igalia.com/tursulin/drm-sched-fair/4-low-hog-vs-interactive.png

All schedulers appear to handle this equally well.

As before, I am looking for feedback, ideas for what other kinds of submission
scenarios to test, testing on different GPUs and of course reviews.

v2:
 * Fixed many rebase errors.
 * Added some new patches.
 * Dropped single shot dependecy handling.

v3:
 * Added scheduling quality unit tests.
 * Refined a tiny bit by adding some fairness.
 * Dropped a few patches for now.

v4:
 * Replaced deadline with fair!
 * Refined scheduling quality unit tests.
 * Pulled one cleanup patch earlier.
 * Fixed "drm/sched: Avoid double re-lock on the job free path".

v5:
 * Rebase on top of latest upstream DRM scheduler changes.
 * Kerneldoc fixup.
 * Improve commit message justification for one patch. (Philipp)
 * Add comment in drm_sched_alloc_wq. (Christian)

v6:
 * Rebase for "drm/sched: De-clutter drm_sched_init" getting merged.
 * Avoid NULL rq dereference from a bad rebase. (Maira)
 * Added some kerneldoc throughout. (Maira)
 * Removed some lockdep annotations not belonging to one patch. (Maira)
 * Use dma_fence_is_signaled in "drm/sched: Avoid double re-lock on the job free path". (Maira, Philipp)

v7:
 * Rebase for some prep patches getting merged.
 * Dropped submit all ready jobs patch.
 * Fixed 64-bit division in unit tests.
 * Fixed some more rebase and patch re-ordering mistakes.
 * Preserve entity RR order when re-entering the queue.
 * Fine tuned the queue re-enter logic for better behaviour with interactive
   clients.
 * Removed some static inlines.
 * Added more kerneldoc.
 * Done some benchmarks in the round-robin scheduling modes.

v8:
 * Rebased for upstream changes.
 * Added assert for reverse numerical order of DRM_SCHED_PRIORITY enums.
 * Fixed head of rq priority updates.

Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
CC: Leo Liu <Leo.Liu@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <phasta@kernel.org>
Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Cc: Michel Dänzer <michel.daenzer@mailbox.org>

Tvrtko Ursulin (12):
  drm/sched: Add some scheduling quality unit tests
  drm/sched: Add some more scheduling quality unit tests
  drm/sched: Implement RR via FIFO
  drm/sched: Consolidate entity run queue management
  drm/sched: Move run queue related code into a separate file
  drm/sched: Free all finished jobs at once
  drm/sched: Account entity GPU time
  drm/sched: Remove idle entity from tree
  drm/sched: Add fair scheduling policy
  drm/sched: Break submission patterns with some randomness
  drm/sched: Remove FIFO and RR and simplify to a single run queue
  drm/sched: Embed run queue singleton into the scheduler

 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |   6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |  27 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.h       |   5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h     |   8 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c   |   8 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c       |   8 +-
 drivers/gpu/drm/scheduler/Makefile            |   2 +-
 drivers/gpu/drm/scheduler/sched_entity.c      | 131 ++-
 drivers/gpu/drm/scheduler/sched_fence.c       |   2 +-
 drivers/gpu/drm/scheduler/sched_internal.h    |  85 +-
 drivers/gpu/drm/scheduler/sched_main.c        | 398 +--------
 drivers/gpu/drm/scheduler/sched_rq.c          | 351 ++++++++
 drivers/gpu/drm/scheduler/tests/Makefile      |   3 +-
 .../gpu/drm/scheduler/tests/tests_scheduler.c | 824 ++++++++++++++++++
 include/drm/gpu_scheduler.h                   |  30 +-
 15 files changed, 1410 insertions(+), 478 deletions(-)
 create mode 100644 drivers/gpu/drm/scheduler/sched_rq.c
 create mode 100644 drivers/gpu/drm/scheduler/tests/tests_scheduler.c

-- 
2.48.0


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC v8 01/12] drm/sched: Add some scheduling quality unit tests
  2025-09-03 10:18 [RFC v8 00/12] Fair DRM scheduler Tvrtko Ursulin
@ 2025-09-03 10:18 ` Tvrtko Ursulin
  2025-09-23 13:01   ` Philipp Stanner
  2025-09-03 10:18 ` [RFC v8 02/12] drm/sched: Add some more " Tvrtko Ursulin
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-03 10:18 UTC (permalink / raw)
  To: dri-devel
  Cc: amd-gfx, kernel-dev, Tvrtko Ursulin, Christian König,
	Danilo Krummrich, Matthew Brost, Philipp Stanner,
	Pierre-Eric Pelloux-Prayer

To make evaluating different scheduling policies easier (no need for
external benchmarks) and perfectly repeatable, lets add some synthetic
workloads built upon mock scheduler unit test infrastructure.

Focus is on two parallel clients (two threads) submitting different job
patterns and logging their progress and some overall metrics. This is
repeated for both scheduler credit limit 1 and 2.

Example test output:

  Normal and low:
                    pct1 cps1 qd1;  pct2 cps2 qd2
        +     0ms:   0     0    0;   0     0    0
        +   104ms: 100  1240  112; 100  1240  125
        +   209ms: 100     0   99; 100     0  125
        +   313ms: 100     0   86; 100     0  125
        +   419ms: 100     0   73; 100     0  125
        +   524ms: 100     0   60; 100     0  125
        +   628ms: 100     0   47; 100     0  125
        +   731ms: 100     0   34; 100     0  125
        +   836ms: 100     0   21; 100     0  125
        +   939ms: 100     0    8; 100     0  125
        +  1043ms:               ; 100     0  120
        +  1147ms:               ; 100     0  107
        +  1252ms:               ; 100     0   94
        +  1355ms:               ; 100     0   81
        +  1459ms:               ; 100     0   68
        +  1563ms:               ; 100     0   55
        +  1667ms:               ; 100     0   42
        +  1771ms:               ; 100     0   29
        +  1875ms:               ; 100     0   16
        +  1979ms:               ; 100     0    3
    0: prio=normal sync=0 elapsed_ms=1015ms (ideal_ms=1000ms) cycle_time(min,avg,max)=134,222,978 us latency_time(min,avg,max)=134,222,978
us
    1: prio=low sync=0 elapsed_ms=2009ms (ideal_ms=1000ms) cycle_time(min,avg,max)=134,215,806 us latency_time(min,avg,max)=134,215,806 us

There we have two clients represented in the two respective columns, with
their progress logged roughly every 100 milliseconds. The metrics are:

 - pct - Percentage progress of the job submit part
 - cps - Cycles per second
 - qd  - Queue depth - number of submitted unfinished jobs

The cycles per second metric is inherent to the fact that workload
patterns are a data driven cycling sequence of:

 - Submit 1..N jobs
 - Wait for Nth job to finish (optional)
 - Sleep (optional)
 - Repeat from start

In this particular example we have a normal priority and a low priority
clients both spamming the scheduler with 8ms jobs with no sync and no
sleeping. Hence they build a very deep queues and we can see how the low
priority client is completely starved until the normal finishes.

Note that the PCT and CPS metrics are irrelevant for "unsync" clients
since they manage to complete all of their cycles instantaneously.

A different example would be:

  Heavy and interactive:
                    pct1 cps1 qd1;  pct2 cps2 qd2
        +     0ms:   0     0    0;   0     0    0
        +   106ms:   5    40    3;   5    40    0
        +   209ms:   9    40    0;   9    40    0
        +   314ms:  14    50    3;  14    50    0
        +   417ms:  18    40    0;  18    40    0
        +   522ms:  23    50    3;  23    50    0
        +   625ms:  27    40    0;  27    40    1
        +   729ms:  32    50    0;  32    50    0
        +   833ms:  36    40    1;  36    40    0
        +   937ms:  40    40    0;  40    40    0
        +  1041ms:  45    50    0;  45    50    0
        +  1146ms:  49    40    1;  49    40    1
        +  1249ms:  54    50    0;  54    50    0
        +  1353ms:  58    40    1;  58    40    0
        +  1457ms:  62    40    0;  62    40    1
        +  1561ms:  67    50    0;  67    50    0
        +  1665ms:  71    40    1;  71    40    0
        +  1772ms:  76    50    0;  76    50    0
        +  1877ms:  80    40    1;  80    40    0
        +  1981ms:  84    40    0;  84    40    0
        +  2085ms:  89    50    0;  89    50    0
        +  2189ms:  93    40    1;  93    40    0
        +  2293ms:  97    40    0;  97    40    1

In this case client one is submitting 3x 2.5ms jobs, waiting for the 3rd
and then sleeping for 2.5ms (in effect causing 75% GPU load, minus the
overheads). Second client is submitting 1ms jobs, waiting for each to
finish and sleeping for 9ms (effective 10% GPU load). Here we can see
the PCT and CPS reflecting real progress.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <phasta@kernel.org>
Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/scheduler/tests/Makefile      |   3 +-
 .../gpu/drm/scheduler/tests/tests_scheduler.c | 640 ++++++++++++++++++
 2 files changed, 642 insertions(+), 1 deletion(-)
 create mode 100644 drivers/gpu/drm/scheduler/tests/tests_scheduler.c

diff --git a/drivers/gpu/drm/scheduler/tests/Makefile b/drivers/gpu/drm/scheduler/tests/Makefile
index 5bf707bad373..9ec185fbbc15 100644
--- a/drivers/gpu/drm/scheduler/tests/Makefile
+++ b/drivers/gpu/drm/scheduler/tests/Makefile
@@ -2,6 +2,7 @@
 
 drm-sched-tests-y := \
         mock_scheduler.o \
-        tests_basic.o
+        tests_basic.o \
+        tests_scheduler.o
 
 obj-$(CONFIG_DRM_SCHED_KUNIT_TEST) += drm-sched-tests.o
diff --git a/drivers/gpu/drm/scheduler/tests/tests_scheduler.c b/drivers/gpu/drm/scheduler/tests/tests_scheduler.c
new file mode 100644
index 000000000000..15c898966ef0
--- /dev/null
+++ b/drivers/gpu/drm/scheduler/tests/tests_scheduler.c
@@ -0,0 +1,640 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Valve Corporation */
+
+#include <linux/delay.h>
+#include <linux/kthread.h>
+#include <linux/ktime.h>
+#include <linux/math64.h>
+
+#include "sched_tests.h"
+
+/*
+ * DRM scheduler scheduler tests exercise load balancing decisions ie. entity
+ * selection logic.
+ */
+
+static int drm_sched_scheduler_init(struct kunit *test)
+{
+	struct drm_mock_scheduler *sched;
+
+	sched = drm_mock_sched_new(test, MAX_SCHEDULE_TIMEOUT);
+	sched->base.credit_limit = 1;
+
+	test->priv = sched;
+
+	return 0;
+}
+
+static int drm_sched_scheduler_init2(struct kunit *test)
+{
+	struct drm_mock_scheduler *sched;
+
+	sched = drm_mock_sched_new(test, MAX_SCHEDULE_TIMEOUT);
+	sched->base.credit_limit = 2;
+
+	test->priv = sched;
+
+	return 0;
+}
+
+static void drm_sched_scheduler_exit(struct kunit *test)
+{
+	struct drm_mock_scheduler *sched = test->priv;
+
+	drm_mock_sched_fini(sched);
+}
+
+static void drm_sched_scheduler_queue_overhead(struct kunit *test)
+{
+	struct drm_mock_scheduler *sched = test->priv;
+	struct drm_mock_sched_entity *entity;
+	const unsigned int job_us = 1000;
+	const unsigned int jobs = 1000;
+	const unsigned int total_us = jobs * job_us;
+	struct drm_mock_sched_job *job, *first;
+	ktime_t start, end;
+	bool done;
+	int i;
+
+	/*
+	 * Deep queue job at a time processing (single credit).
+	 *
+	 * This measures the overhead of picking and processing a job at a time
+	 * by comparing the ideal total "GPU" time of all submitted jobs versus
+	 * the time actually taken.
+	 */
+
+	KUNIT_ASSERT_EQ(test, sched->base.credit_limit, 1);
+
+	entity = drm_mock_sched_entity_new(test,
+					   DRM_SCHED_PRIORITY_NORMAL,
+					   sched);
+
+	for (i = 0; i <= jobs; i++) {
+		job = drm_mock_sched_job_new(test, entity);
+		if (i == 0)
+			first = job; /* Extra first job blocks the queue */
+		else
+			drm_mock_sched_job_set_duration_us(job, job_us);
+		drm_mock_sched_job_submit(job);
+	}
+
+	done = drm_mock_sched_job_wait_scheduled(first, HZ);
+	KUNIT_ASSERT_TRUE(test, done);
+
+	start = ktime_get();
+	i = drm_mock_sched_advance(sched, 1); /* Release the queue */
+	KUNIT_ASSERT_EQ(test, i, 1);
+
+	done = drm_mock_sched_job_wait_finished(job,
+						usecs_to_jiffies(total_us) * 5);
+	end = ktime_get();
+	KUNIT_ASSERT_TRUE(test, done);
+
+	pr_info("Expected %uus, actual %lldus\n",
+		total_us,
+		ktime_to_us(ktime_sub(end, start)));
+
+	drm_mock_sched_entity_free(entity);
+}
+
+static void drm_sched_scheduler_ping_pong(struct kunit *test)
+{
+	struct drm_mock_sched_job *job, *first, *prev = NULL;
+	struct drm_mock_scheduler *sched = test->priv;
+	struct drm_mock_sched_entity *entity[2];
+	const unsigned int job_us = 1000;
+	const unsigned int jobs = 1000;
+	const unsigned int total_us = jobs * job_us;
+	ktime_t start, end;
+	bool done;
+	int i;
+
+	/*
+	 * Two entitites in inter-dependency chain.
+	 *
+	 * This measures the overhead of picking and processing a job at a time,
+	 * where each job depends on the previous one from the diffferent
+	 * entity, by comparing the ideal total "GPU" time of all submitted jobs
+	 * versus the time actually taken.
+	 */
+
+	KUNIT_ASSERT_EQ(test, sched->base.credit_limit, 1);
+
+	for (i = 0; i < ARRAY_SIZE(entity); i++)
+		entity[i] = drm_mock_sched_entity_new(test,
+						      DRM_SCHED_PRIORITY_NORMAL,
+						      sched);
+
+	for (i = 0; i <= jobs; i++) {
+		job = drm_mock_sched_job_new(test, entity[i & 1]);
+		if (i == 0)
+			first = job; /* Extra first job blocks the queue */
+		else
+			drm_mock_sched_job_set_duration_us(job, job_us);
+		if (prev)
+			drm_sched_job_add_dependency(&job->base,
+						     dma_fence_get(&prev->base.s_fence->finished));
+		drm_mock_sched_job_submit(job);
+		prev = job;
+	}
+
+	done = drm_mock_sched_job_wait_scheduled(first, HZ);
+	KUNIT_ASSERT_TRUE(test, done);
+
+	start = ktime_get();
+	i = drm_mock_sched_advance(sched, 1); /* Release the queue */
+	KUNIT_ASSERT_EQ(test, i, 1);
+
+	done = drm_mock_sched_job_wait_finished(job,
+						usecs_to_jiffies(total_us) * 5);
+	end = ktime_get();
+	KUNIT_ASSERT_TRUE(test, done);
+
+	pr_info("Expected %uus, actual %lldus\n",
+		total_us,
+		ktime_to_us(ktime_sub(end, start)));
+
+	for (i = 0; i < ARRAY_SIZE(entity); i++)
+		drm_mock_sched_entity_free(entity[i]);
+}
+
+static struct kunit_case drm_sched_scheduler_overhead_tests[] = {
+	KUNIT_CASE_SLOW(drm_sched_scheduler_queue_overhead),
+	KUNIT_CASE_SLOW(drm_sched_scheduler_ping_pong),
+	{}
+};
+
+static struct kunit_suite drm_sched_scheduler_overhead = {
+	.name = "drm_sched_scheduler_overhead_tests",
+	.init = drm_sched_scheduler_init,
+	.exit = drm_sched_scheduler_exit,
+	.test_cases = drm_sched_scheduler_overhead_tests,
+};
+
+struct drm_sched_client_params {
+	enum drm_sched_priority priority;
+	unsigned int job_cnt;
+	unsigned int job_us;
+	unsigned int wait_us;
+	bool sync;
+};
+
+struct drm_sched_test_params {
+	const char *description;
+	struct drm_sched_client_params client[2];
+};
+
+static const struct drm_sched_test_params drm_sched_cases[] = {
+	{
+		.description = "Normal and normal",
+		.client[0] = {
+			.priority = DRM_SCHED_PRIORITY_NORMAL,
+			.job_cnt = 1,
+			.job_us = 8000,
+			.wait_us = 0,
+			.sync = false,
+		},
+		.client[1] = {
+			.priority = DRM_SCHED_PRIORITY_NORMAL,
+			.job_cnt = 1,
+			.job_us = 8000,
+			.wait_us = 0,
+			.sync = false,
+		},
+	},
+	{
+		.description = "Normal and low",
+		.client[0] = {
+			.priority = DRM_SCHED_PRIORITY_NORMAL,
+			.job_cnt = 1,
+			.job_us = 8000,
+			.wait_us = 0,
+			.sync = false,
+		},
+		.client[1] = {
+			.priority = DRM_SCHED_PRIORITY_LOW,
+			.job_cnt = 1,
+			.job_us = 8000,
+			.wait_us = 0,
+			.sync = false,
+		},
+	},
+	{
+		.description = "High and normal",
+		.client[0] = {
+			.priority = DRM_SCHED_PRIORITY_HIGH,
+			.job_cnt = 1,
+			.job_us = 8000,
+			.wait_us = 0,
+			.sync = false,
+		},
+		.client[1] = {
+			.priority = DRM_SCHED_PRIORITY_NORMAL,
+			.job_cnt = 1,
+			.job_us = 8000,
+			.wait_us = 0,
+			.sync = false,
+		},
+	},
+	{
+		.description = "High and low",
+		.client[0] = {
+			.priority = DRM_SCHED_PRIORITY_HIGH,
+			.job_cnt = 1,
+			.job_us = 8000,
+			.wait_us = 0,
+			.sync = false,
+		},
+		.client[1] = {
+			.priority = DRM_SCHED_PRIORITY_LOW,
+			.job_cnt = 1,
+			.job_us = 8000,
+			.wait_us = 0,
+			.sync = false,
+		},
+	},
+	{
+		.description = "50 and 50",
+		.client[0] = {
+			.priority = DRM_SCHED_PRIORITY_NORMAL,
+			.job_cnt = 1,
+			.job_us = 1500,
+			.wait_us = 1500,
+			.sync = true,
+		},
+		.client[1] = {
+			.priority = DRM_SCHED_PRIORITY_NORMAL,
+			.job_cnt = 1,
+			.job_us = 2500,
+			.wait_us = 2500,
+			.sync = true,
+		},
+	},
+	{
+		.description = "50 and 50 low",
+		.client[0] = {
+			.priority = DRM_SCHED_PRIORITY_NORMAL,
+			.job_cnt = 1,
+			.job_us = 1500,
+			.wait_us = 1500,
+			.sync = true,
+		},
+		.client[1] = {
+			.priority = DRM_SCHED_PRIORITY_LOW,
+			.job_cnt = 1,
+			.job_us = 2500,
+			.wait_us = 2500,
+			.sync = true,
+		},
+	},
+	{
+		.description = "50 high and 50",
+		.client[0] = {
+			.priority = DRM_SCHED_PRIORITY_HIGH,
+			.job_cnt = 1,
+			.job_us = 1500,
+			.wait_us = 1500,
+			.sync = true,
+		},
+		.client[1] = {
+			.priority = DRM_SCHED_PRIORITY_NORMAL,
+			.job_cnt = 1,
+			.job_us = 2500,
+			.wait_us = 2500,
+			.sync = true,
+		},
+	},
+	{
+		.description = "Low hog and interactive",
+		.client[0] = {
+			.priority = DRM_SCHED_PRIORITY_LOW,
+			.job_cnt = 3,
+			.job_us = 2500,
+			.wait_us = 500,
+			.sync = false,
+		},
+		.client[1] = {
+			.priority = DRM_SCHED_PRIORITY_NORMAL,
+			.job_cnt = 1,
+			.job_us = 500,
+			.wait_us = 10000,
+			.sync = true,
+		},
+	},
+	{
+		.description = "Heavy and interactive",
+		.client[0] = {
+			.priority = DRM_SCHED_PRIORITY_NORMAL,
+			.job_cnt = 3,
+			.job_us = 2500,
+			.wait_us = 2500,
+			.sync = true,
+		},
+		.client[1] = {
+			.priority = DRM_SCHED_PRIORITY_NORMAL,
+			.job_cnt = 1,
+			.job_us = 1000,
+			.wait_us = 9000,
+			.sync = true,
+		},
+	},
+	{
+		.description = "Very heavy and interactive",
+		.client[0] = {
+			.priority = DRM_SCHED_PRIORITY_NORMAL,
+			.job_cnt = 4,
+			.job_us = 50000,
+			.wait_us = 1,
+			.sync = true,
+		},
+		.client[1] = {
+			.priority = DRM_SCHED_PRIORITY_NORMAL,
+			.job_cnt = 1,
+			.job_us = 1000,
+			.wait_us = 9000,
+			.sync = true,
+		},
+	},
+};
+
+static void
+drm_sched_desc(const struct drm_sched_test_params *params, char *desc)
+{
+	strscpy(desc, params->description, KUNIT_PARAM_DESC_SIZE);
+}
+
+KUNIT_ARRAY_PARAM(drm_sched_scheduler_two_clients,
+		  drm_sched_cases,
+		  drm_sched_desc);
+
+struct test_client_stats {
+	unsigned int min_us;
+	unsigned int max_us;
+	unsigned long tot_us;
+};
+
+struct test_client {
+	struct kunit *test;
+
+	struct drm_mock_sched_entity	*entity;
+
+	struct kthread_worker	*worker;
+	struct kthread_work	work;
+
+	unsigned int id;
+	ktime_t duration;
+
+	struct drm_sched_client_params params;
+
+	ktime_t ideal_duration;
+	unsigned int cycles;
+	unsigned int cycle;
+	ktime_t	start;
+	ktime_t	end;
+	bool done;
+
+	struct test_client_stats cycle_time;
+	struct test_client_stats latency_time;
+};
+
+static void
+update_stats(struct test_client_stats *stats, unsigned int us)
+{
+	if (us > stats->max_us)
+		stats->max_us = us;
+	if (us < stats->min_us)
+		stats->min_us = us;
+	stats->tot_us += us;
+}
+
+static unsigned int
+get_stats_avg(struct test_client_stats *stats, unsigned int cycles)
+{
+	return div_u64(stats->tot_us, cycles);
+}
+
+static void drm_sched_client_work(struct kthread_work *work)
+{
+	struct test_client *client = container_of(work, typeof(*client), work);
+	const long sync_wait = MAX_SCHEDULE_TIMEOUT;
+	unsigned int cycle, work_us, period_us;
+	struct drm_mock_sched_job *job = NULL;
+
+	work_us = client->params.job_cnt * client->params.job_us;
+	period_us = work_us + client->params.wait_us;
+	client->cycles =
+		DIV_ROUND_UP((unsigned int)ktime_to_us(client->duration),
+			     period_us);
+	client->ideal_duration = us_to_ktime(client->cycles * period_us);
+
+	client->start = ktime_get();
+
+	for (cycle = 0; cycle < client->cycles; cycle++) {
+		unsigned int batch;
+		unsigned long us;
+		ktime_t t;
+
+		if (READ_ONCE(client->done))
+			break;
+
+		t = ktime_get();
+		for (batch = 0; batch < client->params.job_cnt; batch++) {
+			job = drm_mock_sched_job_new(client->test,
+						     client->entity);
+			drm_mock_sched_job_set_duration_us(job,
+							   client->params.job_us);
+			drm_mock_sched_job_submit(job);
+		}
+
+		if (client->params.sync)
+			drm_mock_sched_job_wait_finished(job, sync_wait);
+
+		t = ktime_sub(ktime_get(), t);
+		us = ktime_to_us(t);
+		update_stats(&client->cycle_time, us);
+		if (ktime_to_us(t) >= (long)work_us)
+			us = ktime_to_us(t) - work_us;
+		else if (WARN_ON_ONCE(client->params.sync))
+			us = 0;
+		update_stats(&client->latency_time, us);
+		WRITE_ONCE(client->cycle, cycle);
+
+		if (READ_ONCE(client->done))
+			break;
+
+		if (client->params.wait_us)
+			fsleep(client->params.wait_us);
+		else
+			cond_resched();
+	}
+
+	client->done = drm_mock_sched_job_wait_finished(job, sync_wait);
+	client->end = ktime_get();
+}
+
+static const char *prio_str(enum drm_sched_priority prio)
+{
+	switch (prio) {
+	case DRM_SCHED_PRIORITY_KERNEL:
+		return "kernel";
+	case DRM_SCHED_PRIORITY_LOW:
+		return "low";
+	case DRM_SCHED_PRIORITY_NORMAL:
+		return "normal";
+	case DRM_SCHED_PRIORITY_HIGH:
+		return "high";
+	default:
+		return "???";
+	}
+}
+
+static void drm_sched_scheduler_two_clients_test(struct kunit *test)
+{
+	const struct drm_sched_test_params *params = test->param_value;
+	struct drm_mock_scheduler *sched = test->priv;
+	struct test_client client[2] = { };
+	unsigned int prev_cycle[2] = { };
+	unsigned int i, j;
+	ktime_t start;
+
+	/*
+	 * Same job stream from from two clients.
+	 */
+
+	for (i = 0; i < ARRAY_SIZE(client); i++)
+		client[i].entity =
+			drm_mock_sched_entity_new(test,
+						  params->client[i].priority,
+						  sched);
+
+	for (i = 0; i < ARRAY_SIZE(client); i++) {
+		client[i].test = test;
+		client[i].id = i;
+		client[i].duration = ms_to_ktime(1000);
+		client[i].params = params->client[i];
+		client[i].cycle_time.min_us = ~0U;
+		client[i].latency_time.min_us = ~0U;
+		client[i].worker =
+			kthread_create_worker(0, "%s-%u", __func__, i);
+		if (IS_ERR(client[i].worker)) {
+			for (j = 0; j < i; j++)
+				kthread_destroy_worker(client[j].worker);
+			KUNIT_FAIL(test, "Failed to create worker!\n");
+		}
+
+		kthread_init_work(&client[i].work, drm_sched_client_work);
+	}
+
+	for (i = 0; i < ARRAY_SIZE(client); i++)
+		kthread_queue_work(client[i].worker, &client[i].work);
+
+	/*
+	 * The clients (workers) can be a mix of async (deep submission queue),
+	 * sync (one job at a time), or something in between. Therefore it is
+	 * difficult to display a single metric representing their progress.
+	 *
+	 * Each struct drm_sched_client_params describes the actual submission
+	 * pattern which happens in the following steps:
+	 *  1. Submit N jobs
+	 *  2. Wait for last submitted job to finish
+	 *  3. Sleep for U micro-seconds
+	 *  4. Goto 1. for C cycles
+	 *
+	 * Where number of cycles is calculated to match the target client
+	 * duration from the respective struct drm_sched_test_params.
+	 *
+	 * To asses scheduling behaviour what we output for both clients is:
+	 *  - pct: Percentage progress of the jobs submitted
+	 *  - cps: "Cycles" per second (where one cycle is one 1.-4. above)
+	 *  -  qd: Number of outstanding jobs in the client/entity
+	 */
+
+	start = ktime_get();
+	pr_info("%s:\n\t            pct1 cps1 qd1;  pct2 cps2 qd2\n",
+		params->description);
+	while (!READ_ONCE(client[0].done) || !READ_ONCE(client[1].done)) {
+		unsigned int pct[2], qd[2], cycle[2], cps[2];
+
+		for (i = 0; i < ARRAY_SIZE(client); i++) {
+			qd[i] = spsc_queue_count(&client[i].entity->base.job_queue);
+			cycle[i] = READ_ONCE(client[i].cycle);
+			cps[i] = DIV_ROUND_UP(1000 * (cycle[i] - prev_cycle[i]),
+					      100);
+			if (client[i].cycles)
+				pct[i] = DIV_ROUND_UP(100 * (1 + cycle[i]),
+						      client[i].cycles);
+			else
+				pct[i] = 0;
+			prev_cycle[i] = cycle[i];
+		}
+
+		if (READ_ONCE(client[0].done))
+			pr_info("\t+%6lldms:               ; %3u %5u %4u\n",
+				ktime_to_ms(ktime_sub(ktime_get(), start)),
+				pct[1], cps[1], qd[1]);
+		else if (READ_ONCE(client[1].done))
+			pr_info("\t+%6lldms: %3u %5u %4u;\n",
+				ktime_to_ms(ktime_sub(ktime_get(), start)),
+				pct[0], cps[0], qd[0]);
+		else
+			pr_info("\t+%6lldms: %3u %5u %4u; %3u %5u %4u\n",
+				ktime_to_ms(ktime_sub(ktime_get(), start)),
+				pct[0], cps[0], qd[0],
+				pct[1], cps[1], qd[1]);
+		msleep(100);
+	}
+
+	for (i = 0; i < ARRAY_SIZE(client); i++) {
+		kthread_flush_work(&client[i].work);
+		kthread_destroy_worker(client[i].worker);
+	}
+
+	for (i = 0; i < ARRAY_SIZE(client); i++)
+		KUNIT_ASSERT_TRUE(test, client[i].done);
+
+	for (i = 0; i < ARRAY_SIZE(client); i++) {
+		pr_info("    %u: prio=%s sync=%u elapsed_ms=%lldms (ideal_ms=%lldms) cycle_time(min,avg,max)=%u,%u,%u us latency_time(min,avg,max)=%u,%u,%u us",
+			i,
+			prio_str(params->client[i].priority),
+			params->client[i].sync,
+			ktime_to_ms(ktime_sub(client[i].end, client[i].start)),
+			ktime_to_ms(client[i].ideal_duration),
+			client[i].cycle_time.min_us,
+			get_stats_avg(&client[i].cycle_time, client[i].cycles),
+			client[i].cycle_time.max_us,
+			client[i].latency_time.min_us,
+			get_stats_avg(&client[i].latency_time, client[i].cycles),
+			client[i].latency_time.max_us);
+		drm_mock_sched_entity_free(client[i].entity);
+	}
+}
+
+static const struct kunit_attributes drm_sched_scheduler_two_clients_attr = {
+	.speed = KUNIT_SPEED_SLOW,
+};
+
+static struct kunit_case drm_sched_scheduler_two_clients_tests[] = {
+	KUNIT_CASE_PARAM_ATTR(drm_sched_scheduler_two_clients_test,
+			      drm_sched_scheduler_two_clients_gen_params,
+			      drm_sched_scheduler_two_clients_attr),
+	{}
+};
+
+static struct kunit_suite drm_sched_scheduler_two_clients1 = {
+	.name = "drm_sched_scheduler_two_clients_one_credit_tests",
+	.init = drm_sched_scheduler_init,
+	.exit = drm_sched_scheduler_exit,
+	.test_cases = drm_sched_scheduler_two_clients_tests,
+};
+
+static struct kunit_suite drm_sched_scheduler_two_clients2 = {
+	.name = "drm_sched_scheduler_two_clients_two_credits_tests",
+	.init = drm_sched_scheduler_init2,
+	.exit = drm_sched_scheduler_exit,
+	.test_cases = drm_sched_scheduler_two_clients_tests,
+};
+
+kunit_test_suites(&drm_sched_scheduler_overhead,
+		  &drm_sched_scheduler_two_clients1,
+		  &drm_sched_scheduler_two_clients2);
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v8 02/12] drm/sched: Add some more scheduling quality unit tests
  2025-09-03 10:18 [RFC v8 00/12] Fair DRM scheduler Tvrtko Ursulin
  2025-09-03 10:18 ` [RFC v8 01/12] drm/sched: Add some scheduling quality unit tests Tvrtko Ursulin
@ 2025-09-03 10:18 ` Tvrtko Ursulin
  2025-09-03 10:18 ` [RFC v8 03/12] drm/sched: Implement RR via FIFO Tvrtko Ursulin
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-03 10:18 UTC (permalink / raw)
  To: dri-devel
  Cc: amd-gfx, kernel-dev, Tvrtko Ursulin, Christian König,
	Danilo Krummrich, Matthew Brost, Philipp Stanner,
	Pierre-Eric Pelloux-Prayer

This time round we explore the rate of submitted job queue processing
with multiple identical parallel clients.

Example test output:

3 clients:
        t               cycle:     min  avg max : ...
        +     0ms                   0    0    0 :   0   0   0
        +   102ms                   2    2    2 :   2   2   2
        +   208ms                   5    6    6 :   6   5   5
        +   310ms                   8    9    9 :   9   9   8
...
        +  2616ms                  82   83   83 :  83  83  82
        +  2717ms                  83   83   83 :  83  83  83
    avg_max_min_delta(x100)=60

Every 100ms for the duration of the test test logs how many jobs each
client had completed, prefixed by minimum, average and maximum numbers.
When finished overall average delta between max and min is output as a
rough indicator to scheduling fairness.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <phasta@kernel.org>
Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
---
 .../gpu/drm/scheduler/tests/tests_scheduler.c | 186 +++++++++++++++++-
 1 file changed, 185 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/tests/tests_scheduler.c b/drivers/gpu/drm/scheduler/tests/tests_scheduler.c
index 15c898966ef0..b90fdf6e605c 100644
--- a/drivers/gpu/drm/scheduler/tests/tests_scheduler.c
+++ b/drivers/gpu/drm/scheduler/tests/tests_scheduler.c
@@ -182,6 +182,7 @@ struct drm_sched_client_params {
 
 struct drm_sched_test_params {
 	const char *description;
+	unsigned int num_clients;
 	struct drm_sched_client_params client[2];
 };
 
@@ -635,6 +636,189 @@ static struct kunit_suite drm_sched_scheduler_two_clients2 = {
 	.test_cases = drm_sched_scheduler_two_clients_tests,
 };
 
+
+static const struct drm_sched_test_params drm_sched_many_cases[] = {
+	{
+		.description = "2 clients",
+		.num_clients = 2,
+		.client[0] = {
+			.priority = DRM_SCHED_PRIORITY_NORMAL,
+			.job_cnt = 4,
+			.job_us = 1000,
+			.wait_us = 0,
+			.sync = true,
+		},
+	},
+	{
+		.description = "3 clients",
+		.num_clients = 3,
+		.client[0] = {
+			.priority = DRM_SCHED_PRIORITY_NORMAL,
+			.job_cnt = 4,
+			.job_us = 1000,
+			.wait_us = 0,
+			.sync = true,
+		},
+	},
+	{
+		.description = "7 clients",
+		.num_clients = 7,
+		.client[0] = {
+			.priority = DRM_SCHED_PRIORITY_NORMAL,
+			.job_cnt = 4,
+			.job_us = 1000,
+			.wait_us = 0,
+			.sync = true,
+		},
+	},
+	{
+		.description = "13 clients",
+		.num_clients = 13,
+		.client[0] = {
+			.priority = DRM_SCHED_PRIORITY_NORMAL,
+			.job_cnt = 4,
+			.job_us = 1000,
+			.wait_us = 0,
+			.sync = true,
+		},
+	},
+	{
+		.description = "31 clients",
+		.num_clients = 31,
+		.client[0] = {
+			.priority = DRM_SCHED_PRIORITY_NORMAL,
+			.job_cnt = 2,
+			.job_us = 1000,
+			.wait_us = 0,
+			.sync = true,
+		},
+	},
+};
+
+KUNIT_ARRAY_PARAM(drm_sched_scheduler_many_clients,
+		  drm_sched_many_cases,
+		  drm_sched_desc);
+
+static void drm_sched_scheduler_many_clients_test(struct kunit *test)
+{
+	const struct drm_sched_test_params *params = test->param_value;
+	struct drm_mock_scheduler *sched = test->priv;
+	const unsigned int clients = params->num_clients;
+	unsigned int i, j, delta_total = 0, loops = 0;
+	struct test_client *client;
+	unsigned int *prev_cycle;
+	ktime_t start;
+	char *buf;
+
+	/*
+	 * Many clients with deep-ish async queues.
+	 */
+
+	buf = kunit_kmalloc(test, PAGE_SIZE, GFP_KERNEL);
+	client = kunit_kcalloc(test, clients, sizeof(*client), GFP_KERNEL);
+	prev_cycle = kunit_kcalloc(test, clients, sizeof(*prev_cycle),
+				   GFP_KERNEL);
+
+	for (i = 0; i < clients; i++)
+		client[i].entity =
+			drm_mock_sched_entity_new(test,
+						  DRM_SCHED_PRIORITY_NORMAL,
+						  sched);
+
+	for (i = 0; i < clients; i++) {
+		client[i].test = test;
+		client[i].id = i;
+		client[i].params = params->client[0];
+		client[i].duration = ms_to_ktime(1000 / clients);
+		client[i].cycle_time.min_us = ~0U;
+		client[i].latency_time.min_us = ~0U;
+		client[i].worker =
+			kthread_create_worker(0, "%s-%u", __func__, i);
+		if (IS_ERR(client[i].worker)) {
+			for (j = 0; j < i; j++)
+				kthread_destroy_worker(client[j].worker);
+			KUNIT_FAIL(test, "Failed to create worker!\n");
+		}
+
+		kthread_init_work(&client[i].work, drm_sched_client_work);
+	}
+
+	for (i = 0; i < clients; i++)
+		kthread_queue_work(client[i].worker, &client[i].work);
+
+	start = ktime_get();
+	pr_info("%u clients:\n\tt\t\tcycle:\t  min    avg    max : ...\n", clients);
+	for (;;) {
+		unsigned int min = ~0;
+		unsigned int max = 0;
+		unsigned int total = 0;
+		bool done = true;
+		char pbuf[16];
+
+		memset(buf, 0, PAGE_SIZE);
+		for (i = 0; i < clients; i++) {
+			unsigned int cycle, cycles;
+
+			cycle = READ_ONCE(client[i].cycle);
+			cycles = READ_ONCE(client[i].cycles);
+
+			snprintf(pbuf, sizeof(pbuf), " %3d", cycle);
+			strncat(buf, pbuf, PAGE_SIZE);
+
+			total += cycle;
+			if (cycle < min)
+				min = cycle;
+			if (cycle > max)
+				max = cycle;
+
+			if (!min || (cycle + 1) < cycles)
+				done = false;
+		}
+
+		loops++;
+		delta_total += max - min;
+
+		pr_info("\t+%6lldms\t\t  %3u  %3u  %3u :%s\n",
+			ktime_to_ms(ktime_sub(ktime_get(), start)),
+			min, DIV_ROUND_UP(total, clients), max, buf);
+
+		if (done)
+			break;
+
+		msleep(100);
+	}
+
+	pr_info("    avg_max_min_delta(x100)=%u\n",
+		loops ? DIV_ROUND_UP(delta_total * 100, loops) : 0);
+
+	for (i = 0; i < clients; i++) {
+		kthread_flush_work(&client[i].work);
+		kthread_destroy_worker(client[i].worker);
+	}
+
+	for (i = 0; i < clients; i++)
+		drm_mock_sched_entity_free(client[i].entity);
+}
+
+static const struct kunit_attributes drm_sched_scheduler_many_clients_attr = {
+	.speed = KUNIT_SPEED_SLOW,
+};
+
+static struct kunit_case drm_sched_scheduler_many_clients_tests[] = {
+	KUNIT_CASE_PARAM_ATTR(drm_sched_scheduler_many_clients_test,
+			      drm_sched_scheduler_many_clients_gen_params,
+			      drm_sched_scheduler_many_clients_attr),
+	{}
+};
+
+static struct kunit_suite drm_sched_scheduler_many_clients = {
+	.name = "drm_sched_scheduler_many_clients_tests",
+	.init = drm_sched_scheduler_init2,
+	.exit = drm_sched_scheduler_exit,
+	.test_cases = drm_sched_scheduler_many_clients_tests,
+};
+
 kunit_test_suites(&drm_sched_scheduler_overhead,
 		  &drm_sched_scheduler_two_clients1,
-		  &drm_sched_scheduler_two_clients2);
+		  &drm_sched_scheduler_two_clients2,
+		  &drm_sched_scheduler_many_clients);
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v8 03/12] drm/sched: Implement RR via FIFO
  2025-09-03 10:18 [RFC v8 00/12] Fair DRM scheduler Tvrtko Ursulin
  2025-09-03 10:18 ` [RFC v8 01/12] drm/sched: Add some scheduling quality unit tests Tvrtko Ursulin
  2025-09-03 10:18 ` [RFC v8 02/12] drm/sched: Add some more " Tvrtko Ursulin
@ 2025-09-03 10:18 ` Tvrtko Ursulin
  2025-09-03 10:18 ` [RFC v8 04/12] drm/sched: Consolidate entity run queue management Tvrtko Ursulin
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-03 10:18 UTC (permalink / raw)
  To: dri-devel
  Cc: amd-gfx, kernel-dev, Tvrtko Ursulin, Christian König,
	Danilo Krummrich, Matthew Brost, Philipp Stanner

Round-robin being the non-default policy and unclear how much it is used,
we can notice that it can be implemented using the FIFO data structures if
we only invent a fake submit timestamp which is monotonically increasing
inside drm_sched_rq instances.

So instead of remembering which was the last entity the scheduler worker
picked we can simply bump the picked one to the bottom of the tree, which
ensures round-robin behaviour between all active queued jobs.

If the picked job was the last from a given entity, we remember the
assigned fake timestamp and use it to re-insert the job once it re-joins
the queue. This ensures job neither overtakes all already queued jobs,
neither it goes last. Instead it keeps the position after the currently
queued jobs and before the ones which haven't yet been queued at the point
the entity left the queue.

Advantage is that we can consolidate to a single code path and remove a
bunch of code. Downside is round-robin mode now needs to lock on the job
pop path but that should not be visible.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_entity.c | 51 ++++++++++------
 drivers/gpu/drm/scheduler/sched_main.c   | 76 ++----------------------
 include/drm/gpu_scheduler.h              | 16 +++--
 3 files changed, 51 insertions(+), 92 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 5a4697f636f2..4852006f2308 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -456,9 +456,24 @@ drm_sched_job_dependency(struct drm_sched_job *job,
 	return NULL;
 }
 
+static ktime_t
+drm_sched_rq_get_rr_ts(struct drm_sched_rq *rq, struct drm_sched_entity *entity)
+{
+	ktime_t ts;
+
+	lockdep_assert_held(&entity->lock);
+	lockdep_assert_held(&rq->lock);
+
+	ts = ktime_add_ns(rq->rr_ts, 1);
+	entity->rr_ts = ts;
+	rq->rr_ts = ts;
+
+	return ts;
+}
+
 struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
 {
-	struct drm_sched_job *sched_job;
+	struct drm_sched_job *sched_job, *next_job;
 
 	sched_job = drm_sched_entity_queue_peek(entity);
 	if (!sched_job)
@@ -491,21 +506,21 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
 	 * Update the entity's location in the min heap according to
 	 * the timestamp of the next job, if any.
 	 */
-	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) {
-		struct drm_sched_job *next;
+	next_job = drm_sched_entity_queue_peek(entity);
+	if (next_job) {
+		struct drm_sched_rq *rq;
+		ktime_t ts;
 
-		next = drm_sched_entity_queue_peek(entity);
-		if (next) {
-			struct drm_sched_rq *rq;
-
-			spin_lock(&entity->lock);
-			rq = entity->rq;
-			spin_lock(&rq->lock);
-			drm_sched_rq_update_fifo_locked(entity, rq,
-							next->submit_ts);
-			spin_unlock(&rq->lock);
-			spin_unlock(&entity->lock);
-		}
+		spin_lock(&entity->lock);
+		rq = entity->rq;
+		spin_lock(&rq->lock);
+		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
+			ts = next_job->submit_ts;
+		else
+			ts = drm_sched_rq_get_rr_ts(rq, entity);
+		drm_sched_rq_update_fifo_locked(entity, rq, ts);
+		spin_unlock(&rq->lock);
+		spin_unlock(&entity->lock);
 	}
 
 	/* Jobs and entities might have different lifecycles. Since we're
@@ -612,9 +627,9 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
 
 		spin_lock(&rq->lock);
 		drm_sched_rq_add_entity(rq, entity);
-
-		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
-			drm_sched_rq_update_fifo_locked(entity, rq, submit_ts);
+		if (drm_sched_policy == DRM_SCHED_POLICY_RR)
+			submit_ts = entity->rr_ts;
+		drm_sched_rq_update_fifo_locked(entity, rq, submit_ts);
 
 		spin_unlock(&rq->lock);
 		spin_unlock(&entity->lock);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 46119aacb809..1db0a4aa1d46 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -185,7 +185,6 @@ static void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
 	spin_lock_init(&rq->lock);
 	INIT_LIST_HEAD(&rq->entities);
 	rq->rb_tree_root = RB_ROOT_CACHED;
-	rq->current_entity = NULL;
 	rq->sched = sched;
 }
 
@@ -231,74 +230,13 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
 	atomic_dec(rq->sched->score);
 	list_del_init(&entity->list);
 
-	if (rq->current_entity == entity)
-		rq->current_entity = NULL;
-
-	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
-		drm_sched_rq_remove_fifo_locked(entity, rq);
+	drm_sched_rq_remove_fifo_locked(entity, rq);
 
 	spin_unlock(&rq->lock);
 }
 
 /**
- * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
- *
- * @sched: the gpu scheduler
- * @rq: scheduler run queue to check.
- *
- * Try to find the next ready entity.
- *
- * Return an entity if one is found; return an error-pointer (!NULL) if an
- * entity was ready, but the scheduler had insufficient credits to accommodate
- * its job; return NULL, if no ready entity was found.
- */
-static struct drm_sched_entity *
-drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched,
-			      struct drm_sched_rq *rq)
-{
-	struct drm_sched_entity *entity;
-
-	spin_lock(&rq->lock);
-
-	entity = rq->current_entity;
-	if (entity) {
-		list_for_each_entry_continue(entity, &rq->entities, list) {
-			if (drm_sched_entity_is_ready(entity))
-				goto found;
-		}
-	}
-
-	list_for_each_entry(entity, &rq->entities, list) {
-		if (drm_sched_entity_is_ready(entity))
-			goto found;
-
-		if (entity == rq->current_entity)
-			break;
-	}
-
-	spin_unlock(&rq->lock);
-
-	return NULL;
-
-found:
-	if (!drm_sched_can_queue(sched, entity)) {
-		/*
-		 * If scheduler cannot take more jobs signal the caller to not
-		 * consider lower priority queues.
-		 */
-		entity = ERR_PTR(-ENOSPC);
-	} else {
-		rq->current_entity = entity;
-		reinit_completion(&entity->entity_idle);
-	}
-
-	spin_unlock(&rq->lock);
-
-	return entity;
-}
-
-/**
- * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
+ * drm_sched_rq_select_entity - Select an entity which provides a job to run
  *
  * @sched: the gpu scheduler
  * @rq: scheduler run queue to check.
@@ -310,8 +248,8 @@ drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched,
  * its job; return NULL, if no ready entity was found.
  */
 static struct drm_sched_entity *
-drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
-				struct drm_sched_rq *rq)
+drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched,
+			   struct drm_sched_rq *rq)
 {
 	struct rb_node *rb;
 
@@ -1093,15 +1031,13 @@ void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
 static struct drm_sched_entity *
 drm_sched_select_entity(struct drm_gpu_scheduler *sched)
 {
-	struct drm_sched_entity *entity;
+	struct drm_sched_entity *entity = NULL;
 	int i;
 
 	/* Start with the highest priority.
 	 */
 	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
-		entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
-			drm_sched_rq_select_entity_fifo(sched, sched->sched_rq[i]) :
-			drm_sched_rq_select_entity_rr(sched, sched->sched_rq[i]);
+		entity = drm_sched_rq_select_entity(sched, sched->sched_rq[i]);
 		if (entity)
 			break;
 	}
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 323a505e6e6a..802a0060db55 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -94,7 +94,8 @@ struct drm_sched_entity {
 	 * @lock:
 	 *
 	 * Lock protecting the run-queue (@rq) to which this entity belongs,
-	 * @priority and the list of schedulers (@sched_list, @num_sched_list).
+	 * @priority, the list of schedulers (@sched_list, @num_sched_list) and
+	 * the @rr_ts field.
 	 */
 	spinlock_t			lock;
 
@@ -142,6 +143,13 @@ struct drm_sched_entity {
 	 */
 	enum drm_sched_priority         priority;
 
+	/**
+	 * @rr_ts:
+	 *
+	 * Fake timestamp of the last popped job from the entity.
+	 */
+	ktime_t				rr_ts;
+
 	/**
 	 * @job_queue: the list of jobs of this entity.
 	 */
@@ -239,8 +247,8 @@ struct drm_sched_entity {
  * struct drm_sched_rq - queue of entities to be scheduled.
  *
  * @sched: the scheduler to which this rq belongs to.
- * @lock: protects @entities, @rb_tree_root and @current_entity.
- * @current_entity: the entity which is to be scheduled.
+ * @lock: protects @entities, @rb_tree_root and @rr_ts.
+ * @rr_ts: monotonically incrementing fake timestamp for RR mode
  * @entities: list of the entities to be scheduled.
  * @rb_tree_root: root of time based priority queue of entities for FIFO scheduling
  *
@@ -253,7 +261,7 @@ struct drm_sched_rq {
 
 	spinlock_t			lock;
 	/* Following members are protected by the @lock: */
-	struct drm_sched_entity		*current_entity;
+	ktime_t				rr_ts;
 	struct list_head		entities;
 	struct rb_root_cached		rb_tree_root;
 };
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v8 04/12] drm/sched: Consolidate entity run queue management
  2025-09-03 10:18 [RFC v8 00/12] Fair DRM scheduler Tvrtko Ursulin
                   ` (2 preceding siblings ...)
  2025-09-03 10:18 ` [RFC v8 03/12] drm/sched: Implement RR via FIFO Tvrtko Ursulin
@ 2025-09-03 10:18 ` Tvrtko Ursulin
  2025-09-11 14:20   ` Philipp Stanner
  2025-09-03 10:18 ` [RFC v8 05/12] drm/sched: Move run queue related code into a separate file Tvrtko Ursulin
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-03 10:18 UTC (permalink / raw)
  To: dri-devel
  Cc: amd-gfx, kernel-dev, Tvrtko Ursulin, Christian König,
	Danilo Krummrich, Matthew Brost, Philipp Stanner

Move the code dealing with entities entering and exiting run queues to
helpers to logically separate it from jobs entering and exiting entities.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_entity.c   | 64 ++-------------
 drivers/gpu/drm/scheduler/sched_internal.h |  8 +-
 drivers/gpu/drm/scheduler/sched_main.c     | 95 +++++++++++++++++++---
 3 files changed, 91 insertions(+), 76 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 4852006f2308..7a0a52ba87bf 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -456,24 +456,9 @@ drm_sched_job_dependency(struct drm_sched_job *job,
 	return NULL;
 }
 
-static ktime_t
-drm_sched_rq_get_rr_ts(struct drm_sched_rq *rq, struct drm_sched_entity *entity)
-{
-	ktime_t ts;
-
-	lockdep_assert_held(&entity->lock);
-	lockdep_assert_held(&rq->lock);
-
-	ts = ktime_add_ns(rq->rr_ts, 1);
-	entity->rr_ts = ts;
-	rq->rr_ts = ts;
-
-	return ts;
-}
-
 struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
 {
-	struct drm_sched_job *sched_job, *next_job;
+	struct drm_sched_job *sched_job;
 
 	sched_job = drm_sched_entity_queue_peek(entity);
 	if (!sched_job)
@@ -502,26 +487,7 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
 
 	spsc_queue_pop(&entity->job_queue);
 
-	/*
-	 * Update the entity's location in the min heap according to
-	 * the timestamp of the next job, if any.
-	 */
-	next_job = drm_sched_entity_queue_peek(entity);
-	if (next_job) {
-		struct drm_sched_rq *rq;
-		ktime_t ts;
-
-		spin_lock(&entity->lock);
-		rq = entity->rq;
-		spin_lock(&rq->lock);
-		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
-			ts = next_job->submit_ts;
-		else
-			ts = drm_sched_rq_get_rr_ts(rq, entity);
-		drm_sched_rq_update_fifo_locked(entity, rq, ts);
-		spin_unlock(&rq->lock);
-		spin_unlock(&entity->lock);
-	}
+	drm_sched_rq_pop_entity(entity);
 
 	/* Jobs and entities might have different lifecycles. Since we're
 	 * removing the job from the entities queue, set the jobs entity pointer
@@ -611,30 +577,10 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
 	/* first job wakes up scheduler */
 	if (first) {
 		struct drm_gpu_scheduler *sched;
-		struct drm_sched_rq *rq;
 
-		/* Add the entity to the run queue */
-		spin_lock(&entity->lock);
-		if (entity->stopped) {
-			spin_unlock(&entity->lock);
-
-			DRM_ERROR("Trying to push to a killed entity\n");
-			return;
-		}
-
-		rq = entity->rq;
-		sched = rq->sched;
-
-		spin_lock(&rq->lock);
-		drm_sched_rq_add_entity(rq, entity);
-		if (drm_sched_policy == DRM_SCHED_POLICY_RR)
-			submit_ts = entity->rr_ts;
-		drm_sched_rq_update_fifo_locked(entity, rq, submit_ts);
-
-		spin_unlock(&rq->lock);
-		spin_unlock(&entity->lock);
-
-		drm_sched_wakeup(sched);
+		sched = drm_sched_rq_add_entity(entity, submit_ts);
+		if (sched)
+			drm_sched_wakeup(sched);
 	}
 }
 EXPORT_SYMBOL(drm_sched_entity_push_job);
diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
index 7ea5a6736f98..8269c5392a82 100644
--- a/drivers/gpu/drm/scheduler/sched_internal.h
+++ b/drivers/gpu/drm/scheduler/sched_internal.h
@@ -12,13 +12,11 @@ extern int drm_sched_policy;
 
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
 
-void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
-			     struct drm_sched_entity *entity);
+struct drm_gpu_scheduler *
+drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts);
 void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
 				struct drm_sched_entity *entity);
-
-void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
-				     struct drm_sched_rq *rq, ktime_t ts);
+void drm_sched_rq_pop_entity(struct drm_sched_entity *entity);
 
 void drm_sched_entity_select_rq(struct drm_sched_entity *entity);
 struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 1db0a4aa1d46..c53931e63458 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -151,9 +151,9 @@ static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
 	}
 }
 
-void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
-				     struct drm_sched_rq *rq,
-				     ktime_t ts)
+static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
+					    struct drm_sched_rq *rq,
+					    ktime_t ts)
 {
 	/*
 	 * Both locks need to be grabbed, one to protect from entity->rq change
@@ -191,22 +191,45 @@ static void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
 /**
  * drm_sched_rq_add_entity - add an entity
  *
- * @rq: scheduler run queue
  * @entity: scheduler entity
+ * @ts: submission timestamp
  *
  * Adds a scheduler entity to the run queue.
+ *
+ * Returns a DRM scheduler pre-selected to handle this entity.
  */
-void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
-			     struct drm_sched_entity *entity)
+struct drm_gpu_scheduler *
+drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
 {
-	lockdep_assert_held(&entity->lock);
-	lockdep_assert_held(&rq->lock);
+	struct drm_gpu_scheduler *sched;
+	struct drm_sched_rq *rq;
 
-	if (!list_empty(&entity->list))
-		return;
+	/* Add the entity to the run queue */
+	spin_lock(&entity->lock);
+	if (entity->stopped) {
+		spin_unlock(&entity->lock);
 
-	atomic_inc(rq->sched->score);
-	list_add_tail(&entity->list, &rq->entities);
+		DRM_ERROR("Trying to push to a killed entity\n");
+		return NULL;
+	}
+
+	rq = entity->rq;
+	spin_lock(&rq->lock);
+	sched = rq->sched;
+
+	if (list_empty(&entity->list)) {
+		atomic_inc(sched->score);
+		list_add_tail(&entity->list, &rq->entities);
+	}
+
+	if (drm_sched_policy == DRM_SCHED_POLICY_RR)
+		ts = entity->rr_ts;
+	drm_sched_rq_update_fifo_locked(entity, rq, ts);
+
+	spin_unlock(&rq->lock);
+	spin_unlock(&entity->lock);
+
+	return sched;
 }
 
 /**
@@ -235,6 +258,54 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
 	spin_unlock(&rq->lock);
 }
 
+static ktime_t
+drm_sched_rq_get_rr_ts(struct drm_sched_rq *rq, struct drm_sched_entity *entity)
+{
+	ktime_t ts;
+
+	lockdep_assert_held(&entity->lock);
+	lockdep_assert_held(&rq->lock);
+
+	ts = ktime_add_ns(rq->rr_ts, 1);
+	entity->rr_ts = ts;
+	rq->rr_ts = ts;
+
+	return ts;
+}
+
+/**
+ * drm_sched_rq_pop_entity - pops an entity
+ *
+ * @entity: scheduler entity
+ *
+ * To be called every time after a job is popped from the entity.
+ */
+void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
+{
+	struct drm_sched_job *next_job;
+	struct drm_sched_rq *rq;
+	ktime_t ts;
+
+	/*
+	 * Update the entity's location in the min heap according to
+	 * the timestamp of the next job, if any.
+	 */
+	next_job = drm_sched_entity_queue_peek(entity);
+	if (!next_job)
+		return;
+
+	spin_lock(&entity->lock);
+	rq = entity->rq;
+	spin_lock(&rq->lock);
+	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
+		ts = next_job->submit_ts;
+	else
+		ts = drm_sched_rq_get_rr_ts(rq, entity);
+	drm_sched_rq_update_fifo_locked(entity, rq, ts);
+	spin_unlock(&rq->lock);
+	spin_unlock(&entity->lock);
+}
+
 /**
  * drm_sched_rq_select_entity - Select an entity which provides a job to run
  *
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v8 05/12] drm/sched: Move run queue related code into a separate file
  2025-09-03 10:18 [RFC v8 00/12] Fair DRM scheduler Tvrtko Ursulin
                   ` (3 preceding siblings ...)
  2025-09-03 10:18 ` [RFC v8 04/12] drm/sched: Consolidate entity run queue management Tvrtko Ursulin
@ 2025-09-03 10:18 ` Tvrtko Ursulin
  2025-09-24  7:54   ` Philipp Stanner
  2025-09-03 10:18 ` [RFC v8 06/12] drm/sched: Free all finished jobs at once Tvrtko Ursulin
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-03 10:18 UTC (permalink / raw)
  To: dri-devel
  Cc: amd-gfx, kernel-dev, Tvrtko Ursulin, Christian König,
	Danilo Krummrich, Matthew Brost, Philipp Stanner

Lets move all the code dealing with struct drm_sched_rq into a separate
compilation unit. Advantage being sched_main.c is left with a clearer set
of responsibilities.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/Makefile         |   2 +-
 drivers/gpu/drm/scheduler/sched_internal.h |   7 +
 drivers/gpu/drm/scheduler/sched_main.c     | 218 +-------------------
 drivers/gpu/drm/scheduler/sched_rq.c       | 222 +++++++++++++++++++++
 4 files changed, 232 insertions(+), 217 deletions(-)
 create mode 100644 drivers/gpu/drm/scheduler/sched_rq.c

diff --git a/drivers/gpu/drm/scheduler/Makefile b/drivers/gpu/drm/scheduler/Makefile
index 6e13e4c63e9d..74e75eff6df5 100644
--- a/drivers/gpu/drm/scheduler/Makefile
+++ b/drivers/gpu/drm/scheduler/Makefile
@@ -20,7 +20,7 @@
 # OTHER DEALINGS IN THE SOFTWARE.
 #
 #
-gpu-sched-y := sched_main.o sched_fence.o sched_entity.o
+gpu-sched-y := sched_main.o sched_fence.o sched_entity.o sched_rq.o
 
 obj-$(CONFIG_DRM_SCHED) += gpu-sched.o
 
diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
index 8269c5392a82..703ee48fbc58 100644
--- a/drivers/gpu/drm/scheduler/sched_internal.h
+++ b/drivers/gpu/drm/scheduler/sched_internal.h
@@ -10,8 +10,15 @@ extern int drm_sched_policy;
 #define DRM_SCHED_POLICY_RR    0
 #define DRM_SCHED_POLICY_FIFO  1
 
+bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
+			 struct drm_sched_entity *entity);
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
 
+void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
+		       struct drm_sched_rq *rq);
+struct drm_sched_entity *
+drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched,
+			   struct drm_sched_rq *rq);
 struct drm_gpu_scheduler *
 drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts);
 void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index c53931e63458..97bec2f86502 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -112,8 +112,8 @@ static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
  * Return true if we can push at least one more job from @entity, false
  * otherwise.
  */
-static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
-				struct drm_sched_entity *entity)
+bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
+			 struct drm_sched_entity *entity)
 {
 	struct drm_sched_job *s_job;
 
@@ -133,220 +133,6 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
 	return drm_sched_available_credits(sched) >= s_job->credits;
 }
 
-static __always_inline bool drm_sched_entity_compare_before(struct rb_node *a,
-							    const struct rb_node *b)
-{
-	struct drm_sched_entity *ent_a =  rb_entry((a), struct drm_sched_entity, rb_tree_node);
-	struct drm_sched_entity *ent_b =  rb_entry((b), struct drm_sched_entity, rb_tree_node);
-
-	return ktime_before(ent_a->oldest_job_waiting, ent_b->oldest_job_waiting);
-}
-
-static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
-					    struct drm_sched_rq *rq)
-{
-	if (!RB_EMPTY_NODE(&entity->rb_tree_node)) {
-		rb_erase_cached(&entity->rb_tree_node, &rq->rb_tree_root);
-		RB_CLEAR_NODE(&entity->rb_tree_node);
-	}
-}
-
-static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
-					    struct drm_sched_rq *rq,
-					    ktime_t ts)
-{
-	/*
-	 * Both locks need to be grabbed, one to protect from entity->rq change
-	 * for entity from within concurrent drm_sched_entity_select_rq and the
-	 * other to update the rb tree structure.
-	 */
-	lockdep_assert_held(&entity->lock);
-	lockdep_assert_held(&rq->lock);
-
-	drm_sched_rq_remove_fifo_locked(entity, rq);
-
-	entity->oldest_job_waiting = ts;
-
-	rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root,
-		      drm_sched_entity_compare_before);
-}
-
-/**
- * drm_sched_rq_init - initialize a given run queue struct
- *
- * @sched: scheduler instance to associate with this run queue
- * @rq: scheduler run queue
- *
- * Initializes a scheduler runqueue.
- */
-static void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
-			      struct drm_sched_rq *rq)
-{
-	spin_lock_init(&rq->lock);
-	INIT_LIST_HEAD(&rq->entities);
-	rq->rb_tree_root = RB_ROOT_CACHED;
-	rq->sched = sched;
-}
-
-/**
- * drm_sched_rq_add_entity - add an entity
- *
- * @entity: scheduler entity
- * @ts: submission timestamp
- *
- * Adds a scheduler entity to the run queue.
- *
- * Returns a DRM scheduler pre-selected to handle this entity.
- */
-struct drm_gpu_scheduler *
-drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
-{
-	struct drm_gpu_scheduler *sched;
-	struct drm_sched_rq *rq;
-
-	/* Add the entity to the run queue */
-	spin_lock(&entity->lock);
-	if (entity->stopped) {
-		spin_unlock(&entity->lock);
-
-		DRM_ERROR("Trying to push to a killed entity\n");
-		return NULL;
-	}
-
-	rq = entity->rq;
-	spin_lock(&rq->lock);
-	sched = rq->sched;
-
-	if (list_empty(&entity->list)) {
-		atomic_inc(sched->score);
-		list_add_tail(&entity->list, &rq->entities);
-	}
-
-	if (drm_sched_policy == DRM_SCHED_POLICY_RR)
-		ts = entity->rr_ts;
-	drm_sched_rq_update_fifo_locked(entity, rq, ts);
-
-	spin_unlock(&rq->lock);
-	spin_unlock(&entity->lock);
-
-	return sched;
-}
-
-/**
- * drm_sched_rq_remove_entity - remove an entity
- *
- * @rq: scheduler run queue
- * @entity: scheduler entity
- *
- * Removes a scheduler entity from the run queue.
- */
-void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
-				struct drm_sched_entity *entity)
-{
-	lockdep_assert_held(&entity->lock);
-
-	if (list_empty(&entity->list))
-		return;
-
-	spin_lock(&rq->lock);
-
-	atomic_dec(rq->sched->score);
-	list_del_init(&entity->list);
-
-	drm_sched_rq_remove_fifo_locked(entity, rq);
-
-	spin_unlock(&rq->lock);
-}
-
-static ktime_t
-drm_sched_rq_get_rr_ts(struct drm_sched_rq *rq, struct drm_sched_entity *entity)
-{
-	ktime_t ts;
-
-	lockdep_assert_held(&entity->lock);
-	lockdep_assert_held(&rq->lock);
-
-	ts = ktime_add_ns(rq->rr_ts, 1);
-	entity->rr_ts = ts;
-	rq->rr_ts = ts;
-
-	return ts;
-}
-
-/**
- * drm_sched_rq_pop_entity - pops an entity
- *
- * @entity: scheduler entity
- *
- * To be called every time after a job is popped from the entity.
- */
-void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
-{
-	struct drm_sched_job *next_job;
-	struct drm_sched_rq *rq;
-	ktime_t ts;
-
-	/*
-	 * Update the entity's location in the min heap according to
-	 * the timestamp of the next job, if any.
-	 */
-	next_job = drm_sched_entity_queue_peek(entity);
-	if (!next_job)
-		return;
-
-	spin_lock(&entity->lock);
-	rq = entity->rq;
-	spin_lock(&rq->lock);
-	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
-		ts = next_job->submit_ts;
-	else
-		ts = drm_sched_rq_get_rr_ts(rq, entity);
-	drm_sched_rq_update_fifo_locked(entity, rq, ts);
-	spin_unlock(&rq->lock);
-	spin_unlock(&entity->lock);
-}
-
-/**
- * drm_sched_rq_select_entity - Select an entity which provides a job to run
- *
- * @sched: the gpu scheduler
- * @rq: scheduler run queue to check.
- *
- * Find oldest waiting ready entity.
- *
- * Return an entity if one is found; return an error-pointer (!NULL) if an
- * entity was ready, but the scheduler had insufficient credits to accommodate
- * its job; return NULL, if no ready entity was found.
- */
-static struct drm_sched_entity *
-drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched,
-			   struct drm_sched_rq *rq)
-{
-	struct rb_node *rb;
-
-	spin_lock(&rq->lock);
-	for (rb = rb_first_cached(&rq->rb_tree_root); rb; rb = rb_next(rb)) {
-		struct drm_sched_entity *entity;
-
-		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
-		if (drm_sched_entity_is_ready(entity)) {
-			/* If we can't queue yet, preserve the current entity in
-			 * terms of fairness.
-			 */
-			if (!drm_sched_can_queue(sched, entity)) {
-				spin_unlock(&rq->lock);
-				return ERR_PTR(-ENOSPC);
-			}
-
-			reinit_completion(&entity->entity_idle);
-			break;
-		}
-	}
-	spin_unlock(&rq->lock);
-
-	return rb ? rb_entry(rb, struct drm_sched_entity, rb_tree_node) : NULL;
-}
-
 /**
  * drm_sched_run_job_queue - enqueue run-job work
  * @sched: scheduler instance
diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
new file mode 100644
index 000000000000..16d57461765e
--- /dev/null
+++ b/drivers/gpu/drm/scheduler/sched_rq.c
@@ -0,0 +1,222 @@
+#include <linux/rbtree.h>
+
+#include <drm/drm_print.h>
+#include <drm/gpu_scheduler.h>
+
+#include "sched_internal.h"
+
+static __always_inline bool
+drm_sched_entity_compare_before(struct rb_node *a, const struct rb_node *b)
+{
+	struct drm_sched_entity *ea =
+		rb_entry((a), struct drm_sched_entity, rb_tree_node);
+	struct drm_sched_entity *eb =
+		rb_entry((b), struct drm_sched_entity, rb_tree_node);
+
+	return ktime_before(ea->oldest_job_waiting, eb->oldest_job_waiting);
+}
+
+static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
+					    struct drm_sched_rq *rq)
+{
+	if (!RB_EMPTY_NODE(&entity->rb_tree_node)) {
+		rb_erase_cached(&entity->rb_tree_node, &rq->rb_tree_root);
+		RB_CLEAR_NODE(&entity->rb_tree_node);
+	}
+}
+
+static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
+					    struct drm_sched_rq *rq,
+					    ktime_t ts)
+{
+	/*
+	 * Both locks need to be grabbed, one to protect from entity->rq change
+	 * for entity from within concurrent drm_sched_entity_select_rq and the
+	 * other to update the rb tree structure.
+	 */
+	lockdep_assert_held(&entity->lock);
+	lockdep_assert_held(&rq->lock);
+
+	drm_sched_rq_remove_fifo_locked(entity, rq);
+
+	entity->oldest_job_waiting = ts;
+
+	rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root,
+		      drm_sched_entity_compare_before);
+}
+
+/**
+ * drm_sched_rq_init - initialize a given run queue struct
+ *
+ * @sched: scheduler instance to associate with this run queue
+ * @rq: scheduler run queue
+ *
+ * Initializes a scheduler runqueue.
+ */
+void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
+		       struct drm_sched_rq *rq)
+{
+	spin_lock_init(&rq->lock);
+	INIT_LIST_HEAD(&rq->entities);
+	rq->rb_tree_root = RB_ROOT_CACHED;
+	rq->sched = sched;
+}
+
+/**
+ * drm_sched_rq_add_entity - add an entity
+ *
+ * @entity: scheduler entity
+ * @ts: submission timestamp
+ *
+ * Adds a scheduler entity to the run queue.
+ *
+ * Returns a DRM scheduler pre-selected to handle this entity.
+ */
+struct drm_gpu_scheduler *
+drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
+{
+	struct drm_gpu_scheduler *sched;
+	struct drm_sched_rq *rq;
+
+	/* Add the entity to the run queue */
+	spin_lock(&entity->lock);
+	if (entity->stopped) {
+		spin_unlock(&entity->lock);
+
+		DRM_ERROR("Trying to push to a killed entity\n");
+		return NULL;
+	}
+
+	rq = entity->rq;
+	spin_lock(&rq->lock);
+	sched = rq->sched;
+
+	if (list_empty(&entity->list)) {
+		atomic_inc(sched->score);
+		list_add_tail(&entity->list, &rq->entities);
+	}
+
+	if (drm_sched_policy == DRM_SCHED_POLICY_RR)
+		ts = entity->rr_ts;
+	drm_sched_rq_update_fifo_locked(entity, rq, ts);
+
+	spin_unlock(&rq->lock);
+	spin_unlock(&entity->lock);
+
+	return sched;
+}
+
+/**
+ * drm_sched_rq_remove_entity - remove an entity
+ *
+ * @rq: scheduler run queue
+ * @entity: scheduler entity
+ *
+ * Removes a scheduler entity from the run queue.
+ */
+void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
+				struct drm_sched_entity *entity)
+{
+	lockdep_assert_held(&entity->lock);
+
+	if (list_empty(&entity->list))
+		return;
+
+	spin_lock(&rq->lock);
+
+	atomic_dec(rq->sched->score);
+	list_del_init(&entity->list);
+
+	drm_sched_rq_remove_fifo_locked(entity, rq);
+
+	spin_unlock(&rq->lock);
+}
+
+static ktime_t
+drm_sched_rq_get_rr_ts(struct drm_sched_rq *rq, struct drm_sched_entity *entity)
+{
+	ktime_t ts;
+
+	lockdep_assert_held(&entity->lock);
+	lockdep_assert_held(&rq->lock);
+
+	ts = ktime_add_ns(rq->rr_ts, 1);
+	entity->rr_ts = ts;
+	rq->rr_ts = ts;
+
+	return ts;
+}
+
+/**
+ * drm_sched_rq_pop_entity - pops an entity
+ *
+ * @entity: scheduler entity
+ *
+ * To be called every time after a job is popped from the entity.
+ */
+void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
+{
+	struct drm_sched_job *next_job;
+	struct drm_sched_rq *rq;
+	ktime_t ts;
+
+	/*
+	 * Update the entity's location in the min heap according to
+	 * the timestamp of the next job, if any.
+	 */
+	next_job = drm_sched_entity_queue_peek(entity);
+	if (!next_job)
+		return;
+
+	spin_lock(&entity->lock);
+	rq = entity->rq;
+	spin_lock(&rq->lock);
+	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
+		ts = next_job->submit_ts;
+	else
+		ts = drm_sched_rq_get_rr_ts(rq, entity);
+	drm_sched_rq_update_fifo_locked(entity, rq, ts);
+	spin_unlock(&rq->lock);
+	spin_unlock(&entity->lock);
+}
+
+/**
+ * drm_sched_rq_select_entity - Select an entity which provides a job to run
+ *
+ * @sched: the gpu scheduler
+ * @rq: scheduler run queue to check.
+ *
+ * Find oldest waiting ready entity.
+ *
+ * Return an entity if one is found; return an error-pointer (!NULL) if an
+ * entity was ready, but the scheduler had insufficient credits to accommodate
+ * its job; return NULL, if no ready entity was found.
+ */
+struct drm_sched_entity *
+drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched,
+			   struct drm_sched_rq *rq)
+{
+	struct rb_node *rb;
+
+	spin_lock(&rq->lock);
+	for (rb = rb_first_cached(&rq->rb_tree_root); rb; rb = rb_next(rb)) {
+		struct drm_sched_entity *entity;
+
+		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
+		if (drm_sched_entity_is_ready(entity)) {
+			/* If we can't queue yet, preserve the current entity in
+			 * terms of fairness.
+			 */
+			if (!drm_sched_can_queue(sched, entity)) {
+				spin_unlock(&rq->lock);
+				return ERR_PTR(-ENOSPC);
+			}
+
+			reinit_completion(&entity->entity_idle);
+			break;
+		}
+	}
+	spin_unlock(&rq->lock);
+
+	return rb ? rb_entry(rb, struct drm_sched_entity, rb_tree_node) : NULL;
+}
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v8 06/12] drm/sched: Free all finished jobs at once
  2025-09-03 10:18 [RFC v8 00/12] Fair DRM scheduler Tvrtko Ursulin
                   ` (4 preceding siblings ...)
  2025-09-03 10:18 ` [RFC v8 05/12] drm/sched: Move run queue related code into a separate file Tvrtko Ursulin
@ 2025-09-03 10:18 ` Tvrtko Ursulin
  2025-09-24  8:07   ` Philipp Stanner
  2025-09-03 10:18 ` [RFC v8 07/12] drm/sched: Account entity GPU time Tvrtko Ursulin
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-03 10:18 UTC (permalink / raw)
  To: dri-devel
  Cc: amd-gfx, kernel-dev, Tvrtko Ursulin, Christian König,
	Danilo Krummrich, Matthew Brost, Philipp Stanner

To implement fair scheduling we will need as accurate as possible view
into per entity GPU time utilisation. Because sched fence execution time
are only adjusted for accuracy in the free worker we need to process
completed jobs as soon as possible so the metric is most up to date when
view from the submission side of things.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_main.c | 13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 97bec2f86502..9411676e772a 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -906,7 +906,6 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
  * drm_sched_get_finished_job - fetch the next finished job to be destroyed
  *
  * @sched: scheduler instance
- * @have_more: are there more finished jobs on the list
  *
  * Informs the caller through @have_more whether there are more finished jobs
  * besides the returned one.
@@ -915,7 +914,7 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
  * ready for it to be destroyed.
  */
 static struct drm_sched_job *
-drm_sched_get_finished_job(struct drm_gpu_scheduler *sched, bool *have_more)
+drm_sched_get_finished_job(struct drm_gpu_scheduler *sched)
 {
 	struct drm_sched_job *job, *next;
 
@@ -930,7 +929,6 @@ drm_sched_get_finished_job(struct drm_gpu_scheduler *sched, bool *have_more)
 		/* cancel this job's TO timer */
 		cancel_delayed_work(&sched->work_tdr);
 
-		*have_more = false;
 		next = list_first_entry_or_null(&sched->pending_list,
 						typeof(*next), list);
 		if (next) {
@@ -940,8 +938,6 @@ drm_sched_get_finished_job(struct drm_gpu_scheduler *sched, bool *have_more)
 				next->s_fence->scheduled.timestamp =
 					dma_fence_timestamp(&job->s_fence->finished);
 
-			*have_more = dma_fence_is_signaled(&next->s_fence->finished);
-
 			/* start TO timer for next job */
 			drm_sched_start_timeout(sched);
 		}
@@ -1000,14 +996,9 @@ static void drm_sched_free_job_work(struct work_struct *w)
 	struct drm_gpu_scheduler *sched =
 		container_of(w, struct drm_gpu_scheduler, work_free_job);
 	struct drm_sched_job *job;
-	bool have_more;
 
-	job = drm_sched_get_finished_job(sched, &have_more);
-	if (job) {
+	while ((job = drm_sched_get_finished_job(sched)))
 		sched->ops->free_job(job);
-		if (have_more)
-			drm_sched_run_free_queue(sched);
-	}
 
 	drm_sched_run_job_queue(sched);
 }
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v8 07/12] drm/sched: Account entity GPU time
  2025-09-03 10:18 [RFC v8 00/12] Fair DRM scheduler Tvrtko Ursulin
                   ` (5 preceding siblings ...)
  2025-09-03 10:18 ` [RFC v8 06/12] drm/sched: Free all finished jobs at once Tvrtko Ursulin
@ 2025-09-03 10:18 ` Tvrtko Ursulin
  2025-09-24  9:11   ` Philipp Stanner
  2025-09-03 10:18 ` [RFC v8 08/12] drm/sched: Remove idle entity from tree Tvrtko Ursulin
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-03 10:18 UTC (permalink / raw)
  To: dri-devel
  Cc: amd-gfx, kernel-dev, Tvrtko Ursulin, Christian König,
	Danilo Krummrich, Matthew Brost, Philipp Stanner

To implement fair scheduling we need a view into the GPU time consumed by
entities. Problem we have is that jobs and entities objects have decoupled
lifetimes, where at the point we have a view into accurate GPU time, we
cannot link back to the entity any longer.

Solve this by adding a light weight entity stats object which is reference
counted by both entity and the job and hence can safely be used from
either side.

With that, the only other thing we need is to add a helper for adding the
job's GPU time into the respective entity stats object, and call it once
the accurate GPU time has been calculated.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_entity.c   | 39 +++++++++++++
 drivers/gpu/drm/scheduler/sched_internal.h | 66 ++++++++++++++++++++++
 drivers/gpu/drm/scheduler/sched_main.c     |  6 +-
 include/drm/gpu_scheduler.h                | 12 ++++
 4 files changed, 122 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 7a0a52ba87bf..04ce8b7d436b 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -32,6 +32,39 @@
 
 #include "gpu_scheduler_trace.h"
 
+
+/**
+ * drm_sched_entity_stats_release - Entity stats kref release function
+ *
+ * @kref: Entity stats embedded kref pointer
+ */
+void drm_sched_entity_stats_release(struct kref *kref)
+{
+	struct drm_sched_entity_stats *stats =
+		container_of(kref, typeof(*stats), kref);
+
+	kfree(stats);
+}
+
+/**
+ * drm_sched_entity_stats_alloc - Allocate a new struct drm_sched_entity_stats object
+ *
+ * Returns: Pointer to newly allocated struct drm_sched_entity_stats object.
+ */
+static struct drm_sched_entity_stats *drm_sched_entity_stats_alloc(void)
+{
+	struct drm_sched_entity_stats *stats;
+
+	stats = kzalloc(sizeof(*stats), GFP_KERNEL);
+	if (!stats)
+		return NULL;
+
+	kref_init(&stats->kref);
+	spin_lock_init(&stats->lock);
+
+	return stats;
+}
+
 /**
  * drm_sched_entity_init - Init a context entity used by scheduler when
  * submit to HW ring.
@@ -65,6 +98,11 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
 		return -EINVAL;
 
 	memset(entity, 0, sizeof(struct drm_sched_entity));
+
+	entity->stats = drm_sched_entity_stats_alloc();
+	if (!entity->stats)
+		return -ENOMEM;
+
 	INIT_LIST_HEAD(&entity->list);
 	entity->rq = NULL;
 	entity->guilty = guilty;
@@ -338,6 +376,7 @@ void drm_sched_entity_fini(struct drm_sched_entity *entity)
 
 	dma_fence_put(rcu_dereference_check(entity->last_scheduled, true));
 	RCU_INIT_POINTER(entity->last_scheduled, NULL);
+	drm_sched_entity_stats_put(entity->stats);
 }
 EXPORT_SYMBOL(drm_sched_entity_fini);
 
diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
index 703ee48fbc58..27c8460a3601 100644
--- a/drivers/gpu/drm/scheduler/sched_internal.h
+++ b/drivers/gpu/drm/scheduler/sched_internal.h
@@ -3,6 +3,22 @@
 #ifndef _DRM_GPU_SCHEDULER_INTERNAL_H_
 #define _DRM_GPU_SCHEDULER_INTERNAL_H_
 
+#include <linux/ktime.h>
+#include <linux/kref.h>
+#include <linux/spinlock.h>
+
+/**
+ * struct drm_sched_entity_stats - execution stats for an entity.
+ *
+ * @kref: reference count for the object.
+ * @lock: lock guarding the @runtime updates.
+ * @runtime: time entity spent on the GPU.
+ */
+struct drm_sched_entity_stats {
+	struct kref	kref;
+	spinlock_t	lock;
+	ktime_t		runtime;
+};
 
 /* Used to choose between FIFO and RR job-scheduling */
 extern int drm_sched_policy;
@@ -93,4 +109,54 @@ drm_sched_entity_is_ready(struct drm_sched_entity *entity)
 	return true;
 }
 
+void drm_sched_entity_stats_release(struct kref *kref);
+
+/**
+ * drm_sched_entity_stats_get - Obtain a reference count on struct drm_sched_entity_stats object
+ *
+ * @stats: struct drm_sched_entity_stats pointer
+ *
+ * Returns: struct drm_sched_entity_stats pointer
+ */
+static inline struct drm_sched_entity_stats *
+drm_sched_entity_stats_get(struct drm_sched_entity_stats *stats)
+{
+	kref_get(&stats->kref);
+
+	return stats;
+}
+
+/**
+ * drm_sched_entity_stats_put - Release a reference count on struct drm_sched_entity_stats object
+ *
+ * @stats: struct drm_sched_entity_stats pointer
+ */
+static inline void
+drm_sched_entity_stats_put(struct drm_sched_entity_stats *stats)
+{
+	kref_put(&stats->kref, drm_sched_entity_stats_release);
+}
+
+/**
+ * drm_sched_entity_stats_job_add_gpu_time - Account job execution time to entity
+ *
+ * @job: Scheduler job to account.
+ *
+ * Accounts the execution time of @job to its respective entity stats object.
+ */
+static inline void
+drm_sched_entity_stats_job_add_gpu_time(struct drm_sched_job *job)
+{
+	struct drm_sched_entity_stats *stats = job->entity_stats;
+	struct drm_sched_fence *s_fence = job->s_fence;
+	ktime_t start, end;
+
+	start = dma_fence_timestamp(&s_fence->scheduled);
+	end = dma_fence_timestamp(&s_fence->finished);
+
+	spin_lock(&stats->lock);
+	stats->runtime = ktime_add(stats->runtime, ktime_sub(end, start));
+	spin_unlock(&stats->lock);
+}
+
 #endif
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 9411676e772a..a5d7706efbea 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -658,6 +658,7 @@ void drm_sched_job_arm(struct drm_sched_job *job)
 
 	job->sched = sched;
 	job->s_priority = entity->priority;
+	job->entity_stats = drm_sched_entity_stats_get(entity->stats);
 
 	drm_sched_fence_init(job->s_fence, job->entity);
 }
@@ -846,6 +847,7 @@ void drm_sched_job_cleanup(struct drm_sched_job *job)
 		 * been called.
 		 */
 		dma_fence_put(&job->s_fence->finished);
+		drm_sched_entity_stats_put(job->entity_stats);
 	} else {
 		/* The job was aborted before it has been committed to be run;
 		 * notably, drm_sched_job_arm() has not been called.
@@ -997,8 +999,10 @@ static void drm_sched_free_job_work(struct work_struct *w)
 		container_of(w, struct drm_gpu_scheduler, work_free_job);
 	struct drm_sched_job *job;
 
-	while ((job = drm_sched_get_finished_job(sched)))
+	while ((job = drm_sched_get_finished_job(sched))) {
+		drm_sched_entity_stats_job_add_gpu_time(job);
 		sched->ops->free_job(job);
+	}
 
 	drm_sched_run_job_queue(sched);
 }
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 802a0060db55..f33c78473867 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -71,6 +71,8 @@ enum drm_sched_priority {
 	DRM_SCHED_PRIORITY_COUNT
 };
 
+struct drm_sched_entity_stats;
+
 /**
  * struct drm_sched_entity - A wrapper around a job queue (typically
  * attached to the DRM file_priv).
@@ -110,6 +112,11 @@ struct drm_sched_entity {
 	 */
 	struct drm_sched_rq		*rq;
 
+	/**
+	 * @stats: Stats object reference held by the entity and jobs.
+	 */
+	struct drm_sched_entity_stats	*stats;
+
 	/**
 	 * @sched_list:
 	 *
@@ -365,6 +372,11 @@ struct drm_sched_job {
 	struct drm_sched_fence		*s_fence;
 	struct drm_sched_entity         *entity;
 
+	/**
+	 * @entity_stats: Stats object reference held by the job and entity.
+	 */
+	struct drm_sched_entity_stats	*entity_stats;
+
 	enum drm_sched_priority		s_priority;
 	u32				credits;
 	/** @last_dependency: tracks @dependencies as they signal */
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v8 08/12] drm/sched: Remove idle entity from tree
  2025-09-03 10:18 [RFC v8 00/12] Fair DRM scheduler Tvrtko Ursulin
                   ` (6 preceding siblings ...)
  2025-09-03 10:18 ` [RFC v8 07/12] drm/sched: Account entity GPU time Tvrtko Ursulin
@ 2025-09-03 10:18 ` Tvrtko Ursulin
  2025-09-11 14:32   ` Philipp Stanner
  2025-09-03 10:18 ` [RFC v8 09/12] drm/sched: Add fair scheduling policy Tvrtko Ursulin
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-03 10:18 UTC (permalink / raw)
  To: dri-devel
  Cc: amd-gfx, kernel-dev, Tvrtko Ursulin, Christian König,
	Danilo Krummrich, Matthew Brost, Philipp Stanner

There is no need to keep entities with no jobs in the tree so lets remove
it once the last job is consumed. This keeps the tree smaller which is
nicer and more efficient as entities are removed and re-added on every
popped job.

Apart from that, the upcoming fair scheduling algorithm will rely on the
tree only containing runnable entities.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/scheduler/sched_rq.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
index 16d57461765e..67804815ca0d 100644
--- a/drivers/gpu/drm/scheduler/sched_rq.c
+++ b/drivers/gpu/drm/scheduler/sched_rq.c
@@ -19,6 +19,9 @@ drm_sched_entity_compare_before(struct rb_node *a, const struct rb_node *b)
 static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
 					    struct drm_sched_rq *rq)
 {
+	lockdep_assert_held(&entity->lock);
+	lockdep_assert_held(&rq->lock);
+
 	if (!RB_EMPTY_NODE(&entity->rb_tree_node)) {
 		rb_erase_cached(&entity->rb_tree_node, &rq->rb_tree_root);
 		RB_CLEAR_NODE(&entity->rb_tree_node);
@@ -158,24 +161,27 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
 {
 	struct drm_sched_job *next_job;
 	struct drm_sched_rq *rq;
-	ktime_t ts;
 
 	/*
 	 * Update the entity's location in the min heap according to
 	 * the timestamp of the next job, if any.
 	 */
+	spin_lock(&entity->lock);
+	rq = entity->rq;
+	spin_lock(&rq->lock);
 	next_job = drm_sched_entity_queue_peek(entity);
-	if (!next_job)
-		return;
+	if (next_job) {
+		ktime_t ts;
 
-	spin_lock(&entity->lock);
-	rq = entity->rq;
-	spin_lock(&rq->lock);
-	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
-		ts = next_job->submit_ts;
-	else
-		ts = drm_sched_rq_get_rr_ts(rq, entity);
-	drm_sched_rq_update_fifo_locked(entity, rq, ts);
+		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
+			ts = next_job->submit_ts;
+		else
+			ts = drm_sched_rq_get_rr_ts(rq, entity);
+
+		drm_sched_rq_update_fifo_locked(entity, rq, ts);
+	} else {
+		drm_sched_rq_remove_fifo_locked(entity, rq);
+	}
 	spin_unlock(&rq->lock);
 	spin_unlock(&entity->lock);
 }
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v8 09/12] drm/sched: Add fair scheduling policy
  2025-09-03 10:18 [RFC v8 00/12] Fair DRM scheduler Tvrtko Ursulin
                   ` (7 preceding siblings ...)
  2025-09-03 10:18 ` [RFC v8 08/12] drm/sched: Remove idle entity from tree Tvrtko Ursulin
@ 2025-09-03 10:18 ` Tvrtko Ursulin
  2025-09-24  9:38   ` Philipp Stanner
  2025-09-03 10:18 ` [RFC v8 10/12] drm/sched: Break submission patterns with some randomness Tvrtko Ursulin
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-03 10:18 UTC (permalink / raw)
  To: dri-devel
  Cc: amd-gfx, kernel-dev, Tvrtko Ursulin, Christian König,
	Danilo Krummrich, Matthew Brost, Philipp Stanner,
	Pierre-Eric Pelloux-Prayer

Fair scheduling policy is built upon the same concepts as the well known
CFS kernel scheduler - entity run queue is sorted by the virtual GPU time
consumed by entities in a way that the entity with least vruntime runs
first.

It is able to avoid total priority starvation, which is one of the
problems with FIFO, and it also eliminates the need for per priority run
queues. As it scales the actual GPU runtime by an exponential factor as
the priority decreases, therefore the virtual runtime for low priority
entities grows faster than for normal priority, pushing them further down
the runqueue order for the same real GPU time spent.

Apart from this fundamental fairness, fair policy is especially strong in
oversubscription workloads where it is able to give more GPU time to short
and bursty workloads when they are running in parallel with GPU heavy
clients submitting deep job queues.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <phasta@kernel.org>
Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
---
 drivers/gpu/drm/scheduler/sched_entity.c   |  28 ++--
 drivers/gpu/drm/scheduler/sched_internal.h |   7 +-
 drivers/gpu/drm/scheduler/sched_main.c     |  14 +-
 drivers/gpu/drm/scheduler/sched_rq.c       | 147 ++++++++++++++++++++-
 include/drm/gpu_scheduler.h                |  10 +-
 5 files changed, 186 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 04ce8b7d436b..58f51875547a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -108,6 +108,8 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
 	entity->guilty = guilty;
 	entity->num_sched_list = num_sched_list;
 	entity->priority = priority;
+	entity->rq_priority = drm_sched_policy == DRM_SCHED_POLICY_FAIR ?
+			      DRM_SCHED_PRIORITY_KERNEL : priority;
 	/*
 	 * It's perfectly valid to initialize an entity without having a valid
 	 * scheduler attached. It's just not valid to use the scheduler before it
@@ -124,17 +126,23 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
 		 */
 		pr_warn("%s: called with uninitialized scheduler\n", __func__);
 	} else if (num_sched_list) {
-		/* The "priority" of an entity cannot exceed the number of run-queues of a
-		 * scheduler. Protect against num_rqs being 0, by converting to signed. Choose
-		 * the lowest priority available.
+		enum drm_sched_priority p = entity->priority;
+
+		/*
+		 * The "priority" of an entity cannot exceed the number of
+		 * run-queues of a scheduler. Protect against num_rqs being 0,
+		 * by converting to signed. Choose the lowest priority
+		 * available.
 		 */
-		if (entity->priority >= sched_list[0]->num_rqs) {
-			dev_err(sched_list[0]->dev, "entity has out-of-bounds priority: %u. num_rqs: %u\n",
-				entity->priority, sched_list[0]->num_rqs);
-			entity->priority = max_t(s32, (s32) sched_list[0]->num_rqs - 1,
-						 (s32) DRM_SCHED_PRIORITY_KERNEL);
+		if (p >= sched_list[0]->num_user_rqs) {
+			dev_err(sched_list[0]->dev, "entity with out-of-bounds priority:%u num_user_rqs:%u\n",
+				p, sched_list[0]->num_user_rqs);
+			p = max_t(s32,
+				 (s32)sched_list[0]->num_user_rqs - 1,
+				 (s32)DRM_SCHED_PRIORITY_KERNEL);
+			entity->priority = p;
 		}
-		entity->rq = sched_list[0]->sched_rq[entity->priority];
+		entity->rq = sched_list[0]->sched_rq[entity->rq_priority];
 	}
 
 	init_completion(&entity->entity_idle);
@@ -567,7 +575,7 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
 
 	spin_lock(&entity->lock);
 	sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
-	rq = sched ? sched->sched_rq[entity->priority] : NULL;
+	rq = sched ? sched->sched_rq[entity->rq_priority] : NULL;
 	if (rq != entity->rq) {
 		drm_sched_rq_remove_entity(entity->rq, entity);
 		entity->rq = rq;
diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
index 27c8460a3601..125aba70eda6 100644
--- a/drivers/gpu/drm/scheduler/sched_internal.h
+++ b/drivers/gpu/drm/scheduler/sched_internal.h
@@ -18,13 +18,16 @@ struct drm_sched_entity_stats {
 	struct kref	kref;
 	spinlock_t	lock;
 	ktime_t		runtime;
+	ktime_t		prev_runtime;
+	u64		vruntime;
 };
 
 /* Used to choose between FIFO and RR job-scheduling */
 extern int drm_sched_policy;
 
-#define DRM_SCHED_POLICY_RR    0
-#define DRM_SCHED_POLICY_FIFO  1
+#define DRM_SCHED_POLICY_RR	0
+#define DRM_SCHED_POLICY_FIFO	1
+#define DRM_SCHED_POLICY_FAIR	2
 
 bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
 			 struct drm_sched_entity *entity);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index a5d7706efbea..e7726095c19a 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -84,13 +84,13 @@
 #define CREATE_TRACE_POINTS
 #include "gpu_scheduler_trace.h"
 
-int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
+int drm_sched_policy = DRM_SCHED_POLICY_FAIR;
 
 /**
  * DOC: sched_policy (int)
  * Used to override default entities scheduling policy in a run queue.
  */
-MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
+MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO, " __stringify(DRM_SCHED_POLICY_FAIR) " = Fair (default).");
 module_param_named(sched_policy, drm_sched_policy, int, 0444);
 
 static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
@@ -1132,11 +1132,15 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
 		sched->own_submit_wq = true;
 	}
 
-	sched->sched_rq = kmalloc_array(args->num_rqs, sizeof(*sched->sched_rq),
+	sched->num_user_rqs = args->num_rqs;
+	sched->num_rqs = drm_sched_policy != DRM_SCHED_POLICY_FAIR ?
+			 args->num_rqs : 1;
+	sched->sched_rq = kmalloc_array(sched->num_rqs,
+					sizeof(*sched->sched_rq),
 					GFP_KERNEL | __GFP_ZERO);
 	if (!sched->sched_rq)
 		goto Out_check_own;
-	sched->num_rqs = args->num_rqs;
+
 	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
 		sched->sched_rq[i] = kzalloc(sizeof(*sched->sched_rq[i]), GFP_KERNEL);
 		if (!sched->sched_rq[i])
@@ -1278,7 +1282,7 @@ void drm_sched_increase_karma(struct drm_sched_job *bad)
 	if (bad->s_priority != DRM_SCHED_PRIORITY_KERNEL) {
 		atomic_inc(&bad->karma);
 
-		for (i = DRM_SCHED_PRIORITY_HIGH; i < sched->num_rqs; i++) {
+		for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
 			struct drm_sched_rq *rq = sched->sched_rq[i];
 
 			spin_lock(&rq->lock);
diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
index 67804815ca0d..b0cf7d2143c8 100644
--- a/drivers/gpu/drm/scheduler/sched_rq.c
+++ b/drivers/gpu/drm/scheduler/sched_rq.c
@@ -16,6 +16,24 @@ drm_sched_entity_compare_before(struct rb_node *a, const struct rb_node *b)
 	return ktime_before(ea->oldest_job_waiting, eb->oldest_job_waiting);
 }
 
+static void drm_sched_rq_update_prio(struct drm_sched_rq *rq)
+{
+	enum drm_sched_priority prio = -1;
+	struct rb_node *rb;
+
+	lockdep_assert_held(&rq->lock);
+
+	rb = rb_first_cached(&rq->rb_tree_root);
+	if (rb) {
+		struct drm_sched_entity *entity =
+			rb_entry(rb, typeof(*entity), rb_tree_node);
+
+		prio = entity->priority; /* Unlocked read */
+	}
+
+	rq->head_prio = prio;
+}
+
 static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
 					    struct drm_sched_rq *rq)
 {
@@ -25,6 +43,7 @@ static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
 	if (!RB_EMPTY_NODE(&entity->rb_tree_node)) {
 		rb_erase_cached(&entity->rb_tree_node, &rq->rb_tree_root);
 		RB_CLEAR_NODE(&entity->rb_tree_node);
+		drm_sched_rq_update_prio(rq);
 	}
 }
 
@@ -46,6 +65,7 @@ static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
 
 	rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root,
 		      drm_sched_entity_compare_before);
+	drm_sched_rq_update_prio(rq);
 }
 
 /**
@@ -63,6 +83,114 @@ void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
 	INIT_LIST_HEAD(&rq->entities);
 	rq->rb_tree_root = RB_ROOT_CACHED;
 	rq->sched = sched;
+	rq->head_prio = -1;
+}
+
+static ktime_t
+drm_sched_rq_get_min_vruntime(struct drm_sched_rq *rq)
+{
+	struct drm_sched_entity *entity;
+	struct rb_node *rb;
+
+	lockdep_assert_held(&rq->lock);
+
+	for (rb = rb_first_cached(&rq->rb_tree_root); rb; rb = rb_next(rb)) {
+		entity = rb_entry(rb, typeof(*entity), rb_tree_node);
+
+		return entity->stats->vruntime; /* Unlocked read */
+	}
+
+	return 0;
+}
+
+static void
+drm_sched_entity_save_vruntime(struct drm_sched_entity *entity,
+			       ktime_t min_vruntime)
+{
+	struct drm_sched_entity_stats *stats = entity->stats;
+	ktime_t vruntime;
+
+	spin_lock(&stats->lock);
+	vruntime = stats->vruntime;
+	if (min_vruntime && vruntime > min_vruntime)
+		vruntime = ktime_sub(vruntime, min_vruntime);
+	else
+		vruntime = 0;
+	stats->vruntime = vruntime;
+	spin_unlock(&stats->lock);
+}
+
+static ktime_t
+drm_sched_entity_restore_vruntime(struct drm_sched_entity *entity,
+				  ktime_t min_vruntime,
+				  enum drm_sched_priority rq_prio)
+{
+	struct drm_sched_entity_stats *stats = entity->stats;
+	enum drm_sched_priority prio = entity->priority;
+	ktime_t vruntime;
+
+	BUILD_BUG_ON(DRM_SCHED_PRIORITY_NORMAL < DRM_SCHED_PRIORITY_HIGH);
+
+	spin_lock(&stats->lock);
+	vruntime = stats->vruntime;
+
+	/*
+	 * Special handling for entities which were picked from the top of the
+	 * queue and are now re-joining the top with another one already there.
+	 */
+	if (!vruntime && min_vruntime) {
+		if (prio > rq_prio) {
+			/*
+			 * Lower priority should not overtake higher when re-
+			 * joining at the top of the queue.
+			 */
+			vruntime = us_to_ktime(prio - rq_prio);
+		} else if (prio < rq_prio) {
+			/*
+			 * Higher priority can go first.
+			 */
+			vruntime = -us_to_ktime(rq_prio - prio);
+		}
+	}
+
+	/*
+	 * Restore saved relative position in the queue.
+	 */
+	vruntime = ktime_add(min_vruntime, vruntime);
+
+	stats->vruntime = vruntime;
+	spin_unlock(&stats->lock);
+
+	return vruntime;
+}
+
+static ktime_t drm_sched_entity_update_vruntime(struct drm_sched_entity *entity)
+{
+	static const unsigned int shift[] = {
+		[DRM_SCHED_PRIORITY_KERNEL] = 1,
+		[DRM_SCHED_PRIORITY_HIGH]   = 2,
+		[DRM_SCHED_PRIORITY_NORMAL] = 4,
+		[DRM_SCHED_PRIORITY_LOW]    = 7,
+	};
+	struct drm_sched_entity_stats *stats = entity->stats;
+	ktime_t runtime, prev;
+
+	spin_lock(&stats->lock);
+	prev = stats->prev_runtime;
+	runtime = stats->runtime;
+	stats->prev_runtime = runtime;
+	runtime = ktime_add_ns(stats->vruntime,
+			       ktime_to_ns(ktime_sub(runtime, prev)) <<
+			       shift[entity->priority]);
+	stats->vruntime = runtime;
+	spin_unlock(&stats->lock);
+
+	return runtime;
+}
+
+static ktime_t drm_sched_entity_get_job_ts(struct drm_sched_entity *entity)
+{
+	return drm_sched_entity_update_vruntime(entity);
 }
 
 /**
@@ -99,8 +227,14 @@ drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
 		list_add_tail(&entity->list, &rq->entities);
 	}
 
-	if (drm_sched_policy == DRM_SCHED_POLICY_RR)
+	if (drm_sched_policy == DRM_SCHED_POLICY_FAIR) {
+		ts = drm_sched_rq_get_min_vruntime(rq);
+		ts = drm_sched_entity_restore_vruntime(entity, ts,
+						       rq->head_prio);
+	} else if (drm_sched_policy == DRM_SCHED_POLICY_RR) {
 		ts = entity->rr_ts;
+	}
+
 	drm_sched_rq_update_fifo_locked(entity, rq, ts);
 
 	spin_unlock(&rq->lock);
@@ -173,7 +307,9 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
 	if (next_job) {
 		ktime_t ts;
 
-		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
+		if (drm_sched_policy == DRM_SCHED_POLICY_FAIR)
+			ts = drm_sched_entity_get_job_ts(entity);
+		else if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
 			ts = next_job->submit_ts;
 		else
 			ts = drm_sched_rq_get_rr_ts(rq, entity);
@@ -181,6 +317,13 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
 		drm_sched_rq_update_fifo_locked(entity, rq, ts);
 	} else {
 		drm_sched_rq_remove_fifo_locked(entity, rq);
+
+		if (drm_sched_policy == DRM_SCHED_POLICY_FAIR) {
+			ktime_t min_vruntime;
+
+			min_vruntime = drm_sched_rq_get_min_vruntime(rq);
+			drm_sched_entity_save_vruntime(entity, min_vruntime);
+		}
 	}
 	spin_unlock(&rq->lock);
 	spin_unlock(&entity->lock);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index f33c78473867..327b75a052c7 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -150,6 +150,11 @@ struct drm_sched_entity {
 	 */
 	enum drm_sched_priority         priority;
 
+	/**
+	 * @rq_priority: Run-queue priority
+	 */
+	enum drm_sched_priority         rq_priority;
+
 	/**
 	 * @rr_ts:
 	 *
@@ -254,10 +259,11 @@ struct drm_sched_entity {
  * struct drm_sched_rq - queue of entities to be scheduled.
  *
  * @sched: the scheduler to which this rq belongs to.
- * @lock: protects @entities, @rb_tree_root and @rr_ts.
+ * @lock: protects @entities, @rb_tree_root, @rr_ts and @head_prio.
  * @rr_ts: monotonically incrementing fake timestamp for RR mode
  * @entities: list of the entities to be scheduled.
  * @rb_tree_root: root of time based priority queue of entities for FIFO scheduling
+ * @head_prio: priority of the top tree element
  *
  * Run queue is a set of entities scheduling command submissions for
  * one specific ring. It implements the scheduling policy that selects
@@ -271,6 +277,7 @@ struct drm_sched_rq {
 	ktime_t				rr_ts;
 	struct list_head		entities;
 	struct rb_root_cached		rb_tree_root;
+	enum drm_sched_priority		head_prio;
 };
 
 /**
@@ -597,6 +604,7 @@ struct drm_gpu_scheduler {
 	long				timeout;
 	const char			*name;
 	u32                             num_rqs;
+	u32                             num_user_rqs;
 	struct drm_sched_rq             **sched_rq;
 	wait_queue_head_t		job_scheduled;
 	atomic64_t			job_id_count;
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v8 10/12] drm/sched: Break submission patterns with some randomness
  2025-09-03 10:18 [RFC v8 00/12] Fair DRM scheduler Tvrtko Ursulin
                   ` (8 preceding siblings ...)
  2025-09-03 10:18 ` [RFC v8 09/12] drm/sched: Add fair scheduling policy Tvrtko Ursulin
@ 2025-09-03 10:18 ` Tvrtko Ursulin
  2025-09-03 10:18 ` [RFC v8 11/12] drm/sched: Remove FIFO and RR and simplify to a single run queue Tvrtko Ursulin
  2025-09-03 10:18 ` [RFC v8 12/12] drm/sched: Embed run queue singleton into the scheduler Tvrtko Ursulin
  11 siblings, 0 replies; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-03 10:18 UTC (permalink / raw)
  To: dri-devel
  Cc: amd-gfx, kernel-dev, Tvrtko Ursulin, Christian König,
	Danilo Krummrich, Matthew Brost, Philipp Stanner,
	Pierre-Eric Pelloux-Prayer

GPUs generally don't implement preemption and DRM scheduler definitely
does not support it at the front end scheduling level. This means
execution quanta can be quite long and is controlled by userspace,
consequence of which is picking the "wrong" entity to run can have a
larger negative effect than it would have with a virtual runtime based CPU
scheduler.

Another important consideration is that rendering clients often have
shallow submission queues, meaning they will be entering and exiting the
scheduler's runnable queue often.

Relevant scenario here is what happens when an entity re-joins the
runnable queue with other entities already present. One cornerstone of the
virtual runtime algorithm is to let it re-join at the head and depend on
the virtual runtime accounting to sort out the order after an execution
quanta or two.

However, as explained above, this may not work fully reliably in the GPU
world. Entity could always get to overtake the existing entities, or not,
depending on the submission order and rbtree equal key insertion
behaviour.

We can break this latching by adding some randomness for this specific
corner case.

If an entity is re-joining the runnable queue, was head of the queue the
last time it got picked, and there is an already queued different entity
of an equal scheduling priority, we can break the tie by randomly choosing
the execution order between the two.

For randomness we implement a simple driver global boolean which selects
whether new entity will be first or not. Because the boolean is global and
shared between all the run queues and entities, its actual effect can be
loosely called random. Under the assumption it will not always be the same
entity which is re-joining the queue under these circumstances.

Another way to look at this is that it is adding a little bit of limited
random round-robin behaviour to the fair scheduling algorithm.

Net effect is a significant improvemnt to the scheduling unit tests which
check the scheduling quality for the interactive client running in
parallel with GPU hogs.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <phasta@kernel.org>
Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
---
 drivers/gpu/drm/scheduler/sched_rq.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
index b0cf7d2143c8..c5be8a2c893d 100644
--- a/drivers/gpu/drm/scheduler/sched_rq.c
+++ b/drivers/gpu/drm/scheduler/sched_rq.c
@@ -150,6 +150,16 @@ drm_sched_entity_restore_vruntime(struct drm_sched_entity *entity,
 			 * Higher priority can go first.
 			 */
 			vruntime = -us_to_ktime(rq_prio - prio);
+		} else {
+			static const int shuffle[2] = { -1, 1 };
+			static bool r = 0;
+
+			/*
+			 * For equal priority apply some randomness to break
+			 * latching caused by submission patterns.
+			 */
+			vruntime = shuffle[r];
+			r ^= 1;
 		}
 	}
 
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v8 11/12] drm/sched: Remove FIFO and RR and simplify to a single run queue
  2025-09-03 10:18 [RFC v8 00/12] Fair DRM scheduler Tvrtko Ursulin
                   ` (9 preceding siblings ...)
  2025-09-03 10:18 ` [RFC v8 10/12] drm/sched: Break submission patterns with some randomness Tvrtko Ursulin
@ 2025-09-03 10:18 ` Tvrtko Ursulin
  2025-09-24 11:50   ` Philipp Stanner
  2025-09-03 10:18 ` [RFC v8 12/12] drm/sched: Embed run queue singleton into the scheduler Tvrtko Ursulin
  11 siblings, 1 reply; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-03 10:18 UTC (permalink / raw)
  To: dri-devel
  Cc: amd-gfx, kernel-dev, Tvrtko Ursulin, Christian König,
	Danilo Krummrich, Matthew Brost, Philipp Stanner

If the new fair policy is at least as good as FIFO and we can afford to
remove round-robin, we can simplify the scheduler code by making the
scheduler to run queue relationship always 1:1 and remove some code.

Also, now that the FIFO policy is gone the tree of entities is not a FIFO
tree any more so rename it to just the tree.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  23 ++-
 drivers/gpu/drm/scheduler/sched_entity.c   |  29 +---
 drivers/gpu/drm/scheduler/sched_internal.h |   9 +-
 drivers/gpu/drm/scheduler/sched_main.c     | 163 ++++++---------------
 drivers/gpu/drm/scheduler/sched_rq.c       |  60 ++------
 include/drm/gpu_scheduler.h                |  30 +---
 6 files changed, 78 insertions(+), 236 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index d020a890a0ea..bc07fd57310c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -434,25 +434,22 @@ drm_sched_entity_queue_pop(struct drm_sched_entity *entity)
 
 void amdgpu_job_stop_all_jobs_on_sched(struct drm_gpu_scheduler *sched)
 {
+	struct drm_sched_rq *rq = sched->rq;
+	struct drm_sched_entity *s_entity;
 	struct drm_sched_job *s_job;
-	struct drm_sched_entity *s_entity = NULL;
-	int i;
 
 	/* Signal all jobs not yet scheduled */
-	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
-		struct drm_sched_rq *rq = sched->sched_rq[i];
-		spin_lock(&rq->lock);
-		list_for_each_entry(s_entity, &rq->entities, list) {
-			while ((s_job = drm_sched_entity_queue_pop(s_entity))) {
-				struct drm_sched_fence *s_fence = s_job->s_fence;
+	spin_lock(&rq->lock);
+	list_for_each_entry(s_entity, &rq->entities, list) {
+		while ((s_job = drm_sched_entity_queue_pop(s_entity))) {
+			struct drm_sched_fence *s_fence = s_job->s_fence;
 
-				dma_fence_signal(&s_fence->scheduled);
-				dma_fence_set_error(&s_fence->finished, -EHWPOISON);
-				dma_fence_signal(&s_fence->finished);
-			}
+			dma_fence_signal(&s_fence->scheduled);
+			dma_fence_set_error(&s_fence->finished, -EHWPOISON);
+			dma_fence_signal(&s_fence->finished);
 		}
-		spin_unlock(&rq->lock);
 	}
+	spin_unlock(&rq->lock);
 
 	/* Signal all jobs already scheduled to HW */
 	list_for_each_entry(s_job, &sched->pending_list, list) {
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 58f51875547a..6dd30b85925b 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -108,8 +108,6 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
 	entity->guilty = guilty;
 	entity->num_sched_list = num_sched_list;
 	entity->priority = priority;
-	entity->rq_priority = drm_sched_policy == DRM_SCHED_POLICY_FAIR ?
-			      DRM_SCHED_PRIORITY_KERNEL : priority;
 	/*
 	 * It's perfectly valid to initialize an entity without having a valid
 	 * scheduler attached. It's just not valid to use the scheduler before it
@@ -119,30 +117,14 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
 	RCU_INIT_POINTER(entity->last_scheduled, NULL);
 	RB_CLEAR_NODE(&entity->rb_tree_node);
 
-	if (num_sched_list && !sched_list[0]->sched_rq) {
+	if (num_sched_list && !sched_list[0]->rq) {
 		/* Since every entry covered by num_sched_list
 		 * should be non-NULL and therefore we warn drivers
 		 * not to do this and to fix their DRM calling order.
 		 */
 		pr_warn("%s: called with uninitialized scheduler\n", __func__);
 	} else if (num_sched_list) {
-		enum drm_sched_priority p = entity->priority;
-
-		/*
-		 * The "priority" of an entity cannot exceed the number of
-		 * run-queues of a scheduler. Protect against num_rqs being 0,
-		 * by converting to signed. Choose the lowest priority
-		 * available.
-		 */
-		if (p >= sched_list[0]->num_user_rqs) {
-			dev_err(sched_list[0]->dev, "entity with out-of-bounds priority:%u num_user_rqs:%u\n",
-				p, sched_list[0]->num_user_rqs);
-			p = max_t(s32,
-				 (s32)sched_list[0]->num_user_rqs - 1,
-				 (s32)DRM_SCHED_PRIORITY_KERNEL);
-			entity->priority = p;
-		}
-		entity->rq = sched_list[0]->sched_rq[entity->rq_priority];
+		entity->rq = sched_list[0]->rq;
 	}
 
 	init_completion(&entity->entity_idle);
@@ -575,7 +557,7 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
 
 	spin_lock(&entity->lock);
 	sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
-	rq = sched ? sched->sched_rq[entity->rq_priority] : NULL;
+	rq = sched ? sched->rq : NULL;
 	if (rq != entity->rq) {
 		drm_sched_rq_remove_entity(entity->rq, entity);
 		entity->rq = rq;
@@ -599,7 +581,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
 {
 	struct drm_sched_entity *entity = sched_job->entity;
 	bool first;
-	ktime_t submit_ts;
 
 	trace_drm_sched_job_queue(sched_job, entity);
 
@@ -616,16 +597,14 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
 	/*
 	 * After the sched_job is pushed into the entity queue, it may be
 	 * completed and freed up at any time. We can no longer access it.
-	 * Make sure to set the submit_ts first, to avoid a race.
 	 */
-	sched_job->submit_ts = submit_ts = ktime_get();
 	first = spsc_queue_push(&entity->job_queue, &sched_job->queue_node);
 
 	/* first job wakes up scheduler */
 	if (first) {
 		struct drm_gpu_scheduler *sched;
 
-		sched = drm_sched_rq_add_entity(entity, submit_ts);
+		sched = drm_sched_rq_add_entity(entity);
 		if (sched)
 			drm_sched_wakeup(sched);
 	}
diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
index 125aba70eda6..6e5ed721bb5b 100644
--- a/drivers/gpu/drm/scheduler/sched_internal.h
+++ b/drivers/gpu/drm/scheduler/sched_internal.h
@@ -22,13 +22,6 @@ struct drm_sched_entity_stats {
 	u64		vruntime;
 };
 
-/* Used to choose between FIFO and RR job-scheduling */
-extern int drm_sched_policy;
-
-#define DRM_SCHED_POLICY_RR	0
-#define DRM_SCHED_POLICY_FIFO	1
-#define DRM_SCHED_POLICY_FAIR	2
-
 bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
 			 struct drm_sched_entity *entity);
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
@@ -39,7 +32,7 @@ struct drm_sched_entity *
 drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched,
 			   struct drm_sched_rq *rq);
 struct drm_gpu_scheduler *
-drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts);
+drm_sched_rq_add_entity(struct drm_sched_entity *entity);
 void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
 				struct drm_sched_entity *entity);
 void drm_sched_rq_pop_entity(struct drm_sched_entity *entity);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index e7726095c19a..a9079bdb27d3 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -84,15 +84,6 @@
 #define CREATE_TRACE_POINTS
 #include "gpu_scheduler_trace.h"
 
-int drm_sched_policy = DRM_SCHED_POLICY_FAIR;
-
-/**
- * DOC: sched_policy (int)
- * Used to override default entities scheduling policy in a run queue.
- */
-MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO, " __stringify(DRM_SCHED_POLICY_FAIR) " = Fair (default).");
-module_param_named(sched_policy, drm_sched_policy, int, 0444);
-
 static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
 {
 	u32 credits;
@@ -876,34 +867,6 @@ void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
 	drm_sched_run_job_queue(sched);
 }
 
-/**
- * drm_sched_select_entity - Select next entity to process
- *
- * @sched: scheduler instance
- *
- * Return an entity to process or NULL if none are found.
- *
- * Note, that we break out of the for-loop when "entity" is non-null, which can
- * also be an error-pointer--this assures we don't process lower priority
- * run-queues. See comments in the respectively called functions.
- */
-static struct drm_sched_entity *
-drm_sched_select_entity(struct drm_gpu_scheduler *sched)
-{
-	struct drm_sched_entity *entity = NULL;
-	int i;
-
-	/* Start with the highest priority.
-	 */
-	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
-		entity = drm_sched_rq_select_entity(sched, sched->sched_rq[i]);
-		if (entity)
-			break;
-	}
-
-	return IS_ERR(entity) ? NULL : entity;
-}
-
 /**
  * drm_sched_get_finished_job - fetch the next finished job to be destroyed
  *
@@ -1023,8 +986,8 @@ static void drm_sched_run_job_work(struct work_struct *w)
 	int r;
 
 	/* Find entity with a ready job */
-	entity = drm_sched_select_entity(sched);
-	if (!entity)
+	entity = drm_sched_rq_select_entity(sched, sched->rq);
+	if (IS_ERR_OR_NULL(entity))
 		return;	/* No more work */
 
 	sched_job = drm_sched_entity_pop_job(entity);
@@ -1095,8 +1058,6 @@ static struct workqueue_struct *drm_sched_alloc_wq(const char *name)
  */
 int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_args *args)
 {
-	int i;
-
 	sched->ops = args->ops;
 	sched->credit_limit = args->credit_limit;
 	sched->name = args->name;
@@ -1106,13 +1067,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
 	sched->score = args->score ? args->score : &sched->_score;
 	sched->dev = args->dev;
 
-	if (args->num_rqs > DRM_SCHED_PRIORITY_COUNT) {
-		/* This is a gross violation--tell drivers what the  problem is.
-		 */
-		dev_err(sched->dev, "%s: num_rqs cannot be greater than DRM_SCHED_PRIORITY_COUNT\n",
-			__func__);
-		return -EINVAL;
-	} else if (sched->sched_rq) {
+	if (sched->rq) {
 		/* Not an error, but warn anyway so drivers can
 		 * fine-tune their DRM calling order, and return all
 		 * is good.
@@ -1132,21 +1087,11 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
 		sched->own_submit_wq = true;
 	}
 
-	sched->num_user_rqs = args->num_rqs;
-	sched->num_rqs = drm_sched_policy != DRM_SCHED_POLICY_FAIR ?
-			 args->num_rqs : 1;
-	sched->sched_rq = kmalloc_array(sched->num_rqs,
-					sizeof(*sched->sched_rq),
-					GFP_KERNEL | __GFP_ZERO);
-	if (!sched->sched_rq)
+	sched->rq = kmalloc(sizeof(*sched->rq), GFP_KERNEL | __GFP_ZERO);
+	if (!sched->rq)
 		goto Out_check_own;
 
-	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
-		sched->sched_rq[i] = kzalloc(sizeof(*sched->sched_rq[i]), GFP_KERNEL);
-		if (!sched->sched_rq[i])
-			goto Out_unroll;
-		drm_sched_rq_init(sched, sched->sched_rq[i]);
-	}
+	drm_sched_rq_init(sched, sched->rq);
 
 	init_waitqueue_head(&sched->job_scheduled);
 	INIT_LIST_HEAD(&sched->pending_list);
@@ -1161,12 +1106,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
 
 	sched->ready = true;
 	return 0;
-Out_unroll:
-	for (--i ; i >= DRM_SCHED_PRIORITY_KERNEL; i--)
-		kfree(sched->sched_rq[i]);
 
-	kfree(sched->sched_rq);
-	sched->sched_rq = NULL;
 Out_check_own:
 	if (sched->own_submit_wq)
 		destroy_workqueue(sched->submit_wq);
@@ -1202,41 +1142,35 @@ static void drm_sched_cancel_remaining_jobs(struct drm_gpu_scheduler *sched)
  */
 void drm_sched_fini(struct drm_gpu_scheduler *sched)
 {
+
+	struct drm_sched_rq *rq = sched->rq;
 	struct drm_sched_entity *s_entity;
-	int i;
 
 	drm_sched_wqueue_stop(sched);
 
-	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
-		struct drm_sched_rq *rq = sched->sched_rq[i];
-
-		spin_lock(&rq->lock);
-		list_for_each_entry(s_entity, &rq->entities, list)
-			/*
-			 * Prevents reinsertion and marks job_queue as idle,
-			 * it will be removed from the rq in drm_sched_entity_fini()
-			 * eventually
-			 *
-			 * FIXME:
-			 * This lacks the proper spin_lock(&s_entity->lock) and
-			 * is, therefore, a race condition. Most notably, it
-			 * can race with drm_sched_entity_push_job(). The lock
-			 * cannot be taken here, however, because this would
-			 * lead to lock inversion -> deadlock.
-			 *
-			 * The best solution probably is to enforce the life
-			 * time rule of all entities having to be torn down
-			 * before their scheduler. Then, however, locking could
-			 * be dropped alltogether from this function.
-			 *
-			 * For now, this remains a potential race in all
-			 * drivers that keep entities alive for longer than
-			 * the scheduler.
-			 */
-			s_entity->stopped = true;
-		spin_unlock(&rq->lock);
-		kfree(sched->sched_rq[i]);
-	}
+	spin_lock(&rq->lock);
+	list_for_each_entry(s_entity, &rq->entities, list)
+		/*
+		 * Prevents re-insertion and marks job_queue as idle,
+		 * it will be removed from the rq in drm_sched_entity_fini()
+		 * eventually.
+		 *
+		 * FIXME:
+		 * This lacks the proper spin_lock(&s_entity->lock) and is,
+		 * therefore, a race condition. Most notably, it can race with
+		 * drm_sched_entity_push_job(). The lock cannot be taken here,
+		 * however, because this would lead to lock inversion.
+		 *
+		 * The best solution probably is to enforce the life time rule
+		 * of all entities having to be torn down before their
+		 * scheduler. Then locking could be dropped altogether from this
+		 * function.
+		 *
+		 * For now, this remains a potential race in all drivers that
+		 * keep entities alive for longer than the scheduler.
+		 */
+		s_entity->stopped = true;
+	spin_unlock(&rq->lock);
 
 	/* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
 	wake_up_all(&sched->job_scheduled);
@@ -1251,8 +1185,8 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
 	if (sched->own_submit_wq)
 		destroy_workqueue(sched->submit_wq);
 	sched->ready = false;
-	kfree(sched->sched_rq);
-	sched->sched_rq = NULL;
+	kfree(sched->rq);
+	sched->rq = NULL;
 
 	if (!list_empty(&sched->pending_list))
 		dev_warn(sched->dev, "Tearing down scheduler while jobs are pending!\n");
@@ -1270,35 +1204,28 @@ EXPORT_SYMBOL(drm_sched_fini);
  */
 void drm_sched_increase_karma(struct drm_sched_job *bad)
 {
-	int i;
-	struct drm_sched_entity *tmp;
-	struct drm_sched_entity *entity;
 	struct drm_gpu_scheduler *sched = bad->sched;
+	struct drm_sched_entity *entity, *tmp;
+	struct drm_sched_rq *rq = sched->rq;
 
 	/* don't change @bad's karma if it's from KERNEL RQ,
 	 * because sometimes GPU hang would cause kernel jobs (like VM updating jobs)
 	 * corrupt but keep in mind that kernel jobs always considered good.
 	 */
-	if (bad->s_priority != DRM_SCHED_PRIORITY_KERNEL) {
-		atomic_inc(&bad->karma);
+	if (bad->s_priority == DRM_SCHED_PRIORITY_KERNEL)
+		return;
 
-		for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
-			struct drm_sched_rq *rq = sched->sched_rq[i];
+	atomic_inc(&bad->karma);
 
-			spin_lock(&rq->lock);
-			list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
-				if (bad->s_fence->scheduled.context ==
-				    entity->fence_context) {
-					if (entity->guilty)
-						atomic_set(entity->guilty, 1);
-					break;
-				}
-			}
-			spin_unlock(&rq->lock);
-			if (&entity->list != &rq->entities)
-				break;
+	spin_lock(&rq->lock);
+	list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
+		if (bad->s_fence->scheduled.context == entity->fence_context) {
+			if (entity->guilty)
+				atomic_set(entity->guilty, 1);
+			break;
 		}
 	}
+	spin_unlock(&rq->lock);
 }
 EXPORT_SYMBOL(drm_sched_increase_karma);
 
diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
index c5be8a2c893d..a9bc105221bf 100644
--- a/drivers/gpu/drm/scheduler/sched_rq.c
+++ b/drivers/gpu/drm/scheduler/sched_rq.c
@@ -34,7 +34,7 @@ static void drm_sched_rq_update_prio(struct drm_sched_rq *rq)
 	rq->head_prio = prio;
 }
 
-static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
+static void drm_sched_rq_remove_tree_locked(struct drm_sched_entity *entity,
 					    struct drm_sched_rq *rq)
 {
 	lockdep_assert_held(&entity->lock);
@@ -47,7 +47,7 @@ static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
 	}
 }
 
-static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
+static void drm_sched_rq_update_tree_locked(struct drm_sched_entity *entity,
 					    struct drm_sched_rq *rq,
 					    ktime_t ts)
 {
@@ -59,7 +59,7 @@ static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
 	lockdep_assert_held(&entity->lock);
 	lockdep_assert_held(&rq->lock);
 
-	drm_sched_rq_remove_fifo_locked(entity, rq);
+	drm_sched_rq_remove_tree_locked(entity, rq);
 
 	entity->oldest_job_waiting = ts;
 
@@ -207,17 +207,17 @@ static ktime_t drm_sched_entity_get_job_ts(struct drm_sched_entity *entity)
  * drm_sched_rq_add_entity - add an entity
  *
  * @entity: scheduler entity
- * @ts: submission timestamp
  *
  * Adds a scheduler entity to the run queue.
  *
  * Returns a DRM scheduler pre-selected to handle this entity.
  */
 struct drm_gpu_scheduler *
-drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
+drm_sched_rq_add_entity(struct drm_sched_entity *entity)
 {
 	struct drm_gpu_scheduler *sched;
 	struct drm_sched_rq *rq;
+	ktime_t ts;
 
 	/* Add the entity to the run queue */
 	spin_lock(&entity->lock);
@@ -237,15 +237,9 @@ drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
 		list_add_tail(&entity->list, &rq->entities);
 	}
 
-	if (drm_sched_policy == DRM_SCHED_POLICY_FAIR) {
-		ts = drm_sched_rq_get_min_vruntime(rq);
-		ts = drm_sched_entity_restore_vruntime(entity, ts,
-						       rq->head_prio);
-	} else if (drm_sched_policy == DRM_SCHED_POLICY_RR) {
-		ts = entity->rr_ts;
-	}
-
-	drm_sched_rq_update_fifo_locked(entity, rq, ts);
+	ts = drm_sched_rq_get_min_vruntime(rq);
+	ts = drm_sched_entity_restore_vruntime(entity, ts, rq->head_prio);
+	drm_sched_rq_update_tree_locked(entity, rq, ts);
 
 	spin_unlock(&rq->lock);
 	spin_unlock(&entity->lock);
@@ -274,26 +268,11 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
 	atomic_dec(rq->sched->score);
 	list_del_init(&entity->list);
 
-	drm_sched_rq_remove_fifo_locked(entity, rq);
+	drm_sched_rq_remove_tree_locked(entity, rq);
 
 	spin_unlock(&rq->lock);
 }
 
-static ktime_t
-drm_sched_rq_get_rr_ts(struct drm_sched_rq *rq, struct drm_sched_entity *entity)
-{
-	ktime_t ts;
-
-	lockdep_assert_held(&entity->lock);
-	lockdep_assert_held(&rq->lock);
-
-	ts = ktime_add_ns(rq->rr_ts, 1);
-	entity->rr_ts = ts;
-	rq->rr_ts = ts;
-
-	return ts;
-}
-
 /**
  * drm_sched_rq_pop_entity - pops an entity
  *
@@ -317,23 +296,14 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
 	if (next_job) {
 		ktime_t ts;
 
-		if (drm_sched_policy == DRM_SCHED_POLICY_FAIR)
-			ts = drm_sched_entity_get_job_ts(entity);
-		else if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
-			ts = next_job->submit_ts;
-		else
-			ts = drm_sched_rq_get_rr_ts(rq, entity);
-
-		drm_sched_rq_update_fifo_locked(entity, rq, ts);
+		ts = drm_sched_entity_get_job_ts(entity);
+		drm_sched_rq_update_tree_locked(entity, rq, ts);
 	} else {
-		drm_sched_rq_remove_fifo_locked(entity, rq);
+		ktime_t min_vruntime;
 
-		if (drm_sched_policy == DRM_SCHED_POLICY_FAIR) {
-			ktime_t min_vruntime;
-
-			min_vruntime = drm_sched_rq_get_min_vruntime(rq);
-			drm_sched_entity_save_vruntime(entity, min_vruntime);
-		}
+		drm_sched_rq_remove_tree_locked(entity, rq);
+		min_vruntime = drm_sched_rq_get_min_vruntime(rq);
+		drm_sched_entity_save_vruntime(entity, min_vruntime);
 	}
 	spin_unlock(&rq->lock);
 	spin_unlock(&entity->lock);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 327b75a052c7..c9b75a05d05c 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -96,8 +96,7 @@ struct drm_sched_entity {
 	 * @lock:
 	 *
 	 * Lock protecting the run-queue (@rq) to which this entity belongs,
-	 * @priority, the list of schedulers (@sched_list, @num_sched_list) and
-	 * the @rr_ts field.
+	 * @priority and the list of schedulers (@sched_list, @num_sched_list).
 	 */
 	spinlock_t			lock;
 
@@ -150,18 +149,6 @@ struct drm_sched_entity {
 	 */
 	enum drm_sched_priority         priority;
 
-	/**
-	 * @rq_priority: Run-queue priority
-	 */
-	enum drm_sched_priority         rq_priority;
-
-	/**
-	 * @rr_ts:
-	 *
-	 * Fake timestamp of the last popped job from the entity.
-	 */
-	ktime_t				rr_ts;
-
 	/**
 	 * @job_queue: the list of jobs of this entity.
 	 */
@@ -259,8 +246,7 @@ struct drm_sched_entity {
  * struct drm_sched_rq - queue of entities to be scheduled.
  *
  * @sched: the scheduler to which this rq belongs to.
- * @lock: protects @entities, @rb_tree_root, @rr_ts and @head_prio.
- * @rr_ts: monotonically incrementing fake timestamp for RR mode
+ * @lock: protects @entities, @rb_tree_root and @head_prio.
  * @entities: list of the entities to be scheduled.
  * @rb_tree_root: root of time based priority queue of entities for FIFO scheduling
  * @head_prio: priority of the top tree element
@@ -274,7 +260,6 @@ struct drm_sched_rq {
 
 	spinlock_t			lock;
 	/* Following members are protected by the @lock: */
-	ktime_t				rr_ts;
 	struct list_head		entities;
 	struct rb_root_cached		rb_tree_root;
 	enum drm_sched_priority		head_prio;
@@ -360,13 +345,6 @@ struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f);
  * to schedule the job.
  */
 struct drm_sched_job {
-	/**
-	 * @submit_ts:
-	 *
-	 * When the job was pushed into the entity queue.
-	 */
-	ktime_t                         submit_ts;
-
 	/**
 	 * @sched:
 	 *
@@ -603,9 +581,7 @@ struct drm_gpu_scheduler {
 	atomic_t			credit_count;
 	long				timeout;
 	const char			*name;
-	u32                             num_rqs;
-	u32                             num_user_rqs;
-	struct drm_sched_rq             **sched_rq;
+	struct drm_sched_rq             *rq;
 	wait_queue_head_t		job_scheduled;
 	atomic64_t			job_id_count;
 	struct workqueue_struct		*submit_wq;
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v8 12/12] drm/sched: Embed run queue singleton into the scheduler
  2025-09-03 10:18 [RFC v8 00/12] Fair DRM scheduler Tvrtko Ursulin
                   ` (10 preceding siblings ...)
  2025-09-03 10:18 ` [RFC v8 11/12] drm/sched: Remove FIFO and RR and simplify to a single run queue Tvrtko Ursulin
@ 2025-09-03 10:18 ` Tvrtko Ursulin
  2025-09-24 12:01   ` Philipp Stanner
  11 siblings, 1 reply; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-03 10:18 UTC (permalink / raw)
  To: dri-devel
  Cc: amd-gfx, kernel-dev, Tvrtko Ursulin, Christian König,
	Danilo Krummrich, Matthew Brost, Philipp Stanner

Now that the run queue to scheduler relationship is always 1:1 we can
embed it (the run queue) directly in the scheduler struct and save on
some allocation error handling code and such.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <phasta@kernel.org>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c      |  6 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c     |  6 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.h     |  5 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h   |  8 ++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c |  8 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c     |  8 +++---
 drivers/gpu/drm/scheduler/sched_entity.c    | 32 +++++++++------------
 drivers/gpu/drm/scheduler/sched_fence.c     |  2 +-
 drivers/gpu/drm/scheduler/sched_internal.h  |  6 ++--
 drivers/gpu/drm/scheduler/sched_main.c      | 31 ++++----------------
 drivers/gpu/drm/scheduler/sched_rq.c        | 18 ++++++------
 include/drm/gpu_scheduler.h                 |  4 +--
 12 files changed, 58 insertions(+), 76 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 2ac9729e4c86..467e09e5567b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1131,7 +1131,8 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
 	if (p->gang_size > 1 && !adev->vm_manager.concurrent_flush) {
 		for (i = 0; i < p->gang_size; ++i) {
 			struct drm_sched_entity *entity = p->entities[i];
-			struct drm_gpu_scheduler *sched = entity->rq->sched;
+			struct drm_gpu_scheduler *sched =
+				container_of(entity->rq, typeof(*sched), rq);
 			struct amdgpu_ring *ring = to_amdgpu_ring(sched);
 
 			if (amdgpu_vmid_uses_reserved(vm, ring->vm_hub))
@@ -1262,7 +1263,8 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
 			return r;
 	}
 
-	sched = p->gang_leader->base.entity->rq->sched;
+	sched = container_of(p->gang_leader->base.entity->rq, typeof(*sched),
+			     rq);
 	while ((fence = amdgpu_sync_get_fence(&p->sync))) {
 		struct drm_sched_fence *s_fence = to_drm_sched_fence(fence);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index bc07fd57310c..cdfaf3eb736d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -341,7 +341,9 @@ static struct dma_fence *
 amdgpu_job_prepare_job(struct drm_sched_job *sched_job,
 		      struct drm_sched_entity *s_entity)
 {
-	struct amdgpu_ring *ring = to_amdgpu_ring(s_entity->rq->sched);
+	struct drm_gpu_scheduler *sched =
+		container_of(s_entity->rq, typeof(*sched), rq);
+	struct amdgpu_ring *ring = to_amdgpu_ring(sched);
 	struct amdgpu_job *job = to_amdgpu_job(sched_job);
 	struct dma_fence *fence;
 	int r;
@@ -434,7 +436,7 @@ drm_sched_entity_queue_pop(struct drm_sched_entity *entity)
 
 void amdgpu_job_stop_all_jobs_on_sched(struct drm_gpu_scheduler *sched)
 {
-	struct drm_sched_rq *rq = sched->rq;
+	struct drm_sched_rq *rq = &sched->rq;
 	struct drm_sched_entity *s_entity;
 	struct drm_sched_job *s_job;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
index 4a6487eb6cb5..9530b5da3adc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
@@ -102,7 +102,10 @@ struct amdgpu_job {
 
 static inline struct amdgpu_ring *amdgpu_job_ring(struct amdgpu_job *job)
 {
-	return to_amdgpu_ring(job->base.entity->rq->sched);
+	struct drm_gpu_scheduler *sched =
+		container_of(job->base.entity->rq, typeof(*sched), rq);
+
+	return to_amdgpu_ring(sched);
 }
 
 int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
index d13e64a69e25..85724ec6aaf8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
@@ -145,6 +145,7 @@ TRACE_EVENT(amdgpu_cs,
 		     struct amdgpu_ib *ib),
 	    TP_ARGS(p, job, ib),
 	    TP_STRUCT__entry(
+			     __field(struct drm_gpu_scheduler *, sched)
 			     __field(struct amdgpu_bo_list *, bo_list)
 			     __field(u32, ring)
 			     __field(u32, dw)
@@ -152,11 +153,14 @@ TRACE_EVENT(amdgpu_cs,
 			     ),
 
 	    TP_fast_assign(
+			   __entry->sched = container_of(job->base.entity->rq,
+							 typeof(*__entry->sched),
+							 rq);
 			   __entry->bo_list = p->bo_list;
-			   __entry->ring = to_amdgpu_ring(job->base.entity->rq->sched)->idx;
+			   __entry->ring = to_amdgpu_ring(__entry->sched)->idx;
 			   __entry->dw = ib->length_dw;
 			   __entry->fences = amdgpu_fence_count_emitted(
-				to_amdgpu_ring(job->base.entity->rq->sched));
+				to_amdgpu_ring(__entry->sched));
 			   ),
 	    TP_printk("bo_list=%p, ring=%u, dw=%u, fences=%u",
 		      __entry->bo_list, __entry->ring, __entry->dw,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index 36805dcfa159..4ccd2e769799 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -106,13 +106,13 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
 static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
 				 struct dma_fence **fence)
 {
+	struct drm_gpu_scheduler *sched =
+		container_of(p->vm->delayed.rq, typeof(*sched), rq);
+	struct amdgpu_ring *ring =
+		container_of(sched, struct amdgpu_ring, sched);
 	struct amdgpu_ib *ib = p->job->ibs;
-	struct amdgpu_ring *ring;
 	struct dma_fence *f;
 
-	ring = container_of(p->vm->delayed.rq->sched, struct amdgpu_ring,
-			    sched);
-
 	WARN_ON(ib->length_dw == 0);
 	amdgpu_ring_pad_ib(ring, ib);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
index 1083db8cea2e..be17635ac039 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
@@ -465,15 +465,15 @@ int amdgpu_xcp_open_device(struct amdgpu_device *adev,
 void amdgpu_xcp_release_sched(struct amdgpu_device *adev,
 				  struct amdgpu_ctx_entity *entity)
 {
-	struct drm_gpu_scheduler *sched;
-	struct amdgpu_ring *ring;
+	struct drm_gpu_scheduler *sched =
+		container_of(entity->entity.rq, typeof(*sched), rq);
 
 	if (!adev->xcp_mgr)
 		return;
 
-	sched = entity->entity.rq->sched;
 	if (drm_sched_wqueue_ready(sched)) {
-		ring = to_amdgpu_ring(entity->entity.rq->sched);
+		struct amdgpu_ring *ring = to_amdgpu_ring(sched);
+
 		atomic_dec(&adev->xcp_mgr->xcp[ring->xcp_id].ref_cnt);
 	}
 }
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 6dd30b85925b..ba290143c95d 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -114,19 +114,12 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
 	 * is initialized itself.
 	 */
 	entity->sched_list = num_sched_list > 1 ? sched_list : NULL;
+	if (num_sched_list) {
+		entity->sched_list = num_sched_list > 1 ? sched_list : NULL;
+		entity->rq = &sched_list[0]->rq;
+	}
 	RCU_INIT_POINTER(entity->last_scheduled, NULL);
 	RB_CLEAR_NODE(&entity->rb_tree_node);
-
-	if (num_sched_list && !sched_list[0]->rq) {
-		/* Since every entry covered by num_sched_list
-		 * should be non-NULL and therefore we warn drivers
-		 * not to do this and to fix their DRM calling order.
-		 */
-		pr_warn("%s: called with uninitialized scheduler\n", __func__);
-	} else if (num_sched_list) {
-		entity->rq = sched_list[0]->rq;
-	}
-
 	init_completion(&entity->entity_idle);
 
 	/* We start in an idle state. */
@@ -312,7 +305,7 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout)
 	if (!entity->rq)
 		return 0;
 
-	sched = entity->rq->sched;
+	sched = container_of(entity->rq, typeof(*sched), rq);
 	/*
 	 * The client will not queue more jobs during this fini - consume
 	 * existing queued ones, or discard them on SIGKILL.
@@ -393,10 +386,12 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
 {
 	struct drm_sched_entity *entity =
 		container_of(cb, struct drm_sched_entity, cb);
+	struct drm_gpu_scheduler *sched =
+		container_of(entity->rq, typeof(*sched), rq);
 
 	entity->dependency = NULL;
 	dma_fence_put(f);
-	drm_sched_wakeup(entity->rq->sched);
+	drm_sched_wakeup(sched);
 }
 
 /**
@@ -423,7 +418,8 @@ EXPORT_SYMBOL(drm_sched_entity_set_priority);
 static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity,
 					       struct drm_sched_job *sched_job)
 {
-	struct drm_gpu_scheduler *sched = entity->rq->sched;
+	struct drm_gpu_scheduler *sched =
+		container_of(entity->rq, typeof(*sched), rq);
 	struct dma_fence *fence = entity->dependency;
 	struct drm_sched_fence *s_fence;
 
@@ -557,7 +553,7 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
 
 	spin_lock(&entity->lock);
 	sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
-	rq = sched ? sched->rq : NULL;
+	rq = sched ? &sched->rq : NULL;
 	if (rq != entity->rq) {
 		drm_sched_rq_remove_entity(entity->rq, entity);
 		entity->rq = rq;
@@ -580,6 +576,8 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
 void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
 {
 	struct drm_sched_entity *entity = sched_job->entity;
+	struct drm_gpu_scheduler *sched =
+		container_of(entity->rq, typeof(*sched), rq);
 	bool first;
 
 	trace_drm_sched_job_queue(sched_job, entity);
@@ -591,7 +589,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
 		xa_for_each(&sched_job->dependencies, index, entry)
 			trace_drm_sched_job_add_dep(sched_job, entry);
 	}
-	atomic_inc(entity->rq->sched->score);
+	atomic_inc(sched->score);
 	WRITE_ONCE(entity->last_user, current->group_leader);
 
 	/*
@@ -602,8 +600,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
 
 	/* first job wakes up scheduler */
 	if (first) {
-		struct drm_gpu_scheduler *sched;
-
 		sched = drm_sched_rq_add_entity(entity);
 		if (sched)
 			drm_sched_wakeup(sched);
diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
index 9391d6f0dc01..da4f53a9ca35 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -227,7 +227,7 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
 {
 	unsigned seq;
 
-	fence->sched = entity->rq->sched;
+	fence->sched = container_of(entity->rq, typeof(*fence->sched), rq);
 	seq = atomic_inc_return(&entity->fence_seq);
 	dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
 		       &fence->lock, entity->fence_context, seq);
diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
index 6e5ed721bb5b..409c9ab7ce8f 100644
--- a/drivers/gpu/drm/scheduler/sched_internal.h
+++ b/drivers/gpu/drm/scheduler/sched_internal.h
@@ -26,11 +26,9 @@ bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
 			 struct drm_sched_entity *entity);
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
 
-void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
-		       struct drm_sched_rq *rq);
+void drm_sched_rq_init(struct drm_gpu_scheduler *sched);
 struct drm_sched_entity *
-drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched,
-			   struct drm_sched_rq *rq);
+drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched);
 struct drm_gpu_scheduler *
 drm_sched_rq_add_entity(struct drm_sched_entity *entity);
 void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index a9079bdb27d3..fe773154cdd4 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -645,7 +645,7 @@ void drm_sched_job_arm(struct drm_sched_job *job)
 
 	BUG_ON(!entity);
 	drm_sched_entity_select_rq(entity);
-	sched = entity->rq->sched;
+	sched = container_of(entity->rq, typeof(*sched), rq);
 
 	job->sched = sched;
 	job->s_priority = entity->priority;
@@ -986,7 +986,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
 	int r;
 
 	/* Find entity with a ready job */
-	entity = drm_sched_rq_select_entity(sched, sched->rq);
+	entity = drm_sched_rq_select_entity(sched);
 	if (IS_ERR_OR_NULL(entity))
 		return;	/* No more work */
 
@@ -1067,15 +1067,6 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
 	sched->score = args->score ? args->score : &sched->_score;
 	sched->dev = args->dev;
 
-	if (sched->rq) {
-		/* Not an error, but warn anyway so drivers can
-		 * fine-tune their DRM calling order, and return all
-		 * is good.
-		 */
-		dev_warn(sched->dev, "%s: scheduler already initialized!\n", __func__);
-		return 0;
-	}
-
 	if (args->submit_wq) {
 		sched->submit_wq = args->submit_wq;
 		sched->own_submit_wq = false;
@@ -1087,11 +1078,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
 		sched->own_submit_wq = true;
 	}
 
-	sched->rq = kmalloc(sizeof(*sched->rq), GFP_KERNEL | __GFP_ZERO);
-	if (!sched->rq)
-		goto Out_check_own;
-
-	drm_sched_rq_init(sched, sched->rq);
+	drm_sched_rq_init(sched);
 
 	init_waitqueue_head(&sched->job_scheduled);
 	INIT_LIST_HEAD(&sched->pending_list);
@@ -1106,12 +1093,6 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
 
 	sched->ready = true;
 	return 0;
-
-Out_check_own:
-	if (sched->own_submit_wq)
-		destroy_workqueue(sched->submit_wq);
-	dev_err(sched->dev, "%s: Failed to setup GPU scheduler--out of memory\n", __func__);
-	return -ENOMEM;
 }
 EXPORT_SYMBOL(drm_sched_init);
 
@@ -1143,7 +1124,7 @@ static void drm_sched_cancel_remaining_jobs(struct drm_gpu_scheduler *sched)
 void drm_sched_fini(struct drm_gpu_scheduler *sched)
 {
 
-	struct drm_sched_rq *rq = sched->rq;
+	struct drm_sched_rq *rq = &sched->rq;
 	struct drm_sched_entity *s_entity;
 
 	drm_sched_wqueue_stop(sched);
@@ -1185,8 +1166,6 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
 	if (sched->own_submit_wq)
 		destroy_workqueue(sched->submit_wq);
 	sched->ready = false;
-	kfree(sched->rq);
-	sched->rq = NULL;
 
 	if (!list_empty(&sched->pending_list))
 		dev_warn(sched->dev, "Tearing down scheduler while jobs are pending!\n");
@@ -1206,7 +1185,7 @@ void drm_sched_increase_karma(struct drm_sched_job *bad)
 {
 	struct drm_gpu_scheduler *sched = bad->sched;
 	struct drm_sched_entity *entity, *tmp;
-	struct drm_sched_rq *rq = sched->rq;
+	struct drm_sched_rq *rq = &sched->rq;
 
 	/* don't change @bad's karma if it's from KERNEL RQ,
 	 * because sometimes GPU hang would cause kernel jobs (like VM updating jobs)
diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
index a9bc105221bf..6088434a4ea4 100644
--- a/drivers/gpu/drm/scheduler/sched_rq.c
+++ b/drivers/gpu/drm/scheduler/sched_rq.c
@@ -72,17 +72,16 @@ static void drm_sched_rq_update_tree_locked(struct drm_sched_entity *entity,
  * drm_sched_rq_init - initialize a given run queue struct
  *
  * @sched: scheduler instance to associate with this run queue
- * @rq: scheduler run queue
  *
  * Initializes a scheduler runqueue.
  */
-void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
-		       struct drm_sched_rq *rq)
+void drm_sched_rq_init(struct drm_gpu_scheduler *sched)
 {
+	struct drm_sched_rq *rq = &sched->rq;
+
 	spin_lock_init(&rq->lock);
 	INIT_LIST_HEAD(&rq->entities);
 	rq->rb_tree_root = RB_ROOT_CACHED;
-	rq->sched = sched;
 	rq->head_prio = -1;
 }
 
@@ -229,8 +228,8 @@ drm_sched_rq_add_entity(struct drm_sched_entity *entity)
 	}
 
 	rq = entity->rq;
+	sched = container_of(rq, typeof(*sched), rq);
 	spin_lock(&rq->lock);
-	sched = rq->sched;
 
 	if (list_empty(&entity->list)) {
 		atomic_inc(sched->score);
@@ -258,6 +257,8 @@ drm_sched_rq_add_entity(struct drm_sched_entity *entity)
 void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
 				struct drm_sched_entity *entity)
 {
+	struct drm_gpu_scheduler *sched = container_of(rq, typeof(*sched), rq);
+
 	lockdep_assert_held(&entity->lock);
 
 	if (list_empty(&entity->list))
@@ -265,7 +266,7 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
 
 	spin_lock(&rq->lock);
 
-	atomic_dec(rq->sched->score);
+	atomic_dec(sched->score);
 	list_del_init(&entity->list);
 
 	drm_sched_rq_remove_tree_locked(entity, rq);
@@ -313,7 +314,6 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
  * drm_sched_rq_select_entity - Select an entity which provides a job to run
  *
  * @sched: the gpu scheduler
- * @rq: scheduler run queue to check.
  *
  * Find oldest waiting ready entity.
  *
@@ -322,9 +322,9 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
  * its job; return NULL, if no ready entity was found.
  */
 struct drm_sched_entity *
-drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched,
-			   struct drm_sched_rq *rq)
+drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched)
 {
+	struct drm_sched_rq *rq = &sched->rq;
 	struct rb_node *rb;
 
 	spin_lock(&rq->lock);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index c9b75a05d05c..7fbcd121a6d3 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -256,8 +256,6 @@ struct drm_sched_entity {
  * the next entity to emit commands from.
  */
 struct drm_sched_rq {
-	struct drm_gpu_scheduler	*sched;
-
 	spinlock_t			lock;
 	/* Following members are protected by the @lock: */
 	struct list_head		entities;
@@ -581,7 +579,7 @@ struct drm_gpu_scheduler {
 	atomic_t			credit_count;
 	long				timeout;
 	const char			*name;
-	struct drm_sched_rq             *rq;
+	struct drm_sched_rq             rq;
 	wait_queue_head_t		job_scheduled;
 	atomic64_t			job_id_count;
 	struct workqueue_struct		*submit_wq;
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC v8 04/12] drm/sched: Consolidate entity run queue management
  2025-09-03 10:18 ` [RFC v8 04/12] drm/sched: Consolidate entity run queue management Tvrtko Ursulin
@ 2025-09-11 14:20   ` Philipp Stanner
  2025-09-11 14:55     ` Tvrtko Ursulin
  0 siblings, 1 reply; 35+ messages in thread
From: Philipp Stanner @ 2025-09-11 14:20 UTC (permalink / raw)
  To: Tvrtko Ursulin, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost, Philipp Stanner

On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
> Move the code dealing with entities entering and exiting run queues to
> helpers to logically separate it from jobs entering and exiting entities.

Sorry if I've asked this before, but does this strictly depend on the
preceding patches or could it be branched out?

P.

> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Philipp Stanner <phasta@kernel.org>
> ---
>  drivers/gpu/drm/scheduler/sched_entity.c   | 64 ++-------------
>  drivers/gpu/drm/scheduler/sched_internal.h |  8 +-
>  drivers/gpu/drm/scheduler/sched_main.c     | 95 +++++++++++++++++++---
>  3 files changed, 91 insertions(+), 76 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 4852006f2308..7a0a52ba87bf 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -456,24 +456,9 @@ drm_sched_job_dependency(struct drm_sched_job *job,
>  	return NULL;
>  }
>  
> -static ktime_t
> -drm_sched_rq_get_rr_ts(struct drm_sched_rq *rq, struct drm_sched_entity *entity)
> -{
> -	ktime_t ts;
> -
> -	lockdep_assert_held(&entity->lock);
> -	lockdep_assert_held(&rq->lock);
> -
> -	ts = ktime_add_ns(rq->rr_ts, 1);
> -	entity->rr_ts = ts;
> -	rq->rr_ts = ts;
> -
> -	return ts;
> -}
> -
>  struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>  {
> -	struct drm_sched_job *sched_job, *next_job;
> +	struct drm_sched_job *sched_job;
>  
>  	sched_job = drm_sched_entity_queue_peek(entity);
>  	if (!sched_job)
> @@ -502,26 +487,7 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>  
>  	spsc_queue_pop(&entity->job_queue);
>  
> -	/*
> -	 * Update the entity's location in the min heap according to
> -	 * the timestamp of the next job, if any.
> -	 */
> -	next_job = drm_sched_entity_queue_peek(entity);
> -	if (next_job) {
> -		struct drm_sched_rq *rq;
> -		ktime_t ts;
> -
> -		spin_lock(&entity->lock);
> -		rq = entity->rq;
> -		spin_lock(&rq->lock);
> -		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> -			ts = next_job->submit_ts;
> -		else
> -			ts = drm_sched_rq_get_rr_ts(rq, entity);
> -		drm_sched_rq_update_fifo_locked(entity, rq, ts);
> -		spin_unlock(&rq->lock);
> -		spin_unlock(&entity->lock);
> -	}
> +	drm_sched_rq_pop_entity(entity);
>  
>  	/* Jobs and entities might have different lifecycles. Since we're
>  	 * removing the job from the entities queue, set the jobs entity pointer
> @@ -611,30 +577,10 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>  	/* first job wakes up scheduler */
>  	if (first) {
>  		struct drm_gpu_scheduler *sched;
> -		struct drm_sched_rq *rq;
>  
> -		/* Add the entity to the run queue */
> -		spin_lock(&entity->lock);
> -		if (entity->stopped) {
> -			spin_unlock(&entity->lock);
> -
> -			DRM_ERROR("Trying to push to a killed entity\n");
> -			return;
> -		}
> -
> -		rq = entity->rq;
> -		sched = rq->sched;
> -
> -		spin_lock(&rq->lock);
> -		drm_sched_rq_add_entity(rq, entity);
> -		if (drm_sched_policy == DRM_SCHED_POLICY_RR)
> -			submit_ts = entity->rr_ts;
> -		drm_sched_rq_update_fifo_locked(entity, rq, submit_ts);
> -
> -		spin_unlock(&rq->lock);
> -		spin_unlock(&entity->lock);
> -
> -		drm_sched_wakeup(sched);
> +		sched = drm_sched_rq_add_entity(entity, submit_ts);
> +		if (sched)
> +			drm_sched_wakeup(sched);
>  	}
>  }
>  EXPORT_SYMBOL(drm_sched_entity_push_job);
> diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
> index 7ea5a6736f98..8269c5392a82 100644
> --- a/drivers/gpu/drm/scheduler/sched_internal.h
> +++ b/drivers/gpu/drm/scheduler/sched_internal.h
> @@ -12,13 +12,11 @@ extern int drm_sched_policy;
>  
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
>  
> -void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
> -			     struct drm_sched_entity *entity);
> +struct drm_gpu_scheduler *
> +drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts);
>  void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>  				struct drm_sched_entity *entity);
> -
> -void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
> -				     struct drm_sched_rq *rq, ktime_t ts);
> +void drm_sched_rq_pop_entity(struct drm_sched_entity *entity);
>  
>  void drm_sched_entity_select_rq(struct drm_sched_entity *entity);
>  struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity);
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 1db0a4aa1d46..c53931e63458 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -151,9 +151,9 @@ static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
>  	}
>  }
>  
> -void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
> -				     struct drm_sched_rq *rq,
> -				     ktime_t ts)
> +static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
> +					    struct drm_sched_rq *rq,
> +					    ktime_t ts)
>  {
>  	/*
>  	 * Both locks need to be grabbed, one to protect from entity->rq change
> @@ -191,22 +191,45 @@ static void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
>  /**
>   * drm_sched_rq_add_entity - add an entity
>   *
> - * @rq: scheduler run queue
>   * @entity: scheduler entity
> + * @ts: submission timestamp
>   *
>   * Adds a scheduler entity to the run queue.
> + *
> + * Returns a DRM scheduler pre-selected to handle this entity.
>   */
> -void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
> -			     struct drm_sched_entity *entity)
> +struct drm_gpu_scheduler *
> +drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
>  {
> -	lockdep_assert_held(&entity->lock);
> -	lockdep_assert_held(&rq->lock);
> +	struct drm_gpu_scheduler *sched;
> +	struct drm_sched_rq *rq;
>  
> -	if (!list_empty(&entity->list))
> -		return;
> +	/* Add the entity to the run queue */
> +	spin_lock(&entity->lock);
> +	if (entity->stopped) {
> +		spin_unlock(&entity->lock);
>  
> -	atomic_inc(rq->sched->score);
> -	list_add_tail(&entity->list, &rq->entities);
> +		DRM_ERROR("Trying to push to a killed entity\n");
> +		return NULL;
> +	}
> +
> +	rq = entity->rq;
> +	spin_lock(&rq->lock);
> +	sched = rq->sched;
> +
> +	if (list_empty(&entity->list)) {
> +		atomic_inc(sched->score);
> +		list_add_tail(&entity->list, &rq->entities);
> +	}
> +
> +	if (drm_sched_policy == DRM_SCHED_POLICY_RR)
> +		ts = entity->rr_ts;
> +	drm_sched_rq_update_fifo_locked(entity, rq, ts);
> +
> +	spin_unlock(&rq->lock);
> +	spin_unlock(&entity->lock);
> +
> +	return sched;
>  }
>  
>  /**
> @@ -235,6 +258,54 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>  	spin_unlock(&rq->lock);
>  }
>  
> +static ktime_t
> +drm_sched_rq_get_rr_ts(struct drm_sched_rq *rq, struct drm_sched_entity *entity)
> +{
> +	ktime_t ts;
> +
> +	lockdep_assert_held(&entity->lock);
> +	lockdep_assert_held(&rq->lock);
> +
> +	ts = ktime_add_ns(rq->rr_ts, 1);
> +	entity->rr_ts = ts;
> +	rq->rr_ts = ts;
> +
> +	return ts;
> +}
> +
> +/**
> + * drm_sched_rq_pop_entity - pops an entity
> + *
> + * @entity: scheduler entity
> + *
> + * To be called every time after a job is popped from the entity.
> + */
> +void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
> +{
> +	struct drm_sched_job *next_job;
> +	struct drm_sched_rq *rq;
> +	ktime_t ts;
> +
> +	/*
> +	 * Update the entity's location in the min heap according to
> +	 * the timestamp of the next job, if any.
> +	 */
> +	next_job = drm_sched_entity_queue_peek(entity);
> +	if (!next_job)
> +		return;
> +
> +	spin_lock(&entity->lock);
> +	rq = entity->rq;
> +	spin_lock(&rq->lock);
> +	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> +		ts = next_job->submit_ts;
> +	else
> +		ts = drm_sched_rq_get_rr_ts(rq, entity);
> +	drm_sched_rq_update_fifo_locked(entity, rq, ts);
> +	spin_unlock(&rq->lock);
> +	spin_unlock(&entity->lock);
> +}
> +
>  /**
>   * drm_sched_rq_select_entity - Select an entity which provides a job to run
>   *


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 08/12] drm/sched: Remove idle entity from tree
  2025-09-03 10:18 ` [RFC v8 08/12] drm/sched: Remove idle entity from tree Tvrtko Ursulin
@ 2025-09-11 14:32   ` Philipp Stanner
  2025-09-11 15:06     ` Tvrtko Ursulin
  0 siblings, 1 reply; 35+ messages in thread
From: Philipp Stanner @ 2025-09-11 14:32 UTC (permalink / raw)
  To: Tvrtko Ursulin, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost, Philipp Stanner

On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
> There is no need to keep entities with no jobs in the tree so lets remove
> it once the last job is consumed. This keeps the tree smaller which is
> nicer and more efficient as entities are removed and re-added on every
> popped job.

This reads suspiciously as if it could be an independent patch, not
necessarily tied to this series. I see it depends on the _pop()
function you added.

I think you'd want to make it a bit more obvious that this is not so
much a general improvement as it is a preparation for followup work. Or
could it be made generic for the current in-tree scheduler?

> 
> Apart from that, the upcoming fair scheduling algorithm will rely on the
> tree only containing runnable entities.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Philipp Stanner <phasta@kernel.org>
> ---
>  drivers/gpu/drm/scheduler/sched_rq.c | 28 +++++++++++++++++-----------
>  1 file changed, 17 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
> index 16d57461765e..67804815ca0d 100644
> --- a/drivers/gpu/drm/scheduler/sched_rq.c
> +++ b/drivers/gpu/drm/scheduler/sched_rq.c
> @@ -19,6 +19,9 @@ drm_sched_entity_compare_before(struct rb_node *a, const struct rb_node *b)
>  static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
>  					    struct drm_sched_rq *rq)
>  {
> +	lockdep_assert_held(&entity->lock);
> +	lockdep_assert_held(&rq->lock);

The callers of drm_sched_rq_remove_fifo_locked() already have some
lockdep asserts, have you seen them? Are those here redundant /
additional?

And are they strictly related to this patch?

Maybe you want to investigate the other lockdep assertions and, if
there's room for improvement, address that in a dedicated patch.


P.

> +
>  	if (!RB_EMPTY_NODE(&entity->rb_tree_node)) {
>  		rb_erase_cached(&entity->rb_tree_node, &rq->rb_tree_root);
>  		RB_CLEAR_NODE(&entity->rb_tree_node);
> @@ -158,24 +161,27 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
>  {
>  	struct drm_sched_job *next_job;
>  	struct drm_sched_rq *rq;
> -	ktime_t ts;
>  
>  	/*
>  	 * Update the entity's location in the min heap according to
>  	 * the timestamp of the next job, if any.
>  	 */
> +	spin_lock(&entity->lock);
> +	rq = entity->rq;
> +	spin_lock(&rq->lock);
>  	next_job = drm_sched_entity_queue_peek(entity);
> -	if (!next_job)
> -		return;
> +	if (next_job) {
> +		ktime_t ts;
>  
> -	spin_lock(&entity->lock);
> -	rq = entity->rq;
> -	spin_lock(&rq->lock);
> -	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> -		ts = next_job->submit_ts;
> -	else
> -		ts = drm_sched_rq_get_rr_ts(rq, entity);
> -	drm_sched_rq_update_fifo_locked(entity, rq, ts);
> +		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> +			ts = next_job->submit_ts;
> +		else
> +			ts = drm_sched_rq_get_rr_ts(rq, entity);
> +
> +		drm_sched_rq_update_fifo_locked(entity, rq, ts);
> +	} else {
> +		drm_sched_rq_remove_fifo_locked(entity, rq);
> +	}
>  	spin_unlock(&rq->lock);
>  	spin_unlock(&entity->lock);
>  }


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 04/12] drm/sched: Consolidate entity run queue management
  2025-09-11 14:20   ` Philipp Stanner
@ 2025-09-11 14:55     ` Tvrtko Ursulin
  2025-09-16  7:41       ` Philipp Stanner
  0 siblings, 1 reply; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-11 14:55 UTC (permalink / raw)
  To: phasta, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost


On 11/09/2025 15:20, Philipp Stanner wrote:
> On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
>> Move the code dealing with entities entering and exiting run queues to
>> helpers to logically separate it from jobs entering and exiting entities.
> 
> Sorry if I've asked this before, but does this strictly depend on the
> preceding patches or could it be branched out?

There is no fundamental dependency so I could re-order and pull it ahead 
if you are certain that is what you prefer?

Regards,

Tvrtko

>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Philipp Stanner <phasta@kernel.org>
>> ---
>>   drivers/gpu/drm/scheduler/sched_entity.c   | 64 ++-------------
>>   drivers/gpu/drm/scheduler/sched_internal.h |  8 +-
>>   drivers/gpu/drm/scheduler/sched_main.c     | 95 +++++++++++++++++++---
>>   3 files changed, 91 insertions(+), 76 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 4852006f2308..7a0a52ba87bf 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -456,24 +456,9 @@ drm_sched_job_dependency(struct drm_sched_job *job,
>>   	return NULL;
>>   }
>>   
>> -static ktime_t
>> -drm_sched_rq_get_rr_ts(struct drm_sched_rq *rq, struct drm_sched_entity *entity)
>> -{
>> -	ktime_t ts;
>> -
>> -	lockdep_assert_held(&entity->lock);
>> -	lockdep_assert_held(&rq->lock);
>> -
>> -	ts = ktime_add_ns(rq->rr_ts, 1);
>> -	entity->rr_ts = ts;
>> -	rq->rr_ts = ts;
>> -
>> -	return ts;
>> -}
>> -
>>   struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>>   {
>> -	struct drm_sched_job *sched_job, *next_job;
>> +	struct drm_sched_job *sched_job;
>>   
>>   	sched_job = drm_sched_entity_queue_peek(entity);
>>   	if (!sched_job)
>> @@ -502,26 +487,7 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>>   
>>   	spsc_queue_pop(&entity->job_queue);
>>   
>> -	/*
>> -	 * Update the entity's location in the min heap according to
>> -	 * the timestamp of the next job, if any.
>> -	 */
>> -	next_job = drm_sched_entity_queue_peek(entity);
>> -	if (next_job) {
>> -		struct drm_sched_rq *rq;
>> -		ktime_t ts;
>> -
>> -		spin_lock(&entity->lock);
>> -		rq = entity->rq;
>> -		spin_lock(&rq->lock);
>> -		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>> -			ts = next_job->submit_ts;
>> -		else
>> -			ts = drm_sched_rq_get_rr_ts(rq, entity);
>> -		drm_sched_rq_update_fifo_locked(entity, rq, ts);
>> -		spin_unlock(&rq->lock);
>> -		spin_unlock(&entity->lock);
>> -	}
>> +	drm_sched_rq_pop_entity(entity);
>>   
>>   	/* Jobs and entities might have different lifecycles. Since we're
>>   	 * removing the job from the entities queue, set the jobs entity pointer
>> @@ -611,30 +577,10 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>>   	/* first job wakes up scheduler */
>>   	if (first) {
>>   		struct drm_gpu_scheduler *sched;
>> -		struct drm_sched_rq *rq;
>>   
>> -		/* Add the entity to the run queue */
>> -		spin_lock(&entity->lock);
>> -		if (entity->stopped) {
>> -			spin_unlock(&entity->lock);
>> -
>> -			DRM_ERROR("Trying to push to a killed entity\n");
>> -			return;
>> -		}
>> -
>> -		rq = entity->rq;
>> -		sched = rq->sched;
>> -
>> -		spin_lock(&rq->lock);
>> -		drm_sched_rq_add_entity(rq, entity);
>> -		if (drm_sched_policy == DRM_SCHED_POLICY_RR)
>> -			submit_ts = entity->rr_ts;
>> -		drm_sched_rq_update_fifo_locked(entity, rq, submit_ts);
>> -
>> -		spin_unlock(&rq->lock);
>> -		spin_unlock(&entity->lock);
>> -
>> -		drm_sched_wakeup(sched);
>> +		sched = drm_sched_rq_add_entity(entity, submit_ts);
>> +		if (sched)
>> +			drm_sched_wakeup(sched);
>>   	}
>>   }
>>   EXPORT_SYMBOL(drm_sched_entity_push_job);
>> diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
>> index 7ea5a6736f98..8269c5392a82 100644
>> --- a/drivers/gpu/drm/scheduler/sched_internal.h
>> +++ b/drivers/gpu/drm/scheduler/sched_internal.h
>> @@ -12,13 +12,11 @@ extern int drm_sched_policy;
>>   
>>   void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
>>   
>> -void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
>> -			     struct drm_sched_entity *entity);
>> +struct drm_gpu_scheduler *
>> +drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts);
>>   void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>   				struct drm_sched_entity *entity);
>> -
>> -void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
>> -				     struct drm_sched_rq *rq, ktime_t ts);
>> +void drm_sched_rq_pop_entity(struct drm_sched_entity *entity);
>>   
>>   void drm_sched_entity_select_rq(struct drm_sched_entity *entity);
>>   struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity);
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 1db0a4aa1d46..c53931e63458 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -151,9 +151,9 @@ static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
>>   	}
>>   }
>>   
>> -void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
>> -				     struct drm_sched_rq *rq,
>> -				     ktime_t ts)
>> +static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
>> +					    struct drm_sched_rq *rq,
>> +					    ktime_t ts)
>>   {
>>   	/*
>>   	 * Both locks need to be grabbed, one to protect from entity->rq change
>> @@ -191,22 +191,45 @@ static void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
>>   /**
>>    * drm_sched_rq_add_entity - add an entity
>>    *
>> - * @rq: scheduler run queue
>>    * @entity: scheduler entity
>> + * @ts: submission timestamp
>>    *
>>    * Adds a scheduler entity to the run queue.
>> + *
>> + * Returns a DRM scheduler pre-selected to handle this entity.
>>    */
>> -void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
>> -			     struct drm_sched_entity *entity)
>> +struct drm_gpu_scheduler *
>> +drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
>>   {
>> -	lockdep_assert_held(&entity->lock);
>> -	lockdep_assert_held(&rq->lock);
>> +	struct drm_gpu_scheduler *sched;
>> +	struct drm_sched_rq *rq;
>>   
>> -	if (!list_empty(&entity->list))
>> -		return;
>> +	/* Add the entity to the run queue */
>> +	spin_lock(&entity->lock);
>> +	if (entity->stopped) {
>> +		spin_unlock(&entity->lock);
>>   
>> -	atomic_inc(rq->sched->score);
>> -	list_add_tail(&entity->list, &rq->entities);
>> +		DRM_ERROR("Trying to push to a killed entity\n");
>> +		return NULL;
>> +	}
>> +
>> +	rq = entity->rq;
>> +	spin_lock(&rq->lock);
>> +	sched = rq->sched;
>> +
>> +	if (list_empty(&entity->list)) {
>> +		atomic_inc(sched->score);
>> +		list_add_tail(&entity->list, &rq->entities);
>> +	}
>> +
>> +	if (drm_sched_policy == DRM_SCHED_POLICY_RR)
>> +		ts = entity->rr_ts;
>> +	drm_sched_rq_update_fifo_locked(entity, rq, ts);
>> +
>> +	spin_unlock(&rq->lock);
>> +	spin_unlock(&entity->lock);
>> +
>> +	return sched;
>>   }
>>   
>>   /**
>> @@ -235,6 +258,54 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>   	spin_unlock(&rq->lock);
>>   }
>>   
>> +static ktime_t
>> +drm_sched_rq_get_rr_ts(struct drm_sched_rq *rq, struct drm_sched_entity *entity)
>> +{
>> +	ktime_t ts;
>> +
>> +	lockdep_assert_held(&entity->lock);
>> +	lockdep_assert_held(&rq->lock);
>> +
>> +	ts = ktime_add_ns(rq->rr_ts, 1);
>> +	entity->rr_ts = ts;
>> +	rq->rr_ts = ts;
>> +
>> +	return ts;
>> +}
>> +
>> +/**
>> + * drm_sched_rq_pop_entity - pops an entity
>> + *
>> + * @entity: scheduler entity
>> + *
>> + * To be called every time after a job is popped from the entity.
>> + */
>> +void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
>> +{
>> +	struct drm_sched_job *next_job;
>> +	struct drm_sched_rq *rq;
>> +	ktime_t ts;
>> +
>> +	/*
>> +	 * Update the entity's location in the min heap according to
>> +	 * the timestamp of the next job, if any.
>> +	 */
>> +	next_job = drm_sched_entity_queue_peek(entity);
>> +	if (!next_job)
>> +		return;
>> +
>> +	spin_lock(&entity->lock);
>> +	rq = entity->rq;
>> +	spin_lock(&rq->lock);
>> +	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>> +		ts = next_job->submit_ts;
>> +	else
>> +		ts = drm_sched_rq_get_rr_ts(rq, entity);
>> +	drm_sched_rq_update_fifo_locked(entity, rq, ts);
>> +	spin_unlock(&rq->lock);
>> +	spin_unlock(&entity->lock);
>> +}
>> +
>>   /**
>>    * drm_sched_rq_select_entity - Select an entity which provides a job to run
>>    *
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 08/12] drm/sched: Remove idle entity from tree
  2025-09-11 14:32   ` Philipp Stanner
@ 2025-09-11 15:06     ` Tvrtko Ursulin
  2025-09-24  9:15       ` Philipp Stanner
  0 siblings, 1 reply; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-11 15:06 UTC (permalink / raw)
  To: phasta, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost


On 11/09/2025 15:32, Philipp Stanner wrote:
> On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
>> There is no need to keep entities with no jobs in the tree so lets remove
>> it once the last job is consumed. This keeps the tree smaller which is
>> nicer and more efficient as entities are removed and re-added on every
>> popped job.
> 
> This reads suspiciously as if it could be an independent patch, not
> necessarily tied to this series. I see it depends on the _pop()
> function you added.
> 
> I think you'd want to make it a bit more obvious that this is not so
> much a general improvement as it is a preparation for followup work. Or
> could it be made generic for the current in-tree scheduler?

Both is true. There is currently no reason idle entities _need_ to be in 
the tree. Removing them would improve O(log n) on the rbtree. But also 
fair scheduler relies on it, see below...

>> Apart from that, the upcoming fair scheduling algorithm will rely on the
>> tree only containing runnable entities.

... ^^^ here.

>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Philipp Stanner <phasta@kernel.org>
>> ---
>>   drivers/gpu/drm/scheduler/sched_rq.c | 28 +++++++++++++++++-----------
>>   1 file changed, 17 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
>> index 16d57461765e..67804815ca0d 100644
>> --- a/drivers/gpu/drm/scheduler/sched_rq.c
>> +++ b/drivers/gpu/drm/scheduler/sched_rq.c
>> @@ -19,6 +19,9 @@ drm_sched_entity_compare_before(struct rb_node *a, const struct rb_node *b)
>>   static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
>>   					    struct drm_sched_rq *rq)
>>   {
>> +	lockdep_assert_held(&entity->lock);
>> +	lockdep_assert_held(&rq->lock);
> 
> The callers of drm_sched_rq_remove_fifo_locked() already have some
> lockdep asserts, have you seen them? Are those here redundant /
> additional?
> 
> And are they strictly related to this patch?

Yes, drm_sched_rq_pop_entity() is the new caller, which needs to take 
both locks on its own. So IMO makes sense to add the asserts.

> Maybe you want to investigate the other lockdep assertions and, if
> there's room for improvement, address that in a dedicated patch.

They look okay to me. Are you seeing something is off?

Regards,

Tvrtko

>> +
>>   	if (!RB_EMPTY_NODE(&entity->rb_tree_node)) {
>>   		rb_erase_cached(&entity->rb_tree_node, &rq->rb_tree_root);
>>   		RB_CLEAR_NODE(&entity->rb_tree_node);
>> @@ -158,24 +161,27 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
>>   {
>>   	struct drm_sched_job *next_job;
>>   	struct drm_sched_rq *rq;
>> -	ktime_t ts;
>>   
>>   	/*
>>   	 * Update the entity's location in the min heap according to
>>   	 * the timestamp of the next job, if any.
>>   	 */
>> +	spin_lock(&entity->lock);
>> +	rq = entity->rq;
>> +	spin_lock(&rq->lock);
>>   	next_job = drm_sched_entity_queue_peek(entity);
>> -	if (!next_job)
>> -		return;
>> +	if (next_job) {
>> +		ktime_t ts;
>>   
>> -	spin_lock(&entity->lock);
>> -	rq = entity->rq;
>> -	spin_lock(&rq->lock);
>> -	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>> -		ts = next_job->submit_ts;
>> -	else
>> -		ts = drm_sched_rq_get_rr_ts(rq, entity);
>> -	drm_sched_rq_update_fifo_locked(entity, rq, ts);
>> +		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>> +			ts = next_job->submit_ts;
>> +		else
>> +			ts = drm_sched_rq_get_rr_ts(rq, entity);
>> +
>> +		drm_sched_rq_update_fifo_locked(entity, rq, ts);
>> +	} else {
>> +		drm_sched_rq_remove_fifo_locked(entity, rq);
>> +	}
>>   	spin_unlock(&rq->lock);
>>   	spin_unlock(&entity->lock);
>>   }
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 04/12] drm/sched: Consolidate entity run queue management
  2025-09-11 14:55     ` Tvrtko Ursulin
@ 2025-09-16  7:41       ` Philipp Stanner
  2025-09-16  9:04         ` Tvrtko Ursulin
  0 siblings, 1 reply; 35+ messages in thread
From: Philipp Stanner @ 2025-09-16  7:41 UTC (permalink / raw)
  To: Tvrtko Ursulin, phasta, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost

On Thu, 2025-09-11 at 15:55 +0100, Tvrtko Ursulin wrote:
> 
> On 11/09/2025 15:20, Philipp Stanner wrote:
> > On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
> > > Move the code dealing with entities entering and exiting run queues to
> > > helpers to logically separate it from jobs entering and exiting entities.
> > 
> > Sorry if I've asked this before, but does this strictly depend on the
> > preceding patches or could it be branched out?
> 
> There is no fundamental dependency so I could re-order and pull it ahead 
> if you are certain that is what you prefer?

Well, you know my opinion: If it's a general improvement not directly
necessary for a series, it should be send separately.

For this patch, however, I'm not even sure whether it's really
improving the code base. The number of functions seems the same, just
with different names, and the code base gets even slightly larger.

Can you elaborate a bit on why you think this patch makes sense?

P.

> 
> Regards,
> 
> Tvrtko
> 
> > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: Danilo Krummrich <dakr@kernel.org>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Philipp Stanner <phasta@kernel.org>
> > > ---
> > >   drivers/gpu/drm/scheduler/sched_entity.c   | 64 ++-------------
> > >   drivers/gpu/drm/scheduler/sched_internal.h |  8 +-
> > >   drivers/gpu/drm/scheduler/sched_main.c     | 95 +++++++++++++++++++---
> > >   3 files changed, 91 insertions(+), 76 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> > > index 4852006f2308..7a0a52ba87bf 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > @@ -456,24 +456,9 @@ drm_sched_job_dependency(struct drm_sched_job *job,
> > >   	return NULL;
> > >   }
> > >   
> > > -static ktime_t
> > > -drm_sched_rq_get_rr_ts(struct drm_sched_rq *rq, struct drm_sched_entity *entity)
> > > -{
> > > -	ktime_t ts;
> > > -
> > > -	lockdep_assert_held(&entity->lock);
> > > -	lockdep_assert_held(&rq->lock);
> > > -
> > > -	ts = ktime_add_ns(rq->rr_ts, 1);
> > > -	entity->rr_ts = ts;
> > > -	rq->rr_ts = ts;
> > > -
> > > -	return ts;
> > > -}
> > > -
> > >   struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
> > >   {
> > > -	struct drm_sched_job *sched_job, *next_job;
> > > +	struct drm_sched_job *sched_job;
> > >   
> > >   	sched_job = drm_sched_entity_queue_peek(entity);
> > >   	if (!sched_job)
> > > @@ -502,26 +487,7 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
> > >   
> > >   	spsc_queue_pop(&entity->job_queue);
> > >   
> > > -	/*
> > > -	 * Update the entity's location in the min heap according to
> > > -	 * the timestamp of the next job, if any.
> > > -	 */
> > > -	next_job = drm_sched_entity_queue_peek(entity);
> > > -	if (next_job) {
> > > -		struct drm_sched_rq *rq;
> > > -		ktime_t ts;
> > > -
> > > -		spin_lock(&entity->lock);
> > > -		rq = entity->rq;
> > > -		spin_lock(&rq->lock);
> > > -		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> > > -			ts = next_job->submit_ts;
> > > -		else
> > > -			ts = drm_sched_rq_get_rr_ts(rq, entity);
> > > -		drm_sched_rq_update_fifo_locked(entity, rq, ts);
> > > -		spin_unlock(&rq->lock);
> > > -		spin_unlock(&entity->lock);
> > > -	}
> > > +	drm_sched_rq_pop_entity(entity);
> > >   
> > >   	/* Jobs and entities might have different lifecycles. Since we're
> > >   	 * removing the job from the entities queue, set the jobs entity pointer
> > > @@ -611,30 +577,10 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
> > >   	/* first job wakes up scheduler */
> > >   	if (first) {
> > >   		struct drm_gpu_scheduler *sched;
> > > -		struct drm_sched_rq *rq;
> > >   
> > > -		/* Add the entity to the run queue */
> > > -		spin_lock(&entity->lock);
> > > -		if (entity->stopped) {
> > > -			spin_unlock(&entity->lock);
> > > -
> > > -			DRM_ERROR("Trying to push to a killed entity\n");
> > > -			return;
> > > -		}
> > > -
> > > -		rq = entity->rq;
> > > -		sched = rq->sched;
> > > -
> > > -		spin_lock(&rq->lock);
> > > -		drm_sched_rq_add_entity(rq, entity);
> > > -		if (drm_sched_policy == DRM_SCHED_POLICY_RR)
> > > -			submit_ts = entity->rr_ts;
> > > -		drm_sched_rq_update_fifo_locked(entity, rq, submit_ts);
> > > -
> > > -		spin_unlock(&rq->lock);
> > > -		spin_unlock(&entity->lock);
> > > -
> > > -		drm_sched_wakeup(sched);
> > > +		sched = drm_sched_rq_add_entity(entity, submit_ts);
> > > +		if (sched)
> > > +			drm_sched_wakeup(sched);
> > >   	}
> > >   }
> > >   EXPORT_SYMBOL(drm_sched_entity_push_job);
> > > diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
> > > index 7ea5a6736f98..8269c5392a82 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_internal.h
> > > +++ b/drivers/gpu/drm/scheduler/sched_internal.h
> > > @@ -12,13 +12,11 @@ extern int drm_sched_policy;
> > >   
> > >   void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
> > >   
> > > -void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
> > > -			     struct drm_sched_entity *entity);
> > > +struct drm_gpu_scheduler *
> > > +drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts);
> > >   void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> > >   				struct drm_sched_entity *entity);
> > > -
> > > -void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
> > > -				     struct drm_sched_rq *rq, ktime_t ts);
> > > +void drm_sched_rq_pop_entity(struct drm_sched_entity *entity);
> > >   
> > >   void drm_sched_entity_select_rq(struct drm_sched_entity *entity);
> > >   struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity);
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > index 1db0a4aa1d46..c53931e63458 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -151,9 +151,9 @@ static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
> > >   	}
> > >   }
> > >   
> > > -void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
> > > -				     struct drm_sched_rq *rq,
> > > -				     ktime_t ts)
> > > +static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
> > > +					    struct drm_sched_rq *rq,
> > > +					    ktime_t ts)
> > >   {
> > >   	/*
> > >   	 * Both locks need to be grabbed, one to protect from entity->rq change
> > > @@ -191,22 +191,45 @@ static void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
> > >   /**
> > >    * drm_sched_rq_add_entity - add an entity
> > >    *
> > > - * @rq: scheduler run queue
> > >    * @entity: scheduler entity
> > > + * @ts: submission timestamp
> > >    *
> > >    * Adds a scheduler entity to the run queue.
> > > + *
> > > + * Returns a DRM scheduler pre-selected to handle this entity.
> > >    */
> > > -void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
> > > -			     struct drm_sched_entity *entity)
> > > +struct drm_gpu_scheduler *
> > > +drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
> > >   {
> > > -	lockdep_assert_held(&entity->lock);
> > > -	lockdep_assert_held(&rq->lock);
> > > +	struct drm_gpu_scheduler *sched;
> > > +	struct drm_sched_rq *rq;
> > >   
> > > -	if (!list_empty(&entity->list))
> > > -		return;
> > > +	/* Add the entity to the run queue */
> > > +	spin_lock(&entity->lock);
> > > +	if (entity->stopped) {
> > > +		spin_unlock(&entity->lock);
> > >   
> > > -	atomic_inc(rq->sched->score);
> > > -	list_add_tail(&entity->list, &rq->entities);
> > > +		DRM_ERROR("Trying to push to a killed entity\n");
> > > +		return NULL;
> > > +	}
> > > +
> > > +	rq = entity->rq;
> > > +	spin_lock(&rq->lock);
> > > +	sched = rq->sched;
> > > +
> > > +	if (list_empty(&entity->list)) {
> > > +		atomic_inc(sched->score);
> > > +		list_add_tail(&entity->list, &rq->entities);
> > > +	}
> > > +
> > > +	if (drm_sched_policy == DRM_SCHED_POLICY_RR)
> > > +		ts = entity->rr_ts;
> > > +	drm_sched_rq_update_fifo_locked(entity, rq, ts);
> > > +
> > > +	spin_unlock(&rq->lock);
> > > +	spin_unlock(&entity->lock);
> > > +
> > > +	return sched;
> > >   }
> > >   
> > >   /**
> > > @@ -235,6 +258,54 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> > >   	spin_unlock(&rq->lock);
> > >   }
> > >   
> > > +static ktime_t
> > > +drm_sched_rq_get_rr_ts(struct drm_sched_rq *rq, struct drm_sched_entity *entity)
> > > +{
> > > +	ktime_t ts;
> > > +
> > > +	lockdep_assert_held(&entity->lock);
> > > +	lockdep_assert_held(&rq->lock);
> > > +
> > > +	ts = ktime_add_ns(rq->rr_ts, 1);
> > > +	entity->rr_ts = ts;
> > > +	rq->rr_ts = ts;
> > > +
> > > +	return ts;
> > > +}
> > > +
> > > +/**
> > > + * drm_sched_rq_pop_entity - pops an entity
> > > + *
> > > + * @entity: scheduler entity
> > > + *
> > > + * To be called every time after a job is popped from the entity.
> > > + */
> > > +void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
> > > +{
> > > +	struct drm_sched_job *next_job;
> > > +	struct drm_sched_rq *rq;
> > > +	ktime_t ts;
> > > +
> > > +	/*
> > > +	 * Update the entity's location in the min heap according to
> > > +	 * the timestamp of the next job, if any.
> > > +	 */
> > > +	next_job = drm_sched_entity_queue_peek(entity);
> > > +	if (!next_job)
> > > +		return;
> > > +
> > > +	spin_lock(&entity->lock);
> > > +	rq = entity->rq;
> > > +	spin_lock(&rq->lock);
> > > +	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> > > +		ts = next_job->submit_ts;
> > > +	else
> > > +		ts = drm_sched_rq_get_rr_ts(rq, entity);
> > > +	drm_sched_rq_update_fifo_locked(entity, rq, ts);
> > > +	spin_unlock(&rq->lock);
> > > +	spin_unlock(&entity->lock);
> > > +}
> > > +
> > >   /**
> > >    * drm_sched_rq_select_entity - Select an entity which provides a job to run
> > >    *
> > 
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 04/12] drm/sched: Consolidate entity run queue management
  2025-09-16  7:41       ` Philipp Stanner
@ 2025-09-16  9:04         ` Tvrtko Ursulin
  0 siblings, 0 replies; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-16  9:04 UTC (permalink / raw)
  To: phasta, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost


On 16/09/2025 08:41, Philipp Stanner wrote:
> On Thu, 2025-09-11 at 15:55 +0100, Tvrtko Ursulin wrote:
>>
>> On 11/09/2025 15:20, Philipp Stanner wrote:
>>> On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
>>>> Move the code dealing with entities entering and exiting run queues to
>>>> helpers to logically separate it from jobs entering and exiting entities.
>>>
>>> Sorry if I've asked this before, but does this strictly depend on the
>>> preceding patches or could it be branched out?
>>
>> There is no fundamental dependency so I could re-order and pull it ahead
>> if you are certain that is what you prefer?
> 
> Well, you know my opinion: If it's a general improvement not directly
> necessary for a series, it should be send separately.
> 
> For this patch, however, I'm not even sure whether it's really
> improving the code base. The number of functions seems the same, just
> with different names, and the code base gets even slightly larger.

There is one new function actually (pop). But one previously exported 
gets hidden as implementation details (rbtree update).

> Can you elaborate a bit on why you think this patch makes sense?

Before the patch we have sched_main.c implement 
drm_sched_rq_remove_entity() as an interface operating on run-queues and 
used by the entity code. It handles both the the entity list and the 
FIFO rbtree. All good there.

Other two operations which operate on those two data structures are add 
and pop.

But while the existing drm_sched_rq_add_entity() from the name sounds 
analogous to drm_sched_rq_remove_entity(), in reality it isn't. It only 
handles the entity list part, while the FIFO rbtree is open coded in 
sched_entity.c.

Job pop is the same - rbtree hanlding is open coded in sched_entity.c.

The patch consolidates all of the entity run queue management into 
contained API which is aligned with the above mentioned existing 
drm_sched_rq_remove_entity().

Drm_sched_rq_add_entity() gets conceptualy aligned with 
drm_sched_rq_remove_entity() with rbtree operations removed from 
sched_entity.c.

Drm_sched_rq_pop_entity() is added with the same effect.

The two new pieces of API are implemented next to 
drm_sched_rq_remove_entity() in sched_main.c. (Later I move all three, 
together with other bits relating to run queue management to a new 
sched_rq.c.)

Regards,

Tvrtko

>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>> Cc: Christian König <christian.koenig@amd.com>
>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>> Cc: Philipp Stanner <phasta@kernel.org>
>>>> ---
>>>>    drivers/gpu/drm/scheduler/sched_entity.c   | 64 ++-------------
>>>>    drivers/gpu/drm/scheduler/sched_internal.h |  8 +-
>>>>    drivers/gpu/drm/scheduler/sched_main.c     | 95 +++++++++++++++++++---
>>>>    3 files changed, 91 insertions(+), 76 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> index 4852006f2308..7a0a52ba87bf 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>> @@ -456,24 +456,9 @@ drm_sched_job_dependency(struct drm_sched_job *job,
>>>>    	return NULL;
>>>>    }
>>>>    
>>>> -static ktime_t
>>>> -drm_sched_rq_get_rr_ts(struct drm_sched_rq *rq, struct drm_sched_entity *entity)
>>>> -{
>>>> -	ktime_t ts;
>>>> -
>>>> -	lockdep_assert_held(&entity->lock);
>>>> -	lockdep_assert_held(&rq->lock);
>>>> -
>>>> -	ts = ktime_add_ns(rq->rr_ts, 1);
>>>> -	entity->rr_ts = ts;
>>>> -	rq->rr_ts = ts;
>>>> -
>>>> -	return ts;
>>>> -}
>>>> -
>>>>    struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>>>>    {
>>>> -	struct drm_sched_job *sched_job, *next_job;
>>>> +	struct drm_sched_job *sched_job;
>>>>    
>>>>    	sched_job = drm_sched_entity_queue_peek(entity);
>>>>    	if (!sched_job)
>>>> @@ -502,26 +487,7 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>>>>    
>>>>    	spsc_queue_pop(&entity->job_queue);
>>>>    
>>>> -	/*
>>>> -	 * Update the entity's location in the min heap according to
>>>> -	 * the timestamp of the next job, if any.
>>>> -	 */
>>>> -	next_job = drm_sched_entity_queue_peek(entity);
>>>> -	if (next_job) {
>>>> -		struct drm_sched_rq *rq;
>>>> -		ktime_t ts;
>>>> -
>>>> -		spin_lock(&entity->lock);
>>>> -		rq = entity->rq;
>>>> -		spin_lock(&rq->lock);
>>>> -		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>>>> -			ts = next_job->submit_ts;
>>>> -		else
>>>> -			ts = drm_sched_rq_get_rr_ts(rq, entity);
>>>> -		drm_sched_rq_update_fifo_locked(entity, rq, ts);
>>>> -		spin_unlock(&rq->lock);
>>>> -		spin_unlock(&entity->lock);
>>>> -	}
>>>> +	drm_sched_rq_pop_entity(entity);
>>>>    
>>>>    	/* Jobs and entities might have different lifecycles. Since we're
>>>>    	 * removing the job from the entities queue, set the jobs entity pointer
>>>> @@ -611,30 +577,10 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>>>>    	/* first job wakes up scheduler */
>>>>    	if (first) {
>>>>    		struct drm_gpu_scheduler *sched;
>>>> -		struct drm_sched_rq *rq;
>>>>    
>>>> -		/* Add the entity to the run queue */
>>>> -		spin_lock(&entity->lock);
>>>> -		if (entity->stopped) {
>>>> -			spin_unlock(&entity->lock);
>>>> -
>>>> -			DRM_ERROR("Trying to push to a killed entity\n");
>>>> -			return;
>>>> -		}
>>>> -
>>>> -		rq = entity->rq;
>>>> -		sched = rq->sched;
>>>> -
>>>> -		spin_lock(&rq->lock);
>>>> -		drm_sched_rq_add_entity(rq, entity);
>>>> -		if (drm_sched_policy == DRM_SCHED_POLICY_RR)
>>>> -			submit_ts = entity->rr_ts;
>>>> -		drm_sched_rq_update_fifo_locked(entity, rq, submit_ts);
>>>> -
>>>> -		spin_unlock(&rq->lock);
>>>> -		spin_unlock(&entity->lock);
>>>> -
>>>> -		drm_sched_wakeup(sched);
>>>> +		sched = drm_sched_rq_add_entity(entity, submit_ts);
>>>> +		if (sched)
>>>> +			drm_sched_wakeup(sched);
>>>>    	}
>>>>    }
>>>>    EXPORT_SYMBOL(drm_sched_entity_push_job);
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
>>>> index 7ea5a6736f98..8269c5392a82 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_internal.h
>>>> +++ b/drivers/gpu/drm/scheduler/sched_internal.h
>>>> @@ -12,13 +12,11 @@ extern int drm_sched_policy;
>>>>    
>>>>    void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
>>>>    
>>>> -void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
>>>> -			     struct drm_sched_entity *entity);
>>>> +struct drm_gpu_scheduler *
>>>> +drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts);
>>>>    void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>>>    				struct drm_sched_entity *entity);
>>>> -
>>>> -void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
>>>> -				     struct drm_sched_rq *rq, ktime_t ts);
>>>> +void drm_sched_rq_pop_entity(struct drm_sched_entity *entity);
>>>>    
>>>>    void drm_sched_entity_select_rq(struct drm_sched_entity *entity);
>>>>    struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity);
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 1db0a4aa1d46..c53931e63458 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -151,9 +151,9 @@ static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
>>>>    	}
>>>>    }
>>>>    
>>>> -void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
>>>> -				     struct drm_sched_rq *rq,
>>>> -				     ktime_t ts)
>>>> +static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
>>>> +					    struct drm_sched_rq *rq,
>>>> +					    ktime_t ts)
>>>>    {
>>>>    	/*
>>>>    	 * Both locks need to be grabbed, one to protect from entity->rq change
>>>> @@ -191,22 +191,45 @@ static void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
>>>>    /**
>>>>     * drm_sched_rq_add_entity - add an entity
>>>>     *
>>>> - * @rq: scheduler run queue
>>>>     * @entity: scheduler entity
>>>> + * @ts: submission timestamp
>>>>     *
>>>>     * Adds a scheduler entity to the run queue.
>>>> + *
>>>> + * Returns a DRM scheduler pre-selected to handle this entity.
>>>>     */
>>>> -void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
>>>> -			     struct drm_sched_entity *entity)
>>>> +struct drm_gpu_scheduler *
>>>> +drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
>>>>    {
>>>> -	lockdep_assert_held(&entity->lock);
>>>> -	lockdep_assert_held(&rq->lock);
>>>> +	struct drm_gpu_scheduler *sched;
>>>> +	struct drm_sched_rq *rq;
>>>>    
>>>> -	if (!list_empty(&entity->list))
>>>> -		return;
>>>> +	/* Add the entity to the run queue */
>>>> +	spin_lock(&entity->lock);
>>>> +	if (entity->stopped) {
>>>> +		spin_unlock(&entity->lock);
>>>>    
>>>> -	atomic_inc(rq->sched->score);
>>>> -	list_add_tail(&entity->list, &rq->entities);
>>>> +		DRM_ERROR("Trying to push to a killed entity\n");
>>>> +		return NULL;
>>>> +	}
>>>> +
>>>> +	rq = entity->rq;
>>>> +	spin_lock(&rq->lock);
>>>> +	sched = rq->sched;
>>>> +
>>>> +	if (list_empty(&entity->list)) {
>>>> +		atomic_inc(sched->score);
>>>> +		list_add_tail(&entity->list, &rq->entities);
>>>> +	}
>>>> +
>>>> +	if (drm_sched_policy == DRM_SCHED_POLICY_RR)
>>>> +		ts = entity->rr_ts;
>>>> +	drm_sched_rq_update_fifo_locked(entity, rq, ts);
>>>> +
>>>> +	spin_unlock(&rq->lock);
>>>> +	spin_unlock(&entity->lock);
>>>> +
>>>> +	return sched;
>>>>    }
>>>>    
>>>>    /**
>>>> @@ -235,6 +258,54 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>>>    	spin_unlock(&rq->lock);
>>>>    }
>>>>    
>>>> +static ktime_t
>>>> +drm_sched_rq_get_rr_ts(struct drm_sched_rq *rq, struct drm_sched_entity *entity)
>>>> +{
>>>> +	ktime_t ts;
>>>> +
>>>> +	lockdep_assert_held(&entity->lock);
>>>> +	lockdep_assert_held(&rq->lock);
>>>> +
>>>> +	ts = ktime_add_ns(rq->rr_ts, 1);
>>>> +	entity->rr_ts = ts;
>>>> +	rq->rr_ts = ts;
>>>> +
>>>> +	return ts;
>>>> +}
>>>> +
>>>> +/**
>>>> + * drm_sched_rq_pop_entity - pops an entity
>>>> + *
>>>> + * @entity: scheduler entity
>>>> + *
>>>> + * To be called every time after a job is popped from the entity.
>>>> + */
>>>> +void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
>>>> +{
>>>> +	struct drm_sched_job *next_job;
>>>> +	struct drm_sched_rq *rq;
>>>> +	ktime_t ts;
>>>> +
>>>> +	/*
>>>> +	 * Update the entity's location in the min heap according to
>>>> +	 * the timestamp of the next job, if any.
>>>> +	 */
>>>> +	next_job = drm_sched_entity_queue_peek(entity);
>>>> +	if (!next_job)
>>>> +		return;
>>>> +
>>>> +	spin_lock(&entity->lock);
>>>> +	rq = entity->rq;
>>>> +	spin_lock(&rq->lock);
>>>> +	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>>>> +		ts = next_job->submit_ts;
>>>> +	else
>>>> +		ts = drm_sched_rq_get_rr_ts(rq, entity);
>>>> +	drm_sched_rq_update_fifo_locked(entity, rq, ts);
>>>> +	spin_unlock(&rq->lock);
>>>> +	spin_unlock(&entity->lock);
>>>> +}
>>>> +
>>>>    /**
>>>>     * drm_sched_rq_select_entity - Select an entity which provides a job to run
>>>>     *
>>>
>>
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 01/12] drm/sched: Add some scheduling quality unit tests
  2025-09-03 10:18 ` [RFC v8 01/12] drm/sched: Add some scheduling quality unit tests Tvrtko Ursulin
@ 2025-09-23 13:01   ` Philipp Stanner
  2025-09-25 11:41     ` Tvrtko Ursulin
  0 siblings, 1 reply; 35+ messages in thread
From: Philipp Stanner @ 2025-09-23 13:01 UTC (permalink / raw)
  To: Tvrtko Ursulin, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost, Philipp Stanner, Pierre-Eric Pelloux-Prayer

On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
> To make evaluating different scheduling policies easier (no need for
> external benchmarks) and perfectly repeatable, lets add some synthetic
> workloads built upon mock scheduler unit test infrastructure.
> 
> Focus is on two parallel clients (two threads) submitting different job
> patterns and logging their progress and some overall metrics. This is
> repeated for both scheduler credit limit 1 and 2.
> 
> Example test output:
> 
>   Normal and low:
>                     pct1 cps1 qd1;  pct2 cps2 qd2
>         +     0ms:   0     0    0;   0     0    0
>         +   104ms: 100  1240  112; 100  1240  125
>         +   209ms: 100     0   99; 100     0  125
>         +   313ms: 100     0   86; 100     0  125
>         +   419ms: 100     0   73; 100     0  125
>         +   524ms: 100     0   60; 100     0  125
>         +   628ms: 100     0   47; 100     0  125
>         +   731ms: 100     0   34; 100     0  125
>         +   836ms: 100     0   21; 100     0  125
>         +   939ms: 100     0    8; 100     0  125
>         +  1043ms:               ; 100     0  120
>         +  1147ms:               ; 100     0  107
>         +  1252ms:               ; 100     0   94
>         +  1355ms:               ; 100     0   81
>         +  1459ms:               ; 100     0   68
>         +  1563ms:               ; 100     0   55
>         +  1667ms:               ; 100     0   42
>         +  1771ms:               ; 100     0   29
>         +  1875ms:               ; 100     0   16
>         +  1979ms:               ; 100     0    3
>     0: prio=normal sync=0 elapsed_ms=1015ms (ideal_ms=1000ms) cycle_time(min,avg,max)=134,222,978 us latency_time(min,avg,max)=134,222,978
> us
>     1: prio=low sync=0 elapsed_ms=2009ms (ideal_ms=1000ms) cycle_time(min,avg,max)=134,215,806 us latency_time(min,avg,max)=134,215,806 us
> 
> There we have two clients represented in the two respective columns, with
> their progress logged roughly every 100 milliseconds. The metrics are:
> 
>  - pct - Percentage progress of the job submit part
>  - cps - Cycles per second
>  - qd  - Queue depth - number of submitted unfinished jobs
> 
> The cycles per second metric is inherent to the fact that workload
> patterns are a data driven cycling sequence of:
> 
>  - Submit 1..N jobs
>  - Wait for Nth job to finish (optional)
>  - Sleep (optional)
>  - Repeat from start
> 
> In this particular example we have a normal priority and a low priority
> clients both spamming the scheduler with 8ms jobs with no sync and no
> sleeping. Hence they build a very deep queues and we can see how the low
> priority client is completely starved until the normal finishes.
> 
> Note that the PCT and CPS metrics are irrelevant for "unsync" clients
> since they manage to complete all of their cycles instantaneously.
> 
> A different example would be:
> 
>   Heavy and interactive:
>                     pct1 cps1 qd1;  pct2 cps2 qd2
>         +     0ms:   0     0    0;   0     0    0
>         +   106ms:   5    40    3;   5    40    0
>         +   209ms:   9    40    0;   9    40    0
>         +   314ms:  14    50    3;  14    50    0
>         +   417ms:  18    40    0;  18    40    0
>         +   522ms:  23    50    3;  23    50    0
>         +   625ms:  27    40    0;  27    40    1
>         +   729ms:  32    50    0;  32    50    0
>         +   833ms:  36    40    1;  36    40    0
>         +   937ms:  40    40    0;  40    40    0
>         +  1041ms:  45    50    0;  45    50    0
>         +  1146ms:  49    40    1;  49    40    1
>         +  1249ms:  54    50    0;  54    50    0
>         +  1353ms:  58    40    1;  58    40    0
>         +  1457ms:  62    40    0;  62    40    1
>         +  1561ms:  67    50    0;  67    50    0
>         +  1665ms:  71    40    1;  71    40    0
>         +  1772ms:  76    50    0;  76    50    0
>         +  1877ms:  80    40    1;  80    40    0
>         +  1981ms:  84    40    0;  84    40    0
>         +  2085ms:  89    50    0;  89    50    0
>         +  2189ms:  93    40    1;  93    40    0
>         +  2293ms:  97    40    0;  97    40    1
> 
> In this case client one is submitting 3x 2.5ms jobs, waiting for the 3rd
> and then sleeping for 2.5ms (in effect causing 75% GPU load, minus the
> overheads). Second client is submitting 1ms jobs, waiting for each to
> finish and sleeping for 9ms (effective 10% GPU load). Here we can see
> the PCT and CPS reflecting real progress.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Philipp Stanner <phasta@kernel.org>
> Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
> Acked-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/scheduler/tests/Makefile      |   3 +-
>  .../gpu/drm/scheduler/tests/tests_scheduler.c | 640 ++++++++++++++++++
>  2 files changed, 642 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/scheduler/tests/tests_scheduler.c
> 
> diff --git a/drivers/gpu/drm/scheduler/tests/Makefile b/drivers/gpu/drm/scheduler/tests/Makefile
> index 5bf707bad373..9ec185fbbc15 100644
> --- a/drivers/gpu/drm/scheduler/tests/Makefile
> +++ b/drivers/gpu/drm/scheduler/tests/Makefile
> @@ -2,6 +2,7 @@
>  
>  drm-sched-tests-y := \
>          mock_scheduler.o \
> -        tests_basic.o
> +        tests_basic.o \
> +        tests_scheduler.o
>  
>  obj-$(CONFIG_DRM_SCHED_KUNIT_TEST) += drm-sched-tests.o
> diff --git a/drivers/gpu/drm/scheduler/tests/tests_scheduler.c b/drivers/gpu/drm/scheduler/tests/tests_scheduler.c
> new file mode 100644
> index 000000000000..15c898966ef0
> --- /dev/null
> +++ b/drivers/gpu/drm/scheduler/tests/tests_scheduler.c
> @@ -0,0 +1,640 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2025 Valve Corporation */
> +
> +#include <linux/delay.h>
> +#include <linux/kthread.h>
> +#include <linux/ktime.h>
> +#include <linux/math64.h>
> +
> +#include "sched_tests.h"
> +
> +/*
> + * DRM scheduler scheduler tests exercise load balancing decisions ie. entity
> + * selection logic.
> + */
> +
> +static int drm_sched_scheduler_init(struct kunit *test)
> +{
> +	struct drm_mock_scheduler *sched;
> +
> +	sched = drm_mock_sched_new(test, MAX_SCHEDULE_TIMEOUT);
> +	sched->base.credit_limit = 1;
> +
> +	test->priv = sched;
> +
> +	return 0;
> +}
> +
> +static int drm_sched_scheduler_init2(struct kunit *test)

I'm not very convinced of the naming prefixes in this file. They're all
static, but people might find them through Bootlin or similar and be
confused.

in tests_basic.c we call all functions drm_sched_basic.

So I'd propose to go for drm_sched_quality here.

> +{
> +	struct drm_mock_scheduler *sched;
> +
> +	sched = drm_mock_sched_new(test, MAX_SCHEDULE_TIMEOUT);
> +	sched->base.credit_limit = 2;
> +
> +	test->priv = sched;
> +
> +	return 0;
> +}
> +
> +static void drm_sched_scheduler_exit(struct kunit *test)
> +{
> +	struct drm_mock_scheduler *sched = test->priv;
> +
> +	drm_mock_sched_fini(sched);
> +}
> +
> +static void drm_sched_scheduler_queue_overhead(struct kunit *test)

What queue's overhead is this, actually? Seems to me that it's more the
scheduler overhead for processing jobs in general.

> +{
> +	struct drm_mock_scheduler *sched = test->priv;
> +	struct drm_mock_sched_entity *entity;
> +	const unsigned int job_us = 1000;
> +	const unsigned int jobs = 1000;
> +	const unsigned int total_us = jobs * job_us;
> +	struct drm_mock_sched_job *job, *first;
> +	ktime_t start, end;
> +	bool done;
> +	int i;
> +
> +	/*
> +	 * Deep queue job at a time processing (single credit).

"Measure job-by-job processing overhead time"?

> +	 *
> +	 * This measures the overhead of picking and processing a job at a time
> +	 * by comparing the ideal total "GPU" time of all submitted jobs versus
> +	 * the time actually taken.
> +	 */
> +
> +	KUNIT_ASSERT_EQ(test, sched->base.credit_limit, 1);
> +
> +	entity = drm_mock_sched_entity_new(test,
> +					   DRM_SCHED_PRIORITY_NORMAL,
> +					   sched);
> +
> +	for (i = 0; i <= jobs; i++) {
> +		job = drm_mock_sched_job_new(test, entity);
> +		if (i == 0)
> +			first = job; /* Extra first job blocks the queue */

"Extra"?

> +		else
> +			drm_mock_sched_job_set_duration_us(job, job_us);
> +		drm_mock_sched_job_submit(job);
> +	}
> +
> +	done = drm_mock_sched_job_wait_scheduled(first, HZ);
> +	KUNIT_ASSERT_TRUE(test, done);
> +
> +	start = ktime_get();
> +	i = drm_mock_sched_advance(sched, 1); /* Release the queue */
> +	KUNIT_ASSERT_EQ(test, i, 1);
> +
> +	done = drm_mock_sched_job_wait_finished(job,
> +						usecs_to_jiffies(total_us) * 5);

Why 5?

> +	end = ktime_get();
> +	KUNIT_ASSERT_TRUE(test, done);
> +
> +	pr_info("Expected %uus, actual %lldus\n",
> +		total_us,
> +		ktime_to_us(ktime_sub(end, start)));
> +
> +	drm_mock_sched_entity_free(entity);
> +}
> +
> +static void drm_sched_scheduler_ping_pong(struct kunit *test)

Interesting idea for a test!

> +{
> +	struct drm_mock_sched_job *job, *first, *prev = NULL;
> +	struct drm_mock_scheduler *sched = test->priv;
> +	struct drm_mock_sched_entity *entity[2];

Setting a define saves you the ARRAY_SIZE. But I don't have a strong
opinion on that.

> +	const unsigned int job_us = 1000;
> +	const unsigned int jobs = 1000;
> +	const unsigned int total_us = jobs * job_us;
> +	ktime_t start, end;
> +	bool done;
> +	int i;
> +
> +	/*
> +	 * Two entitites in inter-dependency chain.
> +	 *
> +	 * This measures the overhead of picking and processing a job at a time,
> +	 * where each job depends on the previous one from the diffferent
> +	 * entity, by comparing the ideal total "GPU" time of all submitted jobs
> +	 * versus the time actually taken.
> +	 */
> +
> +	KUNIT_ASSERT_EQ(test, sched->base.credit_limit, 1);
> +
> +	for (i = 0; i < ARRAY_SIZE(entity); i++)
> +		entity[i] = drm_mock_sched_entity_new(test,
> +						      DRM_SCHED_PRIORITY_NORMAL,
> +						      sched);
> +
> +	for (i = 0; i <= jobs; i++) {
> +		job = drm_mock_sched_job_new(test, entity[i & 1]);
> +		if (i == 0)
> +			first = job; /* Extra first job blocks the queue */
> +		else
> +			drm_mock_sched_job_set_duration_us(job, job_us);
> +		if (prev)
> +			drm_sched_job_add_dependency(&job->base,
> +						     dma_fence_get(&prev->base.s_fence->finished));
> +		drm_mock_sched_job_submit(job);
> +		prev = job;
> +	}
> +
> +	done = drm_mock_sched_job_wait_scheduled(first, HZ);
> +	KUNIT_ASSERT_TRUE(test, done);
> +
> +	start = ktime_get();
> +	i = drm_mock_sched_advance(sched, 1); /* Release the queue */
> +	KUNIT_ASSERT_EQ(test, i, 1);
> +
> +	done = drm_mock_sched_job_wait_finished(job,
> +						usecs_to_jiffies(total_us) * 5);

Same. I guess 5 is just "long enough to be finished for sure".


> +	end = ktime_get();
> +	KUNIT_ASSERT_TRUE(test, done);
> +
> +	pr_info("Expected %uus, actual %lldus\n",
> +		total_us,
> +		ktime_to_us(ktime_sub(end, start)));
> +
> +	for (i = 0; i < ARRAY_SIZE(entity); i++)
> +		drm_mock_sched_entity_free(entity[i]);
> +}
> +
> +static struct kunit_case drm_sched_scheduler_overhead_tests[] = {
> +	KUNIT_CASE_SLOW(drm_sched_scheduler_queue_overhead),
> +	KUNIT_CASE_SLOW(drm_sched_scheduler_ping_pong),
> +	{}
> +};
> +
> +static struct kunit_suite drm_sched_scheduler_overhead = {
> +	.name = "drm_sched_scheduler_overhead_tests",
> +	.init = drm_sched_scheduler_init,
> +	.exit = drm_sched_scheduler_exit,
> +	.test_cases = drm_sched_scheduler_overhead_tests,
> +};
> +
> +struct drm_sched_client_params {
> +	enum drm_sched_priority priority;
> +	unsigned int job_cnt;
> +	unsigned int job_us;
> +	unsigned int wait_us;
> +	bool sync;
> +};

Struct documentation would be neat. cnt is obvious.. job_us is the
"payload time" and "wait_us" is.. the time you wait until submitting a
job? And sync is whether you submit synchronously? Doesn't that then
imply wait_us == 0?

I'm reading as I write; I guess it gets clearer below. But you get the
point, a few words of docu would help new programmers / reviewers.

> +
> +struct drm_sched_test_params {
> +	const char *description;
> +	struct drm_sched_client_params client[2];
> +};
> +
> +static const struct drm_sched_test_params drm_sched_cases[] = {
> +	{
> +		.description = "Normal and normal",

s/normal/normal prio/

> +		.client[0] = {
> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
> +			.job_cnt = 1,
> +			.job_us = 8000,
> +			.wait_us = 0,
> +			.sync = false,
> +		},
> +		.client[1] = {
> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
> +			.job_cnt = 1,
> +			.job_us = 8000,
> +			.wait_us = 0,
> +			.sync = false,
> +		},
> +	},
> +	{
> +		.description = "Normal and low",
> +		.client[0] = {
> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
> +			.job_cnt = 1,
> +			.job_us = 8000,
> +			.wait_us = 0,
> +			.sync = false,
> +		},
> +		.client[1] = {
> +			.priority = DRM_SCHED_PRIORITY_LOW,
> +			.job_cnt = 1,
> +			.job_us = 8000,
> +			.wait_us = 0,
> +			.sync = false,
> +		},
> +	},
> +	{
> +		.description = "High and normal",
> +		.client[0] = {
> +			.priority = DRM_SCHED_PRIORITY_HIGH,
> +			.job_cnt = 1,
> +			.job_us = 8000,
> +			.wait_us = 0,
> +			.sync = false,
> +		},
> +		.client[1] = {
> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
> +			.job_cnt = 1,
> +			.job_us = 8000,
> +			.wait_us = 0,
> +			.sync = false,
> +		},
> +	},
> +	{
> +		.description = "High and low",
> +		.client[0] = {
> +			.priority = DRM_SCHED_PRIORITY_HIGH,
> +			.job_cnt = 1,
> +			.job_us = 8000,
> +			.wait_us = 0,
> +			.sync = false,
> +		},
> +		.client[1] = {
> +			.priority = DRM_SCHED_PRIORITY_LOW,
> +			.job_cnt = 1,
> +			.job_us = 8000,
> +			.wait_us = 0,
> +			.sync = false,
> +		},
> +	},
> +	{
> +		.description = "50 and 50",

Hm? 50 normal and 50 normal?

> +		.client[0] = {
> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
> +			.job_cnt = 1,
> +			.job_us = 1500,
> +			.wait_us = 1500,
> +			.sync = true,
> +		},
> +		.client[1] = {
> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
> +			.job_cnt = 1,
> +			.job_us = 2500,
> +			.wait_us = 2500,
> +			.sync = true,
> +		},
> +	},
> +	{
> +		.description = "50 and 50 low",
> +		.client[0] = {
> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
> +			.job_cnt = 1,
> +			.job_us = 1500,
> +			.wait_us = 1500,
> +			.sync = true,
> +		},
> +		.client[1] = {
> +			.priority = DRM_SCHED_PRIORITY_LOW,
> +			.job_cnt = 1,
> +			.job_us = 2500,
> +			.wait_us = 2500,
> +			.sync = true,
> +		},
> +	},
> +	{
> +		.description = "50 high and 50",
> +		.client[0] = {
> +			.priority = DRM_SCHED_PRIORITY_HIGH,
> +			.job_cnt = 1,
> +			.job_us = 1500,
> +			.wait_us = 1500,
> +			.sync = true,
> +		},
> +		.client[1] = {
> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
> +			.job_cnt = 1,
> +			.job_us = 2500,
> +			.wait_us = 2500,
> +			.sync = true,
> +		},
> +	},
> +	{
> +		.description = "Low hog and interactive",
> +		.client[0] = {
> +			.priority = DRM_SCHED_PRIORITY_LOW,
> +			.job_cnt = 3,
> +			.job_us = 2500,
> +			.wait_us = 500,
> +			.sync = false,
> +		},
> +		.client[1] = {
> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
> +			.job_cnt = 1,
> +			.job_us = 500,
> +			.wait_us = 10000,
> +			.sync = true,
> +		},
> +	},
> +	{
> +		.description = "Heavy and interactive",
> +		.client[0] = {
> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
> +			.job_cnt = 3,
> +			.job_us = 2500,
> +			.wait_us = 2500,
> +			.sync = true,
> +		},
> +		.client[1] = {
> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
> +			.job_cnt = 1,
> +			.job_us = 1000,
> +			.wait_us = 9000,
> +			.sync = true,
> +		},
> +	},
> +	{
> +		.description = "Very heavy and interactive",
> +		.client[0] = {
> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
> +			.job_cnt = 4,
> +			.job_us = 50000,
> +			.wait_us = 1,
> +			.sync = true,
> +		},
> +		.client[1] = {
> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
> +			.job_cnt = 1,
> +			.job_us = 1000,
> +			.wait_us = 9000,
> +			.sync = true,
> +		},
> +	},
> +};
> +
> +static void
> +drm_sched_desc(const struct drm_sched_test_params *params, char *desc)
> +{
> +	strscpy(desc, params->description, KUNIT_PARAM_DESC_SIZE);
> +}
> +
> +KUNIT_ARRAY_PARAM(drm_sched_scheduler_two_clients,
> +		  drm_sched_cases,
> +		  drm_sched_desc);
> +
> +struct test_client_stats {
> +	unsigned int min_us;
> +	unsigned int max_us;
> +	unsigned long tot_us;
> +};
> +
> +struct test_client {
> +	struct kunit *test;
> +
> +	struct drm_mock_sched_entity	*entity;
> +
> +	struct kthread_worker	*worker;
> +	struct kthread_work	work;
> +
> +	unsigned int id;
> +	ktime_t duration;
> +
> +	struct drm_sched_client_params params;
> +
> +	ktime_t ideal_duration;
> +	unsigned int cycles;
> +	unsigned int cycle;
> +	ktime_t	start;
> +	ktime_t	end;
> +	bool done;
> +
> +	struct test_client_stats cycle_time;
> +	struct test_client_stats latency_time;
> +};

Same here, docu would be helpful. Especially regarding "cycles". It
seems to be "cycles per second", if I read the code below correctly.

> +
> +static void
> +update_stats(struct test_client_stats *stats, unsigned int us)
> +{
> +	if (us > stats->max_us)
> +		stats->max_us = us;
> +	if (us < stats->min_us)

Won't min_us be initialized to 0? So how could 'us' ever be smaller?

> +		stats->min_us = us;
> +	stats->tot_us += us;
> +}
> +
> +static unsigned int
> +get_stats_avg(struct test_client_stats *stats, unsigned int cycles)
> +{
> +	return div_u64(stats->tot_us, cycles);
> +}
> +
> +static void drm_sched_client_work(struct kthread_work *work)

Why's this called "client_work"? Is it representing the userspace
client?

> +{
> +	struct test_client *client = container_of(work, typeof(*client), work);
> +	const long sync_wait = MAX_SCHEDULE_TIMEOUT;
> +	unsigned int cycle, work_us, period_us;
> +	struct drm_mock_sched_job *job = NULL;
> +
> +	work_us = client->params.job_cnt * client->params.job_us;
> +	period_us = work_us + client->params.wait_us;
> +	client->cycles =
> +		DIV_ROUND_UP((unsigned int)ktime_to_us(client->duration),
> +			     period_us);
> +	client->ideal_duration = us_to_ktime(client->cycles * period_us);
> +
> +	client->start = ktime_get();
> +
> +	for (cycle = 0; cycle < client->cycles; cycle++) {
> +		unsigned int batch;
> +		unsigned long us;
> +		ktime_t t;

't' is used at a couple of places below, including when firing the
WARN_ON_ONCE. I think the name should reveal more clearly what time is
actually measured here, so that it gets more obvious for what
conditions you're checking below.

> +
> +		if (READ_ONCE(client->done))
> +			break;
> +
> +		t = ktime_get();
> +		for (batch = 0; batch < client->params.job_cnt; batch++) {
> +			job = drm_mock_sched_job_new(client->test,
> +						     client->entity);
> +			drm_mock_sched_job_set_duration_us(job,
> +							   client->params.job_us);
> +			drm_mock_sched_job_submit(job);
> +		}
> +
> +		if (client->params.sync)
> +			drm_mock_sched_job_wait_finished(job, sync_wait);
> +
> +		t = ktime_sub(ktime_get(), t);
> +		us = ktime_to_us(t);
> +		update_stats(&client->cycle_time, us);
> +		if (ktime_to_us(t) >= (long)work_us)
> +			us = ktime_to_us(t) - work_us;
> +		else if (WARN_ON_ONCE(client->params.sync))

See above. For me as a new reader it reads like this:

"if it did not take longer than it should have taken, and if this is
synchronous, then that's a bug"

> +			us = 0;
> +		update_stats(&client->latency_time, us);
> +		WRITE_ONCE(client->cycle, cycle);
> +
> +		if (READ_ONCE(client->done))
> +			break;
> +
> +		if (client->params.wait_us)
> +			fsleep(client->params.wait_us);
> +		else
> +			cond_resched();

This if-else definitely needs a comment explaining what is being done
and why. Why do you volunteer the CPU here? The function is about to
end anyways.

> +	}
> +
> +	client->done = drm_mock_sched_job_wait_finished(job, sync_wait);
> +	client->end = ktime_get();
> +}
> +
> +static const char *prio_str(enum drm_sched_priority prio)
> +{
> +	switch (prio) {
> +	case DRM_SCHED_PRIORITY_KERNEL:
> +		return "kernel";
> +	case DRM_SCHED_PRIORITY_LOW:
> +		return "low";
> +	case DRM_SCHED_PRIORITY_NORMAL:
> +		return "normal";
> +	case DRM_SCHED_PRIORITY_HIGH:
> +		return "high";
> +	default:
> +		return "???";

s/???/INVALID

> +	}
> +}
> +
> +static void drm_sched_scheduler_two_clients_test(struct kunit *test)
> +{
> +	const struct drm_sched_test_params *params = test->param_value;
> +	struct drm_mock_scheduler *sched = test->priv;
> +	struct test_client client[2] = { };
> +	unsigned int prev_cycle[2] = { };
> +	unsigned int i, j;
> +	ktime_t start;
> +
> +	/*
> +	 * Same job stream from from two clients.

Typo, "from from"

btw it's not super clear what's being tested here by the description /
names alone. You did test with two entities above in the ping-pong
example already. So I suppose a "client" is a party owning multiple
entities?

> +	 */
> +
> +	for (i = 0; i < ARRAY_SIZE(client); i++)
> +		client[i].entity =
> +			drm_mock_sched_entity_new(test,
> +						  params->client[i].priority,
> +						  sched);
> +
> +	for (i = 0; i < ARRAY_SIZE(client); i++) {
> +		client[i].test = test;
> +		client[i].id = i;
> +		client[i].duration = ms_to_ktime(1000);
> +		client[i].params = params->client[i];
> +		client[i].cycle_time.min_us = ~0U;
> +		client[i].latency_time.min_us = ~0U;
> +		client[i].worker =
> +			kthread_create_worker(0, "%s-%u", __func__, i);
> +		if (IS_ERR(client[i].worker)) {
> +			for (j = 0; j < i; j++)
> +				kthread_destroy_worker(client[j].worker);
> +			KUNIT_FAIL(test, "Failed to create worker!\n");
> +		}
> +
> +		kthread_init_work(&client[i].work, drm_sched_client_work);
> +	}
> +
> +	for (i = 0; i < ARRAY_SIZE(client); i++)
> +		kthread_queue_work(client[i].worker, &client[i].work);
> +
> +	/*
> +	 * The clients (workers) can be a mix of async (deep submission queue),
> +	 * sync (one job at a time), or something in between. Therefore it is
> +	 * difficult to display a single metric representing their progress.
> +	 *
> +	 * Each struct drm_sched_client_params describes the actual submission
> +	 * pattern which happens in the following steps:
> +	 *  1. Submit N jobs
> +	 *  2. Wait for last submitted job to finish
> +	 *  3. Sleep for U micro-seconds
> +	 *  4. Goto 1. for C cycles

OK, so this is simulating clients submitting in a burst-like pattern.
Could it make sense to name the test something with "burst"?

> +	 *
> +	 * Where number of cycles is calculated to match the target client
> +	 * duration from the respective struct drm_sched_test_params.
> +	 *
> +	 * To asses scheduling behaviour what we output for both clients is:
> +	 *  - pct: Percentage progress of the jobs submitted
> +	 *  - cps: "Cycles" per second (where one cycle is one 1.-4. above)

"is one complete submission pattern from the list above"

> +	 *  -  qd: Number of outstanding jobs in the client/entity
> +	 */
> +
> +	start = ktime_get();
> +	pr_info("%s:\n\t            pct1 cps1 qd1;  pct2 cps2 qd2\n",
> +		params->description);
> +	while (!READ_ONCE(client[0].done) || !READ_ONCE(client[1].done)) {

You probably want write in the documentation of struct test_client what
'done' does and who sets and reads it. Then it becomes more obvious why
a READ_ONCE is necessary.

> +		unsigned int pct[2], qd[2], cycle[2], cps[2];
> +
> +		for (i = 0; i < ARRAY_SIZE(client); i++) {
> +			qd[i] = spsc_queue_count(&client[i].entity->base.job_queue);
> +			cycle[i] = READ_ONCE(client[i].cycle);
> +			cps[i] = DIV_ROUND_UP(1000 * (cycle[i] - prev_cycle[i]),

Now you've lost me. All numbers so far always were microseconds (us).
To get the cycles per second, don't you have to multiply by 1,000,000?

> +					      100);
> +			if (client[i].cycles)
> +				pct[i] = DIV_ROUND_UP(100 * (1 + cycle[i]),

Why 100 (here and above) and why +1?


--

I want to go through the other patches this week, too. I guess I'll get
up to speed with the code better in general.


P.

> +						      client[i].cycles);
> +			else
> +				pct[i] = 0;
> +			prev_cycle[i] = cycle[i];
> +		}
> +
> +		if (READ_ONCE(client[0].done))
> +			pr_info("\t+%6lldms:               ; %3u %5u %4u\n",
> +				ktime_to_ms(ktime_sub(ktime_get(), start)),
> +				pct[1], cps[1], qd[1]);
> +		else if (READ_ONCE(client[1].done))
> +			pr_info("\t+%6lldms: %3u %5u %4u;\n",
> +				ktime_to_ms(ktime_sub(ktime_get(), start)),
> +				pct[0], cps[0], qd[0]);
> +		else
> +			pr_info("\t+%6lldms: %3u %5u %4u; %3u %5u %4u\n",
> +				ktime_to_ms(ktime_sub(ktime_get(), start)),
> +				pct[0], cps[0], qd[0],
> +				pct[1], cps[1], qd[1]);
> +		msleep(100);
> +	}
> +
> +	for (i = 0; i < ARRAY_SIZE(client); i++) {
> +		kthread_flush_work(&client[i].work);
> +		kthread_destroy_worker(client[i].worker);
> +	}
> +
> +	for (i = 0; i < ARRAY_SIZE(client); i++)
> +		KUNIT_ASSERT_TRUE(test, client[i].done);
> +
> +	for (i = 0; i < ARRAY_SIZE(client); i++) {
> +		pr_info("    %u: prio=%s sync=%u elapsed_ms=%lldms (ideal_ms=%lldms) cycle_time(min,avg,max)=%u,%u,%u us latency_time(min,avg,max)=%u,%u,%u us",
> +			i,
> +			prio_str(params->client[i].priority),
> +			params->client[i].sync,
> +			ktime_to_ms(ktime_sub(client[i].end, client[i].start)),
> +			ktime_to_ms(client[i].ideal_duration),
> +			client[i].cycle_time.min_us,
> +			get_stats_avg(&client[i].cycle_time, client[i].cycles),
> +			client[i].cycle_time.max_us,
> +			client[i].latency_time.min_us,
> +			get_stats_avg(&client[i].latency_time, client[i].cycles),
> +			client[i].latency_time.max_us);
> +		drm_mock_sched_entity_free(client[i].entity);
> +	}
> +}
> +
> +static const struct kunit_attributes drm_sched_scheduler_two_clients_attr = {
> +	.speed = KUNIT_SPEED_SLOW,
> +};
> +
> +static struct kunit_case drm_sched_scheduler_two_clients_tests[] = {
> +	KUNIT_CASE_PARAM_ATTR(drm_sched_scheduler_two_clients_test,
> +			      drm_sched_scheduler_two_clients_gen_params,
> +			      drm_sched_scheduler_two_clients_attr),
> +	{}
> +};
> +
> +static struct kunit_suite drm_sched_scheduler_two_clients1 = {
> +	.name = "drm_sched_scheduler_two_clients_one_credit_tests",
> +	.init = drm_sched_scheduler_init,
> +	.exit = drm_sched_scheduler_exit,
> +	.test_cases = drm_sched_scheduler_two_clients_tests,
> +};
> +
> +static struct kunit_suite drm_sched_scheduler_two_clients2 = {
> +	.name = "drm_sched_scheduler_two_clients_two_credits_tests",
> +	.init = drm_sched_scheduler_init2,
> +	.exit = drm_sched_scheduler_exit,
> +	.test_cases = drm_sched_scheduler_two_clients_tests,
> +};
> +
> +kunit_test_suites(&drm_sched_scheduler_overhead,
> +		  &drm_sched_scheduler_two_clients1,
> +		  &drm_sched_scheduler_two_clients2);


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 05/12] drm/sched: Move run queue related code into a separate file
  2025-09-03 10:18 ` [RFC v8 05/12] drm/sched: Move run queue related code into a separate file Tvrtko Ursulin
@ 2025-09-24  7:54   ` Philipp Stanner
  0 siblings, 0 replies; 35+ messages in thread
From: Philipp Stanner @ 2025-09-24  7:54 UTC (permalink / raw)
  To: Tvrtko Ursulin, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost, Philipp Stanner

On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
> Lets move all the code dealing with struct drm_sched_rq into a separate
> compilation unit. Advantage being sched_main.c is left with a clearer set
> of responsibilities.

It sounds reasonable to explicitly list those responsibilities. Maybe
even establish them in comments at the top of all C files (in a
separate patch following this one preferably).

P.

> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Philipp Stanner <phasta@kernel.org>
> ---
>  drivers/gpu/drm/scheduler/Makefile         |   2 +-
>  drivers/gpu/drm/scheduler/sched_internal.h |   7 +
>  drivers/gpu/drm/scheduler/sched_main.c     | 218 +-------------------
>  drivers/gpu/drm/scheduler/sched_rq.c       | 222 +++++++++++++++++++++
>  4 files changed, 232 insertions(+), 217 deletions(-)
>  create mode 100644 drivers/gpu/drm/scheduler/sched_rq.c
> 
> diff --git a/drivers/gpu/drm/scheduler/Makefile b/drivers/gpu/drm/scheduler/Makefile
> index 6e13e4c63e9d..74e75eff6df5 100644
> --- a/drivers/gpu/drm/scheduler/Makefile
> +++ b/drivers/gpu/drm/scheduler/Makefile
> @@ -20,7 +20,7 @@
>  # OTHER DEALINGS IN THE SOFTWARE.
>  #
>  #
> -gpu-sched-y := sched_main.o sched_fence.o sched_entity.o
> +gpu-sched-y := sched_main.o sched_fence.o sched_entity.o sched_rq.o
>  
>  obj-$(CONFIG_DRM_SCHED) += gpu-sched.o
>  
> diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
> index 8269c5392a82..703ee48fbc58 100644
> --- a/drivers/gpu/drm/scheduler/sched_internal.h
> +++ b/drivers/gpu/drm/scheduler/sched_internal.h
> @@ -10,8 +10,15 @@ extern int drm_sched_policy;
>  #define DRM_SCHED_POLICY_RR    0
>  #define DRM_SCHED_POLICY_FIFO  1
>  
> +bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
> +			 struct drm_sched_entity *entity);
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
>  
> +void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
> +		       struct drm_sched_rq *rq);
> +struct drm_sched_entity *
> +drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched,
> +			   struct drm_sched_rq *rq);
>  struct drm_gpu_scheduler *
>  drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts);
>  void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index c53931e63458..97bec2f86502 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -112,8 +112,8 @@ static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
>   * Return true if we can push at least one more job from @entity, false
>   * otherwise.
>   */
> -static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
> -				struct drm_sched_entity *entity)
> +bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
> +			 struct drm_sched_entity *entity)
>  {
>  	struct drm_sched_job *s_job;
>  
> @@ -133,220 +133,6 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
>  	return drm_sched_available_credits(sched) >= s_job->credits;
>  }
>  
> -static __always_inline bool drm_sched_entity_compare_before(struct rb_node *a,
> -							    const struct rb_node *b)
> -{
> -	struct drm_sched_entity *ent_a =  rb_entry((a), struct drm_sched_entity, rb_tree_node);
> -	struct drm_sched_entity *ent_b =  rb_entry((b), struct drm_sched_entity, rb_tree_node);
> -
> -	return ktime_before(ent_a->oldest_job_waiting, ent_b->oldest_job_waiting);
> -}
> -
> -static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
> -					    struct drm_sched_rq *rq)
> -{
> -	if (!RB_EMPTY_NODE(&entity->rb_tree_node)) {
> -		rb_erase_cached(&entity->rb_tree_node, &rq->rb_tree_root);
> -		RB_CLEAR_NODE(&entity->rb_tree_node);
> -	}
> -}
> -
> -static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
> -					    struct drm_sched_rq *rq,
> -					    ktime_t ts)
> -{
> -	/*
> -	 * Both locks need to be grabbed, one to protect from entity->rq change
> -	 * for entity from within concurrent drm_sched_entity_select_rq and the
> -	 * other to update the rb tree structure.
> -	 */
> -	lockdep_assert_held(&entity->lock);
> -	lockdep_assert_held(&rq->lock);
> -
> -	drm_sched_rq_remove_fifo_locked(entity, rq);
> -
> -	entity->oldest_job_waiting = ts;
> -
> -	rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root,
> -		      drm_sched_entity_compare_before);
> -}
> -
> -/**
> - * drm_sched_rq_init - initialize a given run queue struct
> - *
> - * @sched: scheduler instance to associate with this run queue
> - * @rq: scheduler run queue
> - *
> - * Initializes a scheduler runqueue.
> - */
> -static void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
> -			      struct drm_sched_rq *rq)
> -{
> -	spin_lock_init(&rq->lock);
> -	INIT_LIST_HEAD(&rq->entities);
> -	rq->rb_tree_root = RB_ROOT_CACHED;
> -	rq->sched = sched;
> -}
> -
> -/**
> - * drm_sched_rq_add_entity - add an entity
> - *
> - * @entity: scheduler entity
> - * @ts: submission timestamp
> - *
> - * Adds a scheduler entity to the run queue.
> - *
> - * Returns a DRM scheduler pre-selected to handle this entity.
> - */
> -struct drm_gpu_scheduler *
> -drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
> -{
> -	struct drm_gpu_scheduler *sched;
> -	struct drm_sched_rq *rq;
> -
> -	/* Add the entity to the run queue */
> -	spin_lock(&entity->lock);
> -	if (entity->stopped) {
> -		spin_unlock(&entity->lock);
> -
> -		DRM_ERROR("Trying to push to a killed entity\n");
> -		return NULL;
> -	}
> -
> -	rq = entity->rq;
> -	spin_lock(&rq->lock);
> -	sched = rq->sched;
> -
> -	if (list_empty(&entity->list)) {
> -		atomic_inc(sched->score);
> -		list_add_tail(&entity->list, &rq->entities);
> -	}
> -
> -	if (drm_sched_policy == DRM_SCHED_POLICY_RR)
> -		ts = entity->rr_ts;
> -	drm_sched_rq_update_fifo_locked(entity, rq, ts);
> -
> -	spin_unlock(&rq->lock);
> -	spin_unlock(&entity->lock);
> -
> -	return sched;
> -}
> -
> -/**
> - * drm_sched_rq_remove_entity - remove an entity
> - *
> - * @rq: scheduler run queue
> - * @entity: scheduler entity
> - *
> - * Removes a scheduler entity from the run queue.
> - */
> -void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> -				struct drm_sched_entity *entity)
> -{
> -	lockdep_assert_held(&entity->lock);
> -
> -	if (list_empty(&entity->list))
> -		return;
> -
> -	spin_lock(&rq->lock);
> -
> -	atomic_dec(rq->sched->score);
> -	list_del_init(&entity->list);
> -
> -	drm_sched_rq_remove_fifo_locked(entity, rq);
> -
> -	spin_unlock(&rq->lock);
> -}
> -
> -static ktime_t
> -drm_sched_rq_get_rr_ts(struct drm_sched_rq *rq, struct drm_sched_entity *entity)
> -{
> -	ktime_t ts;
> -
> -	lockdep_assert_held(&entity->lock);
> -	lockdep_assert_held(&rq->lock);
> -
> -	ts = ktime_add_ns(rq->rr_ts, 1);
> -	entity->rr_ts = ts;
> -	rq->rr_ts = ts;
> -
> -	return ts;
> -}
> -
> -/**
> - * drm_sched_rq_pop_entity - pops an entity
> - *
> - * @entity: scheduler entity
> - *
> - * To be called every time after a job is popped from the entity.
> - */
> -void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
> -{
> -	struct drm_sched_job *next_job;
> -	struct drm_sched_rq *rq;
> -	ktime_t ts;
> -
> -	/*
> -	 * Update the entity's location in the min heap according to
> -	 * the timestamp of the next job, if any.
> -	 */
> -	next_job = drm_sched_entity_queue_peek(entity);
> -	if (!next_job)
> -		return;
> -
> -	spin_lock(&entity->lock);
> -	rq = entity->rq;
> -	spin_lock(&rq->lock);
> -	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> -		ts = next_job->submit_ts;
> -	else
> -		ts = drm_sched_rq_get_rr_ts(rq, entity);
> -	drm_sched_rq_update_fifo_locked(entity, rq, ts);
> -	spin_unlock(&rq->lock);
> -	spin_unlock(&entity->lock);
> -}
> -
> -/**
> - * drm_sched_rq_select_entity - Select an entity which provides a job to run
> - *
> - * @sched: the gpu scheduler
> - * @rq: scheduler run queue to check.
> - *
> - * Find oldest waiting ready entity.
> - *
> - * Return an entity if one is found; return an error-pointer (!NULL) if an
> - * entity was ready, but the scheduler had insufficient credits to accommodate
> - * its job; return NULL, if no ready entity was found.
> - */
> -static struct drm_sched_entity *
> -drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched,
> -			   struct drm_sched_rq *rq)
> -{
> -	struct rb_node *rb;
> -
> -	spin_lock(&rq->lock);
> -	for (rb = rb_first_cached(&rq->rb_tree_root); rb; rb = rb_next(rb)) {
> -		struct drm_sched_entity *entity;
> -
> -		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
> -		if (drm_sched_entity_is_ready(entity)) {
> -			/* If we can't queue yet, preserve the current entity in
> -			 * terms of fairness.
> -			 */
> -			if (!drm_sched_can_queue(sched, entity)) {
> -				spin_unlock(&rq->lock);
> -				return ERR_PTR(-ENOSPC);
> -			}
> -
> -			reinit_completion(&entity->entity_idle);
> -			break;
> -		}
> -	}
> -	spin_unlock(&rq->lock);
> -
> -	return rb ? rb_entry(rb, struct drm_sched_entity, rb_tree_node) : NULL;
> -}
> -
>  /**
>   * drm_sched_run_job_queue - enqueue run-job work
>   * @sched: scheduler instance
> diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
> new file mode 100644
> index 000000000000..16d57461765e
> --- /dev/null
> +++ b/drivers/gpu/drm/scheduler/sched_rq.c
> @@ -0,0 +1,222 @@
> +#include <linux/rbtree.h>
> +
> +#include <drm/drm_print.h>
> +#include <drm/gpu_scheduler.h>
> +
> +#include "sched_internal.h"
> +
> +static __always_inline bool
> +drm_sched_entity_compare_before(struct rb_node *a, const struct rb_node *b)
> +{
> +	struct drm_sched_entity *ea =
> +		rb_entry((a), struct drm_sched_entity, rb_tree_node);
> +	struct drm_sched_entity *eb =
> +		rb_entry((b), struct drm_sched_entity, rb_tree_node);
> +
> +	return ktime_before(ea->oldest_job_waiting, eb->oldest_job_waiting);
> +}
> +
> +static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
> +					    struct drm_sched_rq *rq)
> +{
> +	if (!RB_EMPTY_NODE(&entity->rb_tree_node)) {
> +		rb_erase_cached(&entity->rb_tree_node, &rq->rb_tree_root);
> +		RB_CLEAR_NODE(&entity->rb_tree_node);
> +	}
> +}
> +
> +static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
> +					    struct drm_sched_rq *rq,
> +					    ktime_t ts)
> +{
> +	/*
> +	 * Both locks need to be grabbed, one to protect from entity->rq change
> +	 * for entity from within concurrent drm_sched_entity_select_rq and the
> +	 * other to update the rb tree structure.
> +	 */
> +	lockdep_assert_held(&entity->lock);
> +	lockdep_assert_held(&rq->lock);
> +
> +	drm_sched_rq_remove_fifo_locked(entity, rq);
> +
> +	entity->oldest_job_waiting = ts;
> +
> +	rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root,
> +		      drm_sched_entity_compare_before);
> +}
> +
> +/**
> + * drm_sched_rq_init - initialize a given run queue struct
> + *
> + * @sched: scheduler instance to associate with this run queue
> + * @rq: scheduler run queue
> + *
> + * Initializes a scheduler runqueue.
> + */
> +void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
> +		       struct drm_sched_rq *rq)
> +{
> +	spin_lock_init(&rq->lock);
> +	INIT_LIST_HEAD(&rq->entities);
> +	rq->rb_tree_root = RB_ROOT_CACHED;
> +	rq->sched = sched;
> +}
> +
> +/**
> + * drm_sched_rq_add_entity - add an entity
> + *
> + * @entity: scheduler entity
> + * @ts: submission timestamp
> + *
> + * Adds a scheduler entity to the run queue.
> + *
> + * Returns a DRM scheduler pre-selected to handle this entity.
> + */
> +struct drm_gpu_scheduler *
> +drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
> +{
> +	struct drm_gpu_scheduler *sched;
> +	struct drm_sched_rq *rq;
> +
> +	/* Add the entity to the run queue */
> +	spin_lock(&entity->lock);
> +	if (entity->stopped) {
> +		spin_unlock(&entity->lock);
> +
> +		DRM_ERROR("Trying to push to a killed entity\n");
> +		return NULL;
> +	}
> +
> +	rq = entity->rq;
> +	spin_lock(&rq->lock);
> +	sched = rq->sched;
> +
> +	if (list_empty(&entity->list)) {
> +		atomic_inc(sched->score);
> +		list_add_tail(&entity->list, &rq->entities);
> +	}
> +
> +	if (drm_sched_policy == DRM_SCHED_POLICY_RR)
> +		ts = entity->rr_ts;
> +	drm_sched_rq_update_fifo_locked(entity, rq, ts);
> +
> +	spin_unlock(&rq->lock);
> +	spin_unlock(&entity->lock);
> +
> +	return sched;
> +}
> +
> +/**
> + * drm_sched_rq_remove_entity - remove an entity
> + *
> + * @rq: scheduler run queue
> + * @entity: scheduler entity
> + *
> + * Removes a scheduler entity from the run queue.
> + */
> +void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> +				struct drm_sched_entity *entity)
> +{
> +	lockdep_assert_held(&entity->lock);
> +
> +	if (list_empty(&entity->list))
> +		return;
> +
> +	spin_lock(&rq->lock);
> +
> +	atomic_dec(rq->sched->score);
> +	list_del_init(&entity->list);
> +
> +	drm_sched_rq_remove_fifo_locked(entity, rq);
> +
> +	spin_unlock(&rq->lock);
> +}
> +
> +static ktime_t
> +drm_sched_rq_get_rr_ts(struct drm_sched_rq *rq, struct drm_sched_entity *entity)
> +{
> +	ktime_t ts;
> +
> +	lockdep_assert_held(&entity->lock);
> +	lockdep_assert_held(&rq->lock);
> +
> +	ts = ktime_add_ns(rq->rr_ts, 1);
> +	entity->rr_ts = ts;
> +	rq->rr_ts = ts;
> +
> +	return ts;
> +}
> +
> +/**
> + * drm_sched_rq_pop_entity - pops an entity
> + *
> + * @entity: scheduler entity
> + *
> + * To be called every time after a job is popped from the entity.
> + */
> +void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
> +{
> +	struct drm_sched_job *next_job;
> +	struct drm_sched_rq *rq;
> +	ktime_t ts;
> +
> +	/*
> +	 * Update the entity's location in the min heap according to
> +	 * the timestamp of the next job, if any.
> +	 */
> +	next_job = drm_sched_entity_queue_peek(entity);
> +	if (!next_job)
> +		return;
> +
> +	spin_lock(&entity->lock);
> +	rq = entity->rq;
> +	spin_lock(&rq->lock);
> +	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> +		ts = next_job->submit_ts;
> +	else
> +		ts = drm_sched_rq_get_rr_ts(rq, entity);
> +	drm_sched_rq_update_fifo_locked(entity, rq, ts);
> +	spin_unlock(&rq->lock);
> +	spin_unlock(&entity->lock);
> +}
> +
> +/**
> + * drm_sched_rq_select_entity - Select an entity which provides a job to run
> + *
> + * @sched: the gpu scheduler
> + * @rq: scheduler run queue to check.
> + *
> + * Find oldest waiting ready entity.
> + *
> + * Return an entity if one is found; return an error-pointer (!NULL) if an
> + * entity was ready, but the scheduler had insufficient credits to accommodate
> + * its job; return NULL, if no ready entity was found.
> + */
> +struct drm_sched_entity *
> +drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched,
> +			   struct drm_sched_rq *rq)
> +{
> +	struct rb_node *rb;
> +
> +	spin_lock(&rq->lock);
> +	for (rb = rb_first_cached(&rq->rb_tree_root); rb; rb = rb_next(rb)) {
> +		struct drm_sched_entity *entity;
> +
> +		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
> +		if (drm_sched_entity_is_ready(entity)) {
> +			/* If we can't queue yet, preserve the current entity in
> +			 * terms of fairness.
> +			 */
> +			if (!drm_sched_can_queue(sched, entity)) {
> +				spin_unlock(&rq->lock);
> +				return ERR_PTR(-ENOSPC);
> +			}
> +
> +			reinit_completion(&entity->entity_idle);
> +			break;
> +		}
> +	}
> +	spin_unlock(&rq->lock);
> +
> +	return rb ? rb_entry(rb, struct drm_sched_entity, rb_tree_node) : NULL;
> +}


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 06/12] drm/sched: Free all finished jobs at once
  2025-09-03 10:18 ` [RFC v8 06/12] drm/sched: Free all finished jobs at once Tvrtko Ursulin
@ 2025-09-24  8:07   ` Philipp Stanner
  0 siblings, 0 replies; 35+ messages in thread
From: Philipp Stanner @ 2025-09-24  8:07 UTC (permalink / raw)
  To: Tvrtko Ursulin, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost, Philipp Stanner

On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
> To implement fair scheduling we will need as accurate as possible view
> into per entity GPU time utilisation. Because sched fence execution time
> are only adjusted for accuracy in the free worker we need to process
> completed jobs as soon as possible so the metric is most up to date when
> view from the submission side of things.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Looks like a good patch to me.

> Cc: Christian König <christian.koenig@amd.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Philipp Stanner <phasta@kernel.org>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 13 ++-----------
>  1 file changed, 2 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 97bec2f86502..9411676e772a 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -906,7 +906,6 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>   * drm_sched_get_finished_job - fetch the next finished job to be destroyed
>   *
>   * @sched: scheduler instance
> - * @have_more: are there more finished jobs on the list
>   *
>   * Informs the caller through @have_more whether there are more finished jobs
>   * besides the returned one.
> @@ -915,7 +914,7 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>   * ready for it to be destroyed.
>   */
>  static struct drm_sched_job *
> -drm_sched_get_finished_job(struct drm_gpu_scheduler *sched, bool *have_more)
> +drm_sched_get_finished_job(struct drm_gpu_scheduler *sched)
>  {
>  	struct drm_sched_job *job, *next;
>  
> @@ -930,7 +929,6 @@ drm_sched_get_finished_job(struct drm_gpu_scheduler *sched, bool *have_more)
>  		/* cancel this job's TO timer */
>  		cancel_delayed_work(&sched->work_tdr);
>  
> -		*have_more = false;

That have_more mechanism was introduced by you Tvrtko, in
36caa026b28a4, to avoid a double-lock. AFAICS this patch does not
revert the advantage of that older improvement.

So if you think that's fine there are no objections from my side.

The only thing I am wondering about is whether
drm_sched_free_job_work() might be running unnecessarily often after
this patch. It's kicked off through drm_sched_job_done() at various
places. But I tend to think that investigating optimizations there can
be left for future work, considering that it's running in a workqueue.
But please share your opinion on that if you have one


P.

>  		next = list_first_entry_or_null(&sched->pending_list,
>  						typeof(*next), list);
>  		if (next) {
> @@ -940,8 +938,6 @@ drm_sched_get_finished_job(struct drm_gpu_scheduler *sched, bool *have_more)
>  				next->s_fence->scheduled.timestamp =
>  					dma_fence_timestamp(&job->s_fence->finished);
>  
> -			*have_more = dma_fence_is_signaled(&next->s_fence->finished);
> -
>  			/* start TO timer for next job */
>  			drm_sched_start_timeout(sched);
>  		}
> @@ -1000,14 +996,9 @@ static void drm_sched_free_job_work(struct work_struct *w)
>  	struct drm_gpu_scheduler *sched =
>  		container_of(w, struct drm_gpu_scheduler, work_free_job);
>  	struct drm_sched_job *job;
> -	bool have_more;
>  
> -	job = drm_sched_get_finished_job(sched, &have_more);
> -	if (job) {
> +	while ((job = drm_sched_get_finished_job(sched)))
>  		sched->ops->free_job(job);
> -		if (have_more)
> -			drm_sched_run_free_queue(sched);
> -	}
>  
>  	drm_sched_run_job_queue(sched);
>  }


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 07/12] drm/sched: Account entity GPU time
  2025-09-03 10:18 ` [RFC v8 07/12] drm/sched: Account entity GPU time Tvrtko Ursulin
@ 2025-09-24  9:11   ` Philipp Stanner
  2025-09-25 11:52     ` Tvrtko Ursulin
  0 siblings, 1 reply; 35+ messages in thread
From: Philipp Stanner @ 2025-09-24  9:11 UTC (permalink / raw)
  To: Tvrtko Ursulin, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost, Philipp Stanner

On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
> To implement fair scheduling we need a view into the GPU time consumed by
> entities. Problem we have is that jobs and entities objects have decoupled
> lifetimes, where at the point we have a view into accurate GPU time, we
> cannot link back to the entity any longer.
> 
> Solve this by adding a light weight entity stats object which is reference
> counted by both entity and the job and hence can safely be used from
> either side.
> 
> With that, the only other thing we need is to add a helper for adding the
> job's GPU time into the respective entity stats object, and call it once
> the accurate GPU time has been calculated.

This in general also looks reasonable and clean to me. Some comments
below.

btw I've got many nit-comments for all patches, but will wait with
those until a v1 is posted.

> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Philipp Stanner <phasta@kernel.org>
> ---
>  drivers/gpu/drm/scheduler/sched_entity.c   | 39 +++++++++++++
>  drivers/gpu/drm/scheduler/sched_internal.h | 66 ++++++++++++++++++++++
>  drivers/gpu/drm/scheduler/sched_main.c     |  6 +-
>  include/drm/gpu_scheduler.h                | 12 ++++
>  4 files changed, 122 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 7a0a52ba87bf..04ce8b7d436b 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -32,6 +32,39 @@
>  
>  #include "gpu_scheduler_trace.h"
>  
> +
> +/**
> + * drm_sched_entity_stats_release - Entity stats kref release function
> + *
> + * @kref: Entity stats embedded kref pointer
> + */
> +void drm_sched_entity_stats_release(struct kref *kref)
> +{
> +	struct drm_sched_entity_stats *stats =
> +		container_of(kref, typeof(*stats), kref);
> +
> +	kfree(stats);
> +}
> +
> +/**
> + * drm_sched_entity_stats_alloc - Allocate a new struct drm_sched_entity_stats object
> + *
> + * Returns: Pointer to newly allocated struct drm_sched_entity_stats object.
> + */
> +static struct drm_sched_entity_stats *drm_sched_entity_stats_alloc(void)
> +{
> +	struct drm_sched_entity_stats *stats;
> +
> +	stats = kzalloc(sizeof(*stats), GFP_KERNEL);
> +	if (!stats)
> +		return NULL;
> +
> +	kref_init(&stats->kref);
> +	spin_lock_init(&stats->lock);
> +
> +	return stats;
> +}
> +
>  /**
>   * drm_sched_entity_init - Init a context entity used by scheduler when
>   * submit to HW ring.
> @@ -65,6 +98,11 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>  		return -EINVAL;
>  
>  	memset(entity, 0, sizeof(struct drm_sched_entity));
> +
> +	entity->stats = drm_sched_entity_stats_alloc();
> +	if (!entity->stats)
> +		return -ENOMEM;
> +
>  	INIT_LIST_HEAD(&entity->list);
>  	entity->rq = NULL;
>  	entity->guilty = guilty;
> @@ -338,6 +376,7 @@ void drm_sched_entity_fini(struct drm_sched_entity *entity)
>  
>  	dma_fence_put(rcu_dereference_check(entity->last_scheduled, true));
>  	RCU_INIT_POINTER(entity->last_scheduled, NULL);
> +	drm_sched_entity_stats_put(entity->stats);
>  }
>  EXPORT_SYMBOL(drm_sched_entity_fini);
>  
> diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
> index 703ee48fbc58..27c8460a3601 100644
> --- a/drivers/gpu/drm/scheduler/sched_internal.h
> +++ b/drivers/gpu/drm/scheduler/sched_internal.h
> @@ -3,6 +3,22 @@
>  #ifndef _DRM_GPU_SCHEDULER_INTERNAL_H_
>  #define _DRM_GPU_SCHEDULER_INTERNAL_H_
>  
> +#include <linux/ktime.h>
> +#include <linux/kref.h>
> +#include <linux/spinlock.h>
> +
> +/**
> + * struct drm_sched_entity_stats - execution stats for an entity.
> + *
> + * @kref: reference count for the object.
> + * @lock: lock guarding the @runtime updates.
> + * @runtime: time entity spent on the GPU.
> + */
> +struct drm_sched_entity_stats {

Cool docu.

Though, considering that awkward refcounting patterns are one of the
more relevant scheduler problems, I believe it makes sense to add to
the docu here a sentence or two about who refcounts this object for
what reason.

> +	struct kref	kref;
> +	spinlock_t	lock;
> +	ktime_t		runtime;
> +};
>  
>  /* Used to choose between FIFO and RR job-scheduling */
>  extern int drm_sched_policy;
> @@ -93,4 +109,54 @@ drm_sched_entity_is_ready(struct drm_sched_entity *entity)
>  	return true;
>  }
>  
> +void drm_sched_entity_stats_release(struct kref *kref);
> +
> +/**
> + * drm_sched_entity_stats_get - Obtain a reference count on struct drm_sched_entity_stats object
> + *
> + * @stats: struct drm_sched_entity_stats pointer
> + *
> + * Returns: struct drm_sched_entity_stats pointer
> + */
> +static inline struct drm_sched_entity_stats *
> +drm_sched_entity_stats_get(struct drm_sched_entity_stats *stats)
> +{
> +	kref_get(&stats->kref);
> +
> +	return stats;
> +}
> +
> +/**
> + * drm_sched_entity_stats_put - Release a reference count on struct drm_sched_entity_stats object
> + *
> + * @stats: struct drm_sched_entity_stats pointer
> + */
> +static inline void
> +drm_sched_entity_stats_put(struct drm_sched_entity_stats *stats)
> +{
> +	kref_put(&stats->kref, drm_sched_entity_stats_release);
> +}
> +
> +/**
> + * drm_sched_entity_stats_job_add_gpu_time - Account job execution time to entity
> + *
> + * @job: Scheduler job to account.
> + *
> + * Accounts the execution time of @job to its respective entity stats object.
> + */
> +static inline void
> +drm_sched_entity_stats_job_add_gpu_time(struct drm_sched_job *job)
> +{
> +	struct drm_sched_entity_stats *stats = job->entity_stats;
> +	struct drm_sched_fence *s_fence = job->s_fence;
> +	ktime_t start, end;
> +
> +	start = dma_fence_timestamp(&s_fence->scheduled);
> +	end = dma_fence_timestamp(&s_fence->finished);
> +
> +	spin_lock(&stats->lock);
> +	stats->runtime = ktime_add(stats->runtime, ktime_sub(end, start));
> +	spin_unlock(&stats->lock);
> +}
> +
>  #endif
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 9411676e772a..a5d7706efbea 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -658,6 +658,7 @@ void drm_sched_job_arm(struct drm_sched_job *job)
>  
>  	job->sched = sched;
>  	job->s_priority = entity->priority;
> +	job->entity_stats = drm_sched_entity_stats_get(entity->stats);
>  
>  	drm_sched_fence_init(job->s_fence, job->entity);
>  }
> @@ -846,6 +847,7 @@ void drm_sched_job_cleanup(struct drm_sched_job *job)
>  		 * been called.
>  		 */
>  		dma_fence_put(&job->s_fence->finished);
> +		drm_sched_entity_stats_put(job->entity_stats);
>  	} else {
>  		/* The job was aborted before it has been committed to be run;
>  		 * notably, drm_sched_job_arm() has not been called.
> @@ -997,8 +999,10 @@ static void drm_sched_free_job_work(struct work_struct *w)
>  		container_of(w, struct drm_gpu_scheduler, work_free_job);
>  	struct drm_sched_job *job;
>  
> -	while ((job = drm_sched_get_finished_job(sched)))
> +	while ((job = drm_sched_get_finished_job(sched))) {
> +		drm_sched_entity_stats_job_add_gpu_time(job);
>  		sched->ops->free_job(job);
> +	}

The biggest question I'm wondering about is whether the reference
should be put in drm_sched_job_cleanup() or drm_sched_free_job_work().
The primer is semantically more correct, probably, but the latter is
always controlled by the scheduler.

Hypothetically a driver could just not call drm_sched_job_cleanup() in
its free_job() callback – but then other stuff would leak as well and
everything would be broken anyways.

I probably should change the docu for drm_sched_job_cleanup(), which
currently says drivers "should" call it in free_job(), instead of
"must".


P.

>  
>  	drm_sched_run_job_queue(sched);
>  }
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 802a0060db55..f33c78473867 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -71,6 +71,8 @@ enum drm_sched_priority {
>  	DRM_SCHED_PRIORITY_COUNT
>  };
>  
> +struct drm_sched_entity_stats;
> +
>  /**
>   * struct drm_sched_entity - A wrapper around a job queue (typically
>   * attached to the DRM file_priv).
> @@ -110,6 +112,11 @@ struct drm_sched_entity {
>  	 */
>  	struct drm_sched_rq		*rq;
>  
> +	/**
> +	 * @stats: Stats object reference held by the entity and jobs.
> +	 */
> +	struct drm_sched_entity_stats	*stats;
> +
>  	/**
>  	 * @sched_list:
>  	 *
> @@ -365,6 +372,11 @@ struct drm_sched_job {
>  	struct drm_sched_fence		*s_fence;
>  	struct drm_sched_entity         *entity;
>  
> +	/**
> +	 * @entity_stats: Stats object reference held by the job and entity.
> +	 */
> +	struct drm_sched_entity_stats	*entity_stats;
> +
>  	enum drm_sched_priority		s_priority;
>  	u32				credits;
>  	/** @last_dependency: tracks @dependencies as they signal */


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 08/12] drm/sched: Remove idle entity from tree
  2025-09-11 15:06     ` Tvrtko Ursulin
@ 2025-09-24  9:15       ` Philipp Stanner
  2025-09-25 12:23         ` Tvrtko Ursulin
  0 siblings, 1 reply; 35+ messages in thread
From: Philipp Stanner @ 2025-09-24  9:15 UTC (permalink / raw)
  To: Tvrtko Ursulin, phasta, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost

On Thu, 2025-09-11 at 16:06 +0100, Tvrtko Ursulin wrote:
> 
> On 11/09/2025 15:32, Philipp Stanner wrote:
> > On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
> > > There is no need to keep entities with no jobs in the tree so lets remove
> > > it once the last job is consumed. This keeps the tree smaller which is
> > > nicer and more efficient as entities are removed and re-added on every
> > > popped job.
> > 
> > This reads suspiciously as if it could be an independent patch, not
> > necessarily tied to this series. I see it depends on the _pop()
> > function you added.
> > 
> > I think you'd want to make it a bit more obvious that this is not so
> > much a general improvement as it is a preparation for followup work. Or
> > could it be made generic for the current in-tree scheduler?
> 
> Both is true. There is currently no reason idle entities _need_ to be in 
> the tree. Removing them would improve O(log n) on the rbtree. But also 
> fair scheduler relies on it, see below...
> 
> > > Apart from that, the upcoming fair scheduling algorithm will rely on the
> > > tree only containing runnable entities.
> 
> ... ^^^ here.

Yes, I saw that.

I wanted to stress where I'm coming from: generic code improvements
should ideally be posted as separate patches, because that makes it
easier to review and quicker to merge (and easier to revert should a
problem be detected before the subsequent CFS series is merged)

So, can you submit this patch separately without too much effort? :)

> 
> > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: Danilo Krummrich <dakr@kernel.org>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Philipp Stanner <phasta@kernel.org>
> > > ---
> > >   drivers/gpu/drm/scheduler/sched_rq.c | 28 +++++++++++++++++-----------
> > >   1 file changed, 17 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
> > > index 16d57461765e..67804815ca0d 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_rq.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_rq.c
> > > @@ -19,6 +19,9 @@ drm_sched_entity_compare_before(struct rb_node *a, const struct rb_node *b)
> > >   static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
> > >   					    struct drm_sched_rq *rq)
> > >   {
> > > +	lockdep_assert_held(&entity->lock);
> > > +	lockdep_assert_held(&rq->lock);
> > 
> > The callers of drm_sched_rq_remove_fifo_locked() already have some
> > lockdep asserts, have you seen them? Are those here redundant /
> > additional?
> > 
> > And are they strictly related to this patch?
> 
> Yes, drm_sched_rq_pop_entity() is the new caller, which needs to take
> both locks on its own. So IMO makes sense to add the asserts.
> 
> > Maybe you want to investigate the other lockdep assertions and, if
> > there's room for improvement, address that in a dedicated patch.
> 
> They look okay to me. Are you seeing something is off?

No, they do look correct. It's just that we have a bit of redundancy
then, but that's probably a good thing for robustness.

P.

> 
> Regards,
> 
> Tvrtko
> 
> > > +
> > >   	if (!RB_EMPTY_NODE(&entity->rb_tree_node)) {
> > >   		rb_erase_cached(&entity->rb_tree_node, &rq->rb_tree_root);
> > >   		RB_CLEAR_NODE(&entity->rb_tree_node);
> > > @@ -158,24 +161,27 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
> > >   {
> > >   	struct drm_sched_job *next_job;
> > >   	struct drm_sched_rq *rq;
> > > -	ktime_t ts;
> > >   
> > >   	/*
> > >   	 * Update the entity's location in the min heap according to
> > >   	 * the timestamp of the next job, if any.
> > >   	 */
> > > +	spin_lock(&entity->lock);
> > > +	rq = entity->rq;
> > > +	spin_lock(&rq->lock);
> > >   	next_job = drm_sched_entity_queue_peek(entity);
> > > -	if (!next_job)
> > > -		return;
> > > +	if (next_job) {
> > > +		ktime_t ts;
> > >   
> > > -	spin_lock(&entity->lock);
> > > -	rq = entity->rq;
> > > -	spin_lock(&rq->lock);
> > > -	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> > > -		ts = next_job->submit_ts;
> > > -	else
> > > -		ts = drm_sched_rq_get_rr_ts(rq, entity);
> > > -	drm_sched_rq_update_fifo_locked(entity, rq, ts);
> > > +		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> > > +			ts = next_job->submit_ts;
> > > +		else
> > > +			ts = drm_sched_rq_get_rr_ts(rq, entity);
> > > +
> > > +		drm_sched_rq_update_fifo_locked(entity, rq, ts);
> > > +	} else {
> > > +		drm_sched_rq_remove_fifo_locked(entity, rq);
> > > +	}
> > >   	spin_unlock(&rq->lock);
> > >   	spin_unlock(&entity->lock);
> > >   }
> > 
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 09/12] drm/sched: Add fair scheduling policy
  2025-09-03 10:18 ` [RFC v8 09/12] drm/sched: Add fair scheduling policy Tvrtko Ursulin
@ 2025-09-24  9:38   ` Philipp Stanner
  2025-09-25 12:34     ` Tvrtko Ursulin
  0 siblings, 1 reply; 35+ messages in thread
From: Philipp Stanner @ 2025-09-24  9:38 UTC (permalink / raw)
  To: Tvrtko Ursulin, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost, Philipp Stanner, Pierre-Eric Pelloux-Prayer

On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
> Fair scheduling policy is built upon the same concepts as the well known
> CFS kernel scheduler - entity run queue is sorted by the virtual GPU time
> consumed by entities in a way that the entity with least vruntime runs
> first.
> 
> It is able to avoid total priority starvation, which is one of the
> problems with FIFO, and it also eliminates the need for per priority run
> queues.
> 

If that's the case, then why are the runqueues removed in the
subsequent patches?

>  As it scales the actual GPU runtime by an exponential factor as
> the priority decreases, therefore the virtual runtime for low priority
> entities grows faster than for normal priority, pushing them further down
> the runqueue order for the same real GPU time spent.
> 
> Apart from this fundamental fairness, fair policy is especially strong in
> oversubscription workloads where it is able to give more GPU time to short
> and bursty workloads when they are running in parallel with GPU heavy
> clients submitting deep job queues.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Philipp Stanner <phasta@kernel.org>
> Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
> ---
>  drivers/gpu/drm/scheduler/sched_entity.c   |  28 ++--
>  drivers/gpu/drm/scheduler/sched_internal.h |   7 +-
>  drivers/gpu/drm/scheduler/sched_main.c     |  14 +-
>  drivers/gpu/drm/scheduler/sched_rq.c       | 147 ++++++++++++++++++++-
>  include/drm/gpu_scheduler.h                |  10 +-
>  5 files changed, 186 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 04ce8b7d436b..58f51875547a 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -108,6 +108,8 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>  	entity->guilty = guilty;
>  	entity->num_sched_list = num_sched_list;
>  	entity->priority = priority;
> +	entity->rq_priority = drm_sched_policy == DRM_SCHED_POLICY_FAIR ?
> +			      DRM_SCHED_PRIORITY_KERNEL : priority;

Hm, what do you do here? Why is CFS temporarily running with KERNEL
prio?

AFAIU CFS eliminates the need for runqueues because there will be no
priorities anymore.

But with this patch, it's for now just a new scheduling policy next to
FIFO and RR and your plan is to then to remove those policies
altogether.

Priorities will be kept for users, but they will then set the scaling
factor.

Correct?

I think I understand the series relatively well by now, but here at the
transition point some clarifications would be nice. Maybe the commit
message is what confuses me, since it says that runqueues should be
surplus in this patch already, which I don't think is true.

>  	/*
>  	 * It's perfectly valid to initialize an entity without having a valid
>  	 * scheduler attached. It's just not valid to use the scheduler before it
> @@ -124,17 +126,23 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>  		 */
>  		pr_warn("%s: called with uninitialized scheduler\n", __func__);
>  	} else if (num_sched_list) {
> -		/* The "priority" of an entity cannot exceed the number of run-queues of a
> -		 * scheduler. Protect against num_rqs being 0, by converting to signed. Choose
> -		 * the lowest priority available.
> +		enum drm_sched_priority p = entity->priority;
> +
> +		/*
> +		 * The "priority" of an entity cannot exceed the number of
> +		 * run-queues of a scheduler. Protect against num_rqs being 0,
> +		 * by converting to signed. Choose the lowest priority
> +		 * available.
>  		 */
> -		if (entity->priority >= sched_list[0]->num_rqs) {
> -			dev_err(sched_list[0]->dev, "entity has out-of-bounds priority: %u. num_rqs: %u\n",
> -				entity->priority, sched_list[0]->num_rqs);
> -			entity->priority = max_t(s32, (s32) sched_list[0]->num_rqs - 1,
> -						 (s32) DRM_SCHED_PRIORITY_KERNEL);
> +		if (p >= sched_list[0]->num_user_rqs) {
> +			dev_err(sched_list[0]->dev, "entity with out-of-bounds priority:%u num_user_rqs:%u\n",
> +				p, sched_list[0]->num_user_rqs);
> +			p = max_t(s32,
> +				 (s32)sched_list[0]->num_user_rqs - 1,
> +				 (s32)DRM_SCHED_PRIORITY_KERNEL);
> +			entity->priority = p;
>  		}
> -		entity->rq = sched_list[0]->sched_rq[entity->priority];
> +		entity->rq = sched_list[0]->sched_rq[entity->rq_priority];
>  	}
>  
>  	init_completion(&entity->entity_idle);
> @@ -567,7 +575,7 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>  
>  	spin_lock(&entity->lock);
>  	sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
> -	rq = sched ? sched->sched_rq[entity->priority] : NULL;
> +	rq = sched ? sched->sched_rq[entity->rq_priority] : NULL;
>  	if (rq != entity->rq) {
>  		drm_sched_rq_remove_entity(entity->rq, entity);
>  		entity->rq = rq;
> diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
> index 27c8460a3601..125aba70eda6 100644
> --- a/drivers/gpu/drm/scheduler/sched_internal.h
> +++ b/drivers/gpu/drm/scheduler/sched_internal.h
> @@ -18,13 +18,16 @@ struct drm_sched_entity_stats {
>  	struct kref	kref;
>  	spinlock_t	lock;
>  	ktime_t		runtime;
> +	ktime_t		prev_runtime;
> +	u64		vruntime;

These two lack new documentation.

>  };
>  
>  /* Used to choose between FIFO and RR job-scheduling */
>  extern int drm_sched_policy;
>  
> -#define DRM_SCHED_POLICY_RR    0
> -#define DRM_SCHED_POLICY_FIFO  1
> +#define DRM_SCHED_POLICY_RR	0
> +#define DRM_SCHED_POLICY_FIFO	1
> +#define DRM_SCHED_POLICY_FAIR	2

What's going on with the git diff here?

>  
>  bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
>  			 struct drm_sched_entity *entity);
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index a5d7706efbea..e7726095c19a 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -84,13 +84,13 @@
>  #define CREATE_TRACE_POINTS
>  #include "gpu_scheduler_trace.h"
>  
> -int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
> +int drm_sched_policy = DRM_SCHED_POLICY_FAIR;

If you want to do that, the commit message should state explicitly that
you are changing the default policy in this patch. That's an important
change.

>  
>  /**
>   * DOC: sched_policy (int)
>   * Used to override default entities scheduling policy in a run queue.
>   */
> -MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
> +MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO, " __stringify(DRM_SCHED_POLICY_FAIR) " = Fair (default).");
>  module_param_named(sched_policy, drm_sched_policy, int, 0444);
>  
>  static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
> @@ -1132,11 +1132,15 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
>  		sched->own_submit_wq = true;
>  	}
>  
> -	sched->sched_rq = kmalloc_array(args->num_rqs, sizeof(*sched->sched_rq),
> +	sched->num_user_rqs = args->num_rqs;
> +	sched->num_rqs = drm_sched_policy != DRM_SCHED_POLICY_FAIR ?
> +			 args->num_rqs : 1;
> +	sched->sched_rq = kmalloc_array(sched->num_rqs,
> +					sizeof(*sched->sched_rq),
>  					GFP_KERNEL | __GFP_ZERO);
>  	if (!sched->sched_rq)
>  		goto Out_check_own;
> -	sched->num_rqs = args->num_rqs;
> +
>  	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>  		sched->sched_rq[i] = kzalloc(sizeof(*sched->sched_rq[i]), GFP_KERNEL);
>  		if (!sched->sched_rq[i])
> @@ -1278,7 +1282,7 @@ void drm_sched_increase_karma(struct drm_sched_job *bad)
>  	if (bad->s_priority != DRM_SCHED_PRIORITY_KERNEL) {
>  		atomic_inc(&bad->karma);
>  
> -		for (i = DRM_SCHED_PRIORITY_HIGH; i < sched->num_rqs; i++) {
> +		for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>  			struct drm_sched_rq *rq = sched->sched_rq[i];
>  
>  			spin_lock(&rq->lock);
> diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
> index 67804815ca0d..b0cf7d2143c8 100644
> --- a/drivers/gpu/drm/scheduler/sched_rq.c
> +++ b/drivers/gpu/drm/scheduler/sched_rq.c
> @@ -16,6 +16,24 @@ drm_sched_entity_compare_before(struct rb_node *a, const struct rb_node *b)
>  	return ktime_before(ea->oldest_job_waiting, eb->oldest_job_waiting);
>  }
>  
> +static void drm_sched_rq_update_prio(struct drm_sched_rq *rq)
> +{
> +	enum drm_sched_priority prio = -1;
> +	struct rb_node *rb;
> +
> +	lockdep_assert_held(&rq->lock);
> +
> +	rb = rb_first_cached(&rq->rb_tree_root);
> +	if (rb) {
> +		struct drm_sched_entity *entity =
> +			rb_entry(rb, typeof(*entity), rb_tree_node);
> +
> +		prio = entity->priority; /* Unlocked read */
> +	}
> +
> +	rq->head_prio = prio;
> +}
> +
>  static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
>  					    struct drm_sched_rq *rq)
>  {
> @@ -25,6 +43,7 @@ static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
>  	if (!RB_EMPTY_NODE(&entity->rb_tree_node)) {
>  		rb_erase_cached(&entity->rb_tree_node, &rq->rb_tree_root);
>  		RB_CLEAR_NODE(&entity->rb_tree_node);
> +		drm_sched_rq_update_prio(rq);
>  	}
>  }
>  
> @@ -46,6 +65,7 @@ static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
>  
>  	rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root,
>  		      drm_sched_entity_compare_before);
> +	drm_sched_rq_update_prio(rq);
>  }
>  
>  /**
> @@ -63,6 +83,114 @@ void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
>  	INIT_LIST_HEAD(&rq->entities);
>  	rq->rb_tree_root = RB_ROOT_CACHED;
>  	rq->sched = sched;
> +	rq->head_prio = -1;
> +}
> +
> +static ktime_t
> +drm_sched_rq_get_min_vruntime(struct drm_sched_rq *rq)
> +{
> +	struct drm_sched_entity *entity;
> +	struct rb_node *rb;
> +
> +	lockdep_assert_held(&rq->lock);
> +
> +	for (rb = rb_first_cached(&rq->rb_tree_root); rb; rb = rb_next(rb)) {
> +		entity = rb_entry(rb, typeof(*entity), rb_tree_node);
> +
> +		return entity->stats->vruntime; /* Unlocked read */
> +	}
> +
> +	return 0;
> +}
> +
> +static void
> +drm_sched_entity_save_vruntime(struct drm_sched_entity *entity,
> +			       ktime_t min_vruntime)
> +{
> +	struct drm_sched_entity_stats *stats = entity->stats;
> +	ktime_t vruntime;
> +
> +	spin_lock(&stats->lock);
> +	vruntime = stats->vruntime;
> +	if (min_vruntime && vruntime > min_vruntime)
> +		vruntime = ktime_sub(vruntime, min_vruntime);
> +	else
> +		vruntime = 0;
> +	stats->vruntime = vruntime;
> +	spin_unlock(&stats->lock);
> +}
> +
> +static ktime_t
> +drm_sched_entity_restore_vruntime(struct drm_sched_entity *entity,
> +				  ktime_t min_vruntime,
> +				  enum drm_sched_priority rq_prio)
> +{
> +	struct drm_sched_entity_stats *stats = entity->stats;
> +	enum drm_sched_priority prio = entity->priority;
> +	ktime_t vruntime;
> +
> +	BUILD_BUG_ON(DRM_SCHED_PRIORITY_NORMAL < DRM_SCHED_PRIORITY_HIGH);
> +
> +	spin_lock(&stats->lock);
> +	vruntime = stats->vruntime;
> +
> +	/*
> +	 * Special handling for entities which were picked from the top of the
> +	 * queue and are now re-joining the top with another one already there.
> +	 */
> +	if (!vruntime && min_vruntime) {
> +		if (prio > rq_prio) {
> +			/*
> +			 * Lower priority should not overtake higher when re-
> +			 * joining at the top of the queue.
> +			 */
> +			vruntime = us_to_ktime(prio - rq_prio);
> +		} else if (prio < rq_prio) {
> +			/*
> +			 * Higher priority can go first.
> +			 */
> +			vruntime = -us_to_ktime(rq_prio - prio);
> +		}
> +	}
> +
> +	/*
> +	 * Restore saved relative position in the queue.
> +	 */
> +	vruntime = ktime_add(min_vruntime, vruntime);
> +
> +	stats->vruntime = vruntime;
> +	spin_unlock(&stats->lock);
> +
> +	return vruntime;
> +}
> +
> +static ktime_t drm_sched_entity_update_vruntime(struct drm_sched_entity *entity)
> +{
> +	static const unsigned int shift[] = {
> +		[DRM_SCHED_PRIORITY_KERNEL] = 1,
> +		[DRM_SCHED_PRIORITY_HIGH]   = 2,
> +		[DRM_SCHED_PRIORITY_NORMAL] = 4,
> +		[DRM_SCHED_PRIORITY_LOW]    = 7,
> +	};
> +	struct drm_sched_entity_stats *stats = entity->stats;
> +	ktime_t runtime, prev;
> +
> +	spin_lock(&stats->lock);
> +	prev = stats->prev_runtime;
> +	runtime = stats->runtime;
> +	stats->prev_runtime = runtime;
> +	runtime = ktime_add_ns(stats->vruntime,
> +			       ktime_to_ns(ktime_sub(runtime, prev)) <<
> +			       shift[entity->priority]);
> +	stats->vruntime = runtime;
> +	spin_unlock(&stats->lock);
> +
> +	return runtime;
> +}
> +
> +static ktime_t drm_sched_entity_get_job_ts(struct drm_sched_entity *entity)
> +{
> +	return drm_sched_entity_update_vruntime(entity);
>  }
>  
>  /**
> @@ -99,8 +227,14 @@ drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
>  		list_add_tail(&entity->list, &rq->entities);
>  	}
>  
> -	if (drm_sched_policy == DRM_SCHED_POLICY_RR)
> +	if (drm_sched_policy == DRM_SCHED_POLICY_FAIR) {
> +		ts = drm_sched_rq_get_min_vruntime(rq);
> +		ts = drm_sched_entity_restore_vruntime(entity, ts,
> +						       rq->head_prio);
> +	} else if (drm_sched_policy == DRM_SCHED_POLICY_RR) {
>  		ts = entity->rr_ts;
> +	}
> +
>  	drm_sched_rq_update_fifo_locked(entity, rq, ts);
>  
>  	spin_unlock(&rq->lock);
> @@ -173,7 +307,9 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
>  	if (next_job) {
>  		ktime_t ts;
>  
> -		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> +		if (drm_sched_policy == DRM_SCHED_POLICY_FAIR)
> +			ts = drm_sched_entity_get_job_ts(entity);
> +		else if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>  			ts = next_job->submit_ts;
>  		else
>  			ts = drm_sched_rq_get_rr_ts(rq, entity);
> @@ -181,6 +317,13 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
>  		drm_sched_rq_update_fifo_locked(entity, rq, ts);
>  	} else {
>  		drm_sched_rq_remove_fifo_locked(entity, rq);
> +
> +		if (drm_sched_policy == DRM_SCHED_POLICY_FAIR) {
> +			ktime_t min_vruntime;
> +
> +			min_vruntime = drm_sched_rq_get_min_vruntime(rq);
> +			drm_sched_entity_save_vruntime(entity, min_vruntime);
> +		}
>  	}
>  	spin_unlock(&rq->lock);
>  	spin_unlock(&entity->lock);
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index f33c78473867..327b75a052c7 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -150,6 +150,11 @@ struct drm_sched_entity {
>  	 */
>  	enum drm_sched_priority         priority;
>  
> +	/**
> +	 * @rq_priority: Run-queue priority
> +	 */
> +	enum drm_sched_priority         rq_priority;
> +
>  	/**
>  	 * @rr_ts:
>  	 *
> @@ -254,10 +259,11 @@ struct drm_sched_entity {
>   * struct drm_sched_rq - queue of entities to be scheduled.
>   *
>   * @sched: the scheduler to which this rq belongs to.
> - * @lock: protects @entities, @rb_tree_root and @rr_ts.
> + * @lock: protects @entities, @rb_tree_root, @rr_ts and @head_prio.
>   * @rr_ts: monotonically incrementing fake timestamp for RR mode
>   * @entities: list of the entities to be scheduled.
>   * @rb_tree_root: root of time based priority queue of entities for FIFO scheduling
> + * @head_prio: priority of the top tree element
>   *
>   * Run queue is a set of entities scheduling command submissions for
>   * one specific ring. It implements the scheduling policy that selects
> @@ -271,6 +277,7 @@ struct drm_sched_rq {
>  	ktime_t				rr_ts;
>  	struct list_head		entities;
>  	struct rb_root_cached		rb_tree_root;
> +	enum drm_sched_priority		head_prio;
>  };
>  
>  /**
> @@ -597,6 +604,7 @@ struct drm_gpu_scheduler {
>  	long				timeout;
>  	const char			*name;
>  	u32                             num_rqs;
> +	u32                             num_user_rqs;

Documentation missing.


P.

>  	struct drm_sched_rq             **sched_rq;
>  	wait_queue_head_t		job_scheduled;
>  	atomic64_t			job_id_count;


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 11/12] drm/sched: Remove FIFO and RR and simplify to a single run queue
  2025-09-03 10:18 ` [RFC v8 11/12] drm/sched: Remove FIFO and RR and simplify to a single run queue Tvrtko Ursulin
@ 2025-09-24 11:50   ` Philipp Stanner
  2025-09-25 12:56     ` Tvrtko Ursulin
  0 siblings, 1 reply; 35+ messages in thread
From: Philipp Stanner @ 2025-09-24 11:50 UTC (permalink / raw)
  To: Tvrtko Ursulin, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost, Philipp Stanner

On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
> If the new fair policy is at least as good as FIFO and we can afford to
> remove round-robin, we can simplify the scheduler code by making the
> scheduler to run queue relationship always 1:1 and remove some code.
> 
> Also, now that the FIFO policy is gone the tree of entities is not a FIFO
> tree any more so rename it to just the tree.

I think that this patch should be the one that sets the default
scheduling policy to CFS.

Either that or a separate patch at an appropriate position before #11.

> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Philipp Stanner <phasta@kernel.org>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  23 ++-
>  drivers/gpu/drm/scheduler/sched_entity.c   |  29 +---
>  drivers/gpu/drm/scheduler/sched_internal.h |   9 +-
>  drivers/gpu/drm/scheduler/sched_main.c     | 163 ++++++---------------
>  drivers/gpu/drm/scheduler/sched_rq.c       |  60 ++------
>  include/drm/gpu_scheduler.h                |  30 +---
>  6 files changed, 78 insertions(+), 236 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index d020a890a0ea..bc07fd57310c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -434,25 +434,22 @@ drm_sched_entity_queue_pop(struct drm_sched_entity *entity)
>  
>  void amdgpu_job_stop_all_jobs_on_sched(struct drm_gpu_scheduler *sched)
>  {
> +	struct drm_sched_rq *rq = sched->rq;
> +	struct drm_sched_entity *s_entity;
>  	struct drm_sched_job *s_job;
> -	struct drm_sched_entity *s_entity = NULL;
> -	int i;
>  
>  	/* Signal all jobs not yet scheduled */
> -	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
> -		struct drm_sched_rq *rq = sched->sched_rq[i];
> -		spin_lock(&rq->lock);
> -		list_for_each_entry(s_entity, &rq->entities, list) {
> -			while ((s_job = drm_sched_entity_queue_pop(s_entity))) {
> -				struct drm_sched_fence *s_fence = s_job->s_fence;
> +	spin_lock(&rq->lock);
> +	list_for_each_entry(s_entity, &rq->entities, list) {
> +		while ((s_job = drm_sched_entity_queue_pop(s_entity))) {
> +			struct drm_sched_fence *s_fence = s_job->s_fence;
>  
> -				dma_fence_signal(&s_fence->scheduled);
> -				dma_fence_set_error(&s_fence->finished, -EHWPOISON);
> -				dma_fence_signal(&s_fence->finished);
> -			}
> +			dma_fence_signal(&s_fence->scheduled);
> +			dma_fence_set_error(&s_fence->finished, -EHWPOISON);
> +			dma_fence_signal(&s_fence->finished);
>  		}
> -		spin_unlock(&rq->lock);
>  	}
> +	spin_unlock(&rq->lock);

Yoah, well. Seems there is no easy fix to remove that from amdgpu,
since it's used on GPU recovery.

Fine from my side if it's mentioned and explained in the commit
message.

>  
>  	/* Signal all jobs already scheduled to HW */
>  	list_for_each_entry(s_job, &sched->pending_list, list) {
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 58f51875547a..6dd30b85925b 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -108,8 +108,6 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>  	entity->guilty = guilty;
>  	entity->num_sched_list = num_sched_list;
>  	entity->priority = priority;
> -	entity->rq_priority = drm_sched_policy == DRM_SCHED_POLICY_FAIR ?
> -			      DRM_SCHED_PRIORITY_KERNEL : priority;
>  	/*
>  	 * It's perfectly valid to initialize an entity without having a valid
>  	 * scheduler attached. It's just not valid to use the scheduler before it
> @@ -119,30 +117,14 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>  	RCU_INIT_POINTER(entity->last_scheduled, NULL);
>  	RB_CLEAR_NODE(&entity->rb_tree_node);
>  
> -	if (num_sched_list && !sched_list[0]->sched_rq) {
> +	if (num_sched_list && !sched_list[0]->rq) {
>  		/* Since every entry covered by num_sched_list
>  		 * should be non-NULL and therefore we warn drivers
>  		 * not to do this and to fix their DRM calling order.
>  		 */
>  		pr_warn("%s: called with uninitialized scheduler\n", __func__);
>  	} else if (num_sched_list) {
> -		enum drm_sched_priority p = entity->priority;
> -
> -		/*
> -		 * The "priority" of an entity cannot exceed the number of
> -		 * run-queues of a scheduler. Protect against num_rqs being 0,
> -		 * by converting to signed. Choose the lowest priority
> -		 * available.
> -		 */
> -		if (p >= sched_list[0]->num_user_rqs) {
> -			dev_err(sched_list[0]->dev, "entity with out-of-bounds priority:%u num_user_rqs:%u\n",
> -				p, sched_list[0]->num_user_rqs);
> -			p = max_t(s32,
> -				 (s32)sched_list[0]->num_user_rqs - 1,
> -				 (s32)DRM_SCHED_PRIORITY_KERNEL);
> -			entity->priority = p;
> -		}
> -		entity->rq = sched_list[0]->sched_rq[entity->rq_priority];
> +		entity->rq = sched_list[0]->rq;
>  	}
>  
>  	init_completion(&entity->entity_idle);
> @@ -575,7 +557,7 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>  
>  	spin_lock(&entity->lock);
>  	sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
> -	rq = sched ? sched->sched_rq[entity->rq_priority] : NULL;
> +	rq = sched ? sched->rq : NULL;
>  	if (rq != entity->rq) {
>  		drm_sched_rq_remove_entity(entity->rq, entity);
>  		entity->rq = rq;
> @@ -599,7 +581,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>  {
>  	struct drm_sched_entity *entity = sched_job->entity;
>  	bool first;
> -	ktime_t submit_ts;
>  
>  	trace_drm_sched_job_queue(sched_job, entity);
>  
> @@ -616,16 +597,14 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>  	/*
>  	 * After the sched_job is pushed into the entity queue, it may be
>  	 * completed and freed up at any time. We can no longer access it.
> -	 * Make sure to set the submit_ts first, to avoid a race.
>  	 */
> -	sched_job->submit_ts = submit_ts = ktime_get();

Can submit_ts be removed completely?

>  	first = spsc_queue_push(&entity->job_queue, &sched_job->queue_node);
>  
>  	/* first job wakes up scheduler */
>  	if (first) {
>  		struct drm_gpu_scheduler *sched;
>  
> -		sched = drm_sched_rq_add_entity(entity, submit_ts);
> +		sched = drm_sched_rq_add_entity(entity);
>  		if (sched)
>  			drm_sched_wakeup(sched);
>  	}
> diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
> index 125aba70eda6..6e5ed721bb5b 100644
> --- a/drivers/gpu/drm/scheduler/sched_internal.h
> +++ b/drivers/gpu/drm/scheduler/sched_internal.h
> @@ -22,13 +22,6 @@ struct drm_sched_entity_stats {
>  	u64		vruntime;
>  };
>  
> -/* Used to choose between FIFO and RR job-scheduling */
> -extern int drm_sched_policy;
> -
> -#define DRM_SCHED_POLICY_RR	0
> -#define DRM_SCHED_POLICY_FIFO	1
> -#define DRM_SCHED_POLICY_FAIR	2
> -
>  bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
>  			 struct drm_sched_entity *entity);
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
> @@ -39,7 +32,7 @@ struct drm_sched_entity *
>  drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched,
>  			   struct drm_sched_rq *rq);
>  struct drm_gpu_scheduler *
> -drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts);
> +drm_sched_rq_add_entity(struct drm_sched_entity *entity);
>  void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>  				struct drm_sched_entity *entity);
>  void drm_sched_rq_pop_entity(struct drm_sched_entity *entity);
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index e7726095c19a..a9079bdb27d3 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -84,15 +84,6 @@
>  #define CREATE_TRACE_POINTS
>  #include "gpu_scheduler_trace.h"
>  
> -int drm_sched_policy = DRM_SCHED_POLICY_FAIR;
> -
> -/**
> - * DOC: sched_policy (int)
> - * Used to override default entities scheduling policy in a run queue.
> - */
> -MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO, " __stringify(DRM_SCHED_POLICY_FAIR) " = Fair (default).");
> -module_param_named(sched_policy, drm_sched_policy, int, 0444);
> -

Anyone know what happens if we remove that param and some user still
tries to set it?

>  static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
>  {
>  	u32 credits;
> @@ -876,34 +867,6 @@ void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
>  	drm_sched_run_job_queue(sched);
>  }
>  
> -/**
> - * drm_sched_select_entity - Select next entity to process
> - *
> - * @sched: scheduler instance
> - *
> - * Return an entity to process or NULL if none are found.
> - *
> - * Note, that we break out of the for-loop when "entity" is non-null, which can
> - * also be an error-pointer--this assures we don't process lower priority
> - * run-queues. See comments in the respectively called functions.
> - */
> -static struct drm_sched_entity *
> -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> -{
> -	struct drm_sched_entity *entity = NULL;
> -	int i;
> -
> -	/* Start with the highest priority.
> -	 */
> -	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
> -		entity = drm_sched_rq_select_entity(sched, sched->sched_rq[i]);
> -		if (entity)
> -			break;
> -	}
> -
> -	return IS_ERR(entity) ? NULL : entity;
> -}
> -
>  /**
>   * drm_sched_get_finished_job - fetch the next finished job to be destroyed
>   *
> @@ -1023,8 +986,8 @@ static void drm_sched_run_job_work(struct work_struct *w)
>  	int r;
>  
>  	/* Find entity with a ready job */
> -	entity = drm_sched_select_entity(sched);
> -	if (!entity)
> +	entity = drm_sched_rq_select_entity(sched, sched->rq);

Isn't clear to me why entity-selection should be based on the rq as a
parameter, now that the scheduler only has one rq and, thus, only one
list of entities anyways.

Also, you pass both sched and sched->rq, which is redundant.

I think it would be clearer to continue using the name
drm_sched_select_entity().


> +	if (IS_ERR_OR_NULL(entity))
>  		return;	/* No more work */
>  
>  	sched_job = drm_sched_entity_pop_job(entity);
> @@ -1095,8 +1058,6 @@ static struct workqueue_struct *drm_sched_alloc_wq(const char *name)
>   */
>  int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_args *args)
>  {
> -	int i;
> -
>  	sched->ops = args->ops;
>  	sched->credit_limit = args->credit_limit;
>  	sched->name = args->name;
> @@ -1106,13 +1067,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
>  	sched->score = args->score ? args->score : &sched->_score;
>  	sched->dev = args->dev;
>  
> -	if (args->num_rqs > DRM_SCHED_PRIORITY_COUNT) {
> -		/* This is a gross violation--tell drivers what the  problem is.
> -		 */
> -		dev_err(sched->dev, "%s: num_rqs cannot be greater than DRM_SCHED_PRIORITY_COUNT\n",
> -			__func__);
> -		return -EINVAL;
> -	} else if (sched->sched_rq) {

args->num_rqs is surplus now. What are we going to do with the drivers
that set it?

It should be removed from struct drm_sched_init_args and the using
drivers. That way, those drivers will also notice this big change
coming up.

> +	if (sched->rq) {
>  		/* Not an error, but warn anyway so drivers can
>  		 * fine-tune their DRM calling order, and return all
>  		 * is good.
> @@ -1132,21 +1087,11 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
>  		sched->own_submit_wq = true;
>  	}
>  
> -	sched->num_user_rqs = args->num_rqs;
> -	sched->num_rqs = drm_sched_policy != DRM_SCHED_POLICY_FAIR ?
> -			 args->num_rqs : 1;
> -	sched->sched_rq = kmalloc_array(sched->num_rqs,
> -					sizeof(*sched->sched_rq),
> -					GFP_KERNEL | __GFP_ZERO);
> -	if (!sched->sched_rq)
> +	sched->rq = kmalloc(sizeof(*sched->rq), GFP_KERNEL | __GFP_ZERO);
> +	if (!sched->rq)
>  		goto Out_check_own;
>  
> -	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
> -		sched->sched_rq[i] = kzalloc(sizeof(*sched->sched_rq[i]), GFP_KERNEL);
> -		if (!sched->sched_rq[i])
> -			goto Out_unroll;
> -		drm_sched_rq_init(sched, sched->sched_rq[i]);
> -	}
> +	drm_sched_rq_init(sched, sched->rq);

Same, redundant function parameter.

>  
>  	init_waitqueue_head(&sched->job_scheduled);
>  	INIT_LIST_HEAD(&sched->pending_list);
> @@ -1161,12 +1106,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
>  
>  	sched->ready = true;
>  	return 0;
> -Out_unroll:
> -	for (--i ; i >= DRM_SCHED_PRIORITY_KERNEL; i--)
> -		kfree(sched->sched_rq[i]);
>  
> -	kfree(sched->sched_rq);

Almost thought we have a memory leak now. But seems fine since we can't
fail anymore after the kmalloc() above.

> -	sched->sched_rq = NULL;
>  Out_check_own:
>  	if (sched->own_submit_wq)
>  		destroy_workqueue(sched->submit_wq);
> @@ -1202,41 +1142,35 @@ static void drm_sched_cancel_remaining_jobs(struct drm_gpu_scheduler *sched)
>   */
>  void drm_sched_fini(struct drm_gpu_scheduler *sched)
>  {
> +
> +	struct drm_sched_rq *rq = sched->rq;
>  	struct drm_sched_entity *s_entity;
> -	int i;
>  
>  	drm_sched_wqueue_stop(sched);
>  
> -	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
> -		struct drm_sched_rq *rq = sched->sched_rq[i];
> -
> -		spin_lock(&rq->lock);
> -		list_for_each_entry(s_entity, &rq->entities, list)
> -			/*
> -			 * Prevents reinsertion and marks job_queue as idle,
> -			 * it will be removed from the rq in drm_sched_entity_fini()
> -			 * eventually
> -			 *
> -			 * FIXME:
> -			 * This lacks the proper spin_lock(&s_entity->lock) and
> -			 * is, therefore, a race condition. Most notably, it
> -			 * can race with drm_sched_entity_push_job(). The lock
> -			 * cannot be taken here, however, because this would
> -			 * lead to lock inversion -> deadlock.
> -			 *
> -			 * The best solution probably is to enforce the life
> -			 * time rule of all entities having to be torn down
> -			 * before their scheduler. Then, however, locking could
> -			 * be dropped alltogether from this function.
> -			 *
> -			 * For now, this remains a potential race in all
> -			 * drivers that keep entities alive for longer than
> -			 * the scheduler.
> -			 */
> -			s_entity->stopped = true;
> -		spin_unlock(&rq->lock);
> -		kfree(sched->sched_rq[i]);
> -	}
> +	spin_lock(&rq->lock);
> +	list_for_each_entry(s_entity, &rq->entities, list)
> +		/*
> +		 * Prevents re-insertion and marks job_queue as idle,
> +		 * it will be removed from the rq in drm_sched_entity_fini()
> +		 * eventually.
> +		 *
> +		 * FIXME:
> +		 * This lacks the proper spin_lock(&s_entity->lock) and is,
> +		 * therefore, a race condition. Most notably, it can race with
> +		 * drm_sched_entity_push_job(). The lock cannot be taken here,
> +		 * however, because this would lead to lock inversion.
> +		 *
> +		 * The best solution probably is to enforce the life time rule
> +		 * of all entities having to be torn down before their
> +		 * scheduler. Then locking could be dropped altogether from this
> +		 * function.
> +		 *
> +		 * For now, this remains a potential race in all drivers that
> +		 * keep entities alive for longer than the scheduler.
> +		 */
> +		s_entity->stopped = true;
> +	spin_unlock(&rq->lock);
>  
>  	/* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
>  	wake_up_all(&sched->job_scheduled);
> @@ -1251,8 +1185,8 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
>  	if (sched->own_submit_wq)
>  		destroy_workqueue(sched->submit_wq);
>  	sched->ready = false;
> -	kfree(sched->sched_rq);
> -	sched->sched_rq = NULL;
> +	kfree(sched->rq);

Renaming sched_rq to rq can be done, but it certainly increases the git
diff. Such general naming improvements make patches a bit more
difficult to review and I'd prefer to have those in separate patches.

> +	sched->rq = NULL;
>  
>  	if (!list_empty(&sched->pending_list))
>  		dev_warn(sched->dev, "Tearing down scheduler while jobs are pending!\n");
> @@ -1270,35 +1204,28 @@ EXPORT_SYMBOL(drm_sched_fini);
>   */
>  void drm_sched_increase_karma(struct drm_sched_job *bad)
>  {
> -	int i;
> -	struct drm_sched_entity *tmp;
> -	struct drm_sched_entity *entity;
>  	struct drm_gpu_scheduler *sched = bad->sched;
> +	struct drm_sched_entity *entity, *tmp;
> +	struct drm_sched_rq *rq = sched->rq;
>  
>  	/* don't change @bad's karma if it's from KERNEL RQ,
>  	 * because sometimes GPU hang would cause kernel jobs (like VM updating jobs)
>  	 * corrupt but keep in mind that kernel jobs always considered good.
>  	 */
> -	if (bad->s_priority != DRM_SCHED_PRIORITY_KERNEL) {
> -		atomic_inc(&bad->karma);
> +	if (bad->s_priority == DRM_SCHED_PRIORITY_KERNEL)
> +		return;
>  
> -		for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
> -			struct drm_sched_rq *rq = sched->sched_rq[i];
> +	atomic_inc(&bad->karma);
>  
> -			spin_lock(&rq->lock);
> -			list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
> -				if (bad->s_fence->scheduled.context ==
> -				    entity->fence_context) {
> -					if (entity->guilty)
> -						atomic_set(entity->guilty, 1);
> -					break;
> -				}
> -			}
> -			spin_unlock(&rq->lock);
> -			if (&entity->list != &rq->entities)
> -				break;
> +	spin_lock(&rq->lock);
> +	list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
> +		if (bad->s_fence->scheduled.context == entity->fence_context) {
> +			if (entity->guilty)
> +				atomic_set(entity->guilty, 1);
> +			break;
>  		}
>  	}
> +	spin_unlock(&rq->lock);
>  }
>  EXPORT_SYMBOL(drm_sched_increase_karma);
>  
> diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
> index c5be8a2c893d..a9bc105221bf 100644
> --- a/drivers/gpu/drm/scheduler/sched_rq.c
> +++ b/drivers/gpu/drm/scheduler/sched_rq.c
> @@ -34,7 +34,7 @@ static void drm_sched_rq_update_prio(struct drm_sched_rq *rq)
>  	rq->head_prio = prio;
>  }
>  
> -static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
> +static void drm_sched_rq_remove_tree_locked(struct drm_sched_entity *entity,
>  					    struct drm_sched_rq *rq)
>  {
>  	lockdep_assert_held(&entity->lock);
> @@ -47,7 +47,7 @@ static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
>  	}
>  }
>  
> -static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
> +static void drm_sched_rq_update_tree_locked(struct drm_sched_entity *entity,
>  					    struct drm_sched_rq *rq,
>  					    ktime_t ts)
>  {
> @@ -59,7 +59,7 @@ static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
>  	lockdep_assert_held(&entity->lock);
>  	lockdep_assert_held(&rq->lock);
>  
> -	drm_sched_rq_remove_fifo_locked(entity, rq);
> +	drm_sched_rq_remove_tree_locked(entity, rq);
>  
>  	entity->oldest_job_waiting = ts;
>  
> @@ -207,17 +207,17 @@ static ktime_t drm_sched_entity_get_job_ts(struct drm_sched_entity *entity)
>   * drm_sched_rq_add_entity - add an entity
>   *
>   * @entity: scheduler entity
> - * @ts: submission timestamp
>   *
>   * Adds a scheduler entity to the run queue.
>   *
>   * Returns a DRM scheduler pre-selected to handle this entity.
>   */
>  struct drm_gpu_scheduler *
> -drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
> +drm_sched_rq_add_entity(struct drm_sched_entity *entity)
>  {
>  	struct drm_gpu_scheduler *sched;
>  	struct drm_sched_rq *rq;
> +	ktime_t ts;
>  
>  	/* Add the entity to the run queue */
>  	spin_lock(&entity->lock);
> @@ -237,15 +237,9 @@ drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
>  		list_add_tail(&entity->list, &rq->entities);
>  	}
>  
> -	if (drm_sched_policy == DRM_SCHED_POLICY_FAIR) {
> -		ts = drm_sched_rq_get_min_vruntime(rq);
> -		ts = drm_sched_entity_restore_vruntime(entity, ts,
> -						       rq->head_prio);
> -	} else if (drm_sched_policy == DRM_SCHED_POLICY_RR) {
> -		ts = entity->rr_ts;
> -	}
> -
> -	drm_sched_rq_update_fifo_locked(entity, rq, ts);
> +	ts = drm_sched_rq_get_min_vruntime(rq);
> +	ts = drm_sched_entity_restore_vruntime(entity, ts, rq->head_prio);
> +	drm_sched_rq_update_tree_locked(entity, rq, ts);
>  
>  	spin_unlock(&rq->lock);
>  	spin_unlock(&entity->lock);
> @@ -274,26 +268,11 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>  	atomic_dec(rq->sched->score);
>  	list_del_init(&entity->list);
>  
> -	drm_sched_rq_remove_fifo_locked(entity, rq);
> +	drm_sched_rq_remove_tree_locked(entity, rq);
>  
>  	spin_unlock(&rq->lock);
>  }
>  
> -static ktime_t
> -drm_sched_rq_get_rr_ts(struct drm_sched_rq *rq, struct drm_sched_entity *entity)
> -{
> -	ktime_t ts;
> -
> -	lockdep_assert_held(&entity->lock);
> -	lockdep_assert_held(&rq->lock);
> -
> -	ts = ktime_add_ns(rq->rr_ts, 1);
> -	entity->rr_ts = ts;
> -	rq->rr_ts = ts;
> -
> -	return ts;
> -}
> -
>  /**
>   * drm_sched_rq_pop_entity - pops an entity
>   *
> @@ -317,23 +296,14 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
>  	if (next_job) {
>  		ktime_t ts;
>  
> -		if (drm_sched_policy == DRM_SCHED_POLICY_FAIR)
> -			ts = drm_sched_entity_get_job_ts(entity);
> -		else if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> -			ts = next_job->submit_ts;
> -		else
> -			ts = drm_sched_rq_get_rr_ts(rq, entity);
> -
> -		drm_sched_rq_update_fifo_locked(entity, rq, ts);
> +		ts = drm_sched_entity_get_job_ts(entity);
> +		drm_sched_rq_update_tree_locked(entity, rq, ts);
>  	} else {
> -		drm_sched_rq_remove_fifo_locked(entity, rq);
> +		ktime_t min_vruntime;
>  
> -		if (drm_sched_policy == DRM_SCHED_POLICY_FAIR) {
> -			ktime_t min_vruntime;
> -
> -			min_vruntime = drm_sched_rq_get_min_vruntime(rq);
> -			drm_sched_entity_save_vruntime(entity, min_vruntime);
> -		}
> +		drm_sched_rq_remove_tree_locked(entity, rq);
> +		min_vruntime = drm_sched_rq_get_min_vruntime(rq);
> +		drm_sched_entity_save_vruntime(entity, min_vruntime);
>  	}
>  	spin_unlock(&rq->lock);
>  	spin_unlock(&entity->lock);
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 327b75a052c7..c9b75a05d05c 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -96,8 +96,7 @@ struct drm_sched_entity {
>  	 * @lock:
>  	 *
>  	 * Lock protecting the run-queue (@rq) to which this entity belongs,
> -	 * @priority, the list of schedulers (@sched_list, @num_sched_list) and
> -	 * the @rr_ts field.
> +	 * @priority and the list of schedulers (@sched_list, @num_sched_list).
>  	 */
>  	spinlock_t			lock;
>  
> @@ -150,18 +149,6 @@ struct drm_sched_entity {
>  	 */
>  	enum drm_sched_priority         priority;
>  
> -	/**
> -	 * @rq_priority: Run-queue priority
> -	 */
> -	enum drm_sched_priority         rq_priority;
> -
> -	/**
> -	 * @rr_ts:
> -	 *
> -	 * Fake timestamp of the last popped job from the entity.
> -	 */
> -	ktime_t				rr_ts;
> -
>  	/**
>  	 * @job_queue: the list of jobs of this entity.
>  	 */
> @@ -259,8 +246,7 @@ struct drm_sched_entity {
>   * struct drm_sched_rq - queue of entities to be scheduled.
>   *
>   * @sched: the scheduler to which this rq belongs to.
> - * @lock: protects @entities, @rb_tree_root, @rr_ts and @head_prio.
> - * @rr_ts: monotonically incrementing fake timestamp for RR mode
> + * @lock: protects @entities, @rb_tree_root and @head_prio.
>   * @entities: list of the entities to be scheduled.
>   * @rb_tree_root: root of time based priority queue of entities for FIFO scheduling
>   * @head_prio: priority of the top tree element
> @@ -274,7 +260,6 @@ struct drm_sched_rq {
>  
>  	spinlock_t			lock;
>  	/* Following members are protected by the @lock: */
> -	ktime_t				rr_ts;
>  	struct list_head		entities;
>  	struct rb_root_cached		rb_tree_root;
>  	enum drm_sched_priority		head_prio;
> @@ -360,13 +345,6 @@ struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f);
>   * to schedule the job.
>   */
>  struct drm_sched_job {
> -	/**
> -	 * @submit_ts:
> -	 *
> -	 * When the job was pushed into the entity queue.
> -	 */
> -	ktime_t                         submit_ts;
> -
>  	/**
>  	 * @sched:
>  	 *
> @@ -603,9 +581,7 @@ struct drm_gpu_scheduler {
>  	atomic_t			credit_count;
>  	long				timeout;
>  	const char			*name;
> -	u32                             num_rqs;
> -	u32                             num_user_rqs;
> -	struct drm_sched_rq             **sched_rq;
> +	struct drm_sched_rq             *rq;

Docu must be updated


P.


>  	wait_queue_head_t		job_scheduled;
>  	atomic64_t			job_id_count;
>  	struct workqueue_struct		*submit_wq;


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 12/12] drm/sched: Embed run queue singleton into the scheduler
  2025-09-03 10:18 ` [RFC v8 12/12] drm/sched: Embed run queue singleton into the scheduler Tvrtko Ursulin
@ 2025-09-24 12:01   ` Philipp Stanner
  2025-09-25 13:02     ` Tvrtko Ursulin
  0 siblings, 1 reply; 35+ messages in thread
From: Philipp Stanner @ 2025-09-24 12:01 UTC (permalink / raw)
  To: Tvrtko Ursulin, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost, Philipp Stanner

On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
> Now that the run queue to scheduler relationship is always 1:1 we can
> embed it (the run queue) directly in the scheduler struct and save on
> some allocation error handling code and such.

Looks reasonable to me.

What I suggest is to do things like general improvements by renaming
variables (see my comments in the previous patch) in a separate
cleanup-patch following this one.

Few comments below

> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Philipp Stanner <phasta@kernel.org>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c      |  6 ++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c     |  6 ++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.h     |  5 +++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h   |  8 ++++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c |  8 +++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c     |  8 +++---
>  drivers/gpu/drm/scheduler/sched_entity.c    | 32 +++++++++------------
>  drivers/gpu/drm/scheduler/sched_fence.c     |  2 +-
>  drivers/gpu/drm/scheduler/sched_internal.h  |  6 ++--
>  drivers/gpu/drm/scheduler/sched_main.c      | 31 ++++----------------
>  drivers/gpu/drm/scheduler/sched_rq.c        | 18 ++++++------
>  include/drm/gpu_scheduler.h                 |  4 +--
>  12 files changed, 58 insertions(+), 76 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 2ac9729e4c86..467e09e5567b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -1131,7 +1131,8 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
>  	if (p->gang_size > 1 && !adev->vm_manager.concurrent_flush) {
>  		for (i = 0; i < p->gang_size; ++i) {
>  			struct drm_sched_entity *entity = p->entities[i];
> -			struct drm_gpu_scheduler *sched = entity->rq->sched;
> +			struct drm_gpu_scheduler *sched =
> +				container_of(entity->rq, typeof(*sched), rq);
>  			struct amdgpu_ring *ring = to_amdgpu_ring(sched);
>  
>  			if (amdgpu_vmid_uses_reserved(vm, ring->vm_hub))
> @@ -1262,7 +1263,8 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
>  			return r;
>  	}

Now that I think of it, are you compiling this for all DRM drivers? Who
knows who might do similar things as amdgpu.

>  
> -	sched = p->gang_leader->base.entity->rq->sched;
> +	sched = container_of(p->gang_leader->base.entity->rq, typeof(*sched),
> +			     rq);
>  	while ((fence = amdgpu_sync_get_fence(&p->sync))) {
>  		struct drm_sched_fence *s_fence = to_drm_sched_fence(fence);
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index bc07fd57310c..cdfaf3eb736d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -341,7 +341,9 @@ static struct dma_fence *
>  amdgpu_job_prepare_job(struct drm_sched_job *sched_job,
>  		      struct drm_sched_entity *s_entity)
>  {
> -	struct amdgpu_ring *ring = to_amdgpu_ring(s_entity->rq->sched);
> +	struct drm_gpu_scheduler *sched =
> +		container_of(s_entity->rq, typeof(*sched), rq);
> +	struct amdgpu_ring *ring = to_amdgpu_ring(sched);
>  	struct amdgpu_job *job = to_amdgpu_job(sched_job);
>  	struct dma_fence *fence;
>  	int r;
> @@ -434,7 +436,7 @@ drm_sched_entity_queue_pop(struct drm_sched_entity *entity)
>  
>  void amdgpu_job_stop_all_jobs_on_sched(struct drm_gpu_scheduler *sched)
>  {
> -	struct drm_sched_rq *rq = sched->rq;
> +	struct drm_sched_rq *rq = &sched->rq;
>  	struct drm_sched_entity *s_entity;
>  	struct drm_sched_job *s_job;
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> index 4a6487eb6cb5..9530b5da3adc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> @@ -102,7 +102,10 @@ struct amdgpu_job {
>  
>  static inline struct amdgpu_ring *amdgpu_job_ring(struct amdgpu_job *job)
>  {
> -	return to_amdgpu_ring(job->base.entity->rq->sched);
> +	struct drm_gpu_scheduler *sched =
> +		container_of(job->base.entity->rq, typeof(*sched), rq);
> +
> +	return to_amdgpu_ring(sched);
>  }
>  
>  int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> index d13e64a69e25..85724ec6aaf8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> @@ -145,6 +145,7 @@ TRACE_EVENT(amdgpu_cs,
>  		     struct amdgpu_ib *ib),
>  	    TP_ARGS(p, job, ib),
>  	    TP_STRUCT__entry(
> +			     __field(struct drm_gpu_scheduler *, sched)
>  			     __field(struct amdgpu_bo_list *, bo_list)
>  			     __field(u32, ring)
>  			     __field(u32, dw)
> @@ -152,11 +153,14 @@ TRACE_EVENT(amdgpu_cs,
>  			     ),
>  
>  	    TP_fast_assign(
> +			   __entry->sched = container_of(job->base.entity->rq,
> +							 typeof(*__entry->sched),
> +							 rq);
>  			   __entry->bo_list = p->bo_list;
> -			   __entry->ring = to_amdgpu_ring(job->base.entity->rq->sched)->idx;
> +			   __entry->ring = to_amdgpu_ring(__entry->sched)->idx;
>  			   __entry->dw = ib->length_dw;
>  			   __entry->fences = amdgpu_fence_count_emitted(
> -				to_amdgpu_ring(job->base.entity->rq->sched));
> +				to_amdgpu_ring(__entry->sched));
>  			   ),
>  	    TP_printk("bo_list=%p, ring=%u, dw=%u, fences=%u",
>  		      __entry->bo_list, __entry->ring, __entry->dw,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index 36805dcfa159..4ccd2e769799 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -106,13 +106,13 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
>  static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>  				 struct dma_fence **fence)
>  {
> +	struct drm_gpu_scheduler *sched =
> +		container_of(p->vm->delayed.rq, typeof(*sched), rq);
> +	struct amdgpu_ring *ring =
> +		container_of(sched, struct amdgpu_ring, sched);
>  	struct amdgpu_ib *ib = p->job->ibs;
> -	struct amdgpu_ring *ring;
>  	struct dma_fence *f;
>  
> -	ring = container_of(p->vm->delayed.rq->sched, struct amdgpu_ring,
> -			    sched);
> -
>  	WARN_ON(ib->length_dw == 0);
>  	amdgpu_ring_pad_ib(ring, ib);
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
> index 1083db8cea2e..be17635ac039 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
> @@ -465,15 +465,15 @@ int amdgpu_xcp_open_device(struct amdgpu_device *adev,
>  void amdgpu_xcp_release_sched(struct amdgpu_device *adev,
>  				  struct amdgpu_ctx_entity *entity)
>  {
> -	struct drm_gpu_scheduler *sched;
> -	struct amdgpu_ring *ring;
> +	struct drm_gpu_scheduler *sched =
> +		container_of(entity->entity.rq, typeof(*sched), rq);
>  
>  	if (!adev->xcp_mgr)
>  		return;
>  
> -	sched = entity->entity.rq->sched;
>  	if (drm_sched_wqueue_ready(sched)) {
> -		ring = to_amdgpu_ring(entity->entity.rq->sched);
> +		struct amdgpu_ring *ring = to_amdgpu_ring(sched);
> +
>  		atomic_dec(&adev->xcp_mgr->xcp[ring->xcp_id].ref_cnt);
>  	}
>  }
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 6dd30b85925b..ba290143c95d 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -114,19 +114,12 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>  	 * is initialized itself.
>  	 */
>  	entity->sched_list = num_sched_list > 1 ? sched_list : NULL;
> +	if (num_sched_list) {
> +		entity->sched_list = num_sched_list > 1 ? sched_list : NULL;
> +		entity->rq = &sched_list[0]->rq;
> +	}
>  	RCU_INIT_POINTER(entity->last_scheduled, NULL);
>  	RB_CLEAR_NODE(&entity->rb_tree_node);
> -
> -	if (num_sched_list && !sched_list[0]->rq) {
> -		/* Since every entry covered by num_sched_list
> -		 * should be non-NULL and therefore we warn drivers
> -		 * not to do this and to fix their DRM calling order.
> -		 */
> -		pr_warn("%s: called with uninitialized scheduler\n", __func__);
> -	} else if (num_sched_list) {
> -		entity->rq = sched_list[0]->rq;
> -	}
> -
>  	init_completion(&entity->entity_idle);
>  
>  	/* We start in an idle state. */
> @@ -312,7 +305,7 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout)
>  	if (!entity->rq)
>  		return 0;
>  
> -	sched = entity->rq->sched;
> +	sched = container_of(entity->rq, typeof(*sched), rq);
>  	/*
>  	 * The client will not queue more jobs during this fini - consume
>  	 * existing queued ones, or discard them on SIGKILL.
> @@ -393,10 +386,12 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>  {
>  	struct drm_sched_entity *entity =
>  		container_of(cb, struct drm_sched_entity, cb);
> +	struct drm_gpu_scheduler *sched =
> +		container_of(entity->rq, typeof(*sched), rq);
>  
>  	entity->dependency = NULL;
>  	dma_fence_put(f);
> -	drm_sched_wakeup(entity->rq->sched);
> +	drm_sched_wakeup(sched);
>  }
>  
>  /**
> @@ -423,7 +418,8 @@ EXPORT_SYMBOL(drm_sched_entity_set_priority);
>  static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity,
>  					       struct drm_sched_job *sched_job)
>  {
> -	struct drm_gpu_scheduler *sched = entity->rq->sched;
> +	struct drm_gpu_scheduler *sched =
> +		container_of(entity->rq, typeof(*sched), rq);
>  	struct dma_fence *fence = entity->dependency;
>  	struct drm_sched_fence *s_fence;
>  
> @@ -557,7 +553,7 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>  
>  	spin_lock(&entity->lock);
>  	sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
> -	rq = sched ? sched->rq : NULL;
> +	rq = sched ? &sched->rq : NULL;
>  	if (rq != entity->rq) {
>  		drm_sched_rq_remove_entity(entity->rq, entity);
>  		entity->rq = rq;
> @@ -580,6 +576,8 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>  void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>  {
>  	struct drm_sched_entity *entity = sched_job->entity;
> +	struct drm_gpu_scheduler *sched =
> +		container_of(entity->rq, typeof(*sched), rq);
>  	bool first;
>  
>  	trace_drm_sched_job_queue(sched_job, entity);
> @@ -591,7 +589,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>  		xa_for_each(&sched_job->dependencies, index, entry)
>  			trace_drm_sched_job_add_dep(sched_job, entry);
>  	}
> -	atomic_inc(entity->rq->sched->score);
> +	atomic_inc(sched->score);
>  	WRITE_ONCE(entity->last_user, current->group_leader);
>  
>  	/*
> @@ -602,8 +600,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>  
>  	/* first job wakes up scheduler */
>  	if (first) {
> -		struct drm_gpu_scheduler *sched;
> -
>  		sched = drm_sched_rq_add_entity(entity);
>  		if (sched)
>  			drm_sched_wakeup(sched);
> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
> index 9391d6f0dc01..da4f53a9ca35 100644
> --- a/drivers/gpu/drm/scheduler/sched_fence.c
> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
> @@ -227,7 +227,7 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
>  {
>  	unsigned seq;
>  
> -	fence->sched = entity->rq->sched;
> +	fence->sched = container_of(entity->rq, typeof(*fence->sched), rq);
>  	seq = atomic_inc_return(&entity->fence_seq);
>  	dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
>  		       &fence->lock, entity->fence_context, seq);
> diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
> index 6e5ed721bb5b..409c9ab7ce8f 100644
> --- a/drivers/gpu/drm/scheduler/sched_internal.h
> +++ b/drivers/gpu/drm/scheduler/sched_internal.h
> @@ -26,11 +26,9 @@ bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
>  			 struct drm_sched_entity *entity);
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
>  
> -void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
> -		       struct drm_sched_rq *rq);
> +void drm_sched_rq_init(struct drm_gpu_scheduler *sched);
>  struct drm_sched_entity *
> -drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched,
> -			   struct drm_sched_rq *rq);
> +drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched);
>  struct drm_gpu_scheduler *
>  drm_sched_rq_add_entity(struct drm_sched_entity *entity);
>  void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index a9079bdb27d3..fe773154cdd4 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -645,7 +645,7 @@ void drm_sched_job_arm(struct drm_sched_job *job)
>  
>  	BUG_ON(!entity);
>  	drm_sched_entity_select_rq(entity);
> -	sched = entity->rq->sched;
> +	sched = container_of(entity->rq, typeof(*sched), rq);
>  
>  	job->sched = sched;
>  	job->s_priority = entity->priority;
> @@ -986,7 +986,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
>  	int r;
>  
>  	/* Find entity with a ready job */
> -	entity = drm_sched_rq_select_entity(sched, sched->rq);
> +	entity = drm_sched_rq_select_entity(sched);
>  	if (IS_ERR_OR_NULL(entity))
>  		return;	/* No more work */
>  
> @@ -1067,15 +1067,6 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
>  	sched->score = args->score ? args->score : &sched->_score;
>  	sched->dev = args->dev;
>  
> -	if (sched->rq) {
> -		/* Not an error, but warn anyway so drivers can
> -		 * fine-tune their DRM calling order, and return all
> -		 * is good.
> -		 */
> -		dev_warn(sched->dev, "%s: scheduler already initialized!\n", __func__);
> -		return 0;
> -	}
> -
>  	if (args->submit_wq) {
>  		sched->submit_wq = args->submit_wq;
>  		sched->own_submit_wq = false;
> @@ -1087,11 +1078,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
>  		sched->own_submit_wq = true;
>  	}
>  
> -	sched->rq = kmalloc(sizeof(*sched->rq), GFP_KERNEL | __GFP_ZERO);
> -	if (!sched->rq)
> -		goto Out_check_own;
> -
> -	drm_sched_rq_init(sched, sched->rq);
> +	drm_sched_rq_init(sched);
>  
>  	init_waitqueue_head(&sched->job_scheduled);
>  	INIT_LIST_HEAD(&sched->pending_list);
> @@ -1106,12 +1093,6 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
>  
>  	sched->ready = true;
>  	return 0;
> -
> -Out_check_own:
> -	if (sched->own_submit_wq)
> -		destroy_workqueue(sched->submit_wq);
> -	dev_err(sched->dev, "%s: Failed to setup GPU scheduler--out of memory\n", __func__);
> -	return -ENOMEM;
>  }
>  EXPORT_SYMBOL(drm_sched_init);
>  
> @@ -1143,7 +1124,7 @@ static void drm_sched_cancel_remaining_jobs(struct drm_gpu_scheduler *sched)
>  void drm_sched_fini(struct drm_gpu_scheduler *sched)
>  {
>  
> -	struct drm_sched_rq *rq = sched->rq;
> +	struct drm_sched_rq *rq = &sched->rq;
>  	struct drm_sched_entity *s_entity;
>  
>  	drm_sched_wqueue_stop(sched);
> @@ -1185,8 +1166,6 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
>  	if (sched->own_submit_wq)
>  		destroy_workqueue(sched->submit_wq);
>  	sched->ready = false;
> -	kfree(sched->rq);
> -	sched->rq = NULL;
>  
>  	if (!list_empty(&sched->pending_list))
>  		dev_warn(sched->dev, "Tearing down scheduler while jobs are pending!\n");
> @@ -1206,7 +1185,7 @@ void drm_sched_increase_karma(struct drm_sched_job *bad)
>  {
>  	struct drm_gpu_scheduler *sched = bad->sched;
>  	struct drm_sched_entity *entity, *tmp;
> -	struct drm_sched_rq *rq = sched->rq;
> +	struct drm_sched_rq *rq = &sched->rq;
>  
>  	/* don't change @bad's karma if it's from KERNEL RQ,
>  	 * because sometimes GPU hang would cause kernel jobs (like VM updating jobs)
> diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
> index a9bc105221bf..6088434a4ea4 100644
> --- a/drivers/gpu/drm/scheduler/sched_rq.c
> +++ b/drivers/gpu/drm/scheduler/sched_rq.c
> @@ -72,17 +72,16 @@ static void drm_sched_rq_update_tree_locked(struct drm_sched_entity *entity,
>   * drm_sched_rq_init - initialize a given run queue struct
>   *
>   * @sched: scheduler instance to associate with this run queue
> - * @rq: scheduler run queue
>   *
>   * Initializes a scheduler runqueue.
>   */
> -void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
> -		       struct drm_sched_rq *rq)
> +void drm_sched_rq_init(struct drm_gpu_scheduler *sched)
>  {
> +	struct drm_sched_rq *rq = &sched->rq;
> +
>  	spin_lock_init(&rq->lock);
>  	INIT_LIST_HEAD(&rq->entities);
>  	rq->rb_tree_root = RB_ROOT_CACHED;
> -	rq->sched = sched;
>  	rq->head_prio = -1;
>  }
>  
> @@ -229,8 +228,8 @@ drm_sched_rq_add_entity(struct drm_sched_entity *entity)
>  	}
>  
>  	rq = entity->rq;
> +	sched = container_of(rq, typeof(*sched), rq);
>  	spin_lock(&rq->lock);
> -	sched = rq->sched;
>  
>  	if (list_empty(&entity->list)) {
>  		atomic_inc(sched->score);
> @@ -258,6 +257,8 @@ drm_sched_rq_add_entity(struct drm_sched_entity *entity)
>  void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>  				struct drm_sched_entity *entity)
>  {
> +	struct drm_gpu_scheduler *sched = container_of(rq, typeof(*sched), rq);
> +
>  	lockdep_assert_held(&entity->lock);
>  
>  	if (list_empty(&entity->list))
> @@ -265,7 +266,7 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>  
>  	spin_lock(&rq->lock);
>  
> -	atomic_dec(rq->sched->score);
> +	atomic_dec(sched->score);
>  	list_del_init(&entity->list);
>  
>  	drm_sched_rq_remove_tree_locked(entity, rq);
> @@ -313,7 +314,6 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
>   * drm_sched_rq_select_entity - Select an entity which provides a job to run
>   *
>   * @sched: the gpu scheduler
> - * @rq: scheduler run queue to check.
>   *
>   * Find oldest waiting ready entity.
>   *
> @@ -322,9 +322,9 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
>   * its job; return NULL, if no ready entity was found.
>   */
>  struct drm_sched_entity *
> -drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched,
> -			   struct drm_sched_rq *rq)
> +drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched)
>  {
> +	struct drm_sched_rq *rq = &sched->rq;
>  	struct rb_node *rb;
>  
>  	spin_lock(&rq->lock);
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index c9b75a05d05c..7fbcd121a6d3 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -256,8 +256,6 @@ struct drm_sched_entity {
>   * the next entity to emit commands from.
>   */
>  struct drm_sched_rq {
> -	struct drm_gpu_scheduler	*sched;
> -

Docu needs to be updated


--

So much from my side for now. I guess we'll be discussing a bit on the
other threads and on XDC. But overall, this is a very nice series that
improves things greatly, and from my POV we can transition to a v1;
maybe after brainstorming with the other guys next week a bit.


Regards,
P.

>  	spinlock_t			lock;
>  	/* Following members are protected by the @lock: */
>  	struct list_head		entities;
> @@ -581,7 +579,7 @@ struct drm_gpu_scheduler {
>  	atomic_t			credit_count;
>  	long				timeout;
>  	const char			*name;
> -	struct drm_sched_rq             *rq;
> +	struct drm_sched_rq             rq;
>  	wait_queue_head_t		job_scheduled;
>  	atomic64_t			job_id_count;
>  	struct workqueue_struct		*submit_wq;


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 01/12] drm/sched: Add some scheduling quality unit tests
  2025-09-23 13:01   ` Philipp Stanner
@ 2025-09-25 11:41     ` Tvrtko Ursulin
  0 siblings, 0 replies; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-25 11:41 UTC (permalink / raw)
  To: phasta, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost, Pierre-Eric Pelloux-Prayer


On 23/09/2025 14:01, Philipp Stanner wrote:
> On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
>> To make evaluating different scheduling policies easier (no need for
>> external benchmarks) and perfectly repeatable, lets add some synthetic
>> workloads built upon mock scheduler unit test infrastructure.
>>
>> Focus is on two parallel clients (two threads) submitting different job
>> patterns and logging their progress and some overall metrics. This is
>> repeated for both scheduler credit limit 1 and 2.
>>
>> Example test output:
>>
>>    Normal and low:
>>                      pct1 cps1 qd1;  pct2 cps2 qd2
>>          +     0ms:   0     0    0;   0     0    0
>>          +   104ms: 100  1240  112; 100  1240  125
>>          +   209ms: 100     0   99; 100     0  125
>>          +   313ms: 100     0   86; 100     0  125
>>          +   419ms: 100     0   73; 100     0  125
>>          +   524ms: 100     0   60; 100     0  125
>>          +   628ms: 100     0   47; 100     0  125
>>          +   731ms: 100     0   34; 100     0  125
>>          +   836ms: 100     0   21; 100     0  125
>>          +   939ms: 100     0    8; 100     0  125
>>          +  1043ms:               ; 100     0  120
>>          +  1147ms:               ; 100     0  107
>>          +  1252ms:               ; 100     0   94
>>          +  1355ms:               ; 100     0   81
>>          +  1459ms:               ; 100     0   68
>>          +  1563ms:               ; 100     0   55
>>          +  1667ms:               ; 100     0   42
>>          +  1771ms:               ; 100     0   29
>>          +  1875ms:               ; 100     0   16
>>          +  1979ms:               ; 100     0    3
>>      0: prio=normal sync=0 elapsed_ms=1015ms (ideal_ms=1000ms) cycle_time(min,avg,max)=134,222,978 us latency_time(min,avg,max)=134,222,978
>> us
>>      1: prio=low sync=0 elapsed_ms=2009ms (ideal_ms=1000ms) cycle_time(min,avg,max)=134,215,806 us latency_time(min,avg,max)=134,215,806 us
>>
>> There we have two clients represented in the two respective columns, with
>> their progress logged roughly every 100 milliseconds. The metrics are:
>>
>>   - pct - Percentage progress of the job submit part
>>   - cps - Cycles per second
>>   - qd  - Queue depth - number of submitted unfinished jobs
>>
>> The cycles per second metric is inherent to the fact that workload
>> patterns are a data driven cycling sequence of:
>>
>>   - Submit 1..N jobs
>>   - Wait for Nth job to finish (optional)
>>   - Sleep (optional)
>>   - Repeat from start
>>
>> In this particular example we have a normal priority and a low priority
>> clients both spamming the scheduler with 8ms jobs with no sync and no
>> sleeping. Hence they build a very deep queues and we can see how the low
>> priority client is completely starved until the normal finishes.
>>
>> Note that the PCT and CPS metrics are irrelevant for "unsync" clients
>> since they manage to complete all of their cycles instantaneously.
>>
>> A different example would be:
>>
>>    Heavy and interactive:
>>                      pct1 cps1 qd1;  pct2 cps2 qd2
>>          +     0ms:   0     0    0;   0     0    0
>>          +   106ms:   5    40    3;   5    40    0
>>          +   209ms:   9    40    0;   9    40    0
>>          +   314ms:  14    50    3;  14    50    0
>>          +   417ms:  18    40    0;  18    40    0
>>          +   522ms:  23    50    3;  23    50    0
>>          +   625ms:  27    40    0;  27    40    1
>>          +   729ms:  32    50    0;  32    50    0
>>          +   833ms:  36    40    1;  36    40    0
>>          +   937ms:  40    40    0;  40    40    0
>>          +  1041ms:  45    50    0;  45    50    0
>>          +  1146ms:  49    40    1;  49    40    1
>>          +  1249ms:  54    50    0;  54    50    0
>>          +  1353ms:  58    40    1;  58    40    0
>>          +  1457ms:  62    40    0;  62    40    1
>>          +  1561ms:  67    50    0;  67    50    0
>>          +  1665ms:  71    40    1;  71    40    0
>>          +  1772ms:  76    50    0;  76    50    0
>>          +  1877ms:  80    40    1;  80    40    0
>>          +  1981ms:  84    40    0;  84    40    0
>>          +  2085ms:  89    50    0;  89    50    0
>>          +  2189ms:  93    40    1;  93    40    0
>>          +  2293ms:  97    40    0;  97    40    1
>>
>> In this case client one is submitting 3x 2.5ms jobs, waiting for the 3rd
>> and then sleeping for 2.5ms (in effect causing 75% GPU load, minus the
>> overheads). Second client is submitting 1ms jobs, waiting for each to
>> finish and sleeping for 9ms (effective 10% GPU load). Here we can see
>> the PCT and CPS reflecting real progress.
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Philipp Stanner <phasta@kernel.org>
>> Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
>> Acked-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/tests/Makefile      |   3 +-
>>   .../gpu/drm/scheduler/tests/tests_scheduler.c | 640 ++++++++++++++++++
>>   2 files changed, 642 insertions(+), 1 deletion(-)
>>   create mode 100644 drivers/gpu/drm/scheduler/tests/tests_scheduler.c
>>
>> diff --git a/drivers/gpu/drm/scheduler/tests/Makefile b/drivers/gpu/drm/scheduler/tests/Makefile
>> index 5bf707bad373..9ec185fbbc15 100644
>> --- a/drivers/gpu/drm/scheduler/tests/Makefile
>> +++ b/drivers/gpu/drm/scheduler/tests/Makefile
>> @@ -2,6 +2,7 @@
>>   
>>   drm-sched-tests-y := \
>>           mock_scheduler.o \
>> -        tests_basic.o
>> +        tests_basic.o \
>> +        tests_scheduler.o
>>   
>>   obj-$(CONFIG_DRM_SCHED_KUNIT_TEST) += drm-sched-tests.o
>> diff --git a/drivers/gpu/drm/scheduler/tests/tests_scheduler.c b/drivers/gpu/drm/scheduler/tests/tests_scheduler.c
>> new file mode 100644
>> index 000000000000..15c898966ef0
>> --- /dev/null
>> +++ b/drivers/gpu/drm/scheduler/tests/tests_scheduler.c
>> @@ -0,0 +1,640 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2025 Valve Corporation */
>> +
>> +#include <linux/delay.h>
>> +#include <linux/kthread.h>
>> +#include <linux/ktime.h>
>> +#include <linux/math64.h>
>> +
>> +#include "sched_tests.h"
>> +
>> +/*
>> + * DRM scheduler scheduler tests exercise load balancing decisions ie. entity
>> + * selection logic.
>> + */
>> +
>> +static int drm_sched_scheduler_init(struct kunit *test)
>> +{
>> +	struct drm_mock_scheduler *sched;
>> +
>> +	sched = drm_mock_sched_new(test, MAX_SCHEDULE_TIMEOUT);
>> +	sched->base.credit_limit = 1;
>> +
>> +	test->priv = sched;
>> +
>> +	return 0;
>> +}
>> +
>> +static int drm_sched_scheduler_init2(struct kunit *test)
> 
> I'm not very convinced of the naming prefixes in this file. They're all
> static, but people might find them through Bootlin or similar and be
> confused.
> 
> in tests_basic.c we call all functions drm_sched_basic.
> 
> So I'd propose to go for drm_sched_quality here.

I don't think we should be that concerned about the names of private 
functions as long as they are reasonable. I do use "quality" in the 
patch title but I don't think that is the best term for describing the 
whole group.

If I *have* to churn I would suggest drm_sched_scheduling (and 
tests_scheduling). Because as you say, tests_basic -> drm_sched_basic_ 
and tests_scheduler -> drm_sched_scheduler_.

>> +{
>> +	struct drm_mock_scheduler *sched;
>> +
>> +	sched = drm_mock_sched_new(test, MAX_SCHEDULE_TIMEOUT);
>> +	sched->base.credit_limit = 2;
>> +
>> +	test->priv = sched;
>> +
>> +	return 0;
>> +}
>> +
>> +static void drm_sched_scheduler_exit(struct kunit *test)
>> +{
>> +	struct drm_mock_scheduler *sched = test->priv;
>> +
>> +	drm_mock_sched_fini(sched);
>> +}
>> +
>> +static void drm_sched_scheduler_queue_overhead(struct kunit *test)
> 
> What queue's overhead is this, actually? Seems to me that it's more the
> scheduler overhead for processing jobs in general.

Queue of jobs submitted to the scheduler, and the overhead of scheduler 
processing that queue.

>> +{
>> +	struct drm_mock_scheduler *sched = test->priv;
>> +	struct drm_mock_sched_entity *entity;
>> +	const unsigned int job_us = 1000;
>> +	const unsigned int jobs = 1000;
>> +	const unsigned int total_us = jobs * job_us;
>> +	struct drm_mock_sched_job *job, *first;
>> +	ktime_t start, end;
>> +	bool done;
>> +	int i;
>> +
>> +	/*
>> +	 * Deep queue job at a time processing (single credit).
> 
> "Measure job-by-job processing overhead time"?

Almost literally written below.

> 
>> +	 *
>> +	 * This measures the overhead of picking and processing a job at a time
>> +	 * by comparing the ideal total "GPU" time of all submitted jobs versus
>> +	 * the time actually taken.
>> +	 */
>> +
>> +	KUNIT_ASSERT_EQ(test, sched->base.credit_limit, 1);
>> +
>> +	entity = drm_mock_sched_entity_new(test,
>> +					   DRM_SCHED_PRIORITY_NORMAL,
>> +					   sched);
>> +
>> +	for (i = 0; i <= jobs; i++) {
>> +		job = drm_mock_sched_job_new(test, entity);
>> +		if (i == 0)
>> +			first = job; /* Extra first job blocks the queue */
> 
> "Extra"?

Extra as in different from the rest of the queue. Not part of the job 
count as specified in test prologue.

>> +		else
>> +			drm_mock_sched_job_set_duration_us(job, job_us);
>> +		drm_mock_sched_job_submit(job);
>> +	}
>> +
>> +	done = drm_mock_sched_job_wait_scheduled(first, HZ);
>> +	KUNIT_ASSERT_TRUE(test, done);
>> +
>> +	start = ktime_get();
>> +	i = drm_mock_sched_advance(sched, 1); /* Release the queue */
>> +	KUNIT_ASSERT_EQ(test, i, 1);
>> +
>> +	done = drm_mock_sched_job_wait_finished(job,
>> +						usecs_to_jiffies(total_us) * 5);
> 
> Why 5?

Arbitrary long enough wait since the test is not setting a pass-fail 
criteria but expecting to finish and just log the result.

> 
>> +	end = ktime_get();
>> +	KUNIT_ASSERT_TRUE(test, done);
>> +
>> +	pr_info("Expected %uus, actual %lldus\n",
>> +		total_us,
>> +		ktime_to_us(ktime_sub(end, start)));
>> +
>> +	drm_mock_sched_entity_free(entity);
>> +}
>> +
>> +static void drm_sched_scheduler_ping_pong(struct kunit *test)
> 
> Interesting idea for a test!

:) It's really rudimentary. You should take a peek at the i915 tests we 
have written over the years.

>> +{
>> +	struct drm_mock_sched_job *job, *first, *prev = NULL;
>> +	struct drm_mock_scheduler *sched = test->priv;
>> +	struct drm_mock_sched_entity *entity[2];
> 
> Setting a define saves you the ARRAY_SIZE. But I don't have a strong
> opinion on that.
> 
>> +	const unsigned int job_us = 1000;
>> +	const unsigned int jobs = 1000;
>> +	const unsigned int total_us = jobs * job_us;
>> +	ktime_t start, end;
>> +	bool done;
>> +	int i;
>> +
>> +	/*
>> +	 * Two entitites in inter-dependency chain.
>> +	 *
>> +	 * This measures the overhead of picking and processing a job at a time,
>> +	 * where each job depends on the previous one from the diffferent
>> +	 * entity, by comparing the ideal total "GPU" time of all submitted jobs
>> +	 * versus the time actually taken.
>> +	 */
>> +
>> +	KUNIT_ASSERT_EQ(test, sched->base.credit_limit, 1);
>> +
>> +	for (i = 0; i < ARRAY_SIZE(entity); i++)
>> +		entity[i] = drm_mock_sched_entity_new(test,
>> +						      DRM_SCHED_PRIORITY_NORMAL,
>> +						      sched);
>> +
>> +	for (i = 0; i <= jobs; i++) {
>> +		job = drm_mock_sched_job_new(test, entity[i & 1]);
>> +		if (i == 0)
>> +			first = job; /* Extra first job blocks the queue */
>> +		else
>> +			drm_mock_sched_job_set_duration_us(job, job_us);
>> +		if (prev)
>> +			drm_sched_job_add_dependency(&job->base,
>> +						     dma_fence_get(&prev->base.s_fence->finished));
>> +		drm_mock_sched_job_submit(job);
>> +		prev = job;
>> +	}
>> +
>> +	done = drm_mock_sched_job_wait_scheduled(first, HZ);
>> +	KUNIT_ASSERT_TRUE(test, done);
>> +
>> +	start = ktime_get();
>> +	i = drm_mock_sched_advance(sched, 1); /* Release the queue */
>> +	KUNIT_ASSERT_EQ(test, i, 1);
>> +
>> +	done = drm_mock_sched_job_wait_finished(job,
>> +						usecs_to_jiffies(total_us) * 5);
> 
> Same. I guess 5 is just "long enough to be finished for sure".
> 
> 
>> +	end = ktime_get();
>> +	KUNIT_ASSERT_TRUE(test, done);
>> +
>> +	pr_info("Expected %uus, actual %lldus\n",
>> +		total_us,
>> +		ktime_to_us(ktime_sub(end, start)));
>> +
>> +	for (i = 0; i < ARRAY_SIZE(entity); i++)
>> +		drm_mock_sched_entity_free(entity[i]);
>> +}
>> +
>> +static struct kunit_case drm_sched_scheduler_overhead_tests[] = {
>> +	KUNIT_CASE_SLOW(drm_sched_scheduler_queue_overhead),
>> +	KUNIT_CASE_SLOW(drm_sched_scheduler_ping_pong),
>> +	{}
>> +};
>> +
>> +static struct kunit_suite drm_sched_scheduler_overhead = {
>> +	.name = "drm_sched_scheduler_overhead_tests",
>> +	.init = drm_sched_scheduler_init,
>> +	.exit = drm_sched_scheduler_exit,
>> +	.test_cases = drm_sched_scheduler_overhead_tests,
>> +};
>> +
>> +struct drm_sched_client_params {
>> +	enum drm_sched_priority priority;
>> +	unsigned int job_cnt;
>> +	unsigned int job_us;
>> +	unsigned int wait_us;
>> +	bool sync;
>> +};
> 
> Struct documentation would be neat. cnt is obvious.. job_us is the
> "payload time" and "wait_us" is.. the time you wait until submitting a
> job? And sync is whether you submit synchronously? Doesn't that then
> imply wait_us == 0?

Wait_us is time client waits before starting the next loop. Sync is if 
it waits for the previous loop to finish on the "GPU". They are 
independent and useful for simulating some basic submission patterns.

As a side note, this builds upon my experience from IGT/gem_wsim^1. It 
is i915 focused and not interesting or used much these days. But it was 
instrumental for making sure various datacenter media transcode 
workloads run well.

Not least, recently AMD reported^2 scheduler latencies can slow down 
similar workloads on their hardware. Which on the overall means making 
sure scheduler has low overheads and handles things like ping-pong queue 
transitions well are relevant.

1)
https://cgit.freedesktop.org/drm/igt-gpu-tools/tree/benchmarks/wsim/README

2)
https://lore.kernel.org/amd-gfx/a950349d-69e1-4a7e-bd99-d9d6a1528533@amd.com/

> I'm reading as I write; I guess it gets clearer below. But you get the
> point, a few words of docu would help new programmers / reviewers.

Makes sense to document it here. Will add.

> 
>> +
>> +struct drm_sched_test_params {
>> +	const char *description;
>> +	struct drm_sched_client_params client[2];
>> +};
>> +
>> +static const struct drm_sched_test_params drm_sched_cases[] = {
>> +	{
>> +		.description = "Normal and normal",
> 
> s/normal/normal prio/

It would be very verbose when applied to all tests for questionable 
gain. Only scheduler experts will look at these tests so I think it is 
preferrable to stay readable.

Anyway we can bikeshed later, I'll skip this block for now.

>> +		.client[0] = {
>> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
>> +			.job_cnt = 1,
>> +			.job_us = 8000,
>> +			.wait_us = 0,
>> +			.sync = false,
>> +		},
>> +		.client[1] = {
>> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
>> +			.job_cnt = 1,
>> +			.job_us = 8000,
>> +			.wait_us = 0,
>> +			.sync = false,
>> +		},
>> +	},
>> +	{
>> +		.description = "Normal and low",
>> +		.client[0] = {
>> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
>> +			.job_cnt = 1,
>> +			.job_us = 8000,
>> +			.wait_us = 0,
>> +			.sync = false,
>> +		},
>> +		.client[1] = {
>> +			.priority = DRM_SCHED_PRIORITY_LOW,
>> +			.job_cnt = 1,
>> +			.job_us = 8000,
>> +			.wait_us = 0,
>> +			.sync = false,
>> +		},
>> +	},
>> +	{
>> +		.description = "High and normal",
>> +		.client[0] = {
>> +			.priority = DRM_SCHED_PRIORITY_HIGH,
>> +			.job_cnt = 1,
>> +			.job_us = 8000,
>> +			.wait_us = 0,
>> +			.sync = false,
>> +		},
>> +		.client[1] = {
>> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
>> +			.job_cnt = 1,
>> +			.job_us = 8000,
>> +			.wait_us = 0,
>> +			.sync = false,
>> +		},
>> +	},
>> +	{
>> +		.description = "High and low",
>> +		.client[0] = {
>> +			.priority = DRM_SCHED_PRIORITY_HIGH,
>> +			.job_cnt = 1,
>> +			.job_us = 8000,
>> +			.wait_us = 0,
>> +			.sync = false,
>> +		},
>> +		.client[1] = {
>> +			.priority = DRM_SCHED_PRIORITY_LOW,
>> +			.job_cnt = 1,
>> +			.job_us = 8000,
>> +			.wait_us = 0,
>> +			.sync = false,
>> +		},
>> +	},
>> +	{
>> +		.description = "50 and 50",
> 
> Hm? 50 normal and 50 normal?
> 
>> +		.client[0] = {
>> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
>> +			.job_cnt = 1,
>> +			.job_us = 1500,
>> +			.wait_us = 1500,
>> +			.sync = true,
>> +		},
>> +		.client[1] = {
>> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
>> +			.job_cnt = 1,
>> +			.job_us = 2500,
>> +			.wait_us = 2500,
>> +			.sync = true,
>> +		},
>> +	},
>> +	{
>> +		.description = "50 and 50 low",
>> +		.client[0] = {
>> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
>> +			.job_cnt = 1,
>> +			.job_us = 1500,
>> +			.wait_us = 1500,
>> +			.sync = true,
>> +		},
>> +		.client[1] = {
>> +			.priority = DRM_SCHED_PRIORITY_LOW,
>> +			.job_cnt = 1,
>> +			.job_us = 2500,
>> +			.wait_us = 2500,
>> +			.sync = true,
>> +		},
>> +	},
>> +	{
>> +		.description = "50 high and 50",
>> +		.client[0] = {
>> +			.priority = DRM_SCHED_PRIORITY_HIGH,
>> +			.job_cnt = 1,
>> +			.job_us = 1500,
>> +			.wait_us = 1500,
>> +			.sync = true,
>> +		},
>> +		.client[1] = {
>> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
>> +			.job_cnt = 1,
>> +			.job_us = 2500,
>> +			.wait_us = 2500,
>> +			.sync = true,
>> +		},
>> +	},
>> +	{
>> +		.description = "Low hog and interactive",
>> +		.client[0] = {
>> +			.priority = DRM_SCHED_PRIORITY_LOW,
>> +			.job_cnt = 3,
>> +			.job_us = 2500,
>> +			.wait_us = 500,
>> +			.sync = false,
>> +		},
>> +		.client[1] = {
>> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
>> +			.job_cnt = 1,
>> +			.job_us = 500,
>> +			.wait_us = 10000,
>> +			.sync = true,
>> +		},
>> +	},
>> +	{
>> +		.description = "Heavy and interactive",
>> +		.client[0] = {
>> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
>> +			.job_cnt = 3,
>> +			.job_us = 2500,
>> +			.wait_us = 2500,
>> +			.sync = true,
>> +		},
>> +		.client[1] = {
>> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
>> +			.job_cnt = 1,
>> +			.job_us = 1000,
>> +			.wait_us = 9000,
>> +			.sync = true,
>> +		},
>> +	},
>> +	{
>> +		.description = "Very heavy and interactive",
>> +		.client[0] = {
>> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
>> +			.job_cnt = 4,
>> +			.job_us = 50000,
>> +			.wait_us = 1,
>> +			.sync = true,
>> +		},
>> +		.client[1] = {
>> +			.priority = DRM_SCHED_PRIORITY_NORMAL,
>> +			.job_cnt = 1,
>> +			.job_us = 1000,
>> +			.wait_us = 9000,
>> +			.sync = true,
>> +		},
>> +	},
>> +};
>> +
>> +static void
>> +drm_sched_desc(const struct drm_sched_test_params *params, char *desc)
>> +{
>> +	strscpy(desc, params->description, KUNIT_PARAM_DESC_SIZE);
>> +}
>> +
>> +KUNIT_ARRAY_PARAM(drm_sched_scheduler_two_clients,
>> +		  drm_sched_cases,
>> +		  drm_sched_desc);
>> +
>> +struct test_client_stats {
>> +	unsigned int min_us;
>> +	unsigned int max_us;
>> +	unsigned long tot_us;
>> +};
>> +
>> +struct test_client {
>> +	struct kunit *test;
>> +
>> +	struct drm_mock_sched_entity	*entity;
>> +
>> +	struct kthread_worker	*worker;
>> +	struct kthread_work	work;
>> +
>> +	unsigned int id;
>> +	ktime_t duration;
>> +
>> +	struct drm_sched_client_params params;
>> +
>> +	ktime_t ideal_duration;
>> +	unsigned int cycles;
>> +	unsigned int cycle;
>> +	ktime_t	start;
>> +	ktime_t	end;
>> +	bool done;
>> +
>> +	struct test_client_stats cycle_time;
>> +	struct test_client_stats latency_time;
>> +};
> 
> Same here, docu would be helpful. Especially regarding "cycles". It
> seems to be "cycles per second", if I read the code below correctly.

No, cycles per second is a different metric emitted for each client.

I will document because it does make some sense to do it.

>> +
>> +static void
>> +update_stats(struct test_client_stats *stats, unsigned int us)
>> +{
>> +	if (us > stats->max_us)
>> +		stats->max_us = us;
>> +	if (us < stats->min_us)
> 
> Won't min_us be initialized to 0? So how could 'us' ever be smaller?

No, it is the usual pattern:

  max = 0
  min = ~0

> 
>> +		stats->min_us = us;
>> +	stats->tot_us += us;
>> +}
>> +
>> +static unsigned int
>> +get_stats_avg(struct test_client_stats *stats, unsigned int cycles)
>> +{
>> +	return div_u64(stats->tot_us, cycles);
>> +}
>> +
>> +static void drm_sched_client_work(struct kthread_work *work)
> 
> Why's this called "client_work"? Is it representing the userspace
> client?

Yep.

> 
>> +{
>> +	struct test_client *client = container_of(work, typeof(*client), work);
>> +	const long sync_wait = MAX_SCHEDULE_TIMEOUT;
>> +	unsigned int cycle, work_us, period_us;
>> +	struct drm_mock_sched_job *job = NULL;
>> +
>> +	work_us = client->params.job_cnt * client->params.job_us;
>> +	period_us = work_us + client->params.wait_us;
>> +	client->cycles =
>> +		DIV_ROUND_UP((unsigned int)ktime_to_us(client->duration),
>> +			     period_us);
>> +	client->ideal_duration = us_to_ktime(client->cycles * period_us);
>> +
>> +	client->start = ktime_get();
>> +
>> +	for (cycle = 0; cycle < client->cycles; cycle++) {
>> +		unsigned int batch;
>> +		unsigned long us;
>> +		ktime_t t;
> 
> 't' is used at a couple of places below, including when firing the
> WARN_ON_ONCE. I think the name should reveal more clearly what time is
> actually measured here, so that it gets more obvious for what
> conditions you're checking below.

Makes sense.

>> +
>> +		if (READ_ONCE(client->done))
>> +			break;
>> +
>> +		t = ktime_get();
>> +		for (batch = 0; batch < client->params.job_cnt; batch++) {
>> +			job = drm_mock_sched_job_new(client->test,
>> +						     client->entity);
>> +			drm_mock_sched_job_set_duration_us(job,
>> +							   client->params.job_us);
>> +			drm_mock_sched_job_submit(job);
>> +		}
>> +
>> +		if (client->params.sync)
>> +			drm_mock_sched_job_wait_finished(job, sync_wait);
>> +
>> +		t = ktime_sub(ktime_get(), t);
>> +		us = ktime_to_us(t);
>> +		update_stats(&client->cycle_time, us);
>> +		if (ktime_to_us(t) >= (long)work_us)
>> +			us = ktime_to_us(t) - work_us;
>> +		else if (WARN_ON_ONCE(client->params.sync))
> 
> See above. For me as a new reader it reads like this:
> 
> "if it did not take longer than it should have taken, and if this is
> synchronous, then that's a bug"

If the GPU job took less than it was configured to take, _and_ we waited 
for it to finish, then it is a bug in the mock scheduler which makes 
testing invalid and should be looked at. Will add a comment.

> 
>> +			us = 0;
>> +		update_stats(&client->latency_time, us);
>> +		WRITE_ONCE(client->cycle, cycle);
>> +
>> +		if (READ_ONCE(client->done))
>> +			break;
>> +
>> +		if (client->params.wait_us)
>> +			fsleep(client->params.wait_us);
>> +		else
>> +			cond_resched();
> 
> This if-else definitely needs a comment explaining what is being done
> and why. Why do you volunteer the CPU here? The function is about to
> end anyways.

It is a loop so it may not end. Cond_resched() is just so the thread 
does not hog the CPU completely if submitting a huge number of jobs on a 
slow machine and with neither sync or wait_us set. Will add a comment.

>> +	}
>> +
>> +	client->done = drm_mock_sched_job_wait_finished(job, sync_wait);
>> +	client->end = ktime_get();
>> +}
>> +
>> +static const char *prio_str(enum drm_sched_priority prio)
>> +{
>> +	switch (prio) {
>> +	case DRM_SCHED_PRIORITY_KERNEL:
>> +		return "kernel";
>> +	case DRM_SCHED_PRIORITY_LOW:
>> +		return "low";
>> +	case DRM_SCHED_PRIORITY_NORMAL:
>> +		return "normal";
>> +	case DRM_SCHED_PRIORITY_HIGH:
>> +		return "high";
>> +	default:
>> +		return "???";
> 
> s/???/INVALID

:shrug:

> 
>> +	}
>> +}
>> +
>> +static void drm_sched_scheduler_two_clients_test(struct kunit *test)
>> +{
>> +	const struct drm_sched_test_params *params = test->param_value;
>> +	struct drm_mock_scheduler *sched = test->priv;
>> +	struct test_client client[2] = { };
>> +	unsigned int prev_cycle[2] = { };
>> +	unsigned int i, j;
>> +	ktime_t start;
>> +
>> +	/*
>> +	 * Same job stream from from two clients.
> 
> Typo, "from from"

Marking as TODO.

> btw it's not super clear what's being tested here by the description /
> names alone. You did test with two entities above in the ping-pong
> example already. So I suppose a "client" is a party owning multiple
> entities?

Client is simulating an userspace client submitting GPU work. Two 
clients means two of those running in parallel. See comment a bit lower 
in this function. That one also explains the client work definition data 
structure a bit.

>> +	 */
>> +
>> +	for (i = 0; i < ARRAY_SIZE(client); i++)
>> +		client[i].entity =
>> +			drm_mock_sched_entity_new(test,
>> +						  params->client[i].priority,
>> +						  sched);
>> +
>> +	for (i = 0; i < ARRAY_SIZE(client); i++) {
>> +		client[i].test = test;
>> +		client[i].id = i;
>> +		client[i].duration = ms_to_ktime(1000);
>> +		client[i].params = params->client[i];
>> +		client[i].cycle_time.min_us = ~0U;
>> +		client[i].latency_time.min_us = ~0U;
>> +		client[i].worker =
>> +			kthread_create_worker(0, "%s-%u", __func__, i);
>> +		if (IS_ERR(client[i].worker)) {
>> +			for (j = 0; j < i; j++)
>> +				kthread_destroy_worker(client[j].worker);
>> +			KUNIT_FAIL(test, "Failed to create worker!\n");
>> +		}
>> +
>> +		kthread_init_work(&client[i].work, drm_sched_client_work);
>> +	}
>> +
>> +	for (i = 0; i < ARRAY_SIZE(client); i++)
>> +		kthread_queue_work(client[i].worker, &client[i].work);
>> +
>> +	/*
>> +	 * The clients (workers) can be a mix of async (deep submission queue),
>> +	 * sync (one job at a time), or something in between. Therefore it is
>> +	 * difficult to display a single metric representing their progress.
>> +	 *
>> +	 * Each struct drm_sched_client_params describes the actual submission
>> +	 * pattern which happens in the following steps:
>> +	 *  1. Submit N jobs
>> +	 *  2. Wait for last submitted job to finish
>> +	 *  3. Sleep for U micro-seconds
>> +	 *  4. Goto 1. for C cycles
> 
> OK, so this is simulating clients submitting in a burst-like pattern.
> Could it make sense to name the test something with "burst"?

No, it doesn't have to be burst. Same loop is executing all the workload 
patterns define in drm_sched_cases[]. Each of those has a name and 
submission pattern definition for two clients.

Take for example "Very heavy and interactive":

	{
		.description = "Very heavy and interactive",
		.client[0] = {
			.priority = DRM_SCHED_PRIORITY_NORMAL,
			.job_cnt = 4,
			.job_us = 50000,
			.wait_us = 1,
			.sync = true,
		},
		.client[1] = {
			.priority = DRM_SCHED_PRIORITY_NORMAL,
			.job_cnt = 1,
			.job_us = 1000,
			.wait_us = 9000,
			.sync = true,
		},
	},

That was roughly modeled on one medical imaging use case Pierre-Eric saw 
works really badly with the current scheduler.

1)
Medical imaging sofware submits a series of very demanding jobs. It 
waits not that much between them: 4x 50ms jobs, wait gor them to finish, 
sleep 1us, repeat.

2)
You try to use some light GUI application in parallel: 1x 1ms job, wait 
for it to finish, sleep for 9ms (imagine that's waiting for the next vsync).

This workload runs horribly with the FIFO scheduler:

https://people.igalia.com/tursulin/drm-sched-fair/4-very-heavy-vs-interactive.png

The graph is showing the "FPS" of the 2nd client (the lightweight 
interactive one) over the test runtime, for different scheduling algorithms.

>> +	 *
>> +	 * Where number of cycles is calculated to match the target client
>> +	 * duration from the respective struct drm_sched_test_params.
>> +	 *
>> +	 * To asses scheduling behaviour what we output for both clients is:
>> +	 *  - pct: Percentage progress of the jobs submitted
>> +	 *  - cps: "Cycles" per second (where one cycle is one 1.-4. above)
> 
> "is one complete submission pattern from the list above"

Ok.

>> +	 *  -  qd: Number of outstanding jobs in the client/entity
>> +	 */
>> +
>> +	start = ktime_get();
>> +	pr_info("%s:\n\t            pct1 cps1 qd1;  pct2 cps2 qd2\n",
>> +		params->description);
>> +	while (!READ_ONCE(client[0].done) || !READ_ONCE(client[1].done)) {
> 
> You probably want write in the documentation of struct test_client what
> 'done' does and who sets and reads it. Then it becomes more obvious why
> a READ_ONCE is necessary.

Okay. I can even make it a helper so it will be even clearer.

> 
>> +		unsigned int pct[2], qd[2], cycle[2], cps[2];
>> +
>> +		for (i = 0; i < ARRAY_SIZE(client); i++) {
>> +			qd[i] = spsc_queue_count(&client[i].entity->base.job_queue);
>> +			cycle[i] = READ_ONCE(client[i].cycle);
>> +			cps[i] = DIV_ROUND_UP(1000 * (cycle[i] - prev_cycle[i]),
> 
> Now you've lost me. All numbers so far always were microseconds (us).
> To get the cycles per second, don't you have to multiply by 1,000,000?

Extra 10 is the 10 Hz logging period. I need to move the period into a 
local const, then it will be self-documenting.

> 
>> +					      100);
>> +			if (client[i].cycles)
>> +				pct[i] = DIV_ROUND_UP(100 * (1 + cycle[i]),
> 
> Why 100 (here and above) and why +1?

100 is to calculate a percentage value of current cycle / total cycles.

IIRC +1 is to reach 100% when last cycle is total cycles - 1. So just 
for prettier output.

Regards,

Tvrtko

> 
> 
> --
> 
> I want to go through the other patches this week, too. I guess I'll get
> up to speed with the code better in general.
> 
> 
> P.
> 
>> +						      client[i].cycles);
>> +			else
>> +				pct[i] = 0;
>> +			prev_cycle[i] = cycle[i];
>> +		}
>> +
>> +		if (READ_ONCE(client[0].done))
>> +			pr_info("\t+%6lldms:               ; %3u %5u %4u\n",
>> +				ktime_to_ms(ktime_sub(ktime_get(), start)),
>> +				pct[1], cps[1], qd[1]);
>> +		else if (READ_ONCE(client[1].done))
>> +			pr_info("\t+%6lldms: %3u %5u %4u;\n",
>> +				ktime_to_ms(ktime_sub(ktime_get(), start)),
>> +				pct[0], cps[0], qd[0]);
>> +		else
>> +			pr_info("\t+%6lldms: %3u %5u %4u; %3u %5u %4u\n",
>> +				ktime_to_ms(ktime_sub(ktime_get(), start)),
>> +				pct[0], cps[0], qd[0],
>> +				pct[1], cps[1], qd[1]);
>> +		msleep(100);
>> +	}
>> +
>> +	for (i = 0; i < ARRAY_SIZE(client); i++) {
>> +		kthread_flush_work(&client[i].work);
>> +		kthread_destroy_worker(client[i].worker);
>> +	}
>> +
>> +	for (i = 0; i < ARRAY_SIZE(client); i++)
>> +		KUNIT_ASSERT_TRUE(test, client[i].done);
>> +
>> +	for (i = 0; i < ARRAY_SIZE(client); i++) {
>> +		pr_info("    %u: prio=%s sync=%u elapsed_ms=%lldms (ideal_ms=%lldms) cycle_time(min,avg,max)=%u,%u,%u us latency_time(min,avg,max)=%u,%u,%u us",
>> +			i,
>> +			prio_str(params->client[i].priority),
>> +			params->client[i].sync,
>> +			ktime_to_ms(ktime_sub(client[i].end, client[i].start)),
>> +			ktime_to_ms(client[i].ideal_duration),
>> +			client[i].cycle_time.min_us,
>> +			get_stats_avg(&client[i].cycle_time, client[i].cycles),
>> +			client[i].cycle_time.max_us,
>> +			client[i].latency_time.min_us,
>> +			get_stats_avg(&client[i].latency_time, client[i].cycles),
>> +			client[i].latency_time.max_us);
>> +		drm_mock_sched_entity_free(client[i].entity);
>> +	}
>> +}
>> +
>> +static const struct kunit_attributes drm_sched_scheduler_two_clients_attr = {
>> +	.speed = KUNIT_SPEED_SLOW,
>> +};
>> +
>> +static struct kunit_case drm_sched_scheduler_two_clients_tests[] = {
>> +	KUNIT_CASE_PARAM_ATTR(drm_sched_scheduler_two_clients_test,
>> +			      drm_sched_scheduler_two_clients_gen_params,
>> +			      drm_sched_scheduler_two_clients_attr),
>> +	{}
>> +};
>> +
>> +static struct kunit_suite drm_sched_scheduler_two_clients1 = {
>> +	.name = "drm_sched_scheduler_two_clients_one_credit_tests",
>> +	.init = drm_sched_scheduler_init,
>> +	.exit = drm_sched_scheduler_exit,
>> +	.test_cases = drm_sched_scheduler_two_clients_tests,
>> +};
>> +
>> +static struct kunit_suite drm_sched_scheduler_two_clients2 = {
>> +	.name = "drm_sched_scheduler_two_clients_two_credits_tests",
>> +	.init = drm_sched_scheduler_init2,
>> +	.exit = drm_sched_scheduler_exit,
>> +	.test_cases = drm_sched_scheduler_two_clients_tests,
>> +};
>> +
>> +kunit_test_suites(&drm_sched_scheduler_overhead,
>> +		  &drm_sched_scheduler_two_clients1,
>> +		  &drm_sched_scheduler_two_clients2);
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 07/12] drm/sched: Account entity GPU time
  2025-09-24  9:11   ` Philipp Stanner
@ 2025-09-25 11:52     ` Tvrtko Ursulin
  2025-09-26 12:30       ` Philipp Stanner
  0 siblings, 1 reply; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-25 11:52 UTC (permalink / raw)
  To: phasta, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost


On 24/09/2025 10:11, Philipp Stanner wrote:
> On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
>> To implement fair scheduling we need a view into the GPU time consumed by
>> entities. Problem we have is that jobs and entities objects have decoupled
>> lifetimes, where at the point we have a view into accurate GPU time, we
>> cannot link back to the entity any longer.
>>
>> Solve this by adding a light weight entity stats object which is reference
>> counted by both entity and the job and hence can safely be used from
>> either side.
>>
>> With that, the only other thing we need is to add a helper for adding the
>> job's GPU time into the respective entity stats object, and call it once
>> the accurate GPU time has been calculated.
> 
> This in general also looks reasonable and clean to me. Some comments
> below.

Thanks!

> btw I've got many nit-comments for all patches, but will wait with
> those until a v1 is posted.

Lets try and strike a balance please. :)

>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Philipp Stanner <phasta@kernel.org>
>> ---
>>   drivers/gpu/drm/scheduler/sched_entity.c   | 39 +++++++++++++
>>   drivers/gpu/drm/scheduler/sched_internal.h | 66 ++++++++++++++++++++++
>>   drivers/gpu/drm/scheduler/sched_main.c     |  6 +-
>>   include/drm/gpu_scheduler.h                | 12 ++++
>>   4 files changed, 122 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 7a0a52ba87bf..04ce8b7d436b 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -32,6 +32,39 @@
>>   
>>   #include "gpu_scheduler_trace.h"
>>   
>> +
>> +/**
>> + * drm_sched_entity_stats_release - Entity stats kref release function
>> + *
>> + * @kref: Entity stats embedded kref pointer
>> + */
>> +void drm_sched_entity_stats_release(struct kref *kref)
>> +{
>> +	struct drm_sched_entity_stats *stats =
>> +		container_of(kref, typeof(*stats), kref);
>> +
>> +	kfree(stats);
>> +}
>> +
>> +/**
>> + * drm_sched_entity_stats_alloc - Allocate a new struct drm_sched_entity_stats object
>> + *
>> + * Returns: Pointer to newly allocated struct drm_sched_entity_stats object.
>> + */
>> +static struct drm_sched_entity_stats *drm_sched_entity_stats_alloc(void)
>> +{
>> +	struct drm_sched_entity_stats *stats;
>> +
>> +	stats = kzalloc(sizeof(*stats), GFP_KERNEL);
>> +	if (!stats)
>> +		return NULL;
>> +
>> +	kref_init(&stats->kref);
>> +	spin_lock_init(&stats->lock);
>> +
>> +	return stats;
>> +}
>> +
>>   /**
>>    * drm_sched_entity_init - Init a context entity used by scheduler when
>>    * submit to HW ring.
>> @@ -65,6 +98,11 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>>   		return -EINVAL;
>>   
>>   	memset(entity, 0, sizeof(struct drm_sched_entity));
>> +
>> +	entity->stats = drm_sched_entity_stats_alloc();
>> +	if (!entity->stats)
>> +		return -ENOMEM;
>> +
>>   	INIT_LIST_HEAD(&entity->list);
>>   	entity->rq = NULL;
>>   	entity->guilty = guilty;
>> @@ -338,6 +376,7 @@ void drm_sched_entity_fini(struct drm_sched_entity *entity)
>>   
>>   	dma_fence_put(rcu_dereference_check(entity->last_scheduled, true));
>>   	RCU_INIT_POINTER(entity->last_scheduled, NULL);
>> +	drm_sched_entity_stats_put(entity->stats);
>>   }
>>   EXPORT_SYMBOL(drm_sched_entity_fini);
>>   
>> diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
>> index 703ee48fbc58..27c8460a3601 100644
>> --- a/drivers/gpu/drm/scheduler/sched_internal.h
>> +++ b/drivers/gpu/drm/scheduler/sched_internal.h
>> @@ -3,6 +3,22 @@
>>   #ifndef _DRM_GPU_SCHEDULER_INTERNAL_H_
>>   #define _DRM_GPU_SCHEDULER_INTERNAL_H_
>>   
>> +#include <linux/ktime.h>
>> +#include <linux/kref.h>
>> +#include <linux/spinlock.h>
>> +
>> +/**
>> + * struct drm_sched_entity_stats - execution stats for an entity.
>> + *
>> + * @kref: reference count for the object.
>> + * @lock: lock guarding the @runtime updates.
>> + * @runtime: time entity spent on the GPU.
>> + */
>> +struct drm_sched_entity_stats {
> 
> Cool docu.
> 
> Though, considering that awkward refcounting patterns are one of the
> more relevant scheduler problems, I believe it makes sense to add to
> the docu here a sentence or two about who refcounts this object for
> what reason.

Good idea - it is often not completely easy to find the commit message 
from git blame.

> 
>> +	struct kref	kref;
>> +	spinlock_t	lock;
>> +	ktime_t		runtime;
>> +};
>>   
>>   /* Used to choose between FIFO and RR job-scheduling */
>>   extern int drm_sched_policy;
>> @@ -93,4 +109,54 @@ drm_sched_entity_is_ready(struct drm_sched_entity *entity)
>>   	return true;
>>   }
>>   
>> +void drm_sched_entity_stats_release(struct kref *kref);
>> +
>> +/**
>> + * drm_sched_entity_stats_get - Obtain a reference count on struct drm_sched_entity_stats object
>> + *
>> + * @stats: struct drm_sched_entity_stats pointer
>> + *
>> + * Returns: struct drm_sched_entity_stats pointer
>> + */
>> +static inline struct drm_sched_entity_stats *
>> +drm_sched_entity_stats_get(struct drm_sched_entity_stats *stats)
>> +{
>> +	kref_get(&stats->kref);
>> +
>> +	return stats;
>> +}
>> +
>> +/**
>> + * drm_sched_entity_stats_put - Release a reference count on struct drm_sched_entity_stats object
>> + *
>> + * @stats: struct drm_sched_entity_stats pointer
>> + */
>> +static inline void
>> +drm_sched_entity_stats_put(struct drm_sched_entity_stats *stats)
>> +{
>> +	kref_put(&stats->kref, drm_sched_entity_stats_release);
>> +}
>> +
>> +/**
>> + * drm_sched_entity_stats_job_add_gpu_time - Account job execution time to entity
>> + *
>> + * @job: Scheduler job to account.
>> + *
>> + * Accounts the execution time of @job to its respective entity stats object.
>> + */
>> +static inline void
>> +drm_sched_entity_stats_job_add_gpu_time(struct drm_sched_job *job)
>> +{
>> +	struct drm_sched_entity_stats *stats = job->entity_stats;
>> +	struct drm_sched_fence *s_fence = job->s_fence;
>> +	ktime_t start, end;
>> +
>> +	start = dma_fence_timestamp(&s_fence->scheduled);
>> +	end = dma_fence_timestamp(&s_fence->finished);
>> +
>> +	spin_lock(&stats->lock);
>> +	stats->runtime = ktime_add(stats->runtime, ktime_sub(end, start));
>> +	spin_unlock(&stats->lock);
>> +}
>> +
>>   #endif
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 9411676e772a..a5d7706efbea 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -658,6 +658,7 @@ void drm_sched_job_arm(struct drm_sched_job *job)
>>   
>>   	job->sched = sched;
>>   	job->s_priority = entity->priority;
>> +	job->entity_stats = drm_sched_entity_stats_get(entity->stats);
>>   
>>   	drm_sched_fence_init(job->s_fence, job->entity);
>>   }
>> @@ -846,6 +847,7 @@ void drm_sched_job_cleanup(struct drm_sched_job *job)
>>   		 * been called.
>>   		 */
>>   		dma_fence_put(&job->s_fence->finished);
>> +		drm_sched_entity_stats_put(job->entity_stats);
>>   	} else {
>>   		/* The job was aborted before it has been committed to be run;
>>   		 * notably, drm_sched_job_arm() has not been called.
>> @@ -997,8 +999,10 @@ static void drm_sched_free_job_work(struct work_struct *w)
>>   		container_of(w, struct drm_gpu_scheduler, work_free_job);
>>   	struct drm_sched_job *job;
>>   
>> -	while ((job = drm_sched_get_finished_job(sched)))
>> +	while ((job = drm_sched_get_finished_job(sched))) {
>> +		drm_sched_entity_stats_job_add_gpu_time(job);
>>   		sched->ops->free_job(job);
>> +	}
> 
> The biggest question I'm wondering about is whether the reference
> should be put in drm_sched_job_cleanup() or drm_sched_free_job_work().
> The primer is semantically more correct, probably, but the latter is
> always controlled by the scheduler.
> 
> Hypothetically a driver could just not call drm_sched_job_cleanup() in
> its free_job() callback – but then other stuff would leak as well and
> everything would be broken anyways.

I think drm_sched_entity_stats_put() from drm_sched_job_cleanup() makes 
sense because latter is where other actions done by drm_sched_job_arm() 
are cleaned up. And drm_sched_job_arm() is where I placed 
drm_sched_entity_stats_get(). So that all looks nicely aligned to me.

> I probably should change the docu for drm_sched_job_cleanup(), which
> currently says drivers "should" call it in free_job(), instead of
> "must".

Yeah I agree s/should/must/ is in order.

Regards,

Tvrtko

>>   
>>   	drm_sched_run_job_queue(sched);
>>   }
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index 802a0060db55..f33c78473867 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -71,6 +71,8 @@ enum drm_sched_priority {
>>   	DRM_SCHED_PRIORITY_COUNT
>>   };
>>   
>> +struct drm_sched_entity_stats;
>> +
>>   /**
>>    * struct drm_sched_entity - A wrapper around a job queue (typically
>>    * attached to the DRM file_priv).
>> @@ -110,6 +112,11 @@ struct drm_sched_entity {
>>   	 */
>>   	struct drm_sched_rq		*rq;
>>   
>> +	/**
>> +	 * @stats: Stats object reference held by the entity and jobs.
>> +	 */
>> +	struct drm_sched_entity_stats	*stats;
>> +
>>   	/**
>>   	 * @sched_list:
>>   	 *
>> @@ -365,6 +372,11 @@ struct drm_sched_job {
>>   	struct drm_sched_fence		*s_fence;
>>   	struct drm_sched_entity         *entity;
>>   
>> +	/**
>> +	 * @entity_stats: Stats object reference held by the job and entity.
>> +	 */
>> +	struct drm_sched_entity_stats	*entity_stats;
>> +
>>   	enum drm_sched_priority		s_priority;
>>   	u32				credits;
>>   	/** @last_dependency: tracks @dependencies as they signal */
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 08/12] drm/sched: Remove idle entity from tree
  2025-09-24  9:15       ` Philipp Stanner
@ 2025-09-25 12:23         ` Tvrtko Ursulin
  0 siblings, 0 replies; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-25 12:23 UTC (permalink / raw)
  To: phasta, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost


On 24/09/2025 10:15, Philipp Stanner wrote:
> On Thu, 2025-09-11 at 16:06 +0100, Tvrtko Ursulin wrote:
>>
>> On 11/09/2025 15:32, Philipp Stanner wrote:
>>> On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
>>>> There is no need to keep entities with no jobs in the tree so lets remove
>>>> it once the last job is consumed. This keeps the tree smaller which is
>>>> nicer and more efficient as entities are removed and re-added on every
>>>> popped job.
>>>
>>> This reads suspiciously as if it could be an independent patch, not
>>> necessarily tied to this series. I see it depends on the _pop()
>>> function you added.
>>>
>>> I think you'd want to make it a bit more obvious that this is not so
>>> much a general improvement as it is a preparation for followup work. Or
>>> could it be made generic for the current in-tree scheduler?
>>
>> Both is true. There is currently no reason idle entities _need_ to be in
>> the tree. Removing them would improve O(log n) on the rbtree. But also
>> fair scheduler relies on it, see below...
>>
>>>> Apart from that, the upcoming fair scheduling algorithm will rely on the
>>>> tree only containing runnable entities.
>>
>> ... ^^^ here.
> 
> Yes, I saw that.
> 
> I wanted to stress where I'm coming from: generic code improvements
> should ideally be posted as separate patches, because that makes it
> easier to review and quicker to merge (and easier to revert should a
> problem be detected before the subsequent CFS series is merged)
> 
> So, can you submit this patch separately without too much effort? :)

It actually started as the 2nd patch in the series back in Dec/24, but 
as over the past period no one was convinced on its worth on its own, I 
kept moving it to later and later in the series. Up to the point where 
the series actually depends on it.

I think at this point it is better if it stays where it is. Because 
without the fair scheduler the benefit is harder to justify. Yes it 
reduces the rbtree rebalancing cost, but importance of that only 
increases when there are a lot of open DRM nodes, or clients with a lot 
of entities. On the other side it adds a spin lock to the job pop flow 
and I did not measure the effect on either scenario. At least with the 
fair scheduler the benefit is clear.

Regards,

Tvrtko

>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>> Cc: Christian König <christian.koenig@amd.com>
>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>> Cc: Philipp Stanner <phasta@kernel.org>
>>>> ---
>>>>    drivers/gpu/drm/scheduler/sched_rq.c | 28 +++++++++++++++++-----------
>>>>    1 file changed, 17 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
>>>> index 16d57461765e..67804815ca0d 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_rq.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_rq.c
>>>> @@ -19,6 +19,9 @@ drm_sched_entity_compare_before(struct rb_node *a, const struct rb_node *b)
>>>>    static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
>>>>    					    struct drm_sched_rq *rq)
>>>>    {
>>>> +	lockdep_assert_held(&entity->lock);
>>>> +	lockdep_assert_held(&rq->lock);
>>>
>>> The callers of drm_sched_rq_remove_fifo_locked() already have some
>>> lockdep asserts, have you seen them? Are those here redundant /
>>> additional?
>>>
>>> And are they strictly related to this patch?
>>
>> Yes, drm_sched_rq_pop_entity() is the new caller, which needs to take
>> both locks on its own. So IMO makes sense to add the asserts.
>>
>>> Maybe you want to investigate the other lockdep assertions and, if
>>> there's room for improvement, address that in a dedicated patch.
>>
>> They look okay to me. Are you seeing something is off?
> 
> No, they do look correct. It's just that we have a bit of redundancy
> then, but that's probably a good thing for robustness.
> 
> P.
> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>>> +
>>>>    	if (!RB_EMPTY_NODE(&entity->rb_tree_node)) {
>>>>    		rb_erase_cached(&entity->rb_tree_node, &rq->rb_tree_root);
>>>>    		RB_CLEAR_NODE(&entity->rb_tree_node);
>>>> @@ -158,24 +161,27 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
>>>>    {
>>>>    	struct drm_sched_job *next_job;
>>>>    	struct drm_sched_rq *rq;
>>>> -	ktime_t ts;
>>>>    
>>>>    	/*
>>>>    	 * Update the entity's location in the min heap according to
>>>>    	 * the timestamp of the next job, if any.
>>>>    	 */
>>>> +	spin_lock(&entity->lock);
>>>> +	rq = entity->rq;
>>>> +	spin_lock(&rq->lock);
>>>>    	next_job = drm_sched_entity_queue_peek(entity);
>>>> -	if (!next_job)
>>>> -		return;
>>>> +	if (next_job) {
>>>> +		ktime_t ts;
>>>>    
>>>> -	spin_lock(&entity->lock);
>>>> -	rq = entity->rq;
>>>> -	spin_lock(&rq->lock);
>>>> -	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>>>> -		ts = next_job->submit_ts;
>>>> -	else
>>>> -		ts = drm_sched_rq_get_rr_ts(rq, entity);
>>>> -	drm_sched_rq_update_fifo_locked(entity, rq, ts);
>>>> +		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>>>> +			ts = next_job->submit_ts;
>>>> +		else
>>>> +			ts = drm_sched_rq_get_rr_ts(rq, entity);
>>>> +
>>>> +		drm_sched_rq_update_fifo_locked(entity, rq, ts);
>>>> +	} else {
>>>> +		drm_sched_rq_remove_fifo_locked(entity, rq);
>>>> +	}
>>>>    	spin_unlock(&rq->lock);
>>>>    	spin_unlock(&entity->lock);
>>>>    }
>>>
>>
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 09/12] drm/sched: Add fair scheduling policy
  2025-09-24  9:38   ` Philipp Stanner
@ 2025-09-25 12:34     ` Tvrtko Ursulin
  0 siblings, 0 replies; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-25 12:34 UTC (permalink / raw)
  To: phasta, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost, Pierre-Eric Pelloux-Prayer


On 24/09/2025 10:38, Philipp Stanner wrote:
> On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
>> Fair scheduling policy is built upon the same concepts as the well known
>> CFS kernel scheduler - entity run queue is sorted by the virtual GPU time
>> consumed by entities in a way that the entity with least vruntime runs
>> first.
>>
>> It is able to avoid total priority starvation, which is one of the
>> problems with FIFO, and it also eliminates the need for per priority run
>> queues.
>>
> 
> If that's the case, then why are the runqueues removed in the
> subsequent patches?

I am not entirely sure what you mean here. Maybe you meant why removed 
_only_ in the subsequent patches? Only when FIFO and RR are removed from 
the code we can remove the multiple run queues.

>>   As it scales the actual GPU runtime by an exponential factor as
>> the priority decreases, therefore the virtual runtime for low priority
>> entities grows faster than for normal priority, pushing them further down
>> the runqueue order for the same real GPU time spent.
>>
>> Apart from this fundamental fairness, fair policy is especially strong in
>> oversubscription workloads where it is able to give more GPU time to short
>> and bursty workloads when they are running in parallel with GPU heavy
>> clients submitting deep job queues.
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Philipp Stanner <phasta@kernel.org>
>> Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_entity.c   |  28 ++--
>>   drivers/gpu/drm/scheduler/sched_internal.h |   7 +-
>>   drivers/gpu/drm/scheduler/sched_main.c     |  14 +-
>>   drivers/gpu/drm/scheduler/sched_rq.c       | 147 ++++++++++++++++++++-
>>   include/drm/gpu_scheduler.h                |  10 +-
>>   5 files changed, 186 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 04ce8b7d436b..58f51875547a 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -108,6 +108,8 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>>   	entity->guilty = guilty;
>>   	entity->num_sched_list = num_sched_list;
>>   	entity->priority = priority;
>> +	entity->rq_priority = drm_sched_policy == DRM_SCHED_POLICY_FAIR ?
>> +			      DRM_SCHED_PRIORITY_KERNEL : priority;
> 
> Hm, what do you do here? Why is CFS temporarily running with KERNEL
> prio?

At this point in the series we have two priorities associated with entities.

The original priority is the userspace priority and the new rq_priority 
is used to pick an actual run queue.

For FIFO and RR priority == rq_priority. For FAIR rq_priority is a fixed 
number. It can be any of the existing ones really, it is not about 
priority, just selecting a run queue object to use.

> AFAIU CFS eliminates the need for runqueues because there will be no
> priorities anymore.
> 
> But with this patch, it's for now just a new scheduling policy next to
> FIFO and RR and your plan is to then to remove those policies
> altogether.
> 
> Priorities will be kept for users, but they will then set the scaling
> factor.
> 
> Correct?

Yes.

> I think I understand the series relatively well by now, but here at the
> transition point some clarifications would be nice. Maybe the commit
> message is what confuses me, since it says that runqueues should be
> surplus in this patch already, which I don't think is true.

This policy does not need the per priority run queues. Others still do. 
I think I will reword the commit message to not use the word eliminate 
but "does not need".

>>   	/*
>>   	 * It's perfectly valid to initialize an entity without having a valid
>>   	 * scheduler attached. It's just not valid to use the scheduler before it
>> @@ -124,17 +126,23 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>>   		 */
>>   		pr_warn("%s: called with uninitialized scheduler\n", __func__);
>>   	} else if (num_sched_list) {
>> -		/* The "priority" of an entity cannot exceed the number of run-queues of a
>> -		 * scheduler. Protect against num_rqs being 0, by converting to signed. Choose
>> -		 * the lowest priority available.
>> +		enum drm_sched_priority p = entity->priority;
>> +
>> +		/*
>> +		 * The "priority" of an entity cannot exceed the number of
>> +		 * run-queues of a scheduler. Protect against num_rqs being 0,
>> +		 * by converting to signed. Choose the lowest priority
>> +		 * available.
>>   		 */
>> -		if (entity->priority >= sched_list[0]->num_rqs) {
>> -			dev_err(sched_list[0]->dev, "entity has out-of-bounds priority: %u. num_rqs: %u\n",
>> -				entity->priority, sched_list[0]->num_rqs);
>> -			entity->priority = max_t(s32, (s32) sched_list[0]->num_rqs - 1,
>> -						 (s32) DRM_SCHED_PRIORITY_KERNEL);
>> +		if (p >= sched_list[0]->num_user_rqs) {
>> +			dev_err(sched_list[0]->dev, "entity with out-of-bounds priority:%u num_user_rqs:%u\n",
>> +				p, sched_list[0]->num_user_rqs);
>> +			p = max_t(s32,
>> +				 (s32)sched_list[0]->num_user_rqs - 1,
>> +				 (s32)DRM_SCHED_PRIORITY_KERNEL);
>> +			entity->priority = p;
>>   		}
>> -		entity->rq = sched_list[0]->sched_rq[entity->priority];
>> +		entity->rq = sched_list[0]->sched_rq[entity->rq_priority];
>>   	}
>>   
>>   	init_completion(&entity->entity_idle);
>> @@ -567,7 +575,7 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>>   
>>   	spin_lock(&entity->lock);
>>   	sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
>> -	rq = sched ? sched->sched_rq[entity->priority] : NULL;
>> +	rq = sched ? sched->sched_rq[entity->rq_priority] : NULL;
>>   	if (rq != entity->rq) {
>>   		drm_sched_rq_remove_entity(entity->rq, entity);
>>   		entity->rq = rq;
>> diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
>> index 27c8460a3601..125aba70eda6 100644
>> --- a/drivers/gpu/drm/scheduler/sched_internal.h
>> +++ b/drivers/gpu/drm/scheduler/sched_internal.h
>> @@ -18,13 +18,16 @@ struct drm_sched_entity_stats {
>>   	struct kref	kref;
>>   	spinlock_t	lock;
>>   	ktime_t		runtime;
>> +	ktime_t		prev_runtime;
>> +	u64		vruntime;
> 
> These two lack new documentation.

Ack.

>>   };
>>   
>>   /* Used to choose between FIFO and RR job-scheduling */
>>   extern int drm_sched_policy;
>>   
>> -#define DRM_SCHED_POLICY_RR    0
>> -#define DRM_SCHED_POLICY_FIFO  1
>> +#define DRM_SCHED_POLICY_RR	0
>> +#define DRM_SCHED_POLICY_FIFO	1
>> +#define DRM_SCHED_POLICY_FAIR	2
> 
> What's going on with the git diff here?

Looks like a tabs vs spaces snafu. Will fix.

>>   bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
>>   			 struct drm_sched_entity *entity);
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index a5d7706efbea..e7726095c19a 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -84,13 +84,13 @@
>>   #define CREATE_TRACE_POINTS
>>   #include "gpu_scheduler_trace.h"
>>   
>> -int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
>> +int drm_sched_policy = DRM_SCHED_POLICY_FAIR;
> 
> If you want to do that, the commit message should state explicitly that
> you are changing the default policy in this patch. That's an important
> change.

True. I will make it a separate patch.

>>   
>>   /**
>>    * DOC: sched_policy (int)
>>    * Used to override default entities scheduling policy in a run queue.
>>    */
>> -MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
>> +MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO, " __stringify(DRM_SCHED_POLICY_FAIR) " = Fair (default).");
>>   module_param_named(sched_policy, drm_sched_policy, int, 0444);
>>   
>>   static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
>> @@ -1132,11 +1132,15 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
>>   		sched->own_submit_wq = true;
>>   	}
>>   
>> -	sched->sched_rq = kmalloc_array(args->num_rqs, sizeof(*sched->sched_rq),
>> +	sched->num_user_rqs = args->num_rqs;
>> +	sched->num_rqs = drm_sched_policy != DRM_SCHED_POLICY_FAIR ?
>> +			 args->num_rqs : 1;
>> +	sched->sched_rq = kmalloc_array(sched->num_rqs,
>> +					sizeof(*sched->sched_rq),
>>   					GFP_KERNEL | __GFP_ZERO);
>>   	if (!sched->sched_rq)
>>   		goto Out_check_own;
>> -	sched->num_rqs = args->num_rqs;
>> +
>>   	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>>   		sched->sched_rq[i] = kzalloc(sizeof(*sched->sched_rq[i]), GFP_KERNEL);
>>   		if (!sched->sched_rq[i])
>> @@ -1278,7 +1282,7 @@ void drm_sched_increase_karma(struct drm_sched_job *bad)
>>   	if (bad->s_priority != DRM_SCHED_PRIORITY_KERNEL) {
>>   		atomic_inc(&bad->karma);
>>   
>> -		for (i = DRM_SCHED_PRIORITY_HIGH; i < sched->num_rqs; i++) {
>> +		for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>>   			struct drm_sched_rq *rq = sched->sched_rq[i];
>>   
>>   			spin_lock(&rq->lock);
>> diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
>> index 67804815ca0d..b0cf7d2143c8 100644
>> --- a/drivers/gpu/drm/scheduler/sched_rq.c
>> +++ b/drivers/gpu/drm/scheduler/sched_rq.c
>> @@ -16,6 +16,24 @@ drm_sched_entity_compare_before(struct rb_node *a, const struct rb_node *b)
>>   	return ktime_before(ea->oldest_job_waiting, eb->oldest_job_waiting);
>>   }
>>   
>> +static void drm_sched_rq_update_prio(struct drm_sched_rq *rq)
>> +{
>> +	enum drm_sched_priority prio = -1;
>> +	struct rb_node *rb;
>> +
>> +	lockdep_assert_held(&rq->lock);
>> +
>> +	rb = rb_first_cached(&rq->rb_tree_root);
>> +	if (rb) {
>> +		struct drm_sched_entity *entity =
>> +			rb_entry(rb, typeof(*entity), rb_tree_node);
>> +
>> +		prio = entity->priority; /* Unlocked read */
>> +	}
>> +
>> +	rq->head_prio = prio;
>> +}
>> +
>>   static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
>>   					    struct drm_sched_rq *rq)
>>   {
>> @@ -25,6 +43,7 @@ static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
>>   	if (!RB_EMPTY_NODE(&entity->rb_tree_node)) {
>>   		rb_erase_cached(&entity->rb_tree_node, &rq->rb_tree_root);
>>   		RB_CLEAR_NODE(&entity->rb_tree_node);
>> +		drm_sched_rq_update_prio(rq);
>>   	}
>>   }
>>   
>> @@ -46,6 +65,7 @@ static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
>>   
>>   	rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root,
>>   		      drm_sched_entity_compare_before);
>> +	drm_sched_rq_update_prio(rq);
>>   }
>>   
>>   /**
>> @@ -63,6 +83,114 @@ void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
>>   	INIT_LIST_HEAD(&rq->entities);
>>   	rq->rb_tree_root = RB_ROOT_CACHED;
>>   	rq->sched = sched;
>> +	rq->head_prio = -1;
>> +}
>> +
>> +static ktime_t
>> +drm_sched_rq_get_min_vruntime(struct drm_sched_rq *rq)
>> +{
>> +	struct drm_sched_entity *entity;
>> +	struct rb_node *rb;
>> +
>> +	lockdep_assert_held(&rq->lock);
>> +
>> +	for (rb = rb_first_cached(&rq->rb_tree_root); rb; rb = rb_next(rb)) {
>> +		entity = rb_entry(rb, typeof(*entity), rb_tree_node);
>> +
>> +		return entity->stats->vruntime; /* Unlocked read */
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static void
>> +drm_sched_entity_save_vruntime(struct drm_sched_entity *entity,
>> +			       ktime_t min_vruntime)
>> +{
>> +	struct drm_sched_entity_stats *stats = entity->stats;
>> +	ktime_t vruntime;
>> +
>> +	spin_lock(&stats->lock);
>> +	vruntime = stats->vruntime;
>> +	if (min_vruntime && vruntime > min_vruntime)
>> +		vruntime = ktime_sub(vruntime, min_vruntime);
>> +	else
>> +		vruntime = 0;
>> +	stats->vruntime = vruntime;
>> +	spin_unlock(&stats->lock);
>> +}
>> +
>> +static ktime_t
>> +drm_sched_entity_restore_vruntime(struct drm_sched_entity *entity,
>> +				  ktime_t min_vruntime,
>> +				  enum drm_sched_priority rq_prio)
>> +{
>> +	struct drm_sched_entity_stats *stats = entity->stats;
>> +	enum drm_sched_priority prio = entity->priority;
>> +	ktime_t vruntime;
>> +
>> +	BUILD_BUG_ON(DRM_SCHED_PRIORITY_NORMAL < DRM_SCHED_PRIORITY_HIGH);
>> +
>> +	spin_lock(&stats->lock);
>> +	vruntime = stats->vruntime;
>> +
>> +	/*
>> +	 * Special handling for entities which were picked from the top of the
>> +	 * queue and are now re-joining the top with another one already there.
>> +	 */
>> +	if (!vruntime && min_vruntime) {
>> +		if (prio > rq_prio) {
>> +			/*
>> +			 * Lower priority should not overtake higher when re-
>> +			 * joining at the top of the queue.
>> +			 */
>> +			vruntime = us_to_ktime(prio - rq_prio);
>> +		} else if (prio < rq_prio) {
>> +			/*
>> +			 * Higher priority can go first.
>> +			 */
>> +			vruntime = -us_to_ktime(rq_prio - prio);
>> +		}
>> +	}
>> +
>> +	/*
>> +	 * Restore saved relative position in the queue.
>> +	 */
>> +	vruntime = ktime_add(min_vruntime, vruntime);
>> +
>> +	stats->vruntime = vruntime;
>> +	spin_unlock(&stats->lock);
>> +
>> +	return vruntime;
>> +}
>> +
>> +static ktime_t drm_sched_entity_update_vruntime(struct drm_sched_entity *entity)
>> +{
>> +	static const unsigned int shift[] = {
>> +		[DRM_SCHED_PRIORITY_KERNEL] = 1,
>> +		[DRM_SCHED_PRIORITY_HIGH]   = 2,
>> +		[DRM_SCHED_PRIORITY_NORMAL] = 4,
>> +		[DRM_SCHED_PRIORITY_LOW]    = 7,
>> +	};
>> +	struct drm_sched_entity_stats *stats = entity->stats;
>> +	ktime_t runtime, prev;
>> +
>> +	spin_lock(&stats->lock);
>> +	prev = stats->prev_runtime;
>> +	runtime = stats->runtime;
>> +	stats->prev_runtime = runtime;
>> +	runtime = ktime_add_ns(stats->vruntime,
>> +			       ktime_to_ns(ktime_sub(runtime, prev)) <<
>> +			       shift[entity->priority]);
>> +	stats->vruntime = runtime;
>> +	spin_unlock(&stats->lock);
>> +
>> +	return runtime;
>> +}
>> +
>> +static ktime_t drm_sched_entity_get_job_ts(struct drm_sched_entity *entity)
>> +{
>> +	return drm_sched_entity_update_vruntime(entity);
>>   }
>>   
>>   /**
>> @@ -99,8 +227,14 @@ drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
>>   		list_add_tail(&entity->list, &rq->entities);
>>   	}
>>   
>> -	if (drm_sched_policy == DRM_SCHED_POLICY_RR)
>> +	if (drm_sched_policy == DRM_SCHED_POLICY_FAIR) {
>> +		ts = drm_sched_rq_get_min_vruntime(rq);
>> +		ts = drm_sched_entity_restore_vruntime(entity, ts,
>> +						       rq->head_prio);
>> +	} else if (drm_sched_policy == DRM_SCHED_POLICY_RR) {
>>   		ts = entity->rr_ts;
>> +	}
>> +
>>   	drm_sched_rq_update_fifo_locked(entity, rq, ts);
>>   
>>   	spin_unlock(&rq->lock);
>> @@ -173,7 +307,9 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
>>   	if (next_job) {
>>   		ktime_t ts;
>>   
>> -		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>> +		if (drm_sched_policy == DRM_SCHED_POLICY_FAIR)
>> +			ts = drm_sched_entity_get_job_ts(entity);
>> +		else if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>>   			ts = next_job->submit_ts;
>>   		else
>>   			ts = drm_sched_rq_get_rr_ts(rq, entity);
>> @@ -181,6 +317,13 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
>>   		drm_sched_rq_update_fifo_locked(entity, rq, ts);
>>   	} else {
>>   		drm_sched_rq_remove_fifo_locked(entity, rq);
>> +
>> +		if (drm_sched_policy == DRM_SCHED_POLICY_FAIR) {
>> +			ktime_t min_vruntime;
>> +
>> +			min_vruntime = drm_sched_rq_get_min_vruntime(rq);
>> +			drm_sched_entity_save_vruntime(entity, min_vruntime);
>> +		}
>>   	}
>>   	spin_unlock(&rq->lock);
>>   	spin_unlock(&entity->lock);
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index f33c78473867..327b75a052c7 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -150,6 +150,11 @@ struct drm_sched_entity {
>>   	 */
>>   	enum drm_sched_priority         priority;
>>   
>> +	/**
>> +	 * @rq_priority: Run-queue priority
>> +	 */
>> +	enum drm_sched_priority         rq_priority;
>> +
>>   	/**
>>   	 * @rr_ts:
>>   	 *
>> @@ -254,10 +259,11 @@ struct drm_sched_entity {
>>    * struct drm_sched_rq - queue of entities to be scheduled.
>>    *
>>    * @sched: the scheduler to which this rq belongs to.
>> - * @lock: protects @entities, @rb_tree_root and @rr_ts.
>> + * @lock: protects @entities, @rb_tree_root, @rr_ts and @head_prio.
>>    * @rr_ts: monotonically incrementing fake timestamp for RR mode
>>    * @entities: list of the entities to be scheduled.
>>    * @rb_tree_root: root of time based priority queue of entities for FIFO scheduling
>> + * @head_prio: priority of the top tree element
>>    *
>>    * Run queue is a set of entities scheduling command submissions for
>>    * one specific ring. It implements the scheduling policy that selects
>> @@ -271,6 +277,7 @@ struct drm_sched_rq {
>>   	ktime_t				rr_ts;
>>   	struct list_head		entities;
>>   	struct rb_root_cached		rb_tree_root;
>> +	enum drm_sched_priority		head_prio;
>>   };
>>   
>>   /**
>> @@ -597,6 +604,7 @@ struct drm_gpu_scheduler {
>>   	long				timeout;
>>   	const char			*name;
>>   	u32                             num_rqs;
>> +	u32                             num_user_rqs;
> 
> Documentation missing.

Ack.

Regards,

Tvrtko

> 
> 
> P.
> 
>>   	struct drm_sched_rq             **sched_rq;
>>   	wait_queue_head_t		job_scheduled;
>>   	atomic64_t			job_id_count;
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 11/12] drm/sched: Remove FIFO and RR and simplify to a single run queue
  2025-09-24 11:50   ` Philipp Stanner
@ 2025-09-25 12:56     ` Tvrtko Ursulin
  2025-09-26 13:32       ` Philipp Stanner
  0 siblings, 1 reply; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-25 12:56 UTC (permalink / raw)
  To: phasta, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost


On 24/09/2025 12:50, Philipp Stanner wrote:
> On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
>> If the new fair policy is at least as good as FIFO and we can afford to
>> remove round-robin, we can simplify the scheduler code by making the
>> scheduler to run queue relationship always 1:1 and remove some code.
>>
>> Also, now that the FIFO policy is gone the tree of entities is not a FIFO
>> tree any more so rename it to just the tree.
> 
> I think that this patch should be the one that sets the default
> scheduling policy to CFS.
> 
> Either that or a separate patch at an appropriate position before #11.

Yep, it will be separate.

>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Philipp Stanner <phasta@kernel.org>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  23 ++-
>>   drivers/gpu/drm/scheduler/sched_entity.c   |  29 +---
>>   drivers/gpu/drm/scheduler/sched_internal.h |   9 +-
>>   drivers/gpu/drm/scheduler/sched_main.c     | 163 ++++++---------------
>>   drivers/gpu/drm/scheduler/sched_rq.c       |  60 ++------
>>   include/drm/gpu_scheduler.h                |  30 +---
>>   6 files changed, 78 insertions(+), 236 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index d020a890a0ea..bc07fd57310c 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -434,25 +434,22 @@ drm_sched_entity_queue_pop(struct drm_sched_entity *entity)
>>   
>>   void amdgpu_job_stop_all_jobs_on_sched(struct drm_gpu_scheduler *sched)
>>   {
>> +	struct drm_sched_rq *rq = sched->rq;
>> +	struct drm_sched_entity *s_entity;
>>   	struct drm_sched_job *s_job;
>> -	struct drm_sched_entity *s_entity = NULL;
>> -	int i;
>>   
>>   	/* Signal all jobs not yet scheduled */
>> -	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>> -		struct drm_sched_rq *rq = sched->sched_rq[i];
>> -		spin_lock(&rq->lock);
>> -		list_for_each_entry(s_entity, &rq->entities, list) {
>> -			while ((s_job = drm_sched_entity_queue_pop(s_entity))) {
>> -				struct drm_sched_fence *s_fence = s_job->s_fence;
>> +	spin_lock(&rq->lock);
>> +	list_for_each_entry(s_entity, &rq->entities, list) {
>> +		while ((s_job = drm_sched_entity_queue_pop(s_entity))) {
>> +			struct drm_sched_fence *s_fence = s_job->s_fence;
>>   
>> -				dma_fence_signal(&s_fence->scheduled);
>> -				dma_fence_set_error(&s_fence->finished, -EHWPOISON);
>> -				dma_fence_signal(&s_fence->finished);
>> -			}
>> +			dma_fence_signal(&s_fence->scheduled);
>> +			dma_fence_set_error(&s_fence->finished, -EHWPOISON);
>> +			dma_fence_signal(&s_fence->finished);
>>   		}
>> -		spin_unlock(&rq->lock);
>>   	}
>> +	spin_unlock(&rq->lock);
> 
> Yoah, well. Seems there is no easy fix to remove that from amdgpu,
> since it's used on GPU recovery.
> 
> Fine from my side if it's mentioned and explained in the commit
> message.

Yep. Note there is a "technical debt" TODO marker just above this 
function already.

>>   
>>   	/* Signal all jobs already scheduled to HW */
>>   	list_for_each_entry(s_job, &sched->pending_list, list) {
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 58f51875547a..6dd30b85925b 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -108,8 +108,6 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>>   	entity->guilty = guilty;
>>   	entity->num_sched_list = num_sched_list;
>>   	entity->priority = priority;
>> -	entity->rq_priority = drm_sched_policy == DRM_SCHED_POLICY_FAIR ?
>> -			      DRM_SCHED_PRIORITY_KERNEL : priority;
>>   	/*
>>   	 * It's perfectly valid to initialize an entity without having a valid
>>   	 * scheduler attached. It's just not valid to use the scheduler before it
>> @@ -119,30 +117,14 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>>   	RCU_INIT_POINTER(entity->last_scheduled, NULL);
>>   	RB_CLEAR_NODE(&entity->rb_tree_node);
>>   
>> -	if (num_sched_list && !sched_list[0]->sched_rq) {
>> +	if (num_sched_list && !sched_list[0]->rq) {
>>   		/* Since every entry covered by num_sched_list
>>   		 * should be non-NULL and therefore we warn drivers
>>   		 * not to do this and to fix their DRM calling order.
>>   		 */
>>   		pr_warn("%s: called with uninitialized scheduler\n", __func__);
>>   	} else if (num_sched_list) {
>> -		enum drm_sched_priority p = entity->priority;
>> -
>> -		/*
>> -		 * The "priority" of an entity cannot exceed the number of
>> -		 * run-queues of a scheduler. Protect against num_rqs being 0,
>> -		 * by converting to signed. Choose the lowest priority
>> -		 * available.
>> -		 */
>> -		if (p >= sched_list[0]->num_user_rqs) {
>> -			dev_err(sched_list[0]->dev, "entity with out-of-bounds priority:%u num_user_rqs:%u\n",
>> -				p, sched_list[0]->num_user_rqs);
>> -			p = max_t(s32,
>> -				 (s32)sched_list[0]->num_user_rqs - 1,
>> -				 (s32)DRM_SCHED_PRIORITY_KERNEL);
>> -			entity->priority = p;
>> -		}
>> -		entity->rq = sched_list[0]->sched_rq[entity->rq_priority];
>> +		entity->rq = sched_list[0]->rq;
>>   	}
>>   
>>   	init_completion(&entity->entity_idle);
>> @@ -575,7 +557,7 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>>   
>>   	spin_lock(&entity->lock);
>>   	sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
>> -	rq = sched ? sched->sched_rq[entity->rq_priority] : NULL;
>> +	rq = sched ? sched->rq : NULL;
>>   	if (rq != entity->rq) {
>>   		drm_sched_rq_remove_entity(entity->rq, entity);
>>   		entity->rq = rq;
>> @@ -599,7 +581,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>>   {
>>   	struct drm_sched_entity *entity = sched_job->entity;
>>   	bool first;
>> -	ktime_t submit_ts;
>>   
>>   	trace_drm_sched_job_queue(sched_job, entity);
>>   
>> @@ -616,16 +597,14 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>>   	/*
>>   	 * After the sched_job is pushed into the entity queue, it may be
>>   	 * completed and freed up at any time. We can no longer access it.
>> -	 * Make sure to set the submit_ts first, to avoid a race.
>>   	 */
>> -	sched_job->submit_ts = submit_ts = ktime_get();
> 
> Can submit_ts be removed completely?

It is. Or you mean is it safe? If there are not more users why not?

> 
>>   	first = spsc_queue_push(&entity->job_queue, &sched_job->queue_node);
>>   
>>   	/* first job wakes up scheduler */
>>   	if (first) {
>>   		struct drm_gpu_scheduler *sched;
>>   
>> -		sched = drm_sched_rq_add_entity(entity, submit_ts);
>> +		sched = drm_sched_rq_add_entity(entity);
>>   		if (sched)
>>   			drm_sched_wakeup(sched);
>>   	}
>> diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
>> index 125aba70eda6..6e5ed721bb5b 100644
>> --- a/drivers/gpu/drm/scheduler/sched_internal.h
>> +++ b/drivers/gpu/drm/scheduler/sched_internal.h
>> @@ -22,13 +22,6 @@ struct drm_sched_entity_stats {
>>   	u64		vruntime;
>>   };
>>   
>> -/* Used to choose between FIFO and RR job-scheduling */
>> -extern int drm_sched_policy;
>> -
>> -#define DRM_SCHED_POLICY_RR	0
>> -#define DRM_SCHED_POLICY_FIFO	1
>> -#define DRM_SCHED_POLICY_FAIR	2
>> -
>>   bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
>>   			 struct drm_sched_entity *entity);
>>   void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
>> @@ -39,7 +32,7 @@ struct drm_sched_entity *
>>   drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched,
>>   			   struct drm_sched_rq *rq);
>>   struct drm_gpu_scheduler *
>> -drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts);
>> +drm_sched_rq_add_entity(struct drm_sched_entity *entity);
>>   void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>   				struct drm_sched_entity *entity);
>>   void drm_sched_rq_pop_entity(struct drm_sched_entity *entity);
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index e7726095c19a..a9079bdb27d3 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -84,15 +84,6 @@
>>   #define CREATE_TRACE_POINTS
>>   #include "gpu_scheduler_trace.h"
>>   
>> -int drm_sched_policy = DRM_SCHED_POLICY_FAIR;
>> -
>> -/**
>> - * DOC: sched_policy (int)
>> - * Used to override default entities scheduling policy in a run queue.
>> - */
>> -MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO, " __stringify(DRM_SCHED_POLICY_FAIR) " = Fair (default).");
>> -module_param_named(sched_policy, drm_sched_policy, int, 0444);
>> -
> 
> Anyone know what happens if we remove that param and some user still
> tries to set it?

Just an unknown parameter ignored message.

>>   static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
>>   {
>>   	u32 credits;
>> @@ -876,34 +867,6 @@ void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
>>   	drm_sched_run_job_queue(sched);
>>   }
>>   
>> -/**
>> - * drm_sched_select_entity - Select next entity to process
>> - *
>> - * @sched: scheduler instance
>> - *
>> - * Return an entity to process or NULL if none are found.
>> - *
>> - * Note, that we break out of the for-loop when "entity" is non-null, which can
>> - * also be an error-pointer--this assures we don't process lower priority
>> - * run-queues. See comments in the respectively called functions.
>> - */
>> -static struct drm_sched_entity *
>> -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>> -{
>> -	struct drm_sched_entity *entity = NULL;
>> -	int i;
>> -
>> -	/* Start with the highest priority.
>> -	 */
>> -	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>> -		entity = drm_sched_rq_select_entity(sched, sched->sched_rq[i]);
>> -		if (entity)
>> -			break;
>> -	}
>> -
>> -	return IS_ERR(entity) ? NULL : entity;
>> -}
>> -
>>   /**
>>    * drm_sched_get_finished_job - fetch the next finished job to be destroyed
>>    *
>> @@ -1023,8 +986,8 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>   	int r;
>>   
>>   	/* Find entity with a ready job */
>> -	entity = drm_sched_select_entity(sched);
>> -	if (!entity)
>> +	entity = drm_sched_rq_select_entity(sched, sched->rq);
> 
> Isn't clear to me why entity-selection should be based on the rq as a
> parameter, now that the scheduler only has one rq and, thus, only one
> list of entities anyways.
> 
> Also, you pass both sched and sched->rq, which is redundant.

It is gone in the next patch. But it does look like I can move it to 
this one just as well. Will do.

> I think it would be clearer to continue using the name
> drm_sched_select_entity().

Can do. Just don't complain then what is drm_scehd_select_entity doing 
in sched_rq.c. :D

Yeah I tried to sort and separate things around, and I think I improved 
things, but things are still not perfect.

> 
>> +	if (IS_ERR_OR_NULL(entity))
>>   		return;	/* No more work */
>>   
>>   	sched_job = drm_sched_entity_pop_job(entity);
>> @@ -1095,8 +1058,6 @@ static struct workqueue_struct *drm_sched_alloc_wq(const char *name)
>>    */
>>   int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_args *args)
>>   {
>> -	int i;
>> -
>>   	sched->ops = args->ops;
>>   	sched->credit_limit = args->credit_limit;
>>   	sched->name = args->name;
>> @@ -1106,13 +1067,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
>>   	sched->score = args->score ? args->score : &sched->_score;
>>   	sched->dev = args->dev;
>>   
>> -	if (args->num_rqs > DRM_SCHED_PRIORITY_COUNT) {
>> -		/* This is a gross violation--tell drivers what the  problem is.
>> -		 */
>> -		dev_err(sched->dev, "%s: num_rqs cannot be greater than DRM_SCHED_PRIORITY_COUNT\n",
>> -			__func__);
>> -		return -EINVAL;
>> -	} else if (sched->sched_rq) {
> 
> args->num_rqs is surplus now. What are we going to do with the drivers
> that set it?
> 
> It should be removed from struct drm_sched_init_args and the using
> drivers. That way, those drivers will also notice this big change
> coming up.

Good point. I will add patches to the series to remove it from 
individual drivers.

> 
>> +	if (sched->rq) {
>>   		/* Not an error, but warn anyway so drivers can
>>   		 * fine-tune their DRM calling order, and return all
>>   		 * is good.
>> @@ -1132,21 +1087,11 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
>>   		sched->own_submit_wq = true;
>>   	}
>>   
>> -	sched->num_user_rqs = args->num_rqs;
>> -	sched->num_rqs = drm_sched_policy != DRM_SCHED_POLICY_FAIR ?
>> -			 args->num_rqs : 1;
>> -	sched->sched_rq = kmalloc_array(sched->num_rqs,
>> -					sizeof(*sched->sched_rq),
>> -					GFP_KERNEL | __GFP_ZERO);
>> -	if (!sched->sched_rq)
>> +	sched->rq = kmalloc(sizeof(*sched->rq), GFP_KERNEL | __GFP_ZERO);
>> +	if (!sched->rq)
>>   		goto Out_check_own;
>>   
>> -	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>> -		sched->sched_rq[i] = kzalloc(sizeof(*sched->sched_rq[i]), GFP_KERNEL);
>> -		if (!sched->sched_rq[i])
>> -			goto Out_unroll;
>> -		drm_sched_rq_init(sched, sched->sched_rq[i]);
>> -	}
>> +	drm_sched_rq_init(sched, sched->rq);
> 
> Same, redundant function parameter.

In fact perhaps this should be drm_sched_rq_init(rq, sched)? Because 
that is what the functions really does.

> 
>>   
>>   	init_waitqueue_head(&sched->job_scheduled);
>>   	INIT_LIST_HEAD(&sched->pending_list);
>> @@ -1161,12 +1106,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
>>   
>>   	sched->ready = true;
>>   	return 0;
>> -Out_unroll:
>> -	for (--i ; i >= DRM_SCHED_PRIORITY_KERNEL; i--)
>> -		kfree(sched->sched_rq[i]);
>>   
>> -	kfree(sched->sched_rq);
> 
> Almost thought we have a memory leak now. But seems fine since we can't
> fail anymore after the kmalloc() above.
> 
>> -	sched->sched_rq = NULL;
>>   Out_check_own:
>>   	if (sched->own_submit_wq)
>>   		destroy_workqueue(sched->submit_wq);
>> @@ -1202,41 +1142,35 @@ static void drm_sched_cancel_remaining_jobs(struct drm_gpu_scheduler *sched)
>>    */
>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>   {
>> +
>> +	struct drm_sched_rq *rq = sched->rq;
>>   	struct drm_sched_entity *s_entity;
>> -	int i;
>>   
>>   	drm_sched_wqueue_stop(sched);
>>   
>> -	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>> -		struct drm_sched_rq *rq = sched->sched_rq[i];
>> -
>> -		spin_lock(&rq->lock);
>> -		list_for_each_entry(s_entity, &rq->entities, list)
>> -			/*
>> -			 * Prevents reinsertion and marks job_queue as idle,
>> -			 * it will be removed from the rq in drm_sched_entity_fini()
>> -			 * eventually
>> -			 *
>> -			 * FIXME:
>> -			 * This lacks the proper spin_lock(&s_entity->lock) and
>> -			 * is, therefore, a race condition. Most notably, it
>> -			 * can race with drm_sched_entity_push_job(). The lock
>> -			 * cannot be taken here, however, because this would
>> -			 * lead to lock inversion -> deadlock.
>> -			 *
>> -			 * The best solution probably is to enforce the life
>> -			 * time rule of all entities having to be torn down
>> -			 * before their scheduler. Then, however, locking could
>> -			 * be dropped alltogether from this function.
>> -			 *
>> -			 * For now, this remains a potential race in all
>> -			 * drivers that keep entities alive for longer than
>> -			 * the scheduler.
>> -			 */
>> -			s_entity->stopped = true;
>> -		spin_unlock(&rq->lock);
>> -		kfree(sched->sched_rq[i]);
>> -	}
>> +	spin_lock(&rq->lock);
>> +	list_for_each_entry(s_entity, &rq->entities, list)
>> +		/*
>> +		 * Prevents re-insertion and marks job_queue as idle,
>> +		 * it will be removed from the rq in drm_sched_entity_fini()
>> +		 * eventually.
>> +		 *
>> +		 * FIXME:
>> +		 * This lacks the proper spin_lock(&s_entity->lock) and is,
>> +		 * therefore, a race condition. Most notably, it can race with
>> +		 * drm_sched_entity_push_job(). The lock cannot be taken here,
>> +		 * however, because this would lead to lock inversion.
>> +		 *
>> +		 * The best solution probably is to enforce the life time rule
>> +		 * of all entities having to be torn down before their
>> +		 * scheduler. Then locking could be dropped altogether from this
>> +		 * function.
>> +		 *
>> +		 * For now, this remains a potential race in all drivers that
>> +		 * keep entities alive for longer than the scheduler.
>> +		 */
>> +		s_entity->stopped = true;
>> +	spin_unlock(&rq->lock);
>>   
>>   	/* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
>>   	wake_up_all(&sched->job_scheduled);
>> @@ -1251,8 +1185,8 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>   	if (sched->own_submit_wq)
>>   		destroy_workqueue(sched->submit_wq);
>>   	sched->ready = false;
>> -	kfree(sched->sched_rq);
>> -	sched->sched_rq = NULL;
>> +	kfree(sched->rq);
> 
> Renaming sched_rq to rq can be done, but it certainly increases the git
> diff. Such general naming improvements make patches a bit more
> difficult to review and I'd prefer to have those in separate patches.

I agree in principle just that in this case all the number of touched 
lines would not change much because the member is changed from an array 
of run queues to a single run queue anyway. It could be this line is the 
only one which would stay out of the diff.

>> +	sched->rq = NULL;
>>   
>>   	if (!list_empty(&sched->pending_list))
>>   		dev_warn(sched->dev, "Tearing down scheduler while jobs are pending!\n");
>> @@ -1270,35 +1204,28 @@ EXPORT_SYMBOL(drm_sched_fini);
>>    */
>>   void drm_sched_increase_karma(struct drm_sched_job *bad)
>>   {
>> -	int i;
>> -	struct drm_sched_entity *tmp;
>> -	struct drm_sched_entity *entity;
>>   	struct drm_gpu_scheduler *sched = bad->sched;
>> +	struct drm_sched_entity *entity, *tmp;
>> +	struct drm_sched_rq *rq = sched->rq;
>>   
>>   	/* don't change @bad's karma if it's from KERNEL RQ,
>>   	 * because sometimes GPU hang would cause kernel jobs (like VM updating jobs)
>>   	 * corrupt but keep in mind that kernel jobs always considered good.
>>   	 */
>> -	if (bad->s_priority != DRM_SCHED_PRIORITY_KERNEL) {
>> -		atomic_inc(&bad->karma);
>> +	if (bad->s_priority == DRM_SCHED_PRIORITY_KERNEL)
>> +		return;
>>   
>> -		for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>> -			struct drm_sched_rq *rq = sched->sched_rq[i];
>> +	atomic_inc(&bad->karma);
>>   
>> -			spin_lock(&rq->lock);
>> -			list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
>> -				if (bad->s_fence->scheduled.context ==
>> -				    entity->fence_context) {
>> -					if (entity->guilty)
>> -						atomic_set(entity->guilty, 1);
>> -					break;
>> -				}
>> -			}
>> -			spin_unlock(&rq->lock);
>> -			if (&entity->list != &rq->entities)
>> -				break;
>> +	spin_lock(&rq->lock);
>> +	list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
>> +		if (bad->s_fence->scheduled.context == entity->fence_context) {
>> +			if (entity->guilty)
>> +				atomic_set(entity->guilty, 1);
>> +			break;
>>   		}
>>   	}
>> +	spin_unlock(&rq->lock);
>>   }
>>   EXPORT_SYMBOL(drm_sched_increase_karma);
>>   
>> diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
>> index c5be8a2c893d..a9bc105221bf 100644
>> --- a/drivers/gpu/drm/scheduler/sched_rq.c
>> +++ b/drivers/gpu/drm/scheduler/sched_rq.c
>> @@ -34,7 +34,7 @@ static void drm_sched_rq_update_prio(struct drm_sched_rq *rq)
>>   	rq->head_prio = prio;
>>   }
>>   
>> -static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
>> +static void drm_sched_rq_remove_tree_locked(struct drm_sched_entity *entity,
>>   					    struct drm_sched_rq *rq)
>>   {
>>   	lockdep_assert_held(&entity->lock);
>> @@ -47,7 +47,7 @@ static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
>>   	}
>>   }
>>   
>> -static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
>> +static void drm_sched_rq_update_tree_locked(struct drm_sched_entity *entity,
>>   					    struct drm_sched_rq *rq,
>>   					    ktime_t ts)
>>   {
>> @@ -59,7 +59,7 @@ static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
>>   	lockdep_assert_held(&entity->lock);
>>   	lockdep_assert_held(&rq->lock);
>>   
>> -	drm_sched_rq_remove_fifo_locked(entity, rq);
>> +	drm_sched_rq_remove_tree_locked(entity, rq);
>>   
>>   	entity->oldest_job_waiting = ts;
>>   
>> @@ -207,17 +207,17 @@ static ktime_t drm_sched_entity_get_job_ts(struct drm_sched_entity *entity)
>>    * drm_sched_rq_add_entity - add an entity
>>    *
>>    * @entity: scheduler entity
>> - * @ts: submission timestamp
>>    *
>>    * Adds a scheduler entity to the run queue.
>>    *
>>    * Returns a DRM scheduler pre-selected to handle this entity.
>>    */
>>   struct drm_gpu_scheduler *
>> -drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
>> +drm_sched_rq_add_entity(struct drm_sched_entity *entity)
>>   {
>>   	struct drm_gpu_scheduler *sched;
>>   	struct drm_sched_rq *rq;
>> +	ktime_t ts;
>>   
>>   	/* Add the entity to the run queue */
>>   	spin_lock(&entity->lock);
>> @@ -237,15 +237,9 @@ drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
>>   		list_add_tail(&entity->list, &rq->entities);
>>   	}
>>   
>> -	if (drm_sched_policy == DRM_SCHED_POLICY_FAIR) {
>> -		ts = drm_sched_rq_get_min_vruntime(rq);
>> -		ts = drm_sched_entity_restore_vruntime(entity, ts,
>> -						       rq->head_prio);
>> -	} else if (drm_sched_policy == DRM_SCHED_POLICY_RR) {
>> -		ts = entity->rr_ts;
>> -	}
>> -
>> -	drm_sched_rq_update_fifo_locked(entity, rq, ts);
>> +	ts = drm_sched_rq_get_min_vruntime(rq);
>> +	ts = drm_sched_entity_restore_vruntime(entity, ts, rq->head_prio);
>> +	drm_sched_rq_update_tree_locked(entity, rq, ts);
>>   
>>   	spin_unlock(&rq->lock);
>>   	spin_unlock(&entity->lock);
>> @@ -274,26 +268,11 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>   	atomic_dec(rq->sched->score);
>>   	list_del_init(&entity->list);
>>   
>> -	drm_sched_rq_remove_fifo_locked(entity, rq);
>> +	drm_sched_rq_remove_tree_locked(entity, rq);
>>   
>>   	spin_unlock(&rq->lock);
>>   }
>>   
>> -static ktime_t
>> -drm_sched_rq_get_rr_ts(struct drm_sched_rq *rq, struct drm_sched_entity *entity)
>> -{
>> -	ktime_t ts;
>> -
>> -	lockdep_assert_held(&entity->lock);
>> -	lockdep_assert_held(&rq->lock);
>> -
>> -	ts = ktime_add_ns(rq->rr_ts, 1);
>> -	entity->rr_ts = ts;
>> -	rq->rr_ts = ts;
>> -
>> -	return ts;
>> -}
>> -
>>   /**
>>    * drm_sched_rq_pop_entity - pops an entity
>>    *
>> @@ -317,23 +296,14 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
>>   	if (next_job) {
>>   		ktime_t ts;
>>   
>> -		if (drm_sched_policy == DRM_SCHED_POLICY_FAIR)
>> -			ts = drm_sched_entity_get_job_ts(entity);
>> -		else if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>> -			ts = next_job->submit_ts;
>> -		else
>> -			ts = drm_sched_rq_get_rr_ts(rq, entity);
>> -
>> -		drm_sched_rq_update_fifo_locked(entity, rq, ts);
>> +		ts = drm_sched_entity_get_job_ts(entity);
>> +		drm_sched_rq_update_tree_locked(entity, rq, ts);
>>   	} else {
>> -		drm_sched_rq_remove_fifo_locked(entity, rq);
>> +		ktime_t min_vruntime;
>>   
>> -		if (drm_sched_policy == DRM_SCHED_POLICY_FAIR) {
>> -			ktime_t min_vruntime;
>> -
>> -			min_vruntime = drm_sched_rq_get_min_vruntime(rq);
>> -			drm_sched_entity_save_vruntime(entity, min_vruntime);
>> -		}
>> +		drm_sched_rq_remove_tree_locked(entity, rq);
>> +		min_vruntime = drm_sched_rq_get_min_vruntime(rq);
>> +		drm_sched_entity_save_vruntime(entity, min_vruntime);
>>   	}
>>   	spin_unlock(&rq->lock);
>>   	spin_unlock(&entity->lock);
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index 327b75a052c7..c9b75a05d05c 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -96,8 +96,7 @@ struct drm_sched_entity {
>>   	 * @lock:
>>   	 *
>>   	 * Lock protecting the run-queue (@rq) to which this entity belongs,
>> -	 * @priority, the list of schedulers (@sched_list, @num_sched_list) and
>> -	 * the @rr_ts field.
>> +	 * @priority and the list of schedulers (@sched_list, @num_sched_list).
>>   	 */
>>   	spinlock_t			lock;
>>   
>> @@ -150,18 +149,6 @@ struct drm_sched_entity {
>>   	 */
>>   	enum drm_sched_priority         priority;
>>   
>> -	/**
>> -	 * @rq_priority: Run-queue priority
>> -	 */
>> -	enum drm_sched_priority         rq_priority;
>> -
>> -	/**
>> -	 * @rr_ts:
>> -	 *
>> -	 * Fake timestamp of the last popped job from the entity.
>> -	 */
>> -	ktime_t				rr_ts;
>> -
>>   	/**
>>   	 * @job_queue: the list of jobs of this entity.
>>   	 */
>> @@ -259,8 +246,7 @@ struct drm_sched_entity {
>>    * struct drm_sched_rq - queue of entities to be scheduled.
>>    *
>>    * @sched: the scheduler to which this rq belongs to.
>> - * @lock: protects @entities, @rb_tree_root, @rr_ts and @head_prio.
>> - * @rr_ts: monotonically incrementing fake timestamp for RR mode
>> + * @lock: protects @entities, @rb_tree_root and @head_prio.
>>    * @entities: list of the entities to be scheduled.
>>    * @rb_tree_root: root of time based priority queue of entities for FIFO scheduling
>>    * @head_prio: priority of the top tree element
>> @@ -274,7 +260,6 @@ struct drm_sched_rq {
>>   
>>   	spinlock_t			lock;
>>   	/* Following members are protected by the @lock: */
>> -	ktime_t				rr_ts;
>>   	struct list_head		entities;
>>   	struct rb_root_cached		rb_tree_root;
>>   	enum drm_sched_priority		head_prio;
>> @@ -360,13 +345,6 @@ struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f);
>>    * to schedule the job.
>>    */
>>   struct drm_sched_job {
>> -	/**
>> -	 * @submit_ts:
>> -	 *
>> -	 * When the job was pushed into the entity queue.
>> -	 */
>> -	ktime_t                         submit_ts;
>> -
>>   	/**
>>   	 * @sched:
>>   	 *
>> @@ -603,9 +581,7 @@ struct drm_gpu_scheduler {
>>   	atomic_t			credit_count;
>>   	long				timeout;
>>   	const char			*name;
>> -	u32                             num_rqs;
>> -	u32                             num_user_rqs;
>> -	struct drm_sched_rq             **sched_rq;
>> +	struct drm_sched_rq             *rq;
> 
> Docu must be updated

You insisted the series must be RFC and I tend to not polish RFC's to 
the max. ;p

Will fix.

Regards,

Tvrtko


>>   	wait_queue_head_t		job_scheduled;
>>   	atomic64_t			job_id_count;
>>   	struct workqueue_struct		*submit_wq;
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 12/12] drm/sched: Embed run queue singleton into the scheduler
  2025-09-24 12:01   ` Philipp Stanner
@ 2025-09-25 13:02     ` Tvrtko Ursulin
  0 siblings, 0 replies; 35+ messages in thread
From: Tvrtko Ursulin @ 2025-09-25 13:02 UTC (permalink / raw)
  To: phasta, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost


On 24/09/2025 13:01, Philipp Stanner wrote:
> On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
>> Now that the run queue to scheduler relationship is always 1:1 we can
>> embed it (the run queue) directly in the scheduler struct and save on
>> some allocation error handling code and such.
> 
> Looks reasonable to me.
> 
> What I suggest is to do things like general improvements by renaming
> variables (see my comments in the previous patch) in a separate
> cleanup-patch following this one.

Hmm but there is no renaming in this patch. It changes the type so I 
don't see what to split out.

> Few comments below
> 
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Philipp Stanner <phasta@kernel.org>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c      |  6 ++--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c     |  6 ++--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.h     |  5 +++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h   |  8 ++++--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c |  8 +++---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c     |  8 +++---
>>   drivers/gpu/drm/scheduler/sched_entity.c    | 32 +++++++++------------
>>   drivers/gpu/drm/scheduler/sched_fence.c     |  2 +-
>>   drivers/gpu/drm/scheduler/sched_internal.h  |  6 ++--
>>   drivers/gpu/drm/scheduler/sched_main.c      | 31 ++++----------------
>>   drivers/gpu/drm/scheduler/sched_rq.c        | 18 ++++++------
>>   include/drm/gpu_scheduler.h                 |  4 +--
>>   12 files changed, 58 insertions(+), 76 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index 2ac9729e4c86..467e09e5567b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -1131,7 +1131,8 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p)
>>   	if (p->gang_size > 1 && !adev->vm_manager.concurrent_flush) {
>>   		for (i = 0; i < p->gang_size; ++i) {
>>   			struct drm_sched_entity *entity = p->entities[i];
>> -			struct drm_gpu_scheduler *sched = entity->rq->sched;
>> +			struct drm_gpu_scheduler *sched =
>> +				container_of(entity->rq, typeof(*sched), rq);
>>   			struct amdgpu_ring *ring = to_amdgpu_ring(sched);
>>   
>>   			if (amdgpu_vmid_uses_reserved(vm, ring->vm_hub))
>> @@ -1262,7 +1263,8 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
>>   			return r;
>>   	}
> 
> Now that I think of it, are you compiling this for all DRM drivers? Who
> knows who might do similar things as amdgpu.

Definitely not compiling for all. Amdgpu, xe and v3d is what I was doing 
more or less routinely. Maybe vmwgfx too I think. And I was grepping 
around. Might have missed something of course..

>>   
>> -	sched = p->gang_leader->base.entity->rq->sched;
>> +	sched = container_of(p->gang_leader->base.entity->rq, typeof(*sched),
>> +			     rq);
>>   	while ((fence = amdgpu_sync_get_fence(&p->sync))) {
>>   		struct drm_sched_fence *s_fence = to_drm_sched_fence(fence);
>>   
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index bc07fd57310c..cdfaf3eb736d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -341,7 +341,9 @@ static struct dma_fence *
>>   amdgpu_job_prepare_job(struct drm_sched_job *sched_job,
>>   		      struct drm_sched_entity *s_entity)
>>   {
>> -	struct amdgpu_ring *ring = to_amdgpu_ring(s_entity->rq->sched);
>> +	struct drm_gpu_scheduler *sched =
>> +		container_of(s_entity->rq, typeof(*sched), rq);
>> +	struct amdgpu_ring *ring = to_amdgpu_ring(sched);
>>   	struct amdgpu_job *job = to_amdgpu_job(sched_job);
>>   	struct dma_fence *fence;
>>   	int r;
>> @@ -434,7 +436,7 @@ drm_sched_entity_queue_pop(struct drm_sched_entity *entity)
>>   
>>   void amdgpu_job_stop_all_jobs_on_sched(struct drm_gpu_scheduler *sched)
>>   {
>> -	struct drm_sched_rq *rq = sched->rq;
>> +	struct drm_sched_rq *rq = &sched->rq;
>>   	struct drm_sched_entity *s_entity;
>>   	struct drm_sched_job *s_job;
>>   
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>> index 4a6487eb6cb5..9530b5da3adc 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>> @@ -102,7 +102,10 @@ struct amdgpu_job {
>>   
>>   static inline struct amdgpu_ring *amdgpu_job_ring(struct amdgpu_job *job)
>>   {
>> -	return to_amdgpu_ring(job->base.entity->rq->sched);
>> +	struct drm_gpu_scheduler *sched =
>> +		container_of(job->base.entity->rq, typeof(*sched), rq);
>> +
>> +	return to_amdgpu_ring(sched);
>>   }
>>   
>>   int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>> index d13e64a69e25..85724ec6aaf8 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>> @@ -145,6 +145,7 @@ TRACE_EVENT(amdgpu_cs,
>>   		     struct amdgpu_ib *ib),
>>   	    TP_ARGS(p, job, ib),
>>   	    TP_STRUCT__entry(
>> +			     __field(struct drm_gpu_scheduler *, sched)
>>   			     __field(struct amdgpu_bo_list *, bo_list)
>>   			     __field(u32, ring)
>>   			     __field(u32, dw)
>> @@ -152,11 +153,14 @@ TRACE_EVENT(amdgpu_cs,
>>   			     ),
>>   
>>   	    TP_fast_assign(
>> +			   __entry->sched = container_of(job->base.entity->rq,
>> +							 typeof(*__entry->sched),
>> +							 rq);
>>   			   __entry->bo_list = p->bo_list;
>> -			   __entry->ring = to_amdgpu_ring(job->base.entity->rq->sched)->idx;
>> +			   __entry->ring = to_amdgpu_ring(__entry->sched)->idx;
>>   			   __entry->dw = ib->length_dw;
>>   			   __entry->fences = amdgpu_fence_count_emitted(
>> -				to_amdgpu_ring(job->base.entity->rq->sched));
>> +				to_amdgpu_ring(__entry->sched));
>>   			   ),
>>   	    TP_printk("bo_list=%p, ring=%u, dw=%u, fences=%u",
>>   		      __entry->bo_list, __entry->ring, __entry->dw,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>> index 36805dcfa159..4ccd2e769799 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>> @@ -106,13 +106,13 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
>>   static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>>   				 struct dma_fence **fence)
>>   {
>> +	struct drm_gpu_scheduler *sched =
>> +		container_of(p->vm->delayed.rq, typeof(*sched), rq);
>> +	struct amdgpu_ring *ring =
>> +		container_of(sched, struct amdgpu_ring, sched);
>>   	struct amdgpu_ib *ib = p->job->ibs;
>> -	struct amdgpu_ring *ring;
>>   	struct dma_fence *f;
>>   
>> -	ring = container_of(p->vm->delayed.rq->sched, struct amdgpu_ring,
>> -			    sched);
>> -
>>   	WARN_ON(ib->length_dw == 0);
>>   	amdgpu_ring_pad_ib(ring, ib);
>>   
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>> index 1083db8cea2e..be17635ac039 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>> @@ -465,15 +465,15 @@ int amdgpu_xcp_open_device(struct amdgpu_device *adev,
>>   void amdgpu_xcp_release_sched(struct amdgpu_device *adev,
>>   				  struct amdgpu_ctx_entity *entity)
>>   {
>> -	struct drm_gpu_scheduler *sched;
>> -	struct amdgpu_ring *ring;
>> +	struct drm_gpu_scheduler *sched =
>> +		container_of(entity->entity.rq, typeof(*sched), rq);
>>   
>>   	if (!adev->xcp_mgr)
>>   		return;
>>   
>> -	sched = entity->entity.rq->sched;
>>   	if (drm_sched_wqueue_ready(sched)) {
>> -		ring = to_amdgpu_ring(entity->entity.rq->sched);
>> +		struct amdgpu_ring *ring = to_amdgpu_ring(sched);
>> +
>>   		atomic_dec(&adev->xcp_mgr->xcp[ring->xcp_id].ref_cnt);
>>   	}
>>   }
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 6dd30b85925b..ba290143c95d 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -114,19 +114,12 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>>   	 * is initialized itself.
>>   	 */
>>   	entity->sched_list = num_sched_list > 1 ? sched_list : NULL;
>> +	if (num_sched_list) {
>> +		entity->sched_list = num_sched_list > 1 ? sched_list : NULL;
>> +		entity->rq = &sched_list[0]->rq;
>> +	}
>>   	RCU_INIT_POINTER(entity->last_scheduled, NULL);
>>   	RB_CLEAR_NODE(&entity->rb_tree_node);
>> -
>> -	if (num_sched_list && !sched_list[0]->rq) {
>> -		/* Since every entry covered by num_sched_list
>> -		 * should be non-NULL and therefore we warn drivers
>> -		 * not to do this and to fix their DRM calling order.
>> -		 */
>> -		pr_warn("%s: called with uninitialized scheduler\n", __func__);
>> -	} else if (num_sched_list) {
>> -		entity->rq = sched_list[0]->rq;
>> -	}
>> -
>>   	init_completion(&entity->entity_idle);
>>   
>>   	/* We start in an idle state. */
>> @@ -312,7 +305,7 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout)
>>   	if (!entity->rq)
>>   		return 0;
>>   
>> -	sched = entity->rq->sched;
>> +	sched = container_of(entity->rq, typeof(*sched), rq);
>>   	/*
>>   	 * The client will not queue more jobs during this fini - consume
>>   	 * existing queued ones, or discard them on SIGKILL.
>> @@ -393,10 +386,12 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>>   {
>>   	struct drm_sched_entity *entity =
>>   		container_of(cb, struct drm_sched_entity, cb);
>> +	struct drm_gpu_scheduler *sched =
>> +		container_of(entity->rq, typeof(*sched), rq);
>>   
>>   	entity->dependency = NULL;
>>   	dma_fence_put(f);
>> -	drm_sched_wakeup(entity->rq->sched);
>> +	drm_sched_wakeup(sched);
>>   }
>>   
>>   /**
>> @@ -423,7 +418,8 @@ EXPORT_SYMBOL(drm_sched_entity_set_priority);
>>   static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity,
>>   					       struct drm_sched_job *sched_job)
>>   {
>> -	struct drm_gpu_scheduler *sched = entity->rq->sched;
>> +	struct drm_gpu_scheduler *sched =
>> +		container_of(entity->rq, typeof(*sched), rq);
>>   	struct dma_fence *fence = entity->dependency;
>>   	struct drm_sched_fence *s_fence;
>>   
>> @@ -557,7 +553,7 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>>   
>>   	spin_lock(&entity->lock);
>>   	sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
>> -	rq = sched ? sched->rq : NULL;
>> +	rq = sched ? &sched->rq : NULL;
>>   	if (rq != entity->rq) {
>>   		drm_sched_rq_remove_entity(entity->rq, entity);
>>   		entity->rq = rq;
>> @@ -580,6 +576,8 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>>   void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>>   {
>>   	struct drm_sched_entity *entity = sched_job->entity;
>> +	struct drm_gpu_scheduler *sched =
>> +		container_of(entity->rq, typeof(*sched), rq);
>>   	bool first;
>>   
>>   	trace_drm_sched_job_queue(sched_job, entity);
>> @@ -591,7 +589,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>>   		xa_for_each(&sched_job->dependencies, index, entry)
>>   			trace_drm_sched_job_add_dep(sched_job, entry);
>>   	}
>> -	atomic_inc(entity->rq->sched->score);
>> +	atomic_inc(sched->score);
>>   	WRITE_ONCE(entity->last_user, current->group_leader);
>>   
>>   	/*
>> @@ -602,8 +600,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>>   
>>   	/* first job wakes up scheduler */
>>   	if (first) {
>> -		struct drm_gpu_scheduler *sched;
>> -
>>   		sched = drm_sched_rq_add_entity(entity);
>>   		if (sched)
>>   			drm_sched_wakeup(sched);
>> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
>> index 9391d6f0dc01..da4f53a9ca35 100644
>> --- a/drivers/gpu/drm/scheduler/sched_fence.c
>> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
>> @@ -227,7 +227,7 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
>>   {
>>   	unsigned seq;
>>   
>> -	fence->sched = entity->rq->sched;
>> +	fence->sched = container_of(entity->rq, typeof(*fence->sched), rq);
>>   	seq = atomic_inc_return(&entity->fence_seq);
>>   	dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
>>   		       &fence->lock, entity->fence_context, seq);
>> diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
>> index 6e5ed721bb5b..409c9ab7ce8f 100644
>> --- a/drivers/gpu/drm/scheduler/sched_internal.h
>> +++ b/drivers/gpu/drm/scheduler/sched_internal.h
>> @@ -26,11 +26,9 @@ bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
>>   			 struct drm_sched_entity *entity);
>>   void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
>>   
>> -void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
>> -		       struct drm_sched_rq *rq);
>> +void drm_sched_rq_init(struct drm_gpu_scheduler *sched);
>>   struct drm_sched_entity *
>> -drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched,
>> -			   struct drm_sched_rq *rq);
>> +drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched);
>>   struct drm_gpu_scheduler *
>>   drm_sched_rq_add_entity(struct drm_sched_entity *entity);
>>   void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index a9079bdb27d3..fe773154cdd4 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -645,7 +645,7 @@ void drm_sched_job_arm(struct drm_sched_job *job)
>>   
>>   	BUG_ON(!entity);
>>   	drm_sched_entity_select_rq(entity);
>> -	sched = entity->rq->sched;
>> +	sched = container_of(entity->rq, typeof(*sched), rq);
>>   
>>   	job->sched = sched;
>>   	job->s_priority = entity->priority;
>> @@ -986,7 +986,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>   	int r;
>>   
>>   	/* Find entity with a ready job */
>> -	entity = drm_sched_rq_select_entity(sched, sched->rq);
>> +	entity = drm_sched_rq_select_entity(sched);
>>   	if (IS_ERR_OR_NULL(entity))
>>   		return;	/* No more work */
>>   
>> @@ -1067,15 +1067,6 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
>>   	sched->score = args->score ? args->score : &sched->_score;
>>   	sched->dev = args->dev;
>>   
>> -	if (sched->rq) {
>> -		/* Not an error, but warn anyway so drivers can
>> -		 * fine-tune their DRM calling order, and return all
>> -		 * is good.
>> -		 */
>> -		dev_warn(sched->dev, "%s: scheduler already initialized!\n", __func__);
>> -		return 0;
>> -	}
>> -
>>   	if (args->submit_wq) {
>>   		sched->submit_wq = args->submit_wq;
>>   		sched->own_submit_wq = false;
>> @@ -1087,11 +1078,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
>>   		sched->own_submit_wq = true;
>>   	}
>>   
>> -	sched->rq = kmalloc(sizeof(*sched->rq), GFP_KERNEL | __GFP_ZERO);
>> -	if (!sched->rq)
>> -		goto Out_check_own;
>> -
>> -	drm_sched_rq_init(sched, sched->rq);
>> +	drm_sched_rq_init(sched);
>>   
>>   	init_waitqueue_head(&sched->job_scheduled);
>>   	INIT_LIST_HEAD(&sched->pending_list);
>> @@ -1106,12 +1093,6 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
>>   
>>   	sched->ready = true;
>>   	return 0;
>> -
>> -Out_check_own:
>> -	if (sched->own_submit_wq)
>> -		destroy_workqueue(sched->submit_wq);
>> -	dev_err(sched->dev, "%s: Failed to setup GPU scheduler--out of memory\n", __func__);
>> -	return -ENOMEM;
>>   }
>>   EXPORT_SYMBOL(drm_sched_init);
>>   
>> @@ -1143,7 +1124,7 @@ static void drm_sched_cancel_remaining_jobs(struct drm_gpu_scheduler *sched)
>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>   {
>>   
>> -	struct drm_sched_rq *rq = sched->rq;
>> +	struct drm_sched_rq *rq = &sched->rq;
>>   	struct drm_sched_entity *s_entity;
>>   
>>   	drm_sched_wqueue_stop(sched);
>> @@ -1185,8 +1166,6 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>   	if (sched->own_submit_wq)
>>   		destroy_workqueue(sched->submit_wq);
>>   	sched->ready = false;
>> -	kfree(sched->rq);
>> -	sched->rq = NULL;
>>   
>>   	if (!list_empty(&sched->pending_list))
>>   		dev_warn(sched->dev, "Tearing down scheduler while jobs are pending!\n");
>> @@ -1206,7 +1185,7 @@ void drm_sched_increase_karma(struct drm_sched_job *bad)
>>   {
>>   	struct drm_gpu_scheduler *sched = bad->sched;
>>   	struct drm_sched_entity *entity, *tmp;
>> -	struct drm_sched_rq *rq = sched->rq;
>> +	struct drm_sched_rq *rq = &sched->rq;
>>   
>>   	/* don't change @bad's karma if it's from KERNEL RQ,
>>   	 * because sometimes GPU hang would cause kernel jobs (like VM updating jobs)
>> diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
>> index a9bc105221bf..6088434a4ea4 100644
>> --- a/drivers/gpu/drm/scheduler/sched_rq.c
>> +++ b/drivers/gpu/drm/scheduler/sched_rq.c
>> @@ -72,17 +72,16 @@ static void drm_sched_rq_update_tree_locked(struct drm_sched_entity *entity,
>>    * drm_sched_rq_init - initialize a given run queue struct
>>    *
>>    * @sched: scheduler instance to associate with this run queue
>> - * @rq: scheduler run queue
>>    *
>>    * Initializes a scheduler runqueue.
>>    */
>> -void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
>> -		       struct drm_sched_rq *rq)
>> +void drm_sched_rq_init(struct drm_gpu_scheduler *sched)
>>   {
>> +	struct drm_sched_rq *rq = &sched->rq;
>> +
>>   	spin_lock_init(&rq->lock);
>>   	INIT_LIST_HEAD(&rq->entities);
>>   	rq->rb_tree_root = RB_ROOT_CACHED;
>> -	rq->sched = sched;
>>   	rq->head_prio = -1;
>>   }
>>   
>> @@ -229,8 +228,8 @@ drm_sched_rq_add_entity(struct drm_sched_entity *entity)
>>   	}
>>   
>>   	rq = entity->rq;
>> +	sched = container_of(rq, typeof(*sched), rq);
>>   	spin_lock(&rq->lock);
>> -	sched = rq->sched;
>>   
>>   	if (list_empty(&entity->list)) {
>>   		atomic_inc(sched->score);
>> @@ -258,6 +257,8 @@ drm_sched_rq_add_entity(struct drm_sched_entity *entity)
>>   void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>   				struct drm_sched_entity *entity)
>>   {
>> +	struct drm_gpu_scheduler *sched = container_of(rq, typeof(*sched), rq);
>> +
>>   	lockdep_assert_held(&entity->lock);
>>   
>>   	if (list_empty(&entity->list))
>> @@ -265,7 +266,7 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>   
>>   	spin_lock(&rq->lock);
>>   
>> -	atomic_dec(rq->sched->score);
>> +	atomic_dec(sched->score);
>>   	list_del_init(&entity->list);
>>   
>>   	drm_sched_rq_remove_tree_locked(entity, rq);
>> @@ -313,7 +314,6 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
>>    * drm_sched_rq_select_entity - Select an entity which provides a job to run
>>    *
>>    * @sched: the gpu scheduler
>> - * @rq: scheduler run queue to check.
>>    *
>>    * Find oldest waiting ready entity.
>>    *
>> @@ -322,9 +322,9 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
>>    * its job; return NULL, if no ready entity was found.
>>    */
>>   struct drm_sched_entity *
>> -drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched,
>> -			   struct drm_sched_rq *rq)
>> +drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched)
>>   {
>> +	struct drm_sched_rq *rq = &sched->rq;
>>   	struct rb_node *rb;
>>   
>>   	spin_lock(&rq->lock);
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index c9b75a05d05c..7fbcd121a6d3 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -256,8 +256,6 @@ struct drm_sched_entity {
>>    * the next entity to emit commands from.
>>    */
>>   struct drm_sched_rq {
>> -	struct drm_gpu_scheduler	*sched;
>> -
> 
> Docu needs to be updated

Ack.

> So much from my side for now. I guess we'll be discussing a bit on the
> other threads and on XDC. But overall, this is a very nice series that
> improves things greatly, and from my POV we can transition to a v1;
> maybe after brainstorming with the other guys next week a bit.

Thank you! I will work on all the items identified in this round in the 
background.

Regards,

Tvrtko

>>   	spinlock_t			lock;
>>   	/* Following members are protected by the @lock: */
>>   	struct list_head		entities;
>> @@ -581,7 +579,7 @@ struct drm_gpu_scheduler {
>>   	atomic_t			credit_count;
>>   	long				timeout;
>>   	const char			*name;
>> -	struct drm_sched_rq             *rq;
>> +	struct drm_sched_rq             rq;
>>   	wait_queue_head_t		job_scheduled;
>>   	atomic64_t			job_id_count;
>>   	struct workqueue_struct		*submit_wq;
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 07/12] drm/sched: Account entity GPU time
  2025-09-25 11:52     ` Tvrtko Ursulin
@ 2025-09-26 12:30       ` Philipp Stanner
  0 siblings, 0 replies; 35+ messages in thread
From: Philipp Stanner @ 2025-09-26 12:30 UTC (permalink / raw)
  To: Tvrtko Ursulin, phasta, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost

On Thu, 2025-09-25 at 12:52 +0100, Tvrtko Ursulin wrote:
> 
> On 24/09/2025 10:11, Philipp Stanner wrote:
> > On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
> > > To implement fair scheduling we need a view into the GPU time consumed by
> > > entities. Problem we have is that jobs and entities objects have decoupled
> > > lifetimes, where at the point we have a view into accurate GPU time, we
> > > cannot link back to the entity any longer.
> > > 
> > > Solve this by adding a light weight entity stats object which is reference
> > > counted by both entity and the job and hence can safely be used from
> > > either side.
> > > 
> > > With that, the only other thing we need is to add a helper for adding the
> > > job's GPU time into the respective entity stats object, and call it once
> > > the accurate GPU time has been calculated.
> > 
> > This in general also looks reasonable and clean to me. Some comments
> > below.
> 
> Thanks!
> 
> > btw I've got many nit-comments for all patches, but will wait with
> > those until a v1 is posted.
> 
> Lets try and strike a balance please. :)

-ENOTFULLYUNDERSTOOD
I suppose we try to strike a balance in v1

> 
> > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: Danilo Krummrich <dakr@kernel.org>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Philipp Stanner <phasta@kernel.org>
> > > ---
> > >   drivers/gpu/drm/scheduler/sched_entity.c   | 39 +++++++++++++
> > >   drivers/gpu/drm/scheduler/sched_internal.h | 66 ++++++++++++++++++++++
> > >   drivers/gpu/drm/scheduler/sched_main.c     |  6 +-
> > >   include/drm/gpu_scheduler.h                | 12 ++++
> > >   4 files changed, 122 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> > > index 7a0a52ba87bf..04ce8b7d436b 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > @@ -32,6 +32,39 @@
> > >   
> > >   #include "gpu_scheduler_trace.h"
> > >   
> > > +
> > > +/**
> > > + * drm_sched_entity_stats_release - Entity stats kref release function
> > > + *
> > > + * @kref: Entity stats embedded kref pointer
> > > + */
> > > +void drm_sched_entity_stats_release(struct kref *kref)
> > > +{
> > > +	struct drm_sched_entity_stats *stats =
> > > +		container_of(kref, typeof(*stats), kref);
> > > +
> > > +	kfree(stats);
> > > +}
> > > +
> > > +/**
> > > + * drm_sched_entity_stats_alloc - Allocate a new struct drm_sched_entity_stats object
> > > + *
> > > + * Returns: Pointer to newly allocated struct drm_sched_entity_stats object.
> > > + */
> > > +static struct drm_sched_entity_stats *drm_sched_entity_stats_alloc(void)
> > > +{
> > > +	struct drm_sched_entity_stats *stats;
> > > +
> > > +	stats = kzalloc(sizeof(*stats), GFP_KERNEL);
> > > +	if (!stats)
> > > +		return NULL;
> > > +
> > > +	kref_init(&stats->kref);
> > > +	spin_lock_init(&stats->lock);
> > > +
> > > +	return stats;
> > > +}
> > > +
> > >   /**
> > >    * drm_sched_entity_init - Init a context entity used by scheduler when
> > >    * submit to HW ring.
> > > @@ -65,6 +98,11 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
> > >   		return -EINVAL;
> > >   
> > >   	memset(entity, 0, sizeof(struct drm_sched_entity));
> > > +
> > > +	entity->stats = drm_sched_entity_stats_alloc();
> > > +	if (!entity->stats)
> > > +		return -ENOMEM;
> > > +
> > >   	INIT_LIST_HEAD(&entity->list);
> > >   	entity->rq = NULL;
> > >   	entity->guilty = guilty;
> > > @@ -338,6 +376,7 @@ void drm_sched_entity_fini(struct drm_sched_entity *entity)
> > >   
> > >   	dma_fence_put(rcu_dereference_check(entity->last_scheduled, true));
> > >   	RCU_INIT_POINTER(entity->last_scheduled, NULL);
> > > +	drm_sched_entity_stats_put(entity->stats);
> > >   }
> > >   EXPORT_SYMBOL(drm_sched_entity_fini);
> > >   
> > > diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
> > > index 703ee48fbc58..27c8460a3601 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_internal.h
> > > +++ b/drivers/gpu/drm/scheduler/sched_internal.h
> > > @@ -3,6 +3,22 @@
> > >   #ifndef _DRM_GPU_SCHEDULER_INTERNAL_H_
> > >   #define _DRM_GPU_SCHEDULER_INTERNAL_H_
> > >   
> > > +#include <linux/ktime.h>
> > > +#include <linux/kref.h>
> > > +#include <linux/spinlock.h>
> > > +
> > > +/**
> > > + * struct drm_sched_entity_stats - execution stats for an entity.
> > > + *
> > > + * @kref: reference count for the object.
> > > + * @lock: lock guarding the @runtime updates.
> > > + * @runtime: time entity spent on the GPU.
> > > + */
> > > +struct drm_sched_entity_stats {
> > 
> > Cool docu.
> > 
> > Though, considering that awkward refcounting patterns are one of the
> > more relevant scheduler problems, I believe it makes sense to add to
> > the docu here a sentence or two about who refcounts this object for
> > what reason.
> 
> Good idea - it is often not completely easy to find the commit message 
> from git blame.
> 
> > 
> > > +	struct kref	kref;
> > > +	spinlock_t	lock;
> > > +	ktime_t		runtime;
> > > +};
> > >   
> > >   /* Used to choose between FIFO and RR job-scheduling */
> > >   extern int drm_sched_policy;
> > > @@ -93,4 +109,54 @@ drm_sched_entity_is_ready(struct drm_sched_entity *entity)
> > >   	return true;
> > >   }
> > >   
> > > +void drm_sched_entity_stats_release(struct kref *kref);
> > > +
> > > +/**
> > > + * drm_sched_entity_stats_get - Obtain a reference count on struct drm_sched_entity_stats object
> > > + *
> > > + * @stats: struct drm_sched_entity_stats pointer
> > > + *
> > > + * Returns: struct drm_sched_entity_stats pointer
> > > + */
> > > +static inline struct drm_sched_entity_stats *
> > > +drm_sched_entity_stats_get(struct drm_sched_entity_stats *stats)
> > > +{
> > > +	kref_get(&stats->kref);
> > > +
> > > +	return stats;
> > > +}
> > > +
> > > +/**
> > > + * drm_sched_entity_stats_put - Release a reference count on struct drm_sched_entity_stats object
> > > + *
> > > + * @stats: struct drm_sched_entity_stats pointer
> > > + */
> > > +static inline void
> > > +drm_sched_entity_stats_put(struct drm_sched_entity_stats *stats)
> > > +{
> > > +	kref_put(&stats->kref, drm_sched_entity_stats_release);
> > > +}
> > > +
> > > +/**
> > > + * drm_sched_entity_stats_job_add_gpu_time - Account job execution time to entity
> > > + *
> > > + * @job: Scheduler job to account.
> > > + *
> > > + * Accounts the execution time of @job to its respective entity stats object.
> > > + */
> > > +static inline void
> > > +drm_sched_entity_stats_job_add_gpu_time(struct drm_sched_job *job)
> > > +{
> > > +	struct drm_sched_entity_stats *stats = job->entity_stats;
> > > +	struct drm_sched_fence *s_fence = job->s_fence;
> > > +	ktime_t start, end;
> > > +
> > > +	start = dma_fence_timestamp(&s_fence->scheduled);
> > > +	end = dma_fence_timestamp(&s_fence->finished);
> > > +
> > > +	spin_lock(&stats->lock);
> > > +	stats->runtime = ktime_add(stats->runtime, ktime_sub(end, start));
> > > +	spin_unlock(&stats->lock);
> > > +}
> > > +
> > >   #endif
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > index 9411676e772a..a5d7706efbea 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -658,6 +658,7 @@ void drm_sched_job_arm(struct drm_sched_job *job)
> > >   
> > >   	job->sched = sched;
> > >   	job->s_priority = entity->priority;
> > > +	job->entity_stats = drm_sched_entity_stats_get(entity->stats);
> > >   
> > >   	drm_sched_fence_init(job->s_fence, job->entity);
> > >   }
> > > @@ -846,6 +847,7 @@ void drm_sched_job_cleanup(struct drm_sched_job *job)
> > >   		 * been called.
> > >   		 */
> > >   		dma_fence_put(&job->s_fence->finished);
> > > +		drm_sched_entity_stats_put(job->entity_stats);
> > >   	} else {
> > >   		/* The job was aborted before it has been committed to be run;
> > >   		 * notably, drm_sched_job_arm() has not been called.
> > > @@ -997,8 +999,10 @@ static void drm_sched_free_job_work(struct work_struct *w)
> > >   		container_of(w, struct drm_gpu_scheduler, work_free_job);
> > >   	struct drm_sched_job *job;
> > >   
> > > -	while ((job = drm_sched_get_finished_job(sched)))
> > > +	while ((job = drm_sched_get_finished_job(sched))) {
> > > +		drm_sched_entity_stats_job_add_gpu_time(job);
> > >   		sched->ops->free_job(job);
> > > +	}
> > 
> > The biggest question I'm wondering about is whether the reference
> > should be put in drm_sched_job_cleanup() or drm_sched_free_job_work().
> > The primer is semantically more correct, probably, but the latter is
> > always controlled by the scheduler.
> > 
> > Hypothetically a driver could just not call drm_sched_job_cleanup() in
> > its free_job() callback – but then other stuff would leak as well and
> > everything would be broken anyways.
> 
> I think drm_sched_entity_stats_put() from drm_sched_job_cleanup() makes 
> sense because latter is where other actions done by drm_sched_job_arm() 
> are cleaned up. And drm_sched_job_arm() is where I placed 
> drm_sched_entity_stats_get(). So that all looks nicely aligned to me.

ACK.

> 
> > I probably should change the docu for drm_sched_job_cleanup(), which
> > currently says drivers "should" call it in free_job(), instead of
> > "must".
> 
> Yeah I agree s/should/must/ is in order.

Will do.


P.

> 
> Regards,
> 
> Tvrtko
> 
> > >   
> > >   	drm_sched_run_job_queue(sched);
> > >   }
> > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > index 802a0060db55..f33c78473867 100644
> > > --- a/include/drm/gpu_scheduler.h
> > > +++ b/include/drm/gpu_scheduler.h
> > > @@ -71,6 +71,8 @@ enum drm_sched_priority {
> > >   	DRM_SCHED_PRIORITY_COUNT
> > >   };
> > >   
> > > +struct drm_sched_entity_stats;
> > > +
> > >   /**
> > >    * struct drm_sched_entity - A wrapper around a job queue (typically
> > >    * attached to the DRM file_priv).
> > > @@ -110,6 +112,11 @@ struct drm_sched_entity {
> > >   	 */
> > >   	struct drm_sched_rq		*rq;
> > >   
> > > +	/**
> > > +	 * @stats: Stats object reference held by the entity and jobs.
> > > +	 */
> > > +	struct drm_sched_entity_stats	*stats;
> > > +
> > >   	/**
> > >   	 * @sched_list:
> > >   	 *
> > > @@ -365,6 +372,11 @@ struct drm_sched_job {
> > >   	struct drm_sched_fence		*s_fence;
> > >   	struct drm_sched_entity         *entity;
> > >   
> > > +	/**
> > > +	 * @entity_stats: Stats object reference held by the job and entity.
> > > +	 */
> > > +	struct drm_sched_entity_stats	*entity_stats;
> > > +
> > >   	enum drm_sched_priority		s_priority;
> > >   	u32				credits;
> > >   	/** @last_dependency: tracks @dependencies as they signal */
> > 
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v8 11/12] drm/sched: Remove FIFO and RR and simplify to a single run queue
  2025-09-25 12:56     ` Tvrtko Ursulin
@ 2025-09-26 13:32       ` Philipp Stanner
  0 siblings, 0 replies; 35+ messages in thread
From: Philipp Stanner @ 2025-09-26 13:32 UTC (permalink / raw)
  To: Tvrtko Ursulin, phasta, dri-devel
  Cc: amd-gfx, kernel-dev, Christian König, Danilo Krummrich,
	Matthew Brost

On Thu, 2025-09-25 at 13:56 +0100, Tvrtko Ursulin wrote:
> 
> On 24/09/2025 12:50, Philipp Stanner wrote:
> > On Wed, 2025-09-03 at 11:18 +0100, Tvrtko Ursulin wrote:
> > > If the new fair policy is at least as good as FIFO and we can afford to
> > > remove round-robin, we can simplify the scheduler code by making the
> > > scheduler to run queue relationship always 1:1 and remove some code.
> > > 
> > > Also, now that the FIFO policy is gone the tree of entities is not a FIFO
> > > tree any more so rename it to just the tree.
> > 
> > I think that this patch should be the one that sets the default
> > scheduling policy to CFS.
> > 
> > Either that or a separate patch at an appropriate position before #11.
> 
> Yep, it will be separate.
> 
> > > 
> > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: Danilo Krummrich <dakr@kernel.org>
> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: Philipp Stanner <phasta@kernel.org>
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  23 ++-
> > >   drivers/gpu/drm/scheduler/sched_entity.c   |  29 +---
> > >   drivers/gpu/drm/scheduler/sched_internal.h |   9 +-
> > >   drivers/gpu/drm/scheduler/sched_main.c     | 163 ++++++---------------
> > >   drivers/gpu/drm/scheduler/sched_rq.c       |  60 ++------
> > >   include/drm/gpu_scheduler.h                |  30 +---
> > >   6 files changed, 78 insertions(+), 236 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > index d020a890a0ea..bc07fd57310c 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > @@ -434,25 +434,22 @@ drm_sched_entity_queue_pop(struct drm_sched_entity *entity)
> > >   
> > >   void amdgpu_job_stop_all_jobs_on_sched(struct drm_gpu_scheduler *sched)
> > >   {
> > > +	struct drm_sched_rq *rq = sched->rq;
> > > +	struct drm_sched_entity *s_entity;
> > >   	struct drm_sched_job *s_job;
> > > -	struct drm_sched_entity *s_entity = NULL;
> > > -	int i;
> > >   
> > >   	/* Signal all jobs not yet scheduled */
> > > -	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
> > > -		struct drm_sched_rq *rq = sched->sched_rq[i];
> > > -		spin_lock(&rq->lock);
> > > -		list_for_each_entry(s_entity, &rq->entities, list) {
> > > -			while ((s_job = drm_sched_entity_queue_pop(s_entity))) {
> > > -				struct drm_sched_fence *s_fence = s_job->s_fence;
> > > +	spin_lock(&rq->lock);
> > > +	list_for_each_entry(s_entity, &rq->entities, list) {
> > > +		while ((s_job = drm_sched_entity_queue_pop(s_entity))) {
> > > +			struct drm_sched_fence *s_fence = s_job->s_fence;
> > >   
> > > -				dma_fence_signal(&s_fence->scheduled);
> > > -				dma_fence_set_error(&s_fence->finished, -EHWPOISON);
> > > -				dma_fence_signal(&s_fence->finished);
> > > -			}
> > > +			dma_fence_signal(&s_fence->scheduled);
> > > +			dma_fence_set_error(&s_fence->finished, -EHWPOISON);
> > > +			dma_fence_signal(&s_fence->finished);
> > >   		}
> > > -		spin_unlock(&rq->lock);
> > >   	}
> > > +	spin_unlock(&rq->lock);
> > 
> > Yoah, well. Seems there is no easy fix to remove that from amdgpu,
> > since it's used on GPU recovery.
> > 
> > Fine from my side if it's mentioned and explained in the commit
> > message.
> 
> Yep. Note there is a "technical debt" TODO marker just above this 
> function already.
> 
> > >   
> > >   	/* Signal all jobs already scheduled to HW */
> > >   	list_for_each_entry(s_job, &sched->pending_list, list) {
> > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> > > index 58f51875547a..6dd30b85925b 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > @@ -108,8 +108,6 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
> > >   	entity->guilty = guilty;
> > >   	entity->num_sched_list = num_sched_list;
> > >   	entity->priority = priority;
> > > -	entity->rq_priority = drm_sched_policy == DRM_SCHED_POLICY_FAIR ?
> > > -			      DRM_SCHED_PRIORITY_KERNEL : priority;
> > >   	/*
> > >   	 * It's perfectly valid to initialize an entity without having a valid
> > >   	 * scheduler attached. It's just not valid to use the scheduler before it
> > > @@ -119,30 +117,14 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
> > >   	RCU_INIT_POINTER(entity->last_scheduled, NULL);
> > >   	RB_CLEAR_NODE(&entity->rb_tree_node);
> > >   
> > > -	if (num_sched_list && !sched_list[0]->sched_rq) {
> > > +	if (num_sched_list && !sched_list[0]->rq) {
> > >   		/* Since every entry covered by num_sched_list
> > >   		 * should be non-NULL and therefore we warn drivers
> > >   		 * not to do this and to fix their DRM calling order.
> > >   		 */
> > >   		pr_warn("%s: called with uninitialized scheduler\n", __func__);
> > >   	} else if (num_sched_list) {
> > > -		enum drm_sched_priority p = entity->priority;
> > > -
> > > -		/*
> > > -		 * The "priority" of an entity cannot exceed the number of
> > > -		 * run-queues of a scheduler. Protect against num_rqs being 0,
> > > -		 * by converting to signed. Choose the lowest priority
> > > -		 * available.
> > > -		 */
> > > -		if (p >= sched_list[0]->num_user_rqs) {
> > > -			dev_err(sched_list[0]->dev, "entity with out-of-bounds priority:%u num_user_rqs:%u\n",
> > > -				p, sched_list[0]->num_user_rqs);
> > > -			p = max_t(s32,
> > > -				 (s32)sched_list[0]->num_user_rqs - 1,
> > > -				 (s32)DRM_SCHED_PRIORITY_KERNEL);
> > > -			entity->priority = p;
> > > -		}
> > > -		entity->rq = sched_list[0]->sched_rq[entity->rq_priority];
> > > +		entity->rq = sched_list[0]->rq;
> > >   	}
> > >   
> > >   	init_completion(&entity->entity_idle);
> > > @@ -575,7 +557,7 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
> > >   
> > >   	spin_lock(&entity->lock);
> > >   	sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
> > > -	rq = sched ? sched->sched_rq[entity->rq_priority] : NULL;
> > > +	rq = sched ? sched->rq : NULL;
> > >   	if (rq != entity->rq) {
> > >   		drm_sched_rq_remove_entity(entity->rq, entity);
> > >   		entity->rq = rq;
> > > @@ -599,7 +581,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
> > >   {
> > >   	struct drm_sched_entity *entity = sched_job->entity;
> > >   	bool first;
> > > -	ktime_t submit_ts;
> > >   
> > >   	trace_drm_sched_job_queue(sched_job, entity);
> > >   
> > > @@ -616,16 +597,14 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
> > >   	/*
> > >   	 * After the sched_job is pushed into the entity queue, it may be
> > >   	 * completed and freed up at any time. We can no longer access it.
> > > -	 * Make sure to set the submit_ts first, to avoid a race.
> > >   	 */
> > > -	sched_job->submit_ts = submit_ts = ktime_get();
> > 
> > Can submit_ts be removed completely?
> 
> It is. Or you mean is it safe? If there are not more users why not?

Ah right, I overlooked that. Very well.

> 
> > 
> > >   	first = spsc_queue_push(&entity->job_queue, &sched_job->queue_node);
> > >   
> > >   	/* first job wakes up scheduler */
> > >   	if (first) {
> > >   		struct drm_gpu_scheduler *sched;
> > >   
> > > -		sched = drm_sched_rq_add_entity(entity, submit_ts);
> > > +		sched = drm_sched_rq_add_entity(entity);
> > >   		if (sched)
> > >   			drm_sched_wakeup(sched);
> > >   	}
> > > diff --git a/drivers/gpu/drm/scheduler/sched_internal.h b/drivers/gpu/drm/scheduler/sched_internal.h
> > > index 125aba70eda6..6e5ed721bb5b 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_internal.h
> > > +++ b/drivers/gpu/drm/scheduler/sched_internal.h
> > > @@ -22,13 +22,6 @@ struct drm_sched_entity_stats {
> > >   	u64		vruntime;
> > >   };
> > >   
> > > -/* Used to choose between FIFO and RR job-scheduling */
> > > -extern int drm_sched_policy;
> > > -
> > > -#define DRM_SCHED_POLICY_RR	0
> > > -#define DRM_SCHED_POLICY_FIFO	1
> > > -#define DRM_SCHED_POLICY_FAIR	2
> > > -
> > >   bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
> > >   			 struct drm_sched_entity *entity);
> > >   void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
> > > @@ -39,7 +32,7 @@ struct drm_sched_entity *
> > >   drm_sched_rq_select_entity(struct drm_gpu_scheduler *sched,
> > >   			   struct drm_sched_rq *rq);
> > >   struct drm_gpu_scheduler *
> > > -drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts);
> > > +drm_sched_rq_add_entity(struct drm_sched_entity *entity);
> > >   void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> > >   				struct drm_sched_entity *entity);
> > >   void drm_sched_rq_pop_entity(struct drm_sched_entity *entity);
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > index e7726095c19a..a9079bdb27d3 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -84,15 +84,6 @@
> > >   #define CREATE_TRACE_POINTS
> > >   #include "gpu_scheduler_trace.h"
> > >   
> > > -int drm_sched_policy = DRM_SCHED_POLICY_FAIR;
> > > -
> > > -/**
> > > - * DOC: sched_policy (int)
> > > - * Used to override default entities scheduling policy in a run queue.
> > > - */
> > > -MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO, " __stringify(DRM_SCHED_POLICY_FAIR) " = Fair (default).");
> > > -module_param_named(sched_policy, drm_sched_policy, int, 0444);
> > > -
> > 
> > Anyone know what happens if we remove that param and some user still
> > tries to set it?
> 
> Just an unknown parameter ignored message.
> 
> > >   static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
> > >   {
> > >   	u32 credits;
> > > @@ -876,34 +867,6 @@ void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
> > >   	drm_sched_run_job_queue(sched);
> > >   }
> > >   
> > > -/**
> > > - * drm_sched_select_entity - Select next entity to process
> > > - *
> > > - * @sched: scheduler instance
> > > - *
> > > - * Return an entity to process or NULL if none are found.
> > > - *
> > > - * Note, that we break out of the for-loop when "entity" is non-null, which can
> > > - * also be an error-pointer--this assures we don't process lower priority
> > > - * run-queues. See comments in the respectively called functions.
> > > - */
> > > -static struct drm_sched_entity *
> > > -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > > -{
> > > -	struct drm_sched_entity *entity = NULL;
> > > -	int i;
> > > -
> > > -	/* Start with the highest priority.
> > > -	 */
> > > -	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
> > > -		entity = drm_sched_rq_select_entity(sched, sched->sched_rq[i]);
> > > -		if (entity)
> > > -			break;
> > > -	}
> > > -
> > > -	return IS_ERR(entity) ? NULL : entity;
> > > -}
> > > -
> > >   /**
> > >    * drm_sched_get_finished_job - fetch the next finished job to be destroyed
> > >    *
> > > @@ -1023,8 +986,8 @@ static void drm_sched_run_job_work(struct work_struct *w)
> > >   	int r;
> > >   
> > >   	/* Find entity with a ready job */
> > > -	entity = drm_sched_select_entity(sched);
> > > -	if (!entity)
> > > +	entity = drm_sched_rq_select_entity(sched, sched->rq);
> > 
> > Isn't clear to me why entity-selection should be based on the rq as a
> > parameter, now that the scheduler only has one rq and, thus, only one
> > list of entities anyways.
> > 
> > Also, you pass both sched and sched->rq, which is redundant.
> 
> It is gone in the next patch. But it does look like I can move it to 
> this one just as well. Will do.
> 
> > I think it would be clearer to continue using the name
> > drm_sched_select_entity().
> 
> Can do. Just don't complain then what is drm_scehd_select_entity doing 
> in sched_rq.c. :D
> 
> Yeah I tried to sort and separate things around, and I think I improved 
> things, but things are still not perfect.

Mhm. Perfect is the enemy of "good". Maybe keeping the "rq" name is
even fine then? Choose what feels best to you.

> 
> > 
> > > +	if (IS_ERR_OR_NULL(entity))
> > >   		return;	/* No more work */
> > >   
> > >   	sched_job = drm_sched_entity_pop_job(entity);
> > > @@ -1095,8 +1058,6 @@ static struct workqueue_struct *drm_sched_alloc_wq(const char *name)
> > >    */
> > >   int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_args *args)
> > >   {
> > > -	int i;
> > > -
> > >   	sched->ops = args->ops;
> > >   	sched->credit_limit = args->credit_limit;
> > >   	sched->name = args->name;
> > > @@ -1106,13 +1067,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
> > >   	sched->score = args->score ? args->score : &sched->_score;
> > >   	sched->dev = args->dev;
> > >   
> > > -	if (args->num_rqs > DRM_SCHED_PRIORITY_COUNT) {
> > > -		/* This is a gross violation--tell drivers what the  problem is.
> > > -		 */
> > > -		dev_err(sched->dev, "%s: num_rqs cannot be greater than DRM_SCHED_PRIORITY_COUNT\n",
> > > -			__func__);
> > > -		return -EINVAL;
> > > -	} else if (sched->sched_rq) {
> > 
> > args->num_rqs is surplus now. What are we going to do with the drivers
> > that set it?
> > 
> > It should be removed from struct drm_sched_init_args and the using
> > drivers. That way, those drivers will also notice this big change
> > coming up.
> 
> Good point. I will add patches to the series to remove it from 
> individual drivers.

Very good, that will be a big improvement.

> 
> > 
> > > +	if (sched->rq) {
> > >   		/* Not an error, but warn anyway so drivers can
> > >   		 * fine-tune their DRM calling order, and return all
> > >   		 * is good.
> > > @@ -1132,21 +1087,11 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
> > >   		sched->own_submit_wq = true;
> > >   	}
> > >   
> > > -	sched->num_user_rqs = args->num_rqs;
> > > -	sched->num_rqs = drm_sched_policy != DRM_SCHED_POLICY_FAIR ?
> > > -			 args->num_rqs : 1;
> > > -	sched->sched_rq = kmalloc_array(sched->num_rqs,
> > > -					sizeof(*sched->sched_rq),
> > > -					GFP_KERNEL | __GFP_ZERO);
> > > -	if (!sched->sched_rq)
> > > +	sched->rq = kmalloc(sizeof(*sched->rq), GFP_KERNEL | __GFP_ZERO);
> > > +	if (!sched->rq)
> > >   		goto Out_check_own;
> > >   
> > > -	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
> > > -		sched->sched_rq[i] = kzalloc(sizeof(*sched->sched_rq[i]), GFP_KERNEL);
> > > -		if (!sched->sched_rq[i])
> > > -			goto Out_unroll;
> > > -		drm_sched_rq_init(sched, sched->sched_rq[i]);
> > > -	}
> > > +	drm_sched_rq_init(sched, sched->rq);
> > 
> > Same, redundant function parameter.
> 
> In fact perhaps this should be drm_sched_rq_init(rq, sched)? Because 
> that is what the functions really does.

Your point is reversing the parameters?

I'm not sure whether we can make the purpose any clearer.

But I do believe that we would break kernel conventions if we pass
something by reference and then pass a member of the referenced thing
by reference, too.

So drm_sched_rq_init(sched) would be canonical.

When the rq is embedded into sched, there's only one anyways. So it can
never be that you have a runqueue which doesn't belong to the passed
scheduler.

> 
> > 
> > >   
> > >   	init_waitqueue_head(&sched->job_scheduled);
> > >   	INIT_LIST_HEAD(&sched->pending_list);
> > > @@ -1161,12 +1106,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
> > >   
> > >   	sched->ready = true;
> > >   	return 0;
> > > -Out_unroll:
> > > -	for (--i ; i >= DRM_SCHED_PRIORITY_KERNEL; i--)
> > > -		kfree(sched->sched_rq[i]);
> > >   
> > > -	kfree(sched->sched_rq);
> > 
> > Almost thought we have a memory leak now. But seems fine since we can't
> > fail anymore after the kmalloc() above.
> > 
> > > -	sched->sched_rq = NULL;
> > >   Out_check_own:
> > >   	if (sched->own_submit_wq)
> > >   		destroy_workqueue(sched->submit_wq);
> > > @@ -1202,41 +1142,35 @@ static void drm_sched_cancel_remaining_jobs(struct drm_gpu_scheduler *sched)
> > >    */
> > >   void drm_sched_fini(struct drm_gpu_scheduler *sched)
> > >   {
> > > +
> > > +	struct drm_sched_rq *rq = sched->rq;
> > >   	struct drm_sched_entity *s_entity;
> > > -	int i;
> > >   
> > >   	drm_sched_wqueue_stop(sched);
> > >   
> > > -	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
> > > -		struct drm_sched_rq *rq = sched->sched_rq[i];
> > > -
> > > -		spin_lock(&rq->lock);
> > > -		list_for_each_entry(s_entity, &rq->entities, list)
> > > -			/*
> > > -			 * Prevents reinsertion and marks job_queue as idle,
> > > -			 * it will be removed from the rq in drm_sched_entity_fini()
> > > -			 * eventually
> > > -			 *
> > > -			 * FIXME:
> > > -			 * This lacks the proper spin_lock(&s_entity->lock) and
> > > -			 * is, therefore, a race condition. Most notably, it
> > > -			 * can race with drm_sched_entity_push_job(). The lock
> > > -			 * cannot be taken here, however, because this would
> > > -			 * lead to lock inversion -> deadlock.
> > > -			 *
> > > -			 * The best solution probably is to enforce the life
> > > -			 * time rule of all entities having to be torn down
> > > -			 * before their scheduler. Then, however, locking could
> > > -			 * be dropped alltogether from this function.
> > > -			 *
> > > -			 * For now, this remains a potential race in all
> > > -			 * drivers that keep entities alive for longer than
> > > -			 * the scheduler.
> > > -			 */
> > > -			s_entity->stopped = true;
> > > -		spin_unlock(&rq->lock);
> > > -		kfree(sched->sched_rq[i]);
> > > -	}
> > > +	spin_lock(&rq->lock);
> > > +	list_for_each_entry(s_entity, &rq->entities, list)
> > > +		/*
> > > +		 * Prevents re-insertion and marks job_queue as idle,
> > > +		 * it will be removed from the rq in drm_sched_entity_fini()
> > > +		 * eventually.
> > > +		 *
> > > +		 * FIXME:
> > > +		 * This lacks the proper spin_lock(&s_entity->lock) and is,
> > > +		 * therefore, a race condition. Most notably, it can race with
> > > +		 * drm_sched_entity_push_job(). The lock cannot be taken here,
> > > +		 * however, because this would lead to lock inversion.
> > > +		 *
> > > +		 * The best solution probably is to enforce the life time rule
> > > +		 * of all entities having to be torn down before their
> > > +		 * scheduler. Then locking could be dropped altogether from this
> > > +		 * function.
> > > +		 *
> > > +		 * For now, this remains a potential race in all drivers that
> > > +		 * keep entities alive for longer than the scheduler.
> > > +		 */
> > > +		s_entity->stopped = true;
> > > +	spin_unlock(&rq->lock);
> > >   
> > >   	/* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
> > >   	wake_up_all(&sched->job_scheduled);
> > > @@ -1251,8 +1185,8 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
> > >   	if (sched->own_submit_wq)
> > >   		destroy_workqueue(sched->submit_wq);
> > >   	sched->ready = false;
> > > -	kfree(sched->sched_rq);
> > > -	sched->sched_rq = NULL;
> > > +	kfree(sched->rq);
> > 
> > Renaming sched_rq to rq can be done, but it certainly increases the git
> > diff. Such general naming improvements make patches a bit more
> > difficult to review and I'd prefer to have those in separate patches.
> 
> I agree in principle just that in this case all the number of touched
> lines would not change much because the member is changed from an array 
> of run queues to a single run queue anyway. It could be this line is the 
> only one which would stay out of the diff.

It's not a deal breaker for me. But if you have wider renaming
ambitions, then preferably in distinct patches :]

> 
> > > +	sched->rq = NULL;
> > >   
> > >   	if (!list_empty(&sched->pending_list))
> > >   		dev_warn(sched->dev, "Tearing down scheduler while jobs are pending!\n");
> > > @@ -1270,35 +1204,28 @@ EXPORT_SYMBOL(drm_sched_fini);
> > >    */
> > >   void drm_sched_increase_karma(struct drm_sched_job *bad)
> > >   {
> > > -	int i;
> > > -	struct drm_sched_entity *tmp;
> > > -	struct drm_sched_entity *entity;
> > >   	struct drm_gpu_scheduler *sched = bad->sched;
> > > +	struct drm_sched_entity *entity, *tmp;
> > > +	struct drm_sched_rq *rq = sched->rq;
> > >   
> > >   	/* don't change @bad's karma if it's from KERNEL RQ,
> > >   	 * because sometimes GPU hang would cause kernel jobs (like VM updating jobs)
> > >   	 * corrupt but keep in mind that kernel jobs always considered good.
> > >   	 */
> > > -	if (bad->s_priority != DRM_SCHED_PRIORITY_KERNEL) {
> > > -		atomic_inc(&bad->karma);
> > > +	if (bad->s_priority == DRM_SCHED_PRIORITY_KERNEL)
> > > +		return;
> > >   
> > > -		for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
> > > -			struct drm_sched_rq *rq = sched->sched_rq[i];
> > > +	atomic_inc(&bad->karma);
> > >   
> > > -			spin_lock(&rq->lock);
> > > -			list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
> > > -				if (bad->s_fence->scheduled.context ==
> > > -				    entity->fence_context) {
> > > -					if (entity->guilty)
> > > -						atomic_set(entity->guilty, 1);
> > > -					break;
> > > -				}
> > > -			}
> > > -			spin_unlock(&rq->lock);
> > > -			if (&entity->list != &rq->entities)
> > > -				break;
> > > +	spin_lock(&rq->lock);
> > > +	list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
> > > +		if (bad->s_fence->scheduled.context == entity->fence_context) {
> > > +			if (entity->guilty)
> > > +				atomic_set(entity->guilty, 1);
> > > +			break;
> > >   		}
> > >   	}
> > > +	spin_unlock(&rq->lock);
> > >   }
> > >   EXPORT_SYMBOL(drm_sched_increase_karma);
> > >   
> > > diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/scheduler/sched_rq.c
> > > index c5be8a2c893d..a9bc105221bf 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_rq.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_rq.c
> > > @@ -34,7 +34,7 @@ static void drm_sched_rq_update_prio(struct drm_sched_rq *rq)
> > >   	rq->head_prio = prio;
> > >   }
> > >   
> > > -static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
> > > +static void drm_sched_rq_remove_tree_locked(struct drm_sched_entity *entity,
> > >   					    struct drm_sched_rq *rq)
> > >   {
> > >   	lockdep_assert_held(&entity->lock);
> > > @@ -47,7 +47,7 @@ static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
> > >   	}
> > >   }
> > >   
> > > -static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
> > > +static void drm_sched_rq_update_tree_locked(struct drm_sched_entity *entity,
> > >   					    struct drm_sched_rq *rq,
> > >   					    ktime_t ts)
> > >   {
> > > @@ -59,7 +59,7 @@ static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
> > >   	lockdep_assert_held(&entity->lock);
> > >   	lockdep_assert_held(&rq->lock);
> > >   
> > > -	drm_sched_rq_remove_fifo_locked(entity, rq);
> > > +	drm_sched_rq_remove_tree_locked(entity, rq);
> > >   
> > >   	entity->oldest_job_waiting = ts;
> > >   
> > > @@ -207,17 +207,17 @@ static ktime_t drm_sched_entity_get_job_ts(struct drm_sched_entity *entity)
> > >    * drm_sched_rq_add_entity - add an entity
> > >    *
> > >    * @entity: scheduler entity
> > > - * @ts: submission timestamp
> > >    *
> > >    * Adds a scheduler entity to the run queue.
> > >    *
> > >    * Returns a DRM scheduler pre-selected to handle this entity.
> > >    */
> > >   struct drm_gpu_scheduler *
> > > -drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
> > > +drm_sched_rq_add_entity(struct drm_sched_entity *entity)
> > >   {
> > >   	struct drm_gpu_scheduler *sched;
> > >   	struct drm_sched_rq *rq;
> > > +	ktime_t ts;
> > >   
> > >   	/* Add the entity to the run queue */
> > >   	spin_lock(&entity->lock);
> > > @@ -237,15 +237,9 @@ drm_sched_rq_add_entity(struct drm_sched_entity *entity, ktime_t ts)
> > >   		list_add_tail(&entity->list, &rq->entities);
> > >   	}
> > >   
> > > -	if (drm_sched_policy == DRM_SCHED_POLICY_FAIR) {
> > > -		ts = drm_sched_rq_get_min_vruntime(rq);
> > > -		ts = drm_sched_entity_restore_vruntime(entity, ts,
> > > -						       rq->head_prio);
> > > -	} else if (drm_sched_policy == DRM_SCHED_POLICY_RR) {
> > > -		ts = entity->rr_ts;
> > > -	}
> > > -
> > > -	drm_sched_rq_update_fifo_locked(entity, rq, ts);
> > > +	ts = drm_sched_rq_get_min_vruntime(rq);
> > > +	ts = drm_sched_entity_restore_vruntime(entity, ts, rq->head_prio);
> > > +	drm_sched_rq_update_tree_locked(entity, rq, ts);
> > >   
> > >   	spin_unlock(&rq->lock);
> > >   	spin_unlock(&entity->lock);
> > > @@ -274,26 +268,11 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> > >   	atomic_dec(rq->sched->score);
> > >   	list_del_init(&entity->list);
> > >   
> > > -	drm_sched_rq_remove_fifo_locked(entity, rq);
> > > +	drm_sched_rq_remove_tree_locked(entity, rq);
> > >   
> > >   	spin_unlock(&rq->lock);
> > >   }
> > >   
> > > -static ktime_t
> > > -drm_sched_rq_get_rr_ts(struct drm_sched_rq *rq, struct drm_sched_entity *entity)
> > > -{
> > > -	ktime_t ts;
> > > -
> > > -	lockdep_assert_held(&entity->lock);
> > > -	lockdep_assert_held(&rq->lock);
> > > -
> > > -	ts = ktime_add_ns(rq->rr_ts, 1);
> > > -	entity->rr_ts = ts;
> > > -	rq->rr_ts = ts;
> > > -
> > > -	return ts;
> > > -}
> > > -
> > >   /**
> > >    * drm_sched_rq_pop_entity - pops an entity
> > >    *
> > > @@ -317,23 +296,14 @@ void drm_sched_rq_pop_entity(struct drm_sched_entity *entity)
> > >   	if (next_job) {
> > >   		ktime_t ts;
> > >   
> > > -		if (drm_sched_policy == DRM_SCHED_POLICY_FAIR)
> > > -			ts = drm_sched_entity_get_job_ts(entity);
> > > -		else if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> > > -			ts = next_job->submit_ts;
> > > -		else
> > > -			ts = drm_sched_rq_get_rr_ts(rq, entity);
> > > -
> > > -		drm_sched_rq_update_fifo_locked(entity, rq, ts);
> > > +		ts = drm_sched_entity_get_job_ts(entity);
> > > +		drm_sched_rq_update_tree_locked(entity, rq, ts);
> > >   	} else {
> > > -		drm_sched_rq_remove_fifo_locked(entity, rq);
> > > +		ktime_t min_vruntime;
> > >   
> > > -		if (drm_sched_policy == DRM_SCHED_POLICY_FAIR) {
> > > -			ktime_t min_vruntime;
> > > -
> > > -			min_vruntime = drm_sched_rq_get_min_vruntime(rq);
> > > -			drm_sched_entity_save_vruntime(entity, min_vruntime);
> > > -		}
> > > +		drm_sched_rq_remove_tree_locked(entity, rq);
> > > +		min_vruntime = drm_sched_rq_get_min_vruntime(rq);
> > > +		drm_sched_entity_save_vruntime(entity, min_vruntime);
> > >   	}
> > >   	spin_unlock(&rq->lock);
> > >   	spin_unlock(&entity->lock);
> > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > index 327b75a052c7..c9b75a05d05c 100644
> > > --- a/include/drm/gpu_scheduler.h
> > > +++ b/include/drm/gpu_scheduler.h
> > > @@ -96,8 +96,7 @@ struct drm_sched_entity {
> > >   	 * @lock:
> > >   	 *
> > >   	 * Lock protecting the run-queue (@rq) to which this entity belongs,
> > > -	 * @priority, the list of schedulers (@sched_list, @num_sched_list) and
> > > -	 * the @rr_ts field.
> > > +	 * @priority and the list of schedulers (@sched_list, @num_sched_list).
> > >   	 */
> > >   	spinlock_t			lock;
> > >   
> > > @@ -150,18 +149,6 @@ struct drm_sched_entity {
> > >   	 */
> > >   	enum drm_sched_priority         priority;
> > >   
> > > -	/**
> > > -	 * @rq_priority: Run-queue priority
> > > -	 */
> > > -	enum drm_sched_priority         rq_priority;
> > > -
> > > -	/**
> > > -	 * @rr_ts:
> > > -	 *
> > > -	 * Fake timestamp of the last popped job from the entity.
> > > -	 */
> > > -	ktime_t				rr_ts;
> > > -
> > >   	/**
> > >   	 * @job_queue: the list of jobs of this entity.
> > >   	 */
> > > @@ -259,8 +246,7 @@ struct drm_sched_entity {
> > >    * struct drm_sched_rq - queue of entities to be scheduled.
> > >    *
> > >    * @sched: the scheduler to which this rq belongs to.
> > > - * @lock: protects @entities, @rb_tree_root, @rr_ts and @head_prio.
> > > - * @rr_ts: monotonically incrementing fake timestamp for RR mode
> > > + * @lock: protects @entities, @rb_tree_root and @head_prio.
> > >    * @entities: list of the entities to be scheduled.
> > >    * @rb_tree_root: root of time based priority queue of entities for FIFO scheduling
> > >    * @head_prio: priority of the top tree element
> > > @@ -274,7 +260,6 @@ struct drm_sched_rq {
> > >   
> > >   	spinlock_t			lock;
> > >   	/* Following members are protected by the @lock: */
> > > -	ktime_t				rr_ts;
> > >   	struct list_head		entities;
> > >   	struct rb_root_cached		rb_tree_root;
> > >   	enum drm_sched_priority		head_prio;
> > > @@ -360,13 +345,6 @@ struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f);
> > >    * to schedule the job.
> > >    */
> > >   struct drm_sched_job {
> > > -	/**
> > > -	 * @submit_ts:
> > > -	 *
> > > -	 * When the job was pushed into the entity queue.
> > > -	 */
> > > -	ktime_t                         submit_ts;
> > > -
> > >   	/**
> > >   	 * @sched:
> > >   	 *
> > > @@ -603,9 +581,7 @@ struct drm_gpu_scheduler {
> > >   	atomic_t			credit_count;
> > >   	long				timeout;
> > >   	const char			*name;
> > > -	u32                             num_rqs;
> > > -	u32                             num_user_rqs;
> > > -	struct drm_sched_rq             **sched_rq;
> > > +	struct drm_sched_rq             *rq;
> > 
> > Docu must be updated
> 
> You insisted the series must be RFC and I tend to not polish RFC's to
> the max. ;p

Ah well, we're doing a Schrödinger's review: this series is both RFC
and v1 simultaneously :D

But srsly, I think why I mentioned the docu so frequently is mostly
because more docu helps me understand the RFC better since I didn't
write the code.

In the line above it's indeed trivial, of course. Don't bother yourself
too much, at least not before we talk about CFS with the others next
week


P.

> 
> Will fix.
> 
> Regards,
> 
> Tvrtko
> 
> 
> > >   	wait_queue_head_t		job_scheduled;
> > >   	atomic64_t			job_id_count;
> > >   	struct workqueue_struct		*submit_wq;
> > 
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-09-27  8:13 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-03 10:18 [RFC v8 00/12] Fair DRM scheduler Tvrtko Ursulin
2025-09-03 10:18 ` [RFC v8 01/12] drm/sched: Add some scheduling quality unit tests Tvrtko Ursulin
2025-09-23 13:01   ` Philipp Stanner
2025-09-25 11:41     ` Tvrtko Ursulin
2025-09-03 10:18 ` [RFC v8 02/12] drm/sched: Add some more " Tvrtko Ursulin
2025-09-03 10:18 ` [RFC v8 03/12] drm/sched: Implement RR via FIFO Tvrtko Ursulin
2025-09-03 10:18 ` [RFC v8 04/12] drm/sched: Consolidate entity run queue management Tvrtko Ursulin
2025-09-11 14:20   ` Philipp Stanner
2025-09-11 14:55     ` Tvrtko Ursulin
2025-09-16  7:41       ` Philipp Stanner
2025-09-16  9:04         ` Tvrtko Ursulin
2025-09-03 10:18 ` [RFC v8 05/12] drm/sched: Move run queue related code into a separate file Tvrtko Ursulin
2025-09-24  7:54   ` Philipp Stanner
2025-09-03 10:18 ` [RFC v8 06/12] drm/sched: Free all finished jobs at once Tvrtko Ursulin
2025-09-24  8:07   ` Philipp Stanner
2025-09-03 10:18 ` [RFC v8 07/12] drm/sched: Account entity GPU time Tvrtko Ursulin
2025-09-24  9:11   ` Philipp Stanner
2025-09-25 11:52     ` Tvrtko Ursulin
2025-09-26 12:30       ` Philipp Stanner
2025-09-03 10:18 ` [RFC v8 08/12] drm/sched: Remove idle entity from tree Tvrtko Ursulin
2025-09-11 14:32   ` Philipp Stanner
2025-09-11 15:06     ` Tvrtko Ursulin
2025-09-24  9:15       ` Philipp Stanner
2025-09-25 12:23         ` Tvrtko Ursulin
2025-09-03 10:18 ` [RFC v8 09/12] drm/sched: Add fair scheduling policy Tvrtko Ursulin
2025-09-24  9:38   ` Philipp Stanner
2025-09-25 12:34     ` Tvrtko Ursulin
2025-09-03 10:18 ` [RFC v8 10/12] drm/sched: Break submission patterns with some randomness Tvrtko Ursulin
2025-09-03 10:18 ` [RFC v8 11/12] drm/sched: Remove FIFO and RR and simplify to a single run queue Tvrtko Ursulin
2025-09-24 11:50   ` Philipp Stanner
2025-09-25 12:56     ` Tvrtko Ursulin
2025-09-26 13:32       ` Philipp Stanner
2025-09-03 10:18 ` [RFC v8 12/12] drm/sched: Embed run queue singleton into the scheduler Tvrtko Ursulin
2025-09-24 12:01   ` Philipp Stanner
2025-09-25 13:02     ` Tvrtko Ursulin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox