From: "Christian König" <christian.koenig@amd.com>
To: Matthew Brost <matthew.brost@intel.com>
Cc: ltuikov89@gmail.com, dri-devel@lists.freedesktop.org,
Thorsten Leemhuis <regressions@leemhuis.info>,
Mario Limonciello <mario.limonciello@amd.com>,
daniel@ffwll.ch, Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>,
airlied@gmail.com, intel-xe@lists.freedesktop.org,
Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker
Date: Fri, 26 Jan 2024 11:32:57 +0100 [thread overview]
Message-ID: <0bef4c76-924f-442f-af9c-d701e640db41@amd.com> (raw)
In-Reply-To: <ZbKaqdu5Y/WNwWVX@DUT025-TGLU.fm.intel.com>
Am 25.01.24 um 18:30 schrieb Matthew Brost:
> On Thu, Jan 25, 2024 at 04:12:58PM +0100, Christian König wrote:
>>
>> Am 24.01.24 um 22:08 schrieb Matthew Brost:
>>> All entities must be drained in the DRM scheduler run job worker to
>>> avoid the following case. An entity found that is ready, no job found
>>> ready on entity, and run job worker goes idle with other entities + jobs
>>> ready. Draining all ready entities (i.e. loop over all ready entities)
>>> in the run job worker ensures all job that are ready will be scheduled.
>> That doesn't make sense. drm_sched_select_entity() only returns entities
>> which are "ready", e.g. have a job to run.
>>
> That is what I thought too, hence my original design but it is not
> exactly true. Let me explain.
>
> drm_sched_select_entity() returns an entity with a non-empty spsc queue
> (job in queue) and no *current* waiting dependecies [1]. Dependecies for
> an entity can be added when drm_sched_entity_pop_job() is called [2][3]
> returning a NULL job. Thus we can get into a scenario where 2 entities
> A and B both have jobs and no current dependecies. A's job is waiting
> B's job, entity A gets selected first, a dependecy gets installed in
> drm_sched_entity_pop_job(), run work goes idle, and now we deadlock.
And here is the real problem. run work doesn't goes idle in that moment.
drm_sched_run_job_work() should restarts itself until there is either no
more space in the ring buffer or it can't find a ready entity any more.
At least that was the original design when that was all still driven by
a kthread.
It can perfectly be that we messed this up when switching from kthread
to a work item.
Regards,
Christian.
>
> The proper solution is to loop over all ready entities until one with a
> job is found via drm_sched_entity_pop_job() and then requeue the run
> job worker. Or loop over all entities until drm_sched_select_entity()
> returns NULL and then let the run job worker go idle. This is what the
> old threaded design did too [4]. Hope this clears everything up.
>
> Matt
>
> [1] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L144
> [2] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L464
> [3] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L397
> [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_main.c#L1011
>
>> If that's not the case any more then you have broken something else.
>>
>> Regards,
>> Christian.
>>
>>> Cc: Thorsten Leemhuis <regressions@leemhuis.info>
>>> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
>>> Closes: https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuOWtoeeE+q26zE+Q@mail.gmail.com/
>>> Reported-and-tested-by: Mario Limonciello <mario.limonciello@amd.com>
>>> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3124
>>> Link: https://lore.kernel.org/all/20240123021155.2775-1-mario.limonciello@amd.com/
>>> Reported-by: Vlastimil Babka <vbabka@suse.cz>
>>> Closes: https://lore.kernel.org/dri-devel/05ddb2da-b182-4791-8ef7-82179fd159a8@amd.com/T/#m0c31d4d1b9ae9995bb880974c4f1dbaddc33a48a
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>> drivers/gpu/drm/scheduler/sched_main.c | 15 +++++++--------
>>> 1 file changed, 7 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 550492a7a031..85f082396d42 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1178,21 +1178,20 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>> struct drm_sched_entity *entity;
>>> struct dma_fence *fence;
>>> struct drm_sched_fence *s_fence;
>>> - struct drm_sched_job *sched_job;
>>> + struct drm_sched_job *sched_job = NULL;
>>> int r;
>>> if (READ_ONCE(sched->pause_submit))
>>> return;
>>> - entity = drm_sched_select_entity(sched);
>>> + /* Find entity with a ready job */
>>> + while (!sched_job && (entity = drm_sched_select_entity(sched))) {
>>> + sched_job = drm_sched_entity_pop_job(entity);
>>> + if (!sched_job)
>>> + complete_all(&entity->entity_idle);
>>> + }
>>> if (!entity)
>>> - return;
>>> -
>>> - sched_job = drm_sched_entity_pop_job(entity);
>>> - if (!sched_job) {
>>> - complete_all(&entity->entity_idle);
>>> return; /* No more work */
>>> - }
>>> s_fence = sched_job->s_fence;
next prev parent reply other threads:[~2024-01-26 10:33 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-24 21:08 [PATCH] drm/sched: Drain all entities in DRM sched run job worker Matthew Brost
2024-01-24 21:10 ` ✓ CI.Patch_applied: success for " Patchwork
2024-01-24 21:10 ` ✓ CI.checkpatch: " Patchwork
2024-01-24 21:11 ` ✓ CI.KUnit: " Patchwork
2024-01-24 21:18 ` ✓ CI.Build: " Patchwork
2024-01-24 21:19 ` ✓ CI.Hooks: " Patchwork
2024-01-24 21:20 ` ✓ CI.checksparse: " Patchwork
2024-01-24 21:43 ` ✓ CI.BAT: " Patchwork
2024-01-25 9:24 ` [PATCH] " Vlastimil Babka
2024-01-25 17:30 ` Matthew Brost
2024-01-26 2:45 ` Dave Airlie
2024-01-25 15:12 ` Christian König
2024-01-25 17:30 ` Matthew Brost
2024-01-26 10:32 ` Christian König [this message]
2024-01-26 16:29 ` Matthew Brost
2024-01-29 5:29 ` Luben Tuikov
2024-01-29 7:44 ` Christian König
2024-01-29 7:49 ` Vlastimil Babka
2024-01-29 17:10 ` Luben Tuikov
2024-01-29 18:31 ` Matthew Brost
2024-01-29 4:08 ` Luben Tuikov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0bef4c76-924f-442f-af9c-d701e640db41@amd.com \
--to=christian.koenig@amd.com \
--cc=airlied@gmail.com \
--cc=daniel@ffwll.ch \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-xe@lists.freedesktop.org \
--cc=ltuikov89@gmail.com \
--cc=mario.limonciello@amd.com \
--cc=matthew.brost@intel.com \
--cc=mikhail.v.gavrilov@gmail.com \
--cc=regressions@leemhuis.info \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox