From: Danilo Krummrich <dakr@kernel.org>
To: "Maíra Canal" <mcanal@igalia.com>,
"Christian König" <ckoenig.leichtzumerken@gmail.com>
Cc: phasta@kernel.org, Matthew Brost <matthew.brost@intel.com>,
Maarten Lankhorst <maarten.lankhorst@linux.intel.com>,
Maxime Ripard <mripard@kernel.org>,
Thomas Zimmermann <tzimmermann@suse.de>,
David Airlie <airlied@gmail.com>, Simona Vetter <simona@ffwll.ch>,
Tvrtko Ursulin <tvrtko.ursulin@igalia.com>,
dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v5 2/3] drm/sched: Adjust outdated docu for run_job()
Date: Mon, 24 Feb 2025 15:43:49 +0100 [thread overview]
Message-ID: <Z7yFpZMCFINhEht7@cassiopeiae> (raw)
In-Reply-To: <cfef8bd7-f335-4796-9d4f-93197bb3fc2d@igalia.com>
On Mon, Feb 24, 2025 at 10:29:26AM -0300, Maíra Canal wrote:
> On 20/02/25 12:28, Philipp Stanner wrote:
> > On Thu, 2025-02-20 at 10:28 -0300, Maíra Canal wrote:
> > > Would it be possible to add a comment that `run_job()` must check if
> > > `s_fence->finished.error` is different than 0? If you increase the
> > > karma
> > > of a job and don't check for `s_fence->finished.error`, you might run
> > > a
> > > cancelled job.
> >
> > s_fence->finished is only signaled and its error set once the hardware
> > fence got signaled; or when the entity is killed.
>
> If you have a timeout, increase the karma of that job with
> `drm_sched_increase_karma()` and call `drm_sched_resubmit_jobs()`, the
> latter will flag an error in the dma fence. If you don't check for it in
> `run_job()`, you will run the guilty job again.
Considering that drm_sched_resubmit_jobs() is deprecated I don't think we need
to add this hint to the documentation; the drivers that are still using the API
hopefully got it right.
> I'm still talking about `drm_sched_resubmit_jobs()`, because I'm
> currently fixing an issue in V3D with the GPU reset and we still use
> `drm_sched_resubmit_jobs()`. I read the documentation of `run_job()` and
> `timeout_job()` and the information I commented here (which was crucial
> to fix the bug) wasn't available there.
Well, hopefully... :-)
>
> `drm_sched_resubmit_jobs()` was deprecated in 2022, but Xe introduced a
> new use in 2023
Yeah, that's a bit odd, since Xe relies on a firmware scheduler and uses a 1:1
scheduler - entity setup. I'm a bit surprised Xe does use this function.
> for example. The commit that deprecated it just
> mentions AMD's case, but do we know if the function works as expected
> for the other users?
I read the comment [1] you're referring to differently. It says that
"Re-submitting jobs was a concept AMD came up as cheap way to implement recovery
after a job timeout".
It further explains that "there are many problem with the dma_fence
implementation and requirements. Either the implementation is risking deadlocks
with core memory management or violating documented implementation details of
the dma_fence object", which doesn't give any hint to me that the conceptual
issues are limited to amdgpu.
> For V3D, it does. Also, we need to make it clear which
> are the dma fence requirements that the functions violates.
This I fully agree with, unfortunately the comment does not explain what's the
issue at all.
While I do think I have a vague idea of what's the potential issue with this
approach, I think it would be way better to get Christian, as the expert for DMA
fence rules to comment on this.
@Christian: Can you please shed some light on this?
>
> If we shouldn't use `drm_sched_resubmit_jobs()`, would it be possible to
> provide a common interface for job resubmission?
I wonder why this question did not come up when drm_sched_resubmit_jobs() was
deprecated two years ago, did it?
Anyway, let's shed some light on the difficulties with drm_sched_resubmit_jobs()
and then we can figure out how we can do better.
I think it would also be interesting to know how amdgpu handles job from
unrelated entities being discarded by not re-submitting them when a job from
another entitiy hangs the HW ring.
[1] https://patchwork.freedesktop.org/patch/msgid/20221109095010.141189-5-christian.koenig@amd.com
next prev parent reply other threads:[~2025-02-24 14:43 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-20 11:28 [PATCH v5 0/3] drm/sched: Documentation and refcount improvements Philipp Stanner
2025-02-20 11:28 ` [PATCH v5 1/3] drm/sched: Document run_job() refcount hazard Philipp Stanner
2025-02-20 11:28 ` [PATCH v5 2/3] drm/sched: Adjust outdated docu for run_job() Philipp Stanner
2025-02-20 13:28 ` Maíra Canal
2025-02-20 15:28 ` Philipp Stanner
2025-02-24 13:29 ` Maíra Canal
2025-02-24 14:43 ` Danilo Krummrich [this message]
2025-02-24 16:25 ` Matthew Brost
2025-03-04 9:05 ` Christian König
2025-03-04 9:52 ` Philipp Stanner
2025-02-20 11:28 ` [PATCH v5 3/3] drm/sched: Update timedout_job()'s documentation Philipp Stanner
2025-02-20 13:42 ` Maíra Canal
2025-02-20 15:18 ` Philipp Stanner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z7yFpZMCFINhEht7@cassiopeiae \
--to=dakr@kernel.org \
--cc=airlied@gmail.com \
--cc=ckoenig.leichtzumerken@gmail.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maarten.lankhorst@linux.intel.com \
--cc=matthew.brost@intel.com \
--cc=mcanal@igalia.com \
--cc=mripard@kernel.org \
--cc=phasta@kernel.org \
--cc=simona@ffwll.ch \
--cc=tvrtko.ursulin@igalia.com \
--cc=tzimmermann@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox