Re: [Regression] drm/scheduler: track GPU active time per entity

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Danilo Krummrich <dakr@redhat.com>
To: "Luben Tuikov" <luben.tuikov@amd.com>,
	"Lucas Stach" <l.stach@pengutronix.de>,
	daniel@ffwll.ch, "Dave Airlie" <airlied@gmail.com>,
	"Bagas Sanjaya" <bagasdotme@gmail.com>,
	andrey.grodzovsky@amd.com,
	"Christian König" <christian.koenig@amd.com>,
	tvrtko.ursulin@linux.intel.com,
	"Matthew Brost" <matthew.brost@intel.com>,
	yuq825@gmail.com,
	"Boris Brezillon" <boris.brezillon@collabora.com>,
	lina@asahilina.net
Cc: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org
Subject: Re: [Regression] drm/scheduler: track GPU active time per entity
Date: Wed, 5 Apr 2023 16:05:53 +0200	[thread overview]
Message-ID: <3004a2bf-e725-643e-82af-8a217784e796@redhat.com> (raw)
In-Reply-To: <8b28151c-f2db-af3f-8dff-87dd5d57417b@amd.com>

On 4/4/23 06:31, Luben Tuikov wrote:
> On 2023-03-28 04:54, Lucas Stach wrote:
>> Hi Danilo,
>>
>> Am Dienstag, dem 28.03.2023 um 02:57 +0200 schrieb Danilo Krummrich:
>>> Hi all,
>>>
>>> Commit df622729ddbf ("drm/scheduler: track GPU active time per entity")
>>> tries to track the accumulated time that a job was active on the GPU
>>> writing it to the entity through which the job was deployed to the
>>> scheduler originally. This is done within drm_sched_get_cleanup_job()
>>> which fetches a job from the schedulers pending_list.
>>>
>>> Doing this can result in a race condition where the entity is already
>>> freed, but the entity's newly added elapsed_ns field is still accessed
>>> once the job is fetched from the pending_list.
>>>
>>> After drm_sched_entity_destroy() being called it should be safe to free
>>> the structure that embeds the entity. However, a job originally handed
>>> over to the scheduler by this entity might still reside in the
>>> schedulers pending_list for cleanup after drm_sched_entity_destroy()
>>> already being called and the entity being freed. Hence, we can run into
>>> a UAF.
>>>
>> Sorry about that, I clearly didn't properly consider this case.
>>
>>> In my case it happened that a job, as explained above, was just picked
>>> from the schedulers pending_list after the entity was freed due to the
>>> client application exiting. Meanwhile this freed up memory was already
>>> allocated for a subsequent client applications job structure again.
>>> Hence, the new jobs memory got corrupted. Luckily, I was able to
>>> reproduce the same corruption over and over again by just using
>>> deqp-runner to run a specific set of VK test cases in parallel.
>>>
>>> Fixing this issue doesn't seem to be very straightforward though (unless
>>> I miss something), which is why I'm writing this mail instead of sending
>>> a fix directly.
>>>
>>> Spontaneously, I see three options to fix it:
>>>
>>> 1. Rather than embedding the entity into driver specific structures
>>> (e.g. tied to file_priv) we could allocate the entity separately and
>>> reference count it, such that it's only freed up once all jobs that were
>>> deployed through this entity are fetched from the schedulers pending list.
>>>
>> My vote is on this or something in similar vain for the long term. I
>> have some hope to be able to add a GPU scheduling algorithm with a bit
>> more fairness than the current one sometime in the future, which
>> requires execution time tracking on the entities.
> 
> Danilo,
> 
> Using kref is preferable, i.e. option 1 above.

I think the only real motivation for doing that would be for generically 
tracking job statistics within the entity a job was deployed through. If 
we all agree on tracking job statistics this way I am happy to prepare a 
patch for this option and drop this one: 
https://lore.kernel.org/all/20230331000622.4156-1-dakr@redhat.com/T/#u

Christian mentioned amdgpu tried something similar to what Lucas tried 
running into similar trouble, backed off and implemented it in another 
way - a driver specific way I guess?

> 
> Lucas, can you shed some light on,
> 
> 1. In what way the current FIFO scheduling is unfair, and
> 2. shed some details on this "scheduling algorithm with a bit
> more fairness than the current one"?
> 
> Regards,
> Luben
> 
>>
>>> 2. Somehow make sure drm_sched_entity_destroy() does block until all
>>> jobs deployed through this entity were fetched from the schedulers
>>> pending list. Though, I'm pretty sure that this is not really desirable.
>>>
>>> 3. Just revert the change and let drivers implement tracking of GPU
>>> active times themselves.
>>>
>> Given that we are already pretty late in the release cycle and etnaviv
>> being the only driver so far making use of the scheduler elapsed time
>> tracking I think the right short term solution is to either move the
>> tracking into etnaviv or just revert the change for now. I'll have a
>> look at this.
>>
>> Regards,
>> Lucas
>>
>>> In the case of just reverting the change I'd propose to also set a jobs
>>> entity pointer to NULL  once the job was taken from the entity, such
>>> that in case of a future issue we fail where the actual issue resides
>>> and to make it more obvious that the field shouldn't be used anymore
>>> after the job was taken from the entity.
>>>
>>> I'm happy to implement the solution we agree on. However, it might also
>>> make sense to revert the change until we have a solution in place. I'm
>>> also happy to send a revert with a proper description of the problem.
>>> Please let me know what you think.
>>>
>>> - Danilo
>>>
>>
>

next prev parent reply	other threads:[~2023-04-05 15:38 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-28  0:57 [Regression] drm/scheduler: track GPU active time per entity Danilo Krummrich
2023-03-28  8:54 ` Lucas Stach
2023-04-04  4:31   ` Luben Tuikov
2023-04-05 14:05     ` Danilo Krummrich [this message]
2023-04-05 16:09       ` Luben Tuikov
2023-04-05 16:23         ` Danilo Krummrich
2023-04-06  8:22         ` Christian König
2023-04-06  8:27           ` Daniel Vetter
2023-04-06  9:05             ` Asahi Lina
2023-04-06 10:09               ` Daniel Vetter
2023-04-06 12:21                 ` Asahi Lina
2023-04-06 13:37                   ` Daniel Vetter
2023-04-06 14:43                     ` Asahi Lina
2023-04-06 15:30                       ` Daniel Vetter
2023-04-06 17:11                         ` Asahi Lina
2023-04-06 17:42                           ` Daniel Vetter
2023-04-06 17:46                             ` Daniel Vetter
2023-04-06 10:45             ` Lucas Stach
2023-04-06 12:09               ` Daniel Vetter
2023-04-06 12:19                 ` Lucas Stach
2023-04-06 13:56                   ` Daniel Vetter
2023-04-06 14:21               ` Christian König
2023-04-06 15:24                 ` Lucas Stach
2023-04-06 15:33                   ` Christian König
2023-04-06 15:59                     ` Daniel Vetter
2023-04-06 16:23                       ` Daniel Vetter
2023-04-06 22:36               ` Luben Tuikov
2023-04-06 22:30           ` Luben Tuikov
2023-04-05 17:44     ` Lucas Stach
2023-04-05 20:44       ` Luben Tuikov
2023-04-06 15:58         ` Lucas Stach
2023-04-06 22:06           ` Luben Tuikov
     [not found] <e3e07282-136f-5239-96ae-4aeb2d3c95d7.ref@yahoo.de>
     [not found] ` <e3e07282-136f-5239-96ae-4aeb2d3c95d7@yahoo.de>
2023-03-28 10:29   ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3004a2bf-e725-643e-82af-8a217784e796@redhat.com \
    --to=dakr@redhat.com \
    --cc=airlied@gmail.com \
    --cc=andrey.grodzovsky@amd.com \
    --cc=bagasdotme@gmail.com \
    --cc=boris.brezillon@collabora.com \
    --cc=christian.koenig@amd.com \
    --cc=daniel@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=l.stach@pengutronix.de \
    --cc=lina@asahilina.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luben.tuikov@amd.com \
    --cc=matthew.brost@intel.com \
    --cc=tvrtko.ursulin@linux.intel.com \
    --cc=yuq825@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox