dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] drm/scheduler: Remove entity->rq NULL check
@ 2018-08-03 11:08 Christian König
  2018-08-06 14:13 ` Christian König
  0 siblings, 1 reply; 11+ messages in thread
From: Christian König @ 2018-08-03 11:08 UTC (permalink / raw)
  To: dri-devel

That is superflous now.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
index 85908c7f913e..65078dd3c82c 100644
--- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
@@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job,
 	if (first) {
 		/* Add the entity to the run queue */
 		spin_lock(&entity->rq_lock);
-		if (!entity->rq) {
-			DRM_ERROR("Trying to push to a killed entity\n");
-			spin_unlock(&entity->rq_lock);
-			return;
-		}
 		drm_sched_rq_add_entity(entity->rq, entity);
 		spin_unlock(&entity->rq_lock);
 		drm_sched_wakeup(entity->rq->sched);
-- 
2.14.1

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
  2018-08-03 11:08 [PATCH] drm/scheduler: Remove entity->rq NULL check Christian König
@ 2018-08-06 14:13 ` Christian König
  2018-08-06 14:18   ` Nayan Deshmukh
  0 siblings, 1 reply; 11+ messages in thread
From: Christian König @ 2018-08-06 14:13 UTC (permalink / raw)
  To: dri-devel, Nayan Deshmukh, Grodzovsky, Andrey

Ping. Any objections to that?

Christian.

Am 03.08.2018 um 13:08 schrieb Christian König:
> That is superflous now.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>   1 file changed, 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> index 85908c7f913e..65078dd3c82c 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job,
>   	if (first) {
>   		/* Add the entity to the run queue */
>   		spin_lock(&entity->rq_lock);
> -		if (!entity->rq) {
> -			DRM_ERROR("Trying to push to a killed entity\n");
> -			spin_unlock(&entity->rq_lock);
> -			return;
> -		}
>   		drm_sched_rq_add_entity(entity->rq, entity);
>   		spin_unlock(&entity->rq_lock);
>   		drm_sched_wakeup(entity->rq->sched);

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
  2018-08-06 14:13 ` Christian König
@ 2018-08-06 14:18   ` Nayan Deshmukh
  2018-08-09 19:49     ` Andrey Grodzovsky
  0 siblings, 1 reply; 11+ messages in thread
From: Nayan Deshmukh @ 2018-08-06 14:18 UTC (permalink / raw)
  To: ckoenig.leichtzumerken; +Cc: Maling list - DRI developers


[-- Attachment #1.1: Type: text/plain, Size: 1496 bytes --]

I forgot about this since we started discussing possible scenarios of
processes and threads.

In any case, this check is redundant. Acked-by: Nayan Deshmukh <
nayan26deshmukh@gmail.com>

Nayan

On Mon, Aug 6, 2018 at 7:43 PM Christian König <
ckoenig.leichtzumerken@gmail.com> wrote:

> Ping. Any objections to that?
>
> Christian.
>
> Am 03.08.2018 um 13:08 schrieb Christian König:
> > That is superflous now.
> >
> > Signed-off-by: Christian König <christian.koenig@amd.com>
> > ---
> >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
> >   1 file changed, 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> > index 85908c7f913e..65078dd3c82c 100644
> > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct drm_sched_job
> *sched_job,
> >       if (first) {
> >               /* Add the entity to the run queue */
> >               spin_lock(&entity->rq_lock);
> > -             if (!entity->rq) {
> > -                     DRM_ERROR("Trying to push to a killed entity\n");
> > -                     spin_unlock(&entity->rq_lock);
> > -                     return;
> > -             }
> >               drm_sched_rq_add_entity(entity->rq, entity);
> >               spin_unlock(&entity->rq_lock);
> >               drm_sched_wakeup(entity->rq->sched);
>
>

[-- Attachment #1.2: Type: text/html, Size: 2155 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
  2018-08-06 14:18   ` Nayan Deshmukh
@ 2018-08-09 19:49     ` Andrey Grodzovsky
  2018-08-10 13:27       ` Christian König
  0 siblings, 1 reply; 11+ messages in thread
From: Andrey Grodzovsky @ 2018-08-09 19:49 UTC (permalink / raw)
  To: Nayan Deshmukh, ckoenig.leichtzumerken; +Cc: Maling list - DRI developers


[-- Attachment #1.1: Type: text/plain, Size: 2539 bytes --]

Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>


But I still  have questions about entity->last_user (didn't notice this 
before) -

Looks to me there is a race condition with it's current usage, let's say 
process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)

now process B working on same entity (forked) is inside 
drm_sched_entity_push_job, he writes his PID to entity->last_user and also

executes drm_sched_rq_add_entity. Now process A runs again and execute 
drm_sched_rq_remove_entity inadvertently causing process B removal

from it's scheduler rq.

Looks to me like instead we should lock together entity->last_user 
accesses and adds/removals of entity to the rq.

Andrey


On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
> I forgot about this since we started discussing possible scenarios of 
> processes and threads.
>
> In any case, this check is redundant. Acked-by: Nayan Deshmukh 
> <nayan26deshmukh@gmail.com <mailto:nayan26deshmukh@gmail.com>>
>
> Nayan
>
> On Mon, Aug 6, 2018 at 7:43 PM Christian König 
> <ckoenig.leichtzumerken@gmail.com 
> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>
>     Ping. Any objections to that?
>
>     Christian.
>
>     Am 03.08.2018 um 13:08 schrieb Christian König:
>     > That is superflous now.
>     >
>     > Signed-off-by: Christian König <christian.koenig@amd.com
>     <mailto:christian.koenig@amd.com>>
>     > ---
>     >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>     >   1 file changed, 5 deletions(-)
>     >
>     > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>     b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>     > index 85908c7f913e..65078dd3c82c 100644
>     > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>     > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>     > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
>     drm_sched_job *sched_job,
>     >       if (first) {
>     >               /* Add the entity to the run queue */
>     >               spin_lock(&entity->rq_lock);
>     > -             if (!entity->rq) {
>     > -                     DRM_ERROR("Trying to push to a killed
>     entity\n");
>     > -  spin_unlock(&entity->rq_lock);
>     > -                     return;
>     > -             }
>     >               drm_sched_rq_add_entity(entity->rq, entity);
>     >               spin_unlock(&entity->rq_lock);
>     >               drm_sched_wakeup(entity->rq->sched);
>


[-- Attachment #1.2: Type: text/html, Size: 4232 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
  2018-08-09 19:49     ` Andrey Grodzovsky
@ 2018-08-10 13:27       ` Christian König
  2018-08-10 13:44         ` Andrey Grodzovsky
  2018-08-13 16:43         ` Andrey Grodzovsky
  0 siblings, 2 replies; 11+ messages in thread
From: Christian König @ 2018-08-10 13:27 UTC (permalink / raw)
  To: Andrey Grodzovsky, Nayan Deshmukh; +Cc: Maling list - DRI developers


[-- Attachment #1.1: Type: text/plain, Size: 2766 bytes --]

Crap, yeah indeed that needs to be protected by some lock.

Going to prepare a patch for that,
Christian.

Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
>
> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>
>
> But I still  have questions about entity->last_user (didn't notice 
> this before) -
>
> Looks to me there is a race condition with it's current usage, let's 
> say process A was preempted after doing 
> drm_sched_entity_flush->cmpxchg(...)
>
> now process B working on same entity (forked) is inside 
> drm_sched_entity_push_job, he writes his PID to entity->last_user and also
>
> executes drm_sched_rq_add_entity. Now process A runs again and execute 
> drm_sched_rq_remove_entity inadvertently causing process B removal
>
> from it's scheduler rq.
>
> Looks to me like instead we should lock together entity->last_user 
> accesses and adds/removals of entity to the rq.
>
> Andrey
>
>
> On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
>> I forgot about this since we started discussing possible scenarios of 
>> processes and threads.
>>
>> In any case, this check is redundant. Acked-by: Nayan Deshmukh 
>> <nayan26deshmukh@gmail.com <mailto:nayan26deshmukh@gmail.com>>
>>
>> Nayan
>>
>> On Mon, Aug 6, 2018 at 7:43 PM Christian König 
>> <ckoenig.leichtzumerken@gmail.com 
>> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>>
>>     Ping. Any objections to that?
>>
>>     Christian.
>>
>>     Am 03.08.2018 um 13:08 schrieb Christian König:
>>     > That is superflous now.
>>     >
>>     > Signed-off-by: Christian König <christian.koenig@amd.com
>>     <mailto:christian.koenig@amd.com>>
>>     > ---
>>     >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>>     >   1 file changed, 5 deletions(-)
>>     >
>>     > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>     b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>     > index 85908c7f913e..65078dd3c82c 100644
>>     > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>     > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>     > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
>>     drm_sched_job *sched_job,
>>     >       if (first) {
>>     >               /* Add the entity to the run queue */
>>     >               spin_lock(&entity->rq_lock);
>>     > -             if (!entity->rq) {
>>     > -                     DRM_ERROR("Trying to push to a killed
>>     entity\n");
>>     > -  spin_unlock(&entity->rq_lock);
>>     > -                     return;
>>     > -             }
>>     >               drm_sched_rq_add_entity(entity->rq, entity);
>>     >               spin_unlock(&entity->rq_lock);
>>     >  drm_sched_wakeup(entity->rq->sched);
>>
>


[-- Attachment #1.2: Type: text/html, Size: 4928 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
  2018-08-10 13:27       ` Christian König
@ 2018-08-10 13:44         ` Andrey Grodzovsky
  2018-08-13 16:43         ` Andrey Grodzovsky
  1 sibling, 0 replies; 11+ messages in thread
From: Andrey Grodzovsky @ 2018-08-10 13:44 UTC (permalink / raw)
  To: christian.koenig, Nayan Deshmukh; +Cc: Maling list - DRI developers


[-- Attachment #1.1: Type: text/plain, Size: 2941 bytes --]

I can take care of this.

Andrey


On 08/10/2018 09:27 AM, Christian König wrote:
> Crap, yeah indeed that needs to be protected by some lock.
>
> Going to prepare a patch for that,
> Christian.
>
> Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
>>
>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>
>>
>> But I still  have questions about entity->last_user (didn't notice 
>> this before) -
>>
>> Looks to me there is a race condition with it's current usage, let's 
>> say process A was preempted after doing 
>> drm_sched_entity_flush->cmpxchg(...)
>>
>> now process B working on same entity (forked) is inside 
>> drm_sched_entity_push_job, he writes his PID to entity->last_user and 
>> also
>>
>> executes drm_sched_rq_add_entity. Now process A runs again and 
>> execute drm_sched_rq_remove_entity inadvertently causing process B 
>> removal
>>
>> from it's scheduler rq.
>>
>> Looks to me like instead we should lock together entity->last_user 
>> accesses and adds/removals of entity to the rq.
>>
>> Andrey
>>
>>
>> On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
>>> I forgot about this since we started discussing possible scenarios 
>>> of processes and threads.
>>>
>>> In any case, this check is redundant. Acked-by: Nayan Deshmukh 
>>> <nayan26deshmukh@gmail.com <mailto:nayan26deshmukh@gmail.com>>
>>>
>>> Nayan
>>>
>>> On Mon, Aug 6, 2018 at 7:43 PM Christian König 
>>> <ckoenig.leichtzumerken@gmail.com 
>>> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>>>
>>>     Ping. Any objections to that?
>>>
>>>     Christian.
>>>
>>>     Am 03.08.2018 um 13:08 schrieb Christian König:
>>>     > That is superflous now.
>>>     >
>>>     > Signed-off-by: Christian König <christian.koenig@amd.com
>>>     <mailto:christian.koenig@amd.com>>
>>>     > ---
>>>     >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>>>     >   1 file changed, 5 deletions(-)
>>>     >
>>>     > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>     b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>     > index 85908c7f913e..65078dd3c82c 100644
>>>     > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>     > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>     > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
>>>     drm_sched_job *sched_job,
>>>     >       if (first) {
>>>     >               /* Add the entity to the run queue */
>>>     >               spin_lock(&entity->rq_lock);
>>>     > -             if (!entity->rq) {
>>>     > -                     DRM_ERROR("Trying to push to a killed
>>>     entity\n");
>>>     > -  spin_unlock(&entity->rq_lock);
>>>     > -                     return;
>>>     > -             }
>>>     >               drm_sched_rq_add_entity(entity->rq, entity);
>>>     >               spin_unlock(&entity->rq_lock);
>>>     >  drm_sched_wakeup(entity->rq->sched);
>>>
>>
>


[-- Attachment #1.2: Type: text/html, Size: 5384 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
  2018-08-10 13:27       ` Christian König
  2018-08-10 13:44         ` Andrey Grodzovsky
@ 2018-08-13 16:43         ` Andrey Grodzovsky
  2018-08-14  7:05           ` Christian König
  1 sibling, 1 reply; 11+ messages in thread
From: Andrey Grodzovsky @ 2018-08-13 16:43 UTC (permalink / raw)
  To: christian.koenig, Nayan Deshmukh; +Cc: Maling list - DRI developers


[-- Attachment #1.1: Type: text/plain, Size: 3383 bytes --]

Attached.

If the general idea in the patch is OK I can think of a test (and maybe 
add to libdrm amdgpu tests) to actually simulate this scenario with 2 forked

concurrent processes working on same entity's job queue when one is 
dying while the other keeps pushing to the same queue. For now I only 
tested it

with normal boot and ruining multiple glxgears concurrently - which 
doesn't really test this code path since i think each of them works on 
it's own FD.

Andrey


On 08/10/2018 09:27 AM, Christian König wrote:
> Crap, yeah indeed that needs to be protected by some lock.
>
> Going to prepare a patch for that,
> Christian.
>
> Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
>>
>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>
>>
>> But I still  have questions about entity->last_user (didn't notice 
>> this before) -
>>
>> Looks to me there is a race condition with it's current usage, let's 
>> say process A was preempted after doing 
>> drm_sched_entity_flush->cmpxchg(...)
>>
>> now process B working on same entity (forked) is inside 
>> drm_sched_entity_push_job, he writes his PID to entity->last_user and 
>> also
>>
>> executes drm_sched_rq_add_entity. Now process A runs again and 
>> execute drm_sched_rq_remove_entity inadvertently causing process B 
>> removal
>>
>> from it's scheduler rq.
>>
>> Looks to me like instead we should lock together entity->last_user 
>> accesses and adds/removals of entity to the rq.
>>
>> Andrey
>>
>>
>> On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
>>> I forgot about this since we started discussing possible scenarios 
>>> of processes and threads.
>>>
>>> In any case, this check is redundant. Acked-by: Nayan Deshmukh 
>>> <nayan26deshmukh@gmail.com <mailto:nayan26deshmukh@gmail.com>>
>>>
>>> Nayan
>>>
>>> On Mon, Aug 6, 2018 at 7:43 PM Christian König 
>>> <ckoenig.leichtzumerken@gmail.com 
>>> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>>>
>>>     Ping. Any objections to that?
>>>
>>>     Christian.
>>>
>>>     Am 03.08.2018 um 13:08 schrieb Christian König:
>>>     > That is superflous now.
>>>     >
>>>     > Signed-off-by: Christian König <christian.koenig@amd.com
>>>     <mailto:christian.koenig@amd.com>>
>>>     > ---
>>>     >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>>>     >   1 file changed, 5 deletions(-)
>>>     >
>>>     > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>     b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>     > index 85908c7f913e..65078dd3c82c 100644
>>>     > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>     > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>     > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
>>>     drm_sched_job *sched_job,
>>>     >       if (first) {
>>>     >               /* Add the entity to the run queue */
>>>     >               spin_lock(&entity->rq_lock);
>>>     > -             if (!entity->rq) {
>>>     > -                     DRM_ERROR("Trying to push to a killed
>>>     entity\n");
>>>     > -  spin_unlock(&entity->rq_lock);
>>>     > -                     return;
>>>     > -             }
>>>     >               drm_sched_rq_add_entity(entity->rq, entity);
>>>     >               spin_unlock(&entity->rq_lock);
>>>     >  drm_sched_wakeup(entity->rq->sched);
>>>
>>
>


[-- Attachment #1.2: Type: text/html, Size: 5898 bytes --]

[-- Attachment #2: 0001-drm-scheduler-Fix-possible-race-condition.patch --]
[-- Type: text/x-patch, Size: 2701 bytes --]

>From c581a410e1b3f82de2bb14746d8484db2162f82c Mon Sep 17 00:00:00 2001
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Date: Mon, 13 Aug 2018 12:33:29 -0400
Subject: drm/scheduler: Fix possible race condition.

Problem:
The possbile race scenario is as follwing -
Process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)
now process B working on same entity (forked) is inside
drm_sched_entity_push_job, he writes his PID to entity->last_user and also
executes drm_sched_rq_add_entity. Now process A runs again and
execute drm_sched_rq_remove_entity inadvertently causing process B removal
from it's scheduler rq.

Fix:
Lock together entity->last_user accesses and adds/removals
of entity to the rq.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/scheduler/gpu_scheduler.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
index f566405..5649c3d 100644
--- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
@@ -311,10 +311,12 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout)
 
 
 	/* For killed process disable any more IBs enqueue right now */
+	spin_lock(&entity->rq_lock);
 	last_user = cmpxchg(&entity->last_user, current->group_leader, NULL);
 	if ((!last_user || last_user == current->group_leader) &&
 	    (current->flags & PF_EXITING) && (current->exit_code == SIGKILL))
 		drm_sched_rq_remove_entity(entity->rq, entity);
+	spin_unlock(&entity->rq_lock);
 
 	return ret;
 }
@@ -596,17 +598,23 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job,
 
 	trace_drm_sched_job(sched_job, entity);
 	atomic_inc(&entity->rq->sched->num_jobs);
-	WRITE_ONCE(entity->last_user, current->group_leader);
 	first = spsc_queue_push(&entity->job_queue, &sched_job->queue_node);
 
-	/* first job wakes up scheduler */
-	if (first) {
-		/* Add the entity to the run queue */
-		spin_lock(&entity->rq_lock);
+	/*
+	 * entity might not be attached to rq because it's the first time we use it
+	 * or because another process removed the entity from the rq during
+	 * it's termination in drm_sched_entity_flush. We then need to add
+	 * back the entity to rq
+	 */
+	spin_lock(&entity->rq_lock);
+	entity->last_user = current->group_leader;
+	if (list_empty(&entity->list))
 		drm_sched_rq_add_entity(entity->rq, entity);
-		spin_unlock(&entity->rq_lock);
+	spin_unlock(&entity->rq_lock);
+
+	/* first job wakes up scheduler */
+	if (first)
 		drm_sched_wakeup(entity->rq->sched);
-	}
 }
 EXPORT_SYMBOL(drm_sched_entity_push_job);
 
-- 
2.7.4


[-- Attachment #3: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
  2018-08-13 16:43         ` Andrey Grodzovsky
@ 2018-08-14  7:05           ` Christian König
  2018-08-14 15:17             ` Andrey Grodzovsky
  0 siblings, 1 reply; 11+ messages in thread
From: Christian König @ 2018-08-14  7:05 UTC (permalink / raw)
  To: Andrey Grodzovsky, christian.koenig, Nayan Deshmukh
  Cc: Maling list - DRI developers


[-- Attachment #1.1: Type: text/plain, Size: 4356 bytes --]

I would rather like to avoid taking the lock in the hot path.

How about this:

      /* For killed process disable any more IBs enqueue right now */
     last_user = cmpxchg(&entity->last_user, current->group_leader, NULL);
      if ((!last_user || last_user == current->group_leader) &&
          (current->flags & PF_EXITING) && (current->exit_code == 
SIGKILL)) {
         grab_lock();
          drm_sched_rq_remove_entity(entity->rq, entity);
         if (READ_ONCE(&entity->last_user) != NULL)
             drm_sched_rq_add_entity(entity->rq, entity);
         drop_lock();
     }

Christian.

Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:
>
> Attached.
>
> If the general idea in the patch is OK I can think of a test (and 
> maybe add to libdrm amdgpu tests) to actually simulate this scenario 
> with 2 forked
>
> concurrent processes working on same entity's job queue when one is 
> dying while the other keeps pushing to the same queue. For now I only 
> tested it
>
> with normal boot and ruining multiple glxgears concurrently - which 
> doesn't really test this code path since i think each of them works on 
> it's own FD.
>
> Andrey
>
>
> On 08/10/2018 09:27 AM, Christian König wrote:
>> Crap, yeah indeed that needs to be protected by some lock.
>>
>> Going to prepare a patch for that,
>> Christian.
>>
>> Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
>>>
>>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>
>>>
>>> But I still  have questions about entity->last_user (didn't notice 
>>> this before) -
>>>
>>> Looks to me there is a race condition with it's current usage, let's 
>>> say process A was preempted after doing 
>>> drm_sched_entity_flush->cmpxchg(...)
>>>
>>> now process B working on same entity (forked) is inside 
>>> drm_sched_entity_push_job, he writes his PID to entity->last_user 
>>> and also
>>>
>>> executes drm_sched_rq_add_entity. Now process A runs again and 
>>> execute drm_sched_rq_remove_entity inadvertently causing process B 
>>> removal
>>>
>>> from it's scheduler rq.
>>>
>>> Looks to me like instead we should lock together entity->last_user 
>>> accesses and adds/removals of entity to the rq.
>>>
>>> Andrey
>>>
>>>
>>> On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
>>>> I forgot about this since we started discussing possible scenarios 
>>>> of processes and threads.
>>>>
>>>> In any case, this check is redundant. Acked-by: Nayan Deshmukh 
>>>> <nayan26deshmukh@gmail.com <mailto:nayan26deshmukh@gmail.com>>
>>>>
>>>> Nayan
>>>>
>>>> On Mon, Aug 6, 2018 at 7:43 PM Christian König 
>>>> <ckoenig.leichtzumerken@gmail.com 
>>>> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>>>>
>>>>     Ping. Any objections to that?
>>>>
>>>>     Christian.
>>>>
>>>>     Am 03.08.2018 um 13:08 schrieb Christian König:
>>>>     > That is superflous now.
>>>>     >
>>>>     > Signed-off-by: Christian König <christian.koenig@amd.com
>>>>     <mailto:christian.koenig@amd.com>>
>>>>     > ---
>>>>     >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>>>>     >   1 file changed, 5 deletions(-)
>>>>     >
>>>>     > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>     b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>     > index 85908c7f913e..65078dd3c82c 100644
>>>>     > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>     > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>     > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
>>>>     drm_sched_job *sched_job,
>>>>     >       if (first) {
>>>>     >               /* Add the entity to the run queue */
>>>>     >               spin_lock(&entity->rq_lock);
>>>>     > -             if (!entity->rq) {
>>>>     > -                     DRM_ERROR("Trying to push to a killed
>>>>     entity\n");
>>>>     > -  spin_unlock(&entity->rq_lock);
>>>>     > -                     return;
>>>>     > -             }
>>>>     >  drm_sched_rq_add_entity(entity->rq, entity);
>>>>     >               spin_unlock(&entity->rq_lock);
>>>>     >  drm_sched_wakeup(entity->rq->sched);
>>>>
>>>
>>
>
>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel


[-- Attachment #1.2: Type: text/html, Size: 7836 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
  2018-08-14  7:05           ` Christian König
@ 2018-08-14 15:17             ` Andrey Grodzovsky
  2018-08-14 15:26               ` Christian König
  0 siblings, 1 reply; 11+ messages in thread
From: Andrey Grodzovsky @ 2018-08-14 15:17 UTC (permalink / raw)
  To: christian.koenig, Nayan Deshmukh; +Cc: Maling list - DRI developers


[-- Attachment #1.1: Type: text/plain, Size: 5544 bytes --]

I assume that this is the only code change and no locks are taken in 
drm_sched_entity_push_job -

What happens if process A runs drm_sched_entity_push_job after this code 
was executed from the  (dying) process B and there

are still jobs in the queue (the wait_event terminated prematurely), the 
entity already removed from rq , but bool 'first' in 
drm_sched_entity_push_job

will return false and so the entity will not be reinserted back into rq 
entity list and no wake up trigger will happen for process A pushing a 
new job.


Another issue bellow -

Andrey


On 08/14/2018 03:05 AM, Christian König wrote:
> I would rather like to avoid taking the lock in the hot path.
>
> How about this:
>
>      /* For killed process disable any more IBs enqueue right now */
>     last_user = cmpxchg(&entity->last_user, current->group_leader, NULL);
>      if ((!last_user || last_user == current->group_leader) &&
>          (current->flags & PF_EXITING) && (current->exit_code == 
> SIGKILL)) {
>         grab_lock();
>          drm_sched_rq_remove_entity(entity->rq, entity);
>         if (READ_ONCE(&entity->last_user) != NULL)

This condition is true because just exactly now process A did 
drm_sched_entity_push_job->WRITE_ONCE(entity->last_user, 
current->group_leader);
and so the line bellow executed and entity reinserted into rq. Let's say 
also that the entity job queue is empty now. For process A bool 'first' 
will be true
and hence also 
drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq, entity) 
will take place causing double insertion of the entity queue into rq list.

Andrey

> drm_sched_rq_add_entity(entity->rq, entity);
>         drop_lock();
>     }
>
> Christian.
>
> Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:
>>
>> Attached.
>>
>> If the general idea in the patch is OK I can think of a test (and 
>> maybe add to libdrm amdgpu tests) to actually simulate this scenario 
>> with 2 forked
>>
>> concurrent processes working on same entity's job queue when one is 
>> dying while the other keeps pushing to the same queue. For now I only 
>> tested it
>>
>> with normal boot and ruining multiple glxgears concurrently - which 
>> doesn't really test this code path since i think each of them works 
>> on it's own FD.
>>
>> Andrey
>>
>>
>> On 08/10/2018 09:27 AM, Christian König wrote:
>>> Crap, yeah indeed that needs to be protected by some lock.
>>>
>>> Going to prepare a patch for that,
>>> Christian.
>>>
>>> Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
>>>>
>>>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>
>>>>
>>>> But I still  have questions about entity->last_user (didn't notice 
>>>> this before) -
>>>>
>>>> Looks to me there is a race condition with it's current usage, 
>>>> let's say process A was preempted after doing 
>>>> drm_sched_entity_flush->cmpxchg(...)
>>>>
>>>> now process B working on same entity (forked) is inside 
>>>> drm_sched_entity_push_job, he writes his PID to entity->last_user 
>>>> and also
>>>>
>>>> executes drm_sched_rq_add_entity. Now process A runs again and 
>>>> execute drm_sched_rq_remove_entity inadvertently causing process B 
>>>> removal
>>>>
>>>> from it's scheduler rq.
>>>>
>>>> Looks to me like instead we should lock together entity->last_user 
>>>> accesses and adds/removals of entity to the rq.
>>>>
>>>> Andrey
>>>>
>>>>
>>>> On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
>>>>> I forgot about this since we started discussing possible scenarios 
>>>>> of processes and threads.
>>>>>
>>>>> In any case, this check is redundant. Acked-by: Nayan Deshmukh 
>>>>> <nayan26deshmukh@gmail.com <mailto:nayan26deshmukh@gmail.com>>
>>>>>
>>>>> Nayan
>>>>>
>>>>> On Mon, Aug 6, 2018 at 7:43 PM Christian König 
>>>>> <ckoenig.leichtzumerken@gmail.com 
>>>>> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>>>>>
>>>>>     Ping. Any objections to that?
>>>>>
>>>>>     Christian.
>>>>>
>>>>>     Am 03.08.2018 um 13:08 schrieb Christian König:
>>>>>     > That is superflous now.
>>>>>     >
>>>>>     > Signed-off-by: Christian König <christian.koenig@amd.com
>>>>>     <mailto:christian.koenig@amd.com>>
>>>>>     > ---
>>>>>     >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>>>>>     >   1 file changed, 5 deletions(-)
>>>>>     >
>>>>>     > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>     b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>     > index 85908c7f913e..65078dd3c82c 100644
>>>>>     > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>     > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>     > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
>>>>>     drm_sched_job *sched_job,
>>>>>     >       if (first) {
>>>>>     >               /* Add the entity to the run queue */
>>>>>     >               spin_lock(&entity->rq_lock);
>>>>>     > -             if (!entity->rq) {
>>>>>     > -                     DRM_ERROR("Trying to push to a killed
>>>>>     entity\n");
>>>>>     > -  spin_unlock(&entity->rq_lock);
>>>>>     > -                     return;
>>>>>     > -             }
>>>>>     >  drm_sched_rq_add_entity(entity->rq, entity);
>>>>>     >  spin_unlock(&entity->rq_lock);
>>>>>     >  drm_sched_wakeup(entity->rq->sched);
>>>>>
>>>>
>>>
>>
>>
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>


[-- Attachment #1.2: Type: text/html, Size: 9844 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
  2018-08-14 15:17             ` Andrey Grodzovsky
@ 2018-08-14 15:26               ` Christian König
  2018-08-14 16:32                 ` Andrey Grodzovsky
  0 siblings, 1 reply; 11+ messages in thread
From: Christian König @ 2018-08-14 15:26 UTC (permalink / raw)
  To: Andrey Grodzovsky, Nayan Deshmukh; +Cc: Maling list - DRI developers


[-- Attachment #1.1: Type: text/plain, Size: 6449 bytes --]

Am 14.08.2018 um 17:17 schrieb Andrey Grodzovsky:
>
> I assume that this is the only code change and no locks are taken in 
> drm_sched_entity_push_job -
>

What are you talking about? You surely now take looks in 
drm_sched_entity_push_job():
> +    spin_lock(&entity->rq_lock);
> +    entity->last_user = current->group_leader;
> +    if (list_empty(&entity->list))

> What happens if process A runs drm_sched_entity_push_job after this 
> code was executed from the  (dying) process B and there
>
> are still jobs in the queue (the wait_event terminated prematurely), 
> the entity already removed from rq , but bool 'first' in 
> drm_sched_entity_push_job
>
> will return false and so the entity will not be reinserted back into 
> rq entity list and no wake up trigger will happen for process A 
> pushing a new job.
>

Thought about this as well, but in this case I would say: Shit happens!

The dying process did some command submission and because of this the 
entity was killed as well when the process died and that is legitimate.

>
> Another issue bellow -
>
> Andrey
>
>
> On 08/14/2018 03:05 AM, Christian König wrote:
>> I would rather like to avoid taking the lock in the hot path.
>>
>> How about this:
>>
>>      /* For killed process disable any more IBs enqueue right now */
>>     last_user = cmpxchg(&entity->last_user, current->group_leader, NULL);
>>      if ((!last_user || last_user == current->group_leader) &&
>>          (current->flags & PF_EXITING) && (current->exit_code == 
>> SIGKILL)) {
>>         grab_lock();
>>          drm_sched_rq_remove_entity(entity->rq, entity);
>>         if (READ_ONCE(&entity->last_user) != NULL)
>
> This condition is true because just exactly now process A did 
> drm_sched_entity_push_job->WRITE_ONCE(entity->last_user, 
> current->group_leader);
> and so the line bellow executed and entity reinserted into rq. Let's 
> say also that the entity job queue is empty now. For process A bool 
> 'first' will be true
> and hence also 
> drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq, entity) 
> will take place causing double insertion of the entity queue into rq list.

Calling drm_sched_rq_add_entity() is harmless, it is protected against 
double insertion.

But thinking more about it your idea of adding a killed or finished flag 
becomes more and more appealing to have a consistent handling here.

Christian.

>
> Andrey
>
>> drm_sched_rq_add_entity(entity->rq, entity);
>>         drop_lock();
>>     }
>>
>> Christian.
>>
>> Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:
>>>
>>> Attached.
>>>
>>> If the general idea in the patch is OK I can think of a test (and 
>>> maybe add to libdrm amdgpu tests) to actually simulate this scenario 
>>> with 2 forked
>>>
>>> concurrent processes working on same entity's job queue when one is 
>>> dying while the other keeps pushing to the same queue. For now I 
>>> only tested it
>>>
>>> with normal boot and ruining multiple glxgears concurrently - which 
>>> doesn't really test this code path since i think each of them works 
>>> on it's own FD.
>>>
>>> Andrey
>>>
>>>
>>> On 08/10/2018 09:27 AM, Christian König wrote:
>>>> Crap, yeah indeed that needs to be protected by some lock.
>>>>
>>>> Going to prepare a patch for that,
>>>> Christian.
>>>>
>>>> Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
>>>>>
>>>>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>
>>>>>
>>>>> But I still  have questions about entity->last_user (didn't notice 
>>>>> this before) -
>>>>>
>>>>> Looks to me there is a race condition with it's current usage, 
>>>>> let's say process A was preempted after doing 
>>>>> drm_sched_entity_flush->cmpxchg(...)
>>>>>
>>>>> now process B working on same entity (forked) is inside 
>>>>> drm_sched_entity_push_job, he writes his PID to entity->last_user 
>>>>> and also
>>>>>
>>>>> executes drm_sched_rq_add_entity. Now process A runs again and 
>>>>> execute drm_sched_rq_remove_entity inadvertently causing process B 
>>>>> removal
>>>>>
>>>>> from it's scheduler rq.
>>>>>
>>>>> Looks to me like instead we should lock together entity->last_user 
>>>>> accesses and adds/removals of entity to the rq.
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>> On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
>>>>>> I forgot about this since we started discussing possible 
>>>>>> scenarios of processes and threads.
>>>>>>
>>>>>> In any case, this check is redundant. Acked-by: Nayan Deshmukh 
>>>>>> <nayan26deshmukh@gmail.com <mailto:nayan26deshmukh@gmail.com>>
>>>>>>
>>>>>> Nayan
>>>>>>
>>>>>> On Mon, Aug 6, 2018 at 7:43 PM Christian König 
>>>>>> <ckoenig.leichtzumerken@gmail.com 
>>>>>> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>>>>>>
>>>>>>     Ping. Any objections to that?
>>>>>>
>>>>>>     Christian.
>>>>>>
>>>>>>     Am 03.08.2018 um 13:08 schrieb Christian König:
>>>>>>     > That is superflous now.
>>>>>>     >
>>>>>>     > Signed-off-by: Christian König <christian.koenig@amd.com
>>>>>>     <mailto:christian.koenig@amd.com>>
>>>>>>     > ---
>>>>>>     >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>>>>>>     >   1 file changed, 5 deletions(-)
>>>>>>     >
>>>>>>     > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>     b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>     > index 85908c7f913e..65078dd3c82c 100644
>>>>>>     > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>     > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>     > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
>>>>>>     drm_sched_job *sched_job,
>>>>>>     >       if (first) {
>>>>>>     >               /* Add the entity to the run queue */
>>>>>>     >  spin_lock(&entity->rq_lock);
>>>>>>     > -             if (!entity->rq) {
>>>>>>     > -                     DRM_ERROR("Trying to push to a killed
>>>>>>     entity\n");
>>>>>>     > -  spin_unlock(&entity->rq_lock);
>>>>>>     > -                     return;
>>>>>>     > -             }
>>>>>>     >  drm_sched_rq_add_entity(entity->rq, entity);
>>>>>>     >  spin_unlock(&entity->rq_lock);
>>>>>>     >  drm_sched_wakeup(entity->rq->sched);
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
>


[-- Attachment #1.2: Type: text/html, Size: 11753 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
  2018-08-14 15:26               ` Christian König
@ 2018-08-14 16:32                 ` Andrey Grodzovsky
  0 siblings, 0 replies; 11+ messages in thread
From: Andrey Grodzovsky @ 2018-08-14 16:32 UTC (permalink / raw)
  To: Christian König, Nayan Deshmukh; +Cc: Maling list - DRI developers


[-- Attachment #1.1: Type: text/plain, Size: 7157 bytes --]



On 08/14/2018 11:26 AM, Christian König wrote:
> Am 14.08.2018 um 17:17 schrieb Andrey Grodzovsky:
>>
>> I assume that this is the only code change and no locks are taken in 
>> drm_sched_entity_push_job -
>>
>
> What are you talking about? You surely now take looks in 
> drm_sched_entity_push_job():
>> +    spin_lock(&entity->rq_lock);
>> +    entity->last_user = current->group_leader;
>> +    if (list_empty(&entity->list))

Oh, so your code in drm_sched_entity_flush still relies on my code in 
drm_sched_entity_push_job, OK.

>
>> What happens if process A runs drm_sched_entity_push_job after this 
>> code was executed from the  (dying) process B and there
>>
>> are still jobs in the queue (the wait_event terminated prematurely), 
>> the entity already removed from rq , but bool 'first' in 
>> drm_sched_entity_push_job
>>
>> will return false and so the entity will not be reinserted back into 
>> rq entity list and no wake up trigger will happen for process A 
>> pushing a new job.
>>
>
> Thought about this as well, but in this case I would say: Shit happens!
>
> The dying process did some command submission and because of this the 
> entity was killed as well when the process died and that is legitimate.
>
>>
>> Another issue bellow -
>>
>> Andrey
>>
>>
>> On 08/14/2018 03:05 AM, Christian König wrote:
>>> I would rather like to avoid taking the lock in the hot path.
>>>
>>> How about this:
>>>
>>>      /* For killed process disable any more IBs enqueue right now */
>>>     last_user = cmpxchg(&entity->last_user, current->group_leader, 
>>> NULL);
>>>      if ((!last_user || last_user == current->group_leader) &&
>>>          (current->flags & PF_EXITING) && (current->exit_code == 
>>> SIGKILL)) {
>>>         grab_lock();
>>>          drm_sched_rq_remove_entity(entity->rq, entity);
>>>         if (READ_ONCE(&entity->last_user) != NULL)
>>
>> This condition is true because just exactly now process A did 
>> drm_sched_entity_push_job->WRITE_ONCE(entity->last_user, 
>> current->group_leader);
>> and so the line bellow executed and entity reinserted into rq. Let's 
>> say also that the entity job queue is empty now. For process A bool 
>> 'first' will be true
>> and hence also 
>> drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq, 
>> entity) will take place causing double insertion of the entity queue 
>> into rq list.
>
> Calling drm_sched_rq_add_entity() is harmless, it is protected against 
> double insertion.

Missed that one, right...
>
> But thinking more about it your idea of adding a killed or finished 
> flag becomes more and more appealing to have a consistent handling here.
>
> Christian.

So to be clear - you would like something like

Removing entity->last_user and adding a 'stopped' flag to 
drm_sched_entity to be set in drm_sched_entity_flush and in

drm_sched_entity_push_job check for  'if (entity->stopped)' and when 
true just return some error back to user instead of pushing the job ?

Andrey

>
>>
>> Andrey
>>
>>> drm_sched_rq_add_entity(entity->rq, entity);
>>>         drop_lock();
>>>     }
>>>
>>> Christian.
>>>
>>> Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:
>>>>
>>>> Attached.
>>>>
>>>> If the general idea in the patch is OK I can think of a test (and 
>>>> maybe add to libdrm amdgpu tests) to actually simulate this 
>>>> scenario with 2 forked
>>>>
>>>> concurrent processes working on same entity's job queue when one is 
>>>> dying while the other keeps pushing to the same queue. For now I 
>>>> only tested it
>>>>
>>>> with normal boot and ruining multiple glxgears concurrently - which 
>>>> doesn't really test this code path since i think each of them works 
>>>> on it's own FD.
>>>>
>>>> Andrey
>>>>
>>>>
>>>> On 08/10/2018 09:27 AM, Christian König wrote:
>>>>> Crap, yeah indeed that needs to be protected by some lock.
>>>>>
>>>>> Going to prepare a patch for that,
>>>>> Christian.
>>>>>
>>>>> Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
>>>>>>
>>>>>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>
>>>>>>
>>>>>> But I still  have questions about entity->last_user (didn't 
>>>>>> notice this before) -
>>>>>>
>>>>>> Looks to me there is a race condition with it's current usage, 
>>>>>> let's say process A was preempted after doing 
>>>>>> drm_sched_entity_flush->cmpxchg(...)
>>>>>>
>>>>>> now process B working on same entity (forked) is inside 
>>>>>> drm_sched_entity_push_job, he writes his PID to entity->last_user 
>>>>>> and also
>>>>>>
>>>>>> executes drm_sched_rq_add_entity. Now process A runs again and 
>>>>>> execute drm_sched_rq_remove_entity inadvertently causing process 
>>>>>> B removal
>>>>>>
>>>>>> from it's scheduler rq.
>>>>>>
>>>>>> Looks to me like instead we should lock together 
>>>>>> entity->last_user accesses and adds/removals of entity to the rq.
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>> On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
>>>>>>> I forgot about this since we started discussing possible 
>>>>>>> scenarios of processes and threads.
>>>>>>>
>>>>>>> In any case, this check is redundant. Acked-by: Nayan Deshmukh 
>>>>>>> <nayan26deshmukh@gmail.com <mailto:nayan26deshmukh@gmail.com>>
>>>>>>>
>>>>>>> Nayan
>>>>>>>
>>>>>>> On Mon, Aug 6, 2018 at 7:43 PM Christian König 
>>>>>>> <ckoenig.leichtzumerken@gmail.com 
>>>>>>> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>>>>>>>
>>>>>>>     Ping. Any objections to that?
>>>>>>>
>>>>>>>     Christian.
>>>>>>>
>>>>>>>     Am 03.08.2018 um 13:08 schrieb Christian König:
>>>>>>>     > That is superflous now.
>>>>>>>     >
>>>>>>>     > Signed-off-by: Christian König <christian.koenig@amd.com
>>>>>>>     <mailto:christian.koenig@amd.com>>
>>>>>>>     > ---
>>>>>>>     >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>>>>>>>     >   1 file changed, 5 deletions(-)
>>>>>>>     >
>>>>>>>     > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>>     b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>>     > index 85908c7f913e..65078dd3c82c 100644
>>>>>>>     > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>>     > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>>     > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
>>>>>>>     drm_sched_job *sched_job,
>>>>>>>     >       if (first) {
>>>>>>>     >               /* Add the entity to the run queue */
>>>>>>>     >  spin_lock(&entity->rq_lock);
>>>>>>>     > -             if (!entity->rq) {
>>>>>>>     > -                     DRM_ERROR("Trying to push to a
>>>>>>>     killed entity\n");
>>>>>>>     > -  spin_unlock(&entity->rq_lock);
>>>>>>>     > -                     return;
>>>>>>>     > -             }
>>>>>>>     >  drm_sched_rq_add_entity(entity->rq, entity);
>>>>>>>     >  spin_unlock(&entity->rq_lock);
>>>>>>>     >  drm_sched_wakeup(entity->rq->sched);
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
>>
>


[-- Attachment #1.2: Type: text/html, Size: 13531 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-08-14 16:32 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-08-03 11:08 [PATCH] drm/scheduler: Remove entity->rq NULL check Christian König
2018-08-06 14:13 ` Christian König
2018-08-06 14:18   ` Nayan Deshmukh
2018-08-09 19:49     ` Andrey Grodzovsky
2018-08-10 13:27       ` Christian König
2018-08-10 13:44         ` Andrey Grodzovsky
2018-08-13 16:43         ` Andrey Grodzovsky
2018-08-14  7:05           ` Christian König
2018-08-14 15:17             ` Andrey Grodzovsky
2018-08-14 15:26               ` Christian König
2018-08-14 16:32                 ` Andrey Grodzovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).