* [PATCH] drm/scheduler: Remove entity->rq NULL check
@ 2018-08-03 11:08 Christian König
2018-08-06 14:13 ` Christian König
0 siblings, 1 reply; 11+ messages in thread
From: Christian König @ 2018-08-03 11:08 UTC (permalink / raw)
To: dri-devel
That is superflous now.
Signed-off-by: Christian König <christian.koenig@amd.com>
---
drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
1 file changed, 5 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
index 85908c7f913e..65078dd3c82c 100644
--- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
@@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job,
if (first) {
/* Add the entity to the run queue */
spin_lock(&entity->rq_lock);
- if (!entity->rq) {
- DRM_ERROR("Trying to push to a killed entity\n");
- spin_unlock(&entity->rq_lock);
- return;
- }
drm_sched_rq_add_entity(entity->rq, entity);
spin_unlock(&entity->rq_lock);
drm_sched_wakeup(entity->rq->sched);
--
2.14.1
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
2018-08-03 11:08 [PATCH] drm/scheduler: Remove entity->rq NULL check Christian König
@ 2018-08-06 14:13 ` Christian König
2018-08-06 14:18 ` Nayan Deshmukh
0 siblings, 1 reply; 11+ messages in thread
From: Christian König @ 2018-08-06 14:13 UTC (permalink / raw)
To: dri-devel, Nayan Deshmukh, Grodzovsky, Andrey
Ping. Any objections to that?
Christian.
Am 03.08.2018 um 13:08 schrieb Christian König:
> That is superflous now.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
> drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
> 1 file changed, 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> index 85908c7f913e..65078dd3c82c 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job,
> if (first) {
> /* Add the entity to the run queue */
> spin_lock(&entity->rq_lock);
> - if (!entity->rq) {
> - DRM_ERROR("Trying to push to a killed entity\n");
> - spin_unlock(&entity->rq_lock);
> - return;
> - }
> drm_sched_rq_add_entity(entity->rq, entity);
> spin_unlock(&entity->rq_lock);
> drm_sched_wakeup(entity->rq->sched);
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
2018-08-06 14:13 ` Christian König
@ 2018-08-06 14:18 ` Nayan Deshmukh
2018-08-09 19:49 ` Andrey Grodzovsky
0 siblings, 1 reply; 11+ messages in thread
From: Nayan Deshmukh @ 2018-08-06 14:18 UTC (permalink / raw)
To: ckoenig.leichtzumerken; +Cc: Maling list - DRI developers
[-- Attachment #1.1: Type: text/plain, Size: 1496 bytes --]
I forgot about this since we started discussing possible scenarios of
processes and threads.
In any case, this check is redundant. Acked-by: Nayan Deshmukh <
nayan26deshmukh@gmail.com>
Nayan
On Mon, Aug 6, 2018 at 7:43 PM Christian König <
ckoenig.leichtzumerken@gmail.com> wrote:
> Ping. Any objections to that?
>
> Christian.
>
> Am 03.08.2018 um 13:08 schrieb Christian König:
> > That is superflous now.
> >
> > Signed-off-by: Christian König <christian.koenig@amd.com>
> > ---
> > drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
> > 1 file changed, 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> > index 85908c7f913e..65078dd3c82c 100644
> > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct drm_sched_job
> *sched_job,
> > if (first) {
> > /* Add the entity to the run queue */
> > spin_lock(&entity->rq_lock);
> > - if (!entity->rq) {
> > - DRM_ERROR("Trying to push to a killed entity\n");
> > - spin_unlock(&entity->rq_lock);
> > - return;
> > - }
> > drm_sched_rq_add_entity(entity->rq, entity);
> > spin_unlock(&entity->rq_lock);
> > drm_sched_wakeup(entity->rq->sched);
>
>
[-- Attachment #1.2: Type: text/html, Size: 2155 bytes --]
[-- Attachment #2: Type: text/plain, Size: 160 bytes --]
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
2018-08-06 14:18 ` Nayan Deshmukh
@ 2018-08-09 19:49 ` Andrey Grodzovsky
2018-08-10 13:27 ` Christian König
0 siblings, 1 reply; 11+ messages in thread
From: Andrey Grodzovsky @ 2018-08-09 19:49 UTC (permalink / raw)
To: Nayan Deshmukh, ckoenig.leichtzumerken; +Cc: Maling list - DRI developers
[-- Attachment #1.1: Type: text/plain, Size: 2539 bytes --]
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
But I still have questions about entity->last_user (didn't notice this
before) -
Looks to me there is a race condition with it's current usage, let's say
process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)
now process B working on same entity (forked) is inside
drm_sched_entity_push_job, he writes his PID to entity->last_user and also
executes drm_sched_rq_add_entity. Now process A runs again and execute
drm_sched_rq_remove_entity inadvertently causing process B removal
from it's scheduler rq.
Looks to me like instead we should lock together entity->last_user
accesses and adds/removals of entity to the rq.
Andrey
On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
> I forgot about this since we started discussing possible scenarios of
> processes and threads.
>
> In any case, this check is redundant. Acked-by: Nayan Deshmukh
> <nayan26deshmukh@gmail.com <mailto:nayan26deshmukh@gmail.com>>
>
> Nayan
>
> On Mon, Aug 6, 2018 at 7:43 PM Christian König
> <ckoenig.leichtzumerken@gmail.com
> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>
> Ping. Any objections to that?
>
> Christian.
>
> Am 03.08.2018 um 13:08 schrieb Christian König:
> > That is superflous now.
> >
> > Signed-off-by: Christian König <christian.koenig@amd.com
> <mailto:christian.koenig@amd.com>>
> > ---
> > drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
> > 1 file changed, 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> > index 85908c7f913e..65078dd3c82c 100644
> > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
> drm_sched_job *sched_job,
> > if (first) {
> > /* Add the entity to the run queue */
> > spin_lock(&entity->rq_lock);
> > - if (!entity->rq) {
> > - DRM_ERROR("Trying to push to a killed
> entity\n");
> > - spin_unlock(&entity->rq_lock);
> > - return;
> > - }
> > drm_sched_rq_add_entity(entity->rq, entity);
> > spin_unlock(&entity->rq_lock);
> > drm_sched_wakeup(entity->rq->sched);
>
[-- Attachment #1.2: Type: text/html, Size: 4232 bytes --]
[-- Attachment #2: Type: text/plain, Size: 160 bytes --]
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
2018-08-09 19:49 ` Andrey Grodzovsky
@ 2018-08-10 13:27 ` Christian König
2018-08-10 13:44 ` Andrey Grodzovsky
2018-08-13 16:43 ` Andrey Grodzovsky
0 siblings, 2 replies; 11+ messages in thread
From: Christian König @ 2018-08-10 13:27 UTC (permalink / raw)
To: Andrey Grodzovsky, Nayan Deshmukh; +Cc: Maling list - DRI developers
[-- Attachment #1.1: Type: text/plain, Size: 2766 bytes --]
Crap, yeah indeed that needs to be protected by some lock.
Going to prepare a patch for that,
Christian.
Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
>
> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>
>
> But I still have questions about entity->last_user (didn't notice
> this before) -
>
> Looks to me there is a race condition with it's current usage, let's
> say process A was preempted after doing
> drm_sched_entity_flush->cmpxchg(...)
>
> now process B working on same entity (forked) is inside
> drm_sched_entity_push_job, he writes his PID to entity->last_user and also
>
> executes drm_sched_rq_add_entity. Now process A runs again and execute
> drm_sched_rq_remove_entity inadvertently causing process B removal
>
> from it's scheduler rq.
>
> Looks to me like instead we should lock together entity->last_user
> accesses and adds/removals of entity to the rq.
>
> Andrey
>
>
> On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
>> I forgot about this since we started discussing possible scenarios of
>> processes and threads.
>>
>> In any case, this check is redundant. Acked-by: Nayan Deshmukh
>> <nayan26deshmukh@gmail.com <mailto:nayan26deshmukh@gmail.com>>
>>
>> Nayan
>>
>> On Mon, Aug 6, 2018 at 7:43 PM Christian König
>> <ckoenig.leichtzumerken@gmail.com
>> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>>
>> Ping. Any objections to that?
>>
>> Christian.
>>
>> Am 03.08.2018 um 13:08 schrieb Christian König:
>> > That is superflous now.
>> >
>> > Signed-off-by: Christian König <christian.koenig@amd.com
>> <mailto:christian.koenig@amd.com>>
>> > ---
>> > drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>> > 1 file changed, 5 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>> > index 85908c7f913e..65078dd3c82c 100644
>> > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>> > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>> > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
>> drm_sched_job *sched_job,
>> > if (first) {
>> > /* Add the entity to the run queue */
>> > spin_lock(&entity->rq_lock);
>> > - if (!entity->rq) {
>> > - DRM_ERROR("Trying to push to a killed
>> entity\n");
>> > - spin_unlock(&entity->rq_lock);
>> > - return;
>> > - }
>> > drm_sched_rq_add_entity(entity->rq, entity);
>> > spin_unlock(&entity->rq_lock);
>> > drm_sched_wakeup(entity->rq->sched);
>>
>
[-- Attachment #1.2: Type: text/html, Size: 4928 bytes --]
[-- Attachment #2: Type: text/plain, Size: 160 bytes --]
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
2018-08-10 13:27 ` Christian König
@ 2018-08-10 13:44 ` Andrey Grodzovsky
2018-08-13 16:43 ` Andrey Grodzovsky
1 sibling, 0 replies; 11+ messages in thread
From: Andrey Grodzovsky @ 2018-08-10 13:44 UTC (permalink / raw)
To: christian.koenig, Nayan Deshmukh; +Cc: Maling list - DRI developers
[-- Attachment #1.1: Type: text/plain, Size: 2941 bytes --]
I can take care of this.
Andrey
On 08/10/2018 09:27 AM, Christian König wrote:
> Crap, yeah indeed that needs to be protected by some lock.
>
> Going to prepare a patch for that,
> Christian.
>
> Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
>>
>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>
>>
>> But I still have questions about entity->last_user (didn't notice
>> this before) -
>>
>> Looks to me there is a race condition with it's current usage, let's
>> say process A was preempted after doing
>> drm_sched_entity_flush->cmpxchg(...)
>>
>> now process B working on same entity (forked) is inside
>> drm_sched_entity_push_job, he writes his PID to entity->last_user and
>> also
>>
>> executes drm_sched_rq_add_entity. Now process A runs again and
>> execute drm_sched_rq_remove_entity inadvertently causing process B
>> removal
>>
>> from it's scheduler rq.
>>
>> Looks to me like instead we should lock together entity->last_user
>> accesses and adds/removals of entity to the rq.
>>
>> Andrey
>>
>>
>> On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
>>> I forgot about this since we started discussing possible scenarios
>>> of processes and threads.
>>>
>>> In any case, this check is redundant. Acked-by: Nayan Deshmukh
>>> <nayan26deshmukh@gmail.com <mailto:nayan26deshmukh@gmail.com>>
>>>
>>> Nayan
>>>
>>> On Mon, Aug 6, 2018 at 7:43 PM Christian König
>>> <ckoenig.leichtzumerken@gmail.com
>>> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>>>
>>> Ping. Any objections to that?
>>>
>>> Christian.
>>>
>>> Am 03.08.2018 um 13:08 schrieb Christian König:
>>> > That is superflous now.
>>> >
>>> > Signed-off-by: Christian König <christian.koenig@amd.com
>>> <mailto:christian.koenig@amd.com>>
>>> > ---
>>> > drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>>> > 1 file changed, 5 deletions(-)
>>> >
>>> > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> > index 85908c7f913e..65078dd3c82c 100644
>>> > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
>>> drm_sched_job *sched_job,
>>> > if (first) {
>>> > /* Add the entity to the run queue */
>>> > spin_lock(&entity->rq_lock);
>>> > - if (!entity->rq) {
>>> > - DRM_ERROR("Trying to push to a killed
>>> entity\n");
>>> > - spin_unlock(&entity->rq_lock);
>>> > - return;
>>> > - }
>>> > drm_sched_rq_add_entity(entity->rq, entity);
>>> > spin_unlock(&entity->rq_lock);
>>> > drm_sched_wakeup(entity->rq->sched);
>>>
>>
>
[-- Attachment #1.2: Type: text/html, Size: 5384 bytes --]
[-- Attachment #2: Type: text/plain, Size: 160 bytes --]
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
2018-08-10 13:27 ` Christian König
2018-08-10 13:44 ` Andrey Grodzovsky
@ 2018-08-13 16:43 ` Andrey Grodzovsky
2018-08-14 7:05 ` Christian König
1 sibling, 1 reply; 11+ messages in thread
From: Andrey Grodzovsky @ 2018-08-13 16:43 UTC (permalink / raw)
To: christian.koenig, Nayan Deshmukh; +Cc: Maling list - DRI developers
[-- Attachment #1.1: Type: text/plain, Size: 3383 bytes --]
Attached.
If the general idea in the patch is OK I can think of a test (and maybe
add to libdrm amdgpu tests) to actually simulate this scenario with 2 forked
concurrent processes working on same entity's job queue when one is
dying while the other keeps pushing to the same queue. For now I only
tested it
with normal boot and ruining multiple glxgears concurrently - which
doesn't really test this code path since i think each of them works on
it's own FD.
Andrey
On 08/10/2018 09:27 AM, Christian König wrote:
> Crap, yeah indeed that needs to be protected by some lock.
>
> Going to prepare a patch for that,
> Christian.
>
> Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
>>
>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>
>>
>> But I still have questions about entity->last_user (didn't notice
>> this before) -
>>
>> Looks to me there is a race condition with it's current usage, let's
>> say process A was preempted after doing
>> drm_sched_entity_flush->cmpxchg(...)
>>
>> now process B working on same entity (forked) is inside
>> drm_sched_entity_push_job, he writes his PID to entity->last_user and
>> also
>>
>> executes drm_sched_rq_add_entity. Now process A runs again and
>> execute drm_sched_rq_remove_entity inadvertently causing process B
>> removal
>>
>> from it's scheduler rq.
>>
>> Looks to me like instead we should lock together entity->last_user
>> accesses and adds/removals of entity to the rq.
>>
>> Andrey
>>
>>
>> On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
>>> I forgot about this since we started discussing possible scenarios
>>> of processes and threads.
>>>
>>> In any case, this check is redundant. Acked-by: Nayan Deshmukh
>>> <nayan26deshmukh@gmail.com <mailto:nayan26deshmukh@gmail.com>>
>>>
>>> Nayan
>>>
>>> On Mon, Aug 6, 2018 at 7:43 PM Christian König
>>> <ckoenig.leichtzumerken@gmail.com
>>> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>>>
>>> Ping. Any objections to that?
>>>
>>> Christian.
>>>
>>> Am 03.08.2018 um 13:08 schrieb Christian König:
>>> > That is superflous now.
>>> >
>>> > Signed-off-by: Christian König <christian.koenig@amd.com
>>> <mailto:christian.koenig@amd.com>>
>>> > ---
>>> > drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>>> > 1 file changed, 5 deletions(-)
>>> >
>>> > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> > index 85908c7f913e..65078dd3c82c 100644
>>> > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
>>> drm_sched_job *sched_job,
>>> > if (first) {
>>> > /* Add the entity to the run queue */
>>> > spin_lock(&entity->rq_lock);
>>> > - if (!entity->rq) {
>>> > - DRM_ERROR("Trying to push to a killed
>>> entity\n");
>>> > - spin_unlock(&entity->rq_lock);
>>> > - return;
>>> > - }
>>> > drm_sched_rq_add_entity(entity->rq, entity);
>>> > spin_unlock(&entity->rq_lock);
>>> > drm_sched_wakeup(entity->rq->sched);
>>>
>>
>
[-- Attachment #1.2: Type: text/html, Size: 5898 bytes --]
[-- Attachment #2: 0001-drm-scheduler-Fix-possible-race-condition.patch --]
[-- Type: text/x-patch, Size: 2701 bytes --]
>From c581a410e1b3f82de2bb14746d8484db2162f82c Mon Sep 17 00:00:00 2001
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Date: Mon, 13 Aug 2018 12:33:29 -0400
Subject: drm/scheduler: Fix possible race condition.
Problem:
The possbile race scenario is as follwing -
Process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)
now process B working on same entity (forked) is inside
drm_sched_entity_push_job, he writes his PID to entity->last_user and also
executes drm_sched_rq_add_entity. Now process A runs again and
execute drm_sched_rq_remove_entity inadvertently causing process B removal
from it's scheduler rq.
Fix:
Lock together entity->last_user accesses and adds/removals
of entity to the rq.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
drivers/gpu/drm/scheduler/gpu_scheduler.c | 22 +++++++++++++++-------
1 file changed, 15 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
index f566405..5649c3d 100644
--- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
@@ -311,10 +311,12 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout)
/* For killed process disable any more IBs enqueue right now */
+ spin_lock(&entity->rq_lock);
last_user = cmpxchg(&entity->last_user, current->group_leader, NULL);
if ((!last_user || last_user == current->group_leader) &&
(current->flags & PF_EXITING) && (current->exit_code == SIGKILL))
drm_sched_rq_remove_entity(entity->rq, entity);
+ spin_unlock(&entity->rq_lock);
return ret;
}
@@ -596,17 +598,23 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job,
trace_drm_sched_job(sched_job, entity);
atomic_inc(&entity->rq->sched->num_jobs);
- WRITE_ONCE(entity->last_user, current->group_leader);
first = spsc_queue_push(&entity->job_queue, &sched_job->queue_node);
- /* first job wakes up scheduler */
- if (first) {
- /* Add the entity to the run queue */
- spin_lock(&entity->rq_lock);
+ /*
+ * entity might not be attached to rq because it's the first time we use it
+ * or because another process removed the entity from the rq during
+ * it's termination in drm_sched_entity_flush. We then need to add
+ * back the entity to rq
+ */
+ spin_lock(&entity->rq_lock);
+ entity->last_user = current->group_leader;
+ if (list_empty(&entity->list))
drm_sched_rq_add_entity(entity->rq, entity);
- spin_unlock(&entity->rq_lock);
+ spin_unlock(&entity->rq_lock);
+
+ /* first job wakes up scheduler */
+ if (first)
drm_sched_wakeup(entity->rq->sched);
- }
}
EXPORT_SYMBOL(drm_sched_entity_push_job);
--
2.7.4
[-- Attachment #3: Type: text/plain, Size: 160 bytes --]
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
2018-08-13 16:43 ` Andrey Grodzovsky
@ 2018-08-14 7:05 ` Christian König
2018-08-14 15:17 ` Andrey Grodzovsky
0 siblings, 1 reply; 11+ messages in thread
From: Christian König @ 2018-08-14 7:05 UTC (permalink / raw)
To: Andrey Grodzovsky, christian.koenig, Nayan Deshmukh
Cc: Maling list - DRI developers
[-- Attachment #1.1: Type: text/plain, Size: 4356 bytes --]
I would rather like to avoid taking the lock in the hot path.
How about this:
/* For killed process disable any more IBs enqueue right now */
last_user = cmpxchg(&entity->last_user, current->group_leader, NULL);
if ((!last_user || last_user == current->group_leader) &&
(current->flags & PF_EXITING) && (current->exit_code ==
SIGKILL)) {
grab_lock();
drm_sched_rq_remove_entity(entity->rq, entity);
if (READ_ONCE(&entity->last_user) != NULL)
drm_sched_rq_add_entity(entity->rq, entity);
drop_lock();
}
Christian.
Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:
>
> Attached.
>
> If the general idea in the patch is OK I can think of a test (and
> maybe add to libdrm amdgpu tests) to actually simulate this scenario
> with 2 forked
>
> concurrent processes working on same entity's job queue when one is
> dying while the other keeps pushing to the same queue. For now I only
> tested it
>
> with normal boot and ruining multiple glxgears concurrently - which
> doesn't really test this code path since i think each of them works on
> it's own FD.
>
> Andrey
>
>
> On 08/10/2018 09:27 AM, Christian König wrote:
>> Crap, yeah indeed that needs to be protected by some lock.
>>
>> Going to prepare a patch for that,
>> Christian.
>>
>> Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
>>>
>>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>
>>>
>>> But I still have questions about entity->last_user (didn't notice
>>> this before) -
>>>
>>> Looks to me there is a race condition with it's current usage, let's
>>> say process A was preempted after doing
>>> drm_sched_entity_flush->cmpxchg(...)
>>>
>>> now process B working on same entity (forked) is inside
>>> drm_sched_entity_push_job, he writes his PID to entity->last_user
>>> and also
>>>
>>> executes drm_sched_rq_add_entity. Now process A runs again and
>>> execute drm_sched_rq_remove_entity inadvertently causing process B
>>> removal
>>>
>>> from it's scheduler rq.
>>>
>>> Looks to me like instead we should lock together entity->last_user
>>> accesses and adds/removals of entity to the rq.
>>>
>>> Andrey
>>>
>>>
>>> On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
>>>> I forgot about this since we started discussing possible scenarios
>>>> of processes and threads.
>>>>
>>>> In any case, this check is redundant. Acked-by: Nayan Deshmukh
>>>> <nayan26deshmukh@gmail.com <mailto:nayan26deshmukh@gmail.com>>
>>>>
>>>> Nayan
>>>>
>>>> On Mon, Aug 6, 2018 at 7:43 PM Christian König
>>>> <ckoenig.leichtzumerken@gmail.com
>>>> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>>>>
>>>> Ping. Any objections to that?
>>>>
>>>> Christian.
>>>>
>>>> Am 03.08.2018 um 13:08 schrieb Christian König:
>>>> > That is superflous now.
>>>> >
>>>> > Signed-off-by: Christian König <christian.koenig@amd.com
>>>> <mailto:christian.koenig@amd.com>>
>>>> > ---
>>>> > drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>>>> > 1 file changed, 5 deletions(-)
>>>> >
>>>> > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>> > index 85908c7f913e..65078dd3c82c 100644
>>>> > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>> > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>> > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
>>>> drm_sched_job *sched_job,
>>>> > if (first) {
>>>> > /* Add the entity to the run queue */
>>>> > spin_lock(&entity->rq_lock);
>>>> > - if (!entity->rq) {
>>>> > - DRM_ERROR("Trying to push to a killed
>>>> entity\n");
>>>> > - spin_unlock(&entity->rq_lock);
>>>> > - return;
>>>> > - }
>>>> > drm_sched_rq_add_entity(entity->rq, entity);
>>>> > spin_unlock(&entity->rq_lock);
>>>> > drm_sched_wakeup(entity->rq->sched);
>>>>
>>>
>>
>
>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
[-- Attachment #1.2: Type: text/html, Size: 7836 bytes --]
[-- Attachment #2: Type: text/plain, Size: 160 bytes --]
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
2018-08-14 7:05 ` Christian König
@ 2018-08-14 15:17 ` Andrey Grodzovsky
2018-08-14 15:26 ` Christian König
0 siblings, 1 reply; 11+ messages in thread
From: Andrey Grodzovsky @ 2018-08-14 15:17 UTC (permalink / raw)
To: christian.koenig, Nayan Deshmukh; +Cc: Maling list - DRI developers
[-- Attachment #1.1: Type: text/plain, Size: 5544 bytes --]
I assume that this is the only code change and no locks are taken in
drm_sched_entity_push_job -
What happens if process A runs drm_sched_entity_push_job after this code
was executed from the (dying) process B and there
are still jobs in the queue (the wait_event terminated prematurely), the
entity already removed from rq , but bool 'first' in
drm_sched_entity_push_job
will return false and so the entity will not be reinserted back into rq
entity list and no wake up trigger will happen for process A pushing a
new job.
Another issue bellow -
Andrey
On 08/14/2018 03:05 AM, Christian König wrote:
> I would rather like to avoid taking the lock in the hot path.
>
> How about this:
>
> /* For killed process disable any more IBs enqueue right now */
> last_user = cmpxchg(&entity->last_user, current->group_leader, NULL);
> if ((!last_user || last_user == current->group_leader) &&
> (current->flags & PF_EXITING) && (current->exit_code ==
> SIGKILL)) {
> grab_lock();
> drm_sched_rq_remove_entity(entity->rq, entity);
> if (READ_ONCE(&entity->last_user) != NULL)
This condition is true because just exactly now process A did
drm_sched_entity_push_job->WRITE_ONCE(entity->last_user,
current->group_leader);
and so the line bellow executed and entity reinserted into rq. Let's say
also that the entity job queue is empty now. For process A bool 'first'
will be true
and hence also
drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq, entity)
will take place causing double insertion of the entity queue into rq list.
Andrey
> drm_sched_rq_add_entity(entity->rq, entity);
> drop_lock();
> }
>
> Christian.
>
> Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:
>>
>> Attached.
>>
>> If the general idea in the patch is OK I can think of a test (and
>> maybe add to libdrm amdgpu tests) to actually simulate this scenario
>> with 2 forked
>>
>> concurrent processes working on same entity's job queue when one is
>> dying while the other keeps pushing to the same queue. For now I only
>> tested it
>>
>> with normal boot and ruining multiple glxgears concurrently - which
>> doesn't really test this code path since i think each of them works
>> on it's own FD.
>>
>> Andrey
>>
>>
>> On 08/10/2018 09:27 AM, Christian König wrote:
>>> Crap, yeah indeed that needs to be protected by some lock.
>>>
>>> Going to prepare a patch for that,
>>> Christian.
>>>
>>> Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
>>>>
>>>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>
>>>>
>>>> But I still have questions about entity->last_user (didn't notice
>>>> this before) -
>>>>
>>>> Looks to me there is a race condition with it's current usage,
>>>> let's say process A was preempted after doing
>>>> drm_sched_entity_flush->cmpxchg(...)
>>>>
>>>> now process B working on same entity (forked) is inside
>>>> drm_sched_entity_push_job, he writes his PID to entity->last_user
>>>> and also
>>>>
>>>> executes drm_sched_rq_add_entity. Now process A runs again and
>>>> execute drm_sched_rq_remove_entity inadvertently causing process B
>>>> removal
>>>>
>>>> from it's scheduler rq.
>>>>
>>>> Looks to me like instead we should lock together entity->last_user
>>>> accesses and adds/removals of entity to the rq.
>>>>
>>>> Andrey
>>>>
>>>>
>>>> On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
>>>>> I forgot about this since we started discussing possible scenarios
>>>>> of processes and threads.
>>>>>
>>>>> In any case, this check is redundant. Acked-by: Nayan Deshmukh
>>>>> <nayan26deshmukh@gmail.com <mailto:nayan26deshmukh@gmail.com>>
>>>>>
>>>>> Nayan
>>>>>
>>>>> On Mon, Aug 6, 2018 at 7:43 PM Christian König
>>>>> <ckoenig.leichtzumerken@gmail.com
>>>>> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>>>>>
>>>>> Ping. Any objections to that?
>>>>>
>>>>> Christian.
>>>>>
>>>>> Am 03.08.2018 um 13:08 schrieb Christian König:
>>>>> > That is superflous now.
>>>>> >
>>>>> > Signed-off-by: Christian König <christian.koenig@amd.com
>>>>> <mailto:christian.koenig@amd.com>>
>>>>> > ---
>>>>> > drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>>>>> > 1 file changed, 5 deletions(-)
>>>>> >
>>>>> > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>> > index 85908c7f913e..65078dd3c82c 100644
>>>>> > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>> > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>> > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
>>>>> drm_sched_job *sched_job,
>>>>> > if (first) {
>>>>> > /* Add the entity to the run queue */
>>>>> > spin_lock(&entity->rq_lock);
>>>>> > - if (!entity->rq) {
>>>>> > - DRM_ERROR("Trying to push to a killed
>>>>> entity\n");
>>>>> > - spin_unlock(&entity->rq_lock);
>>>>> > - return;
>>>>> > - }
>>>>> > drm_sched_rq_add_entity(entity->rq, entity);
>>>>> > spin_unlock(&entity->rq_lock);
>>>>> > drm_sched_wakeup(entity->rq->sched);
>>>>>
>>>>
>>>
>>
>>
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
[-- Attachment #1.2: Type: text/html, Size: 9844 bytes --]
[-- Attachment #2: Type: text/plain, Size: 160 bytes --]
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
2018-08-14 15:17 ` Andrey Grodzovsky
@ 2018-08-14 15:26 ` Christian König
2018-08-14 16:32 ` Andrey Grodzovsky
0 siblings, 1 reply; 11+ messages in thread
From: Christian König @ 2018-08-14 15:26 UTC (permalink / raw)
To: Andrey Grodzovsky, Nayan Deshmukh; +Cc: Maling list - DRI developers
[-- Attachment #1.1: Type: text/plain, Size: 6449 bytes --]
Am 14.08.2018 um 17:17 schrieb Andrey Grodzovsky:
>
> I assume that this is the only code change and no locks are taken in
> drm_sched_entity_push_job -
>
What are you talking about? You surely now take looks in
drm_sched_entity_push_job():
> + spin_lock(&entity->rq_lock);
> + entity->last_user = current->group_leader;
> + if (list_empty(&entity->list))
> What happens if process A runs drm_sched_entity_push_job after this
> code was executed from the (dying) process B and there
>
> are still jobs in the queue (the wait_event terminated prematurely),
> the entity already removed from rq , but bool 'first' in
> drm_sched_entity_push_job
>
> will return false and so the entity will not be reinserted back into
> rq entity list and no wake up trigger will happen for process A
> pushing a new job.
>
Thought about this as well, but in this case I would say: Shit happens!
The dying process did some command submission and because of this the
entity was killed as well when the process died and that is legitimate.
>
> Another issue bellow -
>
> Andrey
>
>
> On 08/14/2018 03:05 AM, Christian König wrote:
>> I would rather like to avoid taking the lock in the hot path.
>>
>> How about this:
>>
>> /* For killed process disable any more IBs enqueue right now */
>> last_user = cmpxchg(&entity->last_user, current->group_leader, NULL);
>> if ((!last_user || last_user == current->group_leader) &&
>> (current->flags & PF_EXITING) && (current->exit_code ==
>> SIGKILL)) {
>> grab_lock();
>> drm_sched_rq_remove_entity(entity->rq, entity);
>> if (READ_ONCE(&entity->last_user) != NULL)
>
> This condition is true because just exactly now process A did
> drm_sched_entity_push_job->WRITE_ONCE(entity->last_user,
> current->group_leader);
> and so the line bellow executed and entity reinserted into rq. Let's
> say also that the entity job queue is empty now. For process A bool
> 'first' will be true
> and hence also
> drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq, entity)
> will take place causing double insertion of the entity queue into rq list.
Calling drm_sched_rq_add_entity() is harmless, it is protected against
double insertion.
But thinking more about it your idea of adding a killed or finished flag
becomes more and more appealing to have a consistent handling here.
Christian.
>
> Andrey
>
>> drm_sched_rq_add_entity(entity->rq, entity);
>> drop_lock();
>> }
>>
>> Christian.
>>
>> Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:
>>>
>>> Attached.
>>>
>>> If the general idea in the patch is OK I can think of a test (and
>>> maybe add to libdrm amdgpu tests) to actually simulate this scenario
>>> with 2 forked
>>>
>>> concurrent processes working on same entity's job queue when one is
>>> dying while the other keeps pushing to the same queue. For now I
>>> only tested it
>>>
>>> with normal boot and ruining multiple glxgears concurrently - which
>>> doesn't really test this code path since i think each of them works
>>> on it's own FD.
>>>
>>> Andrey
>>>
>>>
>>> On 08/10/2018 09:27 AM, Christian König wrote:
>>>> Crap, yeah indeed that needs to be protected by some lock.
>>>>
>>>> Going to prepare a patch for that,
>>>> Christian.
>>>>
>>>> Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
>>>>>
>>>>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>
>>>>>
>>>>> But I still have questions about entity->last_user (didn't notice
>>>>> this before) -
>>>>>
>>>>> Looks to me there is a race condition with it's current usage,
>>>>> let's say process A was preempted after doing
>>>>> drm_sched_entity_flush->cmpxchg(...)
>>>>>
>>>>> now process B working on same entity (forked) is inside
>>>>> drm_sched_entity_push_job, he writes his PID to entity->last_user
>>>>> and also
>>>>>
>>>>> executes drm_sched_rq_add_entity. Now process A runs again and
>>>>> execute drm_sched_rq_remove_entity inadvertently causing process B
>>>>> removal
>>>>>
>>>>> from it's scheduler rq.
>>>>>
>>>>> Looks to me like instead we should lock together entity->last_user
>>>>> accesses and adds/removals of entity to the rq.
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>> On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
>>>>>> I forgot about this since we started discussing possible
>>>>>> scenarios of processes and threads.
>>>>>>
>>>>>> In any case, this check is redundant. Acked-by: Nayan Deshmukh
>>>>>> <nayan26deshmukh@gmail.com <mailto:nayan26deshmukh@gmail.com>>
>>>>>>
>>>>>> Nayan
>>>>>>
>>>>>> On Mon, Aug 6, 2018 at 7:43 PM Christian König
>>>>>> <ckoenig.leichtzumerken@gmail.com
>>>>>> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>>>>>>
>>>>>> Ping. Any objections to that?
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>> Am 03.08.2018 um 13:08 schrieb Christian König:
>>>>>> > That is superflous now.
>>>>>> >
>>>>>> > Signed-off-by: Christian König <christian.koenig@amd.com
>>>>>> <mailto:christian.koenig@amd.com>>
>>>>>> > ---
>>>>>> > drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>>>>>> > 1 file changed, 5 deletions(-)
>>>>>> >
>>>>>> > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>> > index 85908c7f913e..65078dd3c82c 100644
>>>>>> > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>> > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>> > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
>>>>>> drm_sched_job *sched_job,
>>>>>> > if (first) {
>>>>>> > /* Add the entity to the run queue */
>>>>>> > spin_lock(&entity->rq_lock);
>>>>>> > - if (!entity->rq) {
>>>>>> > - DRM_ERROR("Trying to push to a killed
>>>>>> entity\n");
>>>>>> > - spin_unlock(&entity->rq_lock);
>>>>>> > - return;
>>>>>> > - }
>>>>>> > drm_sched_rq_add_entity(entity->rq, entity);
>>>>>> > spin_unlock(&entity->rq_lock);
>>>>>> > drm_sched_wakeup(entity->rq->sched);
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
>
[-- Attachment #1.2: Type: text/html, Size: 11753 bytes --]
[-- Attachment #2: Type: text/plain, Size: 160 bytes --]
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/scheduler: Remove entity->rq NULL check
2018-08-14 15:26 ` Christian König
@ 2018-08-14 16:32 ` Andrey Grodzovsky
0 siblings, 0 replies; 11+ messages in thread
From: Andrey Grodzovsky @ 2018-08-14 16:32 UTC (permalink / raw)
To: Christian König, Nayan Deshmukh; +Cc: Maling list - DRI developers
[-- Attachment #1.1: Type: text/plain, Size: 7157 bytes --]
On 08/14/2018 11:26 AM, Christian König wrote:
> Am 14.08.2018 um 17:17 schrieb Andrey Grodzovsky:
>>
>> I assume that this is the only code change and no locks are taken in
>> drm_sched_entity_push_job -
>>
>
> What are you talking about? You surely now take looks in
> drm_sched_entity_push_job():
>> + spin_lock(&entity->rq_lock);
>> + entity->last_user = current->group_leader;
>> + if (list_empty(&entity->list))
Oh, so your code in drm_sched_entity_flush still relies on my code in
drm_sched_entity_push_job, OK.
>
>> What happens if process A runs drm_sched_entity_push_job after this
>> code was executed from the (dying) process B and there
>>
>> are still jobs in the queue (the wait_event terminated prematurely),
>> the entity already removed from rq , but bool 'first' in
>> drm_sched_entity_push_job
>>
>> will return false and so the entity will not be reinserted back into
>> rq entity list and no wake up trigger will happen for process A
>> pushing a new job.
>>
>
> Thought about this as well, but in this case I would say: Shit happens!
>
> The dying process did some command submission and because of this the
> entity was killed as well when the process died and that is legitimate.
>
>>
>> Another issue bellow -
>>
>> Andrey
>>
>>
>> On 08/14/2018 03:05 AM, Christian König wrote:
>>> I would rather like to avoid taking the lock in the hot path.
>>>
>>> How about this:
>>>
>>> /* For killed process disable any more IBs enqueue right now */
>>> last_user = cmpxchg(&entity->last_user, current->group_leader,
>>> NULL);
>>> if ((!last_user || last_user == current->group_leader) &&
>>> (current->flags & PF_EXITING) && (current->exit_code ==
>>> SIGKILL)) {
>>> grab_lock();
>>> drm_sched_rq_remove_entity(entity->rq, entity);
>>> if (READ_ONCE(&entity->last_user) != NULL)
>>
>> This condition is true because just exactly now process A did
>> drm_sched_entity_push_job->WRITE_ONCE(entity->last_user,
>> current->group_leader);
>> and so the line bellow executed and entity reinserted into rq. Let's
>> say also that the entity job queue is empty now. For process A bool
>> 'first' will be true
>> and hence also
>> drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq,
>> entity) will take place causing double insertion of the entity queue
>> into rq list.
>
> Calling drm_sched_rq_add_entity() is harmless, it is protected against
> double insertion.
Missed that one, right...
>
> But thinking more about it your idea of adding a killed or finished
> flag becomes more and more appealing to have a consistent handling here.
>
> Christian.
So to be clear - you would like something like
Removing entity->last_user and adding a 'stopped' flag to
drm_sched_entity to be set in drm_sched_entity_flush and in
drm_sched_entity_push_job check for 'if (entity->stopped)' and when
true just return some error back to user instead of pushing the job ?
Andrey
>
>>
>> Andrey
>>
>>> drm_sched_rq_add_entity(entity->rq, entity);
>>> drop_lock();
>>> }
>>>
>>> Christian.
>>>
>>> Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:
>>>>
>>>> Attached.
>>>>
>>>> If the general idea in the patch is OK I can think of a test (and
>>>> maybe add to libdrm amdgpu tests) to actually simulate this
>>>> scenario with 2 forked
>>>>
>>>> concurrent processes working on same entity's job queue when one is
>>>> dying while the other keeps pushing to the same queue. For now I
>>>> only tested it
>>>>
>>>> with normal boot and ruining multiple glxgears concurrently - which
>>>> doesn't really test this code path since i think each of them works
>>>> on it's own FD.
>>>>
>>>> Andrey
>>>>
>>>>
>>>> On 08/10/2018 09:27 AM, Christian König wrote:
>>>>> Crap, yeah indeed that needs to be protected by some lock.
>>>>>
>>>>> Going to prepare a patch for that,
>>>>> Christian.
>>>>>
>>>>> Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
>>>>>>
>>>>>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>
>>>>>>
>>>>>> But I still have questions about entity->last_user (didn't
>>>>>> notice this before) -
>>>>>>
>>>>>> Looks to me there is a race condition with it's current usage,
>>>>>> let's say process A was preempted after doing
>>>>>> drm_sched_entity_flush->cmpxchg(...)
>>>>>>
>>>>>> now process B working on same entity (forked) is inside
>>>>>> drm_sched_entity_push_job, he writes his PID to entity->last_user
>>>>>> and also
>>>>>>
>>>>>> executes drm_sched_rq_add_entity. Now process A runs again and
>>>>>> execute drm_sched_rq_remove_entity inadvertently causing process
>>>>>> B removal
>>>>>>
>>>>>> from it's scheduler rq.
>>>>>>
>>>>>> Looks to me like instead we should lock together
>>>>>> entity->last_user accesses and adds/removals of entity to the rq.
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>> On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
>>>>>>> I forgot about this since we started discussing possible
>>>>>>> scenarios of processes and threads.
>>>>>>>
>>>>>>> In any case, this check is redundant. Acked-by: Nayan Deshmukh
>>>>>>> <nayan26deshmukh@gmail.com <mailto:nayan26deshmukh@gmail.com>>
>>>>>>>
>>>>>>> Nayan
>>>>>>>
>>>>>>> On Mon, Aug 6, 2018 at 7:43 PM Christian König
>>>>>>> <ckoenig.leichtzumerken@gmail.com
>>>>>>> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>>>>>>>
>>>>>>> Ping. Any objections to that?
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 03.08.2018 um 13:08 schrieb Christian König:
>>>>>>> > That is superflous now.
>>>>>>> >
>>>>>>> > Signed-off-by: Christian König <christian.koenig@amd.com
>>>>>>> <mailto:christian.koenig@amd.com>>
>>>>>>> > ---
>>>>>>> > drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>>>>>>> > 1 file changed, 5 deletions(-)
>>>>>>> >
>>>>>>> > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>> > index 85908c7f913e..65078dd3c82c 100644
>>>>>>> > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>> > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>> > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
>>>>>>> drm_sched_job *sched_job,
>>>>>>> > if (first) {
>>>>>>> > /* Add the entity to the run queue */
>>>>>>> > spin_lock(&entity->rq_lock);
>>>>>>> > - if (!entity->rq) {
>>>>>>> > - DRM_ERROR("Trying to push to a
>>>>>>> killed entity\n");
>>>>>>> > - spin_unlock(&entity->rq_lock);
>>>>>>> > - return;
>>>>>>> > - }
>>>>>>> > drm_sched_rq_add_entity(entity->rq, entity);
>>>>>>> > spin_unlock(&entity->rq_lock);
>>>>>>> > drm_sched_wakeup(entity->rq->sched);
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
>>
>
[-- Attachment #1.2: Type: text/html, Size: 13531 bytes --]
[-- Attachment #2: Type: text/plain, Size: 160 bytes --]
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2018-08-14 16:32 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-08-03 11:08 [PATCH] drm/scheduler: Remove entity->rq NULL check Christian König
2018-08-06 14:13 ` Christian König
2018-08-06 14:18 ` Nayan Deshmukh
2018-08-09 19:49 ` Andrey Grodzovsky
2018-08-10 13:27 ` Christian König
2018-08-10 13:44 ` Andrey Grodzovsky
2018-08-13 16:43 ` Andrey Grodzovsky
2018-08-14 7:05 ` Christian König
2018-08-14 15:17 ` Andrey Grodzovsky
2018-08-14 15:26 ` Christian König
2018-08-14 16:32 ` Andrey Grodzovsky
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).