* [PATCH v3 1/3] drm/panthor: Fail job creation when the group is dead
2024-10-29 15:29 [PATCH v3 0/3] drm/panthor: Fix group state reporting Boris Brezillon
@ 2024-10-29 15:29 ` Boris Brezillon
2024-10-29 15:29 ` [PATCH v3 2/3] drm/panthor: Report group as timedout when we fail to properly suspend Boris Brezillon
` (2 subsequent siblings)
3 siblings, 0 replies; 8+ messages in thread
From: Boris Brezillon @ 2024-10-29 15:29 UTC (permalink / raw)
To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
Cc: Christopher Healy, dri-devel, kernel
Userspace can use GROUP_SUBMIT errors as a trigger to check the group
state and recreate the group if it became unusable. Make sure we
report an error when the group became unusable.
Changes in v3:
- None
Changes in v2:
- Add R-bs
Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
---
drivers/gpu/drm/panthor/panthor_sched.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index eb9f6635cc12..eda8fbb276b3 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -3688,6 +3688,11 @@ panthor_job_create(struct panthor_file *pfile,
goto err_put_job;
}
+ if (!group_can_run(job->group)) {
+ ret = -EINVAL;
+ goto err_put_job;
+ }
+
if (job->queue_idx >= job->group->queue_count ||
!job->group->queues[job->queue_idx]) {
ret = -EINVAL;
--
2.46.2
^ permalink raw reply related [flat|nested] 8+ messages in thread* [PATCH v3 2/3] drm/panthor: Report group as timedout when we fail to properly suspend
2024-10-29 15:29 [PATCH v3 0/3] drm/panthor: Fix group state reporting Boris Brezillon
2024-10-29 15:29 ` [PATCH v3 1/3] drm/panthor: Fail job creation when the group is dead Boris Brezillon
@ 2024-10-29 15:29 ` Boris Brezillon
2024-10-29 15:29 ` [PATCH v3 3/3] drm/panthor: Report innocent group kill Boris Brezillon
2024-10-30 15:41 ` [PATCH v3 0/3] drm/panthor: Fix group state reporting Boris Brezillon
3 siblings, 0 replies; 8+ messages in thread
From: Boris Brezillon @ 2024-10-29 15:29 UTC (permalink / raw)
To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
Cc: Christopher Healy, dri-devel, kernel
If we don't do that, the group is considered usable by userspace, but
all further GROUP_SUBMIT will fail with -EINVAL.
Changes in v3:
- Add R-bs
Changes in v2:
- New patch
Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
---
drivers/gpu/drm/panthor/panthor_sched.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index eda8fbb276b3..ef4bec7ff9c7 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -602,10 +602,11 @@ struct panthor_group {
* @timedout: True when a timeout occurred on any of the queues owned by
* this group.
*
- * Timeouts can be reported by drm_sched or by the FW. In any case, any
- * timeout situation is unrecoverable, and the group becomes useless.
- * We simply wait for all references to be dropped so we can release the
- * group object.
+ * Timeouts can be reported by drm_sched or by the FW. If a reset is required,
+ * and the group can't be suspended, this also leads to a timeout. In any case,
+ * any timeout situation is unrecoverable, and the group becomes useless. We
+ * simply wait for all references to be dropped so we can release the group
+ * object.
*/
bool timedout;
@@ -2687,6 +2688,12 @@ void panthor_sched_suspend(struct panthor_device *ptdev)
csgs_upd_ctx_init(&upd_ctx);
while (slot_mask) {
u32 csg_id = ffs(slot_mask) - 1;
+ struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
+
+ /* We consider group suspension failures as fatal and flag the
+ * group as unusable by setting timedout=true.
+ */
+ csg_slot->group->timedout = true;
csgs_upd_ctx_queue_reqs(ptdev, &upd_ctx, csg_id,
CSG_STATE_TERMINATE,
--
2.46.2
^ permalink raw reply related [flat|nested] 8+ messages in thread* [PATCH v3 3/3] drm/panthor: Report innocent group kill
2024-10-29 15:29 [PATCH v3 0/3] drm/panthor: Fix group state reporting Boris Brezillon
2024-10-29 15:29 ` [PATCH v3 1/3] drm/panthor: Fail job creation when the group is dead Boris Brezillon
2024-10-29 15:29 ` [PATCH v3 2/3] drm/panthor: Report group as timedout when we fail to properly suspend Boris Brezillon
@ 2024-10-29 15:29 ` Boris Brezillon
2024-10-29 16:19 ` Boris Brezillon
2024-10-30 15:41 ` [PATCH v3 0/3] drm/panthor: Fix group state reporting Boris Brezillon
3 siblings, 1 reply; 8+ messages in thread
From: Boris Brezillon @ 2024-10-29 15:29 UTC (permalink / raw)
To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
Cc: Christopher Healy, dri-devel, kernel
Groups can be killed during a reset even though they did nothing wrong.
That usually happens when the FW is put in a bad state by other groups,
resulting in group suspension failures when the reset happens.
If we end up in that situation, flag the group innocent and report
innocence through a new DRM_PANTHOR_GROUP_STATE flag.
Bump the minor driver version to reflect the uAPI change.
Changes in v3:
- Actually report innocence to userspace
Changes in v2:
- New patch
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
drivers/gpu/drm/panthor/panthor_drv.c | 2 +-
drivers/gpu/drm/panthor/panthor_sched.c | 18 ++++++++++++++++++
include/uapi/drm/panthor_drm.h | 9 +++++++++
3 files changed, 28 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index ac7e53f6e3f0..f1dff7e0173d 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -1507,7 +1507,7 @@ static const struct drm_driver panthor_drm_driver = {
.desc = "Panthor DRM driver",
.date = "20230801",
.major = 1,
- .minor = 2,
+ .minor = 3,
.gem_create_object = panthor_gem_create_object,
.gem_prime_import_sg_table = drm_gem_shmem_prime_import_sg_table,
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index ef4bec7ff9c7..97ed5fe5a191 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -610,6 +610,16 @@ struct panthor_group {
*/
bool timedout;
+ /**
+ * @innocent: True when the group becomes unusable because the group suspension
+ * failed during a reset.
+ *
+ * Sometimes the FW was put in a bad state by other groups, causing the group
+ * suspension happening in the reset path to fail. In that case, we consider the
+ * group innocent.
+ */
+ bool innocent;
+
/**
* @syncobjs: Pool of per-queue synchronization objects.
*
@@ -2690,6 +2700,12 @@ void panthor_sched_suspend(struct panthor_device *ptdev)
u32 csg_id = ffs(slot_mask) - 1;
struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
+ /* If the group was still usable before that point, we consider
+ * it innocent.
+ */
+ if (group_can_run(csg_slot->group))
+ csg_slot->group->innocent = true;
+
/* We consider group suspension failures as fatal and flag the
* group as unusable by setting timedout=true.
*/
@@ -3570,6 +3586,8 @@ int panthor_group_get_state(struct panthor_file *pfile,
get_state->state |= DRM_PANTHOR_GROUP_STATE_FATAL_FAULT;
get_state->fatal_queues = group->fatal_queues;
}
+ if (group->innocent)
+ get_state->state |= DRM_PANTHOR_GROUP_STATE_INNOCENT;
mutex_unlock(&sched->lock);
group_put(group);
diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
index 87c9cb555dd1..b99763cbae48 100644
--- a/include/uapi/drm/panthor_drm.h
+++ b/include/uapi/drm/panthor_drm.h
@@ -923,6 +923,15 @@ enum drm_panthor_group_state_flags {
* When a group ends up with this flag set, no jobs can be submitted to its queues.
*/
DRM_PANTHOR_GROUP_STATE_FATAL_FAULT = 1 << 1,
+
+ /**
+ * @DRM_PANTHOR_GROUP_STATE_INNOCENT: Group was killed during a reset caused by other
+ * groups.
+ *
+ * This flag can only be set if DRM_PANTHOR_GROUP_STATE_TIMEDOUT is set and
+ * DRM_PANTHOR_GROUP_STATE_FATAL_FAULT is not.
+ */
+ DRM_PANTHOR_GROUP_STATE_INNOCENT = 1 << 2,
};
/**
--
2.46.2
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH v3 3/3] drm/panthor: Report innocent group kill
2024-10-29 15:29 ` [PATCH v3 3/3] drm/panthor: Report innocent group kill Boris Brezillon
@ 2024-10-29 16:19 ` Boris Brezillon
2024-10-30 13:18 ` Steven Price
2024-10-30 13:53 ` Liviu Dudau
0 siblings, 2 replies; 8+ messages in thread
From: Boris Brezillon @ 2024-10-29 16:19 UTC (permalink / raw)
To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
Cc: Christopher Healy, dri-devel, kernel
On Tue, 29 Oct 2024 16:29:12 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:
> Groups can be killed during a reset even though they did nothing wrong.
> That usually happens when the FW is put in a bad state by other groups,
> resulting in group suspension failures when the reset happens.
>
> If we end up in that situation, flag the group innocent and report
> innocence through a new DRM_PANTHOR_GROUP_STATE flag.
>
> Bump the minor driver version to reflect the uAPI change.
>
> Changes in v3:
> - Actually report innocence to userspace
>
> Changes in v2:
> - New patch
>
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
> drivers/gpu/drm/panthor/panthor_drv.c | 2 +-
> drivers/gpu/drm/panthor/panthor_sched.c | 18 ++++++++++++++++++
> include/uapi/drm/panthor_drm.h | 9 +++++++++
> 3 files changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> index ac7e53f6e3f0..f1dff7e0173d 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
Forgot to update the changelog with:
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -1493,6 +1493,7 @@ static void panthor_debugfs_init(struct drm_minor
*minor)
* - 1.1 - adds DEV_QUERY_TIMESTAMP_INFO query
* - 1.2 - adds DEV_QUERY_GROUP_PRIORITIES_INFO query
* - adds PANTHOR_GROUP_PRIORITY_REALTIME priority
+ * - 1.3 - adds DRM_PANTHOR_GROUP_STATE_INNOCENT flag
*/
I'll send a v4 addressing that, but I'll probably queue the first two
patches to drm-misc-fixes first.
> @@ -1507,7 +1507,7 @@ static const struct drm_driver panthor_drm_driver = {
> .desc = "Panthor DRM driver",
> .date = "20230801",
> .major = 1,
> - .minor = 2,
> + .minor = 3,
>
> .gem_create_object = panthor_gem_create_object,
> .gem_prime_import_sg_table = drm_gem_shmem_prime_import_sg_table,
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index ef4bec7ff9c7..97ed5fe5a191 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -610,6 +610,16 @@ struct panthor_group {
> */
> bool timedout;
>
> + /**
> + * @innocent: True when the group becomes unusable because the group suspension
> + * failed during a reset.
> + *
> + * Sometimes the FW was put in a bad state by other groups, causing the group
> + * suspension happening in the reset path to fail. In that case, we consider the
> + * group innocent.
> + */
> + bool innocent;
> +
> /**
> * @syncobjs: Pool of per-queue synchronization objects.
> *
> @@ -2690,6 +2700,12 @@ void panthor_sched_suspend(struct panthor_device *ptdev)
> u32 csg_id = ffs(slot_mask) - 1;
> struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
>
> + /* If the group was still usable before that point, we consider
> + * it innocent.
> + */
> + if (group_can_run(csg_slot->group))
> + csg_slot->group->innocent = true;
> +
> /* We consider group suspension failures as fatal and flag the
> * group as unusable by setting timedout=true.
> */
> @@ -3570,6 +3586,8 @@ int panthor_group_get_state(struct panthor_file *pfile,
> get_state->state |= DRM_PANTHOR_GROUP_STATE_FATAL_FAULT;
> get_state->fatal_queues = group->fatal_queues;
> }
> + if (group->innocent)
> + get_state->state |= DRM_PANTHOR_GROUP_STATE_INNOCENT;
> mutex_unlock(&sched->lock);
>
> group_put(group);
> diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
> index 87c9cb555dd1..b99763cbae48 100644
> --- a/include/uapi/drm/panthor_drm.h
> +++ b/include/uapi/drm/panthor_drm.h
> @@ -923,6 +923,15 @@ enum drm_panthor_group_state_flags {
> * When a group ends up with this flag set, no jobs can be submitted to its queues.
> */
> DRM_PANTHOR_GROUP_STATE_FATAL_FAULT = 1 << 1,
> +
> + /**
> + * @DRM_PANTHOR_GROUP_STATE_INNOCENT: Group was killed during a reset caused by other
> + * groups.
> + *
> + * This flag can only be set if DRM_PANTHOR_GROUP_STATE_TIMEDOUT is set and
> + * DRM_PANTHOR_GROUP_STATE_FATAL_FAULT is not.
> + */
> + DRM_PANTHOR_GROUP_STATE_INNOCENT = 1 << 2,
> };
>
> /**
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH v3 3/3] drm/panthor: Report innocent group kill
2024-10-29 16:19 ` Boris Brezillon
@ 2024-10-30 13:18 ` Steven Price
2024-10-30 13:53 ` Liviu Dudau
1 sibling, 0 replies; 8+ messages in thread
From: Steven Price @ 2024-10-30 13:18 UTC (permalink / raw)
To: Boris Brezillon, Liviu Dudau, Adrián Larumbe
Cc: Christopher Healy, dri-devel, kernel
On 29/10/2024 16:19, Boris Brezillon wrote:
> On Tue, 29 Oct 2024 16:29:12 +0100
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
>
>> Groups can be killed during a reset even though they did nothing wrong.
>> That usually happens when the FW is put in a bad state by other groups,
>> resulting in group suspension failures when the reset happens.
>>
>> If we end up in that situation, flag the group innocent and report
>> innocence through a new DRM_PANTHOR_GROUP_STATE flag.
>>
>> Bump the minor driver version to reflect the uAPI change.
>>
>> Changes in v3:
>> - Actually report innocence to userspace
>>
>> Changes in v2:
>> - New patch
>>
>> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
>> ---
>> drivers/gpu/drm/panthor/panthor_drv.c | 2 +-
>> drivers/gpu/drm/panthor/panthor_sched.c | 18 ++++++++++++++++++
>> include/uapi/drm/panthor_drm.h | 9 +++++++++
>> 3 files changed, 28 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
>> index ac7e53f6e3f0..f1dff7e0173d 100644
>> --- a/drivers/gpu/drm/panthor/panthor_drv.c
>> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
>
> Forgot to update the changelog with:
>
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -1493,6 +1493,7 @@ static void panthor_debugfs_init(struct drm_minor
> *minor)
> * - 1.1 - adds DEV_QUERY_TIMESTAMP_INFO query
> * - 1.2 - adds DEV_QUERY_GROUP_PRIORITIES_INFO query
> * - adds PANTHOR_GROUP_PRIORITY_REALTIME priority
> + * - 1.3 - adds DRM_PANTHOR_GROUP_STATE_INNOCENT flag
> */
>
> I'll send a v4 addressing that, but I'll probably queue the first two
> patches to drm-misc-fixes first.
With the changelog update you can add:
Reviewed-by: Steven Price <steven.price@arm.com>
Thanks,
Steve
>> @@ -1507,7 +1507,7 @@ static const struct drm_driver panthor_drm_driver = {
>> .desc = "Panthor DRM driver",
>> .date = "20230801",
>> .major = 1,
>> - .minor = 2,
>> + .minor = 3,
>>
>> .gem_create_object = panthor_gem_create_object,
>> .gem_prime_import_sg_table = drm_gem_shmem_prime_import_sg_table,
>> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
>> index ef4bec7ff9c7..97ed5fe5a191 100644
>> --- a/drivers/gpu/drm/panthor/panthor_sched.c
>> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
>> @@ -610,6 +610,16 @@ struct panthor_group {
>> */
>> bool timedout;
>>
>> + /**
>> + * @innocent: True when the group becomes unusable because the group suspension
>> + * failed during a reset.
>> + *
>> + * Sometimes the FW was put in a bad state by other groups, causing the group
>> + * suspension happening in the reset path to fail. In that case, we consider the
>> + * group innocent.
>> + */
>> + bool innocent;
>> +
>> /**
>> * @syncobjs: Pool of per-queue synchronization objects.
>> *
>> @@ -2690,6 +2700,12 @@ void panthor_sched_suspend(struct panthor_device *ptdev)
>> u32 csg_id = ffs(slot_mask) - 1;
>> struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
>>
>> + /* If the group was still usable before that point, we consider
>> + * it innocent.
>> + */
>> + if (group_can_run(csg_slot->group))
>> + csg_slot->group->innocent = true;
>> +
>> /* We consider group suspension failures as fatal and flag the
>> * group as unusable by setting timedout=true.
>> */
>> @@ -3570,6 +3586,8 @@ int panthor_group_get_state(struct panthor_file *pfile,
>> get_state->state |= DRM_PANTHOR_GROUP_STATE_FATAL_FAULT;
>> get_state->fatal_queues = group->fatal_queues;
>> }
>> + if (group->innocent)
>> + get_state->state |= DRM_PANTHOR_GROUP_STATE_INNOCENT;
>> mutex_unlock(&sched->lock);
>>
>> group_put(group);
>> diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
>> index 87c9cb555dd1..b99763cbae48 100644
>> --- a/include/uapi/drm/panthor_drm.h
>> +++ b/include/uapi/drm/panthor_drm.h
>> @@ -923,6 +923,15 @@ enum drm_panthor_group_state_flags {
>> * When a group ends up with this flag set, no jobs can be submitted to its queues.
>> */
>> DRM_PANTHOR_GROUP_STATE_FATAL_FAULT = 1 << 1,
>> +
>> + /**
>> + * @DRM_PANTHOR_GROUP_STATE_INNOCENT: Group was killed during a reset caused by other
>> + * groups.
>> + *
>> + * This flag can only be set if DRM_PANTHOR_GROUP_STATE_TIMEDOUT is set and
>> + * DRM_PANTHOR_GROUP_STATE_FATAL_FAULT is not.
>> + */
>> + DRM_PANTHOR_GROUP_STATE_INNOCENT = 1 << 2,
>> };
>>
>> /**
>
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH v3 3/3] drm/panthor: Report innocent group kill
2024-10-29 16:19 ` Boris Brezillon
2024-10-30 13:18 ` Steven Price
@ 2024-10-30 13:53 ` Liviu Dudau
1 sibling, 0 replies; 8+ messages in thread
From: Liviu Dudau @ 2024-10-30 13:53 UTC (permalink / raw)
To: Boris Brezillon
Cc: Steven Price, Adrián Larumbe, Christopher Healy, dri-devel,
kernel
On Tue, Oct 29, 2024 at 05:19:33PM +0100, Boris Brezillon wrote:
> On Tue, 29 Oct 2024 16:29:12 +0100
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
>
> > Groups can be killed during a reset even though they did nothing wrong.
> > That usually happens when the FW is put in a bad state by other groups,
> > resulting in group suspension failures when the reset happens.
> >
> > If we end up in that situation, flag the group innocent and report
> > innocence through a new DRM_PANTHOR_GROUP_STATE flag.
> >
> > Bump the minor driver version to reflect the uAPI change.
> >
> > Changes in v3:
> > - Actually report innocence to userspace
> >
> > Changes in v2:
> > - New patch
> >
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> > drivers/gpu/drm/panthor/panthor_drv.c | 2 +-
> > drivers/gpu/drm/panthor/panthor_sched.c | 18 ++++++++++++++++++
> > include/uapi/drm/panthor_drm.h | 9 +++++++++
> > 3 files changed, 28 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> > index ac7e53f6e3f0..f1dff7e0173d 100644
> > --- a/drivers/gpu/drm/panthor/panthor_drv.c
> > +++ b/drivers/gpu/drm/panthor/panthor_drv.c
>
> Forgot to update the changelog with:
>
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -1493,6 +1493,7 @@ static void panthor_debugfs_init(struct drm_minor
> *minor)
> * - 1.1 - adds DEV_QUERY_TIMESTAMP_INFO query
> * - 1.2 - adds DEV_QUERY_GROUP_PRIORITIES_INFO query
> * - adds PANTHOR_GROUP_PRIORITY_REALTIME priority
> + * - 1.3 - adds DRM_PANTHOR_GROUP_STATE_INNOCENT flag
> */
>
> I'll send a v4 addressing that, but I'll probably queue the first two
> patches to drm-misc-fixes first.
You can also add my R-b and push the whole series.
Best regards,
Liviu
>
> > @@ -1507,7 +1507,7 @@ static const struct drm_driver panthor_drm_driver = {
> > .desc = "Panthor DRM driver",
> > .date = "20230801",
> > .major = 1,
> > - .minor = 2,
> > + .minor = 3,
> >
> > .gem_create_object = panthor_gem_create_object,
> > .gem_prime_import_sg_table = drm_gem_shmem_prime_import_sg_table,
> > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> > index ef4bec7ff9c7..97ed5fe5a191 100644
> > --- a/drivers/gpu/drm/panthor/panthor_sched.c
> > +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> > @@ -610,6 +610,16 @@ struct panthor_group {
> > */
> > bool timedout;
> >
> > + /**
> > + * @innocent: True when the group becomes unusable because the group suspension
> > + * failed during a reset.
> > + *
> > + * Sometimes the FW was put in a bad state by other groups, causing the group
> > + * suspension happening in the reset path to fail. In that case, we consider the
> > + * group innocent.
> > + */
> > + bool innocent;
> > +
> > /**
> > * @syncobjs: Pool of per-queue synchronization objects.
> > *
> > @@ -2690,6 +2700,12 @@ void panthor_sched_suspend(struct panthor_device *ptdev)
> > u32 csg_id = ffs(slot_mask) - 1;
> > struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
> >
> > + /* If the group was still usable before that point, we consider
> > + * it innocent.
> > + */
> > + if (group_can_run(csg_slot->group))
> > + csg_slot->group->innocent = true;
> > +
> > /* We consider group suspension failures as fatal and flag the
> > * group as unusable by setting timedout=true.
> > */
> > @@ -3570,6 +3586,8 @@ int panthor_group_get_state(struct panthor_file *pfile,
> > get_state->state |= DRM_PANTHOR_GROUP_STATE_FATAL_FAULT;
> > get_state->fatal_queues = group->fatal_queues;
> > }
> > + if (group->innocent)
> > + get_state->state |= DRM_PANTHOR_GROUP_STATE_INNOCENT;
> > mutex_unlock(&sched->lock);
> >
> > group_put(group);
> > diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
> > index 87c9cb555dd1..b99763cbae48 100644
> > --- a/include/uapi/drm/panthor_drm.h
> > +++ b/include/uapi/drm/panthor_drm.h
> > @@ -923,6 +923,15 @@ enum drm_panthor_group_state_flags {
> > * When a group ends up with this flag set, no jobs can be submitted to its queues.
> > */
> > DRM_PANTHOR_GROUP_STATE_FATAL_FAULT = 1 << 1,
> > +
> > + /**
> > + * @DRM_PANTHOR_GROUP_STATE_INNOCENT: Group was killed during a reset caused by other
> > + * groups.
> > + *
> > + * This flag can only be set if DRM_PANTHOR_GROUP_STATE_TIMEDOUT is set and
> > + * DRM_PANTHOR_GROUP_STATE_FATAL_FAULT is not.
> > + */
> > + DRM_PANTHOR_GROUP_STATE_INNOCENT = 1 << 2,
> > };
> >
> > /**
>
--
====================
| I would like to |
| fix the world, |
| but they're not |
| giving me the |
\ source code! /
---------------
¯\_(ツ)_/¯
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 0/3] drm/panthor: Fix group state reporting
2024-10-29 15:29 [PATCH v3 0/3] drm/panthor: Fix group state reporting Boris Brezillon
` (2 preceding siblings ...)
2024-10-29 15:29 ` [PATCH v3 3/3] drm/panthor: Report innocent group kill Boris Brezillon
@ 2024-10-30 15:41 ` Boris Brezillon
3 siblings, 0 replies; 8+ messages in thread
From: Boris Brezillon @ 2024-10-30 15:41 UTC (permalink / raw)
To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
Cc: Christopher Healy, dri-devel, kernel
On Tue, 29 Oct 2024 16:29:09 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:
> Hello,
>
> What initially started as a simple fix to avoid queueing jobs to a group
> that can't be scheduled has turned into a series of three patches
> improving the group state reporting. Two of them are actual fixes, the
> last one is an improvement to properly report innocence/guiltiness of
> a group when a group becomes unusable. With this new
> GROUP_STATE_INNOCENT, we can distinguish between
> GL_INNOCENT_CONTEXT_RESET and GL_GUILT_CONTEXT_RESET (see this mesa
> MR for more details [1]).
>
> Regards,
>
> Boris
>
> [1]https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31868
>
> Boris Brezillon (3):
> drm/panthor: Fail job creation when the group is dead
> drm/panthor: Report group as timedout when we fail to properly suspend
First two patches are queued to drm-misc-fixes.
> drm/panthor: Report innocent group kill
I'll send a v4 of the third one and wait for a backmerge of Linus' tree
in drm-misc-next before applying.
>
> drivers/gpu/drm/panthor/panthor_drv.c | 2 +-
> drivers/gpu/drm/panthor/panthor_sched.c | 38 ++++++++++++++++++++++---
> include/uapi/drm/panthor_drm.h | 9 ++++++
> 3 files changed, 44 insertions(+), 5 deletions(-)
>
^ permalink raw reply [flat|nested] 8+ messages in thread