All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] drm/panthor: Fix group state reporting
@ 2024-10-29 15:29 Boris Brezillon
  2024-10-29 15:29 ` [PATCH v3 1/3] drm/panthor: Fail job creation when the group is dead Boris Brezillon
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Boris Brezillon @ 2024-10-29 15:29 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: Christopher Healy, dri-devel, kernel

Hello,

What initially started as a simple fix to avoid queueing jobs to a group
that can't be scheduled has turned into a series of three patches
improving the group state reporting. Two of them are actual fixes, the
last one is an improvement to properly report innocence/guiltiness of
a group when a group becomes unusable. With this new
GROUP_STATE_INNOCENT, we can distinguish between
GL_INNOCENT_CONTEXT_RESET and GL_GUILT_CONTEXT_RESET (see this mesa
MR for more details [1]).

Regards,

Boris

[1]https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31868

Boris Brezillon (3):
  drm/panthor: Fail job creation when the group is dead
  drm/panthor: Report group as timedout when we fail to properly suspend
  drm/panthor: Report innocent group kill

 drivers/gpu/drm/panthor/panthor_drv.c   |  2 +-
 drivers/gpu/drm/panthor/panthor_sched.c | 38 ++++++++++++++++++++++---
 include/uapi/drm/panthor_drm.h          |  9 ++++++
 3 files changed, 44 insertions(+), 5 deletions(-)

-- 
2.46.2


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 1/3] drm/panthor: Fail job creation when the group is dead
  2024-10-29 15:29 [PATCH v3 0/3] drm/panthor: Fix group state reporting Boris Brezillon
@ 2024-10-29 15:29 ` Boris Brezillon
  2024-10-29 15:29 ` [PATCH v3 2/3] drm/panthor: Report group as timedout when we fail to properly suspend Boris Brezillon
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Boris Brezillon @ 2024-10-29 15:29 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: Christopher Healy, dri-devel, kernel

Userspace can use GROUP_SUBMIT errors as a trigger to check the group
state and recreate the group if it became unusable. Make sure we
report an error when the group became unusable.

Changes in v3:
- None

Changes in v2:
- Add R-bs

Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
---
 drivers/gpu/drm/panthor/panthor_sched.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index eb9f6635cc12..eda8fbb276b3 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -3688,6 +3688,11 @@ panthor_job_create(struct panthor_file *pfile,
 		goto err_put_job;
 	}
 
+	if (!group_can_run(job->group)) {
+		ret = -EINVAL;
+		goto err_put_job;
+	}
+
 	if (job->queue_idx >= job->group->queue_count ||
 	    !job->group->queues[job->queue_idx]) {
 		ret = -EINVAL;
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 2/3] drm/panthor: Report group as timedout when we fail to properly suspend
  2024-10-29 15:29 [PATCH v3 0/3] drm/panthor: Fix group state reporting Boris Brezillon
  2024-10-29 15:29 ` [PATCH v3 1/3] drm/panthor: Fail job creation when the group is dead Boris Brezillon
@ 2024-10-29 15:29 ` Boris Brezillon
  2024-10-29 15:29 ` [PATCH v3 3/3] drm/panthor: Report innocent group kill Boris Brezillon
  2024-10-30 15:41 ` [PATCH v3 0/3] drm/panthor: Fix group state reporting Boris Brezillon
  3 siblings, 0 replies; 8+ messages in thread
From: Boris Brezillon @ 2024-10-29 15:29 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: Christopher Healy, dri-devel, kernel

If we don't do that, the group is considered usable by userspace, but
all further GROUP_SUBMIT will fail with -EINVAL.

Changes in v3:
- Add R-bs

Changes in v2:
- New patch

Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
---
 drivers/gpu/drm/panthor/panthor_sched.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index eda8fbb276b3..ef4bec7ff9c7 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -602,10 +602,11 @@ struct panthor_group {
 	 * @timedout: True when a timeout occurred on any of the queues owned by
 	 * this group.
 	 *
-	 * Timeouts can be reported by drm_sched or by the FW. In any case, any
-	 * timeout situation is unrecoverable, and the group becomes useless.
-	 * We simply wait for all references to be dropped so we can release the
-	 * group object.
+	 * Timeouts can be reported by drm_sched or by the FW. If a reset is required,
+	 * and the group can't be suspended, this also leads to a timeout. In any case,
+	 * any timeout situation is unrecoverable, and the group becomes useless. We
+	 * simply wait for all references to be dropped so we can release the group
+	 * object.
 	 */
 	bool timedout;
 
@@ -2687,6 +2688,12 @@ void panthor_sched_suspend(struct panthor_device *ptdev)
 		csgs_upd_ctx_init(&upd_ctx);
 		while (slot_mask) {
 			u32 csg_id = ffs(slot_mask) - 1;
+			struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
+
+			/* We consider group suspension failures as fatal and flag the
+			 * group as unusable by setting timedout=true.
+			 */
+			csg_slot->group->timedout = true;
 
 			csgs_upd_ctx_queue_reqs(ptdev, &upd_ctx, csg_id,
 						CSG_STATE_TERMINATE,
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 3/3] drm/panthor: Report innocent group kill
  2024-10-29 15:29 [PATCH v3 0/3] drm/panthor: Fix group state reporting Boris Brezillon
  2024-10-29 15:29 ` [PATCH v3 1/3] drm/panthor: Fail job creation when the group is dead Boris Brezillon
  2024-10-29 15:29 ` [PATCH v3 2/3] drm/panthor: Report group as timedout when we fail to properly suspend Boris Brezillon
@ 2024-10-29 15:29 ` Boris Brezillon
  2024-10-29 16:19   ` Boris Brezillon
  2024-10-30 15:41 ` [PATCH v3 0/3] drm/panthor: Fix group state reporting Boris Brezillon
  3 siblings, 1 reply; 8+ messages in thread
From: Boris Brezillon @ 2024-10-29 15:29 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: Christopher Healy, dri-devel, kernel

Groups can be killed during a reset even though they did nothing wrong.
That usually happens when the FW is put in a bad state by other groups,
resulting in group suspension failures when the reset happens.

If we end up in that situation, flag the group innocent and report
innocence through a new DRM_PANTHOR_GROUP_STATE flag.

Bump the minor driver version to reflect the uAPI change.

Changes in v3:
- Actually report innocence to userspace

Changes in v2:
- New patch

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_drv.c   |  2 +-
 drivers/gpu/drm/panthor/panthor_sched.c | 18 ++++++++++++++++++
 include/uapi/drm/panthor_drm.h          |  9 +++++++++
 3 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index ac7e53f6e3f0..f1dff7e0173d 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -1507,7 +1507,7 @@ static const struct drm_driver panthor_drm_driver = {
 	.desc = "Panthor DRM driver",
 	.date = "20230801",
 	.major = 1,
-	.minor = 2,
+	.minor = 3,
 
 	.gem_create_object = panthor_gem_create_object,
 	.gem_prime_import_sg_table = drm_gem_shmem_prime_import_sg_table,
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index ef4bec7ff9c7..97ed5fe5a191 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -610,6 +610,16 @@ struct panthor_group {
 	 */
 	bool timedout;
 
+	/**
+	 * @innocent: True when the group becomes unusable because the group suspension
+	 * failed during a reset.
+	 *
+	 * Sometimes the FW was put in a bad state by other groups, causing the group
+	 * suspension happening in the reset path to fail. In that case, we consider the
+	 * group innocent.
+	 */
+	bool innocent;
+
 	/**
 	 * @syncobjs: Pool of per-queue synchronization objects.
 	 *
@@ -2690,6 +2700,12 @@ void panthor_sched_suspend(struct panthor_device *ptdev)
 			u32 csg_id = ffs(slot_mask) - 1;
 			struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
 
+			/* If the group was still usable before that point, we consider
+			 * it innocent.
+			 */
+			if (group_can_run(csg_slot->group))
+				csg_slot->group->innocent = true;
+
 			/* We consider group suspension failures as fatal and flag the
 			 * group as unusable by setting timedout=true.
 			 */
@@ -3570,6 +3586,8 @@ int panthor_group_get_state(struct panthor_file *pfile,
 		get_state->state |= DRM_PANTHOR_GROUP_STATE_FATAL_FAULT;
 		get_state->fatal_queues = group->fatal_queues;
 	}
+	if (group->innocent)
+		get_state->state |= DRM_PANTHOR_GROUP_STATE_INNOCENT;
 	mutex_unlock(&sched->lock);
 
 	group_put(group);
diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
index 87c9cb555dd1..b99763cbae48 100644
--- a/include/uapi/drm/panthor_drm.h
+++ b/include/uapi/drm/panthor_drm.h
@@ -923,6 +923,15 @@ enum drm_panthor_group_state_flags {
 	 * When a group ends up with this flag set, no jobs can be submitted to its queues.
 	 */
 	DRM_PANTHOR_GROUP_STATE_FATAL_FAULT = 1 << 1,
+
+	/**
+	 * @DRM_PANTHOR_GROUP_STATE_INNOCENT: Group was killed during a reset caused by other
+	 * groups.
+	 *
+	 * This flag can only be set if DRM_PANTHOR_GROUP_STATE_TIMEDOUT is set and
+	 * DRM_PANTHOR_GROUP_STATE_FATAL_FAULT is not.
+	 */
+	DRM_PANTHOR_GROUP_STATE_INNOCENT = 1 << 2,
 };
 
 /**
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 3/3] drm/panthor: Report innocent group kill
  2024-10-29 15:29 ` [PATCH v3 3/3] drm/panthor: Report innocent group kill Boris Brezillon
@ 2024-10-29 16:19   ` Boris Brezillon
  2024-10-30 13:18     ` Steven Price
  2024-10-30 13:53     ` Liviu Dudau
  0 siblings, 2 replies; 8+ messages in thread
From: Boris Brezillon @ 2024-10-29 16:19 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: Christopher Healy, dri-devel, kernel

On Tue, 29 Oct 2024 16:29:12 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> Groups can be killed during a reset even though they did nothing wrong.
> That usually happens when the FW is put in a bad state by other groups,
> resulting in group suspension failures when the reset happens.
> 
> If we end up in that situation, flag the group innocent and report
> innocence through a new DRM_PANTHOR_GROUP_STATE flag.
> 
> Bump the minor driver version to reflect the uAPI change.
> 
> Changes in v3:
> - Actually report innocence to userspace
> 
> Changes in v2:
> - New patch
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>  drivers/gpu/drm/panthor/panthor_drv.c   |  2 +-
>  drivers/gpu/drm/panthor/panthor_sched.c | 18 ++++++++++++++++++
>  include/uapi/drm/panthor_drm.h          |  9 +++++++++
>  3 files changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> index ac7e53f6e3f0..f1dff7e0173d 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c

Forgot to update the changelog with:

--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -1493,6 +1493,7 @@ static void panthor_debugfs_init(struct drm_minor
*minor)
  * - 1.1 - adds DEV_QUERY_TIMESTAMP_INFO query
  * - 1.2 - adds DEV_QUERY_GROUP_PRIORITIES_INFO query
  *       - adds PANTHOR_GROUP_PRIORITY_REALTIME priority
+ * - 1.3 - adds DRM_PANTHOR_GROUP_STATE_INNOCENT flag
  */

I'll send a v4 addressing that, but I'll probably queue the first two
patches to drm-misc-fixes first.

> @@ -1507,7 +1507,7 @@ static const struct drm_driver panthor_drm_driver = {
>  	.desc = "Panthor DRM driver",
>  	.date = "20230801",
>  	.major = 1,
> -	.minor = 2,
> +	.minor = 3,
>  
>  	.gem_create_object = panthor_gem_create_object,
>  	.gem_prime_import_sg_table = drm_gem_shmem_prime_import_sg_table,
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index ef4bec7ff9c7..97ed5fe5a191 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -610,6 +610,16 @@ struct panthor_group {
>  	 */
>  	bool timedout;
>  
> +	/**
> +	 * @innocent: True when the group becomes unusable because the group suspension
> +	 * failed during a reset.
> +	 *
> +	 * Sometimes the FW was put in a bad state by other groups, causing the group
> +	 * suspension happening in the reset path to fail. In that case, we consider the
> +	 * group innocent.
> +	 */
> +	bool innocent;
> +
>  	/**
>  	 * @syncobjs: Pool of per-queue synchronization objects.
>  	 *
> @@ -2690,6 +2700,12 @@ void panthor_sched_suspend(struct panthor_device *ptdev)
>  			u32 csg_id = ffs(slot_mask) - 1;
>  			struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
>  
> +			/* If the group was still usable before that point, we consider
> +			 * it innocent.
> +			 */
> +			if (group_can_run(csg_slot->group))
> +				csg_slot->group->innocent = true;
> +
>  			/* We consider group suspension failures as fatal and flag the
>  			 * group as unusable by setting timedout=true.
>  			 */
> @@ -3570,6 +3586,8 @@ int panthor_group_get_state(struct panthor_file *pfile,
>  		get_state->state |= DRM_PANTHOR_GROUP_STATE_FATAL_FAULT;
>  		get_state->fatal_queues = group->fatal_queues;
>  	}
> +	if (group->innocent)
> +		get_state->state |= DRM_PANTHOR_GROUP_STATE_INNOCENT;
>  	mutex_unlock(&sched->lock);
>  
>  	group_put(group);
> diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
> index 87c9cb555dd1..b99763cbae48 100644
> --- a/include/uapi/drm/panthor_drm.h
> +++ b/include/uapi/drm/panthor_drm.h
> @@ -923,6 +923,15 @@ enum drm_panthor_group_state_flags {
>  	 * When a group ends up with this flag set, no jobs can be submitted to its queues.
>  	 */
>  	DRM_PANTHOR_GROUP_STATE_FATAL_FAULT = 1 << 1,
> +
> +	/**
> +	 * @DRM_PANTHOR_GROUP_STATE_INNOCENT: Group was killed during a reset caused by other
> +	 * groups.
> +	 *
> +	 * This flag can only be set if DRM_PANTHOR_GROUP_STATE_TIMEDOUT is set and
> +	 * DRM_PANTHOR_GROUP_STATE_FATAL_FAULT is not.
> +	 */
> +	DRM_PANTHOR_GROUP_STATE_INNOCENT = 1 << 2,
>  };
>  
>  /**


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 3/3] drm/panthor: Report innocent group kill
  2024-10-29 16:19   ` Boris Brezillon
@ 2024-10-30 13:18     ` Steven Price
  2024-10-30 13:53     ` Liviu Dudau
  1 sibling, 0 replies; 8+ messages in thread
From: Steven Price @ 2024-10-30 13:18 UTC (permalink / raw)
  To: Boris Brezillon, Liviu Dudau, Adrián Larumbe
  Cc: Christopher Healy, dri-devel, kernel

On 29/10/2024 16:19, Boris Brezillon wrote:
> On Tue, 29 Oct 2024 16:29:12 +0100
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> 
>> Groups can be killed during a reset even though they did nothing wrong.
>> That usually happens when the FW is put in a bad state by other groups,
>> resulting in group suspension failures when the reset happens.
>>
>> If we end up in that situation, flag the group innocent and report
>> innocence through a new DRM_PANTHOR_GROUP_STATE flag.
>>
>> Bump the minor driver version to reflect the uAPI change.
>>
>> Changes in v3:
>> - Actually report innocence to userspace
>>
>> Changes in v2:
>> - New patch
>>
>> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
>> ---
>>  drivers/gpu/drm/panthor/panthor_drv.c   |  2 +-
>>  drivers/gpu/drm/panthor/panthor_sched.c | 18 ++++++++++++++++++
>>  include/uapi/drm/panthor_drm.h          |  9 +++++++++
>>  3 files changed, 28 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
>> index ac7e53f6e3f0..f1dff7e0173d 100644
>> --- a/drivers/gpu/drm/panthor/panthor_drv.c
>> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> 
> Forgot to update the changelog with:
> 
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -1493,6 +1493,7 @@ static void panthor_debugfs_init(struct drm_minor
> *minor)
>   * - 1.1 - adds DEV_QUERY_TIMESTAMP_INFO query
>   * - 1.2 - adds DEV_QUERY_GROUP_PRIORITIES_INFO query
>   *       - adds PANTHOR_GROUP_PRIORITY_REALTIME priority
> + * - 1.3 - adds DRM_PANTHOR_GROUP_STATE_INNOCENT flag
>   */
> 
> I'll send a v4 addressing that, but I'll probably queue the first two
> patches to drm-misc-fixes first.

With the changelog update you can add:

Reviewed-by: Steven Price <steven.price@arm.com>

Thanks,
Steve

>> @@ -1507,7 +1507,7 @@ static const struct drm_driver panthor_drm_driver = {
>>  	.desc = "Panthor DRM driver",
>>  	.date = "20230801",
>>  	.major = 1,
>> -	.minor = 2,
>> +	.minor = 3,
>>  
>>  	.gem_create_object = panthor_gem_create_object,
>>  	.gem_prime_import_sg_table = drm_gem_shmem_prime_import_sg_table,
>> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
>> index ef4bec7ff9c7..97ed5fe5a191 100644
>> --- a/drivers/gpu/drm/panthor/panthor_sched.c
>> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
>> @@ -610,6 +610,16 @@ struct panthor_group {
>>  	 */
>>  	bool timedout;
>>  
>> +	/**
>> +	 * @innocent: True when the group becomes unusable because the group suspension
>> +	 * failed during a reset.
>> +	 *
>> +	 * Sometimes the FW was put in a bad state by other groups, causing the group
>> +	 * suspension happening in the reset path to fail. In that case, we consider the
>> +	 * group innocent.
>> +	 */
>> +	bool innocent;
>> +
>>  	/**
>>  	 * @syncobjs: Pool of per-queue synchronization objects.
>>  	 *
>> @@ -2690,6 +2700,12 @@ void panthor_sched_suspend(struct panthor_device *ptdev)
>>  			u32 csg_id = ffs(slot_mask) - 1;
>>  			struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
>>  
>> +			/* If the group was still usable before that point, we consider
>> +			 * it innocent.
>> +			 */
>> +			if (group_can_run(csg_slot->group))
>> +				csg_slot->group->innocent = true;
>> +
>>  			/* We consider group suspension failures as fatal and flag the
>>  			 * group as unusable by setting timedout=true.
>>  			 */
>> @@ -3570,6 +3586,8 @@ int panthor_group_get_state(struct panthor_file *pfile,
>>  		get_state->state |= DRM_PANTHOR_GROUP_STATE_FATAL_FAULT;
>>  		get_state->fatal_queues = group->fatal_queues;
>>  	}
>> +	if (group->innocent)
>> +		get_state->state |= DRM_PANTHOR_GROUP_STATE_INNOCENT;
>>  	mutex_unlock(&sched->lock);
>>  
>>  	group_put(group);
>> diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
>> index 87c9cb555dd1..b99763cbae48 100644
>> --- a/include/uapi/drm/panthor_drm.h
>> +++ b/include/uapi/drm/panthor_drm.h
>> @@ -923,6 +923,15 @@ enum drm_panthor_group_state_flags {
>>  	 * When a group ends up with this flag set, no jobs can be submitted to its queues.
>>  	 */
>>  	DRM_PANTHOR_GROUP_STATE_FATAL_FAULT = 1 << 1,
>> +
>> +	/**
>> +	 * @DRM_PANTHOR_GROUP_STATE_INNOCENT: Group was killed during a reset caused by other
>> +	 * groups.
>> +	 *
>> +	 * This flag can only be set if DRM_PANTHOR_GROUP_STATE_TIMEDOUT is set and
>> +	 * DRM_PANTHOR_GROUP_STATE_FATAL_FAULT is not.
>> +	 */
>> +	DRM_PANTHOR_GROUP_STATE_INNOCENT = 1 << 2,
>>  };
>>  
>>  /**
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 3/3] drm/panthor: Report innocent group kill
  2024-10-29 16:19   ` Boris Brezillon
  2024-10-30 13:18     ` Steven Price
@ 2024-10-30 13:53     ` Liviu Dudau
  1 sibling, 0 replies; 8+ messages in thread
From: Liviu Dudau @ 2024-10-30 13:53 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Steven Price, Adrián Larumbe, Christopher Healy, dri-devel,
	kernel

On Tue, Oct 29, 2024 at 05:19:33PM +0100, Boris Brezillon wrote:
> On Tue, 29 Oct 2024 16:29:12 +0100
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> 
> > Groups can be killed during a reset even though they did nothing wrong.
> > That usually happens when the FW is put in a bad state by other groups,
> > resulting in group suspension failures when the reset happens.
> > 
> > If we end up in that situation, flag the group innocent and report
> > innocence through a new DRM_PANTHOR_GROUP_STATE flag.
> > 
> > Bump the minor driver version to reflect the uAPI change.
> > 
> > Changes in v3:
> > - Actually report innocence to userspace
> > 
> > Changes in v2:
> > - New patch
> > 
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> >  drivers/gpu/drm/panthor/panthor_drv.c   |  2 +-
> >  drivers/gpu/drm/panthor/panthor_sched.c | 18 ++++++++++++++++++
> >  include/uapi/drm/panthor_drm.h          |  9 +++++++++
> >  3 files changed, 28 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> > index ac7e53f6e3f0..f1dff7e0173d 100644
> > --- a/drivers/gpu/drm/panthor/panthor_drv.c
> > +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> 
> Forgot to update the changelog with:
> 
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -1493,6 +1493,7 @@ static void panthor_debugfs_init(struct drm_minor
> *minor)
>   * - 1.1 - adds DEV_QUERY_TIMESTAMP_INFO query
>   * - 1.2 - adds DEV_QUERY_GROUP_PRIORITIES_INFO query
>   *       - adds PANTHOR_GROUP_PRIORITY_REALTIME priority
> + * - 1.3 - adds DRM_PANTHOR_GROUP_STATE_INNOCENT flag
>   */
> 
> I'll send a v4 addressing that, but I'll probably queue the first two
> patches to drm-misc-fixes first.

You can also add my R-b and push the whole series.

Best regards,
Liviu

> 
> > @@ -1507,7 +1507,7 @@ static const struct drm_driver panthor_drm_driver = {
> >  	.desc = "Panthor DRM driver",
> >  	.date = "20230801",
> >  	.major = 1,
> > -	.minor = 2,
> > +	.minor = 3,
> >  
> >  	.gem_create_object = panthor_gem_create_object,
> >  	.gem_prime_import_sg_table = drm_gem_shmem_prime_import_sg_table,
> > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> > index ef4bec7ff9c7..97ed5fe5a191 100644
> > --- a/drivers/gpu/drm/panthor/panthor_sched.c
> > +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> > @@ -610,6 +610,16 @@ struct panthor_group {
> >  	 */
> >  	bool timedout;
> >  
> > +	/**
> > +	 * @innocent: True when the group becomes unusable because the group suspension
> > +	 * failed during a reset.
> > +	 *
> > +	 * Sometimes the FW was put in a bad state by other groups, causing the group
> > +	 * suspension happening in the reset path to fail. In that case, we consider the
> > +	 * group innocent.
> > +	 */
> > +	bool innocent;
> > +
> >  	/**
> >  	 * @syncobjs: Pool of per-queue synchronization objects.
> >  	 *
> > @@ -2690,6 +2700,12 @@ void panthor_sched_suspend(struct panthor_device *ptdev)
> >  			u32 csg_id = ffs(slot_mask) - 1;
> >  			struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
> >  
> > +			/* If the group was still usable before that point, we consider
> > +			 * it innocent.
> > +			 */
> > +			if (group_can_run(csg_slot->group))
> > +				csg_slot->group->innocent = true;
> > +
> >  			/* We consider group suspension failures as fatal and flag the
> >  			 * group as unusable by setting timedout=true.
> >  			 */
> > @@ -3570,6 +3586,8 @@ int panthor_group_get_state(struct panthor_file *pfile,
> >  		get_state->state |= DRM_PANTHOR_GROUP_STATE_FATAL_FAULT;
> >  		get_state->fatal_queues = group->fatal_queues;
> >  	}
> > +	if (group->innocent)
> > +		get_state->state |= DRM_PANTHOR_GROUP_STATE_INNOCENT;
> >  	mutex_unlock(&sched->lock);
> >  
> >  	group_put(group);
> > diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
> > index 87c9cb555dd1..b99763cbae48 100644
> > --- a/include/uapi/drm/panthor_drm.h
> > +++ b/include/uapi/drm/panthor_drm.h
> > @@ -923,6 +923,15 @@ enum drm_panthor_group_state_flags {
> >  	 * When a group ends up with this flag set, no jobs can be submitted to its queues.
> >  	 */
> >  	DRM_PANTHOR_GROUP_STATE_FATAL_FAULT = 1 << 1,
> > +
> > +	/**
> > +	 * @DRM_PANTHOR_GROUP_STATE_INNOCENT: Group was killed during a reset caused by other
> > +	 * groups.
> > +	 *
> > +	 * This flag can only be set if DRM_PANTHOR_GROUP_STATE_TIMEDOUT is set and
> > +	 * DRM_PANTHOR_GROUP_STATE_FATAL_FAULT is not.
> > +	 */
> > +	DRM_PANTHOR_GROUP_STATE_INNOCENT = 1 << 2,
> >  };
> >  
> >  /**
> 

-- 
====================
| I would like to |
| fix the world,  |
| but they're not |
| giving me the   |
 \ source code!  /
  ---------------
    ¯\_(ツ)_/¯

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 0/3] drm/panthor: Fix group state reporting
  2024-10-29 15:29 [PATCH v3 0/3] drm/panthor: Fix group state reporting Boris Brezillon
                   ` (2 preceding siblings ...)
  2024-10-29 15:29 ` [PATCH v3 3/3] drm/panthor: Report innocent group kill Boris Brezillon
@ 2024-10-30 15:41 ` Boris Brezillon
  3 siblings, 0 replies; 8+ messages in thread
From: Boris Brezillon @ 2024-10-30 15:41 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: Christopher Healy, dri-devel, kernel

On Tue, 29 Oct 2024 16:29:09 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> Hello,
> 
> What initially started as a simple fix to avoid queueing jobs to a group
> that can't be scheduled has turned into a series of three patches
> improving the group state reporting. Two of them are actual fixes, the
> last one is an improvement to properly report innocence/guiltiness of
> a group when a group becomes unusable. With this new
> GROUP_STATE_INNOCENT, we can distinguish between
> GL_INNOCENT_CONTEXT_RESET and GL_GUILT_CONTEXT_RESET (see this mesa
> MR for more details [1]).
> 
> Regards,
> 
> Boris
> 
> [1]https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31868
> 
> Boris Brezillon (3):
>   drm/panthor: Fail job creation when the group is dead
>   drm/panthor: Report group as timedout when we fail to properly suspend

First two patches are queued to drm-misc-fixes.

>   drm/panthor: Report innocent group kill

I'll send a v4 of the third one and wait for a backmerge of Linus' tree
in drm-misc-next before applying.

> 
>  drivers/gpu/drm/panthor/panthor_drv.c   |  2 +-
>  drivers/gpu/drm/panthor/panthor_sched.c | 38 ++++++++++++++++++++++---
>  include/uapi/drm/panthor_drm.h          |  9 ++++++
>  3 files changed, 44 insertions(+), 5 deletions(-)
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-10-30 15:41 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-29 15:29 [PATCH v3 0/3] drm/panthor: Fix group state reporting Boris Brezillon
2024-10-29 15:29 ` [PATCH v3 1/3] drm/panthor: Fail job creation when the group is dead Boris Brezillon
2024-10-29 15:29 ` [PATCH v3 2/3] drm/panthor: Report group as timedout when we fail to properly suspend Boris Brezillon
2024-10-29 15:29 ` [PATCH v3 3/3] drm/panthor: Report innocent group kill Boris Brezillon
2024-10-29 16:19   ` Boris Brezillon
2024-10-30 13:18     ` Steven Price
2024-10-30 13:53     ` Liviu Dudau
2024-10-30 15:41 ` [PATCH v3 0/3] drm/panthor: Fix group state reporting Boris Brezillon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.