AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Brost <matthew.brost@intel.com>
To: "Christian König" <ckoenig.leichtzumerken@gmail.com>
Cc: <amd-gfx@lists.freedesktop.org>,
	<dri-devel@lists.freedesktop.org>, <alexander.deucher@amd.com>,
	<jesse.zhang@amd.com>, <vitaly.prosyak@amd.com>
Subject: Re: [PATCH] drm/sched: add optional errno to drm_sched_start()
Date: Fri, 26 Jul 2024 12:30:29 +0000	[thread overview]
Message-ID: <ZqOW5Q778UCkh7V7@DUT025-TGLU.fm.intel.com> (raw)
In-Reply-To: <20240726075550.1511-1-christian.koenig@amd.com>

On Fri, Jul 26, 2024 at 09:55:50AM +0200, Christian König wrote:
> The current implementation of drm_sched_start uses a hardcoded
> -ECANCELED to dispose of a job when the parent/hw fence is NULL.
> This results in drm_sched_job_done being called with -ECANCELED for
> each job with a NULL parent in the pending list, making it difficult
> to distinguish between recovery methods, whether a queue reset or a
> full GPU reset was used.
> 
> To improve this, we first try a soft recovery for timeout jobs and
> use the error code -ENODATA. If soft recovery fails, we proceed with
> a queue reset, where the error code remains -ENODATA for the job.
> Finally, for a full GPU reset, we use error codes -ECANCELED or
> -ETIME. This patch adds an error code parameter to drm_sched_start,
> allowing us to differentiate between queue reset and GPU reset
> failures. This enables user mode and test applications to validate
> the expected correctness of the requested operation. After a
> successful queue reset, the only way to continue normal operation is
> to call drm_sched_job_done with the specific error code -ENODATA.
> 
> v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain
>     and amdgpu_device_unlock_reset_domain to allow user mode to track
>     the queue reset status and distinguish between queue reset and
>     GPU reset.
> v2: Christian suggested using the error codes -ENODATA for queue reset
>     and -ECANCELED or -ETIME for GPU reset, returned to
>     amdgpu_cs_wait_ioctl.
> v3: To meet the requirements, we introduce a new function
>     drm_sched_start_ex with an additional parameter to set
>     dma_fence_set_error, allowing us to handle the specific error
>     codes appropriately and dispose of bad jobs with the selected
>     error code depending on whether it was a queue reset or GPU reset.
> v4: Alex suggested using a new name, drm_sched_start_with_recovery_error,
>     which more accurately describes the function's purpose.
>     Additionally, it was recommended to add documentation details
>     about the new method.
> v5: Fixed declaration of new function drm_sched_start_with_recovery_error.(Alex)
> v6 (chk): rebase on upstream changes, cleanup the commit message,
>           drop the new function again and update all callers,
>           apply the errno also to scheduler fences with hw fences
> 

Seems responablie to me, but all the caller pass in an errno of zero to
drm_sched_start. Going to change in a follow up?

Anyways, LGTM but will leave RB for a user a user of this interface.
Acked-by: Matthew Brost <matthew.brost@intel.com>

> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c          | 4 ++--
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c             | 4 ++--
>  drivers/gpu/drm/imagination/pvr_queue.c             | 4 ++--
>  drivers/gpu/drm/lima/lima_sched.c                   | 2 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c             | 2 +-
>  drivers/gpu/drm/panfrost/panfrost_job.c             | 2 +-
>  drivers/gpu/drm/panthor/panthor_mmu.c               | 2 +-
>  drivers/gpu/drm/scheduler/sched_main.c              | 7 ++++---
>  drivers/gpu/drm/v3d/v3d_sched.c                     | 2 +-
>  include/drm/gpu_scheduler.h                         | 2 +-
>  11 files changed, 17 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> index 2320df51c914..18135d8235f9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> @@ -300,7 +300,7 @@ static int suspend_resume_compute_scheduler(struct amdgpu_device *adev, bool sus
>  			if (r)
>  				goto out;
>  		} else {
> -			drm_sched_start(&ring->sched);
> +			drm_sched_start(&ring->sched, 0);
>  		}
>  	}
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index c186fdb198ad..861827deb03f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5862,7 +5862,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>  			if (!amdgpu_ring_sched_ready(ring))
>  				continue;
>  
> -			drm_sched_start(&ring->sched);
> +			drm_sched_start(&ring->sched, 0);
>  		}
>  
>  		if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
> @@ -6360,7 +6360,7 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>  		if (!amdgpu_ring_sched_ready(ring))
>  			continue;
>  
> -		drm_sched_start(&ring->sched);
> +		drm_sched_start(&ring->sched, 0);
>  	}
>  
>  	amdgpu_device_unset_mp1_state(adev);
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> index c53641aa146f..2c8666f8ec4a 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> @@ -72,12 +72,12 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job
>  
>  	drm_sched_resubmit_jobs(&gpu->sched);
>  
> -	drm_sched_start(&gpu->sched);
> +	drm_sched_start(&gpu->sched, 0);
>  	return DRM_GPU_SCHED_STAT_NOMINAL;
>  
>  out_no_timeout:
>  	/* restart scheduler after GPU is usable again */
> -	drm_sched_start(&gpu->sched);
> +	drm_sched_start(&gpu->sched, 0);
>  	return DRM_GPU_SCHED_STAT_NOMINAL;
>  }
>  
> diff --git a/drivers/gpu/drm/imagination/pvr_queue.c b/drivers/gpu/drm/imagination/pvr_queue.c
> index 20cb46012082..c4f08432882b 100644
> --- a/drivers/gpu/drm/imagination/pvr_queue.c
> +++ b/drivers/gpu/drm/imagination/pvr_queue.c
> @@ -782,7 +782,7 @@ static void pvr_queue_start(struct pvr_queue *queue)
>  		}
>  	}
>  
> -	drm_sched_start(&queue->scheduler);
> +	drm_sched_start(&queue->scheduler, 0);
>  }
>  
>  /**
> @@ -842,7 +842,7 @@ pvr_queue_timedout_job(struct drm_sched_job *s_job)
>  	}
>  	mutex_unlock(&pvr_dev->queues.lock);
>  
> -	drm_sched_start(sched);
> +	drm_sched_start(sched, 0);
>  
>  	return DRM_GPU_SCHED_STAT_NOMINAL;
>  }
> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> index 1a944edb6ddc..b40c90e97d7e 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -463,7 +463,7 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
>  	lima_pm_idle(ldev);
>  
>  	drm_sched_resubmit_jobs(&pipe->base);
> -	drm_sched_start(&pipe->base);
> +	drm_sched_start(&pipe->base, 0);
>  
>  	return DRM_GPU_SCHED_STAT_NOMINAL;
>  }
> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
> index eb6c3f9a01f5..4412f2711fb5 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
> @@ -379,7 +379,7 @@ nouveau_sched_timedout_job(struct drm_sched_job *sched_job)
>  	else
>  		NV_PRINTK(warn, job->cli, "Generic job timeout.\n");
>  
> -	drm_sched_start(sched);
> +	drm_sched_start(sched, 0);
>  
>  	return stat;
>  }
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index df49d37d0e7e..d140800606bf 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -727,7 +727,7 @@ panfrost_reset(struct panfrost_device *pfdev,
>  
>  	/* Restart the schedulers */
>  	for (i = 0; i < NUM_JOB_SLOTS; i++)
> -		drm_sched_start(&pfdev->js->queue[i].sched);
> +		drm_sched_start(&pfdev->js->queue[i].sched, 0);
>  
>  	/* Re-enable job interrupts now that everything has been restarted. */
>  	job_write(pfdev, JOB_INT_MASK,
> diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
> index d47972806d50..e630cdf47f99 100644
> --- a/drivers/gpu/drm/panthor/panthor_mmu.c
> +++ b/drivers/gpu/drm/panthor/panthor_mmu.c
> @@ -827,7 +827,7 @@ static void panthor_vm_stop(struct panthor_vm *vm)
>  
>  static void panthor_vm_start(struct panthor_vm *vm)
>  {
> -	drm_sched_start(&vm->sched);
> +	drm_sched_start(&vm->sched, 0);
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index ab53ab486fe6..f093616fe53c 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -674,9 +674,10 @@ EXPORT_SYMBOL(drm_sched_stop);
>   * drm_sched_start - recover jobs after a reset
>   *
>   * @sched: scheduler instance
> + * @errno: error to set on the pending fences
>   *
>   */
> -void drm_sched_start(struct drm_gpu_scheduler *sched)
> +void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
>  {
>  	struct drm_sched_job *s_job, *tmp;
>  
> @@ -691,13 +692,13 @@ void drm_sched_start(struct drm_gpu_scheduler *sched)
>  		atomic_add(s_job->credits, &sched->credit_count);
>  
>  		if (!fence) {
> -			drm_sched_job_done(s_job, -ECANCELED);
> +			drm_sched_job_done(s_job, errno ?: -ECANCELED);
>  			continue;
>  		}
>  
>  		if (dma_fence_add_callback(fence, &s_job->cb,
>  					   drm_sched_job_done_cb))
> -			drm_sched_job_done(s_job, fence->error);
> +			drm_sched_job_done(s_job, fence->error ?: errno);
>  	}
>  
>  	drm_sched_start_timeout_unlocked(sched);
> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
> index 42d4f4a2dba2..cac02284cd19 100644
> --- a/drivers/gpu/drm/v3d/v3d_sched.c
> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> @@ -653,7 +653,7 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job)
>  
>  	/* Unblock schedulers and restart their jobs. */
>  	for (q = 0; q < V3D_MAX_QUEUES; q++) {
> -		drm_sched_start(&v3d->queue[q].sched);
> +		drm_sched_start(&v3d->queue[q].sched, 0);
>  	}
>  
>  	mutex_unlock(&v3d->reset_lock);
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index fe8edb917360..a8d19b10f9b8 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -579,7 +579,7 @@ bool drm_sched_wqueue_ready(struct drm_gpu_scheduler *sched);
>  void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
>  void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
> -void drm_sched_start(struct drm_gpu_scheduler *sched);
> +void drm_sched_start(struct drm_gpu_scheduler *sched, int errno);
>  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
>  void drm_sched_increase_karma(struct drm_sched_job *bad);
>  void drm_sched_reset_karma(struct drm_sched_job *bad);
> -- 
> 2.34.1
> 

  reply	other threads:[~2024-07-26 12:31 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-26  7:55 [PATCH] drm/sched: add optional errno to drm_sched_start() Christian König
2024-07-26 12:30 ` Matthew Brost [this message]
2024-07-29 18:37   ` Christian König
2024-07-26 14:21 ` Daniel Vetter
2024-07-29 18:43   ` Christian König
2024-07-30  8:36     ` Daniel Vetter
2024-07-30 12:06       ` Christian König
2024-07-31 20:43         ` Daniel Vetter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZqOW5Q778UCkh7V7@DUT025-TGLU.fm.intel.com \
    --to=matthew.brost@intel.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=jesse.zhang@amd.com \
    --cc=vitaly.prosyak@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox