Re: [PATCH 1/4] drm/sched: add optional errno to drm_sched_start()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Daniel Vetter <daniel.vetter@ffwll.ch>
To: Philipp Stanner <pstanner@redhat.com>
Cc: "Christian König" <ckoenig.leichtzumerken@gmail.com>,
	daniel.vetter@ffwll.ch, vitaly.prosyak@amd.com,
	dri-devel@lists.freedesktop.org, ltuikov89@gmail.com
Subject: Re: [PATCH 1/4] drm/sched: add optional errno to drm_sched_start()
Date: Mon, 2 Sep 2024 13:15:13 +0200	[thread overview]
Message-ID: <ZtWeQT_KToehsZ3o@phenom.ffwll.local> (raw)
In-Reply-To: <7db2f831c86b3d73d99b8cf006e11a3546fa38d6.camel@redhat.com>

On Mon, Sep 02, 2024 at 09:01:48AM +0200, Philipp Stanner wrote:
> On Fri, 2024-08-30 at 15:15 +0200, Christian König wrote:
> > Am 28.08.24 um 11:30 schrieb Philipp Stanner:
> > > On Mon, 2024-08-26 at 14:25 +0200, Christian König wrote:
> > > > The current implementation of drm_sched_start uses a hardcoded
> > > > -ECANCELED to dispose of a job when the parent/hw fence is NULL.
> > > > This results in drm_sched_job_done being called with -ECANCELED
> > > > for
> > > > each job with a NULL parent in the pending list, making it
> > > > difficult
> > > > to distinguish between recovery methods, whether a queue reset or
> > > > a
> > > > full GPU reset was used.
> > > > 
> > > > To improve this, we first try a soft recovery for timeout jobs
> > > > and
> > > > use the error code -ENODATA. If soft recovery fails, we proceed
> > > > with
> > > > a queue reset, where the error code remains -ENODATA for the job.
> > > > Finally, for a full GPU reset, we use error codes -ECANCELED or
> > > > -ETIME. This patch adds an error code parameter to
> > > > drm_sched_start,
> > > > allowing us to differentiate between queue reset and GPU reset
> > > > failures. This enables user mode and test applications to
> > > > validate
> > > > the expected correctness of the requested operation. After a
> > > > successful queue reset, the only way to continue normal operation
> > > > is
> > > > to call drm_sched_job_done with the specific error code -ENODATA.
> > > > 
> > > > v1: Initial implementation by Jesse utilized
> > > > amdgpu_device_lock_reset_domain
> > > >      and amdgpu_device_unlock_reset_domain to allow user mode to
> > > > track
> > > >      the queue reset status and distinguish between queue reset
> > > > and
> > > >      GPU reset.
> > > > v2: Christian suggested using the error codes -ENODATA for queue
> > > > reset
> > > >      and -ECANCELED or -ETIME for GPU reset, returned to
> > > >      amdgpu_cs_wait_ioctl.
> > > > v3: To meet the requirements, we introduce a new function
> > > >      drm_sched_start_ex with an additional parameter to set
> > > >      dma_fence_set_error, allowing us to handle the specific
> > > > error
> > > >      codes appropriately and dispose of bad jobs with the
> > > > selected
> > > >      error code depending on whether it was a queue reset or GPU
> > > > reset.
> > > > v4: Alex suggested using a new name,
> > > > drm_sched_start_with_recovery_error,
> > > >      which more accurately describes the function's purpose.
> > > >      Additionally, it was recommended to add documentation
> > > > details
> > > >      about the new method.
> > > > v5: Fixed declaration of new function
> > > > drm_sched_start_with_recovery_error.(Alex)
> > > > v6 (chk): rebase on upstream changes, cleanup the commit message,
> > > >            drop the new function again and update all callers,
> > > >            apply the errno also to scheduler fences with hw
> > > > fences
> > > > 
> > > > Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
> > > > Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
> > > > Signed-off-by: Christian König <christian.koenig@amd.com>
> > > > Cc: Alex Deucher <alexander.deucher@amd.com>
> > > > ---
> > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c | 2 +-
> > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c          | 4 ++--
> > > >   drivers/gpu/drm/etnaviv/etnaviv_sched.c             | 2 +-
> > > >   drivers/gpu/drm/imagination/pvr_queue.c             | 4 ++--
> > > >   drivers/gpu/drm/lima/lima_sched.c                   | 2 +-
> > > >   drivers/gpu/drm/nouveau/nouveau_sched.c             | 2 +-
> > > >   drivers/gpu/drm/panfrost/panfrost_job.c             | 2 +-
> > > >   drivers/gpu/drm/panthor/panthor_mmu.c               | 2 +-
> > > >   drivers/gpu/drm/scheduler/sched_main.c              | 7 ++++---
> > > >   drivers/gpu/drm/v3d/v3d_sched.c                     | 2 +-
> > > >   include/drm/gpu_scheduler.h                         | 2 +-
> > > >   11 files changed, 16 insertions(+), 15 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> > > > index 2320df51c914..18135d8235f9 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> > > > @@ -300,7 +300,7 @@ static int
> > > > suspend_resume_compute_scheduler(struct amdgpu_device *adev, bool
> > > > sus
> > > >   			if (r)
> > > >   				goto out;
> > > >   		} else {
> > > > -			drm_sched_start(&ring->sched);
> > > > +			drm_sched_start(&ring->sched, 0);
> > > >   		}
> > > >   	}
> > > >   
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > index 1cd7d355689c..5891312e44ea 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > @@ -5879,7 +5879,7 @@ int amdgpu_device_gpu_recover(struct
> > > > amdgpu_device *adev,
> > > >   			if (!amdgpu_ring_sched_ready(ring))
> > > >   				continue;
> > > >   
> > > > -			drm_sched_start(&ring->sched);
> > > > +			drm_sched_start(&ring->sched, 0);
> > > >   		}
> > > >   
> > > >   		if
> > > > (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) &&
> > > > !job_signaled)
> > > > @@ -6374,7 +6374,7 @@ void amdgpu_pci_resume(struct pci_dev
> > > > *pdev)
> > > >   		if (!amdgpu_ring_sched_ready(ring))
> > > >   			continue;
> > > >   
> > > > -		drm_sched_start(&ring->sched);
> > > > +		drm_sched_start(&ring->sched, 0);
> > > >   	}
> > > >   
> > > >   	amdgpu_device_unset_mp1_state(adev);
> > > > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > > b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > > index ab9ca4824b62..23ced5896c7c 100644
> > > > --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > > +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > > @@ -72,7 +72,7 @@ static enum drm_gpu_sched_stat
> > > > etnaviv_sched_timedout_job(struct drm_sched_job
> > > >   
> > > >   	drm_sched_resubmit_jobs(&gpu->sched);
> > > >   
> > > > -	drm_sched_start(&gpu->sched);
> > > > +	drm_sched_start(&gpu->sched, 0);
> > > >   	return DRM_GPU_SCHED_STAT_NOMINAL;
> > > >   
> > > >   out_no_timeout:
> > > > diff --git a/drivers/gpu/drm/imagination/pvr_queue.c
> > > > b/drivers/gpu/drm/imagination/pvr_queue.c
> > > > index 20cb46012082..c4f08432882b 100644
> > > > --- a/drivers/gpu/drm/imagination/pvr_queue.c
> > > > +++ b/drivers/gpu/drm/imagination/pvr_queue.c
> > > > @@ -782,7 +782,7 @@ static void pvr_queue_start(struct pvr_queue
> > > > *queue)
> > > >   		}
> > > >   	}
> > > >   
> > > > -	drm_sched_start(&queue->scheduler);
> > > > +	drm_sched_start(&queue->scheduler, 0);
> > > >   }
> > > >   
> > > >   /**
> > > > @@ -842,7 +842,7 @@ pvr_queue_timedout_job(struct drm_sched_job
> > > > *s_job)
> > > >   	}
> > > >   	mutex_unlock(&pvr_dev->queues.lock);
> > > >   
> > > > -	drm_sched_start(sched);
> > > > +	drm_sched_start(sched, 0);
> > > >   
> > > >   	return DRM_GPU_SCHED_STAT_NOMINAL;
> > > >   }
> > > > diff --git a/drivers/gpu/drm/lima/lima_sched.c
> > > > b/drivers/gpu/drm/lima/lima_sched.c
> > > > index 1a944edb6ddc..b40c90e97d7e 100644
> > > > --- a/drivers/gpu/drm/lima/lima_sched.c
> > > > +++ b/drivers/gpu/drm/lima/lima_sched.c
> > > > @@ -463,7 +463,7 @@ static enum drm_gpu_sched_stat
> > > > lima_sched_timedout_job(struct drm_sched_job *job
> > > >   	lima_pm_idle(ldev);
> > > >   
> > > >   	drm_sched_resubmit_jobs(&pipe->base);
> > > > -	drm_sched_start(&pipe->base);
> > > > +	drm_sched_start(&pipe->base, 0);
> > > >   
> > > >   	return DRM_GPU_SCHED_STAT_NOMINAL;
> > > >   }
> > > > diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c
> > > > b/drivers/gpu/drm/nouveau/nouveau_sched.c
> > > > index eb6c3f9a01f5..4412f2711fb5 100644
> > > > --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
> > > > +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
> > > > @@ -379,7 +379,7 @@ nouveau_sched_timedout_job(struct
> > > > drm_sched_job
> > > > *sched_job)
> > > >   	else
> > > >   		NV_PRINTK(warn, job->cli, "Generic job
> > > > timeout.\n");
> > > >   
> > > > -	drm_sched_start(sched);
> > > > +	drm_sched_start(sched, 0);
> > > >   
> > > >   	return stat;
> > > >   }
> > > > diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c
> > > > b/drivers/gpu/drm/panfrost/panfrost_job.c
> > > > index df49d37d0e7e..d140800606bf 100644
> > > > --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> > > > +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> > > > @@ -727,7 +727,7 @@ panfrost_reset(struct panfrost_device *pfdev,
> > > >   
> > > >   	/* Restart the schedulers */
> > > >   	for (i = 0; i < NUM_JOB_SLOTS; i++)
> > > > -		drm_sched_start(&pfdev->js->queue[i].sched);
> > > > +		drm_sched_start(&pfdev->js->queue[i].sched, 0);
> > > >   
> > > >   	/* Re-enable job interrupts now that everything has been
> > > > restarted. */
> > > >   	job_write(pfdev, JOB_INT_MASK,
> > > > diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c
> > > > b/drivers/gpu/drm/panthor/panthor_mmu.c
> > > > index d47972806d50..e630cdf47f99 100644
> > > > --- a/drivers/gpu/drm/panthor/panthor_mmu.c
> > > > +++ b/drivers/gpu/drm/panthor/panthor_mmu.c
> > > > @@ -827,7 +827,7 @@ static void panthor_vm_stop(struct panthor_vm
> > > > *vm)
> > > >   
> > > >   static void panthor_vm_start(struct panthor_vm *vm)
> > > >   {
> > > > -	drm_sched_start(&vm->sched);
> > > > +	drm_sched_start(&vm->sched, 0);
> > > >   }
> > > >   
> > > >   /**
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > index ab53ab486fe6..f093616fe53c 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > @@ -674,9 +674,10 @@ EXPORT_SYMBOL(drm_sched_stop);
> > > >    * drm_sched_start - recover jobs after a reset
> > > >    *
> > > >    * @sched: scheduler instance
> > > > + * @errno: error to set on the pending fences
> > > >    *
> > > >    */
> > > > -void drm_sched_start(struct drm_gpu_scheduler *sched)
> > > > +void drm_sched_start(struct drm_gpu_scheduler *sched, int errno)
> > > >   {
> > > >   	struct drm_sched_job *s_job, *tmp;
> > > >   
> > > > @@ -691,13 +692,13 @@ void drm_sched_start(struct
> > > > drm_gpu_scheduler
> > > > *sched)
> > > >   		atomic_add(s_job->credits, &sched-
> > > > >credit_count);
> > > >   
> > > >   		if (!fence) {
> > > > -			drm_sched_job_done(s_job, -ECANCELED);
> > > > +			drm_sched_job_done(s_job, errno ?: -
> > > > ECANCELED);
> > > >   			continue;
> > > >   		}
> > > >   
> > > >   		if (dma_fence_add_callback(fence, &s_job->cb,
> > > >   					  
> > > > drm_sched_job_done_cb))
> > > > -			drm_sched_job_done(s_job, fence->error);
> > > > +			drm_sched_job_done(s_job, fence->error
> > > > ?:
> > > > errno);
> > > >   	}
> > > >   
> > > >   	drm_sched_start_timeout_unlocked(sched);
> > > > diff --git a/drivers/gpu/drm/v3d/v3d_sched.c
> > > > b/drivers/gpu/drm/v3d/v3d_sched.c
> > > > index fd29a00b233c..b6a89171824b 100644
> > > > --- a/drivers/gpu/drm/v3d/v3d_sched.c
> > > > +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> > > > @@ -663,7 +663,7 @@ v3d_gpu_reset_for_timeout(struct v3d_dev
> > > > *v3d,
> > > > struct drm_sched_job *sched_job)
> > > >   
> > > >   	/* Unblock schedulers and restart their jobs. */
> > > >   	for (q = 0; q < V3D_MAX_QUEUES; q++) {
> > > > -		drm_sched_start(&v3d->queue[q].sched);
> > > > +		drm_sched_start(&v3d->queue[q].sched, 0);
> > > >   	}
> > > >   
> > > >   	mutex_unlock(&v3d->reset_lock);
> > > > diff --git a/include/drm/gpu_scheduler.h
> > > > b/include/drm/gpu_scheduler.h
> > > > index fe8edb917360..a8d19b10f9b8 100644
> > > > --- a/include/drm/gpu_scheduler.h
> > > > +++ b/include/drm/gpu_scheduler.h
> > > > @@ -579,7 +579,7 @@ bool drm_sched_wqueue_ready(struct
> > > > drm_gpu_scheduler *sched);
> > > >   void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
> > > >   void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
> > > >   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct
> > > > drm_sched_job *bad);
> > > > -void drm_sched_start(struct drm_gpu_scheduler *sched);
> > > > +void drm_sched_start(struct drm_gpu_scheduler *sched, int
> > > > errno);
> > > I personally only recently started using the scheduler and many
> > > things
> > > were quite confusing.
> > > 
> > > In particular, I thought it's not good to call this function
> > > drm_sched_start(), because that implies to new users / programmers
> > > that
> > > this function is supposed to start the scheduler.
> > > 
> > > Accordingly, drm_sched_stop() sounds as if it's intended to be used
> > > on
> > > scheduler teardown. "Alright, I'll stop the scheduler with
> > > drm_sched_stop(), then can safely call drm_sched_entity_destroy()
> > > and
> > > then tear it down completely through drm_sched_fini()".
> > > 
> > > Now the comments make it obvious that those start and stop
> > > functions
> > > are more intended for error recovery.
> > > 
> > > So my stance is that start should be called, e.g.,
> > > drm_sched_restart()
> > > or perhaps drm_sched_recover_start().
> > > 
> > > So since you're already touching all the lines where the function
> > > is
> > > used, this might be a good opportunity to improve the name, too.
> > > 
> > > If that's deemed desirable.
> > 
> > Yeah completely agree.
> > 
> > We also had people incorrectly thinking that they should call 
> > drm_sched_stop/start on suspend/resume resulting in a system which 
> > didn't come up again after resume.
> > 
> > How about we call it drm_sched_suspend_before_reset() and 
> > drm_sched_resume_after_reset()?
> 
> Yes, that sounds bullet-proof to me :)
> 
> One might also consider drm_sched_start()'s function summary "recover
> jobs after a reset". Maybe have a sentence about what "recover" means
> would help there, too.

Yeah, with that team drm_sched_reset_prepare/recover might also be a good
function pair. suspend/resume could be mixed up with runtime or system
suspend/resume, leading with reset feels better to me.
-Sima

> 
> 
> Regards,
> P.
> 
> 
> > 
> > Thanks,
> > Christian.
> > 
> > > 
> > > P.
> > > 
> > > 
> > > >   void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
> > > >   void drm_sched_increase_karma(struct drm_sched_job *bad);
> > > >   void drm_sched_reset_karma(struct drm_sched_job *bad);
> > 
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

next prev parent reply	other threads:[~2024-09-02 11:15 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-26 12:25 [PATCH 1/4] drm/sched: add optional errno to drm_sched_start() Christian König
2024-08-26 12:25 ` [PATCH 2/4] dma-buf: give examples of error codes to use Christian König
2024-08-26 12:25 ` [PATCH 3/4] drm/doc: Document submission error signaling Christian König
2024-08-26 12:25 ` [PATCH 4/4] drm/todos: add entry for drm_syncobj error handling Christian König
2024-08-27 17:38   ` Daniel Vetter
2024-08-28  9:30 ` [PATCH 1/4] drm/sched: add optional errno to drm_sched_start() Philipp Stanner
2024-08-30 13:15   ` Christian König
2024-09-02  7:01     ` Philipp Stanner
2024-09-02 11:15       ` Daniel Vetter [this message]
2024-08-29 13:06 ` Alex Deucher
2024-08-30  7:13 ` kernel test robot
2024-08-30  8:15 ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZtWeQT_KToehsZ3o@phenom.ffwll.local \
    --to=daniel.vetter@ffwll.ch \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=ltuikov89@gmail.com \
    --cc=pstanner@redhat.com \
    --cc=vitaly.prosyak@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.