From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 383612E03E3 for ; Thu, 24 Jul 2025 15:07:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753369642; cv=none; b=qQTLgA8P6wPTwqeH+6IffSTnloyju21ONTNGcC5+Hzqt8pKYftSu++rdeYmolGTTMlgQWqKoDzJ/XvdKvRGaJ/xJoEZhlGZo2vt8R0V1UEMShXnX1Pcan1hdB30lbBK6SJ6eFCib2sso8Qy9CaZQHUyYs69L98XU+MsTXv9TT9U= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753369642; c=relaxed/simple; bh=oZBb01Ah0Zg8ABrWhy4PV6AAI8yrnV+VQN+UjBN6rYo=; h=Message-ID:Subject:From:To:Cc:Date:In-Reply-To:References: Content-Type:MIME-Version; b=KdD7ErEwBjGccL1CiD+rHBLNzXWOg1+9dapV/CEdVMK5VCrLYnB8M3UynO0ZIo9GKCVuRIcYBF/8R257dacdsRmVZXlXKs3W15kuP5bdkdp//QsJxzIcFIxVs7ghl4yBe0CvptxZGt12IKbRVafK4gUfKgRMU7FAMQuuOCCvGlI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=RWTTlUhH; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="RWTTlUhH" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1753369638; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=goool6zZHxbiYeT5PZTG+2TGMC8C7c+FdN7P0zvEbNc=; b=RWTTlUhHPrPVLA/gv1rpRGVcAY5bBYLNCpsmjWgrB16WEb/BbguvDjuhtOh66jCPBWqar3 oP9bLyOQWSOZqACu+BNqMTop5t1XWn3qKLaPB92g6/JM54BiEq324Q96IY9v535ATD2cv5 dxcA6e4oinhXa+H26zw4coknHeG9abU= Received: from mail-ej1-f71.google.com (mail-ej1-f71.google.com [209.85.218.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-563-Aa7TV6qhMHCXrY8ds4zkaQ-1; Thu, 24 Jul 2025 11:07:16 -0400 X-MC-Unique: Aa7TV6qhMHCXrY8ds4zkaQ-1 X-Mimecast-MFC-AGG-ID: Aa7TV6qhMHCXrY8ds4zkaQ_1753369635 Received: by mail-ej1-f71.google.com with SMTP id a640c23a62f3a-af4f3e4514dso92138566b.2 for ; Thu, 24 Jul 2025 08:07:16 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753369635; x=1753974435; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:cc:to:from:subject:message-id:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=goool6zZHxbiYeT5PZTG+2TGMC8C7c+FdN7P0zvEbNc=; b=KE2JnJgSC5EhFQd9+uRXVSMAkJRoT0Az4QYmUyMOoKvBAAFCRnTb9XYEV1tIr7MqH0 2mp3SCHrlGOLVpZ3lyhMDJ4R6wvHyhsSEY6V165BegPbMVw5fWwpftWRDZMK0eu/jvv4 xHM75Ft93Dd7L0HBJdHL0Pa9D15UvIM0CtQkZomjp3CkuqNGYKMbdgRXwBoJnTmN+sWm /XohKAFUpeatzdGpE6OGDWycENzrDGTHp66t76MphLQbMx5bDONtWV5KvEY91eXUa0lX rA37jh54Wb2CyKuAlxxsqLAEGbkng9DbF0YHlMt5GrC/LXf1NpdJ4MLEdCColhNC/Mg/ 90Zw== X-Forwarded-Encrypted: i=1; AJvYcCWIKvYe6X7PX34RD/0Jd/vjob7jtkJK4Vx3oIifJ6h1yz3dIcPtx18m1+hYms5/jo1Uj3+m87xYav8=@vger.kernel.org X-Gm-Message-State: AOJu0YzvvKbIZeZhK65MgA9Q/bBAp5xhRuz1JH+8qBqWslNsSOSQbMua Gnic21aYZ9kreTvczlvmN0GgvbM7zntw9M2UTFqgd9n93CioeQyJsqi8gHqXiFGfIt0cKhaxSFX mAFLZwoc/WYLEpsaL6lLPDKsIL/ukEVudah2Bkoi+k0uPandSMEmP75jKKX3uKA== X-Gm-Gg: ASbGncsVarm21SPtWBmY+NGHzsxv3m+NlscYxhkcdCSvtK4b14ScbLUwOhlcJ8bqmeu pOQBN4ubMY1rhR/q5xLlvZT1CuW8/WACi4WN5IYboSvzEL5XnERldczAcMrSwPBvH1mu8Z4kj3+ +MbpceHcoyfMr7Ia4bjlVAPyuYAaBfPj9KIKx/xq0VwWl1ZI0cC+OswkHpAkhOkNg5HTvsJzEH2 7q/0MYyEyuQMX9/u2AVQNJZ4TWww4LW217nmLLaicJIn0zlB0KIXbexwYGAvcvlJi6R64cpaZBg dHJQgjT9svBXuRW1L3OAEiA841krTWRGw/XgC/j6fozPKLyhHlvH69KX8EATwg9rbrpYD4XHwSY e/gSF65lzIsw= X-Received: by 2002:a17:906:4fd0:b0:ade:44f8:569 with SMTP id a640c23a62f3a-af2f8d4b875mr744970066b.42.1753369634807; Thu, 24 Jul 2025 08:07:14 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGAFluBZU3m5JKhU2ovOhD302OI8fxKvv5Lg9wR8FzqLUzAc8MbyYgITS3m3obeKvUPi0BF7Q== X-Received: by 2002:a17:906:4fd0:b0:ade:44f8:569 with SMTP id a640c23a62f3a-af2f8d4b875mr744963466b.42.1753369634115; Thu, 24 Jul 2025 08:07:14 -0700 (PDT) Received: from ?IPv6:2001:16b8:3df8:b00:49b5:ffbc:d28a:6af0? ([2001:16b8:3df8:b00:49b5:ffbc:d28a:6af0]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-af47c496eefsm125727266b.25.2025.07.24.08.07.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 24 Jul 2025 08:07:13 -0700 (PDT) Message-ID: Subject: Re: [PATCH] drm/sched: Extend and update documentation From: Philipp Stanner To: Philipp Stanner , Maarten Lankhorst , Maxime Ripard , Thomas Zimmermann , David Airlie , Simona Vetter , Jonathan Corbet , Matthew Brost , Danilo Krummrich , Christian =?ISO-8859-1?Q?K=F6nig?= , Sumit Semwal Cc: dri-devel@lists.freedesktop.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, Christian =?ISO-8859-1?Q?K=F6nig?= Date: Thu, 24 Jul 2025 17:07:11 +0200 In-Reply-To: <20250724140121.70873-2-phasta@kernel.org> References: <20250724140121.70873-2-phasta@kernel.org> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.52.4 (3.52.4-2.fc40) Precedence: bulk X-Mailing-List: linux-doc@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Two comments from myself to open up room for discussion: On Thu, 2025-07-24 at 16:01 +0200, Philipp Stanner wrote: > From: Philipp Stanner >=20 > The various objects and their memory lifetime used by the GPU scheduler > are currently not fully documented. >=20 > Add documentation describing the scheduler's objects. Improve the > general documentation at a few other places. >=20 > Co-developed-by: Christian K=C3=B6nig > Signed-off-by: Christian K=C3=B6nig > Signed-off-by: Philipp Stanner > --- > The first draft for this docu was posted by Christian in late 2023 IIRC. >=20 > This is an updated version. Please review. >=20 > @Christian: As we agreed on months (a year?) ago I kept your Signed-off > by. Just tell me if there's any issue or sth. > --- > =C2=A0Documentation/gpu/drm-mm.rst=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 |=C2=A0 36 ++++ > =C2=A0drivers/gpu/drm/scheduler/sched_main.c | 228 ++++++++++++++++++++++= --- > =C2=A0include/drm/gpu_scheduler.h=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0 5 +- > =C2=A03 files changed, 238 insertions(+), 31 deletions(-) >=20 > diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst > index d55751cad67c..95ee95fd987a 100644 > --- a/Documentation/gpu/drm-mm.rst > +++ b/Documentation/gpu/drm-mm.rst > @@ -556,12 +556,48 @@ Overview > =C2=A0.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c > =C2=A0=C2=A0=C2=A0 :doc: Overview > =C2=A0 > +Job Object > +---------- > + > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c > +=C2=A0=C2=A0 :doc: Job Object > + > +Entity Object > +------------- > + > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c > +=C2=A0=C2=A0 :doc: Entity Object > + > +Hardware Fence Object > +--------------------- > + > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c > +=C2=A0=C2=A0 :doc: Hardware Fence Object > + > +Scheduler Fence Object > +---------------------- > + > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c > +=C2=A0=C2=A0 :doc: Scheduler Fence Object > + > +Scheduler and Run Queue Objects > +------------------------------- > + > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c > +=C2=A0=C2=A0 :doc: Scheduler and Run Queue Objects > + > =C2=A0Flow Control > =C2=A0------------ > =C2=A0 > =C2=A0.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c > =C2=A0=C2=A0=C2=A0 :doc: Flow Control > =C2=A0 > +Error and Timeout handling > +-------------------------- > + > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c > +=C2=A0=C2=A0 :doc: Error and Timeout handling > + > =C2=A0Scheduler Function References > =C2=A0----------------------------- > =C2=A0 > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/sch= eduler/sched_main.c > index 5a550fd76bf0..2e7bc1e74186 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -24,48 +24,220 @@ > =C2=A0/** > =C2=A0 * DOC: Overview > =C2=A0 * > - * The GPU scheduler provides entities which allow userspace to push job= s > - * into software queues which are then scheduled on a hardware run queue= . > - * The software queues have a priority among them. The scheduler selects= the entities > - * from the run queue using a FIFO. The scheduler provides dependency ha= ndling > - * features among jobs. The driver is supposed to provide callback funct= ions for > - * backend operations to the scheduler like submitting a job to hardware= run queue, > - * returning the dependencies of a job etc. > + * The GPU scheduler is shared infrastructure intended to help drivers m= anaging > + * command submission to their hardware. > =C2=A0 * > - * The organisation of the scheduler is the following: > + * To do so, it offers a set of scheduling facilities that interact with= the > + * driver through callbacks which the latter can register. > =C2=A0 * > - * 1. Each hw run queue has one scheduler > - * 2. Each scheduler has multiple run queues with different priorities > - *=C2=A0=C2=A0=C2=A0 (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL) > - * 3. Each scheduler run queue has a queue of entities to schedule > - * 4. Entities themselves maintain a queue of jobs that will be schedule= d on > - *=C2=A0=C2=A0=C2=A0 the hardware. > + * In particular, the scheduler takes care of: > + *=C2=A0=C2=A0 - Ordering command submissions > + *=C2=A0=C2=A0 - Signalling dma_fences, e.g., for finished commands > + *=C2=A0=C2=A0 - Taking dependencies between command submissions into ac= count > + *=C2=A0=C2=A0 - Handling timeouts for command submissions > =C2=A0 * > - * The jobs in an entity are always scheduled in the order in which they= were pushed. > + * All callbacks the driver needs to implement are restricted by dma_fen= ce > + * signaling rules to guarantee deadlock free forward progress. This esp= ecially > + * means that for normal operation no memory can be allocated in a callb= ack. > + * All memory which is needed for pushing the job to the hardware must b= e > + * allocated before arming a job. It also means that no locks can be tak= en > + * under which memory might be allocated. > =C2=A0 * > - * Note that once a job was taken from the entities queue and pushed to = the > - * hardware, i.e. the pending queue, the entity must not be referenced a= nymore > - * through the jobs entity pointer. > + * Optional memory, for example for device core dumping or debugging, *m= ust* be > + * allocated with GFP_NOWAIT and appropriate error handling if that allo= cation > + * fails. GFP_ATOMIC should only be used if absolutely necessary since d= ipping > + * into the special atomic reserves is usually not justified for a GPU d= river. > + * > + * Note especially the following about the scheduler's historic backgrou= nd that > + * lead to sort of a double role it plays today: > + * > + * In classic setups N ("hardware scheduling") entities share one schedu= ler, > + * and the scheduler decides which job to pick from which entity and mov= e it to > + * the hardware ring next (that is: "scheduling"). > + * > + * Many (especially newer) GPUs, however, can have an almost arbitrary n= umber > + * of hardware rings and it's a firmware scheduler which actually decide= s which > + * job will run next. In such setups, the GPU scheduler is still used (e= .g., in > + * Nouveau) but does not "schedule" jobs in the classical sense anymore.= It > + * merely serves to queue and dequeue jobs and resolve dependencies. In = such a > + * scenario, it is recommended to have one scheduler per entity. > + */ > + > +/** > + * DOC: Job Object > + * > + * The base job object (&struct drm_sched_job) contains submission depen= dencies > + * in the form of &struct dma_fence objects. Drivers can also implement = an > + * optional prepare_job callback which returns additional dependencies a= s > + * dma_fence objects. It's important to note that this callback can't al= locate > + * memory or grab locks under which memory is allocated. > + * > + * Drivers should use this as base class for an object which contains th= e > + * necessary state to push the command submission to the hardware. > + * > + * The lifetime of the job object needs to last at least from submitting= it to > + * the scheduler (through drm_sched_job_arm()) until the scheduler has i= nvoked > + * &struct drm_sched_backend_ops.free_job and, thereby, has indicated th= at it > + * does not need the job anymore. Drivers can of course keep their job o= bject > + * alive for longer than that, but that's outside of the scope of the sc= heduler > + * component. > + * > + * Job initialization is split into two stages: > + *=C2=A0=C2=A0 1. drm_sched_job_init() which serves for basic preparatio= n of a job. > + *=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Drivers don't have to be mindful of thi= s function's consequences and > + *=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 its effects can be reverted through drm= _sched_job_cleanup(). > + *=C2=A0=C2=A0 2. drm_sched_job_arm() which irrevokably arms a job for e= xecution. This > + *=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 initializes the job's fences and the jo= b has to be submitted with > + *=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 drm_sched_entity_push_job(). Once drm_s= ched_job_arm() has been called, > + *=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 the job structure has to be valid until= the scheduler invoked > + *=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 drm_sched_backend_ops.free_job(). > + * > + * It's important to note that after arming a job drivers must follow th= e > + * dma_fence rules and can't easily allocate memory or takes locks under= which > + * memory is allocated. > + */ > + > +/** > + * DOC: Entity Object > + * > + * The entity object (&struct drm_sched_entity) is a container for jobs = which > + * should execute sequentially. Drivers should create an entity for each > + * individual context they maintain for command submissions which can ru= n in > + * parallel. > + * > + * The lifetime of the entity *should not* exceed the lifetime of the > + * userspace process it was created for and drivers should call the > + * drm_sched_entity_flush() function from their file_operations.flush() > + * callback. It is possible that an entity object is not alive anymore > + * while jobs previously fetched from it are still running on the hardwa= re. > + * > + * This is done because all results of a command submission should becom= e > + * visible externally even after a process exits. This is normal POSIX > + * behavior for I/O operations. > + * > + * The problem with this approach is that GPU submissions contain execut= able > + * shaders enabling processes to evade their termination by offloading w= ork to > + * the GPU. So when a process is terminated with a SIGKILL the entity ob= ject > + * makes sure that jobs are freed without running them while still maint= aining > + * correct sequential order for signaling fences. > + * > + * All entities associated with a scheduler have to be torn down before = that > + * scheduler. > + */ > + > +/** > + * DOC: Hardware Fence Object > + * > + * The hardware fence object is a dma_fence provided by the driver throu= gh > + * &struct drm_sched_backend_ops.run_job. The driver signals this fence = once the > + * hardware has completed the associated job. > + * > + * Drivers need to make sure that the normal dma_fence semantics are fol= lowed > + * for this object. It's important to note that the memory for this obje= ct can > + * *not* be allocated in &struct drm_sched_backend_ops.run_job since tha= t would > + * violate the requirements for the dma_fence implementation. The schedu= ler > + * maintains a timeout handler which triggers if this fence doesn't sign= al > + * within a configurable amount of time. > + * > + * The lifetime of this object follows dma_fence refcounting rules. The > + * scheduler takes ownership of the reference returned by the driver and > + * drops it when it's not needed any more. > + * > + * See &struct drm_sched_backend_ops.run_job for precise refcounting rul= es. > + */ > + > +/** > + * DOC: Scheduler Fence Object > + * > + * The scheduler fence object (&struct drm_sched_fence) encapsulates the= whole > + * time from pushing the job into the scheduler until the hardware has f= inished > + * processing it. It is managed by the scheduler. The implementation pro= vides > + * dma_fence interfaces for signaling both scheduling of a command submi= ssion > + * as well as finishing of processing. > + * > + * The lifetime of this object also follows normal dma_fence refcounting= rules. > + */ The relict I'm most unsure about is this docu for the scheduler fence. I know that some drivers are accessing the s_fence, but I strongly suspect that this is a) unncessary and b) dangerous. But the original draft from Christian hinted at that. So, @Christian, this would be an opportunity to discuss this matter. Otherwise I'd drop this docu section in v2. What users don't know, they cannot misuse. > + > +/** > + * DOC: Scheduler and Run Queue Objects > + * > + * The scheduler object itself (&struct drm_gpu_scheduler) does the actu= al > + * scheduling: it picks the next entity to run a job from and pushes tha= t job > + * onto the hardware. Both FIFO and RR selection algorithms are supporte= d, with > + * FIFO being the default and the recommended one. > + * > + * The lifetime of the scheduler is managed by the driver using it. Befo= re > + * destroying the scheduler the driver must ensure that all hardware pro= cessing > + * involving this scheduler object has finished by calling for example > + * disable_irq(). It is *not* sufficient to wait for the hardware fence = here > + * since this doesn't guarantee that all callback processing has finishe= d. > + * > + * The run queue object (&struct drm_sched_rq) is a container for entiti= es of a > + * certain priority level. This object is internally managed by the sche= duler > + * and drivers must not touch it directly. The lifetime of a run queue i= s bound > + * to the scheduler's lifetime. > + * > + * All entities associated with a scheduler must be torn down before it.= Drivers > + * should implement &struct drm_sched_backend_ops.cancel_job to avoid pe= nding > + * jobs (those that were pulled from an entity into the scheduler, but h= ave not > + * been completed by the hardware yet) from leaking. > =C2=A0 */ > =C2=A0 > =C2=A0/** > =C2=A0 * DOC: Flow Control > =C2=A0 * > =C2=A0 * The DRM GPU scheduler provides a flow control mechanism to regul= ate the rate > - * in which the jobs fetched from scheduler entities are executed. > + * at which jobs fetched from scheduler entities are executed. > =C2=A0 * > - * In this context the &drm_gpu_scheduler keeps track of a driver specif= ied > - * credit limit representing the capacity of this scheduler and a credit= count; > - * every &drm_sched_job carries a driver specified number of credits. > + * In this context the &struct drm_gpu_scheduler keeps track of a driver > + * specified credit limit representing the capacity of this scheduler an= d a > + * credit count; every &struct drm_sched_job carries a driver-specified = number > + * of credits. > =C2=A0 * > - * Once a job is executed (but not yet finished), the job's credits cont= ribute > - * to the scheduler's credit count until the job is finished. If by exec= uting > - * one more job the scheduler's credit count would exceed the scheduler'= s > - * credit limit, the job won't be executed. Instead, the scheduler will = wait > - * until the credit count has decreased enough to not overflow its credi= t limit. > - * This implies waiting for previously executed jobs. > + * Once a job is being executed, the job's credits contribute to the > + * scheduler's credit count until the job is finished. If by executing o= ne more > + * job the scheduler's credit count would exceed the scheduler's credit = limit, > + * the job won't be executed. Instead, the scheduler will wait until the= credit > + * count has decreased enough to not overflow its credit limit. This imp= lies > + * waiting for previously executed jobs. > =C2=A0 */ > =C2=A0 > +/** > + * DOC: Error and Timeout handling > + * > + * Errors are signaled by using dma_fence_set_error() on the hardware fe= nce > + * object before signaling it with dma_fence_signal(). Errors are then b= ubbled > + * up from the hardware fence to the scheduler fence. > + * > + * The entity allows querying errors on the last run submission using th= e > + * drm_sched_entity_error() function which can be used to cancel queued > + * submissions in &struct drm_sched_backend_ops.run_job as well as preve= nting > + * pushing further ones into the entity in the driver's submission funct= ion. > + * > + * When the hardware fence doesn't signal within a configurable amount o= f time > + * &struct drm_sched_backend_ops.timedout_job gets invoked. The driver s= hould > + * then follow the procedure described in that callback's documentation. > + * > + * (TODO: The timeout handler should probably switch to using the hardwa= re > + * fence as parameter instead of the job. Otherwise the handling will al= ways > + * race between timing out and signaling the fence). This TODO can probably removed, too. The recently merged DRM_GPU_SCHED_STAT_NO_HANG has solved this issue. P. > + * > + * The scheduler also used to provided functionality for re-submitting j= obs > + * and, thereby, replaced the hardware fence during reset handling. This > + * functionality is now deprecated. This has proven to be fundamentally = racy > + * and not compatible with dma_fence rules and shouldn't be used in new = code. > + * > + * Additionally, there is the function drm_sched_increase_karma() which = tries > + * to find the entity which submitted a job and increases its 'karma' at= omic > + * variable to prevent resubmitting jobs from this entity. This has quit= e some > + * overhead and resubmitting jobs is now marked as deprecated. Thus, usi= ng this > + * function is discouraged. > + * > + * Drivers can still recreate the GPU state in case it should be lost du= ring > + * timeout handling *if* they can guarantee that forward progress will b= e made > + * and this doesn't cause another timeout. But this is strongly hardware > + * specific and out of the scope of the general GPU scheduler. > + */ > =C2=A0#include > =C2=A0#include > =C2=A0#include > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > index 323a505e6e6a..0f0687b7ae9c 100644 > --- a/include/drm/gpu_scheduler.h > +++ b/include/drm/gpu_scheduler.h > @@ -458,8 +458,8 @@ struct drm_sched_backend_ops { > =C2=A0 struct dma_fence *(*run_job)(struct drm_sched_job *sched_job); > =C2=A0 > =C2=A0 /** > - * @timedout_job: Called when a job has taken too long to execute, > - * to trigger GPU recovery. > + * @timedout_job: Called when a hardware fence didn't signal within a > + * configurable amount of time. Triggers GPU recovery. > =C2=A0 * > =C2=A0 * @sched_job: The job that has timed out > =C2=A0 * > @@ -506,7 +506,6 @@ struct drm_sched_backend_ops { > =C2=A0 * that timeout handlers are executed sequentially. > =C2=A0 * > =C2=A0 * Return: The scheduler's status, defined by &enum drm_gpu_sched_= stat > - * > =C2=A0 */ > =C2=A0 enum drm_gpu_sched_stat (*timedout_job)(struct drm_sched_job *sche= d_job); > =C2=A0