From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B4A83C48260 for ; Tue, 13 Feb 2024 17:37:52 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id EA13F10E0A4; Tue, 13 Feb 2024 17:37:51 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.b="Nr6zhF+h"; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by gabe.freedesktop.org (Postfix) with ESMTPS id F1D4810E08F for ; Tue, 13 Feb 2024 17:37:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1707845869; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=I6CePIDfsDcVvbRAJ1gcQ0gDJK1NiHScRn5OrLlnFnw=; b=Nr6zhF+h4U02/sEOzXNK5Qw8ewOukCtUWO4iXLqfiMqJaAUeqQ0YNmADjmFQrxgW21BhWr aciUlTlFpJwHb16R3eZs3XdLvgiJiFu+tzN8ZpzWwBf4MDgXILOMZboj9lEJLm+aMF3ntZ gGS8gz/22md6AAiH26S43r2WpBQ2lag= Received: from mail-ej1-f72.google.com (mail-ej1-f72.google.com [209.85.218.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-695-8Vz7fO5FPcGZEYetf7aqaA-1; Tue, 13 Feb 2024 12:37:47 -0500 X-MC-Unique: 8Vz7fO5FPcGZEYetf7aqaA-1 Received: by mail-ej1-f72.google.com with SMTP id a640c23a62f3a-a2c4e9cb449so311073366b.1 for ; Tue, 13 Feb 2024 09:37:47 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707845866; x=1708450666; h=content-transfer-encoding:in-reply-to:organization:references:cc:to :from:content-language:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=I6CePIDfsDcVvbRAJ1gcQ0gDJK1NiHScRn5OrLlnFnw=; b=H33NtOjADoqb2Eh+AXiXAmjcSSayNuWO2w52fl1/+73C+wgG1p3ePbvoJcs7WdiP/r QgkLmro1lZVIydFb+UhXUo552nUSkchk3AXafQ9T9XrCMxztCE8pk7LKBvkqOsmVCedJ 2B6oWMZp5IxjS1uAKBRCqdTFVVbPWIShOxtkdeO3XIW1xUW8s01orSG7MlPMIYgKGX9S znSa6tZgKslfxK8fQFXZ8rf15pgn57E+eD4vlq0CLr5y/GKx5SjUUKqKcXEzIx9Erfak dPVKWmzNc6mLRzKV2AnwTcXSGqCvGf/gUmz7TeC6/ae1H66lm2fy91deS8GoJmSAehi0 tOOw== X-Forwarded-Encrypted: i=1; AJvYcCV7asaq0VeXeSVdbrMkhqUaZSEy81jcbUs3vHx9pQf2I1hkmbZYQ11/eBXJuPORYdK5nM4J1hWIqDv5Zh3qP7g8wQhWn3xXHovi0ENdq8Hv X-Gm-Message-State: AOJu0YwF66/ybUbAO9gAyZefqjA0n1ApgEEHfY5gOW92579SslPkxn1d N3zSs/n4Pv/A70TRitqN2BtFyMkJ6Tq3NtXBLmj7U0Tm0A/JqZ9k9+fvOEWAabexk1SKq7wYeA9 XfpAfo7LDdKXMcHo6CgzJOJP2BnhTXWoSMSxuBEMakxhu8nBUoFAOhCi/Giw1LaVaUQ== X-Received: by 2002:aa7:db4f:0:b0:55f:4464:d044 with SMTP id n15-20020aa7db4f000000b0055f4464d044mr305571edt.6.1707845866328; Tue, 13 Feb 2024 09:37:46 -0800 (PST) X-Google-Smtp-Source: AGHT+IGtvJRGogEF8rUGzRcOwIN19jHR5qXH0OeLCVPqM0CDYoGGWj+yUFIRdlxf9ZYW0VSdWWKqKw== X-Received: by 2002:aa7:db4f:0:b0:55f:4464:d044 with SMTP id n15-20020aa7db4f000000b0055f4464d044mr305549edt.6.1707845865948; Tue, 13 Feb 2024 09:37:45 -0800 (PST) X-Forwarded-Encrypted: i=1; AJvYcCWVEMXOsIBtmPxaox1ubEb95Qj5cm042MaOjEgt50MQiPGG3+rbtrRhdj7jC0d7UQ39xq7TZCpHTyU45JxsztRb83seorjhOKFGxNIOCpGObMSkbgeYjkp6alnw+NdMd8aNwD3TPbzAXpfpGVd9DMNuyWW9Ihh329+++EjZ0alPxNAxQP7fFip0ltHwjBYdZ4MhQRZVWsnmP2xFpdNvJuwKc2xr3eGz2Nzci0O6tn60BUKWIKybI1a1Czbjl+Cqmp/AtT3iVWgQnbiPYhd/NGdiJnWZCutCub+MEnSKMd7C2s1k Received: from ?IPV6:2a02:810d:4b3f:ee94:abf:b8ff:feee:998b? ([2a02:810d:4b3f:ee94:abf:b8ff:feee:998b]) by smtp.gmail.com with ESMTPSA id c19-20020aa7c993000000b0056211a1635esm242219edt.47.2024.02.13.09.37.44 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 13 Feb 2024 09:37:45 -0800 (PST) Message-ID: Date: Tue, 13 Feb 2024 18:37:44 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/2] drm/scheduler: improve GPU scheduler documentation v2 From: Danilo Krummrich To: =?UTF-8?Q?Christian_K=C3=B6nig?= Cc: airlied@gmail.com, ltuikov89@gmail.com, matthew.brost@intel.com, boris.brezillon@collabora.com, daniel@ffwll.ch, alexander.deucher@amd.com, dri-devel@lists.freedesktop.org, =?UTF-8?Q?Christian_K=C3=B6nig?= References: <20231116141547.206695-1-christian.koenig@amd.com> Organization: RedHat In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Hi Christian, What's the status of this effort? Was there ever a follow-up? - Danilo On 11/16/23 23:23, Danilo Krummrich wrote: > Hi Christian, > > Thanks for sending an update of this patch! > > On Thu, Nov 16, 2023 at 03:15:46PM +0100, Christian König wrote: >> Start to improve the scheduler document. Especially document the >> lifetime of each of the objects as well as the restrictions around >> DMA-fence handling and userspace compatibility. >> >> v2: Some improvements suggested by Danilo, add section about error >> handling. >> >> Signed-off-by: Christian König >> --- >> Documentation/gpu/drm-mm.rst | 36 +++++ >> drivers/gpu/drm/scheduler/sched_main.c | 174 +++++++++++++++++++++---- >> 2 files changed, 188 insertions(+), 22 deletions(-) >> >> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst >> index acc5901ac840..112463fa9f3a 100644 >> --- a/Documentation/gpu/drm-mm.rst >> +++ b/Documentation/gpu/drm-mm.rst >> @@ -552,12 +552,48 @@ Overview >> .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c >> :doc: Overview >> >> +Job Object >> +---------- >> + >> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c >> + :doc: Job Object >> + >> +Entity Object >> +------------- >> + >> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c >> + :doc: Entity Object >> + >> +Hardware Fence Object >> +--------------------- >> + >> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c >> + :doc: Hardware Fence Object >> + >> +Scheduler Fence Object >> +---------------------- >> + >> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c >> + :doc: Scheduler Fence Object >> + >> +Scheduler and Run Queue Objects >> +------------------------------- >> + >> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c >> + :doc: Scheduler and Run Queue Objects >> + >> Flow Control >> ------------ >> >> .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c >> :doc: Flow Control >> >> +Error and Timeout handling >> +-------------------------- >> + >> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c >> + :doc: Error and Timeout handling >> + >> Scheduler Function References >> ----------------------------- >> >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >> index 044a8c4875ba..026123497b0e 100644 >> --- a/drivers/gpu/drm/scheduler/sched_main.c >> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> @@ -24,28 +24,122 @@ >> /** >> * DOC: Overview >> * >> - * The GPU scheduler provides entities which allow userspace to push jobs >> - * into software queues which are then scheduled on a hardware run queue. >> - * The software queues have a priority among them. The scheduler selects the entities >> - * from the run queue using a FIFO. The scheduler provides dependency handling >> - * features among jobs. The driver is supposed to provide callback functions for >> - * backend operations to the scheduler like submitting a job to hardware run queue, >> - * returning the dependencies of a job etc. >> - * >> - * The organisation of the scheduler is the following: >> - * >> - * 1. Each hw run queue has one scheduler >> - * 2. Each scheduler has multiple run queues with different priorities >> - * (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL) >> - * 3. Each scheduler run queue has a queue of entities to schedule >> - * 4. Entities themselves maintain a queue of jobs that will be scheduled on >> - * the hardware. >> - * >> - * The jobs in a entity are always scheduled in the order that they were pushed. >> - * >> - * Note that once a job was taken from the entities queue and pushed to the >> - * hardware, i.e. the pending queue, the entity must not be referenced anymore >> - * through the jobs entity pointer. >> + * The GPU scheduler implements some logic to decide which command submission >> + * to push next to the hardware. Another major use case of the GPU scheduler >> + * is to enforce correct driver behavior around those command submissions. >> + * Because of this it's also used by drivers which don't need the actual >> + * scheduling functionality. > > This reads a bit like we're already right in the middle of the documentation. > I'd propose to start with something like "The DRM GPU scheduler serves as a > common component intended to help drivers to manage command submissions." And > then mention the different solutions the scheduler provides, e.g. ordering of > command submissions, dma-fence handling in the context of command submissions, > etc. > > Also, I think it would be good to give a rough overview of the topology of the > scheduler's components. And since you already mention that the component is > "also used by drivers which don't need the actual scheduling functionality", I'd > also mention the special case of a single entity and a single run-queue per > scheduler. > >> + * >> + * All callbacks the driver needs to implement are restricted by DMA-fence >> + * signaling rules to guarantee deadlock free forward progress. This especially >> + * means that for normal operation no memory can be allocated in a callback. >> + * All memory which is needed for pushing the job to the hardware must be >> + * allocated before arming a job. It also means that no locks can be taken >> + * under which memory might be allocated as well. > > I think that's good. Even though, with the recently merged workqueue patches, > drivers can actually create a setup where the free_job callback isn't part of > the fence signalling critical path anymore. But I agree with Sima that this is > probably too error prone to give drivers ideas. So, this paragraph is probably > good as it is. :-) > >> + * >> + * Memory which is optional to allocate, for example for device core dumping or >> + * debugging, *must* be allocated with GFP_NOWAIT and appropriate error >> + * handling taking if that allocation fails. GFP_ATOMIC should only be used if >> + * absolutely necessary since dipping into the special atomic reserves is >> + * usually not justified for a GPU driver. >> + */ >> + >> +/** >> + * DOC: Job Object >> + * >> + * The base job object contains submission dependencies in the form of DMA-fence >> + * objects. Drivers can also implement an optional prepare_job callback which >> + * returns additional dependencies as DMA-fence objects. It's important to note >> + * that this callback can't allocate memory or grab locks under which memory is >> + * allocated. >> + * >> + * Drivers should use this as base class for an object which contains the >> + * necessary state to push the command submission to the hardware. >> + * >> + * The lifetime of the job object should at least be from pushing it into the > > "should at least last from" > >> + * scheduler until the scheduler notes through the free callback that a job > > What about "until the free_job callback has been called and hence the scheduler > does not require the job object anymore."? > >> + * isn't needed any more. Drivers can of course keep their job object alive >> + * longer than that, but that's outside of the scope of the scheduler >> + * component. Job initialization is split into two parts, drm_sched_job_init() >> + * and drm_sched_job_arm(). It's important to note that after arming a job > > I suggest to add a brief comment on why job initialization is split up. > >> + * drivers must follow the DMA-fence rules and can't easily allocate memory >> + * or takes locks under which memory is allocated. >> + */ >> + >> +/** >> + * DOC: Entity Object >> + * >> + * The entity object which is a container for jobs which should execute >> + * sequentially. Drivers should create an entity for each individual context >> + * they maintain for command submissions which can run in parallel. >> + * >> + * The lifetime of the entity should *not* exceed the lifetime of the >> + * userspace process it was created for and drivers should call the >> + * drm_sched_entity_flush() function from their file_operations.flush >> + * callback. So it's possible that an entity object is not alive any > > "Note that it is possible..." > >> + * more while jobs from it are still running on the hardware. > > "while jobs previously fetched from this entity are still..." > >> + * >> + * Background is that for compatibility reasons with existing >> + * userspace all results of a command submission should become visible >> + * externally even after after a process exits. This is normal POSIX behavior >> + * for I/O operations. >> + * >> + * The problem with this approach is that GPU submissions contain executable >> + * shaders enabling processes to evade their termination by offloading work to >> + * the GPU. So when a process is terminated with a SIGKILL the entity object >> + * makes sure that jobs are freed without running them while still maintaining >> + * correct sequential order for signaling fences. >> + */ >> + >> +/** >> + * DOC: Hardware Fence Object >> + * >> + * The hardware fence object is a DMA-fence provided by the driver as result of >> + * running jobs. Drivers need to make sure that the normal DMA-fence semantics >> + * are followed for this object. It's important to note that the memory for >> + * this object can *not* be allocated in the run_job callback since that would >> + * violate the requirements for the DMA-fence implementation. The scheduler >> + * maintains a timeout handler which triggers if this fence doesn't signal in >> + * a configurable time frame. >> + * >> + * The lifetime of this object follows DMA-fence ref-counting rules, the >> + * scheduler takes ownership of the reference returned by the driver and drops >> + * it when it's not needed any more. >> + */ >> + >> +/** >> + * DOC: Scheduler Fence Object >> + * >> + * The scheduler fence object which encapsulates the whole time from pushing >> + * the job into the scheduler until the hardware has finished processing it. >> + * This is internally managed by the scheduler, but drivers can grab additional >> + * reference to it after arming a job. The implementation provides DMA-fence >> + * interfaces for signaling both scheduling of a command submission as well as >> + * finishing of processing. >> + * >> + * The lifetime of this object also follows normal DMA-fence ref-counting >> + * rules. The finished fence is the one normally exposed outside of the >> + * scheduler, but the driver can grab references to both the scheduled as well >> + * as the finished fence when needed for pipe-lining optimizations. >> + */ >> + >> +/** >> + * DOC: Scheduler and Run Queue Objects >> + * >> + * The scheduler object itself does the actual work of selecting a job and >> + * pushing it to the hardware. Both FIFO and RR selection algorithm are >> + * supported, but FIFO is preferred for many use cases. > > I suggest to name the use cases FIFO scheduling is preferred for and why. > > If, instead, it's just a general recommendation, I also suggest to explain why. > >> + * >> + * The lifetime of the scheduler is managed by the driver using it. Before >> + * destroying the scheduler the driver must ensure that all hardware processing >> + * involving this scheduler object has finished by calling for example >> + * disable_irq(). It is *not* sufficient to wait for the hardware fence here >> + * since this doesn't guarantee that all callback processing has finished. > > This is the part I'm most concerned about, since I feel like we leave drivers > "up in the air" entirely. Hence, I think here we need to be more verbose and > detailed about the options drivers have to ensure that. > > For instance, let's assume we have the single-entity-per-scheduler topology > because the driver only uses the GPU scheduler to feed a firmware scheduler with > dynamically allocated ring buffers. > > In this case the entity, scheduler and ring buffer are bound to the lifetime of > a userspace process. > > What do we expect the driver to do if the userspace process is killed? As you > mentioned, only waiting for the ring to be idle (which implies all HW fences > are signalled) is not enough. This doesn't guarantee all the free_job() > callbacks have been called yet and hence stopping the scheduler before the > pending_list is actually empty would leak the memory of the jobs on the > pending_list waiting to be freed. > > I already brought this up when we were discussing Matt's Xe inspired scheduler > patch series and it seems there was no interest to provide drivers with some > common mechanism that gurantees that the pending_list is empty. Hence, I really > think we should at least give recommendations how drivers should deal with that. > >> + * >> + * The run queue object is a container of entities for a certain priority >> + * level. This object is internally managed by the scheduler and drivers >> + * shouldn't touch them directly. The lifetime of run queues are bound to the >> + * schedulers lifetime. > > I think we should also mention that we support a variable number of run-queues > up to DRM_SCHED_PRIORITY_COUNT. Also there is this weird restriction on which > priorities a driver can use when choosing less than DRM_SCHED_PRIORITY_COUNT > run-queues. > > For instance, initializing the scheduler with a single run-queue, requires the > corresponding entities to pick DRM_SCHED_PRIORITY_MIN, otherwise we'll just > fault since the priority is also used as an array index into sched->sched_rq[]. > >> */ >> >> /** >> @@ -72,6 +166,42 @@ >> * limit. >> */ >> >> +/** >> + * DOC: Error and Timeout handling >> + * >> + * Errors schould be signaled by using dma_fence_set_error() on the hardware >> + * fence object before signaling it. Errors are then bubbled up from the >> + * hardware fence to the scheduler fence. >> + * >> + * The entity allows querying errors on the last run submission using the >> + * drm_sched_entity_error() function which can be used to cancel queued >> + * submissions in the run_job callback as well as preventing pushing further >> + * ones into the entity in the drivers submission function. >> + * >> + * When the hardware fence fails to signal in a configurable amount of time the >> + * timedout_job callback is issued. The driver should then follow the procedure >> + * described on the &struct drm_sched_backend_ops.timedout_job callback (TODO: >> + * The timeout handler should probably switch to using the hardware fence as >> + * parameter instead of the job. Otherwise the handling will always race >> + * between timing out and signaling the fence). >> + * >> + * The scheduler also used to provided functionality for re-submitting jobs >> + * with replacing the hardware fence during reset handling. This functionality >> + * is now marked as deprecated. This has proven to be fundamentally racy and >> + * not compatible with DMA-fence rules and shouldn't be used in any new code. >> + * >> + * Additional there is the drm_sched_increase_karma() function which tries to > > "Additionally" > >> + * find the entity which submitted a job and increases it's 'karma' >> + * atomic variable to prevent re-submitting jobs from this entity. This has >> + * quite some overhead and re-submitting jobs is now marked as deprecated. So >> + * using this function is rather discouraged. >> + * >> + * Drivers can still re-create the GPU state should it be lost during timeout >> + * handling when they can guarantee that forward progress is made and this >> + * doesn't cause another timeout. But this is strongly hardware specific and >> + * out of the scope of the general GPU scheduler. >> + */ >> + >> #include >> #include >> #include >> -- >> 2.34.1 >>