From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EB89AC432BE for ; Tue, 31 Aug 2021 13:08:46 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 7EE0061051 for ; Tue, 31 Aug 2021 13:08:46 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 7EE0061051 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ffwll.ch Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 2409789EAC; Tue, 31 Aug 2021 13:08:37 +0000 (UTC) Received: from mail-wr1-x429.google.com (mail-wr1-x429.google.com [IPv6:2a00:1450:4864:20::429]) by gabe.freedesktop.org (Postfix) with ESMTPS id DA5FC89EA6 for ; Tue, 31 Aug 2021 13:08:35 +0000 (UTC) Received: by mail-wr1-x429.google.com with SMTP id u9so27571490wrg.8 for ; Tue, 31 Aug 2021 06:08:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=wXDmcDBTd2H3ARcd/cZIFyo7FbtO0F5pLy17lY5r46w=; b=dyRIntM+MLpdVii3NsbghIQPginBP+TP3ERYG9rAaR1zz3MlwCM5H0qxn7iPVo05HJ hCq52tsAU5LvKP1zfMYKVMS8tFUkViKIHsn2zmibC7yQCsIeYl9XqpzGWSplpNl1LuzO ibE720/jboj7RTfuRBFVzygwcrT2L25kZr3LU= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=wXDmcDBTd2H3ARcd/cZIFyo7FbtO0F5pLy17lY5r46w=; b=fBW65au23YyKKyYbomKDMAHNYwhOOkXd1GbIFqndJUoUmqO4bQo9+oJa995tZ+eBbw 60NRaU0Uz4IngqYFwhrxddHYGyH24nPWRvkxixeSG1FaHDJskeSsCp8cyLEsM1g/YUXp JpTujhUDcqI49yUjwdJiOdfvpF9ijKtnMBIwW5YfPDTBslwrQrUg4cs8+bfXYQohiXxY wnLfdnFzLvqR+Ky7RbLsPhdHugkyvw9jxP0w36tG3JZF6PJBfr1iTODehKiE89unxPVw Tbp2U9hlqERs9DA5tbeT1g7pZr74ve6302nVzww2EA1sOto2S5NyTArkBPI0LD/WRIBF xFcg== X-Gm-Message-State: AOAM532bXyQmfzkZgmqhcFdCSd/T7RCUDNQekaTtLqI4RMlZsuD0NSb6 PwqyD3fZk7qWb20qHXDSXodkHA== X-Google-Smtp-Source: ABdhPJw8QFyFJzKObDdmwI2AO7gLMA38M2Tz1nNVPK5DOvxNOgTz2MF4AhkpMdxtz+VBrKXaMyoT5g== X-Received: by 2002:adf:9063:: with SMTP id h90mr31927611wrh.121.1630415314371; Tue, 31 Aug 2021 06:08:34 -0700 (PDT) Received: from phenom.ffwll.local ([2a02:168:57f4:0:efd0:b9e5:5ae6:c2fa]) by smtp.gmail.com with ESMTPSA id c14sm2402158wme.6.2021.08.31.06.08.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 31 Aug 2021 06:08:33 -0700 (PDT) Date: Tue, 31 Aug 2021 15:08:31 +0200 From: Daniel Vetter To: Christian =?iso-8859-1?Q?K=F6nig?= Cc: Andrey Grodzovsky , Christian =?iso-8859-1?Q?K=F6nig?= , "Liu, Monk" , "amd-gfx@lists.freedesktop.org" , "dri-devel@lists.freedesktop.org" Subject: Re: [PATCH] drm/sched: fix the bug of time out calculation(v3) Message-ID: References: <1629953731-14629-1-git-send-email-Monk.Liu@amd.com> <20419179-ee90-45aa-f4b8-b6bcb20a9c52@amd.com> <257202a9-a670-5b89-f98d-01855b6e41cf@amd.com> <9bb9b448-ea93-e8f9-818e-c6d27d4a8264@amd.com> <8171ae25-72b6-7400-6969-61885ca5094b@amd.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <8171ae25-72b6-7400-6969-61885ca5094b@amd.com> X-Operating-System: Linux phenom 5.10.0-8-amd64 X-BeenThere: amd-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: amd-gfx-bounces@lists.freedesktop.org Sender: "amd-gfx" On Fri, Aug 27, 2021 at 08:30:32PM +0200, Christian König wrote: > Yeah, that's what I meant with that the start of processing a job is a bit > swampy defined. > > Jobs overload, but we simply don't have another good indicator that a job > started except that the previous one completed. > > It's still better than starting the timer when pushing the job to the ring > buffer, because that is completely off. Not sure this matters that much really in practice, and in some cases we might want to actually just reset it all if the built up backlog is way too long. For anything that really runs way too long you want preempt-ctx/revoke fences anyway, not end-of-cs fences with tdr. -Daniel > > Christian. > > Am 27.08.21 um 20:22 schrieb Andrey Grodzovsky: > > As I mentioned to Monk before - what about cases such as in this test - https://gitlab.freedesktop.org/mesa/drm/-/commit/bc21168fa924d3fc4a000492e861f50a1a135b25 > > > > Here you don't have serialized sequence where when jobs finishes > > processing and second starts, they execute together  concurrently - for > > those cases seems > > to me restarting the timer for the second job from scratch will let it > > hang much longer then allowed by TO value. > > > > Andrey > > > > On 2021-08-27 10:29 a.m., Christian König wrote: > > > I don't think that makes sense. > > > > > > See we don't want to start the time when the job is inserted into > > > the ring buffer, but rather when it starts processing. > > > > > > Starting processing is a bit swampy defined, but just starting the > > > timer when the previous job completes should be fine enough. > > > > > > Christian. > > > > > > Am 27.08.21 um 15:57 schrieb Andrey Grodzovsky: > > > > The TS represents the point in time when the job was inserted > > > > into the pending list. > > > > I don't think it matters when it actually starts to be > > > > processed, what matters is when this job was inserted into > > > > pending list because right at that point you arm the TO timer > > > > (when no other is running already) > > > > and so when the previous job completes and you cancel and rearm > > > > again you can use that TS from the next job in pending list to > > > > calculate how much time has actually left for it to run before > > > > TDR must be initiated > > > > and not just give it again full TO value to run even if it has > > > > already been running for a while. > > > > > > > > Also, i am not sure also about the assumption that what we > > > > measure is processing by HW, what we measure is from the moment > > > > it was scheduled to ring to the moment the job completed (EOP > > > > event). At least that what our TDR timer measures and so it > > > > makes sense to set the TS at this point. > > > > > > > > Andrey > > > > > > > > On 2021-08-27 3:20 a.m., Liu, Monk wrote: > > > > > [AMD Official Use Only] > > > > > > > > > > what is that 'ts' representing for ? it looks to me the > > > > > jiffies that it get scheduled to the ring,  but a job > > > > > scheduled to the ring doesn't represent it's being processed > > > > > by hw. > > > > > > > > > > Thanks > > > > > > > > > > ------------------------------------------ > > > > > Monk Liu | Cloud-GPU Core team > > > > > ------------------------------------------ > > > > > > > > > > -----Original Message----- > > > > > From: Grodzovsky, Andrey > > > > > Sent: Friday, August 27, 2021 4:14 AM > > > > > To: Liu, Monk ; > > > > > amd-gfx@lists.freedesktop.org; Koenig, Christian > > > > > > > > > > Cc: dri-devel@lists.freedesktop.org > > > > > Subject: Re: [PATCH] drm/sched: fix the bug of time out > > > > > calculation(v3) > > > > > > > > > > Attached quick patch for per job TTL calculation to make > > > > > more precises next timer expiration. It's on top of the > > > > > patch in this thread. Let me know if this makes sense. > > > > > > > > > > Andrey > > > > > > > > > > On 2021-08-26 10:03 a.m., Andrey Grodzovsky wrote: > > > > > > On 2021-08-26 12:55 a.m., Monk Liu wrote: > > > > > > > issue: > > > > > > > in cleanup_job the cancle_delayed_work will cancel a TO timer even > > > > > > > the its corresponding job is still running. > > > > > > > > > > > > > > fix: > > > > > > > do not cancel the timer in cleanup_job, instead do the cancelling > > > > > > > only when the heading job is signaled, and if there is a "next" job > > > > > > > we start_timeout again. > > > > > > > > > > > > > > v2: > > > > > > > further cleanup the logic, and do the TDR timer cancelling if the > > > > > > > signaled job is the last one in its scheduler. > > > > > > > > > > > > > > v3: > > > > > > > change the issue description > > > > > > > remove the cancel_delayed_work in the begining of the cleanup_job > > > > > > > recover the implement of drm_sched_job_begin. > > > > > > > > > > > > > > TODO: > > > > > > > 1)introduce pause/resume scheduler in job_timeout to serial the > > > > > > > handling of scheduler and job_timeout. > > > > > > > 2)drop the bad job's del and insert in scheduler due to above > > > > > > > serialization (no race issue anymore with the serialization) > > > > > > > > > > > > > > Signed-off-by: Monk Liu > > > > > > > --- > > > > > > >    drivers/gpu/drm/scheduler/sched_main.c | 25 > > > > > > > ++++++++++--------------- > > > > > > >    1 file changed, 10 insertions(+), 15 deletions(-) > > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > index a2a9536..ecf8140 100644 > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > @@ -676,13 +676,7 @@ drm_sched_get_cleanup_job(struct > > > > > > > drm_gpu_scheduler *sched) > > > > > > >    { > > > > > > >        struct drm_sched_job *job, *next; > > > > > > >    -    /* > > > > > > > -     * Don't destroy jobs while the timeout worker is running OR > > > > > > > thread > > > > > > > -     * is being parked and hence assumed to not touch pending_list > > > > > > > -     */ > > > > > > > -    if ((sched->timeout != MAX_SCHEDULE_TIMEOUT && > > > > > > > -        !cancel_delayed_work(&sched->work_tdr)) || > > > > > > > -        kthread_should_park()) > > > > > > > +    if (kthread_should_park()) > > > > > > >            return NULL; > > > > > > > > > > > > I actually don't see why we need to keep the above, on the other side > > > > > > (in drm_sched_stop) we won't touch the pending list > > > > > > anyway until sched > > > > > > thread came to full stop (kthread_park). If you do see a reason why > > > > > > this needed then a comment should be here i think. > > > > > > > > > > > > Andrey > > > > > > > > > > > > > > > > > > > spin_lock(&sched->job_list_lock); > > > > > > > @@ -693,17 +687,21 @@ drm_sched_get_cleanup_job(struct > > > > > > > drm_gpu_scheduler *sched) > > > > > > >        if (job && dma_fence_is_signaled(&job->s_fence->finished)) { > > > > > > >            /* remove job from pending_list */ > > > > > > >            list_del_init(&job->list); > > > > > > > + > > > > > > > +        /* cancel this job's TO timer */ > > > > > > > +        cancel_delayed_work(&sched->work_tdr); > > > > > > >            /* make the scheduled timestamp more accurate */ > > > > > > >            next = list_first_entry_or_null(&sched->pending_list, > > > > > > >                            typeof(*next), list); > > > > > > > -        if (next) > > > > > > > + > > > > > > > +        if (next) { > > > > > > >                next->s_fence->scheduled.timestamp = > > > > > > > job->s_fence->finished.timestamp; > > > > > > > - > > > > > > > +            /* start TO timer for next job */ > > > > > > > +            drm_sched_start_timeout(sched); > > > > > > > +        } > > > > > > >        } else { > > > > > > >            job = NULL; > > > > > > > -        /* queue timeout for next job */ > > > > > > > -        drm_sched_start_timeout(sched); > > > > > > >        } > > > > > > >          spin_unlock(&sched->job_list_lock); > > > > > > > @@ -791,11 +789,8 @@ static int drm_sched_main(void *param) > > > > > > >                          (entity = > > > > > > > drm_sched_select_entity(sched))) || > > > > > > >                         kthread_should_stop()); > > > > > > >    -        if (cleanup_job) { > > > > > > > +        if (cleanup_job) > > > > > > >                sched->ops->free_job(cleanup_job); > > > > > > > -            /* queue timeout for next job */ > > > > > > > -            drm_sched_start_timeout(sched); > > > > > > > -        } > > > > > > >              if (!entity) > > > > > > >                continue; > > > > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch