From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_2 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 01151C11F66 for ; Tue, 29 Jun 2021 11:19:06 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id BEB1F61DD1 for ; Tue, 29 Jun 2021 11:19:05 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BEB1F61DD1 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=collabora.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=dri-devel-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 31B186E85E; Tue, 29 Jun 2021 11:19:05 +0000 (UTC) Received: from bhuna.collabora.co.uk (bhuna.collabora.co.uk [46.235.227.227]) by gabe.freedesktop.org (Postfix) with ESMTPS id D85456E85E for ; Tue, 29 Jun 2021 11:19:03 +0000 (UTC) Received: from localhost (unknown [IPv6:2a01:e0a:2c:6930:5cf4:84a1:2763:fe0d]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: bbrezillon) by bhuna.collabora.co.uk (Postfix) with ESMTPSA id 232291F416F0; Tue, 29 Jun 2021 12:19:02 +0100 (BST) Date: Tue, 29 Jun 2021 13:18:58 +0200 From: Boris Brezillon To: Christian =?UTF-8?B?S8O2bmln?= Subject: Re: [PATCH v5 02/16] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr Message-ID: <20210629131858.1a598182@collabora.com> In-Reply-To: <5b619624-ca5d-6b9a-0600-f122a4d68c58@amd.com> References: <20210629073510.2764391-1-boris.brezillon@collabora.com> <20210629073510.2764391-3-boris.brezillon@collabora.com> <5b619624-ca5d-6b9a-0600-f122a4d68c58@amd.com> Organization: Collabora X-Mailer: Claws Mail 3.17.8 (GTK+ 2.24.33; x86_64-redhat-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Emma Anholt , Tomeu Vizoso , dri-devel@lists.freedesktop.org, Steven Price , Rob Herring , Alyssa Rosenzweig , Alex Deucher , Qiang Yu , Robin Murphy Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Hi Christian, On Tue, 29 Jun 2021 13:03:58 +0200 Christian K=C3=B6nig wrote: > Am 29.06.21 um 09:34 schrieb Boris Brezillon: > > Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU > > reset. This leads to extra complexity when we need to synchronize timeo= ut > > works with the reset work. One solution to address that is to have an > > ordered workqueue at the driver level that will be used by the different > > schedulers to queue their timeout work. Thanks to the serialization > > provided by the ordered workqueue we are guaranteed that timeout > > handlers are executed sequentially, and can thus easily reset the GPU > > from the timeout handler without extra synchronization. =20 >=20 > Well, we had already tried this and it didn't worked the way it is expect= ed. >=20 > The major problem is that you not only want to serialize the queue, but=20 > rather have a single reset for all queues. >=20 > Otherwise you schedule multiple resets for each hardware queue. E.g. for= =20 > your 3 hardware queues you would reset the GPU 3 times if all of them=20 > time out at the same time (which is rather likely). >=20 > Using a single delayed work item doesn't work either because you then=20 > only have one timeout. >=20 > What could be done is to cancel all delayed work items from all stopped=20 > schedulers. drm_sched_stop() does that already, and since we call drm_sched_stop() on all queues in the timeout handler, we end up with only one global reset happening even if several queues report a timeout at the same time. Regards, Boris