Re: [PATCH v5 02/16] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Boris Brezillon <boris.brezillon@collabora.com>
To: "Christian König" <christian.koenig@amd.com>
Cc: Emma Anholt <emma@anholt.net>,
	Tomeu Vizoso <tomeu.vizoso@collabora.com>,
	dri-devel@lists.freedesktop.org,
	Steven Price <steven.price@arm.com>,
	Rob Herring <robh+dt@kernel.org>,
	Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>,
	Alex Deucher <alexander.deucher@amd.com>,
	Qiang Yu <yuq825@gmail.com>, Robin Murphy <robin.murphy@arm.com>
Subject: Re: [PATCH v5 02/16] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr
Date: Tue, 29 Jun 2021 13:18:58 +0200	[thread overview]
Message-ID: <20210629131858.1a598182@collabora.com> (raw)
In-Reply-To: <5b619624-ca5d-6b9a-0600-f122a4d68c58@amd.com>

Hi Christian,

On Tue, 29 Jun 2021 13:03:58 +0200
Christian König <christian.koenig@amd.com> wrote:

> Am 29.06.21 um 09:34 schrieb Boris Brezillon:
> > Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU
> > reset. This leads to extra complexity when we need to synchronize timeout
> > works with the reset work. One solution to address that is to have an
> > ordered workqueue at the driver level that will be used by the different
> > schedulers to queue their timeout work. Thanks to the serialization
> > provided by the ordered workqueue we are guaranteed that timeout
> > handlers are executed sequentially, and can thus easily reset the GPU
> > from the timeout handler without extra synchronization.  
> 
> Well, we had already tried this and it didn't worked the way it is expected.
> 
> The major problem is that you not only want to serialize the queue, but 
> rather have a single reset for all queues.
> 
> Otherwise you schedule multiple resets for each hardware queue. E.g. for 
> your 3 hardware queues you would reset the GPU 3 times if all of them 
> time out at the same time (which is rather likely).
> 
> Using a single delayed work item doesn't work either because you then 
> only have one timeout.
> 
> What could be done is to cancel all delayed work items from all stopped 
> schedulers.

drm_sched_stop() does that already, and since we call drm_sched_stop()
on all queues in the timeout handler, we end up with only one global
reset happening even if several queues report a timeout at the same
time.

Regards,

Boris

next prev parent reply	other threads:[~2021-06-29 11:19 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-29  7:34 [PATCH v5 00/16] drm/panfrost: Misc improvements Boris Brezillon
2021-06-29  7:34 ` [PATCH v5 01/16] drm/sched: Document what the timedout_job method should do Boris Brezillon
2021-06-29  9:05   ` Daniel Vetter
2021-06-29  7:34 ` [PATCH v5 02/16] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr Boris Brezillon
2021-06-29  8:50   ` Daniel Vetter
2021-06-29  8:58     ` Boris Brezillon
2021-06-29 11:03   ` Christian König
2021-06-29 11:18     ` Boris Brezillon [this message]
2021-06-29 11:24       ` Christian König
2021-06-29 14:05         ` Daniel Vetter
2021-09-07 18:53         ` Andrey Grodzovsky
2021-09-08  6:50           ` Boris Brezillon
2021-09-08 14:53             ` Andrey Grodzovsky
2021-09-08 14:55               ` Boris Brezillon
2021-06-29  7:34 ` [PATCH v5 03/16] drm/panfrost: Make ->run_job() return an ERR_PTR() when appropriate Boris Brezillon
2021-06-29  7:34 ` [PATCH v5 04/16] drm/panfrost: Get rid of the unused JS_STATUS_EVENT_ACTIVE definition Boris Brezillon
2021-06-29  7:34 ` [PATCH v5 05/16] drm/panfrost: Drop the pfdev argument passed to panfrost_exception_name() Boris Brezillon
2021-06-29  7:35 ` [PATCH v5 06/16] drm/panfrost: Do the exception -> string translation using a table Boris Brezillon
2021-06-29  7:35 ` [PATCH v5 07/16] drm/panfrost: Expose a helper to trigger a GPU reset Boris Brezillon
2021-06-29  7:35 ` [PATCH v5 08/16] drm/panfrost: Use a threaded IRQ for job interrupts Boris Brezillon
2021-06-29  7:35 ` [PATCH v5 09/16] drm/panfrost: Simplify the reset serialization logic Boris Brezillon
2021-06-29 11:32   ` Boris Brezillon
2021-06-29  7:35 ` [PATCH v5 10/16] drm/panfrost: Make sure job interrupts are masked before resetting Boris Brezillon
2021-06-29  7:35 ` [PATCH v5 11/16] drm/panfrost: Disable the AS on unhandled page faults Boris Brezillon
2021-06-29  7:35 ` [PATCH v5 12/16] drm/panfrost: Reset the GPU when the AS_ACTIVE bit is stuck Boris Brezillon
2021-06-29  7:35 ` [PATCH v5 13/16] drm/panfrost: Don't reset the GPU on job faults unless we really have to Boris Brezillon
2021-06-29  7:35 ` [PATCH v5 14/16] drm/panfrost: Kill in-flight jobs on FD close Boris Brezillon
2021-06-29  7:35 ` [PATCH v5 15/16] drm/panfrost: Queue jobs on the hardware Boris Brezillon
2021-06-29  7:35 ` [PATCH v5 16/16] drm/panfrost: Increase the AS_ACTIVE polling timeout Boris Brezillon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210629131858.1a598182@collabora.com \
    --to=boris.brezillon@collabora.com \
    --cc=alexander.deucher@amd.com \
    --cc=alyssa.rosenzweig@collabora.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=emma@anholt.net \
    --cc=robh+dt@kernel.org \
    --cc=robin.murphy@arm.com \
    --cc=steven.price@arm.com \
    --cc=tomeu.vizoso@collabora.com \
    --cc=yuq825@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.