From: Boris Brezillon <boris.brezillon@collabora.com>
To: "Adrián Larumbe" <adrian.larumbe@collabora.com>
Cc: Steven Price <steven.price@arm.com>,
Liviu Dudau <liviu.dudau@arm.com>,
dri-devel@lists.freedesktop.org, kernel@collabora.com
Subject: Re: [PATCH v2 4/5] drm/panthor: Be robust against resume failures
Date: Fri, 29 Nov 2024 15:44:48 +0100 [thread overview]
Message-ID: <20241129154448.21fcf641@collabora.com> (raw)
In-Reply-To: <yrcvsbykbiwcmar73zk2ffgfhqzgjppnsp2y4w3kascb3wvo76@dtaciivtaanx>
On Fri, 29 Nov 2024 13:59:13 +0000
Adrián Larumbe <adrian.larumbe@collabora.com> wrote:
> Reviewed-by: Adrian Larumbe <adrian.larumbe@collabora.com>
>
> On 28.11.2024 12:02, Boris Brezillon wrote:
> > When the runtime PM resume callback returns an error, it puts the device
> > in a state where it can't be resumed anymore. Make sure we can recover
> > from such transient failures by calling pm_runtime_set_suspended()
> > explicitly after a pm_runtime_resume_and_get() failure.
> >
> > v2:
> > - Add a comment explaining potential races in
> > panthor_device_resume_and_get()
> >
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> > drivers/gpu/drm/panthor/panthor_device.c | 1 +
> > drivers/gpu/drm/panthor/panthor_device.h | 26 ++++++++++++++++++++++++
> > drivers/gpu/drm/panthor/panthor_drv.c | 2 +-
> > drivers/gpu/drm/panthor/panthor_sched.c | 4 ++--
> > 4 files changed, 30 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> > index e3b22107b268..0362101ea896 100644
> > --- a/drivers/gpu/drm/panthor/panthor_device.c
> > +++ b/drivers/gpu/drm/panthor/panthor_device.c
> > @@ -500,6 +500,7 @@ int panthor_device_resume(struct device *dev)
> >
> > err_set_suspended:
> > atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> > + atomic_set(&ptdev->pm.recovery_needed, 1);
>
> I think it might be the case that if PM resume fails, then ptdev->base.dev->power.runtime_error
> would be set to '1' and then you might use this state variable in panthor_device_resume_and_get()
> rather than encoding it explicity into the panthor driver struct?
So, there are two reasons for not using
ptdev->base.dev->power.runtime_error directly here:
1. I hate accessing subsystem's internal objects directly, and if
there's no helper to check if a runtime error is pending, I suspect
there's a good reason.
2. We need an atomic variable to ensure only one thread clears the
runtime_error (see the comment in panthor_device_resume_and_get()).
>
> > return ret;
> > }
> >
> > diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> > index 0e68f5a70d20..b6c4f25a5d6e 100644
> > --- a/drivers/gpu/drm/panthor/panthor_device.h
> > +++ b/drivers/gpu/drm/panthor/panthor_device.h
> > @@ -9,6 +9,7 @@
> > #include <linux/atomic.h>
> > #include <linux/io-pgtable.h>
> > #include <linux/regulator/consumer.h>
> > +#include <linux/pm_runtime.h>
> > #include <linux/sched.h>
> > #include <linux/spinlock.h>
> >
> > @@ -180,6 +181,9 @@ struct panthor_device {
> > * is suspended.
> > */
> > struct page *dummy_latest_flush;
> > +
> > + /** @recovery_needed: True when a resume attempt failed. */
> > + atomic_t recovery_needed;
> > } pm;
> >
> > /** @profile_mask: User-set profiling flags for job accounting. */
> > @@ -243,6 +247,28 @@ int panthor_device_mmap_io(struct panthor_device *ptdev,
> > int panthor_device_resume(struct device *dev);
> > int panthor_device_suspend(struct device *dev);
> >
> > +static inline int panthor_device_resume_and_get(struct panthor_device *ptdev)
> > +{
> > + int ret = pm_runtime_resume_and_get(ptdev->base.dev);
> > +
> > + /* If the resume failed, we need to clear the runtime_error, which
> > + * can done by forcing the RPM state to suspended. If multiple
> > + * threads called panthor_device_resume_and_get(), we only want
> > + * one of them to update the state, hence the cmpxchg. Note that a
> > + * thread might enter panthor_device_resume_and_get() and call
> > + * pm_runtime_resume_and_get() after another thread had attempted
> > + * to resume and failed. This means we will end up with an error
> > + * without even attempting a resume ourselves. The only risk here
> > + * is to report an error when the second resume attempt might have
> > + * succeeded. Given resume errors are not expected, this is probably
> > + * something we can live with.
> > + */
> > + if (ret && atomic_cmpxchg(&ptdev->pm.recovery_needed, 1, 0) == 1)
> > + pm_runtime_set_suspended(ptdev->base.dev);
> > +
> > + return ret;
> > +}
> > +
> > enum drm_panthor_exception_type {
> > DRM_PANTHOR_EXCEPTION_OK = 0x00,
> > DRM_PANTHOR_EXCEPTION_TERMINATED = 0x04,
> > diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> > index 1498c97b4b85..b7a9adc918e3 100644
> > --- a/drivers/gpu/drm/panthor/panthor_drv.c
> > +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> > @@ -763,7 +763,7 @@ static int panthor_query_timestamp_info(struct panthor_device *ptdev,
> > {
> > int ret;
> >
> > - ret = pm_runtime_resume_and_get(ptdev->base.dev);
> > + ret = panthor_device_resume_and_get(ptdev);
> > if (ret)
> > return ret;
> >
> > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> > index 97ed5fe5a191..77b184c3fb0c 100644
> > --- a/drivers/gpu/drm/panthor/panthor_sched.c
> > +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> > @@ -2364,7 +2364,7 @@ static void tick_work(struct work_struct *work)
> > if (!drm_dev_enter(&ptdev->base, &cookie))
> > return;
> >
> > - ret = pm_runtime_resume_and_get(ptdev->base.dev);
> > + ret = panthor_device_resume_and_get(ptdev);
> > if (drm_WARN_ON(&ptdev->base, ret))
> > goto out_dev_exit;
> >
> > @@ -3131,7 +3131,7 @@ queue_run_job(struct drm_sched_job *sched_job)
> > return dma_fence_get(job->done_fence);
> > }
> >
> > - ret = pm_runtime_resume_and_get(ptdev->base.dev);
> > + ret = panthor_device_resume_and_get(ptdev);
> > if (drm_WARN_ON(&ptdev->base, ret))
> > return ERR_PTR(ret);
> >
> > --
> > 2.46.2
next prev parent reply other threads:[~2024-11-29 14:45 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-28 11:02 [PATCH v2 0/5] drm/panthor: Be robust against failures in the resume path Boris Brezillon
2024-11-28 11:02 ` [PATCH v2 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume() Boris Brezillon
2024-11-29 13:11 ` Adrián Larumbe
2024-11-28 11:02 ` [PATCH v2 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path Boris Brezillon
2024-11-29 13:14 ` Adrián Larumbe
2024-11-29 14:45 ` Boris Brezillon
2024-11-29 15:21 ` Steven Price
2024-11-28 11:02 ` [PATCH v2 3/5] drm/panthor: Ignore devfreq_{suspend, resume}_device() failures Boris Brezillon
2024-11-29 13:46 ` [PATCH v2 3/5] drm/panthor: Ignore devfreq_{suspend,resume}_device() failures Adrián Larumbe
2024-11-28 11:02 ` [PATCH v2 4/5] drm/panthor: Be robust against resume failures Boris Brezillon
2024-11-29 13:59 ` Adrián Larumbe
2024-11-29 14:44 ` Boris Brezillon [this message]
2024-11-29 15:21 ` Steven Price
2024-11-28 11:02 ` [PATCH v2 5/5] drm/panthor: Fix the fast-reset logic Boris Brezillon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241129154448.21fcf641@collabora.com \
--to=boris.brezillon@collabora.com \
--cc=adrian.larumbe@collabora.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=kernel@collabora.com \
--cc=liviu.dudau@arm.com \
--cc=steven.price@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.