[PATCH] drm/i915: make sure power domains don't get disabled during error capture

public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed

* [PATCH] drm/i915: make sure power domains don't get disabled during error capture
@ 2013-11-28  8:04 Imre Deak
  2013-11-28 10:05 ` Daniel Vetter
  0 siblings, 1 reply; 5+ messages in thread
From: Imre Deak @ 2013-11-28  8:04 UTC (permalink / raw)
  To: intel-gfx

So far we had a race during error capturing where a power domain could
get disabled after we queried its state to be on. Fix this by protecting
the power domain state tracking changes with a lock and holding this
lock during error capturing.

Note that this still allows the case where we are halfway in enabling a
power domain and still report it as disabled. This isn't a problem as we
only want to avoid register accesses while the domain is off and that is
now guaranteed.

Signed-off-by: Imre Deak <imre.deak@intel.com>

[ Applies on top of "drm/i915: add intel_display_power_enabled_sw() for
use in atomic ctx" ]
---
 drivers/gpu/drm/i915/i915_drv.h      | 1 +
 drivers/gpu/drm/i915/intel_display.c | 5 +++++
 drivers/gpu/drm/i915/intel_pm.c      | 5 +++++
 3 files changed, 11 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index cc06b9f..9e8d117 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -971,6 +971,7 @@ struct i915_power_domains {
 	int power_well_count;
 
 	struct mutex lock;
+	spinlock_t irq_lock;	/* protects domain_use_count */
 	int domain_use_count[POWER_DOMAIN_NUM];
 	struct i915_power_well *power_wells;
 };
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 464ae66..c7feee4 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -11368,6 +11368,7 @@ intel_display_capture_error_state(struct drm_device *dev)
 {
 	drm_i915_private_t *dev_priv = dev->dev_private;
 	struct intel_display_error_state *error;
+	unsigned long flags;
 	int transcoders[] = {
 		TRANSCODER_A,
 		TRANSCODER_B,
@@ -11386,6 +11387,8 @@ intel_display_capture_error_state(struct drm_device *dev)
 	if (IS_HASWELL(dev) || IS_BROADWELL(dev))
 		error->power_well_driver = I915_READ(HSW_PWR_WELL_DRIVER);
 
+	spin_lock_irqsave(&dev_priv->power_domains.irq_lock, flags);
+
 	for_each_pipe(i) {
 		error->pipe[i].power_domain_on =
 			intel_display_power_enabled_sw(dev, POWER_DOMAIN_PIPE(i));
@@ -11441,6 +11444,8 @@ intel_display_capture_error_state(struct drm_device *dev)
 		error->transcoder[i].vsync = I915_READ(VSYNC(cpu_transcoder));
 	}
 
+	spin_unlock_irqrestore(&dev_priv->power_domains.irq_lock, flags);
+
 	return error;
 }
 
diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index 311199c..00f6b34 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -5769,7 +5769,9 @@ void intel_display_power_get(struct drm_device *dev,
 	for_each_power_well(i, power_well, BIT(domain), power_domains)
 		__intel_power_well_get(dev, power_well);
 
+	spin_lock_irq(&power_domains->irq_lock);
 	power_domains->domain_use_count[domain]++;
+	spin_unlock_irq(&power_domains->irq_lock);
 
 	mutex_unlock(&power_domains->lock);
 }
@@ -5787,7 +5789,9 @@ void intel_display_power_put(struct drm_device *dev,
 	mutex_lock(&power_domains->lock);
 
 	WARN_ON(!power_domains->domain_use_count[domain]);
+	spin_lock_irq(&power_domains->irq_lock);
 	power_domains->domain_use_count[domain]--;
+	spin_unlock_irq(&power_domains->irq_lock);
 
 	for_each_power_well_rev(i, power_well, BIT(domain), power_domains)
 		__intel_power_well_put(dev, power_well);
@@ -5871,6 +5875,7 @@ int intel_power_domains_init(struct drm_device *dev)
 	struct drm_i915_private *dev_priv = dev->dev_private;
 	struct i915_power_domains *power_domains = &dev_priv->power_domains;
 
+	spin_lock_init(&power_domains->irq_lock);
 	mutex_init(&power_domains->lock);
 
 	/*
-- 
1.8.4

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/i915: make sure power domains don't get disabled during error capture
  2013-11-28  8:04 [PATCH] drm/i915: make sure power domains don't get disabled during error capture Imre Deak
@ 2013-11-28 10:05 ` Daniel Vetter
  2013-11-28 10:52   ` Imre Deak
  0 siblings, 1 reply; 5+ messages in thread
From: Daniel Vetter @ 2013-11-28 10:05 UTC (permalink / raw)
  To: Imre Deak; +Cc: intel-gfx

On Thu, Nov 28, 2013 at 10:04:42AM +0200, Imre Deak wrote:
> So far we had a race during error capturing where a power domain could
> get disabled after we queried its state to be on. Fix this by protecting
> the power domain state tracking changes with a lock and holding this
> lock during error capturing.
> 
> Note that this still allows the case where we are halfway in enabling a
> power domain and still report it as disabled. This isn't a problem as we
> only want to avoid register accesses while the domain is off and that is
> now guaranteed.
> 
> Signed-off-by: Imre Deak <imre.deak@intel.com>
> 
> [ Applies on top of "drm/i915: add intel_display_power_enabled_sw() for
> use in atomic ctx" ]

Tbh I don't see the point for this - the error capture code is inherently
racy, we don't grab any of the other modeset locks. We also don't care and
furthermore we should try to grab as few locks as possible to increase the
chances that we can actually capture the error state successfully. Every
lock you add adds another depency and so more ways to end in tears and
deadlocks.

So if the racy error capture is the only reason I'll reject this.
-Daniel

> ---
>  drivers/gpu/drm/i915/i915_drv.h      | 1 +
>  drivers/gpu/drm/i915/intel_display.c | 5 +++++
>  drivers/gpu/drm/i915/intel_pm.c      | 5 +++++
>  3 files changed, 11 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index cc06b9f..9e8d117 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -971,6 +971,7 @@ struct i915_power_domains {
>  	int power_well_count;
>  
>  	struct mutex lock;
> +	spinlock_t irq_lock;	/* protects domain_use_count */
>  	int domain_use_count[POWER_DOMAIN_NUM];
>  	struct i915_power_well *power_wells;
>  };
> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> index 464ae66..c7feee4 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -11368,6 +11368,7 @@ intel_display_capture_error_state(struct drm_device *dev)
>  {
>  	drm_i915_private_t *dev_priv = dev->dev_private;
>  	struct intel_display_error_state *error;
> +	unsigned long flags;
>  	int transcoders[] = {
>  		TRANSCODER_A,
>  		TRANSCODER_B,
> @@ -11386,6 +11387,8 @@ intel_display_capture_error_state(struct drm_device *dev)
>  	if (IS_HASWELL(dev) || IS_BROADWELL(dev))
>  		error->power_well_driver = I915_READ(HSW_PWR_WELL_DRIVER);
>  
> +	spin_lock_irqsave(&dev_priv->power_domains.irq_lock, flags);
> +
>  	for_each_pipe(i) {
>  		error->pipe[i].power_domain_on =
>  			intel_display_power_enabled_sw(dev, POWER_DOMAIN_PIPE(i));
> @@ -11441,6 +11444,8 @@ intel_display_capture_error_state(struct drm_device *dev)
>  		error->transcoder[i].vsync = I915_READ(VSYNC(cpu_transcoder));
>  	}
>  
> +	spin_unlock_irqrestore(&dev_priv->power_domains.irq_lock, flags);
> +
>  	return error;
>  }
>  
> diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
> index 311199c..00f6b34 100644
> --- a/drivers/gpu/drm/i915/intel_pm.c
> +++ b/drivers/gpu/drm/i915/intel_pm.c
> @@ -5769,7 +5769,9 @@ void intel_display_power_get(struct drm_device *dev,
>  	for_each_power_well(i, power_well, BIT(domain), power_domains)
>  		__intel_power_well_get(dev, power_well);
>  
> +	spin_lock_irq(&power_domains->irq_lock);
>  	power_domains->domain_use_count[domain]++;
> +	spin_unlock_irq(&power_domains->irq_lock);
>  
>  	mutex_unlock(&power_domains->lock);
>  }
> @@ -5787,7 +5789,9 @@ void intel_display_power_put(struct drm_device *dev,
>  	mutex_lock(&power_domains->lock);
>  
>  	WARN_ON(!power_domains->domain_use_count[domain]);
> +	spin_lock_irq(&power_domains->irq_lock);
>  	power_domains->domain_use_count[domain]--;
> +	spin_unlock_irq(&power_domains->irq_lock);
>  
>  	for_each_power_well_rev(i, power_well, BIT(domain), power_domains)
>  		__intel_power_well_put(dev, power_well);
> @@ -5871,6 +5875,7 @@ int intel_power_domains_init(struct drm_device *dev)
>  	struct drm_i915_private *dev_priv = dev->dev_private;
>  	struct i915_power_domains *power_domains = &dev_priv->power_domains;
>  
> +	spin_lock_init(&power_domains->irq_lock);
>  	mutex_init(&power_domains->lock);
>  
>  	/*
> -- 
> 1.8.4
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/i915: make sure power domains don't get disabled during error capture
  2013-11-28 10:05 ` Daniel Vetter
@ 2013-11-28 10:52   ` Imre Deak
  2013-11-28 13:34     ` Daniel Vetter
  0 siblings, 1 reply; 5+ messages in thread
From: Imre Deak @ 2013-11-28 10:52 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On Thu, 2013-11-28 at 11:05 +0100, Daniel Vetter wrote:
> On Thu, Nov 28, 2013 at 10:04:42AM +0200, Imre Deak wrote:
> > So far we had a race during error capturing where a power domain could
> > get disabled after we queried its state to be on. Fix this by protecting
> > the power domain state tracking changes with a lock and holding this
> > lock during error capturing.
> > 
> > Note that this still allows the case where we are halfway in enabling a
> > power domain and still report it as disabled. This isn't a problem as we
> > only want to avoid register accesses while the domain is off and that is
> > now guaranteed.
> > 
> > Signed-off-by: Imre Deak <imre.deak@intel.com>
> > 
> > [ Applies on top of "drm/i915: add intel_display_power_enabled_sw() for
> > use in atomic ctx" ]
> 
> Tbh I don't see the point for this - the error capture code is inherently
> racy, we don't grab any of the other modeset locks. We also don't care and
> furthermore we should try to grab as few locks as possible to increase the
> chances that we can actually capture the error state successfully. Every
> lock you add adds another depency and so more ways to end in tears and
> deadlocks.
>
> So if the racy error capture is the only reason I'll reject this.

Yes it is only the race this fixes, but I'd like to argue for its
usefulness. It fixes in particular the following two things:

1. avoiding unclaimed register accesses, which was the original goal of 
    
commit 9d1cb9147dbe45f6e94dc796518ecf67cb64b359
Author: Paulo Zanoni <paulo.r.zanoni@intel.com>
Date:   Fri Nov 1 13:32:08 2013 -0200

drm/i915: avoid unclaimed registers when capturing the error state

2. avoiding an inconsistent power on/off value in the error state wrt.
to the captured registers. Without the lock we can have an error state
showing the power was on, but register values captured with power off,
with some unpredictable values. This consistency check was the whole
point of adding the power on/off value to the error state.

I understand that adding complexity in general is risky, but in this
case the risk is minimal. The only place we take the lock is to adjust a
counter and to read HW registers, without any further nested locks, so
no chance for lock inversions.

--Imre

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/i915: make sure power domains don't get disabled during error capture
  2013-11-28 10:52   ` Imre Deak
@ 2013-11-28 13:34     ` Daniel Vetter
  2013-11-28 13:53       ` Imre Deak
  0 siblings, 1 reply; 5+ messages in thread
From: Daniel Vetter @ 2013-11-28 13:34 UTC (permalink / raw)
  To: Imre Deak; +Cc: intel-gfx

On Thu, Nov 28, 2013 at 12:52:07PM +0200, Imre Deak wrote:
> On Thu, 2013-11-28 at 11:05 +0100, Daniel Vetter wrote:
> > On Thu, Nov 28, 2013 at 10:04:42AM +0200, Imre Deak wrote:
> > > So far we had a race during error capturing where a power domain could
> > > get disabled after we queried its state to be on. Fix this by protecting
> > > the power domain state tracking changes with a lock and holding this
> > > lock during error capturing.
> > > 
> > > Note that this still allows the case where we are halfway in enabling a
> > > power domain and still report it as disabled. This isn't a problem as we
> > > only want to avoid register accesses while the domain is off and that is
> > > now guaranteed.
> > > 
> > > Signed-off-by: Imre Deak <imre.deak@intel.com>
> > > 
> > > [ Applies on top of "drm/i915: add intel_display_power_enabled_sw() for
> > > use in atomic ctx" ]
> > 
> > Tbh I don't see the point for this - the error capture code is inherently
> > racy, we don't grab any of the other modeset locks. We also don't care and
> > furthermore we should try to grab as few locks as possible to increase the
> > chances that we can actually capture the error state successfully. Every
> > lock you add adds another depency and so more ways to end in tears and
> > deadlocks.
> >
> > So if the racy error capture is the only reason I'll reject this.
> 
> Yes it is only the race this fixes, but I'd like to argue for its
> usefulness. It fixes in particular the following two things:
> 
> 1. avoiding unclaimed register accesses, which was the original goal of 
>     
> commit 9d1cb9147dbe45f6e94dc796518ecf67cb64b359
> Author: Paulo Zanoni <paulo.r.zanoni@intel.com>
> Date:   Fri Nov 1 13:32:08 2013 -0200
> 
> drm/i915: avoid unclaimed registers when capturing the error state

Hm, I've thought we won't hit any unclaimed register warnings if there's
no concurrent/racing other thread doing stuff to the power wells?

And as explained I don't care that much about racing hangcheck ...

> 2. avoiding an inconsistent power on/off value in the error state wrt.
> to the captured registers. Without the lock we can have an error state
> showing the power was on, but register values captured with power off,
> with some unpredictable values. This consistency check was the whole
> point of adding the power on/off value to the error state.
> 
> I understand that adding complexity in general is risky, but in this
> case the risk is minimal. The only place we take the lock is to adjust a
> counter and to read HW registers, without any further nested locks, so
> no chance for lock inversions.

The thing is that we already have tons of other similar races, e.g. we
could capture hilariously incosistent modeset state where the pfit is
disabled but pipesrc already set to something that would need upscaling.
And it's not a problem. So I still think we can just ignore those races,
even more we /should/ do so.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/i915: make sure power domains don't get disabled during error capture
  2013-11-28 13:34     ` Daniel Vetter
@ 2013-11-28 13:53       ` Imre Deak
  0 siblings, 0 replies; 5+ messages in thread
From: Imre Deak @ 2013-11-28 13:53 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 3617 bytes --]

On Thu, 2013-11-28 at 14:34 +0100, Daniel Vetter wrote:
> On Thu, Nov 28, 2013 at 12:52:07PM +0200, Imre Deak wrote:
> > On Thu, 2013-11-28 at 11:05 +0100, Daniel Vetter wrote:
> > > On Thu, Nov 28, 2013 at 10:04:42AM +0200, Imre Deak wrote:
> > > > So far we had a race during error capturing where a power domain could
> > > > get disabled after we queried its state to be on. Fix this by protecting
> > > > the power domain state tracking changes with a lock and holding this
> > > > lock during error capturing.
> > > > 
> > > > Note that this still allows the case where we are halfway in enabling a
> > > > power domain and still report it as disabled. This isn't a problem as we
> > > > only want to avoid register accesses while the domain is off and that is
> > > > now guaranteed.
> > > > 
> > > > Signed-off-by: Imre Deak <imre.deak@intel.com>
> > > > 
> > > > [ Applies on top of "drm/i915: add intel_display_power_enabled_sw() for
> > > > use in atomic ctx" ]
> > > 
> > > Tbh I don't see the point for this - the error capture code is inherently
> > > racy, we don't grab any of the other modeset locks. We also don't care and
> > > furthermore we should try to grab as few locks as possible to increase the
> > > chances that we can actually capture the error state successfully. Every
> > > lock you add adds another depency and so more ways to end in tears and
> > > deadlocks.
> > >
> > > So if the racy error capture is the only reason I'll reject this.
> > 
> > Yes it is only the race this fixes, but I'd like to argue for its
> > usefulness. It fixes in particular the following two things:
> > 
> > 1. avoiding unclaimed register accesses, which was the original goal of 
> >     
> > commit 9d1cb9147dbe45f6e94dc796518ecf67cb64b359
> > Author: Paulo Zanoni <paulo.r.zanoni@intel.com>
> > Date:   Fri Nov 1 13:32:08 2013 -0200
> > 
> > drm/i915: avoid unclaimed registers when capturing the error state
> 
> Hm, I've thought we won't hit any unclaimed register warnings if there's
> no concurrent/racing other thread doing stuff to the power wells?

We'll hit it if we read a register while its power well is off, but we
think it's on based on SW tracking. That's possible for example if you
get one of the ERROR_INTERRUPTS and call into the error capture code and
this happens while someone is disabling the power well.

> And as explained I don't care that much about racing hangcheck ...
>
> > 2. avoiding an inconsistent power on/off value in the error state wrt.
> > to the captured registers. Without the lock we can have an error state
> > showing the power was on, but register values captured with power off,
> > with some unpredictable values. This consistency check was the whole
> > point of adding the power on/off value to the error state.
> > 
> > I understand that adding complexity in general is risky, but in this
> > case the risk is minimal. The only place we take the lock is to adjust a
> > counter and to read HW registers, without any further nested locks, so
> > no chance for lock inversions.
> 
> The thing is that we already have tons of other similar races, e.g. we
> could capture hilariously incosistent modeset state where the pfit is
> disabled but pipesrc already set to something that would need upscaling.
> And it's not a problem. So I still think we can just ignore those races,
> even more we /should/ do so.

Ok, I wanted to reduce the uncertainty. But if we don't care about this
and having occasional unclaimed reg access reports are ok then I'm ok to
ignore this patch.

--Imre


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-11-28 13:53 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-28  8:04 [PATCH] drm/i915: make sure power domains don't get disabled during error capture Imre Deak
2013-11-28 10:05 ` Daniel Vetter
2013-11-28 10:52   ` Imre Deak
2013-11-28 13:34     ` Daniel Vetter
2013-11-28 13:53       ` Imre Deak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox