Linux Power Management development

Linux Power Management development
 help / color / mirror / Atom feed

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Huang Ying @ 2012-11-08  2:04 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Alan Stern, linux-kernel, linux-pm
In-Reply-To: <5312356.pH16Qh7ECV@vostro.rjw.lan>

On Thu, 2012-11-08 at 02:35 +0100, Rafael J. Wysocki wrote:
> On Thursday, November 08, 2012 09:15:08 AM Huang Ying wrote:
> > On Thu, 2012-11-08 at 00:09 +0100, Rafael J. Wysocki wrote:
> > > On Wednesday, November 07, 2012 11:51:15 PM Rafael J. Wysocki wrote:
> > > > On Wednesday, November 07, 2012 04:56:49 PM Alan Stern wrote:
> > > > > On Wed, 7 Nov 2012, Rafael J. Wysocki wrote:
> > > > > 
> > > > > > > Right.  The reasoning behind my proposal goes like this: When there's
> > > > > > > no driver, the subsystem can let userspace directly control the
> > > > > > > device's power level through the power/control attribute.
> > > > > > 
> > > > > > Well, we might as well just leave the runtime PM of PCI devices enabled, even
> > > > > > if they have no drivers, but modify the PCI bus type's runtime PM callbacks
> > > > > > to ignore devices with no drivers.
> > > > > > 
> > > > > > IIRC the reason why we decided to disable runtime PM for PCI device with no
> > > > > > drivers was that some of them refused to work again after being put by the
> > > > > > core into D3.  By making the PCI bus type's runtime PM callbacks ignore them
> > > > > > we'd avoid this problem without modifying the core's behavior.
> > > > > 
> > > > > It comes down to a question of the parent.  If a driverless PCI device
> > > > > isn't being used, shouldn't its parent be allowed to go into runtime
> > > > > suspend?  As things stand now, we do allow it.
> > > > > 
> > > > > The problem is that we don't disallow it when the driverless device
> > > > > _is_ being used.
> > > > 
> > > > We can make it depend on what's there in the control file.  Let's say if that's
> > > > "on" (ie. the devices usage counter is not zero), we won't allow the parent
> > > > to be suspended.
> > > > 
> > > > So, as I said, why don't we keep the runtime PM of PCI devices always enabled,
> > > > regardless of whether or not there is a driver, and arrange things in such a
> > > > way that the device is automatically "suspended" if user space writes "auto"
> > > > to the control file.  IOW, I suppose we can do something like this:
> > > 
> > > It probably is better to treat the "no driver" case in a special way, though:
> > > 
> > > ---
> > >  drivers/pci/pci-driver.c |   45 +++++++++++++++++++++++++--------------------
> > >  drivers/pci/pci.c        |    2 ++
> > >  2 files changed, 27 insertions(+), 20 deletions(-)
> > > 
> > > Index: linux-pm/drivers/pci/pci-driver.c
> > > ===================================================================
> > > --- linux-pm.orig/drivers/pci/pci-driver.c
> > > +++ linux-pm/drivers/pci/pci-driver.c
> > > @@ -263,22 +263,17 @@ static long local_pci_probe(void *_ddi)
> > >  	/* The parent bridge must be in active state when probing */
> > >  	if (parent)
> > >  		pm_runtime_get_sync(parent);
> > > -	/* Unbound PCI devices are always set to disabled and suspended.
> > > -	 * During probe, the device is set to enabled and active and the
> > > -	 * usage count is incremented.  If the driver supports runtime PM,
> > > -	 * it should call pm_runtime_put_noidle() in its probe routine and
> > > +	/*
> > > +	 * During probe, the device is set to active and the usage count is
> > > +	 * incremented.  If the driver supports runtime PM, it should call
> > > +	 * pm_runtime_put_noidle() in its probe routine and
> > >  	 * pm_runtime_get_noresume() in its remove routine.
> > >  	 */
> > > -	pm_runtime_get_noresume(dev);
> > > -	pm_runtime_set_active(dev);
> > > -	pm_runtime_enable(dev);
> > > -
> > > +	pm_runtime_get_sync(dev);
> > >  	rc = ddi->drv->probe(ddi->dev, ddi->id);
> > > -	if (rc) {
> > > -		pm_runtime_disable(dev);
> > > -		pm_runtime_set_suspended(dev);
> > > -		pm_runtime_put_noidle(dev);
> > > -	}
> > > +	if (rc)
> > > +		pm_runtime_put_sync(dev);
> > > +
> > >  	if (parent)
> > >  		pm_runtime_put(parent);
> > >  	return rc;
> > > @@ -369,9 +364,7 @@ static int pci_device_remove(struct devi
> > >  	}
> > >  
> > >  	/* Undo the runtime PM settings in local_pci_probe() */
> > > -	pm_runtime_disable(dev);
> > > -	pm_runtime_set_suspended(dev);
> > > -	pm_runtime_put_noidle(dev);
> > > +	pm_runtime_put_sync(dev);
> > >  
> > >  	/*
> > >  	 * If the device is still on, set the power state as "unknown",
> > > @@ -998,10 +991,14 @@ static int pci_pm_restore(struct device
> > >  static int pci_pm_runtime_suspend(struct device *dev)
> > >  {
> > >  	struct pci_dev *pci_dev = to_pci_dev(dev);
> > > -	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> > > +	const struct dev_pm_ops *pm;
> > >  	pci_power_t prev = pci_dev->current_state;
> > >  	int error;
> > >  
> > > +	if (!dev->driver)
> > > +		return 0;
> > > +
> > > +	pm = dev->driver->pm;
> > >  	if (!pm || !pm->runtime_suspend)
> > >  		return -ENOSYS;
> > >  
> > > @@ -1035,8 +1032,12 @@ static int pci_pm_runtime_resume(struct
> > >  {
> > >  	int rc;
> > >  	struct pci_dev *pci_dev = to_pci_dev(dev);
> > > -	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> > > +	const struct dev_pm_ops *pm;
> > > +
> > > +	if (!dev->driver)
> > > +		return 0;
> > >  
> > > +	pm = dev->driver->pm;
> > >  	if (!pm || !pm->runtime_resume)
> > >  		return -ENOSYS;
> > >  
> > > @@ -1054,8 +1055,12 @@ static int pci_pm_runtime_resume(struct
> > >  
> > >  static int pci_pm_runtime_idle(struct device *dev)
> > >  {
> > > -	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> > > +	const struct dev_pm_ops *pm;
> > > +
> > > +	if (!dev->driver)
> > > +		goto out:
> > >  
> > > +	pm = dev->driver->pm;
> > >  	if (!pm)
> > >  		return -ENOSYS;
> > >  
> > > @@ -1065,8 +1070,8 @@ static int pci_pm_runtime_idle(struct de
> > >  			return ret;
> > >  	}
> > >  
> > > + out:
> > >  	pm_runtime_suspend(dev);
> > > -
> > >  	return 0;
> > >  }
> > >  
> > > Index: linux-pm/drivers/pci/pci.c
> > > ===================================================================
> > > --- linux-pm.orig/drivers/pci/pci.c
> > > +++ linux-pm/drivers/pci/pci.c
> > > @@ -1868,6 +1868,8 @@ void pci_pm_init(struct pci_dev *dev)
> > >  	u16 pmc;
> > >  
> > >  	pm_runtime_forbid(&dev->dev);
> > > +	pm_runtime_set_active(dev);
> > > +	pm_runtime_enable(&dev->dev);
> > >  	device_enable_async_suspend(&dev->dev);
> > >  	dev->wakeup_prepared = false;
> > >  
> > 
> > I think the patch can fix the issue in a better way.
> 
> I'm not sure what you mean.

I mean your patch can fix the driver-less VGA issue.  And it is better
than my original patch.  The following discussion is not about this
specific issue.  Just about PM core logic.

> > Do we still need to clarify state about disabled and forbidden?  When a
> > device is forbidden and the usage_count > 0,
> 
> "Forbidden" always means usage_count > 0.

Yes.

> > is it a good idea to allow to set device state to SUSPENDED if the device
> > is disabled?
> 
> No, it is not.  The status should always be ACTIVE as long as usage_count > 0.
> However, in some cases we actually would like to change the status to
> SUSPENDED when usage_count becomes equal to 0, because that means we can
> suspend (I mean really suspend) the parents of the devices in question
> (and we want to notify the parents in those cases).

So do you think Alan Stern's suggestion about forbidden and disabled is
the right way to go?

>	Make pm_runtime_set_suspended() fail if runtime PM is 
> 	forbidden.
> 
> 	Make pm_runtime_forbid() call pm_runtime_set_active()
> 	(and do a runtime resume of the parent) if disable_depth > 0.
>
>	Make the PCI runtime-idle routine call 
> 	pm_runtime_set_suspended() if disable_depth > 0.  Or maybe
> 	do this for all devices, in the runtime PM core.

In this way, we do not really need to call pm_runtime_set_suspended() in
fact.  Because if disabled and usage_count=0, device will be set to
SUSPENDED state automatically.

Best Regards,
Huang Ying



^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Rafael J. Wysocki @ 2012-11-08  1:35 UTC (permalink / raw)
  To: Huang Ying; +Cc: Alan Stern, linux-kernel, linux-pm
In-Reply-To: <1352337308.7176.28.camel@yhuang-dev>

On Thursday, November 08, 2012 09:15:08 AM Huang Ying wrote:
> On Thu, 2012-11-08 at 00:09 +0100, Rafael J. Wysocki wrote:
> > On Wednesday, November 07, 2012 11:51:15 PM Rafael J. Wysocki wrote:
> > > On Wednesday, November 07, 2012 04:56:49 PM Alan Stern wrote:
> > > > On Wed, 7 Nov 2012, Rafael J. Wysocki wrote:
> > > > 
> > > > > > Right.  The reasoning behind my proposal goes like this: When there's
> > > > > > no driver, the subsystem can let userspace directly control the
> > > > > > device's power level through the power/control attribute.
> > > > > 
> > > > > Well, we might as well just leave the runtime PM of PCI devices enabled, even
> > > > > if they have no drivers, but modify the PCI bus type's runtime PM callbacks
> > > > > to ignore devices with no drivers.
> > > > > 
> > > > > IIRC the reason why we decided to disable runtime PM for PCI device with no
> > > > > drivers was that some of them refused to work again after being put by the
> > > > > core into D3.  By making the PCI bus type's runtime PM callbacks ignore them
> > > > > we'd avoid this problem without modifying the core's behavior.
> > > > 
> > > > It comes down to a question of the parent.  If a driverless PCI device
> > > > isn't being used, shouldn't its parent be allowed to go into runtime
> > > > suspend?  As things stand now, we do allow it.
> > > > 
> > > > The problem is that we don't disallow it when the driverless device
> > > > _is_ being used.
> > > 
> > > We can make it depend on what's there in the control file.  Let's say if that's
> > > "on" (ie. the devices usage counter is not zero), we won't allow the parent
> > > to be suspended.
> > > 
> > > So, as I said, why don't we keep the runtime PM of PCI devices always enabled,
> > > regardless of whether or not there is a driver, and arrange things in such a
> > > way that the device is automatically "suspended" if user space writes "auto"
> > > to the control file.  IOW, I suppose we can do something like this:
> > 
> > It probably is better to treat the "no driver" case in a special way, though:
> > 
> > ---
> >  drivers/pci/pci-driver.c |   45 +++++++++++++++++++++++++--------------------
> >  drivers/pci/pci.c        |    2 ++
> >  2 files changed, 27 insertions(+), 20 deletions(-)
> > 
> > Index: linux-pm/drivers/pci/pci-driver.c
> > ===================================================================
> > --- linux-pm.orig/drivers/pci/pci-driver.c
> > +++ linux-pm/drivers/pci/pci-driver.c
> > @@ -263,22 +263,17 @@ static long local_pci_probe(void *_ddi)
> >  	/* The parent bridge must be in active state when probing */
> >  	if (parent)
> >  		pm_runtime_get_sync(parent);
> > -	/* Unbound PCI devices are always set to disabled and suspended.
> > -	 * During probe, the device is set to enabled and active and the
> > -	 * usage count is incremented.  If the driver supports runtime PM,
> > -	 * it should call pm_runtime_put_noidle() in its probe routine and
> > +	/*
> > +	 * During probe, the device is set to active and the usage count is
> > +	 * incremented.  If the driver supports runtime PM, it should call
> > +	 * pm_runtime_put_noidle() in its probe routine and
> >  	 * pm_runtime_get_noresume() in its remove routine.
> >  	 */
> > -	pm_runtime_get_noresume(dev);
> > -	pm_runtime_set_active(dev);
> > -	pm_runtime_enable(dev);
> > -
> > +	pm_runtime_get_sync(dev);
> >  	rc = ddi->drv->probe(ddi->dev, ddi->id);
> > -	if (rc) {
> > -		pm_runtime_disable(dev);
> > -		pm_runtime_set_suspended(dev);
> > -		pm_runtime_put_noidle(dev);
> > -	}
> > +	if (rc)
> > +		pm_runtime_put_sync(dev);
> > +
> >  	if (parent)
> >  		pm_runtime_put(parent);
> >  	return rc;
> > @@ -369,9 +364,7 @@ static int pci_device_remove(struct devi
> >  	}
> >  
> >  	/* Undo the runtime PM settings in local_pci_probe() */
> > -	pm_runtime_disable(dev);
> > -	pm_runtime_set_suspended(dev);
> > -	pm_runtime_put_noidle(dev);
> > +	pm_runtime_put_sync(dev);
> >  
> >  	/*
> >  	 * If the device is still on, set the power state as "unknown",
> > @@ -998,10 +991,14 @@ static int pci_pm_restore(struct device
> >  static int pci_pm_runtime_suspend(struct device *dev)
> >  {
> >  	struct pci_dev *pci_dev = to_pci_dev(dev);
> > -	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> > +	const struct dev_pm_ops *pm;
> >  	pci_power_t prev = pci_dev->current_state;
> >  	int error;
> >  
> > +	if (!dev->driver)
> > +		return 0;
> > +
> > +	pm = dev->driver->pm;
> >  	if (!pm || !pm->runtime_suspend)
> >  		return -ENOSYS;
> >  
> > @@ -1035,8 +1032,12 @@ static int pci_pm_runtime_resume(struct
> >  {
> >  	int rc;
> >  	struct pci_dev *pci_dev = to_pci_dev(dev);
> > -	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> > +	const struct dev_pm_ops *pm;
> > +
> > +	if (!dev->driver)
> > +		return 0;
> >  
> > +	pm = dev->driver->pm;
> >  	if (!pm || !pm->runtime_resume)
> >  		return -ENOSYS;
> >  
> > @@ -1054,8 +1055,12 @@ static int pci_pm_runtime_resume(struct
> >  
> >  static int pci_pm_runtime_idle(struct device *dev)
> >  {
> > -	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> > +	const struct dev_pm_ops *pm;
> > +
> > +	if (!dev->driver)
> > +		goto out:
> >  
> > +	pm = dev->driver->pm;
> >  	if (!pm)
> >  		return -ENOSYS;
> >  
> > @@ -1065,8 +1070,8 @@ static int pci_pm_runtime_idle(struct de
> >  			return ret;
> >  	}
> >  
> > + out:
> >  	pm_runtime_suspend(dev);
> > -
> >  	return 0;
> >  }
> >  
> > Index: linux-pm/drivers/pci/pci.c
> > ===================================================================
> > --- linux-pm.orig/drivers/pci/pci.c
> > +++ linux-pm/drivers/pci/pci.c
> > @@ -1868,6 +1868,8 @@ void pci_pm_init(struct pci_dev *dev)
> >  	u16 pmc;
> >  
> >  	pm_runtime_forbid(&dev->dev);
> > +	pm_runtime_set_active(dev);
> > +	pm_runtime_enable(&dev->dev);
> >  	device_enable_async_suspend(&dev->dev);
> >  	dev->wakeup_prepared = false;
> >  
> 
> I think the patch can fix the issue in a better way.

I'm not sure what you mean.

> Do we still need to clarify state about disabled and forbidden?  When a
> device is forbidden and the usage_count > 0,

"Forbidden" always means usage_count > 0.

> is it a good idea to allow to set device state to SUSPENDED if the device
> is disabled?

No, it is not.  The status should always be ACTIVE as long as usage_count > 0.
However, in some cases we actually would like to change the status to
SUSPENDED when usage_count becomes equal to 0, because that means we can
suspend (I mean really suspend) the parents of the devices in question
(and we want to notify the parents in those cases).

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Huang Ying @ 2012-11-08  1:15 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Alan Stern, linux-kernel, linux-pm
In-Reply-To: <38669905.umVnjWsO2W@vostro.rjw.lan>

On Thu, 2012-11-08 at 00:09 +0100, Rafael J. Wysocki wrote:
> On Wednesday, November 07, 2012 11:51:15 PM Rafael J. Wysocki wrote:
> > On Wednesday, November 07, 2012 04:56:49 PM Alan Stern wrote:
> > > On Wed, 7 Nov 2012, Rafael J. Wysocki wrote:
> > > 
> > > > > Right.  The reasoning behind my proposal goes like this: When there's
> > > > > no driver, the subsystem can let userspace directly control the
> > > > > device's power level through the power/control attribute.
> > > > 
> > > > Well, we might as well just leave the runtime PM of PCI devices enabled, even
> > > > if they have no drivers, but modify the PCI bus type's runtime PM callbacks
> > > > to ignore devices with no drivers.
> > > > 
> > > > IIRC the reason why we decided to disable runtime PM for PCI device with no
> > > > drivers was that some of them refused to work again after being put by the
> > > > core into D3.  By making the PCI bus type's runtime PM callbacks ignore them
> > > > we'd avoid this problem without modifying the core's behavior.
> > > 
> > > It comes down to a question of the parent.  If a driverless PCI device
> > > isn't being used, shouldn't its parent be allowed to go into runtime
> > > suspend?  As things stand now, we do allow it.
> > > 
> > > The problem is that we don't disallow it when the driverless device
> > > _is_ being used.
> > 
> > We can make it depend on what's there in the control file.  Let's say if that's
> > "on" (ie. the devices usage counter is not zero), we won't allow the parent
> > to be suspended.
> > 
> > So, as I said, why don't we keep the runtime PM of PCI devices always enabled,
> > regardless of whether or not there is a driver, and arrange things in such a
> > way that the device is automatically "suspended" if user space writes "auto"
> > to the control file.  IOW, I suppose we can do something like this:
> 
> It probably is better to treat the "no driver" case in a special way, though:
> 
> ---
>  drivers/pci/pci-driver.c |   45 +++++++++++++++++++++++++--------------------
>  drivers/pci/pci.c        |    2 ++
>  2 files changed, 27 insertions(+), 20 deletions(-)
> 
> Index: linux-pm/drivers/pci/pci-driver.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/pci-driver.c
> +++ linux-pm/drivers/pci/pci-driver.c
> @@ -263,22 +263,17 @@ static long local_pci_probe(void *_ddi)
>  	/* The parent bridge must be in active state when probing */
>  	if (parent)
>  		pm_runtime_get_sync(parent);
> -	/* Unbound PCI devices are always set to disabled and suspended.
> -	 * During probe, the device is set to enabled and active and the
> -	 * usage count is incremented.  If the driver supports runtime PM,
> -	 * it should call pm_runtime_put_noidle() in its probe routine and
> +	/*
> +	 * During probe, the device is set to active and the usage count is
> +	 * incremented.  If the driver supports runtime PM, it should call
> +	 * pm_runtime_put_noidle() in its probe routine and
>  	 * pm_runtime_get_noresume() in its remove routine.
>  	 */
> -	pm_runtime_get_noresume(dev);
> -	pm_runtime_set_active(dev);
> -	pm_runtime_enable(dev);
> -
> +	pm_runtime_get_sync(dev);
>  	rc = ddi->drv->probe(ddi->dev, ddi->id);
> -	if (rc) {
> -		pm_runtime_disable(dev);
> -		pm_runtime_set_suspended(dev);
> -		pm_runtime_put_noidle(dev);
> -	}
> +	if (rc)
> +		pm_runtime_put_sync(dev);
> +
>  	if (parent)
>  		pm_runtime_put(parent);
>  	return rc;
> @@ -369,9 +364,7 @@ static int pci_device_remove(struct devi
>  	}
>  
>  	/* Undo the runtime PM settings in local_pci_probe() */
> -	pm_runtime_disable(dev);
> -	pm_runtime_set_suspended(dev);
> -	pm_runtime_put_noidle(dev);
> +	pm_runtime_put_sync(dev);
>  
>  	/*
>  	 * If the device is still on, set the power state as "unknown",
> @@ -998,10 +991,14 @@ static int pci_pm_restore(struct device
>  static int pci_pm_runtime_suspend(struct device *dev)
>  {
>  	struct pci_dev *pci_dev = to_pci_dev(dev);
> -	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> +	const struct dev_pm_ops *pm;
>  	pci_power_t prev = pci_dev->current_state;
>  	int error;
>  
> +	if (!dev->driver)
> +		return 0;
> +
> +	pm = dev->driver->pm;
>  	if (!pm || !pm->runtime_suspend)
>  		return -ENOSYS;
>  
> @@ -1035,8 +1032,12 @@ static int pci_pm_runtime_resume(struct
>  {
>  	int rc;
>  	struct pci_dev *pci_dev = to_pci_dev(dev);
> -	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> +	const struct dev_pm_ops *pm;
> +
> +	if (!dev->driver)
> +		return 0;
>  
> +	pm = dev->driver->pm;
>  	if (!pm || !pm->runtime_resume)
>  		return -ENOSYS;
>  
> @@ -1054,8 +1055,12 @@ static int pci_pm_runtime_resume(struct
>  
>  static int pci_pm_runtime_idle(struct device *dev)
>  {
> -	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> +	const struct dev_pm_ops *pm;
> +
> +	if (!dev->driver)
> +		goto out:
>  
> +	pm = dev->driver->pm;
>  	if (!pm)
>  		return -ENOSYS;
>  
> @@ -1065,8 +1070,8 @@ static int pci_pm_runtime_idle(struct de
>  			return ret;
>  	}
>  
> + out:
>  	pm_runtime_suspend(dev);
> -
>  	return 0;
>  }
>  
> Index: linux-pm/drivers/pci/pci.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/pci.c
> +++ linux-pm/drivers/pci/pci.c
> @@ -1868,6 +1868,8 @@ void pci_pm_init(struct pci_dev *dev)
>  	u16 pmc;
>  
>  	pm_runtime_forbid(&dev->dev);
> +	pm_runtime_set_active(dev);
> +	pm_runtime_enable(&dev->dev);
>  	device_enable_async_suspend(&dev->dev);
>  	dev->wakeup_prepared = false;
>  

I think the patch can fix the issue in a better way.

Do we still need to clarify state about disabled and forbidden?  When a
device is forbidden and the usage_count > 0, is it a good idea to allow
to set device state to SUSPENDED if the device is disabled?

Best Regards,
Huang Ying



^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Rafael J. Wysocki @ 2012-11-07 23:09 UTC (permalink / raw)
  To: Alan Stern; +Cc: Huang Ying, linux-kernel, linux-pm
In-Reply-To: <1748065.Php9fUYGsh@vostro.rjw.lan>

On Wednesday, November 07, 2012 11:51:15 PM Rafael J. Wysocki wrote:
> On Wednesday, November 07, 2012 04:56:49 PM Alan Stern wrote:
> > On Wed, 7 Nov 2012, Rafael J. Wysocki wrote:
> > 
> > > > Right.  The reasoning behind my proposal goes like this: When there's
> > > > no driver, the subsystem can let userspace directly control the
> > > > device's power level through the power/control attribute.
> > > 
> > > Well, we might as well just leave the runtime PM of PCI devices enabled, even
> > > if they have no drivers, but modify the PCI bus type's runtime PM callbacks
> > > to ignore devices with no drivers.
> > > 
> > > IIRC the reason why we decided to disable runtime PM for PCI device with no
> > > drivers was that some of them refused to work again after being put by the
> > > core into D3.  By making the PCI bus type's runtime PM callbacks ignore them
> > > we'd avoid this problem without modifying the core's behavior.
> > 
> > It comes down to a question of the parent.  If a driverless PCI device
> > isn't being used, shouldn't its parent be allowed to go into runtime
> > suspend?  As things stand now, we do allow it.
> > 
> > The problem is that we don't disallow it when the driverless device
> > _is_ being used.
> 
> We can make it depend on what's there in the control file.  Let's say if that's
> "on" (ie. the devices usage counter is not zero), we won't allow the parent
> to be suspended.
> 
> So, as I said, why don't we keep the runtime PM of PCI devices always enabled,
> regardless of whether or not there is a driver, and arrange things in such a
> way that the device is automatically "suspended" if user space writes "auto"
> to the control file.  IOW, I suppose we can do something like this:

It probably is better to treat the "no driver" case in a special way, though:

---
 drivers/pci/pci-driver.c |   45 +++++++++++++++++++++++++--------------------
 drivers/pci/pci.c        |    2 ++
 2 files changed, 27 insertions(+), 20 deletions(-)

Index: linux-pm/drivers/pci/pci-driver.c
===================================================================
--- linux-pm.orig/drivers/pci/pci-driver.c
+++ linux-pm/drivers/pci/pci-driver.c
@@ -263,22 +263,17 @@ static long local_pci_probe(void *_ddi)
 	/* The parent bridge must be in active state when probing */
 	if (parent)
 		pm_runtime_get_sync(parent);
-	/* Unbound PCI devices are always set to disabled and suspended.
-	 * During probe, the device is set to enabled and active and the
-	 * usage count is incremented.  If the driver supports runtime PM,
-	 * it should call pm_runtime_put_noidle() in its probe routine and
+	/*
+	 * During probe, the device is set to active and the usage count is
+	 * incremented.  If the driver supports runtime PM, it should call
+	 * pm_runtime_put_noidle() in its probe routine and
 	 * pm_runtime_get_noresume() in its remove routine.
 	 */
-	pm_runtime_get_noresume(dev);
-	pm_runtime_set_active(dev);
-	pm_runtime_enable(dev);
-
+	pm_runtime_get_sync(dev);
 	rc = ddi->drv->probe(ddi->dev, ddi->id);
-	if (rc) {
-		pm_runtime_disable(dev);
-		pm_runtime_set_suspended(dev);
-		pm_runtime_put_noidle(dev);
-	}
+	if (rc)
+		pm_runtime_put_sync(dev);
+
 	if (parent)
 		pm_runtime_put(parent);
 	return rc;
@@ -369,9 +364,7 @@ static int pci_device_remove(struct devi
 	}
 
 	/* Undo the runtime PM settings in local_pci_probe() */
-	pm_runtime_disable(dev);
-	pm_runtime_set_suspended(dev);
-	pm_runtime_put_noidle(dev);
+	pm_runtime_put_sync(dev);
 
 	/*
 	 * If the device is still on, set the power state as "unknown",
@@ -998,10 +991,14 @@ static int pci_pm_restore(struct device
 static int pci_pm_runtime_suspend(struct device *dev)
 {
 	struct pci_dev *pci_dev = to_pci_dev(dev);
-	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
+	const struct dev_pm_ops *pm;
 	pci_power_t prev = pci_dev->current_state;
 	int error;
 
+	if (!dev->driver)
+		return 0;
+
+	pm = dev->driver->pm;
 	if (!pm || !pm->runtime_suspend)
 		return -ENOSYS;
 
@@ -1035,8 +1032,12 @@ static int pci_pm_runtime_resume(struct
 {
 	int rc;
 	struct pci_dev *pci_dev = to_pci_dev(dev);
-	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
+	const struct dev_pm_ops *pm;
+
+	if (!dev->driver)
+		return 0;
 
+	pm = dev->driver->pm;
 	if (!pm || !pm->runtime_resume)
 		return -ENOSYS;
 
@@ -1054,8 +1055,12 @@ static int pci_pm_runtime_resume(struct
 
 static int pci_pm_runtime_idle(struct device *dev)
 {
-	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
+	const struct dev_pm_ops *pm;
+
+	if (!dev->driver)
+		goto out:
 
+	pm = dev->driver->pm;
 	if (!pm)
 		return -ENOSYS;
 
@@ -1065,8 +1070,8 @@ static int pci_pm_runtime_idle(struct de
 			return ret;
 	}
 
+ out:
 	pm_runtime_suspend(dev);
-
 	return 0;
 }
 
Index: linux-pm/drivers/pci/pci.c
===================================================================
--- linux-pm.orig/drivers/pci/pci.c
+++ linux-pm/drivers/pci/pci.c
@@ -1868,6 +1868,8 @@ void pci_pm_init(struct pci_dev *dev)
 	u16 pmc;
 
 	pm_runtime_forbid(&dev->dev);
+	pm_runtime_set_active(dev);
+	pm_runtime_enable(&dev->dev);
 	device_enable_async_suspend(&dev->dev);
 	dev->wakeup_prepared = false;
 

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Rafael J. Wysocki @ 2012-11-07 22:51 UTC (permalink / raw)
  To: Alan Stern; +Cc: Huang Ying, linux-kernel, linux-pm
In-Reply-To: <Pine.LNX.4.44L0.1211071653140.1188-100000@iolanthe.rowland.org>

On Wednesday, November 07, 2012 04:56:49 PM Alan Stern wrote:
> On Wed, 7 Nov 2012, Rafael J. Wysocki wrote:
> 
> > > Right.  The reasoning behind my proposal goes like this: When there's
> > > no driver, the subsystem can let userspace directly control the
> > > device's power level through the power/control attribute.
> > 
> > Well, we might as well just leave the runtime PM of PCI devices enabled, even
> > if they have no drivers, but modify the PCI bus type's runtime PM callbacks
> > to ignore devices with no drivers.
> > 
> > IIRC the reason why we decided to disable runtime PM for PCI device with no
> > drivers was that some of them refused to work again after being put by the
> > core into D3.  By making the PCI bus type's runtime PM callbacks ignore them
> > we'd avoid this problem without modifying the core's behavior.
> 
> It comes down to a question of the parent.  If a driverless PCI device
> isn't being used, shouldn't its parent be allowed to go into runtime
> suspend?  As things stand now, we do allow it.
> 
> The problem is that we don't disallow it when the driverless device
> _is_ being used.

We can make it depend on what's there in the control file.  Let's say if that's
"on" (ie. the devices usage counter is not zero), we won't allow the parent
to be suspended.

So, as I said, why don't we keep the runtime PM of PCI devices always enabled,
regardless of whether or not there is a driver, and arrange things in such a
way that the device is automatically "suspended" if user space writes "auto"
to the control file.  IOW, I suppose we can do something like this:

---
 drivers/pci/pci-driver.c |   34 ++++++++++++----------------------
 drivers/pci/pci.c        |    2 ++
 2 files changed, 14 insertions(+), 22 deletions(-)

Index: linux-pm/drivers/pci/pci-driver.c
===================================================================
--- linux-pm.orig/drivers/pci/pci-driver.c
+++ linux-pm/drivers/pci/pci-driver.c
@@ -263,22 +263,17 @@ static long local_pci_probe(void *_ddi)
 	/* The parent bridge must be in active state when probing */
 	if (parent)
 		pm_runtime_get_sync(parent);
-	/* Unbound PCI devices are always set to disabled and suspended.
-	 * During probe, the device is set to enabled and active and the
-	 * usage count is incremented.  If the driver supports runtime PM,
-	 * it should call pm_runtime_put_noidle() in its probe routine and
+	/*
+	 * During probe, the device is set to active and the usage count is
+	 * incremented.  If the driver supports runtime PM, it should call
+	 * pm_runtime_put_noidle() in its probe routine and
 	 * pm_runtime_get_noresume() in its remove routine.
 	 */
-	pm_runtime_get_noresume(dev);
-	pm_runtime_set_active(dev);
-	pm_runtime_enable(dev);
-
+	pm_runtime_get_sync(dev);
 	rc = ddi->drv->probe(ddi->dev, ddi->id);
-	if (rc) {
-		pm_runtime_disable(dev);
-		pm_runtime_set_suspended(dev);
-		pm_runtime_put_noidle(dev);
-	}
+	if (rc)
+		pm_runtime_put_sync(dev);
+
 	if (parent)
 		pm_runtime_put(parent);
 	return rc;
@@ -369,9 +364,7 @@ static int pci_device_remove(struct devi
 	}
 
 	/* Undo the runtime PM settings in local_pci_probe() */
-	pm_runtime_disable(dev);
-	pm_runtime_set_suspended(dev);
-	pm_runtime_put_noidle(dev);
+	pm_runtime_put_sync(dev);
 
 	/*
 	 * If the device is still on, set the power state as "unknown",
@@ -1003,7 +996,7 @@ static int pci_pm_runtime_suspend(struct
 	int error;
 
 	if (!pm || !pm->runtime_suspend)
-		return -ENOSYS;
+		return 0;
 
 	pci_dev->no_d3cold = false;
 	error = pm->runtime_suspend(dev);
@@ -1038,7 +1031,7 @@ static int pci_pm_runtime_resume(struct
 	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
 
 	if (!pm || !pm->runtime_resume)
-		return -ENOSYS;
+		return 0;
 
 	pci_restore_standard_config(pci_dev);
 	pci_fixup_device(pci_fixup_resume_early, pci_dev);
@@ -1056,10 +1049,7 @@ static int pci_pm_runtime_idle(struct de
 {
 	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
 
-	if (!pm)
-		return -ENOSYS;
-
-	if (pm->runtime_idle) {
+	if (pm && pm->runtime_idle) {
 		int ret = pm->runtime_idle(dev);
 		if (ret)
 			return ret;
Index: linux-pm/drivers/pci/pci.c
===================================================================
--- linux-pm.orig/drivers/pci/pci.c
+++ linux-pm/drivers/pci/pci.c
@@ -1868,6 +1868,8 @@ void pci_pm_init(struct pci_dev *dev)
 	u16 pmc;
 
 	pm_runtime_forbid(&dev->dev);
+	pm_runtime_set_active(dev);
+	pm_runtime_enable(&dev->dev);
 	device_enable_async_suspend(&dev->dev);
 	dev->wakeup_prepared = false;
 



-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: Linux 3.7-rc3
From: Rafael J. Wysocki @ 2012-11-07 22:35 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Daniel Vetter, Linux Kernel Mailing List,
	Linux PM list, Greg Kroah-Hartman, David Airlie, Michal Hocko,
	Jiri Kosina, Ingo Molnar, Peter Zijlstra
In-Reply-To: <20121107222233.GG2541@linux.vnet.ibm.com>

On Wednesday, November 07, 2012 02:22:34 PM Paul E. McKenney wrote:
> On Fri, Nov 02, 2012 at 04:10:34PM -0700, Linus Torvalds wrote:
> > On Fri, Nov 2, 2012 at 3:43 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > >
> > > Well, not everything is rosy in the suspend land, though.  This is a
> > > failure to freeze khubd during the second in a row attempt to suspend to
> > > RAM (your current tree):
> > 
> > Ugh. So khubd is blocked in usb_start_wait_urb(), and apparently the
> > timeout for that block is longer than the freezing timeout.
> > 
> > There's a comment about why khubd needs to be freezable, but I wonder
> > if that whole thing isn't doing something wrong. Causing the suspend
> > to fail is definitely always the wrong thing.
> > 
> > Greg?
> > 
> > > [  125.780766] [ INFO: suspicious RCU usage. ]
> > > [  125.780804] 3.7.0-rc3+ #988 Not tainted
> > > [  125.780838] -------------------------------
> > > [  125.780875] /home/rafael/src/linux/kernel/sched/core.c:4497 suspicious rcu_dereference_check() usage!
> > 
> > Heh. The RCU usage is from the debug printout from sched_show_task(),
> > so it's "related", but it's a totally independent issue.
> > 
> > It's apparently because we've not done a "rcu_read_lock()" around that
> > sequence, but I seriously doubt we care. But it's technically a real
> > bug - even if the fix might be to just not print out the parent pid
> > (or to just ignore the bug and turn the rcu dereference into an
> > ACCESS_ONCE() or something.
> > 
> > Ingo, Peter, any comments about that sched/core.c:4497 RCU usage?
> 
> Rafael, does the following patch fix that problem?

Well, the box I can reproduce that on is not with me now and it's
not readily reproducible anyway, so it may take some time to verify
I'm afraid.

Thanks,
Rafael


> ------------------------------------------------------------------------
> 
> sched: Mark RCU reader in sched_show_task()
> 
> When sched_show_task() is invoked from try_to_freeze_tasks(), there is
> no RCU read-side critical section, resulting in the following splat:
> 
> [  125.780730] ===============================
> [  125.780766] [ INFO: suspicious RCU usage. ]
> [  125.780804] 3.7.0-rc3+ #988 Not tainted
> [  125.780838] -------------------------------
> [  125.780875] /home/rafael/src/linux/kernel/sched/core.c:4497 suspicious rcu_dereference_check() usage!
> [  125.780946]
> [  125.780946] other info that might help us debug this:
> [  125.780946]
> [  125.781031]
> [  125.781031] rcu_scheduler_active = 1, debug_locks = 0
> [  125.781087] 4 locks held by s2ram/4211:
> [  125.781120]  #0:  (&buffer->mutex){+.+.+.}, at: [<ffffffff811e2acf>] sysfs_write_file+0x3f/0x160
> [  125.781233]  #1:  (s_active#94){.+.+.+}, at: [<ffffffff811e2b58>] sysfs_write_file+0xc8/0x160
> [  125.781339]  #2:  (pm_mutex){+.+.+.}, at: [<ffffffff81090a81>] pm_suspend+0x81/0x230
> [  125.781439]  #3:  (tasklist_lock){.?.?..}, at: [<ffffffff8108feed>] try_to_freeze_tasks+0x2cd/0x3f0
> [  125.781543]
> [  125.781543] stack backtrace:
> [  125.781584] Pid: 4211, comm: s2ram Not tainted 3.7.0-rc3+ #988
> [  125.781632] Call Trace:
> [  125.781662]  [<ffffffff810a3c73>] lockdep_rcu_suspicious+0x103/0x140
> [  125.781719]  [<ffffffff8107cf21>] sched_show_task+0x121/0x180
> [  125.781770]  [<ffffffff8108ffb4>] try_to_freeze_tasks+0x394/0x3f0
> [  125.781823]  [<ffffffff810903b5>] freeze_kernel_threads+0x25/0x80
> [  125.781876]  [<ffffffff81090b65>] pm_suspend+0x165/0x230
> [  125.781924]  [<ffffffff8108fa29>] state_store+0x99/0x100
> [  125.781975]  [<ffffffff812f5867>] kobj_attr_store+0x17/0x20
> [  125.782038]  [<ffffffff811e2b71>] sysfs_write_file+0xe1/0x160
> [  125.782091]  [<ffffffff811667a6>] vfs_write+0xc6/0x180
> [  125.782138]  [<ffffffff81166ada>] sys_write+0x5a/0xa0
> [  125.782185]  [<ffffffff812ff6ae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [  125.782242]  [<ffffffff81669dd2>] system_call_fastpath+0x16/0x1b
> 
> This commit therefore adds the needed RCU read-side critical section.
> 
> Reported-by: "Rafael J. Wysocki" <rjw@sisk.pl>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6d4569e..36f2608 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4474,6 +4474,7 @@ static const char stat_nam[] = TASK_STATE_TO_CHAR_STR;
>  void sched_show_task(struct task_struct *p)
>  {
>  	unsigned long free = 0;
> +	int ppid;
>  	unsigned state;
>  
>  	state = p->state ? __ffs(p->state) + 1 : 0;
> @@ -4493,8 +4494,11 @@ void sched_show_task(struct task_struct *p)
>  #ifdef CONFIG_DEBUG_STACK_USAGE
>  	free = stack_not_used(p);
>  #endif
> +	rcu_read_lock();
> +	ppid = task_pid_nr(rcu_dereference(p->real_parent));
> +	rcu_read_unlock();
>  	printk(KERN_CONT "%5lu %5d %6d 0x%08lx\n", free,
> -		task_pid_nr(p), task_pid_nr(rcu_dereference(p->real_parent)),
> +		task_pid_nr(p), ppid,
>  		(unsigned long)task_thread_info(p)->flags);
>  
>  	show_stack(p, NULL);
> 
> 
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: Linux 3.7-rc3
From: Paul E. McKenney @ 2012-11-07 22:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rafael J. Wysocki, Daniel Vetter, Linux Kernel Mailing List,
	Linux PM list, Greg Kroah-Hartman, David Airlie, Michal Hocko,
	Jiri Kosina, Ingo Molnar, Peter Zijlstra
In-Reply-To: <CA+55aFx+w2hFifGTr_u76D9d_sDKxKJBDd2L4p0GVk-RXCczMA@mail.gmail.com>

On Fri, Nov 02, 2012 at 04:10:34PM -0700, Linus Torvalds wrote:
> On Fri, Nov 2, 2012 at 3:43 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> >
> > Well, not everything is rosy in the suspend land, though.  This is a
> > failure to freeze khubd during the second in a row attempt to suspend to
> > RAM (your current tree):
> 
> Ugh. So khubd is blocked in usb_start_wait_urb(), and apparently the
> timeout for that block is longer than the freezing timeout.
> 
> There's a comment about why khubd needs to be freezable, but I wonder
> if that whole thing isn't doing something wrong. Causing the suspend
> to fail is definitely always the wrong thing.
> 
> Greg?
> 
> > [  125.780766] [ INFO: suspicious RCU usage. ]
> > [  125.780804] 3.7.0-rc3+ #988 Not tainted
> > [  125.780838] -------------------------------
> > [  125.780875] /home/rafael/src/linux/kernel/sched/core.c:4497 suspicious rcu_dereference_check() usage!
> 
> Heh. The RCU usage is from the debug printout from sched_show_task(),
> so it's "related", but it's a totally independent issue.
> 
> It's apparently because we've not done a "rcu_read_lock()" around that
> sequence, but I seriously doubt we care. But it's technically a real
> bug - even if the fix might be to just not print out the parent pid
> (or to just ignore the bug and turn the rcu dereference into an
> ACCESS_ONCE() or something.
> 
> Ingo, Peter, any comments about that sched/core.c:4497 RCU usage?

Rafael, does the following patch fix that problem?

							Thanx, Paul

------------------------------------------------------------------------

sched: Mark RCU reader in sched_show_task()

When sched_show_task() is invoked from try_to_freeze_tasks(), there is
no RCU read-side critical section, resulting in the following splat:

[  125.780730] ===============================
[  125.780766] [ INFO: suspicious RCU usage. ]
[  125.780804] 3.7.0-rc3+ #988 Not tainted
[  125.780838] -------------------------------
[  125.780875] /home/rafael/src/linux/kernel/sched/core.c:4497 suspicious rcu_dereference_check() usage!
[  125.780946]
[  125.780946] other info that might help us debug this:
[  125.780946]
[  125.781031]
[  125.781031] rcu_scheduler_active = 1, debug_locks = 0
[  125.781087] 4 locks held by s2ram/4211:
[  125.781120]  #0:  (&buffer->mutex){+.+.+.}, at: [<ffffffff811e2acf>] sysfs_write_file+0x3f/0x160
[  125.781233]  #1:  (s_active#94){.+.+.+}, at: [<ffffffff811e2b58>] sysfs_write_file+0xc8/0x160
[  125.781339]  #2:  (pm_mutex){+.+.+.}, at: [<ffffffff81090a81>] pm_suspend+0x81/0x230
[  125.781439]  #3:  (tasklist_lock){.?.?..}, at: [<ffffffff8108feed>] try_to_freeze_tasks+0x2cd/0x3f0
[  125.781543]
[  125.781543] stack backtrace:
[  125.781584] Pid: 4211, comm: s2ram Not tainted 3.7.0-rc3+ #988
[  125.781632] Call Trace:
[  125.781662]  [<ffffffff810a3c73>] lockdep_rcu_suspicious+0x103/0x140
[  125.781719]  [<ffffffff8107cf21>] sched_show_task+0x121/0x180
[  125.781770]  [<ffffffff8108ffb4>] try_to_freeze_tasks+0x394/0x3f0
[  125.781823]  [<ffffffff810903b5>] freeze_kernel_threads+0x25/0x80
[  125.781876]  [<ffffffff81090b65>] pm_suspend+0x165/0x230
[  125.781924]  [<ffffffff8108fa29>] state_store+0x99/0x100
[  125.781975]  [<ffffffff812f5867>] kobj_attr_store+0x17/0x20
[  125.782038]  [<ffffffff811e2b71>] sysfs_write_file+0xe1/0x160
[  125.782091]  [<ffffffff811667a6>] vfs_write+0xc6/0x180
[  125.782138]  [<ffffffff81166ada>] sys_write+0x5a/0xa0
[  125.782185]  [<ffffffff812ff6ae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[  125.782242]  [<ffffffff81669dd2>] system_call_fastpath+0x16/0x1b

This commit therefore adds the needed RCU read-side critical section.

Reported-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6d4569e..36f2608 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4474,6 +4474,7 @@ static const char stat_nam[] = TASK_STATE_TO_CHAR_STR;
 void sched_show_task(struct task_struct *p)
 {
 	unsigned long free = 0;
+	int ppid;
 	unsigned state;
 
 	state = p->state ? __ffs(p->state) + 1 : 0;
@@ -4493,8 +4494,11 @@ void sched_show_task(struct task_struct *p)
 #ifdef CONFIG_DEBUG_STACK_USAGE
 	free = stack_not_used(p);
 #endif
+	rcu_read_lock();
+	ppid = task_pid_nr(rcu_dereference(p->real_parent));
+	rcu_read_unlock();
 	printk(KERN_CONT "%5lu %5d %6d 0x%08lx\n", free,
-		task_pid_nr(p), task_pid_nr(rcu_dereference(p->real_parent)),
+		task_pid_nr(p), ppid,
 		(unsigned long)task_thread_info(p)->flags);
 
 	show_stack(p, NULL);


^ permalink raw reply related

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Alan Stern @ 2012-11-07 21:56 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Huang Ying, linux-kernel, linux-pm
In-Reply-To: <5835880.VrlCHNcBeW@vostro.rjw.lan>

On Wed, 7 Nov 2012, Rafael J. Wysocki wrote:

> > Right.  The reasoning behind my proposal goes like this: When there's
> > no driver, the subsystem can let userspace directly control the
> > device's power level through the power/control attribute.
> 
> Well, we might as well just leave the runtime PM of PCI devices enabled, even
> if they have no drivers, but modify the PCI bus type's runtime PM callbacks
> to ignore devices with no drivers.
> 
> IIRC the reason why we decided to disable runtime PM for PCI device with no
> drivers was that some of them refused to work again after being put by the
> core into D3.  By making the PCI bus type's runtime PM callbacks ignore them
> we'd avoid this problem without modifying the core's behavior.

It comes down to a question of the parent.  If a driverless PCI device
isn't being used, shouldn't its parent be allowed to go into runtime
suspend?  As things stand now, we do allow it.

The problem is that we don't disallow it when the driverless device
_is_ being used.

Alan Stern

^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Rafael J. Wysocki @ 2012-11-07 21:44 UTC (permalink / raw)
  To: Alan Stern; +Cc: Huang Ying, linux-kernel, linux-pm
In-Reply-To: <Pine.LNX.4.44L0.1211071544110.1188-100000@iolanthe.rowland.org>

On Wednesday, November 07, 2012 03:47:02 PM Alan Stern wrote:
> On Wed, 7 Nov 2012, Rafael J. Wysocki wrote:
> 
> > On Wednesday, November 07, 2012 12:17:02 PM Alan Stern wrote:
> > > On Wed, 7 Nov 2012, Rafael J. Wysocki wrote:
> > > 
> > > > > The PCI subsystem assumes that 
> > > > > driverless devices are not in use, so they are disabled for runtime PM 
> > > > > and marked as suspended.  This is not appropriate for VGA devices, 
> > > > > which can indeed be used without a driver.
> > > > > 
> > > > > I'm not sure what the best solution is.  Maybe we should Ying's 
> > > > > proposal a step farther:
> > > > > 
> > > > > 	Make pm_runtime_set_suspended() fail if runtime PM is 
> > > > > 	forbidden.
> > > > >
> > > > > 	Make pm_runtime_forbid() call pm_runtime_set_active()
> > > > > 	(and do a runtime resume of the parent) if disable_depth > 0.
> > > > 
> > > > I'd prefer this one.
> > > 
> > > That wasn't meant to be a choice.  The first item is close to what the 
> > > original patch did; I was suggesting that we should adopt all three 
> > > items.
> > 
> > OK, I need to think about this a bit more, then.
> > 
> > The problem seems to be that our initial assumption, ie. that driverless
> > devices won't be in use, is not satisfied in the relevant case.  It may
> > not be satisfied in more cases like this, I suppose, but so far we  don't
> > really know.
> 
> Right.  The reasoning behind my proposal goes like this: When there's
> no driver, the subsystem can let userspace directly control the
> device's power level through the power/control attribute.

Well, we might as well just leave the runtime PM of PCI devices enabled, even
if they have no drivers, but modify the PCI bus type's runtime PM callbacks
to ignore devices with no drivers.

IIRC the reason why we decided to disable runtime PM for PCI device with no
drivers was that some of them refused to work again after being put by the
core into D3.  By making the PCI bus type's runtime PM callbacks ignore them
we'd avoid this problem without modifying the core's behavior.

Thanks,
Rafael

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [PATCH 1/4] cpufreq: OMAP: if an iva clock name is specified, load iva resources
From: Nishanth Menon @ 2012-11-07 21:07 UTC (permalink / raw)
  To: jemele
  Cc: Kevin Hilman, Rafael J. Wysocki, linux-omap, cpufreq, linux-pm,
	linux-kernel
In-Reply-To: <CALH_86ST1kF_csak_75Dk7dDupOqdCT=H11MKBeLsiyqi5vd_w@mail.gmail.com>

On 11:16-20121107, Joshua Emele wrote:
> On Wed, Nov 7, 2012 at 6:53 AM, Nishanth Menon <nm@ti.com> wrote:
> 
> > On 17:47-20121106, Joshua Emele wrote:
> > > +static int __cpuinit omap_iva_init(struct cpufreq_policy *policy)
> > > +{
> > > +     int result;
> > > +     if (!iva_clk_name) {
> > > +             pr_info("%s: iva unavailable\n", __func__);
> > > +             return 0;
> > > +     }
> > > +     iva_dev = omap_device_get_by_hwmod_name("iva");
> >
> > NAK. Reasons as follows:
> > a) cpufreq is purely meant for cpu, not IVA. we should instead be using
> > devfreq which is designed around the usage for co-processor
> > b) clubbing ARM's frequency decision is definitely NOT equal to IVA
> > frequency decision, not only is it wrong in terms of modularization, it
> > is wrong in terms of power benefits as well (not to mention weirdness
> > needed when you look at all OMAP SoC variants)
> > c) DVFS is not trivial around multiple co-processor transitions -> core
> > OPPs (as dependent OPP) needs to be addressed as well. ideally common
> > clock framework could take care of clock dependencies.
> It looks like you've done some work on an omap devfreq coprocessor driver (
> http://pastebin.pandaboard.org/index.php/view/85100576). Are there any
> plans to merge this driver?
It is an sample driver about how it could be done - this was not in any
form meant for merge. there are few angles to it:
a) ability for coprocessors to provide load and idle information back to
master processor
b) integrating it into regular existing framework.

Yes, the need is definitely identified, there are many different
solutions floating around in TI kernel forks, devfreq at the moment
seems to be the rationale choice in terms of a solution that is aligned
with community needs, so we are slowly building towards it.

Any contributions towards that is very appreciated ofcourse :). the
quick reference to latest code meant for 3.8 rc1 target could be found
here:
http://git.kernel.org/?p=linux/kernel/git/mzx/devfreq.git;a=summary

-- 
Regards,
Nishanth Menon

^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Alan Stern @ 2012-11-07 20:47 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Huang Ying, linux-kernel, linux-pm
In-Reply-To: <4585176.YTc0KTJLfO@vostro.rjw.lan>

On Wed, 7 Nov 2012, Rafael J. Wysocki wrote:

> On Wednesday, November 07, 2012 12:17:02 PM Alan Stern wrote:
> > On Wed, 7 Nov 2012, Rafael J. Wysocki wrote:
> > 
> > > > The PCI subsystem assumes that 
> > > > driverless devices are not in use, so they are disabled for runtime PM 
> > > > and marked as suspended.  This is not appropriate for VGA devices, 
> > > > which can indeed be used without a driver.
> > > > 
> > > > I'm not sure what the best solution is.  Maybe we should Ying's 
> > > > proposal a step farther:
> > > > 
> > > > 	Make pm_runtime_set_suspended() fail if runtime PM is 
> > > > 	forbidden.
> > > >
> > > > 	Make pm_runtime_forbid() call pm_runtime_set_active()
> > > > 	(and do a runtime resume of the parent) if disable_depth > 0.
> > > 
> > > I'd prefer this one.
> > 
> > That wasn't meant to be a choice.  The first item is close to what the 
> > original patch did; I was suggesting that we should adopt all three 
> > items.
> 
> OK, I need to think about this a bit more, then.
> 
> The problem seems to be that our initial assumption, ie. that driverless
> devices won't be in use, is not satisfied in the relevant case.  It may
> not be satisfied in more cases like this, I suppose, but so far we  don't
> really know.

Right.  The reasoning behind my proposal goes like this: When there's
no driver, the subsystem can let userspace directly control the
device's power level through the power/control attribute.

Alan Stern


^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Rafael J. Wysocki @ 2012-11-07 20:21 UTC (permalink / raw)
  To: Alan Stern; +Cc: Huang Ying, linux-kernel, linux-pm
In-Reply-To: <Pine.LNX.4.44L0.1211071214190.1175-100000@iolanthe.rowland.org>

On Wednesday, November 07, 2012 12:17:02 PM Alan Stern wrote:
> On Wed, 7 Nov 2012, Rafael J. Wysocki wrote:
> 
> > > The PCI subsystem assumes that 
> > > driverless devices are not in use, so they are disabled for runtime PM 
> > > and marked as suspended.  This is not appropriate for VGA devices, 
> > > which can indeed be used without a driver.
> > > 
> > > I'm not sure what the best solution is.  Maybe we should Ying's 
> > > proposal a step farther:
> > > 
> > > 	Make pm_runtime_set_suspended() fail if runtime PM is 
> > > 	forbidden.
> > >
> > > 	Make pm_runtime_forbid() call pm_runtime_set_active()
> > > 	(and do a runtime resume of the parent) if disable_depth > 0.
> > 
> > I'd prefer this one.
> 
> That wasn't meant to be a choice.  The first item is close to what the 
> original patch did; I was suggesting that we should adopt all three 
> items.

OK, I need to think about this a bit more, then.

The problem seems to be that our initial assumption, ie. that driverless
devices won't be in use, is not satisfied in the relevant case.  It may
not be satisfied in more cases like this, I suppose, but so far we  don't
really know.

Thanks,
Rafael

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [RFC PATCH 6/8] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
From: Srivatsa S. Bhat @ 2012-11-07 20:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: akpm, mgorman, mjg59, paulmck, maxime.coquelin, loic.pallardy,
	arjan, kmpark, kamezawa.hiroyu, lenb, rjw, gargankita,
	amit.kachhap, svaidy, thomas.abraham, santosh.shilimkar, linux-pm,
	linux-mm, linux-kernel
In-Reply-To: <509985DE.8000508@linux.vnet.ibm.com>

On 11/07/2012 03:19 AM, Dave Hansen wrote:
> On 11/06/2012 11:53 AM, Srivatsa S. Bhat wrote:
>> This is the main change - we keep the pageblocks in region-sorted order,
>> where pageblocks belonging to region-0 come first, followed by those belonging
>> to region-1 and so on. But the pageblocks within a given region need *not* be
>> sorted, since we need them to be only region-sorted and not fully
>> address-sorted.
>>
>> This sorting is performed when adding pages back to the freelists, thus
>> avoiding any region-related overhead in the critical page allocation
>> paths.
> 
> It's probably _better_ to do it at free time than alloc, but it's still
> pretty bad to be doing a linear walk over a potentially 256-entry array
> holding the zone lock.  The overhead is going to show up somewhere.  How
> does this do with a kernel compile?  Looks like exit() when a process
> has a bunch of memory might get painful.
> 

As I mentioned in the cover-letter, kernbench numbers haven't shown any
observable performance degradation. On the contrary, (as unbelievable as it
may sound), they actually indicate a slight performance *improvement* with my
patchset! I'm trying to figure out what could be the reason behind that.

Going forward, we could try to optimize the sorting logic in the free()
part, but in any case, IMHO that's the right place to push the overhead to,
since the performance of free() is not expected to be _that_ critical (unlike
alloc()) for overall system performance.

Regards,
Srivatsa S. Bhat


^ permalink raw reply

* Re: [RFC PATCH 1/8] mm: Introduce memory regions data-structure to capture region boundaries within node
From: Srivatsa S. Bhat @ 2012-11-07 20:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: akpm, mgorman, mjg59, paulmck, maxime.coquelin, loic.pallardy,
	arjan, kmpark, kamezawa.hiroyu, lenb, rjw, gargankita,
	amit.kachhap, svaidy, thomas.abraham, santosh.shilimkar, linux-pm,
	linux-mm, linux-kernel
In-Reply-To: <50999755.4000209@linux.vnet.ibm.com>

On 11/07/2012 04:33 AM, Dave Hansen wrote:
> On 11/06/2012 11:52 AM, Srivatsa S. Bhat wrote:
>> But of course, memory regions are sub-divisions *within* a node, so it makes
>> sense to keep the data-structures in the node's struct pglist_data. (Thus
>> this placement makes memory regions parallel to zones in that node).
> 
> I think it's pretty silly to create *ANOTHER* subdivision of memory
> separate from sparsemem.  One that doesn't handle large amounts of
> memory or scale with memory hotplug.  As it stands, you can only support
> 256*512MB=128GB of address space, which seems pretty puny.
> 
> This node_regions[]:
> 
>> @@ -687,6 +698,8 @@ typedef struct pglist_data {
>>  	struct zone node_zones[MAX_NR_ZONES];
>>  	struct zonelist node_zonelists[MAX_ZONELISTS];
>>  	int nr_zones;
>> +	struct node_mem_region node_regions[MAX_NR_REGIONS];
>> +	int nr_node_regions;
>>  #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
>>  	struct page *node_mem_map;
>>  #ifdef CONFIG_MEMCG
> 
> looks like it's indexed the same way regardless of which node it is in.
>  In other words, if there are two nodes, at least half of it is wasted,
> and 3/4 if there are four nodes.  That seems a bit suboptimal.
> 

You're right, I have not addressed that problem in this initial RFC. Thanks
for pointing it out! Going forward, we can surely optimize the way we deal
with memory regions on NUMA systems, using some of the sparsemem techniques.

> Could you remind us of the logic for leaving sparsemem out of the
> equation here?
> 

Nothing, its just that in this first RFC I was more focussed towards getting
the overall design right, in terms of having an acceptable way of tracking
pages belonging to different regions within the page allocator (freelists)
and using it to influence page allocation decisions. And also to compare
the merits of this approach over the previous "Hierarchy" design, in a broad
("big picture") sense.

I'll add the above point you raised in my todo-list and address it in
subsequent versions of the patchset.

Thank you very much for the quick feedback!

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [PATCH 6/6 v4] cpufreq, highbank: add support for highbank cpufreq
From: Rob Herring @ 2012-11-07 18:51 UTC (permalink / raw)
  To: Mark Langsdorf
  Cc: Rafael J. Wysocki, devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cpufreq-u79uwXL29TY76Z2rM5mHXA, linux-pm-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1352313166-28980-7-git-send-email-mark.langsdorf-bsGFqQB8/DxBDgjK7y7TUQ@public.gmane.org>

On 11/07/2012 12:32 PM, Mark Langsdorf wrote:
> Highbank processors depend on the external ECME to perform voltage
> management based on a requested frequency. Communication between the
> highbank and ECME cores happens over the pl320 IPC channel.
> 
> Signed-off-by: Mark Langsdorf <mark.langsdorf-bsGFqQB8/DxBDgjK7y7TUQ@public.gmane.org>
> Cc: devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ@public.gmane.org
> Cc: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> ---
> Changes from v3
> 	None
> Changes from v2
> 	Changed transition latency binding in code to match documentation
> Changes from v1
>         Added highbank specific Kconfig changes
> 
>  .../bindings/cpufreq/highbank-cpufreq.txt          |  53 +++++
>  arch/arm/Kconfig                                   |   2 +
>  arch/arm/boot/dts/highbank.dts                     |  10 +
>  arch/arm/mach-highbank/Kconfig                     |   2 +
>  drivers/cpufreq/Kconfig.arm                        |  15 ++
>  drivers/cpufreq/Makefile                           |   1 +
>  drivers/cpufreq/highbank-cpufreq.c                 | 229 +++++++++++++++++++++
>  7 files changed, 312 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/cpufreq/highbank-cpufreq.txt
>  create mode 100644 drivers/cpufreq/highbank-cpufreq.c
> 
> diff --git a/Documentation/devicetree/bindings/cpufreq/highbank-cpufreq.txt b/Documentation/devicetree/bindings/cpufreq/highbank-cpufreq.txt
> new file mode 100644
> index 0000000..3ec2cec
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/cpufreq/highbank-cpufreq.txt
> @@ -0,0 +1,53 @@
> +Highbank cpufreq driver
> +
> +This is cpufreq driver for Calxeda ECX-1000 (highbank) processor. It is based
> +on the generic cpu0 driver and uses a similar format for bindings. Since
> +the EnergyCore Management Engine maintains the voltage based on the
> +frequency, the voltage component of the operating points can be set to any
> +arbitrary values.
> +
> +Both required properties listed below must be defined under node /cpus/cpu@0.
> +
> +Required properties:
> +- operating-points: Refer to Documentation/devicetree/bindings/power/opp.txt
> +  for details
> +- transition-latency: Specify the possible maximum transition latency for clock,
> +  in unit of nanoseconds.
> +
> +Examples:
> +
> +cpus {
> +	#address-cells = <1>;
> +	#size-cells = <0>;
> +
> +	cpu@0 {
> +		compatible = "arm,cortex-a9";
> +		reg = <0>;
> +		next-level-cache = <&L2>;
> +		operating-points = <
> +			/* kHz  ignored */
> +			790000  1000000
> +			396000  1000000
> +			198000  1000000
> +		>;
> +		transition-latency = <200000>;
> +	};
> +
> +	cpu@1 {
> +		compatible = "arm,cortex-a9";
> +		reg = <1>;
> +		next-level-cache = <&L2>;
> +	};
> +
> +	cpu@2 {
> +		compatible = "arm,cortex-a9";
> +		reg = <2>;
> +		next-level-cache = <&L2>;
> +	};
> +
> +	cpu@3 {
> +		compatible = "arm,cortex-a9";
> +		reg = <3>;
> +		next-level-cache = <&L2>;
> +	};
> +};
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index ade7e92..4ed0b7b 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -391,6 +391,8 @@ config ARCH_SIRF
>  	select PINCTRL
>  	select PINCTRL_SIRF
>  	select USE_OF
> +	select ARCH_HAS_CPUFREQ
> +	select ARCH_HAS_OPP

This hunk needs to be removed.

>  	help
>  	  Support for CSR SiRFprimaII/Marco/Polo platforms
>  
> diff --git a/arch/arm/boot/dts/highbank.dts b/arch/arm/boot/dts/highbank.dts
> index 0c6fc34..7c4c27d 100644
> --- a/arch/arm/boot/dts/highbank.dts
> +++ b/arch/arm/boot/dts/highbank.dts
> @@ -36,6 +36,16 @@
>  			next-level-cache = <&L2>;
>  			clocks = <&a9pll>;
>  			clock-names = "cpu";
> +			operating-points = <
> +				/* kHz    ignored */
> +				 1300000  1000000
> +				 1200000  1000000
> +				 1100000  1000000
> +				  800000  1000000
> +				  400000  1000000
> +				  200000  1000000
> +			>;
> +			transition-latency = <100000>;
>  		};
>  
>  		cpu@1 {
> diff --git a/arch/arm/mach-highbank/Kconfig b/arch/arm/mach-highbank/Kconfig
> index 0e1d0a4..ee83af6 100644
> --- a/arch/arm/mach-highbank/Kconfig
> +++ b/arch/arm/mach-highbank/Kconfig
> @@ -13,3 +13,5 @@ config ARCH_HIGHBANK
>  	select HAVE_SMP
>  	select SPARSE_IRQ
>  	select USE_OF
> +	select ARCH_HAS_CPUFREQ
> +	select ARCH_HAS_OPP

Sort these alphabetically.

> diff --git a/drivers/cpufreq/Kconfig.arm b/drivers/cpufreq/Kconfig.arm
> index 5961e64..bc3ef55 100644
> --- a/drivers/cpufreq/Kconfig.arm
> +++ b/drivers/cpufreq/Kconfig.arm
> @@ -76,3 +76,18 @@ config ARM_EXYNOS5250_CPUFREQ
>  	help
>  	  This adds the CPUFreq driver for Samsung EXYNOS5250
>  	  SoC.
> +
> +config ARM_HIGHBANK_CPUFREQ
> +       tristate "Calxeda Highbank-based"
> +       depends on ARCH_HIGHBANK
> +       select CPU_FREQ_TABLE
> +       select HAVE_CLK

ARCH_HIGHBANK already selects this.

> +       select PM_OPP
> +       select OF

And this.

> +       default m
> +       help
> +         This adds the CPUFreq driver for Calxeda Highbank SoC
> +         based boards.
> +
> +         If in doubt, say N.
> +
> diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
> index 1bc90e1..9e8f12a 100644
> --- a/drivers/cpufreq/Makefile
> +++ b/drivers/cpufreq/Makefile
> @@ -50,6 +50,7 @@ obj-$(CONFIG_ARM_EXYNOS4210_CPUFREQ)	+= exynos4210-cpufreq.o
>  obj-$(CONFIG_ARM_EXYNOS4X12_CPUFREQ)	+= exynos4x12-cpufreq.o
>  obj-$(CONFIG_ARM_EXYNOS5250_CPUFREQ)	+= exynos5250-cpufreq.o
>  obj-$(CONFIG_ARM_OMAP2PLUS_CPUFREQ)     += omap-cpufreq.o
> +obj-$(CONFIG_ARM_HIGHBANK_CPUFREQ)	+= highbank-cpufreq.o
>  
>  ##################################################################################
>  # PowerPC platform drivers
> diff --git a/drivers/cpufreq/highbank-cpufreq.c b/drivers/cpufreq/highbank-cpufreq.c
> new file mode 100644
> index 0000000..a167073
> --- /dev/null
> +++ b/drivers/cpufreq/highbank-cpufreq.c
> @@ -0,0 +1,229 @@
> +/*
> + * Copyright (C) 2012 Calxeda, Inc.
> + *
> + * derived from cpufreq-cpu0 by Freescale Semiconductor
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
> +
> +#include <linux/clk.h>
> +#include <linux/cpu.h>
> +#include <linux/cpufreq.h>
> +#include <linux/err.h>
> +#include <linux/module.h>
> +#include <linux/of.h>
> +#include <linux/opp.h>
> +#include <linux/slab.h>
> +#include <asm/pl320-ipc.h>
> +
> +#define HB_CPUFREQ_CHANGE_NOTE 0x80000001
> +
> +static unsigned int transition_latency;
> +
> +static struct device *cpu_dev;
> +static struct clk *cpu_clk;
> +static struct cpufreq_frequency_table *freq_table;
> +
> +static int hb_verify_speed(struct cpufreq_policy *policy)
> +{
> +	return cpufreq_frequency_table_verify(policy, freq_table);
> +}
> +
> +static unsigned int hb_get_speed(unsigned int cpu)
> +{
> +	return clk_get_rate(cpu_clk) / 1000;
> +}
> +
> +static int hb_voltage_change(unsigned int freq)
> +{
> +	int i;
> +	u32 msg[7];
> +
> +	msg[0] = HB_CPUFREQ_CHANGE_NOTE;
> +	msg[1] = freq / 1000;
> +	for (i = 2; i < 7; i++)
> +		msg[i] = 0;
> +
> +	return ipc_call_slow(msg);
> +}
> +
> +static int hb_set_target(struct cpufreq_policy *policy,
> +			   unsigned int target_freq, unsigned int relation)
> +{
> +	struct cpufreq_freqs freqs;
> +	unsigned long freq_Hz;
> +	unsigned int index, cpu;
> +	int ret;
> +
> +	ret = cpufreq_frequency_table_target(policy, freq_table, target_freq,
> +					     relation, &index);
> +	if (ret) {
> +		pr_err("failed to match target freqency %d: %d\n",
> +		       target_freq, ret);
> +		return ret;
> +	}
> +
> +	freq_Hz = clk_round_rate(cpu_clk, freq_table[index].frequency * 1000);
> +	if (freq_Hz < 0)
> +		freq_Hz = freq_table[index].frequency * 1000;
> +	freqs.new = freq_Hz / 1000;
> +	freqs.old = clk_get_rate(cpu_clk) / 1000;
> +
> +	if (freqs.old == freqs.new)
> +		return 0;
> +
> +	for_each_online_cpu(cpu) {
> +		freqs.cpu = cpu;
> +		cpufreq_notify_transition(&freqs, CPUFREQ_PRECHANGE);
> +	}
> +
> +	pr_debug("%u MHz --> %u MHz\n", freqs.old / 1000, freqs.new / 1000);
> +
> +	/* scaling up?  scale voltage before frequency */
> +	if (freqs.new > freqs.old) {
> +		ret = hb_voltage_change(freqs.new);
> +		if (ret) {
> +			freqs.new = freqs.old;
> +			return -EAGAIN;
> +		}
> +	}
> +
> +	ret = clk_set_rate(cpu_clk, freqs.new * 1000);
> +	if (ret) {
> +		pr_err("failed to set clock rate: %d\n", ret);
> +		hb_voltage_change(freqs.old);
> +		return ret;
> +	}
> +
> +	/* scaling down?  scale voltage after frequency */
> +	if (freqs.new < freqs.old) {
> +		ret = hb_voltage_change(freqs.new);
> +		if (ret) {
> +			if (clk_set_rate(cpu_clk, freqs.old * 1000))
> +				pr_err("also failed to reset freq\n");
> +			freqs.new = freqs.old;
> +			return -EAGAIN;
> +		}
> +	}
> +
> +	for_each_online_cpu(cpu) {
> +		freqs.cpu = cpu;
> +		cpufreq_notify_transition(&freqs, CPUFREQ_POSTCHANGE);
> +	}
> +
> +	return 0;
> +}
> +
> +static int hb_cpufreq_init(struct cpufreq_policy *policy)
> +{
> +	int ret;
> +
> +	if (policy->cpu != 0)
> +		return -EINVAL;
> +
> +	ret = cpufreq_frequency_table_cpuinfo(policy, freq_table);
> +	if (ret) {
> +		pr_err("invalid frequency table: %d\n", ret);
> +		return ret;
> +	}
> +
> +	policy->cpuinfo.transition_latency = transition_latency;
> +	policy->cur = clk_get_rate(cpu_clk) / 1000;
> +
> +	policy->shared_type = CPUFREQ_SHARED_TYPE_ANY;
> +	cpumask_setall(policy->cpus);
> +
> +	cpufreq_frequency_table_get_attr(freq_table, policy->cpu);
> +
> +	return 0;
> +}
> +
> +static int hb_cpufreq_exit(struct cpufreq_policy *policy)
> +{
> +	cpufreq_frequency_table_put_attr(policy->cpu);
> +
> +	return 0;
> +}
> +
> +static struct freq_attr *hb_cpufreq_attr[] = {
> +	&cpufreq_freq_attr_scaling_available_freqs,
> +	NULL,
> +};
> +
> +static struct cpufreq_driver hb_cpufreq_driver = {
> +	.flags = CPUFREQ_STICKY,
> +	.verify = hb_verify_speed,
> +	.target = hb_set_target,
> +	.get = hb_get_speed,
> +	.init = hb_cpufreq_init,
> +	.exit = hb_cpufreq_exit,
> +	.name = "highbank-cpufreq",
> +	.attr = hb_cpufreq_attr,
> +};
> +
> +static int __devinit hb_cpufreq_driver_init(void)
> +{
> +	struct device_node *np;
> +	int ret;
> +
> +	np = of_find_node_by_path("/cpus/cpu@0");
> +	if (!np) {
> +		pr_err("failed to find highbank cpufreq node\n");
> +		return -ENOENT;
> +	}
> +
> +	cpu_dev = get_cpu_device(0);
> +	if (!cpu_dev) {
> +		pr_err("failed to get highbank cpufreq device\n");
> +		ret = -ENODEV;
> +		goto out_put_node;
> +	}
> +
> +	cpu_dev->of_node = np;
> +
> +	cpu_clk = clk_get(cpu_dev, NULL);
> +	if (IS_ERR(cpu_clk)) {
> +		ret = PTR_ERR(cpu_clk);
> +		pr_err("failed to get cpu0 clock: %d\n", ret);
> +		goto out_put_node;
> +	}
> +
> +	ret = of_init_opp_table(cpu_dev);
> +	if (ret) {
> +		pr_err("failed to init OPP table: %d\n", ret);
> +		goto out_put_node;
> +	}
> +
> +	ret = opp_init_cpufreq_table(cpu_dev, &freq_table);
> +	if (ret) {
> +		pr_err("failed to init cpufreq table: %d\n", ret);
> +		goto out_put_node;
> +	}
> +
> +	if (of_property_read_u32(np, "transition-latency", &transition_latency))
> +		transition_latency = CPUFREQ_ETERNAL;
> +
> +	ret = cpufreq_register_driver(&hb_cpufreq_driver);
> +	if (ret) {
> +		pr_err("failed register driver: %d\n", ret);
> +		goto out_free_table;
> +	}
> +
> +	of_node_put(np);
> +	return 0;
> +
> +out_free_table:
> +	opp_free_cpufreq_table(cpu_dev, &freq_table);
> +out_put_node:
> +	of_node_put(np);
> +	return ret;
> +}
> +late_initcall(hb_cpufreq_driver_init);
> +
> +MODULE_AUTHOR("Mark Langsdorf <mark.langsdorf-bsGFqQB8/DxBDgjK7y7TUQ@public.gmane.org>");
> +MODULE_DESCRIPTION("Calxeda Highbank cpufreq driver");
> +MODULE_LICENSE("GPL");
> 

^ permalink raw reply

* [PATCH 6/6 v4] cpufreq, highbank: add support for highbank cpufreq
From: Mark Langsdorf @ 2012-11-07 18:32 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cpufreq-u79uwXL29TY76Z2rM5mHXA, linux-pm-u79uwXL29TY76Z2rM5mHXA
  Cc: Rafael J. Wysocki, devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
	Mark Langsdorf
In-Reply-To: <1352313166-28980-1-git-send-email-mark.langsdorf-bsGFqQB8/DxBDgjK7y7TUQ@public.gmane.org>

Highbank processors depend on the external ECME to perform voltage
management based on a requested frequency. Communication between the
highbank and ECME cores happens over the pl320 IPC channel.

Signed-off-by: Mark Langsdorf <mark.langsdorf-bsGFqQB8/DxBDgjK7y7TUQ@public.gmane.org>
Cc: devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ@public.gmane.org
Cc: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
---
Changes from v3
	None
Changes from v2
	Changed transition latency binding in code to match documentation
Changes from v1
        Added highbank specific Kconfig changes

 .../bindings/cpufreq/highbank-cpufreq.txt          |  53 +++++
 arch/arm/Kconfig                                   |   2 +
 arch/arm/boot/dts/highbank.dts                     |  10 +
 arch/arm/mach-highbank/Kconfig                     |   2 +
 drivers/cpufreq/Kconfig.arm                        |  15 ++
 drivers/cpufreq/Makefile                           |   1 +
 drivers/cpufreq/highbank-cpufreq.c                 | 229 +++++++++++++++++++++
 7 files changed, 312 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/cpufreq/highbank-cpufreq.txt
 create mode 100644 drivers/cpufreq/highbank-cpufreq.c

diff --git a/Documentation/devicetree/bindings/cpufreq/highbank-cpufreq.txt b/Documentation/devicetree/bindings/cpufreq/highbank-cpufreq.txt
new file mode 100644
index 0000000..3ec2cec
--- /dev/null
+++ b/Documentation/devicetree/bindings/cpufreq/highbank-cpufreq.txt
@@ -0,0 +1,53 @@
+Highbank cpufreq driver
+
+This is cpufreq driver for Calxeda ECX-1000 (highbank) processor. It is based
+on the generic cpu0 driver and uses a similar format for bindings. Since
+the EnergyCore Management Engine maintains the voltage based on the
+frequency, the voltage component of the operating points can be set to any
+arbitrary values.
+
+Both required properties listed below must be defined under node /cpus/cpu@0.
+
+Required properties:
+- operating-points: Refer to Documentation/devicetree/bindings/power/opp.txt
+  for details
+- transition-latency: Specify the possible maximum transition latency for clock,
+  in unit of nanoseconds.
+
+Examples:
+
+cpus {
+	#address-cells = <1>;
+	#size-cells = <0>;
+
+	cpu@0 {
+		compatible = "arm,cortex-a9";
+		reg = <0>;
+		next-level-cache = <&L2>;
+		operating-points = <
+			/* kHz  ignored */
+			790000  1000000
+			396000  1000000
+			198000  1000000
+		>;
+		transition-latency = <200000>;
+	};
+
+	cpu@1 {
+		compatible = "arm,cortex-a9";
+		reg = <1>;
+		next-level-cache = <&L2>;
+	};
+
+	cpu@2 {
+		compatible = "arm,cortex-a9";
+		reg = <2>;
+		next-level-cache = <&L2>;
+	};
+
+	cpu@3 {
+		compatible = "arm,cortex-a9";
+		reg = <3>;
+		next-level-cache = <&L2>;
+	};
+};
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index ade7e92..4ed0b7b 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -391,6 +391,8 @@ config ARCH_SIRF
 	select PINCTRL
 	select PINCTRL_SIRF
 	select USE_OF
+	select ARCH_HAS_CPUFREQ
+	select ARCH_HAS_OPP
 	help
 	  Support for CSR SiRFprimaII/Marco/Polo platforms
 
diff --git a/arch/arm/boot/dts/highbank.dts b/arch/arm/boot/dts/highbank.dts
index 0c6fc34..7c4c27d 100644
--- a/arch/arm/boot/dts/highbank.dts
+++ b/arch/arm/boot/dts/highbank.dts
@@ -36,6 +36,16 @@
 			next-level-cache = <&L2>;
 			clocks = <&a9pll>;
 			clock-names = "cpu";
+			operating-points = <
+				/* kHz    ignored */
+				 1300000  1000000
+				 1200000  1000000
+				 1100000  1000000
+				  800000  1000000
+				  400000  1000000
+				  200000  1000000
+			>;
+			transition-latency = <100000>;
 		};
 
 		cpu@1 {
diff --git a/arch/arm/mach-highbank/Kconfig b/arch/arm/mach-highbank/Kconfig
index 0e1d0a4..ee83af6 100644
--- a/arch/arm/mach-highbank/Kconfig
+++ b/arch/arm/mach-highbank/Kconfig
@@ -13,3 +13,5 @@ config ARCH_HIGHBANK
 	select HAVE_SMP
 	select SPARSE_IRQ
 	select USE_OF
+	select ARCH_HAS_CPUFREQ
+	select ARCH_HAS_OPP
diff --git a/drivers/cpufreq/Kconfig.arm b/drivers/cpufreq/Kconfig.arm
index 5961e64..bc3ef55 100644
--- a/drivers/cpufreq/Kconfig.arm
+++ b/drivers/cpufreq/Kconfig.arm
@@ -76,3 +76,18 @@ config ARM_EXYNOS5250_CPUFREQ
 	help
 	  This adds the CPUFreq driver for Samsung EXYNOS5250
 	  SoC.
+
+config ARM_HIGHBANK_CPUFREQ
+       tristate "Calxeda Highbank-based"
+       depends on ARCH_HIGHBANK
+       select CPU_FREQ_TABLE
+       select HAVE_CLK
+       select PM_OPP
+       select OF
+       default m
+       help
+         This adds the CPUFreq driver for Calxeda Highbank SoC
+         based boards.
+
+         If in doubt, say N.
+
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index 1bc90e1..9e8f12a 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -50,6 +50,7 @@ obj-$(CONFIG_ARM_EXYNOS4210_CPUFREQ)	+= exynos4210-cpufreq.o
 obj-$(CONFIG_ARM_EXYNOS4X12_CPUFREQ)	+= exynos4x12-cpufreq.o
 obj-$(CONFIG_ARM_EXYNOS5250_CPUFREQ)	+= exynos5250-cpufreq.o
 obj-$(CONFIG_ARM_OMAP2PLUS_CPUFREQ)     += omap-cpufreq.o
+obj-$(CONFIG_ARM_HIGHBANK_CPUFREQ)	+= highbank-cpufreq.o
 
 ##################################################################################
 # PowerPC platform drivers
diff --git a/drivers/cpufreq/highbank-cpufreq.c b/drivers/cpufreq/highbank-cpufreq.c
new file mode 100644
index 0000000..a167073
--- /dev/null
+++ b/drivers/cpufreq/highbank-cpufreq.c
@@ -0,0 +1,229 @@
+/*
+ * Copyright (C) 2012 Calxeda, Inc.
+ *
+ * derived from cpufreq-cpu0 by Freescale Semiconductor
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
+
+#include <linux/clk.h>
+#include <linux/cpu.h>
+#include <linux/cpufreq.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <linux/of.h>
+#include <linux/opp.h>
+#include <linux/slab.h>
+#include <asm/pl320-ipc.h>
+
+#define HB_CPUFREQ_CHANGE_NOTE 0x80000001
+
+static unsigned int transition_latency;
+
+static struct device *cpu_dev;
+static struct clk *cpu_clk;
+static struct cpufreq_frequency_table *freq_table;
+
+static int hb_verify_speed(struct cpufreq_policy *policy)
+{
+	return cpufreq_frequency_table_verify(policy, freq_table);
+}
+
+static unsigned int hb_get_speed(unsigned int cpu)
+{
+	return clk_get_rate(cpu_clk) / 1000;
+}
+
+static int hb_voltage_change(unsigned int freq)
+{
+	int i;
+	u32 msg[7];
+
+	msg[0] = HB_CPUFREQ_CHANGE_NOTE;
+	msg[1] = freq / 1000;
+	for (i = 2; i < 7; i++)
+		msg[i] = 0;
+
+	return ipc_call_slow(msg);
+}
+
+static int hb_set_target(struct cpufreq_policy *policy,
+			   unsigned int target_freq, unsigned int relation)
+{
+	struct cpufreq_freqs freqs;
+	unsigned long freq_Hz;
+	unsigned int index, cpu;
+	int ret;
+
+	ret = cpufreq_frequency_table_target(policy, freq_table, target_freq,
+					     relation, &index);
+	if (ret) {
+		pr_err("failed to match target freqency %d: %d\n",
+		       target_freq, ret);
+		return ret;
+	}
+
+	freq_Hz = clk_round_rate(cpu_clk, freq_table[index].frequency * 1000);
+	if (freq_Hz < 0)
+		freq_Hz = freq_table[index].frequency * 1000;
+	freqs.new = freq_Hz / 1000;
+	freqs.old = clk_get_rate(cpu_clk) / 1000;
+
+	if (freqs.old == freqs.new)
+		return 0;
+
+	for_each_online_cpu(cpu) {
+		freqs.cpu = cpu;
+		cpufreq_notify_transition(&freqs, CPUFREQ_PRECHANGE);
+	}
+
+	pr_debug("%u MHz --> %u MHz\n", freqs.old / 1000, freqs.new / 1000);
+
+	/* scaling up?  scale voltage before frequency */
+	if (freqs.new > freqs.old) {
+		ret = hb_voltage_change(freqs.new);
+		if (ret) {
+			freqs.new = freqs.old;
+			return -EAGAIN;
+		}
+	}
+
+	ret = clk_set_rate(cpu_clk, freqs.new * 1000);
+	if (ret) {
+		pr_err("failed to set clock rate: %d\n", ret);
+		hb_voltage_change(freqs.old);
+		return ret;
+	}
+
+	/* scaling down?  scale voltage after frequency */
+	if (freqs.new < freqs.old) {
+		ret = hb_voltage_change(freqs.new);
+		if (ret) {
+			if (clk_set_rate(cpu_clk, freqs.old * 1000))
+				pr_err("also failed to reset freq\n");
+			freqs.new = freqs.old;
+			return -EAGAIN;
+		}
+	}
+
+	for_each_online_cpu(cpu) {
+		freqs.cpu = cpu;
+		cpufreq_notify_transition(&freqs, CPUFREQ_POSTCHANGE);
+	}
+
+	return 0;
+}
+
+static int hb_cpufreq_init(struct cpufreq_policy *policy)
+{
+	int ret;
+
+	if (policy->cpu != 0)
+		return -EINVAL;
+
+	ret = cpufreq_frequency_table_cpuinfo(policy, freq_table);
+	if (ret) {
+		pr_err("invalid frequency table: %d\n", ret);
+		return ret;
+	}
+
+	policy->cpuinfo.transition_latency = transition_latency;
+	policy->cur = clk_get_rate(cpu_clk) / 1000;
+
+	policy->shared_type = CPUFREQ_SHARED_TYPE_ANY;
+	cpumask_setall(policy->cpus);
+
+	cpufreq_frequency_table_get_attr(freq_table, policy->cpu);
+
+	return 0;
+}
+
+static int hb_cpufreq_exit(struct cpufreq_policy *policy)
+{
+	cpufreq_frequency_table_put_attr(policy->cpu);
+
+	return 0;
+}
+
+static struct freq_attr *hb_cpufreq_attr[] = {
+	&cpufreq_freq_attr_scaling_available_freqs,
+	NULL,
+};
+
+static struct cpufreq_driver hb_cpufreq_driver = {
+	.flags = CPUFREQ_STICKY,
+	.verify = hb_verify_speed,
+	.target = hb_set_target,
+	.get = hb_get_speed,
+	.init = hb_cpufreq_init,
+	.exit = hb_cpufreq_exit,
+	.name = "highbank-cpufreq",
+	.attr = hb_cpufreq_attr,
+};
+
+static int __devinit hb_cpufreq_driver_init(void)
+{
+	struct device_node *np;
+	int ret;
+
+	np = of_find_node_by_path("/cpus/cpu@0");
+	if (!np) {
+		pr_err("failed to find highbank cpufreq node\n");
+		return -ENOENT;
+	}
+
+	cpu_dev = get_cpu_device(0);
+	if (!cpu_dev) {
+		pr_err("failed to get highbank cpufreq device\n");
+		ret = -ENODEV;
+		goto out_put_node;
+	}
+
+	cpu_dev->of_node = np;
+
+	cpu_clk = clk_get(cpu_dev, NULL);
+	if (IS_ERR(cpu_clk)) {
+		ret = PTR_ERR(cpu_clk);
+		pr_err("failed to get cpu0 clock: %d\n", ret);
+		goto out_put_node;
+	}
+
+	ret = of_init_opp_table(cpu_dev);
+	if (ret) {
+		pr_err("failed to init OPP table: %d\n", ret);
+		goto out_put_node;
+	}
+
+	ret = opp_init_cpufreq_table(cpu_dev, &freq_table);
+	if (ret) {
+		pr_err("failed to init cpufreq table: %d\n", ret);
+		goto out_put_node;
+	}
+
+	if (of_property_read_u32(np, "transition-latency", &transition_latency))
+		transition_latency = CPUFREQ_ETERNAL;
+
+	ret = cpufreq_register_driver(&hb_cpufreq_driver);
+	if (ret) {
+		pr_err("failed register driver: %d\n", ret);
+		goto out_free_table;
+	}
+
+	of_node_put(np);
+	return 0;
+
+out_free_table:
+	opp_free_cpufreq_table(cpu_dev, &freq_table);
+out_put_node:
+	of_node_put(np);
+	return ret;
+}
+late_initcall(hb_cpufreq_driver_init);
+
+MODULE_AUTHOR("Mark Langsdorf <mark.langsdorf-bsGFqQB8/DxBDgjK7y7TUQ@public.gmane.org>");
+MODULE_DESCRIPTION("Calxeda Highbank cpufreq driver");
+MODULE_LICENSE("GPL");
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH 5/6 v4] power: export opp cpufreq functions
From: Mark Langsdorf @ 2012-11-07 18:32 UTC (permalink / raw)
  To: linux-kernel, cpufreq, linux-pm; +Cc: Mark Langsdorf
In-Reply-To: <1352313166-28980-1-git-send-email-mark.langsdorf@calxeda.com>

These functions are needed to make the cpufreq-core0 and highbank-cpufreq
drivers loadable as modules.

Signed-off-by: Mark Langsdorf <mark.langsdorf@calxeda.com>
Acked-by: Nishanth Menon <nm@ti.com>
---
Changes from v3
	includes linux/export.h instead of module.h
Changes from v2
	None.
Changes from v1
        Added Nishanth Menon's ack.
        Clarified the purpose of the change in the commit message.

 drivers/base/power/opp.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
index d946864..4062ec3 100644
--- a/drivers/base/power/opp.c
+++ b/drivers/base/power/opp.c
@@ -23,6 +23,7 @@
 #include <linux/rcupdate.h>
 #include <linux/opp.h>
 #include <linux/of.h>
+#include <linux/export.h>
 
 /*
  * Internal data structure organization with the OPP layer library is as
@@ -643,6 +644,7 @@ int opp_init_cpufreq_table(struct device *dev,
 
 	return 0;
 }
+EXPORT_SYMBOL(opp_init_cpufreq_table);
 
 /**
  * opp_free_cpufreq_table() - free the cpufreq table
@@ -660,6 +662,7 @@ void opp_free_cpufreq_table(struct device *dev,
 	kfree(*table);
 	*table = NULL;
 }
+EXPORT_SYMBOL(opp_free_cpufreq_table);
 #endif		/* CONFIG_CPU_FREQ */
 
 /**
@@ -720,4 +723,5 @@ int of_init_opp_table(struct device *dev)
 
 	return 0;
 }
+EXPORT_SYMBOL(of_init_opp_table);
 #endif
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH 4/6 v4] arm highbank: add support for pl320 IPC
From: Mark Langsdorf @ 2012-11-07 18:32 UTC (permalink / raw)
  To: linux-kernel, cpufreq, linux-pm; +Cc: Mark Langsdorf, Rob Herring
In-Reply-To: <1352313166-28980-1-git-send-email-mark.langsdorf@calxeda.com>

From: Rob Herring <rob.herring@calxeda.com>

The pl320 IPC allows for interprocessor communication between the highbank A9
and the EnergyCore Management Engine. The pl320 implements a straightforward
mailbox protocol.

Signed-off-by: Mark Langsdorf <mark.langsdorf@calxeda.com>
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
---
Changes from v3, v2
	None
Changes from v1
        Removed erroneous changes for cpufreq Kconfig

 arch/arm/include/asm/pl320-ipc.h                |  20 ++
 arch/arm/mach-highbank/Makefile                 |   2 +
 arch/arm/mach-highbank/include/mach/pl320-ipc.h |  20 ++
 arch/arm/mach-highbank/pl320-ipc.c              | 232 ++++++++++++++++++++++++
 4 files changed, 274 insertions(+)
 create mode 100644 arch/arm/include/asm/pl320-ipc.h
 create mode 100644 arch/arm/mach-highbank/include/mach/pl320-ipc.h
 create mode 100644 arch/arm/mach-highbank/pl320-ipc.c

diff --git a/arch/arm/include/asm/pl320-ipc.h b/arch/arm/include/asm/pl320-ipc.h
new file mode 100644
index 0000000..a0e58ee
--- /dev/null
+++ b/arch/arm/include/asm/pl320-ipc.h
@@ -0,0 +1,20 @@
+/*
+ * Copyright 2010 Calxeda, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+int ipc_call_fast(u32 *data);
+int ipc_call_slow(u32 *data);
+
+extern int pl320_ipc_register_notifier(struct notifier_block *nb);
+extern int pl320_ipc_unregister_notifier(struct notifier_block *nb);
diff --git a/arch/arm/mach-highbank/Makefile b/arch/arm/mach-highbank/Makefile
index 3ec8bdd..b894708 100644
--- a/arch/arm/mach-highbank/Makefile
+++ b/arch/arm/mach-highbank/Makefile
@@ -7,3 +7,5 @@ obj-$(CONFIG_DEBUG_HIGHBANK_UART)	+= lluart.o
 obj-$(CONFIG_SMP)			+= platsmp.o
 obj-$(CONFIG_HOTPLUG_CPU)		+= hotplug.o
 obj-$(CONFIG_PM_SLEEP)			+= pm.o
+
+obj-y					+= pl320-ipc.o
diff --git a/arch/arm/mach-highbank/include/mach/pl320-ipc.h b/arch/arm/mach-highbank/include/mach/pl320-ipc.h
new file mode 100644
index 0000000..a0e58ee
--- /dev/null
+++ b/arch/arm/mach-highbank/include/mach/pl320-ipc.h
@@ -0,0 +1,20 @@
+/*
+ * Copyright 2010 Calxeda, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+int ipc_call_fast(u32 *data);
+int ipc_call_slow(u32 *data);
+
+extern int pl320_ipc_register_notifier(struct notifier_block *nb);
+extern int pl320_ipc_unregister_notifier(struct notifier_block *nb);
diff --git a/arch/arm/mach-highbank/pl320-ipc.c b/arch/arm/mach-highbank/pl320-ipc.c
new file mode 100644
index 0000000..0eb92e4
--- /dev/null
+++ b/arch/arm/mach-highbank/pl320-ipc.c
@@ -0,0 +1,232 @@
+/*
+ * Copyright 2012 Calxeda, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+#include <linux/types.h>
+#include <linux/err.h>
+#include <linux/delay.h>
+#include <linux/export.h>
+#include <linux/io.h>
+#include <linux/interrupt.h>
+#include <linux/completion.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/spinlock.h>
+#include <linux/device.h>
+#include <linux/amba/bus.h>
+
+#include <asm/pl320-ipc.h>
+
+#define IPCMxSOURCE(m)		((m) * 0x40)
+#define IPCMxDSET(m)		(((m) * 0x40) + 0x004)
+#define IPCMxDCLEAR(m)		(((m) * 0x40) + 0x008)
+#define IPCMxDSTATUS(m)		(((m) * 0x40) + 0x00C)
+#define IPCMxMODE(m)		(((m) * 0x40) + 0x010)
+#define IPCMxMSET(m)		(((m) * 0x40) + 0x014)
+#define IPCMxMCLEAR(m)		(((m) * 0x40) + 0x018)
+#define IPCMxMSTATUS(m)		(((m) * 0x40) + 0x01C)
+#define IPCMxSEND(m)		(((m) * 0x40) + 0x020)
+#define IPCMxDR(m, dr)		(((m) * 0x40) + ((dr) * 4) + 0x024)
+
+#define IPCMMIS(irq)		(((irq) * 8) + 0x800)
+#define IPCMRIS(irq)		(((irq) * 8) + 0x804)
+
+#define MBOX_MASK(n)		(1 << (n))
+#define IPC_FAST_MBOX		0
+#define IPC_SLOW_MBOX		1
+#define IPC_RX_MBOX		2
+
+#define CHAN_MASK(n)		(1 << (n))
+#define A9_SOURCE		1
+#define M3_SOURCE		0
+
+static void __iomem *ipc_base;
+static int ipc_irq;
+static DEFINE_SPINLOCK(ipc_m0_lock);
+static DEFINE_MUTEX(ipc_m1_lock);
+static DECLARE_COMPLETION(ipc_completion);
+static ATOMIC_NOTIFIER_HEAD(ipc_notifier);
+
+static inline void set_destination(int source, int mbox)
+{
+	__raw_writel(CHAN_MASK(source), ipc_base + IPCMxDSET(mbox));
+	__raw_writel(CHAN_MASK(source), ipc_base + IPCMxMSET(mbox));
+}
+
+static inline void clear_destination(int source, int mbox)
+{
+	__raw_writel(CHAN_MASK(source), ipc_base + IPCMxDCLEAR(mbox));
+	__raw_writel(CHAN_MASK(source), ipc_base + IPCMxMCLEAR(mbox));
+}
+
+static void __ipc_send(int mbox, u32 *data)
+{
+	int i;
+	for (i = 0; i < 7; i++)
+		__raw_writel(data[i], ipc_base + IPCMxDR(mbox, i));
+	__raw_writel(0x1, ipc_base + IPCMxSEND(mbox));
+}
+
+static u32 __ipc_rcv(int mbox, u32 *data)
+{
+	int i;
+	for (i = 0; i < 7; i++)
+		data[i] = __raw_readl(ipc_base + IPCMxDR(mbox, i));
+	return data[1];
+}
+
+/* non-blocking implementation from the A9 side, interrupt safe in theory */
+int ipc_call_fast(u32 *data)
+{
+	int timeout, ret;
+
+	spin_lock(&ipc_m0_lock);
+
+	__ipc_send(IPC_FAST_MBOX, data);
+
+	for (timeout = 500; timeout > 0; timeout--) {
+		if (__raw_readl(ipc_base + IPCMxSEND(IPC_FAST_MBOX)) == 0x2)
+			break;
+		udelay(100);
+	}
+	if (timeout == 0) {
+		ret = -ETIMEDOUT;
+		goto out;
+	}
+
+	ret = __ipc_rcv(IPC_FAST_MBOX, data);
+out:
+	__raw_writel(0, ipc_base + IPCMxSEND(IPC_FAST_MBOX));
+	spin_unlock(&ipc_m0_lock);
+	return ret;
+}
+EXPORT_SYMBOL(ipc_call_fast);
+
+/* blocking implmentation from the A9 side, not usuable in interrupts! */
+int ipc_call_slow(u32 *data)
+{
+	int ret;
+
+	mutex_lock(&ipc_m1_lock);
+
+	init_completion(&ipc_completion);
+	__ipc_send(IPC_SLOW_MBOX, data);
+	ret = wait_for_completion_timeout(&ipc_completion,
+					  msecs_to_jiffies(1000));
+	if (ret == 0) {
+		ret = -ETIMEDOUT;
+		goto out;
+	}
+
+	ret = __ipc_rcv(IPC_SLOW_MBOX, data);
+out:
+	mutex_unlock(&ipc_m1_lock);
+	return ret;
+}
+EXPORT_SYMBOL(ipc_call_slow);
+
+irqreturn_t ipc_handler(int irq, void *dev)
+{
+	u32 irq_stat;
+	u32 data[7];
+
+	irq_stat = __raw_readl(ipc_base + IPCMMIS(1));
+	if (irq_stat & MBOX_MASK(IPC_SLOW_MBOX)) {
+		__raw_writel(0, ipc_base + IPCMxSEND(IPC_SLOW_MBOX));
+		complete(&ipc_completion);
+	}
+	if (irq_stat & MBOX_MASK(IPC_RX_MBOX)) {
+		__ipc_rcv(IPC_RX_MBOX, data);
+		atomic_notifier_call_chain(&ipc_notifier, data[0], data + 1);
+		__raw_writel(2, ipc_base + IPCMxSEND(IPC_RX_MBOX));
+	}
+
+	return IRQ_HANDLED;
+}
+
+int pl320_ipc_register_notifier(struct notifier_block *nb)
+{
+	return atomic_notifier_chain_register(&ipc_notifier, nb);
+}
+
+int pl320_ipc_unregister_notifier(struct notifier_block *nb)
+{
+	return atomic_notifier_chain_unregister(&ipc_notifier, nb);
+}
+
+static int __devinit pl320_probe(struct amba_device *adev,
+				const struct amba_id *id)
+{
+	int ret;
+
+	ipc_base = ioremap(adev->res.start, resource_size(&adev->res));
+	if (ipc_base == NULL)
+		return -ENOMEM;
+
+	__raw_writel(0, ipc_base + IPCMxSEND(IPC_FAST_MBOX));
+	__raw_writel(0, ipc_base + IPCMxSEND(IPC_SLOW_MBOX));
+
+	ipc_irq = adev->irq[0];
+	ret = request_irq(ipc_irq, ipc_handler, 0, dev_name(&adev->dev), NULL);
+	if (ret < 0)
+		goto err;
+
+	/* Init fast mailbox */
+	__raw_writel(CHAN_MASK(A9_SOURCE),
+			ipc_base + IPCMxSOURCE(IPC_FAST_MBOX));
+	set_destination(M3_SOURCE, IPC_FAST_MBOX);
+
+	/* Init slow mailbox */
+	__raw_writel(CHAN_MASK(A9_SOURCE),
+			ipc_base + IPCMxSOURCE(IPC_SLOW_MBOX));
+	__raw_writel(CHAN_MASK(M3_SOURCE),
+			ipc_base + IPCMxDSET(IPC_SLOW_MBOX));
+	__raw_writel(CHAN_MASK(M3_SOURCE) | CHAN_MASK(A9_SOURCE),
+		     ipc_base + IPCMxMSET(IPC_SLOW_MBOX));
+
+	/* Init receive mailbox */
+	__raw_writel(CHAN_MASK(M3_SOURCE),
+			ipc_base + IPCMxSOURCE(IPC_RX_MBOX));
+	__raw_writel(CHAN_MASK(A9_SOURCE),
+			ipc_base + IPCMxDSET(IPC_RX_MBOX));
+	__raw_writel(CHAN_MASK(M3_SOURCE) | CHAN_MASK(A9_SOURCE),
+		     ipc_base + IPCMxMSET(IPC_RX_MBOX));
+
+	return 0;
+err:
+	iounmap(ipc_base);
+	return ret;
+}
+
+static struct amba_id pl320_ids[] = {
+	{
+		.id	= 0x00041320,
+		.mask	= 0x000fffff,
+	},
+	{ 0, 0 },
+};
+
+static struct amba_driver pl320_driver = {
+	.drv = {
+		.name	= "pl320",
+	},
+	.id_table	= pl320_ids,
+	.probe		= pl320_probe,
+};
+
+static int __init ipc_init(void)
+{
+	return amba_driver_register(&pl320_driver);
+}
+module_init(ipc_init);
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH 3/6 v4] cpufreq: tolerate inexact values when collecting stats
From: Mark Langsdorf @ 2012-11-07 18:32 UTC (permalink / raw)
  To: linux-kernel, cpufreq, linux-pm; +Cc: Mark Langsdorf, MyungJoo Ham
In-Reply-To: <1352313166-28980-1-git-send-email-mark.langsdorf@calxeda.com>

When collecting stats, if a frequency doesn't match the table, go through
the table again with both the search frequency and table values shifted
left by 10 bits.

Signed-off-by: Mark Langsdorf <mark.langsdorf@calxeda.com>
Cc: MyungJoo Ham <myungjoo.ham@gmail.com>
---
Changes from v3, v2
	None
Changes from v1:
        Implemented a simple round-up algorithm instead of the over/under
method that could cause errors on Intel processors with boost mode.

 drivers/cpufreq/cpufreq_stats.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/cpufreq/cpufreq_stats.c b/drivers/cpufreq/cpufreq_stats.c
index 3998316..30aee36 100644
--- a/drivers/cpufreq/cpufreq_stats.c
+++ b/drivers/cpufreq/cpufreq_stats.c
@@ -161,6 +161,9 @@ static int freq_table_get_index(struct cpufreq_stats *stat, unsigned int freq)
 	for (index = 0; index < stat->max_state; index++)
 		if (stat->freq_table[index] == freq)
 			return index;
+	for (index = 0; index < stat->max_state; index++)
+		if ((stat->freq_table[index] >> 10) == (freq >> 10))
+			return index;
 	return -1;
 }
 
@@ -251,6 +254,8 @@ static int cpufreq_stats_create_table(struct cpufreq_policy *policy,
 	spin_lock(&cpufreq_stats_lock);
 	stat->last_time = get_jiffies_64();
 	stat->last_index = freq_table_get_index(stat, policy->cur);
+	if (stat->last_index > stat->max_state)
+		stat->last_index = stat->max_state - 1;
 	spin_unlock(&cpufreq_stats_lock);
 	cpufreq_cpu_put(data);
 	return 0;
-- 
1.7.11.7


^ permalink raw reply related

* [PATCH 2/6 v4] clk, highbank: Prevent glitches in non-bypass reset mode
From: Mark Langsdorf @ 2012-11-07 18:32 UTC (permalink / raw)
  To: linux-kernel, cpufreq, linux-pm; +Cc: Mark Langsdorf, Rob Herring, mturquette
In-Reply-To: <1352313166-28980-1-git-send-email-mark.langsdorf@calxeda.com>

The highbank clock will glitch with the current code if the
clock rate is reset without relocking the PLL. Program the PLL
correctly to preven glitches.

Signed-off-by: Mark Langsdorf <mark.langsdorf@calxeda.com>
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Cc: mturquette@linaro.org
---
Changes from v3
	Changelog text and patch name now correspond to the actual patch
	was clk, highbank: remove non-bypass reset mode
Changes from v2
	None
Changes from v1:
        Removed erroneous reformating.

 drivers/clk/clk-highbank.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/clk/clk-highbank.c b/drivers/clk/clk-highbank.c
index 52fecad..3a0b723 100644
--- a/drivers/clk/clk-highbank.c
+++ b/drivers/clk/clk-highbank.c
@@ -182,8 +182,10 @@ static int clk_pll_set_rate(struct clk_hw *hwclk, unsigned long rate,
 		reg |= HB_PLL_EXT_ENA;
 		reg &= ~HB_PLL_EXT_BYPASS;
 	} else {
+		writel(reg | HB_PLL_EXT_BYPASS, hbclk->reg);
 		reg &= ~HB_PLL_DIVQ_MASK;
 		reg |= divq << HB_PLL_DIVQ_SHIFT;
+		writel(reg | HB_PLL_EXT_BYPASS, hbclk->reg);
 	}
 	writel(reg, hbclk->reg);
 
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH 1/6 v4] arm: use devicetree to get smp_twd clock
From: Mark Langsdorf @ 2012-11-07 18:32 UTC (permalink / raw)
  To: linux-kernel, cpufreq, linux-pm; +Cc: Mark Langsdorf, Rob Herring
In-Reply-To: <1352313166-28980-1-git-send-email-mark.langsdorf@calxeda.com>

From: Rob Herring <rob.herring@calxeda.com>

Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Mark Langsdorf <mark.langsdorf@calxeda.com>
---
Changes from v3
	No longer setting *clk to NULL in twd_get_clock()
Changes from v2
	Turned the check for the node pointer into an if-then-else statement.
	Removed the second, redundant clk_get_rate
Changes from v1
        None.

 arch/arm/kernel/smp_twd.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/arm/kernel/smp_twd.c b/arch/arm/kernel/smp_twd.c
index b22d700..af46b80 100644
--- a/arch/arm/kernel/smp_twd.c
+++ b/arch/arm/kernel/smp_twd.c
@@ -237,12 +237,15 @@ static irqreturn_t twd_handler(int irq, void *dev_id)
 	return IRQ_NONE;
 }
 
-static struct clk *twd_get_clock(void)
+static struct clk *twd_get_clock(struct device_node *np)
 {
 	struct clk *clk;
 	int err;
 
-	clk = clk_get_sys("smp_twd", NULL);
+	if (np)
+		clk = of_clk_get(np, 0);
+	else
+		clk = clk_get_sys("smp_twd", NULL);
 	if (IS_ERR(clk)) {
 		pr_err("smp_twd: clock not found: %d\n", (int)PTR_ERR(clk));
 		return clk;
@@ -263,6 +266,7 @@ static struct clk *twd_get_clock(void)
 		return ERR_PTR(err);
 	}
 
+	twd_timer_rate = clk_get_rate(clk);
 	return clk;
 }
 
@@ -273,12 +277,7 @@ static int __cpuinit twd_timer_setup(struct clock_event_device *clk)
 {
 	struct clock_event_device **this_cpu_clk;
 
-	if (!twd_clk)
-		twd_clk = twd_get_clock();
-
-	if (!IS_ERR_OR_NULL(twd_clk))
-		twd_timer_rate = clk_get_rate(twd_clk);
-	else
+	if (IS_ERR_OR_NULL(twd_clk))
 		twd_calibrate_rate();
 
 	__raw_writel(0, twd_base + TWD_TIMER_CONTROL);
@@ -349,6 +348,8 @@ int __init twd_local_timer_register(struct twd_local_timer *tlt)
 	if (!twd_base)
 		return -ENOMEM;
 
+	twd_clk = twd_get_clock(NULL);
+
 	return twd_local_timer_common_register();
 }
 
@@ -383,6 +384,8 @@ void __init twd_local_timer_of_register(void)
 		goto out;
 	}
 
+	twd_clk = twd_get_clock(np);
+
 	err = twd_local_timer_common_register();
 
 out:
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH 0/6 v4] cpufreq: add support for Calxeda ECX-1000 (highbank)
From: Mark Langsdorf @ 2012-11-07 18:32 UTC (permalink / raw)
  To: linux-kernel, cpufreq, linux-pm
In-Reply-To: <1351631056-25938-1-git-send-email-mark.langsdorf@calxeda.com>

This patch series adds cpufreq support for the Calxeda
ECX-1000 (highbank) SoCs. The driver is based on the 
cpufreq-cpu0 driver. Because of the unique way that 
highbank uses the EnergyCore Management Engine to manage
voltages, it was not possible to use the cpufreq-cpu0 driver.

--Mark Langsdorf

^ permalink raw reply

* Re: [PATCH 0/6 v3] cpufreq: add support for Calxeda ECX-1000 (highbank)
From: Mark Langsdorf @ 2012-11-07 18:11 UTC (permalink / raw)
  To: Mark Langsdorf
  Cc: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
	cpufreq@vger.kernel.org
In-Reply-To: <1352233089-22586-1-git-send-email-mark.langsdorf@calxeda.com>

On 11/06/2012 02:18 PM, Mark Langsdorf wrote:
> This patch series adds cpufreq support for the Calxeda
> ECX-1000 (highbank) SoCs. The driver is based on the 
> cpufreq-cpu0 driver. Because of the unique way that 
> highbank uses the EnergyCore Management Engine to manage
> voltages, it was not possible to use the cpufreq-cpu0 driver.

I accidentally sent out old versions of the patches. Please ignore the
v3 series and look at the v4 instead.

--Mark Langsdorf


^ permalink raw reply

* Re: [PATCH 3/9] cgroup: implement generic child / descendant walk macros
From: Michal Hocko @ 2012-11-07 17:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, rjw-KKrjLPT3xs0,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA, fweisbec-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <20121107170118.GD2660-9pTldWuhBndy/B6EtB590w@public.gmane.org>

On Wed 07-11-12 09:01:18, Tejun Heo wrote:
> Hello, Michal.
> 
> On Wed, Nov 07, 2012 at 05:54:57PM +0100, Michal Hocko wrote:
> > > +struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
> > > +					  struct cgroup *cgroup)
> > > +{
> > > +	struct cgroup *next;
> > > +
> > > +	WARN_ON_ONCE(!rcu_read_lock_held());
> > > +
> > > +	/* if first iteration, pretend we just visited @cgroup */
> > > +	if (!pos) {
> > > +		if (list_empty(&cgroup->children))
> > > +			return NULL;
> > > +		pos = cgroup;
> > > +	}
> > 
> > Is there any specific reason why the root of the tree is excluded?
> > This is bit impractical because you have to special case the root
> > in the code.
> 
> Yeah, thought about including it but decided against it for two
> reasons.
> 
> * To be consistent with cgroup_for_each_children() - it's a bit weird
>   for descendants to include self when children don't.
> 
> * Iteration root is likely to require different treatment anyway.
>   e.g. for cgroup_freezer, the root is updated to the specified config
>   while all the descendants inherit config from its immediate parent.
>   They are different.

OK if there is a code which handles root differently then let's not
overcomplicate this. The naming should be clear that root needs a
special treatment.

I will continue with the review tomorrow.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH 1/9 v2] cgroup: add cgroup_subsys->post_create()
From: Michal Hocko @ 2012-11-07 17:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, rjw-KKrjLPT3xs0,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	Glauber Costa
In-Reply-To: <20121107171508.GF2660-9pTldWuhBndy/B6EtB590w@public.gmane.org>

On Wed 07-11-12 09:15:08, Tejun Heo wrote:
[...]
> This patch adds ->post_create().  It's called after all ->create()
> succeeded and the cgroup is linked into the generic cgroup hierarchy.
> This plays the counterpart of ->pre_destroy().
> 
> When used in combination with the to-be-added generic descendant
> iterators, ->post_create() can be used to implement reliable state
> inheritance.  It will be explained with the descendant iterators.

Thanks
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox