Linux Power Management development

Linux Power Management development
 help / color / mirror / Atom feed

* Re: [PATCH v3 4/5] PM / Domains: Add support for multi PM domains per device to genpd
From: Lucas Stach @ 2018-05-31 11:40 UTC (permalink / raw)
  To: Ulf Hansson, Rafael J . Wysocki, linux-pm
  Cc: Greg Kroah-Hartman, Jon Hunter, Geert Uytterhoeven, Todor Tomov,
	Rajendra Nayak, Viresh Kumar, Vincent Guittot, Kevin Hilman,
	linux-kernel, linux-arm-kernel, linux-tegra
In-Reply-To: <20180531105959.14843-5-ulf.hansson@linaro.org>

Hi Ulf,

Am Donnerstag, den 31.05.2018, 12:59 +0200 schrieb Ulf Hansson:
> To support devices being partitioned across multiple PM domains, let's
> begin with extending genpd to cope with these kind of configurations.
> 
> Therefore, add a new exported function genpd_dev_pm_attach_by_id(), which
> is similar to the existing genpd_dev_pm_attach(), but with the difference
> that it allows its callers to provide an index to the PM domain that it
> wants to attach.
> 
> Note that, genpd_dev_pm_attach_by_id() shall only be called by the driver
> core / PM core, similar to how the existing dev_pm_domain_attach() makes
> use of genpd_dev_pm_attach(). However, this is implemented by following
> changes on top.

by_id() APIs are not really intuitive to use for driver writers. Other
subsystems have solved this by providing a "-names" property to give
the phandles a bit more meaning and then providing a by_name API. I
would really appreciate if PM domains could move in the same direction.

Regards,
Lucas

> Because, only one PM domain can be attached per device, genpd needs to
> create a virtual device that it can attach/detach instead. More precisely,
> let the new function genpd_dev_pm_attach_by_id() register a virtual struct
> device via calling device_register(). Then let it attach this device to the
> corresponding PM domain, rather than the one that is provided by the
> caller. The actual attaching is done via re-using the existing genpd OF
> functions.
> 
> At successful attachment, genpd_dev_pm_attach_by_id() returns the created
> virtual device, which allows the caller to operate on it to deal with power
> management. Following changes on top, provides more details in this
> regards.
> 
> To deal with detaching of a PM domain for the multiple PM domains case,
> let's also extend the existing genpd_dev_pm_detach() function, to cover the
> cleanup of the created virtual device, via make it call device_unregister()
> on it. In this way, there is no need to introduce a new function to deal
> with detach for the multiple PM domain case, but instead the existing one
> is re-used.
> 
> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> Acked-by: Jon Hunter <jonathanh@nvidia.com>
> Tested-by: Jon Hunter <jonathanh@nvidia.com>
> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
> ---
>  drivers/base/power/domain.c | 80 +++++++++++++++++++++++++++++++++++++
>  include/linux/pm_domain.h   |  8 ++++
>  2 files changed, 88 insertions(+)
> 
> diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
> index b1fcbf917974..4925af5c4cf0 100644
> --- a/drivers/base/power/domain.c
> +++ b/drivers/base/power/domain.c
> @@ -2171,6 +2171,15 @@ struct generic_pm_domain *of_genpd_remove_last(struct device_node *np)
>  }
>  EXPORT_SYMBOL_GPL(of_genpd_remove_last);
>  
> +static void genpd_release_dev(struct device *dev)
> +{
> +	kfree(dev);
> +}
> +
> +static struct bus_type genpd_bus_type = {
> +	.name		= "genpd",
> +};
> +
>  /**
>   * genpd_dev_pm_detach - Detach a device from its PM domain.
>   * @dev: Device to detach.
> @@ -2208,6 +2217,10 @@ static void genpd_dev_pm_detach(struct device *dev, bool power_off)
>  
>  	/* Check if PM domain can be powered off after removing this device. */
>  	genpd_queue_power_off_work(pd);
> +
> +	/* Unregister the device if it was created by genpd. */
> +	if (dev->bus == &genpd_bus_type)
> +		device_unregister(dev);
>  }
>  
>  static void genpd_dev_pm_sync(struct device *dev)
> @@ -2298,6 +2311,67 @@ int genpd_dev_pm_attach(struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(genpd_dev_pm_attach);
>  
> +/**
> + * genpd_dev_pm_attach_by_id - Associate a device with one of its PM domains.
> + * @dev: The device used to lookup the PM domain.
> + * @index: The index of the PM domain.
> + *
> + * Parse device's OF node to find a PM domain specifier at the provided @index.
> + * If such is found, creates a virtual device and attaches it to the retrieved
> + * pm_domain ops. To deal with detaching of the virtual device, the ->detach()
> + * callback in the struct dev_pm_domain are assigned to genpd_dev_pm_detach().
> + *
> + * Returns the created virtual device if successfully attached PM domain, NULL
> + * when the device don't need a PM domain, else an ERR_PTR() in case of
> + * failures. If a power-domain exists for the device, but cannot be found or
> + * turned on, then ERR_PTR(-EPROBE_DEFER) is returned to ensure that the device
> + * is not probed and to re-try again later.
> + */
> +struct device *genpd_dev_pm_attach_by_id(struct device *dev,
> +					 unsigned int index)
> +{
> +	struct device *genpd_dev;
> +	int num_domains;
> +	int ret;
> +
> +	if (!dev->of_node)
> +		return NULL;
> +
> +	/* Deal only with devices using multiple PM domains. */
> +	num_domains = of_count_phandle_with_args(dev->of_node, "power-domains",
> +						 "#power-domain-cells");
> +	if (num_domains < 2 || index >= num_domains)
> +		return NULL;
> +
> +	/* Allocate and register device on the genpd bus. */
> +	genpd_dev = kzalloc(sizeof(*genpd_dev), GFP_KERNEL);
> +	if (!genpd_dev)
> +		return ERR_PTR(-ENOMEM);
> +
> +	dev_set_name(genpd_dev, "genpd:%u:%s", index, dev_name(dev));
> +	genpd_dev->bus = &genpd_bus_type;
> +	genpd_dev->release = genpd_release_dev;
> +
> +	ret = device_register(genpd_dev);
> +	if (ret) {
> +		kfree(genpd_dev);
> +		return ERR_PTR(ret);
> +	}
> +
> +	/* Try to attach the device to the PM domain at the specified index. */
> +	ret = __genpd_dev_pm_attach(genpd_dev, dev->of_node, index);
> +	if (ret < 1) {
> +		device_unregister(genpd_dev);
> +		return ret ? ERR_PTR(ret) : NULL;
> +	}
> +
> +	pm_runtime_set_active(genpd_dev);
> +	pm_runtime_enable(genpd_dev);
> +
> +	return genpd_dev;
> +}
> +EXPORT_SYMBOL_GPL(genpd_dev_pm_attach_by_id);
> +
>  static const struct of_device_id idle_state_match[] = {
>  	{ .compatible = "domain-idle-state", },
>  	{ }
> @@ -2457,6 +2531,12 @@ unsigned int of_genpd_opp_to_performance_state(struct device *dev,
>  }
>  EXPORT_SYMBOL_GPL(of_genpd_opp_to_performance_state);
>  
> +static int __init genpd_bus_init(void)
> +{
> +	return bus_register(&genpd_bus_type);
> +}
> +core_initcall(genpd_bus_init);
> +
>  #endif /* CONFIG_PM_GENERIC_DOMAINS_OF */
>  
>  
> diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
> index 42e0d649e653..82458e8e2e01 100644
> --- a/include/linux/pm_domain.h
> +++ b/include/linux/pm_domain.h
> @@ -237,6 +237,8 @@ unsigned int of_genpd_opp_to_performance_state(struct device *dev,
>  				struct device_node *opp_node);
>  
>  int genpd_dev_pm_attach(struct device *dev);
> +struct device *genpd_dev_pm_attach_by_id(struct device *dev,
> +					 unsigned int index);
>  #else /* !CONFIG_PM_GENERIC_DOMAINS_OF */
>  static inline int of_genpd_add_provider_simple(struct device_node *np,
>  					struct generic_pm_domain *genpd)
> @@ -282,6 +284,12 @@ static inline int genpd_dev_pm_attach(struct device *dev)
>  	return 0;
>  }
>  
> +static inline struct device *genpd_dev_pm_attach_by_id(struct device *dev,
> +						       unsigned int index)
> +{
> +	return NULL;
> +}
> +
>  static inline
>  struct generic_pm_domain *of_genpd_remove_last(struct device_node *np)
>  {

^ permalink raw reply

* [PATCH] cpuidle:powernv: Make the snooze timeout dynamic.
From: Gautham R. Shenoy @ 2018-05-31 12:15 UTC (permalink / raw)
  To: Rafael J. Wysocki, Daniel Lezcano, Michael Ellerman,
	Stewart Smith, Michael Neuling, Vaidyanathan Srinivasan,
	Shilpasri G Bhat, Akshay Adiga, Nicholas Piggin
  Cc: linuxppc-dev, linux-kernel, linux-pm, Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

The commit 78eaa10f027c ("cpuidle: powernv/pseries: Auto-promotion of
snooze to deeper idle state") introduced a timeout for the snooze idle
state so that it could be eventually be promoted to a deeper idle
state. The snooze timeout value is static and set to the target
residency of the next idle state, which would train the cpuidle
governor to pick the next idle state eventually.

The unfortunate side-effect of this is that if the next idle state(s)
is disabled, the CPU will forever remain in snooze, despite the fact
that the system is completely idle, and other deeper idle states are
available.

This patch fixes the issue by dynamically setting the snooze timeout
to the target residency of the next enabled state on the device.

Before Patch
==================
POWER8 : Only nap disabled.
 $cpupower monitor sleep 30
sleep took 30.01297 seconds and exited with status 0
              |Idle_Stats
PKG |CORE|CPU | snoo | Nap  | Fast
   0|   8|   0| 96.41|  0.00|  0.00
   0|   8|   1| 96.43|  0.00|  0.00
   0|   8|   2| 96.47|  0.00|  0.00
   0|   8|   3| 96.35|  0.00|  0.00
   0|   8|   4| 96.37|  0.00|  0.00
   0|   8|   5| 96.37|  0.00|  0.00
   0|   8|   6| 96.47|  0.00|  0.00
   0|   8|   7| 96.47|  0.00|  0.00

POWER9: Shallow states (stop0lite, stop1lite, stop2lite, stop0, stop1,
stop2) disabled:
$cpupower monitor sleep 30
sleep took 30.05033 seconds and exited with status 0
              |Idle_Stats
PKG |CORE|CPU | snoo | stop | stop | stop | stop | stop | stop | stop | stop
   0|  16|   0| 89.79|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00
   0|  16|   1| 90.12|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00
   0|  16|   2| 90.21|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00
   0|  16|   3| 90.29|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00

After Patch
======================
POWER8 : Only nap disabled.
$ cpupower monitor sleep 30
sleep took 30.01200 seconds and exited with status 0
              |Idle_Stats
PKG |CORE|CPU | snoo | Nap  | Fast
   0|   8|   0| 16.58|  0.00| 77.21
   0|   8|   1| 18.42|  0.00| 75.38
   0|   8|   2|  4.70|  0.00| 94.09
   0|   8|   3| 17.06|  0.00| 81.73
   0|   8|   4|  3.06|  0.00| 95.73
   0|   8|   5|  7.00|  0.00| 96.80
   0|   8|   6|  1.00|  0.00| 98.79
   0|   8|   7|  5.62|  0.00| 94.17

POWER9: Shallow states (stop0lite, stop1lite, stop2lite, stop0, stop1,
stop2) disabled:

$cpupower monitor sleep 30
sleep took 30.02110 seconds and exited with status 0
              |Idle_Stats
PKG |CORE|CPU | snoo | stop | stop | stop | stop | stop | stop | stop | stop
   0|   0|   0|  0.69|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  9.39| 89.70
   0|   0|   1|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.05| 93.21
   0|   0|   2|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00| 89.93
   0|   0|   3|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00| 93.26

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 drivers/cpuidle/cpuidle-powernv.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index 1a8234e..d29e4f0 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -43,9 +43,31 @@ struct stop_psscr_table {
 
 static struct stop_psscr_table stop_psscr_table[CPUIDLE_STATE_MAX] __read_mostly;
 
-static u64 snooze_timeout __read_mostly;
+static u64 default_snooze_timeout __read_mostly;
 static bool snooze_timeout_en __read_mostly;
 
+static u64 get_snooze_timeout(struct cpuidle_device *dev,
+			      struct cpuidle_driver *drv,
+			      int index)
+{
+	int i;
+
+	if (unlikely(!snooze_timeout_en))
+		return default_snooze_timeout;
+
+	for (i = index + 1; i < drv->state_count; i++) {
+		struct cpuidle_state *s = &drv->states[i];
+		struct cpuidle_state_usage *su = &dev->states_usage[i];
+
+		if (s->disabled || su->disable)
+			continue;
+
+		return s->target_residency * tb_ticks_per_usec;
+	}
+
+	return default_snooze_timeout;
+}
+
 static int snooze_loop(struct cpuidle_device *dev,
 			struct cpuidle_driver *drv,
 			int index)
@@ -56,7 +78,7 @@ static int snooze_loop(struct cpuidle_device *dev,
 
 	local_irq_enable();
 
-	snooze_exit_time = get_tb() + snooze_timeout;
+	snooze_exit_time = get_tb() + get_snooze_timeout(dev, drv, index);
 	ppc64_runlatch_off();
 	HMT_very_low();
 	while (!need_resched()) {
@@ -465,11 +487,9 @@ static int powernv_idle_probe(void)
 		cpuidle_state_table = powernv_states;
 		/* Device tree can indicate more idle states */
 		max_idle_state = powernv_add_idle_states();
-		if (max_idle_state > 1) {
+		default_snooze_timeout = TICK_USEC * tb_ticks_per_usec;
+		if (max_idle_state > 1)
 			snooze_timeout_en = true;
-			snooze_timeout = powernv_states[1].target_residency *
-					 tb_ticks_per_usec;
-		}
  	} else
  		return -ENODEV;
 
-- 
1.9.4

^ permalink raw reply related

* Re: [PATCH v3 4/5] PM / Domains: Add support for multi PM domains per device to genpd
From: Jon Hunter @ 2018-05-31 12:47 UTC (permalink / raw)
  To: Lucas Stach, Ulf Hansson, Rafael J . Wysocki, linux-pm
  Cc: Greg Kroah-Hartman, Geert Uytterhoeven, Todor Tomov,
	Rajendra Nayak, Viresh Kumar, Vincent Guittot, Kevin Hilman,
	linux-kernel, linux-arm-kernel, linux-tegra
In-Reply-To: <5fc1d3ee51c8dbc264fb21edf33a879c7db4056b.camel@lynxeye.de>


On 31/05/18 12:40, Lucas Stach wrote:
> Hi Ulf,
> 
> Am Donnerstag, den 31.05.2018, 12:59 +0200 schrieb Ulf Hansson:
>> To support devices being partitioned across multiple PM domains, let's
>> begin with extending genpd to cope with these kind of configurations.
>>
>> Therefore, add a new exported function genpd_dev_pm_attach_by_id(), which
>> is similar to the existing genpd_dev_pm_attach(), but with the difference
>> that it allows its callers to provide an index to the PM domain that it
>> wants to attach.
>>
>> Note that, genpd_dev_pm_attach_by_id() shall only be called by the driver
>> core / PM core, similar to how the existing dev_pm_domain_attach() makes
>> use of genpd_dev_pm_attach(). However, this is implemented by following
>> changes on top.
> 
> by_id() APIs are not really intuitive to use for driver writers. Other
> subsystems have solved this by providing a "-names" property to give
> the phandles a bit more meaning and then providing a by_name API. I
> would really appreciate if PM domains could move in the same direction.

As discussed here [0], there are plans to add that.

Cheers
Jon

[0] https://patchwork.ozlabs.org/patch/921938/

-- 
nvpublic

^ permalink raw reply

* Re: [PATCH] cpuidle:powernv: Make the snooze timeout dynamic.
From: Balbir Singh @ 2018-05-31 14:51 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Rafael J. Wysocki, Daniel Lezcano, Michael Ellerman,
	Stewart Smith, Michael Neuling, Vaidyanathan Srinivasan,
	Shilpasri G Bhat, Akshay Adiga, Nicholas Piggin,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-kernel@vger.kernel.org, linux-pm
In-Reply-To: <1527768909-32637-1-git-send-email-ego@linux.vnet.ibm.com>

On Thu, May 31, 2018 at 10:15 PM, Gautham R. Shenoy
<ego@linux.vnet.ibm.com> wrote:
> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
>
> The commit 78eaa10f027c ("cpuidle: powernv/pseries: Auto-promotion of
> snooze to deeper idle state") introduced a timeout for the snooze idle
> state so that it could be eventually be promoted to a deeper idle
> state. The snooze timeout value is static and set to the target
> residency of the next idle state, which would train the cpuidle
> governor to pick the next idle state eventually.
>
> The unfortunate side-effect of this is that if the next idle state(s)
> is disabled, the CPU will forever remain in snooze, despite the fact
> that the system is completely idle, and other deeper idle states are
> available.
>
> This patch fixes the issue by dynamically setting the snooze timeout
> to the target residency of the next enabled state on the device.
>
> Before Patch
> ==================
> POWER8 : Only nap disabled.
>  $cpupower monitor sleep 30
> sleep took 30.01297 seconds and exited with status 0
>               |Idle_Stats
> PKG |CORE|CPU | snoo | Nap  | Fast
>    0|   8|   0| 96.41|  0.00|  0.00
>    0|   8|   1| 96.43|  0.00|  0.00
>    0|   8|   2| 96.47|  0.00|  0.00
>    0|   8|   3| 96.35|  0.00|  0.00
>    0|   8|   4| 96.37|  0.00|  0.00
>    0|   8|   5| 96.37|  0.00|  0.00
>    0|   8|   6| 96.47|  0.00|  0.00
>    0|   8|   7| 96.47|  0.00|  0.00
>
> POWER9: Shallow states (stop0lite, stop1lite, stop2lite, stop0, stop1,
> stop2) disabled:
> $cpupower monitor sleep 30
> sleep took 30.05033 seconds and exited with status 0
>               |Idle_Stats
> PKG |CORE|CPU | snoo | stop | stop | stop | stop | stop | stop | stop | stop
>    0|  16|   0| 89.79|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00
>    0|  16|   1| 90.12|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00
>    0|  16|   2| 90.21|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00
>    0|  16|   3| 90.29|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00
>
> After Patch
> ======================
> POWER8 : Only nap disabled.
> $ cpupower monitor sleep 30
> sleep took 30.01200 seconds and exited with status 0
>               |Idle_Stats
> PKG |CORE|CPU | snoo | Nap  | Fast
>    0|   8|   0| 16.58|  0.00| 77.21
>    0|   8|   1| 18.42|  0.00| 75.38
>    0|   8|   2|  4.70|  0.00| 94.09
>    0|   8|   3| 17.06|  0.00| 81.73
>    0|   8|   4|  3.06|  0.00| 95.73
>    0|   8|   5|  7.00|  0.00| 96.80
>    0|   8|   6|  1.00|  0.00| 98.79
>    0|   8|   7|  5.62|  0.00| 94.17
>
> POWER9: Shallow states (stop0lite, stop1lite, stop2lite, stop0, stop1,
> stop2) disabled:
>
> $cpupower monitor sleep 30
> sleep took 30.02110 seconds and exited with status 0
>               |Idle_Stats
> PKG |CORE|CPU | snoo | stop | stop | stop | stop | stop | stop | stop | stop
>    0|   0|   0|  0.69|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  9.39| 89.70
>    0|   0|   1|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.05| 93.21
>    0|   0|   2|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00| 89.93
>    0|   0|   3|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00|  0.00| 93.26
>
> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> ---
>  drivers/cpuidle/cpuidle-powernv.c | 32 ++++++++++++++++++++++++++------
>  1 file changed, 26 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
> index 1a8234e..d29e4f0 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -43,9 +43,31 @@ struct stop_psscr_table {
>
>  static struct stop_psscr_table stop_psscr_table[CPUIDLE_STATE_MAX] __read_mostly;
>
> -static u64 snooze_timeout __read_mostly;
> +static u64 default_snooze_timeout __read_mostly;
>  static bool snooze_timeout_en __read_mostly;
>
> +static u64 get_snooze_timeout(struct cpuidle_device *dev,
> +                             struct cpuidle_driver *drv,
> +                             int index)
> +{
> +       int i;
> +
> +       if (unlikely(!snooze_timeout_en))
> +               return default_snooze_timeout;
> +
> +       for (i = index + 1; i < drv->state_count; i++) {
> +               struct cpuidle_state *s = &drv->states[i];
> +               struct cpuidle_state_usage *su = &dev->states_usage[i];
> +
> +               if (s->disabled || su->disable)
> +                       continue;
> +
> +               return s->target_residency * tb_ticks_per_usec;

Can we ensure this is not prone to overflow?

Otherwise looks good

Reviewed-by: Balbir Singh <bsingharora@gmail.com>

^ permalink raw reply

* Re: [PATCH 10/11] dt-bindings: misc: add bindings for throttler
From: Rob Herring @ 2018-05-31 16:31 UTC (permalink / raw)
  To: Matthias Kaehlcke
  Cc: MyungJoo Ham, Kyungmin Park, Chanwoo Choi, Arnd Bergmann,
	Greg Kroah-Hartman, Mark Rutland, linux-pm, devicetree,
	linux-kernel, Brian Norris, Douglas Anderson
In-Reply-To: <20180525203043.249193-11-mka@chromium.org>

On Fri, May 25, 2018 at 01:30:42PM -0700, Matthias Kaehlcke wrote:

Commit msg?

> Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
> ---
>  .../devicetree/bindings/misc/throttler.txt    | 41 +++++++++++++++++++
>  1 file changed, 41 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/misc/throttler.txt
> 
> diff --git a/Documentation/devicetree/bindings/misc/throttler.txt b/Documentation/devicetree/bindings/misc/throttler.txt
> new file mode 100644
> index 000000000000..92f13e94451a
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/misc/throttler.txt
> @@ -0,0 +1,41 @@
> +Throttler driver
> +
> +The Throttler is used for non-thermal throttling of system components like
> +CPUs or devfreq devices.

This all looks very Linux specific and not a h/w device. Perhaps you can 
add hint properties to OPP tables as to what entries can be used for 
throttling, but otherwise this doesn't belong in DT.

^ permalink raw reply

* Re: [PATCH 11/11] misc/throttler: Add Chrome OS EC throttler
From: Matthias Kaehlcke @ 2018-05-31 17:33 UTC (permalink / raw)
  To: Enric Balletbo Serra
  Cc: MyungJoo Ham, Kyungmin Park, Chanwoo Choi, Arnd Bergmann,
	Greg Kroah-Hartman, Rob Herring, Mark Rutland, Linux PM list,
	devicetree@vger.kernel.org, linux-kernel, Brian Norris,
	Douglas Anderson
In-Reply-To: <CAFqH_53iMhvq+6NXoXu50sD=PuXBf-2_D3K1Mf+iggLeGSu_mg@mail.gmail.com>

Hi Enric,

thanks for the review!

I'll address your comments in the next revision.

On Thu, May 31, 2018 at 11:05:09AM +0200, Enric Balletbo Serra wrote:
> Hi Matthias,
> 
> The patch looks good to me, just three minor comments to be more
> coherent with other cros-ec drivers.
> 
> 2018-05-25 22:30 GMT+02:00 Matthias Kaehlcke <mka@chromium.org>:
> > The driver subscribes to throttling events from the Chrome OS
> > embedded controller and enables/disables system throttling based
> > on these events.
> >
> > Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
> > ---
> >  drivers/misc/throttler/Kconfig             |  15 +++
> >  drivers/misc/throttler/Makefile            |   1 +
> >  drivers/misc/throttler/cros_ec_throttler.c | 122 +++++++++++++++++++++
> >  3 files changed, 138 insertions(+)
> >  create mode 100644 drivers/misc/throttler/cros_ec_throttler.c
> >
> > diff --git a/drivers/misc/throttler/Kconfig b/drivers/misc/throttler/Kconfig
> > index ef8388f6bc0a..652b6817b75c 100644
> > --- a/drivers/misc/throttler/Kconfig
> > +++ b/drivers/misc/throttler/Kconfig
> > @@ -11,3 +11,18 @@ menuconfig THROTTLER
> >           Note that you also need a event monitor module usually called
> >           *_throttler.
> >
> > +if THROTTLER
> > +
> > +config CROS_EC_THROTTLER
> > +       tristate "Throttler event monitor for the Chrome OS Embedded Controller"
> 
> s/Chrome OS/ChromeOS/ This is how other cros-ec drivers refer to the
> embedded controller, so will be good if you could maintain this
> denomination.
> 
> > +       default n
> > +       depends on MFD_CROS_EC
> > +       ---help---
> > +         This driver adds support to throttle the system in reaction to
> > +         Chrome OS EC events.
> > +
> 
> ditto
> 
> > +         To compile this driver as a module, choose M here: the
> > +         module will be called cros_ec_throttler.
> > +
> > +endif # THROTTLER
> > +
> > diff --git a/drivers/misc/throttler/Makefile b/drivers/misc/throttler/Makefile
> > index c8d920cee315..d9b2a77dabc9 100644
> > --- a/drivers/misc/throttler/Makefile
> > +++ b/drivers/misc/throttler/Makefile
> > @@ -1 +1,2 @@
> >  obj-$(CONFIG_THROTTLER)                += core.o
> > +obj-$(CONFIG_CROS_EC_THROTTLER)        += cros_ec_throttler.o
> > diff --git a/drivers/misc/throttler/cros_ec_throttler.c b/drivers/misc/throttler/cros_ec_throttler.c
> > new file mode 100644
> > index 000000000000..ea6bc002d49c
> > --- /dev/null
> > +++ b/drivers/misc/throttler/cros_ec_throttler.c
> > @@ -0,0 +1,122 @@
> > +/*
> > + * Driver for throttling triggered by EC events.
> > + *
> > + * Copyright (C) 2018 Google, Inc.
> > + *
> > + * This software is licensed under the terms of the GNU General Public
> > + * License version 2, as published by the Free Software Foundation, and
> > + * may be copied, distributed, and modified under those terms.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + */
> > +
> 
> Replace this for the SPDX header format
> 
> // SPDX-License-Identifier: GPL-2.0+
> // Driver for throttling triggered by EC events.
> //
> // Copyright (C) 2018 Google, Inc.
> // Author: Matthias Kaehlcke <mka@chromium.org>
> 
> > +#include <linux/kernel.h>
> > +#include <linux/mfd/cros_ec.h>
> > +#include <linux/module.h>
> > +#include <linux/notifier.h>
> > +#include <linux/of.h>
> > +#include <linux/of_platform.h>
> > +#include <linux/platform_device.h>
> > +#include <linux/throttler.h>
> > +
> > +struct cros_ec_throttler {
> > +       struct cros_ec_device *ec;
> > +       struct throttler *throttler;
> > +       struct notifier_block nb;
> > +};
> > +
> > +static int cros_ec_throttler_event(struct notifier_block *nb,
> > +       unsigned long queued_during_suspend, void *_notify)
> > +{
> > +       struct cros_ec_throttler *cte =
> > +               container_of(nb, struct cros_ec_throttler, nb);
> 
> nit: Better add a define here instead of split the line like this ?
> 
> > +       u32 host_event;
> > +
> > +       host_event = cros_ec_get_host_event(cte->ec);
> > +       if (host_event & EC_HOST_EVENT_MASK(EC_HOST_EVENT_THROTTLE_START)) {
> > +               throttler_set_level(cte->throttler, 1);
> > +
> > +               return NOTIFY_OK;
> > +       } else if (host_event & EC_HOST_EVENT_MASK(EC_HOST_EVENT_THROTTLE_STOP)) {
> > +               throttler_set_level(cte->throttler, 0);
> > +
> > +               return NOTIFY_OK;
> > +       }
> > +
> > +       return NOTIFY_DONE;
> > +}
> > +
> > +static int cros_ec_throttler_probe(struct platform_device *pdev)
> > +{
> > +       struct cros_ec_throttler *cte;
> > +       struct device *dev = &pdev->dev;
> > +       struct device_node *np = pdev->dev.of_node;
> > +       int ret;
> > +
> > +       if (!np) {
> > +               /* should never happen */
> > +               return -EINVAL;
> > +       }
> > +
> > +       cte = devm_kzalloc(dev, sizeof(*cte), GFP_KERNEL);
> > +       if (!cte)
> > +               return -ENOMEM;
> > +
> > +       cte->ec = dev_get_drvdata(pdev->dev.parent);
> > +
> > +       cte->throttler = throttler_setup(dev);
> > +       if (IS_ERR(cte->throttler))
> > +               return PTR_ERR(cte->throttler);
> > +
> > +       dev_set_drvdata(dev, cte);
> > +
> > +       cte->nb.notifier_call = cros_ec_throttler_event;
> > +       ret = blocking_notifier_chain_register(&cte->ec->event_notifier,
> > +                                              &cte->nb);
> > +       if (ret < 0) {
> > +               dev_err(dev, "failed to register notifier\n");
> > +               throttler_teardown(cte->throttler);
> > +               return ret;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static int cros_ec_throttler_remove(struct platform_device *pdev)
> > +{
> > +       struct cros_ec_throttler *cte = platform_get_drvdata(pdev);
> > +
> > +       blocking_notifier_chain_unregister(&cte->ec->event_notifier,
> > +                                          &cte->nb);
> > +
> > +       throttler_teardown(cte->throttler);
> > +
> > +       return 0;
> > +}
> > +
> > +#ifdef CONFIG_OF
> > +static const struct of_device_id cros_ec_throttler_of_match[] = {
> > +        { .compatible = "google,cros-ec-throttler" },
> > +        { /* sentinel */ }
> > +};
> > +MODULE_DEVICE_TABLE(of, cros_ec_throttler_of_match);
> > +#endif /* CONFIG_OF */
> > +
> > +static struct platform_driver cros_ec_throttler_driver = {
> > +       .driver = {
> > +               .name = "cros-ec-throttler",
> > +               .of_match_table = of_match_ptr(cros_ec_throttler_of_match),
> > +       },
> > +       .probe          = cros_ec_throttler_probe,
> > +       .remove         = cros_ec_throttler_remove,
> > +};
> > +
> > +module_platform_driver(cros_ec_throttler_driver);
> > +
> > +MODULE_LICENSE("GPL");
> > +MODULE_AUTHOR("Matthias Kaehlcke <mka@chromium.org>");
> > +MODULE_DESCRIPTION("Chrome OS EC Throttler");
> 
> s/Chrome OS/ChromeOS/
> 
> >
> 
> Reviewed-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
> 
> Thanks,
>   Enric

^ permalink raw reply

* Re: [PATCH V4] powercap/drivers/idle_injection: Add an idle injection framework
From: Daniel Lezcano @ 2018-05-31 18:25 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: rjw, edubezval, kevin.wangtao, leo.yan, vincent.guittot,
	linux-kernel, javi.merino, rui.zhang, linux-pm, daniel.thompson
In-Reply-To: <20180529093148.b2cxb5lwks7ka2tn@vireshk-i7>

On 29/05/2018 11:31, Viresh Kumar wrote:
> Hi Daniel,
> 
> Thanks for yet another version :)
> 
> On 25-05-18, 11:49, Daniel Lezcano wrote:
>> +++ b/drivers/powercap/idle_injection.c
>> +static void idle_injection_wakeup(struct idle_injection_device *ii_dev)
>> +{
>> +	struct idle_injection_thread *iit;
>> +	int cpu;
>> +
>> +	for_each_cpu_and(cpu, ii_dev->cpumask, cpu_online_mask) {
>> +		iit = per_cpu_ptr(&idle_injection_thread, cpu);
>> +		iit->should_run = 1;
>> +		wake_up_process(iit->tsk);
>> +	}
>> +}
>> +
>> +/**
>> + * idle_injection_wakeup_fn - idle injection timer callback
>> + * @timer: a hrtimer structure
>> + *
>> + * This function is called when the idle injection timer expires which
>> + * will wake up the idle injection tasks and these ones, in turn, play
>> + * idle a specified amount of time.
>> + *
>> + * Return: HRTIMER_NORESTART.
>> + */
>> +static enum hrtimer_restart idle_injection_wakeup_fn(struct hrtimer *timer)
>> +{
>> +	struct idle_injection_device *ii_dev =
>> +		container_of(timer, struct idle_injection_device, timer);
>> +
>> +	idle_injection_wakeup(ii_dev);
>> +
>> +	return HRTIMER_NORESTART;
>> +}
>> +
>> +/**
>> + * idle_injection_fn - idle injection routine
>> + * @cpu: the CPU number the tasks belongs to
>> + *
>> + * The idle injection routine will stay idle the specified amount of
>> + * time
>> + */
>> +static void idle_injection_fn(unsigned int cpu)
>> +{
>> +	struct idle_injection_device *ii_dev;
>> +	struct idle_injection_thread *iit;
>> +	int run_duration_ms, idle_duration_ms;
>> +
>> +	ii_dev = per_cpu(idle_injection_device, cpu);
>> +
>> +	if (WARN_ON_ONCE(!ii_dev))
> 
> Yes, this is marked as "unlikely" and is kind of not that harmful, but
> I would suggest to just drop it and let the kernel crash if that
> serious of a bug is present in this code where ii_dev can be NULL
> here.

Ok.

>> +		return;
>> +
>> +	iit = per_cpu_ptr(&idle_injection_thread, cpu);
> 
> See, we don't check this one, why check only ii_dev ? :)
> 
>> +
>> +	/*
>> +	 * Boolean used by the smpboot main loop and used as a
>> +	 * flip-flop in this function
>> +	 */
>> +	iit->should_run = 0;
>> +
>> +	atomic_inc(&ii_dev->count);
>> +
>> +	idle_duration_ms = atomic_read(&ii_dev->idle_duration_ms);
>> +	if (idle_duration_ms)
>> +		play_idle(idle_duration_ms);
>> +
>> +	/*
>> +	 * The last CPU waking up is in charge of setting the timer. If
>> +	 * the CPU is hotplugged, the timer will move to another CPU
>> +	 * (which may not belong to the same cluster) but that is not a
>> +	 * problem as the timer will be set again by another CPU
>> +	 * belonging to the cluster. This mechanism is self adaptive.
>> +	 */
> 
> I am afraid that the above comment may not be completely true all the
> time. For a quad-core platform, it is possible for 3 CPUs (0,1,2) to
> run this function as soon as the kthread is woken up, but one of the
> CPUs (3) may be stuck servicing an IRQ, Deadline or RT activity.
> Because you do atomic_inc() also in this function (above) itself,
> below decrement may return a true value for the CPU2 and that will
> restart the hrtimer, while one of the CPUs never got a chance to
> increment count in the first place.
>
> The fix is simple though, do the increment in idle_injection_wakeup()
> and things should be fine then.

Ok.

>> +	if (!atomic_dec_and_test(&ii_dev->count))
>> +		return;
>> +
>> +	run_duration_ms = atomic_read(&ii_dev->run_duration_ms);
>> +	if (run_duration_ms) {
>> +		hrtimer_start(&ii_dev->timer, ms_to_ktime(run_duration_ms),
>> +			      HRTIMER_MODE_REL_PINNED);
>> +		return;
>> +	}
>> +
>> +	complete(&ii_dev->stop_complete);
> 
> So you call complete() after hrtimer is potentially restarted. This
> can happen if idle_injection_stop() is called right after the above
> atomic_read() has finished :)
> 
> IOW, this doesn't look safe now as well.

It is safe, we just missed a cycle and the stop will block until the
next cycle. I did it on purpose and for me it is correct.

>> +}
> 
>> +/**
>> + * idle_injection_stop - stops the idle injections
>> + * @ii_dev: a pointer to an idle injection_device structure
>> + *
>> + * The function stops the idle injection by resetting the idle and
>> + * running durations and wait for the threads to complete. If we are
>> + * in the process of injecting an idle cycle, then this will wait the
>> + * end of the cycle.
>> + */
>> +void idle_injection_stop(struct idle_injection_device *ii_dev)
>> +{
>> +	pr_debug("Stopping injecting idle cycles on CPUs '%*pbl'\n",
>> +		 cpumask_pr_args(ii_dev->cpumask));
>> +
>> +	init_completion(&ii_dev->stop_complete);
> 
> This looks completely Borken (yeah, broken :)). complete() may be
> running in parallel under spinlock and updating x->done while you just
> set it to 0 here without any locking in place. init_completion()
> should be used only once after ii_dev is allocated and I don't see
> that being done either, so that looks incorrect as well.

Ok.

>> +
>> +	idle_injection_set_duration(ii_dev, 0, 0);
>> +
>> +	wait_for_completion_interruptible(&ii_dev->stop_complete);
>> +}
>> +
>> +/**
>> + * idle_injection_setup - initialize the current task as a RT task
>> + * @cpu: the CPU number where the kthread is running on (not used)
>> + *
>> + * Called one time, this function is in charge of setting the task
>> + * scheduler parameters.
>> + */
>> +static void idle_injection_setup(unsigned int cpu)
>> +{
>> +	struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO / 2 };
>> +
>> +	set_freezable();
>> +
>> +	sched_setscheduler(current, SCHED_FIFO, &param);
>> +}
>> +
>> +/**
>> + * idle_injection_should_run - function helper for the smpboot API
>> + * @cpu: the CPU number where the kthread is running on
>> + *
>> + * Return: a boolean telling if the thread can run.
>> + */
>> +static int idle_injection_should_run(unsigned int cpu)
>> +{
>> +	struct idle_injection_thread *iit =
>> +		per_cpu_ptr(&idle_injection_thread, cpu);
>> +
>> +	return iit->should_run;
>> +}
>> +
>> +static struct idle_injection_device *ii_dev_alloc(void)
>> +{
>> +	struct idle_injection_device *ii_dev;
>> +
>> +	ii_dev = kzalloc(sizeof(*ii_dev), GFP_KERNEL);
>> +	if (!ii_dev)
>> +		return NULL;
>> +
>> +	if (!alloc_cpumask_var(&ii_dev->cpumask, GFP_KERNEL)) {
>> +		kfree(ii_dev);
>> +		return NULL;
>> +	}
>> +
>> +	return ii_dev;
>> +}
>> +
>> +static void ii_dev_free(struct idle_injection_device *ii_dev)
>> +{
>> +	free_cpumask_var(ii_dev->cpumask);
>> +	kfree(ii_dev);
>> +}
>> +
>> +/**
>> + * idle_injection_register - idle injection init routine
>> + * @cpumask: the list of CPUs managed by the idle injection device
>> + *
>> + * This is the initialization function in charge of creating the
>> + * initializing of the timer and allocate the structures. It does not
>> + * starts the idle injection cycles.
>> + *
>> + * Return: NULL if an allocation fails.
>> + */
>> +struct idle_injection_device *idle_injection_register(struct cpumask *cpumask)
>> +{
>> +	struct idle_injection_device *ii_dev;
>> +	int cpu, cpu2;
>> +
>> +	ii_dev = ii_dev_alloc();
>> +	if (!ii_dev)
>> +		return NULL;
>> +
>> +	cpumask_copy(ii_dev->cpumask, cpumask);
>> +	hrtimer_init(&ii_dev->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> +	ii_dev->timer.function = idle_injection_wakeup_fn;
>> +
>> +	for_each_cpu(cpu, ii_dev->cpumask) {
>> +
>> +		if (per_cpu(idle_injection_device, cpu)) {
> 
> Maybe add unlikely here ?

For this kind of init function and tight loop, there is no benefit of
adding the unlikely / likely. I can add it if you want, but it is useless.

>> +			pr_err("cpu%d is already registered\n", cpu);
>> +			goto out_rollback_per_cpu;
>> +		}
>> +
>> +		per_cpu(idle_injection_device, cpu) = ii_dev;
>> +	}
>> +
>> +	return ii_dev;
>> +
>> +out_rollback_per_cpu:
>> +	for_each_cpu(cpu2, ii_dev->cpumask) {
>> +		if (cpu == cpu2)
> 
> And is this really required? Perhaps this conditional is making it
> worse and just setting the per-cpu variable forcefully to NULL would
> be much faster :)

We want undo what was done, setting all variables to NULL resets the
values from a previous register and we don't want that.


> Maybe move this for-loop into iid_dev_free() and you can make this
> look simple then.
> 
>> +			break;
>> +
>> +		per_cpu(idle_injection_device, cpu2) = NULL;
>> +	}
>> +
>> +	ii_dev_free(ii_dev);
>> +
>> +	return NULL;
>> +}
> 


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply

* Re: [PATCH 10/11] dt-bindings: misc: add bindings for throttler
From: Matthias Kaehlcke @ 2018-05-31 18:34 UTC (permalink / raw)
  To: Rob Herring
  Cc: MyungJoo Ham, Kyungmin Park, Chanwoo Choi, Arnd Bergmann,
	Greg Kroah-Hartman, Mark Rutland, linux-pm, devicetree,
	linux-kernel, Brian Norris, Douglas Anderson
In-Reply-To: <20180531163159.GA4369@rob-hp-laptop>

Hi Rob,

On Thu, May 31, 2018 at 11:31:59AM -0500, Rob Herring wrote:
> On Fri, May 25, 2018 at 01:30:42PM -0700, Matthias Kaehlcke wrote:
> 
> Commit msg?

Will add some more info in the next revision.

> > Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
> > ---
> >  .../devicetree/bindings/misc/throttler.txt    | 41 +++++++++++++++++++
> >  1 file changed, 41 insertions(+)
> >  create mode 100644 Documentation/devicetree/bindings/misc/throttler.txt
> > 
> > diff --git a/Documentation/devicetree/bindings/misc/throttler.txt b/Documentation/devicetree/bindings/misc/throttler.txt
> > new file mode 100644
> > index 000000000000..92f13e94451a
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/misc/throttler.txt
> > @@ -0,0 +1,41 @@
> > +Throttler driver
> > +
> > +The Throttler is used for non-thermal throttling of system components like
> > +CPUs or devfreq devices.
> 
> This all looks very Linux specific and not a h/w device. Perhaps you can 
> add hint properties to OPP tables as to what entries can be used for 
> throttling, but otherwise this doesn't belong in DT.

My idea is to allow multiple throttlers, which might operate on
different throttling devices or use different OPPs for the same
device. To support this a simple boolean hint that the OPP can be used
for throttling would not be sufficient.

What should work is a property with an array of phandles of the
throttlers that use a given OPP.

AFAIK it is currently not possible to enumerate the devfreq devices in
the system, so besides the info in the OPPs the throttler itself would
still need a phandle to the devfreq device(s) it uses.

I envision something like this:

gpu_opp_table: opp-table2 {
  compatible = "operating-points-v2";

  opp00 {
    opp-hz = /bits/ 64 <200000000>;
    opp-microvolt = <800000>;
  };
  opp01 {
    opp-hz = /bits/ 64 <297000000>;
    opp-microvolt = <800000>;
    opp-throttlers = <&cros_ec_throttler>;
  };
  ...
};

cros_ec_throttler: cros-ec-throttler {
  compatible = "google,cros-ec-throttler";

  devfreq-devices = <&gpu>;
};

Would this be acceptable?

Thanks

Matthias

^ permalink raw reply

* Re: [PATCH 10/11] dt-bindings: misc: add bindings for throttler
From: Rob Herring @ 2018-05-31 20:04 UTC (permalink / raw)
  To: Matthias Kaehlcke
  Cc: MyungJoo Ham, Kyungmin Park, Chanwoo Choi, Arnd Bergmann,
	Greg Kroah-Hartman, Mark Rutland, linux-pm, devicetree,
	linux-kernel@vger.kernel.org, Brian Norris, Douglas Anderson
In-Reply-To: <20180531183404.GB88063@google.com>

On Thu, May 31, 2018 at 1:34 PM, Matthias Kaehlcke <mka@chromium.org> wrote:
> Hi Rob,
>
> On Thu, May 31, 2018 at 11:31:59AM -0500, Rob Herring wrote:
>> On Fri, May 25, 2018 at 01:30:42PM -0700, Matthias Kaehlcke wrote:
>>
>> Commit msg?
>
> Will add some more info in the next revision.
>
>> > Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
>> > ---
>> >  .../devicetree/bindings/misc/throttler.txt    | 41 +++++++++++++++++++
>> >  1 file changed, 41 insertions(+)
>> >  create mode 100644 Documentation/devicetree/bindings/misc/throttler.txt
>> >
>> > diff --git a/Documentation/devicetree/bindings/misc/throttler.txt b/Documentation/devicetree/bindings/misc/throttler.txt
>> > new file mode 100644
>> > index 000000000000..92f13e94451a
>> > --- /dev/null
>> > +++ b/Documentation/devicetree/bindings/misc/throttler.txt
>> > @@ -0,0 +1,41 @@
>> > +Throttler driver
>> > +
>> > +The Throttler is used for non-thermal throttling of system components like
>> > +CPUs or devfreq devices.
>>
>> This all looks very Linux specific and not a h/w device. Perhaps you can
>> add hint properties to OPP tables as to what entries can be used for
>> throttling, but otherwise this doesn't belong in DT.
>
> My idea is to allow multiple throttlers, which might operate on
> different throttling devices or use different OPPs for the same
> device. To support this a simple boolean hint that the OPP can be used
> for throttling would not be sufficient.
>
> What should work is a property with an array of phandles of the
> throttlers that use a given OPP.
>
> AFAIK it is currently not possible to enumerate the devfreq devices in
> the system, so besides the info in the OPPs the throttler itself would
> still need a phandle to the devfreq device(s) it uses.

Why don't you fix that OS problem instead of working around it in DT?

>
> I envision something like this:
>
> gpu_opp_table: opp-table2 {
>   compatible = "operating-points-v2";
>
>   opp00 {
>     opp-hz = /bits/ 64 <200000000>;
>     opp-microvolt = <800000>;
>   };
>   opp01 {
>     opp-hz = /bits/ 64 <297000000>;
>     opp-microvolt = <800000>;
>     opp-throttlers = <&cros_ec_throttler>;
>   };
>   ...
> };
>
> cros_ec_throttler: cros-ec-throttler {
>   compatible = "google,cros-ec-throttler";

Is this an actual h/w device?

>
>   devfreq-devices = <&gpu>;
> };
>
> Would this be acceptable?
>
> Thanks
>
> Matthias

^ permalink raw reply

* Re: [PATCH 10/11] dt-bindings: misc: add bindings for throttler
From: Matthias Kaehlcke @ 2018-05-31 21:10 UTC (permalink / raw)
  To: Rob Herring
  Cc: MyungJoo Ham, Kyungmin Park, Chanwoo Choi, Arnd Bergmann,
	Greg Kroah-Hartman, Mark Rutland, linux-pm, devicetree,
	linux-kernel@vger.kernel.org, Brian Norris, Douglas Anderson
In-Reply-To: <CAL_JsqK_nFtuxhNbAn3yYueLKAxvwfnY=5M6zFdGrwwwPVrP5g@mail.gmail.com>

On Thu, May 31, 2018 at 03:04:18PM -0500, Rob Herring wrote:
> On Thu, May 31, 2018 at 1:34 PM, Matthias Kaehlcke <mka@chromium.org> wrote:
> > Hi Rob,
> >
> > On Thu, May 31, 2018 at 11:31:59AM -0500, Rob Herring wrote:
> >> On Fri, May 25, 2018 at 01:30:42PM -0700, Matthias Kaehlcke wrote:
> >>
> >> Commit msg?
> >
> > Will add some more info in the next revision.
> >
> >> > Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
> >> > ---
> >> >  .../devicetree/bindings/misc/throttler.txt    | 41 +++++++++++++++++++
> >> >  1 file changed, 41 insertions(+)
> >> >  create mode 100644 Documentation/devicetree/bindings/misc/throttler.txt
> >> >
> >> > diff --git a/Documentation/devicetree/bindings/misc/throttler.txt b/Documentation/devicetree/bindings/misc/throttler.txt
> >> > new file mode 100644
> >> > index 000000000000..92f13e94451a
> >> > --- /dev/null
> >> > +++ b/Documentation/devicetree/bindings/misc/throttler.txt
> >> > @@ -0,0 +1,41 @@
> >> > +Throttler driver
> >> > +
> >> > +The Throttler is used for non-thermal throttling of system components like
> >> > +CPUs or devfreq devices.
> >>
> >> This all looks very Linux specific and not a h/w device. Perhaps you can
> >> add hint properties to OPP tables as to what entries can be used for
> >> throttling, but otherwise this doesn't belong in DT.
> >
> > My idea is to allow multiple throttlers, which might operate on
> > different throttling devices or use different OPPs for the same
> > device. To support this a simple boolean hint that the OPP can be used
> > for throttling would not be sufficient.
> >
> > What should work is a property with an array of phandles of the
> > throttlers that use a given OPP.
> >
> > AFAIK it is currently not possible to enumerate the devfreq devices in
> > the system, so besides the info in the OPPs the throttler itself would
> > still need a phandle to the devfreq device(s) it uses.
> 
> Why don't you fix that OS problem instead of working around it in DT?

I can try, though it's not exclusively in my hands, depends on what
the devfreq maintainers think about it.

> > I envision something like this:
> >
> > gpu_opp_table: opp-table2 {
> >   compatible = "operating-points-v2";
> >
> >   opp00 {
> >     opp-hz = /bits/ 64 <200000000>;
> >     opp-microvolt = <800000>;
> >   };
> >   opp01 {
> >     opp-hz = /bits/ 64 <297000000>;
> >     opp-microvolt = <800000>;
> >     opp-throttlers = <&cros_ec_throttler>;
> >   };
> >   ...
> > };
> >
> > cros_ec_throttler: cros-ec-throttler {
> >   compatible = "google,cros-ec-throttler";
> 
> Is this an actual h/w device?

The Chrome OS Embedded Controller is a MCU that communicates with
Linux over SPI or I2C. The low-level communication with the EC is
handled by drivers/mfd/cros_ec* and then there are multiple drivers
representing 'remote devices' on top of this (rtc-cros-ec.c,
extcon-usbc-cros-ec.c, pwm-cros-ec.c, ...). cros_ec_throttler is a
similar 'device' that responds to signals from the EC with frequency
adjustments.

^ permalink raw reply

* Re: [PATCH 2/2] PM / devfreq: Generic cpufreq governor
From: Saravana Kannan @ 2018-05-31 21:53 UTC (permalink / raw)
  To: MyungJoo Ham, Kyungmin Park, Chanwoo Choi, Rob Herring,
	Mark Rutland
  Cc: Rajendra Nayak, Amit Kucheria, linux-pm, devicetree, linux-kernel
In-Reply-To: <1526631889-5084-3-git-send-email-skannan@codeaurora.org>

Chanwoo,

Friendly reminder for a review. I've addressed your comments on v1 of 
this patch.

-Saravana


On 05/18/2018 01:24 AM, Saravana Kannan wrote:
> This devfreq governor is a generic implementation that can scale any
> devfreq device based on the current CPU frequency of all ONLINE CPUs. It
> allows for specifying CPU freq to devfreq mapping for specific devices.
> When such a mapping is not present, it defaults to scaling the device
> frequency in proportion to the CPU frequency.
>
> Change-Id: I7f786b9059435afe85b9ec8c504a4655731ee20e
> Signed-off-by: Saravana Kannan <skannan@codeaurora.org>
> ---
>   .../bindings/devfreq/devfreq-cpufreq.txt           |  53 ++
>   drivers/devfreq/Kconfig                            |   8 +
>   drivers/devfreq/Makefile                           |   1 +
>   drivers/devfreq/governor_cpufreq.c                 | 628 +++++++++++++++++++++
>   4 files changed, 690 insertions(+)
>   create mode 100644 Documentation/devicetree/bindings/devfreq/devfreq-cpufreq.txt
>   create mode 100644 drivers/devfreq/governor_cpufreq.c
>
> diff --git a/Documentation/devicetree/bindings/devfreq/devfreq-cpufreq.txt b/Documentation/devicetree/bindings/devfreq/devfreq-cpufreq.txt
> new file mode 100644
> index 0000000..6537538
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/devfreq/devfreq-cpufreq.txt
> @@ -0,0 +1,53 @@
> +Devfreq CPUfreq governor
> +
> +devfreq-cpufreq is a parent device that contains one or more child devices.
> +Each child device provides CPU frequency to device frequency mapping for a
> +specific device. Examples of devices that could use this are: DDR, cache and
> +CCI.
> +
> +Parent device name shall be "devfreq-cpufreq".
> +
> +Required child device properties:
> +- cpu-to-dev-map, or cpu-to-dev-map-<X>:
> +			A list of tuples where each tuple consists of a
> +			CPU frequency (KHz) and the corresponding device
> +			frequency. CPU frequencies not listed in the table
> +			will use the device frequency that corresponds to the
> +			next rounded up CPU frequency.
> +			Use "cpu-to-dev-map" if all CPUs in the system should
> +			share same mapping.
> +			Use cpu-to-dev-map-<cpuid> to describe different
> +			mappings for different CPUs. The property should be
> +			listed only for the first CPU if multiple CPUs are
> +			synchronous.
> +- target-dev:		Phandle to device that this mapping applies to.
> +
> +Example:
> +	devfreq-cpufreq {
> +		cpubw-cpufreq {
> +			target-dev = <&cpubw>;
> +			cpu-to-dev-map =
> +				<  300000  1144 >,
> +				<  422400  2288 >,
> +				<  652800  3051 >,
> +				<  883200  5996 >,
> +				< 1190400  8056 >,
> +				< 1497600 10101 >,
> +				< 1728000 12145 >,
> +				< 2649600 16250 >;
> +		};
> +
> +		cache-cpufreq {
> +			target-dev = <&cache>;
> +			cpu-to-dev-map =
> +				<  300000  300000 >,
> +				<  422400  422400 >,
> +				<  652800  499200 >,
> +				<  883200  576000 >,
> +				<  960000  960000 >,
> +				< 1497600 1036800 >,
> +				< 1574400 1574400 >,
> +				< 1728000 1651200 >,
> +				< 2649600 1728000 >;
> +		};
> +	};
> diff --git a/drivers/devfreq/Kconfig b/drivers/devfreq/Kconfig
> index 8503018..5eeb33f 100644
> --- a/drivers/devfreq/Kconfig
> +++ b/drivers/devfreq/Kconfig
> @@ -73,6 +73,14 @@ config DEVFREQ_GOV_PASSIVE
>   	  through sysfs entries. The passive governor recommends that
>   	  devfreq device uses the OPP table to get the frequency/voltage.
>   
> +config DEVFREQ_GOV_CPUFREQ
> +	tristate "CPUfreq"
> +	depends on CPU_FREQ
> +	help
> +	  Chooses frequency based on the online CPUs' current frequency and a
> +	  CPU frequency to device frequency mapping table(s). This governor
> +	  can be useful for controlling devices such as DDR, cache, CCI, etc.
> +
>   comment "DEVFREQ Drivers"
>   
>   config ARM_EXYNOS_BUS_DEVFREQ
> diff --git a/drivers/devfreq/Makefile b/drivers/devfreq/Makefile
> index f1cc8990..cafe7c2 100644
> --- a/drivers/devfreq/Makefile
> +++ b/drivers/devfreq/Makefile
> @@ -6,6 +6,7 @@ obj-$(CONFIG_DEVFREQ_GOV_PERFORMANCE)	+= governor_performance.o
>   obj-$(CONFIG_DEVFREQ_GOV_POWERSAVE)	+= governor_powersave.o
>   obj-$(CONFIG_DEVFREQ_GOV_USERSPACE)	+= governor_userspace.o
>   obj-$(CONFIG_DEVFREQ_GOV_PASSIVE)	+= governor_passive.o
> +obj-$(CONFIG_DEVFREQ_GOV_CPUFREQ)	+= governor_cpufreq.o
>   
>   # DEVFREQ Drivers
>   obj-$(CONFIG_ARM_EXYNOS_BUS_DEVFREQ)	+= exynos-bus.o
> diff --git a/drivers/devfreq/governor_cpufreq.c b/drivers/devfreq/governor_cpufreq.c
> new file mode 100644
> index 0000000..8f8ea26
> --- /dev/null
> +++ b/drivers/devfreq/governor_cpufreq.c
> @@ -0,0 +1,628 @@
> +/*
> + * Copyright (c) 2014-2015, The Linux Foundation. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 and
> + * only version 2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#define pr_fmt(fmt) "dev-cpufreq: " fmt
> +
> +#include <linux/devfreq.h>
> +#include <linux/cpu.h>
> +#include <linux/cpufreq.h>
> +#include <linux/cpumask.h>
> +#include <linux/slab.h>
> +#include <linux/platform_device.h>
> +#include <linux/of.h>
> +#include <linux/module.h>
> +#include "governor.h"
> +
> +struct cpu_state {
> +	unsigned int freq;
> +	unsigned int min_freq;
> +	unsigned int max_freq;
> +	bool on;
> +	unsigned int first_cpu;
> +};
> +static struct cpu_state *state[NR_CPUS];
> +static int cpufreq_cnt;
> +
> +struct freq_map {
> +	unsigned int cpu_khz;
> +	unsigned int target_freq;
> +};
> +
> +struct devfreq_node {
> +	struct devfreq *df;
> +	void *orig_data;
> +	struct device *dev;
> +	struct device_node *of_node;
> +	struct list_head list;
> +	struct freq_map **map;
> +	struct freq_map *common_map;
> +};
> +static LIST_HEAD(devfreq_list);
> +static DEFINE_MUTEX(state_lock);
> +static DEFINE_MUTEX(cpufreq_reg_lock);
> +
> +static void update_all_devfreqs(void)
> +{
> +	struct devfreq_node *node;
> +
> +	list_for_each_entry(node, &devfreq_list, list) {
> +		struct devfreq *df = node->df;
> +		if (!node->df)
> +			continue;
> +		mutex_lock(&df->lock);
> +		update_devfreq(df);
> +		mutex_unlock(&df->lock);
> +
> +	}
> +}
> +
> +static struct devfreq_node *find_devfreq_node(struct device *dev)
> +{
> +	struct devfreq_node *node;
> +
> +	list_for_each_entry(node, &devfreq_list, list)
> +		if (node->dev == dev || node->of_node == dev->of_node)
> +			return node;
> +
> +	return NULL;
> +}
> +
> +/* ==================== cpufreq part ==================== */
> +static void add_policy(struct cpufreq_policy *policy)
> +{
> +	struct cpu_state *new_state;
> +	unsigned int cpu, first_cpu;
> +
> +	if (state[policy->cpu]) {
> +		state[policy->cpu]->freq = policy->cur;
> +		state[policy->cpu]->on = true;
> +	} else {
> +		new_state = kzalloc(sizeof(struct cpu_state), GFP_KERNEL);
> +		if (!new_state)
> +			return;
> +
> +		first_cpu = cpumask_first(policy->related_cpus);
> +		new_state->first_cpu = first_cpu;
> +		new_state->freq = policy->cur;
> +		new_state->min_freq = policy->cpuinfo.min_freq;
> +		new_state->max_freq = policy->cpuinfo.max_freq;
> +		new_state->on = true;
> +
> +		for_each_cpu(cpu, policy->related_cpus)
> +			state[cpu] = new_state;
> +	}
> +}
> +
> +static int cpufreq_policy_notifier(struct notifier_block *nb,
> +		unsigned long event, void *data)
> +{
> +	struct cpufreq_policy *policy = data;
> +
> +	switch (event) {
> +	case CPUFREQ_CREATE_POLICY:
> +		mutex_lock(&state_lock);
> +		add_policy(policy);
> +		update_all_devfreqs();
> +		mutex_unlock(&state_lock);
> +		break;
> +
> +	case CPUFREQ_REMOVE_POLICY:
> +		mutex_lock(&state_lock);
> +		if (state[policy->cpu]) {
> +			state[policy->cpu]->on = false;
> +			update_all_devfreqs();
> +		}
> +		mutex_unlock(&state_lock);
> +		break;
> +	}
> +
> +	return 0;
> +}
> +
> +static struct notifier_block cpufreq_policy_nb = {
> +	.notifier_call = cpufreq_policy_notifier
> +};
> +
> +static int cpufreq_trans_notifier(struct notifier_block *nb,
> +		unsigned long event, void *data)
> +{
> +	struct cpufreq_freqs *freq = data;
> +	struct cpu_state *s;
> +
> +	if (event != CPUFREQ_POSTCHANGE)
> +		return 0;
> +
> +	mutex_lock(&state_lock);
> +
> +	s = state[freq->cpu];
> +	if (!s)
> +		goto out;
> +
> +	if (s->freq != freq->new) {
> +		s->freq = freq->new;
> +		update_all_devfreqs();
> +	}
> +
> +out:
> +	mutex_unlock(&state_lock);
> +	return 0;
> +}
> +
> +static struct notifier_block cpufreq_trans_nb = {
> +	.notifier_call = cpufreq_trans_notifier
> +};
> +
> +static int register_cpufreq(void)
> +{
> +	int ret = 0;
> +	unsigned int cpu;
> +	struct cpufreq_policy *policy;
> +
> +	mutex_lock(&cpufreq_reg_lock);
> +
> +	if (cpufreq_cnt)
> +		goto cnt_not_zero;
> +
> +	get_online_cpus();
> +	ret = cpufreq_register_notifier(&cpufreq_policy_nb,
> +				CPUFREQ_POLICY_NOTIFIER);
> +	if (ret)
> +		goto out;
> +
> +	ret = cpufreq_register_notifier(&cpufreq_trans_nb,
> +				CPUFREQ_TRANSITION_NOTIFIER);
> +	if (ret) {
> +		cpufreq_unregister_notifier(&cpufreq_policy_nb,
> +				CPUFREQ_POLICY_NOTIFIER);
> +		goto out;
> +	}
> +
> +	for_each_online_cpu(cpu) {
> +		policy = cpufreq_cpu_get(cpu);
> +		if (policy) {
> +			add_policy(policy);
> +			cpufreq_cpu_put(policy);
> +		}
> +	}
> +out:
> +	put_online_cpus();
> +cnt_not_zero:
> +	if (!ret)
> +		cpufreq_cnt++;
> +	mutex_unlock(&cpufreq_reg_lock);
> +	return ret;
> +}
> +
> +static int unregister_cpufreq(void)
> +{
> +	int ret = 0;
> +	int cpu;
> +
> +	mutex_lock(&cpufreq_reg_lock);
> +
> +	if (cpufreq_cnt > 1)
> +		goto out;
> +
> +	cpufreq_unregister_notifier(&cpufreq_policy_nb,
> +				CPUFREQ_POLICY_NOTIFIER);
> +	cpufreq_unregister_notifier(&cpufreq_trans_nb,
> +				CPUFREQ_TRANSITION_NOTIFIER);
> +
> +	for (cpu = ARRAY_SIZE(state) - 1; cpu >= 0; cpu--) {
> +		if (!state[cpu])
> +			continue;
> +		if (state[cpu]->first_cpu == cpu)
> +			kfree(state[cpu]);
> +		state[cpu] = NULL;
> +	}
> +
> +out:
> +	cpufreq_cnt--;
> +	mutex_unlock(&cpufreq_reg_lock);
> +	return ret;
> +}
> +
> +/* ==================== devfreq part ==================== */
> +
> +static unsigned int interpolate_freq(struct devfreq *df, unsigned int cpu)
> +{
> +	unsigned long *freq_table = df->profile->freq_table;
> +	unsigned int cpu_min = state[cpu]->min_freq;
> +	unsigned int cpu_max = state[cpu]->max_freq;
> +	unsigned int cpu_freq = state[cpu]->freq;
> +	unsigned int dev_min, dev_max, cpu_percent;
> +
> +	if (freq_table) {
> +		dev_min = freq_table[0];
> +		dev_max = freq_table[df->profile->max_state - 1];
> +	} else {
> +		if (df->max_freq <= df->min_freq)
> +			return 0;
> +		dev_min = df->min_freq;
> +		dev_max = df->max_freq;
> +	}
> +
> +	cpu_percent = ((cpu_freq - cpu_min) * 100) / (cpu_max - cpu_min);
> +	return dev_min + mult_frac(dev_max - dev_min, cpu_percent, 100);
> +}
> +
> +static unsigned int cpu_to_dev_freq(struct devfreq *df, unsigned int cpu)
> +{
> +	struct freq_map *map = NULL;
> +	unsigned int cpu_khz = 0, freq;
> +	struct devfreq_node *n = df->data;
> +
> +	if (!state[cpu] || !state[cpu]->on || state[cpu]->first_cpu != cpu) {
> +		freq = 0;
> +		goto out;
> +	}
> +
> +	if (n->common_map)
> +		map = n->common_map;
> +	else if (n->map)
> +		map = n->map[cpu];
> +
> +	cpu_khz = state[cpu]->freq;
> +
> +	if (!map) {
> +		freq = interpolate_freq(df, cpu);
> +		goto out;
> +	}
> +
> +	while (map->cpu_khz && map->cpu_khz < cpu_khz)
> +		map++;
> +	if (!map->cpu_khz)
> +		map--;
> +	freq = map->target_freq;
> +
> +out:
> +	dev_dbg(df->dev.parent, "CPU%u: %d -> dev: %u\n", cpu, cpu_khz, freq);
> +	return freq;
> +}
> +
> +static int devfreq_cpufreq_get_freq(struct devfreq *df,
> +					unsigned long *freq)
> +{
> +	unsigned int cpu, tgt_freq = 0;
> +	struct devfreq_node *node;
> +
> +	node = df->data;
> +	if (!node) {
> +		pr_err("Unable to find devfreq node!\n");
> +		return -ENODEV;
> +	}
> +
> +	for_each_possible_cpu(cpu)
> +		tgt_freq = max(tgt_freq, cpu_to_dev_freq(df, cpu));
> +
> +	*freq = tgt_freq;
> +	return 0;
> +}
> +
> +static unsigned int show_table(char *buf, unsigned int len,
> +				struct freq_map *map)
> +{
> +	unsigned int cnt = 0;
> +
> +	cnt += snprintf(buf + cnt, len - cnt, "CPU freq\tDevice freq\n");
> +
> +	while (map->cpu_khz && cnt < len) {
> +		cnt += snprintf(buf + cnt, len - cnt, "%8u\t%11u\n",
> +				map->cpu_khz, map->target_freq);
> +		map++;
> +	}
> +	if (cnt < len)
> +		cnt += snprintf(buf + cnt, len - cnt, "\n");
> +
> +	return cnt;
> +}
> +
> +static ssize_t show_map(struct device *dev, struct device_attribute *attr,
> +			char *buf)
> +{
> +	struct devfreq *df = to_devfreq(dev);
> +	struct devfreq_node *n = df->data;
> +	struct freq_map *map;
> +	unsigned int cnt = 0, cpu;
> +
> +	mutex_lock(&state_lock);
> +	if (n->common_map) {
> +		map = n->common_map;
> +		cnt += snprintf(buf + cnt, PAGE_SIZE - cnt,
> +				"Common table for all CPUs:\n");
> +		cnt += show_table(buf + cnt, PAGE_SIZE - cnt, map);
> +	} else if (n->map) {
> +		for_each_possible_cpu(cpu) {
> +			map = n->map[cpu];
> +			if (!map)
> +				continue;
> +			cnt += snprintf(buf + cnt, PAGE_SIZE - cnt,
> +					"CPU %u:\n", cpu);
> +			if (cnt >= PAGE_SIZE)
> +				break;
> +			cnt += show_table(buf + cnt, PAGE_SIZE - cnt, map);
> +			if (cnt >= PAGE_SIZE)
> +				break;
> +		}
> +	} else {
> +		cnt += snprintf(buf + cnt, PAGE_SIZE - cnt,
> +				"Device freq interpolated based on CPU freq\n");
> +	}
> +	mutex_unlock(&state_lock);
> +
> +	return cnt;
> +}
> +
> +static DEVICE_ATTR(freq_map, 0444, show_map, NULL);
> +static struct attribute *dev_attr[] = {
> +	&dev_attr_freq_map.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group dev_attr_group = {
> +	.name = "cpufreq",
> +	.attrs = dev_attr,
> +};
> +
> +static int devfreq_cpufreq_gov_start(struct devfreq *devfreq)
> +{
> +	int ret = 0;
> +	struct devfreq_node *node;
> +	bool alloc = false;
> +
> +	ret = register_cpufreq();
> +	if (ret)
> +		return ret;
> +
> +	ret = sysfs_create_group(&devfreq->dev.kobj, &dev_attr_group);
> +	if (ret) {
> +		unregister_cpufreq();
> +		return ret;
> +	}
> +
> +	mutex_lock(&state_lock);
> +
> +	node = find_devfreq_node(devfreq->dev.parent);
> +	if (node == NULL) {
> +		node = kzalloc(sizeof(struct devfreq_node), GFP_KERNEL);
> +		if (!node) {
> +			pr_err("Out of memory!\n");
> +			ret = -ENOMEM;
> +			goto alloc_fail;
> +		}
> +		alloc = true;
> +		node->dev = devfreq->dev.parent;
> +		list_add_tail(&node->list, &devfreq_list);
> +	}
> +	node->df = devfreq;
> +	node->orig_data = devfreq->data;
> +	devfreq->data = node;
> +
> +	mutex_lock(&devfreq->lock);
> +	ret = update_devfreq(devfreq);
> +	mutex_unlock(&devfreq->lock);
> +	if (ret) {
> +		pr_err("Freq update failed!\n");
> +		goto update_fail;
> +	}
> +
> +	mutex_unlock(&state_lock);
> +	return 0;
> +
> +update_fail:
> +	devfreq->data = node->orig_data;
> +	if (alloc) {
> +		list_del(&node->list);
> +		kfree(node);
> +	}
> +alloc_fail:
> +	mutex_unlock(&state_lock);
> +	sysfs_remove_group(&devfreq->dev.kobj, &dev_attr_group);
> +	unregister_cpufreq();
> +	return ret;
> +}
> +
> +static void devfreq_cpufreq_gov_stop(struct devfreq *devfreq)
> +{
> +	struct devfreq_node *node = devfreq->data;
> +
> +	mutex_lock(&state_lock);
> +	devfreq->data = node->orig_data;
> +	if (node->map || node->common_map) {
> +		node->df = NULL;
> +	} else {
> +		list_del(&node->list);
> +		kfree(node);
> +	}
> +	mutex_unlock(&state_lock);
> +
> +	sysfs_remove_group(&devfreq->dev.kobj, &dev_attr_group);
> +	unregister_cpufreq();
> +}
> +
> +static int devfreq_cpufreq_ev_handler(struct devfreq *devfreq,
> +					unsigned int event, void *data)
> +{
> +	int ret;
> +
> +	switch (event) {
> +	case DEVFREQ_GOV_START:
> +
> +		ret = devfreq_cpufreq_gov_start(devfreq);
> +		if (ret) {
> +			pr_err("Governor start failed!\n");
> +			return ret;
> +		}
> +		pr_debug("Enabled dev CPUfreq governor\n");
> +		break;
> +
> +	case DEVFREQ_GOV_STOP:
> +
> +		devfreq_cpufreq_gov_stop(devfreq);
> +		pr_debug("Disabled dev CPUfreq governor\n");
> +		break;
> +	}
> +
> +	return 0;
> +}
> +
> +static struct devfreq_governor devfreq_cpufreq = {
> +	.name = "cpufreq",
> +	.get_target_freq = devfreq_cpufreq_get_freq,
> +	.event_handler = devfreq_cpufreq_ev_handler,
> +};
> +
> +#define NUM_COLS	2
> +static struct freq_map *read_tbl(struct device_node *of_node, char *prop_name)
> +{
> +	int len, nf, i, j;
> +	u32 data;
> +	struct freq_map *tbl;
> +
> +	if (!of_find_property(of_node, prop_name, &len))
> +		return NULL;
> +	len /= sizeof(data);
> +
> +	if (len % NUM_COLS || len == 0)
> +		return NULL;
> +	nf = len / NUM_COLS;
> +
> +	tbl = kzalloc((nf + 1) * sizeof(*tbl), GFP_KERNEL);
> +	if (!tbl)
> +		return NULL;
> +
> +	for (i = 0, j = 0; i < nf; i++, j += 2) {
> +		of_property_read_u32_index(of_node, prop_name, j, &data);
> +		tbl[i].cpu_khz = data;
> +
> +		of_property_read_u32_index(of_node, prop_name, j + 1, &data);
> +		tbl[i].target_freq = data;
> +	}
> +	tbl[i].cpu_khz = 0;
> +
> +	return tbl;
> +}
> +
> +#define PROP_TARGET "target-dev"
> +#define PROP_TABLE "cpu-to-dev-map"
> +static int add_table_from_of(struct device_node *of_node)
> +{
> +	struct device_node *target_of_node;
> +	struct devfreq_node *node;
> +	struct freq_map *common_tbl;
> +	struct freq_map **tbl_list = NULL;
> +	static char prop_name[] = PROP_TABLE "-999999";
> +	int cpu, ret, cnt = 0, prop_sz = ARRAY_SIZE(prop_name);
> +
> +	target_of_node = of_parse_phandle(of_node, PROP_TARGET, 0);
> +	if (!target_of_node)
> +		return -EINVAL;
> +
> +	node = kzalloc(sizeof(struct devfreq_node), GFP_KERNEL);
> +	if (!node)
> +		return -ENOMEM;
> +
> +	common_tbl = read_tbl(of_node, PROP_TABLE);
> +	if (!common_tbl) {
> +		tbl_list = kzalloc(sizeof(*tbl_list) * NR_CPUS, GFP_KERNEL);
> +		if (!tbl_list) {
> +			ret = -ENOMEM;
> +			goto err_list;
> +		}
> +
> +		for_each_possible_cpu(cpu) {
> +			ret = snprintf(prop_name, prop_sz, "%s-%d",
> +					PROP_TABLE, cpu);
> +			if (ret >= prop_sz) {
> +				pr_warn("More CPUs than I can handle!\n");
> +				pr_warn("Skipping rest of the tables!\n");
> +				break;
> +			}
> +			tbl_list[cpu] = read_tbl(of_node, prop_name);
> +			if (tbl_list[cpu])
> +				cnt++;
> +		}
> +	}
> +	if (!common_tbl && !cnt) {
> +		ret = -EINVAL;
> +		goto err_tbl;
> +	}
> +
> +	mutex_lock(&state_lock);
> +	node->of_node = target_of_node;
> +	node->map = tbl_list;
> +	node->common_map = common_tbl;
> +	list_add_tail(&node->list, &devfreq_list);
> +	mutex_unlock(&state_lock);
> +
> +	return 0;
> +err_tbl:
> +	kfree(tbl_list);
> +err_list:
> +	kfree(node);
> +	return ret;
> +}
> +
> +static int __init devfreq_cpufreq_init(void)
> +{
> +	int ret;
> +	struct device_node *of_par, *of_child;
> +
> +	of_par = of_find_node_by_name(NULL, "devfreq-cpufreq");
> +	if (of_par) {
> +		for_each_child_of_node(of_par, of_child) {
> +			ret = add_table_from_of(of_child);
> +			if (ret)
> +				pr_err("Parsing %s failed!\n", of_child->name);
> +			else
> +				pr_debug("Parsed %s.\n", of_child->name);
> +		}
> +		of_node_put(of_par);
> +	} else {
> +		pr_info("No tables parsed from DT.\n");
> +	}
> +
> +	ret = devfreq_add_governor(&devfreq_cpufreq);
> +	if (ret) {
> +		pr_err("Governor add failed!\n");
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +subsys_initcall(devfreq_cpufreq_init);
> +
> +static void __exit devfreq_cpufreq_exit(void)
> +{
> +	int ret, cpu;
> +	struct devfreq_node *node, *tmp;
> +
> +	ret = devfreq_remove_governor(&devfreq_cpufreq);
> +	if (ret)
> +		pr_err("Governor remove failed!\n");
> +
> +	mutex_lock(&state_lock);
> +	list_for_each_entry_safe(node, tmp, &devfreq_list, list) {
> +		kfree(node->common_map);
> +		for_each_possible_cpu(cpu)
> +			kfree(node->map[cpu]);
> +		kfree(node->map);
> +		list_del(&node->list);
> +		kfree(node);
> +	}
> +	mutex_unlock(&state_lock);
> +}
> +module_exit(devfreq_cpufreq_exit);
> +
> +MODULE_DESCRIPTION("CPU freq based generic governor for devfreq devices");
> +MODULE_LICENSE("GPL v2");

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply

* [PATCH v3 0/2] ti-cpufreq: minor fixes/cleanups
From: Suman Anna @ 2018-05-31 22:21 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Viresh Kumar, Dave Gerlach, Tero Kristo, linux-pm, linux-omap,
	linux-kernel, Suman Anna

Hi Rafael,

This is a repost of the v2 patches Ccing the proper linux-pm list.
There are no code changes, I have picked up Viresh's acks and
also added the stable kernel versions the first patch needs to be
applied to.

regards
Suman

v2: https://marc.info/?l=linux-omap&m=152268780921259&w=2
v1: https://patchwork.kernel.org/patch/10308925/

Suman Anna (2):
  cpufreq: ti-cpufreq: Fix an incorrect error return value
  cpufreq: ti-cpufreq: Use devres managed API in probe()

 drivers/cpufreq/ti-cpufreq.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

-- 
2.17.0

^ permalink raw reply

* [PATCH v3 1/2] cpufreq: ti-cpufreq: Fix an incorrect error return value
From: Suman Anna @ 2018-05-31 22:21 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Viresh Kumar, Dave Gerlach, Tero Kristo, linux-pm, linux-omap,
	linux-kernel, Suman Anna, Zumeng Chen, stable
In-Reply-To: <20180531222144.26739-1-s-anna@ti.com>

Commit 05829d9431df ("cpufreq: ti-cpufreq: kfree opp_data when
failure") has fixed a memory leak in the failure path, however
the patch returned a positive value on get_cpu_device() failure
instead of the previous negative value. Fix this incorrect error
return value properly.

Fixes: 05829d9431df ("cpufreq: ti-cpufreq: kfree opp_data when failure")
Cc: Zumeng Chen <zumeng.chen@gmail.com>
Cc: <stable@vger.kernel.org> # v4.14+
Signed-off-by: Suman Anna <s-anna@ti.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
---
 drivers/cpufreq/ti-cpufreq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpufreq/ti-cpufreq.c b/drivers/cpufreq/ti-cpufreq.c
index 6ba709b6f095..896caba5dfe5 100644
--- a/drivers/cpufreq/ti-cpufreq.c
+++ b/drivers/cpufreq/ti-cpufreq.c
@@ -226,7 +226,7 @@ static int ti_cpufreq_probe(struct platform_device *pdev)
 	opp_data->cpu_dev = get_cpu_device(0);
 	if (!opp_data->cpu_dev) {
 		pr_err("%s: Failed to get device for CPU0\n", __func__);
-		ret = ENODEV;
+		ret = -ENODEV;
 		goto free_opp_data;
 	}
 
-- 
2.17.0

^ permalink raw reply related

* [PATCH v3 2/2] cpufreq: ti-cpufreq: Use devres managed API in probe()
From: Suman Anna @ 2018-05-31 22:21 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Viresh Kumar, Dave Gerlach, Tero Kristo, linux-pm, linux-omap,
	linux-kernel, Suman Anna, Zumeng Chen
In-Reply-To: <20180531222144.26739-1-s-anna@ti.com>

The ti_cpufreq_probe() function uses regular kzalloc to allocate
the ti_cpufreq_data structure and kfree for freeing this memory
on failures. Simplify this code by using the devres managed
API.

Cc: Zumeng Chen <zumeng.chen@gmail.com>
Signed-off-by: Suman Anna <s-anna@ti.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
---
 drivers/cpufreq/ti-cpufreq.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/cpufreq/ti-cpufreq.c b/drivers/cpufreq/ti-cpufreq.c
index 896caba5dfe5..3f0e2a14895a 100644
--- a/drivers/cpufreq/ti-cpufreq.c
+++ b/drivers/cpufreq/ti-cpufreq.c
@@ -217,7 +217,7 @@ static int ti_cpufreq_probe(struct platform_device *pdev)
 	if (!match)
 		return -ENODEV;
 
-	opp_data = kzalloc(sizeof(*opp_data), GFP_KERNEL);
+	opp_data = devm_kzalloc(&pdev->dev, sizeof(*opp_data), GFP_KERNEL);
 	if (!opp_data)
 		return -ENOMEM;
 
@@ -226,8 +226,7 @@ static int ti_cpufreq_probe(struct platform_device *pdev)
 	opp_data->cpu_dev = get_cpu_device(0);
 	if (!opp_data->cpu_dev) {
 		pr_err("%s: Failed to get device for CPU0\n", __func__);
-		ret = -ENODEV;
-		goto free_opp_data;
+		return -ENODEV;
 	}
 
 	opp_data->opp_node = dev_pm_opp_of_get_opp_desc_node(opp_data->cpu_dev);
@@ -285,8 +284,6 @@ static int ti_cpufreq_probe(struct platform_device *pdev)
 
 fail_put_node:
 	of_node_put(opp_data->opp_node);
-free_opp_data:
-	kfree(opp_data);
 
 	return ret;
 }
-- 
2.17.0

^ permalink raw reply related

* [RFC/RFT] [PATCH v3 0/4] Intel_pstate: HWP Dynamic performance boost
From: Srinivas Pandruvada @ 2018-05-31 22:51 UTC (permalink / raw)
  To: lenb, rjw, peterz, mgorman
  Cc: linux-pm, linux-kernel, juri.lelli, viresh.kumar,
	Srinivas Pandruvada

v3
- Removed atomic bit operation as suggested.
- Added description of contention with user space.
- Removed hwp cache, boost utililty function patch and merged with util callback
  patch. This way any value set is used somewhere.

Waiting for test results from Mel Gorman, who is the original reporter.

v2
This is a much simpler version than the previous one and only consider IO
boost, using the existing mechanism. There is no change in this series
beyond intel_pstate driver.

Once PeterZ finishes his work on frequency invariant, I will revisit
thread migration optimization in HWP mode.

Other changes:
- Gradual boost instead of single step as suggested by PeterZ.
- Cross CPU synchronization concerns identified by Rafael.
- Split the patch for HWP MSR value caching as suggested by PeterZ.

Not changed as suggested:
There is no architecture way to identify platform with Per-core
P-states, so still have to enable feature based on CPU model.

-----------
v1

This series tries to address some concern in performance particularly with IO
workloads (Reported by Mel Gorman), when HWP is using intel_pstate powersave
policy.

Background
HWP performance can be controlled by user space using sysfs interface for
max/min frequency limits and energy performance preference settings. Based on
workload characteristics these can be adjusted from user space. These limits
are not changed dynamically by kernel based on workload.

By default HWP defaults to energy performance preference value of 0x80 on
majority of platforms(Scale is 0-255, 0 is max performance and 255 is min).
This value offers best performance/watt and for majority of server workloads
performance doesn't suffer. Also users always have option to use performance
policy of intel_pstate, to get best performance. But user tend to run with
out of box configuration, which is powersave policy on most of the distros.

In some case it is possible to dynamically adjust performance, for example,
when a CPU is woken up due to IO completion or thread migrate to a new CPU. In
this case HWP algorithm will take some time to build utilization and ramp up
P-states. So this may results in lower performance for some IO workloads and
workloads which tend to migrate. The idea of this patch series is to
temporarily boost performance dynamically in these cases. This is only
applicable only when user is using powersave policy, not in performance policy.

Results on a Skylake server:

Benchmark                       Improvement %
----------------------------------------------------------------------
dbench                          50.36
thread IO bench (tiobench)      10.35
File IO                         9.81
sqlite                          15.76
X264 -104 cores                 9.75

Spec Power                      (Negligible impact 7382 Vs. 7378)
Idle Power                      No change observed
-----------------------------------------------------------------------

HWP brings in best performace/watt at EPP=0x80. Since we are boosting
EPP here to 0, the performance/watt drops upto 10%. So there is a power
penalty of these changes.

Also Mel Gorman provided test results on a prior patchset, which shows
benifits of this series.

Srinivas Pandruvada (4):
  cpufreq: intel_pstate: Add HWP boost utility and sched util hooks
  cpufreq: intel_pstate: HWP boost performance on IO wakeup
  cpufreq: intel_pstate: New sysfs entry to control HWP boost
  cpufreq: intel_pstate: enable boost for SKX

 drivers/cpufreq/intel_pstate.c | 177 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 173 insertions(+), 4 deletions(-)

-- 
2.13.6

^ permalink raw reply

* [RFC/RFT] [PATCH v3 1/4] cpufreq: intel_pstate: Add HWP boost utility and sched util hooks
From: Srinivas Pandruvada @ 2018-05-31 22:51 UTC (permalink / raw)
  To: lenb, rjw, peterz, mgorman
  Cc: linux-pm, linux-kernel, juri.lelli, viresh.kumar,
	Srinivas Pandruvada
In-Reply-To: <20180531225143.34270-1-srinivas.pandruvada@linux.intel.com>

Added two utility functions to HWP boost up gradually and boost down to
the default cached HWP request values.

Boost up:
Boost up updates HWP request minimum value in steps. This minimum value
can reach upto at HWP request maximum values depends on how frequently,
this boost up function is called. At max, boost up will take three steps
to reach the maximum, depending on the current HWP request levels and HWP
capabilities. For example, if the current settings are:
If P0 (Turbo max) = P1 (Guaranteed max) = min
        No boost at all.
If P0 (Turbo max) > P1 (Guaranteed max) = min
        Should result in one level boost only for P0.
If P0 (Turbo max) = P1 (Guaranteed max) > min
        Should result in two level boost:
                (min + p1)/2 and P1.
If P0 (Turbo max) > P1 (Guaranteed max) > min
        Should result in three level boost:
                (min + p1)/2, P1 and P0.
We don't set any level between P0 and P1 as there is no guarantee that
they will be honored.

Boost down:
After the system is idle for hold time of 3ms, the HWP request is reset
to the default value from HWP init or user modified one via sysfs.

Caching of HWP Request and Capabilities
Store the HWP request value last set using MSR_HWP_REQUEST and read
MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility
functions.

These boost utility functions calculated limits are based on the latest
HWP request value, which can be modified by setpolicy() callback. So if
user space modifies the minimum perf value, that will be accounted for
every time the boost up is called. There will be case when there can be
contention with the user modified minimum perf, in that case user value
will gain precedence. For example just before HWP_REQUEST MSR is updated
from setpolicy() callback, the boost up function is called via scheduler
tick callback. Here the cached MSR value is already the latest and limits
are updated based on the latest user limits, but on return the MSR write
callback called from setpolicy() callback will update the HWP_REQUEST
value. This will be used till next time the boost up function is called.

In addition add a variable to control HWP dynamic boosting. When HWP
dynamic boost is active then set the HWP specific update util hook. The
contents in the utility hooks will be filled in the subsequent patches.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
 drivers/cpufreq/intel_pstate.c | 99 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 95 insertions(+), 4 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 17e566afbb41..80bf61ae4b1f 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -221,6 +221,9 @@ struct global_params {
  *			preference/bias
  * @epp_saved:		Saved EPP/EPB during system suspend or CPU offline
  *			operation
+ * @hwp_req_cached:	Cached value of the last HWP Request MSR
+ * @hwp_cap_cached:	Cached value of the last HWP Capabilities MSR
+ * @hwp_boost_min:	Last HWP boosted min performance
  *
  * This structure stores per CPU instance data for all CPUs.
  */
@@ -253,6 +256,9 @@ struct cpudata {
 	s16 epp_policy;
 	s16 epp_default;
 	s16 epp_saved;
+	u64 hwp_req_cached;
+	u64 hwp_cap_cached;
+	int hwp_boost_min;
 };
 
 static struct cpudata **all_cpu_data;
@@ -285,6 +291,7 @@ static struct pstate_funcs pstate_funcs __read_mostly;
 
 static int hwp_active __read_mostly;
 static bool per_cpu_limits __read_mostly;
+static bool hwp_boost __read_mostly;
 
 static struct cpufreq_driver *intel_pstate_driver __read_mostly;
 
@@ -689,6 +696,7 @@ static void intel_pstate_get_hwp_max(unsigned int cpu, int *phy_max,
 	u64 cap;
 
 	rdmsrl_on_cpu(cpu, MSR_HWP_CAPABILITIES, &cap);
+	WRITE_ONCE(all_cpu_data[cpu]->hwp_cap_cached, cap);
 	if (global.no_turbo)
 		*current_max = HWP_GUARANTEED_PERF(cap);
 	else
@@ -763,6 +771,7 @@ static void intel_pstate_hwp_set(unsigned int cpu)
 		intel_pstate_set_epb(cpu, epp);
 	}
 skip_epp:
+	WRITE_ONCE(cpu_data->hwp_req_cached, value);
 	wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value);
 }
 
@@ -1381,6 +1390,81 @@ static void intel_pstate_get_cpu_pstates(struct cpudata *cpu)
 	intel_pstate_set_min_pstate(cpu);
 }
 
+/*
+ * Long hold time will keep high perf limits for long time,
+ * which negatively impacts perf/watt for some workloads,
+ * like specpower. 3ms is based on experiements on some
+ * workoads.
+ */
+static int hwp_boost_hold_time_ms = 3;
+
+static inline void intel_pstate_hwp_boost_up(struct cpudata *cpu)
+{
+	u64 hwp_req = READ_ONCE(cpu->hwp_req_cached);
+	int max_limit = (hwp_req & 0xff00) >> 8;
+	int min_limit = (hwp_req & 0xff);
+	int boost_level1;
+
+	/*
+	 * Cases to consider (User changes via sysfs or boot time):
+	 * If, P0 (Turbo max) = P1 (Guaranteed max) = min:
+	 *	No boost, return.
+	 * If, P0 (Turbo max) > P1 (Guaranteed max) = min:
+	 *     Should result in one level boost only for P0.
+	 * If, P0 (Turbo max) = P1 (Guaranteed max) > min:
+	 *     Should result in two level boost:
+	 *         (min + p1)/2 and P1.
+	 * If, P0 (Turbo max) > P1 (Guaranteed max) > min:
+	 *     Should result in three level boost:
+	 *        (min + p1)/2, P1 and P0.
+	 */
+
+	/* If max and min are equal or already at max, nothing to boost */
+	if (max_limit == min_limit || cpu->hwp_boost_min >= max_limit)
+		return;
+
+	if (!cpu->hwp_boost_min)
+		cpu->hwp_boost_min = min_limit;
+
+	/* level at half way mark between min and guranteed */
+	boost_level1 = (HWP_GUARANTEED_PERF(cpu->hwp_cap_cached) + min_limit) >> 1;
+
+	if (cpu->hwp_boost_min < boost_level1)
+		cpu->hwp_boost_min = boost_level1;
+	else if (cpu->hwp_boost_min < HWP_GUARANTEED_PERF(cpu->hwp_cap_cached))
+		cpu->hwp_boost_min = HWP_GUARANTEED_PERF(cpu->hwp_cap_cached);
+	else if (cpu->hwp_boost_min == HWP_GUARANTEED_PERF(cpu->hwp_cap_cached) &&
+		 max_limit != HWP_GUARANTEED_PERF(cpu->hwp_cap_cached))
+		cpu->hwp_boost_min = max_limit;
+	else
+		return;
+
+	hwp_req = (hwp_req & ~GENMASK_ULL(7, 0)) | cpu->hwp_boost_min;
+	wrmsrl(MSR_HWP_REQUEST, hwp_req);
+	cpu->last_update = cpu->sample.time;
+}
+
+static inline void intel_pstate_hwp_boost_down(struct cpudata *cpu)
+{
+	if (cpu->hwp_boost_min) {
+		bool expired;
+
+		/* Check if we are idle for hold time to boost down */
+		expired = time_after64(cpu->sample.time, cpu->last_update +
+				       (hwp_boost_hold_time_ms * NSEC_PER_MSEC));
+		if (expired) {
+			wrmsrl(MSR_HWP_REQUEST, cpu->hwp_req_cached);
+			cpu->hwp_boost_min = 0;
+		}
+	}
+	cpu->last_update = cpu->sample.time;
+}
+
+static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
+						u64 time, unsigned int flags)
+{
+}
+
 static inline void intel_pstate_calc_avg_perf(struct cpudata *cpu)
 {
 	struct sample *sample = &cpu->sample;
@@ -1684,7 +1768,7 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
 {
 	struct cpudata *cpu = all_cpu_data[cpu_num];
 
-	if (hwp_active)
+	if (hwp_active && !hwp_boost)
 		return;
 
 	if (cpu->update_util_set)
@@ -1692,8 +1776,12 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
 
 	/* Prevent intel_pstate_update_util() from using stale data. */
 	cpu->sample.time = 0;
-	cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
-				     intel_pstate_update_util);
+	if (hwp_active)
+		cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
+					     intel_pstate_update_util_hwp);
+	else
+		cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
+					     intel_pstate_update_util);
 	cpu->update_util_set = true;
 }
 
@@ -1805,8 +1893,11 @@ static int intel_pstate_set_policy(struct cpufreq_policy *policy)
 		intel_pstate_set_update_util_hook(policy->cpu);
 	}
 
-	if (hwp_active)
+	if (hwp_active) {
+		if (!hwp_boost)
+			intel_pstate_clear_update_util_hook(policy->cpu);
 		intel_pstate_hwp_set(policy->cpu);
+	}
 
 	mutex_unlock(&intel_pstate_limits_lock);
 
-- 
2.13.6

^ permalink raw reply related

* [RFC/RFT] [PATCH v3 2/4] cpufreq: intel_pstate: HWP boost performance on IO wakeup
From: Srinivas Pandruvada @ 2018-05-31 22:51 UTC (permalink / raw)
  To: lenb, rjw, peterz, mgorman
  Cc: linux-pm, linux-kernel, juri.lelli, viresh.kumar,
	Srinivas Pandruvada
In-Reply-To: <20180531225143.34270-1-srinivas.pandruvada@linux.intel.com>

This change uses SCHED_CPUFREQ_IOWAIT flag to boost HWP performance.
Since SCHED_CPUFREQ_IOWAIT flag is set frequently, we don't start
boosting steps unless we see two consecutive flags in two ticks. This
avoids boosting due to IO because of regular system activities.

To avoid synchronization issues, the actual processing of the flag is
done on the local CPU callback.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
 drivers/cpufreq/intel_pstate.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 80bf61ae4b1f..2a14756d958e 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -223,6 +223,8 @@ struct global_params {
  *			operation
  * @hwp_req_cached:	Cached value of the last HWP Request MSR
  * @hwp_cap_cached:	Cached value of the last HWP Capabilities MSR
+ * @last_io_update:	Last time when IO wake flag was set
+ * @sched_flags:	Store scheduler flags for possible cross CPU update
  * @hwp_boost_min:	Last HWP boosted min performance
  *
  * This structure stores per CPU instance data for all CPUs.
@@ -258,6 +260,8 @@ struct cpudata {
 	s16 epp_saved;
 	u64 hwp_req_cached;
 	u64 hwp_cap_cached;
+	u64 last_io_update;
+	unsigned int sched_flags;
 	int hwp_boost_min;
 };
 
@@ -1460,9 +1464,44 @@ static inline void intel_pstate_hwp_boost_down(struct cpudata *cpu)
 	cpu->last_update = cpu->sample.time;
 }
 
+static inline void intel_pstate_update_util_hwp_local(struct cpudata *cpu,
+						      u64 time)
+{
+	cpu->sample.time = time;
+
+	if (cpu->sched_flags & SCHED_CPUFREQ_IOWAIT) {
+		bool do_io = false;
+
+		cpu->sched_flags = 0;
+		/*
+		 * Set iowait_boost flag and update time. Since IO WAIT flag
+		 * is set all the time, we can't just conclude that there is
+		 * some IO bound activity is scheduled on this CPU with just
+		 * one occurrence. If we receive at least two in two
+		 * consecutive ticks, then we treat as boost candidate.
+		 */
+		if (time_before64(time, cpu->last_io_update + 2 * TICK_NSEC))
+			do_io = true;
+
+		cpu->last_io_update = time;
+
+		if (do_io)
+			intel_pstate_hwp_boost_up(cpu);
+
+	} else {
+		intel_pstate_hwp_boost_down(cpu);
+	}
+}
+
 static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
 						u64 time, unsigned int flags)
 {
+	struct cpudata *cpu = container_of(data, struct cpudata, update_util);
+
+	cpu->sched_flags |= flags;
+
+	if (smp_processor_id() == cpu->cpu)
+		intel_pstate_update_util_hwp_local(cpu, time);
 }
 
 static inline void intel_pstate_calc_avg_perf(struct cpudata *cpu)
-- 
2.13.6

^ permalink raw reply related

* [RFC/RFT] [PATCH v3 3/4] cpufreq: intel_pstate: New sysfs entry to control HWP boost
From: Srinivas Pandruvada @ 2018-05-31 22:51 UTC (permalink / raw)
  To: lenb, rjw, peterz, mgorman
  Cc: linux-pm, linux-kernel, juri.lelli, viresh.kumar,
	Srinivas Pandruvada
In-Reply-To: <20180531225143.34270-1-srinivas.pandruvada@linux.intel.com>

A new attribute is added to intel_pstate sysfs to enable/disable
HWP dynamic performance boost.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
 drivers/cpufreq/intel_pstate.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 2a14756d958e..254e30a6c1e0 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1033,6 +1033,30 @@ static ssize_t store_min_perf_pct(struct kobject *a, struct attribute *b,
 	return count;
 }
 
+static ssize_t show_hwp_dynamic_boost(struct kobject *kobj,
+				struct attribute *attr, char *buf)
+{
+	return sprintf(buf, "%u\n", hwp_boost);
+}
+
+static ssize_t store_hwp_dynamic_boost(struct kobject *a, struct attribute *b,
+				       const char *buf, size_t count)
+{
+	unsigned int input;
+	int ret;
+
+	ret = kstrtouint(buf, 10, &input);
+	if (ret)
+		return ret;
+
+	mutex_lock(&intel_pstate_driver_lock);
+	hwp_boost = !!input;
+	intel_pstate_update_policies();
+	mutex_unlock(&intel_pstate_driver_lock);
+
+	return count;
+}
+
 show_one(max_perf_pct, max_perf_pct);
 show_one(min_perf_pct, min_perf_pct);
 
@@ -1042,6 +1066,7 @@ define_one_global_rw(max_perf_pct);
 define_one_global_rw(min_perf_pct);
 define_one_global_ro(turbo_pct);
 define_one_global_ro(num_pstates);
+define_one_global_rw(hwp_dynamic_boost);
 
 static struct attribute *intel_pstate_attributes[] = {
 	&status.attr,
@@ -1082,6 +1107,11 @@ static void __init intel_pstate_sysfs_expose_params(void)
 	rc = sysfs_create_file(intel_pstate_kobject, &min_perf_pct.attr);
 	WARN_ON(rc);
 
+	if (hwp_active) {
+		rc = sysfs_create_file(intel_pstate_kobject,
+				       &hwp_dynamic_boost.attr);
+		WARN_ON(rc);
+	}
 }
 /************************** sysfs end ************************/
 
-- 
2.13.6

^ permalink raw reply related

* [RFC/RFT] [PATCH v3 4/4] cpufreq: intel_pstate: enable boost for SKX
From: Srinivas Pandruvada @ 2018-05-31 22:51 UTC (permalink / raw)
  To: lenb, rjw, peterz, mgorman
  Cc: linux-pm, linux-kernel, juri.lelli, viresh.kumar,
	Srinivas Pandruvada
In-Reply-To: <20180531225143.34270-1-srinivas.pandruvada@linux.intel.com>

Enable HWP boost on Skylake server platform.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
 drivers/cpufreq/intel_pstate.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 254e30a6c1e0..75c5a95bc0ac 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1794,6 +1794,11 @@ static const struct x86_cpu_id intel_pstate_cpu_ee_disable_ids[] = {
 	{}
 };
 
+static const struct x86_cpu_id intel_pstate_hwp_boost_ids[] __initconst = {
+	ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
+	{}
+};
+
 static int intel_pstate_init_cpu(unsigned int cpunum)
 {
 	struct cpudata *cpu;
@@ -1824,6 +1829,10 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
 			intel_pstate_disable_ee(cpunum);
 
 		intel_pstate_hwp_enable(cpu);
+
+		id = x86_match_cpu(intel_pstate_hwp_boost_ids);
+		if (id)
+			hwp_boost = true;
 	}
 
 	intel_pstate_get_cpu_pstates(cpu);
-- 
2.13.6

^ permalink raw reply related

* [PATCH 1/2] power: supply: sbs-battery: don't assume MANUFACTURER_DATA formats
From: Brian Norris @ 2018-06-01  0:32 UTC (permalink / raw)
  To: Sebastian Reichel
  Cc: linux-kernel, Rob Herring, linux-pm, devicetree, Alexandru Stan,
	Guenter Roeck, Doug Anderson, Brian Norris

This driver was originally submitted for the TI BQ20Z75 battery IC
(commit a7640bfa10c5 ("power_supply: Add driver for TI BQ20Z75 gas gauge
IC")) and later renamed to express generic SBS support. While it's
mostly true that this driver implemented a standard SBS command set, it
takes liberties with the REG_MANUFACTURER_DATA register. This register
is specified in the SBS spec, but it doesn't make any mention of what
its actual contents are.

We've sort of noticed this optionality previously, with commit
17c6d3979e5b ("sbs-battery: make writes to ManufacturerAccess
optional"), where we found that some batteries NAK writes to this
register.

What this really means is that so far, we've just been lucky that most
batteries have either been compatible with the TI chip, or else at least
haven't reported highly-unexpected values.

For instance, one battery I have here seems to report either 0x0000 or
0x0100 to the MANUFACTURER_ACCESS_STATUS command -- while this seems to
match either Wake Up (bits[11:8] = 0000b) or Normal Discharge
(bits[11:8] = 0001b) status for the TI part [1], they don't seem to
actually correspond to real states (for instance, I never see 0101b =
Charge, even when charging).

On other batteries, I'm getting apparently random data in return, which
means that occasionally, we interpret this as "battery not present" or
"battery is not healthy".

All in all, it seems to be a really bad idea to make assumptions about
REG_MANUFACTURER_DATA, unless we already know what battery we're using.
Therefore, this patch reimplements the "present" and "health" checks to
the following on most SBS batteries:

1. HEALTH: report "unknown" -- I couldn't find a standard SBS command
   that gives us much useful here
2. PRESENT: just send a REG_STATUS command; if it succeeds, then the
   battery is present

Also, we stop sending MANUFACTURER_ACCESS_SLEEP to non-TI parts. I have
no proof that this is useful and supported.

If someone explicitly provided a 'ti,bq20z75' compatible property, then
we retain the existing TI command behaviors.

[1] http://www.ti.com/lit/er/sluu265a/sluu265a.pdf

Signed-off-by: Brian Norris <briannorris@chromium.org>
---
 drivers/power/supply/sbs-battery.c | 61 +++++++++++++++++++++++++-----
 1 file changed, 52 insertions(+), 9 deletions(-)

diff --git a/drivers/power/supply/sbs-battery.c b/drivers/power/supply/sbs-battery.c
index 83d7b4115857..e15d0ca4729d 100644
--- a/drivers/power/supply/sbs-battery.c
+++ b/drivers/power/supply/sbs-battery.c
@@ -23,6 +23,7 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/of.h>
+#include <linux/of_device.h>
 #include <linux/power/sbs-battery.h>
 #include <linux/power_supply.h>
 #include <linux/slab.h>
@@ -84,8 +85,9 @@ static const struct chip_data {
 	int min_value;
 	int max_value;
 } sbs_data[] = {
+	/* Manuf. data isn't directly useful as a POWER_SUPPLY_PROP */
 	[REG_MANUFACTURER_DATA] =
-		SBS_DATA(POWER_SUPPLY_PROP_PRESENT, 0x00, 0, 65535),
+		SBS_DATA(-1, 0x00, 0, 65535),
 	[REG_TEMPERATURE] =
 		SBS_DATA(POWER_SUPPLY_PROP_TEMP, 0x08, 0, 65535),
 	[REG_VOLTAGE] =
@@ -156,6 +158,9 @@ static enum power_supply_property sbs_properties[] = {
 	POWER_SUPPLY_PROP_MODEL_NAME
 };

+/* Supports special manufacturer commands from TI BQ20Z75 IC. */
+#define SBS_FLAGS_TI_BQ20Z75		BIT(0)
+
 struct sbs_info {
 	struct i2c_client		*client;
 	struct power_supply		*power_supply;
@@ -168,6 +173,7 @@ struct sbs_info {
 	u32				poll_retry_count;
 	struct delayed_work		work;
 	struct mutex			mode_lock;
+	u32				flags;
 };

 static char model_name[I2C_SMBUS_BLOCK_MAX + 1];
@@ -315,6 +321,31 @@ static int sbs_status_correct(struct i2c_client *client, int *intval)
 static int sbs_get_battery_presence_and_health(
 	struct i2c_client *client, enum power_supply_property psp,
 	union power_supply_propval *val)
+{
+	int ret;
+
+	switch (psp) {
+	case POWER_SUPPLY_PROP_PRESENT:
+		/* Dummy command; if it succeeds, battery is present. */
+		ret = sbs_read_word_data(client, sbs_data[REG_STATUS].addr);
+		if (ret < 0)
+			val->intval = 0; /* battery disconnected */
+		else
+			val->intval = 1; /* battery present */
+		return 0;
+	case POWER_SUPPLY_PROP_HEALTH:
+		/* SBS spec doesn't have a general health command. */
+		val->intval = POWER_SUPPLY_HEALTH_UNKNOWN;
+		return 0;
+	default:
+		dev_err(&client->dev, "unexpected prop %d\n", psp);
+		return -EINVAL;
+	}
+}
+
+static int sbs_get_ti_battery_presence_and_health(
+	struct i2c_client *client, enum power_supply_property psp,
+	union power_supply_propval *val)
 {
 	s32 ret;

@@ -600,7 +631,12 @@ static int sbs_get_property(struct power_supply *psy,
 	switch (psp) {
 	case POWER_SUPPLY_PROP_PRESENT:
 	case POWER_SUPPLY_PROP_HEALTH:
-		ret = sbs_get_battery_presence_and_health(client, psp, val);
+		if (client->flags & SBS_FLAGS_TI_BQ20Z75)
+			ret = sbs_get_ti_battery_presence_and_health(client,
+								     psp, val);
+		else
+			ret = sbs_get_battery_presence_and_health(client, psp,
+								  val);
 		if (psp == POWER_SUPPLY_PROP_PRESENT)
 			return 0;
 		break;
@@ -806,6 +842,7 @@ static int sbs_probe(struct i2c_client *client,
 	if (!chip)
 		return -ENOMEM;

+	chip->flags = (u32)(uintptr_t)of_device_get_match_data(&client->dev);
 	chip->client = client;
 	chip->enable_detection = false;
 	psy_cfg.of_node = client->dev.of_node;
@@ -915,12 +952,15 @@ static int sbs_suspend(struct device *dev)
 	if (chip->poll_time > 0)
 		cancel_delayed_work_sync(&chip->work);

-	/*
-	 * Write to manufacturer access with sleep command.
-	 * Support is manufacturer dependend, so ignore errors.
-	 */
-	sbs_write_word_data(client, sbs_data[REG_MANUFACTURER_DATA].addr,
-		MANUFACTURER_ACCESS_SLEEP);
+	if (chip->flags & SBS_FLAGS_TI_BQ20Z75) {
+		/*
+		 * Write to manufacturer access with sleep command.
+		 * Support is manufacturer dependent, so ignore errors.
+		 */
+		sbs_write_word_data(client,
+				    sbs_data[REG_MANUFACTURER_DATA].addr,
+				    MANUFACTURER_ACCESS_SLEEP);
+	}

 	return 0;
 }
@@ -941,7 +981,10 @@ MODULE_DEVICE_TABLE(i2c, sbs_id);

 static const struct of_device_id sbs_dt_ids[] = {
 	{ .compatible = "sbs,sbs-battery" },
-	{ .compatible = "ti,bq20z75" },
+	{
+		.compatible = "ti,bq20z75",
+		.data = (void *)SBS_FLAGS_TI_BQ20Z75,
+	},
 	{ }
 };
 MODULE_DEVICE_TABLE(of, sbs_dt_ids);
-- 
2.17.0.921.gf22659ad46-goog

^ permalink raw reply related

* [PATCH 2/2] dt-bindings: power: sbs-battery: re-document "ti,bq20z75"
From: Brian Norris @ 2018-06-01  0:32 UTC (permalink / raw)
  To: Sebastian Reichel
  Cc: linux-kernel, Rob Herring, linux-pm, devicetree, Alexandru Stan,
	Guenter Roeck, Doug Anderson, Brian Norris, Rhyland Klein
In-Reply-To: <20180601003252.42810-1-briannorris@chromium.org>

This compatible property was documented before the driver was renamed to
"SBS" (see commit e57f1b68c406 ("devicetree-bindings: Propagate
bq20z75->sbs rename to dt bindings")). The driver has continued to
support this property as an alternative to "sbs,sbs-battery", and
because we've noticed there are some lingering TI specifics (in the
manufacturer-specific portion of the SBS spec), we'd like to start using
this property again to differentiate.

Cc: Rhyland Klein <rklein@nvidia.com>
Signed-off-by: Brian Norris <briannorris@chromium.org>
---
 .../devicetree/bindings/power/supply/sbs_sbs-battery.txt        | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/power/supply/sbs_sbs-battery.txt b/Documentation/devicetree/bindings/power/supply/sbs_sbs-battery.txt
index c40e8926facf..a7a9c3366f82 100644
--- a/Documentation/devicetree/bindings/power/supply/sbs_sbs-battery.txt
+++ b/Documentation/devicetree/bindings/power/supply/sbs_sbs-battery.txt
@@ -2,7 +2,7 @@ SBS sbs-battery
 ~~~~~~~~~~
 
 Required properties :
- - compatible : "sbs,sbs-battery"
+ - compatible : "sbs,sbs-battery" or "ti,bq20z75"
 
 Optional properties :
  - sbs,i2c-retry-count : The number of times to retry i2c transactions on i2c
-- 
2.17.0.921.gf22659ad46-goog

^ permalink raw reply related

* Re: [PATCH 1/2] power: supply: sbs-battery: don't assume MANUFACTURER_DATA formats
From: Guenter Roeck @ 2018-06-01  1:20 UTC (permalink / raw)
  To: Brian Norris, Sebastian Reichel
  Cc: linux-kernel, Rob Herring, linux-pm, devicetree, Alexandru Stan,
	Doug Anderson
In-Reply-To: <20180601003252.42810-1-briannorris@chromium.org>

Hi Brian,

On 05/31/2018 05:32 PM, Brian Norris wrote:
> This driver was originally submitted for the TI BQ20Z75 battery IC
> (commit a7640bfa10c5 ("power_supply: Add driver for TI BQ20Z75 gas gauge
> IC")) and later renamed to express generic SBS support. While it's
> mostly true that this driver implemented a standard SBS command set, it
> takes liberties with the REG_MANUFACTURER_DATA register. This register
> is specified in the SBS spec, but it doesn't make any mention of what
> its actual contents are.
> 
> We've sort of noticed this optionality previously, with commit
> 17c6d3979e5b ("sbs-battery: make writes to ManufacturerAccess
> optional"), where we found that some batteries NAK writes to this
> register.
> 
> What this really means is that so far, we've just been lucky that most
> batteries have either been compatible with the TI chip, or else at least
> haven't reported highly-unexpected values.
> 
> For instance, one battery I have here seems to report either 0x0000 or
> 0x0100 to the MANUFACTURER_ACCESS_STATUS command -- while this seems to
> match either Wake Up (bits[11:8] = 0000b) or Normal Discharge
> (bits[11:8] = 0001b) status for the TI part [1], they don't seem to
> actually correspond to real states (for instance, I never see 0101b =
> Charge, even when charging).
> 
> On other batteries, I'm getting apparently random data in return, which
> means that occasionally, we interpret this as "battery not present" or
> "battery is not healthy".
> 
> All in all, it seems to be a really bad idea to make assumptions about
> REG_MANUFACTURER_DATA, unless we already know what battery we're using.
> Therefore, this patch reimplements the "present" and "health" checks to
> the following on most SBS batteries:
> 
> 1. HEALTH: report "unknown" -- I couldn't find a standard SBS command
>     that gives us much useful here
> 2. PRESENT: just send a REG_STATUS command; if it succeeds, then the
>     battery is present
> 
> Also, we stop sending MANUFACTURER_ACCESS_SLEEP to non-TI parts. I have
> no proof that this is useful and supported.
> 
> If someone explicitly provided a 'ti,bq20z75' compatible property, then
> we retain the existing TI command behaviors.
> 
> [1] http://www.ti.com/lit/er/sluu265a/sluu265a.pdf
> 

Excellent idea. Couple of comments below.

Thanks,
Guenter

> Signed-off-by: Brian Norris <briannorris@chromium.org>
> ---
>   drivers/power/supply/sbs-battery.c | 61 +++++++++++++++++++++++++-----
>   1 file changed, 52 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/power/supply/sbs-battery.c b/drivers/power/supply/sbs-battery.c
> index 83d7b4115857..e15d0ca4729d 100644
> --- a/drivers/power/supply/sbs-battery.c
> +++ b/drivers/power/supply/sbs-battery.c
> @@ -23,6 +23,7 @@
>   #include <linux/kernel.h>
>   #include <linux/module.h>
>   #include <linux/of.h>
> +#include <linux/of_device.h>
>   #include <linux/power/sbs-battery.h>
>   #include <linux/power_supply.h>
>   #include <linux/slab.h>
> @@ -84,8 +85,9 @@ static const struct chip_data {
>   	int min_value;
>   	int max_value;
>   } sbs_data[] = {
> +	/* Manuf. data isn't directly useful as a POWER_SUPPLY_PROP */
>   	[REG_MANUFACTURER_DATA] =
> -		SBS_DATA(POWER_SUPPLY_PROP_PRESENT, 0x00, 0, 65535),
> +		SBS_DATA(-1, 0x00, 0, 65535),

I don't think this change is necessary. For example, POWER_SUPPLY_PROP_SERIAL_NUMBER
is also not directly accessed, yet the REG_SERIAL_NUMBER array entry includes
POWER_SUPPLY_PROP_SERIAL_NUMBER.

>   	[REG_TEMPERATURE] =
>   		SBS_DATA(POWER_SUPPLY_PROP_TEMP, 0x08, 0, 65535),
>   	[REG_VOLTAGE] =
> @@ -156,6 +158,9 @@ static enum power_supply_property sbs_properties[] = {
>   	POWER_SUPPLY_PROP_MODEL_NAME
>   };
>   
> +/* Supports special manufacturer commands from TI BQ20Z75 IC. */
> +#define SBS_FLAGS_TI_BQ20Z75		BIT(0)
> +
>   struct sbs_info {
>   	struct i2c_client		*client;
>   	struct power_supply		*power_supply;
> @@ -168,6 +173,7 @@ struct sbs_info {
>   	u32				poll_retry_count;
>   	struct delayed_work		work;
>   	struct mutex			mode_lock;
> +	u32				flags;
>   };
>   
>   static char model_name[I2C_SMBUS_BLOCK_MAX + 1];
> @@ -315,6 +321,31 @@ static int sbs_status_correct(struct i2c_client *client, int *intval)
>   static int sbs_get_battery_presence_and_health(
>   	struct i2c_client *client, enum power_supply_property psp,
>   	union power_supply_propval *val)
> +{
> +	int ret;
> +
> +	switch (psp) {
> +	case POWER_SUPPLY_PROP_PRESENT:
> +		/* Dummy command; if it succeeds, battery is present. */
> +		ret = sbs_read_word_data(client, sbs_data[REG_STATUS].addr);
> +		if (ret < 0)
> +			val->intval = 0; /* battery disconnected */

I don't know the relevance, but the original code returns the error
in this situation.

> +		else
> +			val->intval = 1; /* battery present */
> +		return 0;
> +	case POWER_SUPPLY_PROP_HEALTH:
> +		/* SBS spec doesn't have a general health command. */
> +		val->intval = POWER_SUPPLY_HEALTH_UNKNOWN;
> +		return 0;
> +	default:
> +		dev_err(&client->dev, "unexpected prop %d\n", psp);
> +		return -EINVAL;

Personally I would prefer if/else here, for the simple reason that the default case
will never be executed and just wastes a bit of code space.

> +	}
> +}
> +
> +static int sbs_get_ti_battery_presence_and_health(
> +	struct i2c_client *client, enum power_supply_property psp,
> +	union power_supply_propval *val)
>   {
>   	s32 ret;
>   
> @@ -600,7 +631,12 @@ static int sbs_get_property(struct power_supply *psy,
>   	switch (psp) {
>   	case POWER_SUPPLY_PROP_PRESENT:
>   	case POWER_SUPPLY_PROP_HEALTH:
> -		ret = sbs_get_battery_presence_and_health(client, psp, val);
> +		if (client->flags & SBS_FLAGS_TI_BQ20Z75)
> +			ret = sbs_get_ti_battery_presence_and_health(client,
> +								     psp, val);
> +		else
> +			ret = sbs_get_battery_presence_and_health(client, psp,
> +								  val);
>   		if (psp == POWER_SUPPLY_PROP_PRESENT)
>   			return 0;
>   		break;
> @@ -806,6 +842,7 @@ static int sbs_probe(struct i2c_client *client,
>   	if (!chip)
>   		return -ENOMEM;
>   
> +	chip->flags = (u32)(uintptr_t)of_device_get_match_data(&client->dev);
>   	chip->client = client;
>   	chip->enable_detection = false;
>   	psy_cfg.of_node = client->dev.of_node;
> @@ -915,12 +952,15 @@ static int sbs_suspend(struct device *dev)
>   	if (chip->poll_time > 0)
>   		cancel_delayed_work_sync(&chip->work);
>   
> -	/*
> -	 * Write to manufacturer access with sleep command.
> -	 * Support is manufacturer dependend, so ignore errors.
> -	 */
> -	sbs_write_word_data(client, sbs_data[REG_MANUFACTURER_DATA].addr,
> -		MANUFACTURER_ACCESS_SLEEP);
> +	if (chip->flags & SBS_FLAGS_TI_BQ20Z75) {
> +		/*
> +		 * Write to manufacturer access with sleep command.
> +		 * Support is manufacturer dependent, so ignore errors.
> +		 */
> +		sbs_write_word_data(client,
> +				    sbs_data[REG_MANUFACTURER_DATA].addr,
> +				    MANUFACTURER_ACCESS_SLEEP);
> +	}
>   
>   	return 0;
>   }
> @@ -941,7 +981,10 @@ MODULE_DEVICE_TABLE(i2c, sbs_id);
>   
>   static const struct of_device_id sbs_dt_ids[] = {
>   	{ .compatible = "sbs,sbs-battery" },
> -	{ .compatible = "ti,bq20z75" },
> +	{
> +		.compatible = "ti,bq20z75",
> +		.data = (void *)SBS_FLAGS_TI_BQ20Z75,
> +	},
>   	{ }
>   };
>   MODULE_DEVICE_TABLE(of, sbs_dt_ids);
> 

^ permalink raw reply

* Re: [PATCH] cpuidle:powernv: Make the snooze timeout dynamic.
From: Gautham R Shenoy @ 2018-06-01  4:54 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Stewart Smith, Gautham R. Shenoy, Michael Neuling, linux-pm,
	Daniel Lezcano, Rafael J. Wysocki, linux-kernel@vger.kernel.org,
	Nicholas Piggin, Shilpasri G Bhat,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Akshay Adiga
In-Reply-To: <CAKTCnz=WhC_vamKPZruEqc04WKpD5ZqZm0zTcCtagRHib6OZPw@mail.gmail.com>

Hi Balbir,

Thanks for reviewing the patch!

On Fri, Jun 01, 2018 at 12:51:05AM +1000, Balbir Singh wrote:
> On Thu, May 31, 2018 at 10:15 PM, Gautham R. Shenoy

[..snip..]
> >
> > +static u64 get_snooze_timeout(struct cpuidle_device *dev,
> > +                             struct cpuidle_driver *drv,
> > +                             int index)
> > +{
> > +       int i;
> > +
> > +       if (unlikely(!snooze_timeout_en))
> > +               return default_snooze_timeout;
> > +
> > +       for (i = index + 1; i < drv->state_count; i++) {
> > +               struct cpuidle_state *s = &drv->states[i];
> > +               struct cpuidle_state_usage *su = &dev->states_usage[i];
> > +
> > +               if (s->disabled || su->disable)
> > +                       continue;
> > +
> > +               return s->target_residency * tb_ticks_per_usec;
> 
> Can we ensure this is not prone to overflow?

s->target_residency is an "unsigned int" so can take a maximum value
of UINT_MAX. tb_ticks_per_usec is an "unsigned long" with a value in
the range of 100-1000. The return value is a u64. The product of
s->target_residency and tb_ticks_per_usec should be contained in u64.

Is there a potential case of overflow that I am missing ?

> 
> Otherwise looks good
> 
> Reviewed-by: Balbir Singh <bsingharora@gmail.com>

--
Thanks and Regards
gautham.

^ permalink raw reply

* [PATCH 0/4] lib/vsprintf: Remove atomic-unsafe support for printk format %pCr
From: Geert Uytterhoeven @ 2018-06-01  9:28 UTC (permalink / raw)
  To: Jia-Ju Bai, Jonathan Corbet, Michael Turquette, Stephen Boyd,
	Zhang Rui, Eduardo Valentin, Eric Anholt, Stefan Wahren,
	Greg Kroah-Hartman
  Cc: Sergey Senozhatsky, Petr Mladek, Linus Torvalds, Steven Rostedt,
	linux-doc, linux-clk, linux-pm, linux-serial, linux-arm-kernel,
	linux-renesas-soc, linux-kernel, Geert Uytterhoeven

	Hi all,

"%pCr" formats the current rate of a clock, and calls clk_get_rate().
The latter obtains a mutex, hence it must not be called from atomic
context.  As vsprintf() (and e.g. printk()) must be callable from any
context, it's better to remove support for this (rarely-used) format.

This patch series:
  - Changes all existing users of "%pCr" to print the result of
    clk_get_rate() directly, which is safe as they all do this in task
    context only,
  - Removes support for the "%pCr" printk format.

Note that any remaining out-of-tree users will start seeing the clock's
name printed instead of its rate.

Thanks for your comments!

Geert Uytterhoeven (4):
  clk: renesas: cpg-mssr: Stop using printk format %pCr
  thermal: bcm2835: Stop using printk format %pCr
  serial: sh-sci: Stop using printk format %pCr
  lib/vsprintf: Remove atomic-unsafe support for %pCr

 Documentation/core-api/printk-formats.rst  | 3 +--
 drivers/clk/renesas/renesas-cpg-mssr.c     | 9 +++++----
 drivers/thermal/broadcom/bcm2835_thermal.c | 4 ++--
 drivers/tty/serial/sh-sci.c                | 4 ++--
 lib/vsprintf.c                             | 3 ---
 5 files changed, 10 insertions(+), 13 deletions(-)

-- 
2.7.4

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply

* [PATCH 1/4] clk: renesas: cpg-mssr: Stop using printk format %pCr
From: Geert Uytterhoeven @ 2018-06-01  9:28 UTC (permalink / raw)
  To: Jia-Ju Bai, Jonathan Corbet, Michael Turquette, Stephen Boyd,
	Zhang Rui, Eduardo Valentin, Eric Anholt, Stefan Wahren,
	Greg Kroah-Hartman
  Cc: Petr Mladek, Sergey Senozhatsky, Geert Uytterhoeven, linux-doc,
	linux-pm, linux-kernel, Steven Rostedt, linux-renesas-soc,
	linux-serial, Linus Torvalds, linux-clk, linux-arm-kernel
In-Reply-To: <1527845302-12159-1-git-send-email-geert+renesas@glider.be>

Printk format "%pCr" will be removed soon, as clk_get_rate() must not be
called in atomic context.

Replace it by open-coding the operation.  This is safe here, as the code
runs in task context.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
---
 drivers/clk/renesas/renesas-cpg-mssr.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/clk/renesas/renesas-cpg-mssr.c b/drivers/clk/renesas/renesas-cpg-mssr.c
index 49e510691eeeab07..f4b013e9352d9efc 100644
--- a/drivers/clk/renesas/renesas-cpg-mssr.c
+++ b/drivers/clk/renesas/renesas-cpg-mssr.c
@@ -258,8 +258,9 @@ struct clk *cpg_mssr_clk_src_twocell_get(struct of_phandle_args *clkspec,
 		dev_err(dev, "Cannot get %s clock %u: %ld", type, clkidx,
 		       PTR_ERR(clk));
 	else
-		dev_dbg(dev, "clock (%u, %u) is %pC at %pCr Hz\n",
-			clkspec->args[0], clkspec->args[1], clk, clk);
+		dev_dbg(dev, "clock (%u, %u) is %pC at %lu Hz\n",
+			clkspec->args[0], clkspec->args[1], clk,
+			clk_get_rate(clk));
 	return clk;
 }
 
@@ -326,7 +327,7 @@ static void __init cpg_mssr_register_core_clk(const struct cpg_core_clk *core,
 	if (IS_ERR_OR_NULL(clk))
 		goto fail;
 
-	dev_dbg(dev, "Core clock %pC at %pCr Hz\n", clk, clk);
+	dev_dbg(dev, "Core clock %pC at %lu Hz\n", clk, clk_get_rate(clk));
 	priv->clks[id] = clk;
 	return;
 
@@ -392,7 +393,7 @@ static void __init cpg_mssr_register_mod_clk(const struct mssr_mod_clk *mod,
 	if (IS_ERR(clk))
 		goto fail;
 
-	dev_dbg(dev, "Module clock %pC at %pCr Hz\n", clk, clk);
+	dev_dbg(dev, "Module clock %pC at %lu Hz\n", clk, clk_get_rate(clk));
 	priv->clks[id] = clk;
 	priv->smstpcr_saved[clock->index / 32].mask |= BIT(clock->index % 32);
 	return;
-- 
2.7.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox