Linux Power Management development

Linux Power Management development
 help / color / mirror / Atom feed

* Re: [PATCH] PM / Qos: Ensure device not in PRM_SUSPENDED when pm qos flags request functions are invoked in the pm core.
From: Lan Tianyu @ 2012-11-11 12:08 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: stern, linux-pm, linux-acpi
In-Reply-To: <5096852C.3000707@intel.com>

On 2012/11/4 23:09, Lan Tianyu wrote:
> On 2012/11/3 4:11, Rafael J. Wysocki wrote:
>>>>>    }
>>>>> > >>   EXPORT_SYMBOL_GPL(dev_pm_qos_expose_flags);
>>>>> > >>@@ -645,7 +649,9 @@ void dev_pm_qos_hide_flags(struct device *dev)
>>>>> > >>   {
>>>>> > >>       if (dev->power.qos && dev->power.qos->flags_req) {
>>>>> > >>           pm_qos_sysfs_remove_flags(dev);
>>>>> > >>+        pm_runtime_get_sync(dev);
>>>>> > >>           __dev_pm_qos_drop_user_request(dev, DEV_PM_QOS_FLAGS);
>>>>> > >>+        pm_runtime_put(dev);
>>>> > >
>>>> > >I'm not sure if these two are necessary.  If we remove a request,
>>>> > >then what happens worst case is that some flags will be cleared
>>>> > >effectively which means fewer restrictions on the next sleep state.
>>>> > >Then, it shouldn't hurt that the current sleep state is more
>>>> > >restricted.
>>> >
>>> >But this mean the device can be put into lower power state(power off).
>>> >So why not do that? that can save more power, right?
>> Correct.  On the other hand, though, if the device already is in the
>> deepest low-power state available, we will resume it unnecessarily this
>> way.  Which may not be a big deal, however, and since we do that in other
>> cases, we may as well do it here.
> Yeah. This seems not very reasonable. But we can optimize this
> later.From my previous opinion, add notifier for flags and let device
> driver or bus driver to determine when the device should be resumed.
> Since you said at another email you would remove all notifiers in the pm
> qos to make some functions able to be invoked in interrupt context. I
> have a thought that check the context before call notifiers chain. If it
> was in interrupt, not call notifier chain and trigger a work queue or
> other choices to do that. If not, call the chain. Does this make sense? :)
>
Hi Rafael:
	Do you have some opinions?

>>
>> Thanks,
>> Rafael
>
>


-- 
Best Regards
Tianyu Lan
linux kernel enabling team

^ permalink raw reply

* Re: [PATCH V2] PM / Qos: Ensure device not in PRM_SUSPENDED when pm qos flags request functions are invoked in the pm core.
From: Rafael J. Wysocki @ 2012-11-10 22:08 UTC (permalink / raw)
  To: Lan Tianyu; +Cc: stern, linux-pm, linux-acpi
In-Reply-To: <1352344448-9971-1-git-send-email-tianyu.lan@intel.com>

On Thursday, November 08, 2012 11:14:08 AM Lan Tianyu wrote:
> Since dev_pm_qos_add_request(), dev_pm_qos_update_request() and
> dev_pm_qos_remove_request() for pm qos flags should not be invoked
> when device in RPM_SUSPENDED. Add pm_runtime_get_sync() and pm_runtime_put()
> around these functions in the pm core to ensure device not in RPM_SUSPENDED.
> 
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>

Applied to linux-pm.git/linux-next.

Thanks,
Rafael


> ---
> Change since v1:
> 	Remove unnecessary pm_runtime_get_sync() and pm_runtime_put()
> around dev_pm_qos_update_flags().
> ---
>  drivers/base/power/qos.c |    7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/power/qos.c b/drivers/base/power/qos.c
> index 081db2d..fdc3894 100644
> --- a/drivers/base/power/qos.c
> +++ b/drivers/base/power/qos.c
> @@ -633,15 +633,18 @@ int dev_pm_qos_expose_flags(struct device *dev, s32 val)
>  	if (!req)
>  		return -ENOMEM;
>  
> +	pm_runtime_get_sync(dev);
>  	ret = dev_pm_qos_add_request(dev, req, DEV_PM_QOS_FLAGS, val);
>  	if (ret < 0)
> -		return ret;
> +		goto fail;
>  
>  	dev->power.qos->flags_req = req;
>  	ret = pm_qos_sysfs_add_flags(dev);
>  	if (ret)
>  		__dev_pm_qos_drop_user_request(dev, DEV_PM_QOS_FLAGS);
>  
> +fail:
> +	pm_runtime_put(dev);
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(dev_pm_qos_expose_flags);
> @@ -654,7 +657,9 @@ void dev_pm_qos_hide_flags(struct device *dev)
>  {
>  	if (dev->power.qos && dev->power.qos->flags_req) {
>  		pm_qos_sysfs_remove_flags(dev);
> +		pm_runtime_get_sync(dev);
>  		__dev_pm_qos_drop_user_request(dev, DEV_PM_QOS_FLAGS);
> +		pm_runtime_put(dev);
>  	}
>  }
>  EXPORT_SYMBOL_GPL(dev_pm_qos_hide_flags);
> 
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [PATCH 1/9 v3] cgroup: add cgroup_subsys->post_create()
From: Glauber Costa @ 2012-11-10  1:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Daniel Wagner, lizefan-hv44wF8Li93QT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	rjw-KKrjLPT3xs0,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA, fweisbec-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <20121109172211.GB2711-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>

On 11/09/2012 06:22 PM, Tejun Heo wrote:
> Hey, Daniel.
> 
> On Fri, Nov 09, 2012 at 12:09:38PM +0100, Daniel Wagner wrote:
>> On 08.11.2012 20:07, Tejun Heo wrote:> Subject: cgroup: add
>> cgroup_subsys->post_create()
>>>
>>> Currently, there's no way for a controller to find out whether a new
>>> cgroup finished all ->create() allocatinos successfully and is
>>> considered "live" by cgroup.
>>
>> I'd like add hierarchy support to net_prio and the first thing to
>> do is to get rid of get_prioidx(). It looks like it would be nice to
> 
> Ooh, I'm already working on it.  I *think* I should be able to post
> the patches later today or early next week.
> 
>> be able to use use_id and post_create() for this but as I read the
>> code this might not work because the netdev might access resources
>> allocated between create() and post_create(). So my question is
>> would it make sense to move
>>
>> cgroup_create():
>>
>> 		if (ss->use_id) {
>> 			err = alloc_css_id(ss, parent, cgrp);
>> 			if (err)
>> 				goto err_destroy;
>> 		}
>>
>> part before create() or add some protection between create() and
>> post_create() callback in net_prio. I have a patch but I see
>> I could drop it completely if post_create() is there.
> 
> Glauber had about similar question about css_id and I need to think
> more about it but currently I think I want to phase out css IDs.  It's
> an id of the wrong thing (CSSes don't need IDs, cgroups do) and
> unnecessarily duplicates its own hierarchy when the hierarchy of
> cgroups already exists.  Once memcontrol moves away from walking using
> css_ids, I *think* I'll try to kill it.

May I suggest doing something similar with what the scheduler does? I
had some code in the past that reused that code - but basically
duplicated it. If you want, I can try getting a version of that in
kernel/cgroup.c  that would serve as a general walker.

I like that walker a lot, because it happens in a sane order. memcg
basically walks in a random weird order, that makes hierarchical
computation of anything quite hard.

> 
> I'll add cgroup ID (no hierarchy funnies, just a single ida allocated
> number) so that it can be used for cgroup indexing.  Glauber, that
> should solve your problem too, right?
> 

Actually I went with a totally orthogonal solution. I am now using per
kmem-limited ids. Because they are not tied to the cgroup creation
workflow, I can allocate whenever it is more convenient.

I ended up liking this solution because it will do better in scenarios
where most of the memcgs are not kmem limited. So it had an edge here,
and also got rid of the create/post_create problem by breaking the
dependency.

But of course, if cgroups would gain some kind of sane indexing, it
could shift the balance towards reusing it.

^ permalink raw reply

* [GIT PULL] PCI updates for v3.7
From: Bjorn Helgaas @ 2012-11-09 17:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-pci, linux-kernel, Huang Ying, Taku Izumi, Linux PM list

Hi Linus,

Here are some fixes for v3.7.  Three are related to D3cold support,
and the last portdrv one is a fix for the PCIe capability accessor
rework we merged for v3.7-rc1.

Bjorn


The following changes since commit 8f0d8163b50e01f398b14bcd4dc039ac5ab18d64:

  Linux 3.7-rc3 (2012-10-28 12:24:48 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git
tags/3.7-pci-fixes

for you to fetch changes up to ff8e59bc4ec3f31789a47dce9b6fe44bd7bc5fcc:

  PCI/portdrv: Don't create hotplug slots unless port supports hotplug
(2012-11-05 16:59:59 -0700)

----------------------------------------------------------------
PCI updates for v3.7:

  Power management
      PCI/PM: Fix proc config reg access for D3cold and bridge suspending
      PCI/PM: Resume device before shutdown
      PCI/PM: Fix deadlock when unbinding device if parent in D3cold
  Hotplug
      PCI/portdrv: Don't create hotplug slots unless port supports hotplug

----------------------------------------------------------------
Bjorn Helgaas (1):
      Merge branch 'pci/huang-d3cold-fixes' into for-linus

Huang Ying (3):
      PCI/PM: Fix deadlock when unbinding device if parent in D3cold
      PCI/PM: Resume device before shutdown
      PCI/PM: Fix proc config reg access for D3cold and bridge suspending

Taku Izumi (1):
      PCI/portdrv: Don't create hotplug slots unless port supports hotplug

 drivers/pci/bus.c                  |  3 ---
 drivers/pci/pci-driver.c           | 12 ++----------
 drivers/pci/pci-sysfs.c            | 34 ----------------------------------
 drivers/pci/pci.c                  | 32 ++++++++++++++++++++++++++++++++
 drivers/pci/pci.h                  |  2 ++
 drivers/pci/pcie/aer/aerdrv_core.c | 20 ++++++++++++++++----
 drivers/pci/pcie/portdrv_core.c    |  3 ++-
 drivers/pci/proc.c                 |  8 ++++++++
 8 files changed, 62 insertions(+), 52 deletions(-)

^ permalink raw reply

* Re: [PATCH 1/9 v3] cgroup: add cgroup_subsys->post_create()
From: Tejun Heo @ 2012-11-09 17:22 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	rjw-KKrjLPT3xs0,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	Glauber Costa
In-Reply-To: <509CE472.9040504-kQCPcA+X3s7YtjvyW6yDsg@public.gmane.org>

Hey, Daniel.

On Fri, Nov 09, 2012 at 12:09:38PM +0100, Daniel Wagner wrote:
> On 08.11.2012 20:07, Tejun Heo wrote:> Subject: cgroup: add
> cgroup_subsys->post_create()
> >
> > Currently, there's no way for a controller to find out whether a new
> > cgroup finished all ->create() allocatinos successfully and is
> > considered "live" by cgroup.
> 
> I'd like add hierarchy support to net_prio and the first thing to
> do is to get rid of get_prioidx(). It looks like it would be nice to

Ooh, I'm already working on it.  I *think* I should be able to post
the patches later today or early next week.

> be able to use use_id and post_create() for this but as I read the
> code this might not work because the netdev might access resources
> allocated between create() and post_create(). So my question is
> would it make sense to move
> 
> cgroup_create():
> 
> 		if (ss->use_id) {
> 			err = alloc_css_id(ss, parent, cgrp);
> 			if (err)
> 				goto err_destroy;
> 		}
> 
> part before create() or add some protection between create() and
> post_create() callback in net_prio. I have a patch but I see
> I could drop it completely if post_create() is there.

Glauber had about similar question about css_id and I need to think
more about it but currently I think I want to phase out css IDs.  It's
an id of the wrong thing (CSSes don't need IDs, cgroups do) and
unnecessarily duplicates its own hierarchy when the hierarchy of
cgroups already exists.  Once memcontrol moves away from walking using
css_ids, I *think* I'll try to kill it.

I'll add cgroup ID (no hierarchy funnies, just a single ida allocated
number) so that it can be used for cgroup indexing.  Glauber, that
should solve your problem too, right?

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCHSET cgroup/for-3.8] cgroup_freezer: implement proper hierarchy support
From: Tejun Heo @ 2012-11-09 17:15 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	rjw-KKrjLPT3xs0
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1351931915-1701-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

On Sat, Nov 03, 2012 at 01:38:26AM -0700, Tejun Heo wrote:
> Hello,
> 
> This patchset implement proper hierarchy support for cgroup_freezer as
> discussed in "[RFC] cgroup TODOs"[1].

Applied to cgroup/for-3.8.  Rafael, I applied the cgroup_freezer
changes there too as there already are and will be more dependencies
between cgroup_freezer and cgroup changes, and the cgroup_freezer
changes don't really affect the rest of the freezer.  If you'd like
them to be routed differently, please let me know.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management
From: Srivatsa S. Bhat @ 2012-11-09 16:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mel Gorman, Vaidyanathan Srinivasan, akpm, mjg59, paulmck,
	maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu,
	lenb, rjw, gargankita, amit.kachhap, thomas.abraham,
	santosh.shilimkar, linux-pm, linux-mm, linux-kernel
In-Reply-To: <509D32C2.2090104@linux.vnet.ibm.com>

On 11/09/2012 10:13 PM, Srivatsa S. Bhat wrote:
> On 11/09/2012 10:04 PM, Srivatsa S. Bhat wrote:
>> On 11/09/2012 09:43 PM, Dave Hansen wrote:
>>> On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote:
>>>> FWIW, kernbench is actually (and surprisingly) showing a slight performance
>>>> *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
>>>> my other email to Dave.
>>>>
>>>> https://lkml.org/lkml/2012/11/7/428
>>>>
>>>> I don't think I can dismiss it as an experimental error, because I am seeing
>>>> those results consistently.. I'm trying to find out what's behind that.
>>>
>>> The only numbers in that link are in the date. :)  Let's see the
>>> numbers, please.
>>>
>>
>> Sure :) The reason I didn't post the numbers very eagerly was that I didn't
>> want it to look ridiculous if it later turned out to be really an error in the
>> experiment ;) But since I have seen it happening consistently I think I can
>> post the numbers here with some non-zero confidence.
>>
>>> If you really have performance improvement to the memory allocator (or
>>> something else) here, then surely it can be pared out of your patches
>>> and merged quickly by itself.  Those kinds of optimizations are hard to
>>> come by!
>>>
>>
>> :-)
>>
>> Anyway, here it goes:
>>
>> Test setup:
>> ----------
>> x86 2-socket quad-core machine. (CONFIG_NUMA=n because I figured that my
>> patchset might not handle NUMA properly). Mem region size = 512 MB.
>>
> 
> For CONFIG_NUMA=y on the same machine, the difference between the 2 kernels
> was much lesser, but nevertheless, this patchset performed better. I wouldn't
> vouch that my patchset handles NUMA correctly, but here are the numbers from
> that run anyway (at least to show that I really found the results to be
> repeatable):
> 
> Kernbench log for Vanilla 3.7-rc3
> =================================
> Kernel: 3.7.0-rc3-vanilla-numa-default
> Average Optimal load -j 32 Run (std deviation):
> Elapsed Time 589.058 (0.596171)
> User Time 7461.26 (1.69702)
> System Time 1072.03 (1.54704)
> Percent CPU 1448.2 (1.30384)
> Context Switches 2.14322e+06 (4042.97)
> Sleeps 1847230 (2614.96)
> 
> Kernbench log for Vanilla 3.7-rc3
> =================================

Oops, that title must have been "for sorted-buddy patchset" of course..

> Kernel: 3.7.0-rc3-sorted-buddy-numa-default
> Average Optimal load -j 32 Run (std deviation):
> Elapsed Time 577.182 (0.713772)
> User Time 7315.43 (3.87226)
> System Time 1043 (1.12855)
> Percent CPU 1447.6 (2.19089)
> Context Switches 2117022 (3810.15)
> Sleeps 1.82966e+06 (4149.82)
> 
> 

Regards,
Srivatsa S. Bhat

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management
From: SrinivasPandruvada @ 2012-11-09 16:48 UTC (permalink / raw)
  To: linux-pm
In-Reply-To: <20121108180257.GC8218@suse.de>

I did like this implementation and think it is valuable.
I am experimenting with one of our HW. This type of partition does help in
saving power. We believe we can save up-to 1W power per DIM with the help
of some HW/BIOS changes. We are only talking about content preserving memory,
so we don't have to be 100% correct.
In my experiments, I tried two methods:
- Similar to approach suggested by Mel Gorman. I have a special sticky
migrate type like CMA.
- Buddy buckets: Buddies are organized into memory region aware buckets.
During allocation it prefers higher order buckets. I made sure that there is
no affect of my change if there are no power saving memory DIMs. The advantage
of this bucket is that I can keep the memory in close proximity for a related
task groups by direct hashing to a bucket. The free list if organized as two
dimensional array with bucket and migrate type for each order.

In both methods, currently reclaim is targetted to be done by a sysfs interface
similar to memory compaction for a node allowing user space to initiate reclaim. 

Thanks,
Srinivas Pandruvada
Open Source Technology Center,
Intel Corp.



^ permalink raw reply

* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management
From: Srivatsa S. Bhat @ 2012-11-09 16:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mel Gorman, Vaidyanathan Srinivasan, akpm, mjg59, paulmck,
	maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu,
	lenb, rjw, gargankita, amit.kachhap, thomas.abraham,
	santosh.shilimkar, linux-pm, linux-mm, linux-kernel
In-Reply-To: <509D3088.2060507@linux.vnet.ibm.com>

On 11/09/2012 10:04 PM, Srivatsa S. Bhat wrote:
> On 11/09/2012 09:43 PM, Dave Hansen wrote:
>> On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote:
>>> FWIW, kernbench is actually (and surprisingly) showing a slight performance
>>> *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
>>> my other email to Dave.
>>>
>>> https://lkml.org/lkml/2012/11/7/428
>>>
>>> I don't think I can dismiss it as an experimental error, because I am seeing
>>> those results consistently.. I'm trying to find out what's behind that.
>>
>> The only numbers in that link are in the date. :)  Let's see the
>> numbers, please.
>>
> 
> Sure :) The reason I didn't post the numbers very eagerly was that I didn't
> want it to look ridiculous if it later turned out to be really an error in the
> experiment ;) But since I have seen it happening consistently I think I can
> post the numbers here with some non-zero confidence.
> 
>> If you really have performance improvement to the memory allocator (or
>> something else) here, then surely it can be pared out of your patches
>> and merged quickly by itself.  Those kinds of optimizations are hard to
>> come by!
>>
> 
> :-)
> 
> Anyway, here it goes:
> 
> Test setup:
> ----------
> x86 2-socket quad-core machine. (CONFIG_NUMA=n because I figured that my
> patchset might not handle NUMA properly). Mem region size = 512 MB.
> 

For CONFIG_NUMA=y on the same machine, the difference between the 2 kernels
was much lesser, but nevertheless, this patchset performed better. I wouldn't
vouch that my patchset handles NUMA correctly, but here are the numbers from
that run anyway (at least to show that I really found the results to be
repeatable):

Kernbench log for Vanilla 3.7-rc3
=================================
Kernel: 3.7.0-rc3-vanilla-numa-default
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 589.058 (0.596171)
User Time 7461.26 (1.69702)
System Time 1072.03 (1.54704)
Percent CPU 1448.2 (1.30384)
Context Switches 2.14322e+06 (4042.97)
Sleeps 1847230 (2614.96)

Kernbench log for Vanilla 3.7-rc3
=================================
Kernel: 3.7.0-rc3-sorted-buddy-numa-default
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 577.182 (0.713772)
User Time 7315.43 (3.87226)
System Time 1043 (1.12855)
Percent CPU 1447.6 (2.19089)
Context Switches 2117022 (3810.15)
Sleeps 1.82966e+06 (4149.82)


Regards,
Srivatsa S. Bhat

> Kernbench log for Vanilla 3.7-rc3
> =================================
> 
> Kernel: 3.7.0-rc3-vanilla-default
> Average Optimal load -j 32 Run (std deviation):
> Elapsed Time 650.742 (2.49774)
> User Time 8213.08 (17.6347)
> System Time 1273.91 (6.00643)
> Percent CPU 1457.4 (3.64692)
> Context Switches 2250203 (3846.61)
> Sleeps 1.8781e+06 (5310.33)
> 
> Kernbench log for this sorted-buddy patchset
> ============================================
> 
> Kernel: 3.7.0-rc3-sorted-buddy-default
> Average Optimal load -j 32 Run (std deviation):
> Elapsed Time 591.696 (0.660969)
> User Time 7511.97 (1.08313)
> System Time 1062.99 (1.1109)
> Percent CPU 1448.6 (1.94936)
> Context Switches 2.1496e+06 (3507.12)
> Sleeps 1.84305e+06 (3092.67)
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Alan Stern @ 2012-11-09 16:41 UTC (permalink / raw)
  To: Huang Ying; +Cc: Rafael J. Wysocki, linux-kernel, linux-pm
In-Reply-To: <1352428604.7176.103.camel@yhuang-dev>

On Fri, 9 Nov 2012, Huang Ying wrote:

> On Thu, 2012-11-08 at 12:07 -0500, Alan Stern wrote:
> > On Thu, 8 Nov 2012, Rafael J. Wysocki wrote:
> > 
> > > > > > is it a good idea to allow to set device state to SUSPENDED if the device
> > > > > > is disabled?
> > > > > 
> > > > > No, it is not.  The status should always be ACTIVE as long as usage_count > 0.
> > 
> > That isn't strictly true, because pm_runtime_get_noresume violates this
> > rule.  What the PM core actually does is prevent a transition from the
> > ACTIVE state to the SUSPENDING/SUSPENDED state if usage_count > 0,
> > _provided_ runtime PM is enabled.  There's no such restriction when it
> > is disabled.
> 
> Usage count may be not a issue for the end user.  But "on" in "control"
> sysfs file + SUSPENDED can be confusing for the end user.  Maybe we need
> to check dev->power.runtime_auto in pm_runtime_set_suspended().

You are confusing the issue by raising two separate (though related)
questions.

The first question: How should the PCI subsystem prevent the parents of 
driverless VGA devices from being runtime suspended while userspace is 
accessing them?

The second question: Should the PM core allow devices that are disabled
for runtime PM to be in the SUSPENDED state when
dev->power.runtime_auto is clear?

Assuming we don't want to allow this, there's a third question: Should
pm_runtime_allow call pm_runtime_set_suspended if the device is
disabled?

Alan Stern

^ permalink raw reply

* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management
From: Srivatsa S. Bhat @ 2012-11-09 16:34 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mel Gorman, Vaidyanathan Srinivasan, akpm, mjg59, paulmck,
	maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu,
	lenb, rjw, gargankita, amit.kachhap, thomas.abraham,
	santosh.shilimkar, linux-pm, linux-mm, linux-kernel
In-Reply-To: <509D2B9B.4090305@linux.vnet.ibm.com>

On 11/09/2012 09:43 PM, Dave Hansen wrote:
> On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote:
>> FWIW, kernbench is actually (and surprisingly) showing a slight performance
>> *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
>> my other email to Dave.
>>
>> https://lkml.org/lkml/2012/11/7/428
>>
>> I don't think I can dismiss it as an experimental error, because I am seeing
>> those results consistently.. I'm trying to find out what's behind that.
> 
> The only numbers in that link are in the date. :)  Let's see the
> numbers, please.
> 

Sure :) The reason I didn't post the numbers very eagerly was that I didn't
want it to look ridiculous if it later turned out to be really an error in the
experiment ;) But since I have seen it happening consistently I think I can
post the numbers here with some non-zero confidence.

> If you really have performance improvement to the memory allocator (or
> something else) here, then surely it can be pared out of your patches
> and merged quickly by itself.  Those kinds of optimizations are hard to
> come by!
> 

:-)

Anyway, here it goes:

Test setup:
----------
x86 2-socket quad-core machine. (CONFIG_NUMA=n because I figured that my
patchset might not handle NUMA properly). Mem region size = 512 MB.

Kernbench log for Vanilla 3.7-rc3
=================================

Kernel: 3.7.0-rc3-vanilla-default
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 650.742 (2.49774)
User Time 8213.08 (17.6347)
System Time 1273.91 (6.00643)
Percent CPU 1457.4 (3.64692)
Context Switches 2250203 (3846.61)
Sleeps 1.8781e+06 (5310.33)

Kernbench log for this sorted-buddy patchset
============================================

Kernel: 3.7.0-rc3-sorted-buddy-default
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 591.696 (0.660969)
User Time 7511.97 (1.08313)
System Time 1062.99 (1.1109)
Percent CPU 1448.6 (1.94936)
Context Switches 2.1496e+06 (3507.12)
Sleeps 1.84305e+06 (3092.67)

Regards,
Srivatsa S. Bhat


^ permalink raw reply

* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management
From: Dave Hansen @ 2012-11-09 16:13 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Mel Gorman, Vaidyanathan Srinivasan, akpm, mjg59, paulmck,
	maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu,
	lenb, rjw, gargankita, amit.kachhap, thomas.abraham,
	santosh.shilimkar, linux-pm, linux-mm, linux-kernel
In-Reply-To: <509D200F.2000908@linux.vnet.ibm.com>

On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote:
> FWIW, kernbench is actually (and surprisingly) showing a slight performance
> *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
> my other email to Dave.
> 
> https://lkml.org/lkml/2012/11/7/428
> 
> I don't think I can dismiss it as an experimental error, because I am seeing
> those results consistently.. I'm trying to find out what's behind that.

The only numbers in that link are in the date. :)  Let's see the
numbers, please.

If you really have performance improvement to the memory allocator (or
something else) here, then surely it can be pared out of your patches
and merged quickly by itself.  Those kinds of optimizations are hard to
come by!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management
From: Arjan van de Ven @ 2012-11-09 15:34 UTC (permalink / raw)
  To: svaidy
  Cc: Mel Gorman, Srivatsa S. Bhat, akpm, mjg59, paulmck, dave,
	maxime.coquelin, loic.pallardy, kmpark, kamezawa.hiroyu, lenb,
	rjw, gargankita, amit.kachhap, thomas.abraham, santosh.shilimkar,
	linux-pm, linux-mm, linux-kernel
In-Reply-To: <20121109051247.GA499@dirshya.in.ibm.com>

On 11/8/2012 9:14 PM, Vaidyanathan Srinivasan wrote:
> * Mel Gorman <mgorman@suse.de> [2012-11-08 18:02:57]:
> 
>> On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
>>> ------------------------------------------------------------
> 
> Hi Mel,
> 
> Thanks for detailed review and comments.  The goal of this patch
> series is to brainstorm on ideas that enable Linux VM to record and
> exploit memory region boundaries.
> 
> The first approach that we had last year (hierarchy) has more runtime
> overhead.  This approach of sorted-buddy was one of the alternative
> discussed earlier and we are trying to find out if simple requirements
> of biasing memory allocations can be achieved with this approach.
> 
> Smart reclaim based on this approach is a key piece we still need to
> design.  Ideas from compaction will certainly help.

reclaim may be needed for the embedded use case
but at least we are also looking at memory power savings that come for content-preserving power states.
For that, Linux should *statistically* not be actively using (e.g. read or write from it) a percentage of memory...
and statistically clustering is quite sufficient for that.

(for example, if you don't use a DIMM for a certain amount of time,
the link and other pieces can go to a lower power state,
even on todays server systems.
In a many-dimm system..  if each app is, on a per app basis,
preferring one dimm for its allocations, the process scheduler will
help us naturally keeping the other dimms "dark")

If you have to actually free the memory, it is a much much harder problem,
increasingly so if the region you MUST free is quite large.

if one solution can solve both cases, great, but lets not make both not happen
because one of the cases is hard...
(and please lets not use moving or freeing of pages as a solution for at least the
content preserving case)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management
From: Srivatsa S. Bhat @ 2012-11-09 15:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vaidyanathan Srinivasan, akpm, mjg59, paulmck, dave,
	maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu,
	lenb, rjw, gargankita, amit.kachhap, thomas.abraham,
	santosh.shilimkar, linux-pm, linux-mm, linux-kernel
In-Reply-To: <509D185D.8070307@linux.vnet.ibm.com>

On 11/09/2012 08:21 PM, Srivatsa S. Bhat wrote:
> On 11/09/2012 02:30 PM, Mel Gorman wrote:
>> On Fri, Nov 09, 2012 at 10:44:16AM +0530, Vaidyanathan Srinivasan wrote:
>>> * Mel Gorman <mgorman@suse.de> [2012-11-08 18:02:57]:
[...]
>>>>> Short description of the "Sorted-buddy" design:
>>>>> -----------------------------------------------
>>>>>
>>>>> In this design, the memory region boundaries are captured in a parallel
>>>>> data-structure instead of fitting regions between nodes and zones in the
>>>>> hierarchy. Further, the buddy allocator is altered, such that we maintain the
>>>>> zones' freelists in region-sorted-order and thus do page allocation in the
>>>>> order of increasing memory regions.
>>>>
>>>> Implying that this sorting has to happen in the either the alloc or free
>>>> fast path.
>>>
>>> Yes, in the free path. This optimization can be actually be delayed in
>>> the free fast path and completely avoided if our memory is full and we
>>> are doing direct reclaim during allocations.
>>>
>>
>> Hurting the free fast path is a bad idea as there are workloads that depend
>> on it (buffer allocation and free) even though many workloads do *not*
>> notice it because the bulk of the cost is incurred at exit time. As
>> memory low power usage has many caveats (may be impossible if a page
>> table is allocated in the region for example) but CPU usage has less
>> restrictions it is more important that the CPU usage be kept low.
>>
>> That means, little or no modification to the fastpath. Sorting or linear
>> searches should be minimised or avoided.
>>
> 
> Right. For example, in the previous "hierarchy" design[1], there was no overhead
> in any of the fast paths. Because it split up the zones themselves, so that
> they fit on memory region boundaries. But that design had other problems, like
> zone fragmentation (too many zones).. which kind of out-weighed the benefit
> obtained from zero overhead in the fast-paths. So one of the suggested
> alternatives during that review[2], was to explore modifying the buddy allocator
> to be aware of memory region boundaries, which this "sorted-buddy" design
> implements.
> 
> [1]. http://lwn.net/Articles/445045/
>      http://thread.gmane.org/gmane.linux.kernel.mm/63840
>      http://thread.gmane.org/gmane.linux.kernel.mm/89202
> 
> [2]. http://article.gmane.org/gmane.linux.power-management.general/24862
>      http://article.gmane.org/gmane.linux.power-management.general/25061
>      http://article.gmane.org/gmane.linux.kernel.mm/64689 
> 
> In this patchset, I have tried to minimize the overhead on the fastpaths.
> For example, I have used a special 'next_region' data-structure to keep the
> alloc path fast. Also, in the free path, we don't need to keep the free
> lists fully address sorted; having them region-sorted is sufficient. Of course
> we could explore more ways of avoiding overhead in the fast paths, or even a
> different design that promises to be much better overall. I'm all ears for
> any suggestions :-)
> 

FWIW, kernbench is actually (and surprisingly) showing a slight performance
*improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
my other email to Dave.

https://lkml.org/lkml/2012/11/7/428

I don't think I can dismiss it as an experimental error, because I am seeing
those results consistently.. I'm trying to find out what's behind that.

Regards,
Srivatsa S. Bhat

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management
From: Srivatsa S. Bhat @ 2012-11-09 14:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vaidyanathan Srinivasan, akpm, mjg59, paulmck, dave,
	maxime.coquelin, loic.pallardy, arjan, kmpark, kamezawa.hiroyu,
	lenb, rjw, gargankita, amit.kachhap, thomas.abraham,
	santosh.shilimkar, linux-pm, linux-mm, linux-kernel
In-Reply-To: <20121109090052.GF8218@suse.de>

On 11/09/2012 02:30 PM, Mel Gorman wrote:
> On Fri, Nov 09, 2012 at 10:44:16AM +0530, Vaidyanathan Srinivasan wrote:
>> * Mel Gorman <mgorman@suse.de> [2012-11-08 18:02:57]:
>>
[...]
>>> How much power is saved?
>>
>> On embedded platform the savings could be around 5% as discussed in
>> the earlier thread: http://article.gmane.org/gmane.linux.kernel.mm/65935
>>
>> On larger servers with large amounts of memory the savings could be
>> more.  We do not yet have all the pieces together to evaluate.
>>
> 
> Ok, it's something to keep an eye on because if memory power savings
> require large amounts of CPU (for smart placement or migration) or more
> disk accesses (due to reclaim) then the savings will be offset by
> increased power usage elsehwere.
> 

True.

>>>> ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
>>>> the firmware can expose information regarding the boundaries of such memory
>>>> power management domains to the OS in a standard way.
>>>>
>>>
>>> I'm not familiar with the ACPI spec but is there support for parsing of
>>> MPST and interpreting the associated ACPI events? For example, if ACPI
>>> fires an event indicating that a memory power node is to enter a low
>>> state then presumably the OS should actively migrate pages away -- even
>>> if it's going into a state where the contents are still refreshed
>>> as exiting that state could take a long time.
>>>
>>> I did not look closely at the patchset at all because it looked like the
>>> actual support to use it and measure the benefit is missing.
>>
>> Correct.  The platform interface part is not included in this patch
>> set mainly because there is not much design required there.  Each
>> platform can have code to collect the memory region boundaries from
>> BIOS/firmware and load it into the Linux VM.  The goal of this patch
>> is to brainstorm on the idea of hos core VM should used the region
>> information.
>>  
> 
> Ok. It does mean that the patches should not be merged until there is
> some platform support that can take advantage of them.
>

That's right, but the development of the VM algorithms and the platform
support for different platforms can go on in parallel. And once we have all
the pieces designed, we can fit them together and merge them.
 
>>>> How can Linux VM help memory power savings?
>>>>
>>>> o Consolidate memory allocations and/or references such that they are
>>>> not spread across the entire memory address space.  Basically area of memory
>>>> that is not being referenced, can reside in low power state.
>>>>
>>>
>>> Which the series does not appear to do.
>>
>> Correct.  We need to design the correct reclaim strategy for this to
>> work.  However having buddy list sorted by region address could get us
>> one step closer to shaping the allocations.
>>
> 
> If you reclaim, it means that the information is going to disk and will
> have to be refaulted in sooner rather than later. If you concentrate on
> reclaiming low memory regions and memory is almost full, it will lead to
> a situation where you almost always reclaim newer pages and increase
> faulting. You will save a few milliwatts on memory and lose way more
> than that on increase disk traffic and CPU usage.
> 

Yes, we should ensure that our reclaim strategy won't back-fire like that.
We definitely need to depend on LRU ordering for reclaim for the most part,
but try to opportunistically reclaim from within the required region boundaries
while doing that. We definitely need to think more about this...

But the point of making the free lists sorted region-wise in this patchset
was to exploit the shaping of page allocations the way we want (ie.,
constrained to lesser number of regions).

>>>> o Support targeted memory reclaim, where certain areas of memory that can be
>>>> easily freed can be offlined, allowing those areas of memory to be put into
>>>> lower power states.
>>>>
>>>
>>> Which the series does not appear to do judging from this;
>>>
>>>   include/linux/mm.h     |   38 +++++++
>>>   include/linux/mmzone.h |   52 +++++++++
>>>   mm/compaction.c        |    8 +
>>>   mm/page_alloc.c        |  263 ++++++++++++++++++++++++++++++++++++++++++++----
>>>   mm/vmstat.c            |   59 ++++++++++-
>>>
>>> This does not appear to be doing anything with reclaim and not enough with
>>> compaction to indicate that the series actively manages memory placement
>>> in response to ACPI events.
>>
>> Correct.  Evaluating different ideas for reclaim will be next step
>> before getting into the platform interface parts.
>>
[...]
>>
>> This patch is roughly based on the idea that ACPI MPST will give us
>> memory region boundaries.  It is not designed to implement all options
>> defined in the spec. 
> 
> Ok, but as it is the only potential consumer of this interface that you
> mentioned then it should at least be able to handle it. The spec talks about
> overlapping memory regions where the regions potentially have differnet
> power states. This is pretty damn remarkable and hard to see how it could
> be interpreted in a sensible way but it forces your implementation to take
> it into account.
>

Well, sorry for not mentioning in the cover-letter, but the VM algorithms for
memory power management could benefit other platforms too, like ARM, not just
ACPI-based systems. Last year, Amit had evaluated them on Samsung boards with
a simplistic layout for memory regions, based on the Samsung exynos board's
configuration.

http://article.gmane.org/gmane.linux.kernel.mm/65935
 
>> We have taken a general case of regions do not
>> overlap while memory addresses itself can be discontinuous.
>>
> 
> Why is the general case? You referred to the ACPI spec where it is not
> the case and no other examples.
> 

ARM is another example, where we could describe the memory regions in a simple
manner with respect to the Samsung exynos board.

So the idea behind this patchset was to start by assuming a simplistic layout
for memory regions and focussing on the design of the VM algorithms, and
evaluating how this "sorted-buddy" design would perform in comparison to the
previous "hierarchy" design that was explored last year.

But of course, you are absolutely right in pointing out that, to make all this
consumable, we need to revisit this with a focus on the layout of memory
regions themselves, so that all interested platforms can make use of it
effectively.

[...]

>>>> Short description of the "Sorted-buddy" design:
>>>> -----------------------------------------------
>>>>
>>>> In this design, the memory region boundaries are captured in a parallel
>>>> data-structure instead of fitting regions between nodes and zones in the
>>>> hierarchy. Further, the buddy allocator is altered, such that we maintain the
>>>> zones' freelists in region-sorted-order and thus do page allocation in the
>>>> order of increasing memory regions.
>>>
>>> Implying that this sorting has to happen in the either the alloc or free
>>> fast path.
>>
>> Yes, in the free path. This optimization can be actually be delayed in
>> the free fast path and completely avoided if our memory is full and we
>> are doing direct reclaim during allocations.
>>
> 
> Hurting the free fast path is a bad idea as there are workloads that depend
> on it (buffer allocation and free) even though many workloads do *not*
> notice it because the bulk of the cost is incurred at exit time. As
> memory low power usage has many caveats (may be impossible if a page
> table is allocated in the region for example) but CPU usage has less
> restrictions it is more important that the CPU usage be kept low.
> 
> That means, little or no modification to the fastpath. Sorting or linear
> searches should be minimised or avoided.
> 

Right. For example, in the previous "hierarchy" design[1], there was no overhead
in any of the fast paths. Because it split up the zones themselves, so that
they fit on memory region boundaries. But that design had other problems, like
zone fragmentation (too many zones).. which kind of out-weighed the benefit
obtained from zero overhead in the fast-paths. So one of the suggested
alternatives during that review[2], was to explore modifying the buddy allocator
to be aware of memory region boundaries, which this "sorted-buddy" design
implements.

[1]. http://lwn.net/Articles/445045/
     http://thread.gmane.org/gmane.linux.kernel.mm/63840
     http://thread.gmane.org/gmane.linux.kernel.mm/89202

[2]. http://article.gmane.org/gmane.linux.power-management.general/24862
     http://article.gmane.org/gmane.linux.power-management.general/25061
     http://article.gmane.org/gmane.linux.kernel.mm/64689 

In this patchset, I have tried to minimize the overhead on the fastpaths.
For example, I have used a special 'next_region' data-structure to keep the
alloc path fast. Also, in the free path, we don't need to keep the free
lists fully address sorted; having them region-sorted is sufficient. Of course
we could explore more ways of avoiding overhead in the fast paths, or even a
different design that promises to be much better overall. I'm all ears for
any suggestions :-)

>> At this point we want to look at overheads of having region
>> infrastructure in VM and how does that trade off in terms of
>> requirements that we can meet.
>>
>> The first goal is to have memory allocations fill as few regions as
>> possible when system's memory usage is significantly lower. 
> 
> While it's a reasonable starting objective, the fast path overhead is very
> unfortunate and such a strategy can be easily defeated by running sometime
> metadata intensive (like find over the entire system) while a large memory
> user starts at the same time to spread kernel and user space allocations
> throughout the address space. This will spread the allocations throughout
> the address space and persist even after the two processes exit due to
> the page cache usage from the metadata intensive workload.
> 
> Basically, it'll only work as long as the system is idle or never uses
> much memory during the lifetime of the system.
> 

Well, page cache usage could definitely come in the way of memory power
management. Probably having a separate driver shrink the page cache
(depending on how aggressive we want to get with respect to power-management)
is the way to go?

Regards,
Srivatsa S. Bhat


^ permalink raw reply

* Re: [PATCH V3 3/5] Thermal: Remove the cooling_cpufreq_list.
From: Hongbo Zhang @ 2012-11-09 11:54 UTC (permalink / raw)
  To: Zhang Rui
  Cc: linaro-dev, linux-kernel, linux-pm, amit.kachhap, patches,
	linaro-kernel, STEricsson_nomadik_linux, kernel, hongbo.zhang
In-Reply-To: <1352271296.2137.35.camel@rzhang1-mobl4>

On 7 November 2012 14:54, Zhang Rui <rui.zhang@intel.com> wrote:
> On Tue, 2012-10-30 at 17:48 +0100, hongbo.zhang wrote:
>> From: "hongbo.zhang" <hongbo.zhang@linaro.com>
>>
>> Problem of using this list is that the cpufreq_get_max_state callback will be
>> called when register cooling device by thermal_cooling_device_register, but
>> this list isn't ready at this moment. What's more, there is no need to maintain
>> such a list, we can get cpufreq_cooling_device instance by the private
>> thermal_cooling_device.devdata.
>>
>> Signed-off-by: hongbo.zhang <hongbo.zhang@linaro.com>
>> Reviewed-by: Francesco Lavra <francescolavra.fl@gmail.com>
>> Reviewed-by: Amit Daniel Kachhap <amit.kachhap@linaro.org>
>
> applied to thermal-next.
Thanks.
I have sent the other updated 4/5 5/5 patches with Reviewed-by added
in a new thread, please have a look there.

>
> thanks,
> rui
>
>> ---
>>  drivers/thermal/cpu_cooling.c | 91 +++++++++----------------------------------
>>  1 file changed, 19 insertions(+), 72 deletions(-)
>>
>> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
>> index bfd62b7..392d57d 100644
>> --- a/drivers/thermal/cpu_cooling.c
>> +++ b/drivers/thermal/cpu_cooling.c
>> @@ -58,8 +58,9 @@ struct cpufreq_cooling_device {
>>  };
>>  static LIST_HEAD(cooling_cpufreq_list);
>>  static DEFINE_IDR(cpufreq_idr);
>> +static DEFINE_MUTEX(cooling_cpufreq_lock);
>>
>> -static struct mutex cooling_cpufreq_lock;
>> +static unsigned int cpufreq_dev_count;
>>
>>  /* notify_table passes value to the CPUFREQ_ADJUST callback function. */
>>  #define NOTIFY_INVALID NULL
>> @@ -240,28 +241,18 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
>>  static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
>>                                unsigned long *state)
>>  {
>> -     int ret = -EINVAL, i = 0;
>> -     struct cpufreq_cooling_device *cpufreq_device;
>> -     struct cpumask *maskPtr;
>> +     struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
>> +     struct cpumask *maskPtr = &cpufreq_device->allowed_cpus;
>>       unsigned int cpu;
>>       struct cpufreq_frequency_table *table;
>>       unsigned long count = 0;
>> +     int i = 0;
>>
>> -     mutex_lock(&cooling_cpufreq_lock);
>> -     list_for_each_entry(cpufreq_device, &cooling_cpufreq_list, node) {
>> -             if (cpufreq_device && cpufreq_device->cool_dev == cdev)
>> -                     break;
>> -     }
>> -     if (cpufreq_device == NULL)
>> -             goto return_get_max_state;
>> -
>> -     maskPtr = &cpufreq_device->allowed_cpus;
>>       cpu = cpumask_any(maskPtr);
>>       table = cpufreq_frequency_get_table(cpu);
>>       if (!table) {
>>               *state = 0;
>> -             ret = 0;
>> -             goto return_get_max_state;
>> +             return 0;
>>       }
>>
>>       for (i = 0; (table[i].frequency != CPUFREQ_TABLE_END); i++) {
>> @@ -272,12 +263,10 @@ static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
>>
>>       if (count > 0) {
>>               *state = --count;
>> -             ret = 0;
>> +             return 0;
>>       }
>>
>> -return_get_max_state:
>> -     mutex_unlock(&cooling_cpufreq_lock);
>> -     return ret;
>> +     return -EINVAL;
>>  }
>>
>>  /**
>> @@ -288,20 +277,10 @@ return_get_max_state:
>>  static int cpufreq_get_cur_state(struct thermal_cooling_device *cdev,
>>                                unsigned long *state)
>>  {
>> -     int ret = -EINVAL;
>> -     struct cpufreq_cooling_device *cpufreq_device;
>> +     struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
>>
>> -     mutex_lock(&cooling_cpufreq_lock);
>> -     list_for_each_entry(cpufreq_device, &cooling_cpufreq_list, node) {
>> -             if (cpufreq_device && cpufreq_device->cool_dev == cdev) {
>> -                     *state = cpufreq_device->cpufreq_state;
>> -                     ret = 0;
>> -                     break;
>> -             }
>> -     }
>> -     mutex_unlock(&cooling_cpufreq_lock);
>> -
>> -     return ret;
>> +     *state = cpufreq_device->cpufreq_state;
>> +     return 0;
>>  }
>>
>>  /**
>> @@ -312,22 +291,9 @@ static int cpufreq_get_cur_state(struct thermal_cooling_device *cdev,
>>  static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
>>                                unsigned long state)
>>  {
>> -     int ret = -EINVAL;
>> -     struct cpufreq_cooling_device *cpufreq_device;
>> +     struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
>>
>> -     mutex_lock(&cooling_cpufreq_lock);
>> -     list_for_each_entry(cpufreq_device, &cooling_cpufreq_list, node) {
>> -             if (cpufreq_device && cpufreq_device->cool_dev == cdev) {
>> -                     ret = 0;
>> -                     break;
>> -             }
>> -     }
>> -     if (!ret)
>> -             ret = cpufreq_apply_cooling(cpufreq_device, state);
>> -
>> -     mutex_unlock(&cooling_cpufreq_lock);
>> -
>> -     return ret;
>> +     return cpufreq_apply_cooling(cpufreq_device, state);
>>  }
>>
>>  /* Bind cpufreq callbacks to thermal cooling device ops */
>> @@ -351,14 +317,11 @@ struct thermal_cooling_device *cpufreq_cooling_register(
>>  {
>>       struct thermal_cooling_device *cool_dev;
>>       struct cpufreq_cooling_device *cpufreq_dev = NULL;
>> -     unsigned int cpufreq_dev_count = 0, min = 0, max = 0;
>> +     unsigned int min = 0, max = 0;
>>       char dev_name[THERMAL_NAME_LENGTH];
>>       int ret = 0, i;
>>       struct cpufreq_policy policy;
>>
>> -     list_for_each_entry(cpufreq_dev, &cooling_cpufreq_list, node)
>> -             cpufreq_dev_count++;
>> -
>>       /*Verify that all the clip cpus have same freq_min, freq_max limit*/
>>       for_each_cpu(i, clip_cpus) {
>>               /*continue if cpufreq policy not found and not return error*/
>> @@ -380,9 +343,6 @@ struct thermal_cooling_device *cpufreq_cooling_register(
>>
>>       cpumask_copy(&cpufreq_dev->allowed_cpus, clip_cpus);
>>
>> -     if (cpufreq_dev_count == 0)
>> -             mutex_init(&cooling_cpufreq_lock);
>> -
>>       ret = get_idr(&cpufreq_idr, &cpufreq_dev->id);
>>       if (ret) {
>>               kfree(cpufreq_dev);
>> @@ -401,12 +361,12 @@ struct thermal_cooling_device *cpufreq_cooling_register(
>>       cpufreq_dev->cool_dev = cool_dev;
>>       cpufreq_dev->cpufreq_state = 0;
>>       mutex_lock(&cooling_cpufreq_lock);
>> -     list_add_tail(&cpufreq_dev->node, &cooling_cpufreq_list);
>>
>>       /* Register the notifier for first cpufreq cooling device */
>>       if (cpufreq_dev_count == 0)
>>               cpufreq_register_notifier(&thermal_cpufreq_notifier_block,
>>                                               CPUFREQ_POLICY_NOTIFIER);
>> +     cpufreq_dev_count++;
>>
>>       mutex_unlock(&cooling_cpufreq_lock);
>>       return cool_dev;
>> @@ -419,33 +379,20 @@ EXPORT_SYMBOL(cpufreq_cooling_register);
>>   */
>>  void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
>>  {
>> -     struct cpufreq_cooling_device *cpufreq_dev = NULL;
>> -     unsigned int cpufreq_dev_count = 0;
>> +     struct cpufreq_cooling_device *cpufreq_dev = cdev->devdata;
>>
>>       mutex_lock(&cooling_cpufreq_lock);
>> -     list_for_each_entry(cpufreq_dev, &cooling_cpufreq_list, node) {
>> -             if (cpufreq_dev && cpufreq_dev->cool_dev == cdev)
>> -                     break;
>> -             cpufreq_dev_count++;
>> -     }
>> -
>> -     if (!cpufreq_dev || cpufreq_dev->cool_dev != cdev) {
>> -             mutex_unlock(&cooling_cpufreq_lock);
>> -             return;
>> -     }
>> -
>> -     list_del(&cpufreq_dev->node);
>> +     cpufreq_dev_count--;
>>
>>       /* Unregister the notifier for the last cpufreq cooling device */
>> -     if (cpufreq_dev_count == 1) {
>> +     if (cpufreq_dev_count == 0) {
>>               cpufreq_unregister_notifier(&thermal_cpufreq_notifier_block,
>>                                       CPUFREQ_POLICY_NOTIFIER);
>>       }
>>       mutex_unlock(&cooling_cpufreq_lock);
>> +
>>       thermal_cooling_device_unregister(cpufreq_dev->cool_dev);
>>       release_idr(&cpufreq_idr, cpufreq_dev->id);
>> -     if (cpufreq_dev_count == 1)
>> -             mutex_destroy(&cooling_cpufreq_lock);
>>       kfree(cpufreq_dev);
>>  }
>>  EXPORT_SYMBOL(cpufreq_cooling_unregister);
>
>

^ permalink raw reply

* Re: [PATCH V5 0/2] Upstream ST-Ericsson thermal driver
From: Hongbo Zhang @ 2012-11-09 11:39 UTC (permalink / raw)
  To: linux-acpi, rui.zhang
  Cc: linux-kernel, linux-pm, amit.kachhap, patches, linaro-dev,
	linaro-kernel, STEricsson_nomadik_linux, kernel, hongbo.zhang
In-Reply-To: <1352460548-3494-1-git-send-email-hongbo.zhang@linaro.com>

Hi Rui Zhang,
Please have a look at this patch set.
Since the previous 1/5, 2/5, 3/5 have been accepted, I'd like to send
these last two updated patches with Reviewed-by added in this new
thread.

Thanks.

On 9 November 2012 19:29, hongbo.zhang <hongbo.zhang@linaro.org> wrote:
> From: "hongbo.zhang" <hongbo.zhang@linaro.com>
>
> V4->V5 Changes:
>
> In patch "Thermal: Add ST-Ericsson DB8500 thermal driver":
>  - use flush_work instead of flush_work_sync since the later is deprecated now.
>  - parameter trip_points of db8500_thermal_match_cdev is renamed to trip_point;
>  - re-order oprerations in function db8500_thermal_update_config;
>
> hongbo.zhang (2):
>   Thermal: Add ST-Ericsson DB8500 thermal driver.
>   Thermal: Add ST-Ericsson DB8500 thermal properties and platform data.
>
>  .../devicetree/bindings/thermal/db8500-thermal.txt |  44 ++
>  arch/arm/boot/dts/dbx5x0.dtsi                      |  14 +
>  arch/arm/boot/dts/snowball.dts                     |  31 ++
>  arch/arm/configs/u8500_defconfig                   |   2 +
>  arch/arm/mach-ux500/board-mop500.c                 |  64 +++
>  drivers/thermal/Kconfig                            |  20 +
>  drivers/thermal/Makefile                           |   2 +
>  drivers/thermal/db8500_cpufreq_cooling.c           | 108 +++++
>  drivers/thermal/db8500_thermal.c                   | 531 +++++++++++++++++++++
>  include/linux/platform_data/db8500_thermal.h       |  38 ++
>  10 files changed, 854 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/thermal/db8500-thermal.txt
>  create mode 100644 drivers/thermal/db8500_cpufreq_cooling.c
>  create mode 100644 drivers/thermal/db8500_thermal.c
>  create mode 100644 include/linux/platform_data/db8500_thermal.h
>
> --
> 1.7.11.3
>

^ permalink raw reply

* [PATCH V5 0/2] Upstream ST-Ericsson thermal driver
From: hongbo.zhang @ 2012-11-09 11:29 UTC (permalink / raw)
  To: linux-acpi, rui.zhang
  Cc: linux-kernel, linux-pm, amit.kachhap, patches, linaro-dev,
	linaro-kernel, STEricsson_nomadik_linux, kernel, hongbo.zhang

From: "hongbo.zhang" <hongbo.zhang@linaro.com>

V4->V5 Changes:

In patch "Thermal: Add ST-Ericsson DB8500 thermal driver":
 - use flush_work instead of flush_work_sync since the later is deprecated now.
 - parameter trip_points of db8500_thermal_match_cdev is renamed to trip_point;
 - re-order oprerations in function db8500_thermal_update_config;

hongbo.zhang (2):
  Thermal: Add ST-Ericsson DB8500 thermal driver.
  Thermal: Add ST-Ericsson DB8500 thermal properties and platform data.

 .../devicetree/bindings/thermal/db8500-thermal.txt |  44 ++
 arch/arm/boot/dts/dbx5x0.dtsi                      |  14 +
 arch/arm/boot/dts/snowball.dts                     |  31 ++
 arch/arm/configs/u8500_defconfig                   |   2 +
 arch/arm/mach-ux500/board-mop500.c                 |  64 +++
 drivers/thermal/Kconfig                            |  20 +
 drivers/thermal/Makefile                           |   2 +
 drivers/thermal/db8500_cpufreq_cooling.c           | 108 +++++
 drivers/thermal/db8500_thermal.c                   | 531 +++++++++++++++++++++
 include/linux/platform_data/db8500_thermal.h       |  38 ++
 10 files changed, 854 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/thermal/db8500-thermal.txt
 create mode 100644 drivers/thermal/db8500_cpufreq_cooling.c
 create mode 100644 drivers/thermal/db8500_thermal.c
 create mode 100644 include/linux/platform_data/db8500_thermal.h

-- 
1.7.11.3


^ permalink raw reply

* [PATCH V5 2/2] Thermal: Add ST-Ericsson DB8500 thermal properties and platform data.
From: hongbo.zhang @ 2012-11-09 11:29 UTC (permalink / raw)
  To: linux-acpi, rui.zhang
  Cc: linux-kernel, linux-pm, amit.kachhap, patches, linaro-dev,
	linaro-kernel, STEricsson_nomadik_linux, kernel, hongbo.zhang
In-Reply-To: <1352460548-3494-1-git-send-email-hongbo.zhang@linaro.com>

From: "hongbo.zhang" <hongbo.zhang@linaro.com>

This patch adds device tree properties for ST-Ericsson DB8500 thermal driver,
also adds the platform data to support the old fashion.

Signed-off-by: hongbo.zhang <hongbo.zhang@linaro.com>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
---
 arch/arm/boot/dts/dbx5x0.dtsi      | 14 +++++++++
 arch/arm/boot/dts/snowball.dts     | 31 ++++++++++++++++++
 arch/arm/configs/u8500_defconfig   |  2 ++
 arch/arm/mach-ux500/board-mop500.c | 64 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 111 insertions(+)

diff --git a/arch/arm/boot/dts/dbx5x0.dtsi b/arch/arm/boot/dts/dbx5x0.dtsi
index 4b0e0ca..731086b 100644
--- a/arch/arm/boot/dts/dbx5x0.dtsi
+++ b/arch/arm/boot/dts/dbx5x0.dtsi
@@ -203,6 +203,14 @@
 				reg = <0x80157450 0xC>;
 			};
 
+			thermal@801573c0 {
+				compatible = "stericsson,db8500-thermal";
+				reg = <0x801573c0 0x40>;
+				interrupts = <21 0x4>, <22 0x4>;
+				interrupt-names = "IRQ_HOTMON_LOW", "IRQ_HOTMON_HIGH";
+				status = "disabled";
+			 };
+
 			db8500-prcmu-regulators {
 				compatible = "stericsson,db8500-prcmu-regulator";
 
@@ -660,5 +668,11 @@
 			ranges = <0 0x50000000 0x4000000>;
 			status = "disabled";
 		};
+
+		cpufreq-cooling {
+			compatible = "stericsson,db8500-cpufreq-cooling";
+			status = "disabled";
+		 };
+
 	};
 };
diff --git a/arch/arm/boot/dts/snowball.dts b/arch/arm/boot/dts/snowball.dts
index 702c0ba..c6f85f0 100644
--- a/arch/arm/boot/dts/snowball.dts
+++ b/arch/arm/boot/dts/snowball.dts
@@ -99,6 +99,33 @@
 			status = "okay";
 		};
 
+		prcmu@80157000 {
+			thermal@801573c0 {
+				num-trips = <4>;
+
+				trip0-temp = <70000>;
+				trip0-type = "active";
+				trip0-cdev-num = <1>;
+				trip0-cdev-name0 = "thermal-cpufreq-0";
+
+				trip1-temp = <75000>;
+				trip1-type = "active";
+				trip1-cdev-num = <1>;
+				trip1-cdev-name0 = "thermal-cpufreq-0";
+
+				trip2-temp = <80000>;
+				trip2-type = "active";
+				trip2-cdev-num = <1>;
+				trip2-cdev-name0 = "thermal-cpufreq-0";
+
+				trip3-temp = <85000>;
+				trip3-type = "critical";
+				trip3-cdev-num = <0>;
+
+				status = "okay";
+			 };
+		};
+
 		external-bus@50000000 {
 			status = "okay";
 
@@ -183,5 +210,9 @@
 				reg = <0x33>;
 			};
 		};
+
+		cpufreq-cooling {
+			status = "okay";
+		};
 	};
 };
diff --git a/arch/arm/configs/u8500_defconfig b/arch/arm/configs/u8500_defconfig
index da68454..250625d 100644
--- a/arch/arm/configs/u8500_defconfig
+++ b/arch/arm/configs/u8500_defconfig
@@ -69,6 +69,8 @@ CONFIG_GPIO_TC3589X=y
 CONFIG_POWER_SUPPLY=y
 CONFIG_AB8500_BM=y
 CONFIG_AB8500_BATTERY_THERM_ON_BATCTRL=y
+CONFIG_THERMAL=y
+CONFIG_CPU_THERMAL=y
 CONFIG_MFD_STMPE=y
 CONFIG_MFD_TC3589X=y
 CONFIG_AB5500_CORE=y
diff --git a/arch/arm/mach-ux500/board-mop500.c b/arch/arm/mach-ux500/board-mop500.c
index 416d436..b03216b 100644
--- a/arch/arm/mach-ux500/board-mop500.c
+++ b/arch/arm/mach-ux500/board-mop500.c
@@ -16,6 +16,7 @@
 #include <linux/io.h>
 #include <linux/i2c.h>
 #include <linux/platform_data/i2c-nomadik.h>
+#include <linux/platform_data/db8500_thermal.h>
 #include <linux/gpio.h>
 #include <linux/amba/bus.h>
 #include <linux/amba/pl022.h>
@@ -229,6 +230,67 @@ static struct ab8500_platform_data ab8500_platdata = {
 };
 
 /*
+ * Thermal Sensor
+ */
+
+static struct resource db8500_thsens_resources[] = {
+	{
+		.name = "IRQ_HOTMON_LOW",
+		.start  = IRQ_PRCMU_HOTMON_LOW,
+		.end    = IRQ_PRCMU_HOTMON_LOW,
+		.flags  = IORESOURCE_IRQ,
+	},
+	{
+		.name = "IRQ_HOTMON_HIGH",
+		.start  = IRQ_PRCMU_HOTMON_HIGH,
+		.end    = IRQ_PRCMU_HOTMON_HIGH,
+		.flags  = IORESOURCE_IRQ,
+	},
+};
+
+static struct db8500_thsens_platform_data db8500_thsens_data = {
+	.trip_points[0] = {
+		.temp = 70000,
+		.type = THERMAL_TRIP_ACTIVE,
+		.cdev_name = {
+			[0] = "thermal-cpufreq-0",
+		},
+	},
+	.trip_points[1] = {
+		.temp = 75000,
+		.type = THERMAL_TRIP_ACTIVE,
+		.cdev_name = {
+			[0] = "thermal-cpufreq-0",
+		},
+	},
+	.trip_points[2] = {
+		.temp = 80000,
+		.type = THERMAL_TRIP_ACTIVE,
+		.cdev_name = {
+			[0] = "thermal-cpufreq-0",
+		},
+	},
+	.trip_points[3] = {
+		.temp = 85000,
+		.type = THERMAL_TRIP_CRITICAL,
+	},
+	.num_trips = 4,
+};
+
+static struct platform_device u8500_thsens_device = {
+	.name           = "db8500-thermal",
+	.resource       = db8500_thsens_resources,
+	.num_resources  = ARRAY_SIZE(db8500_thsens_resources),
+	.dev	= {
+		.platform_data	= &db8500_thsens_data,
+	},
+};
+
+static struct platform_device u8500_cpufreq_cooling_device = {
+	.name           = "db8500-cpufreq-cooling",
+};
+
+/*
  * TPS61052
  */
 
@@ -583,6 +645,8 @@ static struct platform_device *snowball_platform_devs[] __initdata = {
 	&snowball_key_dev,
 	&snowball_sbnet_dev,
 	&snowball_gpio_en_3v3_regulator_dev,
+	&u8500_thsens_device,
+	&u8500_cpufreq_cooling_device,
 };
 
 static void __init mop500_init_machine(void)
-- 
1.7.11.3


^ permalink raw reply related

* [PATCH V5 1/2] Thermal: Add ST-Ericsson DB8500 thermal driver.
From: hongbo.zhang @ 2012-11-09 11:29 UTC (permalink / raw)
  To: linux-acpi, rui.zhang
  Cc: linux-kernel, linux-pm, amit.kachhap, patches, linaro-dev,
	linaro-kernel, STEricsson_nomadik_linux, kernel, hongbo.zhang
In-Reply-To: <1352460548-3494-1-git-send-email-hongbo.zhang@linaro.com>

From: "hongbo.zhang" <hongbo.zhang@linaro.com>

This driver is based on the thermal management framework in thermal_sys.c. A
thermal zone device is created with the trip points to which cooling devices
can be bound, the current cooling device is cpufreq, e.g. CPU frequency is
clipped down to cool the CPU, and other cooling devices can be added and bound
to the trip points dynamically.  The platform specific PRCMU interrupts are
used to active thermal update when trip points are reached.

Signed-off-by: hongbo.zhang <hongbo.zhang@linaro.com>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Francesco Lavra <francescolavra.fl@gmail.com>
---
 .../devicetree/bindings/thermal/db8500-thermal.txt |  44 ++
 drivers/thermal/Kconfig                            |  20 +
 drivers/thermal/Makefile                           |   2 +
 drivers/thermal/db8500_cpufreq_cooling.c           | 108 +++++
 drivers/thermal/db8500_thermal.c                   | 531 +++++++++++++++++++++
 include/linux/platform_data/db8500_thermal.h       |  38 ++
 6 files changed, 743 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/thermal/db8500-thermal.txt
 create mode 100644 drivers/thermal/db8500_cpufreq_cooling.c
 create mode 100644 drivers/thermal/db8500_thermal.c
 create mode 100644 include/linux/platform_data/db8500_thermal.h

diff --git a/Documentation/devicetree/bindings/thermal/db8500-thermal.txt b/Documentation/devicetree/bindings/thermal/db8500-thermal.txt
new file mode 100644
index 0000000..2e1c06f
--- /dev/null
+++ b/Documentation/devicetree/bindings/thermal/db8500-thermal.txt
@@ -0,0 +1,44 @@
+* ST-Ericsson DB8500 Thermal
+
+** Thermal node properties:
+
+- compatible : "stericsson,db8500-thermal";
+- reg : address range of the thermal sensor registers;
+- interrupts : interrupts generated from PRCMU;
+- interrupt-names : "IRQ_HOTMON_LOW" and "IRQ_HOTMON_HIGH";
+- num-trips : number of total trip points, this is required, set it 0 if none,
+  if greater than 0, the following properties must be defined;
+- tripN-temp : temperature of trip point N, should be in ascending order;
+- tripN-type : type of trip point N, should be one of "active" "passive" "hot"
+  "critical";
+- tripN-cdev-num : number of the cooling devices which can be bound to trip
+  point N, this is required if trip point N is defined, set it 0 if none,
+  otherwise the following cooling device names must be defined;
+- tripN-cdev-nameM : name of the No. M cooling device of trip point N;
+
+Usually the num-trips and tripN-*** are separated in board related dts files.
+
+Example:
+thermal@801573c0 {
+	compatible = "stericsson,db8500-thermal";
+	reg = <0x801573c0 0x40>;
+	interrupts = <21 0x4>, <22 0x4>;
+	interrupt-names = "IRQ_HOTMON_LOW", "IRQ_HOTMON_HIGH";
+
+	num-trips = <3>;
+
+	trip0-temp = <75000>;
+	trip0-type = "active";
+	trip0-cdev-num = <1>;
+	trip0-cdev-name0 = "thermal-cpufreq-0";
+
+	trip1-temp = <80000>;
+	trip1-type = "active";
+	trip1-cdev-num = <2>;
+	trip1-cdev-name0 = "thermal-cpufreq-0";
+	trip1-cdev-name1 = "thermal-fan";
+
+	trip2-temp = <85000>;
+	trip2-type = "critical";
+	trip2-cdev-num = <0>;
+}
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index e1cb6bd..54c8fd0 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -31,6 +31,26 @@ config CPU_THERMAL
 	  and not the ACPI interface.
 	  If you want this support, you should say Y here.
 
+config DB8500_THERMAL
+	bool "DB8500 thermal management"
+	depends on THERMAL
+	default y
+	help
+	  Adds DB8500 thermal management implementation according to the thermal
+	  management framework. A thermal zone with several trip points will be
+	  created. Cooling devices can be bound to the trip points to cool this
+	  thermal zone if trip points reached.
+
+config DB8500_CPUFREQ_COOLING
+	tristate "DB8500 cpufreq cooling"
+	depends on CPU_THERMAL
+	default y
+	help
+	  Adds DB8500 cpufreq cooling devices, and these cooling devices can be
+	  bound to thermal zone trip points. When a trip point reached, the
+	  bound cpufreq cooling device turns active to set CPU frequency low to
+	  cool down the CPU.
+
 config SPEAR_THERMAL
 	bool "SPEAr thermal sensor driver"
 	depends on THERMAL
diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
index 885550d..c7a8dab 100644
--- a/drivers/thermal/Makefile
+++ b/drivers/thermal/Makefile
@@ -7,3 +7,5 @@ obj-$(CONFIG_CPU_THERMAL)		+= cpu_cooling.o
 obj-$(CONFIG_SPEAR_THERMAL)		+= spear_thermal.o
 obj-$(CONFIG_RCAR_THERMAL)	+= rcar_thermal.o
 obj-$(CONFIG_EXYNOS_THERMAL)		+= exynos_thermal.o
+obj-$(CONFIG_DB8500_THERMAL)		+= db8500_thermal.o
+obj-$(CONFIG_DB8500_CPUFREQ_COOLING)	+= db8500_cpufreq_cooling.o
diff --git a/drivers/thermal/db8500_cpufreq_cooling.c b/drivers/thermal/db8500_cpufreq_cooling.c
new file mode 100644
index 0000000..4cf8e72
--- /dev/null
+++ b/drivers/thermal/db8500_cpufreq_cooling.c
@@ -0,0 +1,108 @@
+/*
+ * db8500_cpufreq_cooling.c - DB8500 cpufreq works as cooling device.
+ *
+ * Copyright (C) 2012 ST-Ericsson
+ * Copyright (C) 2012 Linaro Ltd.
+ *
+ * Author: Hongbo Zhang <hongbo.zhang@linaro.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/cpu_cooling.h>
+#include <linux/cpufreq.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/slab.h>
+
+static int db8500_cpufreq_cooling_probe(struct platform_device *pdev)
+{
+	struct thermal_cooling_device *cdev;
+	struct cpumask mask_val;
+
+	/* make sure cpufreq driver has been initialized */
+	if (!cpufreq_frequency_get_table(0))
+		return -EPROBE_DEFER;
+
+	cpumask_set_cpu(0, &mask_val);
+	cdev = cpufreq_cooling_register(&mask_val);
+
+	if (IS_ERR_OR_NULL(cdev)) {
+		dev_err(&pdev->dev, "Failed to register cooling device\n");
+		return PTR_ERR(cdev);
+	}
+
+	platform_set_drvdata(pdev, cdev);
+
+	dev_info(&pdev->dev, "Cooling device registered: %s\n",	cdev->type);
+
+	return 0;
+}
+
+static int db8500_cpufreq_cooling_remove(struct platform_device *pdev)
+{
+	struct thermal_cooling_device *cdev = platform_get_drvdata(pdev);
+
+	cpufreq_cooling_unregister(cdev);
+
+	return 0;
+}
+
+static int db8500_cpufreq_cooling_suspend(struct platform_device *pdev,
+		pm_message_t state)
+{
+	return -ENOSYS;
+}
+
+static int db8500_cpufreq_cooling_resume(struct platform_device *pdev)
+{
+	return -ENOSYS;
+}
+
+#ifdef CONFIG_OF
+static const struct of_device_id db8500_cpufreq_cooling_match[] = {
+	{ .compatible = "stericsson,db8500-cpufreq-cooling" },
+	{},
+};
+#else
+#define db8500_cpufreq_cooling_match NULL
+#endif
+
+static struct platform_driver db8500_cpufreq_cooling_driver = {
+	.driver = {
+		.owner = THIS_MODULE,
+		.name = "db8500-cpufreq-cooling",
+		.of_match_table = db8500_cpufreq_cooling_match,
+	},
+	.probe = db8500_cpufreq_cooling_probe,
+	.suspend = db8500_cpufreq_cooling_suspend,
+	.resume = db8500_cpufreq_cooling_resume,
+	.remove = db8500_cpufreq_cooling_remove,
+};
+
+static int __init db8500_cpufreq_cooling_init(void)
+{
+	return platform_driver_register(&db8500_cpufreq_cooling_driver);
+}
+
+static void __exit db8500_cpufreq_cooling_exit(void)
+{
+	platform_driver_unregister(&db8500_cpufreq_cooling_driver);
+}
+
+/* Should be later than db8500_cpufreq_register */
+late_initcall(db8500_cpufreq_cooling_init);
+module_exit(db8500_cpufreq_cooling_exit);
+
+MODULE_AUTHOR("Hongbo Zhang <hongbo.zhang@stericsson.com>");
+MODULE_DESCRIPTION("DB8500 cpufreq cooling driver");
+MODULE_LICENSE("GPL");
diff --git a/drivers/thermal/db8500_thermal.c b/drivers/thermal/db8500_thermal.c
new file mode 100644
index 0000000..94ec358
--- /dev/null
+++ b/drivers/thermal/db8500_thermal.c
@@ -0,0 +1,531 @@
+/*
+ * db8500_thermal.c - DB8500 Thermal Management Implementation
+ *
+ * Copyright (C) 2012 ST-Ericsson
+ * Copyright (C) 2012 Linaro Ltd.
+ *
+ * Author: Hongbo Zhang <hongbo.zhang@linaro.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/cpu_cooling.h>
+#include <linux/interrupt.h>
+#include <linux/mfd/dbx500-prcmu.h>
+#include <linux/module.h>
+#include <linux/of.h>
+#include <linux/platform_data/db8500_thermal.h>
+#include <linux/platform_device.h>
+#include <linux/slab.h>
+#include <linux/thermal.h>
+
+#define PRCMU_DEFAULT_MEASURE_TIME	0xFFF
+#define PRCMU_DEFAULT_LOW_TEMP		0
+
+struct db8500_thermal_zone {
+	struct thermal_zone_device *therm_dev;
+	struct mutex th_lock;
+	struct work_struct therm_work;
+	struct db8500_thsens_platform_data *trip_tab;
+	enum thermal_device_mode mode;
+	enum thermal_trend trend;
+	unsigned long cur_temp_pseudo;
+	unsigned int cur_index;
+};
+
+/* Local function to check if thermal zone matches cooling devices */
+static int db8500_thermal_match_cdev(struct thermal_cooling_device *cdev,
+		struct db8500_trip_point *trip_point)
+{
+	int i;
+
+	if (!strlen(cdev->type))
+		return -EINVAL;
+
+	for (i = 0; i < COOLING_DEV_MAX; i++) {
+		if (!strcmp(trip_point->cdev_name[i], cdev->type))
+			return 0;
+	}
+
+	return -ENODEV;
+}
+
+/* Callback to bind cooling device to thermal zone */
+static int db8500_cdev_bind(struct thermal_zone_device *thermal,
+		struct thermal_cooling_device *cdev)
+{
+	struct db8500_thermal_zone *pzone = thermal->devdata;
+	struct db8500_thsens_platform_data *ptrips = pzone->trip_tab;
+	unsigned long max_state, upper, lower;
+	int i, ret = -EINVAL;
+
+	cdev->ops->get_max_state(cdev, &max_state);
+
+	for (i = 0; i < ptrips->num_trips; i++) {
+		if (db8500_thermal_match_cdev(cdev, &ptrips->trip_points[i]))
+			continue;
+
+		upper = lower = i > max_state ? max_state : i;
+
+		ret = thermal_zone_bind_cooling_device(thermal, i, cdev,
+			upper, lower);
+
+		dev_info(&cdev->device, "%s bind to %d: %d-%s\n", cdev->type,
+			i, ret, ret ? "fail" : "succeed");
+	}
+
+	return ret;
+}
+
+/* Callback to unbind cooling device from thermal zone */
+static int db8500_cdev_unbind(struct thermal_zone_device *thermal,
+		struct thermal_cooling_device *cdev)
+{
+	struct db8500_thermal_zone *pzone = thermal->devdata;
+	struct db8500_thsens_platform_data *ptrips = pzone->trip_tab;
+	int i, ret = -EINVAL;
+
+	for (i = 0; i < ptrips->num_trips; i++) {
+		if (db8500_thermal_match_cdev(cdev, &ptrips->trip_points[i]))
+			continue;
+
+		ret = thermal_zone_unbind_cooling_device(thermal, i, cdev);
+
+		dev_info(&cdev->device, "%s unbind from %d: %s\n", cdev->type,
+			i, ret ? "fail" : "succeed");
+	}
+
+	return ret;
+}
+
+/* Callback to get current temperature */
+static int db8500_sys_get_temp(struct thermal_zone_device *thermal,
+		unsigned long *temp)
+{
+	struct db8500_thermal_zone *pzone = thermal->devdata;
+
+	/*
+	 * TODO: There is no PRCMU interface to get temperature data currently,
+	 * so a pseudo temperature is returned , it works for thermal framework
+	 * and this will be fixed when the PRCMU interface is available.
+	 */
+	*temp = pzone->cur_temp_pseudo;
+
+	return 0;
+}
+
+/* Callback to get temperature changing trend */
+static int db8500_sys_get_trend(struct thermal_zone_device *thermal,
+		int trip, enum thermal_trend *trend)
+{
+	struct db8500_thermal_zone *pzone = thermal->devdata;
+
+	*trend = pzone->trend;
+
+	return 0;
+}
+
+/* Callback to get thermal zone mode */
+static int db8500_sys_get_mode(struct thermal_zone_device *thermal,
+		enum thermal_device_mode *mode)
+{
+	struct db8500_thermal_zone *pzone = thermal->devdata;
+
+	mutex_lock(&pzone->th_lock);
+	*mode = pzone->mode;
+	mutex_unlock(&pzone->th_lock);
+
+	return 0;
+}
+
+/* Callback to set thermal zone mode */
+static int db8500_sys_set_mode(struct thermal_zone_device *thermal,
+		enum thermal_device_mode mode)
+{
+	struct db8500_thermal_zone *pzone = thermal->devdata;
+
+	mutex_lock(&pzone->th_lock);
+
+	pzone->mode = mode;
+	if (mode == THERMAL_DEVICE_ENABLED)
+		schedule_work(&pzone->therm_work);
+
+	mutex_unlock(&pzone->th_lock);
+
+	return 0;
+}
+
+/* Callback to get trip point type */
+static int db8500_sys_get_trip_type(struct thermal_zone_device *thermal,
+		int trip, enum thermal_trip_type *type)
+{
+	struct db8500_thermal_zone *pzone = thermal->devdata;
+	struct db8500_thsens_platform_data *ptrips = pzone->trip_tab;
+
+	if (trip >= ptrips->num_trips)
+		return -EINVAL;
+
+	*type = ptrips->trip_points[trip].type;
+
+	return 0;
+}
+
+/* Callback to get trip point temperature */
+static int db8500_sys_get_trip_temp(struct thermal_zone_device *thermal,
+		int trip, unsigned long *temp)
+{
+	struct db8500_thermal_zone *pzone = thermal->devdata;
+	struct db8500_thsens_platform_data *ptrips = pzone->trip_tab;
+
+	if (trip >= ptrips->num_trips)
+		return -EINVAL;
+
+	*temp = ptrips->trip_points[trip].temp;
+
+	return 0;
+}
+
+/* Callback to get critical trip point temperature */
+static int db8500_sys_get_crit_temp(struct thermal_zone_device *thermal,
+		unsigned long *temp)
+{
+	struct db8500_thermal_zone *pzone = thermal->devdata;
+	struct db8500_thsens_platform_data *ptrips = pzone->trip_tab;
+	int i;
+
+	for (i = ptrips->num_trips - 1; i > 0; i--) {
+		if (ptrips->trip_points[i].type == THERMAL_TRIP_CRITICAL) {
+			*temp = ptrips->trip_points[i].temp;
+			return 0;
+		}
+	}
+
+	return -EINVAL;
+}
+
+static struct thermal_zone_device_ops thdev_ops = {
+	.bind = db8500_cdev_bind,
+	.unbind = db8500_cdev_unbind,
+	.get_temp = db8500_sys_get_temp,
+	.get_trend = db8500_sys_get_trend,
+	.get_mode = db8500_sys_get_mode,
+	.set_mode = db8500_sys_set_mode,
+	.get_trip_type = db8500_sys_get_trip_type,
+	.get_trip_temp = db8500_sys_get_trip_temp,
+	.get_crit_temp = db8500_sys_get_crit_temp,
+};
+
+static void db8500_thermal_update_config(struct db8500_thermal_zone *pzone,
+		unsigned int idx, enum thermal_trend trend,
+		unsigned long next_low, unsigned long next_high)
+{
+	prcmu_stop_temp_sense();
+
+	pzone->cur_index = idx;
+	pzone->cur_temp_pseudo = (next_low + next_high)/2;
+	pzone->trend = trend;
+
+	prcmu_config_hotmon((u8)(next_low/1000), (u8)(next_high/1000));
+	prcmu_start_temp_sense(PRCMU_DEFAULT_MEASURE_TIME);
+}
+
+static irqreturn_t prcmu_low_irq_handler(int irq, void *irq_data)
+{
+	struct db8500_thermal_zone *pzone = irq_data;
+	struct db8500_thsens_platform_data *ptrips = pzone->trip_tab;
+	unsigned int idx = pzone->cur_index;
+	unsigned long next_low, next_high;
+
+	if (unlikely(idx == 0))
+		/* Meaningless for thermal management, ignoring it */
+		return IRQ_HANDLED;
+
+	if (idx == 1) {
+		next_high = ptrips->trip_points[0].temp;
+		next_low = PRCMU_DEFAULT_LOW_TEMP;
+	} else {
+		next_high = ptrips->trip_points[idx-1].temp;
+		next_low = ptrips->trip_points[idx-2].temp;
+	}
+	idx -= 1;
+
+	db8500_thermal_update_config(pzone, idx, THERMAL_TREND_DROPPING,
+		next_low, next_high);
+
+	dev_dbg(&pzone->therm_dev->device,
+		"PRCMU set max %ld, min %ld\n", next_high, next_low);
+
+	schedule_work(&pzone->therm_work);
+
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t prcmu_high_irq_handler(int irq, void *irq_data)
+{
+	struct db8500_thermal_zone *pzone = irq_data;
+	struct db8500_thsens_platform_data *ptrips = pzone->trip_tab;
+	unsigned int idx = pzone->cur_index;
+	unsigned long next_low, next_high;
+
+	if (idx < ptrips->num_trips - 1) {
+		next_high = ptrips->trip_points[idx+1].temp;
+		next_low = ptrips->trip_points[idx].temp;
+		idx += 1;
+
+		db8500_thermal_update_config(pzone, idx, THERMAL_TREND_RAISING,
+			next_low, next_high);
+
+		dev_dbg(&pzone->therm_dev->device,
+		"PRCMU set max %ld, min %ld\n", next_high, next_low);
+	} else if (idx == ptrips->num_trips - 1)
+		pzone->cur_temp_pseudo = ptrips->trip_points[idx].temp + 1;
+
+	schedule_work(&pzone->therm_work);
+
+	return IRQ_HANDLED;
+}
+
+static void db8500_thermal_work(struct work_struct *work)
+{
+	enum thermal_device_mode cur_mode;
+	struct db8500_thermal_zone *pzone;
+
+	pzone = container_of(work, struct db8500_thermal_zone, therm_work);
+
+	mutex_lock(&pzone->th_lock);
+	cur_mode = pzone->mode;
+	mutex_unlock(&pzone->th_lock);
+
+	if (cur_mode == THERMAL_DEVICE_DISABLED)
+		return;
+
+	thermal_zone_device_update(pzone->therm_dev);
+	dev_dbg(&pzone->therm_dev->device, "thermal work finished.\n");
+}
+
+#ifdef CONFIG_OF
+static struct db8500_thsens_platform_data*
+		db8500_thermal_parse_dt(struct platform_device *pdev)
+{
+	struct db8500_thsens_platform_data *ptrips;
+	struct device_node *np = pdev->dev.of_node;
+	char prop_name[32];
+	const char *tmp_str;
+	u32 tmp_data;
+	int i, j;
+
+	ptrips = devm_kzalloc(&pdev->dev, sizeof(*ptrips), GFP_KERNEL);
+	if (!ptrips)
+		return NULL;
+
+	if (of_property_read_u32(np, "num-trips", &tmp_data))
+		goto err_parse_dt;
+
+	if (tmp_data > THERMAL_MAX_TRIPS)
+		goto err_parse_dt;
+
+	ptrips->num_trips = tmp_data;
+
+	for (i = 0; i < ptrips->num_trips; i++) {
+		sprintf(prop_name, "trip%d-temp", i);
+		if (of_property_read_u32(np, prop_name, &tmp_data))
+			goto err_parse_dt;
+
+		ptrips->trip_points[i].temp = tmp_data;
+
+		sprintf(prop_name, "trip%d-type", i);
+		if (of_property_read_string(np, prop_name, &tmp_str))
+			goto err_parse_dt;
+
+		if (!strcmp(tmp_str, "active"))
+			ptrips->trip_points[i].type = THERMAL_TRIP_ACTIVE;
+		else if (!strcmp(tmp_str, "passive"))
+			ptrips->trip_points[i].type = THERMAL_TRIP_PASSIVE;
+		else if (!strcmp(tmp_str, "hot"))
+			ptrips->trip_points[i].type = THERMAL_TRIP_HOT;
+		else if (!strcmp(tmp_str, "critical"))
+			ptrips->trip_points[i].type = THERMAL_TRIP_CRITICAL;
+		else
+			goto err_parse_dt;
+
+		sprintf(prop_name, "trip%d-cdev-num", i);
+		if (of_property_read_u32(np, prop_name, &tmp_data))
+			goto err_parse_dt;
+
+		if (tmp_data > COOLING_DEV_MAX)
+			goto err_parse_dt;
+
+		for (j = 0; j < tmp_data; j++) {
+			sprintf(prop_name, "trip%d-cdev-name%d", i, j);
+			if (of_property_read_string(np, prop_name, &tmp_str))
+				goto err_parse_dt;
+
+			if (strlen(tmp_str) >= THERMAL_NAME_LENGTH)
+				goto err_parse_dt;
+
+			strcpy(ptrips->trip_points[i].cdev_name[j], tmp_str);
+		}
+	}
+	return ptrips;
+
+err_parse_dt:
+	dev_err(&pdev->dev, "Parsing device tree data error.\n");
+	return NULL;
+}
+#else
+static inline struct db8500_thsens_platform_data*
+		db8500_thermal_parse_dt(struct platform_device *pdev)
+{
+	return NULL;
+}
+#endif
+
+static int db8500_thermal_probe(struct platform_device *pdev)
+{
+	struct db8500_thermal_zone *pzone = NULL;
+	struct db8500_thsens_platform_data *ptrips = NULL;
+	struct device_node *np = pdev->dev.of_node;
+	int low_irq, high_irq, ret = 0;
+	unsigned long dft_low, dft_high;
+
+	if (np)
+		ptrips = db8500_thermal_parse_dt(pdev);
+	else
+		ptrips = dev_get_platdata(&pdev->dev);
+
+	if (!ptrips)
+		return -EINVAL;
+
+	pzone = devm_kzalloc(&pdev->dev, sizeof(*pzone), GFP_KERNEL);
+	if (!pzone)
+		return -ENOMEM;
+
+	mutex_init(&pzone->th_lock);
+	mutex_lock(&pzone->th_lock);
+
+	pzone->mode = THERMAL_DEVICE_DISABLED;
+	pzone->trip_tab = ptrips;
+
+	INIT_WORK(&pzone->therm_work, db8500_thermal_work);
+
+	low_irq = platform_get_irq_byname(pdev, "IRQ_HOTMON_LOW");
+	if (low_irq < 0) {
+		dev_err(&pdev->dev, "Get IRQ_HOTMON_LOW failed.\n");
+		return low_irq;
+	}
+
+	ret = devm_request_threaded_irq(&pdev->dev, low_irq, NULL,
+		prcmu_low_irq_handler, IRQF_NO_SUSPEND | IRQF_ONESHOT,
+		"dbx500_temp_low", pzone);
+	if (ret < 0) {
+		dev_err(&pdev->dev, "Failed to allocate temp low irq.\n");
+		return ret;
+	}
+
+	high_irq = platform_get_irq_byname(pdev, "IRQ_HOTMON_HIGH");
+	if (high_irq < 0) {
+		dev_err(&pdev->dev, "Get IRQ_HOTMON_HIGH failed.\n");
+		return high_irq;
+	}
+
+	ret = devm_request_threaded_irq(&pdev->dev, high_irq, NULL,
+		prcmu_high_irq_handler, IRQF_NO_SUSPEND | IRQF_ONESHOT,
+		"dbx500_temp_high", pzone);
+	if (ret < 0) {
+		dev_err(&pdev->dev, "Failed to allocate temp high irq.\n");
+		return ret;
+	}
+
+	pzone->therm_dev = thermal_zone_device_register("db8500_thermal_zone",
+		ptrips->num_trips, 0, pzone, &thdev_ops, 0, 0);
+
+	if (IS_ERR_OR_NULL(pzone->therm_dev)) {
+		dev_err(&pdev->dev, "Register thermal zone device failed.\n");
+		return PTR_ERR(pzone->therm_dev);
+	}
+	dev_info(&pdev->dev, "Thermal zone device registered.\n");
+
+	dft_low = PRCMU_DEFAULT_LOW_TEMP;
+	dft_high = ptrips->trip_points[0].temp;
+
+	db8500_thermal_update_config(pzone, 0, THERMAL_TREND_STABLE,
+		dft_low, dft_high);
+
+	platform_set_drvdata(pdev, pzone);
+	pzone->mode = THERMAL_DEVICE_ENABLED;
+	mutex_unlock(&pzone->th_lock);
+
+	return 0;
+}
+
+static int db8500_thermal_remove(struct platform_device *pdev)
+{
+	struct db8500_thermal_zone *pzone = platform_get_drvdata(pdev);
+
+	thermal_zone_device_unregister(pzone->therm_dev);
+	cancel_work_sync(&pzone->therm_work);
+	mutex_destroy(&pzone->th_lock);
+
+	return 0;
+}
+
+static int db8500_thermal_suspend(struct platform_device *pdev,
+		pm_message_t state)
+{
+	struct db8500_thermal_zone *pzone = platform_get_drvdata(pdev);
+
+	flush_work(&pzone->therm_work);
+	prcmu_stop_temp_sense();
+
+	return 0;
+}
+
+static int db8500_thermal_resume(struct platform_device *pdev)
+{
+	struct db8500_thermal_zone *pzone = platform_get_drvdata(pdev);
+	struct db8500_thsens_platform_data *ptrips = pzone->trip_tab;
+	unsigned long dft_low, dft_high;
+
+	dft_low = PRCMU_DEFAULT_LOW_TEMP;
+	dft_high = ptrips->trip_points[0].temp;
+
+	db8500_thermal_update_config(pzone, 0, THERMAL_TREND_STABLE,
+		dft_low, dft_high);
+
+	return 0;
+}
+
+#ifdef CONFIG_OF
+static const struct of_device_id db8500_thermal_match[] = {
+	{ .compatible = "stericsson,db8500-thermal" },
+	{},
+};
+#else
+#define db8500_thermal_match NULL
+#endif
+
+static struct platform_driver db8500_thermal_driver = {
+	.driver = {
+		.owner = THIS_MODULE,
+		.name = "db8500-thermal",
+		.of_match_table = db8500_thermal_match,
+	},
+	.probe = db8500_thermal_probe,
+	.suspend = db8500_thermal_suspend,
+	.resume = db8500_thermal_resume,
+	.remove = db8500_thermal_remove,
+};
+
+module_platform_driver(db8500_thermal_driver);
+
+MODULE_AUTHOR("Hongbo Zhang <hongbo.zhang@stericsson.com>");
+MODULE_DESCRIPTION("DB8500 thermal driver");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/platform_data/db8500_thermal.h b/include/linux/platform_data/db8500_thermal.h
new file mode 100644
index 0000000..3bf6090
--- /dev/null
+++ b/include/linux/platform_data/db8500_thermal.h
@@ -0,0 +1,38 @@
+/*
+ * db8500_thermal.h - DB8500 Thermal Management Implementation
+ *
+ * Copyright (C) 2012 ST-Ericsson
+ * Copyright (C) 2012 Linaro Ltd.
+ *
+ * Author: Hongbo Zhang <hongbo.zhang@linaro.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef _DB8500_THERMAL_H_
+#define _DB8500_THERMAL_H_
+
+#include <linux/thermal.h>
+
+#define COOLING_DEV_MAX 8
+
+struct db8500_trip_point {
+	unsigned long temp;
+	enum thermal_trip_type type;
+	char cdev_name[COOLING_DEV_MAX][THERMAL_NAME_LENGTH];
+};
+
+struct db8500_thsens_platform_data {
+	struct db8500_trip_point trip_points[THERMAL_MAX_TRIPS];
+	int num_trips;
+};
+
+#endif /* _DB8500_THERMAL_H_ */
-- 
1.7.11.3


^ permalink raw reply related

* Re: [PATCH 1/9 v3] cgroup: add cgroup_subsys->post_create()
From: Daniel Wagner @ 2012-11-09 11:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	rjw-KKrjLPT3xs0,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	Glauber Costa
In-Reply-To: <20121108190715.GD9672-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>

Hi Tejun,

On 08.11.2012 20:07, Tejun Heo wrote:> Subject: cgroup: add 
cgroup_subsys->post_create()
 >
 > Currently, there's no way for a controller to find out whether a new
 > cgroup finished all ->create() allocatinos successfully and is
 > considered "live" by cgroup.

I'd like add hierarchy support to net_prio and the first thing to
do is to get rid of get_prioidx(). It looks like it would be nice to be 
able to use use_id and post_create() for this but as I read the code 
this might not work because the netdev might access resources allocated 
between create() and post_create(). So my question is would it make 
sense to move

cgroup_create():

		if (ss->use_id) {
			err = alloc_css_id(ss, parent, cgrp);
			if (err)
				goto err_destroy;
		}

part before create() or add some protection between create() and 
post_create() callback in net_prio. I have a patch but I see
I could drop it completely if post_create() is there.

cheers,
daniel


 From 84fbbdf0dc5d3460389e39a00a3ee553ee55b563 Mon Sep 17 00:00:00 2001
From: Daniel Wagner <daniel.wagner-98C5kh4wR6ohFhg+JK9F0w@public.gmane.org>
Date: Thu, 8 Nov 2012 17:17:21 +0100
Subject: [PATCH] cgroups: net_prio: Use IDR library to assign netprio index

get_prioidx() allocated a new id whenever it was called. put_prioidx()
gave an id back. get_pioidx() could can reallocate the id later on.
So that is exactly what IDR offers and so let's use it.

Signed-off-by: Daniel Wagner <daniel.wagner-98C5kh4wR6ohFhg+JK9F0w@public.gmane.org>
---
  net/core/netprio_cgroup.c | 51 
+++++++++--------------------------------------
  1 file changed, 9 insertions(+), 42 deletions(-)

diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index 847c02b..3c1b612 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -27,10 +27,7 @@

  #include <linux/fdtable.h>

-#define PRIOIDX_SZ 128
-
-static unsigned long prioidx_map[PRIOIDX_SZ];
-static DEFINE_SPINLOCK(prioidx_map_lock);
+static DEFINE_IDA(netprio_ida);
  static atomic_t max_prioidx = ATOMIC_INIT(0);

  static inline struct cgroup_netprio_state *cgrp_netprio_state(struct 
cgroup *cgrp)
@@ -39,34 +36,6 @@ static inline struct cgroup_netprio_state 
*cgrp_netprio_state(struct cgroup *cgr
  			    struct cgroup_netprio_state, css);
  }

-static int get_prioidx(u32 *prio)
-{
-	unsigned long flags;
-	u32 prioidx;
-
-	spin_lock_irqsave(&prioidx_map_lock, flags);
-	prioidx = find_first_zero_bit(prioidx_map, sizeof(unsigned long) * 
PRIOIDX_SZ);
-	if (prioidx == sizeof(unsigned long) * PRIOIDX_SZ) {
-		spin_unlock_irqrestore(&prioidx_map_lock, flags);
-		return -ENOSPC;
-	}
-	set_bit(prioidx, prioidx_map);
-	if (atomic_read(&max_prioidx) < prioidx)
-		atomic_set(&max_prioidx, prioidx);
-	spin_unlock_irqrestore(&prioidx_map_lock, flags);
-	*prio = prioidx;
-	return 0;
-}
-
-static void put_prioidx(u32 idx)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&prioidx_map_lock, flags);
-	clear_bit(idx, prioidx_map);
-	spin_unlock_irqrestore(&prioidx_map_lock, flags);
-}
-
  static int extend_netdev_table(struct net_device *dev, u32 new_len)
  {
  	size_t new_size = sizeof(struct netprio_map) +
@@ -120,9 +89,9 @@ static struct cgroup_subsys_state *cgrp_create(struct 
cgroup *cgrp)
  	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx)
  		goto out;

-	ret = get_prioidx(&cs->prioidx);
-	if (ret < 0) {
-		pr_warn("No space in priority index array\n");
+	cs->prioidx = ida_simple_get(&netprio_ida, 0, 0, GFP_KERNEL);
+	if (cs->prioidx < 0) {
+		ret = cs->prioidx;
  		goto out;
  	}

@@ -146,7 +115,7 @@ static void cgrp_destroy(struct cgroup *cgrp)
  			map->priomap[cs->prioidx] = 0;
  	}
  	rtnl_unlock();
-	put_prioidx(cs->prioidx);
+	ida_simple_remove(&netprio_ida, cs->prioidx);
  	kfree(cs);
  }

@@ -284,12 +253,10 @@ struct cgroup_subsys net_prio_subsys = {
  	.module		= THIS_MODULE,

  	/*
-	 * net_prio has artificial limit on the number of cgroups and
-	 * disallows nesting making it impossible to co-mount it with other
-	 * hierarchical subsystems.  Remove the artificially low PRIOIDX_SZ
-	 * limit and properly nest configuration such that children follow
-	 * their parents' configurations by default and are allowed to
-	 * override and remove the following.
+	 * net_prio has artificial limit on properly nest
+	 * configuration such that children follow their parents'
+	 * configurations by default and are allowed to override and
+	 * remove the following.
  	 */
  	.broken_hierarchy = true,
  };
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH 3/9 v2] cgroup: implement generic child / descendant walk macros
From: Li Zefan @ 2012-11-09  9:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: rjw-KKrjLPT3xs0, linux-pm-u79uwXL29TY76Z2rM5mHXA,
	fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, mhocko-AlSwsSmVLrQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20121108175946.GA9672-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>

On 2012/11/9 1:59, Tejun Heo wrote:
> Currently, cgroup doesn't provide any generic helper for walking a
> given cgroup's children or descendants.  This patch adds the following
> three macros.
> 
> * cgroup_for_each_child() - walk immediate children of a cgroup.
> 
> * cgroup_for_each_descendant_pre() - visit all descendants of a cgroup
>   in pre-order tree traversal.
> 
> * cgroup_for_each_descendant_post() - visit all descendants of a
>   cgroup in post-order tree traversal.
> 
> All three only require the user to hold RCU read lock during
> traversal.  Verifying that each iterated cgroup is online is the
> responsibility of the user.  When used with proper synchronization,
> cgroup_for_each_descendant_pre() can be used to propagate state
> updates to descendants in reliable way.  See comments for details.
> 
> v2: s/config/state/ in commit message and comments per Michal.  More
>     documentation on synchronization rules.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-c2VE3kR2zFovtab9mdV7tw@public.gmane.org>
> Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

I don't see anything wrong with the comment on cgroup_next_descendant_pre().

Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

^ permalink raw reply

* Re: [PATCH 2/9] cgroup: Use rculist ops for cgroup->children
From: Li Zefan @ 2012-11-09  9:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: rjw-KKrjLPT3xs0, linux-pm-u79uwXL29TY76Z2rM5mHXA,
	fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, mhocko-AlSwsSmVLrQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1351931915-1701-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

On 2012/11/3 16:38, Tejun Heo wrote:
> Use RCU safe list operations for cgroup->children.  This will be used
> to implement cgroup children / descendant walking which can be used by
> controllers.
> 
> Note that cgroup_create() now puts a new cgroup at the end of the
> ->children list instead of head.  This isn't strictly necessary but is
> done so that the iteration order is more conventional.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

^ permalink raw reply

* Re: [PATCH 1/9 v3] cgroup: add cgroup_subsys->post_create()
From: Li Zefan @ 2012-11-09  9:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: mhocko, rjw, containers, cgroups, linux-kernel, linux-pm,
	fweisbec, Glauber Costa
In-Reply-To: <20121108190715.GD9672@htj.dyndns.org>

On 2012/11/9 3:07, Tejun Heo wrote:
> Subject: cgroup: add cgroup_subsys->post_create()
> 
> Currently, there's no way for a controller to find out whether a new
> cgroup finished all ->create() allocatinos successfully and is
> considered "live" by cgroup.
> 
> This becomes a problem later when we add generic descendants walking
> to cgroup which can be used by controllers as controllers don't have a
> synchronization point where it can synchronize against new cgroups
> appearing in such walks.
> 
> This patch adds ->post_create().  It's called after all ->create()
> succeeded and the cgroup is linked into the generic cgroup hierarchy.
> This plays the counterpart of ->pre_destroy().
> 
> When used in combination with the to-be-added generic descendant
> iterators, ->post_create() can be used to implement reliable state
> inheritance.  It will be explained with the descendant iterators.
> 
> v2: Added a paragraph about its future use w/ descendant iterators per
>     Michal.
> 
> v3: Forgot to add ->post_create() invocation to cgroup_load_subsys().
>     Fixed.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Acked-by: Michal Hocko <mhocko@suse.cz>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Glauber Costa <glommer@parallels.com>

Li Zefan <lizefan@huawei.com


^ permalink raw reply

* Re: [PATCH 1/9 v3] cgroup: add cgroup_subsys->post_create()
From: Li Zefan @ 2012-11-09  9:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: mhocko-AlSwsSmVLrQ, rjw-KKrjLPT3xs0,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	Glauber Costa
In-Reply-To: <20121108190715.GD9672-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>

On 2012/11/9 3:07, Tejun Heo wrote:
> Subject: cgroup: add cgroup_subsys->post_create()
> 
> Currently, there's no way for a controller to find out whether a new
> cgroup finished all ->create() allocatinos successfully and is
> considered "live" by cgroup.
> 
> This becomes a problem later when we add generic descendants walking
> to cgroup which can be used by controllers as controllers don't have a
> synchronization point where it can synchronize against new cgroups
> appearing in such walks.
> 
> This patch adds ->post_create().  It's called after all ->create()
> succeeded and the cgroup is linked into the generic cgroup hierarchy.
> This plays the counterpart of ->pre_destroy().
> 
> When used in combination with the to-be-added generic descendant
> iterators, ->post_create() can be used to implement reliable state
> inheritance.  It will be explained with the descendant iterators.
> 
> v2: Added a paragraph about its future use w/ descendant iterators per
>     Michal.
> 
> v3: Forgot to add ->post_create() invocation to cgroup_load_subsys().
>     Fixed.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> Cc: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox