Linux Power Management development

Linux Power Management development
 help / color / mirror / Atom feed

* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management
From: wujianguo @ 2012-12-04 10:51 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, mjg59, paulmck, dave, maxime.coquelin,
	loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw,
	gargankita, amit.kachhap, svaidy, thomas.abraham,
	santosh.shilimkar, linux-pm, linux-mm, linux-kernel
In-Reply-To: <20121106195026.6941.24662.stgit@srivatsabhat.in.ibm.com>

Hi Srivatsa,

I applied this patchset, and run genload(from LTP) test: numactl --membind=1 ./genload -m 100,
then got a "general protection fault", and system was going to reboot.

If I revert [RFC PATCH 7/8], and run this test again, genload will be killed due to OOM,
but the system is OK, no coredump.

ps: node1 has 8G memory.

[ 3647.020666] general protection fault: 0000 [#1] SMP
[ 3647.026232] Modules linked in: edd cpufreq_conservative cpufreq_userspace cpu
freq_powersave acpi_cpufreq mperf fuse vfat fat loop dm_mod coretemp kvm crc32c_
intel ixgbe ipv6 i7core_edac igb iTCO_wdt i2c_i801 iTCO_vendor_support ioatdma e
dac_core tpm_tis joydev lpc_ich i2c_core microcode mfd_core rtc_cmos pcspkr sr_m
od tpm sg dca hid_generic mdio tpm_bios cdrom button ext3 jbd mbcache usbhid hid
 uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif processor thermal_sys hw
mon scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_
piix libata megaraid_sas scsi_mod
[ 3647.084565] CPU 19
[ 3647.086709] Pid: 33708, comm: genload Not tainted 3.7.0-rc7-mem-region+ #11 Q
CI QSSC-S4R/QSSC-S4R
[ 3647.096799] RIP: 0010:[<ffffffff8110979c>]  [<ffffffff8110979c>] add_to_freel
ist+0x8c/0x100
[ 3647.106125] RSP: 0000:ffff880a7f6c3e58  EFLAGS: 00010086
[ 3647.112042] RAX: dead000000200200 RBX: 0000000000000001 RCX: 0000000000000000

[ 3647.119990] RDX: ffffea001211a3a0 RSI: ffffea001211ffa0 RDI: 0000000000000001

[ 3647.127936] RBP: ffff880a7f6c3e58 R08: ffff88067ff6d240 R09: ffff88067ff6b180

[ 3647.135884] R10: 0000000000000002 R11: 0000000000000001 R12: 00000000000007fe

[ 3647.143831] R13: 0000000000000001 R14: 0000000000000001 R15: ffffea001211ff80

[ 3647.151778] FS:  00007f0b2a674700(0000) GS:ffff880a7f6c0000(0000) knlGS:00000
00000000000
[ 3647.160790] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3647.167188] CR2: 00007f0b1a000000 CR3: 0000000484723000 CR4: 00000000000007e0

[ 3647.175136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[ 3647.183083] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

[ 3647.191030] Process genload (pid: 33708, threadinfo ffff8806852bc000, task ff
ff880688288000)
[ 3647.200428] Stack:
[ 3647.202667]  ffff880a7f6c3f08 ffffffff8110e9c0 ffff88067ff66100 0000000000000
7fe
[ 3647.210954]  ffff880a7f6d5bb0 0000000000000030 0000000000002030 ffff88067ff66
168
[ 3647.219244]  0000000000000002 ffff880a7f6d5b78 0000000e88288000 ffff88067ff66
100
[ 3647.227530] Call Trace:
[ 3647.230252]  <IRQ>
[ 3647.232394]  [<ffffffff8110e9c0>] free_pcppages_bulk+0x350/0x450
[ 3647.239297]  [<ffffffff8110f0d0>] ? drain_pages+0xd0/0xd0
[ 3647.245313]  [<ffffffff8110f0c3>] drain_pages+0xc3/0xd0
[ 3647.251135]  [<ffffffff8110f0e6>] drain_local_pages+0x16/0x20
[ 3647.257540]  [<ffffffff810a3bce>] generic_smp_call_function_interrupt+0xae/0x
260
[ 3647.265783]  [<ffffffff810282c7>] smp_call_function_interrupt+0x27/0x40
[ 3647.273156]  [<ffffffff8147f272>] call_function_interrupt+0x72/0x80
[ 3647.280136]  <EOI>
[ 3647.282278]  [<ffffffff81077936>] ? mutex_spin_on_owner+0x76/0xa0
[ 3647.289292]  [<ffffffff81473116>] __mutex_lock_slowpath+0x66/0x180
[ 3647.296181]  [<ffffffff8113afe7>] ? try_to_unmap_one+0x277/0x440
[ 3647.302872]  [<ffffffff81472b93>] mutex_lock+0x23/0x40
[ 3647.308595]  [<ffffffff8113b657>] rmap_walk+0x137/0x240
[ 3647.314417]  [<ffffffff8115c230>] ? get_page+0x40/0x40
[ 3647.320133]  [<ffffffff8115d036>] move_to_new_page+0xb6/0x110
[ 3647.326526]  [<ffffffff8115d452>] __unmap_and_move+0x192/0x230
[ 3647.333023]  [<ffffffff8115d612>] unmap_and_move+0x122/0x140
[ 3647.339328]  [<ffffffff8115d6c9>] migrate_pages+0x99/0x150
[ 3647.345433]  [<ffffffff81129f10>] ? isolate_freepages+0x220/0x220
[ 3647.352220]  [<ffffffff8112ace2>] compact_zone+0x2f2/0x5d0
[ 3647.358332]  [<ffffffff8112b4a0>] try_to_compact_pages+0x180/0x240
[ 3647.365218]  [<ffffffff8110f1e7>] __alloc_pages_direct_compact+0x97/0x200
[ 3647.372780]  [<ffffffff810a45a3>] ? on_each_cpu_mask+0x63/0xb0
[ 3647.379279]  [<ffffffff8110f84f>] __alloc_pages_slowpath+0x4ff/0x780
[ 3647.386349]  [<ffffffff8110fbf1>] __alloc_pages_nodemask+0x121/0x180
[ 3647.393430]  [<ffffffff811500d6>] alloc_pages_vma+0xd6/0x170
[ 3647.399737]  [<ffffffff81162198>] do_huge_pmd_anonymous_page+0x148/0x210
[ 3647.407203]  [<ffffffff81132f6b>] handle_mm_fault+0x33b/0x340
[ 3647.413609]  [<ffffffff814799d3>] __do_page_fault+0x2a3/0x4e0
[ 3647.420017]  [<ffffffff8126316a>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3647.427290]  [<ffffffff81479c1e>] do_page_fault+0xe/0x10
[ 3647.433208]  [<ffffffff81475f68>] page_fault+0x28/0x30
[ 3647.438921] Code: 8d 78 01 48 89 f8 48 c1 e0 04 49 8d 04 00 48 8b 50 08 48 83
 40 10 01 48 85 d2 74 1b 48 8b 42 08 48 89 72 08 48 89 16 48 89 46 08 <48> 89 30
 c9 c3 0f 1f 80 00 00 00 00 4d 3b 00 74 4b 83 e9 01 79
[ 3647.460607] RIP  [<ffffffff8110979c>] add_to_freelist+0x8c/0x100
[ 3647.467308]  RSP <ffff880a7f6c3e58>
[    0.000000] Linux version 3.7.0-rc7-mem-region+ (root@linux-intel) (gcc versi
on 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #11 SMP Tue Dec 4 15:23
:15 CST 2012
.

Thanks,
Jianguo Wu

On 2012-11-7 3:52, Srivatsa S. Bhat wrote:
> Hi,
> 
> This is an alternative design for Memory Power Management, developed based on
> some of the suggestions[1] received during the review of the earlier patchset
> ("Hierarchy" design) on Memory Power Management[2]. This alters the buddy-lists
> to keep them region-sorted, and is hence identified as the "Sorted-buddy" design.
> 
> One of the key aspects of this design is that it avoids the zone-fragmentation
> problem that was present in the earlier design[3].
> 
> 
> Quick overview of Memory Power Management and Memory Regions:
> ------------------------------------------------------------
> 
> Today memory subsystems are offer a wide range of capabilities for managing
> memory power consumption. As a quick example, if a block of memory is not
> referenced for a threshold amount of time, the memory controller can decide to
> put that chunk into a low-power content-preserving state. And the next
> reference to that memory chunk would bring it back to full power for read/write.
> With this capability in place, it becomes important for the OS to understand
> the boundaries of such power-manageable chunks of memory and to ensure that
> references are consolidated to a minimum number of such memory power management
> domains.
> 
> ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
> the firmware can expose information regarding the boundaries of such memory
> power management domains to the OS in a standard way.
> 
> How can Linux VM help memory power savings?
> 
> o Consolidate memory allocations and/or references such that they are
> not spread across the entire memory address space.  Basically area of memory
> that is not being referenced, can reside in low power state.
> 
> o Support targeted memory reclaim, where certain areas of memory that can be
> easily freed can be offlined, allowing those areas of memory to be put into
> lower power states.
> 
> Memory Regions:
> ---------------
> 
> "Memory Regions" is a way of capturing the boundaries of power-managable
> chunks of memory, within the MM subsystem.
> 
> 
> Short description of the "Sorted-buddy" design:
> -----------------------------------------------
> 
> In this design, the memory region boundaries are captured in a parallel
> data-structure instead of fitting regions between nodes and zones in the
> hierarchy. Further, the buddy allocator is altered, such that we maintain the
> zones' freelists in region-sorted-order and thus do page allocation in the
> order of increasing memory regions. (The freelists need not be fully
> address-sorted, they just need to be region-sorted. Patch 6 explains this
> in more detail).
> 
> The idea is to do page allocation in increasing order of memory regions
> (within a zone) and perform page reclaim in the reverse order, as illustrated
> below.
> 
> ---------------------------- Increasing region number---------------------->
> 
> Direction of allocation--->                         <---Direction of reclaim
> 
> 
> The sorting logic (to maintain freelist pageblocks in region-sorted-order)
> lies in the page-free path and not the page-allocation path and hence the
> critical page allocation paths remain fast. Moreover, the heart of the page
> allocation algorithm itself remains largely unchanged, and the region-related
> data-structures are optimized to avoid unnecessary updates during the
> page-allocator's runtime.
> 
> Advantages of this design:
> --------------------------
> 1. No zone-fragmentation (IOW, we don't create more zones than necessary) and
>    hence we avoid its associated problems (like too many zones, extra page
>    reclaim threads, question of choosing watermarks etc).
>    [This is an advantage over the "Hierarchy" design]
> 
> 2. Performance overhead is expected to be low: Since we retain the simplicity
>    of the algorithm in the page allocation path, page allocation can
>    potentially remain as fast as it would be without memory regions. The
>    overhead is pushed to the page-freeing paths which are not that critical.
> 
> 
> Results:
> =======
> 
> Test setup:
> -----------
> This patchset applies cleanly on top of 3.7-rc3.
> 
> x86 dual-socket quad core HT-enabled machine booted with mem=8G
> Memory region size = 512 MB
> 
> Functional testing:
> -------------------
> 
> Ran pagetest, a simple C program that allocates and touches a required number
> of pages.
> 
> Below is the statistics from the regions within ZONE_NORMAL, at various sizes
> of allocations from pagetest.
> 
> 	     Present pages   |	Free pages at various allocations        |
> 			     |  start	|  512 MB  |  1024 MB | 2048 MB  |
>   Region 0      16	     |   0      |    0     |     0    |    0     |
>   Region 1      131072       |  87219   |  8066    |   7892   |  7387    |
>   Region 2      131072       | 131072   |  79036   |     0    |    0     |
>   Region 3      131072       | 131072   | 131072   |   79061  |    0     |
>   Region 4      131072       | 131072   | 131072   |  131072  |    0     |
>   Region 5      131072       | 131072   | 131072   |  131072  |  79051   |
>   Region 6      131072       | 131072   | 131072   |  131072  |  131072  |
>   Region 7      131072       | 131072   | 131072   |  131072  |  131072  |
>   Region 8      131056       | 105475   | 105472   |  105472  |  105472  |
> 
> This shows that page allocation occurs in the order of increasing region
> numbers, as intended in this design.
> 
> Performance impact:
> -------------------
> 
> Kernbench results didn't show much of a difference between the performance
> of vanilla 3.7-rc3 and this patchset.
> 
> 
> Todos:
> =====
> 
> 1. Memory-region aware page-reclamation:
> ----------------------------------------
> 
> We would like to do page reclaim in the reverse order of page allocation
> within a zone, ie., in the order of decreasing region numbers.
> To achieve that, while scanning lru pages to reclaim, we could potentially
> look for pages belonging to higher regions (considering region boundaries)
> or perhaps simply prefer pages of higher pfns (and skip lower pfns) as
> reclaim candidates.
> 
> 2. Compile-time exclusion of Memory Power Management, and extending the
> support to also work with other features such as Mem cgroups, kexec etc.
> 
> References:
> ----------
> 
> [1]. Review comments suggesting modifying the buddy allocator to be aware of
>      memory regions:
>      http://article.gmane.org/gmane.linux.power-management.general/24862
>      http://article.gmane.org/gmane.linux.power-management.general/25061
>      http://article.gmane.org/gmane.linux.kernel.mm/64689
> 
> [2]. Patch series that implemented the node-region-zone hierarchy design:
>      http://lwn.net/Articles/445045/
>      http://thread.gmane.org/gmane.linux.kernel.mm/63840
> 
>      Summary of the discussion on that patchset:
>      http://article.gmane.org/gmane.linux.power-management.general/25061
> 
>      Forward-port of that patchset to 3.7-rc3 (minimal x86 config)
>      http://thread.gmane.org/gmane.linux.kernel.mm/89202
> 
> [3]. Disadvantages of having memory regions in the hierarchy between nodes and
>      zones:
>      http://article.gmane.org/gmane.linux.kernel.mm/63849
> 
> [4]. Estimate of potential power savings on Samsung exynos board
>      http://article.gmane.org/gmane.linux.kernel.mm/65935
> 
> [5]. ACPI 5.0 and MPST support
>      http://www.acpi.info/spec.htm
>      Section 5.2.21 Memory Power State Table (MPST)
> 
>  Srivatsa S. Bhat (8):
>       mm: Introduce memory regions data-structure to capture region boundaries within node
>       mm: Initialize node memory regions during boot
>       mm: Introduce and initialize zone memory regions
>       mm: Add helpers to retrieve node region and zone region for a given page
>       mm: Add data-structures to describe memory regions within the zones' freelists
>       mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
>       mm: Add an optimized version of del_from_freelist to keep page allocation fast
>       mm: Print memory region statistics to understand the buddy allocator behavior
> 
> 
>   include/linux/mm.h     |   38 +++++++
>  include/linux/mmzone.h |   52 +++++++++
>  mm/compaction.c        |    8 +
>  mm/page_alloc.c        |  263 ++++++++++++++++++++++++++++++++++++++++++++----
>  mm/vmstat.c            |   59 ++++++++++-
>  5 files changed, 390 insertions(+), 30 deletions(-)
> 
> 
> Thanks,
> Srivatsa S. Bhat
> IBM Linux Technology Center
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply

* [PATCH 2/2] cpufreq: db8500: set CPUFREQ_CONST_LOOPS
From: Fabio Baltieri @ 2012-12-04 10:10 UTC (permalink / raw)
  To: John Stultz, Arnd Bergmann, Thomas Gleixner, Rafael J. Wysocki
  Cc: cpufreq, linux-pm, Linus Walleij, linux-arm-kernel, linux-kernel,
	Fabio Baltieri
In-Reply-To: <1354615845-2758-1-git-send-email-fabio.baltieri@linaro.org>

As ux500 is being converted to timer based delay loops, and the timer
used is not depending on CPUs clock frequency, set cpufreq_driver flag
CPUFREQ_CONST_LOOPS to prevent cpufreq rescaling loops_for_jiffies.

Acked-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Fabio Baltieri <fabio.baltieri@linaro.org>
---
 drivers/cpufreq/db8500-cpufreq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpufreq/db8500-cpufreq.c b/drivers/cpufreq/db8500-cpufreq.c
index 74b830b..e758891 100644
--- a/drivers/cpufreq/db8500-cpufreq.c
+++ b/drivers/cpufreq/db8500-cpufreq.c
@@ -150,7 +150,7 @@ static int __cpuinit db8500_cpufreq_init(struct cpufreq_policy *policy)
 }
 
 static struct cpufreq_driver db8500_cpufreq_driver = {
-	.flags  = CPUFREQ_STICKY,
+	.flags  = CPUFREQ_STICKY | CPUFREQ_CONST_LOOPS,
 	.verify = db8500_cpufreq_verify_speed,
 	.target = db8500_cpufreq_target,
 	.get    = db8500_cpufreq_getspeed,
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH 1/2] clocksource: nomadik-mtu: support timer-based delay
From: Fabio Baltieri @ 2012-12-04 10:10 UTC (permalink / raw)
  To: John Stultz, Arnd Bergmann, Thomas Gleixner, Rafael J. Wysocki
  Cc: linux-pm, Fabio Baltieri, Linus Walleij, linux-kernel, cpufreq,
	linux-arm-kernel
In-Reply-To: <1354615845-2758-1-git-send-email-fabio.baltieri@linaro.org>

This patch adds support to use Nomadik MTU for timer-based delay.

Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Fabio Baltieri <fabio.baltieri@linaro.org>
---
 drivers/clocksource/nomadik-mtu.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/clocksource/nomadik-mtu.c b/drivers/clocksource/nomadik-mtu.c
index 8914c3c..5f3c8db 100644
--- a/drivers/clocksource/nomadik-mtu.c
+++ b/drivers/clocksource/nomadik-mtu.c
@@ -15,6 +15,7 @@
 #include <linux/clocksource.h>
 #include <linux/clk.h>
 #include <linux/jiffies.h>
+#include <linux/delay.h>
 #include <linux/err.h>
 #include <linux/platform_data/clocksource-nomadik-mtu.h>
 #include <asm/mach/time.h>
@@ -64,6 +65,7 @@ static void __iomem *mtu_base;
 static bool clkevt_periodic;
 static u32 clk_prescale;
 static u32 nmdk_cycle;		/* write-once */
+static struct delay_timer mtu_delay_timer;
 
 #ifdef CONFIG_NOMADIK_MTU_SCHED_CLOCK
 /*
@@ -80,6 +82,11 @@ static u32 notrace nomadik_read_sched_clock(void)
 }
 #endif
 
+static unsigned long nmdk_timer_read_current_timer(void)
+{
+	return ~readl_relaxed(mtu_base + MTU_VAL(0));
+}
+
 /* Clockevent device: use one-shot mode */
 static int nmdk_clkevt_next(unsigned long evt, struct clock_event_device *ev)
 {
@@ -227,4 +234,8 @@ void __init nmdk_timer_init(void __iomem *base, int irq)
 	setup_irq(irq, &nmdk_timer_irq);
 	nmdk_clkevt.cpumask = cpumask_of(0);
 	clockevents_config_and_register(&nmdk_clkevt, rate, 2, 0xffffffffU);
+
+	mtu_delay_timer.read_current_timer = &nmdk_timer_read_current_timer;
+	mtu_delay_timer.freq = rate;
+	register_current_timer_delay(&mtu_delay_timer);
 }
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH 0/2] clocksource: nomadik-mtu: support timer-based delay
From: Fabio Baltieri @ 2012-12-04 10:10 UTC (permalink / raw)
  To: John Stultz, Arnd Bergmann, Thomas Gleixner, Rafael J. Wysocki
  Cc: cpufreq, linux-pm, Linus Walleij, linux-arm-kernel, linux-kernel,
	Fabio Baltieri

Hi all,

this implements timer-based delay support for nomadik and ux500
platforms, using the MTU as time source, and marks the u8500 cpufreq
driver as CPUFREQ_CONST_LOOPS accordingly.

The patches are based on Arnd's arm-soc/ux500/mtu-clk branch, as that
contains latest MTU driver developments, including a driver move/rename,
but I can rebase if necessary.

Thanks,
Fabio

Fabio Baltieri (2):
  clocksource: nomadik-mtu: support timer-based delay
  cpufreq: db8500: set CPUFREQ_CONST_LOOPS

 drivers/clocksource/nomadik-mtu.c | 11 +++++++++++
 drivers/cpufreq/db8500-cpufreq.c  |  2 +-
 2 files changed, 12 insertions(+), 1 deletion(-)

-- 
1.7.12.1

^ permalink raw reply

* RE: [PATCH RFC] PM/Devfreq: Add Exynos5-bus devfreq driver for Exynos5250.
From: Jonghwan Choi @ 2012-12-04  9:36 UTC (permalink / raw)
  To: 'Abhilash Kesavan', myungjoo.ham, kyungmin.park, rjw,
	linux-kernel, linux-pm
  Cc: kgene.kim
In-Reply-To: <1354264973-11214-1-git-send-email-a.kesavan@samsung.com>


Hi Abhilash Kesavan.

I compiled in 3.7-rc8
I got a compile error & warning.

Compile error.

CC      drivers/devfreq/exynos5_ppmu.o
drivers/devfreq/exynos5_ppmu.c:56:14: error: 'S5P_VA_PPMU_DDR_C' undeclared
here (not in a function)
drivers/devfreq/exynos5_ppmu.c:59:14: error: 'S5P_VA_PPMU_DDR_R1' undeclared
here (not in a function)
drivers/devfreq/exynos5_ppmu.c:62:13: error: 'S5P_VA_PPMU_DDR_L' undeclared
here (not in a function)
drivers/devfreq/exynos5_ppmu.c:65:13: error: 'S5P_VA_PPMU_RIGHT' undeclared
here (not in a function)
drivers/devfreq/exynos5_ppmu.c:68:14: error: 'S5P_VA_PPMU_CPU' undeclared
here (not in a function)
make[2]: *** [drivers/devfreq/exynos5_ppmu.o] Error 1



> +	data->devfreq = devfreq_add_device(dev,
> &exynos5_devfreq_int_profile,
> +					   "simple_ondemand",
> +					   &exynos5_int_ondemand_data);

drivers/devfreq/exynos5_bus.c: In function 'exynos5_busfreq_int_probe':
drivers/devfreq/exynos5_bus.c:388:9: warning: passing argument 3 of
'devfreq_add_device' from incompatible pointer type [enabled by default]
include/linux/devfreq.h:170:24: note: expected 'struct devfreq_governor
const *' but argument is of type 'char *'

include/linux/devfreq.h
extern struct devfreq *devfreq_add_device(struct device *dev,
                                  struct devfreq_dev_profile *profile,
                                  const struct devfreq_governor *governor,
                                  void *data);
Need change?


> +
> +	if (IS_ERR(data->devfreq)) {
> +		err = PTR_ERR(data->devfreq);
> +		goto err_devfreq_add;
> +	}


> +static int exynos5_busfreq_int_target(struct device *dev, unsigned long
> *_freq,
> +			      u32 flags)
> +{
> +	int err = 0;
> +	struct platform_device *pdev = container_of(dev, struct
> platform_device,
...

> +
> +	err = clk_set_rate(data->int_clk, freq * 1000);

How to change the INT clock? I think you have to change
arch/arm/mach-exynos/clock-exynos5250.c
(If you want to control via clock framework.)

Like this.

static struct clk exynos5_int_clk = {
        .name           = "int_clk",
        .id             = -1,
};

static struct clk_ops exynos5_clk_int_ops = {
        .get_rate = exynos5_clk_int_get_rate,
        .set_rate = exynos5_clk_int_set_rate
};
exynos5_int_clk.ops = &exynos5_clk_int_ops;

> +
> +	if (err)
> +		goto out;
> +


Thanks~

Best Regards.



> -----Original Message-----
> From: Abhilash Kesavan [mailto:a.kesavan@samsung.com]
> Sent: Friday, November 30, 2012 5:43 PM
> To: myungjoo.ham@samsung.com; kyungmin.park@samsung.com; rjw@sisk.pl;
> linux-kernel@vger.kernel.org; linux-pm@vger.kernel.org
> Cc: kgene.kim@samsung.com; jhbird.choi@samsung.com
> Subject: [PATCH RFC] PM/Devfreq: Add Exynos5-bus devfreq driver for
> Exynos5250.
> 
> Exynos5-bus device devfreq driver monitors PPMU counters and
> adjusts operating frequencies and voltages with OPP. ASV should
> be used to provide appropriate voltages as per the speed group
> of the SoC rather than using a constant 1.025V.
> 
> Signed-off-by: Abhilash Kesavan <a.kesavan@samsung.com>
> Cc: Jonghwan Choi <jhbird.choi@samsung.com>
> Cc: Kukjin Kim <kgene.kim@samsung.com>
> ---
> This code is based on Jonghwan Choi's <jhbird.choi@samsung.com> devfreq
> work
> for Exynos5250. This requires corresponding machine specific changes which
> will be posted once the driver is reviewed.
> 
>  drivers/devfreq/Kconfig        |   10 +
>  drivers/devfreq/Makefile       |    1 +
>  drivers/devfreq/exynos5_bus.c  |  595
> ++++++++++++++++++++++++++++++++++++++++
>  drivers/devfreq/exynos5_ppmu.c |  395 ++++++++++++++++++++++++++
>  drivers/devfreq/exynos5_ppmu.h |   26 ++
>  drivers/devfreq/exynos_ppmu.c  |   56 ++++
>  drivers/devfreq/exynos_ppmu.h  |   79 ++++++
>  7 files changed, 1162 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/devfreq/exynos5_bus.c
>  create mode 100644 drivers/devfreq/exynos5_ppmu.c
>  create mode 100644 drivers/devfreq/exynos5_ppmu.h
>  create mode 100644 drivers/devfreq/exynos_ppmu.c
>  create mode 100644 drivers/devfreq/exynos_ppmu.h
> 
> diff --git a/drivers/devfreq/Kconfig b/drivers/devfreq/Kconfig
> index 0f079be..1560d0d 100644
> --- a/drivers/devfreq/Kconfig
> +++ b/drivers/devfreq/Kconfig
> @@ -78,4 +78,14 @@ config ARM_EXYNOS4_BUS_DEVFREQ
>  	  To operate with optimal voltages, ASV support is required
>  	  (CONFIG_EXYNOS_ASV).
> 
> +config ARM_EXYNOS5_BUS_DEVFREQ
> +	bool "ARM Exynos5250 Bus DEVFREQ Driver"
> +	depends on SOC_EXYNOS5250
> +	select ARCH_HAS_OPP
> +	select DEVFREQ_GOV_SIMPLE_ONDEMAND
> +	help
> +	  This adds the DEVFREQ driver for Exynos5250 bus interface
> (vdd_int).
> +	  It reads PPMU counters of memory controllers and adjusts the
> +	  operating frequencies and voltages with OPP support.
> +
>  endif # PM_DEVFREQ
> diff --git a/drivers/devfreq/Makefile b/drivers/devfreq/Makefile
> index 8c46423..6259aa4 100644
> --- a/drivers/devfreq/Makefile
> +++ b/drivers/devfreq/Makefile
> @@ -6,3 +6,4 @@ obj-$(CONFIG_DEVFREQ_GOV_USERSPACE)	+=
> governor_userspace.o
> 
>  # DEVFREQ Drivers
>  obj-$(CONFIG_ARM_EXYNOS4_BUS_DEVFREQ)	+= exynos4_bus.o
> +obj-$(CONFIG_ARM_EXYNOS5_BUS_DEVFREQ)	+= exynos_ppmu.o
exynos5_ppmu.o
> exynos5_bus.o
> diff --git a/drivers/devfreq/exynos5_bus.c b/drivers/devfreq/exynos5_bus.c
> new file mode 100644
> index 0000000..8ced376
> --- /dev/null
> +++ b/drivers/devfreq/exynos5_bus.c
> @@ -0,0 +1,595 @@
> +/*
> + * Copyright (c) 2012 Samsung Electronics Co., Ltd.
> + *		http://www.samsung.com/
> + *
> + * EXYNOS5 INT clock frequency scaling support using DEVFREQ framework
> + * Based on work done by Jonghwan Choi <jhbird.choi@samsung.com>
> + * Support for only EXYNOS5250 is present.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + */
> +
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/opp.h>
> +#include <linux/clk.h>
> +#include <linux/platform_device.h>
> +#include <linux/regulator/consumer.h>
> +
> +#include "exynos_ppmu.h"
> +#include "exynos5_ppmu.h"
> +#include "governor.h"
> +
> +#define MAX_SAFEVOLT			1100000 /* 1.10V */
> +/* Assume that the bus is saturated if the utilization is 25% */
> +#define INT_BUS_SATURATION_RATIO	25
> +#define EXYNOS5_BUS_INT_POLL_TIME	msecs_to_jiffies(100)
> +
> +enum int_level_idx {
> +	LV_0,
> +	LV_1,
> +	LV_2,
> +	LV_3,
> +	LV_4,
> +	_LV_END
> +};
> +
> +struct busfreq_data_int {
> +	struct device *dev;
> +	struct devfreq *devfreq;
> +	bool disabled;
> +	struct regulator *vdd_int;
> +	unsigned long curr_freq;
> +	unsigned long curr_volt;
> +	unsigned long suspend_freq;
> +	struct mutex lock;
> +	struct clk *int_clk;
> +	struct exynos5_ppmu_handle *ppmu;
> +	struct delayed_work work;
> +	int busy;
> +};
> +
> +struct exynos5_bus_int_handle {
> +	struct list_head node;
> +	struct delayed_work work;
> +	bool boost;
> +	bool poll;
> +	unsigned long min;
> +};
> +
> +struct int_bus_opp_table {
> +	unsigned int idx;
> +	unsigned long clk;
> +	unsigned long volt;
> +};
> +
> +static struct int_bus_opp_table exynos5_int_opp_table[] = {
> +	{LV_0, 266000, 1025000},
> +	{LV_1, 200000, 1025000},
> +	{LV_2, 160000, 1025000},
> +	{LV_3, 133000, 1025000},
> +	{LV_4, 100000, 1025000},
> +	{0, 0, 0},
> +};
> +
> +static struct busfreq_data_int *exynos5_bus_int_data;
> +static DEFINE_MUTEX(exynos5_bus_int_data_lock);
> +static LIST_HEAD(exynos5_bus_int_requests);
> +static DEFINE_MUTEX(exynos5_bus_int_requests_lock);
> +
> +static int exynos5_int_setvolt(struct busfreq_data_int *data,
> +		unsigned long volt)
> +{
> +	return regulator_set_voltage(data->vdd_int, volt, MAX_SAFEVOLT);
> +}
> +
> +static int exynos5_busfreq_int_target(struct device *dev, unsigned long
> *_freq,
> +			      u32 flags)
> +{
> +	int err = 0;
> +	struct platform_device *pdev = container_of(dev, struct
> platform_device,
...

> +
> +	err = clk_set_rate(data->int_clk, freq * 1000);
> +
> +	if (err)
> +		goto out;
> +




> +	if (old_freq > freq)
> +		err = exynos5_int_setvolt(data, volt);
> +	if (err)
> +		goto out;
> +
> +	data->curr_freq = freq;
> +out:
> +	mutex_unlock(&data->lock);
> +	return err;
> +}
> +
> +static int exynos5_int_get_dev_status(struct device *dev,
> +				      struct devfreq_dev_status *stat)
> +{
> +	struct platform_device *pdev = container_of(dev, struct
> platform_device,
> +						    dev);
> +	struct busfreq_data_int *data = platform_get_drvdata(pdev);
> +
> +	stat->current_frequency = data->curr_freq;
> +	stat->busy_time = data->busy;
> +	stat->total_time = 100;
> +
> +	return 0;
> +}
> +
> +static void exynos5_int_poll_start(struct busfreq_data_int *data)
> +{
> +	struct exynos5_bus_int_handle *handle;
> +
> +	mutex_lock(&exynos5_bus_int_requests_lock);
> +	list_for_each_entry(handle, &exynos5_bus_int_requests, node) {
> +		if (handle->poll) {
> +			schedule_delayed_work(&data->work,
> +					EXYNOS5_BUS_INT_POLL_TIME);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&exynos5_bus_int_requests_lock);
> +}
> +
> +static void exynos5_int_poll_stop(struct busfreq_data_int *data)
> +{
> +	cancel_delayed_work_sync(&data->work);
> +}
> +
> +static void exynos5_int_update(struct busfreq_data_int *data)
> +{
> +	struct exynos5_bus_int_handle *handle;
> +	int ret = 0;
> +	bool poll = false;
> +
> +	mutex_lock(&exynos5_bus_int_requests_lock);
> +	list_for_each_entry(handle, &exynos5_bus_int_requests, node) {
> +		if (handle->poll) {
> +			poll = true;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&exynos5_bus_int_requests_lock);
> +
> +	ret = exynos5_ppmu_get_busy(data->ppmu, PPMU_SET_RIGHT);
> +	if (ret >= 0 && poll)
> +		data->busy = ret;
> +	else
> +		data->busy = 0;
> +
> +	if (ret >= 0 || !poll) {
> +		mutex_lock(&data->devfreq->lock);
> +		update_devfreq(data->devfreq);
> +		mutex_unlock(&data->devfreq->lock);
> +	}
> +}
> +
> +static void exynos5_int_poll_work(struct work_struct *work)
> +{
> +	struct delayed_work *dwork;
> +	struct busfreq_data_int *data;
> +
> +	dwork = to_delayed_work(work);
> +	data = container_of(dwork, struct busfreq_data_int, work);
> +
> +	exynos5_int_update(data);
> +	exynos5_int_poll_start(data);
> +}
> +
> +static void exynos5_int_cancel_boost(struct work_struct *work)
> +{
> +	struct delayed_work *dwork = to_delayed_work(work);
> +	struct exynos5_bus_int_handle *handle;
> +
> +	handle = container_of(dwork, struct exynos5_bus_int_handle, work);
> +	handle->boost = false;
> +}
> +
> +struct exynos5_bus_int_handle *exynos5_bus_int_get(unsigned long
min_freq,
> +		bool poll)
> +{
> +	struct exynos5_bus_int_handle *handle;
> +
> +	handle = kzalloc(sizeof(*handle), GFP_KERNEL);
> +	if (!handle)
> +		return NULL;
> +
> +	handle->min = min_freq;
> +	/* If polling, boost the frequency for the first poll cycle */
> +	handle->boost = poll;
> +	handle->poll = poll;
> +	INIT_DELAYED_WORK(&handle->work, exynos5_int_cancel_boost);
> +
> +	mutex_lock(&exynos5_bus_int_requests_lock);
> +	list_add_tail(&handle->node, &exynos5_bus_int_requests);
> +	mutex_unlock(&exynos5_bus_int_requests_lock);
> +
> +	mutex_lock(&exynos5_bus_int_data_lock);
> +
> +	if (exynos5_bus_int_data) {
> +		cancel_delayed_work_sync(&exynos5_bus_int_data->work);
> +		exynos5_int_update(exynos5_bus_int_data);
> +		exynos5_int_poll_start(exynos5_bus_int_data);
> +	}
> +
> +	mutex_unlock(&exynos5_bus_int_data_lock);
> +
> +	if (handle->boost)
> +		schedule_delayed_work(&handle->work,
> EXYNOS5_BUS_INT_POLL_TIME);
> +
> +	return handle;
> +}
> +
> +static void exynos5_int_exit(struct device *dev)
> +{
> +	struct platform_device *pdev = container_of(dev, struct
> platform_device,
> +						    dev);
> +	struct busfreq_data_int *data = platform_get_drvdata(pdev);
> +
> +	devfreq_unregister_opp_notifier(dev, data->devfreq);
> +}
> +
> +static struct devfreq_dev_profile exynos5_devfreq_int_profile = {
> +	.initial_freq		= 160000,
> +	.polling_ms		= 0,
> +	.target			= exynos5_busfreq_int_target,
> +	.get_dev_status		= exynos5_int_get_dev_status,
> +	.exit			= exynos5_int_exit,
> +};
> +
> +static int exynos5250_init_int_tables(struct busfreq_data_int *data)
> +{
> +	int i, err = 0;
> +
> +	for (i = LV_0; i < _LV_END; i++) {
> +		err = opp_add(data->dev, exynos5_int_opp_table[i].clk,
> +				exynos5_int_opp_table[i].volt);
> +		if (err) {
> +			dev_err(data->dev, "Cannot add opp entries.\n");
> +			return err;
> +		}
> +	}
> +
> +	return 0;
> +}
> +static struct devfreq_simple_ondemand_data exynos5_int_ondemand_data = {
> +	.downdifferential = 2,
> +	.upthreshold = INT_BUS_SATURATION_RATIO,
> +};
> +
> +static __devinit int exynos5_busfreq_int_probe(struct platform_device
> *pdev)
> +{
> +	struct busfreq_data_int *data;
> +	struct opp *opp;
> +	struct device *dev = &pdev->dev;
> +	unsigned long initial_freq;
> +	unsigned long initial_volt;
> +	int err = 0;
> +
> +	data = devm_kzalloc(&pdev->dev, sizeof(struct busfreq_data_int),
> +				GFP_KERNEL);
> +	if (data == NULL) {
> +		dev_err(dev, "Cannot allocate memory.\n");
> +		return -ENOMEM;
> +	}
> +
> +	data->dev = dev;
> +	mutex_init(&data->lock);
> +
> +	err = exynos5250_init_int_tables(data);
> +	if (err)
> +		goto err_regulator;
> +
> +	data->vdd_int = regulator_get(dev, "vdd_int");
> +	if (IS_ERR(data->vdd_int)) {
> +		dev_err(dev, "Cannot get the regulator \"vdd_int\"\n");
> +		err = PTR_ERR(data->vdd_int);
> +		goto err_regulator;
> +	}
> +
> +	data->int_clk = clk_get(dev, "int_clk");
> +	if (IS_ERR(data->int_clk)) {
> +		dev_err(dev, "Cannot get clock \"int_clk\"\n");
> +		err = PTR_ERR(data->int_clk);
> +		goto err_clock;
> +	}
> +
> +	rcu_read_lock();
> +	opp = opp_find_freq_floor(dev,
> +			&exynos5_devfreq_int_profile.initial_freq);
> +	if (IS_ERR(opp)) {
> +		rcu_read_unlock();
> +		dev_err(dev, "Invalid initial frequency %lu kHz.\n",
> +		       exynos5_devfreq_int_profile.initial_freq);
> +		err = PTR_ERR(opp);
> +		goto err_opp_add;
> +	}
> +	initial_freq = opp_get_freq(opp);
> +	initial_volt = opp_get_voltage(opp);
> +	rcu_read_unlock();
> +	data->curr_freq = initial_freq;
> +	data->curr_volt = initial_volt;
> +
> +	err = clk_set_rate(data->int_clk, initial_freq * 1000);
> +	if (err) {
> +		dev_err(dev, "Failed to set initial frequency\n");
> +		goto err_opp_add;
> +	}
> +
> +	err = exynos5_int_setvolt(data, initial_volt);
> +	if (err)
> +		goto err_opp_add;
> +
> +	platform_set_drvdata(pdev, data);
> +
> +	data->ppmu = exynos5_ppmu_get();
> +	if (!data->ppmu)
> +		goto err_ppmu_get;
> +
> +	INIT_DELAYED_WORK(&data->work, exynos5_int_poll_work);
> +
> +	data->devfreq = devfreq_add_device(dev,
> &exynos5_devfreq_int_profile,
> +					   "simple_ondemand",
> +					   &exynos5_int_ondemand_data);
> +
> +	if (IS_ERR(data->devfreq)) {
> +		err = PTR_ERR(data->devfreq);
> +		goto err_devfreq_add;
> +	}
> +
> +	devfreq_register_opp_notifier(dev, data->devfreq);
> +
> +	mutex_lock(&exynos5_bus_int_data_lock);
> +	exynos5_bus_int_data = data;
> +	mutex_unlock(&exynos5_bus_int_data_lock);
> +
> +	exynos5_int_update(data);
> +	exynos5_int_poll_start(data);
> +
> +	return 0;
> +
> +err_devfreq_add:
> +	devfreq_remove_device(data->devfreq);
> +err_ppmu_get:
> +	platform_set_drvdata(pdev, NULL);
> +err_opp_add:
> +	clk_put(data->int_clk);
> +err_clock:
> +	regulator_put(data->vdd_int);
> +err_regulator:
> +	return err;
> +}
> +
> +static __devexit int exynos5_busfreq_int_remove(struct platform_device
> *pdev)
> +{
> +	struct busfreq_data_int *data = platform_get_drvdata(pdev);
> +
> +	mutex_lock(&exynos5_bus_int_data_lock);
> +	exynos5_bus_int_data = NULL;
> +	mutex_unlock(&exynos5_bus_int_data_lock);
> +
> +	devfreq_remove_device(data->devfreq);
> +	exynos5_int_poll_stop(data);
> +	regulator_put(data->vdd_int);
> +	clk_put(data->int_clk);
> +	platform_set_drvdata(pdev, NULL);
> +
> +	return 0;
> +}
> +
> +#ifdef CONFIG_PM_SLEEP
> +static int exynos5_busfreq_int_suspend(struct device *dev)
> +{
> +	struct platform_device *pdev = to_platform_device(dev);
> +	struct opp *max_opp;
> +	struct opp *opp;
> +	unsigned long maxfreq = ULONG_MAX;
> +	unsigned long volt;
> +	unsigned long freq;
> +	int err = 0;
> +	struct busfreq_data_int *data = platform_get_drvdata(pdev);
> +
> +	exynos5_int_poll_stop(data);
> +	/*
> +	 * Set the frequency to the maximum enabled frequency, but set the
> +	 * voltage to the maximum possible voltage in case the bootloader
> +	 * sets the frequency to maximum during resume.
> +	 */
> +	mutex_lock(&data->lock);
> +
> +	data->disabled = true;
> +
> +	rcu_read_lock();
> +	max_opp = opp_find_freq_floor(data->dev, &maxfreq);
> +	if (IS_ERR(max_opp)) {
> +		rcu_read_unlock();
> +		err = PTR_ERR(max_opp);
> +		goto unlock;
> +	}
> +
> +	maxfreq = ULONG_MAX;
> +	if (data->devfreq->max_freq)
> +		maxfreq = data->devfreq->max_freq;
> +
> +	opp = opp_find_freq_floor(data->dev, &maxfreq);
> +	if (IS_ERR(opp)) {
> +		rcu_read_unlock();
> +		err = PTR_ERR(opp);
> +		goto unlock;
> +	}
> +
> +	freq = opp_get_freq(opp);
> +	volt = opp_get_voltage(max_opp);
> +	rcu_read_unlock();
> +
> +	err = exynos5_int_setvolt(data, volt);
> +	if (err)
> +		goto unlock;
> +
> +	err = clk_set_rate(data->int_clk, freq * 1000);
> +	if (err)
> +		goto unlock;
> +
> +	data->suspend_freq = freq;
> +
> +unlock:
> +	mutex_unlock(&data->lock);
> +	return err;
> +}
> +
> +static int exynos5_busfreq_int_resume_noirq(struct device *dev)
> +{
> +	struct platform_device *pdev = to_platform_device(dev);
> +	int err = 0;
> +	struct busfreq_data_int *data = platform_get_drvdata(pdev);
> +
> +	/*
> +	 * Set the frequency to the maximum enabled frequency in case the
> +	 * bootloader raised it during resume.
> +	 */
> +	mutex_lock(&data->lock);
> +
> +	err = clk_set_rate(data->int_clk, data->suspend_freq * 1000);
> +	if (err)
> +		goto unlock;
> +
> +unlock:
> +	mutex_unlock(&data->lock);
> +	return err;
> +}
> +
> +static int exynos5_busfreq_int_resume(struct device *dev)
> +{
> +	struct platform_device *pdev = to_platform_device(dev);
> +	struct busfreq_data_int *data = platform_get_drvdata(pdev);
> +	int err = 0;
> +
> +	/* Restore the frequency and voltage to the values pre-suspend */
> +	mutex_lock(&data->lock);
> +
> +	data->disabled = false;
> +
> +	err = clk_set_rate(data->int_clk, data->curr_freq * 1000);
> +	if (err)
> +		goto unlock;
> +
> +	err = exynos5_int_setvolt(data, data->curr_volt);
> +	if (err)
> +		goto unlock;
> +
> +unlock:
> +	mutex_unlock(&data->lock);
> +
> +	if (!err)
> +		exynos5_int_poll_start(data);
> +
> +	return err;
> +}
> +#endif
> +
> +static const struct dev_pm_ops exynos5_busfreq_int_pm = {
> +	.suspend = exynos5_busfreq_int_suspend,
> +	.resume_noirq = exynos5_busfreq_int_resume_noirq,
> +	.resume = exynos5_busfreq_int_resume,
> +};
> +
> +/* platform device pointer for exynos5 devfreq device. */
> +static struct platform_device *exynos5_devfreq_pdev;
> +
> +static struct platform_driver exynos5_busfreq_int_driver = {
> +	.probe		= exynos5_busfreq_int_probe,
> +	.remove		= __devexit_p(exynos5_busfreq_int_remove),
> +	.driver		= {
> +		.name		= "exynos5-bus-int",
> +		.owner		= THIS_MODULE,
> +		.pm		= &exynos5_busfreq_int_pm,
> +	},
> +};
> +
> +static int __init exynos5_busfreq_int_init(void)
> +{
> +	int ret;
> +
> +	ret = platform_driver_register(&exynos5_busfreq_int_driver);
> +	if (ret < 0)
> +		goto out;
> +
> +	exynos5_devfreq_pdev =
> +		platform_device_register_simple("exynos5-bus-int", -1, NULL,
> 0);
> +	if (IS_ERR_OR_NULL(exynos5_devfreq_pdev)) {
> +		ret = PTR_ERR(exynos5_devfreq_pdev);
> +		goto out1;
> +	}
> +
> +	return 0;
> +out1:
> +	platform_driver_unregister(&exynos5_busfreq_int_driver);
> +out:
> +	return ret;
> +}
> +late_initcall(exynos5_busfreq_int_init);
> +
> +static void __exit exynos5_busfreq_int_exit(void)
> +{
> +	platform_device_unregister(exynos5_devfreq_pdev);
> +	platform_driver_unregister(&exynos5_busfreq_int_driver);
> +}
> +module_exit(exynos5_busfreq_int_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("EXYNOS5 busfreq driver with devfreq framework");
> diff --git a/drivers/devfreq/exynos5_ppmu.c
> b/drivers/devfreq/exynos5_ppmu.c
> new file mode 100644
> index 0000000..9752098
> --- /dev/null
> +++ b/drivers/devfreq/exynos5_ppmu.c
> @@ -0,0 +1,395 @@
> +/*
> + * Copyright (c) 2012 Samsung Electronics Co., Ltd.
> + *		http://www.samsung.com/
> + *
> + * EXYNOS5 PPMU support
> + * Support for only EXYNOS5250 is present.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/debugfs.h>
> +#include <linux/hrtimer.h>
> +#include <linux/io.h>
> +#include <linux/slab.h>
> +
> +#include <mach/map.h>
> +
> +#include "exynos_ppmu.h"
> +#include "exynos5_ppmu.h"
> +
> +#define FIXED_POINT_OFFSET 8
> +#define FIXED_POINT_MASK ((1 << FIXED_POINT_OFFSET) - 1)
> +
> +enum exynos5_ppmu_list {
> +	PPMU_DDR_C,
> +	PPMU_DDR_R1,
> +	PPMU_DDR_L,
> +	PPMU_RIGHT,
> +	PPMU_CPU,
> +	PPMU_END,
> +};
> +
> +struct exynos5_ppmu_handle {
> +	struct list_head node;
> +	struct exynos_ppmu ppmu[PPMU_END];
> +};
> +
> +static DEFINE_SPINLOCK(exynos5_ppmu_lock);
> +static LIST_HEAD(exynos5_ppmu_handle_list);
> +static struct exynos5_ppmu_handle *exynos5_ppmu_trace_handle;
> +
> +static const char *exynos5_ppmu_name[PPMU_END] = {
> +	[PPMU_DDR_C]	= "DDR_C",
> +	[PPMU_DDR_R1]	= "DDR_R1",
> +	[PPMU_DDR_L]	= "DDR_L",
> +	[PPMU_RIGHT]	= "RIGHT",
> +	[PPMU_CPU]	= "CPU",
> +};
> +
> +static struct exynos_ppmu ppmu[PPMU_END] = {
> +	[PPMU_DDR_C] = {
> +		.hw_base = S5P_VA_PPMU_DDR_C,
> +	},
> +	[PPMU_DDR_R1] = {
> +		.hw_base = S5P_VA_PPMU_DDR_R1,
> +	},
> +	[PPMU_DDR_L] = {
> +		.hw_base = S5P_VA_PPMU_DDR_L,
> +	},
> +	[PPMU_RIGHT] = {
> +		.hw_base = S5P_VA_PPMU_RIGHT,
> +	},
> +	[PPMU_CPU] = {
> +		.hw_base = S5P_VA_PPMU_CPU,
> +	},
> +};
> +
> +static void exynos5_ppmu_reset(struct exynos_ppmu *ppmu)
> +{
> +	unsigned long flags;
> +
> +	void __iomem *ppmu_base = ppmu->hw_base;
> +
> +	/* Reset PPMU */
> +	exynos_ppmu_reset(ppmu_base);
> +
> +	/* Set PPMU Event */
> +	ppmu->event[PPMU_PMNCNT0] = RD_DATA_COUNT;
> +	exynos_ppmu_setevent(ppmu_base, PPMU_PMNCNT0,
> +			ppmu->event[PPMU_PMNCNT0]);
> +	ppmu->event[PPMU_PMCCNT1] = WR_DATA_COUNT;
> +	exynos_ppmu_setevent(ppmu_base, PPMU_PMCCNT1,
> +			ppmu->event[PPMU_PMCCNT1]);
> +	ppmu->event[PPMU_PMNCNT3] = RDWR_DATA_COUNT;
> +	exynos_ppmu_setevent(ppmu_base, PPMU_PMNCNT3,
> +			ppmu->event[PPMU_PMNCNT3]);
> +
> +	local_irq_save(flags);
> +	ppmu->reset_time = ktime_get();
> +	/* Start PPMU */
> +	exynos_ppmu_start(ppmu_base);
> +	local_irq_restore(flags);
> +}
> +
> +static void exynos5_ppmu_read(struct exynos_ppmu *ppmu)
> +{
> +	int j;
> +	unsigned long flags;
> +	ktime_t read_time;
> +	ktime_t t;
> +	u32 reg;
> +
> +	void __iomem *ppmu_base = ppmu->hw_base;
> +
> +	local_irq_save(flags);
> +	read_time = ktime_get();
> +	/* Stop PPMU */
> +	exynos_ppmu_stop(ppmu_base);
> +	local_irq_restore(flags);
> +
> +	/* Update local data from PPMU */
> +	ppmu->ccnt = __raw_readl(ppmu_base + PPMU_CCNT);
> +	reg = __raw_readl(ppmu_base + PPMU_FLAG);
> +	ppmu->ccnt_overflow = reg & PPMU_CCNT_OVERFLOW;
> +
> +	for (j = PPMU_PMNCNT0; j < PPMU_PMNCNT_MAX; j++) {
> +		if (ppmu->event[j] == 0)
> +			ppmu->count[j] = 0;
> +		else
> +			ppmu->count[j] = exynos_ppmu_read(ppmu_base, j);
> +	}
> +	t = ktime_sub(read_time, ppmu->reset_time);
> +	ppmu->ns = ktime_to_ns(t);
> +}
> +
> +static void exynos5_ppmu_add(struct exynos_ppmu *to, struct exynos_ppmu
> *from)
> +{
> +	int i;
> +	int j;
> +
> +	for (i = 0; i < PPMU_END; i++) {
> +		for (j = PPMU_PMNCNT0; j < PPMU_PMNCNT_MAX; j++)
> +			to[i].count[j] += from[i].count[j];
> +
> +		to[i].ccnt += from[i].ccnt;
> +		if (to[i].ccnt < from[i].ccnt)
> +			to[i].ccnt_overflow = true;
> +
> +		to[i].ns += from[i].ns;
> +
> +		if (from[i].ccnt_overflow)
> +			to[i].ccnt_overflow = true;
> +	}
> +}
> +
> +static void exynos5_ppmu_handle_clear(struct exynos5_ppmu_handle *handle)
> +{
> +	memset(&handle->ppmu, 0, sizeof(struct exynos_ppmu) * PPMU_END);
> +}
> +
> +static void exynos5_ppmu_update(void)
> +{
> +	int i;
> +	struct exynos5_ppmu_handle *handle;
> +
> +	for (i = 0; i < PPMU_END; i++) {
> +		exynos5_ppmu_read(&ppmu[i]);
> +		exynos5_ppmu_reset(&ppmu[i]);
> +	}
> +
> +	list_for_each_entry(handle, &exynos5_ppmu_handle_list, node)
> +		exynos5_ppmu_add(handle->ppmu, ppmu);
> +}
> +
> +static int exynos5_ppmu_get_filter(enum exynos5_ppmu_sets filter,
> +	enum exynos5_ppmu_list *start, enum exynos5_ppmu_list *end)
> +{
> +	switch (filter) {
> +	case PPMU_SET_DDR:
> +		*start = PPMU_DDR_C;
> +		*end = PPMU_DDR_L;
> +		break;
> +	case PPMU_SET_RIGHT:
> +		*start = PPMU_RIGHT;
> +		*end = PPMU_RIGHT;
> +		break;
> +	case PPMU_SET_CPU:
> +		*start = PPMU_CPU;
> +		*end = PPMU_CPU;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +int exynos5_ppmu_get_busy(struct exynos5_ppmu_handle *handle,
> +	enum exynos5_ppmu_sets filter)
> +{
> +	unsigned long flags;
> +	int i;
> +	int busy = 0;
> +	int temp;
> +	enum exynos5_ppmu_list start;
> +	enum exynos5_ppmu_list end;
> +	int ret;
> +
> +	ret = exynos5_ppmu_get_filter(filter, &start, &end);
> +	if (ret < 0)
> +		return ret;
> +
> +	spin_lock_irqsave(&exynos5_ppmu_lock, flags);
> +
> +	exynos5_ppmu_update();
> +
> +	for (i = start; i <= end; i++) {
> +		if (handle->ppmu[i].ccnt_overflow) {
> +			busy = -EOVERFLOW;
> +			break;
> +		}
> +		temp = handle->ppmu[i].count[PPMU_PMNCNT3] * 100;
> +		if (handle->ppmu[i].ccnt > 0)
> +			temp /= handle->ppmu[i].ccnt;
> +		if (temp > busy)
> +			busy = temp;
> +	}
> +
> +	exynos5_ppmu_handle_clear(handle);
> +
> +	spin_unlock_irqrestore(&exynos5_ppmu_lock, flags);
> +
> +	return busy;
> +}
> +
> +static void exynos5_ppmu_put(struct exynos5_ppmu_handle *handle)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&exynos5_ppmu_lock, flags);
> +
> +	list_del(&handle->node);
> +
> +	spin_unlock_irqrestore(&exynos5_ppmu_lock, flags);
> +
> +	kfree(handle);
> +}
> +
> +struct exynos5_ppmu_handle *exynos5_ppmu_get(void)
> +{
> +	struct exynos5_ppmu_handle *handle;
> +	unsigned long flags;
> +
> +	handle = kzalloc(sizeof(struct exynos5_ppmu_handle), GFP_KERNEL);
> +	if (!handle)
> +		return NULL;
> +
> +	spin_lock_irqsave(&exynos5_ppmu_lock, flags);
> +
> +	exynos5_ppmu_update();
> +	list_add_tail(&handle->node, &exynos5_ppmu_handle_list);
> +
> +	spin_unlock_irqrestore(&exynos5_ppmu_lock, flags);
> +
> +	return handle;
> +}
> +
> +static int exynos5_ppmu_trace_init(void)
> +{
> +	exynos5_ppmu_trace_handle = exynos5_ppmu_get();
> +	return 0;
> +}
> +late_initcall(exynos5_ppmu_trace_init);
> +
> +static void exynos5_ppmu_debug_compute(struct exynos_ppmu *ppmu,
> +	enum ppmu_counter i, u32 *sat, u32 *freq, u32 *bw)
> +{
> +	u64 ns = ppmu->ns;
> +	u32 busy = ppmu->count[i];
> +	u32 total = ppmu->ccnt;
> +
> +	u64 s;
> +	u64 f;
> +	u64 b;
> +
> +	s = (u64)busy * 100 * (1 << FIXED_POINT_OFFSET);
> +	s += total / 2;
> +	do_div(s, total);
> +
> +	f = (u64)total * 1000 * (1 << FIXED_POINT_OFFSET);
> +	f += ns / 2;
> +	f = div64_u64(f, ns);
> +
> +	b = (u64)busy * (128 / 8) * 1000 * (1 << FIXED_POINT_OFFSET);
> +	b += ns / 2;
> +	b = div64_u64(b, ns);
> +
> +	*sat = s;
> +	*freq = f;
> +	*bw = b;
> +}
> +
> +static void exynos5_ppmu_debug_show_one_counter(struct seq_file *s,
> +	const char *name, const char *type, struct exynos_ppmu *ppmu,
> +	enum ppmu_counter i, u32 *bw_total)
> +{
> +	u32 sat;
> +	u32 freq;
> +	u32 bw;
> +
> +	exynos5_ppmu_debug_compute(ppmu, i, &sat, &freq, &bw);
> +
> +	seq_printf(s, "%-10s %-10s %4u.%02u MBps %3u.%02u MHz %2u.%02u%%\n",
> +		name, type,
> +		bw >> FIXED_POINT_OFFSET,
> +		(bw & FIXED_POINT_MASK) * 100 / (1 << FIXED_POINT_OFFSET),
> +		freq >> FIXED_POINT_OFFSET,
> +		(freq & FIXED_POINT_MASK) * 100 / (1 << FIXED_POINT_OFFSET),
> +		sat >> FIXED_POINT_OFFSET,
> +		(sat & FIXED_POINT_MASK) * 100 / (1 << FIXED_POINT_OFFSET));
> +
> +	*bw_total += bw;
> +}
> +
> +static void exynos5_ppmu_debug_show_one(struct seq_file *s,
> +	const char *name, struct exynos_ppmu *ppmu,
> +	u32 *bw_total)
> +{
> +	exynos5_ppmu_debug_show_one_counter(s, name, "read+write",
> +		ppmu, PPMU_PMNCNT3, &bw_total[PPMU_PMNCNT3]);
> +	exynos5_ppmu_debug_show_one_counter(s, "", "read",
> +		ppmu, PPMU_PMNCNT0, &bw_total[PPMU_PMNCNT0]);
> +	exynos5_ppmu_debug_show_one_counter(s, "", "write",
> +		ppmu, PPMU_PMCCNT1, &bw_total[PPMU_PMCCNT1]);
> +
> +}
> +
> +static int exynos5_ppmu_debug_show(struct seq_file *s, void *d)
> +{
> +	int i;
> +	u32 bw_total[PPMU_PMNCNT_MAX];
> +	struct exynos5_ppmu_handle *handle;
> +	unsigned long flags;
> +
> +	memset(bw_total, 0, sizeof(bw_total));
> +
> +	handle = exynos5_ppmu_get();
> +	msleep(100);
> +
> +	spin_lock_irqsave(&exynos5_ppmu_lock, flags);
> +
> +	exynos5_ppmu_update();
> +
> +	for (i = 0; i < PPMU_CPU; i++)
> +		exynos5_ppmu_debug_show_one(s, exynos5_ppmu_name[i],
> +				&handle->ppmu[i], bw_total);
> +
> +	seq_printf(s, "%-10s %-10s %4u.%02u MBps\n", "total", "read+write",
> +		bw_total[PPMU_PMNCNT3] >> FIXED_POINT_OFFSET,
> +		(bw_total[PPMU_PMNCNT3] & FIXED_POINT_MASK) *
> +				100 / (1 << FIXED_POINT_OFFSET));
> +	seq_printf(s, "%-10s %-10s %4u.%02u MBps\n", "", "read",
> +		bw_total[PPMU_PMNCNT0] >> FIXED_POINT_OFFSET,
> +		(bw_total[PPMU_PMNCNT0] & FIXED_POINT_MASK) *
> +				100 / (1 << FIXED_POINT_OFFSET));
> +	seq_printf(s, "%-10s %-10s %4u.%02u MBps\n", "", "write",
> +		bw_total[PPMU_PMCCNT1] >> FIXED_POINT_OFFSET,
> +		(bw_total[PPMU_PMCCNT1] & FIXED_POINT_MASK) *
> +				100 / (1 << FIXED_POINT_OFFSET));
> +
> +	seq_printf(s, "\n");
> +
> +	exynos5_ppmu_debug_show_one(s, exynos5_ppmu_name[PPMU_CPU],
> +			&ppmu[PPMU_CPU], bw_total);
> +
> +	spin_unlock_irqrestore(&exynos5_ppmu_lock, flags);
> +
> +	exynos5_ppmu_put(handle);
> +
> +	return 0;
> +}
> +
> +static int exynos5_ppmu_debug_open(struct inode *inode, struct file
*file)
> +{
> +	return single_open(file, exynos5_ppmu_debug_show, inode->i_private);
> +}
> +
> +static const struct file_operations exynos5_ppmu_debug_fops = {
> +	.open		= exynos5_ppmu_debug_open,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= single_release,
> +};
> +
> +static int __init exynos5_ppmu_debug_init(void)
> +{
> +	debugfs_create_file("exynos5_bus", S_IRUGO, NULL, NULL,
> +		&exynos5_ppmu_debug_fops);
> +	return 0;
> +}
> +late_initcall(exynos5_ppmu_debug_init);
> diff --git a/drivers/devfreq/exynos5_ppmu.h
> b/drivers/devfreq/exynos5_ppmu.h
> new file mode 100644
> index 0000000..24d2ec4
> --- /dev/null
> +++ b/drivers/devfreq/exynos5_ppmu.h
> @@ -0,0 +1,26 @@
> +/*
> + * Copyright (c) 2012 Samsung Electronics Co., Ltd.
> + *		http://www.samsung.com/
> + *
> + * EXYNOS5 PPMU header
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> +*/
> +
> +#ifndef __DEVFREQ_EXYNOS5_PPMU_H
> +#define __DEVFREQ_EXYNOS5_PPMU_H __FILE__
> +
> +enum exynos5_ppmu_sets {
> +	PPMU_SET_DDR,
> +	PPMU_SET_RIGHT,
> +	PPMU_SET_CPU,
> +};
> +
> +struct exynos5_ppmu_handle *exynos5_ppmu_get(void);
> +int exynos5_ppmu_get_busy(struct exynos5_ppmu_handle *handle,
> +	enum exynos5_ppmu_sets filter);
> +
> +#endif /* __DEVFREQ_EXYNOS5_PPMU_H */
> +
> diff --git a/drivers/devfreq/exynos_ppmu.c b/drivers/devfreq/exynos_ppmu.c
> new file mode 100644
> index 0000000..85fc5ac
> --- /dev/null
> +++ b/drivers/devfreq/exynos_ppmu.c
> @@ -0,0 +1,56 @@
> +/*
> + * Copyright (c) 2012 Samsung Electronics Co., Ltd.
> + *		http://www.samsung.com/
> + *
> + * EXYNOS - PPMU support
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/io.h>
> +
> +#include "exynos_ppmu.h"
> +
> +void exynos_ppmu_reset(void __iomem *ppmu_base)
> +{
> +	__raw_writel(PPMU_CYCLE_RESET | PPMU_COUNTER_RESET, ppmu_base);
> +	__raw_writel(PPMU_ENABLE_CYCLE  |
> +		     PPMU_ENABLE_COUNT0 |
> +		     PPMU_ENABLE_COUNT1 |
> +		     PPMU_ENABLE_COUNT2 |
> +		     PPMU_ENABLE_COUNT3,
> +		     ppmu_base + PPMU_CNTENS);
> +}
> +
> +void exynos_ppmu_setevent(void __iomem *ppmu_base, unsigned int ch,
> +			unsigned int evt)
> +{
> +	__raw_writel(evt, ppmu_base + PPMU_BEVTSEL(ch));
> +}
> +
> +void exynos_ppmu_start(void __iomem *ppmu_base)
> +{
> +	__raw_writel(PPMU_ENABLE, ppmu_base);
> +}
> +
> +void exynos_ppmu_stop(void __iomem *ppmu_base)
> +{
> +	__raw_writel(PPMU_DISABLE, ppmu_base);
> +}
> +
> +unsigned int exynos_ppmu_read(void __iomem *ppmu_base, unsigned int ch)
> +{
> +	unsigned int total;
> +
> +	if (ch == PPMU_PMNCNT3)
> +		total = ((__raw_readl(ppmu_base + PMCNT_OFFSET(ch)) << 8) |
> +			  __raw_readl(ppmu_base + PMCNT_OFFSET(ch + 1)));
> +	else
> +		total = __raw_readl(ppmu_base + PMCNT_OFFSET(ch));
> +
> +	return total;
> +}
> diff --git a/drivers/devfreq/exynos_ppmu.h b/drivers/devfreq/exynos_ppmu.h
> new file mode 100644
> index 0000000..b46d31b
> --- /dev/null
> +++ b/drivers/devfreq/exynos_ppmu.h
> @@ -0,0 +1,79 @@
> +/*
> + * Copyright (c) 2012 Samsung Electronics Co., Ltd.
> + *		http://www.samsung.com/
> + *
> + * EXYNOS PPMU header
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> +*/
> +
> +#ifndef __DEVFREQ_EXYNOS_PPMU_H
> +#define __DEVFREQ_EXYNOS_PPMU_H __FILE__
> +
> +#include <linux/ktime.h>
> +
> +/* For PPMU Control */
> +#define PPMU_ENABLE             BIT(0)
> +#define PPMU_DISABLE            0x0
> +#define PPMU_CYCLE_RESET        BIT(1)
> +#define PPMU_COUNTER_RESET      BIT(2)
> +
> +#define PPMU_ENABLE_COUNT0      BIT(0)
> +#define PPMU_ENABLE_COUNT1      BIT(1)
> +#define PPMU_ENABLE_COUNT2      BIT(2)
> +#define PPMU_ENABLE_COUNT3      BIT(3)
> +#define PPMU_ENABLE_CYCLE       BIT(31)
> +
> +#define PPMU_CNTENS		0x10
> +#define PPMU_FLAG		0x50
> +#define PPMU_CCNT_OVERFLOW	BIT(31)
> +#define PPMU_CCNT		0x100
> +
> +#define PPMU_PMCNT0		0x110
> +#define PPMU_PMCNT_OFFSET	0x10
> +#define PMCNT_OFFSET(x)		(PPMU_PMCNT0 + (PPMU_PMCNT_OFFSET *
x))
> +
> +#define PPMU_BEVT0SEL		0x1000
> +#define PPMU_BEVTSEL_OFFSET	0x100
> +#define PPMU_BEVTSEL(x)		(PPMU_BEVT0SEL + (ch *
> PPMU_BEVTSEL_OFFSET))
> +
> +/* For Event Selection */
> +#define RD_DATA_COUNT		0x5
> +#define WR_DATA_COUNT		0x6
> +#define RDWR_DATA_COUNT		0x7
> +
> +enum ppmu_counter {
> +	PPMU_PMNCNT0,
> +	PPMU_PMCCNT1,
> +	PPMU_PMNCNT2,
> +	PPMU_PMNCNT3,
> +	PPMU_PMNCNT_MAX,
> +};
> +
> +struct bus_opp_table {
> +	unsigned int idx;
> +	unsigned long clk;
> +	unsigned long volt;
> +};
> +
> +struct exynos_ppmu {
> +	void __iomem *hw_base;
> +	unsigned int ccnt;
> +	unsigned int event[PPMU_PMNCNT_MAX];
> +	unsigned int count[PPMU_PMNCNT_MAX];
> +	unsigned long long ns;
> +	ktime_t reset_time;
> +	bool ccnt_overflow;
> +	bool count_overflow[PPMU_PMNCNT_MAX];
> +};
> +
> +void exynos_ppmu_reset(void __iomem *ppmu_base);
> +void exynos_ppmu_setevent(void __iomem *ppmu_base, unsigned int ch,
> +			unsigned int evt);
> +void exynos_ppmu_start(void __iomem *ppmu_base);
> +void exynos_ppmu_stop(void __iomem *ppmu_base);
> +unsigned int exynos_ppmu_read(void __iomem *ppmu_base, unsigned int ch);
> +#endif /* __DEVFREQ_EXYNOS_PPMU_H */
> +
> --
> 1.6.6.1


^ permalink raw reply

* [RFC PATCH 10/10] cpu: No more __stop_machine() in _cpu_down()
From: Srivatsa S. Bhat @ 2012-12-04  8:56 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot
  Cc: sbw, tj, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121204085149.25919.29920.stgit@srivatsabhat.in.ibm.com>

From: Paul E. McKenney <paul.mckenney@linaro.org>

The _cpu_down() function invoked as part of the CPU-hotplug offlining
process currently invokes __stop_machine(), which is slow and inflicts
substantial real-time latencies on the entire system.  This patch
substitutes stop_cpus() for __stop_machine() in order to improve
both performance and real-time latency.

This is currently unsafe, because there are a number of uses of
preempt_disable() that are intended to block CPU-hotplug offlining.
These will be fixed by using get/put_online_cpus_stable_atomic(), but in the
meantime, this commit is one way to help locate them.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ Srivatsa: Refer to get/put_online_cpus_stable_atomic() in the changelog ]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/cpu.c |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index aaf2393..59592e5 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -352,15 +352,20 @@ static int __ref take_cpu_down(void *_param)
 {
 	struct take_cpu_down_param *param = _param;
 	int err, cpu = (long)(param->hcpu);
+	unsigned long flags;
 
 	prepare_cpu_take_down(cpu);
 
 	/* Ensure this CPU doesn't handle any more interrupts. */
+	local_irq_save(flags);
 	err = __cpu_disable();
-	if (err < 0)
+	if (err < 0) {
+		local_irq_restore(flags);
 		return err;
+	}
 
 	cpu_notify(CPU_DYING | param->mod, param->hcpu);
+	local_irq_restore(flags);
 	return 0;
 }
 
@@ -393,7 +398,7 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
 	}
 	smpboot_park_threads(cpu);
 
-	err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
+	err = stop_cpus(cpumask_of(cpu), take_cpu_down, &tcd_param);
 	if (err) {
 		/* CPU didn't die: tell everyone.  Can't complain. */
 		smpboot_unpark_threads(cpu);


^ permalink raw reply related

* [RFC PATCH 04/10] sched, cpu hotplug: Use stable online cpus in try_to_wake_up() & select_task_rq()
From: Srivatsa S. Bhat @ 2012-12-04  8:54 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot
  Cc: sbw, tj, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121204085149.25919.29920.stgit@srivatsabhat.in.ibm.com>

With stop_machine() gone from the CPU offline path, we can't depend on
preempt_disable() to prevent CPUs from going offline from under us.

Use the get/put_online_cpus_stable_atomic() APIs to prevent CPUs from going
offline, while invoking from atomic context.

Scheduler functions such as try_to_wake_up() and select_task_rq() (and even
select_fallback_rq()) deal with picking new CPUs to run tasks. It would be
better if they picked those CPUs from the set of CPUs that are going to
remain online. So use the cpu_online_stable_mask while making those decisions.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/sched/core.c |   20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..ef6ada4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1103,6 +1103,10 @@ EXPORT_SYMBOL_GPL(kick_process);
 #ifdef CONFIG_SMP
 /*
  * ->cpus_allowed is protected by both rq->lock and p->pi_lock
+ *
+ *  Must be called under get/put_online_cpus_stable_atomic() or
+ *  equivalent, to avoid CPUs from going offline from underneath
+ *  us.
  */
 static int select_fallback_rq(int cpu, struct task_struct *p)
 {
@@ -1112,7 +1116,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 
 	/* Look for allowed, online CPU in same node. */
 	for_each_cpu(dest_cpu, nodemask) {
-		if (!cpu_online(dest_cpu))
+		if (!cpu_online_stable(dest_cpu))
 			continue;
 		if (!cpu_active(dest_cpu))
 			continue;
@@ -1123,7 +1127,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 	for (;;) {
 		/* Any allowed, online CPU? */
 		for_each_cpu(dest_cpu, tsk_cpus_allowed(p)) {
-			if (!cpu_online(dest_cpu))
+			if (!cpu_online_stable(dest_cpu))
 				continue;
 			if (!cpu_active(dest_cpu))
 				continue;
@@ -1166,6 +1170,9 @@ out:
 
 /*
  * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
+ *
+ * Must be called under get/put_online_cpus_stable_atomic(), to prevent
+ * CPUs from going offline from underneath us.
  */
 static inline
 int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
@@ -1183,7 +1190,7 @@ int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
 	 *   not worry about this generic constraint ]
 	 */
 	if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||
-		     !cpu_online(cpu)))
+		     !cpu_online_stable(cpu)))
 		cpu = select_fallback_rq(task_cpu(p), p);
 
 	return cpu;
@@ -1406,6 +1413,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	int cpu, success = 0;
 
 	smp_wmb();
+	get_online_cpus_stable_atomic();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	if (!(p->state & state))
 		goto out;
@@ -1446,6 +1454,7 @@ stat:
 	ttwu_stat(p, cpu, wake_flags);
 out:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+	put_online_cpus_stable_atomic();
 
 	return success;
 }
@@ -1624,6 +1633,7 @@ void wake_up_new_task(struct task_struct *p)
 	unsigned long flags;
 	struct rq *rq;
 
+	get_online_cpus_stable_atomic();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 #ifdef CONFIG_SMP
 	/*
@@ -1644,6 +1654,7 @@ void wake_up_new_task(struct task_struct *p)
 		p->sched_class->task_woken(rq, p);
 #endif
 	task_rq_unlock(rq, p, &flags);
+	put_online_cpus_stable_atomic();
 }
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -2541,6 +2552,7 @@ void sched_exec(void)
 	unsigned long flags;
 	int dest_cpu;
 
+	get_online_cpus_stable_atomic();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
 	if (dest_cpu == smp_processor_id())
@@ -2550,11 +2562,13 @@ void sched_exec(void)
 		struct migration_arg arg = { p, dest_cpu };
 
 		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+		put_online_cpus_stable_atomic();
 		stop_one_cpu(task_cpu(p), migration_cpu_stop, &arg);
 		return;
 	}
 unlock:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+	put_online_cpus_stable_atomic();
 }
 
 #endif


^ permalink raw reply related

* [RFC PATCH 03/10] smp, cpu hotplug: Fix on_each_cpu_*() to prevent CPU offline properly
From: Srivatsa S. Bhat @ 2012-12-04  8:54 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot
  Cc: sbw, tj, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121204085149.25919.29920.stgit@srivatsabhat.in.ibm.com>

With stop_machine() gone from the CPU offline path, we can't depend on
preempt_disable() to prevent CPUs from going offline from under us.

Use the get/put_online_cpus_stable_atomic() APIs to prevent CPUs from going
offline, while invoking from atomic context.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/smp.c |   30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 581727c..ea81293 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -688,12 +688,12 @@ int on_each_cpu(void (*func) (void *info), void *info, int wait)
 	unsigned long flags;
 	int ret = 0;
 
-	preempt_disable();
+	get_online_cpus_stable_atomic();
 	ret = smp_call_function(func, info, wait);
 	local_irq_save(flags);
 	func(info);
 	local_irq_restore(flags);
-	preempt_enable();
+	put_online_cpus_stable_atomic();
 	return ret;
 }
 EXPORT_SYMBOL(on_each_cpu);
@@ -715,7 +715,11 @@ EXPORT_SYMBOL(on_each_cpu);
 void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
 			void *info, bool wait)
 {
-	int cpu = get_cpu();
+	int cpu;
+
+	get_online_cpus_stable_atomic();
+
+	cpu = smp_processor_id();
 
 	smp_call_function_many(mask, func, info, wait);
 	if (cpumask_test_cpu(cpu, mask)) {
@@ -723,7 +727,7 @@ void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
 		func(info);
 		local_irq_enable();
 	}
-	put_cpu();
+	put_online_cpus_stable_atomic();
 }
 EXPORT_SYMBOL(on_each_cpu_mask);
 
@@ -748,8 +752,10 @@ EXPORT_SYMBOL(on_each_cpu_mask);
  * The function might sleep if the GFP flags indicates a non
  * atomic allocation is allowed.
  *
- * Preemption is disabled to protect against CPUs going offline but not online.
- * CPUs going online during the call will not be seen or sent an IPI.
+ * We use get/put_online_cpus_stable_atomic() to have a stable online mask
+ * to work with, whose CPUs won't go offline in-between our operation.
+ * And we will skip those CPUs which have already begun their offline journey.
+ * CPUs coming online during the call will not be seen or sent an IPI.
  *
  * You must not call this function with disabled interrupts or
  * from a hardware interrupt handler or from a bottom half handler.
@@ -764,26 +770,26 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
 	might_sleep_if(gfp_flags & __GFP_WAIT);
 
 	if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
-		preempt_disable();
-		for_each_online_cpu(cpu)
+		get_online_cpus_stable_atomic();
+		for_each_online_cpu_stable(cpu)
 			if (cond_func(cpu, info))
 				cpumask_set_cpu(cpu, cpus);
 		on_each_cpu_mask(cpus, func, info, wait);
-		preempt_enable();
+		put_online_cpus_stable_atomic();
 		free_cpumask_var(cpus);
 	} else {
 		/*
 		 * No free cpumask, bother. No matter, we'll
 		 * just have to IPI them one by one.
 		 */
-		preempt_disable();
-		for_each_online_cpu(cpu)
+		get_online_cpus_stable_atomic();
+		for_each_online_cpu_stable(cpu)
 			if (cond_func(cpu, info)) {
 				ret = smp_call_function_single(cpu, func,
 								info, wait);
 				WARN_ON_ONCE(!ret);
 			}
-		preempt_enable();
+		put_online_cpus_stable_atomic();
 	}
 }
 EXPORT_SYMBOL(on_each_cpu_cond);


^ permalink raw reply related

* [RFC PATCH 09/10] KVM: VMX: fix unsyc vmcs status when cpu is going down
From: Srivatsa S. Bhat @ 2012-12-04  8:56 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot
  Cc: sbw, tj, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121204085149.25919.29920.stgit@srivatsabhat.in.ibm.com>

From: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>

When vcpu is scheduled to the different cpu, it should sent IPI to
the cpu which the vcpu ran to clear the vcpu->vmcs

It is safe since cpu-offline path can not concurrently run with
other cpu. After implementing stop_machine()-free, smp_call_function_sing
will return -ENXIO immediately when the target cpu is going down, that
means, we can load the vmcs but it is not cleared on target cpu and corrupt
per_cpu(loaded_vmcss_on_cpu, cpu). This bug can be triggered under this case:

# general protection fault: 0000 [#1] PREEMPT SMP

[......]

Call Trace:
 [<ffffffffa052980f>] kvm_arch_hardware_disable+0x1f/0x50 [kvm]
 [<ffffffffa050ef43>] hardware_disable_nolock+0x33/0x40 [kvm]
 [<ffffffffa050efa3>] kvm_cpu_hotplug+0x53/0xb0 [kvm]
 [<ffffffff81548b1d>] notifier_call_chain+0x4d/0x70
 [<ffffffff81517fe0>] ? spp_getpage+0xb0/0xb0
 [<ffffffff8108459e>] __raw_notifier_call_chain+0xe/0x10
 [<ffffffff810599f0>] __cpu_notify+0x20/0x40
 [<ffffffff8151802e>] take_cpu_down+0x4e/0x90
 [<ffffffff810d184b>] cpu_stopper_thread+0xdb/0x1d0
 [<ffffffff8108b3ce>] ? finish_task_switch+0x4e/0xe0
 [<ffffffff815438d0>] ? __schedule+0x460/0x740
 [<ffffffff810d1770>] ? cpu_stop_signal_done+0x40/0x40
 [<ffffffff8107de30>] kthread+0xc0/0xd0
 [<ffffffff8107dd70>] ? flush_kthread_worker+0xb0/0xb0
 [<ffffffff8154cc6c>] ret_from_fork+0x7c/0xb0
 [<ffffffff8107dd70>] ? flush_kthread_worker+0xb0/0xb0

Fix it by waiting for target vcpu to clear the vmcs

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 arch/x86/kvm/vmx.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 95e502b..4fb4e51 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1019,8 +1019,15 @@ static void loaded_vmcs_clear(struct loaded_vmcs *loaded_vmcs)
 	int cpu = loaded_vmcs->cpu;
 
 	if (cpu != -1)
-		smp_call_function_single(cpu,
-			 __loaded_vmcs_clear, loaded_vmcs, 1);
+		if (smp_call_function_single(cpu,
+			 __loaded_vmcs_clear, loaded_vmcs, 1))
+
+			/*
+			 * The target cpu is going down, we should
+			 * wait for it to clear the vmcs status.
+			 */
+			while (ACCESS_ONCE(loaded_vmcs->cpu) != -1)
+				cpu_relax();
 }
 
 static inline void vpid_sync_vcpu_single(struct vcpu_vmx *vmx)

^ permalink raw reply related

* [RFC PATCH 02/10] smp, cpu hotplug: Fix smp_call_function_*() to prevent CPU offline properly
From: Srivatsa S. Bhat @ 2012-12-04  8:54 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot
  Cc: sbw, tj, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121204085149.25919.29920.stgit@srivatsabhat.in.ibm.com>

From: Michael Wang <wangyun@linux.vnet.ibm.com>

With stop_machine() gone from the CPU offline path, we can't depend on
preempt_disable() to prevent CPUs from going offline from under us.

Use the get/put_online_cpus_stable_atomic() APIs to prevent CPUs from going
offline, while invoking from atomic context.

[ Michael: Designed the synchronization for the IPI case ]
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
[ Srivatsa: Generalized it to work for all cases and wrote the changelog ]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/smp.c |   54 +++++++++++++++++++++++++++++++++---------------------
 1 file changed, 33 insertions(+), 21 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 29dd40a..581727c 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -310,7 +310,8 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
 	 * prevent preemption and reschedule on another processor,
 	 * as well as CPU removal
 	 */
-	this_cpu = get_cpu();
+	get_online_cpus_stable_atomic();
+	this_cpu = smp_processor_id();
 
 	/*
 	 * Can deadlock when called with interrupts disabled.
@@ -326,7 +327,7 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
 		func(info);
 		local_irq_restore(flags);
 	} else {
-		if ((unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) {
+		if ((unsigned)cpu < nr_cpu_ids && cpu_online_stable(cpu)) {
 			struct call_single_data *data = &d;
 
 			if (!wait)
@@ -342,7 +343,7 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
 		}
 	}
 
-	put_cpu();
+	put_online_cpus_stable_atomic();
 
 	return err;
 }
@@ -371,8 +372,10 @@ int smp_call_function_any(const struct cpumask *mask,
 	const struct cpumask *nodemask;
 	int ret;
 
+	get_online_cpus_stable_atomic();
 	/* Try for same CPU (cheapest) */
-	cpu = get_cpu();
+	cpu = smp_processor_id();
+
 	if (cpumask_test_cpu(cpu, mask))
 		goto call;
 
@@ -380,15 +383,15 @@ int smp_call_function_any(const struct cpumask *mask,
 	nodemask = cpumask_of_node(cpu_to_node(cpu));
 	for (cpu = cpumask_first_and(nodemask, mask); cpu < nr_cpu_ids;
 	     cpu = cpumask_next_and(cpu, nodemask, mask)) {
-		if (cpu_online(cpu))
+		if (cpu_online_stable(cpu))
 			goto call;
 	}
 
 	/* Any online will do: smp_call_function_single handles nr_cpu_ids. */
-	cpu = cpumask_any_and(mask, cpu_online_mask);
+	cpu = cpumask_any_and(mask, cpu_online_stable_mask);
 call:
 	ret = smp_call_function_single(cpu, func, info, wait);
-	put_cpu();
+	put_online_cpus_stable_atomic();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(smp_call_function_any);
@@ -409,14 +412,17 @@ void __smp_call_function_single(int cpu, struct call_single_data *data,
 	unsigned int this_cpu;
 	unsigned long flags;
 
-	this_cpu = get_cpu();
+	get_online_cpus_stable_atomic();
+
+	this_cpu = smp_processor_id();
+
 	/*
 	 * Can deadlock when called with interrupts disabled.
 	 * We allow cpu's that are not yet online though, as no one else can
 	 * send smp call function interrupt to this cpu and as such deadlocks
 	 * can't happen.
 	 */
-	WARN_ON_ONCE(cpu_online(smp_processor_id()) && wait && irqs_disabled()
+	WARN_ON_ONCE(cpu_online(this_cpu) && wait && irqs_disabled()
 		     && !oops_in_progress);
 
 	if (cpu == this_cpu) {
@@ -427,7 +433,7 @@ void __smp_call_function_single(int cpu, struct call_single_data *data,
 		csd_lock(data);
 		generic_exec_single(cpu, data, wait);
 	}
-	put_cpu();
+	put_online_cpus_stable_atomic();
 }
 
 /**
@@ -451,6 +457,8 @@ void smp_call_function_many(const struct cpumask *mask,
 	unsigned long flags;
 	int refs, cpu, next_cpu, this_cpu = smp_processor_id();
 
+	get_online_cpus_stable_atomic();
+
 	/*
 	 * Can deadlock when called with interrupts disabled.
 	 * We allow cpu's that are not yet online though, as no one else can
@@ -461,23 +469,24 @@ void smp_call_function_many(const struct cpumask *mask,
 		     && !oops_in_progress && !early_boot_irqs_disabled);
 
 	/* Try to fastpath.  So, what's a CPU they want? Ignoring this one. */
-	cpu = cpumask_first_and(mask, cpu_online_mask);
+	cpu = cpumask_first_and(mask, cpu_online_stable_mask);
 	if (cpu == this_cpu)
-		cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
+		cpu = cpumask_next_and(cpu, mask, cpu_online_stable_mask);
 
 	/* No online cpus?  We're done. */
 	if (cpu >= nr_cpu_ids)
-		return;
+		goto out_unlock;
 
 	/* Do we have another CPU which isn't us? */
-	next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
+	next_cpu = cpumask_next_and(cpu, mask, cpu_online_stable_mask);
 	if (next_cpu == this_cpu)
-		next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask);
+		next_cpu = cpumask_next_and(next_cpu, mask,
+						cpu_online_stable_mask);
 
 	/* Fastpath: do that cpu by itself. */
 	if (next_cpu >= nr_cpu_ids) {
 		smp_call_function_single(cpu, func, info, wait);
-		return;
+		goto out_unlock;
 	}
 
 	data = &__get_cpu_var(cfd_data);
@@ -516,14 +525,14 @@ void smp_call_function_many(const struct cpumask *mask,
 	smp_wmb();
 
 	/* We rely on the "and" being processed before the store */
-	cpumask_and(data->cpumask, mask, cpu_online_mask);
+	cpumask_and(data->cpumask, mask, cpu_online_stable_mask);
 	cpumask_clear_cpu(this_cpu, data->cpumask);
 	refs = cpumask_weight(data->cpumask);
 
 	/* Some callers race with other cpus changing the passed mask */
 	if (unlikely(!refs)) {
 		csd_unlock(&data->csd);
-		return;
+		goto out_unlock;
 	}
 
 	raw_spin_lock_irqsave(&call_function.lock, flags);
@@ -554,6 +563,9 @@ void smp_call_function_many(const struct cpumask *mask,
 	/* Optionally wait for the CPUs to complete */
 	if (wait)
 		csd_lock_wait(&data->csd);
+
+out_unlock:
+	put_online_cpus_stable_atomic();
 }
 EXPORT_SYMBOL(smp_call_function_many);
 
@@ -574,9 +586,9 @@ EXPORT_SYMBOL(smp_call_function_many);
  */
 int smp_call_function(smp_call_func_t func, void *info, int wait)
 {
-	preempt_disable();
-	smp_call_function_many(cpu_online_mask, func, info, wait);
-	preempt_enable();
+	get_online_cpus_stable_atomic();
+	smp_call_function_many(cpu_online_stable_mask, func, info, wait);
+	put_online_cpus_stable_atomic();
 
 	return 0;
 }


^ permalink raw reply related

* [RFC PATCH 08/10] KVM: VMX: fix memory order between loading vmcs and clearing vmcs
From: Srivatsa S. Bhat @ 2012-12-04  8:55 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot
  Cc: sbw, tj, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121204085149.25919.29920.stgit@srivatsabhat.in.ibm.com>

From: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>

vmcs->cpu indicates whether it exists on the target cpu, -1 means the vmcs
does not exist on any vcpu

If vcpu load vmcs with vmcs.cpu = -1, it can be directly added to cpu's percpu
list. The list can be corrupted if the cpu prefetch the vmcs's list before
reading vmcs->cpu. Meanwhile, we should remove vmcs from the list before
making vmcs->vcpu == -1 be visible

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 arch/x86/kvm/vmx.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9dc562a..95e502b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1002,6 +1002,15 @@ static void __loaded_vmcs_clear(void *arg)
 	if (per_cpu(current_vmcs, cpu) == loaded_vmcs->vmcs)
 		per_cpu(current_vmcs, cpu) = NULL;
 	list_del(&loaded_vmcs->loaded_vmcss_on_cpu_link);
+
+	/*
+	 * we should ensure updating loaded_vmcs->loaded_vmcss_on_cpu_link
+	 * is before setting loaded_vmcs->vcpu to -1 which is done in
+	 * loaded_vmcs_init. Otherwise, other cpu can see vcpu = -1 fist
+	 * then adds the vmcs into percpu list before it is deleted.
+	 */
+	smp_wmb();
+
 	loaded_vmcs_init(loaded_vmcs);
 }
 
@@ -1537,6 +1546,14 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
 		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 		local_irq_disable();
+
+		/*
+		 * Read loaded_vmcs->cpu should be before fetching
+		 * loaded_vmcs->loaded_vmcss_on_cpu_link.
+		 * See the comments in __loaded_vmcs_clear().
+		 */
+		smp_rmb();
+
 		list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link,
 			 &per_cpu(loaded_vmcss_on_cpu, cpu));
 		local_irq_enable();

^ permalink raw reply related

* [RFC PATCH 01/10] CPU hotplug: Introduce "stable" cpu online mask, for atomic hotplug readers
From: Srivatsa S. Bhat @ 2012-12-04  8:53 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot
  Cc: sbw, tj, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121204085149.25919.29920.stgit@srivatsabhat.in.ibm.com>

From: Michael Wang <wangyun@linux.vnet.ibm.com>

There are places where preempt_disable() is used to prevent any CPU from
going offline during the critical section. Let us call them as "atomic
hotplug readers" (atomic because they run in atomic contexts).

Often, these atomic hotplug readers have a simple need : they want the cpu
online mask that they work with (inside their critical section), to be
stable, i.e., it should be guaranteed that CPUs in that mask won't go
offline during the critical section. An important point here is that they
don't really care if such a "stable" mask is a subset of the actual
cpu_online_mask.

The intent of this patch is to provide such a "stable" cpu online mask
for that class of atomic hotplug readers.

Fundamental idea behind the design:
-----------------------------------

Simply put, have a separate mask called the stable cpu online mask; and
at the hotplug writer (cpu_down()), note down the CPU that is about to go
offline, and remove it from the stable cpu online mask. Then, feel free
to take that CPU offline, since the atomic hotplug readers won't see it
from now on. Also, don't start any new cpu_down() operations until all
existing atomic hotplug readers have completed (because they might still
be working with the old value of the stable online mask).

Some important design requirements and considerations:
-----------------------------------------------------

1. The atomic hotplug readers should ideally *never* wait for the hotplug
   writer (cpu_down()) for *anything*. Because, these atomic hotplug readers
   can be in very hot-paths like interrupt handling/IPI and hence, if they
   have to wait for an ongoing cpu_down() to complete, it would pretty much
   introduce the same performance/latency problems as stop_machine().

2. Any synchronization at the atomic hotplug readers side must be highly
   scalable - avoid global locks/counters etc. Because, these paths currently
   use the extremely fast preempt_disable(); our replacement to
   preempt_disable() should not become ridiculously costly.

3. preempt_disable() was recursive. The replacement should also be recursive.

Implementation of the design:
----------------------------

Atomic hotplug reader side:

We use per-cpu counters to mark the presence of atomic hotplug readers.
A reader would increment its per-cpu counter and continue, without waiting
for anything. And no locks are used in this path. Together, these satisfy
all the 3 requirements mentioned above.

The hotplug writer uses (reads) the per-cpu counters of all CPUs in order to
ensure that all existing atomic hotplug readers have completed. Only after
that, it will proceed to actually take the CPU offline.

[ Michael: Designed the synchronization for the IPI case ]
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
[ Srivatsa: Generalized it to work for all cases and wrote the changelog ]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/cpu.h     |    4 +
 include/linux/cpumask.h |    5 ++
 kernel/cpu.c            |  129 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index ce7a074..c64b6ed 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -175,6 +175,8 @@ extern struct bus_type cpu_subsys;
 
 extern void get_online_cpus(void);
 extern void put_online_cpus(void);
+extern void get_online_cpus_stable_atomic(void);
+extern void put_online_cpus_stable_atomic(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
 #define register_hotcpu_notifier(nb)	register_cpu_notifier(nb)
 #define unregister_hotcpu_notifier(nb)	unregister_cpu_notifier(nb)
@@ -198,6 +200,8 @@ static inline void cpu_hotplug_driver_unlock(void)
 
 #define get_online_cpus()	do { } while (0)
 #define put_online_cpus()	do { } while (0)
+#define get_online_cpus_stable_atomic()	do { } while (0)
+#define put_online_cpus_stable_atomic()	do { } while (0)
 #define hotcpu_notifier(fn, pri)	do { (void)(fn); } while (0)
 /* These aren't inline functions due to a GCC bug. */
 #define register_hotcpu_notifier(nb)	({ (void)(nb); 0; })
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 0325602..2148e60 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -78,6 +78,7 @@ extern int nr_cpu_ids;
 
 extern const struct cpumask *const cpu_possible_mask;
 extern const struct cpumask *const cpu_online_mask;
+extern const struct cpumask *const cpu_online_stable_mask;
 extern const struct cpumask *const cpu_present_mask;
 extern const struct cpumask *const cpu_active_mask;
 
@@ -87,6 +88,7 @@ extern const struct cpumask *const cpu_active_mask;
 #define num_present_cpus()	cpumask_weight(cpu_present_mask)
 #define num_active_cpus()	cpumask_weight(cpu_active_mask)
 #define cpu_online(cpu)		cpumask_test_cpu((cpu), cpu_online_mask)
+#define cpu_online_stable(cpu)	cpumask_test_cpu((cpu), cpu_online_stable_mask)
 #define cpu_possible(cpu)	cpumask_test_cpu((cpu), cpu_possible_mask)
 #define cpu_present(cpu)	cpumask_test_cpu((cpu), cpu_present_mask)
 #define cpu_active(cpu)		cpumask_test_cpu((cpu), cpu_active_mask)
@@ -705,12 +707,15 @@ extern const DECLARE_BITMAP(cpu_all_bits, NR_CPUS);
 
 #define for_each_possible_cpu(cpu) for_each_cpu((cpu), cpu_possible_mask)
 #define for_each_online_cpu(cpu)   for_each_cpu((cpu), cpu_online_mask)
+#define for_each_online_cpu_stable(cpu)					  \
+				for_each_cpu((cpu), cpu_online_stable_mask)
 #define for_each_present_cpu(cpu)  for_each_cpu((cpu), cpu_present_mask)
 
 /* Wrappers for arch boot code to manipulate normally-constant masks */
 void set_cpu_possible(unsigned int cpu, bool possible);
 void set_cpu_present(unsigned int cpu, bool present);
 void set_cpu_online(unsigned int cpu, bool online);
+void set_cpu_online_stable(unsigned int cpu, bool online);
 void set_cpu_active(unsigned int cpu, bool active);
 void init_cpu_present(const struct cpumask *src);
 void init_cpu_possible(const struct cpumask *src);
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 42bd331..aaf2393 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,6 +49,73 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
+/*
+ * Per-cpu counter to mark the presence of active atomic hotplug readers
+ * (those that run in atomic context and want to prevent CPUs from going
+ * offline).
+ */
+static DEFINE_PER_CPU(int, hotplug_reader_refcount);
+
+/*
+ * Hotplug readers (those that want to prevent CPUs from going offline)
+ * sometimes run from atomic contexts, and hence can't use
+ * get/put_online_cpus() because they can sleep. And often-times, all
+ * they really want is a stable (unchanging) online mask to work with, which
+ * could be a subset of the actual cpu_online_mask, but with a guarantee
+ * that all the CPUs in that stable mask stay online throughout the
+ * hotplug-read-side critical section.
+ *
+ * In such cases, these atomic hotplug readers can use the pair
+ * get/put_online_cpus_stable_atomic() around their critical section to
+ * ensure that the stable mask 'cpu_online_stable_mask' remains unaltered
+ * throughout that critical section. And of course, they should only use
+ * the stable mask in their critical section, and not the actual
+ * cpu_online_mask!
+ *
+ * Eg:
+ *
+ * Atomic hotplug read-side critical section:
+ * -----------------------------------------
+ *
+ * get_online_cpus_stable_atomic();
+ *
+ * for_each_online_cpu_stable(cpu) {
+ *         ... Do something...
+ * }
+ *    ...
+ *
+ * if (cpu_online_stable(other_cpu))
+ *         do_something();
+ *
+ * put_online_cpus_stable_atomic();
+ *
+ * You can call this function recursively.
+ */
+void get_online_cpus_stable_atomic(void)
+{
+	preempt_disable();
+	this_cpu_inc(hotplug_reader_refcount);
+
+	/*
+	 * Prevent reordering of writes to hotplug_reader_refcount and
+	 * reads from cpu_online_stable_mask.
+	 */
+	smp_mb();
+}
+EXPORT_SYMBOL_GPL(get_online_cpus_stable_atomic);
+
+void put_online_cpus_stable_atomic(void)
+{
+	/*
+	 * Prevent reordering of reads from cpu_online_stable_mask and
+	 * writes to hotplug_reader_refcount.
+	 */
+	smp_mb();
+	this_cpu_dec(hotplug_reader_refcount);
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(put_online_cpus_stable_atomic);
+
 static struct {
 	struct task_struct *active_writer;
 	struct mutex lock; /* Synchronizes accesses to refcount, */
@@ -237,6 +304,44 @@ static inline void check_for_tasks(int cpu)
 	write_unlock_irq(&tasklist_lock);
 }
 
+/*
+ * We want all atomic hotplug readers to refer to the new value of
+ * cpu_online_stable_mask. So wait for the existing atomic hotplug readers
+ * to complete. Any new atomic hotplug readers will see the updated mask
+ * and hence pose no problems.
+ */
+static void sync_hotplug_readers(void)
+{
+	unsigned int cpu;
+
+	for_each_online_cpu(cpu) {
+		while (per_cpu(hotplug_reader_refcount, cpu))
+			cpu_relax();
+	}
+}
+
+/*
+ * We are serious about taking this CPU down. So clear the CPU from the
+ * stable online mask.
+ */
+static void prepare_cpu_take_down(unsigned int cpu)
+{
+	set_cpu_online_stable(cpu, false);
+
+	/*
+	 * Prevent reordering of writes to cpu_online_stable_mask and reads
+	 * from hotplug_reader_refcount.
+	 */
+	smp_mb();
+
+	/*
+	 * Wait for all active hotplug readers to complete, to ensure that
+	 * there are no critical sections still referring to the old stable
+	 * online mask.
+	 */
+	sync_hotplug_readers();
+}
+
 struct take_cpu_down_param {
 	unsigned long mod;
 	void *hcpu;
@@ -246,7 +351,9 @@ struct take_cpu_down_param {
 static int __ref take_cpu_down(void *_param)
 {
 	struct take_cpu_down_param *param = _param;
-	int err;
+	int err, cpu = (long)(param->hcpu);
+
+	prepare_cpu_take_down(cpu);
 
 	/* Ensure this CPU doesn't handle any more interrupts. */
 	err = __cpu_disable();
@@ -670,6 +777,11 @@ static DECLARE_BITMAP(cpu_online_bits, CONFIG_NR_CPUS) __read_mostly;
 const struct cpumask *const cpu_online_mask = to_cpumask(cpu_online_bits);
 EXPORT_SYMBOL(cpu_online_mask);
 
+static DECLARE_BITMAP(cpu_online_stable_bits, CONFIG_NR_CPUS) __read_mostly;
+const struct cpumask *const cpu_online_stable_mask =
+					to_cpumask(cpu_online_stable_bits);
+EXPORT_SYMBOL(cpu_online_stable_mask);
+
 static DECLARE_BITMAP(cpu_present_bits, CONFIG_NR_CPUS) __read_mostly;
 const struct cpumask *const cpu_present_mask = to_cpumask(cpu_present_bits);
 EXPORT_SYMBOL(cpu_present_mask);
@@ -694,12 +806,26 @@ void set_cpu_present(unsigned int cpu, bool present)
 		cpumask_clear_cpu(cpu, to_cpumask(cpu_present_bits));
 }
 
+void set_cpu_online_stable(unsigned int cpu, bool online)
+{
+	if (online)
+		cpumask_set_cpu(cpu, to_cpumask(cpu_online_stable_bits));
+	else
+		cpumask_clear_cpu(cpu, to_cpumask(cpu_online_stable_bits));
+}
+
 void set_cpu_online(unsigned int cpu, bool online)
 {
 	if (online)
 		cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits));
 	else
 		cpumask_clear_cpu(cpu, to_cpumask(cpu_online_bits));
+
+	/*
+	 * Any changes to the online mask must also be propagated to the
+	 * stable online mask.
+	 */
+	set_cpu_online_stable(cpu, online);
 }
 
 void set_cpu_active(unsigned int cpu, bool active)
@@ -723,4 +849,5 @@ void init_cpu_possible(const struct cpumask *src)
 void init_cpu_online(const struct cpumask *src)
 {
 	cpumask_copy(to_cpumask(cpu_online_bits), src);
+	cpumask_copy(to_cpumask(cpu_online_stable_bits), src);
 }


^ permalink raw reply related

* [RFC PATCH 07/10] KVM: VMX: fix invalid cpu passed to smp_call_function_single
From: Srivatsa S. Bhat @ 2012-12-04  8:55 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot
  Cc: sbw, tj, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121204085149.25919.29920.stgit@srivatsabhat.in.ibm.com>

From: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>

In loaded_vmcs_clear, loaded_vmcs->cpu is the fist parameter passed to
smp_call_function_single, if the target cpu is downing (doing cpu hot remove),
loaded_vmcs->cpu can become -1 then -1 is passed to smp_call_function_single

It can be triggered when vcpu is being destroyed, loaded_vmcs_clear is called
in the preemptionable context

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 arch/x86/kvm/vmx.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f858159..9dc562a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1007,9 +1007,11 @@ static void __loaded_vmcs_clear(void *arg)
 
 static void loaded_vmcs_clear(struct loaded_vmcs *loaded_vmcs)
 {
-	if (loaded_vmcs->cpu != -1)
-		smp_call_function_single(
-			loaded_vmcs->cpu, __loaded_vmcs_clear, loaded_vmcs, 1);
+	int cpu = loaded_vmcs->cpu;
+
+	if (cpu != -1)
+		smp_call_function_single(cpu,
+			 __loaded_vmcs_clear, loaded_vmcs, 1);
 }
 
 static inline void vpid_sync_vcpu_single(struct vcpu_vmx *vmx)

^ permalink raw reply related

* [RFC PATCH 06/10] yield_to(), cpu-hotplug: Prevent offlining of other CPUs properly
From: Srivatsa S. Bhat @ 2012-12-04  8:55 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot
  Cc: sbw, tj, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121204085149.25919.29920.stgit@srivatsabhat.in.ibm.com>

With stop_machine() gone from the CPU offline path, we can't depend on
local_irq_save() to prevent CPUs from going offline from under us.

Use the get/put_online_cpus_stable_atomic() APIs to prevent CPUs from
going offline, while invoking from atomic context. And use the stable
online mask while interacting with other CPUs.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/sched/core.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c1a5d93..d0b159b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4312,6 +4312,7 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
 	unsigned long flags;
 	bool yielded = 0;
 
+	get_online_cpus_stable_atomic();
 	local_irq_save(flags);
 	rq = this_rq();
 
@@ -4339,13 +4340,14 @@ again:
 		 * Make p's CPU reschedule; pick_next_entity takes care of
 		 * fairness.
 		 */
-		if (preempt && rq != p_rq)
+		if (preempt && rq != p_rq && cpu_online_stable(task_cpu(p)))
 			resched_task(p_rq->curr);
 	}
 
 out:
 	double_rq_unlock(rq, p_rq);
 	local_irq_restore(flags);
+	put_online_cpus_stable_atomic();
 
 	if (yielded)
 		schedule();

^ permalink raw reply related

* [RFC PATCH 05/10] kick_process(), cpu-hotplug: Prevent offlining of target CPU properly
From: Srivatsa S. Bhat @ 2012-12-04  8:55 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot
  Cc: sbw, tj, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121204085149.25919.29920.stgit@srivatsabhat.in.ibm.com>

With stop_machine() gone from the CPU offline path, we can't depend on
preempt_disable() to prevent CPUs from going offline from under us.

Use the get/put_online_cpus_stable_atomic() APIs to prevent CPUs from going
offline, while invoking from atomic context.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/sched/core.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ef6ada4..c1a5d93 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1091,11 +1091,11 @@ void kick_process(struct task_struct *p)
 {
 	int cpu;
 
-	preempt_disable();
+	get_online_cpus_stable_atomic();
 	cpu = task_cpu(p);
 	if ((cpu != smp_processor_id()) && task_curr(p))
 		smp_send_reschedule(cpu);
-	preempt_enable();
+	put_online_cpus_stable_atomic();
 }
 EXPORT_SYMBOL_GPL(kick_process);
 #endif /* CONFIG_SMP */

^ permalink raw reply related

* [RFC PATCH 00/10] CPU hotplug: stop_machine()-free CPU hotplug
From: Srivatsa S. Bhat @ 2012-12-04  8:53 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot
  Cc: sbw, tj, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

Hi,

This patchset removes CPU hotplug's dependence on stop_machine() from the CPU
offline path and provides an alternative (set of APIs) to preempt_disable() to
prevent CPUs from going offline, which can be invoked from atomic context.

This is an RFC patchset with only a few call-sites of preempt_disable()
converted to the new APIs for now, and the main goal is to get feedback on the
design of the new atomic APIs and see if it serves as a viable replacement for
stop_machine()-free CPU hotplug.

Overview of the patches:
-----------------------

Patch 1 introduces the new APIs that can be used from atomic context, to
prevent CPUs from going offline.

Patches 2 to 6 convert various call-sites to use the new APIs.

Patches 7, 8 and 9 fix a KVM issue that comes into picture when we remove
stop_machine() from the CPU hotplug path. (Actually, patches 7 and 8 are
already in the kvm tree. Patch 9 is the fix we need, but I preserved the
other 2 as well so that the patches can apply easily without external
dependencies).

Patch 10 is the one which actually removes stop_machine() from the CPU
offline path.

Comments and suggestions welcome!

--
 Michael Wang (2):
      CPU hotplug: Introduce "stable" cpu online mask, for atomic hotplug readers
      smp, cpu hotplug: Fix smp_call_function_*() to prevent CPU offline properly

Paul E. McKenney (1):
      cpu: No more __stop_machine() in _cpu_down()

Srivatsa S. Bhat (4):
      smp, cpu hotplug: Fix on_each_cpu_*() to prevent CPU offline properly
      sched, cpu hotplug: Use stable online cpus in try_to_wake_up() & select_task_rq()
      kick_process(), cpu-hotplug: Prevent offlining of target CPU properly
      yield_to(), cpu-hotplug: Prevent offlining of other CPUs properly

Xiao Guangrong (3):
      KVM: VMX: fix invalid cpu passed to smp_call_function_single
      KVM: VMX: fix memory order between loading vmcs and clearing vmcs
      KVM: VMX: fix unsyc vmcs status when cpu is going down

  arch/x86/kvm/vmx.c      |   32 ++++++++++-
 include/linux/cpu.h     |    4 +
 include/linux/cpumask.h |    5 ++
 kernel/cpu.c            |  138 ++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/core.c     |   28 +++++++---
 kernel/smp.c            |   84 +++++++++++++++++------------
 6 files changed, 246 insertions(+), 45 deletions(-)

Thanks,
Srivatsa S. Bhat
IBM Linux Technology Center

^ permalink raw reply

* RE: [RFC PATCH 4/7] Thermal: Add Thermal_trip sysfs node
From: R, Durgadoss @ 2012-12-04  8:41 UTC (permalink / raw)
  To: Hongbo Zhang
  Cc: Zhang, Rui, linux-pm@vger.kernel.org, wni@nvidia.com,
	eduardo.valentin@ti.com, amit.kachhap@linaro.org,
	sachin.kamat@linaro.org
In-Reply-To: <CAJLyvQyd=RTZx0GF6_SvPFSnqfPgh93vSvZERXMF50TzEYyobw@mail.gmail.com>


> -----Original Message-----
> From: Hongbo Zhang [mailto:hongbo.zhang@linaro.org]
> Sent: Tuesday, December 04, 2012 2:01 PM
> To: R, Durgadoss
> Cc: Zhang, Rui; linux-pm@vger.kernel.org; wni@nvidia.com;
> eduardo.valentin@ti.com; amit.kachhap@linaro.org;
> sachin.kamat@linaro.org
> Subject: Re: [RFC PATCH 4/7] Thermal: Add Thermal_trip sysfs node
> 
> On 17 November 2012 18:45, Durgadoss R <durgadoss.r@intel.com> wrote:
> > This patch adds a thermal_trip directory under
> > /sys/class/thermal/zoneX. This directory contains
> > the trip point values for sensors bound to this
> > zone.
> >
> > Signed-off-by: Durgadoss R <durgadoss.r@intel.com>
> > ---
> >  drivers/thermal/thermal_sys.c |  244
> ++++++++++++++++++++++++++++++++++++++++-
> >  include/linux/thermal.h       |   32 ++++++
> >  2 files changed, 274 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/thermal/thermal_sys.c b/drivers/thermal/thermal_sys.c
> > index 70e5f5a..011588a 100644
> > --- a/drivers/thermal/thermal_sys.c
> > +++ b/drivers/thermal/thermal_sys.c
> > @@ -480,9 +480,15 @@ static void remove_sensor_from_zone(struct
> thermal_zone *tz,
> >
> >         sysfs_remove_link(&tz->device.kobj, kobject_name(&ts-
> >device.kobj));
> >
> > +       /* Delete this sensor's trip Kobject */
> > +       kobject_del(tz->kobj_trip[indx]);
> > +
> >         /* Shift the entries in the tz->sensors array */
> > -       for (j = indx; j < MAX_SENSORS_PER_ZONE - 1; j++)
> > +       for (j = indx; j < MAX_SENSORS_PER_ZONE - 1; j++) {
> >                 tz->sensors[j] = tz->sensors[j + 1];
> > +               tz->trip[j] = tz->trip[j + 1];
> > +               tz->kobj_trip[j] = tz->kobj_trip[j + 1];
> > +       }
> >
> >         tz->sensor_indx--;
> >  }
> > @@ -505,6 +511,41 @@ static void remove_cdev_from_zone(struct
> thermal_zone *tz,
> >         tz->cdev_indx--;
> >  }
> >
> > +static struct thermal_zone *get_zone_by_trip_kobj(struct kobject *kobj)
> > +{
> > +       struct thermal_zone *pos;
> > +       struct thermal_zone *tz = NULL;
> > +
> > +       if (!kobj)
> > +               return NULL;
> > +
> > +       mutex_lock(&tz_list_lock);
> > +       for_each_thermal_zone(pos) {
> > +               if (pos->kobj_thermal_trip == kobj) {
> > +                       tz = pos;
> > +                       break;
> > +               }
> > +       }
> > +       mutex_unlock(&tz_list_lock);
> > +       return tz;
> > +}
> > +
> > +static struct thermal_sensor *get_sensor_by_kobj_name(const char
> *name)
> > +{
> > +       struct thermal_sensor *pos;
> > +       struct thermal_sensor *ts = NULL;
> > +
> > +       mutex_lock(&ts_list_lock);
> > +       for_each_thermal_sensor(pos) {
> > +               if (!strcmp(kobject_name(&pos->device.kobj), name)) {
> > +                       ts = pos;
> > +                       break;
> > +               }
> > +       }
> > +       mutex_unlock(&ts_list_lock);
> > +       return ts;
> > +}
> > +
> >  /* sys I/F for thermal zone */
> >
> >  #define to_thermal_zone(_dev) \
> > @@ -859,6 +900,113 @@ policy_show(struct device *dev, struct
> device_attribute *devattr, char *buf)
> >         return sprintf(buf, "%s\n", tz->governor->name);
> >  }
> >
> > +static ssize_t
> > +active_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
> > +{
> > +       int i, indx, ret = 0;
> > +       struct thermal_sensor *ts;
> > +       struct thermal_zone *tz;
> > +
> > +       ts = get_sensor_by_kobj_name(kobject_name(kobj));
> > +       if (!ts)
> > +               return -EINVAL;
> > +
> > +       tz = get_zone_by_trip_kobj(kobj->parent);
> > +       if (!tz)
> > +               return -EINVAL;
> > +
> > +       indx = get_sensor_indx(tz, ts);
> > +       if (indx < 0)
> > +               return indx;
> > +
> > +       if (tz->trip[indx]->num_active_trips <= 0)
> > +               return sprintf(buf, "<Not available>\n");
> > +
> > +       ret += sprintf(buf, "0x%x", tz->trip[indx]->active_trip_mask);
> > +       for (i = 0; i < tz->trip[indx]->num_active_trips; i++) {
> > +               ret += sprintf(buf + ret, " %d",
> > +                       tz->trip[indx]->active_trips[i]);
> > +       }
> > +
> > +       ret += sprintf(buf + ret, "\n");
> > +       return ret;
> > +}
> > +
> > +static ssize_t
> > +ptrip_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
> Just remind you it is better to use passive_show instead of ptrip_show
> when you remove the unused old codes.

Yes, you told the reason also :-)
I did that because we have another function named passive_show.
So, when we remove the old code, we will change this name also :-)

Thanks,
Durga

^ permalink raw reply

* Re: [RFC PATCH 4/7] Thermal: Add Thermal_trip sysfs node
From: Hongbo Zhang @ 2012-12-04  8:30 UTC (permalink / raw)
  To: Durgadoss R
  Cc: rui.zhang, linux-pm, wni, eduardo.valentin, amit.kachhap,
	sachin.kamat
In-Reply-To: <1353149158-19102-5-git-send-email-durgadoss.r@intel.com>

On 17 November 2012 18:45, Durgadoss R <durgadoss.r@intel.com> wrote:
> This patch adds a thermal_trip directory under
> /sys/class/thermal/zoneX. This directory contains
> the trip point values for sensors bound to this
> zone.
>
> Signed-off-by: Durgadoss R <durgadoss.r@intel.com>
> ---
>  drivers/thermal/thermal_sys.c |  244 ++++++++++++++++++++++++++++++++++++++++-
>  include/linux/thermal.h       |   32 ++++++
>  2 files changed, 274 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/thermal/thermal_sys.c b/drivers/thermal/thermal_sys.c
> index 70e5f5a..011588a 100644
> --- a/drivers/thermal/thermal_sys.c
> +++ b/drivers/thermal/thermal_sys.c
> @@ -480,9 +480,15 @@ static void remove_sensor_from_zone(struct thermal_zone *tz,
>
>         sysfs_remove_link(&tz->device.kobj, kobject_name(&ts->device.kobj));
>
> +       /* Delete this sensor's trip Kobject */
> +       kobject_del(tz->kobj_trip[indx]);
> +
>         /* Shift the entries in the tz->sensors array */
> -       for (j = indx; j < MAX_SENSORS_PER_ZONE - 1; j++)
> +       for (j = indx; j < MAX_SENSORS_PER_ZONE - 1; j++) {
>                 tz->sensors[j] = tz->sensors[j + 1];
> +               tz->trip[j] = tz->trip[j + 1];
> +               tz->kobj_trip[j] = tz->kobj_trip[j + 1];
> +       }
>
>         tz->sensor_indx--;
>  }
> @@ -505,6 +511,41 @@ static void remove_cdev_from_zone(struct thermal_zone *tz,
>         tz->cdev_indx--;
>  }
>
> +static struct thermal_zone *get_zone_by_trip_kobj(struct kobject *kobj)
> +{
> +       struct thermal_zone *pos;
> +       struct thermal_zone *tz = NULL;
> +
> +       if (!kobj)
> +               return NULL;
> +
> +       mutex_lock(&tz_list_lock);
> +       for_each_thermal_zone(pos) {
> +               if (pos->kobj_thermal_trip == kobj) {
> +                       tz = pos;
> +                       break;
> +               }
> +       }
> +       mutex_unlock(&tz_list_lock);
> +       return tz;
> +}
> +
> +static struct thermal_sensor *get_sensor_by_kobj_name(const char *name)
> +{
> +       struct thermal_sensor *pos;
> +       struct thermal_sensor *ts = NULL;
> +
> +       mutex_lock(&ts_list_lock);
> +       for_each_thermal_sensor(pos) {
> +               if (!strcmp(kobject_name(&pos->device.kobj), name)) {
> +                       ts = pos;
> +                       break;
> +               }
> +       }
> +       mutex_unlock(&ts_list_lock);
> +       return ts;
> +}
> +
>  /* sys I/F for thermal zone */
>
>  #define to_thermal_zone(_dev) \
> @@ -859,6 +900,113 @@ policy_show(struct device *dev, struct device_attribute *devattr, char *buf)
>         return sprintf(buf, "%s\n", tz->governor->name);
>  }
>
> +static ssize_t
> +active_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
> +{
> +       int i, indx, ret = 0;
> +       struct thermal_sensor *ts;
> +       struct thermal_zone *tz;
> +
> +       ts = get_sensor_by_kobj_name(kobject_name(kobj));
> +       if (!ts)
> +               return -EINVAL;
> +
> +       tz = get_zone_by_trip_kobj(kobj->parent);
> +       if (!tz)
> +               return -EINVAL;
> +
> +       indx = get_sensor_indx(tz, ts);
> +       if (indx < 0)
> +               return indx;
> +
> +       if (tz->trip[indx]->num_active_trips <= 0)
> +               return sprintf(buf, "<Not available>\n");
> +
> +       ret += sprintf(buf, "0x%x", tz->trip[indx]->active_trip_mask);
> +       for (i = 0; i < tz->trip[indx]->num_active_trips; i++) {
> +               ret += sprintf(buf + ret, " %d",
> +                       tz->trip[indx]->active_trips[i]);
> +       }
> +
> +       ret += sprintf(buf + ret, "\n");
> +       return ret;
> +}
> +
> +static ssize_t
> +ptrip_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
Just remind you it is better to use passive_show instead of ptrip_show
when you remove the unused old codes.

> +{
> +       int i, indx, ret = 0;
> +       struct thermal_sensor *ts;
> +       struct thermal_zone *tz;
> +
> +       ts = get_sensor_by_kobj_name(kobject_name(kobj));
> +       if (!ts)
> +               return -EINVAL;
> +
> +       tz = get_zone_by_trip_kobj(kobj->parent);
> +       if (!tz)
> +               return -EINVAL;
> +
> +       indx = get_sensor_indx(tz, ts);
> +       if (indx < 0)
> +               return indx;
> +
> +       if (tz->trip[indx]->num_passive_trips <= 0)
> +               return sprintf(buf, "<Not available>\n");
> +
> +       for (i = 0; i < tz->trip[indx]->num_passive_trips; i++) {
> +               ret += sprintf(buf + ret, "%d ",
> +                       tz->trip[indx]->passive_trips[i]);
> +       }
> +
> +       ret += sprintf(buf + ret, "\n");
> +       return ret;
> +}
> +
> +static ssize_t
> +hot_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
> +{
> +       int indx;
> +       struct thermal_sensor *ts;
> +       struct thermal_zone *tz;
> +
> +       ts = get_sensor_by_kobj_name(kobject_name(kobj));
> +       if (!ts)
> +               return -EINVAL;
> +
> +       tz = get_zone_by_trip_kobj(kobj->parent);
> +       if (!tz)
> +               return -EINVAL;
> +
> +       indx = get_sensor_indx(tz, ts);
> +       if (indx < 0)
> +               return indx;
> +
> +       return sprintf(buf, "%d\n", tz->trip[indx]->hot);
> +}
> +
> +static ssize_t
> +critical_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
> +{
> +       int indx;
> +       struct thermal_sensor *ts;
> +       struct thermal_zone *tz;
> +
> +       ts = get_sensor_by_kobj_name(kobject_name(kobj));
> +       if (!ts)
> +               return -EINVAL;
> +
> +       tz = get_zone_by_trip_kobj(kobj->parent);
> +       if (!tz)
> +               return -EINVAL;
> +
> +       indx = get_sensor_indx(tz, ts);
> +       if (indx < 0)
> +               return indx;
> +
> +       return sprintf(buf, "%d\n", tz->trip[indx]->crit);
> +}
> +
>  static DEVICE_ATTR(type, 0444, type_show, NULL);
>  static DEVICE_ATTR(temp, 0444, temp_show, NULL);
>  static DEVICE_ATTR(mode, 0644, mode_show, mode_store);
> @@ -869,7 +1017,26 @@ static DEVICE_ATTR(policy, S_IRUGO | S_IWUSR, policy_show, policy_store);
>  static DEVICE_ATTR(sensor_name, 0444, sensor_name_show, NULL);
>  static DEVICE_ATTR(temp_input, 0444, sensor_temp_show, NULL);
>
> -static DEVICE_ATTR(zone_name, 0444, zone_name_show, NULL);
> +/* Thermal zone attributes */
> +static DEVICE_ATTR(zone_name, S_IRUGO, zone_name_show, NULL);
> +
> +/* Thermal trip attributes */
> +static struct kobj_attribute active_attr = __ATTR_RO(active);
> +static struct kobj_attribute passive_attr = __ATTR_RO(ptrip);
> +static struct kobj_attribute hot_attr = __ATTR_RO(hot);
> +static struct kobj_attribute crit_attr = __ATTR_RO(critical);
> +
> +static struct attribute *trip_attrs[] = {
> +                       &active_attr.attr,
> +                       &passive_attr.attr,
> +                       &hot_attr.attr,
> +                       &crit_attr.attr,
> +                       NULL,
> +                       };
> +
> +static struct attribute_group trip_attr_group = {
> +                       .attrs = trip_attrs,
> +                       };
>
>  /* sys I/F for cooling device */
>  #define to_cooling_device(_dev)        \
> @@ -1694,12 +1861,19 @@ struct thermal_zone *create_thermal_zone(const char *name, void *devdata)
>         if (ret)
>                 goto exit_unregister;
>
> +       tz->kobj_thermal_trip = kobject_create_and_add("thermal_trip",
> +                                       &tz->device.kobj);
> +       if (!tz->kobj_thermal_trip)
> +               goto exit_name;
> +
>         /* Add this zone to the global list of thermal zones */
>         mutex_lock(&tz_list_lock);
>         list_add_tail(&tz->node, &thermal_zone_list);
>         mutex_unlock(&tz_list_lock);
>         return tz;
>
> +exit_name:
> +       device_remove_file(&tz->device, &dev_attr_zone_name);
>  exit_unregister:
>         device_unregister(&tz->device);
>  exit_idr:
> @@ -1713,6 +1887,7 @@ EXPORT_SYMBOL(create_thermal_zone);
>  void remove_thermal_zone(struct thermal_zone *tz)
>  {
>         struct thermal_zone *pos, *next;
> +       int i;
>         bool found = false;
>
>         if (!tz)
> @@ -1733,6 +1908,29 @@ void remove_thermal_zone(struct thermal_zone *tz)
>
>         device_remove_file(&tz->device, &dev_attr_zone_name);
>
> +       /* Just for ease of usage */
> +       i = tz->sensor_indx;
> +
> +       while (--i >= 0) {
> +               sysfs_remove_link(&tz->device.kobj,
> +                               kobject_name(&tz->sensors[i]->device.kobj));
> +               if (tz->kobj_trip[i])
> +                       kobject_del(tz->kobj_trip[i]);
> +       }
> +
> +       if (tz->kobj_thermal_trip)
> +               kobject_del(tz->kobj_thermal_trip);
> +
> +       /* Release the cdevs attached to this zone */
> +       i = tz->cdev_indx;
> +
> +       while (--i >= 0) {
> +               sysfs_remove_link(&tz->device.kobj,
> +                               kobject_name(&tz->cdevs[i]->device.kobj));
> +       }
> +
> +       tz->sensor_indx = tz->cdev_indx = 0;
> +
>         release_idr(&thermal_zone_idr, &thermal_idr_lock, tz->id);
>         idr_destroy(&tz->idr);
>
> @@ -1864,6 +2062,48 @@ int add_cdev_to_zone(struct thermal_zone *tz,
>  }
>  EXPORT_SYMBOL(add_cdev_to_zone);
>
> +int add_sensor_trip_info(struct thermal_zone *tz, const char *sensor_name,
> +                       struct thermal_trip_point *trip)
> +{
> +       int indx, ret = -EINVAL;
> +       struct thermal_sensor *ts = get_sensor_by_name(sensor_name);
> +
> +       if (!tz || !ts || !trip)
> +               return ret;
> +
> +       mutex_lock(&tz_list_lock);
> +
> +       indx = get_sensor_indx(tz, ts);
> +       if (indx < 0)
> +               goto exit_indx;
> +
> +       tz->kobj_trip[indx] = kobject_create_and_add(
> +                                       kobject_name(&ts->device.kobj),
> +                                       tz->kobj_thermal_trip);
> +       if (!tz->kobj_trip[indx]) {
> +               ret = -ENOMEM;
> +               goto exit_indx;
> +       }
> +
> +       ret = sysfs_create_group(tz->kobj_trip[indx], &trip_attr_group);
> +       if (ret) {
> +               dev_err(&tz->device, "sysfs_create_group failed:%d\n", ret);
> +               goto exit_kobj;
> +       }
> +
> +       tz->trip[indx] = trip;
> +       mutex_unlock(&tz_list_lock);
> +       return 0;
> +
> +exit_kobj:
> +       kobject_del(tz->kobj_trip[indx]);
> +       tz->kobj_trip[indx] = NULL;
> +exit_indx:
> +       mutex_unlock(&tz_list_lock);
> +       return ret;
> +}
> +EXPORT_SYMBOL(add_sensor_trip_info);
> +
>  /**
>   * thermal_sensor_register - register a new thermal sensor
>   * @name:      name of the thermal sensor
> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> index 2913910..7b2fe35 100644
> --- a/include/linux/thermal.h
> +++ b/include/linux/thermal.h
> @@ -159,6 +159,30 @@ struct thermal_attr {
>         char name[THERMAL_NAME_LENGTH];
>  };
>
> +/*
> + * This structure defines the trip points for a sensor.
> + * The actual values for these trip points come from
> + * platform characterization. The thermal governors
> + * (either kernel or user space) may take appropriate
> + * actions when the sensors reach these trip points.
> + * See Documentation/thermal/sysfs-api2.txt for more details.
> + *
> + * As of now, For a particular sensor, we support:
> + * a) 1 hot trip point
> + * b) 1 critical trip point
> + * c) 'n' passive trip points
> + * d) 'm' active trip points
> + */
> +struct thermal_trip_point {
> +       int hot;
> +       int crit;
> +       int num_passive_trips;
> +       int *passive_trips;
> +       int num_active_trips;
> +       int *active_trips;
> +       int active_trip_mask;
> +};
> +
>  struct thermal_sensor {
>         char name[THERMAL_NAME_LENGTH];
>         int id;
> @@ -215,6 +239,11 @@ struct thermal_zone {
>         /* cdev level information */
>         int cdev_indx; /* index into 'cdevs' array */
>         struct thermal_cooling_device *cdevs[MAX_CDEVS_PER_ZONE];
> +
> +       /* Thermal trip information */
> +       struct thermal_trip_point *trip[MAX_SENSORS_PER_ZONE];
> +       struct kobject *kobj_trip[MAX_SENSORS_PER_ZONE];
> +       struct kobject *kobj_thermal_trip;
>  };
>
>  /* Structure that holds thermal governor information */
> @@ -298,6 +327,9 @@ int add_sensor_to_zone_by_name(struct thermal_zone *, const char *);
>  int add_cdev_to_zone(struct thermal_zone *, struct thermal_cooling_device *);
>  int add_cdev_to_zone_by_name(struct thermal_zone *, const char *);
>
> +int add_sensor_trip_info(struct thermal_zone *, const char *,
> +                       struct thermal_trip_point *);
> +
>  #ifdef CONFIG_NET
>  extern int thermal_generate_netlink_event(u32 orig, enum events event);
>  #else
> --
> 1.7.9.5
>

^ permalink raw reply

* Re: [RFC PATCH 2/8] mm: Initialize node memory regions during boot
From: wujianguo @ 2012-12-04  8:25 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, mjg59, paulmck, dave, maxime.coquelin,
	loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw,
	gargankita, amit.kachhap, svaidy, thomas.abraham,
	santosh.shilimkar, linux-pm, linux-mm, linux-kernel
In-Reply-To: <20121106195241.6941.43309.stgit@srivatsabhat.in.ibm.com>

Hi Srivatsa,

I got following compile waring:

WARNING: vmlinux.o(.text+0x10b320): Section mismatch in reference from the function init_zone_memory_regions() to the function .meminit.text:__absent_pages_in_range()
The function init_zone_memory_regions() references
the function __meminit __absent_pages_in_range().
This is often because init_zone_memory_regions lacks a __meminit
annotation or the annotation of __absent_pages_in_range is wrong.

WARNING: vmlinux.o(.text+0x10b457): Section mismatch in reference from the function init_node_memory_regions() to the function .meminit.text:__absent_pages_in_range()
The function init_node_memory_regions() references
the function __meminit __absent_pages_in_range().
This is often because init_node_memory_regions lacks a __meminit
annotation or the annotation of __absent_pages_in_range is wrong.

I think should add *__paginginit* to the following three functions:
init_memory_regions()
init_node_memory_regions()
init_zone_memory_regions()

Thanks,
Jianguo wu

On 2012-11-7 3:52, Srivatsa S. Bhat wrote:
> Initialize the node's memory regions structures with the information about
> the region-boundaries, at boot time.
> 
> Based-on-patch-by: Ankita Garg <gargankita@gmail.com>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
> 
>  include/linux/mm.h |    4 ++++
>  mm/page_alloc.c    |   35 +++++++++++++++++++++++++++++++++++
>  2 files changed, 39 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fa06804..19c4fb0 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -657,6 +657,10 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>  #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
>  #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
>  
> +/* Hard-code memory regions size to be 512 MB for now. */
> +#define MEM_REGION_SHIFT	(29 - PAGE_SHIFT)
> +#define MEM_REGION_SIZE		(1UL << MEM_REGION_SHIFT)
> +
>  static inline enum zone_type page_zonenum(const struct page *page)
>  {
>  	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bb90971..709e3c1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4560,6 +4560,40 @@ static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat)
>  #endif /* CONFIG_FLAT_NODE_MEM_MAP */
>  }
>  
> +void init_node_memory_regions(struct pglist_data *pgdat)
> +{
> +	int nid = pgdat->node_id;
> +	unsigned long start_pfn = pgdat->node_start_pfn;
> +	unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
> +	unsigned long i, absent;
> +	int idx;
> +	struct node_mem_region *region;
> +
> +	for (i = start_pfn, idx = 0; i < end_pfn;
> +				i += region->spanned_pages, idx++) {
> +
> +		region = &pgdat->node_regions[idx];
> +
> +		if (i + MEM_REGION_SIZE <= end_pfn) {
> +			region->start_pfn = i;
> +			region->spanned_pages = MEM_REGION_SIZE;
> +		} else {
> +			region->start_pfn = i;
> +			region->spanned_pages = end_pfn - i;
> +		}
> +
> +		absent = __absent_pages_in_range(nid, region->start_pfn,
> +						 region->start_pfn +
> +						 region->spanned_pages);
> +
> +		region->present_pages = region->spanned_pages - absent;
> +		region->idx = idx;
> +		region->node = nid;
> +		region->pgdat = pgdat;
> +		pgdat->nr_node_regions++;
> +	}
> +}
> +
>  void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
>  		unsigned long node_start_pfn, unsigned long *zholes_size)
>  {
> @@ -4581,6 +4615,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
>  #endif
>  
>  	free_area_init_core(pgdat, zones_size, zholes_size);
> +	init_node_memory_regions(pgdat);
>  }
>  
>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply

* Re: [PATCH v9 06/10] ata: zpodd: check zero power ready status
From: Aaron Lu @ 2012-12-04  5:04 UTC (permalink / raw)
  To: Jeff Garzik, Tejun Heo, James Bottomley
  Cc: Rafael J. Wysocki, linux-pm, Alan Stern, Jeff Wu, Aaron Lu,
	linux-ide, linux-scsi, linux-acpi
In-Reply-To: <50BCF5F8.2030802@pobox.com>

On 12/04/2012 02:56 AM, Jeff Garzik wrote:
> On 12/03/2012 11:23 AM, Tejun Heo wrote:
>> Hello, James.
>>
>> On Mon, Dec 03, 2012 at 08:25:43AM +0000, James Bottomley wrote:
>>>> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
>>>> index e65c62e..1756151 100644
>>>> --- a/include/scsi/scsi_device.h
>>>> +++ b/include/scsi/scsi_device.h
>>>> @@ -160,6 +160,7 @@ struct scsi_device {
>>>>   	unsigned can_power_off:1; /* Device supports runtime power off */
>>>>   	unsigned wce_default_on:1;	/* Cache is ON by default */
>>>>   	unsigned no_dif:1;	/* T10 PI (DIF) should be disabled */
>>>> +	unsigned event_driven:1; /* No need to poll the device */
>>>>
>>>>   	DECLARE_BITMAP(supported_events, SDEV_EVT_MAXBITS); /* supported events */
>>>>   	struct list_head event_list;	/* asserted events */
>>>
>>> Yes, but if we can get away with doing that, it should be in genhd
>>> because it's completely generic.
>>>
>>> I was imagining we'd have to fake the reply to the test unit ready or
>>> some other commands, which is why it would need to be in sr.c
>>>
>>> The check events code is Tejun's baby, if he's OK with it then just do
>>> it in genhd.c
>>
>> The problem here is there's no easy to reach genhd from libata (or the
>> other way around) without going through sr.  I think we're gonna have
>> to have something in sr one way or the other.
> 
>   ...which is precisely as I said when v1 of this ZPODD patchset appeared.
> 
> sr modifications cannot be avoided.

So I'm gonna use the above code to silence the poll when ODD is powered
off. I suppose everybody is OK with this, right? James, please let me
know if you disagree.

Thanks,
Aaron

> 
> 	Jeff
> 
> 
> 
> 


^ permalink raw reply

* Re: [RFC PATCH 4/5] arm: omap2: support port power on lan95xx devices
From: Andy Green @ 2012-12-04  4:02 UTC (permalink / raw)
  To: Ming Lei
  Cc: Alan Stern, Greg Kroah-Hartman, Lan Tianyu, Sarah Sharp,
	Rafael J. Wysocki, linux-pm-u79uwXL29TY76Z2rM5mHXA, Oliver Neukum,
	linux-omap-u79uwXL29TY76Z2rM5mHXA,
	linux-usb-u79uwXL29TY76Z2rM5mHXA, Roger Quadros, Felipe Balbi
In-Reply-To: <CACVXFVO-Xktswog9Zx16zo-pmx9fTh0F3BYC-3q6Zn2SPCqdGg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 04/12/12 10:39, the mail apparently from Ming Lei included:
> On Mon, Dec 3, 2012 at 12:52 PM, Andy Green <andy.green-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
>>>>> +static void ehci_hub_power_off(struct power_controller *pc, struct
>>>>> device
>>>>> *dev)
>>>>> +{
>>>>> +       gpio_set_value(GPIO_HUB_NRESET, 0);
>>>>> +       gpio_set_value(GPIO_HUB_POWER, 0);
>>>>> +}
>>>>> +
>>>>> +static struct usb_port_power_switch_data root_hub_port_data = {
>>>>> +       .hub_tier       = 0,
>>>>> +       .port_number = 1,
>>>>> +       .type = USB_PORT_CONNECT_TYPE_HARD_WIRED,
>>>>> +};
>>>>> +
>>>>> +static struct usb_port_power_switch_data smsc_hub_port_data = {
>>>>> +       .hub_tier       = 1,
>>>>> +       .port_number = 1,
>>>>> +       .type = USB_PORT_CONNECT_TYPE_HARD_WIRED,
>>>>> +};
>>>>> +
>>>>> +static struct power_controller pc = {
>>>>> +       .name = "omap_hub_eth_pc",
>>>>> +       .count = ATOMIC_INIT(0),
>>>>> +       .power_on = ehci_hub_power_on,
>>>>> +       .power_off = ehci_hub_power_off,
>>>>> +};
>>>>> +
>>>>> +static inline int omap_ehci_hub_port(struct device *dev)
>>>>> +{
>>>>> +       /* we expect dev->parent points to ehcd controller */
>>>>> +       if (dev->parent && !strcmp(dev_name(dev->parent),
>>>>> "ehci-omap.0"))
>>>>> +               return 1;
>>>>> +       return 0;
>>>>> +}
>>>>> +
>>>>> +static inline int dev_pc_match(struct device *dev)
>>>>> +{
>>>>> +       struct device *anc;
>>>>> +       int ret = 0;
>>>>> +
>>>>> +       if (likely(strcmp(dev_name(dev), "port1")))
>>>>> +               goto exit;
>>>>> +
>>>>> +       if (dev->parent && (anc = dev->parent->parent)) {
>>>>> +               if (omap_ehci_hub_port(anc)) {
>>>>> +                       ret = 1;
>>>>> +                       goto exit;
>>>>> +               }
>>>>> +
>>>>> +               /* is it port of lan95xx hub? */
>>>>> +               if ((anc = anc->parent) && omap_ehci_hub_port(anc)) {
>>>>> +                       ret = 2;
>>>>> +                       goto exit;
>>>>> +               }
>>>>> +       }
>>>>> +exit:
>>>>> +       return ret;
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Notifications of device registration
>>>>> + */
>>>>> +static int device_notify(struct notifier_block *nb, unsigned long
>>>>> action,
>>>>> void *data)
>>>>> +{
>>>>> +       struct device *dev = data;
>>>>> +       int ret;
>>>>> +
>>>>> +       switch (action) {
>>>>> +       case DEV_NOTIFY_ADD_DEVICE:
>>>>> +               ret = dev_pc_match(dev);
>>>>> +               if (likely(!ret))
>>>>> +                       goto exit;
>>>>> +               if (ret == 1)
>>>>> +                       dev_pc_bind(&pc, dev, &root_hub_port_data,
>>>>> sizeof(root_hub_port_data));
>>>>> +               else
>>>>> +                       dev_pc_bind(&pc, dev, &smsc_hub_port_data,
>>>>> sizeof(smsc_hub_port_data));
>>>>> +               break;
>>>>> +
>>>>> +       case DEV_NOTIFY_DEL_DEVICE:
>>>>> +               break;
>>>>> +       }
>>>>> +exit:
>>>>> +       return 0;
>>>>> +}
>>>>> +
>>>>> +static struct notifier_block usb_port_nb = {
>>>>> +       .notifier_call = device_notify,
>>>>> +};
>>>>> +
>>>>
>>>>
>>>>
>>>> Some thoughts on trying to make this functionality specific to power only
>>>> and ehci hub port only:
>>>>
>>>>    - Quite a few boards have smsc95xx... they're all going to carry these
>>>> additions in the board file?  Surely you'll have to generalize the pieces
>>>
>>>
>>> All things are board dependent because we are discussing peripheral
>>> device(not builtin SoC devices), so it is proper to put it in the board
>>> file.
>>> If some boards want to share it, we can put it in a single .c file and
>>> let board file include it.
>>
>>
>> Where would the .c file go... SMSC is not platform, or even arch specific
>> chip (eg, iMX or MIPS or even x86 boards can have it), so not
>> arch/arm/mach-xxxx or arch/arm/plat-xxx or arch/arm.  I guess it would go in
>> drivers/usb or drivers/net.
>
> How does drivers/usb or drivers/net know the SMSC is used on beagle or
> panda? Different power control approach is used in the two boards even
> same SMSC chip is shipped in the two boards.

You mention you're going to "put it in a single .c file and let the 
board file include it", I am just wondering where that .c file will 
live.  It seems it would have to live down drivers/ somewhere so MIPS, 
x86 etc that might have an SMSC chip onboard can also use it if they want.

>> Push in ARM-Land is towards single kernels which support many platforms, a
>> good case in point is omap2plus_defconfg which wants to allow to support all
>> OMAP2/3/4/5 platforms in one kernel.  If you "include" that C file over and
>> over in board files, it's not very nice for that.  They anyway want to
>> eliminate board files for the single kernel binary reason, and just have one
>> load of config come in by dtb.
>
> I only mean it is reasonable to put the power control code into board
> file because
> each board may take different power control approach even same SMSC chip
> is used. I understand DT only describes the difference of the board via device
> node, and I am not sure if the current DT is enough to convert the board file
> into data as far as the problem is concerned.

No the approach with DT is to provide a dummy SoC "board file" with all 
data provided by dtb.  This is already established, see 
./arch/arm/mach-omap2/board-generic.c for example.  You can see there's 
nothing in it other than minimum dt match business.

People with new boards are getting pointed at that and told to sort it 
out in dtb and disallowed from creating new board files.

So whatever support or helper scheme you're adding needs a story about 
how dtb can express it (in dt language, "has bindings for it") or it 
can't be used on new boards.  That's not a minor ding any more either.

>> So it guides you towards having static helper code once in drivers/ for the
>> scenarios you support... if that's where you end up smsc is less good a
>> target for a helper than to have helpers for classes of object like
>> regulator and clk, that you can combine and reuse on all sorts of target
>> devices, which is device_asset approach.
>>
>> It also guides you to having the special platform sauce be something that
>> can go into the dtb: per-board code can't.  That's why device_asset stuff
>> only places asset structs in the board file and is removing code from there.
>>
>>
>>>> that perform device_path business out of the omap4panda board file at
>>>> least.
>>>> At that point the path matching code becomes generic
>>>> end-of-the-device-path
>>>> matching code.
>>>
>>>
>>> Looks Alan has mentioned, there might be no generic way, and any device's
>>> name change in the path may make the way fragile.
>>
>>
>> What you have provided is no less fragile if you allow "port1" and the
>> ehci-omap.0 to be set from the outside.
>>
>> Unless someone NAKs it for sure already (if you're already sure you're going
>> to, please do so to avoid wasting time), I'll issue a try#2 of my code later
>> which demonstrates what I mean.  At least I guess it's useful for
>> comparative purposes.
>>
>>
>>>>    - How could these literals like "port1" etc be nicely provided by
>>>> Device
>>>> Tree?  In ARM-land there's pressure to eventually eliminate board files
>>>> completely and pass in everything from dtb.  device_asset can neatly grow
>>>> DT
>>>> bindings in a generic way, since the footprint in the board file is some
>>>
>>>
>>> IMO, it isn't necessary to expose these assets to device or users, from
>>> the
>>> view of device or user, which only cares two actions(poweron and poweroff)
>>> about the discussed problem. Also it should be better to put these assets
>>> into device resource list, instead of introducing them in 'struct device'.
>>
>>
>>  From the point of view of allowing it to be reused on different boards /
>> platforms / arches, you are going to have to do something better than have
>> literals in the code.
>>
>>
>>>> regulators that already have dt bindings and some device_asset structs.
>>>> Similarly there's pressure for magic code to service a board (rather than
>>>> SoC) to go elsewhere than the board file.
>>>
>>>
>>> Looks you associate these assets with ehci-omap device, which mightn't be
>>> enough, because we need to control port's power for supporting port
>>> power off mechanism. Do you have generic way to associate these assets
>>> with usb port device and let port use it generally?
>>
>>
>> Yes, you need a parent device pointer (ehci host controller in this case)
>> and the path rhs, but only stuff that is defined by usb stack code.  Needing
>> a parent pointer is OK because this stuff only has meaning for hardwired
>> assets on the platform, so the parent device will always be known as a
>> platform_device at boot time anyway.
>
> parent device pointer may work on the panda problem, but may not work
> on other case if one hardwired device is powered by another power domain.
>
> So it is not a general solution on usb port power off.

I am talking about, well, "ancestor" pointer only to filter descendant 
children.  Not for any power control directly.

It gets us away from caring about what the device path looked like prior 
to that host controller, and since we have confidence it's exactly the 
host controller we intended by knowing its platform_device, we can 
ignore everything between than at the right-hand side path fragment 
identifying the child.  So it's a strong way to reduce fragility on the 
device_path stuff.

>> The code I'll provide will work the same in sdio or other bus case, just use
>> mmc host controller pointer and the sdio device name the same way.
>>
>>
>>>>    - Shouldn't this take care of enabling and disabling the ULPI PHY clock
>>>> on
>>>> Panda too?  There's no purpose leaving it running if the one thing the
>>>> ULPI
>>>> PHY is connected to is depowered, and when you do power it, on Panda you
>>>> will reset the ULPI PHY at the same time anyway (smsc reset and ULPI PHY
>>>> reset are connected together on Panda).  Then you can eliminate
>>>> omap4_ehci_init() in the board file.
>>>
>>>
>>> OK, we can include the ULPI PHY clock things easily in ->power_on() and
>>> ->power_off() of 'power controller'
>>
>>
>> Yes if the ARM people will accept establishing more code in board files that
>> doesn't have a DT story.
>
> As I explained above, it is reasonable to put the power control code in board
> file, but current DT could convert these board file into device node?

DT has lots of bindings now, you can describe regulators and things in 
there fully.  The point is whatever scheme is workable for this must be 
able to get soaked up using DT too to be acceptable.  Taking the 
approach you're going to drop literal strings in code in the board file, 
or throw code at the board file at all, will no longer fly.

I have not provided a DT solution for my approach yet either, but I have 
taken care that the only footprint in the boardfile version of it are 
*structs*.  That means translating them to dtb content by adding a 
binding for the asset stuff for example, will be a clean task.

That's also why back at the beginning of this discussion I gave dts 
version of the regulator structs I was introducing in that patch... it's 
proof that the boardfile footprint of that scheme is ready to go into dtb.

-Andy

-- 
Andy Green | TI Landing Team Leader
Linaro.org │ Open source software for ARM SoCs | Follow Linaro
http://facebook.com/pages/Linaro/155974581091106  - 
http://twitter.com/#!/linaroorg - http://linaro.org/linaro-blog
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH 4/5] arm: omap2: support port power on lan95xx devices
From: Andy Green @ 2012-12-04  3:40 UTC (permalink / raw)
  To: Alan Stern
  Cc: Ming Lei, Greg Kroah-Hartman, Lan Tianyu, Sarah Sharp,
	Rafael J. Wysocki, linux-pm, Oliver Neukum, linux-omap, linux-usb,
	Roger Quadros, Felipe Balbi
In-Reply-To: <Pine.LNX.4.44L0.1212031154460.1309-100000@iolanthe.rowland.org>

On 04/12/12 01:09, the mail apparently from Alan Stern included:
> On Mon, 3 Dec 2012, Andy Green wrote:
>
>> Unless someone NAKs it for sure already (if you're already sure you're
>> going to, please do so to avoid wasting time), I'll issue a try#2 of my
>> code later which demonstrates what I mean.  At least I guess it's useful
>> for comparative purposes.
>
> Before you go writing a whole lot more code, we should discuss the
> basics a bit more clearly.  There are several unsettled issues here:

>       1. Should the LAN95xx stuff be associated with the ehci-omap.0's
> 	driver or with the hub port?  The port would be more flexible,
> 	offering the ability to turn the power off and on while the
> 	system is running without affecting anything else.  But the
> 	port code is currently in flux, which could cause this new
> 	addition to be delayed.

I think associating ULPI PHY reset and SMSC power and reset with the hub 
port power state is good.  Then, you could have the driver in a device 
with multiple onboard USB devices, and individually control whether 
they're eating power or not.  In the asset case, you'd associate mux 
assets with ehci-omap.0.

Yesterday I studied the hub port code and have a couple of patches, one 
normalizes the hub port device to have a stub driver.

The other then puts hub port power state signalling into runtime_pm 
handlers in the hub port device.  Until now, actually there's no code in 
hub.c to switch off a port.

Assuming that's not insane, what should the user interface to disable a 
port power look like, something in sysfs?  Until now it doesn't seem to 
exist.

> 	(On the other hand, since the LAN95xx is the only thing
> 	connected to the root hub, it could be powered off and on by
> 	unbinding the ehci-omap.0 device from its driver and rebinding
> 	it.)

We shouldn't get to tied up with Panda case, this will be there for all 
cases like PCs etc.  It should work well if there are multiple ports 
with onboard assets.

>       2. If we do choose the port, do we want to identify it by matching
> 	against a device name string or by matching a sequence of port
> 	numbers (in this case, a length-1 sequence)?  The port numbers
> 	are fixed by the board design, whereas the device name strings
> 	might  get changed in the future.  On the other hand, the port
> 	numbers apply only to USB whereas device names can be used by
> 	any subsystem.

USB device names contain the port information.  The matching scheme I 
have currently just uses the right-hand side of the path information and 
nothing that is not defined by the USB subsystem.  It uses a 
platform_device ancestor to restrict matches to descendants of the right 
host controller.  So unlike try#1 the names are as stable as the 
subsystem code alone, however stable that is, it's not exposed to 
changes from anywhere else.  As you mention it's then workable on any 
dynamically probed bus.

>       3. Should the matching mechanism go into the device core or into
> 	the USB port code?  (This is related to the previous question.)

Currently I am experimenting with having the asset pointer in struct 
device, but migrating the events into runtime_resume and 
runtime_suspend.  If it works out that has advantages that assets follow 
not just the logical device existence but the pm state of the device 
closely.

It also allows leveraging assets directly to the hub port runtime_pm 
state, so they follow enable state of the port without any additional code.

>       4. Should this be implemented simply as a regulator (or a pair of
> 	regulators)?  Or should it be generalized to some sort of PM
> 	domain thing?  The generic_pm_domain structure defined in
> 	include/linux/pm_domain.h seems like overkill, but maybe it's
> 	the most appropriate thing to use.

They should be regulators for that I think.  But it's only part the 
problem since clocks and mux are also going to be commonly associated 
with device power state, and indeed are in Panda case.

I realize restricting the scope is desirable to get something done, but 
I'm not sure supporting regulators only is enough of the job.

> Until we decide on the answers to these questions, there's no point
> writing detailed patches.

I agree, however I can't get insight into and confidence about what to 
do without studying and meddling with the code.  Some throwaway code is 
probably a reasonable price.

-Andy

-- 
Andy Green | TI Landing Team Leader
Linaro.org │ Open source software for ARM SoCs | Follow Linaro
http://facebook.com/pages/Linaro/155974581091106  - 
http://twitter.com/#!/linaroorg - http://linaro.org/linaro-blog
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH 4/5] arm: omap2: support port power on lan95xx devices
From: Ming Lei @ 2012-12-04  3:06 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andy Green, Greg Kroah-Hartman, Lan Tianyu, Sarah Sharp,
	Rafael J. Wysocki, linux-pm, Oliver Neukum, linux-omap, linux-usb,
	Roger Quadros, Felipe Balbi
In-Reply-To: <Pine.LNX.4.44L0.1212031154460.1309-100000@iolanthe.rowland.org>

On Tue, Dec 4, 2012 at 1:09 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Mon, 3 Dec 2012, Andy Green wrote:
>
>> Unless someone NAKs it for sure already (if you're already sure you're
>> going to, please do so to avoid wasting time), I'll issue a try#2 of my
>> code later which demonstrates what I mean.  At least I guess it's useful
>> for comparative purposes.
>
> Before you go writing a whole lot more code, we should discuss the
> basics a bit more clearly.  There are several unsettled issues here:
>
>      1. Should the LAN95xx stuff be associated with the ehci-omap.0's
>         driver or with the hub port?  The port would be more flexible,

IMO, it shouldn't be associated with ehci-omap controller driver
because the LAN95xx is a external USB device and should be
nothing to do with ehci, but it is reasonable to be associated with
the hub port because it takes a out of band power control method.

>         offering the ability to turn the power off and on while the
>         system is running without affecting anything else.  But the
>         port code is currently in flux, which could cause this new
>         addition to be delayed.
>
>         (On the other hand, since the LAN95xx is the only thing
>         connected to the root hub, it could be powered off and on by
>         unbinding the ehci-omap.0 device from its driver and rebinding
>         it.)
>
>      2. If we do choose the port, do we want to identify it by matching
>         against a device name string or by matching a sequence of port
>         numbers (in this case, a length-1 sequence)?  The port numbers
>         are fixed by the board design, whereas the device name strings
>         might  get changed in the future.  On the other hand, the port
>         numbers apply only to USB whereas device names can be used by
>         any subsystem.

Alos, the same ehci-omap driver and same LAN95xx chip is used in
beagle board and panda board with different power control
approach, does port driver can distinguish these two cases?
Port device is a general device(not platform device), how does
port driver get platform/board dependent info?

>
>      3. Should the matching mechanism go into the device core or into
>         the USB port code?  (This is related to the previous question.)
>
>      4. Should this be implemented simply as a regulator (or a pair of
>         regulators)?  Or should it be generalized to some sort of PM
>         domain thing?  The generic_pm_domain structure defined in
>         include/linux/pm_domain.h seems like overkill, but maybe it's
>         the most appropriate thing to use.

Not only regulators involved, clock or other things might be involved too.
Also the same power domain might be shared with more than one port,
so it is better to introduce power domain to the problem. Looks
generic_pm_domain is overkill, so I introduced power controller which
only focuses on power on/off and being shared by multiple devices.


Thanks,
-- 
Ming Lei

^ permalink raw reply

* Re: [RFC PATCH 4/5] arm: omap2: support port power on lan95xx devices
From: Ming Lei @ 2012-12-04  2:39 UTC (permalink / raw)
  To: Andy Green
  Cc: Alan Stern, Greg Kroah-Hartman, Lan Tianyu, Sarah Sharp,
	Rafael J. Wysocki, linux-pm, Oliver Neukum, linux-omap, linux-usb,
	Roger Quadros, Felipe Balbi
In-Reply-To: <50BC3003.80608@linaro.org>

On Mon, Dec 3, 2012 at 12:52 PM, Andy Green <andy.green@linaro.org> wrote:
>>>> +static void ehci_hub_power_off(struct power_controller *pc, struct
>>>> device
>>>> *dev)
>>>> +{
>>>> +       gpio_set_value(GPIO_HUB_NRESET, 0);
>>>> +       gpio_set_value(GPIO_HUB_POWER, 0);
>>>> +}
>>>> +
>>>> +static struct usb_port_power_switch_data root_hub_port_data = {
>>>> +       .hub_tier       = 0,
>>>> +       .port_number = 1,
>>>> +       .type = USB_PORT_CONNECT_TYPE_HARD_WIRED,
>>>> +};
>>>> +
>>>> +static struct usb_port_power_switch_data smsc_hub_port_data = {
>>>> +       .hub_tier       = 1,
>>>> +       .port_number = 1,
>>>> +       .type = USB_PORT_CONNECT_TYPE_HARD_WIRED,
>>>> +};
>>>> +
>>>> +static struct power_controller pc = {
>>>> +       .name = "omap_hub_eth_pc",
>>>> +       .count = ATOMIC_INIT(0),
>>>> +       .power_on = ehci_hub_power_on,
>>>> +       .power_off = ehci_hub_power_off,
>>>> +};
>>>> +
>>>> +static inline int omap_ehci_hub_port(struct device *dev)
>>>> +{
>>>> +       /* we expect dev->parent points to ehcd controller */
>>>> +       if (dev->parent && !strcmp(dev_name(dev->parent),
>>>> "ehci-omap.0"))
>>>> +               return 1;
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +static inline int dev_pc_match(struct device *dev)
>>>> +{
>>>> +       struct device *anc;
>>>> +       int ret = 0;
>>>> +
>>>> +       if (likely(strcmp(dev_name(dev), "port1")))
>>>> +               goto exit;
>>>> +
>>>> +       if (dev->parent && (anc = dev->parent->parent)) {
>>>> +               if (omap_ehci_hub_port(anc)) {
>>>> +                       ret = 1;
>>>> +                       goto exit;
>>>> +               }
>>>> +
>>>> +               /* is it port of lan95xx hub? */
>>>> +               if ((anc = anc->parent) && omap_ehci_hub_port(anc)) {
>>>> +                       ret = 2;
>>>> +                       goto exit;
>>>> +               }
>>>> +       }
>>>> +exit:
>>>> +       return ret;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Notifications of device registration
>>>> + */
>>>> +static int device_notify(struct notifier_block *nb, unsigned long
>>>> action,
>>>> void *data)
>>>> +{
>>>> +       struct device *dev = data;
>>>> +       int ret;
>>>> +
>>>> +       switch (action) {
>>>> +       case DEV_NOTIFY_ADD_DEVICE:
>>>> +               ret = dev_pc_match(dev);
>>>> +               if (likely(!ret))
>>>> +                       goto exit;
>>>> +               if (ret == 1)
>>>> +                       dev_pc_bind(&pc, dev, &root_hub_port_data,
>>>> sizeof(root_hub_port_data));
>>>> +               else
>>>> +                       dev_pc_bind(&pc, dev, &smsc_hub_port_data,
>>>> sizeof(smsc_hub_port_data));
>>>> +               break;
>>>> +
>>>> +       case DEV_NOTIFY_DEL_DEVICE:
>>>> +               break;
>>>> +       }
>>>> +exit:
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +static struct notifier_block usb_port_nb = {
>>>> +       .notifier_call = device_notify,
>>>> +};
>>>> +
>>>
>>>
>>>
>>> Some thoughts on trying to make this functionality specific to power only
>>> and ehci hub port only:
>>>
>>>   - Quite a few boards have smsc95xx... they're all going to carry these
>>> additions in the board file?  Surely you'll have to generalize the pieces
>>
>>
>> All things are board dependent because we are discussing peripheral
>> device(not builtin SoC devices), so it is proper to put it in the board
>> file.
>> If some boards want to share it, we can put it in a single .c file and
>> let board file include it.
>
>
> Where would the .c file go... SMSC is not platform, or even arch specific
> chip (eg, iMX or MIPS or even x86 boards can have it), so not
> arch/arm/mach-xxxx or arch/arm/plat-xxx or arch/arm.  I guess it would go in
> drivers/usb or drivers/net.

How does drivers/usb or drivers/net know the SMSC is used on beagle or
panda? Different power control approach is used in the two boards even
same SMSC chip is shipped in the two boards.

>
> Push in ARM-Land is towards single kernels which support many platforms, a
> good case in point is omap2plus_defconfg which wants to allow to support all
> OMAP2/3/4/5 platforms in one kernel.  If you "include" that C file over and
> over in board files, it's not very nice for that.  They anyway want to
> eliminate board files for the single kernel binary reason, and just have one
> load of config come in by dtb.

I only mean it is reasonable to put the power control code into board
file because
each board may take different power control approach even same SMSC chip
is used. I understand DT only describes the difference of the board via device
node, and I am not sure if the current DT is enough to convert the board file
into data as far as the problem is concerned.

>
> So it guides you towards having static helper code once in drivers/ for the
> scenarios you support... if that's where you end up smsc is less good a
> target for a helper than to have helpers for classes of object like
> regulator and clk, that you can combine and reuse on all sorts of target
> devices, which is device_asset approach.
>
> It also guides you to having the special platform sauce be something that
> can go into the dtb: per-board code can't.  That's why device_asset stuff
> only places asset structs in the board file and is removing code from there.
>
>
>>> that perform device_path business out of the omap4panda board file at
>>> least.
>>> At that point the path matching code becomes generic
>>> end-of-the-device-path
>>> matching code.
>>
>>
>> Looks Alan has mentioned, there might be no generic way, and any device's
>> name change in the path may make the way fragile.
>
>
> What you have provided is no less fragile if you allow "port1" and the
> ehci-omap.0 to be set from the outside.
>
> Unless someone NAKs it for sure already (if you're already sure you're going
> to, please do so to avoid wasting time), I'll issue a try#2 of my code later
> which demonstrates what I mean.  At least I guess it's useful for
> comparative purposes.
>
>
>>>   - How could these literals like "port1" etc be nicely provided by
>>> Device
>>> Tree?  In ARM-land there's pressure to eventually eliminate board files
>>> completely and pass in everything from dtb.  device_asset can neatly grow
>>> DT
>>> bindings in a generic way, since the footprint in the board file is some
>>
>>
>> IMO, it isn't necessary to expose these assets to device or users, from
>> the
>> view of device or user, which only cares two actions(poweron and poweroff)
>> about the discussed problem. Also it should be better to put these assets
>> into device resource list, instead of introducing them in 'struct device'.
>
>
> From the point of view of allowing it to be reused on different boards /
> platforms / arches, you are going to have to do something better than have
> literals in the code.
>
>
>>> regulators that already have dt bindings and some device_asset structs.
>>> Similarly there's pressure for magic code to service a board (rather than
>>> SoC) to go elsewhere than the board file.
>>
>>
>> Looks you associate these assets with ehci-omap device, which mightn't be
>> enough, because we need to control port's power for supporting port
>> power off mechanism. Do you have generic way to associate these assets
>> with usb port device and let port use it generally?
>
>
> Yes, you need a parent device pointer (ehci host controller in this case)
> and the path rhs, but only stuff that is defined by usb stack code.  Needing
> a parent pointer is OK because this stuff only has meaning for hardwired
> assets on the platform, so the parent device will always be known as a
> platform_device at boot time anyway.

parent device pointer may work on the panda problem, but may not work
on other case if one hardwired device is powered by another power domain.

So it is not a general solution on usb port power off.

> The code I'll provide will work the same in sdio or other bus case, just use
> mmc host controller pointer and the sdio device name the same way.
>
>
>>>   - Shouldn't this take care of enabling and disabling the ULPI PHY clock
>>> on
>>> Panda too?  There's no purpose leaving it running if the one thing the
>>> ULPI
>>> PHY is connected to is depowered, and when you do power it, on Panda you
>>> will reset the ULPI PHY at the same time anyway (smsc reset and ULPI PHY
>>> reset are connected together on Panda).  Then you can eliminate
>>> omap4_ehci_init() in the board file.
>>
>>
>> OK, we can include the ULPI PHY clock things easily in ->power_on() and
>> ->power_off() of 'power controller'
>
>
> Yes if the ARM people will accept establishing more code in board files that
> doesn't have a DT story.

As I explained above, it is reasonable to put the power control code in board
file, but current DT could convert these board file into device node?

Thanks,
-- 
Ming Lei

^ permalink raw reply

* Re: [PATCH v9 06/10] ata: zpodd: check zero power ready status
From: Jeff Garzik @ 2012-12-03 18:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: James Bottomley, Aaron Lu, Rafael J. Wysocki, linux-pm,
	Alan Stern, Jeff Wu, Aaron Lu, linux-ide, linux-scsi, linux-acpi
In-Reply-To: <20121203162323.GB19802@htj.dyndns.org>

On 12/03/2012 11:23 AM, Tejun Heo wrote:
> Hello, James.
>
> On Mon, Dec 03, 2012 at 08:25:43AM +0000, James Bottomley wrote:
>>> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
>>> index e65c62e..1756151 100644
>>> --- a/include/scsi/scsi_device.h
>>> +++ b/include/scsi/scsi_device.h
>>> @@ -160,6 +160,7 @@ struct scsi_device {
>>>   	unsigned can_power_off:1; /* Device supports runtime power off */
>>>   	unsigned wce_default_on:1;	/* Cache is ON by default */
>>>   	unsigned no_dif:1;	/* T10 PI (DIF) should be disabled */
>>> +	unsigned event_driven:1; /* No need to poll the device */
>>>
>>>   	DECLARE_BITMAP(supported_events, SDEV_EVT_MAXBITS); /* supported events */
>>>   	struct list_head event_list;	/* asserted events */
>>
>> Yes, but if we can get away with doing that, it should be in genhd
>> because it's completely generic.
>>
>> I was imagining we'd have to fake the reply to the test unit ready or
>> some other commands, which is why it would need to be in sr.c
>>
>> The check events code is Tejun's baby, if he's OK with it then just do
>> it in genhd.c
>
> The problem here is there's no easy to reach genhd from libata (or the
> other way around) without going through sr.  I think we're gonna have
> to have something in sr one way or the other.

  ...which is precisely as I said when v1 of this ZPODD patchset appeared.

sr modifications cannot be avoided.

	Jeff





^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox