Linux Power Management development
 help / color / mirror / Atom feed
* [PATCH] cpufreq: ondemand: update sampling rate only on right CPUs
From: Fabio Baltieri @ 2012-11-26 18:10 UTC (permalink / raw)
  To: Rafael J. Wysocki, cpufreq, linux-pm; +Cc: linux-kernel, Fabio Baltieri

Fix cpufreq_gov_ondemand to skip CPU where another governor is used.

The bug present itself as NULL pointer access on the mutex_lock() call,
an can be reproduced on an SMP machine by setting the default governor
to anything other than ondemand, setting a single CPU's governor to
ondemand, then changing the sample rate by writing on:

> /sys/devices/system/cpu/cpufreq/ondemand/sampling_rate

Backtrace:

Nov 26 17:36:54 balto kernel: [  839.585241] BUG: unable to handle kernel NULL pointer dereference at           (null)
Nov 26 17:36:54 balto kernel: [  839.585311] IP: [<ffffffff8174e082>] __mutex_lock_slowpath+0xb2/0x170
[snip]
Nov 26 17:36:54 balto kernel: [  839.587005] Call Trace:
Nov 26 17:36:54 balto kernel: [  839.587030]  [<ffffffff8174da82>] mutex_lock+0x22/0x40
Nov 26 17:36:54 balto kernel: [  839.587067]  [<ffffffff81610b8f>] store_sampling_rate+0xbf/0x150
Nov 26 17:36:54 balto kernel: [  839.587110]  [<ffffffff81031e9c>] ?  __do_page_fault+0x1cc/0x4c0
Nov 26 17:36:54 balto kernel: [  839.587153]  [<ffffffff813309bf>] kobj_attr_store+0xf/0x20
Nov 26 17:36:54 balto kernel: [  839.587192]  [<ffffffff811bb62d>] sysfs_write_file+0xcd/0x140
Nov 26 17:36:54 balto kernel: [  839.587234]  [<ffffffff8114c12c>] vfs_write+0xac/0x180
Nov 26 17:36:54 balto kernel: [  839.587271]  [<ffffffff8114c472>] sys_write+0x52/0xa0
Nov 26 17:36:54 balto kernel: [  839.587306]  [<ffffffff810321ce>] ?  do_page_fault+0xe/0x10
Nov 26 17:36:54 balto kernel: [  839.587345]  [<ffffffff81751202>] system_call_fastpath+0x16/0x1b

Signed-off-by: Fabio Baltieri <fabio.baltieri@linaro.org>
---

Hi Rafael,

this is based on a clean linux-pm linux-next branch (i.e. not with my other
patch-set applied), so expect a context conflict if both are applied.

Thanks,
Fabio

 drivers/cpufreq/cpufreq_ondemand.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index cca3e9f..7731f7c 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -40,6 +40,10 @@
 static struct dbs_data od_dbs_data;
 static DEFINE_PER_CPU(struct od_cpu_dbs_info_s, od_cpu_dbs_info);
 
+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND
+static struct cpufreq_governor cpufreq_gov_ondemand;
+#endif
+
 static struct od_dbs_tuners od_tuners = {
 	.up_threshold = DEF_FREQUENCY_UP_THRESHOLD,
 	.sampling_down_factor = DEF_SAMPLING_DOWN_FACTOR,
@@ -279,6 +283,10 @@ static void update_sampling_rate(unsigned int new_rate)
 		policy = cpufreq_cpu_get(cpu);
 		if (!policy)
 			continue;
+		if (policy->governor != &cpufreq_gov_ondemand) {
+			cpufreq_cpu_put(policy);
+			continue;
+		}
 		dbs_info = &per_cpu(od_cpu_dbs_info, policy->cpu);
 		cpufreq_cpu_put(policy);
 
-- 
1.7.12.1

^ permalink raw reply related

* PROBLEM: resuming from suspend-to-ram hangs
From: Adrian Byszuk @ 2012-11-26 17:14 UTC (permalink / raw)
  To: linux-pm

Hello all-mighty linux kernel developers.

My laptop (Asus F3F-AP075) can't wake up from suspend to RAM, it seems
to hang completely while resuming. What seems to be a bit odd, is that
LED for HDD lights on all the time while the system is hanged.
To be more precise: s2ram used to work on this notebook, but it stopped
to. Some time ago I changed the HDD from typical 2.5' hard drive to an
OCZ Vertex4 SSD drive (and installed a new distribution on it) - I think
that's the moment it stopped working (however I'm not suspending often,
so it could also be a pure coincidence).
This problem is 100% reproducible on both Debian 3.2 kernel and vanilla
3.6.x kernels.

Following the Documentation/power/s2ram.txt I've got something like this
from dmesg:
[    0.253496]   Magic number: 0:56:890
[    0.253501]   hash matches drivers/base/power/main.c:571
[    0.253542] pci 0000:00:1f.3: hash matches

and PCI 00:1f.3 device is:
00:1f.3 SMBus: Intel Corporation N10/ICH 7 Family SMBus Controller (rev 02)

I'm willing to debug this by myself, but I need your help. Could someone
tell where to start looking?

Regards,
Adrian

#sh scripts/ver_linux:

Linux adrian-lap 3.6.7 #5 SMP PREEMPT Mon Nov 26 14:19:09 CET 2012
x86_64 GNU/Linux

Gnu C                  4.7
Gnu make               3.81
binutils               2.22
util-linux             ver_linux: line 23: fdformat: command not found
mount                  support
module-init-tools      found
Linux C Library        2.13
Dynamic linker (ldd)   2.13
Procps                 3.3.3
Kbd                    1.15.3
Sh-utils               8.13
Modules Loaded         ablk_helper cryptd aes_x86_64 aes_generic rfcomm
bnep pci_stub vboxpci vboxnetadp vboxnetflt binfmt_misc vboxdrv uinput
nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc btusb bluetooth loop
firewire_sbp2 hid_generic usbhid hid stkwebcam videodev media joydev
arc4 8139too iTCO_wdt iTCO_vendor_support uhci_hcd iwl3945 iwlegacy
mac80211 lpc_ich snd_hda_codec_si3054 8139cp sdhci_pci mii sdhci
firewire_ohci firewire_core crc_itu_t mfd_core snd_hda_codec_idt
ehci_hcd mmc_core usbcore r592 memstick psmouse rng_core serio_raw
coretemp pcspkr i2c_i801 usb_common snd_hda_intel snd_hda_codec cfg80211
snd_hwdep snd_pcm snd_page_alloc snd_seq snd_seq_device snd_timer snd
soundcore battery ac asus_laptop sparse_keymap rfkill input_polldev
evdev acpi_cpufreq mperf processor ext4 crc16 jbd2 mbcache ata_generic
microcode i915 thermal video button i2c_algo_bit drm_kms_helper drm
i2c_core thermal_sys

# cat /proc/cpuinfo:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU         T5500  @ 1.66GHz
stepping        : 6
microcode       : 0xd1
cpu MHz         : 996.000
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl
aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm
lahf_lm dtherm
bogomips        : 3324.92
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU         T5500  @ 1.66GHz
stepping        : 6
microcode       : 0xd1
cpu MHz         : 996.000
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl
aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm
lahf_lm dtherm
bogomips        : 3324.92
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

# cat /proc/modules:
ablk_helper 12572 0 - Live 0xffffffffa078e000
cryptd 14559 1 ablk_helper, Live 0xffffffffa0785000
aes_x86_64 16796 1 - Live 0xffffffffa077f000
aes_generic 37122 1 aes_x86_64, Live 0xffffffffa0774000
rfcomm 37038 14 - Live 0xffffffffa0769000
bnep 17345 2 - Live 0xffffffffa05d7000
pci_stub 12429 1 - Live 0xffffffffa04c1000
vboxpci 23040 0 - Live 0xffffffffa05d0000 (O)
vboxnetadp 25443 0 - Live 0xffffffffa05c8000 (O)
vboxnetflt 23221 0 - Live 0xffffffffa05c1000 (O)
binfmt_misc 16974 1 - Live 0xffffffffa075f000
vboxdrv 217495 3 vboxpci,vboxnetadp,vboxnetflt, Live 0xffffffffa058a000 (O)
uinput 17361 1 - Live 0xffffffffa0514000
nfsd 236550 2 - Live 0xffffffffa054f000
auth_rpcgss 37102 1 nfsd, Live 0xffffffffa0544000
nfs_acl 12511 1 nfsd, Live 0xffffffffa04bc000
nfs 122995 0 - Live 0xffffffffa051c000
lockd 70976 2 nfsd,nfs, Live 0xffffffffa0501000
fscache 40076 1 nfs, Live 0xffffffffa04f6000
sunrpc 189709 6 nfsd,auth_rpcgss,nfs_acl,nfs,lockd, Live 0xffffffffa04c6000
btusb 17300 0 - Live 0xffffffffa053e000
bluetooth 187463 24 rfcomm,bnep,btusb, Live 0xffffffffa048d000
loop 26516 0 - Live 0xffffffffa0481000
firewire_sbp2 21745 0 - Live 0xffffffffa0414000
hid_generic 12385 0 - Live 0xffffffffa046f000
usbhid 44452 0 - Live 0xffffffffa0475000
hid 93518 2 hid_generic,usbhid, Live 0xffffffffa0442000
stkwebcam 26790 0 - Live 0xffffffffa040c000
videodev 99739 1 stkwebcam, Live 0xffffffffa0428000
media 18109 1 videodev, Live 0xffffffffa03ee000
joydev 17108 0 - Live 0xffffffffa0406000
arc4 12536 2 - Live 0xffffffffa034c000
8139too 26327 0 - Live 0xffffffffa03fe000
iTCO_wdt 12831 0 - Live 0xffffffffa0307000
iTCO_vendor_support 12649 1 iTCO_wdt, Live 0xffffffffa02b1000
uhci_hcd 30814 0 - Live 0xffffffffa03f5000
iwl3945 54386 0 - Live 0xffffffffa03c6000
iwlegacy 54678 1 iwl3945, Live 0xffffffffa033d000
mac80211 423307 2 iwl3945,iwlegacy, Live 0xffffffffa035d000
lpc_ich 16665 0 - Live 0xffffffffa0281000
snd_hda_codec_si3054 12758 1 - Live 0xffffffffa0302000
8139cp 26423 0 - Live 0xffffffffa0351000
sdhci_pci 17754 0 - Live 0xffffffffa0333000
mii 12675 2 8139too,8139cp, Live 0xffffffffa02e3000
sdhci 30597 1 sdhci_pci, Live 0xffffffffa02f9000
firewire_ohci 35291 0 - Live 0xffffffffa02eb000
firewire_core 56645 2 firewire_sbp2,firewire_ohci, Live 0xffffffffa02d4000
crc_itu_t 12347 1 firewire_core, Live 0xffffffffa02ca000
mfd_core 12601 1 lpc_ich, Live 0xffffffffa02cf000
snd_hda_codec_idt 61647 1 - Live 0xffffffffa045e000
ehci_hcd 48637 0 - Live 0xffffffffa041b000
mmc_core 84816 2 sdhci_pci,sdhci, Live 0xffffffffa03d8000
usbcore 148405 5 btusb,usbhid,stkwebcam,uhci_hcd,ehci_hcd, Live
0xffffffffa030d000
r592 17323 0 - Live 0xffffffffa02a2000
memstick 13696 1 r592, Live 0xffffffffa028b000
psmouse 76669 0 - Live 0xffffffffa02b6000
rng_core 12614 0 - Live 0xffffffffa02ac000
serio_raw 12849 0 - Live 0xffffffffa029d000
coretemp 12854 0 - Live 0xffffffffa0294000
pcspkr 12595 0 - Live 0xffffffffa0250000
i2c_i801 16963 0 - Live 0xffffffffa0255000
usb_common 12354 1 usbcore, Live 0xffffffffa01ea000
snd_hda_intel 30388 3 - Live 0xffffffffa0278000
snd_hda_codec 99628 3
snd_hda_codec_si3054,snd_hda_codec_idt,snd_hda_intel, Live
0xffffffffa025e000
cfg80211 178277 3 iwl3945,iwlegacy,mac80211, Live 0xffffffffa0223000
snd_hwdep 13148 1 snd_hda_codec, Live 0xffffffffa01e5000
snd_pcm 75529 3 snd_hda_codec_si3054,snd_hda_intel,snd_hda_codec, Live
0xffffffffa020f000
snd_page_alloc 17065 2 snd_hda_intel,snd_pcm, Live 0xffffffffa0205000
snd_seq 52897 0 - Live 0xffffffffa01f7000
snd_seq_device 13132 1 snd_seq, Live 0xffffffffa01bd000
snd_timer 26605 2 snd_pcm,snd_seq, Live 0xffffffffa01ef000
snd 60877 15
snd_hda_codec_si3054,snd_hda_codec_idt,snd_hda_intel,snd_hda_codec,snd_hwdep,snd_pcm,snd_seq,snd_seq_device,snd_timer,
Live 0xffffffffa01d5000
soundcore 13026 1 snd, Live 0xffffffffa01cc000
battery 13109 0 - Live 0xffffffffa01c7000
ac 12624 0 - Live 0xffffffffa01c2000
asus_laptop 22742 0 - Live 0xffffffffa01b6000
sparse_keymap 12760 1 asus_laptop, Live 0xffffffffa00f5000
rfkill 18902 5 bluetooth,cfg80211,asus_laptop, Live 0xffffffffa01b0000
input_polldev 12906 1 asus_laptop, Live 0xffffffffa01a2000
evdev 17406 20 - Live 0xffffffffa0198000
acpi_cpufreq 12893 0 - Live 0xffffffffa018f000
mperf 12411 1 acpi_cpufreq, Live 0xffffffffa01ab000
processor 28098 3 acpi_cpufreq, Live 0xffffffffa00ed000
ext4 434100 3 - Live 0xffffffffa0124000
crc16 12343 2 bluetooth,ext4, Live 0xffffffffa00fb000
jbd2 82043 1 ext4, Live 0xffffffffa010e000
mbcache 12985 1 ext4, Live 0xffffffffa00e8000
ata_generic 12490 0 - Live 0xffffffffa0109000
microcode 17625 0 - Live 0xffffffffa005c000
i915 487264 3 - Live 0xffffffffa0070000
thermal 17383 0 - Live 0xffffffffa0103000
video 17630 1 i915, Live 0xffffffffa0056000
button 12930 1 i915, Live 0xffffffffa0012000
i2c_algo_bit 12751 1 i915, Live 0xffffffffa004e000
drm_kms_helper 35294 1 i915, Live 0xffffffffa0066000
drm 217040 4 i915,drm_kms_helper, Live 0xffffffffa0018000
i2c_core 23855 6 videodev,i2c_i801,i915,i2c_algo_bit,drm_kms_helper,drm,
Live 0xffffffffa000b000
thermal_sys 22203 3 processor,thermal,video, Live 0xffffffffa0000000

# cat /proc/ioports:
0000-0cf7 : PCI Bus 0000:00
  0000-001f : dma1
  0020-0021 : pic1
  0040-0043 : timer0
  0050-0053 : timer1
  0060-0060 : keyboard
  0062-0062 : EC data
  0064-0064 : keyboard
  0066-0066 : EC cmd
  0070-0071 : rtc0
  0080-008f : dma page reg
  00a0-00a1 : pic2
  00c0-00df : dma2
  00f0-00ff : fpu
  0170-0177 : 0000:00:1f.2
    0170-0177 : ata_piix
  01f0-01f7 : 0000:00:1f.2
    01f0-01f7 : ata_piix
  0250-0253 : pnp 00:0a
  0256-025f : pnp 00:0a
  0376-0376 : 0000:00:1f.2
    0376-0376 : ata_piix
  03c0-03df : vga+
  03f6-03f6 : 0000:00:1f.2
    03f6-03f6 : ata_piix
  0400-041f : 0000:00:1f.3
  0480-04bf : pnp 00:08
  04d0-04d1 : pnp 00:08
  0800-087f : pnp 00:08
    0800-0803 : ACPI PM1a_EVT_BLK
    0804-0805 : ACPI PM1a_CNT_BLK
    0808-080b : ACPI PM_TMR
    0810-0815 : ACPI CPU throttle
    0820-0820 : ACPI PM2_CNT_BLK
    0828-082f : ACPI GPE0_BLK
    0830-0833 : iTCO_wdt
      0830-0833 : iTCO_wdt
    0860-087f : iTCO_wdt
      0860-087f : iTCO_wdt
0cf8-0cff : PCI conf1
0d00-ffff : PCI Bus 0000:00
  1000-1fff : PCI Bus 0000:01
  2000-2fff : PCI Bus 0000:02
  c000-cfff : PCI Bus 0000:03
  d000-dfff : PCI Bus 0000:05
    d800-d8ff : 0000:05:00.0
      d800-d8ff : 8139too
  e400-e41f : 0000:00:1d.3
    e400-e41f : uhci_hcd
  e480-e49f : 0000:00:1d.2
    e480-e49f : uhci_hcd
  e800-e81f : 0000:00:1d.1
    e800-e81f : uhci_hcd
  e880-e89f : 0000:00:1d.0
    e880-e89f : uhci_hcd
  ec00-ec07 : 0000:00:02.0
  ffa0-ffaf : 0000:00:1f.2
    ffa0-ffaf : ata_piix

# cat /proc/iomem
00000000-0000ffff : reserved
00010000-0009fbff : System RAM
0009fc00-0009ffff : reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000c7fff : Video ROM
000d0000-000dffff : PCI Bus 0000:00
000e0000-000fffff : reserved
  000f0000-000fffff : System ROM
00100000-b77affff : System RAM
  01000000-01441d42 : Kernel code
  01441d43-01888b7f : Kernel data
  0191d000-019eafff : Kernel bss
b77b0000-b77bdfff : ACPI Tables
b77be000-b77fffff : ACPI Non-volatile Storage
b7800000-ffffffff : PCI Bus 0000:00
  b7800000-b79fffff : PCI Bus 0000:01
  b7a00000-b7bfffff : PCI Bus 0000:01
  b7c00000-b7dfffff : PCI Bus 0000:02
  bdf00000-bfefffff : PCI Bus 0000:03
  d0000000-dfffffff : 0000:00:02.0
  e0000000-e3ffffff : PCI MMCONFIG 0000 [bus 00-3f]
    e0000000-e3ffffff : pnp 00:0b
  fdf00000-fdffffff : PCI Bus 0000:02
    fdfff000-fdffffff : 0000:02:00.0
      fdfff000-fdffffff : iwl3945
  fe000000-fe7fffff : PCI Bus 0000:03
  fe800000-fe8fffff : PCI Bus 0000:05
    fe8fe800-fe8fe8ff : 0000:05:01.3
      fe8fe800-fe8fe8ff : r592
    fe8fec00-fe8fecff : 0000:05:01.2
      fe8fec00-fe8fecff : mmc1
    fe8ff000-fe8ff7ff : 0000:05:01.0
      fe8ff000-fe8ff7ff : firewire_ohci
    fe8ff800-fe8ff8ff : 0000:05:01.1
      fe8ff800-fe8ff8ff : mmc0
    fe8ffc00-fe8ffcff : 0000:05:00.0
      fe8ffc00-fe8ffcff : 8139too
  fea80000-feafffff : 0000:00:02.1
  feb3bc00-feb3bfff : 0000:00:1d.7
    feb3bc00-feb3bfff : ehci_hcd
  feb3c000-feb3ffff : 0000:00:1b.0
    feb3c000-feb3ffff : ICH HD audio
  feb40000-feb7ffff : 0000:00:02.0
  feb80000-febfffff : 0000:00:02.0
  fec00000-fec003ff : IOAPIC 0
  fec10000-fec17fff : pnp 00:0a
  fec18000-fec1ffff : pnp 00:0a
  fec20000-fec27fff : pnp 00:0a
  fed00000-fed003ff : HPET 0
  fed13000-fed19fff : pnp 00:01
  fed1c000-fed1ffff : pnp 00:08
    fed1f410-fed1f414 : iTCO_wdt
      fed1f410-fed1f414 : iTCO_wdt
  fed20000-fed3ffff : pnp 00:08
  fed45000-fed89fff : pnp 00:08
  fee00000-fee00fff : Local APIC
    fee00000-fee00fff : reserved
      fee00000-fee00fff : pnp 00:0a
  ffb80000-ffffffff : reserved
    fff00000-ffffffff : pnp 00:08

# lspci -vvv:
00:00.0 Host bridge: Intel Corporation Mobile 945GM/PM/GMS, 943/940GML
and 945GT Express Memory Controller Hub (rev 03)
        Subsystem: ASUSTeK Computer Inc. Device 1317
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort+ >SERR- <PERR- INTx-
        Latency: 0
        Capabilities: [e0] Vendor Specific Information: Len=09 <?>
        Kernel driver in use: agpgart-intel

00:02.0 VGA compatible controller: Intel Corporation Mobile 945GM/GMS,
943/940GML Express Integrated Graphics Controller (rev 03) (prog-if 00
[VGA controller])
        Subsystem: ASUSTeK Computer Inc. Device 1252
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at feb80000 (32-bit, non-prefetchable) [size=512K]
        Region 1: I/O ports at ec00 [size=8]
        Region 2: Memory at d0000000 (32-bit, prefetchable) [size=256M]
        Region 3: Memory at feb40000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at <unassigned> [disabled]
        Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit-
                Address: 00000000  Data: 0000
        Capabilities: [d0] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Kernel driver in use: i915

00:02.1 Display controller: Intel Corporation Mobile 945GM/GMS/GME,
943/940GML Express Integrated Graphics Controller (rev 03)
        Subsystem: ASUSTeK Computer Inc. Device 1252
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Region 0: Memory at fea80000 (32-bit, non-prefetchable) [size=512K]
        Capabilities: [d0] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-

00:1b.0 Audio device: Intel Corporation N10/ICH 7 Family High Definition
Audio Controller (rev 02)
        Subsystem: ASUSTeK Computer Inc. Device 1253
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 43
        Region 0: Memory at feb3c000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [50] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=55mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [60] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee0300c  Data: 41c1
        Capabilities: [70] Express (v1) Root Complex Integrated
Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s
<64ns, L1 <1us
                        ExtTag- RBE- FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+
TransPend-
                LnkCap: Port #0, Speed unknown, Width x0, ASPM unknown,
Latency L0 <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; Disabled- Retrain- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk-
DLActive- BWMgmt- ABWMgmt-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128-
WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
                        Status: NegoPending- InProgress-
                VC1:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128-
WRR256-
                        Ctrl:   Enable+ ID=1 ArbSelect=Fixed TC/VC=80
                        Status: NegoPending- InProgress-
        Capabilities: [130 v1] Root Complex Link
                Desc:   PortNumber=0f ComponentID=02 EltType=Config
                Link0:  Desc:   TargetPort=00 TargetComponent=02
AssocRCRB- LinkType=MemMapped LinkValid+
                        Addr:   00000000fed1c000
        Kernel driver in use: snd_hda_intel

00:1c.0 PCI bridge: Intel Corporation N10/ICH 7 Family PCI Express Port
1 (rev 02) (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 00001000-00001fff
        Memory behind bridge: b7800000-b79fffff
        Prefetchable memory behind bridge: 00000000b7a00000-00000000b7bfffff
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [40] Express (v1) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s
unlimited, L1 unlimited
                        ExtTag- RBE- FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+
TransPend-
                LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM L0s L1,
Latency L0 <1us, L1 <4us
                        ClockPM- Surprise- LLActRep+ BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled+ Retrain-
CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x0, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
                SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+
Surprise+
                        Slot #2, PowerLimit 10.000W; Interlock- NoCompl-
                SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet+ CmdCplt-
HPIrq- LinkChg-
                        Control: AttnInd Unknown, PwrInd Unknown, Power-
Interlock-
                SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt-
PresDet- Interlock-
                        Changed: MRL- PresDet- LinkState-
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal-
PMEIntEna- CRSVisible-
                RootCap: CRSVisible-
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
        Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
                Address: fee0300c  Data: 4191
        Capabilities: [90] Subsystem: ASUSTeK Computer Inc. Device 1317
        Capabilities: [a0] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D3 NoSoftRst- PME-Enable+ DSel=0 DScale=0 PME-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed+ WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed+ WRR32- WRR64- WRR128- TWRR128-
WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
                        Status: NegoPending- InProgress-
                VC1:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed+ WRR32- WRR64- WRR128- TWRR128-
WRR256-
                        Ctrl:   Enable- ID=0 ArbSelect=Fixed TC/VC=00
                        Status: NegoPending- InProgress-
        Capabilities: [180 v1] Root Complex Link
                Desc:   PortNumber=01 ComponentID=02 EltType=Config
                Link0:  Desc:   TargetPort=00 TargetComponent=02
AssocRCRB- LinkType=MemMapped LinkValid+
                        Addr:   00000000fed1c001
        Kernel driver in use: pcieport

00:1c.1 PCI bridge: Intel Corporation N10/ICH 7 Family PCI Express Port
2 (rev 02) (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
        I/O behind bridge: 00002000-00002fff
        Memory behind bridge: fdf00000-fdffffff
        Prefetchable memory behind bridge: 00000000b7c00000-00000000b7dfffff
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [40] Express (v1) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s
unlimited, L1 unlimited
                        ExtTag- RBE- FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+
TransPend-
                LnkCap: Port #2, Speed 2.5GT/s, Width x1, ASPM L0s L1,
Latency L0 <256ns, L1 <4us
                        ClockPM- Surprise- LLActRep+ BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+
DLActive+ BWMgmt- ABWMgmt-
                SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+
Surprise+
                        Slot #3, PowerLimit 10.000W; Interlock- NoCompl-
                SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet+ CmdCplt-
HPIrq- LinkChg-
                        Control: AttnInd Unknown, PwrInd Unknown, Power-
Interlock-
                SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt-
PresDet+ Interlock-
                        Changed: MRL- PresDet- LinkState+
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal-
PMEIntEna- CRSVisible-
                RootCap: CRSVisible-
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
        Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
                Address: fee0300c  Data: 41a1
        Capabilities: [90] Subsystem: ASUSTeK Computer Inc. Device 1317
        Capabilities: [a0] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed+ WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed+ WRR32- WRR64- WRR128- TWRR128-
WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
                        Status: NegoPending- InProgress-
                VC1:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed+ WRR32- WRR64- WRR128- TWRR128-
WRR256-
                        Ctrl:   Enable- ID=0 ArbSelect=Fixed TC/VC=00
                        Status: NegoPending- InProgress-
        Capabilities: [180 v1] Root Complex Link
                Desc:   PortNumber=02 ComponentID=02 EltType=Config
                Link0:  Desc:   TargetPort=00 TargetComponent=02
AssocRCRB- LinkType=MemMapped LinkValid+
                        Addr:   00000000fed1c001
        Kernel driver in use: pcieport

00:1c.2 PCI bridge: Intel Corporation N10/ICH 7 Family PCI Express Port
3 (rev 02) (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Bus: primary=00, secondary=03, subordinate=04, sec-latency=0
        I/O behind bridge: 0000c000-0000cfff
        Memory behind bridge: fe000000-fe7fffff
        Prefetchable memory behind bridge: 00000000bdf00000-00000000bfefffff
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [40] Express (v1) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s
unlimited, L1 unlimited
                        ExtTag- RBE- FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+
TransPend-
                LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1,
Latency L0 <256ns, L1 <4us
                        ClockPM- Surprise- LLActRep+ BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x0, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
                SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+
Surprise+
                        Slot #4, PowerLimit 10.000W; Interlock- NoCompl-
                SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet+ CmdCplt-
HPIrq- LinkChg-
                        Control: AttnInd Unknown, PwrInd Unknown, Power-
Interlock-
                SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt-
PresDet- Interlock-
                        Changed: MRL- PresDet- LinkState-
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal-
PMEIntEna- CRSVisible-
                RootCap: CRSVisible-
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
        Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
                Address: fee0300c  Data: 41b1
        Capabilities: [90] Subsystem: ASUSTeK Computer Inc. Device 1317
        Capabilities: [a0] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D3 NoSoftRst- PME-Enable+ DSel=0 DScale=0 PME-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed+ WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed+ WRR32- WRR64- WRR128- TWRR128-
WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
                        Status: NegoPending- InProgress-
                VC1:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed+ WRR32- WRR64- WRR128- TWRR128-
WRR256-
                        Ctrl:   Enable- ID=0 ArbSelect=Fixed TC/VC=00
                        Status: NegoPending- InProgress-
        Capabilities: [180 v1] Root Complex Link
                Desc:   PortNumber=03 ComponentID=02 EltType=Config
                Link0:  Desc:   TargetPort=00 TargetComponent=02
AssocRCRB- LinkType=MemMapped LinkValid+
                        Addr:   00000000fed1c001
        Kernel driver in use: pcieport

00:1d.0 USB controller: Intel Corporation N10/ICH 7 Family USB UHCI
Controller #1 (rev 02) (prog-if 00 [UHCI])
        Subsystem: ASUSTeK Computer Inc. Device 1317
        Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 23
        Region 4: I/O ports at e880 [size=32]
        Kernel driver in use: uhci_hcd

00:1d.1 USB controller: Intel Corporation N10/ICH 7 Family USB UHCI
Controller #2 (rev 02) (prog-if 00 [UHCI])
        Subsystem: ASUSTeK Computer Inc. Device 1317
        Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin B routed to IRQ 19
        Region 4: I/O ports at e800 [size=32]
        Kernel driver in use: uhci_hcd

00:1d.2 USB controller: Intel Corporation N10/ICH 7 Family USB UHCI
Controller #3 (rev 02) (prog-if 00 [UHCI])
        Subsystem: ASUSTeK Computer Inc. Device 1317
        Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin C routed to IRQ 18
        Region 4: I/O ports at e480 [size=32]
        Kernel driver in use: uhci_hcd

00:1d.3 USB controller: Intel Corporation N10/ICH 7 Family USB UHCI
Controller #4 (rev 02) (prog-if 00 [UHCI])
        Subsystem: ASUSTeK Computer Inc. Device 1317
        Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin D routed to IRQ 16
        Region 4: I/O ports at e400 [size=32]
        Kernel driver in use: uhci_hcd

00:1d.7 USB controller: Intel Corporation N10/ICH 7 Family USB2 EHCI
Controller (rev 02) (prog-if 20 [EHCI])
        Subsystem: ASUSTeK Computer Inc. Device 1317
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 23
        Region 0: Memory at feb3bc00 (32-bit, non-prefetchable) [size=1K]
        Capabilities: [50] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Debug port: BAR=1 offset=00a0
        Kernel driver in use: ehci_hcd

00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev e2)
(prog-if 01 [Subtractive decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Bus: primary=00, secondary=05, subordinate=05, sec-latency=32
        I/O behind bridge: 0000d000-0000dfff
        Memory behind bridge: fe800000-fe8fffff
        Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
        Secondary status: 66MHz- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort+ <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [50] Subsystem: ASUSTeK Computer Inc. Device 1317

00:1f.0 ISA bridge: Intel Corporation 82801GBM (ICH7-M) LPC Interface
Bridge (rev 02)
        Subsystem: ASUSTeK Computer Inc. Device 1317
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Capabilities: [e0] Vendor Specific Information: Len=0c <?>
        Kernel driver in use: lpc_ich

00:1f.2 IDE interface: Intel Corporation 82801GBM/GHM (ICH7-M Family)
SATA Controller [IDE mode] (rev 02) (prog-if 80 [Master])
        Subsystem: ASUSTeK Computer Inc. Device 1317
        Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx+
        Latency: 0
        Interrupt: pin B routed to IRQ 19
        Region 0: I/O ports at 01f0 [size=8]
        Region 1: I/O ports at 03f4 [size=1]
        Region 2: I/O ports at 0170 [size=8]
        Region 3: I/O ports at 0374 [size=1]
        Region 4: I/O ports at ffa0 [size=16]
        Capabilities: [70] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Kernel driver in use: ata_piix

00:1f.3 SMBus: Intel Corporation N10/ICH 7 Family SMBus Controller (rev 02)
        Subsystem: ASUSTeK Computer Inc. Device 1317
        Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin B routed to IRQ 19
        Region 4: I/O ports at 0400 [size=32]

02:00.0 Network controller: Intel Corporation PRO/Wireless 3945ABG
[Golan] Network Connection (rev 02)
        Subsystem: Intel Corporation PRO/Wireless 3945ABG Network Connection
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 44
        Region 0: Memory at fdfff000 (32-bit, non-prefetchable) [size=4K]
        Capabilities: [c8] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee0300c  Data: 41d1
        Capabilities: [e0] Express (v1) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s
<512ns, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+
TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1,
Latency L0 <128ns, L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt-
UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr-
                AERCap: First Error Pointer: 14, GenCap- CGenEn- ChkCap-
ChkEn-
        Capabilities: [140 v1] Device Serial Number 00-18-de-ff-ff-c8-c4-c6
        Kernel driver in use: iwl3945

05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL-8139/8139C/8139C+ (rev 10)
        Subsystem: ASUSTeK Computer Inc. L8400B or L3C/S notebook
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 64 (8000ns min, 16000ns max)
        Interrupt: pin A routed to IRQ 18
        Region 0: I/O ports at d800 [size=256]
        Region 1: Memory at fe8ffc00 (32-bit, non-prefetchable) [size=256]
        Capabilities: [50] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA
PME(D0-,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Kernel driver in use: 8139too

05:01.0 FireWire (IEEE 1394): Ricoh Co Ltd R5C832 IEEE 1394 Controller
(prog-if 10 [OHCI])
        Subsystem: ASUSTeK Computer Inc. Device 1317
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 32 (500ns min, 1000ns max)
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at fe8ff000 (32-bit, non-prefetchable) [size=2K]
        Capabilities: [dc] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=2 PME+
        Kernel driver in use: firewire_ohci

05:01.1 SD Host controller: Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro
Host Adapter (rev 19)
        Subsystem: ASUSTeK Computer Inc. Device 1317
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 64
        Interrupt: pin B routed to IRQ 17
        Region 0: Memory at fe8ff800 (32-bit, non-prefetchable) [size=256]
        Capabilities: [80] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=2 PME-
        Kernel driver in use: sdhci-pci

05:01.2 System peripheral: Ricoh Co Ltd R5C843 MMC Host Controller (rev 01)
        Subsystem: ASUSTeK Computer Inc. Device 1317
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin B routed to IRQ 17
        Region 0: Memory at fe8fec00 (32-bit, non-prefetchable) [size=256]
        Capabilities: [80] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=2 PME-
        Kernel driver in use: sdhci-pci

05:01.3 System peripheral: Ricoh Co Ltd R5C592 Memory Stick Bus Host
Adapter (rev 0a)
        Subsystem: ASUSTeK Computer Inc. Device 1317
        Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin B routed to IRQ 17
        Region 0: Memory at fe8fe800 (32-bit, non-prefetchable) [size=256]
        Capabilities: [80] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=2 PME-
        Kernel driver in use: r592

^ permalink raw reply

* [PATCH 4/5] cpufreq: conservative: call dbs_check_cpu only when necessary
From: Fabio Baltieri @ 2012-11-26 16:39 UTC (permalink / raw)
  To: Rafael J. Wysocki, cpufreq, linux-pm
  Cc: Rickard Andersson, Vincent Guittot, Linus Walleij, Lee Jones,
	linux-kernel, Fabio Baltieri
In-Reply-To: <1353947996-26723-1-git-send-email-fabio.baltieri@linaro.org>

Modify conservative timer to not resample CPU utilization if recently
sampled from another SW coordinated core.

Signed-off-by: Fabio Baltieri <fabio.baltieri@linaro.org>
---
 drivers/cpufreq/cpufreq_conservative.c | 47 +++++++++++++++++++++++++++++-----
 1 file changed, 41 insertions(+), 6 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c
index b9d7f14..5d8e894 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -111,22 +111,57 @@ static void cs_check_cpu(int cpu, unsigned int load)
 	}
 }
 
-static void cs_dbs_timer(struct work_struct *work)
+static void cs_timer_update(struct cs_cpu_dbs_info_s *dbs_info, bool sample,
+			    struct delayed_work *dw)
 {
-	struct cs_cpu_dbs_info_s *dbs_info = container_of(work,
-			struct cs_cpu_dbs_info_s, cdbs.work.work);
 	unsigned int cpu = dbs_info->cdbs.cpu;
 	int delay = delay_for_sampling_rate(cs_tuners.sampling_rate);
 
+	if (sample)
+		dbs_check_cpu(&cs_dbs_data, cpu);
+
+	schedule_delayed_work_on(smp_processor_id(), dw, delay);
+}
+
+static void cs_timer_coordinated(struct cs_cpu_dbs_info_s *dbs_info_local,
+				 struct delayed_work *dw)
+{
+	struct cs_cpu_dbs_info_s *dbs_info;
+	ktime_t time_now;
+	s64 delta_us;
+	bool sample = true;
+
+	/* use leader CPU's dbs_info */
+	dbs_info = &per_cpu(cs_cpu_dbs_info, dbs_info_local->cdbs.cpu);
 	mutex_lock(&dbs_info->cdbs.timer_mutex);
 
-	dbs_check_cpu(&cs_dbs_data, cpu);
+	time_now = ktime_get();
+	delta_us = ktime_us_delta(time_now, dbs_info->cdbs.time_stamp);
 
-	schedule_delayed_work_on(smp_processor_id(), &dbs_info->cdbs.work,
-			delay);
+	/* Do nothing if we recently have sampled */
+	if (delta_us < (s64)(cs_tuners.sampling_rate / 2))
+		sample = false;
+	else
+		dbs_info->cdbs.time_stamp = time_now;
+
+	cs_timer_update(dbs_info, sample, dw);
 	mutex_unlock(&dbs_info->cdbs.timer_mutex);
 }
 
+static void cs_dbs_timer(struct work_struct *work)
+{
+	struct delayed_work *dw = to_delayed_work(work);
+	struct cs_cpu_dbs_info_s *dbs_info = container_of(work,
+			struct cs_cpu_dbs_info_s, cdbs.work.work);
+
+	if (dbs_sw_coordinated_cpus(&dbs_info->cdbs)) {
+		cs_timer_coordinated(dbs_info, dw);
+	} else {
+		mutex_lock(&dbs_info->cdbs.timer_mutex);
+		cs_timer_update(dbs_info, true, dw);
+		mutex_unlock(&dbs_info->cdbs.timer_mutex);
+	}
+}
 static int dbs_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
 		void *data)
 {
-- 
1.7.12.1


^ permalink raw reply related

* [PATCH 5/5] cpufreq: ondemand: use all CPUs in update_sampling_rate
From: Fabio Baltieri @ 2012-11-26 16:39 UTC (permalink / raw)
  To: Rafael J. Wysocki, cpufreq, linux-pm
  Cc: Rickard Andersson, Vincent Guittot, Linus Walleij, Lee Jones,
	linux-kernel, Fabio Baltieri
In-Reply-To: <1353947996-26723-1-git-send-email-fabio.baltieri@linaro.org>

Modify update_sampling_rate() to check, and eventually immediately
schedule, all CPU's do_dbs_timer delayed work.

This is required in case of software coordinated CPUs, as we now have a
separate delayed work for each CPU.

Signed-off-by: Fabio Baltieri <fabio.baltieri@linaro.org>
---
 drivers/cpufreq/cpufreq_ondemand.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index e07593e..e34bb2b 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -318,7 +318,7 @@ static void update_sampling_rate(unsigned int new_rate)
 		policy = cpufreq_cpu_get(cpu);
 		if (!policy)
 			continue;
-		dbs_info = &per_cpu(od_cpu_dbs_info, policy->cpu);
+		dbs_info = &per_cpu(od_cpu_dbs_info, cpu);
 		cpufreq_cpu_put(policy);
 
 		mutex_lock(&dbs_info->cdbs.timer_mutex);
@@ -337,8 +337,7 @@ static void update_sampling_rate(unsigned int new_rate)
 			cancel_delayed_work_sync(&dbs_info->cdbs.work);
 			mutex_lock(&dbs_info->cdbs.timer_mutex);
 
-			schedule_delayed_work_on(dbs_info->cdbs.cpu,
-					&dbs_info->cdbs.work,
+			schedule_delayed_work_on(cpu, &dbs_info->cdbs.work,
 					usecs_to_jiffies(new_rate));
 
 		}
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH 3/5] cpufreq: ondemand: call dbs_check_cpu only when necessary
From: Fabio Baltieri @ 2012-11-26 16:39 UTC (permalink / raw)
  To: Rafael J. Wysocki, cpufreq, linux-pm
  Cc: Rickard Andersson, Vincent Guittot, Linus Walleij, Lee Jones,
	linux-kernel, Fabio Baltieri
In-Reply-To: <1353947996-26723-1-git-send-email-fabio.baltieri@linaro.org>

Modify ondemand timer to not resample CPU utilization if recently
sampled from another SW coordinated core.

Signed-off-by: Fabio Baltieri <fabio.baltieri@linaro.org>
---
 drivers/cpufreq/cpufreq_governor.c |  3 ++
 drivers/cpufreq/cpufreq_governor.h |  1 +
 drivers/cpufreq/cpufreq_ondemand.c | 58 +++++++++++++++++++++++++++++++-------
 3 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index b7c1f89..7e322e5 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -341,6 +341,9 @@ second_time:
 		mutex_unlock(&dbs_data->mutex);
 
 		if (dbs_sw_coordinated_cpus(cpu_cdbs)) {
+			/* Initiate timer time stamp */
+			cpu_cdbs->time_stamp = ktime_get();
+
 			for_each_cpu(j, policy->cpus) {
 				struct cpu_dbs_common_info *j_cdbs;
 
diff --git a/drivers/cpufreq/cpufreq_governor.h b/drivers/cpufreq/cpufreq_governor.h
index 5bf6fb8..aaf073d 100644
--- a/drivers/cpufreq/cpufreq_governor.h
+++ b/drivers/cpufreq/cpufreq_governor.h
@@ -82,6 +82,7 @@ struct cpu_dbs_common_info {
 	 * the governor or limits.
 	 */
 	struct mutex timer_mutex;
+	ktime_t time_stamp;
 };
 
 struct od_cpu_dbs_info_s {
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index fe6e47c..e07593e 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -212,23 +212,23 @@ static void od_check_cpu(int cpu, unsigned int load_freq)
 	}
 }
 
-static void od_dbs_timer(struct work_struct *work)
+static void od_timer_update(struct od_cpu_dbs_info_s *dbs_info, bool sample,
+			    struct delayed_work *dw)
 {
-	struct od_cpu_dbs_info_s *dbs_info =
-		container_of(work, struct od_cpu_dbs_info_s, cdbs.work.work);
 	unsigned int cpu = dbs_info->cdbs.cpu;
 	int delay, sample_type = dbs_info->sample_type;
 
-	mutex_lock(&dbs_info->cdbs.timer_mutex);
-
 	/* Common NORMAL_SAMPLE setup */
 	dbs_info->sample_type = OD_NORMAL_SAMPLE;
 	if (sample_type == OD_SUB_SAMPLE) {
 		delay = dbs_info->freq_lo_jiffies;
-		__cpufreq_driver_target(dbs_info->cdbs.cur_policy,
-			dbs_info->freq_lo, CPUFREQ_RELATION_H);
+		if (sample)
+			__cpufreq_driver_target(dbs_info->cdbs.cur_policy,
+						dbs_info->freq_lo,
+						CPUFREQ_RELATION_H);
 	} else {
-		dbs_check_cpu(&od_dbs_data, cpu);
+		if (sample)
+			dbs_check_cpu(&od_dbs_data, cpu);
 		if (dbs_info->freq_lo) {
 			/* Setup timer for SUB_SAMPLE */
 			dbs_info->sample_type = OD_SUB_SAMPLE;
@@ -239,11 +239,49 @@ static void od_dbs_timer(struct work_struct *work)
 		}
 	}
 
-	schedule_delayed_work_on(smp_processor_id(), &dbs_info->cdbs.work,
-			delay);
+	schedule_delayed_work_on(smp_processor_id(), dw, delay);
+}
+
+static void od_timer_coordinated(struct od_cpu_dbs_info_s *dbs_info_local,
+				 struct delayed_work *dw)
+{
+	struct od_cpu_dbs_info_s *dbs_info;
+	ktime_t time_now;
+	s64 delta_us;
+	bool sample = true;
+
+	/* use leader CPU's dbs_info */
+	dbs_info = &per_cpu(od_cpu_dbs_info, dbs_info_local->cdbs.cpu);
+	mutex_lock(&dbs_info->cdbs.timer_mutex);
+
+	time_now = ktime_get();
+	delta_us = ktime_us_delta(time_now, dbs_info->cdbs.time_stamp);
+
+	/* Do nothing if we recently have sampled */
+	if (delta_us < (s64)(od_tuners.sampling_rate / 2))
+		sample = false;
+	else
+		dbs_info->cdbs.time_stamp = time_now;
+
+	od_timer_update(dbs_info, sample, dw);
 	mutex_unlock(&dbs_info->cdbs.timer_mutex);
 }
 
+static void od_dbs_timer(struct work_struct *work)
+{
+	struct delayed_work *dw = to_delayed_work(work);
+	struct od_cpu_dbs_info_s *dbs_info =
+		container_of(work, struct od_cpu_dbs_info_s, cdbs.work.work);
+
+	if (dbs_sw_coordinated_cpus(&dbs_info->cdbs)) {
+		od_timer_coordinated(dbs_info, dw);
+	} else {
+		mutex_lock(&dbs_info->cdbs.timer_mutex);
+		od_timer_update(dbs_info, true, dw);
+		mutex_unlock(&dbs_info->cdbs.timer_mutex);
+	}
+}
+
 /************************** sysfs interface ************************/
 
 static ssize_t show_sampling_rate_min(struct kobject *kobj,
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH 2/5] cpufreq: star/stop cpufreq timers on cpu hotplug
From: Fabio Baltieri @ 2012-11-26 16:39 UTC (permalink / raw)
  To: Rafael J. Wysocki, cpufreq, linux-pm
  Cc: Rickard Andersson, Vincent Guittot, Linus Walleij, Lee Jones,
	linux-kernel, Fabio Baltieri
In-Reply-To: <1353947996-26723-1-git-send-email-fabio.baltieri@linaro.org>

Add a CPU notifier to start and stop individual core timers on CPU
hotplug events when running on CPUs with SW coordinated frequency.

Signed-off-by: Fabio Baltieri <fabio.baltieri@linaro.org>
---
 drivers/cpufreq/cpufreq_governor.c | 51 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index a00f02d..b7c1f89 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -25,9 +25,12 @@
 #include <linux/tick.h>
 #include <linux/types.h>
 #include <linux/workqueue.h>
+#include <linux/cpu.h>
 
 #include "cpufreq_governor.h"
 
+static DEFINE_PER_CPU(struct dbs_data *, cpu_cur_dbs);
+
 static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall)
 {
 	u64 idle_time;
@@ -193,6 +196,46 @@ static inline void dbs_timer_exit(struct cpu_dbs_common_info *cdbs)
 	cancel_delayed_work_sync(&cdbs->work);
 }
 
+static int __cpuinit cpu_callback(struct notifier_block *nfb,
+		unsigned long action, void *hcpu)
+{
+	unsigned int cpu = (unsigned long)hcpu;
+	struct device *cpu_dev = get_cpu_device(cpu);
+	struct dbs_data *dbs_data = per_cpu(cpu_cur_dbs, cpu);
+	struct cpu_dbs_common_info *cpu_cdbs = dbs_data->get_cpu_cdbs(cpu);
+	unsigned int sampling_rate;
+
+	if (dbs_data->governor == GOV_CONSERVATIVE) {
+		struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
+		sampling_rate = cs_tuners->sampling_rate;
+	} else {
+		struct od_dbs_tuners *od_tuners = dbs_data->tuners;
+		sampling_rate = od_tuners->sampling_rate;
+	}
+
+	if (cpu_dev) {
+		switch (action) {
+		case CPU_ONLINE:
+		case CPU_ONLINE_FROZEN:
+		case CPU_DOWN_FAILED:
+		case CPU_DOWN_FAILED_FROZEN:
+			dbs_timer_init(dbs_data, cpu_cdbs,
+					sampling_rate, cpu);
+			break;
+		case CPU_DOWN_PREPARE:
+		case CPU_DOWN_PREPARE_FROZEN:
+			dbs_timer_exit(cpu_cdbs);
+			break;
+		}
+	}
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __refdata ondemand_cpu_notifier = {
+	.notifier_call = cpu_callback,
+};
+
 int cpufreq_governor_dbs(struct dbs_data *dbs_data,
 		struct cpufreq_policy *policy, unsigned int event)
 {
@@ -304,7 +347,11 @@ second_time:
 				j_cdbs = dbs_data->get_cpu_cdbs(j);
 				dbs_timer_init(dbs_data, j_cdbs,
 					       *sampling_rate, j);
+
+				per_cpu(cpu_cur_dbs, j) = dbs_data;
 			}
+
+			register_hotcpu_notifier(&ondemand_cpu_notifier);
 		} else {
 			dbs_timer_init(dbs_data, cpu_cdbs, *sampling_rate, cpu);
 		}
@@ -315,11 +362,15 @@ second_time:
 			cs_dbs_info->enable = 0;
 
 		if (dbs_sw_coordinated_cpus(cpu_cdbs)) {
+			unregister_hotcpu_notifier(&ondemand_cpu_notifier);
+
 			for_each_cpu(j, policy->cpus) {
 				struct cpu_dbs_common_info *j_cdbs;
 
 				j_cdbs = dbs_data->get_cpu_cdbs(j);
 				dbs_timer_exit(j_cdbs);
+
+				per_cpu(cpu_cur_dbs, j) = NULL;
 			}
 		} else {
 			dbs_timer_exit(cpu_cdbs);
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH 1/5] cpufreq: handle SW coordinated CPUs
From: Fabio Baltieri @ 2012-11-26 16:39 UTC (permalink / raw)
  To: Rafael J. Wysocki, cpufreq, linux-pm
  Cc: Rickard Andersson, Vincent Guittot, Linus Walleij, Lee Jones,
	linux-kernel, Fabio Baltieri
In-Reply-To: <1353947996-26723-1-git-send-email-fabio.baltieri@linaro.org>

From: Rickard Andersson <rickard.andersson@stericsson.com>

This patch fixes a bug that occurred when we had load on a secondary CPU
and the primary CPU was sleeping. Only one sampling timer was spawned
and it was spawned as a deferred timer on the primary CPU, so when a
secondary CPU had a change in load this was not detected by the cpufreq
governor (both ondemand and conservative).

This patch make sure that deferred timers are run on all CPUs in the
case of software controlled CPUs that run on the same frequency.

Signed-off-by: Rickard Andersson <rickard.andersson@stericsson.com>
Signed-off-by: Fabio Baltieri <fabio.baltieri@linaro.org>
---
 drivers/cpufreq/cpufreq_conservative.c |  3 +-
 drivers/cpufreq/cpufreq_governor.c     | 52 ++++++++++++++++++++++++++++++----
 drivers/cpufreq/cpufreq_governor.h     |  1 +
 drivers/cpufreq/cpufreq_ondemand.c     |  3 +-
 4 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c
index 64ef737..b9d7f14 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -122,7 +122,8 @@ static void cs_dbs_timer(struct work_struct *work)
 
 	dbs_check_cpu(&cs_dbs_data, cpu);
 
-	schedule_delayed_work_on(cpu, &dbs_info->cdbs.work, delay);
+	schedule_delayed_work_on(smp_processor_id(), &dbs_info->cdbs.work,
+			delay);
 	mutex_unlock(&dbs_info->cdbs.timer_mutex);
 }
 
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index 6c5f1d3..a00f02d 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -161,13 +161,31 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu)
 }
 EXPORT_SYMBOL_GPL(dbs_check_cpu);
 
+bool dbs_sw_coordinated_cpus(struct cpu_dbs_common_info *cdbs)
+{
+	struct cpufreq_policy *policy = cdbs->cur_policy;
+
+	return cpumask_weight(policy->cpus) > 1;
+}
+EXPORT_SYMBOL_GPL(dbs_sw_coordinated_cpus);
+
 static inline void dbs_timer_init(struct dbs_data *dbs_data,
-		struct cpu_dbs_common_info *cdbs, unsigned int sampling_rate)
+				  struct cpu_dbs_common_info *cdbs,
+				  unsigned int sampling_rate,
+				  int cpu)
 {
 	int delay = delay_for_sampling_rate(sampling_rate);
+	struct cpu_dbs_common_info *cdbs_local = dbs_data->get_cpu_cdbs(cpu);
+	struct od_cpu_dbs_info_s *od_dbs_info;
+
+	cancel_delayed_work_sync(&cdbs_local->work);
+
+	if (dbs_data->governor == GOV_ONDEMAND) {
+		od_dbs_info = dbs_data->get_cpu_dbs_info_s(cpu);
+		od_dbs_info->sample_type = OD_NORMAL_SAMPLE;
+	}
 
-	INIT_DEFERRABLE_WORK(&cdbs->work, dbs_data->gov_dbs_timer);
-	schedule_delayed_work_on(cdbs->cpu, &cdbs->work, delay);
+	schedule_delayed_work_on(cpu, &cdbs_local->work, delay);
 }
 
 static inline void dbs_timer_exit(struct cpu_dbs_common_info *cdbs)
@@ -217,6 +235,10 @@ int cpufreq_governor_dbs(struct dbs_data *dbs_data,
 			if (ignore_nice)
 				j_cdbs->prev_cpu_nice =
 					kcpustat_cpu(j).cpustat[CPUTIME_NICE];
+
+			mutex_init(&j_cdbs->timer_mutex);
+			INIT_DEFERRABLE_WORK(&j_cdbs->work,
+					     dbs_data->gov_dbs_timer);
 		}
 
 		/*
@@ -275,15 +297,33 @@ second_time:
 		}
 		mutex_unlock(&dbs_data->mutex);
 
-		mutex_init(&cpu_cdbs->timer_mutex);
-		dbs_timer_init(dbs_data, cpu_cdbs, *sampling_rate);
+		if (dbs_sw_coordinated_cpus(cpu_cdbs)) {
+			for_each_cpu(j, policy->cpus) {
+				struct cpu_dbs_common_info *j_cdbs;
+
+				j_cdbs = dbs_data->get_cpu_cdbs(j);
+				dbs_timer_init(dbs_data, j_cdbs,
+					       *sampling_rate, j);
+			}
+		} else {
+			dbs_timer_init(dbs_data, cpu_cdbs, *sampling_rate, cpu);
+		}
 		break;
 
 	case CPUFREQ_GOV_STOP:
 		if (dbs_data->governor == GOV_CONSERVATIVE)
 			cs_dbs_info->enable = 0;
 
-		dbs_timer_exit(cpu_cdbs);
+		if (dbs_sw_coordinated_cpus(cpu_cdbs)) {
+			for_each_cpu(j, policy->cpus) {
+				struct cpu_dbs_common_info *j_cdbs;
+
+				j_cdbs = dbs_data->get_cpu_cdbs(j);
+				dbs_timer_exit(j_cdbs);
+			}
+		} else {
+			dbs_timer_exit(cpu_cdbs);
+		}
 
 		mutex_lock(&dbs_data->mutex);
 		mutex_destroy(&cpu_cdbs->timer_mutex);
diff --git a/drivers/cpufreq/cpufreq_governor.h b/drivers/cpufreq/cpufreq_governor.h
index f661654..5bf6fb8 100644
--- a/drivers/cpufreq/cpufreq_governor.h
+++ b/drivers/cpufreq/cpufreq_governor.h
@@ -171,6 +171,7 @@ static inline int delay_for_sampling_rate(unsigned int sampling_rate)
 
 u64 get_cpu_idle_time(unsigned int cpu, u64 *wall);
 void dbs_check_cpu(struct dbs_data *dbs_data, int cpu);
+bool dbs_sw_coordinated_cpus(struct cpu_dbs_common_info *cdbs);
 int cpufreq_governor_dbs(struct dbs_data *dbs_data,
 		struct cpufreq_policy *policy, unsigned int event);
 #endif /* _CPUFREQ_GOVERNER_H */
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index cca3e9f..fe6e47c 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -239,7 +239,8 @@ static void od_dbs_timer(struct work_struct *work)
 		}
 	}
 
-	schedule_delayed_work_on(cpu, &dbs_info->cdbs.work, delay);
+	schedule_delayed_work_on(smp_processor_id(), &dbs_info->cdbs.work,
+			delay);
 	mutex_unlock(&dbs_info->cdbs.timer_mutex);
 }
 
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH v5 0/5] cpufreq: handle SW coordinated CPUs
From: Fabio Baltieri @ 2012-11-26 16:39 UTC (permalink / raw)
  To: Rafael J. Wysocki, cpufreq, linux-pm
  Cc: Rickard Andersson, Vincent Guittot, Linus Walleij, Lee Jones,
	linux-kernel, Fabio Baltieri

Hello Rafael,

this patchset is a new version of the cpufreq SW coordinated CPU bug fix.
That's basically the v4 rebased on linux-pm's linux-next tree, split in 5
patches for readability and with the bug fixed also in the "conservative"
governor.

Regards,
Fabio


Changes:
v5
- rebased/reimplemented on linux-next
- split hotplug code on a separate patch
- split ondemand/conservative specific fixes on separate patches
v4
- moved update_sampling rate code on separate patch
- simplified dbs_sw_coordinated_cpus
- reworked timer handling for code readability, now:
  * do_dbs_timer() [-> dbs_timer_coordinated()] -> dbs_timer_update()
- simplified cpu_callback cases
v3
- original submission


Fabio Baltieri (4):
  cpufreq: star/stop cpufreq timers on cpu hotplug
  cpufreq: ondemand: call dbs_check_cpu only when necessary
  cpufreq: conservative: call dbs_check_cpu only when necessary
  cpufreq: ondemand: use all CPUs in update_sampling_rate

Rickard Andersson (1):
  cpufreq: handle SW coordinated CPUs

 drivers/cpufreq/cpufreq_conservative.c |  46 ++++++++++++--
 drivers/cpufreq/cpufreq_governor.c     | 106 +++++++++++++++++++++++++++++++--
 drivers/cpufreq/cpufreq_governor.h     |   2 +
 drivers/cpufreq/cpufreq_ondemand.c     |  62 +++++++++++++++----
 4 files changed, 193 insertions(+), 23 deletions(-)

-- 
1.7.12.1

^ permalink raw reply

* Re: [PATCH v9 06/10] ata: zpodd: check zero power ready status
From: Alan Stern @ 2012-11-26 16:21 UTC (permalink / raw)
  To: James Bottomley
  Cc: Aaron Lu, Rafael J. Wysocki, Tejun Heo, Jeff Garzik, Jeff Wu,
	Aaron Lu, linux-ide, linux-pm, linux-scsi, linux-acpi
In-Reply-To: <1353935850.2523.52.camel@dabdike>

On Mon, 26 Nov 2012, James Bottomley wrote:

> I'm also curious about driving sleep from autopm, since mode page timers
> don't control the sleep transition.

Is it feasible to do this the other way around?  That is, to drive 
runtime suspend by noticing when the device decides to put itself into 
a low-power state?

Alan Stern


^ permalink raw reply

* Re: [alsa-devel] [PATCH] usb: add USB_QUIRK_RESET_RESUME for M-Audio 49
From: Alan Stern @ 2012-11-26 16:12 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Clemens Ladisch, Takashi Iwai, Jonathan Nieder,
	Steffen Müller, alsa-devel, Olivier MATZ, linux-pm,
	linux-usb, stable, David Banks, Ralf Lang
In-Reply-To: <9473152.Oq5ztpgF62@linux-lqwf.site>

On Mon, 26 Nov 2012, Oliver Neukum wrote:

> On Monday 26 November 2012 14:43:13 Clemens Ladisch wrote:
> 
> > > If it has to be running, the easiest fix would be the patch like
> > > below.  This will turn off the autopm essentially, but better than
> > > breakage.
> > >
> > > @@ -2074,6 +2077,8 @@ static void snd_usbmidi_input_start_ep(struct snd_usb_midi_in_endpoint* ep)
> > >
> > > +	ep->autopm_reference =
> > > +		usb_autopm_get_interface(ep->umidi->iface) >= 0;
> > 
> > usb_autopm_get_interface() cannot be called from the USB probe callback.
> 
> You can use usb_autopm_get_interface_no_resume()
> During probe() the device is known to not be suspended.

In fact, as far as I know you _can_ use usb_autopm_get_interface() from
within a probe routine.  Is there some problem with it I'm not aware of?

Alan Stern

^ permalink raw reply

* Re: [PATCHv9 1/3] Runtime Interpreted Power Sequences
From: Grant Likely @ 2012-11-26 15:34 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Alexandre Courbot, linux-fbdev@vger.kernel.org, Stephen Warren,
	Arnd Bergmann, linux-pm@vger.kernel.org,
	devicetree-discuss@lists.ozlabs.org, Mark Brown, Mark Zhang,
	Rob Herring, linux-kernel@vger.kernel.org, Alex Courbot,
	Anton Vorontsov, linux-tegra@vger.kernel.org, David Woodhouse,
	linux-arm-kernel@lists.infradead.org
In-Reply-To: <20121122214021.GA14771@avionic-0098.adnet.avionic-design.de>

On Thu, 22 Nov 2012 22:40:21 +0100, Thierry Reding <thierry.reding@avionic-design.de> wrote:
> On Thu, Nov 22, 2012 at 01:39:41PM +0000, Grant Likely wrote:
> [...]
> > I do think that each sequence should be contained within a single
> > property, but I'm open to other suggestions.
> 
> IIRC a very early prototype did implement something like that. However
> because of the resource issues this had to be string based, so that the
> sequences looked somewhat like (Alex, correct me if I'm wrong):
> 
> 	power-on = <"REGULATOR", "power", 1, "GPIO", "enable", 1>;
> 
> Instead we could possibly have something like:
> 
> 	power-on = <0 &reg 1,
> 		    1 &gpio 42 0 1>;

Yes, that would work, although I still think it would be a good idea to
split the used resources off into the gpios/pwms/regs/etc properties.

> Where the first cell in each entry defines the type (0 = regulator, 1 =
> GPIO) and the rest would be a regular OF specifier for the given type of
> resource along with some defined parameter such as enable/disable,
> voltage, delay in ms, ... I don't know if that sounds any better. It
> looks sort of cryptic but it is more "in the spirit of" DT, right Grant?

It is still kind of a ham-handed approach, but it does fit better with
existing conventions than the hierarchy of nodes does.

g.

^ permalink raw reply

* Re: [PULL REQUEST for Rafael] PM / devfreq: rebased on pm-devfreq
From: Rafael J. Wysocki @ 2012-11-26 15:34 UTC (permalink / raw)
  To: linux-pm; +Cc: MyungJoo Ham, linux-kernel, myungjoo.ham
In-Reply-To: <1353925903-19557-1-git-send-email-myungjoo.ham@samsung.com>

On Monday, November 26, 2012 07:31:43 PM MyungJoo Ham wrote:
> 
> Rafael,
> 
> Here goes the pull request rebased on your pm-devfreq branch.
> Sorry for all the confusion and mess invoked.
> 
> 
> Cheers,
> MyungJoo
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> The following changes since commit 1a1357ea176670867f347419c3345e2becc07338:
> 
>   PM / devfreq: make devfreq_class static (2012-11-15 00:35:06 +0100)
> 
> are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/mzx/devfreq.git tags/pull_req_20121126

OK, pulled, but I'll merge it to my linux-next branch tomorrow.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [PATCH 3/6 v4] cpufreq: tolerate inexact values when collecting stats
From: Rafael J. Wysocki @ 2012-11-26 15:25 UTC (permalink / raw)
  To: linux-pm
  Cc: Mark Langsdorf, Borislav Petkov, linux-kernel@vger.kernel.org,
	cpufreq@vger.kernel.org, MyungJoo Ham
In-Reply-To: <50B3754A.50804@calxeda.com>

On Monday, November 26, 2012 07:57:30 AM Mark Langsdorf wrote:
> On 11/24/2012 04:05 AM, Rafael J. Wysocki wrote:
> > On Saturday, November 17, 2012 03:50:48 PM Borislav Petkov wrote:
> >> On Tue, Nov 13, 2012 at 02:13:38PM -0500, Mark Langsdorf wrote:
> >>> Although cpufreq_driver has a flag field, no part of cpufreq_driver
> >>> is directly passed to the cpufreq_stat code. Only cpufreq_policy
> >>> is. It's cleaner to do passes of the while loop than to copy the
> >>> cpufreq_driver.flag field into cpufreq_policy and then store it again
> >>> in cpufreq_stats.
> >>
> >> That maybe so but this newly added loop which is only Calxeda-relevant
> >> is called in cpufreq_stat_notifier_trans, which is the frequency change
> >> notifier call, AFAICT.
> 
> Drivers only go through the loop if they can't find an exact frequency.
> So every driver that isn't Calxeda shouldn't see the issue.
> 
> >> So you probably need to find a slick way of detecting calxeda hw
> >> somewhere along the init path of cpufreq_stats_init and set a
> >> hw-specific flag instead of adding that cost to each driver.
> > 
> > Mark, I suppose you'd like me to take this series for v3.8, but the above
> > comment from Boris has to be addressed for that.
> 
> I think I'd rather drop this particular patch and not have cpufreq_stat
> support for Highbank. Redesigning it to meet Boris' requirements is
> going to take more time than I currently have available.
> 
> Would it be acceptable to drop this patch and fix the issues with
> patches 4 and 6 to get the series in?

Yes, it would, but please resubmit ASAP.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* [PATCH v3 3/3] PM: Introduce Intel PowerClamp Driver
From: Jacob Pan @ 2012-11-26 15:19 UTC (permalink / raw)
  To: Linux PM, LKML
  Cc: Peter Zijlstra, Rafael Wysocki, Len Brown, Thomas Gleixner,
	H. Peter Anvin, Ingo Molnar, Zhang Rui, Rob Landley,
	Arjan van de Ven, Paul McKenney, Jacob Pan
In-Reply-To: <1353943192-20298-1-git-send-email-jacob.jun.pan@linux.intel.com>

Intel PowerClamp driver performs synchronized idle injection across
all online CPUs. The goal is to maintain a given package level C-state
ratio.

Compared to other throttling methods already exist in the kernel,
such as ACPI PAD (taking CPUs offline) and clock modulation, this is often
more efficient in terms of performance per watt.

Please refer to Documentation/thermal/intel_powerclamp.txt for more details.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 Documentation/thermal/intel_powerclamp.txt |  307 +++++++++++
 drivers/thermal/Kconfig                    |   10 +
 drivers/thermal/Makefile                   |    1 +
 drivers/thermal/intel_powerclamp.c         |  766 ++++++++++++++++++++++++++++
 4 files changed, 1084 insertions(+)
 create mode 100644 Documentation/thermal/intel_powerclamp.txt
 create mode 100644 drivers/thermal/intel_powerclamp.c

diff --git a/Documentation/thermal/intel_powerclamp.txt b/Documentation/thermal/intel_powerclamp.txt
new file mode 100644
index 0000000..332de4a
--- /dev/null
+++ b/Documentation/thermal/intel_powerclamp.txt
@@ -0,0 +1,307 @@
+			 =======================
+			 INTEL POWERCLAMP DRIVER
+			 =======================
+By: Arjan van de Ven <arjan@linux.intel.com>
+    Jacob Pan <jacob.jun.pan@linux.intel.com>
+
+Contents:
+	(*) Introduction
+	    - Goals and Objectives
+
+	(*) Theory of Operation
+	    - Idle Injection
+	    - Calibration
+
+	(*) Performance Analysis
+	    - Effectiveness and Limitations
+	    - Power vs Performance
+	    - Scalability
+	    - Calibration
+	    - Comparison with Alternative Techniques
+
+	(*) Usage and Interfaces
+	    - Generic Thermal Layer (sysfs)
+	    - Kernel APIs (TBD)
+
+============
+INTRODUCTION
+============
+
+Consider the situation where a system’s power consumption must be
+reduced at runtime, due to power budget, thermal constraint, or noise
+level, and where active cooling is not preferred. Software managed
+passive power reduction must be performed to prevent the hardware
+actions that are designed for catastrophic scenarios.
+
+Currently, P-states, T-states (clock modulation), and CPU offlining
+are used for CPU throttling.
+
+On Intel CPUs, C-states provide effective power reduction, but so far
+they’re only used opportunistically, based on workload. With the
+development of intel_powerclamp driver, the method of synchronizing
+idle injection across all online CPU threads was introduced. The goal
+is to achieve forced and controllable C-state residency.
+
+Test/Analysis has been made in the areas of power, performance,
+scalability, and user experience. In many cases, clear advantage is
+shown over taking the CPU offline or modulating the CPU clock.
+
+
+===================
+THEORY OF OPERATION
+===================
+
+Idle Injection
+--------------
+
+On modern Intel processors (Nehalem or later), package level C-state
+residency is available in MSRs, thus also available to the kernel.
+
+These MSRs are:
+      #define MSR_PKG_C2_RESIDENCY	0x60D
+      #define MSR_PKG_C3_RESIDENCY	0x3F8
+      #define MSR_PKG_C6_RESIDENCY	0x3F9
+      #define MSR_PKG_C7_RESIDENCY	0x3FA
+
+If the kernel can also inject idle time to the system, then a
+closed-loop control system can be established that manages package
+level C-state. The intel_powerclamp driver is conceived as such a
+control system, where the target set point is a user-selected idle
+ratio (based on power reduction), and the error is the difference
+between the actual package level C-state residency ratio and the target idle
+ratio.
+
+Injection is controlled by high priority kernel threads, spawned for
+each online CPU.
+
+These kernel threads, with SCHED_FIFO class, are created to perform
+clamping actions of controlled duty ratio and duration. Each per-CPU
+thread synchronizes its idle time and duration, based on the rounding
+of jiffies, so accumulated errors can be prevented to avoid a jittery
+effect. Threads are also bound to the CPU such that they cannot be
+migrated, unless the CPU is taken offline. In this case, threads
+belong to the offlined CPUs will be terminated immediately.
+
+Running as SCHED_FIFO and relatively high priority, also allows such
+scheme to work for both preemptable and non-preemptable kernels.
+Alignment of idle time around jiffies ensures scalability for HZ
+values. This effect can be better visualized using a Perf timechart.
+The following diagram shows the behavior of kernel thread
+kidle_inject/cpu. During idle injection, it runs monitor/mwait idle
+for a given "duration", then relinquishes the CPU to other tasks,
+until the next time interval.
+
+The NOHZ schedule tick is disabled during idle time, but interrupts
+are not masked. Tests show that the extra wakeups from scheduler tick
+have a dramatic impact on the effectiveness of the powerclamp driver
+on large scale systems (Westmere system with 80 processors).
+
+CPU0
+		  ____________          ____________
+kidle_inject/0   |   sleep    |  mwait |  sleep     |
+	_________|            |________|            |_______
+			       duration
+CPU1
+		  ____________          ____________
+kidle_inject/1   |   sleep    |  mwait |  sleep     |
+	_________|            |________|            |_______
+			      ^
+			      |
+			      |
+			      roundup(jiffies, interval)
+
+Only one CPU is allowed to collect statistics and update global
+control parameters. This CPU is referred to as the controlling CPU in
+this document. The controlling CPU is elected at runtime, with a
+policy that favors BSP, taking into account the possibility of a CPU
+hot-plug.
+
+In terms of dynamics of the idle control system, package level idle
+time is considered largely as a non-causal system where its behavior
+cannot be based on the past or current input. Therefore, the
+intel_powerclamp driver attempts to enforce the desired idle time
+instantly as given input (target idle ratio). After injection,
+powerclamp moniors the actual idle for a given time window and adjust
+the next injection accordingly to avoid over/under correction.
+
+When used in a causal control system, such as a temperature control,
+it is up to the user of this driver to implement algorithms where
+past samples and outputs are included in the feedback. For example, a
+PID-based thermal controller can use the powerclamp driver to
+maintain a desired target temperature, based on integral and
+derivative gains of the past samples.
+
+
+
+Calibration
+-----------
+During scalability testing, it is observed that synchronized actions
+among CPUs become challenging as the number of cores grows. This is
+also true for the ability of a system to enter package level C-states.
+
+To make sure the intel_powerclamp driver scales well, online
+calibration is implemented. The goals for doing such a calibration
+are:
+
+a) determine the effective range of idle injection ratio
+b) determine the amount of compensation needed at each target ratio
+
+Compensation to each target ratio consists of two parts:
+
+        a) steady state error compensation
+	This is to offset the error occurring when the system can
+	enter idle without extra wakeups (such as external interrupts).
+
+	b) dynamic error compensation
+	When an excessive amount of wakeups occurs during idle, an
+	additional idle ratio can be added to quiet interrupts, by
+	slowing down CPU activities.
+
+A debugfs file is provided for the user to examine compensation
+progress and results, such as on a Westmere system.
+[jacob@nex01 ~]$ cat
+/sys/kernel/debug/intel_powerclamp/powerclamp_calib
+controlling cpu: 0
+pct confidence steady dynamic (compensation)
+0	0	0	0
+1	1	0	0
+2	1	1	0
+3	3	1	0
+4	3	1	0
+5	3	1	0
+6	3	1	0
+7	3	1	0
+8	3	1	0
+...
+30	3	2	0
+31	3	2	0
+32	3	1	0
+33	3	2	0
+34	3	1	0
+35	3	2	0
+36	3	1	0
+37	3	2	0
+38	3	1	0
+39	3	2	0
+40	3	3	0
+41	3	1	0
+42	3	2	0
+43	3	1	0
+44	3	1	0
+45	3	2	0
+46	3	3	0
+47	3	0	0
+48	3	2	0
+49	3	3	0
+
+Calibration occurs during runtime. No offline method is available.
+Steady state compensation is used only when confidence levels of all
+adjacent ratios have reached satisfactory level. A confidence level
+is accumulated based on clean data collected at runtime. Data
+collected during a period without extra interrupts is considered
+clean.
+
+To compensate for excessive amounts of wakeup during idle, additional
+idle time is injected when such a condition is detected. Currently,
+we have a simple algorithm to double the injection ratio. A possible
+enhancement might be to throttle the offending IRQ, such as delaying
+EOI for level triggered interrupts. But it is a challenge to be
+non-intrusive to the scheduler or the IRQ core code.
+
+
+CPU Online/Offline
+------------------
+Per-CPU kernel threads are started/stopped upon receiving
+notifications of CPU hotplug activities. The intel_powerclamp driver
+keeps track of clamping kernel threads, even after they are migrated
+to other CPUs, after a CPU offline event.
+
+
+=====================
+Performance Analysis
+=====================
+This section describes the general performance data collected on
+multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P).
+
+Effectiveness and Limitations
+-----------------------------
+The maximum range that idle injection is allowed is capped at 50
+percent. As mentioned earlier, since interrupts are allowed during
+forced idle time, excessive interrupts could result in less
+effectiveness. The extreme case would be doing a ping -f to generated
+flooded network interrupts without much CPU acknowledgement. In this
+case, little can be done from the idle injection threads. In most
+normal cases, such as scp a large file, applications can be throttled
+by the powerclamp driver, since slowing down the CPU also slows down
+network protocol processing, which in turn reduces interrupts.
+
+When control parameters change at runtime by the controlling CPU, it
+may take an additional period for the rest of the CPUs to catch up
+with the changes. During this time, idle injection is out of sync,
+thus not able to enter package C- states at the expected ratio. But
+this effect is minor, in that in most cases change to the target
+ratio is updated much less frequently than the idle injection
+frequency.
+
+Scalability
+-----------
+Tests also show a minor, but measurable, difference between the 4P/8P
+Ivy Bridge system and the 80P Westmere server under 50% idle ratio.
+More compensation is needed on Westmere for the same amount of
+target idle ratio. The compensation also increases as the idle ratio
+gets larger. The above reason constitutes the need for the
+calibration code.
+
+On the IVB 8P system, compared to an offline CPU, powerclamp can
+achieve up to 40% better performance per watt. (measured by a spin
+counter summed over per CPU counting threads spawned for all running
+CPUs).
+
+====================
+Usage and Interfaces
+====================
+The powerclamp driver is registered to the generic thermal layer as a
+cooling device. Currently, it’s not bound to any thermal zones.
+
+jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . *
+cur_state:0
+max_state:50
+type:intel_powerclamp
+
+Example usage:
+- To inject 25% idle time
+$ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state
+"
+
+If the system is not busy and has more than 25% idle time already,
+then the powerclamp driver will not start idle injection. Using Top
+will not show idle injection kernel threads.
+
+If the system is busy (spin test below) and has less than 25% natural
+idle time, powerclamp kernel threads will do idle injection, which
+appear running to the scheduler. But the overall system idle is still
+reflected. In this example, 24.1% idle is shown. This helps the
+system admin or user determine the cause of slowdown, when a
+powerclamp driver is in action.
+
+
+Tasks: 197 total,   1 running, 196 sleeping,   0 stopped,   0 zombie
+Cpu(s): 71.2%us,  4.7%sy,  0.0%ni, 24.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
+Mem:   3943228k total,  1689632k used,  2253596k free,    74960k buffers
+Swap:  4087804k total,        0k used,  4087804k free,   945336k cached
+
+  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
+ 3352 jacob     20   0  262m  644  428 S  286  0.0   0:17.16 spin
+ 3341 root     -51   0     0    0    0 D   25  0.0   0:01.62 kidle_inject/0
+ 3344 root     -51   0     0    0    0 D   25  0.0   0:01.60 kidle_inject/3
+ 3342 root     -51   0     0    0    0 D   25  0.0   0:01.61 kidle_inject/1
+ 3343 root     -51   0     0    0    0 D   25  0.0   0:01.60 kidle_inject/2
+ 2935 jacob     20   0  696m 125m  35m S    5  3.3   0:31.11 firefox
+ 1546 root      20   0  158m  20m 6640 S    3  0.5   0:26.97 Xorg
+ 2100 jacob     20   0 1223m  88m  30m S    3  2.3   0:23.68 compiz
+
+Tests have shown that by using the powerclamp driver as a cooling
+device, a PID based userspace thermal controller can manage to
+control CPU temperature effectively, when no other thermal influence
+is added. For example, a UltraBook user can compile the kernel under
+certain temperature (below most active trip points).
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index e1cb6bd..4d99c4b 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -55,3 +55,13 @@ config EXYNOS_THERMAL
 	help
 	  If you say yes here you get support for TMU (Thermal Managment
 	  Unit) on SAMSUNG EXYNOS series of SoC.
+
+config INTEL_POWERCLAMP
+	tristate "Intel PowerClamp idle injection driver"
+	depends on THERMAL
+	depends on X86
+	depends on CPU_SUP_INTEL
+	help
+	  Enable this to enable Intel PowerClamp idle injection driver. This
+	  enforce idle time which results in more package C-state residency. The
+	  user interface is exposed via generic thermal framework.
diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
index 885550d..03e4479 100644
--- a/drivers/thermal/Makefile
+++ b/drivers/thermal/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_CPU_THERMAL)		+= cpu_cooling.o
 obj-$(CONFIG_SPEAR_THERMAL)		+= spear_thermal.o
 obj-$(CONFIG_RCAR_THERMAL)	+= rcar_thermal.o
 obj-$(CONFIG_EXYNOS_THERMAL)		+= exynos_thermal.o
+obj-$(CONFIG_INTEL_POWERCLAMP)	+= intel_powerclamp.o
diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
new file mode 100644
index 0000000..878d489
--- /dev/null
+++ b/drivers/thermal/intel_powerclamp.c
@@ -0,0 +1,766 @@
+/*
+ * intel_powerclamp.c - package c-state idle injection
+ *
+ * Copyright (c) 2012, Intel Corporation.
+ *
+ * Authors:
+ *     Arjan van de Ven <arjan@linux.intel.com>
+ *     Jacob Pan <jacob.jun.pan@linux.intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ *
+ *	TODO:
+ *           1. better handle wakeup from external interrupts, currently a fixed
+ *              compensation is added to clamping duration when excessive amount
+ *              of wakeups are observed during idle time. the reason is that in
+ *              case of external interrupts without need for ack, clamping down
+ *              cpu in non-irq context does not reduce irq. for majority of the
+ *              cases, clamping down cpu does help reduce irq as well, we should
+ *              be able to differenciate the two cases and give a quantitative
+ *              solution for the irqs that we can control. perhaps based on
+ *              get_cpu_iowait_time_us()
+ *
+ *	     2. synchronization with other hw blocks
+ *
+ *
+ */
+
+/* #define DEBUG */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/delay.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
+#include <linux/cpu.h>
+#include <linux/thermal.h>
+#include <linux/slab.h>
+#include <linux/tick.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+#include <linux/nmi.h>
+
+#include <asm/msr.h>
+#include <asm/mwait.h>
+#include <asm/cpu_device_id.h>
+#include <asm/idle.h>
+#include <asm/hardirq.h>
+
+#define MSR_PKG_C2_RESIDENCY		0x60D
+#define MSR_PKG_C3_RESIDENCY		0x3F8
+#define MSR_PKG_C6_RESIDENCY		0x3F9
+#define MSR_PKG_C7_RESIDENCY		0x3FA
+
+#define MAX_TARGET_RATIO (50)
+/* For each undisturbed clamping period (no extra wake ups during idle time),
+ * we increment the confidence counter for the given target ratio.
+ * CONFIDENCE_OK defines the level where runtime calibration results are
+ * valid.
+ */
+#define CONFIDENCE_OK (3)
+/* Default idle injection duration, driver adjust sleep time to meet target
+ * idle ratio. Similar to frequency modulation.
+ */
+#define DEFAULT_DURATION_JIFFIES (6)
+
+static unsigned int target_mwait;
+static struct dentry *debug_dir;
+
+/* user selected target */
+static unsigned int set_target_ratio;
+static unsigned int current_ratio;
+static bool should_skip;
+static bool reduce_irq;
+static atomic_t idle_wakeup_counter;
+static unsigned int control_cpu; /* The cpu assigned to collect stat and update
+				  * control parameters. default to BSP but BSP
+				  * can be offlined.
+				  */
+static int clamping;
+
+
+static struct task_struct __percpu **powerclamp_thread;
+static struct thermal_cooling_device *cooling_dev;
+static unsigned long *cpu_clamping_mask;  /* bit map for tracking per cpu
+					   * clamping thread
+					   */
+static int duration;
+module_param(duration, int, 0600);
+MODULE_PARM_DESC(duration, "forced idle time for each attempt in msec.");
+
+static unsigned int pkg_cstate_ratio_cur;
+static unsigned int window_size;
+
+struct powerclamp_calibration_data {
+	unsigned long confidence;  /* used for calibration, basically a counter
+				    * gets incremented each time a clamping
+				    * period is completed without extra wakeups
+				    * once that counter is reached given level,
+				    * compensation is deemed usable.
+				    */
+	unsigned long steady_comp; /* steady state compensation used when
+				    * no extra wakeups occurred.
+				    */
+	unsigned long dynamic_comp; /* compensate excessive wakeup from idle
+				     * mostly from external interrupts.
+				     */
+};
+
+static struct powerclamp_calibration_data cal_data[MAX_TARGET_RATIO];
+
+static int window_size_set(const char *arg, const struct kernel_param *kp)
+{
+	int ret = 0;
+	unsigned long new_window_size;
+
+	ret = kstrtoul(arg, 10, &new_window_size);
+	if (ret)
+		goto exit_win;
+	if (new_window_size >= 10 || new_window_size < 2) {
+		pr_err("Invalid window size %lu, between 2-10\n",
+			new_window_size);
+		ret = -EINVAL;
+	}
+
+	window_size = new_window_size;
+	smp_mb();
+
+exit_win:
+
+	return ret;
+}
+static struct kernel_param_ops window_size_ops = {
+	.set = window_size_set,
+	.get = param_get_int,
+};
+
+module_param_cb(window_size, &window_size_ops, &window_size, 0644);
+MODULE_PARM_DESC(window_size, "sliding window in number of clamping cycles\n"
+	"\tpowerclamp controls idle ratio within this window. larger\n"
+	"\twindow size results in slower response time but more smooth\n"
+	"\tclamping results. default to 2.");
+
+static void find_target_mwait(void)
+{
+	unsigned int eax, ebx, ecx, edx;
+	unsigned int highest_cstate = 0;
+	unsigned int highest_subcstate = 0;
+	int i;
+
+	if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
+		return;
+
+	cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &edx);
+
+	if (!(ecx & CPUID5_ECX_EXTENSIONS_SUPPORTED) ||
+	    !(ecx & CPUID5_ECX_INTERRUPT_BREAK))
+		return;
+
+	edx >>= MWAIT_SUBSTATE_SIZE;
+	for (i = 0; i < 7 && edx; i++, edx >>= MWAIT_SUBSTATE_SIZE) {
+		if (edx & MWAIT_SUBSTATE_MASK) {
+			highest_cstate = i;
+			highest_subcstate = edx & MWAIT_SUBSTATE_MASK;
+		}
+	}
+	target_mwait = (highest_cstate << MWAIT_SUBSTATE_SIZE) |
+		(highest_subcstate - 1);
+
+}
+
+static u64 pkg_state_counter(void)
+{
+	u64 val;
+	u64 count = 0;
+
+	static bool skip_c2;
+	static bool skip_c3;
+	static bool skip_c6;
+	static bool skip_c7;
+
+	if (!skip_c2) {
+		if (!rdmsrl_safe(MSR_PKG_C2_RESIDENCY, &val))
+			count += val;
+		else
+			skip_c2 = true;
+	}
+
+	if (!skip_c3) {
+		if (!rdmsrl_safe(MSR_PKG_C3_RESIDENCY, &val))
+			count += val;
+		else
+			skip_c3 = true;
+	}
+
+	if (!skip_c6) {
+		if (!rdmsrl_safe(MSR_PKG_C6_RESIDENCY, &val))
+			count += val;
+		else
+			skip_c6 = true;
+	}
+
+	if (!skip_c7) {
+		if (!rdmsrl_safe(MSR_PKG_C7_RESIDENCY, &val))
+			count += val;
+		else
+			skip_c7 = true;
+	}
+
+	return count;
+}
+
+static void noop_timer(unsigned long foo)
+{
+	/* empty... just the fact that we get the interrupt wakes us up */
+}
+
+static unsigned int get_compensation(int ratio)
+{
+	unsigned int comp = 0;
+
+	/* we only use compensation if all adjacent ones are good */
+	if (ratio == 1 &&
+		cal_data[ratio].confidence >= CONFIDENCE_OK &&
+		cal_data[ratio + 1].confidence >= CONFIDENCE_OK &&
+		cal_data[ratio + 2].confidence >= CONFIDENCE_OK) {
+		comp = (cal_data[ratio].steady_comp +
+			cal_data[ratio + 1].steady_comp +
+			cal_data[ratio + 2].steady_comp) / 3;
+	} else if (ratio == MAX_TARGET_RATIO - 1 &&
+		cal_data[ratio].confidence >= CONFIDENCE_OK &&
+		cal_data[ratio - 1].confidence >= CONFIDENCE_OK &&
+		cal_data[ratio - 2].confidence >= CONFIDENCE_OK) {
+		comp = (cal_data[ratio].steady_comp +
+			cal_data[ratio - 1].steady_comp +
+			cal_data[ratio - 2].steady_comp) / 3;
+	} else if (cal_data[ratio].confidence >= CONFIDENCE_OK &&
+		cal_data[ratio - 1].confidence >= CONFIDENCE_OK &&
+		cal_data[ratio + 1].confidence >= CONFIDENCE_OK) {
+		comp = (cal_data[ratio].steady_comp +
+			cal_data[ratio - 1].steady_comp +
+			cal_data[ratio + 1].steady_comp) / 3;
+	}
+
+	/* REVISIT: simple penalty of double idle injection */
+	if (reduce_irq)
+		comp = ratio;
+	/* do not exceed limit */
+	if (comp + ratio >= MAX_TARGET_RATIO)
+		comp = MAX_TARGET_RATIO - ratio - 1;
+
+	return comp;
+}
+
+static void adjust_compensation(int target_ratio, unsigned int win)
+{
+	int delta;
+
+	/*
+	 * adjust compensations if confidence level has not been reached or
+	 * there are too many wakeups during the last idle injection period, we
+	 * cannot trust the data for compensation.
+	 */
+	if (cal_data[target_ratio].confidence >= CONFIDENCE_OK ||
+		atomic_read(&idle_wakeup_counter) >
+		win * num_online_cpus())
+		return;
+
+	delta = set_target_ratio - current_ratio;
+	/* filter out bad data */
+	if (delta >= 0 && delta <= (1+target_ratio/10)) {
+		if (cal_data[target_ratio].steady_comp)
+			cal_data[target_ratio].steady_comp =
+				roundup(delta+
+					cal_data[target_ratio].steady_comp,
+					2)/2;
+		else
+			cal_data[target_ratio].steady_comp = delta;
+		cal_data[target_ratio].confidence++;
+	}
+}
+
+static bool powerclamp_adjust_controls(unsigned int target_ratio,
+				unsigned int guard, unsigned int win)
+{
+	static u64 msr_last, tsc_last;
+	u64 msr_now, tsc_now;
+
+	/* check result for the last window */
+	msr_now = pkg_state_counter();
+	rdtscll(tsc_now);
+
+	/* calculate pkg cstate vs tsc ratio */
+	if (!msr_last || !tsc_last)
+		current_ratio = 1;
+	else if (tsc_now-tsc_last)
+		current_ratio = 100*(msr_now-msr_last)/
+			(tsc_now-tsc_last);
+
+	/* update record */
+	msr_last = msr_now;
+	tsc_last = tsc_now;
+
+	adjust_compensation(target_ratio, win);
+	/*
+	 * too many external interrupts, set flag such
+	 * that we can take measure later.
+	 */
+	reduce_irq = atomic_read(&idle_wakeup_counter) >=
+		2 * win * num_online_cpus();
+
+	atomic_set(&idle_wakeup_counter, 0);
+	/* if we are above target+guard, skip */
+	return set_target_ratio + guard <= current_ratio;
+}
+
+static int clamp_thread(void *arg)
+{
+	int cpunr = (unsigned long)arg;
+	DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0);
+	static const struct sched_param param = {
+		.sched_priority = MAX_USER_RT_PRIO/2,
+	};
+	unsigned int count = 0;
+	unsigned int target_ratio;
+
+	set_bit(cpunr, cpu_clamping_mask);
+	set_freezable();
+	init_timer_on_stack(&wakeup_timer);
+	sched_setscheduler(current, SCHED_FIFO, &param);
+
+	while (clamping && !kthread_should_stop() && cpu_online(cpunr)) {
+		int sleeptime;
+		unsigned long target_jiffies;
+		unsigned int guard;
+		unsigned int compensation = 0;
+		int interval; /* jiffies to sleep for each attempt */
+		unsigned int duration_jiffies = msecs_to_jiffies(duration);
+		unsigned int window_size_now;
+
+		try_to_freeze();
+		/*
+		 * make sure user selected ratio does not take effect until
+		 * the next round. adjust target_ratio if user has changed
+		 * target such that we can converge quickly.
+		 */
+		target_ratio = set_target_ratio;
+		guard = 1 + target_ratio/20;
+		window_size_now = window_size;
+		count++;
+
+		/*
+		 * systems may have different ability to enter package level
+		 * c-states, thus we need to compensate the injected idle ratio
+		 * to achieve the actual target reported by the HW.
+		 */
+		compensation = get_compensation(target_ratio);
+		interval = duration_jiffies*100/(target_ratio+compensation);
+
+		/* align idle time */
+		target_jiffies = roundup(jiffies, interval);
+		sleeptime = target_jiffies - jiffies;
+		if (sleeptime <= 0)
+			sleeptime = 1;
+		schedule_timeout_interruptible(sleeptime);
+		/*
+		 * only elected controlling cpu can collect stats and update
+		 * control parameters.
+		 */
+		if (cpunr == control_cpu && !(count%window_size_now)) {
+			should_skip =
+				powerclamp_adjust_controls(target_ratio,
+							guard, window_size_now);
+			smp_mb();
+		}
+
+		if (should_skip)
+			continue;
+
+		target_jiffies = jiffies + duration_jiffies;
+		mod_timer(&wakeup_timer, target_jiffies);
+		if (unlikely(local_softirq_pending()))
+			continue;
+		/*
+		 * stop tick sched during idle time, interrupts are still
+		 * allowed. thus jiffies are updated properly.
+		 */
+		preempt_disable();
+		tick_nohz_idle_enter();
+		/* mwait until target jiffies is reached */
+		while (time_before(jiffies, target_jiffies)) {
+			unsigned long ecx = 1;
+			unsigned long eax = target_mwait;
+
+			/*
+			 * REVISIT: may call enter_idle() to notify drivers who
+			 * can save power during cpu idle. same for exit_idle()
+			 */
+			local_touch_nmi();
+			stop_critical_timings();
+			__monitor((void *)&current_thread_info()->flags, 0, 0);
+			cpu_relax(); /* allow HT sibling to run */
+			__mwait(eax, ecx);
+			start_critical_timings();
+			atomic_inc(&idle_wakeup_counter);
+		}
+		tick_nohz_idle_exit();
+		preempt_enable_no_resched();
+	}
+	del_timer_sync(&wakeup_timer);
+	clear_bit(cpunr, cpu_clamping_mask);
+
+	return 0;
+}
+
+/*
+ * 1 HZ polling while clamping is active, useful for userspace
+ * to monitor actual idle ratio.
+ */
+static void poll_pkg_cstate(struct work_struct *dummy);
+static DECLARE_DELAYED_WORK(poll_pkg_cstate_work, poll_pkg_cstate);
+static void poll_pkg_cstate(struct work_struct *dummy)
+{
+	static u64 msr_last;
+	static u64 tsc_last;
+	static unsigned long jiffies_last;
+
+	u64 msr_now;
+	unsigned long jiffies_now;
+	u64 tsc_now;
+
+	msr_now = pkg_state_counter();
+	rdtscll(tsc_now);
+	jiffies_now = jiffies;
+
+	/* calculate pkg cstate vs tsc ratio */
+	if (!msr_last || !tsc_last)
+		pkg_cstate_ratio_cur = 1;
+	else {
+		if (tsc_now - tsc_last)
+			pkg_cstate_ratio_cur = 100 * (msr_now - msr_last)/
+				(tsc_now - tsc_last);
+	}
+
+	/* update record */
+	msr_last = msr_now;
+	jiffies_last = jiffies_now;
+	tsc_last = tsc_now;
+
+	if (clamping)
+		schedule_delayed_work(&poll_pkg_cstate_work, HZ);
+}
+
+static int start_power_clamp(void)
+{
+	unsigned long cpu;
+	struct task_struct *thread;
+
+	/* check if pkg cstate counter is completely 0, abort in this case */
+	if (!pkg_state_counter()) {
+		pr_err("pkg cstate counter not functional, abort\n");
+		return -EINVAL;
+	}
+
+	if (set_target_ratio > MAX_TARGET_RATIO)
+		set_target_ratio = MAX_TARGET_RATIO;
+
+	/* prevent cpu hotplug */
+	get_online_cpus();
+
+	/* prefer BSP */
+	control_cpu = 0;
+	if (!cpu_online(control_cpu))
+		control_cpu = smp_processor_id();
+
+	clamping = 1;
+	schedule_delayed_work(&poll_pkg_cstate_work, 0);
+
+	/* start one thread per online cpu */
+	for_each_online_cpu(cpu) {
+		struct task_struct **p =
+			per_cpu_ptr(powerclamp_thread, cpu);
+
+		thread = kthread_create_on_node(clamp_thread,
+						(void *) cpu,
+						cpu_to_node(cpu),
+						"kidle_inject/%ld", cpu);
+		/* bind to cpu here */
+		if (likely(!IS_ERR(thread))) {
+			kthread_bind(thread, cpu);
+			wake_up_process(thread);
+			*p = thread;
+		}
+
+	}
+	put_online_cpus();
+
+	return 0;
+}
+
+static void end_power_clamp(void)
+{
+	int i;
+	struct task_struct *thread;
+
+	clamping = 0;
+	/*
+	 * make clamping visible to other cpus and give per cpu clamping threads
+	 * sometime to exit, or gets killed later.
+	 */
+	smp_mb();
+	msleep(20);
+	if (bitmap_weight(cpu_clamping_mask, num_possible_cpus())) {
+		for_each_set_bit(i, cpu_clamping_mask, num_possible_cpus()) {
+			pr_debug("clamping thread for cpu %d alive, kill\n", i);
+			thread = *per_cpu_ptr(powerclamp_thread, i);
+			kthread_stop(thread);
+		}
+	}
+}
+
+static int powerclamp_cpu_callback(struct notifier_block *nfb,
+				unsigned long action, void *hcpu)
+{
+	unsigned long cpu = (unsigned long)hcpu;
+	struct task_struct *thread;
+	struct task_struct **percpu_thread =
+		per_cpu_ptr(powerclamp_thread, cpu);
+
+	if (!clamping)
+		goto exit_ok;
+
+	switch (action) {
+	case CPU_ONLINE:
+		thread = kthread_create_on_node(clamp_thread,
+						(void *) cpu,
+						cpu_to_node(cpu),
+						"kidle_inject/%lu", cpu);
+		if (likely(!IS_ERR(thread))) {
+			kthread_bind(thread, cpu);
+			wake_up_process(thread);
+			*percpu_thread = thread;
+		}
+		/* prefer BSP as controlling CPU */
+		if (cpu == 0) {
+			control_cpu = 0;
+			smp_mb();
+		}
+		break;
+	case CPU_DEAD:
+		if (test_bit(cpu, cpu_clamping_mask)) {
+			pr_err("cpu %lu dead but powerclamping thread is not\n",
+				cpu);
+			kthread_stop(*percpu_thread);
+		}
+		if (cpu == control_cpu) {
+			control_cpu = smp_processor_id();
+			smp_mb();
+		}
+	}
+
+exit_ok:
+	return NOTIFY_OK;
+}
+
+static struct notifier_block powerclamp_cpu_notifier = {
+	.notifier_call = powerclamp_cpu_callback,
+};
+
+static int powerclamp_get_max_state(struct thermal_cooling_device *cdev,
+				 unsigned long *state)
+{
+	*state = MAX_TARGET_RATIO;
+
+	return 0;
+}
+
+static int powerclamp_get_cur_state(struct thermal_cooling_device *cdev,
+				 unsigned long *state)
+{
+	if (clamping)
+		*state = pkg_cstate_ratio_cur;
+	else
+		/* to save power, do not poll idle ratio while not clamping */
+		*state = -1; /* indicates invalid state */
+
+	return 0;
+}
+
+static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
+				 unsigned long new_target_ratio)
+{
+	int ret = 0;
+
+	if (new_target_ratio >= MAX_TARGET_RATIO)
+		new_target_ratio = MAX_TARGET_RATIO - 1;
+
+	if (set_target_ratio == 0 && new_target_ratio > 0) {
+		pr_info("Start idle injection to reduce power\n");
+		set_target_ratio = new_target_ratio;
+		ret = start_power_clamp();
+		goto exit_set;
+	} else	if (set_target_ratio > 0 && new_target_ratio == 0) {
+		pr_info("Stop forced idle injection\n");
+		set_target_ratio = 0;
+		end_power_clamp();
+	} else	/* adjust currently running */ {
+		set_target_ratio = new_target_ratio;
+		/* make new set_target_ratio visible to other cpus */
+		smp_mb();
+	}
+
+exit_set:
+	return ret;
+}
+
+/* bind to generic thermal layer as cooling device*/
+static struct thermal_cooling_device_ops powerclamp_cooling_ops = {
+	.get_max_state = powerclamp_get_max_state,
+	.get_cur_state = powerclamp_get_cur_state,
+	.set_cur_state = powerclamp_set_cur_state,
+};
+
+/* runs on Nehalem and later */
+static const struct x86_cpu_id intel_powerclamp_ids[] = {
+	{ X86_VENDOR_INTEL, 6, 0x1a},
+	{ X86_VENDOR_INTEL, 6, 0x1c},
+	{ X86_VENDOR_INTEL, 6, 0x1e},
+	{ X86_VENDOR_INTEL, 6, 0x1f},
+	{ X86_VENDOR_INTEL, 6, 0x25},
+	{ X86_VENDOR_INTEL, 6, 0x26},
+	{ X86_VENDOR_INTEL, 6, 0x2a},
+	{ X86_VENDOR_INTEL, 6, 0x2c},
+	{ X86_VENDOR_INTEL, 6, 0x2d},
+	{ X86_VENDOR_INTEL, 6, 0x2e},
+	{ X86_VENDOR_INTEL, 6, 0x2f},
+	{ X86_VENDOR_INTEL, 6, 0x3a},
+	{}
+};
+MODULE_DEVICE_TABLE(x86cpu, intel_powerclamp_ids);
+
+static int powerclamp_probe(void)
+{
+	if (!x86_match_cpu(intel_powerclamp_ids)) {
+		pr_err("Intel powerclamp does not run on family %d model %d\n",
+				boot_cpu_data.x86, boot_cpu_data.x86_model);
+		return -ENODEV;
+	}
+	if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ||
+		!boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ||
+		!boot_cpu_has(X86_FEATURE_MWAIT) ||
+		!boot_cpu_has(X86_FEATURE_ARAT))
+		return -ENODEV;
+
+	/* find the deepest mwait value */
+	find_target_mwait();
+
+	return 0;
+}
+
+static int powerclamp_debug_show(struct seq_file *m, void *unused)
+{
+	int i = 0;
+
+	seq_printf(m, "controlling cpu: %d\n", control_cpu);
+	seq_printf(m, "pct confidence steady dynamic (compensation)\n");
+	for (i = 0; i < MAX_TARGET_RATIO; i++) {
+		seq_printf(m, "%d\t%lu\t%lu\t%lu\n",
+			i,
+			cal_data[i].confidence,
+			cal_data[i].steady_comp,
+			cal_data[i].dynamic_comp);
+	}
+
+	return 0;
+}
+
+static int powerclamp_debug_open(struct inode *inode,
+			struct file *file)
+{
+	return single_open(file, powerclamp_debug_show, inode->i_private);
+}
+
+static const struct file_operations powerclamp_debug_fops = {
+	.open		= powerclamp_debug_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+	.owner		= THIS_MODULE,
+};
+
+static inline void powerclamp_create_debug_files(void)
+{
+	debug_dir = debugfs_create_dir("intel_powerclamp", NULL);
+	if (!debug_dir)
+		return;
+
+	if (!debugfs_create_file("powerclamp_calib", S_IRUGO, debug_dir,
+					cal_data, &powerclamp_debug_fops))
+		goto file_error;
+
+	return;
+
+file_error:
+	debugfs_remove_recursive(debug_dir);
+}
+
+static int powerclamp_init(void)
+{
+	int retval;
+	int bitmap_size;
+
+	bitmap_size = BITS_TO_LONGS(num_possible_cpus()) * sizeof(long);
+	cpu_clamping_mask = kzalloc(bitmap_size, GFP_KERNEL);
+	if (!cpu_clamping_mask)
+		return -ENOMEM;
+
+	/* probe cpu features and ids here */
+	retval = powerclamp_probe();
+	if (retval)
+		return retval;
+	/* set default limit, maybe adjusted during runtime based on feedback */
+	window_size = 2;
+	register_hotcpu_notifier(&powerclamp_cpu_notifier);
+	powerclamp_thread = alloc_percpu(struct task_struct *);
+	cooling_dev = thermal_cooling_device_register("intel_powerclamp", NULL,
+						&powerclamp_cooling_ops);
+	if (IS_ERR(cooling_dev))
+		return -ENODEV;
+
+	if (!duration)
+		duration = jiffies_to_msecs(DEFAULT_DURATION_JIFFIES);
+	powerclamp_create_debug_files();
+
+	return 0;
+}
+module_init(powerclamp_init);
+
+static void powerclamp_exit(void)
+{
+	unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
+	end_power_clamp();
+	free_percpu(powerclamp_thread);
+	thermal_cooling_device_unregister(cooling_dev);
+	kfree(cpu_clamping_mask);
+
+	cancel_delayed_work_sync(&poll_pkg_cstate_work);
+	debugfs_remove_recursive(debug_dir);
+}
+module_exit(powerclamp_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Arjan van de Ven <arjan@linux.intel.com>");
+MODULE_AUTHOR("Jacob Pan <jacob.jun.pan@linux.intel.com>");
+MODULE_DESCRIPTION("Package Level C-state Idle Injection for Intel CPUs");
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH v3 2/3] x86/nmi: export local_touch_nmi() symbol for modules
From: Jacob Pan @ 2012-11-26 15:19 UTC (permalink / raw)
  To: Linux PM, LKML
  Cc: Peter Zijlstra, Rafael Wysocki, Len Brown, Thomas Gleixner,
	H. Peter Anvin, Ingo Molnar, Zhang Rui, Rob Landley,
	Arjan van de Ven, Paul McKenney, Jacob Pan
In-Reply-To: <1353943192-20298-1-git-send-email-jacob.jun.pan@linux.intel.com>

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/kernel/nmi.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index f84f5c5..6030805 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -509,3 +509,4 @@ void local_touch_nmi(void)
 {
 	__this_cpu_write(last_nmi_rip, 0);
 }
+EXPORT_SYMBOL_GPL(local_touch_nmi);
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH v3 1/3] tick: export nohz tick idle symbols for module use
From: Jacob Pan @ 2012-11-26 15:19 UTC (permalink / raw)
  To: Linux PM, LKML
  Cc: Peter Zijlstra, Rafael Wysocki, Len Brown, Thomas Gleixner,
	H. Peter Anvin, Ingo Molnar, Zhang Rui, Rob Landley,
	Arjan van de Ven, Paul McKenney, Jacob Pan
In-Reply-To: <1353943192-20298-1-git-send-email-jacob.jun.pan@linux.intel.com>

Allow drivers such as intel_powerclamp to use these apis for
turning on/off ticks during idle.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 kernel/time/tick-sched.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a402608..7c38f08 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -510,6 +510,7 @@ void tick_nohz_idle_enter(void)
 
 	local_irq_enable();
 }
+EXPORT_SYMBOL_GPL(tick_nohz_idle_enter);
 
 /**
  * tick_nohz_irq_exit - update next tick event from interrupt exit
@@ -634,6 +635,7 @@ void tick_nohz_idle_exit(void)
 
 	local_irq_enable();
 }
+EXPORT_SYMBOL_GPL(tick_nohz_idle_exit);
 
 static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now)
 {
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH v2 3/3] PM: Introduce Intel PowerClamp Driver
From: Jacob Pan @ 2012-11-26 14:37 UTC (permalink / raw)
  To: Linux PM, LKML
  Cc: Peter Zijlstra, Rafael Wysocki, Len Brown, Thomas Gleixner,
	H. Peter Anvin, Ingo Molnar, Zhang Rui, Rob Landley,
	Arjan van de Ven, Paul McKenney, Jacob Pan
In-Reply-To: <1353940633-20084-1-git-send-email-jacob.jun.pan@linux.intel.com>

Intel PowerClamp driver performs synchronized idle injection across
all online CPUs. The goal is to maintain a given package level C-state
ratio.

Compared to other throttling methods already exist in the kernel,
such as ACPI PAD (taking CPUs offline) and clock modulation, this is often
more efficient in terms of performance per watt.

Please refer to Documentation/thermal/intel_powerclamp.txt for more details.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 Documentation/thermal/intel_powerclamp.txt |  307 +++++++++++
 drivers/thermal/Kconfig                    |   10 +
 drivers/thermal/Makefile                   |    1 +
 drivers/thermal/intel_powerclamp.c         |  766 ++++++++++++++++++++++++++++
 4 files changed, 1084 insertions(+)
 create mode 100644 Documentation/thermal/intel_powerclamp.txt
 create mode 100644 drivers/thermal/intel_powerclamp.c

diff --git a/Documentation/thermal/intel_powerclamp.txt b/Documentation/thermal/intel_powerclamp.txt
new file mode 100644
index 0000000..332de4a
--- /dev/null
+++ b/Documentation/thermal/intel_powerclamp.txt
@@ -0,0 +1,307 @@
+			 =======================
+			 INTEL POWERCLAMP DRIVER
+			 =======================
+By: Arjan van de Ven <arjan@linux.intel.com>
+    Jacob Pan <jacob.jun.pan@linux.intel.com>
+
+Contents:
+	(*) Introduction
+	    - Goals and Objectives
+
+	(*) Theory of Operation
+	    - Idle Injection
+	    - Calibration
+
+	(*) Performance Analysis
+	    - Effectiveness and Limitations
+	    - Power vs Performance
+	    - Scalability
+	    - Calibration
+	    - Comparison with Alternative Techniques
+
+	(*) Usage and Interfaces
+	    - Generic Thermal Layer (sysfs)
+	    - Kernel APIs (TBD)
+
+============
+INTRODUCTION
+============
+
+Consider the situation where a system’s power consumption must be
+reduced at runtime, due to power budget, thermal constraint, or noise
+level, and where active cooling is not preferred. Software managed
+passive power reduction must be performed to prevent the hardware
+actions that are designed for catastrophic scenarios.
+
+Currently, P-states, T-states (clock modulation), and CPU offlining
+are used for CPU throttling.
+
+On Intel CPUs, C-states provide effective power reduction, but so far
+they’re only used opportunistically, based on workload. With the
+development of intel_powerclamp driver, the method of synchronizing
+idle injection across all online CPU threads was introduced. The goal
+is to achieve forced and controllable C-state residency.
+
+Test/Analysis has been made in the areas of power, performance,
+scalability, and user experience. In many cases, clear advantage is
+shown over taking the CPU offline or modulating the CPU clock.
+
+
+===================
+THEORY OF OPERATION
+===================
+
+Idle Injection
+--------------
+
+On modern Intel processors (Nehalem or later), package level C-state
+residency is available in MSRs, thus also available to the kernel.
+
+These MSRs are:
+      #define MSR_PKG_C2_RESIDENCY	0x60D
+      #define MSR_PKG_C3_RESIDENCY	0x3F8
+      #define MSR_PKG_C6_RESIDENCY	0x3F9
+      #define MSR_PKG_C7_RESIDENCY	0x3FA
+
+If the kernel can also inject idle time to the system, then a
+closed-loop control system can be established that manages package
+level C-state. The intel_powerclamp driver is conceived as such a
+control system, where the target set point is a user-selected idle
+ratio (based on power reduction), and the error is the difference
+between the actual package level C-state residency ratio and the target idle
+ratio.
+
+Injection is controlled by high priority kernel threads, spawned for
+each online CPU.
+
+These kernel threads, with SCHED_FIFO class, are created to perform
+clamping actions of controlled duty ratio and duration. Each per-CPU
+thread synchronizes its idle time and duration, based on the rounding
+of jiffies, so accumulated errors can be prevented to avoid a jittery
+effect. Threads are also bound to the CPU such that they cannot be
+migrated, unless the CPU is taken offline. In this case, threads
+belong to the offlined CPUs will be terminated immediately.
+
+Running as SCHED_FIFO and relatively high priority, also allows such
+scheme to work for both preemptable and non-preemptable kernels.
+Alignment of idle time around jiffies ensures scalability for HZ
+values. This effect can be better visualized using a Perf timechart.
+The following diagram shows the behavior of kernel thread
+kidle_inject/cpu. During idle injection, it runs monitor/mwait idle
+for a given "duration", then relinquishes the CPU to other tasks,
+until the next time interval.
+
+The NOHZ schedule tick is disabled during idle time, but interrupts
+are not masked. Tests show that the extra wakeups from scheduler tick
+have a dramatic impact on the effectiveness of the powerclamp driver
+on large scale systems (Westmere system with 80 processors).
+
+CPU0
+		  ____________          ____________
+kidle_inject/0   |   sleep    |  mwait |  sleep     |
+	_________|            |________|            |_______
+			       duration
+CPU1
+		  ____________          ____________
+kidle_inject/1   |   sleep    |  mwait |  sleep     |
+	_________|            |________|            |_______
+			      ^
+			      |
+			      |
+			      roundup(jiffies, interval)
+
+Only one CPU is allowed to collect statistics and update global
+control parameters. This CPU is referred to as the controlling CPU in
+this document. The controlling CPU is elected at runtime, with a
+policy that favors BSP, taking into account the possibility of a CPU
+hot-plug.
+
+In terms of dynamics of the idle control system, package level idle
+time is considered largely as a non-causal system where its behavior
+cannot be based on the past or current input. Therefore, the
+intel_powerclamp driver attempts to enforce the desired idle time
+instantly as given input (target idle ratio). After injection,
+powerclamp moniors the actual idle for a given time window and adjust
+the next injection accordingly to avoid over/under correction.
+
+When used in a causal control system, such as a temperature control,
+it is up to the user of this driver to implement algorithms where
+past samples and outputs are included in the feedback. For example, a
+PID-based thermal controller can use the powerclamp driver to
+maintain a desired target temperature, based on integral and
+derivative gains of the past samples.
+
+
+
+Calibration
+-----------
+During scalability testing, it is observed that synchronized actions
+among CPUs become challenging as the number of cores grows. This is
+also true for the ability of a system to enter package level C-states.
+
+To make sure the intel_powerclamp driver scales well, online
+calibration is implemented. The goals for doing such a calibration
+are:
+
+a) determine the effective range of idle injection ratio
+b) determine the amount of compensation needed at each target ratio
+
+Compensation to each target ratio consists of two parts:
+
+        a) steady state error compensation
+	This is to offset the error occurring when the system can
+	enter idle without extra wakeups (such as external interrupts).
+
+	b) dynamic error compensation
+	When an excessive amount of wakeups occurs during idle, an
+	additional idle ratio can be added to quiet interrupts, by
+	slowing down CPU activities.
+
+A debugfs file is provided for the user to examine compensation
+progress and results, such as on a Westmere system.
+[jacob@nex01 ~]$ cat
+/sys/kernel/debug/intel_powerclamp/powerclamp_calib
+controlling cpu: 0
+pct confidence steady dynamic (compensation)
+0	0	0	0
+1	1	0	0
+2	1	1	0
+3	3	1	0
+4	3	1	0
+5	3	1	0
+6	3	1	0
+7	3	1	0
+8	3	1	0
+...
+30	3	2	0
+31	3	2	0
+32	3	1	0
+33	3	2	0
+34	3	1	0
+35	3	2	0
+36	3	1	0
+37	3	2	0
+38	3	1	0
+39	3	2	0
+40	3	3	0
+41	3	1	0
+42	3	2	0
+43	3	1	0
+44	3	1	0
+45	3	2	0
+46	3	3	0
+47	3	0	0
+48	3	2	0
+49	3	3	0
+
+Calibration occurs during runtime. No offline method is available.
+Steady state compensation is used only when confidence levels of all
+adjacent ratios have reached satisfactory level. A confidence level
+is accumulated based on clean data collected at runtime. Data
+collected during a period without extra interrupts is considered
+clean.
+
+To compensate for excessive amounts of wakeup during idle, additional
+idle time is injected when such a condition is detected. Currently,
+we have a simple algorithm to double the injection ratio. A possible
+enhancement might be to throttle the offending IRQ, such as delaying
+EOI for level triggered interrupts. But it is a challenge to be
+non-intrusive to the scheduler or the IRQ core code.
+
+
+CPU Online/Offline
+------------------
+Per-CPU kernel threads are started/stopped upon receiving
+notifications of CPU hotplug activities. The intel_powerclamp driver
+keeps track of clamping kernel threads, even after they are migrated
+to other CPUs, after a CPU offline event.
+
+
+=====================
+Performance Analysis
+=====================
+This section describes the general performance data collected on
+multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P).
+
+Effectiveness and Limitations
+-----------------------------
+The maximum range that idle injection is allowed is capped at 50
+percent. As mentioned earlier, since interrupts are allowed during
+forced idle time, excessive interrupts could result in less
+effectiveness. The extreme case would be doing a ping -f to generated
+flooded network interrupts without much CPU acknowledgement. In this
+case, little can be done from the idle injection threads. In most
+normal cases, such as scp a large file, applications can be throttled
+by the powerclamp driver, since slowing down the CPU also slows down
+network protocol processing, which in turn reduces interrupts.
+
+When control parameters change at runtime by the controlling CPU, it
+may take an additional period for the rest of the CPUs to catch up
+with the changes. During this time, idle injection is out of sync,
+thus not able to enter package C- states at the expected ratio. But
+this effect is minor, in that in most cases change to the target
+ratio is updated much less frequently than the idle injection
+frequency.
+
+Scalability
+-----------
+Tests also show a minor, but measurable, difference between the 4P/8P
+Ivy Bridge system and the 80P Westmere server under 50% idle ratio.
+More compensation is needed on Westmere for the same amount of
+target idle ratio. The compensation also increases as the idle ratio
+gets larger. The above reason constitutes the need for the
+calibration code.
+
+On the IVB 8P system, compared to an offline CPU, powerclamp can
+achieve up to 40% better performance per watt. (measured by a spin
+counter summed over per CPU counting threads spawned for all running
+CPUs).
+
+====================
+Usage and Interfaces
+====================
+The powerclamp driver is registered to the generic thermal layer as a
+cooling device. Currently, it’s not bound to any thermal zones.
+
+jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . *
+cur_state:0
+max_state:50
+type:intel_powerclamp
+
+Example usage:
+- To inject 25% idle time
+$ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state
+"
+
+If the system is not busy and has more than 25% idle time already,
+then the powerclamp driver will not start idle injection. Using Top
+will not show idle injection kernel threads.
+
+If the system is busy (spin test below) and has less than 25% natural
+idle time, powerclamp kernel threads will do idle injection, which
+appear running to the scheduler. But the overall system idle is still
+reflected. In this example, 24.1% idle is shown. This helps the
+system admin or user determine the cause of slowdown, when a
+powerclamp driver is in action.
+
+
+Tasks: 197 total,   1 running, 196 sleeping,   0 stopped,   0 zombie
+Cpu(s): 71.2%us,  4.7%sy,  0.0%ni, 24.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
+Mem:   3943228k total,  1689632k used,  2253596k free,    74960k buffers
+Swap:  4087804k total,        0k used,  4087804k free,   945336k cached
+
+  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
+ 3352 jacob     20   0  262m  644  428 S  286  0.0   0:17.16 spin
+ 3341 root     -51   0     0    0    0 D   25  0.0   0:01.62 kidle_inject/0
+ 3344 root     -51   0     0    0    0 D   25  0.0   0:01.60 kidle_inject/3
+ 3342 root     -51   0     0    0    0 D   25  0.0   0:01.61 kidle_inject/1
+ 3343 root     -51   0     0    0    0 D   25  0.0   0:01.60 kidle_inject/2
+ 2935 jacob     20   0  696m 125m  35m S    5  3.3   0:31.11 firefox
+ 1546 root      20   0  158m  20m 6640 S    3  0.5   0:26.97 Xorg
+ 2100 jacob     20   0 1223m  88m  30m S    3  2.3   0:23.68 compiz
+
+Tests have shown that by using the powerclamp driver as a cooling
+device, a PID based userspace thermal controller can manage to
+control CPU temperature effectively, when no other thermal influence
+is added. For example, a UltraBook user can compile the kernel under
+certain temperature (below most active trip points).
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index e1cb6bd..4d99c4b 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -55,3 +55,13 @@ config EXYNOS_THERMAL
 	help
 	  If you say yes here you get support for TMU (Thermal Managment
 	  Unit) on SAMSUNG EXYNOS series of SoC.
+
+config INTEL_POWERCLAMP
+	tristate "Intel PowerClamp idle injection driver"
+	depends on THERMAL
+	depends on X86
+	depends on CPU_SUP_INTEL
+	help
+	  Enable this to enable Intel PowerClamp idle injection driver. This
+	  enforce idle time which results in more package C-state residency. The
+	  user interface is exposed via generic thermal framework.
diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
index 885550d..03e4479 100644
--- a/drivers/thermal/Makefile
+++ b/drivers/thermal/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_CPU_THERMAL)		+= cpu_cooling.o
 obj-$(CONFIG_SPEAR_THERMAL)		+= spear_thermal.o
 obj-$(CONFIG_RCAR_THERMAL)	+= rcar_thermal.o
 obj-$(CONFIG_EXYNOS_THERMAL)		+= exynos_thermal.o
+obj-$(CONFIG_INTEL_POWERCLAMP)	+= intel_powerclamp.o
diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
new file mode 100644
index 0000000..cc9fa17
--- /dev/null
+++ b/drivers/thermal/intel_powerclamp.c
@@ -0,0 +1,766 @@
+/*
+ * intel_powerclamp.c - package c-state idle injection
+ *
+ * Copyright (c) 2012, Intel Corporation.
+ *
+ * Authors:
+ *     Arjan van de Ven <arjan@linux.intel.com>
+ *     Jacob Pan <jacob.jun.pan@linux.intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ *
+ *	TODO:
+ *           1. better handle wakeup from external interrupts, currently a fixed
+ *              compensation is added to clamping duration when excessive amount
+ *              of wakeups are observed during idle time. the reason is that in
+ *              case of external interrupts without need for ack, clamping down
+ *              cpu in non-irq context does not reduce irq. for majority of the
+ *              cases, clamping down cpu does help reduce irq as well, we should
+ *              be able to differenciate the two cases and give a quantitative
+ *              solution for the irqs that we can control. perhaps based on
+ *              get_cpu_iowait_time_us()
+ *
+ *	     2. synchronization with other hw blocks
+ *
+ *
+ */
+
+/* #define DEBUG */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/delay.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
+#include <linux/cpu.h>
+#include <linux/thermal.h>
+#include <linux/slab.h>
+#include <linux/tick.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+#include <linux/nmi.h>
+
+#include <asm/msr.h>
+#include <asm/mwait.h>
+#include <asm/cpu_device_id.h>
+#include <asm/idle.h>
+#include <asm/hardirq.h>
+
+#define MSR_PKG_C2_RESIDENCY		0x60D
+#define MSR_PKG_C3_RESIDENCY		0x3F8
+#define MSR_PKG_C6_RESIDENCY		0x3F9
+#define MSR_PKG_C7_RESIDENCY		0x3FA
+
+#define MAX_TARGET_RATIO (50)
+/* For each undisturbed clamping period (no extra wake ups during idle time),
+ * we increment the confidence counter for the given target ratio.
+ * CONFIDENCE_OK defines the level where runtime calibration results are
+ * valid.
+ */
+#define CONFIDENCE_OK (3)
+/* Default idle injection duration, driver adjust sleep time to meet target
+ * idle ratio. Similar to frequency modulation.
+ */
+#define DEFAULT_DURATION_JIFFIES (6)
+
+static unsigned int target_mwait;
+static struct dentry *debug_dir;
+
+/* user selected target */
+static unsigned int set_target_ratio;
+static unsigned int current_ratio;
+static bool should_skip;
+static bool reduce_irq;
+static atomic_t idle_wakeup_counter;
+static unsigned int control_cpu; /* The cpu assigned to collect stat and update
+				  * control parameters. default to BSP but BSP
+				  * can be offlined.
+				  */
+static int clamping;
+
+
+static struct task_struct __percpu **powerclamp_thread;
+static struct thermal_cooling_device *cooling_dev;
+static unsigned long *cpu_clamping_mask;  /* bit map for tracking per cpu
+					   * clamping thread
+					   */
+static int duration;
+module_param(duration, int, 0600);
+MODULE_PARM_DESC(duration, "forced idle time for each attempt in msec.");
+
+static unsigned int pkg_cstate_ratio_cur;
+static unsigned int window_size;
+
+struct powerclamp_calibration_data {
+	unsigned long confidence;  /* used for calibration, basically a counter
+				    * gets incremented each time a clamping
+				    * period is completed without extra wakeups
+				    * once that counter is reached given level,
+				    * compensation is deemed usable.
+				    */
+	unsigned long steady_comp; /* steady state compensation used when
+				    * no extra wakeups occurred.
+				    */
+	unsigned long dynamic_comp; /* compensate excessive wakeup from idle
+				     * mostly from external interrupts.
+				     */
+};
+
+static struct powerclamp_calibration_data cal_data[MAX_TARGET_RATIO];
+
+static int window_size_set(const char *arg, const struct kernel_param *kp)
+{
+	int ret = 0;
+	unsigned long new_window_size;
+
+	ret = kstrtoul(arg, 10, &new_window_size);
+	if (ret)
+		goto exit_win;
+	if (new_window_size >= 10 || new_window_size < 2) {
+		pr_err("PowerClamp: invalid window size %lu, between 2-10\n",
+			new_window_size);
+		ret = -EINVAL;
+	}
+
+	window_size = new_window_size;
+	smp_mb();
+
+exit_win:
+
+	return ret;
+}
+static struct kernel_param_ops window_size_ops = {
+	.set = window_size_set,
+	.get = param_get_int,
+};
+
+module_param_cb(window_size, &window_size_ops, &window_size, 0644);
+MODULE_PARM_DESC(window_size, "sliding window in number of clamping cycles\n"
+	"\tpowerclamp controls idle ratio within this window. larger\n"
+	"\twindow size results in slower response time but more smooth\n"
+	"\tclamping results. default to 2.");
+
+static void find_target_mwait(void)
+{
+	unsigned int eax, ebx, ecx, edx;
+	unsigned int highest_cstate = 0;
+	unsigned int highest_subcstate = 0;
+	int i;
+
+	if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
+		return;
+
+	cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &edx);
+
+	if (!(ecx & CPUID5_ECX_EXTENSIONS_SUPPORTED) ||
+	    !(ecx & CPUID5_ECX_INTERRUPT_BREAK))
+		return;
+
+	edx >>= MWAIT_SUBSTATE_SIZE;
+	for (i = 0; i < 7 && edx; i++, edx >>= MWAIT_SUBSTATE_SIZE) {
+		if (edx & MWAIT_SUBSTATE_MASK) {
+			highest_cstate = i;
+			highest_subcstate = edx & MWAIT_SUBSTATE_MASK;
+		}
+	}
+	target_mwait = (highest_cstate << MWAIT_SUBSTATE_SIZE) |
+		(highest_subcstate - 1);
+
+}
+
+static u64 pkg_state_counter(void)
+{
+	u64 val;
+	u64 count = 0;
+
+	static int skip_c2;
+	static int skip_c3;
+	static int skip_c6;
+	static int skip_c7;
+
+	if (!skip_c2) {
+		if (!rdmsrl_safe(MSR_PKG_C2_RESIDENCY, &val))
+			count += val;
+		else
+			skip_c2 = 1;
+	}
+
+	if (!skip_c3) {
+		if (!rdmsrl_safe(MSR_PKG_C3_RESIDENCY, &val))
+			count += val;
+		else
+			skip_c3 = 1;
+	}
+
+	if (!skip_c6) {
+		if (!rdmsrl_safe(MSR_PKG_C6_RESIDENCY, &val))
+			count += val;
+		else
+			skip_c6 = 1;
+	}
+
+	if (!skip_c7) {
+		if (!rdmsrl_safe(MSR_PKG_C7_RESIDENCY, &val))
+			count += val;
+		else
+			skip_c7 = 1;
+	}
+
+	return count;
+}
+
+static void noop_timer(unsigned long foo)
+{
+	/* empty... just the fact that we get the interrupt wakes us up */
+}
+
+static unsigned int get_compensation(int ratio)
+{
+	unsigned int comp = 0;
+
+	/* we only use compensation if all adjacent ones are good */
+	if (ratio == 1 &&
+		cal_data[ratio].confidence >= CONFIDENCE_OK &&
+		cal_data[ratio + 1].confidence >= CONFIDENCE_OK &&
+		cal_data[ratio + 2].confidence >= CONFIDENCE_OK) {
+		comp = (cal_data[ratio].steady_comp +
+			cal_data[ratio + 1].steady_comp +
+			cal_data[ratio + 2].steady_comp) / 3;
+	} else if (ratio == MAX_TARGET_RATIO - 1 &&
+		cal_data[ratio].confidence >= CONFIDENCE_OK &&
+		cal_data[ratio - 1].confidence >= CONFIDENCE_OK &&
+		cal_data[ratio - 2].confidence >= CONFIDENCE_OK) {
+		comp = (cal_data[ratio].steady_comp +
+			cal_data[ratio - 1].steady_comp +
+			cal_data[ratio - 2].steady_comp) / 3;
+	} else if (cal_data[ratio].confidence >= CONFIDENCE_OK &&
+		cal_data[ratio - 1].confidence >= CONFIDENCE_OK &&
+		cal_data[ratio + 1].confidence >= CONFIDENCE_OK) {
+		comp = (cal_data[ratio].steady_comp +
+			cal_data[ratio - 1].steady_comp +
+			cal_data[ratio + 1].steady_comp) / 3;
+	}
+
+	/* REVISIT: simple penalty of double idle injection */
+	if (reduce_irq)
+		comp = ratio;
+	/* do not exceed limit */
+	if (comp + ratio >= MAX_TARGET_RATIO)
+		comp = MAX_TARGET_RATIO - ratio - 1;
+
+	return comp;
+}
+
+static void adjust_compensation(int target_ratio, unsigned int win)
+{
+	int delta;
+
+	/*
+	 * adjust compensations if confidence level has not been reached or
+	 * there are too many wakeups during the last idle injection period, we
+	 * cannot trust the data for compensation.
+	 */
+	if (cal_data[target_ratio].confidence >= CONFIDENCE_OK ||
+		atomic_read(&idle_wakeup_counter) >
+		win * num_online_cpus())
+		return;
+
+	delta = set_target_ratio - current_ratio;
+	/* filter out bad data */
+	if (delta >= 0 && delta <= (1+target_ratio/10)) {
+		if (cal_data[target_ratio].steady_comp)
+			cal_data[target_ratio].steady_comp =
+				roundup(delta+
+					cal_data[target_ratio].steady_comp,
+					2)/2;
+		else
+			cal_data[target_ratio].steady_comp = delta;
+		cal_data[target_ratio].confidence++;
+	}
+}
+
+static bool powerclamp_adjust_controls(unsigned int target_ratio,
+				unsigned int guard, unsigned int win)
+{
+	static u64 msr_last, tsc_last;
+	u64 msr_now, tsc_now;
+
+	/* check result for the last window */
+	msr_now = pkg_state_counter();
+	rdtscll(tsc_now);
+
+	/* calculate pkg cstate vs tsc ratio */
+	if (!msr_last || !tsc_last)
+		current_ratio = 1;
+	else if (tsc_now-tsc_last)
+		current_ratio = 100*(msr_now-msr_last)/
+			(tsc_now-tsc_last);
+
+	/* update record */
+	msr_last = msr_now;
+	tsc_last = tsc_now;
+
+	adjust_compensation(target_ratio, win);
+	/*
+	 * too many external interrupts, set flag such
+	 * that we can take measure later.
+	 */
+	reduce_irq = atomic_read(&idle_wakeup_counter) >=
+		2 * win * num_online_cpus();
+
+	atomic_set(&idle_wakeup_counter, 0);
+	/* if we are above target+guard, skip */
+	return set_target_ratio + guard <= current_ratio;
+}
+
+static int clamp_thread(void *arg)
+{
+	int cpunr = (unsigned long)arg;
+	DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0);
+	static const struct sched_param param = {
+		.sched_priority = MAX_USER_RT_PRIO/2,
+	};
+	unsigned int count = 0;
+	unsigned int target_ratio;
+
+	set_bit(cpunr, cpu_clamping_mask);
+	set_freezable();
+	init_timer_on_stack(&wakeup_timer);
+	sched_setscheduler(current, SCHED_FIFO, &param);
+
+	while (clamping && !kthread_should_stop() && cpu_online(cpunr)) {
+		int sleeptime;
+		unsigned long target_jiffies;
+		unsigned int guard;
+		unsigned int compensation = 0;
+		int interval; /* jiffies to sleep for each attempt */
+		unsigned int duration_jiffies = msecs_to_jiffies(duration);
+		unsigned int window_size_now;
+
+		try_to_freeze();
+		/*
+		 * make sure user selected ratio does not take effect until
+		 * the next round. adjust target_ratio if user has changed
+		 * target such that we can converge quickly.
+		 */
+		target_ratio = set_target_ratio;
+		guard = 1 + target_ratio/20;
+		window_size_now = window_size;
+		count++;
+
+		/*
+		 * systems may have different ability to enter package level
+		 * c-states, thus we need to compensate the injected idle ratio
+		 * to achieve the actual target reported by the HW.
+		 */
+		compensation = get_compensation(target_ratio);
+		interval = duration_jiffies*100/(target_ratio+compensation);
+
+		/* align idle time */
+		target_jiffies = roundup(jiffies, interval);
+		sleeptime = target_jiffies - jiffies;
+		if (sleeptime <= 0)
+			sleeptime = 1;
+		schedule_timeout_interruptible(sleeptime);
+		/*
+		 * only elected controlling cpu can collect stats and update
+		 * control parameters.
+		 */
+		if (cpunr == control_cpu && !(count%window_size_now)) {
+			should_skip =
+				powerclamp_adjust_controls(target_ratio,
+							guard, window_size_now);
+			smp_mb();
+		}
+
+		if (should_skip)
+			continue;
+
+		target_jiffies = jiffies + duration_jiffies;
+		mod_timer(&wakeup_timer, target_jiffies);
+		if (unlikely(local_softirq_pending()))
+			continue;
+		/*
+		 * stop tick sched during idle time, interrupts are still
+		 * allowed. thus jiffies are updated properly.
+		 */
+		preempt_disable();
+		tick_nohz_idle_enter();
+		/* mwait until target jiffies is reached */
+		while (time_before(jiffies, target_jiffies)) {
+			unsigned long ecx = 1;
+			unsigned long eax = target_mwait;
+
+			/*
+			 * REVISIT: may call enter_idle() to notify drivers who
+			 * can save power during cpu idle. same for exit_idle()
+			 */
+			local_touch_nmi();
+			stop_critical_timings();
+			__monitor((void *)&current_thread_info()->flags, 0, 0);
+			cpu_relax(); /* allow HT sibling to run */
+			__mwait(eax, ecx);
+			start_critical_timings();
+			atomic_inc(&idle_wakeup_counter);
+		}
+		tick_nohz_idle_exit();
+		preempt_enable_no_resched();
+	}
+	del_timer_sync(&wakeup_timer);
+	clear_bit(cpunr, cpu_clamping_mask);
+
+	return 0;
+}
+
+/*
+ * 1 HZ polling while clamping is active, useful for userspace
+ * to monitor actual idle ratio.
+ */
+static void poll_pkg_cstate(struct work_struct *dummy);
+static DECLARE_DELAYED_WORK(poll_pkg_cstate_work, poll_pkg_cstate);
+static void poll_pkg_cstate(struct work_struct *dummy)
+{
+	static u64 msr_last;
+	static u64 tsc_last;
+	static unsigned long jiffies_last;
+
+	u64 msr_now;
+	unsigned long jiffies_now;
+	u64 tsc_now;
+
+	msr_now = pkg_state_counter();
+	rdtscll(tsc_now);
+	jiffies_now = jiffies;
+
+	/* calculate pkg cstate vs tsc ratio */
+	if (!msr_last || !tsc_last)
+		pkg_cstate_ratio_cur = 1;
+	else {
+		if (tsc_now - tsc_last)
+			pkg_cstate_ratio_cur = 100 * (msr_now - msr_last)/
+				(tsc_now - tsc_last);
+	}
+
+	/* update record */
+	msr_last = msr_now;
+	jiffies_last = jiffies_now;
+	tsc_last = tsc_now;
+
+	if (clamping)
+		schedule_delayed_work(&poll_pkg_cstate_work, HZ);
+}
+
+static int start_power_clamp(void)
+{
+	unsigned long cpu;
+	struct task_struct *thread;
+
+	/* check if pkg cstate counter is completely 0, abort in this case */
+	if (!pkg_state_counter()) {
+		pr_err("pkg cstate counter not functional, abort\n");
+		return -EINVAL;
+	}
+
+	if (set_target_ratio > MAX_TARGET_RATIO)
+		set_target_ratio = MAX_TARGET_RATIO;
+
+	/* prevent cpu hotplug */
+	get_online_cpus();
+
+	/* prefer BSP */
+	control_cpu = 0;
+	if (!cpu_online(control_cpu))
+		control_cpu = smp_processor_id();
+
+	clamping = 1;
+	schedule_delayed_work(&poll_pkg_cstate_work, 0);
+
+	/* start one thread per online cpu */
+	for_each_online_cpu(cpu) {
+		struct task_struct **p =
+			per_cpu_ptr(powerclamp_thread, cpu);
+
+		thread = kthread_create_on_node(clamp_thread,
+						(void *) cpu,
+						cpu_to_node(cpu),
+						"kidle_inject/%ld", cpu);
+		/* bind to cpu here */
+		if (likely(!IS_ERR(thread))) {
+			kthread_bind(thread, cpu);
+			wake_up_process(thread);
+			*p = thread;
+		}
+
+	}
+	put_online_cpus();
+
+	return 0;
+}
+
+static void end_power_clamp(void)
+{
+	int i;
+	struct task_struct *thread;
+
+	clamping = 0;
+	/*
+	 * make clamping visible to other cpus and give per cpu clamping threads
+	 * sometime to exit, or gets killed later.
+	 */
+	smp_mb();
+	msleep(20);
+	if (bitmap_weight(cpu_clamping_mask, num_possible_cpus())) {
+		for_each_set_bit(i, cpu_clamping_mask, num_possible_cpus()) {
+			pr_debug("clamping thread for cpu %d alive, kill\n", i);
+			thread = *per_cpu_ptr(powerclamp_thread, i);
+			kthread_stop(thread);
+		}
+	}
+}
+
+static int powerclamp_cpu_callback(struct notifier_block *nfb,
+				unsigned long action, void *hcpu)
+{
+	unsigned long cpu = (unsigned long)hcpu;
+	struct task_struct *thread;
+	struct task_struct **percpu_thread =
+		per_cpu_ptr(powerclamp_thread, cpu);
+
+	if (!clamping)
+		goto exit_ok;
+
+	switch (action) {
+	case CPU_ONLINE:
+		thread = kthread_create_on_node(clamp_thread,
+						(void *) cpu,
+						cpu_to_node(cpu),
+						"kidle_inject/%lu", cpu);
+		if (likely(!IS_ERR(thread))) {
+			kthread_bind(thread, cpu);
+			wake_up_process(thread);
+			*percpu_thread = thread;
+		}
+		/* prefer BSP as controlling CPU */
+		if (cpu == 0) {
+			control_cpu = 0;
+			smp_mb();
+		}
+		break;
+	case CPU_DEAD:
+		if (test_bit(cpu, cpu_clamping_mask)) {
+			pr_err("cpu %lu dead but powerclamping thread is not\n",
+				cpu);
+			kthread_stop(*percpu_thread);
+		}
+		if (cpu == control_cpu) {
+			control_cpu = smp_processor_id();
+			smp_mb();
+		}
+	}
+
+exit_ok:
+	return NOTIFY_OK;
+}
+
+static struct notifier_block powerclamp_cpu_notifier = {
+	.notifier_call = powerclamp_cpu_callback,
+};
+
+static int powerclamp_get_max_state(struct thermal_cooling_device *cdev,
+				 unsigned long *state)
+{
+	*state = MAX_TARGET_RATIO;
+
+	return 0;
+}
+
+static int powerclamp_get_cur_state(struct thermal_cooling_device *cdev,
+				 unsigned long *state)
+{
+	if (clamping)
+		*state = pkg_cstate_ratio_cur;
+	else
+		/* to save power, do not poll idle ratio while not clamping */
+		*state = -1; /* indicates invalid state */
+
+	return 0;
+}
+
+static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
+				 unsigned long new_target_ratio)
+{
+	int ret = 0;
+
+	if (new_target_ratio >= MAX_TARGET_RATIO)
+		new_target_ratio = MAX_TARGET_RATIO - 1;
+
+	if (set_target_ratio == 0 && new_target_ratio > 0) {
+		pr_info("Start idle injection to reduce power\n");
+		set_target_ratio = new_target_ratio;
+		ret = start_power_clamp();
+		goto exit_set;
+	} else	if (set_target_ratio > 0 && new_target_ratio == 0) {
+		pr_info("Stop forced idle injection\n");
+		set_target_ratio = 0;
+		end_power_clamp();
+	} else	/* adjust currently running */ {
+		set_target_ratio = new_target_ratio;
+		/* make new set_target_ratio visible to other cpus */
+		smp_mb();
+	}
+
+exit_set:
+	return ret;
+}
+
+/* bind to generic thermal layer as cooling device*/
+static struct thermal_cooling_device_ops powerclamp_cooling_ops = {
+	.get_max_state = powerclamp_get_max_state,
+	.get_cur_state = powerclamp_get_cur_state,
+	.set_cur_state = powerclamp_set_cur_state,
+};
+
+/* runs on Nehalem and later */
+static const struct x86_cpu_id intel_powerclamp_ids[] = {
+	{ X86_VENDOR_INTEL, 6, 0x1a},
+	{ X86_VENDOR_INTEL, 6, 0x1c},
+	{ X86_VENDOR_INTEL, 6, 0x1e},
+	{ X86_VENDOR_INTEL, 6, 0x1f},
+	{ X86_VENDOR_INTEL, 6, 0x25},
+	{ X86_VENDOR_INTEL, 6, 0x26},
+	{ X86_VENDOR_INTEL, 6, 0x2a},
+	{ X86_VENDOR_INTEL, 6, 0x2c},
+	{ X86_VENDOR_INTEL, 6, 0x2d},
+	{ X86_VENDOR_INTEL, 6, 0x2e},
+	{ X86_VENDOR_INTEL, 6, 0x2f},
+	{ X86_VENDOR_INTEL, 6, 0x3a},
+	{}
+};
+MODULE_DEVICE_TABLE(x86cpu, intel_powerclamp_ids);
+
+static int powerclamp_probe(void)
+{
+	if (!x86_match_cpu(intel_powerclamp_ids)) {
+		pr_err("Intel powerclamp does not run on family %d model %d\n",
+				boot_cpu_data.x86, boot_cpu_data.x86_model);
+		return -ENODEV;
+	}
+	if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ||
+		!boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ||
+		!boot_cpu_has(X86_FEATURE_MWAIT) ||
+		!boot_cpu_has(X86_FEATURE_ARAT))
+		return -ENODEV;
+
+	/* find the deepest mwait value */
+	find_target_mwait();
+
+	return 0;
+}
+
+static int powerclamp_debug_show(struct seq_file *m, void *unused)
+{
+	int i = 0;
+
+	seq_printf(m, "controlling cpu: %d\n", control_cpu);
+	seq_printf(m, "pct confidence steady dynamic (compensation)\n");
+	for (i = 0; i < MAX_TARGET_RATIO; i++) {
+		seq_printf(m, "%d\t%lu\t%lu\t%lu\n",
+			i,
+			cal_data[i].confidence,
+			cal_data[i].steady_comp,
+			cal_data[i].dynamic_comp);
+	}
+
+	return 0;
+}
+
+static int powerclamp_debug_open(struct inode *inode,
+			struct file *file)
+{
+	return single_open(file, powerclamp_debug_show, inode->i_private);
+}
+
+static const struct file_operations powerclamp_debug_fops = {
+	.open		= powerclamp_debug_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+	.owner		= THIS_MODULE,
+};
+
+static inline void powerclamp_create_debug_files(void)
+{
+	debug_dir = debugfs_create_dir("intel_powerclamp", NULL);
+	if (!debug_dir)
+		return;
+
+	if (!debugfs_create_file("powerclamp_calib", S_IRUGO, debug_dir,
+					cal_data, &powerclamp_debug_fops))
+		goto file_error;
+
+	return;
+
+file_error:
+	debugfs_remove_recursive(debug_dir);
+}
+
+static int powerclamp_init(void)
+{
+	int retval;
+	int bitmap_size;
+
+	bitmap_size = BITS_TO_LONGS(num_possible_cpus()) * sizeof(long);
+	cpu_clamping_mask = kzalloc(bitmap_size, GFP_KERNEL);
+	if (!cpu_clamping_mask)
+		return -ENOMEM;
+
+	/* probe cpu features and ids here */
+	retval = powerclamp_probe();
+	if (retval)
+		return retval;
+	/* set default limit, maybe adjusted during runtime based on feedback */
+	window_size = 2;
+	register_hotcpu_notifier(&powerclamp_cpu_notifier);
+	powerclamp_thread = alloc_percpu(struct task_struct *);
+	cooling_dev = thermal_cooling_device_register("intel_powerclamp", NULL,
+						&powerclamp_cooling_ops);
+	if (IS_ERR(cooling_dev))
+		return -ENODEV;
+
+	if (!duration)
+		duration = jiffies_to_msecs(DEFAULT_DURATION_JIFFIES);
+	powerclamp_create_debug_files();
+
+	return 0;
+}
+module_init(powerclamp_init);
+
+static void powerclamp_exit(void)
+{
+	unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
+	end_power_clamp();
+	free_percpu(powerclamp_thread);
+	thermal_cooling_device_unregister(cooling_dev);
+	kfree(cpu_clamping_mask);
+
+	cancel_delayed_work_sync(&poll_pkg_cstate_work);
+	debugfs_remove_recursive(debug_dir);
+}
+module_exit(powerclamp_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Arjan van de Ven <arjan@linux.intel.com>");
+MODULE_AUTHOR("Jacob Pan <jacob.jun.pan@linux.intel.com>");
+MODULE_DESCRIPTION("Package Level C-state Idle Injection for Intel CPUs");
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH v2 2/3] x86/nmi: export local_touch_nmi() symbol for modules
From: Jacob Pan @ 2012-11-26 14:37 UTC (permalink / raw)
  To: Linux PM, LKML
  Cc: Peter Zijlstra, Rafael Wysocki, Len Brown, Thomas Gleixner,
	H. Peter Anvin, Ingo Molnar, Zhang Rui, Rob Landley,
	Arjan van de Ven, Paul McKenney, Jacob Pan
In-Reply-To: <1353940633-20084-1-git-send-email-jacob.jun.pan@linux.intel.com>

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/kernel/nmi.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index f84f5c5..6030805 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -509,3 +509,4 @@ void local_touch_nmi(void)
 {
 	__this_cpu_write(last_nmi_rip, 0);
 }
+EXPORT_SYMBOL_GPL(local_touch_nmi);
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH v2 1/3] tick: export nohz tick idle symbols for module use
From: Jacob Pan @ 2012-11-26 14:37 UTC (permalink / raw)
  To: Linux PM, LKML
  Cc: Peter Zijlstra, Rafael Wysocki, Len Brown, Thomas Gleixner,
	H. Peter Anvin, Ingo Molnar, Zhang Rui, Rob Landley,
	Arjan van de Ven, Paul McKenney, Jacob Pan
In-Reply-To: <1353940633-20084-1-git-send-email-jacob.jun.pan@linux.intel.com>

Allow drivers such as intel_powerclamp to use these apis for
turning on/off ticks during idle.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 kernel/time/tick-sched.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a402608..7c38f08 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -510,6 +510,7 @@ void tick_nohz_idle_enter(void)
 
 	local_irq_enable();
 }
+EXPORT_SYMBOL_GPL(tick_nohz_idle_enter);
 
 /**
  * tick_nohz_irq_exit - update next tick event from interrupt exit
@@ -634,6 +635,7 @@ void tick_nohz_idle_exit(void)
 
 	local_irq_enable();
 }
+EXPORT_SYMBOL_GPL(tick_nohz_idle_exit);
 
 static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now)
 {
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH v2 0/3]  pm: Intel powerclamp driver
From: Jacob Pan @ 2012-11-26 14:37 UTC (permalink / raw)
  To: Linux PM, LKML
  Cc: Peter Zijlstra, Rafael Wysocki, Len Brown, Thomas Gleixner,
	H. Peter Anvin, Ingo Molnar, Zhang Rui, Rob Landley,
	Arjan van de Ven, Paul McKenney, Jacob Pan

Minor syntax change since v1.

We have done some experiment with idle injection on Intel platforms.
The idea is to use the increasingly power efficient package level
C-states for power capping and passive thermal control.

Documentation is included in the patch to explain the theory of
operation, performance implication, calibration, scalability, and user
interface. Please refer to the following file for more details.

Documentation/thermal/intel_powerclamp.txt

Arjan van de Ven created the original idea and driver, I have been
refining driver in hope that they can be to be useful beyond a proof
of concept.

Jacob Pan (3):
  tick: export nohz tick idle symbols for module use
  x86/nmi: export local_touch_nmi() symbol for modules
  PM: Introduce Intel PowerClamp Driver

 Documentation/thermal/intel_powerclamp.txt |  307 +++++++++++
 arch/x86/kernel/nmi.c                      |    1 +
 drivers/thermal/Kconfig                    |   10 +
 drivers/thermal/Makefile                   |    1 +
 drivers/thermal/intel_powerclamp.c         |  766 ++++++++++++++++++++++++++++
 kernel/time/tick-sched.c                   |    2 +
 6 files changed, 1087 insertions(+)
 create mode 100644 Documentation/thermal/intel_powerclamp.txt
 create mode 100644 drivers/thermal/intel_powerclamp.c

-- 
1.7.9.5

^ permalink raw reply

* Re: [alsa-devel] [PATCH] usb: add USB_QUIRK_RESET_RESUME for M-Audio 49
From: Takashi Iwai @ 2012-11-26 13:58 UTC (permalink / raw)
  To: Clemens Ladisch
  Cc: Jonathan Nieder, Steffen Müller, alsa-devel, Olivier MATZ,
	linux-pm, linux-usb, Oliver Neukum, stable, David Banks,
	Ralf Lang
In-Reply-To: <50B371F1.70508@ladisch.de>

At Mon, 26 Nov 2012 14:43:13 +0100,
Clemens Ladisch wrote:
> 
> Takashi Iwai wrote:
> > Clemens Ladisch wrote:
> >> Takashi Iwai wrote:
> >>> Clemens Ladisch wrote:
> >>>> I'm working on a fix that adds proper power management for input ports,
> >>>> but this requires the driver to be reorganized a little ...
> >>>
> >>> Doesn't a simple patch like below work?
> >>
> >>> +static int substream_open(struct snd_rawmidi_substream *substream, int open)
> >>>  {
> >>> +	if (open && umidi->opened++ == 0) {
> >>> +		err = usb_autopm_get_interface(umidi->iface);
> >>>
> >>>  static int snd_usbmidi_input_open(struct snd_rawmidi_substream *substream)
> >>>  {
> >>> +	return substream_open(substream, 1);
> >>
> >> No, because the input URBs are submitted before the userspace device is
> >> opened.
> >
> > Ah, right.  What's the reason of submitting input urbs for the all
> > time from the beginning?  For loopback?
> 
> For not needing to count open input ports.

So we've spun urbs for all the time just for reducing refcount?
That's bad.

> > If it has to be running, the easiest fix would be the patch like
> > below.  This will turn off the autopm essentially, but better than
> > breakage.
> >
> > @@ -2074,6 +2077,8 @@ static void snd_usbmidi_input_start_ep(struct snd_usb_midi_in_endpoint* ep)
> >
> > +	ep->autopm_reference =
> > +		usb_autopm_get_interface(ep->umidi->iface) >= 0;
> 
> usb_autopm_get_interface() cannot be called from the USB probe callback.

OK, it can call no_resume version by passing an argument.

But judging from your comment above, we should fix the free wheel MIDI
input urbs at first indeed, not only for autopm but in general.


Takashi

^ permalink raw reply

* Re: [PATCH 3/6 v4] cpufreq: tolerate inexact values when collecting stats
From: Mark Langsdorf @ 2012-11-26 13:57 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Borislav Petkov, linux-kernel@vger.kernel.org,
	cpufreq@vger.kernel.org, linux-pm@vger.kernel.org, MyungJoo Ham
In-Reply-To: <3756540.1nyvqzWibd@vostro.rjw.lan>

On 11/24/2012 04:05 AM, Rafael J. Wysocki wrote:
> On Saturday, November 17, 2012 03:50:48 PM Borislav Petkov wrote:
>> On Tue, Nov 13, 2012 at 02:13:38PM -0500, Mark Langsdorf wrote:
>>> Although cpufreq_driver has a flag field, no part of cpufreq_driver
>>> is directly passed to the cpufreq_stat code. Only cpufreq_policy
>>> is. It's cleaner to do passes of the while loop than to copy the
>>> cpufreq_driver.flag field into cpufreq_policy and then store it again
>>> in cpufreq_stats.
>>
>> That maybe so but this newly added loop which is only Calxeda-relevant
>> is called in cpufreq_stat_notifier_trans, which is the frequency change
>> notifier call, AFAICT.

Drivers only go through the loop if they can't find an exact frequency.
So every driver that isn't Calxeda shouldn't see the issue.

>> So you probably need to find a slick way of detecting calxeda hw
>> somewhere along the init path of cpufreq_stats_init and set a
>> hw-specific flag instead of adding that cost to each driver.
> 
> Mark, I suppose you'd like me to take this series for v3.8, but the above
> comment from Boris has to be addressed for that.

I think I'd rather drop this particular patch and not have cpufreq_stat
support for Highbank. Redesigning it to meet Boris' requirements is
going to take more time than I currently have available.

Would it be acceptable to drop this patch and fix the issues with
patches 4 and 6 to get the series in?

--Mark Langsdorf
Calxeda, Inc.


^ permalink raw reply

* Re: [alsa-devel] [PATCH] usb: add USB_QUIRK_RESET_RESUME for M-Audio 49
From: Oliver Neukum @ 2012-11-26 13:54 UTC (permalink / raw)
  To: Clemens Ladisch
  Cc: Takashi Iwai, Jonathan Nieder, Steffen Müller, alsa-devel,
	Olivier MATZ, linux-pm, linux-usb, stable, David Banks, Ralf Lang
In-Reply-To: <50B371F1.70508@ladisch.de>

On Monday 26 November 2012 14:43:13 Clemens Ladisch wrote:

> > If it has to be running, the easiest fix would be the patch like
> > below.  This will turn off the autopm essentially, but better than
> > breakage.
> >
> > @@ -2074,6 +2077,8 @@ static void snd_usbmidi_input_start_ep(struct snd_usb_midi_in_endpoint* ep)
> >
> > +	ep->autopm_reference =
> > +		usb_autopm_get_interface(ep->umidi->iface) >= 0;
> 
> usb_autopm_get_interface() cannot be called from the USB probe callback.

You can use usb_autopm_get_interface_no_resume()
During probe() the device is known to not be suspended.

	Regards
		Oliver

^ permalink raw reply

* Re: [alsa-devel] [PATCH] usb: add USB_QUIRK_RESET_RESUME for M-Audio 49
From: Clemens Ladisch @ 2012-11-26 13:43 UTC (permalink / raw)
  To: Takashi Iwai
  Cc: Jonathan Nieder, Steffen Müller,
	alsa-devel-K7yf7f+aM1XWsZ/bQMPhNw, Olivier MATZ,
	linux-pm-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA,
	Oliver Neukum, stable-u79uwXL29TY76Z2rM5mHXA, David Banks,
	Ralf Lang
In-Reply-To: <s5hwqx8h5w1.wl%tiwai-l3A5Bk7waGM@public.gmane.org>

Takashi Iwai wrote:
> Clemens Ladisch wrote:
>> Takashi Iwai wrote:
>>> Clemens Ladisch wrote:
>>>> I'm working on a fix that adds proper power management for input ports,
>>>> but this requires the driver to be reorganized a little ...
>>>
>>> Doesn't a simple patch like below work?
>>
>>> +static int substream_open(struct snd_rawmidi_substream *substream, int open)
>>>  {
>>> +	if (open && umidi->opened++ == 0) {
>>> +		err = usb_autopm_get_interface(umidi->iface);
>>>
>>>  static int snd_usbmidi_input_open(struct snd_rawmidi_substream *substream)
>>>  {
>>> +	return substream_open(substream, 1);
>>
>> No, because the input URBs are submitted before the userspace device is
>> opened.
>
> Ah, right.  What's the reason of submitting input urbs for the all
> time from the beginning?  For loopback?

For not needing to count open input ports.

> If it has to be running, the easiest fix would be the patch like
> below.  This will turn off the autopm essentially, but better than
> breakage.
>
> @@ -2074,6 +2077,8 @@ static void snd_usbmidi_input_start_ep(struct snd_usb_midi_in_endpoint* ep)
>
> +	ep->autopm_reference =
> +		usb_autopm_get_interface(ep->umidi->iface) >= 0;

usb_autopm_get_interface() cannot be called from the USB probe callback.


Regards,
Clemens
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v9 06/10] ata: zpodd: check zero power ready status
From: James Bottomley @ 2012-11-26 13:17 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Rafael J. Wysocki, Tejun Heo, Jeff Garzik, Alan Stern, Jeff Wu,
	Aaron Lu, linux-ide, linux-pm, linux-scsi, linux-acpi
In-Reply-To: <50B327FB.3020701@intel.com>

On Mon, 2012-11-26 at 16:27 +0800, Aaron Lu wrote:
> Well, ZPREADY is not a power state that we can program the ODD to
> enter(figure 234 and table 323 of the SPEC), it servers more like an
> information provided by ODD to host so that host does not need to do TUR
> and then examine the sense code to decide if zero power ready status is
> satisfied but simply query ODD if its current power state is ZPREADY.
> So it's not that we program the device to go into ZPREADY power state
> and the ODD's power will be omitted.
> 
> The benefit of a ZPREADY capable ODD is that, when we need to decide if
> the ODD is in a zero power ready status, we can simply query the ODD by
> issuing a GET_EVENT_STATUS_NOTIFICATION and check the returned power
> class events, if it is in ZPREADY power state, then we can omit the
> power. To support ZPREADY, we just need some change to the zpready
> funtion, which currently uses sense code to check ZP ready status.
> 
> So this is my understanding of ZPREADY, and I don't see it as a total
> different thing with ZPODD, it just changes the way how host senses the
> zero power ready status. But if I was wrong, please kindly let me know,
> thanks.

My understanding is that a ZPREADY device may be capable of internal
power down, meaning it doesn't necessarily need the host to omit the
power.  It depends what the difference is between Sleep and Off is
(they're deliberately left as implementation defined in the standard, Ch
16, but the conditions of sleep are pretty onerous, so it sounds like
most of the mechanics are powered down).

However, if you want to work it similarly to ZPODD, then the timeouts
automatically transitions to ZPREADY, the device issues an event, we
trap the event at the low level and omit power.

I'm also curious about driving sleep from autopm, since mode page timers
don't control the sleep transition.

James



^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox