linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate
@ 2016-04-12 18:06 Akshay Adiga
  2016-04-12 18:06 ` [PATCH 1/2] cpufreq: powernv: Remove flag use-case of policy->driver_data Akshay Adiga
  2016-04-12 18:06 ` [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate Akshay Adiga
  0 siblings, 2 replies; 8+ messages in thread
From: Akshay Adiga @ 2016-04-12 18:06 UTC (permalink / raw)
  To: rjw, viresh.kumar, linux-pm, linuxppc-dev; +Cc: ego, Akshay Adiga

The frequency transition latency from pmin to pmax is observed to be in few 
millisecond granurality. And it usually happens to take a performance penalty
during sudden frequency rampup requests. 

This patch set solves this problem by using a chip-level entity called "global
 pstates". Global pstate manages elements across other dependent core chiplets.
Typically, the element that needs to be managed is the voltage setting.
So by holding global pstates higher than local pstate for some amount of time 
( ~5 seconds) the subsequent rampups could be made faster.

(1/2) patch removes the flag from cpufreq_policy->driver_data, so that it can
be used for tracking global pstates.

(2/2) patch adds code for global pstate management. 
- The iozone results with this patchset, shows improvements in almost all cases.
- YCSB workload on redis with various  target operations per second shows 
better MaxLatency with this patch.

Akshay Adiga (1):
  powernv: Ramp-down global pstate slower than local-pstate

Shilpasri G Bhat (1):
  cpufreq: powernv: Remove flag use-case of policy->driver_data

 drivers/cpufreq/powernv-cpufreq.c | 250 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 242 insertions(+), 8 deletions(-)

-- 
2.5.5


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] cpufreq: powernv: Remove flag use-case of policy->driver_data
  2016-04-12 18:06 [PATCH 0/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate Akshay Adiga
@ 2016-04-12 18:06 ` Akshay Adiga
  2016-04-12 18:06 ` [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate Akshay Adiga
  1 sibling, 0 replies; 8+ messages in thread
From: Akshay Adiga @ 2016-04-12 18:06 UTC (permalink / raw)
  To: rjw, viresh.kumar, linux-pm, linuxppc-dev
  Cc: ego, Shilpasri G Bhat, Akshay Adiga

From: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>

'commit 1b0289848d5d ("cpufreq: powernv: Add sysfs attributes to show throttle
stats")' used policy->driver_data as a flag for one-time creation
of throttle sysfs files. Instead of this use 'kernfs_find_and_get()'
to check if the attribute already exists. This is required as
policy->driver_data is used for other purposes in the later patch.

Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Reviewed-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
---
 drivers/cpufreq/powernv-cpufreq.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index 39ac78c..e2e2219 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -455,13 +455,15 @@ static int powernv_cpufreq_target_index(struct cpufreq_policy *policy,
 static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
 {
 	int base, i;
+	struct kernfs_node *kn;
 
 	base = cpu_first_thread_sibling(policy->cpu);
 
 	for (i = 0; i < threads_per_core; i++)
 		cpumask_set_cpu(base + i, policy->cpus);
 
-	if (!policy->driver_data) {
+	kn = kernfs_find_and_get(policy->kobj.sd, throttle_attr_grp.name);
+	if (!kn) {
 		int ret;
 
 		ret = sysfs_create_group(&policy->kobj, &throttle_attr_grp);
@@ -470,11 +472,8 @@ static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
 				policy->cpu);
 			return ret;
 		}
-		/*
-		 * policy->driver_data is used as a flag for one-time
-		 * creation of throttle sysfs files.
-		 */
-		policy->driver_data = policy;
+	} else {
+		kernfs_put(kn);
 	}
 	return cpufreq_table_validate_and_show(policy, powernv_freqs);
 }
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate
  2016-04-12 18:06 [PATCH 0/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate Akshay Adiga
  2016-04-12 18:06 ` [PATCH 1/2] cpufreq: powernv: Remove flag use-case of policy->driver_data Akshay Adiga
@ 2016-04-12 18:06 ` Akshay Adiga
  2016-04-13  5:03   ` Viresh Kumar
  2016-04-14  5:40   ` Balbir Singh
  1 sibling, 2 replies; 8+ messages in thread
From: Akshay Adiga @ 2016-04-12 18:06 UTC (permalink / raw)
  To: rjw, viresh.kumar, linux-pm, linuxppc-dev; +Cc: ego, Akshay Adiga

This patch brings down global pstate at a slower rate than the local
pstate. As the frequency transition latency from pmin to pmax is
observed to be in few millisecond granurality. It takes a performance
penalty during sudden frequency rampup. Hence by holding global pstates
higher than local pstate makes the subsequent rampups faster.

A global per policy structure is maintained to keep track of the global
and local pstate changes. The global pstate is brought down using a
parabolic equation. The ramp down time to pmin is set to 6 seconds. To
make sure that the global pstates are dropped at regular interval , a
timer is queued for every 2 seconds, which eventually brings the pstate
down to local pstate.

Iozone results show fairly consistent performance boost.
YCSB on redis shows improved Max latencies in most cases.

Iozone write/rewite test were made with filesizes 200704Kb and 401408Kb with
different record sizes . The following table shows IOoperations/sec with and
without patch.

Iozone Results ( in op/sec) ( mean over 3 iterations )
------------------------------------
file size-	   		with     	without
recordsize-IOtype      		patch    	patch 		 % change
----------------------------------------------------------------------
200704-1-SeqWrite		1616532 	1615425 	0.06
200704-1-Rewrite		2423195  	2303130 	5.21
200704-2-SeqWrite		1628577 	1602620 	1.61
200704-2-Rewrite		2428264  	2312154 	5.02
200704-4-SeqWrite		1617605 	1617182 	0.02
200704-4-Rewrite		2430524  	2351238 	3.37
200704-8-SeqWrite		1629478 	1600436 	1.81
200704-8-Rewrite		2415308  	2298136 	5.09
200704-16-SeqWrite		1619632 	1618250 	0.08
200704-16-Rewrite		2396650 	2352591 	1.87
200704-32-SeqWrite		1632544 	1598083 	2.15
200704-32-Rewrite		2425119 	2329743 	4.09
200704-64-SeqWrite		1617812 	1617235 	0.03
200704-64-Rewrite		2402021 	2321080 	3.48
200704-128-SeqWrite		1631998 	1600256 	1.98
200704-128-Rewrite		2422389		2304954 	5.09
200704-256 SeqWrite		1617065		1616962 	0.00
200704-256-Rewrite		2432539 	2301980 	5.67
200704-512-SeqWrite		1632599 	1598656 	2.12
200704-512-Rewrite		2429270		2323676 	4.54
200704-1024-SeqWrite		1618758		1616156 	0.16
200704-1024-Rewrite		2431631		2315889 	4.99
401408-1-SeqWrite		1631479  	1608132 	1.45
401408-1-Rewrite		2501550  	2459409 	1.71
401408-2-SeqWrite		1617095 	1626069 	-0.55
401408-2-Rewrite		2507557  	2443621 	2.61
401408-4-SeqWrite		1629601 	1611869		1.10
401408-4-Rewrite		2505909  	2462098 	1.77
401408-8-SeqWrite  		1617110 	1626968 	-0.60
401408-8-Rewrite		2512244  	2456827 	2.25
401408-16-SeqWrite		1632609 	1609603 	1.42
401408-16-Rewrite		2500792 	2451405 	2.01
401408-32-SeqWrite		1619294 	1628167 	-0.54
401408-32-Rewrite		2510115		2451292 	2.39
401408-64-SeqWrite		1632709		1603746 	1.80
401408-64-Rewrite		2506692 	2433186 	3.02
401408-128-SeqWrite		1619284		1627461 	-0.50
401408-128-Rewrite		2518698 	2453361 	2.66
401408-256-SeqWrite		1634022 	1610681 	1.44
401408-256-Rewrite		2509987 	2446328 	2.60
401408-512-SeqWrite 		1617524 	1628016 	-0.64
401408-512-Rewrite		2504409 	2442899 	2.51
401408-1024-SeqWrite		1629812 	1611566 	1.13
401408-1024-Rewrite 		2507620		 2442968 	2.64

Tested with YCSB workloada over redis for 1 million records and 1 million
operation. Each test was carried out with target operations per second and
persistence disabled. 

Max-latency (in us)( mean over 5 iterations )
-----------------------------------------------------------
op/s	Operation	with patch	without patch	%change
------------------------------------------------------------
15000	Read		61480.6		50261.4		22.32
15000	cleanup		215.2		293.6		-26.70
15000	update		25666.2		25163.8		2.00

25000	Read		32626.2		89525.4		-63.56
25000	cleanup		292.2		263.0		11.10
25000	update		32293.4		90255.0		-64.22

35000	Read		34783.0		33119.0		5.02
35000	cleanup		321.2		395.8		-18.8
35000	update		36047.0		38747.8		-6.97

40000	Read		38562.2		42357.4		-8.96
40000	cleanup		371.8		384.6		-3.33
40000	update		27861.4		41547.8		-32.94

45000	Read		42271.0		88120.6		-52.03
45000	cleanup		263.6		383.0		-31.17
45000	update		29755.8		81359.0		-63.43

(test without target op/s)
47659	Read		83061.4		136440.6	-39.12
47659	cleanup		195.8		193.8		1.03
47659	update		73429.4		124971.8	-41.24

Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 drivers/cpufreq/powernv-cpufreq.c | 239 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 237 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index e2e2219..288fa10 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -36,12 +36,58 @@
 #include <asm/reg.h>
 #include <asm/smp.h> /* Required for cpu_sibling_mask() in UP configs */
 #include <asm/opal.h>
+#include <linux/timer.h>
 
 #define POWERNV_MAX_PSTATES	256
 #define PMSR_PSAFE_ENABLE	(1UL << 30)
 #define PMSR_SPR_EM_DISABLE	(1UL << 31)
 #define PMSR_MAX(x)		((x >> 32) & 0xFF)
 
+/*
+ * Quadratic equation which gives the percentage rampdown for time elapsed in
+ * milliseconds. time can be between 0 and MAX_RAMP_DOWN_TIME ( milliseconds )
+ * This equation approximates to y = -4e-6 x^2
+ *
+ * At 0 seconds x=0000 ramp_down_percent=0
+ * At MAX_RAMP_DOWN_TIME x=5120 ramp_down_percent=100
+ */
+#define MAX_RAMP_DOWN_TIME				5120
+#define ramp_down_percent(time)		((time * time)>>18)
+
+/*Interval after which the timer is queued to bring down global pstate*/
+#define GPSTATE_TIMER_INTERVAL				2000
+/*
+ * global_pstate_info :
+ *	per policy data structure to maintain history of global pstates
+ *
+ * @highest_lpstate : the local pstate from which we are ramping down
+ * @elapsed_time : time in ms spent in ramping down from highest_lpstate
+ * @last_sampled_time : time from boot in ms when global pstates were last set
+ * @last_lpstate , last_gpstate : last set values for local and global pstates
+ * @timer : is used for ramping down if cpu goes idle for a long time with
+ *	global pstate held high
+ * @gpstate_lock : a spinlock to maintain synchronization between routines
+ *	called by the timer handler and governer's target_index calls
+ */
+struct global_pstate_info {
+	int highest_lpstate;
+	unsigned int elapsed_time;
+	unsigned int last_sampled_time;
+	int last_lpstate;
+	int last_gpstate;
+	spinlock_t gpstate_lock;
+	struct timer_list timer;
+};
+
+/*
+ * While resetting we don't want "timer" fields to be set to zero as we
+ * may lose track of timer and will not be able to cleanly remove it
+ */
+#define reset_gpstates(policy)   memset(policy->driver_data, 0,\
+					sizeof(struct global_pstate_info)-\
+					sizeof(struct timer_list)-\
+					sizeof(spinlock_t))
+
 static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
 static bool rebooting, throttled, occ_reset;
 
@@ -285,6 +331,7 @@ static inline void set_pmspr(unsigned long sprn, unsigned long val)
 struct powernv_smp_call_data {
 	unsigned int freq;
 	int pstate_id;
+	int gpstate_id;
 };
 
 /*
@@ -348,14 +395,17 @@ static void set_pstate(void *freq_data)
 	unsigned long val;
 	unsigned long pstate_ul =
 		((struct powernv_smp_call_data *) freq_data)->pstate_id;
+	unsigned long gpstate_ul =
+		((struct powernv_smp_call_data *) freq_data)->gpstate_id;
 
 	val = get_pmspr(SPRN_PMCR);
 	val = val & 0x0000FFFFFFFFFFFFULL;
 
 	pstate_ul = pstate_ul & 0xFF;
+	gpstate_ul = gpstate_ul & 0xFF;
 
 	/* Set both global(bits 56..63) and local(bits 48..55) PStates */
-	val = val | (pstate_ul << 56) | (pstate_ul << 48);
+	val = val | (gpstate_ul << 56) | (pstate_ul << 48);
 
 	pr_debug("Setting cpu %d pmcr to %016lX\n",
 			raw_smp_processor_id(), val);
@@ -425,6 +475,109 @@ next:
 }
 
 /*
+ * calcuate_global_pstate:
+ *
+ * @elapsed_time : elapsed time in milliseconds
+ * @local_pstate : new local pstate
+ * @highest_lpstate : pstate from which its ramping down
+ *
+ * Finds the appropriate global pstate based on the pstate from which its
+ * ramping down and the time elapsed in ramping down. It follows a quadratic
+ * equation which ensures that it reaches ramping down to pmin in 5sec.
+ */
+inline int calculate_global_pstate(unsigned int elapsed_time,
+		int highest_lpstate, int local_pstate)
+{
+	int pstate_diff;
+
+	/*
+	 * Using ramp_down_percent we get the percentage of rampdown
+	 * that we are expecting to be dropping. Difference between
+	 * highest_lpstate and powernv_pstate_info.min will give a absolute
+	 * number of how many pstates we will drop eventually by the end of
+	 * 5 seconds, then just scale it get the number pstates to be dropped.
+	 */
+	pstate_diff =  ((int)ramp_down_percent(elapsed_time) *
+			(highest_lpstate - powernv_pstate_info.min))/100;
+
+	/* Ensure that global pstate is >= to local pstate */
+	if (highest_lpstate - pstate_diff < local_pstate)
+		return local_pstate;
+	else
+		return (highest_lpstate - pstate_diff);
+}
+
+inline int queue_gpstate_timer(struct global_pstate_info *gpstates)
+{
+	unsigned int timer_interval;
+
+	/* Setting up timer to fire after GPSTATE_TIMER_INTERVAL ms, But
+	 * if it exceeds MAX_RAMP_DOWN_TIME ms for ramp down time.
+	 * Set timer such that it fires exactly at MAX_RAMP_DOWN_TIME
+	 * seconds of ramp down time.
+	 */
+	if ((gpstates->elapsed_time + GPSTATE_TIMER_INTERVAL)
+							 > MAX_RAMP_DOWN_TIME)
+		timer_interval = MAX_RAMP_DOWN_TIME - gpstates->elapsed_time;
+	else
+		timer_interval = GPSTATE_TIMER_INTERVAL;
+
+	return  mod_timer_pinned(&(gpstates->timer), jiffies +
+				msecs_to_jiffies(timer_interval));
+}
+/*
+ * gpstate_timer_handler
+ *
+ * @data: pointer to cpufreq_policy on which timer was queued
+ *
+ * This handler brings down the global pstate closer to the local pstate
+ * according quadratic equation. Queues a new timer if it is still not equal
+ * to local pstate
+ */
+void gpstate_timer_handler(unsigned long data)
+{
+	struct cpufreq_policy *policy = (struct cpufreq_policy *) data;
+	struct global_pstate_info *gpstates = (struct global_pstate_info *)
+							policy->driver_data;
+	unsigned int time_diff = jiffies_to_msecs(jiffies)
+					- gpstates->last_sampled_time;
+	struct powernv_smp_call_data freq_data;
+	int ret;
+
+	ret = spin_trylock(&gpstates->gpstate_lock);
+	if (!ret)
+		return;
+
+	gpstates->last_sampled_time += time_diff;
+	gpstates->elapsed_time += time_diff;
+	freq_data.pstate_id = gpstates->last_lpstate;
+	if ((gpstates->last_gpstate == freq_data.pstate_id) ||
+			(gpstates->elapsed_time > MAX_RAMP_DOWN_TIME)) {
+		freq_data.gpstate_id = freq_data.pstate_id;
+		reset_gpstates(policy);
+		gpstates->highest_lpstate = freq_data.pstate_id;
+	} else {
+		freq_data.gpstate_id = calculate_global_pstate(
+			gpstates->elapsed_time, gpstates->highest_lpstate,
+			freq_data.pstate_id);
+	}
+
+	/* If local pstate is equal to global pstate, rampdown is over
+	 * So timer is not required to be queued.
+	 */
+	if (freq_data.gpstate_id != freq_data.pstate_id)
+		ret = queue_gpstate_timer(gpstates);
+
+	gpstates->last_gpstate = freq_data.gpstate_id;
+	gpstates->last_lpstate = freq_data.pstate_id;
+
+	/* Timer may get migrated to a different cpu on cpu hot unplug */
+	smp_call_function_any(policy->cpus, set_pstate, &freq_data, 1);
+	spin_unlock(&gpstates->gpstate_lock);
+}
+
+
+/*
  * powernv_cpufreq_target_index: Sets the frequency corresponding to
  * the cpufreq table entry indexed by new_index on the cpus in the
  * mask policy->cpus
@@ -432,23 +585,88 @@ next:
 static int powernv_cpufreq_target_index(struct cpufreq_policy *policy,
 					unsigned int new_index)
 {
+	int ret;
 	struct powernv_smp_call_data freq_data;
-
+	unsigned int cur_msec;
+	unsigned long flags;
+	struct global_pstate_info *gpstates = (struct global_pstate_info *)
+						policy->driver_data;
 	if (unlikely(rebooting) && new_index != get_nominal_index())
 		return 0;
 
 	if (!throttled)
 		powernv_cpufreq_throttle_check(NULL);
 
+	cur_msec = jiffies_to_msecs(get_jiffies_64());
+
+	/*spinlock taken*/
+	spin_lock_irqsave(&gpstates->gpstate_lock, flags);
 	freq_data.pstate_id = powernv_freqs[new_index].driver_data;
 
+	/*First time call */
+	if (!gpstates->last_sampled_time) {
+		freq_data.gpstate_id = freq_data.pstate_id;
+		gpstates->highest_lpstate = freq_data.pstate_id;
+		goto gpstates_done;
+	}
+
+	/*Ramp down*/
+	if (gpstates->last_gpstate > freq_data.pstate_id) {
+		gpstates->elapsed_time += cur_msec -
+					gpstates->last_sampled_time;
+		/* If its has been ramping down for more than 5seconds
+		 * we should be reseting all global pstate related data.
+		 * Set it equal to local pstate to start fresh.
+		 */
+		if (gpstates->elapsed_time > MAX_RAMP_DOWN_TIME) {
+			freq_data.gpstate_id = freq_data.pstate_id;
+			reset_gpstates(policy);
+			gpstates->highest_lpstate = freq_data.pstate_id;
+			freq_data.gpstate_id = freq_data.pstate_id;
+		} else {
+		/* elaspsed_time is less than 5 seconds, continue to rampdown*/
+			freq_data.gpstate_id = calculate_global_pstate(
+			gpstates->elapsed_time,
+			gpstates->highest_lpstate, freq_data.pstate_id);
+
+		}
+
+	} else {
+		/*Ramp up*/
+		reset_gpstates(policy);
+		gpstates->highest_lpstate = freq_data.pstate_id;
+		freq_data.gpstate_id = freq_data.pstate_id;
+	}
+
+	/* If local pstate is equal to global pstate, rampdown is over
+	 * So timer is not required to be queued.
+	 */
+	if (freq_data.gpstate_id != freq_data.pstate_id)
+		ret = queue_gpstate_timer(gpstates);
+gpstates_done:
+	gpstates->last_sampled_time = cur_msec;
+	gpstates->last_gpstate = freq_data.gpstate_id;
+	gpstates->last_lpstate = freq_data.pstate_id;
+
 	/*
 	 * Use smp_call_function to send IPI and execute the
 	 * mtspr on target CPU.  We could do that without IPI
 	 * if current CPU is within policy->cpus (core)
 	 */
 	smp_call_function_any(policy->cpus, set_pstate, &freq_data, 1);
+	spin_unlock_irqrestore(&gpstates->gpstate_lock, flags);
+	return 0;
+}
 
+static int powernv_cpufreq_cpu_exit(struct cpufreq_policy *policy)
+{
+	int base;
+	struct global_pstate_info *gpstates = (struct global_pstate_info *)
+						policy->driver_data;
+	base = cpu_first_thread_sibling(policy->cpu);
+	del_timer_sync(&gpstates->timer);
+	kfree(policy->driver_data);
+	pr_info("freed driver_data cpu %d\n", base);
 	return 0;
 }
 
@@ -456,6 +674,7 @@ static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
 {
 	int base, i;
 	struct kernfs_node *kn;
+	struct global_pstate_info *gpstates;
 
 	base = cpu_first_thread_sibling(policy->cpu);
 
@@ -475,6 +694,21 @@ static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
 	} else {
 		kernfs_put(kn);
 	}
+	gpstates =  kzalloc(sizeof(struct global_pstate_info), GFP_KERNEL);
+	if (!gpstates) {
+		pr_err("Could not allocate global_pstate_info\n");
+		return -ENOMEM;
+	}
+	policy->driver_data = gpstates;
+
+	/* initialize timer */
+	init_timer_deferrable(&gpstates->timer);
+	gpstates->timer.data = (unsigned long) policy;
+	gpstates->timer.function = gpstate_timer_handler;
+	gpstates->timer.expires = jiffies +
+				msecs_to_jiffies(GPSTATE_TIMER_INTERVAL);
+
+	pr_info("Added global_pstate_info & timer for %d cpu\n", base);
 	return cpufreq_table_validate_and_show(policy, powernv_freqs);
 }
 
@@ -612,6 +846,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
 	.name		= "powernv-cpufreq",
 	.flags		= CPUFREQ_CONST_LOOPS,
 	.init		= powernv_cpufreq_cpu_init,
+	.exit		= powernv_cpufreq_cpu_exit,
 	.verify		= cpufreq_generic_frequency_table_verify,
 	.target_index	= powernv_cpufreq_target_index,
 	.get		= powernv_cpufreq_get,
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate
  2016-04-12 18:06 ` [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate Akshay Adiga
@ 2016-04-13  5:03   ` Viresh Kumar
  2016-04-13 17:57     ` Akshay Adiga
  2016-04-14  5:40   ` Balbir Singh
  1 sibling, 1 reply; 8+ messages in thread
From: Viresh Kumar @ 2016-04-13  5:03 UTC (permalink / raw)
  To: Akshay Adiga; +Cc: rjw, linux-pm, linuxppc-dev, ego

Comments mostly on the coding standards which you have *not* followed.

Also, please run checkpatch --strict next time you send patches
upstream.

On 12-04-16, 23:36, Akshay Adiga wrote:
> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
> +#define MAX_RAMP_DOWN_TIME				5120
> +#define ramp_down_percent(time)		((time * time)>>18)

You need spaces around >>

> +
> +/*Interval after which the timer is queued to bring down global pstate*/

Bad comment style, should be /* ... */

> +#define GPSTATE_TIMER_INTERVAL				2000
> +/*
> + * global_pstate_info :

This is bad as well.. See how doc style comments are written at
separate places.

> + *	per policy data structure to maintain history of global pstates
> + *
> + * @highest_lpstate : the local pstate from which we are ramping down

No space required before :

> + * @elapsed_time : time in ms spent in ramping down from highest_lpstate
> + * @last_sampled_time : time from boot in ms when global pstates were last set
> + * @last_lpstate , last_gpstate : last set values for local and global pstates
> + * @timer : is used for ramping down if cpu goes idle for a long time with
> + *	global pstate held high
> + * @gpstate_lock : a spinlock to maintain synchronization between routines
> + *	called by the timer handler and governer's target_index calls
> + */
> +struct global_pstate_info {
> +	int highest_lpstate;
> +	unsigned int elapsed_time;
> +	unsigned int last_sampled_time;
> +	int last_lpstate;
> +	int last_gpstate;
> +	spinlock_t gpstate_lock;
> +	struct timer_list timer;
> +};
> +
> +/*
> + * While resetting we don't want "timer" fields to be set to zero as we
> + * may lose track of timer and will not be able to cleanly remove it
> + */
> +#define reset_gpstates(policy)   memset(policy->driver_data, 0,\
> +					sizeof(struct global_pstate_info)-\
> +					sizeof(struct timer_list)-\
> +					sizeof(spinlock_t))

That's super *ugly*. Why don't you create a simple routine which will
set the 5 integer variables to 0 in a straight forward way ?

> @@ -348,14 +395,17 @@ static void set_pstate(void *freq_data)
>  	unsigned long val;
>  	unsigned long pstate_ul =
>  		((struct powernv_smp_call_data *) freq_data)->pstate_id;
> +	unsigned long gpstate_ul =
> +		((struct powernv_smp_call_data *) freq_data)->gpstate_id;

Remove these unnecessary casts and do:

struct powernv_smp_call_data *freq_data = data; //Name func arg as data

And then use freq_data->*.

>  
>  	val = get_pmspr(SPRN_PMCR);
>  	val = val & 0x0000FFFFFFFFFFFFULL;
>  
>  	pstate_ul = pstate_ul & 0xFF;
> +	gpstate_ul = gpstate_ul & 0xFF;
>  
>  	/* Set both global(bits 56..63) and local(bits 48..55) PStates */
> -	val = val | (pstate_ul << 56) | (pstate_ul << 48);
> +	val = val | (gpstate_ul << 56) | (pstate_ul << 48);

>  /*

You need /** here. See comments everywhere please, they are mostly
done in a wrong way.

> + * calcuate_global_pstate:
> + *
> + * @elapsed_time : elapsed time in milliseconds
> + * @local_pstate : new local pstate
> + * @highest_lpstate : pstate from which its ramping down
> + *
> + * Finds the appropriate global pstate based on the pstate from which its
> + * ramping down and the time elapsed in ramping down. It follows a quadratic
> + * equation which ensures that it reaches ramping down to pmin in 5sec.
> + */
> +inline int calculate_global_pstate(unsigned int elapsed_time,
> +		int highest_lpstate, int local_pstate)

Not aligned properly, checkpatch --strict will tell you what to do.

> +{
> +	int pstate_diff;
> +
> +	/*
> +	 * Using ramp_down_percent we get the percentage of rampdown
> +	 * that we are expecting to be dropping. Difference between
> +	 * highest_lpstate and powernv_pstate_info.min will give a absolute
> +	 * number of how many pstates we will drop eventually by the end of
> +	 * 5 seconds, then just scale it get the number pstates to be dropped.
> +	 */
> +	pstate_diff =  ((int)ramp_down_percent(elapsed_time) *
> +			(highest_lpstate - powernv_pstate_info.min))/100;
> +
> +	/* Ensure that global pstate is >= to local pstate */
> +	if (highest_lpstate - pstate_diff < local_pstate)
> +		return local_pstate;
> +	else
> +		return (highest_lpstate - pstate_diff);

Unnecessary ().

> +}
> +
> +inline int queue_gpstate_timer(struct global_pstate_info *gpstates)

Looks like many of the function you wrote now should be marked
'static' as well.

> +{
> +	unsigned int timer_interval;
> +
> +	/* Setting up timer to fire after GPSTATE_TIMER_INTERVAL ms, But

Bad style again:

/*
 * ...
 * ...
 */

> +	 * if it exceeds MAX_RAMP_DOWN_TIME ms for ramp down time.
> +	 * Set timer such that it fires exactly at MAX_RAMP_DOWN_TIME
> +	 * seconds of ramp down time.
> +	 */
> +	if ((gpstates->elapsed_time + GPSTATE_TIMER_INTERVAL)
> +							 > MAX_RAMP_DOWN_TIME)

Align '>' right below the second (

> +		timer_interval = MAX_RAMP_DOWN_TIME - gpstates->elapsed_time;
> +	else
> +		timer_interval = GPSTATE_TIMER_INTERVAL;
> +
> +	return  mod_timer_pinned(&(gpstates->timer), jiffies +

() not required.

> +				msecs_to_jiffies(timer_interval));
> +}

blank line required.

> +/*
> + * gpstate_timer_handler
> + *
> + * @data: pointer to cpufreq_policy on which timer was queued
> + *
> + * This handler brings down the global pstate closer to the local pstate
> + * according quadratic equation. Queues a new timer if it is still not equal
> + * to local pstate
> + */
> +void gpstate_timer_handler(unsigned long data)
> +{
> +	struct cpufreq_policy *policy = (struct cpufreq_policy *) data;

no need to cast.

> +	struct global_pstate_info *gpstates = (struct global_pstate_info *)
> +							policy->driver_data;

same here.

> +	unsigned int time_diff = jiffies_to_msecs(jiffies)
> +					- gpstates->last_sampled_time;
> +	struct powernv_smp_call_data freq_data;
> +	int ret;
> +
> +	ret = spin_trylock(&gpstates->gpstate_lock);

no need of 'ret' for just this, simply do: if (!spin_trylock())...

> +	if (!ret)
> +		return;
> +
> +	gpstates->last_sampled_time += time_diff;
> +	gpstates->elapsed_time += time_diff;
> +	freq_data.pstate_id = gpstates->last_lpstate;
> +	if ((gpstates->last_gpstate == freq_data.pstate_id) ||
> +			(gpstates->elapsed_time > MAX_RAMP_DOWN_TIME)) {
> +		freq_data.gpstate_id = freq_data.pstate_id;
> +		reset_gpstates(policy);
> +		gpstates->highest_lpstate = freq_data.pstate_id;
> +	} else {
> +		freq_data.gpstate_id = calculate_global_pstate(

You can't break a line after ( of a function call :)

Let it go beyond 80 columns if it has to.

> +			gpstates->elapsed_time, gpstates->highest_lpstate,
> +			freq_data.pstate_id);
> +	}
> +
> +	/* If local pstate is equal to global pstate, rampdown is over

Bad style again.

> +	 * So timer is not required to be queued.
> +	 */
> +	if (freq_data.gpstate_id != freq_data.pstate_id)
> +		ret = queue_gpstate_timer(gpstates);

ret not used.

> +
> +	gpstates->last_gpstate = freq_data.gpstate_id;
> +	gpstates->last_lpstate = freq_data.pstate_id;
> +
> +	/* Timer may get migrated to a different cpu on cpu hot unplug */
> +	smp_call_function_any(policy->cpus, set_pstate, &freq_data, 1);
> +	spin_unlock(&gpstates->gpstate_lock);
> +}
> +
> +

Remove extra blank line.

> +/*
>   * powernv_cpufreq_target_index: Sets the frequency corresponding to
>   * the cpufreq table entry indexed by new_index on the cpus in the
>   * mask policy->cpus
> @@ -432,23 +585,88 @@ next:
>  static int powernv_cpufreq_target_index(struct cpufreq_policy *policy,
>  					unsigned int new_index)
>  {
> +	int ret;
>  	struct powernv_smp_call_data freq_data;
> -
> +	unsigned int cur_msec;
> +	unsigned long flags;
> +	struct global_pstate_info *gpstates = (struct global_pstate_info *)
> +						policy->driver_data;
>  	if (unlikely(rebooting) && new_index != get_nominal_index())
>  		return 0;
>  
>  	if (!throttled)
>  		powernv_cpufreq_throttle_check(NULL);
>  
> +	cur_msec = jiffies_to_msecs(get_jiffies_64());
> +
> +	/*spinlock taken*/

Drop these useless comments, we know what you are doing.

> +	spin_lock_irqsave(&gpstates->gpstate_lock, flags);
>  	freq_data.pstate_id = powernv_freqs[new_index].driver_data;
>  
> +	/*First time call */
> +	if (!gpstates->last_sampled_time) {
> +		freq_data.gpstate_id = freq_data.pstate_id;
> +		gpstates->highest_lpstate = freq_data.pstate_id;
> +		goto gpstates_done;
> +	}
> +
> +	/*Ramp down*/

Rather explain what you are doing and how you are doing it here.

> +	if (gpstates->last_gpstate > freq_data.pstate_id) {
> +		gpstates->elapsed_time += cur_msec -
> +					gpstates->last_sampled_time;
> +		/* If its has been ramping down for more than 5seconds
> +		 * we should be reseting all global pstate related data.
> +		 * Set it equal to local pstate to start fresh.
> +		 */
> +		if (gpstates->elapsed_time > MAX_RAMP_DOWN_TIME) {
> +			freq_data.gpstate_id = freq_data.pstate_id;
> +			reset_gpstates(policy);
> +			gpstates->highest_lpstate = freq_data.pstate_id;
> +			freq_data.gpstate_id = freq_data.pstate_id;
> +		} else {
> +		/* elaspsed_time is less than 5 seconds, continue to rampdown*/
> +			freq_data.gpstate_id = calculate_global_pstate(
> +			gpstates->elapsed_time,
> +			gpstates->highest_lpstate, freq_data.pstate_id);
> +
> +		}
> +

remove blank line.

> +	} else {
> +		/*Ramp up*/
> +		reset_gpstates(policy);
> +		gpstates->highest_lpstate = freq_data.pstate_id;
> +		freq_data.gpstate_id = freq_data.pstate_id;
> +	}
> +
> +	/* If local pstate is equal to global pstate, rampdown is over
> +	 * So timer is not required to be queued.
> +	 */
> +	if (freq_data.gpstate_id != freq_data.pstate_id)
> +		ret = queue_gpstate_timer(gpstates);

add a blank line here

> +gpstates_done:
> +	gpstates->last_sampled_time = cur_msec;
> +	gpstates->last_gpstate = freq_data.gpstate_id;
> +	gpstates->last_lpstate = freq_data.pstate_id;
> +
>  	/*
>  	 * Use smp_call_function to send IPI and execute the
>  	 * mtspr on target CPU.  We could do that without IPI
>  	 * if current CPU is within policy->cpus (core)
>  	 */
>  	smp_call_function_any(policy->cpus, set_pstate, &freq_data, 1);
> +	spin_unlock_irqrestore(&gpstates->gpstate_lock, flags);
> +	return 0;
> +}
>  
> +static int powernv_cpufreq_cpu_exit(struct cpufreq_policy *policy)

Add this after the init() routine.

> +{
> +	int base;
> +	struct global_pstate_info *gpstates = (struct global_pstate_info *)

Unnecessary cast.

> +						policy->driver_data;
> +	base = cpu_first_thread_sibling(policy->cpu);
> +	del_timer_sync(&gpstates->timer);
> +	kfree(policy->driver_data);
> +	pr_info("freed driver_data cpu %d\n", base);

may be a blank line here.

>  	return 0;
>  }
>  
> @@ -456,6 +674,7 @@ static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
>  {
>  	int base, i;
>  	struct kernfs_node *kn;
> +	struct global_pstate_info *gpstates;
>  
>  	base = cpu_first_thread_sibling(policy->cpu);
>  
> @@ -475,6 +694,21 @@ static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
>  	} else {
>  		kernfs_put(kn);
>  	}

blank line here.

> +	gpstates =  kzalloc(sizeof(struct global_pstate_info), GFP_KERNEL);

sizeof(*gpstates).

> +	if (!gpstates) {
> +		pr_err("Could not allocate global_pstate_info\n");

print not required

> +		return -ENOMEM;
> +	}

blank line here

> +	policy->driver_data = gpstates;
> +
> +	/* initialize timer */
> +	init_timer_deferrable(&gpstates->timer);
> +	gpstates->timer.data = (unsigned long) policy;
> +	gpstates->timer.function = gpstate_timer_handler;
> +	gpstates->timer.expires = jiffies +
> +				msecs_to_jiffies(GPSTATE_TIMER_INTERVAL);
> +
> +	pr_info("Added global_pstate_info & timer for %d cpu\n", base);
>  	return cpufreq_table_validate_and_show(policy, powernv_freqs);

Who will free gpstates if this fails ?

>  }
>  
> @@ -612,6 +846,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
>  	.name		= "powernv-cpufreq",
>  	.flags		= CPUFREQ_CONST_LOOPS,
>  	.init		= powernv_cpufreq_cpu_init,
> +	.exit		= powernv_cpufreq_cpu_exit,
>  	.verify		= cpufreq_generic_frequency_table_verify,
>  	.target_index	= powernv_cpufreq_target_index,
>  	.get		= powernv_cpufreq_get,
> -- 
> 2.5.5

-- 
viresh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate
  2016-04-13  5:03   ` Viresh Kumar
@ 2016-04-13 17:57     ` Akshay Adiga
  2016-04-14  2:19       ` Viresh Kumar
  0 siblings, 1 reply; 8+ messages in thread
From: Akshay Adiga @ 2016-04-13 17:57 UTC (permalink / raw)
  To: Viresh Kumar; +Cc: linuxppc-dev, ego, rjw, linux-pm

Hi Viresh ,

Thanks for reviewing in detail.
I will correct all comments related to coding standards in my next patch.

On 04/13/2016 10:33 AM, Viresh Kumar wrote:

> Comments mostly on the coding standards which you have *not* followed.
>
> Also, please run checkpatch --strict next time you send patches
> upstream.

Thanks for pointing out the --strict option, was not aware of that. I will
run checkpatch --strict on the next versions.

> On 12-04-16, 23:36, Akshay Adiga wrote:
> +
> +/*
> + * While resetting we don't want "timer" fields to be set to zero as we
> + * may lose track of timer and will not be able to cleanly remove it
> + */
> +#define reset_gpstates(policy)   memset(policy->driver_data, 0,\
> +					sizeof(struct global_pstate_info)-\
> +					sizeof(struct timer_list)-\
> +					sizeof(spinlock_t))
> That's super *ugly*. Why don't you create a simple routine which will
> set the 5 integer variables to 0 in a straight forward way ?

Yeh, will create a routine.

>> @@ -348,14 +395,17 @@ static void set_pstate(void *freq_data)
>>   	unsigned long val;
>>   	unsigned long pstate_ul =
>>   		((struct powernv_smp_call_data *) freq_data)->pstate_id;
>> +	unsigned long gpstate_ul =
>> +		((struct powernv_smp_call_data *) freq_data)->gpstate_id;
> Remove these unnecessary casts and do:
>
> struct powernv_smp_call_data *freq_data = data; //Name func arg as data
>
> And then use freq_data->*.

Ok. Will do that.

>> +/*
>> + * gpstate_timer_handler
>> + *
>> + * @data: pointer to cpufreq_policy on which timer was queued
>> + *
>> + * This handler brings down the global pstate closer to the local pstate
>> + * according quadratic equation. Queues a new timer if it is still not equal
>> + * to local pstate
>> + */
>> +void gpstate_timer_handler(unsigned long data)
>> +{
>> +	struct cpufreq_policy *policy = (struct cpufreq_policy *) data;
> no need to cast.

May be i need a cast here,  because data is unsigned long ( unlike other places where its void *).
On building without cast, it throws me a warning.

>> +	struct global_pstate_info *gpstates = (struct global_pstate_info *)
>> +	struct powernv_smp_call_data freq_data;
>> +	int ret;
>> +
>> +	ret = spin_trylock(&gpstates->gpstate_lock);
> no need of 'ret' for just this, simply do: if (!spin_trylock())...

Sure will do that.

> a
>> +	if (!ret)
>> +		return;
>> +
>> +	gpstates->last_sampled_time += time_diff;
>> +	gpstates->elapsed_time += time_diff;
>> +	freq_data.pstate_id = gpstates->last_lpstate;
>> +	if ((gpstates->last_gpstate == freq_data.pstate_id) ||
>> +			(gpstates->elapsed_time > MAX_RAMP_DOWN_TIME)) {
>> +		freq_data.gpstate_id = freq_data.pstate_id;
>> +		reset_gpstates(policy);
>> +		gpstates->highest_lpstate = freq_data.pstate_id;
>> +	} else {
>> +		freq_data.gpstate_id = calculate_global_pstate(
> You can't break a line after ( of a function call :)
>
> Let it go beyond 80 columns if it has to.

May be i will try to get it inside 80 columns with a temporary variable instead of
freq_data.gpstate_id.

>> +			gpstates->elapsed_time, gpstates->highest_lpstate,
>> +			freq_data.pstate_id);
>> +	}
>> +
>> +	/* If local pstate is equal to global pstate, rampdown is over
> Bad style again.
>
>> +	 * So timer is not required to be queued.
>> +	 */
>> +	if (freq_data.gpstate_id != freq_data.pstate_id)
>> +		ret = queue_gpstate_timer(gpstates);
> ret not used.

Should i make it void instead of returning int?, as i cannot do much even if it fails, except for notifying.

>> +gpstates_done:
>> +	gpstates->last_sampled_time = cur_msec;
>> +	gpstates->last_gpstate = freq_data.gpstate_id;
>> +	gpstates->last_lpstate = freq_data.pstate_id;
>> +
>>   	/*
>>   	 * Use smp_call_function to send IPI and execute the
>>   	 * mtspr on target CPU.  We could do that without IPI
>>   	 * if current CPU is within policy->cpus (core)
>>   	 */
>>   	smp_call_function_any(policy->cpus, set_pstate, &freq_data, 1);
>> +	spin_unlock_irqrestore(&gpstates->gpstate_lock, flags);
>> +	return 0;
>> +}
>>   
>> +static int powernv_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> Add this after the init() routine.

Ok will do it.

>> +	policy->driver_data = gpstates;
>> +
>> +	/* initialize timer */
>> +	init_timer_deferrable(&gpstates->timer);
>> +	gpstates->timer.data = (unsigned long) policy;
>> +	gpstates->timer.function = gpstate_timer_handler;
>> +	gpstates->timer.expires = jiffies +
>> +				msecs_to_jiffies(GPSTATE_TIMER_INTERVAL);
>> +
>> +	pr_info("Added global_pstate_info & timer for %d cpu\n", base);
>>   	return cpufreq_table_validate_and_show(policy, powernv_freqs);
> Who will free gpstates if this fails ?

Thanks for pointing out. Will fix in v2.

Regards
Akshay Adiga

_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate
  2016-04-13 17:57     ` Akshay Adiga
@ 2016-04-14  2:19       ` Viresh Kumar
  0 siblings, 0 replies; 8+ messages in thread
From: Viresh Kumar @ 2016-04-14  2:19 UTC (permalink / raw)
  To: Akshay Adiga; +Cc: rjw, linux-pm, linuxppc-dev, ego

On 13-04-16, 23:27, Akshay Adiga wrote:
> On 04/13/2016 10:33 AM, Viresh Kumar wrote:
> >>+void gpstate_timer_handler(unsigned long data)
> >>+{
> >>+	struct cpufreq_policy *policy = (struct cpufreq_policy *) data;
> >no need to cast.
> 
> May be i need a cast here,  because data is unsigned long ( unlike other places where its void *).
> On building without cast, it throws me a warning.

My bad, yeah :(

> >>+	if (freq_data.gpstate_id != freq_data.pstate_id)
> >>+		ret = queue_gpstate_timer(gpstates);
> >ret not used.
> 
> Should i make it void instead of returning int?, as i cannot do much even if it fails, except for notifying.

Sure.

-- 
viresh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate
  2016-04-12 18:06 ` [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate Akshay Adiga
  2016-04-13  5:03   ` Viresh Kumar
@ 2016-04-14  5:40   ` Balbir Singh
  2016-04-14 15:54     ` Akshay Adiga
  1 sibling, 1 reply; 8+ messages in thread
From: Balbir Singh @ 2016-04-14  5:40 UTC (permalink / raw)
  To: Akshay Adiga, rjw, viresh.kumar, linux-pm, linuxppc-dev; +Cc: ego



On 13/04/16 04:06, Akshay Adiga wrote:
> This patch brings down global pstate at a slower rate than the local
> pstate. As the frequency transition latency from pmin to pmax is
> observed to be in few millisecond granurality. It takes a performance
> penalty during sudden frequency rampup. Hence by holding global pstates
> higher than local pstate makes the subsequent rampups faster.

What domains does local and global refer to?

> 
> A global per policy structure is maintained to keep track of the global
> and local pstate changes. The global pstate is brought down using a
> parabolic equation. The ramp down time to pmin is set to 6 seconds. To
> make sure that the global pstates are dropped at regular interval , a
> timer is queued for every 2 seconds, which eventually brings the pstate
> down to local pstate.
> 
> Iozone results show fairly consistent performance boost.
> YCSB on redis shows improved Max latencies in most cases.
> 
> Iozone write/rewite test were made with filesizes 200704Kb and 401408Kb with
> different record sizes . The following table shows IOoperations/sec with and
> without patch.
> 
> Iozone Results ( in op/sec) ( mean over 3 iterations )
> ------------------------------------
> file size-	   		with     	without
> recordsize-IOtype      		patch    	patch 		 % change
> ----------------------------------------------------------------------
> 200704-1-SeqWrite		1616532 	1615425 	0.06
> 200704-1-Rewrite		2423195  	2303130 	5.21
> 200704-2-SeqWrite		1628577 	1602620 	1.61
> 200704-2-Rewrite		2428264  	2312154 	5.02
> 200704-4-SeqWrite		1617605 	1617182 	0.02
> 200704-4-Rewrite		2430524  	2351238 	3.37
> 200704-8-SeqWrite		1629478 	1600436 	1.81
> 200704-8-Rewrite		2415308  	2298136 	5.09
> 200704-16-SeqWrite		1619632 	1618250 	0.08
> 200704-16-Rewrite		2396650 	2352591 	1.87
> 200704-32-SeqWrite		1632544 	1598083 	2.15
> 200704-32-Rewrite		2425119 	2329743 	4.09
> 200704-64-SeqWrite		1617812 	1617235 	0.03
> 200704-64-Rewrite		2402021 	2321080 	3.48
> 200704-128-SeqWrite		1631998 	1600256 	1.98
> 200704-128-Rewrite		2422389		2304954 	5.09
> 200704-256 SeqWrite		1617065		1616962 	0.00
> 200704-256-Rewrite		2432539 	2301980 	5.67
> 200704-512-SeqWrite		1632599 	1598656 	2.12
> 200704-512-Rewrite		2429270		2323676 	4.54
> 200704-1024-SeqWrite		1618758		1616156 	0.16
> 200704-1024-Rewrite		2431631		2315889 	4.99
> 401408-1-SeqWrite		1631479  	1608132 	1.45
> 401408-1-Rewrite		2501550  	2459409 	1.71
> 401408-2-SeqWrite		1617095 	1626069 	-0.55
> 401408-2-Rewrite		2507557  	2443621 	2.61
> 401408-4-SeqWrite		1629601 	1611869		1.10
> 401408-4-Rewrite		2505909  	2462098 	1.77
> 401408-8-SeqWrite  		1617110 	1626968 	-0.60
> 401408-8-Rewrite		2512244  	2456827 	2.25
> 401408-16-SeqWrite		1632609 	1609603 	1.42
> 401408-16-Rewrite		2500792 	2451405 	2.01
> 401408-32-SeqWrite		1619294 	1628167 	-0.54
> 401408-32-Rewrite		2510115		2451292 	2.39
> 401408-64-SeqWrite		1632709		1603746 	1.80
> 401408-64-Rewrite		2506692 	2433186 	3.02
> 401408-128-SeqWrite		1619284		1627461 	-0.50
> 401408-128-Rewrite		2518698 	2453361 	2.66
> 401408-256-SeqWrite		1634022 	1610681 	1.44
> 401408-256-Rewrite		2509987 	2446328 	2.60
> 401408-512-SeqWrite 		1617524 	1628016 	-0.64
> 401408-512-Rewrite		2504409 	2442899 	2.51
> 401408-1024-SeqWrite		1629812 	1611566 	1.13
> 401408-1024-Rewrite 		2507620		 2442968 	2.64
> 
> Tested with YCSB workloada over redis for 1 million records and 1 million
> operation. Each test was carried out with target operations per second and
> persistence disabled. 
> 
> Max-latency (in us)( mean over 5 iterations )
> -----------------------------------------------------------
> op/s	Operation	with patch	without patch	%change
> ------------------------------------------------------------
> 15000	Read		61480.6		50261.4		22.32
> 15000	cleanup		215.2		293.6		-26.70
> 15000	update		25666.2		25163.8		2.00
> 
> 25000	Read		32626.2		89525.4		-63.56
> 25000	cleanup		292.2		263.0		11.10
> 25000	update		32293.4		90255.0		-64.22
> 
> 35000	Read		34783.0		33119.0		5.02
> 35000	cleanup		321.2		395.8		-18.8
> 35000	update		36047.0		38747.8		-6.97
> 
> 40000	Read		38562.2		42357.4		-8.96
> 40000	cleanup		371.8		384.6		-3.33
> 40000	update		27861.4		41547.8		-32.94
> 
> 45000	Read		42271.0		88120.6		-52.03
> 45000	cleanup		263.6		383.0		-31.17
> 45000	update		29755.8		81359.0		-63.43
> 
> (test without target op/s)
> 47659	Read		83061.4		136440.6	-39.12
> 47659	cleanup		195.8		193.8		1.03
> 47659	update		73429.4		124971.8	-41.24
> 
> Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 239 +++++++++++++++++++++++++++++++++++++-
>  1 file changed, 237 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
> index e2e2219..288fa10 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -36,12 +36,58 @@
>  #include <asm/reg.h>
>  #include <asm/smp.h> /* Required for cpu_sibling_mask() in UP configs */
>  #include <asm/opal.h>
> +#include <linux/timer.h>
>  
>  #define POWERNV_MAX_PSTATES	256
>  #define PMSR_PSAFE_ENABLE	(1UL << 30)
>  #define PMSR_SPR_EM_DISABLE	(1UL << 31)
>  #define PMSR_MAX(x)		((x >> 32) & 0xFF)
>  
> +/*
> + * Quadratic equation which gives the percentage rampdown for time elapsed in
> + * milliseconds. time can be between 0 and MAX_RAMP_DOWN_TIME ( milliseconds )
> + * This equation approximates to y = -4e-6 x^2

Thanks for documenting this, but I think it will also be good to explain why we 
use y = -4 e-6*x^2 as opposed to any other magic numbers.

> + *
> + * At 0 seconds x=0000 ramp_down_percent=0
> + * At MAX_RAMP_DOWN_TIME x=5120 ramp_down_percent=100
> + */
> +#define MAX_RAMP_DOWN_TIME				5120
> +#define ramp_down_percent(time)		((time * time)>>18)
> +
> +/*Interval after which the timer is queued to bring down global pstate*/
> +#define GPSTATE_TIMER_INTERVAL				2000
> +/*
> + * global_pstate_info :
> + *	per policy data structure to maintain history of global pstates
> + *
> + * @highest_lpstate : the local pstate from which we are ramping down
> + * @elapsed_time : time in ms spent in ramping down from highest_lpstate
> + * @last_sampled_time : time from boot in ms when global pstates were last set
> + * @last_lpstate , last_gpstate : last set values for local and global pstates
> + * @timer : is used for ramping down if cpu goes idle for a long time with
> + *	global pstate held high
> + * @gpstate_lock : a spinlock to maintain synchronization between routines
> + *	called by the timer handler and governer's target_index calls
> + */
> +struct global_pstate_info {
> +	int highest_lpstate;
> +	unsigned int elapsed_time;
> +	unsigned int last_sampled_time;
> +	int last_lpstate;
> +	int last_gpstate;
> +	spinlock_t gpstate_lock;
> +	struct timer_list timer;
> +};
> +
> +/*
> + * While resetting we don't want "timer" fields to be set to zero as we
> + * may lose track of timer and will not be able to cleanly remove it
> + */
> +#define reset_gpstates(policy)   memset(policy->driver_data, 0,\
> +					sizeof(struct global_pstate_info)-\
> +					sizeof(struct timer_list)-\
> +					sizeof(spinlock_t))
> +
>  static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
>  static bool rebooting, throttled, occ_reset;
>  
> @@ -285,6 +331,7 @@ static inline void set_pmspr(unsigned long sprn, unsigned long val)
>  struct powernv_smp_call_data {
>  	unsigned int freq;
>  	int pstate_id;
> +	int gpstate_id;
>  };
>  
>  /*
> @@ -348,14 +395,17 @@ static void set_pstate(void *freq_data)
>  	unsigned long val;
>  	unsigned long pstate_ul =
>  		((struct powernv_smp_call_data *) freq_data)->pstate_id;
> +	unsigned long gpstate_ul =
> +		((struct powernv_smp_call_data *) freq_data)->gpstate_id;
>  
>  	val = get_pmspr(SPRN_PMCR);
>  	val = val & 0x0000FFFFFFFFFFFFULL;
>  
>  	pstate_ul = pstate_ul & 0xFF;
> +	gpstate_ul = gpstate_ul & 0xFF;
>  
>  	/* Set both global(bits 56..63) and local(bits 48..55) PStates */
> -	val = val | (pstate_ul << 56) | (pstate_ul << 48);
> +	val = val | (gpstate_ul << 56) | (pstate_ul << 48);
>  
>  	pr_debug("Setting cpu %d pmcr to %016lX\n",
>  			raw_smp_processor_id(), val);
> @@ -425,6 +475,109 @@ next:
>  }
>  
>  /*
> + * calcuate_global_pstate:
> + *
> + * @elapsed_time : elapsed time in milliseconds
> + * @local_pstate : new local pstate
> + * @highest_lpstate : pstate from which its ramping down
> + *
> + * Finds the appropriate global pstate based on the pstate from which its
> + * ramping down and the time elapsed in ramping down. It follows a quadratic
> + * equation which ensures that it reaches ramping down to pmin in 5sec.
> + */
> +inline int calculate_global_pstate(unsigned int elapsed_time,
> +		int highest_lpstate, int local_pstate)
> +{
> +	int pstate_diff;
> +
> +	/*
> +	 * Using ramp_down_percent we get the percentage of rampdown
> +	 * that we are expecting to be dropping. Difference between
> +	 * highest_lpstate and powernv_pstate_info.min will give a absolute
> +	 * number of how many pstates we will drop eventually by the end of
> +	 * 5 seconds, then just scale it get the number pstates to be dropped.
> +	 */
> +	pstate_diff =  ((int)ramp_down_percent(elapsed_time) *
> +			(highest_lpstate - powernv_pstate_info.min))/100;
> +
> +	/* Ensure that global pstate is >= to local pstate */
> +	if (highest_lpstate - pstate_diff < local_pstate)
> +		return local_pstate;
> +	else
> +		return (highest_lpstate - pstate_diff);
> +}
> +
> +inline int queue_gpstate_timer(struct global_pstate_info *gpstates)
> +{
> +	unsigned int timer_interval;
> +
> +	/* Setting up timer to fire after GPSTATE_TIMER_INTERVAL ms, But
> +	 * if it exceeds MAX_RAMP_DOWN_TIME ms for ramp down time.
> +	 * Set timer such that it fires exactly at MAX_RAMP_DOWN_TIME
> +	 * seconds of ramp down time.
> +	 */
> +	if ((gpstates->elapsed_time + GPSTATE_TIMER_INTERVAL)
> +							 > MAX_RAMP_DOWN_TIME)
> +		timer_interval = MAX_RAMP_DOWN_TIME - gpstates->elapsed_time;
> +	else
> +		timer_interval = GPSTATE_TIMER_INTERVAL;
> +
> +	return  mod_timer_pinned(&(gpstates->timer), jiffies +
> +				msecs_to_jiffies(timer_interval));
> +}
> +/*
> + * gpstate_timer_handler
> + *
> + * @data: pointer to cpufreq_policy on which timer was queued
> + *
> + * This handler brings down the global pstate closer to the local pstate
> + * according quadratic equation. Queues a new timer if it is still not equal
> + * to local pstate
> + */
> +void gpstate_timer_handler(unsigned long data)
> +{
> +	struct cpufreq_policy *policy = (struct cpufreq_policy *) data;
> +	struct global_pstate_info *gpstates = (struct global_pstate_info *)
> +							policy->driver_data;
> +	unsigned int time_diff = jiffies_to_msecs(jiffies)
> +					- gpstates->last_sampled_time;
> +	struct powernv_smp_call_data freq_data;
> +	int ret;
> +
> +	ret = spin_trylock(&gpstates->gpstate_lock);
> +	if (!ret)
> +		return;
> +
> +	gpstates->last_sampled_time += time_diff;
> +	gpstates->elapsed_time += time_diff;
> +	freq_data.pstate_id = gpstates->last_lpstate;
> +	if ((gpstates->last_gpstate == freq_data.pstate_id) ||
> +			(gpstates->elapsed_time > MAX_RAMP_DOWN_TIME)) {
> +		freq_data.gpstate_id = freq_data.pstate_id;
> +		reset_gpstates(policy);
> +		gpstates->highest_lpstate = freq_data.pstate_id;
> +	} else {
> +		freq_data.gpstate_id = calculate_global_pstate(
> +			gpstates->elapsed_time, gpstates->highest_lpstate,
> +			freq_data.pstate_id);
> +	}
> +
> +	/* If local pstate is equal to global pstate, rampdown is over
> +	 * So timer is not required to be queued.
> +	 */
> +	if (freq_data.gpstate_id != freq_data.pstate_id)
> +		ret = queue_gpstate_timer(gpstates);
> +
> +	gpstates->last_gpstate = freq_data.gpstate_id;
> +	gpstates->last_lpstate = freq_data.pstate_id;
> +
> +	/* Timer may get migrated to a different cpu on cpu hot unplug */
> +	smp_call_function_any(policy->cpus, set_pstate, &freq_data, 1);
> +	spin_unlock(&gpstates->gpstate_lock);
> +}
> +
> +
> +/*
>   * powernv_cpufreq_target_index: Sets the frequency corresponding to
>   * the cpufreq table entry indexed by new_index on the cpus in the
>   * mask policy->cpus
> @@ -432,23 +585,88 @@ next:
>  static int powernv_cpufreq_target_index(struct cpufreq_policy *policy,
>  					unsigned int new_index)
>  {
> +	int ret;
>  	struct powernv_smp_call_data freq_data;
> -
> +	unsigned int cur_msec;
> +	unsigned long flags;
> +	struct global_pstate_info *gpstates = (struct global_pstate_info *)
> +						policy->driver_data;
>  	if (unlikely(rebooting) && new_index != get_nominal_index())
>  		return 0;
>  
>  	if (!throttled)
>  		powernv_cpufreq_throttle_check(NULL);
>  
> +	cur_msec = jiffies_to_msecs(get_jiffies_64());
> +
> +	/*spinlock taken*/
> +	spin_lock_irqsave(&gpstates->gpstate_lock, flags);
>  	freq_data.pstate_id = powernv_freqs[new_index].driver_data;
>  
> +	/*First time call */
> +	if (!gpstates->last_sampled_time) {
> +		freq_data.gpstate_id = freq_data.pstate_id;
> +		gpstates->highest_lpstate = freq_data.pstate_id;
> +		goto gpstates_done;
> +	}
> +
> +	/*Ramp down*/
> +	if (gpstates->last_gpstate > freq_data.pstate_id) {
> +		gpstates->elapsed_time += cur_msec -
> +					gpstates->last_sampled_time;
> +		/* If its has been ramping down for more than 5seconds
> +		 * we should be reseting all global pstate related data.
> +		 * Set it equal to local pstate to start fresh.
> +		 */
> +		if (gpstates->elapsed_time > MAX_RAMP_DOWN_TIME) {
> +			freq_data.gpstate_id = freq_data.pstate_id;
> +			reset_gpstates(policy);
> +			gpstates->highest_lpstate = freq_data.pstate_id;
> +			freq_data.gpstate_id = freq_data.pstate_id;
> +		} else {
> +		/* elaspsed_time is less than 5 seconds, continue to rampdown*/
> +			freq_data.gpstate_id = calculate_global_pstate(
> +			gpstates->elapsed_time,
> +			gpstates->highest_lpstate, freq_data.pstate_id);
> +
> +		}
> +
> +	} else {
> +		/*Ramp up*/
> +		reset_gpstates(policy);
> +		gpstates->highest_lpstate = freq_data.pstate_id;
> +		freq_data.gpstate_id = freq_data.pstate_id;
> +	}
> +
> +	/* If local pstate is equal to global pstate, rampdown is over
> +	 * So timer is not required to be queued.
> +	 */
> +	if (freq_data.gpstate_id != freq_data.pstate_id)
> +		ret = queue_gpstate_timer(gpstates);
> +gpstates_done:
> +	gpstates->last_sampled_time = cur_msec;
> +	gpstates->last_gpstate = freq_data.gpstate_id;
> +	gpstates->last_lpstate = freq_data.pstate_id;
> +
>  	/*
>  	 * Use smp_call_function to send IPI and execute the
>  	 * mtspr on target CPU.  We could do that without IPI
>  	 * if current CPU is within policy->cpus (core)
>  	 */
>  	smp_call_function_any(policy->cpus, set_pstate, &freq_data, 1);
> +	spin_unlock_irqrestore(&gpstates->gpstate_lock, flags);
> +	return 0;
> +}
>  
> +static int powernv_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> +{
> +	int base;
> +	struct global_pstate_info *gpstates = (struct global_pstate_info *)
> +						policy->driver_data;
> +	base = cpu_first_thread_sibling(policy->cpu);
> +	del_timer_sync(&gpstates->timer);
> +	kfree(policy->driver_data);
> +	pr_info("freed driver_data cpu %d\n", base);
>  	return 0;
>  }
>  
> @@ -456,6 +674,7 @@ static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
>  {
>  	int base, i;
>  	struct kernfs_node *kn;
> +	struct global_pstate_info *gpstates;
>  
>  	base = cpu_first_thread_sibling(policy->cpu);
>  
> @@ -475,6 +694,21 @@ static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
>  	} else {
>  		kernfs_put(kn);
>  	}
> +	gpstates =  kzalloc(sizeof(struct global_pstate_info), GFP_KERNEL);
> +	if (!gpstates) {
> +		pr_err("Could not allocate global_pstate_info\n");
> +		return -ENOMEM;
> +	}
> +	policy->driver_data = gpstates;
> +
> +	/* initialize timer */
> +	init_timer_deferrable(&gpstates->timer);
> +	gpstates->timer.data = (unsigned long) policy;
> +	gpstates->timer.function = gpstate_timer_handler;
> +	gpstates->timer.expires = jiffies +
> +				msecs_to_jiffies(GPSTATE_TIMER_INTERVAL);
> +
> +	pr_info("Added global_pstate_info & timer for %d cpu\n", base);
>  	return cpufreq_table_validate_and_show(policy, powernv_freqs);
>  }
>  
> @@ -612,6 +846,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
>  	.name		= "powernv-cpufreq",
>  	.flags		= CPUFREQ_CONST_LOOPS,
>  	.init		= powernv_cpufreq_cpu_init,
> +	.exit		= powernv_cpufreq_cpu_exit,
>  	.verify		= cpufreq_generic_frequency_table_verify,
>  	.target_index	= powernv_cpufreq_target_index,
>  	.get		= powernv_cpufreq_get,
> 


Balbir Singh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate
  2016-04-14  5:40   ` Balbir Singh
@ 2016-04-14 15:54     ` Akshay Adiga
  0 siblings, 0 replies; 8+ messages in thread
From: Akshay Adiga @ 2016-04-14 15:54 UTC (permalink / raw)
  To: Balbir Singh, rjw, viresh.kumar, linux-pm, linuxppc-dev; +Cc: ego

Hi Balbir,

On 04/14/2016 11:10 AM, Balbir Singh wrote:

> On 13/04/16 04:06, Akshay Adiga wrote:
>> This patch brings down global pstate at a slower rate than the local
>> pstate. As the frequency transition latency from pmin to pmax is
>> observed to be in few millisecond granurality. It takes a performance
>> penalty during sudden frequency rampup. Hence by holding global pstates
>> higher than local pstate makes the subsequent rampups faster.
> What domains does local and global refer to?

The *global pstate* is a Chip-level entity as such, so the global entity
(Voltage)  is managed across several cores. But with a DCM (Dual Chip Modules),
its more of a socket-level entity and hence Voltage is managed across both chips.

The *local pstate* is a Core-level entity, so the local entity (frequency) is
managed across threads.

I will include this in the commit message. Thanks.

>
>> +/*
>> + * Quadratic equation which gives the percentage rampdown for time elapsed in
>> + * milliseconds. time can be between 0 and MAX_RAMP_DOWN_TIME ( milliseconds )
>> + * This equation approximates to y = -4e-6 x^2
> Thanks for documenting this, but I think it will also be good to explain why we
> use y = -4 e-6*x^2 as opposed to any other magic numbers.

Well it empirically worked out best this way.

On an idle system we want the Global Pstate to ramp-down from max value to min over
a span of ~5 secs. Also we want initially ramp-down slowly and ramp-down rapidly later
on, hence the equation.

I will try to make this in the comment more informative.

Regards

Akshay Adiga


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-04-14 15:55 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-12 18:06 [PATCH 0/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate Akshay Adiga
2016-04-12 18:06 ` [PATCH 1/2] cpufreq: powernv: Remove flag use-case of policy->driver_data Akshay Adiga
2016-04-12 18:06 ` [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate Akshay Adiga
2016-04-13  5:03   ` Viresh Kumar
2016-04-13 17:57     ` Akshay Adiga
2016-04-14  2:19       ` Viresh Kumar
2016-04-14  5:40   ` Balbir Singh
2016-04-14 15:54     ` Akshay Adiga

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).