From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Doug Smythies" Subject: RE: [PATCH] cpuidle: Allow menu governor to enter deeper sleep states after some time Date: Thu, 19 Oct 2017 17:17:42 -0700 Message-ID: <000101d34938$da740870$8f5c1950$@net> References: aiVvdxj7gLueDaiVwdfXxs Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Return-path: Received: from cmta16.telus.net ([209.171.16.89]:45683 "EHLO cmta16.telus.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751406AbdJTARt (ORCPT ); Thu, 19 Oct 2017 20:17:49 -0400 In-Reply-To: aiVvdxj7gLueDaiVwdfXxs Content-Language: en-ca Sender: linux-pm-owner@vger.kernel.org List-Id: linux-pm@vger.kernel.org To: 'Thomas Ilsche' Cc: =?utf-8?Q?'Marcus_H=C3=A4hnel'?= , 'Daniel Hackenberg' , =?utf-8?Q?'Robert_Sch=C3=B6ne'?= , mario.bielert@tu-dresden.de, "'Rafael J. Wysocki'" , 'Alex Shi' , 'Ingo Molnar' , 'Rik van Riel' , 'Daniel Lezcano' , 'Nicholas Piggin' , linux-pm@vger.kernel.org, 'Len Brown' Hi Thomas, Note 1: I think it is just a coincidence that Len Brown also replied to this old e-mail earlier today. Note 2: Rafael referred me to this e-mail about a week ago, after I was complaining about excessive energy consumption due to unused CPUs not going idle when they should. My example use case was a 100% load on one CPU. Note 3: The patch does not work for me as it was sent. See notes in-line further below. On 2017.07.27 05:50 Thomas Ilsche wrote: > On several contemporary Intel systems, we have observed that the idle > power consumption is sometimes significantly too high, e.g. 105 W = instead > of 74 W for several seconds. On my test system, I do not observe this magnitude of excessive idle = energy. While I do observe occurrences of excessive time spent in idle state 0, = they do not add up to significant energy waste. > We tracked this issue down to patterns that confuse the heuristic of = the > menu idle governor. If the governor observes several consecutive short > sleeps, it does expect another short sleep duration despite no = immediate > timers, sending the CPU into a shallow sleep state. On a dyntick-idle > kernel, there are no means for the core to enter a more efficient = sleep > state if the prediction was wrong. Ironically this is particularly > problematic if some cores of the system have very infrequent activity, = i.e. > many cores and optimized configuration for low idle power. If I use your "powernightmare" method (or idle-virus as Len called it) I get a drastic package power consumption difference between fallback_timer_enabled being enabled or disabled. It seems mainly due to changes in time spent in idle state 2. There was also a contribution from idle state 0 on one CPU. This was a surprise to me as my thinking = was that wasted energy was dominated by extra time in idle state 0. Example data (kernel 4.14-rc5 + this patch): Average package power with fallback timer disabled: ~8.06 watts Average package power with fallback timer enabled: ~3.95 watts > The effect, cause, triggers, mitigation technique and verification = thereof > are described in detail in the a paper that is pending publication = [1]. The paper is very interesting. Thank you. > Fixing the heuristic to make a better prediction does not seem to be > generally feasible. The following patch addresses this issue by = setting a > timer such that when the an expected immediate wake-up fails to = appear, the > CPU is woken up to go into a deeper sleep state. If the heuristic was > right, the timer is simply canceled. > > Please consider the patch a draft for discussion. We need to address: > > * Avoid programming the fallback timer when the deepest sleep state = is > already chosen. > * Determine good default values for the introduced configurables. = This > is difficult due to the large variety of system configurations = affected > by the menu governor. Nevertheless we believe this can be done such > that many systems benefit without significant degradation in any = case. > * Document (or remove) the sysfs entries. > > But first, I would like to invite comments if this is going in the = right > direction, or if this should be addressed in a different way. The problem I am struggling with is the patch makes no difference for my example use case of 100% load on one CPU, others idle. The wasted power is entirely from idle state 0, and idle state 0 times remain about the same with or without the fallback timer. If I merely disable idle state 0, things are great. However, just disabling idle state 0 does not help when "powermightmares" are present. Example data (kernel 4.14-rc5 + this patch): Average package power with fallback timer disabled: ~27.2 watts Average package power with fallback timer enabled: ~28 watts Average package power with state 0 disabled, fallback timer disabled: = ~23.9 watts Average package power with state 0 disabled, fallback timer enabled: = ~23.9 watts Question: Could this be because I made a mistake re-basing the patch to = kernel 4.14-rc5? Answer: Perhaps. I am unfamiliar with this area of the code. > > [1] = https://tu-dresden.de/zih/forschung/ressourcen/dateien/projekte/haec/powe= rnightmares.pdf > > Signed-off-by: Thomas Ilsche > Signed-off-by: Marcus H=C3=A4hnel > --- > > diff --git a/drivers/cpuidle/governors/menu.c = b/drivers/cpuidle/governors/menu.c > index 61b64c2b2cb8..45bbeb362809 100644 > --- a/drivers/cpuidle/governors/menu.c > +++ b/drivers/cpuidle/governors/menu.c > @@ -22,6 +22,7 @@ > #include > #include > #include > +#include > =20 > /* > * Please note when changing the tuning values: > @@ -130,6 +131,10 @@ struct menu_device { > unsigned int correction_factor[BUCKETS]; > unsigned int intervals[INTERVALS]; > int interval_ptr; > + > + struct hrtimer fallback_timer; > + int have_timer; > + unsigned int disregard_past; > }; > =20 > =20 > @@ -190,6 +195,85 @@ static inline int performance_multiplier(unsigned = long nr_iowaiters, unsigned lo > =20 > static DEFINE_PER_CPU(struct menu_device, menu_devices); > =20 > +static unsigned int fallback_timer_disregard_past =3D 1; > +static unsigned int diff_threshold_bits =3D 8; > +static unsigned int fallback_timer_enabled; Shouldn't fallback_timer_enabled be initialized? On my system it defaults to 0, or disabled, and due to another problem, see further down, this patch wasn't working. > +static unsigned int fallback_timer_interval_us =3D 10000; > + > +#define MENU_ATTR_RW(name, var, range_min, range_max, wfun) \ > + static ssize_t show_##name(struct device *dev, \ > + struct device_attribute *attr, char *buf) \ > + { \ > + return snprintf(buf, 12, "%i\n", var); \ > + } \ > + static ssize_t store_##name(struct device *dev, \ > + struct device_attribute *attr, \ > + const char *buf, size_t count) \ > + { \ > + unsigned int tmp; \ > + int ret =3D kstrtouint(buf, 10, &tmp); \ > + if (ret !=3D 1) \ Shouldn't this be: + if (ret !=3D 0) \ (or maybe just " if (ret) \". I did (and therefore tested) the previous. ? Anyway, it doesn't work unless I make this change. > + return -EINVAL; \ > + if (tmp > range_max || tmp < range_min) { \ > + pr_warn("value outside of valid range [%u, %u]\n", \ > + range_min, range_max); \ > + return -EINVAL; \ > + } \ > + var =3D tmp; \ > + wfun \ > + return count; \ > + } \ > + static DEVICE_ATTR(fallback_timer_##name, 0644, \ > + show_##name, store_##name) > + > +MENU_ATTR_RW(threshold_bits, diff_threshold_bits, 0, 31, {}); > + > +MENU_ATTR_RW(enable, fallback_timer_enabled, 0, 1, { > + int i; > + > + for_each_possible_cpu(i) { > + struct menu_device *data =3D per_cpu_ptr(&menu_devices, i); > + > + if (!fallback_timer_enabled) { > + data->have_timer =3D 0; > + hrtimer_cancel(&(data->fallback_timer)); > + } > + } }); > + > +MENU_ATTR_RW(interval_us, fallback_timer_interval_us, 1, 1000000, = {}); > + > +MENU_ATTR_RW(disregard_past, fallback_timer_disregard_past, 0, 1, { > + int i; > + > + for_each_possible_cpu(i) { > + struct menu_device *data =3D per_cpu_ptr(&menu_devices, i); > + > + data->disregard_past =3D 0; > + } }); > + > +static struct attribute *menu_attrs[] =3D { > + &dev_attr_fallback_timer_threshold_bits.attr, > + &dev_attr_fallback_timer_enable.attr, > + &dev_attr_fallback_timer_interval_us.attr, > + &dev_attr_fallback_timer_disregard_past.attr, > + NULL > +}; > + > +static struct attribute_group menu_attr_group =3D { > + .attrs =3D menu_attrs, > + .name =3D "cpuidle_menu", > + }; > + > +int menu_add_interface(struct device *dev) > +{ > + return sysfs_create_group(&dev->kobj, &menu_attr_group); > +} > + > +void menu_remove_interface(struct device *dev) > +{ > + sysfs_remove_group(&dev->kobj, &menu_attr_group); > +} > + > static void menu_update(struct cpuidle_driver *drv, struct = cpuidle_device *dev); > =20 > /* > @@ -275,6 +359,14 @@ static unsigned int get_typical_interval(struct = menu_device *data) > goto again; > } > =20 > +static enum hrtimer_restart fallback_timer_fun(struct hrtimer *tmr) > +{ > + struct menu_device *mdata =3D this_cpu_ptr(&menu_devices); > + > + mdata->disregard_past =3D 1; > + return HRTIMER_NORESTART; > +} > + > /** > * menu_select - selects the next idle state to enter > * @drv: cpuidle driver containing state data > @@ -293,6 +385,11 @@ static int menu_select(struct cpuidle_driver = *drv, struct cpuidle_device *dev) > unsigned long nr_iowaiters, cpu_load; > int resume_latency =3D dev_pm_qos_raw_read_value(device); > =20 > + if (fallback_timer_enabled && data->have_timer) { > + data->have_timer =3D 0; > + hrtimer_cancel(&(data->fallback_timer)); > + } > + > if (data->needs_update) { > menu_update(drv, dev); > data->needs_update =3D 0; > @@ -322,7 +419,32 @@ static int menu_select(struct cpuidle_driver = *drv, struct cpuidle_device *dev) > RESOLUTION * DECAY); > =20 > expected_interval =3D get_typical_interval(data); > - expected_interval =3D min(expected_interval, data->next_timer_us); > + > + if (fallback_timer_enabled && fallback_timer_disregard_past > + && data->disregard_past) { > + expected_interval =3D data->next_timer_us; > + // Only disregard the past once! Then try again > + data->disregard_past =3D 0; > + } else { > + if (fallback_timer_enabled > + && expected_interval < (data->next_timer_us >> = diff_threshold_bits) > + && data->next_timer_us > fallback_timer_interval_us * 2) { > + /* > + * Program the fallback timer if the gap between the > + * expected interval by heuristic and the next regular > + * timer are too far apart. > + * However, only do this when we didn't just wakup from > + * a timer and are told to disregard the heuristic > + */ > + ktime_t interval =3D > + ktime_set(0, fallback_timer_interval_us * 1000); > + > + hrtimer_start(&(data->fallback_timer), interval, > + HRTIMER_MODE_REL_PINNED); > + data->have_timer =3D 1; > + } > + expected_interval =3D min(expected_interval, data->next_timer_us); > + } > =20 > if (CPUIDLE_DRIVER_STATE_START > 0) { > struct cpuidle_state *s =3D = &drv->states[CPUIDLE_DRIVER_STATE_START]; > @@ -398,6 +520,11 @@ static void menu_reflect(struct cpuidle_device = *dev, int index) > { > struct menu_device *data =3D this_cpu_ptr(&menu_devices); > =20 > + if (fallback_timer_enabled && data->have_timer) { > + data->have_timer =3D 0; > + hrtimer_cancel(&data->fallback_timer); > + } > + > data->last_state_idx =3D index; > data->needs_update =3D 1; > } > @@ -486,6 +613,10 @@ static int menu_enable_device(struct = cpuidle_driver *drv, > =20 > memset(data, 0, sizeof(struct menu_device)); > =20 > + hrtimer_init(&(data->fallback_timer), > + CLOCK_REALTIME, HRTIMER_MODE_REL_PINNED); > + data->fallback_timer.function =3D fallback_timer_fun; > + > /* > * if the correction factor is 0 (eg first time init or cpu hotplug > * etc), we actually want to start out with a unity factor. > @@ -509,6 +640,10 @@ static struct cpuidle_governor menu_governor =3D = { > */ > static int __init init_menu(void) > { > + int ret =3D menu_add_interface(cpu_subsys.dev_root); > + > + if (ret) > + return ret; > return cpuidle_register_governor(&menu_governor); > } =20