* [Bug 217931] amd-pstate lacks crucial features: CPU frequency and boost control
From: bugzilla-daemon @ 2024-04-06 1:39 UTC (permalink / raw)
To: linux-pm
In-Reply-To: <bug-217931-137361@https.bugzilla.kernel.org/>
https://bugzilla.kernel.org/show_bug.cgi?id=217931
--- Comment #57 from Perry Yuan(AMD) (Perry.Yuan@amd.com) ---
Hello,
Thank you for your message. I am currently out of the office from afternoon
today to 4/8 for Chinese QingMing Festival. response will be delayed.
Best wishes,
Perry Yuan
--
You may reply to this email to add a comment.
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply
* [Bug 217931] amd-pstate lacks crucial features: CPU frequency and boost control
From: bugzilla-daemon @ 2024-04-06 1:38 UTC (permalink / raw)
To: linux-pm
In-Reply-To: <bug-217931-137361@https.bugzilla.kernel.org/>
https://bugzilla.kernel.org/show_bug.cgi?id=217931
--- Comment #56 from Artem S. Tashkinov (aros@gmx.com) ---
Here's my report for 6.8.4 and Ryzen 7 7840HS:
1. Setting scaling_max_freq works, cool, but you cannot go below 544MHz. You
can set 400MHz but the CPU will continue to jump to 544MHz, not a big deal.
2. Setting scaling_min_freq sort of works, the CPU _prefers_ to stay at this
specified frequency but occasionally drops to 400MHz. Then, the CPU doesn't
totally respect it, i.e. for values below 1.4GHz it stays around 1397MHz. The
relationship is not linear, setting 4GHz results in the CPU preferring 3766Mhz.
Or 2GHz -> 1982MHz. Not a big deal. Unlikely anyone would want to set the
lowest frequency.
3. /sys/devices/system/cpu/amd_pstate/cpb_boost is missing altogether:
# find /sys -iname '*boost*'
[nothing]
So, my only remaining question is where's boost support? Or it's not supported
for all Zen CPUs?
Thanks a lot for your work regardless!
--
You may reply to this email to add a comment.
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply
* [PATCH 3/3] hwmon: (coretemp) Use a model-specific bitmask to read registers
From: Ricardo Neri @ 2024-04-06 1:04 UTC (permalink / raw)
To: Rafael J. Wysocki, Zhang Rui, Jean Delvare, Guenter Roeck
Cc: Srinivas Pandruvada, Lukasz Luba, Daniel Lezcano, linux-pm,
linux-hwmon, linux-kernel, Ricardo Neri
In-Reply-To: <20240406010416.4821-1-ricardo.neri-calderon@linux.intel.com>
The Intel Software Development manual defines states the temperature
digital readout as the bits [22:16] of the IA32_[PACKAGE]_THERM_STATUS
registers. In recent processor, however, the range is [23:16]. Use a
model-specific bitmask to extract the temperature readout correctly.
Instead of re-implementing model checks, extract the correct bitmask
using the intel_tcc library. Add an 'imply' weak reverse dependency on
CONFIG_INTEL_TCC. This captures the dependency and lets user to unselect
them if they are so inclined. In such case, the bitmask used for the
digital readout is [22:16] as specified in the Intel Software Developer's
manual.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: linux-hwmon@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org # v6.7+
---
drivers/hwmon/Kconfig | 1 +
drivers/hwmon/coretemp.c | 6 +++++-
2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig
index 83945397b6eb..11d72b3009bf 100644
--- a/drivers/hwmon/Kconfig
+++ b/drivers/hwmon/Kconfig
@@ -847,6 +847,7 @@ config SENSORS_I5500
config SENSORS_CORETEMP
tristate "Intel Core/Core2/Atom temperature sensor"
depends on X86
+ imply INTEL_TCC
help
If you say yes here you get support for the temperature
sensor inside your CPU. Most of the family 6 CPUs
diff --git a/drivers/hwmon/coretemp.c b/drivers/hwmon/coretemp.c
index 616bd1a5b864..5632e1b1dfb1 100644
--- a/drivers/hwmon/coretemp.c
+++ b/drivers/hwmon/coretemp.c
@@ -17,6 +17,7 @@
#include <linux/sysfs.h>
#include <linux/hwmon-sysfs.h>
#include <linux/err.h>
+#include <linux/intel_tcc.h>
#include <linux/mutex.h>
#include <linux/list.h>
#include <linux/platform_device.h>
@@ -404,6 +405,8 @@ static ssize_t show_temp(struct device *dev,
tjmax = get_tjmax(tdata, dev);
/* Check whether the time interval has elapsed */
if (time_after(jiffies, tdata->last_updated + HZ)) {
+ u32 mask = intel_tcc_get_temp_mask(is_pkg_temp_data(tdata));
+
rdmsr_on_cpu(tdata->cpu, tdata->status_reg, &eax, &edx);
/*
* Ignore the valid bit. In all observed cases the register
@@ -411,7 +414,7 @@ static ssize_t show_temp(struct device *dev,
* Return it instead of reporting an error which doesn't
* really help at all.
*/
- tdata->temp = tjmax - ((eax >> 16) & 0x7f) * 1000;
+ tdata->temp = tjmax - ((eax >> 16) & mask) * 1000;
tdata->last_updated = jiffies;
}
@@ -838,4 +841,5 @@ module_exit(coretemp_exit)
MODULE_AUTHOR("Rudolf Marek <r.marek@assembler.cz>");
MODULE_DESCRIPTION("Intel Core temperature monitor");
+MODULE_IMPORT_NS(INTEL_TCC);
MODULE_LICENSE("GPL");
--
2.34.1
^ permalink raw reply related
* [PATCH 1/3] thermal: intel: intel_tcc: Add model checks for temperature registers
From: Ricardo Neri @ 2024-04-06 1:04 UTC (permalink / raw)
To: Rafael J. Wysocki, Zhang Rui, Jean Delvare, Guenter Roeck
Cc: Srinivas Pandruvada, Lukasz Luba, Daniel Lezcano, linux-pm,
linux-hwmon, linux-kernel, Ricardo Neri
In-Reply-To: <20240406010416.4821-1-ricardo.neri-calderon@linux.intel.com>
The register MSR_TEMPERATURE_TARGET is not architectural. Its fields may be
defined differently for each processor model. TCC_OFFSET is an example of
such case.
Despite being specified as architectural, the registers IA32_[PACKAGE]_
THERM_STATUS have become model-specific: in recent processors, the
digital temperature readout uses bits [23:16] whereas the Intel Software
Developer's manual specifies bits [22:16].
Create an array of processor models and their bitmasks for TCC_OFFSET and
the digital temperature readout fields. Do not include recent processors.
Instead, use the bitmasks of these recent processors as default.
Use these model-specific bitmasks when reading TCC_OFFSET or the
temperature sensors.
Initialize a model-specific data structure during subsys_initcall() to
have it ready when thermal drivers are loaded.
Expose the new interfaces get_tcc_offset_mask() and
intel_tcc_get_temp_mask(). The hwmon/coretemp and the intel_tcc_cooling
drivers need to use model-specific bitmasks. Include stubs that reflect
minimum support for !CONFIG_INTEL_TCC.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: linux-hwmon@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org # v6.7+
---
drivers/thermal/intel/intel_tcc.c | 177 +++++++++++++++++++++++++++++-
include/linux/intel_tcc.h | 8 ++
2 files changed, 180 insertions(+), 5 deletions(-)
diff --git a/drivers/thermal/intel/intel_tcc.c b/drivers/thermal/intel/intel_tcc.c
index 5e8b7f34b395..1f595f174ab0 100644
--- a/drivers/thermal/intel/intel_tcc.c
+++ b/drivers/thermal/intel/intel_tcc.c
@@ -6,8 +6,170 @@
#include <linux/errno.h>
#include <linux/intel_tcc.h>
+#include <asm/cpu_device_id.h>
+#include <asm/intel-family.h>
#include <asm/msr.h>
+/**
+ * struct temp_masks - Bitmasks for temperature readings
+ * @tcc_offset: TCC offset in MSR_TEMPERATURE_TARGET
+ * @digital_readout: Digital readout in MSR_IA32_THERM_STATUS
+ * @pkg_digital_readout: Digital readout in MSR_IA32_PACKAGE_THERM_STATUS
+ *
+ * Bitmasks to extract the fields of the MSR_TEMPERATURE and IA32_[PACKAGE]_
+ * THERM_STATUS registers for different processor models.
+ *
+ * The bitmask of TjMax is not included in this structure. It is always 0xff.
+ */
+struct temp_masks {
+ u32 tcc_offset;
+ u32 digital_readout;
+ u32 pkg_digital_readout;
+};
+
+#define TCC_FAM6_MODEL_TEMP_MASKS(model, _tcc_offset, _digital_readout, \
+ _pkg_digital_readout) \
+ static const struct temp_masks temp_##model __initconst = { \
+ .tcc_offset = _tcc_offset, \
+ .digital_readout = _digital_readout, \
+ .pkg_digital_readout = _pkg_digital_readout \
+ }
+
+TCC_FAM6_MODEL_TEMP_MASKS(nehalem, 0, 0x7f, 0x7f);
+TCC_FAM6_MODEL_TEMP_MASKS(haswell_x, 0xf, 0x7f, 0x7f);
+TCC_FAM6_MODEL_TEMP_MASKS(broadwell, 0x3f, 0x7f, 0x7f);
+TCC_FAM6_MODEL_TEMP_MASKS(goldmont, 0x7f, 0x7f, 0x7f);
+TCC_FAM6_MODEL_TEMP_MASKS(tigerlake, 0x3f, 0xff, 0xff);
+TCC_FAM6_MODEL_TEMP_MASKS(sapphirerapids, 0x3f, 0x7f, 0xff);
+
+/* Use these masks for processors not included in @tcc_cpu_ids. */
+static struct temp_masks intel_tcc_temp_masks __ro_after_init = {
+ .tcc_offset = 0x7f,
+ .digital_readout = 0xff,
+ .pkg_digital_readout = 0xff,
+};
+
+static const struct x86_cpu_id intel_tcc_cpu_ids[] __initconst = {
+ X86_MATCH_INTEL_FAM6_MODEL(CORE_YONAH, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(CORE2_MEROM, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(CORE2_MEROM_L, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(CORE2_PENRYN, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(CORE2_DUNNINGTON, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(NEHALEM, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(NEHALEM_G, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(NEHALEM_EP, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(NEHALEM_EX, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(WESTMERE, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(WESTMERE_EP, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(WESTMERE_EX, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(SANDYBRIDGE, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(SANDYBRIDGE_X, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(IVYBRIDGE, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(IVYBRIDGE_X, &temp_haswell_x),
+ X86_MATCH_INTEL_FAM6_MODEL(HASWELL, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(HASWELL_X, &temp_haswell_x),
+ X86_MATCH_INTEL_FAM6_MODEL(HASWELL_L, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(HASWELL_G, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(BROADWELL, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_G, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_X, &temp_haswell_x),
+ X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_D, &temp_haswell_x),
+ X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_L, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_X, &temp_haswell_x),
+ X86_MATCH_INTEL_FAM6_MODEL(KABYLAKE_L, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(KABYLAKE, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE_L, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(CANNONLAKE_L, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_D, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_L, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_NNPI, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(ROCKETLAKE, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(TIGERLAKE_L, &temp_tigerlake),
+ X86_MATCH_INTEL_FAM6_MODEL(TIGERLAKE, &temp_tigerlake),
+ X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, &temp_sapphirerapids),
+ X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, &temp_sapphirerapids),
+ X86_MATCH_INTEL_FAM6_MODEL(LAKEFIELD, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(ALDERLAKE, &temp_tigerlake),
+ X86_MATCH_INTEL_FAM6_MODEL(ALDERLAKE_L, &temp_tigerlake),
+ X86_MATCH_INTEL_FAM6_MODEL(RAPTORLAKE, &temp_tigerlake),
+ X86_MATCH_INTEL_FAM6_MODEL(RAPTORLAKE_P, &temp_tigerlake),
+ X86_MATCH_INTEL_FAM6_MODEL(RAPTORLAKE_S, &temp_tigerlake),
+ X86_MATCH_INTEL_FAM6_MODEL(ATOM_BONNELL, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(ATOM_BONNELL_MID, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(ATOM_SALTWELL, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(ATOM_SALTWELL_MID, &temp_nehalem),
+ X86_MATCH_INTEL_FAM6_MODEL(ATOM_SILVERMONT, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(ATOM_SILVERMONT_D, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(ATOM_SILVERMONT_MID, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(ATOM_AIRMONT, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(ATOM_AIRMONT_MID, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(ATOM_AIRMONT_NP, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT, &temp_goldmont),
+ X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT_D, &temp_goldmont),
+ X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT_PLUS, &temp_goldmont),
+ X86_MATCH_INTEL_FAM6_MODEL(ATOM_TREMONT_D, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(ATOM_TREMONT, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(ATOM_TREMONT_L, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(ATOM_GRACEMONT, &temp_tigerlake),
+ X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNL, &temp_broadwell),
+ X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNM, &temp_broadwell),
+ {}
+};
+
+static int __init intel_tcc_init(void)
+{
+ const struct x86_cpu_id *id;
+
+ id = x86_match_cpu(intel_tcc_cpu_ids);
+ if (id)
+ memcpy(&intel_tcc_temp_masks, (const void *)id->driver_data,
+ sizeof(intel_tcc_temp_masks));
+
+ return 0;
+}
+/*
+ * Use subsys_initcall to ensure temperature bitmasks are initialized before
+ * the drivers that use this library.
+ */
+subsys_initcall(intel_tcc_init);
+
+/**
+ * get_tcc_offset_mask() - Returns the model-specific bitmask for TCC offset
+ *
+ * Get the model-specific bitmask to extract TCC_OFFSET from the MSR_TEMPERATURE_
+ * TARGET register. If the mask is 0, it means the processor does not support TCC offset.
+ *
+ * Return: The model-specific bitmask for TCC offset.
+ */
+u32 get_tcc_offset_mask(void)
+{
+ return intel_tcc_temp_masks.tcc_offset;
+}
+EXPORT_SYMBOL_NS(get_tcc_offset_mask, INTEL_TCC);
+
+/**
+ * intel_tcc_get_temp_mask() - Returns the model-specific bitmask for temperature
+ *
+ * @pkg: true: Package Thermal Sensor. false: Core Thermal Sensor.
+ *
+ * Get the model-specific bitmask to extract the temperature reading from the
+ * MSR_IA32_[PACKAGE]_THERM_STATUS register.
+ *
+ * Callers must check if the thermal status registers are supported.
+ *
+ * Return: The model-specific bitmask for temperature reading
+ */
+u32 intel_tcc_get_temp_mask(bool pkg)
+{
+ return pkg ? intel_tcc_temp_masks.pkg_digital_readout :
+ intel_tcc_temp_masks.digital_readout;
+}
+EXPORT_SYMBOL_NS(intel_tcc_get_temp_mask, INTEL_TCC);
+
/**
* intel_tcc_get_tjmax() - returns the default TCC activation Temperature
* @cpu: cpu that the MSR should be run on, nagative value means any cpu.
@@ -56,7 +218,7 @@ int intel_tcc_get_offset(int cpu)
if (err)
return err;
- return (low >> 24) & 0x3f;
+ return (low >> 24) & intel_tcc_temp_masks.tcc_offset;
}
EXPORT_SYMBOL_NS_GPL(intel_tcc_get_offset, INTEL_TCC);
@@ -76,7 +238,10 @@ int intel_tcc_set_offset(int cpu, int offset)
u32 low, high;
int err;
- if (offset < 0 || offset > 0x3f)
+ if (!intel_tcc_temp_masks.tcc_offset)
+ return -ENODEV;
+
+ if (offset < 0 || offset > intel_tcc_temp_masks.tcc_offset)
return -EINVAL;
if (cpu < 0)
@@ -90,7 +255,7 @@ int intel_tcc_set_offset(int cpu, int offset)
if (low & BIT(31))
return -EPERM;
- low &= ~(0x3f << 24);
+ low &= ~(intel_tcc_temp_masks.tcc_offset << 24);
low |= offset << 24;
if (cpu < 0)
@@ -113,8 +278,8 @@ EXPORT_SYMBOL_NS_GPL(intel_tcc_set_offset, INTEL_TCC);
*/
int intel_tcc_get_temp(int cpu, int *temp, bool pkg)
{
- u32 low, high;
u32 msr = pkg ? MSR_IA32_PACKAGE_THERM_STATUS : MSR_IA32_THERM_STATUS;
+ u32 low, high, mask;
int tjmax, err;
tjmax = intel_tcc_get_tjmax(cpu);
@@ -132,7 +297,9 @@ int intel_tcc_get_temp(int cpu, int *temp, bool pkg)
if (!(low & BIT(31)))
return -ENODATA;
- *temp = tjmax - ((low >> 16) & 0x7f);
+ mask = intel_tcc_get_temp_mask(pkg);
+
+ *temp = tjmax - ((low >> 16) & mask);
return 0;
}
diff --git a/include/linux/intel_tcc.h b/include/linux/intel_tcc.h
index 8ff8eabb4a98..e281cf06aeab 100644
--- a/include/linux/intel_tcc.h
+++ b/include/linux/intel_tcc.h
@@ -14,5 +14,13 @@ int intel_tcc_get_tjmax(int cpu);
int intel_tcc_get_offset(int cpu);
int intel_tcc_set_offset(int cpu, int offset);
int intel_tcc_get_temp(int cpu, int *temp, bool pkg);
+#ifdef CONFIG_INTEL_TCC
+u32 get_tcc_offset_mask(void);
+u32 intel_tcc_get_temp_mask(bool pkg);
+#else
+static inline u32 get_tcc_offset_mask(void) { return 0; }
+/* Use the architectural bitmask of the temperature readout. No model checks. */
+static inline u32 intel_tcc_get_temp_mask(bool pkg) { return 0x7f; }
+#endif
#endif /* __INTEL_TCC_H__ */
--
2.34.1
^ permalink raw reply related
* [PATCH 2/3] thermal: intel: intel_tcc_cooling: Use a model-specific bitmask for TCC offset
From: Ricardo Neri @ 2024-04-06 1:04 UTC (permalink / raw)
To: Rafael J. Wysocki, Zhang Rui, Jean Delvare, Guenter Roeck
Cc: Srinivas Pandruvada, Lukasz Luba, Daniel Lezcano, linux-pm,
linux-hwmon, linux-kernel, Ricardo Neri
In-Reply-To: <20240406010416.4821-1-ricardo.neri-calderon@linux.intel.com>
The TCC offset field in the register MSR_TEMPERATURE_TARGET is not
architectural. The TCC library provides a model-specific bitmask. Use it to
determine the maximum TCC offset.
Suggested-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: linux-hwmon@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org # v6.7+
---
drivers/thermal/intel/intel_tcc_cooling.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/thermal/intel/intel_tcc_cooling.c b/drivers/thermal/intel/intel_tcc_cooling.c
index 6c392147e6d1..308946853cdd 100644
--- a/drivers/thermal/intel/intel_tcc_cooling.c
+++ b/drivers/thermal/intel/intel_tcc_cooling.c
@@ -20,7 +20,7 @@ static struct thermal_cooling_device *tcc_cdev;
static int tcc_get_max_state(struct thermal_cooling_device *cdev, unsigned long
*state)
{
- *state = 0x3f;
+ *state = get_tcc_offset_mask();
return 0;
}
--
2.34.1
^ permalink raw reply related
* [PATCH 0/3] drivers: thermal/hwmon: intel: Use model-specific bitmasks for temperature registers
From: Ricardo Neri @ 2024-04-06 1:04 UTC (permalink / raw)
To: Rafael J. Wysocki, Zhang Rui, Jean Delvare, Guenter Roeck
Cc: Srinivas Pandruvada, Lukasz Luba, Daniel Lezcano, linux-pm,
linux-hwmon, linux-kernel, Ricardo Neri
Hi,
We have been treating the register MSR_TEMPERATURE_TARGET as a
architectural. It is model-specific.
The registers IA32_[PACKAGE]_THERM_STATUS are architectural. However, they
have become model-specific: in recent processors the temperature readout
occupies bits [23:16] whereas the Intel Software Developer's manual
specifies that it uses [22:16].
Using incorrect bitmasks leads to incorrect temperature readings and
writes to reserved register bits. For instance, temperatures below ~-27C
(depending on the value of TjMax) would be read incorrectly if only the
bits [22:16] of IA32_THERM_[PACKAGE]_STATUS are used.
Update the intel_tcc library to use model-specific bitmasks. Also update
the hwmon/coretemp and intel_tcc_cooling drivers drivers to use the model
checking utilities of intel_tcc.
Updating hwmon/coretemp to use the intel_tcc library required to add a
weak reverse dependency on CONFIG_INTEL_TCC. The less attractive
alternative would be to duplicate the model checking functionality of
intel_tcc in hwmon/coretemp.
I have tested these patches on Alder Lake, Meteor Lake, and Grand Ridge
systems by looking at the temp*_input sysfs files of hwmon.
These patches apply cleanly on top of the `testing` branches of the
linux-pm and hwmon repositories.
Thanks and BR,
Ricardo
Ricardo Neri (3):
thermal: intel: intel_tcc: Add model checks for temperature registers
thermal: intel: intel_tcc_cooling: Use a model-specific bitmask for
TCC offset
hwmon: (coretemp) Use a model-specific bitmask to read registers
drivers/hwmon/Kconfig | 1 +
drivers/hwmon/coretemp.c | 6 +-
drivers/thermal/intel/intel_tcc.c | 177 +++++++++++++++++++++-
drivers/thermal/intel/intel_tcc_cooling.c | 2 +-
include/linux/intel_tcc.h | 8 +
5 files changed, 187 insertions(+), 7 deletions(-)
--
2.34.1
^ permalink raw reply
* Re: [PATCH v4 7/8] cpuidle/poll_state: replace cpu_relax with smp_cond_load_relaxed
From: Ankur Arora @ 2024-04-05 23:14 UTC (permalink / raw)
To: Okanovic, Haris
Cc: mihai.carabas@oracle.com, kvm@vger.kernel.org,
joao.m.martins@oracle.com, dianders@chromium.org,
ankur.a.arora@oracle.com, mic@digikod.net, pmladek@suse.com,
wanpengli@tencent.com, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, catalin.marinas@arm.com,
mingo@redhat.com, pbonzini@redhat.com, tglx@linutronix.de,
daniel.lezcano@linaro.org, arnd@arndb.de, will@kernel.org,
hpa@zytor.com, peterz@infradead.org, npiggin@gmail.com,
vkuznets@redhat.com, bp@alien8.de, linux-pm@vger.kernel.org,
rafael@kernel.org, rick.p.edgecombe@intel.com,
juerg.haefliger@canonical.com, x86@kernel.org,
linux-arm-kernel@lists.infradead.org
In-Reply-To: <7f3e540ad30f40ae51f1abda24b1bea2c8b648ea.camel@amazon.com>
Okanovic, Haris <harisokn@amazon.com> writes:
> On Thu, 2024-02-15 at 09:41 +0200, Mihai Carabas wrote:
>> cpu_relax on ARM64 does a simple "yield". Thus we replace it with
>> smp_cond_load_relaxed which basically does a "wfe".
>>
>> Suggested-by: Peter Zijlstra <peterz@infradead.org>
>> Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
>> ---
>> drivers/cpuidle/poll_state.c | 15 ++++++++++-----
>> 1 file changed, 10 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
>> index 9b6d90a72601..1e45be906e72 100644
>> --- a/drivers/cpuidle/poll_state.c
>> +++ b/drivers/cpuidle/poll_state.c
>> @@ -13,6 +13,7 @@
>> static int __cpuidle poll_idle(struct cpuidle_device *dev,
>> struct cpuidle_driver *drv, int index)
>> {
>> + unsigned long ret;
>> u64 time_start;
>>
>> time_start = local_clock_noinstr();
>> @@ -26,12 +27,16 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
>>
>> limit = cpuidle_poll_time(drv, dev);
>>
>> - while (!need_resched()) {
>> - cpu_relax();
>> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
>> - continue;
>> -
>> + for (;;) {
>> loop_count = 0;
>> +
>> + ret = smp_cond_load_relaxed(¤t_thread_info()->flags,
>> + VAL & _TIF_NEED_RESCHED ||
>> + loop_count++ >= POLL_IDLE_RELAX_COUNT);
>
> Is it necessary to repeat this 200 times with a wfe poll?
The POLL_IDLE_RELAX_COUNT is there because on x86 each cpu_relax()
iteration is much shorter.
With WFE, it makes less sense.
> Does kvm not implement a timeout period?
Not yet, but it does become more useful after a WFE haltpoll is
available on ARM64.
Haltpoll does have a timeout, which you should be able to tune via
/sys/module/haltpoll/parameters/ but that, of course, won't help here.
> Could you make it configurable? This patch improves certain workloads
> on AWS Graviton instances as well, but blocks up to 6ms in 200 * 30us
> increments before going to wfi, which is a bit excessive.
Yeah, this looks like a problem. We could solve it by making it an
architectural parameter. Though I worry about ARM platforms with
much smaller default timeouts.
The other possibility is using WFET in the primitive, but then we
have that dependency and that's a bigger change.
Will address this in the next version.
Thanks for pointing this out.
--
ankur
^ permalink raw reply
* Re: [PATCH v4 7/8] cpuidle/poll_state: replace cpu_relax with smp_cond_load_relaxed
From: Okanovic, Haris @ 2024-04-05 21:51 UTC (permalink / raw)
To: mihai.carabas@oracle.com
Cc: kvm@vger.kernel.org, joao.m.martins@oracle.com,
dianders@chromium.org, ankur.a.arora@oracle.com, mic@digikod.net,
pmladek@suse.com, wanpengli@tencent.com,
akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
catalin.marinas@arm.com, mingo@redhat.com, pbonzini@redhat.com,
tglx@linutronix.de, daniel.lezcano@linaro.org, arnd@arndb.de,
will@kernel.org, hpa@zytor.com, peterz@infradead.org,
npiggin@gmail.com, vkuznets@redhat.com, bp@alien8.de,
linux-pm@vger.kernel.org, rafael@kernel.org,
rick.p.edgecombe@intel.com, juerg.haefliger@canonical.com,
x86@kernel.org, linux-arm-kernel@lists.infradead.org
In-Reply-To: <1707982910-27680-8-git-send-email-mihai.carabas@oracle.com>
On Thu, 2024-02-15 at 09:41 +0200, Mihai Carabas wrote:
> cpu_relax on ARM64 does a simple "yield". Thus we replace it with
> smp_cond_load_relaxed which basically does a "wfe".
>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
> ---
> drivers/cpuidle/poll_state.c | 15 ++++++++++-----
> 1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> index 9b6d90a72601..1e45be906e72 100644
> --- a/drivers/cpuidle/poll_state.c
> +++ b/drivers/cpuidle/poll_state.c
> @@ -13,6 +13,7 @@
> static int __cpuidle poll_idle(struct cpuidle_device *dev,
> struct cpuidle_driver *drv, int index)
> {
> + unsigned long ret;
> u64 time_start;
>
> time_start = local_clock_noinstr();
> @@ -26,12 +27,16 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
>
> limit = cpuidle_poll_time(drv, dev);
>
> - while (!need_resched()) {
> - cpu_relax();
> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> - continue;
> -
> + for (;;) {
> loop_count = 0;
> +
> + ret = smp_cond_load_relaxed(¤t_thread_info()->flags,
> + VAL & _TIF_NEED_RESCHED ||
> + loop_count++ >= POLL_IDLE_RELAX_COUNT);
Is it necessary to repeat this 200 times with a wfe poll? Does kvm not
implement a timeout period?
Could you make it configurable? This patch improves certain workloads
on AWS Graviton instances as well, but blocks up to 6ms in 200 * 30us
increments before going to wfi, which is a bit excessive.
> +
> + if (!(ret & _TIF_NEED_RESCHED))
> + break;
> +
> if (local_clock_noinstr() - time_start > limit) {
> dev->poll_time_limit = true;
> break;
^ permalink raw reply
* Re: [PATCH v5 3/5] clk: qcom: common: Add interconnect clocks support
From: Stephen Boyd @ 2024-04-05 21:25 UTC (permalink / raw)
To: Varadarajan Narayanan
Cc: andersson, conor+dt, devicetree, djakov, dmitry.baryshkov,
konrad.dybcio, krzysztof.kozlowski+dt, linux-arm-msm, linux-clk,
linux-kernel, linux-pm, mturquette, quic_anusha, robh
In-Reply-To: <ZgaceJT2FMsQVoPa@hu-varada-blr.qualcomm.com>
Quoting Varadarajan Narayanan (2024-03-29 03:48:24)
> On Thu, Mar 28, 2024 at 02:54:52PM -0700, Stephen Boyd wrote:
> > Quoting Varadarajan Narayanan (2024-03-28 00:59:34)
> > > diff --git a/drivers/clk/qcom/common.c b/drivers/clk/qcom/common.c
> > > index 75f09e6e057e..9fa271812373 100644
> > > --- a/drivers/clk/qcom/common.c
> > > +++ b/drivers/clk/qcom/common.c
> > > @@ -234,6 +236,41 @@ static struct clk_hw *qcom_cc_clk_hw_get(struct of_phandle_args *clkspec,
> > > return cc->rclks[idx] ? &cc->rclks[idx]->hw : NULL;
> > > }
> > >
> > > +#if IS_ENABLED(CONFIG_INTERCONNECT_CLK)
> > > +static int qcom_cc_icc_register(struct device *dev,
> > > + const struct qcom_cc_desc *desc)
> > > +{
> > > + struct icc_clk_data *icd;
> > > + int i;
> > > +
> > > + if (!desc->icc_hws)
> > > + return 0;
> > > +
> > > + icd = devm_kcalloc(dev, desc->num_icc_hws, sizeof(*icd), GFP_KERNEL);
> > > + if (!icd)
> > > + return -ENOMEM;
> > > +
> > > + for (i = 0; i < desc->num_icc_hws; i++) {
> > > + icd[i].clk = devm_clk_hw_get_clk(dev, desc->icc_hws[i], "qcom");
> >
> > Make the con_id "icc" instead please, so we know the consumer is
> > icc_clk.
>
> Ok.
>
> > Even better would be for the icc_clk device itself to be the
> > one requesting with devm_clk_hw_get_clk() so that we associate the clk
> > handle with the consumer device. It would also help us make it so that
> > drivers defer probe until their clk isn't an orphan.
>
> Not sure if I understand the comments correctly.
>
> In one of the previous patches, had
> icd[i].clk = clks[noc_clks[i]]->hw.clk;
>
> This was said to be error prone since the clock would not be
> ref counted. Hence used devm_clk_hw_get_clk before doing
> icc_clk_register.
>
> Now, are you suggesting to use the direct clock pointer
> and do a devm_clk_hw_get_clk from the consumer driver?
> This will take care of the refcounting. However, we will
> have to add these clock entries to the consumer DT node.
> Is this ok?
Why do they need to be added to the consumer DT node? Why can't the
icc_clk device driver (icc_clk_driver?) use struct clk_hw instead of
struct clk in struct icc_clk_data? The answer cannot be that the icc_clk
driver cannot be changed.
> > > diff --git a/drivers/clk/qcom/common.h b/drivers/clk/qcom/common.h
> > > index 9c8f7b798d9f..d8ac26d83f3c 100644
> > > --- a/drivers/clk/qcom/common.h
> > > +++ b/drivers/clk/qcom/common.h
> > > @@ -29,6 +29,9 @@ struct qcom_cc_desc {
> > > size_t num_gdscs;
> > > struct clk_hw **clk_hws;
> > > size_t num_clk_hws;
> > > + struct clk_hw **icc_hws;
> > > + size_t num_icc_hws;
> > > + unsigned int first_id;
> >
> > 'first_id' is gross.
>
> will change it to 'icc_id'.
That's not what I meant :) The whole concept of having to pick some
random number is bad. At the least, hide that in the icc_clk driver so
that we don't have to put this in every clk provider that is also an
interconnect provider.
^ permalink raw reply
* Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
From: Ankur Arora @ 2024-04-05 20:22 UTC (permalink / raw)
To: Okanovic, Haris
Cc: ankur.a.arora@oracle.com, linux-assembly@vger.kernel.org,
peterz@infradead.org, mark.rutland@arm.com,
linux-kernel@vger.kernel.org, Blake, Geoff, Silver, Brian,
linux-pm@vger.kernel.org, Saidi, Ali
In-Reply-To: <be986f715a9351fdecebd76c76626d6333b5a9f3.camel@amazon.com>
Okanovic, Haris <harisokn@amazon.com> writes:
> On Tue, 2024-04-02 at 16:17 -0700, Ankur Arora wrote:
>> CAUTION: This email originated from outside of the organization. Do
>> not click links or open attachments unless you can confirm the sender
>> and know the content is safe.
>>
>>
>>
>> Mark Rutland <mark.rutland@arm.com> writes:
>>
>> > On Mon, Apr 01, 2024 at 08:47:06PM -0500, Haris Okanovic wrote:
>> > > An arm64 cpuidle driver with two states: (1) First polls for new
>> > > runable
>> > > tasks up to 100 us (by default) before (2) a wfi idle and awoken
>> > > by
>> > > interrupt (the current arm64 behavior). It allows CPUs to return
>> > > from
>> > > idle more quickly by avoiding the longer interrupt wakeup path,
>> > > which
>> > > may require EL1/EL2 transition in certain VM scenarios.
>> >
>> > Please start off with an explanation of the problem you're trying
>> > to solve
>> > (which IIUC is to wake up more quickly in certain cases), before
>> > describing the
>> > solution. That makes it *significantly* easier for people to review
>> > this, since
>> > once you have the problem statement in mind it's much easier to
>> > understand how
>> > the solution space follows from that.
>> >
>> > > Poll duration is optionally configured at load time via the
>> > > poll_limit
>> > > module parameter.
>> >
>> > Why should this be a configurable parameter?
>> >
>> > (note, at this point you haven't introduced any of the data below,
>> > so the
>> > trade-off isn't clear to anyone).
>> >
>> > > The default 100 us duration was experimentally chosen, by
>> > > measuring QPS
>> > > (queries per sec) of the MLPerf bert inference benchmark, which
>> > > seems
>> > > particularly susceptible to this change; see procedure below. 100
>> > > us is
>> > > the inflection point where QPS stopped growing in a range of
>> > > tested
>> > > values. All results are from AWS m7g.16xlarge instances
>> > > (Graviton3 SoC)
>> > > with dedicated tenancy (dedicated hardware).
>> > >
>> > > > before | 10us | 25us | 50us | 100us | 125us | 150us | 200us |
>> > > > 300us |
>> > > > 5.87 | 5.91 | 5.96 | 6.01 | 6.06 | 6.07 | 6.06 | 6.06 |
>> > > > 6.06 |
>> > >
>> > > Perf's scheduler benchmarks also improve with a range of
>> > > poll_limit
>> > > values >= 10 us. Higher limits produce near identical results
>> > > within a
>> > > 3% noise margin. The following tables are `perf bench sched`
>> > > results,
>> > > run times in seconds.
>> > >
>> > > `perf bench sched messaging -l 80000`
>> > > > AWS instance | SoC | Before | After | % Change |
>> > > > c6g.16xl (VM) | Graviton2 | 18.974 | 18.400 | none |
>> > > > c7g.16xl (VM) | Graviton3 | 13.852 | 13.859 | none |
>> > > > c6g.metal | Graviton2 | 17.621 | 16.744 | none |
>> > > > c7g.metal | Graviton3 | 13.430 | 13.404 | none |
>> > >
>> > > `perf bench sched pipe -l 2500000`
>> > > > AWS instance | SoC | Before | After | % Change |
>> > > > c6g.16xl (VM) | Graviton2 | 30.158 | 15.181 | -50% |
>> > > > c7g.16xl (VM) | Graviton3 | 18.289 | 12.067 | -34% |
>> > > > c6g.metal | Graviton2 | 17.609 | 15.170 | -14% |
>> > > > c7g.metal | Graviton3 | 14.103 | 12.304 | -13% |
>> > >
>> > > `perf bench sched seccomp-notify -l 2500000`
>> > > > AWS instance | SoC | Before | After | % Change |
>> > > > c6g.16xl (VM) | Graviton2 | 28.784 | 13.754 | -52% |
>> > > > c7g.16xl (VM) | Graviton3 | 16.964 | 11.430 | -33% |
>> > > > c6g.metal | Graviton2 | 15.717 | 13.536 | -14% |
>> > > > c7g.metal | Graviton3 | 13.301 | 11.491 | -14% |
>> >
>> > Ok, so perf numbers for a busy workload go up.
>> >
>> > What happens for idle state residency on a mostly idle system?
>> >
>> > > Steps to run MLPerf bert inference on Ubuntu 22.04:
>> > > sudo apt install build-essential python3 python3-pip
>> > > pip install "pybind11[global]" tensorflow transformers
>> > > export TF_ENABLE_ONEDNN_OPTS=1
>> > > export DNNL_DEFAULT_FPMATH_MODE=BF16
>> > > git clone https://github.com/mlcommons/inference.git --recursive
>> > > cd inference
>> > > git checkout v2.0
>> > > cd loadgen
>> > > CFLAGS="-std=c++14" python3 setup.py bdist_wheel
>> > > pip install dist/*.whl
>> > > cd ../language/bert
>> > > make setup
>> > > python3 run.py --backend=tf --scenario=SingleStream
>> > >
>> > > Suggested-by: Ali Saidi <alisaidi@amazon.com>
>> > > Reviewed-by: Ali Saidi <alisaidi@amazon.com>
>> > > Reviewed-by: Geoff Blake <blakgeof@amazon.com>
>> > > Cc: Brian Silver <silverbr@amazon.com>
>> > > Signed-off-by: Haris Okanovic <harisokn@amazon.com>
>> > > ---
>> > > drivers/cpuidle/Kconfig.arm | 13 ++
>> > > drivers/cpuidle/Makefile | 1 +
>> > > drivers/cpuidle/cpuidle-arm-polling.c | 171
>> > > ++++++++++++++++++++++++++
>> > > 3 files changed, 185 insertions(+)
>> > > create mode 100644 drivers/cpuidle/cpuidle-arm-polling.c
>> > >
>> > > diff --git a/drivers/cpuidle/Kconfig.arm
>> > > b/drivers/cpuidle/Kconfig.arm
>> > > index a1ee475d180d..484666dda38d 100644
>> > > --- a/drivers/cpuidle/Kconfig.arm
>> > > +++ b/drivers/cpuidle/Kconfig.arm
>> > > @@ -14,6 +14,19 @@ config ARM_CPUIDLE
>> > > initialized by calling the CPU operations init idle hook
>> > > provided by architecture code.
>> > >
>> > > +config ARM_POLL_CPUIDLE
>> > > + bool "ARM64 CPU idle Driver with polling"
>> > > + depends on ARM64
>> > > + depends on ARM_ARCH_TIMER_EVTSTREAM
>> > > + select CPU_IDLE_MULTIPLE_DRIVERS
>> > > + help
>> > > + Select this to enable a polling cpuidle driver for ARM64:
>> > > + The first state polls TIF_NEED_RESCHED for best latency on
>> > > short
>> > > + sleep intervals. The second state falls back to
>> > > arch_cpu_idle() to
>> > > + wait for interrupt. This is can be helpful in workloads
>> > > that
>> > > + frequently block/wake at short intervals or VMs where
>> > > wakeup IPIs
>> > > + are more expensive.
>> >
>> > Why is this a separate driver rather than an optional feature in
>> > the existing
>> > driver?
>> >
>> > The fact that this duplicates a bunch of code indicates to me that
>> > this should
>> > not be a separate driver.
>>
>> Also, the cpuidle-haltpoll driver is meant to do something quite
>> similar.
>> That driver polls adaptively based on the haltpoll governor's tuning
>> of
>> the polling period.
>>
>> However, cpuidle-haltpoll is currently x86 only. Mihai (also from
>> Oracle)
>> posted patches [1] adding support for ARM64.
>>
>> Haris, could you take a look at it and see if it does what you are
>> looking for? The polling path in the linked version also uses
>> smp_cond_load_relaxed() so even the mechanisms for both of these
>> are fairly similar.
>
> Hi Ankur,
>
> I agree, except for that small bug in exit condition, your haltpoll
> changes fundamentally do the same thing:
Yup. Will address that bug and a few other things in the next version.
>> @ int __cpuidle poll_idle(...
>> - if (!(ret & _TIF_NEED_RESCHED))
>> + if (ret & _TIF_NEED_RESCHE
>
> I'll follow up with another patch for AWS Graviton when your team is
> finished.
>
> Do you have a rough ETA of when your changes will land in master?
That I guess would be determined by the maintainers, but I should be
able to send it out the coming week.
Thanks
Ankur
>>
>> (I'll be sending out the next version shortly. Happy to Cc you if you
>> would like to try that out.)
>
> Yes, please do!
>
> Thanks,
> Haris Okanovic
>
>>
>> Thanks
>> Ankur
>>
>> [1]
>> https://lore.kernel.org/lkml/1707982910-27680-1-git-send-email-mihai.carabas@oracle.com/
>>
>> >
>> > > +
>> > > config ARM_PSCI_CPUIDLE
>> > > bool "PSCI CPU idle Driver"
>> > > depends on ARM_PSCI_FW
>> > > diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
>> > > index d103342b7cfc..23c21422792d 100644
>> > > --- a/drivers/cpuidle/Makefile
>> > > +++ b/drivers/cpuidle/Makefile
>> > > @@ -22,6 +22,7 @@ obj-$(CONFIG_ARM_U8500_CPUIDLE) +=
>> > > cpuidle-ux500.o
>> > > obj-$(CONFIG_ARM_AT91_CPUIDLE) += cpuidle-at91.o
>> > > obj-$(CONFIG_ARM_EXYNOS_CPUIDLE) += cpuidle-exynos.o
>> > > obj-$(CONFIG_ARM_CPUIDLE) += cpuidle-arm.o
>> > > +obj-$(CONFIG_ARM_POLL_CPUIDLE) += cpuidle-arm-
>> > > polling.o
>> > > obj-$(CONFIG_ARM_PSCI_CPUIDLE) += cpuidle-psci.o
>> > > obj-$(CONFIG_ARM_PSCI_CPUIDLE_DOMAIN) += cpuidle-psci-
>> > > domain.o
>> > > obj-$(CONFIG_ARM_TEGRA_CPUIDLE) += cpuidle-tegra.o
>> > > diff --git a/drivers/cpuidle/cpuidle-arm-polling.c
>> > > b/drivers/cpuidle/cpuidle-arm-polling.c
>> > > new file mode 100644
>> > > index 000000000000..bca128568114
>> > > --- /dev/null
>> > > +++ b/drivers/cpuidle/cpuidle-arm-polling.c
>> > > @@ -0,0 +1,171 @@
>> > > +// SPDX-License-Identifier: GPL-2.0
>> > > +/*
>> > > + * ARM64 CPU idle driver using wfe polling
>> > > + *
>> > > + * Copyright 2024 Amazon.com, Inc. or its affiliates. All rights
>> > > reserved.
>> > > + *
>> > > + * Authors:
>> > > + * Haris Okanovic <harisokn@amazon.com>
>> > > + * Brian Silver <silverbr@amazon.com>
>> > > + *
>> > > + * Based on cpuidle-arm.c
>> > > + * Copyright (C) 2014 ARM Ltd.
>> > > + * Author: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
>> > > + */
>> > > +
>> > > +#include <linux/cpu.h>
>> > > +#include <linux/cpu_cooling.h>
>> > > +#include <linux/cpuidle.h>
>> > > +#include <linux/sched/clock.h>
>> > > +
>> > > +#include <asm/cpuidle.h>
>> > > +#include <asm/readex.h>
>> > > +
>> > > +#include "dt_idle_states.h"
>> > > +
>> > > +/* Max duration of the wfe() poll loop in us, before
>> > > transitioning to
>> > > + * arch_cpu_idle()/wfi() sleep.
>> > > + */
>> >
>> > /*
>> > * Comments should have the leading '/*' on a separate line.
>> > * See
>> > https://www.kernel.org/doc/html/v6.8/process/coding-style.html#commenting
>> > */
>> >
>> > > +#define DEFAULT_POLL_LIMIT_US 100
>> > > +static unsigned int poll_limit __read_mostly =
>> > > DEFAULT_POLL_LIMIT_US;
>> > > +
>> > > +/*
>> > > + * arm_idle_wfe_poll - Polls state in wfe loop until reschedule
>> > > is
>> > > + * needed or timeout
>> > > + */
>> > > +static int __cpuidle arm_idle_wfe_poll(struct cpuidle_device
>> > > *dev,
>> > > + struct cpuidle_driver *drv, int idx)
>> > > +{
>> > > + u64 time_start, time_limit;
>> > > +
>> > > + time_start = local_clock();
>> > > + dev->poll_time_limit = false;
>> > > +
>> > > + local_irq_enable();
>> >
>> > Why enable IRQs here? We don't do that in the regular cpuidle-arm
>> > driver, nor
>> > the cpuidle-psci driver, and there's no explanation here or in the
>> > commit message.
>> >
>> > How does this interact with RCU? Is that still watching or are we
>> > in an
>> > extended quiescent state? For PSCI idle states we enter an EQS, and
>> > it seems
>> > like we probably should here...
>> >
>> > > +
>> > > + if (current_set_polling_and_test())
>> > > + goto end;
>> > > +
>> > > + time_limit = cpuidle_poll_time(drv, dev);
>> > > +
>> > > + do {
>> > > + // exclusive read arms the monitor for wfe
>> > > + if (__READ_ONCE_EX(current_thread_info()->flags) &
>> > > _TIF_NEED_RESCHED)
>> > > + goto end;
>> > > +
>> > > + // may exit prematurely, see
>> > > ARM_ARCH_TIMER_EVTSTREAM
>> > > + wfe();
>> > > + } while (local_clock() - time_start < time_limit);
>> >
>> > .. and if the EVTSTREAM is disabled, we'll sit in WFE forever
>> > rather than
>> > entering a deeper idle state, which doesn't seem desirable.
>> >
>> > It's worth noting that now that we have WFET, we'll probably want
>> > to disable
>> > the EVTSTREAM by default at some point, at least in some
>> > configurations, since
>> > that'll be able to sit in a WFE state for longer while also
>> > reliably waking up
>> > when required.
>> >
>> > I suspect we want something like an smp_load_acquire_timeout() here
>> > to do the
>> > wait in arch code (allowing us to use WFET), and enabling this
>> > state will
>> > depend on either having WFET or EVTSTREAM.
>> >
>> > > +
>> > > + dev->poll_time_limit = true;
>> > > +
>> > > +end:
>> > > + current_clr_polling();
>> > > + return idx;
>> > > +}
>> > > +
>> > > +/*
>> > > + * arm_idle_wfi - Places cpu in lower power state until
>> > > interrupt,
>> > > + * a fallback to polling
>> > > + */
>> > > +static int __cpuidle arm_idle_wfi(struct cpuidle_device *dev,
>> > > + struct cpuidle_driver *drv, int idx)
>> > > +{
>> > > + if (current_clr_polling_and_test()) {
>> > > + local_irq_enable();
>> > > + return idx;
>> > > + }
>> >
>> > Same as above, why enable IRQs here?
>> >
>> > > + arch_cpu_idle();
>> > > + return idx;
>> >
>> > .. and if we need to enable IRQs in the other cases above, why do
>> > we *not*
>> > need to enable them here?
>> >
>> > > +}
>> > > +
>> > > +static struct cpuidle_driver arm_poll_idle_driver __initdata = {
>> > > + .name = "arm_poll_idle",
>> > > + .owner = THIS_MODULE,
>> > > + .states = {
>> > > + {
>> > > + .enter = arm_idle_wfe_poll,
>> > > + .exit_latency = 0,
>> > > + .target_residency = 0,
>> > > + .exit_latency_ns = 0,
>> > > + .power_usage = UINT_MAX,
>> > > + .flags =
>> > > CPUIDLE_FLAG_POLLING,
>> > > + .name = "WFE",
>> > > + .desc = "ARM WFE",
>> > > + },
>> > > + {
>> > > + .enter = arm_idle_wfi,
>> > > + .exit_latency =
>> > > DEFAULT_POLL_LIMIT_US,
>> > > + .target_residency =
>> > > DEFAULT_POLL_LIMIT_US,
>> > > + .power_usage = UINT_MAX,
>> > > + .name = "WFI",
>> > > + .desc = "ARM WFI",
>> > > + },
>> > > + },
>> > > + .state_count = 2,
>> > > +};
>> >
>> > How does this interact with the existing driver?
>> >
>> > How does DEFAULT_POLL_LIMIT_US compare with PSCI idle states?
>> >
>> > > +
>> > > +/*
>> > > + * arm_poll_init_cpu - Initializes arm cpuidle polling driver
>> > > for one cpu
>> > > + */
>> > > +static int __init arm_poll_init_cpu(int cpu)
>> > > +{
>> > > + int ret;
>> > > + struct cpuidle_driver *drv;
>> > > +
>> > > + drv = kmemdup(&arm_poll_idle_driver, sizeof(*drv),
>> > > GFP_KERNEL);
>> > > + if (!drv)
>> > > + return -ENOMEM;
>> > > +
>> > > + drv->cpumask = (struct cpumask *)cpumask_of(cpu);
>> > > + drv->states[1].exit_latency = poll_limit;
>> > > + drv->states[1].target_residency = poll_limit;
>> > > +
>> > > + ret = cpuidle_register(drv, NULL);
>> > > + if (ret) {
>> > > + pr_err("failed to register driver: %d, cpu %d\n",
>> > > ret, cpu);
>> > > + goto out_kfree_drv;
>> > > + }
>> > > +
>> > > + pr_info("registered driver cpu %d\n", cpu);
>> >
>> > This does not need to be printed for each CPU.
>> >
>> > Mark.
>> >
>> > > +
>> > > + cpuidle_cooling_register(drv);
>> > > +
>> > > + return 0;
>> > > +
>> > > +out_kfree_drv:
>> > > + kfree(drv);
>> > > + return ret;
>> > > +}
>> > > +
>> > > +/*
>> > > + * arm_poll_init - Initializes arm cpuidle polling driver
>> > > + */
>> > > +static int __init arm_poll_init(void)
>> > > +{
>> > > + int cpu, ret;
>> > > + struct cpuidle_driver *drv;
>> > > + struct cpuidle_device *dev;
>> > > +
>> > > + for_each_possible_cpu(cpu) {
>> > > + ret = arm_poll_init_cpu(cpu);
>> > > + if (ret)
>> > > + goto out_fail;
>> > > + }
>> > > +
>> > > + return 0;
>> > > +
>> > > +out_fail:
>> > > + pr_info("de-register all");
>> > > + while (--cpu >= 0) {
>> > > + dev = per_cpu(cpuidle_devices, cpu);
>> > > + drv = cpuidle_get_cpu_driver(dev);
>> > > + cpuidle_unregister(drv);
>> > > + kfree(drv);
>> > > + }
>> > > +
>> > > + return ret;
>> > > +}
>> > > +
>> > > +module_param(poll_limit, uint, 0444);
>> > > +device_initcall(arm_poll_init);
>> > > --
>> > > 2.34.1
>> > >
>> > >
>>
>>
>> --
>> ankur
--
ankur
^ permalink raw reply
* Re: [GIT PULL] Power management fixes for v6.9-rc3
From: pr-tracker-bot @ 2024-04-05 20:10 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: Linus Torvalds, Linux PM, Linux Kernel Mailing List
In-Reply-To: <CAJZ5v0gVqOUuFfJ8um+_F_ubU4QsWYneyQsFBHu9rG10-rJYEg@mail.gmail.com>
The pull request you sent on Fri, 5 Apr 2024 20:42:04 +0200:
> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git pm-6.9-rc3
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/2f9fd9e439706c615b77c23d70184ddefa7fb9e0
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply
* Re: [GIT PULL] Thermal control fixes for v6.9-rc3
From: pr-tracker-bot @ 2024-04-05 20:10 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Linus Torvalds, Linux PM, ACPI Devel Maling List,
Linux Kernel Mailing List, Daniel Lezcano
In-Reply-To: <CAJZ5v0hHLvQWGDtwk=yXv_bJxZH2KTW2SWSUopdRDHHQp0sqow@mail.gmail.com>
The pull request you sent on Fri, 5 Apr 2024 20:41:04 +0200:
> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git thermal-6.9-rc3
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/b21defcb52c67c5580ddba8b9823820bccccf97e
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply
* Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
From: Okanovic, Haris @ 2024-04-05 20:05 UTC (permalink / raw)
To: mark.rutland@arm.com
Cc: Silver, Brian, peterz@infradead.org, linux-pm@vger.kernel.org,
Saidi, Ali, linux-kernel@vger.kernel.org,
linux-assembly@vger.kernel.org, Blake, Geoff
In-Reply-To: <Zgw--fHBH9kEQsi0@FVFF77S0Q05N>
On Tue, 2024-04-02 at 18:23 +0100, Mark Rutland wrote:
> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender
> and know the content is safe.
>
>
>
> On Mon, Apr 01, 2024 at 08:47:06PM -0500, Haris Okanovic wrote:
> > An arm64 cpuidle driver with two states: (1) First polls for new
> > runable
> > tasks up to 100 us (by default) before (2) a wfi idle and awoken by
> > interrupt (the current arm64 behavior). It allows CPUs to return
> > from
> > idle more quickly by avoiding the longer interrupt wakeup path,
> > which
> > may require EL1/EL2 transition in certain VM scenarios.
>
> Please start off with an explanation of the problem you're trying to
> solve
> (which IIUC is to wake up more quickly in certain cases), before
> describing the
> solution. That makes it *significantly* easier for people to review
> this, since
> once you have the problem statement in mind it's much easier to
> understand how
> the solution space follows from that.
Hi Mark,
I'll try to clarify:
This change improves performance of certain multithreaded workloads
that rapidly cycles through blocked/running states, by avoiding a
relatively costly IPI wakeup path.
One example is Tensorflow as demonstated by the benchmark shared
earlier. It generates 250k reschedule IPIs per sec on AWS m7g.16xlarge
(Graviton3 SoC) in Linux 6.9.0-rc2, which drops to 120k/sec with this
change and a ~3% gain in throughput.
Some Java workloads have a similar pattern. For example, Renissance
Reactors benchmark (Reactors.IO framework app) drops from 370k to 80k
IPIs and sees a ~25% latency drop.
Renissance Reactors repro steps on Ubuntu 22.04 on AWS:
sudo apt-get install java-common
wget
https://corretto.aws/downloads/latest/amazon-corretto-17-aarch64-linux-jdk.deb
sudo dpkg -i amazon-corretto-17-aarch64-linux-jdk.deb
wget
https://github.com/renaissance-benchmarks/renaissance/releases/download/v0.15.0/renaissance-gpl-0.15.0.jar
java -jar renaissance-gpl-0.15.0.jar reactors —repetitions 1
>
> > Poll duration is optionally configured at load time via the
> > poll_limit
> > module parameter.
>
> Why should this be a configurable parameter?
>
> (note, at this point you haven't introduced any of the data below, so
> the
> trade-off isn't clear to anyone).
Some applications could improve with higher poll periods. For example,
Tensorflow throughput improves ~3% at a 100us+ poll limit and ~1% at
10us limit. A configurable knob allows users to experiment more easily.
There’s also no performance penatly for having the option. New values
are patched during load and all other code paths remain the same.
>
> > The default 100 us duration was experimentally chosen, by measuring
> > QPS
> > (queries per sec) of the MLPerf bert inference benchmark, which
> > seems
> > particularly susceptible to this change; see procedure below. 100
> > us is
> > the inflection point where QPS stopped growing in a range of tested
> > values. All results are from AWS m7g.16xlarge instances (Graviton3
> > SoC)
> > with dedicated tenancy (dedicated hardware).
> >
> > > before | 10us | 25us | 50us | 100us | 125us | 150us | 200us |
> > > 300us |
> > > 5.87 | 5.91 | 5.96 | 6.01 | 6.06 | 6.07 | 6.06 | 6.06 |
> > > 6.06 |
> >
> > Perf's scheduler benchmarks also improve with a range of poll_limit
> > values >= 10 us. Higher limits produce near identical results
> > within a
> > 3% noise margin. The following tables are `perf bench sched`
> > results,
> > run times in seconds.
> >
> > `perf bench sched messaging -l 80000`
> > > AWS instance | SoC | Before | After | % Change |
> > > c6g.16xl (VM) | Graviton2 | 18.974 | 18.400 | none |
> > > c7g.16xl (VM) | Graviton3 | 13.852 | 13.859 | none |
> > > c6g.metal | Graviton2 | 17.621 | 16.744 | none |
> > > c7g.metal | Graviton3 | 13.430 | 13.404 | none |
> >
> > `perf bench sched pipe -l 2500000`
> > > AWS instance | SoC | Before | After | % Change |
> > > c6g.16xl (VM) | Graviton2 | 30.158 | 15.181 | -50% |
> > > c7g.16xl (VM) | Graviton3 | 18.289 | 12.067 | -34% |
> > > c6g.metal | Graviton2 | 17.609 | 15.170 | -14% |
> > > c7g.metal | Graviton3 | 14.103 | 12.304 | -13% |
> >
> > `perf bench sched seccomp-notify -l 2500000`
> > > AWS instance | SoC | Before | After | % Change |
> > > c6g.16xl (VM) | Graviton2 | 28.784 | 13.754 | -52% |
> > > c7g.16xl (VM) | Graviton3 | 16.964 | 11.430 | -33% |
> > > c6g.metal | Graviton2 | 15.717 | 13.536 | -14% |
> > > c7g.metal | Graviton3 | 13.301 | 11.491 | -14% |
>
> Ok, so perf numbers for a busy workload go up.
>
> What happens for idle state residency on a mostly idle system?
It decreases by about 0.005% on AWS m7g.16xlarge (Graviton3 SoC).
>
> > Steps to run MLPerf bert inference on Ubuntu 22.04:
> > sudo apt install build-essential python3 python3-pip
> > pip install "pybind11[global]" tensorflow transformers
> > export TF_ENABLE_ONEDNN_OPTS=1
> > export DNNL_DEFAULT_FPMATH_MODE=BF16
> > git clone https://github.com/mlcommons/inference.git --recursive
> > cd inference
> > git checkout v2.0
> > cd loadgen
> > CFLAGS="-std=c++14" python3 setup.py bdist_wheel
> > pip install dist/*.whl
> > cd ../language/bert
> > make setup
> > python3 run.py --backend=tf --scenario=SingleStream
> >
> > Suggested-by: Ali Saidi <alisaidi@amazon.com>
> > Reviewed-by: Ali Saidi <alisaidi@amazon.com>
> > Reviewed-by: Geoff Blake <blakgeof@amazon.com>
> > Cc: Brian Silver <silverbr@amazon.com>
> > Signed-off-by: Haris Okanovic <harisokn@amazon.com>
> > ---
> > drivers/cpuidle/Kconfig.arm | 13 ++
> > drivers/cpuidle/Makefile | 1 +
> > drivers/cpuidle/cpuidle-arm-polling.c | 171
> > ++++++++++++++++++++++++++
> > 3 files changed, 185 insertions(+)
> > create mode 100644 drivers/cpuidle/cpuidle-arm-polling.c
> >
> > diff --git a/drivers/cpuidle/Kconfig.arm
> > b/drivers/cpuidle/Kconfig.arm
> > index a1ee475d180d..484666dda38d 100644
> > --- a/drivers/cpuidle/Kconfig.arm
> > +++ b/drivers/cpuidle/Kconfig.arm
> > @@ -14,6 +14,19 @@ config ARM_CPUIDLE
> > initialized by calling the CPU operations init idle hook
> > provided by architecture code.
> >
> > +config ARM_POLL_CPUIDLE
> > + bool "ARM64 CPU idle Driver with polling"
> > + depends on ARM64
> > + depends on ARM_ARCH_TIMER_EVTSTREAM
> > + select CPU_IDLE_MULTIPLE_DRIVERS
> > + help
> > + Select this to enable a polling cpuidle driver for ARM64:
> > + The first state polls TIF_NEED_RESCHED for best latency on
> > short
> > + sleep intervals. The second state falls back to
> > arch_cpu_idle() to
> > + wait for interrupt. This is can be helpful in workloads
> > that
> > + frequently block/wake at short intervals or VMs where
> > wakeup IPIs
> > + are more expensive.
>
> Why is this a separate driver rather than an optional feature in the
> existing
> driver?
>
> The fact that this duplicates a bunch of code indicates to me that
> this should
> not be a separate driver.
I'll explore using Ankur Arora’s haltpoll changes on AWS Graviton; see
related thread.
>
> > +
> > config ARM_PSCI_CPUIDLE
> > bool "PSCI CPU idle Driver"
> > depends on ARM_PSCI_FW
> > diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
> > index d103342b7cfc..23c21422792d 100644
> > --- a/drivers/cpuidle/Makefile
> > +++ b/drivers/cpuidle/Makefile
> > @@ -22,6 +22,7 @@ obj-$(CONFIG_ARM_U8500_CPUIDLE) +=
> > cpuidle-ux500.o
> > obj-$(CONFIG_ARM_AT91_CPUIDLE) += cpuidle-at91.o
> > obj-$(CONFIG_ARM_EXYNOS_CPUIDLE) += cpuidle-exynos.o
> > obj-$(CONFIG_ARM_CPUIDLE) += cpuidle-arm.o
> > +obj-$(CONFIG_ARM_POLL_CPUIDLE) += cpuidle-arm-
> > polling.o
> > obj-$(CONFIG_ARM_PSCI_CPUIDLE) += cpuidle-psci.o
> > obj-$(CONFIG_ARM_PSCI_CPUIDLE_DOMAIN) += cpuidle-psci-
> > domain.o
> > obj-$(CONFIG_ARM_TEGRA_CPUIDLE) += cpuidle-tegra.o
> > diff --git a/drivers/cpuidle/cpuidle-arm-polling.c
> > b/drivers/cpuidle/cpuidle-arm-polling.c
> > new file mode 100644
> > index 000000000000..bca128568114
> > --- /dev/null
> > +++ b/drivers/cpuidle/cpuidle-arm-polling.c
> > @@ -0,0 +1,171 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * ARM64 CPU idle driver using wfe polling
> > + *
> > + * Copyright 2024 Amazon.com, Inc. or its affiliates. All rights
> > reserved.
> > + *
> > + * Authors:
> > + * Haris Okanovic <harisokn@amazon.com>
> > + * Brian Silver <silverbr@amazon.com>
> > + *
> > + * Based on cpuidle-arm.c
> > + * Copyright (C) 2014 ARM Ltd.
> > + * Author: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> > + */
> > +
> > +#include <linux/cpu.h>
> > +#include <linux/cpu_cooling.h>
> > +#include <linux/cpuidle.h>
> > +#include <linux/sched/clock.h>
> > +
> > +#include <asm/cpuidle.h>
> > +#include <asm/readex.h>
> > +
> > +#include "dt_idle_states.h"
> > +
> > +/* Max duration of the wfe() poll loop in us, before transitioning
> > to
> > + * arch_cpu_idle()/wfi() sleep.
> > + */
>
> /*
> * Comments should have the leading '/*' on a separate line.
> * See
> https://www.kernel.org/doc/html/v6.8/process/coding-style.html#commenting
> */
>
> > +#define DEFAULT_POLL_LIMIT_US 100
> > +static unsigned int poll_limit __read_mostly =
> > DEFAULT_POLL_LIMIT_US;
> > +
> > +/*
> > + * arm_idle_wfe_poll - Polls state in wfe loop until reschedule is
> > + * needed or timeout
> > + */
> > +static int __cpuidle arm_idle_wfe_poll(struct cpuidle_device *dev,
> > + struct cpuidle_driver *drv, int idx)
> > +{
> > + u64 time_start, time_limit;
> > +
> > + time_start = local_clock();
> > + dev->poll_time_limit = false;
> > +
> > + local_irq_enable();
>
> Why enable IRQs here? We don't do that in the regular cpuidle-arm
> driver, nor
> the cpuidle-psci driver, and there's no explanation here or in the
> commit message.
>
> How does this interact with RCU? Is that still watching or are we in
> an
> extended quiescent state? For PSCI idle states we enter an EQS, and
> it seems
> like we probably should here...
I'll circle back to your code comments (here and elsewhere) if we
proceede with this patch instead of Ankur's.
>
> > +
> > + if (current_set_polling_and_test())
> > + goto end;
> > +
> > + time_limit = cpuidle_poll_time(drv, dev);
> > +
> > + do {
> > + // exclusive read arms the monitor for wfe
> > + if (__READ_ONCE_EX(current_thread_info()->flags) &
> > _TIF_NEED_RESCHED)
> > + goto end;
> > +
> > + // may exit prematurely, see ARM_ARCH_TIMER_EVTSTREAM
> > + wfe();
> > + } while (local_clock() - time_start < time_limit);
>
> ... and if the EVTSTREAM is disabled, we'll sit in WFE forever rather
> than
> entering a deeper idle state, which doesn't seem desirable.
I took a depdnency on ARM_ARCH_TIMER_EVTSTREAM in Kconfig.arm to ensure
event stream is enabled.
>
> It's worth noting that now that we have WFET, we'll probably want to
> disable
> the EVTSTREAM by default at some point, at least in some
> configurations, since
> that'll be able to sit in a WFE state for longer while also reliably
> waking up
> when required.
>
> I suspect we want something like an smp_load_acquire_timeout() here
> to do the
> wait in arch code (allowing us to use WFET), and enabling this state
> will
> depend on either having WFET or EVTSTREAM.
>
> > +
> > + dev->poll_time_limit = true;
> > +
> > +end:
> > + current_clr_polling();
> > + return idx;
> > +}
> > +
> > +/*
> > + * arm_idle_wfi - Places cpu in lower power state until interrupt,
> > + * a fallback to polling
> > + */
> > +static int __cpuidle arm_idle_wfi(struct cpuidle_device *dev,
> > + struct cpuidle_driver *drv, int idx)
> > +{
> > + if (current_clr_polling_and_test()) {
> > + local_irq_enable();
> > + return idx;
> > + }
>
> Same as above, why enable IRQs here?
>
> > + arch_cpu_idle();
> > + return idx;
>
> ... and if we need to enable IRQs in the other cases above, why do we
> *not*
> need to enable them here?
>
> > +}
> > +
> > +static struct cpuidle_driver arm_poll_idle_driver __initdata = {
> > + .name = "arm_poll_idle",
> > + .owner = THIS_MODULE,
> > + .states = {
> > + {
> > + .enter = arm_idle_wfe_poll,
> > + .exit_latency = 0,
> > + .target_residency = 0,
> > + .exit_latency_ns = 0,
> > + .power_usage = UINT_MAX,
> > + .flags =
> > CPUIDLE_FLAG_POLLING,
> > + .name = "WFE",
> > + .desc = "ARM WFE",
> > + },
> > + {
> > + .enter = arm_idle_wfi,
> > + .exit_latency =
> > DEFAULT_POLL_LIMIT_US,
> > + .target_residency =
> > DEFAULT_POLL_LIMIT_US,
> > + .power_usage = UINT_MAX,
> > + .name = "WFI",
> > + .desc = "ARM WFI",
> > + },
> > + },
> > + .state_count = 2,
> > +};
>
> How does this interact with the existing driver?
>
> How does DEFAULT_POLL_LIMIT_US compare with PSCI idle states?
>
> > +
> > +/*
> > + * arm_poll_init_cpu - Initializes arm cpuidle polling driver for
> > one cpu
> > + */
> > +static int __init arm_poll_init_cpu(int cpu)
> > +{
> > + int ret;
> > + struct cpuidle_driver *drv;
> > +
> > + drv = kmemdup(&arm_poll_idle_driver, sizeof(*drv),
> > GFP_KERNEL);
> > + if (!drv)
> > + return -ENOMEM;
> > +
> > + drv->cpumask = (struct cpumask *)cpumask_of(cpu);
> > + drv->states[1].exit_latency = poll_limit;
> > + drv->states[1].target_residency = poll_limit;
> > +
> > + ret = cpuidle_register(drv, NULL);
> > + if (ret) {
> > + pr_err("failed to register driver: %d, cpu %d\n",
> > ret, cpu);
> > + goto out_kfree_drv;
> > + }
> > +
> > + pr_info("registered driver cpu %d\n", cpu);
>
> This does not need to be printed for each CPU.
>
> Mark.
>
> > +
> > + cpuidle_cooling_register(drv);
> > +
> > + return 0;
> > +
> > +out_kfree_drv:
> > + kfree(drv);
> > + return ret;
> > +}
> > +
> > +/*
> > + * arm_poll_init - Initializes arm cpuidle polling driver
> > + */
> > +static int __init arm_poll_init(void)
> > +{
> > + int cpu, ret;
> > + struct cpuidle_driver *drv;
> > + struct cpuidle_device *dev;
> > +
> > + for_each_possible_cpu(cpu) {
> > + ret = arm_poll_init_cpu(cpu);
> > + if (ret)
> > + goto out_fail;
> > + }
> > +
> > + return 0;
> > +
> > +out_fail:
> > + pr_info("de-register all");
> > + while (--cpu >= 0) {
> > + dev = per_cpu(cpuidle_devices, cpu);
> > + drv = cpuidle_get_cpu_driver(dev);
> > + cpuidle_unregister(drv);
> > + kfree(drv);
> > + }
> > +
> > + return ret;
> > +}
> > +
> > +module_param(poll_limit, uint, 0444);
> > +device_initcall(arm_poll_init);
> > --
> > 2.34.1
> >
> >
Thanks,
Haris Okanovic
^ permalink raw reply
* Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
From: Okanovic, Haris @ 2024-04-05 19:36 UTC (permalink / raw)
To: ankur.a.arora@oracle.com
Cc: linux-assembly@vger.kernel.org, peterz@infradead.org,
mark.rutland@arm.com, linux-kernel@vger.kernel.org, Blake, Geoff,
Silver, Brian, linux-pm@vger.kernel.org, Saidi, Ali
In-Reply-To: <87a5mb5p8p.fsf@oracle.com>
On Tue, 2024-04-02 at 16:17 -0700, Ankur Arora wrote:
> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender
> and know the content is safe.
>
>
>
> Mark Rutland <mark.rutland@arm.com> writes:
>
> > On Mon, Apr 01, 2024 at 08:47:06PM -0500, Haris Okanovic wrote:
> > > An arm64 cpuidle driver with two states: (1) First polls for new
> > > runable
> > > tasks up to 100 us (by default) before (2) a wfi idle and awoken
> > > by
> > > interrupt (the current arm64 behavior). It allows CPUs to return
> > > from
> > > idle more quickly by avoiding the longer interrupt wakeup path,
> > > which
> > > may require EL1/EL2 transition in certain VM scenarios.
> >
> > Please start off with an explanation of the problem you're trying
> > to solve
> > (which IIUC is to wake up more quickly in certain cases), before
> > describing the
> > solution. That makes it *significantly* easier for people to review
> > this, since
> > once you have the problem statement in mind it's much easier to
> > understand how
> > the solution space follows from that.
> >
> > > Poll duration is optionally configured at load time via the
> > > poll_limit
> > > module parameter.
> >
> > Why should this be a configurable parameter?
> >
> > (note, at this point you haven't introduced any of the data below,
> > so the
> > trade-off isn't clear to anyone).
> >
> > > The default 100 us duration was experimentally chosen, by
> > > measuring QPS
> > > (queries per sec) of the MLPerf bert inference benchmark, which
> > > seems
> > > particularly susceptible to this change; see procedure below. 100
> > > us is
> > > the inflection point where QPS stopped growing in a range of
> > > tested
> > > values. All results are from AWS m7g.16xlarge instances
> > > (Graviton3 SoC)
> > > with dedicated tenancy (dedicated hardware).
> > >
> > > > before | 10us | 25us | 50us | 100us | 125us | 150us | 200us |
> > > > 300us |
> > > > 5.87 | 5.91 | 5.96 | 6.01 | 6.06 | 6.07 | 6.06 | 6.06 |
> > > > 6.06 |
> > >
> > > Perf's scheduler benchmarks also improve with a range of
> > > poll_limit
> > > values >= 10 us. Higher limits produce near identical results
> > > within a
> > > 3% noise margin. The following tables are `perf bench sched`
> > > results,
> > > run times in seconds.
> > >
> > > `perf bench sched messaging -l 80000`
> > > > AWS instance | SoC | Before | After | % Change |
> > > > c6g.16xl (VM) | Graviton2 | 18.974 | 18.400 | none |
> > > > c7g.16xl (VM) | Graviton3 | 13.852 | 13.859 | none |
> > > > c6g.metal | Graviton2 | 17.621 | 16.744 | none |
> > > > c7g.metal | Graviton3 | 13.430 | 13.404 | none |
> > >
> > > `perf bench sched pipe -l 2500000`
> > > > AWS instance | SoC | Before | After | % Change |
> > > > c6g.16xl (VM) | Graviton2 | 30.158 | 15.181 | -50% |
> > > > c7g.16xl (VM) | Graviton3 | 18.289 | 12.067 | -34% |
> > > > c6g.metal | Graviton2 | 17.609 | 15.170 | -14% |
> > > > c7g.metal | Graviton3 | 14.103 | 12.304 | -13% |
> > >
> > > `perf bench sched seccomp-notify -l 2500000`
> > > > AWS instance | SoC | Before | After | % Change |
> > > > c6g.16xl (VM) | Graviton2 | 28.784 | 13.754 | -52% |
> > > > c7g.16xl (VM) | Graviton3 | 16.964 | 11.430 | -33% |
> > > > c6g.metal | Graviton2 | 15.717 | 13.536 | -14% |
> > > > c7g.metal | Graviton3 | 13.301 | 11.491 | -14% |
> >
> > Ok, so perf numbers for a busy workload go up.
> >
> > What happens for idle state residency on a mostly idle system?
> >
> > > Steps to run MLPerf bert inference on Ubuntu 22.04:
> > > sudo apt install build-essential python3 python3-pip
> > > pip install "pybind11[global]" tensorflow transformers
> > > export TF_ENABLE_ONEDNN_OPTS=1
> > > export DNNL_DEFAULT_FPMATH_MODE=BF16
> > > git clone https://github.com/mlcommons/inference.git --recursive
> > > cd inference
> > > git checkout v2.0
> > > cd loadgen
> > > CFLAGS="-std=c++14" python3 setup.py bdist_wheel
> > > pip install dist/*.whl
> > > cd ../language/bert
> > > make setup
> > > python3 run.py --backend=tf --scenario=SingleStream
> > >
> > > Suggested-by: Ali Saidi <alisaidi@amazon.com>
> > > Reviewed-by: Ali Saidi <alisaidi@amazon.com>
> > > Reviewed-by: Geoff Blake <blakgeof@amazon.com>
> > > Cc: Brian Silver <silverbr@amazon.com>
> > > Signed-off-by: Haris Okanovic <harisokn@amazon.com>
> > > ---
> > > drivers/cpuidle/Kconfig.arm | 13 ++
> > > drivers/cpuidle/Makefile | 1 +
> > > drivers/cpuidle/cpuidle-arm-polling.c | 171
> > > ++++++++++++++++++++++++++
> > > 3 files changed, 185 insertions(+)
> > > create mode 100644 drivers/cpuidle/cpuidle-arm-polling.c
> > >
> > > diff --git a/drivers/cpuidle/Kconfig.arm
> > > b/drivers/cpuidle/Kconfig.arm
> > > index a1ee475d180d..484666dda38d 100644
> > > --- a/drivers/cpuidle/Kconfig.arm
> > > +++ b/drivers/cpuidle/Kconfig.arm
> > > @@ -14,6 +14,19 @@ config ARM_CPUIDLE
> > > initialized by calling the CPU operations init idle hook
> > > provided by architecture code.
> > >
> > > +config ARM_POLL_CPUIDLE
> > > + bool "ARM64 CPU idle Driver with polling"
> > > + depends on ARM64
> > > + depends on ARM_ARCH_TIMER_EVTSTREAM
> > > + select CPU_IDLE_MULTIPLE_DRIVERS
> > > + help
> > > + Select this to enable a polling cpuidle driver for ARM64:
> > > + The first state polls TIF_NEED_RESCHED for best latency on
> > > short
> > > + sleep intervals. The second state falls back to
> > > arch_cpu_idle() to
> > > + wait for interrupt. This is can be helpful in workloads
> > > that
> > > + frequently block/wake at short intervals or VMs where
> > > wakeup IPIs
> > > + are more expensive.
> >
> > Why is this a separate driver rather than an optional feature in
> > the existing
> > driver?
> >
> > The fact that this duplicates a bunch of code indicates to me that
> > this should
> > not be a separate driver.
>
> Also, the cpuidle-haltpoll driver is meant to do something quite
> similar.
> That driver polls adaptively based on the haltpoll governor's tuning
> of
> the polling period.
>
> However, cpuidle-haltpoll is currently x86 only. Mihai (also from
> Oracle)
> posted patches [1] adding support for ARM64.
>
> Haris, could you take a look at it and see if it does what you are
> looking for? The polling path in the linked version also uses
> smp_cond_load_relaxed() so even the mechanisms for both of these
> are fairly similar.
Hi Ankur,
I agree, except for that small bug in exit condition, your haltpoll
changes fundamentally do the same thing:
> @ int __cpuidle poll_idle(...
> - if (!(ret & _TIF_NEED_RESCHED))
> + if (ret & _TIF_NEED_RESCHE
I'll follow up with another patch for AWS Graviton when your team is
finished.
Do you have a rough ETA of when your changes will land in master?
>
> (I'll be sending out the next version shortly. Happy to Cc you if you
> would like to try that out.)
Yes, please do!
Thanks,
Haris Okanovic
>
> Thanks
> Ankur
>
> [1]
> https://lore.kernel.org/lkml/1707982910-27680-1-git-send-email-mihai.carabas@oracle.com/
>
> >
> > > +
> > > config ARM_PSCI_CPUIDLE
> > > bool "PSCI CPU idle Driver"
> > > depends on ARM_PSCI_FW
> > > diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
> > > index d103342b7cfc..23c21422792d 100644
> > > --- a/drivers/cpuidle/Makefile
> > > +++ b/drivers/cpuidle/Makefile
> > > @@ -22,6 +22,7 @@ obj-$(CONFIG_ARM_U8500_CPUIDLE) +=
> > > cpuidle-ux500.o
> > > obj-$(CONFIG_ARM_AT91_CPUIDLE) += cpuidle-at91.o
> > > obj-$(CONFIG_ARM_EXYNOS_CPUIDLE) += cpuidle-exynos.o
> > > obj-$(CONFIG_ARM_CPUIDLE) += cpuidle-arm.o
> > > +obj-$(CONFIG_ARM_POLL_CPUIDLE) += cpuidle-arm-
> > > polling.o
> > > obj-$(CONFIG_ARM_PSCI_CPUIDLE) += cpuidle-psci.o
> > > obj-$(CONFIG_ARM_PSCI_CPUIDLE_DOMAIN) += cpuidle-psci-
> > > domain.o
> > > obj-$(CONFIG_ARM_TEGRA_CPUIDLE) += cpuidle-tegra.o
> > > diff --git a/drivers/cpuidle/cpuidle-arm-polling.c
> > > b/drivers/cpuidle/cpuidle-arm-polling.c
> > > new file mode 100644
> > > index 000000000000..bca128568114
> > > --- /dev/null
> > > +++ b/drivers/cpuidle/cpuidle-arm-polling.c
> > > @@ -0,0 +1,171 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * ARM64 CPU idle driver using wfe polling
> > > + *
> > > + * Copyright 2024 Amazon.com, Inc. or its affiliates. All rights
> > > reserved.
> > > + *
> > > + * Authors:
> > > + * Haris Okanovic <harisokn@amazon.com>
> > > + * Brian Silver <silverbr@amazon.com>
> > > + *
> > > + * Based on cpuidle-arm.c
> > > + * Copyright (C) 2014 ARM Ltd.
> > > + * Author: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> > > + */
> > > +
> > > +#include <linux/cpu.h>
> > > +#include <linux/cpu_cooling.h>
> > > +#include <linux/cpuidle.h>
> > > +#include <linux/sched/clock.h>
> > > +
> > > +#include <asm/cpuidle.h>
> > > +#include <asm/readex.h>
> > > +
> > > +#include "dt_idle_states.h"
> > > +
> > > +/* Max duration of the wfe() poll loop in us, before
> > > transitioning to
> > > + * arch_cpu_idle()/wfi() sleep.
> > > + */
> >
> > /*
> > * Comments should have the leading '/*' on a separate line.
> > * See
> > https://www.kernel.org/doc/html/v6.8/process/coding-style.html#commenting
> > */
> >
> > > +#define DEFAULT_POLL_LIMIT_US 100
> > > +static unsigned int poll_limit __read_mostly =
> > > DEFAULT_POLL_LIMIT_US;
> > > +
> > > +/*
> > > + * arm_idle_wfe_poll - Polls state in wfe loop until reschedule
> > > is
> > > + * needed or timeout
> > > + */
> > > +static int __cpuidle arm_idle_wfe_poll(struct cpuidle_device
> > > *dev,
> > > + struct cpuidle_driver *drv, int idx)
> > > +{
> > > + u64 time_start, time_limit;
> > > +
> > > + time_start = local_clock();
> > > + dev->poll_time_limit = false;
> > > +
> > > + local_irq_enable();
> >
> > Why enable IRQs here? We don't do that in the regular cpuidle-arm
> > driver, nor
> > the cpuidle-psci driver, and there's no explanation here or in the
> > commit message.
> >
> > How does this interact with RCU? Is that still watching or are we
> > in an
> > extended quiescent state? For PSCI idle states we enter an EQS, and
> > it seems
> > like we probably should here...
> >
> > > +
> > > + if (current_set_polling_and_test())
> > > + goto end;
> > > +
> > > + time_limit = cpuidle_poll_time(drv, dev);
> > > +
> > > + do {
> > > + // exclusive read arms the monitor for wfe
> > > + if (__READ_ONCE_EX(current_thread_info()->flags) &
> > > _TIF_NEED_RESCHED)
> > > + goto end;
> > > +
> > > + // may exit prematurely, see
> > > ARM_ARCH_TIMER_EVTSTREAM
> > > + wfe();
> > > + } while (local_clock() - time_start < time_limit);
> >
> > .. and if the EVTSTREAM is disabled, we'll sit in WFE forever
> > rather than
> > entering a deeper idle state, which doesn't seem desirable.
> >
> > It's worth noting that now that we have WFET, we'll probably want
> > to disable
> > the EVTSTREAM by default at some point, at least in some
> > configurations, since
> > that'll be able to sit in a WFE state for longer while also
> > reliably waking up
> > when required.
> >
> > I suspect we want something like an smp_load_acquire_timeout() here
> > to do the
> > wait in arch code (allowing us to use WFET), and enabling this
> > state will
> > depend on either having WFET or EVTSTREAM.
> >
> > > +
> > > + dev->poll_time_limit = true;
> > > +
> > > +end:
> > > + current_clr_polling();
> > > + return idx;
> > > +}
> > > +
> > > +/*
> > > + * arm_idle_wfi - Places cpu in lower power state until
> > > interrupt,
> > > + * a fallback to polling
> > > + */
> > > +static int __cpuidle arm_idle_wfi(struct cpuidle_device *dev,
> > > + struct cpuidle_driver *drv, int idx)
> > > +{
> > > + if (current_clr_polling_and_test()) {
> > > + local_irq_enable();
> > > + return idx;
> > > + }
> >
> > Same as above, why enable IRQs here?
> >
> > > + arch_cpu_idle();
> > > + return idx;
> >
> > .. and if we need to enable IRQs in the other cases above, why do
> > we *not*
> > need to enable them here?
> >
> > > +}
> > > +
> > > +static struct cpuidle_driver arm_poll_idle_driver __initdata = {
> > > + .name = "arm_poll_idle",
> > > + .owner = THIS_MODULE,
> > > + .states = {
> > > + {
> > > + .enter = arm_idle_wfe_poll,
> > > + .exit_latency = 0,
> > > + .target_residency = 0,
> > > + .exit_latency_ns = 0,
> > > + .power_usage = UINT_MAX,
> > > + .flags =
> > > CPUIDLE_FLAG_POLLING,
> > > + .name = "WFE",
> > > + .desc = "ARM WFE",
> > > + },
> > > + {
> > > + .enter = arm_idle_wfi,
> > > + .exit_latency =
> > > DEFAULT_POLL_LIMIT_US,
> > > + .target_residency =
> > > DEFAULT_POLL_LIMIT_US,
> > > + .power_usage = UINT_MAX,
> > > + .name = "WFI",
> > > + .desc = "ARM WFI",
> > > + },
> > > + },
> > > + .state_count = 2,
> > > +};
> >
> > How does this interact with the existing driver?
> >
> > How does DEFAULT_POLL_LIMIT_US compare with PSCI idle states?
> >
> > > +
> > > +/*
> > > + * arm_poll_init_cpu - Initializes arm cpuidle polling driver
> > > for one cpu
> > > + */
> > > +static int __init arm_poll_init_cpu(int cpu)
> > > +{
> > > + int ret;
> > > + struct cpuidle_driver *drv;
> > > +
> > > + drv = kmemdup(&arm_poll_idle_driver, sizeof(*drv),
> > > GFP_KERNEL);
> > > + if (!drv)
> > > + return -ENOMEM;
> > > +
> > > + drv->cpumask = (struct cpumask *)cpumask_of(cpu);
> > > + drv->states[1].exit_latency = poll_limit;
> > > + drv->states[1].target_residency = poll_limit;
> > > +
> > > + ret = cpuidle_register(drv, NULL);
> > > + if (ret) {
> > > + pr_err("failed to register driver: %d, cpu %d\n",
> > > ret, cpu);
> > > + goto out_kfree_drv;
> > > + }
> > > +
> > > + pr_info("registered driver cpu %d\n", cpu);
> >
> > This does not need to be printed for each CPU.
> >
> > Mark.
> >
> > > +
> > > + cpuidle_cooling_register(drv);
> > > +
> > > + return 0;
> > > +
> > > +out_kfree_drv:
> > > + kfree(drv);
> > > + return ret;
> > > +}
> > > +
> > > +/*
> > > + * arm_poll_init - Initializes arm cpuidle polling driver
> > > + */
> > > +static int __init arm_poll_init(void)
> > > +{
> > > + int cpu, ret;
> > > + struct cpuidle_driver *drv;
> > > + struct cpuidle_device *dev;
> > > +
> > > + for_each_possible_cpu(cpu) {
> > > + ret = arm_poll_init_cpu(cpu);
> > > + if (ret)
> > > + goto out_fail;
> > > + }
> > > +
> > > + return 0;
> > > +
> > > +out_fail:
> > > + pr_info("de-register all");
> > > + while (--cpu >= 0) {
> > > + dev = per_cpu(cpuidle_devices, cpu);
> > > + drv = cpuidle_get_cpu_driver(dev);
> > > + cpuidle_unregister(drv);
> > > + kfree(drv);
> > > + }
> > > +
> > > + return ret;
> > > +}
> > > +
> > > +module_param(poll_limit, uint, 0444);
> > > +device_initcall(arm_poll_init);
> > > --
> > > 2.34.1
> > >
> > >
>
>
> --
> ankur
^ permalink raw reply
* [PATCH 1/1] pm: Take advantage of %ps to simplify debug output
From: Len Brown @ 2024-04-05 19:12 UTC (permalink / raw)
To: linux-pm; +Cc: Len Brown
From: Len Brown <len.brown@intel.com>
initcall_debug previous and new output:
...PM: calling pci_pm_suspend+0x0/0x1b0 @ 3233, parent: pci0000:00
...PM: calling pci_pm_suspend @ 3233, parent: pci0000:00
Signed-off-by: Len Brown <len.brown@intel.com>
---
drivers/base/power/main.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index f85f3515c258..de581c33095f 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -209,7 +209,7 @@ static ktime_t initcall_debug_start(struct device *dev, void *cb)
if (!pm_print_times_enabled)
return 0;
- dev_info(dev, "calling %pS @ %i, parent: %s\n", cb,
+ dev_info(dev, "calling %ps @ %i, parent: %s\n", cb,
task_pid_nr(current),
dev->parent ? dev_name(dev->parent) : "none");
return ktime_get();
@@ -224,7 +224,7 @@ static void initcall_debug_report(struct device *dev, ktime_t calltime,
return;
rettime = ktime_get();
- dev_info(dev, "%pS returned %d after %Ld usecs\n", cb, error,
+ dev_info(dev, "%ps returned %d after %Ld usecs\n", cb, error,
(unsigned long long)ktime_us_delta(rettime, calltime));
}
@@ -1963,7 +1963,7 @@ EXPORT_SYMBOL_GPL(dpm_suspend_start);
void __suspend_report_result(const char *function, struct device *dev, void *fn, int ret)
{
if (ret)
- dev_err(dev, "%s(): %pS returns %d\n", function, fn, ret);
+ dev_err(dev, "%s(): %ps returns %d\n", function, fn, ret);
}
EXPORT_SYMBOL_GPL(__suspend_report_result);
--
2.40.1
^ permalink raw reply related
* [GIT PULL] Power management fixes for v6.9-rc3
From: Rafael J. Wysocki @ 2024-04-05 18:42 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Linux PM, Linux Kernel Mailing List
Hi Linus,
Please pull from the tag
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
pm-6.9-rc3
with top-most commit 8130b05c559d1aa83d0c8971b422ba0da18ef24a
PM: EM: fix wrong utilization estimation in em_cpu_energy()
on top of commit 39cd87c4eb2b893354f3b850f916353f2658ae6f
Linux 6.9-rc2
to receive a power management fix for 6.9-rc3.
This fixes a recent Energy Model change that went against a recent
scheduler change made independently (Vincent Guittot).
Thanks!
---------------
Vincent Guittot (1):
PM: EM: fix wrong utilization estimation in em_cpu_energy()
---------------
include/linux/energy_model.h | 1 -
1 file changed, 1 deletion(-)
^ permalink raw reply
* [GIT PULL] Thermal control fixes for v6.9-rc3
From: Rafael J. Wysocki @ 2024-04-05 18:41 UTC (permalink / raw)
To: Linus Torvalds
Cc: Linux PM, ACPI Devel Maling List, Linux Kernel Mailing List,
Daniel Lezcano
Hi Linus,
Please pull from the tag
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
thermal-6.9-rc3
with top-most commit 6f824c9fccd494319988fa529601923edf5caacb
Merge branch 'acpi-thermal'
on top of commit 39cd87c4eb2b893354f3b850f916353f2658ae6f
Linux 6.9-rc2
to receive thermal control fixes for 6.9-rc3.
These fix two power allocator thermal governor issues and an ACPI
thermal driver regression that all were introduced during the 6.8
development cycle.
Specifics:
- Allow the power allocator thermal governor to bind to a thermal zone
without cooling devices and/or without trip points (Nikita Travkin).
- Make the ACPI thermal driver register a tripless thermal zone when
it cannot find any usable trip points instead of returning an error
from acpi_thermal_add() (Stephen Horvath).
Thanks!
---------------
Nikita Travkin (2):
thermal: gov_power_allocator: Allow binding without cooling devices
thermal: gov_power_allocator: Allow binding without trip points
Stephen Horvath (1):
ACPI: thermal: Register thermal zones without valid trip points
---------------
drivers/acpi/thermal.c | 22 ++++++++++------------
drivers/thermal/gov_power_allocator.c | 14 +++++---------
2 files changed, 15 insertions(+), 21 deletions(-)
^ permalink raw reply
* Re: [PATCH] PM: s2idle: Make sure CPUs will wakeup directly on resume
From: Thomas Gleixner @ 2024-04-05 17:52 UTC (permalink / raw)
To: Anna-Maria Behnsen, linux-kernel
Cc: Rafael J . Wysocki, Pavel Machek, Len Brown, Ulf Hansson,
linux-pm, Frederic Weisbecker, x86, Anna-Maria Behnsen,
Mario Limonciello, stable
In-Reply-To: <20240405083410.4896-1-anna-maria@linutronix.de>
On Fri, Apr 05 2024 at 10:34, Anna-Maria Behnsen wrote:
> s2idle works like a regular suspend with freezing processes and freezing
> devices. All CPUs except the control CPU go into idle. Once this is
> completed the control CPU kicks all other CPUs out of idle, so that they
> reenter the idle loop and then enter s2idle state. The control CPU then
> issues an swait() on the suspend state and therefore enters the idle loop
> as well.
>
> Due to being kicked out of idle, the other CPUs leave their NOHZ states,
> which means the tick is active and the corresponding hrtimer is programmed
> to the next jiffie.
>
> On entering s2idle the CPUs shut down their local clockevent device to
> prevent wakeups. The last CPU which enters s2idle shuts down its local
> clockevent and freezes timekeeping.
>
> On resume, one of the CPUs receives the wakeup interrupt, unfreezes
> timekeeping and its local clockevent and starts the resume process. At that
> point all other CPUs are still in s2idle with their clockevents switched
> off. They only resume when they are kicked by another CPU or after resuming
> devices and then receiving a device interrupt.
>
> That means there is no guarantee that all CPUs will wakeup directly on
> resume. As the consequence there is no guarantee that timers which are
s/As the/As a/
> queued on those CPUs and should expire directly after resume, are
> handled. Also timer list timers which are remotely queued to one of those
> CPUs after resume will not result in a reporgramming IPI as the tick is
s/reporgramming/reprogamming/
> active. A queue hrtimer will also not result in a reprogramming IPI because
s/A queue/Queueing a/
> the first hrtimer event is already in the past.
>
> The recent introduction of the timer pull model (7ee988770326 ("timers:
> Implement the hierarchical pull model")) amplifies this problem, if the
> current migrator is one of the non woken up CPUs. When a non pinned timer
> list timer is queued and the queueing CPU goes idle, it relies on the still
> suspended migrator CPU to expire the timer which will happen by chance.
>
> The problem existis since commit 8d89835b0467 ("PM: suspend: Do not pause
> cpuidle in the suspend-to-idle path"). There the cpuidle_pause() call which
> in turn invoked a wakeup for all idle CPUs was moved to a later point in
> the resume process. This might not be reached or reached very late because
> it waits on a timer of a still suspended CPU.
>
> Address this by kicking all CPUs out of idle after the control CPU returns
> from swait() so that they resume their timers and restore consistent system
> state.
>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218641
> Fixes: 8d89835b0467 ("PM: suspend: Do not pause cpuidle in the suspend-to-idle path")
> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
> Tested-by: Mario Limonciello <mario.limonciello@amd.com>
> Cc: stable@kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
^ permalink raw reply
* Re: [CfP] LPC 2024 - Power Management and Thermal Control Micro-Conference
From: Dhruva Gole @ 2024-04-05 16:43 UTC (permalink / raw)
To: Ulf Hansson
Cc: Rafael J. Wysocki, Linux PM, Nikunj Kela, Kevin Hilman,
Doug Anderson, nm, vigneshr, vibhore
In-Reply-To: <CAPDyKFrUKZ+h37YEhkkLj9ffPtV_gn5_8_dqUrSGNsYn7T_ozw@mail.gmail.com>
Hi all,
On Apr 04, 2024 at 13:31:26 +0200, Ulf Hansson wrote:
> + Dhruva, Nikunj, Kevin, Doug (I probably forgot some of the last
> years attendants, please loop them if you know)
>
> On Mon, 25 Mar 2024 at 12:37, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > Hi Everyone,
> >
> > This year the Linux Plumbers Conference will be held in Vienna,
> > Austria, on Wednesday 18th, Thursday 19th and Friday 20th of
> > September (http://linuxplumbersconf.org/).
> >
> > As it has been a tradition for the last few years, I'm planning to
> > submit a proposal for a Power Management and Thermal Control
> > Micro-Conference to the LPC and I'm looking for topics that people
> > would like to discuss in that MC.
>
> Thanks for continuing to drive this! If you need some additional help
> around this, feel free to reach out to me.
>
> >
> > The LPC is focused on work in progress and future developments, so any
> > work that has been completed already is not a suitable topic for it,
> > but if there's anything you'd like to do or you are planning or
> > considering in the power management and thermal control space in
> > Linux, please let me know.
>
> We had a BoF last year around "supporting multiple low power states
> for system wide suspend".
>
> Not so much progress has been made I believe? (Except for some local
> hacks that I am working on).
A small update on TI end, we have been working on how to support
multiple low power modes in heterogeneous SoC architecture.
TI team will present Low Power Modes ARCH and how we have tried to solve
the issue in K3 SoCs..
Please feel free to join the session on on April 10th 8:00AM CST
https://ti.webex.com/ti/j.php?MTID=m433b5e0c9b1b2edc4296a988c320a2a9
>
> If possible, I think it would be a good opportunity to continue the
> discussions from last year, assuming people are still interested in
> this topic.
We would love to garner inputs and reviews from the community on what
they think about this. Perhaps we can look at scaling this as a standard
or something on similar lines maybe. It is still in it's early stages and hence
looking for alignment with what the community thinks/expects.
Perhaps the community can continue from this talk and discuss further at
LPC at the PM microconference how things can be made more scalable/
improved if there seem to be gaps to be addressd.
--
Best regards,
Dhruva Gole <d-gole@ti.com>
^ permalink raw reply
* Re: [PATCH RFC 02/11] dt-bindings: riscv: Add Sdtrig optional CSRs existence on DT
From: Andrew Jones @ 2024-04-05 15:59 UTC (permalink / raw)
To: Conor Dooley
Cc: Max Hsu, Conor Dooley, Rob Herring, Krzysztof Kozlowski,
Paul Walmsley, Palmer Dabbelt, Albert Ou, Rafael J. Wysocki,
Pavel Machek, Anup Patel, Atish Patra, Paolo Bonzini, Shuah Khan,
Palmer Dabbelt, linux-riscv, devicetree, linux-kernel, linux-pm,
kvm, kvm-riscv, linux-kselftest
In-Reply-To: <20240329-affidavit-anatomist-1118a12c3e60@wendy>
On Fri, Mar 29, 2024 at 10:31:10AM +0000, Conor Dooley wrote:
> On Fri, Mar 29, 2024 at 05:26:18PM +0800, Max Hsu wrote:
> > The mcontext/hcontext/scontext CSRs are optional in the Sdtrig extension,
> > to prevent RW operations to the missing CSRs, which will cause
> > illegal instructions.
> >
> > As a solution, we have proposed the dt format for these CSRs.
>
> As I mentioned in your other patch, I amn't sure what the actual value
> is in being told about "sdtrig" itself if so many of the CSRs are
> optional. I think we should define pseudo extensions that represent
> usable subsets that are allowed by riscv,isa-extensions, such as
> those you describe here: sdtrig + mcontext, sdtrig + scontext and
> sdtrig + hcontext. Probably also for strig + mscontext. What
> additional value does having a debug child node give us that makes
> it worth having over something like the above?
Yeah, Sdtrig, which doesn't tell you what you get, isn't nice at all.
I wonder if we can start with requiring Sdtrig to be accompanied by
Ssstrict in order to enable the context CSRs, i.e.
Sdtrig - support without optional CSRs
Sdtrig+Ssstrict - probe for optional CSRs, support what's found
If there are platforms with Sdtrig and optional CSRs, but not Ssstrict,
then maybe the optional CSRs can be detected in some vendor-specific way,
where the decision as to whether or not that vendor-specific way is
acceptable is handled case-by-case.
Thanks,
drew
>
> Thanks,
> Conor.
>
> >
> > Signed-off-by: Max Hsu <max.hsu@sifive.com>
> > ---
> > Documentation/devicetree/bindings/riscv/cpus.yaml | 18 ++++++++++++++++++
> > 1 file changed, 18 insertions(+)
> >
> > diff --git a/Documentation/devicetree/bindings/riscv/cpus.yaml b/Documentation/devicetree/bindings/riscv/cpus.yaml
> > index d87dd50f1a4b..c713a48c5025 100644
> > --- a/Documentation/devicetree/bindings/riscv/cpus.yaml
> > +++ b/Documentation/devicetree/bindings/riscv/cpus.yaml
> > @@ -137,6 +137,24 @@ properties:
> > DMIPS/MHz, relative to highest capacity-dmips-mhz
> > in the system.
> >
> > + debug:
> > + type: object
> > + properties:
> > + compatible:
> > + const: riscv,debug-v1.0.0
> > + trigger-module:
> > + type: object
> > + description: |
> > + An indication set of optional CSR existence from
> > + riscv-debug-spec Sdtrig extension
> > + properties:
> > + mcontext-present:
> > + type: boolean
> > + hcontext-present:
> > + type: boolean
> > + scontext-present:
> > + type: boolean
> > +
> > anyOf:
> > - required:
> > - riscv,isa
> >
> > --
> > 2.43.2
> >
> --
> kvm-riscv mailing list
> kvm-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kvm-riscv
^ permalink raw reply
* Re: [PATCH v6 05/16] dt-bindings: net: wireless: describe the ath12k PCI module
From: Jeff Johnson @ 2024-04-05 15:44 UTC (permalink / raw)
To: Bartosz Golaszewski, Kalle Valo
Cc: Marcel Holtmann, Luiz Augusto von Dentz, David S . Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson, Konrad Dybcio,
Liam Girdwood, Mark Brown, Catalin Marinas, Will Deacon,
Bjorn Helgaas, Saravana Kannan, Geert Uytterhoeven, Arnd Bergmann,
Neil Armstrong, Marek Szyprowski, Alex Elder, Srini Kandagatla,
Greg Kroah-Hartman, Abel Vesa, Manivannan Sadhasivam,
Lukas Wunner, Dmitry Baryshkov, linux-bluetooth, netdev,
devicetree, linux-kernel, linux-wireless, linux-arm-msm,
linux-arm-kernel, linux-pci, linux-pm, Bartosz Golaszewski
In-Reply-To: <CAMRc=MdzhGxLNcNLWvRfqr0S9pey-iw964=AcYx_yDXgyDDjMA@mail.gmail.com>
On 4/5/2024 2:52 AM, Bartosz Golaszewski wrote:
> In addition to what Krzysztof already said about you seamingly
> confusing the maintenance of the driver vs maintenance of the
> device-tree bindings (IOW: structured hardware description) and in
> response to your question: I don't see any functional change to any
> dt-bindings neither from you nor from Jeff. Are you convinced you can
> maintain and properly review any changes?
Speaking just for myself, I know Bartosz is far more capable in this regard
than I am, so I'd expect him to be listed as a maintainer before me.
/jeff
^ permalink raw reply
* [regression] opp issues on devices with multiple power-domains
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-04-05 15:21 UTC (permalink / raw)
To: Viresh Kumar
Cc: Linux PM, LKML, Linux kernel regressions list, Nishanth Menon,
Stephen Boyd
Hi, Thorsten here, the Linux kernel's regression tracker.
Viresh Kumar, I noticed a report about a regression in
bugzilla.kernel.org that appears to be caused by a change of yours
(e37440e7e2c276 ("OPP: Call dev_pm_opp_set_opp() for required OPPs")
[v6.8-rc1]). As many (most?) kernel developers don't keep an eye on it,
I decided to forward it by mail.
Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=218682 :
> Created attachment 306086 [details]
> Dmesg from xiaomi-onclite
>
> Devices with multiple power-domains has issues with required_opps is
> pointing to wrong opp table because it lazily just uses dt node to find it.
> But every genpd device has its own table.
>
> The issue was fixed downstream with this commit:
> https://github.com/msm8953-mainline/linux/commit/6afec1ea2cf1cdb968d646c45d1d1a80f6cb5fb2
See the ticket for more details. Note, you have to use bugzilla to reach
the reporter, as I sadly[1] can not CCed them in mails like this.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.
[1] because bugzilla.kernel.org tells users upon registration their
"email address will never be displayed to logged out users"
P.S.: let me use this mail to also add the report to the list of tracked
regressions to ensure it's doesn't fall through the cracks:
#regzbot introduced: e37440e7e2c2760475d60c5556b59c8880a7fd63
#regzbot title: OPP: issues on devices with multiple power-domains
#regzbot from: Barnabás Czémán
#regzbot duplicate: https://bugzilla.kernel.org/show_bug.cgi?id=218682
#regzbot ignore-activity
^ permalink raw reply
* Re: iMX8M Mini suspend/resume hanging on imx8m_blk_ctrl_power_on()
From: vitor @ 2024-04-05 15:06 UTC (permalink / raw)
To: linux-pm, imx, linux-arm-kernel, linux-kernel
Cc: vitor.soares, ulf.hansson, shawnguo, s.hauer, kernel, festevam,
rafael, geert+renesas, peng.fan, linus.walleij, u.kleine-koenig,
marex
In-Reply-To: <fccbb040330a706a4f7b34875db1d896a0bf81c8.camel@gmail.com>
Hi,
On Thu, 2024-04-04 at 16:53 +0100, vitor wrote:
> Greetings,
>
> I'm trying to suspend/resume our Verdin iMX8M Mini with VPU IP using
> the latest 6.9.0-rc2 Kernel. While the system can suspend without
> issues, it hangs on the resume routine. After some investigation, I
> can
> see the Kernel hanging on imx8m_blk_ctrl_power_on()[1] while resuming
> the hantro-vpu power domain.
>
> Any hint about that?
>
> [1]
> https://elixir.bootlin.com/linux/v6.9-rc2/source/drivers/pmdomain/imx
> /imx8m-blk-ctrl.c#L101
>
Looking at other child nodes of the pgc node, pgc_vpu_[g1|g2|h1] seems
to be nested into pgc_vpumix.
After applying the following changes to imx8mm.dtsi, the suspend/resume
is working.
@@ -739,16 +739,19 @@ pgc_vpumix: power-domain@6 {
pgc_vpu_g1: power-domain@7 {
#power-domain-cells = <0>;
reg = <IMX8MM_POWER_DOMAIN_VPUG1>;
+ power-domains = <&pgc_vpumix>;
};
pgc_vpu_g2: power-domain@8 {
#power-domain-cells = <0>;
reg = <IMX8MM_POWER_DOMAIN_VPUG2>;
+ power-domains = <&pgc_vpumix>;
};
pgc_vpu_h1: power-domain@9 {
#power-domain-cells = <0>;
reg = <IMX8MM_POWER_DOMAIN_VPUH1>;
+ power-domains = <&pgc_vpumix>;
};
I will prepare the patch to send in the next couple of days.
Regards,
Vitor Soares
^ permalink raw reply
* Re: [regression] 6.8.1 and 6.7.12: fails to hibernate with pm_runtime_force_suspend+0x0/0x120 returns -16
From: Martin Steigerwald @ 2024-04-05 14:38 UTC (permalink / raw)
To: Greg KH; +Cc: linux-pm, regressions, linux-kernel
In-Reply-To: <2024040512-defective-wreckage-f202@gregkh>
Greg KH - 05.04.24, 15:34:30 CEST:
> > I think ThinkPad T14 AMD Gen 1 or similar systems with Linux are not
> > that rare. I am surprised this is not hitting more people. Well maybe
> > it does and no one reported it here.
> >
> > Well let's see what happens when 6.7.12 hits distribution kernels.
> >
> > Anyway, I have an actionable git bisect between 6.7.11 and 6.7.12,
> > just not today. Maybe beginning of next week.
>
> 6.7.y is end-of-life, I wouldn't worry too much about it, there's
> nothing we can do about it anymore, sorry.
Ah, I see. Thanks.
Still its a good short cut for bisection. Bisecting between 6.7.11 and
6.7.12 should get much quicker results than bisecting between 6.7 and 6.8
and as its the same error message in 6.8.1/6.8.4 and 6.7.12 it might be
the same (backported) patch causing the issue.
--
Martin
^ permalink raw reply
* Re: [regression] 6.8.1 and 6.7.12: fails to hibernate with pm_runtime_force_suspend+0x0/0x120 returns -16
From: Greg KH @ 2024-04-05 13:34 UTC (permalink / raw)
To: Martin Steigerwald; +Cc: linux-pm, regressions, linux-kernel
In-Reply-To: <12414153.O9o76ZdvQC@lichtvoll.de>
On Fri, Apr 05, 2024 at 02:11:11PM +0200, Martin Steigerwald wrote:
> Martin Steigerwald - 05.04.24, 13:54:34 CEST:
> > > Not doing it today or probably the weekend, but now I have some
> > > actionable git bisect plan without bisecting between major kernel
> > > releases which as I have been told between 6.7 an 6.8-rc1 can lead to
> > > non working modeset graphics on AMD Ryzen in between.
> > >
> > > It seems now I only need to git bisect between 6.7.11 and 6.7.12 to
> > > find the patch that breaks hibernation on 6.8.1 as well. However
> > > first I will briefly check whether 6.8.4 hibernates okay.
> >
> > 6.8.4 is still affected and fails hibernation with the same error
> > message.
>
> Both kernels also fail to reboot the machine. Runit says:
>
> - runit: system reboot
>
> And then nothing anymore.
>
> It is not just hibernation.
>
> I think ThinkPad T14 AMD Gen 1 or similar systems with Linux are not that
> rare. I am surprised this is not hitting more people. Well maybe it does
> and no one reported it here.
>
> Well let's see what happens when 6.7.12 hits distribution kernels.
>
> Anyway, I have an actionable git bisect between 6.7.11 and 6.7.12, just
> not today. Maybe beginning of next week.
6.7.y is end-of-life, I wouldn't worry too much about it, there's
nothing we can do about it anymore, sorry.
greg k-h
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox