Linux Power Management development
 help / color / mirror / Atom feed
* [PATCH v3 3/6] thermal: core: Rewrite comments in handle_thermal_trip()
From: Rafael J. Wysocki @ 2024-04-02 18:59 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Srinivas Pandruvada, Daniel Lezcano, Lukasz Luba,
	AngeloGioacchino Del Regno
In-Reply-To: <4558251.LvFx2qVVIh@kreacher>

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Make the comments regarding trip crossing and threshold updates in
handle_thermal_trip() slightly more clear.

No functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v2 -> v3: New patch

---
 drivers/thermal/thermal_core.c |   26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

Index: linux-pm/drivers/thermal/thermal_core.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.c
+++ linux-pm/drivers/thermal/thermal_core.c
@@ -368,6 +368,13 @@ static void handle_thermal_trip(struct t
 	if (trip->temperature == THERMAL_TEMP_INVALID)
 		return;
 
+	/*
+	 * If the trip temperature or hysteresis has been updated recently,
+	 * the threshold needs to be computed again using the new values.
+	 * However, its initial value still reflects the old ones and that
+	 * is what needs to be compared with the previous zone temperature
+	 * to decide which action to take.
+	 */
 	if (tz->last_temperature == THERMAL_TEMP_INVALID) {
 		/* Initialization. */
 		td->threshold = trip->temperature;
@@ -375,11 +382,9 @@ static void handle_thermal_trip(struct t
 			td->threshold -= trip->hysteresis;
 	} else if (tz->last_temperature < td->threshold) {
 		/*
-		 * The trip threshold is equal to the trip temperature, unless
-		 * the latter has changed in the meantime.  In either case,
-		 * the trip is crossed if the current zone temperature is at
-		 * least equal to its temperature, but otherwise ensure that
-		 * the threshold and the trip temperature will be equal.
+		 * There is no mitigation under way, so it needs to be started
+		 * if the zone temperature exceeds the trip one.  The new
+		 * threshold is then set to the low temperature of the trip.
 		 */
 		if (tz->temperature >= trip->temperature) {
 			thermal_notify_tz_trip_up(tz, trip);
@@ -390,14 +395,9 @@ static void handle_thermal_trip(struct t
 		}
 	} else {
 		/*
-		 * The previous zone temperature was above or equal to the trip
-		 * threshold, which would be equal to the "low temperature" of
-		 * the trip (its temperature minus its hysteresis), unless the
-		 * trip temperature or hysteresis had changed.  In either case,
-		 * the trip is crossed if the current zone temperature is below
-		 * the low temperature of the trip, but otherwise ensure that
-		 * the trip threshold will be equal to the low temperature of
-		 * the trip.
+		 * Mitigation is under way, so it needs to stop if the zone
+		 * temperature falls below the low temperature of the trip.
+		 * In that case, the trip temperature becomes the new threshold.
 		 */
 		if (tz->temperature < trip->temperature - trip->hysteresis) {
 			thermal_notify_tz_trip_down(tz, trip);




^ permalink raw reply

* [PATCH v3 4/6] thermal: core: Send trip crossing notifications at init time if needed
From: Rafael J. Wysocki @ 2024-04-02 19:02 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Srinivas Pandruvada, Daniel Lezcano, Lukasz Luba,
	AngeloGioacchino Del Regno
In-Reply-To: <4558251.LvFx2qVVIh@kreacher>

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

If a trip point is already exceeded by the zone temperature at the
initialization time, no trip crossing notification is send regarding
this even though mitigation should be started then.

Address this by rearranging the code in handle_thermal_trip() to
send a trip crossing notification for trip points already exceeded
by the zone temperature initially which also allows to reduce its
size by using the observation that the initialization and regular
trip crossing on the way up become the same case then.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v2 -> v3: New patch

---
 drivers/thermal/thermal_core.c |   37 ++++++++++++++++---------------------
 1 file changed, 16 insertions(+), 21 deletions(-)

Index: linux-pm/drivers/thermal/thermal_core.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.c
+++ linux-pm/drivers/thermal/thermal_core.c
@@ -364,6 +364,7 @@ static void handle_thermal_trip(struct t
 				struct thermal_trip_desc *td)
 {
 	const struct thermal_trip *trip = &td->trip;
+	int old_threshold;
 
 	if (trip->temperature == THERMAL_TEMP_INVALID)
 		return;
@@ -375,25 +376,11 @@ static void handle_thermal_trip(struct t
 	 * is what needs to be compared with the previous zone temperature
 	 * to decide which action to take.
 	 */
-	if (tz->last_temperature == THERMAL_TEMP_INVALID) {
-		/* Initialization. */
-		td->threshold = trip->temperature;
-		if (tz->temperature >= td->threshold)
-			td->threshold -= trip->hysteresis;
-	} else if (tz->last_temperature < td->threshold) {
-		/*
-		 * There is no mitigation under way, so it needs to be started
-		 * if the zone temperature exceeds the trip one.  The new
-		 * threshold is then set to the low temperature of the trip.
-		 */
-		if (tz->temperature >= trip->temperature) {
-			thermal_notify_tz_trip_up(tz, trip);
-			thermal_debug_tz_trip_up(tz, trip);
-			td->threshold = trip->temperature - trip->hysteresis;
-		} else {
-			td->threshold = trip->temperature;
-		}
-	} else {
+	old_threshold = td->threshold;
+	td->threshold = trip->temperature;
+
+	if (tz->last_temperature >= old_threshold &&
+	    tz->last_temperature != THERMAL_TEMP_INVALID) {
 		/*
 		 * Mitigation is under way, so it needs to stop if the zone
 		 * temperature falls below the low temperature of the trip.
@@ -402,10 +389,18 @@ static void handle_thermal_trip(struct t
 		if (tz->temperature < trip->temperature - trip->hysteresis) {
 			thermal_notify_tz_trip_down(tz, trip);
 			thermal_debug_tz_trip_down(tz, trip);
-			td->threshold = trip->temperature;
 		} else {
-			td->threshold = trip->temperature - trip->hysteresis;
+			td->threshold -= trip->hysteresis;
 		}
+	} else if (tz->temperature >= trip->temperature) {
+		/*
+		 * There is no mitigation under way, so it needs to be started
+		 * if the zone temperature exceeds the trip one.  The new
+		 * threshold is then set to the low temperature of the trip.
+		 */
+		thermal_notify_tz_trip_up(tz, trip);
+		thermal_debug_tz_trip_up(tz, trip);
+		td->threshold -= trip->hysteresis;
 	}
 
 	if (trip->type == THERMAL_TRIP_CRITICAL || trip->type == THERMAL_TRIP_HOT)




^ permalink raw reply

* [PATCH v3 5/6] thermal: core: Sort trip point crossing notifications by temperature
From: Rafael J. Wysocki @ 2024-04-02 19:03 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Srinivas Pandruvada, Daniel Lezcano, Lukasz Luba,
	AngeloGioacchino Del Regno
In-Reply-To: <4558251.LvFx2qVVIh@kreacher>

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

If multiple trip points are crossed in one go and the trips table in
the thermal zone device object is not sorted, the corresponding trip
point crossing notifications sent to user space will not be ordered
either.

Moreover, if the trips table is sorted by trip temperature in ascending
order, the trip crossing notifications on the way up will be sent in that
order too, but the trip crossing notifications on the way down will be
sent in the reverse order.

This is generally confusing and it is better to make the kernel send the
notifications in the order of growing (on the way up) or falling (on the
way down) trip temperature.

To achieve that, instead of sending a trip crossing notification and
recording a trip crossing event in the statistics right away from
handle_thermal_trip(), put the trip in question on a list that will be
sorted by __thermal_zone_device_update() after processing all of the trips
and before sending the notifications and recording trip crossing events.

Link: https://lore.kernel.org/linux-pm/20240306085428.88011-1-daniel.lezcano@linaro.org/
Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org> 
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v2 -> v3: Rebase, update changelog, add notify_temp to struct thermal_trip_desc

v1 -> v2: New patch

---
 drivers/thermal/thermal_core.c |   41 +++++++++++++++++++++++++++++++++++------
 drivers/thermal/thermal_core.h |    2 ++
 2 files changed, 37 insertions(+), 6 deletions(-)

Index: linux-pm/drivers/thermal/thermal_core.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.c
+++ linux-pm/drivers/thermal/thermal_core.c
@@ -15,6 +15,7 @@
 #include <linux/slab.h>
 #include <linux/kdev_t.h>
 #include <linux/idr.h>
+#include <linux/list_sort.h>
 #include <linux/thermal.h>
 #include <linux/reboot.h>
 #include <linux/string.h>
@@ -361,7 +362,9 @@ static void handle_critical_trips(struct
 }
 
 static void handle_thermal_trip(struct thermal_zone_device *tz,
-				struct thermal_trip_desc *td)
+				struct thermal_trip_desc *td,
+				struct list_head *way_up_list,
+				struct list_head *way_down_list)
 {
 	const struct thermal_trip *trip = &td->trip;
 	int old_threshold;
@@ -387,8 +390,8 @@ static void handle_thermal_trip(struct t
 		 * In that case, the trip temperature becomes the new threshold.
 		 */
 		if (tz->temperature < trip->temperature - trip->hysteresis) {
-			thermal_notify_tz_trip_down(tz, trip);
-			thermal_debug_tz_trip_down(tz, trip);
+			list_add(&td->notify_list_node, way_down_list);
+			td->notify_temp = trip->temperature - trip->hysteresis;
 		} else {
 			td->threshold -= trip->hysteresis;
 		}
@@ -398,8 +401,8 @@ static void handle_thermal_trip(struct t
 		 * if the zone temperature exceeds the trip one.  The new
 		 * threshold is then set to the low temperature of the trip.
 		 */
-		thermal_notify_tz_trip_up(tz, trip);
-		thermal_debug_tz_trip_up(tz, trip);
+		list_add_tail(&td->notify_list_node, way_up_list);
+		td->notify_temp = trip->temperature;
 		td->threshold -= trip->hysteresis;
 	}
 
@@ -452,10 +455,24 @@ static void thermal_zone_device_init(str
 		pos->initialized = false;
 }
 
+static int thermal_trip_notify_cmp(void *ascending, const struct list_head *a,
+				   const struct list_head *b)
+{
+	struct thermal_trip_desc *tda = container_of(a, struct thermal_trip_desc,
+						     notify_list_node);
+	struct thermal_trip_desc *tdb = container_of(b, struct thermal_trip_desc,
+						     notify_list_node);
+	int ret = tdb->notify_temp - tda->notify_temp;
+
+	return ascending ? ret : -ret;
+}
+
 void __thermal_zone_device_update(struct thermal_zone_device *tz,
 				  enum thermal_notify_event event)
 {
 	struct thermal_trip_desc *td;
+	LIST_HEAD(way_down_list);
+	LIST_HEAD(way_up_list);
 
 	if (tz->suspended)
 		return;
@@ -470,7 +487,19 @@ void __thermal_zone_device_update(struct
 	tz->notify_event = event;
 
 	for_each_trip_desc(tz, td)
-		handle_thermal_trip(tz, td);
+		handle_thermal_trip(tz, td, &way_up_list, &way_down_list);
+
+	list_sort(&way_up_list, &way_up_list, thermal_trip_notify_cmp);
+	list_for_each_entry(td, &way_up_list, notify_list_node) {
+		thermal_notify_tz_trip_up(tz, &td->trip);
+		thermal_debug_tz_trip_up(tz, &td->trip);
+	}
+
+	list_sort(NULL, &way_down_list, thermal_trip_notify_cmp);
+	list_for_each_entry(td, &way_down_list, notify_list_node) {
+		thermal_notify_tz_trip_down(tz, &td->trip);
+		thermal_debug_tz_trip_down(tz, &td->trip);
+	}
 
 	monitor_thermal_zone(tz);
 }
Index: linux-pm/drivers/thermal/thermal_core.h
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.h
+++ linux-pm/drivers/thermal/thermal_core.h
@@ -17,6 +17,8 @@
 
 struct thermal_trip_desc {
 	struct thermal_trip trip;
+	struct list_head notify_list_node;
+	int notify_temp;
 	int threshold;
 };
 




^ permalink raw reply

* [PATCH v3 0/6] thermal: More separation between the core and drivers
From: Rafael J. Wysocki @ 2024-04-02 18:54 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Srinivas Pandruvada, Daniel Lezcano, Lukasz Luba,
	AngeloGioacchino Del Regno

Hi Everyone,

This is an update of

https://lore.kernel.org/linux-pm/4558384.LvFx2qVVIh@kreacher/

and

https://lore.kernel.org/linux-pm/2331888.ElGaqSPkdT@kreacher/

which rebases the first patch on top of 6.9-rc2, adds 3 patches and adjusts
the third patch from v2.

The original description of the first two patches still applies:

> Patch [1/2] is based on the observation that the threshold field in struct
> thermal_trip really should be core-internal and to make that happen it
> introduces a wrapper structure around struct thermal_trip for internal
> use in the core.
> 
> Patch [2/2] moves the definition of the new structure and the struct
> thermal_zone_device one to a local header file in the core to enforce
> more separation between the core and drivers.
> 
> The patches are not expected to introduce any observable differences in
> behavior, so please let me know if you see any of that.

Note that these patches were first sent before the merge window and have not
really changed since then (except for a minor rebase of the first patch in
this series).  Moreover, no comments regarding the merit of these patches
have been made shared, so if this continues, I will be considering them as
good to go by the end of this week.

Patch [3/6] is a rewrite of comments regarding trip crossing and threshold
computations.

Patch [4/6] updates the trip crossing detection code to consolidate the
threshold initialization with trip crossing on the way up.

Patch [5/6] ([3/3] in v2) adds a mechanism to sort notifications and debug
calls taking place during one invocation of __thermal_zone_device_update() so
they always go in temperature order.

Patch [6/6] relocates the critical and trip point handling to avoid a
redundant temperature check.

The series applies on top of 6.9-rc2 and I'm planning to create a test
branch containing it.

Thanks!




^ permalink raw reply

* [PATCH v3 1/6] thermal: core: Move threshold out of struct thermal_trip
From: Rafael J. Wysocki @ 2024-04-02 18:56 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Srinivas Pandruvada, Daniel Lezcano, Lukasz Luba,
	AngeloGioacchino Del Regno
In-Reply-To: <4558251.LvFx2qVVIh@kreacher>

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

The threshold field in struct thermal_trip is only used internally by
the thermal core and it is better to prevent drivers from misusing it.
It also takes some space unnecessarily in the trip tables passed by
drivers to the core during thermal zone registration.

For this reason, introduce struct thermal_trip_desc as a wrapper around
struct thermal_trip, move the threshold field directly into it and make
the thermal core store struct thermal_trip_desc objects in the internal
thermal zone trip tables.  Adjust all of the code using trip tables in
the thermal core accordingly.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v2 -> v3: Rebase on top of 6.9-rc2, minor changelog update.

v1 -> v2: No changes.

---
 drivers/thermal/gov_fair_share.c      |    7 +++--
 drivers/thermal/gov_power_allocator.c |    6 ++--
 drivers/thermal/thermal_core.c        |   46 +++++++++++++++++++---------------
 drivers/thermal/thermal_core.h        |    7 +++--
 drivers/thermal/thermal_debugfs.c     |    6 ++--
 drivers/thermal/thermal_helpers.c     |    8 +++--
 drivers/thermal/thermal_netlink.c     |    6 ++--
 drivers/thermal/thermal_sysfs.c       |   20 +++++++-------
 drivers/thermal/thermal_trip.c        |   15 +++++------
 include/linux/thermal.h               |    9 ++++--
 10 files changed, 78 insertions(+), 52 deletions(-)

Index: linux-pm/include/linux/thermal.h
===================================================================
--- linux-pm.orig/include/linux/thermal.h
+++ linux-pm/include/linux/thermal.h
@@ -61,7 +61,6 @@ enum thermal_notify_event {
  * struct thermal_trip - representation of a point in temperature domain
  * @temperature: temperature value in miliCelsius
  * @hysteresis: relative hysteresis in miliCelsius
- * @threshold: trip crossing notification threshold miliCelsius
  * @type: trip point type
  * @priv: pointer to driver data associated with this trip
  * @flags: flags representing binary properties of the trip
@@ -69,12 +68,16 @@ enum thermal_notify_event {
 struct thermal_trip {
 	int temperature;
 	int hysteresis;
-	int threshold;
 	enum thermal_trip_type type;
 	u8 flags;
 	void *priv;
 };
 
+struct thermal_trip_desc {
+	struct thermal_trip trip;
+	int threshold;
+};
+
 #define THERMAL_TRIP_FLAG_RW_TEMP	BIT(0)
 #define THERMAL_TRIP_FLAG_RW_HYST	BIT(1)
 
@@ -203,7 +206,7 @@ struct thermal_zone_device {
 #ifdef CONFIG_THERMAL_DEBUGFS
 	struct thermal_debugfs *debugfs;
 #endif
-	struct thermal_trip trips[] __counted_by(num_trips);
+	struct thermal_trip_desc trips[] __counted_by(num_trips);
 };
 
 /**
Index: linux-pm/drivers/thermal/thermal_core.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.c
+++ linux-pm/drivers/thermal/thermal_core.c
@@ -361,17 +361,19 @@ static void handle_critical_trips(struct
 }
 
 static void handle_thermal_trip(struct thermal_zone_device *tz,
-				struct thermal_trip *trip)
+				struct thermal_trip_desc *td)
 {
+	const struct thermal_trip *trip = &td->trip;
+
 	if (trip->temperature == THERMAL_TEMP_INVALID)
 		return;
 
 	if (tz->last_temperature == THERMAL_TEMP_INVALID) {
 		/* Initialization. */
-		trip->threshold = trip->temperature;
-		if (tz->temperature >= trip->threshold)
-			trip->threshold -= trip->hysteresis;
-	} else if (tz->last_temperature < trip->threshold) {
+		td->threshold = trip->temperature;
+		if (tz->temperature >= td->threshold)
+			td->threshold -= trip->hysteresis;
+	} else if (tz->last_temperature < td->threshold) {
 		/*
 		 * The trip threshold is equal to the trip temperature, unless
 		 * the latter has changed in the meantime.  In either case,
@@ -382,9 +384,9 @@ static void handle_thermal_trip(struct t
 		if (tz->temperature >= trip->temperature) {
 			thermal_notify_tz_trip_up(tz, trip);
 			thermal_debug_tz_trip_up(tz, trip);
-			trip->threshold = trip->temperature - trip->hysteresis;
+			td->threshold = trip->temperature - trip->hysteresis;
 		} else {
-			trip->threshold = trip->temperature;
+			td->threshold = trip->temperature;
 		}
 	} else {
 		/*
@@ -400,9 +402,9 @@ static void handle_thermal_trip(struct t
 		if (tz->temperature < trip->temperature - trip->hysteresis) {
 			thermal_notify_tz_trip_down(tz, trip);
 			thermal_debug_tz_trip_down(tz, trip);
-			trip->threshold = trip->temperature;
+			td->threshold = trip->temperature;
 		} else {
-			trip->threshold = trip->temperature - trip->hysteresis;
+			td->threshold = trip->temperature - trip->hysteresis;
 		}
 	}
 
@@ -458,7 +460,7 @@ static void thermal_zone_device_init(str
 void __thermal_zone_device_update(struct thermal_zone_device *tz,
 				  enum thermal_notify_event event)
 {
-	struct thermal_trip *trip;
+	struct thermal_trip_desc *td;
 
 	if (tz->suspended)
 		return;
@@ -472,8 +474,8 @@ void __thermal_zone_device_update(struct
 
 	tz->notify_event = event;
 
-	for_each_trip(tz, trip)
-		handle_thermal_trip(tz, trip);
+	for_each_trip_desc(tz, td)
+		handle_thermal_trip(tz, td);
 
 	monitor_thermal_zone(tz);
 }
@@ -766,7 +768,7 @@ int thermal_zone_bind_cooling_device(str
 	if (trip_index < 0 || trip_index >= tz->num_trips)
 		return -EINVAL;
 
-	return thermal_bind_cdev_to_trip(tz, &tz->trips[trip_index], cdev,
+	return thermal_bind_cdev_to_trip(tz, &tz->trips[trip_index].trip, cdev,
 					 upper, lower, weight);
 }
 EXPORT_SYMBOL_GPL(thermal_zone_bind_cooling_device);
@@ -825,7 +827,7 @@ int thermal_zone_unbind_cooling_device(s
 	if (trip_index < 0 || trip_index >= tz->num_trips)
 		return -EINVAL;
 
-	return thermal_unbind_cdev_from_trip(tz, &tz->trips[trip_index], cdev);
+	return thermal_unbind_cdev_from_trip(tz, &tz->trips[trip_index].trip, cdev);
 }
 EXPORT_SYMBOL_GPL(thermal_zone_unbind_cooling_device);
 
@@ -1221,16 +1223,19 @@ static void thermal_set_delay_jiffies(un
 
 int thermal_zone_get_crit_temp(struct thermal_zone_device *tz, int *temp)
 {
-	int i, ret = -EINVAL;
+	const struct thermal_trip_desc *td;
+	int ret = -EINVAL;
 
 	if (tz->ops.get_crit_temp)
 		return tz->ops.get_crit_temp(tz, temp);
 
 	mutex_lock(&tz->lock);
 
-	for (i = 0; i < tz->num_trips; i++) {
-		if (tz->trips[i].type == THERMAL_TRIP_CRITICAL) {
-			*temp = tz->trips[i].temperature;
+	for_each_trip_desc(tz, td) {
+		const struct thermal_trip *trip = &td->trip;
+
+		if (trip->type == THERMAL_TRIP_CRITICAL) {
+			*temp = trip->temperature;
 			ret = 0;
 			break;
 		}
@@ -1274,7 +1279,9 @@ thermal_zone_device_register_with_trips(
 					const struct thermal_zone_params *tzp,
 					int passive_delay, int polling_delay)
 {
+	const struct thermal_trip *trip = trips;
 	struct thermal_zone_device *tz;
+	struct thermal_trip_desc *td;
 	int id;
 	int result;
 	struct thermal_governor *governor;
@@ -1339,7 +1346,8 @@ thermal_zone_device_register_with_trips(
 	tz->device.class = thermal_class;
 	tz->devdata = devdata;
 	tz->num_trips = num_trips;
-	memcpy(tz->trips, trips, num_trips * sizeof(*trips));
+	for_each_trip_desc(tz, td)
+		td->trip = *trip++;
 
 	thermal_set_delay_jiffies(&tz->passive_delay_jiffies, passive_delay);
 	thermal_set_delay_jiffies(&tz->polling_delay_jiffies, polling_delay);
Index: linux-pm/drivers/thermal/thermal_core.h
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.h
+++ linux-pm/drivers/thermal/thermal_core.h
@@ -120,8 +120,11 @@ void thermal_governor_update_tz(struct t
 				enum thermal_notify_event reason);
 
 /* Helpers */
-#define for_each_trip(__tz, __trip)	\
-	for (__trip = __tz->trips; __trip - __tz->trips < __tz->num_trips; __trip++)
+#define for_each_trip_desc(__tz, __td)	\
+	for (__td = __tz->trips; __td - __tz->trips < __tz->num_trips; __td++)
+
+#define trip_to_trip_desc(__trip)	\
+	container_of(__trip, struct thermal_trip_desc, trip)
 
 void __thermal_zone_set_trips(struct thermal_zone_device *tz);
 int thermal_zone_trip_id(const struct thermal_zone_device *tz,
Index: linux-pm/drivers/thermal/thermal_sysfs.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_sysfs.c
+++ linux-pm/drivers/thermal/thermal_sysfs.c
@@ -88,7 +88,7 @@ trip_point_type_show(struct device *dev,
 	if (sscanf(attr->attr.name, "trip_point_%d_type", &trip_id) != 1)
 		return -EINVAL;
 
-	switch (tz->trips[trip_id].type) {
+	switch (tz->trips[trip_id].trip.type) {
 	case THERMAL_TRIP_CRITICAL:
 		return sprintf(buf, "critical\n");
 	case THERMAL_TRIP_HOT:
@@ -120,7 +120,7 @@ trip_point_temp_store(struct device *dev
 
 	mutex_lock(&tz->lock);
 
-	trip = &tz->trips[trip_id];
+	trip = &tz->trips[trip_id].trip;
 
 	if (temp != trip->temperature) {
 		if (tz->ops.set_trip_temp) {
@@ -150,7 +150,7 @@ trip_point_temp_show(struct device *dev,
 	if (sscanf(attr->attr.name, "trip_point_%d_temp", &trip_id) != 1)
 		return -EINVAL;
 
-	return sprintf(buf, "%d\n", tz->trips[trip_id].temperature);
+	return sprintf(buf, "%d\n", tz->trips[trip_id].trip.temperature);
 }
 
 static ssize_t
@@ -171,7 +171,7 @@ trip_point_hyst_store(struct device *dev
 
 	mutex_lock(&tz->lock);
 
-	trip = &tz->trips[trip_id];
+	trip = &tz->trips[trip_id].trip;
 
 	if (hyst != trip->hysteresis) {
 		trip->hysteresis = hyst;
@@ -194,7 +194,7 @@ trip_point_hyst_show(struct device *dev,
 	if (sscanf(attr->attr.name, "trip_point_%d_hyst", &trip_id) != 1)
 		return -EINVAL;
 
-	return sprintf(buf, "%d\n", tz->trips[trip_id].hysteresis);
+	return sprintf(buf, "%d\n", tz->trips[trip_id].trip.hysteresis);
 }
 
 static ssize_t
@@ -393,7 +393,7 @@ static const struct attribute_group *the
  */
 static int create_trip_attrs(struct thermal_zone_device *tz)
 {
-	const struct thermal_trip *trip;
+	const struct thermal_trip_desc *td;
 	struct attribute **attrs;
 
 	/* This function works only for zones with at least one trip */
@@ -429,8 +429,8 @@ static int create_trip_attrs(struct ther
 		return -ENOMEM;
 	}
 
-	for_each_trip(tz, trip) {
-		int indx = thermal_zone_trip_id(tz, trip);
+	for_each_trip_desc(tz, td) {
+		int indx = thermal_zone_trip_id(tz, &td->trip);
 
 		/* create trip type attribute */
 		snprintf(tz->trip_type_attrs[indx].name, THERMAL_NAME_LENGTH,
@@ -452,7 +452,7 @@ static int create_trip_attrs(struct ther
 						tz->trip_temp_attrs[indx].name;
 		tz->trip_temp_attrs[indx].attr.attr.mode = S_IRUGO;
 		tz->trip_temp_attrs[indx].attr.show = trip_point_temp_show;
-		if (trip->flags & THERMAL_TRIP_FLAG_RW_TEMP) {
+		if (td->trip.flags & THERMAL_TRIP_FLAG_RW_TEMP) {
 			tz->trip_temp_attrs[indx].attr.attr.mode |= S_IWUSR;
 			tz->trip_temp_attrs[indx].attr.store =
 							trip_point_temp_store;
@@ -467,7 +467,7 @@ static int create_trip_attrs(struct ther
 					tz->trip_hyst_attrs[indx].name;
 		tz->trip_hyst_attrs[indx].attr.attr.mode = S_IRUGO;
 		tz->trip_hyst_attrs[indx].attr.show = trip_point_hyst_show;
-		if (trip->flags & THERMAL_TRIP_FLAG_RW_HYST) {
+		if (td->trip.flags & THERMAL_TRIP_FLAG_RW_HYST) {
 			tz->trip_hyst_attrs[indx].attr.attr.mode |= S_IWUSR;
 			tz->trip_hyst_attrs[indx].attr.store =
 					trip_point_hyst_store;
Index: linux-pm/drivers/thermal/thermal_debugfs.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_debugfs.c
+++ linux-pm/drivers/thermal/thermal_debugfs.c
@@ -744,7 +744,7 @@ static void tze_seq_stop(struct seq_file
 static int tze_seq_show(struct seq_file *s, void *v)
 {
 	struct thermal_zone_device *tz = s->private;
-	struct thermal_trip *trip;
+	struct thermal_trip_desc *td;
 	struct tz_episode *tze;
 	const char *type;
 	int trip_id;
@@ -757,7 +757,9 @@ static int tze_seq_show(struct seq_file
 
 	seq_printf(s, "| trip |     type | temp(°mC) | hyst(°mC) |  duration  |  avg(°mC) |  min(°mC) |  max(°mC) |\n");
 
-	for_each_trip(tz, trip) {
+	for_each_trip_desc(tz, td) {
+		const struct thermal_trip *trip = &td->trip;
+
 		/*
 		 * There is no possible mitigation happening at the
 		 * critical trip point, so the stats will be always
Index: linux-pm/drivers/thermal/thermal_netlink.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_netlink.c
+++ linux-pm/drivers/thermal/thermal_netlink.c
@@ -442,7 +442,7 @@ out_cancel_nest:
 static int thermal_genl_cmd_tz_get_trip(struct param *p)
 {
 	struct sk_buff *msg = p->msg;
-	const struct thermal_trip *trip;
+	const struct thermal_trip_desc *td;
 	struct thermal_zone_device *tz;
 	struct nlattr *start_trip;
 	int id;
@@ -462,7 +462,9 @@ static int thermal_genl_cmd_tz_get_trip(
 
 	mutex_lock(&tz->lock);
 
-	for_each_trip(tz, trip) {
+	for_each_trip_desc(tz, td) {
+		const struct thermal_trip *trip = &td->trip;
+
 		if (nla_put_u32(msg, THERMAL_GENL_ATTR_TZ_TRIP_ID,
 				thermal_zone_trip_id(tz, trip)) ||
 		    nla_put_u32(msg, THERMAL_GENL_ATTR_TZ_TRIP_TYPE, trip->type) ||
Index: linux-pm/drivers/thermal/thermal_trip.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_trip.c
+++ linux-pm/drivers/thermal/thermal_trip.c
@@ -13,11 +13,11 @@ int for_each_thermal_trip(struct thermal
 			  int (*cb)(struct thermal_trip *, void *),
 			  void *data)
 {
-	struct thermal_trip *trip;
+	struct thermal_trip_desc *td;
 	int ret;
 
-	for_each_trip(tz, trip) {
-		ret = cb(trip, data);
+	for_each_trip_desc(tz, td) {
+		ret = cb(&td->trip, data);
 		if (ret)
 			return ret;
 	}
@@ -63,7 +63,7 @@ EXPORT_SYMBOL_GPL(thermal_zone_get_num_t
  */
 void __thermal_zone_set_trips(struct thermal_zone_device *tz)
 {
-	const struct thermal_trip *trip;
+	const struct thermal_trip_desc *td;
 	int low = -INT_MAX, high = INT_MAX;
 	int ret;
 
@@ -72,7 +72,8 @@ void __thermal_zone_set_trips(struct the
 	if (!tz->ops.set_trips)
 		return;
 
-	for_each_trip(tz, trip) {
+	for_each_trip_desc(tz, td) {
+		const struct thermal_trip *trip = &td->trip;
 		int trip_low;
 
 		trip_low = trip->temperature - trip->hysteresis;
@@ -110,7 +111,7 @@ int __thermal_zone_get_trip(struct therm
 	if (!tz || trip_id < 0 || trip_id >= tz->num_trips || !trip)
 		return -EINVAL;
 
-	*trip = tz->trips[trip_id];
+	*trip = tz->trips[trip_id].trip;
 	return 0;
 }
 EXPORT_SYMBOL_GPL(__thermal_zone_get_trip);
@@ -135,7 +136,7 @@ int thermal_zone_trip_id(const struct th
 	 * Assume the trip to be located within the bounds of the thermal
 	 * zone's trips[] table.
 	 */
-	return trip - tz->trips;
+	return trip_to_trip_desc(trip) - tz->trips;
 }
 void thermal_zone_trip_updated(struct thermal_zone_device *tz,
 			       const struct thermal_trip *trip)
Index: linux-pm/drivers/thermal/gov_fair_share.c
===================================================================
--- linux-pm.orig/drivers/thermal/gov_fair_share.c
+++ linux-pm/drivers/thermal/gov_fair_share.c
@@ -17,10 +17,13 @@
 
 static int get_trip_level(struct thermal_zone_device *tz)
 {
-	const struct thermal_trip *trip, *level_trip = NULL;
+	const struct thermal_trip *level_trip = NULL;
+	const struct thermal_trip_desc *td;
 	int trip_level = -1;
 
-	for_each_trip(tz, trip) {
+	for_each_trip_desc(tz, td) {
+		const struct thermal_trip *trip = &td->trip;
+
 		if (trip->temperature >= tz->temperature)
 			continue;
 
Index: linux-pm/drivers/thermal/gov_power_allocator.c
===================================================================
--- linux-pm.orig/drivers/thermal/gov_power_allocator.c
+++ linux-pm/drivers/thermal/gov_power_allocator.c
@@ -496,9 +496,11 @@ static void get_governor_trips(struct th
 	const struct thermal_trip *first_passive = NULL;
 	const struct thermal_trip *last_passive = NULL;
 	const struct thermal_trip *last_active = NULL;
-	const struct thermal_trip *trip;
+	const struct thermal_trip_desc *td;
+
+	for_each_trip_desc(tz, td) {
+		const struct thermal_trip *trip = &td->trip;
 
-	for_each_trip(tz, trip) {
 		switch (trip->type) {
 		case THERMAL_TRIP_PASSIVE:
 			if (!first_passive) {
Index: linux-pm/drivers/thermal/thermal_helpers.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_helpers.c
+++ linux-pm/drivers/thermal/thermal_helpers.c
@@ -50,7 +50,7 @@ get_thermal_instance(struct thermal_zone
 	mutex_lock(&tz->lock);
 	mutex_lock(&cdev->lock);
 
-	trip = &tz->trips[trip_index];
+	trip = &tz->trips[trip_index].trip;
 
 	list_for_each_entry(pos, &tz->thermal_instances, tz_node) {
 		if (pos->tz == tz && pos->trip == trip && pos->cdev == cdev) {
@@ -82,7 +82,7 @@ EXPORT_SYMBOL(get_thermal_instance);
  */
 int __thermal_zone_get_temp(struct thermal_zone_device *tz, int *temp)
 {
-	const struct thermal_trip *trip;
+	const struct thermal_trip_desc *td;
 	int crit_temp = INT_MAX;
 	int ret = -EINVAL;
 
@@ -91,7 +91,9 @@ int __thermal_zone_get_temp(struct therm
 	ret = tz->ops.get_temp(tz, temp);
 
 	if (IS_ENABLED(CONFIG_THERMAL_EMULATION) && tz->emul_temperature) {
-		for_each_trip(tz, trip) {
+		for_each_trip_desc(tz, td) {
+			const struct thermal_trip *trip = &td->trip;
+
 			if (trip->type == THERMAL_TRIP_CRITICAL) {
 				crit_temp = trip->temperature;
 				break;




^ permalink raw reply

* [PATCH v3 6/6] thermal: core: Relocate critical and hot trip handling
From: Rafael J. Wysocki @ 2024-04-02 19:04 UTC (permalink / raw)
  To: Linux PM
  Cc: LKML, Srinivas Pandruvada, Daniel Lezcano, Lukasz Luba,
	AngeloGioacchino Del Regno
In-Reply-To: <4558251.LvFx2qVVIh@kreacher>

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Modify handle_thermal_trip() to call handle_critical_trips() only after
finding that the trip temperature has been crossed on the way up and
remove the redundant temperature check from the latter.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/thermal/thermal_core.c |   15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

Index: linux-pm/drivers/thermal/thermal_core.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.c
+++ linux-pm/drivers/thermal/thermal_core.c
@@ -349,10 +349,6 @@ void thermal_zone_device_critical_reboot
 static void handle_critical_trips(struct thermal_zone_device *tz,
 				  const struct thermal_trip *trip)
 {
-	/* If we have not crossed the trip_temp, we do not care. */
-	if (trip->temperature <= 0 || tz->temperature < trip->temperature)
-		return;
-
 	trace_thermal_zone_trip(tz, thermal_zone_trip_id(tz, trip), trip->type);
 
 	if (trip->type == THERMAL_TRIP_CRITICAL)
@@ -404,12 +400,15 @@ static void handle_thermal_trip(struct t
 		list_add_tail(&td->notify_list_node, way_up_list);
 		td->notify_temp = trip->temperature;
 		td->threshold -= trip->hysteresis;
+
+		if (trip->type == THERMAL_TRIP_CRITICAL ||
+		    trip->type == THERMAL_TRIP_HOT) {
+			handle_critical_trips(tz, trip);
+			return;
+		}
 	}
 
-	if (trip->type == THERMAL_TRIP_CRITICAL || trip->type == THERMAL_TRIP_HOT)
-		handle_critical_trips(tz, trip);
-	else
-		handle_non_critical_trips(tz, trip);
+	handle_non_critical_trips(tz, trip);
 }
 
 static void update_temperature(struct thermal_zone_device *tz)




^ permalink raw reply

* RE: [PATCH] tools/power/turbostat: Fix uncore frequency file string
From: Ernst, Justin @ 2024-04-02 18:57 UTC (permalink / raw)
  To: Len Brown; +Cc: linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <CAJvTdK=R+XQZ4Vov8iXGiMADShgrwSoDL8-Jqfhii7YruRLDsg@mail.gmail.com>

> From: Len Brown <lenb@kernel.org>
> Sent: Tuesday, April 2, 2024 12:46 PM
> 
> Thanks for the patch, Justin,
> 
> Looks like the probe part of this was already fixed in my git tree, so
> I lopped off that hunk and kept your 1st hunk.
> 
> Let me know if it works, or if I screwed it up.

It works! I tested on a 16-socket system with "..., package_10_die_00/, package_11_die_00/, etc" directories present.

Thanks for the link to your latest tree. It made it very easy to build and test your patch.

Cheers,
-Justin

> 
> latest is in this tree:
> https://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git/
> 
> thanks,
> -Len
> 
> On Fri, Mar 8, 2024 at 3:50 PM Justin Ernst <justin.ernst@hpe.com> wrote:
> >
> > Running turbostat on a 16 socket HPE Scale-up Compute 3200 (SapphireRapids) fails with:
> > turbostat: /sys/devices/system/cpu/intel_uncore_frequency/package_010_die_00/current_freq_khz: open
> failed: No such file or directory
> >
> > We observe the sysfs uncore frequency directories named:
> > ...
> > package_09_die_00/
> > package_10_die_00/
> > package_11_die_00/
> > ...
> > package_15_die_00/
> >
> > The culprit is an incorrect sprintf format string "package_0%d_die_0%d" used
> > with each instance of reading uncore frequency files. uncore-frequency-common.c
> > creates the sysfs directory with the format "package_%02d_die_%02d". Once the
> > package value reaches double digits, the formats diverge.
> >
> > Change each instance of "package_0%d_die_0%d" to "package_%02d_die_%02d".
> >
> > Signed-off-by: Justin Ernst <justin.ernst@hpe.com>
> > ---
> >  tools/power/x86/turbostat/turbostat.c | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
> > index 7a334377f92b..2a15a23cb726 100644
> > --- a/tools/power/x86/turbostat/turbostat.c
> > +++ b/tools/power/x86/turbostat/turbostat.c
> > @@ -2599,7 +2599,7 @@ unsigned long long get_uncore_mhz(int package, int die)
> >  {
> >         char path[128];
> >
> > -       sprintf(path,
> "/sys/devices/system/cpu/intel_uncore_frequency/package_0%d_die_0%d/current_freq_khz", package,
> > +       sprintf(path,
> "/sys/devices/system/cpu/intel_uncore_frequency/package_%02d_die_%02d/current_freq_khz", package,
> >                 die);
> >
> >         return (snapshot_sysfs_counter(path) / 1000);
> > @@ -4589,20 +4589,20 @@ static void probe_intel_uncore_frequency(void)
> >                 for (j = 0; j < topo.num_die; ++j) {
> >                         int k, l;
> >
> > -                       sprintf(path,
> "/sys/devices/system/cpu/intel_uncore_frequency/package_0%d_die_0%d/min_freq_khz",
> > +                       sprintf(path,
> "/sys/devices/system/cpu/intel_uncore_frequency/package_%02d_die_%02d/min_freq_khz",
> >                                 i, j);
> >                         k = read_sysfs_int(path);
> > -                       sprintf(path,
> "/sys/devices/system/cpu/intel_uncore_frequency/package_0%d_die_0%d/max_freq_khz",
> > +                       sprintf(path,
> "/sys/devices/system/cpu/intel_uncore_frequency/package_%02d_die_%02d/max_freq_khz",
> >                                 i, j);
> >                         l = read_sysfs_int(path);
> >                         fprintf(outf, "Uncore Frequency pkg%d die%d: %d - %d MHz ", i, j, k / 1000,
> l / 1000);
> >
> >                         sprintf(path,
> > -
> "/sys/devices/system/cpu/intel_uncore_frequency/package_0%d_die_0%d/initial_min_freq_khz",
> > +
> "/sys/devices/system/cpu/intel_uncore_frequency/package_%02d_die_%02d/initial_min_freq_khz",
> >                                 i, j);
> >                         k = read_sysfs_int(path);
> >                         sprintf(path,
> > -
> "/sys/devices/system/cpu/intel_uncore_frequency/package_0%d_die_0%d/initial_max_freq_khz",
> > +
> "/sys/devices/system/cpu/intel_uncore_frequency/package_%02d_die_%02d/initial_max_freq_khz",
> >                                 i, j);
> >                         l = read_sysfs_int(path);
> >                         fprintf(outf, "(%d - %d MHz)\n", k / 1000, l / 1000);
> > --
> > 2.26.2
> >
> 
> 
> --
> Len Brown, Intel

^ permalink raw reply

* Re: [PATCH] tools/power/turbostat: Fix uncore frequency file string
From: Len Brown @ 2024-04-02 17:45 UTC (permalink / raw)
  To: Justin Ernst; +Cc: linux-pm, linux-kernel
In-Reply-To: <20240308204957.2007212-1-justin.ernst@hpe.com>

Thanks for the patch, Justin,

Looks like the probe part of this was already fixed in my git tree, so
I lopped off that hunk and kept your 1st hunk.

Let me know if it works, or if I screwed it up.

latest is in this tree:
https://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git/

thanks,
-Len

On Fri, Mar 8, 2024 at 3:50 PM Justin Ernst <justin.ernst@hpe.com> wrote:
>
> Running turbostat on a 16 socket HPE Scale-up Compute 3200 (SapphireRapids) fails with:
> turbostat: /sys/devices/system/cpu/intel_uncore_frequency/package_010_die_00/current_freq_khz: open failed: No such file or directory
>
> We observe the sysfs uncore frequency directories named:
> ...
> package_09_die_00/
> package_10_die_00/
> package_11_die_00/
> ...
> package_15_die_00/
>
> The culprit is an incorrect sprintf format string "package_0%d_die_0%d" used
> with each instance of reading uncore frequency files. uncore-frequency-common.c
> creates the sysfs directory with the format "package_%02d_die_%02d". Once the
> package value reaches double digits, the formats diverge.
>
> Change each instance of "package_0%d_die_0%d" to "package_%02d_die_%02d".
>
> Signed-off-by: Justin Ernst <justin.ernst@hpe.com>
> ---
>  tools/power/x86/turbostat/turbostat.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
> index 7a334377f92b..2a15a23cb726 100644
> --- a/tools/power/x86/turbostat/turbostat.c
> +++ b/tools/power/x86/turbostat/turbostat.c
> @@ -2599,7 +2599,7 @@ unsigned long long get_uncore_mhz(int package, int die)
>  {
>         char path[128];
>
> -       sprintf(path, "/sys/devices/system/cpu/intel_uncore_frequency/package_0%d_die_0%d/current_freq_khz", package,
> +       sprintf(path, "/sys/devices/system/cpu/intel_uncore_frequency/package_%02d_die_%02d/current_freq_khz", package,
>                 die);
>
>         return (snapshot_sysfs_counter(path) / 1000);
> @@ -4589,20 +4589,20 @@ static void probe_intel_uncore_frequency(void)
>                 for (j = 0; j < topo.num_die; ++j) {
>                         int k, l;
>
> -                       sprintf(path, "/sys/devices/system/cpu/intel_uncore_frequency/package_0%d_die_0%d/min_freq_khz",
> +                       sprintf(path, "/sys/devices/system/cpu/intel_uncore_frequency/package_%02d_die_%02d/min_freq_khz",
>                                 i, j);
>                         k = read_sysfs_int(path);
> -                       sprintf(path, "/sys/devices/system/cpu/intel_uncore_frequency/package_0%d_die_0%d/max_freq_khz",
> +                       sprintf(path, "/sys/devices/system/cpu/intel_uncore_frequency/package_%02d_die_%02d/max_freq_khz",
>                                 i, j);
>                         l = read_sysfs_int(path);
>                         fprintf(outf, "Uncore Frequency pkg%d die%d: %d - %d MHz ", i, j, k / 1000, l / 1000);
>
>                         sprintf(path,
> -                               "/sys/devices/system/cpu/intel_uncore_frequency/package_0%d_die_0%d/initial_min_freq_khz",
> +                               "/sys/devices/system/cpu/intel_uncore_frequency/package_%02d_die_%02d/initial_min_freq_khz",
>                                 i, j);
>                         k = read_sysfs_int(path);
>                         sprintf(path,
> -                               "/sys/devices/system/cpu/intel_uncore_frequency/package_0%d_die_0%d/initial_max_freq_khz",
> +                               "/sys/devices/system/cpu/intel_uncore_frequency/package_%02d_die_%02d/initial_max_freq_khz",
>                                 i, j);
>                         l = read_sysfs_int(path);
>                         fprintf(outf, "(%d - %d MHz)\n", k / 1000, l / 1000);
> --
> 2.26.2
>


-- 
Len Brown, Intel

^ permalink raw reply

* Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
From: Mark Rutland @ 2024-04-02 17:23 UTC (permalink / raw)
  To: Haris Okanovic
  Cc: linux-kernel, linux-pm, linux-assembly, peterz, Ali Saidi,
	Geoff Blake, Brian Silver
In-Reply-To: <20240402014706.3969151-3-harisokn@amazon.com>

On Mon, Apr 01, 2024 at 08:47:06PM -0500, Haris Okanovic wrote:
> An arm64 cpuidle driver with two states: (1) First polls for new runable
> tasks up to 100 us (by default) before (2) a wfi idle and awoken by
> interrupt (the current arm64 behavior). It allows CPUs to return from
> idle more quickly by avoiding the longer interrupt wakeup path, which
> may require EL1/EL2 transition in certain VM scenarios.

Please start off with an explanation of the problem you're trying to solve
(which IIUC is to wake up more quickly in certain cases), before describing the
solution. That makes it *significantly* easier for people to review this, since
once you have the problem statement in mind it's much easier to understand how
the solution space follows from that.

> Poll duration is optionally configured at load time via the poll_limit
> module parameter.

Why should this be a configurable parameter?

(note, at this point you haven't introduced any of the data below, so the
trade-off isn't clear to anyone).

> The default 100 us duration was experimentally chosen, by measuring QPS
> (queries per sec) of the MLPerf bert inference benchmark, which seems
> particularly susceptible to this change; see procedure below. 100 us is
> the inflection point where QPS stopped growing in a range of tested
> values. All results are from AWS m7g.16xlarge instances (Graviton3 SoC)
> with dedicated tenancy (dedicated hardware).
> 
> | before | 10us  | 25us | 50us | 100us | 125us | 150us | 200us | 300us |
> | 5.87   | 5.91  | 5.96 | 6.01 | 6.06  | 6.07  | 6.06  | 6.06  | 6.06  |
> 
> Perf's scheduler benchmarks also improve with a range of poll_limit
> values >= 10 us. Higher limits produce near identical results within a
> 3% noise margin. The following tables are `perf bench sched` results,
> run times in seconds.
> 
> `perf bench sched messaging -l 80000`
> | AWS instance  | SoC       | Before | After  | % Change |
> | c6g.16xl (VM) | Graviton2 | 18.974 | 18.400 | none     |
> | c7g.16xl (VM) | Graviton3 | 13.852 | 13.859 | none     |
> | c6g.metal     | Graviton2 | 17.621 | 16.744 | none     |
> | c7g.metal     | Graviton3 | 13.430 | 13.404 | none     |
> 
> `perf bench sched pipe -l 2500000`
> | AWS instance  | SoC       | Before | After  | % Change |
> | c6g.16xl (VM) | Graviton2 | 30.158 | 15.181 | -50%     |
> | c7g.16xl (VM) | Graviton3 | 18.289 | 12.067 | -34%     |
> | c6g.metal     | Graviton2 | 17.609 | 15.170 | -14%     |
> | c7g.metal     | Graviton3 | 14.103 | 12.304 | -13%     |
> 
> `perf bench sched seccomp-notify -l 2500000`
> | AWS instance  | SoC       | Before | After  | % Change |
> | c6g.16xl (VM) | Graviton2 | 28.784 | 13.754 | -52%     |
> | c7g.16xl (VM) | Graviton3 | 16.964 | 11.430 | -33%     |
> | c6g.metal     | Graviton2 | 15.717 | 13.536 | -14%     |
> | c7g.metal     | Graviton3 | 13.301 | 11.491 | -14%     |

Ok, so perf numbers for a busy workload go up.

What happens for idle state residency on a mostly idle system?

> Steps to run MLPerf bert inference on Ubuntu 22.04:
>  sudo apt install build-essential python3 python3-pip
>  pip install "pybind11[global]" tensorflow  transformers
>  export TF_ENABLE_ONEDNN_OPTS=1
>  export DNNL_DEFAULT_FPMATH_MODE=BF16
>  git clone https://github.com/mlcommons/inference.git --recursive
>  cd inference
>  git checkout v2.0
>  cd loadgen
>  CFLAGS="-std=c++14" python3 setup.py bdist_wheel
>  pip install dist/*.whl
>  cd ../language/bert
>  make setup
>  python3 run.py --backend=tf --scenario=SingleStream
> 
> Suggested-by: Ali Saidi <alisaidi@amazon.com>
> Reviewed-by: Ali Saidi <alisaidi@amazon.com>
> Reviewed-by: Geoff Blake <blakgeof@amazon.com>
> Cc: Brian Silver <silverbr@amazon.com>
> Signed-off-by: Haris Okanovic <harisokn@amazon.com>
> ---
>  drivers/cpuidle/Kconfig.arm           |  13 ++
>  drivers/cpuidle/Makefile              |   1 +
>  drivers/cpuidle/cpuidle-arm-polling.c | 171 ++++++++++++++++++++++++++
>  3 files changed, 185 insertions(+)
>  create mode 100644 drivers/cpuidle/cpuidle-arm-polling.c
> 
> diff --git a/drivers/cpuidle/Kconfig.arm b/drivers/cpuidle/Kconfig.arm
> index a1ee475d180d..484666dda38d 100644
> --- a/drivers/cpuidle/Kconfig.arm
> +++ b/drivers/cpuidle/Kconfig.arm
> @@ -14,6 +14,19 @@ config ARM_CPUIDLE
>  	  initialized by calling the CPU operations init idle hook
>  	  provided by architecture code.
>  
> +config ARM_POLL_CPUIDLE
> +	bool "ARM64 CPU idle Driver with polling"
> +	depends on ARM64
> +	depends on ARM_ARCH_TIMER_EVTSTREAM
> +	select CPU_IDLE_MULTIPLE_DRIVERS
> +	help
> +	  Select this to enable a polling cpuidle driver for ARM64:
> +	  The first state polls TIF_NEED_RESCHED for best latency on short
> +	  sleep intervals. The second state falls back to arch_cpu_idle() to
> +	  wait for interrupt. This is can be helpful in workloads that
> +	  frequently block/wake at short intervals or VMs where wakeup IPIs
> +	  are more expensive.

Why is this a separate driver rather than an optional feature in the existing
driver?

The fact that this duplicates a bunch of code indicates to me that this should
not be a separate driver.

> +
>  config ARM_PSCI_CPUIDLE
>  	bool "PSCI CPU idle Driver"
>  	depends on ARM_PSCI_FW
> diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
> index d103342b7cfc..23c21422792d 100644
> --- a/drivers/cpuidle/Makefile
> +++ b/drivers/cpuidle/Makefile
> @@ -22,6 +22,7 @@ obj-$(CONFIG_ARM_U8500_CPUIDLE)         += cpuidle-ux500.o
>  obj-$(CONFIG_ARM_AT91_CPUIDLE)          += cpuidle-at91.o
>  obj-$(CONFIG_ARM_EXYNOS_CPUIDLE)        += cpuidle-exynos.o
>  obj-$(CONFIG_ARM_CPUIDLE)		+= cpuidle-arm.o
> +obj-$(CONFIG_ARM_POLL_CPUIDLE)		+= cpuidle-arm-polling.o
>  obj-$(CONFIG_ARM_PSCI_CPUIDLE)		+= cpuidle-psci.o
>  obj-$(CONFIG_ARM_PSCI_CPUIDLE_DOMAIN)	+= cpuidle-psci-domain.o
>  obj-$(CONFIG_ARM_TEGRA_CPUIDLE)		+= cpuidle-tegra.o
> diff --git a/drivers/cpuidle/cpuidle-arm-polling.c b/drivers/cpuidle/cpuidle-arm-polling.c
> new file mode 100644
> index 000000000000..bca128568114
> --- /dev/null
> +++ b/drivers/cpuidle/cpuidle-arm-polling.c
> @@ -0,0 +1,171 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * ARM64 CPU idle driver using wfe polling
> + *
> + * Copyright 2024 Amazon.com, Inc. or its affiliates. All rights reserved.
> + *
> + * Authors:
> + *   Haris Okanovic <harisokn@amazon.com>
> + *   Brian Silver <silverbr@amazon.com>
> + *
> + * Based on cpuidle-arm.c
> + * Copyright (C) 2014 ARM Ltd.
> + * Author: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> + */
> +
> +#include <linux/cpu.h>
> +#include <linux/cpu_cooling.h>
> +#include <linux/cpuidle.h>
> +#include <linux/sched/clock.h>
> +
> +#include <asm/cpuidle.h>
> +#include <asm/readex.h>
> +
> +#include "dt_idle_states.h"
> +
> +/* Max duration of the wfe() poll loop in us, before transitioning to
> + * arch_cpu_idle()/wfi() sleep.
> + */

/*
 * Comments should have the leading '/*' on a separate line.
 * See https://www.kernel.org/doc/html/v6.8/process/coding-style.html#commenting
 */

> +#define DEFAULT_POLL_LIMIT_US 100
> +static unsigned int poll_limit __read_mostly = DEFAULT_POLL_LIMIT_US;
> +
> +/*
> + * arm_idle_wfe_poll - Polls state in wfe loop until reschedule is
> + * needed or timeout
> + */
> +static int __cpuidle arm_idle_wfe_poll(struct cpuidle_device *dev,
> +				struct cpuidle_driver *drv, int idx)
> +{
> +	u64 time_start, time_limit;
> +
> +	time_start = local_clock();
> +	dev->poll_time_limit = false;
> +
> +	local_irq_enable();

Why enable IRQs here? We don't do that in the regular cpuidle-arm driver, nor
the cpuidle-psci driver, and there's no explanation here or in the commit message.

How does this interact with RCU? Is that still watching or are we in an
extended quiescent state? For PSCI idle states we enter an EQS, and it seems
like we probably should here...

> +
> +	if (current_set_polling_and_test())
> +		goto end;
> +
> +	time_limit = cpuidle_poll_time(drv, dev);
> +
> +	do {
> +		// exclusive read arms the monitor for wfe
> +		if (__READ_ONCE_EX(current_thread_info()->flags) & _TIF_NEED_RESCHED)
> +			goto end;
> +
> +		// may exit prematurely, see ARM_ARCH_TIMER_EVTSTREAM
> +		wfe();
> +	} while (local_clock() - time_start < time_limit);

... and if the EVTSTREAM is disabled, we'll sit in WFE forever rather than
entering a deeper idle state, which doesn't seem desirable.

It's worth noting that now that we have WFET, we'll probably want to disable
the EVTSTREAM by default at some point, at least in some configurations, since
that'll be able to sit in a WFE state for longer while also reliably waking up
when required.

I suspect we want something like an smp_load_acquire_timeout() here to do the
wait in arch code (allowing us to use WFET), and enabling this state will
depend on either having WFET or EVTSTREAM.

> +
> +	dev->poll_time_limit = true;
> +
> +end:
> +	current_clr_polling();
> +	return idx;
> +}
> +
> +/*
> + * arm_idle_wfi - Places cpu in lower power state until interrupt,
> + * a fallback to polling
> + */
> +static int __cpuidle arm_idle_wfi(struct cpuidle_device *dev,
> +				struct cpuidle_driver *drv, int idx)
> +{
> +	if (current_clr_polling_and_test()) {
> +		local_irq_enable();
> +		return idx;
> +	}

Same as above, why enable IRQs here?

> +	arch_cpu_idle();
> +	return idx;

... and if we need to enable IRQs in the other cases above, why do we *not*
need to enable them here?

> +}
> +
> +static struct cpuidle_driver arm_poll_idle_driver __initdata = {
> +	.name = "arm_poll_idle",
> +	.owner = THIS_MODULE,
> +	.states = {
> +		{
> +			.enter			= arm_idle_wfe_poll,
> +			.exit_latency		= 0,
> +			.target_residency	= 0,
> +			.exit_latency_ns	= 0,
> +			.power_usage		= UINT_MAX,
> +			.flags			= CPUIDLE_FLAG_POLLING,
> +			.name			= "WFE",
> +			.desc			= "ARM WFE",
> +		},
> +		{
> +			.enter			= arm_idle_wfi,
> +			.exit_latency		= DEFAULT_POLL_LIMIT_US,
> +			.target_residency	= DEFAULT_POLL_LIMIT_US,
> +			.power_usage		= UINT_MAX,
> +			.name			= "WFI",
> +			.desc			= "ARM WFI",
> +		},
> +	},
> +	.state_count = 2,
> +};

How does this interact with the existing driver?

How does DEFAULT_POLL_LIMIT_US compare with PSCI idle states?

> +
> +/*
> + * arm_poll_init_cpu - Initializes arm cpuidle polling driver for one cpu
> + */
> +static int __init arm_poll_init_cpu(int cpu)
> +{
> +	int ret;
> +	struct cpuidle_driver *drv;
> +
> +	drv = kmemdup(&arm_poll_idle_driver, sizeof(*drv), GFP_KERNEL);
> +	if (!drv)
> +		return -ENOMEM;
> +
> +	drv->cpumask = (struct cpumask *)cpumask_of(cpu);
> +	drv->states[1].exit_latency = poll_limit;
> +	drv->states[1].target_residency = poll_limit;
> +
> +	ret = cpuidle_register(drv, NULL);
> +	if (ret) {
> +		pr_err("failed to register driver: %d, cpu %d\n", ret, cpu);
> +		goto out_kfree_drv;
> +	}
> +
> +	pr_info("registered driver cpu %d\n", cpu);

This does not need to be printed for each CPU.

Mark.

> +
> +	cpuidle_cooling_register(drv);
> +
> +	return 0;
> +
> +out_kfree_drv:
> +	kfree(drv);
> +	return ret;
> +}
> +
> +/*
> + * arm_poll_init - Initializes arm cpuidle polling driver
> + */
> +static int __init arm_poll_init(void)
> +{
> +	int cpu, ret;
> +	struct cpuidle_driver *drv;
> +	struct cpuidle_device *dev;
> +
> +	for_each_possible_cpu(cpu) {
> +		ret = arm_poll_init_cpu(cpu);
> +		if (ret)
> +			goto out_fail;
> +	}
> +
> +	return 0;
> +
> +out_fail:
> +	pr_info("de-register all");
> +	while (--cpu >= 0) {
> +		dev = per_cpu(cpuidle_devices, cpu);
> +		drv = cpuidle_get_cpu_driver(dev);
> +		cpuidle_unregister(drv);
> +		kfree(drv);
> +	}
> +
> +	return ret;
> +}
> +
> +module_param(poll_limit, uint, 0444);
> +device_initcall(arm_poll_init);
> -- 
> 2.34.1
> 
> 

^ permalink raw reply

* Re: [PATCH 2/3] arm64: add __READ_ONCE_EX()
From: Mark Rutland @ 2024-04-02 16:48 UTC (permalink / raw)
  To: Haris Okanovic; +Cc: linux-kernel, linux-pm, linux-assembly, peterz
In-Reply-To: <20240402014706.3969151-2-harisokn@amazon.com>

On Mon, Apr 01, 2024 at 08:47:05PM -0500, Haris Okanovic wrote:
> Perform an exclusive load, which atomically loads a word and arms the
> execusive monitor to enable wfe() polling of an address.
> 
> Adding this macro in preparation for an arm64 cpuidle driver which
> supports a wfe() based polling state.
> 
> https://developer.arm.com/documentation/dht0008/a/arm-synchronization-primitives/exclusive-accesses/exclusive-monitors
> 
> Signed-off-by: Haris Okanovic <harisokn@amazon.com>
> ---
>  arch/arm64/include/asm/readex.h | 46 +++++++++++++++++++++++++++++++++
>  1 file changed, 46 insertions(+)
>  create mode 100644 arch/arm64/include/asm/readex.h
> 
> diff --git a/arch/arm64/include/asm/readex.h b/arch/arm64/include/asm/readex.h
> new file mode 100644
> index 000000000000..51963c3107e1
> --- /dev/null
> +++ b/arch/arm64/include/asm/readex.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Based on arch/arm64/include/asm/rwonce.h
> + *
> + * Copyright (C) 2020 Google LLC.
> + * Copyright (C) 2024 Amazon.com, Inc. or its affiliates.
> + */
> +
> +#ifndef __ASM_READEX_H
> +#define __ASM_READEX_H
> +
> +#define __LOAD_EX(sfx, regs...) "ldaxr" #sfx "\t" #regs
> +
> +#define __READ_ONCE_EX(x)						\
> +({									\
> +	typeof(&(x)) __x = &(x);					\
> +	int atomic = 1;							\
> +	union { __unqual_scalar_typeof(*__x) __val; char __c[1]; } __u;	\
> +	switch (sizeof(x)) {						\
> +	case 1:								\
> +		asm volatile(__LOAD_EX(b, %w0, %1)			\
> +			: "=r" (*(__u8 *)__u.__c)			\
> +			: "Q" (*__x) : "memory");			\
> +		break;							\
> +	case 2:								\
> +		asm volatile(__LOAD_EX(h, %w0, %1)			\
> +			: "=r" (*(__u16 *)__u.__c)			\
> +			: "Q" (*__x) : "memory");			\
> +		break;							\
> +	case 4:								\
> +		asm volatile(__LOAD_EX(, %w0, %1)			\
> +			: "=r" (*(__u32 *)__u.__c)			\
> +			: "Q" (*__x) : "memory");			\
> +		break;							\
> +	case 8:								\
> +		asm volatile(__LOAD_EX(, %0, %1)			\
> +			: "=r" (*(__u64 *)__u.__c)			\
> +			: "Q" (*__x) : "memory");			\
> +		break;							\
> +	default:							\
> +		atomic = 0;						\
> +	}								\
> +	atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(__x))__x);\
> +})

Why can't you use the existing smp_cond_load_relaxed() or
smp_cond_load_acquire()?

I don't believe this is necessary.

Mark.

^ permalink raw reply

* Re: [PATCH v2 12/15] dt-bindings: thermal: mediatek: Add LVTS thermal controller definition for MT8188
From: Rob Herring @ 2024-04-02 16:23 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Nicolas Pitre, Daniel Lezcano, AngeloGioacchino Del Regno,
	devicetree, linux-pm, linux-mediatek
In-Reply-To: <20240402032729.2736685-13-nico@fluxnic.net>


On Mon, 01 Apr 2024 23:25:46 -0400, Nicolas Pitre wrote:
> From: Nicolas Pitre <npitre@baylibre.com>
> 
> Add LVTS thermal controller definition for MT8188.
> 
> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
> ---
>  .../bindings/thermal/mediatek,lvts-thermal.yaml  |  4 ++++
>  .../dt-bindings/thermal/mediatek,lvts-thermal.h  | 16 ++++++++++++++++
>  2 files changed, 20 insertions(+)
> 

Reviewed-by: Rob Herring <robh@kernel.org>


^ permalink raw reply

* [PATCH v4 4/4] soc: samsung: exynos-asv: Update Energy Model after adjusting voltage
From: Lukasz Luba @ 2024-04-02 15:58 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, linux-arm-kernel, sboyd, nm,
	linux-samsung-soc, daniel.lezcano, viresh.kumar,
	krzysztof.kozlowski, alim.akhtar, m.szyprowski, mhiramat
In-Reply-To: <20240402155822.505491-1-lukasz.luba@arm.com>

When the voltage for OPPs is adjusted there is a need to also update
Energy Model framework. The EM data contains power values which depend
on voltage values. The EM structure is used for thermal (IPA governor)
and in scheduler task placement (EAS) so it should reflect the real HW
model as best as possible to operate properly.

Based on data on Exynos5422 ASV tables the maximum power difference might
be ~29%. An Odroid-XU4 (with a random sample SoC in this chip lottery)
showed power difference for some OPPs ~20%. Therefore, it's worth to
update the EM.

Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 drivers/soc/samsung/exynos-asv.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/soc/samsung/exynos-asv.c b/drivers/soc/samsung/exynos-asv.c
index d60af8acc3916..d6d003e3a81ab 100644
--- a/drivers/soc/samsung/exynos-asv.c
+++ b/drivers/soc/samsung/exynos-asv.c
@@ -11,6 +11,7 @@
 
 #include <linux/cpu.h>
 #include <linux/device.h>
+#include <linux/energy_model.h>
 #include <linux/errno.h>
 #include <linux/of.h>
 #include <linux/pm_opp.h>
@@ -97,9 +98,17 @@ static int exynos_asv_update_opps(struct exynos_asv *asv)
 			last_opp_table = opp_table;
 
 			ret = exynos_asv_update_cpu_opps(asv, cpu);
-			if (ret < 0)
+			if (!ret) {
+				/*
+				 * When the voltage for OPPs could be changed,
+				 * make sure to update the EM power values, to
+				 * reflect the reality and not use stale data.
+				 */
+				em_dev_update_chip_binning(cpu);
+			} else {
 				dev_err(asv->dev, "Couldn't udate OPPs for cpu%d\n",
 					cpuid);
+			}
 		}
 
 		dev_pm_opp_put_opp_table(opp_table);
-- 
2.25.1


^ permalink raw reply related

* [PATCH v4 3/4] PM: EM: Add em_dev_update_chip_binning()
From: Lukasz Luba @ 2024-04-02 15:58 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, linux-arm-kernel, sboyd, nm,
	linux-samsung-soc, daniel.lezcano, viresh.kumar,
	krzysztof.kozlowski, alim.akhtar, m.szyprowski, mhiramat
In-Reply-To: <20240402155822.505491-1-lukasz.luba@arm.com>

Add a function which allows to modify easily the EM after the new voltage
information is available. The device drivers for the chip can adjust
the voltage values after setup. The voltage for the same frequency in OPP
can be different due to chip binning. The voltage impacts the power usage
and the EM power values can be updated to reflect that.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/linux/energy_model.h |  5 ++++
 kernel/power/energy_model.c  | 48 ++++++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 770755df852f1..d30d67c2f07cf 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -172,6 +172,7 @@ struct em_perf_table __rcu *em_table_alloc(struct em_perf_domain *pd);
 void em_table_free(struct em_perf_table __rcu *table);
 int em_dev_compute_costs(struct device *dev, struct em_perf_state *table,
 			 int nr_states);
+int em_dev_update_chip_binning(struct device *dev);
 
 /**
  * em_pd_get_efficient_state() - Get an efficient performance state from the EM
@@ -387,6 +388,10 @@ int em_dev_compute_costs(struct device *dev, struct em_perf_state *table,
 {
 	return -EINVAL;
 }
+static inline int em_dev_update_chip_binning(struct device *dev)
+{
+	return -EINVAL;
+}
 #endif
 
 #endif
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 6960dd7393b2d..927cc55ba0b3d 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -808,3 +808,51 @@ static void em_update_workfn(struct work_struct *work)
 {
 	em_check_capacity_update();
 }
+
+/**
+ * em_dev_update_chip_binning() - Update Energy Model after the new voltage
+ *				information is present in the OPPs.
+ * @dev		: Device for which the Energy Model has to be updated.
+ *
+ * This function allows to update easily the EM with new values available in
+ * the OPP framework and DT. It can be used after the chip has been properly
+ * verified by device drivers and the voltages adjusted for the 'chip binning'.
+ */
+int em_dev_update_chip_binning(struct device *dev)
+{
+	struct em_perf_table __rcu *em_table;
+	struct em_perf_domain *pd;
+	int i, ret;
+
+	if (IS_ERR_OR_NULL(dev))
+		return -EINVAL;
+
+	pd = em_pd_get(dev);
+	if (!pd) {
+		dev_warn(dev, "Couldn't find Energy Model\n");
+		return -EINVAL;
+	}
+
+	em_table = em_table_dup(pd);
+	if (!em_table) {
+		dev_warn(dev, "EM: allocation failed\n");
+		return -ENOMEM;
+	}
+
+	/* Update power values which might change due to new voltage in OPPs */
+	for (i = 0; i < pd->nr_perf_states; i++) {
+		unsigned long freq = em_table->state[i].frequency;
+		unsigned long power;
+
+		ret = dev_pm_opp_calc_power(dev, &power, &freq);
+		if (ret) {
+			em_table_free(em_table);
+			return ret;
+		}
+
+		em_table->state[i].power = power;
+	}
+
+	return em_recalc_and_update(dev, pd, em_table);
+}
+EXPORT_SYMBOL_GPL(em_dev_update_chip_binning);
-- 
2.25.1


^ permalink raw reply related

* [PATCH v4 2/4] PM: EM: Refactor em_adjust_new_capacity()
From: Lukasz Luba @ 2024-04-02 15:58 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, linux-arm-kernel, sboyd, nm,
	linux-samsung-soc, daniel.lezcano, viresh.kumar,
	krzysztof.kozlowski, alim.akhtar, m.szyprowski, mhiramat
In-Reply-To: <20240402155822.505491-1-lukasz.luba@arm.com>

Extract em_table_dup() and em_recalc_and_update() from
em_adjust_new_capacity(). Both functions will be later reused by the
'update EM due to chip binning' functionality.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 kernel/power/energy_model.c | 58 +++++++++++++++++++++++++------------
 1 file changed, 39 insertions(+), 19 deletions(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 9e1c9aa399ea9..6960dd7393b2d 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -674,23 +674,15 @@ void em_dev_unregister_perf_domain(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(em_dev_unregister_perf_domain);
 
-/*
- * Adjustment of CPU performance values after boot, when all CPUs capacites
- * are correctly calculated.
- */
-static void em_adjust_new_capacity(struct device *dev,
-				   struct em_perf_domain *pd,
-				   u64 max_cap)
+static struct em_perf_table __rcu *em_table_dup(struct em_perf_domain *pd)
 {
 	struct em_perf_table __rcu *em_table;
 	struct em_perf_state *ps, *new_ps;
-	int ret, ps_size;
+	int ps_size;
 
 	em_table = em_table_alloc(pd);
-	if (!em_table) {
-		dev_warn(dev, "EM: allocation failed\n");
-		return;
-	}
+	if (!em_table)
+		return NULL;
 
 	new_ps = em_table->state;
 
@@ -702,24 +694,52 @@ static void em_adjust_new_capacity(struct device *dev,
 
 	rcu_read_unlock();
 
-	em_init_performance(dev, pd, new_ps, pd->nr_perf_states);
-	ret = em_compute_costs(dev, new_ps, NULL, pd->nr_perf_states,
+	return em_table;
+}
+
+static int em_recalc_and_update(struct device *dev, struct em_perf_domain *pd,
+				struct em_perf_table __rcu *em_table)
+{
+	int ret;
+
+	ret = em_compute_costs(dev, em_table->state, NULL, pd->nr_perf_states,
 			       pd->flags);
-	if (ret) {
-		dev_warn(dev, "EM: compute costs failed\n");
-		return;
-	}
+	if (ret)
+		goto free_em_table;
 
 	ret = em_dev_update_perf_domain(dev, em_table);
 	if (ret)
-		dev_warn(dev, "EM: update failed %d\n", ret);
+		goto free_em_table;
 
 	/*
 	 * This is one-time-update, so give up the ownership in this updater.
 	 * The EM framework has incremented the usage counter and from now
 	 * will keep the reference (then free the memory when needed).
 	 */
+free_em_table:
 	em_table_free(em_table);
+	return ret;
+}
+
+/*
+ * Adjustment of CPU performance values after boot, when all CPUs capacites
+ * are correctly calculated.
+ */
+static void em_adjust_new_capacity(struct device *dev,
+				   struct em_perf_domain *pd,
+				   u64 max_cap)
+{
+	struct em_perf_table __rcu *em_table;
+
+	em_table = em_table_dup(pd);
+	if (!em_table) {
+		dev_warn(dev, "EM: allocation failed\n");
+		return;
+	}
+
+	em_init_performance(dev, pd, em_table->state, pd->nr_perf_states);
+
+	em_recalc_and_update(dev, pd, em_table);
 }
 
 static void em_check_capacity_update(void)
-- 
2.25.1


^ permalink raw reply related

* [PATCH v4 1/4] OPP: OF: Export dev_opp_pm_calc_power() for usage from EM
From: Lukasz Luba @ 2024-04-02 15:58 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, linux-arm-kernel, sboyd, nm,
	linux-samsung-soc, daniel.lezcano, viresh.kumar,
	krzysztof.kozlowski, alim.akhtar, m.szyprowski, mhiramat
In-Reply-To: <20240402155822.505491-1-lukasz.luba@arm.com>

There are device drivers which can modify voltage values for OPPs. It
could be due to the chip binning and those drivers have specific chip
knowledge about it. This adjustment can happen after Energy Model is
registered, thus EM can have stale data about power.

Export dev_opp_pm_calc_power() which can be used by Energy Model to
calculate new power with the new voltage for OPPs.

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 drivers/opp/of.c       | 17 ++++++++++++-----
 include/linux/pm_opp.h |  8 ++++++++
 2 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/drivers/opp/of.c b/drivers/opp/of.c
index f9f0b22bccbb4..282eb5966fd03 100644
--- a/drivers/opp/of.c
+++ b/drivers/opp/of.c
@@ -1494,20 +1494,26 @@ _get_dt_power(struct device *dev, unsigned long *uW, unsigned long *kHz)
 	return 0;
 }
 
-/*
- * Callback function provided to the Energy Model framework upon registration.
+/**
+ * dev_pm_opp_calc_power() - Calculate power value for device with EM
+ * @dev		: Device for which an Energy Model has to be registered
+ * @uW		: New power value that is calculated
+ * @kHz		: Frequency for which the new power is calculated
+ *
  * This computes the power estimated by @dev at @kHz if it is the frequency
  * of an existing OPP, or at the frequency of the first OPP above @kHz otherwise
  * (see dev_pm_opp_find_freq_ceil()). This function updates @kHz to the ceiled
  * frequency and @uW to the associated power. The power is estimated as
  * P = C * V^2 * f with C being the device's capacitance and V and f
  * respectively the voltage and frequency of the OPP.
+ * It is also used as a callback function provided to the Energy Model
+ * framework upon registration.
  *
  * Returns -EINVAL if the power calculation failed because of missing
  * parameters, 0 otherwise.
  */
-static int __maybe_unused _get_power(struct device *dev, unsigned long *uW,
-				     unsigned long *kHz)
+int dev_pm_opp_calc_power(struct device *dev, unsigned long *uW,
+			  unsigned long *kHz)
 {
 	struct dev_pm_opp *opp;
 	struct device_node *np;
@@ -1544,6 +1550,7 @@ static int __maybe_unused _get_power(struct device *dev, unsigned long *uW,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(dev_pm_opp_calc_power);
 
 static bool _of_has_opp_microwatt_property(struct device *dev)
 {
@@ -1619,7 +1626,7 @@ int dev_pm_opp_of_register_em(struct device *dev, struct cpumask *cpus)
 		goto failed;
 	}
 
-	EM_SET_ACTIVE_POWER_CB(em_cb, _get_power);
+	EM_SET_ACTIVE_POWER_CB(em_cb, dev_pm_opp_calc_power);
 
 register_em:
 	ret = em_dev_register_perf_domain(dev, nr_opp, &em_cb, cpus, true);
diff --git a/include/linux/pm_opp.h b/include/linux/pm_opp.h
index 065a47382302c..31370deb9905f 100644
--- a/include/linux/pm_opp.h
+++ b/include/linux/pm_opp.h
@@ -476,6 +476,8 @@ struct device_node *dev_pm_opp_get_of_node(struct dev_pm_opp *opp);
 int of_get_required_opp_performance_state(struct device_node *np, int index);
 int dev_pm_opp_of_find_icc_paths(struct device *dev, struct opp_table *opp_table);
 int dev_pm_opp_of_register_em(struct device *dev, struct cpumask *cpus);
+int dev_pm_opp_calc_power(struct device *dev, unsigned long *uW,
+			  unsigned long *kHz);
 static inline void dev_pm_opp_of_unregister_em(struct device *dev)
 {
 	em_dev_unregister_perf_domain(dev);
@@ -539,6 +541,12 @@ static inline void dev_pm_opp_of_unregister_em(struct device *dev)
 {
 }
 
+static inline int dev_pm_opp_calc_power(struct device *dev, unsigned long *uW,
+			  unsigned long *kHz)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline int of_get_required_opp_performance_state(struct device_node *np, int index)
 {
 	return -EOPNOTSUPP;
-- 
2.25.1


^ permalink raw reply related

* [PATCH v4 0/4] Update Energy Model after chip binning adjusted voltages
From: Lukasz Luba @ 2024-04-02 15:58 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, linux-arm-kernel, sboyd, nm,
	linux-samsung-soc, daniel.lezcano, viresh.kumar,
	krzysztof.kozlowski, alim.akhtar, m.szyprowski, mhiramat

Hi all,

This is a follow-up patch aiming to add EM modification due to chip binning.
The first RFC and the discussion can be found here [1].

It uses Exynos chip driver code as a 1st user. The EM framework has been
extended to handle this use case easily, when the voltage has been changed
after setup. On my Odroid-xu4 in some OPPs I can observe ~20% power difference.
According to that data in driver tables it could be up to ~29%.

This chip binning is applicable to a lot of SoCs, so the EM framework should
make it easy to update. It uses the existing OPP and DT information to
re-calculate the new power values.

It has dependency on Exynos SoC driver tree.

Changes:
v4:
- added asterisk in the comment section (test robot)
- change the patch 2/4 header name and use 'Refactor'
v3:
- updated header description patch 2/4 (Dietmar)
- removed 2 sentences from comment and adjusted in patch 3/4 (Dietmar)
- patch 4/4 re-phrased code comment (Dietmar)
- collected tags (Krzysztof, Viresh)
v2:
- removed 'ret' from error message which wasn't initialized (Christian)
v1:
- exported the OPP calculation function from the OPP/OF so it can be
  used from EM fwk (Viresh)
- refactored EM updating function to re-use common code
- added new EM function which can be used by chip device drivers which
  modify the voltage in OPPs
RFC is at [1]

Regards,
Lukasz Luba

[1] https://lore.kernel.org/lkml/20231220110339.1065505-1-lukasz.luba@arm.com/

Lukasz Luba (4):
  OPP: OF: Export dev_opp_pm_calc_power() for usage from EM
  PM: EM: Refactor em_adjust_new_capacity()
  PM: EM: Add em_dev_update_chip_binning()
  soc: samsung: exynos-asv: Update Energy Model after adjusting voltage

 drivers/opp/of.c                 |  17 +++--
 drivers/soc/samsung/exynos-asv.c |  11 +++-
 include/linux/energy_model.h     |   5 ++
 include/linux/pm_opp.h           |   8 +++
 kernel/power/energy_model.c      | 106 +++++++++++++++++++++++++------
 5 files changed, 122 insertions(+), 25 deletions(-)

-- 
2.25.1


^ permalink raw reply

* Re: [PATCH 1/2] cgroup/cpuset: Make cpuset hotplug processing synchronous
From: Waiman Long @ 2024-04-02 15:30 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Thomas Gleixner,
	Peter Zijlstra, Rafael J. Wysocki, Len Brown, Pavel Machek,
	Shuah Khan, linux-kernel, cgroups, linux-pm, linux-kselftest,
	Frederic Weisbecker, Paul E. McKenney, Ingo Molnar,
	Valentin Schneider, Anna-Maria Behnsen, Alex Shi, Vincent Guittot,
	Barry Song
In-Reply-To: <kce74bx6aafxfuw5yovaschym4ze4kommfk74eq5totojytest@mdxnfvl2kdol>


On 4/2/24 10:13, Michal Koutný wrote:
> Hello Waiman.
>
> (I have no opinion on the overall locking reworks, only the bits about
> v1 migrations caught my attention.)
>
> On Mon, Apr 01, 2024 at 10:58:57AM -0400, Waiman Long <longman@redhat.com> wrote:
> ...
>> @@ -4383,12 +4377,20 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
>>   	/*
>>   	 * Move tasks to the nearest ancestor with execution resources,
>>   	 * This is full cgroup operation which will also call back into
>> -	 * cpuset. Should be done outside any lock.
>> +	 * cpuset. Execute it asynchronously using workqueue.
>                     ...to avoid deadlock on cpus_read_lock
>
> Is this the reason?
> Also, what happens with the tasks in the window till the migration
> happens?
> Is it handled gracefully that their cpu is gone?

Yes, there is a potential that a cpus_read_lock() may be called leading 
to deadlock. So unless we reverse the current cgroup_mutex --> 
cpu_hotplug_lock ordering, it is not safe to call 
cgroup_transfer_tasks() directly.


>
>
>> -	if (is_empty) {
>> -		mutex_unlock(&cpuset_mutex);
>> -		remove_tasks_in_empty_cpuset(cs);
>> -		mutex_lock(&cpuset_mutex);
>> +	if (is_empty && css_tryget_online(&cs->css)) {
>> +		struct cpuset_remove_tasks_struct *s;
>> +
>> +		s = kzalloc(sizeof(*s), GFP_KERNEL);
> Is there a benefit of having a work for each cpuset?
> Instead of traversing whole top_cpuset once in the async work.

We could do that too. It's just that we have the repeat the iteration 
process once the workfn is invoked, but that has the advantage of not 
needing to do memory allocation. I am OK with either way. Let's see what 
other folks think about that.

Cheers,
Longman


^ permalink raw reply

* Re: [PATCH v6 2/6] interconnect: icc-clk: Remove tristate from INTERCONNECT_CLK
From: Konrad Dybcio @ 2024-04-02 14:49 UTC (permalink / raw)
  To: Dmitry Baryshkov, Varadarajan Narayanan
  Cc: andersson, mturquette, sboyd, robh, krzysztof.kozlowski+dt,
	conor+dt, djakov, quic_anusha, linux-arm-msm, linux-clk,
	devicetree, linux-kernel, linux-pm, kernel test robot
In-Reply-To: <CAA8EJppyuagb5zkP4LCjjJwConw3mw3iS-+dO7YB01=7-waRTw@mail.gmail.com>

On 2.04.2024 12:39 PM, Dmitry Baryshkov wrote:
> On Tue, 2 Apr 2024 at 13:34, Varadarajan Narayanan
> <quic_varada@quicinc.com> wrote:
>>
>> drivers/clk/qcom/common.c uses devm_icc_clk_register under
>> IS_ENABLED(CONFIG_INTERCONNECT_CLK). However, in kernel bot
>> random config build test, with the following combination
>>
>>         CONFIG_COMMON_CLK_QCOM=y
>>                 and
>>         CONFIG_INTERCONNECT_CLK=m
>>
>> the following error is seen as devm_icc_clk_register is in a
>> module and being referenced from vmlinux.
>>
>>         powerpc64-linux-ld: drivers/clk/qcom/common.o: in function `qcom_cc_really_probe':
>>         >> common.c:(.text+0x980): undefined reference to `devm_icc_clk_register'
>>
>> Hence, ensure INTERCONNECT_CLK is not selected as a module.
> 
> NAK. Please use `depends on INTERCONNECT_CLK || !INTERCONNECT_CLK` in
> your Kconfig dependencies.

Should icc-clk ever be built as a module? It really seems like it should be
a part of the core framework.. And dependency management would be easier

Konrad

^ permalink raw reply

* Re: [PATCH 1/2] cgroup/cpuset: Make cpuset hotplug processing synchronous
From: Michal Koutný @ 2024-04-02 14:13 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Thomas Gleixner,
	Peter Zijlstra, Rafael J. Wysocki, Len Brown, Pavel Machek,
	Shuah Khan, linux-kernel, cgroups, linux-pm, linux-kselftest,
	Frederic Weisbecker, Paul E. McKenney, Ingo Molnar,
	Valentin Schneider, Anna-Maria Behnsen, Alex Shi, Vincent Guittot,
	Barry Song
In-Reply-To: <20240401145858.2656598-2-longman@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 1138 bytes --]

Hello Waiman.

(I have no opinion on the overall locking reworks, only the bits about
v1 migrations caught my attention.)

On Mon, Apr 01, 2024 at 10:58:57AM -0400, Waiman Long <longman@redhat.com> wrote:
...
> @@ -4383,12 +4377,20 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
>  	/*
>  	 * Move tasks to the nearest ancestor with execution resources,
>  	 * This is full cgroup operation which will also call back into
> -	 * cpuset. Should be done outside any lock.
> +	 * cpuset. Execute it asynchronously using workqueue.

                   ...to avoid deadlock on cpus_read_lock

Is this the reason?
Also, what happens with the tasks in the window till the migration
happens?
Is it handled gracefully that their cpu is gone?


> -	if (is_empty) {
> -		mutex_unlock(&cpuset_mutex);
> -		remove_tasks_in_empty_cpuset(cs);
> -		mutex_lock(&cpuset_mutex);
> +	if (is_empty && css_tryget_online(&cs->css)) {
> +		struct cpuset_remove_tasks_struct *s;
> +
> +		s = kzalloc(sizeof(*s), GFP_KERNEL);

Is there a benefit of having a work for each cpuset?
Instead of traversing whole top_cpuset once in the async work.


Thanks,
Michal


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v6 3/6] interconnect: icc-clk: Add devm_icc_clk_register
From: Dmitry Baryshkov @ 2024-04-02 12:12 UTC (permalink / raw)
  To: Varadarajan Narayanan
  Cc: andersson, konrad.dybcio, mturquette, sboyd, robh,
	krzysztof.kozlowski+dt, conor+dt, djakov, quic_anusha,
	linux-arm-msm, linux-clk, devicetree, linux-kernel, linux-pm
In-Reply-To: <ZgvqkhF2mTG82Rx2@hu-varada-blr.qualcomm.com>

On Tue, 2 Apr 2024 at 14:23, Varadarajan Narayanan
<quic_varada@quicinc.com> wrote:
>
> On Tue, Apr 02, 2024 at 02:16:56PM +0300, Dmitry Baryshkov wrote:
> > On Tue, 2 Apr 2024 at 14:02, Varadarajan Narayanan
> > <quic_varada@quicinc.com> wrote:
> > >
> > > On Tue, Apr 02, 2024 at 01:48:08PM +0300, Dmitry Baryshkov wrote:
> > > > On Tue, 2 Apr 2024 at 13:40, Dmitry Baryshkov
> > > > <dmitry.baryshkov@linaro.org> wrote:
> > > > >
> > > > > On Tue, 2 Apr 2024 at 13:34, Varadarajan Narayanan
> > > > > <quic_varada@quicinc.com> wrote:
> > > > > >
> > > > > > Wrap icc_clk_register to create devm_icc_clk_register to be
> > > > > > able to release the resources properly.
> > > > > >
> > > > > > Signed-off-by: Varadarajan Narayanan <quic_varada@quicinc.com>
> > > > > > ---
> > > > > > v5: Introduced devm_icc_clk_register
> > > > > > ---
> > > > > >  drivers/interconnect/icc-clk.c   | 29 +++++++++++++++++++++++++++++
> > > > > >  include/linux/interconnect-clk.h |  4 ++++
> > > > > >  2 files changed, 33 insertions(+)
> > > > >
> > > > > Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
> > > >
> > > > Wait. Actually,
> > > >
> > > > Unreviewed-by: me
> > > >
> > > > Please return int from devm_icc_clk_register instead of returning the pointer.
> > >
> > > Wouldn't returning int break the general assumption that
> > > devm_foo(), returns the same type as foo(). For example
> > > devm_clk_hw_get_clk and clk_hw_get_clk return struct clk *?
> >
> > Not always. The only reason to return icc_provider was to make it
> > possible to destroy it. With devres-managed function you don't have to
> > do anything.
>
> Ok. Will change as follows
>
>         return prov; -> return PTR_ERR_OR_ZERO(prov);
>

I think the code might become simpler if you first allocate the ICC
provider and then just 'return devm_add_action_or_reset(dev,
your_icc_clk_release, provider)'


-- 
With best wishes
Dmitry

^ permalink raw reply

* Re: [PATCH 1/1] platform/x86/intel/hid: Don't wake on 5-button releases
From: Hans de Goede @ 2024-04-02 11:36 UTC (permalink / raw)
  To: David McFarland, linux-pm, Ilpo Järvinen,
	platform-driver-x86@vger.kernel.org
  Cc: Rafael J . Wysocki, Enrik Berkhan
In-Reply-To: <20240318191153.6978-2-corngood@gmail.com>

Hi David,

Thank you for your patch.

I became aware of this patch because the discussion on the regressions
list. Next time for patches to files under drivers/platform/x86/
at a minimum please Cc: platform-driver-x86@vger.kernel.org and
preferably also send the patch directly to me and Ilpo as shown
by get_maintainer.pl:

[hans@shalem linux]$ scripts/get_maintainer.pl -f drivers/platform/x86/intel/vbtn.c
AceLan Kao <acelan.kao@canonical.com> (maintainer:INTEL VIRTUAL BUTTON DRIVER)
Hans de Goede <hdegoede@redhat.com> (maintainer:X86 PLATFORM DRIVERS)
"Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com> (maintainer:X86 PLATFORM DRIVERS)
platform-driver-x86@vger.kernel.org (open list:INTEL VIRTUAL BUTTON DRIVER)
linux-kernel@vger.kernel.org (open list)

Please make sure you are sending this to the right people for v2.

On 3/18/24 8:11 PM, David McFarland wrote:
> If, for example, the power button is configured to suspend, holding it
> and releasing it after the machine has suspended, will wake the machine.
> 
> Also on some machines, power button release events are sent during
> hibernation, even if the button wasn't used to hibernate the machine.
> This causes hibernation to be aborted.
> 

As discussed by Thorsten this needs a fixes tag, to help with backporting
it to relevant stable kernels. Also for v2 please don't forget to add
Enrik's Tested-by from elsewhere in thread:

Tested-by: Enrik Berkhan <Enrik.Berkhan@inka.de>



> Signed-off-by: David McFarland <corngood@gmail.com>
> ---
>  drivers/platform/x86/intel/hid.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/platform/x86/intel/hid.c b/drivers/platform/x86/intel/hid.c
> index 7457ca2b27a6..707de9895965 100644
> --- a/drivers/platform/x86/intel/hid.c
> +++ b/drivers/platform/x86/intel/hid.c
> @@ -504,6 +504,7 @@ static void notify_handler(acpi_handle handle, u32 event, void *context)
>  	struct platform_device *device = context;
>  	struct intel_hid_priv *priv = dev_get_drvdata(&device->dev);
>  	unsigned long long ev_index;
> +	struct key_entry *ke;
>  	int err;
>  
>  	/*
> @@ -545,11 +546,16 @@ static void notify_handler(acpi_handle handle, u32 event, void *context)
>  		if (event == 0xc0 || !priv->array)
>  			return;
>  
> -		if (!sparse_keymap_entry_from_scancode(priv->array, event)) {
> +		ke = sparse_keymap_entry_from_scancode(priv->array, event);
> +

I would prefer for there to be no empty line between the "ke =" assignment
and the "if (!ke)".

Otherwise the patch looks good to me, so for v3
you can add:

Reviewed-by: Hans de Goede <hdegoede@redhat.com>

After fixing this + adding the Fixes and Tested-by tags.

Regards,

Hans




> +		if (!ke) {
>  			dev_info(&device->dev, "unknown event 0x%x\n", event);
>  			return;
>  		}
>  
> +		if (ke->type == KE_IGNORE)
> +			return;
> +
>  wakeup:
>  		pm_wakeup_hard_event(&device->dev);
>  



^ permalink raw reply

* Re: [PATCH v6 3/6] interconnect: icc-clk: Add devm_icc_clk_register
From: Varadarajan Narayanan @ 2024-04-02 11:22 UTC (permalink / raw)
  To: Dmitry Baryshkov
  Cc: andersson, konrad.dybcio, mturquette, sboyd, robh,
	krzysztof.kozlowski+dt, conor+dt, djakov, quic_anusha,
	linux-arm-msm, linux-clk, devicetree, linux-kernel, linux-pm
In-Reply-To: <CAA8EJpp2dgy0DcLoUuo6gz-8ee0RRwJ_mvCLGDbdvF-gVhREFg@mail.gmail.com>

On Tue, Apr 02, 2024 at 02:16:56PM +0300, Dmitry Baryshkov wrote:
> On Tue, 2 Apr 2024 at 14:02, Varadarajan Narayanan
> <quic_varada@quicinc.com> wrote:
> >
> > On Tue, Apr 02, 2024 at 01:48:08PM +0300, Dmitry Baryshkov wrote:
> > > On Tue, 2 Apr 2024 at 13:40, Dmitry Baryshkov
> > > <dmitry.baryshkov@linaro.org> wrote:
> > > >
> > > > On Tue, 2 Apr 2024 at 13:34, Varadarajan Narayanan
> > > > <quic_varada@quicinc.com> wrote:
> > > > >
> > > > > Wrap icc_clk_register to create devm_icc_clk_register to be
> > > > > able to release the resources properly.
> > > > >
> > > > > Signed-off-by: Varadarajan Narayanan <quic_varada@quicinc.com>
> > > > > ---
> > > > > v5: Introduced devm_icc_clk_register
> > > > > ---
> > > > >  drivers/interconnect/icc-clk.c   | 29 +++++++++++++++++++++++++++++
> > > > >  include/linux/interconnect-clk.h |  4 ++++
> > > > >  2 files changed, 33 insertions(+)
> > > >
> > > > Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
> > >
> > > Wait. Actually,
> > >
> > > Unreviewed-by: me
> > >
> > > Please return int from devm_icc_clk_register instead of returning the pointer.
> >
> > Wouldn't returning int break the general assumption that
> > devm_foo(), returns the same type as foo(). For example
> > devm_clk_hw_get_clk and clk_hw_get_clk return struct clk *?
>
> Not always. The only reason to return icc_provider was to make it
> possible to destroy it. With devres-managed function you don't have to
> do anything.

Ok. Will change as follows

	return prov; -> return PTR_ERR_OR_ZERO(prov);

Thanks
Varada

^ permalink raw reply

* Re: [PATCH v6 3/6] interconnect: icc-clk: Add devm_icc_clk_register
From: Dmitry Baryshkov @ 2024-04-02 11:16 UTC (permalink / raw)
  To: Varadarajan Narayanan
  Cc: andersson, konrad.dybcio, mturquette, sboyd, robh,
	krzysztof.kozlowski+dt, conor+dt, djakov, quic_anusha,
	linux-arm-msm, linux-clk, devicetree, linux-kernel, linux-pm
In-Reply-To: <ZgvlrbvvPNA6HRiL@hu-varada-blr.qualcomm.com>

On Tue, 2 Apr 2024 at 14:02, Varadarajan Narayanan
<quic_varada@quicinc.com> wrote:
>
> On Tue, Apr 02, 2024 at 01:48:08PM +0300, Dmitry Baryshkov wrote:
> > On Tue, 2 Apr 2024 at 13:40, Dmitry Baryshkov
> > <dmitry.baryshkov@linaro.org> wrote:
> > >
> > > On Tue, 2 Apr 2024 at 13:34, Varadarajan Narayanan
> > > <quic_varada@quicinc.com> wrote:
> > > >
> > > > Wrap icc_clk_register to create devm_icc_clk_register to be
> > > > able to release the resources properly.
> > > >
> > > > Signed-off-by: Varadarajan Narayanan <quic_varada@quicinc.com>
> > > > ---
> > > > v5: Introduced devm_icc_clk_register
> > > > ---
> > > >  drivers/interconnect/icc-clk.c   | 29 +++++++++++++++++++++++++++++
> > > >  include/linux/interconnect-clk.h |  4 ++++
> > > >  2 files changed, 33 insertions(+)
> > >
> > > Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
> >
> > Wait. Actually,
> >
> > Unreviewed-by: me
> >
> > Please return int from devm_icc_clk_register instead of returning the pointer.
>
> Wouldn't returning int break the general assumption that
> devm_foo(), returns the same type as foo(). For example
> devm_clk_hw_get_clk and clk_hw_get_clk return struct clk *?

Not always. The only reason to return icc_provider was to make it
possible to destroy it. With devres-managed function you don't have to
do anything.


-- 
With best wishes
Dmitry

^ permalink raw reply

* Re: [PATCH v6 4/6] clk: qcom: common: Add interconnect clocks support
From: Varadarajan Narayanan @ 2024-04-02 11:05 UTC (permalink / raw)
  To: Dmitry Baryshkov
  Cc: andersson, konrad.dybcio, mturquette, sboyd, robh,
	krzysztof.kozlowski+dt, conor+dt, djakov, quic_anusha,
	linux-arm-msm, linux-clk, devicetree, linux-kernel, linux-pm
In-Reply-To: <CAA8EJprP0m53B=g7jafAkfcqAQP4kE2ZvtxPXEe4s7ALjFXGSQ@mail.gmail.com>

On Tue, Apr 02, 2024 at 01:48:14PM +0300, Dmitry Baryshkov wrote:
> On Tue, 2 Apr 2024 at 13:34, Varadarajan Narayanan
> <quic_varada@quicinc.com> wrote:
> >
> > Unlike MSM platforms that manage NoC related clocks and scaling
> > from RPM, IPQ SoCs dont involve RPM in managing NoC related
> > clocks and there is no NoC scaling.
> >
> > However, there is a requirement to enable some NoC interface
> > clocks for accessing the peripheral controllers present on
> > these NoCs. Though exposing these as normal clocks would work,
> > having a minimalistic interconnect driver to handle these clocks
> > would make it consistent with other Qualcomm platforms resulting
> > in common code paths. This is similar to msm8996-cbf's usage of
> > icc-clk framework.
> >
> > Signed-off-by: Varadarajan Narayanan <quic_varada@quicinc.com>
> > ---
> > v6: first_id -> icc_first_node_id
> >     Remove clock get so that the peripheral that uses the clock
> >     can do the clock get
> > v5: Split changes in common.c to separate patch
> >     Fix error handling
> >     Use devm_icc_clk_register instead of icc_clk_register
> > v4: Use clk_hw instead of indices
> >     Do icc register in qcom_cc_probe() call stream
> >     Add icc clock info to qcom_cc_desc structure
> > v3: Use indexed identifiers here to avoid confusion
> >     Fix error messages and move to common.c
> > v2: Move DTS to separate patch
> >     Update commit log
> >     Auto select CONFIG_INTERCONNECT & CONFIG_INTERCONNECT_CLK to fix build error
> > ---
> >  drivers/clk/qcom/common.c | 38 +++++++++++++++++++++++++++++++++++++-
> >  drivers/clk/qcom/common.h |  3 +++
> >  2 files changed, 40 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/clk/qcom/common.c b/drivers/clk/qcom/common.c
> > index 75f09e6e057e..d5c008048994 100644
> > --- a/drivers/clk/qcom/common.c
> > +++ b/drivers/clk/qcom/common.c
> > @@ -8,6 +8,7 @@
> >  #include <linux/regmap.h>
> >  #include <linux/platform_device.h>
> >  #include <linux/clk-provider.h>
> > +#include <linux/interconnect-clk.h>
> >  #include <linux/reset-controller.h>
> >  #include <linux/of.h>
> >
> > @@ -234,6 +235,41 @@ static struct clk_hw *qcom_cc_clk_hw_get(struct of_phandle_args *clkspec,
> >         return cc->rclks[idx] ? &cc->rclks[idx]->hw : NULL;
> >  }
> >
> > +static int qcom_cc_icc_register(struct device *dev,
> > +                               const struct qcom_cc_desc *desc)
> > +{
> > +       struct icc_clk_data *icd;
> > +       int i;
> > +
> > +       if (!IS_ENABLED(CONFIG_INTERCONNECT_CLK))
> > +               return 0;
> > +
> > +       if (!desc->icc_hws)
> > +               return 0;
> > +
> > +       icd = devm_kcalloc(dev, desc->num_icc_hws, sizeof(*icd), GFP_KERNEL);
> > +       if (!icd)
> > +               return -ENOMEM;
> > +
> > +       for (i = 0; i < desc->num_icc_hws; i++) {
> > +               /*
> > +                * get_clk will be done by the peripheral device using this
> > +                * clock with devm_clk_hw_get_clk() so that we can associate
> > +                * the clk handle with the consumer device. It would also help
> > +                * us make it so that drivers defer probe until their
> > +                * clk isn't an orphan.
>
> How the clock instance returned to the peripheral driver is supposed
> to correspond to the clock instance used by the icc-clk?
> > +                */
> > +               icd[i].clk = desc->icc_hws[i]->clk;
>
> You again are abusing clk_hw->clk. Please don't do that.

Ok, will clk_get in both the places.

Thanks
Varada

> > +               if (!icd[i].clk)
> > +                       return dev_err_probe(dev, -ENOENT,
> > +                                            "(%d) clock entry is null\n", i);
> > +               icd[i].name = clk_hw_get_name(desc->icc_hws[i]);
> > +       }
> > +
> > +       return PTR_ERR_OR_ZERO(devm_icc_clk_register(dev, desc->icc_first_node_id,
> > +                                                    desc->num_icc_hws, icd));
> > +}
> > +
> >  int qcom_cc_really_probe(struct platform_device *pdev,
> >                          const struct qcom_cc_desc *desc, struct regmap *regmap)
> >  {
> > @@ -303,7 +339,7 @@ int qcom_cc_really_probe(struct platform_device *pdev,
> >         if (ret)
> >                 return ret;
> >
> > -       return 0;
> > +       return qcom_cc_icc_register(dev, desc);
> >  }
> >  EXPORT_SYMBOL_GPL(qcom_cc_really_probe);
> >
> > diff --git a/drivers/clk/qcom/common.h b/drivers/clk/qcom/common.h
> > index 9c8f7b798d9f..9058ffd46260 100644
> > --- a/drivers/clk/qcom/common.h
> > +++ b/drivers/clk/qcom/common.h
> > @@ -29,6 +29,9 @@ struct qcom_cc_desc {
> >         size_t num_gdscs;
> >         struct clk_hw **clk_hws;
> >         size_t num_clk_hws;
> > +       struct clk_hw **icc_hws;
> > +       size_t num_icc_hws;
> > +       unsigned int icc_first_node_id;
> >  };
> >
> >  /**
> > --
> > 2.34.1
> >
>
>
> --
> With best wishes
>
> Dmitry

^ permalink raw reply

* Re: [PATCH v6 3/6] interconnect: icc-clk: Add devm_icc_clk_register
From: Varadarajan Narayanan @ 2024-04-02 11:02 UTC (permalink / raw)
  To: Dmitry Baryshkov
  Cc: andersson, konrad.dybcio, mturquette, sboyd, robh,
	krzysztof.kozlowski+dt, conor+dt, djakov, quic_anusha,
	linux-arm-msm, linux-clk, devicetree, linux-kernel, linux-pm
In-Reply-To: <CAA8EJpo=TMhu+Te+JE0cQzmjLOTDPi-Vv-h5Bch0Wfr_7iVi2w@mail.gmail.com>

On Tue, Apr 02, 2024 at 01:48:08PM +0300, Dmitry Baryshkov wrote:
> On Tue, 2 Apr 2024 at 13:40, Dmitry Baryshkov
> <dmitry.baryshkov@linaro.org> wrote:
> >
> > On Tue, 2 Apr 2024 at 13:34, Varadarajan Narayanan
> > <quic_varada@quicinc.com> wrote:
> > >
> > > Wrap icc_clk_register to create devm_icc_clk_register to be
> > > able to release the resources properly.
> > >
> > > Signed-off-by: Varadarajan Narayanan <quic_varada@quicinc.com>
> > > ---
> > > v5: Introduced devm_icc_clk_register
> > > ---
> > >  drivers/interconnect/icc-clk.c   | 29 +++++++++++++++++++++++++++++
> > >  include/linux/interconnect-clk.h |  4 ++++
> > >  2 files changed, 33 insertions(+)
> >
> > Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
>
> Wait. Actually,
>
> Unreviewed-by: me
>
> Please return int from devm_icc_clk_register instead of returning the pointer.

Wouldn't returning int break the general assumption that
devm_foo(), returns the same type as foo(). For example
devm_clk_hw_get_clk and clk_hw_get_clk return struct clk *?

Thanks
Varada

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox