[PATCHSET V2 bpf-next 0/2] cpuidle, bpf: Introduce BPF-ext cpuidle governor policy via struct

public inbox for linux-pm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET V2 bpf-next 0/2] cpuidle, bpf: Introduce BPF-ext cpuidle governor policy via struct_ops
@ 2025-09-01 13:56 Lin Yikai
  2025-09-01 13:56 ` [PATCH v2 bpf-next 1/2] cpuidle: Implement BPF extensible cpuidle governor class Lin Yikai
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Lin Yikai @ 2025-09-01 13:56 UTC (permalink / raw)
  To: Christian Loehle, Song Liu, Rafael J . Wysocki, Daniel Lezcano,
	Andrii Nakryiko, Eduard Zingerman, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, linux-kselftest, linux-pm, bpf,
	Tejun Heo, Lin Yikai
  Cc: zhaofuyu

Changes in v2:
 - Optimized the logic in descriptions. (Song Liu)
 - Created a new header file to declare kfuncs for future extensions included by other files. (Christian Loehle)
 - Fixed some logical issues in the code. (Christian Loehle)

Reference:
[1] https://lore.kernel.org/bpf/20250829101137.9507-1-yikai.lin@vivo.com/

Summary
----------
Hi, everyone,
This patch set introduces an extensible cpuidle governor framework
using BPF struct_ops, enabling dynamic implementation of idle-state selection policies
via BPF programs.

Motivation
----------
As is well-known, CPUs support multiple idle states (e.g., C0, C1, C2, ...),
where deeper states reduce power consumption, but results in longer wakeup latency,
potentially affecting performance. 
Existing generic cpuidle governors operate effectively in common scenarios
but exhibit suboptimal behavior in specific Android phone's use cases.

Our testing reveals that during low-utilization scenarios
(e.g., screen-off background tasks like music playback with CPU utilization <10%),
the C0 state occupies ~50% of idle time, causing significant energy inefficiency.
Reducing C0 to ≤20% could yield ≥5% power savings on mobile phones.

To address this, we expect:
  1.Dynamic governor switching to power-saved policies for low cpu utilization scenarios (e.g., screen-off mode)
  2.Dynamic switching to alternate governors for high-performance scenarios (e.g., gaming)

OverView
----------
The BPF cpuidle ext governor registers at postcore_initcall()
but remains disabled by default due to its low priority "rating" with value "1".
Activation requires adjust higer "rating" than other governors within BPF.

Core Components:
1.**struct cpuidle_gov_ext_ops** – BPF-overridable operations:
- ops.enable()/ops.disable(): enable or disable callback
- ops.select(): cpu Idle-state selection logic
- ops.set_stop_tick(): Scheduler tick management after state selection
- ops.reflect(): feedback info about previous idle state.
- ops.init()/ops.deinit(): Initialization or cleanup.

2.**Critical kfuncs for kernel state access**:
- bpf_cpuidle_ext_gov_update_rating(): 
  Activate ext governor by raising rating must be called from "ops.init()"
- bpf_cpuidle_ext_gov_latency_req(): get idle-state latency constraints
- bpf_tick_nohz_get_sleep_length(): get CPU sleep duration in tickless mode

Future work
----------
1. Scenario detection: Identifying low-utilization states (e.g., screen-off + background music)
2. Policy optimization: Optimizing state-selection algorithms for specific scenarios

Is it related to sched_ext?
---------------------------
The cpuidle framework is as follows.
  ----------------------------------------------------------
                 Scheduler Core
  ----------------------------------------------------------
                     |
                     v
  ----------------------------------------------------------
| FAIR Class | EXT Class |           IDLE Class           |
  ----------------------------------------------------------
|            |           |              |
|            |           |              v
|            |           |      ------------------------
|            |           |          enter_cpu_idle()
|            |           |      ------------------------
|            |           |              |
|            |           |              v
|            |           |   ------------------------------
|            |           |       | CPUIDLE Governor |
|            |           |   ------------------------------
|            |           |     |            |           |
|            |           |     v            v           v
|            |           |-----------------------------------
|            |           | default   | |   other  | | BPF ext  |
|            |           | Governor  | | Governor | | Governor |  <<===Here is the feature we add.
|            |           |-----------------------------------
|            |           |     |            |           |
|            |           |     v            v           v
|            |           |-------------------------------------
|            |           |           select idle state
|            |           |-------------------------------------

Whereas cpuidle is invoked after switching to idle class when no tasks are present in the scheduling RQ.
They are not directly related, so implementing kfuncs or other extensions through sched_ext is not feasible.


Lin Yikai (2):
  cpuidle: Implement BPF extensible cpuidle governor class
  selftests/bpf: Add selftests for cpuidle_gov_ext

 drivers/cpuidle/Kconfig                       |  12 +
 drivers/cpuidle/governors/Makefile            |   1 +
 drivers/cpuidle/governors/ext.c               | 537 ++++++++++++++++++
 .../bpf/prog_tests/test_cpuidle_gov_ext.c     |  28 +
 .../selftests/bpf/progs/cpuidle_common.h      |  13 +
 .../selftests/bpf/progs/cpuidle_gov_ext.c     | 200 +++++++
 6 files changed, 791 insertions(+)
 create mode 100644 drivers/cpuidle/governors/ext.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cpuidle_gov_ext.c
 create mode 100644 tools/testing/selftests/bpf/progs/cpuidle_common.h
 create mode 100644 tools/testing/selftests/bpf/progs/cpuidle_gov_ext.c

-- 
2.43.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 bpf-next 1/2] cpuidle: Implement BPF extensible cpuidle governor class
  2025-09-01 13:56 [PATCHSET V2 bpf-next 0/2] cpuidle, bpf: Introduce BPF-ext cpuidle governor policy via struct_ops Lin Yikai
@ 2025-09-01 13:56 ` Lin Yikai
  2025-09-07  6:15   ` kernel test robot
  2025-09-01 13:56 ` [PATCH v2 bpf-next 2/2] selftests/bpf: Add selftests for cpuidle_gov_ext Lin Yikai
  2025-09-01 16:36 ` [PATCHSET V2 bpf-next 0/2] cpuidle, bpf: Introduce BPF-ext cpuidle governor policy via struct_ops Rafael J. Wysocki
  2 siblings, 1 reply; 8+ messages in thread
From: Lin Yikai @ 2025-09-01 13:56 UTC (permalink / raw)
  To: Christian Loehle, Song Liu, Rafael J . Wysocki, Daniel Lezcano,
	Andrii Nakryiko, Eduard Zingerman, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, linux-kselftest, linux-pm, bpf,
	Tejun Heo, Lin Yikai
  Cc: zhaofuyu

The BPF cpuidle ext governor registers at postcore_initcall()
but remains disabled by default due to its low priority "rating" with value "1".
Activation requires adjust higer "rating" than other governors within BPF.

Core Components:
1.**struct cpuidle_gov_ext_ops** – BPF-overridable operations:
- ops.enable()/ops.disable(): enable or disable callback
- ops.select(): cpu Idle-state selection logic
- ops.set_stop_tick(): Scheduler tick management after state selection
- ops.reflect(): feedback info about previous idle state.
- ops.init()/ops.deinit(): Initialization or cleanup.

2.**Critical kfuncs for kernel state access**:
- bpf_cpuidle_ext_gov_update_rating():
  Activate ext governor by raising rating must be called from "ops.init()"
- bpf_cpuidle_ext_gov_latency_req(): get idle-state latency constraints
- bpf_tick_nohz_get_sleep_length(): get CPU sleep duration in tickless mode

Signed-off-by: Lin Yikai <yikai.lin@vivo.com>
---
 drivers/cpuidle/Kconfig            |  12 +
 drivers/cpuidle/governors/Makefile |   1 +
 drivers/cpuidle/governors/ext.c    | 537 +++++++++++++++++++++++++++++
 3 files changed, 550 insertions(+)
 create mode 100644 drivers/cpuidle/governors/ext.c

diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index cac5997dca50..4f2eac531b0b 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -44,6 +44,18 @@ config CPU_IDLE_GOV_HALTPOLL
 
 	  Some virtualized workloads benefit from using it.
 
+config CPU_IDLE_GOV_EXT
+    bool "bpf cpuidle ext governor"
+	depends on BPF_SYSCALL && BPF_JIT && DEBUG_INFO_BTF
+	default y
+	help
+	  This governor implements a simple cpuidle ext governor,
+	  which can be customized by a BPF program without modifying
+	  kernel code.
+
+	  Some scenarios benefit where CPUidle policy needs
+	  to be customized based on user-space requirements.
+
 config DT_IDLE_STATES
 	bool
 
diff --git a/drivers/cpuidle/governors/Makefile b/drivers/cpuidle/governors/Makefile
index 63abb5393a4d..cd5eaf9f275f 100644
--- a/drivers/cpuidle/governors/Makefile
+++ b/drivers/cpuidle/governors/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_CPU_IDLE_GOV_LADDER) += ladder.o
 obj-$(CONFIG_CPU_IDLE_GOV_MENU) += menu.o
 obj-$(CONFIG_CPU_IDLE_GOV_TEO) += teo.o
 obj-$(CONFIG_CPU_IDLE_GOV_HALTPOLL) += haltpoll.o
+obj-$(CONFIG_CPU_IDLE_GOV_EXT) += ext.o
diff --git a/drivers/cpuidle/governors/ext.c b/drivers/cpuidle/governors/ext.c
new file mode 100644
index 000000000000..9968ae482899
--- /dev/null
+++ b/drivers/cpuidle/governors/ext.c
@@ -0,0 +1,537 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ext.c - the cpuidle ext governor used by BPF
+ *
+ * Copyright (C) Yikai Lin <yikai.lin@vivo.com>
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/cpuidle.h>
+#include <linux/percpu.h>
+#include <linux/ktime.h>
+#include <linux/cpumask.h>
+#include <linux/slab.h>
+#include <linux/sysfs.h>
+#include <linux/tick.h>
+
+#define EXT_GOV_NAME	"ext"
+
+/********************************************************************************
+ * Helpers that can be called from the BPF cpuidle gov.
+ */
+#include <linux/btf_ids.h>
+#include <linux/btf.h>
+
+#include "../cpuidle.h"
+
+static struct cpuidle_governor *cpuidle_last_governor;
+
+/**
+ * restore_cpuidle_last_governor - restore last governor after bpf ext gov exiting.
+ */
+static void restore_cpuidle_last_governor(void)
+{
+	bool enabled = false;
+
+	if (cpuidle_curr_governor)
+		enabled = !strncasecmp(cpuidle_curr_governor->name, EXT_GOV_NAME, CPUIDLE_NAME_LEN);
+
+	mutex_lock(&cpuidle_lock);
+	if (enabled && cpuidle_last_governor)
+		if (cpuidle_switch_governor(cpuidle_last_governor))
+			cpuidle_last_governor = NULL;
+	mutex_unlock(&cpuidle_lock);
+}
+
+__bpf_kfunc_start_defs();
+
+/**
+ * bpf_cpuidle_ext_gov_update_rating - update rating of bpf cpuidle ext governor.
+ * @rating: target rating
+ *
+ * The BPF cpuidle ext governor is registered by default
+ * but remains inactive due to its default @rating being set to 1
+ * which is significantly lower than that of other governors.
+ *
+ * To activate it, adjust @rating to a higher value within the BPF program.
+ *
+ * This function should be called from ops.init().
+ */
+__bpf_kfunc int bpf_cpuidle_ext_gov_update_rating(unsigned int rating)
+{
+	int ret = -EINVAL;
+	struct cpuidle_governor *ext_gov;
+
+	ext_gov = cpuidle_find_governor(EXT_GOV_NAME);
+	if (!ext_gov) {
+		ret = -EEXIST;
+		goto clean_up;
+	}
+	mutex_lock(&cpuidle_lock);
+	if (!cpuidle_curr_governor || cpuidle_curr_governor->rating < rating) {
+		cpuidle_last_governor = cpuidle_curr_governor;
+		ret = cpuidle_switch_governor(ext_gov);
+	}
+	mutex_unlock(&cpuidle_lock);
+
+clean_up:
+	return ret;
+}
+
+/**
+ * bpf_cpuidle_ext_gov_latency_req - get target cpu's latency constraint
+ * @cpu: Target CPU
+ *
+ * The BPF program may require this info.
+ */
+__bpf_kfunc s64 bpf_cpuidle_ext_gov_latency_req(unsigned int cpu)
+{
+	return cpuidle_governor_latency_req(cpu);
+}
+
+/**
+ * bpf_tick_nohz_get_sleep_length - return the expected length of the current sleep
+ *
+ * The BPF program may require this info.
+ */
+__bpf_kfunc s64 bpf_tick_nohz_get_sleep_length(void)
+{
+	ktime_t delta_tick;
+
+	return (s64)tick_nohz_get_sleep_length(&delta_tick);
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(cpuidle_ext_gov_kfuncs)
+BTF_ID_FLAGS(func, bpf_cpuidle_ext_gov_update_rating)
+BTF_ID_FLAGS(func, bpf_cpuidle_ext_gov_latency_req)
+BTF_ID_FLAGS(func, bpf_tick_nohz_get_sleep_length)
+BTF_KFUNCS_END(cpuidle_ext_gov_kfuncs)
+
+static const struct btf_kfunc_id_set cpuidle_ext_gov_kfuncs_set = {
+	.owner  = THIS_MODULE,
+	.set	= &cpuidle_ext_gov_kfuncs,
+};
+
+static int cpuidle_gov_kfuncs_init(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &cpuidle_ext_gov_kfuncs_set);
+}
+
+/********************************************************************************
+ * bpf_struct_ops plumbing.
+ */
+#include <linux/bpf_verifier.h>
+#include <linux/bpf.h>
+
+#define CPUIDLE_GOV_EXT_NAME_LEN 128
+enum ops_enable_state {
+	OPS_ENABLED,
+	OPS_DISABLED,
+};
+
+static const struct btf_type *cpuidle_device_type;
+static u32 cpuidle_device_type_id;
+static struct cpuidle_gov_ext_ops *ops;
+
+static DEFINE_MUTEX(ops_mutex);
+DEFINE_STATIC_KEY_FALSE(ops_enabled_key);
+static atomic_t ops_enable_state_var = ATOMIC_INIT(OPS_DISABLED);
+
+struct cpuidle_gov_ext_ops {
+	/**
+	 * enable - cpuidle ext governor enable
+	 * @drv: cpuidle driver containing state data.
+	 * @dev: target cpu
+	 */
+	int (*enable)(struct cpuidle_driver *drv, struct cpuidle_device *dev);
+
+	/**
+	 * disable - cpuidle ext governor disable
+	 * @drv: cpuidle driver containing state data.
+	 * @dev: target cpu
+	 */
+	void (*disable)(struct cpuidle_driver *drv, struct cpuidle_device *dev);
+
+	/*
+	 * select - select the next cpu idle state to enter
+	 * @drv: cpuidle driver containing state data.
+	 * @dev: target cpu
+	 */
+	int (*select)(struct cpuidle_driver *drv, struct cpuidle_device *dev);
+
+	/*
+	 * set_stop_tick - whether or not to stop the scheduler tick
+	 * automatically called after selecting cpuidle state
+	 */
+	bool (*set_stop_tick)(void);
+
+	/*
+	 * reflect - Give the governor an opportunity to reflect on the outcome
+	 * @dev: target cpu
+	 * @index: last idle state which target cpu has entered
+	 */
+	void (*reflect)(struct cpuidle_device *dev, int index);
+
+	/**
+	 * init - Initialize the BPF cpuidle governor
+	 */
+	int (*init)(void);
+
+	/**
+	 * exit - Clean up after the BPF cpuidle governor
+	 */
+	void (*exit)(void);
+
+	/**
+	 * name - BPF cpuidle governor name
+	 */
+	char name[CPUIDLE_GOV_EXT_NAME_LEN];
+};
+
+static enum ops_enable_state get_ops_enable_state(void)
+{
+	return atomic_read(&ops_enable_state_var);
+}
+
+static enum ops_enable_state
+set_ops_enable_state(enum ops_enable_state to)
+{
+	return atomic_xchg(&ops_enable_state_var, to);
+}
+
+static int enable_stub(struct cpuidle_driver *drv, struct cpuidle_device *dev) { return 0; }
+static void disable_stub(struct cpuidle_driver *drv, struct cpuidle_device *dev) {}
+static int select_stub(struct cpuidle_driver *drv, struct cpuidle_device *dev) { return 0; }
+static bool set_stop_tick_stub(void) {return false; }
+static void reflect_stub(struct cpuidle_device *dev, int index) {}
+static int init_stub(void) { return 0; }
+static void exit_stub(void) {}
+
+static struct cpuidle_gov_ext_ops __bpf_ops_cpuidle_gov_ext_ops = {
+	.enable = enable_stub,
+	.disable = disable_stub,
+	.select = select_stub,
+	.set_stop_tick = set_stop_tick_stub,
+	.reflect = reflect_stub,
+	.init = init_stub,
+	.exit = exit_stub,
+};
+
+static int ext_btf_struct_access(struct bpf_verifier_log *log,
+					 const struct bpf_reg_state *reg, int off,
+					 int size)
+{
+	const struct btf_type *t;
+
+	t = btf_type_by_id(reg->btf, reg->btf_id);
+	if (t == cpuidle_device_type) {
+		for (int i = 0; i < CPUIDLE_STATE_MAX; i++) {
+			size_t base_offset = offsetof(struct cpuidle_device, states_usage[i]);
+
+			if (off >= base_offset + offsetof(struct cpuidle_state_usage, disable) &&
+				off + size <= base_offset + offsetofend(struct cpuidle_state_usage, disable)) {
+				return SCALAR_VALUE;
+			}
+		}
+	}
+
+	return -EACCES;
+}
+
+static const struct bpf_verifier_ops ops_verifier = {
+	.get_func_proto = bpf_base_func_proto,
+	.is_valid_access = btf_ctx_access,
+	.btf_struct_access = ext_btf_struct_access,
+};
+
+static void ops_disable(void)
+{
+	restore_cpuidle_last_governor();
+	WARN_ON_ONCE(set_ops_enable_state(OPS_DISABLED) != OPS_ENABLED);
+	static_branch_disable(&ops_enabled_key);
+	if (ops->exit)
+		ops->exit();
+	memset(&ops, 0, sizeof(ops));
+}
+
+static void ops_unreg(void *kdata, struct bpf_link *link)
+{
+	mutex_lock(&ops_mutex);
+	ops_disable();
+	mutex_unlock(&ops_mutex);
+}
+
+static int ops_reg(void *kdata, struct bpf_link *link)
+{
+	mutex_lock(&ops_mutex);
+	if (get_ops_enable_state() != OPS_DISABLED) {
+		mutex_unlock(&ops_mutex);
+		return -EEXIST;
+	}
+	/*
+	 * Set ops, call ops.init(), and set enable state flag
+	 */
+	ops = (struct cpuidle_gov_ext_ops *)kdata;
+	if (ops->init && ops->init()) {
+		ops_disable();
+		mutex_unlock(&ops_mutex);
+		return -EINVAL;
+	}
+	WARN_ON_ONCE(set_ops_enable_state(OPS_ENABLED) != OPS_DISABLED);
+	static_branch_enable(&ops_enabled_key);
+
+	mutex_unlock(&ops_mutex);
+	return 0;
+}
+
+static int ops_check_member(const struct btf_type *t,
+				const struct btf_member *member,
+				const struct bpf_prog *prog)
+{
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+	switch (moff) {
+	case offsetof(struct cpuidle_gov_ext_ops, enable):
+	case offsetof(struct cpuidle_gov_ext_ops, disable):
+	case offsetof(struct cpuidle_gov_ext_ops, select):
+	case offsetof(struct cpuidle_gov_ext_ops, set_stop_tick):
+	case offsetof(struct cpuidle_gov_ext_ops, reflect):
+	case offsetof(struct cpuidle_gov_ext_ops, init):
+	case offsetof(struct cpuidle_gov_ext_ops, exit):
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int ops_init_member(const struct btf_type *t,
+				const struct btf_member *member,
+				void *kdata, const void *udata)
+{
+	const struct cpuidle_gov_ext_ops *uops = udata;
+	struct cpuidle_gov_ext_ops *ops = kdata;
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+	int ret;
+
+	switch (moff) {
+	case offsetof(struct cpuidle_gov_ext_ops, name):
+		ret = bpf_obj_name_cpy(ops->name, uops->name,
+				sizeof(ops->name));
+		if (ret < 0)
+			return ret;
+		if (ret == 0)
+			return -EINVAL;
+		return 1;
+	}
+	return 0;
+}
+
+static int ops_init(struct btf *btf)
+{
+	s32 type_id;
+
+	type_id = btf_find_by_name_kind(btf, "cpuidle_device", BTF_KIND_STRUCT);
+	if (type_id < 0)
+		return -EINVAL;
+	cpuidle_device_type = btf_type_by_id(btf, type_id);
+	cpuidle_device_type_id = type_id;
+
+	return 0;
+}
+
+static int ops_update(void *kdata, void *old_kdata, struct bpf_link *link)
+{
+	/*
+	 * Not support updating the actively-loaded BPF cpuidle governor
+	 */
+	return -EOPNOTSUPP;
+}
+
+static int ops_validate(void *kdata)
+{
+	return 0;
+}
+
+static struct bpf_struct_ops bpf_cpuidle_gov_ext_ops = {
+	.verifier_ops = &ops_verifier,
+	.reg = ops_reg,
+	.unreg = ops_unreg,
+	.check_member = ops_check_member,
+	.init_member = ops_init_member,
+	.init = ops_init,
+	.update = ops_update,
+	.validate = ops_validate,
+	.name = "cpuidle_gov_ext_ops",
+	.owner = THIS_MODULE,
+	.cfi_stubs = &__bpf_ops_cpuidle_gov_ext_ops
+};
+
+/********************************************************************************
+ * default cpuidle ext governor implementations
+ */
+#define ALPHA_SCALE 100
+#define FIT_FACTOR 90
+
+struct cpuidle_gov_ext {
+	int cpu;
+	int last_idx;
+	u64 last_duration;
+	u64 next_pred;
+};
+
+DEFINE_PER_CPU(struct cpuidle_gov_ext, cpuidle_gov_ext_data);
+
+static void update_predict_duration(struct cpuidle_gov_ext *data,
+			struct cpuidle_driver *drv, struct cpuidle_device *dev)
+{
+	int idx;
+	struct cpuidle_state *target;
+
+	if (!data || !drv || !dev)
+		return;
+	idx = data->last_idx;
+	data->last_duration = dev->last_residency_ns;
+	if (idx > 0) {
+		target = &drv->states[idx];
+		if (data->last_duration > target->exit_latency)
+			data->last_duration -= target->exit_latency;
+	}
+	data->next_pred = data->last_duration;
+}
+
+static void ext_reflect_dfl(struct cpuidle_device *dev, int index)
+{
+	struct cpuidle_gov_ext *data = this_cpu_ptr(&cpuidle_gov_ext_data);
+
+	if (!data)
+		return;
+	data->last_idx = index;
+}
+
+static int ext_select_dfl(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+				bool *stop_tick)
+{
+	int i, selected;
+	struct cpuidle_gov_ext *data;
+	ktime_t delta_tick;
+	s64 delta = tick_nohz_get_sleep_length(&delta_tick);
+	s64 latency_req = cpuidle_governor_latency_req(dev->cpu);
+
+	data = this_cpu_ptr(&cpuidle_gov_ext_data);
+	if (!data)
+		return 0;
+
+	/*
+	 * We aim to achieve function redefinition through BPF ops.select(),
+	 * so we do not use complex algorithm here.
+	 */
+	update_predict_duration(data, drv, dev);
+	for (i = drv->state_count - 1; i > 0; i--) {
+		struct cpuidle_state *s = &drv->states[i];
+		struct cpuidle_state_usage *su = &dev->states_usage[i];
+
+		if (su->disable)
+			continue;
+
+		if (latency_req < s->exit_latency_ns)
+			continue;
+
+		if (delta < s->target_residency_ns)
+			continue;
+
+		if (data->next_pred / FIT_FACTOR * ALPHA_SCALE < s->target_residency_ns)
+			continue;
+		break;
+	}
+	selected = i;
+	return selected;
+}
+
+static int ext_enable_dfl(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+{
+	struct cpuidle_gov_ext *data = &per_cpu(cpuidle_gov_ext_data, dev->cpu);
+
+	memset(data, 0, sizeof(struct cpuidle_gov_ext));
+	data->cpu = dev->cpu;
+	return 0;
+}
+
+static void ext_disable_dfl(struct cpuidle_driver *drv, struct cpuidle_device *dev) { }
+
+/********************************************************************************
+ * Register and init cpuidle governor
+ */
+static int ext_enable(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+{
+	if (static_branch_likely(&ops_enabled_key))
+		return ops->enable(drv, dev);
+	return ext_enable_dfl(drv, dev);
+}
+
+static void ext_disable(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+{
+	if (static_branch_likely(&ops_enabled_key))
+		return ops->disable(drv, dev);
+	return ext_disable_dfl(drv, dev);
+}
+
+static int ext_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+			   bool *stop_tick)
+{
+	int state = 0;
+
+	if (static_branch_likely(&ops_enabled_key)) {
+		state = ops->select(drv, dev);
+		*stop_tick = ops->set_stop_tick();
+	} else {
+		state = ext_select_dfl(drv, dev, stop_tick);
+	}
+	return state;
+}
+
+static void ext_reflect(struct cpuidle_device *dev, int index)
+{
+	if (static_branch_likely(&ops_enabled_key))
+		ops->reflect(dev, index);
+	ext_reflect_dfl(dev, index);
+}
+
+static struct cpuidle_governor ext_governor = {
+	.name = EXT_GOV_NAME,
+	.rating =	1,
+	.enable =	ext_enable,
+	.disable = ext_disable,
+	.select =	ext_select,
+	.reflect =	ext_reflect,
+};
+
+static int __init init_ext(void)
+{
+	int ret;
+
+	ret = cpuidle_register_governor(&ext_governor);
+	if (ret)
+		return ret;
+
+	ret = register_bpf_struct_ops(&bpf_cpuidle_gov_ext_ops, cpuidle_gov_ext_ops);
+	if (ret) {
+		pr_err("bpf_cpuidle_gov_ext_ops register fail: %d\n", ret);
+		return ret;
+	}
+
+	ret = cpuidle_gov_kfuncs_init();
+	if (ret) {
+		pr_err("bpf cpuidle_gov_kfuncs_init register fail: %d\n", ret);
+		return ret;
+	}
+
+	return ret;
+}
+
+postcore_initcall(init_ext);
+MODULE_LICENSE("GPL");
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 bpf-next 1/2] cpuidle: Implement BPF extensible cpuidle governor class
  2025-09-01 13:56 ` [PATCH v2 bpf-next 1/2] cpuidle: Implement BPF extensible cpuidle governor class Lin Yikai
@ 2025-09-07  6:15   ` kernel test robot
  0 siblings, 0 replies; 8+ messages in thread
From: kernel test robot @ 2025-09-07  6:15 UTC (permalink / raw)
  To: Lin Yikai, Christian Loehle, Song Liu, Rafael J . Wysocki,
	Daniel Lezcano, Andrii Nakryiko, Eduard Zingerman, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, linux-kselftest, linux-pm, bpf,
	Tejun Heo
  Cc: oe-kbuild-all, zhaofuyu

Hi Lin,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Lin-Yikai/cpuidle-Implement-BPF-extensible-cpuidle-governor-class/20250901-220011
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20250901135609.76590-2-yikai.lin%40vivo.com
patch subject: [PATCH v2 bpf-next 1/2] cpuidle: Implement BPF extensible cpuidle governor class
config: arm-randconfig-r131-20250907 (https://download.01.org/0day-ci/archive/20250907/202509071459.Gkqhd7YV-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 8.5.0
reproduce: (https://download.01.org/0day-ci/archive/20250907/202509071459.Gkqhd7YV-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509071459.Gkqhd7YV-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> drivers/cpuidle/governors/ext.c:141:1: sparse: sparse: symbol 'ops_enabled_key' was not declared. Should it be static?

vim +/ops_enabled_key +141 drivers/cpuidle/governors/ext.c

   139	
   140	static DEFINE_MUTEX(ops_mutex);
 > 141	DEFINE_STATIC_KEY_FALSE(ops_enabled_key);
   142	static atomic_t ops_enable_state_var = ATOMIC_INIT(OPS_DISABLED);
   143	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 bpf-next 2/2] selftests/bpf: Add selftests for cpuidle_gov_ext
  2025-09-01 13:56 [PATCHSET V2 bpf-next 0/2] cpuidle, bpf: Introduce BPF-ext cpuidle governor policy via struct_ops Lin Yikai
  2025-09-01 13:56 ` [PATCH v2 bpf-next 1/2] cpuidle: Implement BPF extensible cpuidle governor class Lin Yikai
@ 2025-09-01 13:56 ` Lin Yikai
  2025-09-03  0:38   ` Alexei Starovoitov
  2025-09-01 16:36 ` [PATCHSET V2 bpf-next 0/2] cpuidle, bpf: Introduce BPF-ext cpuidle governor policy via struct_ops Rafael J. Wysocki
  2 siblings, 1 reply; 8+ messages in thread
From: Lin Yikai @ 2025-09-01 13:56 UTC (permalink / raw)
  To: Christian Loehle, Song Liu, Rafael J . Wysocki, Daniel Lezcano,
	Andrii Nakryiko, Eduard Zingerman, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, linux-kselftest, linux-pm, bpf,
	Tejun Heo, Lin Yikai
  Cc: zhaofuyu

Add test to verify cpuidle governor ext's load, attach, and kfuncs.

This patch also provides a simple demonstration of `cpuidle_gov_ext_ops` usage:
- In `ops.init()`, we set the "rating" value to 60 - significantly exceeding other governors' ratings - to activate `cpuidle_gov_ext`.
- For specific scenarios (e.g., screen-off music playback on mobile devices), we can enable "expect_deeper" to transition to deeper idle states.

This implementation serves as a foundation, not a final solution.
We can explore further exploration of cpuidle strategies optimized for various usage scenarios.

Test Results
-----------
:~/workplace/bpf/x86/submit/bpf_next/tools/testing/selftests/bpf$ make -j4

:$ sudo ./test_progs -t test_cpuidle_gov_ext      
#449     test_cpuidle_gov_ext: OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

Additionally, the kernel log shows:
$sudo cat /dev/kmsg
6,911,10997439785,-; cpuidle: using governor ext
6,913,11010384887,-; cpuidle: using governor menu
After `cpuidle_gov_ext` exits, the system will restore the previous governor.

Signed-off-by: Lin Yikai <yikai.lin@vivo.com>
---
 .../bpf/prog_tests/test_cpuidle_gov_ext.c     |  28 +++
 .../selftests/bpf/progs/cpuidle_common.h      |  13 ++
 .../selftests/bpf/progs/cpuidle_gov_ext.c     | 200 ++++++++++++++++++
 3 files changed, 241 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cpuidle_gov_ext.c
 create mode 100644 tools/testing/selftests/bpf/progs/cpuidle_common.h
 create mode 100644 tools/testing/selftests/bpf/progs/cpuidle_gov_ext.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_cpuidle_gov_ext.c b/tools/testing/selftests/bpf/prog_tests/test_cpuidle_gov_ext.c
new file mode 100644
index 000000000000..8b35771ada44
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_cpuidle_gov_ext.c
@@ -0,0 +1,28 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * test_cpuidle_gov_ext.c - test cpuidle governor ext's load, attach and kfuncs
+ *
+ * Copyright (C) Yikai Lin <yikai.lin@vivo.com>
+ */
+
+#include <test_progs.h>
+#include "cpuidle_gov_ext.skel.h"
+
+void test_test_cpuidle_gov_ext(void)
+{
+	struct cpuidle_gov_ext *skel;
+	int err;
+
+	skel = cpuidle_gov_ext__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "cpuidle_gov_ext__open_and_load"))
+		return;
+
+	skel->bss->expect_deeper = 1;
+	err = cpuidle_gov_ext__attach(skel);
+	if (!ASSERT_OK(err, "cpuidle_gov_ext__attach"))
+		goto cleanup;
+
+cleanup:
+	cpuidle_gov_ext__destroy(skel);
+}
+
diff --git a/tools/testing/selftests/bpf/progs/cpuidle_common.h b/tools/testing/selftests/bpf/progs/cpuidle_common.h
new file mode 100644
index 000000000000..95402974c53e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cpuidle_common.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) Yikai Lin <yikai.lin@vivo.com>
+ */
+
+#ifndef _CPUIDLE_COMMON_H
+#define _CPUIDLE_COMMON_H
+
+int bpf_cpuidle_ext_gov_update_rating(unsigned int rating) __ksym __weak;
+s64 bpf_cpuidle_ext_gov_latency_req(unsigned int cpu) __ksym __weak;
+s64 bpf_tick_nohz_get_sleep_length(void) __ksym __weak;
+
+#endif /* _CPUIDLE_COMMON_H */
diff --git a/tools/testing/selftests/bpf/progs/cpuidle_gov_ext.c b/tools/testing/selftests/bpf/progs/cpuidle_gov_ext.c
new file mode 100644
index 000000000000..66c437243270
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cpuidle_gov_ext.c
@@ -0,0 +1,200 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * cpuidle_gov_ext.c - test to use cpuidle governor ext by bpf
+ *
+ * Copyright (C) Yikai Lin <yikai.lin@vivo.com>
+ */
+
+#include "vmlinux.h"
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+
+#include "bpf_misc.h"
+#include "cpuidle_common.h"
+
+char LICENSE[] SEC("license") = "GPL";
+
+#define ALPHA 10
+#define ALPHA_SCALE 100
+#define FIT_FACTOR 90
+
+#ifndef max
+#define max(a, b) ((a) > (b) ? (a) : (b))
+#endif
+#ifndef min
+#define min(a, b) ((a) < (b) ? (a) : (b))
+#endif
+
+/*
+ * For some low-power scenarios,
+ * such as the screen off scenario of mobile devices
+ * (which will be determined by the user-space BPF program),
+ * we aim to choose a deeper state
+ * At this point, we will somewhat disregard the impact on CPU performance.
+ */
+int expect_deeper = 0;
+
+struct cpuidle_gov_data {
+	int cpu;
+	int last_idx;
+	u64 last_pred;
+	u64 last_duration;
+	u64 next_pred;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, u32);
+	__type(value, struct cpuidle_gov_data);
+} cpuidle_gov_data_map SEC(".maps");
+
+static u64 calculate_ewma(u64 last, u64 new, u32 alpha, u32 alpha_scale)
+{
+	return (alpha * new + (alpha_scale - alpha) * last) / alpha_scale;
+}
+
+static void update_predict_duration(struct cpuidle_gov_data *data,
+			struct cpuidle_driver *drv, struct cpuidle_device *dev)
+{
+	int idx;
+	struct cpuidle_state target;
+
+	if (!data || !drv || !dev)
+		return;
+	idx = data->last_idx;
+	data->last_duration = dev->last_residency_ns;
+	if (idx > 0) {
+		bpf_core_read(&target, sizeof(target), &drv->states[idx]);
+		if (data->last_duration > target.exit_latency)
+			data->last_duration -= target.exit_latency;
+	}
+	data->last_pred = data->next_pred;
+	data->next_pred = calculate_ewma(data->next_pred,
+		data->last_duration, ALPHA, ALPHA_SCALE);
+}
+
+/* Enable the cpuidle governor */
+SEC("struct_ops.s/enable")
+int BPF_PROG(bpf_cpuidle_enable, struct cpuidle_driver *drv, struct cpuidle_device *dev)
+{
+	u32 key = 0;
+	struct cpuidle_gov_data *data;
+
+	bpf_printk("cpuidle_gov_ext: enabled");
+	data = bpf_map_lookup_percpu_elem(&cpuidle_gov_data_map, &key, dev->cpu);
+	if (!data)
+		return 0;
+
+	__builtin_memset(data, 0, sizeof(struct cpuidle_gov_data));
+	data->cpu = dev->cpu;
+	return 0;
+}
+
+/* Disable the cpuidle governor */
+SEC("struct_ops.s/disable")
+void BPF_PROG(bpf_cpuidle_disable, struct cpuidle_driver *drv, struct cpuidle_device *dev)
+{
+	bpf_printk("cpuidle_gov_ext: disabled");
+}
+
+/* Select the next idle state */
+SEC("struct_ops.s/select")
+int BPF_PROG(bpf_cpuidle_select, struct cpuidle_driver *drv, struct cpuidle_device *dev)
+{
+	u32 key = 0;
+	s64 delta, latency_req, residency_ns;
+	int i;
+	unsigned long long disable;
+	struct cpuidle_gov_data *data;
+	struct cpuidle_state *cs;
+
+	data = bpf_map_lookup_percpu_elem(&cpuidle_gov_data_map, &key, dev->cpu);
+	if (!data) {
+		bpf_printk("cpuidle_gov_ext: [%s] cpuidle_gov_data_map is NULL\n", __func__);
+		return 0;
+	}
+	latency_req = bpf_cpuidle_ext_gov_latency_req(dev->cpu);
+	delta = bpf_tick_nohz_get_sleep_length();
+
+	update_predict_duration(data, drv, dev);
+	for (i = ARRAY_SIZE(drv->states)-1; i > 0; i--) {
+		if (i >= drv->state_count)
+			continue;
+		cs = &drv->states[i];
+		disable = dev->states_usage[i].disable;
+		if (disable)
+			continue;
+		if (latency_req < cs->exit_latency_ns)
+			continue;
+
+		if (delta < cs->target_residency_ns)
+			continue;
+
+		if (data->next_pred / FIT_FACTOR * ALPHA_SCALE < cs->target_residency_ns)
+			continue;
+
+		break;
+	}
+	residency_ns = drv->states[i].target_residency_ns;
+	if (expect_deeper &&
+		i <= drv->state_count-2 &&
+		!dev->states_usage[i+1].disable &&
+		data->last_pred >= residency_ns &&
+		data->next_pred < residency_ns &&
+		data->next_pred / FIT_FACTOR * ALPHA_SCALE >= residency_ns &&
+		data->next_pred / FIT_FACTOR * ALPHA_SCALE >= data->last_duration &&
+		delta > residency_ns) {
+		i++;
+	}
+
+	return i;
+}
+
+//enable or disable scheduling tick after selecting cpuidle state
+SEC("struct_ops.s/set_stop_tick")
+bool BPF_PROG(bpf_cpuidle_set_stop_tick)
+{
+	return false;
+}
+
+/* Reflect function called after entering an idle state */
+SEC("struct_ops.s/reflect")
+void BPF_PROG(bpf_cpuidle_reflect, struct cpuidle_device *dev, int index)
+{
+	u32 key = 0;
+	struct cpuidle_gov_data *data;
+
+	data = bpf_map_lookup_percpu_elem(&cpuidle_gov_data_map, &key, dev->cpu);
+	if (!data) {
+		bpf_printk("cpuidle_gov_ext: [%s] cpuidle_gov_data_map is NULL\n", __func__);
+		return;
+	}
+	data->last_idx = index;
+}
+
+/* Initialize the BPF cpuidle governor */
+SEC("struct_ops.s/init")
+int BPF_PROG(bpf_cpuidle_init)
+{
+	return bpf_cpuidle_ext_gov_update_rating(60);
+}
+
+/* Cleanup after the BPF cpuidle governor */
+SEC("struct_ops.s/exit")
+void BPF_PROG(bpf_cpuidle_exit) { }
+
+/* Struct_ops linkage for cpuidle governor */
+SEC(".struct_ops.link")
+struct cpuidle_gov_ext_ops ops = {
+	.enable  = (void *)bpf_cpuidle_enable,
+	.disable = (void *)bpf_cpuidle_disable,
+	.select  = (void *)bpf_cpuidle_select,
+	.set_stop_tick = (void *)bpf_cpuidle_set_stop_tick,
+	.reflect = (void *)bpf_cpuidle_reflect,
+	.init	= (void *)bpf_cpuidle_init,
+	.exit	= (void *)bpf_cpuidle_exit,
+	.name	= "BPF_cpuidle_gov"
+};
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 bpf-next 2/2] selftests/bpf: Add selftests for cpuidle_gov_ext
  2025-09-01 13:56 ` [PATCH v2 bpf-next 2/2] selftests/bpf: Add selftests for cpuidle_gov_ext Lin Yikai
@ 2025-09-03  0:38   ` Alexei Starovoitov
  2025-09-03  3:06     ` yikai.lin
  0 siblings, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2025-09-03  0:38 UTC (permalink / raw)
  To: Lin Yikai
  Cc: Christian Loehle, Song Liu, Rafael J . Wysocki, Daniel Lezcano,
	Andrii Nakryiko, Eduard Zingerman, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan,
	open list:KERNEL SELFTEST FRAMEWORK, Linux Power Management, bpf,
	Tejun Heo, zhaofuyu

On Mon, Sep 1, 2025 at 6:56 AM Lin Yikai <yikai.lin@vivo.com> wrote:
>
> +
> +/*
> + * For some low-power scenarios,
> + * such as the screen off scenario of mobile devices
> + * (which will be determined by the user-space BPF program),
> + * we aim to choose a deeper state
> + * At this point, we will somewhat disregard the impact on CPU performance.
> + */
> +int expect_deeper = 0;

...

> +/* Select the next idle state */
> +SEC("struct_ops.s/select")
> +int BPF_PROG(bpf_cpuidle_select, struct cpuidle_driver *drv, struct cpuidle_device *dev)
> +{
> +       u32 key = 0;
> +       s64 delta, latency_req, residency_ns;
> +       int i;
> +       unsigned long long disable;
> +       struct cpuidle_gov_data *data;
> +       struct cpuidle_state *cs;
> +
> +       data = bpf_map_lookup_percpu_elem(&cpuidle_gov_data_map, &key, dev->cpu);
> +       if (!data) {
> +               bpf_printk("cpuidle_gov_ext: [%s] cpuidle_gov_data_map is NULL\n", __func__);
> +               return 0;
> +       }
> +       latency_req = bpf_cpuidle_ext_gov_latency_req(dev->cpu);
> +       delta = bpf_tick_nohz_get_sleep_length();
> +
> +       update_predict_duration(data, drv, dev);
> +       for (i = ARRAY_SIZE(drv->states)-1; i > 0; i--) {
> +               if (i >= drv->state_count)
> +                       continue;
> +               cs = &drv->states[i];
> +               disable = dev->states_usage[i].disable;
> +               if (disable)
> +                       continue;
> +               if (latency_req < cs->exit_latency_ns)
> +                       continue;
> +
> +               if (delta < cs->target_residency_ns)
> +                       continue;
> +
> +               if (data->next_pred / FIT_FACTOR * ALPHA_SCALE < cs->target_residency_ns)
> +                       continue;
> +
> +               break;
> +       }
> +       residency_ns = drv->states[i].target_residency_ns;
> +       if (expect_deeper &&
> +               i <= drv->state_count-2 &&
> +               !dev->states_usage[i+1].disable &&
> +               data->last_pred >= residency_ns &&
> +               data->next_pred < residency_ns &&
> +               data->next_pred / FIT_FACTOR * ALPHA_SCALE >= residency_ns &&
> +               data->next_pred / FIT_FACTOR * ALPHA_SCALE >= data->last_duration &&
> +               delta > residency_ns) {
> +               i++;
> +       }
> +
> +       return i;
> +}

This function is the main programmability benefit that
you're claiming, right?

And user space knob 'expect_deeper' is the key difference
vs all existing governors ?

If so, I have to agree with Rafael. This doesn't look too compelling
to bolt bpf struct-ops onto cpuidle.

There must be a way to introduce user togglable knobs in the current
set of governors, no?

Other than that the patch set seems to be doing all the right things
from bpf perspective. KF_SLEEPABLE is missing in kfuncs and
the safety aspect needs to be thoroughly analyzed,
but before doing in-depth review the examples need to have more substance.
With real world benchmarks and results.
The commit log is saying:
"This implementation serves as a foundation, not a final solution"
It's understood that it's work-in-progress, but we need to see
more real usage before committing.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 bpf-next 2/2] selftests/bpf: Add selftests for cpuidle_gov_ext
  2025-09-03  0:38   ` Alexei Starovoitov
@ 2025-09-03  3:06     ` yikai.lin
  0 siblings, 0 replies; 8+ messages in thread
From: yikai.lin @ 2025-09-03  3:06 UTC (permalink / raw)
  To: Alexei Starovoitov, Song Liu, Rafael J . Wysocki
  Cc: Christian Loehle, Daniel Lezcano, Andrii Nakryiko,
	Eduard Zingerman, Mykola Lysenko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	open list:KERNEL SELFTEST FRAMEWORK, Linux Power Management, bpf,
	Tejun Heo, zhaofuyu



On 9/3/2025 8:38 AM, Alexei Starovoitov wrote:
> [Some people who received this message don't often get email from alexei.starovoitov@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> On Mon, Sep 1, 2025 at 6:56 AM Lin Yikai <yikai.lin@vivo.com> wrote:
>>
>> +
>> +/*
>> + * For some low-power scenarios,
>> + * such as the screen off scenario of mobile devices
>> + * (which will be determined by the user-space BPF program),
>> + * we aim to choose a deeper state
>> + * At this point, we will somewhat disregard the impact on CPU performance.
>> + */
>> +int expect_deeper = 0;
> 
> ...
> 
>> +/* Select the next idle state */
>> +SEC("struct_ops.s/select")
>> +int BPF_PROG(bpf_cpuidle_select, struct cpuidle_driver *drv, struct cpuidle_device *dev)
>> +{
>> +       u32 key = 0;
>> +       s64 delta, latency_req, residency_ns;
>> +       int i;
>> +       unsigned long long disable;
>> +       struct cpuidle_gov_data *data;
>> +       struct cpuidle_state *cs;
>> +
>> +       data = bpf_map_lookup_percpu_elem(&cpuidle_gov_data_map, &key, dev->cpu);
>> +       if (!data) {
>> +               bpf_printk("cpuidle_gov_ext: [%s] cpuidle_gov_data_map is NULL\n", __func__);
>> +               return 0;
>> +       }
>> +       latency_req = bpf_cpuidle_ext_gov_latency_req(dev->cpu);
>> +       delta = bpf_tick_nohz_get_sleep_length();
>> +
>> +       update_predict_duration(data, drv, dev);
>> +       for (i = ARRAY_SIZE(drv->states)-1; i > 0; i--) {
>> +               if (i >= drv->state_count)
>> +                       continue;
>> +               cs = &drv->states[i];
>> +               disable = dev->states_usage[i].disable;
>> +               if (disable)
>> +                       continue;
>> +               if (latency_req < cs->exit_latency_ns)
>> +                       continue;
>> +
>> +               if (delta < cs->target_residency_ns)
>> +                       continue;
>> +
>> +               if (data->next_pred / FIT_FACTOR * ALPHA_SCALE < cs->target_residency_ns)
>> +                       continue;
>> +
>> +               break;
>> +       }
>> +       residency_ns = drv->states[i].target_residency_ns;
>> +       if (expect_deeper &&
>> +               i <= drv->state_count-2 &&
>> +               !dev->states_usage[i+1].disable &&
>> +               data->last_pred >= residency_ns &&
>> +               data->next_pred < residency_ns &&
>> +               data->next_pred / FIT_FACTOR * ALPHA_SCALE >= residency_ns &&
>> +               data->next_pred / FIT_FACTOR * ALPHA_SCALE >= data->last_duration &&
>> +               delta > residency_ns) {
>> +               i++;
>> +       }
>> +
>> +       return i;
>> +}
> 
> This function is the main programmability benefit that
> you're claiming, right?
> 
> And user space knob 'expect_deeper' is the key difference
> vs all existing governors ?
> 
> If so, I have to agree with Rafael. This doesn't look too compelling
> to bolt bpf struct-ops onto cpuidle.
> 
> There must be a way to introduce user togglable knobs in the current
> set of governors, no?
> 
> Other than that the patch set seems to be doing all the right things
> from bpf perspective. KF_SLEEPABLE is missing in kfuncs and
> the safety aspect needs to be thoroughly analyzed,
> but before doing in-depth review the examples need to have more substance.
> With real world benchmarks and results.
> The commit log is saying:
> "This implementation serves as a foundation, not a final solution"
> It's understood that it's work-in-progress, but we need to see
> more real usage before committing.

Thanks, Alexei, Song, and Rafael, for your valuable feedback.

So, I understand that the key requirement here is to demonstrate a real-world scenario example
that can be effectively used in production environments
and to provide benchmark results.

Next up, I'll focus on developing a real-world use case for mobile devices
and providing test results.
Thanks again for the helpful insights.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCHSET V2 bpf-next 0/2] cpuidle, bpf: Introduce BPF-ext cpuidle governor policy via struct_ops
  2025-09-01 13:56 [PATCHSET V2 bpf-next 0/2] cpuidle, bpf: Introduce BPF-ext cpuidle governor policy via struct_ops Lin Yikai
  2025-09-01 13:56 ` [PATCH v2 bpf-next 1/2] cpuidle: Implement BPF extensible cpuidle governor class Lin Yikai
  2025-09-01 13:56 ` [PATCH v2 bpf-next 2/2] selftests/bpf: Add selftests for cpuidle_gov_ext Lin Yikai
@ 2025-09-01 16:36 ` Rafael J. Wysocki
  2025-09-02  8:03   ` yikai.lin
  2 siblings, 1 reply; 8+ messages in thread
From: Rafael J. Wysocki @ 2025-09-01 16:36 UTC (permalink / raw)
  To: Lin Yikai
  Cc: Christian Loehle, Song Liu, Rafael J . Wysocki, Daniel Lezcano,
	Andrii Nakryiko, Eduard Zingerman, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, linux-kselftest, linux-pm, bpf,
	Tejun Heo, zhaofuyu

On Mon, Sep 1, 2025 at 3:56 PM Lin Yikai <yikai.lin@vivo.com> wrote:
>
> Changes in v2:
>  - Optimized the logic in descriptions. (Song Liu)
>  - Created a new header file to declare kfuncs for future extensions included by other files. (Christian Loehle)
>  - Fixed some logical issues in the code. (Christian Loehle)
>
> Reference:
> [1] https://lore.kernel.org/bpf/20250829101137.9507-1-yikai.lin@vivo.com/
>
> Summary
> ----------
> Hi, everyone,
> This patch set introduces an extensible cpuidle governor framework
> using BPF struct_ops, enabling dynamic implementation of idle-state selection policies
> via BPF programs.
>
> Motivation
> ----------
> As is well-known, CPUs support multiple idle states (e.g., C0, C1, C2, ...),
> where deeper states reduce power consumption, but results in longer wakeup latency,
> potentially affecting performance.
> Existing generic cpuidle governors operate effectively in common scenarios
> but exhibit suboptimal behavior in specific Android phone's use cases.
>
> Our testing reveals that during low-utilization scenarios
> (e.g., screen-off background tasks like music playback with CPU utilization <10%),
> the C0 state occupies ~50% of idle time, causing significant energy inefficiency.

I gather that this is based on measurements taken on ARM-based
platforms because for x86 it is demonstrably not true.

> Reducing C0 to ≤20% could yield ≥5% power savings on mobile phones.
>
> To address this, we expect:
>   1.Dynamic governor switching to power-saved policies for low cpu utilization scenarios (e.g., screen-off mode)
>   2.Dynamic switching to alternate governors for high-performance scenarios (e.g., gaming)

All of this can be done without using BPF at all, so why use it here?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCHSET V2 bpf-next 0/2] cpuidle, bpf: Introduce BPF-ext cpuidle governor policy via struct_ops
  2025-09-01 16:36 ` [PATCHSET V2 bpf-next 0/2] cpuidle, bpf: Introduce BPF-ext cpuidle governor policy via struct_ops Rafael J. Wysocki
@ 2025-09-02  8:03   ` yikai.lin
  0 siblings, 0 replies; 8+ messages in thread
From: yikai.lin @ 2025-09-02  8:03 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Christian Loehle, Song Liu, Daniel Lezcano, Andrii Nakryiko,
	Eduard Zingerman, Mykola Lysenko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	linux-kselftest, linux-pm, bpf, Tejun Heo, zhaofuyu, yikai.lin



On 9/2/2025 12:36 AM, Rafael J. Wysocki wrote:
> On Mon, Sep 1, 2025 at 3:56 PM Lin Yikai <yikai.lin@vivo.com> wrote:
>>
>> Changes in v2:
>>   - Optimized the logic in descriptions. (Song Liu)
>>   - Created a new header file to declare kfuncs for future extensions included by other files. (Christian Loehle)
>>   - Fixed some logical issues in the code. (Christian Loehle)
>>
>> Reference:
>> [1] https://lore.kernel.org/bpf/20250829101137.9507-1-yikai.lin@vivo.com/
>>
>> Summary
>> ----------
>> Hi, everyone,
>> This patch set introduces an extensible cpuidle governor framework
>> using BPF struct_ops, enabling dynamic implementation of idle-state selection policies
>> via BPF programs.
>>
>> Motivation
>> ----------
>> As is well-known, CPUs support multiple idle states (e.g., C0, C1, C2, ...),
>> where deeper states reduce power consumption, but results in longer wakeup latency,
>> potentially affecting performance.
>> Existing generic cpuidle governors operate effectively in common scenarios
>> but exhibit suboptimal behavior in specific Android phone's use cases.
>>
>> Our testing reveals that during low-utilization scenarios
>> (e.g., screen-off background tasks like music playback with CPU utilization <10%),
>> the C0 state occupies ~50% of idle time, causing significant energy inefficiency.
> 
> I gather that this is based on measurements taken on ARM-based
> platforms because for x86 it is demonstrably not true.
> 
Yes, we conducted our tests on mobile phones based on Qualcomm SoCs, which use the ARM64 platform.
There are only two cpuidle states in total on these CPUs (C0/C1).
Through tracing cpuidle and scheduling, we found that during screen-off music playback,
although the task running time accounts for less than 10%, the C0 state's proportion exceeds 50%.

>> Reducing C0 to ≤20% could yield ≥5% power savings on mobile phones.
>>
>> To address this, we expect:
>>    1.Dynamic governor switching to power-saved policies for low cpu utilization scenarios (e.g., screen-off mode)
>>    2.Dynamic switching to alternate governors for high-performance scenarios (e.g., gaming)
> 
> All of this can be done without using BPF at all, so why use it here?

Thanks for your comments.

Conclusion first
------------------
Yes,it is possible to switch different governors using sysfs nodes,
but that lacks granularity and flexibility.
This approach aims to preserve the common logic of existing Linux governors
while adding flexibility for scenario-specific optimizations on certain SoC platforms or by ODMs.

Details second
-----------------
Below is a comparison of traditional governor methods
with a BPF-based approach for dynamic optimization of the cpuidle governor:

1.Agile Iteration
-Traditional:
   Governor policies require being predetermined and statically embedded before kernel compilation.
-BPF:
   Allows dynamic policy iteration based on real-time market user feedback
   User-space components can be updated via cloud deployment,
   eliminating the need for kernel modifications, recompilation, or reboots.
   It is very convenient for mobile device updates.

2.Dynamic Fine-Tuning
-Traditional:
  Involves replacing the entire governor, which is less granular.
-BPF:
  Allows granular tuning of governor parameters.
-Examples:
  --Screen-off music playback: dynamically enable "expect_deeper" or other flags add by BPF progs for deeper idle states.
  --Gaming scenarios: Allows idle strategy parameters adjustments via user-space signals
    (e.g., FPS, charging state) – metrics often opaque to the kernel.

So, by exposing tunable parameters through BPF maps,
user-space applications could make more run-time parameters adjustments, enhancing precision for specific scenarios.

Wish more additional insights you might have on this.
Thanks,
[yikai]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-09-07  6:16 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-01 13:56 [PATCHSET V2 bpf-next 0/2] cpuidle, bpf: Introduce BPF-ext cpuidle governor policy via struct_ops Lin Yikai
2025-09-01 13:56 ` [PATCH v2 bpf-next 1/2] cpuidle: Implement BPF extensible cpuidle governor class Lin Yikai
2025-09-07  6:15   ` kernel test robot
2025-09-01 13:56 ` [PATCH v2 bpf-next 2/2] selftests/bpf: Add selftests for cpuidle_gov_ext Lin Yikai
2025-09-03  0:38   ` Alexei Starovoitov
2025-09-03  3:06     ` yikai.lin
2025-09-01 16:36 ` [PATCHSET V2 bpf-next 0/2] cpuidle, bpf: Introduce BPF-ext cpuidle governor policy via struct_ops Rafael J. Wysocki
2025-09-02  8:03   ` yikai.lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox