* [PATCHSET sched_ext/for-6.11] sched_ext: Integrate with schedutil
@ 2024-06-19 3:12 Tejun Heo
2024-06-19 3:12 ` [PATCH 1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy() Tejun Heo
2024-06-19 3:12 ` [PATCH 2/2] sched_ext: Add cpuperf support Tejun Heo
0 siblings, 2 replies; 36+ messages in thread
From: Tejun Heo @ 2024-06-19 3:12 UTC (permalink / raw)
To: rafael, viresh.kumar
Cc: linux-pm, void, linux-kernel, kernel-team, mingo, peterz
sched_ext currently does not integrate with schedutil. When schedutil is the
governor, frequencies are left unregulated and usually get stuck close to
the highest performance level from running RT tasks.
This patchset integrates sched_ext with schedutil so that the BPF scheduler
can steer CPU frequencies when the schedutil governor is in use. The
implementation is straightforward. sugov_get_util() is updated to add the
BPF scheduler provided utilization metric when sched_ext is enabled.
This patchset is on top of sched_ext/for-6.11 (a5db7817af78 ("sched_ext: Add
selftests")) and contains the following two patches:
0001-cpufreq_schedutil-Refactor-sugov_cpu_is_busy.patch
0002-sched_ext-Add-cpuperf-support.patch
and is also available in the following git branch:
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-cpuperf
diffstat follows. Thanks.
kernel/sched/cpufreq_schedutil.c | 50 +++++++++-------
kernel/sched/ext.c | 83 ++++++++++++++++++++++++++
kernel/sched/ext.h | 9 ++
kernel/sched/sched.h | 1
tools/sched_ext/include/scx/common.bpf.h | 3
tools/sched_ext/scx_qmap.bpf.c | 142 +++++++++++++++++++++++++++++++++++++++++++++-
tools/sched_ext/scx_qmap.c | 8 ++
7 files changed, 270 insertions(+), 26 deletions(-)
--
tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH 1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy()
2024-06-19 3:12 [PATCHSET sched_ext/for-6.11] sched_ext: Integrate with schedutil Tejun Heo
@ 2024-06-19 3:12 ` Tejun Heo
2024-06-19 14:07 ` Christian Loehle
` (2 more replies)
2024-06-19 3:12 ` [PATCH 2/2] sched_ext: Add cpuperf support Tejun Heo
1 sibling, 3 replies; 36+ messages in thread
From: Tejun Heo @ 2024-06-19 3:12 UTC (permalink / raw)
To: rafael, viresh.kumar
Cc: linux-pm, void, linux-kernel, kernel-team, mingo, peterz,
Tejun Heo, David Vernet, Rafael J . Wysocki
sugov_cpu_is_busy() is used to avoid decreasing performance level while the
CPU is busy and called by sugov_update_single_freq() and
sugov_update_single_perf(). Both callers repeat the same pattern to first
test for uclamp and then the business. Let's refactor so that the tests
aren't repeated.
The new helper is named sugov_hold_freq() and tests both the uclamp
exception and CPU business. No functional changes. This will make adding
more exception conditions easier.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
---
kernel/sched/cpufreq_schedutil.c | 38 +++++++++++++++-----------------
1 file changed, 18 insertions(+), 20 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index eece6244f9d2..972b7dd65af2 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -325,16 +325,27 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
}
#ifdef CONFIG_NO_HZ_COMMON
-static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
+static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
{
- unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
- bool ret = idle_calls == sg_cpu->saved_idle_calls;
+ unsigned long idle_calls;
+ bool ret;
+
+ /* if capped by uclamp_max, always update to be in compliance */
+ if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
+ return false;
+
+ /*
+ * Maintain the frequency if the CPU has not been idle recently, as
+ * reduction is likely to be premature.
+ */
+ idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
+ ret = idle_calls == sg_cpu->saved_idle_calls;
sg_cpu->saved_idle_calls = idle_calls;
return ret;
}
#else
-static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
+static inline bool sugov_hold_freq(struct sugov_cpu *sg_cpu) { return false; }
#endif /* CONFIG_NO_HZ_COMMON */
/*
@@ -382,14 +393,8 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
return;
next_f = get_next_freq(sg_policy, sg_cpu->util, max_cap);
- /*
- * Do not reduce the frequency if the CPU has not been idle
- * recently, as the reduction is likely to be premature then.
- *
- * Except when the rq is capped by uclamp_max.
- */
- if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) &&
- sugov_cpu_is_busy(sg_cpu) && next_f < sg_policy->next_freq &&
+
+ if (sugov_hold_freq(sg_cpu) && next_f < sg_policy->next_freq &&
!sg_policy->need_freq_update) {
next_f = sg_policy->next_freq;
@@ -436,14 +441,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
if (!sugov_update_single_common(sg_cpu, time, max_cap, flags))
return;
- /*
- * Do not reduce the target performance level if the CPU has not been
- * idle recently, as the reduction is likely to be premature then.
- *
- * Except when the rq is capped by uclamp_max.
- */
- if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) &&
- sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
+ if (sugov_hold_freq(sg_cpu) && sg_cpu->util < prev_util)
sg_cpu->util = prev_util;
cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
--
2.45.2
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 2/2] sched_ext: Add cpuperf support
2024-06-19 3:12 [PATCHSET sched_ext/for-6.11] sched_ext: Integrate with schedutil Tejun Heo
2024-06-19 3:12 ` [PATCH 1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy() Tejun Heo
@ 2024-06-19 3:12 ` Tejun Heo
2024-06-19 14:07 ` Christian Loehle
` (3 more replies)
1 sibling, 4 replies; 36+ messages in thread
From: Tejun Heo @ 2024-06-19 3:12 UTC (permalink / raw)
To: rafael, viresh.kumar
Cc: linux-pm, void, linux-kernel, kernel-team, mingo, peterz,
Tejun Heo, David Vernet, Rafael J . Wysocki
sched_ext currently does not integrate with schedutil. When schedutil is the
governor, frequencies are left unregulated and usually get stuck close to
the highest performance level from running RT tasks.
Add CPU performance monitoring and scaling support by integrating into
schedutil. The following kfuncs are added:
- scx_bpf_cpuperf_cap(): Query the relative performance capacity of
different CPUs in the system.
- scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
relative to its max performance.
- scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.
This gives direct control over CPU performance setting to the BPF scheduler.
The only changes on the schedutil side are accounting for the utilization
factor from sched_ext and disabling frequency holding heuristics as it may
not apply well to sched_ext schedulers which may have a lot weaker
connection between tasks and their current / last CPU.
With cpuperf support added, there is no reason to block uclamp. Enable while
at it.
A toy implementation of cpuperf is added to scx_qmap as a demonstration of
the feature.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
---
kernel/sched/cpufreq_schedutil.c | 12 +-
kernel/sched/ext.c | 83 ++++++++++++-
kernel/sched/ext.h | 9 ++
kernel/sched/sched.h | 1 +
tools/sched_ext/include/scx/common.bpf.h | 3 +
tools/sched_ext/scx_qmap.bpf.c | 142 ++++++++++++++++++++++-
tools/sched_ext/scx_qmap.c | 8 ++
7 files changed, 252 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 972b7dd65af2..12174c0137a5 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -197,7 +197,9 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
{
- unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
+ unsigned long min, max;
+ unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu) +
+ scx_cpuperf_target(sg_cpu->cpu);
util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
util = max(util, boost);
@@ -330,6 +332,14 @@ static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
unsigned long idle_calls;
bool ret;
+ /*
+ * The heuristics in this function is for the fair class. For SCX, the
+ * performance target comes directly from the BPF scheduler. Let's just
+ * follow it.
+ */
+ if (scx_switched_all())
+ return false;
+
/* if capped by uclamp_max, always update to be in compliance */
if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
return false;
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f814e84ceeb3..04fb0eeee5ec 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -16,6 +16,8 @@ enum scx_consts {
SCX_EXIT_BT_LEN = 64,
SCX_EXIT_MSG_LEN = 1024,
SCX_EXIT_DUMP_DFL_LEN = 32768,
+
+ SCX_CPUPERF_ONE = SCHED_CAPACITY_SCALE,
};
enum scx_exit_kind {
@@ -3520,7 +3522,7 @@ DEFINE_SCHED_CLASS(ext) = {
.update_curr = update_curr_scx,
#ifdef CONFIG_UCLAMP_TASK
- .uclamp_enabled = 0,
+ .uclamp_enabled = 1,
#endif
};
@@ -4393,7 +4395,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
struct scx_task_iter sti;
struct task_struct *p;
unsigned long timeout;
- int i, ret;
+ int i, cpu, ret;
mutex_lock(&scx_ops_enable_mutex);
@@ -4442,6 +4444,9 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
atomic_long_set(&scx_nr_rejected, 0);
+ for_each_possible_cpu(cpu)
+ cpu_rq(cpu)->scx.cpuperf_target = SCX_CPUPERF_ONE;
+
/*
* Keep CPUs stable during enable so that the BPF scheduler can track
* online CPUs by watching ->on/offline_cpu() after ->init().
@@ -5835,6 +5840,77 @@ __bpf_kfunc void scx_bpf_dump_bstr(char *fmt, unsigned long long *data,
ops_dump_flush();
}
+/**
+ * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU
+ * @cpu: CPU of interest
+ *
+ * Return the maximum relative capacity of @cpu in relation to the most
+ * performant CPU in the system. The return value is in the range [1,
+ * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur().
+ */
+__bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu)
+{
+ if (ops_cpu_valid(cpu, NULL))
+ return arch_scale_cpu_capacity(cpu);
+ else
+ return SCX_CPUPERF_ONE;
+}
+
+/**
+ * scx_bpf_cpuperf_cur - Query the current relative performance of a CPU
+ * @cpu: CPU of interest
+ *
+ * Return the current relative performance of @cpu in relation to its maximum.
+ * The return value is in the range [1, %SCX_CPUPERF_ONE].
+ *
+ * The current performance level of a CPU in relation to the maximum performance
+ * available in the system can be calculated as follows:
+ *
+ * scx_bpf_cpuperf_cap() * scx_bpf_cpuperf_cur() / %SCX_CPUPERF_ONE
+ *
+ * The result is in the range [1, %SCX_CPUPERF_ONE].
+ */
+__bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu)
+{
+ if (ops_cpu_valid(cpu, NULL))
+ return arch_scale_freq_capacity(cpu);
+ else
+ return SCX_CPUPERF_ONE;
+}
+
+/**
+ * scx_bpf_cpuperf_set - Set the relative performance target of a CPU
+ * @cpu: CPU of interest
+ * @perf: target performance level [0, %SCX_CPUPERF_ONE]
+ * @flags: %SCX_CPUPERF_* flags
+ *
+ * Set the target performance level of @cpu to @perf. @perf is in linear
+ * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the
+ * schedutil cpufreq governor chooses the target frequency.
+ *
+ * The actual performance level chosen, CPU grouping, and the overhead and
+ * latency of the operations are dependent on the hardware and cpufreq driver in
+ * use. Consult hardware and cpufreq documentation for more information. The
+ * current performance level can be monitored using scx_bpf_cpuperf_cur().
+ */
+__bpf_kfunc void scx_bpf_cpuperf_set(u32 cpu, u32 perf)
+{
+ if (unlikely(perf > SCX_CPUPERF_ONE)) {
+ scx_ops_error("Invalid cpuperf target %u for CPU %d", perf, cpu);
+ return;
+ }
+
+ if (ops_cpu_valid(cpu, NULL)) {
+ struct rq *rq = cpu_rq(cpu);
+
+ rq->scx.cpuperf_target = perf;
+
+ rcu_read_lock_sched_notrace();
+ cpufreq_update_util(cpu_rq(cpu), 0);
+ rcu_read_unlock_sched_notrace();
+ }
+}
+
/**
* scx_bpf_nr_cpu_ids - Return the number of possible CPU IDs
*
@@ -6045,6 +6121,9 @@ BTF_ID_FLAGS(func, scx_bpf_destroy_dsq)
BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS)
BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS)
BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_set)
BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids)
BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE)
BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE)
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index c41d742b5d62..c7ac33c47b68 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -48,6 +48,14 @@ int scx_check_setscheduler(struct task_struct *p, int policy);
bool task_should_scx(struct task_struct *p);
void init_sched_ext_class(void);
+static inline u32 scx_cpuperf_target(s32 cpu)
+{
+ if (scx_enabled())
+ return cpu_rq(cpu)->scx.cpuperf_target;
+ else
+ return 0;
+}
+
static inline const struct sched_class *next_active_class(const struct sched_class *class)
{
class++;
@@ -89,6 +97,7 @@ static inline void scx_pre_fork(struct task_struct *p) {}
static inline int scx_fork(struct task_struct *p) { return 0; }
static inline void scx_post_fork(struct task_struct *p) {}
static inline void scx_cancel_fork(struct task_struct *p) {}
+static inline u32 scx_cpuperf_target(s32 cpu) { return 0; }
static inline bool scx_can_stop_tick(struct rq *rq) { return true; }
static inline void scx_rq_activate(struct rq *rq) {}
static inline void scx_rq_deactivate(struct rq *rq) {}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c0d6e42c99cc..d3912cf3c3b7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -743,6 +743,7 @@ struct scx_rq {
u64 extra_enq_flags; /* see move_task_to_local_dsq() */
u32 nr_running;
u32 flags;
+ u32 cpuperf_target; /* [0, SCHED_CAPACITY_SCALE] */
bool cpu_released;
cpumask_var_t cpus_to_kick;
cpumask_var_t cpus_to_kick_if_idle;
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 3fa87084cf17..dbbda0e35c5d 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -42,6 +42,9 @@ void scx_bpf_destroy_dsq(u64 dsq_id) __ksym;
void scx_bpf_exit_bstr(s64 exit_code, char *fmt, unsigned long long *data, u32 data__sz) __ksym __weak;
void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym;
void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym __weak;
+u32 scx_bpf_cpuperf_cap(s32 cpu) __ksym __weak;
+u32 scx_bpf_cpuperf_cur(s32 cpu) __ksym __weak;
+void scx_bpf_cpuperf_set(s32 cpu, u32 perf) __ksym __weak;
u32 scx_bpf_nr_cpu_ids(void) __ksym __weak;
const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak;
const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak;
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index c75c70d6a8eb..b1d0b09c966e 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -68,6 +68,18 @@ struct {
},
};
+/*
+ * If enabled, CPU performance target is set according to the queue index
+ * according to the following table.
+ */
+static const u32 qidx_to_cpuperf_target[] = {
+ [0] = SCX_CPUPERF_ONE * 0 / 4,
+ [1] = SCX_CPUPERF_ONE * 1 / 4,
+ [2] = SCX_CPUPERF_ONE * 2 / 4,
+ [3] = SCX_CPUPERF_ONE * 3 / 4,
+ [4] = SCX_CPUPERF_ONE * 4 / 4,
+};
+
/*
* Per-queue sequence numbers to implement core-sched ordering.
*
@@ -95,6 +107,8 @@ struct {
struct cpu_ctx {
u64 dsp_idx; /* dispatch index */
u64 dsp_cnt; /* remaining count */
+ u32 avg_weight;
+ u32 cpuperf_target;
};
struct {
@@ -107,6 +121,8 @@ struct {
/* Statistics */
u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued;
u64 nr_core_sched_execed;
+u32 cpuperf_min, cpuperf_avg, cpuperf_max;
+u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max;
s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
s32 prev_cpu, u64 wake_flags)
@@ -313,6 +329,29 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
}
}
+void BPF_STRUCT_OPS(qmap_tick, struct task_struct *p)
+{
+ struct cpu_ctx *cpuc;
+ u32 zero = 0;
+ int idx;
+
+ if (!(cpuc = bpf_map_lookup_elem(&cpu_ctx_stor, &zero))) {
+ scx_bpf_error("failed to look up cpu_ctx");
+ return;
+ }
+
+ /*
+ * Use the running avg of weights to select the target cpuperf level.
+ * This is a demonstration of the cpuperf feature rather than a
+ * practical strategy to regulate CPU frequency.
+ */
+ cpuc->avg_weight = cpuc->avg_weight * 3 / 4 + p->scx.weight / 4;
+ idx = weight_to_idx(cpuc->avg_weight);
+ cpuc->cpuperf_target = qidx_to_cpuperf_target[idx];
+
+ scx_bpf_cpuperf_set(scx_bpf_task_cpu(p), cpuc->cpuperf_target);
+}
+
/*
* The distance from the head of the queue scaled by the weight of the queue.
* The lower the number, the older the task and the higher the priority.
@@ -422,8 +461,9 @@ void BPF_STRUCT_OPS(qmap_dump_cpu, struct scx_dump_ctx *dctx, s32 cpu, bool idle
if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, cpu)))
return;
- scx_bpf_dump("QMAP: dsp_idx=%llu dsp_cnt=%llu",
- cpuc->dsp_idx, cpuc->dsp_cnt);
+ scx_bpf_dump("QMAP: dsp_idx=%llu dsp_cnt=%llu avg_weight=%u cpuperf_target=%u",
+ cpuc->dsp_idx, cpuc->dsp_cnt, cpuc->avg_weight,
+ cpuc->cpuperf_target);
}
void BPF_STRUCT_OPS(qmap_dump_task, struct scx_dump_ctx *dctx, struct task_struct *p)
@@ -492,11 +532,106 @@ void BPF_STRUCT_OPS(qmap_cpu_offline, s32 cpu)
print_cpus();
}
+struct monitor_timer {
+ struct bpf_timer timer;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, u32);
+ __type(value, struct monitor_timer);
+} monitor_timer SEC(".maps");
+
+/*
+ * Print out the min, avg and max performance levels of CPUs every second to
+ * demonstrate the cpuperf interface.
+ */
+static void monitor_cpuperf(void)
+{
+ u32 zero = 0, nr_cpu_ids;
+ u64 cap_sum = 0, cur_sum = 0, cur_min = SCX_CPUPERF_ONE, cur_max = 0;
+ u64 target_sum = 0, target_min = SCX_CPUPERF_ONE, target_max = 0;
+ const struct cpumask *online;
+ int i, nr_online_cpus = 0;
+
+ nr_cpu_ids = scx_bpf_nr_cpu_ids();
+ online = scx_bpf_get_online_cpumask();
+
+ bpf_for(i, 0, nr_cpu_ids) {
+ struct cpu_ctx *cpuc;
+ u32 cap, cur;
+
+ if (!bpf_cpumask_test_cpu(i, online))
+ continue;
+ nr_online_cpus++;
+
+ /* collect the capacity and current cpuperf */
+ cap = scx_bpf_cpuperf_cap(i);
+ cur = scx_bpf_cpuperf_cur(i);
+
+ cur_min = cur < cur_min ? cur : cur_min;
+ cur_max = cur > cur_max ? cur : cur_max;
+
+ /*
+ * $cur is relative to $cap. Scale it down accordingly so that
+ * it's in the same scale as other CPUs and $cur_sum/$cap_sum
+ * makes sense.
+ */
+ cur_sum += cur * cap / SCX_CPUPERF_ONE;
+ cap_sum += cap;
+
+ if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, i))) {
+ scx_bpf_error("failed to look up cpu_ctx");
+ goto out;
+ }
+
+ /* collect target */
+ cur = cpuc->cpuperf_target;
+ target_sum += cur;
+ target_min = cur < target_min ? cur : target_min;
+ target_max = cur > target_max ? cur : target_max;
+ }
+
+ cpuperf_min = cur_min;
+ cpuperf_avg = cur_sum * SCX_CPUPERF_ONE / cap_sum;
+ cpuperf_max = cur_max;
+
+ cpuperf_target_min = target_min;
+ cpuperf_target_avg = target_sum / nr_online_cpus;
+ cpuperf_target_max = target_max;
+out:
+ scx_bpf_put_cpumask(online);
+}
+
+static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer)
+{
+ monitor_cpuperf();
+
+ bpf_timer_start(timer, ONE_SEC_IN_NS, 0);
+ return 0;
+}
+
s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
{
+ u32 key = 0;
+ struct bpf_timer *timer;
+ s32 ret;
+
print_cpus();
- return scx_bpf_create_dsq(SHARED_DSQ, -1);
+ ret = scx_bpf_create_dsq(SHARED_DSQ, -1);
+ if (ret)
+ return ret;
+
+ timer = bpf_map_lookup_elem(&monitor_timer, &key);
+ if (!timer)
+ return -ESRCH;
+
+ bpf_timer_init(timer, &monitor_timer, CLOCK_MONOTONIC);
+ bpf_timer_set_callback(timer, monitor_timerfn);
+
+ return bpf_timer_start(timer, ONE_SEC_IN_NS, 0);
}
void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei)
@@ -509,6 +644,7 @@ SCX_OPS_DEFINE(qmap_ops,
.enqueue = (void *)qmap_enqueue,
.dequeue = (void *)qmap_dequeue,
.dispatch = (void *)qmap_dispatch,
+ .tick = (void *)qmap_tick,
.core_sched_before = (void *)qmap_core_sched_before,
.cpu_release = (void *)qmap_cpu_release,
.init_task = (void *)qmap_init_task,
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index bc36ec4f88a7..4d41c0cb1dab 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -116,6 +116,14 @@ int main(int argc, char **argv)
nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
skel->bss->nr_reenqueued, skel->bss->nr_dequeued,
skel->bss->nr_core_sched_execed);
+ if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur"))
+ printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n",
+ skel->bss->cpuperf_min,
+ skel->bss->cpuperf_avg,
+ skel->bss->cpuperf_max,
+ skel->bss->cpuperf_target_min,
+ skel->bss->cpuperf_target_avg,
+ skel->bss->cpuperf_target_max);
fflush(stdout);
sleep(1);
}
--
2.45.2
^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [PATCH 1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy()
2024-06-19 3:12 ` [PATCH 1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy() Tejun Heo
@ 2024-06-19 14:07 ` Christian Loehle
2024-06-19 18:57 ` Tejun Heo
2024-06-19 18:45 ` Rafael J. Wysocki
2024-06-21 22:38 ` Tejun Heo
2 siblings, 1 reply; 36+ messages in thread
From: Christian Loehle @ 2024-06-19 14:07 UTC (permalink / raw)
To: Tejun Heo, rafael, viresh.kumar
Cc: linux-pm, void, linux-kernel, kernel-team, mingo, peterz,
David Vernet, Rafael J . Wysocki
On 6/19/24 04:12, Tejun Heo wrote:
> sugov_cpu_is_busy() is used to avoid decreasing performance level while the
> CPU is busy and called by sugov_update_single_freq() and
> sugov_update_single_perf(). Both callers repeat the same pattern to first
> test for uclamp and then the business. Let's refactor so that the tests
> aren't repeated.
>
> The new helper is named sugov_hold_freq() and tests both the uclamp
> exception and CPU business. No functional changes. This will make adding
> more exception conditions easier.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: David Vernet <dvernet@meta.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> ---
> kernel/sched/cpufreq_schedutil.c | 38 +++++++++++++++-----------------
> 1 file changed, 18 insertions(+), 20 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index eece6244f9d2..972b7dd65af2 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -325,16 +325,27 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> }
>
> #ifdef CONFIG_NO_HZ_COMMON
> -static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> +static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
> {
> - unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
> - bool ret = idle_calls == sg_cpu->saved_idle_calls;
> + unsigned long idle_calls;
> + bool ret;
> +
> + /* if capped by uclamp_max, always update to be in compliance */
> + if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
> + return false;
> +
> + /*
> + * Maintain the frequency if the CPU has not been idle recently, as
> + * reduction is likely to be premature.
> + */
> + idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
> + ret = idle_calls == sg_cpu->saved_idle_calls;
>
> sg_cpu->saved_idle_calls = idle_calls;
> return ret;
> }
> #else
> -static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
> +static inline bool sugov_hold_freq(struct sugov_cpu *sg_cpu) { return false; }
> #endif /* CONFIG_NO_HZ_COMMON */
>
> /*
> @@ -382,14 +393,8 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
> return;
>
> next_f = get_next_freq(sg_policy, sg_cpu->util, max_cap);
> - /*
> - * Do not reduce the frequency if the CPU has not been idle
> - * recently, as the reduction is likely to be premature then.
> - *
> - * Except when the rq is capped by uclamp_max.
> - */
> - if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) &&
> - sugov_cpu_is_busy(sg_cpu) && next_f < sg_policy->next_freq &&
> +
> + if (sugov_hold_freq(sg_cpu) && next_f < sg_policy->next_freq &&
> !sg_policy->need_freq_update) {
> next_f = sg_policy->next_freq;
>
Not necessarily related to your changes, but in case you're touching this
again, maybe sugov_hold_freq() could be the last condition?
And do we want something like
#ifdef CONFIG_NO_HZ_COMMON
else
sg_cpu->saved_idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
#endif
here?
> @@ -436,14 +441,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
> if (!sugov_update_single_common(sg_cpu, time, max_cap, flags))
> return;
>
> - /*
> - * Do not reduce the target performance level if the CPU has not been
> - * idle recently, as the reduction is likely to be premature then.
> - *
> - * Except when the rq is capped by uclamp_max.
> - */
> - if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) &&
> - sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
> + if (sugov_hold_freq(sg_cpu) && sg_cpu->util < prev_util)
> sg_cpu->util = prev_util;
>
> cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
FWIW
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/2] sched_ext: Add cpuperf support
2024-06-19 3:12 ` [PATCH 2/2] sched_ext: Add cpuperf support Tejun Heo
@ 2024-06-19 14:07 ` Christian Loehle
2024-06-19 19:19 ` Tejun Heo
2024-06-19 19:51 ` [PATCH v2 " Tejun Heo
` (2 subsequent siblings)
3 siblings, 1 reply; 36+ messages in thread
From: Christian Loehle @ 2024-06-19 14:07 UTC (permalink / raw)
To: Tejun Heo, rafael, viresh.kumar
Cc: linux-pm, void, linux-kernel, kernel-team, mingo, peterz,
David Vernet, Rafael J . Wysocki
On 6/19/24 04:12, Tejun Heo wrote:
> sched_ext currently does not integrate with schedutil. When schedutil is the
> governor, frequencies are left unregulated and usually get stuck close to
> the highest performance level from running RT tasks.
>
> Add CPU performance monitoring and scaling support by integrating into
> schedutil. The following kfuncs are added:
>
> - scx_bpf_cpuperf_cap(): Query the relative performance capacity of
> different CPUs in the system.
>
> - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
> relative to its max performance.
>
> - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.
>
> This gives direct control over CPU performance setting to the BPF scheduler.
> The only changes on the schedutil side are accounting for the utilization
> factor from sched_ext and disabling frequency holding heuristics as it may
> not apply well to sched_ext schedulers which may have a lot weaker
> connection between tasks and their current / last CPU.
>
> With cpuperf support added, there is no reason to block uclamp. Enable while
> at it.
>
> A toy implementation of cpuperf is added to scx_qmap as a demonstration of
> the feature.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: David Vernet <dvernet@meta.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> ---
> kernel/sched/cpufreq_schedutil.c | 12 +-
> kernel/sched/ext.c | 83 ++++++++++++-
> kernel/sched/ext.h | 9 ++
> kernel/sched/sched.h | 1 +
> tools/sched_ext/include/scx/common.bpf.h | 3 +
> tools/sched_ext/scx_qmap.bpf.c | 142 ++++++++++++++++++++++-
> tools/sched_ext/scx_qmap.c | 8 ++
> 7 files changed, 252 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 972b7dd65af2..12174c0137a5 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -197,7 +197,9 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
>
> static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> {
> - unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> + unsigned long min, max;
> + unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu) +
> + scx_cpuperf_target(sg_cpu->cpu);
What does cpu_util_cfs_boost() contain if scx is active?
NIT: reverse xmas
>
> util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
> util = max(util, boost);
> @@ -330,6 +332,14 @@ static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
> unsigned long idle_calls;
> bool ret;
>
> + /*
> + * The heuristics in this function is for the fair class. For SCX, the
> + * performance target comes directly from the BPF scheduler. Let's just
> + * follow it.
> + */
> + if (scx_switched_all())
> + return false;
> +
> [SNIP]
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy()
2024-06-19 3:12 ` [PATCH 1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy() Tejun Heo
2024-06-19 14:07 ` Christian Loehle
@ 2024-06-19 18:45 ` Rafael J. Wysocki
2024-06-19 19:53 ` Tejun Heo
2024-06-21 22:38 ` Tejun Heo
2 siblings, 1 reply; 36+ messages in thread
From: Rafael J. Wysocki @ 2024-06-19 18:45 UTC (permalink / raw)
To: Tejun Heo
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
On Wed, Jun 19, 2024 at 5:13 AM Tejun Heo <tj@kernel.org> wrote:
>
> sugov_cpu_is_busy() is used to avoid decreasing performance level while the
> CPU is busy and called by sugov_update_single_freq() and
> sugov_update_single_perf(). Both callers repeat the same pattern to first
> test for uclamp and then the business. Let's refactor so that the tests
> aren't repeated.
>
> The new helper is named sugov_hold_freq() and tests both the uclamp
> exception and CPU business. No functional changes. This will make adding
> more exception conditions easier.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: David Vernet <dvernet@meta.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
for this particular change.
> ---
> kernel/sched/cpufreq_schedutil.c | 38 +++++++++++++++-----------------
> 1 file changed, 18 insertions(+), 20 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index eece6244f9d2..972b7dd65af2 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -325,16 +325,27 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> }
>
> #ifdef CONFIG_NO_HZ_COMMON
> -static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> +static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
> {
> - unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
> - bool ret = idle_calls == sg_cpu->saved_idle_calls;
> + unsigned long idle_calls;
> + bool ret;
> +
> + /* if capped by uclamp_max, always update to be in compliance */
> + if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
> + return false;
> +
> + /*
> + * Maintain the frequency if the CPU has not been idle recently, as
> + * reduction is likely to be premature.
> + */
> + idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
> + ret = idle_calls == sg_cpu->saved_idle_calls;
>
> sg_cpu->saved_idle_calls = idle_calls;
> return ret;
> }
> #else
> -static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
> +static inline bool sugov_hold_freq(struct sugov_cpu *sg_cpu) { return false; }
> #endif /* CONFIG_NO_HZ_COMMON */
>
> /*
> @@ -382,14 +393,8 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
> return;
>
> next_f = get_next_freq(sg_policy, sg_cpu->util, max_cap);
> - /*
> - * Do not reduce the frequency if the CPU has not been idle
> - * recently, as the reduction is likely to be premature then.
> - *
> - * Except when the rq is capped by uclamp_max.
> - */
> - if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) &&
> - sugov_cpu_is_busy(sg_cpu) && next_f < sg_policy->next_freq &&
> +
> + if (sugov_hold_freq(sg_cpu) && next_f < sg_policy->next_freq &&
> !sg_policy->need_freq_update) {
> next_f = sg_policy->next_freq;
>
> @@ -436,14 +441,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
> if (!sugov_update_single_common(sg_cpu, time, max_cap, flags))
> return;
>
> - /*
> - * Do not reduce the target performance level if the CPU has not been
> - * idle recently, as the reduction is likely to be premature then.
> - *
> - * Except when the rq is capped by uclamp_max.
> - */
> - if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) &&
> - sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
> + if (sugov_hold_freq(sg_cpu) && sg_cpu->util < prev_util)
> sg_cpu->util = prev_util;
>
> cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
> --
> 2.45.2
>
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy()
2024-06-19 14:07 ` Christian Loehle
@ 2024-06-19 18:57 ` Tejun Heo
2024-06-19 19:07 ` Tejun Heo
0 siblings, 1 reply; 36+ messages in thread
From: Tejun Heo @ 2024-06-19 18:57 UTC (permalink / raw)
To: Christian Loehle
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
Hello, Christian.
On Wed, Jun 19, 2024 at 03:07:32PM +0100, Christian Loehle wrote:
> > + if (sugov_hold_freq(sg_cpu) && next_f < sg_policy->next_freq &&
> > !sg_policy->need_freq_update) {
> > next_f = sg_policy->next_freq;
> >
>
> Not necessarily related to your changes, but in case you're touching this
> again, maybe sugov_hold_freq() could be the last condition?
I'll update the patch so that sugov_hold_freq() is the last condition.
> And do we want something like
> #ifdef CONFIG_NO_HZ_COMMON
> else
> sg_cpu->saved_idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
> #endif
> here?
I have no idea but if something like the above is necessary, it'd probably
fit better in the #else definition of sugof_hold_freq() or just move the
#ifdef inside the function body so that the common part is outside?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy()
2024-06-19 18:57 ` Tejun Heo
@ 2024-06-19 19:07 ` Tejun Heo
2024-06-20 9:40 ` Christian Loehle
0 siblings, 1 reply; 36+ messages in thread
From: Tejun Heo @ 2024-06-19 19:07 UTC (permalink / raw)
To: Christian Loehle
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
On Wed, Jun 19, 2024 at 08:57:56AM -1000, Tejun Heo wrote:
> Hello, Christian.
>
> On Wed, Jun 19, 2024 at 03:07:32PM +0100, Christian Loehle wrote:
> > > + if (sugov_hold_freq(sg_cpu) && next_f < sg_policy->next_freq &&
> > > !sg_policy->need_freq_update) {
> > > next_f = sg_policy->next_freq;
> > >
> >
> > Not necessarily related to your changes, but in case you're touching this
> > again, maybe sugov_hold_freq() could be the last condition?
>
> I'll update the patch so that sugov_hold_freq() is the last condition.
Oh, looking at the code again, this would lead to behavior change, right? It
changes the period over which non-idleness is measured. Maybe that's okay
but seems out-of-scope for a refactoring patch. I'll leave it as-is.
> > And do we want something like
> > #ifdef CONFIG_NO_HZ_COMMON
> > else
> > sg_cpu->saved_idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
> > #endif
> > here?
>
> I have no idea but if something like the above is necessary, it'd probably
> fit better in the #else definition of sugof_hold_freq() or just move the
> #ifdef inside the function body so that the common part is outside?
and ->saved_idle_calls isn't even defined if !NO_HZ_COMMON and is only used
to determine whether to hold frequency, so the above doesn't seem necessary
either.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/2] sched_ext: Add cpuperf support
2024-06-19 14:07 ` Christian Loehle
@ 2024-06-19 19:19 ` Tejun Heo
0 siblings, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2024-06-19 19:19 UTC (permalink / raw)
To: Christian Loehle
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
Hello,
On Wed, Jun 19, 2024 at 03:07:51PM +0100, Christian Loehle wrote:
> > static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> > {
> > - unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> > + unsigned long min, max;
> > + unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu) +
> > + scx_cpuperf_target(sg_cpu->cpu);
>
> What does cpu_util_cfs_boost() contain if scx is active?
Hmm... when SCX is in partial mode, cpu_util_cfs_boost() reports the util
metric from cfs side as usual, but, yeah, when scx_switched_all(), this
could be stale. I'll update.
> NIT: reverse xmas
Ah, let me just make the initial assignment a separate statement. Reverse
xmas is awkward with broken lines.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH v2 2/2] sched_ext: Add cpuperf support
2024-06-19 3:12 ` [PATCH 2/2] sched_ext: Add cpuperf support Tejun Heo
2024-06-19 14:07 ` Christian Loehle
@ 2024-06-19 19:51 ` Tejun Heo
2024-06-21 22:39 ` Tejun Heo
2024-07-05 12:41 ` Vincent Guittot
2024-07-02 10:23 ` [PATCH " Hongyan Xia
2024-07-24 23:45 ` Qais Yousef
3 siblings, 2 replies; 36+ messages in thread
From: Tejun Heo @ 2024-06-19 19:51 UTC (permalink / raw)
To: rafael, viresh.kumar
Cc: linux-pm, void, linux-kernel, kernel-team, mingo, peterz,
David Vernet, Rafael J . Wysocki
sched_ext currently does not integrate with schedutil. When schedutil is the
governor, frequencies are left unregulated and usually get stuck close to
the highest performance level from running RT tasks.
Add CPU performance monitoring and scaling support by integrating into
schedutil. The following kfuncs are added:
- scx_bpf_cpuperf_cap(): Query the relative performance capacity of
different CPUs in the system.
- scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
relative to its max performance.
- scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.
This gives direct control over CPU performance setting to the BPF scheduler.
The only changes on the schedutil side are accounting for the utilization
factor from sched_ext and disabling frequency holding heuristics as it may
not apply well to sched_ext schedulers which may have a lot weaker
connection between tasks and their current / last CPU.
With cpuperf support added, there is no reason to block uclamp. Enable while
at it.
A toy implementation of cpuperf is added to scx_qmap as a demonstration of
the feature.
v2: Ignore cpu_util_cfs_boost() when scx_switched_all() in sugov_get_util()
to avoid factoring in stale util metric. (Christian)
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Christian Loehle <christian.loehle@arm.com>
---
kernel/sched/cpufreq_schedutil.c | 12 ++
kernel/sched/ext.c | 83 +++++++++++++++++-
kernel/sched/ext.h | 9 +
kernel/sched/sched.h | 1
tools/sched_ext/include/scx/common.bpf.h | 3
tools/sched_ext/scx_qmap.bpf.c | 142 ++++++++++++++++++++++++++++++-
tools/sched_ext/scx_qmap.c | 8 +
7 files changed, 252 insertions(+), 6 deletions(-)
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -197,8 +197,10 @@ unsigned long sugov_effective_cpu_perf(i
static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
{
- unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
+ unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu);
+ if (!scx_switched_all())
+ util += cpu_util_cfs_boost(sg_cpu->cpu);
util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
util = max(util, boost);
sg_cpu->bw_min = min;
@@ -330,6 +332,14 @@ static bool sugov_hold_freq(struct sugov
unsigned long idle_calls;
bool ret;
+ /*
+ * The heuristics in this function is for the fair class. For SCX, the
+ * performance target comes directly from the BPF scheduler. Let's just
+ * follow it.
+ */
+ if (scx_switched_all())
+ return false;
+
/* if capped by uclamp_max, always update to be in compliance */
if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
return false;
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -16,6 +16,8 @@ enum scx_consts {
SCX_EXIT_BT_LEN = 64,
SCX_EXIT_MSG_LEN = 1024,
SCX_EXIT_DUMP_DFL_LEN = 32768,
+
+ SCX_CPUPERF_ONE = SCHED_CAPACITY_SCALE,
};
enum scx_exit_kind {
@@ -3520,7 +3522,7 @@ DEFINE_SCHED_CLASS(ext) = {
.update_curr = update_curr_scx,
#ifdef CONFIG_UCLAMP_TASK
- .uclamp_enabled = 0,
+ .uclamp_enabled = 1,
#endif
};
@@ -4393,7 +4395,7 @@ static int scx_ops_enable(struct sched_e
struct scx_task_iter sti;
struct task_struct *p;
unsigned long timeout;
- int i, ret;
+ int i, cpu, ret;
mutex_lock(&scx_ops_enable_mutex);
@@ -4442,6 +4444,9 @@ static int scx_ops_enable(struct sched_e
atomic_long_set(&scx_nr_rejected, 0);
+ for_each_possible_cpu(cpu)
+ cpu_rq(cpu)->scx.cpuperf_target = SCX_CPUPERF_ONE;
+
/*
* Keep CPUs stable during enable so that the BPF scheduler can track
* online CPUs by watching ->on/offline_cpu() after ->init().
@@ -5836,6 +5841,77 @@ __bpf_kfunc void scx_bpf_dump_bstr(char
}
/**
+ * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU
+ * @cpu: CPU of interest
+ *
+ * Return the maximum relative capacity of @cpu in relation to the most
+ * performant CPU in the system. The return value is in the range [1,
+ * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur().
+ */
+__bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu)
+{
+ if (ops_cpu_valid(cpu, NULL))
+ return arch_scale_cpu_capacity(cpu);
+ else
+ return SCX_CPUPERF_ONE;
+}
+
+/**
+ * scx_bpf_cpuperf_cur - Query the current relative performance of a CPU
+ * @cpu: CPU of interest
+ *
+ * Return the current relative performance of @cpu in relation to its maximum.
+ * The return value is in the range [1, %SCX_CPUPERF_ONE].
+ *
+ * The current performance level of a CPU in relation to the maximum performance
+ * available in the system can be calculated as follows:
+ *
+ * scx_bpf_cpuperf_cap() * scx_bpf_cpuperf_cur() / %SCX_CPUPERF_ONE
+ *
+ * The result is in the range [1, %SCX_CPUPERF_ONE].
+ */
+__bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu)
+{
+ if (ops_cpu_valid(cpu, NULL))
+ return arch_scale_freq_capacity(cpu);
+ else
+ return SCX_CPUPERF_ONE;
+}
+
+/**
+ * scx_bpf_cpuperf_set - Set the relative performance target of a CPU
+ * @cpu: CPU of interest
+ * @perf: target performance level [0, %SCX_CPUPERF_ONE]
+ * @flags: %SCX_CPUPERF_* flags
+ *
+ * Set the target performance level of @cpu to @perf. @perf is in linear
+ * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the
+ * schedutil cpufreq governor chooses the target frequency.
+ *
+ * The actual performance level chosen, CPU grouping, and the overhead and
+ * latency of the operations are dependent on the hardware and cpufreq driver in
+ * use. Consult hardware and cpufreq documentation for more information. The
+ * current performance level can be monitored using scx_bpf_cpuperf_cur().
+ */
+__bpf_kfunc void scx_bpf_cpuperf_set(u32 cpu, u32 perf)
+{
+ if (unlikely(perf > SCX_CPUPERF_ONE)) {
+ scx_ops_error("Invalid cpuperf target %u for CPU %d", perf, cpu);
+ return;
+ }
+
+ if (ops_cpu_valid(cpu, NULL)) {
+ struct rq *rq = cpu_rq(cpu);
+
+ rq->scx.cpuperf_target = perf;
+
+ rcu_read_lock_sched_notrace();
+ cpufreq_update_util(cpu_rq(cpu), 0);
+ rcu_read_unlock_sched_notrace();
+ }
+}
+
+/**
* scx_bpf_nr_cpu_ids - Return the number of possible CPU IDs
*
* All valid CPU IDs in the system are smaller than the returned value.
@@ -6045,6 +6121,9 @@ BTF_ID_FLAGS(func, scx_bpf_destroy_dsq)
BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS)
BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS)
BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_set)
BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids)
BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE)
BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE)
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -48,6 +48,14 @@ int scx_check_setscheduler(struct task_s
bool task_should_scx(struct task_struct *p);
void init_sched_ext_class(void);
+static inline u32 scx_cpuperf_target(s32 cpu)
+{
+ if (scx_enabled())
+ return cpu_rq(cpu)->scx.cpuperf_target;
+ else
+ return 0;
+}
+
static inline const struct sched_class *next_active_class(const struct sched_class *class)
{
class++;
@@ -89,6 +97,7 @@ static inline void scx_pre_fork(struct t
static inline int scx_fork(struct task_struct *p) { return 0; }
static inline void scx_post_fork(struct task_struct *p) {}
static inline void scx_cancel_fork(struct task_struct *p) {}
+static inline u32 scx_cpuperf_target(s32 cpu) { return 0; }
static inline bool scx_can_stop_tick(struct rq *rq) { return true; }
static inline void scx_rq_activate(struct rq *rq) {}
static inline void scx_rq_deactivate(struct rq *rq) {}
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -743,6 +743,7 @@ struct scx_rq {
u64 extra_enq_flags; /* see move_task_to_local_dsq() */
u32 nr_running;
u32 flags;
+ u32 cpuperf_target; /* [0, SCHED_CAPACITY_SCALE] */
bool cpu_released;
cpumask_var_t cpus_to_kick;
cpumask_var_t cpus_to_kick_if_idle;
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -42,6 +42,9 @@ void scx_bpf_destroy_dsq(u64 dsq_id) __k
void scx_bpf_exit_bstr(s64 exit_code, char *fmt, unsigned long long *data, u32 data__sz) __ksym __weak;
void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym;
void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym __weak;
+u32 scx_bpf_cpuperf_cap(s32 cpu) __ksym __weak;
+u32 scx_bpf_cpuperf_cur(s32 cpu) __ksym __weak;
+void scx_bpf_cpuperf_set(s32 cpu, u32 perf) __ksym __weak;
u32 scx_bpf_nr_cpu_ids(void) __ksym __weak;
const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak;
const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak;
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -69,6 +69,18 @@ struct {
};
/*
+ * If enabled, CPU performance target is set according to the queue index
+ * according to the following table.
+ */
+static const u32 qidx_to_cpuperf_target[] = {
+ [0] = SCX_CPUPERF_ONE * 0 / 4,
+ [1] = SCX_CPUPERF_ONE * 1 / 4,
+ [2] = SCX_CPUPERF_ONE * 2 / 4,
+ [3] = SCX_CPUPERF_ONE * 3 / 4,
+ [4] = SCX_CPUPERF_ONE * 4 / 4,
+};
+
+/*
* Per-queue sequence numbers to implement core-sched ordering.
*
* Tail seq is assigned to each queued task and incremented. Head seq tracks the
@@ -95,6 +107,8 @@ struct {
struct cpu_ctx {
u64 dsp_idx; /* dispatch index */
u64 dsp_cnt; /* remaining count */
+ u32 avg_weight;
+ u32 cpuperf_target;
};
struct {
@@ -107,6 +121,8 @@ struct {
/* Statistics */
u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued;
u64 nr_core_sched_execed;
+u32 cpuperf_min, cpuperf_avg, cpuperf_max;
+u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max;
s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
s32 prev_cpu, u64 wake_flags)
@@ -313,6 +329,29 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 c
}
}
+void BPF_STRUCT_OPS(qmap_tick, struct task_struct *p)
+{
+ struct cpu_ctx *cpuc;
+ u32 zero = 0;
+ int idx;
+
+ if (!(cpuc = bpf_map_lookup_elem(&cpu_ctx_stor, &zero))) {
+ scx_bpf_error("failed to look up cpu_ctx");
+ return;
+ }
+
+ /*
+ * Use the running avg of weights to select the target cpuperf level.
+ * This is a demonstration of the cpuperf feature rather than a
+ * practical strategy to regulate CPU frequency.
+ */
+ cpuc->avg_weight = cpuc->avg_weight * 3 / 4 + p->scx.weight / 4;
+ idx = weight_to_idx(cpuc->avg_weight);
+ cpuc->cpuperf_target = qidx_to_cpuperf_target[idx];
+
+ scx_bpf_cpuperf_set(scx_bpf_task_cpu(p), cpuc->cpuperf_target);
+}
+
/*
* The distance from the head of the queue scaled by the weight of the queue.
* The lower the number, the older the task and the higher the priority.
@@ -422,8 +461,9 @@ void BPF_STRUCT_OPS(qmap_dump_cpu, struc
if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, cpu)))
return;
- scx_bpf_dump("QMAP: dsp_idx=%llu dsp_cnt=%llu",
- cpuc->dsp_idx, cpuc->dsp_cnt);
+ scx_bpf_dump("QMAP: dsp_idx=%llu dsp_cnt=%llu avg_weight=%u cpuperf_target=%u",
+ cpuc->dsp_idx, cpuc->dsp_cnt, cpuc->avg_weight,
+ cpuc->cpuperf_target);
}
void BPF_STRUCT_OPS(qmap_dump_task, struct scx_dump_ctx *dctx, struct task_struct *p)
@@ -492,11 +532,106 @@ void BPF_STRUCT_OPS(qmap_cpu_offline, s3
print_cpus();
}
+struct monitor_timer {
+ struct bpf_timer timer;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, u32);
+ __type(value, struct monitor_timer);
+} monitor_timer SEC(".maps");
+
+/*
+ * Print out the min, avg and max performance levels of CPUs every second to
+ * demonstrate the cpuperf interface.
+ */
+static void monitor_cpuperf(void)
+{
+ u32 zero = 0, nr_cpu_ids;
+ u64 cap_sum = 0, cur_sum = 0, cur_min = SCX_CPUPERF_ONE, cur_max = 0;
+ u64 target_sum = 0, target_min = SCX_CPUPERF_ONE, target_max = 0;
+ const struct cpumask *online;
+ int i, nr_online_cpus = 0;
+
+ nr_cpu_ids = scx_bpf_nr_cpu_ids();
+ online = scx_bpf_get_online_cpumask();
+
+ bpf_for(i, 0, nr_cpu_ids) {
+ struct cpu_ctx *cpuc;
+ u32 cap, cur;
+
+ if (!bpf_cpumask_test_cpu(i, online))
+ continue;
+ nr_online_cpus++;
+
+ /* collect the capacity and current cpuperf */
+ cap = scx_bpf_cpuperf_cap(i);
+ cur = scx_bpf_cpuperf_cur(i);
+
+ cur_min = cur < cur_min ? cur : cur_min;
+ cur_max = cur > cur_max ? cur : cur_max;
+
+ /*
+ * $cur is relative to $cap. Scale it down accordingly so that
+ * it's in the same scale as other CPUs and $cur_sum/$cap_sum
+ * makes sense.
+ */
+ cur_sum += cur * cap / SCX_CPUPERF_ONE;
+ cap_sum += cap;
+
+ if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, i))) {
+ scx_bpf_error("failed to look up cpu_ctx");
+ goto out;
+ }
+
+ /* collect target */
+ cur = cpuc->cpuperf_target;
+ target_sum += cur;
+ target_min = cur < target_min ? cur : target_min;
+ target_max = cur > target_max ? cur : target_max;
+ }
+
+ cpuperf_min = cur_min;
+ cpuperf_avg = cur_sum * SCX_CPUPERF_ONE / cap_sum;
+ cpuperf_max = cur_max;
+
+ cpuperf_target_min = target_min;
+ cpuperf_target_avg = target_sum / nr_online_cpus;
+ cpuperf_target_max = target_max;
+out:
+ scx_bpf_put_cpumask(online);
+}
+
+static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer)
+{
+ monitor_cpuperf();
+
+ bpf_timer_start(timer, ONE_SEC_IN_NS, 0);
+ return 0;
+}
+
s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
{
+ u32 key = 0;
+ struct bpf_timer *timer;
+ s32 ret;
+
print_cpus();
- return scx_bpf_create_dsq(SHARED_DSQ, -1);
+ ret = scx_bpf_create_dsq(SHARED_DSQ, -1);
+ if (ret)
+ return ret;
+
+ timer = bpf_map_lookup_elem(&monitor_timer, &key);
+ if (!timer)
+ return -ESRCH;
+
+ bpf_timer_init(timer, &monitor_timer, CLOCK_MONOTONIC);
+ bpf_timer_set_callback(timer, monitor_timerfn);
+
+ return bpf_timer_start(timer, ONE_SEC_IN_NS, 0);
}
void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei)
@@ -509,6 +644,7 @@ SCX_OPS_DEFINE(qmap_ops,
.enqueue = (void *)qmap_enqueue,
.dequeue = (void *)qmap_dequeue,
.dispatch = (void *)qmap_dispatch,
+ .tick = (void *)qmap_tick,
.core_sched_before = (void *)qmap_core_sched_before,
.cpu_release = (void *)qmap_cpu_release,
.init_task = (void *)qmap_init_task,
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -116,6 +116,14 @@ int main(int argc, char **argv)
nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
skel->bss->nr_reenqueued, skel->bss->nr_dequeued,
skel->bss->nr_core_sched_execed);
+ if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur"))
+ printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n",
+ skel->bss->cpuperf_min,
+ skel->bss->cpuperf_avg,
+ skel->bss->cpuperf_max,
+ skel->bss->cpuperf_target_min,
+ skel->bss->cpuperf_target_avg,
+ skel->bss->cpuperf_target_max);
fflush(stdout);
sleep(1);
}
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy()
2024-06-19 18:45 ` Rafael J. Wysocki
@ 2024-06-19 19:53 ` Tejun Heo
2024-06-20 17:57 ` Rafael J. Wysocki
0 siblings, 1 reply; 36+ messages in thread
From: Tejun Heo @ 2024-06-19 19:53 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: viresh.kumar, linux-pm, void, linux-kernel, kernel-team, mingo,
peterz, David Vernet, Rafael J . Wysocki
Hello, Rafael.
On Wed, Jun 19, 2024 at 08:45:42PM +0200, Rafael J. Wysocki wrote:
> On Wed, Jun 19, 2024 at 5:13 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > sugov_cpu_is_busy() is used to avoid decreasing performance level while the
> > CPU is busy and called by sugov_update_single_freq() and
> > sugov_update_single_perf(). Both callers repeat the same pattern to first
> > test for uclamp and then the business. Let's refactor so that the tests
> > aren't repeated.
> >
> > The new helper is named sugov_hold_freq() and tests both the uclamp
> > exception and CPU business. No functional changes. This will make adding
> > more exception conditions easier.
> >
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > Reviewed-by: David Vernet <dvernet@meta.com>
> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Cc: Viresh Kumar <viresh.kumar@linaro.org>
>
> Acked-by: Rafael J. Wysocki <rafael@kernel.org>
>
> for this particular change.
If the cpufreq_schedutil part of the second patch looks good to you, would
it be okay to route together with this patch through the sched_ext tree?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy()
2024-06-19 19:07 ` Tejun Heo
@ 2024-06-20 9:40 ` Christian Loehle
0 siblings, 0 replies; 36+ messages in thread
From: Christian Loehle @ 2024-06-20 9:40 UTC (permalink / raw)
To: Tejun Heo
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
On 6/19/24 20:07, Tejun Heo wrote:
> On Wed, Jun 19, 2024 at 08:57:56AM -1000, Tejun Heo wrote:
>> Hello, Christian.
>>
>> On Wed, Jun 19, 2024 at 03:07:32PM +0100, Christian Loehle wrote:
>>>> + if (sugov_hold_freq(sg_cpu) && next_f < sg_policy->next_freq &&
>>>> !sg_policy->need_freq_update) {
>>>> next_f = sg_policy->next_freq;
>>>>
>>>
>>> Not necessarily related to your changes, but in case you're touching this
>>> again, maybe sugov_hold_freq() could be the last condition?
>>
>> I'll update the patch so that sugov_hold_freq() is the last condition.
>
> Oh, looking at the code again, this would lead to behavior change, right? It
> changes the period over which non-idleness is measured. Maybe that's okay
> but seems out-of-scope for a refactoring patch. I'll leave it as-is.
It does prevent idle_calls being updated in some cases, but see below I don't
think they are that deliberate anyway.
>
>>> And do we want something like
>>> #ifdef CONFIG_NO_HZ_COMMON
>>> else
>>> sg_cpu->saved_idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
>>> #endif
>>> here?
>>
>> I have no idea but if something like the above is necessary, it'd probably
>> fit better in the #else definition of sugof_hold_freq() or just move the
>> #ifdef inside the function body so that the common part is outside?
>
> and ->saved_idle_calls isn't even defined if !NO_HZ_COMMON and is only used
> to determine whether to hold frequency, so the above doesn't seem necessary
> either.
When reading that code again it seems like the right thing to do.
Anyway feel free to ignore, I might pick it up myself.
I'll think it through some more, the question mark was more like an open
question to anyone reading this.
Kind Regards,
Christian
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy()
2024-06-19 19:53 ` Tejun Heo
@ 2024-06-20 17:57 ` Rafael J. Wysocki
2024-06-20 18:08 ` Tejun Heo
0 siblings, 1 reply; 36+ messages in thread
From: Rafael J. Wysocki @ 2024-06-20 17:57 UTC (permalink / raw)
To: Tejun Heo
Cc: Rafael J. Wysocki, viresh.kumar, linux-pm, void, linux-kernel,
kernel-team, mingo, peterz, David Vernet, Rafael J . Wysocki
Hi Tejun,
On Wed, Jun 19, 2024 at 9:53 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Rafael.
>
> On Wed, Jun 19, 2024 at 08:45:42PM +0200, Rafael J. Wysocki wrote:
> > On Wed, Jun 19, 2024 at 5:13 AM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > sugov_cpu_is_busy() is used to avoid decreasing performance level while the
> > > CPU is busy and called by sugov_update_single_freq() and
> > > sugov_update_single_perf(). Both callers repeat the same pattern to first
> > > test for uclamp and then the business. Let's refactor so that the tests
> > > aren't repeated.
> > >
> > > The new helper is named sugov_hold_freq() and tests both the uclamp
> > > exception and CPU business. No functional changes. This will make adding
> > > more exception conditions easier.
> > >
> > > Signed-off-by: Tejun Heo <tj@kernel.org>
> > > Reviewed-by: David Vernet <dvernet@meta.com>
> > > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > Cc: Viresh Kumar <viresh.kumar@linaro.org>
> >
> > Acked-by: Rafael J. Wysocki <rafael@kernel.org>
> >
> > for this particular change.
>
> If the cpufreq_schedutil part of the second patch looks good to you, would
> it be okay to route together with this patch through the sched_ext tree?
Please feel free to pick up the $subject patch (with my ACK).
As for the [2/2], I don't think I'm sufficiently familiar with the
scx_* stuff to make any comments on it, either way.
Cheers, Rafael
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy()
2024-06-20 17:57 ` Rafael J. Wysocki
@ 2024-06-20 18:08 ` Tejun Heo
0 siblings, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2024-06-20 18:08 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: viresh.kumar, linux-pm, void, linux-kernel, kernel-team, mingo,
peterz, David Vernet, Rafael J . Wysocki
Hello, Rafael.
On Thu, Jun 20, 2024 at 07:57:47PM +0200, Rafael J. Wysocki wrote:
> Please feel free to pick up the $subject patch (with my ACK).
Will do. Thanks.
> As for the [2/2], I don't think I'm sufficiently familiar with the
> scx_* stuff to make any comments on it, either way.
Yeah, as long as the cpufreq part isn't objectionable to you, it's all good.
Thank you.
--
tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy()
2024-06-19 3:12 ` [PATCH 1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy() Tejun Heo
2024-06-19 14:07 ` Christian Loehle
2024-06-19 18:45 ` Rafael J. Wysocki
@ 2024-06-21 22:38 ` Tejun Heo
2 siblings, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2024-06-21 22:38 UTC (permalink / raw)
To: rafael, viresh.kumar
Cc: linux-pm, void, linux-kernel, kernel-team, mingo, peterz,
David Vernet, Rafael J . Wysocki
On Tue, Jun 18, 2024 at 05:12:02PM -1000, Tejun Heo wrote:
> sugov_cpu_is_busy() is used to avoid decreasing performance level while the
> CPU is busy and called by sugov_update_single_freq() and
> sugov_update_single_perf(). Both callers repeat the same pattern to first
> test for uclamp and then the business. Let's refactor so that the tests
> aren't repeated.
>
> The new helper is named sugov_hold_freq() and tests both the uclamp
> exception and CPU business. No functional changes. This will make adding
> more exception conditions easier.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: David Vernet <dvernet@meta.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
Applied to sched_ext/for-6.11.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 2/2] sched_ext: Add cpuperf support
2024-06-19 19:51 ` [PATCH v2 " Tejun Heo
@ 2024-06-21 22:39 ` Tejun Heo
2024-07-05 12:41 ` Vincent Guittot
1 sibling, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2024-06-21 22:39 UTC (permalink / raw)
To: rafael, viresh.kumar
Cc: linux-pm, void, linux-kernel, kernel-team, mingo, peterz,
David Vernet, Rafael J . Wysocki
On Wed, Jun 19, 2024 at 09:51:39AM -1000, Tejun Heo wrote:
> sched_ext currently does not integrate with schedutil. When schedutil is the
> governor, frequencies are left unregulated and usually get stuck close to
> the highest performance level from running RT tasks.
>
> Add CPU performance monitoring and scaling support by integrating into
> schedutil. The following kfuncs are added:
>
> - scx_bpf_cpuperf_cap(): Query the relative performance capacity of
> different CPUs in the system.
>
> - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
> relative to its max performance.
>
> - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.
>
> This gives direct control over CPU performance setting to the BPF scheduler.
> The only changes on the schedutil side are accounting for the utilization
> factor from sched_ext and disabling frequency holding heuristics as it may
> not apply well to sched_ext schedulers which may have a lot weaker
> connection between tasks and their current / last CPU.
>
> With cpuperf support added, there is no reason to block uclamp. Enable while
> at it.
>
> A toy implementation of cpuperf is added to scx_qmap as a demonstration of
> the feature.
>
> v2: Ignore cpu_util_cfs_boost() when scx_switched_all() in sugov_get_util()
> to avoid factoring in stale util metric. (Christian)
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: David Vernet <dvernet@meta.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> Cc: Christian Loehle <christian.loehle@arm.com>
Applied to sched_ext/for-6.11.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/2] sched_ext: Add cpuperf support
2024-06-19 3:12 ` [PATCH 2/2] sched_ext: Add cpuperf support Tejun Heo
2024-06-19 14:07 ` Christian Loehle
2024-06-19 19:51 ` [PATCH v2 " Tejun Heo
@ 2024-07-02 10:23 ` Hongyan Xia
2024-07-02 16:37 ` Tejun Heo
2024-07-24 23:45 ` Qais Yousef
3 siblings, 1 reply; 36+ messages in thread
From: Hongyan Xia @ 2024-07-02 10:23 UTC (permalink / raw)
To: Tejun Heo, rafael, viresh.kumar
Cc: linux-pm, void, linux-kernel, kernel-team, mingo, peterz,
David Vernet, Rafael J . Wysocki
On 19/06/2024 04:12, Tejun Heo wrote:
> sched_ext currently does not integrate with schedutil. When schedutil is the
> governor, frequencies are left unregulated and usually get stuck close to
> the highest performance level from running RT tasks.
>
> Add CPU performance monitoring and scaling support by integrating into
> schedutil. The following kfuncs are added:
>
> - scx_bpf_cpuperf_cap(): Query the relative performance capacity of
> different CPUs in the system.
>
> - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
> relative to its max performance.
>
> - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.
>
> This gives direct control over CPU performance setting to the BPF scheduler.
> The only changes on the schedutil side are accounting for the utilization
> factor from sched_ext and disabling frequency holding heuristics as it may
> not apply well to sched_ext schedulers which may have a lot weaker
> connection between tasks and their current / last CPU.
>
> With cpuperf support added, there is no reason to block uclamp. Enable while
> at it.
>
> A toy implementation of cpuperf is added to scx_qmap as a demonstration of
> the feature.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: David Vernet <dvernet@meta.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> ---
> kernel/sched/cpufreq_schedutil.c | 12 +-
> kernel/sched/ext.c | 83 ++++++++++++-
> kernel/sched/ext.h | 9 ++
> kernel/sched/sched.h | 1 +
> tools/sched_ext/include/scx/common.bpf.h | 3 +
> tools/sched_ext/scx_qmap.bpf.c | 142 ++++++++++++++++++++++-
> tools/sched_ext/scx_qmap.c | 8 ++
> 7 files changed, 252 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 972b7dd65af2..12174c0137a5 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -197,7 +197,9 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
>
> static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> {
> - unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> + unsigned long min, max;
> + unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu) +
> + scx_cpuperf_target(sg_cpu->cpu);
>
> util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
> util = max(util, boost);
> @@ -330,6 +332,14 @@ static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
> unsigned long idle_calls;
> bool ret;
>
> + /*
> + * The heuristics in this function is for the fair class. For SCX, the
> + * performance target comes directly from the BPF scheduler. Let's just
> + * follow it.
> + */
> + if (scx_switched_all())
> + return false;
> +
> /* if capped by uclamp_max, always update to be in compliance */
> if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
> return false;
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index f814e84ceeb3..04fb0eeee5ec 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -16,6 +16,8 @@ enum scx_consts {
> SCX_EXIT_BT_LEN = 64,
> SCX_EXIT_MSG_LEN = 1024,
> SCX_EXIT_DUMP_DFL_LEN = 32768,
> +
> + SCX_CPUPERF_ONE = SCHED_CAPACITY_SCALE,
> };
>
> enum scx_exit_kind {
> @@ -3520,7 +3522,7 @@ DEFINE_SCHED_CLASS(ext) = {
> .update_curr = update_curr_scx,
>
> #ifdef CONFIG_UCLAMP_TASK
> - .uclamp_enabled = 0,
> + .uclamp_enabled = 1,
> #endif
> };
>
Hi. I know this is a bit late, but the implication of this change here
can be quite interesting.
With this patch but without switching this knob from 0 to 1, this series
gives me the perfect opportunity to implement a custom uclamp within
sched_ext on top of the cpufreq support added. I think this would be
what some vendors looking at sched_ext would also want. But, if
.uclamp_enabled == 1, then the mainline uclamp implementation is in
effect regardless of what ext scheduler is loaded. In fact,
uclamp_{inc,dec}() are before calling the {enqueue,dequeue}_task() so
now there's no easy way to circumvent it.
What would be really nice is to have cpufreq support in sched_ext but
not force uclamp_enabled. But, I also think there will be people who are
happy with the current uclamp implementation and want to just reuse it.
The best thing is to let the loaded scheduler decide, somehow, which I
don't know if there's an easy way to do this yet.
> [...]
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/2] sched_ext: Add cpuperf support
2024-07-02 10:23 ` [PATCH " Hongyan Xia
@ 2024-07-02 16:37 ` Tejun Heo
2024-07-02 17:12 ` Hongyan Xia
0 siblings, 1 reply; 36+ messages in thread
From: Tejun Heo @ 2024-07-02 16:37 UTC (permalink / raw)
To: Hongyan Xia
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
Hello, Hongyan.
On Tue, Jul 02, 2024 at 11:23:58AM +0100, Hongyan Xia wrote:
> What would be really nice is to have cpufreq support in sched_ext but not
> force uclamp_enabled. But, I also think there will be people who are happy
> with the current uclamp implementation and want to just reuse it. The best
> thing is to let the loaded scheduler decide, somehow, which I don't know if
> there's an easy way to do this yet.
I don't know much about uclamp but at least from sched_ext side, it's
trivial add an ops flag for it and because we know that no tasks are on the
ext class before BPF scheduler is loaded, as long as we switch the
uclamp_enabled value while the BPF scheduler is not loaded, the uclamp
buckets should stay balanced. AFAICS, the only core change we need to make
is mooving the uclamp_enabled bool outside sched_class so that it can be
changed runtime. Is that the case or am I missing something?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/2] sched_ext: Add cpuperf support
2024-07-02 16:37 ` Tejun Heo
@ 2024-07-02 17:12 ` Hongyan Xia
2024-07-02 17:56 ` Tejun Heo
0 siblings, 1 reply; 36+ messages in thread
From: Hongyan Xia @ 2024-07-02 17:12 UTC (permalink / raw)
To: Tejun Heo
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
On 02/07/2024 17:37, Tejun Heo wrote:
> Hello, Hongyan.
>
> On Tue, Jul 02, 2024 at 11:23:58AM +0100, Hongyan Xia wrote:
>> What would be really nice is to have cpufreq support in sched_ext but not
>> force uclamp_enabled. But, I also think there will be people who are happy
>> with the current uclamp implementation and want to just reuse it. The best
>> thing is to let the loaded scheduler decide, somehow, which I don't know if
>> there's an easy way to do this yet.
>
> I don't know much about uclamp but at least from sched_ext side, it's
> trivial add an ops flag for it and because we know that no tasks are on the
> ext class before BPF scheduler is loaded, as long as we switch the
> uclamp_enabled value while the BPF scheduler is not loaded, the uclamp
> buckets should stay balanced. AFAICS, the only core change we need to make
> is mooving the uclamp_enabled bool outside sched_class so that it can be
> changed runtime. Is that the case or am I missing something?
>
Pretty much. Just to clarify what I meant, it would be fantastic if for
ext, sched_class->uclamp_enabled is decided the moment we load the
custom scheduler, not globally enabled all the time for all ext
schedulers, in case the custom scheduler wants to ignore uclamp or has
its own uclamp implementation. During ext_ops->init(), it would be great
if the loaded scheduler could decide whether its
sched_class->uclamp_enabled should be enabled.
However, sched_class->uclamp_enabled is just a normal struct variable,
so I cannot immediately see a clean way to let the loaded scheduler
program this field. We might be able to expose a function from the
kernel side to write sched_class->uclamp_enabled during ext_ops->init(),
although that looks a bit messy.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/2] sched_ext: Add cpuperf support
2024-07-02 17:12 ` Hongyan Xia
@ 2024-07-02 17:56 ` Tejun Heo
2024-07-02 20:41 ` Hongyan Xia
0 siblings, 1 reply; 36+ messages in thread
From: Tejun Heo @ 2024-07-02 17:56 UTC (permalink / raw)
To: Hongyan Xia
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
Hello,
So, maybe something like this. It's not the prettiest but avoids adding
indirect calls to fair and rt while allowing sched_ext to report what the
BPF scheduler wants. Only compile tested. Would something like this work for
the use cases you have on mind?
Thanks.
Index: work/kernel/sched/core.c
===================================================================
--- work.orig/kernel/sched/core.c
+++ work/kernel/sched/core.c
@@ -1671,6 +1671,20 @@ static inline void uclamp_rq_dec_id(stru
}
}
+bool sched_uclamp_enabled(void)
+{
+ return true;
+}
+
+static bool class_supports_uclamp(const struct sched_class *class)
+{
+ if (likely(class->uclamp_enabled == sched_uclamp_enabled))
+ return true;
+ if (!class->uclamp_enabled)
+ return false;
+ return class->uclamp_enabled();
+}
+
static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
{
enum uclamp_id clamp_id;
@@ -1684,7 +1698,7 @@ static inline void uclamp_rq_inc(struct
if (!static_branch_unlikely(&sched_uclamp_used))
return;
- if (unlikely(!p->sched_class->uclamp_enabled))
+ if (class_supports_uclamp(p->sched_class))
return;
for_each_clamp_id(clamp_id)
@@ -1708,7 +1722,7 @@ static inline void uclamp_rq_dec(struct
if (!static_branch_unlikely(&sched_uclamp_used))
return;
- if (unlikely(!p->sched_class->uclamp_enabled))
+ if (class_supports_uclamp(p->sched_class))
return;
for_each_clamp_id(clamp_id)
Index: work/kernel/sched/ext.c
===================================================================
--- work.orig/kernel/sched/ext.c
+++ work/kernel/sched/ext.c
@@ -116,10 +116,17 @@ enum scx_ops_flags {
*/
SCX_OPS_SWITCH_PARTIAL = 1LLU << 3,
+ /*
+ * Disable built-in uclamp support. Can be useful when the BPF scheduler
+ * wants to implement custom uclamp support.
+ */
+ SCX_OPS_DISABLE_UCLAMP = 1LLU << 4,
+
SCX_OPS_ALL_FLAGS = SCX_OPS_KEEP_BUILTIN_IDLE |
SCX_OPS_ENQ_LAST |
SCX_OPS_ENQ_EXITING |
- SCX_OPS_SWITCH_PARTIAL,
+ SCX_OPS_SWITCH_PARTIAL |
+ SCX_OPS_DISABLE_UCLAMP,
};
/* argument container for ops.init_task() */
@@ -3437,6 +3444,13 @@ static void switched_from_scx(struct rq
static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
+#ifdef CONFIG_UCLAMP_TASK
+static bool uclamp_enabled_scx(void)
+{
+ return !(scx_ops.flags & SCX_OPS_DISABLE_UCLAMP);
+}
+#endif
+
int scx_check_setscheduler(struct task_struct *p, int policy)
{
lockdep_assert_rq_held(task_rq(p));
@@ -3522,7 +3536,7 @@ DEFINE_SCHED_CLASS(ext) = {
.update_curr = update_curr_scx,
#ifdef CONFIG_UCLAMP_TASK
- .uclamp_enabled = 1,
+ .uclamp_enabled = uclamp_enabled_scx,
#endif
};
Index: work/kernel/sched/fair.c
===================================================================
--- work.orig/kernel/sched/fair.c
+++ work/kernel/sched/fair.c
@@ -13228,9 +13228,7 @@ DEFINE_SCHED_CLASS(fair) = {
.task_is_throttled = task_is_throttled_fair,
#endif
-#ifdef CONFIG_UCLAMP_TASK
- .uclamp_enabled = 1,
-#endif
+ SCHED_CLASS_UCLAMP_ENABLED
};
#ifdef CONFIG_SCHED_DEBUG
Index: work/kernel/sched/rt.c
===================================================================
--- work.orig/kernel/sched/rt.c
+++ work/kernel/sched/rt.c
@@ -2681,9 +2681,7 @@ DEFINE_SCHED_CLASS(rt) = {
.task_is_throttled = task_is_throttled_rt,
#endif
-#ifdef CONFIG_UCLAMP_TASK
- .uclamp_enabled = 1,
-#endif
+ SCHED_CLASS_UCLAMP_ENABLED
};
#ifdef CONFIG_RT_GROUP_SCHED
Index: work/kernel/sched/sched.h
===================================================================
--- work.orig/kernel/sched/sched.h
+++ work/kernel/sched/sched.h
@@ -2339,11 +2339,6 @@ struct affinity_context {
extern s64 update_curr_common(struct rq *rq);
struct sched_class {
-
-#ifdef CONFIG_UCLAMP_TASK
- int uclamp_enabled;
-#endif
-
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);
@@ -2405,8 +2400,21 @@ struct sched_class {
#ifdef CONFIG_SCHED_CORE
int (*task_is_throttled)(struct task_struct *p, int cpu);
#endif
+
+#ifdef CONFIG_UCLAMP_TASK
+ bool (*uclamp_enabled)(void);
+#endif
};
+#ifdef CONFIG_UCLAMP_TASK
+bool sched_uclamp_enabled(void);
+
+#define SCHED_CLASS_UCLAMP_ENABLED \
+ .uclamp_enabled = sched_uclamp_enabled,
+#else
+#define SCHED_CLASS_UCLAMP_ENABLED
+#endif
+
static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
{
WARN_ON_ONCE(rq->curr != prev);
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/2] sched_ext: Add cpuperf support
2024-07-02 17:56 ` Tejun Heo
@ 2024-07-02 20:41 ` Hongyan Xia
2024-07-02 21:12 ` Tejun Heo
0 siblings, 1 reply; 36+ messages in thread
From: Hongyan Xia @ 2024-07-02 20:41 UTC (permalink / raw)
To: Tejun Heo
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
On 02/07/2024 18:56, Tejun Heo wrote:
> Hello,
>
> So, maybe something like this. It's not the prettiest but avoids adding
> indirect calls to fair and rt while allowing sched_ext to report what the
> BPF scheduler wants. Only compile tested. Would something like this work for
> the use cases you have on mind?
>
> Thanks.
>
> Index: work/kernel/sched/core.c
> ===================================================================
> --- work.orig/kernel/sched/core.c
> +++ work/kernel/sched/core.c
> @@ -1671,6 +1671,20 @@ static inline void uclamp_rq_dec_id(stru
> }
> }
>
> +bool sched_uclamp_enabled(void)
> +{
> + return true;
> +}
> +
> +static bool class_supports_uclamp(const struct sched_class *class)
> +{
> + if (likely(class->uclamp_enabled == sched_uclamp_enabled))
> + return true;
> + if (!class->uclamp_enabled)
> + return false;
> + return class->uclamp_enabled();
> +}
> +
> static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> {
> enum uclamp_id clamp_id;
> @@ -1684,7 +1698,7 @@ static inline void uclamp_rq_inc(struct
> if (!static_branch_unlikely(&sched_uclamp_used))
> return;
>
> - if (unlikely(!p->sched_class->uclamp_enabled))
> + if (class_supports_uclamp(p->sched_class))
> return;
>
> for_each_clamp_id(clamp_id)
> @@ -1708,7 +1722,7 @@ static inline void uclamp_rq_dec(struct
> if (!static_branch_unlikely(&sched_uclamp_used))
> return;
>
> - if (unlikely(!p->sched_class->uclamp_enabled))
> + if (class_supports_uclamp(p->sched_class))
> return;
>
> for_each_clamp_id(clamp_id)
> Index: work/kernel/sched/ext.c
> ===================================================================
> --- work.orig/kernel/sched/ext.c
> +++ work/kernel/sched/ext.c
> @@ -116,10 +116,17 @@ enum scx_ops_flags {
> */
> SCX_OPS_SWITCH_PARTIAL = 1LLU << 3,
>
> + /*
> + * Disable built-in uclamp support. Can be useful when the BPF scheduler
> + * wants to implement custom uclamp support.
> + */
> + SCX_OPS_DISABLE_UCLAMP = 1LLU << 4,
> +
> SCX_OPS_ALL_FLAGS = SCX_OPS_KEEP_BUILTIN_IDLE |
> SCX_OPS_ENQ_LAST |
> SCX_OPS_ENQ_EXITING |
> - SCX_OPS_SWITCH_PARTIAL,
> + SCX_OPS_SWITCH_PARTIAL |
> + SCX_OPS_DISABLE_UCLAMP,
> };
>
> /* argument container for ops.init_task() */
> @@ -3437,6 +3444,13 @@ static void switched_from_scx(struct rq
> static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
> static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
>
> +#ifdef CONFIG_UCLAMP_TASK
> +static bool uclamp_enabled_scx(void)
> +{
> + return !(scx_ops.flags & SCX_OPS_DISABLE_UCLAMP);
> +}
> +#endif
> +
> int scx_check_setscheduler(struct task_struct *p, int policy)
> {
> lockdep_assert_rq_held(task_rq(p));
> @@ -3522,7 +3536,7 @@ DEFINE_SCHED_CLASS(ext) = {
> .update_curr = update_curr_scx,
>
> #ifdef CONFIG_UCLAMP_TASK
> - .uclamp_enabled = 1,
> + .uclamp_enabled = uclamp_enabled_scx,
> #endif
> };
>
> Index: work/kernel/sched/fair.c
> ===================================================================
> --- work.orig/kernel/sched/fair.c
> +++ work/kernel/sched/fair.c
> @@ -13228,9 +13228,7 @@ DEFINE_SCHED_CLASS(fair) = {
> .task_is_throttled = task_is_throttled_fair,
> #endif
>
> -#ifdef CONFIG_UCLAMP_TASK
> - .uclamp_enabled = 1,
> -#endif
> + SCHED_CLASS_UCLAMP_ENABLED
> };
>
> #ifdef CONFIG_SCHED_DEBUG
> Index: work/kernel/sched/rt.c
> ===================================================================
> --- work.orig/kernel/sched/rt.c
> +++ work/kernel/sched/rt.c
> @@ -2681,9 +2681,7 @@ DEFINE_SCHED_CLASS(rt) = {
> .task_is_throttled = task_is_throttled_rt,
> #endif
>
> -#ifdef CONFIG_UCLAMP_TASK
> - .uclamp_enabled = 1,
> -#endif
> + SCHED_CLASS_UCLAMP_ENABLED
> };
>
> #ifdef CONFIG_RT_GROUP_SCHED
> Index: work/kernel/sched/sched.h
> ===================================================================
> --- work.orig/kernel/sched/sched.h
> +++ work/kernel/sched/sched.h
> @@ -2339,11 +2339,6 @@ struct affinity_context {
> extern s64 update_curr_common(struct rq *rq);
>
> struct sched_class {
> -
> -#ifdef CONFIG_UCLAMP_TASK
> - int uclamp_enabled;
> -#endif
> -
> void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
> void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
> void (*yield_task) (struct rq *rq);
> @@ -2405,8 +2400,21 @@ struct sched_class {
> #ifdef CONFIG_SCHED_CORE
> int (*task_is_throttled)(struct task_struct *p, int cpu);
> #endif
> +
> +#ifdef CONFIG_UCLAMP_TASK
> + bool (*uclamp_enabled)(void);
> +#endif
> };
>
> +#ifdef CONFIG_UCLAMP_TASK
> +bool sched_uclamp_enabled(void);
> +
> +#define SCHED_CLASS_UCLAMP_ENABLED \
> + .uclamp_enabled = sched_uclamp_enabled,
> +#else
> +#define SCHED_CLASS_UCLAMP_ENABLED
> +#endif
> +
> static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
> {
> WARN_ON_ONCE(rq->curr != prev);
Looks good to me!
Actually, if we are okay with changing the sched_class struct and
touching the code of other classes, I wonder if a cleaner solution is
just to completely remove sched_class->uclamp_enabled and let each class
decide what to do in enqueue and dequeue, so instead of
uclamp_inc/dec();
p->sched_class->enqueue/dequeue_task();
we can just
p->sched_class->enqueue/dequeue_task();
do_uclamp_inside_each_class();
and we export uclamp_inc/dec() functions from core.c to RT, fair and
ext. For RT and fair, just
enqueue/dequeue_task_fair/rt();
uclamp_inc/dec();
For ext, maybe expose bpf_uclamp_inc/dec() to the custom scheduler. If a
scheduler wants to reuse the current uclamp implementation, just call
these. If not, skip them and implement its own.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/2] sched_ext: Add cpuperf support
2024-07-02 20:41 ` Hongyan Xia
@ 2024-07-02 21:12 ` Tejun Heo
0 siblings, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2024-07-02 21:12 UTC (permalink / raw)
To: Hongyan Xia
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
Hello,
On Tue, Jul 02, 2024 at 09:41:30PM +0100, Hongyan Xia wrote:
...
> Actually, if we are okay with changing the sched_class struct and touching
> the code of other classes, I wonder if a cleaner solution is just to
> completely remove sched_class->uclamp_enabled and let each class decide what
> to do in enqueue and dequeue, so instead of
>
> uclamp_inc/dec();
> p->sched_class->enqueue/dequeue_task();
>
> we can just
>
> p->sched_class->enqueue/dequeue_task();
> do_uclamp_inside_each_class();
>
> and we export uclamp_inc/dec() functions from core.c to RT, fair and ext.
> For RT and fair, just
>
> enqueue/dequeue_task_fair/rt();
> uclamp_inc/dec();
>
> For ext, maybe expose bpf_uclamp_inc/dec() to the custom scheduler. If a
> scheduler wants to reuse the current uclamp implementation, just call these.
> If not, skip them and implement its own.
That does sound a lot better. Mind writing up a patchset?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 2/2] sched_ext: Add cpuperf support
2024-06-19 19:51 ` [PATCH v2 " Tejun Heo
2024-06-21 22:39 ` Tejun Heo
@ 2024-07-05 12:41 ` Vincent Guittot
2024-07-05 18:22 ` Tejun Heo
1 sibling, 1 reply; 36+ messages in thread
From: Vincent Guittot @ 2024-07-05 12:41 UTC (permalink / raw)
To: Tejun Heo
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
On Wed, 19 Jun 2024 at 21:52, Tejun Heo <tj@kernel.org> wrote:
>
> sched_ext currently does not integrate with schedutil. When schedutil is the
> governor, frequencies are left unregulated and usually get stuck close to
> the highest performance level from running RT tasks.
>
> Add CPU performance monitoring and scaling support by integrating into
> schedutil. The following kfuncs are added:
>
> - scx_bpf_cpuperf_cap(): Query the relative performance capacity of
> different CPUs in the system.
>
> - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
> relative to its max performance.
>
> - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.
>
> This gives direct control over CPU performance setting to the BPF scheduler.
> The only changes on the schedutil side are accounting for the utilization
> factor from sched_ext and disabling frequency holding heuristics as it may
> not apply well to sched_ext schedulers which may have a lot weaker
> connection between tasks and their current / last CPU.
>
> With cpuperf support added, there is no reason to block uclamp. Enable while
> at it.
>
> A toy implementation of cpuperf is added to scx_qmap as a demonstration of
> the feature.
>
> v2: Ignore cpu_util_cfs_boost() when scx_switched_all() in sugov_get_util()
> to avoid factoring in stale util metric. (Christian)
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: David Vernet <dvernet@meta.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> Cc: Christian Loehle <christian.loehle@arm.com>
> ---
> kernel/sched/cpufreq_schedutil.c | 12 ++
> kernel/sched/ext.c | 83 +++++++++++++++++-
> kernel/sched/ext.h | 9 +
> kernel/sched/sched.h | 1
> tools/sched_ext/include/scx/common.bpf.h | 3
> tools/sched_ext/scx_qmap.bpf.c | 142 ++++++++++++++++++++++++++++++-
> tools/sched_ext/scx_qmap.c | 8 +
> 7 files changed, 252 insertions(+), 6 deletions(-)
>
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -197,8 +197,10 @@ unsigned long sugov_effective_cpu_perf(i
>
> static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> {
> - unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> + unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu);
>
> + if (!scx_switched_all())
> + util += cpu_util_cfs_boost(sg_cpu->cpu);
I don't see the need for this. If fair is not used, this returns zero
> util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
> util = max(util, boost);
> sg_cpu->bw_min = min;
> @@ -330,6 +332,14 @@ static bool sugov_hold_freq(struct sugov
> unsigned long idle_calls;
> bool ret;
>
> + /*
> + * The heuristics in this function is for the fair class. For SCX, the
> + * performance target comes directly from the BPF scheduler. Let's just
> + * follow it.
> + */
> + if (scx_switched_all())
> + return false;
> +
> /* if capped by uclamp_max, always update to be in compliance */
> if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
> return false;
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -16,6 +16,8 @@ enum scx_consts {
> SCX_EXIT_BT_LEN = 64,
> SCX_EXIT_MSG_LEN = 1024,
> SCX_EXIT_DUMP_DFL_LEN = 32768,
> +
> + SCX_CPUPERF_ONE = SCHED_CAPACITY_SCALE,
> };
>
> enum scx_exit_kind {
> @@ -3520,7 +3522,7 @@ DEFINE_SCHED_CLASS(ext) = {
> .update_curr = update_curr_scx,
>
> #ifdef CONFIG_UCLAMP_TASK
> - .uclamp_enabled = 0,
> + .uclamp_enabled = 1,
> #endif
> };
>
> @@ -4393,7 +4395,7 @@ static int scx_ops_enable(struct sched_e
> struct scx_task_iter sti;
> struct task_struct *p;
> unsigned long timeout;
> - int i, ret;
> + int i, cpu, ret;
>
> mutex_lock(&scx_ops_enable_mutex);
>
> @@ -4442,6 +4444,9 @@ static int scx_ops_enable(struct sched_e
>
> atomic_long_set(&scx_nr_rejected, 0);
>
> + for_each_possible_cpu(cpu)
> + cpu_rq(cpu)->scx.cpuperf_target = SCX_CPUPERF_ONE;
> +
> /*
> * Keep CPUs stable during enable so that the BPF scheduler can track
> * online CPUs by watching ->on/offline_cpu() after ->init().
> @@ -5836,6 +5841,77 @@ __bpf_kfunc void scx_bpf_dump_bstr(char
> }
>
> /**
> + * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU
> + * @cpu: CPU of interest
> + *
> + * Return the maximum relative capacity of @cpu in relation to the most
> + * performant CPU in the system. The return value is in the range [1,
> + * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur().
> + */
> +__bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu)
> +{
> + if (ops_cpu_valid(cpu, NULL))
> + return arch_scale_cpu_capacity(cpu);
> + else
> + return SCX_CPUPERF_ONE;
> +}
> +
> +/**
> + * scx_bpf_cpuperf_cur - Query the current relative performance of a CPU
> + * @cpu: CPU of interest
> + *
> + * Return the current relative performance of @cpu in relation to its maximum.
> + * The return value is in the range [1, %SCX_CPUPERF_ONE].
> + *
> + * The current performance level of a CPU in relation to the maximum performance
> + * available in the system can be calculated as follows:
> + *
> + * scx_bpf_cpuperf_cap() * scx_bpf_cpuperf_cur() / %SCX_CPUPERF_ONE
> + *
> + * The result is in the range [1, %SCX_CPUPERF_ONE].
> + */
> +__bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu)
> +{
> + if (ops_cpu_valid(cpu, NULL))
> + return arch_scale_freq_capacity(cpu);
> + else
> + return SCX_CPUPERF_ONE;
> +}
> +
> +/**
> + * scx_bpf_cpuperf_set - Set the relative performance target of a CPU
> + * @cpu: CPU of interest
> + * @perf: target performance level [0, %SCX_CPUPERF_ONE]
> + * @flags: %SCX_CPUPERF_* flags
> + *
> + * Set the target performance level of @cpu to @perf. @perf is in linear
> + * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the
> + * schedutil cpufreq governor chooses the target frequency.
> + *
> + * The actual performance level chosen, CPU grouping, and the overhead and
> + * latency of the operations are dependent on the hardware and cpufreq driver in
> + * use. Consult hardware and cpufreq documentation for more information. The
> + * current performance level can be monitored using scx_bpf_cpuperf_cur().
> + */
> +__bpf_kfunc void scx_bpf_cpuperf_set(u32 cpu, u32 perf)
> +{
> + if (unlikely(perf > SCX_CPUPERF_ONE)) {
> + scx_ops_error("Invalid cpuperf target %u for CPU %d", perf, cpu);
> + return;
> + }
> +
> + if (ops_cpu_valid(cpu, NULL)) {
> + struct rq *rq = cpu_rq(cpu);
> +
> + rq->scx.cpuperf_target = perf;
> +
> + rcu_read_lock_sched_notrace();
> + cpufreq_update_util(cpu_rq(cpu), 0);
> + rcu_read_unlock_sched_notrace();
> + }
> +}
> +
> +/**
> * scx_bpf_nr_cpu_ids - Return the number of possible CPU IDs
> *
> * All valid CPU IDs in the system are smaller than the returned value.
> @@ -6045,6 +6121,9 @@ BTF_ID_FLAGS(func, scx_bpf_destroy_dsq)
> BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS)
> BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS)
> BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap)
> +BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur)
> +BTF_ID_FLAGS(func, scx_bpf_cpuperf_set)
> BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids)
> BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE)
> BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE)
> --- a/kernel/sched/ext.h
> +++ b/kernel/sched/ext.h
> @@ -48,6 +48,14 @@ int scx_check_setscheduler(struct task_s
> bool task_should_scx(struct task_struct *p);
> void init_sched_ext_class(void);
>
> +static inline u32 scx_cpuperf_target(s32 cpu)
> +{
> + if (scx_enabled())
> + return cpu_rq(cpu)->scx.cpuperf_target;
> + else
> + return 0;
> +}
> +
> static inline const struct sched_class *next_active_class(const struct sched_class *class)
> {
> class++;
> @@ -89,6 +97,7 @@ static inline void scx_pre_fork(struct t
> static inline int scx_fork(struct task_struct *p) { return 0; }
> static inline void scx_post_fork(struct task_struct *p) {}
> static inline void scx_cancel_fork(struct task_struct *p) {}
> +static inline u32 scx_cpuperf_target(s32 cpu) { return 0; }
> static inline bool scx_can_stop_tick(struct rq *rq) { return true; }
> static inline void scx_rq_activate(struct rq *rq) {}
> static inline void scx_rq_deactivate(struct rq *rq) {}
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -743,6 +743,7 @@ struct scx_rq {
> u64 extra_enq_flags; /* see move_task_to_local_dsq() */
> u32 nr_running;
> u32 flags;
> + u32 cpuperf_target; /* [0, SCHED_CAPACITY_SCALE] */
> bool cpu_released;
> cpumask_var_t cpus_to_kick;
> cpumask_var_t cpus_to_kick_if_idle;
> --- a/tools/sched_ext/include/scx/common.bpf.h
> +++ b/tools/sched_ext/include/scx/common.bpf.h
> @@ -42,6 +42,9 @@ void scx_bpf_destroy_dsq(u64 dsq_id) __k
> void scx_bpf_exit_bstr(s64 exit_code, char *fmt, unsigned long long *data, u32 data__sz) __ksym __weak;
> void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym;
> void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym __weak;
> +u32 scx_bpf_cpuperf_cap(s32 cpu) __ksym __weak;
> +u32 scx_bpf_cpuperf_cur(s32 cpu) __ksym __weak;
> +void scx_bpf_cpuperf_set(s32 cpu, u32 perf) __ksym __weak;
> u32 scx_bpf_nr_cpu_ids(void) __ksym __weak;
> const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak;
> const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak;
> --- a/tools/sched_ext/scx_qmap.bpf.c
> +++ b/tools/sched_ext/scx_qmap.bpf.c
> @@ -69,6 +69,18 @@ struct {
> };
>
> /*
> + * If enabled, CPU performance target is set according to the queue index
> + * according to the following table.
> + */
> +static const u32 qidx_to_cpuperf_target[] = {
> + [0] = SCX_CPUPERF_ONE * 0 / 4,
> + [1] = SCX_CPUPERF_ONE * 1 / 4,
> + [2] = SCX_CPUPERF_ONE * 2 / 4,
> + [3] = SCX_CPUPERF_ONE * 3 / 4,
> + [4] = SCX_CPUPERF_ONE * 4 / 4,
> +};
> +
> +/*
> * Per-queue sequence numbers to implement core-sched ordering.
> *
> * Tail seq is assigned to each queued task and incremented. Head seq tracks the
> @@ -95,6 +107,8 @@ struct {
> struct cpu_ctx {
> u64 dsp_idx; /* dispatch index */
> u64 dsp_cnt; /* remaining count */
> + u32 avg_weight;
> + u32 cpuperf_target;
> };
>
> struct {
> @@ -107,6 +121,8 @@ struct {
> /* Statistics */
> u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued;
> u64 nr_core_sched_execed;
> +u32 cpuperf_min, cpuperf_avg, cpuperf_max;
> +u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max;
>
> s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
> s32 prev_cpu, u64 wake_flags)
> @@ -313,6 +329,29 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 c
> }
> }
>
> +void BPF_STRUCT_OPS(qmap_tick, struct task_struct *p)
> +{
> + struct cpu_ctx *cpuc;
> + u32 zero = 0;
> + int idx;
> +
> + if (!(cpuc = bpf_map_lookup_elem(&cpu_ctx_stor, &zero))) {
> + scx_bpf_error("failed to look up cpu_ctx");
> + return;
> + }
> +
> + /*
> + * Use the running avg of weights to select the target cpuperf level.
> + * This is a demonstration of the cpuperf feature rather than a
> + * practical strategy to regulate CPU frequency.
> + */
> + cpuc->avg_weight = cpuc->avg_weight * 3 / 4 + p->scx.weight / 4;
> + idx = weight_to_idx(cpuc->avg_weight);
> + cpuc->cpuperf_target = qidx_to_cpuperf_target[idx];
> +
> + scx_bpf_cpuperf_set(scx_bpf_task_cpu(p), cpuc->cpuperf_target);
> +}
> +
> /*
> * The distance from the head of the queue scaled by the weight of the queue.
> * The lower the number, the older the task and the higher the priority.
> @@ -422,8 +461,9 @@ void BPF_STRUCT_OPS(qmap_dump_cpu, struc
> if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, cpu)))
> return;
>
> - scx_bpf_dump("QMAP: dsp_idx=%llu dsp_cnt=%llu",
> - cpuc->dsp_idx, cpuc->dsp_cnt);
> + scx_bpf_dump("QMAP: dsp_idx=%llu dsp_cnt=%llu avg_weight=%u cpuperf_target=%u",
> + cpuc->dsp_idx, cpuc->dsp_cnt, cpuc->avg_weight,
> + cpuc->cpuperf_target);
> }
>
> void BPF_STRUCT_OPS(qmap_dump_task, struct scx_dump_ctx *dctx, struct task_struct *p)
> @@ -492,11 +532,106 @@ void BPF_STRUCT_OPS(qmap_cpu_offline, s3
> print_cpus();
> }
>
> +struct monitor_timer {
> + struct bpf_timer timer;
> +};
> +
> +struct {
> + __uint(type, BPF_MAP_TYPE_ARRAY);
> + __uint(max_entries, 1);
> + __type(key, u32);
> + __type(value, struct monitor_timer);
> +} monitor_timer SEC(".maps");
> +
> +/*
> + * Print out the min, avg and max performance levels of CPUs every second to
> + * demonstrate the cpuperf interface.
> + */
> +static void monitor_cpuperf(void)
> +{
> + u32 zero = 0, nr_cpu_ids;
> + u64 cap_sum = 0, cur_sum = 0, cur_min = SCX_CPUPERF_ONE, cur_max = 0;
> + u64 target_sum = 0, target_min = SCX_CPUPERF_ONE, target_max = 0;
> + const struct cpumask *online;
> + int i, nr_online_cpus = 0;
> +
> + nr_cpu_ids = scx_bpf_nr_cpu_ids();
> + online = scx_bpf_get_online_cpumask();
> +
> + bpf_for(i, 0, nr_cpu_ids) {
> + struct cpu_ctx *cpuc;
> + u32 cap, cur;
> +
> + if (!bpf_cpumask_test_cpu(i, online))
> + continue;
> + nr_online_cpus++;
> +
> + /* collect the capacity and current cpuperf */
> + cap = scx_bpf_cpuperf_cap(i);
> + cur = scx_bpf_cpuperf_cur(i);
> +
> + cur_min = cur < cur_min ? cur : cur_min;
> + cur_max = cur > cur_max ? cur : cur_max;
> +
> + /*
> + * $cur is relative to $cap. Scale it down accordingly so that
> + * it's in the same scale as other CPUs and $cur_sum/$cap_sum
> + * makes sense.
> + */
> + cur_sum += cur * cap / SCX_CPUPERF_ONE;
> + cap_sum += cap;
> +
> + if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, i))) {
> + scx_bpf_error("failed to look up cpu_ctx");
> + goto out;
> + }
> +
> + /* collect target */
> + cur = cpuc->cpuperf_target;
> + target_sum += cur;
> + target_min = cur < target_min ? cur : target_min;
> + target_max = cur > target_max ? cur : target_max;
> + }
> +
> + cpuperf_min = cur_min;
> + cpuperf_avg = cur_sum * SCX_CPUPERF_ONE / cap_sum;
> + cpuperf_max = cur_max;
> +
> + cpuperf_target_min = target_min;
> + cpuperf_target_avg = target_sum / nr_online_cpus;
> + cpuperf_target_max = target_max;
> +out:
> + scx_bpf_put_cpumask(online);
> +}
> +
> +static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer)
> +{
> + monitor_cpuperf();
> +
> + bpf_timer_start(timer, ONE_SEC_IN_NS, 0);
> + return 0;
> +}
> +
> s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
> {
> + u32 key = 0;
> + struct bpf_timer *timer;
> + s32 ret;
> +
> print_cpus();
>
> - return scx_bpf_create_dsq(SHARED_DSQ, -1);
> + ret = scx_bpf_create_dsq(SHARED_DSQ, -1);
> + if (ret)
> + return ret;
> +
> + timer = bpf_map_lookup_elem(&monitor_timer, &key);
> + if (!timer)
> + return -ESRCH;
> +
> + bpf_timer_init(timer, &monitor_timer, CLOCK_MONOTONIC);
> + bpf_timer_set_callback(timer, monitor_timerfn);
> +
> + return bpf_timer_start(timer, ONE_SEC_IN_NS, 0);
> }
>
> void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei)
> @@ -509,6 +644,7 @@ SCX_OPS_DEFINE(qmap_ops,
> .enqueue = (void *)qmap_enqueue,
> .dequeue = (void *)qmap_dequeue,
> .dispatch = (void *)qmap_dispatch,
> + .tick = (void *)qmap_tick,
> .core_sched_before = (void *)qmap_core_sched_before,
> .cpu_release = (void *)qmap_cpu_release,
> .init_task = (void *)qmap_init_task,
> --- a/tools/sched_ext/scx_qmap.c
> +++ b/tools/sched_ext/scx_qmap.c
> @@ -116,6 +116,14 @@ int main(int argc, char **argv)
> nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
> skel->bss->nr_reenqueued, skel->bss->nr_dequeued,
> skel->bss->nr_core_sched_execed);
> + if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur"))
> + printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n",
> + skel->bss->cpuperf_min,
> + skel->bss->cpuperf_avg,
> + skel->bss->cpuperf_max,
> + skel->bss->cpuperf_target_min,
> + skel->bss->cpuperf_target_avg,
> + skel->bss->cpuperf_target_max);
> fflush(stdout);
> sleep(1);
> }
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 2/2] sched_ext: Add cpuperf support
2024-07-05 12:41 ` Vincent Guittot
@ 2024-07-05 18:22 ` Tejun Heo
2024-07-06 9:01 ` Vincent Guittot
0 siblings, 1 reply; 36+ messages in thread
From: Tejun Heo @ 2024-07-05 18:22 UTC (permalink / raw)
To: Vincent Guittot
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
Hello, Vincent.
On Fri, Jul 05, 2024 at 02:41:41PM +0200, Vincent Guittot wrote:
> > static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> > {
> > - unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> > + unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu);
> >
> > + if (!scx_switched_all())
> > + util += cpu_util_cfs_boost(sg_cpu->cpu);
>
> I don't see the need for this. If fair is not used, this returns zero
There's scx_enabled() and scx_switched_all(). The former is set when some
tasks may be on sched_ext. The latter when all tasks are on sched_ext. When
some tasks may be on sched_ext but other tasks may be on fair, the condition
is scx_enabled() && !scx_switched_all(). So, the above if statement
condition is true for all cases that tasks may be on CFS (sched_ext is
disabled or is enabled in partial mode).
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 2/2] sched_ext: Add cpuperf support
2024-07-05 18:22 ` Tejun Heo
@ 2024-07-06 9:01 ` Vincent Guittot
2024-07-07 1:44 ` Tejun Heo
0 siblings, 1 reply; 36+ messages in thread
From: Vincent Guittot @ 2024-07-06 9:01 UTC (permalink / raw)
To: Tejun Heo
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
On Fri, 5 Jul 2024 at 20:22, Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Vincent.
>
> On Fri, Jul 05, 2024 at 02:41:41PM +0200, Vincent Guittot wrote:
> > > static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> > > {
> > > - unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> > > + unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu);
> > >
> > > + if (!scx_switched_all())
> > > + util += cpu_util_cfs_boost(sg_cpu->cpu);
> >
> > I don't see the need for this. If fair is not used, this returns zero
>
> There's scx_enabled() and scx_switched_all(). The former is set when some
> tasks may be on sched_ext. The latter when all tasks are on sched_ext. When
> some tasks may be on sched_ext but other tasks may be on fair, the condition
> is scx_enabled() && !scx_switched_all(). So, the above if statement
> condition is true for all cases that tasks may be on CFS (sched_ext is
> disabled or is enabled in partial mode).
My point is that if there is no fair task, cpu_util_cfs_boost() will
already return 0 so there is no need to add a sched_ext if statement
there
Vincent
>
> Thanks.
>
> --
> tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 2/2] sched_ext: Add cpuperf support
2024-07-06 9:01 ` Vincent Guittot
@ 2024-07-07 1:44 ` Tejun Heo
2024-07-08 6:37 ` Vincent Guittot
0 siblings, 1 reply; 36+ messages in thread
From: Tejun Heo @ 2024-07-07 1:44 UTC (permalink / raw)
To: Vincent Guittot
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
Hello,
On Sat, Jul 06, 2024 at 11:01:20AM +0200, Vincent Guittot wrote:
> > There's scx_enabled() and scx_switched_all(). The former is set when some
> > tasks may be on sched_ext. The latter when all tasks are on sched_ext. When
> > some tasks may be on sched_ext but other tasks may be on fair, the condition
> > is scx_enabled() && !scx_switched_all(). So, the above if statement
> > condition is true for all cases that tasks may be on CFS (sched_ext is
> > disabled or is enabled in partial mode).
>
> My point is that if there is no fair task, cpu_util_cfs_boost() will
> already return 0 so there is no need to add a sched_ext if statement
> there
I see, but scx_switched_all() is a static key while cpu_util_cfs_boost()
isn't necessarily trivial. I can remove the conditional but wouldn't it make
more sense to keep it?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 2/2] sched_ext: Add cpuperf support
2024-07-07 1:44 ` Tejun Heo
@ 2024-07-08 6:37 ` Vincent Guittot
2024-07-08 18:20 ` Tejun Heo
0 siblings, 1 reply; 36+ messages in thread
From: Vincent Guittot @ 2024-07-08 6:37 UTC (permalink / raw)
To: Tejun Heo
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
On Sun, 7 Jul 2024 at 03:44, Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Sat, Jul 06, 2024 at 11:01:20AM +0200, Vincent Guittot wrote:
> > > There's scx_enabled() and scx_switched_all(). The former is set when some
> > > tasks may be on sched_ext. The latter when all tasks are on sched_ext. When
> > > some tasks may be on sched_ext but other tasks may be on fair, the condition
> > > is scx_enabled() && !scx_switched_all(). So, the above if statement
> > > condition is true for all cases that tasks may be on CFS (sched_ext is
> > > disabled or is enabled in partial mode).
> >
> > My point is that if there is no fair task, cpu_util_cfs_boost() will
> > already return 0 so there is no need to add a sched_ext if statement
> > there
>
> I see, but scx_switched_all() is a static key while cpu_util_cfs_boost()
> isn't necessarily trivial. I can remove the conditional but wouldn't it make
> more sense to keep it?
I prefer to minimize (if not remove) sched_ext related calls in the
fair path so we can easily rework it if needed. And this will also
ensure that all fair task are cleanly removed when they are all
switched to sched_ext
Thanks
Vincent
>
> Thanks.
>
> --
> tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 2/2] sched_ext: Add cpuperf support
2024-07-08 6:37 ` Vincent Guittot
@ 2024-07-08 18:20 ` Tejun Heo
2024-07-08 19:51 ` Vincent Guittot
0 siblings, 1 reply; 36+ messages in thread
From: Tejun Heo @ 2024-07-08 18:20 UTC (permalink / raw)
To: Vincent Guittot
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
Hello, Vincent.
On Mon, Jul 08, 2024 at 08:37:06AM +0200, Vincent Guittot wrote:
> I prefer to minimize (if not remove) sched_ext related calls in the
> fair path so we can easily rework it if needed. And this will also
> ensure that all fair task are cleanly removed when they are all
> switched to sched_ext
Unless we add a WARN_ON_ONCE, if it doesn't behave as expected, the end
result will most likely be cpufreq sometimes picking a higher freq than
requested, which won't be the easiest to notice. Would you be against adding
WARN_ON_ONCE(scx_switched_all && !util) too?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 2/2] sched_ext: Add cpuperf support
2024-07-08 18:20 ` Tejun Heo
@ 2024-07-08 19:51 ` Vincent Guittot
2024-07-08 21:08 ` Tejun Heo
0 siblings, 1 reply; 36+ messages in thread
From: Vincent Guittot @ 2024-07-08 19:51 UTC (permalink / raw)
To: Tejun Heo
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
On Mon, 8 Jul 2024 at 20:20, Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Vincent.
>
> On Mon, Jul 08, 2024 at 08:37:06AM +0200, Vincent Guittot wrote:
> > I prefer to minimize (if not remove) sched_ext related calls in the
> > fair path so we can easily rework it if needed. And this will also
> > ensure that all fair task are cleanly removed when they are all
> > switched to sched_ext
>
> Unless we add a WARN_ON_ONCE, if it doesn't behave as expected, the end
> result will most likely be cpufreq sometimes picking a higher freq than
> requested, which won't be the easiest to notice. Would you be against adding
> WARN_ON_ONCE(scx_switched_all && !util) too?
A WARN_ON_ONCE to detect misbehavior would be ok
>
> Thanks.
>
> --
> tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 2/2] sched_ext: Add cpuperf support
2024-07-08 19:51 ` Vincent Guittot
@ 2024-07-08 21:08 ` Tejun Heo
2024-07-09 13:36 ` Vincent Guittot
0 siblings, 1 reply; 36+ messages in thread
From: Tejun Heo @ 2024-07-08 21:08 UTC (permalink / raw)
To: Vincent Guittot
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
Hello, Vincent.
On Mon, Jul 08, 2024 at 09:51:08PM +0200, Vincent Guittot wrote:
> > Unless we add a WARN_ON_ONCE, if it doesn't behave as expected, the end
> > result will most likely be cpufreq sometimes picking a higher freq than
> > requested, which won't be the easiest to notice. Would you be against adding
> > WARN_ON_ONCE(scx_switched_all && !util) too?
>
> A WARN_ON_ONCE to detect misbehavior would be ok
I tried this and it's a bit problematic. Migrating out all the tasks do
bring the numbers pretty close to zero but the math doesn't work out exactly
and it often leaves 1 in the averages. While the fair class is in use, they
would decay quickly through __update_blocked_fair(); however, when all tasks
are switched to sched_ext, that function doesn't get called and the
remaining small value never decays.
Now, the value being really low, it doesn't really matter but it's an
unnecessary complication. I can make sched_ext keep calling
__update_blocked_fair() in addition to update_other_load_avgs() to decay
fair's averages but that seems a lot more complicated than having one
scx_switched_all() test.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 2/2] sched_ext: Add cpuperf support
2024-07-08 21:08 ` Tejun Heo
@ 2024-07-09 13:36 ` Vincent Guittot
2024-07-09 16:43 ` Tejun Heo
0 siblings, 1 reply; 36+ messages in thread
From: Vincent Guittot @ 2024-07-09 13:36 UTC (permalink / raw)
To: Tejun Heo
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
On Mon, 8 Jul 2024 at 23:09, Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Vincent.
>
> On Mon, Jul 08, 2024 at 09:51:08PM +0200, Vincent Guittot wrote:
> > > Unless we add a WARN_ON_ONCE, if it doesn't behave as expected, the end
> > > result will most likely be cpufreq sometimes picking a higher freq than
> > > requested, which won't be the easiest to notice. Would you be against adding
> > > WARN_ON_ONCE(scx_switched_all && !util) too?
> >
> > A WARN_ON_ONCE to detect misbehavior would be ok
>
> I tried this and it's a bit problematic. Migrating out all the tasks do
> bring the numbers pretty close to zero but the math doesn't work out exactly
> and it often leaves 1 in the averages. While the fair class is in use, they
hmm interesting, such remaining small value could be expected for
load_avg but not with util_avg which is normally a direct propagation.
Do you have a sequence in particular ?
> would decay quickly through __update_blocked_fair(); however, when all tasks
> are switched to sched_ext, that function doesn't get called and the
> remaining small value never decays.
>
> Now, the value being really low, it doesn't really matter but it's an
> unnecessary complication. I can make sched_ext keep calling
> __update_blocked_fair() in addition to update_other_load_avgs() to decay
> fair's averages but that seems a lot more complicated than having one
> scx_switched_all() test.
>
> Thanks.
>
> --
> tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 2/2] sched_ext: Add cpuperf support
2024-07-09 13:36 ` Vincent Guittot
@ 2024-07-09 16:43 ` Tejun Heo
2024-07-12 10:12 ` Vincent Guittot
0 siblings, 1 reply; 36+ messages in thread
From: Tejun Heo @ 2024-07-09 16:43 UTC (permalink / raw)
To: Vincent Guittot
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
Hello,
On Tue, Jul 09, 2024 at 03:36:34PM +0200, Vincent Guittot wrote:
> > I tried this and it's a bit problematic. Migrating out all the tasks do
> > bring the numbers pretty close to zero but the math doesn't work out exactly
> > and it often leaves 1 in the averages. While the fair class is in use, they
>
> hmm interesting, such remaining small value could be expected for
> load_avg but not with util_avg which is normally a direct propagation.
> Do you have a sequence in particular ?
Oh, I thought it was a byproduct of decay calculations not exactly matching
up between the sum and the components but I haven't really checked. It's
really easy to reproduce. Just boot a kernel with sched_ext enabled (with
some instrumentations added to monitor the util calculation), run some
stress workload to be sure and run a sched_ext scheduler (make -C
tools/sched_ext && tools/sched_ext/build/bin/scx_simple).
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 2/2] sched_ext: Add cpuperf support
2024-07-09 16:43 ` Tejun Heo
@ 2024-07-12 10:12 ` Vincent Guittot
2024-07-12 17:10 ` Tejun Heo
0 siblings, 1 reply; 36+ messages in thread
From: Vincent Guittot @ 2024-07-12 10:12 UTC (permalink / raw)
To: Tejun Heo
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
On Tue, 9 Jul 2024 at 18:43, Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Jul 09, 2024 at 03:36:34PM +0200, Vincent Guittot wrote:
> > > I tried this and it's a bit problematic. Migrating out all the tasks do
> > > bring the numbers pretty close to zero but the math doesn't work out exactly
> > > and it often leaves 1 in the averages. While the fair class is in use, they
> >
> > hmm interesting, such remaining small value could be expected for
> > load_avg but not with util_avg which is normally a direct propagation.
> > Do you have a sequence in particular ?
>
> Oh, I thought it was a byproduct of decay calculations not exactly matching
> up between the sum and the components but I haven't really checked. It's
> really easy to reproduce. Just boot a kernel with sched_ext enabled (with
> some instrumentations added to monitor the util calculation), run some
> stress workload to be sure and run a sched_ext scheduler (make -C
> tools/sched_ext && tools/sched_ext/build/bin/scx_simple).
II failed to setup my dev system for reproducing your use case in time
and I'm going to be away for the coming weeks so I suppose that you
should move forward and I will look at that when back to my dev system
It seems that "make -C tools/sched_ext ARCH=arm64 LLVM=-16" doesn't
use clang-16 everywhere like the rest of the kernel which triggers
error on my system:
make -C <path-to-linux>/linux/tools/sched_ext ARCH=arm64
LOCALVERSION=+ LLVM=-16
O=<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext
...
clang-16 -g -O0 -fPIC -std=gnu89 -Wbad-function-cast
-Wdeclaration-after-statement -Wformat-security -Wformat-y2k
-Winit-self -Wmissing-declarations -Wmissing-prototypes
-Wnested-externs -Wno-system-headers -Wold-style-definition -Wpacked
-Wredundant-decls -Wstrict-prototypes -Wswitch-default -Wswitch-enum
-Wundef -Wwrite-strings -Wformat -Wno-type-limits -Wshadow
-Wno-switch-enum -Werror -Wall
-I<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/libbpf/
-I<path-to-linux>/linux/tools/include
-I<path-to-linux>/linux/tools/include/uapi -fvisibility=hidden
-D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 \
--shared -Wl,-soname,libbpf.so.1 \
-Wl,--version-script=libbpf.map
<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/libbpf/sharedobjs/libbpf-in.o
-lelf -lz -o <path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/libbpf/libbpf.so.1.5.0
...
clang -g -D__TARGET_ARCH_arm64 -mlittle-endian
-I<path-to-linux>/linux/tools/sched_ext/include
-I<path-to-linux>/linux/tools/sched_ext/include/bpf-compat
-I<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/include
-I<path-to-linux>/linux/tools/include/uapi -I../../include -idirafter
/usr/lib/llvm-14/lib/clang/14.0.0/include -idirafter
/usr/local/include -idirafter /usr/include/x86_64-linux-gnu -idirafter
/usr/include -Wall -Wno-compare-distinct-pointer-types -O2 -mcpu=v3
-target bpf -c scx_simple.bpf.c -o
<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/sched_ext/scx_simple.bpf.o
In file included from scx_simple.bpf.c:23:
<path-to-linux>/linux/tools/sched_ext/include/scx/common.bpf.h:27:17:
error: use of undeclared identifier 'SCX_DSQ_FLAG_BUILTIN'
_Static_assert(SCX_DSQ_FLAG_BUILTIN,
^
...
fatal error: too many errors emitted, stopping now [-ferror-limit=]
5 warnings and 20 errors generated.
Vincent
>
> Thanks.
>
> --
> tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 2/2] sched_ext: Add cpuperf support
2024-07-12 10:12 ` Vincent Guittot
@ 2024-07-12 17:10 ` Tejun Heo
0 siblings, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2024-07-12 17:10 UTC (permalink / raw)
To: Vincent Guittot
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki
Hello,
On Fri, Jul 12, 2024 at 12:12:32PM +0200, Vincent Guittot wrote:
...
> II failed to setup my dev system for reproducing your use case in time
> and I'm going to be away for the coming weeks so I suppose that you
> should move forward and I will look at that when back to my dev system
Thankfully, this should be pretty easy to fix up however we want afterwards.
> It seems that "make -C tools/sched_ext ARCH=arm64 LLVM=-16" doesn't
> use clang-16 everywhere like the rest of the kernel which triggers
> error on my system:
Hmm... there is llvm prefix/suffix handling in the Makefile. I wonder what's
broken.
> make -C <path-to-linux>/linux/tools/sched_ext ARCH=arm64
> LOCALVERSION=+ LLVM=-16
> O=<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext
> ...
> clang-16 -g -O0 -fPIC -std=gnu89 -Wbad-function-cast
> -Wdeclaration-after-statement -Wformat-security -Wformat-y2k
> -Winit-self -Wmissing-declarations -Wmissing-prototypes
> -Wnested-externs -Wno-system-headers -Wold-style-definition -Wpacked
> -Wredundant-decls -Wstrict-prototypes -Wswitch-default -Wswitch-enum
> -Wundef -Wwrite-strings -Wformat -Wno-type-limits -Wshadow
> -Wno-switch-enum -Werror -Wall
> -I<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/libbpf/
> -I<path-to-linux>/linux/tools/include
> -I<path-to-linux>/linux/tools/include/uapi -fvisibility=hidden
> -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 \
> --shared -Wl,-soname,libbpf.so.1 \
> -Wl,--version-script=libbpf.map
> <path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/libbpf/sharedobjs/libbpf-in.o
> -lelf -lz -o <path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/libbpf/libbpf.so.1.5.0
So, thi sis regular arm target buliding.
> clang -g -D__TARGET_ARCH_arm64 -mlittle-endian
> -I<path-to-linux>/linux/tools/sched_ext/include
> -I<path-to-linux>/linux/tools/sched_ext/include/bpf-compat
> -I<path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/include
> -I<path-to-linux>/linux/tools/include/uapi -I../../include -idirafter
> /usr/lib/llvm-14/lib/clang/14.0.0/include -idirafter
> /usr/local/include -idirafter /usr/include/x86_64-linux-gnu -idirafter
> /usr/include -Wall -Wno-compare-distinct-pointer-types -O2 -mcpu=v3
> -target bpf -c scx_simple.bpf.c -o
> <path-to-linux>/out/kernel/arm64-llvm/tools/sched_ext/build/obj/sched_ext/scx_simple.bpf.o
> In file included from scx_simple.bpf.c:23:
> <path-to-linux>/linux/tools/sched_ext/include/scx/common.bpf.h:27:17:
> error: use of undeclared identifier 'SCX_DSQ_FLAG_BUILTIN'
> _Static_assert(SCX_DSQ_FLAG_BUILTIN,
> ^
> fatal error: too many errors emitted, stopping now [-ferror-limit=]
> 5 warnings and 20 errors generated.
This is BPF.
The Makefile is mostly copied from other existing BPF Makefiles under tools,
so I don't quite understand why things are set up this way but
CC := $(LLVM_PREFIX)clang$(LLVM_SUFFIX) $(CLANG_FLAGS) -fintegrated-as
is what's used to build regular targets, while
$(CLANG) $(BPF_CFLAGS) -target bpf -c $< -o $@
is what's used to build BPF targets. It's not too out there to use a
different compiler for BPF targtes, so maybe that's why? I'll ask BPF folks.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/2] sched_ext: Add cpuperf support
2024-06-19 3:12 ` [PATCH 2/2] sched_ext: Add cpuperf support Tejun Heo
` (2 preceding siblings ...)
2024-07-02 10:23 ` [PATCH " Hongyan Xia
@ 2024-07-24 23:45 ` Qais Yousef
2024-07-31 1:05 ` Tejun Heo
3 siblings, 1 reply; 36+ messages in thread
From: Qais Yousef @ 2024-07-24 23:45 UTC (permalink / raw)
To: Tejun Heo
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki, Linus Torvalds
On 06/18/24 17:12, Tejun Heo wrote:
> sched_ext currently does not integrate with schedutil. When schedutil is the
> governor, frequencies are left unregulated and usually get stuck close to
> the highest performance level from running RT tasks.
Have you tried to investigate why is that? By default RT run at max frequency.
Only way to prevent them from doing that is by using uclamp
https://kernel.org/doc/html/latest/scheduler/sched-util-clamp.html#sched-util-clamp-min-rt-default
If that's not the cause, then it's likely something else is broken.
>
> Add CPU performance monitoring and scaling support by integrating into
> schedutil. The following kfuncs are added:
>
> - scx_bpf_cpuperf_cap(): Query the relative performance capacity of
> different CPUs in the system.
>
> - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
> relative to its max performance.
>
> - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.
What is exactly the problem you're seeing? You shouldn't need to set
performance directly. Are you trying to fix a problem, or add a new feature?
>
> This gives direct control over CPU performance setting to the BPF scheduler.
Why would we need to do that? schedutil is supposed to operate in utilization
signal. Overriding it with custom unknown changes makes it all random governor
based on what's current bpf sched_ext is loaded? This make bug reports and
debugging problems a lot harder.
I do hope by the way that loading external scheduler does cause the kernel to
be tainted. With these random changes, it's hard to know if it is a problem in
the kernel or with external out of tree entity. Out of tree modules taint the
kernel, so should loading sched_ext.
It should not cause spurious reports, nor prevent us from changing the code
without worrying about breaking out of tree code.
> The only changes on the schedutil side are accounting for the utilization
> factor from sched_ext and disabling frequency holding heuristics as it may
> not apply well to sched_ext schedulers which may have a lot weaker
> connection between tasks and their current / last CPU.
>
> With cpuperf support added, there is no reason to block uclamp. Enable while
> at it.
>
> A toy implementation of cpuperf is added to scx_qmap as a demonstration of
> the feature.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: David Vernet <dvernet@meta.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> ---
> kernel/sched/cpufreq_schedutil.c | 12 +-
> kernel/sched/ext.c | 83 ++++++++++++-
> kernel/sched/ext.h | 9 ++
> kernel/sched/sched.h | 1 +
> tools/sched_ext/include/scx/common.bpf.h | 3 +
> tools/sched_ext/scx_qmap.bpf.c | 142 ++++++++++++++++++++++-
> tools/sched_ext/scx_qmap.c | 8 ++
> 7 files changed, 252 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 972b7dd65af2..12174c0137a5 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -197,7 +197,9 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
>
> static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> {
> - unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> + unsigned long min, max;
> + unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu) +
> + scx_cpuperf_target(sg_cpu->cpu);
>
> util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
> util = max(util, boost);
> @@ -330,6 +332,14 @@ static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
> unsigned long idle_calls;
> bool ret;
>
> + /*
> + * The heuristics in this function is for the fair class. For SCX, the
> + * performance target comes directly from the BPF scheduler. Let's just
> + * follow it.
> + */
> + if (scx_switched_all())
> + return false;
Why do you need to totally override? What problems did you find in current util
value and what have you done to try to fix it first rather than override it
completely?
> +
> /* if capped by uclamp_max, always update to be in compliance */
> if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
> return false;
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index f814e84ceeb3..04fb0eeee5ec 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -16,6 +16,8 @@ enum scx_consts {
> SCX_EXIT_BT_LEN = 64,
> SCX_EXIT_MSG_LEN = 1024,
> SCX_EXIT_DUMP_DFL_LEN = 32768,
> +
> + SCX_CPUPERF_ONE = SCHED_CAPACITY_SCALE,
> };
>
> enum scx_exit_kind {
> @@ -3520,7 +3522,7 @@ DEFINE_SCHED_CLASS(ext) = {
> .update_curr = update_curr_scx,
>
> #ifdef CONFIG_UCLAMP_TASK
> - .uclamp_enabled = 0,
> + .uclamp_enabled = 1,
> #endif
> };
>
> @@ -4393,7 +4395,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
> struct scx_task_iter sti;
> struct task_struct *p;
> unsigned long timeout;
> - int i, ret;
> + int i, cpu, ret;
>
> mutex_lock(&scx_ops_enable_mutex);
>
> @@ -4442,6 +4444,9 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
>
> atomic_long_set(&scx_nr_rejected, 0);
>
> + for_each_possible_cpu(cpu)
> + cpu_rq(cpu)->scx.cpuperf_target = SCX_CPUPERF_ONE;
> +
> /*
> * Keep CPUs stable during enable so that the BPF scheduler can track
> * online CPUs by watching ->on/offline_cpu() after ->init().
> @@ -5835,6 +5840,77 @@ __bpf_kfunc void scx_bpf_dump_bstr(char *fmt, unsigned long long *data,
> ops_dump_flush();
> }
>
> +/**
> + * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU
> + * @cpu: CPU of interest
> + *
> + * Return the maximum relative capacity of @cpu in relation to the most
> + * performant CPU in the system. The return value is in the range [1,
> + * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur().
> + */
> +__bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu)
> +{
> + if (ops_cpu_valid(cpu, NULL))
> + return arch_scale_cpu_capacity(cpu);
> + else
> + return SCX_CPUPERF_ONE;
> +}
Hmm. This is tricky. It looks fine, but I worry about changing how we want to
handle capacities in the future and then being tied down forever with out of
tree sched_ext not being able to load.
How are we going to protect against such potential changes? Just make it a NOP?
A bit hypothetical but so far these are considered internal scheduler details
that could change anytime with no consequence. With this attaching to this info
changing them will become a lot harder as there's external dependencies that
will fail to load or work properly. And what is the regression rule in this
case?
You should make all functions return an error to future proof them against
suddenly disappearing.
> +
> +/**
> + * scx_bpf_cpuperf_cur - Query the current relative performance of a CPU
> + * @cpu: CPU of interest
> + *
> + * Return the current relative performance of @cpu in relation to its maximum.
> + * The return value is in the range [1, %SCX_CPUPERF_ONE].
> + *
> + * The current performance level of a CPU in relation to the maximum performance
> + * available in the system can be calculated as follows:
> + *
> + * scx_bpf_cpuperf_cap() * scx_bpf_cpuperf_cur() / %SCX_CPUPERF_ONE
> + *
> + * The result is in the range [1, %SCX_CPUPERF_ONE].
> + */
> +__bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu)
> +{
> + if (ops_cpu_valid(cpu, NULL))
> + return arch_scale_freq_capacity(cpu);
> + else
> + return SCX_CPUPERF_ONE;
> +}
> +
> +/**
> + * scx_bpf_cpuperf_set - Set the relative performance target of a CPU
> + * @cpu: CPU of interest
> + * @perf: target performance level [0, %SCX_CPUPERF_ONE]
> + * @flags: %SCX_CPUPERF_* flags
> + *
> + * Set the target performance level of @cpu to @perf. @perf is in linear
> + * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the
> + * schedutil cpufreq governor chooses the target frequency.
> + *
> + * The actual performance level chosen, CPU grouping, and the overhead and
> + * latency of the operations are dependent on the hardware and cpufreq driver in
> + * use. Consult hardware and cpufreq documentation for more information. The
> + * current performance level can be monitored using scx_bpf_cpuperf_cur().
> + */
> +__bpf_kfunc void scx_bpf_cpuperf_set(u32 cpu, u32 perf)
> +{
> + if (unlikely(perf > SCX_CPUPERF_ONE)) {
> + scx_ops_error("Invalid cpuperf target %u for CPU %d", perf, cpu);
> + return;
> + }
> +
> + if (ops_cpu_valid(cpu, NULL)) {
> + struct rq *rq = cpu_rq(cpu);
> +
> + rq->scx.cpuperf_target = perf;
> +
> + rcu_read_lock_sched_notrace();
> + cpufreq_update_util(cpu_rq(cpu), 0);
> + rcu_read_unlock_sched_notrace();
> + }
> +}
Is the problem that you break how util signal works in sched_ext? Or you want
the fine control? We expect user application to use uclamp to set their perf
requirement. And sched_ext should not break util signal, no? If it does and
there's a good reason for it, then it is not compatible with schedutil, as the
name indicates it operates on util signal as defined in PELT.
You can always use min_freq/max_freq in sysfs to force min and max frequencies
without hacking the governor. I don't advise it though and I'd recommend trying
to be compatible with schedutil as-is rather than modify it. Consistency is
a key.
Thanks
--
Qais Yousef
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/2] sched_ext: Add cpuperf support
2024-07-24 23:45 ` Qais Yousef
@ 2024-07-31 1:05 ` Tejun Heo
0 siblings, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2024-07-31 1:05 UTC (permalink / raw)
To: Qais Yousef
Cc: rafael, viresh.kumar, linux-pm, void, linux-kernel, kernel-team,
mingo, peterz, David Vernet, Rafael J . Wysocki, Linus Torvalds
Hello, Qais.
On Thu, Jul 25, 2024 at 12:45:27AM +0100, Qais Yousef wrote:
> On 06/18/24 17:12, Tejun Heo wrote:
> > sched_ext currently does not integrate with schedutil. When schedutil is the
> > governor, frequencies are left unregulated and usually get stuck close to
> > the highest performance level from running RT tasks.
>
> Have you tried to investigate why is that? By default RT run at max frequency.
> Only way to prevent them from doing that is by using uclamp
>
> https://kernel.org/doc/html/latest/scheduler/sched-util-clamp.html#sched-util-clamp-min-rt-default
>
> If that's not the cause, then it's likely something else is broken.
Nothing is necessarily broken. SCX just wasn't providing any signal to
schedutil before this patch, so schedutil ends up just going by with the
occasional signals from RT.
...
> What is exactly the problem you're seeing? You shouldn't need to set
> performance directly. Are you trying to fix a problem, or add a new feature?
I'm having a hard time following the question. When active, the BPF
scheduler is the only one who can tell how much each CPU is being used, and
the straightforward to integrate with schedutil is by indicating what the
current CPU utilization level is which is the signal that schedutil takes
from each scheduler class.
> > This gives direct control over CPU performance setting to the BPF scheduler.
>
> Why would we need to do that? schedutil is supposed to operate in utilization
Because nobody else knows? No one *can* know. Schedutil is driven by the
utilization signal from the scheduler. Here, the scheduler is implemented in
BPF. Nothing else knows how much workload each CPU is going to get. Note
that the kernel side does not have any meaningful information re. task <->
CPU relationship. That can change on every dispatch. Only the BPF scheduler
itself knows how the CPUs are going to be used.
> signal. Overriding it with custom unknown changes makes it all random governor
> based on what's current bpf sched_ext is loaded? This make bug reports and
> debugging problems a lot harder.
If schedutil is misbehaving while an SCX scheduler is loaded, it's the BPF
scheduler's fault. There isn't much else to it.
> I do hope by the way that loading external scheduler does cause the kernel to
> be tainted. With these random changes, it's hard to know if it is a problem in
> the kernel or with external out of tree entity. Out of tree modules taint the
> kernel, so should loading sched_ext.
Out of tree modules can cause corruptions which can have lasting impacts on
the system even after the module is unloaded. That's why we keep the
persistent taint flags - to remember that the current misbehavior has a
reasonable chance of being impacted by what happened to the system
beforehand. Kernel bugs aside, SCX schedulers shouldn't leave persistent
impacts on the system. In those cases, we usually mark the dumps which SCX
already does.
> It should not cause spurious reports, nor prevent us from changing the code
> without worrying about breaking out of tree code.
Oh yeah, we can and will make compatibility breaking changes. Not
willy-nilly, hopefully.
...
> > + /*
> > + * The heuristics in this function is for the fair class. For SCX, the
> > + * performance target comes directly from the BPF scheduler. Let's just
> > + * follow it.
> > + */
> > + if (scx_switched_all())
> > + return false;
>
> Why do you need to totally override? What problems did you find in current util
> value and what have you done to try to fix it first rather than override it
> completely?
Because it's way cleaner this way. Otherwise, we need to keep calling into
update_blocked_averages() so that the fair class's util metrics can decay
(it often doesn't decay completely due to math inaccuracies but Vincent was
looking at it). When switched_all(), the fair class is not used at all and
keeping calling it to decay the utility metrics to zero seems kinda silly.
...
> > +/**
> > + * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU
> > + * @cpu: CPU of interest
> > + *
> > + * Return the maximum relative capacity of @cpu in relation to the most
> > + * performant CPU in the system. The return value is in the range [1,
> > + * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur().
> > + */
> > +__bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu)
> > +{
> > + if (ops_cpu_valid(cpu, NULL))
> > + return arch_scale_cpu_capacity(cpu);
> > + else
> > + return SCX_CPUPERF_ONE;
> > +}
>
> Hmm. This is tricky. It looks fine, but I worry about changing how we want to
> handle capacities in the future and then being tied down forever with out of
> tree sched_ext not being able to load.
>
> How are we going to protect against such potential changes? Just make it a NOP?
So, if it's a necessary change, we just break the API and update the
out-of-tree schedulers accordingly. With BPF's CO-RE and supporting
features, we can handle quite a bit of backward compatibility - ie. it's
cumbersome but usually possible to provide BPF-side helpers that allow
updated BPF schedulers to be loaded in both old and new kernels, so it
usually isn't *that* painful.
> A bit hypothetical but so far these are considered internal scheduler details
> that could change anytime with no consequence. With this attaching to this info
> changing them will become a lot harder as there's external dependencies that
> will fail to load or work properly. And what is the regression rule in this
> case?
This is not different from any other BPF hooks and has been discussed many
times. As I wrote before, we don't want to change for no reason but we
definitely can change things if necessary.
> You should make all functions return an error to future proof them against
> suddenly disappearing.
Really, no need to be that draconian.
...
> > +__bpf_kfunc void scx_bpf_cpuperf_set(u32 cpu, u32 perf)
> > +{
> > + if (unlikely(perf > SCX_CPUPERF_ONE)) {
> > + scx_ops_error("Invalid cpuperf target %u for CPU %d", perf, cpu);
> > + return;
> > + }
> > +
> > + if (ops_cpu_valid(cpu, NULL)) {
> > + struct rq *rq = cpu_rq(cpu);
> > +
> > + rq->scx.cpuperf_target = perf;
> > +
> > + rcu_read_lock_sched_notrace();
> > + cpufreq_update_util(cpu_rq(cpu), 0);
> > + rcu_read_unlock_sched_notrace();
> > + }
> > +}
>
> Is the problem that you break how util signal works in sched_ext? Or you want
> the fine control? We expect user application to use uclamp to set their perf
> requirement. And sched_ext should not break util signal, no? If it does and
> there's a good reason for it, then it is not compatible with schedutil, as the
> name indicates it operates on util signal as defined in PELT.
>
> You can always use min_freq/max_freq in sysfs to force min and max frequencies
> without hacking the governor. I don't advise it though and I'd recommend trying
> to be compatible with schedutil as-is rather than modify it. Consistency is
> a key.
It's not hacking the governor and uclamp should work the same way on top.
The BPF scheduler is the only thing that can tell how and how much a given
CPU is going to be used. There is no way for the kernel to project per-CPU
utilization without asking the BPF scheduler.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2024-07-31 1:05 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-19 3:12 [PATCHSET sched_ext/for-6.11] sched_ext: Integrate with schedutil Tejun Heo
2024-06-19 3:12 ` [PATCH 1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy() Tejun Heo
2024-06-19 14:07 ` Christian Loehle
2024-06-19 18:57 ` Tejun Heo
2024-06-19 19:07 ` Tejun Heo
2024-06-20 9:40 ` Christian Loehle
2024-06-19 18:45 ` Rafael J. Wysocki
2024-06-19 19:53 ` Tejun Heo
2024-06-20 17:57 ` Rafael J. Wysocki
2024-06-20 18:08 ` Tejun Heo
2024-06-21 22:38 ` Tejun Heo
2024-06-19 3:12 ` [PATCH 2/2] sched_ext: Add cpuperf support Tejun Heo
2024-06-19 14:07 ` Christian Loehle
2024-06-19 19:19 ` Tejun Heo
2024-06-19 19:51 ` [PATCH v2 " Tejun Heo
2024-06-21 22:39 ` Tejun Heo
2024-07-05 12:41 ` Vincent Guittot
2024-07-05 18:22 ` Tejun Heo
2024-07-06 9:01 ` Vincent Guittot
2024-07-07 1:44 ` Tejun Heo
2024-07-08 6:37 ` Vincent Guittot
2024-07-08 18:20 ` Tejun Heo
2024-07-08 19:51 ` Vincent Guittot
2024-07-08 21:08 ` Tejun Heo
2024-07-09 13:36 ` Vincent Guittot
2024-07-09 16:43 ` Tejun Heo
2024-07-12 10:12 ` Vincent Guittot
2024-07-12 17:10 ` Tejun Heo
2024-07-02 10:23 ` [PATCH " Hongyan Xia
2024-07-02 16:37 ` Tejun Heo
2024-07-02 17:12 ` Hongyan Xia
2024-07-02 17:56 ` Tejun Heo
2024-07-02 20:41 ` Hongyan Xia
2024-07-02 21:12 ` Tejun Heo
2024-07-24 23:45 ` Qais Yousef
2024-07-31 1:05 ` Tejun Heo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).