public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/11] sched_ext: Implement core event counters
@ 2025-01-26 10:16 Changwoo Min
  2025-01-26 10:16 ` [PATCH v2 01/11] sched_ext: Implement event counter infrastructure Changwoo Min
                   ` (11 more replies)
  0 siblings, 12 replies; 26+ messages in thread
From: Changwoo Min @ 2025-01-26 10:16 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

The sched_ext core often has to override the BPF scheduler decisions,
and some events could be interesting but not easily visible.

This patchset aims to address such a problem in the following manner:
  - Introduce an infrastructure to collect such events in a scalable and
    extensible way and to expose the collected events to the BPF scheduler
    in a compatible way.
  - Define seven events to be collected.
  - Modify an scx scheduler to demonstrate the usage of the new BPF APIs.

ChangeLog: v1 -> v2
  - Rename scx_event_stat and scx_bpf_event_stat() to scx_event_stats and
    scx_bpf_event_stats().
  - Rename event names following the convention of $COMPONENT_$EVENT.
  - Rename event_stats to event_stats_cpu.
  - Drop the enum scx_event_kind and related macros.
  - Revise scx_add_event() to use this_cpu_add().
  - Add __scx_add_event() to use __this_cpu_add().
  - Move the event counter resetting code to the loading of a BPF scheduler.
  - The bypass-related event is further categorized into three events:
    BYPASS_ACTIVATE, BYPASS_DISPATCH, and BYPASS_DURATION.
  - Revise SELECT_CPU_FALLBACK to capture the case of the chosen CPU is not
    allowed.
  - Move is_cpu_allowed() from core.c to sched.h to use in the
    SELECT_CPU_FALLBACK code.

Changwoo Min (11):
  sched_ext: Implement event counter infrastructure
  sched: Move is_cpu_allowed() to the header
  sched_ext: Add an event, SELECT_CPU_FALLBACK
  sched_ext: Add an event, DISPATCH_LOCAL_DSQ_OFFLINE
  sched_ext: Add an event, DISPATCH_KEEP_LAST
  sched_ext: Add an event, ENQ_SKIP_EXITING
  sched_ext: Add an event, BYPASS_ACTIVATE
  sched_ext: Add an event, BYPASS_DISPATCH
  sched_ext: Add an event, BYPASS_DURATION
  sched_ext: Add scx_bpf_event_stats() and scx_read_event() for BPF
    schedulers
  sched_ext: Print core event count in scx_central scheduler

 kernel/sched/core.c                      |  30 ----
 kernel/sched/ext.c                       | 177 ++++++++++++++++++++++-
 kernel/sched/sched.h                     |  30 ++++
 tools/sched_ext/include/scx/common.bpf.h |   4 +
 tools/sched_ext/scx_central.bpf.c        |  21 +++
 5 files changed, 231 insertions(+), 31 deletions(-)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 01/11] sched_ext: Implement event counter infrastructure
  2025-01-26 10:16 [PATCH v2 00/11] sched_ext: Implement core event counters Changwoo Min
@ 2025-01-26 10:16 ` Changwoo Min
  2025-01-27 20:09   ` Tejun Heo
  2025-01-26 10:16 ` [PATCH v2 02/11] sched: Move is_cpu_allowed() to the header Changwoo Min
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Changwoo Min @ 2025-01-26 10:16 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

Collect the statistics of specific types of behavior in the sched_ext core,
which are not easily visible but still interesting to an scx scheduler.

An event type is defined in 'struct scx_event_stats.' When an event occurs,
its counter is accumulated using 'scx_add_event()' and '__scx_add_event()'
to per-CPU 'struct scx_event_stats' for efficiency. 'scx_bpf_event_stats()'
aggregates all the per-CPU counters and exposes a system-wide counters.

For convenience and readability of the code, 'scx_agg_event()' and
'scx_dump_event()' are provided.

The collected events can be observed after a BPF scheduler is unloaded
beforea new BPF scheduler is loaded so the per-CPU 'struct scx_event_stats'
are reset.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/ext.c | 103 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 103 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 7081c7be5f62..dbf659d593a3 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1458,6 +1458,64 @@ static struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter)
 	return p;
 }
 
+/*
+ * Collection of event counters. Event types are placed in descending order.
+ */
+struct scx_event_stats {
+};
+
+/*
+ * The event counter is organized by a per-CPU variable to minimize the
+ * accounting overhead without synchronization. A system-wide view on the
+ * event counter is constructed when requested by scx_bpf_get_event_stat().
+ */
+static DEFINE_PER_CPU(struct scx_event_stats, event_stats_cpu);
+
+/**
+ * scx_add_event - Increase an event counter for 'name' by 'cnt'
+ * @name: an event name defined in struct scx_event_stats
+ * @cnt: the number of the event occured
+ *
+ * This can be used when preemption is not disabled.
+ */
+#define scx_add_event(name, cnt) do {						\
+	this_cpu_add(event_stats_cpu.name, cnt);				\
+} while(0)
+
+/**
+ * __scx_add_event - Increase an event counter for 'name' by 'cnt'
+ * @name: an event name defined in struct scx_event_stats
+ * @cnt: the number of the event occured
+ *
+ * This should be used only when preemption is disabled.
+ */
+#define __scx_add_event(name, cnt) do {						\
+	__this_cpu_add(event_stats_cpu.name, cnt);				\
+} while(0)
+
+/**
+ * scx_agg_event - Aggregate an event counter 'kind' from 'src_e' to 'dst_e'
+ * @dst_e: destination event stats
+ * @src_e: source event stats
+ * @kind: a kind of event to be aggregated
+ */
+#define scx_agg_event(dst_e, src_e, kind) do {					\
+	(dst_e)->kind += READ_ONCE((src_e)->kind);				\
+} while(0)
+
+/**
+ * scx_dump_event - Dump an event 'kind' in 'events' to 's'
+ * @s: output seq_buf
+ * @events: event stats
+ * @kind: a kind of event to dump
+ */
+#define scx_dump_event(s, events, kind) do {					\
+	dump_line(&(s), "%30s: %16llu", #kind, (events)->kind);			\
+} while (0)
+
+
+static void scx_bpf_event_stats(struct scx_event_stats *events, size_t events__sz);
+
 static enum scx_ops_enable_state scx_ops_enable_state(void)
 {
 	return atomic_read(&scx_ops_enable_state_var);
@@ -5310,6 +5368,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
 		.at_jiffies = jiffies,
 	};
 	struct seq_buf s;
+	struct scx_event_stats events;
 	unsigned long flags;
 	char *buf;
 	int cpu;
@@ -5418,6 +5477,12 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
 		rq_unlock(rq, &rf);
 	}
 
+	dump_newline(&s);
+	dump_line(&s, "Event counters");
+	dump_line(&s, "--------------");
+
+	scx_bpf_event_stats(&events, sizeof(events));
+
 	if (seq_buf_has_overflowed(&s) && dump_len >= sizeof(trunc_marker))
 		memcpy(ei->dump + dump_len - sizeof(trunc_marker),
 		       trunc_marker, sizeof(trunc_marker));
@@ -5525,6 +5590,15 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 
 	mutex_lock(&scx_ops_enable_mutex);
 
+	/*
+	 * Clear event counters so a new scx scheduler gets
+	 * fresh event counter values.
+	 */
+	for_each_possible_cpu(cpu) {
+		struct scx_event_stats *e = per_cpu_ptr(&event_stats_cpu, cpu);
+		memset(e, 0, sizeof(*e));
+	}
+
 	if (!scx_ops_helper) {
 		WRITE_ONCE(scx_ops_helper,
 			   scx_create_rt_helper("sched_ext_ops_helper"));
@@ -7721,6 +7795,34 @@ __bpf_kfunc u64 scx_bpf_now(void)
 	return clock;
 }
 
+/*
+ * scx_bpf_event_stats - Get a system-wide event counter to
+ * @events: output buffer from a BPF program
+ * @events__sz: @events len, must end in '__sz'' for the verifier
+ */
+__bpf_kfunc void scx_bpf_event_stats(struct scx_event_stats *events,
+				     size_t events__sz)
+{
+	struct scx_event_stats e_sys, *e_cpu;
+	int cpu;
+
+	/* Aggregate per-CPU event counters into the system-wide counters. */
+	memset(&e_sys, 0, sizeof(e_sys));
+	for_each_possible_cpu(cpu) {
+		e_cpu = per_cpu_ptr(&event_stats_cpu, cpu);
+	}
+
+	/*
+	 * We cannot entirely trust a BPF-provided size since a BPF program
+	 * might be compiled against a different vmlinux.h, of which
+	 * scx_event_stats would be larger (a newer vmlinux.h) or smaller
+	 * (an older vmlinux.h). Hence, we use the smaller size to avoid
+	 * memory corruption.
+	 */
+	events__sz = min(events__sz, sizeof(*events));
+	memcpy(events, &e_sys, events__sz);
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(scx_kfunc_ids_any)
@@ -7753,6 +7855,7 @@ BTF_ID_FLAGS(func, scx_bpf_cpu_rq)
 BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_RCU | KF_ACQUIRE)
 #endif
 BTF_ID_FLAGS(func, scx_bpf_now)
+BTF_ID_FLAGS(func, scx_bpf_event_stats, KF_TRUSTED_ARGS)
 BTF_KFUNCS_END(scx_kfunc_ids_any)
 
 static const struct btf_kfunc_id_set scx_kfunc_set_any = {
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 02/11] sched: Move is_cpu_allowed() to the header
  2025-01-26 10:16 [PATCH v2 00/11] sched_ext: Implement core event counters Changwoo Min
  2025-01-26 10:16 ` [PATCH v2 01/11] sched_ext: Implement event counter infrastructure Changwoo Min
@ 2025-01-26 10:16 ` Changwoo Min
  2025-01-26 10:16 ` [PATCH v2 03/11] sched_ext: Add an event, SELECT_CPU_FALLBACK Changwoo Min
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Changwoo Min @ 2025-01-26 10:16 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

is_cpu_allowed() can be used to specific scheduling policies
(e.g., sched_ext), so move it to the header.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/core.c  | 30 ------------------------------
 kernel/sched/sched.h | 30 ++++++++++++++++++++++++++++++
 2 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 901170708e2a..eaaee182999c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2427,36 +2427,6 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
 	return rq->nr_pinned;
 }
 
-/*
- * Per-CPU kthreads are allowed to run on !active && online CPUs, see
- * __set_cpus_allowed_ptr() and select_fallback_rq().
- */
-static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
-{
-	/* When not in the task's cpumask, no point in looking further. */
-	if (!task_allowed_on_cpu(p, cpu))
-		return false;
-
-	/* migrate_disabled() must be allowed to finish. */
-	if (is_migration_disabled(p))
-		return cpu_online(cpu);
-
-	/* Non kernel threads are not allowed during either online or offline. */
-	if (!(p->flags & PF_KTHREAD))
-		return cpu_active(cpu);
-
-	/* KTHREAD_IS_PER_CPU is always allowed. */
-	if (kthread_is_per_cpu(p))
-		return cpu_online(cpu);
-
-	/* Regular kernel threads don't get to stay during offline. */
-	if (cpu_dying(cpu))
-		return false;
-
-	/* But are allowed during online. */
-	return cpu_online(cpu);
-}
-
 /*
  * This is how migration works:
  *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 38e0e323dda2..30c401d940f4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2675,6 +2675,36 @@ static inline cpumask_t *alloc_user_cpus_ptr(int node)
 
 #endif /* !CONFIG_SMP */
 
+/*
+ * Per-CPU kthreads are allowed to run on !active && online CPUs, see
+ * __set_cpus_allowed_ptr() and select_fallback_rq().
+ */
+static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
+{
+	/* When not in the task's cpumask, no point in looking further. */
+	if (!task_allowed_on_cpu(p, cpu))
+		return false;
+
+	/* migrate_disabled() must be allowed to finish. */
+	if (is_migration_disabled(p))
+		return cpu_online(cpu);
+
+	/* Non kernel threads are not allowed during either online or offline. */
+	if (!(p->flags & PF_KTHREAD))
+		return cpu_active(cpu);
+
+	/* KTHREAD_IS_PER_CPU is always allowed. */
+	if (kthread_is_per_cpu(p))
+		return cpu_online(cpu);
+
+	/* Regular kernel threads don't get to stay during offline. */
+	if (cpu_dying(cpu))
+		return false;
+
+	/* But are allowed during online. */
+	return cpu_online(cpu);
+}
+
 #ifdef CONFIG_CPU_IDLE
 
 static inline void idle_set_state(struct rq *rq,
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 03/11] sched_ext: Add an event, SELECT_CPU_FALLBACK
  2025-01-26 10:16 [PATCH v2 00/11] sched_ext: Implement core event counters Changwoo Min
  2025-01-26 10:16 ` [PATCH v2 01/11] sched_ext: Implement event counter infrastructure Changwoo Min
  2025-01-26 10:16 ` [PATCH v2 02/11] sched: Move is_cpu_allowed() to the header Changwoo Min
@ 2025-01-26 10:16 ` Changwoo Min
  2025-01-27 20:01   ` Tejun Heo
  2025-01-26 10:16 ` [PATCH v2 04/11] sched_ext: Add an event, DISPATCH_LOCAL_DSQ_OFFLINE Changwoo Min
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Changwoo Min @ 2025-01-26 10:16 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

Add a core event, SELECT_CPU_FALLBACK, which represents how many times
ops.select_cpu() returns a CPU that the task can't use.

__scx_add_event() is used since the caller holds a p->pi_lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/ext.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index dbf659d593a3..727b1f8b623e 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1462,6 +1462,11 @@ static struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter)
  * Collection of event counters. Event types are placed in descending order.
  */
 struct scx_event_stats {
+	/*
+	 * If ops.select_cpu() returns a CPU which can't be used by the task,
+	 * the core scheduler code silently picks a fallback CPU.
+	 */
+	u64		SELECT_CPU_FALLBACK;
 };
 
 /*
@@ -3663,6 +3668,10 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
 		cpu = SCX_CALL_OP_TASK_RET(SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU,
 					   select_cpu, p, prev_cpu, wake_flags);
 		*ddsp_taskp = NULL;
+
+		if (unlikely(!is_cpu_allowed(p, cpu)))
+			__scx_add_event(SELECT_CPU_FALLBACK, 1);
+
 		if (ops_cpu_valid(cpu, "from ops.select_cpu()"))
 			return cpu;
 		else
@@ -5482,6 +5491,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
 	dump_line(&s, "--------------");
 
 	scx_bpf_event_stats(&events, sizeof(events));
+	scx_dump_event(s, &events, SELECT_CPU_FALLBACK);
 
 	if (seq_buf_has_overflowed(&s) && dump_len >= sizeof(trunc_marker))
 		memcpy(ei->dump + dump_len - sizeof(trunc_marker),
@@ -7810,6 +7820,7 @@ __bpf_kfunc void scx_bpf_event_stats(struct scx_event_stats *events,
 	memset(&e_sys, 0, sizeof(e_sys));
 	for_each_possible_cpu(cpu) {
 		e_cpu = per_cpu_ptr(&event_stats_cpu, cpu);
+		scx_agg_event(&e_sys, e_cpu, SELECT_CPU_FALLBACK);
 	}
 
 	/*
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 04/11] sched_ext: Add an event, DISPATCH_LOCAL_DSQ_OFFLINE
  2025-01-26 10:16 [PATCH v2 00/11] sched_ext: Implement core event counters Changwoo Min
                   ` (2 preceding siblings ...)
  2025-01-26 10:16 ` [PATCH v2 03/11] sched_ext: Add an event, SELECT_CPU_FALLBACK Changwoo Min
@ 2025-01-26 10:16 ` Changwoo Min
  2025-01-26 10:16 ` [PATCH v2 05/11] sched_ext: Add an event, DISPATCH_KEEP_LAST Changwoo Min
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Changwoo Min @ 2025-01-26 10:16 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

Add a core event, DISPATCH_LOCAL_DSQ_OFFLINE, which represents how many
times a BPF scheduler tries to dispatch to an offlined local DSQ.

__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/ext.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 727b1f8b623e..38cf7ea6c385 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1467,6 +1467,12 @@ struct scx_event_stats {
 	 * the core scheduler code silently picks a fallback CPU.
 	 */
 	u64		SELECT_CPU_FALLBACK;
+
+	/*
+	 * When dispatching to a local DSQ, the CPU may have gone offline in
+	 * the meantime. In this case, the task is bounced to the global DSQ.
+	 */
+	u64		DISPATCH_LOCAL_DSQ_OFFLINE;
 };
 
 /*
@@ -2654,6 +2660,7 @@ static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
 	if (unlikely(!task_can_run_on_remote_rq(p, dst_rq, true))) {
 		dispatch_enqueue(find_global_dsq(p), p,
 				 enq_flags | SCX_ENQ_CLEAR_OPSS);
+		__scx_add_event(DISPATCH_LOCAL_DSQ_OFFLINE, 1);
 		return;
 	}
 
@@ -5492,6 +5499,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
 
 	scx_bpf_event_stats(&events, sizeof(events));
 	scx_dump_event(s, &events, SELECT_CPU_FALLBACK);
+	scx_dump_event(s, &events, DISPATCH_LOCAL_DSQ_OFFLINE);
 
 	if (seq_buf_has_overflowed(&s) && dump_len >= sizeof(trunc_marker))
 		memcpy(ei->dump + dump_len - sizeof(trunc_marker),
@@ -7821,6 +7829,7 @@ __bpf_kfunc void scx_bpf_event_stats(struct scx_event_stats *events,
 	for_each_possible_cpu(cpu) {
 		e_cpu = per_cpu_ptr(&event_stats_cpu, cpu);
 		scx_agg_event(&e_sys, e_cpu, SELECT_CPU_FALLBACK);
+		scx_agg_event(&e_sys, e_cpu, DISPATCH_LOCAL_DSQ_OFFLINE);
 	}
 
 	/*
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 05/11] sched_ext: Add an event, DISPATCH_KEEP_LAST
  2025-01-26 10:16 [PATCH v2 00/11] sched_ext: Implement core event counters Changwoo Min
                   ` (3 preceding siblings ...)
  2025-01-26 10:16 ` [PATCH v2 04/11] sched_ext: Add an event, DISPATCH_LOCAL_DSQ_OFFLINE Changwoo Min
@ 2025-01-26 10:16 ` Changwoo Min
  2025-01-26 10:16 ` [PATCH v2 06/11] sched_ext: Add an event, ENQ_SKIP_EXITING Changwoo Min
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Changwoo Min @ 2025-01-26 10:16 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

Add a core event, DISPATCH_KEEP_LAST, which represents how many times
a task is continued to run without ops.enqueue() when SCX_OPS_ENQ_LAST
is not set.

__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/ext.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 38cf7ea6c385..d3a1ef58202c 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1473,6 +1473,12 @@ struct scx_event_stats {
 	 * the meantime. In this case, the task is bounced to the global DSQ.
 	 */
 	u64		DISPATCH_LOCAL_DSQ_OFFLINE;
+
+	/*
+	 * If SCX_OPS_ENQ_LAST is not set, the number of times that a task
+	 * continued to run because there were no other tasks on the CPU.
+	 */
+	u64		DISPATCH_KEEP_LAST;
 };
 
 /*
@@ -2915,6 +2921,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
 	if (prev_on_rq && (!static_branch_unlikely(&scx_ops_enq_last) ||
 	     scx_rq_bypassing(rq))) {
 		rq->scx.flags |= SCX_RQ_BAL_KEEP;
+		__scx_add_event(DISPATCH_KEEP_LAST, 1);
 		goto has_tasks;
 	}
 	rq->scx.flags &= ~SCX_RQ_IN_BALANCE;
@@ -5500,6 +5507,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
 	scx_bpf_event_stats(&events, sizeof(events));
 	scx_dump_event(s, &events, SELECT_CPU_FALLBACK);
 	scx_dump_event(s, &events, DISPATCH_LOCAL_DSQ_OFFLINE);
+	scx_dump_event(s, &events, DISPATCH_KEEP_LAST);
 
 	if (seq_buf_has_overflowed(&s) && dump_len >= sizeof(trunc_marker))
 		memcpy(ei->dump + dump_len - sizeof(trunc_marker),
@@ -7830,6 +7838,7 @@ __bpf_kfunc void scx_bpf_event_stats(struct scx_event_stats *events,
 		e_cpu = per_cpu_ptr(&event_stats_cpu, cpu);
 		scx_agg_event(&e_sys, e_cpu, SELECT_CPU_FALLBACK);
 		scx_agg_event(&e_sys, e_cpu, DISPATCH_LOCAL_DSQ_OFFLINE);
+		scx_agg_event(&e_sys, e_cpu, DISPATCH_KEEP_LAST);
 	}
 
 	/*
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 06/11] sched_ext: Add an event, ENQ_SKIP_EXITING
  2025-01-26 10:16 [PATCH v2 00/11] sched_ext: Implement core event counters Changwoo Min
                   ` (4 preceding siblings ...)
  2025-01-26 10:16 ` [PATCH v2 05/11] sched_ext: Add an event, DISPATCH_KEEP_LAST Changwoo Min
@ 2025-01-26 10:16 ` Changwoo Min
  2025-01-26 10:16 ` [PATCH v2 07/11] sched_ext: Add an event, BYPASS_ACTIVATE Changwoo Min
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Changwoo Min @ 2025-01-26 10:16 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

Add a core event, ENQ_SKIP_EXITING, which represents how many times
a task is enqueued to a local DSQ when exiting if SCX_OPS_ENQ_EXITING
is not set.

__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/ext.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index d3a1ef58202c..6dc86bc06596 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1479,6 +1479,12 @@ struct scx_event_stats {
 	 * continued to run because there were no other tasks on the CPU.
 	 */
 	u64		DISPATCH_KEEP_LAST;
+
+	/*
+	 * If SCX_OPS_ENQ_EXITING is not set, the number of times that a task
+	 * is dispatched to a local DSQ when exiting.
+	 */
+	u64		ENQ_SKIP_EXITING;
 };
 
 /*
@@ -2086,8 +2092,10 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 
 	/* see %SCX_OPS_ENQ_EXITING */
 	if (!static_branch_unlikely(&scx_ops_enq_exiting) &&
-	    unlikely(p->flags & PF_EXITING))
+	    unlikely(p->flags & PF_EXITING)) {
+		__scx_add_event(ENQ_SKIP_EXITING, 1);
 		goto local;
+	}
 
 	if (!SCX_HAS_OP(enqueue))
 		goto global;
@@ -5508,6 +5516,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
 	scx_dump_event(s, &events, SELECT_CPU_FALLBACK);
 	scx_dump_event(s, &events, DISPATCH_LOCAL_DSQ_OFFLINE);
 	scx_dump_event(s, &events, DISPATCH_KEEP_LAST);
+	scx_dump_event(s, &events, ENQ_SKIP_EXITING);
 
 	if (seq_buf_has_overflowed(&s) && dump_len >= sizeof(trunc_marker))
 		memcpy(ei->dump + dump_len - sizeof(trunc_marker),
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 07/11] sched_ext: Add an event, BYPASS_ACTIVATE
  2025-01-26 10:16 [PATCH v2 00/11] sched_ext: Implement core event counters Changwoo Min
                   ` (5 preceding siblings ...)
  2025-01-26 10:16 ` [PATCH v2 06/11] sched_ext: Add an event, ENQ_SKIP_EXITING Changwoo Min
@ 2025-01-26 10:16 ` Changwoo Min
  2025-01-27 20:03   ` Tejun Heo
  2025-01-26 10:16 ` [PATCH v2 08/11] sched_ext: Add an event, BYPASS_DISPATCH Changwoo Min
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Changwoo Min @ 2025-01-26 10:16 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

Add a core event, BYPASS_ACTIVATE, which represents how many times
the bypass mode has been triggered.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/ext.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 6dc86bc06596..bb363809e484 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1485,6 +1485,11 @@ struct scx_event_stats {
 	 * is dispatched to a local DSQ when exiting.
 	 */
 	u64		ENQ_SKIP_EXITING;
+
+	/*
+	 * The number of times the bypassing mode has been activated.
+	 */
+	u64		BYPASS_ACTIVATE;
 };
 
 /*
@@ -3707,6 +3712,7 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
 			p->scx.slice = SCX_SLICE_DFL;
 			p->scx.ddsp_dsq_id = SCX_DSQ_LOCAL;
 		}
+
 		return cpu;
 	}
 }
@@ -4922,6 +4928,7 @@ static void scx_ops_bypass(bool bypass)
 		WARN_ON_ONCE(scx_ops_bypass_depth <= 0);
 		if (scx_ops_bypass_depth != 1)
 			goto unlock;
+		scx_add_event(BYPASS_ACTIVATE, 1);
 	} else {
 		scx_ops_bypass_depth--;
 		WARN_ON_ONCE(scx_ops_bypass_depth < 0);
@@ -5517,6 +5524,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
 	scx_dump_event(s, &events, DISPATCH_LOCAL_DSQ_OFFLINE);
 	scx_dump_event(s, &events, DISPATCH_KEEP_LAST);
 	scx_dump_event(s, &events, ENQ_SKIP_EXITING);
+	scx_dump_event(s, &events, BYPASS_ACTIVATE);
 
 	if (seq_buf_has_overflowed(&s) && dump_len >= sizeof(trunc_marker))
 		memcpy(ei->dump + dump_len - sizeof(trunc_marker),
@@ -7848,6 +7856,7 @@ __bpf_kfunc void scx_bpf_event_stats(struct scx_event_stats *events,
 		scx_agg_event(&e_sys, e_cpu, SELECT_CPU_FALLBACK);
 		scx_agg_event(&e_sys, e_cpu, DISPATCH_LOCAL_DSQ_OFFLINE);
 		scx_agg_event(&e_sys, e_cpu, DISPATCH_KEEP_LAST);
+		scx_agg_event(&e_sys, e_cpu, BYPASS_ACTIVATE);
 	}
 
 	/*
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 08/11] sched_ext: Add an event, BYPASS_DISPATCH
  2025-01-26 10:16 [PATCH v2 00/11] sched_ext: Implement core event counters Changwoo Min
                   ` (6 preceding siblings ...)
  2025-01-26 10:16 ` [PATCH v2 07/11] sched_ext: Add an event, BYPASS_ACTIVATE Changwoo Min
@ 2025-01-26 10:16 ` Changwoo Min
  2025-01-27 20:05   ` Tejun Heo
  2025-01-26 10:16 ` [PATCH v2 09/11] sched_ext: Add an event, BYPASS_DURATION Changwoo Min
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Changwoo Min @ 2025-01-26 10:16 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

Add a core event, BYPASS_DISPATCH, which represents how many tasks
have been dispatched in the bypass mode.

__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/ext.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index bb363809e484..9a680774c1fc 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1486,6 +1486,11 @@ struct scx_event_stats {
 	 */
 	u64		ENQ_SKIP_EXITING;
 
+	/*
+	 * The number of tasks dispatched in the bypassing mode.
+	 */
+	u64		BYPASS_DISPATCH;
+
 	/*
 	 * The number of times the bypassing mode has been activated.
 	 */
@@ -2941,6 +2946,8 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
 	return false;
 
 has_tasks:
+	if (scx_rq_bypassing(rq))
+		__scx_add_event(BYPASS_DISPATCH, 1);
 	rq->scx.flags &= ~SCX_RQ_IN_BALANCE;
 	return true;
 }
@@ -5524,6 +5531,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
 	scx_dump_event(s, &events, DISPATCH_LOCAL_DSQ_OFFLINE);
 	scx_dump_event(s, &events, DISPATCH_KEEP_LAST);
 	scx_dump_event(s, &events, ENQ_SKIP_EXITING);
+	scx_dump_event(s, &events, BYPASS_DISPATCH);
 	scx_dump_event(s, &events, BYPASS_ACTIVATE);
 
 	if (seq_buf_has_overflowed(&s) && dump_len >= sizeof(trunc_marker))
@@ -7856,6 +7864,7 @@ __bpf_kfunc void scx_bpf_event_stats(struct scx_event_stats *events,
 		scx_agg_event(&e_sys, e_cpu, SELECT_CPU_FALLBACK);
 		scx_agg_event(&e_sys, e_cpu, DISPATCH_LOCAL_DSQ_OFFLINE);
 		scx_agg_event(&e_sys, e_cpu, DISPATCH_KEEP_LAST);
+		scx_agg_event(&e_sys, e_cpu, BYPASS_DISPATCH);
 		scx_agg_event(&e_sys, e_cpu, BYPASS_ACTIVATE);
 	}
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 09/11] sched_ext: Add an event, BYPASS_DURATION
  2025-01-26 10:16 [PATCH v2 00/11] sched_ext: Implement core event counters Changwoo Min
                   ` (7 preceding siblings ...)
  2025-01-26 10:16 ` [PATCH v2 08/11] sched_ext: Add an event, BYPASS_DISPATCH Changwoo Min
@ 2025-01-26 10:16 ` Changwoo Min
  2025-01-26 10:16 ` [PATCH v2 10/11] sched_ext: Add scx_bpf_event_stats() and scx_read_event() for BPF schedulers Changwoo Min
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Changwoo Min @ 2025-01-26 10:16 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

Add a core event, BYPASS_DURATION, which represents the total duration
of bypass modes in nanoseconds.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/ext.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 9a680774c1fc..c71c89cfd1c8 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1486,6 +1486,11 @@ struct scx_event_stats {
 	 */
 	u64		ENQ_SKIP_EXITING;
 
+	/*
+	 * The total duration of bypass modes in nanoseconds.
+	 */
+	u64		BYPASS_DURATION;
+
 	/*
 	 * The number of tasks dispatched in the bypassing mode.
 	 */
@@ -1547,6 +1552,12 @@ static DEFINE_PER_CPU(struct scx_event_stats, event_stats_cpu);
 } while (0)
 
 
+/*
+ * The last time the bypass mode started.
+ * This is used to measure BYPASS_DURATION.
+ */
+static unsigned long scx_bypass_timestamp;
+
 static void scx_bpf_event_stats(struct scx_event_stats *events, size_t events__sz);
 
 static enum scx_ops_enable_state scx_ops_enable_state(void)
@@ -4935,12 +4946,15 @@ static void scx_ops_bypass(bool bypass)
 		WARN_ON_ONCE(scx_ops_bypass_depth <= 0);
 		if (scx_ops_bypass_depth != 1)
 			goto unlock;
+		scx_bypass_timestamp = ktime_get_ns();
 		scx_add_event(BYPASS_ACTIVATE, 1);
 	} else {
 		scx_ops_bypass_depth--;
 		WARN_ON_ONCE(scx_ops_bypass_depth < 0);
 		if (scx_ops_bypass_depth != 0)
 			goto unlock;
+		scx_add_event(BYPASS_DURATION,
+			      ktime_get_ns() - scx_bypass_timestamp);
 	}
 
 	atomic_inc(&scx_ops_breather_depth);
@@ -5531,6 +5545,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
 	scx_dump_event(s, &events, DISPATCH_LOCAL_DSQ_OFFLINE);
 	scx_dump_event(s, &events, DISPATCH_KEEP_LAST);
 	scx_dump_event(s, &events, ENQ_SKIP_EXITING);
+	scx_dump_event(s, &events, BYPASS_DURATION);
 	scx_dump_event(s, &events, BYPASS_DISPATCH);
 	scx_dump_event(s, &events, BYPASS_ACTIVATE);
 
@@ -7864,6 +7879,7 @@ __bpf_kfunc void scx_bpf_event_stats(struct scx_event_stats *events,
 		scx_agg_event(&e_sys, e_cpu, SELECT_CPU_FALLBACK);
 		scx_agg_event(&e_sys, e_cpu, DISPATCH_LOCAL_DSQ_OFFLINE);
 		scx_agg_event(&e_sys, e_cpu, DISPATCH_KEEP_LAST);
+		scx_agg_event(&e_sys, e_cpu, BYPASS_DURATION);
 		scx_agg_event(&e_sys, e_cpu, BYPASS_DISPATCH);
 		scx_agg_event(&e_sys, e_cpu, BYPASS_ACTIVATE);
 	}
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 10/11] sched_ext: Add scx_bpf_event_stats() and scx_read_event() for BPF schedulers
  2025-01-26 10:16 [PATCH v2 00/11] sched_ext: Implement core event counters Changwoo Min
                   ` (8 preceding siblings ...)
  2025-01-26 10:16 ` [PATCH v2 09/11] sched_ext: Add an event, BYPASS_DURATION Changwoo Min
@ 2025-01-26 10:16 ` Changwoo Min
  2025-01-26 10:16 ` [PATCH v2 11/11] sched_ext: Print core event count in scx_central scheduler Changwoo Min
  2025-01-27 20:10 ` [PATCH v2 00/11] sched_ext: Implement core event counters Tejun Heo
  11 siblings, 0 replies; 26+ messages in thread
From: Changwoo Min @ 2025-01-26 10:16 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

scx_bpf_event_stats() is added to the header files so the BPF scheduler
can use it. Also, scx_read_event() is added to read an event type in a
compatible way.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 tools/sched_ext/include/scx/common.bpf.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index f3e15e9efa76..71db653c8eb2 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -78,6 +78,10 @@ struct rq *scx_bpf_cpu_rq(s32 cpu) __ksym;
 struct cgroup *scx_bpf_task_cgroup(struct task_struct *p) __ksym __weak;
 u64 scx_bpf_now(void) __ksym __weak;
 
+void scx_bpf_event_stats(struct scx_event_stats *events, size_t events__sz) __ksym __weak;
+#define scx_read_event(e, name)							\
+	(bpf_core_field_exists((e)->name) ? (e)->name : 0)
+
 /*
  * Use the following as @it__iter when calling scx_bpf_dsq_move[_vtime]() from
  * within bpf_for_each() loops.
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 11/11] sched_ext: Print core event count in scx_central scheduler
  2025-01-26 10:16 [PATCH v2 00/11] sched_ext: Implement core event counters Changwoo Min
                   ` (9 preceding siblings ...)
  2025-01-26 10:16 ` [PATCH v2 10/11] sched_ext: Add scx_bpf_event_stats() and scx_read_event() for BPF schedulers Changwoo Min
@ 2025-01-26 10:16 ` Changwoo Min
  2025-01-27 20:08   ` Tejun Heo
  2025-01-27 20:10 ` [PATCH v2 00/11] sched_ext: Implement core event counters Tejun Heo
  11 siblings, 1 reply; 26+ messages in thread
From: Changwoo Min @ 2025-01-26 10:16 UTC (permalink / raw)
  To: tj, void, arighi; +Cc: kernel-dev, linux-kernel, Changwoo Min

Modify the scx_central scheduler to print the core event counter
every second.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 tools/sched_ext/scx_central.bpf.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/tools/sched_ext/scx_central.bpf.c b/tools/sched_ext/scx_central.bpf.c
index 50bc1737c167..96b3c08132ea 100644
--- a/tools/sched_ext/scx_central.bpf.c
+++ b/tools/sched_ext/scx_central.bpf.c
@@ -256,6 +256,7 @@ static int central_timerfn(void *map, int *key, struct bpf_timer *timer)
 	u64 now = scx_bpf_now();
 	u64 nr_to_kick = nr_queued;
 	s32 i, curr_cpu;
+	struct scx_event_stats events;
 
 	curr_cpu = bpf_get_smp_processor_id();
 	if (timer_pinned && (curr_cpu != central_cpu)) {
@@ -291,6 +292,26 @@ static int central_timerfn(void *map, int *key, struct bpf_timer *timer)
 
 	bpf_timer_start(timer, TIMER_INTERVAL_NS, BPF_F_TIMER_CPU_PIN);
 	__sync_fetch_and_add(&nr_timers, 1);
+
+	/* print event counters every second */
+	if (nr_timers % 1000 == 0) {
+		scx_bpf_event_stats(&events, sizeof(events));
+
+		bpf_printk("%30s: %llu\n", "SELECT_CPU_FALLBACK",
+			   scx_read_event(&events, SELECT_CPU_FALLBACK));
+		bpf_printk("%30s: %llu\n", "DISPATCH_LOCAL_DSQ_OFFLINE",
+			   scx_read_event(&events, DISPATCH_LOCAL_DSQ_OFFLINE));
+		bpf_printk("%30s: %llu\n", "DISPATCH_KEEP_LAST",
+			   scx_read_event(&events, DISPATCH_KEEP_LAST));
+		bpf_printk("%30s: %llu\n", "ENQ_SKIP_EXITING",
+			   scx_read_event(&events, ENQ_SKIP_EXITING));
+		bpf_printk("%30s: %llu\n", "BYPASS_DURATION",
+			   scx_read_event(&events, BYPASS_DURATION));
+		bpf_printk("%30s: %llu\n", "BYPASS_DISPATCH",
+			   scx_read_event(&events, BYPASS_DISPATCH));
+		bpf_printk("%30s: %llu\n", "BYPASS_ACTIVATE",
+			   scx_read_event(&events, BYPASS_ACTIVATE));
+	}
 	return 0;
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 03/11] sched_ext: Add an event, SELECT_CPU_FALLBACK
  2025-01-26 10:16 ` [PATCH v2 03/11] sched_ext: Add an event, SELECT_CPU_FALLBACK Changwoo Min
@ 2025-01-27 20:01   ` Tejun Heo
  2025-01-30  6:22     ` Changwoo Min
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2025-01-27 20:01 UTC (permalink / raw)
  To: Changwoo Min; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

On Sun, Jan 26, 2025 at 07:16:06PM +0900, Changwoo Min wrote:
...
>  struct scx_event_stats {
> +	/*
> +	 * If ops.select_cpu() returns a CPU which can't be used by the task,
> +	 * the core scheduler code silently picks a fallback CPU.
> +	 */
> +	u64		SELECT_CPU_FALLBACK;

As C has one global namespace, we're gonna have to prefix these. Hopefully,
something not too long. SCX_EV_?

> @@ -3663,6 +3668,10 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
>  		cpu = SCX_CALL_OP_TASK_RET(SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU,
>  					   select_cpu, p, prev_cpu, wake_flags);
>  		*ddsp_taskp = NULL;
> +
> +		if (unlikely(!is_cpu_allowed(p, cpu)))
> +			__scx_add_event(SELECT_CPU_FALLBACK, 1);

This is trying to guess what select_task_rq() is going to do and then count
that as an event, which doesn't seem great. Can't we just record the picked
CPU here into a field in p->scx and then compare that against cpu_of(rq) in
enqueue_task_scx()? That way, we can reliably count events where CPU
selection was modified by the fallback logic regardless of why or how that
happens.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 07/11] sched_ext: Add an event, BYPASS_ACTIVATE
  2025-01-26 10:16 ` [PATCH v2 07/11] sched_ext: Add an event, BYPASS_ACTIVATE Changwoo Min
@ 2025-01-27 20:03   ` Tejun Heo
  2025-01-30  6:34     ` Changwoo Min
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2025-01-27 20:03 UTC (permalink / raw)
  To: Changwoo Min; +Cc: void, arighi, kernel-dev, linux-kernel

On Sun, Jan 26, 2025 at 07:16:10PM +0900, Changwoo Min wrote:
...
> @@ -3707,6 +3712,7 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
>  			p->scx.slice = SCX_SLICE_DFL;
>  			p->scx.ddsp_dsq_id = SCX_DSQ_LOCAL;
>  		}
> +
>  		return cpu;
>  	}
>  }

Stray change?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 08/11] sched_ext: Add an event, BYPASS_DISPATCH
  2025-01-26 10:16 ` [PATCH v2 08/11] sched_ext: Add an event, BYPASS_DISPATCH Changwoo Min
@ 2025-01-27 20:05   ` Tejun Heo
  2025-01-30  6:26     ` Changwoo Min
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2025-01-27 20:05 UTC (permalink / raw)
  To: Changwoo Min; +Cc: void, arighi, kernel-dev, linux-kernel

On Sun, Jan 26, 2025 at 07:16:11PM +0900, Changwoo Min wrote:
...
>  has_tasks:
> +	if (scx_rq_bypassing(rq))
> +		__scx_add_event(BYPASS_DISPATCH, 1);

Can we do this at the sources even if that's a bit more code?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 11/11] sched_ext: Print core event count in scx_central scheduler
  2025-01-26 10:16 ` [PATCH v2 11/11] sched_ext: Print core event count in scx_central scheduler Changwoo Min
@ 2025-01-27 20:08   ` Tejun Heo
  2025-01-30  6:28     ` Changwoo Min
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2025-01-27 20:08 UTC (permalink / raw)
  To: Changwoo Min; +Cc: void, arighi, kernel-dev, linux-kernel

On Sun, Jan 26, 2025 at 07:16:14PM +0900, Changwoo Min wrote:
> Modify the scx_central scheduler to print the core event counter
> every second.

Can you add it to scx_qmap too? That's used as the dumping ground for all
the niche features, so it'd be nice to have an example there too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 01/11] sched_ext: Implement event counter infrastructure
  2025-01-26 10:16 ` [PATCH v2 01/11] sched_ext: Implement event counter infrastructure Changwoo Min
@ 2025-01-27 20:09   ` Tejun Heo
  2025-01-30  6:12     ` Changwoo Min
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2025-01-27 20:09 UTC (permalink / raw)
  To: Changwoo Min; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

On Sun, Jan 26, 2025 at 07:16:04PM +0900, Changwoo Min wrote:
...
> +/*
> + * scx_bpf_event_stats - Get a system-wide event counter to
> + * @events: output buffer from a BPF program
> + * @events__sz: @events len, must end in '__sz'' for the verifier
> + */
> +__bpf_kfunc void scx_bpf_event_stats(struct scx_event_stats *events,
> +				     size_t events__sz)

scx_bpf_events()?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 00/11] sched_ext: Implement core event counters
  2025-01-26 10:16 [PATCH v2 00/11] sched_ext: Implement core event counters Changwoo Min
                   ` (10 preceding siblings ...)
  2025-01-26 10:16 ` [PATCH v2 11/11] sched_ext: Print core event count in scx_central scheduler Changwoo Min
@ 2025-01-27 20:10 ` Tejun Heo
  2025-01-30  6:28   ` Changwoo Min
  11 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2025-01-27 20:10 UTC (permalink / raw)
  To: Changwoo Min; +Cc: void, arighi, kernel-dev, linux-kernel

On Sun, Jan 26, 2025 at 07:16:03PM +0900, Changwoo Min wrote:
> The sched_ext core often has to override the BPF scheduler decisions,
> and some events could be interesting but not easily visible.

Generally looks good to me. Left several comments which should be
straightforward to address.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 01/11] sched_ext: Implement event counter infrastructure
  2025-01-27 20:09   ` Tejun Heo
@ 2025-01-30  6:12     ` Changwoo Min
  0 siblings, 0 replies; 26+ messages in thread
From: Changwoo Min @ 2025-01-30  6:12 UTC (permalink / raw)
  To: Tejun Heo; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

On 25. 1. 28. 05:09, Tejun Heo wrote:
> Hello,
> 
> On Sun, Jan 26, 2025 at 07:16:04PM +0900, Changwoo Min wrote:
> ...
>> +/*
>> + * scx_bpf_event_stats - Get a system-wide event counter to
>> + * @events: output buffer from a BPF program
>> + * @events__sz: @events len, must end in '__sz'' for the verifier
>> + */
>> +__bpf_kfunc void scx_bpf_event_stats(struct scx_event_stats *events,
>> +				     size_t events__sz)
> 
> scx_bpf_events()?

That sounds simpler. I will change it in the next version.

Regards,
Changwoo Min

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 03/11] sched_ext: Add an event, SELECT_CPU_FALLBACK
  2025-01-27 20:01   ` Tejun Heo
@ 2025-01-30  6:22     ` Changwoo Min
  0 siblings, 0 replies; 26+ messages in thread
From: Changwoo Min @ 2025-01-30  6:22 UTC (permalink / raw)
  To: Tejun Heo; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

On 25. 1. 28. 05:01, Tejun Heo wrote:
> Hello,
> 
> On Sun, Jan 26, 2025 at 07:16:06PM +0900, Changwoo Min wrote:
> ...
>>   struct scx_event_stats {
>> +	/*
>> +	 * If ops.select_cpu() returns a CPU which can't be used by the task,
>> +	 * the core scheduler code silently picks a fallback CPU.
>> +	 */
>> +	u64		SELECT_CPU_FALLBACK;
> 
> As C has one global namespace, we're gonna have to prefix these. Hopefully,
> something not too long. SCX_EV_?

Sure, I will add SCX_EV_ as a prefix for all events.

> 
>> @@ -3663,6 +3668,10 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
>>   		cpu = SCX_CALL_OP_TASK_RET(SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU,
>>   					   select_cpu, p, prev_cpu, wake_flags);
>>   		*ddsp_taskp = NULL;
>> +
>> +		if (unlikely(!is_cpu_allowed(p, cpu)))
>> +			__scx_add_event(SELECT_CPU_FALLBACK, 1);
> 
> This is trying to guess what select_task_rq() is going to do and then count
> that as an event, which doesn't seem great. Can't we just record the picked
> CPU here into a field in p->scx and then compare that against cpu_of(rq) in
> enqueue_task_scx()? That way, we can reliably count events where CPU
> selection was modified by the fallback logic regardless of why or how that
> happens.

That sounds good. I will add a field (say, selected_cpu?) to struct 
sched_ext_entity.

Regards,
Changwoo Min

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 08/11] sched_ext: Add an event, BYPASS_DISPATCH
  2025-01-27 20:05   ` Tejun Heo
@ 2025-01-30  6:26     ` Changwoo Min
  2025-01-30 12:25       ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Changwoo Min @ 2025-01-30  6:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

On 25. 1. 28. 05:05, Tejun Heo wrote:
> On Sun, Jan 26, 2025 at 07:16:11PM +0900, Changwoo Min wrote:
> ...
>>   has_tasks:
>> +	if (scx_rq_bypassing(rq))
>> +		__scx_add_event(BYPASS_DISPATCH, 1);
> 
> Can we do this at the sources even if that's a bit more code?

Sure. I will remove scx_add_event() and __scx_add_event() and will use 
this_cpu_add() and __this_cpu_add() directly.

Regards,
Changwoo Min

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 11/11] sched_ext: Print core event count in scx_central scheduler
  2025-01-27 20:08   ` Tejun Heo
@ 2025-01-30  6:28     ` Changwoo Min
  0 siblings, 0 replies; 26+ messages in thread
From: Changwoo Min @ 2025-01-30  6:28 UTC (permalink / raw)
  To: Tejun Heo; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

On 25. 1. 28. 05:08, Tejun Heo wrote:
> On Sun, Jan 26, 2025 at 07:16:14PM +0900, Changwoo Min wrote:
>> Modify the scx_central scheduler to print the core event counter
>> every second.
> 
> Can you add it to scx_qmap too? That's used as the dumping ground for all
> the niche features, so it'd be nice to have an example there too.

Sure, I will add the event dump feature to monitor_timerfn() in scx_qmap.

Regards,
Changwoo Min

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 00/11] sched_ext: Implement core event counters
  2025-01-27 20:10 ` [PATCH v2 00/11] sched_ext: Implement core event counters Tejun Heo
@ 2025-01-30  6:28   ` Changwoo Min
  0 siblings, 0 replies; 26+ messages in thread
From: Changwoo Min @ 2025-01-30  6:28 UTC (permalink / raw)
  To: Tejun Heo; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

On 25. 1. 28. 05:10, Tejun Heo wrote:
> On Sun, Jan 26, 2025 at 07:16:03PM +0900, Changwoo Min wrote:
>> The sched_ext core often has to override the BPF scheduler decisions,
>> and some events could be interesting but not easily visible.
> 
> Generally looks good to me. Left several comments which should be
> straightforward to address.

Thank you for the review! I will cook another version in a day or two.

Regards,
Changwoo Min

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 07/11] sched_ext: Add an event, BYPASS_ACTIVATE
  2025-01-27 20:03   ` Tejun Heo
@ 2025-01-30  6:34     ` Changwoo Min
  0 siblings, 0 replies; 26+ messages in thread
From: Changwoo Min @ 2025-01-30  6:34 UTC (permalink / raw)
  To: Tejun Heo; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

On 25. 1. 28. 05:03, Tejun Heo wrote:
> On Sun, Jan 26, 2025 at 07:16:10PM +0900, Changwoo Min wrote:
> ...
>> @@ -3707,6 +3712,7 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
>>   			p->scx.slice = SCX_SLICE_DFL;
>>   			p->scx.ddsp_dsq_id = SCX_DSQ_LOCAL;
>>   		}
>> +
>>   		return cpu;
>>   	}
>>   }
> 
> Stray change?

My bad. Will fix it.

Regards,
Changwoo Min


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 08/11] sched_ext: Add an event, BYPASS_DISPATCH
  2025-01-30  6:26     ` Changwoo Min
@ 2025-01-30 12:25       ` Tejun Heo
  2025-01-31  1:13         ` Changwoo Min
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2025-01-30 12:25 UTC (permalink / raw)
  To: Changwoo Min; +Cc: void, arighi, kernel-dev, linux-kernel

On Thu, Jan 30, 2025 at 03:26:16PM +0900, Changwoo Min wrote:
> Hello,
> 
> On 25. 1. 28. 05:05, Tejun Heo wrote:
> > On Sun, Jan 26, 2025 at 07:16:11PM +0900, Changwoo Min wrote:
> > ...
> > >   has_tasks:
> > > +	if (scx_rq_bypassing(rq))
> > > +		__scx_add_event(BYPASS_DISPATCH, 1);
> > 
> > Can we do this at the sources even if that's a bit more code?
> 
> Sure. I will remove scx_add_event() and __scx_add_event() and will use
> this_cpu_add() and __this_cpu_add() directly.

Oh, that's not what I meant. I meant that in the code quoted above, the stat
is being incremented in a spot where most code paths converge and then the
specific stat condition is being tested again. It'd be better to update the
stat where the condition is detected initially.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 08/11] sched_ext: Add an event, BYPASS_DISPATCH
  2025-01-30 12:25       ` Tejun Heo
@ 2025-01-31  1:13         ` Changwoo Min
  0 siblings, 0 replies; 26+ messages in thread
From: Changwoo Min @ 2025-01-31  1:13 UTC (permalink / raw)
  To: Tejun Heo; +Cc: void, arighi, kernel-dev, linux-kernel

Hello,

On 25. 1. 30. 21:25, Tejun Heo wrote:
> On Thu, Jan 30, 2025 at 03:26:16PM +0900, Changwoo Min wrote:
>> Hello,
>>
>> On 25. 1. 28. 05:05, Tejun Heo wrote:
>>> On Sun, Jan 26, 2025 at 07:16:11PM +0900, Changwoo Min wrote:
>>> ...
>>>>    has_tasks:
>>>> +	if (scx_rq_bypassing(rq))
>>>> +		__scx_add_event(BYPASS_DISPATCH, 1);
>>>
>>> Can we do this at the sources even if that's a bit more code?
>>
>> Sure. I will remove scx_add_event() and __scx_add_event() and will use
>> this_cpu_add() and __this_cpu_add() directly.
> 
> Oh, that's not what I meant. I meant that in the code quoted above, the stat
> is being incremented in a spot where most code paths converge and then the
> specific stat condition is being tested again. It'd be better to update the
> stat where the condition is detected initially.

Aha, I got it. Thanks for the clarification.

Regards,
Changwoo Min

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-01-31  1:13 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-26 10:16 [PATCH v2 00/11] sched_ext: Implement core event counters Changwoo Min
2025-01-26 10:16 ` [PATCH v2 01/11] sched_ext: Implement event counter infrastructure Changwoo Min
2025-01-27 20:09   ` Tejun Heo
2025-01-30  6:12     ` Changwoo Min
2025-01-26 10:16 ` [PATCH v2 02/11] sched: Move is_cpu_allowed() to the header Changwoo Min
2025-01-26 10:16 ` [PATCH v2 03/11] sched_ext: Add an event, SELECT_CPU_FALLBACK Changwoo Min
2025-01-27 20:01   ` Tejun Heo
2025-01-30  6:22     ` Changwoo Min
2025-01-26 10:16 ` [PATCH v2 04/11] sched_ext: Add an event, DISPATCH_LOCAL_DSQ_OFFLINE Changwoo Min
2025-01-26 10:16 ` [PATCH v2 05/11] sched_ext: Add an event, DISPATCH_KEEP_LAST Changwoo Min
2025-01-26 10:16 ` [PATCH v2 06/11] sched_ext: Add an event, ENQ_SKIP_EXITING Changwoo Min
2025-01-26 10:16 ` [PATCH v2 07/11] sched_ext: Add an event, BYPASS_ACTIVATE Changwoo Min
2025-01-27 20:03   ` Tejun Heo
2025-01-30  6:34     ` Changwoo Min
2025-01-26 10:16 ` [PATCH v2 08/11] sched_ext: Add an event, BYPASS_DISPATCH Changwoo Min
2025-01-27 20:05   ` Tejun Heo
2025-01-30  6:26     ` Changwoo Min
2025-01-30 12:25       ` Tejun Heo
2025-01-31  1:13         ` Changwoo Min
2025-01-26 10:16 ` [PATCH v2 09/11] sched_ext: Add an event, BYPASS_DURATION Changwoo Min
2025-01-26 10:16 ` [PATCH v2 10/11] sched_ext: Add scx_bpf_event_stats() and scx_read_event() for BPF schedulers Changwoo Min
2025-01-26 10:16 ` [PATCH v2 11/11] sched_ext: Print core event count in scx_central scheduler Changwoo Min
2025-01-27 20:08   ` Tejun Heo
2025-01-30  6:28     ` Changwoo Min
2025-01-27 20:10 ` [PATCH v2 00/11] sched_ext: Implement core event counters Tejun Heo
2025-01-30  6:28   ` Changwoo Min

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox