[PATCH v4 0/6] sched_ext: Support high-performance monotonically non-decreasing clock

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/6] sched_ext: Support high-performance monotonically non-decreasing clock
@ 2024-12-09  6:15 Changwoo Min
  2024-12-09  6:15 ` [PATCH v4 1/6] sched_ext: Relocate scx_enabled() related code Changwoo Min
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Changwoo Min @ 2024-12-09  6:15 UTC (permalink / raw)
  To: tj, void, mingo, peterz; +Cc: changwoo, kernel-dev, linux-kernel

Many BPF schedulers (such as scx_central, scx_lavd, scx_rusty, scx_bpfland,
and scx_flash) frequently call bpf_ktime_get_ns() for tracking tasks' runtime
properties. If supported, bpf_ktime_get_ns() eventually reads a hardware
timestamp counter (TSC). However, reading a hardware TSC is not
performant in some hardware platforms, degrading IPC.

This patchset addresses the performance problem of reading hardware TSC
by leveraging the rq clock in the scheduler core, introducing a
scx_bpf_now_ns() function for BPF schedulers. Whenever the rq clock
is fresh and valid, scx_bpf_now_ns() provides the rq clock, which is
already updated by the scheduler core (update_rq_clock), so it can reduce
reading the hardware TSC.

When the rq lock is released (rq_unpin_lock), the rq clock is invalidated,
so a subsequent scx_bpf_now_ns() call gets the fresh sched_clock for the caller.

In addition, scx_bpf_now_ns() guarantees the clock is monotonically
non-decreasing for the same CPU, so the clock cannot go backward
in the same CPU.

Using scx_bpf_now_ns() reduces the number of reading hardware TSC
by 40-70% (65% for scx_lavd, 58% for scx_bpfland, and 43% for scx_rusty)
for the following benchmark:

    perf bench -f simple sched messaging -t -g 20 -l 6000

The patchset begins by managing the status of rq clock in the scheduler
core, then implementing scx_bpf_now_ns(), and finally applying it to the
BPF schedulers.

ChangwLog v3 -> v4:
  - Separate the code relocation related to scx_enabled() into a
    separate patch.
  - Remove scx_rq_clock_stale() after (or before) ops.running() and
    ops.update_idle() calls
  - Rename scx_bpf_clock_get_ns() into scx_bpf_now_ns() and revise it to
    address the comments
  - Move the per-CPU variable holding a prev clock into scx_rq
    (rq->scx.prev_clock)
  - Add a comment describing when the clock could go backward in
    scx_bpf_now_ns()
  - Rebase the code to the tip of Tejun's sched_ext repo (for-next
    branch)

ChangeLog v2 -> v3:
  - To avoid unnecessarily modifying cache lines, scx_rq_clock_update()
    and scx_rq_clock_stale() update the clock and flags only when a
    sched_ext scheduler is enabled.

ChangeLog v1 -> v2:
  - Rename SCX_RQ_CLK_UPDATED to SCX_RQ_CLK_VALID to denote the validity
    of an rq clock clearly.
  - Rearrange the clock and flags fields in struct scx_rq to make sure
    they are in the same cacheline to minimize the cache misses 
  - Add an additional explanation to the commit message in the 2/5 patch
    describing when the rq clock will be reused with an example.
  - Fix typos
  - Rebase the code to the tip of Tejun's sched_ext repo

Changwoo Min (6):
  sched_ext: Relocate scx_enabled() related code
  sched_ext: Implement scx_rq_clock_update/stale()
  sched_ext: Manage the validity of scx_rq_clock
  sched_ext: Implement scx_bpf_now_ns()
  sched_ext: Add scx_bpf_now_ns() for BPF scheduler
  sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now_ns()

 kernel/sched/core.c                      |  6 +-
 kernel/sched/ext.c                       | 73 ++++++++++++++++++++++++
 kernel/sched/sched.h                     | 52 ++++++++++++-----
 tools/sched_ext/include/scx/common.bpf.h |  1 +
 tools/sched_ext/include/scx/compat.bpf.h |  5 ++
 tools/sched_ext/scx_central.bpf.c        |  4 +-
 tools/sched_ext/scx_flatcg.bpf.c         |  2 +-
 7 files changed, 124 insertions(+), 19 deletions(-)

-- 
2.47.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4 1/6] sched_ext: Relocate scx_enabled() related code
  2024-12-09  6:15 [PATCH v4 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Changwoo Min
@ 2024-12-09  6:15 ` Changwoo Min
  2024-12-11  7:27   ` Tejun Heo
  2024-12-09  6:15 ` [PATCH v4 2/6] sched_ext: Implement scx_rq_clock_update/stale() Changwoo Min
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Changwoo Min @ 2024-12-09  6:15 UTC (permalink / raw)
  To: tj, void, mingo, peterz; +Cc: changwoo, kernel-dev, linux-kernel

scx_enabled() will be used in scx_rq_clock_update/stale() in the
following patch, so relocate the scx_enabled() related code to the
proper location.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/sched.h | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 76f5f53a645f..440ecedf871b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1717,6 +1717,19 @@ struct rq_flags {
 
 extern struct balance_callback balance_push_callback;
 
+#ifdef CONFIG_SCHED_CLASS_EXT
+extern const struct sched_class ext_sched_class;
+
+DECLARE_STATIC_KEY_FALSE(__scx_ops_enabled);	/* SCX BPF scheduler loaded */
+DECLARE_STATIC_KEY_FALSE(__scx_switched_all);	/* all fair class tasks on SCX */
+
+#define scx_enabled()		static_branch_unlikely(&__scx_ops_enabled)
+#define scx_switched_all()	static_branch_unlikely(&__scx_switched_all)
+#else /* !CONFIG_SCHED_CLASS_EXT */
+#define scx_enabled()		false
+#define scx_switched_all()	false
+#endif /* !CONFIG_SCHED_CLASS_EXT */
+
 /*
  * Lockdep annotation that avoids accidental unlocks; it's like a
  * sticky/continuous lockdep_assert_held().
@@ -2505,19 +2518,6 @@ extern const struct sched_class rt_sched_class;
 extern const struct sched_class fair_sched_class;
 extern const struct sched_class idle_sched_class;
 
-#ifdef CONFIG_SCHED_CLASS_EXT
-extern const struct sched_class ext_sched_class;
-
-DECLARE_STATIC_KEY_FALSE(__scx_ops_enabled);	/* SCX BPF scheduler loaded */
-DECLARE_STATIC_KEY_FALSE(__scx_switched_all);	/* all fair class tasks on SCX */
-
-#define scx_enabled()		static_branch_unlikely(&__scx_ops_enabled)
-#define scx_switched_all()	static_branch_unlikely(&__scx_switched_all)
-#else /* !CONFIG_SCHED_CLASS_EXT */
-#define scx_enabled()		false
-#define scx_switched_all()	false
-#endif /* !CONFIG_SCHED_CLASS_EXT */
-
 /*
  * Iterate only active classes. SCX can take over all fair tasks or be
  * completely disabled. If the former, skip fair. If the latter, skip SCX.
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 2/6] sched_ext: Implement scx_rq_clock_update/stale()
  2024-12-09  6:15 [PATCH v4 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Changwoo Min
  2024-12-09  6:15 ` [PATCH v4 1/6] sched_ext: Relocate scx_enabled() related code Changwoo Min
@ 2024-12-09  6:15 ` Changwoo Min
  2024-12-09  9:40   ` Andrea Righi
  2024-12-11  7:43   ` Tejun Heo
  2024-12-09  6:15 ` [PATCH v4 3/6] sched_ext: Manage the validity of scx_rq_clock Changwoo Min
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 18+ messages in thread
From: Changwoo Min @ 2024-12-09  6:15 UTC (permalink / raw)
  To: tj, void, mingo, peterz; +Cc: changwoo, kernel-dev, linux-kernel

scx_rq_clock_update() and scx_rq_clock_stale() manage the status of an
rq clock when sched_ext is enabled. scx_rq_clock_update() keeps the rq
clock in memory and its status valid. scx_rq_clock_stale() invalidates
the current rq clock not to use the cached rq clock.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/sched.h | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 440ecedf871b..7e71d8685fcc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -754,6 +754,7 @@ enum scx_rq_flags {
 	SCX_RQ_BAL_PENDING	= 1 << 2, /* balance hasn't run yet */
 	SCX_RQ_BAL_KEEP		= 1 << 3, /* balance decided to keep current */
 	SCX_RQ_BYPASSING	= 1 << 4,
+	SCX_RQ_CLK_VALID	= 1 << 5, /* RQ clock is fresh and valid */
 
 	SCX_RQ_IN_WAKEUP	= 1 << 16,
 	SCX_RQ_IN_BALANCE	= 1 << 17,
@@ -766,9 +767,11 @@ struct scx_rq {
 	unsigned long		ops_qseq;
 	u64			extra_enq_flags;	/* see move_task_to_local_dsq() */
 	u32			nr_running;
-	u32			flags;
 	u32			cpuperf_target;		/* [0, SCHED_CAPACITY_SCALE] */
 	bool			cpu_released;
+	u32			flags;
+	u64			clock;			/* current per-rq clock -- see scx_bpf_now_ns() */
+	u64			prev_clock;		/* previous per-rq clock -- see scx_bpf_now_ns() */
 	cpumask_var_t		cpus_to_kick;
 	cpumask_var_t		cpus_to_kick_if_idle;
 	cpumask_var_t		cpus_to_preempt;
@@ -1725,9 +1728,28 @@ DECLARE_STATIC_KEY_FALSE(__scx_switched_all);	/* all fair class tasks on SCX */
 
 #define scx_enabled()		static_branch_unlikely(&__scx_ops_enabled)
 #define scx_switched_all()	static_branch_unlikely(&__scx_switched_all)
+
+static inline void scx_rq_clock_update(struct rq *rq, u64 clock)
+{
+	if (scx_enabled()) {
+		rq->scx.prev_clock = rq->scx.clock;
+		rq->scx.clock = clock;
+		rq->scx.flags |= SCX_RQ_CLK_VALID;
+	}
+}
+
+static inline void scx_rq_clock_stale(struct rq *rq)
+{
+	if (scx_enabled())
+		rq->scx.flags &= ~SCX_RQ_CLK_VALID;
+}
+
 #else /* !CONFIG_SCHED_CLASS_EXT */
 #define scx_enabled()		false
 #define scx_switched_all()	false
+
+static inline void scx_rq_clock_update(struct rq *rq, u64 clock) {}
+static inline void scx_rq_clock_stale(struct rq *rq) {}
 #endif /* !CONFIG_SCHED_CLASS_EXT */
 
 /*
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 3/6] sched_ext: Manage the validity of scx_rq_clock
  2024-12-09  6:15 [PATCH v4 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Changwoo Min
  2024-12-09  6:15 ` [PATCH v4 1/6] sched_ext: Relocate scx_enabled() related code Changwoo Min
  2024-12-09  6:15 ` [PATCH v4 2/6] sched_ext: Implement scx_rq_clock_update/stale() Changwoo Min
@ 2024-12-09  6:15 ` Changwoo Min
  2024-12-09  6:15 ` [PATCH v4 4/6] sched_ext: Implement scx_bpf_now_ns() Changwoo Min
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Changwoo Min @ 2024-12-09  6:15 UTC (permalink / raw)
  To: tj, void, mingo, peterz; +Cc: changwoo, kernel-dev, linux-kernel

An rq clock becomes valid when it is updated using update_rq_clock()
and invalidated when the rq is unlocked using rq_unpin_lock().

Let's suppose the following timeline in the scheduler core:

   T1. rq_lock(rq)
   T2. update_rq_clock(rq)
   T3. a sched_ext BPF operation
   T4. rq_unlock(rq)
   T5. a sched_ext BPF operation
   T6. rq_lock(rq)
   T7. update_rq_clock(rq)

For [T2, T4), we consider that rq clock is valid (SCX_RQ_CLK_VALID is
set), so scx_bpf_now_ns() calls during [T2, T4) (including T3) will
return the rq clock updated at T2. For duration [T4, T7), when a BPF
scheduler can still call scx_bpf_now_ns() (T5), we consider the rq clock
is invalid (SCX_RQ_CLK_VALID is unset at T4). So when calling
scx_bpf_now_ns() at T5, we will return a fresh clock value by calling
sched_clock() internally.

One example of calling scx_bpf_now_ns(), when the rq clock is invalid
(like T5), is in scx_central [1]. The scx_central scheduler uses a BPF
timer for preemptive scheduling. In every msec, the timer callback checks
if the currently running tasks exceed their timeslice. At the beginning of
the BPF timer callback (central_timerfn in scx_central.bpf.c), scx_central
gets the current time. When the BPF timer callback runs, the rq clock could
be invalid, the same as T5. In this case, scx_bpf_now_ns() returns a fresh
clock value rather than returning the old one (T2).

[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_central.bpf.c

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/core.c  | 6 +++++-
 kernel/sched/sched.h | 2 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 95e40895a519..ab8015c8cab4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -789,6 +789,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 void update_rq_clock(struct rq *rq)
 {
 	s64 delta;
+	u64 clock;

 	lockdep_assert_rq_held(rq);

@@ -800,11 +801,14 @@ void update_rq_clock(struct rq *rq)
 		SCHED_WARN_ON(rq->clock_update_flags & RQCF_UPDATED);
 	rq->clock_update_flags |= RQCF_UPDATED;
 #endif
+	clock = sched_clock_cpu(cpu_of(rq));
+	scx_rq_clock_update(rq, clock);

-	delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
+	delta = clock - rq->clock;
 	if (delta < 0)
 		return;
 	rq->clock += delta;
+
 	update_rq_clock_task(rq, delta);
 }

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7e71d8685fcc..b9d7c638ec60 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1781,7 +1781,7 @@ static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
 	if (rq->clock_update_flags > RQCF_ACT_SKIP)
 		rf->clock_update_flags = RQCF_UPDATED;
 #endif
-
+	scx_rq_clock_stale(rq);
 	lockdep_unpin_lock(__rq_lockp(rq), rf->cookie);
 }

-- 
2.47.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 4/6] sched_ext: Implement scx_bpf_now_ns()
  2024-12-09  6:15 [PATCH v4 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Changwoo Min
                   ` (2 preceding siblings ...)
  2024-12-09  6:15 ` [PATCH v4 3/6] sched_ext: Manage the validity of scx_rq_clock Changwoo Min
@ 2024-12-09  6:15 ` Changwoo Min
  2024-12-11  8:14   ` Tejun Heo
  2024-12-11  9:32   ` Peter Zijlstra
  2024-12-09  6:15 ` [PATCH v4 5/6] sched_ext: Add scx_bpf_now_ns() for BPF scheduler Changwoo Min
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 18+ messages in thread
From: Changwoo Min @ 2024-12-09  6:15 UTC (permalink / raw)
  To: tj, void, mingo, peterz; +Cc: changwoo, kernel-dev, linux-kernel

Returns a high-performance monotonically non-decreasing clock for the current
CPU. The clock returned is in nanoseconds.

It provides the following properties:

1) High performance: Many BPF schedulers call bpf_ktime_get_ns() frequently
 to account for execution time and track tasks' runtime properties.
 Unfortunately, in some hardware platforms, bpf_ktime_get_ns() -- which
 eventually reads a hardware timestamp counter -- is neither performant nor
 scalable. scx_bpf_now_ns() aims to provide a high-performance clock by
 using the rq clock in the scheduler core whenever possible.

2) High enough resolution for the BPF scheduler use cases: In most BPF
 scheduler use cases, the required clock resolution is lower than the most
 accurate hardware clock (e.g., rdtsc in x86). scx_bpf_now_ns() basically
 uses the rq clock in the scheduler core whenever it is valid. It considers
 that the rq clock is valid from the time the rq clock is updated
 (update_rq_clock) until the rq is unlocked (rq_unpin_lock).

3) Monotonically non-decreasing clock for the same CPU: scx_bpf_now_ns()
 guarantees the clock never goes backward when comparing them in the same
 CPU. On the other hand, when comparing clocks in different CPUs, there
 is no such guarantee -- the clock can go backward. It provides a
 monotonically *non-decreasing* clock so that it would provide the same
 clock values in two different scx_bpf_now_ns() calls in the same CPU
 during the same period of when the rq clock is valid.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/ext.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 71342f3719c1..f0476d5dd6f5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -7601,6 +7601,78 @@ __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p)
 }
 #endif
 
+/**
+ * scx_bpf_now_ns - Returns a high-performance monotonically non-decreasing
+ * clock for the current CPU. The clock returned is in nanoseconds.
+ *
+ * It provides the following properties:
+ *
+ * 1) High performance: Many BPF schedulers call bpf_ktime_get_ns() frequently
+ *  to account for execution time and track tasks' runtime properties.
+ *  Unfortunately, in some hardware platforms, bpf_ktime_get_ns() -- which
+ *  eventually reads a hardware timestamp counter -- is neither performant nor
+ *  scalable. scx_bpf_now_ns() aims to provide a high-performance clock by
+ *  using the rq clock in the scheduler core whenever possible.
+ *
+ * 2) High enough resolution for the BPF scheduler use cases: In most BPF
+ *  scheduler use cases, the required clock resolution is lower than the most
+ *  accurate hardware clock (e.g., rdtsc in x86). scx_bpf_now_ns() basically
+ *  uses the rq clock in the scheduler core whenever it is valid. It considers
+ *  that the rq clock is valid from the time the rq clock is updated
+ *  (update_rq_clock) until the rq is unlocked (rq_unpin_lock).
+ *
+ * 3) Monotonically non-decreasing clock for the same CPU: scx_bpf_now_ns()
+ *  guarantees the clock never goes backward when comparing them in the same
+ *  CPU. On the other hand, when comparing clocks in different CPUs, there
+ *  is no such guarantee -- the clock can go backward. It provides a
+ *  monotonically *non-decreasing* clock so that it would provide the same
+ *  clock values in two different scx_bpf_now_ns() calls in the same CPU
+ *  during the same period of when the rq clock is valid.
+ */
+__bpf_kfunc u64 scx_bpf_now_ns(void)
+{
+	struct rq *rq;
+	u64 clock;
+
+	preempt_disable();
+
+	/*
+	 * If the rq clock is valid, use the cached rq clock
+	 * whenever the clock does not go backward.
+	 */
+	rq = this_rq();
+	clock = rq->scx.clock;
+
+	if (!(rq->scx.flags & SCX_RQ_CLK_VALID) ||
+	    (rq->scx.prev_clock >= clock)) {
+		/*
+		 * If the rq clock is invalid or goes backward,
+		 * start a new rq clock period with a fresh sched_clock_cpu().
+		 *
+		 * The cached rq clock can go backward because there is a
+		 * race with a timer interrupt. Suppose that a timer interrupt
+		 * occurred while running scx_bpf_now_ns() *after* reading the
+		 * rq clock and *before* comparing the if condition. The timer
+		 * interrupt will eventually call a BPF scheduler's ops.tick(),
+		 * and the BPF scheduler can call scx_bpf_now_ns(). Since the
+		 * scheduler core updates the rq clock before calling
+		 * ops.tick(), the scx_bpf_now_ns() call will get the fresh
+		 * clock. After handling the timer interrupt, the interrupted
+		 * scx_bpf_now_ns() will be resumed, so the if condition will
+		 * be compared. In this case, the clock, which was read before
+		 * the timer interrupt, will be the same as rq->scx.prev_clock.
+		 * When such a case is detected, start a new rq clock period
+		 * with a fresh sched_clock_cpu().
+		 */
+		clock = sched_clock_cpu(cpu_of(rq));
+		scx_rq_clock_update(rq, clock);
+	}
+
+	preempt_enable();
+
+	return clock;
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(scx_kfunc_ids_any)
@@ -7632,6 +7704,7 @@ BTF_ID_FLAGS(func, scx_bpf_cpu_rq)
 #ifdef CONFIG_CGROUP_SCHED
 BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_RCU | KF_ACQUIRE)
 #endif
+BTF_ID_FLAGS(func, scx_bpf_now_ns)
 BTF_KFUNCS_END(scx_kfunc_ids_any)
 
 static const struct btf_kfunc_id_set scx_kfunc_set_any = {
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 5/6] sched_ext: Add scx_bpf_now_ns() for BPF scheduler
  2024-12-09  6:15 [PATCH v4 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Changwoo Min
                   ` (3 preceding siblings ...)
  2024-12-09  6:15 ` [PATCH v4 4/6] sched_ext: Implement scx_bpf_now_ns() Changwoo Min
@ 2024-12-09  6:15 ` Changwoo Min
  2024-12-09  6:15 ` [PATCH v4 6/6] sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now_ns() Changwoo Min
  2024-12-09  9:51 ` [PATCH v4 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Andrea Righi
  6 siblings, 0 replies; 18+ messages in thread
From: Changwoo Min @ 2024-12-09  6:15 UTC (permalink / raw)
  To: tj, void, mingo, peterz; +Cc: changwoo, kernel-dev, linux-kernel

scx_bpf_now_ns() is added to the header files so the BPF scheduler
can use it.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 tools/sched_ext/include/scx/common.bpf.h | 1 +
 tools/sched_ext/include/scx/compat.bpf.h | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 625f5b046776..d8fd58d8c032 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -72,6 +72,7 @@ bool scx_bpf_task_running(const struct task_struct *p) __ksym;
 s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym;
 struct rq *scx_bpf_cpu_rq(s32 cpu) __ksym;
 struct cgroup *scx_bpf_task_cgroup(struct task_struct *p) __ksym __weak;
+u64 scx_bpf_now_ns(void) __ksym __weak;
 
 /*
  * Use the following as @it__iter when calling scx_bpf_dsq_move[_vtime]() from
diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
index d56520100a26..2f16f30ae741 100644
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -125,6 +125,11 @@ bool scx_bpf_dispatch_vtime_from_dsq___compat(struct bpf_iter_scx_dsq *it__iter,
 	false;									\
 })
 
+#define scx_bpf_now_ns()							\
+	(bpf_ksym_exists(scx_bpf_now_ns) ?					\
+	 scx_bpf_now_ns() :							\
+	 bpf_ktime_get_ns())
+
 /*
  * Define sched_ext_ops. This may be expanded to define multiple variants for
  * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH().
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 6/6] sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now_ns()
  2024-12-09  6:15 [PATCH v4 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Changwoo Min
                   ` (4 preceding siblings ...)
  2024-12-09  6:15 ` [PATCH v4 5/6] sched_ext: Add scx_bpf_now_ns() for BPF scheduler Changwoo Min
@ 2024-12-09  6:15 ` Changwoo Min
  2024-12-09  9:51 ` [PATCH v4 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Andrea Righi
  6 siblings, 0 replies; 18+ messages in thread
From: Changwoo Min @ 2024-12-09  6:15 UTC (permalink / raw)
  To: tj, void, mingo, peterz; +Cc: changwoo, kernel-dev, linux-kernel

In the BPF schedulers that use bpf_ktime_get_ns() -- scx_central and
scx_flatcg, replace bpf_ktime_get_ns() calls to scx_bpf_now_ns().

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 tools/sched_ext/scx_central.bpf.c | 4 ++--
 tools/sched_ext/scx_flatcg.bpf.c  | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/sched_ext/scx_central.bpf.c b/tools/sched_ext/scx_central.bpf.c
index e6fad6211f6c..e64b7a429126 100644
--- a/tools/sched_ext/scx_central.bpf.c
+++ b/tools/sched_ext/scx_central.bpf.c
@@ -245,7 +245,7 @@ void BPF_STRUCT_OPS(central_running, struct task_struct *p)
 	s32 cpu = scx_bpf_task_cpu(p);
 	u64 *started_at = ARRAY_ELEM_PTR(cpu_started_at, cpu, nr_cpu_ids);
 	if (started_at)
-		*started_at = bpf_ktime_get_ns() ?: 1;	/* 0 indicates idle */
+		*started_at = scx_bpf_now_ns() ?: 1;	/* 0 indicates idle */
 }
 
 void BPF_STRUCT_OPS(central_stopping, struct task_struct *p, bool runnable)
@@ -258,7 +258,7 @@ void BPF_STRUCT_OPS(central_stopping, struct task_struct *p, bool runnable)
 
 static int central_timerfn(void *map, int *key, struct bpf_timer *timer)
 {
-	u64 now = bpf_ktime_get_ns();
+	u64 now = scx_bpf_now_ns();
 	u64 nr_to_kick = nr_queued;
 	s32 i, curr_cpu;
 
diff --git a/tools/sched_ext/scx_flatcg.bpf.c b/tools/sched_ext/scx_flatcg.bpf.c
index 4e3afcd260bf..568581b15c44 100644
--- a/tools/sched_ext/scx_flatcg.bpf.c
+++ b/tools/sched_ext/scx_flatcg.bpf.c
@@ -734,7 +734,7 @@ void BPF_STRUCT_OPS(fcg_dispatch, s32 cpu, struct task_struct *prev)
 	struct fcg_cpu_ctx *cpuc;
 	struct fcg_cgrp_ctx *cgc;
 	struct cgroup *cgrp;
-	u64 now = bpf_ktime_get_ns();
+	u64 now = scx_bpf_now_ns();
 	bool picked_next = false;
 
 	cpuc = find_cpu_ctx();
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/6] sched_ext: Implement scx_rq_clock_update/stale()
  2024-12-09  6:15 ` [PATCH v4 2/6] sched_ext: Implement scx_rq_clock_update/stale() Changwoo Min
@ 2024-12-09  9:40   ` Andrea Righi
  2024-12-10  7:21     ` Changwoo Min
  2024-12-11  7:43   ` Tejun Heo
  1 sibling, 1 reply; 18+ messages in thread
From: Andrea Righi @ 2024-12-09  9:40 UTC (permalink / raw)
  To: Changwoo Min; +Cc: tj, void, mingo, peterz, changwoo, kernel-dev, linux-kernel

On Mon, Dec 09, 2024 at 03:15:27PM +0900, Changwoo Min wrote:
> scx_rq_clock_update() and scx_rq_clock_stale() manage the status of an
> rq clock when sched_ext is enabled. scx_rq_clock_update() keeps the rq
> clock in memory and its status valid. scx_rq_clock_stale() invalidates
> the current rq clock not to use the cached rq clock.
> 
> Signed-off-by: Changwoo Min <changwoo@igalia.com>
> ---
>  kernel/sched/sched.h | 24 +++++++++++++++++++++++-
>  1 file changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 440ecedf871b..7e71d8685fcc 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -754,6 +754,7 @@ enum scx_rq_flags {
>  	SCX_RQ_BAL_PENDING	= 1 << 2, /* balance hasn't run yet */
>  	SCX_RQ_BAL_KEEP		= 1 << 3, /* balance decided to keep current */
>  	SCX_RQ_BYPASSING	= 1 << 4,
> +	SCX_RQ_CLK_VALID	= 1 << 5, /* RQ clock is fresh and valid */
>  
>  	SCX_RQ_IN_WAKEUP	= 1 << 16,
>  	SCX_RQ_IN_BALANCE	= 1 << 17,
> @@ -766,9 +767,11 @@ struct scx_rq {
>  	unsigned long		ops_qseq;
>  	u64			extra_enq_flags;	/* see move_task_to_local_dsq() */
>  	u32			nr_running;
> -	u32			flags;
>  	u32			cpuperf_target;		/* [0, SCHED_CAPACITY_SCALE] */
>  	bool			cpu_released;
> +	u32			flags;
> +	u64			clock;			/* current per-rq clock -- see scx_bpf_now_ns() */
> +	u64			prev_clock;		/* previous per-rq clock -- see scx_bpf_now_ns() */

Since we're reordering this struct, we may want to move cpu_released all
the way to the bottom to get rid of the 3-bytes hole (and still have
flags, clock and prev_clock in the same cacheline).

>  	cpumask_var_t		cpus_to_kick;
>  	cpumask_var_t		cpus_to_kick_if_idle;
>  	cpumask_var_t		cpus_to_preempt;
> @@ -1725,9 +1728,28 @@ DECLARE_STATIC_KEY_FALSE(__scx_switched_all);	/* all fair class tasks on SCX */
>  
>  #define scx_enabled()		static_branch_unlikely(&__scx_ops_enabled)
>  #define scx_switched_all()	static_branch_unlikely(&__scx_switched_all)
> +
> +static inline void scx_rq_clock_update(struct rq *rq, u64 clock)
> +{
> +	if (scx_enabled()) {
> +		rq->scx.prev_clock = rq->scx.clock;
> +		rq->scx.clock = clock;
> +		rq->scx.flags |= SCX_RQ_CLK_VALID;
> +	}
> +}

Nit, this is just personal preference (feel free to ignore it):

	if (!scx_enabled())
		return;
	rq->scx.prev_clock = rq->scx.clock;
	rq->scx.clock = clock;
	rq->scx.flags |= SCX_RQ_CLK_VALID;

> +
> +static inline void scx_rq_clock_stale(struct rq *rq)
> +{
> +	if (scx_enabled())
> +		rq->scx.flags &= ~SCX_RQ_CLK_VALID;
> +}

I'm wondering if we need to invalidate the clock on all rqs when we call
scx_ops_enable() to prevent getting stale information from a previous
scx scheduler.

Probably it's not an issue, since scx_ops_disable_workfn() should make
sure that all tasks are going through rq_unpin_lock() before unloading
the current scheduler, maybe it could be helpful to add comment about
this scenario in scx_bpf_now_ns() (PATCH 4/6)?

> +
>  #else /* !CONFIG_SCHED_CLASS_EXT */
>  #define scx_enabled()		false
>  #define scx_switched_all()	false
> +
> +static inline void scx_rq_clock_update(struct rq *rq, u64 clock) {}
> +static inline void scx_rq_clock_stale(struct rq *rq) {}
>  #endif /* !CONFIG_SCHED_CLASS_EXT */
>  
>  /*
> -- 
> 2.47.1
> 

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/6] sched_ext: Support high-performance monotonically non-decreasing clock
  2024-12-09  6:15 [PATCH v4 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Changwoo Min
                   ` (5 preceding siblings ...)
  2024-12-09  6:15 ` [PATCH v4 6/6] sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now_ns() Changwoo Min
@ 2024-12-09  9:51 ` Andrea Righi
  6 siblings, 0 replies; 18+ messages in thread
From: Andrea Righi @ 2024-12-09  9:51 UTC (permalink / raw)
  To: Changwoo Min; +Cc: tj, void, mingo, peterz, changwoo, kernel-dev, linux-kernel

On Mon, Dec 09, 2024 at 03:15:25PM +0900, Changwoo Min wrote:
> Many BPF schedulers (such as scx_central, scx_lavd, scx_rusty, scx_bpfland,
> and scx_flash) frequently call bpf_ktime_get_ns() for tracking tasks' runtime
> properties. If supported, bpf_ktime_get_ns() eventually reads a hardware
> timestamp counter (TSC). However, reading a hardware TSC is not
> performant in some hardware platforms, degrading IPC.
> 
> This patchset addresses the performance problem of reading hardware TSC
> by leveraging the rq clock in the scheduler core, introducing a
> scx_bpf_now_ns() function for BPF schedulers. Whenever the rq clock
> is fresh and valid, scx_bpf_now_ns() provides the rq clock, which is
> already updated by the scheduler core (update_rq_clock), so it can reduce
> reading the hardware TSC.
> 
> When the rq lock is released (rq_unpin_lock), the rq clock is invalidated,
> so a subsequent scx_bpf_now_ns() call gets the fresh sched_clock for the caller.
> 
> In addition, scx_bpf_now_ns() guarantees the clock is monotonically
> non-decreasing for the same CPU, so the clock cannot go backward
> in the same CPU.
> 
> Using scx_bpf_now_ns() reduces the number of reading hardware TSC
> by 40-70% (65% for scx_lavd, 58% for scx_bpfland, and 43% for scx_rusty)
> for the following benchmark:
> 
>     perf bench -f simple sched messaging -t -g 20 -l 6000
> 
> The patchset begins by managing the status of rq clock in the scheduler
> core, then implementing scx_bpf_now_ns(), and finally applying it to the
> BPF schedulers.

I left a few comments, but overall it looks good to me. I also ran some
tests with this applied and a modified scx_bpfland to use the new
scx_bpf_now_ns(), no issue to report, therefore:

Acked-by: Andrea Righi <arighi@nvidia.com>

> 
> ChangwLog v3 -> v4:
>   - Separate the code relocation related to scx_enabled() into a
>     separate patch.
>   - Remove scx_rq_clock_stale() after (or before) ops.running() and
>     ops.update_idle() calls
>   - Rename scx_bpf_clock_get_ns() into scx_bpf_now_ns() and revise it to
>     address the comments
>   - Move the per-CPU variable holding a prev clock into scx_rq
>     (rq->scx.prev_clock)
>   - Add a comment describing when the clock could go backward in
>     scx_bpf_now_ns()
>   - Rebase the code to the tip of Tejun's sched_ext repo (for-next
>     branch)
> 
> ChangeLog v2 -> v3:
>   - To avoid unnecessarily modifying cache lines, scx_rq_clock_update()
>     and scx_rq_clock_stale() update the clock and flags only when a
>     sched_ext scheduler is enabled.
> 
> ChangeLog v1 -> v2:
>   - Rename SCX_RQ_CLK_UPDATED to SCX_RQ_CLK_VALID to denote the validity
>     of an rq clock clearly.
>   - Rearrange the clock and flags fields in struct scx_rq to make sure
>     they are in the same cacheline to minimize the cache misses 
>   - Add an additional explanation to the commit message in the 2/5 patch
>     describing when the rq clock will be reused with an example.
>   - Fix typos
>   - Rebase the code to the tip of Tejun's sched_ext repo
> 
> Changwoo Min (6):
>   sched_ext: Relocate scx_enabled() related code
>   sched_ext: Implement scx_rq_clock_update/stale()
>   sched_ext: Manage the validity of scx_rq_clock
>   sched_ext: Implement scx_bpf_now_ns()
>   sched_ext: Add scx_bpf_now_ns() for BPF scheduler
>   sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now_ns()
> 
>  kernel/sched/core.c                      |  6 +-
>  kernel/sched/ext.c                       | 73 ++++++++++++++++++++++++
>  kernel/sched/sched.h                     | 52 ++++++++++++-----
>  tools/sched_ext/include/scx/common.bpf.h |  1 +
>  tools/sched_ext/include/scx/compat.bpf.h |  5 ++
>  tools/sched_ext/scx_central.bpf.c        |  4 +-
>  tools/sched_ext/scx_flatcg.bpf.c         |  2 +-
>  7 files changed, 124 insertions(+), 19 deletions(-)
> 
> -- 
> 2.47.1
> 

-Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/6] sched_ext: Implement scx_rq_clock_update/stale()
  2024-12-09  9:40   ` Andrea Righi
@ 2024-12-10  7:21     ` Changwoo Min
  0 siblings, 0 replies; 18+ messages in thread
From: Changwoo Min @ 2024-12-10  7:21 UTC (permalink / raw)
  To: Andrea Righi; +Cc: tj, void, mingo, peterz, kernel-dev, linux-kernel

Hello Andrea,

Thank you for the review.

On 24. 12. 9. 18:40, Andrea Righi wrote:
>> @@ -766,9 +767,11 @@ struct scx_rq {
>>   	unsigned long		ops_qseq;
>>   	u64			extra_enq_flags;	/* see move_task_to_local_dsq() */
>>   	u32			nr_running;
>> -	u32			flags;
>>   	u32			cpuperf_target;		/* [0, SCHED_CAPACITY_SCALE] */
>>   	bool			cpu_released;
>> +	u32			flags;
>> +	u64			clock;			/* current per-rq clock -- see scx_bpf_now_ns() */
>> +	u64			prev_clock;		/* previous per-rq clock -- see scx_bpf_now_ns() */
> 
> Since we're reordering this struct, we may want to move cpu_released all
> the way to the bottom to get rid of the 3-bytes hole (and still have
> flags, clock and prev_clock in the same cacheline).

We'd better keep the layout as it is. That is because moving
cpu_released to the end of the struct creates 4-byte hole between
flags and clock and 7-byte padding at the end after cpu_released.
I double-checked the two layouts using pahole.


> Nit, this is just personal preference (feel free to ignore it):
> 
> 	if (!scx_enabled())
> 		return;
> 	rq->scx.prev_clock = rq->scx.clock;
> 	rq->scx.clock = clock;
> 	rq->scx.flags |= SCX_RQ_CLK_VALID;
> 
That's prettier. I will change it as you suggested.


> I'm wondering if we need to invalidate the clock on all rqs when we call
> scx_ops_enable() to prevent getting stale information from a previous
> scx scheduler.
> 
> Probably it's not an issue, since scx_ops_disable_workfn() should make
> sure that all tasks are going through rq_unpin_lock() before unloading
> the current scheduler, maybe it could be helpful to add comment about
> this scenario in scx_bpf_now_ns() (PATCH 4/6)?

That's a good catch. In theory, there is a possibility that
a scx_rq is not invalidated when unloading the sched_ext. Since
scx_ops_disable_workfn() iterates all the sched_ext tasks, an rq
would not be invalidated if there is no scx task on the rq.
I will add the code which iterates and invalidates all scx_rqs at
scx_ops_disable_workfn() in the next version.

Thank you again!
Changwoo Min


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/6] sched_ext: Relocate scx_enabled() related code
  2024-12-09  6:15 ` [PATCH v4 1/6] sched_ext: Relocate scx_enabled() related code Changwoo Min
@ 2024-12-11  7:27   ` Tejun Heo
  2024-12-11  7:37     ` Tejun Heo
  0 siblings, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2024-12-11  7:27 UTC (permalink / raw)
  To: Changwoo Min; +Cc: void, mingo, peterz, changwoo, kernel-dev, linux-kernel

On Mon, Dec 09, 2024 at 03:15:26PM +0900, Changwoo Min wrote:
> scx_enabled() will be used in scx_rq_clock_update/stale() in the
> following patch, so relocate the scx_enabled() related code to the
> proper location.
> 
> Signed-off-by: Changwoo Min <changwoo@igalia.com>
> ---
>  kernel/sched/sched.h | 26 +++++++++++++-------------
>  1 file changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 76f5f53a645f..440ecedf871b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1717,6 +1717,19 @@ struct rq_flags {
>  
>  extern struct balance_callback balance_push_callback;
>  
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +extern const struct sched_class ext_sched_class;
> +
> +DECLARE_STATIC_KEY_FALSE(__scx_ops_enabled);	/* SCX BPF scheduler loaded */
> +DECLARE_STATIC_KEY_FALSE(__scx_switched_all);	/* all fair class tasks on SCX */
> +
> +#define scx_enabled()		static_branch_unlikely(&__scx_ops_enabled)
> +#define scx_switched_all()	static_branch_unlikely(&__scx_switched_all)
> +#else /* !CONFIG_SCHED_CLASS_EXT */
> +#define scx_enabled()		false
> +#define scx_switched_all()	false
> +#endif /* !CONFIG_SCHED_CLASS_EXT */

I wonder whether a better place for the above is include/linux/sched/ext.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/6] sched_ext: Relocate scx_enabled() related code
  2024-12-11  7:27   ` Tejun Heo
@ 2024-12-11  7:37     ` Tejun Heo
  0 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2024-12-11  7:37 UTC (permalink / raw)
  To: Changwoo Min; +Cc: void, mingo, peterz, changwoo, kernel-dev, linux-kernel

On Tue, Dec 10, 2024 at 09:27:53PM -1000, Tejun Heo wrote:
> I wonder whether a better place for the above is include/linux/sched/ext.

Please ignore. Stuff added by the later patches would be weird to expose
there. Might as well keep them together.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/6] sched_ext: Implement scx_rq_clock_update/stale()
  2024-12-09  6:15 ` [PATCH v4 2/6] sched_ext: Implement scx_rq_clock_update/stale() Changwoo Min
  2024-12-09  9:40   ` Andrea Righi
@ 2024-12-11  7:43   ` Tejun Heo
  2024-12-13  1:16     ` Changwoo Min
  1 sibling, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2024-12-11  7:43 UTC (permalink / raw)
  To: Changwoo Min; +Cc: void, mingo, peterz, changwoo, kernel-dev, linux-kernel

On Mon, Dec 09, 2024 at 03:15:27PM +0900, Changwoo Min wrote:
...
> +static inline void scx_rq_clock_stale(struct rq *rq)

Would scx_rq_clock_expire() or scx_rq_clock_invalidate() be a better name?
Also, I'd roll this patch into the next one.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 4/6] sched_ext: Implement scx_bpf_now_ns()
  2024-12-09  6:15 ` [PATCH v4 4/6] sched_ext: Implement scx_bpf_now_ns() Changwoo Min
@ 2024-12-11  8:14   ` Tejun Heo
  2024-12-13  1:41     ` Changwoo Min
  2024-12-11  9:32   ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2024-12-11  8:14 UTC (permalink / raw)
  To: Changwoo Min; +Cc: void, mingo, peterz, changwoo, kernel-dev, linux-kernel

Hello,

I'd roll the preceding two patches into this one.

On Mon, Dec 09, 2024 at 03:15:29PM +0900, Changwoo Min wrote:
...
> 3) Monotonically non-decreasing clock for the same CPU: scx_bpf_now_ns()
>  guarantees the clock never goes backward when comparing them in the same
>  CPU. On the other hand, when comparing clocks in different CPUs, there
>  is no such guarantee -- the clock can go backward. It provides a
>  monotonically *non-decreasing* clock so that it would provide the same
>  clock values in two different scx_bpf_now_ns() calls in the same CPU
>  during the same period of when the rq clock is valid.

We probably should provide helpers to calculate deltas between timestamps
and use them consitently in SCX scheds. e.g. ops.runnable() and
ops.running() can run on different CPUs and it'd be useful and common to
calculate the delta between the two points in time.

...
> +__bpf_kfunc u64 scx_bpf_now_ns(void)
> +{
> +	struct rq *rq;
> +	u64 clock;
> +
> +	preempt_disable();
> +
> +	/*
> +	 * If the rq clock is valid, use the cached rq clock
> +	 * whenever the clock does not go backward.
> +	 */
> +	rq = this_rq();
> +	clock = rq->scx.clock;
> +
> +	if (!(rq->scx.flags & SCX_RQ_CLK_VALID) ||
> +	    (rq->scx.prev_clock >= clock)) {

The clocks usually start at zero but it'd still be a good idea to use
time_after64() and friends when comparing the ordering between timestamps.

> +		/*
> +		 * If the rq clock is invalid or goes backward,
> +		 * start a new rq clock period with a fresh sched_clock_cpu().
> +		 *
> +		 * The cached rq clock can go backward because there is a
> +		 * race with a timer interrupt. Suppose that a timer interrupt

This is not limited to timer interrupts, right? This kfunc can be called
from anywhere including tracepoints for code running in IRQ.

> +		 * occurred while running scx_bpf_now_ns() *after* reading the
> +		 * rq clock and *before* comparing the if condition. The timer
> +		 * interrupt will eventually call a BPF scheduler's ops.tick(),
> +		 * and the BPF scheduler can call scx_bpf_now_ns(). Since the
> +		 * scheduler core updates the rq clock before calling
> +		 * ops.tick(), the scx_bpf_now_ns() call will get the fresh
> +		 * clock. After handling the timer interrupt, the interrupted

This might be easier to explain with two column table explaning what each
party is doing in what order.

> +		 * scx_bpf_now_ns() will be resumed, so the if condition will
> +		 * be compared. In this case, the clock, which was read before
> +		 * the timer interrupt, will be the same as rq->scx.prev_clock.
> +		 * When such a case is detected, start a new rq clock period
> +		 * with a fresh sched_clock_cpu().
> +		 */
> +		clock = sched_clock_cpu(cpu_of(rq));
> +		scx_rq_clock_update(rq, clock);

Hmmm... what happens if e.g. a timer ends up performing multiple operations
each going through rq pin/unpin?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 4/6] sched_ext: Implement scx_bpf_now_ns()
  2024-12-09  6:15 ` [PATCH v4 4/6] sched_ext: Implement scx_bpf_now_ns() Changwoo Min
  2024-12-11  8:14   ` Tejun Heo
@ 2024-12-11  9:32   ` Peter Zijlstra
  2024-12-13  2:01     ` Changwoo Min
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2024-12-11  9:32 UTC (permalink / raw)
  To: Changwoo Min; +Cc: tj, void, mingo, changwoo, kernel-dev, linux-kernel

On Mon, Dec 09, 2024 at 03:15:29PM +0900, Changwoo Min wrote:

> +__bpf_kfunc u64 scx_bpf_now_ns(void)
> +{
> +	struct rq *rq;
> +	u64 clock;
> +
> +	preempt_disable();
> +
> +	/*
> +	 * If the rq clock is valid, use the cached rq clock
> +	 * whenever the clock does not go backward.
> +	 */
> +	rq = this_rq();
> +	clock = rq->scx.clock;
> +
> +	if (!(rq->scx.flags & SCX_RQ_CLK_VALID) ||
> +	    (rq->scx.prev_clock >= clock)) {

As TJ said, it's best to consider that the clock can wrap.

> +		/*
> +		 * If the rq clock is invalid or goes backward,
> +		 * start a new rq clock period with a fresh sched_clock_cpu().
> +		 *
> +		 * The cached rq clock can go backward because there is a
> +		 * race with a timer interrupt. Suppose that a timer interrupt
> +		 * occurred while running scx_bpf_now_ns() *after* reading the
> +		 * rq clock and *before* comparing the if condition. The timer
> +		 * interrupt will eventually call a BPF scheduler's ops.tick(),
> +		 * and the BPF scheduler can call scx_bpf_now_ns(). Since the
> +		 * scheduler core updates the rq clock before calling
> +		 * ops.tick(), the scx_bpf_now_ns() call will get the fresh
> +		 * clock. After handling the timer interrupt, the interrupted
> +		 * scx_bpf_now_ns() will be resumed, so the if condition will
> +		 * be compared. In this case, the clock, which was read before
> +		 * the timer interrupt, will be the same as rq->scx.prev_clock.
> +		 * When such a case is detected, start a new rq clock period
> +		 * with a fresh sched_clock_cpu().

This has a wall-of-text problem; use paragraphs?

> +		 */
> +		clock = sched_clock_cpu(cpu_of(rq));
> +		scx_rq_clock_update(rq, clock);

Doesn't this set the VALID bit again? How is using this outside of
RQ-lock and setting VALID a good idea?

> +	}
> +
> +	preempt_enable();
> +
> +	return clock;
> +}
> +
>  __bpf_kfunc_end_defs();
>  
>  BTF_KFUNCS_START(scx_kfunc_ids_any)
> @@ -7632,6 +7704,7 @@ BTF_ID_FLAGS(func, scx_bpf_cpu_rq)
>  #ifdef CONFIG_CGROUP_SCHED
>  BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_RCU | KF_ACQUIRE)
>  #endif
> +BTF_ID_FLAGS(func, scx_bpf_now_ns)
>  BTF_KFUNCS_END(scx_kfunc_ids_any)
>  
>  static const struct btf_kfunc_id_set scx_kfunc_set_any = {
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/6] sched_ext: Implement scx_rq_clock_update/stale()
  2024-12-11  7:43   ` Tejun Heo
@ 2024-12-13  1:16     ` Changwoo Min
  0 siblings, 0 replies; 18+ messages in thread
From: Changwoo Min @ 2024-12-13  1:16 UTC (permalink / raw)
  To: Tejun Heo, Changwoo Min; +Cc: void, mingo, peterz, kernel-dev, linux-kernel

Hello,

On 24. 12. 11. 16:43, Tejun Heo wrote:
> On Mon, Dec 09, 2024 at 03:15:27PM +0900, Changwoo Min wrote:
> ...
>> +static inline void scx_rq_clock_stale(struct rq *rq)
> 
> Would scx_rq_clock_expire() or scx_rq_clock_invalidate() be a better name?
> Also, I'd roll this patch into the next one.

Thank you for the suggestion. I will change it to
scx_rq_clock_invalidate() in the next version.

Regards,
Changwoo Min

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 4/6] sched_ext: Implement scx_bpf_now_ns()
  2024-12-11  8:14   ` Tejun Heo
@ 2024-12-13  1:41     ` Changwoo Min
  0 siblings, 0 replies; 18+ messages in thread
From: Changwoo Min @ 2024-12-13  1:41 UTC (permalink / raw)
  To: Tejun Heo; +Cc: void, mingo, peterz, kernel-dev, linux-kernel

Hello,

On 24. 12. 11. 17:14, Tejun Heo wrote:
> Hello,
> 
> I'd roll the preceding two patches into this one.
Sure. I will merge patches 2, 3, 4 into one.

> On Mon, Dec 09, 2024 at 03:15:29PM +0900, Changwoo Min wrote:
> ...
>> 3) Monotonically non-decreasing clock for the same CPU: scx_bpf_now_ns()
>>   guarantees the clock never goes backward when comparing them in the same
>>   CPU. On the other hand, when comparing clocks in different CPUs, there
>>   is no such guarantee -- the clock can go backward. It provides a
>>   monotonically *non-decreasing* clock so that it would provide the same
>>   clock values in two different scx_bpf_now_ns() calls in the same CPU
>>   during the same period of when the rq clock is valid.
> 
> We probably should provide helpers to calculate deltas between timestamps
> and use them consitently in SCX scheds. e.g. ops.runnable() and
> ops.running() can run on different CPUs and it'd be useful and common to
> calculate the delta between the two points in time.

If I understand correctly, it should be something similar to
jiffies_delta_to_msecs(). Regarding the API name, what about
scx_time_delta(s64 time_delta) and/or scx_time_diff(u64 time_a,
u64 time_b)?

>> +	if (!(rq->scx.flags & SCX_RQ_CLK_VALID) ||
>> +	    (rq->scx.prev_clock >= clock)) {
> 
> The clocks usually start at zero but it'd still be a good idea to use
> time_after64() and friends when comparing the ordering between timestamps.

Sure. I will update the code as suggested.

> 
>> +		/*
>> +		 * If the rq clock is invalid or goes backward,
>> +		 * start a new rq clock period with a fresh sched_clock_cpu().
>> +		 *
>> +		 * The cached rq clock can go backward because there is a
>> +		 * race with a timer interrupt. Suppose that a timer interrupt
> 
> This is not limited to timer interrupts, right? This kfunc can be called
> from anywhere including tracepoints for code running in IRQ
Yup, you are right. I will update the comments.


> 
>> +		 * occurred while running scx_bpf_now_ns() *after* reading the
>> +		 * rq clock and *before* comparing the if condition. The timer
>> +		 * interrupt will eventually call a BPF scheduler's ops.tick(),
>> +		 * and the BPF scheduler can call scx_bpf_now_ns(). Since the
>> +		 * scheduler core updates the rq clock before calling
>> +		 * ops.tick(), the scx_bpf_now_ns() call will get the fresh
>> +		 * clock. After handling the timer interrupt, the interrupted
> 
> This might be easier to explain with two column table explaning what each
> party is doing in what order.
I will beautify the text for readability.

> 
>> +		 * scx_bpf_now_ns() will be resumed, so the if condition will
>> +		 * be compared. In this case, the clock, which was read before
>> +		 * the timer interrupt, will be the same as rq->scx.prev_clock.
>> +		 * When such a case is detected, start a new rq clock period
>> +		 * with a fresh sched_clock_cpu().
>> +		 */
>> +		clock = sched_clock_cpu(cpu_of(rq));
>> +		scx_rq_clock_update(rq, clock);
> 
> Hmmm... what happens if e.g. a timer ends up performing multiple operations
> each going through rq pin/unpin?

That should be okay. After multiple rq pin/unpin operations, the
resumed scx_bpf_now_ns() will found that the prev_clock is
greater (not equal) than the current clock, so it will get the
fresh clock.

Thanks!
Changwoo Min

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 4/6] sched_ext: Implement scx_bpf_now_ns()
  2024-12-11  9:32   ` Peter Zijlstra
@ 2024-12-13  2:01     ` Changwoo Min
  0 siblings, 0 replies; 18+ messages in thread
From: Changwoo Min @ 2024-12-13  2:01 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tj, void, mingo, kernel-dev, linux-kernel

Hello,

Thank you for the review!

On 24. 12. 11. 18:32, Peter Zijlstra wrote:
> On Mon, Dec 09, 2024 at 03:15:29PM +0900, Changwoo Min wrote:
>> +	if (!(rq->scx.flags & SCX_RQ_CLK_VALID) ||
>> +	    (rq->scx.prev_clock >= clock)) {
> 
> As TJ said, it's best to consider that the clock can wrap.
I will update it as Tejun suggested.

> 
>> +		/*
>> +		 * If the rq clock is invalid or goes backward,
>> +		 * start a new rq clock period with a fresh sched_clock_cpu().
>> +		 *
>> +		 * The cached rq clock can go backward because there is a
>> +		 * race with a timer interrupt. Suppose that a timer interrupt
>> +		 * occurred while running scx_bpf_now_ns() *after* reading the
>> +		 * rq clock and *before* comparing the if condition. The timer
>> +		 * interrupt will eventually call a BPF scheduler's ops.tick(),
>> +		 * and the BPF scheduler can call scx_bpf_now_ns(). Since the
>> +		 * scheduler core updates the rq clock before calling
>> +		 * ops.tick(), the scx_bpf_now_ns() call will get the fresh
>> +		 * clock. After handling the timer interrupt, the interrupted
>> +		 * scx_bpf_now_ns() will be resumed, so the if condition will
>> +		 * be compared. In this case, the clock, which was read before
>> +		 * the timer interrupt, will be the same as rq->scx.prev_clock.
>> +		 * When such a case is detected, start a new rq clock period
>> +		 * with a fresh sched_clock_cpu().
> 
> This has a wall-of-text problem; use paragraphs?
I will improve the presentation using multiple paragraphs
and time chart.

>> +		clock = sched_clock_cpu(cpu_of(rq));
>> +		scx_rq_clock_update(rq, clock);
> Doesn't this set the VALID bit again? How is using this outside of
> RQ-lock and setting VALID a good idea?

You are right. The current implementation sets the VALID bit, so
the clock can be reused until the next update_rq_clock(). Another
approach would be not setting the VALID flag, so it gets the
fresh clock every time until next update_rq_clock(). Considering
the clock usages of the scx schedulers, both would be almost the
same in number of sched_clock_cpu() calls. But the second
approach -- not setting the VALID flag outside of rqlock -- would
be more predictable. I will double-check the difference of
sched_clock_cpu() calls, and if they are similar, I will change
it not setting the VALID flag.

Regards,
Changwoo Min

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-12-13  2:02 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-09  6:15 [PATCH v4 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Changwoo Min
2024-12-09  6:15 ` [PATCH v4 1/6] sched_ext: Relocate scx_enabled() related code Changwoo Min
2024-12-11  7:27   ` Tejun Heo
2024-12-11  7:37     ` Tejun Heo
2024-12-09  6:15 ` [PATCH v4 2/6] sched_ext: Implement scx_rq_clock_update/stale() Changwoo Min
2024-12-09  9:40   ` Andrea Righi
2024-12-10  7:21     ` Changwoo Min
2024-12-11  7:43   ` Tejun Heo
2024-12-13  1:16     ` Changwoo Min
2024-12-09  6:15 ` [PATCH v4 3/6] sched_ext: Manage the validity of scx_rq_clock Changwoo Min
2024-12-09  6:15 ` [PATCH v4 4/6] sched_ext: Implement scx_bpf_now_ns() Changwoo Min
2024-12-11  8:14   ` Tejun Heo
2024-12-13  1:41     ` Changwoo Min
2024-12-11  9:32   ` Peter Zijlstra
2024-12-13  2:01     ` Changwoo Min
2024-12-09  6:15 ` [PATCH v4 5/6] sched_ext: Add scx_bpf_now_ns() for BPF scheduler Changwoo Min
2024-12-09  6:15 ` [PATCH v4 6/6] sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now_ns() Changwoo Min
2024-12-09  9:51 ` [PATCH v4 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox