[PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock
@ 2024-12-30  9:56 Changwoo Min
  2024-12-30  9:56 ` [PATCH v7 1/6] sched_ext: Relocate scx_enabled() related code Changwoo Min
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Changwoo Min @ 2024-12-30  9:56 UTC (permalink / raw)
  To: tj, void, arighi, mingo, peterz; +Cc: changwoo, kernel-dev, linux-kernel

Many BPF schedulers (such as scx_central, scx_lavd, scx_rusty, scx_bpfland,
and scx_flash) frequently call bpf_ktime_get_ns() for tracking tasks' runtime
properties. If supported, bpf_ktime_get_ns() eventually reads a hardware
timestamp counter (TSC). However, reading a hardware TSC is not
performant in some hardware platforms, degrading IPC.

This patchset addresses the performance problem of reading hardware TSC
by leveraging the rq clock in the scheduler core, introducing a
scx_bpf_now() function for BPF schedulers. Whenever the rq clock
is fresh and valid, scx_bpf_now() provides the rq clock, which is
already updated by the scheduler core (update_rq_clock), so it can reduce
reading the hardware TSC.

When the rq lock is released (rq_unpin_lock), the rq clock is invalidated,
so a subsequent scx_bpf_now() call gets the fresh sched_clock for the caller.

In addition, scx_bpf_now() guarantees the clock is monotonically
non-decreasing for the same CPU, so the clock cannot go backward
in the same CPU.

Using scx_bpf_now() reduces the number of reading hardware TSC
by 50-80% (76% for scx_lavd, 82% for scx_bpfland, and 51% for scx_rusty)
for the following benchmark:

    perf bench -f simple sched messaging -t -g 20 -l 6000

ChangeLog v6 -> v7:
  - Do not update rq_lock when the rq_clock is invalidated.
  - Rename scx_bpf_now_ns() to scx_bpf_now().
  - Rename vtime_*() helper functions to time_*().
  - Fix a typo.

ChangeLog v5 -> v6:
  - Drop prev_clock because a race between a process context and an interrupt
    context (e.g., timer interrupt) is not observable from a caller (discussed
    with Tejun offline), so it is not necessary to check prev_clock.

ChangeLog v4 -> v5:
  - Merge patches 2, 3, and 4 into one for readability.
  - Use a time helper (time_after_eq64) for time comparison at scx_bpf_now_ns().
  - Do not validate the rq clock outside the rq critical section for
    more predictable behavior.
  - Improve the comment at scx_bpf_now_ns() for readability.
  - Rename scx_rq_clock_stale() into  scx_rq_clock_invalidate().
  - Invalidate all the rq clocks upon unloading to prevent getting outdated
    rq clocks from a previous scx scheduler.
  - Use READ/WRITE_ONCE() when accessing rq->scx.{clock, prev_clock, flags}
    to properly handle concurrent accesses from an interrupt contex.
  - Add time helpers for BPF schedulers.
  - Update the rdtsc reduction numbers in the cover letter with the latest
    scx_scheds (1.0.7) and the updated scx_bpf_now_ns().

ChangeLog v3 -> v4:
  - Separate the code relocation related to scx_enabled() into a
    separate patch.
  - Remove scx_rq_clock_stale() after (or before) ops.running() and
    ops.update_idle() calls
  - Rename scx_bpf_clock_get_ns() into scx_bpf_now_ns() and revise it to
    address the comments
  - Move the per-CPU variable holding a prev clock into scx_rq
    (rq->scx.prev_clock)
  - Add a comment describing when the clock could go backward in
    scx_bpf_now_ns()
  - Rebase the code to the tip of Tejun's sched_ext repo (for-next
    branch)

ChangeLog v2 -> v3:
  - To avoid unnecessarily modifying cache lines, scx_rq_clock_update()
    and scx_rq_clock_stale() update the clock and flags only when a
    sched_ext scheduler is enabled.

ChangeLog v1 -> v2:
  - Rename SCX_RQ_CLK_UPDATED to SCX_RQ_CLK_VALID to denote the validity
    of an rq clock clearly.
  - Rearrange the clock and flags fields in struct scx_rq to make sure
    they are in the same cacheline to minimize the cache misses 
  - Add an additional explanation to the commit message in the 2/5 patch
    describing when the rq clock will be reused with an example.
  - Fix typos
  - Rebase the code to the tip of Tejun's sched_ext repo

Changwoo Min (6):
  sched_ext: Relocate scx_enabled() related code
  sched_ext: Implement scx_bpf_now()
  sched_ext: Add scx_bpf_now() for BPF scheduler
  sched_ext: Add time helpers for BPF schedulers
  sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now()
  sched_ext: Use time helpers in BPF schedulers

 kernel/sched/core.c                      |  6 +-
 kernel/sched/ext.c                       | 74 +++++++++++++++++-
 kernel/sched/sched.h                     | 51 +++++++++----
 tools/sched_ext/include/scx/common.bpf.h | 95 ++++++++++++++++++++++++
 tools/sched_ext/include/scx/compat.bpf.h |  5 ++
 tools/sched_ext/scx_central.bpf.c        | 11 +--
 tools/sched_ext/scx_flatcg.bpf.c         | 23 +++---
 tools/sched_ext/scx_simple.bpf.c         |  9 +--
 8 files changed, 228 insertions(+), 46 deletions(-)

-- 
2.47.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v7 1/6] sched_ext: Relocate scx_enabled() related code
  2024-12-30  9:56 [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Changwoo Min
@ 2024-12-30  9:56 ` Changwoo Min
  2024-12-30  9:56 ` [PATCH v7 2/6] sched_ext: Implement scx_bpf_now() Changwoo Min
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Changwoo Min @ 2024-12-30  9:56 UTC (permalink / raw)
  To: tj, void, arighi, mingo, peterz; +Cc: changwoo, kernel-dev, linux-kernel

scx_enabled() will be used in scx_rq_clock_update/invalidate()
in the following patch, so relocate the scx_enabled() related code
to the proper location.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/sched.h | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 76f5f53a645f..440ecedf871b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1717,6 +1717,19 @@ struct rq_flags {
 
 extern struct balance_callback balance_push_callback;
 
+#ifdef CONFIG_SCHED_CLASS_EXT
+extern const struct sched_class ext_sched_class;
+
+DECLARE_STATIC_KEY_FALSE(__scx_ops_enabled);	/* SCX BPF scheduler loaded */
+DECLARE_STATIC_KEY_FALSE(__scx_switched_all);	/* all fair class tasks on SCX */
+
+#define scx_enabled()		static_branch_unlikely(&__scx_ops_enabled)
+#define scx_switched_all()	static_branch_unlikely(&__scx_switched_all)
+#else /* !CONFIG_SCHED_CLASS_EXT */
+#define scx_enabled()		false
+#define scx_switched_all()	false
+#endif /* !CONFIG_SCHED_CLASS_EXT */
+
 /*
  * Lockdep annotation that avoids accidental unlocks; it's like a
  * sticky/continuous lockdep_assert_held().
@@ -2505,19 +2518,6 @@ extern const struct sched_class rt_sched_class;
 extern const struct sched_class fair_sched_class;
 extern const struct sched_class idle_sched_class;
 
-#ifdef CONFIG_SCHED_CLASS_EXT
-extern const struct sched_class ext_sched_class;
-
-DECLARE_STATIC_KEY_FALSE(__scx_ops_enabled);	/* SCX BPF scheduler loaded */
-DECLARE_STATIC_KEY_FALSE(__scx_switched_all);	/* all fair class tasks on SCX */
-
-#define scx_enabled()		static_branch_unlikely(&__scx_ops_enabled)
-#define scx_switched_all()	static_branch_unlikely(&__scx_switched_all)
-#else /* !CONFIG_SCHED_CLASS_EXT */
-#define scx_enabled()		false
-#define scx_switched_all()	false
-#endif /* !CONFIG_SCHED_CLASS_EXT */
-
 /*
  * Iterate only active classes. SCX can take over all fair tasks or be
  * completely disabled. If the former, skip fair. If the latter, skip SCX.
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v7 2/6] sched_ext: Implement scx_bpf_now()
  2024-12-30  9:56 [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Changwoo Min
  2024-12-30  9:56 ` [PATCH v7 1/6] sched_ext: Relocate scx_enabled() related code Changwoo Min
@ 2024-12-30  9:56 ` Changwoo Min
  2025-01-08  8:50   ` Peter Zijlstra
  2024-12-30  9:56 ` [PATCH v7 3/6] sched_ext: Add scx_bpf_now() for BPF scheduler Changwoo Min
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Changwoo Min @ 2024-12-30  9:56 UTC (permalink / raw)
  To: tj, void, arighi, mingo, peterz; +Cc: changwoo, kernel-dev, linux-kernel

Returns a high-performance monotonically non-decreasing clock for the current
CPU. The clock returned is in nanoseconds.

It provides the following properties:

1) High performance: Many BPF schedulers call bpf_ktime_get_ns() frequently
 to account for execution time and track tasks' runtime properties.
 Unfortunately, in some hardware platforms, bpf_ktime_get_ns() -- which
 eventually reads a hardware timestamp counter -- is neither performant nor
 scalable. scx_bpf_now() aims to provide a high-performance clock by
 using the rq clock in the scheduler core whenever possible.

2) High enough resolution for the BPF scheduler use cases: In most BPF
 scheduler use cases, the required clock resolution is lower than the most
 accurate hardware clock (e.g., rdtsc in x86). scx_bpf_now() basically
 uses the rq clock in the scheduler core whenever it is valid. It considers
 that the rq clock is valid from the time the rq clock is updated
 (update_rq_clock) until the rq is unlocked (rq_unpin_lock).

3) Monotonically non-decreasing clock for the same CPU: scx_bpf_now()
 guarantees the clock never goes backward when comparing them in the same
 CPU. On the other hand, when comparing clocks in different CPUs, there
 is no such guarantee -- the clock can go backward. It provides a
 monotonically *non-decreasing* clock so that it would provide the same
 clock values in two different scx_bpf_now() calls in the same CPU
 during the same period of when the rq clock is valid.

An rq clock becomes valid when it is updated using update_rq_clock()
and invalidated when the rq is unlocked using rq_unpin_lock().

Let's suppose the following timeline in the scheduler core:

   T1. rq_lock(rq)
   T2. update_rq_clock(rq)
   T3. a sched_ext BPF operation
   T4. rq_unlock(rq)
   T5. a sched_ext BPF operation
   T6. rq_lock(rq)
   T7. update_rq_clock(rq)

For [T2, T4), we consider that rq clock is valid (SCX_RQ_CLK_VALID is
set), so scx_bpf_now() calls during [T2, T4) (including T3) will
return the rq clock updated at T2. For duration [T4, T7), when a BPF
scheduler can still call scx_bpf_now() (T5), we consider the rq clock
is invalid (SCX_RQ_CLK_VALID is unset at T4). So when calling
scx_bpf_now() at T5, we will return a fresh clock value by calling
sched_clock_cpu() internally. Also, to prevent getting outdated rq clocks
from a previous scx scheduler, invalidate all the rq clocks when unloading
a BPF scheduler.

One example of calling scx_bpf_now(), when the rq clock is invalid
(like T5), is in scx_central [1]. The scx_central scheduler uses a BPF
timer for preemptive scheduling. In every msec, the timer callback checks
if the currently running tasks exceed their timeslice. At the beginning of
the BPF timer callback (central_timerfn in scx_central.bpf.c), scx_central
gets the current time. When the BPF timer callback runs, the rq clock could
be invalid, the same as T5. In this case, scx_bpf_now() returns a fresh
clock value rather than returning the old one (T2).

[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_central.bpf.c

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/core.c  |  6 +++-
 kernel/sched/ext.c   | 74 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h | 25 +++++++++++++--
 3 files changed, 101 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 95e40895a519..ab8015c8cab4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -789,6 +789,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 void update_rq_clock(struct rq *rq)
 {
 	s64 delta;
+	u64 clock;
 
 	lockdep_assert_rq_held(rq);
 
@@ -800,11 +801,14 @@ void update_rq_clock(struct rq *rq)
 		SCHED_WARN_ON(rq->clock_update_flags & RQCF_UPDATED);
 	rq->clock_update_flags |= RQCF_UPDATED;
 #endif
+	clock = sched_clock_cpu(cpu_of(rq));
+	scx_rq_clock_update(rq, clock);
 
-	delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
+	delta = clock - rq->clock;
 	if (delta < 0)
 		return;
 	rq->clock += delta;
+
 	update_rq_clock_task(rq, delta);
 }
 
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 16de619ff2f9..2a830c8a00bd 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4913,7 +4913,7 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
 	struct task_struct *p;
 	struct rhashtable_iter rht_iter;
 	struct scx_dispatch_q *dsq;
-	int i, kind;
+	int i, kind, cpu;
 
 	kind = atomic_read(&scx_exit_kind);
 	while (true) {
@@ -4996,6 +4996,15 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
 	scx_task_iter_stop(&sti);
 	percpu_up_write(&scx_fork_rwsem);
 
+	/*
+	 * Invalidate all the rq clocks to prevent getting outdated
+	 * rq clocks from a previous scx scheduler.
+	 */
+	for_each_possible_cpu(cpu) {
+		struct rq *rq = cpu_rq(cpu);
+		scx_rq_clock_invalidate(rq);
+	}
+
 	/* no task is on scx, turn off all the switches and flush in-progress calls */
 	static_branch_disable(&__scx_ops_enabled);
 	for (i = SCX_OPI_BEGIN; i < SCX_OPI_END; i++)
@@ -7603,6 +7612,68 @@ __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p)
 }
 #endif
 
+/**
+ * scx_bpf_now - Returns a high-performance monotonically non-decreasing
+ * clock for the current CPU. The clock returned is in nanoseconds.
+ *
+ * It provides the following properties:
+ *
+ * 1) High performance: Many BPF schedulers call bpf_ktime_get_ns() frequently
+ *  to account for execution time and track tasks' runtime properties.
+ *  Unfortunately, in some hardware platforms, bpf_ktime_get_ns() -- which
+ *  eventually reads a hardware timestamp counter -- is neither performant nor
+ *  scalable. scx_bpf_now() aims to provide a high-performance clock by
+ *  using the rq clock in the scheduler core whenever possible.
+ *
+ * 2) High enough resolution for the BPF scheduler use cases: In most BPF
+ *  scheduler use cases, the required clock resolution is lower than the most
+ *  accurate hardware clock (e.g., rdtsc in x86). scx_bpf_now() basically
+ *  uses the rq clock in the scheduler core whenever it is valid. It considers
+ *  that the rq clock is valid from the time the rq clock is updated
+ *  (update_rq_clock) until the rq is unlocked (rq_unpin_lock).
+ *
+ * 3) Monotonically non-decreasing clock for the same CPU: scx_bpf_now()
+ *  guarantees the clock never goes backward when comparing them in the same
+ *  CPU. On the other hand, when comparing clocks in different CPUs, there
+ *  is no such guarantee -- the clock can go backward. It provides a
+ *  monotonically *non-decreasing* clock so that it would provide the same
+ *  clock values in two different scx_bpf_now() calls in the same CPU
+ *  during the same period of when the rq clock is valid.
+ */
+__bpf_kfunc u64 scx_bpf_now(void)
+{
+	struct rq *rq;
+	u64 clock;
+
+	preempt_disable();
+
+	rq = this_rq();
+	if (READ_ONCE(rq->scx.flags) & SCX_RQ_CLK_VALID) {
+		/*
+		 * If the rq clock is valid, use the cached rq clock.
+		 *
+		 * Note that scx_bpf_now() is re-entrant between a process
+		 * context and an interrupt context (e.g., timer interrupt).
+		 * However, we don't need to consider the race between them
+		 * because such race is not observable from a caller.
+		 */
+		clock = READ_ONCE(rq->scx.clock);
+	} else {
+		/*
+		 * Otherwise, return a fresh rq clock.
+		 *
+		 * The rq clock is updated outside of the rq lock.
+		 * In this case, keep the updated rq clock invalid so the next
+		 * kfunc call outside the rq lock gets a fresh rq clock.
+		 */
+		clock = sched_clock_cpu(cpu_of(rq));
+	}
+
+	preempt_enable();
+
+	return clock;
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(scx_kfunc_ids_any)
@@ -7634,6 +7705,7 @@ BTF_ID_FLAGS(func, scx_bpf_cpu_rq)
 #ifdef CONFIG_CGROUP_SCHED
 BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_RCU | KF_ACQUIRE)
 #endif
+BTF_ID_FLAGS(func, scx_bpf_now)
 BTF_KFUNCS_END(scx_kfunc_ids_any)
 
 static const struct btf_kfunc_id_set scx_kfunc_set_any = {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 440ecedf871b..e40fdc3e5f22 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -754,6 +754,7 @@ enum scx_rq_flags {
 	SCX_RQ_BAL_PENDING	= 1 << 2, /* balance hasn't run yet */
 	SCX_RQ_BAL_KEEP		= 1 << 3, /* balance decided to keep current */
 	SCX_RQ_BYPASSING	= 1 << 4,
+	SCX_RQ_CLK_VALID	= 1 << 5, /* RQ clock is fresh and valid */
 
 	SCX_RQ_IN_WAKEUP	= 1 << 16,
 	SCX_RQ_IN_BALANCE	= 1 << 17,
@@ -766,9 +767,10 @@ struct scx_rq {
 	unsigned long		ops_qseq;
 	u64			extra_enq_flags;	/* see move_task_to_local_dsq() */
 	u32			nr_running;
-	u32			flags;
 	u32			cpuperf_target;		/* [0, SCHED_CAPACITY_SCALE] */
 	bool			cpu_released;
+	u32			flags;
+	u64			clock;			/* current per-rq clock -- see scx_bpf_now() */
 	cpumask_var_t		cpus_to_kick;
 	cpumask_var_t		cpus_to_kick_if_idle;
 	cpumask_var_t		cpus_to_preempt;
@@ -1725,9 +1727,28 @@ DECLARE_STATIC_KEY_FALSE(__scx_switched_all);	/* all fair class tasks on SCX */
 
 #define scx_enabled()		static_branch_unlikely(&__scx_ops_enabled)
 #define scx_switched_all()	static_branch_unlikely(&__scx_switched_all)
+
+static inline void scx_rq_clock_update(struct rq *rq, u64 clock)
+{
+	if (!scx_enabled())
+		return;
+	WRITE_ONCE(rq->scx.clock, clock);
+	WRITE_ONCE(rq->scx.flags, rq->scx.flags | SCX_RQ_CLK_VALID);
+}
+
+static inline void scx_rq_clock_invalidate(struct rq *rq)
+{
+	if (!scx_enabled())
+		return;
+	WRITE_ONCE(rq->scx.flags, rq->scx.flags & ~SCX_RQ_CLK_VALID);
+}
+
 #else /* !CONFIG_SCHED_CLASS_EXT */
 #define scx_enabled()		false
 #define scx_switched_all()	false
+
+static inline void scx_rq_clock_update(struct rq *rq, u64 clock) {}
+static inline void scx_rq_clock_invalidate(struct rq *rq) {}
 #endif /* !CONFIG_SCHED_CLASS_EXT */
 
 /*
@@ -1759,7 +1780,7 @@ static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
 	if (rq->clock_update_flags > RQCF_ACT_SKIP)
 		rf->clock_update_flags = RQCF_UPDATED;
 #endif
-
+	scx_rq_clock_invalidate(rq);
 	lockdep_unpin_lock(__rq_lockp(rq), rf->cookie);
 }
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v7 3/6] sched_ext: Add scx_bpf_now() for BPF scheduler
  2024-12-30  9:56 [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Changwoo Min
  2024-12-30  9:56 ` [PATCH v7 1/6] sched_ext: Relocate scx_enabled() related code Changwoo Min
  2024-12-30  9:56 ` [PATCH v7 2/6] sched_ext: Implement scx_bpf_now() Changwoo Min
@ 2024-12-30  9:56 ` Changwoo Min
  2024-12-30  9:56 ` [PATCH v7 4/6] sched_ext: Add time helpers for BPF schedulers Changwoo Min
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Changwoo Min @ 2024-12-30  9:56 UTC (permalink / raw)
  To: tj, void, arighi, mingo, peterz; +Cc: changwoo, kernel-dev, linux-kernel

scx_bpf_now() is added to the header files so the BPF scheduler
can use it.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 tools/sched_ext/include/scx/common.bpf.h | 1 +
 tools/sched_ext/include/scx/compat.bpf.h | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 858ba1f438f6..5c9517190713 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -76,6 +76,7 @@ bool scx_bpf_task_running(const struct task_struct *p) __ksym;
 s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym;
 struct rq *scx_bpf_cpu_rq(s32 cpu) __ksym;
 struct cgroup *scx_bpf_task_cgroup(struct task_struct *p) __ksym __weak;
+u64 scx_bpf_now(void) __ksym __weak;
 
 /*
  * Use the following as @it__iter when calling scx_bpf_dsq_move[_vtime]() from
diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
index d56520100a26..50e1499ae093 100644
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -125,6 +125,11 @@ bool scx_bpf_dispatch_vtime_from_dsq___compat(struct bpf_iter_scx_dsq *it__iter,
 	false;									\
 })
 
+#define scx_bpf_now()								\
+	(bpf_ksym_exists(scx_bpf_now) ?						\
+	 scx_bpf_now() :							\
+	 bpf_ktime_get_ns())
+
 /*
  * Define sched_ext_ops. This may be expanded to define multiple variants for
  * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH().
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v7 4/6] sched_ext: Add time helpers for BPF schedulers
  2024-12-30  9:56 [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Changwoo Min
                   ` (2 preceding siblings ...)
  2024-12-30  9:56 ` [PATCH v7 3/6] sched_ext: Add scx_bpf_now() for BPF scheduler Changwoo Min
@ 2024-12-30  9:56 ` Changwoo Min
  2024-12-30  9:56 ` [PATCH v7 5/6] sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now() Changwoo Min
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Changwoo Min @ 2024-12-30  9:56 UTC (permalink / raw)
  To: tj, void, arighi, mingo, peterz; +Cc: changwoo, kernel-dev, linux-kernel

The following functions are added for BPF schedulers:
- time_delta(after, before)
- time_after(a, b)
- time_before(a, b)
- time_after_eq(a, b)
- time_before_eq(a, b)
- time_in_range(a, b, c)
- time_in_range_open(a, b, c)

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 tools/sched_ext/include/scx/common.bpf.h | 94 ++++++++++++++++++++++++
 1 file changed, 94 insertions(+)

diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 5c9517190713..f3e15e9efa76 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -408,6 +408,100 @@ static __always_inline const struct cpumask *cast_mask(struct bpf_cpumask *mask)
 void bpf_rcu_read_lock(void) __ksym;
 void bpf_rcu_read_unlock(void) __ksym;
 
+/*
+ * Time helpers, most of which are from jiffies.h.
+ */
+
+/**
+ * time_delta - Calculate the delta between new and old time stamp
+ * @after: first comparable as u64
+ * @before: second comparable as u64
+ *
+ * Return: the time difference, which is >= 0
+ */
+static inline s64 time_delta(u64 after, u64 before)
+{
+	return (s64)(after - before) > 0 ? : 0;
+}
+
+/**
+ * time_after - returns true if the time a is after time b.
+ * @a: first comparable as u64
+ * @b: second comparable as u64
+ *
+ * Do this with "<0" and ">=0" to only test the sign of the result. A
+ * good compiler would generate better code (and a really good compiler
+ * wouldn't care). Gcc is currently neither.
+ *
+ * Return: %true is time a is after time b, otherwise %false.
+ */
+static inline bool time_after(u64 a, u64 b)
+{
+	 return (s64)(b - a) < 0;
+}
+
+/**
+ * time_before - returns true if the time a is before time b.
+ * @a: first comparable as u64
+ * @b: second comparable as u64
+ *
+ * Return: %true is time a is before time b, otherwise %false.
+ */
+static inline bool time_before(u64 a, u64 b)
+{
+	return time_after(b, a);
+}
+
+/**
+ * time_after_eq - returns true if the time a is after or the same as time b.
+ * @a: first comparable as u64
+ * @b: second comparable as u64
+ *
+ * Return: %true is time a is after or the same as time b, otherwise %false.
+ */
+static inline bool time_after_eq(u64 a, u64 b)
+{
+	 return (s64)(a - b) >= 0;
+}
+
+/**
+ * time_before_eq - returns true if the time a is before or the same as time b.
+ * @a: first comparable as u64
+ * @b: second comparable as u64
+ *
+ * Return: %true is time a is before or the same as time b, otherwise %false.
+ */
+static inline bool time_before_eq(u64 a, u64 b)
+{
+	return time_after_eq(b, a);
+}
+
+/**
+ * time_in_range - Calculate whether a is in the range of [b, c].
+ * @a: time to test
+ * @b: beginning of the range
+ * @c: end of the range
+ *
+ * Return: %true is time a is in the range [b, c], otherwise %false.
+ */
+static inline bool time_in_range(u64 a, u64 b, u64 c)
+{
+	return time_after_eq(a, b) && time_before_eq(a, c);
+}
+
+/**
+ * time_in_range_open - Calculate whether a is in the range of [b, c).
+ * @a: time to test
+ * @b: beginning of the range
+ * @c: end of the range
+ *
+ * Return: %true is time a is in the range [b, c), otherwise %false.
+ */
+static inline bool time_in_range_open(u64 a, u64 b, u64 c)
+{
+	return time_after_eq(a, b) && time_before(a, c);
+}
+
 
 /*
  * Other helpers
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v7 5/6] sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now()
  2024-12-30  9:56 [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Changwoo Min
                   ` (3 preceding siblings ...)
  2024-12-30  9:56 ` [PATCH v7 4/6] sched_ext: Add time helpers for BPF schedulers Changwoo Min
@ 2024-12-30  9:56 ` Changwoo Min
  2024-12-30  9:56 ` [PATCH v7 6/6] sched_ext: Use time helpers in BPF schedulers Changwoo Min
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Changwoo Min @ 2024-12-30  9:56 UTC (permalink / raw)
  To: tj, void, arighi, mingo, peterz; +Cc: changwoo, kernel-dev, linux-kernel

In the BPF schedulers that use bpf_ktime_get_ns() -- scx_central and
scx_flatcg, replace bpf_ktime_get_ns() calls to scx_bpf_now().

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 tools/sched_ext/scx_central.bpf.c | 4 ++--
 tools/sched_ext/scx_flatcg.bpf.c  | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/sched_ext/scx_central.bpf.c b/tools/sched_ext/scx_central.bpf.c
index 2907df78241e..4239034ad593 100644
--- a/tools/sched_ext/scx_central.bpf.c
+++ b/tools/sched_ext/scx_central.bpf.c
@@ -245,7 +245,7 @@ void BPF_STRUCT_OPS(central_running, struct task_struct *p)
 	s32 cpu = scx_bpf_task_cpu(p);
 	u64 *started_at = ARRAY_ELEM_PTR(cpu_started_at, cpu, nr_cpu_ids);
 	if (started_at)
-		*started_at = bpf_ktime_get_ns() ?: 1;	/* 0 indicates idle */
+		*started_at = scx_bpf_now() ?: 1;	/* 0 indicates idle */
 }
 
 void BPF_STRUCT_OPS(central_stopping, struct task_struct *p, bool runnable)
@@ -258,7 +258,7 @@ void BPF_STRUCT_OPS(central_stopping, struct task_struct *p, bool runnable)
 
 static int central_timerfn(void *map, int *key, struct bpf_timer *timer)
 {
-	u64 now = bpf_ktime_get_ns();
+	u64 now = scx_bpf_now();
 	u64 nr_to_kick = nr_queued;
 	s32 i, curr_cpu;
 
diff --git a/tools/sched_ext/scx_flatcg.bpf.c b/tools/sched_ext/scx_flatcg.bpf.c
index 3dbfa82883be..5f588963fb2f 100644
--- a/tools/sched_ext/scx_flatcg.bpf.c
+++ b/tools/sched_ext/scx_flatcg.bpf.c
@@ -734,7 +734,7 @@ void BPF_STRUCT_OPS(fcg_dispatch, s32 cpu, struct task_struct *prev)
 	struct fcg_cpu_ctx *cpuc;
 	struct fcg_cgrp_ctx *cgc;
 	struct cgroup *cgrp;
-	u64 now = bpf_ktime_get_ns();
+	u64 now = scx_bpf_now();
 	bool picked_next = false;
 
 	cpuc = find_cpu_ctx();
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v7 6/6] sched_ext: Use time helpers in BPF schedulers
  2024-12-30  9:56 [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Changwoo Min
                   ` (4 preceding siblings ...)
  2024-12-30  9:56 ` [PATCH v7 5/6] sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now() Changwoo Min
@ 2024-12-30  9:56 ` Changwoo Min
  2025-01-02 18:54 ` [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Andrea Righi
  2025-01-03 22:16 ` Tejun Heo
  7 siblings, 0 replies; 15+ messages in thread
From: Changwoo Min @ 2024-12-30  9:56 UTC (permalink / raw)
  To: tj, void, arighi, mingo, peterz; +Cc: changwoo, kernel-dev, linux-kernel

Modify the BPF schedulers to use time helpers defined in common.bpf.h

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 tools/sched_ext/scx_central.bpf.c |  7 +------
 tools/sched_ext/scx_flatcg.bpf.c  | 21 ++++++++-------------
 tools/sched_ext/scx_simple.bpf.c  |  9 ++-------
 3 files changed, 11 insertions(+), 26 deletions(-)

diff --git a/tools/sched_ext/scx_central.bpf.c b/tools/sched_ext/scx_central.bpf.c
index 4239034ad593..50bc1737c167 100644
--- a/tools/sched_ext/scx_central.bpf.c
+++ b/tools/sched_ext/scx_central.bpf.c
@@ -87,11 +87,6 @@ struct {
 	__type(value, struct central_timer);
 } central_timer SEC(".maps");
 
-static bool vtime_before(u64 a, u64 b)
-{
-	return (s64)(a - b) < 0;
-}
-
 s32 BPF_STRUCT_OPS(central_select_cpu, struct task_struct *p,
 		   s32 prev_cpu, u64 wake_flags)
 {
@@ -279,7 +274,7 @@ static int central_timerfn(void *map, int *key, struct bpf_timer *timer)
 		/* kick iff the current one exhausted its slice */
 		started_at = ARRAY_ELEM_PTR(cpu_started_at, cpu, nr_cpu_ids);
 		if (started_at && *started_at &&
-		    vtime_before(now, *started_at + slice_ns))
+		    time_before(now, *started_at + slice_ns))
 			continue;
 
 		/* and there's something pending */
diff --git a/tools/sched_ext/scx_flatcg.bpf.c b/tools/sched_ext/scx_flatcg.bpf.c
index 5f588963fb2f..2c720e3ecad5 100644
--- a/tools/sched_ext/scx_flatcg.bpf.c
+++ b/tools/sched_ext/scx_flatcg.bpf.c
@@ -137,11 +137,6 @@ static u64 div_round_up(u64 dividend, u64 divisor)
 	return (dividend + divisor - 1) / divisor;
 }
 
-static bool vtime_before(u64 a, u64 b)
-{
-	return (s64)(a - b) < 0;
-}
-
 static bool cgv_node_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
 {
 	struct cgv_node *cgc_a, *cgc_b;
@@ -271,7 +266,7 @@ static void cgrp_cap_budget(struct cgv_node *cgv_node, struct fcg_cgrp_ctx *cgc)
 	 */
 	max_budget = (cgrp_slice_ns * nr_cpus * cgc->hweight) /
 		(2 * FCG_HWEIGHT_ONE);
-	if (vtime_before(cvtime, cvtime_now - max_budget))
+	if (time_before(cvtime, cvtime_now - max_budget))
 		cvtime = cvtime_now - max_budget;
 
 	cgv_node->cvtime = cvtime;
@@ -401,7 +396,7 @@ void BPF_STRUCT_OPS(fcg_enqueue, struct task_struct *p, u64 enq_flags)
 		 * Limit the amount of budget that an idling task can accumulate
 		 * to one slice.
 		 */
-		if (vtime_before(tvtime, cgc->tvtime_now - SCX_SLICE_DFL))
+		if (time_before(tvtime, cgc->tvtime_now - SCX_SLICE_DFL))
 			tvtime = cgc->tvtime_now - SCX_SLICE_DFL;
 
 		scx_bpf_dsq_insert_vtime(p, cgrp->kn->id, SCX_SLICE_DFL,
@@ -535,7 +530,7 @@ void BPF_STRUCT_OPS(fcg_running, struct task_struct *p)
 		 * from multiple CPUs and thus racy. Any error should be
 		 * contained and temporary. Let's just live with it.
 		 */
-		if (vtime_before(cgc->tvtime_now, p->scx.dsq_vtime))
+		if (time_before(cgc->tvtime_now, p->scx.dsq_vtime))
 			cgc->tvtime_now = p->scx.dsq_vtime;
 	}
 	bpf_cgroup_release(cgrp);
@@ -645,7 +640,7 @@ static bool try_pick_next_cgroup(u64 *cgidp)
 	cgv_node = container_of(rb_node, struct cgv_node, rb_node);
 	cgid = cgv_node->cgid;
 
-	if (vtime_before(cvtime_now, cgv_node->cvtime))
+	if (time_before(cvtime_now, cgv_node->cvtime))
 		cvtime_now = cgv_node->cvtime;
 
 	/*
@@ -744,7 +739,7 @@ void BPF_STRUCT_OPS(fcg_dispatch, s32 cpu, struct task_struct *prev)
 	if (!cpuc->cur_cgid)
 		goto pick_next_cgroup;
 
-	if (vtime_before(now, cpuc->cur_at + cgrp_slice_ns)) {
+	if (time_before(now, cpuc->cur_at + cgrp_slice_ns)) {
 		if (scx_bpf_dsq_move_to_local(cpuc->cur_cgid)) {
 			stat_inc(FCG_STAT_CNS_KEEP);
 			return;
@@ -920,14 +915,14 @@ void BPF_STRUCT_OPS(fcg_cgroup_move, struct task_struct *p,
 		    struct cgroup *from, struct cgroup *to)
 {
 	struct fcg_cgrp_ctx *from_cgc, *to_cgc;
-	s64 vtime_delta;
+	s64 delta;
 
 	/* find_cgrp_ctx() triggers scx_ops_error() on lookup failures */
 	if (!(from_cgc = find_cgrp_ctx(from)) || !(to_cgc = find_cgrp_ctx(to)))
 		return;
 
-	vtime_delta = p->scx.dsq_vtime - from_cgc->tvtime_now;
-	p->scx.dsq_vtime = to_cgc->tvtime_now + vtime_delta;
+	delta = time_delta(p->scx.dsq_vtime, from_cgc->tvtime_now);
+	p->scx.dsq_vtime = to_cgc->tvtime_now + delta;
 }
 
 s32 BPF_STRUCT_OPS_SLEEPABLE(fcg_init)
diff --git a/tools/sched_ext/scx_simple.bpf.c b/tools/sched_ext/scx_simple.bpf.c
index 31f915b286c6..e6de99dba7db 100644
--- a/tools/sched_ext/scx_simple.bpf.c
+++ b/tools/sched_ext/scx_simple.bpf.c
@@ -52,11 +52,6 @@ static void stat_inc(u32 idx)
 		(*cnt_p)++;
 }
 
-static inline bool vtime_before(u64 a, u64 b)
-{
-	return (s64)(a - b) < 0;
-}
-
 s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags)
 {
 	bool is_idle = false;
@@ -84,7 +79,7 @@ void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
 		 * Limit the amount of budget that an idling task can accumulate
 		 * to one slice.
 		 */
-		if (vtime_before(vtime, vtime_now - SCX_SLICE_DFL))
+		if (time_before(vtime, vtime_now - SCX_SLICE_DFL))
 			vtime = vtime_now - SCX_SLICE_DFL;
 
 		scx_bpf_dsq_insert_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime,
@@ -108,7 +103,7 @@ void BPF_STRUCT_OPS(simple_running, struct task_struct *p)
 	 * thus racy. Any error should be contained and temporary. Let's just
 	 * live with it.
 	 */
-	if (vtime_before(vtime_now, p->scx.dsq_vtime))
+	if (time_before(vtime_now, p->scx.dsq_vtime))
 		vtime_now = p->scx.dsq_vtime;
 }
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock
  2024-12-30  9:56 [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Changwoo Min
                   ` (5 preceding siblings ...)
  2024-12-30  9:56 ` [PATCH v7 6/6] sched_ext: Use time helpers in BPF schedulers Changwoo Min
@ 2025-01-02 18:54 ` Andrea Righi
  2025-01-03 22:16 ` Tejun Heo
  7 siblings, 0 replies; 15+ messages in thread
From: Andrea Righi @ 2025-01-02 18:54 UTC (permalink / raw)
  To: Changwoo Min; +Cc: tj, void, mingo, peterz, changwoo, kernel-dev, linux-kernel

Hi Changwoo,

On Mon, Dec 30, 2024 at 06:56:19PM +0900, Changwoo Min wrote:
> Many BPF schedulers (such as scx_central, scx_lavd, scx_rusty, scx_bpfland,
> and scx_flash) frequently call bpf_ktime_get_ns() for tracking tasks' runtime
> properties. If supported, bpf_ktime_get_ns() eventually reads a hardware
> timestamp counter (TSC). However, reading a hardware TSC is not
> performant in some hardware platforms, degrading IPC.
> 
> This patchset addresses the performance problem of reading hardware TSC
> by leveraging the rq clock in the scheduler core, introducing a
> scx_bpf_now() function for BPF schedulers. Whenever the rq clock
> is fresh and valid, scx_bpf_now() provides the rq clock, which is
> already updated by the scheduler core (update_rq_clock), so it can reduce
> reading the hardware TSC.
> 
> When the rq lock is released (rq_unpin_lock), the rq clock is invalidated,
> so a subsequent scx_bpf_now() call gets the fresh sched_clock for the caller.
> 
> In addition, scx_bpf_now() guarantees the clock is monotonically
> non-decreasing for the same CPU, so the clock cannot go backward
> in the same CPU.
> 
> Using scx_bpf_now() reduces the number of reading hardware TSC
> by 50-80% (76% for scx_lavd, 82% for scx_bpfland, and 51% for scx_rusty)
> for the following benchmark:
> 
>     perf bench -f simple sched messaging -t -g 20 -l 6000

Looks good to me, I also tested it and I have nothing to report, therefore:

Acked-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock
  2024-12-30  9:56 [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Changwoo Min
                   ` (6 preceding siblings ...)
  2025-01-02 18:54 ` [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Andrea Righi
@ 2025-01-03 22:16 ` Tejun Heo
  2025-01-07 19:46   ` Tejun Heo
  7 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2025-01-03 22:16 UTC (permalink / raw)
  To: Changwoo Min
  Cc: void, arighi, mingo, peterz, changwoo, kernel-dev, linux-kernel

On Mon, Dec 30, 2024 at 06:56:19PM +0900, Changwoo Min wrote:
> Many BPF schedulers (such as scx_central, scx_lavd, scx_rusty, scx_bpfland,
> and scx_flash) frequently call bpf_ktime_get_ns() for tracking tasks' runtime
> properties. If supported, bpf_ktime_get_ns() eventually reads a hardware
> timestamp counter (TSC). However, reading a hardware TSC is not
> performant in some hardware platforms, degrading IPC.
> 
> This patchset addresses the performance problem of reading hardware TSC
> by leveraging the rq clock in the scheduler core, introducing a
> scx_bpf_now() function for BPF schedulers. Whenever the rq clock
> is fresh and valid, scx_bpf_now() provides the rq clock, which is
> already updated by the scheduler core (update_rq_clock), so it can reduce
> reading the hardware TSC.
> 
> When the rq lock is released (rq_unpin_lock), the rq clock is invalidated,
> so a subsequent scx_bpf_now() call gets the fresh sched_clock for the caller.
> 
> In addition, scx_bpf_now() guarantees the clock is monotonically
> non-decreasing for the same CPU, so the clock cannot go backward
> in the same CPU.
> 
> Using scx_bpf_now() reduces the number of reading hardware TSC
> by 50-80% (76% for scx_lavd, 82% for scx_bpfland, and 51% for scx_rusty)
> for the following benchmark:

The patch series generally look good to me. Peter, if things look okay to
you, I'll apply the series to sched_ext/for-6.14.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock
  2025-01-03 22:16 ` Tejun Heo
@ 2025-01-07 19:46   ` Tejun Heo
  2025-01-07 19:53     ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2025-01-07 19:46 UTC (permalink / raw)
  To: Changwoo Min
  Cc: void, arighi, mingo, peterz, changwoo, kernel-dev, linux-kernel

Hello,

On Fri, Jan 03, 2025 at 12:16:41PM -1000, Tejun Heo wrote:
> On Mon, Dec 30, 2024 at 06:56:19PM +0900, Changwoo Min wrote:
> > Many BPF schedulers (such as scx_central, scx_lavd, scx_rusty, scx_bpfland,
> > and scx_flash) frequently call bpf_ktime_get_ns() for tracking tasks' runtime
> > properties. If supported, bpf_ktime_get_ns() eventually reads a hardware
> > timestamp counter (TSC). However, reading a hardware TSC is not
> > performant in some hardware platforms, degrading IPC.
> > 
> > This patchset addresses the performance problem of reading hardware TSC
> > by leveraging the rq clock in the scheduler core, introducing a
> > scx_bpf_now() function for BPF schedulers. Whenever the rq clock
> > is fresh and valid, scx_bpf_now() provides the rq clock, which is
> > already updated by the scheduler core (update_rq_clock), so it can reduce
> > reading the hardware TSC.
> > 
> > When the rq lock is released (rq_unpin_lock), the rq clock is invalidated,
> > so a subsequent scx_bpf_now() call gets the fresh sched_clock for the caller.
> > 
> > In addition, scx_bpf_now() guarantees the clock is monotonically
> > non-decreasing for the same CPU, so the clock cannot go backward
> > in the same CPU.
> > 
> > Using scx_bpf_now() reduces the number of reading hardware TSC
> > by 50-80% (76% for scx_lavd, 82% for scx_bpfland, and 51% for scx_rusty)
> > for the following benchmark:
> 
> The patch series generally look good to me. Peter, if things look okay to
> you, I'll apply the series to sched_ext/for-6.14.

Applying to sched_ext/for-6.14. Please holler if there are concerns.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock
  2025-01-07 19:46   ` Tejun Heo
@ 2025-01-07 19:53     ` Peter Zijlstra
  2025-01-07 19:55       ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2025-01-07 19:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Changwoo Min, void, arighi, mingo, changwoo, kernel-dev,
	linux-kernel

On Tue, Jan 07, 2025 at 09:46:48AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Jan 03, 2025 at 12:16:41PM -1000, Tejun Heo wrote:
> > On Mon, Dec 30, 2024 at 06:56:19PM +0900, Changwoo Min wrote:
> > > Many BPF schedulers (such as scx_central, scx_lavd, scx_rusty, scx_bpfland,
> > > and scx_flash) frequently call bpf_ktime_get_ns() for tracking tasks' runtime
> > > properties. If supported, bpf_ktime_get_ns() eventually reads a hardware
> > > timestamp counter (TSC). However, reading a hardware TSC is not
> > > performant in some hardware platforms, degrading IPC.
> > > 
> > > This patchset addresses the performance problem of reading hardware TSC
> > > by leveraging the rq clock in the scheduler core, introducing a
> > > scx_bpf_now() function for BPF schedulers. Whenever the rq clock
> > > is fresh and valid, scx_bpf_now() provides the rq clock, which is
> > > already updated by the scheduler core (update_rq_clock), so it can reduce
> > > reading the hardware TSC.
> > > 
> > > When the rq lock is released (rq_unpin_lock), the rq clock is invalidated,
> > > so a subsequent scx_bpf_now() call gets the fresh sched_clock for the caller.
> > > 
> > > In addition, scx_bpf_now() guarantees the clock is monotonically
> > > non-decreasing for the same CPU, so the clock cannot go backward
> > > in the same CPU.
> > > 
> > > Using scx_bpf_now() reduces the number of reading hardware TSC
> > > by 50-80% (76% for scx_lavd, 82% for scx_bpfland, and 51% for scx_rusty)
> > > for the following benchmark:
> > 
> > The patch series generally look good to me. Peter, if things look okay to
> > you, I'll apply the series to sched_ext/for-6.14.
> 
> Applying to sched_ext/for-6.14. Please holler if there are concerns.

Urgh, I'll try and have a look, but have a hate for everybody who's been
working through x-mas :/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock
  2025-01-07 19:53     ` Peter Zijlstra
@ 2025-01-07 19:55       ` Tejun Heo
  2025-01-08  8:51         ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2025-01-07 19:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Changwoo Min, void, arighi, mingo, changwoo, kernel-dev,
	linux-kernel

Hello,

On Tue, Jan 07, 2025 at 08:53:47PM +0100, Peter Zijlstra wrote:
> > > The patch series generally look good to me. Peter, if things look okay to
> > > you, I'll apply the series to sched_ext/for-6.14.
> > 
> > Applying to sched_ext/for-6.14. Please holler if there are concerns.
> 
> Urgh, I'll try and have a look, but have a hate for everybody who's been
> working through x-mas :/

Ah, I'm just back too. FWIW, the patchset is a lot simpler now, so hopefully
there shouldn't be too much to object to. I'll hold off until you can take a
look.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 2/6] sched_ext: Implement scx_bpf_now()
  2024-12-30  9:56 ` [PATCH v7 2/6] sched_ext: Implement scx_bpf_now() Changwoo Min
@ 2025-01-08  8:50   ` Peter Zijlstra
  2025-01-08 16:04     ` Changwoo Min
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2025-01-08  8:50 UTC (permalink / raw)
  To: Changwoo Min; +Cc: tj, void, arighi, mingo, changwoo, kernel-dev, linux-kernel


> +__bpf_kfunc u64 scx_bpf_now(void)
> +{
> +	struct rq *rq;
> +	u64 clock;
> +
> +	preempt_disable();
> +
> +	rq = this_rq();
> +	if (READ_ONCE(rq->scx.flags) & SCX_RQ_CLK_VALID) {
> +		/*
> +		 * If the rq clock is valid, use the cached rq clock.
> +		 *
> +		 * Note that scx_bpf_now() is re-entrant between a process
> +		 * context and an interrupt context (e.g., timer interrupt).
> +		 * However, we don't need to consider the race between them
> +		 * because such race is not observable from a caller.
> +		 */
> +		clock = READ_ONCE(rq->scx.clock);
> +	} else {
> +		/*
> +		 * Otherwise, return a fresh rq clock.
> +		 *
> +		 * The rq clock is updated outside of the rq lock.
> +		 * In this case, keep the updated rq clock invalid so the next
> +		 * kfunc call outside the rq lock gets a fresh rq clock.
> +		 */
> +		clock = sched_clock_cpu(cpu_of(rq));
> +	}
> +
> +	preempt_enable();
> +
> +	return clock;
> +}

> +static inline void scx_rq_clock_update(struct rq *rq, u64 clock)
> +{
> +	if (!scx_enabled())
> +		return;
> +	WRITE_ONCE(rq->scx.clock, clock);
> +	WRITE_ONCE(rq->scx.flags, rq->scx.flags | SCX_RQ_CLK_VALID);
> +}


AFAICT it is possible to be used like:


  CPU0					CPU1

  lock(rq1->lock);
  ...
  scx_rq_clock_update(...);		scx_bpf_now();
  ...
  unlock(rq->lock);


Which then enables the following ordering problem:

  CPU0					CPU1

  WRITE_ONCE(rq->scx.clock, clock);	if (rq->scx.flags & VALID)
  WRITE_ONCE(rq->scx.flags, VALID);	  return rq->scx.clock;

Where it then becomes possible to observe VALID before clock is written.

That is, I rather think you need:

> +static inline void scx_rq_clock_update(struct rq *rq, u64 clock)
> +{
> +	if (!scx_enabled())
> +		return;
> +	WRITE_ONCE(rq->scx.clock, clock);
> +	smp_store_release(&rq->scx.flags, rq->scx.flags | SCX_RQ_CLK_VALID);
> +}

and:

	if (smp_load_acquire(&rq->scx.flags) & SCX_RQ_CLK_VALID) {
> +		/*
> +		 * If the rq clock is valid, use the cached rq clock.
> +		 *
> +		 * Note that scx_bpf_now() is re-entrant between a process
> +		 * context and an interrupt context (e.g., timer interrupt).
> +		 * However, we don't need to consider the race between them
> +		 * because such race is not observable from a caller.
> +		 */
> +		clock = READ_ONCE(rq->scx.clock);

Such that if you ovbserve VALID, you must then also observe the clock.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock
  2025-01-07 19:55       ` Tejun Heo
@ 2025-01-08  8:51         ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2025-01-08  8:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Changwoo Min, void, arighi, mingo, changwoo, kernel-dev,
	linux-kernel

On Tue, Jan 07, 2025 at 09:55:16AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jan 07, 2025 at 08:53:47PM +0100, Peter Zijlstra wrote:
> > > > The patch series generally look good to me. Peter, if things look okay to
> > > > you, I'll apply the series to sched_ext/for-6.14.
> > > 
> > > Applying to sched_ext/for-6.14. Please holler if there are concerns.
> > 
> > Urgh, I'll try and have a look, but have a hate for everybody who's been
> > working through x-mas :/
> 
> Ah, I'm just back too. FWIW, the patchset is a lot simpler now, so hopefully
> there shouldn't be too much to object to. I'll hold off until you can take a
> look.

Thanks! Code does appear simpler now. But I think there's an ordering
issue, I've replied to the patch in question.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 2/6] sched_ext: Implement scx_bpf_now()
  2025-01-08  8:50   ` Peter Zijlstra
@ 2025-01-08 16:04     ` Changwoo Min
  0 siblings, 0 replies; 15+ messages in thread
From: Changwoo Min @ 2025-01-08 16:04 UTC (permalink / raw)
  To: Peter Zijlstra, Changwoo Min
  Cc: tj, void, arighi, mingo, kernel-dev, linux-kernel

Hello,

On 25. 1. 8. 17:50, Peter Zijlstra wrote:
> 
> That is, I rather think you need:
> 
>> +static inline void scx_rq_clock_update(struct rq *rq, u64 clock)
>> +{
>> +	if (!scx_enabled())
>> +		return;
>> +	WRITE_ONCE(rq->scx.clock, clock);
>> +	smp_store_release(&rq->scx.flags, rq->scx.flags | SCX_RQ_CLK_VALID);
>> +}
> 
> and:
> 
> 	if (smp_load_acquire(&rq->scx.flags) & SCX_RQ_CLK_VALID) {
>> +		/*
>> +		 * If the rq clock is valid, use the cached rq clock.
>> +		 *
>> +		 * Note that scx_bpf_now() is re-entrant between a process
>> +		 * context and an interrupt context (e.g., timer interrupt).
>> +		 * However, we don't need to consider the race between them
>> +		 * because such race is not observable from a caller.
>> +		 */
>> +		clock = READ_ONCE(rq->scx.clock);
> 
> Such that if you ovbserve VALID, you must then also observe the clock.

Thank you for catching this out! I will change it as suggested in
the next version.

Regards,
Changwoo Min

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-01-08 16:05 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-30  9:56 [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Changwoo Min
2024-12-30  9:56 ` [PATCH v7 1/6] sched_ext: Relocate scx_enabled() related code Changwoo Min
2024-12-30  9:56 ` [PATCH v7 2/6] sched_ext: Implement scx_bpf_now() Changwoo Min
2025-01-08  8:50   ` Peter Zijlstra
2025-01-08 16:04     ` Changwoo Min
2024-12-30  9:56 ` [PATCH v7 3/6] sched_ext: Add scx_bpf_now() for BPF scheduler Changwoo Min
2024-12-30  9:56 ` [PATCH v7 4/6] sched_ext: Add time helpers for BPF schedulers Changwoo Min
2024-12-30  9:56 ` [PATCH v7 5/6] sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now() Changwoo Min
2024-12-30  9:56 ` [PATCH v7 6/6] sched_ext: Use time helpers in BPF schedulers Changwoo Min
2025-01-02 18:54 ` [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock Andrea Righi
2025-01-03 22:16 ` Tejun Heo
2025-01-07 19:46   ` Tejun Heo
2025-01-07 19:53     ` Peter Zijlstra
2025-01-07 19:55       ` Tejun Heo
2025-01-08  8:51         ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox