[GIT PULL] sched_ext: Initial pull request for v6.11

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [GIT PULL] sched_ext: Initial pull request for v6.11
@ 2024-07-15 22:32 Tejun Heo
  2024-07-23 16:33 ` Peter Zijlstra
  2024-07-25  1:19 ` Qais Yousef
  0 siblings, 2 replies; 33+ messages in thread
From: Tejun Heo @ 2024-07-15 22:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, David Vernet, Ingo Molnar, Peter Zijlstra,
	Alexei Starovoitov

NOTE: I couldn't get git-request-pull to generate the correct diffstat. The
      diffstat at the end is generated against the merged base and should
      match the diff when sched_ext-for-6.11 is pulled after tip/sched/core
      and bpf/for-next.

The following changes since commit d329605287020c3d1c3b0dadc63d8208e7251382:

  sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks (2024-07-04 15:59:52 +0200)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git/ tags/sched_ext-for-6.11

for you to fetch changes up to 8bb30798fd6ee79e4041a32ca85b9f70345d8671:

  sched_ext: Fixes incorrect type in bpf_scx_init() (2024-07-14 18:10:10 -1000)

----------------------------------------------------------------
sched_ext: Initial pull request for v6.11

This is the initial pull request of sched_ext. The v7 patchset
(https://lkml.kernel.org/r/20240618212056.2833381-1-tj@kernel.org) is
applied on top of tip/sched/core + bpf/for-next as of Jun 18th.

  tip/sched/core 793a62823d1c ("sched/core: Drop spinlocks on contention iff kernel is preemptible")
  bpf/for-next   f6afdaf72af7 ("Merge branch 'bpf-support-resilient-split-btf'")

Since then, the following changes:

- cpuperf support which was a part of the v6 patchset was posted separately
  and then applied after reviews.

- tip/sched/core is pulled into resolve the conflicts between

    e83edbf88f18 ("sched: Add sched_class->reweight_task()")
    d32960528702 ("sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks")

  The former makes reweight_task() a sched_class operation for sched_ext
  while the latter fixed a bug around idle weight handling in the same area.
  The conflicts are not trivial although not that complicated either.

- Improve integration with sched core.

- DSQ iterator which was a part of the v6 patchset was posted separately.
  The iterator itself was applied after a couple revisions. The associated
  selective consumption kfunc can use further improvements and is still
  being worked on.

- The BPF scheduler couldn't directly dispatch to the local DSQ of another
  CPU using a SCX_DSQ_LOCAL_ON verdict. This caused difficulties around
  handling non-wakeup enqueues. Updated so that SCX_DSQ_LOCAL_ON can be used
  in the enqueue path too.

- Various fixes and improvements.

As the branch is based on top of tip/sched/core + bpf/for-next, please merge
after both are applied.

----------------------------------------------------------------
Aboorva Devarajan (1):
      sched_ext: Documentation: Remove mentions of scx_bpf_switch_all

Andrea Righi (2):
      sched_ext: fix typo in set_weight() description
      sched_ext: add CONFIG_DEBUG_INFO_BTF dependency

Colin Ian King (1):
      sched_ext: Fix spelling mistake: "intead" -> "instead"

David Vernet (6):
      sched_ext: Implement runnable task stall watchdog
      sched_ext: Print sched_ext info when dumping stack
      sched_ext: Implement SCX_KICK_WAIT
      sched_ext: Implement sched_ext_ops.cpu_acquire/release()
      sched_ext: Add selftests
      sched_ext: Make scx_bpf_cpuperf_set() @cpu arg signed

Hongyan Xia (1):
      sched/ext: Add BPF function to fetch rq

Jiapeng Chong (1):
      sched_ext: Fixes incorrect type in bpf_scx_init()

Tejun Heo (48):
      sched: Restructure sched_class order sanity checks in sched_init()
      sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork()
      sched: Add sched_class->reweight_task()
      sched: Add sched_class->switching_to() and expose check_class_changing/changed()
      sched: Factor out cgroup weight conversion functions
      sched: Factor out update_other_load_avgs() from __update_blocked_others()
      sched: Add normal_policy()
      sched_ext: Add boilerplate for extensible scheduler class
      sched_ext: Implement BPF extensible scheduler class
      sched_ext: Add scx_simple and scx_example_qmap example schedulers
      sched_ext: Add sysrq-S which disables the BPF scheduler
      sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT
      sched_ext: Print debug dump after an error exit
      tools/sched_ext: Add scx_show_state.py
      sched_ext: Implement scx_bpf_kick_cpu() and task preemption support
      sched_ext: Add a central scheduler which makes all scheduling decisions on one CPU
      sched_ext: Make watchdog handle ops.dispatch() looping stall
      sched_ext: Add task state tracking operations
      sched_ext: Implement tickless support
      sched_ext: Track tasks that are subjects of the in-flight SCX operation
      sched_ext: Implement sched_ext_ops.cpu_online/offline()
      sched_ext: Bypass BPF scheduler while PM events are in progress
      sched_ext: Implement core-sched support
      sched_ext: Add vtime-ordered priority queue to dispatch_q's
      sched_ext: Documentation: scheduler: Document extensible scheduler class
      sched, sched_ext: Replace scx_next_task_picked() with sched_class->switch_class()
      cpufreq_schedutil: Refactor sugov_cpu_is_busy()
      sched_ext: Add cpuperf support
      sched_ext: Drop tools_clean target from the top-level Makefile
      sched_ext: Swap argument positions in kcalloc() call to avoid compiler warning
      Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.11
      sched, sched_ext: Simplify dl_prio() case handling in sched_fork()
      sched_ext: Account for idle policy when setting p->scx.weight in scx_ops_enable_task()
      sched_ext: Disallow loading BPF scheduler if isolcpus= domain isolation is in effect
      sched_ext: Minor cleanups in kernel/sched/ext.h
      sched, sched_ext: Open code for_balance_class_range()
      sched, sched_ext: Move some declarations from kernel/sched/ext.h to sched.h
      sched_ext: Take out ->priq and ->flags from scx_dsq_node
      sched_ext: Implement DSQ iterator
      sched_ext/scx_qmap: Add an example usage of DSQ iterator
      sched_ext: Reimplement scx_bpf_reenqueue_local()
      sched_ext: Make scx_bpf_reenqueue_local() skip tasks that are being migrated
      sched: Move struct balance_callback definition upward
      sched_ext: Open-code task_linked_on_dsq()
      sched_ext: Unpin and repin rq lock from balance_scx()
      sched_ext: s/SCX_RQ_BALANCING/SCX_RQ_IN_BALANCE/ and add SCX_RQ_IN_WAKEUP
      sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches
      sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues

 Documentation/scheduler/index.rst                                   |    1
 Documentation/scheduler/sched-ext.rst                               |  316 +
 MAINTAINERS                                                         |   13
 drivers/tty/sysrq.c                                                 |    1
 include/asm-generic/vmlinux.lds.h                                   |    1
 include/linux/cgroup.h                                              |    4
 include/linux/sched.h                                               |    5
 include/linux/sched/ext.h                                           |  206
 include/linux/sched/task.h                                          |    3
 include/trace/events/sched_ext.h                                    |   32
 include/uapi/linux/sched.h                                          |    1
 init/init_task.c                                                    |   12
 kernel/Kconfig.preempt                                              |   26
 kernel/fork.c                                                       |   17
 kernel/sched/build_policy.c                                         |   11
 kernel/sched/core.c                                                 |  192
 kernel/sched/cpufreq_schedutil.c                                    |   50
 kernel/sched/debug.c                                                |    3
 kernel/sched/ext.c                                                  | 6537 ++++++++++++++++++++++++++
 kernel/sched/ext.h                                                  |   69
 kernel/sched/fair.c                                                 |   22
 kernel/sched/idle.c                                                 |    2
 kernel/sched/sched.h                                                |  158
 kernel/sched/syscalls.c                                             |   26
 lib/dump_stack.c                                                    |    1
 tools/Makefile                                                      |   10
 tools/sched_ext/.gitignore                                          |    2
 tools/sched_ext/Makefile                                            |  246
 tools/sched_ext/README.md                                           |  258 +
 tools/sched_ext/include/bpf-compat/gnu/stubs.h                      |   11
 tools/sched_ext/include/scx/common.bpf.h                            |  401 +
 tools/sched_ext/include/scx/common.h                                |   75
 tools/sched_ext/include/scx/compat.bpf.h                            |   28
 tools/sched_ext/include/scx/compat.h                                |  186
 tools/sched_ext/include/scx/user_exit_info.h                        |  111
 tools/sched_ext/scx_central.bpf.c                                   |  361 +
 tools/sched_ext/scx_central.c                                       |  135
 tools/sched_ext/scx_qmap.bpf.c                                      |  706 ++
 tools/sched_ext/scx_qmap.c                                          |  144
 tools/sched_ext/scx_show_state.py                                   |   39
 tools/sched_ext/scx_simple.bpf.c                                    |  156
 tools/sched_ext/scx_simple.c                                        |  107
 tools/testing/selftests/sched_ext/.gitignore                        |    6
 tools/testing/selftests/sched_ext/Makefile                          |  218
 tools/testing/selftests/sched_ext/config                            |    9
 tools/testing/selftests/sched_ext/create_dsq.bpf.c                  |   58
 tools/testing/selftests/sched_ext/create_dsq.c                      |   57
 tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c         |   42
 tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.c             |   57
 tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c        |   39
 tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.c            |   56
 tools/testing/selftests/sched_ext/dsp_local_on.bpf.c                |   65
 tools/testing/selftests/sched_ext/dsp_local_on.c                    |   58
 tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c       |   21
 tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c           |   60
 tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c        |   43
 tools/testing/selftests/sched_ext/enq_select_cpu_fails.c            |   61
 tools/testing/selftests/sched_ext/exit.bpf.c                        |   84
 tools/testing/selftests/sched_ext/exit.c                            |   55
 tools/testing/selftests/sched_ext/exit_test.h                       |   20
 tools/testing/selftests/sched_ext/hotplug.bpf.c                     |   61
 tools/testing/selftests/sched_ext/hotplug.c                         |  168
 tools/testing/selftests/sched_ext/hotplug_test.h                    |   15
 tools/testing/selftests/sched_ext/init_enable_count.bpf.c           |   53
 tools/testing/selftests/sched_ext/init_enable_count.c               |  166
 tools/testing/selftests/sched_ext/maximal.bpf.c                     |  132
 tools/testing/selftests/sched_ext/maximal.c                         |   51
 tools/testing/selftests/sched_ext/maybe_null.bpf.c                  |   36
 tools/testing/selftests/sched_ext/maybe_null.c                      |   49
 tools/testing/selftests/sched_ext/maybe_null_fail_dsp.bpf.c         |   25
 tools/testing/selftests/sched_ext/maybe_null_fail_yld.bpf.c         |   28
 tools/testing/selftests/sched_ext/minimal.bpf.c                     |   21
 tools/testing/selftests/sched_ext/minimal.c                         |   58
 tools/testing/selftests/sched_ext/prog_run.bpf.c                    |   32
 tools/testing/selftests/sched_ext/prog_run.c                        |   78
 tools/testing/selftests/sched_ext/reload_loop.c                     |   75
 tools/testing/selftests/sched_ext/runner.c                          |  201
 tools/testing/selftests/sched_ext/scx_test.h                        |  131
 tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c              |   40
 tools/testing/selftests/sched_ext/select_cpu_dfl.c                  |   72
 tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c   |   89
 tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.c       |   72
 tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c         |   41
 tools/testing/selftests/sched_ext/select_cpu_dispatch.c             |   70
 tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c |   37
 tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.c     |   56
 tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c |   38
 tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.c     |   56
 tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c            |   92
 tools/testing/selftests/sched_ext/select_cpu_vtime.c                |   59
 tools/testing/selftests/sched_ext/test_example.c                    |   49
 tools/testing/selftests/sched_ext/util.c                            |   71
 tools/testing/selftests/sched_ext/util.h                            |   13
 93 files changed, 13834 insertions(+), 95 deletions(-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-07-15 22:32 [GIT PULL] sched_ext: Initial pull request for v6.11 Tejun Heo
@ 2024-07-23 16:33 ` Peter Zijlstra
  2024-07-23 19:34   ` Tejun Heo
  2024-08-02 12:20   ` Peter Zijlstra
  2024-07-25  1:19 ` Qais Yousef
  1 sibling, 2 replies; 33+ messages in thread
From: Peter Zijlstra @ 2024-07-23 16:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner


Taken from git diff from your tree against Linus' tree... I've not yet
read everything, but I figured I should share these bits.

---

> @@ -1255,11 +1263,14 @@ bool sched_can_stop_tick(struct rq *rq)
>  		return true;
>  
>  	/*
> -	 * If there are no DL,RR/FIFO tasks, there must only be CFS tasks left;
> -	 * if there's more than one we need the tick for involuntary
> -	 * preemption.
> +	 * If there are no DL,RR/FIFO tasks, there must only be CFS or SCX tasks
> +	 * left. For CFS, if there's more than one we need the tick for
> +	 * involuntary preemption. For SCX, ask.
>  	 */
> -	if (rq->nr_running > 1)
> +	if (!scx_switched_all() && rq->nr_running > 1)
> +		return false;
> +
> +	if (scx_enabled() && !scx_can_stop_tick(rq))
>  		return false;
>  

Doesn't that boil down to something like:

	if (scx_switched_all())
		return scx_can_stop_tick(rq);

	if (rq->nr_running > 1)
		return false;

> @@ -5773,7 +5827,19 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
>  				  struct rq_flags *rf)
>  {
>  #ifdef CONFIG_SMP
> +	const struct sched_class *start_class = prev->sched_class;
>  	const struct sched_class *class;
> +
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +	/*
> +	 * SCX requires a balance() call before every pick_next_task() including
> +	 * when waking up from SCHED_IDLE. If @start_class is below SCX, start
> +	 * from SCX instead.
> +	 */
> +	if (sched_class_above(&ext_sched_class, start_class))
> +		start_class = &ext_sched_class;

	if (scx_enabled() && ...)

?

> +#endif
> +
>  	/*
>  	 * We must do the balancing pass before put_prev_task(), such
>  	 * that when we release the rq->lock the task is in the same
> @@ -5782,7 +5848,7 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
>  	 * We can terminate the balance pass as soon as we know there is
>  	 * a runnable task of @class priority or higher.
>  	 */
> -	for_class_range(class, prev->sched_class, &idle_sched_class) {
> +	for_active_class_range(class, start_class, &idle_sched_class) {
>  		if (class->balance(rq, prev, rf))
>  			break;

Don't you need fixing balance_fair() here? It has:

  if (rq->nr_running)
    return 1;

Which would return true and terminate the iteration even if there are
only scx tasks left.

>  	}
> @@ -5800,6 +5866,9 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  	const struct sched_class *class;
>  	struct task_struct *p;
>  
> +	if (scx_enabled())
> +		goto restart;
> +
>  	/*
>  	 * Optimization: we know that if all tasks are in the fair class we can
>  	 * call that function directly, but only if the @prev task wasn't of a
> @@ -5840,10 +5909,15 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  	if (prev->dl_server)
>  		prev->dl_server = NULL;
>  
> -	for_each_class(class) {
> +	for_each_active_class(class) {
>  		p = class->pick_next_task(rq);
> -		if (p)
> +		if (p) {
> +			const struct sched_class *prev_class = prev->sched_class;
> +
> +			if (class != prev_class && prev_class->switch_class)
> +				prev_class->switch_class(rq, p);
>  			return p;
> +		}
>  	}

So I really don't like this one.. at the very least it would need a comment
explaining why it only needs calling here and not every time a put_prev_task()
and set_next_task() pair cross a class boundary -- which would be the
consistent thing to do.

Now, IIRC, you need a class call that indicates you're about to loose the CPU
so that you can kick the task to another CPU or somesuch. And last time I got
it backwards and suggested adding an argument to pick_next_task(), but what
about put_prev_task()?

Can't we universally push put_prev_task() after the pick loop? Then we get
something like:

	next = pick_task();
	if (next != prev) {
		put_prev_task(rq, prev, next->class != prev->class);
		set_next_task(rq, next);
	}

I have patches for most of this for fair (in my eevdf queue), and I
think the others are doable, after all, this is more or less what we do
for SCHED_CORE already.

  /me went off hacking for a bit

I've done this; find the results at: queue.git sched/prep

I've also rebased the sched_ext tree on top of that with the below delta, which
you can find at: queue.git sched/scx

This led me to discover that you're passing the task of the other class into
the bpf stuff -- I don't think that is a sane thing to do. You get preempted,
it doesn't matter from which higher class or by which specific task, a policy
must not care about that. So I kinda bodged it, but I really think this should
be taken out.

I also found you have some terrible !SMP hack in there, which I've
broken, I've disabled it for now. This needs a proper fix, and not
something ugly like you did.

Anyway, it seems to build, boot and pass the sched_ext selftest:

PASSED:  21
SKIPPED: 0
FAILED:  0

(albeit with a fair amount of console noise -- is that expected?)

Also, why does that thing hard depend on DEBUG_BTF? (at least having
that option enabled no longer explodes build times like it used to)

Also should we /CONFIG_SCHED_CLASS_EXT/CONFIG_SCHED_BPF/ ? Then
something like: grep BPF foo-build/.config
more easily shows what's what.

I suppose I'll look more at all this in the coming days ... *sigh*


---
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index da9cac6b6cc2a..6375a648d3bf5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2015,11 +2015,11 @@ static void ops_dequeue(struct task_struct *p, u64 deq_flags)
 	}
 }
 
-static void dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags)
+static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags)
 {
 	if (!(p->scx.flags & SCX_TASK_QUEUED)) {
 		WARN_ON_ONCE(task_runnable(p));
-		return;
+		return true;
 	}
 
 	ops_dequeue(p, deq_flags);
@@ -2054,6 +2054,8 @@ static void dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags
 	sub_nr_running(rq, 1);
 
 	dispatch_dequeue(rq, p);
+
+	return true;
 }
 
 static void yield_task_scx(struct rq *rq)
@@ -2702,6 +2704,18 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
 		 */
 		update_other_load_avgs(rq);
 	}
+
+	if (!first)
+		return;
+
+	if (unlikely(!p->scx.slice)) {
+		if (!scx_ops_bypassing() && !scx_warned_zero_slice) {
+			printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in pick_next_task_scx()\n",
+					p->comm, p->pid);
+			scx_warned_zero_slice = true;
+		}
+		p->scx.slice = SCX_SLICE_DFL;
+	}
 }
 
 static void process_ddsp_deferred_locals(struct rq *rq)
@@ -2727,10 +2741,15 @@ static void process_ddsp_deferred_locals(struct rq *rq)
 	}
 }
 
-static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
+static void switch_class_scx(struct rq *rq);
+
+static void put_prev_task_scx(struct rq *rq, struct task_struct *p, bool change_class)
 {
-#ifndef CONFIG_SMP
+#if 0 // ndef CONFIG_SMP
 	/*
+	 * XXX broken, put_prev no longer comes after balance but now comes
+	 * after pick.
+	 *
 	 * UP workaround.
 	 *
 	 * Because SCX may transfer tasks across CPUs during dispatch, dispatch
@@ -2774,7 +2793,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
 		p->scx.flags &= ~SCX_TASK_BAL_KEEP;
 		set_task_runnable(rq, p);
 		dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
-		return;
+		goto out;
 	}
 
 	if (p->scx.flags & SCX_TASK_QUEUED) {
@@ -2788,7 +2807,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
 		 */
 		if (p->scx.slice && !scx_ops_bypassing()) {
 			dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
-			return;
+			goto out;
 		}
 
 		/*
@@ -2805,6 +2824,14 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
 		else
 			do_enqueue_task(rq, p, 0, -1);
 	}
+
+out:
+	/*
+	 * XXX switch_class was called after put, if this can be called earlier
+	 * the goto can be avoided.
+	 */
+	if (change_class)
+		switch_class_scx(rq);
 }
 
 static struct task_struct *first_local_task(struct rq *rq)
@@ -2813,34 +2840,6 @@ static struct task_struct *first_local_task(struct rq *rq)
 					struct task_struct, scx.dsq_list.node);
 }
 
-static struct task_struct *pick_next_task_scx(struct rq *rq)
-{
-	struct task_struct *p;
-
-#ifndef CONFIG_SMP
-	/* UP workaround - see the comment at the head of put_prev_task_scx() */
-	if (unlikely(rq->curr->sched_class != &ext_sched_class))
-		balance_one(rq, rq->curr, true);
-#endif
-
-	p = first_local_task(rq);
-	if (!p)
-		return NULL;
-
-	set_next_task_scx(rq, p, true);
-
-	if (unlikely(!p->scx.slice)) {
-		if (!scx_ops_bypassing() && !scx_warned_zero_slice) {
-			printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in pick_next_task_scx()\n",
-					p->comm, p->pid);
-			scx_warned_zero_slice = true;
-		}
-		p->scx.slice = SCX_SLICE_DFL;
-	}
-
-	return p;
-}
-
 #ifdef CONFIG_SCHED_CORE
 /**
  * scx_prio_less - Task ordering for core-sched
@@ -2874,6 +2873,7 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
 	else
 		return time_after64(a->scx.core_sched_at, b->scx.core_sched_at);
 }
+#endif	/* CONFIG_SCHED_CORE */
 
 /**
  * pick_task_scx - Pick a candidate task for core-sched
@@ -2917,28 +2917,9 @@ static struct task_struct *pick_task_scx(struct rq *rq)
 
 	return first;	/* this may be %NULL */
 }
-#endif	/* CONFIG_SCHED_CORE */
 
-static enum scx_cpu_preempt_reason
-preempt_reason_from_class(const struct sched_class *class)
+static void switch_class_scx(struct rq *rq)
 {
-#ifdef CONFIG_SMP
-	if (class == &stop_sched_class)
-		return SCX_CPU_PREEMPT_STOP;
-#endif
-	if (class == &dl_sched_class)
-		return SCX_CPU_PREEMPT_DL;
-	if (class == &rt_sched_class)
-		return SCX_CPU_PREEMPT_RT;
-	return SCX_CPU_PREEMPT_UNKNOWN;
-}
-
-static void switch_class_scx(struct rq *rq, struct task_struct *next)
-{
-	const struct sched_class *next_class = next->sched_class;
-
-	if (!scx_enabled())
-		return;
 #ifdef CONFIG_SMP
 	/*
 	 * Pairs with the smp_load_acquire() issued by a CPU in
@@ -2950,15 +2931,6 @@ static void switch_class_scx(struct rq *rq, struct task_struct *next)
 	if (!static_branch_unlikely(&scx_ops_cpu_preempt))
 		return;
 
-	/*
-	 * The callback is conceptually meant to convey that the CPU is no
-	 * longer under the control of SCX. Therefore, don't invoke the callback
-	 * if the next class is below SCX (in which case the BPF scheduler has
-	 * actively decided not to schedule any tasks on the CPU).
-	 */
-	if (sched_class_above(&ext_sched_class, next_class))
-		return;
-
 	/*
 	 * At this point we know that SCX was preempted by a higher priority
 	 * sched_class, so invoke the ->cpu_release() callback if we have not
@@ -2971,8 +2943,8 @@ static void switch_class_scx(struct rq *rq, struct task_struct *next)
 	if (!rq->scx.cpu_released) {
 		if (SCX_HAS_OP(cpu_release)) {
 			struct scx_cpu_release_args args = {
-				.reason = preempt_reason_from_class(next_class),
-				.task = next,
+				.reason = SCX_CPU_PREEMPT_UNKNOWN,
+				.task = NULL,
 			};
 
 			SCX_CALL_OP(SCX_KF_CPU_RELEASE,
@@ -3678,13 +3650,12 @@ DEFINE_SCHED_CLASS(ext) = {
 
 	.wakeup_preempt		= wakeup_preempt_scx,
 
+	.pick_task		= pick_task_scx,
 	.pick_next_task		= pick_next_task_scx,
 
 	.put_prev_task		= put_prev_task_scx,
 	.set_next_task		= set_next_task_scx,
 
-	.switch_class		= switch_class_scx,
-
 #ifdef CONFIG_SMP
 	.balance		= balance_scx,
 	.select_task_rq		= select_task_rq_scx,
@@ -3695,10 +3666,6 @@ DEFINE_SCHED_CLASS(ext) = {
 	.rq_offline		= rq_offline_scx,
 #endif
 
-#ifdef CONFIG_SCHED_CORE
-	.pick_task		= pick_task_scx,
-#endif
-
 	.task_tick		= task_tick_scx,
 
 	.switching_to		= switching_to_scx,

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-07-23 16:33 ` Peter Zijlstra
@ 2024-07-23 19:34   ` Tejun Heo
  2024-07-24  8:52     ` Peter Zijlstra
  2024-08-02 12:20   ` Peter Zijlstra
  1 sibling, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2024-07-23 19:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

Hello, Peter.

On Tue, Jul 23, 2024 at 06:33:58PM +0200, Peter Zijlstra wrote:
> >  	/*
> > -	 * If there are no DL,RR/FIFO tasks, there must only be CFS tasks left;
> > -	 * if there's more than one we need the tick for involuntary
> > -	 * preemption.
> > +	 * If there are no DL,RR/FIFO tasks, there must only be CFS or SCX tasks
> > +	 * left. For CFS, if there's more than one we need the tick for
> > +	 * involuntary preemption. For SCX, ask.
> >  	 */
> > -	if (rq->nr_running > 1)
> > +	if (!scx_switched_all() && rq->nr_running > 1)
> > +		return false;
> > +
> > +	if (scx_enabled() && !scx_can_stop_tick(rq))
> >  		return false;
> >  
> 
> Doesn't that boil down to something like:
> 
> 	if (scx_switched_all())
> 		return scx_can_stop_tick(rq);
>
> 	if (rq->nr_running > 1)
> 		return false;

Yeah, given that rq->nr_running == 1 && partial enabling condition is
unlikely to matter and can easily lead to misdetermination by the BPF
scheduler, the suggested code is better and cleaner. I'll update.

> > @@ -5773,7 +5827,19 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
> >  				  struct rq_flags *rf)
> >  {
> >  #ifdef CONFIG_SMP
> > +	const struct sched_class *start_class = prev->sched_class;
> >  	const struct sched_class *class;
> > +
> > +#ifdef CONFIG_SCHED_CLASS_EXT
> > +	/*
> > +	 * SCX requires a balance() call before every pick_next_task() including
> > +	 * when waking up from SCHED_IDLE. If @start_class is below SCX, start
> > +	 * from SCX instead.
> > +	 */
> > +	if (sched_class_above(&ext_sched_class, start_class))
> > +		start_class = &ext_sched_class;
> 
> 	if (scx_enabled() && ...)
> 
> ?

Will add.

> > @@ -5782,7 +5848,7 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
> >  	 * We can terminate the balance pass as soon as we know there is
> >  	 * a runnable task of @class priority or higher.
> >  	 */
> > -	for_class_range(class, prev->sched_class, &idle_sched_class) {
> > +	for_active_class_range(class, start_class, &idle_sched_class) {
> >  		if (class->balance(rq, prev, rf))
> >  			break;
> 
> Don't you need fixing balance_fair() here? It has:
> 
>   if (rq->nr_running)
>     return 1;
> 
> Which would return true and terminate the iteration even if there are
> only scx tasks left.

Right, that will lead to issues in partial mode. I think it hasn't been
noticed because tasks were being consumed through otherwise empty CPUs. Will
fix.

...
> > +	for_each_active_class(class) {
> >  		p = class->pick_next_task(rq);
> > -		if (p)
> > +		if (p) {
> > +			const struct sched_class *prev_class = prev->sched_class;
> > +
> > +			if (class != prev_class && prev_class->switch_class)
> > +				prev_class->switch_class(rq, p);
> >  			return p;
> > +		}
> >  	}
> 
> So I really don't like this one.. at the very least it would need a comment
> explaining why it only needs calling here and not every time a put_prev_task()
> and set_next_task() pair cross a class boundary -- which would be the
> consistent thing to do.
> 
> Now, IIRC, you need a class call that indicates you're about to loose the CPU
> so that you can kick the task to another CPU or somesuch. And last time I got
> it backwards and suggested adding an argument to pick_next_task(), but what
> about put_prev_task()?
>
> Can't we universally push put_prev_task() after the pick loop? Then we get
> something like:

Yeah, the problem with put_prev_task() was that it was before the next task
was picked, so we couldn't know what the next class should be.

> 	next = pick_task();
> 	if (next != prev) {
> 		put_prev_task(rq, prev, next->class != prev->class);
> 		set_next_task(rq, next);
> 	}
> 
> I have patches for most of this for fair (in my eevdf queue), and I
> think the others are doable, after all, this is more or less what we do
> for SCHED_CORE already.
> 
>   /me went off hacking for a bit
> 
> I've done this; find the results at: queue.git sched/prep
> 
> I've also rebased the sched_ext tree on top of that with the below delta, which
> you can find at: queue.git sched/scx

Hmm... took a brief look, but I think I'm missing something. Doesn't
put_prev_task() need to take place before pick_task() so that the previously
running task can be considered when pick_task() is picking the next task to
run?

> This led me to discover that you're passing the task of the other class into
> the bpf stuff -- I don't think that is a sane thing to do. You get preempted,
> it doesn't matter from which higher class or by which specific task, a policy
> must not care about that. So I kinda bodged it, but I really think this should
> be taken out.

At least for visibility, I think it makes sense. One can attach extra BPF
progs to track the preemption events but it can be useful for the scheduler
itself to be able to colllect when and why it's getting preempted. Also, in
a more controlled environments, which RT task is preempting the task can be
a, while not always reliable, useful hint in whether it'd be better to
bounce to another CPU right away or not. That said, we can drop it e.g. if
it makes implementation unnecessarily complicated.

> I also found you have some terrible !SMP hack in there, which I've
> broken, I've disabled it for now. This needs a proper fix, and not
> something ugly like you did.

Yeah, this came up before. On UP, SCX either needs to call the balance
callback as that's where the whole dispatch logic is called from (which
makes sense for it as dispatching often involves balancing operations), or
SCX itself needs to call it directly in a matching sequence. Just enabling
balance callback on UP && SCX would be the cleanest.

> Anyway, it seems to build, boot and pass the sched_ext selftest:
> 
> PASSED:  21
> SKIPPED: 0
> FAILED:  0
> 
> (albeit with a fair amount of console noise -- is that expected?)

Yeah, the selftests trigger a lot of error conditions so that's expected.

> Also, why does that thing hard depend on DEBUG_BTF? (at least having
> that option enabled no longer explodes build times like it used to)

That's necessary for building the schedulers, at least, I think. We didn't
have that earlier and people were getting confused.

> Also should we /CONFIG_SCHED_CLASS_EXT/CONFIG_SCHED_BPF/ ? Then
> something like: grep BPF foo-build/.config
> more easily shows what's what.

I don't know. isn't that too inconsistent with other classes, and down the
line, maybe there may be BPF related additions to scheduler?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-07-23 19:34   ` Tejun Heo
@ 2024-07-24  8:52     ` Peter Zijlstra
  2024-07-24 17:38       ` David Vernet
                         ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Peter Zijlstra @ 2024-07-24  8:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

On Tue, Jul 23, 2024 at 09:34:13AM -1000, Tejun Heo wrote:

> > > +	for_each_active_class(class) {
> > >  		p = class->pick_next_task(rq);
> > > -		if (p)
> > > +		if (p) {
> > > +			const struct sched_class *prev_class = prev->sched_class;
> > > +
> > > +			if (class != prev_class && prev_class->switch_class)
> > > +				prev_class->switch_class(rq, p);
> > >  			return p;
> > > +		}
> > >  	}
> > 
> > So I really don't like this one.. at the very least it would need a comment
> > explaining why it only needs calling here and not every time a put_prev_task()
> > and set_next_task() pair cross a class boundary -- which would be the
> > consistent thing to do.
> > 
> > Now, IIRC, you need a class call that indicates you're about to loose the CPU
> > so that you can kick the task to another CPU or somesuch. And last time I got
> > it backwards and suggested adding an argument to pick_next_task(), but what
> > about put_prev_task()?
> >
> > Can't we universally push put_prev_task() after the pick loop? Then we get
> > something like:
> 
> Yeah, the problem with put_prev_task() was that it was before the next task
> was picked, so we couldn't know what the next class should be.

Right. But I don't think it needs to stay that way.

> > 	next = pick_task();
> > 	if (next != prev) {
> > 		put_prev_task(rq, prev, next->class != prev->class);
> > 		set_next_task(rq, next);
> > 	}
> > 
> > I have patches for most of this for fair (in my eevdf queue), and I
> > think the others are doable, after all, this is more or less what we do
> > for SCHED_CORE already.
> > 
> >   /me went off hacking for a bit
> > 
> > I've done this; find the results at: queue.git sched/prep
> > 
> > I've also rebased the sched_ext tree on top of that with the below delta, which
> > you can find at: queue.git sched/scx
> 
> Hmm... took a brief look, but I think I'm missing something. Doesn't
> put_prev_task() need to take place before pick_task() so that the previously
> running task can be considered when pick_task() is picking the next task to
> run?

So pick_task() came from the SCHED_CORE crud, which does a remote pick
and as such isn't able to do a put -- remote is still running its
current etc.

So pick_task() *SHOULD* already be considering its current and pick
that if it is a better candidate than whatever is on the queue.

If we have a pick_task() that doesn't do that, it's a pre-existing bug
and needs fixing anyhow.

> > This led me to discover that you're passing the task of the other class into
> > the bpf stuff -- I don't think that is a sane thing to do. You get preempted,
> > it doesn't matter from which higher class or by which specific task, a policy
> > must not care about that. So I kinda bodged it, but I really think this should
> > be taken out.
> 
> At least for visibility, I think it makes sense. One can attach extra BPF
> progs to track the preemption events but it can be useful for the scheduler
> itself to be able to colllect when and why it's getting preempted. Also, in
> a more controlled environments, which RT task is preempting the task can be
> a, while not always reliable, useful hint in whether it'd be better to
> bounce to another CPU right away or not. That said, we can drop it e.g. if
> it makes implementation unnecessarily complicated.

It shouldn't be too hard to 'fix', it just feel wrong to hand the next
task into the prev class for something like this.

> > I also found you have some terrible !SMP hack in there, which I've
> > broken, I've disabled it for now. This needs a proper fix, and not
> > something ugly like you did.
> 
> Yeah, this came up before. On UP, SCX either needs to call the balance
> callback as that's where the whole dispatch logic is called from (which
> makes sense for it as dispatching often involves balancing operations), or
> SCX itself needs to call it directly in a matching sequence. Just enabling
> balance callback on UP && SCX would be the cleanest.

Or make scx hard depend on SMP? Are there really still people using !SMP
-- and I suppose more importantly, do we care?

I mean, they could always run an SMP kernel on their UP potato if they
*really* feel they need this.

> > Anyway, it seems to build, boot and pass the sched_ext selftest:
> > 
> > PASSED:  21
> > SKIPPED: 0
> > FAILED:  0
> > 
> > (albeit with a fair amount of console noise -- is that expected?)
> 
> Yeah, the selftests trigger a lot of error conditions so that's expected.
> 
> > Also, why does that thing hard depend on DEBUG_BTF? (at least having
> > that option enabled no longer explodes build times like it used to)
> 
> That's necessary for building the schedulers, at least, I think. We didn't
> have that earlier and people were getting confused.

It's what made it difficult for me to even build this stuff, simple
things like:

  cd foo-build; ../scipts/config --enable CONFIG_SCHED_CLASS_EXT; cd -

don't work (nor error out on the lack of dependencies), and my build
scripts were force disabling the BTF crud -- and thus silently also the
scx thing.

It looks like I can change my builds scripts, because the reason I did
that was that BTF made the build times explode and that is no longer the
case.

> > Also should we /CONFIG_SCHED_CLASS_EXT/CONFIG_SCHED_BPF/ ? Then
> > something like: grep BPF foo-build/.config
> > more easily shows what's what.
> 
> I don't know. isn't that too inconsistent with other classes, and down the
> line, maybe there may be BPF related additions to scheduler?

We can always rename crap later if needed. Maybe I'm the weird one doing
grep on .config, dunno. I also edit raw diff files and people think
that's weird too.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-07-24  8:52     ` Peter Zijlstra
@ 2024-07-24 17:38       ` David Vernet
  2024-07-31  1:36       ` Tejun Heo
  2024-08-06 19:56       ` Tejun Heo
  2 siblings, 0 replies; 33+ messages in thread
From: David Vernet @ 2024-07-24 17:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Linus Torvalds, linux-kernel, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 1285 bytes --]

On Wed, Jul 24, 2024 at 10:52:21AM +0200, Peter Zijlstra wrote:

[...]

> > > Also, why does that thing hard depend on DEBUG_BTF? (at least having
> > > that option enabled no longer explodes build times like it used to)
> > 
> > That's necessary for building the schedulers, at least, I think. We didn't
> > have that earlier and people were getting confused.
> 
> It's what made it difficult for me to even build this stuff, simple
> things like:
> 
>   cd foo-build; ../scipts/config --enable CONFIG_SCHED_CLASS_EXT; cd -
> 
> don't work (nor error out on the lack of dependencies), and my build
> scripts were force disabling the BTF crud -- and thus silently also the
> scx thing.
> 
> It looks like I can change my builds scripts, because the reason I did
> that was that BTF made the build times explode and that is no longer the
> case.

Yeah so to clarify the situation here -- BTF is a hard requirement for
sched_ext because BPF requires it in order to call kfuncs [0] (note that
the APIs have changed significantly since that article was written). The
reason it's required, among other things, is so that the verifier can
check that a BPF prog is safely invoking a kernel function with the
proper types.

[0]: https://lwn.net/Articles/856005/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-07-15 22:32 [GIT PULL] sched_ext: Initial pull request for v6.11 Tejun Heo
  2024-07-23 16:33 ` Peter Zijlstra
@ 2024-07-25  1:19 ` Qais Yousef
  2024-07-30  9:04   ` Peter Zijlstra
                     ` (2 more replies)
  1 sibling, 3 replies; 33+ messages in thread
From: Qais Yousef @ 2024-07-25  1:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Peter Zijlstra, Alexei Starovoitov, Vincent Guittot

On 07/15/24 12:32, Tejun Heo wrote:
> NOTE: I couldn't get git-request-pull to generate the correct diffstat. The
>       diffstat at the end is generated against the merged base and should
>       match the diff when sched_ext-for-6.11 is pulled after tip/sched/core
>       and bpf/for-next.
> 
> The following changes since commit d329605287020c3d1c3b0dadc63d8208e7251382:
> 
>   sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks (2024-07-04 15:59:52 +0200)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git/ tags/sched_ext-for-6.11
> 
> for you to fetch changes up to 8bb30798fd6ee79e4041a32ca85b9f70345d8671:
> 
>   sched_ext: Fixes incorrect type in bpf_scx_init() (2024-07-14 18:10:10 -1000)
> 
> ----------------------------------------------------------------
> sched_ext: Initial pull request for v6.11
> 
> This is the initial pull request of sched_ext. The v7 patchset
> (https://lkml.kernel.org/r/20240618212056.2833381-1-tj@kernel.org) is
> applied on top of tip/sched/core + bpf/for-next as of Jun 18th.
> 
>   tip/sched/core 793a62823d1c ("sched/core: Drop spinlocks on contention iff kernel is preemptible")
>   bpf/for-next   f6afdaf72af7 ("Merge branch 'bpf-support-resilient-split-btf'")
> 
> Since then, the following changes:
> 
> - cpuperf support which was a part of the v6 patchset was posted separately
>   and then applied after reviews.

I just reviewed this and I think you're going in the wrong direction here.
I don't think the current level of review was sufficient and we're rushing
things to get them into 6.11.

	https://lore.kernel.org/lkml/20240724234527.6m43t36puktdwn2g@airbuntu/

We really shouldn't change how schedutil works. The governor is supposed to
behave in a certain way, and we need to ensure consistency. I think you should
look on how you make your scheduler compatible with it. Adding hooks to say
apply this perf value that I want is a recipe for randomness.

Generally I do have big concerns about sched_ext being loaded causing spurious
bug report as it changes the behavior of the scheduler and the kernel is not
trusted when sched_ext scheduler is loaded. Like out-of-tree modules, it should
cause the kernel to be tainted. Something I asked for few years back when
Gushchin sent the first proposal

	https://lwn.net/Articles/873244/
	https://lore.kernel.org/lkml/20211011163852.s4pq45rs2j3qhdwl@e107158-lin.cambridge.arm.com/

How can we trust bug and regression report when out-of-tree code was loaded
that intrusively changes the way the kernel behaves? This must be marked as
a kernel TAINT otherwise we're doomed trying to fix out of tree code.

And there's another general problem of regression reports due to failure to
load code due to changes to how the scheduler evolves. We need to continue to
be able to change our code freely without worrying about breaking out-of-tree
code. What is the regression rule? We don't want to be limited to be able to
make in-kernel changes because out-of-tree code will fail now; either to load
or to run as intended. How is the current code designed to handle failsafe when
the external scheduler is no longer compatible with existing kernel and *they*
need to rewrite their code, pretty much the way it goes for out-of-tree modules
now?

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-07-25  1:19 ` Qais Yousef
@ 2024-07-30  9:04   ` Peter Zijlstra
  2024-07-31  1:11     ` Tejun Heo
  2024-07-31  1:22   ` Tejun Heo
  2024-08-01  2:50   ` Russell Haley
  2 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2024-07-30  9:04 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Tejun Heo, Linus Torvalds, linux-kernel, David Vernet,
	Ingo Molnar, Alexei Starovoitov, Vincent Guittot

On Thu, Jul 25, 2024 at 02:19:07AM +0100, Qais Yousef wrote:

> We really shouldn't change how schedutil works. The governor is supposed to
> behave in a certain way, and we need to ensure consistency. I think you should
> look on how you make your scheduler compatible with it. Adding hooks to say
> apply this perf value that I want is a recipe for randomness.

That would be this part right?

> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index eece6244f9d2..e683e5d08daa 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -197,8 +197,10 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
>  
>  static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
>  {
> -	unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> +	unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu);
>  
> +	if (!scx_switched_all())
> +		util += cpu_util_cfs_boost(sg_cpu->cpu);
>  	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
>  	util = max(util, boost);
>  	sg_cpu->bw_min = min;
> @@ -325,16 +327,35 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
>  }
>  
>  #ifdef CONFIG_NO_HZ_COMMON
> -static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> +static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
>  {
> -	unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
> -	bool ret = idle_calls == sg_cpu->saved_idle_calls;
> +	unsigned long idle_calls;
> +	bool ret;
> +
> +	/*
> +	 * The heuristics in this function is for the fair class. For SCX, the
> +	 * performance target comes directly from the BPF scheduler. Let's just
> +	 * follow it.
> +	 */
> +	if (scx_switched_all())
> +		return false;

This one does seem really weird. It makes schedutil behave significantly
different from the BPF pov depending on if you have this partial mode on
or not.

So I would really like this to be reconsidered as I agree with Qais,
things should be consistent.

> +	/* if capped by uclamp_max, always update to be in compliance */
> +	if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
> +		return false;
> +
> +	/*
> +	 * Maintain the frequency if the CPU has not been idle recently, as
> +	 * reduction is likely to be premature.
> +	 */
> +	idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
> +	ret = idle_calls == sg_cpu->saved_idle_calls;
>  
>  	sg_cpu->saved_idle_calls = idle_calls;
>  	return ret;
>  }
>  #else
> -static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
> +static inline bool sugov_hold_freq(struct sugov_cpu *sg_cpu) { return false; }
>  #endif /* CONFIG_NO_HZ_COMMON */
>  
>  /*
> @@ -382,14 +403,8 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
>  		return;
>  
>  	next_f = get_next_freq(sg_policy, sg_cpu->util, max_cap);
> -	/*
> -	 * Do not reduce the frequency if the CPU has not been idle
> -	 * recently, as the reduction is likely to be premature then.
> -	 *
> -	 * Except when the rq is capped by uclamp_max.
> -	 */
> -	if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) &&
> -	    sugov_cpu_is_busy(sg_cpu) && next_f < sg_policy->next_freq &&
> +
> +	if (sugov_hold_freq(sg_cpu) && next_f < sg_policy->next_freq &&
>  	    !sg_policy->need_freq_update) {
>  		next_f = sg_policy->next_freq;
>  
> @@ -436,14 +451,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
>  	if (!sugov_update_single_common(sg_cpu, time, max_cap, flags))
>  		return;
>  
> -	/*
> -	 * Do not reduce the target performance level if the CPU has not been
> -	 * idle recently, as the reduction is likely to be premature then.
> -	 *
> -	 * Except when the rq is capped by uclamp_max.
> -	 */
> -	if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) &&
> -	    sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
> +	if (sugov_hold_freq(sg_cpu) && sg_cpu->util < prev_util)
>  		sg_cpu->util = prev_util;
>  
>  	cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-07-30  9:04   ` Peter Zijlstra
@ 2024-07-31  1:11     ` Tejun Heo
  0 siblings, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2024-07-31  1:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Qais Yousef, Linus Torvalds, linux-kernel, David Vernet,
	Ingo Molnar, Alexei Starovoitov, Vincent Guittot

Hello,

On Tue, Jul 30, 2024 at 11:04:43AM +0200, Peter Zijlstra wrote:
> >  #ifdef CONFIG_NO_HZ_COMMON
> > -static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> > +static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
> >  {
> > -	unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
> > -	bool ret = idle_calls == sg_cpu->saved_idle_calls;
> > +	unsigned long idle_calls;
> > +	bool ret;
> > +
> > +	/*
> > +	 * The heuristics in this function is for the fair class. For SCX, the
> > +	 * performance target comes directly from the BPF scheduler. Let's just
> > +	 * follow it.
> > +	 */
> > +	if (scx_switched_all())
> > +		return false;
> 
> This one does seem really weird. It makes schedutil behave significantly
> different from the BPF pov depending on if you have this partial mode on
> or not.
> 
> So I would really like this to be reconsidered as I agree with Qais,
> things should be consistent.

I replied in the other thread and Vincent raised it too. To reiterate, when
switched_all(), if we want to keep accumulating util signal from the fair
class, we need to keep calling fair's update_blocked_averages() so that the
value can decay. We can but it seems silly to keep calling it to decay it to
zero when we know it's becoming and staying zero.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-07-25  1:19 ` Qais Yousef
  2024-07-30  9:04   ` Peter Zijlstra
@ 2024-07-31  1:22   ` Tejun Heo
  2024-08-01 13:17     ` Qais Yousef
  2024-08-01  2:50   ` Russell Haley
  2 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2024-07-31  1:22 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Peter Zijlstra, Alexei Starovoitov, Vincent Guittot

Hello,

On Thu, Jul 25, 2024 at 02:19:07AM +0100, Qais Yousef wrote:
> We really shouldn't change how schedutil works. The governor is supposed to
> behave in a certain way, and we need to ensure consistency. I think you should
> look on how you make your scheduler compatible with it. Adding hooks to say
> apply this perf value that I want is a recipe for randomness.

You made the same point in another thread, so let's discuss it there but
it's not changing the relationship between schedutil and sched class.
schedutil collects utility signals from sched classes and then translates
that to cpufreq operations. For SCX scheds, the only way to get such util
signals is asking the BPF scheduler. Nobody else knows. It's loading a
completely new scheduler after all.

> Generally I do have big concerns about sched_ext being loaded causing spurious
> bug report as it changes the behavior of the scheduler and the kernel is not
> trusted when sched_ext scheduler is loaded. Like out-of-tree modules, it should
> cause the kernel to be tainted. Something I asked for few years back when
> Gushchin sent the first proposal
>
> How can we trust bug and regression report when out-of-tree code was loaded
> that intrusively changes the way the kernel behaves? This must be marked as
> a kernel TAINT otherwise we're doomed trying to fix out of tree code.

You raised in the other thread too but I don't think taint fits the bill
here. Taints are useful when the impact is persistent so that we can know
that a later failure may have been caused by an earlier thing which might
not be around anymore. A SCX scheduler is not supposed to leave any
persistent impact on the system. If it's loaded, we can see it's loaded in
oops dumps and other places. If it's not, it shouldn't really be factor.

> And there's another general problem of regression reports due to failure to
> load code due to changes to how the scheduler evolves. We need to continue to
> be able to change our code freely without worrying about breaking out-of-tree
> code. What is the regression rule? We don't want to be limited to be able to
> make in-kernel changes because out-of-tree code will fail now; either to load
> or to run as intended. How is the current code designed to handle failsafe when
> the external scheduler is no longer compatible with existing kernel and *they*
> need to rewrite their code, pretty much the way it goes for out-of-tree modules
> now?

It's the same as other BPF hooks. We don't want to break willy-nilly but we
can definitely break backward compatibility if necessary. This has been
discussed to death and I don't think we can add much by litigating the case
again.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-07-24  8:52     ` Peter Zijlstra
  2024-07-24 17:38       ` David Vernet
@ 2024-07-31  1:36       ` Tejun Heo
  2024-08-02 11:10         ` Peter Zijlstra
  2024-08-06 21:10         ` Peter Zijlstra
  2024-08-06 19:56       ` Tejun Heo
  2 siblings, 2 replies; 33+ messages in thread
From: Tejun Heo @ 2024-07-31  1:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

Hello,

On Wed, Jul 24, 2024 at 10:52:21AM +0200, Peter Zijlstra wrote:
...
> So pick_task() came from the SCHED_CORE crud, which does a remote pick
> and as such isn't able to do a put -- remote is still running its
> current etc.
> 
> So pick_task() *SHOULD* already be considering its current and pick
> that if it is a better candidate than whatever is on the queue.
> 
> If we have a pick_task() that doesn't do that, it's a pre-existing bug
> and needs fixing anyhow.

Right, I don't think it affects SCX in any significant way. Either way
should be fine.

...
> > Yeah, this came up before. On UP, SCX either needs to call the balance
> > callback as that's where the whole dispatch logic is called from (which
> > makes sense for it as dispatching often involves balancing operations), or
> > SCX itself needs to call it directly in a matching sequence. Just enabling
> > balance callback on UP && SCX would be the cleanest.
> 
> Or make scx hard depend on SMP? Are there really still people using !SMP
> -- and I suppose more importantly, do we care?
> 
> I mean, they could always run an SMP kernel on their UP potato if they
> *really* feel they need this.

Maybe, but at the same time, it's also just some isolated cruft that enables
UP support, so both sides of the scale seem similarly light-weight? I lean
towards "why not support it?" but don't feel particularly strongly about it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-07-25  1:19 ` Qais Yousef
  2024-07-30  9:04   ` Peter Zijlstra
  2024-07-31  1:22   ` Tejun Heo
@ 2024-08-01  2:50   ` Russell Haley
  2024-08-01 15:52     ` Qais Yousef
  2 siblings, 1 reply; 33+ messages in thread
From: Russell Haley @ 2024-08-01  2:50 UTC (permalink / raw)
  To: qyousef
  Cc: ast, linux-kernel, mingo, peterz, tj, torvalds, vincent.guittot,
	void

> We really shouldn't change how schedutil works. The governor is supposed to
> behave in a certain way, and we need to ensure consistency. I think you should
> look on how you make your scheduler compatible with it. Adding hooks to say
> apply this perf value that I want is a recipe for randomness.

If schedutil's behavior is perfect as-is, then why does cpu.uclamp.max
not work with values between 81-100%, which is the part of the CPU
frequency range where one pays the least in performance per Joule saved?
Why does cpu.uclamp.min have to be set all the way up and down the
cgroup hierarchy, from root to leaf, to actually affect frequency
selection? Why is sugov notorious for harming video encoding
performance[1], which is a CPU-saturating workload? Why do intel_pstate
and amd-pstate both bypass it on modern hardware?

It appears that without Android's very deeply integrated userspace
uclamp controls telling sugov what to do, it's native behavior is less
than awe-inspring. Futhermore, uclamp doesn't work especially well on
systems that violate the big.LITTLE assumption that only clamping << max
saves meaningful energy[2]. Non-Android users widely scorn sugov when
they become aware of it. Web forums are full of suggestions to switch to
perfgov, or to switch to "conservative" or disable turbo for those who
want efficiency.

That said, given how long the the PELT time constant is, a bpf scheduler
that wanted to override sugov could probably cooperate with a userspace
daemon to set min and max uclamps to the same value to control frequency
selection without too much overhead, as long as it doesn't mind the
81-100% hole.

[1] https://www.phoronix.com/review/schedutil-quirky-2023

[2] Does that still hold on high-end Android devices with one or two
hot-rodded prime cores?

Thanks,

--
Russell Haley

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-07-31  1:22   ` Tejun Heo
@ 2024-08-01 13:17     ` Qais Yousef
  2024-08-01 16:36       ` Tejun Heo
  0 siblings, 1 reply; 33+ messages in thread
From: Qais Yousef @ 2024-08-01 13:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Peter Zijlstra, Alexei Starovoitov, Vincent Guittot

On 07/30/24 15:22, Tejun Heo wrote:
> Hello,
> 
> On Thu, Jul 25, 2024 at 02:19:07AM +0100, Qais Yousef wrote:
> > We really shouldn't change how schedutil works. The governor is supposed to
> > behave in a certain way, and we need to ensure consistency. I think you should
> > look on how you make your scheduler compatible with it. Adding hooks to say
> > apply this perf value that I want is a recipe for randomness.
> 
> You made the same point in another thread, so let's discuss it there but

Don't you think it's a bit rushed to include this part in the pull request?

> it's not changing the relationship between schedutil and sched class.
> schedutil collects utility signals from sched classes and then translates
> that to cpufreq operations. For SCX scheds, the only way to get such util
> signals is asking the BPF scheduler. Nobody else knows. It's loading a
> completely new scheduler after all.

But you're effectively making schedutil a userspace governor. If SCX wants to
define its own util signal, wouldn't it be more appropriate to pair it with
user space governor instead? It makes more sense to pair userspace scheduler
with userspace governor than alter schedutil behavior.

> 
> > Generally I do have big concerns about sched_ext being loaded causing spurious
> > bug report as it changes the behavior of the scheduler and the kernel is not
> > trusted when sched_ext scheduler is loaded. Like out-of-tree modules, it should
> > cause the kernel to be tainted. Something I asked for few years back when
> > Gushchin sent the first proposal
> >
> > How can we trust bug and regression report when out-of-tree code was loaded
> > that intrusively changes the way the kernel behaves? This must be marked as
> > a kernel TAINT otherwise we're doomed trying to fix out of tree code.
> 
> You raised in the other thread too but I don't think taint fits the bill
> here. Taints are useful when the impact is persistent so that we can know

That's not how I read it. It supposed to be for things that alter the kernel
spec/functionality and make it not trust worthy. We already have a taint flag
for overriding ACPI tables. Out of tree modules can have lots of power to alter
things in a way that makes the kernel generally not trust worthy. Given how
intrusively the scheduler behavior can be altered with no control, I think
a taint flag to show case it is important. Not only for us, but also for app
developers as you don't know what people will decide to do that can end up
causing apps to misbehave weirdly on some systems that load specific scheduler
extensions. I think both of us (kernel and app developers) want to know that
something in the kernel that can impact this misbehavior was loaded.

> that a later failure may have been caused by an earlier thing which might
> not be around anymore. A SCX scheduler is not supposed to leave any
> persistent impact on the system. If it's loaded, we can see it's loaded in
> oops dumps and other places. If it's not, it shouldn't really be factor.
> 
> > And there's another general problem of regression reports due to failure to
> > load code due to changes to how the scheduler evolves. We need to continue to
> > be able to change our code freely without worrying about breaking out-of-tree
> > code. What is the regression rule? We don't want to be limited to be able to
> > make in-kernel changes because out-of-tree code will fail now; either to load
> > or to run as intended. How is the current code designed to handle failsafe when
> > the external scheduler is no longer compatible with existing kernel and *they*
> > need to rewrite their code, pretty much the way it goes for out-of-tree modules
> > now?
> 
> It's the same as other BPF hooks. We don't want to break willy-nilly but we
> can definitely break backward compatibility if necessary. This has been
> discussed to death and I don't think we can add much by litigating the case
> again.

Was this discussion on the list? I haven't seen it. Assuming the details were
discussed with the maintainers and Linus and there's an agreement in place,
that's good to know. If not, then a clarity before-the-fact is better than
after-the-fact. I think the boundaries are very hazy and regressions are one of
the major reasons that holds up the speed of scheduler development. It is very
easy to break some configuration/workload/system unwittingly. Adding more
constraints that are actually harder to deal with to the mix will make our life
exponentially more difficult.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-08-01  2:50   ` Russell Haley
@ 2024-08-01 15:52     ` Qais Yousef
  0 siblings, 0 replies; 33+ messages in thread
From: Qais Yousef @ 2024-08-01 15:52 UTC (permalink / raw)
  To: Russell Haley
  Cc: ast, linux-kernel, mingo, peterz, tj, torvalds, vincent.guittot,
	void

On 07/31/24 21:50, Russell Haley wrote:
> > We really shouldn't change how schedutil works. The governor is supposed to
> > behave in a certain way, and we need to ensure consistency. I think you should
> > look on how you make your scheduler compatible with it. Adding hooks to say
> > apply this perf value that I want is a recipe for randomness.
> 
> If schedutil's behavior is perfect as-is, then why does cpu.uclamp.max
> not work with values between 81-100%, which is the part of the CPU
> frequency range where one pays the least in performance per Joule saved?

I think you're referring to this problem

	https://lore.kernel.org/lkml/20230820210640.585311-1-qyousef@layalina.io/

which lead to Vincent schedutil rework to how estimation and constraints are
applied

	9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")

Do you see the problem with this applied?

> Why does cpu.uclamp.min have to be set all the way up and down the
> cgroup hierarchy, from root to leaf, to actually affect frequency

I'm not aware of this problem.

This has nothing to do with schedutil. It's probably a bug in uclamp
aggregation.

Which kernel version are you on? Is this reproducible on mainline?

> selection? Why is sugov notorious for harming video encoding
> performance[1], which is a CPU-saturating workload? Why do intel_pstate

Without analysing the workload on that particular system, it's hard to know.

But I am aware of issues with rate_limit_us being high. I already proposed
enhancement and pursuing further improvement [1]. rate_limit_us defaulted to
10ms on many systems, which has slow reaction time for workloads that cause
utilization signal to go up and down.

If the workload is cpu-saturated with no idle time at all then we should run at
max frequency except for some initial rampup delay. If that's not the case it'd
be good to know. But we have no info to tell. My suspicion is that it's bursty
and goes up and down.

I will give a talk at LPC about issues with how util signal ramps up. And had
some patches to discuss overall response time issues in general [2]. I am also
trying to consolidate how cpufreq updates are driven by the scheduler to better
reflect the state of the CPU [3] and hopefully pave the way for better handling
perf constraints like uclamp and iowait boost (if we ever move to a per-task
iowait boost) [4].

There are a lot of teething issues to consider though.. So yeah, not perfect
and needs lots of improvements. And there's effort to improve it. I'm
interesting to learn about your problems so please feel to start a separate
thread on the problems you have. We can work to address them at least.

[1] https://lore.kernel.org/lkml/20240728192659.58115-1-qyousef@layalina.io/
[2] https://lore.kernel.org/lkml/20231208002342.367117-1-qyousef@layalina.io/
[3] https://lore.kernel.org/lkml/20240728184551.42133-1-qyousef@layalina.io/
[4] https://lore.kernel.org/lkml/20231208015242.385103-1-qyousef@layalina.io/

> and amd-pstate both bypass it on modern hardware?

I did try to raise it in the past but no one has carried out the work.
Volunteers are welcome. It's on my todo, but way down the list.

> 
> It appears that without Android's very deeply integrated userspace
> uclamp controls telling sugov what to do, it's native behavior is less

Nothing is deeply integrated about a syscall to change uclamp. But there are
more folks interested in creating libraries in that world and push to make it
work. I sadly don't see similar interest for Desktop and servers. Except for
Asahi Linux in one of their blog posts. But not sure if they plan something
more sophisticated than what they did.

Generally we don't create these userspace libraries, and there have been
several suggestions for us to start a library to help kick off that effort. But
none of us have the time. If there are folks interested to start this work, I'd
be happy to provide guidance.

> than awe-inspring. Futhermore, uclamp doesn't work especially well on
> systems that violate the big.LITTLE assumption that only clamping << max

I'm not sure what you mean here, but the power curve is different for every
system.

I'd expect users to try to set uclamp_max as low as possible without impacting
their perf. But there's a known problem with uclamp_max due to aggregation that
affects it's effectiveness that has been a topic of discussion in OSPM and LPC
several times. My proposal in [4] should address it, but I have to fix more
concerns/problems first. There are proposals for different approaches floating
around too.

> saves meaningful energy[2]. Non-Android users widely scorn sugov when
> they become aware of it. Web forums are full of suggestions to switch to
> perfgov, or to switch to "conservative" or disable turbo for those who
> want efficiency.

It'd be great if people come forward with their problems and help with
analysing them instead.

> 
> That said, given how long the the PELT time constant is, a bpf scheduler
> that wanted to override sugov could probably cooperate with a userspace
> daemon to set min and max uclamps to the same value to control frequency
> selection without too much overhead, as long as it doesn't mind the
> 81-100% hole.

Yes. They could do that. I'm surprised no one has done a generic uclamp daemon
to auto tune certain apps to improve their perf and efficiency. I started some
effort in the past, but sadly dropped it as I didn't have the time.

I generally think the approach of adding more QoS in the kernel then use BPF to
create app or system specific tuner to set these QoS is a more viable option in
the long run. I think BPF can hook into user app to get cues about bits they
could struggle with and they need help to auto tune somehow? ie: a certain
operation that requires boosting or causes unnecessary freq spike.

> 
> [1] https://www.phoronix.com/review/schedutil-quirky-2023
> 
> [2] Does that still hold on high-end Android devices with one or two
> hot-rodded prime cores?
> 
> Thanks,
> 
> --
> Russell Haley

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-08-01 13:17     ` Qais Yousef
@ 2024-08-01 16:36       ` Tejun Heo
  2024-08-05  1:44         ` Qais Yousef
  0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2024-08-01 16:36 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Peter Zijlstra, Alexei Starovoitov, Vincent Guittot

Hello,

On Thu, Aug 01, 2024 at 02:17:35PM +0100, Qais Yousef wrote:
> > You made the same point in another thread, so let's discuss it there but
> 
> Don't you think it's a bit rushed to include this part in the pull request?

Not really. It seems pretty straightforward to me.

> > it's not changing the relationship between schedutil and sched class.
> > schedutil collects utility signals from sched classes and then translates
> > that to cpufreq operations. For SCX scheds, the only way to get such util
> > signals is asking the BPF scheduler. Nobody else knows. It's loading a
> > completely new scheduler after all.
> 
> But you're effectively making schedutil a userspace governor. If SCX wants to
> define its own util signal, wouldn't it be more appropriate to pair it with
> user space governor instead? It makes more sense to pair userspace scheduler
> with userspace governor than alter schedutil behavior.

The *scheduler* itself is defined from userspace. I have a hard time
following why utilization signal coming from that scheduler is all that
surprising. If user or the scheduler implementation want to pair it up with
userspace governor, they can do that. I don't want to make that decision for
developers who are implementing their own schedulers.

...
> That's not how I read it. It supposed to be for things that alter the kernel
> spec/functionality and make it not trust worthy. We already have a taint flag
> for overriding ACPI tables. Out of tree modules can have lots of power to alter
> things in a way that makes the kernel generally not trust worthy. Given how
> intrusively the scheduler behavior can be altered with no control, I think
> a taint flag to show case it is important. Not only for us, but also for app
> developers as you don't know what people will decide to do that can end up
> causing apps to misbehave weirdly on some systems that load specific scheduler
> extensions. I think both of us (kernel and app developers) want to know that
> something in the kernel that can impact this misbehavior was loaded.

We of course want to make sure that developers and users can tell what
they're running on. However, this doesn't really align with why taint flags
were added and how they are usually used, and it's unclear how the use of a
taint flag would improve the situation on top of the existing visibility
mechanisms (in the sysfs and oops messasges). Does this mean loading any BPF
program should taint the kernel? How about changing sysctls?

> > It's the same as other BPF hooks. We don't want to break willy-nilly but we
> > can definitely break backward compatibility if necessary. This has been
> > discussed to death and I don't think we can add much by litigating the case
> > again.
> 
> Was this discussion on the list? I haven't seen it. Assuming the details were
> discussed with the maintainers and Linus and there's an agreement in place,
> that's good to know. If not, then a clarity before-the-fact is better than
> after-the-fact. I think the boundaries are very hazy and regressions are one of
> the major reasons that holds up the speed of scheduler development. It is very
> easy to break some configuration/workload/system unwittingly. Adding more
> constraints that are actually harder to deal with to the mix will make our life
> exponentially more difficult.

I wasn't a first party in the discussions and don't have good pointers.
However, I know that the subject has been discussed to the moon and back a
few times and the conclusion is pretty clear at this point - after all, the
multiple ecosystems around BPF have been operating this way for quite a
while now. Maybe BPF folks have better pointers?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-07-31  1:36       ` Tejun Heo
@ 2024-08-02 11:10         ` Peter Zijlstra
  2024-08-02 16:09           ` Tejun Heo
  2024-08-06 21:10         ` Peter Zijlstra
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2024-08-02 11:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

On Tue, Jul 30, 2024 at 03:36:27PM -1000, Tejun Heo wrote:

> > Or make scx hard depend on SMP? Are there really still people using !SMP
> > -- and I suppose more importantly, do we care?
> > 
> > I mean, they could always run an SMP kernel on their UP potato if they
> > *really* feel they need this.
> 
> Maybe, but at the same time, it's also just some isolated cruft that enables
> UP support, so both sides of the scale seem similarly light-weight? I lean
> towards "why not support it?" but don't feel particularly strongly about it.

So you're basically looking for something like this?

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a9f655025607..69ec02a28117 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5772,7 +5772,6 @@ static inline void schedule_debug(struct task_struct *prev, bool preempt)
 static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
 				  struct rq_flags *rf)
 {
-#ifdef CONFIG_SMP
 	const struct sched_class *class;
 	/*
 	 * We must do the balancing pass before put_prev_task(), such
@@ -5783,10 +5782,9 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
 	 * a runnable task of @class priority or higher.
 	 */
 	for_class_range(class, prev->sched_class, &idle_sched_class) {
-		if (class->balance(rq, prev, rf))
+		if (class->balance && class->balance(rq, prev, rf))
 			break;
 	}
-#endif
 
 	put_prev_task(rq, prev);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4c36cc680361..40f3dc436f4f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2296,8 +2296,8 @@ struct sched_class {
 	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
 	void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);
 
-#ifdef CONFIG_SMP
 	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
+#ifdef CONFIG_SMP
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);
 
 	struct task_struct * (*pick_task)(struct rq *rq);

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-07-23 16:33 ` Peter Zijlstra
  2024-07-23 19:34   ` Tejun Heo
@ 2024-08-02 12:20   ` Peter Zijlstra
  2024-08-02 18:47     ` Tejun Heo
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2024-08-02 12:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner


A few more..

> +static bool scx_switching_all;
> +DEFINE_STATIC_KEY_FALSE(__scx_switched_all);

> +	WRITE_ONCE(scx_switching_all, !(ops->flags & SCX_OPS_SWITCH_PARTIAL));
> +	if (!(ops->flags & SCX_OPS_SWITCH_PARTIAL))
> +		static_branch_enable(&__scx_switched_all);

> +	static_branch_disable(&__scx_switched_all);
> +	WRITE_ONCE(scx_switching_all, false);

FYI the static_key contains a variable you can read if you want, see
static_key_count()/static_key_enabled(). No need to mirror the state.


> +static struct task_struct *
> +scx_task_iter_next_locked(struct scx_task_iter *iter, bool include_dead)
> +{
> +	struct task_struct *p;
> +retry:
> +	scx_task_iter_rq_unlock(iter);
> +
> +	while ((p = scx_task_iter_next(iter))) {
> +		/*
> +		 * is_idle_task() tests %PF_IDLE which may not be set for CPUs
> +		 * which haven't yet been onlined. Test sched_class directly.
> +		 */
> +		if (p->sched_class != &idle_sched_class)
> +			break;

This isn't quite the same; please look at play_idle_precise() in
drivers/powercap/idle_inject.c.

That is, there are PF_IDLE tasks that are not idle_sched_class.

> +	}
> +	if (!p)
> +		return NULL;
> +
> +	iter->rq = task_rq_lock(p, &iter->rf);
> +	iter->locked = p;
> +
> +	/*
> +	 * If we see %TASK_DEAD, @p already disabled preemption, is about to do
> +	 * the final __schedule(), won't ever need to be scheduled again and can
> +	 * thus be safely ignored. If we don't see %TASK_DEAD, @p can't enter
> +	 * the final __schedle() while we're locking its rq and thus will stay
> +	 * alive until the rq is unlocked.
> +	 */
> +	if (!include_dead && READ_ONCE(p->__state) == TASK_DEAD)
> +		goto retry;
> +
> +	return p;
> +}


> +static void update_curr_scx(struct rq *rq)
> +{
> +	struct task_struct *curr = rq->curr;
> +	u64 now = rq_clock_task(rq);
> +	u64 delta_exec;
> +
> +	if (time_before_eq64(now, curr->se.exec_start))
> +		return;
> +
> +	delta_exec = now - curr->se.exec_start;
> +	curr->se.exec_start = now;
> +	curr->se.sum_exec_runtime += delta_exec;
> +	account_group_exec_runtime(curr, delta_exec);
> +	cgroup_account_cputime(curr, delta_exec);

Could you please use update_curr_common() here?

This helps keep the accounting in one place. For instance, see this
patch:

  https://lkml.kernel.org/r/20240727105031.053611186@infradead.org  

That adds a sum_exec_runtime variant that is scaled by DVFS and
capacity.

You should be able to make the function:

	u64 delta_exec = update_curr_common(rq);
> +	struct task_struct *curr = rq->curr;
> +
> +	if (curr->scx.slice != SCX_SLICE_INF) {
> +		curr->scx.slice -= min(curr->scx.slice, delta_exec);
> +		if (!curr->scx.slice)
> +			touch_core_sched(rq, curr);
> +	}
> +}



> +static bool move_task_to_local_dsq(struct rq *rq, struct task_struct *p,
> +				   u64 enq_flags)
> +{
> +	struct rq *task_rq;
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	/*
> +	 * If dequeue got to @p while we were trying to lock both rq's, it'd
> +	 * have cleared @p->scx.holding_cpu to -1. While other cpus may have
> +	 * updated it to different values afterwards, as this operation can't be
> +	 * preempted or recurse, @p->scx.holding_cpu can never become
> +	 * raw_smp_processor_id() again before we're done. Thus, we can tell
> +	 * whether we lost to dequeue by testing whether @p->scx.holding_cpu is
> +	 * still raw_smp_processor_id().
> +	 *
> +	 * See dispatch_dequeue() for the counterpart.
> +	 */
> +	if (unlikely(p->scx.holding_cpu != raw_smp_processor_id()))
> +		return false;
> +
> +	/* @p->rq couldn't have changed if we're still the holding cpu */
> +	task_rq = task_rq(p);
> +	lockdep_assert_rq_held(task_rq);
> +
> +	WARN_ON_ONCE(!cpumask_test_cpu(cpu_of(rq), p->cpus_ptr));
> +	deactivate_task(task_rq, p, 0);
> +	set_task_cpu(p, cpu_of(rq));
> +	p->scx.sticky_cpu = cpu_of(rq);

(this *could* live in ->migrate_task_rq(), but yeah, you only have this
one site, so meh)

> +
> +	/*
> +	 * We want to pass scx-specific enq_flags but activate_task() will
> +	 * truncate the upper 32 bit. As we own @rq, we can pass them through
> +	 * @rq->scx.extra_enq_flags instead.
> +	 */
> +	WARN_ON_ONCE(rq->scx.extra_enq_flags);
> +	rq->scx.extra_enq_flags = enq_flags;

eeew.. it's not just activate_task(), its the whole callchain having
'int' flags. That said, we should be having plenty free bits there no?

> +	activate_task(rq, p, 0);
> +	rq->scx.extra_enq_flags = 0;
> +
> +	return true;
> +}

> +static bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq,
> +				struct task_struct *p, struct rq *task_rq)
> +{
> +	bool moved = false;
> +
> +	lockdep_assert_held(&dsq->lock);	/* released on return */
> +
> +	/*
> +	 * @dsq is locked and @p is on a remote rq. @p is currently protected by
> +	 * @dsq->lock. We want to pull @p to @rq but may deadlock if we grab
> +	 * @task_rq while holding @dsq and @rq locks. As dequeue can't drop the
> +	 * rq lock or fail, do a little dancing from our side. See
> +	 * move_task_to_local_dsq().
> +	 */
> +	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
> +	task_unlink_from_dsq(p, dsq);
> +	dsq_mod_nr(dsq, -1);
> +	p->scx.holding_cpu = raw_smp_processor_id();
> +	raw_spin_unlock(&dsq->lock);
> +
> +	double_lock_balance(rq, task_rq);
> +
> +	moved = move_task_to_local_dsq(rq, p, 0);
> +
> +	double_unlock_balance(rq, task_rq);
> +
> +	return moved;
> +}

I've gotta ask, why are you using the double_lock_balance() pattern
instead of the one in move_queued_task() that does:

  lock src
  dequeue src, task
  set_task_cpu task, dst
  unlock src

  lock dst
  enqueue dst, task
  unlock dst



> +/*
> + * Similar to kernel/sched/core.c::is_cpu_allowed() but we're testing whether @p
> + * can be pulled to @rq.
> + */
> +static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq)
> +{
> +	int cpu = cpu_of(rq);
> +
> +	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> +		return false;
> +	if (unlikely(is_migration_disabled(p)))
> +		return false;
> +	if (!(p->flags & PF_KTHREAD) && unlikely(!task_cpu_possible(cpu, p)))
> +		return false;
> +	if (!scx_rq_online(rq))
> +		return false;
> +	return true;
> +}

I'm a little confused, is_cpu_allowed() is used for that same purpose
no?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-08-02 11:10         ` Peter Zijlstra
@ 2024-08-02 16:09           ` Tejun Heo
  2024-08-02 17:37             ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2024-08-02 16:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

Hello,

On Fri, Aug 02, 2024 at 01:10:18PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 30, 2024 at 03:36:27PM -1000, Tejun Heo wrote:
...
> > Maybe, but at the same time, it's also just some isolated cruft that enables
> > UP support, so both sides of the scale seem similarly light-weight? I lean
> > towards "why not support it?" but don't feel particularly strongly about it.
> 
> So you're basically looking for something like this?
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a9f655025607..69ec02a28117 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5772,7 +5772,6 @@ static inline void schedule_debug(struct task_struct *prev, bool preempt)
>  static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
>  				  struct rq_flags *rf)
>  {
> -#ifdef CONFIG_SMP
>  	const struct sched_class *class;
>  	/*
>  	 * We must do the balancing pass before put_prev_task(), such
> @@ -5783,10 +5782,9 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
>  	 * a runnable task of @class priority or higher.
>  	 */
>  	for_class_range(class, prev->sched_class, &idle_sched_class) {
> -		if (class->balance(rq, prev, rf))
> +		if (class->balance && class->balance(rq, prev, rf))
>  			break;
>  	}
> -#endif
>  
>  	put_prev_task(rq, prev);
>  }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 4c36cc680361..40f3dc436f4f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2296,8 +2296,8 @@ struct sched_class {
>  	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
>  	void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);
>  
> -#ifdef CONFIG_SMP
>  	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
> +#ifdef CONFIG_SMP
>  	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);
>  
>  	struct task_struct * (*pick_task)(struct rq *rq);

Yes, exactly. This is how it was implemented in the first RFC patchset. If
you're okay with the above, I'd love to go back to it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-08-02 16:09           ` Tejun Heo
@ 2024-08-02 17:37             ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2024-08-02 17:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

On Fri, Aug 02, 2024 at 06:09:03AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Aug 02, 2024 at 01:10:18PM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 30, 2024 at 03:36:27PM -1000, Tejun Heo wrote:
> ...
> > > Maybe, but at the same time, it's also just some isolated cruft that enables
> > > UP support, so both sides of the scale seem similarly light-weight? I lean
> > > towards "why not support it?" but don't feel particularly strongly about it.
> > 
> > So you're basically looking for something like this?
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index a9f655025607..69ec02a28117 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5772,7 +5772,6 @@ static inline void schedule_debug(struct task_struct *prev, bool preempt)
> >  static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
> >  				  struct rq_flags *rf)
> >  {
> > -#ifdef CONFIG_SMP
> >  	const struct sched_class *class;
> >  	/*
> >  	 * We must do the balancing pass before put_prev_task(), such
> > @@ -5783,10 +5782,9 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
> >  	 * a runnable task of @class priority or higher.
> >  	 */
> >  	for_class_range(class, prev->sched_class, &idle_sched_class) {
> > -		if (class->balance(rq, prev, rf))
> > +		if (class->balance && class->balance(rq, prev, rf))
> >  			break;
> >  	}
> > -#endif
> >  
> >  	put_prev_task(rq, prev);
> >  }
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 4c36cc680361..40f3dc436f4f 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2296,8 +2296,8 @@ struct sched_class {
> >  	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
> >  	void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);
> >  
> > -#ifdef CONFIG_SMP
> >  	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
> > +#ifdef CONFIG_SMP
> >  	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);
> >  
> >  	struct task_struct * (*pick_task)(struct rq *rq);
> 
> Yes, exactly. This is how it was implemented in the first RFC patchset. If
> you're okay with the above, I'd love to go back to it.

Yeah, I don't find I can make myself care for SMP=n much. So lets just
do that.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-08-02 12:20   ` Peter Zijlstra
@ 2024-08-02 18:47     ` Tejun Heo
  2024-08-06  8:27       ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2024-08-02 18:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

Hello,

On Fri, Aug 02, 2024 at 02:20:34PM +0200, Peter Zijlstra wrote:
> 
> A few more..
> 
> > +static bool scx_switching_all;
> > +DEFINE_STATIC_KEY_FALSE(__scx_switched_all);
> 
> > +	WRITE_ONCE(scx_switching_all, !(ops->flags & SCX_OPS_SWITCH_PARTIAL));
> > +	if (!(ops->flags & SCX_OPS_SWITCH_PARTIAL))
> > +		static_branch_enable(&__scx_switched_all);
> 
> > +	static_branch_disable(&__scx_switched_all);
> > +	WRITE_ONCE(scx_switching_all, false);
> 
> FYI the static_key contains a variable you can read if you want, see
> static_key_count()/static_key_enabled(). No need to mirror the state.

They go off together but they don't go on together in scx_ops_enable().
switching_all goes on first which makes new tasks to go on SCX, all existing
tasks are switched, and then __scx_switched_all goes on. I can make
switching_all a static_key too but the only time it's read is during fork
and while switching scheduler class, so it doesn't really seem worth it.

> > +static struct task_struct *
> > +scx_task_iter_next_locked(struct scx_task_iter *iter, bool include_dead)
> > +{
> > +	struct task_struct *p;
> > +retry:
> > +	scx_task_iter_rq_unlock(iter);
> > +
> > +	while ((p = scx_task_iter_next(iter))) {
> > +		/*
> > +		 * is_idle_task() tests %PF_IDLE which may not be set for CPUs
> > +		 * which haven't yet been onlined. Test sched_class directly.
> > +		 */
> > +		if (p->sched_class != &idle_sched_class)
> > +			break;
> 
> This isn't quite the same; please look at play_idle_precise() in
> drivers/powercap/idle_inject.c.
> 
> That is, there are PF_IDLE tasks that are not idle_sched_class.

Thanks for pointing that out. scx_task_iter's are used to move tasks in and
out of SCX when BPF scheduler is attached and detached, respectively. The
requirements are:

1. __setscheduler_prio() should be able to determine the sched_class for the
   task.

2. scx_ops_init_task() must be called on all tasks which can potentially
   switch to SCX. This invokes ops.init_task() if implemented.

The swappers are problematic for both 1) and 2). __setscheduler_prio() won't
preserve the idle_sched_class, and it's really easy to confuse the BPF
schedulers with swappers as they aren't normal tasks (e.g. they have the
same PID). Besides, they are never going to be scheduled by SCX anyway.

Looking at idle_inject, they seem to be regular RT threads which temporarily
set PF_IDLE. AFAICS, this should be completely fine. The iterator just
really needs to avoid the swappers.

I'll add more comments.

...
> > +static void update_curr_scx(struct rq *rq)
> > +{
> > +	struct task_struct *curr = rq->curr;
> > +	u64 now = rq_clock_task(rq);
> > +	u64 delta_exec;
> > +
> > +	if (time_before_eq64(now, curr->se.exec_start))
> > +		return;
> > +
> > +	delta_exec = now - curr->se.exec_start;
> > +	curr->se.exec_start = now;
> > +	curr->se.sum_exec_runtime += delta_exec;
> > +	account_group_exec_runtime(curr, delta_exec);
> > +	cgroup_account_cputime(curr, delta_exec);
> 
> Could you please use update_curr_common() here?
> 
> This helps keep the accounting in one place. For instance, see this
> patch:
> 
>   https://lkml.kernel.org/r/20240727105031.053611186@infradead.org  
> 
> That adds a sum_exec_runtime variant that is scaled by DVFS and
> capacity.
> 
> You should be able to make the function:
> 
> 	u64 delta_exec = update_curr_common(rq);

Will do.

...
> > +	/*
> > +	 * We want to pass scx-specific enq_flags but activate_task() will
> > +	 * truncate the upper 32 bit. As we own @rq, we can pass them through
> > +	 * @rq->scx.extra_enq_flags instead.
> > +	 */
> > +	WARN_ON_ONCE(rq->scx.extra_enq_flags);
> > +	rq->scx.extra_enq_flags = enq_flags;
> 
> eeew.. it's not just activate_task(), its the whole callchain having
> 'int' flags. That said, we should be having plenty free bits there no?

Yeah, this isn't pretty but here are the rationales:

- These extra enqueue flags are only necessary for SCX. For other
  sched_classes, these flags exist to communicate between sched core and
  sched_classes. SCX also needs to communicate from the SCX itself to the
  BPF scheduler, which creates the need for extra flags.

- While possible, it's pretty cumbersome to support backward compatibility
  when flag values change. Unless done proactively, it can silently and
  subtly break forward compatibility.

- As SCX flags aren't interesting to anyone else, it makes sense to put them
  in the upper 32bits. This way, they don't get in the way and the chance of
  needing to shift them around is lower.

...
> > +static bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq,
> > +				struct task_struct *p, struct rq *task_rq)
> > +{
> > +	bool moved = false;
> > +
> > +	lockdep_assert_held(&dsq->lock);	/* released on return */
> > +
> > +	/*
> > +	 * @dsq is locked and @p is on a remote rq. @p is currently protected by
> > +	 * @dsq->lock. We want to pull @p to @rq but may deadlock if we grab
> > +	 * @task_rq while holding @dsq and @rq locks. As dequeue can't drop the
> > +	 * rq lock or fail, do a little dancing from our side. See
> > +	 * move_task_to_local_dsq().
> > +	 */
> > +	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
> > +	task_unlink_from_dsq(p, dsq);
> > +	dsq_mod_nr(dsq, -1);
> > +	p->scx.holding_cpu = raw_smp_processor_id();
> > +	raw_spin_unlock(&dsq->lock);
> > +
> > +	double_lock_balance(rq, task_rq);
> > +
> > +	moved = move_task_to_local_dsq(rq, p, 0);
> > +
> > +	double_unlock_balance(rq, task_rq);
> > +
> > +	return moved;
> > +}
> 
> I've gotta ask, why are you using the double_lock_balance() pattern
> instead of the one in move_queued_task() that does:
> 
>   lock src
>   dequeue src, task
>   set_task_cpu task, dst
>   unlock src
> 
>   lock dst
>   enqueue dst, task
>   unlock dst

When !CONFIG_PREEMPTION, double_lock_balance() seems cheaper than unlocking
and locking unconditionally. Because SCX schedulers can do a lot more hot
migrations, I thought it'd be better to go that way. I haven't actually
measured anything tho, so I could be wrong.

> > +/*
> > + * Similar to kernel/sched/core.c::is_cpu_allowed() but we're testing whether @p
> > + * can be pulled to @rq.
> > + */
> > +static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq)
> > +{
> > +	int cpu = cpu_of(rq);
> > +
> > +	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> > +		return false;
> > +	if (unlikely(is_migration_disabled(p)))
> > +		return false;
> > +	if (!(p->flags & PF_KTHREAD) && unlikely(!task_cpu_possible(cpu, p)))
> > +		return false;
> > +	if (!scx_rq_online(rq))
> > +		return false;
> > +	return true;
> > +}
> 
> I'm a little confused, is_cpu_allowed() is used for that same purpose
> no?

Yeah, this is a bit subtle. There are two differences:

- IIUC, during migration, is_cpu_allowed() has to say "yes" to the
  destination CPU as that may be the CPU that the task is being migrated to.
  However, task_can_run_on_remote_rq() is asking something different - it's
  asking "can I initiate migrating the task to this rq?", which it shouldn't
  while migration is disabled.

- The BPF scheduler is bypassed while the rq is offline. This doesn't hinder
  other migrations. It just makes the CPU state testing a non-factor for
  "can the BPF scheduler pull this task to this rq?".

I suppose the function can be rewritten to be something like:

	if (!is_cpu_allowed())
		return false;
	if (unlikely(is_migration_disabled(p)))
		return false;
	if (!scx_rq_online(rq))
		return false;

but I'm not sure it's necessarily better. I'll add more comments explaining
what's going on.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-08-01 16:36       ` Tejun Heo
@ 2024-08-05  1:44         ` Qais Yousef
  0 siblings, 0 replies; 33+ messages in thread
From: Qais Yousef @ 2024-08-05  1:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Peter Zijlstra, Alexei Starovoitov, Vincent Guittot

On 08/01/24 06:36, Tejun Heo wrote:
> Hello,
> 
> On Thu, Aug 01, 2024 at 02:17:35PM +0100, Qais Yousef wrote:
> > > You made the same point in another thread, so let's discuss it there but
> > 
> > Don't you think it's a bit rushed to include this part in the pull request?
> 
> Not really. It seems pretty straightforward to me.
> 
> > > it's not changing the relationship between schedutil and sched class.
> > > schedutil collects utility signals from sched classes and then translates
> > > that to cpufreq operations. For SCX scheds, the only way to get such util
> > > signals is asking the BPF scheduler. Nobody else knows. It's loading a
> > > completely new scheduler after all.
> > 
> > But you're effectively making schedutil a userspace governor. If SCX wants to
> > define its own util signal, wouldn't it be more appropriate to pair it with
> > user space governor instead? It makes more sense to pair userspace scheduler
> > with userspace governor than alter schedutil behavior.
> 
> The *scheduler* itself is defined from userspace. I have a hard time
> following why utilization signal coming from that scheduler is all that
> surprising. If user or the scheduler implementation want to pair it up with
> userspace governor, they can do that. I don't want to make that decision for
> developers who are implementing their own schedulers.

But schedutil is based on PELT signal. Capacity values, RT pressure, irq
pressure, and DL bandwidth are all based on that. And adding them together is
based on the fact they're all the same signal. I don't see it compatible to mix
and match.

And we have uclamp for user space to influence the decisions based on PELT
already. I don't see the need for another way to influence the decision.

Is it not desired to reuse util signal as-is? Or there's a problem that
prevents you from reusing it?

> 
> ...
> > That's not how I read it. It supposed to be for things that alter the kernel
> > spec/functionality and make it not trust worthy. We already have a taint flag
> > for overriding ACPI tables. Out of tree modules can have lots of power to alter
> > things in a way that makes the kernel generally not trust worthy. Given how
> > intrusively the scheduler behavior can be altered with no control, I think
> > a taint flag to show case it is important. Not only for us, but also for app
> > developers as you don't know what people will decide to do that can end up
> > causing apps to misbehave weirdly on some systems that load specific scheduler
> > extensions. I think both of us (kernel and app developers) want to know that
> > something in the kernel that can impact this misbehavior was loaded.
> 
> We of course want to make sure that developers and users can tell what
> they're running on. However, this doesn't really align with why taint flags
> were added and how they are usually used, and it's unclear how the use of a
> taint flag would improve the situation on top of the existing visibility
> mechanisms (in the sysfs and oops messasges). Does this mean loading any BPF
> program should taint the kernel? How about changing sysctls?

The difference here is that you're overriding decision, not just hooking. It's
like live patching the kernel and using fault injection. There's a very visible
side effect.

A BPF program can qualify to taint when it leads to changing the control flow.
The particular case here is not a passive observer case, but it is an active
overrider. And of a critical functionality. That's why I think it should be
treated like an external module.

sysctls don't change the control flow in a way that is decided outside of the
kernel.

The schedutil problem is an example of how there's a visible side effect. What
if the loaded scheduler decided to ignore uclamp hints for task placement or
potentially any new/existing hint/sched_attr added/present? Or if the system is
HMP and there's a loaded Energy Model but the loaded scheduler doesn't have
Energy Aware Scheduling support?

IIUC one of the goals of the sched_ext is not to have to keep everything happy
in favour to optimize for a specific system and workloads without being dragged
down with all the other things that can come in the way.  So the inherent
breakage, AFAIU, is by design.

And once this is in we will lose all control over what people will do with it.

> 
> > > It's the same as other BPF hooks. We don't want to break willy-nilly but we
> > > can definitely break backward compatibility if necessary. This has been
> > > discussed to death and I don't think we can add much by litigating the case
> > > again.
> > 
> > Was this discussion on the list? I haven't seen it. Assuming the details were
> > discussed with the maintainers and Linus and there's an agreement in place,
> > that's good to know. If not, then a clarity before-the-fact is better than
> > after-the-fact. I think the boundaries are very hazy and regressions are one of
> > the major reasons that holds up the speed of scheduler development. It is very
> > easy to break some configuration/workload/system unwittingly. Adding more
> > constraints that are actually harder to deal with to the mix will make our life
> > exponentially more difficult.
> 
> I wasn't a first party in the discussions and don't have good pointers.
> However, I know that the subject has been discussed to the moon and back a
> few times and the conclusion is pretty clear at this point - after all, the
> multiple ecosystems around BPF have been operating this way for quite a
> while now. Maybe BPF folks have better pointers?

Fair enough. I think I just want to highlight that the fragility extends to
failure to load as well as things suddenly stopping to behave as intended after
a kernel upgrade. If we all agree that wouldn't constitute a regression that
can impact in-kernel development and Linus is onboard with that then it's all
good.

Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-08-02 18:47     ` Tejun Heo
@ 2024-08-06  8:27       ` Peter Zijlstra
  2024-08-06 19:17         ` Tejun Heo
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2024-08-06  8:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

On Fri, Aug 02, 2024 at 08:47:51AM -1000, Tejun Heo wrote:

> > > +static bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq,
> > > +				struct task_struct *p, struct rq *task_rq)
> > > +{
> > > +	bool moved = false;
> > > +
> > > +	lockdep_assert_held(&dsq->lock);	/* released on return */
> > > +
> > > +	/*
> > > +	 * @dsq is locked and @p is on a remote rq. @p is currently protected by
> > > +	 * @dsq->lock. We want to pull @p to @rq but may deadlock if we grab
> > > +	 * @task_rq while holding @dsq and @rq locks. As dequeue can't drop the
> > > +	 * rq lock or fail, do a little dancing from our side. See
> > > +	 * move_task_to_local_dsq().
> > > +	 */
> > > +	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
> > > +	task_unlink_from_dsq(p, dsq);
> > > +	dsq_mod_nr(dsq, -1);
> > > +	p->scx.holding_cpu = raw_smp_processor_id();
> > > +	raw_spin_unlock(&dsq->lock);
> > > +
> > > +	double_lock_balance(rq, task_rq);
> > > +
> > > +	moved = move_task_to_local_dsq(rq, p, 0);
> > > +
> > > +	double_unlock_balance(rq, task_rq);
> > > +
> > > +	return moved;
> > > +}
> > 
> > I've gotta ask, why are you using the double_lock_balance() pattern
> > instead of the one in move_queued_task() that does:
> > 
> >   lock src
> >   dequeue src, task
> >   set_task_cpu task, dst
> >   unlock src
> > 
> >   lock dst
> >   enqueue dst, task
> >   unlock dst
> 
> When !CONFIG_PREEMPTION, double_lock_balance() seems cheaper than unlocking
> and locking unconditionally. Because SCX schedulers can do a lot more hot
> migrations, I thought it'd be better to go that way. I haven't actually
> measured anything tho, so I could be wrong.

So I think the theory is something like this.

If you take a spinlock, you wait-time W is N times the hold-time H,
where the hold-time is avg/max (depending on your analysis goals) time
you hold the lock for, and N is the contention level or number of
waiters etc.

Now, when you go nest locks, your hold-time increases with the wait-time
of the nested lock. In this case, since it's the 'same' lock, your
hold-time gets a recursive wait-time term, that is: H'=H+N*H.

This blows up your wait-time, which makes contention worse. Because what
was W=N*H then becomes W=N*(N*H).

Anyway, at the time we saw great benefits from moving away from the
double-lock thing, it might be worth looking into when/if you see
significant lock contention; because obviously if the locks are not
contended it all doesn't matter.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-08-06  8:27       ` Peter Zijlstra
@ 2024-08-06 19:17         ` Tejun Heo
  0 siblings, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2024-08-06 19:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

Hello,

On Tue, Aug 06, 2024 at 10:27:16AM +0200, Peter Zijlstra wrote:
...
> > When !CONFIG_PREEMPTION, double_lock_balance() seems cheaper than unlocking
> > and locking unconditionally. Because SCX schedulers can do a lot more hot
> > migrations, I thought it'd be better to go that way. I haven't actually
> > measured anything tho, so I could be wrong.
> 
> So I think the theory is something like this.
> 
> If you take a spinlock, you wait-time W is N times the hold-time H,
> where the hold-time is avg/max (depending on your analysis goals) time
> you hold the lock for, and N is the contention level or number of
> waiters etc.
> 
> Now, when you go nest locks, your hold-time increases with the wait-time
> of the nested lock. In this case, since it's the 'same' lock, your
> hold-time gets a recursive wait-time term, that is: H'=H+N*H.
> 
> This blows up your wait-time, which makes contention worse. Because what
> was W=N*H then becomes W=N*(N*H).

Thanks for the explanation. Much appreaciated.

> Anyway, at the time we saw great benefits from moving away from the
> double-lock thing, it might be worth looking into when/if you see
> significant lock contention; because obviously if the locks are not
> contended it all doesn't matter.

I think we *may* have seen this in action on a NUMA machine running a
scheduler which doesn't have topology awareness and thus was migrating tasks
across the boundary frequently. I'll see if I can reproduce it and whether
getting rid of the double locking improves the situation.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-07-24  8:52     ` Peter Zijlstra
  2024-07-24 17:38       ` David Vernet
  2024-07-31  1:36       ` Tejun Heo
@ 2024-08-06 19:56       ` Tejun Heo
  2024-08-06 20:18         ` Peter Zijlstra
  2 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2024-08-06 19:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

Hello, Peter.

On Wed, Jul 24, 2024 at 10:52:21AM +0200, Peter Zijlstra wrote:
...
> So pick_task() came from the SCHED_CORE crud, which does a remote pick
> and as such isn't able to do a put -- remote is still running its
> current etc.
> 
> So pick_task() *SHOULD* already be considering its current and pick
> that if it is a better candidate than whatever is on the queue.
> 
> If we have a pick_task() that doesn't do that, it's a pre-existing bug
> and needs fixing anyhow.

I haven't applied the patch to make balance_fair() test
sched_fair_runnable() instead of rq->nr_running yet but after that I think
all review points that you raised should be addressed except for the above
pick_task() conversion. If you want to go ahead with this change, please do
so. I'll pull in tip/sched/core and update accordingly. AFAICS, it doesn't
make any significant difference to SCX one way or the other and I think
updating shouldn't be too painful.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-08-06 19:56       ` Tejun Heo
@ 2024-08-06 20:18         ` Peter Zijlstra
  2024-08-06 20:20           ` Tejun Heo
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2024-08-06 20:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

On Tue, Aug 06, 2024 at 09:56:36AM -1000, Tejun Heo wrote:
> Hello, Peter.
> 
> On Wed, Jul 24, 2024 at 10:52:21AM +0200, Peter Zijlstra wrote:
> ...
> > So pick_task() came from the SCHED_CORE crud, which does a remote pick
> > and as such isn't able to do a put -- remote is still running its
> > current etc.
> > 
> > So pick_task() *SHOULD* already be considering its current and pick
> > that if it is a better candidate than whatever is on the queue.
> > 
> > If we have a pick_task() that doesn't do that, it's a pre-existing bug
> > and needs fixing anyhow.
> 
> I haven't applied the patch to make balance_fair() test
> sched_fair_runnable() instead of rq->nr_running yet but after that I think
> all review points that you raised should be addressed except for the above
> pick_task() conversion. If you want to go ahead with this change, please do
> so. I'll pull in tip/sched/core and update accordingly. AFAICS, it doesn't
> make any significant difference to SCX one way or the other and I think
> updating shouldn't be too painful.

OK. So my plan was to finish reading the for 6.11 pull diff, and then
merge that eevdf patch-set I send out. Post those patches I had in
sched/prep that re-arrange the put_prev thing. Then merge those, and
then ask you to rebase the whole lot on top of that, after which I'll
pull your branch.

And while I'm behind my own schedule, I think we can still get all that
sorted.

Anyway, I'll go stick this patch somewhere so it don't get lost.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-08-06 20:18         ` Peter Zijlstra
@ 2024-08-06 20:20           ` Tejun Heo
  0 siblings, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2024-08-06 20:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

Hello, Peter.

On Tue, Aug 06, 2024 at 10:18:45PM +0200, Peter Zijlstra wrote:
...
> OK. So my plan was to finish reading the for 6.11 pull diff, and then
> merge that eevdf patch-set I send out. Post those patches I had in
> sched/prep that re-arrange the put_prev thing. Then merge those, and
> then ask you to rebase the whole lot on top of that, after which I'll
> pull your branch.

Oh... I'm planning to keep the tree history and just send the pull request
to Linus. I can just follow the core changes and update accordingly as
necessary.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-07-31  1:36       ` Tejun Heo
  2024-08-02 11:10         ` Peter Zijlstra
@ 2024-08-06 21:10         ` Peter Zijlstra
  2024-08-06 21:34           ` Tejun Heo
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2024-08-06 21:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

On Tue, Jul 30, 2024 at 03:36:27PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jul 24, 2024 at 10:52:21AM +0200, Peter Zijlstra wrote:
> ...
> > So pick_task() came from the SCHED_CORE crud, which does a remote pick
> > and as such isn't able to do a put -- remote is still running its
> > current etc.
> > 
> > So pick_task() *SHOULD* already be considering its current and pick
> > that if it is a better candidate than whatever is on the queue.
> > 
> > If we have a pick_task() that doesn't do that, it's a pre-existing bug
> > and needs fixing anyhow.
> 
> Right, I don't think it affects SCX in any significant way. Either way
> should be fine.

So I just looked at things. And considering we currently want to have:

  pick_next_task := pick_task() + set_next_task(.first = true)

and want to, with those other patches moving put_prev_task() around, get
to fully making pick_next_task() optional, it looks to me you're not
quite there yet. Notably:

> +static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
> +{
> +	if (p->scx.flags & SCX_TASK_QUEUED) {
> +		/*
> +		 * Core-sched might decide to execute @p before it is
> +		 * dispatched. Call ops_dequeue() to notify the BPF scheduler.
> +		 */
> +		ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC);
> +		dispatch_dequeue(rq, p);
> +	}
> +
> +	p->se.exec_start = rq_clock_task(rq);
> +
> +	/* see dequeue_task_scx() on why we skip when !QUEUED */
> +	if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED))
> +		SCX_CALL_OP_TASK(SCX_KF_REST, running, p);
> +
> +	clr_task_runnable(p, true);
> +
> +	/*
> +	 * @p is getting newly scheduled or got kicked after someone updated its
> +	 * slice. Refresh whether tick can be stopped. See scx_can_stop_tick().
> +	 */
> +	if ((p->scx.slice == SCX_SLICE_INF) !=
> +	    (bool)(rq->scx.flags & SCX_RQ_CAN_STOP_TICK)) {
> +		if (p->scx.slice == SCX_SLICE_INF)
> +			rq->scx.flags |= SCX_RQ_CAN_STOP_TICK;
> +		else
> +			rq->scx.flags &= ~SCX_RQ_CAN_STOP_TICK;
> +
> +		sched_update_tick_dependency(rq);
> +
> +		/*
> +		 * For now, let's refresh the load_avgs just when transitioning
> +		 * in and out of nohz. In the future, we might want to add a
> +		 * mechanism which calls the following periodically on
> +		 * tick-stopped CPUs.
> +		 */
> +		update_other_load_avgs(rq);
> +	}
> +}

> +static struct task_struct *pick_next_task_scx(struct rq *rq)
> +{
> +	struct task_struct *p;
> +
> +#ifndef CONFIG_SMP
> +	/* UP workaround - see the comment at the head of put_prev_task_scx() */
> +	if (unlikely(rq->curr->sched_class != &ext_sched_class))
> +		balance_one(rq, rq->curr, true);
> +#endif

(should already be gone in your latest branch)

> +
> +	p = first_local_task(rq);
> +	if (!p)
> +		return NULL;
> +
> +	set_next_task_scx(rq, p, true);
> +
> +	if (unlikely(!p->scx.slice)) {
> +		if (!scx_ops_bypassing() && !scx_warned_zero_slice) {
> +			printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in pick_next_task_scx()\n",
> +					p->comm, p->pid);
> +			scx_warned_zero_slice = true;
> +		}
> +		p->scx.slice = SCX_SLICE_DFL;
> +	}

This condition should probably move to set_next_task_scx(.first = true).

> +
> +	return p;
> +}

> +/**
> + * pick_task_scx - Pick a candidate task for core-sched
> + * @rq: rq to pick the candidate task from
> + *
> + * Core-sched calls this function on each SMT sibling to determine the next
> + * tasks to run on the SMT siblings. balance_one() has been called on all
> + * siblings and put_prev_task_scx() has been called only for the current CPU.
> + *
> + * As put_prev_task_scx() hasn't been called on remote CPUs, we can't just look
> + * at the first task in the local dsq. @rq->curr has to be considered explicitly
> + * to mimic %SCX_TASK_BAL_KEEP.
> + */
> +static struct task_struct *pick_task_scx(struct rq *rq)
> +{
> +	struct task_struct *curr = rq->curr;
> +	struct task_struct *first = first_local_task(rq);
> +
> +	if (curr->scx.flags & SCX_TASK_QUEUED) {
> +		/* is curr the only runnable task? */
> +		if (!first)
> +			return curr;
> +
> +		/*
> +		 * Does curr trump first? We can always go by core_sched_at for
> +		 * this comparison as it represents global FIFO ordering when
> +		 * the default core-sched ordering is used and local-DSQ FIFO
> +		 * ordering otherwise.
> +		 *
> +		 * We can have a task with an earlier timestamp on the DSQ. For
> +		 * example, when a current task is preempted by a sibling
> +		 * picking a different cookie, the task would be requeued at the
> +		 * head of the local DSQ with an earlier timestamp than the
> +		 * core-sched picked next task. Besides, the BPF scheduler may
> +		 * dispatch any tasks to the local DSQ anytime.
> +		 */
> +		if (curr->scx.slice && time_before64(curr->scx.core_sched_at,
> +						     first->scx.core_sched_at))
> +			return curr;
> +	}

And the above condition seems a little core_sched specific. Is that
suitable for the primary pick function?

> +
> +	return first;	/* this may be %NULL */
> +}


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-08-06 21:10         ` Peter Zijlstra
@ 2024-08-06 21:34           ` Tejun Heo
  2024-08-06 21:55             ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2024-08-06 21:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

Hello, Peter.

On Tue, Aug 06, 2024 at 11:10:02PM +0200, Peter Zijlstra wrote:
...
> > Right, I don't think it affects SCX in any significant way. Either way
> > should be fine.
> 
> So I just looked at things. And considering we currently want to have:
> 
>   pick_next_task := pick_task() + set_next_task(.first = true)
> 
> and want to, with those other patches moving put_prev_task() around, get
> to fully making pick_next_task() optional, it looks to me you're not
> quite there yet. Notably:

Oh yes, the code definitely needs updating. I just meant that the needed
changes are unlikely to be invasive.

...
> > +	p = first_local_task(rq);
> > +	if (!p)
> > +		return NULL;
> > +
> > +	set_next_task_scx(rq, p, true);
> > +
> > +	if (unlikely(!p->scx.slice)) {
> > +		if (!scx_ops_bypassing() && !scx_warned_zero_slice) {
> > +			printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in pick_next_task_scx()\n",
> > +					p->comm, p->pid);
> > +			scx_warned_zero_slice = true;
> > +		}
> > +		p->scx.slice = SCX_SLICE_DFL;
> > +	}
> 
> This condition should probably move to set_next_task_scx(.first = true).

Sure.

...
> > +static struct task_struct *pick_task_scx(struct rq *rq)
> > +{
> > +	struct task_struct *curr = rq->curr;
> > +	struct task_struct *first = first_local_task(rq);
> > +
> > +	if (curr->scx.flags & SCX_TASK_QUEUED) {
> > +		/* is curr the only runnable task? */
> > +		if (!first)
> > +			return curr;
> > +
> > +		/*
> > +		 * Does curr trump first? We can always go by core_sched_at for
> > +		 * this comparison as it represents global FIFO ordering when
> > +		 * the default core-sched ordering is used and local-DSQ FIFO
> > +		 * ordering otherwise.
> > +		 *
> > +		 * We can have a task with an earlier timestamp on the DSQ. For
> > +		 * example, when a current task is preempted by a sibling
> > +		 * picking a different cookie, the task would be requeued at the
> > +		 * head of the local DSQ with an earlier timestamp than the
> > +		 * core-sched picked next task. Besides, the BPF scheduler may
> > +		 * dispatch any tasks to the local DSQ anytime.
> > +		 */
> > +		if (curr->scx.slice && time_before64(curr->scx.core_sched_at,
> > +						     first->scx.core_sched_at))
> > +			return curr;
> > +	}
> 
> And the above condition seems a little core_sched specific. Is that
> suitable for the primary pick function?

Would there be any distinction between pick_task() being called for regular
and core sched paths?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-08-06 21:34           ` Tejun Heo
@ 2024-08-06 21:55             ` Peter Zijlstra
  2024-08-06 22:09               ` Tejun Heo
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2024-08-06 21:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

On Tue, Aug 06, 2024 at 11:34:19AM -1000, Tejun Heo wrote:

> > > +static struct task_struct *pick_task_scx(struct rq *rq)
> > > +{
> > > +	struct task_struct *curr = rq->curr;
> > > +	struct task_struct *first = first_local_task(rq);
> > > +
> > > +	if (curr->scx.flags & SCX_TASK_QUEUED) {
> > > +		/* is curr the only runnable task? */
> > > +		if (!first)
> > > +			return curr;
> > > +
> > > +		/*
> > > +		 * Does curr trump first? We can always go by core_sched_at for
> > > +		 * this comparison as it represents global FIFO ordering when
> > > +		 * the default core-sched ordering is used and local-DSQ FIFO
> > > +		 * ordering otherwise.
> > > +		 *
> > > +		 * We can have a task with an earlier timestamp on the DSQ. For
> > > +		 * example, when a current task is preempted by a sibling
> > > +		 * picking a different cookie, the task would be requeued at the
> > > +		 * head of the local DSQ with an earlier timestamp than the
> > > +		 * core-sched picked next task. Besides, the BPF scheduler may
> > > +		 * dispatch any tasks to the local DSQ anytime.
> > > +		 */
> > > +		if (curr->scx.slice && time_before64(curr->scx.core_sched_at,
> > > +						     first->scx.core_sched_at))
> > > +			return curr;
> > > +	}
> > 
> > And the above condition seems a little core_sched specific. Is that
> > suitable for the primary pick function?
> 
> Would there be any distinction between pick_task() being called for regular
> and core sched paths?

There currently is not -- but if you need that, we can definitely add a
boolean argument or something. But I think it would be good if a policy
can inherently know if curr is the better pick.

ISTR you having two queue types, one FIFO and one vtime ordered, for
both I think it should be possible to determine order, right?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-08-06 21:55             ` Peter Zijlstra
@ 2024-08-06 22:09               ` Tejun Heo
  2024-08-10 20:45                 ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2024-08-06 22:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

Hello,

On Tue, Aug 06, 2024 at 11:55:35PM +0200, Peter Zijlstra wrote:
...
> > > And the above condition seems a little core_sched specific. Is that
> > > suitable for the primary pick function?
> > 
> > Would there be any distinction between pick_task() being called for regular
> > and core sched paths?
> 
> There currently is not -- but if you need that, we can definitely add a
> boolean argument or something. But I think it would be good if a policy

Yeah, SCX might need that.

> can inherently know if curr is the better pick.
> ISTR you having two queue types, one FIFO and one vtime ordered, for
> both I think it should be possible to determine order, right?

It is tricky because the kernel part can't make assumptions about whether
two tasks are even on the same timeline. In the usual scheduling path, this
isn't a problem as the decision is made by the BPF scheduler from balance()
- if it wants to keep running the current task, it doesn't dispatch a new
one. Otherwise, it dispatches the next task.

For core sched, once dispatched, global (or rather core-wide) ordering is
established according to the wallclock dispatch order. I don't think always
following that order will break anything for scheds that support core-sched
but it is an unnecessary thing to do in the non-core-sched path.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-08-06 22:09               ` Tejun Heo
@ 2024-08-10 20:45                 ` Peter Zijlstra
  2024-08-13 19:14                   ` Tejun Heo
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2024-08-10 20:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

On Tue, Aug 06, 2024 at 12:09:47PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Tue, Aug 06, 2024 at 11:55:35PM +0200, Peter Zijlstra wrote:
> ...
> > > > And the above condition seems a little core_sched specific. Is that
> > > > suitable for the primary pick function?
> > > 
> > > Would there be any distinction between pick_task() being called for regular
> > > and core sched paths?
> > 
> > There currently is not -- but if you need that, we can definitely add a
> > boolean argument or something. But I think it would be good if a policy
> 
> Yeah, SCX might need that.

Right, patch is trivial ofcourse.

> > can inherently know if curr is the better pick.
> > ISTR you having two queue types, one FIFO and one vtime ordered, for
> > both I think it should be possible to determine order, right?
> 
> It is tricky because the kernel part can't make assumptions about whether
> two tasks are even on the same timeline. In the usual scheduling path, this
> isn't a problem as the decision is made by the BPF scheduler from balance()
> - if it wants to keep running the current task, it doesn't dispatch a new
> one. Otherwise, it dispatches the next task.

But I have a question.. don't you clear scx.slice when a task needs to
be preempted? That is, why isn't that condition sufficient to determine
if curr has precedence over the first queued? If curr and it is still
queued and its slice is non-zero, take curr.

Am I missing something obvoius?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-08-10 20:45                 ` Peter Zijlstra
@ 2024-08-13 19:14                   ` Tejun Heo
  2024-08-13 22:53                     ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2024-08-13 19:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

Hello, Peter.

On Sat, Aug 10, 2024 at 10:45:42PM +0200, Peter Zijlstra wrote:
...
> > It is tricky because the kernel part can't make assumptions about whether
> > two tasks are even on the same timeline. In the usual scheduling path, this
> > isn't a problem as the decision is made by the BPF scheduler from balance()
> > - if it wants to keep running the current task, it doesn't dispatch a new
> > one. Otherwise, it dispatches the next task.
> 
> But I have a question.. don't you clear scx.slice when a task needs to
> be preempted? That is, why isn't that condition sufficient to determine
> if curr has precedence over the first queued? If curr and it is still
> queued and its slice is non-zero, take curr.

scx.slice is used a bit different from other sched classes mostly because
there are two layers - the SCX core and the BPF scheduler itself. The BPF
scheduler uses scx.slice to tell the SCX core to "don't bother asking about
it until the current slice has been exhausted" - ie. it's a way to offload
things like tick handling and preemption by higher priority sched classes to
SCX core. When scx.slice expires, the BPF scheduler's dispatch() is called
which can then decide whether to replenish the slice of the current task or
something else should run and so on.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-08-13 19:14                   ` Tejun Heo
@ 2024-08-13 22:53                     ` Peter Zijlstra
  2024-08-21 23:08                       ` Tejun Heo
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2024-08-13 22:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

On Tue, Aug 13, 2024 at 09:14:18AM -1000, Tejun Heo wrote:
> Hello, Peter.
> 
> On Sat, Aug 10, 2024 at 10:45:42PM +0200, Peter Zijlstra wrote:
> ...
> > > It is tricky because the kernel part can't make assumptions about whether
> > > two tasks are even on the same timeline. In the usual scheduling path, this
> > > isn't a problem as the decision is made by the BPF scheduler from balance()
> > > - if it wants to keep running the current task, it doesn't dispatch a new
> > > one. Otherwise, it dispatches the next task.
> > 
> > But I have a question.. don't you clear scx.slice when a task needs to
> > be preempted? That is, why isn't that condition sufficient to determine
> > if curr has precedence over the first queued? If curr and it is still
> > queued and its slice is non-zero, take curr.
> 
> scx.slice is used a bit different from other sched classes mostly because
> there are two layers - the SCX core and the BPF scheduler itself. The BPF
> scheduler uses scx.slice to tell the SCX core to "don't bother asking about
> it until the current slice has been exhausted" - ie. it's a way to offload
> things like tick handling and preemption by higher priority sched classes to
> SCX core. When scx.slice expires, the BPF scheduler's dispatch() is called
> which can then decide whether to replenish the slice of the current task or
> something else should run and so on.

Right, but can't we flip that on its head and state that when scx.slice
is non-zero, we should pick current and not bother asking for what's
next?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] sched_ext: Initial pull request for v6.11
  2024-08-13 22:53                     ` Peter Zijlstra
@ 2024-08-21 23:08                       ` Tejun Heo
  0 siblings, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2024-08-21 23:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-kernel, David Vernet, Ingo Molnar,
	Alexei Starovoitov, Thomas Gleixner

Hello,

On Wed, Aug 14, 2024 at 12:53:07AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 13, 2024 at 09:14:18AM -1000, Tejun Heo wrote:
> > Hello, Peter.
> > 
> > On Sat, Aug 10, 2024 at 10:45:42PM +0200, Peter Zijlstra wrote:
> > ...
> > > > It is tricky because the kernel part can't make assumptions about whether
> > > > two tasks are even on the same timeline. In the usual scheduling path, this
> > > > isn't a problem as the decision is made by the BPF scheduler from balance()
> > > > - if it wants to keep running the current task, it doesn't dispatch a new
> > > > one. Otherwise, it dispatches the next task.
> > > 
> > > But I have a question.. don't you clear scx.slice when a task needs to
> > > be preempted? That is, why isn't that condition sufficient to determine
> > > if curr has precedence over the first queued? If curr and it is still
> > > queued and its slice is non-zero, take curr.
> > 
> > scx.slice is used a bit different from other sched classes mostly because
> > there are two layers - the SCX core and the BPF scheduler itself. The BPF
> > scheduler uses scx.slice to tell the SCX core to "don't bother asking about
> > it until the current slice has been exhausted" - ie. it's a way to offload
> > things like tick handling and preemption by higher priority sched classes to
> > SCX core. When scx.slice expires, the BPF scheduler's dispatch() is called
> > which can then decide whether to replenish the slice of the current task or
> > something else should run and so on.
> 
> Right, but can't we flip that on its head and state that when scx.slice
> is non-zero, we should pick current and not bother asking for what's
> next?

I ended up resolving this differently. The existing code distinguishes the
regular and core-sched paths because the current task gets enqueued when
balance_scx() decides to keep running it. With the pending put_prev_task
update, it becomes a lot more straightforward to not queue current when
deciding to keep running it, which makes regular and core-sched paths
identical.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2024-08-21 23:08 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-15 22:32 [GIT PULL] sched_ext: Initial pull request for v6.11 Tejun Heo
2024-07-23 16:33 ` Peter Zijlstra
2024-07-23 19:34   ` Tejun Heo
2024-07-24  8:52     ` Peter Zijlstra
2024-07-24 17:38       ` David Vernet
2024-07-31  1:36       ` Tejun Heo
2024-08-02 11:10         ` Peter Zijlstra
2024-08-02 16:09           ` Tejun Heo
2024-08-02 17:37             ` Peter Zijlstra
2024-08-06 21:10         ` Peter Zijlstra
2024-08-06 21:34           ` Tejun Heo
2024-08-06 21:55             ` Peter Zijlstra
2024-08-06 22:09               ` Tejun Heo
2024-08-10 20:45                 ` Peter Zijlstra
2024-08-13 19:14                   ` Tejun Heo
2024-08-13 22:53                     ` Peter Zijlstra
2024-08-21 23:08                       ` Tejun Heo
2024-08-06 19:56       ` Tejun Heo
2024-08-06 20:18         ` Peter Zijlstra
2024-08-06 20:20           ` Tejun Heo
2024-08-02 12:20   ` Peter Zijlstra
2024-08-02 18:47     ` Tejun Heo
2024-08-06  8:27       ` Peter Zijlstra
2024-08-06 19:17         ` Tejun Heo
2024-07-25  1:19 ` Qais Yousef
2024-07-30  9:04   ` Peter Zijlstra
2024-07-31  1:11     ` Tejun Heo
2024-07-31  1:22   ` Tejun Heo
2024-08-01 13:17     ` Qais Yousef
2024-08-01 16:36       ` Tejun Heo
2024-08-05  1:44         ` Qais Yousef
2024-08-01  2:50   ` Russell Haley
2024-08-01 15:52     ` Qais Yousef

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox