[PATCH 0/4] sched/eevdf: Optimize reweight and pick

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] sched/eevdf: Optimize reweight and pick
@ 2023-11-07  9:05 Abel Wu
  2023-11-07  9:05 ` [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight Abel Wu
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Abel Wu @ 2023-11-07  9:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider
  Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel, Abel Wu

This patchset makes these contributions:

  [1/4] Fixes the problem that vruntime doesn't get adjusted
        when reweight at !0-lag point.

  [2/4] Optimize out the fallback search on @best_left which
        doubles the cost in worst case.

  [3/4] Enable O(1) fastpath picking based on deadline-sorted
        leftmost-cached rbtree.

  [4/4] Statistics for patch 3, not intended to upstream.

All the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled, on a Dual-CPU Intel Xeon(R)
Platinum 8260 with 2 NUMA nodes each of which has 24C/48T.

	p0: baseline, tip/master 1187c0b3a6c2
	p1: p0 + patch(1)
	p3: p0 + patch(1~3)

hackbench
=========
case            load    	  p0% (std%)	   p1% ( std%)	   p3% ( std%)
process-pipe    group-1 	 1.00 (  2.49)	 -1.73 (  2.64)	 -3.77 (  0.91)
process-pipe    group-2 	 1.00 (  5.23)	 +5.51 (  2.32)	 -3.41 (  4.28)
process-pipe    group-4 	 1.00 (  5.30)	 +3.53 (  5.46)	 +6.51 (  1.44)
process-pipe    group-8 	 1.00 (  1.36)	 -1.85 (  2.22)	 -3.57 (  1.06)
process-sockets group-1 	 1.00 (  2.29)	 -2.39 (  2.66)	 -2.39 (  1.86)
process-sockets group-2 	 1.00 (  3.46)	 +0.46 (  1.85)	 +1.19 (  2.08)
process-sockets group-4 	 1.00 (  1.43)	 -1.98 (  2.78)	 +4.52 (  8.68)
process-sockets group-8 	 1.00 (  0.95)	 -1.60 (  0.94)	 +2.78 (  2.14)
threads-pipe    group-1 	 1.00 (  1.92)	 +5.33 (  1.54)	 +3.47 (  1.09)
threads-pipe    group-2 	 1.00 (  0.64)	 +0.51 (  2.31)	 +2.91 (  0.43)
threads-pipe    group-4 	 1.00 (  3.03)	 -2.91 (  2.31)	 +1.83 (  1.65)
threads-pipe    group-8 	 1.00 (  2.55)	 +1.89 (  3.04)	 -1.29 (  2.32)
threads-sockets group-1 	 1.00 (  0.71)	 +0.83 (  0.52)	 -0.42 (  0.52)
threads-sockets group-2 	 1.00 (  2.48)	 -2.52 (  1.20)	 -3.27 (  0.59)
threads-sockets group-4 	 1.00 (  1.96)	 +2.67 (  2.34)	 +3.74 (  1.18)
threads-sockets group-8 	 1.00 (  1.09)	 -2.30 (  0.51)	 +3.07 (  0.62)

netperf
=======
case            load    	  p0% (std%)	   p1% ( std%)	   p3% ( std%)
TCP_RR          thread-24	 1.00 (  2.48)	 -2.15 (  2.38)	 +0.17 (  1.95)
TCP_RR          thread-48	 1.00 (  0.73)	 -1.59 (  0.51)	 +0.47 (  0.93)
TCP_RR          thread-72	 1.00 (  1.04)	 -1.26 (  1.03)	 -0.09 (  1.13)
TCP_RR          thread-96	 1.00 ( 29.36)	+70.41 ( 14.86)	+17.88 ( 37.24)
TCP_RR          thread-192	 1.00 ( 28.29)	 -1.30 ( 34.03)	 -2.00 ( 30.63)
TCP_STREAM      thread-24	 1.00 (  1.57)	 +0.38 (  1.90)	 +0.20 (  1.72)
TCP_STREAM      thread-48	 1.00 (  0.08)	 -0.29 (  0.07)	 +0.15 (  0.12)
TCP_STREAM      thread-72	 1.00 (  0.01)	 -0.00 (  0.00)	 +0.00 (  0.00)
TCP_STREAM      thread-96	 1.00 (  0.76)	 +0.16 (  0.65)	 +0.30 (  0.47)
TCP_STREAM      thread-192	 1.00 (  0.65)	 +0.23 (  0.46)	 +0.25 (  0.49)
UDP_RR          thread-24	 1.00 (  1.74)	 -1.26 (  2.41)	 +0.81 (  3.02)
UDP_RR          thread-48	 1.00 (  0.56)	 -0.40 ( 16.72)	 -0.98 (  0.36)
UDP_RR          thread-72	 1.00 (  0.84)	 -0.70 (  0.66)	 -0.27 (  0.88)
UDP_RR          thread-96	 1.00 (  1.24)	 -0.44 (  1.01)	 -0.99 (  8.99)
UDP_RR          thread-192	 1.00 ( 28.02)	 -0.42 ( 31.59)	 -1.80 ( 26.23)
UDP_STREAM      thread-24	 1.00 (100.05)	 +0.31 (100.04)	 +0.32 (100.06)
UDP_STREAM      thread-48	 1.00 (104.35)	 +1.22 (105.14)	 +1.65 (104.10)
UDP_STREAM      thread-72	 1.00 (100.69)	 +1.28 (100.63)	 -0.17 (100.49)
UDP_STREAM      thread-96	 1.00 ( 99.63)	 +0.33 ( 99.51)	 -0.25 ( 99.53)
UDP_STREAM      thread-192	 1.00 (100.57)	 +2.00 (107.01)	 -1.21 ( 99.51)

tbench
======
case            load    	  p0% (std%)	   p1% ( std%)	   p3% ( std%)
loopback        thread-24	 1.00 (  0.49)	 -1.47 (  0.94)	 +0.08 (  0.75)
loopback        thread-48	 1.00 (  0.42)	 -0.04 (  0.53)	 -0.06 (  0.34)
loopback        thread-72	 1.00 (  7.10)	 -3.33 (  2.98)	 -5.06 (  0.31)
loopback        thread-96	 1.00 (  0.80)	 +2.65 (  0.80)	 -0.68 (  1.30)
loopback        thread-192	 1.00 (  1.24)	 +1.21 (  0.73)	 -1.78 (  0.22)

schbench
========
case            load    	  p0% (std%)	   p1% ( std%)	   p3% ( std%)
normal          mthread-1	 1.00 (  5.83)	 -2.83 (  1.46)	 +1.24 (  2.51)
normal          mthread-2	 1.00 (  4.45)	 +8.94 (  7.81)	+14.24 (  7.44)
normal          mthread-4	 1.00 (  2.73)	 +2.53 (  4.31)	+12.44 (  5.99)
normal          mthread-8	 1.00 (  0.15)	 +0.21 (  0.13)	 -0.34 (  0.11)

Seems no obvious complain from these benchmarks.
Comments are appreciated! Thanks!

Abel Wu (4):
  sched/eevdf: Fix vruntime adjustment on reweight
  sched/eevdf: Sort the rbtree by virtual deadline
  sched/eevdf: O(1) fastpath for task selection
  sched/stats: branch statistics for pick_eevdf

 include/linux/sched.h |   2 +-
 kernel/sched/debug.c  |  11 +-
 kernel/sched/fair.c   | 412 +++++++++++++++++++++++++-----------------
 kernel/sched/sched.h  |   6 +
 kernel/sched/stats.c  |   6 +-
 5 files changed, 263 insertions(+), 174 deletions(-)

-- 
2.37.3


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
  2023-11-07  9:05 [PATCH 0/4] sched/eevdf: Optimize reweight and pick Abel Wu
@ 2023-11-07  9:05 ` Abel Wu
  2023-11-07  9:52   ` Peter Zijlstra
                     ` (3 more replies)
  2023-11-07  9:05 ` [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline Abel Wu
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 25+ messages in thread
From: Abel Wu @ 2023-11-07  9:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider
  Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel, Abel Wu

vruntime of the (on_rq && !0-lag) entity needs to be adjusted when
it gets re-weighted, and the calculations can be simplified based
on the fact that re-weight won't change the w-average of all the
entities. Please check the proofs in comments.

But adjusting vruntime can also cause position change in RB-tree
hence require re-queue to fix up which might be costly. This might
be avoided by deferring adjustment to the time the entity actually
leaves tree (dequeue/pick), but that will negatively affect task
selection and probably not good enough either.

Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 kernel/sched/fair.c | 151 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 128 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8767988242ee..b00d09a9b601 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3666,41 +3666,140 @@ static inline void
 dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 #endif
 
+static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
+			   unsigned long weight)
+{
+	unsigned long old_weight = se->load.weight;
+	u64 avruntime = avg_vruntime(cfs_rq);
+	s64 vlag, vslice;
+
+	/*
+	 * VRUNTIME
+	 * ========
+	 *
+	 * COROLLARY #1: The virtual runtime of the entity needs to be
+	 * adjusted if re-weight at !0-lag point.
+	 *
+	 * Proof: For contradiction assume this is not true, so we can
+	 * re-weight without changing vruntime at !0-lag point.
+	 *
+	 *             Weight	VRuntime   Avg-VRuntime
+	 *     before    w          v            V
+	 *      after    w'         v'           V'
+	 *
+	 * Since lag needs to be preserved through re-weight:
+	 *
+	 *	lag = (V - v)*w = (V'- v')*w', where v = v'
+	 *	==>	V' = (V - v)*w/w' + v		(1)
+	 *
+	 * Let W be the total weight of the entities before reweight,
+	 * since V' is the new weighted average of entities:
+	 *
+	 *	V' = (WV + w'v - wv) / (W + w' - w)	(2)
+	 *
+	 * by using (1) & (2) we obtain:
+	 *
+	 *	(WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v
+	 *	==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v
+	 *	==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v
+	 *	==>	(V - v)*W/(W + w' - w) = (V - v)*w/w' (3)
+	 *
+	 * Since we are doing at !0-lag point which means V != v, we
+	 * can simplify (3):
+	 *
+	 *	==>	W / (W + w' - w) = w / w'
+	 *	==>	Ww' = Ww + ww' - ww
+	 *	==>	W * (w' - w) = w * (w' - w)
+	 *	==>	W = w	(re-weight indicates w' != w)
+	 *
+	 * So the cfs_rq contains only one entity, hence vruntime of
+	 * the entity @v should always equal to the cfs_rq's weighted
+	 * average vruntime @V, which means we will always re-weight
+	 * at 0-lag point, thus breach assumption. Proof completed.
+	 *
+	 *
+	 * COROLLARY #2: Re-weight does NOT affect weighted average
+	 * vruntime of all the entities.
+	 *
+	 * Proof: According to corollary #1, Eq. (1) should be:
+	 *
+	 *	(V - v)*w = (V' - v')*w'
+	 *	==>    v' = V' - (V - v)*w/w'		(4)
+	 *
+	 * According to the weighted average formula, we have:
+	 *
+	 *	V' = (WV - wv + w'v') / (W - w + w')
+	 *	   = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w')
+	 *	   = (WV - wv + w'V' - Vw + wv) / (W - w + w')
+	 *	   = (WV + w'V' - Vw) / (W - w + w')
+	 *
+	 *	==>  V'*(W - w + w') = WV + w'V' - Vw
+	 *	==>	V' * (W - w) = (W - w) * V	(5)
+	 *
+	 * If the entity is the only one in the cfs_rq, then reweight
+	 * always occurs at 0-lag point, so V won't change. Or else
+	 * there are other entities, hence W != w, then Eq. (5) turns
+	 * into V' = V. So V won't change in either case, proof done.
+	 *
+	 *
+	 * So according to corollary #1 & #2, the effect of re-weight
+	 * on vruntime should be:
+	 *
+	 *	v' = V' - (V - v) * w / w'		(4)
+	 *	   = V  - (V - v) * w / w'
+	 *	   = V  - vl * w / w'
+	 *	   = V  - vl'
+	 */
+	if (avruntime != se->vruntime) {
+		vlag = (s64)(avruntime - se->vruntime);
+		vlag = div_s64(vlag * old_weight, weight);
+		se->vruntime = avruntime - vlag;
+	}
+
+	/*
+	 * DEADLINE
+	 * ========
+	 *
+	 * When the weight changes, the virtual time slope changes and
+	 * we should adjust the relative virtual deadline accordingly.
+	 *
+	 *	d' = v' + (d - v)*w/w'
+	 *	   = V' - (V - v)*w/w' + (d - v)*w/w'
+	 *	   = V  - (V - v)*w/w' + (d - v)*w/w'
+	 *	   = V  + (d - V)*w/w'
+	 */
+	vslice = (s64)(se->deadline - avruntime);
+	vslice = div_s64(vslice * old_weight, weight);
+	se->deadline = avruntime + vslice;
+}
+
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight)
 {
-	unsigned long old_weight = se->load.weight;
+	bool curr = cfs_rq->curr == se;
 
 	if (se->on_rq) {
 		/* commit outstanding execution time */
-		if (cfs_rq->curr == se)
+		if (curr)
 			update_curr(cfs_rq);
 		else
-			avg_vruntime_sub(cfs_rq, se);
+			__dequeue_entity(cfs_rq, se);
 		update_load_sub(&cfs_rq->load, se->load.weight);
 	}
 	dequeue_load_avg(cfs_rq, se);
 
-	update_load_set(&se->load, weight);
-
 	if (!se->on_rq) {
 		/*
 		 * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
 		 * we need to scale se->vlag when w_i changes.
 		 */
-		se->vlag = div_s64(se->vlag * old_weight, weight);
+		se->vlag = div_s64(se->vlag * se->load.weight, weight);
 	} else {
-		s64 deadline = se->deadline - se->vruntime;
-		/*
-		 * When the weight changes, the virtual time slope changes and
-		 * we should adjust the relative virtual deadline accordingly.
-		 */
-		deadline = div_s64(deadline * old_weight, weight);
-		se->deadline = se->vruntime + deadline;
-		if (se != cfs_rq->curr)
-			min_deadline_cb_propagate(&se->run_node, NULL);
+		reweight_eevdf(cfs_rq, se, weight);
 	}
 
+	update_load_set(&se->load, weight);
+
 #ifdef CONFIG_SMP
 	do {
 		u32 divider = get_pelt_divider(&se->avg);
@@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 	enqueue_load_avg(cfs_rq, se);
 	if (se->on_rq) {
 		update_load_add(&cfs_rq->load, se->load.weight);
-		if (cfs_rq->curr != se)
-			avg_vruntime_add(cfs_rq, se);
+		if (!curr) {
+			/*
+			 * The entity's vruntime has been adjusted, so let's check
+			 * whether the rq-wide min_vruntime needs updated too. Since
+			 * the calculations above require stable min_vruntime rather
+			 * than up-to-date one, we do the update at the end of the
+			 * reweight process.
+			 */
+			__enqueue_entity(cfs_rq, se);
+			update_min_vruntime(cfs_rq);
+		}
 	}
 }
 
@@ -3857,14 +3965,11 @@ static void update_cfs_group(struct sched_entity *se)
 
 #ifndef CONFIG_SMP
 	shares = READ_ONCE(gcfs_rq->tg->shares);
-
-	if (likely(se->load.weight == shares))
-		return;
 #else
-	shares   = calc_group_shares(gcfs_rq);
+	shares = calc_group_shares(gcfs_rq);
 #endif
-
-	reweight_entity(cfs_rq_of(se), se, shares);
+	if (unlikely(se->load.weight != shares))
+		reweight_entity(cfs_rq_of(se), se, shares);
 }
 
 #else /* CONFIG_FAIR_GROUP_SCHED */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline
  2023-11-07  9:05 [PATCH 0/4] sched/eevdf: Optimize reweight and pick Abel Wu
  2023-11-07  9:05 ` [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight Abel Wu
@ 2023-11-07  9:05 ` Abel Wu
  2023-11-07 11:03   ` Peter Zijlstra
  2023-11-07 23:26   ` Benjamin Segall
  2023-11-07  9:05 ` [PATCH 3/4] sched/eevdf: O(1) fastpath for task selection Abel Wu
  2023-11-07  9:05 ` [PATCH 4/4] sched/stats: branch statistics for pick_eevdf Abel Wu
  3 siblings, 2 replies; 25+ messages in thread
From: Abel Wu @ 2023-11-07  9:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider
  Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel, Abel Wu

Sort the task timeline by virtual deadline and keep the min_vruntime
in the augmented tree, so we can avoid doubling the worst case cost
and make full use of the cached leftmost node to enable O(1) fastpath
picking in next patch.

This patch also cleans up the unused max_vruntime() and adjusts pos
for some functions.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 include/linux/sched.h |   2 +-
 kernel/sched/debug.c  |  11 +-
 kernel/sched/fair.c   | 239 +++++++++++++++++-------------------------
 kernel/sched/sched.h  |   1 +
 4 files changed, 107 insertions(+), 146 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 556ad71b532b..815832a0e7eb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -553,7 +553,7 @@ struct sched_entity {
 	struct load_weight		load;
 	struct rb_node			run_node;
 	u64				deadline;
-	u64				min_deadline;
+	u64				min_vruntime;
 
 	struct list_head		group_node;
 	unsigned int			on_rq;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 4580a450700e..168eecc209b4 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -628,8 +628,8 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
 
 void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 {
-	s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, spread;
-	struct sched_entity *last, *first;
+	s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, left_deadline = -1, spread;
+	struct sched_entity *last, *first, *root;
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
 
@@ -644,15 +644,20 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			SPLIT_NS(cfs_rq->exec_clock));
 
 	raw_spin_rq_lock_irqsave(rq, flags);
+	root = __pick_root_entity(cfs_rq);
+	if (root)
+		left_vruntime = root->min_vruntime;
 	first = __pick_first_entity(cfs_rq);
 	if (first)
-		left_vruntime = first->vruntime;
+		left_deadline = first->deadline;
 	last = __pick_last_entity(cfs_rq);
 	if (last)
 		right_vruntime = last->vruntime;
 	min_vruntime = cfs_rq->min_vruntime;
 	raw_spin_rq_unlock_irqrestore(rq, flags);
 
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_deadline",
+			SPLIT_NS(left_deadline));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_vruntime",
 			SPLIT_NS(left_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b00d09a9b601..459487bf8824 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -530,15 +530,6 @@ void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
  * Scheduling class tree data structure manipulation methods:
  */
 
-static inline u64 max_vruntime(u64 max_vruntime, u64 vruntime)
-{
-	s64 delta = (s64)(vruntime - max_vruntime);
-	if (delta > 0)
-		max_vruntime = vruntime;
-
-	return max_vruntime;
-}
-
 static inline u64 min_vruntime(u64 min_vruntime, u64 vruntime)
 {
 	s64 delta = (s64)(vruntime - min_vruntime);
@@ -551,7 +542,11 @@ static inline u64 min_vruntime(u64 min_vruntime, u64 vruntime)
 static inline bool entity_before(const struct sched_entity *a,
 				 const struct sched_entity *b)
 {
-	return (s64)(a->vruntime - b->vruntime) < 0;
+	/*
+	 * Tiebreak on vruntime seems unnecessary since it can
+	 * hardly happen.
+	 */
+	return (s64)(a->deadline - b->deadline) < 0;
 }
 
 static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -720,7 +715,7 @@ static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * Note: using 'avg_vruntime() > se->vruntime' is inacurate due
  *       to the loss in precision caused by the division.
  */
-int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
 {
 	struct sched_entity *curr = cfs_rq->curr;
 	s64 avg = cfs_rq->avg_vruntime;
@@ -733,16 +728,19 @@ int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		load += weight;
 	}
 
-	return avg >= entity_key(cfs_rq, se) * load;
+	return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load;
+}
+
+int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	return vruntime_eligible(cfs_rq, se->vruntime);
 }
 
 static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
 {
 	u64 min_vruntime = cfs_rq->min_vruntime;
-	/*
-	 * open coded max_vruntime() to allow updating avg_vruntime
-	 */
 	s64 delta = (s64)(vruntime - min_vruntime);
+
 	if (delta > 0) {
 		avg_vruntime_update(cfs_rq, delta);
 		min_vruntime = vruntime;
@@ -752,9 +750,8 @@ static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
 
 static void update_min_vruntime(struct cfs_rq *cfs_rq)
 {
-	struct sched_entity *se = __pick_first_entity(cfs_rq);
+	struct sched_entity *se = __pick_root_entity(cfs_rq);
 	struct sched_entity *curr = cfs_rq->curr;
-
 	u64 vruntime = cfs_rq->min_vruntime;
 
 	if (curr) {
@@ -766,9 +763,9 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 
 	if (se) {
 		if (!curr)
-			vruntime = se->vruntime;
+			vruntime = se->min_vruntime;
 		else
-			vruntime = min_vruntime(vruntime, se->vruntime);
+			vruntime = min_vruntime(vruntime, se->min_vruntime);
 	}
 
 	/* ensure we never gain time by being placed backwards. */
@@ -781,34 +778,34 @@ static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
 	return entity_before(__node_2_se(a), __node_2_se(b));
 }
 
-#define deadline_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; })
+#define vruntime_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; })
 
-static inline void __update_min_deadline(struct sched_entity *se, struct rb_node *node)
+static inline void __min_vruntime_update(struct sched_entity *se, struct rb_node *node)
 {
 	if (node) {
 		struct sched_entity *rse = __node_2_se(node);
-		if (deadline_gt(min_deadline, se, rse))
-			se->min_deadline = rse->min_deadline;
+		if (vruntime_gt(min_vruntime, se, rse))
+			se->min_vruntime = rse->min_vruntime;
 	}
 }
 
 /*
- * se->min_deadline = min(se->deadline, left->min_deadline, right->min_deadline)
+ * se->min_vruntime = min(se->vruntime, {left,right}->min_vruntime)
  */
-static inline bool min_deadline_update(struct sched_entity *se, bool exit)
+static inline bool min_vruntime_update(struct sched_entity *se, bool exit)
 {
-	u64 old_min_deadline = se->min_deadline;
+	u64 old_min_vruntime = se->min_vruntime;
 	struct rb_node *node = &se->run_node;
 
-	se->min_deadline = se->deadline;
-	__update_min_deadline(se, node->rb_right);
-	__update_min_deadline(se, node->rb_left);
+	se->min_vruntime = se->vruntime;
+	__min_vruntime_update(se, node->rb_right);
+	__min_vruntime_update(se, node->rb_left);
 
-	return se->min_deadline == old_min_deadline;
+	return se->min_vruntime == old_min_vruntime;
 }
 
-RB_DECLARE_CALLBACKS(static, min_deadline_cb, struct sched_entity,
-		     run_node, min_deadline, min_deadline_update);
+RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity,
+		     run_node, min_vruntime, min_vruntime_update);
 
 /*
  * Enqueue an entity into the rb-tree:
@@ -816,18 +813,28 @@ RB_DECLARE_CALLBACKS(static, min_deadline_cb, struct sched_entity,
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	avg_vruntime_add(cfs_rq, se);
-	se->min_deadline = se->deadline;
+	se->min_vruntime = se->vruntime;
 	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
-				__entity_less, &min_deadline_cb);
+				__entity_less, &min_vruntime_cb);
 }
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
-				  &min_deadline_cb);
+				  &min_vruntime_cb);
 	avg_vruntime_sub(cfs_rq, se);
 }
 
+struct sched_entity *__pick_root_entity(struct cfs_rq *cfs_rq)
+{
+	struct rb_node *root = cfs_rq->tasks_timeline.rb_root.rb_node;
+
+	if (!root)
+		return NULL;
+
+	return __node_2_se(root);
+}
+
 struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
 {
 	struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);
@@ -838,6 +845,35 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
 	return __node_2_se(left);
 }
 
+#ifdef CONFIG_SCHED_DEBUG
+struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
+{
+	struct rb_node *last = rb_last(&cfs_rq->tasks_timeline.rb_root);
+
+	if (!last)
+		return NULL;
+
+	return __node_2_se(last);
+}
+
+/**************************************************************
+ * Scheduling class statistics methods:
+ */
+#ifdef CONFIG_SMP
+int sched_update_scaling(void)
+{
+	unsigned int factor = get_update_sysctl_factor();
+
+#define WRT_SYSCTL(name) \
+	(normalized_sysctl_##name = sysctl_##name / (factor))
+	WRT_SYSCTL(sched_base_slice);
+#undef WRT_SYSCTL
+
+	return 0;
+}
+#endif
+#endif
+
 /*
  * Earliest Eligible Virtual Deadline First
  *
@@ -850,23 +886,28 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
  *     with the earliest virtual deadline.
  *
  * We can do this in O(log n) time due to an augmented RB-tree. The
- * tree keeps the entries sorted on service, but also functions as a
- * heap based on the deadline by keeping:
+ * tree keeps the entries sorted on deadline, but also functions as a
+ * heap based on the vruntime by keeping:
  *
- *  se->min_deadline = min(se->deadline, se->{left,right}->min_deadline)
+ *  se->min_vruntime = min(se->vruntime, se->{left,right}->min_vruntime)
  *
- * Which allows an EDF like search on (sub)trees.
+ * Which allows tree pruning through eligibility.
  */
-static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 {
 	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
 	struct sched_entity *curr = cfs_rq->curr;
 	struct sched_entity *best = NULL;
-	struct sched_entity *best_left = NULL;
+
+	/*
+	 * We can safely skip eligibility check if there is only one entity
+	 * in this cfs_rq, saving some cycles.
+	 */
+	if (cfs_rq->nr_running == 1)
+		return curr && curr->on_rq ? curr : __node_2_se(node);
 
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
-	best = curr;
 
 	/*
 	 * Once selected, run a task until it either becomes non-eligible or
@@ -875,126 +916,40 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
 	if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
 		return curr;
 
+	/* Heap search for the EEVD entity */
 	while (node) {
 		struct sched_entity *se = __node_2_se(node);
+		struct rb_node *left = node->rb_left;
 
 		/*
-		 * If this entity is not eligible, try the left subtree.
+		 * Eligible entities in left subtree are always better
+		 * choices, since they have earlier deadlines.
 		 */
-		if (!entity_eligible(cfs_rq, se)) {
-			node = node->rb_left;
+		if (left && vruntime_eligible(cfs_rq,
+					__node_2_se(left)->min_vruntime)) {
+			node = left;
 			continue;
 		}
 
 		/*
-		 * Now we heap search eligible trees for the best (min_)deadline
+		 * The left subtree either is empty or has no eligible
+		 * entity, so check the current node since it is the one
+		 * with earliest deadline that might be eligible.
 		 */
-		if (!best || deadline_gt(deadline, best, se))
+		if (entity_eligible(cfs_rq, se)) {
 			best = se;
-
-		/*
-		 * Every se in a left branch is eligible, keep track of the
-		 * branch with the best min_deadline
-		 */
-		if (node->rb_left) {
-			struct sched_entity *left = __node_2_se(node->rb_left);
-
-			if (!best_left || deadline_gt(min_deadline, best_left, left))
-				best_left = left;
-
-			/*
-			 * min_deadline is in the left branch. rb_left and all
-			 * descendants are eligible, so immediately switch to the second
-			 * loop.
-			 */
-			if (left->min_deadline == se->min_deadline)
-				break;
-		}
-
-		/* min_deadline is at this node, no need to look right */
-		if (se->deadline == se->min_deadline)
 			break;
-
-		/* else min_deadline is in the right branch. */
-		node = node->rb_right;
-	}
-
-	/*
-	 * We ran into an eligible node which is itself the best.
-	 * (Or nr_running == 0 and both are NULL)
-	 */
-	if (!best_left || (s64)(best_left->min_deadline - best->deadline) > 0)
-		return best;
-
-	/*
-	 * Now best_left and all of its children are eligible, and we are just
-	 * looking for deadline == min_deadline
-	 */
-	node = &best_left->run_node;
-	while (node) {
-		struct sched_entity *se = __node_2_se(node);
-
-		/* min_deadline is the current node */
-		if (se->deadline == se->min_deadline)
-			return se;
-
-		/* min_deadline is in the left branch */
-		if (node->rb_left &&
-		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
-			node = node->rb_left;
-			continue;
 		}
 
-		/* else min_deadline is in the right branch */
 		node = node->rb_right;
 	}
-	return NULL;
-}
 
-static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
-{
-	struct sched_entity *se = __pick_eevdf(cfs_rq);
+	if (!best || (curr && entity_before(curr, best)))
+		best = curr;
 
-	if (!se) {
-		struct sched_entity *left = __pick_first_entity(cfs_rq);
-		if (left) {
-			pr_err("EEVDF scheduling fail, picking leftmost\n");
-			return left;
-		}
-	}
-
-	return se;
+	return best;
 }
 
-#ifdef CONFIG_SCHED_DEBUG
-struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
-{
-	struct rb_node *last = rb_last(&cfs_rq->tasks_timeline.rb_root);
-
-	if (!last)
-		return NULL;
-
-	return __node_2_se(last);
-}
-
-/**************************************************************
- * Scheduling class statistics methods:
- */
-#ifdef CONFIG_SMP
-int sched_update_scaling(void)
-{
-	unsigned int factor = get_update_sysctl_factor();
-
-#define WRT_SYSCTL(name) \
-	(normalized_sysctl_##name = sysctl_##name / (factor))
-	WRT_SYSCTL(sched_base_slice);
-#undef WRT_SYSCTL
-
-	return 0;
-}
-#endif
-#endif
-
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e5a95486a42..539c7e763f15 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2822,6 +2822,7 @@ DEFINE_LOCK_GUARD_2(double_rq_lock, struct rq,
 		    double_rq_lock(_T->lock, _T->lock2),
 		    double_rq_unlock(_T->lock, _T->lock2))
 
+extern struct sched_entity *__pick_root_entity(struct cfs_rq *cfs_rq);
 extern struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq);
 extern struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq);
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/4] sched/eevdf: O(1) fastpath for task selection
  2023-11-07  9:05 [PATCH 0/4] sched/eevdf: Optimize reweight and pick Abel Wu
  2023-11-07  9:05 ` [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight Abel Wu
  2023-11-07  9:05 ` [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline Abel Wu
@ 2023-11-07  9:05 ` Abel Wu
  2023-11-07 10:12   ` Abel Wu
  2023-11-07  9:05 ` [PATCH 4/4] sched/stats: branch statistics for pick_eevdf Abel Wu
  3 siblings, 1 reply; 25+ messages in thread
From: Abel Wu @ 2023-11-07  9:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider
  Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel, Abel Wu

Since the RB-tree is now sorted by deadline, let's first try the
leftmost entity which has the earliest virtual deadline. I've done
some benchmarks to see its effectiveness.

All the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled, on a dual-CPU Intel Xeon(R)
Platinum 8260 with 2 NUMA nodes each of which has 24C/48T.

  hackbench: process/thread + pipe/socket + 1/2/4/8 groups
  netperf:   TCP/UDP + STREAM/RR + 24/48/72/96/192 threads
  tbench:    loopback 24/48/72/96/192 threads
  schbench:  1/2/4/8 mthreads

  direct:    cfs_rq has only one entity
  parity:    RUN_TO_PARITY
  fast:      O(1) fastpath
  slow:	     heap search

    (%)		direct	parity	fast	slow
  hackbench	92.95	2.02	4.91	0.12
  netperf	68.08	6.60	24.18	1.14
  tbench	67.55	11.22	20.61	0.62
  schbench	69.91	2.65	25.73	1.71

The above results indicate that this fastpath really makes task
selection more efficient.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 kernel/sched/fair.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 459487bf8824..a1fdd0c7a051 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -896,6 +896,7 @@ int sched_update_scaling(void)
 static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 {
 	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
+	struct sched_entity *se = __pick_first_entity(cfs_rq);
 	struct sched_entity *curr = cfs_rq->curr;
 	struct sched_entity *best = NULL;
 
@@ -904,7 +905,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 	 * in this cfs_rq, saving some cycles.
 	 */
 	if (cfs_rq->nr_running == 1)
-		return curr && curr->on_rq ? curr : __node_2_se(node);
+		return curr && curr->on_rq ? curr : se;
 
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
@@ -916,9 +917,14 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 	if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
 		return curr;
 
+	/* Pick the leftmost entity if it's eligible */
+	if (se && entity_eligible(cfs_rq, se)) {
+		best = se;
+		goto found;
+	}
+
 	/* Heap search for the EEVD entity */
 	while (node) {
-		struct sched_entity *se = __node_2_se(node);
 		struct rb_node *left = node->rb_left;
 
 		/*
@@ -931,6 +937,8 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 			continue;
 		}
 
+		se = __node_2_se(node);
+
 		/*
 		 * The left subtree either is empty or has no eligible
 		 * entity, so check the current node since it is the one
@@ -943,7 +951,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 
 		node = node->rb_right;
 	}
-
+found:
 	if (!best || (curr && entity_before(curr, best)))
 		best = curr;
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/4] sched/stats: branch statistics for pick_eevdf
  2023-11-07  9:05 [PATCH 0/4] sched/eevdf: Optimize reweight and pick Abel Wu
                   ` (2 preceding siblings ...)
  2023-11-07  9:05 ` [PATCH 3/4] sched/eevdf: O(1) fastpath for task selection Abel Wu
@ 2023-11-07  9:05 ` Abel Wu
  3 siblings, 0 replies; 25+ messages in thread
From: Abel Wu @ 2023-11-07  9:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider
  Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel, Abel Wu

For trace purpose only, not intended for upstream.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 kernel/sched/fair.c  | 12 ++++++++++--
 kernel/sched/sched.h |  5 +++++
 kernel/sched/stats.c |  6 ++++--
 3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1fdd0c7a051..9b07e9437526 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -899,13 +899,16 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 	struct sched_entity *se = __pick_first_entity(cfs_rq);
 	struct sched_entity *curr = cfs_rq->curr;
 	struct sched_entity *best = NULL;
+	struct rq *rq = rq_of(cfs_rq);
 
 	/*
 	 * We can safely skip eligibility check if there is only one entity
 	 * in this cfs_rq, saving some cycles.
 	 */
-	if (cfs_rq->nr_running == 1)
+	if (cfs_rq->nr_running == 1) {
+		schedstat_inc(rq->pick_direct);
 		return curr && curr->on_rq ? curr : se;
+	}
 
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
@@ -914,15 +917,20 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 	 * Once selected, run a task until it either becomes non-eligible or
 	 * until it gets a new slice. See the HACK in set_next_entity().
 	 */
-	if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
+	if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline) {
+		schedstat_inc(rq->pick_parity);
 		return curr;
+	}
 
 	/* Pick the leftmost entity if it's eligible */
 	if (se && entity_eligible(cfs_rq, se)) {
+		schedstat_inc(rq->pick_fast);
 		best = se;
 		goto found;
 	}
 
+	schedstat_inc(rq->pick_slow);
+
 	/* Heap search for the EEVD entity */
 	while (node) {
 		struct rb_node *left = node->rb_left;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 539c7e763f15..85a79990a698 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1105,6 +1105,11 @@ struct rq {
 	/* try_to_wake_up() stats */
 	unsigned int		ttwu_count;
 	unsigned int		ttwu_local;
+
+	unsigned int		pick_direct;
+	unsigned int		pick_parity;
+	unsigned int		pick_fast;
+	unsigned int		pick_slow;
 #endif
 
 #ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..4b862c798989 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,14 @@ static int show_schedstat(struct seq_file *seq, void *v)
 
 		/* runqueue-specific stats */
 		seq_printf(seq,
-		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u %u",
 		    cpu, rq->yld_count,
 		    rq->sched_count, rq->sched_goidle,
 		    rq->ttwu_count, rq->ttwu_local,
 		    rq->rq_cpu_time,
-		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+		    rq->pick_direct, rq->pick_parity,
+		    rq->pick_fast, rq->pick_slow);
 
 		seq_printf(seq, "\n");
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
  2023-11-07  9:05 ` [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight Abel Wu
@ 2023-11-07  9:52   ` Peter Zijlstra
  2023-11-14 21:57   ` [tip: sched/urgent] " tip-bot2 for Abel Wu
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2023-11-07  9:52 UTC (permalink / raw)
  To: Abel Wu
  Cc: Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Barry Song, Benjamin Segall, Chen Yu,
	Daniel Jordan, Gautham R . Shenoy, Joel Fernandes,
	K Prateek Nayak, Mike Galbraith, Qais Yousef, Tim Chen,
	Yicong Yang, Youssef Esmat, linux-kernel

On Tue, Nov 07, 2023 at 05:05:07PM +0800, Abel Wu wrote:
> vruntime of the (on_rq && !0-lag) entity needs to be adjusted when
> it gets re-weighted, and the calculations can be simplified based
> on the fact that re-weight won't change the w-average of all the
> entities. Please check the proofs in comments.
> 
> But adjusting vruntime can also cause position change in RB-tree
> hence require re-queue to fix up which might be costly. This might
> be avoided by deferring adjustment to the time the entity actually
> leaves tree (dequeue/pick), but that will negatively affect task
> selection and probably not good enough either.
> 
> Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>

Very good, thanks!

It's a bit sad we have to muck about with the tree now, but alas.

If only Google and FB could agree on what to do with this cgroup
nonsense, then maybe we could get rid of all this.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] sched/eevdf: O(1) fastpath for task selection
  2023-11-07  9:05 ` [PATCH 3/4] sched/eevdf: O(1) fastpath for task selection Abel Wu
@ 2023-11-07 10:12   ` Abel Wu
  2023-11-07 10:42     ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Abel Wu @ 2023-11-07 10:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider
  Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel

On 11/7/23 5:05 PM, Abel Wu Wrote:
> Since the RB-tree is now sorted by deadline, let's first try the
> leftmost entity which has the earliest virtual deadline. I've done
> some benchmarks to see its effectiveness.
> 
> All the benchmarks are done inside a normal cpu cgroup in a clean
> environment with cpu turbo disabled, on a dual-CPU Intel Xeon(R)
> Platinum 8260 with 2 NUMA nodes each of which has 24C/48T.
> 
>    hackbench: process/thread + pipe/socket + 1/2/4/8 groups
>    netperf:   TCP/UDP + STREAM/RR + 24/48/72/96/192 threads
>    tbench:    loopback 24/48/72/96/192 threads
>    schbench:  1/2/4/8 mthreads
> 
>    direct:    cfs_rq has only one entity
>    parity:    RUN_TO_PARITY
>    fast:      O(1) fastpath
>    slow:	     heap search
> 
>      (%)		direct	parity	fast	slow
>    hackbench	92.95	2.02	4.91	0.12
>    netperf	68.08	6.60	24.18	1.14
>    tbench	67.55	11.22	20.61	0.62
>    schbench	69.91	2.65	25.73	1.71
> 
> The above results indicate that this fastpath really makes task
> selection more efficient.
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
>   kernel/sched/fair.c | 14 +++++++++++---
>   1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 459487bf8824..a1fdd0c7a051 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -896,6 +896,7 @@ int sched_update_scaling(void)
>   static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>   {
>   	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
> +	struct sched_entity *se = __pick_first_entity(cfs_rq);
>   	struct sched_entity *curr = cfs_rq->curr;
>   	struct sched_entity *best = NULL;
>   
> @@ -904,7 +905,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>   	 * in this cfs_rq, saving some cycles.
>   	 */
>   	if (cfs_rq->nr_running == 1)
> -		return curr && curr->on_rq ? curr : __node_2_se(node);
> +		return curr && curr->on_rq ? curr : se;

Maybe we can reduce memory footprint on curr by:

		return se ? se : curr;

>   
>   	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>   		curr = NULL;
> @@ -916,9 +917,14 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>   	if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
>   		return curr;
>   
> +	/* Pick the leftmost entity if it's eligible */
> +	if (se && entity_eligible(cfs_rq, se)) {
> +		best = se;
> +		goto found;
> +	}
> +
>   	/* Heap search for the EEVD entity */
>   	while (node) {
> -		struct sched_entity *se = __node_2_se(node);
>   		struct rb_node *left = node->rb_left;
>   
>   		/*
> @@ -931,6 +937,8 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>   			continue;
>   		}
>   
> +		se = __node_2_se(node);
> +
>   		/*
>   		 * The left subtree either is empty or has no eligible
>   		 * entity, so check the current node since it is the one
> @@ -943,7 +951,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>   
>   		node = node->rb_right;
>   	}
> -
> +found:
>   	if (!best || (curr && entity_before(curr, best)))
>   		best = curr;
>   

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] sched/eevdf: O(1) fastpath for task selection
  2023-11-07 10:12   ` Abel Wu
@ 2023-11-07 10:42     ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2023-11-07 10:42 UTC (permalink / raw)
  To: Abel Wu
  Cc: Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Barry Song, Benjamin Segall, Chen Yu,
	Daniel Jordan, Gautham R . Shenoy, Joel Fernandes,
	K Prateek Nayak, Mike Galbraith, Qais Yousef, Tim Chen,
	Yicong Yang, Youssef Esmat, linux-kernel

On Tue, Nov 07, 2023 at 06:12:49PM +0800, Abel Wu wrote:

> > @@ -904,7 +905,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> >   	 * in this cfs_rq, saving some cycles.
> >   	 */
> >   	if (cfs_rq->nr_running == 1)
> > -		return curr && curr->on_rq ? curr : __node_2_se(node);
> > +		return curr && curr->on_rq ? curr : se;
> 
> Maybe we can reduce memory footprint on curr by:
> 
> 		return se ? se : curr;

Irrespective, I think that logic makes more sense. If we know we have
but one task and the tree has a task, it must be that task, otherwise,
current must be it. 

Anyway, I was still staring at the previous patch, flipping the tree
around like that is clever. Yes I suppose that ought to work just fine.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline
  2023-11-07  9:05 ` [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline Abel Wu
@ 2023-11-07 11:03   ` Peter Zijlstra
  2023-11-07 11:06     ` Abel Wu
  2023-11-07 23:26   ` Benjamin Segall
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2023-11-07 11:03 UTC (permalink / raw)
  To: Abel Wu
  Cc: Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Barry Song, Benjamin Segall, Chen Yu,
	Daniel Jordan, Gautham R . Shenoy, Joel Fernandes,
	K Prateek Nayak, Mike Galbraith, Qais Yousef, Tim Chen,
	Yicong Yang, Youssef Esmat, linux-kernel

On Tue, Nov 07, 2023 at 05:05:08PM +0800, Abel Wu wrote:

> Sort the task timeline by virtual deadline and keep the min_vruntime
> in the augmented tree, so we can avoid doubling the worst case cost
> and make full use of the cached leftmost node to enable O(1) fastpath
> picking in next patch.

Another very good patch, just a note:

> This patch also cleans up the unused max_vruntime() and adjusts pos
> for some functions.

There's this thing about saying 'this patch' in Changelogs, per
definition the Changelog is about 'this patch' so saying it is a bit
redundant.

It even gets mentioned in the Documentation on submitting patches
somewhere.

But what I care about more is this patch doing extra unrelated things,
this should be split out.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Re: [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline
  2023-11-07 11:03   ` Peter Zijlstra
@ 2023-11-07 11:06     ` Abel Wu
  0 siblings, 0 replies; 25+ messages in thread
From: Abel Wu @ 2023-11-07 11:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Barry Song, Benjamin Segall, Chen Yu,
	Daniel Jordan, Gautham R . Shenoy, Joel Fernandes,
	K Prateek Nayak, Mike Galbraith, Qais Yousef, Tim Chen,
	Yicong Yang, Youssef Esmat, linux-kernel

On 11/7/23 7:03 PM, Peter Zijlstra Wrote:
> On Tue, Nov 07, 2023 at 05:05:08PM +0800, Abel Wu wrote:
> 
>> Sort the task timeline by virtual deadline and keep the min_vruntime
>> in the augmented tree, so we can avoid doubling the worst case cost
>> and make full use of the cached leftmost node to enable O(1) fastpath
>> picking in next patch.
> 
> Another very good patch, just a note:
> 
>> This patch also cleans up the unused max_vruntime() and adjusts pos
>> for some functions.
> 
> There's this thing about saying 'this patch' in Changelogs, per
> definition the Changelog is about 'this patch' so saying it is a bit
> redundant.

I see, thanks.

> 
> It even gets mentioned in the Documentation on submitting patches
> somewhere.
> 
> But what I care about more is this patch doing extra unrelated things,
> this should be split out.

I will fix this in next version.

Thanks,
	Abel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline
  2023-11-07  9:05 ` [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline Abel Wu
  2023-11-07 11:03   ` Peter Zijlstra
@ 2023-11-07 23:26   ` Benjamin Segall
  2023-11-08  8:51     ` Abel Wu
  1 sibling, 1 reply; 25+ messages in thread
From: Benjamin Segall @ 2023-11-07 23:26 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Barry Song, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel

Abel Wu <wuyun.abel@bytedance.com> writes:

> Sort the task timeline by virtual deadline and keep the min_vruntime
> in the augmented tree, so we can avoid doubling the worst case cost
> and make full use of the cached leftmost node to enable O(1) fastpath
> picking in next patch.
>
> This patch also cleans up the unused max_vruntime() and adjusts pos
> for some functions.
>
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---

I've run this through my tester and it agrees that it does fulfil the
EEVDF pick (though this implementation is trivial enough that that's
fairly obvious just by reading the code, which is a nice bonus upgrade).

And it makes sense that this would help for performance, and the
fastpath seems likely to trigger most of the time for even better
results.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Re: [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline
  2023-11-07 23:26   ` Benjamin Segall
@ 2023-11-08  8:51     ` Abel Wu
  0 siblings, 0 replies; 25+ messages in thread
From: Abel Wu @ 2023-11-08  8:51 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Barry Song, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel

On 11/8/23 7:26 AM, Benjamin Segall Wrote:
> Abel Wu <wuyun.abel@bytedance.com> writes:
> 
>> Sort the task timeline by virtual deadline and keep the min_vruntime
>> in the augmented tree, so we can avoid doubling the worst case cost
>> and make full use of the cached leftmost node to enable O(1) fastpath
>> picking in next patch.
>>
>> This patch also cleans up the unused max_vruntime() and adjusts pos
>> for some functions.
>>
>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>> ---
> 
> I've run this through my tester and it agrees that it does fulfil the
> EEVDF pick (though this implementation is trivial enough that that's
> fairly obvious just by reading the code, which is a nice bonus upgrade).
> 
> And it makes sense that this would help for performance, and the
> fastpath seems likely to trigger most of the time for even better
> results.

Thanks, Benjamin!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [tip: sched/urgent] sched/eevdf: Fix vruntime adjustment on reweight
  2023-11-07  9:05 ` [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight Abel Wu
  2023-11-07  9:52   ` Peter Zijlstra
@ 2023-11-14 21:57   ` tip-bot2 for Abel Wu
  2023-11-15 15:36   ` [PATCH 1/4] " Yiwei Lin
  2024-02-29  9:24   ` Tianchen Ding
  3 siblings, 0 replies; 25+ messages in thread
From: tip-bot2 for Abel Wu @ 2023-11-14 21:57 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Abel Wu, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     eab03c23c2a162085b13200d7942fc5a00b5ccc8
Gitweb:        https://git.kernel.org/tip/eab03c23c2a162085b13200d7942fc5a00b5ccc8
Author:        Abel Wu <wuyun.abel@bytedance.com>
AuthorDate:    Tue, 07 Nov 2023 17:05:07 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 14 Nov 2023 22:27:00 +01:00

sched/eevdf: Fix vruntime adjustment on reweight

vruntime of the (on_rq && !0-lag) entity needs to be adjusted when
it gets re-weighted, and the calculations can be simplified based
on the fact that re-weight won't change the w-average of all the
entities. Please check the proofs in comments.

But adjusting vruntime can also cause position change in RB-tree
hence require re-queue to fix up which might be costly. This might
be avoided by deferring adjustment to the time the entity actually
leaves tree (dequeue/pick), but that will negatively affect task
selection and probably not good enough either.

Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231107090510.71322-2-wuyun.abel@bytedance.com
---
 kernel/sched/fair.c | 151 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 128 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2048138..025d909 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3666,41 +3666,140 @@ static inline void
 dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 #endif
 
+static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
+			   unsigned long weight)
+{
+	unsigned long old_weight = se->load.weight;
+	u64 avruntime = avg_vruntime(cfs_rq);
+	s64 vlag, vslice;
+
+	/*
+	 * VRUNTIME
+	 * ========
+	 *
+	 * COROLLARY #1: The virtual runtime of the entity needs to be
+	 * adjusted if re-weight at !0-lag point.
+	 *
+	 * Proof: For contradiction assume this is not true, so we can
+	 * re-weight without changing vruntime at !0-lag point.
+	 *
+	 *             Weight	VRuntime   Avg-VRuntime
+	 *     before    w          v            V
+	 *      after    w'         v'           V'
+	 *
+	 * Since lag needs to be preserved through re-weight:
+	 *
+	 *	lag = (V - v)*w = (V'- v')*w', where v = v'
+	 *	==>	V' = (V - v)*w/w' + v		(1)
+	 *
+	 * Let W be the total weight of the entities before reweight,
+	 * since V' is the new weighted average of entities:
+	 *
+	 *	V' = (WV + w'v - wv) / (W + w' - w)	(2)
+	 *
+	 * by using (1) & (2) we obtain:
+	 *
+	 *	(WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v
+	 *	==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v
+	 *	==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v
+	 *	==>	(V - v)*W/(W + w' - w) = (V - v)*w/w' (3)
+	 *
+	 * Since we are doing at !0-lag point which means V != v, we
+	 * can simplify (3):
+	 *
+	 *	==>	W / (W + w' - w) = w / w'
+	 *	==>	Ww' = Ww + ww' - ww
+	 *	==>	W * (w' - w) = w * (w' - w)
+	 *	==>	W = w	(re-weight indicates w' != w)
+	 *
+	 * So the cfs_rq contains only one entity, hence vruntime of
+	 * the entity @v should always equal to the cfs_rq's weighted
+	 * average vruntime @V, which means we will always re-weight
+	 * at 0-lag point, thus breach assumption. Proof completed.
+	 *
+	 *
+	 * COROLLARY #2: Re-weight does NOT affect weighted average
+	 * vruntime of all the entities.
+	 *
+	 * Proof: According to corollary #1, Eq. (1) should be:
+	 *
+	 *	(V - v)*w = (V' - v')*w'
+	 *	==>    v' = V' - (V - v)*w/w'		(4)
+	 *
+	 * According to the weighted average formula, we have:
+	 *
+	 *	V' = (WV - wv + w'v') / (W - w + w')
+	 *	   = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w')
+	 *	   = (WV - wv + w'V' - Vw + wv) / (W - w + w')
+	 *	   = (WV + w'V' - Vw) / (W - w + w')
+	 *
+	 *	==>  V'*(W - w + w') = WV + w'V' - Vw
+	 *	==>	V' * (W - w) = (W - w) * V	(5)
+	 *
+	 * If the entity is the only one in the cfs_rq, then reweight
+	 * always occurs at 0-lag point, so V won't change. Or else
+	 * there are other entities, hence W != w, then Eq. (5) turns
+	 * into V' = V. So V won't change in either case, proof done.
+	 *
+	 *
+	 * So according to corollary #1 & #2, the effect of re-weight
+	 * on vruntime should be:
+	 *
+	 *	v' = V' - (V - v) * w / w'		(4)
+	 *	   = V  - (V - v) * w / w'
+	 *	   = V  - vl * w / w'
+	 *	   = V  - vl'
+	 */
+	if (avruntime != se->vruntime) {
+		vlag = (s64)(avruntime - se->vruntime);
+		vlag = div_s64(vlag * old_weight, weight);
+		se->vruntime = avruntime - vlag;
+	}
+
+	/*
+	 * DEADLINE
+	 * ========
+	 *
+	 * When the weight changes, the virtual time slope changes and
+	 * we should adjust the relative virtual deadline accordingly.
+	 *
+	 *	d' = v' + (d - v)*w/w'
+	 *	   = V' - (V - v)*w/w' + (d - v)*w/w'
+	 *	   = V  - (V - v)*w/w' + (d - v)*w/w'
+	 *	   = V  + (d - V)*w/w'
+	 */
+	vslice = (s64)(se->deadline - avruntime);
+	vslice = div_s64(vslice * old_weight, weight);
+	se->deadline = avruntime + vslice;
+}
+
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight)
 {
-	unsigned long old_weight = se->load.weight;
+	bool curr = cfs_rq->curr == se;
 
 	if (se->on_rq) {
 		/* commit outstanding execution time */
-		if (cfs_rq->curr == se)
+		if (curr)
 			update_curr(cfs_rq);
 		else
-			avg_vruntime_sub(cfs_rq, se);
+			__dequeue_entity(cfs_rq, se);
 		update_load_sub(&cfs_rq->load, se->load.weight);
 	}
 	dequeue_load_avg(cfs_rq, se);
 
-	update_load_set(&se->load, weight);
-
 	if (!se->on_rq) {
 		/*
 		 * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
 		 * we need to scale se->vlag when w_i changes.
 		 */
-		se->vlag = div_s64(se->vlag * old_weight, weight);
+		se->vlag = div_s64(se->vlag * se->load.weight, weight);
 	} else {
-		s64 deadline = se->deadline - se->vruntime;
-		/*
-		 * When the weight changes, the virtual time slope changes and
-		 * we should adjust the relative virtual deadline accordingly.
-		 */
-		deadline = div_s64(deadline * old_weight, weight);
-		se->deadline = se->vruntime + deadline;
-		if (se != cfs_rq->curr)
-			min_deadline_cb_propagate(&se->run_node, NULL);
+		reweight_eevdf(cfs_rq, se, weight);
 	}
 
+	update_load_set(&se->load, weight);
+
 #ifdef CONFIG_SMP
 	do {
 		u32 divider = get_pelt_divider(&se->avg);
@@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 	enqueue_load_avg(cfs_rq, se);
 	if (se->on_rq) {
 		update_load_add(&cfs_rq->load, se->load.weight);
-		if (cfs_rq->curr != se)
-			avg_vruntime_add(cfs_rq, se);
+		if (!curr) {
+			/*
+			 * The entity's vruntime has been adjusted, so let's check
+			 * whether the rq-wide min_vruntime needs updated too. Since
+			 * the calculations above require stable min_vruntime rather
+			 * than up-to-date one, we do the update at the end of the
+			 * reweight process.
+			 */
+			__enqueue_entity(cfs_rq, se);
+			update_min_vruntime(cfs_rq);
+		}
 	}
 }
 
@@ -3857,14 +3965,11 @@ static void update_cfs_group(struct sched_entity *se)
 
 #ifndef CONFIG_SMP
 	shares = READ_ONCE(gcfs_rq->tg->shares);
-
-	if (likely(se->load.weight == shares))
-		return;
 #else
-	shares   = calc_group_shares(gcfs_rq);
+	shares = calc_group_shares(gcfs_rq);
 #endif
-
-	reweight_entity(cfs_rq_of(se), se, shares);
+	if (unlikely(se->load.weight != shares))
+		reweight_entity(cfs_rq_of(se), se, shares);
 }
 
 #else /* CONFIG_FAIR_GROUP_SCHED */

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
  2023-11-07  9:05 ` [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight Abel Wu
  2023-11-07  9:52   ` Peter Zijlstra
  2023-11-14 21:57   ` [tip: sched/urgent] " tip-bot2 for Abel Wu
@ 2023-11-15 15:36   ` Yiwei Lin
  2023-11-16  4:48     ` Abel Wu
  2024-02-29  9:24   ` Tianchen Ding
  3 siblings, 1 reply; 25+ messages in thread
From: Yiwei Lin @ 2023-11-15 15:36 UTC (permalink / raw)
  To: Abel Wu
  Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider


On 11/7/23 17:05, Abel Wu wrote:
> vruntime of the (on_rq && !0-lag) entity needs to be adjusted when
> it gets re-weighted, and the calculations can be simplified based
> on the fact that re-weight won't change the w-average of all the
> entities. Please check the proofs in comments.
>
> But adjusting vruntime can also cause position change in RB-tree
> hence require re-queue to fix up which might be costly. This might
> be avoided by deferring adjustment to the time the entity actually
> leaves tree (dequeue/pick), but that will negatively affect task
> selection and probably not good enough either.
>
> Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
>   kernel/sched/fair.c | 151 +++++++++++++++++++++++++++++++++++++-------
>   1 file changed, 128 insertions(+), 23 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8767988242ee..b00d09a9b601 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3666,41 +3666,140 @@ static inline void
>   dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
>   #endif
>   
> +static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
> +			   unsigned long weight)
> +{
> +	unsigned long old_weight = se->load.weight;
> +	u64 avruntime = avg_vruntime(cfs_rq);
> +	s64 vlag, vslice;
> +
> +	/*
> +	 * VRUNTIME
> +	 * ========
> +	 *
> +	 * COROLLARY #1: The virtual runtime of the entity needs to be
> +	 * adjusted if re-weight at !0-lag point.
> +	 *
> +	 * Proof: For contradiction assume this is not true, so we can
> +	 * re-weight without changing vruntime at !0-lag point.
> +	 *
> +	 *             Weight	VRuntime   Avg-VRuntime
> +	 *     before    w          v            V
> +	 *      after    w'         v'           V'
> +	 *
> +	 * Since lag needs to be preserved through re-weight:
> +	 *
> +	 *	lag = (V - v)*w = (V'- v')*w', where v = v'
> +	 *	==>	V' = (V - v)*w/w' + v		(1)
> +	 *
> +	 * Let W be the total weight of the entities before reweight,
> +	 * since V' is the new weighted average of entities:
> +	 *
> +	 *	V' = (WV + w'v - wv) / (W + w' - w)	(2)
> +	 *
> +	 * by using (1) & (2) we obtain:
> +	 *
> +	 *	(WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v
> +	 *	==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v
> +	 *	==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v
> +	 *	==>	(V - v)*W/(W + w' - w) = (V - v)*w/w' (3)
> +	 *
> +	 * Since we are doing at !0-lag point which means V != v, we
> +	 * can simplify (3):
> +	 *
> +	 *	==>	W / (W + w' - w) = w / w'
> +	 *	==>	Ww' = Ww + ww' - ww
> +	 *	==>	W * (w' - w) = w * (w' - w)
> +	 *	==>	W = w	(re-weight indicates w' != w)
> +	 *
> +	 * So the cfs_rq contains only one entity, hence vruntime of
> +	 * the entity @v should always equal to the cfs_rq's weighted
> +	 * average vruntime @V, which means we will always re-weight
> +	 * at 0-lag point, thus breach assumption. Proof completed.
> +	 *
> +	 *
> +	 * COROLLARY #2: Re-weight does NOT affect weighted average
> +	 * vruntime of all the entities.
> +	 *
> +	 * Proof: According to corollary #1, Eq. (1) should be:
> +	 *
> +	 *	(V - v)*w = (V' - v')*w'
> +	 *	==>    v' = V' - (V - v)*w/w'		(4)
> +	 *
> +	 * According to the weighted average formula, we have:
> +	 *
> +	 *	V' = (WV - wv + w'v') / (W - w + w')
> +	 *	   = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w')
> +	 *	   = (WV - wv + w'V' - Vw + wv) / (W - w + w')
> +	 *	   = (WV + w'V' - Vw) / (W - w + w')
> +	 *
> +	 *	==>  V'*(W - w + w') = WV + w'V' - Vw
> +	 *	==>	V' * (W - w) = (W - w) * V	(5)
> +	 *
> +	 * If the entity is the only one in the cfs_rq, then reweight
> +	 * always occurs at 0-lag point, so V won't change. Or else
> +	 * there are other entities, hence W != w, then Eq. (5) turns
> +	 * into V' = V. So V won't change in either case, proof done.
> +	 *
> +	 *
> +	 * So according to corollary #1 & #2, the effect of re-weight
> +	 * on vruntime should be:
> +	 *
> +	 *	v' = V' - (V - v) * w / w'		(4)
> +	 *	   = V  - (V - v) * w / w'
> +	 *	   = V  - vl * w / w'
> +	 *	   = V  - vl'
> +	 */
> +	if (avruntime != se->vruntime) {
> +		vlag = (s64)(avruntime - se->vruntime);
> +		vlag = div_s64(vlag * old_weight, weight);
> +		se->vruntime = avruntime - vlag;
> +	}
> +
> +	/*
> +	 * DEADLINE
> +	 * ========
> +	 *
> +	 * When the weight changes, the virtual time slope changes and
> +	 * we should adjust the relative virtual deadline accordingly.
> +	 *
> +	 *	d' = v' + (d - v)*w/w'
> +	 *	   = V' - (V - v)*w/w' + (d - v)*w/w'
> +	 *	   = V  - (V - v)*w/w' + (d - v)*w/w'
> +	 *	   = V  + (d - V)*w/w'
> +	 */
> +	vslice = (s64)(se->deadline - avruntime);
> +	vslice = div_s64(vslice * old_weight, weight);
> +	se->deadline = avruntime + vslice;
> +}
> +
>   static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>   			    unsigned long weight)
>   {
> -	unsigned long old_weight = se->load.weight;
> +	bool curr = cfs_rq->curr == se;
>   
>   	if (se->on_rq) {
>   		/* commit outstanding execution time */
> -		if (cfs_rq->curr == se)
> +		if (curr)
>   			update_curr(cfs_rq);
>   		else
> -			avg_vruntime_sub(cfs_rq, se);
> +			__dequeue_entity(cfs_rq, se);
>   		update_load_sub(&cfs_rq->load, se->load.weight);
>   	}
>   	dequeue_load_avg(cfs_rq, se);
>   
> -	update_load_set(&se->load, weight);
> -
>   	if (!se->on_rq) {
>   		/*
>   		 * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
>   		 * we need to scale se->vlag when w_i changes.
>   		 */
> -		se->vlag = div_s64(se->vlag * old_weight, weight);
> +		se->vlag = div_s64(se->vlag * se->load.weight, weight);
>   	} else {
> -		s64 deadline = se->deadline - se->vruntime;
> -		/*
> -		 * When the weight changes, the virtual time slope changes and
> -		 * we should adjust the relative virtual deadline accordingly.
> -		 */
> -		deadline = div_s64(deadline * old_weight, weight);
> -		se->deadline = se->vruntime + deadline;
> -		if (se != cfs_rq->curr)
> -			min_deadline_cb_propagate(&se->run_node, NULL);
> +		reweight_eevdf(cfs_rq, se, weight);
>   	}
>   
> +	update_load_set(&se->load, weight);
> +
>   #ifdef CONFIG_SMP
>   	do {
>   		u32 divider = get_pelt_divider(&se->avg);
> @@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>   	enqueue_load_avg(cfs_rq, se);
>   	if (se->on_rq) {
>   		update_load_add(&cfs_rq->load, se->load.weight);
> -		if (cfs_rq->curr != se)
> -			avg_vruntime_add(cfs_rq, se);
> +		if (!curr) {
> +			/*
> +			 * The entity's vruntime has been adjusted, so let's check
> +			 * whether the rq-wide min_vruntime needs updated too. Since
> +			 * the calculations above require stable min_vruntime rather
> +			 * than up-to-date one, we do the update at the end of the
> +			 * reweight process.
> +			 */
> +			__enqueue_entity(cfs_rq, se);
> +			update_min_vruntime(cfs_rq);
> +		}
>   	}
>   }
Sorry if I am asking stupid question...... It looks like 
reweight_entity() may have chance to change the weight of cfs_rq->curr 
entity, but we'll never update_min_vruntime() when reweighting it. Is 
there any reason that we can skip the update_min_vruntime() for this case?
>   
> @@ -3857,14 +3965,11 @@ static void update_cfs_group(struct sched_entity *se)
>   
>   #ifndef CONFIG_SMP
>   	shares = READ_ONCE(gcfs_rq->tg->shares);
> -
> -	if (likely(se->load.weight == shares))
> -		return;
>   #else
> -	shares   = calc_group_shares(gcfs_rq);
> +	shares = calc_group_shares(gcfs_rq);
>   #endif
> -
> -	reweight_entity(cfs_rq_of(se), se, shares);
> +	if (unlikely(se->load.weight != shares))
> +		reweight_entity(cfs_rq_of(se), se, shares);
>   }
>   
>   #else /* CONFIG_FAIR_GROUP_SCHED */

Thanks,
Yiwei Lin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
  2023-11-15 15:36   ` [PATCH 1/4] " Yiwei Lin
@ 2023-11-16  4:48     ` Abel Wu
  2023-11-16  4:59       ` Yiwei Lin
  2023-11-16  5:07       ` Abel Wu
  0 siblings, 2 replies; 25+ messages in thread
From: Abel Wu @ 2023-11-16  4:48 UTC (permalink / raw)
  To: Yiwei Lin
  Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider

On 11/15/23 11:36 PM, Yiwei Lin Wrote:
> 
>> @@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>>       enqueue_load_avg(cfs_rq, se);
>>       if (se->on_rq) {
>>           update_load_add(&cfs_rq->load, se->load.weight);
>> -        if (cfs_rq->curr != se)
>> -            avg_vruntime_add(cfs_rq, se);
>> +        if (!curr) {
>> +            /*
>> +             * The entity's vruntime has been adjusted, so let's check
>> +             * whether the rq-wide min_vruntime needs updated too. Since
>> +             * the calculations above require stable min_vruntime rather
>> +             * than up-to-date one, we do the update at the end of the
>> +             * reweight process.
>> +             */
>> +            __enqueue_entity(cfs_rq, se);
>> +            update_min_vruntime(cfs_rq);
>> +        }
>>       }
>>   }
> Sorry if I am asking stupid question...... It looks like reweight_entity() may have chance to change the weight of cfs_rq->curr entity, but we'll never update_min_vruntime() when reweighting it. Is there any reason that we can skip the update_min_vruntime() for this case?

No, you are right!

Thanks!
	Abel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
  2023-11-16  4:48     ` Abel Wu
@ 2023-11-16  4:59       ` Yiwei Lin
  2023-11-16  9:01         ` Abel Wu
  2023-11-16  5:07       ` Abel Wu
  1 sibling, 1 reply; 25+ messages in thread
From: Yiwei Lin @ 2023-11-16  4:59 UTC (permalink / raw)
  To: Abel Wu
  Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider


On 11/16/23 12:48, Abel Wu wrote:
> On 11/15/23 11:36 PM, Yiwei Lin Wrote:
>>
>>> @@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq 
>>> *cfs_rq, struct sched_entity *se,
>>>       enqueue_load_avg(cfs_rq, se);
>>>       if (se->on_rq) {
>>>           update_load_add(&cfs_rq->load, se->load.weight);
>>> -        if (cfs_rq->curr != se)
>>> -            avg_vruntime_add(cfs_rq, se);
>>> +        if (!curr) {
>>> +            /*
>>> +             * The entity's vruntime has been adjusted, so let's check
>>> +             * whether the rq-wide min_vruntime needs updated too. 
>>> Since
>>> +             * the calculations above require stable min_vruntime 
>>> rather
>>> +             * than up-to-date one, we do the update at the end of the
>>> +             * reweight process.
>>> +             */
>>> +            __enqueue_entity(cfs_rq, se);
>>> +            update_min_vruntime(cfs_rq);
>>> +        }
>>>       }
>>>   }
>> Sorry if I am asking stupid question...... It looks like 
>> reweight_entity() may have chance to change the weight of 
>> cfs_rq->curr entity, but we'll never update_min_vruntime() when 
>> reweighting it. Is there any reason that we can skip the 
>> update_min_vruntime() for this case?
>
> No, you are right!
>
> Thanks!
>     Abel
Thank you! I'll take the responsibility to fix this.

Yiwei Lin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
  2023-11-16  4:48     ` Abel Wu
  2023-11-16  4:59       ` Yiwei Lin
@ 2023-11-16  5:07       ` Abel Wu
  2023-11-16  6:51         ` Yiwei Lin
  1 sibling, 1 reply; 25+ messages in thread
From: Abel Wu @ 2023-11-16  5:07 UTC (permalink / raw)
  To: Yiwei Lin
  Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider

On 11/16/23 12:48 PM, Abel Wu Wrote:
> On 11/15/23 11:36 PM, Yiwei Lin Wrote:
>>
>>> @@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>>>       enqueue_load_avg(cfs_rq, se);
>>>       if (se->on_rq) {
>>>           update_load_add(&cfs_rq->load, se->load.weight);
>>> -        if (cfs_rq->curr != se)
>>> -            avg_vruntime_add(cfs_rq, se);
>>> +        if (!curr) {
>>> +            /*
>>> +             * The entity's vruntime has been adjusted, so let's check
>>> +             * whether the rq-wide min_vruntime needs updated too. Since
>>> +             * the calculations above require stable min_vruntime rather
>>> +             * than up-to-date one, we do the update at the end of the
>>> +             * reweight process.
>>> +             */
>>> +            __enqueue_entity(cfs_rq, se);
>>> +            update_min_vruntime(cfs_rq);
>>> +        }
>>>       }
>>>   }
>> Sorry if I am asking stupid question...... It looks like reweight_entity() may have chance to change the weight of cfs_rq->curr entity, but we'll never update_min_vruntime() when reweighting it. Is there any reason that we can skip the update_min_vruntime() for this case?
> 
> No, you are right!

I was intended to update_min_vruntime() if se->on_rq and no matter
it is curr or not, just as you suggested. But after a second thought
I wonder if it is necessary to update *NOW*, since we will always
update_curr() before making any change to cfs_rq. Thoughts?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
  2023-11-16  5:07       ` Abel Wu
@ 2023-11-16  6:51         ` Yiwei Lin
  2023-11-16  7:11           ` Abel Wu
  0 siblings, 1 reply; 25+ messages in thread
From: Yiwei Lin @ 2023-11-16  6:51 UTC (permalink / raw)
  To: Abel Wu
  Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider


On 11/16/23 13:07, Abel Wu wrote:
> On 11/16/23 12:48 PM, Abel Wu Wrote:
>> On 11/15/23 11:36 PM, Yiwei Lin Wrote:
>>>
>>>> @@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq 
>>>> *cfs_rq, struct sched_entity *se,
>>>>       enqueue_load_avg(cfs_rq, se);
>>>>       if (se->on_rq) {
>>>>           update_load_add(&cfs_rq->load, se->load.weight);
>>>> -        if (cfs_rq->curr != se)
>>>> -            avg_vruntime_add(cfs_rq, se);
>>>> +        if (!curr) {
>>>> +            /*
>>>> +             * The entity's vruntime has been adjusted, so let's 
>>>> check
>>>> +             * whether the rq-wide min_vruntime needs updated too. 
>>>> Since
>>>> +             * the calculations above require stable min_vruntime 
>>>> rather
>>>> +             * than up-to-date one, we do the update at the end of 
>>>> the
>>>> +             * reweight process.
>>>> +             */
>>>> +            __enqueue_entity(cfs_rq, se);
>>>> +            update_min_vruntime(cfs_rq);
>>>> +        }
>>>>       }
>>>>   }
>>> Sorry if I am asking stupid question...... It looks like 
>>> reweight_entity() may have chance to change the weight of 
>>> cfs_rq->curr entity, but we'll never update_min_vruntime() when 
>>> reweighting it. Is there any reason that we can skip the 
>>> update_min_vruntime() for this case?
>>
>> No, you are right!
>
> I was intended to update_min_vruntime() if se->on_rq and no matter
> it is curr or not, just as you suggested. But after a second thought
> I wonder if it is necessary to update *NOW*, since we will always
> update_curr() before making any change to cfs_rq. Thoughts?
I lost the fact that we'll update_min_vruntime() every time we 
update_curr(). Because of this fact, we can indeed wait until we need 
the correct min_vruntime and update_min_vruntime() then. The only 
consideration that I came up with is that the sched_debug may not be 
able to reflect the accurate min_vruntime in time. But this may not be a 
big problem.

Further, I have another advanced thought we can remove the 
update_min_vruntime() here in the reweight_entity() directly to save 
more time. The reason that I think this is because min_vruntime is not 
for normalization of vruntime as before which is required on CFS, so we 
will always update_curr() for the latest min_vruntime before using it. 
Also, the update_min_vruntime() in dequeue_entity() may also be removed 
as the reason, i.e. just do update_min_vruntime() in update_curr() to 
simplify. What do you think?

Thanks,
Yiwei Lin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
  2023-11-16  6:51         ` Yiwei Lin
@ 2023-11-16  7:11           ` Abel Wu
  0 siblings, 0 replies; 25+ messages in thread
From: Abel Wu @ 2023-11-16  7:11 UTC (permalink / raw)
  To: Yiwei Lin
  Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider

On 11/16/23 2:51 PM, Yiwei Lin Wrote:
> 
> On 11/16/23 13:07, Abel Wu wrote:
>> On 11/16/23 12:48 PM, Abel Wu Wrote:
>>> On 11/15/23 11:36 PM, Yiwei Lin Wrote:
>>>>
>>>>> @@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>>>>>       enqueue_load_avg(cfs_rq, se);
>>>>>       if (se->on_rq) {
>>>>>           update_load_add(&cfs_rq->load, se->load.weight);
>>>>> -        if (cfs_rq->curr != se)
>>>>> -            avg_vruntime_add(cfs_rq, se);
>>>>> +        if (!curr) {
>>>>> +            /*
>>>>> +             * The entity's vruntime has been adjusted, so let's check
>>>>> +             * whether the rq-wide min_vruntime needs updated too. Since
>>>>> +             * the calculations above require stable min_vruntime rather
>>>>> +             * than up-to-date one, we do the update at the end of the
>>>>> +             * reweight process.
>>>>> +             */
>>>>> +            __enqueue_entity(cfs_rq, se);
>>>>> +            update_min_vruntime(cfs_rq);
>>>>> +        }
>>>>>       }
>>>>>   }
>>>> Sorry if I am asking stupid question...... It looks like reweight_entity() may have chance to change the weight of cfs_rq->curr entity, but we'll never update_min_vruntime() when reweighting it. Is there any reason that we can skip the update_min_vruntime() for this case?
>>>
>>> No, you are right!
>>
>> I was intended to update_min_vruntime() if se->on_rq and no matter
>> it is curr or not, just as you suggested. But after a second thought
>> I wonder if it is necessary to update *NOW*, since we will always
>> update_curr() before making any change to cfs_rq. Thoughts?
> I lost the fact that we'll update_min_vruntime() every time we update_curr(). Because of this fact, we can indeed wait until we need the correct min_vruntime and update_min_vruntime() then. The only consideration that I came up with is that the sched_debug may not be able to reflect the accurate min_vruntime in time. But this may not be a big problem.
> 
> Further, I have another advanced thought we can remove the update_min_vruntime() here in the reweight_entity() directly to save more time. The reason that I think this is because min_vruntime is not for normalization of vruntime as before which is required on CFS, so we will always update_curr() for the latest min_vruntime before using it. Also, the update_min_vruntime() in dequeue_entity() may also be removed as the reason, i.e. just do update_min_vruntime() in update_curr() to simplify. What do you think?

Yes, this is also exactly what I am thinking about. As task placement
now adopts lag-based solution which is irrespective of min_vruntime,
and also based on the fact that it is only used as a base offset for
calculating avg_vruntime (in order to avoid overflow), we probably
can update it in a more relaxed way e.g. in ticks. If relaxed update
works, there seems still work to be done first:

   1) the priority of core pick when core scheduling needs to change
      to deadline-based solution;
   2) need to make sure not overflow in NOHZ_FULL mode

Just some first thoughts come into my mind :)

Thanks,
	Abel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
  2023-11-16  4:59       ` Yiwei Lin
@ 2023-11-16  9:01         ` Abel Wu
  0 siblings, 0 replies; 25+ messages in thread
From: Abel Wu @ 2023-11-16  9:01 UTC (permalink / raw)
  To: Yiwei Lin
  Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider

On 11/16/23 12:59 PM, Yiwei Lin Wrote:
> 
> On 11/16/23 12:48, Abel Wu wrote:
>> On 11/15/23 11:36 PM, Yiwei Lin Wrote:
>>>
>>>> @@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>>>>       enqueue_load_avg(cfs_rq, se);
>>>>       if (se->on_rq) {
>>>>           update_load_add(&cfs_rq->load, se->load.weight);
>>>> -        if (cfs_rq->curr != se)
>>>> -            avg_vruntime_add(cfs_rq, se);
>>>> +        if (!curr) {
>>>> +            /*
>>>> +             * The entity's vruntime has been adjusted, so let's check
>>>> +             * whether the rq-wide min_vruntime needs updated too. Since
>>>> +             * the calculations above require stable min_vruntime rather
>>>> +             * than up-to-date one, we do the update at the end of the
>>>> +             * reweight process.
>>>> +             */
>>>> +            __enqueue_entity(cfs_rq, se);
>>>> +            update_min_vruntime(cfs_rq);
>>>> +        }
>>>>       }
>>>>   }
>>> Sorry if I am asking stupid question...... It looks like reweight_entity() may have chance to change the weight of cfs_rq->curr entity, but we'll never update_min_vruntime() when reweighting it. Is there any reason that we can skip the update_min_vruntime() for this case?
>>
>> No, you are right!
>>
>> Thanks!
>>     Abel
> Thank you! I'll take the responsibility to fix this.

That would be appreciated. I suggest fixing this before we
do further optimizations.

Thanks,
	Abel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
  2023-11-07  9:05 ` [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight Abel Wu
                     ` (2 preceding siblings ...)
  2023-11-15 15:36   ` [PATCH 1/4] " Yiwei Lin
@ 2024-02-29  9:24   ` Tianchen Ding
  2024-02-29 14:25     ` Abel Wu
  3 siblings, 1 reply; 25+ messages in thread
From: Tianchen Ding @ 2024-02-29  9:24 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Valentin Schneider,
	Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel

Hi Abel:

I'm not so familiar with eevdf and still learning. Here I've some questions 
about this patch.

1. You did proof that V will not change during reweight (COROLLARY #2). However, 
according to the original paper, the new V will be:
V' = V + lag(j)/(W - w_j) - lag(j)/(W - w_j + w'_j)
So the V' in paper will change (when lag is not zero).
Is the V in Linux code slightly different from the paper?

2. I print some log around reweight_entity(), mainly want to print V by calling 
avg_vruntime(cfs_rq). I found your algorithm only keeps the V unchanged during 
reweight_eevdf(), but not reweight_entity().

In detail:
If curr is true (i.e., cfs_rq->curr == se), we will directly run 
reweight_eevdf(), and the V is not changed.
If curr is false, we will have __dequeue_entity() -> reweight_eevdf() -> 
__enqueue_entity(). The V is finally changed due to dequeue and enqueue. So the 
result of reweight_entity() will be impacted by "cfs_rq->curr == se", is this 
expected?

Thanks!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
  2024-02-29  9:24   ` Tianchen Ding
@ 2024-02-29 14:25     ` Abel Wu
  2024-03-01  6:41       ` Tianchen Ding
  0 siblings, 1 reply; 25+ messages in thread
From: Abel Wu @ 2024-02-29 14:25 UTC (permalink / raw)
  To: Tianchen Ding
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Valentin Schneider,
	Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel

Hi Tianchen,

On 2/29/24 5:24 PM, Tianchen Ding Wrote:
> Hi Abel:
> 
> I'm not so familiar with eevdf and still learning. Here I've some questions about this patch.
> 
> 1. You did proof that V will not change during reweight (COROLLARY #2). However, according to the original paper, the new V will be:
> V' = V + lag(j)/(W - w_j) - lag(j)/(W - w_j + w'_j)
> So the V' in paper will change (when lag is not zero).
> Is the V in Linux code slightly different from the paper?

Good catch. And to the best of my knowledge, the answer is YES. The
above Equation in the paper, which is Eq. (20), is based on the
assumption that:

	"once client 3 leaves, the remaining two clients will
	 proportionally support the eventual loss or gain in the
	 service time"  -- Page 10

	"by updating the virtual time according to Eq. (18,19) we
	 ensure that the sum over the lags of all active clients
	 is always zero"  -- Page 11

But in Peter's implementation, it is the competitors in the new group
that client 3 later joins in who actually support the effect. So when
client 3 leaves competition with !0-lag in Linux, the rq's sum(lag_i)
is no longer zero.

> 
> 2. I print some log around reweight_entity(), mainly want to print V by calling avg_vruntime(cfs_rq). I found your algorithm only keeps the V unchanged during reweight_eevdf(), but not reweight_entity().
> 
> In detail:
> If curr is true (i.e., cfs_rq->curr == se), we will directly run reweight_eevdf(), and the V is not changed.
> If curr is false, we will have __dequeue_entity() -> reweight_eevdf() -> __enqueue_entity(). The V is finally changed due to dequeue and enqueue. So the result of reweight_entity() will be impacted by "cfs_rq->curr == se", is this expected?

Good catch again! It smells like a bug. Since this @se is still on_rq,
it should be taken into consideration when calculating avg_runtime(),
but in fact it isn't because __dequeue_entity() will remove its share.

And I seem to spot another bug, although not relate to this problem,
that we actually need to call update_curr() unconditionally if curr is
available, because we need to commit curr's outstanding runtime to
ensure the result of avg_runtime() is up to date.

Thanks & BR,
	Abel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
  2024-02-29 14:25     ` Abel Wu
@ 2024-03-01  6:41       ` Tianchen Ding
  2024-03-01  8:30         ` Abel Wu
  0 siblings, 1 reply; 25+ messages in thread
From: Tianchen Ding @ 2024-03-01  6:41 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Valentin Schneider,
	Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel

On 2024/2/29 22:25, Abel Wu wrote:
> Good catch. And to the best of my knowledge, the answer is YES. The
> above Equation in the paper, which is Eq. (20), is based on the
> assumption that:
> 
>      "once client 3 leaves, the remaining two clients will
>       proportionally support the eventual loss or gain in the
>       service time"  -- Page 10
> 
>      "by updating the virtual time according to Eq. (18,19) we
>       ensure that the sum over the lags of all active clients
>       is always zero"  -- Page 11
> 
> But in Peter's implementation, it is the competitors in the new group
> that client 3 later joins in who actually support the effect. So when
> client 3 leaves competition with !0-lag in Linux, the rq's sum(lag_i)
> is no longer zero.
> 

I've different opinions. According to the comments above avg_vruntime_add(), V
is calculated exactly to satisfy sum(lag_i)=0. This is guaranteed by math.

Actually I print some logs in enqueue_entity() and dequeue_entity() to verify this:

[  293.261236] before dequeue: V=2525278131 W=3072 v=2526243139 w=1024 lag_sum=0
[  293.261237] after dequeue: V=2524795627 W=2048 v=2526243139 w=1024 lag_sum=0
[  293.262286] before enqueue: V=2525319064 W=2048 v=2526766576 w=1024 lag_sum=0
[  293.262287] after enqueue: V=2525801568 W=3072 v=2526766576 w=1024 lag_sum=0

For the first 2 lines, we have 2524795627 = 2525278131 + (2525278131 - 2526243139) * 1024 / 2048.
Which is Eq. (18)

For the last 2 lines, we have 2525801568 = 2525319064 - (2525319064 - 2526766576) * 1024 / 3072.
Which is Eq. (19)

So whatever client 3 leave or join competition with !0-lag in Linux, V is handled properly.

> Good catch again! It smells like a bug. Since this @se is still on_rq,
> it should be taken into consideration when calculating avg_runtime(),
> but in fact it isn't because __dequeue_entity() will remove its share.
> 
> And I seem to spot another bug, although not relate to this problem,
> that we actually need to call update_curr() unconditionally if curr is
> available, because we need to commit curr's outstanding runtime to
> ensure the result of avg_runtime() is up to date.
> 

I've tried to record avg_vruntime before __dequeue_entity() and pass it to
reweight_eevdf(). Then the issue is fixed. The V keeps the same during the whole
reweight_entity().

I could send these two bugfix patches (one for this bug and one you sugguested
about update_curr). But before doing so, I still want to dig out the answer of
my first question.

Hi Peter, would you please provide any information?

Thanks.



My rough logging code:
(Note: lag_sum may output a minus value, with its absolute value less than W.
This is ok because my lag_sum calculate is not so accurate due to the sign flips
in avg_vruntime())

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 533547e3c90a..9306c1bbd472 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5260,6 +5260,62 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
  
  static inline bool cfs_bandwidth_used(void);
  
+static int rbtree_all(const void *key, const struct rb_node *node)
+{
+	return 0;
+}
+
+static s64 get_lag_sum(struct cfs_rq *cfs_rq)
+{
+	u64 avruntime = avg_vruntime(cfs_rq);
+	struct sched_entity *curr = cfs_rq->curr;
+	struct rb_node *node;
+	s64 lag_sum = 0;
+
+	rb_for_each(node, 0, &cfs_rq->tasks_timeline.rb_root, rbtree_all) {
+		struct sched_entity *se = __node_2_se(node);
+
+		if (se->on_rq)
+			lag_sum += (avruntime - se->vruntime) * scale_load_down(se->load.weight);
+	}
+
+	if (curr && curr->on_rq) {
+		lag_sum += (avruntime - curr->vruntime) * scale_load_down(curr->load.weight);
+	}
+
+	return lag_sum;
+}
+
+static void print_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se, bool before, bool enqueue)
+{
+	if (cpu_of(rq_of(cfs_rq)))
+		return; // avoid too many printings.
+
+	long load = cfs_rq->avg_load;
+	struct sched_entity *curr = cfs_rq->curr;
+	unsigned long weight = scale_load_down(se->load.weight);
+	s64 lag_sum = get_lag_sum(cfs_rq);
+	u64 avruntime = avg_vruntime(cfs_rq);
+
+	if (curr && curr->on_rq) {
+		unsigned long curr_weight = scale_load_down(curr->load.weight);
+		load += curr_weight;
+	}
+
+	if (before) {
+		if (enqueue)
+			printk("before enqueue: V=%llu W=%ld v=%llu w=%lu lag_sum=%lld\n", avruntime, load, se->vruntime, weight, lag_sum);
+		else
+			printk("before dequeue: V=%llu W=%ld v=%llu w=%lu lag_sum=%lld\n", avruntime, load, se->vruntime, weight, lag_sum);
+	}
+	else {
+		if (enqueue)
+			printk("after enqueue: V=%llu W=%ld v=%llu w=%lu lag_sum=%lld\n", avruntime, load, se->vruntime, weight, lag_sum);
+		else
+			printk("after dequeue: V=%llu W=%ld v=%llu w=%lu lag_sum=%lld\n", avruntime, load, se->vruntime, weight, lag_sum);
+	}
+}
+
  static void
  enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
  {
@@ -5307,9 +5363,11 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
  
  	check_schedstat_required();
  	update_stats_enqueue_fair(cfs_rq, se, flags);
+	print_eevdf(cfs_rq, se, true, true);
  	if (!curr)
  		__enqueue_entity(cfs_rq, se);
  	se->on_rq = 1;
+	print_eevdf(cfs_rq, se, false, true);
  
  	if (cfs_rq->nr_running == 1) {
  		check_enqueue_throttle(cfs_rq);
@@ -5347,6 +5405,7 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
  
  static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
  
+
  static void
  dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
  {
@@ -5377,9 +5436,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
  	clear_buddies(cfs_rq, se);
  
  	update_entity_lag(cfs_rq, se);
+	print_eevdf(cfs_rq, se, true, false);
  	if (se != cfs_rq->curr)
  		__dequeue_entity(cfs_rq, se);
  	se->on_rq = 0;
+	print_eevdf(cfs_rq, se, false, false);
  	account_entity_dequeue(cfs_rq, se);
  
  	/* return excess runtime on last dequeue */

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
  2024-03-01  6:41       ` Tianchen Ding
@ 2024-03-01  8:30         ` Abel Wu
  2024-03-01 10:04           ` Tianchen Ding
  0 siblings, 1 reply; 25+ messages in thread
From: Abel Wu @ 2024-03-01  8:30 UTC (permalink / raw)
  To: Tianchen Ding
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Valentin Schneider,
	Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel

On 3/1/24 2:41 PM, Tianchen Ding Wrote:
> On 2024/2/29 22:25, Abel Wu wrote:
>> Good catch. And to the best of my knowledge, the answer is YES. The
>> above Equation in the paper, which is Eq. (20), is based on the
>> assumption that:
>>
>>      "once client 3 leaves, the remaining two clients will
>>       proportionally support the eventual loss or gain in the
>>       service time"  -- Page 10
>>
>>      "by updating the virtual time according to Eq. (18,19) we
>>       ensure that the sum over the lags of all active clients
>>       is always zero"  -- Page 11
>>
>> But in Peter's implementation, it is the competitors in the new group
>> that client 3 later joins in who actually support the effect. So when
>> client 3 leaves competition with !0-lag in Linux, the rq's sum(lag_i)
>> is no longer zero.
>>
> 
> I've different opinions. According to the comments above avg_vruntime_add(), V
> is calculated exactly to satisfy sum(lag_i)=0. This is guaranteed by math.

Yes, you are right. I mixed another fairness issue with this. What I
was thinking is that considering multiple competition groups (e.g.
runqueues), the latency bound could be violated, that is someone could
starve a bit. Say one entity even with positive lag could become less
competitive if migrated to a higher competitive group.

Staring at Eq. (20) again, what if we do a fake reweight? I mean let
the client leave and rejoin at the same time without changing weight?
IMHO it should have no effects, but according to Eq. (20) the V will
change to:

	V' = V + lag(j)/(W - w_j) - lag(j)/W != V

Have I missed anything?

> 
> Actually I print some logs in enqueue_entity() and dequeue_entity() to verify this:
> 
> [  293.261236] before dequeue: V=2525278131 W=3072 v=2526243139 w=1024 lag_sum=0
> [  293.261237] after dequeue: V=2524795627 W=2048 v=2526243139 w=1024 lag_sum=0
> [  293.262286] before enqueue: V=2525319064 W=2048 v=2526766576 w=1024 lag_sum=0
> [  293.262287] after enqueue: V=2525801568 W=3072 v=2526766576 w=1024 lag_sum=0
> 
> For the first 2 lines, we have 2524795627 = 2525278131 + (2525278131 - 2526243139) * 1024 / 2048.
> Which is Eq. (18)
> 
> For the last 2 lines, we have 2525801568 = 2525319064 - (2525319064 - 2526766576) * 1024 / 3072.
> Which is Eq. (19)
> 
> So whatever client 3 leave or join competition with !0-lag in Linux, V is handled properly.
> 
>> Good catch again! It smells like a bug. Since this @se is still on_rq,
>> it should be taken into consideration when calculating avg_runtime(),
>> but in fact it isn't because __dequeue_entity() will remove its share.
>>
>> And I seem to spot another bug, although not relate to this problem,
>> that we actually need to call update_curr() unconditionally if curr is
>> available, because we need to commit curr's outstanding runtime to
>> ensure the result of avg_runtime() is up to date.
>>
> 
> I've tried to record avg_vruntime before __dequeue_entity() and pass it to
> reweight_eevdf(). Then the issue is fixed. The V keeps the same during the whole
> reweight_entity().
> 
> I could send these two bugfix patches (one for this bug and one you sugguested

That would be appreciated!

Thanks,
	Abel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
  2024-03-01  8:30         ` Abel Wu
@ 2024-03-01 10:04           ` Tianchen Ding
  0 siblings, 0 replies; 25+ messages in thread
From: Tianchen Ding @ 2024-03-01 10:04 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Valentin Schneider,
	Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
	Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
	Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
	linux-kernel

On 2024/3/1 16:30, Abel Wu wrote:
> On 3/1/24 2:41 PM, Tianchen Ding Wrote:
>> On 2024/2/29 22:25, Abel Wu wrote:
>>> Good catch. And to the best of my knowledge, the answer is YES. The
>>> above Equation in the paper, which is Eq. (20), is based on the
>>> assumption that:
>>>
>>>      "once client 3 leaves, the remaining two clients will
>>>       proportionally support the eventual loss or gain in the
>>>       service time"  -- Page 10
>>>
>>>      "by updating the virtual time according to Eq. (18,19) we
>>>       ensure that the sum over the lags of all active clients
>>>       is always zero"  -- Page 11
>>>
>>> But in Peter's implementation, it is the competitors in the new group
>>> that client 3 later joins in who actually support the effect. So when
>>> client 3 leaves competition with !0-lag in Linux, the rq's sum(lag_i)
>>> is no longer zero.
>>>
>>
>> I've different opinions. According to the comments above avg_vruntime_add(), V
>> is calculated exactly to satisfy sum(lag_i)=0. This is guaranteed by math.
> 
> Yes, you are right. I mixed another fairness issue with this. What I
> was thinking is that considering multiple competition groups (e.g.
> runqueues), the latency bound could be violated, that is someone could
> starve a bit. Say one entity even with positive lag could become less
> competitive if migrated to a higher competitive group.
> 
> Staring at Eq. (20) again, what if we do a fake reweight? I mean let
> the client leave and rejoin at the same time without changing weight?
> IMHO it should have no effects, but according to Eq. (20) the V will
> change to:
> 
>      V' = V + lag(j)/(W - w_j) - lag(j)/W != V
> 
> Have I missed anything?
> 

Good point! I've not ever noticed this conflict.

I tried to modify reweight_entity() to run dequeue_entity() -> adjust se->vlag ->
enqueue_entity(). And I found V do not changed.

The difference is, when doing enqueue_entity(), Peter enlarges the lag in place_entity().
Because after enqueue, the lag will evaporate.
In order to keep the same lag after enqueue, during place_entity(),
the new lag(t) will be enlarged with (W+w_i)/W.

So the Eq. (20) should be:


     V' = V + lag(j)/(W - w_j) - lag'(j)/(W - w_j + w'_j)

     lag'(j) = lag(j) * (W - w_j + w'_j)/(W - w_j)

So we can get

     V' = V + lag(j)/(W - w_j) - lag(j) * (W - w_j + w'_j)/(W - w_j)/(W - w_j + w'_j) = V

So COROLLARY #2 is correct.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-03-01 10:04 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-07  9:05 [PATCH 0/4] sched/eevdf: Optimize reweight and pick Abel Wu
2023-11-07  9:05 ` [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight Abel Wu
2023-11-07  9:52   ` Peter Zijlstra
2023-11-14 21:57   ` [tip: sched/urgent] " tip-bot2 for Abel Wu
2023-11-15 15:36   ` [PATCH 1/4] " Yiwei Lin
2023-11-16  4:48     ` Abel Wu
2023-11-16  4:59       ` Yiwei Lin
2023-11-16  9:01         ` Abel Wu
2023-11-16  5:07       ` Abel Wu
2023-11-16  6:51         ` Yiwei Lin
2023-11-16  7:11           ` Abel Wu
2024-02-29  9:24   ` Tianchen Ding
2024-02-29 14:25     ` Abel Wu
2024-03-01  6:41       ` Tianchen Ding
2024-03-01  8:30         ` Abel Wu
2024-03-01 10:04           ` Tianchen Ding
2023-11-07  9:05 ` [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline Abel Wu
2023-11-07 11:03   ` Peter Zijlstra
2023-11-07 11:06     ` Abel Wu
2023-11-07 23:26   ` Benjamin Segall
2023-11-08  8:51     ` Abel Wu
2023-11-07  9:05 ` [PATCH 3/4] sched/eevdf: O(1) fastpath for task selection Abel Wu
2023-11-07 10:12   ` Abel Wu
2023-11-07 10:42     ` Peter Zijlstra
2023-11-07  9:05 ` [PATCH 4/4] sched/stats: branch statistics for pick_eevdf Abel Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox