* [PATCH 0/4] sched/eevdf: Optimize reweight and pick
@ 2023-11-07 9:05 Abel Wu
2023-11-07 9:05 ` [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight Abel Wu
` (3 more replies)
0 siblings, 4 replies; 25+ messages in thread
From: Abel Wu @ 2023-11-07 9:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider
Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel, Abel Wu
This patchset makes these contributions:
[1/4] Fixes the problem that vruntime doesn't get adjusted
when reweight at !0-lag point.
[2/4] Optimize out the fallback search on @best_left which
doubles the cost in worst case.
[3/4] Enable O(1) fastpath picking based on deadline-sorted
leftmost-cached rbtree.
[4/4] Statistics for patch 3, not intended to upstream.
All the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled, on a Dual-CPU Intel Xeon(R)
Platinum 8260 with 2 NUMA nodes each of which has 24C/48T.
p0: baseline, tip/master 1187c0b3a6c2
p1: p0 + patch(1)
p3: p0 + patch(1~3)
hackbench
=========
case load p0% (std%) p1% ( std%) p3% ( std%)
process-pipe group-1 1.00 ( 2.49) -1.73 ( 2.64) -3.77 ( 0.91)
process-pipe group-2 1.00 ( 5.23) +5.51 ( 2.32) -3.41 ( 4.28)
process-pipe group-4 1.00 ( 5.30) +3.53 ( 5.46) +6.51 ( 1.44)
process-pipe group-8 1.00 ( 1.36) -1.85 ( 2.22) -3.57 ( 1.06)
process-sockets group-1 1.00 ( 2.29) -2.39 ( 2.66) -2.39 ( 1.86)
process-sockets group-2 1.00 ( 3.46) +0.46 ( 1.85) +1.19 ( 2.08)
process-sockets group-4 1.00 ( 1.43) -1.98 ( 2.78) +4.52 ( 8.68)
process-sockets group-8 1.00 ( 0.95) -1.60 ( 0.94) +2.78 ( 2.14)
threads-pipe group-1 1.00 ( 1.92) +5.33 ( 1.54) +3.47 ( 1.09)
threads-pipe group-2 1.00 ( 0.64) +0.51 ( 2.31) +2.91 ( 0.43)
threads-pipe group-4 1.00 ( 3.03) -2.91 ( 2.31) +1.83 ( 1.65)
threads-pipe group-8 1.00 ( 2.55) +1.89 ( 3.04) -1.29 ( 2.32)
threads-sockets group-1 1.00 ( 0.71) +0.83 ( 0.52) -0.42 ( 0.52)
threads-sockets group-2 1.00 ( 2.48) -2.52 ( 1.20) -3.27 ( 0.59)
threads-sockets group-4 1.00 ( 1.96) +2.67 ( 2.34) +3.74 ( 1.18)
threads-sockets group-8 1.00 ( 1.09) -2.30 ( 0.51) +3.07 ( 0.62)
netperf
=======
case load p0% (std%) p1% ( std%) p3% ( std%)
TCP_RR thread-24 1.00 ( 2.48) -2.15 ( 2.38) +0.17 ( 1.95)
TCP_RR thread-48 1.00 ( 0.73) -1.59 ( 0.51) +0.47 ( 0.93)
TCP_RR thread-72 1.00 ( 1.04) -1.26 ( 1.03) -0.09 ( 1.13)
TCP_RR thread-96 1.00 ( 29.36) +70.41 ( 14.86) +17.88 ( 37.24)
TCP_RR thread-192 1.00 ( 28.29) -1.30 ( 34.03) -2.00 ( 30.63)
TCP_STREAM thread-24 1.00 ( 1.57) +0.38 ( 1.90) +0.20 ( 1.72)
TCP_STREAM thread-48 1.00 ( 0.08) -0.29 ( 0.07) +0.15 ( 0.12)
TCP_STREAM thread-72 1.00 ( 0.01) -0.00 ( 0.00) +0.00 ( 0.00)
TCP_STREAM thread-96 1.00 ( 0.76) +0.16 ( 0.65) +0.30 ( 0.47)
TCP_STREAM thread-192 1.00 ( 0.65) +0.23 ( 0.46) +0.25 ( 0.49)
UDP_RR thread-24 1.00 ( 1.74) -1.26 ( 2.41) +0.81 ( 3.02)
UDP_RR thread-48 1.00 ( 0.56) -0.40 ( 16.72) -0.98 ( 0.36)
UDP_RR thread-72 1.00 ( 0.84) -0.70 ( 0.66) -0.27 ( 0.88)
UDP_RR thread-96 1.00 ( 1.24) -0.44 ( 1.01) -0.99 ( 8.99)
UDP_RR thread-192 1.00 ( 28.02) -0.42 ( 31.59) -1.80 ( 26.23)
UDP_STREAM thread-24 1.00 (100.05) +0.31 (100.04) +0.32 (100.06)
UDP_STREAM thread-48 1.00 (104.35) +1.22 (105.14) +1.65 (104.10)
UDP_STREAM thread-72 1.00 (100.69) +1.28 (100.63) -0.17 (100.49)
UDP_STREAM thread-96 1.00 ( 99.63) +0.33 ( 99.51) -0.25 ( 99.53)
UDP_STREAM thread-192 1.00 (100.57) +2.00 (107.01) -1.21 ( 99.51)
tbench
======
case load p0% (std%) p1% ( std%) p3% ( std%)
loopback thread-24 1.00 ( 0.49) -1.47 ( 0.94) +0.08 ( 0.75)
loopback thread-48 1.00 ( 0.42) -0.04 ( 0.53) -0.06 ( 0.34)
loopback thread-72 1.00 ( 7.10) -3.33 ( 2.98) -5.06 ( 0.31)
loopback thread-96 1.00 ( 0.80) +2.65 ( 0.80) -0.68 ( 1.30)
loopback thread-192 1.00 ( 1.24) +1.21 ( 0.73) -1.78 ( 0.22)
schbench
========
case load p0% (std%) p1% ( std%) p3% ( std%)
normal mthread-1 1.00 ( 5.83) -2.83 ( 1.46) +1.24 ( 2.51)
normal mthread-2 1.00 ( 4.45) +8.94 ( 7.81) +14.24 ( 7.44)
normal mthread-4 1.00 ( 2.73) +2.53 ( 4.31) +12.44 ( 5.99)
normal mthread-8 1.00 ( 0.15) +0.21 ( 0.13) -0.34 ( 0.11)
Seems no obvious complain from these benchmarks.
Comments are appreciated! Thanks!
Abel Wu (4):
sched/eevdf: Fix vruntime adjustment on reweight
sched/eevdf: Sort the rbtree by virtual deadline
sched/eevdf: O(1) fastpath for task selection
sched/stats: branch statistics for pick_eevdf
include/linux/sched.h | 2 +-
kernel/sched/debug.c | 11 +-
kernel/sched/fair.c | 412 +++++++++++++++++++++++++-----------------
kernel/sched/sched.h | 6 +
kernel/sched/stats.c | 6 +-
5 files changed, 263 insertions(+), 174 deletions(-)
--
2.37.3
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
2023-11-07 9:05 [PATCH 0/4] sched/eevdf: Optimize reweight and pick Abel Wu
@ 2023-11-07 9:05 ` Abel Wu
2023-11-07 9:52 ` Peter Zijlstra
` (3 more replies)
2023-11-07 9:05 ` [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline Abel Wu
` (2 subsequent siblings)
3 siblings, 4 replies; 25+ messages in thread
From: Abel Wu @ 2023-11-07 9:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider
Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel, Abel Wu
vruntime of the (on_rq && !0-lag) entity needs to be adjusted when
it gets re-weighted, and the calculations can be simplified based
on the fact that re-weight won't change the w-average of all the
entities. Please check the proofs in comments.
But adjusting vruntime can also cause position change in RB-tree
hence require re-queue to fix up which might be costly. This might
be avoided by deferring adjustment to the time the entity actually
leaves tree (dequeue/pick), but that will negatively affect task
selection and probably not good enough either.
Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
kernel/sched/fair.c | 151 +++++++++++++++++++++++++++++++++++++-------
1 file changed, 128 insertions(+), 23 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8767988242ee..b00d09a9b601 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3666,41 +3666,140 @@ static inline void
dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
#endif
+static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
+ unsigned long weight)
+{
+ unsigned long old_weight = se->load.weight;
+ u64 avruntime = avg_vruntime(cfs_rq);
+ s64 vlag, vslice;
+
+ /*
+ * VRUNTIME
+ * ========
+ *
+ * COROLLARY #1: The virtual runtime of the entity needs to be
+ * adjusted if re-weight at !0-lag point.
+ *
+ * Proof: For contradiction assume this is not true, so we can
+ * re-weight without changing vruntime at !0-lag point.
+ *
+ * Weight VRuntime Avg-VRuntime
+ * before w v V
+ * after w' v' V'
+ *
+ * Since lag needs to be preserved through re-weight:
+ *
+ * lag = (V - v)*w = (V'- v')*w', where v = v'
+ * ==> V' = (V - v)*w/w' + v (1)
+ *
+ * Let W be the total weight of the entities before reweight,
+ * since V' is the new weighted average of entities:
+ *
+ * V' = (WV + w'v - wv) / (W + w' - w) (2)
+ *
+ * by using (1) & (2) we obtain:
+ *
+ * (WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v
+ * ==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v
+ * ==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v
+ * ==> (V - v)*W/(W + w' - w) = (V - v)*w/w' (3)
+ *
+ * Since we are doing at !0-lag point which means V != v, we
+ * can simplify (3):
+ *
+ * ==> W / (W + w' - w) = w / w'
+ * ==> Ww' = Ww + ww' - ww
+ * ==> W * (w' - w) = w * (w' - w)
+ * ==> W = w (re-weight indicates w' != w)
+ *
+ * So the cfs_rq contains only one entity, hence vruntime of
+ * the entity @v should always equal to the cfs_rq's weighted
+ * average vruntime @V, which means we will always re-weight
+ * at 0-lag point, thus breach assumption. Proof completed.
+ *
+ *
+ * COROLLARY #2: Re-weight does NOT affect weighted average
+ * vruntime of all the entities.
+ *
+ * Proof: According to corollary #1, Eq. (1) should be:
+ *
+ * (V - v)*w = (V' - v')*w'
+ * ==> v' = V' - (V - v)*w/w' (4)
+ *
+ * According to the weighted average formula, we have:
+ *
+ * V' = (WV - wv + w'v') / (W - w + w')
+ * = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w')
+ * = (WV - wv + w'V' - Vw + wv) / (W - w + w')
+ * = (WV + w'V' - Vw) / (W - w + w')
+ *
+ * ==> V'*(W - w + w') = WV + w'V' - Vw
+ * ==> V' * (W - w) = (W - w) * V (5)
+ *
+ * If the entity is the only one in the cfs_rq, then reweight
+ * always occurs at 0-lag point, so V won't change. Or else
+ * there are other entities, hence W != w, then Eq. (5) turns
+ * into V' = V. So V won't change in either case, proof done.
+ *
+ *
+ * So according to corollary #1 & #2, the effect of re-weight
+ * on vruntime should be:
+ *
+ * v' = V' - (V - v) * w / w' (4)
+ * = V - (V - v) * w / w'
+ * = V - vl * w / w'
+ * = V - vl'
+ */
+ if (avruntime != se->vruntime) {
+ vlag = (s64)(avruntime - se->vruntime);
+ vlag = div_s64(vlag * old_weight, weight);
+ se->vruntime = avruntime - vlag;
+ }
+
+ /*
+ * DEADLINE
+ * ========
+ *
+ * When the weight changes, the virtual time slope changes and
+ * we should adjust the relative virtual deadline accordingly.
+ *
+ * d' = v' + (d - v)*w/w'
+ * = V' - (V - v)*w/w' + (d - v)*w/w'
+ * = V - (V - v)*w/w' + (d - v)*w/w'
+ * = V + (d - V)*w/w'
+ */
+ vslice = (s64)(se->deadline - avruntime);
+ vslice = div_s64(vslice * old_weight, weight);
+ se->deadline = avruntime + vslice;
+}
+
static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
unsigned long weight)
{
- unsigned long old_weight = se->load.weight;
+ bool curr = cfs_rq->curr == se;
if (se->on_rq) {
/* commit outstanding execution time */
- if (cfs_rq->curr == se)
+ if (curr)
update_curr(cfs_rq);
else
- avg_vruntime_sub(cfs_rq, se);
+ __dequeue_entity(cfs_rq, se);
update_load_sub(&cfs_rq->load, se->load.weight);
}
dequeue_load_avg(cfs_rq, se);
- update_load_set(&se->load, weight);
-
if (!se->on_rq) {
/*
* Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
* we need to scale se->vlag when w_i changes.
*/
- se->vlag = div_s64(se->vlag * old_weight, weight);
+ se->vlag = div_s64(se->vlag * se->load.weight, weight);
} else {
- s64 deadline = se->deadline - se->vruntime;
- /*
- * When the weight changes, the virtual time slope changes and
- * we should adjust the relative virtual deadline accordingly.
- */
- deadline = div_s64(deadline * old_weight, weight);
- se->deadline = se->vruntime + deadline;
- if (se != cfs_rq->curr)
- min_deadline_cb_propagate(&se->run_node, NULL);
+ reweight_eevdf(cfs_rq, se, weight);
}
+ update_load_set(&se->load, weight);
+
#ifdef CONFIG_SMP
do {
u32 divider = get_pelt_divider(&se->avg);
@@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
enqueue_load_avg(cfs_rq, se);
if (se->on_rq) {
update_load_add(&cfs_rq->load, se->load.weight);
- if (cfs_rq->curr != se)
- avg_vruntime_add(cfs_rq, se);
+ if (!curr) {
+ /*
+ * The entity's vruntime has been adjusted, so let's check
+ * whether the rq-wide min_vruntime needs updated too. Since
+ * the calculations above require stable min_vruntime rather
+ * than up-to-date one, we do the update at the end of the
+ * reweight process.
+ */
+ __enqueue_entity(cfs_rq, se);
+ update_min_vruntime(cfs_rq);
+ }
}
}
@@ -3857,14 +3965,11 @@ static void update_cfs_group(struct sched_entity *se)
#ifndef CONFIG_SMP
shares = READ_ONCE(gcfs_rq->tg->shares);
-
- if (likely(se->load.weight == shares))
- return;
#else
- shares = calc_group_shares(gcfs_rq);
+ shares = calc_group_shares(gcfs_rq);
#endif
-
- reweight_entity(cfs_rq_of(se), se, shares);
+ if (unlikely(se->load.weight != shares))
+ reweight_entity(cfs_rq_of(se), se, shares);
}
#else /* CONFIG_FAIR_GROUP_SCHED */
--
2.37.3
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline
2023-11-07 9:05 [PATCH 0/4] sched/eevdf: Optimize reweight and pick Abel Wu
2023-11-07 9:05 ` [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight Abel Wu
@ 2023-11-07 9:05 ` Abel Wu
2023-11-07 11:03 ` Peter Zijlstra
2023-11-07 23:26 ` Benjamin Segall
2023-11-07 9:05 ` [PATCH 3/4] sched/eevdf: O(1) fastpath for task selection Abel Wu
2023-11-07 9:05 ` [PATCH 4/4] sched/stats: branch statistics for pick_eevdf Abel Wu
3 siblings, 2 replies; 25+ messages in thread
From: Abel Wu @ 2023-11-07 9:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider
Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel, Abel Wu
Sort the task timeline by virtual deadline and keep the min_vruntime
in the augmented tree, so we can avoid doubling the worst case cost
and make full use of the cached leftmost node to enable O(1) fastpath
picking in next patch.
This patch also cleans up the unused max_vruntime() and adjusts pos
for some functions.
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
include/linux/sched.h | 2 +-
kernel/sched/debug.c | 11 +-
kernel/sched/fair.c | 239 +++++++++++++++++-------------------------
kernel/sched/sched.h | 1 +
4 files changed, 107 insertions(+), 146 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 556ad71b532b..815832a0e7eb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -553,7 +553,7 @@ struct sched_entity {
struct load_weight load;
struct rb_node run_node;
u64 deadline;
- u64 min_deadline;
+ u64 min_vruntime;
struct list_head group_node;
unsigned int on_rq;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 4580a450700e..168eecc209b4 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -628,8 +628,8 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
{
- s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, spread;
- struct sched_entity *last, *first;
+ s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, left_deadline = -1, spread;
+ struct sched_entity *last, *first, *root;
struct rq *rq = cpu_rq(cpu);
unsigned long flags;
@@ -644,15 +644,20 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SPLIT_NS(cfs_rq->exec_clock));
raw_spin_rq_lock_irqsave(rq, flags);
+ root = __pick_root_entity(cfs_rq);
+ if (root)
+ left_vruntime = root->min_vruntime;
first = __pick_first_entity(cfs_rq);
if (first)
- left_vruntime = first->vruntime;
+ left_deadline = first->deadline;
last = __pick_last_entity(cfs_rq);
if (last)
right_vruntime = last->vruntime;
min_vruntime = cfs_rq->min_vruntime;
raw_spin_rq_unlock_irqrestore(rq, flags);
+ SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_deadline",
+ SPLIT_NS(left_deadline));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_vruntime",
SPLIT_NS(left_vruntime));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "min_vruntime",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b00d09a9b601..459487bf8824 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -530,15 +530,6 @@ void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
* Scheduling class tree data structure manipulation methods:
*/
-static inline u64 max_vruntime(u64 max_vruntime, u64 vruntime)
-{
- s64 delta = (s64)(vruntime - max_vruntime);
- if (delta > 0)
- max_vruntime = vruntime;
-
- return max_vruntime;
-}
-
static inline u64 min_vruntime(u64 min_vruntime, u64 vruntime)
{
s64 delta = (s64)(vruntime - min_vruntime);
@@ -551,7 +542,11 @@ static inline u64 min_vruntime(u64 min_vruntime, u64 vruntime)
static inline bool entity_before(const struct sched_entity *a,
const struct sched_entity *b)
{
- return (s64)(a->vruntime - b->vruntime) < 0;
+ /*
+ * Tiebreak on vruntime seems unnecessary since it can
+ * hardly happen.
+ */
+ return (s64)(a->deadline - b->deadline) < 0;
}
static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -720,7 +715,7 @@ static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
* Note: using 'avg_vruntime() > se->vruntime' is inacurate due
* to the loss in precision caused by the division.
*/
-int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
{
struct sched_entity *curr = cfs_rq->curr;
s64 avg = cfs_rq->avg_vruntime;
@@ -733,16 +728,19 @@ int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
load += weight;
}
- return avg >= entity_key(cfs_rq, se) * load;
+ return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load;
+}
+
+int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ return vruntime_eligible(cfs_rq, se->vruntime);
}
static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
{
u64 min_vruntime = cfs_rq->min_vruntime;
- /*
- * open coded max_vruntime() to allow updating avg_vruntime
- */
s64 delta = (s64)(vruntime - min_vruntime);
+
if (delta > 0) {
avg_vruntime_update(cfs_rq, delta);
min_vruntime = vruntime;
@@ -752,9 +750,8 @@ static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
static void update_min_vruntime(struct cfs_rq *cfs_rq)
{
- struct sched_entity *se = __pick_first_entity(cfs_rq);
+ struct sched_entity *se = __pick_root_entity(cfs_rq);
struct sched_entity *curr = cfs_rq->curr;
-
u64 vruntime = cfs_rq->min_vruntime;
if (curr) {
@@ -766,9 +763,9 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
if (se) {
if (!curr)
- vruntime = se->vruntime;
+ vruntime = se->min_vruntime;
else
- vruntime = min_vruntime(vruntime, se->vruntime);
+ vruntime = min_vruntime(vruntime, se->min_vruntime);
}
/* ensure we never gain time by being placed backwards. */
@@ -781,34 +778,34 @@ static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
return entity_before(__node_2_se(a), __node_2_se(b));
}
-#define deadline_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; })
+#define vruntime_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; })
-static inline void __update_min_deadline(struct sched_entity *se, struct rb_node *node)
+static inline void __min_vruntime_update(struct sched_entity *se, struct rb_node *node)
{
if (node) {
struct sched_entity *rse = __node_2_se(node);
- if (deadline_gt(min_deadline, se, rse))
- se->min_deadline = rse->min_deadline;
+ if (vruntime_gt(min_vruntime, se, rse))
+ se->min_vruntime = rse->min_vruntime;
}
}
/*
- * se->min_deadline = min(se->deadline, left->min_deadline, right->min_deadline)
+ * se->min_vruntime = min(se->vruntime, {left,right}->min_vruntime)
*/
-static inline bool min_deadline_update(struct sched_entity *se, bool exit)
+static inline bool min_vruntime_update(struct sched_entity *se, bool exit)
{
- u64 old_min_deadline = se->min_deadline;
+ u64 old_min_vruntime = se->min_vruntime;
struct rb_node *node = &se->run_node;
- se->min_deadline = se->deadline;
- __update_min_deadline(se, node->rb_right);
- __update_min_deadline(se, node->rb_left);
+ se->min_vruntime = se->vruntime;
+ __min_vruntime_update(se, node->rb_right);
+ __min_vruntime_update(se, node->rb_left);
- return se->min_deadline == old_min_deadline;
+ return se->min_vruntime == old_min_vruntime;
}
-RB_DECLARE_CALLBACKS(static, min_deadline_cb, struct sched_entity,
- run_node, min_deadline, min_deadline_update);
+RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity,
+ run_node, min_vruntime, min_vruntime_update);
/*
* Enqueue an entity into the rb-tree:
@@ -816,18 +813,28 @@ RB_DECLARE_CALLBACKS(static, min_deadline_cb, struct sched_entity,
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
avg_vruntime_add(cfs_rq, se);
- se->min_deadline = se->deadline;
+ se->min_vruntime = se->vruntime;
rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
- __entity_less, &min_deadline_cb);
+ __entity_less, &min_vruntime_cb);
}
static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
- &min_deadline_cb);
+ &min_vruntime_cb);
avg_vruntime_sub(cfs_rq, se);
}
+struct sched_entity *__pick_root_entity(struct cfs_rq *cfs_rq)
+{
+ struct rb_node *root = cfs_rq->tasks_timeline.rb_root.rb_node;
+
+ if (!root)
+ return NULL;
+
+ return __node_2_se(root);
+}
+
struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
{
struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);
@@ -838,6 +845,35 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
return __node_2_se(left);
}
+#ifdef CONFIG_SCHED_DEBUG
+struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
+{
+ struct rb_node *last = rb_last(&cfs_rq->tasks_timeline.rb_root);
+
+ if (!last)
+ return NULL;
+
+ return __node_2_se(last);
+}
+
+/**************************************************************
+ * Scheduling class statistics methods:
+ */
+#ifdef CONFIG_SMP
+int sched_update_scaling(void)
+{
+ unsigned int factor = get_update_sysctl_factor();
+
+#define WRT_SYSCTL(name) \
+ (normalized_sysctl_##name = sysctl_##name / (factor))
+ WRT_SYSCTL(sched_base_slice);
+#undef WRT_SYSCTL
+
+ return 0;
+}
+#endif
+#endif
+
/*
* Earliest Eligible Virtual Deadline First
*
@@ -850,23 +886,28 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
* with the earliest virtual deadline.
*
* We can do this in O(log n) time due to an augmented RB-tree. The
- * tree keeps the entries sorted on service, but also functions as a
- * heap based on the deadline by keeping:
+ * tree keeps the entries sorted on deadline, but also functions as a
+ * heap based on the vruntime by keeping:
*
- * se->min_deadline = min(se->deadline, se->{left,right}->min_deadline)
+ * se->min_vruntime = min(se->vruntime, se->{left,right}->min_vruntime)
*
- * Which allows an EDF like search on (sub)trees.
+ * Which allows tree pruning through eligibility.
*/
-static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
{
struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
struct sched_entity *curr = cfs_rq->curr;
struct sched_entity *best = NULL;
- struct sched_entity *best_left = NULL;
+
+ /*
+ * We can safely skip eligibility check if there is only one entity
+ * in this cfs_rq, saving some cycles.
+ */
+ if (cfs_rq->nr_running == 1)
+ return curr && curr->on_rq ? curr : __node_2_se(node);
if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
curr = NULL;
- best = curr;
/*
* Once selected, run a task until it either becomes non-eligible or
@@ -875,126 +916,40 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
return curr;
+ /* Heap search for the EEVD entity */
while (node) {
struct sched_entity *se = __node_2_se(node);
+ struct rb_node *left = node->rb_left;
/*
- * If this entity is not eligible, try the left subtree.
+ * Eligible entities in left subtree are always better
+ * choices, since they have earlier deadlines.
*/
- if (!entity_eligible(cfs_rq, se)) {
- node = node->rb_left;
+ if (left && vruntime_eligible(cfs_rq,
+ __node_2_se(left)->min_vruntime)) {
+ node = left;
continue;
}
/*
- * Now we heap search eligible trees for the best (min_)deadline
+ * The left subtree either is empty or has no eligible
+ * entity, so check the current node since it is the one
+ * with earliest deadline that might be eligible.
*/
- if (!best || deadline_gt(deadline, best, se))
+ if (entity_eligible(cfs_rq, se)) {
best = se;
-
- /*
- * Every se in a left branch is eligible, keep track of the
- * branch with the best min_deadline
- */
- if (node->rb_left) {
- struct sched_entity *left = __node_2_se(node->rb_left);
-
- if (!best_left || deadline_gt(min_deadline, best_left, left))
- best_left = left;
-
- /*
- * min_deadline is in the left branch. rb_left and all
- * descendants are eligible, so immediately switch to the second
- * loop.
- */
- if (left->min_deadline == se->min_deadline)
- break;
- }
-
- /* min_deadline is at this node, no need to look right */
- if (se->deadline == se->min_deadline)
break;
-
- /* else min_deadline is in the right branch. */
- node = node->rb_right;
- }
-
- /*
- * We ran into an eligible node which is itself the best.
- * (Or nr_running == 0 and both are NULL)
- */
- if (!best_left || (s64)(best_left->min_deadline - best->deadline) > 0)
- return best;
-
- /*
- * Now best_left and all of its children are eligible, and we are just
- * looking for deadline == min_deadline
- */
- node = &best_left->run_node;
- while (node) {
- struct sched_entity *se = __node_2_se(node);
-
- /* min_deadline is the current node */
- if (se->deadline == se->min_deadline)
- return se;
-
- /* min_deadline is in the left branch */
- if (node->rb_left &&
- __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
- node = node->rb_left;
- continue;
}
- /* else min_deadline is in the right branch */
node = node->rb_right;
}
- return NULL;
-}
-static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
-{
- struct sched_entity *se = __pick_eevdf(cfs_rq);
+ if (!best || (curr && entity_before(curr, best)))
+ best = curr;
- if (!se) {
- struct sched_entity *left = __pick_first_entity(cfs_rq);
- if (left) {
- pr_err("EEVDF scheduling fail, picking leftmost\n");
- return left;
- }
- }
-
- return se;
+ return best;
}
-#ifdef CONFIG_SCHED_DEBUG
-struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
-{
- struct rb_node *last = rb_last(&cfs_rq->tasks_timeline.rb_root);
-
- if (!last)
- return NULL;
-
- return __node_2_se(last);
-}
-
-/**************************************************************
- * Scheduling class statistics methods:
- */
-#ifdef CONFIG_SMP
-int sched_update_scaling(void)
-{
- unsigned int factor = get_update_sysctl_factor();
-
-#define WRT_SYSCTL(name) \
- (normalized_sysctl_##name = sysctl_##name / (factor))
- WRT_SYSCTL(sched_base_slice);
-#undef WRT_SYSCTL
-
- return 0;
-}
-#endif
-#endif
-
static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e5a95486a42..539c7e763f15 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2822,6 +2822,7 @@ DEFINE_LOCK_GUARD_2(double_rq_lock, struct rq,
double_rq_lock(_T->lock, _T->lock2),
double_rq_unlock(_T->lock, _T->lock2))
+extern struct sched_entity *__pick_root_entity(struct cfs_rq *cfs_rq);
extern struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq);
extern struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq);
--
2.37.3
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 3/4] sched/eevdf: O(1) fastpath for task selection
2023-11-07 9:05 [PATCH 0/4] sched/eevdf: Optimize reweight and pick Abel Wu
2023-11-07 9:05 ` [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight Abel Wu
2023-11-07 9:05 ` [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline Abel Wu
@ 2023-11-07 9:05 ` Abel Wu
2023-11-07 10:12 ` Abel Wu
2023-11-07 9:05 ` [PATCH 4/4] sched/stats: branch statistics for pick_eevdf Abel Wu
3 siblings, 1 reply; 25+ messages in thread
From: Abel Wu @ 2023-11-07 9:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider
Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel, Abel Wu
Since the RB-tree is now sorted by deadline, let's first try the
leftmost entity which has the earliest virtual deadline. I've done
some benchmarks to see its effectiveness.
All the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled, on a dual-CPU Intel Xeon(R)
Platinum 8260 with 2 NUMA nodes each of which has 24C/48T.
hackbench: process/thread + pipe/socket + 1/2/4/8 groups
netperf: TCP/UDP + STREAM/RR + 24/48/72/96/192 threads
tbench: loopback 24/48/72/96/192 threads
schbench: 1/2/4/8 mthreads
direct: cfs_rq has only one entity
parity: RUN_TO_PARITY
fast: O(1) fastpath
slow: heap search
(%) direct parity fast slow
hackbench 92.95 2.02 4.91 0.12
netperf 68.08 6.60 24.18 1.14
tbench 67.55 11.22 20.61 0.62
schbench 69.91 2.65 25.73 1.71
The above results indicate that this fastpath really makes task
selection more efficient.
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
kernel/sched/fair.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 459487bf8824..a1fdd0c7a051 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -896,6 +896,7 @@ int sched_update_scaling(void)
static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
{
struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
+ struct sched_entity *se = __pick_first_entity(cfs_rq);
struct sched_entity *curr = cfs_rq->curr;
struct sched_entity *best = NULL;
@@ -904,7 +905,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
* in this cfs_rq, saving some cycles.
*/
if (cfs_rq->nr_running == 1)
- return curr && curr->on_rq ? curr : __node_2_se(node);
+ return curr && curr->on_rq ? curr : se;
if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
curr = NULL;
@@ -916,9 +917,14 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
return curr;
+ /* Pick the leftmost entity if it's eligible */
+ if (se && entity_eligible(cfs_rq, se)) {
+ best = se;
+ goto found;
+ }
+
/* Heap search for the EEVD entity */
while (node) {
- struct sched_entity *se = __node_2_se(node);
struct rb_node *left = node->rb_left;
/*
@@ -931,6 +937,8 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
continue;
}
+ se = __node_2_se(node);
+
/*
* The left subtree either is empty or has no eligible
* entity, so check the current node since it is the one
@@ -943,7 +951,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
node = node->rb_right;
}
-
+found:
if (!best || (curr && entity_before(curr, best)))
best = curr;
--
2.37.3
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 4/4] sched/stats: branch statistics for pick_eevdf
2023-11-07 9:05 [PATCH 0/4] sched/eevdf: Optimize reweight and pick Abel Wu
` (2 preceding siblings ...)
2023-11-07 9:05 ` [PATCH 3/4] sched/eevdf: O(1) fastpath for task selection Abel Wu
@ 2023-11-07 9:05 ` Abel Wu
3 siblings, 0 replies; 25+ messages in thread
From: Abel Wu @ 2023-11-07 9:05 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider
Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel, Abel Wu
For trace purpose only, not intended for upstream.
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
kernel/sched/fair.c | 12 ++++++++++--
kernel/sched/sched.h | 5 +++++
kernel/sched/stats.c | 6 ++++--
3 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1fdd0c7a051..9b07e9437526 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -899,13 +899,16 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
struct sched_entity *se = __pick_first_entity(cfs_rq);
struct sched_entity *curr = cfs_rq->curr;
struct sched_entity *best = NULL;
+ struct rq *rq = rq_of(cfs_rq);
/*
* We can safely skip eligibility check if there is only one entity
* in this cfs_rq, saving some cycles.
*/
- if (cfs_rq->nr_running == 1)
+ if (cfs_rq->nr_running == 1) {
+ schedstat_inc(rq->pick_direct);
return curr && curr->on_rq ? curr : se;
+ }
if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
curr = NULL;
@@ -914,15 +917,20 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
* Once selected, run a task until it either becomes non-eligible or
* until it gets a new slice. See the HACK in set_next_entity().
*/
- if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
+ if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline) {
+ schedstat_inc(rq->pick_parity);
return curr;
+ }
/* Pick the leftmost entity if it's eligible */
if (se && entity_eligible(cfs_rq, se)) {
+ schedstat_inc(rq->pick_fast);
best = se;
goto found;
}
+ schedstat_inc(rq->pick_slow);
+
/* Heap search for the EEVD entity */
while (node) {
struct rb_node *left = node->rb_left;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 539c7e763f15..85a79990a698 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1105,6 +1105,11 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+
+ unsigned int pick_direct;
+ unsigned int pick_parity;
+ unsigned int pick_fast;
+ unsigned int pick_slow;
#endif
#ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..4b862c798989 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,14 @@ static int show_schedstat(struct seq_file *seq, void *v)
/* runqueue-specific stats */
seq_printf(seq,
- "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+ "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u %u",
cpu, rq->yld_count,
rq->sched_count, rq->sched_goidle,
rq->ttwu_count, rq->ttwu_local,
rq->rq_cpu_time,
- rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+ rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+ rq->pick_direct, rq->pick_parity,
+ rq->pick_fast, rq->pick_slow);
seq_printf(seq, "\n");
--
2.37.3
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
2023-11-07 9:05 ` [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight Abel Wu
@ 2023-11-07 9:52 ` Peter Zijlstra
2023-11-14 21:57 ` [tip: sched/urgent] " tip-bot2 for Abel Wu
` (2 subsequent siblings)
3 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2023-11-07 9:52 UTC (permalink / raw)
To: Abel Wu
Cc: Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider, Barry Song, Benjamin Segall, Chen Yu,
Daniel Jordan, Gautham R . Shenoy, Joel Fernandes,
K Prateek Nayak, Mike Galbraith, Qais Yousef, Tim Chen,
Yicong Yang, Youssef Esmat, linux-kernel
On Tue, Nov 07, 2023 at 05:05:07PM +0800, Abel Wu wrote:
> vruntime of the (on_rq && !0-lag) entity needs to be adjusted when
> it gets re-weighted, and the calculations can be simplified based
> on the fact that re-weight won't change the w-average of all the
> entities. Please check the proofs in comments.
>
> But adjusting vruntime can also cause position change in RB-tree
> hence require re-queue to fix up which might be costly. This might
> be avoided by deferring adjustment to the time the entity actually
> leaves tree (dequeue/pick), but that will negatively affect task
> selection and probably not good enough either.
>
> Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Very good, thanks!
It's a bit sad we have to muck about with the tree now, but alas.
If only Google and FB could agree on what to do with this cgroup
nonsense, then maybe we could get rid of all this.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/4] sched/eevdf: O(1) fastpath for task selection
2023-11-07 9:05 ` [PATCH 3/4] sched/eevdf: O(1) fastpath for task selection Abel Wu
@ 2023-11-07 10:12 ` Abel Wu
2023-11-07 10:42 ` Peter Zijlstra
0 siblings, 1 reply; 25+ messages in thread
From: Abel Wu @ 2023-11-07 10:12 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider
Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel
On 11/7/23 5:05 PM, Abel Wu Wrote:
> Since the RB-tree is now sorted by deadline, let's first try the
> leftmost entity which has the earliest virtual deadline. I've done
> some benchmarks to see its effectiveness.
>
> All the benchmarks are done inside a normal cpu cgroup in a clean
> environment with cpu turbo disabled, on a dual-CPU Intel Xeon(R)
> Platinum 8260 with 2 NUMA nodes each of which has 24C/48T.
>
> hackbench: process/thread + pipe/socket + 1/2/4/8 groups
> netperf: TCP/UDP + STREAM/RR + 24/48/72/96/192 threads
> tbench: loopback 24/48/72/96/192 threads
> schbench: 1/2/4/8 mthreads
>
> direct: cfs_rq has only one entity
> parity: RUN_TO_PARITY
> fast: O(1) fastpath
> slow: heap search
>
> (%) direct parity fast slow
> hackbench 92.95 2.02 4.91 0.12
> netperf 68.08 6.60 24.18 1.14
> tbench 67.55 11.22 20.61 0.62
> schbench 69.91 2.65 25.73 1.71
>
> The above results indicate that this fastpath really makes task
> selection more efficient.
>
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
> kernel/sched/fair.c | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 459487bf8824..a1fdd0c7a051 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -896,6 +896,7 @@ int sched_update_scaling(void)
> static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> {
> struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
> + struct sched_entity *se = __pick_first_entity(cfs_rq);
> struct sched_entity *curr = cfs_rq->curr;
> struct sched_entity *best = NULL;
>
> @@ -904,7 +905,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> * in this cfs_rq, saving some cycles.
> */
> if (cfs_rq->nr_running == 1)
> - return curr && curr->on_rq ? curr : __node_2_se(node);
> + return curr && curr->on_rq ? curr : se;
Maybe we can reduce memory footprint on curr by:
return se ? se : curr;
>
> if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> curr = NULL;
> @@ -916,9 +917,14 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
> return curr;
>
> + /* Pick the leftmost entity if it's eligible */
> + if (se && entity_eligible(cfs_rq, se)) {
> + best = se;
> + goto found;
> + }
> +
> /* Heap search for the EEVD entity */
> while (node) {
> - struct sched_entity *se = __node_2_se(node);
> struct rb_node *left = node->rb_left;
>
> /*
> @@ -931,6 +937,8 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> continue;
> }
>
> + se = __node_2_se(node);
> +
> /*
> * The left subtree either is empty or has no eligible
> * entity, so check the current node since it is the one
> @@ -943,7 +951,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>
> node = node->rb_right;
> }
> -
> +found:
> if (!best || (curr && entity_before(curr, best)))
> best = curr;
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/4] sched/eevdf: O(1) fastpath for task selection
2023-11-07 10:12 ` Abel Wu
@ 2023-11-07 10:42 ` Peter Zijlstra
0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2023-11-07 10:42 UTC (permalink / raw)
To: Abel Wu
Cc: Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider, Barry Song, Benjamin Segall, Chen Yu,
Daniel Jordan, Gautham R . Shenoy, Joel Fernandes,
K Prateek Nayak, Mike Galbraith, Qais Yousef, Tim Chen,
Yicong Yang, Youssef Esmat, linux-kernel
On Tue, Nov 07, 2023 at 06:12:49PM +0800, Abel Wu wrote:
> > @@ -904,7 +905,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> > * in this cfs_rq, saving some cycles.
> > */
> > if (cfs_rq->nr_running == 1)
> > - return curr && curr->on_rq ? curr : __node_2_se(node);
> > + return curr && curr->on_rq ? curr : se;
>
> Maybe we can reduce memory footprint on curr by:
>
> return se ? se : curr;
Irrespective, I think that logic makes more sense. If we know we have
but one task and the tree has a task, it must be that task, otherwise,
current must be it.
Anyway, I was still staring at the previous patch, flipping the tree
around like that is clever. Yes I suppose that ought to work just fine.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline
2023-11-07 9:05 ` [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline Abel Wu
@ 2023-11-07 11:03 ` Peter Zijlstra
2023-11-07 11:06 ` Abel Wu
2023-11-07 23:26 ` Benjamin Segall
1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2023-11-07 11:03 UTC (permalink / raw)
To: Abel Wu
Cc: Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider, Barry Song, Benjamin Segall, Chen Yu,
Daniel Jordan, Gautham R . Shenoy, Joel Fernandes,
K Prateek Nayak, Mike Galbraith, Qais Yousef, Tim Chen,
Yicong Yang, Youssef Esmat, linux-kernel
On Tue, Nov 07, 2023 at 05:05:08PM +0800, Abel Wu wrote:
> Sort the task timeline by virtual deadline and keep the min_vruntime
> in the augmented tree, so we can avoid doubling the worst case cost
> and make full use of the cached leftmost node to enable O(1) fastpath
> picking in next patch.
Another very good patch, just a note:
> This patch also cleans up the unused max_vruntime() and adjusts pos
> for some functions.
There's this thing about saying 'this patch' in Changelogs, per
definition the Changelog is about 'this patch' so saying it is a bit
redundant.
It even gets mentioned in the Documentation on submitting patches
somewhere.
But what I care about more is this patch doing extra unrelated things,
this should be split out.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Re: [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline
2023-11-07 11:03 ` Peter Zijlstra
@ 2023-11-07 11:06 ` Abel Wu
0 siblings, 0 replies; 25+ messages in thread
From: Abel Wu @ 2023-11-07 11:06 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider, Barry Song, Benjamin Segall, Chen Yu,
Daniel Jordan, Gautham R . Shenoy, Joel Fernandes,
K Prateek Nayak, Mike Galbraith, Qais Yousef, Tim Chen,
Yicong Yang, Youssef Esmat, linux-kernel
On 11/7/23 7:03 PM, Peter Zijlstra Wrote:
> On Tue, Nov 07, 2023 at 05:05:08PM +0800, Abel Wu wrote:
>
>> Sort the task timeline by virtual deadline and keep the min_vruntime
>> in the augmented tree, so we can avoid doubling the worst case cost
>> and make full use of the cached leftmost node to enable O(1) fastpath
>> picking in next patch.
>
> Another very good patch, just a note:
>
>> This patch also cleans up the unused max_vruntime() and adjusts pos
>> for some functions.
>
> There's this thing about saying 'this patch' in Changelogs, per
> definition the Changelog is about 'this patch' so saying it is a bit
> redundant.
I see, thanks.
>
> It even gets mentioned in the Documentation on submitting patches
> somewhere.
>
> But what I care about more is this patch doing extra unrelated things,
> this should be split out.
I will fix this in next version.
Thanks,
Abel
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline
2023-11-07 9:05 ` [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline Abel Wu
2023-11-07 11:03 ` Peter Zijlstra
@ 2023-11-07 23:26 ` Benjamin Segall
2023-11-08 8:51 ` Abel Wu
1 sibling, 1 reply; 25+ messages in thread
From: Benjamin Segall @ 2023-11-07 23:26 UTC (permalink / raw)
To: Abel Wu
Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider, Barry Song, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel
Abel Wu <wuyun.abel@bytedance.com> writes:
> Sort the task timeline by virtual deadline and keep the min_vruntime
> in the augmented tree, so we can avoid doubling the worst case cost
> and make full use of the cached leftmost node to enable O(1) fastpath
> picking in next patch.
>
> This patch also cleans up the unused max_vruntime() and adjusts pos
> for some functions.
>
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
I've run this through my tester and it agrees that it does fulfil the
EEVDF pick (though this implementation is trivial enough that that's
fairly obvious just by reading the code, which is a nice bonus upgrade).
And it makes sense that this would help for performance, and the
fastpath seems likely to trigger most of the time for even better
results.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Re: [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline
2023-11-07 23:26 ` Benjamin Segall
@ 2023-11-08 8:51 ` Abel Wu
0 siblings, 0 replies; 25+ messages in thread
From: Abel Wu @ 2023-11-08 8:51 UTC (permalink / raw)
To: Benjamin Segall
Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider, Barry Song, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel
On 11/8/23 7:26 AM, Benjamin Segall Wrote:
> Abel Wu <wuyun.abel@bytedance.com> writes:
>
>> Sort the task timeline by virtual deadline and keep the min_vruntime
>> in the augmented tree, so we can avoid doubling the worst case cost
>> and make full use of the cached leftmost node to enable O(1) fastpath
>> picking in next patch.
>>
>> This patch also cleans up the unused max_vruntime() and adjusts pos
>> for some functions.
>>
>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>> ---
>
> I've run this through my tester and it agrees that it does fulfil the
> EEVDF pick (though this implementation is trivial enough that that's
> fairly obvious just by reading the code, which is a nice bonus upgrade).
>
> And it makes sense that this would help for performance, and the
> fastpath seems likely to trigger most of the time for even better
> results.
Thanks, Benjamin!
^ permalink raw reply [flat|nested] 25+ messages in thread
* [tip: sched/urgent] sched/eevdf: Fix vruntime adjustment on reweight
2023-11-07 9:05 ` [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight Abel Wu
2023-11-07 9:52 ` Peter Zijlstra
@ 2023-11-14 21:57 ` tip-bot2 for Abel Wu
2023-11-15 15:36 ` [PATCH 1/4] " Yiwei Lin
2024-02-29 9:24 ` Tianchen Ding
3 siblings, 0 replies; 25+ messages in thread
From: tip-bot2 for Abel Wu @ 2023-11-14 21:57 UTC (permalink / raw)
To: linux-tip-commits; +Cc: Abel Wu, Peter Zijlstra (Intel), x86, linux-kernel
The following commit has been merged into the sched/urgent branch of tip:
Commit-ID: eab03c23c2a162085b13200d7942fc5a00b5ccc8
Gitweb: https://git.kernel.org/tip/eab03c23c2a162085b13200d7942fc5a00b5ccc8
Author: Abel Wu <wuyun.abel@bytedance.com>
AuthorDate: Tue, 07 Nov 2023 17:05:07 +08:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 14 Nov 2023 22:27:00 +01:00
sched/eevdf: Fix vruntime adjustment on reweight
vruntime of the (on_rq && !0-lag) entity needs to be adjusted when
it gets re-weighted, and the calculations can be simplified based
on the fact that re-weight won't change the w-average of all the
entities. Please check the proofs in comments.
But adjusting vruntime can also cause position change in RB-tree
hence require re-queue to fix up which might be costly. This might
be avoided by deferring adjustment to the time the entity actually
leaves tree (dequeue/pick), but that will negatively affect task
selection and probably not good enough either.
Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231107090510.71322-2-wuyun.abel@bytedance.com
---
kernel/sched/fair.c | 151 ++++++++++++++++++++++++++++++++++++-------
1 file changed, 128 insertions(+), 23 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2048138..025d909 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3666,41 +3666,140 @@ static inline void
dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
#endif
+static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
+ unsigned long weight)
+{
+ unsigned long old_weight = se->load.weight;
+ u64 avruntime = avg_vruntime(cfs_rq);
+ s64 vlag, vslice;
+
+ /*
+ * VRUNTIME
+ * ========
+ *
+ * COROLLARY #1: The virtual runtime of the entity needs to be
+ * adjusted if re-weight at !0-lag point.
+ *
+ * Proof: For contradiction assume this is not true, so we can
+ * re-weight without changing vruntime at !0-lag point.
+ *
+ * Weight VRuntime Avg-VRuntime
+ * before w v V
+ * after w' v' V'
+ *
+ * Since lag needs to be preserved through re-weight:
+ *
+ * lag = (V - v)*w = (V'- v')*w', where v = v'
+ * ==> V' = (V - v)*w/w' + v (1)
+ *
+ * Let W be the total weight of the entities before reweight,
+ * since V' is the new weighted average of entities:
+ *
+ * V' = (WV + w'v - wv) / (W + w' - w) (2)
+ *
+ * by using (1) & (2) we obtain:
+ *
+ * (WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v
+ * ==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v
+ * ==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v
+ * ==> (V - v)*W/(W + w' - w) = (V - v)*w/w' (3)
+ *
+ * Since we are doing at !0-lag point which means V != v, we
+ * can simplify (3):
+ *
+ * ==> W / (W + w' - w) = w / w'
+ * ==> Ww' = Ww + ww' - ww
+ * ==> W * (w' - w) = w * (w' - w)
+ * ==> W = w (re-weight indicates w' != w)
+ *
+ * So the cfs_rq contains only one entity, hence vruntime of
+ * the entity @v should always equal to the cfs_rq's weighted
+ * average vruntime @V, which means we will always re-weight
+ * at 0-lag point, thus breach assumption. Proof completed.
+ *
+ *
+ * COROLLARY #2: Re-weight does NOT affect weighted average
+ * vruntime of all the entities.
+ *
+ * Proof: According to corollary #1, Eq. (1) should be:
+ *
+ * (V - v)*w = (V' - v')*w'
+ * ==> v' = V' - (V - v)*w/w' (4)
+ *
+ * According to the weighted average formula, we have:
+ *
+ * V' = (WV - wv + w'v') / (W - w + w')
+ * = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w')
+ * = (WV - wv + w'V' - Vw + wv) / (W - w + w')
+ * = (WV + w'V' - Vw) / (W - w + w')
+ *
+ * ==> V'*(W - w + w') = WV + w'V' - Vw
+ * ==> V' * (W - w) = (W - w) * V (5)
+ *
+ * If the entity is the only one in the cfs_rq, then reweight
+ * always occurs at 0-lag point, so V won't change. Or else
+ * there are other entities, hence W != w, then Eq. (5) turns
+ * into V' = V. So V won't change in either case, proof done.
+ *
+ *
+ * So according to corollary #1 & #2, the effect of re-weight
+ * on vruntime should be:
+ *
+ * v' = V' - (V - v) * w / w' (4)
+ * = V - (V - v) * w / w'
+ * = V - vl * w / w'
+ * = V - vl'
+ */
+ if (avruntime != se->vruntime) {
+ vlag = (s64)(avruntime - se->vruntime);
+ vlag = div_s64(vlag * old_weight, weight);
+ se->vruntime = avruntime - vlag;
+ }
+
+ /*
+ * DEADLINE
+ * ========
+ *
+ * When the weight changes, the virtual time slope changes and
+ * we should adjust the relative virtual deadline accordingly.
+ *
+ * d' = v' + (d - v)*w/w'
+ * = V' - (V - v)*w/w' + (d - v)*w/w'
+ * = V - (V - v)*w/w' + (d - v)*w/w'
+ * = V + (d - V)*w/w'
+ */
+ vslice = (s64)(se->deadline - avruntime);
+ vslice = div_s64(vslice * old_weight, weight);
+ se->deadline = avruntime + vslice;
+}
+
static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
unsigned long weight)
{
- unsigned long old_weight = se->load.weight;
+ bool curr = cfs_rq->curr == se;
if (se->on_rq) {
/* commit outstanding execution time */
- if (cfs_rq->curr == se)
+ if (curr)
update_curr(cfs_rq);
else
- avg_vruntime_sub(cfs_rq, se);
+ __dequeue_entity(cfs_rq, se);
update_load_sub(&cfs_rq->load, se->load.weight);
}
dequeue_load_avg(cfs_rq, se);
- update_load_set(&se->load, weight);
-
if (!se->on_rq) {
/*
* Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
* we need to scale se->vlag when w_i changes.
*/
- se->vlag = div_s64(se->vlag * old_weight, weight);
+ se->vlag = div_s64(se->vlag * se->load.weight, weight);
} else {
- s64 deadline = se->deadline - se->vruntime;
- /*
- * When the weight changes, the virtual time slope changes and
- * we should adjust the relative virtual deadline accordingly.
- */
- deadline = div_s64(deadline * old_weight, weight);
- se->deadline = se->vruntime + deadline;
- if (se != cfs_rq->curr)
- min_deadline_cb_propagate(&se->run_node, NULL);
+ reweight_eevdf(cfs_rq, se, weight);
}
+ update_load_set(&se->load, weight);
+
#ifdef CONFIG_SMP
do {
u32 divider = get_pelt_divider(&se->avg);
@@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
enqueue_load_avg(cfs_rq, se);
if (se->on_rq) {
update_load_add(&cfs_rq->load, se->load.weight);
- if (cfs_rq->curr != se)
- avg_vruntime_add(cfs_rq, se);
+ if (!curr) {
+ /*
+ * The entity's vruntime has been adjusted, so let's check
+ * whether the rq-wide min_vruntime needs updated too. Since
+ * the calculations above require stable min_vruntime rather
+ * than up-to-date one, we do the update at the end of the
+ * reweight process.
+ */
+ __enqueue_entity(cfs_rq, se);
+ update_min_vruntime(cfs_rq);
+ }
}
}
@@ -3857,14 +3965,11 @@ static void update_cfs_group(struct sched_entity *se)
#ifndef CONFIG_SMP
shares = READ_ONCE(gcfs_rq->tg->shares);
-
- if (likely(se->load.weight == shares))
- return;
#else
- shares = calc_group_shares(gcfs_rq);
+ shares = calc_group_shares(gcfs_rq);
#endif
-
- reweight_entity(cfs_rq_of(se), se, shares);
+ if (unlikely(se->load.weight != shares))
+ reweight_entity(cfs_rq_of(se), se, shares);
}
#else /* CONFIG_FAIR_GROUP_SCHED */
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
2023-11-07 9:05 ` [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight Abel Wu
2023-11-07 9:52 ` Peter Zijlstra
2023-11-14 21:57 ` [tip: sched/urgent] " tip-bot2 for Abel Wu
@ 2023-11-15 15:36 ` Yiwei Lin
2023-11-16 4:48 ` Abel Wu
2024-02-29 9:24 ` Tianchen Ding
3 siblings, 1 reply; 25+ messages in thread
From: Yiwei Lin @ 2023-11-15 15:36 UTC (permalink / raw)
To: Abel Wu
Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
Dietmar Eggemann, Valentin Schneider
On 11/7/23 17:05, Abel Wu wrote:
> vruntime of the (on_rq && !0-lag) entity needs to be adjusted when
> it gets re-weighted, and the calculations can be simplified based
> on the fact that re-weight won't change the w-average of all the
> entities. Please check the proofs in comments.
>
> But adjusting vruntime can also cause position change in RB-tree
> hence require re-queue to fix up which might be costly. This might
> be avoided by deferring adjustment to the time the entity actually
> leaves tree (dequeue/pick), but that will negatively affect task
> selection and probably not good enough either.
>
> Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
> kernel/sched/fair.c | 151 +++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 128 insertions(+), 23 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8767988242ee..b00d09a9b601 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3666,41 +3666,140 @@ static inline void
> dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
> #endif
>
> +static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
> + unsigned long weight)
> +{
> + unsigned long old_weight = se->load.weight;
> + u64 avruntime = avg_vruntime(cfs_rq);
> + s64 vlag, vslice;
> +
> + /*
> + * VRUNTIME
> + * ========
> + *
> + * COROLLARY #1: The virtual runtime of the entity needs to be
> + * adjusted if re-weight at !0-lag point.
> + *
> + * Proof: For contradiction assume this is not true, so we can
> + * re-weight without changing vruntime at !0-lag point.
> + *
> + * Weight VRuntime Avg-VRuntime
> + * before w v V
> + * after w' v' V'
> + *
> + * Since lag needs to be preserved through re-weight:
> + *
> + * lag = (V - v)*w = (V'- v')*w', where v = v'
> + * ==> V' = (V - v)*w/w' + v (1)
> + *
> + * Let W be the total weight of the entities before reweight,
> + * since V' is the new weighted average of entities:
> + *
> + * V' = (WV + w'v - wv) / (W + w' - w) (2)
> + *
> + * by using (1) & (2) we obtain:
> + *
> + * (WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v
> + * ==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v
> + * ==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v
> + * ==> (V - v)*W/(W + w' - w) = (V - v)*w/w' (3)
> + *
> + * Since we are doing at !0-lag point which means V != v, we
> + * can simplify (3):
> + *
> + * ==> W / (W + w' - w) = w / w'
> + * ==> Ww' = Ww + ww' - ww
> + * ==> W * (w' - w) = w * (w' - w)
> + * ==> W = w (re-weight indicates w' != w)
> + *
> + * So the cfs_rq contains only one entity, hence vruntime of
> + * the entity @v should always equal to the cfs_rq's weighted
> + * average vruntime @V, which means we will always re-weight
> + * at 0-lag point, thus breach assumption. Proof completed.
> + *
> + *
> + * COROLLARY #2: Re-weight does NOT affect weighted average
> + * vruntime of all the entities.
> + *
> + * Proof: According to corollary #1, Eq. (1) should be:
> + *
> + * (V - v)*w = (V' - v')*w'
> + * ==> v' = V' - (V - v)*w/w' (4)
> + *
> + * According to the weighted average formula, we have:
> + *
> + * V' = (WV - wv + w'v') / (W - w + w')
> + * = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w')
> + * = (WV - wv + w'V' - Vw + wv) / (W - w + w')
> + * = (WV + w'V' - Vw) / (W - w + w')
> + *
> + * ==> V'*(W - w + w') = WV + w'V' - Vw
> + * ==> V' * (W - w) = (W - w) * V (5)
> + *
> + * If the entity is the only one in the cfs_rq, then reweight
> + * always occurs at 0-lag point, so V won't change. Or else
> + * there are other entities, hence W != w, then Eq. (5) turns
> + * into V' = V. So V won't change in either case, proof done.
> + *
> + *
> + * So according to corollary #1 & #2, the effect of re-weight
> + * on vruntime should be:
> + *
> + * v' = V' - (V - v) * w / w' (4)
> + * = V - (V - v) * w / w'
> + * = V - vl * w / w'
> + * = V - vl'
> + */
> + if (avruntime != se->vruntime) {
> + vlag = (s64)(avruntime - se->vruntime);
> + vlag = div_s64(vlag * old_weight, weight);
> + se->vruntime = avruntime - vlag;
> + }
> +
> + /*
> + * DEADLINE
> + * ========
> + *
> + * When the weight changes, the virtual time slope changes and
> + * we should adjust the relative virtual deadline accordingly.
> + *
> + * d' = v' + (d - v)*w/w'
> + * = V' - (V - v)*w/w' + (d - v)*w/w'
> + * = V - (V - v)*w/w' + (d - v)*w/w'
> + * = V + (d - V)*w/w'
> + */
> + vslice = (s64)(se->deadline - avruntime);
> + vslice = div_s64(vslice * old_weight, weight);
> + se->deadline = avruntime + vslice;
> +}
> +
> static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> unsigned long weight)
> {
> - unsigned long old_weight = se->load.weight;
> + bool curr = cfs_rq->curr == se;
>
> if (se->on_rq) {
> /* commit outstanding execution time */
> - if (cfs_rq->curr == se)
> + if (curr)
> update_curr(cfs_rq);
> else
> - avg_vruntime_sub(cfs_rq, se);
> + __dequeue_entity(cfs_rq, se);
> update_load_sub(&cfs_rq->load, se->load.weight);
> }
> dequeue_load_avg(cfs_rq, se);
>
> - update_load_set(&se->load, weight);
> -
> if (!se->on_rq) {
> /*
> * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
> * we need to scale se->vlag when w_i changes.
> */
> - se->vlag = div_s64(se->vlag * old_weight, weight);
> + se->vlag = div_s64(se->vlag * se->load.weight, weight);
> } else {
> - s64 deadline = se->deadline - se->vruntime;
> - /*
> - * When the weight changes, the virtual time slope changes and
> - * we should adjust the relative virtual deadline accordingly.
> - */
> - deadline = div_s64(deadline * old_weight, weight);
> - se->deadline = se->vruntime + deadline;
> - if (se != cfs_rq->curr)
> - min_deadline_cb_propagate(&se->run_node, NULL);
> + reweight_eevdf(cfs_rq, se, weight);
> }
>
> + update_load_set(&se->load, weight);
> +
> #ifdef CONFIG_SMP
> do {
> u32 divider = get_pelt_divider(&se->avg);
> @@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> enqueue_load_avg(cfs_rq, se);
> if (se->on_rq) {
> update_load_add(&cfs_rq->load, se->load.weight);
> - if (cfs_rq->curr != se)
> - avg_vruntime_add(cfs_rq, se);
> + if (!curr) {
> + /*
> + * The entity's vruntime has been adjusted, so let's check
> + * whether the rq-wide min_vruntime needs updated too. Since
> + * the calculations above require stable min_vruntime rather
> + * than up-to-date one, we do the update at the end of the
> + * reweight process.
> + */
> + __enqueue_entity(cfs_rq, se);
> + update_min_vruntime(cfs_rq);
> + }
> }
> }
Sorry if I am asking stupid question...... It looks like
reweight_entity() may have chance to change the weight of cfs_rq->curr
entity, but we'll never update_min_vruntime() when reweighting it. Is
there any reason that we can skip the update_min_vruntime() for this case?
>
> @@ -3857,14 +3965,11 @@ static void update_cfs_group(struct sched_entity *se)
>
> #ifndef CONFIG_SMP
> shares = READ_ONCE(gcfs_rq->tg->shares);
> -
> - if (likely(se->load.weight == shares))
> - return;
> #else
> - shares = calc_group_shares(gcfs_rq);
> + shares = calc_group_shares(gcfs_rq);
> #endif
> -
> - reweight_entity(cfs_rq_of(se), se, shares);
> + if (unlikely(se->load.weight != shares))
> + reweight_entity(cfs_rq_of(se), se, shares);
> }
>
> #else /* CONFIG_FAIR_GROUP_SCHED */
Thanks,
Yiwei Lin
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
2023-11-15 15:36 ` [PATCH 1/4] " Yiwei Lin
@ 2023-11-16 4:48 ` Abel Wu
2023-11-16 4:59 ` Yiwei Lin
2023-11-16 5:07 ` Abel Wu
0 siblings, 2 replies; 25+ messages in thread
From: Abel Wu @ 2023-11-16 4:48 UTC (permalink / raw)
To: Yiwei Lin
Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
Dietmar Eggemann, Valentin Schneider
On 11/15/23 11:36 PM, Yiwei Lin Wrote:
>
>> @@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>> enqueue_load_avg(cfs_rq, se);
>> if (se->on_rq) {
>> update_load_add(&cfs_rq->load, se->load.weight);
>> - if (cfs_rq->curr != se)
>> - avg_vruntime_add(cfs_rq, se);
>> + if (!curr) {
>> + /*
>> + * The entity's vruntime has been adjusted, so let's check
>> + * whether the rq-wide min_vruntime needs updated too. Since
>> + * the calculations above require stable min_vruntime rather
>> + * than up-to-date one, we do the update at the end of the
>> + * reweight process.
>> + */
>> + __enqueue_entity(cfs_rq, se);
>> + update_min_vruntime(cfs_rq);
>> + }
>> }
>> }
> Sorry if I am asking stupid question...... It looks like reweight_entity() may have chance to change the weight of cfs_rq->curr entity, but we'll never update_min_vruntime() when reweighting it. Is there any reason that we can skip the update_min_vruntime() for this case?
No, you are right!
Thanks!
Abel
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
2023-11-16 4:48 ` Abel Wu
@ 2023-11-16 4:59 ` Yiwei Lin
2023-11-16 9:01 ` Abel Wu
2023-11-16 5:07 ` Abel Wu
1 sibling, 1 reply; 25+ messages in thread
From: Yiwei Lin @ 2023-11-16 4:59 UTC (permalink / raw)
To: Abel Wu
Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
Dietmar Eggemann, Valentin Schneider
On 11/16/23 12:48, Abel Wu wrote:
> On 11/15/23 11:36 PM, Yiwei Lin Wrote:
>>
>>> @@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq
>>> *cfs_rq, struct sched_entity *se,
>>> enqueue_load_avg(cfs_rq, se);
>>> if (se->on_rq) {
>>> update_load_add(&cfs_rq->load, se->load.weight);
>>> - if (cfs_rq->curr != se)
>>> - avg_vruntime_add(cfs_rq, se);
>>> + if (!curr) {
>>> + /*
>>> + * The entity's vruntime has been adjusted, so let's check
>>> + * whether the rq-wide min_vruntime needs updated too.
>>> Since
>>> + * the calculations above require stable min_vruntime
>>> rather
>>> + * than up-to-date one, we do the update at the end of the
>>> + * reweight process.
>>> + */
>>> + __enqueue_entity(cfs_rq, se);
>>> + update_min_vruntime(cfs_rq);
>>> + }
>>> }
>>> }
>> Sorry if I am asking stupid question...... It looks like
>> reweight_entity() may have chance to change the weight of
>> cfs_rq->curr entity, but we'll never update_min_vruntime() when
>> reweighting it. Is there any reason that we can skip the
>> update_min_vruntime() for this case?
>
> No, you are right!
>
> Thanks!
> Abel
Thank you! I'll take the responsibility to fix this.
Yiwei Lin
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
2023-11-16 4:48 ` Abel Wu
2023-11-16 4:59 ` Yiwei Lin
@ 2023-11-16 5:07 ` Abel Wu
2023-11-16 6:51 ` Yiwei Lin
1 sibling, 1 reply; 25+ messages in thread
From: Abel Wu @ 2023-11-16 5:07 UTC (permalink / raw)
To: Yiwei Lin
Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
Dietmar Eggemann, Valentin Schneider
On 11/16/23 12:48 PM, Abel Wu Wrote:
> On 11/15/23 11:36 PM, Yiwei Lin Wrote:
>>
>>> @@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>>> enqueue_load_avg(cfs_rq, se);
>>> if (se->on_rq) {
>>> update_load_add(&cfs_rq->load, se->load.weight);
>>> - if (cfs_rq->curr != se)
>>> - avg_vruntime_add(cfs_rq, se);
>>> + if (!curr) {
>>> + /*
>>> + * The entity's vruntime has been adjusted, so let's check
>>> + * whether the rq-wide min_vruntime needs updated too. Since
>>> + * the calculations above require stable min_vruntime rather
>>> + * than up-to-date one, we do the update at the end of the
>>> + * reweight process.
>>> + */
>>> + __enqueue_entity(cfs_rq, se);
>>> + update_min_vruntime(cfs_rq);
>>> + }
>>> }
>>> }
>> Sorry if I am asking stupid question...... It looks like reweight_entity() may have chance to change the weight of cfs_rq->curr entity, but we'll never update_min_vruntime() when reweighting it. Is there any reason that we can skip the update_min_vruntime() for this case?
>
> No, you are right!
I was intended to update_min_vruntime() if se->on_rq and no matter
it is curr or not, just as you suggested. But after a second thought
I wonder if it is necessary to update *NOW*, since we will always
update_curr() before making any change to cfs_rq. Thoughts?
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
2023-11-16 5:07 ` Abel Wu
@ 2023-11-16 6:51 ` Yiwei Lin
2023-11-16 7:11 ` Abel Wu
0 siblings, 1 reply; 25+ messages in thread
From: Yiwei Lin @ 2023-11-16 6:51 UTC (permalink / raw)
To: Abel Wu
Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
Dietmar Eggemann, Valentin Schneider
On 11/16/23 13:07, Abel Wu wrote:
> On 11/16/23 12:48 PM, Abel Wu Wrote:
>> On 11/15/23 11:36 PM, Yiwei Lin Wrote:
>>>
>>>> @@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq
>>>> *cfs_rq, struct sched_entity *se,
>>>> enqueue_load_avg(cfs_rq, se);
>>>> if (se->on_rq) {
>>>> update_load_add(&cfs_rq->load, se->load.weight);
>>>> - if (cfs_rq->curr != se)
>>>> - avg_vruntime_add(cfs_rq, se);
>>>> + if (!curr) {
>>>> + /*
>>>> + * The entity's vruntime has been adjusted, so let's
>>>> check
>>>> + * whether the rq-wide min_vruntime needs updated too.
>>>> Since
>>>> + * the calculations above require stable min_vruntime
>>>> rather
>>>> + * than up-to-date one, we do the update at the end of
>>>> the
>>>> + * reweight process.
>>>> + */
>>>> + __enqueue_entity(cfs_rq, se);
>>>> + update_min_vruntime(cfs_rq);
>>>> + }
>>>> }
>>>> }
>>> Sorry if I am asking stupid question...... It looks like
>>> reweight_entity() may have chance to change the weight of
>>> cfs_rq->curr entity, but we'll never update_min_vruntime() when
>>> reweighting it. Is there any reason that we can skip the
>>> update_min_vruntime() for this case?
>>
>> No, you are right!
>
> I was intended to update_min_vruntime() if se->on_rq and no matter
> it is curr or not, just as you suggested. But after a second thought
> I wonder if it is necessary to update *NOW*, since we will always
> update_curr() before making any change to cfs_rq. Thoughts?
I lost the fact that we'll update_min_vruntime() every time we
update_curr(). Because of this fact, we can indeed wait until we need
the correct min_vruntime and update_min_vruntime() then. The only
consideration that I came up with is that the sched_debug may not be
able to reflect the accurate min_vruntime in time. But this may not be a
big problem.
Further, I have another advanced thought we can remove the
update_min_vruntime() here in the reweight_entity() directly to save
more time. The reason that I think this is because min_vruntime is not
for normalization of vruntime as before which is required on CFS, so we
will always update_curr() for the latest min_vruntime before using it.
Also, the update_min_vruntime() in dequeue_entity() may also be removed
as the reason, i.e. just do update_min_vruntime() in update_curr() to
simplify. What do you think?
Thanks,
Yiwei Lin
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
2023-11-16 6:51 ` Yiwei Lin
@ 2023-11-16 7:11 ` Abel Wu
0 siblings, 0 replies; 25+ messages in thread
From: Abel Wu @ 2023-11-16 7:11 UTC (permalink / raw)
To: Yiwei Lin
Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
Dietmar Eggemann, Valentin Schneider
On 11/16/23 2:51 PM, Yiwei Lin Wrote:
>
> On 11/16/23 13:07, Abel Wu wrote:
>> On 11/16/23 12:48 PM, Abel Wu Wrote:
>>> On 11/15/23 11:36 PM, Yiwei Lin Wrote:
>>>>
>>>>> @@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>>>>> enqueue_load_avg(cfs_rq, se);
>>>>> if (se->on_rq) {
>>>>> update_load_add(&cfs_rq->load, se->load.weight);
>>>>> - if (cfs_rq->curr != se)
>>>>> - avg_vruntime_add(cfs_rq, se);
>>>>> + if (!curr) {
>>>>> + /*
>>>>> + * The entity's vruntime has been adjusted, so let's check
>>>>> + * whether the rq-wide min_vruntime needs updated too. Since
>>>>> + * the calculations above require stable min_vruntime rather
>>>>> + * than up-to-date one, we do the update at the end of the
>>>>> + * reweight process.
>>>>> + */
>>>>> + __enqueue_entity(cfs_rq, se);
>>>>> + update_min_vruntime(cfs_rq);
>>>>> + }
>>>>> }
>>>>> }
>>>> Sorry if I am asking stupid question...... It looks like reweight_entity() may have chance to change the weight of cfs_rq->curr entity, but we'll never update_min_vruntime() when reweighting it. Is there any reason that we can skip the update_min_vruntime() for this case?
>>>
>>> No, you are right!
>>
>> I was intended to update_min_vruntime() if se->on_rq and no matter
>> it is curr or not, just as you suggested. But after a second thought
>> I wonder if it is necessary to update *NOW*, since we will always
>> update_curr() before making any change to cfs_rq. Thoughts?
> I lost the fact that we'll update_min_vruntime() every time we update_curr(). Because of this fact, we can indeed wait until we need the correct min_vruntime and update_min_vruntime() then. The only consideration that I came up with is that the sched_debug may not be able to reflect the accurate min_vruntime in time. But this may not be a big problem.
>
> Further, I have another advanced thought we can remove the update_min_vruntime() here in the reweight_entity() directly to save more time. The reason that I think this is because min_vruntime is not for normalization of vruntime as before which is required on CFS, so we will always update_curr() for the latest min_vruntime before using it. Also, the update_min_vruntime() in dequeue_entity() may also be removed as the reason, i.e. just do update_min_vruntime() in update_curr() to simplify. What do you think?
Yes, this is also exactly what I am thinking about. As task placement
now adopts lag-based solution which is irrespective of min_vruntime,
and also based on the fact that it is only used as a base offset for
calculating avg_vruntime (in order to avoid overflow), we probably
can update it in a more relaxed way e.g. in ticks. If relaxed update
works, there seems still work to be done first:
1) the priority of core pick when core scheduling needs to change
to deadline-based solution;
2) need to make sure not overflow in NOHZ_FULL mode
Just some first thoughts come into my mind :)
Thanks,
Abel
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
2023-11-16 4:59 ` Yiwei Lin
@ 2023-11-16 9:01 ` Abel Wu
0 siblings, 0 replies; 25+ messages in thread
From: Abel Wu @ 2023-11-16 9:01 UTC (permalink / raw)
To: Yiwei Lin
Cc: Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
Dietmar Eggemann, Valentin Schneider
On 11/16/23 12:59 PM, Yiwei Lin Wrote:
>
> On 11/16/23 12:48, Abel Wu wrote:
>> On 11/15/23 11:36 PM, Yiwei Lin Wrote:
>>>
>>>> @@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>>>> enqueue_load_avg(cfs_rq, se);
>>>> if (se->on_rq) {
>>>> update_load_add(&cfs_rq->load, se->load.weight);
>>>> - if (cfs_rq->curr != se)
>>>> - avg_vruntime_add(cfs_rq, se);
>>>> + if (!curr) {
>>>> + /*
>>>> + * The entity's vruntime has been adjusted, so let's check
>>>> + * whether the rq-wide min_vruntime needs updated too. Since
>>>> + * the calculations above require stable min_vruntime rather
>>>> + * than up-to-date one, we do the update at the end of the
>>>> + * reweight process.
>>>> + */
>>>> + __enqueue_entity(cfs_rq, se);
>>>> + update_min_vruntime(cfs_rq);
>>>> + }
>>>> }
>>>> }
>>> Sorry if I am asking stupid question...... It looks like reweight_entity() may have chance to change the weight of cfs_rq->curr entity, but we'll never update_min_vruntime() when reweighting it. Is there any reason that we can skip the update_min_vruntime() for this case?
>>
>> No, you are right!
>>
>> Thanks!
>> Abel
> Thank you! I'll take the responsibility to fix this.
That would be appreciated. I suggest fixing this before we
do further optimizations.
Thanks,
Abel
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
2023-11-07 9:05 ` [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight Abel Wu
` (2 preceding siblings ...)
2023-11-15 15:36 ` [PATCH 1/4] " Yiwei Lin
@ 2024-02-29 9:24 ` Tianchen Ding
2024-02-29 14:25 ` Abel Wu
3 siblings, 1 reply; 25+ messages in thread
From: Tianchen Ding @ 2024-02-29 9:24 UTC (permalink / raw)
To: Abel Wu
Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Valentin Schneider,
Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel
Hi Abel:
I'm not so familiar with eevdf and still learning. Here I've some questions
about this patch.
1. You did proof that V will not change during reweight (COROLLARY #2). However,
according to the original paper, the new V will be:
V' = V + lag(j)/(W - w_j) - lag(j)/(W - w_j + w'_j)
So the V' in paper will change (when lag is not zero).
Is the V in Linux code slightly different from the paper?
2. I print some log around reweight_entity(), mainly want to print V by calling
avg_vruntime(cfs_rq). I found your algorithm only keeps the V unchanged during
reweight_eevdf(), but not reweight_entity().
In detail:
If curr is true (i.e., cfs_rq->curr == se), we will directly run
reweight_eevdf(), and the V is not changed.
If curr is false, we will have __dequeue_entity() -> reweight_eevdf() ->
__enqueue_entity(). The V is finally changed due to dequeue and enqueue. So the
result of reweight_entity() will be impacted by "cfs_rq->curr == se", is this
expected?
Thanks!
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
2024-02-29 9:24 ` Tianchen Ding
@ 2024-02-29 14:25 ` Abel Wu
2024-03-01 6:41 ` Tianchen Ding
0 siblings, 1 reply; 25+ messages in thread
From: Abel Wu @ 2024-02-29 14:25 UTC (permalink / raw)
To: Tianchen Ding
Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Valentin Schneider,
Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel
Hi Tianchen,
On 2/29/24 5:24 PM, Tianchen Ding Wrote:
> Hi Abel:
>
> I'm not so familiar with eevdf and still learning. Here I've some questions about this patch.
>
> 1. You did proof that V will not change during reweight (COROLLARY #2). However, according to the original paper, the new V will be:
> V' = V + lag(j)/(W - w_j) - lag(j)/(W - w_j + w'_j)
> So the V' in paper will change (when lag is not zero).
> Is the V in Linux code slightly different from the paper?
Good catch. And to the best of my knowledge, the answer is YES. The
above Equation in the paper, which is Eq. (20), is based on the
assumption that:
"once client 3 leaves, the remaining two clients will
proportionally support the eventual loss or gain in the
service time" -- Page 10
"by updating the virtual time according to Eq. (18,19) we
ensure that the sum over the lags of all active clients
is always zero" -- Page 11
But in Peter's implementation, it is the competitors in the new group
that client 3 later joins in who actually support the effect. So when
client 3 leaves competition with !0-lag in Linux, the rq's sum(lag_i)
is no longer zero.
>
> 2. I print some log around reweight_entity(), mainly want to print V by calling avg_vruntime(cfs_rq). I found your algorithm only keeps the V unchanged during reweight_eevdf(), but not reweight_entity().
>
> In detail:
> If curr is true (i.e., cfs_rq->curr == se), we will directly run reweight_eevdf(), and the V is not changed.
> If curr is false, we will have __dequeue_entity() -> reweight_eevdf() -> __enqueue_entity(). The V is finally changed due to dequeue and enqueue. So the result of reweight_entity() will be impacted by "cfs_rq->curr == se", is this expected?
Good catch again! It smells like a bug. Since this @se is still on_rq,
it should be taken into consideration when calculating avg_runtime(),
but in fact it isn't because __dequeue_entity() will remove its share.
And I seem to spot another bug, although not relate to this problem,
that we actually need to call update_curr() unconditionally if curr is
available, because we need to commit curr's outstanding runtime to
ensure the result of avg_runtime() is up to date.
Thanks & BR,
Abel
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
2024-02-29 14:25 ` Abel Wu
@ 2024-03-01 6:41 ` Tianchen Ding
2024-03-01 8:30 ` Abel Wu
0 siblings, 1 reply; 25+ messages in thread
From: Tianchen Ding @ 2024-03-01 6:41 UTC (permalink / raw)
To: Abel Wu
Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Valentin Schneider,
Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel
On 2024/2/29 22:25, Abel Wu wrote:
> Good catch. And to the best of my knowledge, the answer is YES. The
> above Equation in the paper, which is Eq. (20), is based on the
> assumption that:
>
> "once client 3 leaves, the remaining two clients will
> proportionally support the eventual loss or gain in the
> service time" -- Page 10
>
> "by updating the virtual time according to Eq. (18,19) we
> ensure that the sum over the lags of all active clients
> is always zero" -- Page 11
>
> But in Peter's implementation, it is the competitors in the new group
> that client 3 later joins in who actually support the effect. So when
> client 3 leaves competition with !0-lag in Linux, the rq's sum(lag_i)
> is no longer zero.
>
I've different opinions. According to the comments above avg_vruntime_add(), V
is calculated exactly to satisfy sum(lag_i)=0. This is guaranteed by math.
Actually I print some logs in enqueue_entity() and dequeue_entity() to verify this:
[ 293.261236] before dequeue: V=2525278131 W=3072 v=2526243139 w=1024 lag_sum=0
[ 293.261237] after dequeue: V=2524795627 W=2048 v=2526243139 w=1024 lag_sum=0
[ 293.262286] before enqueue: V=2525319064 W=2048 v=2526766576 w=1024 lag_sum=0
[ 293.262287] after enqueue: V=2525801568 W=3072 v=2526766576 w=1024 lag_sum=0
For the first 2 lines, we have 2524795627 = 2525278131 + (2525278131 - 2526243139) * 1024 / 2048.
Which is Eq. (18)
For the last 2 lines, we have 2525801568 = 2525319064 - (2525319064 - 2526766576) * 1024 / 3072.
Which is Eq. (19)
So whatever client 3 leave or join competition with !0-lag in Linux, V is handled properly.
> Good catch again! It smells like a bug. Since this @se is still on_rq,
> it should be taken into consideration when calculating avg_runtime(),
> but in fact it isn't because __dequeue_entity() will remove its share.
>
> And I seem to spot another bug, although not relate to this problem,
> that we actually need to call update_curr() unconditionally if curr is
> available, because we need to commit curr's outstanding runtime to
> ensure the result of avg_runtime() is up to date.
>
I've tried to record avg_vruntime before __dequeue_entity() and pass it to
reweight_eevdf(). Then the issue is fixed. The V keeps the same during the whole
reweight_entity().
I could send these two bugfix patches (one for this bug and one you sugguested
about update_curr). But before doing so, I still want to dig out the answer of
my first question.
Hi Peter, would you please provide any information?
Thanks.
My rough logging code:
(Note: lag_sum may output a minus value, with its absolute value less than W.
This is ok because my lag_sum calculate is not so accurate due to the sign flips
in avg_vruntime())
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 533547e3c90a..9306c1bbd472 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5260,6 +5260,62 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
static inline bool cfs_bandwidth_used(void);
+static int rbtree_all(const void *key, const struct rb_node *node)
+{
+ return 0;
+}
+
+static s64 get_lag_sum(struct cfs_rq *cfs_rq)
+{
+ u64 avruntime = avg_vruntime(cfs_rq);
+ struct sched_entity *curr = cfs_rq->curr;
+ struct rb_node *node;
+ s64 lag_sum = 0;
+
+ rb_for_each(node, 0, &cfs_rq->tasks_timeline.rb_root, rbtree_all) {
+ struct sched_entity *se = __node_2_se(node);
+
+ if (se->on_rq)
+ lag_sum += (avruntime - se->vruntime) * scale_load_down(se->load.weight);
+ }
+
+ if (curr && curr->on_rq) {
+ lag_sum += (avruntime - curr->vruntime) * scale_load_down(curr->load.weight);
+ }
+
+ return lag_sum;
+}
+
+static void print_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se, bool before, bool enqueue)
+{
+ if (cpu_of(rq_of(cfs_rq)))
+ return; // avoid too many printings.
+
+ long load = cfs_rq->avg_load;
+ struct sched_entity *curr = cfs_rq->curr;
+ unsigned long weight = scale_load_down(se->load.weight);
+ s64 lag_sum = get_lag_sum(cfs_rq);
+ u64 avruntime = avg_vruntime(cfs_rq);
+
+ if (curr && curr->on_rq) {
+ unsigned long curr_weight = scale_load_down(curr->load.weight);
+ load += curr_weight;
+ }
+
+ if (before) {
+ if (enqueue)
+ printk("before enqueue: V=%llu W=%ld v=%llu w=%lu lag_sum=%lld\n", avruntime, load, se->vruntime, weight, lag_sum);
+ else
+ printk("before dequeue: V=%llu W=%ld v=%llu w=%lu lag_sum=%lld\n", avruntime, load, se->vruntime, weight, lag_sum);
+ }
+ else {
+ if (enqueue)
+ printk("after enqueue: V=%llu W=%ld v=%llu w=%lu lag_sum=%lld\n", avruntime, load, se->vruntime, weight, lag_sum);
+ else
+ printk("after dequeue: V=%llu W=%ld v=%llu w=%lu lag_sum=%lld\n", avruntime, load, se->vruntime, weight, lag_sum);
+ }
+}
+
static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
@@ -5307,9 +5363,11 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
check_schedstat_required();
update_stats_enqueue_fair(cfs_rq, se, flags);
+ print_eevdf(cfs_rq, se, true, true);
if (!curr)
__enqueue_entity(cfs_rq, se);
se->on_rq = 1;
+ print_eevdf(cfs_rq, se, false, true);
if (cfs_rq->nr_running == 1) {
check_enqueue_throttle(cfs_rq);
@@ -5347,6 +5405,7 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+
static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
@@ -5377,9 +5436,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
clear_buddies(cfs_rq, se);
update_entity_lag(cfs_rq, se);
+ print_eevdf(cfs_rq, se, true, false);
if (se != cfs_rq->curr)
__dequeue_entity(cfs_rq, se);
se->on_rq = 0;
+ print_eevdf(cfs_rq, se, false, false);
account_entity_dequeue(cfs_rq, se);
/* return excess runtime on last dequeue */
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
2024-03-01 6:41 ` Tianchen Ding
@ 2024-03-01 8:30 ` Abel Wu
2024-03-01 10:04 ` Tianchen Ding
0 siblings, 1 reply; 25+ messages in thread
From: Abel Wu @ 2024-03-01 8:30 UTC (permalink / raw)
To: Tianchen Ding
Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Valentin Schneider,
Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel
On 3/1/24 2:41 PM, Tianchen Ding Wrote:
> On 2024/2/29 22:25, Abel Wu wrote:
>> Good catch. And to the best of my knowledge, the answer is YES. The
>> above Equation in the paper, which is Eq. (20), is based on the
>> assumption that:
>>
>> "once client 3 leaves, the remaining two clients will
>> proportionally support the eventual loss or gain in the
>> service time" -- Page 10
>>
>> "by updating the virtual time according to Eq. (18,19) we
>> ensure that the sum over the lags of all active clients
>> is always zero" -- Page 11
>>
>> But in Peter's implementation, it is the competitors in the new group
>> that client 3 later joins in who actually support the effect. So when
>> client 3 leaves competition with !0-lag in Linux, the rq's sum(lag_i)
>> is no longer zero.
>>
>
> I've different opinions. According to the comments above avg_vruntime_add(), V
> is calculated exactly to satisfy sum(lag_i)=0. This is guaranteed by math.
Yes, you are right. I mixed another fairness issue with this. What I
was thinking is that considering multiple competition groups (e.g.
runqueues), the latency bound could be violated, that is someone could
starve a bit. Say one entity even with positive lag could become less
competitive if migrated to a higher competitive group.
Staring at Eq. (20) again, what if we do a fake reweight? I mean let
the client leave and rejoin at the same time without changing weight?
IMHO it should have no effects, but according to Eq. (20) the V will
change to:
V' = V + lag(j)/(W - w_j) - lag(j)/W != V
Have I missed anything?
>
> Actually I print some logs in enqueue_entity() and dequeue_entity() to verify this:
>
> [ 293.261236] before dequeue: V=2525278131 W=3072 v=2526243139 w=1024 lag_sum=0
> [ 293.261237] after dequeue: V=2524795627 W=2048 v=2526243139 w=1024 lag_sum=0
> [ 293.262286] before enqueue: V=2525319064 W=2048 v=2526766576 w=1024 lag_sum=0
> [ 293.262287] after enqueue: V=2525801568 W=3072 v=2526766576 w=1024 lag_sum=0
>
> For the first 2 lines, we have 2524795627 = 2525278131 + (2525278131 - 2526243139) * 1024 / 2048.
> Which is Eq. (18)
>
> For the last 2 lines, we have 2525801568 = 2525319064 - (2525319064 - 2526766576) * 1024 / 3072.
> Which is Eq. (19)
>
> So whatever client 3 leave or join competition with !0-lag in Linux, V is handled properly.
>
>> Good catch again! It smells like a bug. Since this @se is still on_rq,
>> it should be taken into consideration when calculating avg_runtime(),
>> but in fact it isn't because __dequeue_entity() will remove its share.
>>
>> And I seem to spot another bug, although not relate to this problem,
>> that we actually need to call update_curr() unconditionally if curr is
>> available, because we need to commit curr's outstanding runtime to
>> ensure the result of avg_runtime() is up to date.
>>
>
> I've tried to record avg_vruntime before __dequeue_entity() and pass it to
> reweight_eevdf(). Then the issue is fixed. The V keeps the same during the whole
> reweight_entity().
>
> I could send these two bugfix patches (one for this bug and one you sugguested
That would be appreciated!
Thanks,
Abel
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight
2024-03-01 8:30 ` Abel Wu
@ 2024-03-01 10:04 ` Tianchen Ding
0 siblings, 0 replies; 25+ messages in thread
From: Tianchen Ding @ 2024-03-01 10:04 UTC (permalink / raw)
To: Abel Wu
Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Valentin Schneider,
Barry Song, Benjamin Segall, Chen Yu, Daniel Jordan,
Gautham R . Shenoy, Joel Fernandes, K Prateek Nayak,
Mike Galbraith, Qais Yousef, Tim Chen, Yicong Yang, Youssef Esmat,
linux-kernel
On 2024/3/1 16:30, Abel Wu wrote:
> On 3/1/24 2:41 PM, Tianchen Ding Wrote:
>> On 2024/2/29 22:25, Abel Wu wrote:
>>> Good catch. And to the best of my knowledge, the answer is YES. The
>>> above Equation in the paper, which is Eq. (20), is based on the
>>> assumption that:
>>>
>>> "once client 3 leaves, the remaining two clients will
>>> proportionally support the eventual loss or gain in the
>>> service time" -- Page 10
>>>
>>> "by updating the virtual time according to Eq. (18,19) we
>>> ensure that the sum over the lags of all active clients
>>> is always zero" -- Page 11
>>>
>>> But in Peter's implementation, it is the competitors in the new group
>>> that client 3 later joins in who actually support the effect. So when
>>> client 3 leaves competition with !0-lag in Linux, the rq's sum(lag_i)
>>> is no longer zero.
>>>
>>
>> I've different opinions. According to the comments above avg_vruntime_add(), V
>> is calculated exactly to satisfy sum(lag_i)=0. This is guaranteed by math.
>
> Yes, you are right. I mixed another fairness issue with this. What I
> was thinking is that considering multiple competition groups (e.g.
> runqueues), the latency bound could be violated, that is someone could
> starve a bit. Say one entity even with positive lag could become less
> competitive if migrated to a higher competitive group.
>
> Staring at Eq. (20) again, what if we do a fake reweight? I mean let
> the client leave and rejoin at the same time without changing weight?
> IMHO it should have no effects, but according to Eq. (20) the V will
> change to:
>
> V' = V + lag(j)/(W - w_j) - lag(j)/W != V
>
> Have I missed anything?
>
Good point! I've not ever noticed this conflict.
I tried to modify reweight_entity() to run dequeue_entity() -> adjust se->vlag ->
enqueue_entity(). And I found V do not changed.
The difference is, when doing enqueue_entity(), Peter enlarges the lag in place_entity().
Because after enqueue, the lag will evaporate.
In order to keep the same lag after enqueue, during place_entity(),
the new lag(t) will be enlarged with (W+w_i)/W.
So the Eq. (20) should be:
V' = V + lag(j)/(W - w_j) - lag'(j)/(W - w_j + w'_j)
lag'(j) = lag(j) * (W - w_j + w'_j)/(W - w_j)
So we can get
V' = V + lag(j)/(W - w_j) - lag(j) * (W - w_j + w'_j)/(W - w_j)/(W - w_j + w'_j) = V
So COROLLARY #2 is correct.
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2024-03-01 10:04 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-07 9:05 [PATCH 0/4] sched/eevdf: Optimize reweight and pick Abel Wu
2023-11-07 9:05 ` [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight Abel Wu
2023-11-07 9:52 ` Peter Zijlstra
2023-11-14 21:57 ` [tip: sched/urgent] " tip-bot2 for Abel Wu
2023-11-15 15:36 ` [PATCH 1/4] " Yiwei Lin
2023-11-16 4:48 ` Abel Wu
2023-11-16 4:59 ` Yiwei Lin
2023-11-16 9:01 ` Abel Wu
2023-11-16 5:07 ` Abel Wu
2023-11-16 6:51 ` Yiwei Lin
2023-11-16 7:11 ` Abel Wu
2024-02-29 9:24 ` Tianchen Ding
2024-02-29 14:25 ` Abel Wu
2024-03-01 6:41 ` Tianchen Ding
2024-03-01 8:30 ` Abel Wu
2024-03-01 10:04 ` Tianchen Ding
2023-11-07 9:05 ` [PATCH 2/4] sched/eevdf: Sort the rbtree by virtual deadline Abel Wu
2023-11-07 11:03 ` Peter Zijlstra
2023-11-07 11:06 ` Abel Wu
2023-11-07 23:26 ` Benjamin Segall
2023-11-08 8:51 ` Abel Wu
2023-11-07 9:05 ` [PATCH 3/4] sched/eevdf: O(1) fastpath for task selection Abel Wu
2023-11-07 10:12 ` Abel Wu
2023-11-07 10:42 ` Peter Zijlstra
2023-11-07 9:05 ` [PATCH 4/4] sched/stats: branch statistics for pick_eevdf Abel Wu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox