* [RFC PATCH] sched: Simplify leaf_cfs_rq_list maintenance
@ 2021-02-25 16:27 Michal Koutný
2021-02-25 17:44 ` kernel test robot
2021-03-08 8:18 ` Dietmar Eggemann
0 siblings, 2 replies; 3+ messages in thread
From: Michal Koutný @ 2021-02-25 16:27 UTC (permalink / raw)
To: Vincent Guittot
Cc: Phil Auld, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Daniel Bristot de Oliveira, linux-kernel
leaf_cfs_rq_list should contain cfs_rqs that have runnable entities in
them. When we're operating under a throttled hierarchy we always update
the leaf_cfs_rq_list in order no to break list_add_leaf_cfs_rq invariant
of adding complete branches.
This caused troubles when an entity became runnable (enqueue_entity)
under a throttled hierarchy (see commit b34cb07dde7c ("sched/fair: Fix
enqueue_task_fair() warning some more")). While it proved well, this
new change ignores updating leaf_cfs_rq_list when we're operating under
the throttled hierarchy and defers the leaf_cfs_rq_list update to the
point when whole hiearchy is unthrottled (tg_unthrottle_up).
The code is now simpler and leaf_cfs_rq_list contains only cfs_rqs that
are truly runnable.
Why is this RFC?
- Primarily, I'm not sure I interpreted the purpose of leaf_cfs_rq_list
right. The removal of throttled cfs_rqs from it would exclude them
from __update_blocked_fair() calculation and I can't see past it now.
If it's wrong assumption, I'd like this to help clarify what the
proper definition of leaf_cfs_rq_list would be.
- Additionally, I didn't check thoroughly for corner cases when
se->on_rq => cfs_rq_of(se)->on_list wouldn't hold, so the patch
certainly isn't finished.
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
kernel/sched/fair.c | 41 ++++++-----------------------------------
kernel/sched/sched.h | 4 +---
2 files changed, 7 insertions(+), 38 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..634ba6637824 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4250,10 +4250,11 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
/*
* When bandwidth control is enabled, cfs might have been removed
- * because of a parent been throttled but cfs->nr_running > 1. Try to
- * add it unconditionnally.
+ * because of a parent been throttled. We'll add it later (with
+ * complete branch up to se->on_rq/cfs_eq->on_list) in
+ * tg_unthrottle_up() and unthrottle_cfs_rq().
*/
- if (cfs_rq->nr_running == 1 || cfs_bandwidth_used())
+ if (cfs_rq->nr_running == 1 && !throttled_hierarchy(cfs_rq))
list_add_leaf_cfs_rq(cfs_rq);
if (cfs_rq->nr_running == 1)
@@ -4859,6 +4860,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
for_each_sched_entity(se) {
if (se->on_rq)
break;
+ // XXX: se->on_rq implies cfs_rq_of(se)->on_list (unless throttled_hierarchy)
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
@@ -4896,17 +4898,6 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
add_nr_running(rq, task_delta);
unthrottle_throttle:
- /*
- * The cfs_rq_throttled() breaks in the above iteration can result in
- * incomplete leaf list maintenance, resulting in triggering the
- * assertion below.
- */
- for_each_sched_entity(se) {
- cfs_rq = cfs_rq_of(se);
-
- if (list_add_leaf_cfs_rq(cfs_rq))
- break;
- }
assert_list_leaf_cfs_rq(rq);
@@ -5518,6 +5509,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
for_each_sched_entity(se) {
if (se->on_rq)
break;
+ // XXX: se->on_rq implies cfs_rq_of(se)->on_list (unless throttled_hierarchy)
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, flags);
@@ -5544,13 +5536,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
goto enqueue_throttle;
-
- /*
- * One parent has been throttled and cfs_rq removed from the
- * list. Add it back to not break the leaf list.
- */
- if (throttled_hierarchy(cfs_rq))
- list_add_leaf_cfs_rq(cfs_rq);
}
/* At this point se is NULL and we are at root level*/
@@ -5574,20 +5559,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_overutilized_status(rq);
enqueue_throttle:
- if (cfs_bandwidth_used()) {
- /*
- * When bandwidth control is enabled; the cfs_rq_throttled()
- * breaks in the above iteration can result in incomplete
- * leaf list maintenance, resulting in triggering the assertion
- * below.
- */
- for_each_sched_entity(se) {
- cfs_rq = cfs_rq_of(se);
-
- if (list_add_leaf_cfs_rq(cfs_rq))
- break;
- }
- }
assert_list_leaf_cfs_rq(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bb09988451a0..f674d88920da 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -573,9 +573,7 @@ struct cfs_rq {
struct rq *rq; /* CPU runqueue to which this cfs_rq is attached */
/*
- * leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
- * a hierarchy). Non-leaf lrqs hold other higher schedulable entities
- * (like users, containers etc.)
+ * leaf cfs_rqs are those that hold runnable entities.
*
* leaf_cfs_rq_list ties together list of leaf cfs_rq's in a CPU.
* This list is used during load balance.
--
2.30.1
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [RFC PATCH] sched: Simplify leaf_cfs_rq_list maintenance
2021-02-25 16:27 [RFC PATCH] sched: Simplify leaf_cfs_rq_list maintenance Michal Koutný
@ 2021-02-25 17:44 ` kernel test robot
2021-03-08 8:18 ` Dietmar Eggemann
1 sibling, 0 replies; 3+ messages in thread
From: kernel test robot @ 2021-02-25 17:44 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 8905 bytes --]
Hi "Michal,
[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on tip/sched/core]
[also build test ERROR on v5.11 next-20210225]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Michal-Koutn/sched-Simplify-leaf_cfs_rq_list-maintenance/20210226-003049
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git c5e6fc08feb2b88dc5dac2f3c817e1c2a4cafda4
config: sparc64-randconfig-s032-20210225 (attached as .config)
compiler: sparc64-linux-gcc (GCC) 9.3.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# apt-get install sparse
# sparse version: v0.6.3-229-g60c1f270-dirty
# https://github.com/0day-ci/linux/commit/e566cc27ca5c69d5199dd3b9ad07dfc4bdc35eac
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Michal-Koutn/sched-Simplify-leaf_cfs_rq_list-maintenance/20210226-003049
git checkout e566cc27ca5c69d5199dd3b9ad07dfc4bdc35eac
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=sparc64
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
In file included from arch/sparc/include/asm/bug.h:6,
from include/linux/bug.h:5,
from include/linux/thread_info.h:12,
from arch/sparc/include/asm/current.h:15,
from include/linux/sched.h:12,
from kernel/sched/sched.h:5,
from kernel/sched/fair.c:23:
kernel/sched/fair.c: In function 'enqueue_entity':
>> kernel/sched/fair.c:4250:34: error: implicit declaration of function 'throttled_hierarchy' [-Werror=implicit-function-declaration]
4250 | if (cfs_rq->nr_running == 1 && !throttled_hierarchy(cfs_rq))
| ^~~~~~~~~~~~~~~~~~~
include/linux/compiler.h:58:52: note: in definition of macro '__trace_if_var'
58 | #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
| ^~~~
kernel/sched/fair.c:4250:2: note: in expansion of macro 'if'
4250 | if (cfs_rq->nr_running == 1 && !throttled_hierarchy(cfs_rq))
| ^~
kernel/sched/fair.c: At top level:
>> kernel/sched/fair.c:5361:19: error: static declaration of 'throttled_hierarchy' follows non-static declaration
5361 | static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
| ^~~~~~~~~~~~~~~~~~~
In file included from arch/sparc/include/asm/bug.h:6,
from include/linux/bug.h:5,
from include/linux/thread_info.h:12,
from arch/sparc/include/asm/current.h:15,
from include/linux/sched.h:12,
from kernel/sched/sched.h:5,
from kernel/sched/fair.c:23:
kernel/sched/fair.c:4250:34: note: previous implicit declaration of 'throttled_hierarchy' was here
4250 | if (cfs_rq->nr_running == 1 && !throttled_hierarchy(cfs_rq))
| ^~~~~~~~~~~~~~~~~~~
include/linux/compiler.h:58:52: note: in definition of macro '__trace_if_var'
58 | #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
| ^~~~
kernel/sched/fair.c:4250:2: note: in expansion of macro 'if'
4250 | if (cfs_rq->nr_running == 1 && !throttled_hierarchy(cfs_rq))
| ^~
kernel/sched/fair.c:5372:6: warning: no previous prototype for 'init_cfs_bandwidth' [-Wmissing-prototypes]
5372 | void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
| ^~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:11160:6: warning: no previous prototype for 'free_fair_sched_group' [-Wmissing-prototypes]
11160 | void free_fair_sched_group(struct task_group *tg) { }
| ^~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:11162:5: warning: no previous prototype for 'alloc_fair_sched_group' [-Wmissing-prototypes]
11162 | int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
| ^~~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:11167:6: warning: no previous prototype for 'online_fair_sched_group' [-Wmissing-prototypes]
11167 | void online_fair_sched_group(struct task_group *tg) { }
| ^~~~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:11169:6: warning: no previous prototype for 'unregister_fair_sched_group' [-Wmissing-prototypes]
11169 | void unregister_fair_sched_group(struct task_group *tg) { }
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
Kconfig warnings: (for reference only)
WARNING: unmet direct dependencies detected for FRAME_POINTER
Depends on DEBUG_KERNEL && (M68K || UML || SUPERH) || ARCH_WANT_FRAME_POINTERS || MCOUNT
Selected by
- LOCKDEP && DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT && !MIPS && !PPC && !ARM && !S390 && !MICROBLAZE && !ARC && !X86
vim +/throttled_hierarchy +4250 kernel/sched/fair.c
4166
4167 /*
4168 * MIGRATION
4169 *
4170 * dequeue
4171 * update_curr()
4172 * update_min_vruntime()
4173 * vruntime -= min_vruntime
4174 *
4175 * enqueue
4176 * update_curr()
4177 * update_min_vruntime()
4178 * vruntime += min_vruntime
4179 *
4180 * this way the vruntime transition between RQs is done when both
4181 * min_vruntime are up-to-date.
4182 *
4183 * WAKEUP (remote)
4184 *
4185 * ->migrate_task_rq_fair() (p->state == TASK_WAKING)
4186 * vruntime -= min_vruntime
4187 *
4188 * enqueue
4189 * update_curr()
4190 * update_min_vruntime()
4191 * vruntime += min_vruntime
4192 *
4193 * this way we don't have the most up-to-date min_vruntime on the originating
4194 * CPU and an up-to-date min_vruntime on the destination CPU.
4195 */
4196
4197 static void
4198 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
4199 {
4200 bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED);
4201 bool curr = cfs_rq->curr == se;
4202
4203 /*
4204 * If we're the current task, we must renormalise before calling
4205 * update_curr().
4206 */
4207 if (renorm && curr)
4208 se->vruntime += cfs_rq->min_vruntime;
4209
4210 update_curr(cfs_rq);
4211
4212 /*
4213 * Otherwise, renormalise after, such that we're placed at the current
4214 * moment in time, instead of some random moment in the past. Being
4215 * placed in the past could significantly boost this task to the
4216 * fairness detriment of existing tasks.
4217 */
4218 if (renorm && !curr)
4219 se->vruntime += cfs_rq->min_vruntime;
4220
4221 /*
4222 * When enqueuing a sched_entity, we must:
4223 * - Update loads to have both entity and cfs_rq synced with now.
4224 * - Add its load to cfs_rq->runnable_avg
4225 * - For group_entity, update its weight to reflect the new share of
4226 * its group cfs_rq
4227 * - Add its new weight to cfs_rq->load.weight
4228 */
4229 update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
4230 se_update_runnable(se);
4231 update_cfs_group(se);
4232 account_entity_enqueue(cfs_rq, se);
4233
4234 if (flags & ENQUEUE_WAKEUP)
4235 place_entity(cfs_rq, se, 0);
4236
4237 check_schedstat_required();
4238 update_stats_enqueue(cfs_rq, se, flags);
4239 check_spread(cfs_rq, se);
4240 if (!curr)
4241 __enqueue_entity(cfs_rq, se);
4242 se->on_rq = 1;
4243
4244 /*
4245 * When bandwidth control is enabled, cfs might have been removed
4246 * because of a parent been throttled. We'll add it later (with
4247 * complete branch up to se->on_rq/cfs_eq->on_list) in
4248 * tg_unthrottle_up() and unthrottle_cfs_rq().
4249 */
> 4250 if (cfs_rq->nr_running == 1 && !throttled_hierarchy(cfs_rq))
4251 list_add_leaf_cfs_rq(cfs_rq);
4252
4253 if (cfs_rq->nr_running == 1)
4254 check_enqueue_throttle(cfs_rq);
4255 }
4256
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 23663 bytes --]
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [RFC PATCH] sched: Simplify leaf_cfs_rq_list maintenance
2021-02-25 16:27 [RFC PATCH] sched: Simplify leaf_cfs_rq_list maintenance Michal Koutný
2021-02-25 17:44 ` kernel test robot
@ 2021-03-08 8:18 ` Dietmar Eggemann
1 sibling, 0 replies; 3+ messages in thread
From: Dietmar Eggemann @ 2021-03-08 8:18 UTC (permalink / raw)
To: Michal Koutný, Vincent Guittot
Cc: Phil Auld, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Steven Rostedt, Ben Segall, Mel Gorman,
Daniel Bristot de Oliveira, linux-kernel
Hi Michal,
On 25/02/2021 17:27, Michal Koutný wrote:
> leaf_cfs_rq_list should contain cfs_rqs that have runnable entities in
> them. When we're operating under a throttled hierarchy we always update
> the leaf_cfs_rq_list in order no to break list_add_leaf_cfs_rq invariant
> of adding complete branches.
So this patch replaces in enqueue_entity() the unconditional
list_add_leaf_cfs_rq(cfs_rq) in case cfs->nr_running > 1
(parent had been throttled)
- if (cfs_rq->nr_running == 1 || cfs_bandwidth_used())
^^^^^^^^^^^^^^^^^^^^^^^
with
+ if (cfs_rq->nr_running == 1 && !throttled_hierarchy(cfs_rq))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
and removes the leaf_cfs_rq_list maintenance code after the
xxx_throttle label in enqueue_task_fair() and unthrottle_cfs_rq
from commit f6783319737f ("sched/fair: Fix insertion in
rq->leaf_cfs_rq_list") and fe61468b2cb ("sched/fair: Fix
enqueue_task_fair warning").
> This caused troubles when an entity became runnable (enqueue_entity)
> under a throttled hierarchy (see commit b34cb07dde7c ("sched/fair: Fix
> enqueue_task_fair() warning some more")). While it proved well, this
> new change ignores updating leaf_cfs_rq_list when we're operating under
> the throttled hierarchy and defers the leaf_cfs_rq_list update to the
> point when whole hiearchy is unthrottled (tg_unthrottle_up).
IMHO, f6783319737f gives the explanation why throttled cfs_rq's have to
be added to rq->leaf_cfs_rq_list.
IIRC, fe61468b2cb was fixing a use case in which a cfs-rq with
on_list=0 and nr_running > 1 within the cgroup hierarchy wasn't
added back to rq->leaf_cfs_rq_list:
https://lkml.kernel.org/r/15252de5-9a2d-19ae-607a-594ee88d1ba1@de.ibm.com
In enqueue_task_fair() just before the assert_list_leaf_cfs_rq(rq),
Iterate through the se heriarchy of p=[CPU 2/KVM 2662] in case the
assert will hit:
...
CPU23 path=/machine.slice/machine-test.slice/machine-qemu\x2d18\x2dtest10.scope/vcpuX on_list=1 nr_running=1 throttled=0 p=[CPU 2/KVM 2662]
CPU23 path=/machine.slice/machine-test.slice/machine-qemu\x2d18\x2dtest10.scope on_list=0 nr_running=3 throttled=0 p=[CPU 2/KVM 2662]
^^^^^^^^^ ^^^^^^^^^^^^
CPU23 path=/machine.slice/machine-test.slice on_list=1 nr_running=1 throttled=1 p=[CPU 2/KVM 2662]
CPU23 path=/machine.slice on_list=1 nr_running=0 throttled=0 p=[CPU 2/KVM 2662]
CPU23 path=/ on_list=1 nr_running=1 throttled=0 p=[CPU 2/KVM 2662]
...
> The code is now simpler and leaf_cfs_rq_list contains only cfs_rqs that
> are truly runnable.
>
> Why is this RFC?
> - Primarily, I'm not sure I interpreted the purpose of leaf_cfs_rq_list
> right. The removal of throttled cfs_rqs from it would exclude them
> from __update_blocked_fair() calculation and I can't see past it now.
The leaf_cfs_rq_list should contain all cfs_rqs with
cfs_rq->avg.load/runnable/util_avg != 0 so that in case there are no
runnable entities on them anymore this (blocked) load
cfs_rq->avg.xxx_avg can be decayed. IMHO. the "leaf_" is misleading
here since it can also contain non-leaf cfs_rqs.
> If it's wrong assumption, I'd like this to help clarify what the
> proper definition of leaf_cfs_rq_list would be.
> - Additionally, I didn't check thoroughly for corner cases when
> se->on_rq => cfs_rq_of(se)->on_list wouldn't hold, so the patch
> certainly isn't finished.
[...]
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2021-03-08 8:18 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-02-25 16:27 [RFC PATCH] sched: Simplify leaf_cfs_rq_list maintenance Michal Koutný
2021-02-25 17:44 ` kernel test robot
2021-03-08 8:18 ` Dietmar Eggemann
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.