* [RFC PATCH 0/2] sched/eevdf: Introduce a cgroup interface for slice
@ 2024-10-28 6:33 Tianchen Ding
2024-10-28 6:33 ` [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice Tianchen Ding
2024-10-28 6:33 ` [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice Tianchen Ding
0 siblings, 2 replies; 21+ messages in thread
From: Tianchen Ding @ 2024-10-28 6:33 UTC (permalink / raw)
To: linux-kernel
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tejun Heo
The 1st patch is a minor fix for the existing cgroup propagating.
The 2nd patch is the main part and RFC. If the design is ok, I'll send
another patch later for documents about the new cgroup interface.
Thanks.
Tianchen Ding (2):
sched/eevdf: Force propagating min_slice of cfs_rq when a task
changing slice
sched/eevdf: Introduce a cgroup interface for slice
kernel/sched/core.c | 34 +++++++++++++++++++++++++++++
kernel/sched/fair.c | 51 +++++++++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 3 +++
3 files changed, 83 insertions(+), 5 deletions(-)
--
2.39.3
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
2024-10-28 6:33 [RFC PATCH 0/2] sched/eevdf: Introduce a cgroup interface for slice Tianchen Ding
@ 2024-10-28 6:33 ` Tianchen Ding
2024-10-30 8:18 ` kernel test robot
2024-10-31 9:48 ` [PATCH v2] " Tianchen Ding
2024-10-28 6:33 ` [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice Tianchen Ding
1 sibling, 2 replies; 21+ messages in thread
From: Tianchen Ding @ 2024-10-28 6:33 UTC (permalink / raw)
To: linux-kernel
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tejun Heo
When a task changes slice and its cgroup se is already on_rq, the cgroup
se will not be enqueued again, and hence the root->min_slice leaves
unchanged.
Force propagating it when se doesn't need to be enqueued (or dequeued).
Ensure the se hierarchy always get the latest min_slice.
Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
---
kernel/sched/fair.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6512258dc71f..7dc90a6e6e26 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7017,6 +7017,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_group(se);
se->slice = slice;
+ min_vruntime_cb_propagate(&se->run_node, NULL);
slice = cfs_rq_min_slice(cfs_rq);
cfs_rq->h_nr_running++;
@@ -7141,6 +7142,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
update_cfs_group(se);
se->slice = slice;
+ min_vruntime_cb_propagate(&se->run_node, NULL);
slice = cfs_rq_min_slice(cfs_rq);
cfs_rq->h_nr_running -= h_nr_running;
--
2.39.3
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
2024-10-28 6:33 [RFC PATCH 0/2] sched/eevdf: Introduce a cgroup interface for slice Tianchen Ding
2024-10-28 6:33 ` [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice Tianchen Ding
@ 2024-10-28 6:33 ` Tianchen Ding
2024-10-28 17:37 ` Tejun Heo
` (2 more replies)
1 sibling, 3 replies; 21+ messages in thread
From: Tianchen Ding @ 2024-10-28 6:33 UTC (permalink / raw)
To: linux-kernel
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tejun Heo
Introduce "cpu.fair_slice" for cgroup v2 and "cpu.fair_slice_us" for v1
according to their name styles. The unit is always microseconds.
A cgroup with shorter slice can preempt others more easily. This could be
useful in container scenarios.
By default, cpu.fair_slice is 0, which means the slice of se is
calculated by min_slice from its cfs_rq. If cpu.fair_slice is set, it
will overwrite se->slice with the customized value.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
---
CC Tejun, do we need (and reuse) this slice interface for sched_ext?
---
kernel/sched/core.c | 34 ++++++++++++++++++++++++++++++
kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 3 +++
3 files changed, 81 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 114adac5a9c8..8d57b7d88d18 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9690,6 +9690,24 @@ static int cpu_idle_write_s64(struct cgroup_subsys_state *css,
}
#endif
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static u64 cpu_fair_slice_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ u64 fair_slice_us = css_tg(css)->slice;
+
+ do_div(fair_slice_us, NSEC_PER_USEC);
+
+ return fair_slice_us;
+}
+
+static int cpu_fair_slice_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 fair_slice_us)
+{
+ return sched_group_set_slice(css_tg(css), fair_slice_us);
+}
+#endif
+
static struct cftype cpu_legacy_files[] = {
#ifdef CONFIG_GROUP_SCHED_WEIGHT
{
@@ -9703,6 +9721,14 @@ static struct cftype cpu_legacy_files[] = {
.write_s64 = cpu_idle_write_s64,
},
#endif
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ {
+ .name = "fair_slice_us",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_fair_slice_read_u64,
+ .write_u64 = cpu_fair_slice_write_u64,
+ },
+#endif
#ifdef CONFIG_CFS_BANDWIDTH
{
.name = "cfs_quota_us",
@@ -9943,6 +9969,14 @@ static struct cftype cpu_files[] = {
.write_s64 = cpu_idle_write_s64,
},
#endif
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ {
+ .name = "fair_slice",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_fair_slice_read_u64,
+ .write_u64 = cpu_fair_slice_write_u64,
+ },
+#endif
#ifdef CONFIG_CFS_BANDWIDTH
{
.name = "max",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7dc90a6e6e26..694dc0655719 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -797,6 +797,11 @@ static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
return min_slice;
}
+static inline u64 cfs_rq_slice(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->tg->slice ? : cfs_rq_min_slice(cfs_rq);
+}
+
static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
{
return entity_before(__node_2_se(a), __node_2_se(b));
@@ -6994,7 +6999,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
se->custom_slice = 1;
}
enqueue_entity(cfs_rq, se, flags);
- slice = cfs_rq_min_slice(cfs_rq);
+ slice = cfs_rq_slice(cfs_rq);
cfs_rq->h_nr_running++;
cfs_rq->idle_h_nr_running += idle_h_nr_running;
@@ -7018,7 +7023,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
se->slice = slice;
min_vruntime_cb_propagate(&se->run_node, NULL);
- slice = cfs_rq_min_slice(cfs_rq);
+ slice = cfs_rq_slice(cfs_rq);
cfs_rq->h_nr_running++;
cfs_rq->idle_h_nr_running += idle_h_nr_running;
@@ -7093,7 +7098,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
idle_h_nr_running = task_has_idle_policy(p);
} else {
cfs_rq = group_cfs_rq(se);
- slice = cfs_rq_min_slice(cfs_rq);
+ slice = cfs_rq_slice(cfs_rq);
}
for_each_sched_entity(se) {
@@ -7118,7 +7123,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
/* Don't dequeue parent if it has other entities besides us */
if (cfs_rq->load.weight) {
- slice = cfs_rq_min_slice(cfs_rq);
+ slice = cfs_rq_slice(cfs_rq);
/* Avoid re-evaluating load for this entity: */
se = parent_entity(se);
@@ -7143,7 +7148,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
se->slice = slice;
min_vruntime_cb_propagate(&se->run_node, NULL);
- slice = cfs_rq_min_slice(cfs_rq);
+ slice = cfs_rq_slice(cfs_rq);
cfs_rq->h_nr_running -= h_nr_running;
cfs_rq->idle_h_nr_running -= idle_h_nr_running;
@@ -13535,6 +13540,40 @@ int sched_group_set_idle(struct task_group *tg, long idle)
return 0;
}
+int sched_group_set_slice(struct task_group *tg, u64 fair_slice_us)
+{
+ u64 slice = 0;
+ int i;
+
+ if (fair_slice_us > U64_MAX / NSEC_PER_USEC)
+ return -EINVAL;
+
+ if (fair_slice_us) {
+ slice = clamp_t(u64, fair_slice_us * NSEC_PER_USEC,
+ NSEC_PER_MSEC / 10, /* HZ = 1000 * 10 */
+ NSEC_PER_MSEC * 100); /* HZ = 100 / 10 */
+ }
+
+ if (slice == tg->slice)
+ return 0;
+
+ tg->slice = slice;
+
+ for_each_possible_cpu(i) {
+ struct sched_entity *se = tg->se[i];
+ struct rq *rq = cpu_rq(i);
+
+ guard(rq_lock_irqsave)(rq);
+ for_each_sched_entity(se) {
+ se->custom_slice = 1;
+ se->slice = cfs_rq_slice(group_cfs_rq(se));
+ min_vruntime_cb_propagate(&se->run_node, NULL);
+ }
+ }
+
+ return 0;
+}
+
#endif /* CONFIG_FAIR_GROUP_SCHED */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7b139016cbd9..e02f8715bc04 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -443,6 +443,7 @@ struct task_group {
/* runqueue "owned" by this group on each CPU */
struct cfs_rq **cfs_rq;
unsigned long shares;
+ u64 slice;
#ifdef CONFIG_SMP
/*
* load_avg can be heavily contended at clock tick time, so put
@@ -574,6 +575,8 @@ extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
extern int sched_group_set_idle(struct task_group *tg, long idle);
+extern int sched_group_set_slice(struct task_group *tg, u64 fair_slice_us);
+
#ifdef CONFIG_SMP
extern void set_task_rq_fair(struct sched_entity *se,
struct cfs_rq *prev, struct cfs_rq *next);
--
2.39.3
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
2024-10-28 6:33 ` [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice Tianchen Ding
@ 2024-10-28 17:37 ` Tejun Heo
2024-10-29 2:07 ` Tianchen Ding
[not found] ` <ME0P300MB0414F63E895B2F343EE740258E4B2@ME0P300MB0414.AUSP300.PROD.OUTLOOK.COM>
2024-10-30 11:00 ` Peter Zijlstra
2 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2024-10-28 17:37 UTC (permalink / raw)
To: Tianchen Ding
Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider
Hello,
On Mon, Oct 28, 2024 at 02:33:13PM +0800, Tianchen Ding wrote:
> Introduce "cpu.fair_slice" for cgroup v2 and "cpu.fair_slice_us" for v1
> according to their name styles. The unit is always microseconds.
>
> A cgroup with shorter slice can preempt others more easily. This could be
> useful in container scenarios.
>
> By default, cpu.fair_slice is 0, which means the slice of se is
> calculated by min_slice from its cfs_rq. If cpu.fair_slice is set, it
> will overwrite se->slice with the customized value.
Provided that this tunable is necessary, wouldn't it be more useful to
figure out what per-task interface would look like first? Maybe there are
cases where per-cgroup slice config makes sense but that sounds
significantly less useful than being able to configure it per-task.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
2024-10-28 17:37 ` Tejun Heo
@ 2024-10-29 2:07 ` Tianchen Ding
2024-10-29 6:18 ` Tejun Heo
0 siblings, 1 reply; 21+ messages in thread
From: Tianchen Ding @ 2024-10-29 2:07 UTC (permalink / raw)
To: Tejun Heo
Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider
On 2024/10/29 01:37, Tejun Heo wrote:
> Hello,
>
> On Mon, Oct 28, 2024 at 02:33:13PM +0800, Tianchen Ding wrote:
>> Introduce "cpu.fair_slice" for cgroup v2 and "cpu.fair_slice_us" for v1
>> according to their name styles. The unit is always microseconds.
>>
>> A cgroup with shorter slice can preempt others more easily. This could be
>> useful in container scenarios.
>>
>> By default, cpu.fair_slice is 0, which means the slice of se is
>> calculated by min_slice from its cfs_rq. If cpu.fair_slice is set, it
>> will overwrite se->slice with the customized value.
>
> Provided that this tunable is necessary, wouldn't it be more useful to
> figure out what per-task interface would look like first? Maybe there are
> cases where per-cgroup slice config makes sense but that sounds
> significantly less useful than being able to configure it per-task.
>
> Thanks.
>
For eevdf, per-task interface has been introduced in commit 857b158dc5e8
("sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion")
So This patch is trying to introduce a cgroup level interface.
Thanks.
^ permalink raw reply [flat|nested] 21+ messages in thread
* 回复: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
[not found] ` <ME0P300MB0414F63E895B2F343EE740258E4B2@ME0P300MB0414.AUSP300.PROD.OUTLOOK.COM>
@ 2024-10-29 4:26 ` 解 咏梅
0 siblings, 0 replies; 21+ messages in thread
From: 解 咏梅 @ 2024-10-29 4:26 UTC (permalink / raw)
To: Tianchen Ding, linux-kernel@vger.kernel.org
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tejun Heo
change to plain text to send to linux-kernel@vger.kernel.org
Sorry:(
________________________________________
发件人: 解 咏梅 <xieym_ict@hotmail.com>
发送时间: 2024年10月29日 12:10
收件人: Tianchen Ding <dtcccc@linux.alibaba.com>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
抄送: Ingo Molnar <mingo@redhat.com>; Peter Zijlstra <peterz@infradead.org>; Juri Lelli <juri.lelli@redhat.com>; Vincent Guittot <vincent.guittot@linaro.org>; Dietmar Eggemann <dietmar.eggemann@arm.com>; Steven Rostedt <rostedt@goodmis.org>; Ben Segall <bsegall@google.com>; Mel Gorman <mgorman@suse.de>; Valentin Schneider <vschneid@redhat.com>; Tejun Heo <tj@kernel.org>
主题: 回复: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
Interesting!
So, se's position in RB tree (aka se->deadline) is determined by 2 factors:
itself's total weighted runtime
tg's slice if defined or the queued cfs_rq's slice (the last one considers all ses beneth cfs_rq, so it's a dynamic slice)
As my understanding, this patch proposes a static slice for task cgroup. It might be useful in container colocation scenarios.
It's hard to detect the major product process in rich container scenario, because the user application is various and diverse.
Regards,
Yongmei.
________________________________________
发件人: Tianchen Ding <dtcccc@linux.alibaba.com>
发送时间: 2024年10月28日 14:33
收件人: linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
抄送: Ingo Molnar <mingo@redhat.com>; Peter Zijlstra <peterz@infradead.org>; Juri Lelli <juri.lelli@redhat.com>; Vincent Guittot <vincent.guittot@linaro.org>; Dietmar Eggemann <dietmar.eggemann@arm.com>; Steven Rostedt <rostedt@goodmis.org>; Ben Segall <bsegall@google.com>; Mel Gorman <mgorman@suse.de>; Valentin Schneider <vschneid@redhat.com>; Tejun Heo <tj@kernel.org>
主题: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
Introduce "cpu.fair_slice" for cgroup v2 and "cpu.fair_slice_us" for v1
according to their name styles. The unit is always microseconds.
A cgroup with shorter slice can preempt others more easily. This could be
useful in container scenarios.
By default, cpu.fair_slice is 0, which means the slice of se is
calculated by min_slice from its cfs_rq. If cpu.fair_slice is set, it
will overwrite se->slice with the customized value.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
---
CC Tejun, do we need (and reuse) this slice interface for sched_ext?
---
kernel/sched/core.c | 34 ++++++++++++++++++++++++++++++
kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 3 +++
3 files changed, 81 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 114adac5a9c8..8d57b7d88d18 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9690,6 +9690,24 @@ static int cpu_idle_write_s64(struct cgroup_subsys_state *css,
}
#endif
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static u64 cpu_fair_slice_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ u64 fair_slice_us = css_tg(css)->slice;
+
+ do_div(fair_slice_us, NSEC_PER_USEC);
+
+ return fair_slice_us;
+}
+
+static int cpu_fair_slice_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 fair_slice_us)
+{
+ return sched_group_set_slice(css_tg(css), fair_slice_us);
+}
+#endif
+
static struct cftype cpu_legacy_files[] = {
#ifdef CONFIG_GROUP_SCHED_WEIGHT
{
@@ -9703,6 +9721,14 @@ static struct cftype cpu_legacy_files[] = {
.write_s64 = cpu_idle_write_s64,
},
#endif
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ {
+ .name = "fair_slice_us",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_fair_slice_read_u64,
+ .write_u64 = cpu_fair_slice_write_u64,
+ },
+#endif
#ifdef CONFIG_CFS_BANDWIDTH
{
.name = "cfs_quota_us",
@@ -9943,6 +9969,14 @@ static struct cftype cpu_files[] = {
.write_s64 = cpu_idle_write_s64,
},
#endif
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ {
+ .name = "fair_slice",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_fair_slice_read_u64,
+ .write_u64 = cpu_fair_slice_write_u64,
+ },
+#endif
#ifdef CONFIG_CFS_BANDWIDTH
{
.name = "max",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7dc90a6e6e26..694dc0655719 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -797,6 +797,11 @@ static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
return min_slice;
}
+static inline u64 cfs_rq_slice(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->tg->slice ? : cfs_rq_min_slice(cfs_rq);
+}
+
static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
{
return entity_before(__node_2_se(a), __node_2_se(b));
@@ -6994,7 +6999,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
se->custom_slice = 1;
}
enqueue_entity(cfs_rq, se, flags);
- slice = cfs_rq_min_slice(cfs_rq);
+ slice = cfs_rq_slice(cfs_rq);
cfs_rq->h_nr_running++;
cfs_rq->idle_h_nr_running += idle_h_nr_running;
@@ -7018,7 +7023,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
se->slice = slice;
min_vruntime_cb_propagate(&se->run_node, NULL);
- slice = cfs_rq_min_slice(cfs_rq);
+ slice = cfs_rq_slice(cfs_rq);
cfs_rq->h_nr_running++;
cfs_rq->idle_h_nr_running += idle_h_nr_running;
@@ -7093,7 +7098,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
idle_h_nr_running = task_has_idle_policy(p);
} else {
cfs_rq = group_cfs_rq(se);
- slice = cfs_rq_min_slice(cfs_rq);
+ slice = cfs_rq_slice(cfs_rq);
}
for_each_sched_entity(se) {
@@ -7118,7 +7123,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
/* Don't dequeue parent if it has other entities besides us */
if (cfs_rq->load.weight) {
- slice = cfs_rq_min_slice(cfs_rq);
+ slice = cfs_rq_slice(cfs_rq);
/* Avoid re-evaluating load for this entity: */
se = parent_entity(se);
@@ -7143,7 +7148,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
se->slice = slice;
min_vruntime_cb_propagate(&se->run_node, NULL);
- slice = cfs_rq_min_slice(cfs_rq);
+ slice = cfs_rq_slice(cfs_rq);
cfs_rq->h_nr_running -= h_nr_running;
cfs_rq->idle_h_nr_running -= idle_h_nr_running;
@@ -13535,6 +13540,40 @@ int sched_group_set_idle(struct task_group *tg, long idle)
return 0;
}
+int sched_group_set_slice(struct task_group *tg, u64 fair_slice_us)
+{
+ u64 slice = 0;
+ int i;
+
+ if (fair_slice_us > U64_MAX / NSEC_PER_USEC)
+ return -EINVAL;
+
+ if (fair_slice_us) {
+ slice = clamp_t(u64, fair_slice_us * NSEC_PER_USEC,
+ NSEC_PER_MSEC / 10, /* HZ = 1000 * 10 */
+ NSEC_PER_MSEC * 100); /* HZ = 100 / 10 */
+ }
+
+ if (slice == tg->slice)
+ return 0;
+
+ tg->slice = slice;
+
+ for_each_possible_cpu(i) {
+ struct sched_entity *se = tg->se[i];
+ struct rq *rq = cpu_rq(i);
+
+ guard(rq_lock_irqsave)(rq);
+ for_each_sched_entity(se) {
+ se->custom_slice = 1;
+ se->slice = cfs_rq_slice(group_cfs_rq(se));
+ min_vruntime_cb_propagate(&se->run_node, NULL);
+ }
+ }
+
+ return 0;
+}
+
#endif /* CONFIG_FAIR_GROUP_SCHED */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7b139016cbd9..e02f8715bc04 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -443,6 +443,7 @@ struct task_group {
/* runqueue "owned" by this group on each CPU */
struct cfs_rq **cfs_rq;
unsigned long shares;
+ u64 slice;
#ifdef CONFIG_SMP
/*
* load_avg can be heavily contended at clock tick time, so put
@@ -574,6 +575,8 @@ extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
extern int sched_group_set_idle(struct task_group *tg, long idle);
+extern int sched_group_set_slice(struct task_group *tg, u64 fair_slice_us);
+
#ifdef CONFIG_SMP
extern void set_task_rq_fair(struct sched_entity *se,
struct cfs_rq *prev, struct cfs_rq *next);
--
2.39.3
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
2024-10-29 2:07 ` Tianchen Ding
@ 2024-10-29 6:18 ` Tejun Heo
2024-10-29 6:49 ` Tianchen Ding
0 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2024-10-29 6:18 UTC (permalink / raw)
To: Tianchen Ding
Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider
Hello,
On Tue, Oct 29, 2024 at 10:07:36AM +0800, Tianchen Ding wrote:
....
> For eevdf, per-task interface has been introduced in commit 857b158dc5e8
> ("sched/eevdf: Use sched_attr::sched_runtime to set request/slice
> suggestion")
I see.
> So This patch is trying to introduce a cgroup level interface.
If I'm reading the code correctly, the property can be set per task and is
inherited when forking unless RESET_ON_FORK is set. I'm not sure the cgroup
interface adds all that much:
- There's no inherent hierarchical or grouping behavior. I don't think it
makes sense for cgroup config to override per-thread configs.
- For cgroup-wide config, setting it in the seed process of the cgroup would
suffice in most cases. Changing it afterwards is more awkward but not
hugely so. If racing against forks is a concern, you can either use the
freezer or iterate until no new tasks are seen.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
2024-10-29 6:18 ` Tejun Heo
@ 2024-10-29 6:49 ` Tianchen Ding
2024-10-29 20:39 ` Tejun Heo
0 siblings, 1 reply; 21+ messages in thread
From: Tianchen Ding @ 2024-10-29 6:49 UTC (permalink / raw)
To: Tejun Heo
Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider
On 2024/10/29 14:18, Tejun Heo wrote:
>> So This patch is trying to introduce a cgroup level interface.
>
> If I'm reading the code correctly, the property can be set per task and is
> inherited when forking unless RESET_ON_FORK is set. I'm not sure the cgroup
> interface adds all that much:
>
> - There's no inherent hierarchical or grouping behavior. I don't think it
> makes sense for cgroup config to override per-thread configs.
>
> - For cgroup-wide config, setting it in the seed process of the cgroup would
> suffice in most cases. Changing it afterwards is more awkward but not
> hugely so. If racing against forks is a concern, you can either use the
> freezer or iterate until no new tasks are seen.
>
> Thanks.
>
However, we may want to set and keep different slice for processes inside the
same cgroup.
For example in rich container scenario (as Yongmei mentioned), the administrator
can decide the cpu resources of a container: its weight(cpu.weight),
scope(cpuset.cpus), bandwidth(cpu.max), and also the **slice and preempt
priority** (cpu.fair_slice in this patch).
At the same time, the user may want to decide his processes inside the
container. He may want to set customized value (sched_attr::sched_runtime) for
each process, and administrator should not overwrite the user's own config.
So cpu.fair_slice is for preempt competition across cgroups in the samle level,
while sched_attr::sched_runtime can be used for processes inside the same
cgroup. (a bit like cpu.weight vs task NICE)
Thanks.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
2024-10-29 6:49 ` Tianchen Ding
@ 2024-10-29 20:39 ` Tejun Heo
0 siblings, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2024-10-29 20:39 UTC (permalink / raw)
To: Tianchen Ding
Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider
Hello,
On Tue, Oct 29, 2024 at 02:49:51PM +0800, Tianchen Ding wrote:
...
> At the same time, the user may want to decide his processes inside the
> container. He may want to set customized value (sched_attr::sched_runtime)
> for each process, and administrator should not overwrite the user's own
> config.
>
> So cpu.fair_slice is for preempt competition across cgroups in the samle
> level, while sched_attr::sched_runtime can be used for processes inside the
> same cgroup. (a bit like cpu.weight vs task NICE)
I see. It's setting the slice for the task_groups. I'm not sure how much we
want to codify the current recursive behavior into fixed interface. Besides,
it's not sustainable to keep adding scheduler tunables to cgroup interface.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
2024-10-28 6:33 ` [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice Tianchen Ding
@ 2024-10-30 8:18 ` kernel test robot
2024-10-30 9:11 ` Tianchen Ding
2024-10-31 9:48 ` [PATCH v2] " Tianchen Ding
1 sibling, 1 reply; 21+ messages in thread
From: kernel test robot @ 2024-10-30 8:18 UTC (permalink / raw)
To: Tianchen Ding
Cc: oe-lkp, lkp, linux-kernel, aubrey.li, yu.c.chen, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Tejun Heo, oliver.sang
Hello,
kernel test robot noticed "BUG:KASAN:slab-use-after-free_in_enqueue_task_fair" on:
commit: e9b718a38463470cc388aaa3ff50f12bbe8c4279 ("[PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice")
url: https://github.com/intel-lab-lkp/linux/commits/Tianchen-Ding/sched-eevdf-Force-propagating-min_slice-of-cfs_rq-when-a-task-changing-slice/20241028-143410
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git d1fb8a78b2ff1fe4e9478c75b4fbec588a73c1b0
patch link: https://lore.kernel.org/all/20241028063313.8039-2-dtcccc@linux.alibaba.com/
patch subject: [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
in testcase: trinity
version: trinity-x86_64-ba2360ed-1_20240923
with following parameters:
runtime: 600s
config: x86_64-randconfig-014-20241028
compiler: clang-19
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
(please refer to attached dmesg/kmsg for entire log/backtrace)
+------------------------------------------------------------------------------+------------+------------+
| | d1fb8a78b2 | e9b718a384 |
+------------------------------------------------------------------------------+------------+------------+
| BUG:KASAN:slab-use-after-free_in_enqueue_task_fair | 0 | 4 |
| BUG:KASAN:slab-use-after-free_in_dequeue_entities | 0 | 1 |
+------------------------------------------------------------------------------+------------+------------+
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202410301535.14e0855c-lkp@intel.com
[ 117.822447][ T468] BUG: KASAN: slab-use-after-free in enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020)
[ 117.825270][ T468] Read of size 8 at addr ffff8881678c1c30 by task trinity-main/468
[ 117.826451][ T468]
[ 117.826941][ T468] CPU: 0 UID: 65534 PID: 468 Comm: trinity-main Not tainted 6.12.0-rc4-00025-ge9b718a38463 #1
[ 117.828330][ T468] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 117.829779][ T468] Call Trace:
[ 117.830339][ T468] <TASK>
[ 117.830865][ T468] dump_stack_lvl (lib/dump_stack.c:122)
[ 117.831554][ T468] ? enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020)
[ 117.832327][ T468] print_report (mm/kasan/report.c:378)
[ 117.833021][ T468] ? __virt_addr_valid (include/linux/rcupdate.h:337 include/linux/rcupdate.h:941 include/linux/mmzone.h:2043 arch/x86/mm/physaddr.c:65)
[ 117.833768][ T468] ? __virt_addr_valid (arch/x86/include/asm/preempt.h:103 include/linux/rcupdate.h:964 include/linux/mmzone.h:2053 arch/x86/mm/physaddr.c:65)
[ 117.834510][ T468] ? enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020)
[ 117.835304][ T468] ? kasan_complete_mode_report_info (mm/kasan/report_generic.c:179)
[ 117.836192][ T468] ? enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020)
[ 117.836990][ T468] kasan_report (mm/kasan/report.c:603)
[ 117.837670][ T468] ? enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020)
[ 117.838449][ T468] __asan_report_load8_noabort (mm/kasan/report_generic.c:381)
[ 117.839276][ T468] enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020)
[ 117.840026][ T468] enqueue_task (kernel/sched/core.c:2027)
[ 117.840701][ T468] activate_task (kernel/sched/core.c:2069)
[ 117.841383][ T468] wake_up_new_task (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/sched.h:185 kernel/sched/core.c:4829)
[ 117.842104][ T468] kernel_clone (kernel/fork.c:2818)
[ 117.842786][ T468] __x64_sys_clone (kernel/fork.c:2927)
[ 117.843490][ T468] x64_sys_call (kbuild/obj/consumer/x86_64-randconfig-014-20241028/./arch/x86/include/generated/asm/syscalls_64.h:161)
[ 117.844198][ T468] do_syscall_64 (arch/x86/entry/common.c:?)
[ 117.844861][ T468] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:467)
[ 117.845742][ T468] ? syscall_exit_to_user_mode (kernel/entry/common.c:221)
[ 117.846571][ T468] ? do_syscall_64 (arch/x86/entry/common.c:102)
[ 117.848374][ T468] ? irqentry_exit_to_user_mode (kernel/entry/common.c:234)
[ 117.849183][ T468] ? irqentry_exit (kernel/entry/common.c:367)
[ 117.849897][ T468] ? exc_page_fault (arch/x86/mm/fault.c:1543)
[ 117.850589][ T468] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
[ 117.851487][ T468] RIP: 0033:0x7f97dcccc293
[ 117.852155][ T468] Code: 00 00 00 00 00 66 90 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 89 c2 85 c0 75 2c 64 48 8b 04 25 10 00 00
All code
========
0: 00 00 add %al,(%rax)
2: 00 00 add %al,(%rax)
4: 00 66 90 add %ah,-0x70(%rsi)
7: 64 48 8b 04 25 10 00 mov %fs:0x10,%rax
e: 00 00
10: 45 31 c0 xor %r8d,%r8d
13: 31 d2 xor %edx,%edx
15: 31 f6 xor %esi,%esi
17: bf 11 00 20 01 mov $0x1200011,%edi
1c: 4c 8d 90 d0 02 00 00 lea 0x2d0(%rax),%r10
23: b8 38 00 00 00 mov $0x38,%eax
28: 0f 05 syscall
2a:* 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax <-- trapping instruction
30: 77 35 ja 0x67
32: 89 c2 mov %eax,%edx
34: 85 c0 test %eax,%eax
36: 75 2c jne 0x64
38: 64 fs
39: 48 rex.W
3a: 8b .byte 0x8b
3b: 04 25 add $0x25,%al
3d: 10 00 adc %al,(%rax)
...
Code starting with the faulting instruction
===========================================
0: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax
6: 77 35 ja 0x3d
8: 89 c2 mov %eax,%edx
a: 85 c0 test %eax,%eax
c: 75 2c jne 0x3a
e: 64 fs
f: 48 rex.W
10: 8b .byte 0x8b
11: 04 25 add $0x25,%al
13: 10 00 adc %al,(%rax)
...
[ 117.854567][ T468] RSP: 002b:00007fff02a6b648 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[ 117.855822][ T468] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f97dcccc293
[ 117.856993][ T468] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[ 117.858049][ T468] RBP: 0000000000000000 R08: 0000000000000000 R09: 7fffffffffffffff
[ 117.859328][ T468] R10: 00007f97dcbf5a10 R11: 0000000000000246 R12: 0000000000000001
[ 117.860342][ T468] R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
[ 117.861176][ T468] </TASK>
[ 117.861691][ T468]
[ 117.862096][ T468] Allocated by task 902:
[ 117.862719][ T468] kasan_save_track (mm/kasan/common.c:48 mm/kasan/common.c:68)
[ 117.863560][ T468] kasan_save_alloc_info (mm/kasan/generic.c:566)
[ 117.864398][ T468] __kasan_kmalloc (mm/kasan/common.c:398)
[ 117.865193][ T468] __kmalloc_cache_node_noprof (mm/slub.c:4308)
[ 117.866038][ T468] alloc_fair_sched_group (include/linux/slab.h:? kernel/sched/fair.c:13312)
[ 117.866871][ T468] sched_create_group (kernel/sched/core.c:8853)
[ 117.867588][ T468] sched_autogroup_create_attach (include/linux/err.h:67 kernel/sched/autogroup.c:93 kernel/sched/autogroup.c:194)
[ 117.868413][ T468] ksys_setsid (kernel/sys.c:?)
[ 117.869079][ T468] __ia32_sys_setsid (kernel/sys.c:1269)
[ 117.869767][ T468] x64_sys_call (kbuild/obj/consumer/x86_64-randconfig-014-20241028/./arch/x86/include/generated/asm/syscalls_64.h:161)
[ 117.870453][ T468] do_syscall_64 (arch/x86/entry/common.c:?)
[ 117.871156][ T468] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
[ 117.872032][ T468]
[ 117.872468][ T468] Freed by task 243:
[ 117.873823][ T468] kasan_save_track (mm/kasan/common.c:48 mm/kasan/common.c:68)
[ 117.874518][ T468] kasan_save_free_info (mm/kasan/generic.c:582)
[ 117.875278][ T468] __kasan_slab_free (mm/kasan/common.c:271)
[ 117.875923][ T468] kfree (mm/slub.c:4579)
[ 117.876526][ T468] free_fair_sched_group (kernel/sched/fair.c:13278)
[ 117.877340][ T468] sched_free_group (kernel/sched/core.c:8823)
[ 117.878034][ T468] sched_free_group_rcu (kernel/sched/core.c:8831)
[ 117.878758][ T468] rcu_core (kernel/rcu/tree.c:?)
[ 117.879467][ T468] rcu_core_si (kernel/rcu/tree.c:2841)
[ 117.880104][ T468] handle_softirqs (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/irq.h:142 kernel/softirq.c:555)
[ 117.880824][ T468] __irq_exit_rcu (kernel/softirq.c:617 kernel/softirq.c:639)
[ 117.881526][ T468] irq_exit_rcu (kernel/softirq.c:651)
[ 117.882185][ T468] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1049)
[ 117.883046][ T468] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:702)
[ 117.883877][ T468]
[ 117.884299][ T468] The buggy address belongs to the object at ffff8881678c1c00
[ 117.884299][ T468] which belongs to the cache kmalloc-rnd-07-1k of size 1024
[ 117.886238][ T468] The buggy address is located 48 bytes inside of
[ 117.886238][ T468] freed 1024-byte region [ffff8881678c1c00, ffff8881678c2000)
[ 117.888235][ T468]
[ 117.888665][ T468] The buggy address belongs to the physical page:
[ 117.889558][ T468] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1678c0
[ 117.890880][ T468] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 117.892051][ T468] memcg:ffff88814cf29cc1
[ 117.892690][ T468] flags: 0x8000000000000040(head|zone=2)
[ 117.893467][ T468] page_type: f5(slab)
[ 117.894072][ T468] raw: 8000000000000040 ffff888100059b40 ffffea000479e610 ffff88810005a828
[ 117.896842][ T468] raw: 0000000000000000 00000000000a000a 00000001f5000000 ffff88814cf29cc1
[ 117.898038][ T468] head: 8000000000000040 ffff888100059b40 ffffea000479e610 ffff88810005a828
[ 117.899301][ T468] head: 0000000000000000 00000000000a000a 00000001f5000000 ffff88814cf29cc1
[ 117.900519][ T468] head: 8000000000000003 ffffea00059e3001 ffffffffffffffff 0000000000000000
[ 117.901731][ T468] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
[ 117.903000][ T468] page dumped because: kasan: bad access detected
[ 117.903859][ T468] page_owner tracks the page as allocated
[ 117.904643][ T468] page last allocated via order 3, migratetype Unmovable, gfp_mask 0x252800(GFP_NOWAIT|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE), pid 194, tgid 194 ((sh)), ts 46703728016, free_ts 0
[ 117.906952][ T468] post_alloc_hook (include/linux/page_owner.h:?)
[ 117.907672][ T468] prep_new_page (mm/page_alloc.c:1547)
[ 117.908338][ T468] get_page_from_freelist (mm/page_alloc.c:?)
[ 117.909140][ T468] __alloc_pages_noprof (mm/page_alloc.c:4733)
[ 117.909885][ T468] new_slab (mm/slub.c:?)
[ 117.910539][ T468] ___slab_alloc (mm/slub.c:3819)
[ 117.911291][ T468] __slab_alloc (mm/slub.c:3910)
[ 117.911959][ T468] __kmalloc_cache_node_noprof (mm/slub.c:3961)
[ 117.912775][ T468] alloc_fair_sched_group (include/linux/slab.h:? kernel/sched/fair.c:13312)
[ 117.913582][ T468] sched_create_group (kernel/sched/core.c:8853)
[ 117.914420][ T468] sched_autogroup_create_attach (include/linux/err.h:67 kernel/sched/autogroup.c:93 kernel/sched/autogroup.c:194)
[ 117.915352][ T468] ksys_setsid (kernel/sys.c:?)
[ 117.916031][ T468] __ia32_sys_setsid (kernel/sys.c:1269)
[ 117.916779][ T468] x64_sys_call (kbuild/obj/consumer/x86_64-randconfig-014-20241028/./arch/x86/include/generated/asm/syscalls_64.h:161)
[ 117.917631][ T468] do_syscall_64 (arch/x86/entry/common.c:?)
[ 117.918407][ T468] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
[ 117.919324][ T468] page_owner free stack trace missing
[ 117.920090][ T468]
[ 117.920515][ T468] Memory state around the buggy address:
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20241030/202410301535.14e0855c-lkp@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
2024-10-30 8:18 ` kernel test robot
@ 2024-10-30 9:11 ` Tianchen Ding
0 siblings, 0 replies; 21+ messages in thread
From: Tianchen Ding @ 2024-10-30 9:11 UTC (permalink / raw)
To: kernel test robot
Cc: oe-lkp, lkp, linux-kernel, aubrey.li, yu.c.chen, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Tejun Heo
On 2024/10/30 16:18, kernel test robot wrote:
>
>
> Hello,
>
> kernel test robot noticed "BUG:KASAN:slab-use-after-free_in_enqueue_task_fair" on:
>
> commit: e9b718a38463470cc388aaa3ff50f12bbe8c4279 ("[PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice")
> url: https://github.com/intel-lab-lkp/linux/commits/Tianchen-Ding/sched-eevdf-Force-propagating-min_slice-of-cfs_rq-when-a-task-changing-slice/20241028-143410
> base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git d1fb8a78b2ff1fe4e9478c75b4fbec588a73c1b0
> patch link: https://lore.kernel.org/all/20241028063313.8039-2-dtcccc@linux.alibaba.com/
> patch subject: [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
>
> in testcase: trinity
> version: trinity-x86_64-ba2360ed-1_20240923
> with following parameters:
>
> runtime: 600s
>
>
>
> config: x86_64-randconfig-014-20241028
> compiler: clang-19
> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> (please refer to attached dmesg/kmsg for entire log/backtrace)
>
>
> +------------------------------------------------------------------------------+------------+------------+
> | | d1fb8a78b2 | e9b718a384 |
> +------------------------------------------------------------------------------+------------+------------+
> | BUG:KASAN:slab-use-after-free_in_enqueue_task_fair | 0 | 4 |
> | BUG:KASAN:slab-use-after-free_in_dequeue_entities | 0 | 1 |
> +------------------------------------------------------------------------------+------------+------------+
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202410301535.14e0855c-lkp@intel.com
>
>
> [ 117.822447][ T468] BUG: KASAN: slab-use-after-free in enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020)
> [ 117.825270][ T468] Read of size 8 at addr ffff8881678c1c30 by task trinity-main/468
> [ 117.826451][ T468]
> [ 117.826941][ T468] CPU: 0 UID: 65534 PID: 468 Comm: trinity-main Not tainted 6.12.0-rc4-00025-ge9b718a38463 #1
> [ 117.828330][ T468] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [ 117.829779][ T468] Call Trace:
> [ 117.830339][ T468] <TASK>
> [ 117.830865][ T468] dump_stack_lvl (lib/dump_stack.c:122)
> [ 117.831554][ T468] ? enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020)
> [ 117.832327][ T468] print_report (mm/kasan/report.c:378)
> [ 117.833021][ T468] ? __virt_addr_valid (include/linux/rcupdate.h:337 include/linux/rcupdate.h:941 include/linux/mmzone.h:2043 arch/x86/mm/physaddr.c:65)
> [ 117.833768][ T468] ? __virt_addr_valid (arch/x86/include/asm/preempt.h:103 include/linux/rcupdate.h:964 include/linux/mmzone.h:2053 arch/x86/mm/physaddr.c:65)
> [ 117.834510][ T468] ? enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020)
> [ 117.835304][ T468] ? kasan_complete_mode_report_info (mm/kasan/report_generic.c:179)
> [ 117.836192][ T468] ? enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020)
> [ 117.836990][ T468] kasan_report (mm/kasan/report.c:603)
> [ 117.837670][ T468] ? enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020)
> [ 117.838449][ T468] __asan_report_load8_noabort (mm/kasan/report_generic.c:381)
> [ 117.839276][ T468] enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020)
> [ 117.840026][ T468] enqueue_task (kernel/sched/core.c:2027)
> [ 117.840701][ T468] activate_task (kernel/sched/core.c:2069)
> [ 117.841383][ T468] wake_up_new_task (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/sched.h:185 kernel/sched/core.c:4829)
> [ 117.842104][ T468] kernel_clone (kernel/fork.c:2818)
> [ 117.842786][ T468] __x64_sys_clone (kernel/fork.c:2927)
> [ 117.843490][ T468] x64_sys_call (kbuild/obj/consumer/x86_64-randconfig-014-20241028/./arch/x86/include/generated/asm/syscalls_64.h:161)
> [ 117.844198][ T468] do_syscall_64 (arch/x86/entry/common.c:?)
> [ 117.844861][ T468] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:467)
> [ 117.845742][ T468] ? syscall_exit_to_user_mode (kernel/entry/common.c:221)
> [ 117.846571][ T468] ? do_syscall_64 (arch/x86/entry/common.c:102)
> [ 117.848374][ T468] ? irqentry_exit_to_user_mode (kernel/entry/common.c:234)
> [ 117.849183][ T468] ? irqentry_exit (kernel/entry/common.c:367)
> [ 117.849897][ T468] ? exc_page_fault (arch/x86/mm/fault.c:1543)
> [ 117.850589][ T468] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
> [ 117.851487][ T468] RIP: 0033:0x7f97dcccc293
> [ 117.852155][ T468] Code: 00 00 00 00 00 66 90 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 89 c2 85 c0 75 2c 64 48 8b 04 25 10 00 00
> All code
> ========
> 0: 00 00 add %al,(%rax)
> 2: 00 00 add %al,(%rax)
> 4: 00 66 90 add %ah,-0x70(%rsi)
> 7: 64 48 8b 04 25 10 00 mov %fs:0x10,%rax
> e: 00 00
> 10: 45 31 c0 xor %r8d,%r8d
> 13: 31 d2 xor %edx,%edx
> 15: 31 f6 xor %esi,%esi
> 17: bf 11 00 20 01 mov $0x1200011,%edi
> 1c: 4c 8d 90 d0 02 00 00 lea 0x2d0(%rax),%r10
> 23: b8 38 00 00 00 mov $0x38,%eax
> 28: 0f 05 syscall
> 2a:* 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax <-- trapping instruction
> 30: 77 35 ja 0x67
> 32: 89 c2 mov %eax,%edx
> 34: 85 c0 test %eax,%eax
> 36: 75 2c jne 0x64
> 38: 64 fs
> 39: 48 rex.W
> 3a: 8b .byte 0x8b
> 3b: 04 25 add $0x25,%al
> 3d: 10 00 adc %al,(%rax)
> ...
>
> Code starting with the faulting instruction
> ===========================================
> 0: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax
> 6: 77 35 ja 0x3d
> 8: 89 c2 mov %eax,%edx
> a: 85 c0 test %eax,%eax
> c: 75 2c jne 0x3a
> e: 64 fs
> f: 48 rex.W
> 10: 8b .byte 0x8b
> 11: 04 25 add $0x25,%al
> 13: 10 00 adc %al,(%rax)
> ...
> [ 117.854567][ T468] RSP: 002b:00007fff02a6b648 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
> [ 117.855822][ T468] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f97dcccc293
> [ 117.856993][ T468] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
> [ 117.858049][ T468] RBP: 0000000000000000 R08: 0000000000000000 R09: 7fffffffffffffff
> [ 117.859328][ T468] R10: 00007f97dcbf5a10 R11: 0000000000000246 R12: 0000000000000001
> [ 117.860342][ T468] R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
> [ 117.861176][ T468] </TASK>
> [ 117.861691][ T468]
> [ 117.862096][ T468] Allocated by task 902:
> [ 117.862719][ T468] kasan_save_track (mm/kasan/common.c:48 mm/kasan/common.c:68)
> [ 117.863560][ T468] kasan_save_alloc_info (mm/kasan/generic.c:566)
> [ 117.864398][ T468] __kasan_kmalloc (mm/kasan/common.c:398)
> [ 117.865193][ T468] __kmalloc_cache_node_noprof (mm/slub.c:4308)
> [ 117.866038][ T468] alloc_fair_sched_group (include/linux/slab.h:? kernel/sched/fair.c:13312)
> [ 117.866871][ T468] sched_create_group (kernel/sched/core.c:8853)
> [ 117.867588][ T468] sched_autogroup_create_attach (include/linux/err.h:67 kernel/sched/autogroup.c:93 kernel/sched/autogroup.c:194)
> [ 117.868413][ T468] ksys_setsid (kernel/sys.c:?)
> [ 117.869079][ T468] __ia32_sys_setsid (kernel/sys.c:1269)
> [ 117.869767][ T468] x64_sys_call (kbuild/obj/consumer/x86_64-randconfig-014-20241028/./arch/x86/include/generated/asm/syscalls_64.h:161)
> [ 117.870453][ T468] do_syscall_64 (arch/x86/entry/common.c:?)
> [ 117.871156][ T468] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
> [ 117.872032][ T468]
> [ 117.872468][ T468] Freed by task 243:
> [ 117.873823][ T468] kasan_save_track (mm/kasan/common.c:48 mm/kasan/common.c:68)
> [ 117.874518][ T468] kasan_save_free_info (mm/kasan/generic.c:582)
> [ 117.875278][ T468] __kasan_slab_free (mm/kasan/common.c:271)
> [ 117.875923][ T468] kfree (mm/slub.c:4579)
> [ 117.876526][ T468] free_fair_sched_group (kernel/sched/fair.c:13278)
> [ 117.877340][ T468] sched_free_group (kernel/sched/core.c:8823)
> [ 117.878034][ T468] sched_free_group_rcu (kernel/sched/core.c:8831)
> [ 117.878758][ T468] rcu_core (kernel/rcu/tree.c:?)
> [ 117.879467][ T468] rcu_core_si (kernel/rcu/tree.c:2841)
> [ 117.880104][ T468] handle_softirqs (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/irq.h:142 kernel/softirq.c:555)
> [ 117.880824][ T468] __irq_exit_rcu (kernel/softirq.c:617 kernel/softirq.c:639)
> [ 117.881526][ T468] irq_exit_rcu (kernel/softirq.c:651)
> [ 117.882185][ T468] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1049)
> [ 117.883046][ T468] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:702)
> [ 117.883877][ T468]
> [ 117.884299][ T468] The buggy address belongs to the object at ffff8881678c1c00
> [ 117.884299][ T468] which belongs to the cache kmalloc-rnd-07-1k of size 1024
> [ 117.886238][ T468] The buggy address is located 48 bytes inside of
> [ 117.886238][ T468] freed 1024-byte region [ffff8881678c1c00, ffff8881678c2000)
> [ 117.888235][ T468]
> [ 117.888665][ T468] The buggy address belongs to the physical page:
> [ 117.889558][ T468] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1678c0
> [ 117.890880][ T468] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> [ 117.892051][ T468] memcg:ffff88814cf29cc1
> [ 117.892690][ T468] flags: 0x8000000000000040(head|zone=2)
> [ 117.893467][ T468] page_type: f5(slab)
> [ 117.894072][ T468] raw: 8000000000000040 ffff888100059b40 ffffea000479e610 ffff88810005a828
> [ 117.896842][ T468] raw: 0000000000000000 00000000000a000a 00000001f5000000 ffff88814cf29cc1
> [ 117.898038][ T468] head: 8000000000000040 ffff888100059b40 ffffea000479e610 ffff88810005a828
> [ 117.899301][ T468] head: 0000000000000000 00000000000a000a 00000001f5000000 ffff88814cf29cc1
> [ 117.900519][ T468] head: 8000000000000003 ffffea00059e3001 ffffffffffffffff 0000000000000000
> [ 117.901731][ T468] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
> [ 117.903000][ T468] page dumped because: kasan: bad access detected
> [ 117.903859][ T468] page_owner tracks the page as allocated
> [ 117.904643][ T468] page last allocated via order 3, migratetype Unmovable, gfp_mask 0x252800(GFP_NOWAIT|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE), pid 194, tgid 194 ((sh)), ts 46703728016, free_ts 0
> [ 117.906952][ T468] post_alloc_hook (include/linux/page_owner.h:?)
> [ 117.907672][ T468] prep_new_page (mm/page_alloc.c:1547)
> [ 117.908338][ T468] get_page_from_freelist (mm/page_alloc.c:?)
> [ 117.909140][ T468] __alloc_pages_noprof (mm/page_alloc.c:4733)
> [ 117.909885][ T468] new_slab (mm/slub.c:?)
> [ 117.910539][ T468] ___slab_alloc (mm/slub.c:3819)
> [ 117.911291][ T468] __slab_alloc (mm/slub.c:3910)
> [ 117.911959][ T468] __kmalloc_cache_node_noprof (mm/slub.c:3961)
> [ 117.912775][ T468] alloc_fair_sched_group (include/linux/slab.h:? kernel/sched/fair.c:13312)
> [ 117.913582][ T468] sched_create_group (kernel/sched/core.c:8853)
> [ 117.914420][ T468] sched_autogroup_create_attach (include/linux/err.h:67 kernel/sched/autogroup.c:93 kernel/sched/autogroup.c:194)
> [ 117.915352][ T468] ksys_setsid (kernel/sys.c:?)
> [ 117.916031][ T468] __ia32_sys_setsid (kernel/sys.c:1269)
> [ 117.916779][ T468] x64_sys_call (kbuild/obj/consumer/x86_64-randconfig-014-20241028/./arch/x86/include/generated/asm/syscalls_64.h:161)
> [ 117.917631][ T468] do_syscall_64 (arch/x86/entry/common.c:?)
> [ 117.918407][ T468] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
> [ 117.919324][ T468] page_owner free stack trace missing
> [ 117.920090][ T468]
> [ 117.920515][ T468] Memory state around the buggy address:
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20241030/202410301535.14e0855c-lkp@intel.com
>
>
Hmm... Should add a check about whether se node is on rb tree effectively.
Thanks for the report.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
2024-10-28 6:33 ` [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice Tianchen Ding
2024-10-28 17:37 ` Tejun Heo
[not found] ` <ME0P300MB0414F63E895B2F343EE740258E4B2@ME0P300MB0414.AUSP300.PROD.OUTLOOK.COM>
@ 2024-10-30 11:00 ` Peter Zijlstra
2024-10-30 14:54 ` Tianchen Ding
2 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2024-10-30 11:00 UTC (permalink / raw)
To: Tianchen Ding
Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tejun Heo
On Mon, Oct 28, 2024 at 02:33:13PM +0800, Tianchen Ding wrote:
> Introduce "cpu.fair_slice" for cgroup v2 and "cpu.fair_slice_us" for v1
> according to their name styles. The unit is always microseconds.
>
> A cgroup with shorter slice can preempt others more easily. This could be
> useful in container scenarios.
>
> By default, cpu.fair_slice is 0, which means the slice of se is
> calculated by min_slice from its cfs_rq. If cpu.fair_slice is set, it
> will overwrite se->slice with the customized value.
So I'm not sure I like to expose this, like this.
The thing is, this is really specific to the way we schedule the cgroup
mess, fully hierarchical. If you want to collapse all this, like one of
those bpf schedulers does, then you cannot do this.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
2024-10-30 11:00 ` Peter Zijlstra
@ 2024-10-30 14:54 ` Tianchen Ding
0 siblings, 0 replies; 21+ messages in thread
From: Tianchen Ding @ 2024-10-30 14:54 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tejun Heo
On 2024/10/30 19:00, Peter Zijlstra wrote:
> On Mon, Oct 28, 2024 at 02:33:13PM +0800, Tianchen Ding wrote:
>> Introduce "cpu.fair_slice" for cgroup v2 and "cpu.fair_slice_us" for v1
>> according to their name styles. The unit is always microseconds.
>>
>> A cgroup with shorter slice can preempt others more easily. This could be
>> useful in container scenarios.
>>
>> By default, cpu.fair_slice is 0, which means the slice of se is
>> calculated by min_slice from its cfs_rq. If cpu.fair_slice is set, it
>> will overwrite se->slice with the customized value.
>
> So I'm not sure I like to expose this, like this.
>
> The thing is, this is really specific to the way we schedule the cgroup
> mess, fully hierarchical. If you want to collapse all this, like one of
> those bpf schedulers does, then you cannot do this.
Yes, "slice" is an absolute value and may not fit the hierarchical cgroup...
There probably might not be a perfect solution :(
Anyway, I'll later send v2 for the 1st patch which fixes an existing issue.
Thanks.
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
2024-10-28 6:33 ` [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice Tianchen Ding
2024-10-30 8:18 ` kernel test robot
@ 2024-10-31 9:48 ` Tianchen Ding
2024-11-12 3:25 ` Tianchen Ding
1 sibling, 1 reply; 21+ messages in thread
From: Tianchen Ding @ 2024-10-31 9:48 UTC (permalink / raw)
To: linux-kernel
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
When a task changes slice and its cgroup se is already on_rq, the cgroup
se will not be enqueued again, and hence the root->min_slice leaves
unchanged.
Force propagating it when se doesn't need to be enqueued (or dequeued).
Ensure the se hierarchy always get the latest min_slice.
Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
---
v2:
Add a check about effectiveness of se->run_node. Thanks to the kernel
test robot.
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6512258dc71f..ffac002a8807 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7017,6 +7017,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_group(se);
se->slice = slice;
+ if (se != cfs_rq->curr)
+ min_vruntime_cb_propagate(&se->run_node, NULL);
slice = cfs_rq_min_slice(cfs_rq);
cfs_rq->h_nr_running++;
@@ -7141,6 +7143,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
update_cfs_group(se);
se->slice = slice;
+ if (se != cfs_rq->curr)
+ min_vruntime_cb_propagate(&se->run_node, NULL);
slice = cfs_rq_min_slice(cfs_rq);
cfs_rq->h_nr_running -= h_nr_running;
--
2.39.3
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
2024-10-31 9:48 ` [PATCH v2] " Tianchen Ding
@ 2024-11-12 3:25 ` Tianchen Ding
2024-11-13 11:50 ` 回复: " 解 咏梅
0 siblings, 1 reply; 21+ messages in thread
From: Tianchen Ding @ 2024-11-12 3:25 UTC (permalink / raw)
To: linux-kernel
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
On 2024/10/31 17:48, Tianchen Ding wrote:
> When a task changes slice and its cgroup se is already on_rq, the cgroup
> se will not be enqueued again, and hence the root->min_slice leaves
> unchanged.
>
> Force propagating it when se doesn't need to be enqueued (or dequeued).
> Ensure the se hierarchy always get the latest min_slice.
>
> Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
ping for this fix.
^ permalink raw reply [flat|nested] 21+ messages in thread
* 回复: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
2024-11-12 3:25 ` Tianchen Ding
@ 2024-11-13 11:50 ` 解 咏梅
2024-11-14 2:45 ` Tianchen Ding
0 siblings, 1 reply; 21+ messages in thread
From: 解 咏梅 @ 2024-11-13 11:50 UTC (permalink / raw)
To: Tianchen Ding, linux-kernel@vger.kernel.org
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Similar problem as commit d2929762 & 8dafa9d0, but this time heap integrity is corrupted by min_slice attr.
commit eab03c23c fixed it by explicitly calling __dequeue_entity and __enqueue_entity in reweight_entity.
But, it's rare case, it only happens when adjust task's select by setting up scheduler attribute.
Regards,
Yongmei.
________________________________________
发件人: Tianchen Ding <dtcccc@linux.alibaba.com>
发送时间: 2024年11月12日 11:25
收件人: linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
抄送: Ingo Molnar <mingo@redhat.com>; Peter Zijlstra <peterz@infradead.org>; Juri Lelli <juri.lelli@redhat.com>; Vincent Guittot <vincent.guittot@linaro.org>; Dietmar Eggemann <dietmar.eggemann@arm.com>; Steven Rostedt <rostedt@goodmis.org>; Ben Segall <bsegall@google.com>; Mel Gorman <mgorman@suse.de>; Valentin Schneider <vschneid@redhat.com>
主题: Re: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
On 2024/10/31 17:48, Tianchen Ding wrote:
> When a task changes slice and its cgroup se is already on_rq, the cgroup
> se will not be enqueued again, and hence the root->min_slice leaves
> unchanged.
>
> Force propagating it when se doesn't need to be enqueued (or dequeued).
> Ensure the se hierarchy always get the latest min_slice.
>
> Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
ping for this fix.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 回复: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
2024-11-13 11:50 ` 回复: " 解 咏梅
@ 2024-11-14 2:45 ` Tianchen Ding
2024-11-14 6:06 ` 回复: " 解 咏梅
0 siblings, 1 reply; 21+ messages in thread
From: Tianchen Ding @ 2024-11-14 2:45 UTC (permalink / raw)
To: 解 咏梅
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel@vger.kernel.org
On 2024/11/13 19:50, 解 咏梅 wrote:
> Similar problem as commit d2929762 & 8dafa9d0, but this time heap integrity is corrupted by min_slice attr.
> commit eab03c23c fixed it by explicitly calling __dequeue_entity and __enqueue_entity in reweight_entity.
>
> But, it's rare case, it only happens when adjust task's select by setting up scheduler attribute.
>
It's not rare. Since it's in enqueue/dequeue common path, wakeup/sleep may also
trigger this issue.
>
> Regards,
> Yongmei.
^ permalink raw reply [flat|nested] 21+ messages in thread
* 回复: 回复: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
2024-11-14 2:45 ` Tianchen Ding
@ 2024-11-14 6:06 ` 解 咏梅
2024-11-14 6:36 ` Tianchen Ding
0 siblings, 1 reply; 21+ messages in thread
From: 解 咏梅 @ 2024-11-14 6:06 UTC (permalink / raw)
To: Tianchen Ding
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel@vger.kernel.org
Let analyze it case by case:P
say cgroup A has 3 tasks: task A, task B, task C
1) assign taskA's slice to 0.1 ms, task B, tack C, task C all have the default slice (0.75ms)
2) task A is picked by __schedule as next task, because task A is still on rq,
so the cfs_rq hierarchical doesn't have to change cfs_rq's min_slice, it will report it to the root cgroup
3) task A is preempted by other task, it's still runnable. it will be requeued cgroup A's cfs_rq. similar as case 2
4) task A is preempted since it's blocked, task A's se will be retained in cgroup A's cfs_rq until it reach 0-lag state.
4.1 before 0-lag, I guess it's similar as case 2
the logic is based on cfs_rq's avg_runtime, it supposed task A won't be pick as next task before it reach 0-lag state.
If my understanding is wrong, pls correct me. Thanks.
4.2 After it reached 0-lag state, If it's picked by pick_task_fair, it will be removed from cgroup A cfs_rq ultimately.
pick_next_entity->dequeue_entities(DEQUEUE_SLEEP | DEQUEUE_DELAYED)->__dequeue_entity (taskA)
so, cgroup A's cfs_rq min_slice will be re-calculated. So the cfs_rq hierarchical will modify their own min_slice bottom up.
4.3 After it reached 0-lag state, it will waked up. Because, the current __schedule() split the path for block/sleep from migration path. only migration path will call deactivate. so p->on_rq is still 1, ttwu_runnable will work for it to just call requeue_delayed_entity. similar as case 2
I think only case 1 has such problem.
Regards,
Yongmei.
________________________________________
发件人: Tianchen Ding <dtcccc@linux.alibaba.com>
发送时间: 2024年11月14日 10:45
收件人: 解 咏梅 <xieym_ict@hotmail.com>
抄送: Ingo Molnar <mingo@redhat.com>; Peter Zijlstra <peterz@infradead.org>; Juri Lelli <juri.lelli@redhat.com>; Vincent Guittot <vincent.guittot@linaro.org>; Dietmar Eggemann <dietmar.eggemann@arm.com>; Steven Rostedt <rostedt@goodmis.org>; Ben Segall <bsegall@google.com>; Mel Gorman <mgorman@suse.de>; Valentin Schneider <vschneid@redhat.com>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
主题: Re: 回复: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
On 2024/11/13 19:50, 解 咏梅 wrote:
> Similar problem as commit d2929762 & 8dafa9d0, but this time heap integrity is corrupted by min_slice attr.
> commit eab03c23c fixed it by explicitly calling __dequeue_entity and __enqueue_entity in reweight_entity.
>
> But, it's rare case, it only happens when adjust task's select by setting up scheduler attribute.
>
It's not rare. Since it's in enqueue/dequeue common path, wakeup/sleep may also
trigger this issue.
>
> Regards,
> Yongmei.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
2024-11-14 6:06 ` 回复: " 解 咏梅
@ 2024-11-14 6:36 ` Tianchen Ding
[not found] ` <ME0P300MB041447EBB0A17918745695898E5B2@ME0P300MB0414.AUSP300.PROD.OUTLOOK.COM>
0 siblings, 1 reply; 21+ messages in thread
From: Tianchen Ding @ 2024-11-14 6:36 UTC (permalink / raw)
To: 解 咏梅
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel@vger.kernel.org
On 2024/11/14 14:06, 解 咏梅 wrote:
> Let analyze it case by case:P
>
> say cgroup A has 3 tasks: task A, task B, task C
>
> 1) assign taskA's slice to 0.1 ms, task B, tack C, task C all have the default slice (0.75ms)
>
> 2) task A is picked by __schedule as next task, because task A is still on rq,
> so the cfs_rq hierarchical doesn't have to change cfs_rq's min_slice, it will report it to the root cgroup
>
> 3) task A is preempted by other task, it's still runnable. it will be requeued cgroup A's cfs_rq. similar as case 2
>
> 4) task A is preempted since it's blocked, task A's se will be retained in cgroup A's cfs_rq until it reach 0-lag state.
> 4.1 before 0-lag, I guess it's similar as case 2
> the logic is based on cfs_rq's avg_runtime, it supposed task A won't be pick as next task before it reach 0-lag state.
> If my understanding is wrong, pls correct me. Thanks.
> 4.2 After it reached 0-lag state, If it's picked by pick_task_fair, it will be removed from cgroup A cfs_rq ultimately.
> pick_next_entity->dequeue_entities(DEQUEUE_SLEEP | DEQUEUE_DELAYED)->__dequeue_entity (taskA)
> so, cgroup A's cfs_rq min_slice will be re-calculated. So the cfs_rq hierarchical will modify their own min_slice bottom up.
> 4.3 After it reached 0-lag state, it will waked up. Because, the current __schedule() split the path for block/sleep from migration path. only migration path will call deactivate. so p->on_rq is still 1, ttwu_runnable will work for it to just call requeue_delayed_entity. similar as case 2
>
> I think only case 1 has such problem.
>
> Regards,
> Yongmei.
>
I think you misunderstood the case. We're not talking about the DELAY_DEQUEUE
feature. We're simply talking about enqueue(waking up) and dequeue(sleeping).
For convenience, let's turn DELAY_DEQUEUE off.
Consider the following cgroup hierarchy on one cpu:
root_cgroup
|
------------------------
| |
cgroup_A(curr) other_cgroups...
|
--------------
| |
any_se(curr) cgroup_B(runnable)
|
------------
| |
task_A(sleep) task_B(runnable)
Assume task_A has a smaller slice(0.1ms) and all other tasks have default
slice(0.75ms).
Because task_A is sleeping, it is not actually on the tree.
Now task_A is woken up. It is enqueued to cgroup_B. So slice of cgroup_B is
updated to 0.1ms. This is ok.
However, Since cgroup_B is already on_rq, it cannot be "enqueued" again to
cgroup_A. The code is running to the bottom half.(the second
for_each_sched_entity loop in enqueue_task_fair)
So the slice of cgroup_A is not updated. It is still 0.75ms.
Thanks.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
[not found] ` <ME0P300MB041447EBB0A17918745695898E5B2@ME0P300MB0414.AUSP300.PROD.OUTLOOK.COM>
@ 2024-11-14 7:47 ` Tianchen Ding
2024-11-14 13:44 ` 回复: " 解 咏梅
0 siblings, 1 reply; 21+ messages in thread
From: Tianchen Ding @ 2024-11-14 7:47 UTC (permalink / raw)
To: 解 咏梅
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel@vger.kernel.org
On 2024/11/14 15:33, 解 咏梅 wrote:
> delayed dequeue is nessary for eevdf to maintain lag. Paw’s relative vruntime is
> not necessary any more in migration path.
>
>
> it is not a tuning option.
>
> regards,
> Yongmei
I don't know why you so focus on DELAY_DEQUEUE, it is not related to the case I
explained.
The case is about cgroup hierarchy. And the task_A in my case is already blocked
and *out of rq*
I'm talking about its enqueue path when woken up.
^ permalink raw reply [flat|nested] 21+ messages in thread
* 回复: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
2024-11-14 7:47 ` Tianchen Ding
@ 2024-11-14 13:44 ` 解 咏梅
0 siblings, 0 replies; 21+ messages in thread
From: 解 咏梅 @ 2024-11-14 13:44 UTC (permalink / raw)
To: Tianchen Ding
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel@vger.kernel.org
Sorry I cannot change to pure text mode by my cellphone.
since commit 5e963f2bd4, there's no CFS any more as my understanding.
There's too much change. There's no preempt check in each scheduler tick. Peter introduces a hrtimer to mark the time of lasted slice used up. Preempt check only happens in wake up path.
This is mark for disable preempt
curr->vlag == curr->deadline means no preempt.
Since your fist patch for "Force propagating min_slice of cfs_rq", I read the source and found there's too much new things in eevdf.
Regards,
Yongmei.
________________________________________
发件人: Tianchen Ding <dtcccc@linux.alibaba.com>
发送时间: 2024年11月14日 15:47
收件人: 解 咏梅 <xieym_ict@hotmail.com>
抄送: Ingo Molnar <mingo@redhat.com>; Peter Zijlstra <peterz@infradead.org>; Juri Lelli <juri.lelli@redhat.com>; Vincent Guittot <vincent.guittot@linaro.org>; Dietmar Eggemann <dietmar.eggemann@arm.com>; Steven Rostedt <rostedt@goodmis.org>; Ben Segall <bsegall@google.com>; Mel Gorman <mgorman@suse.de>; Valentin Schneider <vschneid@redhat.com>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
主题: Re: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
On 2024/11/14 15:33, 解 咏梅 wrote:
> delayed dequeue is nessary for eevdf to maintain lag. Paw’s relative vruntime is
> not necessary any more in migration path.
>
>
> it is not a tuning option.
>
> regards,
> Yongmei
I don't know why you so focus on DELAY_DEQUEUE, it is not related to the case I
explained.
The case is about cgroup hierarchy. And the task_A in my case is already blocked
and *out of rq*
I'm talking about its enqueue path when woken up.
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2024-11-14 13:44 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-28 6:33 [RFC PATCH 0/2] sched/eevdf: Introduce a cgroup interface for slice Tianchen Ding
2024-10-28 6:33 ` [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice Tianchen Ding
2024-10-30 8:18 ` kernel test robot
2024-10-30 9:11 ` Tianchen Ding
2024-10-31 9:48 ` [PATCH v2] " Tianchen Ding
2024-11-12 3:25 ` Tianchen Ding
2024-11-13 11:50 ` 回复: " 解 咏梅
2024-11-14 2:45 ` Tianchen Ding
2024-11-14 6:06 ` 回复: " 解 咏梅
2024-11-14 6:36 ` Tianchen Ding
[not found] ` <ME0P300MB041447EBB0A17918745695898E5B2@ME0P300MB0414.AUSP300.PROD.OUTLOOK.COM>
2024-11-14 7:47 ` Tianchen Ding
2024-11-14 13:44 ` 回复: " 解 咏梅
2024-10-28 6:33 ` [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice Tianchen Ding
2024-10-28 17:37 ` Tejun Heo
2024-10-29 2:07 ` Tianchen Ding
2024-10-29 6:18 ` Tejun Heo
2024-10-29 6:49 ` Tianchen Ding
2024-10-29 20:39 ` Tejun Heo
[not found] ` <ME0P300MB0414F63E895B2F343EE740258E4B2@ME0P300MB0414.AUSP300.PROD.OUTLOOK.COM>
2024-10-29 4:26 ` 回复: " 解 咏梅
2024-10-30 11:00 ` Peter Zijlstra
2024-10-30 14:54 ` Tianchen Ding
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox