[RFC PATCH 0/2] sched/eevdf: Introduce a cgroup interface for slice

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] sched/eevdf: Introduce a cgroup interface for slice
@ 2024-10-28  6:33 Tianchen Ding
  2024-10-28  6:33 ` [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice Tianchen Ding
  2024-10-28  6:33 ` [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice Tianchen Ding
  0 siblings, 2 replies; 21+ messages in thread
From: Tianchen Ding @ 2024-10-28  6:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tejun Heo

The 1st patch is a minor fix for the existing cgroup propagating.

The 2nd patch is the main part and RFC. If the design is ok, I'll send
another patch later for documents about the new cgroup interface.

Thanks.

Tianchen Ding (2):
  sched/eevdf: Force propagating min_slice of cfs_rq when a task
    changing slice
  sched/eevdf: Introduce a cgroup interface for slice

 kernel/sched/core.c  | 34 +++++++++++++++++++++++++++++
 kernel/sched/fair.c  | 51 +++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h |  3 +++
 3 files changed, 83 insertions(+), 5 deletions(-)

-- 
2.39.3


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
  2024-10-28  6:33 [RFC PATCH 0/2] sched/eevdf: Introduce a cgroup interface for slice Tianchen Ding
@ 2024-10-28  6:33 ` Tianchen Ding
  2024-10-30  8:18   ` kernel test robot
  2024-10-31  9:48   ` [PATCH v2] " Tianchen Ding
  2024-10-28  6:33 ` [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice Tianchen Ding
  1 sibling, 2 replies; 21+ messages in thread
From: Tianchen Ding @ 2024-10-28  6:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tejun Heo

When a task changes slice and its cgroup se is already on_rq, the cgroup
se will not be enqueued again, and hence the root->min_slice leaves
unchanged.

Force propagating it when se doesn't need to be enqueued (or dequeued).
Ensure the se hierarchy always get the latest min_slice.

Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
---
 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6512258dc71f..7dc90a6e6e26 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7017,6 +7017,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_cfs_group(se);
 
 		se->slice = slice;
+		min_vruntime_cb_propagate(&se->run_node, NULL);
 		slice = cfs_rq_min_slice(cfs_rq);
 
 		cfs_rq->h_nr_running++;
@@ -7141,6 +7142,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		update_cfs_group(se);
 
 		se->slice = slice;
+		min_vruntime_cb_propagate(&se->run_node, NULL);
 		slice = cfs_rq_min_slice(cfs_rq);
 
 		cfs_rq->h_nr_running -= h_nr_running;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
  2024-10-28  6:33 [RFC PATCH 0/2] sched/eevdf: Introduce a cgroup interface for slice Tianchen Ding
  2024-10-28  6:33 ` [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice Tianchen Ding
@ 2024-10-28  6:33 ` Tianchen Ding
  2024-10-28 17:37   ` Tejun Heo
                     ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: Tianchen Ding @ 2024-10-28  6:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tejun Heo

Introduce "cpu.fair_slice" for cgroup v2 and "cpu.fair_slice_us" for v1
according to their name styles. The unit is always microseconds.

A cgroup with shorter slice can preempt others more easily. This could be
useful in container scenarios.

By default, cpu.fair_slice is 0, which means the slice of se is
calculated by min_slice from its cfs_rq. If cpu.fair_slice is set, it
will overwrite se->slice with the customized value.

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
---
CC Tejun, do we need (and reuse) this slice interface for sched_ext?
---
 kernel/sched/core.c  | 34 ++++++++++++++++++++++++++++++
 kernel/sched/fair.c  | 49 +++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h |  3 +++
 3 files changed, 81 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 114adac5a9c8..8d57b7d88d18 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9690,6 +9690,24 @@ static int cpu_idle_write_s64(struct cgroup_subsys_state *css,
 }
 #endif
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static u64 cpu_fair_slice_read_u64(struct cgroup_subsys_state *css,
+				   struct cftype *cft)
+{
+	u64 fair_slice_us = css_tg(css)->slice;
+
+	do_div(fair_slice_us, NSEC_PER_USEC);
+
+	return fair_slice_us;
+}
+
+static int cpu_fair_slice_write_u64(struct cgroup_subsys_state *css,
+				    struct cftype *cftype, u64 fair_slice_us)
+{
+	return sched_group_set_slice(css_tg(css), fair_slice_us);
+}
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_GROUP_SCHED_WEIGHT
 	{
@@ -9703,6 +9721,14 @@ static struct cftype cpu_legacy_files[] = {
 		.write_s64 = cpu_idle_write_s64,
 	},
 #endif
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	{
+		.name = "fair_slice_us",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_fair_slice_read_u64,
+		.write_u64 = cpu_fair_slice_write_u64,
+	},
+#endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
 		.name = "cfs_quota_us",
@@ -9943,6 +9969,14 @@ static struct cftype cpu_files[] = {
 		.write_s64 = cpu_idle_write_s64,
 	},
 #endif
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	{
+		.name = "fair_slice",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_fair_slice_read_u64,
+		.write_u64 = cpu_fair_slice_write_u64,
+	},
+#endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
 		.name = "max",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7dc90a6e6e26..694dc0655719 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -797,6 +797,11 @@ static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
 	return min_slice;
 }
 
+static inline u64 cfs_rq_slice(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->tg->slice ? : cfs_rq_min_slice(cfs_rq);
+}
+
 static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
 {
 	return entity_before(__node_2_se(a), __node_2_se(b));
@@ -6994,7 +6999,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			se->custom_slice = 1;
 		}
 		enqueue_entity(cfs_rq, se, flags);
-		slice = cfs_rq_min_slice(cfs_rq);
+		slice = cfs_rq_slice(cfs_rq);
 
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
@@ -7018,7 +7023,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		se->slice = slice;
 		min_vruntime_cb_propagate(&se->run_node, NULL);
-		slice = cfs_rq_min_slice(cfs_rq);
+		slice = cfs_rq_slice(cfs_rq);
 
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
@@ -7093,7 +7098,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		idle_h_nr_running = task_has_idle_policy(p);
 	} else {
 		cfs_rq = group_cfs_rq(se);
-		slice = cfs_rq_min_slice(cfs_rq);
+		slice = cfs_rq_slice(cfs_rq);
 	}
 
 	for_each_sched_entity(se) {
@@ -7118,7 +7123,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
-			slice = cfs_rq_min_slice(cfs_rq);
+			slice = cfs_rq_slice(cfs_rq);
 
 			/* Avoid re-evaluating load for this entity: */
 			se = parent_entity(se);
@@ -7143,7 +7148,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		se->slice = slice;
 		min_vruntime_cb_propagate(&se->run_node, NULL);
-		slice = cfs_rq_min_slice(cfs_rq);
+		slice = cfs_rq_slice(cfs_rq);
 
 		cfs_rq->h_nr_running -= h_nr_running;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
@@ -13535,6 +13540,40 @@ int sched_group_set_idle(struct task_group *tg, long idle)
 	return 0;
 }
 
+int sched_group_set_slice(struct task_group *tg, u64 fair_slice_us)
+{
+	u64 slice = 0;
+	int i;
+
+	if (fair_slice_us > U64_MAX / NSEC_PER_USEC)
+		return -EINVAL;
+
+	if (fair_slice_us) {
+		slice = clamp_t(u64, fair_slice_us * NSEC_PER_USEC,
+				NSEC_PER_MSEC / 10,	/* HZ = 1000 * 10 */
+				NSEC_PER_MSEC * 100);	/* HZ = 100 / 10 */
+	}
+
+	if (slice == tg->slice)
+		return 0;
+
+	tg->slice = slice;
+
+	for_each_possible_cpu(i) {
+		struct sched_entity *se = tg->se[i];
+		struct rq *rq = cpu_rq(i);
+
+		guard(rq_lock_irqsave)(rq);
+		for_each_sched_entity(se) {
+			se->custom_slice = 1;
+			se->slice = cfs_rq_slice(group_cfs_rq(se));
+			min_vruntime_cb_propagate(&se->run_node, NULL);
+		}
+	}
+
+	return 0;
+}
+
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7b139016cbd9..e02f8715bc04 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -443,6 +443,7 @@ struct task_group {
 	/* runqueue "owned" by this group on each CPU */
 	struct cfs_rq		**cfs_rq;
 	unsigned long		shares;
+	u64			slice;
 #ifdef	CONFIG_SMP
 	/*
 	 * load_avg can be heavily contended at clock tick time, so put
@@ -574,6 +575,8 @@ extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
 
 extern int sched_group_set_idle(struct task_group *tg, long idle);
 
+extern int sched_group_set_slice(struct task_group *tg, u64 fair_slice_us);
+
 #ifdef CONFIG_SMP
 extern void set_task_rq_fair(struct sched_entity *se,
 			     struct cfs_rq *prev, struct cfs_rq *next);
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
  2024-10-28  6:33 ` [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice Tianchen Ding
@ 2024-10-28 17:37   ` Tejun Heo
  2024-10-29  2:07     ` Tianchen Ding
       [not found]   ` <ME0P300MB0414F63E895B2F343EE740258E4B2@ME0P300MB0414.AUSP300.PROD.OUTLOOK.COM>
  2024-10-30 11:00   ` Peter Zijlstra
  2 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2024-10-28 17:37 UTC (permalink / raw)
  To: Tianchen Ding
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider

Hello,

On Mon, Oct 28, 2024 at 02:33:13PM +0800, Tianchen Ding wrote:
> Introduce "cpu.fair_slice" for cgroup v2 and "cpu.fair_slice_us" for v1
> according to their name styles. The unit is always microseconds.
> 
> A cgroup with shorter slice can preempt others more easily. This could be
> useful in container scenarios.
> 
> By default, cpu.fair_slice is 0, which means the slice of se is
> calculated by min_slice from its cfs_rq. If cpu.fair_slice is set, it
> will overwrite se->slice with the customized value.

Provided that this tunable is necessary, wouldn't it be more useful to
figure out what per-task interface would look like first? Maybe there are
cases where per-cgroup slice config makes sense but that sounds
significantly less useful than being able to configure it per-task.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
  2024-10-28 17:37   ` Tejun Heo
@ 2024-10-29  2:07     ` Tianchen Ding
  2024-10-29  6:18       ` Tejun Heo
  0 siblings, 1 reply; 21+ messages in thread
From: Tianchen Ding @ 2024-10-29  2:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider

On 2024/10/29 01:37, Tejun Heo wrote:
> Hello,
> 
> On Mon, Oct 28, 2024 at 02:33:13PM +0800, Tianchen Ding wrote:
>> Introduce "cpu.fair_slice" for cgroup v2 and "cpu.fair_slice_us" for v1
>> according to their name styles. The unit is always microseconds.
>>
>> A cgroup with shorter slice can preempt others more easily. This could be
>> useful in container scenarios.
>>
>> By default, cpu.fair_slice is 0, which means the slice of se is
>> calculated by min_slice from its cfs_rq. If cpu.fair_slice is set, it
>> will overwrite se->slice with the customized value.
> 
> Provided that this tunable is necessary, wouldn't it be more useful to
> figure out what per-task interface would look like first? Maybe there are
> cases where per-cgroup slice config makes sense but that sounds
> significantly less useful than being able to configure it per-task.
> 
> Thanks.
> 

For eevdf, per-task interface has been introduced in commit 857b158dc5e8 
("sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion")

So This patch is trying to introduce a cgroup level interface.

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* 回复: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
       [not found]   ` <ME0P300MB0414F63E895B2F343EE740258E4B2@ME0P300MB0414.AUSP300.PROD.OUTLOOK.COM>
@ 2024-10-29  4:26     ` 解 咏梅
  0 siblings, 0 replies; 21+ messages in thread
From: 解 咏梅 @ 2024-10-29  4:26 UTC (permalink / raw)
  To: Tianchen Ding, linux-kernel@vger.kernel.org
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tejun Heo

change to plain text to send to linux-kernel@vger.kernel.org

Sorry:(

________________________________________
发件人: 解 咏梅 <xieym_ict@hotmail.com>
发送时间: 2024年10月29日 12:10
收件人: Tianchen Ding <dtcccc@linux.alibaba.com>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
抄送: Ingo Molnar <mingo@redhat.com>; Peter Zijlstra <peterz@infradead.org>; Juri Lelli <juri.lelli@redhat.com>; Vincent Guittot <vincent.guittot@linaro.org>; Dietmar Eggemann <dietmar.eggemann@arm.com>; Steven Rostedt <rostedt@goodmis.org>; Ben Segall <bsegall@google.com>; Mel Gorman <mgorman@suse.de>; Valentin Schneider <vschneid@redhat.com>; Tejun Heo <tj@kernel.org>
主题: 回复: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
 
Interesting!
So, se's position in RB tree (aka se->deadline) is determined by 2 factors:
itself's total weighted runtime
tg's slice if defined or the queued cfs_rq's slice (the last one considers all ses beneth cfs_rq, so it's a dynamic slice)

As my understanding, this patch proposes a static slice for task cgroup. It might be useful in container colocation scenarios.
It's hard to detect the major product process in rich container scenario, because the user application is various and diverse. 


Regards,
Yongmei.

________________________________________
发件人: Tianchen Ding <dtcccc@linux.alibaba.com>
发送时间: 2024年10月28日 14:33
收件人: linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
抄送: Ingo Molnar <mingo@redhat.com>; Peter Zijlstra <peterz@infradead.org>; Juri Lelli <juri.lelli@redhat.com>; Vincent Guittot <vincent.guittot@linaro.org>; Dietmar Eggemann <dietmar.eggemann@arm.com>; Steven Rostedt <rostedt@goodmis.org>; Ben Segall <bsegall@google.com>; Mel Gorman <mgorman@suse.de>; Valentin Schneider <vschneid@redhat.com>; Tejun Heo <tj@kernel.org>
主题: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
 
Introduce "cpu.fair_slice" for cgroup v2 and "cpu.fair_slice_us" for v1
according to their name styles. The unit is always microseconds.

A cgroup with shorter slice can preempt others more easily. This could be
useful in container scenarios.

By default, cpu.fair_slice is 0, which means the slice of se is
calculated by min_slice from its cfs_rq. If cpu.fair_slice is set, it
will overwrite se->slice with the customized value.

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
---
CC Tejun, do we need (and reuse) this slice interface for sched_ext?
---
 kernel/sched/core.c  | 34 ++++++++++++++++++++++++++++++
 kernel/sched/fair.c  | 49 +++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h |  3 +++
 3 files changed, 81 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 114adac5a9c8..8d57b7d88d18 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9690,6 +9690,24 @@ static int cpu_idle_write_s64(struct cgroup_subsys_state *css,
 }
 #endif
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static u64 cpu_fair_slice_read_u64(struct cgroup_subsys_state *css,
+                                  struct cftype *cft)
+{
+       u64 fair_slice_us = css_tg(css)->slice;
+
+       do_div(fair_slice_us, NSEC_PER_USEC);
+
+       return fair_slice_us;
+}
+
+static int cpu_fair_slice_write_u64(struct cgroup_subsys_state *css,
+                                   struct cftype *cftype, u64 fair_slice_us)
+{
+       return sched_group_set_slice(css_tg(css), fair_slice_us);
+}
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_GROUP_SCHED_WEIGHT
         {
@@ -9703,6 +9721,14 @@ static struct cftype cpu_legacy_files[] = {
                 .write_s64 = cpu_idle_write_s64,
         },
 #endif
+#ifdef CONFIG_FAIR_GROUP_SCHED
+       {
+               .name = "fair_slice_us",
+               .flags = CFTYPE_NOT_ON_ROOT,
+               .read_u64 = cpu_fair_slice_read_u64,
+               .write_u64 = cpu_fair_slice_write_u64,
+       },
+#endif
 #ifdef CONFIG_CFS_BANDWIDTH
         {
                 .name = "cfs_quota_us",
@@ -9943,6 +9969,14 @@ static struct cftype cpu_files[] = {
                 .write_s64 = cpu_idle_write_s64,
         },
 #endif
+#ifdef CONFIG_FAIR_GROUP_SCHED
+       {
+               .name = "fair_slice",
+               .flags = CFTYPE_NOT_ON_ROOT,
+               .read_u64 = cpu_fair_slice_read_u64,
+               .write_u64 = cpu_fair_slice_write_u64,
+       },
+#endif
 #ifdef CONFIG_CFS_BANDWIDTH
         {
                 .name = "max",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7dc90a6e6e26..694dc0655719 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -797,6 +797,11 @@ static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
         return min_slice;
 }
 
+static inline u64 cfs_rq_slice(struct cfs_rq *cfs_rq)
+{
+       return cfs_rq->tg->slice ? : cfs_rq_min_slice(cfs_rq);
+}
+
 static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
 {
         return entity_before(__node_2_se(a), __node_2_se(b));
@@ -6994,7 +6999,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
                         se->custom_slice = 1;
                 }
                 enqueue_entity(cfs_rq, se, flags);
-               slice = cfs_rq_min_slice(cfs_rq);
+               slice = cfs_rq_slice(cfs_rq);
 
                 cfs_rq->h_nr_running++;
                 cfs_rq->idle_h_nr_running += idle_h_nr_running;
@@ -7018,7 +7023,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
                 se->slice = slice;
                 min_vruntime_cb_propagate(&se->run_node, NULL);
-               slice = cfs_rq_min_slice(cfs_rq);
+               slice = cfs_rq_slice(cfs_rq);
 
                 cfs_rq->h_nr_running++;
                 cfs_rq->idle_h_nr_running += idle_h_nr_running;
@@ -7093,7 +7098,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
                 idle_h_nr_running = task_has_idle_policy(p);
         } else {
                 cfs_rq = group_cfs_rq(se);
-               slice = cfs_rq_min_slice(cfs_rq);
+               slice = cfs_rq_slice(cfs_rq);
         }
 
         for_each_sched_entity(se) {
@@ -7118,7 +7123,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
                 /* Don't dequeue parent if it has other entities besides us */
                 if (cfs_rq->load.weight) {
-                       slice = cfs_rq_min_slice(cfs_rq);
+                       slice = cfs_rq_slice(cfs_rq);
 
                         /* Avoid re-evaluating load for this entity: */
                         se = parent_entity(se);
@@ -7143,7 +7148,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
                 se->slice = slice;
                 min_vruntime_cb_propagate(&se->run_node, NULL);
-               slice = cfs_rq_min_slice(cfs_rq);
+               slice = cfs_rq_slice(cfs_rq);
 
                 cfs_rq->h_nr_running -= h_nr_running;
                 cfs_rq->idle_h_nr_running -= idle_h_nr_running;
@@ -13535,6 +13540,40 @@ int sched_group_set_idle(struct task_group *tg, long idle)
         return 0;
 }
 
+int sched_group_set_slice(struct task_group *tg, u64 fair_slice_us)
+{
+       u64 slice = 0;
+       int i;
+
+       if (fair_slice_us > U64_MAX / NSEC_PER_USEC)
+               return -EINVAL;
+
+       if (fair_slice_us) {
+               slice = clamp_t(u64, fair_slice_us * NSEC_PER_USEC,
+                               NSEC_PER_MSEC / 10,     /* HZ = 1000 * 10 */
+                               NSEC_PER_MSEC * 100);   /* HZ = 100 / 10 */
+       }
+
+       if (slice == tg->slice)
+               return 0;
+
+       tg->slice = slice;
+
+       for_each_possible_cpu(i) {
+               struct sched_entity *se = tg->se[i];
+               struct rq *rq = cpu_rq(i);
+
+               guard(rq_lock_irqsave)(rq);
+               for_each_sched_entity(se) {
+                       se->custom_slice = 1;
+                       se->slice = cfs_rq_slice(group_cfs_rq(se));
+                       min_vruntime_cb_propagate(&se->run_node, NULL);
+               }
+       }
+
+       return 0;
+}
+
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7b139016cbd9..e02f8715bc04 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -443,6 +443,7 @@ struct task_group {
         /* runqueue "owned" by this group on each CPU */
         struct cfs_rq           **cfs_rq;
         unsigned long           shares;
+       u64                     slice;
 #ifdef  CONFIG_SMP
         /*
          * load_avg can be heavily contended at clock tick time, so put
@@ -574,6 +575,8 @@ extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
 
 extern int sched_group_set_idle(struct task_group *tg, long idle);
 
+extern int sched_group_set_slice(struct task_group *tg, u64 fair_slice_us);
+
 #ifdef CONFIG_SMP
 extern void set_task_rq_fair(struct sched_entity *se,
                              struct cfs_rq *prev, struct cfs_rq *next);
--
2.39.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
  2024-10-29  2:07     ` Tianchen Ding
@ 2024-10-29  6:18       ` Tejun Heo
  2024-10-29  6:49         ` Tianchen Ding
  0 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2024-10-29  6:18 UTC (permalink / raw)
  To: Tianchen Ding
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider

Hello,

On Tue, Oct 29, 2024 at 10:07:36AM +0800, Tianchen Ding wrote:
....
> For eevdf, per-task interface has been introduced in commit 857b158dc5e8
> ("sched/eevdf: Use sched_attr::sched_runtime to set request/slice
> suggestion")

I see.

> So This patch is trying to introduce a cgroup level interface.

If I'm reading the code correctly, the property can be set per task and is
inherited when forking unless RESET_ON_FORK is set. I'm not sure the cgroup
interface adds all that much:

- There's no inherent hierarchical or grouping behavior. I don't think it
  makes sense for cgroup config to override per-thread configs.

- For cgroup-wide config, setting it in the seed process of the cgroup would
  suffice in most cases. Changing it afterwards is more awkward but not
  hugely so. If racing against forks is a concern, you can either use the
  freezer or iterate until no new tasks are seen.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
  2024-10-29  6:18       ` Tejun Heo
@ 2024-10-29  6:49         ` Tianchen Ding
  2024-10-29 20:39           ` Tejun Heo
  0 siblings, 1 reply; 21+ messages in thread
From: Tianchen Ding @ 2024-10-29  6:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider

On 2024/10/29 14:18, Tejun Heo wrote:
>> So This patch is trying to introduce a cgroup level interface.
> 
> If I'm reading the code correctly, the property can be set per task and is
> inherited when forking unless RESET_ON_FORK is set. I'm not sure the cgroup
> interface adds all that much:
> 
> - There's no inherent hierarchical or grouping behavior. I don't think it
>    makes sense for cgroup config to override per-thread configs.
> 
> - For cgroup-wide config, setting it in the seed process of the cgroup would
>    suffice in most cases. Changing it afterwards is more awkward but not
>    hugely so. If racing against forks is a concern, you can either use the
>    freezer or iterate until no new tasks are seen.
> 
> Thanks.
> 

However, we may want to set and keep different slice for processes inside the 
same cgroup.

For example in rich container scenario (as Yongmei mentioned), the administrator 
can decide the cpu resources of a container: its weight(cpu.weight), 
scope(cpuset.cpus), bandwidth(cpu.max), and also the **slice and preempt 
priority** (cpu.fair_slice in this patch).

At the same time, the user may want to decide his processes inside the 
container. He may want to set customized value (sched_attr::sched_runtime) for 
each process, and administrator should not overwrite the user's own config.

So cpu.fair_slice is for preempt competition across cgroups in the samle level, 
while sched_attr::sched_runtime can be used for processes inside the same 
cgroup. (a bit like cpu.weight vs task NICE)

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
  2024-10-29  6:49         ` Tianchen Ding
@ 2024-10-29 20:39           ` Tejun Heo
  0 siblings, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2024-10-29 20:39 UTC (permalink / raw)
  To: Tianchen Ding
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider

Hello,

On Tue, Oct 29, 2024 at 02:49:51PM +0800, Tianchen Ding wrote:
...
> At the same time, the user may want to decide his processes inside the
> container. He may want to set customized value (sched_attr::sched_runtime)
> for each process, and administrator should not overwrite the user's own
> config.
> 
> So cpu.fair_slice is for preempt competition across cgroups in the samle
> level, while sched_attr::sched_runtime can be used for processes inside the
> same cgroup. (a bit like cpu.weight vs task NICE)

I see. It's setting the slice for the task_groups. I'm not sure how much we
want to codify the current recursive behavior into fixed interface. Besides,
it's not sustainable to keep adding scheduler tunables to cgroup interface.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
  2024-10-28  6:33 ` [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice Tianchen Ding
@ 2024-10-30  8:18   ` kernel test robot
  2024-10-30  9:11     ` Tianchen Ding
  2024-10-31  9:48   ` [PATCH v2] " Tianchen Ding
  1 sibling, 1 reply; 21+ messages in thread
From: kernel test robot @ 2024-10-30  8:18 UTC (permalink / raw)
  To: Tianchen Ding
  Cc: oe-lkp, lkp, linux-kernel, aubrey.li, yu.c.chen, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo, oliver.sang



Hello,

kernel test robot noticed "BUG:KASAN:slab-use-after-free_in_enqueue_task_fair" on:

commit: e9b718a38463470cc388aaa3ff50f12bbe8c4279 ("[PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice")
url: https://github.com/intel-lab-lkp/linux/commits/Tianchen-Ding/sched-eevdf-Force-propagating-min_slice-of-cfs_rq-when-a-task-changing-slice/20241028-143410
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git d1fb8a78b2ff1fe4e9478c75b4fbec588a73c1b0
patch link: https://lore.kernel.org/all/20241028063313.8039-2-dtcccc@linux.alibaba.com/
patch subject: [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice

in testcase: trinity
version: trinity-x86_64-ba2360ed-1_20240923
with following parameters:

	runtime: 600s



config: x86_64-randconfig-014-20241028
compiler: clang-19
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


+------------------------------------------------------------------------------+------------+------------+
|                                                                              | d1fb8a78b2 | e9b718a384 |
+------------------------------------------------------------------------------+------------+------------+
| BUG:KASAN:slab-use-after-free_in_enqueue_task_fair                           | 0          | 4          |
| BUG:KASAN:slab-use-after-free_in_dequeue_entities                            | 0          | 1          |
+------------------------------------------------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202410301535.14e0855c-lkp@intel.com


[ 117.822447][ T468] BUG: KASAN: slab-use-after-free in enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020) 
[  117.825270][  T468] Read of size 8 at addr ffff8881678c1c30 by task trinity-main/468
[  117.826451][  T468]
[  117.826941][  T468] CPU: 0 UID: 65534 PID: 468 Comm: trinity-main Not tainted 6.12.0-rc4-00025-ge9b718a38463 #1
[  117.828330][  T468] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[  117.829779][  T468] Call Trace:
[  117.830339][  T468]  <TASK>
[ 117.830865][ T468] dump_stack_lvl (lib/dump_stack.c:122) 
[ 117.831554][ T468] ? enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020) 
[ 117.832327][ T468] print_report (mm/kasan/report.c:378) 
[ 117.833021][ T468] ? __virt_addr_valid (include/linux/rcupdate.h:337 include/linux/rcupdate.h:941 include/linux/mmzone.h:2043 arch/x86/mm/physaddr.c:65) 
[ 117.833768][ T468] ? __virt_addr_valid (arch/x86/include/asm/preempt.h:103 include/linux/rcupdate.h:964 include/linux/mmzone.h:2053 arch/x86/mm/physaddr.c:65) 
[ 117.834510][ T468] ? enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020) 
[ 117.835304][ T468] ? kasan_complete_mode_report_info (mm/kasan/report_generic.c:179) 
[ 117.836192][ T468] ? enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020) 
[ 117.836990][ T468] kasan_report (mm/kasan/report.c:603) 
[ 117.837670][ T468] ? enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020) 
[ 117.838449][ T468] __asan_report_load8_noabort (mm/kasan/report_generic.c:381) 
[ 117.839276][ T468] enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020) 
[ 117.840026][ T468] enqueue_task (kernel/sched/core.c:2027) 
[ 117.840701][ T468] activate_task (kernel/sched/core.c:2069) 
[ 117.841383][ T468] wake_up_new_task (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/sched.h:185 kernel/sched/core.c:4829) 
[ 117.842104][ T468] kernel_clone (kernel/fork.c:2818) 
[ 117.842786][ T468] __x64_sys_clone (kernel/fork.c:2927) 
[ 117.843490][ T468] x64_sys_call (kbuild/obj/consumer/x86_64-randconfig-014-20241028/./arch/x86/include/generated/asm/syscalls_64.h:161) 
[ 117.844198][ T468] do_syscall_64 (arch/x86/entry/common.c:?) 
[ 117.844861][ T468] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:467) 
[ 117.845742][ T468] ? syscall_exit_to_user_mode (kernel/entry/common.c:221) 
[ 117.846571][ T468] ? do_syscall_64 (arch/x86/entry/common.c:102) 
[ 117.848374][ T468] ? irqentry_exit_to_user_mode (kernel/entry/common.c:234) 
[ 117.849183][ T468] ? irqentry_exit (kernel/entry/common.c:367) 
[ 117.849897][ T468] ? exc_page_fault (arch/x86/mm/fault.c:1543) 
[ 117.850589][ T468] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) 
[  117.851487][  T468] RIP: 0033:0x7f97dcccc293
[ 117.852155][ T468] Code: 00 00 00 00 00 66 90 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 89 c2 85 c0 75 2c 64 48 8b 04 25 10 00 00
All code
========
   0:	00 00                	add    %al,(%rax)
   2:	00 00                	add    %al,(%rax)
   4:	00 66 90             	add    %ah,-0x70(%rsi)
   7:	64 48 8b 04 25 10 00 	mov    %fs:0x10,%rax
   e:	00 00 
  10:	45 31 c0             	xor    %r8d,%r8d
  13:	31 d2                	xor    %edx,%edx
  15:	31 f6                	xor    %esi,%esi
  17:	bf 11 00 20 01       	mov    $0x1200011,%edi
  1c:	4c 8d 90 d0 02 00 00 	lea    0x2d0(%rax),%r10
  23:	b8 38 00 00 00       	mov    $0x38,%eax
  28:	0f 05                	syscall
  2a:*	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax		<-- trapping instruction
  30:	77 35                	ja     0x67
  32:	89 c2                	mov    %eax,%edx
  34:	85 c0                	test   %eax,%eax
  36:	75 2c                	jne    0x64
  38:	64                   	fs
  39:	48                   	rex.W
  3a:	8b                   	.byte 0x8b
  3b:	04 25                	add    $0x25,%al
  3d:	10 00                	adc    %al,(%rax)
	...

Code starting with the faulting instruction
===========================================
   0:	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax
   6:	77 35                	ja     0x3d
   8:	89 c2                	mov    %eax,%edx
   a:	85 c0                	test   %eax,%eax
   c:	75 2c                	jne    0x3a
   e:	64                   	fs
   f:	48                   	rex.W
  10:	8b                   	.byte 0x8b
  11:	04 25                	add    $0x25,%al
  13:	10 00                	adc    %al,(%rax)
	...
[  117.854567][  T468] RSP: 002b:00007fff02a6b648 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[  117.855822][  T468] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f97dcccc293
[  117.856993][  T468] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[  117.858049][  T468] RBP: 0000000000000000 R08: 0000000000000000 R09: 7fffffffffffffff
[  117.859328][  T468] R10: 00007f97dcbf5a10 R11: 0000000000000246 R12: 0000000000000001
[  117.860342][  T468] R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
[  117.861176][  T468]  </TASK>
[  117.861691][  T468]
[  117.862096][  T468] Allocated by task 902:
[ 117.862719][ T468] kasan_save_track (mm/kasan/common.c:48 mm/kasan/common.c:68) 
[ 117.863560][ T468] kasan_save_alloc_info (mm/kasan/generic.c:566) 
[ 117.864398][ T468] __kasan_kmalloc (mm/kasan/common.c:398) 
[ 117.865193][ T468] __kmalloc_cache_node_noprof (mm/slub.c:4308) 
[ 117.866038][ T468] alloc_fair_sched_group (include/linux/slab.h:? kernel/sched/fair.c:13312) 
[ 117.866871][ T468] sched_create_group (kernel/sched/core.c:8853) 
[ 117.867588][ T468] sched_autogroup_create_attach (include/linux/err.h:67 kernel/sched/autogroup.c:93 kernel/sched/autogroup.c:194) 
[ 117.868413][ T468] ksys_setsid (kernel/sys.c:?) 
[ 117.869079][ T468] __ia32_sys_setsid (kernel/sys.c:1269) 
[ 117.869767][ T468] x64_sys_call (kbuild/obj/consumer/x86_64-randconfig-014-20241028/./arch/x86/include/generated/asm/syscalls_64.h:161) 
[ 117.870453][ T468] do_syscall_64 (arch/x86/entry/common.c:?) 
[ 117.871156][ T468] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) 
[  117.872032][  T468]
[  117.872468][  T468] Freed by task 243:
[ 117.873823][ T468] kasan_save_track (mm/kasan/common.c:48 mm/kasan/common.c:68) 
[ 117.874518][ T468] kasan_save_free_info (mm/kasan/generic.c:582) 
[ 117.875278][ T468] __kasan_slab_free (mm/kasan/common.c:271) 
[ 117.875923][ T468] kfree (mm/slub.c:4579) 
[ 117.876526][ T468] free_fair_sched_group (kernel/sched/fair.c:13278) 
[ 117.877340][ T468] sched_free_group (kernel/sched/core.c:8823) 
[ 117.878034][ T468] sched_free_group_rcu (kernel/sched/core.c:8831) 
[ 117.878758][ T468] rcu_core (kernel/rcu/tree.c:?) 
[ 117.879467][ T468] rcu_core_si (kernel/rcu/tree.c:2841) 
[ 117.880104][ T468] handle_softirqs (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/irq.h:142 kernel/softirq.c:555) 
[ 117.880824][ T468] __irq_exit_rcu (kernel/softirq.c:617 kernel/softirq.c:639) 
[ 117.881526][ T468] irq_exit_rcu (kernel/softirq.c:651) 
[ 117.882185][ T468] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1049) 
[ 117.883046][ T468] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:702) 
[  117.883877][  T468]
[  117.884299][  T468] The buggy address belongs to the object at ffff8881678c1c00
[  117.884299][  T468]  which belongs to the cache kmalloc-rnd-07-1k of size 1024
[  117.886238][  T468] The buggy address is located 48 bytes inside of
[  117.886238][  T468]  freed 1024-byte region [ffff8881678c1c00, ffff8881678c2000)
[  117.888235][  T468]
[  117.888665][  T468] The buggy address belongs to the physical page:
[  117.889558][  T468] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1678c0
[  117.890880][  T468] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  117.892051][  T468] memcg:ffff88814cf29cc1
[  117.892690][  T468] flags: 0x8000000000000040(head|zone=2)
[  117.893467][  T468] page_type: f5(slab)
[  117.894072][  T468] raw: 8000000000000040 ffff888100059b40 ffffea000479e610 ffff88810005a828
[  117.896842][  T468] raw: 0000000000000000 00000000000a000a 00000001f5000000 ffff88814cf29cc1
[  117.898038][  T468] head: 8000000000000040 ffff888100059b40 ffffea000479e610 ffff88810005a828
[  117.899301][  T468] head: 0000000000000000 00000000000a000a 00000001f5000000 ffff88814cf29cc1
[  117.900519][  T468] head: 8000000000000003 ffffea00059e3001 ffffffffffffffff 0000000000000000
[  117.901731][  T468] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
[  117.903000][  T468] page dumped because: kasan: bad access detected
[  117.903859][  T468] page_owner tracks the page as allocated
[  117.904643][  T468] page last allocated via order 3, migratetype Unmovable, gfp_mask 0x252800(GFP_NOWAIT|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE), pid 194, tgid 194 ((sh)), ts 46703728016, free_ts 0
[ 117.906952][ T468] post_alloc_hook (include/linux/page_owner.h:?) 
[ 117.907672][ T468] prep_new_page (mm/page_alloc.c:1547) 
[ 117.908338][ T468] get_page_from_freelist (mm/page_alloc.c:?) 
[ 117.909140][ T468] __alloc_pages_noprof (mm/page_alloc.c:4733) 
[ 117.909885][ T468] new_slab (mm/slub.c:?) 
[ 117.910539][ T468] ___slab_alloc (mm/slub.c:3819) 
[ 117.911291][ T468] __slab_alloc (mm/slub.c:3910) 
[ 117.911959][ T468] __kmalloc_cache_node_noprof (mm/slub.c:3961) 
[ 117.912775][ T468] alloc_fair_sched_group (include/linux/slab.h:? kernel/sched/fair.c:13312) 
[ 117.913582][ T468] sched_create_group (kernel/sched/core.c:8853) 
[ 117.914420][ T468] sched_autogroup_create_attach (include/linux/err.h:67 kernel/sched/autogroup.c:93 kernel/sched/autogroup.c:194) 
[ 117.915352][ T468] ksys_setsid (kernel/sys.c:?) 
[ 117.916031][ T468] __ia32_sys_setsid (kernel/sys.c:1269) 
[ 117.916779][ T468] x64_sys_call (kbuild/obj/consumer/x86_64-randconfig-014-20241028/./arch/x86/include/generated/asm/syscalls_64.h:161) 
[ 117.917631][ T468] do_syscall_64 (arch/x86/entry/common.c:?) 
[ 117.918407][ T468] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) 
[  117.919324][  T468] page_owner free stack trace missing
[  117.920090][  T468]
[  117.920515][  T468] Memory state around the buggy address:


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20241030/202410301535.14e0855c-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
  2024-10-30  8:18   ` kernel test robot
@ 2024-10-30  9:11     ` Tianchen Ding
  0 siblings, 0 replies; 21+ messages in thread
From: Tianchen Ding @ 2024-10-30  9:11 UTC (permalink / raw)
  To: kernel test robot
  Cc: oe-lkp, lkp, linux-kernel, aubrey.li, yu.c.chen, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo

On 2024/10/30 16:18, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed "BUG:KASAN:slab-use-after-free_in_enqueue_task_fair" on:
> 
> commit: e9b718a38463470cc388aaa3ff50f12bbe8c4279 ("[PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice")
> url: https://github.com/intel-lab-lkp/linux/commits/Tianchen-Ding/sched-eevdf-Force-propagating-min_slice-of-cfs_rq-when-a-task-changing-slice/20241028-143410
> base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git d1fb8a78b2ff1fe4e9478c75b4fbec588a73c1b0
> patch link: https://lore.kernel.org/all/20241028063313.8039-2-dtcccc@linux.alibaba.com/
> patch subject: [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
> 
> in testcase: trinity
> version: trinity-x86_64-ba2360ed-1_20240923
> with following parameters:
> 
> 	runtime: 600s
> 
> 
> 
> config: x86_64-randconfig-014-20241028
> compiler: clang-19
> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> 
> (please refer to attached dmesg/kmsg for entire log/backtrace)
> 
> 
> +------------------------------------------------------------------------------+------------+------------+
> |                                                                              | d1fb8a78b2 | e9b718a384 |
> +------------------------------------------------------------------------------+------------+------------+
> | BUG:KASAN:slab-use-after-free_in_enqueue_task_fair                           | 0          | 4          |
> | BUG:KASAN:slab-use-after-free_in_dequeue_entities                            | 0          | 1          |
> +------------------------------------------------------------------------------+------------+------------+
> 
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202410301535.14e0855c-lkp@intel.com
> 
> 
> [ 117.822447][ T468] BUG: KASAN: slab-use-after-free in enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020)
> [  117.825270][  T468] Read of size 8 at addr ffff8881678c1c30 by task trinity-main/468
> [  117.826451][  T468]
> [  117.826941][  T468] CPU: 0 UID: 65534 PID: 468 Comm: trinity-main Not tainted 6.12.0-rc4-00025-ge9b718a38463 #1
> [  117.828330][  T468] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [  117.829779][  T468] Call Trace:
> [  117.830339][  T468]  <TASK>
> [ 117.830865][ T468] dump_stack_lvl (lib/dump_stack.c:122)
> [ 117.831554][ T468] ? enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020)
> [ 117.832327][ T468] print_report (mm/kasan/report.c:378)
> [ 117.833021][ T468] ? __virt_addr_valid (include/linux/rcupdate.h:337 include/linux/rcupdate.h:941 include/linux/mmzone.h:2043 arch/x86/mm/physaddr.c:65)
> [ 117.833768][ T468] ? __virt_addr_valid (arch/x86/include/asm/preempt.h:103 include/linux/rcupdate.h:964 include/linux/mmzone.h:2053 arch/x86/mm/physaddr.c:65)
> [ 117.834510][ T468] ? enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020)
> [ 117.835304][ T468] ? kasan_complete_mode_report_info (mm/kasan/report_generic.c:179)
> [ 117.836192][ T468] ? enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020)
> [ 117.836990][ T468] kasan_report (mm/kasan/report.c:603)
> [ 117.837670][ T468] ? enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020)
> [ 117.838449][ T468] __asan_report_load8_noabort (mm/kasan/report_generic.c:381)
> [ 117.839276][ T468] enqueue_task_fair (kernel/sched/fair.c:831 kernel/sched/fair.c:846 kernel/sched/fair.c:7020)
> [ 117.840026][ T468] enqueue_task (kernel/sched/core.c:2027)
> [ 117.840701][ T468] activate_task (kernel/sched/core.c:2069)
> [ 117.841383][ T468] wake_up_new_task (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/sched.h:185 kernel/sched/core.c:4829)
> [ 117.842104][ T468] kernel_clone (kernel/fork.c:2818)
> [ 117.842786][ T468] __x64_sys_clone (kernel/fork.c:2927)
> [ 117.843490][ T468] x64_sys_call (kbuild/obj/consumer/x86_64-randconfig-014-20241028/./arch/x86/include/generated/asm/syscalls_64.h:161)
> [ 117.844198][ T468] do_syscall_64 (arch/x86/entry/common.c:?)
> [ 117.844861][ T468] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:467)
> [ 117.845742][ T468] ? syscall_exit_to_user_mode (kernel/entry/common.c:221)
> [ 117.846571][ T468] ? do_syscall_64 (arch/x86/entry/common.c:102)
> [ 117.848374][ T468] ? irqentry_exit_to_user_mode (kernel/entry/common.c:234)
> [ 117.849183][ T468] ? irqentry_exit (kernel/entry/common.c:367)
> [ 117.849897][ T468] ? exc_page_fault (arch/x86/mm/fault.c:1543)
> [ 117.850589][ T468] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
> [  117.851487][  T468] RIP: 0033:0x7f97dcccc293
> [ 117.852155][ T468] Code: 00 00 00 00 00 66 90 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 89 c2 85 c0 75 2c 64 48 8b 04 25 10 00 00
> All code
> ========
>     0:	00 00                	add    %al,(%rax)
>     2:	00 00                	add    %al,(%rax)
>     4:	00 66 90             	add    %ah,-0x70(%rsi)
>     7:	64 48 8b 04 25 10 00 	mov    %fs:0x10,%rax
>     e:	00 00
>    10:	45 31 c0             	xor    %r8d,%r8d
>    13:	31 d2                	xor    %edx,%edx
>    15:	31 f6                	xor    %esi,%esi
>    17:	bf 11 00 20 01       	mov    $0x1200011,%edi
>    1c:	4c 8d 90 d0 02 00 00 	lea    0x2d0(%rax),%r10
>    23:	b8 38 00 00 00       	mov    $0x38,%eax
>    28:	0f 05                	syscall
>    2a:*	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax		<-- trapping instruction
>    30:	77 35                	ja     0x67
>    32:	89 c2                	mov    %eax,%edx
>    34:	85 c0                	test   %eax,%eax
>    36:	75 2c                	jne    0x64
>    38:	64                   	fs
>    39:	48                   	rex.W
>    3a:	8b                   	.byte 0x8b
>    3b:	04 25                	add    $0x25,%al
>    3d:	10 00                	adc    %al,(%rax)
> 	...
> 
> Code starting with the faulting instruction
> ===========================================
>     0:	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax
>     6:	77 35                	ja     0x3d
>     8:	89 c2                	mov    %eax,%edx
>     a:	85 c0                	test   %eax,%eax
>     c:	75 2c                	jne    0x3a
>     e:	64                   	fs
>     f:	48                   	rex.W
>    10:	8b                   	.byte 0x8b
>    11:	04 25                	add    $0x25,%al
>    13:	10 00                	adc    %al,(%rax)
> 	...
> [  117.854567][  T468] RSP: 002b:00007fff02a6b648 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
> [  117.855822][  T468] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f97dcccc293
> [  117.856993][  T468] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
> [  117.858049][  T468] RBP: 0000000000000000 R08: 0000000000000000 R09: 7fffffffffffffff
> [  117.859328][  T468] R10: 00007f97dcbf5a10 R11: 0000000000000246 R12: 0000000000000001
> [  117.860342][  T468] R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
> [  117.861176][  T468]  </TASK>
> [  117.861691][  T468]
> [  117.862096][  T468] Allocated by task 902:
> [ 117.862719][ T468] kasan_save_track (mm/kasan/common.c:48 mm/kasan/common.c:68)
> [ 117.863560][ T468] kasan_save_alloc_info (mm/kasan/generic.c:566)
> [ 117.864398][ T468] __kasan_kmalloc (mm/kasan/common.c:398)
> [ 117.865193][ T468] __kmalloc_cache_node_noprof (mm/slub.c:4308)
> [ 117.866038][ T468] alloc_fair_sched_group (include/linux/slab.h:? kernel/sched/fair.c:13312)
> [ 117.866871][ T468] sched_create_group (kernel/sched/core.c:8853)
> [ 117.867588][ T468] sched_autogroup_create_attach (include/linux/err.h:67 kernel/sched/autogroup.c:93 kernel/sched/autogroup.c:194)
> [ 117.868413][ T468] ksys_setsid (kernel/sys.c:?)
> [ 117.869079][ T468] __ia32_sys_setsid (kernel/sys.c:1269)
> [ 117.869767][ T468] x64_sys_call (kbuild/obj/consumer/x86_64-randconfig-014-20241028/./arch/x86/include/generated/asm/syscalls_64.h:161)
> [ 117.870453][ T468] do_syscall_64 (arch/x86/entry/common.c:?)
> [ 117.871156][ T468] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
> [  117.872032][  T468]
> [  117.872468][  T468] Freed by task 243:
> [ 117.873823][ T468] kasan_save_track (mm/kasan/common.c:48 mm/kasan/common.c:68)
> [ 117.874518][ T468] kasan_save_free_info (mm/kasan/generic.c:582)
> [ 117.875278][ T468] __kasan_slab_free (mm/kasan/common.c:271)
> [ 117.875923][ T468] kfree (mm/slub.c:4579)
> [ 117.876526][ T468] free_fair_sched_group (kernel/sched/fair.c:13278)
> [ 117.877340][ T468] sched_free_group (kernel/sched/core.c:8823)
> [ 117.878034][ T468] sched_free_group_rcu (kernel/sched/core.c:8831)
> [ 117.878758][ T468] rcu_core (kernel/rcu/tree.c:?)
> [ 117.879467][ T468] rcu_core_si (kernel/rcu/tree.c:2841)
> [ 117.880104][ T468] handle_softirqs (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/irq.h:142 kernel/softirq.c:555)
> [ 117.880824][ T468] __irq_exit_rcu (kernel/softirq.c:617 kernel/softirq.c:639)
> [ 117.881526][ T468] irq_exit_rcu (kernel/softirq.c:651)
> [ 117.882185][ T468] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1049)
> [ 117.883046][ T468] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:702)
> [  117.883877][  T468]
> [  117.884299][  T468] The buggy address belongs to the object at ffff8881678c1c00
> [  117.884299][  T468]  which belongs to the cache kmalloc-rnd-07-1k of size 1024
> [  117.886238][  T468] The buggy address is located 48 bytes inside of
> [  117.886238][  T468]  freed 1024-byte region [ffff8881678c1c00, ffff8881678c2000)
> [  117.888235][  T468]
> [  117.888665][  T468] The buggy address belongs to the physical page:
> [  117.889558][  T468] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1678c0
> [  117.890880][  T468] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> [  117.892051][  T468] memcg:ffff88814cf29cc1
> [  117.892690][  T468] flags: 0x8000000000000040(head|zone=2)
> [  117.893467][  T468] page_type: f5(slab)
> [  117.894072][  T468] raw: 8000000000000040 ffff888100059b40 ffffea000479e610 ffff88810005a828
> [  117.896842][  T468] raw: 0000000000000000 00000000000a000a 00000001f5000000 ffff88814cf29cc1
> [  117.898038][  T468] head: 8000000000000040 ffff888100059b40 ffffea000479e610 ffff88810005a828
> [  117.899301][  T468] head: 0000000000000000 00000000000a000a 00000001f5000000 ffff88814cf29cc1
> [  117.900519][  T468] head: 8000000000000003 ffffea00059e3001 ffffffffffffffff 0000000000000000
> [  117.901731][  T468] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
> [  117.903000][  T468] page dumped because: kasan: bad access detected
> [  117.903859][  T468] page_owner tracks the page as allocated
> [  117.904643][  T468] page last allocated via order 3, migratetype Unmovable, gfp_mask 0x252800(GFP_NOWAIT|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE), pid 194, tgid 194 ((sh)), ts 46703728016, free_ts 0
> [ 117.906952][ T468] post_alloc_hook (include/linux/page_owner.h:?)
> [ 117.907672][ T468] prep_new_page (mm/page_alloc.c:1547)
> [ 117.908338][ T468] get_page_from_freelist (mm/page_alloc.c:?)
> [ 117.909140][ T468] __alloc_pages_noprof (mm/page_alloc.c:4733)
> [ 117.909885][ T468] new_slab (mm/slub.c:?)
> [ 117.910539][ T468] ___slab_alloc (mm/slub.c:3819)
> [ 117.911291][ T468] __slab_alloc (mm/slub.c:3910)
> [ 117.911959][ T468] __kmalloc_cache_node_noprof (mm/slub.c:3961)
> [ 117.912775][ T468] alloc_fair_sched_group (include/linux/slab.h:? kernel/sched/fair.c:13312)
> [ 117.913582][ T468] sched_create_group (kernel/sched/core.c:8853)
> [ 117.914420][ T468] sched_autogroup_create_attach (include/linux/err.h:67 kernel/sched/autogroup.c:93 kernel/sched/autogroup.c:194)
> [ 117.915352][ T468] ksys_setsid (kernel/sys.c:?)
> [ 117.916031][ T468] __ia32_sys_setsid (kernel/sys.c:1269)
> [ 117.916779][ T468] x64_sys_call (kbuild/obj/consumer/x86_64-randconfig-014-20241028/./arch/x86/include/generated/asm/syscalls_64.h:161)
> [ 117.917631][ T468] do_syscall_64 (arch/x86/entry/common.c:?)
> [ 117.918407][ T468] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
> [  117.919324][  T468] page_owner free stack trace missing
> [  117.920090][  T468]
> [  117.920515][  T468] Memory state around the buggy address:
> 
> 
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20241030/202410301535.14e0855c-lkp@intel.com
> 
> 

Hmm... Should add a check about whether se node is on rb tree effectively.

Thanks for the report.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
  2024-10-28  6:33 ` [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice Tianchen Ding
  2024-10-28 17:37   ` Tejun Heo
       [not found]   ` <ME0P300MB0414F63E895B2F343EE740258E4B2@ME0P300MB0414.AUSP300.PROD.OUTLOOK.COM>
@ 2024-10-30 11:00   ` Peter Zijlstra
  2024-10-30 14:54     ` Tianchen Ding
  2 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2024-10-30 11:00 UTC (permalink / raw)
  To: Tianchen Ding
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tejun Heo

On Mon, Oct 28, 2024 at 02:33:13PM +0800, Tianchen Ding wrote:
> Introduce "cpu.fair_slice" for cgroup v2 and "cpu.fair_slice_us" for v1
> according to their name styles. The unit is always microseconds.
> 
> A cgroup with shorter slice can preempt others more easily. This could be
> useful in container scenarios.
> 
> By default, cpu.fair_slice is 0, which means the slice of se is
> calculated by min_slice from its cfs_rq. If cpu.fair_slice is set, it
> will overwrite se->slice with the customized value.

So I'm not sure I like to expose this, like this.

The thing is, this is really specific to the way we schedule the cgroup
mess, fully hierarchical. If you want to collapse all this, like one of
those bpf schedulers does, then you cannot do this.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice
  2024-10-30 11:00   ` Peter Zijlstra
@ 2024-10-30 14:54     ` Tianchen Ding
  0 siblings, 0 replies; 21+ messages in thread
From: Tianchen Ding @ 2024-10-30 14:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tejun Heo

On 2024/10/30 19:00, Peter Zijlstra wrote:
> On Mon, Oct 28, 2024 at 02:33:13PM +0800, Tianchen Ding wrote:
>> Introduce "cpu.fair_slice" for cgroup v2 and "cpu.fair_slice_us" for v1
>> according to their name styles. The unit is always microseconds.
>>
>> A cgroup with shorter slice can preempt others more easily. This could be
>> useful in container scenarios.
>>
>> By default, cpu.fair_slice is 0, which means the slice of se is
>> calculated by min_slice from its cfs_rq. If cpu.fair_slice is set, it
>> will overwrite se->slice with the customized value.
> 
> So I'm not sure I like to expose this, like this.
> 
> The thing is, this is really specific to the way we schedule the cgroup
> mess, fully hierarchical. If you want to collapse all this, like one of
> those bpf schedulers does, then you cannot do this.

Yes, "slice" is an absolute value and may not fit the hierarchical cgroup...
There probably might not be a perfect solution :(

Anyway, I'll later send v2 for the 1st patch which fixes an existing issue.

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
  2024-10-28  6:33 ` [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice Tianchen Ding
  2024-10-30  8:18   ` kernel test robot
@ 2024-10-31  9:48   ` Tianchen Ding
  2024-11-12  3:25     ` Tianchen Ding
  1 sibling, 1 reply; 21+ messages in thread
From: Tianchen Ding @ 2024-10-31  9:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider

When a task changes slice and its cgroup se is already on_rq, the cgroup
se will not be enqueued again, and hence the root->min_slice leaves
unchanged.

Force propagating it when se doesn't need to be enqueued (or dequeued).
Ensure the se hierarchy always get the latest min_slice.

Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
---
v2:
Add a check about effectiveness of se->run_node. Thanks to the kernel
test robot.
---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6512258dc71f..ffac002a8807 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7017,6 +7017,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_cfs_group(se);
 
 		se->slice = slice;
+		if (se != cfs_rq->curr)
+			min_vruntime_cb_propagate(&se->run_node, NULL);
 		slice = cfs_rq_min_slice(cfs_rq);
 
 		cfs_rq->h_nr_running++;
@@ -7141,6 +7143,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		update_cfs_group(se);
 
 		se->slice = slice;
+		if (se != cfs_rq->curr)
+			min_vruntime_cb_propagate(&se->run_node, NULL);
 		slice = cfs_rq_min_slice(cfs_rq);
 
 		cfs_rq->h_nr_running -= h_nr_running;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
  2024-10-31  9:48   ` [PATCH v2] " Tianchen Ding
@ 2024-11-12  3:25     ` Tianchen Ding
  2024-11-13 11:50       ` 回复: " 解 咏梅
  0 siblings, 1 reply; 21+ messages in thread
From: Tianchen Ding @ 2024-11-12  3:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider

On 2024/10/31 17:48, Tianchen Ding wrote:
> When a task changes slice and its cgroup se is already on_rq, the cgroup
> se will not be enqueued again, and hence the root->min_slice leaves
> unchanged.
> 
> Force propagating it when se doesn't need to be enqueued (or dequeued).
> Ensure the se hierarchy always get the latest min_slice.
> 
> Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>

ping for this fix.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* 回复: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
  2024-11-12  3:25     ` Tianchen Ding
@ 2024-11-13 11:50       ` 解 咏梅
  2024-11-14  2:45         ` Tianchen Ding
  0 siblings, 1 reply; 21+ messages in thread
From: 解 咏梅 @ 2024-11-13 11:50 UTC (permalink / raw)
  To: Tianchen Ding, linux-kernel@vger.kernel.org
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider

Similar problem as commit d2929762 & 8dafa9d0, but this time heap integrity is corrupted by min_slice attr.
commit eab03c23c fixed it by explicitly calling __dequeue_entity and __enqueue_entity in reweight_entity.

But, it's rare case, it only happens when adjust task's select by setting up scheduler attribute.


Regards,
Yongmei.


________________________________________
发件人: Tianchen Ding <dtcccc@linux.alibaba.com>
发送时间: 2024年11月12日 11:25
收件人: linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
抄送: Ingo Molnar <mingo@redhat.com>; Peter Zijlstra <peterz@infradead.org>; Juri Lelli <juri.lelli@redhat.com>; Vincent Guittot <vincent.guittot@linaro.org>; Dietmar Eggemann <dietmar.eggemann@arm.com>; Steven Rostedt <rostedt@goodmis.org>; Ben Segall <bsegall@google.com>; Mel Gorman <mgorman@suse.de>; Valentin Schneider <vschneid@redhat.com>
主题: Re: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
 
On 2024/10/31 17:48, Tianchen Ding wrote:
> When a task changes slice and its cgroup se is already on_rq, the cgroup
> se will not be enqueued again, and hence the root->min_slice leaves
> unchanged.
>
> Force propagating it when se doesn't need to be enqueued (or dequeued).
> Ensure the se hierarchy always get the latest min_slice.
>
> Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>

ping for this fix.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 回复: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
  2024-11-13 11:50       ` 回复: " 解 咏梅
@ 2024-11-14  2:45         ` Tianchen Ding
  2024-11-14  6:06           ` 回复: " 解 咏梅
  0 siblings, 1 reply; 21+ messages in thread
From: Tianchen Ding @ 2024-11-14  2:45 UTC (permalink / raw)
  To: 解 咏梅
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel@vger.kernel.org

On 2024/11/13 19:50, 解 咏梅 wrote:
> Similar problem as commit d2929762 & 8dafa9d0, but this time heap integrity is corrupted by min_slice attr.
> commit eab03c23c fixed it by explicitly calling __dequeue_entity and __enqueue_entity in reweight_entity.
> 
> But, it's rare case, it only happens when adjust task's select by setting up scheduler attribute.
> 

It's not rare. Since it's in enqueue/dequeue common path, wakeup/sleep may also 
trigger this issue.

> 
> Regards,
> Yongmei.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* 回复: 回复: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
  2024-11-14  2:45         ` Tianchen Ding
@ 2024-11-14  6:06           ` 解 咏梅
  2024-11-14  6:36             ` Tianchen Ding
  0 siblings, 1 reply; 21+ messages in thread
From: 解 咏梅 @ 2024-11-14  6:06 UTC (permalink / raw)
  To: Tianchen Ding
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel@vger.kernel.org

Let analyze it case by case:P

say cgroup A has 3 tasks: task A, task B, task C

1) assign taskA's slice to 0.1 ms, task B, tack C, task C all have the default slice (0.75ms)

2) task A is picked by __schedule as next task, because task A is still on rq, 
so the cfs_rq hierarchical doesn't have to change cfs_rq's min_slice, it will report it to the root cgroup

3)  task A is preempted by other task, it's still runnable. it will be requeued cgroup A's cfs_rq. similar as case 2

4) task A is preempted since it's blocked, task A's se will be retained in cgroup A's cfs_rq until it reach 0-lag state.
4.1 before 0-lag, I guess it's similar as case 2
     the logic is based on cfs_rq's avg_runtime, it supposed task A won't be pick as next task before it reach 0-lag state.
     If my understanding is wrong, pls correct me. Thanks.
4.2 After it reached 0-lag state, If it's picked by pick_task_fair, it will be removed from cgroup A cfs_rq ultimately.
     pick_next_entity->dequeue_entities(DEQUEUE_SLEEP | DEQUEUE_DELAYED)->__dequeue_entity (taskA)
    so, cgroup A's cfs_rq min_slice will be re-calculated. So the cfs_rq hierarchical  will modify their own min_slice bottom up.
4.3 After it reached 0-lag state, it will waked up. Because, the current __schedule() split the path for block/sleep from migration path. only migration path will call deactivate. so p->on_rq is still 1, ttwu_runnable will work for it to just call requeue_delayed_entity. similar as case 2

I think only case 1 has such problem.

Regards,
Yongmei.

________________________________________
发件人: Tianchen Ding <dtcccc@linux.alibaba.com>
发送时间: 2024年11月14日 10:45
收件人: 解 咏梅 <xieym_ict@hotmail.com>
抄送: Ingo Molnar <mingo@redhat.com>; Peter Zijlstra <peterz@infradead.org>; Juri Lelli <juri.lelli@redhat.com>; Vincent Guittot <vincent.guittot@linaro.org>; Dietmar Eggemann <dietmar.eggemann@arm.com>; Steven Rostedt <rostedt@goodmis.org>; Ben Segall <bsegall@google.com>; Mel Gorman <mgorman@suse.de>; Valentin Schneider <vschneid@redhat.com>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
主题: Re: 回复: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
 
On 2024/11/13 19:50, 解 咏梅 wrote:
> Similar problem as commit d2929762 & 8dafa9d0, but this time heap integrity is corrupted by min_slice attr.
> commit eab03c23c fixed it by explicitly calling __dequeue_entity and __enqueue_entity in reweight_entity.
>
> But, it's rare case, it only happens when adjust task's select by setting up scheduler attribute.
>

It's not rare. Since it's in enqueue/dequeue common path, wakeup/sleep may also
trigger this issue.

>
> Regards,
> Yongmei.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
  2024-11-14  6:06           ` 回复: " 解 咏梅
@ 2024-11-14  6:36             ` Tianchen Ding
       [not found]               ` <ME0P300MB041447EBB0A17918745695898E5B2@ME0P300MB0414.AUSP300.PROD.OUTLOOK.COM>
  0 siblings, 1 reply; 21+ messages in thread
From: Tianchen Ding @ 2024-11-14  6:36 UTC (permalink / raw)
  To: 解 咏梅
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel@vger.kernel.org

On 2024/11/14 14:06, 解 咏梅 wrote:
> Let analyze it case by case:P
> 
> say cgroup A has 3 tasks: task A, task B, task C
> 
> 1) assign taskA's slice to 0.1 ms, task B, tack C, task C all have the default slice (0.75ms)
> 
> 2) task A is picked by __schedule as next task, because task A is still on rq,
> so the cfs_rq hierarchical doesn't have to change cfs_rq's min_slice, it will report it to the root cgroup
> 
> 3)  task A is preempted by other task, it's still runnable. it will be requeued cgroup A's cfs_rq. similar as case 2
> 
> 4) task A is preempted since it's blocked, task A's se will be retained in cgroup A's cfs_rq until it reach 0-lag state.
> 4.1 before 0-lag, I guess it's similar as case 2
>       the logic is based on cfs_rq's avg_runtime, it supposed task A won't be pick as next task before it reach 0-lag state.
>       If my understanding is wrong, pls correct me. Thanks.
> 4.2 After it reached 0-lag state, If it's picked by pick_task_fair, it will be removed from cgroup A cfs_rq ultimately.
>       pick_next_entity->dequeue_entities(DEQUEUE_SLEEP | DEQUEUE_DELAYED)->__dequeue_entity (taskA)
>      so, cgroup A's cfs_rq min_slice will be re-calculated. So the cfs_rq hierarchical  will modify their own min_slice bottom up.
> 4.3 After it reached 0-lag state, it will waked up. Because, the current __schedule() split the path for block/sleep from migration path. only migration path will call deactivate. so p->on_rq is still 1, ttwu_runnable will work for it to just call requeue_delayed_entity. similar as case 2
> 
> I think only case 1 has such problem.
> 
> Regards,
> Yongmei.
> 

I think you misunderstood the case. We're not talking about the DELAY_DEQUEUE 
feature. We're simply talking about enqueue(waking up) and dequeue(sleeping).
For convenience, let's turn DELAY_DEQUEUE off.

Consider the following cgroup hierarchy on one cpu:


                     root_cgroup
                         |
              ------------------------
              |                      |
          cgroup_A(curr)      other_cgroups...
              |
        --------------
        |            |
    any_se(curr)   cgroup_B(runnable)
                     |
                ------------
                |          |
           task_A(sleep)  task_B(runnable)

Assume task_A has a smaller slice(0.1ms) and all other tasks have default 
slice(0.75ms).

Because task_A is sleeping, it is not actually on the tree.

Now task_A is woken up. It is enqueued to cgroup_B. So slice of cgroup_B is 
updated to 0.1ms. This is ok.

However, Since cgroup_B is already on_rq, it cannot be "enqueued" again to 
cgroup_A. The code is running to the bottom half.(the second 
for_each_sched_entity loop in enqueue_task_fair)

So the slice of cgroup_A is not updated. It is still 0.75ms.

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
       [not found]               ` <ME0P300MB041447EBB0A17918745695898E5B2@ME0P300MB0414.AUSP300.PROD.OUTLOOK.COM>
@ 2024-11-14  7:47                 ` Tianchen Ding
  2024-11-14 13:44                   ` 回复: " 解 咏梅
  0 siblings, 1 reply; 21+ messages in thread
From: Tianchen Ding @ 2024-11-14  7:47 UTC (permalink / raw)
  To: 解 咏梅
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel@vger.kernel.org

On 2024/11/14 15:33, 解 咏梅 wrote:
> delayed dequeue is nessary for eevdf to maintain lag. Paw’s relative vruntime is 
> not necessary any more in migration path.
> 
> 
> it is not a tuning option.
> 
> regards,
> Yongmei

I don't know why you so focus on DELAY_DEQUEUE, it is not related to the case I 
explained.

The case is about cgroup hierarchy. And the task_A in my case is already blocked 
and *out of rq*

I'm talking about its enqueue path when woken up.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* 回复: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice
  2024-11-14  7:47                 ` Tianchen Ding
@ 2024-11-14 13:44                   ` 解 咏梅
  0 siblings, 0 replies; 21+ messages in thread
From: 解 咏梅 @ 2024-11-14 13:44 UTC (permalink / raw)
  To: Tianchen Ding
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel@vger.kernel.org

Sorry I cannot change to pure text mode by my cellphone.

since commit  5e963f2bd4, there's no CFS any more as my understanding.

There's too much change. There's no preempt check in each scheduler tick. Peter introduces a hrtimer to mark the time of lasted slice used up. Preempt check only happens in wake up path.

This is mark for disable preempt
curr->vlag == curr->deadline means no preempt.

Since your fist patch for "Force propagating min_slice of cfs_rq", I read the source and found there's too much new things in   eevdf. 

Regards,
Yongmei.

________________________________________
发件人: Tianchen Ding <dtcccc@linux.alibaba.com>
发送时间: 2024年11月14日 15:47
收件人: 解 咏梅 <xieym_ict@hotmail.com>
抄送: Ingo Molnar <mingo@redhat.com>; Peter Zijlstra <peterz@infradead.org>; Juri Lelli <juri.lelli@redhat.com>; Vincent Guittot <vincent.guittot@linaro.org>; Dietmar Eggemann <dietmar.eggemann@arm.com>; Steven Rostedt <rostedt@goodmis.org>; Ben Segall <bsegall@google.com>; Mel Gorman <mgorman@suse.de>; Valentin Schneider <vschneid@redhat.com>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
主题: Re: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice

On 2024/11/14 15:33, 解 咏梅 wrote:
> delayed dequeue is nessary for eevdf to maintain lag. Paw’s relative vruntime is
> not necessary any more in migration path.
>
>
> it is not a tuning option.
>
> regards,
> Yongmei

I don't know why you so focus on DELAY_DEQUEUE, it is not related to the case I
explained.

The case is about cgroup hierarchy. And the task_A in my case is already blocked
and *out of rq*

I'm talking about its enqueue path when woken up.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-11-14 13:44 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-28  6:33 [RFC PATCH 0/2] sched/eevdf: Introduce a cgroup interface for slice Tianchen Ding
2024-10-28  6:33 ` [PATCH] sched/eevdf: Force propagating min_slice of cfs_rq when a task changing slice Tianchen Ding
2024-10-30  8:18   ` kernel test robot
2024-10-30  9:11     ` Tianchen Ding
2024-10-31  9:48   ` [PATCH v2] " Tianchen Ding
2024-11-12  3:25     ` Tianchen Ding
2024-11-13 11:50       ` 回复: " 解 咏梅
2024-11-14  2:45         ` Tianchen Ding
2024-11-14  6:06           ` 回复: " 解 咏梅
2024-11-14  6:36             ` Tianchen Ding
     [not found]               ` <ME0P300MB041447EBB0A17918745695898E5B2@ME0P300MB0414.AUSP300.PROD.OUTLOOK.COM>
2024-11-14  7:47                 ` Tianchen Ding
2024-11-14 13:44                   ` 回复: " 解 咏梅
2024-10-28  6:33 ` [RFC PATCH 2/2] sched/eevdf: Introduce a cgroup interface for slice Tianchen Ding
2024-10-28 17:37   ` Tejun Heo
2024-10-29  2:07     ` Tianchen Ding
2024-10-29  6:18       ` Tejun Heo
2024-10-29  6:49         ` Tianchen Ding
2024-10-29 20:39           ` Tejun Heo
     [not found]   ` <ME0P300MB0414F63E895B2F343EE740258E4B2@ME0P300MB0414.AUSP300.PROD.OUTLOOK.COM>
2024-10-29  4:26     ` 回复: " 解 咏梅
2024-10-30 11:00   ` Peter Zijlstra
2024-10-30 14:54     ` Tianchen Ding

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox