* [PATCH 0/2] sched/fair: Enhance sync wakeup for short duration tasks @ 2024-06-25 7:21 Chen Yu 2024-06-25 7:22 ` [PATCH 1/2] sched/fair: Record the average duration of a task Chen Yu 2024-06-25 7:22 ` [PATCH 2/2] sched/fair: Enhance sync wakeup for short duration tasks Chen Yu 0 siblings, 2 replies; 17+ messages in thread From: Chen Yu @ 2024-06-25 7:21 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot Cc: Mike Galbraith, Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu, linux-kernel, Chen Yu Hi, This is a respin of the previous work to inhit task migration[1]. Many workloads suffer from high Cache-to-Cache latency on large system. Especially when different tasks access the same cache line, which brings false sharing. This patch set leverages the WF_SYNC flag, and inhits task wakeup migration for short duration ones. This can help reduce the chance to trigger the cache false sharing. The main concern of this proposal in the last discussion is that, it could break idle-cpu-first task wakeup strategy. Usually an idle CPU can beat other non-idle CPU in terms of latency. Based on this, the new proposal takes the CPU number into consideration, if it is a low-core-count system, the bar to inhit the task migration is much higher or even impossible. Vice versa. Meanwhile, only when there is no idle Core in the system, this inhiting task migration will take effect. According to the test on 4 different platforms, it has shown good improvement for Client/Server workloads, like netperf, tbench. And not saw any performance regression on my 8 CPUs laptop. Please refer to PATCH 2/2 for detail. Comments and tests would be appreciated. [1] https://lore.kernel.org/lkml/cover.1682661027.git.yu.c.chen@intel.com/ Chen Yu (2): sched/fair: Record the average duration of a task sched/fair: Enhance sync wakeup for short duration tasks include/linux/sched.h | 3 ++ kernel/sched/core.c | 2 ++ kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++--- kernel/sched/features.h | 1 + 4 files changed, 75 insertions(+), 5 deletions(-) -- 2.25.1 ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/2] sched/fair: Record the average duration of a task 2024-06-25 7:21 [PATCH 0/2] sched/fair: Enhance sync wakeup for short duration tasks Chen Yu @ 2024-06-25 7:22 ` Chen Yu 2024-06-26 4:21 ` Mike Galbraith 2024-06-25 7:22 ` [PATCH 2/2] sched/fair: Enhance sync wakeup for short duration tasks Chen Yu 1 sibling, 1 reply; 17+ messages in thread From: Chen Yu @ 2024-06-25 7:22 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot Cc: Mike Galbraith, Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu, linux-kernel, Chen Yu Record the average duration of a task, as there is a requirement to leverage this information for better task placement. At first thought the (p->se.sum_exec_runtime / p->nvcsw) can be used to measure the task duration. However, the history long past was factored too heavily in such a formula. Ideally, the old activity should decay and not affect the current status too much. Although something based on PELT can be used, se.util_avg might not be appropriate to describe the task duration: Task p1 and task p2 are doing frequent ping-pong scheduling on one CPU, both p1 and p2 have a short duration, but the util_avg of each task can be up to 50%, which is inconsistent with the short task duration. Here's an example to show what the average duration is. Suppose on CPUx, task p1 and p2 run alternatively: --------------------> time | p1 runs 1ms | p2 preempt p1 | p1 switch in, runs 0.5ms and blocks | ^ ^ ^ |_____________| |_____________________________________| ^ | p1 dequeued p1's duration is (1 + 0.5)ms. Because if p2 does not preempt p1, p1 can run 1.5ms. This reflects the nature of a task: how long it wishes to run at most. Suggested-by: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> --- include/linux/sched.h | 3 +++ kernel/sched/core.c | 2 ++ kernel/sched/fair.c | 12 ++++++++++++ 3 files changed, 17 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 90691d99027e..78747d3954fd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1339,6 +1339,9 @@ struct task_struct { struct callback_head cid_work; #endif + u64 prev_sleep_sum_runtime; + u64 duration_avg; + struct tlbflush_unmap_batch tlb_ubc; /* Cache last used pipe for splice(): */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0935f9d4bb7b..7399c4143528 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->migration_pending = NULL; #endif init_sched_mm_cid(p); + p->prev_sleep_sum_runtime = 0; + p->duration_avg = 0; } DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 41b58387023d..445877069fbf 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6833,6 +6833,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) static void set_next_buddy(struct sched_entity *se); +static inline void dur_avg_update(struct task_struct *p) +{ + u64 dur; + + dur = p->se.sum_exec_runtime - p->prev_sleep_sum_runtime; + p->prev_sleep_sum_runtime = p->se.sum_exec_runtime; + update_avg(&p->duration_avg, dur); +} + /* * The dequeue_task method is called before nr_running is * decreased. We remove the task from the rbtree and @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) dequeue_throttle: util_est_update(&rq->cfs, p, task_sleep); + if (task_sleep) + dur_avg_update(p); + hrtick_update(rq); } -- 2.25.1 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] sched/fair: Record the average duration of a task 2024-06-25 7:22 ` [PATCH 1/2] sched/fair: Record the average duration of a task Chen Yu @ 2024-06-26 4:21 ` Mike Galbraith 2024-06-30 13:09 ` Chen Yu 0 siblings, 1 reply; 17+ messages in thread From: Mike Galbraith @ 2024-06-26 4:21 UTC (permalink / raw) To: Chen Yu, Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot Cc: Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu, linux-kernel On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote: > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 0935f9d4bb7b..7399c4143528 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) > p->migration_pending = NULL; > #endif > init_sched_mm_cid(p); > + p->prev_sleep_sum_runtime = 0; > + p->duration_avg = 0; > } Beginning life biased toward stacking? > DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 41b58387023d..445877069fbf 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > > @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > dequeue_throttle: > util_est_update(&rq->cfs, p, task_sleep); > + if (task_sleep) > + dur_avg_update(p); > + > hrtick_update(rq); > } > That qualifier looks a bit dangerous. Microbench components tend to have only one behavior, but the real world goes through all kinds of nutty gyrations, intentional and otherwise. The heuristics in the next patch seem to exhibit a healthy level of paranoia, but these bits could perhaps use a tad more. Bad experiences springs to mind when I stare at that - sleepers going hog, hogs meet sleeping lock contention, preemption, sync hint not meaning much... -Mike ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] sched/fair: Record the average duration of a task 2024-06-26 4:21 ` Mike Galbraith @ 2024-06-30 13:09 ` Chen Yu 2024-07-01 6:57 ` Mike Galbraith 2024-08-05 4:38 ` Madadi Vineeth Reddy 0 siblings, 2 replies; 17+ messages in thread From: Chen Yu @ 2024-06-30 13:09 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu, linux-kernel Hi Mike, Thanks for your time and giving the insights. On 2024-06-26 at 06:21:43 +0200, Mike Galbraith wrote: > On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote: > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 0935f9d4bb7b..7399c4143528 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) > > p->migration_pending = NULL; > > #endif > > init_sched_mm_cid(p); > > + p->prev_sleep_sum_runtime = 0; > > + p->duration_avg = 0; > > } > > Beginning life biased toward stacking? > OK, so I should change the short_task() to skip the 0 duration_avg, to avoid task stacking in the beginning. > > DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 41b58387023d..445877069fbf 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > > > @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > > > dequeue_throttle: > > util_est_update(&rq->cfs, p, task_sleep); > > + if (task_sleep) > > + dur_avg_update(p); > > + > > hrtick_update(rq); > > } > > > > That qualifier looks a bit dangerous. Microbench components tend to > have only one behavior, but the real world goes through all kinds of > nutty gyrations, intentional and otherwise. > Understand. Unfortunately I don't have access to production environment so I have to rely on microbenchmarks and a OLTP to check the result. I get feedback from Abel that the former version of this patch brought benefit to some short tasks like Redis in the production environment[1]. https://lore.kernel.org/lkml/36ba3b68-5b73-9db0-2247-061627b0d95a@bytedance.com/ I can launch a combination of microbenchmarks in parallel to check the impact. > The heuristics in the next patch seem to exhibit a healthy level of > paranoia, but these bits could perhaps use a tad more. Bad experiences > springs to mind when I stare at that - sleepers going hog, hogs meet > sleeping lock contention, preemption, sync hint not meaning much... > I see. If I understand correctly, the scenario mentioned above could bring a false positive of 'short task', which causes task stacking. If the sleeper task: 1. is preempted frequently. This should not be a problem, because the task duration is unlikely to be shorten by preemption, and the short_task() is unlikely to return true. Because the task duration will span the time from the task is scheduled in, to finally scheduled out due to sleeping(dequeue_task_fair()), but not by preemption. This time duration should be long enough to not trigger the 'short task' case. But since there is a delay queue mechanism under development, calculating the duration in dequeue_task_fair() in current patch might not be a proper method anymore. 2. meets sleeping lock contention. This would be a false positive 'short task', which bring unexpected task stacking. So consider 1 and 2, I'm thinking of moving the calculating of task duration from dequeue_task_fair() to wait_woken(). The reason to update the task's duration in wait_woken() rather than dequeue_task_fair() is that, the former is aware of the scenario that the task is waiting for the real resource, rather than blocking on a random sleeping lock. And the wait_woken() is widely used by the driver to indicate that task is waiting for the resource. For example, the netperf calltrace: schedule_timeout+222 wait_woken+84 sk_wait_data+378 tcp_recvmsg_locked+507 tcp_recvmsg+115 inet_recvmsg+90 sock_recvmsg+150 In the future, if there is requirement other scenario could also invoke the newly introduced update_cur_duration() when needed. For example, the pipe_read() could use it if the task is going to sleep due to the empty pipe buffer. I change the code as below, may I have your suggestion on this? thanks, Chenyu diff --git a/include/linux/sched.h b/include/linux/sched.h index 90691d99027e..78747d3954fd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1339,6 +1339,9 @@ struct task_struct { struct callback_head cid_work; #endif + u64 prev_sleep_sum_runtime; + u64 duration_avg; + struct tlbflush_unmap_batch tlb_ubc; /* Cache last used pipe for splice(): */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0935f9d4bb7b..7399c4143528 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->migration_pending = NULL; #endif init_sched_mm_cid(p); + p->prev_sleep_sum_runtime = 0; + p->duration_avg = 0; } DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 41b58387023d..bbeba36d0145 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -744,6 +744,19 @@ int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se) return vruntime_eligible(cfs_rq, se->vruntime); } +void update_curr_duration(void) +{ + struct sched_entity *curr = ¤t->se; + unsigned long flags; + u64 dur; + + local_irq_save(flags); + dur = curr->sum_exec_runtime - current->prev_sleep_sum_runtime; + current->prev_sleep_sum_runtime = curr->sum_exec_runtime; + update_avg(¤t->duration_avg, dur); + local_irq_restore(flags); +} + static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime) { u64 min_vruntime = cfs_rq->min_vruntime; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 62fd8bc6fd08..7beb604ca76b 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3574,6 +3574,7 @@ static inline void init_sched_mm_cid(struct task_struct *t) { } extern u64 avg_vruntime(struct cfs_rq *cfs_rq); extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se); +extern void update_curr_duration(void); #ifdef CONFIG_RT_MUTEXES diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index 51e38f5f4701..a0004cc7454f 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -419,8 +419,10 @@ long wait_woken(struct wait_queue_entry *wq_entry, unsigned mode, long timeout) * or woken_wake_function() sees our store to current->state. */ set_current_state(mode); /* A */ - if (!(wq_entry->flags & WQ_FLAG_WOKEN) && !kthread_should_stop_or_park()) + if (!(wq_entry->flags & WQ_FLAG_WOKEN) && !kthread_should_stop_or_park()) { + update_curr_duration(); timeout = schedule_timeout(timeout); + } __set_current_state(TASK_RUNNING); /* -- 2.25.1 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] sched/fair: Record the average duration of a task 2024-06-30 13:09 ` Chen Yu @ 2024-07-01 6:57 ` Mike Galbraith 2024-07-01 14:57 ` Chen Yu 2024-08-05 4:38 ` Madadi Vineeth Reddy 1 sibling, 1 reply; 17+ messages in thread From: Mike Galbraith @ 2024-07-01 6:57 UTC (permalink / raw) To: Chen Yu Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu, linux-kernel On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote: > Hi Mike, > > Thanks for your time and giving the insights. > > On 2024-06-26 at 06:21:43 +0200, Mike Galbraith wrote: > > On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote: > > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > index 0935f9d4bb7b..7399c4143528 100644 > > > --- a/kernel/sched/core.c > > > +++ b/kernel/sched/core.c > > > @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) > > > p->migration_pending = NULL; > > > #endif > > > init_sched_mm_cid(p); > > > + p->prev_sleep_sum_runtime = 0; > > > + p->duration_avg = 0; > > > } > > > > Beginning life biased toward stacking? > > > > OK, so I should change the short_task() to skip the 0 duration_avg, to avoid > task stacking in the beginning. Or something, definitely. > > > DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 41b58387023d..445877069fbf 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > > > > @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > > > > > dequeue_throttle: > > > util_est_update(&rq->cfs, p, task_sleep); > > > + if (task_sleep) > > > + dur_avg_update(p); > > > + > > > hrtick_update(rq); > > > } > > > > > > > That qualifier looks a bit dangerous. Microbench components tend to > > have only one behavior, but the real world goes through all kinds of > > nutty gyrations, intentional and otherwise. > > > > Understand. Unfortunately I don't have access to production environment > so I have to rely on microbenchmarks and a OLTP to check the result. I > get feedback from Abel that the former version of this patch brought > benefit to some short tasks like Redis in the production environment[1]. > https://lore.kernel.org/lkml/36ba3b68-5b73-9db0-2247-061627b0d95a@bytedance.com/ Here's hoping you get a lot more. > I can launch a combination of microbenchmarks in parallel to check the impact. > > > The heuristics in the next patch seem to exhibit a healthy level of > > paranoia, but these bits could perhaps use a tad more. Bad experiences > > springs to mind when I stare at that - sleepers going hog, hogs meet > > sleeping lock contention, preemption, sync hint not meaning much... > > > > I see. If I understand correctly, the scenario mentioned above > could bring a false positive of 'short task', which causes > task stacking. Yeah. > If the sleeper task: > 1. is preempted frequently. This should not be a problem, because > the task duration is unlikely to be shorten by preemption, and > the short_task() is unlikely to return true. Ah, but it not being a problem would beg the question why bother? There are consequences either way, the dark side is just scarier. > 2. meets sleeping lock contention. This would be a false positive 'short task', > which bring unexpected task stacking. Yeah. > So consider 1 and 2, I'm thinking of moving the calculating of task duration from > dequeue_task_fair() to wait_woken(). The reason to update the task's duration in > wait_woken() rather than dequeue_task_fair() is that, the former is aware of the > scenario that the task is waiting for the real resource, rather than blocking > on a random sleeping lock. And the wait_woken() is widely used by the driver to > indicate that task is waiting for the resource. For example, the netperf calltrace: > > schedule_timeout+222 > wait_woken+84 > sk_wait_data+378 > tcp_recvmsg_locked+507 > tcp_recvmsg+115 > inet_recvmsg+90 > sock_recvmsg+150 > > In the future, if there is requirement other scenario could also invoke the newly > introduced update_cur_duration() when needed. For example, the pipe_read() could > use it if the task is going to sleep due to the empty pipe buffer. I change the > code as below, may I have your suggestion on this? I don't have any suggestions that will help plug the holes, heck, I squabbled in this arena quite a bit some years ago, and did not win. Frankly I don't think the scheduler has the information necessary to do so, it'll always be a case of this will likely do less harm than good, but will certainly leave at least an occasional mark. Just take a look at the high speed ping-pong thing (not a benchmark, that's a box full of tape measures, rather silly, but..). TCP_RR IS 1:1, has as short a duration as network stack plus scheduler can possibly make it, and is nearly synchronous to boot, two halves of a whole, the ONLY thing you can certainly safely stack.. but a shared L2 box still takes a wee hit when you do so. 1:N or M:N tasks can approach its wakeup frequency range, and there's nothing you can do about the very same cache to cache latency you're trying to duck, it just is what it is, and is considered perfectly fine as it is. That's a bit of a red flag, but worse is the lack of knowledge wrt what tasks are actually up to at any given time. We rashly presume that tasks waking one another implies a 1:1 relationship, we routinely call them buddies and generally get away with it.. but during any overlap they can be doing anything including N way data share, and regardless of what that is and section size, needless stacking flushes concurrency, injecting service latency in its place, cost unknown. Intentional stacking can be jokingly equated to injecting just a wee bit of SMP kryptonite.. it'll be fine.. at least until it's not ;-) -Mike ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] sched/fair: Record the average duration of a task 2024-07-01 6:57 ` Mike Galbraith @ 2024-07-01 14:57 ` Chen Yu 2024-07-02 5:08 ` Mike Galbraith 2024-07-03 8:34 ` Raghavendra K T 0 siblings, 2 replies; 17+ messages in thread From: Chen Yu @ 2024-07-01 14:57 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu, linux-kernel, Raghavendra K T Hi Mike, On 2024-07-01 at 08:57:25 +0200, Mike Galbraith wrote: > On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote: > > Hi Mike, > > > > Thanks for your time and giving the insights. > > > > On 2024-06-26 at 06:21:43 +0200, Mike Galbraith wrote: > > > On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote: > > > > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > > index 0935f9d4bb7b..7399c4143528 100644 > > > > --- a/kernel/sched/core.c > > > > +++ b/kernel/sched/core.c > > > > @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) > > > > p->migration_pending = NULL; > > > > #endif > > > > init_sched_mm_cid(p); > > > > + p->prev_sleep_sum_runtime = 0; > > > > + p->duration_avg = 0; > > > > } > > > > > > Beginning life biased toward stacking? > > > > > > > OK, so I should change the short_task() to skip the 0 duration_avg, to avoid > > task stacking in the beginning. > > Or something, definitely. > > > > > DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > index 41b58387023d..445877069fbf 100644 > > > > --- a/kernel/sched/fair.c > > > > +++ b/kernel/sched/fair.c > > > > > > > > @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > > > > > > > dequeue_throttle: > > > > util_est_update(&rq->cfs, p, task_sleep); > > > > + if (task_sleep) > > > > + dur_avg_update(p); > > > > + > > > > hrtick_update(rq); > > > > } > > > > > > > > > > That qualifier looks a bit dangerous. Microbench components tend to > > > have only one behavior, but the real world goes through all kinds of > > > nutty gyrations, intentional and otherwise. > > > > > > > Understand. Unfortunately I don't have access to production environment > > so I have to rely on microbenchmarks and a OLTP to check the result. I > > get feedback from Abel that the former version of this patch brought > > benefit to some short tasks like Redis in the production environment[1]. > > https://lore.kernel.org/lkml/36ba3b68-5b73-9db0-2247-061627b0d95a@bytedance.com/ > > Here's hoping you get a lot more. > We recently received a simulated benchmark for Meta. I'll conduct some tests to see the results. > > So consider 1 and 2, I'm thinking of moving the calculating of task duration from > > dequeue_task_fair() to wait_woken(). The reason to update the task's duration in > > wait_woken() rather than dequeue_task_fair() is that, the former is aware of the > > scenario that the task is waiting for the real resource, rather than blocking > > on a random sleeping lock. And the wait_woken() is widely used by the driver to > > indicate that task is waiting for the resource. For example, the netperf calltrace: > > > > schedule_timeout+222 > > wait_woken+84 > > sk_wait_data+378 > > tcp_recvmsg_locked+507 > > tcp_recvmsg+115 > > inet_recvmsg+90 > > sock_recvmsg+150 > > > > In the future, if there is requirement other scenario could also invoke the newly > > introduced update_cur_duration() when needed. For example, the pipe_read() could > > use it if the task is going to sleep due to the empty pipe buffer. I change the > > code as below, may I have your suggestion on this? > > I don't have any suggestions that will help plug the holes, heck, I > squabbled in this arena quite a bit some years ago, and did not win. > Frankly I don't think the scheduler has the information necessary to do > so, it'll always be a case of this will likely do less harm than good, > but will certainly leave at least an occasional mark. > I agree. Unlike bug fixing, this kind of change usually involves trade-offs. The attempt is to not do harm for most cases, and bring benefit for some cases. Regarding the necessary information, non-scheduler components might have better knowledge than the scheduler: 1. If the hint comes from user space, it could be something like sched_attr::sync_wakeup. It indicates that the task prefers sync-wakeup and is tolerant of task stacking. However, I'm unsure how this could be accepted by the community if we want to change the user space interface. What is the criteria to accept such change? Would the requirement from different production environment be considered as the endorsement? 2. If the hint comes from other components in the kernel, it could be the driver or others. seccomp in the current kernel code, enforces waking the wakee on the current CPU via the WF_CURRENT_CPU flag. However, WF_CURRENT_CPU seems a bit aggressive for ordinary tasks. So, wait_woken() could be used when needed (by the network component, for example) to indicate a possible cache/resource sensitivity of the wakee, and together with the task duration, to decide if the wakee can be placed on the current CPU. > Just take a look at the high speed ping-pong thing (not a benchmark, > that's a box full of tape measures, rather silly, but..). TCP_RR IS > 1:1, has as short a duration as network stack plus scheduler can > possibly make it, and is nearly synchronous to boot, two halves of a > whole, the ONLY thing you can certainly safely stack.. I agree, this is a limited scenario. > but a shared L2 box still takes a wee hit when you do so. According to a test conducted last month on a system with 500+ CPUs where 4 CPUs share the same L2 cache, around 20% improvement was noticed (though not as much as on the non-L2 shared platform). I haven't delved into the details yet, but my understanding is that L1 cache-to-cache latency within the L2 domain might also matter on large servers (which I need to investigate further). > 1:N or M:N > tasks can approach its wakeup frequency range, and there's nothing you can do > about the very same cache to cache latency you're trying to duck, it > just is what it is, and is considered perfectly fine as it is. That's > a bit of a red flag, but worse is the lack of knowledge wrt what tasks > are actually up to at any given time. We rashly presume that tasks > waking one another implies a 1:1 relationship, we routinely call them > buddies and generally get away with it.. but during any overlap they > can be doing anything including N way data share, and regardless of > what that is and section size, needless stacking flushes concurrency, > injecting service latency in its place, cost unknown. > I believe this is a generic issue that the current scheduler faces, where it attempts to predict the task's behavior based on its runtime. For instance, task_hot() checks the task runtime to predict whether the task is cache-hot, regardless of what the task does during its time slice. This is also the case with WF_SYNC, which provides the scheduler with a hint to wake up on the current CPU to potentially benefit from cache locality. A thought occurred to me that one possible method to determine if the waker and wakee share data could be to leverage the NUMA balance's numa_group data structure. As numa balance periodically scans the task's VMA space and groups tasks accessing the same physical page into one numa_group, we can infer that if the waker and wakee are within the same numa_group, they are likely to share data, and it might be appropriate to place the wakee on top of the waker. CC Raghavendra here in case he has any insights. > Intentional stacking can be jokingly equated to injecting just a wee > bit of SMP kryptonite.. it'll be fine.. at least until it's not ;-) > I fully understand your concern, and this analogy is very interesting. We will conduct additional tests and share the data/analysis later. thanks, Chenyu ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] sched/fair: Record the average duration of a task 2024-07-01 14:57 ` Chen Yu @ 2024-07-02 5:08 ` Mike Galbraith 2024-07-03 8:34 ` Raghavendra K T 1 sibling, 0 replies; 17+ messages in thread From: Mike Galbraith @ 2024-07-02 5:08 UTC (permalink / raw) To: Chen Yu Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu, linux-kernel, Raghavendra K T On Mon, 2024-07-01 at 22:57 +0800, Chen Yu wrote: > > Just take a look at the high speed ping-pong thing (not a benchmark, > > that's a box full of tape measures, rather silly, but..). TCP_RR IS > > 1:1, has as short a duration as network stack plus scheduler can > > possibly make it, and is nearly synchronous to boot, two halves of a > > whole, the ONLY thing you can certainly safely stack.. > > I agree, this is a limited scenario. > > > but a shared L2 box still takes a wee hit when you do so. > > According to a test conducted last month on a system with 500+ CPUs where 4 CPUs > share the same L2 cache, around 20% improvement was noticed (though not as much > as on the non-L2 shared platform). This dinky box doesn't have 500 cores, but it's.. aw, adorable :) rpi4:/root # ONLY=TCP_RR netperf.sh TCP_RR-1 unbound Avg: 31754 Sum: 31754 TCP_RR-1 stacked Avg: 26625 Sum: 26625 TCP_RR-1 cross-core Avg: 32325 Sum: 32325 rpi4:/root # tbench.sh 1 30 2>&1|grep Throughput Throughput 139.024 MB/sec 1 clients 1 procs max_latency=1.116 ms rpi4:/root # taskset -c 3 tbench.sh 1 30 2>&1|grep Throughput Throughput 116.765 MB/sec 1 clients 1 procs max_latency=0.340 ms rpi4:/root # This little box running its stock 6.6.33 distro kernel pulls out a cross-core win for both maximally synchronous TCP_RR and the a bit lesser so but still pretty close tbench. The numbers mean little though, one propagation speed is lovely, but were there more, I'd be as stuck with them as I am with rpi4's one-speed (all ahead slow) gearbox. -Mike ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] sched/fair: Record the average duration of a task 2024-07-01 14:57 ` Chen Yu 2024-07-02 5:08 ` Mike Galbraith @ 2024-07-03 8:34 ` Raghavendra K T 2024-07-03 11:57 ` Mike Galbraith 2024-07-03 13:12 ` Chen Yu 1 sibling, 2 replies; 17+ messages in thread From: Raghavendra K T @ 2024-07-03 8:34 UTC (permalink / raw) To: Chen Yu, Mike Galbraith Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu, linux-kernel On 7/1/2024 8:27 PM, Chen Yu wrote: > Hi Mike, > > On 2024-07-01 at 08:57:25 +0200, Mike Galbraith wrote: >> On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote: >>> Hi Mike, >>> >>> Thanks for your time and giving the insights. > > According to a test conducted last month on a system with 500+ CPUs where 4 CPUs > share the same L2 cache, around 20% improvement was noticed (though not as much > as on the non-L2 shared platform). I haven't delved into the details yet, but my > understanding is that L1 cache-to-cache latency within the L2 domain might also > matter on large servers (which I need to investigate further). > >> 1:N or M:N >> tasks can approach its wakeup frequency range, and there's nothing you can do >> about the very same cache to cache latency you're trying to duck, it >> just is what it is, and is considered perfectly fine as it is. That's >> a bit of a red flag, but worse is the lack of knowledge wrt what tasks >> are actually up to at any given time. We rashly presume that tasks >> waking one another implies a 1:1 relationship, we routinely call them >> buddies and generally get away with it.. but during any overlap they >> can be doing anything including N way data share, and regardless of >> what that is and section size, needless stacking flushes concurrency, >> injecting service latency in its place, cost unknown. >> > > I believe this is a generic issue that the current scheduler faces, where > it attempts to predict the task's behavior based on its runtime. For instance, > task_hot() checks the task runtime to predict whether the task is cache-hot, > regardless of what the task does during its time slice. This is also the case > with WF_SYNC, which provides the scheduler with a hint to wake up on the current > CPU to potentially benefit from cache locality. > > A thought occurred to me that one possible method to determine if the waker > and wakee share data could be to leverage the NUMA balance's numa_group data structure. > As numa balance periodically scans the task's VMA space and groups tasks accessing > the same physical page into one numa_group, we can infer that if the waker and wakee > are within the same numa_group, they are likely to share data, and it might be > appropriate to place the wakee on top of the waker. > > CC Raghavendra here in case he has any insights. > Agree with your thought here, So I imagine two possible things to explore here. 1) Use task1, task2 numa_group and check if they belong to same numa_group, also check if there is a possibility of M:N relationship by checking if t1/t2->numa_group->nr_tasks > 1 etc 2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2 (threads) possibly interested in same VMA. Latter one looks to be practically difficult because we don't want to sweep across VMAs perhaps.. > thanks, > Chenyu ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] sched/fair: Record the average duration of a task 2024-07-03 8:34 ` Raghavendra K T @ 2024-07-03 11:57 ` Mike Galbraith 2024-07-03 13:23 ` Chen Yu 2024-07-03 13:40 ` Raghavendra K T 2024-07-03 13:12 ` Chen Yu 1 sibling, 2 replies; 17+ messages in thread From: Mike Galbraith @ 2024-07-03 11:57 UTC (permalink / raw) To: Raghavendra K T, Chen Yu Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu, linux-kernel On Wed, 2024-07-03 at 14:04 +0530, Raghavendra K T wrote: > > > On 7/1/2024 8:27 PM, Chen Yu wrote: > > > > A thought occurred to me that one possible method to determine if the waker > > and wakee share data could be to leverage the NUMA balance's numa_group data structure. > > As numa balance periodically scans the task's VMA space and groups tasks accessing > > the same physical page into one numa_group, we can infer that if the waker and wakee > > are within the same numa_group, they are likely to share data, and it might be > > appropriate to place the wakee on top of the waker. > > > > CC Raghavendra here in case he has any insights. > > > > Agree with your thought here, > > So I imagine two possible things to explore here. > > 1) Use task1, task2 numa_group and check if they belong to same > numa_group, also check if there is a possibility of M:N relationship > by checking if t1/t2->numa_group->nr_tasks > 1 etc > > 2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2 > (threads) possibly interested in same VMA. > Latter one looks to be practically difficult because we don't want to > sweep across VMAs perhaps.. Oooh dear.. as soon as you mention threads, the question of who's wheelhouse is this in springs to mind, ie should the kernel be overriding userspace by targeting bits of threaded programs for forced serialization? Bah, think I'll just bugger off and let you guys have a go at making this stacking business do less harm than good. -Mike ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] sched/fair: Record the average duration of a task 2024-07-03 11:57 ` Mike Galbraith @ 2024-07-03 13:23 ` Chen Yu 2024-07-03 13:45 ` Mike Galbraith 2024-07-03 13:40 ` Raghavendra K T 1 sibling, 1 reply; 17+ messages in thread From: Chen Yu @ 2024-07-03 13:23 UTC (permalink / raw) To: Mike Galbraith Cc: Raghavendra K T, Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu, linux-kernel Hi Mike, On 2024-07-03 at 13:57:19 +0200, Mike Galbraith wrote: > On Wed, 2024-07-03 at 14:04 +0530, Raghavendra K T wrote: > > > > > > On 7/1/2024 8:27 PM, Chen Yu wrote: > > > > > > A thought occurred to me that one possible method to determine if the waker > > > and wakee share data could be to leverage the NUMA balance's numa_group data structure. > > > As numa balance periodically scans the task's VMA space and groups tasks accessing > > > the same physical page into one numa_group, we can infer that if the waker and wakee > > > are within the same numa_group, they are likely to share data, and it might be > > > appropriate to place the wakee on top of the waker. > > > > > > CC Raghavendra here in case he has any insights. > > > > > > > Agree with your thought here, > > > > So I imagine two possible things to explore here. > > > > 1) Use task1, task2 numa_group and check if they belong to same > > numa_group, also check if there is a possibility of M:N relationship > > by checking if t1/t2->numa_group->nr_tasks > 1 etc > > > > 2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2 > > (threads) possibly interested in same VMA. > > Latter one looks to be practically difficult because we don't want to > > sweep across VMAs perhaps.. > > Oooh dear.. as soon as you mention threads, the question of who's > wheelhouse is this in springs to mind, ie should the kernel be > overriding userspace by targeting bits of threaded programs for forced > serialization? > Do you mean the threads within the same process should run in parallel as much as possible, regardless of sharing the same data, because the thread is designed to do so? If so then we should probably skip the 1:1 task stacking if the waker and the wakee are in the same process. thanks, Chenyu ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] sched/fair: Record the average duration of a task 2024-07-03 13:23 ` Chen Yu @ 2024-07-03 13:45 ` Mike Galbraith 0 siblings, 0 replies; 17+ messages in thread From: Mike Galbraith @ 2024-07-03 13:45 UTC (permalink / raw) To: Chen Yu Cc: Raghavendra K T, Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu, linux-kernel On Wed, 2024-07-03 at 21:23 +0800, Chen Yu wrote: > On 2024-07-03 at 13:57:19 +0200, Mike Galbraith wrote: > > > > Oooh dear.. as soon as you mention threads, the question of who's > > wheelhouse is this in springs to mind, ie should the kernel be > > overriding userspace by targeting bits of threaded programs for forced > > serialization? > > Do you mean the threads within the same process should run in parallel > as much as possible, regardless of sharing the same data, because the thread > is designed to do so? If so then we should probably skip the 1:1 task stacking > if the waker and the wakee are in the same process. That's seems like the obviously correct answer.. except for that leaving virtually the entire issue you're squabbling with lying on the floor. Grr, how annoying. -Mike ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] sched/fair: Record the average duration of a task 2024-07-03 11:57 ` Mike Galbraith 2024-07-03 13:23 ` Chen Yu @ 2024-07-03 13:40 ` Raghavendra K T 1 sibling, 0 replies; 17+ messages in thread From: Raghavendra K T @ 2024-07-03 13:40 UTC (permalink / raw) To: Mike Galbraith, Chen Yu Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu, linux-kernel On 7/3/2024 5:27 PM, Mike Galbraith wrote: > On Wed, 2024-07-03 at 14:04 +0530, Raghavendra K T wrote: >> >> >> On 7/1/2024 8:27 PM, Chen Yu wrote: >>> >>> A thought occurred to me that one possible method to determine if the waker >>> and wakee share data could be to leverage the NUMA balance's numa_group data structure. >>> As numa balance periodically scans the task's VMA space and groups tasks accessing >>> the same physical page into one numa_group, we can infer that if the waker and wakee >>> are within the same numa_group, they are likely to share data, and it might be >>> appropriate to place the wakee on top of the waker. >>> >>> CC Raghavendra here in case he has any insights. >>> >> >> Agree with your thought here, >> >> So I imagine two possible things to explore here. >> >> 1) Use task1, task2 numa_group and check if they belong to same >> numa_group, also check if there is a possibility of M:N relationship >> by checking if t1/t2->numa_group->nr_tasks > 1 etc >> >> 2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2 >> (threads) possibly interested in same VMA. >> Latter one looks to be practically difficult because we don't want to >> sweep across VMAs perhaps.. > > Oooh dear.. as soon as you mention threads, the question of who's > wheelhouse is this in springs to mind, ie should the kernel be > overriding userspace by targeting bits of threaded programs for forced > serialization? > Yes.. There is no ROI on this option (mentioned only for completeness). also we are not looking beyond process. Rather than "Practically difficult" I should have rephrased as Practically not an option. > Bah, think I'll just bugger off and let you guys have a go at making > this stacking business do less harm than good. > > -Mike ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] sched/fair: Record the average duration of a task 2024-07-03 8:34 ` Raghavendra K T 2024-07-03 11:57 ` Mike Galbraith @ 2024-07-03 13:12 ` Chen Yu 2024-07-03 13:46 ` Raghavendra K T 1 sibling, 1 reply; 17+ messages in thread From: Chen Yu @ 2024-07-03 13:12 UTC (permalink / raw) To: Raghavendra K T Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu, linux-kernel On 2024-07-03 at 14:04:47 +0530, Raghavendra K T wrote: > > > On 7/1/2024 8:27 PM, Chen Yu wrote: > > Hi Mike, > > > > On 2024-07-01 at 08:57:25 +0200, Mike Galbraith wrote: > > > On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote: > > > > Hi Mike, > > > > > > > > Thanks for your time and giving the insights. > > > > According to a test conducted last month on a system with 500+ CPUs where 4 CPUs > > share the same L2 cache, around 20% improvement was noticed (though not as much > > as on the non-L2 shared platform). I haven't delved into the details yet, but my > > understanding is that L1 cache-to-cache latency within the L2 domain might also > > matter on large servers (which I need to investigate further). > > > > > 1:N or M:N > > > tasks can approach its wakeup frequency range, and there's nothing you can do > > > about the very same cache to cache latency you're trying to duck, it > > > just is what it is, and is considered perfectly fine as it is. That's > > > a bit of a red flag, but worse is the lack of knowledge wrt what tasks > > > are actually up to at any given time. We rashly presume that tasks > > > waking one another implies a 1:1 relationship, we routinely call them > > > buddies and generally get away with it.. but during any overlap they > > > can be doing anything including N way data share, and regardless of > > > what that is and section size, needless stacking flushes concurrency, > > > injecting service latency in its place, cost unknown. > > > > > > > I believe this is a generic issue that the current scheduler faces, where > > it attempts to predict the task's behavior based on its runtime. For instance, > > task_hot() checks the task runtime to predict whether the task is cache-hot, > > regardless of what the task does during its time slice. This is also the case > > with WF_SYNC, which provides the scheduler with a hint to wake up on the current > > CPU to potentially benefit from cache locality. > > > > A thought occurred to me that one possible method to determine if the waker > > and wakee share data could be to leverage the NUMA balance's numa_group data structure. > > As numa balance periodically scans the task's VMA space and groups tasks accessing > > the same physical page into one numa_group, we can infer that if the waker and wakee > > are within the same numa_group, they are likely to share data, and it might be > > appropriate to place the wakee on top of the waker. > > > > CC Raghavendra here in case he has any insights. > > > > Agree with your thought here, > Thanks for taking a look at this, Raghavendra. > So I imagine two possible things to explore here. > > 1) Use task1, task2 numa_group and check if they belong to same > numa_group, also check if there is a possibility of M:N relationship > by checking if t1/t2->numa_group->nr_tasks > 1 etc > I see, so do you mean if it is M:N rather than 1:1, we should avoid the task been woken up on current CPU to avoid task stacking? > 2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2 > (threads) possibly interested in same VMA. > Latter one looks to be practically difficult because we don't want to > sweep across VMAs perhaps.. > Yeah, we might have to scan the whole VMAs to gather the PID information which might be a little costly(or maybe subset of the VMAs). And the pids_active[] is for threads rather than process, stacking the threads seem to not be good enough(per Mike's comments) Anyway, I'm preparing for some full tests to see if there is overall benefit from current version. Later let's investigate if numa balance information could help here. thanks, Chenyu ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] sched/fair: Record the average duration of a task 2024-07-03 13:12 ` Chen Yu @ 2024-07-03 13:46 ` Raghavendra K T 0 siblings, 0 replies; 17+ messages in thread From: Raghavendra K T @ 2024-07-03 13:46 UTC (permalink / raw) To: Chen Yu Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu, linux-kernel On 7/3/2024 6:42 PM, Chen Yu wrote: > On 2024-07-03 at 14:04:47 +0530, Raghavendra K T wrote: >> >> >> On 7/1/2024 8:27 PM, Chen Yu wrote: >>> Hi Mike, >>> >>> On 2024-07-01 at 08:57:25 +0200, Mike Galbraith wrote: >>>> On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote: >>>>> Hi Mike, >>>>> >>>>> Thanks for your time and giving the insights. >>> >>> According to a test conducted last month on a system with 500+ CPUs where 4 CPUs >>> share the same L2 cache, around 20% improvement was noticed (though not as much >>> as on the non-L2 shared platform). I haven't delved into the details yet, but my >>> understanding is that L1 cache-to-cache latency within the L2 domain might also >>> matter on large servers (which I need to investigate further). >>> >>>> 1:N or M:N >>>> tasks can approach its wakeup frequency range, and there's nothing you can do >>>> about the very same cache to cache latency you're trying to duck, it >>>> just is what it is, and is considered perfectly fine as it is. That's >>>> a bit of a red flag, but worse is the lack of knowledge wrt what tasks >>>> are actually up to at any given time. We rashly presume that tasks >>>> waking one another implies a 1:1 relationship, we routinely call them >>>> buddies and generally get away with it.. but during any overlap they >>>> can be doing anything including N way data share, and regardless of >>>> what that is and section size, needless stacking flushes concurrency, >>>> injecting service latency in its place, cost unknown. >>>> >>> >>> I believe this is a generic issue that the current scheduler faces, where >>> it attempts to predict the task's behavior based on its runtime. For instance, >>> task_hot() checks the task runtime to predict whether the task is cache-hot, >>> regardless of what the task does during its time slice. This is also the case >>> with WF_SYNC, which provides the scheduler with a hint to wake up on the current >>> CPU to potentially benefit from cache locality. >>> >>> A thought occurred to me that one possible method to determine if the waker >>> and wakee share data could be to leverage the NUMA balance's numa_group data structure. >>> As numa balance periodically scans the task's VMA space and groups tasks accessing >>> the same physical page into one numa_group, we can infer that if the waker and wakee >>> are within the same numa_group, they are likely to share data, and it might be >>> appropriate to place the wakee on top of the waker. >>> >>> CC Raghavendra here in case he has any insights. >>> >> >> Agree with your thought here, >> > > Thanks for taking a look at this, Raghavendra. > >> So I imagine two possible things to explore here. >> >> 1) Use task1, task2 numa_group and check if they belong to same >> numa_group, also check if there is a possibility of M:N relationship >> by checking if t1/t2->numa_group->nr_tasks > 1 etc >> > > I see, so do you mean if it is M:N rather than 1:1, we should avoid the > task been woken up on current CPU to avoid task stacking? Not sure actually, perhaps depends on usecase. But atleast gives an idea on the relationship. Problem here is we only know that they belong to same numa_group, But we cannot deduce actual relationship (we only know it is > 1) (1:N or M:N is not known). > >> 2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2 >> (threads) possibly interested in same VMA. >> Latter one looks to be practically difficult because we don't want to >> sweep across VMAs perhaps.. >> > > Yeah, we might have to scan the whole VMAs to gather the PID information which > might be a little costly(or maybe subset of the VMAs). And the pids_active[] is > for threads rather than process, stacking the threads seem to not be good > enough(per Mike's comments) > > Anyway, I'm preparing for some full tests to see if there is overall benefit > from current version. Later let's investigate if numa balance information could > help here. > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] sched/fair: Record the average duration of a task 2024-06-30 13:09 ` Chen Yu 2024-07-01 6:57 ` Mike Galbraith @ 2024-08-05 4:38 ` Madadi Vineeth Reddy 2024-08-05 7:22 ` Chen Yu 1 sibling, 1 reply; 17+ messages in thread From: Madadi Vineeth Reddy @ 2024-08-05 4:38 UTC (permalink / raw) To: Chen Yu Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu, linux-kernel, Mike Galbraith, Madadi Vineeth Reddy Hi Chen Yu, On 30/06/24 18:39, Chen Yu wrote: > Hi Mike, > > Thanks for your time and giving the insights. > > On 2024-06-26 at 06:21:43 +0200, Mike Galbraith wrote: >> On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote: >>> >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >>> index 0935f9d4bb7b..7399c4143528 100644 >>> --- a/kernel/sched/core.c >>> +++ b/kernel/sched/core.c >>> @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) >>> p->migration_pending = NULL; >>> #endif >>> init_sched_mm_cid(p); >>> + p->prev_sleep_sum_runtime = 0; >>> + p->duration_avg = 0; >>> } >> >> Beginning life biased toward stacking? >> > > OK, so I should change the short_task() to skip the 0 duration_avg, to avoid > task stacking in the beginning. > >>> DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index 41b58387023d..445877069fbf 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> >>> @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) >>> >>> dequeue_throttle: >>> util_est_update(&rq->cfs, p, task_sleep); >>> + if (task_sleep) >>> + dur_avg_update(p); >>> + >>> hrtick_update(rq); >>> } >>> >> >> That qualifier looks a bit dangerous. Microbench components tend to >> have only one behavior, but the real world goes through all kinds of >> nutty gyrations, intentional and otherwise. >> > > Understand. Unfortunately I don't have access to production environment > so I have to rely on microbenchmarks and a OLTP to check the result. I > get feedback from Abel that the former version of this patch brought > benefit to some short tasks like Redis in the production environment[1]. > https://lore.kernel.org/lkml/36ba3b68-5b73-9db0-2247-061627b0d95a@bytedance.com/ Since the discussion was about real-life workload performance, I ran the DayTrader workload with three users and three instances. The results show no performance regression, and a 1% performance gain was observed, which is within the standard deviation. Thanks and Regards Madadi Vineeth Reddy > > I can launch a combination of microbenchmarks in parallel to check the impact. > >> The heuristics in the next patch seem to exhibit a healthy level of >> paranoia, but these bits could perhaps use a tad more. Bad experiences >> springs to mind when I stare at that - sleepers going hog, hogs meet >> sleeping lock contention, preemption, sync hint not meaning much... ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] sched/fair: Record the average duration of a task 2024-08-05 4:38 ` Madadi Vineeth Reddy @ 2024-08-05 7:22 ` Chen Yu 0 siblings, 0 replies; 17+ messages in thread From: Chen Yu @ 2024-08-05 7:22 UTC (permalink / raw) To: Madadi Vineeth Reddy Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot, Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu, linux-kernel, Mike Galbraith, Madadi Vineeth Reddy On 2024-08-05 at 10:08:35 +0530, Madadi Vineeth Reddy wrote: > Hi Chen Yu, > > On 30/06/24 18:39, Chen Yu wrote: > > Hi Mike, > > > > Thanks for your time and giving the insights. > > > > On 2024-06-26 at 06:21:43 +0200, Mike Galbraith wrote: > >> On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote: > >>> > >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c > >>> index 0935f9d4bb7b..7399c4143528 100644 > >>> --- a/kernel/sched/core.c > >>> +++ b/kernel/sched/core.c > >>> @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) > >>> p->migration_pending = NULL; > >>> #endif > >>> init_sched_mm_cid(p); > >>> + p->prev_sleep_sum_runtime = 0; > >>> + p->duration_avg = 0; > >>> } > >> > >> Beginning life biased toward stacking? > >> > > > > OK, so I should change the short_task() to skip the 0 duration_avg, to avoid > > task stacking in the beginning. > > > >>> DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); > >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >>> index 41b58387023d..445877069fbf 100644 > >>> --- a/kernel/sched/fair.c > >>> +++ b/kernel/sched/fair.c > >>> > >>> @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > >>> > >>> dequeue_throttle: > >>> util_est_update(&rq->cfs, p, task_sleep); > >>> + if (task_sleep) > >>> + dur_avg_update(p); > >>> + > >>> hrtick_update(rq); > >>> } > >>> > >> > >> That qualifier looks a bit dangerous. Microbench components tend to > >> have only one behavior, but the real world goes through all kinds of > >> nutty gyrations, intentional and otherwise. > >> > > > > Understand. Unfortunately I don't have access to production environment > > so I have to rely on microbenchmarks and a OLTP to check the result. I > > get feedback from Abel that the former version of this patch brought > > benefit to some short tasks like Redis in the production environment[1]. > > https://lore.kernel.org/lkml/36ba3b68-5b73-9db0-2247-061627b0d95a@bytedance.com/ > > Since the discussion was about real-life workload performance, I ran the DayTrader > workload with three users and three instances. The results show no performance > regression, and a 1% performance gain was observed, which is within the standard > deviation. > Thank you Madadi for your test, let me respin this version and send it out with the fixes after talked with Mike. thanks, Chenyu ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/2] sched/fair: Enhance sync wakeup for short duration tasks 2024-06-25 7:21 [PATCH 0/2] sched/fair: Enhance sync wakeup for short duration tasks Chen Yu 2024-06-25 7:22 ` [PATCH 1/2] sched/fair: Record the average duration of a task Chen Yu @ 2024-06-25 7:22 ` Chen Yu 1 sibling, 0 replies; 17+ messages in thread From: Chen Yu @ 2024-06-25 7:22 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot Cc: Mike Galbraith, Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu, linux-kernel, Chen Yu [Problem Statement] On platforms where there are many CPUs, one bottleneck is the high Cache-to-Cache latency. This issue is exacerbated when the tasks sharing data are running on different CPUs: When the tasks access different part of the same cache, false sharing happens. One example is the network client/server workload with small packages. A simple example: On a system with 240 CPUs, 2 sockets, taskset -c 2 netserver taskset -c 1 netperf -4 -H 127.0.0.1 -t TCP_RR -c -C -l 100 Trans Rate per sec: 83528.11 taskset -c 2 netperf -4 -H 127.0.0.1 -t TCP_RR -c -C -l 100 Trans Rate per sec: 134504.35 [Problem Analysis] TL;DR When netperf and nerserver are running on difference cores, the cache false sharing on the TCP/IP stack hurts the performance. As long as the netperf and netserver are on the same system, and within the same network namespace, this issue exists. Detail With the help of perf topdown, when netperf and netserver are both on CPU2: 28.1 % tma_backend_bound 13.7 % tma_memory_bound 3.3 % tma_l2_bound 9.3 % tma_l1_bound When netperf is on CPU1, netserver is on CPU2: 30.5 % tma_backend_bound 16.8 % tma_memory_bound 11.0 % tma_l1_bound 32.4 % tma_l3_bound 59.5 % tma_contested_accesses <---- 11.1 % tma_data_sharing The contested_accesses has increased a lot when netperf and netserver are on different CPUs. Contested accesses occur when data written by one thread is read by another thread on a different core. This indicates the cache false sharing. Use perf c2c to figure out the place where cache false sharing happens. top 2 offenders: ----- HITM ----- ------- Store Refs ------ --------- Data address ------- RmtHitm LclHitm L1 Hit L1 Miss Offset Node 0.00% 55.17% 0.00% 0.00% 0x1c <---- read 0.00% 0.00% 20.00% 0.00% 0x1f <---- write To be more specific, there are frequent read/write on the same cache line in the: struct tcp_sock { new cache line ... u16 tcp_header_len; <----- read u8 scaling_ratio; u8 chrono_type : 2, <---- write repair : 1, tcp_usec_ts : 1, is_sack_reneg:1, is_cwnd_limited:1; < ---- write new cache line u32 copied_seq; <----- write u32 rcv_tstamp; <---- write u32 snd_wl1; <---- write ... u32 urg_seq; <--- read Re-arranging the layout of struct tcp_sock could become a seesaw. As the variables mentioned above are frequently accessed by different path of TCP/IP stack. Propose a more generic solution: 1. if the waker and the wakee are both short duration tasks, 2. if the wakeup is WF_SYNC, 3. if there is no idle Core in the system, 4. if the waker and the wakee wake up each other, Wake up the wakee on the same CPU as waker. N.B. The bar to regard the task as a short duration one depends on the number of CPUs. Normally we don't want to enable this wakeup feature on desktop or mobile. Because the overhead of Cache-to-Cache latency is negligible on small systems. [Benchmark results] Tested on 4 platforms, significant throughput improvement on tbench, netperf, stress-ng, will-it-scale, and latency reduced of lmbench. Platform1, 240 CPUs, 2 sockets Intel(R) Xeon(R) ======================================================================== netperf ======= case load baseline(std%) compare%( std%) TCP_RR 60-threads 1.00 ( 1.04) -0.03 ( 1.27) TCP_RR 120-threads 1.00 ( 2.31) -0.09 ( 2.46) TCP_RR 180-threads 1.00 ( 1.77) +0.93 ( 1.16) TCP_RR 240-threads 1.00 ( 9.39) +190.13 ( 3.66) TCP_RR 300-threads 1.00 ( 45.28) +120.07 ( 19.41) TCP_RR 360-threads 1.00 ( 20.13) +0.27 ( 30.57) TCP_RR 420-threads 1.00 ( 30.85) +13.39 ( 46.38) UDP_RR 60-threads 1.00 ( 11.86) -0.29 ( 2.66) UDP_RR 120-threads 1.00 ( 16.28) +0.42 ( 13.41) UDP_RR 180-threads 1.00 ( 15.34) +0.31 ( 17.45) UDP_RR 240-threads 1.00 ( 16.27) -0.36 ( 18.78) UDP_RR 300-threads 1.00 ( 20.42) -2.54 ( 32.42) UDP_RR 360-threads 1.00 ( 31.59) +0.28 ( 35.66) UDP_RR 420-threads 1.00 ( 30.44) -0.27 ( 37.12) tbench ====== case load baseline(std%) compare%( std%) loopback 60-threads 1.00 ( 0.27) +0.04 ( 0.11) loopback 120-threads 1.00 ( 0.65) -1.01 ( 0.41) loopback 180-threads 1.00 ( 0.42) +62.05 ( 26.22) loopback 240-threads 1.00 ( 30.43) +77.61 ( 15.27) hackbench ========= case load baseline(std%) compare%( std%) process-pipe 1-groups 1.00 ( 6.92) +4.70 ( 5.85) process-pipe 2-groups 1.00 ( 6.45) +7.66 ( 2.39) process-pipe 4-groups 1.00 ( 2.82) -1.82 ( 1.47) schbench ======== No noticeable difference of 99.0th wakeup/request latency, 50.0th RPS percentiles. schbench -m 2 -r 100 baseline sis_sync Wakeup Latencies 99.0th usec 27 25 Request Latencies 99.0th usec 15376 15376 RPS percentiles 50.0th 16608 16608 Platform2, 48 CPUs 2 sockets Intel(R) Xeon(R) CPU E5-2697 ======================================================================== lmbench3: lmbench3.PIPE.latency.us 33.8% improvement lmbench3: lmbench3.AF_UNIX.sock.stream.latency.us 30.6% improvement Platform3: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480 ======================================================================= stress-ng: stress-ng.vm-rw.ops_per_sec 250.8% improvement will-it-scale: will-it-scale.per_process_ops 42.1% improvement Suggested-by: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> --- kernel/sched/fair.c | 62 +++++++++++++++++++++++++++++++++++++---- kernel/sched/features.h | 1 + 2 files changed, 58 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 445877069fbf..d749397249ca 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1003,7 +1003,7 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) #include "pelt.h" #ifdef CONFIG_SMP -static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu); +static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu, int sync); static unsigned long task_h_load(struct task_struct *p); static unsigned long capacity_of(int cpu); @@ -7410,12 +7410,55 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd #endif /* CONFIG_SCHED_SMT */ +/* + * threshold of the short duration task: + * sysctl_sched_migration_cost * llc_weight^2 / 256^2 + * + * threshold + * LLC_WEIGHT=8 0.5 usec + * LLC_WEIGHT=16 2 usec + * LLC_WEIGHT=32 8 usec + * LLC_WEIGHT=64 31 usec + * LLC_WEIGHT=128 125 usec + * LLC_WEIGHT=256 500 usec + */ +static int short_task(struct task_struct *p, int llc) +{ + return ((p->duration_avg << 16) < + (sysctl_sched_migration_cost * llc * llc)); +} + +static int mutual_wakeup(struct task_struct *p, int target) +{ + int llc_weight; + + if (!sched_feat(SIS_SYNC)) + return 0; + + if (target != smp_processor_id()) + return 0; + + if (this_rq()->nr_running > 1) + return 0; + + llc_weight = per_cpu(sd_llc_size, target); + + if (!short_task(p, llc_weight) || + !short_task(current, llc_weight)) + return 0; + + if (current->last_wakee != p || p->last_wakee != current) + return 0; + + return 1; +} /* * Scan the LLC domain for idle CPUs; this is dynamically regulated by * comparing the average scan cost (tracked in sd->avg_scan_cost) against the * average idle time for this rq (as found in rq->avg_idle). */ -static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target) +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target, + int sync) { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask); int i, cpu, idle_cpu = -1, nr = INT_MAX; @@ -7458,6 +7501,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool } } + /* + * The Cache-to-Cache latency could be large on big system. + * Before trying to find a compelete idle CPU than the current one, + * give the current CPU another chance if waker and wakee are mutually + * waking up each other. + */ + if (!has_idle_core && sync && mutual_wakeup(p, target)) + return target; + for_each_cpu_wrap(cpu, cpus, target + 1) { if (has_idle_core) { i = select_idle_core(p, cpu, cpus, &idle_cpu); @@ -7550,7 +7602,7 @@ static inline bool asym_fits_cpu(unsigned long util, /* * Try and locate an idle core/thread in the LLC cache domain. */ -static int select_idle_sibling(struct task_struct *p, int prev, int target) +static int select_idle_sibling(struct task_struct *p, int prev, int target, int sync) { bool has_idle_core = false; struct sched_domain *sd; @@ -7659,7 +7711,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) } } - i = select_idle_cpu(p, sd, has_idle_core, target); + i = select_idle_cpu(p, sd, has_idle_core, target, sync); if ((unsigned)i < nr_cpumask_bits) return i; @@ -8259,7 +8311,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag); } else if (wake_flags & WF_TTWU) { /* XXX always ? */ /* Fast path */ - new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); + new_cpu = select_idle_sibling(p, prev_cpu, new_cpu, sync); } rcu_read_unlock(); diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 143f55df890b..7e5968d01dcb 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -50,6 +50,7 @@ SCHED_FEAT(TTWU_QUEUE, true) * When doing wakeups, attempt to limit superfluous scans of the LLC domain. */ SCHED_FEAT(SIS_UTIL, true) +SCHED_FEAT(SIS_SYNC, true) /* * Issue a WARN when we do multiple update_rq_clock() calls -- 2.25.1 ^ permalink raw reply related [flat|nested] 17+ messages in thread
end of thread, other threads:[~2024-08-05 7:22 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-06-25 7:21 [PATCH 0/2] sched/fair: Enhance sync wakeup for short duration tasks Chen Yu 2024-06-25 7:22 ` [PATCH 1/2] sched/fair: Record the average duration of a task Chen Yu 2024-06-26 4:21 ` Mike Galbraith 2024-06-30 13:09 ` Chen Yu 2024-07-01 6:57 ` Mike Galbraith 2024-07-01 14:57 ` Chen Yu 2024-07-02 5:08 ` Mike Galbraith 2024-07-03 8:34 ` Raghavendra K T 2024-07-03 11:57 ` Mike Galbraith 2024-07-03 13:23 ` Chen Yu 2024-07-03 13:45 ` Mike Galbraith 2024-07-03 13:40 ` Raghavendra K T 2024-07-03 13:12 ` Chen Yu 2024-07-03 13:46 ` Raghavendra K T 2024-08-05 4:38 ` Madadi Vineeth Reddy 2024-08-05 7:22 ` Chen Yu 2024-06-25 7:22 ` [PATCH 2/2] sched/fair: Enhance sync wakeup for short duration tasks Chen Yu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).