[PATCH 0/2] sched/fair: Enhance sync wakeup for short duration tasks

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] sched/fair: Enhance sync wakeup for short duration tasks
@ 2024-06-25  7:21 Chen Yu
  2024-06-25  7:22 ` [PATCH 1/2] sched/fair: Record the average duration of a task Chen Yu
  2024-06-25  7:22 ` [PATCH 2/2] sched/fair: Enhance sync wakeup for short duration tasks Chen Yu
  0 siblings, 2 replies; 17+ messages in thread
From: Chen Yu @ 2024-06-25  7:21 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Mike Galbraith, Tim Chen, Yujie Liu, K Prateek Nayak,
	Gautham R . Shenoy, Chen Yu, linux-kernel, Chen Yu

Hi,

This is a respin of the previous work to inhit task migration[1].

Many workloads suffer from high Cache-to-Cache latency on large system. Especially
when different tasks access the same cache line, which brings false sharing.

This patch set leverages the WF_SYNC flag, and inhits task wakeup migration
for short duration ones. This can help reduce the chance to trigger the
cache false sharing.

The main concern of this proposal in the last discussion is that, it could
break idle-cpu-first task wakeup strategy. Usually an idle CPU can beat other
non-idle CPU in terms of latency. Based on this, the new proposal takes the
CPU number into consideration, if it is a low-core-count system, the bar to
inhit the task migration is much higher or even impossible. Vice versa. Meanwhile,
only when there is no idle Core in the system, this inhiting task migration will take
effect.

According to the test on 4 different platforms, it has shown good improvement for
Client/Server workloads, like netperf, tbench. And not saw any performance regression
on my 8 CPUs laptop. Please refer to PATCH 2/2 for detail.

Comments and tests would be appreciated.

[1] https://lore.kernel.org/lkml/cover.1682661027.git.yu.c.chen@intel.com/

Chen Yu (2):
  sched/fair: Record the average duration of a task
  sched/fair: Enhance sync wakeup for short duration tasks

 include/linux/sched.h   |  3 ++
 kernel/sched/core.c     |  2 ++
 kernel/sched/fair.c     | 74 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/features.h |  1 +
 4 files changed, 75 insertions(+), 5 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] sched/fair: Record the average duration of a task
  2024-06-25  7:21 [PATCH 0/2] sched/fair: Enhance sync wakeup for short duration tasks Chen Yu
@ 2024-06-25  7:22 ` Chen Yu
  2024-06-26  4:21   ` Mike Galbraith
  2024-06-25  7:22 ` [PATCH 2/2] sched/fair: Enhance sync wakeup for short duration tasks Chen Yu
  1 sibling, 1 reply; 17+ messages in thread
From: Chen Yu @ 2024-06-25  7:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Mike Galbraith, Tim Chen, Yujie Liu, K Prateek Nayak,
	Gautham R . Shenoy, Chen Yu, linux-kernel, Chen Yu

Record the average duration of a task, as there is a requirement
to leverage this information for better task placement.

At first thought the (p->se.sum_exec_runtime / p->nvcsw)
can be used to measure the task duration. However, the
history long past was factored too heavily in such a formula.
Ideally, the old activity should decay and not affect
the current status too much.

Although something based on PELT can be used, se.util_avg might
not be appropriate to describe the task duration:
Task p1 and task p2 are doing frequent ping-pong scheduling on
one CPU, both p1 and p2 have a short duration, but the util_avg
of each task can be up to 50%, which is inconsistent with the
short task duration.

Here's an example to show what the average duration is. Suppose
on CPUx, task p1 and p2 run alternatively:

 --------------------> time

 | p1 runs 1ms | p2 preempt p1 | p1 switch in, runs 0.5ms and blocks |
               ^               ^                                     ^
 |_____________|               |_____________________________________|
                                                                     ^
                                                                     |
                                                                  p1 dequeued

p1's duration is (1 + 0.5)ms. Because if p2 does not preempt p1, p1 can run 1.5ms.
This reflects the nature of a task: how long it wishes to run at most.

Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched.h |  3 +++
 kernel/sched/core.c   |  2 ++
 kernel/sched/fair.c   | 12 ++++++++++++
 3 files changed, 17 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 90691d99027e..78747d3954fd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1339,6 +1339,9 @@ struct task_struct {
 	struct callback_head		cid_work;
 #endif
 
+	u64				prev_sleep_sum_runtime;
+	u64				duration_avg;
+
 	struct tlbflush_unmap_batch	tlb_ubc;
 
 	/* Cache last used pipe for splice(): */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0935f9d4bb7b..7399c4143528 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->migration_pending = NULL;
 #endif
 	init_sched_mm_cid(p);
+	p->prev_sleep_sum_runtime = 0;
+	p->duration_avg = 0;
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41b58387023d..445877069fbf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6833,6 +6833,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 static void set_next_buddy(struct sched_entity *se);
 
+static inline void dur_avg_update(struct task_struct *p)
+{
+	u64 dur;
+
+	dur = p->se.sum_exec_runtime - p->prev_sleep_sum_runtime;
+	p->prev_sleep_sum_runtime = p->se.sum_exec_runtime;
+	update_avg(&p->duration_avg, dur);
+}
+
 /*
  * The dequeue_task method is called before nr_running is
  * decreased. We remove the task from the rbtree and
@@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 dequeue_throttle:
 	util_est_update(&rq->cfs, p, task_sleep);
+	if (task_sleep)
+		dur_avg_update(p);
+
 	hrtick_update(rq);
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] sched/fair: Enhance sync wakeup for short duration tasks
  2024-06-25  7:21 [PATCH 0/2] sched/fair: Enhance sync wakeup for short duration tasks Chen Yu
  2024-06-25  7:22 ` [PATCH 1/2] sched/fair: Record the average duration of a task Chen Yu
@ 2024-06-25  7:22 ` Chen Yu
  1 sibling, 0 replies; 17+ messages in thread
From: Chen Yu @ 2024-06-25  7:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Mike Galbraith, Tim Chen, Yujie Liu, K Prateek Nayak,
	Gautham R . Shenoy, Chen Yu, linux-kernel, Chen Yu

[Problem Statement]
On platforms where there are many CPUs, one bottleneck is the high
Cache-to-Cache latency. This issue is exacerbated when the tasks sharing
data are running on different CPUs: When the tasks access different
part of the same cache, false sharing happens. One example is the network
client/server workload with small packages. A simple example:

On a system with 240 CPUs, 2 sockets,
taskset -c 2 netserver

taskset -c 1 netperf -4 -H 127.0.0.1 -t TCP_RR -c -C -l 100
Trans Rate per sec: 83528.11

taskset -c 2 netperf -4 -H 127.0.0.1 -t TCP_RR -c -C -l 100
Trans Rate per sec: 134504.35

[Problem Analysis]

TL;DR
When netperf and nerserver are running on difference cores, the cache
false sharing on the TCP/IP stack hurts the performance. As long as the
netperf and netserver are on the same system, and within the same
network namespace, this issue exists.

Detail
With the help of perf topdown, when netperf and netserver are both on CPU2:
28.1 %  tma_backend_bound
	13.7 %  tma_memory_bound
		3.3 %  tma_l2_bound
		9.3 %  tma_l1_bound

When netperf is on CPU1, netserver is on CPU2:
30.5 %  tma_backend_bound
	16.8 %  tma_memory_bound
		11.0 %  tma_l1_bound
		32.4 %  tma_l3_bound
			59.5 %  tma_contested_accesses  <----
			11.1 %  tma_data_sharing

The contested_accesses has increased a lot when netperf and netserver
are on different CPUs. Contested accesses occur when data written by one
thread is read by another thread on a different core. This indicates the
cache false sharing.

Use perf c2c to figure out the place where cache false sharing happens.
top 2 offenders:

----- HITM -----  ------- Store Refs ------  --------- Data address -------
RmtHitm  LclHitm   L1 Hit  L1 Miss                     Offset  Node
0.00%    55.17%    0.00%    0.00%                      0x1c         <---- read
0.00%    0.00%     20.00%   0.00%                      0x1f         <---- write

To be more specific, there are frequent read/write on the same cache line
in the:
struct tcp_sock {

	new cache line
	...
	u16     tcp_header_len; <----- read
        u8      scaling_ratio;
        u8      chrono_type : 2, <---- write
                repair      : 1,
                tcp_usec_ts : 1,
                is_sack_reneg:1,
                is_cwnd_limited:1; < ---- write

	new cache line

	u32     copied_seq;   <----- write
	u32     rcv_tstamp;   <---- write
	u32     snd_wl1;   <---- write
	...
	u32     urg_seq;   <--- read

Re-arranging the layout of struct tcp_sock could become a seesaw. As the variables
mentioned above are frequently accessed by different path of TCP/IP stack.

Propose a more generic solution:
1. if the waker and the wakee are both short duration tasks,
2. if the wakeup is WF_SYNC,
3. if there is no idle Core in the system,
4. if the waker and the wakee wake up each other,

Wake up the wakee on the same CPU as waker.

N.B. The bar to regard the task as a short duration one depends on the number
of CPUs. Normally we don't want to enable this wakeup feature on desktop or mobile.
Because the overhead of Cache-to-Cache latency is negligible on small systems.

[Benchmark results]
Tested on 4 platforms, significant throughput improvement on tbench, netperf,
stress-ng, will-it-scale, and latency reduced of lmbench.

Platform1, 240 CPUs, 2 sockets Intel(R) Xeon(R)
========================================================================
netperf
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	60-threads	 1.00 (  1.04)	 -0.03 (  1.27)
TCP_RR          	120-threads	 1.00 (  2.31)	 -0.09 (  2.46)
TCP_RR          	180-threads	 1.00 (  1.77)	 +0.93 (  1.16)
TCP_RR          	240-threads	 1.00 (  9.39)	+190.13 (  3.66)
TCP_RR          	300-threads	 1.00 ( 45.28)	+120.07 ( 19.41)
TCP_RR          	360-threads	 1.00 ( 20.13)	 +0.27 ( 30.57)
TCP_RR          	420-threads	 1.00 ( 30.85)	+13.39 ( 46.38)
UDP_RR          	60-threads	 1.00 ( 11.86)	 -0.29 (  2.66)
UDP_RR          	120-threads	 1.00 ( 16.28)	 +0.42 ( 13.41)
UDP_RR          	180-threads	 1.00 ( 15.34)	 +0.31 ( 17.45)
UDP_RR          	240-threads	 1.00 ( 16.27)	 -0.36 ( 18.78)
UDP_RR          	300-threads	 1.00 ( 20.42)	 -2.54 ( 32.42)
UDP_RR          	360-threads	 1.00 ( 31.59)	 +0.28 ( 35.66)
UDP_RR          	420-threads	 1.00 ( 30.44)	 -0.27 ( 37.12)

tbench
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	60-threads	 1.00 (  0.27)	 +0.04 (  0.11)
loopback        	120-threads	 1.00 (  0.65)	 -1.01 (  0.41)
loopback        	180-threads	 1.00 (  0.42)	+62.05 ( 26.22)
loopback        	240-threads	 1.00 ( 30.43)	+77.61 ( 15.27)

hackbench
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1-groups	 1.00 (  6.92)	 +4.70 (  5.85)
process-pipe    	2-groups	 1.00 (  6.45)	 +7.66 (  2.39)
process-pipe    	4-groups	 1.00 (  2.82)	 -1.82 (  1.47)

schbench
========
No noticeable difference of 99.0th wakeup/request latency, 50.0th RPS percentiles.
schbench -m 2 -r 100

                                  baseline                       sis_sync

Wakeup Latencies 99.0th usec            27                            25
Request Latencies 99.0th usec        15376                         15376
RPS percentiles 50.0th               16608                         16608

Platform2, 48 CPUs 2 sockets Intel(R) Xeon(R) CPU E5-2697
========================================================================
lmbench3: lmbench3.PIPE.latency.us 33.8% improvement
lmbench3: lmbench3.AF_UNIX.sock.stream.latency.us 30.6% improvement

Platform3: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480
=======================================================================
stress-ng: stress-ng.vm-rw.ops_per_sec 250.8% improvement
will-it-scale: will-it-scale.per_process_ops 42.1% improvement

Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c     | 62 +++++++++++++++++++++++++++++++++++++----
 kernel/sched/features.h |  1 +
 2 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 445877069fbf..d749397249ca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1003,7 +1003,7 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 #include "pelt.h"
 #ifdef CONFIG_SMP
 
-static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
+static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu, int sync);
 static unsigned long task_h_load(struct task_struct *p);
 static unsigned long capacity_of(int cpu);
 
@@ -7410,12 +7410,55 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
 
 #endif /* CONFIG_SCHED_SMT */
 
+/*
+ *    threshold of the short duration task:
+ *    sysctl_sched_migration_cost * llc_weight^2 / 256^2
+ *
+ *                            threshold
+ *  LLC_WEIGHT=8               0.5 usec
+ *  LLC_WEIGHT=16              2 usec
+ *  LLC_WEIGHT=32              8 usec
+ *  LLC_WEIGHT=64              31 usec
+ *  LLC_WEIGHT=128             125 usec
+ *  LLC_WEIGHT=256             500 usec
+ */
+static int short_task(struct task_struct *p, int llc)
+{
+	return ((p->duration_avg << 16) <
+	    (sysctl_sched_migration_cost * llc * llc));
+}
+
+static int mutual_wakeup(struct task_struct *p, int target)
+{
+	int llc_weight;
+
+	if (!sched_feat(SIS_SYNC))
+		return 0;
+
+	if (target != smp_processor_id())
+		return 0;
+
+	if (this_rq()->nr_running > 1)
+		return 0;
+
+	llc_weight = per_cpu(sd_llc_size, target);
+
+	if (!short_task(p, llc_weight) ||
+	    !short_task(current, llc_weight))
+		return 0;
+
+	if (current->last_wakee != p || p->last_wakee != current)
+		return 0;
+
+	return 1;
+}
 /*
  * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
  * average idle time for this rq (as found in rq->avg_idle).
  */
-static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
+static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target,
+			   int sync)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
@@ -7458,6 +7501,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		}
 	}
 
+	/*
+	 * The Cache-to-Cache latency could be large on big system.
+	 * Before trying to find a compelete idle CPU than the current one,
+	 * give the current CPU another chance if waker and wakee are mutually
+	 * waking up each other.
+	 */
+	if (!has_idle_core && sync && mutual_wakeup(p, target))
+		return target;
+
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
@@ -7550,7 +7602,7 @@ static inline bool asym_fits_cpu(unsigned long util,
 /*
  * Try and locate an idle core/thread in the LLC cache domain.
  */
-static int select_idle_sibling(struct task_struct *p, int prev, int target)
+static int select_idle_sibling(struct task_struct *p, int prev, int target, int sync)
 {
 	bool has_idle_core = false;
 	struct sched_domain *sd;
@@ -7659,7 +7711,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 		}
 	}
 
-	i = select_idle_cpu(p, sd, has_idle_core, target);
+	i = select_idle_cpu(p, sd, has_idle_core, target, sync);
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
@@ -8259,7 +8311,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
 	} else if (wake_flags & WF_TTWU) { /* XXX always ? */
 		/* Fast path */
-		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu, sync);
 	}
 	rcu_read_unlock();
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 143f55df890b..7e5968d01dcb 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -50,6 +50,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
 SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_SYNC, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] sched/fair: Record the average duration of a task
  2024-06-25  7:22 ` [PATCH 1/2] sched/fair: Record the average duration of a task Chen Yu
@ 2024-06-26  4:21   ` Mike Galbraith
  2024-06-30 13:09     ` Chen Yu
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Galbraith @ 2024-06-26  4:21 UTC (permalink / raw)
  To: Chen Yu, Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu,
	linux-kernel

On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote:
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0935f9d4bb7b..7399c4143528 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>         p->migration_pending = NULL;
>  #endif
>         init_sched_mm_cid(p);
> +       p->prev_sleep_sum_runtime = 0;
> +       p->duration_avg = 0;
>  }

Beginning life biased toward stacking?
  
>  DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 41b58387023d..445877069fbf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
>
> @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  
>  dequeue_throttle:
>         util_est_update(&rq->cfs, p, task_sleep);
> +       if (task_sleep)
> +               dur_avg_update(p);
> +
>         hrtick_update(rq);
>  }
>

That qualifier looks a bit dangerous.  Microbench components tend to
have only one behavior, but the real world goes through all kinds of
nutty gyrations, intentional and otherwise.

The heuristics in the next patch seem to exhibit a healthy level of
paranoia, but these bits could perhaps use a tad more.  Bad experiences
springs to mind when I stare at that - sleepers going hog, hogs meet
sleeping lock contention, preemption, sync hint not meaning much...

	-Mike

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] sched/fair: Record the average duration of a task
  2024-06-26  4:21   ` Mike Galbraith
@ 2024-06-30 13:09     ` Chen Yu
  2024-07-01  6:57       ` Mike Galbraith
  2024-08-05  4:38       ` Madadi Vineeth Reddy
  0 siblings, 2 replies; 17+ messages in thread
From: Chen Yu @ 2024-06-30 13:09 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu,
	linux-kernel

Hi Mike,

Thanks for your time and giving the insights.

On 2024-06-26 at 06:21:43 +0200, Mike Galbraith wrote:
> On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote:
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 0935f9d4bb7b..7399c4143528 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> >         p->migration_pending = NULL;
> >  #endif
> >         init_sched_mm_cid(p);
> > +       p->prev_sleep_sum_runtime = 0;
> > +       p->duration_avg = 0;
> >  }
> 
> Beginning life biased toward stacking?
>

OK, so I should change the short_task() to skip the 0 duration_avg, to avoid
task stacking in the beginning.

> >  DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 41b58387023d..445877069fbf 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> >
> > @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  
> >  dequeue_throttle:
> >         util_est_update(&rq->cfs, p, task_sleep);
> > +       if (task_sleep)
> > +               dur_avg_update(p);
> > +
> >         hrtick_update(rq);
> >  }
> >
> 
> That qualifier looks a bit dangerous.  Microbench components tend to
> have only one behavior, but the real world goes through all kinds of
> nutty gyrations, intentional and otherwise.
>

Understand. Unfortunately I don't have access to production environment
so I have to rely on microbenchmarks and a OLTP to check the result. I
get feedback from Abel that the former version of this patch brought
benefit to some short tasks like Redis in the production environment[1].
https://lore.kernel.org/lkml/36ba3b68-5b73-9db0-2247-061627b0d95a@bytedance.com/

I can launch a combination of microbenchmarks in parallel to check the impact.

> The heuristics in the next patch seem to exhibit a healthy level of
> paranoia, but these bits could perhaps use a tad more.  Bad experiences
> springs to mind when I stare at that - sleepers going hog, hogs meet
> sleeping lock contention, preemption, sync hint not meaning much...
>

I see. If I understand correctly, the scenario mentioned above
could bring a false positive of 'short task', which causes
task stacking.

If the sleeper task:
1. is preempted frequently. This should not be a problem, because
   the task duration is unlikely to be shorten by preemption, and
   the short_task() is unlikely to return true.
   Because the task duration will span the time from the task is
   scheduled in, to finally scheduled out due to sleeping(dequeue_task_fair()),
   but not by preemption.
   This time duration should be long enough to not trigger the 'short task'
   case. But since there is a delay queue mechanism under development,
   calculating the duration in dequeue_task_fair() in current patch might not
   be a proper method anymore.

2. meets sleeping lock contention. This would be a false positive 'short task',
   which bring unexpected task stacking.

So consider 1 and 2, I'm thinking of moving the calculating of task duration from
dequeue_task_fair() to wait_woken(). The reason to update the task's duration in
wait_woken() rather than dequeue_task_fair() is that, the former is aware of the
scenario that the task is waiting for the real resource, rather than blocking
on a random sleeping lock. And the wait_woken() is widely used by the driver to
indicate that task is waiting for the resource. For example, the netperf calltrace:

    schedule_timeout+222
    wait_woken+84
    sk_wait_data+378
    tcp_recvmsg_locked+507
    tcp_recvmsg+115
    inet_recvmsg+90
    sock_recvmsg+150

In the future, if there is requirement other scenario could also invoke the newly
introduced update_cur_duration() when needed. For example, the pipe_read() could
use it if the task is going to sleep due to the empty pipe buffer. I change the
code as below, may I have your suggestion on this?

thanks,
Chenyu

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 90691d99027e..78747d3954fd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1339,6 +1339,9 @@ struct task_struct {
 	struct callback_head		cid_work;
 #endif

+	u64				prev_sleep_sum_runtime;
+	u64				duration_avg;
+
 	struct tlbflush_unmap_batch	tlb_ubc;

 	/* Cache last used pipe for splice(): */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0935f9d4bb7b..7399c4143528 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->migration_pending = NULL;
 #endif
 	init_sched_mm_cid(p);
+	p->prev_sleep_sum_runtime = 0;
+	p->duration_avg = 0;
 }

 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41b58387023d..bbeba36d0145 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -744,6 +744,19 @@ int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	return vruntime_eligible(cfs_rq, se->vruntime);
 }

+void update_curr_duration(void)
+{
+	struct sched_entity *curr = &current->se;
+	unsigned long flags;
+	u64 dur;
+
+	local_irq_save(flags);
+	dur = curr->sum_exec_runtime - current->prev_sleep_sum_runtime;
+	current->prev_sleep_sum_runtime = curr->sum_exec_runtime;
+	update_avg(&current->duration_avg, dur);
+	local_irq_restore(flags);
+}
+
 static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
 {
 	u64 min_vruntime = cfs_rq->min_vruntime;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 62fd8bc6fd08..7beb604ca76b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3574,6 +3574,7 @@ static inline void init_sched_mm_cid(struct task_struct *t) { }

 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
 extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
+extern void update_curr_duration(void);

 #ifdef CONFIG_RT_MUTEXES

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 51e38f5f4701..a0004cc7454f 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -419,8 +419,10 @@ long wait_woken(struct wait_queue_entry *wq_entry, unsigned mode, long timeout)
 	 * or woken_wake_function() sees our store to current->state.
 	 */
 	set_current_state(mode); /* A */
-	if (!(wq_entry->flags & WQ_FLAG_WOKEN) && !kthread_should_stop_or_park())
+	if (!(wq_entry->flags & WQ_FLAG_WOKEN) && !kthread_should_stop_or_park()) {
+		update_curr_duration();
 		timeout = schedule_timeout(timeout);
+	}
 	__set_current_state(TASK_RUNNING);

 	/*
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] sched/fair: Record the average duration of a task
  2024-06-30 13:09     ` Chen Yu
@ 2024-07-01  6:57       ` Mike Galbraith
  2024-07-01 14:57         ` Chen Yu
  2024-08-05  4:38       ` Madadi Vineeth Reddy
  1 sibling, 1 reply; 17+ messages in thread
From: Mike Galbraith @ 2024-07-01  6:57 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu,
	linux-kernel

On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote:
> Hi Mike,
>
> Thanks for your time and giving the insights.
>
> On 2024-06-26 at 06:21:43 +0200, Mike Galbraith wrote:
> > On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote:
> > >
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 0935f9d4bb7b..7399c4143528 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> > >         p->migration_pending = NULL;
> > >  #endif
> > >         init_sched_mm_cid(p);
> > > +       p->prev_sleep_sum_runtime = 0;
> > > +       p->duration_avg = 0;
> > >  }
> >
> > Beginning life biased toward stacking?
> >
>
> OK, so I should change the short_task() to skip the 0 duration_avg, to avoid
> task stacking in the beginning.

Or something, definitely.
   
> > >  DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 41b58387023d..445877069fbf 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > >
> > > @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >  
> > >  dequeue_throttle:
> > >         util_est_update(&rq->cfs, p, task_sleep);
> > > +       if (task_sleep)
> > > +               dur_avg_update(p);
> > > +
> > >         hrtick_update(rq);
> > >  }
> > >
> >
> > That qualifier looks a bit dangerous.  Microbench components tend to
> > have only one behavior, but the real world goes through all kinds of
> > nutty gyrations, intentional and otherwise.
> >
>
> Understand. Unfortunately I don't have access to production environment
> so I have to rely on microbenchmarks and a OLTP to check the result. I
> get feedback from Abel that the former version of this patch brought
> benefit to some short tasks like Redis in the production environment[1].
> https://lore.kernel.org/lkml/36ba3b68-5b73-9db0-2247-061627b0d95a@bytedance.com/

Here's hoping you get a lot more.

> I can launch a combination of microbenchmarks in parallel to check the impact.
>
> > The heuristics in the next patch seem to exhibit a healthy level of
> > paranoia, but these bits could perhaps use a tad more.  Bad experiences
> > springs to mind when I stare at that - sleepers going hog, hogs meet
> > sleeping lock contention, preemption, sync hint not meaning much...
> >
>
> I see. If I understand correctly, the scenario mentioned above
> could bring a false positive of 'short task', which causes
> task stacking.

Yeah.

> If the sleeper task:
> 1. is preempted frequently. This should not be a problem, because
>    the task duration is unlikely to be shorten by preemption, and
>    the short_task() is unlikely to return true.

Ah, but it not being a problem would beg the question why bother?
There are consequences either way, the dark side is just scarier.

> 2. meets sleeping lock contention. This would be a false positive 'short task',
>    which bring unexpected task stacking.

Yeah.

> So consider 1 and 2, I'm thinking of moving the calculating of task duration from
> dequeue_task_fair() to wait_woken(). The reason to update the task's duration in
> wait_woken() rather than dequeue_task_fair() is that, the former is aware of the
> scenario that the task is waiting for the real resource, rather than blocking
> on a random sleeping lock. And the wait_woken() is widely used by the driver to
> indicate that task is waiting for the resource. For example, the netperf calltrace:
>
>     schedule_timeout+222
>     wait_woken+84
>     sk_wait_data+378
>     tcp_recvmsg_locked+507
>     tcp_recvmsg+115
>     inet_recvmsg+90
>     sock_recvmsg+150
>
> In the future, if there is requirement other scenario could also invoke the newly
> introduced update_cur_duration() when needed. For example, the pipe_read() could
> use it if the task is going to sleep due to the empty pipe buffer. I change the
> code as below, may I have your suggestion on this?

I don't have any suggestions that will help plug the holes, heck, I
squabbled in this arena quite a bit some years ago, and did not win.
Frankly I don't think the scheduler has the information necessary to do
so, it'll always be a case of this will likely do less harm than good,
but will certainly leave at least an occasional mark.

Just take a look at the high speed ping-pong thing (not a benchmark,
that's a box full of tape measures, rather silly, but..).  TCP_RR IS
1:1, has as short a duration as network stack plus scheduler can
possibly make it, and is nearly synchronous to boot, two halves of a
whole, the ONLY thing you can certainly safely stack.. but a shared L2
box still takes a wee hit when you do so.  1:N or M:N tasks can
approach its wakeup frequency range, and there's nothing you can do
about the very same cache to cache latency you're trying to duck, it
just is what it is, and is considered perfectly fine as it is.  That's
a bit of a red flag, but worse is the lack of knowledge wrt what tasks
are actually up to at any given time.  We rashly presume that tasks
waking one another implies a 1:1 relationship, we routinely call them
buddies and generally get away with it.. but during any overlap they
can be doing anything including N way data share, and regardless of
what that is and section size, needless stacking flushes concurrency,
injecting service latency in its place, cost unknown.

Intentional stacking can be jokingly equated to injecting just a wee
bit of SMP kryptonite.. it'll be fine.. at least until it's not ;-)

	-Mike

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] sched/fair: Record the average duration of a task
  2024-07-01  6:57       ` Mike Galbraith
@ 2024-07-01 14:57         ` Chen Yu
  2024-07-02  5:08           ` Mike Galbraith
  2024-07-03  8:34           ` Raghavendra K T
  0 siblings, 2 replies; 17+ messages in thread
From: Chen Yu @ 2024-07-01 14:57 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu,
	linux-kernel, Raghavendra K T

Hi Mike,

On 2024-07-01 at 08:57:25 +0200, Mike Galbraith wrote:
> On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote:
> > Hi Mike,
> >
> > Thanks for your time and giving the insights.
> >
> > On 2024-06-26 at 06:21:43 +0200, Mike Galbraith wrote:
> > > On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote:
> > > >
> > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > index 0935f9d4bb7b..7399c4143528 100644
> > > > --- a/kernel/sched/core.c
> > > > +++ b/kernel/sched/core.c
> > > > @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> > > >         p->migration_pending = NULL;
> > > >  #endif
> > > >         init_sched_mm_cid(p);
> > > > +       p->prev_sleep_sum_runtime = 0;
> > > > +       p->duration_avg = 0;
> > > >  }
> > >
> > > Beginning life biased toward stacking?
> > >
> >
> > OK, so I should change the short_task() to skip the 0 duration_avg, to avoid
> > task stacking in the beginning.
> 
> Or something, definitely.
>    
> > > >  DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 41b58387023d..445877069fbf 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > >
> > > > @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > >  
> > > >  dequeue_throttle:
> > > >         util_est_update(&rq->cfs, p, task_sleep);
> > > > +       if (task_sleep)
> > > > +               dur_avg_update(p);
> > > > +
> > > >         hrtick_update(rq);
> > > >  }
> > > >
> > >
> > > That qualifier looks a bit dangerous.  Microbench components tend to
> > > have only one behavior, but the real world goes through all kinds of
> > > nutty gyrations, intentional and otherwise.
> > >
> >
> > Understand. Unfortunately I don't have access to production environment
> > so I have to rely on microbenchmarks and a OLTP to check the result. I
> > get feedback from Abel that the former version of this patch brought
> > benefit to some short tasks like Redis in the production environment[1].
> > https://lore.kernel.org/lkml/36ba3b68-5b73-9db0-2247-061627b0d95a@bytedance.com/
> 
> Here's hoping you get a lot more.
>

We recently received a simulated benchmark for Meta. I'll conduct some tests
to see the results.

> > So consider 1 and 2, I'm thinking of moving the calculating of task duration from
> > dequeue_task_fair() to wait_woken(). The reason to update the task's duration in
> > wait_woken() rather than dequeue_task_fair() is that, the former is aware of the
> > scenario that the task is waiting for the real resource, rather than blocking
> > on a random sleeping lock. And the wait_woken() is widely used by the driver to
> > indicate that task is waiting for the resource. For example, the netperf calltrace:
> >
> >     schedule_timeout+222
> >     wait_woken+84
> >     sk_wait_data+378
> >     tcp_recvmsg_locked+507
> >     tcp_recvmsg+115
> >     inet_recvmsg+90
> >     sock_recvmsg+150
> >
> > In the future, if there is requirement other scenario could also invoke the newly
> > introduced update_cur_duration() when needed. For example, the pipe_read() could
> > use it if the task is going to sleep due to the empty pipe buffer. I change the
> > code as below, may I have your suggestion on this?
> 
> I don't have any suggestions that will help plug the holes, heck, I
> squabbled in this arena quite a bit some years ago, and did not win.
> Frankly I don't think the scheduler has the information necessary to do
> so, it'll always be a case of this will likely do less harm than good,
> but will certainly leave at least an occasional mark.
>

I agree. Unlike bug fixing, this kind of change usually involves trade-offs.
The attempt is to not do harm for most cases, and bring benefit for some cases.

Regarding the necessary information, non-scheduler components might have
better knowledge than the scheduler:

1. If the hint comes from user space, it could be something like
   sched_attr::sync_wakeup.
   It indicates that the task prefers sync-wakeup and is tolerant of task
   stacking. However, I'm unsure how this could be accepted by the community
   if we want to change the user space interface. What is the criteria to
   accept such change? Would the requirement from different production
   environment be considered as the endorsement?

2. If the hint comes from other components in the kernel, it could be the
   driver or others. seccomp in the current kernel code, enforces waking the
   wakee on the current CPU via the WF_CURRENT_CPU flag. However, WF_CURRENT_CPU
   seems a bit aggressive for ordinary tasks. So, wait_woken() could be used when
   needed (by the network component, for example) to indicate a possible
   cache/resource sensitivity of the wakee, and together with the task duration,
   to decide if the wakee can be placed on the current CPU.

> Just take a look at the high speed ping-pong thing (not a benchmark,
> that's a box full of tape measures, rather silly, but..).  TCP_RR IS
> 1:1, has as short a duration as network stack plus scheduler can
> possibly make it, and is nearly synchronous to boot, two halves of a
> whole, the ONLY thing you can certainly safely stack..

I agree, this is a limited scenario. 

> but a shared L2 box still takes a wee hit when you do so.

According to a test conducted last month on a system with 500+ CPUs where 4 CPUs
share the same L2 cache, around 20% improvement was noticed (though not as much
as on the non-L2 shared platform). I haven't delved into the details yet, but my
understanding is that L1 cache-to-cache latency within the L2 domain might also
matter on large servers (which I need to investigate further).

> 1:N or M:N
> tasks can approach its wakeup frequency range, and there's nothing you can do
> about the very same cache to cache latency you're trying to duck, it
> just is what it is, and is considered perfectly fine as it is.  That's
> a bit of a red flag, but worse is the lack of knowledge wrt what tasks
> are actually up to at any given time.  We rashly presume that tasks
> waking one another implies a 1:1 relationship, we routinely call them
> buddies and generally get away with it.. but during any overlap they
> can be doing anything including N way data share, and regardless of
> what that is and section size, needless stacking flushes concurrency,
> injecting service latency in its place, cost unknown.
>

I believe this is a generic issue that the current scheduler faces, where
it attempts to predict the task's behavior based on its runtime. For instance,
task_hot() checks the task runtime to predict whether the task is cache-hot,
regardless of what the task does during its time slice. This is also the case
with WF_SYNC, which provides the scheduler with a hint to wake up on the current
CPU to potentially benefit from cache locality.

A thought occurred to me that one possible method to determine if the waker
and wakee share data could be to leverage the NUMA balance's numa_group data structure.
As numa balance periodically scans the task's VMA space and groups tasks accessing
the same physical page into one numa_group, we can infer that if the waker and wakee
are within the same numa_group, they are likely to share data, and it might be
appropriate to place the wakee on top of the waker.

CC Raghavendra here in case he has any insights.

> Intentional stacking can be jokingly equated to injecting just a wee
> bit of SMP kryptonite.. it'll be fine.. at least until it's not ;-)
>

I fully understand your concern, and this analogy is very interesting.
We will conduct additional tests and share the data/analysis later.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] sched/fair: Record the average duration of a task
  2024-07-01 14:57         ` Chen Yu
@ 2024-07-02  5:08           ` Mike Galbraith
  2024-07-03  8:34           ` Raghavendra K T
  1 sibling, 0 replies; 17+ messages in thread
From: Mike Galbraith @ 2024-07-02  5:08 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu,
	linux-kernel, Raghavendra K T

On Mon, 2024-07-01 at 22:57 +0800, Chen Yu wrote:
> > Just take a look at the high speed ping-pong thing (not a benchmark,
> > that's a box full of tape measures, rather silly, but..).  TCP_RR IS
> > 1:1, has as short a duration as network stack plus scheduler can
> > possibly make it, and is nearly synchronous to boot, two halves of a
> > whole, the ONLY thing you can certainly safely stack..
>
> I agree, this is a limited scenario.
>
> > but a shared L2 box still takes a wee hit when you do so.
>
> According to a test conducted last month on a system with 500+ CPUs where 4 CPUs
> share the same L2 cache, around 20% improvement was noticed (though not as much
> as on the non-L2 shared platform).

This dinky box doesn't have 500 cores, but it's.. aw, adorable :)

rpi4:/root # ONLY=TCP_RR netperf.sh
TCP_RR-1        unbound    Avg:  31754  Sum:    31754
TCP_RR-1        stacked    Avg:  26625  Sum:    26625
TCP_RR-1        cross-core Avg:  32325  Sum:    32325

rpi4:/root # tbench.sh 1 30 2>&1|grep Throughput
Throughput 139.024 MB/sec  1 clients  1 procs  max_latency=1.116 ms
rpi4:/root # taskset -c 3 tbench.sh 1 30 2>&1|grep Throughput
Throughput 116.765 MB/sec  1 clients  1 procs  max_latency=0.340 ms
rpi4:/root #

This little box running its stock 6.6.33 distro kernel pulls out a
cross-core win for both maximally synchronous TCP_RR and the a bit
lesser so but still pretty close tbench.  The numbers mean little
though, one propagation speed is lovely, but were there more, I'd be as
stuck with them as I am with rpi4's one-speed (all ahead slow) gearbox.

	-Mike

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] sched/fair: Record the average duration of a task
  2024-07-01 14:57         ` Chen Yu
  2024-07-02  5:08           ` Mike Galbraith
@ 2024-07-03  8:34           ` Raghavendra K T
  2024-07-03 11:57             ` Mike Galbraith
  2024-07-03 13:12             ` Chen Yu
  1 sibling, 2 replies; 17+ messages in thread
From: Raghavendra K T @ 2024-07-03  8:34 UTC (permalink / raw)
  To: Chen Yu, Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu,
	linux-kernel



On 7/1/2024 8:27 PM, Chen Yu wrote:
> Hi Mike,
> 
> On 2024-07-01 at 08:57:25 +0200, Mike Galbraith wrote:
>> On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote:
>>> Hi Mike,
>>>
>>> Thanks for your time and giving the insights.
> 
> According to a test conducted last month on a system with 500+ CPUs where 4 CPUs
> share the same L2 cache, around 20% improvement was noticed (though not as much
> as on the non-L2 shared platform). I haven't delved into the details yet, but my
> understanding is that L1 cache-to-cache latency within the L2 domain might also
> matter on large servers (which I need to investigate further).
> 
>> 1:N or M:N
>> tasks can approach its wakeup frequency range, and there's nothing you can do
>> about the very same cache to cache latency you're trying to duck, it
>> just is what it is, and is considered perfectly fine as it is.  That's
>> a bit of a red flag, but worse is the lack of knowledge wrt what tasks
>> are actually up to at any given time.  We rashly presume that tasks
>> waking one another implies a 1:1 relationship, we routinely call them
>> buddies and generally get away with it.. but during any overlap they
>> can be doing anything including N way data share, and regardless of
>> what that is and section size, needless stacking flushes concurrency,
>> injecting service latency in its place, cost unknown.
>>
> 
> I believe this is a generic issue that the current scheduler faces, where
> it attempts to predict the task's behavior based on its runtime. For instance,
> task_hot() checks the task runtime to predict whether the task is cache-hot,
> regardless of what the task does during its time slice. This is also the case
> with WF_SYNC, which provides the scheduler with a hint to wake up on the current
> CPU to potentially benefit from cache locality.
> 
> A thought occurred to me that one possible method to determine if the waker
> and wakee share data could be to leverage the NUMA balance's numa_group data structure.
> As numa balance periodically scans the task's VMA space and groups tasks accessing
> the same physical page into one numa_group, we can infer that if the waker and wakee
> are within the same numa_group, they are likely to share data, and it might be
> appropriate to place the wakee on top of the waker.
> 
> CC Raghavendra here in case he has any insights.
> 

Agree with your thought here,

So I imagine two possible things to explore here.

1) Use task1, task2 numa_group and check if they belong to same
numa_group, also check if there is a possibility of M:N relationship
by checking if t1/t2->numa_group->nr_tasks > 1 etc

2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2
(threads) possibly interested in same VMA.
Latter one looks to be practically difficult because we don't want to
sweep across VMAs perhaps..

> thanks,
> Chenyu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] sched/fair: Record the average duration of a task
  2024-07-03  8:34           ` Raghavendra K T
@ 2024-07-03 11:57             ` Mike Galbraith
  2024-07-03 13:23               ` Chen Yu
  2024-07-03 13:40               ` Raghavendra K T
  2024-07-03 13:12             ` Chen Yu
  1 sibling, 2 replies; 17+ messages in thread
From: Mike Galbraith @ 2024-07-03 11:57 UTC (permalink / raw)
  To: Raghavendra K T, Chen Yu
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu,
	linux-kernel

On Wed, 2024-07-03 at 14:04 +0530, Raghavendra K T wrote:
>
>
> On 7/1/2024 8:27 PM, Chen Yu wrote:
> >
> > A thought occurred to me that one possible method to determine if the waker
> > and wakee share data could be to leverage the NUMA balance's numa_group data structure.
> > As numa balance periodically scans the task's VMA space and groups tasks accessing
> > the same physical page into one numa_group, we can infer that if the waker and wakee
> > are within the same numa_group, they are likely to share data, and it might be
> > appropriate to place the wakee on top of the waker.
> >
> > CC Raghavendra here in case he has any insights.
> >
>
> Agree with your thought here,
>
> So I imagine two possible things to explore here.
>
> 1) Use task1, task2 numa_group and check if they belong to same
> numa_group, also check if there is a possibility of M:N relationship
> by checking if t1/t2->numa_group->nr_tasks > 1 etc
>
> 2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2
> (threads) possibly interested in same VMA.
> Latter one looks to be practically difficult because we don't want to
> sweep across VMAs perhaps..

Oooh dear.. as soon as you mention threads, the question of who's
wheelhouse is this in springs to mind, ie should the kernel be
overriding userspace by targeting bits of threaded programs for forced
serialization?

Bah, think I'll just bugger off and let you guys have a go at making
this stacking business do less harm than good.

	-Mike

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] sched/fair: Record the average duration of a task
  2024-07-03  8:34           ` Raghavendra K T
  2024-07-03 11:57             ` Mike Galbraith
@ 2024-07-03 13:12             ` Chen Yu
  2024-07-03 13:46               ` Raghavendra K T
  1 sibling, 1 reply; 17+ messages in thread
From: Chen Yu @ 2024-07-03 13:12 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Tim Chen, Yujie Liu, K Prateek Nayak,
	Gautham R . Shenoy, Chen Yu, linux-kernel

On 2024-07-03 at 14:04:47 +0530, Raghavendra K T wrote:
> 
> 
> On 7/1/2024 8:27 PM, Chen Yu wrote:
> > Hi Mike,
> > 
> > On 2024-07-01 at 08:57:25 +0200, Mike Galbraith wrote:
> > > On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote:
> > > > Hi Mike,
> > > > 
> > > > Thanks for your time and giving the insights.
> > 
> > According to a test conducted last month on a system with 500+ CPUs where 4 CPUs
> > share the same L2 cache, around 20% improvement was noticed (though not as much
> > as on the non-L2 shared platform). I haven't delved into the details yet, but my
> > understanding is that L1 cache-to-cache latency within the L2 domain might also
> > matter on large servers (which I need to investigate further).
> > 
> > > 1:N or M:N
> > > tasks can approach its wakeup frequency range, and there's nothing you can do
> > > about the very same cache to cache latency you're trying to duck, it
> > > just is what it is, and is considered perfectly fine as it is.  That's
> > > a bit of a red flag, but worse is the lack of knowledge wrt what tasks
> > > are actually up to at any given time.  We rashly presume that tasks
> > > waking one another implies a 1:1 relationship, we routinely call them
> > > buddies and generally get away with it.. but during any overlap they
> > > can be doing anything including N way data share, and regardless of
> > > what that is and section size, needless stacking flushes concurrency,
> > > injecting service latency in its place, cost unknown.
> > > 
> > 
> > I believe this is a generic issue that the current scheduler faces, where
> > it attempts to predict the task's behavior based on its runtime. For instance,
> > task_hot() checks the task runtime to predict whether the task is cache-hot,
> > regardless of what the task does during its time slice. This is also the case
> > with WF_SYNC, which provides the scheduler with a hint to wake up on the current
> > CPU to potentially benefit from cache locality.
> > 
> > A thought occurred to me that one possible method to determine if the waker
> > and wakee share data could be to leverage the NUMA balance's numa_group data structure.
> > As numa balance periodically scans the task's VMA space and groups tasks accessing
> > the same physical page into one numa_group, we can infer that if the waker and wakee
> > are within the same numa_group, they are likely to share data, and it might be
> > appropriate to place the wakee on top of the waker.
> > 
> > CC Raghavendra here in case he has any insights.
> > 
> 
> Agree with your thought here,
>

Thanks for taking a look at this, Raghavendra.

> So I imagine two possible things to explore here.
> 
> 1) Use task1, task2 numa_group and check if they belong to same
> numa_group, also check if there is a possibility of M:N relationship
> by checking if t1/t2->numa_group->nr_tasks > 1 etc
>

I see, so do you mean if it is M:N rather than 1:1, we should avoid the
task been woken up on current CPU to avoid task stacking?
 
> 2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2
> (threads) possibly interested in same VMA.
> Latter one looks to be practically difficult because we don't want to
> sweep across VMAs perhaps..
>

Yeah, we might have to scan the whole VMAs to gather the PID information which
might be a little costly(or maybe subset of the VMAs). And the pids_active[] is
for threads rather than process, stacking the threads seem to not be good
enough(per Mike's comments)

Anyway, I'm preparing for some full tests to see if there is overall benefit
from current version. Later let's investigate if numa balance information could
help here.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] sched/fair: Record the average duration of a task
  2024-07-03 11:57             ` Mike Galbraith
@ 2024-07-03 13:23               ` Chen Yu
  2024-07-03 13:45                 ` Mike Galbraith
  2024-07-03 13:40               ` Raghavendra K T
  1 sibling, 1 reply; 17+ messages in thread
From: Chen Yu @ 2024-07-03 13:23 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Raghavendra K T, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Tim Chen, Yujie Liu, K Prateek Nayak,
	Gautham R . Shenoy, Chen Yu, linux-kernel

Hi Mike,

On 2024-07-03 at 13:57:19 +0200, Mike Galbraith wrote:
> On Wed, 2024-07-03 at 14:04 +0530, Raghavendra K T wrote:
> >
> >
> > On 7/1/2024 8:27 PM, Chen Yu wrote:
> > >
> > > A thought occurred to me that one possible method to determine if the waker
> > > and wakee share data could be to leverage the NUMA balance's numa_group data structure.
> > > As numa balance periodically scans the task's VMA space and groups tasks accessing
> > > the same physical page into one numa_group, we can infer that if the waker and wakee
> > > are within the same numa_group, they are likely to share data, and it might be
> > > appropriate to place the wakee on top of the waker.
> > >
> > > CC Raghavendra here in case he has any insights.
> > >
> >
> > Agree with your thought here,
> >
> > So I imagine two possible things to explore here.
> >
> > 1) Use task1, task2 numa_group and check if they belong to same
> > numa_group, also check if there is a possibility of M:N relationship
> > by checking if t1/t2->numa_group->nr_tasks > 1 etc
> >
> > 2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2
> > (threads) possibly interested in same VMA.
> > Latter one looks to be practically difficult because we don't want to
> > sweep across VMAs perhaps..
> 
> Oooh dear.. as soon as you mention threads, the question of who's
> wheelhouse is this in springs to mind, ie should the kernel be
> overriding userspace by targeting bits of threaded programs for forced
> serialization?
>

Do you mean the threads within the same process should run in parallel
as much as possible, regardless of sharing the same data, because the thread
is designed to do so? If so then we should probably skip the 1:1 task stacking
if the waker and the wakee are in the same process.
 
thanks,
Chenyu 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] sched/fair: Record the average duration of a task
  2024-07-03 11:57             ` Mike Galbraith
  2024-07-03 13:23               ` Chen Yu
@ 2024-07-03 13:40               ` Raghavendra K T
  1 sibling, 0 replies; 17+ messages in thread
From: Raghavendra K T @ 2024-07-03 13:40 UTC (permalink / raw)
  To: Mike Galbraith, Chen Yu
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu,
	linux-kernel



On 7/3/2024 5:27 PM, Mike Galbraith wrote:
> On Wed, 2024-07-03 at 14:04 +0530, Raghavendra K T wrote:
>>
>>
>> On 7/1/2024 8:27 PM, Chen Yu wrote:
>>>
>>> A thought occurred to me that one possible method to determine if the waker
>>> and wakee share data could be to leverage the NUMA balance's numa_group data structure.
>>> As numa balance periodically scans the task's VMA space and groups tasks accessing
>>> the same physical page into one numa_group, we can infer that if the waker and wakee
>>> are within the same numa_group, they are likely to share data, and it might be
>>> appropriate to place the wakee on top of the waker.
>>>
>>> CC Raghavendra here in case he has any insights.
>>>
>>
>> Agree with your thought here,
>>
>> So I imagine two possible things to explore here.
>>
>> 1) Use task1, task2 numa_group and check if they belong to same
>> numa_group, also check if there is a possibility of M:N relationship
>> by checking if t1/t2->numa_group->nr_tasks > 1 etc
>>
>> 2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2
>> (threads) possibly interested in same VMA.
>> Latter one looks to be practically difficult because we don't want to
>> sweep across VMAs perhaps..
> 
> Oooh dear.. as soon as you mention threads, the question of who's
> wheelhouse is this in springs to mind, ie should the kernel be
> overriding userspace by targeting bits of threaded programs for forced
> serialization?
> 

Yes.. There is no ROI on this option (mentioned only for completeness).
also we are not looking beyond process. Rather than "Practically
  difficult" I should have rephrased as Practically not an option.

> Bah, think I'll just bugger off and let you guys have a go at making
> this stacking business do less harm than good.
> 
> 	-Mike

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] sched/fair: Record the average duration of a task
  2024-07-03 13:23               ` Chen Yu
@ 2024-07-03 13:45                 ` Mike Galbraith
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Galbraith @ 2024-07-03 13:45 UTC (permalink / raw)
  To: Chen Yu
  Cc: Raghavendra K T, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Tim Chen, Yujie Liu, K Prateek Nayak,
	Gautham R . Shenoy, Chen Yu, linux-kernel

On Wed, 2024-07-03 at 21:23 +0800, Chen Yu wrote:
> On 2024-07-03 at 13:57:19 +0200, Mike Galbraith wrote:
> >
> > Oooh dear.. as soon as you mention threads, the question of who's
> > wheelhouse is this in springs to mind, ie should the kernel be
> > overriding userspace by targeting bits of threaded programs for forced
> > serialization?
>
> Do you mean the threads within the same process should run in parallel
> as much as possible, regardless of sharing the same data, because the thread
> is designed to do so? If so then we should probably skip the 1:1 task stacking
> if the waker and the wakee are in the same process.

That's seems like the obviously correct answer.. except for that
leaving virtually the entire issue you're squabbling with lying on the
floor.  Grr, how annoying.

	-Mike

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] sched/fair: Record the average duration of a task
  2024-07-03 13:12             ` Chen Yu
@ 2024-07-03 13:46               ` Raghavendra K T
  0 siblings, 0 replies; 17+ messages in thread
From: Raghavendra K T @ 2024-07-03 13:46 UTC (permalink / raw)
  To: Chen Yu
  Cc: Mike Galbraith, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Tim Chen, Yujie Liu, K Prateek Nayak,
	Gautham R . Shenoy, Chen Yu, linux-kernel



On 7/3/2024 6:42 PM, Chen Yu wrote:
> On 2024-07-03 at 14:04:47 +0530, Raghavendra K T wrote:
>>
>>
>> On 7/1/2024 8:27 PM, Chen Yu wrote:
>>> Hi Mike,
>>>
>>> On 2024-07-01 at 08:57:25 +0200, Mike Galbraith wrote:
>>>> On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote:
>>>>> Hi Mike,
>>>>>
>>>>> Thanks for your time and giving the insights.
>>>
>>> According to a test conducted last month on a system with 500+ CPUs where 4 CPUs
>>> share the same L2 cache, around 20% improvement was noticed (though not as much
>>> as on the non-L2 shared platform). I haven't delved into the details yet, but my
>>> understanding is that L1 cache-to-cache latency within the L2 domain might also
>>> matter on large servers (which I need to investigate further).
>>>
>>>> 1:N or M:N
>>>> tasks can approach its wakeup frequency range, and there's nothing you can do
>>>> about the very same cache to cache latency you're trying to duck, it
>>>> just is what it is, and is considered perfectly fine as it is.  That's
>>>> a bit of a red flag, but worse is the lack of knowledge wrt what tasks
>>>> are actually up to at any given time.  We rashly presume that tasks
>>>> waking one another implies a 1:1 relationship, we routinely call them
>>>> buddies and generally get away with it.. but during any overlap they
>>>> can be doing anything including N way data share, and regardless of
>>>> what that is and section size, needless stacking flushes concurrency,
>>>> injecting service latency in its place, cost unknown.
>>>>
>>>
>>> I believe this is a generic issue that the current scheduler faces, where
>>> it attempts to predict the task's behavior based on its runtime. For instance,
>>> task_hot() checks the task runtime to predict whether the task is cache-hot,
>>> regardless of what the task does during its time slice. This is also the case
>>> with WF_SYNC, which provides the scheduler with a hint to wake up on the current
>>> CPU to potentially benefit from cache locality.
>>>
>>> A thought occurred to me that one possible method to determine if the waker
>>> and wakee share data could be to leverage the NUMA balance's numa_group data structure.
>>> As numa balance periodically scans the task's VMA space and groups tasks accessing
>>> the same physical page into one numa_group, we can infer that if the waker and wakee
>>> are within the same numa_group, they are likely to share data, and it might be
>>> appropriate to place the wakee on top of the waker.
>>>
>>> CC Raghavendra here in case he has any insights.
>>>
>>
>> Agree with your thought here,
>>
> 
> Thanks for taking a look at this, Raghavendra.
> 
>> So I imagine two possible things to explore here.
>>
>> 1) Use task1, task2 numa_group and check if they belong to same
>> numa_group, also check if there is a possibility of M:N relationship
>> by checking if t1/t2->numa_group->nr_tasks > 1 etc
>>
> 
> I see, so do you mean if it is M:N rather than 1:1, we should avoid the
> task been woken up on current CPU to avoid task stacking?

Not sure actually, perhaps depends on usecase. But atleast gives an idea
on the relationship.
Problem here is we only know that they belong to same numa_group,
But we cannot deduce actual relationship (we only know it is > 1)
(1:N or M:N is not known).

>   
>> 2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2
>> (threads) possibly interested in same VMA.
>> Latter one looks to be practically difficult because we don't want to
>> sweep across VMAs perhaps..
>>
> 
> Yeah, we might have to scan the whole VMAs to gather the PID information which
> might be a little costly(or maybe subset of the VMAs). And the pids_active[] is
> for threads rather than process, stacking the threads seem to not be good
> enough(per Mike's comments)
> 
> Anyway, I'm preparing for some full tests to see if there is overall benefit
> from current version. Later let's investigate if numa balance information could
> help here.
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] sched/fair: Record the average duration of a task
  2024-06-30 13:09     ` Chen Yu
  2024-07-01  6:57       ` Mike Galbraith
@ 2024-08-05  4:38       ` Madadi Vineeth Reddy
  2024-08-05  7:22         ` Chen Yu
  1 sibling, 1 reply; 17+ messages in thread
From: Madadi Vineeth Reddy @ 2024-08-05  4:38 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu,
	linux-kernel, Mike Galbraith, Madadi Vineeth Reddy

Hi Chen Yu,

On 30/06/24 18:39, Chen Yu wrote:
> Hi Mike,
> 
> Thanks for your time and giving the insights.
> 
> On 2024-06-26 at 06:21:43 +0200, Mike Galbraith wrote:
>> On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote:
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 0935f9d4bb7b..7399c4143528 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>>>         p->migration_pending = NULL;
>>>  #endif
>>>         init_sched_mm_cid(p);
>>> +       p->prev_sleep_sum_runtime = 0;
>>> +       p->duration_avg = 0;
>>>  }
>>
>> Beginning life biased toward stacking?
>>
> 
> OK, so I should change the short_task() to skip the 0 duration_avg, to avoid
> task stacking in the beginning.
>    
>>>  DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 41b58387023d..445877069fbf 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>>
>>> @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>  
>>>  dequeue_throttle:
>>>         util_est_update(&rq->cfs, p, task_sleep);
>>> +       if (task_sleep)
>>> +               dur_avg_update(p);
>>> +
>>>         hrtick_update(rq);
>>>  }
>>>
>>
>> That qualifier looks a bit dangerous.  Microbench components tend to
>> have only one behavior, but the real world goes through all kinds of
>> nutty gyrations, intentional and otherwise.
>>
> 
> Understand. Unfortunately I don't have access to production environment
> so I have to rely on microbenchmarks and a OLTP to check the result. I
> get feedback from Abel that the former version of this patch brought
> benefit to some short tasks like Redis in the production environment[1].
> https://lore.kernel.org/lkml/36ba3b68-5b73-9db0-2247-061627b0d95a@bytedance.com/

Since the discussion was about real-life workload performance, I ran the DayTrader
workload with three users and three instances. The results show no performance
regression, and a 1% performance gain was observed, which is within the standard
deviation.

Thanks and Regards
Madadi Vineeth Reddy

> 
> I can launch a combination of microbenchmarks in parallel to check the impact.
> 
>> The heuristics in the next patch seem to exhibit a healthy level of
>> paranoia, but these bits could perhaps use a tad more.  Bad experiences
>> springs to mind when I stare at that - sleepers going hog, hogs meet
>> sleeping lock contention, preemption, sync hint not meaning much...


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] sched/fair: Record the average duration of a task
  2024-08-05  4:38       ` Madadi Vineeth Reddy
@ 2024-08-05  7:22         ` Chen Yu
  0 siblings, 0 replies; 17+ messages in thread
From: Chen Yu @ 2024-08-05  7:22 UTC (permalink / raw)
  To: Madadi Vineeth Reddy
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Tim Chen, Yujie Liu, K Prateek Nayak, Gautham R . Shenoy, Chen Yu,
	linux-kernel, Mike Galbraith, Madadi Vineeth Reddy

On 2024-08-05 at 10:08:35 +0530, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
> 
> On 30/06/24 18:39, Chen Yu wrote:
> > Hi Mike,
> > 
> > Thanks for your time and giving the insights.
> > 
> > On 2024-06-26 at 06:21:43 +0200, Mike Galbraith wrote:
> >> On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote:
> >>>
> >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >>> index 0935f9d4bb7b..7399c4143528 100644
> >>> --- a/kernel/sched/core.c
> >>> +++ b/kernel/sched/core.c
> >>> @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> >>>         p->migration_pending = NULL;
> >>>  #endif
> >>>         init_sched_mm_cid(p);
> >>> +       p->prev_sleep_sum_runtime = 0;
> >>> +       p->duration_avg = 0;
> >>>  }
> >>
> >> Beginning life biased toward stacking?
> >>
> > 
> > OK, so I should change the short_task() to skip the 0 duration_avg, to avoid
> > task stacking in the beginning.
> >    
> >>>  DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index 41b58387023d..445877069fbf 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>>
> >>> @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>>  
> >>>  dequeue_throttle:
> >>>         util_est_update(&rq->cfs, p, task_sleep);
> >>> +       if (task_sleep)
> >>> +               dur_avg_update(p);
> >>> +
> >>>         hrtick_update(rq);
> >>>  }
> >>>
> >>
> >> That qualifier looks a bit dangerous.  Microbench components tend to
> >> have only one behavior, but the real world goes through all kinds of
> >> nutty gyrations, intentional and otherwise.
> >>
> > 
> > Understand. Unfortunately I don't have access to production environment
> > so I have to rely on microbenchmarks and a OLTP to check the result. I
> > get feedback from Abel that the former version of this patch brought
> > benefit to some short tasks like Redis in the production environment[1].
> > https://lore.kernel.org/lkml/36ba3b68-5b73-9db0-2247-061627b0d95a@bytedance.com/
> 
> Since the discussion was about real-life workload performance, I ran the DayTrader
> workload with three users and three instances. The results show no performance
> regression, and a 1% performance gain was observed, which is within the standard
> deviation.
>

Thank you Madadi for your test, let me respin this version and send it out with the
fixes after talked with Mike.

thanks,
Chenyu 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-08-05  7:22 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-25  7:21 [PATCH 0/2] sched/fair: Enhance sync wakeup for short duration tasks Chen Yu
2024-06-25  7:22 ` [PATCH 1/2] sched/fair: Record the average duration of a task Chen Yu
2024-06-26  4:21   ` Mike Galbraith
2024-06-30 13:09     ` Chen Yu
2024-07-01  6:57       ` Mike Galbraith
2024-07-01 14:57         ` Chen Yu
2024-07-02  5:08           ` Mike Galbraith
2024-07-03  8:34           ` Raghavendra K T
2024-07-03 11:57             ` Mike Galbraith
2024-07-03 13:23               ` Chen Yu
2024-07-03 13:45                 ` Mike Galbraith
2024-07-03 13:40               ` Raghavendra K T
2024-07-03 13:12             ` Chen Yu
2024-07-03 13:46               ` Raghavendra K T
2024-08-05  4:38       ` Madadi Vineeth Reddy
2024-08-05  7:22         ` Chen Yu
2024-06-25  7:22 ` [PATCH 2/2] sched/fair: Enhance sync wakeup for short duration tasks Chen Yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).