linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-07-27 10:27 Peter Zijlstra
@ 2024-11-28 10:32 ` Marcel Ziswiler
  2024-11-28 10:58   ` Peter Zijlstra
  2024-12-10  8:45   ` Luis Machado
  0 siblings, 2 replies; 51+ messages in thread
From: Marcel Ziswiler @ 2024-11-28 10:32 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

Hi all,

On Sat, 2024-07-27 at 12:27 +0200, Peter Zijlstra wrote:
> Hi all,
> 
> So after much delay this is hopefully the final version of the EEVDF patches.
> They've been sitting in my git tree for ever it seems, and people have been
> testing it and sending fixes.
> 
> I've spend the last two days testing and fixing cfs-bandwidth, and as far
> as I know that was the very last issue holding it back.
> 
> These patches apply on top of queue.git sched/dl-server, which I plan on merging
> in tip/sched/core once -rc1 drops.
> 
> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
> 
> 
> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
> 
>  - split up the huge delay-dequeue patch
>  - tested/fixed cfs-bandwidth
>  - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>  - SCHED_BATCH is equivalent to RESPECT_SLICE
>  - propagate min_slice up cgroups
>  - CLOCK_THREAD_DVFS_ID

We found the following 7 commits from this patch set to crash in enqueue_dl_entity():

54a58a787791 sched/fair: Implement DELAY_ZERO
152e11f6df29 sched/fair: Implement delayed dequeue
e1459a50ba31 sched: Teach dequeue_task() about special task states
a1c446611e31 sched,freezer: Mark TASK_FROZEN special
781773e3b680 sched/fair: Implement ENQUEUE_DELAYED
f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue
2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue

Resulting in the following crash dump (this is running v6.12.1):

[   14.652856] sched: DL replenish lagged too much
[   16.572706] ------------[ cut here ]------------
[   16.573115] WARNING: CPU: 5 PID: 912 at kernel/sched/deadline.c:1995 enqueue_dl_entity+0x46c/0x55c
[   16.573900] Modules linked in: overlay crct10dif_ce rk805_pwrkey snd_soc_es8316 pwm_fan
phy_rockchip_naneng_combphy rockchip_saradc rtc_hym8563 industrialio_trigg
ered_buffer kfifo_buf rockchip_thermal phy_rockchip_usbdp typec spi_rockchip_sfc snd_soc_rockchip_i2s_tdm
hantro_vpu v4l2_vp9 v4l2_h264 v4l2_jpeg panthor v4l2_mem2me
m rockchipdrm drm_gpuvm drm_exec drm_shmem_helper analogix_dp gpu_sched dw_mipi_dsi dw_hdmi cec
drm_display_helper snd_soc_audio_graph_card snd_soc_simple_card_utils
 drm_dma_helper drm_kms_helper cfg80211 rfkill pci_endpoint_test drm backlight dm_mod dax
[   16.578350] CPU: 5 UID: 0 PID: 912 Comm: job10 Not tainted 6.12.1-dirty #15
[   16.578956] Hardware name: radxa Radxa ROCK 5B/Radxa ROCK 5B, BIOS 2024.10-rc3 10/01/2024
[   16.579667] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   16.580273] pc : enqueue_dl_entity+0x46c/0x55c
[   16.580661] lr : dl_server_start+0x44/0x12c
[   16.581028] sp : ffff80008002bc00
[   16.581318] x29: ffff80008002bc00 x28: dead000000000122 x27: 0000000000000000
[   16.581941] x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000009
[   16.582563] x23: ffff33c976db0e40 x22: 0000000000000001 x21: 00000000002dc6c0
[   16.583186] x20: 0000000000000001 x19: ffff33c976db17a8 x18: 0000000000000000
[   16.583808] x17: ffff5dd9779ac000 x16: ffff800080028000 x15: 11c3485b851e0698
[   16.584430] x14: 11b4b257e4156000 x13: 0000000000000255 x12: 0000000000000000
[   16.585053] x11: ffff33c976db0ec0 x10: 0000000000000000 x9 : 0000000000000009
[   16.585674] x8 : 0000000000000005 x7 : ffff33c976db19a0 x6 : ffff33c78258b440
[   16.586296] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[   16.586918] x2 : 0000000000000001 x1 : 0000000000000001 x0 : ffff33c798e112f0
[   16.587540] Call trace:
[   16.587754]  enqueue_dl_entity+0x46c/0x55c
[   16.588113]  dl_server_start+0x44/0x12c
[   16.588449]  enqueue_task_fair+0x124/0x49c
[   16.588807]  enqueue_task+0x3c/0xe0
[   16.589113]  ttwu_do_activate.isra.0+0x6c/0x208
[   16.589511]  try_to_wake_up+0x1d0/0x61c
[   16.589847]  wake_up_process+0x18/0x24
[   16.590175]  kick_pool+0x84/0x150
[   16.590467]  __queue_work+0x2f4/0x544
[   16.590788]  delayed_work_timer_fn+0x1c/0x28
[   16.591161]  call_timer_fn+0x34/0x1ac
[   16.591481]  __run_timer_base+0x20c/0x314
[   16.591832]  run_timer_softirq+0x3c/0x78
[   16.592176]  handle_softirqs+0x124/0x35c
[   16.592520]  __do_softirq+0x14/0x20
[   16.592827]  ____do_softirq+0x10/0x1c
[   16.593148]  call_on_irq_stack+0x24/0x4c
[   16.593490]  do_softirq_own_stack+0x1c/0x2c
[   16.593857]  irq_exit_rcu+0x8c/0xc0
[   16.594163]  el0_interrupt+0x48/0xbc
[   16.594477]  __el0_irq_handler_common+0x18/0x24
[   16.594874]  el0t_64_irq_handler+0x10/0x1c
[   16.595232]  el0t_64_irq+0x190/0x194
[   16.595545] ---[ end trace 0000000000000000 ]---
[   16.595950] ------------[ cut here ]------------

It looks like it is trying to enqueue an already queued deadline task. Full serial console log available [1].

We are running the exact same scheduler stress test both on Intel NUCs as well as RADXA ROCK 5B board farms.
While so far we have not seen this on amd64 it crashes consistently/reproducible on aarch64.

We haven't had time to do a non-proprietary reproduction case as of yet but I wanted to report our current
findings asking for any feedback/suggestions from the community.

Thanks!

Cheers

Marcel

[1] https://hastebin.skyra.pw/hoqesigaye.yaml

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-11-28 10:32 ` [REGRESSION] " Marcel Ziswiler
@ 2024-11-28 10:58   ` Peter Zijlstra
  2024-11-28 11:37     ` Marcel Ziswiler
  2024-12-10  8:45   ` Luis Machado
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-28 10:58 UTC (permalink / raw)
  To: Marcel Ziswiler
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Thu, Nov 28, 2024 at 11:32:10AM +0100, Marcel Ziswiler wrote:

> Resulting in the following crash dump (this is running v6.12.1):
> 
> [   14.652856] sched: DL replenish lagged too much
> [   16.572706] ------------[ cut here ]------------
> [   16.573115] WARNING: CPU: 5 PID: 912 at kernel/sched/deadline.c:1995 enqueue_dl_entity+0x46c/0x55c

> [   16.578350] CPU: 5 UID: 0 PID: 912 Comm: job10 Not tainted 6.12.1-dirty #15
> [   16.578956] Hardware name: radxa Radxa ROCK 5B/Radxa ROCK 5B, BIOS 2024.10-rc3 10/01/2024
> [   16.579667] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [   16.580273] pc : enqueue_dl_entity+0x46c/0x55c
> [   16.580661] lr : dl_server_start+0x44/0x12c
> [   16.581028] sp : ffff80008002bc00
> [   16.581318] x29: ffff80008002bc00 x28: dead000000000122 x27: 0000000000000000
> [   16.581941] x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000009
> [   16.582563] x23: ffff33c976db0e40 x22: 0000000000000001 x21: 00000000002dc6c0
> [   16.583186] x20: 0000000000000001 x19: ffff33c976db17a8 x18: 0000000000000000
> [   16.583808] x17: ffff5dd9779ac000 x16: ffff800080028000 x15: 11c3485b851e0698
> [   16.584430] x14: 11b4b257e4156000 x13: 0000000000000255 x12: 0000000000000000
> [   16.585053] x11: ffff33c976db0ec0 x10: 0000000000000000 x9 : 0000000000000009
> [   16.585674] x8 : 0000000000000005 x7 : ffff33c976db19a0 x6 : ffff33c78258b440
> [   16.586296] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
> [   16.586918] x2 : 0000000000000001 x1 : 0000000000000001 x0 : ffff33c798e112f0
> [   16.587540] Call trace:
> [   16.587754]  enqueue_dl_entity+0x46c/0x55c
> [   16.588113]  dl_server_start+0x44/0x12c
> [   16.588449]  enqueue_task_fair+0x124/0x49c
> [   16.588807]  enqueue_task+0x3c/0xe0
> [   16.589113]  ttwu_do_activate.isra.0+0x6c/0x208
> [   16.589511]  try_to_wake_up+0x1d0/0x61c
> [   16.589847]  wake_up_process+0x18/0x24
> [   16.590175]  kick_pool+0x84/0x150
> [   16.590467]  __queue_work+0x2f4/0x544
> [   16.590788]  delayed_work_timer_fn+0x1c/0x28
> [   16.591161]  call_timer_fn+0x34/0x1ac
> [   16.591481]  __run_timer_base+0x20c/0x314
> [   16.591832]  run_timer_softirq+0x3c/0x78
> [   16.592176]  handle_softirqs+0x124/0x35c
> [   16.592520]  __do_softirq+0x14/0x20
> [   16.592827]  ____do_softirq+0x10/0x1c
> [   16.593148]  call_on_irq_stack+0x24/0x4c
> [   16.593490]  do_softirq_own_stack+0x1c/0x2c
> [   16.593857]  irq_exit_rcu+0x8c/0xc0
> [   16.594163]  el0_interrupt+0x48/0xbc
> [   16.594477]  __el0_irq_handler_common+0x18/0x24
> [   16.594874]  el0t_64_irq_handler+0x10/0x1c
> [   16.595232]  el0t_64_irq+0x190/0x194
> [   16.595545] ---[ end trace 0000000000000000 ]---
> [   16.595950] ------------[ cut here ]------------
> 
> It looks like it is trying to enqueue an already queued deadline task. Full serial console log available [1].

Right, I've had a number of these reports, but so far we've not yet
managed to figure out how it's all happening.

> We are running the exact same scheduler stress test both on Intel NUCs
> as well as RADXA ROCK 5B board farms.  While so far we have not seen
> this on amd64 it crashes consistently/reproducible on aarch64.

Oooh, that's something. So far the few reports have not been (easily)
reproducible. If this is readily reproducible on arm64 that would
help a lot. Juri, do you have access to an arm64 test box?

A very long shot:

 https://lkml.kernel.org/r/20241127063740.8278-1-juri.lelli@redhat.com

doesn't help, does it?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-11-28 10:58   ` Peter Zijlstra
@ 2024-11-28 11:37     ` Marcel Ziswiler
  2024-11-29  9:08       ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Marcel Ziswiler @ 2024-11-28 11:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Thu, 2024-11-28 at 11:58 +0100, Peter Zijlstra wrote:
> On Thu, Nov 28, 2024 at 11:32:10AM +0100, Marcel Ziswiler wrote:
> 
> > Resulting in the following crash dump (this is running v6.12.1):

[snip]

> > It looks like it is trying to enqueue an already queued deadline task. Full serial console log available
> > [1].
> 
> Right, I've had a number of these reports, but so far we've not yet
> managed to figure out how it's all happening.
> 
> > We are running the exact same scheduler stress test both on Intel NUCs
> > as well as RADXA ROCK 5B board farms.  While so far we have not seen
> > this on amd64 it crashes consistently/reproducible on aarch64.
> 
> Oooh, that's something. So far the few reports have not been (easily)
> reproducible. If this is readily reproducible on arm64 that would
> help a lot. Juri, do you have access to an arm64 test box?

As mentioned above, so far our scheduler stress test is not yet open source but Codethink is eager to share
anything which helps in resolving this.

> A very long shot:
> 
>  https://lkml.kernel.org/r/20241127063740.8278-1-juri.lelli@redhat.com
> 
> doesn't help, does it?

No, still the same with this on top of today's -next.

Thanks!

Cheers

Marcel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-11-28 11:37     ` Marcel Ziswiler
@ 2024-11-29  9:08       ` Peter Zijlstra
  2024-12-02 18:46         ` Marcel Ziswiler
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-29  9:08 UTC (permalink / raw)
  To: Marcel Ziswiler
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Thu, Nov 28, 2024 at 12:37:14PM +0100, Marcel Ziswiler wrote:

> > Oooh, that's something. So far the few reports have not been (easily)
> > reproducible. If this is readily reproducible on arm64 that would
> > help a lot. Juri, do you have access to an arm64 test box?
> 
> As mentioned above, so far our scheduler stress test is not yet open source but Codethink is eager to share
> anything which helps in resolving this.

I was hoping you could perhaps share a binary with Juri privately or
with RHT (same difference etc), such that he can poke at it too.

Anyway, if you don't mind a bit of back and forth, would you mind adding
the below patch to your kernel and doing:

(all assuming your kernel has ftrace enabled)

  echo 1 > /sys/kernel/debug/tracing/options/stacktrace
  echo 1 > /proc/sys/kernel/traceoff_on_warning

running your test to failure and then dumping the trace into a file
like:

  cat /sys/kernel/debug/tracing/trace > ~/trace

Then compress the file (bzip2 or whatever is popular these days) and
send it my way along with a dmesg dump (private is fine -- these things
tend to be large-ish).

Hopefully, this will give us a little clue as to where the double
enqueue happens.

---
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d9d5a702f1a6..b9cd9b40a19f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1203,6 +1203,11 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
 	scoped_guard (rq_lock, rq) {
 		struct rq_flags *rf = &scope.rf;
 
+		if (dl_se == &rq->fair_server) {
+			trace_printk("timer fair server %d throttled %d\n",
+				     cpu_of(rq), dl_se->dl_throttled);
+		}
+
 		if (!dl_se->dl_throttled || !dl_se->dl_runtime)
 			return HRTIMER_NORESTART;
 
@@ -1772,6 +1777,9 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 		rq_lock(rq, &rf);
 	}
 
+	if (dl_se == &rq->fair_server)
+		trace_printk("inactive fair server %d\n", cpu_of(rq));
+
 	sched_clock_tick();
 	update_rq_clock(rq);
 
@@ -1967,6 +1975,12 @@ update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
 static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_se(dl_se);
+
+	if (dl_se == &rq->fair_server) {
+		trace_printk("enqueue fair server %d h_nr_running %d\n",
+			     cpu_of(rq), rq->cfs.h_nr_running);
+	}
 
 	WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node));
 
@@ -1978,6 +1992,12 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_se(dl_se);
+
+	if (dl_se == &rq->fair_server) {
+		trace_printk("dequeue fair server %d h_nr_running %d\n",
+			     cpu_of(rq), rq->cfs.h_nr_running);
+	}
 
 	if (RB_EMPTY_NODE(&dl_se->rb_node))
 		return;

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-11-29  9:08       ` Peter Zijlstra
@ 2024-12-02 18:46         ` Marcel Ziswiler
  2024-12-09  9:49           ` Peter Zijlstra
  2024-12-10 16:13           ` Steven Rostedt
  0 siblings, 2 replies; 51+ messages in thread
From: Marcel Ziswiler @ 2024-12-02 18:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

Sorry for my late reply, I was traveling back from Manchester to Switzerland but I am all settled down again.

On Fri, 2024-11-29 at 10:08 +0100, Peter Zijlstra wrote:
> On Thu, Nov 28, 2024 at 12:37:14PM +0100, Marcel Ziswiler wrote:
> 
> > > Oooh, that's something. So far the few reports have not been (easily)
> > > reproducible. If this is readily reproducible on arm64 that would
> > > help a lot. Juri, do you have access to an arm64 test box?
> > 
> > As mentioned above, so far our scheduler stress test is not yet open source but Codethink is eager to share
> > anything which helps in resolving this.
> 
> I was hoping you could perhaps share a binary with Juri privately or
> with RHT (same difference etc), such that he can poke at it too.

Sure, there is nothing secret about it, it is just that we have not gotten around open sourcing all parts of it
just yet.

The UEFI aarch64 embedded Linux image I am using may be found here [1]. Plus matching bmap file should you
fancy using that [2]. And the SSH key may help when interacting with the system (e.g. that is how I trigger the
failure as the console is quite busy with tracing) [3]. However, that was built by CI and does not contain a
kernel with below patch applied yet. I manually dumped the kernel config and compiled v6.12.1 with your patch
applied and deployed it (to /lib/modules, /usr/lib/kernel et. al.) in the below case where I provide the dump.

> Anyway, if you don't mind a bit of back and forth, 

Sure.

> would you mind adding
> the below patch to your kernel and doing:
> 
> (all assuming your kernel has ftrace enabled)
> 
>   echo 1 > /sys/kernel/debug/tracing/options/stacktrace
>   echo 1 > /proc/sys/kernel/traceoff_on_warning
> 
> running your test to failure and then dumping the trace into a file
> like:
> 
>   cat /sys/kernel/debug/tracing/trace > ~/trace

Unfortunately, once I trigger the failure the system is completely dead and won't allow me to dump the trace
buffer any longer. So I did the following instead on the serial console terminal:

tail -f /sys/kernel/debug/tracing/trace

Not sure whether there is any better way to go about this. Plus even so we run the serial console at 1.5
megabaud I am not fully sure whether it was able to keep up logging what you are looking for.

> Then compress the file (bzip2 or whatever is popular these days)

xz or zstd (;-p)

> and
> send it my way along with a dmesg dump (private is fine -- these things
> tend to be large-ish).

As mentioned before, there is nothing secret about it. Please find it here [4].

> Hopefully, this will give us a little clue as to where the double
> enqueue happens.

Yes, and do not hesitate to ask for any additional information et. al. we are happy to help. Thanks!

> ---
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index d9d5a702f1a6..b9cd9b40a19f 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1203,6 +1203,11 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
>   scoped_guard (rq_lock, rq) {
>   struct rq_flags *rf = &scope.rf;
>  
> + if (dl_se == &rq->fair_server) {
> + trace_printk("timer fair server %d throttled %d\n",
> +      cpu_of(rq), dl_se->dl_throttled);
> + }
> +
>   if (!dl_se->dl_throttled || !dl_se->dl_runtime)
>   return HRTIMER_NORESTART;
>  
> @@ -1772,6 +1777,9 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
>   rq_lock(rq, &rf);
>   }
>  
> + if (dl_se == &rq->fair_server)
> + trace_printk("inactive fair server %d\n", cpu_of(rq));
> +
>   sched_clock_tick();
>   update_rq_clock(rq);
>  
> @@ -1967,6 +1975,12 @@ update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
>  static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>  {
>   struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> + struct rq *rq = rq_of_dl_se(dl_se);
> +
> + if (dl_se == &rq->fair_server) {
> + trace_printk("enqueue fair server %d h_nr_running %d\n",
> +      cpu_of(rq), rq->cfs.h_nr_running);
> + }
>  
>   WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node));
>  
> @@ -1978,6 +1992,12 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>  static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
>  {
>   struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> + struct rq *rq = rq_of_dl_se(dl_se);
> +
> + if (dl_se == &rq->fair_server) {
> + trace_printk("dequeue fair server %d h_nr_running %d\n",
> +      cpu_of(rq), rq->cfs.h_nr_running);
> + }
>  
>   if (RB_EMPTY_NODE(&dl_se->rb_node))
>   return;

[1] https://drive.codethink.co.uk/s/N8CQipaNNN45gYM

[2] https://drive.codethink.co.uk/s/mpcPawXpCjPL8D3

[3] https://drive.codethink.co.uk/s/8RjHNTQQRpYgaLc

[4] https://drive.codethink.co.uk/s/MWtzWjLDtdD3E5i

Cheers

Marcel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-12-02 18:46         ` Marcel Ziswiler
@ 2024-12-09  9:49           ` Peter Zijlstra
  2024-12-10 16:05             ` Marcel Ziswiler
  2024-12-10 16:13           ` Steven Rostedt
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2024-12-09  9:49 UTC (permalink / raw)
  To: Marcel Ziswiler
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault


Sorry for the delay, I got laid low by snot monsters :/

On Mon, Dec 02, 2024 at 07:46:21PM +0100, Marcel Ziswiler wrote:

> Unfortunately, once I trigger the failure the system is completely dead and won't allow me to dump the trace
> buffer any longer. So I did the following instead on the serial console terminal:
> 
> tail -f /sys/kernel/debug/tracing/trace
> 
> Not sure whether there is any better way to go about this. Plus even so we run the serial console at 1.5
> megabaud I am not fully sure whether it was able to keep up logging what you are looking for.

Ah, that is unfortunate. There is a ftrace_dump_on_oops option that
might be of help. And yes, dumping trace buffers over 1m5 serial lines
is tedious -- been there done that, got a t-shirt and all that.

Still, let me see if perhaps making that WARN in enqueue_dl_entity()
return makes the whole thing less fatal.

I've included the traceoff_on_warning and ftrace_dump in the code, so
all you really need to still do is enable the stacktrace option.

   echo 1 > /sys/kernel/debug/tracing/options/stacktrace

> Yes, and do not hesitate to ask for any additional information et. al. we are happy to help. Thanks!

Could I bother you to try again with the below patch?

There are two new hunks vs the previous one, the hunk in
enqueue_dl_entity() (the very last bit) will stop tracing and dump the
buffers when that condition is hit in addition to then aborting the
double enqueue, hopefully leaving the system is a slightly better state.

The other new hunk is the one for dl_server_stop() (second hunk), while
going over the code last week, I found that this might be a possible
hole leading to the observed double enqueue, so fingers crossed.

---

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 33b4646f8b24..bd1df7612482 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1223,6 +1223,11 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
 	scoped_guard (rq_lock, rq) {
 		struct rq_flags *rf = &scope.rf;
 
+		if (dl_se == &rq->fair_server) {
+			trace_printk("timer fair server %d throttled %d\n",
+				     cpu_of(rq), dl_se->dl_throttled);
+		}
+
 		if (!dl_se->dl_throttled || !dl_se->dl_runtime)
 			return HRTIMER_NORESTART;
 
@@ -1674,6 +1679,12 @@ void dl_server_start(struct sched_dl_entity *dl_se)
 
 void dl_server_stop(struct sched_dl_entity *dl_se)
 {
+	if (current->dl_server == dl_se) {
+		struct rq *rq = rq_of_dl_se(dl_se);
+		trace_printk("stop fair server %d\n", cpu_of(rq));
+		current->dl_server = NULL;
+	}
+
 	if (!dl_se->dl_runtime)
 		return;
 
@@ -1792,6 +1803,9 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 		rq_lock(rq, &rf);
 	}
 
+	if (dl_se == &rq->fair_server)
+		trace_printk("inactive fair server %d\n", cpu_of(rq));
+
 	sched_clock_tick();
 	update_rq_clock(rq);
 
@@ -1987,6 +2001,12 @@ update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
 static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_se(dl_se);
+
+	if (dl_se == &rq->fair_server) {
+		trace_printk("enqueue fair server %d h_nr_running %d\n",
+			     cpu_of(rq), rq->cfs.h_nr_running);
+	}
 
 	WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node));
 
@@ -1998,6 +2018,12 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_se(dl_se);
+
+	if (dl_se == &rq->fair_server) {
+		trace_printk("dequeue fair server %d h_nr_running %d\n",
+			     cpu_of(rq), rq->cfs.h_nr_running);
+	}
 
 	if (RB_EMPTY_NODE(&dl_se->rb_node))
 		return;
@@ -2012,7 +2038,11 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
 static void
 enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 {
-	WARN_ON_ONCE(on_dl_rq(dl_se));
+	if (WARN_ON_ONCE(on_dl_rq(dl_se))) {
+		tracing_off();
+		ftrace_dump(DUMP_ALL);
+		return;
+	}
 
 	update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags);
 

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-11-28 10:32 ` [REGRESSION] " Marcel Ziswiler
  2024-11-28 10:58   ` Peter Zijlstra
@ 2024-12-10  8:45   ` Luis Machado
  1 sibling, 0 replies; 51+ messages in thread
From: Luis Machado @ 2024-12-10  8:45 UTC (permalink / raw)
  To: Marcel Ziswiler, Peter Zijlstra, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

On 11/28/24 10:32, Marcel Ziswiler wrote:
> Hi all,
> 
> On Sat, 2024-07-27 at 12:27 +0200, Peter Zijlstra wrote:
>> Hi all,
>>
>> So after much delay this is hopefully the final version of the EEVDF patches.
>> They've been sitting in my git tree for ever it seems, and people have been
>> testing it and sending fixes.
>>
>> I've spend the last two days testing and fixing cfs-bandwidth, and as far
>> as I know that was the very last issue holding it back.
>>
>> These patches apply on top of queue.git sched/dl-server, which I plan on merging
>> in tip/sched/core once -rc1 drops.
>>
>> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
>>
>>
>> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
>>
>>  - split up the huge delay-dequeue patch
>>  - tested/fixed cfs-bandwidth
>>  - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>>  - SCHED_BATCH is equivalent to RESPECT_SLICE
>>  - propagate min_slice up cgroups
>>  - CLOCK_THREAD_DVFS_ID
> 
> We found the following 7 commits from this patch set to crash in enqueue_dl_entity():
> 
> 54a58a787791 sched/fair: Implement DELAY_ZERO
> 152e11f6df29 sched/fair: Implement delayed dequeue
> e1459a50ba31 sched: Teach dequeue_task() about special task states
> a1c446611e31 sched,freezer: Mark TASK_FROZEN special
> 781773e3b680 sched/fair: Implement ENQUEUE_DELAYED
> f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue
> 2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue
> 
> Resulting in the following crash dump (this is running v6.12.1):
> 
> [   14.652856] sched: DL replenish lagged too much
> [   16.572706] ------------[ cut here ]------------
> [   16.573115] WARNING: CPU: 5 PID: 912 at kernel/sched/deadline.c:1995 enqueue_dl_entity+0x46c/0x55c
> [   16.573900] Modules linked in: overlay crct10dif_ce rk805_pwrkey snd_soc_es8316 pwm_fan
> phy_rockchip_naneng_combphy rockchip_saradc rtc_hym8563 industrialio_trigg
> ered_buffer kfifo_buf rockchip_thermal phy_rockchip_usbdp typec spi_rockchip_sfc snd_soc_rockchip_i2s_tdm
> hantro_vpu v4l2_vp9 v4l2_h264 v4l2_jpeg panthor v4l2_mem2me
> m rockchipdrm drm_gpuvm drm_exec drm_shmem_helper analogix_dp gpu_sched dw_mipi_dsi dw_hdmi cec
> drm_display_helper snd_soc_audio_graph_card snd_soc_simple_card_utils
>  drm_dma_helper drm_kms_helper cfg80211 rfkill pci_endpoint_test drm backlight dm_mod dax
> [   16.578350] CPU: 5 UID: 0 PID: 912 Comm: job10 Not tainted 6.12.1-dirty #15
> [   16.578956] Hardware name: radxa Radxa ROCK 5B/Radxa ROCK 5B, BIOS 2024.10-rc3 10/01/2024
> [   16.579667] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [   16.580273] pc : enqueue_dl_entity+0x46c/0x55c
> [   16.580661] lr : dl_server_start+0x44/0x12c
> [   16.581028] sp : ffff80008002bc00
> [   16.581318] x29: ffff80008002bc00 x28: dead000000000122 x27: 0000000000000000
> [   16.581941] x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000009
> [   16.582563] x23: ffff33c976db0e40 x22: 0000000000000001 x21: 00000000002dc6c0
> [   16.583186] x20: 0000000000000001 x19: ffff33c976db17a8 x18: 0000000000000000
> [   16.583808] x17: ffff5dd9779ac000 x16: ffff800080028000 x15: 11c3485b851e0698
> [   16.584430] x14: 11b4b257e4156000 x13: 0000000000000255 x12: 0000000000000000
> [   16.585053] x11: ffff33c976db0ec0 x10: 0000000000000000 x9 : 0000000000000009
> [   16.585674] x8 : 0000000000000005 x7 : ffff33c976db19a0 x6 : ffff33c78258b440
> [   16.586296] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
> [   16.586918] x2 : 0000000000000001 x1 : 0000000000000001 x0 : ffff33c798e112f0
> [   16.587540] Call trace:
> [   16.587754]  enqueue_dl_entity+0x46c/0x55c
> [   16.588113]  dl_server_start+0x44/0x12c
> [   16.588449]  enqueue_task_fair+0x124/0x49c
> [   16.588807]  enqueue_task+0x3c/0xe0
> [   16.589113]  ttwu_do_activate.isra.0+0x6c/0x208
> [   16.589511]  try_to_wake_up+0x1d0/0x61c
> [   16.589847]  wake_up_process+0x18/0x24
> [   16.590175]  kick_pool+0x84/0x150
> [   16.590467]  __queue_work+0x2f4/0x544
> [   16.590788]  delayed_work_timer_fn+0x1c/0x28
> [   16.591161]  call_timer_fn+0x34/0x1ac
> [   16.591481]  __run_timer_base+0x20c/0x314
> [   16.591832]  run_timer_softirq+0x3c/0x78
> [   16.592176]  handle_softirqs+0x124/0x35c
> [   16.592520]  __do_softirq+0x14/0x20
> [   16.592827]  ____do_softirq+0x10/0x1c
> [   16.593148]  call_on_irq_stack+0x24/0x4c
> [   16.593490]  do_softirq_own_stack+0x1c/0x2c
> [   16.593857]  irq_exit_rcu+0x8c/0xc0
> [   16.594163]  el0_interrupt+0x48/0xbc
> [   16.594477]  __el0_irq_handler_common+0x18/0x24
> [   16.594874]  el0t_64_irq_handler+0x10/0x1c
> [   16.595232]  el0t_64_irq+0x190/0x194
> [   16.595545] ---[ end trace 0000000000000000 ]---
> [   16.595950] ------------[ cut here ]------------


Random piece of data, but I also had some difficulty making things boot on Android when trying
Vincent's nr_running accounting series due to a very similar crash/stack trace. Though what I
saw went a bit further and actually crashed within task_contending, called from
enqueue_dl_entity. Possibly crashed in one of the inlined functions.

Even though the kernel was 6.8 and it was backports, it seems awfully similar to the above.

> 
> It looks like it is trying to enqueue an already queued deadline task. Full serial console log available [1].
> 
> We are running the exact same scheduler stress test both on Intel NUCs as well as RADXA ROCK 5B board farms.
> While so far we have not seen this on amd64 it crashes consistently/reproducible on aarch64.
> 
> We haven't had time to do a non-proprietary reproduction case as of yet but I wanted to report our current
> findings asking for any feedback/suggestions from the community.
> 
> Thanks!
> 
> Cheers
> 
> Marcel
> 
> [1] https://hastebin.skyra.pw/hoqesigaye.yaml
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-12-09  9:49           ` Peter Zijlstra
@ 2024-12-10 16:05             ` Marcel Ziswiler
  0 siblings, 0 replies; 51+ messages in thread
From: Marcel Ziswiler @ 2024-12-10 16:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Mon, 2024-12-09 at 10:49 +0100, Peter Zijlstra wrote:
> 
> Sorry for the delay, I got laid low by snot monsters :/
> 
> On Mon, Dec 02, 2024 at 07:46:21PM +0100, Marcel Ziswiler wrote:
> 
> > Unfortunately, once I trigger the failure the system is completely dead and won't allow me to dump the
> > trace
> > buffer any longer. So I did the following instead on the serial console terminal:
> > 
> > tail -f /sys/kernel/debug/tracing/trace
> > 
> > Not sure whether there is any better way to go about this. Plus even so we run the serial console at 1.5
> > megabaud I am not fully sure whether it was able to keep up logging what you are looking for.
> 
> Ah, that is unfortunate. There is a ftrace_dump_on_oops option that
> might be of help. And yes, dumping trace buffers over 1m5 serial lines
> is tedious -- been there done that, got a t-shirt and all that.
> 
> Still, let me see if perhaps making that WARN in enqueue_dl_entity()
> return makes the whole thing less fatal.
> 
> I've included the traceoff_on_warning and ftrace_dump in the code, so
> all you really need to still do is enable the stacktrace option.
> 
>    echo 1 > /sys/kernel/debug/tracing/options/stacktrace
> 
> > Yes, and do not hesitate to ask for any additional information et. al. we are happy to help. Thanks!
> 
> Could I bother you to try again with the below patch?

Sure, here you go.

https://drive.codethink.co.uk/s/HniZCtccDBMHpAK

> There are two new hunks vs the previous one, the hunk in
> enqueue_dl_entity() (the very last bit) will stop tracing and dump the
> buffers when that condition is hit in addition to then aborting the
> double enqueue, hopefully leaving the system is a slightly better state.
> 
> The other new hunk is the one for dl_server_stop() (second hunk), while
> going over the code last week, I found that this might be a possible
> hole leading to the observed double enqueue, so fingers crossed.
> 
> ---
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 33b4646f8b24..bd1df7612482 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1223,6 +1223,11 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
>  	scoped_guard (rq_lock, rq) {
>  		struct rq_flags *rf = &scope.rf;
>  
> +		if (dl_se == &rq->fair_server) {
> +			trace_printk("timer fair server %d throttled %d\n",
> +				     cpu_of(rq), dl_se->dl_throttled);
> +		}
> +
>  		if (!dl_se->dl_throttled || !dl_se->dl_runtime)
>  			return HRTIMER_NORESTART;
>  
> @@ -1674,6 +1679,12 @@ void dl_server_start(struct sched_dl_entity *dl_se)
>  
>  void dl_server_stop(struct sched_dl_entity *dl_se)
>  {
> +	if (current->dl_server == dl_se) {
> +		struct rq *rq = rq_of_dl_se(dl_se);
> +		trace_printk("stop fair server %d\n", cpu_of(rq));
> +		current->dl_server = NULL;
> +	}
> +
>  	if (!dl_se->dl_runtime)
>  		return;
>  
> @@ -1792,6 +1803,9 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
>  		rq_lock(rq, &rf);
>  	}
>  
> +	if (dl_se == &rq->fair_server)
> +		trace_printk("inactive fair server %d\n", cpu_of(rq));
> +
>  	sched_clock_tick();
>  	update_rq_clock(rq);
>  
> @@ -1987,6 +2001,12 @@ update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
>  static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>  {
>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> +	struct rq *rq = rq_of_dl_se(dl_se);
> +
> +	if (dl_se == &rq->fair_server) {
> +		trace_printk("enqueue fair server %d h_nr_running %d\n",
> +			     cpu_of(rq), rq->cfs.h_nr_running);
> +	}
>  
>  	WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node));
>  
> @@ -1998,6 +2018,12 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>  static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
>  {
>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> +	struct rq *rq = rq_of_dl_se(dl_se);
> +
> +	if (dl_se == &rq->fair_server) {
> +		trace_printk("dequeue fair server %d h_nr_running %d\n",
> +			     cpu_of(rq), rq->cfs.h_nr_running);
> +	}
>  
>  	if (RB_EMPTY_NODE(&dl_se->rb_node))
>  		return;
> @@ -2012,7 +2038,11 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
>  static void
>  enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
>  {
> -	WARN_ON_ONCE(on_dl_rq(dl_se));
> +	if (WARN_ON_ONCE(on_dl_rq(dl_se))) {
> +		tracing_off();
> +		ftrace_dump(DUMP_ALL);
> +		return;
> +	}
>  
>  	update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags);

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-12-02 18:46         ` Marcel Ziswiler
  2024-12-09  9:49           ` Peter Zijlstra
@ 2024-12-10 16:13           ` Steven Rostedt
  1 sibling, 0 replies; 51+ messages in thread
From: Steven Rostedt @ 2024-12-10 16:13 UTC (permalink / raw)
  To: Marcel Ziswiler
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, vschneid, linux-kernel,
	kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

On Mon, 02 Dec 2024 19:46:21 +0100
Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote:

> Unfortunately, once I trigger the failure the system is completely dead and won't allow me to dump the trace
> buffer any longer. So I did the following instead on the serial console terminal:
> 
> tail -f /sys/kernel/debug/tracing/trace
> 
> Not sure whether there is any better way to go about this. Plus even so we run the serial console at 1.5
> megabaud I am not fully sure whether it was able to keep up logging what you are looking for.

If the memory of the machine is persistent (it is on several of my
machines) you can use the persistent ring buffer.

Add to the kernel command line (or enable bootconfig that attaches a
command line to the kernel if you can't change the parameters):

  reserve_mem=20M:12M:trace trace_instance=boot_map^traceoff^traceprintk@trace

The above will create a "boot_map" instance with tracing off on boot
and trace_printk() going to it. Start tracing:

 trace-cmd start -B boot_map -p nop

If or replace "-p nop" with any events or tracers you want, including
function tracing", then after a crash.

  trace-cmd show -B boot_map

If the memory is persistent and you don't use KASLR (may want to also
add nokaslr if arm64 supports KASLR it and you use it), you should get
the trace right up to the crash.

See Documentation/trace/debugging.rst for more details.

-- Steve

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
@ 2024-12-29 22:51 Doug Smythies
  2025-01-06 11:57 ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Doug Smythies @ 2024-12-29 22:51 UTC (permalink / raw)
  To: peterz; +Cc: Doug Smythies, linux-kernel, vincent.guittot

[-- Attachment #1: Type: text/plain, Size: 2976 bytes --]

Hi Peter,

I have been having trouble with turbostat reporting processor package power levels that can not possibly be true.
After eliminating the turbostat program itself as the source of the issue I bisected the kernel.
An edited summary (actual log attached):

	 	82e9d0456e06 sched/fair: Avoid re-setting virtual deadline on 'migrations'
b10	bad	fc1892becd56 sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
b13	bad	54a58a787791 sched/fair: Implement DELAY_ZERO
	skip	152e11f6df29 sched/fair: Implement delayed dequeue
	skip	e1459a50ba31 sched: Teach dequeue_task() about special task states
	skip	a1c446611e31 sched,freezer: Mark TASK_FROZEN special
	skip	781773e3b680 sched/fair: Implement ENQUEUE_DELAYED
	skip	f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue
	skip	2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue
b12	good	e28b5f8bda01 sched/fair: Assert {set_next,put_prev}_entity() are properly balanced
		dfa0a574cbc4 sched/uclamg: Handle delayed dequeue
b11	good	abc158c82ae5 sched: Prepare generic code for delayed dequeue
		e8901061ca0c sched: Split DEQUEUE_SLEEP from deactivate_task()

Where "bN" is just my assigned kernel name for each bisection step.

In the linux-kernel email archives I found a thread that isolated these same commits.
It was from late Novermebr / early December:

https://lore.kernel.org/all/20240727105030.226163742@infradead.org/T/#m9aeb4d897e029cf7546513bb09499c320457c174

An example of the turbostat manifestation of the issue:

doug@s19:~$ sudo ~/kernel/linux/tools/power/x86/turbostat/turbostat --quiet --Summary --show
Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1
[sudo] password for doug:
Busy%   Bzy_MHz TSC_MHz IRQ     PkgTmp  PkgWatt
99.76   4800    4104    12304   73      80.08
99.76   4800    4104    12047   73      80.23
99.76   4800    879     12157   73      11.40
99.76   4800    26667   84214   72      557.23
99.76   4800    4104    12036   72      79.39

Where TSC_MHz was reported as 879, there was a big gap in time.
Like 4.7 seconds instead of 1.
Where TSC_MHz was reported as 26667, there was not a big gap in time.

It happens for about 5% of the samples + or - a lot.
It only happens when the workload is almost exactly 100%.
More load, it doesn't occur.
Less load, it doesn't occur. Although, I did get this once:

Busy%   Bzy_MHz TSC_MHz IRQ     PkgTmp  PkgWatt
91.46   4800    4104    11348   73      103.98
91.46   4800    4104    11353   73      103.89
91.50   4800    3903    11339   73      98.16
91.43   4800    4271    12001   73      108.52
91.45   4800    4148    11481   73      105.13
91.46   4800    4104    11341   73      103.96
91.46   4800    4104    11348   73      103.99

So, it might just be much less probable and less severe.

It happens over many different types of workload that I have tried.

Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
6 cores. 2 thread per core, 12 CPUs.
OS: Ubuntu 24.04.1 LTS (server, no GUI)

... Doug


[-- Attachment #2: bisect-log.txt --]
[-- Type: text/plain, Size: 5985 bytes --]

doug@s19:~/kernel/linux$ git bisect bad
There are only 'skip'ped commits left to test.
The first bad commit could be any of:
781773e3b68031bd001c0c18aa72e8470c225ebd
a1c446611e31ca5363d4db51e398271da1dce0af
e1459a50ba31831efdfc35278023d959e4ba775b
f12e148892ede8d9ee82bcd3e469e6d01fc077ac
152e11f6df293e816a6a37c69757033cdc72667d
2e0199df252a536a03f4cb0810324dff523d1e79
54a58a78779169f9c92a51facf6de7ce94962328
We cannot bisect more!

doug@s19:~/kernel/linux$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# good: [98f7e32f20d28ec452afb208f9cffc08448a2652] Linux 6.11
git bisect good 98f7e32f20d28ec452afb208f9cffc08448a2652
# status: waiting for bad commit, 1 good commit known
# bad: [9852d85ec9d492ebef56dc5f229416c925758edc] Linux 6.12-rc1
git bisect bad 9852d85ec9d492ebef56dc5f229416c925758edc
# good: [176000734ee2978121fde22a954eb1eabb204329] Merge tag 'ata-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
git bisect good 176000734ee2978121fde22a954eb1eabb204329
# bad: [d0359e4ca0f26aaf3118124dfb562e3b3dca1c06] Merge tag 'fs_for_v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
git bisect bad d0359e4ca0f26aaf3118124dfb562e3b3dca1c06
# bad: [171754c3808214d4fd8843eab584599a429deb52] Merge tag 'vfs-6.12.blocksize' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
git bisect bad 171754c3808214d4fd8843eab584599a429deb52
# good: [e55ef65510a401862b902dc979441ea10ae25c61] Merge tag 'amd-drm-next-6.12-2024-08-26' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
git bisect good e55ef65510a401862b902dc979441ea10ae25c61
# good: [32bd3eb5fbab954e68adba8c0b6a43cf03605c93] Merge tag 'drm-intel-gt-next-2024-09-06' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-next
git bisect good 32bd3eb5fbab954e68adba8c0b6a43cf03605c93
# good: [726e2d0cf2bbc14e3bf38491cddda1a56fe18663] Merge tag 'dma-mapping-6.12-2024-09-19' of git://git.infradead.org/users/hch/dma-mapping
git bisect good 726e2d0cf2bbc14e3bf38491cddda1a56fe18663
# good: [839c4f596f898edc424070dc8b517381572f8502] Merge tag 'mm-hotfixes-stable-2024-09-19-00-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
git bisect good 839c4f596f898edc424070dc8b517381572f8502
# bad: [bd9bbc96e8356886971317f57994247ca491dbf1] sched: Rework dl_server
git bisect bad bd9bbc96e8356886971317f57994247ca491dbf1
# good: [863ccdbb918a77e3f011571f943020bf7f0b114b] sched: Allow sched_class::dequeue_task() to fail
git bisect good 863ccdbb918a77e3f011571f943020bf7f0b114b
# bad: [fc1892becd5672f52329a75c73117b60ac7841b7] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
git bisect bad fc1892becd5672f52329a75c73117b60ac7841b7
# skip: [2e0199df252a536a03f4cb0810324dff523d1e79] sched/fair: Prepare exit/cleanup paths for delayed_dequeue
git bisect skip 2e0199df252a536a03f4cb0810324dff523d1e79
# skip: [f12e148892ede8d9ee82bcd3e469e6d01fc077ac] sched/fair: Prepare pick_next_task() for delayed dequeue
git bisect skip f12e148892ede8d9ee82bcd3e469e6d01fc077ac
# skip: [e1459a50ba31831efdfc35278023d959e4ba775b] sched: Teach dequeue_task() about special task states
git bisect skip e1459a50ba31831efdfc35278023d959e4ba775b
# skip: [781773e3b68031bd001c0c18aa72e8470c225ebd] sched/fair: Implement ENQUEUE_DELAYED
git bisect skip 781773e3b68031bd001c0c18aa72e8470c225ebd
# good: [abc158c82ae555078aa5dd2d8407c3df0f868904] sched: Prepare generic code for delayed dequeue
git bisect good abc158c82ae555078aa5dd2d8407c3df0f868904
# skip: [a1c446611e31ca5363d4db51e398271da1dce0af] sched,freezer: Mark TASK_FROZEN special
git bisect skip a1c446611e31ca5363d4db51e398271da1dce0af
# good: [e28b5f8bda01720b5ce8456b48cf4b963f9a80a1] sched/fair: Assert {set_next,put_prev}_entity() are properly balanced
git bisect good e28b5f8bda01720b5ce8456b48cf4b963f9a80a1
# skip: [152e11f6df293e816a6a37c69757033cdc72667d] sched/fair: Implement delayed dequeue
git bisect skip 152e11f6df293e816a6a37c69757033cdc72667d
# bad: [54a58a78779169f9c92a51facf6de7ce94962328] sched/fair: Implement DELAY_ZERO
git bisect bad 54a58a78779169f9c92a51facf6de7ce94962328
# only skipped commits left to test
# possible first bad commit: [54a58a78779169f9c92a51facf6de7ce94962328] sched/fair: Implement DELAY_ZERO
# possible first bad commit: [152e11f6df293e816a6a37c69757033cdc72667d] sched/fair: Implement delayed dequeue
# possible first bad commit: [e1459a50ba31831efdfc35278023d959e4ba775b] sched: Teach dequeue_task() about special task states
# possible first bad commit: [a1c446611e31ca5363d4db51e398271da1dce0af] sched,freezer: Mark TASK_FROZEN special
# possible first bad commit: [781773e3b68031bd001c0c18aa72e8470c225ebd] sched/fair: Implement ENQUEUE_DELAYED
# possible first bad commit: [f12e148892ede8d9ee82bcd3e469e6d01fc077ac] sched/fair: Prepare pick_next_task() for delayed dequeue
# possible first bad commit: [2e0199df252a536a03f4cb0810324dff523d1e79] sched/fair: Prepare exit/cleanup paths for delayed_dequeue

doug@s19:~/kernel/linux$ git log --oneline | grep -B 2 -A 10 54a58a78779
	 	82e9d0456e06 sched/fair: Avoid re-setting virtual deadline on 'migrations'
b10	bad	fc1892becd56 sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
b13	bad	54a58a787791 sched/fair: Implement DELAY_ZERO
	skip	152e11f6df29 sched/fair: Implement delayed dequeue
	skip	e1459a50ba31 sched: Teach dequeue_task() about special task states
	skip	a1c446611e31 sched,freezer: Mark TASK_FROZEN special
	skip	781773e3b680 sched/fair: Implement ENQUEUE_DELAYED
	skip	f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue
	skip	2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue
b12	good	e28b5f8bda01 sched/fair: Assert {set_next,put_prev}_entity() are properly balanced
		dfa0a574cbc4 sched/uclamg: Handle delayed dequeue
b11	good	abc158c82ae5 sched: Prepare generic code for delayed dequeue
		e8901061ca0c sched: Split DEQUEUE_SLEEP from deactivate_task()

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-12-29 22:51 [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies
@ 2025-01-06 11:57 ` Peter Zijlstra
  2025-01-06 15:01   ` Doug Smythies
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2025-01-06 11:57 UTC (permalink / raw)
  To: Doug Smythies; +Cc: linux-kernel, vincent.guittot

On Sun, Dec 29, 2024 at 02:51:43PM -0800, Doug Smythies wrote:
> Hi Peter,
> 
> I have been having trouble with turbostat reporting processor package power levels that can not possibly be true.
> After eliminating the turbostat program itself as the source of the issue I bisected the kernel.
> An edited summary (actual log attached):
> 
> 	 	82e9d0456e06 sched/fair: Avoid re-setting virtual deadline on 'migrations'
> b10	bad	fc1892becd56 sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
> b13	bad	54a58a787791 sched/fair: Implement DELAY_ZERO
> 	skip	152e11f6df29 sched/fair: Implement delayed dequeue
> 	skip	e1459a50ba31 sched: Teach dequeue_task() about special task states
> 	skip	a1c446611e31 sched,freezer: Mark TASK_FROZEN special
> 	skip	781773e3b680 sched/fair: Implement ENQUEUE_DELAYED
> 	skip	f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue
> 	skip	2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue
> b12	good	e28b5f8bda01 sched/fair: Assert {set_next,put_prev}_entity() are properly balanced
> 		dfa0a574cbc4 sched/uclamg: Handle delayed dequeue
> b11	good	abc158c82ae5 sched: Prepare generic code for delayed dequeue
> 		e8901061ca0c sched: Split DEQUEUE_SLEEP from deactivate_task()
> 
> Where "bN" is just my assigned kernel name for each bisection step.
> 
> In the linux-kernel email archives I found a thread that isolated these same commits.
> It was from late Novermebr / early December:
> 
> https://lore.kernel.org/all/20240727105030.226163742@infradead.org/T/#m9aeb4d897e029cf7546513bb09499c320457c174
> 
> An example of the turbostat manifestation of the issue:
> 
> doug@s19:~$ sudo ~/kernel/linux/tools/power/x86/turbostat/turbostat --quiet --Summary --show
> Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1
> [sudo] password for doug:
> Busy%   Bzy_MHz TSC_MHz IRQ     PkgTmp  PkgWatt
> 99.76   4800    4104    12304   73      80.08
> 99.76   4800    4104    12047   73      80.23
> 99.76   4800    879     12157   73      11.40
> 99.76   4800    26667   84214   72      557.23
> 99.76   4800    4104    12036   72      79.39
> 
> Where TSC_MHz was reported as 879, there was a big gap in time.
> Like 4.7 seconds instead of 1.
> Where TSC_MHz was reported as 26667, there was not a big gap in time.
> 
> It happens for about 5% of the samples + or - a lot.
> It only happens when the workload is almost exactly 100%.
> More load, it doesn't occur.
> Less load, it doesn't occur. Although, I did get this once:
> 
> Busy%   Bzy_MHz TSC_MHz IRQ     PkgTmp  PkgWatt
> 91.46   4800    4104    11348   73      103.98
> 91.46   4800    4104    11353   73      103.89
> 91.50   4800    3903    11339   73      98.16
> 91.43   4800    4271    12001   73      108.52
> 91.45   4800    4148    11481   73      105.13
> 91.46   4800    4104    11341   73      103.96
> 91.46   4800    4104    11348   73      103.99
> 
> So, it might just be much less probable and less severe.
> 
> It happens over many different types of workload that I have tried.

In private email you've communicated it happens due to
sched_setaffinity() sometimes taking multiple seconds.

I'm trying to reproduce by starting a bash 'while ;: do :; done' spinner
for each CPU, but so far am not able to reproduce.

What is the easiest 100% load you're seeing this with?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-06 11:57 ` Peter Zijlstra
@ 2025-01-06 15:01   ` Doug Smythies
  2025-01-06 16:59     ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Doug Smythies @ 2025-01-06 15:01 UTC (permalink / raw)
  To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies

On 2025.01.06 03:58 Peter Zijlstra wrote:
>On Sun, Dec 29, 2024 at 02:51:43PM -0800, Doug Smythies wrote:
>> Hi Peter,
>> 
>> I have been having trouble with turbostat reporting processor package power levels that can not possibly be true.
>> After eliminating the turbostat program itself as the source of the issue I bisected the kernel.
>> An edited summary (actual log attached):
>> 
>> 	 	82e9d0456e06 sched/fair: Avoid re-setting virtual deadline on 'migrations'
>> b10	bad	fc1892becd56 sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
>> b13	bad	54a58a787791 sched/fair: Implement DELAY_ZERO
>> 	skip	152e11f6df29 sched/fair: Implement delayed dequeue
>> 	skip	e1459a50ba31 sched: Teach dequeue_task() about special task states
>> 	skip	a1c446611e31 sched,freezer: Mark TASK_FROZEN special
>> 	skip	781773e3b680 sched/fair: Implement ENQUEUE_DELAYED
>> 	skip	f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue
>> 	skip	2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue
>> b12	good	e28b5f8bda01 sched/fair: Assert {set_next,put_prev}_entity() are properly balanced
>> 		dfa0a574cbc4 sched/uclamg: Handle delayed dequeue
>> b11	good	abc158c82ae5 sched: Prepare generic code for delayed dequeue
>> 		e8901061ca0c sched: Split DEQUEUE_SLEEP from deactivate_task()
>> 
>> Where "bN" is just my assigned kernel name for each bisection step.
>> 
>> In the linux-kernel email archives I found a thread that isolated these same commits.
>> It was from late Novermebr / early December:
>> 
>> https://lore.kernel.org/all/20240727105030.226163742@infradead.org/T/#m9aeb4d897e029cf7546513bb09499c320457c174
>> 
>> An example of the turbostat manifestation of the issue:
>> 
>> doug@s19:~$ sudo ~/kernel/linux/tools/power/x86/turbostat/turbostat --quiet --Summary --show
>> Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1
>> [sudo] password for doug:
>> Busy%   Bzy_MHz TSC_MHz IRQ     PkgTmp  PkgWatt
>> 99.76   4800    4104    12304   73      80.08
>> 99.76   4800    4104    12047   73      80.23
>> 99.76   4800    879     12157   73      11.40
>> 99.76   4800    26667   84214   72      557.23
>> 99.76   4800    4104    12036   72      79.39
>> 
>> Where TSC_MHz was reported as 879, there was a big gap in time.
>> Like 4.7 seconds instead of 1.
>> Where TSC_MHz was reported as 26667, there was not a big gap in time.
>> 
>> It happens for about 5% of the samples + or - a lot.
>> It only happens when the workload is almost exactly 100%.
>> More load, it doesn't occur.
>> Less load, it doesn't occur. Although, I did get this once:
>> 
>> Busy%   Bzy_MHz TSC_MHz IRQ     PkgTmp  PkgWatt
>> 91.46   4800    4104    11348   73      103.98
>> 91.46   4800    4104    11353   73      103.89
>> 91.50   4800    3903    11339   73      98.16
>> 91.43   4800    4271    12001   73      108.52
>> 91.45   4800    4148    11481   73      105.13
>> 91.46   4800    4104    11341   73      103.96
>> 91.46   4800    4104    11348   73      103.99
>> 
>> So, it might just be much less probable and less severe.
>> 
>> It happens over many different types of workload that I have tried.
>
> In private email you've communicated it happens due to
> sched_setaffinity() sometimes taking multiple seconds.
>
> I'm trying to reproduce by starting a bash 'while ;: do :; done' spinner
> for each CPU, but so far am not able to reproduce.

I have also been trying to reproduce the issue without using turbostat.
No success.

The other thing to note is that my test computer is otherwise very
very idle with no GUI and few services.

>
> What is the easiest 100% load you're seeing this with?

Lately, and specifically to be able to tell others, I have been using:

yes > /dev/null &

On my Intel i5-10600K, with 6 cores and 2 threads per core, 12 CPUs,
I run 12 of those work loads.

... Doug



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-06 15:01   ` Doug Smythies
@ 2025-01-06 16:59     ` Peter Zijlstra
  2025-01-06 17:04       ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2025-01-06 16:59 UTC (permalink / raw)
  To: Doug Smythies; +Cc: linux-kernel, vincent.guittot

On Mon, Jan 06, 2025 at 07:01:34AM -0800, Doug Smythies wrote:

> > What is the easiest 100% load you're seeing this with?
> 
> Lately, and specifically to be able to tell others, I have been using:
> 
> yes > /dev/null &
> 
> On my Intel i5-10600K, with 6 cores and 2 threads per core, 12 CPUs,
> I run 12 of those work loads.

On my headless ivb-ep 2 sockets, 10 cores each and 2 threads per core, I
do:

for ((i=0; i<40; i++)) ; do yes > /dev/null & done
tools/power/x86/turbostat/turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1

But no so far, nada :-( I've tried with full preemption and voluntary,
HZ=1000.





^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-06 16:59     ` Peter Zijlstra
@ 2025-01-06 17:04       ` Peter Zijlstra
  2025-01-06 17:14         ` Peter Zijlstra
  2025-01-06 22:28         ` Doug Smythies
  0 siblings, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2025-01-06 17:04 UTC (permalink / raw)
  To: Doug Smythies; +Cc: linux-kernel, vincent.guittot

On Mon, Jan 06, 2025 at 05:59:32PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 06, 2025 at 07:01:34AM -0800, Doug Smythies wrote:
> 
> > > What is the easiest 100% load you're seeing this with?
> > 
> > Lately, and specifically to be able to tell others, I have been using:
> > 
> > yes > /dev/null &
> > 
> > On my Intel i5-10600K, with 6 cores and 2 threads per core, 12 CPUs,
> > I run 12 of those work loads.
> 
> On my headless ivb-ep 2 sockets, 10 cores each and 2 threads per core, I
> do:
> 
> for ((i=0; i<40; i++)) ; do yes > /dev/null & done
> tools/power/x86/turbostat/turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1
> 
> But no so far, nada :-( I've tried with full preemption and voluntary,
> HZ=1000.
> 

And just as I send this, I see these happen:

100.00  3100    2793    40302   71      195.22
100.00  3100    2618    40459   72      183.58
100.00  3100    2993    46215   71      209.21
100.00  3100    2789    40467   71      195.19
99.92   3100    2798    40589   71      195.76
100.00  3100    2793    40397   72      195.46
...
100.00  3100    2844    41906   71      199.43
100.00  3100    2779    40468   71      194.51
99.96   3100    2320    40933   71      163.23
100.00  3100    3529    61823   72      245.70
100.00  3100    2793    40493   72      195.45
100.00  3100    2793    40462   72      195.56

They look like funny little blips. Nowhere near as bad as you had
though.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-06 17:04       ` Peter Zijlstra
@ 2025-01-06 17:14         ` Peter Zijlstra
  2025-01-07  1:24           ` Doug Smythies
  2025-01-06 22:28         ` Doug Smythies
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2025-01-06 17:14 UTC (permalink / raw)
  To: Doug Smythies; +Cc: linux-kernel, vincent.guittot

On Mon, Jan 06, 2025 at 06:04:55PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 06, 2025 at 05:59:32PM +0100, Peter Zijlstra wrote:
> > On Mon, Jan 06, 2025 at 07:01:34AM -0800, Doug Smythies wrote:
> > 
> > > > What is the easiest 100% load you're seeing this with?
> > > 
> > > Lately, and specifically to be able to tell others, I have been using:
> > > 
> > > yes > /dev/null &
> > > 
> > > On my Intel i5-10600K, with 6 cores and 2 threads per core, 12 CPUs,
> > > I run 12 of those work loads.
> > 
> > On my headless ivb-ep 2 sockets, 10 cores each and 2 threads per core, I
> > do:
> > 
> > for ((i=0; i<40; i++)) ; do yes > /dev/null & done
> > tools/power/x86/turbostat/turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1
> > 
> > But no so far, nada :-( I've tried with full preemption and voluntary,
> > HZ=1000.
> > 
> 
> And just as I send this, I see these happen:
> 
> 100.00  3100    2793    40302   71      195.22
> 100.00  3100    2618    40459   72      183.58
> 100.00  3100    2993    46215   71      209.21
> 100.00  3100    2789    40467   71      195.19
> 99.92   3100    2798    40589   71      195.76
> 100.00  3100    2793    40397   72      195.46
> ...
> 100.00  3100    2844    41906   71      199.43
> 100.00  3100    2779    40468   71      194.51
> 99.96   3100    2320    40933   71      163.23
> 100.00  3100    3529    61823   72      245.70
> 100.00  3100    2793    40493   72      195.45
> 100.00  3100    2793    40462   72      195.56
> 
> They look like funny little blips. Nowhere near as bad as you had
> though.

Anyway, given you've confirmed disabling DELAY_DEQUEUE fixes things,
could you perhaps try the below hackery for me? Its a bit of a wild
guess, but throw stuff at wall, see what sticks etc..

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 84902936a620..fa4b9891f93a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3019,7 +3019,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 	} else {
 
 		if (!is_migration_disabled(p)) {
-			if (task_on_rq_queued(p))
+			if (task_on_rq_queued(p) && !p->se.sched_delayed)
 				rq = move_queued_task(rq, rf, p, dest_cpu);
 
 			if (!pending->stop_pending) {
@@ -3776,28 +3776,30 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
  */
 static int ttwu_runnable(struct task_struct *p, int wake_flags)
 {
-	struct rq_flags rf;
-	struct rq *rq;
-	int ret = 0;
+	CLASS(__task_rq_lock, rq_guard)(p);
+	struct rq *rq = rq_guard.rq;
 
-	rq = __task_rq_lock(p, &rf);
-	if (task_on_rq_queued(p)) {
-		update_rq_clock(rq);
-		if (p->se.sched_delayed)
-			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
-		if (!task_on_cpu(rq, p)) {
-			/*
-			 * When on_rq && !on_cpu the task is preempted, see if
-			 * it should preempt the task that is current now.
-			 */
-			wakeup_preempt(rq, p, wake_flags);
+	if (!task_on_rq_queued(p))
+		return 0;
+
+	update_rq_clock(rq);
+	if (p->se.sched_delayed) {
+		int queue_flags = ENQUEUE_NOCLOCK | ENQUEUE_DELAYED;
+		if (!is_cpu_allowed(p, cpu_of(rq))) {
+			dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
+			return 0;
 		}
-		ttwu_do_wakeup(p);
-		ret = 1;
+		enqueue_task(rq, p, queue_flags);
 	}
-	__task_rq_unlock(rq, &rf);
-
-	return ret;
+	if (!task_on_cpu(rq, p)) {
+		/*
+		 * When on_rq && !on_cpu the task is preempted, see if
+		 * it should preempt the task that is current now.
+		 */
+		wakeup_preempt(rq, p, wake_flags);
+	}
+	ttwu_do_wakeup(p);
+	return 1;
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 65fa64845d9f..b4c1f6c06c18 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1793,6 +1793,11 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 }
 
+DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
+		    _T->rq = __task_rq_lock(_T->lock, &_T->rf),
+		    __task_rq_unlock(_T->rq, &_T->rf),
+		    struct rq *rq; struct rq_flags rf)
+
 DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
 		    _T->rq = task_rq_lock(_T->lock, &_T->rf),
 		    task_rq_unlock(_T->rq, _T->lock, &_T->rf),

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-06 17:04       ` Peter Zijlstra
  2025-01-06 17:14         ` Peter Zijlstra
@ 2025-01-06 22:28         ` Doug Smythies
  2025-01-07 11:26           ` Peter Zijlstra
  1 sibling, 1 reply; 51+ messages in thread
From: Doug Smythies @ 2025-01-06 22:28 UTC (permalink / raw)
  To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies

[-- Attachment #1: Type: text/plain, Size: 4909 bytes --]

On 2025.01.06 09:05 Peter Zijlstra wrote:
> On Mon, Jan 06, 2025 at 05:59:32PM +0100, Peter Zijlstra wrote:
>> On Mon, Jan 06, 2025 at 07:01:34AM -0800, Doug Smythies wrote:
>> 
>>>> What is the easiest 100% load you're seeing this with?
>>> 
>>> Lately, and specifically to be able to tell others, I have been using:
>>> 
>>> yes > /dev/null &
>>> 
>>> On my Intel i5-10600K, with 6 cores and 2 threads per core, 12 CPUs,
>>> I run 12 of those work loads.
>> 
>> On my headless ivb-ep 2 sockets, 10 cores each and 2 threads per core, I
>> do:
>> 
>> for ((i=0; i<40; i++)) ; do yes > /dev/null & done
>> tools/power/x86/turbostat/turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1
>> 
>> But no so far, nada :-( I've tried with full preemption and voluntary,
>> HZ=1000.

My HZ=1000 also. And: CONFIG_NO_HZ_FULL=y
 
>
> And just as I send this, I see these happen:
>
> 100.00  3100    2793    40302   71      195.22
> 100.00  3100    2618    40459   72      183.58
> 100.00  3100    2993    46215   71      209.21
> 100.00  3100    2789    40467   71      195.19
> 99.92   3100    2798    40589   71      195.76
> 100.00  3100    2793    40397   72      195.46
> ...
> 100.00  3100    2844    41906   71      199.43
> 100.00  3100    2779    40468   71      194.51
> 99.96   3100    2320    40933   71      163.23
> 100.00  3100    3529    61823   72      245.70
> 100.00  3100    2793    40493   72      195.45
> 100.00  3100    2793    40462   72      195.56
>
> They look like funny little blips. Nowhere near as bad as you had
> though.

Yes, I get a lot of the lesser magnitude ones.

The large magnitude ones are very much a function of what else is running.
If just add a 0.5% load at 73 hertz work/sleep frequency, then over a 2 hour and
31 minute test I got a maximum interval time of 1.68 seconds.
Without that small pertibations I got tons of interval times of 7 seconds,
Meaning the regular 1 second interval plus 6 seconds for the CPU migration.

Since I can not seem to function without making a graph, some example graphs
are attached.

By the way, and to make it easier to go away while tests run, I am now using this
turbostat command:

doug@s19:~/kernel/linux/tools/power/x86/turbostat$ sudo ./turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval
1 | grep "^[1-9]"
6005701 1736201357.014741       -       99.76   12034
177731  1736201386.221771       -       99.76   12034
6003699 1736201393.226740       -       99.76   14167
6003704 1736201422.253743       -       99.76   12040
6005700 1736201447.278740       -       99.76   12030
311699  1736201475.816740       -       99.76   12033

Which will show when a CPU migration took over 10 milliseconds.
If you want to go further, for example to only display ones that took
over a second and to include the target CPU, then patch turbostat:

doug@s19:~/kernel/linux/tools/power/x86/turbostat$ git diff
diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
index 58a487c225a7..f8a73cc8fbfc 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -2704,7 +2704,7 @@ int format_counters(struct thread_data *t, struct core_data *c, struct pkg_data
                struct timeval tv;

                timersub(&t->tv_end, &t->tv_begin, &tv);
-               outp += sprintf(outp, "%5ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
+               outp += sprintf(outp, "%7ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
        }

        /* Time_Of_Day_Seconds: on each row, print sec.usec last timestamp taken */
@@ -4570,12 +4570,14 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p)
        int i;
        int status;

+       gettimeofday(&t->tv_begin, (struct timezone *)NULL); /* doug test */
+
        if (cpu_migrate(cpu)) {
                fprintf(outf, "%s: Could not migrate to CPU %d\n", __func__, cpu);
                return -1;
        }

-       gettimeofday(&t->tv_begin, (struct timezone *)NULL);
+//     gettimeofday(&t->tv_begin, (struct timezone *)NULL);

        if (first_counter_read)
                get_apic_id(t);

Example output:

sudo ./turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval 1 | grep "^[1-9]"
1330709 1736202049.987740       -       99.76   12040
1330603 1736202049.987740       11      99.76   1003
6008710 1736202068.008741       -       99.76   12030
6008601 1736202068.008741       11      99.76   1003
2003709 1736202120.936740       -       99.76   12028
2003603 1736202120.936740       11      99.76   1002
6005710 1736202140.956741       -       99.76   12028
6005604 1736202140.956741       11      99.76   1002

In this short example all captures were for the CPU 5 to 11 migration.
2 at 6 seconds, 1 at 1.33 seconds and 1 at 2 seconds.

I'll try, and report on, your test patch from the other email later.


[-- Attachment #2: turbostat-sampling-issue-seconds.png --]
[-- Type: image/png, Size: 32233 bytes --]

[-- Attachment #3: turbostat-sampling-issue-seconds-detail-a.png --]
[-- Type: image/png, Size: 42780 bytes --]

[-- Attachment #4: turbostat-sampling-issue-seconds-detail-b.png --]
[-- Type: image/png, Size: 87699 bytes --]

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-06 17:14         ` Peter Zijlstra
@ 2025-01-07  1:24           ` Doug Smythies
  2025-01-07 10:49             ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Doug Smythies @ 2025-01-07  1:24 UTC (permalink / raw)
  To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies

On 2024.01.06 09:14 Peter Zijlstra wrote:
> On Mon, Jan 06, 2025 at 06:04:55PM +0100, Peter Zijlstra wrote:
>> On Mon, Jan 06, 2025 at 05:59:32PM +0100, Peter Zijlstra wrote:
>>> On Mon, Jan 06, 2025 at 07:01:34AM -0800, Doug Smythies wrote:
>>>
>>>>> What is the easiest 100% load you're seeing this with?
>>>>
>>>> Lately, and specifically to be able to tell others, I have been using:
>>>>
>>>> yes > /dev/null &
>>>>
>>>> On my Intel i5-10600K, with 6 cores and 2 threads per core, 12 CPUs,
>>>> I run 12 of those work loads.
>>>
>>> On my headless ivb-ep 2 sockets, 10 cores each and 2 threads per core, I
>>> do:
>>>
>>> for ((i=0; i<40; i++)) ; do yes > /dev/null & done
>>> tools/power/x86/turbostat/turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1
>>>
>>> But no so far, nada :-( I've tried with full preemption and voluntary,
>>> HZ=1000.
>>>
>>
>> And just as I send this, I see these happen:
>>
>> 100.00  3100    2793    40302   71      195.22
>> 100.00  3100    2618    40459   72      183.58
>> 100.00  3100    2993    46215   71      209.21
>> 100.00  3100    2789    40467   71      195.19
>> 99.92   3100    2798    40589   71      195.76
>> 100.00  3100    2793    40397   72      195.46
>> ...
>> 100.00  3100    2844    41906   71      199.43
>> 100.00  3100    2779    40468   71      194.51
>> 99.96   3100    2320    40933   71      163.23
>> 100.00  3100    3529    61823   72      245.70
>> 100.00  3100    2793    40493   72      195.45
>> 100.00  3100    2793    40462   72      195.56
>>
>> They look like funny little blips. Nowhere near as bad as you had
>> though.
>
> Anyway, given you've confirmed disabling DELAY_DEQUEUE fixes things,
> could you perhaps try the below hackery for me? Its a bit of a wild
> guess, but throw stuff at wall, see what sticks etc..
>
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 84902936a620..fa4b9891f93a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3019,7 +3019,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
>       } else {
>
>               if (!is_migration_disabled(p)) {
> -                     if (task_on_rq_queued(p))
> +                     if (task_on_rq_queued(p) && !p->se.sched_delayed)
>                               rq = move_queued_task(rq, rf, p, dest_cpu);
>
>                       if (!pending->stop_pending) {
> @@ -3776,28 +3776,30 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
>   */
>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  {
> -     struct rq_flags rf;
> -     struct rq *rq;
> -     int ret = 0;
> +     CLASS(__task_rq_lock, rq_guard)(p);
> +     struct rq *rq = rq_guard.rq;
>
> -     rq = __task_rq_lock(p, &rf);
> -     if (task_on_rq_queued(p)) {
> -             update_rq_clock(rq);
> -             if (p->se.sched_delayed)
> -                     enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> -             if (!task_on_cpu(rq, p)) {
> -                     /*
> -                      * When on_rq && !on_cpu the task is preempted, see if
> -                      * it should preempt the task that is current now.
> -                      */
> -                     wakeup_preempt(rq, p, wake_flags);
> +     if (!task_on_rq_queued(p))
> +             return 0;
> +
> +     update_rq_clock(rq);
> +     if (p->se.sched_delayed) {
> +             int queue_flags = ENQUEUE_NOCLOCK | ENQUEUE_DELAYED;
> +             if (!is_cpu_allowed(p, cpu_of(rq))) {
> +                     dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> +                     return 0;
>               }
> -             ttwu_do_wakeup(p);
> -             ret = 1;
> +             enqueue_task(rq, p, queue_flags);
>       }
> -     __task_rq_unlock(rq, &rf);
> -
> -     return ret;
> +     if (!task_on_cpu(rq, p)) {
> +             /*
> +              * When on_rq && !on_cpu the task is preempted, see if
> +              * it should preempt the task that is current now.
> +              */
> +             wakeup_preempt(rq, p, wake_flags);
> +     }
> +     ttwu_do_wakeup(p);
> +     return 1;
>  }
>
>  #ifdef CONFIG_SMP
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 65fa64845d9f..b4c1f6c06c18 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1793,6 +1793,11 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
>       raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
>  }
>
> +DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
> +                 _T->rq = __task_rq_lock(_T->lock, &_T->rf),
> +                 __task_rq_unlock(_T->rq, &_T->rf),
> +                 struct rq *rq; struct rq_flags rf)
> +
>  DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
>                   _T->rq = task_rq_lock(_T->lock, &_T->rf),
>                   task_rq_unlock(_T->rq, _T->lock, &_T->rf),

I tried the patch on top of kernel 6.13-rc6.
It did not fix the issue.

I used my patched version of turbostat as per the previous email,
so that I could see which CPU and the CPU migration time.
CPU migration times >= 10 milliseconds are listed.
Results:

doug@s19:~date
Mon Jan  6 04:37:58 PM PST 2025
doug@s19:~$ sudo ~/kernel/linux/tools/power/x86/turbostat/turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval
1 | grep -v \- | grep -e "^[1-9]" -e "^ [1-9]" -e "^  [1-9]"
usec    Time_Of_Day_Seconds     CPU     Busy%   IRQ
  16599 1736210307.324843       11      99.76   1004
6003601 1736210314.329844       11      99.76   1018
1164604 1736210330.509843       11      99.76   1003
6003604 1736210347.524844       11      99.76   1005
  23602 1736210369.570843       11      99.76   1003
 161680 1736210384.748843       7       99.76   1002
5750600 1736210398.507843       11      99.76   1005
6003607 1736210478.587844       11      99.76   1002
 210645 1736210479.799843       3       99.76   7017
  22602 1736210495.838843       11      99.76   1002
6003390 1736210520.861844       11      99.76   1002
 108627 1736210534.984843       10      99.76   1002
  23604 1736210570.047843       11      99.76   1003
6004604 1736210600.076843       11      99.76   1003
1895606 1736210606.977843       11      99.76   1002
3110603 1736210745.226843       11      99.76   1003
6003606 1736210765.244844       11      99.76   1002
6003605 1736210785.262843       11      99.76   1002
 401642 1736210847.732843       9       99.76   1002
6003604 1736210891.781843       11      99.76   1003
6003607 1736210914.802844       11      99.76   1002
6003605 1736210945.831843       11      99.76   1002
5579609 1736210968.428848       11      99.76   1002
6003600 1736210975.433844       11      99.76   6585
  93623 1736210985.537843       10      99.76   1003
5005605 1736210994.547843       11      99.76   1003
2654601 1736211029.244843       11      99.76   1004
  17604 1736211057.290843       11      99.76   1003
  23598 1736211077.334843       11      99.76   1006
 114671 1736211079.451843       2       99.76   1003
6003603 1736211105.475843       11      99.76   1002
^Cdoug@s19:~$ date
Mon Jan  6 04:52:18 PM PST 2025
doug@s19:~$ uname -a
Linux s19 6.13.0-rc6-peterz #1320 SMP PREEMPT_DYNAMIC Mon Jan  6 16:25:39 PST 2025 x86_64 x86_64 x86_64 GNU/Linux





^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-07  1:24           ` Doug Smythies
@ 2025-01-07 10:49             ` Peter Zijlstra
  0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2025-01-07 10:49 UTC (permalink / raw)
  To: Doug Smythies; +Cc: linux-kernel, vincent.guittot

On Mon, Jan 06, 2025 at 05:24:44PM -0800, Doug Smythies wrote:
> I tried the patch on top of kernel 6.13-rc6.
> It did not fix the issue.

Oh well, it was a long shot anyway.

I'll try and make it reproduce again.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-06 22:28         ` Doug Smythies
@ 2025-01-07 11:26           ` Peter Zijlstra
  2025-01-07 15:04             ` Doug Smythies
  2025-01-07 19:23             ` Peter Zijlstra
  0 siblings, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2025-01-07 11:26 UTC (permalink / raw)
  To: Doug Smythies; +Cc: linux-kernel, vincent.guittot

On Mon, Jan 06, 2025 at 02:28:40PM -0800, Doug Smythies wrote:

> Which will show when a CPU migration took over 10 milliseconds.
> If you want to go further, for example to only display ones that took
> over a second and to include the target CPU, then patch turbostat:
> 
> doug@s19:~/kernel/linux/tools/power/x86/turbostat$ git diff
> diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
> index 58a487c225a7..f8a73cc8fbfc 100644
> --- a/tools/power/x86/turbostat/turbostat.c
> +++ b/tools/power/x86/turbostat/turbostat.c
> @@ -2704,7 +2704,7 @@ int format_counters(struct thread_data *t, struct core_data *c, struct pkg_data
>                 struct timeval tv;
> 
>                 timersub(&t->tv_end, &t->tv_begin, &tv);
> -               outp += sprintf(outp, "%5ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
> +               outp += sprintf(outp, "%7ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
>         }
> 
>         /* Time_Of_Day_Seconds: on each row, print sec.usec last timestamp taken */
> @@ -4570,12 +4570,14 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p)
>         int i;
>         int status;
> 
> +       gettimeofday(&t->tv_begin, (struct timezone *)NULL); /* doug test */
> +
>         if (cpu_migrate(cpu)) {
>                 fprintf(outf, "%s: Could not migrate to CPU %d\n", __func__, cpu);
>                 return -1;
>         }
> 
> -       gettimeofday(&t->tv_begin, (struct timezone *)NULL);
> +//     gettimeofday(&t->tv_begin, (struct timezone *)NULL);
> 
>         if (first_counter_read)
>                 get_apic_id(t);
> 
> 

So I've taken the second node offline, running with 10 cores (20
threads) now.

usec    Time_Of_Day_Seconds     CPU     Busy%   IRQ
 106783 1736248404.951438       -       100.00  20119
     46 1736248404.844701       0       100.00  1005
     41 1736248404.844742       20      100.00  1007
     42 1736248404.844784       1       100.00  1005
     40 1736248404.844824       21      100.00  1006
     41 1736248404.844865       2       100.00  1005
     40 1736248404.844905       22      100.00  1006
     41 1736248404.844946       3       100.00  1006
     40 1736248404.844986       23      100.00  1005
     41 1736248404.845027       4       100.00  1005
     40 1736248404.845067       24      100.00  1006
     41 1736248404.845108       5       100.00  1011
     40 1736248404.845149       25      100.00  1005
     41 1736248404.845190       6       100.00  1005
     40 1736248404.845230       26      100.00  1005
     42 1736248404.845272       7       100.00  1007
     41 1736248404.845313       27      100.00  1005
     41 1736248404.845355       8       100.00  1005
     42 1736248404.845397       28      100.00  1006
     46 1736248404.845443       9       100.00  1009
 105995 1736248404.951438       29      100.00  1005

Is by far the worst I've had in the past few minutes playing with this.

If I get a blimp (>10000) then it is always on the last CPU, are you
seeing the same thing?

> In this short example all captures were for the CPU 5 to 11 migration.
> 2 at 6 seconds, 1 at 1.33 seconds and 1 at 2 seconds.

This seems to suggest you are, always on CPU 11.

Weird!

Anyway, let me see if I can capture a trace of this..

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-07 11:26           ` Peter Zijlstra
@ 2025-01-07 15:04             ` Doug Smythies
  2025-01-07 16:25               ` Doug Smythies
  2025-01-07 19:23             ` Peter Zijlstra
  1 sibling, 1 reply; 51+ messages in thread
From: Doug Smythies @ 2025-01-07 15:04 UTC (permalink / raw)
  To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies

On 2025.01.07 03:26 Peter Zijlstra wrote:
> On Mon, Jan 06, 2025 at 02:28:40PM -0800, Doug Smythies wrote:
>
>> Which will show when a CPU migration took over 10 milliseconds.
>> If you want to go further, for example to only display ones that took
>> over a second and to include the target CPU, then patch turbostat:
>> 
>> doug@s19:~/kernel/linux/tools/power/x86/turbostat$ git diff
>> diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
>> index 58a487c225a7..f8a73cc8fbfc 100644
>> --- a/tools/power/x86/turbostat/turbostat.c
>> +++ b/tools/power/x86/turbostat/turbostat.c
>> @@ -2704,7 +2704,7 @@ int format_counters(struct thread_data *t, struct core_data *c, struct pkg_data
>>                 struct timeval tv;
>> 
>>                 timersub(&t->tv_end, &t->tv_begin, &tv);
>> -               outp += sprintf(outp, "%5ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
>> +               outp += sprintf(outp, "%7ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
>>         }
>> 
>>         /* Time_Of_Day_Seconds: on each row, print sec.usec last timestamp taken */
>> @@ -4570,12 +4570,14 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p)
>>         int i;
>>         int status;
>> 
>> +       gettimeofday(&t->tv_begin, (struct timezone *)NULL); /* doug test */
>> +
>>         if (cpu_migrate(cpu)) {
>>                 fprintf(outf, "%s: Could not migrate to CPU %d\n", __func__, cpu);
>>                 return -1;
>>         }
>> 
>> -       gettimeofday(&t->tv_begin, (struct timezone *)NULL);
>> +//     gettimeofday(&t->tv_begin, (struct timezone *)NULL);
>> 
>>         if (first_counter_read)
>>                 get_apic_id(t);
>> 
>> 
>
> So I've taken the second node offline, running with 10 cores (20
> threads) now.
>
> usec    Time_Of_Day_Seconds     CPU     Busy%   IRQ
> 106783 1736248404.951438       -       100.00  20119
>     46 1736248404.844701       0       100.00  1005
>     41 1736248404.844742       20      100.00  1007
>     42 1736248404.844784       1       100.00  1005
>     40 1736248404.844824       21      100.00  1006
>     41 1736248404.844865       2       100.00  1005
>     40 1736248404.844905       22      100.00  1006
>     41 1736248404.844946       3       100.00  1006
>     40 1736248404.844986       23      100.00  1005
>     41 1736248404.845027       4       100.00  1005
>     40 1736248404.845067       24      100.00  1006
>     41 1736248404.845108       5       100.00  1011
>     40 1736248404.845149       25      100.00  1005
>     41 1736248404.845190       6       100.00  1005
>     40 1736248404.845230       26      100.00  1005
>     42 1736248404.845272       7       100.00  1007
>     41 1736248404.845313       27      100.00  1005
>     41 1736248404.845355       8       100.00  1005
>     42 1736248404.845397       28      100.00  1006
>     46 1736248404.845443       9       100.00  1009
> 105995 1736248404.951438       29      100.00  1005
>
> Is by far the worst I've had in the past few minutes playing with this.
>
> If I get a blimp (>10000) then it is always on the last CPU, are you
> seeing the same thing?

More or less, yes. The very long migrations are dominated by the
CPU 5 to CPU 11 migration.

Here is data from yesterday for other CPUs:
usec    Time_Of_Day_Seconds     CPU     Busy%   IRQ
605706 1736224605.542844       0       99.76   1922
  10001 1736224605.561844       1       99.76   1922
  10999 1736224605.572843       7       99.76   1923
  11001 1736224605.583844       2       99.76   1925
  11000 1736224605.606844       4       99.76   1924
  10999 1736224605.617843       10      99.76   1923
 105001 1736224605.722844       5       99.76   1922
 465657 1736224608.190843       8       99.76   1002
 494000 1736224608.684843       3       99.76   1003
 395674 1736224610.081843       7       99.76   1964
  19679 1736224617.108843       7       99.76   1003
  37709 1736224636.633845       0       99.76   1003
  65641 1736224689.796843       9       99.76   1003
 406631 1736224693.206843       4       99.76   1002
 105622 1736225026.238843       10      99.76   1003
 409622 1736225053.673843       10      99.76   1003
  16706 1736225302.149847       0       99.76   1820
  10000 1736225302.185846       4       99.76   1825
  19663 1736225317.249844       7       99.76   1012
>
>> In this short example all captures were for the CPU 5 to 11 migration.
>> 2 at 6 seconds, 1 at 1.33 seconds and 1 at 2 seconds.
>
> This seems to suggest you are, always on CPU 11.
>
> Weird!

Yes, weird. I think, but am not certain, the CPU sequence in turbostat
per interval loop is:

Wake on highest numbered CPU (11 in my case)
Do a bunch of work that can be done without MSR reads.
For each CPU in topological order (0,6,1,7,2,8,3,9,4,10,5,11 in my case)
  Do the CPU specific work
Finish the intervals work and printing and such on CPU 11.
Sleep for the interval time (we have been using 1 second)

Without any proof, I was thinking the CPU 11 dominance
for the long migration issue was due to the other bits of
work done on that CPU.

> Anyway, let me see if I can capture a trace of this..



^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-07 15:04             ` Doug Smythies
@ 2025-01-07 16:25               ` Doug Smythies
  0 siblings, 0 replies; 51+ messages in thread
From: Doug Smythies @ 2025-01-07 16:25 UTC (permalink / raw)
  To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies

On 2025.01.07 07:04 Doug Smythies wrote:
> On 2025.01.07 03:26 Peter Zijlstra wrote:
>> On Mon, Jan 06, 2025 at 02:28:40PM -0800, Doug Smythies wrote:

>> If I get a blimp (>10000) then it is always on the last CPU, are you
>> seeing the same thing?
>
> More or less, yes. The very long migrations are dominated by the
> CPU 5 to CPU 11 migration.
>>
>>> In this short example all captures were for the CPU 5 to 11 migration.
>>> 2 at 6 seconds, 1 at 1.33 seconds and 1 at 2 seconds.
>>
>> This seems to suggest you are, always on CPU 11.
>>
>> Weird!
>
> Yes, weird. I think, but am not certain, the CPU sequence in turbostat
> per interval loop is:
>
> Wake on highest numbered CPU (11 in my case)
> Do a bunch of work that can be done without MSR reads.
> For each CPU in topological order (0,6,1,7,2,8,3,9,4,10,5,11 in my case)
>  Do the CPU specific work
> Finish the intervals work and printing and such on CPU 11.
> Sleep for the interval time (we have been using 1 second)
>
> Without any proof, I was thinking the CPU 11 dominance
> for the long migration issue was due to the other bits of
> work done on that CPU.

To test this theory I hacked turbostat to migrate to CPU 3
After the CPU specific work loop.
So now the per interval workflow is:

Wake on CPU 3
Do a bunch of work that can be done without MSR reads.
For each CPU in topological order (0,6,1,7,2,8,3,9,4,10,5,11 in my case)
 Do the CPU specific work
Migrate to CPU 3
Finish the intervals work and printing and such on CPU 3.
Sleep for the interval time

And now I get:

usec    Time_Of_Day_Seconds     CPU     Busy%   IRQ
  12646 1736266361.533240       3       99.76   1005
6004653 1736266384.555240       3       99.76   1006
6004653 1736266393.563240       3       99.76   1004
6005648 1736266400.570240       3       99.76   7019
6005653 1736266432.602240       3       99.76   1005
6003656 1736266479.652242       3       99.76   1004
  15636 1736266501.690240       3       99.76   1005
4948651 1736266528.661240       3       99.76   1004
 521672 1736266534.192240       2       99.76   1002
1117651 1736266585.360239       3       99.76   1004
6003652 1736266592.365240       3       99.76   2123
3526648 1736266612.909240       3       99.76   1004
6003650 1736266632.927240       3       99.76   1005
 396623 1736266636.327239       10      99.76   1002
6003654 1736266660.349240       3       99.76   1005
6003653 1736266682.369239       3       99.76   1006
6003653 1736266703.388240       3       99.76   1004
 514673 1736266718.918240       2       99.76   1003
  14652 1736266725.940240       3       99.76   1004
6003653 1736266745.958240       3       99.76   1004
6003653 1736266767.978240       3       99.76   1006
6003652 1736266794.002240       3       99.76   1006
6003653 1736266815.021240       3       99.76   1004
2496651 1736266841.542239       3       99.76   1007
6003647 1736266848.547240       3       99.76   3504  <<< 8 minutes 7 seconds elapsed



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-07 11:26           ` Peter Zijlstra
  2025-01-07 15:04             ` Doug Smythies
@ 2025-01-07 19:23             ` Peter Zijlstra
  2025-01-08  5:15               ` Doug Smythies
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2025-01-07 19:23 UTC (permalink / raw)
  To: Doug Smythies; +Cc: linux-kernel, vincent.guittot

On Tue, Jan 07, 2025 at 12:26:07PM +0100, Peter Zijlstra wrote:

> So I've taken the second node offline, running with 10 cores (20
> threads) now.
> 
> usec    Time_Of_Day_Seconds     CPU     Busy%   IRQ
>  106783 1736248404.951438       -       100.00  20119
>      46 1736248404.844701       0       100.00  1005
>      41 1736248404.844742       20      100.00  1007
>      42 1736248404.844784       1       100.00  1005
>      40 1736248404.844824       21      100.00  1006
>      41 1736248404.844865       2       100.00  1005
>      40 1736248404.844905       22      100.00  1006
>      41 1736248404.844946       3       100.00  1006
>      40 1736248404.844986       23      100.00  1005
>      41 1736248404.845027       4       100.00  1005
>      40 1736248404.845067       24      100.00  1006
>      41 1736248404.845108       5       100.00  1011
>      40 1736248404.845149       25      100.00  1005
>      41 1736248404.845190       6       100.00  1005
>      40 1736248404.845230       26      100.00  1005
>      42 1736248404.845272       7       100.00  1007
>      41 1736248404.845313       27      100.00  1005
>      41 1736248404.845355       8       100.00  1005
>      42 1736248404.845397       28      100.00  1006
>      46 1736248404.845443       9       100.00  1009
>  105995 1736248404.951438       29      100.00  1005
> 
> Is by far the worst I've had in the past few minutes playing with this.

Much also depends on how much (or if at all) cgroups are used.

What exact cgroup config are you having? /sys/kernel/debug/sched/debug
should be able to tell you.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-07 19:23             ` Peter Zijlstra
@ 2025-01-08  5:15               ` Doug Smythies
  2025-01-08 13:12                 ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Doug Smythies @ 2025-01-08  5:15 UTC (permalink / raw)
  To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies

On 2025.07.11:24 Peter Zijlstra wrote:
> On Tue, Jan 07, 2025 at 12:26:07PM +0100, Peter Zijlstra wrote:

>> So I've taken the second node offline, running with 10 cores (20
>> threads) now.
>> 
>> usec    Time_Of_Day_Seconds     CPU     Busy%   IRQ
>>  106783 1736248404.951438       -       100.00  20119
>>      46 1736248404.844701       0       100.00  1005
>>      41 1736248404.844742       20      100.00  1007
>>      42 1736248404.844784       1       100.00  1005
>>      40 1736248404.844824       21      100.00  1006
>>      41 1736248404.844865       2       100.00  1005
>>      40 1736248404.844905       22      100.00  1006
>>      41 1736248404.844946       3       100.00  1006
>>      40 1736248404.844986       23      100.00  1005
>>      41 1736248404.845027       4       100.00  1005
>>      40 1736248404.845067       24      100.00  1006
>>      41 1736248404.845108       5       100.00  1011
>>      40 1736248404.845149       25      100.00  1005
>>      41 1736248404.845190       6       100.00  1005
>>      40 1736248404.845230       26      100.00  1005
>>      42 1736248404.845272       7       100.00  1007
>>      41 1736248404.845313       27      100.00  1005
>>      41 1736248404.845355       8       100.00  1005
>>      42 1736248404.845397       28      100.00  1006
>>      46 1736248404.845443       9       100.00  1009
>>  105995 1736248404.951438       29      100.00  1005
>> 
>> Is by far the worst I've had in the past few minutes playing with this.
>
> Much also depends on how much (or if at all) cgroups are used.
>
> What exact cgroup config are you having? /sys/kernel/debug/sched/debug
> should be able to tell you.

I do not know.
I'll capture the above output, compress it, and send it to you.

I did also boot with systemd.unified_cgroup_hierarchy=0
and it made no difference.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-08  5:15               ` Doug Smythies
@ 2025-01-08 13:12                 ` Peter Zijlstra
  2025-01-08 15:48                   ` Doug Smythies
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2025-01-08 13:12 UTC (permalink / raw)
  To: Doug Smythies; +Cc: linux-kernel, vincent.guittot


I failed to realize the follow up email was private, so duplicating that
here again, but also new content :-)

On Tue, Jan 07, 2025 at 09:15:59PM -0800, Doug Smythies wrote:
> On 2025.07.11:24 Peter Zijlstra wrote:

> > What exact cgroup config are you having? /sys/kernel/debug/sched/debug
> > should be able to tell you.
> 
> I do not know.
> I'll capture the above output, compress it, and send it to you.
> 
> I did also boot with systemd.unified_cgroup_hierarchy=0
> and it made no difference.

I think you need: "cgroup_disable=cpu noautogroup" to fully disable all
the cpu-cgroup muck. Anyway:

$ zcat cgroup2.txt.gz | grep -e yes -e turbo | awk '{print $2 "\t" $16}'
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
turbostat       /autogroup-286
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
turbostat       /autogroup-286

That matches the scenario where I could reproduce, two competing groups.

I'm seeing wild vruntime divergence when this happens -- this is
definitely wonky. Basically the turbostat groups gets starved for a
while while the yes group catches up.

It looks like reweight_entity() is shooting out the cgroup entity to the
right.

So it builds up some negative lag (received surplus service) and then
because turbostat goes sleep for a second, it's cgroup's share gets
truncated to 2 and it shoots the cgroup entity out waaaaaaaay far.

Thing is, waking up *should* fix that up again, but that doesn't appear
to happen, leaving us up a creek.

/me noodles a bit....

Does this help?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0e58e51801f..daa62cfa3092 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7000,6 +7063,13 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 	if (flags & ENQUEUE_DELAYED) {
 		requeue_delayed_entity(se);
+		se = se->parent;
+		for_each_sched_entity(se) {
+			cfs_rq = cfs_rq_of(se);
+			update_load_avg(cfs_rq, se, UPDATE_TG);
+			se_update_runnable(se);
+			update_cfs_group(se);
+		}
 		return;
 	}
 

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-08 13:12                 ` Peter Zijlstra
@ 2025-01-08 15:48                   ` Doug Smythies
  2025-01-09 10:59                     ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Doug Smythies @ 2025-01-08 15:48 UTC (permalink / raw)
  To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies

On 2025.01.08 05:12 Peter Zijlstra wrote:
> On Tue, Jan 07, 2025 at 09:15:59PM -0800, Doug Smythies wrote:
>> On 2025.07.11:24 Peter Zijlstra wrote:
>
>>> What exact cgroup config are you having? /sys/kernel/debug/sched/debug
>>> should be able to tell you.
>>
>> I do not know.
>> I'll capture the above output, compress it, and send it to you.
>>
>> I did also boot with systemd.unified_cgroup_hierarchy=0
>> and it made no difference.
>
> I think you need: "cgroup_disable=cpu noautogroup" to fully disable all
> the cpu-cgroup muck. Anyway:
>
> $ zcat cgroup2.txt.gz | grep -e yes -e turbo | awk '{print $2 "\t" $16}'
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> turbostat       /autogroup-286
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> turbostat       /autogroup-286
>
> That matches the scenario where I could reproduce, two competing groups.
>
> I'm seeing wild vruntime divergence when this happens -- this is
> definitely wonky. Basically the turbostat groups gets starved for a
> while while the yes group catches up.
>
> It looks like reweight_entity() is shooting out the cgroup entity to the
> right.
>
> So it builds up some negative lag (received surplus service) and then
> because turbostat goes sleep for a second, it's cgroup's share gets
> truncated to 2 and it shoots the cgroup entity out waaaaaaaay far.
>
> Thing is, waking up *should* fix that up again, but that doesn't appear
> to happen, leaving us up a creek.
>
> /me noodles a bit....
>
> Does this help?

Sorry, but no it did not help.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c0e58e51801f..daa62cfa3092 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7000,6 +7063,13 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
>       if (flags & ENQUEUE_DELAYED) {
>               requeue_delayed_entity(se);
> +             se = se->parent;
> +             for_each_sched_entity(se) {
> +                     cfs_rq = cfs_rq_of(se);
> +                     update_load_avg(cfs_rq, se, UPDATE_TG);
> +                     se_update_runnable(se);
> +                     update_cfs_group(se);
> +             }
>               return;
>       }



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-08 15:48                   ` Doug Smythies
@ 2025-01-09 10:59                     ` Peter Zijlstra
  2025-01-09 12:18                       ` [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag tip-bot2 for Peter Zijlstra
  2025-01-10  5:09                       ` [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies
  0 siblings, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2025-01-09 10:59 UTC (permalink / raw)
  To: Doug Smythies; +Cc: linux-kernel, vincent.guittot, Ingo Molnar, wuyun.abel

On Wed, Jan 08, 2025 at 07:48:29AM -0800, Doug Smythies wrote:

> > Does this help?
> 
> Sorry, but no it did not help.

Mooo :-(

OK, new day, new chances though.

I noticed this in my traces today:

       turbostat-1222    [006] d..2.   311.935649: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
                               { weight: 1048576 avg_vruntime: 3184159639071 vruntime: 3184159640194 (-1123) deadline: 3184162621107 } ->
                               { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 }
       turbostat-1222    [006] d..2.   311.935651: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
                               { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } ->
                               { weight: 1048576 avg_vruntime: 3184176414812 vruntime: 3184177464419 (-1049607) deadline: 3184180445332 }

Which is a weight transition: 1048576 -> 2 -> 1048576.

One would expect the lag to shoot out *AND* come back, notably:

  -1123*1048576/2 = -588775424
  -588775424*2/1048576 = -1123

Except the trace shows it is all off. Worse, subsequent cycles shoot it
out further and further.

This made me have a very hard look at reweight_entity(), and
specifically the ->on_rq case, which is more prominent with
DELAY_DEQUEUE.

And indeed, it is all sorts of broken. While the computation of the new
lag is correct, the computation for the new vruntime, using the new lag
is broken for it does not consider the logic set out in place_entity().

With the below patch, I now see things like:
    
    migration/12-55      [012] d..3.   309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
                               { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 } ->
                               { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 }
    migration/14-62      [014] d..3.   309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
                               { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: 6316614641111 } ->
                               { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 }

Which isn't perfect yet, but much closer.

Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0e58e51801f..b9575db5ecfe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -689,21 +689,16 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
  *
  * XXX could add max_slice to the augmented data to track this.
  */
-static s64 entity_lag(u64 avruntime, struct sched_entity *se)
+static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	s64 vlag, limit;
 
-	vlag = avruntime - se->vruntime;
-	limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
-
-	return clamp(vlag, -limit, limit);
-}
-
-static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
 	SCHED_WARN_ON(!se->on_rq);
 
-	se->vlag = entity_lag(avg_vruntime(cfs_rq), se);
+	vlag = avg_vruntime(cfs_rq) - se->vruntime;
+	limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
+
+	se->vlag = clamp(vlag, -limit, limit);
 }
 
 /*
@@ -3770,137 +3765,32 @@ static inline void
 dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 #endif
 
-static void reweight_eevdf(struct sched_entity *se, u64 avruntime,
-			   unsigned long weight)
-{
-	unsigned long old_weight = se->load.weight;
-	s64 vlag, vslice;
-
-	/*
-	 * VRUNTIME
-	 * --------
-	 *
-	 * COROLLARY #1: The virtual runtime of the entity needs to be
-	 * adjusted if re-weight at !0-lag point.
-	 *
-	 * Proof: For contradiction assume this is not true, so we can
-	 * re-weight without changing vruntime at !0-lag point.
-	 *
-	 *             Weight	VRuntime   Avg-VRuntime
-	 *     before    w          v            V
-	 *      after    w'         v'           V'
-	 *
-	 * Since lag needs to be preserved through re-weight:
-	 *
-	 *	lag = (V - v)*w = (V'- v')*w', where v = v'
-	 *	==>	V' = (V - v)*w/w' + v		(1)
-	 *
-	 * Let W be the total weight of the entities before reweight,
-	 * since V' is the new weighted average of entities:
-	 *
-	 *	V' = (WV + w'v - wv) / (W + w' - w)	(2)
-	 *
-	 * by using (1) & (2) we obtain:
-	 *
-	 *	(WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v
-	 *	==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v
-	 *	==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v
-	 *	==>	(V - v)*W/(W + w' - w) = (V - v)*w/w' (3)
-	 *
-	 * Since we are doing at !0-lag point which means V != v, we
-	 * can simplify (3):
-	 *
-	 *	==>	W / (W + w' - w) = w / w'
-	 *	==>	Ww' = Ww + ww' - ww
-	 *	==>	W * (w' - w) = w * (w' - w)
-	 *	==>	W = w	(re-weight indicates w' != w)
-	 *
-	 * So the cfs_rq contains only one entity, hence vruntime of
-	 * the entity @v should always equal to the cfs_rq's weighted
-	 * average vruntime @V, which means we will always re-weight
-	 * at 0-lag point, thus breach assumption. Proof completed.
-	 *
-	 *
-	 * COROLLARY #2: Re-weight does NOT affect weighted average
-	 * vruntime of all the entities.
-	 *
-	 * Proof: According to corollary #1, Eq. (1) should be:
-	 *
-	 *	(V - v)*w = (V' - v')*w'
-	 *	==>    v' = V' - (V - v)*w/w'		(4)
-	 *
-	 * According to the weighted average formula, we have:
-	 *
-	 *	V' = (WV - wv + w'v') / (W - w + w')
-	 *	   = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w')
-	 *	   = (WV - wv + w'V' - Vw + wv) / (W - w + w')
-	 *	   = (WV + w'V' - Vw) / (W - w + w')
-	 *
-	 *	==>  V'*(W - w + w') = WV + w'V' - Vw
-	 *	==>	V' * (W - w) = (W - w) * V	(5)
-	 *
-	 * If the entity is the only one in the cfs_rq, then reweight
-	 * always occurs at 0-lag point, so V won't change. Or else
-	 * there are other entities, hence W != w, then Eq. (5) turns
-	 * into V' = V. So V won't change in either case, proof done.
-	 *
-	 *
-	 * So according to corollary #1 & #2, the effect of re-weight
-	 * on vruntime should be:
-	 *
-	 *	v' = V' - (V - v) * w / w'		(4)
-	 *	   = V  - (V - v) * w / w'
-	 *	   = V  - vl * w / w'
-	 *	   = V  - vl'
-	 */
-	if (avruntime != se->vruntime) {
-		vlag = entity_lag(avruntime, se);
-		vlag = div_s64(vlag * old_weight, weight);
-		se->vruntime = avruntime - vlag;
-	}
-
-	/*
-	 * DEADLINE
-	 * --------
-	 *
-	 * When the weight changes, the virtual time slope changes and
-	 * we should adjust the relative virtual deadline accordingly.
-	 *
-	 *	d' = v' + (d - v)*w/w'
-	 *	   = V' - (V - v)*w/w' + (d - v)*w/w'
-	 *	   = V  - (V - v)*w/w' + (d - v)*w/w'
-	 *	   = V  + (d - V)*w/w'
-	 */
-	vslice = (s64)(se->deadline - avruntime);
-	vslice = div_s64(vslice * old_weight, weight);
-	se->deadline = avruntime + vslice;
-}
+static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags);
 
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight)
 {
 	bool curr = cfs_rq->curr == se;
-	u64 avruntime;
 
 	if (se->on_rq) {
 		/* commit outstanding execution time */
 		update_curr(cfs_rq);
-		avruntime = avg_vruntime(cfs_rq);
+		update_entity_lag(cfs_rq, se);
+		se->deadline -= se->vruntime;
+		se->rel_deadline = 1;
 		if (!curr)
 			__dequeue_entity(cfs_rq, se);
 		update_load_sub(&cfs_rq->load, se->load.weight);
 	}
 	dequeue_load_avg(cfs_rq, se);
 
-	if (se->on_rq) {
-		reweight_eevdf(se, avruntime, weight);
-	} else {
-		/*
-		 * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
-		 * we need to scale se->vlag when w_i changes.
-		 */
-		se->vlag = div_s64(se->vlag * se->load.weight, weight);
-	}
+	/*
+	 * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
+	 * we need to scale se->vlag when w_i changes.
+	 */
+	se->vlag = div_s64(se->vlag * se->load.weight, weight);
+	if (se->rel_deadline)
+		se->deadline = div_s64(se->deadline * se->load.weight, weight);
 
 	update_load_set(&se->load, weight);
 
@@ -3915,6 +3805,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 	enqueue_load_avg(cfs_rq, se);
 	if (se->on_rq) {
 		update_load_add(&cfs_rq->load, se->load.weight);
+		place_entity(cfs_rq, se, 0);
 		if (!curr)
 			__enqueue_entity(cfs_rq, se);
 
@@ -5355,7 +5246,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	se->vruntime = vruntime - lag;
 
-	if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
+	if (se->rel_deadline) {
 		se->deadline += se->vruntime;
 		se->rel_deadline = 0;
 		return;

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag
  2025-01-09 10:59                     ` Peter Zijlstra
@ 2025-01-09 12:18                       ` tip-bot2 for Peter Zijlstra
  2025-04-17  9:56                         ` Alexander Egorenkov
  2025-01-10  5:09                       ` [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies
  1 sibling, 1 reply; 51+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-01-09 12:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Doug Smythies, Ingo Molnar, Peter Zijlstra (Intel), x86,
	linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     6d71a9c6160479899ee744d2c6d6602a191deb1f
Gitweb:        https://git.kernel.org/tip/6d71a9c6160479899ee744d2c6d6602a191deb1f
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 09 Jan 2025 11:59:59 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 09 Jan 2025 12:55:27 +01:00

sched/fair: Fix EEVDF entity placement bug causing scheduling lag

I noticed this in my traces today:

       turbostat-1222    [006] d..2.   311.935649: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
                               { weight: 1048576 avg_vruntime: 3184159639071 vruntime: 3184159640194 (-1123) deadline: 3184162621107 } ->
                               { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 }
       turbostat-1222    [006] d..2.   311.935651: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
                               { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } ->
                               { weight: 1048576 avg_vruntime: 3184176414812 vruntime: 3184177464419 (-1049607) deadline: 3184180445332 }

Which is a weight transition: 1048576 -> 2 -> 1048576.

One would expect the lag to shoot out *AND* come back, notably:

  -1123*1048576/2 = -588775424
  -588775424*2/1048576 = -1123

Except the trace shows it is all off. Worse, subsequent cycles shoot it
out further and further.

This made me have a very hard look at reweight_entity(), and
specifically the ->on_rq case, which is more prominent with
DELAY_DEQUEUE.

And indeed, it is all sorts of broken. While the computation of the new
lag is correct, the computation for the new vruntime, using the new lag
is broken for it does not consider the logic set out in place_entity().

With the below patch, I now see things like:

    migration/12-55      [012] d..3.   309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
                               { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 } ->
                               { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 }
    migration/14-62      [014] d..3.   309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
                               { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: 6316614641111 } ->
                               { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 }

Which isn't perfect yet, but much closer.

Reported-by: Doug Smythies <dsmythies@telus.net>
Reported-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight")
Link: https://lore.kernel.org/r/20250109105959.GA2981@noisy.programming.kicks-ass.net
---
 kernel/sched/fair.c | 145 +++++--------------------------------------
 1 file changed, 18 insertions(+), 127 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e9ca38..eeed8e3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -689,21 +689,16 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
  *
  * XXX could add max_slice to the augmented data to track this.
  */
-static s64 entity_lag(u64 avruntime, struct sched_entity *se)
+static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	s64 vlag, limit;
 
-	vlag = avruntime - se->vruntime;
-	limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
-
-	return clamp(vlag, -limit, limit);
-}
-
-static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
 	SCHED_WARN_ON(!se->on_rq);
 
-	se->vlag = entity_lag(avg_vruntime(cfs_rq), se);
+	vlag = avg_vruntime(cfs_rq) - se->vruntime;
+	limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
+
+	se->vlag = clamp(vlag, -limit, limit);
 }
 
 /*
@@ -3774,137 +3769,32 @@ static inline void
 dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 #endif
 
-static void reweight_eevdf(struct sched_entity *se, u64 avruntime,
-			   unsigned long weight)
-{
-	unsigned long old_weight = se->load.weight;
-	s64 vlag, vslice;
-
-	/*
-	 * VRUNTIME
-	 * --------
-	 *
-	 * COROLLARY #1: The virtual runtime of the entity needs to be
-	 * adjusted if re-weight at !0-lag point.
-	 *
-	 * Proof: For contradiction assume this is not true, so we can
-	 * re-weight without changing vruntime at !0-lag point.
-	 *
-	 *             Weight	VRuntime   Avg-VRuntime
-	 *     before    w          v            V
-	 *      after    w'         v'           V'
-	 *
-	 * Since lag needs to be preserved through re-weight:
-	 *
-	 *	lag = (V - v)*w = (V'- v')*w', where v = v'
-	 *	==>	V' = (V - v)*w/w' + v		(1)
-	 *
-	 * Let W be the total weight of the entities before reweight,
-	 * since V' is the new weighted average of entities:
-	 *
-	 *	V' = (WV + w'v - wv) / (W + w' - w)	(2)
-	 *
-	 * by using (1) & (2) we obtain:
-	 *
-	 *	(WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v
-	 *	==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v
-	 *	==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v
-	 *	==>	(V - v)*W/(W + w' - w) = (V - v)*w/w' (3)
-	 *
-	 * Since we are doing at !0-lag point which means V != v, we
-	 * can simplify (3):
-	 *
-	 *	==>	W / (W + w' - w) = w / w'
-	 *	==>	Ww' = Ww + ww' - ww
-	 *	==>	W * (w' - w) = w * (w' - w)
-	 *	==>	W = w	(re-weight indicates w' != w)
-	 *
-	 * So the cfs_rq contains only one entity, hence vruntime of
-	 * the entity @v should always equal to the cfs_rq's weighted
-	 * average vruntime @V, which means we will always re-weight
-	 * at 0-lag point, thus breach assumption. Proof completed.
-	 *
-	 *
-	 * COROLLARY #2: Re-weight does NOT affect weighted average
-	 * vruntime of all the entities.
-	 *
-	 * Proof: According to corollary #1, Eq. (1) should be:
-	 *
-	 *	(V - v)*w = (V' - v')*w'
-	 *	==>    v' = V' - (V - v)*w/w'		(4)
-	 *
-	 * According to the weighted average formula, we have:
-	 *
-	 *	V' = (WV - wv + w'v') / (W - w + w')
-	 *	   = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w')
-	 *	   = (WV - wv + w'V' - Vw + wv) / (W - w + w')
-	 *	   = (WV + w'V' - Vw) / (W - w + w')
-	 *
-	 *	==>  V'*(W - w + w') = WV + w'V' - Vw
-	 *	==>	V' * (W - w) = (W - w) * V	(5)
-	 *
-	 * If the entity is the only one in the cfs_rq, then reweight
-	 * always occurs at 0-lag point, so V won't change. Or else
-	 * there are other entities, hence W != w, then Eq. (5) turns
-	 * into V' = V. So V won't change in either case, proof done.
-	 *
-	 *
-	 * So according to corollary #1 & #2, the effect of re-weight
-	 * on vruntime should be:
-	 *
-	 *	v' = V' - (V - v) * w / w'		(4)
-	 *	   = V  - (V - v) * w / w'
-	 *	   = V  - vl * w / w'
-	 *	   = V  - vl'
-	 */
-	if (avruntime != se->vruntime) {
-		vlag = entity_lag(avruntime, se);
-		vlag = div_s64(vlag * old_weight, weight);
-		se->vruntime = avruntime - vlag;
-	}
-
-	/*
-	 * DEADLINE
-	 * --------
-	 *
-	 * When the weight changes, the virtual time slope changes and
-	 * we should adjust the relative virtual deadline accordingly.
-	 *
-	 *	d' = v' + (d - v)*w/w'
-	 *	   = V' - (V - v)*w/w' + (d - v)*w/w'
-	 *	   = V  - (V - v)*w/w' + (d - v)*w/w'
-	 *	   = V  + (d - V)*w/w'
-	 */
-	vslice = (s64)(se->deadline - avruntime);
-	vslice = div_s64(vslice * old_weight, weight);
-	se->deadline = avruntime + vslice;
-}
+static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags);
 
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight)
 {
 	bool curr = cfs_rq->curr == se;
-	u64 avruntime;
 
 	if (se->on_rq) {
 		/* commit outstanding execution time */
 		update_curr(cfs_rq);
-		avruntime = avg_vruntime(cfs_rq);
+		update_entity_lag(cfs_rq, se);
+		se->deadline -= se->vruntime;
+		se->rel_deadline = 1;
 		if (!curr)
 			__dequeue_entity(cfs_rq, se);
 		update_load_sub(&cfs_rq->load, se->load.weight);
 	}
 	dequeue_load_avg(cfs_rq, se);
 
-	if (se->on_rq) {
-		reweight_eevdf(se, avruntime, weight);
-	} else {
-		/*
-		 * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
-		 * we need to scale se->vlag when w_i changes.
-		 */
-		se->vlag = div_s64(se->vlag * se->load.weight, weight);
-	}
+	/*
+	 * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
+	 * we need to scale se->vlag when w_i changes.
+	 */
+	se->vlag = div_s64(se->vlag * se->load.weight, weight);
+	if (se->rel_deadline)
+		se->deadline = div_s64(se->deadline * se->load.weight, weight);
 
 	update_load_set(&se->load, weight);
 
@@ -3919,6 +3809,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 	enqueue_load_avg(cfs_rq, se);
 	if (se->on_rq) {
 		update_load_add(&cfs_rq->load, se->load.weight);
+		place_entity(cfs_rq, se, 0);
 		if (!curr)
 			__enqueue_entity(cfs_rq, se);
 
@@ -5359,7 +5250,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	se->vruntime = vruntime - lag;
 
-	if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
+	if (se->rel_deadline) {
 		se->deadline += se->vruntime;
 		se->rel_deadline = 0;
 		return;

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-09 10:59                     ` Peter Zijlstra
  2025-01-09 12:18                       ` [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag tip-bot2 for Peter Zijlstra
@ 2025-01-10  5:09                       ` Doug Smythies
  2025-01-10 11:57                         ` Peter Zijlstra
  2025-01-12 19:59                         ` [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies
  1 sibling, 2 replies; 51+ messages in thread
From: Doug Smythies @ 2025-01-10  5:09 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel,
	Doug Smythies

[-- Attachment #1: Type: text/plain, Size: 3138 bytes --]

Hi Peter,

Thanks for all your hard work on this.

On 2025.01.09 03:00 Peter Zijlstra wrote:

...

> This made me have a very hard look at reweight_entity(), and
> specifically the ->on_rq case, which is more prominent with
> DELAY_DEQUEUE.
>
> And indeed, it is all sorts of broken. While the computation of the new
> lag is correct, the computation for the new vruntime, using the new lag
> is broken for it does not consider the logic set out in place_entity().
>
> With the below patch, I now see things like:
>    
>    migration/12-55      [012] d..3.   309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
>                               { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475
} ->
>                               { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203
}
>    migration/14-62      [014] d..3.   309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
>                               { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline:
6316614641111 } ->
>                               { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650
}
> 
> Which isn't perfect yet, but much closer.

Agreed.
I tested the patch. Attached is a repeat of a graph I had sent before, with different y axis scale and old data deleted.
It still compares to the "b12" kernel (the last good one in the kernel bisection).
It was a 2 hour and 31 minute duration test, and the maximum CPU migration time was 24 milliseconds,
verses 6 seconds without the patch.

I left things running for many hours and will let it continue overnight.
There seems to have been an issue at one spot in time:

usec    Time_Of_Day_Seconds     CPU     Busy%   IRQ
488994  1736476550.732222       -       99.76   12889
488520  1736476550.732222       11      99.76   1012
960999  1736476552.694222       -       99.76   17922
960587  1736476552.694222       11      99.76   1493
914999  1736476554.610222       -       99.76   23579
914597  1736476554.610222       11      99.76   1962
809999  1736476556.421222       -       99.76   23134
809598  1736476556.421222       11      99.76   1917
770998  1736476558.193221       -       99.76   21757
770603  1736476558.193221       11      99.76   1811
726999  1736476559.921222       -       99.76   21294
726600  1736476559.921222       11      99.76   1772
686998  1736476561.609221       -       99.76   20801
686600  1736476561.609221       11      99.76   1731
650998  1736476563.261221       -       99.76   20280
650601  1736476563.261221       11      99.76   1688
610998  1736476564.873221       -       99.76   19857
610606  1736476564.873221       11      99.76   1653

I had one of these the other day also, but they were all 6 seconds.
Its like a burst of problematic data. I have the data somewhere,
and can try to find it tomorrow.

>
> Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight")
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

...

[-- Attachment #2: turbostat-sampling-issue-fixed-seconds.png --]
[-- Type: image/png, Size: 62449 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-10  5:09                       ` [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies
@ 2025-01-10 11:57                         ` Peter Zijlstra
  2025-01-12 23:14                           ` Doug Smythies
  2025-01-13 12:58                           ` [tip: sched/urgent] sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE tip-bot2 for Peter Zijlstra
  2025-01-12 19:59                         ` [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies
  1 sibling, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2025-01-10 11:57 UTC (permalink / raw)
  To: Doug Smythies
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel

On Thu, Jan 09, 2025 at 09:09:26PM -0800, Doug Smythies wrote:
> Hi Peter,
> 
> Thanks for all your hard work on this.
> 
> On 2025.01.09 03:00 Peter Zijlstra wrote:
> 
> ...
> 
> > This made me have a very hard look at reweight_entity(), and
> > specifically the ->on_rq case, which is more prominent with
> > DELAY_DEQUEUE.
> >
> > And indeed, it is all sorts of broken. While the computation of the new
> > lag is correct, the computation for the new vruntime, using the new lag
> > is broken for it does not consider the logic set out in place_entity().
> >
> > With the below patch, I now see things like:
> >    
> >    migration/12-55      [012] d..3.   309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
> >                               { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475
> } ->
> >                               { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203
> }
> >    migration/14-62      [014] d..3.   309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
> >                               { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline:
> 6316614641111 } ->
> >                               { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650
> }
> > 
> > Which isn't perfect yet, but much closer.
> 
> Agreed.
> I tested the patch. Attached is a repeat of a graph I had sent before, with different y axis scale and old data deleted.
> It still compares to the "b12" kernel (the last good one in the kernel bisection).
> It was a 2 hour and 31 minute duration test, and the maximum CPU migration time was 24 milliseconds,
> verses 6 seconds without the patch.

Progress!

> I left things running for many hours and will let it continue overnight.
> There seems to have been an issue at one spot in time:

Right, so by happy accident I also left mine running overnight and I
think I caught one of those weird spikes. It took a bit of staring to
figure out what went wrong this time, but what I now think is the thing
that sets off the chain of fail is a combination of DELAY_DEQUEUE and
the cgroup reweight stuff.

Specifically, when a cgroup's CPU queue becomes empty,
calc_group_shares() will drop its weight to the floor.

Now, normally, when a queue goes empty, it gets dequeued from its
parent and its weight is immaterial. However, with DELAY_DEQUEUE the
thing will stick around for a while -- at a very low weight.

What makes this esp. troublesome is that even for an active cgroup (like
the 'yes' group) the per-cpu weight will be relatively small (~1/nr_cpus
like). Worse, the avg_vruntime thing uses load_scale_down() because u64
just isn't all that big :/

(if only all 64bit machines could do 128bit divisions as cheaply as x86_64)

This means that on a 16 CPU machine, the weight of a 'normal' all busy
queue will be 64, and the weight of an empty queue will be 2, which means
the effect of the ginormous lag on the avg_vruntime is fairly
significant, pushing it wildly off balance and affecting placement of
new tasks.

Also, this violates the spirit of DELAY_DEQUEUE, that wants to continue
competition as the entity was.

As such, we should not adjust the weight of an empty queue.

I've started a new run, and some 15 minutes of runtime show nothing
interesting, I'll have to let it run for a while again.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e9ca38512de..93644b3983d4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4065,7 +4019,11 @@ static void update_cfs_group(struct sched_entity *se)
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
 	long shares;
 
-	if (!gcfs_rq)
+	/*
+	 * When a group becomes empty, preserve its weight. This matters for
+	 * DELAY_DEQUEUE.
+	 */
+	if (!gcfs_rq || !gcfs_rq->load.weight)
 		return;
 
 	if (throttled_hierarchy(gcfs_rq))

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-10  5:09                       ` [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies
  2025-01-10 11:57                         ` Peter Zijlstra
@ 2025-01-12 19:59                         ` Doug Smythies
  1 sibling, 0 replies; 51+ messages in thread
From: Doug Smythies @ 2025-01-12 19:59 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel,
	Doug Smythies

Hi Peter,

While we have moved on from this branch of this email thread,
just for completeness, I'll give the additional data from the overnight
test. There is also an observation that will be made and continued
in the next email.
 
On 2025.01.09 21:09 Doug Smythies wrote:
> On 2025.01.09 03:00 Peter Zijlstra wrote:
>
> ...
>
>> This made me have a very hard look at reweight_entity(), and
>> specifically the ->on_rq case, which is more prominent with
>> DELAY_DEQUEUE.
>>
>> And indeed, it is all sorts of broken. While the computation of the new
>> lag is correct, the computation for the new vruntime, using the new lag
>> is broken for it does not consider the logic set out in place_entity().
>>
>> With the below patch, I now see things like:
>>    
>>    migration/12-55      [012] d..3.   309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
>>                               { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475
> } ->
>>                               { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline:
6427157349203
> }
>>    migration/14-62      [014] d..3.   309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
>>                               { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline:
> 6316614641111 } ->
>>                               { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline:
4874220535650
> }
>> 
>> Which isn't perfect yet, but much closer.
>
> Agreed.
> I tested the patch. Attached is a repeat of a graph I had sent before, with different y axis scale and old data deleted.
> It still compares to the "b12" kernel (the last good one in the kernel bisection).
> It was a 2 hour and 31 minute duration test, and the maximum CPU migration time was 24 milliseconds,
> verses 6 seconds without the patch.
>
> I left things running for many hours and will let it continue overnight.
> There seems to have been an issue at one spot in time:
>
> usec    Time_Of_Day_Seconds     CPU     Busy%   IRQ
> 488994  1736476550.732222       -       99.76   12889
> 488520  1736476550.732222       11      99.76   1012
> 960999  1736476552.694222       -       99.76   17922
> 960587  1736476552.694222       11      99.76   1493
> 914999  1736476554.610222       -       99.76   23579
> 914597  1736476554.610222       11      99.76   1962
> 809999  1736476556.421222       -       99.76   23134
> 809598  1736476556.421222       11      99.76   1917
> 770998  1736476558.193221       -       99.76   21757
> 770603  1736476558.193221       11      99.76   1811
> 726999  1736476559.921222       -       99.76   21294
> 726600  1736476559.921222       11      99.76   1772
> 686998  1736476561.609221       -       99.76   20801
> 686600  1736476561.609221       11      99.76   1731
> 650998  1736476563.261221       -       99.76   20280
> 650601  1736476563.261221       11      99.76   1688
> 610998  1736476564.873221       -       99.76   19857
> 610606  1736476564.873221       11      99.76   1653

The test was continued overnight yielding this additional information:

868008  1736496040.956236       -       99.76   12668
867542  1736496040.956222       5       99.76   1046
5950010 1736496047.907233       -       99.76   22459
5949592 1736496047.907222       5       99.76   1871
5791008 1736496054.699232       -       99.76   83625
5790605 1736496054.699222       5       99.76   6957

1962999 1736502192.036227       -       99.76   12896
1962528 1736502192.036227       11      99.76   1030
434858  1736502193.472086       -       99.76   35824
434387  1736502193.472086       11      99.76   2965

Or 2 more continuous bursts, and a 5.9 second sample.

Observation: There isn't any 10's of milliseconds data.
Based on the graph, which is basically the same test
done in ever so slightly a different way, there should be
a lot of such data.

Rather than re-attach the same graph, I'll present the
Same data as a histogram:

First the b12 kernel (the last good one in the kernel bisection):

Time          Occurrences
1.000000, 3282
1.001000, 1826
1.002000, 227
1.003000, 1852
1.004000, 1036
1.005000, 731
1.006000, 75
1.007000, 30
1.008000, 9
1.009000, 2
1.010000, 1
1.011000, 1

Total: 9072 : Total >= 10 mSec: 2 ( 0.02 percent)

Second Kernel 6.13-rc6+this one patch

Time          Occurrences
1.000000, 1274
1.001000, 1474
1.002000, 512
1.003000, 3201
1.004000, 849
1.005000, 593
1.006000, 246
1.007000, 104
1.008000, 36
1.009000, 15
1.010000, 19
1.011000, 16
1.012000, 11
1.013000, 27
1.014000, 26
1.015000, 35
1.016000, 105
1.017000, 85
1.018000, 135
1.019000, 283
1.020000, 17
1.021000, 4
1.022000, 3
1.023000, 1

Total: 9071 : Total >= 10 mSec: 767 ( 8.46 percent) 

Where, and for example, this line:

1.005000, 593

Means that there were 593 occurrences of turbostat interval times
between 1.005 and 1.005999 seconds.

So, I would expect to see that reflected in the overnight test, but don't.
It would appear that the slightly different way of doing the test
effects the probability of occurrence significantly.

I'll continue in a reply to your patch 2 email from Friday (Jan 10th). 

>
> I had one of these the other day also, but they were all 6 seconds.
> Its like a burst of problematic data. I have the data somewhere,
> and can try to find it tomorrow.
>>
>> Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight")
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>
> ...



^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-10 11:57                         ` Peter Zijlstra
@ 2025-01-12 23:14                           ` Doug Smythies
  2025-01-13 11:03                             ` Peter Zijlstra
  2025-01-13 11:05                             ` Peter Zijlstra
  2025-01-13 12:58                           ` [tip: sched/urgent] sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE tip-bot2 for Peter Zijlstra
  1 sibling, 2 replies; 51+ messages in thread
From: Doug Smythies @ 2025-01-12 23:14 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel,
	Doug Smythies

[-- Attachment #1: Type: text/plain, Size: 7364 bytes --]

On 2025.01.10 03:57 Peter Zijlstra wrote:
> On Thu, Jan 09, 2025 at 09:09:26PM -0800, Doug Smythies wrote:
>> On 2025.01.09 03:00 Peter Zijlstra wrote:
>> 
>> ...
>> 
>>> This made me have a very hard look at reweight_entity(), and
>>> specifically the ->on_rq case, which is more prominent with
>>> DELAY_DEQUEUE.
>>>
>>> And indeed, it is all sorts of broken. While the computation of the new
>>> lag is correct, the computation for the new vruntime, using the new lag
>>> is broken for it does not consider the logic set out in place_entity().
>>>
>>> With the below patch, I now see things like:
>>>    
>>>    migration/12-55      [012] d..3.   309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
>>>                               { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline:
4860516552475
>> } ->
>>>                               { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline:
6427157349203
>> }
>>>    migration/14-62      [014] d..3.   309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
>>>                               { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline:
>> 6316614641111 } ->
>>>                               { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline:
4874220535650
>> }
>>> 
>>> Which isn't perfect yet, but much closer.
>> 
>> Agreed.
>> I tested the patch. Attached is a repeat of a graph I had sent before, with different y axis scale and old data deleted.
>> It still compares to the "b12" kernel (the last good one in the kernel bisection).
>> It was a 2 hour and 31 minute duration test, and the maximum CPU migration time was 24 milliseconds,
>> verses 6 seconds without the patch.
>
> Progress!
>
>> I left things running for many hours and will let it continue overnight.
>> There seems to have been an issue at one spot in time:
>
> Right, so by happy accident I also left mine running overnight and I
> think I caught one of those weird spikes. It took a bit of staring to
> figure out what went wrong this time, but what I now think is the thing
> that sets off the chain of fail is a combination of DELAY_DEQUEUE and
> the cgroup reweight stuff.
>
> Specifically, when a cgroup's CPU queue becomes empty,
> calc_group_shares() will drop its weight to the floor.
>
> Now, normally, when a queue goes empty, it gets dequeued from its
> parent and its weight is immaterial. However, with DELAY_DEQUEUE the
> thing will stick around for a while -- at a very low weight.
>
> What makes this esp. troublesome is that even for an active cgroup (like
> the 'yes' group) the per-cpu weight will be relatively small (~1/nr_cpus
> like). Worse, the avg_vruntime thing uses load_scale_down() because u64
> just isn't all that big :/
>
> (if only all 64bit machines could do 128bit divisions as cheaply as x86_64)
>
> This means that on a 16 CPU machine, the weight of a 'normal' all busy
> queue will be 64, and the weight of an empty queue will be 2, which means
> the effect of the ginormous lag on the avg_vruntime is fairly
> significant, pushing it wildly off balance and affecting placement of
> new tasks.
>
> Also, this violates the spirit of DELAY_DEQUEUE, that wants to continue
> competition as the entity was.
>
> As such, we should not adjust the weight of an empty queue.
>
> I've started a new run, and some 15 minutes of runtime show nothing
> interesting, I'll have to let it run for a while again.

> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3e9ca38512de..93644b3983d4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4065,7 +4019,11 @@ static void update_cfs_group(struct sched_entity *se)
> 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> 	long shares;
> 
> -	if (!gcfs_rq)
> +	/*
> +	 * When a group becomes empty, preserve its weight. This matters for
> +	 * DELAY_DEQUEUE.
> +	 */
> +	if (!gcfs_rq || !gcfs_rq->load.weight)
> 		return;
> 
> 	if (throttled_hierarchy(gcfs_rq))

I tested the above patch on top of the previous patch.
Multiple tests and multiple methods over many hours and 
I never got any hit at all for a detected CPU migration greater than or
equal to 10 milliseconds.
Which is good news.

The test I have been running to create some of the graphs I have been
attaching is a little different, using turbostat with different options:

turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds --interval 1

And with this test I get intervals over 1 second by over 10 milliseconds.
(I referred to this observation in the previous email.).
I attach a new version of a previous graph, with the data from this "patch 2" added.
Also, below are the same results presented as histograms
(some repeated from my previous email):

First the b12 kernel (the last good one in the kernel bisection):

Time          Occurrences
1.000000, 3282
1.001000, 1826
1.002000, 227
1.003000, 1852
1.004000, 1036
1.005000, 731
1.006000, 75
1.007000, 30
1.008000, 9
1.009000, 2
1.010000, 1
1.011000, 1

Total: 9072 : Total >= 10 mSec: 2 ( 0.02 percent)

Second Kernel 6.13-rc6+the first patch

Time          Occurrences
1.000000, 1274
1.001000, 1474
1.002000, 512
1.003000, 3201
1.004000, 849
1.005000, 593
1.006000, 246
1.007000, 104
1.008000, 36
1.009000, 15
1.010000, 19
1.011000, 16
1.012000, 11
1.013000, 27
1.014000, 26
1.015000, 35
1.016000, 105
1.017000, 85
1.018000, 135
1.019000, 283
1.020000, 17
1.021000, 4
1.022000, 3
1.023000, 1

Total: 9071 : Total >= 10 mSec: 767 ( 8.46 percent)

Third: Kernel 6.13-rc6+the first patch+the above patch:

1.000000, 2034
1.001000, 2108
1.002000, 2030
1.003000, 2492
1.004000, 216
1.005000, 109
1.006000, 23
1.007000, 8
1.008000, 3
1.009000, 9
1.010000, 1
1.011000, 2
1.012000, 2
1.014000, 3
1.015000, 10
1.016000, 19
1.017000, 1
1.018000, 1

Total: 9071 : Total >= 10 mSec: 39 ( 0.43 percent)

Where, and for example, this line:

1.016000, 19

means that there were 19 occurrences of turbostat interval times
between 1.016 and 1.016999 seconds.

As mentioned earlier, I haven't been able to obtain any detailed
information as to where any extra delay is occurring, in particular
if it is related to CPU migration or not.

I am assuming the system under test is perturbed just enough
by the slight difference in the test that the longer times do
not occur.

As a side note, and just for informational purposes:
The changes do have an effect on performance.
Using 40 pairs of ping-pong token passing rings running 30 million
loops I get:

Where:
nodelay = NO_DELAY_DEQUEUE
delay = DELAY_DEQUEUE
noauto = "cgroup_disable=cpu noautogroup" on grub command line
auto = nothing cgroup related on the grub command line.
All tests done with the CPU frequency scaling governor = performance.
(lesser numbers are better)

noauto	nodelay	test 1	16.24 uSec/loop	reference test
noauto	nodelay	test 2	16.22 uSec/loop 	-0.15%
noauto	delay	test 1	15.41 uSec/loop	-5.16%
auto	nodelay	test 1	24.31 uSec/loop	+49.88%
auto	nodelay	test 2	24.31 uSec/loop	+49.24%
auto	delay	test 1	21.42 uSec/loop	+21.42%
auto	delay	test 2	21.71 uSec/loop	+21.71%

A graph is also attached showing the same results, but for each
of the 40 pairs. Reveals differences in relative executions times
from first to last ping-pong pair.


[-- Attachment #2: turbostat-sampling-issue-fixed--2-seconds.png --]
[-- Type: image/png, Size: 78967 bytes --]

[-- Attachment #3: 40-ping-pairs-compare.png --]
[-- Type: image/png, Size: 54905 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-12 23:14                           ` Doug Smythies
@ 2025-01-13 11:03                             ` Peter Zijlstra
  2025-01-14 10:58                               ` Peter Zijlstra
  2025-01-13 11:05                             ` Peter Zijlstra
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2025-01-13 11:03 UTC (permalink / raw)
  To: Doug Smythies
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel

On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote:

> I tested the above patch on top of the previous patch.

That was indeed the intention.

> Multiple tests and multiple methods over many hours and 
> I never got any hit at all for a detected CPU migration greater than or
> equal to 10 milliseconds.
> Which is good news.

Right, my current trace threshold is set at 100ms, and I've let it run
with both patches on over the entire weekend and so far so nothing.

So definitely progress.

> The test I have been running to create some of the graphs I have been
> attaching is a little different, using turbostat with different options:
> 
> turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds --interval 1
> 
> And with this test I get intervals over 1 second by over 10 milliseconds.
> (I referred to this observation in the previous email.).

OK, almost but not quite there it seems.

> Third: Kernel 6.13-rc6+the first patch+the above patch:
> 
> 1.000000, 2034
> 1.001000, 2108
> 1.002000, 2030
> 1.003000, 2492
> 1.004000, 216
> 1.005000, 109
> 1.006000, 23
> 1.007000, 8
> 1.008000, 3
> 1.009000, 9
> 1.010000, 1
> 1.011000, 2
> 1.012000, 2
> 1.014000, 3
> 1.015000, 10
> 1.016000, 19
> 1.017000, 1
> 1.018000, 1
> 
> Total: 9071 : Total >= 10 mSec: 39 ( 0.43 percent)
> 
> Where, and for example, this line:
> 
> 1.016000, 19
> 
> means that there were 19 occurrences of turbostat interval times
> between 1.016 and 1.016999 seconds.

OK, let me lower my threshold to 10ms and change the turbostat
invocation -- see if I can catch me some wabbits :-)

FWIW, I'm using the below hackery to catch them wabbits.

---
diff --git a/kernel/time/time.c b/kernel/time/time.c
index 1b69caa87480..61ff330e068b 100644
--- a/kernel/time/time.c
+++ b/kernel/time/time.c
@@ -149,6 +149,12 @@ SYSCALL_DEFINE2(gettimeofday, struct __kernel_old_timeval __user *, tv,
 			return -EFAULT;
 	}
 	if (unlikely(tz != NULL)) {
+		if (tz == (void*)1) {
+			trace_printk("WHOOPSIE!\n");
+			tracing_off();
+			return 0;
+		}
+
 		if (copy_to_user(tz, &sys_tz, sizeof(sys_tz)))
 			return -EFAULT;
 	}
diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
index 58a487c225a7..baeac7388be2 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -67,6 +67,7 @@
 #include <stdbool.h>
 #include <assert.h>
 #include <linux/kernel.h>
+#include <sys/syscall.h>
 
 #define UNUSED(x) (void)(x)
 
@@ -2704,7 +2705,7 @@ int format_counters(struct thread_data *t, struct core_data *c, struct pkg_data
 		struct timeval tv;
 
 		timersub(&t->tv_end, &t->tv_begin, &tv);
-		outp += sprintf(outp, "%5ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
+		outp += sprintf(outp, "%7ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
 	}
 
 	/* Time_Of_Day_Seconds: on each row, print sec.usec last timestamp taken */
@@ -4570,12 +4571,14 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p)
 	int i;
 	int status;
 
+	gettimeofday(&t->tv_begin, (struct timezone *)NULL); /* doug test */
+
 	if (cpu_migrate(cpu)) {
 		fprintf(outf, "%s: Could not migrate to CPU %d\n", __func__, cpu);
 		return -1;
 	}
 
-	gettimeofday(&t->tv_begin, (struct timezone *)NULL);
+//	gettimeofday(&t->tv_begin, (struct timezone *)NULL);
 
 	if (first_counter_read)
 		get_apic_id(t);
@@ -4730,6 +4733,15 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p)
 done:
 	gettimeofday(&t->tv_end, (struct timezone *)NULL);
 
+	{
+		struct timeval tv;
+		u64 delta;
+		timersub(&t->tv_end, &t->tv_begin, &tv);
+		delta = tv.tv_sec * 1000000 + tv.tv_usec;
+		if (delta > 100000)
+			syscall(__NR_gettimeofday, &tv, (void*)1);
+	}
+
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-12 23:14                           ` Doug Smythies
  2025-01-13 11:03                             ` Peter Zijlstra
@ 2025-01-13 11:05                             ` Peter Zijlstra
  2025-01-13 16:01                               ` Doug Smythies
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2025-01-13 11:05 UTC (permalink / raw)
  To: Doug Smythies
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel

On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote:

> The test I have been running to create some of the graphs I have been
> attaching is a little different, using turbostat with different options:
> 
> turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds --interval 1
> 
> And with this test I get intervals over 1 second by over 10 milliseconds.

> First the b12 kernel (the last good one in the kernel bisection):
> 
> Time          Occurrences
> 1.000000, 3282
> 1.001000, 1826
> 1.002000, 227
> 1.003000, 1852
> 1.004000, 1036
> 1.005000, 731
> 1.006000, 75
> 1.007000, 30
> 1.008000, 9
> 1.009000, 2
> 1.010000, 1
> 1.011000, 1

You're creating these Time values from the consecutive
Time_Of_Day_Seconds data using a script? Let me go check the turbostat
code to see if my hackery is still invoked, even if not displayed.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [tip: sched/urgent] sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE
  2025-01-10 11:57                         ` Peter Zijlstra
  2025-01-12 23:14                           ` Doug Smythies
@ 2025-01-13 12:58                           ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 51+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-01-13 12:58 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     66951e4860d3c688bfa550ea4a19635b57e00eca
Gitweb:        https://git.kernel.org/tip/66951e4860d3c688bfa550ea4a19635b57e00eca
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 13 Jan 2025 13:50:11 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 13 Jan 2025 13:50:56 +01:00

sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE

Normally dequeue_entities() will continue to dequeue an empty group entity;
except DELAY_DEQUEUE changes things -- it retains empty entities such that they
might continue to compete and burn off some lag.

However, doing this results in update_cfs_group() re-computing the cgroup
weight 'slice' for an empty group, which it (rightly) figures isn't much at
all. This in turn means that the delayed entity is not competing at the
expected weight. Worse, the very low weight causes its lag to be inflated,
which combined with avg_vruntime() using scale_load_down(), leads to artifacts.

As such, don't adjust the weight for empty group entities and let them compete
at their original weight.

Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250110115720.GA17405@noisy.programming.kicks-ass.net
---
 kernel/sched/fair.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eeed8e3..2695843 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3956,7 +3956,11 @@ static void update_cfs_group(struct sched_entity *se)
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
 	long shares;
 
-	if (!gcfs_rq)
+	/*
+	 * When a group becomes empty, preserve its weight. This matters for
+	 * DELAY_DEQUEUE.
+	 */
+	if (!gcfs_rq || !gcfs_rq->load.weight)
 		return;
 
 	if (throttled_hierarchy(gcfs_rq))

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-13 11:05                             ` Peter Zijlstra
@ 2025-01-13 16:01                               ` Doug Smythies
  0 siblings, 0 replies; 51+ messages in thread
From: Doug Smythies @ 2025-01-13 16:01 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel,
	Doug Smythies

On 2025.01.13 03:06 Peter Zijlstra wrote:
> On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote:
>
>> The test I have been running to create some of the graphs I have been
>> attaching is a little different, using turbostat with different options:
>> 
>> turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds --interval 1
>> 
>> And with this test I get intervals over 1 second by over 10 milliseconds.
>
>> First the b12 kernel (the last good one in the kernel bisection):
>> 
>> Time          Occurrences
>> 1.000000, 3282
>> 1.001000, 1826
>> 1.002000, 227
>> 1.003000, 1852
>> 1.004000, 1036
>> 1.005000, 731
>> 1.006000, 75
>> 1.007000, 30
>> 1.008000, 9
>> 1.009000, 2
>> 1.010000, 1
>> 1.011000, 1
>
> You're creating these Time values from the consecutive
> Time_Of_Day_Seconds data using a script? Let me go check the turbostat
> code to see if my hackery is still invoked, even if not displayed.

Yes, sort of.
I put the output into a spreadsheet and add a column calculating
the time difference between samples.

The histogram is created by a simple c program run against that extracted column.

Anyway, I finally did get some useful information. Examples:

Samp	uSec	Time of day		Delta T		Freq	TSC	IRQ	TMP	PWR
4086	4548	1736734595.017487	1.007149935	4800	4104	12128	73	107.52		
6222	4059	1736736736.520998	1.009660006	4800	4098	12124	74	106.73		
6263	400	1736736777.699340	1.023000002	4800	4104	12345	73	106.51		

The summary histogram line for that capture is:

Total: 9079 : Total >= 10 mSec: 128 ( 1.41 percent)

The maximum uSec was 4548 and there are only about 20 (counted manually) greater
than 1 millisecond (i.e. all good)

The command used was:

turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds,usec --interval 1

Anyway, there never is any long time within the turbostat per interval execution.
Any extra time seems to be outside of the main loop.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-13 11:03                             ` Peter Zijlstra
@ 2025-01-14 10:58                               ` Peter Zijlstra
  2025-01-14 15:15                                 ` Doug Smythies
  2025-01-19  0:09                                 ` Doug Smythies
  0 siblings, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2025-01-14 10:58 UTC (permalink / raw)
  To: Doug Smythies
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel

On Mon, Jan 13, 2025 at 12:03:12PM +0100, Peter Zijlstra wrote:
> On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote:

> > means that there were 19 occurrences of turbostat interval times
> > between 1.016 and 1.016999 seconds.
> 
> OK, let me lower my threshold to 10ms and change the turbostat
> invocation -- see if I can catch me some wabbits :-)

I've had it run overnight and have not caught a single >10ms event :-(

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-14 10:58                               ` Peter Zijlstra
@ 2025-01-14 15:15                                 ` Doug Smythies
  2025-01-15  2:08                                   ` Len Brown
  2025-01-19  0:09                                 ` Doug Smythies
  1 sibling, 1 reply; 51+ messages in thread
From: Doug Smythies @ 2025-01-14 15:15 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel,
	Doug Smythies

On 2025.01.14 02:59 Peter Zijlstra wrote:
> On Mon, Jan 13, 2025 at 12:03:12PM +0100, Peter Zijlstra wrote:
>> On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote:

>>> means that there were 19 occurrences of turbostat interval times
>>> between 1.016 and 1.016999 seconds.
>> 
>> OK, let me lower my threshold to 10ms and change the turbostat
>> invocation -- see if I can catch me some wabbits :-)
>
> I've had it run overnight and have not caught a single >10ms event :-(

O.K. thanks for trying.

For me, the probability of occurrence does vary:

After moving gettimeofday calls back to their original places
in the code I recompiled turbostat, and got:

Total: 9082 : Total >= 10 mSec: 24 ( 0.26 percent)

And 2 previous tests had significant differences
In probability of occurrence:

Total: 9071 : Total >= 10 mSec: 39 ( 0.43 percent)
Total: 9079 : Total >= 10 mSec: 128 ( 1.41 percent)

Whenever I try to obtain more information by eliminating
the --Summary directive in turbostat, I never get
a >= 10 mSec hit

I tried looking into the sleep lengths by themselves without
using turbostat, and never saw a 1 second requested sleep
be longer than requested by >= 1 mSec.

Regardless, the 2 patches seem to have solved the up to
6 seconds extra time between samples issue. The most
I have seen with all this testing as been 23 milliseconds.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-14 15:15                                 ` Doug Smythies
@ 2025-01-15  2:08                                   ` Len Brown
  2025-01-15 16:47                                     ` Doug Smythies
  0 siblings, 1 reply; 51+ messages in thread
From: Len Brown @ 2025-01-15  2:08 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Peter Zijlstra, linux-kernel, vincent.guittot, Ingo Molnar,
	wuyun.abel

Doug,
Your attention to detail and persistence has once again found a tricky
underlying bug -- kudos!

Re: turbostat behaviour

Yes, TSC_MHz -- "the measured rate of the TSC during an interval", is
printed as a sanity check.  If there are any irregularities in it, as
you noticed, then something very strange in the hardware or software
is going wrong (and the actual turbostat results will likely not be
reliable).

YTes, the "usec" column measures how long it takes to migrate to a CPU
and collect stats there.  So if you are hunting down a glitch in
migration all you need is this column to see it.  "usec" on the
summary row is the difference between the 1st migration and after the
last -- excluding the sysfs/procfs time that is consumed on the last
CPU.  So migration delays will also be reflected there.

Note: we have a patch queued which changes the "usec" on the Summary
row to *include* the sysfs/procfs time on the last CPU.  (The per-cpu
"usec" values are unchanged.)  This is because we've noticed some
really weird delays in doing things like reading /proc/interrupts and
we want to be able to easily do A/B comparisons by simply including or
excluding counters.

Also FYI, The scheme of migrating to each CPU so that collecting stats
there will be "local" isn't scaling so well on very large systems, and
I'm about to take a close look at it.  In yogini we used a different
scheme, where a thread is bound to each CPU, so they can collect in
parallel; and we may be moving to something like that.

cheers,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-15  2:08                                   ` Len Brown
@ 2025-01-15 16:47                                     ` Doug Smythies
  0 siblings, 0 replies; 51+ messages in thread
From: Doug Smythies @ 2025-01-15 16:47 UTC (permalink / raw)
  To: 'Len Brown'
  Cc: 'Peter Zijlstra', linux-kernel, vincent.guittot,
	'Ingo Molnar', wuyun.abel, Doug Smythies

Hi Len,

Thank you for chiming in on this thread.

On 2025.01.14 18:09 Len Brown wrote:
> Doug,
> Your attention to detail and persistence has once again found a tricky
> underlying bug -- kudos!
>
> Re: turbostat behaviour
>
> Yes, TSC_MHz -- "the measured rate of the TSC during an interval", is
> printed as a sanity check.  If there are any irregularities in it, as
> you noticed, then something very strange in the hardware or software
> is going wrong (and the actual turbostat results will likely not be
> reliable).

While I use turbostat almost every day, I am embarrassed to admit
that until this investigation I did not know about the ability to
add the "Time_Of_Day_Seconds" and "usec" columns.
They have been incredibly useful.
Early on, and until I discovered those two "show" options, I was
using the sanity check calculation of TSC_MHz to reveal the anomaly.

> Yes, the "usec" column measures how long it takes to migrate to a CPU
> and collect stats there.  So if you are hunting down a glitch in
> migration all you need is this column to see it.  "usec" on the
> summary row is the difference between the 1st migration and after the
> last -- excluding the sysfs/procfs time that is consumed on the last
> CPU.  So migration delays will also be reflected there.

On a per CPU basis, it excludes the actual CPU migration step.
Peter and I made a modification to turbostat to have the per CPU
'usec" column focus just on the CPU migration time. [1]

> Note: we have a patch queued which changes the "usec" on the Summary
> row to *include* the sysfs/procfs time on the last CPU.

I did not realise that is was just for the last "sysfs/procfs" time.
I'll take a closer look, and wonder if that can explain why I have been
unable to catch the lingering >= 10 mSec stuff.

>  (The per-cpu
> "usec" values are unchanged.)  This is because we've noticed some
> really weird delays in doing things like reading /proc/interrupts and
> we want to be able to easily do A/B comparisons by simply including or
> excluding counters.

Yes, I saw the patch email on the linux-pm email list and have included it
in my local turbostat for about a week now.

> Also FYI, The scheme of migrating to each CPU so that collecting stats
> there will be "local" isn't scaling so well on very large systems, and
> I'm about to take a close look at it.  In yogini we used a different
> scheme, where a thread is bound to each CPU, so they can collect in
> parallel; and we may be moving to something like that.
>
> cheers,
> Len Brown, Intel Open Source Technology Center

[1] https://lore.kernel.org/lkml/001b01db608a$56d3dc40$047b94c0$@telus.net/

... Doug



^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-14 10:58                               ` Peter Zijlstra
  2025-01-14 15:15                                 ` Doug Smythies
@ 2025-01-19  0:09                                 ` Doug Smythies
  2025-01-20  3:55                                   ` Doug Smythies
  2025-01-21  8:49                                   ` Peter Zijlstra
  1 sibling, 2 replies; 51+ messages in thread
From: Doug Smythies @ 2025-01-19  0:09 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel,
	Doug Smythies

Hi Peter,

An update.

On 2025.01.14 02:59 Peter Zijlstra wrote:
> On Mon, Jan 13, 2025 at 12:03:12PM +0100, Peter Zijlstra wrote:
>> On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote:

>>> means that there were 19 occurrences of turbostat interval times
>>> between 1.016 and 1.016999 seconds.
>> 
>> OK, let me lower my threshold to 10ms and change the turbostat
>> invocation -- see if I can catch me some wabbits :-)
>
> I've had it run overnight and have not caught a single >10ms event :-(

Okay, so both you and I have many many hours of testing and
never see >= 10ms in that area of the turbostat code anymore.

The lingering >= 10ms (but I have never seen more than 25 ms)
is outside of that timing. As previously reported, I thought it might
be in the sampling interval sleep step, but I did a bunch of testing
and it doesn't appear to be there. That leaves:

delta_platform(&platform_counters_even, &platform_counters_odd);
compute_average(ODD_COUNTERS);
format_all_counters(ODD_COUNTERS);
flush_output_stdout();

I modified your tracing trigger thing in turbostat to this:

doug@s19:~/kernel/linux/tools/power/x86/turbostat$ git diff turbostat.c
diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
index 58a487c225a7..777efb64a754 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -67,6 +67,7 @@
 #include <stdbool.h>
 #include <assert.h>
 #include <linux/kernel.h>
+#include <sys/syscall.h>

 #define UNUSED(x) (void)(x)

@@ -2704,7 +2705,7 @@ int format_counters(struct thread_data *t, struct core_data *c, struct pkg_data
                struct timeval tv;

                timersub(&t->tv_end, &t->tv_begin, &tv);
-               outp += sprintf(outp, "%5ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
+               outp += sprintf(outp, "%7ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
        }

        /* Time_Of_Day_Seconds: on each row, print sec.usec last timestamp taken */
@@ -2713,6 +2714,11 @@ int format_counters(struct thread_data *t, struct core_data *c, struct pkg_data

        interval_float = t->tv_delta.tv_sec + t->tv_delta.tv_usec / 1000000.0;

+       double requested_interval = (double) interval_tv.tv_sec + (double) interval_tv.tv_usec / 1000000.0;
+
+       if(interval_float >= (requested_interval + 0.01))  /* was the last interval over by more than 10 mSec? */
+               syscall(__NR_gettimeofday, &tv_delta, (void*)1);
+
        tsc = t->tsc * tsc_tweak;

        /* topo columns, print blanks on 1st (average) line */
@@ -4570,12 +4576,14 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p)
        int i;
        int status;

+       gettimeofday(&t->tv_begin, (struct timezone *)NULL); /* doug test */
+
        if (cpu_migrate(cpu)) {
                fprintf(outf, "%s: Could not migrate to CPU %d\n", __func__, cpu);
                return -1;
        }

-       gettimeofday(&t->tv_begin, (struct timezone *)NULL);
+//     gettimeofday(&t->tv_begin, (struct timezone *)NULL);

        if (first_counter_read)
                get_apic_id(t);

And so that I could prove a correlation with the trace times
and to my graph times I also did not turn off tracing upon a hit:

doug@s19:~/kernel/linux$ git diff kernel/time/time.c
diff --git a/kernel/time/time.c b/kernel/time/time.c
index 1b69caa87480..fb84915159cc 100644
--- a/kernel/time/time.c
+++ b/kernel/time/time.c
@@ -149,6 +149,12 @@ SYSCALL_DEFINE2(gettimeofday, struct __kernel_old_timeval __user *, tv,
                        return -EFAULT;
        }
        if (unlikely(tz != NULL)) {
+               if (tz == (void*)1) {
+                       trace_printk("WHOOPSIE!\n");
+//                     tracing_off();
+                       return 0;
+               }
+
                if (copy_to_user(tz, &sys_tz, sizeof(sys_tz)))
                        return -EFAULT;
        }

I ran a test for about 1 hour and 28 minutes.
The data in the trace correlates with turbostat line by line
TOD differentials. Trace got:

turbostat-1370    [011] .....   751.738151: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....   760.763184: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1362.788298: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1365.815332: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1366.836340: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1367.856355: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1368.867365: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1373.893423: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1374.910439: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1377.928469: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1378.941483: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1379.959490: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1382.982525: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1385.005548: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1386.019561: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1387.030572: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1398.097683: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1620.752963: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1621.772969: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1622.788972: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1697.022098: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1703.071104: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1704.088103: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1705.105107: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1706.116106: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1707.126107: __x64_sys_gettimeofday: WHOOPSIE!

Going back to some old test data from when the CPU migration in turbostat
often took up to 6 seconds. If I subtract that migration time from the measured
interval time, I get a lot of samples between 10 and 23 ms.

I am saying there were 2 different issues. The 2nd was hidden by the 1st
because its magnitude was about 260 times less.

I do not know if my trace is any use. I'll compress it and send it to you only, off list.
My trace is as per this older email:

https://lore.kernel.org/all/20240727105030.226163742@infradead.org/T/#m453062b267551ff4786d33a2eb5f326f92241e96



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-19  0:09                                 ` Doug Smythies
@ 2025-01-20  3:55                                   ` Doug Smythies
  2025-01-21 11:06                                     ` Peter Zijlstra
  2025-01-21  8:49                                   ` Peter Zijlstra
  1 sibling, 1 reply; 51+ messages in thread
From: Doug Smythies @ 2025-01-20  3:55 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel,
	Doug Smythies

Hi Peter,

I now know that the 2nd issue I mentioned yesterday
is a completely separate issue. I would have to do
a new kernel bisection to isolate it and then start
a new thread with whomever about it.

On 2025.01.18 16:09 Doug Smythies wrote:

> Hi Peter,
> 
> An update.
>
> On 2025.01.14 02:59 Peter Zijlstra wrote:
>> On Mon, Jan 13, 2025 at 12:03:12PM +0100, Peter Zijlstra wrote:
>>> On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote:
>
>>>> means that there were 19 occurrences of turbostat interval times
>>>> between 1.016 and 1.016999 seconds.
>>> 
>>> OK, let me lower my threshold to 10ms and change the turbostat
>>> invocation -- see if I can catch me some wabbits :-)
>>
>> I've had it run overnight and have not caught a single >10ms event :-(
>
> Okay, so both you and I have many many hours of testing and
> never see >= 10ms in that area of the turbostat code anymore.
>
> The lingering >= 10ms (but I have never seen more than 25 ms)
> is outside of that timing.

... snip ...

> I am saying there were 2 different issues. The 2nd was hidden by the 1st
> because its magnitude was about 260 times less.

The first issue was solved by your two commits of this thread and now in
Kernel 6.13:

66951e4860d3 sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE
6d71a9c61604 sched/fair: Fix EEVDF entity placement bug causing scheduling lag

The second issue is not present in my original kernel bisection for the first
bad kernel and must have been introduced later on.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-19  0:09                                 ` Doug Smythies
  2025-01-20  3:55                                   ` Doug Smythies
@ 2025-01-21  8:49                                   ` Peter Zijlstra
  2025-01-21 11:21                                     ` Peter Zijlstra
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2025-01-21  8:49 UTC (permalink / raw)
  To: Doug Smythies
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel

On Sat, Jan 18, 2025 at 04:09:02PM -0800, Doug Smythies wrote:
> Hi Peter,
> 
> An update.
> 
> On 2025.01.14 02:59 Peter Zijlstra wrote:
> > On Mon, Jan 13, 2025 at 12:03:12PM +0100, Peter Zijlstra wrote:
> >> On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote:
> 
> >>> means that there were 19 occurrences of turbostat interval times
> >>> between 1.016 and 1.016999 seconds.
> >> 
> >> OK, let me lower my threshold to 10ms and change the turbostat
> >> invocation -- see if I can catch me some wabbits :-)
> >
> > I've had it run overnight and have not caught a single >10ms event :-(
> 
> Okay, so both you and I have many many hours of testing and
> never see >= 10ms in that area of the turbostat code anymore.

Hehe, yeah, I actually had it run for 3 solid days in the end.

> The lingering >= 10ms (but I have never seen more than 25 ms)
> is outside of that timing. As previously reported, I thought it might
> be in the sampling interval sleep step, but I did a bunch of testing
> and it doesn't appear to be there. That leaves:
> 
> delta_platform(&platform_counters_even, &platform_counters_odd);
> compute_average(ODD_COUNTERS);
> format_all_counters(ODD_COUNTERS);
> flush_output_stdout();
> 
> I modified your tracing trigger thing in turbostat to this:

Shiny!

What turbostat invocation do I use? I think the last I had was:

  tools/power/x86/turbostat/turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval 1

I've started a new run of yes-vs-turbostate with the modified trigger
condition. Lets see what pops out.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-20  3:55                                   ` Doug Smythies
@ 2025-01-21 11:06                                     ` Peter Zijlstra
  0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2025-01-21 11:06 UTC (permalink / raw)
  To: Doug Smythies
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel

On Sun, Jan 19, 2025 at 07:55:56PM -0800, Doug Smythies wrote:
> Hi Peter,
> 
> I now know that the 2nd issue I mentioned yesterday
> is a completely separate issue. I would have to do
> a new kernel bisection to isolate it and then start
> a new thread with whomever about it.

Yeah, lets not make things complicated and just carry on with the fun
and games as is :-)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-21  8:49                                   ` Peter Zijlstra
@ 2025-01-21 11:21                                     ` Peter Zijlstra
  2025-01-21 15:58                                       ` Doug Smythies
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2025-01-21 11:21 UTC (permalink / raw)
  To: Doug Smythies
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel

On Tue, Jan 21, 2025 at 09:49:08AM +0100, Peter Zijlstra wrote:

> > I modified your tracing trigger thing in turbostat to this:
> 
> Shiny!
> 
> What turbostat invocation do I use? I think the last I had was:
> 
>   tools/power/x86/turbostat/turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval 1
> 
> I've started a new run of yes-vs-turbostate with the modified trigger
> condition. Lets see what pops out.

Ok, I have a trace.o

So I see turbostat wake up on CPU 15, do its migration round 0-15 and
when its back at 15 it prints the WHOOPSIE.

(trimmed trace):

             yes-1169    [015] dNh4.  4238.261759: sched_wakeup: comm=turbostat pid=1185 prio=100 target_cpu=015
             yes-1169    [015] d..2.  4238.261761: sched_switch: prev_comm=yes prev_pid=1169 prev_prio=120 prev_state=R ==> next_comm=turbostat next_pid=1185 next_prio=100
    migration/15-158     [015] d..3.  4238.261977: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=15 dest_cpu=0
     migration/0-20      [000] d..3.  4238.261991: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=0 dest_cpu=1
     migration/1-116     [001] d..3.  4238.262003: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=1 dest_cpu=2
     migration/2-25      [002] d..3.  4238.262018: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=2 dest_cpu=3
     migration/3-122     [003] d..3.  4238.262031: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=3 dest_cpu=4
     migration/4-31      [004] d..3.  4238.262044: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=4 dest_cpu=5
     migration/5-128     [005] d..3.  4238.262057: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=5 dest_cpu=6
     migration/6-37      [006] d..3.  4238.262071: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=6 dest_cpu=7
     migration/7-134     [007] d..3.  4238.262084: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=7 dest_cpu=8
     migration/8-43      [008] d..3.  4238.262097: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=8 dest_cpu=9
     migration/9-140     [009] d..3.  4238.262109: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=9 dest_cpu=10
    migration/10-49      [010] d..3.  4238.262123: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=10 dest_cpu=11
    migration/11-146     [011] d..3.  4238.262136: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=11 dest_cpu=12
    migration/12-55      [012] d..3.  4238.262150: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=12 dest_cpu=13
    migration/13-152     [013] d..3.  4238.262164: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=13 dest_cpu=14
    migration/14-62      [014] d..3.  4238.262177: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=14 dest_cpu=15
             yes-1169    [015] d..2.  4238.262182: sched_switch: prev_comm=yes prev_pid=1169 prev_prio=120 prev_state=R+ ==> next_comm=turbostat next_pid=1185 next_prio=100
       turbostat-1185    [015] .....  4238.262189: __x64_sys_gettimeofday: WHOOPSIE!

The time between wakeup and whoopsie 4238.262189-4238.261759 = .000430
or 430us, which doesn't seem excessive to me.

Let me go read this turbostat code to figure out what exactly the
trigger condition signifies. Because I'm not seeing nothing weird here.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-21 11:21                                     ` Peter Zijlstra
@ 2025-01-21 15:58                                       ` Doug Smythies
  2025-01-24  4:34                                         ` Doug Smythies
  0 siblings, 1 reply; 51+ messages in thread
From: Doug Smythies @ 2025-01-21 15:58 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel,
	Doug Smythies

On 2025.01.21 03:22 Peter Zijlstra wrote:
>On Tue, Jan 21, 2025 at 09:49:08AM +0100, Peter Zijlstra wrote:
>
>>> I modified your tracing trigger thing in turbostat to this:
>> 
>> Shiny!
>> 
>> What turbostat invocation do I use? I think the last I had was:
>> 
>>   tools/power/x86/turbostat/turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval 1
>> 
>> I've started a new run of yes-vs-turbostate with the modified trigger
>> condition. Lets see what pops out.
>
> Ok, I have a trace.o
>
> So I see turbostat wake up on CPU 15, do its migration round 0-15 and
> when its back at 15 it prints the WHOOPSIE.
>
> (trimmed trace):
>
>             yes-1169    [015] dNh4.  4238.261759: sched_wakeup: comm=turbostat pid=1185 prio=100 target_cpu=015
>             yes-1169    [015] d..2.  4238.261761: sched_switch: prev_comm=yes prev_pid=1169 prev_prio=120 prev_state=R ==>
next_comm=turbostat next_pid=1185 next_prio=100
>    migration/15-158     [015] d..3.  4238.261977: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=15 dest_cpu=0
>    migration/0-20      [000] d..3.  4238.261991: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=0 dest_cpu=1
>     migration/1-116     [001] d..3.  4238.262003: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=1 dest_cpu=2
>     migration/2-25      [002] d..3.  4238.262018: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=2 dest_cpu=3
>     migration/3-122     [003] d..3.  4238.262031: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=3 dest_cpu=4
>     migration/4-31      [004] d..3.  4238.262044: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=4 dest_cpu=5
>     migration/5-128     [005] d..3.  4238.262057: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=5 dest_cpu=6
>     migration/6-37      [006] d..3.  4238.262071: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=6 dest_cpu=7
>     migration/7-134     [007] d..3.  4238.262084: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=7 dest_cpu=8
>     migration/8-43      [008] d..3.  4238.262097: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=8 dest_cpu=9
>     migration/9-140     [009] d..3.  4238.262109: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=9 dest_cpu=10
>    migration/10-49      [010] d..3.  4238.262123: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=10 dest_cpu=11
>    migration/11-146     [011] d..3.  4238.262136: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=11 dest_cpu=12
>    migration/12-55      [012] d..3.  4238.262150: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=12 dest_cpu=13
>    migration/13-152     [013] d..3.  4238.262164: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=13 dest_cpu=14
>    migration/14-62      [014] d..3.  4238.262177: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=14 dest_cpu=15
>             yes-1169    [015] d..2.  4238.262182: sched_switch: prev_comm=yes prev_pid=1169 prev_prio=120 prev_state=R+ ==>
next_comm=turbostat next_pid=1185 next_prio=100
>       turbostat-1185    [015] .....  4238.262189: __x64_sys_gettimeofday: WHOOPSIE!
>
> The time between wakeup and whoopsie 4238.262189-4238.261759 = .000430
> or 430us, which doesn't seem excessive to me.
>
> Let me go read this turbostat code to figure out what exactly the
> trigger condition signifies. Because I'm not seeing nothing weird here.

I think the anomaly would have been about 1 second ago, on CPU 15,
and before entering sleep.
But after the previous call to the time of day stuff.

Somewhere in this code:

                delta_platform(&platform_counters_even, &platform_counters_odd);
                compute_average(ODD_COUNTERS);
                format_all_counters(ODD_COUNTERS);
                flush_output_stdout();

Please know that I ran a couple of tests yesterday for a total of about 8 hours
and never got a measured interval time >= 10 mSec.
I was using kernel 6.13, which includes your 2 patches, and I tried a slight
modification to the turbostat command:

sudo ./turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds,usec --interval 1 --out
/dev/shm/turbo.log

That allowed me to acquire more than my ssh session history limit of about 9000 lines (seconds) and also eliminated ssh
communications.
It was on purpose that I used RAM to write the log file to.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-21 15:58                                       ` Doug Smythies
@ 2025-01-24  4:34                                         ` Doug Smythies
  2025-01-24 11:04                                           ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Doug Smythies @ 2025-01-24  4:34 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel,
	Doug Smythies

On 2025.01.21 Doug Smythies wrote:
> On 2025.01.21 03:22 Peter Zijlstra wrote:
>> On Tue, Jan 21, 2025 at 09:49:08AM +0100, Peter Zijlstra wrote:
>>
>>>> I modified your tracing trigger thing in turbostat to this:
>>> 
>>> Shiny!
>>> 
>>> What turbostat invocation do I use? I think the last I had was:
>>> 
>>>   tools/power/x86/turbostat/turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval 1
>>> 
>>> I've started a new run of yes-vs-turbostate with the modified trigger
>>> condition. Lets see what pops out.
>>
>> Ok, I have a trace.o
>>
>> So I see turbostat wake up on CPU 15, do its migration round 0-15 and
>> when its back at 15 it prints the WHOOPSIE.
>>
>> (trimmed trace):
>>
>>             yes-1169    [015] dNh4.  4238.261759: sched_wakeup: comm=turbostat pid=1185 prio=100 target_cpu=015
>>             yes-1169    [015] d..2.  4238.261761: sched_switch: prev_comm=yes prev_pid=1169 prev_prio=120 prev_state=R ==>
> next_comm=turbostat next_pid=1185 next_prio=100
>>    migration/15-158     [015] d..3.  4238.261977: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=15 dest_cpu=0
>>    migration/0-20      [000] d..3.  4238.261991: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=0 dest_cpu=1
>>     migration/1-116     [001] d..3.  4238.262003: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=1 dest_cpu=2
>>     migration/2-25      [002] d..3.  4238.262018: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=2 dest_cpu=3
>>     migration/3-122     [003] d..3.  4238.262031: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=3 dest_cpu=4
>>     migration/4-31      [004] d..3.  4238.262044: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=4 dest_cpu=5
>>     migration/5-128     [005] d..3.  4238.262057: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=5 dest_cpu=6
>>     migration/6-37      [006] d..3.  4238.262071: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=6 dest_cpu=7
>>     migration/7-134     [007] d..3.  4238.262084: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=7 dest_cpu=8
>>     migration/8-43      [008] d..3.  4238.262097: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=8 dest_cpu=9
>>     migration/9-140     [009] d..3.  4238.262109: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=9 dest_cpu=10
>>    migration/10-49      [010] d..3.  4238.262123: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=10 dest_cpu=11
>>    migration/11-146     [011] d..3.  4238.262136: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=11 dest_cpu=12
>>    migration/12-55      [012] d..3.  4238.262150: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=12 dest_cpu=13
>>    migration/13-152     [013] d..3.  4238.262164: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=13 dest_cpu=14
>>    migration/14-62      [014] d..3.  4238.262177: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=14 dest_cpu=15
>>             yes-1169    [015] d..2.  4238.262182: sched_switch: prev_comm=yes prev_pid=1169 prev_prio=120 prev_state=R+ ==>
> next_comm=turbostat next_pid=1185 next_prio=100
>>       turbostat-1185    [015] .....  4238.262189: __x64_sys_gettimeofday: WHOOPSIE!
>>
>> The time between wakeup and whoopsie 4238.262189-4238.261759 = .000430
>> or 430us, which doesn't seem excessive to me.
>>
>> Let me go read this turbostat code to figure out what exactly the
>> trigger condition signifies. Because I'm not seeing nothing weird here.
>
> I think the anomaly would have been about 1 second ago, on CPU 15,
> and before entering sleep.
> But after the previous call to the time of day stuff.
>
> Somewhere in this code:
>
>  delta_platform(&platform_counters_even, &platform_counters_odd);
>  compute_average(ODD_COUNTERS);
>  format_all_counters(ODD_COUNTERS);
>  flush_output_stdout();
>
> Please know that I ran a couple of tests yesterday for a total of about 8 hours
> and never got a measured interval time >= 10 mSec.
> I was using kernel 6.13, which includes your 2 patches, and I tried a slight
> modification to the turbostat command:
>
> sudo ./turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds,usec --interval 1 --out
/dev/shm/turbo.log
>
> That allowed me to acquire more than my ssh session history limit of about 9000 lines (seconds) and also eliminated ssh
> communications.
> It was on purpose that I used RAM to write the log file to.

I have run more tests over the last couple of days, totalling over 30 hours.
I simply do not get a measured interval time >= 10mSec using kernel 6.13.
The previous work was kernel 6.13-rc6 + the 2 patches + the tracing stuff.
I never tried kernel 6.13-rc7. 



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-24  4:34                                         ` Doug Smythies
@ 2025-01-24 11:04                                           ` Peter Zijlstra
  0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2025-01-24 11:04 UTC (permalink / raw)
  To: Doug Smythies
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel

On Thu, Jan 23, 2025 at 08:34:57PM -0800, Doug Smythies wrote:

> I have run more tests over the last couple of days, totalling over 30 hours.
> I simply do not get a measured interval time >= 10mSec using kernel 6.13.
> The previous work was kernel 6.13-rc6 + the 2 patches + the tracing stuff.
> I never tried kernel 6.13-rc7. 

OK, lets close this for now then. Feel free to contact me again if you
find anything else.

Thanks for all the useful input!

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag
  2025-01-09 12:18                       ` [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag tip-bot2 for Peter Zijlstra
@ 2025-04-17  9:56                         ` Alexander Egorenkov
  2025-04-22  5:40                           ` ll"RE: " Doug Smythies
  0 siblings, 1 reply; 51+ messages in thread
From: Alexander Egorenkov @ 2025-04-17  9:56 UTC (permalink / raw)
  To: tip-bot2; +Cc: dsmythies, linux-kernel, linux-tip-commits, mingo, peterz, x86


Hi Peter,

after this change, we are seeing big latencies when trying to execute a
simple command per SSH on a Fedora 41 s390x remote system which is under
stress.

I was able to bisect the problem to this commit.

The problem is easy to reproduce with stress-ng executed on the remote
system which is otherwise unoccupied and concurrent SSH connect attempts
from a local system to the remote one.

stress-ng (on remote system)
----------------------------

$ cpus=$(nproc)
$ stress-ng --cpu $((cpus * 2)) --matrix 50 --mq 50 --aggressive --brk 2
            --stack 2 --bigheap 2 --userfaultfd 0 --perf -t 5m

SSH connect attempts (from local to remote system)
--------------------------------------------------

$ ssh_options=(
        -o UserKnownHostsFile=/dev/null
        -o StrictHostKeyChecking=no
        -o LogLevel=ERROR
        -o ConnectTimeout=10
        -o TCPKeepAlive=yes
        -o ServerAliveInterval=10
        -o PreferredAuthentications=publickey
        -o PubkeyAuthentication=yes
        -o BatchMode=yes
        -o ForwardX11=no
        -A
)

$ while true; do time ssh "${ssh_options[@]}" root@remote-system true; sleep 2; done

========
My tests
========

commit v6.12
------------

$ while true; do time ssh "${ssh_options[@]}" root@remote-system true; sleep 2; done

ssh "${ssh_options[@]}" ciuser@a8345039 true  0.01s user 0.00s system 1% cpu 0.919 total
ssh "${ssh_options[@]}" ciuser@a8345039 true  0.00s user 0.00s system 9% cpu 0.068 total
ssh "${ssh_options[@]}" ciuser@a8345039 true  0.00s user 0.00s system 8% cpu 0.069 total
ssh "${ssh_options[@]}" ciuser@a8345039 true  0.00s user 0.00s system 6% cpu 0.092 total
ssh "${ssh_options[@]}" ciuser@a8345039 true  0.00s user 0.00s system 6% cpu 0.097 total
ssh "${ssh_options[@]}" ciuser@a8345039 true  0.00s user 0.00s system 5% cpu 0.109 total
ssh "${ssh_options[@]}" ciuser@a8345039 true  0.00s user 0.00s system 7% cpu 0.083 total
ssh "${ssh_options[@]}" ciuser@a8345039 true  0.00s user 0.00s system 7% cpu 0.079 total
ssh "${ssh_options[@]}" ciuser@a8345039 true  0.00s user 0.00s system 11% cpu 0.054 total

commit 6d71a9c6160479899ee744d2c6d6602a191deb1f
-----------------------------------------------

$ while true; do time ssh "${ssh_options[@]}" root@remote-system true; sleep 2; done

ssh "${ssh_options[@]}" ciuser@a8345034 true  0.01s user 0.00s system 0% cpu 33.379 total
ssh "${ssh_options[@]}" ciuser@a8345034 true  0.00s user 0.00s system 0% cpu 1.206 total
ssh "${ssh_options[@]}" ciuser@a8345034 true  0.00s user 0.00s system 0% cpu 2.388 total
ssh "${ssh_options[@]}" ciuser@a8345034 true  0.00s user 0.00s system 9% cpu 0.055 total
ssh "${ssh_options[@]}" ciuser@a8345034 true  0.00s user 0.00s system 0% cpu 2.376 total
ssh "${ssh_options[@]}" ciuser@a8345034 true  0.00s user 0.00s system 2% cpu 0.243 total
ssh "${ssh_options[@]}" ciuser@a8345034 true  0.00s user 0.00s system 11% cpu 0.049 total
ssh "${ssh_options[@]}" ciuser@a8345034 true  0.00s user 0.00s system 0% cpu 2.563 total
ssh "${ssh_options[@]}" ciuser@a8345034 true  0.00s user 0.00s system 8% cpu 0.065 total

Thank you
Regards
Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* ll"RE: [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag
  2025-04-17  9:56                         ` Alexander Egorenkov
@ 2025-04-22  5:40                           ` Doug Smythies
  2025-04-24  7:56                             ` Alexander Egorenkov
  0 siblings, 1 reply; 51+ messages in thread
From: Doug Smythies @ 2025-04-22  5:40 UTC (permalink / raw)
  To: 'Alexander Egorenkov', tip-bot2
  Cc: linux-kernel, linux-tip-commits, mingo, peterz, x86,
	Doug Smythies

On 2025.04.17 02:57 Alexander Egorenkov wrote:

> Hi Peter,
>
> after this change, we are seeing big latencies when trying to execute a
> simple command per SSH on a Fedora 41 s390x remote system which is under
> stress.
>
> I was able to bisect the problem to this commit.
>
> The problem is easy to reproduce with stress-ng executed on the remote
> system which is otherwise unoccupied and concurrent SSH connect attempts
> from a local system to the remote one.
>
> stress-ng (on remote system)
> ----------------------------
>
> $ cpus=$(nproc)
> $ stress-ng --cpu $((cpus * 2)) --matrix 50 --mq 50 --aggressive --brk 2
>            --stack 2 --bigheap 2 --userfaultfd 0 --perf -t 5m

That is a very very stressful test. It crashes within a few seconds on my test computer,
with a " Segmentation fault (core dumped)" message.
If I back it off to this:

$ stress-ng --cpu 24 --matrix 50 --mq 50 --aggressive --brk 2 --stack 2 --bigheap 2 -t 300m

It runs, but still makes a great many entries in /var/log/kern.log as the oom killer runs etc.
I am suggesting it is not a reasonable test workload.

Anyway, I used turbostat the same way I was using it back in January for this work, and did observe
longer than requested intervals.
I took 1427 samples and got 10 where the interval time was more than 1 second more than requested.
The worst was 7.5 seconds longer than requested.

I rechecked the 100% workload used in January (12X "yes > dev/null") and it was fine.
3551 samples and the actual interval was never more than 10 milliseconds longer than requested.

Kernel 6.15-rc2
Turbostat version: 2025.04.06
Turbostat sample interval: 2 seconds.
Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz (12 CPU, 6 cores)

... Doug



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: ll"RE: [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag
  2025-04-22  5:40                           ` ll"RE: " Doug Smythies
@ 2025-04-24  7:56                             ` Alexander Egorenkov
  2025-04-26 15:09                               ` Doug Smythies
  0 siblings, 1 reply; 51+ messages in thread
From: Alexander Egorenkov @ 2025-04-24  7:56 UTC (permalink / raw)
  To: Doug Smythies, tip-bot2
  Cc: linux-kernel, linux-tip-commits, mingo, peterz, x86,
	Doug Smythies

Hi all,

> That is a very very stressful test. It crashes within a few seconds on my test computer,
> with a " Segmentation fault (core dumped)" message.

Yes, this is an artificial test i came up with to demonstrate the
problem we have with another realistic test which i can hardly
use here for the sake of demonstration. But it reveals the exact
same problem we have with our CI test on s390x test systems.

Let me explain shortly how it happens.

Basically, we have a test system where we execute a test suite and
simultaneously monitor this system on another system via simple SSH
logins (approximately invoked every 15 seconds) whether the test system
is still online and dump automatically if it remains unresponsive for
5m straight. We limit every such SSH login to 10 seconds because
we had situations where SSH sometimes hanged for a long time due to
various problems with networking, test system itself etc., just to make
our monitoring robust.

And since the commit "sched/fair: Fix EEVDF entity placement bug causing
scheduling lag" we regularly see SSH logins (limited to 10s) failing for
5m straight, not a single SSH login succeeds. This happens regularly
with test suites which compile software with GCC and use all CPUs
at 100%. Before the commit, a SSH login required under 1 second.
I cannot judge whether the problem really in this commit, or it is just an
accumulated effect after multiple ones.

FYI:
One such system where it happens regularly has 7 cores (5.2Ghz SMT 2x, 14 cpus)
and 8G of main memory with 20G of swap.

Thanks
Regards
Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag
  2025-04-24  7:56                             ` Alexander Egorenkov
@ 2025-04-26 15:09                               ` Doug Smythies
  0 siblings, 0 replies; 51+ messages in thread
From: Doug Smythies @ 2025-04-26 15:09 UTC (permalink / raw)
  To: 'Alexander Egorenkov', peterz
  Cc: linux-kernel, mingo, x86, Doug Smythies

Hi Alexander,

Thank you for your reply.
Note that I have adjusted the address list for this email, because I don't know if bots can get emails, and Peter was not on the
"To" line, and might not have noticed this thread.
@Peter : Off-list I will forward you the other emails, in case you missed them. I apologise if you did see them but haven't had time
to get to them or whatever.

Also note that I know nothing about the scheduler and was only on the original email because I had a "Reported-by" tag.

On 2025.04.24 00:57 Alexander Egorenkov wrote:

> Hi all,

[Doug wrote]
>> That is a very very stressful test. It crashes within a few seconds on my test computer,
>> with a " Segmentation fault (core dumped)" message.
>
> Yes, this is an artificial test i came up with to demonstrate the
> problem we have with another realistic test which i can hardly
> use here for the sake of demonstration. But it reveals the exact
> same problem we have with our CI test on s390x test systems.
>
> Let me explain shortly how it happens.
>
> Basically, we have a test system where we execute a test suite and
> simultaneously monitor this system on another system via simple SSH
> logins (approximately invoked every 15 seconds) whether the test system
> is still online and dump automatically if it remains unresponsive for
> 5m straight. We limit every such SSH login to 10 seconds because
> we had situations where SSH sometimes hanged for a long time due to
> various problems with networking, test system itself etc., just to make
> our monitoring robust.
>
> And since the commit "sched/fair: Fix EEVDF entity placement bug causing
> scheduling lag" we regularly see SSH logins (limited to 10s) failing for
> 5m straight, not a single SSH login succeeds. This happens regularly
> with test suites which compile software with GCC and use all CPUs
> at 100%. Before the commit, a SSH login required under 1 second.
> I cannot judge whether the problem really in this commit, or it is just an
> accumulated effect after multiple ones.
>
> FYI:
> One such system where it happens regularly has 7 cores (5.2Ghz SMT 2x, 14 cpus)
> and 8G of main memory with 20G of swap.
>
> Thanks
> Regards
> Alex

Thanks for the explanation.
I have recreated your situation with a workflow that, while it stresses the CPUs,
doesn't make any entries in /var/log/kern.log and /var/log/syslog.
Under the same conditions, I have confirmed that the ssh login lag doesn't occur
With kernel 6.12, but does with kernel 6.13

My workflow is stuff I have used for many years and wrote myself.
Basically, I create a huge queue of running tasks, with each doing a little work
and then sleeping for a short period. I have 2 methods to achieve similar overall
workflow, and one shows the issue and one does not. I can also create a huge
queue by just increasing the number "yes" tasks to a ridiculous number, but
that does not show your ssh login lag issue.

Anyway, for the workflow that does show your issue, I had a load average of
about 19,500 (20,000 tasks) and ssh login times ranged from 38 to 10 seconds,
with an average of about 13 seconds. ssh login times using kernel 6.12 were
negligible.

... Doug




^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2025-04-26 15:09 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-29 22:51 [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies
2025-01-06 11:57 ` Peter Zijlstra
2025-01-06 15:01   ` Doug Smythies
2025-01-06 16:59     ` Peter Zijlstra
2025-01-06 17:04       ` Peter Zijlstra
2025-01-06 17:14         ` Peter Zijlstra
2025-01-07  1:24           ` Doug Smythies
2025-01-07 10:49             ` Peter Zijlstra
2025-01-06 22:28         ` Doug Smythies
2025-01-07 11:26           ` Peter Zijlstra
2025-01-07 15:04             ` Doug Smythies
2025-01-07 16:25               ` Doug Smythies
2025-01-07 19:23             ` Peter Zijlstra
2025-01-08  5:15               ` Doug Smythies
2025-01-08 13:12                 ` Peter Zijlstra
2025-01-08 15:48                   ` Doug Smythies
2025-01-09 10:59                     ` Peter Zijlstra
2025-01-09 12:18                       ` [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag tip-bot2 for Peter Zijlstra
2025-04-17  9:56                         ` Alexander Egorenkov
2025-04-22  5:40                           ` ll"RE: " Doug Smythies
2025-04-24  7:56                             ` Alexander Egorenkov
2025-04-26 15:09                               ` Doug Smythies
2025-01-10  5:09                       ` [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies
2025-01-10 11:57                         ` Peter Zijlstra
2025-01-12 23:14                           ` Doug Smythies
2025-01-13 11:03                             ` Peter Zijlstra
2025-01-14 10:58                               ` Peter Zijlstra
2025-01-14 15:15                                 ` Doug Smythies
2025-01-15  2:08                                   ` Len Brown
2025-01-15 16:47                                     ` Doug Smythies
2025-01-19  0:09                                 ` Doug Smythies
2025-01-20  3:55                                   ` Doug Smythies
2025-01-21 11:06                                     ` Peter Zijlstra
2025-01-21  8:49                                   ` Peter Zijlstra
2025-01-21 11:21                                     ` Peter Zijlstra
2025-01-21 15:58                                       ` Doug Smythies
2025-01-24  4:34                                         ` Doug Smythies
2025-01-24 11:04                                           ` Peter Zijlstra
2025-01-13 11:05                             ` Peter Zijlstra
2025-01-13 16:01                               ` Doug Smythies
2025-01-13 12:58                           ` [tip: sched/urgent] sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE tip-bot2 for Peter Zijlstra
2025-01-12 19:59                         ` [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies
  -- strict thread matches above, loose matches on Subject: below --
2024-07-27 10:27 Peter Zijlstra
2024-11-28 10:32 ` [REGRESSION] " Marcel Ziswiler
2024-11-28 10:58   ` Peter Zijlstra
2024-11-28 11:37     ` Marcel Ziswiler
2024-11-29  9:08       ` Peter Zijlstra
2024-12-02 18:46         ` Marcel Ziswiler
2024-12-09  9:49           ` Peter Zijlstra
2024-12-10 16:05             ` Marcel Ziswiler
2024-12-10 16:13           ` Steven Rostedt
2024-12-10  8:45   ` Luis Machado

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).