* [REGRESSION] Re: [PATCH 00/24] Complete EEVDF @ 2024-12-29 22:51 Doug Smythies 2025-01-06 11:57 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Doug Smythies @ 2024-12-29 22:51 UTC (permalink / raw) To: peterz; +Cc: Doug Smythies, linux-kernel, vincent.guittot [-- Attachment #1: Type: text/plain, Size: 2976 bytes --] Hi Peter, I have been having trouble with turbostat reporting processor package power levels that can not possibly be true. After eliminating the turbostat program itself as the source of the issue I bisected the kernel. An edited summary (actual log attached): 82e9d0456e06 sched/fair: Avoid re-setting virtual deadline on 'migrations' b10 bad fc1892becd56 sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE b13 bad 54a58a787791 sched/fair: Implement DELAY_ZERO skip 152e11f6df29 sched/fair: Implement delayed dequeue skip e1459a50ba31 sched: Teach dequeue_task() about special task states skip a1c446611e31 sched,freezer: Mark TASK_FROZEN special skip 781773e3b680 sched/fair: Implement ENQUEUE_DELAYED skip f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue skip 2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue b12 good e28b5f8bda01 sched/fair: Assert {set_next,put_prev}_entity() are properly balanced dfa0a574cbc4 sched/uclamg: Handle delayed dequeue b11 good abc158c82ae5 sched: Prepare generic code for delayed dequeue e8901061ca0c sched: Split DEQUEUE_SLEEP from deactivate_task() Where "bN" is just my assigned kernel name for each bisection step. In the linux-kernel email archives I found a thread that isolated these same commits. It was from late Novermebr / early December: https://lore.kernel.org/all/20240727105030.226163742@infradead.org/T/#m9aeb4d897e029cf7546513bb09499c320457c174 An example of the turbostat manifestation of the issue: doug@s19:~$ sudo ~/kernel/linux/tools/power/x86/turbostat/turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1 [sudo] password for doug: Busy% Bzy_MHz TSC_MHz IRQ PkgTmp PkgWatt 99.76 4800 4104 12304 73 80.08 99.76 4800 4104 12047 73 80.23 99.76 4800 879 12157 73 11.40 99.76 4800 26667 84214 72 557.23 99.76 4800 4104 12036 72 79.39 Where TSC_MHz was reported as 879, there was a big gap in time. Like 4.7 seconds instead of 1. Where TSC_MHz was reported as 26667, there was not a big gap in time. It happens for about 5% of the samples + or - a lot. It only happens when the workload is almost exactly 100%. More load, it doesn't occur. Less load, it doesn't occur. Although, I did get this once: Busy% Bzy_MHz TSC_MHz IRQ PkgTmp PkgWatt 91.46 4800 4104 11348 73 103.98 91.46 4800 4104 11353 73 103.89 91.50 4800 3903 11339 73 98.16 91.43 4800 4271 12001 73 108.52 91.45 4800 4148 11481 73 105.13 91.46 4800 4104 11341 73 103.96 91.46 4800 4104 11348 73 103.99 So, it might just be much less probable and less severe. It happens over many different types of workload that I have tried. Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz 6 cores. 2 thread per core, 12 CPUs. OS: Ubuntu 24.04.1 LTS (server, no GUI) ... Doug [-- Attachment #2: bisect-log.txt --] [-- Type: text/plain, Size: 5985 bytes --] doug@s19:~/kernel/linux$ git bisect bad There are only 'skip'ped commits left to test. The first bad commit could be any of: 781773e3b68031bd001c0c18aa72e8470c225ebd a1c446611e31ca5363d4db51e398271da1dce0af e1459a50ba31831efdfc35278023d959e4ba775b f12e148892ede8d9ee82bcd3e469e6d01fc077ac 152e11f6df293e816a6a37c69757033cdc72667d 2e0199df252a536a03f4cb0810324dff523d1e79 54a58a78779169f9c92a51facf6de7ce94962328 We cannot bisect more! doug@s19:~/kernel/linux$ git bisect log git bisect start # status: waiting for both good and bad commits # good: [98f7e32f20d28ec452afb208f9cffc08448a2652] Linux 6.11 git bisect good 98f7e32f20d28ec452afb208f9cffc08448a2652 # status: waiting for bad commit, 1 good commit known # bad: [9852d85ec9d492ebef56dc5f229416c925758edc] Linux 6.12-rc1 git bisect bad 9852d85ec9d492ebef56dc5f229416c925758edc # good: [176000734ee2978121fde22a954eb1eabb204329] Merge tag 'ata-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux git bisect good 176000734ee2978121fde22a954eb1eabb204329 # bad: [d0359e4ca0f26aaf3118124dfb562e3b3dca1c06] Merge tag 'fs_for_v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs git bisect bad d0359e4ca0f26aaf3118124dfb562e3b3dca1c06 # bad: [171754c3808214d4fd8843eab584599a429deb52] Merge tag 'vfs-6.12.blocksize' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs git bisect bad 171754c3808214d4fd8843eab584599a429deb52 # good: [e55ef65510a401862b902dc979441ea10ae25c61] Merge tag 'amd-drm-next-6.12-2024-08-26' of https://gitlab.freedesktop.org/agd5f/linux into drm-next git bisect good e55ef65510a401862b902dc979441ea10ae25c61 # good: [32bd3eb5fbab954e68adba8c0b6a43cf03605c93] Merge tag 'drm-intel-gt-next-2024-09-06' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-next git bisect good 32bd3eb5fbab954e68adba8c0b6a43cf03605c93 # good: [726e2d0cf2bbc14e3bf38491cddda1a56fe18663] Merge tag 'dma-mapping-6.12-2024-09-19' of git://git.infradead.org/users/hch/dma-mapping git bisect good 726e2d0cf2bbc14e3bf38491cddda1a56fe18663 # good: [839c4f596f898edc424070dc8b517381572f8502] Merge tag 'mm-hotfixes-stable-2024-09-19-00-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm git bisect good 839c4f596f898edc424070dc8b517381572f8502 # bad: [bd9bbc96e8356886971317f57994247ca491dbf1] sched: Rework dl_server git bisect bad bd9bbc96e8356886971317f57994247ca491dbf1 # good: [863ccdbb918a77e3f011571f943020bf7f0b114b] sched: Allow sched_class::dequeue_task() to fail git bisect good 863ccdbb918a77e3f011571f943020bf7f0b114b # bad: [fc1892becd5672f52329a75c73117b60ac7841b7] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE git bisect bad fc1892becd5672f52329a75c73117b60ac7841b7 # skip: [2e0199df252a536a03f4cb0810324dff523d1e79] sched/fair: Prepare exit/cleanup paths for delayed_dequeue git bisect skip 2e0199df252a536a03f4cb0810324dff523d1e79 # skip: [f12e148892ede8d9ee82bcd3e469e6d01fc077ac] sched/fair: Prepare pick_next_task() for delayed dequeue git bisect skip f12e148892ede8d9ee82bcd3e469e6d01fc077ac # skip: [e1459a50ba31831efdfc35278023d959e4ba775b] sched: Teach dequeue_task() about special task states git bisect skip e1459a50ba31831efdfc35278023d959e4ba775b # skip: [781773e3b68031bd001c0c18aa72e8470c225ebd] sched/fair: Implement ENQUEUE_DELAYED git bisect skip 781773e3b68031bd001c0c18aa72e8470c225ebd # good: [abc158c82ae555078aa5dd2d8407c3df0f868904] sched: Prepare generic code for delayed dequeue git bisect good abc158c82ae555078aa5dd2d8407c3df0f868904 # skip: [a1c446611e31ca5363d4db51e398271da1dce0af] sched,freezer: Mark TASK_FROZEN special git bisect skip a1c446611e31ca5363d4db51e398271da1dce0af # good: [e28b5f8bda01720b5ce8456b48cf4b963f9a80a1] sched/fair: Assert {set_next,put_prev}_entity() are properly balanced git bisect good e28b5f8bda01720b5ce8456b48cf4b963f9a80a1 # skip: [152e11f6df293e816a6a37c69757033cdc72667d] sched/fair: Implement delayed dequeue git bisect skip 152e11f6df293e816a6a37c69757033cdc72667d # bad: [54a58a78779169f9c92a51facf6de7ce94962328] sched/fair: Implement DELAY_ZERO git bisect bad 54a58a78779169f9c92a51facf6de7ce94962328 # only skipped commits left to test # possible first bad commit: [54a58a78779169f9c92a51facf6de7ce94962328] sched/fair: Implement DELAY_ZERO # possible first bad commit: [152e11f6df293e816a6a37c69757033cdc72667d] sched/fair: Implement delayed dequeue # possible first bad commit: [e1459a50ba31831efdfc35278023d959e4ba775b] sched: Teach dequeue_task() about special task states # possible first bad commit: [a1c446611e31ca5363d4db51e398271da1dce0af] sched,freezer: Mark TASK_FROZEN special # possible first bad commit: [781773e3b68031bd001c0c18aa72e8470c225ebd] sched/fair: Implement ENQUEUE_DELAYED # possible first bad commit: [f12e148892ede8d9ee82bcd3e469e6d01fc077ac] sched/fair: Prepare pick_next_task() for delayed dequeue # possible first bad commit: [2e0199df252a536a03f4cb0810324dff523d1e79] sched/fair: Prepare exit/cleanup paths for delayed_dequeue doug@s19:~/kernel/linux$ git log --oneline | grep -B 2 -A 10 54a58a78779 82e9d0456e06 sched/fair: Avoid re-setting virtual deadline on 'migrations' b10 bad fc1892becd56 sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE b13 bad 54a58a787791 sched/fair: Implement DELAY_ZERO skip 152e11f6df29 sched/fair: Implement delayed dequeue skip e1459a50ba31 sched: Teach dequeue_task() about special task states skip a1c446611e31 sched,freezer: Mark TASK_FROZEN special skip 781773e3b680 sched/fair: Implement ENQUEUE_DELAYED skip f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue skip 2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue b12 good e28b5f8bda01 sched/fair: Assert {set_next,put_prev}_entity() are properly balanced dfa0a574cbc4 sched/uclamg: Handle delayed dequeue b11 good abc158c82ae5 sched: Prepare generic code for delayed dequeue e8901061ca0c sched: Split DEQUEUE_SLEEP from deactivate_task() ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2024-12-29 22:51 [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies @ 2025-01-06 11:57 ` Peter Zijlstra 2025-01-06 15:01 ` Doug Smythies 0 siblings, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2025-01-06 11:57 UTC (permalink / raw) To: Doug Smythies; +Cc: linux-kernel, vincent.guittot On Sun, Dec 29, 2024 at 02:51:43PM -0800, Doug Smythies wrote: > Hi Peter, > > I have been having trouble with turbostat reporting processor package power levels that can not possibly be true. > After eliminating the turbostat program itself as the source of the issue I bisected the kernel. > An edited summary (actual log attached): > > 82e9d0456e06 sched/fair: Avoid re-setting virtual deadline on 'migrations' > b10 bad fc1892becd56 sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE > b13 bad 54a58a787791 sched/fair: Implement DELAY_ZERO > skip 152e11f6df29 sched/fair: Implement delayed dequeue > skip e1459a50ba31 sched: Teach dequeue_task() about special task states > skip a1c446611e31 sched,freezer: Mark TASK_FROZEN special > skip 781773e3b680 sched/fair: Implement ENQUEUE_DELAYED > skip f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue > skip 2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue > b12 good e28b5f8bda01 sched/fair: Assert {set_next,put_prev}_entity() are properly balanced > dfa0a574cbc4 sched/uclamg: Handle delayed dequeue > b11 good abc158c82ae5 sched: Prepare generic code for delayed dequeue > e8901061ca0c sched: Split DEQUEUE_SLEEP from deactivate_task() > > Where "bN" is just my assigned kernel name for each bisection step. > > In the linux-kernel email archives I found a thread that isolated these same commits. > It was from late Novermebr / early December: > > https://lore.kernel.org/all/20240727105030.226163742@infradead.org/T/#m9aeb4d897e029cf7546513bb09499c320457c174 > > An example of the turbostat manifestation of the issue: > > doug@s19:~$ sudo ~/kernel/linux/tools/power/x86/turbostat/turbostat --quiet --Summary --show > Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1 > [sudo] password for doug: > Busy% Bzy_MHz TSC_MHz IRQ PkgTmp PkgWatt > 99.76 4800 4104 12304 73 80.08 > 99.76 4800 4104 12047 73 80.23 > 99.76 4800 879 12157 73 11.40 > 99.76 4800 26667 84214 72 557.23 > 99.76 4800 4104 12036 72 79.39 > > Where TSC_MHz was reported as 879, there was a big gap in time. > Like 4.7 seconds instead of 1. > Where TSC_MHz was reported as 26667, there was not a big gap in time. > > It happens for about 5% of the samples + or - a lot. > It only happens when the workload is almost exactly 100%. > More load, it doesn't occur. > Less load, it doesn't occur. Although, I did get this once: > > Busy% Bzy_MHz TSC_MHz IRQ PkgTmp PkgWatt > 91.46 4800 4104 11348 73 103.98 > 91.46 4800 4104 11353 73 103.89 > 91.50 4800 3903 11339 73 98.16 > 91.43 4800 4271 12001 73 108.52 > 91.45 4800 4148 11481 73 105.13 > 91.46 4800 4104 11341 73 103.96 > 91.46 4800 4104 11348 73 103.99 > > So, it might just be much less probable and less severe. > > It happens over many different types of workload that I have tried. In private email you've communicated it happens due to sched_setaffinity() sometimes taking multiple seconds. I'm trying to reproduce by starting a bash 'while ;: do :; done' spinner for each CPU, but so far am not able to reproduce. What is the easiest 100% load you're seeing this with? ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-06 11:57 ` Peter Zijlstra @ 2025-01-06 15:01 ` Doug Smythies 2025-01-06 16:59 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Doug Smythies @ 2025-01-06 15:01 UTC (permalink / raw) To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies On 2025.01.06 03:58 Peter Zijlstra wrote: >On Sun, Dec 29, 2024 at 02:51:43PM -0800, Doug Smythies wrote: >> Hi Peter, >> >> I have been having trouble with turbostat reporting processor package power levels that can not possibly be true. >> After eliminating the turbostat program itself as the source of the issue I bisected the kernel. >> An edited summary (actual log attached): >> >> 82e9d0456e06 sched/fair: Avoid re-setting virtual deadline on 'migrations' >> b10 bad fc1892becd56 sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE >> b13 bad 54a58a787791 sched/fair: Implement DELAY_ZERO >> skip 152e11f6df29 sched/fair: Implement delayed dequeue >> skip e1459a50ba31 sched: Teach dequeue_task() about special task states >> skip a1c446611e31 sched,freezer: Mark TASK_FROZEN special >> skip 781773e3b680 sched/fair: Implement ENQUEUE_DELAYED >> skip f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue >> skip 2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue >> b12 good e28b5f8bda01 sched/fair: Assert {set_next,put_prev}_entity() are properly balanced >> dfa0a574cbc4 sched/uclamg: Handle delayed dequeue >> b11 good abc158c82ae5 sched: Prepare generic code for delayed dequeue >> e8901061ca0c sched: Split DEQUEUE_SLEEP from deactivate_task() >> >> Where "bN" is just my assigned kernel name for each bisection step. >> >> In the linux-kernel email archives I found a thread that isolated these same commits. >> It was from late Novermebr / early December: >> >> https://lore.kernel.org/all/20240727105030.226163742@infradead.org/T/#m9aeb4d897e029cf7546513bb09499c320457c174 >> >> An example of the turbostat manifestation of the issue: >> >> doug@s19:~$ sudo ~/kernel/linux/tools/power/x86/turbostat/turbostat --quiet --Summary --show >> Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1 >> [sudo] password for doug: >> Busy% Bzy_MHz TSC_MHz IRQ PkgTmp PkgWatt >> 99.76 4800 4104 12304 73 80.08 >> 99.76 4800 4104 12047 73 80.23 >> 99.76 4800 879 12157 73 11.40 >> 99.76 4800 26667 84214 72 557.23 >> 99.76 4800 4104 12036 72 79.39 >> >> Where TSC_MHz was reported as 879, there was a big gap in time. >> Like 4.7 seconds instead of 1. >> Where TSC_MHz was reported as 26667, there was not a big gap in time. >> >> It happens for about 5% of the samples + or - a lot. >> It only happens when the workload is almost exactly 100%. >> More load, it doesn't occur. >> Less load, it doesn't occur. Although, I did get this once: >> >> Busy% Bzy_MHz TSC_MHz IRQ PkgTmp PkgWatt >> 91.46 4800 4104 11348 73 103.98 >> 91.46 4800 4104 11353 73 103.89 >> 91.50 4800 3903 11339 73 98.16 >> 91.43 4800 4271 12001 73 108.52 >> 91.45 4800 4148 11481 73 105.13 >> 91.46 4800 4104 11341 73 103.96 >> 91.46 4800 4104 11348 73 103.99 >> >> So, it might just be much less probable and less severe. >> >> It happens over many different types of workload that I have tried. > > In private email you've communicated it happens due to > sched_setaffinity() sometimes taking multiple seconds. > > I'm trying to reproduce by starting a bash 'while ;: do :; done' spinner > for each CPU, but so far am not able to reproduce. I have also been trying to reproduce the issue without using turbostat. No success. The other thing to note is that my test computer is otherwise very very idle with no GUI and few services. > > What is the easiest 100% load you're seeing this with? Lately, and specifically to be able to tell others, I have been using: yes > /dev/null & On my Intel i5-10600K, with 6 cores and 2 threads per core, 12 CPUs, I run 12 of those work loads. ... Doug ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-06 15:01 ` Doug Smythies @ 2025-01-06 16:59 ` Peter Zijlstra 2025-01-06 17:04 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2025-01-06 16:59 UTC (permalink / raw) To: Doug Smythies; +Cc: linux-kernel, vincent.guittot On Mon, Jan 06, 2025 at 07:01:34AM -0800, Doug Smythies wrote: > > What is the easiest 100% load you're seeing this with? > > Lately, and specifically to be able to tell others, I have been using: > > yes > /dev/null & > > On my Intel i5-10600K, with 6 cores and 2 threads per core, 12 CPUs, > I run 12 of those work loads. On my headless ivb-ep 2 sockets, 10 cores each and 2 threads per core, I do: for ((i=0; i<40; i++)) ; do yes > /dev/null & done tools/power/x86/turbostat/turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1 But no so far, nada :-( I've tried with full preemption and voluntary, HZ=1000. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-06 16:59 ` Peter Zijlstra @ 2025-01-06 17:04 ` Peter Zijlstra 2025-01-06 17:14 ` Peter Zijlstra 2025-01-06 22:28 ` Doug Smythies 0 siblings, 2 replies; 51+ messages in thread From: Peter Zijlstra @ 2025-01-06 17:04 UTC (permalink / raw) To: Doug Smythies; +Cc: linux-kernel, vincent.guittot On Mon, Jan 06, 2025 at 05:59:32PM +0100, Peter Zijlstra wrote: > On Mon, Jan 06, 2025 at 07:01:34AM -0800, Doug Smythies wrote: > > > > What is the easiest 100% load you're seeing this with? > > > > Lately, and specifically to be able to tell others, I have been using: > > > > yes > /dev/null & > > > > On my Intel i5-10600K, with 6 cores and 2 threads per core, 12 CPUs, > > I run 12 of those work loads. > > On my headless ivb-ep 2 sockets, 10 cores each and 2 threads per core, I > do: > > for ((i=0; i<40; i++)) ; do yes > /dev/null & done > tools/power/x86/turbostat/turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1 > > But no so far, nada :-( I've tried with full preemption and voluntary, > HZ=1000. > And just as I send this, I see these happen: 100.00 3100 2793 40302 71 195.22 100.00 3100 2618 40459 72 183.58 100.00 3100 2993 46215 71 209.21 100.00 3100 2789 40467 71 195.19 99.92 3100 2798 40589 71 195.76 100.00 3100 2793 40397 72 195.46 ... 100.00 3100 2844 41906 71 199.43 100.00 3100 2779 40468 71 194.51 99.96 3100 2320 40933 71 163.23 100.00 3100 3529 61823 72 245.70 100.00 3100 2793 40493 72 195.45 100.00 3100 2793 40462 72 195.56 They look like funny little blips. Nowhere near as bad as you had though. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-06 17:04 ` Peter Zijlstra @ 2025-01-06 17:14 ` Peter Zijlstra 2025-01-07 1:24 ` Doug Smythies 2025-01-06 22:28 ` Doug Smythies 1 sibling, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2025-01-06 17:14 UTC (permalink / raw) To: Doug Smythies; +Cc: linux-kernel, vincent.guittot On Mon, Jan 06, 2025 at 06:04:55PM +0100, Peter Zijlstra wrote: > On Mon, Jan 06, 2025 at 05:59:32PM +0100, Peter Zijlstra wrote: > > On Mon, Jan 06, 2025 at 07:01:34AM -0800, Doug Smythies wrote: > > > > > > What is the easiest 100% load you're seeing this with? > > > > > > Lately, and specifically to be able to tell others, I have been using: > > > > > > yes > /dev/null & > > > > > > On my Intel i5-10600K, with 6 cores and 2 threads per core, 12 CPUs, > > > I run 12 of those work loads. > > > > On my headless ivb-ep 2 sockets, 10 cores each and 2 threads per core, I > > do: > > > > for ((i=0; i<40; i++)) ; do yes > /dev/null & done > > tools/power/x86/turbostat/turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1 > > > > But no so far, nada :-( I've tried with full preemption and voluntary, > > HZ=1000. > > > > And just as I send this, I see these happen: > > 100.00 3100 2793 40302 71 195.22 > 100.00 3100 2618 40459 72 183.58 > 100.00 3100 2993 46215 71 209.21 > 100.00 3100 2789 40467 71 195.19 > 99.92 3100 2798 40589 71 195.76 > 100.00 3100 2793 40397 72 195.46 > ... > 100.00 3100 2844 41906 71 199.43 > 100.00 3100 2779 40468 71 194.51 > 99.96 3100 2320 40933 71 163.23 > 100.00 3100 3529 61823 72 245.70 > 100.00 3100 2793 40493 72 195.45 > 100.00 3100 2793 40462 72 195.56 > > They look like funny little blips. Nowhere near as bad as you had > though. Anyway, given you've confirmed disabling DELAY_DEQUEUE fixes things, could you perhaps try the below hackery for me? Its a bit of a wild guess, but throw stuff at wall, see what sticks etc.. --- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 84902936a620..fa4b9891f93a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3019,7 +3019,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag } else { if (!is_migration_disabled(p)) { - if (task_on_rq_queued(p)) + if (task_on_rq_queued(p) && !p->se.sched_delayed) rq = move_queued_task(rq, rf, p, dest_cpu); if (!pending->stop_pending) { @@ -3776,28 +3776,30 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, */ static int ttwu_runnable(struct task_struct *p, int wake_flags) { - struct rq_flags rf; - struct rq *rq; - int ret = 0; + CLASS(__task_rq_lock, rq_guard)(p); + struct rq *rq = rq_guard.rq; - rq = __task_rq_lock(p, &rf); - if (task_on_rq_queued(p)) { - update_rq_clock(rq); - if (p->se.sched_delayed) - enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED); - if (!task_on_cpu(rq, p)) { - /* - * When on_rq && !on_cpu the task is preempted, see if - * it should preempt the task that is current now. - */ - wakeup_preempt(rq, p, wake_flags); + if (!task_on_rq_queued(p)) + return 0; + + update_rq_clock(rq); + if (p->se.sched_delayed) { + int queue_flags = ENQUEUE_NOCLOCK | ENQUEUE_DELAYED; + if (!is_cpu_allowed(p, cpu_of(rq))) { + dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags); + return 0; } - ttwu_do_wakeup(p); - ret = 1; + enqueue_task(rq, p, queue_flags); } - __task_rq_unlock(rq, &rf); - - return ret; + if (!task_on_cpu(rq, p)) { + /* + * When on_rq && !on_cpu the task is preempted, see if + * it should preempt the task that is current now. + */ + wakeup_preempt(rq, p, wake_flags); + } + ttwu_do_wakeup(p); + return 1; } #ifdef CONFIG_SMP diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 65fa64845d9f..b4c1f6c06c18 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1793,6 +1793,11 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf) raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags); } +DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct, + _T->rq = __task_rq_lock(_T->lock, &_T->rf), + __task_rq_unlock(_T->rq, &_T->rf), + struct rq *rq; struct rq_flags rf) + DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct, _T->rq = task_rq_lock(_T->lock, &_T->rf), task_rq_unlock(_T->rq, _T->lock, &_T->rf), ^ permalink raw reply related [flat|nested] 51+ messages in thread
* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-06 17:14 ` Peter Zijlstra @ 2025-01-07 1:24 ` Doug Smythies 2025-01-07 10:49 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Doug Smythies @ 2025-01-07 1:24 UTC (permalink / raw) To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies On 2024.01.06 09:14 Peter Zijlstra wrote: > On Mon, Jan 06, 2025 at 06:04:55PM +0100, Peter Zijlstra wrote: >> On Mon, Jan 06, 2025 at 05:59:32PM +0100, Peter Zijlstra wrote: >>> On Mon, Jan 06, 2025 at 07:01:34AM -0800, Doug Smythies wrote: >>> >>>>> What is the easiest 100% load you're seeing this with? >>>> >>>> Lately, and specifically to be able to tell others, I have been using: >>>> >>>> yes > /dev/null & >>>> >>>> On my Intel i5-10600K, with 6 cores and 2 threads per core, 12 CPUs, >>>> I run 12 of those work loads. >>> >>> On my headless ivb-ep 2 sockets, 10 cores each and 2 threads per core, I >>> do: >>> >>> for ((i=0; i<40; i++)) ; do yes > /dev/null & done >>> tools/power/x86/turbostat/turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1 >>> >>> But no so far, nada :-( I've tried with full preemption and voluntary, >>> HZ=1000. >>> >> >> And just as I send this, I see these happen: >> >> 100.00 3100 2793 40302 71 195.22 >> 100.00 3100 2618 40459 72 183.58 >> 100.00 3100 2993 46215 71 209.21 >> 100.00 3100 2789 40467 71 195.19 >> 99.92 3100 2798 40589 71 195.76 >> 100.00 3100 2793 40397 72 195.46 >> ... >> 100.00 3100 2844 41906 71 199.43 >> 100.00 3100 2779 40468 71 194.51 >> 99.96 3100 2320 40933 71 163.23 >> 100.00 3100 3529 61823 72 245.70 >> 100.00 3100 2793 40493 72 195.45 >> 100.00 3100 2793 40462 72 195.56 >> >> They look like funny little blips. Nowhere near as bad as you had >> though. > > Anyway, given you've confirmed disabling DELAY_DEQUEUE fixes things, > could you perhaps try the below hackery for me? Its a bit of a wild > guess, but throw stuff at wall, see what sticks etc.. > > --- > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 84902936a620..fa4b9891f93a 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3019,7 +3019,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag > } else { > > if (!is_migration_disabled(p)) { > - if (task_on_rq_queued(p)) > + if (task_on_rq_queued(p) && !p->se.sched_delayed) > rq = move_queued_task(rq, rf, p, dest_cpu); > > if (!pending->stop_pending) { > @@ -3776,28 +3776,30 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, > */ > static int ttwu_runnable(struct task_struct *p, int wake_flags) > { > - struct rq_flags rf; > - struct rq *rq; > - int ret = 0; > + CLASS(__task_rq_lock, rq_guard)(p); > + struct rq *rq = rq_guard.rq; > > - rq = __task_rq_lock(p, &rf); > - if (task_on_rq_queued(p)) { > - update_rq_clock(rq); > - if (p->se.sched_delayed) > - enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED); > - if (!task_on_cpu(rq, p)) { > - /* > - * When on_rq && !on_cpu the task is preempted, see if > - * it should preempt the task that is current now. > - */ > - wakeup_preempt(rq, p, wake_flags); > + if (!task_on_rq_queued(p)) > + return 0; > + > + update_rq_clock(rq); > + if (p->se.sched_delayed) { > + int queue_flags = ENQUEUE_NOCLOCK | ENQUEUE_DELAYED; > + if (!is_cpu_allowed(p, cpu_of(rq))) { > + dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags); > + return 0; > } > - ttwu_do_wakeup(p); > - ret = 1; > + enqueue_task(rq, p, queue_flags); > } > - __task_rq_unlock(rq, &rf); > - > - return ret; > + if (!task_on_cpu(rq, p)) { > + /* > + * When on_rq && !on_cpu the task is preempted, see if > + * it should preempt the task that is current now. > + */ > + wakeup_preempt(rq, p, wake_flags); > + } > + ttwu_do_wakeup(p); > + return 1; > } > > #ifdef CONFIG_SMP > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 65fa64845d9f..b4c1f6c06c18 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1793,6 +1793,11 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf) > raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags); > } > > +DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct, > + _T->rq = __task_rq_lock(_T->lock, &_T->rf), > + __task_rq_unlock(_T->rq, &_T->rf), > + struct rq *rq; struct rq_flags rf) > + > DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct, > _T->rq = task_rq_lock(_T->lock, &_T->rf), > task_rq_unlock(_T->rq, _T->lock, &_T->rf), I tried the patch on top of kernel 6.13-rc6. It did not fix the issue. I used my patched version of turbostat as per the previous email, so that I could see which CPU and the CPU migration time. CPU migration times >= 10 milliseconds are listed. Results: doug@s19:~date Mon Jan 6 04:37:58 PM PST 2025 doug@s19:~$ sudo ~/kernel/linux/tools/power/x86/turbostat/turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval 1 | grep -v \- | grep -e "^[1-9]" -e "^ [1-9]" -e "^ [1-9]" usec Time_Of_Day_Seconds CPU Busy% IRQ 16599 1736210307.324843 11 99.76 1004 6003601 1736210314.329844 11 99.76 1018 1164604 1736210330.509843 11 99.76 1003 6003604 1736210347.524844 11 99.76 1005 23602 1736210369.570843 11 99.76 1003 161680 1736210384.748843 7 99.76 1002 5750600 1736210398.507843 11 99.76 1005 6003607 1736210478.587844 11 99.76 1002 210645 1736210479.799843 3 99.76 7017 22602 1736210495.838843 11 99.76 1002 6003390 1736210520.861844 11 99.76 1002 108627 1736210534.984843 10 99.76 1002 23604 1736210570.047843 11 99.76 1003 6004604 1736210600.076843 11 99.76 1003 1895606 1736210606.977843 11 99.76 1002 3110603 1736210745.226843 11 99.76 1003 6003606 1736210765.244844 11 99.76 1002 6003605 1736210785.262843 11 99.76 1002 401642 1736210847.732843 9 99.76 1002 6003604 1736210891.781843 11 99.76 1003 6003607 1736210914.802844 11 99.76 1002 6003605 1736210945.831843 11 99.76 1002 5579609 1736210968.428848 11 99.76 1002 6003600 1736210975.433844 11 99.76 6585 93623 1736210985.537843 10 99.76 1003 5005605 1736210994.547843 11 99.76 1003 2654601 1736211029.244843 11 99.76 1004 17604 1736211057.290843 11 99.76 1003 23598 1736211077.334843 11 99.76 1006 114671 1736211079.451843 2 99.76 1003 6003603 1736211105.475843 11 99.76 1002 ^Cdoug@s19:~$ date Mon Jan 6 04:52:18 PM PST 2025 doug@s19:~$ uname -a Linux s19 6.13.0-rc6-peterz #1320 SMP PREEMPT_DYNAMIC Mon Jan 6 16:25:39 PST 2025 x86_64 x86_64 x86_64 GNU/Linux ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-07 1:24 ` Doug Smythies @ 2025-01-07 10:49 ` Peter Zijlstra 0 siblings, 0 replies; 51+ messages in thread From: Peter Zijlstra @ 2025-01-07 10:49 UTC (permalink / raw) To: Doug Smythies; +Cc: linux-kernel, vincent.guittot On Mon, Jan 06, 2025 at 05:24:44PM -0800, Doug Smythies wrote: > I tried the patch on top of kernel 6.13-rc6. > It did not fix the issue. Oh well, it was a long shot anyway. I'll try and make it reproduce again. ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-06 17:04 ` Peter Zijlstra 2025-01-06 17:14 ` Peter Zijlstra @ 2025-01-06 22:28 ` Doug Smythies 2025-01-07 11:26 ` Peter Zijlstra 1 sibling, 1 reply; 51+ messages in thread From: Doug Smythies @ 2025-01-06 22:28 UTC (permalink / raw) To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies [-- Attachment #1: Type: text/plain, Size: 4909 bytes --] On 2025.01.06 09:05 Peter Zijlstra wrote: > On Mon, Jan 06, 2025 at 05:59:32PM +0100, Peter Zijlstra wrote: >> On Mon, Jan 06, 2025 at 07:01:34AM -0800, Doug Smythies wrote: >> >>>> What is the easiest 100% load you're seeing this with? >>> >>> Lately, and specifically to be able to tell others, I have been using: >>> >>> yes > /dev/null & >>> >>> On my Intel i5-10600K, with 6 cores and 2 threads per core, 12 CPUs, >>> I run 12 of those work loads. >> >> On my headless ivb-ep 2 sockets, 10 cores each and 2 threads per core, I >> do: >> >> for ((i=0; i<40; i++)) ; do yes > /dev/null & done >> tools/power/x86/turbostat/turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1 >> >> But no so far, nada :-( I've tried with full preemption and voluntary, >> HZ=1000. My HZ=1000 also. And: CONFIG_NO_HZ_FULL=y > > And just as I send this, I see these happen: > > 100.00 3100 2793 40302 71 195.22 > 100.00 3100 2618 40459 72 183.58 > 100.00 3100 2993 46215 71 209.21 > 100.00 3100 2789 40467 71 195.19 > 99.92 3100 2798 40589 71 195.76 > 100.00 3100 2793 40397 72 195.46 > ... > 100.00 3100 2844 41906 71 199.43 > 100.00 3100 2779 40468 71 194.51 > 99.96 3100 2320 40933 71 163.23 > 100.00 3100 3529 61823 72 245.70 > 100.00 3100 2793 40493 72 195.45 > 100.00 3100 2793 40462 72 195.56 > > They look like funny little blips. Nowhere near as bad as you had > though. Yes, I get a lot of the lesser magnitude ones. The large magnitude ones are very much a function of what else is running. If just add a 0.5% load at 73 hertz work/sleep frequency, then over a 2 hour and 31 minute test I got a maximum interval time of 1.68 seconds. Without that small pertibations I got tons of interval times of 7 seconds, Meaning the regular 1 second interval plus 6 seconds for the CPU migration. Since I can not seem to function without making a graph, some example graphs are attached. By the way, and to make it easier to go away while tests run, I am now using this turbostat command: doug@s19:~/kernel/linux/tools/power/x86/turbostat$ sudo ./turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval 1 | grep "^[1-9]" 6005701 1736201357.014741 - 99.76 12034 177731 1736201386.221771 - 99.76 12034 6003699 1736201393.226740 - 99.76 14167 6003704 1736201422.253743 - 99.76 12040 6005700 1736201447.278740 - 99.76 12030 311699 1736201475.816740 - 99.76 12033 Which will show when a CPU migration took over 10 milliseconds. If you want to go further, for example to only display ones that took over a second and to include the target CPU, then patch turbostat: doug@s19:~/kernel/linux/tools/power/x86/turbostat$ git diff diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c index 58a487c225a7..f8a73cc8fbfc 100644 --- a/tools/power/x86/turbostat/turbostat.c +++ b/tools/power/x86/turbostat/turbostat.c @@ -2704,7 +2704,7 @@ int format_counters(struct thread_data *t, struct core_data *c, struct pkg_data struct timeval tv; timersub(&t->tv_end, &t->tv_begin, &tv); - outp += sprintf(outp, "%5ld\t", tv.tv_sec * 1000000 + tv.tv_usec); + outp += sprintf(outp, "%7ld\t", tv.tv_sec * 1000000 + tv.tv_usec); } /* Time_Of_Day_Seconds: on each row, print sec.usec last timestamp taken */ @@ -4570,12 +4570,14 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p) int i; int status; + gettimeofday(&t->tv_begin, (struct timezone *)NULL); /* doug test */ + if (cpu_migrate(cpu)) { fprintf(outf, "%s: Could not migrate to CPU %d\n", __func__, cpu); return -1; } - gettimeofday(&t->tv_begin, (struct timezone *)NULL); +// gettimeofday(&t->tv_begin, (struct timezone *)NULL); if (first_counter_read) get_apic_id(t); Example output: sudo ./turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval 1 | grep "^[1-9]" 1330709 1736202049.987740 - 99.76 12040 1330603 1736202049.987740 11 99.76 1003 6008710 1736202068.008741 - 99.76 12030 6008601 1736202068.008741 11 99.76 1003 2003709 1736202120.936740 - 99.76 12028 2003603 1736202120.936740 11 99.76 1002 6005710 1736202140.956741 - 99.76 12028 6005604 1736202140.956741 11 99.76 1002 In this short example all captures were for the CPU 5 to 11 migration. 2 at 6 seconds, 1 at 1.33 seconds and 1 at 2 seconds. I'll try, and report on, your test patch from the other email later. [-- Attachment #2: turbostat-sampling-issue-seconds.png --] [-- Type: image/png, Size: 32233 bytes --] [-- Attachment #3: turbostat-sampling-issue-seconds-detail-a.png --] [-- Type: image/png, Size: 42780 bytes --] [-- Attachment #4: turbostat-sampling-issue-seconds-detail-b.png --] [-- Type: image/png, Size: 87699 bytes --] ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-06 22:28 ` Doug Smythies @ 2025-01-07 11:26 ` Peter Zijlstra 2025-01-07 15:04 ` Doug Smythies 2025-01-07 19:23 ` Peter Zijlstra 0 siblings, 2 replies; 51+ messages in thread From: Peter Zijlstra @ 2025-01-07 11:26 UTC (permalink / raw) To: Doug Smythies; +Cc: linux-kernel, vincent.guittot On Mon, Jan 06, 2025 at 02:28:40PM -0800, Doug Smythies wrote: > Which will show when a CPU migration took over 10 milliseconds. > If you want to go further, for example to only display ones that took > over a second and to include the target CPU, then patch turbostat: > > doug@s19:~/kernel/linux/tools/power/x86/turbostat$ git diff > diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c > index 58a487c225a7..f8a73cc8fbfc 100644 > --- a/tools/power/x86/turbostat/turbostat.c > +++ b/tools/power/x86/turbostat/turbostat.c > @@ -2704,7 +2704,7 @@ int format_counters(struct thread_data *t, struct core_data *c, struct pkg_data > struct timeval tv; > > timersub(&t->tv_end, &t->tv_begin, &tv); > - outp += sprintf(outp, "%5ld\t", tv.tv_sec * 1000000 + tv.tv_usec); > + outp += sprintf(outp, "%7ld\t", tv.tv_sec * 1000000 + tv.tv_usec); > } > > /* Time_Of_Day_Seconds: on each row, print sec.usec last timestamp taken */ > @@ -4570,12 +4570,14 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p) > int i; > int status; > > + gettimeofday(&t->tv_begin, (struct timezone *)NULL); /* doug test */ > + > if (cpu_migrate(cpu)) { > fprintf(outf, "%s: Could not migrate to CPU %d\n", __func__, cpu); > return -1; > } > > - gettimeofday(&t->tv_begin, (struct timezone *)NULL); > +// gettimeofday(&t->tv_begin, (struct timezone *)NULL); > > if (first_counter_read) > get_apic_id(t); > > So I've taken the second node offline, running with 10 cores (20 threads) now. usec Time_Of_Day_Seconds CPU Busy% IRQ 106783 1736248404.951438 - 100.00 20119 46 1736248404.844701 0 100.00 1005 41 1736248404.844742 20 100.00 1007 42 1736248404.844784 1 100.00 1005 40 1736248404.844824 21 100.00 1006 41 1736248404.844865 2 100.00 1005 40 1736248404.844905 22 100.00 1006 41 1736248404.844946 3 100.00 1006 40 1736248404.844986 23 100.00 1005 41 1736248404.845027 4 100.00 1005 40 1736248404.845067 24 100.00 1006 41 1736248404.845108 5 100.00 1011 40 1736248404.845149 25 100.00 1005 41 1736248404.845190 6 100.00 1005 40 1736248404.845230 26 100.00 1005 42 1736248404.845272 7 100.00 1007 41 1736248404.845313 27 100.00 1005 41 1736248404.845355 8 100.00 1005 42 1736248404.845397 28 100.00 1006 46 1736248404.845443 9 100.00 1009 105995 1736248404.951438 29 100.00 1005 Is by far the worst I've had in the past few minutes playing with this. If I get a blimp (>10000) then it is always on the last CPU, are you seeing the same thing? > In this short example all captures were for the CPU 5 to 11 migration. > 2 at 6 seconds, 1 at 1.33 seconds and 1 at 2 seconds. This seems to suggest you are, always on CPU 11. Weird! Anyway, let me see if I can capture a trace of this.. ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-07 11:26 ` Peter Zijlstra @ 2025-01-07 15:04 ` Doug Smythies 2025-01-07 16:25 ` Doug Smythies 2025-01-07 19:23 ` Peter Zijlstra 1 sibling, 1 reply; 51+ messages in thread From: Doug Smythies @ 2025-01-07 15:04 UTC (permalink / raw) To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies On 2025.01.07 03:26 Peter Zijlstra wrote: > On Mon, Jan 06, 2025 at 02:28:40PM -0800, Doug Smythies wrote: > >> Which will show when a CPU migration took over 10 milliseconds. >> If you want to go further, for example to only display ones that took >> over a second and to include the target CPU, then patch turbostat: >> >> doug@s19:~/kernel/linux/tools/power/x86/turbostat$ git diff >> diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c >> index 58a487c225a7..f8a73cc8fbfc 100644 >> --- a/tools/power/x86/turbostat/turbostat.c >> +++ b/tools/power/x86/turbostat/turbostat.c >> @@ -2704,7 +2704,7 @@ int format_counters(struct thread_data *t, struct core_data *c, struct pkg_data >> struct timeval tv; >> >> timersub(&t->tv_end, &t->tv_begin, &tv); >> - outp += sprintf(outp, "%5ld\t", tv.tv_sec * 1000000 + tv.tv_usec); >> + outp += sprintf(outp, "%7ld\t", tv.tv_sec * 1000000 + tv.tv_usec); >> } >> >> /* Time_Of_Day_Seconds: on each row, print sec.usec last timestamp taken */ >> @@ -4570,12 +4570,14 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p) >> int i; >> int status; >> >> + gettimeofday(&t->tv_begin, (struct timezone *)NULL); /* doug test */ >> + >> if (cpu_migrate(cpu)) { >> fprintf(outf, "%s: Could not migrate to CPU %d\n", __func__, cpu); >> return -1; >> } >> >> - gettimeofday(&t->tv_begin, (struct timezone *)NULL); >> +// gettimeofday(&t->tv_begin, (struct timezone *)NULL); >> >> if (first_counter_read) >> get_apic_id(t); >> >> > > So I've taken the second node offline, running with 10 cores (20 > threads) now. > > usec Time_Of_Day_Seconds CPU Busy% IRQ > 106783 1736248404.951438 - 100.00 20119 > 46 1736248404.844701 0 100.00 1005 > 41 1736248404.844742 20 100.00 1007 > 42 1736248404.844784 1 100.00 1005 > 40 1736248404.844824 21 100.00 1006 > 41 1736248404.844865 2 100.00 1005 > 40 1736248404.844905 22 100.00 1006 > 41 1736248404.844946 3 100.00 1006 > 40 1736248404.844986 23 100.00 1005 > 41 1736248404.845027 4 100.00 1005 > 40 1736248404.845067 24 100.00 1006 > 41 1736248404.845108 5 100.00 1011 > 40 1736248404.845149 25 100.00 1005 > 41 1736248404.845190 6 100.00 1005 > 40 1736248404.845230 26 100.00 1005 > 42 1736248404.845272 7 100.00 1007 > 41 1736248404.845313 27 100.00 1005 > 41 1736248404.845355 8 100.00 1005 > 42 1736248404.845397 28 100.00 1006 > 46 1736248404.845443 9 100.00 1009 > 105995 1736248404.951438 29 100.00 1005 > > Is by far the worst I've had in the past few minutes playing with this. > > If I get a blimp (>10000) then it is always on the last CPU, are you > seeing the same thing? More or less, yes. The very long migrations are dominated by the CPU 5 to CPU 11 migration. Here is data from yesterday for other CPUs: usec Time_Of_Day_Seconds CPU Busy% IRQ 605706 1736224605.542844 0 99.76 1922 10001 1736224605.561844 1 99.76 1922 10999 1736224605.572843 7 99.76 1923 11001 1736224605.583844 2 99.76 1925 11000 1736224605.606844 4 99.76 1924 10999 1736224605.617843 10 99.76 1923 105001 1736224605.722844 5 99.76 1922 465657 1736224608.190843 8 99.76 1002 494000 1736224608.684843 3 99.76 1003 395674 1736224610.081843 7 99.76 1964 19679 1736224617.108843 7 99.76 1003 37709 1736224636.633845 0 99.76 1003 65641 1736224689.796843 9 99.76 1003 406631 1736224693.206843 4 99.76 1002 105622 1736225026.238843 10 99.76 1003 409622 1736225053.673843 10 99.76 1003 16706 1736225302.149847 0 99.76 1820 10000 1736225302.185846 4 99.76 1825 19663 1736225317.249844 7 99.76 1012 > >> In this short example all captures were for the CPU 5 to 11 migration. >> 2 at 6 seconds, 1 at 1.33 seconds and 1 at 2 seconds. > > This seems to suggest you are, always on CPU 11. > > Weird! Yes, weird. I think, but am not certain, the CPU sequence in turbostat per interval loop is: Wake on highest numbered CPU (11 in my case) Do a bunch of work that can be done without MSR reads. For each CPU in topological order (0,6,1,7,2,8,3,9,4,10,5,11 in my case) Do the CPU specific work Finish the intervals work and printing and such on CPU 11. Sleep for the interval time (we have been using 1 second) Without any proof, I was thinking the CPU 11 dominance for the long migration issue was due to the other bits of work done on that CPU. > Anyway, let me see if I can capture a trace of this.. ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-07 15:04 ` Doug Smythies @ 2025-01-07 16:25 ` Doug Smythies 0 siblings, 0 replies; 51+ messages in thread From: Doug Smythies @ 2025-01-07 16:25 UTC (permalink / raw) To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies On 2025.01.07 07:04 Doug Smythies wrote: > On 2025.01.07 03:26 Peter Zijlstra wrote: >> On Mon, Jan 06, 2025 at 02:28:40PM -0800, Doug Smythies wrote: >> If I get a blimp (>10000) then it is always on the last CPU, are you >> seeing the same thing? > > More or less, yes. The very long migrations are dominated by the > CPU 5 to CPU 11 migration. >> >>> In this short example all captures were for the CPU 5 to 11 migration. >>> 2 at 6 seconds, 1 at 1.33 seconds and 1 at 2 seconds. >> >> This seems to suggest you are, always on CPU 11. >> >> Weird! > > Yes, weird. I think, but am not certain, the CPU sequence in turbostat > per interval loop is: > > Wake on highest numbered CPU (11 in my case) > Do a bunch of work that can be done without MSR reads. > For each CPU in topological order (0,6,1,7,2,8,3,9,4,10,5,11 in my case) > Do the CPU specific work > Finish the intervals work and printing and such on CPU 11. > Sleep for the interval time (we have been using 1 second) > > Without any proof, I was thinking the CPU 11 dominance > for the long migration issue was due to the other bits of > work done on that CPU. To test this theory I hacked turbostat to migrate to CPU 3 After the CPU specific work loop. So now the per interval workflow is: Wake on CPU 3 Do a bunch of work that can be done without MSR reads. For each CPU in topological order (0,6,1,7,2,8,3,9,4,10,5,11 in my case) Do the CPU specific work Migrate to CPU 3 Finish the intervals work and printing and such on CPU 3. Sleep for the interval time And now I get: usec Time_Of_Day_Seconds CPU Busy% IRQ 12646 1736266361.533240 3 99.76 1005 6004653 1736266384.555240 3 99.76 1006 6004653 1736266393.563240 3 99.76 1004 6005648 1736266400.570240 3 99.76 7019 6005653 1736266432.602240 3 99.76 1005 6003656 1736266479.652242 3 99.76 1004 15636 1736266501.690240 3 99.76 1005 4948651 1736266528.661240 3 99.76 1004 521672 1736266534.192240 2 99.76 1002 1117651 1736266585.360239 3 99.76 1004 6003652 1736266592.365240 3 99.76 2123 3526648 1736266612.909240 3 99.76 1004 6003650 1736266632.927240 3 99.76 1005 396623 1736266636.327239 10 99.76 1002 6003654 1736266660.349240 3 99.76 1005 6003653 1736266682.369239 3 99.76 1006 6003653 1736266703.388240 3 99.76 1004 514673 1736266718.918240 2 99.76 1003 14652 1736266725.940240 3 99.76 1004 6003653 1736266745.958240 3 99.76 1004 6003653 1736266767.978240 3 99.76 1006 6003652 1736266794.002240 3 99.76 1006 6003653 1736266815.021240 3 99.76 1004 2496651 1736266841.542239 3 99.76 1007 6003647 1736266848.547240 3 99.76 3504 <<< 8 minutes 7 seconds elapsed ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-07 11:26 ` Peter Zijlstra 2025-01-07 15:04 ` Doug Smythies @ 2025-01-07 19:23 ` Peter Zijlstra 2025-01-08 5:15 ` Doug Smythies 1 sibling, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2025-01-07 19:23 UTC (permalink / raw) To: Doug Smythies; +Cc: linux-kernel, vincent.guittot On Tue, Jan 07, 2025 at 12:26:07PM +0100, Peter Zijlstra wrote: > So I've taken the second node offline, running with 10 cores (20 > threads) now. > > usec Time_Of_Day_Seconds CPU Busy% IRQ > 106783 1736248404.951438 - 100.00 20119 > 46 1736248404.844701 0 100.00 1005 > 41 1736248404.844742 20 100.00 1007 > 42 1736248404.844784 1 100.00 1005 > 40 1736248404.844824 21 100.00 1006 > 41 1736248404.844865 2 100.00 1005 > 40 1736248404.844905 22 100.00 1006 > 41 1736248404.844946 3 100.00 1006 > 40 1736248404.844986 23 100.00 1005 > 41 1736248404.845027 4 100.00 1005 > 40 1736248404.845067 24 100.00 1006 > 41 1736248404.845108 5 100.00 1011 > 40 1736248404.845149 25 100.00 1005 > 41 1736248404.845190 6 100.00 1005 > 40 1736248404.845230 26 100.00 1005 > 42 1736248404.845272 7 100.00 1007 > 41 1736248404.845313 27 100.00 1005 > 41 1736248404.845355 8 100.00 1005 > 42 1736248404.845397 28 100.00 1006 > 46 1736248404.845443 9 100.00 1009 > 105995 1736248404.951438 29 100.00 1005 > > Is by far the worst I've had in the past few minutes playing with this. Much also depends on how much (or if at all) cgroups are used. What exact cgroup config are you having? /sys/kernel/debug/sched/debug should be able to tell you. ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-07 19:23 ` Peter Zijlstra @ 2025-01-08 5:15 ` Doug Smythies 2025-01-08 13:12 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Doug Smythies @ 2025-01-08 5:15 UTC (permalink / raw) To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies On 2025.07.11:24 Peter Zijlstra wrote: > On Tue, Jan 07, 2025 at 12:26:07PM +0100, Peter Zijlstra wrote: >> So I've taken the second node offline, running with 10 cores (20 >> threads) now. >> >> usec Time_Of_Day_Seconds CPU Busy% IRQ >> 106783 1736248404.951438 - 100.00 20119 >> 46 1736248404.844701 0 100.00 1005 >> 41 1736248404.844742 20 100.00 1007 >> 42 1736248404.844784 1 100.00 1005 >> 40 1736248404.844824 21 100.00 1006 >> 41 1736248404.844865 2 100.00 1005 >> 40 1736248404.844905 22 100.00 1006 >> 41 1736248404.844946 3 100.00 1006 >> 40 1736248404.844986 23 100.00 1005 >> 41 1736248404.845027 4 100.00 1005 >> 40 1736248404.845067 24 100.00 1006 >> 41 1736248404.845108 5 100.00 1011 >> 40 1736248404.845149 25 100.00 1005 >> 41 1736248404.845190 6 100.00 1005 >> 40 1736248404.845230 26 100.00 1005 >> 42 1736248404.845272 7 100.00 1007 >> 41 1736248404.845313 27 100.00 1005 >> 41 1736248404.845355 8 100.00 1005 >> 42 1736248404.845397 28 100.00 1006 >> 46 1736248404.845443 9 100.00 1009 >> 105995 1736248404.951438 29 100.00 1005 >> >> Is by far the worst I've had in the past few minutes playing with this. > > Much also depends on how much (or if at all) cgroups are used. > > What exact cgroup config are you having? /sys/kernel/debug/sched/debug > should be able to tell you. I do not know. I'll capture the above output, compress it, and send it to you. I did also boot with systemd.unified_cgroup_hierarchy=0 and it made no difference. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-08 5:15 ` Doug Smythies @ 2025-01-08 13:12 ` Peter Zijlstra 2025-01-08 15:48 ` Doug Smythies 0 siblings, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2025-01-08 13:12 UTC (permalink / raw) To: Doug Smythies; +Cc: linux-kernel, vincent.guittot I failed to realize the follow up email was private, so duplicating that here again, but also new content :-) On Tue, Jan 07, 2025 at 09:15:59PM -0800, Doug Smythies wrote: > On 2025.07.11:24 Peter Zijlstra wrote: > > What exact cgroup config are you having? /sys/kernel/debug/sched/debug > > should be able to tell you. > > I do not know. > I'll capture the above output, compress it, and send it to you. > > I did also boot with systemd.unified_cgroup_hierarchy=0 > and it made no difference. I think you need: "cgroup_disable=cpu noautogroup" to fully disable all the cpu-cgroup muck. Anyway: $ zcat cgroup2.txt.gz | grep -e yes -e turbo | awk '{print $2 "\t" $16}' yes /user.slice/user-1000.slice/session-1.scope yes /user.slice/user-1000.slice/session-1.scope yes /user.slice/user-1000.slice/session-1.scope yes /user.slice/user-1000.slice/session-1.scope turbostat /autogroup-286 yes /user.slice/user-1000.slice/session-1.scope yes /user.slice/user-1000.slice/session-1.scope yes /user.slice/user-1000.slice/session-1.scope yes /user.slice/user-1000.slice/session-1.scope yes /user.slice/user-1000.slice/session-1.scope yes /user.slice/user-1000.slice/session-1.scope yes /user.slice/user-1000.slice/session-1.scope yes /user.slice/user-1000.slice/session-1.scope turbostat /autogroup-286 That matches the scenario where I could reproduce, two competing groups. I'm seeing wild vruntime divergence when this happens -- this is definitely wonky. Basically the turbostat groups gets starved for a while while the yes group catches up. It looks like reweight_entity() is shooting out the cgroup entity to the right. So it builds up some negative lag (received surplus service) and then because turbostat goes sleep for a second, it's cgroup's share gets truncated to 2 and it shoots the cgroup entity out waaaaaaaay far. Thing is, waking up *should* fix that up again, but that doesn't appear to happen, leaving us up a creek. /me noodles a bit.... Does this help? diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c0e58e51801f..daa62cfa3092 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7000,6 +7063,13 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (flags & ENQUEUE_DELAYED) { requeue_delayed_entity(se); + se = se->parent; + for_each_sched_entity(se) { + cfs_rq = cfs_rq_of(se); + update_load_avg(cfs_rq, se, UPDATE_TG); + se_update_runnable(se); + update_cfs_group(se); + } return; } ^ permalink raw reply related [flat|nested] 51+ messages in thread
* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-08 13:12 ` Peter Zijlstra @ 2025-01-08 15:48 ` Doug Smythies 2025-01-09 10:59 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Doug Smythies @ 2025-01-08 15:48 UTC (permalink / raw) To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies On 2025.01.08 05:12 Peter Zijlstra wrote: > On Tue, Jan 07, 2025 at 09:15:59PM -0800, Doug Smythies wrote: >> On 2025.07.11:24 Peter Zijlstra wrote: > >>> What exact cgroup config are you having? /sys/kernel/debug/sched/debug >>> should be able to tell you. >> >> I do not know. >> I'll capture the above output, compress it, and send it to you. >> >> I did also boot with systemd.unified_cgroup_hierarchy=0 >> and it made no difference. > > I think you need: "cgroup_disable=cpu noautogroup" to fully disable all > the cpu-cgroup muck. Anyway: > > $ zcat cgroup2.txt.gz | grep -e yes -e turbo | awk '{print $2 "\t" $16}' > yes /user.slice/user-1000.slice/session-1.scope > yes /user.slice/user-1000.slice/session-1.scope > yes /user.slice/user-1000.slice/session-1.scope > yes /user.slice/user-1000.slice/session-1.scope > turbostat /autogroup-286 > yes /user.slice/user-1000.slice/session-1.scope > yes /user.slice/user-1000.slice/session-1.scope > yes /user.slice/user-1000.slice/session-1.scope > yes /user.slice/user-1000.slice/session-1.scope > yes /user.slice/user-1000.slice/session-1.scope > yes /user.slice/user-1000.slice/session-1.scope > yes /user.slice/user-1000.slice/session-1.scope > yes /user.slice/user-1000.slice/session-1.scope > turbostat /autogroup-286 > > That matches the scenario where I could reproduce, two competing groups. > > I'm seeing wild vruntime divergence when this happens -- this is > definitely wonky. Basically the turbostat groups gets starved for a > while while the yes group catches up. > > It looks like reweight_entity() is shooting out the cgroup entity to the > right. > > So it builds up some negative lag (received surplus service) and then > because turbostat goes sleep for a second, it's cgroup's share gets > truncated to 2 and it shoots the cgroup entity out waaaaaaaay far. > > Thing is, waking up *should* fix that up again, but that doesn't appear > to happen, leaving us up a creek. > > /me noodles a bit.... > > Does this help? Sorry, but no it did not help. > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index c0e58e51801f..daa62cfa3092 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7000,6 +7063,13 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > if (flags & ENQUEUE_DELAYED) { > requeue_delayed_entity(se); > + se = se->parent; > + for_each_sched_entity(se) { > + cfs_rq = cfs_rq_of(se); > + update_load_avg(cfs_rq, se, UPDATE_TG); > + se_update_runnable(se); > + update_cfs_group(se); > + } > return; > } ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-08 15:48 ` Doug Smythies @ 2025-01-09 10:59 ` Peter Zijlstra 2025-01-09 12:18 ` [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag tip-bot2 for Peter Zijlstra 2025-01-10 5:09 ` [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies 0 siblings, 2 replies; 51+ messages in thread From: Peter Zijlstra @ 2025-01-09 10:59 UTC (permalink / raw) To: Doug Smythies; +Cc: linux-kernel, vincent.guittot, Ingo Molnar, wuyun.abel On Wed, Jan 08, 2025 at 07:48:29AM -0800, Doug Smythies wrote: > > Does this help? > > Sorry, but no it did not help. Mooo :-( OK, new day, new chances though. I noticed this in my traces today: turbostat-1222 [006] d..2. 311.935649: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6) { weight: 1048576 avg_vruntime: 3184159639071 vruntime: 3184159640194 (-1123) deadline: 3184162621107 } -> { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } turbostat-1222 [006] d..2. 311.935651: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6) { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } -> { weight: 1048576 avg_vruntime: 3184176414812 vruntime: 3184177464419 (-1049607) deadline: 3184180445332 } Which is a weight transition: 1048576 -> 2 -> 1048576. One would expect the lag to shoot out *AND* come back, notably: -1123*1048576/2 = -588775424 -588775424*2/1048576 = -1123 Except the trace shows it is all off. Worse, subsequent cycles shoot it out further and further. This made me have a very hard look at reweight_entity(), and specifically the ->on_rq case, which is more prominent with DELAY_DEQUEUE. And indeed, it is all sorts of broken. While the computation of the new lag is correct, the computation for the new vruntime, using the new lag is broken for it does not consider the logic set out in place_entity(). With the below patch, I now see things like: migration/12-55 [012] d..3. 309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12) { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 } -> { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 } migration/14-62 [014] d..3. 309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15) { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: 6316614641111 } -> { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 } Which isn't perfect yet, but much closer. Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c0e58e51801f..b9575db5ecfe 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -689,21 +689,16 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq) * * XXX could add max_slice to the augmented data to track this. */ -static s64 entity_lag(u64 avruntime, struct sched_entity *se) +static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se) { s64 vlag, limit; - vlag = avruntime - se->vruntime; - limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se); - - return clamp(vlag, -limit, limit); -} - -static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se) -{ SCHED_WARN_ON(!se->on_rq); - se->vlag = entity_lag(avg_vruntime(cfs_rq), se); + vlag = avg_vruntime(cfs_rq) - se->vruntime; + limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se); + + se->vlag = clamp(vlag, -limit, limit); } /* @@ -3770,137 +3765,32 @@ static inline void dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { } #endif -static void reweight_eevdf(struct sched_entity *se, u64 avruntime, - unsigned long weight) -{ - unsigned long old_weight = se->load.weight; - s64 vlag, vslice; - - /* - * VRUNTIME - * -------- - * - * COROLLARY #1: The virtual runtime of the entity needs to be - * adjusted if re-weight at !0-lag point. - * - * Proof: For contradiction assume this is not true, so we can - * re-weight without changing vruntime at !0-lag point. - * - * Weight VRuntime Avg-VRuntime - * before w v V - * after w' v' V' - * - * Since lag needs to be preserved through re-weight: - * - * lag = (V - v)*w = (V'- v')*w', where v = v' - * ==> V' = (V - v)*w/w' + v (1) - * - * Let W be the total weight of the entities before reweight, - * since V' is the new weighted average of entities: - * - * V' = (WV + w'v - wv) / (W + w' - w) (2) - * - * by using (1) & (2) we obtain: - * - * (WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v - * ==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v - * ==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v - * ==> (V - v)*W/(W + w' - w) = (V - v)*w/w' (3) - * - * Since we are doing at !0-lag point which means V != v, we - * can simplify (3): - * - * ==> W / (W + w' - w) = w / w' - * ==> Ww' = Ww + ww' - ww - * ==> W * (w' - w) = w * (w' - w) - * ==> W = w (re-weight indicates w' != w) - * - * So the cfs_rq contains only one entity, hence vruntime of - * the entity @v should always equal to the cfs_rq's weighted - * average vruntime @V, which means we will always re-weight - * at 0-lag point, thus breach assumption. Proof completed. - * - * - * COROLLARY #2: Re-weight does NOT affect weighted average - * vruntime of all the entities. - * - * Proof: According to corollary #1, Eq. (1) should be: - * - * (V - v)*w = (V' - v')*w' - * ==> v' = V' - (V - v)*w/w' (4) - * - * According to the weighted average formula, we have: - * - * V' = (WV - wv + w'v') / (W - w + w') - * = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w') - * = (WV - wv + w'V' - Vw + wv) / (W - w + w') - * = (WV + w'V' - Vw) / (W - w + w') - * - * ==> V'*(W - w + w') = WV + w'V' - Vw - * ==> V' * (W - w) = (W - w) * V (5) - * - * If the entity is the only one in the cfs_rq, then reweight - * always occurs at 0-lag point, so V won't change. Or else - * there are other entities, hence W != w, then Eq. (5) turns - * into V' = V. So V won't change in either case, proof done. - * - * - * So according to corollary #1 & #2, the effect of re-weight - * on vruntime should be: - * - * v' = V' - (V - v) * w / w' (4) - * = V - (V - v) * w / w' - * = V - vl * w / w' - * = V - vl' - */ - if (avruntime != se->vruntime) { - vlag = entity_lag(avruntime, se); - vlag = div_s64(vlag * old_weight, weight); - se->vruntime = avruntime - vlag; - } - - /* - * DEADLINE - * -------- - * - * When the weight changes, the virtual time slope changes and - * we should adjust the relative virtual deadline accordingly. - * - * d' = v' + (d - v)*w/w' - * = V' - (V - v)*w/w' + (d - v)*w/w' - * = V - (V - v)*w/w' + (d - v)*w/w' - * = V + (d - V)*w/w' - */ - vslice = (s64)(se->deadline - avruntime); - vslice = div_s64(vslice * old_weight, weight); - se->deadline = avruntime + vslice; -} +static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags); static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, unsigned long weight) { bool curr = cfs_rq->curr == se; - u64 avruntime; if (se->on_rq) { /* commit outstanding execution time */ update_curr(cfs_rq); - avruntime = avg_vruntime(cfs_rq); + update_entity_lag(cfs_rq, se); + se->deadline -= se->vruntime; + se->rel_deadline = 1; if (!curr) __dequeue_entity(cfs_rq, se); update_load_sub(&cfs_rq->load, se->load.weight); } dequeue_load_avg(cfs_rq, se); - if (se->on_rq) { - reweight_eevdf(se, avruntime, weight); - } else { - /* - * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i), - * we need to scale se->vlag when w_i changes. - */ - se->vlag = div_s64(se->vlag * se->load.weight, weight); - } + /* + * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i), + * we need to scale se->vlag when w_i changes. + */ + se->vlag = div_s64(se->vlag * se->load.weight, weight); + if (se->rel_deadline) + se->deadline = div_s64(se->deadline * se->load.weight, weight); update_load_set(&se->load, weight); @@ -3915,6 +3805,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, enqueue_load_avg(cfs_rq, se); if (se->on_rq) { update_load_add(&cfs_rq->load, se->load.weight); + place_entity(cfs_rq, se, 0); if (!curr) __enqueue_entity(cfs_rq, se); @@ -5355,7 +5246,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) se->vruntime = vruntime - lag; - if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) { + if (se->rel_deadline) { se->deadline += se->vruntime; se->rel_deadline = 0; return; ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag 2025-01-09 10:59 ` Peter Zijlstra @ 2025-01-09 12:18 ` tip-bot2 for Peter Zijlstra 2025-04-17 9:56 ` Alexander Egorenkov 2025-01-10 5:09 ` [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies 1 sibling, 1 reply; 51+ messages in thread From: tip-bot2 for Peter Zijlstra @ 2025-01-09 12:18 UTC (permalink / raw) To: linux-tip-commits Cc: Doug Smythies, Ingo Molnar, Peter Zijlstra (Intel), x86, linux-kernel The following commit has been merged into the sched/urgent branch of tip: Commit-ID: 6d71a9c6160479899ee744d2c6d6602a191deb1f Gitweb: https://git.kernel.org/tip/6d71a9c6160479899ee744d2c6d6602a191deb1f Author: Peter Zijlstra <peterz@infradead.org> AuthorDate: Thu, 09 Jan 2025 11:59:59 +01:00 Committer: Ingo Molnar <mingo@kernel.org> CommitterDate: Thu, 09 Jan 2025 12:55:27 +01:00 sched/fair: Fix EEVDF entity placement bug causing scheduling lag I noticed this in my traces today: turbostat-1222 [006] d..2. 311.935649: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6) { weight: 1048576 avg_vruntime: 3184159639071 vruntime: 3184159640194 (-1123) deadline: 3184162621107 } -> { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } turbostat-1222 [006] d..2. 311.935651: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6) { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } -> { weight: 1048576 avg_vruntime: 3184176414812 vruntime: 3184177464419 (-1049607) deadline: 3184180445332 } Which is a weight transition: 1048576 -> 2 -> 1048576. One would expect the lag to shoot out *AND* come back, notably: -1123*1048576/2 = -588775424 -588775424*2/1048576 = -1123 Except the trace shows it is all off. Worse, subsequent cycles shoot it out further and further. This made me have a very hard look at reweight_entity(), and specifically the ->on_rq case, which is more prominent with DELAY_DEQUEUE. And indeed, it is all sorts of broken. While the computation of the new lag is correct, the computation for the new vruntime, using the new lag is broken for it does not consider the logic set out in place_entity(). With the below patch, I now see things like: migration/12-55 [012] d..3. 309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12) { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 } -> { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 } migration/14-62 [014] d..3. 309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15) { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: 6316614641111 } -> { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 } Which isn't perfect yet, but much closer. Reported-by: Doug Smythies <dsmythies@telus.net> Reported-by: Ingo Molnar <mingo@kernel.org> Tested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") Link: https://lore.kernel.org/r/20250109105959.GA2981@noisy.programming.kicks-ass.net --- kernel/sched/fair.c | 145 +++++-------------------------------------- 1 file changed, 18 insertions(+), 127 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3e9ca38..eeed8e3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -689,21 +689,16 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq) * * XXX could add max_slice to the augmented data to track this. */ -static s64 entity_lag(u64 avruntime, struct sched_entity *se) +static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se) { s64 vlag, limit; - vlag = avruntime - se->vruntime; - limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se); - - return clamp(vlag, -limit, limit); -} - -static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se) -{ SCHED_WARN_ON(!se->on_rq); - se->vlag = entity_lag(avg_vruntime(cfs_rq), se); + vlag = avg_vruntime(cfs_rq) - se->vruntime; + limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se); + + se->vlag = clamp(vlag, -limit, limit); } /* @@ -3774,137 +3769,32 @@ static inline void dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { } #endif -static void reweight_eevdf(struct sched_entity *se, u64 avruntime, - unsigned long weight) -{ - unsigned long old_weight = se->load.weight; - s64 vlag, vslice; - - /* - * VRUNTIME - * -------- - * - * COROLLARY #1: The virtual runtime of the entity needs to be - * adjusted if re-weight at !0-lag point. - * - * Proof: For contradiction assume this is not true, so we can - * re-weight without changing vruntime at !0-lag point. - * - * Weight VRuntime Avg-VRuntime - * before w v V - * after w' v' V' - * - * Since lag needs to be preserved through re-weight: - * - * lag = (V - v)*w = (V'- v')*w', where v = v' - * ==> V' = (V - v)*w/w' + v (1) - * - * Let W be the total weight of the entities before reweight, - * since V' is the new weighted average of entities: - * - * V' = (WV + w'v - wv) / (W + w' - w) (2) - * - * by using (1) & (2) we obtain: - * - * (WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v - * ==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v - * ==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v - * ==> (V - v)*W/(W + w' - w) = (V - v)*w/w' (3) - * - * Since we are doing at !0-lag point which means V != v, we - * can simplify (3): - * - * ==> W / (W + w' - w) = w / w' - * ==> Ww' = Ww + ww' - ww - * ==> W * (w' - w) = w * (w' - w) - * ==> W = w (re-weight indicates w' != w) - * - * So the cfs_rq contains only one entity, hence vruntime of - * the entity @v should always equal to the cfs_rq's weighted - * average vruntime @V, which means we will always re-weight - * at 0-lag point, thus breach assumption. Proof completed. - * - * - * COROLLARY #2: Re-weight does NOT affect weighted average - * vruntime of all the entities. - * - * Proof: According to corollary #1, Eq. (1) should be: - * - * (V - v)*w = (V' - v')*w' - * ==> v' = V' - (V - v)*w/w' (4) - * - * According to the weighted average formula, we have: - * - * V' = (WV - wv + w'v') / (W - w + w') - * = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w') - * = (WV - wv + w'V' - Vw + wv) / (W - w + w') - * = (WV + w'V' - Vw) / (W - w + w') - * - * ==> V'*(W - w + w') = WV + w'V' - Vw - * ==> V' * (W - w) = (W - w) * V (5) - * - * If the entity is the only one in the cfs_rq, then reweight - * always occurs at 0-lag point, so V won't change. Or else - * there are other entities, hence W != w, then Eq. (5) turns - * into V' = V. So V won't change in either case, proof done. - * - * - * So according to corollary #1 & #2, the effect of re-weight - * on vruntime should be: - * - * v' = V' - (V - v) * w / w' (4) - * = V - (V - v) * w / w' - * = V - vl * w / w' - * = V - vl' - */ - if (avruntime != se->vruntime) { - vlag = entity_lag(avruntime, se); - vlag = div_s64(vlag * old_weight, weight); - se->vruntime = avruntime - vlag; - } - - /* - * DEADLINE - * -------- - * - * When the weight changes, the virtual time slope changes and - * we should adjust the relative virtual deadline accordingly. - * - * d' = v' + (d - v)*w/w' - * = V' - (V - v)*w/w' + (d - v)*w/w' - * = V - (V - v)*w/w' + (d - v)*w/w' - * = V + (d - V)*w/w' - */ - vslice = (s64)(se->deadline - avruntime); - vslice = div_s64(vslice * old_weight, weight); - se->deadline = avruntime + vslice; -} +static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags); static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, unsigned long weight) { bool curr = cfs_rq->curr == se; - u64 avruntime; if (se->on_rq) { /* commit outstanding execution time */ update_curr(cfs_rq); - avruntime = avg_vruntime(cfs_rq); + update_entity_lag(cfs_rq, se); + se->deadline -= se->vruntime; + se->rel_deadline = 1; if (!curr) __dequeue_entity(cfs_rq, se); update_load_sub(&cfs_rq->load, se->load.weight); } dequeue_load_avg(cfs_rq, se); - if (se->on_rq) { - reweight_eevdf(se, avruntime, weight); - } else { - /* - * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i), - * we need to scale se->vlag when w_i changes. - */ - se->vlag = div_s64(se->vlag * se->load.weight, weight); - } + /* + * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i), + * we need to scale se->vlag when w_i changes. + */ + se->vlag = div_s64(se->vlag * se->load.weight, weight); + if (se->rel_deadline) + se->deadline = div_s64(se->deadline * se->load.weight, weight); update_load_set(&se->load, weight); @@ -3919,6 +3809,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, enqueue_load_avg(cfs_rq, se); if (se->on_rq) { update_load_add(&cfs_rq->load, se->load.weight); + place_entity(cfs_rq, se, 0); if (!curr) __enqueue_entity(cfs_rq, se); @@ -5359,7 +5250,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) se->vruntime = vruntime - lag; - if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) { + if (se->rel_deadline) { se->deadline += se->vruntime; se->rel_deadline = 0; return; ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag 2025-01-09 12:18 ` [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag tip-bot2 for Peter Zijlstra @ 2025-04-17 9:56 ` Alexander Egorenkov 2025-04-22 5:40 ` ll"RE: " Doug Smythies 0 siblings, 1 reply; 51+ messages in thread From: Alexander Egorenkov @ 2025-04-17 9:56 UTC (permalink / raw) To: tip-bot2; +Cc: dsmythies, linux-kernel, linux-tip-commits, mingo, peterz, x86 Hi Peter, after this change, we are seeing big latencies when trying to execute a simple command per SSH on a Fedora 41 s390x remote system which is under stress. I was able to bisect the problem to this commit. The problem is easy to reproduce with stress-ng executed on the remote system which is otherwise unoccupied and concurrent SSH connect attempts from a local system to the remote one. stress-ng (on remote system) ---------------------------- $ cpus=$(nproc) $ stress-ng --cpu $((cpus * 2)) --matrix 50 --mq 50 --aggressive --brk 2 --stack 2 --bigheap 2 --userfaultfd 0 --perf -t 5m SSH connect attempts (from local to remote system) -------------------------------------------------- $ ssh_options=( -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR -o ConnectTimeout=10 -o TCPKeepAlive=yes -o ServerAliveInterval=10 -o PreferredAuthentications=publickey -o PubkeyAuthentication=yes -o BatchMode=yes -o ForwardX11=no -A ) $ while true; do time ssh "${ssh_options[@]}" root@remote-system true; sleep 2; done ======== My tests ======== commit v6.12 ------------ $ while true; do time ssh "${ssh_options[@]}" root@remote-system true; sleep 2; done ssh "${ssh_options[@]}" ciuser@a8345039 true 0.01s user 0.00s system 1% cpu 0.919 total ssh "${ssh_options[@]}" ciuser@a8345039 true 0.00s user 0.00s system 9% cpu 0.068 total ssh "${ssh_options[@]}" ciuser@a8345039 true 0.00s user 0.00s system 8% cpu 0.069 total ssh "${ssh_options[@]}" ciuser@a8345039 true 0.00s user 0.00s system 6% cpu 0.092 total ssh "${ssh_options[@]}" ciuser@a8345039 true 0.00s user 0.00s system 6% cpu 0.097 total ssh "${ssh_options[@]}" ciuser@a8345039 true 0.00s user 0.00s system 5% cpu 0.109 total ssh "${ssh_options[@]}" ciuser@a8345039 true 0.00s user 0.00s system 7% cpu 0.083 total ssh "${ssh_options[@]}" ciuser@a8345039 true 0.00s user 0.00s system 7% cpu 0.079 total ssh "${ssh_options[@]}" ciuser@a8345039 true 0.00s user 0.00s system 11% cpu 0.054 total commit 6d71a9c6160479899ee744d2c6d6602a191deb1f ----------------------------------------------- $ while true; do time ssh "${ssh_options[@]}" root@remote-system true; sleep 2; done ssh "${ssh_options[@]}" ciuser@a8345034 true 0.01s user 0.00s system 0% cpu 33.379 total ssh "${ssh_options[@]}" ciuser@a8345034 true 0.00s user 0.00s system 0% cpu 1.206 total ssh "${ssh_options[@]}" ciuser@a8345034 true 0.00s user 0.00s system 0% cpu 2.388 total ssh "${ssh_options[@]}" ciuser@a8345034 true 0.00s user 0.00s system 9% cpu 0.055 total ssh "${ssh_options[@]}" ciuser@a8345034 true 0.00s user 0.00s system 0% cpu 2.376 total ssh "${ssh_options[@]}" ciuser@a8345034 true 0.00s user 0.00s system 2% cpu 0.243 total ssh "${ssh_options[@]}" ciuser@a8345034 true 0.00s user 0.00s system 11% cpu 0.049 total ssh "${ssh_options[@]}" ciuser@a8345034 true 0.00s user 0.00s system 0% cpu 2.563 total ssh "${ssh_options[@]}" ciuser@a8345034 true 0.00s user 0.00s system 8% cpu 0.065 total Thank you Regards Alex ^ permalink raw reply [flat|nested] 51+ messages in thread
* ll"RE: [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag 2025-04-17 9:56 ` Alexander Egorenkov @ 2025-04-22 5:40 ` Doug Smythies 2025-04-24 7:56 ` Alexander Egorenkov 0 siblings, 1 reply; 51+ messages in thread From: Doug Smythies @ 2025-04-22 5:40 UTC (permalink / raw) To: 'Alexander Egorenkov', tip-bot2 Cc: linux-kernel, linux-tip-commits, mingo, peterz, x86, Doug Smythies On 2025.04.17 02:57 Alexander Egorenkov wrote: > Hi Peter, > > after this change, we are seeing big latencies when trying to execute a > simple command per SSH on a Fedora 41 s390x remote system which is under > stress. > > I was able to bisect the problem to this commit. > > The problem is easy to reproduce with stress-ng executed on the remote > system which is otherwise unoccupied and concurrent SSH connect attempts > from a local system to the remote one. > > stress-ng (on remote system) > ---------------------------- > > $ cpus=$(nproc) > $ stress-ng --cpu $((cpus * 2)) --matrix 50 --mq 50 --aggressive --brk 2 > --stack 2 --bigheap 2 --userfaultfd 0 --perf -t 5m That is a very very stressful test. It crashes within a few seconds on my test computer, with a " Segmentation fault (core dumped)" message. If I back it off to this: $ stress-ng --cpu 24 --matrix 50 --mq 50 --aggressive --brk 2 --stack 2 --bigheap 2 -t 300m It runs, but still makes a great many entries in /var/log/kern.log as the oom killer runs etc. I am suggesting it is not a reasonable test workload. Anyway, I used turbostat the same way I was using it back in January for this work, and did observe longer than requested intervals. I took 1427 samples and got 10 where the interval time was more than 1 second more than requested. The worst was 7.5 seconds longer than requested. I rechecked the 100% workload used in January (12X "yes > dev/null") and it was fine. 3551 samples and the actual interval was never more than 10 milliseconds longer than requested. Kernel 6.15-rc2 Turbostat version: 2025.04.06 Turbostat sample interval: 2 seconds. Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz (12 CPU, 6 cores) ... Doug ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: ll"RE: [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag 2025-04-22 5:40 ` ll"RE: " Doug Smythies @ 2025-04-24 7:56 ` Alexander Egorenkov 2025-04-26 15:09 ` Doug Smythies 0 siblings, 1 reply; 51+ messages in thread From: Alexander Egorenkov @ 2025-04-24 7:56 UTC (permalink / raw) To: Doug Smythies, tip-bot2 Cc: linux-kernel, linux-tip-commits, mingo, peterz, x86, Doug Smythies Hi all, > That is a very very stressful test. It crashes within a few seconds on my test computer, > with a " Segmentation fault (core dumped)" message. Yes, this is an artificial test i came up with to demonstrate the problem we have with another realistic test which i can hardly use here for the sake of demonstration. But it reveals the exact same problem we have with our CI test on s390x test systems. Let me explain shortly how it happens. Basically, we have a test system where we execute a test suite and simultaneously monitor this system on another system via simple SSH logins (approximately invoked every 15 seconds) whether the test system is still online and dump automatically if it remains unresponsive for 5m straight. We limit every such SSH login to 10 seconds because we had situations where SSH sometimes hanged for a long time due to various problems with networking, test system itself etc., just to make our monitoring robust. And since the commit "sched/fair: Fix EEVDF entity placement bug causing scheduling lag" we regularly see SSH logins (limited to 10s) failing for 5m straight, not a single SSH login succeeds. This happens regularly with test suites which compile software with GCC and use all CPUs at 100%. Before the commit, a SSH login required under 1 second. I cannot judge whether the problem really in this commit, or it is just an accumulated effect after multiple ones. FYI: One such system where it happens regularly has 7 cores (5.2Ghz SMT 2x, 14 cpus) and 8G of main memory with 20G of swap. Thanks Regards Alex ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag 2025-04-24 7:56 ` Alexander Egorenkov @ 2025-04-26 15:09 ` Doug Smythies 0 siblings, 0 replies; 51+ messages in thread From: Doug Smythies @ 2025-04-26 15:09 UTC (permalink / raw) To: 'Alexander Egorenkov', peterz Cc: linux-kernel, mingo, x86, Doug Smythies Hi Alexander, Thank you for your reply. Note that I have adjusted the address list for this email, because I don't know if bots can get emails, and Peter was not on the "To" line, and might not have noticed this thread. @Peter : Off-list I will forward you the other emails, in case you missed them. I apologise if you did see them but haven't had time to get to them or whatever. Also note that I know nothing about the scheduler and was only on the original email because I had a "Reported-by" tag. On 2025.04.24 00:57 Alexander Egorenkov wrote: > Hi all, [Doug wrote] >> That is a very very stressful test. It crashes within a few seconds on my test computer, >> with a " Segmentation fault (core dumped)" message. > > Yes, this is an artificial test i came up with to demonstrate the > problem we have with another realistic test which i can hardly > use here for the sake of demonstration. But it reveals the exact > same problem we have with our CI test on s390x test systems. > > Let me explain shortly how it happens. > > Basically, we have a test system where we execute a test suite and > simultaneously monitor this system on another system via simple SSH > logins (approximately invoked every 15 seconds) whether the test system > is still online and dump automatically if it remains unresponsive for > 5m straight. We limit every such SSH login to 10 seconds because > we had situations where SSH sometimes hanged for a long time due to > various problems with networking, test system itself etc., just to make > our monitoring robust. > > And since the commit "sched/fair: Fix EEVDF entity placement bug causing > scheduling lag" we regularly see SSH logins (limited to 10s) failing for > 5m straight, not a single SSH login succeeds. This happens regularly > with test suites which compile software with GCC and use all CPUs > at 100%. Before the commit, a SSH login required under 1 second. > I cannot judge whether the problem really in this commit, or it is just an > accumulated effect after multiple ones. > > FYI: > One such system where it happens regularly has 7 cores (5.2Ghz SMT 2x, 14 cpus) > and 8G of main memory with 20G of swap. > > Thanks > Regards > Alex Thanks for the explanation. I have recreated your situation with a workflow that, while it stresses the CPUs, doesn't make any entries in /var/log/kern.log and /var/log/syslog. Under the same conditions, I have confirmed that the ssh login lag doesn't occur With kernel 6.12, but does with kernel 6.13 My workflow is stuff I have used for many years and wrote myself. Basically, I create a huge queue of running tasks, with each doing a little work and then sleeping for a short period. I have 2 methods to achieve similar overall workflow, and one shows the issue and one does not. I can also create a huge queue by just increasing the number "yes" tasks to a ridiculous number, but that does not show your ssh login lag issue. Anyway, for the workflow that does show your issue, I had a load average of about 19,500 (20,000 tasks) and ssh login times ranged from 38 to 10 seconds, with an average of about 13 seconds. ssh login times using kernel 6.12 were negligible. ... Doug ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-09 10:59 ` Peter Zijlstra 2025-01-09 12:18 ` [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag tip-bot2 for Peter Zijlstra @ 2025-01-10 5:09 ` Doug Smythies 2025-01-10 11:57 ` Peter Zijlstra 2025-01-12 19:59 ` [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies 1 sibling, 2 replies; 51+ messages in thread From: Doug Smythies @ 2025-01-10 5:09 UTC (permalink / raw) To: 'Peter Zijlstra' Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel, Doug Smythies [-- Attachment #1: Type: text/plain, Size: 3138 bytes --] Hi Peter, Thanks for all your hard work on this. On 2025.01.09 03:00 Peter Zijlstra wrote: ... > This made me have a very hard look at reweight_entity(), and > specifically the ->on_rq case, which is more prominent with > DELAY_DEQUEUE. > > And indeed, it is all sorts of broken. While the computation of the new > lag is correct, the computation for the new vruntime, using the new lag > is broken for it does not consider the logic set out in place_entity(). > > With the below patch, I now see things like: > > migration/12-55 [012] d..3. 309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12) > { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 } -> > { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 } > migration/14-62 [014] d..3. 309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15) > { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: 6316614641111 } -> > { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 } > > Which isn't perfect yet, but much closer. Agreed. I tested the patch. Attached is a repeat of a graph I had sent before, with different y axis scale and old data deleted. It still compares to the "b12" kernel (the last good one in the kernel bisection). It was a 2 hour and 31 minute duration test, and the maximum CPU migration time was 24 milliseconds, verses 6 seconds without the patch. I left things running for many hours and will let it continue overnight. There seems to have been an issue at one spot in time: usec Time_Of_Day_Seconds CPU Busy% IRQ 488994 1736476550.732222 - 99.76 12889 488520 1736476550.732222 11 99.76 1012 960999 1736476552.694222 - 99.76 17922 960587 1736476552.694222 11 99.76 1493 914999 1736476554.610222 - 99.76 23579 914597 1736476554.610222 11 99.76 1962 809999 1736476556.421222 - 99.76 23134 809598 1736476556.421222 11 99.76 1917 770998 1736476558.193221 - 99.76 21757 770603 1736476558.193221 11 99.76 1811 726999 1736476559.921222 - 99.76 21294 726600 1736476559.921222 11 99.76 1772 686998 1736476561.609221 - 99.76 20801 686600 1736476561.609221 11 99.76 1731 650998 1736476563.261221 - 99.76 20280 650601 1736476563.261221 11 99.76 1688 610998 1736476564.873221 - 99.76 19857 610606 1736476564.873221 11 99.76 1653 I had one of these the other day also, but they were all 6 seconds. Its like a burst of problematic data. I have the data somewhere, and can try to find it tomorrow. > > Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> ... [-- Attachment #2: turbostat-sampling-issue-fixed-seconds.png --] [-- Type: image/png, Size: 62449 bytes --] ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-10 5:09 ` [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies @ 2025-01-10 11:57 ` Peter Zijlstra 2025-01-12 23:14 ` Doug Smythies 2025-01-13 12:58 ` [tip: sched/urgent] sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE tip-bot2 for Peter Zijlstra 2025-01-12 19:59 ` [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies 1 sibling, 2 replies; 51+ messages in thread From: Peter Zijlstra @ 2025-01-10 11:57 UTC (permalink / raw) To: Doug Smythies Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel On Thu, Jan 09, 2025 at 09:09:26PM -0800, Doug Smythies wrote: > Hi Peter, > > Thanks for all your hard work on this. > > On 2025.01.09 03:00 Peter Zijlstra wrote: > > ... > > > This made me have a very hard look at reweight_entity(), and > > specifically the ->on_rq case, which is more prominent with > > DELAY_DEQUEUE. > > > > And indeed, it is all sorts of broken. While the computation of the new > > lag is correct, the computation for the new vruntime, using the new lag > > is broken for it does not consider the logic set out in place_entity(). > > > > With the below patch, I now see things like: > > > > migration/12-55 [012] d..3. 309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12) > > { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 > } -> > > { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 > } > > migration/14-62 [014] d..3. 309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15) > > { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: > 6316614641111 } -> > > { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 > } > > > > Which isn't perfect yet, but much closer. > > Agreed. > I tested the patch. Attached is a repeat of a graph I had sent before, with different y axis scale and old data deleted. > It still compares to the "b12" kernel (the last good one in the kernel bisection). > It was a 2 hour and 31 minute duration test, and the maximum CPU migration time was 24 milliseconds, > verses 6 seconds without the patch. Progress! > I left things running for many hours and will let it continue overnight. > There seems to have been an issue at one spot in time: Right, so by happy accident I also left mine running overnight and I think I caught one of those weird spikes. It took a bit of staring to figure out what went wrong this time, but what I now think is the thing that sets off the chain of fail is a combination of DELAY_DEQUEUE and the cgroup reweight stuff. Specifically, when a cgroup's CPU queue becomes empty, calc_group_shares() will drop its weight to the floor. Now, normally, when a queue goes empty, it gets dequeued from its parent and its weight is immaterial. However, with DELAY_DEQUEUE the thing will stick around for a while -- at a very low weight. What makes this esp. troublesome is that even for an active cgroup (like the 'yes' group) the per-cpu weight will be relatively small (~1/nr_cpus like). Worse, the avg_vruntime thing uses load_scale_down() because u64 just isn't all that big :/ (if only all 64bit machines could do 128bit divisions as cheaply as x86_64) This means that on a 16 CPU machine, the weight of a 'normal' all busy queue will be 64, and the weight of an empty queue will be 2, which means the effect of the ginormous lag on the avg_vruntime is fairly significant, pushing it wildly off balance and affecting placement of new tasks. Also, this violates the spirit of DELAY_DEQUEUE, that wants to continue competition as the entity was. As such, we should not adjust the weight of an empty queue. I've started a new run, and some 15 minutes of runtime show nothing interesting, I'll have to let it run for a while again. --- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3e9ca38512de..93644b3983d4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4065,7 +4019,11 @@ static void update_cfs_group(struct sched_entity *se) struct cfs_rq *gcfs_rq = group_cfs_rq(se); long shares; - if (!gcfs_rq) + /* + * When a group becomes empty, preserve its weight. This matters for + * DELAY_DEQUEUE. + */ + if (!gcfs_rq || !gcfs_rq->load.weight) return; if (throttled_hierarchy(gcfs_rq)) ^ permalink raw reply related [flat|nested] 51+ messages in thread
* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-10 11:57 ` Peter Zijlstra @ 2025-01-12 23:14 ` Doug Smythies 2025-01-13 11:03 ` Peter Zijlstra 2025-01-13 11:05 ` Peter Zijlstra 2025-01-13 12:58 ` [tip: sched/urgent] sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE tip-bot2 for Peter Zijlstra 1 sibling, 2 replies; 51+ messages in thread From: Doug Smythies @ 2025-01-12 23:14 UTC (permalink / raw) To: 'Peter Zijlstra' Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel, Doug Smythies [-- Attachment #1: Type: text/plain, Size: 7364 bytes --] On 2025.01.10 03:57 Peter Zijlstra wrote: > On Thu, Jan 09, 2025 at 09:09:26PM -0800, Doug Smythies wrote: >> On 2025.01.09 03:00 Peter Zijlstra wrote: >> >> ... >> >>> This made me have a very hard look at reweight_entity(), and >>> specifically the ->on_rq case, which is more prominent with >>> DELAY_DEQUEUE. >>> >>> And indeed, it is all sorts of broken. While the computation of the new >>> lag is correct, the computation for the new vruntime, using the new lag >>> is broken for it does not consider the logic set out in place_entity(). >>> >>> With the below patch, I now see things like: >>> >>> migration/12-55 [012] d..3. 309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12) >>> { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 >> } -> >>> { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 >> } >>> migration/14-62 [014] d..3. 309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15) >>> { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: >> 6316614641111 } -> >>> { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 >> } >>> >>> Which isn't perfect yet, but much closer. >> >> Agreed. >> I tested the patch. Attached is a repeat of a graph I had sent before, with different y axis scale and old data deleted. >> It still compares to the "b12" kernel (the last good one in the kernel bisection). >> It was a 2 hour and 31 minute duration test, and the maximum CPU migration time was 24 milliseconds, >> verses 6 seconds without the patch. > > Progress! > >> I left things running for many hours and will let it continue overnight. >> There seems to have been an issue at one spot in time: > > Right, so by happy accident I also left mine running overnight and I > think I caught one of those weird spikes. It took a bit of staring to > figure out what went wrong this time, but what I now think is the thing > that sets off the chain of fail is a combination of DELAY_DEQUEUE and > the cgroup reweight stuff. > > Specifically, when a cgroup's CPU queue becomes empty, > calc_group_shares() will drop its weight to the floor. > > Now, normally, when a queue goes empty, it gets dequeued from its > parent and its weight is immaterial. However, with DELAY_DEQUEUE the > thing will stick around for a while -- at a very low weight. > > What makes this esp. troublesome is that even for an active cgroup (like > the 'yes' group) the per-cpu weight will be relatively small (~1/nr_cpus > like). Worse, the avg_vruntime thing uses load_scale_down() because u64 > just isn't all that big :/ > > (if only all 64bit machines could do 128bit divisions as cheaply as x86_64) > > This means that on a 16 CPU machine, the weight of a 'normal' all busy > queue will be 64, and the weight of an empty queue will be 2, which means > the effect of the ginormous lag on the avg_vruntime is fairly > significant, pushing it wildly off balance and affecting placement of > new tasks. > > Also, this violates the spirit of DELAY_DEQUEUE, that wants to continue > competition as the entity was. > > As such, we should not adjust the weight of an empty queue. > > I've started a new run, and some 15 minutes of runtime show nothing > interesting, I'll have to let it run for a while again. > --- > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 3e9ca38512de..93644b3983d4 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4065,7 +4019,11 @@ static void update_cfs_group(struct sched_entity *se) > struct cfs_rq *gcfs_rq = group_cfs_rq(se); > long shares; > > - if (!gcfs_rq) > + /* > + * When a group becomes empty, preserve its weight. This matters for > + * DELAY_DEQUEUE. > + */ > + if (!gcfs_rq || !gcfs_rq->load.weight) > return; > > if (throttled_hierarchy(gcfs_rq)) I tested the above patch on top of the previous patch. Multiple tests and multiple methods over many hours and I never got any hit at all for a detected CPU migration greater than or equal to 10 milliseconds. Which is good news. The test I have been running to create some of the graphs I have been attaching is a little different, using turbostat with different options: turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds --interval 1 And with this test I get intervals over 1 second by over 10 milliseconds. (I referred to this observation in the previous email.). I attach a new version of a previous graph, with the data from this "patch 2" added. Also, below are the same results presented as histograms (some repeated from my previous email): First the b12 kernel (the last good one in the kernel bisection): Time Occurrences 1.000000, 3282 1.001000, 1826 1.002000, 227 1.003000, 1852 1.004000, 1036 1.005000, 731 1.006000, 75 1.007000, 30 1.008000, 9 1.009000, 2 1.010000, 1 1.011000, 1 Total: 9072 : Total >= 10 mSec: 2 ( 0.02 percent) Second Kernel 6.13-rc6+the first patch Time Occurrences 1.000000, 1274 1.001000, 1474 1.002000, 512 1.003000, 3201 1.004000, 849 1.005000, 593 1.006000, 246 1.007000, 104 1.008000, 36 1.009000, 15 1.010000, 19 1.011000, 16 1.012000, 11 1.013000, 27 1.014000, 26 1.015000, 35 1.016000, 105 1.017000, 85 1.018000, 135 1.019000, 283 1.020000, 17 1.021000, 4 1.022000, 3 1.023000, 1 Total: 9071 : Total >= 10 mSec: 767 ( 8.46 percent) Third: Kernel 6.13-rc6+the first patch+the above patch: 1.000000, 2034 1.001000, 2108 1.002000, 2030 1.003000, 2492 1.004000, 216 1.005000, 109 1.006000, 23 1.007000, 8 1.008000, 3 1.009000, 9 1.010000, 1 1.011000, 2 1.012000, 2 1.014000, 3 1.015000, 10 1.016000, 19 1.017000, 1 1.018000, 1 Total: 9071 : Total >= 10 mSec: 39 ( 0.43 percent) Where, and for example, this line: 1.016000, 19 means that there were 19 occurrences of turbostat interval times between 1.016 and 1.016999 seconds. As mentioned earlier, I haven't been able to obtain any detailed information as to where any extra delay is occurring, in particular if it is related to CPU migration or not. I am assuming the system under test is perturbed just enough by the slight difference in the test that the longer times do not occur. As a side note, and just for informational purposes: The changes do have an effect on performance. Using 40 pairs of ping-pong token passing rings running 30 million loops I get: Where: nodelay = NO_DELAY_DEQUEUE delay = DELAY_DEQUEUE noauto = "cgroup_disable=cpu noautogroup" on grub command line auto = nothing cgroup related on the grub command line. All tests done with the CPU frequency scaling governor = performance. (lesser numbers are better) noauto nodelay test 1 16.24 uSec/loop reference test noauto nodelay test 2 16.22 uSec/loop -0.15% noauto delay test 1 15.41 uSec/loop -5.16% auto nodelay test 1 24.31 uSec/loop +49.88% auto nodelay test 2 24.31 uSec/loop +49.24% auto delay test 1 21.42 uSec/loop +21.42% auto delay test 2 21.71 uSec/loop +21.71% A graph is also attached showing the same results, but for each of the 40 pairs. Reveals differences in relative executions times from first to last ping-pong pair. [-- Attachment #2: turbostat-sampling-issue-fixed--2-seconds.png --] [-- Type: image/png, Size: 78967 bytes --] [-- Attachment #3: 40-ping-pairs-compare.png --] [-- Type: image/png, Size: 54905 bytes --] ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-12 23:14 ` Doug Smythies @ 2025-01-13 11:03 ` Peter Zijlstra 2025-01-14 10:58 ` Peter Zijlstra 2025-01-13 11:05 ` Peter Zijlstra 1 sibling, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2025-01-13 11:03 UTC (permalink / raw) To: Doug Smythies Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote: > I tested the above patch on top of the previous patch. That was indeed the intention. > Multiple tests and multiple methods over many hours and > I never got any hit at all for a detected CPU migration greater than or > equal to 10 milliseconds. > Which is good news. Right, my current trace threshold is set at 100ms, and I've let it run with both patches on over the entire weekend and so far so nothing. So definitely progress. > The test I have been running to create some of the graphs I have been > attaching is a little different, using turbostat with different options: > > turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds --interval 1 > > And with this test I get intervals over 1 second by over 10 milliseconds. > (I referred to this observation in the previous email.). OK, almost but not quite there it seems. > Third: Kernel 6.13-rc6+the first patch+the above patch: > > 1.000000, 2034 > 1.001000, 2108 > 1.002000, 2030 > 1.003000, 2492 > 1.004000, 216 > 1.005000, 109 > 1.006000, 23 > 1.007000, 8 > 1.008000, 3 > 1.009000, 9 > 1.010000, 1 > 1.011000, 2 > 1.012000, 2 > 1.014000, 3 > 1.015000, 10 > 1.016000, 19 > 1.017000, 1 > 1.018000, 1 > > Total: 9071 : Total >= 10 mSec: 39 ( 0.43 percent) > > Where, and for example, this line: > > 1.016000, 19 > > means that there were 19 occurrences of turbostat interval times > between 1.016 and 1.016999 seconds. OK, let me lower my threshold to 10ms and change the turbostat invocation -- see if I can catch me some wabbits :-) FWIW, I'm using the below hackery to catch them wabbits. --- diff --git a/kernel/time/time.c b/kernel/time/time.c index 1b69caa87480..61ff330e068b 100644 --- a/kernel/time/time.c +++ b/kernel/time/time.c @@ -149,6 +149,12 @@ SYSCALL_DEFINE2(gettimeofday, struct __kernel_old_timeval __user *, tv, return -EFAULT; } if (unlikely(tz != NULL)) { + if (tz == (void*)1) { + trace_printk("WHOOPSIE!\n"); + tracing_off(); + return 0; + } + if (copy_to_user(tz, &sys_tz, sizeof(sys_tz))) return -EFAULT; } diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c index 58a487c225a7..baeac7388be2 100644 --- a/tools/power/x86/turbostat/turbostat.c +++ b/tools/power/x86/turbostat/turbostat.c @@ -67,6 +67,7 @@ #include <stdbool.h> #include <assert.h> #include <linux/kernel.h> +#include <sys/syscall.h> #define UNUSED(x) (void)(x) @@ -2704,7 +2705,7 @@ int format_counters(struct thread_data *t, struct core_data *c, struct pkg_data struct timeval tv; timersub(&t->tv_end, &t->tv_begin, &tv); - outp += sprintf(outp, "%5ld\t", tv.tv_sec * 1000000 + tv.tv_usec); + outp += sprintf(outp, "%7ld\t", tv.tv_sec * 1000000 + tv.tv_usec); } /* Time_Of_Day_Seconds: on each row, print sec.usec last timestamp taken */ @@ -4570,12 +4571,14 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p) int i; int status; + gettimeofday(&t->tv_begin, (struct timezone *)NULL); /* doug test */ + if (cpu_migrate(cpu)) { fprintf(outf, "%s: Could not migrate to CPU %d\n", __func__, cpu); return -1; } - gettimeofday(&t->tv_begin, (struct timezone *)NULL); +// gettimeofday(&t->tv_begin, (struct timezone *)NULL); if (first_counter_read) get_apic_id(t); @@ -4730,6 +4733,15 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p) done: gettimeofday(&t->tv_end, (struct timezone *)NULL); + { + struct timeval tv; + u64 delta; + timersub(&t->tv_end, &t->tv_begin, &tv); + delta = tv.tv_sec * 1000000 + tv.tv_usec; + if (delta > 100000) + syscall(__NR_gettimeofday, &tv, (void*)1); + } + return 0; } ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-13 11:03 ` Peter Zijlstra @ 2025-01-14 10:58 ` Peter Zijlstra 2025-01-14 15:15 ` Doug Smythies 2025-01-19 0:09 ` Doug Smythies 0 siblings, 2 replies; 51+ messages in thread From: Peter Zijlstra @ 2025-01-14 10:58 UTC (permalink / raw) To: Doug Smythies Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel On Mon, Jan 13, 2025 at 12:03:12PM +0100, Peter Zijlstra wrote: > On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote: > > means that there were 19 occurrences of turbostat interval times > > between 1.016 and 1.016999 seconds. > > OK, let me lower my threshold to 10ms and change the turbostat > invocation -- see if I can catch me some wabbits :-) I've had it run overnight and have not caught a single >10ms event :-( ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-14 10:58 ` Peter Zijlstra @ 2025-01-14 15:15 ` Doug Smythies 2025-01-15 2:08 ` Len Brown 2025-01-19 0:09 ` Doug Smythies 1 sibling, 1 reply; 51+ messages in thread From: Doug Smythies @ 2025-01-14 15:15 UTC (permalink / raw) To: 'Peter Zijlstra' Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel, Doug Smythies On 2025.01.14 02:59 Peter Zijlstra wrote: > On Mon, Jan 13, 2025 at 12:03:12PM +0100, Peter Zijlstra wrote: >> On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote: >>> means that there were 19 occurrences of turbostat interval times >>> between 1.016 and 1.016999 seconds. >> >> OK, let me lower my threshold to 10ms and change the turbostat >> invocation -- see if I can catch me some wabbits :-) > > I've had it run overnight and have not caught a single >10ms event :-( O.K. thanks for trying. For me, the probability of occurrence does vary: After moving gettimeofday calls back to their original places in the code I recompiled turbostat, and got: Total: 9082 : Total >= 10 mSec: 24 ( 0.26 percent) And 2 previous tests had significant differences In probability of occurrence: Total: 9071 : Total >= 10 mSec: 39 ( 0.43 percent) Total: 9079 : Total >= 10 mSec: 128 ( 1.41 percent) Whenever I try to obtain more information by eliminating the --Summary directive in turbostat, I never get a >= 10 mSec hit I tried looking into the sleep lengths by themselves without using turbostat, and never saw a 1 second requested sleep be longer than requested by >= 1 mSec. Regardless, the 2 patches seem to have solved the up to 6 seconds extra time between samples issue. The most I have seen with all this testing as been 23 milliseconds. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-14 15:15 ` Doug Smythies @ 2025-01-15 2:08 ` Len Brown 2025-01-15 16:47 ` Doug Smythies 0 siblings, 1 reply; 51+ messages in thread From: Len Brown @ 2025-01-15 2:08 UTC (permalink / raw) To: Doug Smythies Cc: Peter Zijlstra, linux-kernel, vincent.guittot, Ingo Molnar, wuyun.abel Doug, Your attention to detail and persistence has once again found a tricky underlying bug -- kudos! Re: turbostat behaviour Yes, TSC_MHz -- "the measured rate of the TSC during an interval", is printed as a sanity check. If there are any irregularities in it, as you noticed, then something very strange in the hardware or software is going wrong (and the actual turbostat results will likely not be reliable). YTes, the "usec" column measures how long it takes to migrate to a CPU and collect stats there. So if you are hunting down a glitch in migration all you need is this column to see it. "usec" on the summary row is the difference between the 1st migration and after the last -- excluding the sysfs/procfs time that is consumed on the last CPU. So migration delays will also be reflected there. Note: we have a patch queued which changes the "usec" on the Summary row to *include* the sysfs/procfs time on the last CPU. (The per-cpu "usec" values are unchanged.) This is because we've noticed some really weird delays in doing things like reading /proc/interrupts and we want to be able to easily do A/B comparisons by simply including or excluding counters. Also FYI, The scheme of migrating to each CPU so that collecting stats there will be "local" isn't scaling so well on very large systems, and I'm about to take a close look at it. In yogini we used a different scheme, where a thread is bound to each CPU, so they can collect in parallel; and we may be moving to something like that. cheers, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-15 2:08 ` Len Brown @ 2025-01-15 16:47 ` Doug Smythies 0 siblings, 0 replies; 51+ messages in thread From: Doug Smythies @ 2025-01-15 16:47 UTC (permalink / raw) To: 'Len Brown' Cc: 'Peter Zijlstra', linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel, Doug Smythies Hi Len, Thank you for chiming in on this thread. On 2025.01.14 18:09 Len Brown wrote: > Doug, > Your attention to detail and persistence has once again found a tricky > underlying bug -- kudos! > > Re: turbostat behaviour > > Yes, TSC_MHz -- "the measured rate of the TSC during an interval", is > printed as a sanity check. If there are any irregularities in it, as > you noticed, then something very strange in the hardware or software > is going wrong (and the actual turbostat results will likely not be > reliable). While I use turbostat almost every day, I am embarrassed to admit that until this investigation I did not know about the ability to add the "Time_Of_Day_Seconds" and "usec" columns. They have been incredibly useful. Early on, and until I discovered those two "show" options, I was using the sanity check calculation of TSC_MHz to reveal the anomaly. > Yes, the "usec" column measures how long it takes to migrate to a CPU > and collect stats there. So if you are hunting down a glitch in > migration all you need is this column to see it. "usec" on the > summary row is the difference between the 1st migration and after the > last -- excluding the sysfs/procfs time that is consumed on the last > CPU. So migration delays will also be reflected there. On a per CPU basis, it excludes the actual CPU migration step. Peter and I made a modification to turbostat to have the per CPU 'usec" column focus just on the CPU migration time. [1] > Note: we have a patch queued which changes the "usec" on the Summary > row to *include* the sysfs/procfs time on the last CPU. I did not realise that is was just for the last "sysfs/procfs" time. I'll take a closer look, and wonder if that can explain why I have been unable to catch the lingering >= 10 mSec stuff. > (The per-cpu > "usec" values are unchanged.) This is because we've noticed some > really weird delays in doing things like reading /proc/interrupts and > we want to be able to easily do A/B comparisons by simply including or > excluding counters. Yes, I saw the patch email on the linux-pm email list and have included it in my local turbostat for about a week now. > Also FYI, The scheme of migrating to each CPU so that collecting stats > there will be "local" isn't scaling so well on very large systems, and > I'm about to take a close look at it. In yogini we used a different > scheme, where a thread is bound to each CPU, so they can collect in > parallel; and we may be moving to something like that. > > cheers, > Len Brown, Intel Open Source Technology Center [1] https://lore.kernel.org/lkml/001b01db608a$56d3dc40$047b94c0$@telus.net/ ... Doug ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-14 10:58 ` Peter Zijlstra 2025-01-14 15:15 ` Doug Smythies @ 2025-01-19 0:09 ` Doug Smythies 2025-01-20 3:55 ` Doug Smythies 2025-01-21 8:49 ` Peter Zijlstra 1 sibling, 2 replies; 51+ messages in thread From: Doug Smythies @ 2025-01-19 0:09 UTC (permalink / raw) To: 'Peter Zijlstra' Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel, Doug Smythies Hi Peter, An update. On 2025.01.14 02:59 Peter Zijlstra wrote: > On Mon, Jan 13, 2025 at 12:03:12PM +0100, Peter Zijlstra wrote: >> On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote: >>> means that there were 19 occurrences of turbostat interval times >>> between 1.016 and 1.016999 seconds. >> >> OK, let me lower my threshold to 10ms and change the turbostat >> invocation -- see if I can catch me some wabbits :-) > > I've had it run overnight and have not caught a single >10ms event :-( Okay, so both you and I have many many hours of testing and never see >= 10ms in that area of the turbostat code anymore. The lingering >= 10ms (but I have never seen more than 25 ms) is outside of that timing. As previously reported, I thought it might be in the sampling interval sleep step, but I did a bunch of testing and it doesn't appear to be there. That leaves: delta_platform(&platform_counters_even, &platform_counters_odd); compute_average(ODD_COUNTERS); format_all_counters(ODD_COUNTERS); flush_output_stdout(); I modified your tracing trigger thing in turbostat to this: doug@s19:~/kernel/linux/tools/power/x86/turbostat$ git diff turbostat.c diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c index 58a487c225a7..777efb64a754 100644 --- a/tools/power/x86/turbostat/turbostat.c +++ b/tools/power/x86/turbostat/turbostat.c @@ -67,6 +67,7 @@ #include <stdbool.h> #include <assert.h> #include <linux/kernel.h> +#include <sys/syscall.h> #define UNUSED(x) (void)(x) @@ -2704,7 +2705,7 @@ int format_counters(struct thread_data *t, struct core_data *c, struct pkg_data struct timeval tv; timersub(&t->tv_end, &t->tv_begin, &tv); - outp += sprintf(outp, "%5ld\t", tv.tv_sec * 1000000 + tv.tv_usec); + outp += sprintf(outp, "%7ld\t", tv.tv_sec * 1000000 + tv.tv_usec); } /* Time_Of_Day_Seconds: on each row, print sec.usec last timestamp taken */ @@ -2713,6 +2714,11 @@ int format_counters(struct thread_data *t, struct core_data *c, struct pkg_data interval_float = t->tv_delta.tv_sec + t->tv_delta.tv_usec / 1000000.0; + double requested_interval = (double) interval_tv.tv_sec + (double) interval_tv.tv_usec / 1000000.0; + + if(interval_float >= (requested_interval + 0.01)) /* was the last interval over by more than 10 mSec? */ + syscall(__NR_gettimeofday, &tv_delta, (void*)1); + tsc = t->tsc * tsc_tweak; /* topo columns, print blanks on 1st (average) line */ @@ -4570,12 +4576,14 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p) int i; int status; + gettimeofday(&t->tv_begin, (struct timezone *)NULL); /* doug test */ + if (cpu_migrate(cpu)) { fprintf(outf, "%s: Could not migrate to CPU %d\n", __func__, cpu); return -1; } - gettimeofday(&t->tv_begin, (struct timezone *)NULL); +// gettimeofday(&t->tv_begin, (struct timezone *)NULL); if (first_counter_read) get_apic_id(t); And so that I could prove a correlation with the trace times and to my graph times I also did not turn off tracing upon a hit: doug@s19:~/kernel/linux$ git diff kernel/time/time.c diff --git a/kernel/time/time.c b/kernel/time/time.c index 1b69caa87480..fb84915159cc 100644 --- a/kernel/time/time.c +++ b/kernel/time/time.c @@ -149,6 +149,12 @@ SYSCALL_DEFINE2(gettimeofday, struct __kernel_old_timeval __user *, tv, return -EFAULT; } if (unlikely(tz != NULL)) { + if (tz == (void*)1) { + trace_printk("WHOOPSIE!\n"); +// tracing_off(); + return 0; + } + if (copy_to_user(tz, &sys_tz, sizeof(sys_tz))) return -EFAULT; } I ran a test for about 1 hour and 28 minutes. The data in the trace correlates with turbostat line by line TOD differentials. Trace got: turbostat-1370 [011] ..... 751.738151: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 760.763184: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1362.788298: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1365.815332: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1366.836340: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1367.856355: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1368.867365: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1373.893423: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1374.910439: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1377.928469: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1378.941483: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1379.959490: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1382.982525: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1385.005548: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1386.019561: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1387.030572: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1398.097683: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1620.752963: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1621.772969: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1622.788972: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1697.022098: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1703.071104: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1704.088103: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1705.105107: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1706.116106: __x64_sys_gettimeofday: WHOOPSIE! turbostat-1370 [011] ..... 1707.126107: __x64_sys_gettimeofday: WHOOPSIE! Going back to some old test data from when the CPU migration in turbostat often took up to 6 seconds. If I subtract that migration time from the measured interval time, I get a lot of samples between 10 and 23 ms. I am saying there were 2 different issues. The 2nd was hidden by the 1st because its magnitude was about 260 times less. I do not know if my trace is any use. I'll compress it and send it to you only, off list. My trace is as per this older email: https://lore.kernel.org/all/20240727105030.226163742@infradead.org/T/#m453062b267551ff4786d33a2eb5f326f92241e96 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-19 0:09 ` Doug Smythies @ 2025-01-20 3:55 ` Doug Smythies 2025-01-21 11:06 ` Peter Zijlstra 2025-01-21 8:49 ` Peter Zijlstra 1 sibling, 1 reply; 51+ messages in thread From: Doug Smythies @ 2025-01-20 3:55 UTC (permalink / raw) To: 'Peter Zijlstra' Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel, Doug Smythies Hi Peter, I now know that the 2nd issue I mentioned yesterday is a completely separate issue. I would have to do a new kernel bisection to isolate it and then start a new thread with whomever about it. On 2025.01.18 16:09 Doug Smythies wrote: > Hi Peter, > > An update. > > On 2025.01.14 02:59 Peter Zijlstra wrote: >> On Mon, Jan 13, 2025 at 12:03:12PM +0100, Peter Zijlstra wrote: >>> On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote: > >>>> means that there were 19 occurrences of turbostat interval times >>>> between 1.016 and 1.016999 seconds. >>> >>> OK, let me lower my threshold to 10ms and change the turbostat >>> invocation -- see if I can catch me some wabbits :-) >> >> I've had it run overnight and have not caught a single >10ms event :-( > > Okay, so both you and I have many many hours of testing and > never see >= 10ms in that area of the turbostat code anymore. > > The lingering >= 10ms (but I have never seen more than 25 ms) > is outside of that timing. ... snip ... > I am saying there were 2 different issues. The 2nd was hidden by the 1st > because its magnitude was about 260 times less. The first issue was solved by your two commits of this thread and now in Kernel 6.13: 66951e4860d3 sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE 6d71a9c61604 sched/fair: Fix EEVDF entity placement bug causing scheduling lag The second issue is not present in my original kernel bisection for the first bad kernel and must have been introduced later on. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-20 3:55 ` Doug Smythies @ 2025-01-21 11:06 ` Peter Zijlstra 0 siblings, 0 replies; 51+ messages in thread From: Peter Zijlstra @ 2025-01-21 11:06 UTC (permalink / raw) To: Doug Smythies Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel On Sun, Jan 19, 2025 at 07:55:56PM -0800, Doug Smythies wrote: > Hi Peter, > > I now know that the 2nd issue I mentioned yesterday > is a completely separate issue. I would have to do > a new kernel bisection to isolate it and then start > a new thread with whomever about it. Yeah, lets not make things complicated and just carry on with the fun and games as is :-) ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-19 0:09 ` Doug Smythies 2025-01-20 3:55 ` Doug Smythies @ 2025-01-21 8:49 ` Peter Zijlstra 2025-01-21 11:21 ` Peter Zijlstra 1 sibling, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2025-01-21 8:49 UTC (permalink / raw) To: Doug Smythies Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel On Sat, Jan 18, 2025 at 04:09:02PM -0800, Doug Smythies wrote: > Hi Peter, > > An update. > > On 2025.01.14 02:59 Peter Zijlstra wrote: > > On Mon, Jan 13, 2025 at 12:03:12PM +0100, Peter Zijlstra wrote: > >> On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote: > > >>> means that there were 19 occurrences of turbostat interval times > >>> between 1.016 and 1.016999 seconds. > >> > >> OK, let me lower my threshold to 10ms and change the turbostat > >> invocation -- see if I can catch me some wabbits :-) > > > > I've had it run overnight and have not caught a single >10ms event :-( > > Okay, so both you and I have many many hours of testing and > never see >= 10ms in that area of the turbostat code anymore. Hehe, yeah, I actually had it run for 3 solid days in the end. > The lingering >= 10ms (but I have never seen more than 25 ms) > is outside of that timing. As previously reported, I thought it might > be in the sampling interval sleep step, but I did a bunch of testing > and it doesn't appear to be there. That leaves: > > delta_platform(&platform_counters_even, &platform_counters_odd); > compute_average(ODD_COUNTERS); > format_all_counters(ODD_COUNTERS); > flush_output_stdout(); > > I modified your tracing trigger thing in turbostat to this: Shiny! What turbostat invocation do I use? I think the last I had was: tools/power/x86/turbostat/turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval 1 I've started a new run of yes-vs-turbostate with the modified trigger condition. Lets see what pops out. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-21 8:49 ` Peter Zijlstra @ 2025-01-21 11:21 ` Peter Zijlstra 2025-01-21 15:58 ` Doug Smythies 0 siblings, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2025-01-21 11:21 UTC (permalink / raw) To: Doug Smythies Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel On Tue, Jan 21, 2025 at 09:49:08AM +0100, Peter Zijlstra wrote: > > I modified your tracing trigger thing in turbostat to this: > > Shiny! > > What turbostat invocation do I use? I think the last I had was: > > tools/power/x86/turbostat/turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval 1 > > I've started a new run of yes-vs-turbostate with the modified trigger > condition. Lets see what pops out. Ok, I have a trace.o So I see turbostat wake up on CPU 15, do its migration round 0-15 and when its back at 15 it prints the WHOOPSIE. (trimmed trace): yes-1169 [015] dNh4. 4238.261759: sched_wakeup: comm=turbostat pid=1185 prio=100 target_cpu=015 yes-1169 [015] d..2. 4238.261761: sched_switch: prev_comm=yes prev_pid=1169 prev_prio=120 prev_state=R ==> next_comm=turbostat next_pid=1185 next_prio=100 migration/15-158 [015] d..3. 4238.261977: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=15 dest_cpu=0 migration/0-20 [000] d..3. 4238.261991: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=0 dest_cpu=1 migration/1-116 [001] d..3. 4238.262003: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=1 dest_cpu=2 migration/2-25 [002] d..3. 4238.262018: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=2 dest_cpu=3 migration/3-122 [003] d..3. 4238.262031: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=3 dest_cpu=4 migration/4-31 [004] d..3. 4238.262044: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=4 dest_cpu=5 migration/5-128 [005] d..3. 4238.262057: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=5 dest_cpu=6 migration/6-37 [006] d..3. 4238.262071: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=6 dest_cpu=7 migration/7-134 [007] d..3. 4238.262084: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=7 dest_cpu=8 migration/8-43 [008] d..3. 4238.262097: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=8 dest_cpu=9 migration/9-140 [009] d..3. 4238.262109: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=9 dest_cpu=10 migration/10-49 [010] d..3. 4238.262123: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=10 dest_cpu=11 migration/11-146 [011] d..3. 4238.262136: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=11 dest_cpu=12 migration/12-55 [012] d..3. 4238.262150: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=12 dest_cpu=13 migration/13-152 [013] d..3. 4238.262164: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=13 dest_cpu=14 migration/14-62 [014] d..3. 4238.262177: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=14 dest_cpu=15 yes-1169 [015] d..2. 4238.262182: sched_switch: prev_comm=yes prev_pid=1169 prev_prio=120 prev_state=R+ ==> next_comm=turbostat next_pid=1185 next_prio=100 turbostat-1185 [015] ..... 4238.262189: __x64_sys_gettimeofday: WHOOPSIE! The time between wakeup and whoopsie 4238.262189-4238.261759 = .000430 or 430us, which doesn't seem excessive to me. Let me go read this turbostat code to figure out what exactly the trigger condition signifies. Because I'm not seeing nothing weird here. ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-21 11:21 ` Peter Zijlstra @ 2025-01-21 15:58 ` Doug Smythies 2025-01-24 4:34 ` Doug Smythies 0 siblings, 1 reply; 51+ messages in thread From: Doug Smythies @ 2025-01-21 15:58 UTC (permalink / raw) To: 'Peter Zijlstra' Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel, Doug Smythies On 2025.01.21 03:22 Peter Zijlstra wrote: >On Tue, Jan 21, 2025 at 09:49:08AM +0100, Peter Zijlstra wrote: > >>> I modified your tracing trigger thing in turbostat to this: >> >> Shiny! >> >> What turbostat invocation do I use? I think the last I had was: >> >> tools/power/x86/turbostat/turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval 1 >> >> I've started a new run of yes-vs-turbostate with the modified trigger >> condition. Lets see what pops out. > > Ok, I have a trace.o > > So I see turbostat wake up on CPU 15, do its migration round 0-15 and > when its back at 15 it prints the WHOOPSIE. > > (trimmed trace): > > yes-1169 [015] dNh4. 4238.261759: sched_wakeup: comm=turbostat pid=1185 prio=100 target_cpu=015 > yes-1169 [015] d..2. 4238.261761: sched_switch: prev_comm=yes prev_pid=1169 prev_prio=120 prev_state=R ==> next_comm=turbostat next_pid=1185 next_prio=100 > migration/15-158 [015] d..3. 4238.261977: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=15 dest_cpu=0 > migration/0-20 [000] d..3. 4238.261991: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=0 dest_cpu=1 > migration/1-116 [001] d..3. 4238.262003: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=1 dest_cpu=2 > migration/2-25 [002] d..3. 4238.262018: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=2 dest_cpu=3 > migration/3-122 [003] d..3. 4238.262031: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=3 dest_cpu=4 > migration/4-31 [004] d..3. 4238.262044: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=4 dest_cpu=5 > migration/5-128 [005] d..3. 4238.262057: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=5 dest_cpu=6 > migration/6-37 [006] d..3. 4238.262071: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=6 dest_cpu=7 > migration/7-134 [007] d..3. 4238.262084: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=7 dest_cpu=8 > migration/8-43 [008] d..3. 4238.262097: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=8 dest_cpu=9 > migration/9-140 [009] d..3. 4238.262109: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=9 dest_cpu=10 > migration/10-49 [010] d..3. 4238.262123: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=10 dest_cpu=11 > migration/11-146 [011] d..3. 4238.262136: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=11 dest_cpu=12 > migration/12-55 [012] d..3. 4238.262150: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=12 dest_cpu=13 > migration/13-152 [013] d..3. 4238.262164: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=13 dest_cpu=14 > migration/14-62 [014] d..3. 4238.262177: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=14 dest_cpu=15 > yes-1169 [015] d..2. 4238.262182: sched_switch: prev_comm=yes prev_pid=1169 prev_prio=120 prev_state=R+ ==> next_comm=turbostat next_pid=1185 next_prio=100 > turbostat-1185 [015] ..... 4238.262189: __x64_sys_gettimeofday: WHOOPSIE! > > The time between wakeup and whoopsie 4238.262189-4238.261759 = .000430 > or 430us, which doesn't seem excessive to me. > > Let me go read this turbostat code to figure out what exactly the > trigger condition signifies. Because I'm not seeing nothing weird here. I think the anomaly would have been about 1 second ago, on CPU 15, and before entering sleep. But after the previous call to the time of day stuff. Somewhere in this code: delta_platform(&platform_counters_even, &platform_counters_odd); compute_average(ODD_COUNTERS); format_all_counters(ODD_COUNTERS); flush_output_stdout(); Please know that I ran a couple of tests yesterday for a total of about 8 hours and never got a measured interval time >= 10 mSec. I was using kernel 6.13, which includes your 2 patches, and I tried a slight modification to the turbostat command: sudo ./turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds,usec --interval 1 --out /dev/shm/turbo.log That allowed me to acquire more than my ssh session history limit of about 9000 lines (seconds) and also eliminated ssh communications. It was on purpose that I used RAM to write the log file to. ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-21 15:58 ` Doug Smythies @ 2025-01-24 4:34 ` Doug Smythies 2025-01-24 11:04 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Doug Smythies @ 2025-01-24 4:34 UTC (permalink / raw) To: 'Peter Zijlstra' Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel, Doug Smythies On 2025.01.21 Doug Smythies wrote: > On 2025.01.21 03:22 Peter Zijlstra wrote: >> On Tue, Jan 21, 2025 at 09:49:08AM +0100, Peter Zijlstra wrote: >> >>>> I modified your tracing trigger thing in turbostat to this: >>> >>> Shiny! >>> >>> What turbostat invocation do I use? I think the last I had was: >>> >>> tools/power/x86/turbostat/turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval 1 >>> >>> I've started a new run of yes-vs-turbostate with the modified trigger >>> condition. Lets see what pops out. >> >> Ok, I have a trace.o >> >> So I see turbostat wake up on CPU 15, do its migration round 0-15 and >> when its back at 15 it prints the WHOOPSIE. >> >> (trimmed trace): >> >> yes-1169 [015] dNh4. 4238.261759: sched_wakeup: comm=turbostat pid=1185 prio=100 target_cpu=015 >> yes-1169 [015] d..2. 4238.261761: sched_switch: prev_comm=yes prev_pid=1169 prev_prio=120 prev_state=R ==> > next_comm=turbostat next_pid=1185 next_prio=100 >> migration/15-158 [015] d..3. 4238.261977: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=15 dest_cpu=0 >> migration/0-20 [000] d..3. 4238.261991: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=0 dest_cpu=1 >> migration/1-116 [001] d..3. 4238.262003: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=1 dest_cpu=2 >> migration/2-25 [002] d..3. 4238.262018: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=2 dest_cpu=3 >> migration/3-122 [003] d..3. 4238.262031: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=3 dest_cpu=4 >> migration/4-31 [004] d..3. 4238.262044: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=4 dest_cpu=5 >> migration/5-128 [005] d..3. 4238.262057: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=5 dest_cpu=6 >> migration/6-37 [006] d..3. 4238.262071: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=6 dest_cpu=7 >> migration/7-134 [007] d..3. 4238.262084: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=7 dest_cpu=8 >> migration/8-43 [008] d..3. 4238.262097: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=8 dest_cpu=9 >> migration/9-140 [009] d..3. 4238.262109: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=9 dest_cpu=10 >> migration/10-49 [010] d..3. 4238.262123: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=10 dest_cpu=11 >> migration/11-146 [011] d..3. 4238.262136: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=11 dest_cpu=12 >> migration/12-55 [012] d..3. 4238.262150: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=12 dest_cpu=13 >> migration/13-152 [013] d..3. 4238.262164: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=13 dest_cpu=14 >> migration/14-62 [014] d..3. 4238.262177: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=14 dest_cpu=15 >> yes-1169 [015] d..2. 4238.262182: sched_switch: prev_comm=yes prev_pid=1169 prev_prio=120 prev_state=R+ ==> > next_comm=turbostat next_pid=1185 next_prio=100 >> turbostat-1185 [015] ..... 4238.262189: __x64_sys_gettimeofday: WHOOPSIE! >> >> The time between wakeup and whoopsie 4238.262189-4238.261759 = .000430 >> or 430us, which doesn't seem excessive to me. >> >> Let me go read this turbostat code to figure out what exactly the >> trigger condition signifies. Because I'm not seeing nothing weird here. > > I think the anomaly would have been about 1 second ago, on CPU 15, > and before entering sleep. > But after the previous call to the time of day stuff. > > Somewhere in this code: > > delta_platform(&platform_counters_even, &platform_counters_odd); > compute_average(ODD_COUNTERS); > format_all_counters(ODD_COUNTERS); > flush_output_stdout(); > > Please know that I ran a couple of tests yesterday for a total of about 8 hours > and never got a measured interval time >= 10 mSec. > I was using kernel 6.13, which includes your 2 patches, and I tried a slight > modification to the turbostat command: > > sudo ./turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds,usec --interval 1 --out /dev/shm/turbo.log > > That allowed me to acquire more than my ssh session history limit of about 9000 lines (seconds) and also eliminated ssh > communications. > It was on purpose that I used RAM to write the log file to. I have run more tests over the last couple of days, totalling over 30 hours. I simply do not get a measured interval time >= 10mSec using kernel 6.13. The previous work was kernel 6.13-rc6 + the 2 patches + the tracing stuff. I never tried kernel 6.13-rc7. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-24 4:34 ` Doug Smythies @ 2025-01-24 11:04 ` Peter Zijlstra 0 siblings, 0 replies; 51+ messages in thread From: Peter Zijlstra @ 2025-01-24 11:04 UTC (permalink / raw) To: Doug Smythies Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel On Thu, Jan 23, 2025 at 08:34:57PM -0800, Doug Smythies wrote: > I have run more tests over the last couple of days, totalling over 30 hours. > I simply do not get a measured interval time >= 10mSec using kernel 6.13. > The previous work was kernel 6.13-rc6 + the 2 patches + the tracing stuff. > I never tried kernel 6.13-rc7. OK, lets close this for now then. Feel free to contact me again if you find anything else. Thanks for all the useful input! ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-12 23:14 ` Doug Smythies 2025-01-13 11:03 ` Peter Zijlstra @ 2025-01-13 11:05 ` Peter Zijlstra 2025-01-13 16:01 ` Doug Smythies 1 sibling, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2025-01-13 11:05 UTC (permalink / raw) To: Doug Smythies Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote: > The test I have been running to create some of the graphs I have been > attaching is a little different, using turbostat with different options: > > turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds --interval 1 > > And with this test I get intervals over 1 second by over 10 milliseconds. > First the b12 kernel (the last good one in the kernel bisection): > > Time Occurrences > 1.000000, 3282 > 1.001000, 1826 > 1.002000, 227 > 1.003000, 1852 > 1.004000, 1036 > 1.005000, 731 > 1.006000, 75 > 1.007000, 30 > 1.008000, 9 > 1.009000, 2 > 1.010000, 1 > 1.011000, 1 You're creating these Time values from the consecutive Time_Of_Day_Seconds data using a script? Let me go check the turbostat code to see if my hackery is still invoked, even if not displayed. ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-13 11:05 ` Peter Zijlstra @ 2025-01-13 16:01 ` Doug Smythies 0 siblings, 0 replies; 51+ messages in thread From: Doug Smythies @ 2025-01-13 16:01 UTC (permalink / raw) To: 'Peter Zijlstra' Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel, Doug Smythies On 2025.01.13 03:06 Peter Zijlstra wrote: > On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote: > >> The test I have been running to create some of the graphs I have been >> attaching is a little different, using turbostat with different options: >> >> turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds --interval 1 >> >> And with this test I get intervals over 1 second by over 10 milliseconds. > >> First the b12 kernel (the last good one in the kernel bisection): >> >> Time Occurrences >> 1.000000, 3282 >> 1.001000, 1826 >> 1.002000, 227 >> 1.003000, 1852 >> 1.004000, 1036 >> 1.005000, 731 >> 1.006000, 75 >> 1.007000, 30 >> 1.008000, 9 >> 1.009000, 2 >> 1.010000, 1 >> 1.011000, 1 > > You're creating these Time values from the consecutive > Time_Of_Day_Seconds data using a script? Let me go check the turbostat > code to see if my hackery is still invoked, even if not displayed. Yes, sort of. I put the output into a spreadsheet and add a column calculating the time difference between samples. The histogram is created by a simple c program run against that extracted column. Anyway, I finally did get some useful information. Examples: Samp uSec Time of day Delta T Freq TSC IRQ TMP PWR 4086 4548 1736734595.017487 1.007149935 4800 4104 12128 73 107.52 6222 4059 1736736736.520998 1.009660006 4800 4098 12124 74 106.73 6263 400 1736736777.699340 1.023000002 4800 4104 12345 73 106.51 The summary histogram line for that capture is: Total: 9079 : Total >= 10 mSec: 128 ( 1.41 percent) The maximum uSec was 4548 and there are only about 20 (counted manually) greater than 1 millisecond (i.e. all good) The command used was: turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds,usec --interval 1 Anyway, there never is any long time within the turbostat per interval execution. Any extra time seems to be outside of the main loop. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [tip: sched/urgent] sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE 2025-01-10 11:57 ` Peter Zijlstra 2025-01-12 23:14 ` Doug Smythies @ 2025-01-13 12:58 ` tip-bot2 for Peter Zijlstra 1 sibling, 0 replies; 51+ messages in thread From: tip-bot2 for Peter Zijlstra @ 2025-01-13 12:58 UTC (permalink / raw) To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), x86, linux-kernel The following commit has been merged into the sched/urgent branch of tip: Commit-ID: 66951e4860d3c688bfa550ea4a19635b57e00eca Gitweb: https://git.kernel.org/tip/66951e4860d3c688bfa550ea4a19635b57e00eca Author: Peter Zijlstra <peterz@infradead.org> AuthorDate: Mon, 13 Jan 2025 13:50:11 +01:00 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Mon, 13 Jan 2025 13:50:56 +01:00 sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE Normally dequeue_entities() will continue to dequeue an empty group entity; except DELAY_DEQUEUE changes things -- it retains empty entities such that they might continue to compete and burn off some lag. However, doing this results in update_cfs_group() re-computing the cgroup weight 'slice' for an empty group, which it (rightly) figures isn't much at all. This in turn means that the delayed entity is not competing at the expected weight. Worse, the very low weight causes its lag to be inflated, which combined with avg_vruntime() using scale_load_down(), leads to artifacts. As such, don't adjust the weight for empty group entities and let them compete at their original weight. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250110115720.GA17405@noisy.programming.kicks-ass.net --- kernel/sched/fair.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index eeed8e3..2695843 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3956,7 +3956,11 @@ static void update_cfs_group(struct sched_entity *se) struct cfs_rq *gcfs_rq = group_cfs_rq(se); long shares; - if (!gcfs_rq) + /* + * When a group becomes empty, preserve its weight. This matters for + * DELAY_DEQUEUE. + */ + if (!gcfs_rq || !gcfs_rq->load.weight) return; if (throttled_hierarchy(gcfs_rq)) ^ permalink raw reply related [flat|nested] 51+ messages in thread
* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2025-01-10 5:09 ` [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies 2025-01-10 11:57 ` Peter Zijlstra @ 2025-01-12 19:59 ` Doug Smythies 1 sibling, 0 replies; 51+ messages in thread From: Doug Smythies @ 2025-01-12 19:59 UTC (permalink / raw) To: 'Peter Zijlstra' Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel, Doug Smythies Hi Peter, While we have moved on from this branch of this email thread, just for completeness, I'll give the additional data from the overnight test. There is also an observation that will be made and continued in the next email. On 2025.01.09 21:09 Doug Smythies wrote: > On 2025.01.09 03:00 Peter Zijlstra wrote: > > ... > >> This made me have a very hard look at reweight_entity(), and >> specifically the ->on_rq case, which is more prominent with >> DELAY_DEQUEUE. >> >> And indeed, it is all sorts of broken. While the computation of the new >> lag is correct, the computation for the new vruntime, using the new lag >> is broken for it does not consider the logic set out in place_entity(). >> >> With the below patch, I now see things like: >> >> migration/12-55 [012] d..3. 309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12) >> { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 > } -> >> { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 > } >> migration/14-62 [014] d..3. 309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15) >> { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: > 6316614641111 } -> >> { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 > } >> >> Which isn't perfect yet, but much closer. > > Agreed. > I tested the patch. Attached is a repeat of a graph I had sent before, with different y axis scale and old data deleted. > It still compares to the "b12" kernel (the last good one in the kernel bisection). > It was a 2 hour and 31 minute duration test, and the maximum CPU migration time was 24 milliseconds, > verses 6 seconds without the patch. > > I left things running for many hours and will let it continue overnight. > There seems to have been an issue at one spot in time: > > usec Time_Of_Day_Seconds CPU Busy% IRQ > 488994 1736476550.732222 - 99.76 12889 > 488520 1736476550.732222 11 99.76 1012 > 960999 1736476552.694222 - 99.76 17922 > 960587 1736476552.694222 11 99.76 1493 > 914999 1736476554.610222 - 99.76 23579 > 914597 1736476554.610222 11 99.76 1962 > 809999 1736476556.421222 - 99.76 23134 > 809598 1736476556.421222 11 99.76 1917 > 770998 1736476558.193221 - 99.76 21757 > 770603 1736476558.193221 11 99.76 1811 > 726999 1736476559.921222 - 99.76 21294 > 726600 1736476559.921222 11 99.76 1772 > 686998 1736476561.609221 - 99.76 20801 > 686600 1736476561.609221 11 99.76 1731 > 650998 1736476563.261221 - 99.76 20280 > 650601 1736476563.261221 11 99.76 1688 > 610998 1736476564.873221 - 99.76 19857 > 610606 1736476564.873221 11 99.76 1653 The test was continued overnight yielding this additional information: 868008 1736496040.956236 - 99.76 12668 867542 1736496040.956222 5 99.76 1046 5950010 1736496047.907233 - 99.76 22459 5949592 1736496047.907222 5 99.76 1871 5791008 1736496054.699232 - 99.76 83625 5790605 1736496054.699222 5 99.76 6957 1962999 1736502192.036227 - 99.76 12896 1962528 1736502192.036227 11 99.76 1030 434858 1736502193.472086 - 99.76 35824 434387 1736502193.472086 11 99.76 2965 Or 2 more continuous bursts, and a 5.9 second sample. Observation: There isn't any 10's of milliseconds data. Based on the graph, which is basically the same test done in ever so slightly a different way, there should be a lot of such data. Rather than re-attach the same graph, I'll present the Same data as a histogram: First the b12 kernel (the last good one in the kernel bisection): Time Occurrences 1.000000, 3282 1.001000, 1826 1.002000, 227 1.003000, 1852 1.004000, 1036 1.005000, 731 1.006000, 75 1.007000, 30 1.008000, 9 1.009000, 2 1.010000, 1 1.011000, 1 Total: 9072 : Total >= 10 mSec: 2 ( 0.02 percent) Second Kernel 6.13-rc6+this one patch Time Occurrences 1.000000, 1274 1.001000, 1474 1.002000, 512 1.003000, 3201 1.004000, 849 1.005000, 593 1.006000, 246 1.007000, 104 1.008000, 36 1.009000, 15 1.010000, 19 1.011000, 16 1.012000, 11 1.013000, 27 1.014000, 26 1.015000, 35 1.016000, 105 1.017000, 85 1.018000, 135 1.019000, 283 1.020000, 17 1.021000, 4 1.022000, 3 1.023000, 1 Total: 9071 : Total >= 10 mSec: 767 ( 8.46 percent) Where, and for example, this line: 1.005000, 593 Means that there were 593 occurrences of turbostat interval times between 1.005 and 1.005999 seconds. So, I would expect to see that reflected in the overnight test, but don't. It would appear that the slightly different way of doing the test effects the probability of occurrence significantly. I'll continue in a reply to your patch 2 email from Friday (Jan 10th). > > I had one of these the other day also, but they were all 6 seconds. > Its like a burst of problematic data. I have the data somewhere, > and can try to find it tomorrow. >> >> Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") >> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > ... ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH 00/24] Complete EEVDF @ 2024-07-27 10:27 Peter Zijlstra 2024-11-28 10:32 ` [REGRESSION] " Marcel Ziswiler 0 siblings, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw) To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, linux-kernel Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault Hi all, So after much delay this is hopefully the final version of the EEVDF patches. They've been sitting in my git tree for ever it seems, and people have been testing it and sending fixes. I've spend the last two days testing and fixing cfs-bandwidth, and as far as I know that was the very last issue holding it back. These patches apply on top of queue.git sched/dl-server, which I plan on merging in tip/sched/core once -rc1 drops. I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2. Aside from a ton of bug fixes -- thanks all! -- new in this version is: - split up the huge delay-dequeue patch - tested/fixed cfs-bandwidth - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating - SCHED_BATCH is equivalent to RESPECT_SLICE - propagate min_slice up cgroups - CLOCK_THREAD_DVFS_ID ^ permalink raw reply [flat|nested] 51+ messages in thread
* [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2024-07-27 10:27 Peter Zijlstra @ 2024-11-28 10:32 ` Marcel Ziswiler 2024-11-28 10:58 ` Peter Zijlstra 2024-12-10 8:45 ` Luis Machado 0 siblings, 2 replies; 51+ messages in thread From: Marcel Ziswiler @ 2024-11-28 10:32 UTC (permalink / raw) To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, linux-kernel Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault Hi all, On Sat, 2024-07-27 at 12:27 +0200, Peter Zijlstra wrote: > Hi all, > > So after much delay this is hopefully the final version of the EEVDF patches. > They've been sitting in my git tree for ever it seems, and people have been > testing it and sending fixes. > > I've spend the last two days testing and fixing cfs-bandwidth, and as far > as I know that was the very last issue holding it back. > > These patches apply on top of queue.git sched/dl-server, which I plan on merging > in tip/sched/core once -rc1 drops. > > I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2. > > > Aside from a ton of bug fixes -- thanks all! -- new in this version is: > > - split up the huge delay-dequeue patch > - tested/fixed cfs-bandwidth > - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating > - SCHED_BATCH is equivalent to RESPECT_SLICE > - propagate min_slice up cgroups > - CLOCK_THREAD_DVFS_ID We found the following 7 commits from this patch set to crash in enqueue_dl_entity(): 54a58a787791 sched/fair: Implement DELAY_ZERO 152e11f6df29 sched/fair: Implement delayed dequeue e1459a50ba31 sched: Teach dequeue_task() about special task states a1c446611e31 sched,freezer: Mark TASK_FROZEN special 781773e3b680 sched/fair: Implement ENQUEUE_DELAYED f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue 2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue Resulting in the following crash dump (this is running v6.12.1): [ 14.652856] sched: DL replenish lagged too much [ 16.572706] ------------[ cut here ]------------ [ 16.573115] WARNING: CPU: 5 PID: 912 at kernel/sched/deadline.c:1995 enqueue_dl_entity+0x46c/0x55c [ 16.573900] Modules linked in: overlay crct10dif_ce rk805_pwrkey snd_soc_es8316 pwm_fan phy_rockchip_naneng_combphy rockchip_saradc rtc_hym8563 industrialio_trigg ered_buffer kfifo_buf rockchip_thermal phy_rockchip_usbdp typec spi_rockchip_sfc snd_soc_rockchip_i2s_tdm hantro_vpu v4l2_vp9 v4l2_h264 v4l2_jpeg panthor v4l2_mem2me m rockchipdrm drm_gpuvm drm_exec drm_shmem_helper analogix_dp gpu_sched dw_mipi_dsi dw_hdmi cec drm_display_helper snd_soc_audio_graph_card snd_soc_simple_card_utils drm_dma_helper drm_kms_helper cfg80211 rfkill pci_endpoint_test drm backlight dm_mod dax [ 16.578350] CPU: 5 UID: 0 PID: 912 Comm: job10 Not tainted 6.12.1-dirty #15 [ 16.578956] Hardware name: radxa Radxa ROCK 5B/Radxa ROCK 5B, BIOS 2024.10-rc3 10/01/2024 [ 16.579667] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 16.580273] pc : enqueue_dl_entity+0x46c/0x55c [ 16.580661] lr : dl_server_start+0x44/0x12c [ 16.581028] sp : ffff80008002bc00 [ 16.581318] x29: ffff80008002bc00 x28: dead000000000122 x27: 0000000000000000 [ 16.581941] x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000009 [ 16.582563] x23: ffff33c976db0e40 x22: 0000000000000001 x21: 00000000002dc6c0 [ 16.583186] x20: 0000000000000001 x19: ffff33c976db17a8 x18: 0000000000000000 [ 16.583808] x17: ffff5dd9779ac000 x16: ffff800080028000 x15: 11c3485b851e0698 [ 16.584430] x14: 11b4b257e4156000 x13: 0000000000000255 x12: 0000000000000000 [ 16.585053] x11: ffff33c976db0ec0 x10: 0000000000000000 x9 : 0000000000000009 [ 16.585674] x8 : 0000000000000005 x7 : ffff33c976db19a0 x6 : ffff33c78258b440 [ 16.586296] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 16.586918] x2 : 0000000000000001 x1 : 0000000000000001 x0 : ffff33c798e112f0 [ 16.587540] Call trace: [ 16.587754] enqueue_dl_entity+0x46c/0x55c [ 16.588113] dl_server_start+0x44/0x12c [ 16.588449] enqueue_task_fair+0x124/0x49c [ 16.588807] enqueue_task+0x3c/0xe0 [ 16.589113] ttwu_do_activate.isra.0+0x6c/0x208 [ 16.589511] try_to_wake_up+0x1d0/0x61c [ 16.589847] wake_up_process+0x18/0x24 [ 16.590175] kick_pool+0x84/0x150 [ 16.590467] __queue_work+0x2f4/0x544 [ 16.590788] delayed_work_timer_fn+0x1c/0x28 [ 16.591161] call_timer_fn+0x34/0x1ac [ 16.591481] __run_timer_base+0x20c/0x314 [ 16.591832] run_timer_softirq+0x3c/0x78 [ 16.592176] handle_softirqs+0x124/0x35c [ 16.592520] __do_softirq+0x14/0x20 [ 16.592827] ____do_softirq+0x10/0x1c [ 16.593148] call_on_irq_stack+0x24/0x4c [ 16.593490] do_softirq_own_stack+0x1c/0x2c [ 16.593857] irq_exit_rcu+0x8c/0xc0 [ 16.594163] el0_interrupt+0x48/0xbc [ 16.594477] __el0_irq_handler_common+0x18/0x24 [ 16.594874] el0t_64_irq_handler+0x10/0x1c [ 16.595232] el0t_64_irq+0x190/0x194 [ 16.595545] ---[ end trace 0000000000000000 ]--- [ 16.595950] ------------[ cut here ]------------ It looks like it is trying to enqueue an already queued deadline task. Full serial console log available [1]. We are running the exact same scheduler stress test both on Intel NUCs as well as RADXA ROCK 5B board farms. While so far we have not seen this on amd64 it crashes consistently/reproducible on aarch64. We haven't had time to do a non-proprietary reproduction case as of yet but I wanted to report our current findings asking for any feedback/suggestions from the community. Thanks! Cheers Marcel [1] https://hastebin.skyra.pw/hoqesigaye.yaml ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2024-11-28 10:32 ` [REGRESSION] " Marcel Ziswiler @ 2024-11-28 10:58 ` Peter Zijlstra 2024-11-28 11:37 ` Marcel Ziswiler 2024-12-10 8:45 ` Luis Machado 1 sibling, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2024-11-28 10:58 UTC (permalink / raw) To: Marcel Ziswiler Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault On Thu, Nov 28, 2024 at 11:32:10AM +0100, Marcel Ziswiler wrote: > Resulting in the following crash dump (this is running v6.12.1): > > [ 14.652856] sched: DL replenish lagged too much > [ 16.572706] ------------[ cut here ]------------ > [ 16.573115] WARNING: CPU: 5 PID: 912 at kernel/sched/deadline.c:1995 enqueue_dl_entity+0x46c/0x55c > [ 16.578350] CPU: 5 UID: 0 PID: 912 Comm: job10 Not tainted 6.12.1-dirty #15 > [ 16.578956] Hardware name: radxa Radxa ROCK 5B/Radxa ROCK 5B, BIOS 2024.10-rc3 10/01/2024 > [ 16.579667] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 16.580273] pc : enqueue_dl_entity+0x46c/0x55c > [ 16.580661] lr : dl_server_start+0x44/0x12c > [ 16.581028] sp : ffff80008002bc00 > [ 16.581318] x29: ffff80008002bc00 x28: dead000000000122 x27: 0000000000000000 > [ 16.581941] x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000009 > [ 16.582563] x23: ffff33c976db0e40 x22: 0000000000000001 x21: 00000000002dc6c0 > [ 16.583186] x20: 0000000000000001 x19: ffff33c976db17a8 x18: 0000000000000000 > [ 16.583808] x17: ffff5dd9779ac000 x16: ffff800080028000 x15: 11c3485b851e0698 > [ 16.584430] x14: 11b4b257e4156000 x13: 0000000000000255 x12: 0000000000000000 > [ 16.585053] x11: ffff33c976db0ec0 x10: 0000000000000000 x9 : 0000000000000009 > [ 16.585674] x8 : 0000000000000005 x7 : ffff33c976db19a0 x6 : ffff33c78258b440 > [ 16.586296] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 > [ 16.586918] x2 : 0000000000000001 x1 : 0000000000000001 x0 : ffff33c798e112f0 > [ 16.587540] Call trace: > [ 16.587754] enqueue_dl_entity+0x46c/0x55c > [ 16.588113] dl_server_start+0x44/0x12c > [ 16.588449] enqueue_task_fair+0x124/0x49c > [ 16.588807] enqueue_task+0x3c/0xe0 > [ 16.589113] ttwu_do_activate.isra.0+0x6c/0x208 > [ 16.589511] try_to_wake_up+0x1d0/0x61c > [ 16.589847] wake_up_process+0x18/0x24 > [ 16.590175] kick_pool+0x84/0x150 > [ 16.590467] __queue_work+0x2f4/0x544 > [ 16.590788] delayed_work_timer_fn+0x1c/0x28 > [ 16.591161] call_timer_fn+0x34/0x1ac > [ 16.591481] __run_timer_base+0x20c/0x314 > [ 16.591832] run_timer_softirq+0x3c/0x78 > [ 16.592176] handle_softirqs+0x124/0x35c > [ 16.592520] __do_softirq+0x14/0x20 > [ 16.592827] ____do_softirq+0x10/0x1c > [ 16.593148] call_on_irq_stack+0x24/0x4c > [ 16.593490] do_softirq_own_stack+0x1c/0x2c > [ 16.593857] irq_exit_rcu+0x8c/0xc0 > [ 16.594163] el0_interrupt+0x48/0xbc > [ 16.594477] __el0_irq_handler_common+0x18/0x24 > [ 16.594874] el0t_64_irq_handler+0x10/0x1c > [ 16.595232] el0t_64_irq+0x190/0x194 > [ 16.595545] ---[ end trace 0000000000000000 ]--- > [ 16.595950] ------------[ cut here ]------------ > > It looks like it is trying to enqueue an already queued deadline task. Full serial console log available [1]. Right, I've had a number of these reports, but so far we've not yet managed to figure out how it's all happening. > We are running the exact same scheduler stress test both on Intel NUCs > as well as RADXA ROCK 5B board farms. While so far we have not seen > this on amd64 it crashes consistently/reproducible on aarch64. Oooh, that's something. So far the few reports have not been (easily) reproducible. If this is readily reproducible on arm64 that would help a lot. Juri, do you have access to an arm64 test box? A very long shot: https://lkml.kernel.org/r/20241127063740.8278-1-juri.lelli@redhat.com doesn't help, does it? ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2024-11-28 10:58 ` Peter Zijlstra @ 2024-11-28 11:37 ` Marcel Ziswiler 2024-11-29 9:08 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Marcel Ziswiler @ 2024-11-28 11:37 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault On Thu, 2024-11-28 at 11:58 +0100, Peter Zijlstra wrote: > On Thu, Nov 28, 2024 at 11:32:10AM +0100, Marcel Ziswiler wrote: > > > Resulting in the following crash dump (this is running v6.12.1): [snip] > > It looks like it is trying to enqueue an already queued deadline task. Full serial console log available > > [1]. > > Right, I've had a number of these reports, but so far we've not yet > managed to figure out how it's all happening. > > > We are running the exact same scheduler stress test both on Intel NUCs > > as well as RADXA ROCK 5B board farms. While so far we have not seen > > this on amd64 it crashes consistently/reproducible on aarch64. > > Oooh, that's something. So far the few reports have not been (easily) > reproducible. If this is readily reproducible on arm64 that would > help a lot. Juri, do you have access to an arm64 test box? As mentioned above, so far our scheduler stress test is not yet open source but Codethink is eager to share anything which helps in resolving this. > A very long shot: > > https://lkml.kernel.org/r/20241127063740.8278-1-juri.lelli@redhat.com > > doesn't help, does it? No, still the same with this on top of today's -next. Thanks! Cheers Marcel ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2024-11-28 11:37 ` Marcel Ziswiler @ 2024-11-29 9:08 ` Peter Zijlstra 2024-12-02 18:46 ` Marcel Ziswiler 0 siblings, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2024-11-29 9:08 UTC (permalink / raw) To: Marcel Ziswiler Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault On Thu, Nov 28, 2024 at 12:37:14PM +0100, Marcel Ziswiler wrote: > > Oooh, that's something. So far the few reports have not been (easily) > > reproducible. If this is readily reproducible on arm64 that would > > help a lot. Juri, do you have access to an arm64 test box? > > As mentioned above, so far our scheduler stress test is not yet open source but Codethink is eager to share > anything which helps in resolving this. I was hoping you could perhaps share a binary with Juri privately or with RHT (same difference etc), such that he can poke at it too. Anyway, if you don't mind a bit of back and forth, would you mind adding the below patch to your kernel and doing: (all assuming your kernel has ftrace enabled) echo 1 > /sys/kernel/debug/tracing/options/stacktrace echo 1 > /proc/sys/kernel/traceoff_on_warning running your test to failure and then dumping the trace into a file like: cat /sys/kernel/debug/tracing/trace > ~/trace Then compress the file (bzip2 or whatever is popular these days) and send it my way along with a dmesg dump (private is fine -- these things tend to be large-ish). Hopefully, this will give us a little clue as to where the double enqueue happens. --- diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index d9d5a702f1a6..b9cd9b40a19f 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1203,6 +1203,11 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_ scoped_guard (rq_lock, rq) { struct rq_flags *rf = &scope.rf; + if (dl_se == &rq->fair_server) { + trace_printk("timer fair server %d throttled %d\n", + cpu_of(rq), dl_se->dl_throttled); + } + if (!dl_se->dl_throttled || !dl_se->dl_runtime) return HRTIMER_NORESTART; @@ -1772,6 +1777,9 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer) rq_lock(rq, &rf); } + if (dl_se == &rq->fair_server) + trace_printk("inactive fair server %d\n", cpu_of(rq)); + sched_clock_tick(); update_rq_clock(rq); @@ -1967,6 +1975,12 @@ update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se, static void __enqueue_dl_entity(struct sched_dl_entity *dl_se) { struct dl_rq *dl_rq = dl_rq_of_se(dl_se); + struct rq *rq = rq_of_dl_se(dl_se); + + if (dl_se == &rq->fair_server) { + trace_printk("enqueue fair server %d h_nr_running %d\n", + cpu_of(rq), rq->cfs.h_nr_running); + } WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node)); @@ -1978,6 +1992,12 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se) static void __dequeue_dl_entity(struct sched_dl_entity *dl_se) { struct dl_rq *dl_rq = dl_rq_of_se(dl_se); + struct rq *rq = rq_of_dl_se(dl_se); + + if (dl_se == &rq->fair_server) { + trace_printk("dequeue fair server %d h_nr_running %d\n", + cpu_of(rq), rq->cfs.h_nr_running); + } if (RB_EMPTY_NODE(&dl_se->rb_node)) return; ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2024-11-29 9:08 ` Peter Zijlstra @ 2024-12-02 18:46 ` Marcel Ziswiler 2024-12-09 9:49 ` Peter Zijlstra 2024-12-10 16:13 ` Steven Rostedt 0 siblings, 2 replies; 51+ messages in thread From: Marcel Ziswiler @ 2024-12-02 18:46 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault Sorry for my late reply, I was traveling back from Manchester to Switzerland but I am all settled down again. On Fri, 2024-11-29 at 10:08 +0100, Peter Zijlstra wrote: > On Thu, Nov 28, 2024 at 12:37:14PM +0100, Marcel Ziswiler wrote: > > > > Oooh, that's something. So far the few reports have not been (easily) > > > reproducible. If this is readily reproducible on arm64 that would > > > help a lot. Juri, do you have access to an arm64 test box? > > > > As mentioned above, so far our scheduler stress test is not yet open source but Codethink is eager to share > > anything which helps in resolving this. > > I was hoping you could perhaps share a binary with Juri privately or > with RHT (same difference etc), such that he can poke at it too. Sure, there is nothing secret about it, it is just that we have not gotten around open sourcing all parts of it just yet. The UEFI aarch64 embedded Linux image I am using may be found here [1]. Plus matching bmap file should you fancy using that [2]. And the SSH key may help when interacting with the system (e.g. that is how I trigger the failure as the console is quite busy with tracing) [3]. However, that was built by CI and does not contain a kernel with below patch applied yet. I manually dumped the kernel config and compiled v6.12.1 with your patch applied and deployed it (to /lib/modules, /usr/lib/kernel et. al.) in the below case where I provide the dump. > Anyway, if you don't mind a bit of back and forth, Sure. > would you mind adding > the below patch to your kernel and doing: > > (all assuming your kernel has ftrace enabled) > > echo 1 > /sys/kernel/debug/tracing/options/stacktrace > echo 1 > /proc/sys/kernel/traceoff_on_warning > > running your test to failure and then dumping the trace into a file > like: > > cat /sys/kernel/debug/tracing/trace > ~/trace Unfortunately, once I trigger the failure the system is completely dead and won't allow me to dump the trace buffer any longer. So I did the following instead on the serial console terminal: tail -f /sys/kernel/debug/tracing/trace Not sure whether there is any better way to go about this. Plus even so we run the serial console at 1.5 megabaud I am not fully sure whether it was able to keep up logging what you are looking for. > Then compress the file (bzip2 or whatever is popular these days) xz or zstd (;-p) > and > send it my way along with a dmesg dump (private is fine -- these things > tend to be large-ish). As mentioned before, there is nothing secret about it. Please find it here [4]. > Hopefully, this will give us a little clue as to where the double > enqueue happens. Yes, and do not hesitate to ask for any additional information et. al. we are happy to help. Thanks! > --- > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index d9d5a702f1a6..b9cd9b40a19f 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -1203,6 +1203,11 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_ > scoped_guard (rq_lock, rq) { > struct rq_flags *rf = &scope.rf; > > + if (dl_se == &rq->fair_server) { > + trace_printk("timer fair server %d throttled %d\n", > + cpu_of(rq), dl_se->dl_throttled); > + } > + > if (!dl_se->dl_throttled || !dl_se->dl_runtime) > return HRTIMER_NORESTART; > > @@ -1772,6 +1777,9 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer) > rq_lock(rq, &rf); > } > > + if (dl_se == &rq->fair_server) > + trace_printk("inactive fair server %d\n", cpu_of(rq)); > + > sched_clock_tick(); > update_rq_clock(rq); > > @@ -1967,6 +1975,12 @@ update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se, > static void __enqueue_dl_entity(struct sched_dl_entity *dl_se) > { > struct dl_rq *dl_rq = dl_rq_of_se(dl_se); > + struct rq *rq = rq_of_dl_se(dl_se); > + > + if (dl_se == &rq->fair_server) { > + trace_printk("enqueue fair server %d h_nr_running %d\n", > + cpu_of(rq), rq->cfs.h_nr_running); > + } > > WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node)); > > @@ -1978,6 +1992,12 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se) > static void __dequeue_dl_entity(struct sched_dl_entity *dl_se) > { > struct dl_rq *dl_rq = dl_rq_of_se(dl_se); > + struct rq *rq = rq_of_dl_se(dl_se); > + > + if (dl_se == &rq->fair_server) { > + trace_printk("dequeue fair server %d h_nr_running %d\n", > + cpu_of(rq), rq->cfs.h_nr_running); > + } > > if (RB_EMPTY_NODE(&dl_se->rb_node)) > return; [1] https://drive.codethink.co.uk/s/N8CQipaNNN45gYM [2] https://drive.codethink.co.uk/s/mpcPawXpCjPL8D3 [3] https://drive.codethink.co.uk/s/8RjHNTQQRpYgaLc [4] https://drive.codethink.co.uk/s/MWtzWjLDtdD3E5i Cheers Marcel ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2024-12-02 18:46 ` Marcel Ziswiler @ 2024-12-09 9:49 ` Peter Zijlstra 2024-12-10 16:05 ` Marcel Ziswiler 2024-12-10 16:13 ` Steven Rostedt 1 sibling, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2024-12-09 9:49 UTC (permalink / raw) To: Marcel Ziswiler Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault Sorry for the delay, I got laid low by snot monsters :/ On Mon, Dec 02, 2024 at 07:46:21PM +0100, Marcel Ziswiler wrote: > Unfortunately, once I trigger the failure the system is completely dead and won't allow me to dump the trace > buffer any longer. So I did the following instead on the serial console terminal: > > tail -f /sys/kernel/debug/tracing/trace > > Not sure whether there is any better way to go about this. Plus even so we run the serial console at 1.5 > megabaud I am not fully sure whether it was able to keep up logging what you are looking for. Ah, that is unfortunate. There is a ftrace_dump_on_oops option that might be of help. And yes, dumping trace buffers over 1m5 serial lines is tedious -- been there done that, got a t-shirt and all that. Still, let me see if perhaps making that WARN in enqueue_dl_entity() return makes the whole thing less fatal. I've included the traceoff_on_warning and ftrace_dump in the code, so all you really need to still do is enable the stacktrace option. echo 1 > /sys/kernel/debug/tracing/options/stacktrace > Yes, and do not hesitate to ask for any additional information et. al. we are happy to help. Thanks! Could I bother you to try again with the below patch? There are two new hunks vs the previous one, the hunk in enqueue_dl_entity() (the very last bit) will stop tracing and dump the buffers when that condition is hit in addition to then aborting the double enqueue, hopefully leaving the system is a slightly better state. The other new hunk is the one for dl_server_stop() (second hunk), while going over the code last week, I found that this might be a possible hole leading to the observed double enqueue, so fingers crossed. --- diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 33b4646f8b24..bd1df7612482 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1223,6 +1223,11 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_ scoped_guard (rq_lock, rq) { struct rq_flags *rf = &scope.rf; + if (dl_se == &rq->fair_server) { + trace_printk("timer fair server %d throttled %d\n", + cpu_of(rq), dl_se->dl_throttled); + } + if (!dl_se->dl_throttled || !dl_se->dl_runtime) return HRTIMER_NORESTART; @@ -1674,6 +1679,12 @@ void dl_server_start(struct sched_dl_entity *dl_se) void dl_server_stop(struct sched_dl_entity *dl_se) { + if (current->dl_server == dl_se) { + struct rq *rq = rq_of_dl_se(dl_se); + trace_printk("stop fair server %d\n", cpu_of(rq)); + current->dl_server = NULL; + } + if (!dl_se->dl_runtime) return; @@ -1792,6 +1803,9 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer) rq_lock(rq, &rf); } + if (dl_se == &rq->fair_server) + trace_printk("inactive fair server %d\n", cpu_of(rq)); + sched_clock_tick(); update_rq_clock(rq); @@ -1987,6 +2001,12 @@ update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se, static void __enqueue_dl_entity(struct sched_dl_entity *dl_se) { struct dl_rq *dl_rq = dl_rq_of_se(dl_se); + struct rq *rq = rq_of_dl_se(dl_se); + + if (dl_se == &rq->fair_server) { + trace_printk("enqueue fair server %d h_nr_running %d\n", + cpu_of(rq), rq->cfs.h_nr_running); + } WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node)); @@ -1998,6 +2018,12 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se) static void __dequeue_dl_entity(struct sched_dl_entity *dl_se) { struct dl_rq *dl_rq = dl_rq_of_se(dl_se); + struct rq *rq = rq_of_dl_se(dl_se); + + if (dl_se == &rq->fair_server) { + trace_printk("dequeue fair server %d h_nr_running %d\n", + cpu_of(rq), rq->cfs.h_nr_running); + } if (RB_EMPTY_NODE(&dl_se->rb_node)) return; @@ -2012,7 +2038,11 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se) static void enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags) { - WARN_ON_ONCE(on_dl_rq(dl_se)); + if (WARN_ON_ONCE(on_dl_rq(dl_se))) { + tracing_off(); + ftrace_dump(DUMP_ALL); + return; + } update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags); ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2024-12-09 9:49 ` Peter Zijlstra @ 2024-12-10 16:05 ` Marcel Ziswiler 0 siblings, 0 replies; 51+ messages in thread From: Marcel Ziswiler @ 2024-12-10 16:05 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault On Mon, 2024-12-09 at 10:49 +0100, Peter Zijlstra wrote: > > Sorry for the delay, I got laid low by snot monsters :/ > > On Mon, Dec 02, 2024 at 07:46:21PM +0100, Marcel Ziswiler wrote: > > > Unfortunately, once I trigger the failure the system is completely dead and won't allow me to dump the > > trace > > buffer any longer. So I did the following instead on the serial console terminal: > > > > tail -f /sys/kernel/debug/tracing/trace > > > > Not sure whether there is any better way to go about this. Plus even so we run the serial console at 1.5 > > megabaud I am not fully sure whether it was able to keep up logging what you are looking for. > > Ah, that is unfortunate. There is a ftrace_dump_on_oops option that > might be of help. And yes, dumping trace buffers over 1m5 serial lines > is tedious -- been there done that, got a t-shirt and all that. > > Still, let me see if perhaps making that WARN in enqueue_dl_entity() > return makes the whole thing less fatal. > > I've included the traceoff_on_warning and ftrace_dump in the code, so > all you really need to still do is enable the stacktrace option. > > echo 1 > /sys/kernel/debug/tracing/options/stacktrace > > > Yes, and do not hesitate to ask for any additional information et. al. we are happy to help. Thanks! > > Could I bother you to try again with the below patch? Sure, here you go. https://drive.codethink.co.uk/s/HniZCtccDBMHpAK > There are two new hunks vs the previous one, the hunk in > enqueue_dl_entity() (the very last bit) will stop tracing and dump the > buffers when that condition is hit in addition to then aborting the > double enqueue, hopefully leaving the system is a slightly better state. > > The other new hunk is the one for dl_server_stop() (second hunk), while > going over the code last week, I found that this might be a possible > hole leading to the observed double enqueue, so fingers crossed. > > --- > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index 33b4646f8b24..bd1df7612482 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -1223,6 +1223,11 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_ > scoped_guard (rq_lock, rq) { > struct rq_flags *rf = &scope.rf; > > + if (dl_se == &rq->fair_server) { > + trace_printk("timer fair server %d throttled %d\n", > + cpu_of(rq), dl_se->dl_throttled); > + } > + > if (!dl_se->dl_throttled || !dl_se->dl_runtime) > return HRTIMER_NORESTART; > > @@ -1674,6 +1679,12 @@ void dl_server_start(struct sched_dl_entity *dl_se) > > void dl_server_stop(struct sched_dl_entity *dl_se) > { > + if (current->dl_server == dl_se) { > + struct rq *rq = rq_of_dl_se(dl_se); > + trace_printk("stop fair server %d\n", cpu_of(rq)); > + current->dl_server = NULL; > + } > + > if (!dl_se->dl_runtime) > return; > > @@ -1792,6 +1803,9 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer) > rq_lock(rq, &rf); > } > > + if (dl_se == &rq->fair_server) > + trace_printk("inactive fair server %d\n", cpu_of(rq)); > + > sched_clock_tick(); > update_rq_clock(rq); > > @@ -1987,6 +2001,12 @@ update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se, > static void __enqueue_dl_entity(struct sched_dl_entity *dl_se) > { > struct dl_rq *dl_rq = dl_rq_of_se(dl_se); > + struct rq *rq = rq_of_dl_se(dl_se); > + > + if (dl_se == &rq->fair_server) { > + trace_printk("enqueue fair server %d h_nr_running %d\n", > + cpu_of(rq), rq->cfs.h_nr_running); > + } > > WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node)); > > @@ -1998,6 +2018,12 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se) > static void __dequeue_dl_entity(struct sched_dl_entity *dl_se) > { > struct dl_rq *dl_rq = dl_rq_of_se(dl_se); > + struct rq *rq = rq_of_dl_se(dl_se); > + > + if (dl_se == &rq->fair_server) { > + trace_printk("dequeue fair server %d h_nr_running %d\n", > + cpu_of(rq), rq->cfs.h_nr_running); > + } > > if (RB_EMPTY_NODE(&dl_se->rb_node)) > return; > @@ -2012,7 +2038,11 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se) > static void > enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags) > { > - WARN_ON_ONCE(on_dl_rq(dl_se)); > + if (WARN_ON_ONCE(on_dl_rq(dl_se))) { > + tracing_off(); > + ftrace_dump(DUMP_ALL); > + return; > + } > > update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags); ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2024-12-02 18:46 ` Marcel Ziswiler 2024-12-09 9:49 ` Peter Zijlstra @ 2024-12-10 16:13 ` Steven Rostedt 1 sibling, 0 replies; 51+ messages in thread From: Steven Rostedt @ 2024-12-10 16:13 UTC (permalink / raw) To: Marcel Ziswiler Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot, dietmar.eggemann, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault On Mon, 02 Dec 2024 19:46:21 +0100 Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote: > Unfortunately, once I trigger the failure the system is completely dead and won't allow me to dump the trace > buffer any longer. So I did the following instead on the serial console terminal: > > tail -f /sys/kernel/debug/tracing/trace > > Not sure whether there is any better way to go about this. Plus even so we run the serial console at 1.5 > megabaud I am not fully sure whether it was able to keep up logging what you are looking for. If the memory of the machine is persistent (it is on several of my machines) you can use the persistent ring buffer. Add to the kernel command line (or enable bootconfig that attaches a command line to the kernel if you can't change the parameters): reserve_mem=20M:12M:trace trace_instance=boot_map^traceoff^traceprintk@trace The above will create a "boot_map" instance with tracing off on boot and trace_printk() going to it. Start tracing: trace-cmd start -B boot_map -p nop If or replace "-p nop" with any events or tracers you want, including function tracing", then after a crash. trace-cmd show -B boot_map If the memory is persistent and you don't use KASLR (may want to also add nokaslr if arm64 supports KASLR it and you use it), you should get the trace right up to the crash. See Documentation/trace/debugging.rst for more details. -- Steve ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF 2024-11-28 10:32 ` [REGRESSION] " Marcel Ziswiler 2024-11-28 10:58 ` Peter Zijlstra @ 2024-12-10 8:45 ` Luis Machado 1 sibling, 0 replies; 51+ messages in thread From: Luis Machado @ 2024-12-10 8:45 UTC (permalink / raw) To: Marcel Ziswiler, Peter Zijlstra, mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, linux-kernel Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault On 11/28/24 10:32, Marcel Ziswiler wrote: > Hi all, > > On Sat, 2024-07-27 at 12:27 +0200, Peter Zijlstra wrote: >> Hi all, >> >> So after much delay this is hopefully the final version of the EEVDF patches. >> They've been sitting in my git tree for ever it seems, and people have been >> testing it and sending fixes. >> >> I've spend the last two days testing and fixing cfs-bandwidth, and as far >> as I know that was the very last issue holding it back. >> >> These patches apply on top of queue.git sched/dl-server, which I plan on merging >> in tip/sched/core once -rc1 drops. >> >> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2. >> >> >> Aside from a ton of bug fixes -- thanks all! -- new in this version is: >> >> - split up the huge delay-dequeue patch >> - tested/fixed cfs-bandwidth >> - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating >> - SCHED_BATCH is equivalent to RESPECT_SLICE >> - propagate min_slice up cgroups >> - CLOCK_THREAD_DVFS_ID > > We found the following 7 commits from this patch set to crash in enqueue_dl_entity(): > > 54a58a787791 sched/fair: Implement DELAY_ZERO > 152e11f6df29 sched/fair: Implement delayed dequeue > e1459a50ba31 sched: Teach dequeue_task() about special task states > a1c446611e31 sched,freezer: Mark TASK_FROZEN special > 781773e3b680 sched/fair: Implement ENQUEUE_DELAYED > f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue > 2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue > > Resulting in the following crash dump (this is running v6.12.1): > > [ 14.652856] sched: DL replenish lagged too much > [ 16.572706] ------------[ cut here ]------------ > [ 16.573115] WARNING: CPU: 5 PID: 912 at kernel/sched/deadline.c:1995 enqueue_dl_entity+0x46c/0x55c > [ 16.573900] Modules linked in: overlay crct10dif_ce rk805_pwrkey snd_soc_es8316 pwm_fan > phy_rockchip_naneng_combphy rockchip_saradc rtc_hym8563 industrialio_trigg > ered_buffer kfifo_buf rockchip_thermal phy_rockchip_usbdp typec spi_rockchip_sfc snd_soc_rockchip_i2s_tdm > hantro_vpu v4l2_vp9 v4l2_h264 v4l2_jpeg panthor v4l2_mem2me > m rockchipdrm drm_gpuvm drm_exec drm_shmem_helper analogix_dp gpu_sched dw_mipi_dsi dw_hdmi cec > drm_display_helper snd_soc_audio_graph_card snd_soc_simple_card_utils > drm_dma_helper drm_kms_helper cfg80211 rfkill pci_endpoint_test drm backlight dm_mod dax > [ 16.578350] CPU: 5 UID: 0 PID: 912 Comm: job10 Not tainted 6.12.1-dirty #15 > [ 16.578956] Hardware name: radxa Radxa ROCK 5B/Radxa ROCK 5B, BIOS 2024.10-rc3 10/01/2024 > [ 16.579667] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 16.580273] pc : enqueue_dl_entity+0x46c/0x55c > [ 16.580661] lr : dl_server_start+0x44/0x12c > [ 16.581028] sp : ffff80008002bc00 > [ 16.581318] x29: ffff80008002bc00 x28: dead000000000122 x27: 0000000000000000 > [ 16.581941] x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000009 > [ 16.582563] x23: ffff33c976db0e40 x22: 0000000000000001 x21: 00000000002dc6c0 > [ 16.583186] x20: 0000000000000001 x19: ffff33c976db17a8 x18: 0000000000000000 > [ 16.583808] x17: ffff5dd9779ac000 x16: ffff800080028000 x15: 11c3485b851e0698 > [ 16.584430] x14: 11b4b257e4156000 x13: 0000000000000255 x12: 0000000000000000 > [ 16.585053] x11: ffff33c976db0ec0 x10: 0000000000000000 x9 : 0000000000000009 > [ 16.585674] x8 : 0000000000000005 x7 : ffff33c976db19a0 x6 : ffff33c78258b440 > [ 16.586296] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 > [ 16.586918] x2 : 0000000000000001 x1 : 0000000000000001 x0 : ffff33c798e112f0 > [ 16.587540] Call trace: > [ 16.587754] enqueue_dl_entity+0x46c/0x55c > [ 16.588113] dl_server_start+0x44/0x12c > [ 16.588449] enqueue_task_fair+0x124/0x49c > [ 16.588807] enqueue_task+0x3c/0xe0 > [ 16.589113] ttwu_do_activate.isra.0+0x6c/0x208 > [ 16.589511] try_to_wake_up+0x1d0/0x61c > [ 16.589847] wake_up_process+0x18/0x24 > [ 16.590175] kick_pool+0x84/0x150 > [ 16.590467] __queue_work+0x2f4/0x544 > [ 16.590788] delayed_work_timer_fn+0x1c/0x28 > [ 16.591161] call_timer_fn+0x34/0x1ac > [ 16.591481] __run_timer_base+0x20c/0x314 > [ 16.591832] run_timer_softirq+0x3c/0x78 > [ 16.592176] handle_softirqs+0x124/0x35c > [ 16.592520] __do_softirq+0x14/0x20 > [ 16.592827] ____do_softirq+0x10/0x1c > [ 16.593148] call_on_irq_stack+0x24/0x4c > [ 16.593490] do_softirq_own_stack+0x1c/0x2c > [ 16.593857] irq_exit_rcu+0x8c/0xc0 > [ 16.594163] el0_interrupt+0x48/0xbc > [ 16.594477] __el0_irq_handler_common+0x18/0x24 > [ 16.594874] el0t_64_irq_handler+0x10/0x1c > [ 16.595232] el0t_64_irq+0x190/0x194 > [ 16.595545] ---[ end trace 0000000000000000 ]--- > [ 16.595950] ------------[ cut here ]------------ Random piece of data, but I also had some difficulty making things boot on Android when trying Vincent's nr_running accounting series due to a very similar crash/stack trace. Though what I saw went a bit further and actually crashed within task_contending, called from enqueue_dl_entity. Possibly crashed in one of the inlined functions. Even though the kernel was 6.8 and it was backports, it seems awfully similar to the above. > > It looks like it is trying to enqueue an already queued deadline task. Full serial console log available [1]. > > We are running the exact same scheduler stress test both on Intel NUCs as well as RADXA ROCK 5B board farms. > While so far we have not seen this on amd64 it crashes consistently/reproducible on aarch64. > > We haven't had time to do a non-proprietary reproduction case as of yet but I wanted to report our current > findings asking for any feedback/suggestions from the community. > > Thanks! > > Cheers > > Marcel > > [1] https://hastebin.skyra.pw/hoqesigaye.yaml > ^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2025-04-26 15:09 UTC | newest] Thread overview: 51+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-12-29 22:51 [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies 2025-01-06 11:57 ` Peter Zijlstra 2025-01-06 15:01 ` Doug Smythies 2025-01-06 16:59 ` Peter Zijlstra 2025-01-06 17:04 ` Peter Zijlstra 2025-01-06 17:14 ` Peter Zijlstra 2025-01-07 1:24 ` Doug Smythies 2025-01-07 10:49 ` Peter Zijlstra 2025-01-06 22:28 ` Doug Smythies 2025-01-07 11:26 ` Peter Zijlstra 2025-01-07 15:04 ` Doug Smythies 2025-01-07 16:25 ` Doug Smythies 2025-01-07 19:23 ` Peter Zijlstra 2025-01-08 5:15 ` Doug Smythies 2025-01-08 13:12 ` Peter Zijlstra 2025-01-08 15:48 ` Doug Smythies 2025-01-09 10:59 ` Peter Zijlstra 2025-01-09 12:18 ` [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag tip-bot2 for Peter Zijlstra 2025-04-17 9:56 ` Alexander Egorenkov 2025-04-22 5:40 ` ll"RE: " Doug Smythies 2025-04-24 7:56 ` Alexander Egorenkov 2025-04-26 15:09 ` Doug Smythies 2025-01-10 5:09 ` [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies 2025-01-10 11:57 ` Peter Zijlstra 2025-01-12 23:14 ` Doug Smythies 2025-01-13 11:03 ` Peter Zijlstra 2025-01-14 10:58 ` Peter Zijlstra 2025-01-14 15:15 ` Doug Smythies 2025-01-15 2:08 ` Len Brown 2025-01-15 16:47 ` Doug Smythies 2025-01-19 0:09 ` Doug Smythies 2025-01-20 3:55 ` Doug Smythies 2025-01-21 11:06 ` Peter Zijlstra 2025-01-21 8:49 ` Peter Zijlstra 2025-01-21 11:21 ` Peter Zijlstra 2025-01-21 15:58 ` Doug Smythies 2025-01-24 4:34 ` Doug Smythies 2025-01-24 11:04 ` Peter Zijlstra 2025-01-13 11:05 ` Peter Zijlstra 2025-01-13 16:01 ` Doug Smythies 2025-01-13 12:58 ` [tip: sched/urgent] sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE tip-bot2 for Peter Zijlstra 2025-01-12 19:59 ` [REGRESSION] Re: [PATCH 00/24] Complete EEVDF Doug Smythies -- strict thread matches above, loose matches on Subject: below -- 2024-07-27 10:27 Peter Zijlstra 2024-11-28 10:32 ` [REGRESSION] " Marcel Ziswiler 2024-11-28 10:58 ` Peter Zijlstra 2024-11-28 11:37 ` Marcel Ziswiler 2024-11-29 9:08 ` Peter Zijlstra 2024-12-02 18:46 ` Marcel Ziswiler 2024-12-09 9:49 ` Peter Zijlstra 2024-12-10 16:05 ` Marcel Ziswiler 2024-12-10 16:13 ` Steven Rostedt 2024-12-10 8:45 ` Luis Machado
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).