* Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]
2024-03-14 3:44 ` Florian Fainelli
@ 2024-03-14 5:12 ` Boqun Feng
2024-03-14 6:33 ` Boqun Feng
2024-03-14 9:11 ` Thomas Gleixner
2024-03-14 10:41 ` Frederic Weisbecker
2 siblings, 1 reply; 47+ messages in thread
From: Boqun Feng @ 2024-03-14 5:12 UTC (permalink / raw)
To: Florian Fainelli
Cc: Frederic Weisbecker, Russell King (Oracle), Joel Fernandes,
Anna-Maria Behnsen, Linus Torvalds, linux-kernel, kernel-team,
paulmck, mingo, tglx, rcu, neeraj.upadhyay, urezki,
qiang.zhang1211, bigeasy, chenzhongjin, yangjihong1, rostedt,
Justin Chen
On Wed, Mar 13, 2024 at 08:44:07PM -0700, Florian Fainelli wrote:
>
>
> On 3/13/2024 3:52 PM, Frederic Weisbecker wrote:
> > On Wed, Mar 13, 2024 at 03:04:26PM -0700, Florian Fainelli wrote:
> > > On 3/13/24 14:59, Russell King (Oracle) wrote:
> > > > On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
> > > > > I will try to provide multiple answers for the sake of everyone having the
> > > > > same context. Responding to Linus' specifically and his suggestion to use
> > > > > "initcall_debug", this is what it gave me:
> > > > >
> > > > > [ 6.970669] ata1: SATA link down (SStatus 0 SControl 300)
> > > > > [ 166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
> > > > > [ 166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
> > > > > [ 166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
> > > > > [ 166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
> > > > > [ 166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
> > > > >
> > > > > Also got another occurrence happening resuming from suspend to DRAM with:
> > > > >
> > > > > [ 22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
> > > > > platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
> > > > > [ 181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
> > > > > platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
> > > > >
> > > > > and also with the PCIe root complex driver:
> > > > >
> > > > > [ 18.266279] brcm-pcie f0460000.pcie: PM: calling
> > > > > brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
> > > > > [ 177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
> > > > > [ 177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
> > > > > [ 177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
> > > > > returned 0 after 159190939 usecs
> > > > > [ 177.457257] pcieport 0000:00:00.0: PM: calling
> > > > > pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
> > > > >
> > > > > Surprisingly those drivers are consistently reproducing the failures I am
> > > > > seeing so at least this gave me a clue as to where the problem is.
> > > > >
> > > > > There were no changes to drivers/net/ethernet/broadcom/genet/, the two
> > > > > changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
> > > > > the read_poll_timeout() conversion is correct, we properly break out of the
> > > > > loop. The initial delay looked like a good culprit for a little while, but
> > > > > it is not used on the affected platforms because instead we provide a
> > > > > callback and we have an interrupt to signal the completion of a MDIO
> > > > > operation, therefore unimac_mdio_poll() is not used at all. Finally
> > > > > drivers/memory/brcmstb_dpfe.c also received a single change which is not
> > > > > functional here (.remove function change do return void).
> > > > >
> > > > > I went back to a manual bisection and this time I believe that I have a more
> > > > > plausible candidate with:
> > > > >
> > > > > 7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
> > > > > hierarchical pull model")
> > > >
> > > > I haven't understood the code there yet, and how it would interact with
> > > > arch code, but one thing that immediately jumps out to me is this:
> > > >
> > > > " As long as a CPU is busy it expires both local and global timers. When a
> > > > CPU goes idle it arms for the first expiring local timer."
> > > >
> > > > So are local timers "armed" when they are enqueued while the cpu is
> > > > "busy" during initialisation, and will they expire, and will that
> > > > expiry be delivered in a timely manner?
> > > >
> > > > If not, this commit is basically broken, and would be the cause of the
> > > > issue you are seeing. For the mdio case, we're talking about 2ms
> > > > polling. For the dpfe case, it looks like we're talking about 1ms
> > > > sleeps. I'm guessing that these end up being local timers.
> > > >
> > > > Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
> > > > for link up every 5ms - if the link was down and we msleep(5) I wonder
> > > > if that's triggering the same issue.
> > > >
> > > > Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
> > > > say. I would imagine that the same hardware timer driver is being used
> > > > (may be worth checking DT.) The same should be true for the interrupt
> > > > driver as well. There's been no changes in that code.
> > >
> > > I just had it happen with ARM64 I was plagued by:
> > >
> > > https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/
> > >
> > > and my earlier bisections somehow did not have ARM64 fail, so I thought it
> > > was immune but it fails with about the same failure rate as ARM 32-bit.
> >
> > Can you please boot with:
> >
> > trace_event=timer_migration,timer_start,timer_expire_entry,timer_cancel
> >
> > And add the following and give us the resulting output in dmesg?
>
> Here are two logs from two different systems that exposed the problem on
> boot:
>
I found a pattern here, maybe related.
> https://gist.github.com/ffainelli/f0834c52ef6320c9216d879ca29a4b81
[ 163.244130] kworker/-31 3D.... 44007us : timer_start: timer=b089b886 function=delayed_work_timer_fn expires=4294672340 [timeout=5000] bucket_expiry=4294672384 cpu =3 idx=192 flags=D|I
^^^ this timer was supposed to expired after around 5000 jiffies (i.e. 5
second, since HZ=1000), but it expired way late (160 seconds later).
[ 163.261034] kworker/-31 3d.... 44012us : timer_start: timer=394b309f function=delayed_work_timer_fn expires=4294787991 [timeout=120651] bucket_expiry=4294791168 c pu=3 idx=277 flags=I
...
[ 221.630578] <idle>-0 1..s.. 3287405us : timer_expire_entry: timer=7e541f87 function=process_timeout now=4294670584 baseclk=4294670584
[ 221.643475] <idle>-0 0d.s.. 162361292us : timer_cancel: timer=95703ccd
[ 221.650896] <idle>-0 0..s.. 162361292us : timer_expire_entry: timer=95703ccd function=process_timeout now=4294829657 baseclk=4294670587
baseclk and now has a huge delta.
...
[ 222.394471] <idle>-0 3dns.. 162499420us : timer_cancel: timer=b089b886
[ 222.401893] kworker/-47 1d.... 162499420us : timer_start: timer=11162242 function=delayed_work_timer_fn expires=4294830997 [timeout=1201] bucket_expiry=4294831040 cpu=1 idx=175 flags=D|P|I
[ 222.419317] <idle>-0 3dns.. 162499421us : timer_expire_entry: timer=b089b886 function=delayed_work_timer_fn now=4294829796 baseclk=4294672384
and the same pattern for the following:
> https://gist.github.com/ffainelli/dc838883edb925a77d8eb34c0fe95be0
[ 162.855009] kworker/-31 3D.... 34849us : timer_start: timer=c48c9b47 function=delayed_work_timer_fn expires=4294672330 [timeout=5000] bucket_expiry=4294672384 cpu =3 idx=192 flags=D|I
[ 162.871903] kworker/-31 3d.... 34854us : timer_start: timer=c1ab7696 function=delayed_work_timer_fn expires=4294787991 [timeout=120661] bucket_expiry=4294791168 c pu=3 idx=277 flags=I
...
[ 218.724285] <idle>-0 2..s.. 2938198us : timer_expire_entry: timer=f3c19512 function=process_timeout now=4294670234 baseclk=4294670234
[ 218.737175] <idle>-0 0d.s.. 162011087us : timer_cancel: timer=829432ce
[ 218.744589] <idle>-0 0d.s.. 162011088us : timer_expire_entry: timer=829432ce function=delayed_work_timer_fn now=4294829306 baseclk=4294670400
...
[ 219.171101] <idle>-0 3dns.. 162025222us : timer_cancel: timer=c48c9b47
[ 219.178516] <idle>-0 3dns.. 162025222us : timer_expire_entry: timer=c48c9b47 function=delayed_work_timer_fn now=4294829321 baseclk=4294672384
Regards,
Boqun
>
> thanks!
> --
> Florian
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]
2024-03-14 5:12 ` Boqun Feng
@ 2024-03-14 6:33 ` Boqun Feng
2024-03-14 9:32 ` Thomas Gleixner
0 siblings, 1 reply; 47+ messages in thread
From: Boqun Feng @ 2024-03-14 6:33 UTC (permalink / raw)
To: Florian Fainelli
Cc: Frederic Weisbecker, Russell King (Oracle), Joel Fernandes,
Anna-Maria Behnsen, Linus Torvalds, linux-kernel, kernel-team,
paulmck, mingo, tglx, rcu, neeraj.upadhyay, urezki,
qiang.zhang1211, bigeasy, chenzhongjin, yangjihong1, rostedt,
Justin Chen
On Wed, Mar 13, 2024 at 10:12:10PM -0700, Boqun Feng wrote:
> On Wed, Mar 13, 2024 at 08:44:07PM -0700, Florian Fainelli wrote:
> >
> >
> > On 3/13/2024 3:52 PM, Frederic Weisbecker wrote:
> > > On Wed, Mar 13, 2024 at 03:04:26PM -0700, Florian Fainelli wrote:
> > > > On 3/13/24 14:59, Russell King (Oracle) wrote:
> > > > > On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
> > > > > > I will try to provide multiple answers for the sake of everyone having the
> > > > > > same context. Responding to Linus' specifically and his suggestion to use
> > > > > > "initcall_debug", this is what it gave me:
> > > > > >
> > > > > > [ 6.970669] ata1: SATA link down (SStatus 0 SControl 300)
> > > > > > [ 166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
> > > > > > [ 166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
> > > > > > [ 166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
> > > > > > [ 166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
> > > > > > [ 166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
> > > > > >
> > > > > > Also got another occurrence happening resuming from suspend to DRAM with:
> > > > > >
> > > > > > [ 22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
> > > > > > platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
> > > > > > [ 181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
> > > > > > platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
> > > > > >
> > > > > > and also with the PCIe root complex driver:
> > > > > >
> > > > > > [ 18.266279] brcm-pcie f0460000.pcie: PM: calling
> > > > > > brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
> > > > > > [ 177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
> > > > > > [ 177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
> > > > > > [ 177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
> > > > > > returned 0 after 159190939 usecs
> > > > > > [ 177.457257] pcieport 0000:00:00.0: PM: calling
> > > > > > pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
> > > > > >
> > > > > > Surprisingly those drivers are consistently reproducing the failures I am
> > > > > > seeing so at least this gave me a clue as to where the problem is.
> > > > > >
> > > > > > There were no changes to drivers/net/ethernet/broadcom/genet/, the two
> > > > > > changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
> > > > > > the read_poll_timeout() conversion is correct, we properly break out of the
> > > > > > loop. The initial delay looked like a good culprit for a little while, but
> > > > > > it is not used on the affected platforms because instead we provide a
> > > > > > callback and we have an interrupt to signal the completion of a MDIO
> > > > > > operation, therefore unimac_mdio_poll() is not used at all. Finally
> > > > > > drivers/memory/brcmstb_dpfe.c also received a single change which is not
> > > > > > functional here (.remove function change do return void).
> > > > > >
> > > > > > I went back to a manual bisection and this time I believe that I have a more
> > > > > > plausible candidate with:
> > > > > >
> > > > > > 7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
> > > > > > hierarchical pull model")
> > > > >
> > > > > I haven't understood the code there yet, and how it would interact with
> > > > > arch code, but one thing that immediately jumps out to me is this:
> > > > >
> > > > > " As long as a CPU is busy it expires both local and global timers. When a
> > > > > CPU goes idle it arms for the first expiring local timer."
> > > > >
> > > > > So are local timers "armed" when they are enqueued while the cpu is
> > > > > "busy" during initialisation, and will they expire, and will that
> > > > > expiry be delivered in a timely manner?
> > > > >
> > > > > If not, this commit is basically broken, and would be the cause of the
> > > > > issue you are seeing. For the mdio case, we're talking about 2ms
> > > > > polling. For the dpfe case, it looks like we're talking about 1ms
> > > > > sleeps. I'm guessing that these end up being local timers.
> > > > >
> > > > > Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
> > > > > for link up every 5ms - if the link was down and we msleep(5) I wonder
> > > > > if that's triggering the same issue.
> > > > >
> > > > > Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
> > > > > say. I would imagine that the same hardware timer driver is being used
> > > > > (may be worth checking DT.) The same should be true for the interrupt
> > > > > driver as well. There's been no changes in that code.
> > > >
> > > > I just had it happen with ARM64 I was plagued by:
> > > >
> > > > https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/
> > > >
> > > > and my earlier bisections somehow did not have ARM64 fail, so I thought it
> > > > was immune but it fails with about the same failure rate as ARM 32-bit.
> > >
> > > Can you please boot with:
> > >
> > > trace_event=timer_migration,timer_start,timer_expire_entry,timer_cancel
> > >
> > > And add the following and give us the resulting output in dmesg?
> >
> > Here are two logs from two different systems that exposed the problem on
> > boot:
> >
>
> I found a pattern here, maybe related.
>
> > https://gist.github.com/ffainelli/f0834c52ef6320c9216d879ca29a4b81
>
> [ 163.244130] kworker/-31 3D.... 44007us : timer_start: timer=b089b886 function=delayed_work_timer_fn expires=4294672340 [timeout=5000] bucket_expiry=4294672384 cpu =3 idx=192 flags=D|I
>
> ^^^ this timer was supposed to expired after around 5000 jiffies (i.e. 5
> second, since HZ=1000), but it expired way late (160 seconds later).
>
Hmm.. this is more a noise since it's a deferreable timer...
> [ 163.261034] kworker/-31 3d.... 44012us : timer_start: timer=394b309f function=delayed_work_timer_fn expires=4294787991 [timeout=120651] bucket_expiry=4294791168 c pu=3 idx=277 flags=I
> ...
> [ 221.630578] <idle>-0 1..s.. 3287405us : timer_expire_entry: timer=7e541f87 function=process_timeout now=4294670584 baseclk=4294670584
> [ 221.643475] <idle>-0 0d.s.. 162361292us : timer_cancel: timer=95703ccd
> [ 221.650896] <idle>-0 0..s.. 162361292us : timer_expire_entry: timer=95703ccd function=process_timeout now=4294829657 baseclk=4294670587
>
but here:
[ 221.555265] kworker/-44 0d.... 3279414us : timer_start: timer=95703ccd function=process_timeout expires=4294670586 [timeout=10] bucket_expiry=4294670587 cpu=0 idx =59 flags=
this is a normal timer.
[ 221.571298] rcu_sche-15 3d.... 3279417us : timer_start: timer=7e541f87 function=process_timeout expires=4294670579 [timeout=3] bucket_expiry=4294670580 cpu=3 idx= 52 flags=
[ 221.587241] <idle>-0 1d.s.. 3283405us : timer_cancel: timer=7e541f87
[ 221.594488] <idle>-0 1..s.. 3283407us : timer_expire_entry: timer=7e541f87 function=process_timeout now=4294670580 baseclk=4294670580
[ 221.607388] rcu_sche-15 3d.... 3283416us : timer_start: timer=7e541f87 function=process_timeout expires=4294670583 [timeout=3] bucket_expiry=4294670584 cpu=3 idx= 56 flags=
[ 221.623331] <idle>-0 1d.s.. 3287404us : timer_cancel: timer=7e541f87
[ 221.630578] <idle>-0 1..s.. 3287405us : timer_expire_entry: timer=7e541f87 function=process_timeout now=4294670584 baseclk=4294670584
[ 221.643475] <idle>-0 0d.s.. 162361292us : timer_cancel: timer=95703ccd
[ 221.650896] <idle>-0 0..s.. 162361292us : timer_expire_entry: timer=95703ccd function=process_timeout now=4294829657 baseclk=4294670587
which got fired here.
[ 221.663967] <idle>-0 0dns.. 162361296us : timer_cancel: timer=d03eaa1d
[ 221.671388] <idle>-0 0.ns.. 162361297us : timer_expire_entry: timer=d03eaa1d function=process_timeout now=4294829657 baseclk=4294670856
And looks to me CPU 0 is the tick_do_timer_cpu CPU, since it's the first
one that got timers after a long wait and was doing a few catch-ups. Now
the problem is why CPU 0 didn't program its hardware timer to expire at
4294670587? I.e. the earliest timer.
Regards,
Boqun
[...]
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]
2024-03-14 6:33 ` Boqun Feng
@ 2024-03-14 9:32 ` Thomas Gleixner
0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2024-03-14 9:32 UTC (permalink / raw)
To: Boqun Feng, Florian Fainelli
Cc: Frederic Weisbecker, Russell King (Oracle), Joel Fernandes,
Anna-Maria Behnsen, Linus Torvalds, linux-kernel, kernel-team,
paulmck, mingo, rcu, neeraj.upadhyay, urezki, qiang.zhang1211,
bigeasy, chenzhongjin, yangjihong1, rostedt, Justin Chen
On Wed, Mar 13 2024 at 23:33, Boqun Feng wrote:
> but here:
>
> [ 221.555265] kworker/-44 0d.... 3279414us : timer_start: timer=95703ccd function=process_timeout expires=4294670586 [timeout=10] bucket_expiry=4294670587 cpu=0 idx =59 flags=
>
> this is a normal timer.
>
> [ 221.571298] rcu_sche-15 3d.... 3279417us : timer_start: timer=7e541f87 function=process_timeout expires=4294670579 [timeout=3] bucket_expiry=4294670580 cpu=3 idx= 52 flags=
> [ 221.587241] <idle>-0 1d.s.. 3283405us : timer_cancel: timer=7e541f87
> [ 221.594488] <idle>-0 1..s.. 3283407us : timer_expire_entry: timer=7e541f87 function=process_timeout now=4294670580 baseclk=4294670580
> [ 221.607388] rcu_sche-15 3d.... 3283416us : timer_start: timer=7e541f87 function=process_timeout expires=4294670583 [timeout=3] bucket_expiry=4294670584 cpu=3 idx= 56 flags=
> [ 221.623331] <idle>-0 1d.s.. 3287404us : timer_cancel: timer=7e541f87
> [ 221.630578] <idle>-0 1..s.. 3287405us : timer_expire_entry: timer=7e541f87 function=process_timeout now=4294670584 baseclk=4294670584
> [ 221.643475] <idle>-0 0d.s.. 162361292us : timer_cancel: timer=95703ccd
> [ 221.650896] <idle>-0 0..s.. 162361292us : timer_expire_entry: timer=95703ccd function=process_timeout now=4294829657 baseclk=4294670587
>
> which got fired here.
>
> [ 221.663967] <idle>-0 0dns.. 162361296us : timer_cancel: timer=d03eaa1d
> [ 221.671388] <idle>-0 0.ns.. 162361297us : timer_expire_entry: timer=d03eaa1d function=process_timeout now=4294829657 baseclk=4294670856
>
> And looks to me CPU 0 is the tick_do_timer_cpu CPU, since it's the first
> one that got timers after a long wait and was doing a few catch-ups. Now
> the problem is why CPU 0 didn't program its hardware timer to expire at
> 4294670587? I.e. the earliest timer.
That's not the problem. It's not clear whether CPU0 was the last one
going idle. It doesn't look so:
> [ 221.623331] <idle>-0 1d.s.. 3287404us : timer_cancel: timer=7e541f87
> [ 221.630578] <idle>-0 1..s.. 3287405us : timer_expire_entry: timer=7e541f87 function=process_timeout now=4294670584 baseclk=4294670584
The timer in question is supposed to wake up 2 ticks later, so CPU1
should have armed it. But without the migration trace points enabled
this is all guess work.
Thanks,
tglx
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]
2024-03-14 3:44 ` Florian Fainelli
2024-03-14 5:12 ` Boqun Feng
@ 2024-03-14 9:11 ` Thomas Gleixner
2024-03-14 10:41 ` Frederic Weisbecker
2 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2024-03-14 9:11 UTC (permalink / raw)
To: Florian Fainelli, Frederic Weisbecker
Cc: Russell King (Oracle), Joel Fernandes, Boqun Feng,
Anna-Maria Behnsen, Linus Torvalds, linux-kernel, kernel-team,
paulmck, mingo, rcu, neeraj.upadhyay, urezki, qiang.zhang1211,
bigeasy, chenzhongjin, yangjihong1, rostedt, Justin Chen
On Wed, Mar 13 2024 at 20:44, Florian Fainelli wrote:
> On 3/13/2024 3:52 PM, Frederic Weisbecker wrote:
>> Can you please boot with:
>>
>> trace_event=timer_migration,timer_start,timer_expire_entry,timer_cancel
>>
>> And add the following and give us the resulting output in dmesg?
>
> Here are two logs from two different systems that exposed the problem on
> boot:
>
> https://gist.github.com/ffainelli/f0834c52ef6320c9216d879ca29a4b81
> https://gist.github.com/ffainelli/dc838883edb925a77d8eb34c0fe95be0
[ 0.003042] Failed to enable trace event: timer_migration
The event option should be:
trace_event=timer_migration:*,timer_start,timer_expire_entry,timer_cancel
Thanks,
tglx
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]
2024-03-14 3:44 ` Florian Fainelli
2024-03-14 5:12 ` Boqun Feng
2024-03-14 9:11 ` Thomas Gleixner
@ 2024-03-14 10:41 ` Frederic Weisbecker
2024-03-14 18:35 ` Florian Fainelli
2 siblings, 1 reply; 47+ messages in thread
From: Frederic Weisbecker @ 2024-03-14 10:41 UTC (permalink / raw)
To: Florian Fainelli
Cc: Russell King (Oracle), Joel Fernandes, Boqun Feng,
Anna-Maria Behnsen, Linus Torvalds, linux-kernel, kernel-team,
paulmck, mingo, tglx, rcu, neeraj.upadhyay, urezki,
qiang.zhang1211, bigeasy, chenzhongjin, yangjihong1, rostedt,
Justin Chen
On Wed, Mar 13, 2024 at 08:44:07PM -0700, Florian Fainelli wrote:
>
>
> On 3/13/2024 3:52 PM, Frederic Weisbecker wrote:
> > On Wed, Mar 13, 2024 at 03:04:26PM -0700, Florian Fainelli wrote:
> > > On 3/13/24 14:59, Russell King (Oracle) wrote:
> > > > On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
> > > > > I will try to provide multiple answers for the sake of everyone having the
> > > > > same context. Responding to Linus' specifically and his suggestion to use
> > > > > "initcall_debug", this is what it gave me:
> > > > >
> > > > > [ 6.970669] ata1: SATA link down (SStatus 0 SControl 300)
> > > > > [ 166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
> > > > > [ 166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
> > > > > [ 166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
> > > > > [ 166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
> > > > > [ 166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
> > > > >
> > > > > Also got another occurrence happening resuming from suspend to DRAM with:
> > > > >
> > > > > [ 22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
> > > > > platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
> > > > > [ 181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
> > > > > platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
> > > > >
> > > > > and also with the PCIe root complex driver:
> > > > >
> > > > > [ 18.266279] brcm-pcie f0460000.pcie: PM: calling
> > > > > brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
> > > > > [ 177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
> > > > > [ 177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
> > > > > [ 177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
> > > > > returned 0 after 159190939 usecs
> > > > > [ 177.457257] pcieport 0000:00:00.0: PM: calling
> > > > > pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
> > > > >
> > > > > Surprisingly those drivers are consistently reproducing the failures I am
> > > > > seeing so at least this gave me a clue as to where the problem is.
> > > > >
> > > > > There were no changes to drivers/net/ethernet/broadcom/genet/, the two
> > > > > changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
> > > > > the read_poll_timeout() conversion is correct, we properly break out of the
> > > > > loop. The initial delay looked like a good culprit for a little while, but
> > > > > it is not used on the affected platforms because instead we provide a
> > > > > callback and we have an interrupt to signal the completion of a MDIO
> > > > > operation, therefore unimac_mdio_poll() is not used at all. Finally
> > > > > drivers/memory/brcmstb_dpfe.c also received a single change which is not
> > > > > functional here (.remove function change do return void).
> > > > >
> > > > > I went back to a manual bisection and this time I believe that I have a more
> > > > > plausible candidate with:
> > > > >
> > > > > 7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
> > > > > hierarchical pull model")
> > > >
> > > > I haven't understood the code there yet, and how it would interact with
> > > > arch code, but one thing that immediately jumps out to me is this:
> > > >
> > > > " As long as a CPU is busy it expires both local and global timers. When a
> > > > CPU goes idle it arms for the first expiring local timer."
> > > >
> > > > So are local timers "armed" when they are enqueued while the cpu is
> > > > "busy" during initialisation, and will they expire, and will that
> > > > expiry be delivered in a timely manner?
> > > >
> > > > If not, this commit is basically broken, and would be the cause of the
> > > > issue you are seeing. For the mdio case, we're talking about 2ms
> > > > polling. For the dpfe case, it looks like we're talking about 1ms
> > > > sleeps. I'm guessing that these end up being local timers.
> > > >
> > > > Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
> > > > for link up every 5ms - if the link was down and we msleep(5) I wonder
> > > > if that's triggering the same issue.
> > > >
> > > > Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
> > > > say. I would imagine that the same hardware timer driver is being used
> > > > (may be worth checking DT.) The same should be true for the interrupt
> > > > driver as well. There's been no changes in that code.
> > >
> > > I just had it happen with ARM64 I was plagued by:
> > >
> > > https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/
> > >
> > > and my earlier bisections somehow did not have ARM64 fail, so I thought it
> > > was immune but it fails with about the same failure rate as ARM 32-bit.
> >
> > Can you please boot with:
> >
> > trace_event=timer_migration,timer_start,timer_expire_entry,timer_cancel
> >
> > And add the following and give us the resulting output in dmesg?
>
> Here are two logs from two different systems that exposed the problem on
> boot:
>
> https://gist.github.com/ffainelli/f0834c52ef6320c9216d879ca29a4b81
> https://gist.github.com/ffainelli/dc838883edb925a77d8eb34c0fe95be0
Thanks! Unfortunately like Thomas pointed out, I'm missing the timer_migration
events. My fault, can you retry with this syntax?
trace_event=timer_migration:*,timer_start,timer_expire_entry,timer_cancel
Though it's fairly possible that timer migration is not enabled at this point
as it's a late initcall. But we better not miss its traces otherwise.
> thanks!
> --
> Florian
>
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]
2024-03-14 10:41 ` Frederic Weisbecker
@ 2024-03-14 18:35 ` Florian Fainelli
2024-03-14 18:51 ` Boqun Feng
0 siblings, 1 reply; 47+ messages in thread
From: Florian Fainelli @ 2024-03-14 18:35 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Russell King (Oracle), Joel Fernandes, Boqun Feng,
Anna-Maria Behnsen, Linus Torvalds, linux-kernel, kernel-team,
paulmck, mingo, tglx, rcu, neeraj.upadhyay, urezki,
qiang.zhang1211, bigeasy, chenzhongjin, yangjihong1, rostedt,
Justin Chen
On 3/14/24 03:41, Frederic Weisbecker wrote:
> On Wed, Mar 13, 2024 at 08:44:07PM -0700, Florian Fainelli wrote:
>>
>>
>> On 3/13/2024 3:52 PM, Frederic Weisbecker wrote:
>>> On Wed, Mar 13, 2024 at 03:04:26PM -0700, Florian Fainelli wrote:
>>>> On 3/13/24 14:59, Russell King (Oracle) wrote:
>>>>> On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
>>>>>> I will try to provide multiple answers for the sake of everyone having the
>>>>>> same context. Responding to Linus' specifically and his suggestion to use
>>>>>> "initcall_debug", this is what it gave me:
>>>>>>
>>>>>> [ 6.970669] ata1: SATA link down (SStatus 0 SControl 300)
>>>>>> [ 166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
>>>>>> [ 166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
>>>>>> [ 166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
>>>>>> [ 166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
>>>>>> [ 166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
>>>>>>
>>>>>> Also got another occurrence happening resuming from suspend to DRAM with:
>>>>>>
>>>>>> [ 22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
>>>>>> platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
>>>>>> [ 181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
>>>>>> platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
>>>>>>
>>>>>> and also with the PCIe root complex driver:
>>>>>>
>>>>>> [ 18.266279] brcm-pcie f0460000.pcie: PM: calling
>>>>>> brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
>>>>>> [ 177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
>>>>>> [ 177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
>>>>>> [ 177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
>>>>>> returned 0 after 159190939 usecs
>>>>>> [ 177.457257] pcieport 0000:00:00.0: PM: calling
>>>>>> pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
>>>>>>
>>>>>> Surprisingly those drivers are consistently reproducing the failures I am
>>>>>> seeing so at least this gave me a clue as to where the problem is.
>>>>>>
>>>>>> There were no changes to drivers/net/ethernet/broadcom/genet/, the two
>>>>>> changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
>>>>>> the read_poll_timeout() conversion is correct, we properly break out of the
>>>>>> loop. The initial delay looked like a good culprit for a little while, but
>>>>>> it is not used on the affected platforms because instead we provide a
>>>>>> callback and we have an interrupt to signal the completion of a MDIO
>>>>>> operation, therefore unimac_mdio_poll() is not used at all. Finally
>>>>>> drivers/memory/brcmstb_dpfe.c also received a single change which is not
>>>>>> functional here (.remove function change do return void).
>>>>>>
>>>>>> I went back to a manual bisection and this time I believe that I have a more
>>>>>> plausible candidate with:
>>>>>>
>>>>>> 7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
>>>>>> hierarchical pull model")
>>>>>
>>>>> I haven't understood the code there yet, and how it would interact with
>>>>> arch code, but one thing that immediately jumps out to me is this:
>>>>>
>>>>> " As long as a CPU is busy it expires both local and global timers. When a
>>>>> CPU goes idle it arms for the first expiring local timer."
>>>>>
>>>>> So are local timers "armed" when they are enqueued while the cpu is
>>>>> "busy" during initialisation, and will they expire, and will that
>>>>> expiry be delivered in a timely manner?
>>>>>
>>>>> If not, this commit is basically broken, and would be the cause of the
>>>>> issue you are seeing. For the mdio case, we're talking about 2ms
>>>>> polling. For the dpfe case, it looks like we're talking about 1ms
>>>>> sleeps. I'm guessing that these end up being local timers.
>>>>>
>>>>> Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
>>>>> for link up every 5ms - if the link was down and we msleep(5) I wonder
>>>>> if that's triggering the same issue.
>>>>>
>>>>> Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
>>>>> say. I would imagine that the same hardware timer driver is being used
>>>>> (may be worth checking DT.) The same should be true for the interrupt
>>>>> driver as well. There's been no changes in that code.
>>>>
>>>> I just had it happen with ARM64 I was plagued by:
>>>>
>>>> https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/
>>>>
>>>> and my earlier bisections somehow did not have ARM64 fail, so I thought it
>>>> was immune but it fails with about the same failure rate as ARM 32-bit.
>>>
>>> Can you please boot with:
>>>
>>> trace_event=timer_migration,timer_start,timer_expire_entry,timer_cancel
>>>
>>> And add the following and give us the resulting output in dmesg?
>>
>> Here are two logs from two different systems that exposed the problem on
>> boot:
>>
>> https://gist.github.com/ffainelli/f0834c52ef6320c9216d879ca29a4b81
>> https://gist.github.com/ffainelli/dc838883edb925a77d8eb34c0fe95be0
>
> Thanks! Unfortunately like Thomas pointed out, I'm missing the timer_migration
> events. My fault, can you retry with this syntax?
>
> trace_event=timer_migration:*,timer_start,timer_expire_entry,timer_cancel
>
> Though it's fairly possible that timer migration is not enabled at this point
> as it's a late initcall. But we better not miss its traces otherwise.
Here is another log with timer_migration:
https://gist.github.com/ffainelli/237a5f9928850d6d8900d1f36da45aee
--
Florian
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]
2024-03-14 18:35 ` Florian Fainelli
@ 2024-03-14 18:51 ` Boqun Feng
2024-03-14 19:09 ` Florian Fainelli
0 siblings, 1 reply; 47+ messages in thread
From: Boqun Feng @ 2024-03-14 18:51 UTC (permalink / raw)
To: Florian Fainelli
Cc: Frederic Weisbecker, Russell King (Oracle), Joel Fernandes,
Anna-Maria Behnsen, Linus Torvalds, linux-kernel, kernel-team,
paulmck, mingo, tglx, rcu, neeraj.upadhyay, urezki,
qiang.zhang1211, bigeasy, chenzhongjin, yangjihong1, rostedt,
Justin Chen
On Thu, Mar 14, 2024 at 11:35:23AM -0700, Florian Fainelli wrote:
> On 3/14/24 03:41, Frederic Weisbecker wrote:
> > On Wed, Mar 13, 2024 at 08:44:07PM -0700, Florian Fainelli wrote:
> > >
> > >
> > > On 3/13/2024 3:52 PM, Frederic Weisbecker wrote:
> > > > On Wed, Mar 13, 2024 at 03:04:26PM -0700, Florian Fainelli wrote:
> > > > > On 3/13/24 14:59, Russell King (Oracle) wrote:
> > > > > > On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
> > > > > > > I will try to provide multiple answers for the sake of everyone having the
> > > > > > > same context. Responding to Linus' specifically and his suggestion to use
> > > > > > > "initcall_debug", this is what it gave me:
> > > > > > >
> > > > > > > [ 6.970669] ata1: SATA link down (SStatus 0 SControl 300)
> > > > > > > [ 166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
> > > > > > > [ 166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
> > > > > > > [ 166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
> > > > > > > [ 166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
> > > > > > > [ 166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
> > > > > > >
> > > > > > > Also got another occurrence happening resuming from suspend to DRAM with:
> > > > > > >
> > > > > > > [ 22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
> > > > > > > platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
> > > > > > > [ 181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
> > > > > > > platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
> > > > > > >
> > > > > > > and also with the PCIe root complex driver:
> > > > > > >
> > > > > > > [ 18.266279] brcm-pcie f0460000.pcie: PM: calling
> > > > > > > brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
> > > > > > > [ 177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
> > > > > > > [ 177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
> > > > > > > [ 177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
> > > > > > > returned 0 after 159190939 usecs
> > > > > > > [ 177.457257] pcieport 0000:00:00.0: PM: calling
> > > > > > > pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
> > > > > > >
> > > > > > > Surprisingly those drivers are consistently reproducing the failures I am
> > > > > > > seeing so at least this gave me a clue as to where the problem is.
> > > > > > >
> > > > > > > There were no changes to drivers/net/ethernet/broadcom/genet/, the two
> > > > > > > changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
> > > > > > > the read_poll_timeout() conversion is correct, we properly break out of the
> > > > > > > loop. The initial delay looked like a good culprit for a little while, but
> > > > > > > it is not used on the affected platforms because instead we provide a
> > > > > > > callback and we have an interrupt to signal the completion of a MDIO
> > > > > > > operation, therefore unimac_mdio_poll() is not used at all. Finally
> > > > > > > drivers/memory/brcmstb_dpfe.c also received a single change which is not
> > > > > > > functional here (.remove function change do return void).
> > > > > > >
> > > > > > > I went back to a manual bisection and this time I believe that I have a more
> > > > > > > plausible candidate with:
> > > > > > >
> > > > > > > 7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
> > > > > > > hierarchical pull model")
> > > > > >
> > > > > > I haven't understood the code there yet, and how it would interact with
> > > > > > arch code, but one thing that immediately jumps out to me is this:
> > > > > >
> > > > > > " As long as a CPU is busy it expires both local and global timers. When a
> > > > > > CPU goes idle it arms for the first expiring local timer."
> > > > > >
> > > > > > So are local timers "armed" when they are enqueued while the cpu is
> > > > > > "busy" during initialisation, and will they expire, and will that
> > > > > > expiry be delivered in a timely manner?
> > > > > >
> > > > > > If not, this commit is basically broken, and would be the cause of the
> > > > > > issue you are seeing. For the mdio case, we're talking about 2ms
> > > > > > polling. For the dpfe case, it looks like we're talking about 1ms
> > > > > > sleeps. I'm guessing that these end up being local timers.
> > > > > >
> > > > > > Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
> > > > > > for link up every 5ms - if the link was down and we msleep(5) I wonder
> > > > > > if that's triggering the same issue.
> > > > > >
> > > > > > Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
> > > > > > say. I would imagine that the same hardware timer driver is being used
> > > > > > (may be worth checking DT.) The same should be true for the interrupt
> > > > > > driver as well. There's been no changes in that code.
> > > > >
> > > > > I just had it happen with ARM64 I was plagued by:
> > > > >
> > > > > https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/
> > > > >
> > > > > and my earlier bisections somehow did not have ARM64 fail, so I thought it
> > > > > was immune but it fails with about the same failure rate as ARM 32-bit.
> > > >
> > > > Can you please boot with:
> > > >
> > > > trace_event=timer_migration,timer_start,timer_expire_entry,timer_cancel
> > > >
> > > > And add the following and give us the resulting output in dmesg?
> > >
> > > Here are two logs from two different systems that exposed the problem on
> > > boot:
> > >
> > > https://gist.github.com/ffainelli/f0834c52ef6320c9216d879ca29a4b81
> > > https://gist.github.com/ffainelli/dc838883edb925a77d8eb34c0fe95be0
> >
> > Thanks! Unfortunately like Thomas pointed out, I'm missing the timer_migration
> > events. My fault, can you retry with this syntax?
> >
> > trace_event=timer_migration:*,timer_start,timer_expire_entry,timer_cancel
> >
> > Though it's fairly possible that timer migration is not enabled at this point
> > as it's a late initcall. But we better not miss its traces otherwise.
>
> Here is another log with timer_migration:
>
> https://gist.github.com/ffainelli/237a5f9928850d6d8900d1f36da45aee
FWIW, the trace point is still not enabled:
[ 0.000000] Failed to enable trace event: timer_migration:*
you need this commit in master:
36e40df35d2c "timer_migration: Add tracepoints"
, which is one commit later than 7ee988770326 AFAICT
Regards,
Boqun
> --
> Florian
>
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]
2024-03-14 18:51 ` Boqun Feng
@ 2024-03-14 19:09 ` Florian Fainelli
2024-03-14 20:45 ` Thomas Gleixner
0 siblings, 1 reply; 47+ messages in thread
From: Florian Fainelli @ 2024-03-14 19:09 UTC (permalink / raw)
To: Boqun Feng
Cc: Frederic Weisbecker, Russell King (Oracle), Joel Fernandes,
Anna-Maria Behnsen, Linus Torvalds, linux-kernel, kernel-team,
paulmck, mingo, tglx, rcu, neeraj.upadhyay, urezki,
qiang.zhang1211, bigeasy, chenzhongjin, yangjihong1, rostedt,
Justin Chen
On 3/14/24 11:51, Boqun Feng wrote:
> On Thu, Mar 14, 2024 at 11:35:23AM -0700, Florian Fainelli wrote:
>> On 3/14/24 03:41, Frederic Weisbecker wrote:
>>> On Wed, Mar 13, 2024 at 08:44:07PM -0700, Florian Fainelli wrote:
>>>>
>>>>
>>>> On 3/13/2024 3:52 PM, Frederic Weisbecker wrote:
>>>>> On Wed, Mar 13, 2024 at 03:04:26PM -0700, Florian Fainelli wrote:
>>>>>> On 3/13/24 14:59, Russell King (Oracle) wrote:
>>>>>>> On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
>>>>>>>> I will try to provide multiple answers for the sake of everyone having the
>>>>>>>> same context. Responding to Linus' specifically and his suggestion to use
>>>>>>>> "initcall_debug", this is what it gave me:
>>>>>>>>
>>>>>>>> [ 6.970669] ata1: SATA link down (SStatus 0 SControl 300)
>>>>>>>> [ 166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
>>>>>>>> [ 166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
>>>>>>>> [ 166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
>>>>>>>> [ 166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
>>>>>>>> [ 166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
>>>>>>>>
>>>>>>>> Also got another occurrence happening resuming from suspend to DRAM with:
>>>>>>>>
>>>>>>>> [ 22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
>>>>>>>> platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
>>>>>>>> [ 181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
>>>>>>>> platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
>>>>>>>>
>>>>>>>> and also with the PCIe root complex driver:
>>>>>>>>
>>>>>>>> [ 18.266279] brcm-pcie f0460000.pcie: PM: calling
>>>>>>>> brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
>>>>>>>> [ 177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
>>>>>>>> [ 177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
>>>>>>>> [ 177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
>>>>>>>> returned 0 after 159190939 usecs
>>>>>>>> [ 177.457257] pcieport 0000:00:00.0: PM: calling
>>>>>>>> pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
>>>>>>>>
>>>>>>>> Surprisingly those drivers are consistently reproducing the failures I am
>>>>>>>> seeing so at least this gave me a clue as to where the problem is.
>>>>>>>>
>>>>>>>> There were no changes to drivers/net/ethernet/broadcom/genet/, the two
>>>>>>>> changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
>>>>>>>> the read_poll_timeout() conversion is correct, we properly break out of the
>>>>>>>> loop. The initial delay looked like a good culprit for a little while, but
>>>>>>>> it is not used on the affected platforms because instead we provide a
>>>>>>>> callback and we have an interrupt to signal the completion of a MDIO
>>>>>>>> operation, therefore unimac_mdio_poll() is not used at all. Finally
>>>>>>>> drivers/memory/brcmstb_dpfe.c also received a single change which is not
>>>>>>>> functional here (.remove function change do return void).
>>>>>>>>
>>>>>>>> I went back to a manual bisection and this time I believe that I have a more
>>>>>>>> plausible candidate with:
>>>>>>>>
>>>>>>>> 7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
>>>>>>>> hierarchical pull model")
>>>>>>>
>>>>>>> I haven't understood the code there yet, and how it would interact with
>>>>>>> arch code, but one thing that immediately jumps out to me is this:
>>>>>>>
>>>>>>> " As long as a CPU is busy it expires both local and global timers. When a
>>>>>>> CPU goes idle it arms for the first expiring local timer."
>>>>>>>
>>>>>>> So are local timers "armed" when they are enqueued while the cpu is
>>>>>>> "busy" during initialisation, and will they expire, and will that
>>>>>>> expiry be delivered in a timely manner?
>>>>>>>
>>>>>>> If not, this commit is basically broken, and would be the cause of the
>>>>>>> issue you are seeing. For the mdio case, we're talking about 2ms
>>>>>>> polling. For the dpfe case, it looks like we're talking about 1ms
>>>>>>> sleeps. I'm guessing that these end up being local timers.
>>>>>>>
>>>>>>> Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
>>>>>>> for link up every 5ms - if the link was down and we msleep(5) I wonder
>>>>>>> if that's triggering the same issue.
>>>>>>>
>>>>>>> Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
>>>>>>> say. I would imagine that the same hardware timer driver is being used
>>>>>>> (may be worth checking DT.) The same should be true for the interrupt
>>>>>>> driver as well. There's been no changes in that code.
>>>>>>
>>>>>> I just had it happen with ARM64 I was plagued by:
>>>>>>
>>>>>> https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/
>>>>>>
>>>>>> and my earlier bisections somehow did not have ARM64 fail, so I thought it
>>>>>> was immune but it fails with about the same failure rate as ARM 32-bit.
>>>>>
>>>>> Can you please boot with:
>>>>>
>>>>> trace_event=timer_migration,timer_start,timer_expire_entry,timer_cancel
>>>>>
>>>>> And add the following and give us the resulting output in dmesg?
>>>>
>>>> Here are two logs from two different systems that exposed the problem on
>>>> boot:
>>>>
>>>> https://gist.github.com/ffainelli/f0834c52ef6320c9216d879ca29a4b81
>>>> https://gist.github.com/ffainelli/dc838883edb925a77d8eb34c0fe95be0
>>>
>>> Thanks! Unfortunately like Thomas pointed out, I'm missing the timer_migration
>>> events. My fault, can you retry with this syntax?
>>>
>>> trace_event=timer_migration:*,timer_start,timer_expire_entry,timer_cancel
>>>
>>> Though it's fairly possible that timer migration is not enabled at this point
>>> as it's a late initcall. But we better not miss its traces otherwise.
>>
>> Here is another log with timer_migration:
>>
>> https://gist.github.com/ffainelli/237a5f9928850d6d8900d1f36da45aee
>
> FWIW, the trace point is still not enabled:
>
> [ 0.000000] Failed to enable trace event: timer_migration:*
>
> you need this commit in master:
>
> 36e40df35d2c "timer_migration: Add tracepoints"
>
> , which is one commit later than 7ee988770326 AFAICT
Argh, thanks Boqun, here is a new capture:
https://gist.github.com/ffainelli/cb562c1a60ef8e0e69e7d42143c48e8f
this one is does include the tmigr events. Thanks!
--
Florian
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]
2024-03-14 19:09 ` Florian Fainelli
@ 2024-03-14 20:45 ` Thomas Gleixner
2024-03-14 21:21 ` Thomas Gleixner
0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2024-03-14 20:45 UTC (permalink / raw)
To: Florian Fainelli, Boqun Feng
Cc: Frederic Weisbecker, Russell King (Oracle), Joel Fernandes,
Anna-Maria Behnsen, Linus Torvalds, linux-kernel, kernel-team,
paulmck, mingo, rcu, neeraj.upadhyay, urezki, qiang.zhang1211,
bigeasy, chenzhongjin, yangjihong1, rostedt, Justin Chen
On Thu, Mar 14 2024 at 12:09, Florian Fainelli wrote:
> On 3/14/24 11:51, Boqun Feng wrote:
>>>> trace_event=timer_migration:*,timer_start,timer_expire_entry,timer_cancel
>>>>
>>>> Though it's fairly possible that timer migration is not enabled at this point
>>>> as it's a late initcall. But we better not miss its traces otherwise.
>>>
>>> Here is another log with timer_migration:
>>>
>>> https://gist.github.com/ffainelli/237a5f9928850d6d8900d1f36da45aee
>>
>> FWIW, the trace point is still not enabled:
>>
>> [ 0.000000] Failed to enable trace event: timer_migration:*
>>
>> you need this commit in master:
>>
>> 36e40df35d2c "timer_migration: Add tracepoints"
>>
>> , which is one commit later than 7ee988770326 AFAICT
>
> Argh, thanks Boqun, here is a new capture:
>
> https://gist.github.com/ffainelli/cb562c1a60ef8e0e69e7d42143c48e8f
>
> this one is does include the tmigr events. Thanks!
You need 8ca1836769d758e4fbf5851bb81e181c52193f5d too.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]
2024-03-14 20:45 ` Thomas Gleixner
@ 2024-03-14 21:21 ` Thomas Gleixner
2024-03-14 21:53 ` Florian Fainelli
` (2 more replies)
0 siblings, 3 replies; 47+ messages in thread
From: Thomas Gleixner @ 2024-03-14 21:21 UTC (permalink / raw)
To: Florian Fainelli, Boqun Feng
Cc: Frederic Weisbecker, Russell King (Oracle), Joel Fernandes,
Anna-Maria Behnsen, Linus Torvalds, linux-kernel, kernel-team,
paulmck, mingo, rcu, neeraj.upadhyay, urezki, qiang.zhang1211,
bigeasy, chenzhongjin, yangjihong1, rostedt, Justin Chen
On Thu, Mar 14 2024 at 21:45, Thomas Gleixner wrote:
> On Thu, Mar 14 2024 at 12:09, Florian Fainelli wrote:
>> https://gist.github.com/ffainelli/cb562c1a60ef8e0e69e7d42143c48e8f
>>
>> this one is does include the tmigr events. Thanks!
>
> You need 8ca1836769d758e4fbf5851bb81e181c52193f5d too.
So from the above trace it's clear where it goes south:
[ 236.318158] <idle>-0 3..s.. 2928466us : tmigr_handle_remote: group=aecb05cb lvl=0
[ 236.326526] <idle>-0 3d.s.. 2928467us : tmigr_handle_remote_cpu: cpu=0 parent=aecb05cb wakeup=9223372036854775807
[ 236.357809] <idle>-0 3d.s.. 2928469us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
[ 236.377222] <idle>-0 0dn... 2928471us : tmigr_cpu_active: cpu=0 parent=aecb05cb wakeup=9223372036854775807
[ 236.387765] <idle>-0 0dn... 2928471us : tmigr_group_set_cpu_active: group=aecb05cb lvl=0 numa=0 active=9 migrator=8 parent=00000000 childmask=1
[ 236.401526] <idle>-0 0d.... 2928477us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
[ 236.420940] <idle>-0 0d.... 2928478us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=8 migrator=8 parent=00000000 childmask=1
[ 236.434874] <idle>-0 0d.... 2928478us : tmigr_cpu_idle: cpu=0 parent=aecb05cb nextevt=3103000000 wakeup=9223372036854775807
[ 236.446896] <idle>-0 3d.... 2929469us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=0 migrator=ff parent=00000000 childmask=8
[ 236.460916] <idle>-0 3d.... 2929470us : tmigr_cpu_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
[ 236.473721] <idle>-0 3d.... 2934471us : tmigr_cpu_new_timer_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
CPU3 is the last active CPU and goes idle. So it should take care of the
pending events, but it does not.
This is the next trace entry where CPU0 magically comes back to life.
[ 236.487393] <idle>-0 0d.s.. 162001359us : timer_cancel: timer=8c725d84
8ca1836769d758e4fbf5851bb81e181c52193f5d is related, but dos not fully
explain the fail. I haven't yet spotted where this goes into lala land.
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]
2024-03-14 21:21 ` Thomas Gleixner
@ 2024-03-14 21:53 ` Florian Fainelli
2024-03-14 22:51 ` Thomas Gleixner
2024-03-14 21:58 ` Thomas Gleixner
2024-03-14 22:05 ` Boqun Feng
2 siblings, 1 reply; 47+ messages in thread
From: Florian Fainelli @ 2024-03-14 21:53 UTC (permalink / raw)
To: Thomas Gleixner, Boqun Feng
Cc: Frederic Weisbecker, Russell King (Oracle), Joel Fernandes,
Anna-Maria Behnsen, Linus Torvalds, linux-kernel, kernel-team,
paulmck, mingo, rcu, neeraj.upadhyay, urezki, qiang.zhang1211,
bigeasy, chenzhongjin, yangjihong1, rostedt, Justin Chen
On 3/14/24 14:21, Thomas Gleixner wrote:
> On Thu, Mar 14 2024 at 21:45, Thomas Gleixner wrote:
>> On Thu, Mar 14 2024 at 12:09, Florian Fainelli wrote:
>>> https://gist.github.com/ffainelli/cb562c1a60ef8e0e69e7d42143c48e8f
>>>
>>> this one is does include the tmigr events. Thanks!
>>
>> You need 8ca1836769d758e4fbf5851bb81e181c52193f5d too.
>
> So from the above trace it's clear where it goes south:
>
> [ 236.318158] <idle>-0 3..s.. 2928466us : tmigr_handle_remote: group=aecb05cb lvl=0
> [ 236.326526] <idle>-0 3d.s.. 2928467us : tmigr_handle_remote_cpu: cpu=0 parent=aecb05cb wakeup=9223372036854775807
> [ 236.357809] <idle>-0 3d.s.. 2928469us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
>
> [ 236.377222] <idle>-0 0dn... 2928471us : tmigr_cpu_active: cpu=0 parent=aecb05cb wakeup=9223372036854775807
> [ 236.387765] <idle>-0 0dn... 2928471us : tmigr_group_set_cpu_active: group=aecb05cb lvl=0 numa=0 active=9 migrator=8 parent=00000000 childmask=1
>
> [ 236.401526] <idle>-0 0d.... 2928477us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
> [ 236.420940] <idle>-0 0d.... 2928478us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=8 migrator=8 parent=00000000 childmask=1
> [ 236.434874] <idle>-0 0d.... 2928478us : tmigr_cpu_idle: cpu=0 parent=aecb05cb nextevt=3103000000 wakeup=9223372036854775807
>
> [ 236.446896] <idle>-0 3d.... 2929469us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=0 migrator=ff parent=00000000 childmask=8
> [ 236.460916] <idle>-0 3d.... 2929470us : tmigr_cpu_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
> [ 236.473721] <idle>-0 3d.... 2934471us : tmigr_cpu_new_timer_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
>
> CPU3 is the last active CPU and goes idle. So it should take care of the
> pending events, but it does not.
>
> This is the next trace entry where CPU0 magically comes back to life.
>
> [ 236.487393] <idle>-0 0d.s.. 162001359us : timer_cancel: timer=8c725d84
>
> 8ca1836769d758e4fbf5851bb81e181c52193f5d is related, but dos not fully
> explain the fail. I haven't yet spotted where this goes into lala land.
It was a lot harder to generate the same issue on cold boot against
8ca1836769d758e4fbf5851bb81e181c52193f5d, however, just like against
36e40df35d2c1891fe58241640c7c95de4aa739b, it would happen resuming from
suspend to DRAM whereby the CPU core(s) lost their power and had to be
re-initialized. Eventually I got a cold boot log:
https://gist.github.com/ffainelli/b5684585c78518a5492cbbf1c7dce16e
Does the consistent ~159s mean anything?
--
Florian
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]
2024-03-14 21:53 ` Florian Fainelli
@ 2024-03-14 22:51 ` Thomas Gleixner
0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2024-03-14 22:51 UTC (permalink / raw)
To: Florian Fainelli, Boqun Feng
Cc: Frederic Weisbecker, Russell King (Oracle), Joel Fernandes,
Anna-Maria Behnsen, Linus Torvalds, linux-kernel, kernel-team,
paulmck, mingo, rcu, neeraj.upadhyay, urezki, qiang.zhang1211,
bigeasy, chenzhongjin, yangjihong1, rostedt, Justin Chen
On Thu, Mar 14 2024 at 14:53, Florian Fainelli wrote:
> On 3/14/24 14:21, Thomas Gleixner wrote:
>> 8ca1836769d758e4fbf5851bb81e181c52193f5d is related, but dos not fully
>> explain the fail. I haven't yet spotted where this goes into lala land.
>
> It was a lot harder to generate the same issue on cold boot against
> 8ca1836769d758e4fbf5851bb81e181c52193f5d,
That's good as it points into the exactly right direction as far as I
can tell from the data we have, but I might be wrong at the end.
> however, just like against 36e40df35d2c1891fe58241640c7c95de4aa739b,
> it would happen resuming from suspend to DRAM whereby the CPU core(s)
> lost their power and had to be re-initialized. Eventually I got a cold
> boot log:
>
> https://gist.github.com/ffainelli/b5684585c78518a5492cbbf1c7dce16e
The picture is similar to the one before:
<idle>-0 2d.... 3016627us : tmigr_update_events: child=00000000 group=6b74d49d group_lvl=0 child_active=0 group_active=1 nextevt=3023000000 next_expiry=3023000000 child_evt_expiry=0 child_evtcpu=0
<idle>-0 2d.... 3016628us : tmigr_group_set_cpu_inactive: group=6b74d49d lvl=0 numa=0 active=1 migrator=1 parent=00000000 childmask=4
<idle>-0 2d.... 3016629us : tmigr_cpu_idle: cpu=2 parent=6b74d49d nextevt=3023000000 wakeup=9223372036854775807
<idle>-0 0d.... 3016684us : tmigr_group_set_cpu_inactive: group=6b74d49d lvl=0 numa=0 active=0 migrator=ff parent=00000000 childmask=1
<idle>-0 0d.... 3016685us : tmigr_cpu_idle: cpu=0 parent=6b74d49d nextevt=9223372036854775807 wakeup=9223372036854775807
<idle>-0 0d.... 3024623us : tmigr_cpu_new_timer_idle: cpu=0 parent=6b74d49d nextevt=9223372036854775807 wakeup=9223372036854775807
<idle>-0 1d.s.. 162080532us : timer_cancel: timer=2e281df7
Just a different CPU this time.
The first expiring timer:
kcompact-42 1d.... 2552891us : timer_start: timer=2e281df7 function=process_timeout expires=4294670348 [timeout=500] bucket_expiry=4294670352 cpu=1 idx=66 flags=
Last expiry before going south:
<idle>-0 1..s.. 3006620us : timer_expire_entry: timer=6f47b280 function=process_timeout now=4294670302 baseclk=4294670302
4294670352 - 4294670302 = 50
3006620us + 50*1000us = 3056620us
So the previous observation of hitting the exact point of the last CPU
going idle does not matter.
What really bothers me is:
<idle>-0 2d.... 3016629us : tmigr_cpu_idle: cpu=2 parent=6b74d49d nextevt=3023000000 wakeup=9223372036854775807
which has an event between these events:
<idle>-0 0d.... 3016685us : tmigr_cpu_idle: cpu=0 parent=6b74d49d nextevt=9223372036854775807 wakeup=9223372036854775807
<idle>-0 0d.... 3024623us : tmigr_cpu_new_timer_idle: cpu=0 parent=6b74d49d nextevt=9223372036854775807 wakeup=9223372036854775807
But that event is before the next expiring timer. Confused, but Frederic
just told me on IRC he's on to something.
> Does the consistent ~159s mean anything?
I don't think so. It might be the limitation of the clockevent device,
the maximum sleep time restriction or some other unrelated event (device
interrupt, hrtimer) which happens to pop up after this time for some
reason.
But it's definitely not relevant to the problem itself. It's just the
thing which brings the machine back to life. Otherwise it might sit
there forever.
Thanks,
tglx
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]
2024-03-14 21:21 ` Thomas Gleixner
2024-03-14 21:53 ` Florian Fainelli
@ 2024-03-14 21:58 ` Thomas Gleixner
2024-03-14 22:05 ` Boqun Feng
2 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2024-03-14 21:58 UTC (permalink / raw)
To: Florian Fainelli, Boqun Feng
Cc: Frederic Weisbecker, Russell King (Oracle), Joel Fernandes,
Anna-Maria Behnsen, Linus Torvalds, linux-kernel, kernel-team,
paulmck, mingo, rcu, neeraj.upadhyay, urezki, qiang.zhang1211,
bigeasy, chenzhongjin, yangjihong1, rostedt, Justin Chen
On Thu, Mar 14 2024 at 22:21, Thomas Gleixner wrote:
> On Thu, Mar 14 2024 at 21:45, Thomas Gleixner wrote:
>> On Thu, Mar 14 2024 at 12:09, Florian Fainelli wrote:
>>> https://gist.github.com/ffainelli/cb562c1a60ef8e0e69e7d42143c48e8f
>>>
>>> this one is does include the tmigr events. Thanks!
>>
>> You need 8ca1836769d758e4fbf5851bb81e181c52193f5d too.
>
> So from the above trace it's clear where it goes south:
>
> [ 236.318158] <idle>-0 3..s.. 2928466us : tmigr_handle_remote: group=aecb05cb lvl=0
> [ 236.326526] <idle>-0 3d.s.. 2928467us : tmigr_handle_remote_cpu: cpu=0 parent=aecb05cb wakeup=9223372036854775807
> [ 236.357809] <idle>-0 3d.s.. 2928469us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
>
> [ 236.377222] <idle>-0 0dn... 2928471us : tmigr_cpu_active: cpu=0 parent=aecb05cb wakeup=9223372036854775807
> [ 236.387765] <idle>-0 0dn... 2928471us : tmigr_group_set_cpu_active: group=aecb05cb lvl=0 numa=0 active=9 migrator=8 parent=00000000 childmask=1
>
> [ 236.401526] <idle>-0 0d.... 2928477us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
> [ 236.420940] <idle>-0 0d.... 2928478us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=8 migrator=8 parent=00000000 childmask=1
> [ 236.434874] <idle>-0 0d.... 2928478us : tmigr_cpu_idle: cpu=0 parent=aecb05cb nextevt=3103000000 wakeup=9223372036854775807
>
> [ 236.446896] <idle>-0 3d.... 2929469us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=0 migrator=ff parent=00000000 childmask=8
> [ 236.460916] <idle>-0 3d.... 2929470us : tmigr_cpu_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
> [ 236.473721] <idle>-0 3d.... 2934471us : tmigr_cpu_new_timer_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
>
> CPU3 is the last active CPU and goes idle. So it should take care of the
> pending events, but it does not.
>
> This is the next trace entry where CPU0 magically comes back to life.
>
> [ 236.487393] <idle>-0 0d.s.. 162001359us : timer_cancel: timer=8c725d84
>
> 8ca1836769d758e4fbf5851bb81e181c52193f5d is related, but dos not fully
> explain the fail. I haven't yet spotted where this goes into lala land.
The first expiring pending global timer is this one:
[ 236.112089] kworker/-1198 1d.... 2923481us : timer_start: timer=ffd09a61 function=process_timeout expires=4294670230 [timeout=10] bucket_expiry=4294670231 cpu=1 idx=23 flags=
Bucket expiry is the relevant number: 4294670231 (jiffies)
The last timer expiry happened about 1ms before the fail:
[ 236.344917] <idle>-0 3..s.. 2928468us : timer_expire_entry: timer=2f09af03 function=process_timeout now=4294670225 baseclk=4294670225
Jiffies at that point: 4294670225
AFAICT this is a HZ=1000 configuration.
4294670231 - 4294670225 = 6
So the relevant wakeup should be around 2934468us
which is damned close to the point where the last active CPU vanishes into idle:
> [ 236.446896] <idle>-0 3d.... 2929469us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=0 migrator=ff parent=00000000 childmask=8
> [ 236.460916] <idle>-0 3d.... 2929470us : tmigr_cpu_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
> [ 236.473721] <idle>-0 3d.... 2934471us : tmigr_cpu_new_timer_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
Unless someone beats me to it, I'm going to continue staring at that
code tomorrow as my brain has started to melt by now.
Thanks,
tglx
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]
2024-03-14 21:21 ` Thomas Gleixner
2024-03-14 21:53 ` Florian Fainelli
2024-03-14 21:58 ` Thomas Gleixner
@ 2024-03-14 22:05 ` Boqun Feng
2024-03-14 22:10 ` Boqun Feng
2024-03-15 1:14 ` [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]] Frederic Weisbecker
2 siblings, 2 replies; 47+ messages in thread
From: Boqun Feng @ 2024-03-14 22:05 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Florian Fainelli, Frederic Weisbecker, Russell King (Oracle),
Joel Fernandes, Anna-Maria Behnsen, Linus Torvalds, linux-kernel,
kernel-team, paulmck, mingo, rcu, neeraj.upadhyay, urezki,
qiang.zhang1211, bigeasy, chenzhongjin, yangjihong1, rostedt,
Justin Chen
On Thu, Mar 14, 2024 at 10:21:21PM +0100, Thomas Gleixner wrote:
> On Thu, Mar 14 2024 at 21:45, Thomas Gleixner wrote:
> > On Thu, Mar 14 2024 at 12:09, Florian Fainelli wrote:
> >> https://gist.github.com/ffainelli/cb562c1a60ef8e0e69e7d42143c48e8f
> >>
> >> this one is does include the tmigr events. Thanks!
> >
> > You need 8ca1836769d758e4fbf5851bb81e181c52193f5d too.
>
> So from the above trace it's clear where it goes south:
>
> [ 236.318158] <idle>-0 3..s.. 2928466us : tmigr_handle_remote: group=aecb05cb lvl=0
> [ 236.326526] <idle>-0 3d.s.. 2928467us : tmigr_handle_remote_cpu: cpu=0 parent=aecb05cb wakeup=9223372036854775807
> [ 236.357809] <idle>-0 3d.s.. 2928469us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
>
> [ 236.377222] <idle>-0 0dn... 2928471us : tmigr_cpu_active: cpu=0 parent=aecb05cb wakeup=9223372036854775807
> [ 236.387765] <idle>-0 0dn... 2928471us : tmigr_group_set_cpu_active: group=aecb05cb lvl=0 numa=0 active=9 migrator=8 parent=00000000 childmask=1
>
> [ 236.401526] <idle>-0 0d.... 2928477us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
> [ 236.420940] <idle>-0 0d.... 2928478us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=8 migrator=8 parent=00000000 childmask=1
> [ 236.434874] <idle>-0 0d.... 2928478us : tmigr_cpu_idle: cpu=0 parent=aecb05cb nextevt=3103000000 wakeup=9223372036854775807
>
> [ 236.446896] <idle>-0 3d.... 2929469us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=0 migrator=ff parent=00000000 childmask=8
> [ 236.460916] <idle>-0 3d.... 2929470us : tmigr_cpu_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
> [ 236.473721] <idle>-0 3d.... 2934471us : tmigr_cpu_new_timer_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
>
> CPU3 is the last active CPU and goes idle. So it should take care of the
> pending events, but it does not.
>
I notice CPU3 didn't have its own non-deferrable timer queued (local or
global), so could the following happen?
timer_base_try_to_set_idle():
__get_next_timer_interrupt():
fetch_next_timer_interrupt():
// nextevt_local == nextevt_global == basej + NEXT_TIMER_MAX_DELTA
// tevt->local == tevt->gloabl = KTIME_MAX
timer_use_tmigr():
tmigr_cpu_deactivate():
__tmigr_cpu_deactivate():
// tmc->cpuevt.ignore untouched still == true
walk_groups(&tmigr_inactive_up, ...):
tmigr_inactive_up():
data->remote = true;
tmigr_update_events():
if (child) { // child is NULL
...
} else {
first_childevt = evt = data->evt;
if (evt->ignore && !remote)
return true; // no remote tick is picked.
...
}
Regards,
Boqun
> This is the next trace entry where CPU0 magically comes back to life.
>
> [ 236.487393] <idle>-0 0d.s.. 162001359us : timer_cancel: timer=8c725d84
>
> 8ca1836769d758e4fbf5851bb81e181c52193f5d is related, but dos not fully
> explain the fail. I haven't yet spotted where this goes into lala land.
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]
2024-03-14 22:05 ` Boqun Feng
@ 2024-03-14 22:10 ` Boqun Feng
2024-03-15 1:14 ` [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]] Frederic Weisbecker
1 sibling, 0 replies; 47+ messages in thread
From: Boqun Feng @ 2024-03-14 22:10 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Florian Fainelli, Frederic Weisbecker, Russell King (Oracle),
Joel Fernandes, Anna-Maria Behnsen, Linus Torvalds, linux-kernel,
kernel-team, paulmck, mingo, rcu, neeraj.upadhyay, urezki,
qiang.zhang1211, bigeasy, chenzhongjin, yangjihong1, rostedt,
Justin Chen
On Thu, Mar 14, 2024 at 03:05:53PM -0700, Boqun Feng wrote:
> On Thu, Mar 14, 2024 at 10:21:21PM +0100, Thomas Gleixner wrote:
> > On Thu, Mar 14 2024 at 21:45, Thomas Gleixner wrote:
> > > On Thu, Mar 14 2024 at 12:09, Florian Fainelli wrote:
> > >> https://gist.github.com/ffainelli/cb562c1a60ef8e0e69e7d42143c48e8f
> > >>
> > >> this one is does include the tmigr events. Thanks!
> > >
> > > You need 8ca1836769d758e4fbf5851bb81e181c52193f5d too.
> >
> > So from the above trace it's clear where it goes south:
> >
> > [ 236.318158] <idle>-0 3..s.. 2928466us : tmigr_handle_remote: group=aecb05cb lvl=0
> > [ 236.326526] <idle>-0 3d.s.. 2928467us : tmigr_handle_remote_cpu: cpu=0 parent=aecb05cb wakeup=9223372036854775807
> > [ 236.357809] <idle>-0 3d.s.. 2928469us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
> >
> > [ 236.377222] <idle>-0 0dn... 2928471us : tmigr_cpu_active: cpu=0 parent=aecb05cb wakeup=9223372036854775807
> > [ 236.387765] <idle>-0 0dn... 2928471us : tmigr_group_set_cpu_active: group=aecb05cb lvl=0 numa=0 active=9 migrator=8 parent=00000000 childmask=1
> >
> > [ 236.401526] <idle>-0 0d.... 2928477us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
> > [ 236.420940] <idle>-0 0d.... 2928478us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=8 migrator=8 parent=00000000 childmask=1
> > [ 236.434874] <idle>-0 0d.... 2928478us : tmigr_cpu_idle: cpu=0 parent=aecb05cb nextevt=3103000000 wakeup=9223372036854775807
> >
> > [ 236.446896] <idle>-0 3d.... 2929469us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=0 migrator=ff parent=00000000 childmask=8
> > [ 236.460916] <idle>-0 3d.... 2929470us : tmigr_cpu_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
> > [ 236.473721] <idle>-0 3d.... 2934471us : tmigr_cpu_new_timer_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
> >
> > CPU3 is the last active CPU and goes idle. So it should take care of the
> > pending events, but it does not.
> >
>
> I notice CPU3 didn't have its own non-deferrable timer queued (local or
> global), so could the following happen?
>
> timer_base_try_to_set_idle():
> __get_next_timer_interrupt():
> fetch_next_timer_interrupt():
> // nextevt_local == nextevt_global == basej + NEXT_TIMER_MAX_DELTA
> // tevt->local == tevt->gloabl = KTIME_MAX
> timer_use_tmigr():
> tmigr_cpu_deactivate():
> __tmigr_cpu_deactivate():
> // tmc->cpuevt.ignore untouched still == true
> walk_groups(&tmigr_inactive_up, ...):
> tmigr_inactive_up():
> data->remote = true;
this line is
data->remote = false;
in the real code.
> tmigr_update_events():
> if (child) { // child is NULL
> ...
> } else {
> first_childevt = evt = data->evt;
>
> if (evt->ignore && !remote)
> return true; // no remote tick is picked.
> ...
> }
>
> Regards,
> Boqun
>
> > This is the next trace entry where CPU0 magically comes back to life.
> >
> > [ 236.487393] <idle>-0 0d.s.. 162001359us : timer_cancel: timer=8c725d84
> >
> > 8ca1836769d758e4fbf5851bb81e181c52193f5d is related, but dos not fully
> > explain the fail. I haven't yet spotted where this goes into lala land.
^ permalink raw reply [flat|nested] 47+ messages in thread* [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]]
2024-03-14 22:05 ` Boqun Feng
2024-03-14 22:10 ` Boqun Feng
@ 2024-03-15 1:14 ` Frederic Weisbecker
2024-03-15 1:20 ` Frederic Weisbecker
` (3 more replies)
1 sibling, 4 replies; 47+ messages in thread
From: Frederic Weisbecker @ 2024-03-15 1:14 UTC (permalink / raw)
To: Boqun Feng, Florian Fainelli
Cc: Thomas Gleixner, Russell King (Oracle), Joel Fernandes,
Anna-Maria Behnsen, Linus Torvalds, linux-kernel, kernel-team,
paulmck, mingo, rcu, neeraj.upadhyay, urezki, qiang.zhang1211,
bigeasy, chenzhongjin, yangjihong1, rostedt, Justin Chen
On Thu, Mar 14, 2024 at 03:05:53PM -0700, Boqun Feng wrote:
> I notice CPU3 didn't have its own non-deferrable timer queued (local or
> global), so could the following happen?
>
> timer_base_try_to_set_idle():
> __get_next_timer_interrupt():
> fetch_next_timer_interrupt():
> // nextevt_local == nextevt_global == basej + NEXT_TIMER_MAX_DELTA
> // tevt->local == tevt->gloabl = KTIME_MAX
> timer_use_tmigr():
> tmigr_cpu_deactivate():
> __tmigr_cpu_deactivate():
> // tmc->cpuevt.ignore untouched still == true
> walk_groups(&tmigr_inactive_up, ...):
> tmigr_inactive_up():
> data->remote = true;
> tmigr_update_events():
> if (child) { // child is NULL
> ...
> } else {
> first_childevt = evt = data->evt;
>
> if (evt->ignore && !remote)
> return true; // no remote tick is picked.
> ...
> }
Nice catch! Florian can you try the following?
From b0e335371ed758f68bf4f501246298c98a615b04 Mon Sep 17 00:00:00 2001
From: Frederic Weisbecker <frederic@kernel.org>
Date: Fri, 15 Mar 2024 00:21:01 +0100
Subject: [PATCH] timer/migration: Remove buggy early return on deactivation
When a CPU enters into idle and deactivates itself from the timer
migration hierarchy without any global timer of its own to propagate,
the group event of that CPU is set to "ignore" and tmigr_update_events()
accordingly performs an early return without considering timers queued
by other CPUs.
If the hierarchy has a single level, and the CPU is the last one to
enter idle, it will ignore others' global timers, as in the following
layout:
[GRP0:0]
migrator = 0
active = 0
nextevt = T0i
/ \
0 1
active (T0i) idle (T1)
0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
[GRP0:0]
migrator = NONE
active = NONE
nextevt = T0i
/ \
0 1
idle (T0i) idle (T1)
1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". As a result
tmigr_update_events() ignores T1 and CPU 0 goes to idle with T1
unhandled.
This isn't proper to single level hierarchy though. A similar issue,
although slightly different, may arise on multi-level:
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:0i, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = 0 migrator = NONE
active = 0 active = NONE
nextevt = T0i nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
idle idle idle idle
0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:0i, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = NONE migrator = NONE
active = NONE active = NONE
nextevt = T0i nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
idle idle idle idle
1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". As a result
tmigr_update_events() ignores T1. The change only propagated up to 1st
level so far.
[GRP1:0]
migrator = NONE
active = NONE
nextevt = T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = NONE migrator = NONE
active = NONE active = NONE
nextevt = T0i nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
idle idle idle idle
2) The change now propagates up to the top. tmigr_update_events() finds
that the child event is ignored and thus removes it. The top level next
event is now T2 which is returned to CPU 0 as its next effective expiry
to take account for as the global idle migrator. However T1 has been
ignored along the way, leaving it unhandled.
Fix those issues with removing the buggy related early return. Ignored
child events must not prevent from evaluating the other events within
the same group.
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/time/timer_migration.c | 20 --------------------
1 file changed, 20 deletions(-)
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 8f49b6b96dfd..611cd904f035 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -751,26 +751,6 @@ bool tmigr_update_events(struct tmigr_group *group, struct tmigr_group *child,
first_childevt = evt = data->evt;
- /*
- * Walking the hierarchy is required in any case when a
- * remote expiry was done before. This ensures to not lose
- * already queued events in non active groups (see section
- * "Required event and timerqueue update after a remote
- * expiry" in the documentation at the top).
- *
- * The two call sites which are executed without a remote expiry
- * before, are not prevented from propagating changes through
- * the hierarchy by the return:
- * - When entering this path by tmigr_new_timer(), @evt->ignore
- * is never set.
- * - tmigr_inactive_up() takes care of the propagation by
- * itself and ignores the return value. But an immediate
- * return is required because nothing has to be done in this
- * level as the event could be ignored.
- */
- if (evt->ignore && !remote)
- return true;
-
raw_spin_lock(&group->lock);
childstate.state = 0;
--
2.34.1
^ permalink raw reply related [flat|nested] 47+ messages in thread* Re: [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]]
2024-03-15 1:14 ` [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]] Frederic Weisbecker
@ 2024-03-15 1:20 ` Frederic Weisbecker
2024-03-15 13:44 ` Florian Fainelli
` (2 subsequent siblings)
3 siblings, 0 replies; 47+ messages in thread
From: Frederic Weisbecker @ 2024-03-15 1:20 UTC (permalink / raw)
To: Boqun Feng, Florian Fainelli
Cc: Thomas Gleixner, Russell King (Oracle), Joel Fernandes,
Anna-Maria Behnsen, Linus Torvalds, linux-kernel, kernel-team,
paulmck, mingo, rcu, neeraj.upadhyay, urezki, qiang.zhang1211,
bigeasy, chenzhongjin, yangjihong1, rostedt, Justin Chen
On Fri, Mar 15, 2024 at 02:14:47AM +0100, Frederic Weisbecker wrote:
> This isn't proper to single level hierarchy though. A similar issue,
> although slightly different, may arise on multi-level:
>
> [GRP1:0]
> migrator = GRP0:0
> active = GRP0:0
> nextevt = T0:0i, T0:1
> / \
> [GRP0:0] [GRP0:1]
> migrator = 0 migrator = NONE
> active = 0 active = NONE
> nextevt = T0i nextevt = T2
> / \ / \
> 0 (T0i) 1 (T1) 2 (T2) 3
> idle idle idle idle
The "idle" below "0" above should be "active"...
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]]
2024-03-15 1:14 ` [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]] Frederic Weisbecker
2024-03-15 1:20 ` Frederic Weisbecker
@ 2024-03-15 13:44 ` Florian Fainelli
2024-03-16 19:06 ` [tip: timers/urgent] timer/migration: Remove buggy early return on deactivation tip-bot2 for Frederic Weisbecker
2024-03-26 16:41 ` [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]] Anna-Maria Behnsen
3 siblings, 0 replies; 47+ messages in thread
From: Florian Fainelli @ 2024-03-15 13:44 UTC (permalink / raw)
To: Frederic Weisbecker, Boqun Feng
Cc: Thomas Gleixner, Russell King (Oracle), Joel Fernandes,
Anna-Maria Behnsen, Linus Torvalds, linux-kernel, kernel-team,
paulmck, mingo, rcu, neeraj.upadhyay, urezki, qiang.zhang1211,
bigeasy, chenzhongjin, yangjihong1, rostedt, Justin Chen
On 3/14/2024 6:14 PM, Frederic Weisbecker wrote:
> On Thu, Mar 14, 2024 at 03:05:53PM -0700, Boqun Feng wrote:
>> I notice CPU3 didn't have its own non-deferrable timer queued (local or
>> global), so could the following happen?
>>
>> timer_base_try_to_set_idle():
>> __get_next_timer_interrupt():
>> fetch_next_timer_interrupt():
>> // nextevt_local == nextevt_global == basej + NEXT_TIMER_MAX_DELTA
>> // tevt->local == tevt->gloabl = KTIME_MAX
>> timer_use_tmigr():
>> tmigr_cpu_deactivate():
>> __tmigr_cpu_deactivate():
>> // tmc->cpuevt.ignore untouched still == true
>> walk_groups(&tmigr_inactive_up, ...):
>> tmigr_inactive_up():
>> data->remote = true;
>> tmigr_update_events():
>> if (child) { // child is NULL
>> ...
>> } else {
>> first_childevt = evt = data->evt;
>>
>> if (evt->ignore && !remote)
>> return true; // no remote tick is picked.
>> ...
>> }
>
> Nice catch! Florian can you try the following?
>
> From b0e335371ed758f68bf4f501246298c98a615b04 Mon Sep 17 00:00:00 2001
> From: Frederic Weisbecker <frederic@kernel.org>
> Date: Fri, 15 Mar 2024 00:21:01 +0100
> Subject: [PATCH] timer/migration: Remove buggy early return on deactivation
>
> When a CPU enters into idle and deactivates itself from the timer
> migration hierarchy without any global timer of its own to propagate,
> the group event of that CPU is set to "ignore" and tmigr_update_events()
> accordingly performs an early return without considering timers queued
> by other CPUs.
>
> If the hierarchy has a single level, and the CPU is the last one to
> enter idle, it will ignore others' global timers, as in the following
> layout:
>
> [GRP0:0]
> migrator = 0
> active = 0
> nextevt = T0i
> / \
> 0 1
> active (T0i) idle (T1)
>
> 0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
> upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
>
> [GRP0:0]
> migrator = NONE
> active = NONE
> nextevt = T0i
> / \
> 0 1
> idle (T0i) idle (T1)
>
> 1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
> pushed as its next expiry and its own event kept as "ignore". As a result
> tmigr_update_events() ignores T1 and CPU 0 goes to idle with T1
> unhandled.
>
> This isn't proper to single level hierarchy though. A similar issue,
> although slightly different, may arise on multi-level:
>
> [GRP1:0]
> migrator = GRP0:0
> active = GRP0:0
> nextevt = T0:0i, T0:1
> / \
> [GRP0:0] [GRP0:1]
> migrator = 0 migrator = NONE
> active = 0 active = NONE
> nextevt = T0i nextevt = T2
> / \ / \
> 0 (T0i) 1 (T1) 2 (T2) 3
> idle idle idle idle
>
> 0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
> upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
> CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2
>
> [GRP1:0]
> migrator = GRP0:0
> active = GRP0:0
> nextevt = T0:0i, T0:1
> / \
> [GRP0:0] [GRP0:1]
> migrator = NONE migrator = NONE
> active = NONE active = NONE
> nextevt = T0i nextevt = T2
> / \ / \
> 0 (T0i) 1 (T1) 2 (T2) 3
> idle idle idle idle
>
> 1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
> pushed as its next expiry and its own event kept as "ignore". As a result
> tmigr_update_events() ignores T1. The change only propagated up to 1st
> level so far.
>
> [GRP1:0]
> migrator = NONE
> active = NONE
> nextevt = T0:1
> / \
> [GRP0:0] [GRP0:1]
> migrator = NONE migrator = NONE
> active = NONE active = NONE
> nextevt = T0i nextevt = T2
> / \ / \
> 0 (T0i) 1 (T1) 2 (T2) 3
> idle idle idle idle
>
> 2) The change now propagates up to the top. tmigr_update_events() finds
> that the child event is ignored and thus removes it. The top level next
> event is now T2 which is returned to CPU 0 as its next effective expiry
> to take account for as the global idle migrator. However T1 has been
> ignored along the way, leaving it unhandled.
>
> Fix those issues with removing the buggy related early return. Ignored
> child events must not prevent from evaluating the other events within
> the same group.
>
> Reported-by: Boqun Feng <boqun.feng@gmail.com>
> Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Gave it a good 20 cold boots and suspend to DRAM cycles and none of
those triggered the reported behavior:
Reported-by: Florian Fainelli <florian.fainelli@broadcom.com>
Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
Thanks for the quick fix Frederic!
--
Florian
^ permalink raw reply [flat|nested] 47+ messages in thread* [tip: timers/urgent] timer/migration: Remove buggy early return on deactivation
2024-03-15 1:14 ` [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]] Frederic Weisbecker
2024-03-15 1:20 ` Frederic Weisbecker
2024-03-15 13:44 ` Florian Fainelli
@ 2024-03-16 19:06 ` tip-bot2 for Frederic Weisbecker
2024-03-26 16:41 ` [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]] Anna-Maria Behnsen
3 siblings, 0 replies; 47+ messages in thread
From: tip-bot2 for Frederic Weisbecker @ 2024-03-16 19:06 UTC (permalink / raw)
To: linux-tip-commits
Cc: Boqun Feng, Florian Fainelli, Thomas Gleixner,
Frederic Weisbecker, Florian Fainelli, x86, linux-kernel
The following commit has been merged into the timers/urgent branch of tip:
Commit-ID: 4b6f4c5a67c07417bf29d896c76f513a4be07516
Gitweb: https://git.kernel.org/tip/4b6f4c5a67c07417bf29d896c76f513a4be07516
Author: Frederic Weisbecker <frederic@kernel.org>
AuthorDate: Fri, 15 Mar 2024 02:14:47 +01:00
Committer: Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Sat, 16 Mar 2024 19:55:46 +01:00
timer/migration: Remove buggy early return on deactivation
When a CPU enters into idle and deactivates itself from the timer
migration hierarchy without any global timer of its own to propagate,
the group event of that CPU is set to "ignore" and tmigr_update_events()
accordingly performs an early return without considering timers queued
by other CPUs.
If the hierarchy has a single level, and the CPU is the last one to
enter idle, it will ignore others' global timers, as in the following
layout:
[GRP0:0]
migrator = 0
active = 0
nextevt = T0i
/ \
0 1
active (T0i) idle (T1)
0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
[GRP0:0]
migrator = NONE
active = NONE
nextevt = T0i
/ \
0 1
idle (T0i) idle (T1)
1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". As a result
tmigr_update_events() ignores T1 and CPU 0 goes to idle with T1
unhandled.
This isn't proper to single level hierarchy though. A similar issue,
although slightly different, may arise on multi-level:
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:0i, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = 0 migrator = NONE
active = 0 active = NONE
nextevt = T0i nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
active idle idle idle
0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:0i, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = NONE migrator = NONE
active = NONE active = NONE
nextevt = T0i nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
idle idle idle idle
1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". As a result
tmigr_update_events() ignores T1. The change only propagated up to 1st
level so far.
[GRP1:0]
migrator = NONE
active = NONE
nextevt = T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = NONE migrator = NONE
active = NONE active = NONE
nextevt = T0i nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
idle idle idle idle
2) The change now propagates up to the top. tmigr_update_events() finds
that the child event is ignored and thus removes it. The top level next
event is now T2 which is returned to CPU 0 as its next effective expiry
to take account for as the global idle migrator. However T1 has been
ignored along the way, leaving it unhandled.
Fix those issues with removing the buggy related early return. Ignored
child events must not prevent from evaluating the other events within
the same group.
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/ZfOhB9ZByTZcBy4u@lothringen
---
kernel/time/timer_migration.c | 20 --------------------
1 file changed, 20 deletions(-)
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 8f49b6b..611cd90 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -751,26 +751,6 @@ bool tmigr_update_events(struct tmigr_group *group, struct tmigr_group *child,
first_childevt = evt = data->evt;
- /*
- * Walking the hierarchy is required in any case when a
- * remote expiry was done before. This ensures to not lose
- * already queued events in non active groups (see section
- * "Required event and timerqueue update after a remote
- * expiry" in the documentation at the top).
- *
- * The two call sites which are executed without a remote expiry
- * before, are not prevented from propagating changes through
- * the hierarchy by the return:
- * - When entering this path by tmigr_new_timer(), @evt->ignore
- * is never set.
- * - tmigr_inactive_up() takes care of the propagation by
- * itself and ignores the return value. But an immediate
- * return is required because nothing has to be done in this
- * level as the event could be ignored.
- */
- if (evt->ignore && !remote)
- return true;
-
raw_spin_lock(&group->lock);
childstate.state = 0;
^ permalink raw reply related [flat|nested] 47+ messages in thread* Re: [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]]
2024-03-15 1:14 ` [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]] Frederic Weisbecker
` (2 preceding siblings ...)
2024-03-16 19:06 ` [tip: timers/urgent] timer/migration: Remove buggy early return on deactivation tip-bot2 for Frederic Weisbecker
@ 2024-03-26 16:41 ` Anna-Maria Behnsen
2024-03-26 17:18 ` Frederic Weisbecker
3 siblings, 1 reply; 47+ messages in thread
From: Anna-Maria Behnsen @ 2024-03-26 16:41 UTC (permalink / raw)
To: Frederic Weisbecker, Boqun Feng, Florian Fainelli
Cc: Thomas Gleixner, Russell King (Oracle), Joel Fernandes,
Linus Torvalds, linux-kernel, kernel-team, paulmck, mingo, rcu,
neeraj.upadhyay, urezki, qiang.zhang1211, bigeasy, chenzhongjin,
yangjihong1, rostedt, Justin Chen
Hi Frederic,
I'm sorry sending my concerns late, but I was on sick leave. Keep in
mind, it is definitely possible that my brain is not yet in the timer
migration hierarchy mood after the sick leave :) so please correct me
whenever I'm wrong.
Frederic Weisbecker <frederic@kernel.org> writes:
> On Thu, Mar 14, 2024 at 03:05:53PM -0700, Boqun Feng wrote:
>> I notice CPU3 didn't have its own non-deferrable timer queued (local or
>> global), so could the following happen?
>>
>> timer_base_try_to_set_idle():
>> __get_next_timer_interrupt():
>> fetch_next_timer_interrupt():
>> // nextevt_local == nextevt_global == basej + NEXT_TIMER_MAX_DELTA
>> // tevt->local == tevt->gloabl = KTIME_MAX
>> timer_use_tmigr():
>> tmigr_cpu_deactivate():
>> __tmigr_cpu_deactivate():
>> // tmc->cpuevt.ignore untouched still == true
>> walk_groups(&tmigr_inactive_up, ...):
>> tmigr_inactive_up():
>> data->remote = true;
>> tmigr_update_events():
>> if (child) { // child is NULL
>> ...
>> } else {
>> first_childevt = evt = data->evt;
>>
>> if (evt->ignore && !remote)
>> return true; // no remote tick is picked.
>> ...
>> }
>
> Nice catch! Florian can you try the following?
>
> From b0e335371ed758f68bf4f501246298c98a615b04 Mon Sep 17 00:00:00 2001
> From: Frederic Weisbecker <frederic@kernel.org>
> Date: Fri, 15 Mar 2024 00:21:01 +0100
> Subject: [PATCH] timer/migration: Remove buggy early return on deactivation
>
> When a CPU enters into idle and deactivates itself from the timer
> migration hierarchy without any global timer of its own to propagate,
> the group event of that CPU is set to "ignore" and tmigr_update_events()
> accordingly performs an early return without considering timers queued
> by other CPUs.
>
> If the hierarchy has a single level, and the CPU is the last one to
> enter idle, it will ignore others' global timers, as in the following
> layout:
>
> [GRP0:0]
> migrator = 0
> active = 0
> nextevt = T0i
> / \
> 0 1
> active (T0i) idle (T1)
>
> 0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
> upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
>
> [GRP0:0]
> migrator = NONE
> active = NONE
> nextevt = T0i
> / \
> 0 1
> idle (T0i) idle (T1)
>
> 1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
> pushed as its next expiry and its own event kept as "ignore". As a result
> tmigr_update_events() ignores T1 and CPU 0 goes to idle with T1
> unhandled.
This is broken - indeed.
> This isn't proper to single level hierarchy though. A similar issue,
> although slightly different, may arise on multi-level:
>
> [GRP1:0]
> migrator = GRP0:0
> active = GRP0:0
> nextevt = T0:0i, T0:1
> / \
> [GRP0:0] [GRP0:1]
> migrator = 0 migrator = NONE
> active = 0 active = NONE
> nextevt = T0i nextevt = T2
> / \ / \
> 0 (T0i) 1 (T1) 2 (T2) 3
> idle idle idle idle
>
> 0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
> upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
> CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2
>
> [GRP1:0]
> migrator = GRP0:0
> active = GRP0:0
> nextevt = T0:0i, T0:1
> / \
> [GRP0:0] [GRP0:1]
> migrator = NONE migrator = NONE
> active = NONE active = NONE
> nextevt = T0i nextevt = T2
> / \ / \
> 0 (T0i) 1 (T1) 2 (T2) 3
> idle idle idle idle
>
> 1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
> pushed as its next expiry and its own event kept as "ignore". As a result
> tmigr_update_events() ignores T1. The change only propagated up to 1st
> level so far.
Right. T0 doesn't has to be enqueued into the timer queue of GRP0:0 as
this timer could be ignored. So nothing changes directly in GRP0:0.
> [GRP1:0]
> migrator = NONE
> active = NONE
> nextevt = T0:1
> / \
> [GRP0:0] [GRP0:1]
> migrator = NONE migrator = NONE
> active = NONE active = NONE
> nextevt = T0i nextevt = T2
> / \ / \
> 0 (T0i) 1 (T1) 2 (T2) 3
> idle idle idle idle
>
> 2) The change now propagates up to the top. tmigr_update_events() finds
> that the child event is ignored and thus removes it. The top level next
> event is now T2 which is returned to CPU 0 as its next effective expiry
> to take account for as the global idle migrator. However T1 has been
> ignored along the way, leaving it unhandled.
Now propagation goes on as GRP0:0 is completely idle. When executing
tmigr_update_events() in the next step of walking the hierarchy via
tmigr_inactive_up(), the arguments for tmigr_update_events() are set in
the following way:
group = GRP1:0
child = GRP0:0
Then at the begin of tmigr_update_events() the group event of child is
updated - so all ignored events are removed (T0i), and the
child->groupevt and child->next_expiry is updated with T1. This
reevaluated child->groupevt is then queued/updated in the GRP1:0
timerqueue.
So T1 will be handled!
As there is no parent, the top level group event is updated (see goto
label "check_toplvl") and T1 will be still the first event.
> Fix those issues with removing the buggy related early return. Ignored
> child events must not prevent from evaluating the other events within
> the same group.
I would prefere to keep this early return but skip it, when there is
!group->parent (only a single level in hierarchy).
Then it would prevent taking the group lock and making some random
event updates which are done nevertheless on the next iteration of the
hierarchy walk.
Thanks,
Anna-Maria
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]]
2024-03-26 16:41 ` [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]] Anna-Maria Behnsen
@ 2024-03-26 17:18 ` Frederic Weisbecker
2024-04-04 16:50 ` [PATCH] timers/migration: Return early on deactivation Anna-Maria Behnsen
0 siblings, 1 reply; 47+ messages in thread
From: Frederic Weisbecker @ 2024-03-26 17:18 UTC (permalink / raw)
To: Anna-Maria Behnsen
Cc: Boqun Feng, Florian Fainelli, Thomas Gleixner,
Russell King (Oracle), Joel Fernandes, Linus Torvalds,
linux-kernel, kernel-team, paulmck, mingo, rcu, neeraj.upadhyay,
urezki, qiang.zhang1211, bigeasy, chenzhongjin, yangjihong1,
rostedt, Justin Chen
Le Tue, Mar 26, 2024 at 05:41:00PM +0100, Anna-Maria Behnsen a écrit :
> Frederic Weisbecker <frederic@kernel.org> writes:
> Now propagation goes on as GRP0:0 is completely idle. When executing
> tmigr_update_events() in the next step of walking the hierarchy via
> tmigr_inactive_up(), the arguments for tmigr_update_events() are set in
> the following way:
>
> group = GRP1:0
> child = GRP0:0
>
> Then at the begin of tmigr_update_events() the group event of child is
> updated - so all ignored events are removed (T0i), and the
> child->groupevt and child->next_expiry is updated with T1. This
> reevaluated child->groupevt is then queued/updated in the GRP1:0
> timerqueue.
>
> So T1 will be handled!
>
> As there is no parent, the top level group event is updated (see goto
> label "check_toplvl") and T1 will be still the first event.
Bah! Good point, I got confused there...
>
> > Fix those issues with removing the buggy related early return. Ignored
> > child events must not prevent from evaluating the other events within
> > the same group.
>
> I would prefere to keep this early return but skip it, when there is
> !group->parent (only a single level in hierarchy).
>
> Then it would prevent taking the group lock and making some random
> event updates which are done nevertheless on the next iteration of the
> hierarchy walk.
Ok sounds like a good plan!
Thanks.
>
> Thanks,
>
> Anna-Maria
>
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH] timers/migration: Return early on deactivation
2024-03-26 17:18 ` Frederic Weisbecker
@ 2024-04-04 16:50 ` Anna-Maria Behnsen
2024-04-04 22:19 ` Frederic Weisbecker
0 siblings, 1 reply; 47+ messages in thread
From: Anna-Maria Behnsen @ 2024-04-04 16:50 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Boqun Feng, Florian Fainelli, Thomas Gleixner,
Russell King (Oracle), Joel Fernandes, Linus Torvalds,
linux-kernel, kernel-team, paulmck, mingo, rcu, neeraj.upadhyay,
urezki, qiang.zhang1211, bigeasy, chenzhongjin, yangjihong1,
rostedt, Justin Chen
Commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on
deactivation") removed the logic to return early in tmigr_update_events()
on deactivation. With this the problem with a not properly updated first
global event in a hierarchy containing only a single group was fixed.
But when having a look at this code path with a hierarchy with more than a
single level, now unnecessary work is done (example is partially copied
from the message of the commit mentioned above):
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:0i, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = 0 migrator = NONE
active = 0 active = NONE
nextevt = T0i, T1 nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
active idle idle idle
0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:0i, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = NONE migrator = NONE
active = NONE active = NONE
nextevt = T1 nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
idle idle idle idle
1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". Without this
early return the following steps happen in tmigr_update_events() when
child = null and group = GRP0:0 :
lock(GRP0:0->lock);
timerqueue_del(GRP0:0, T0i);
unlock(GRP0:0->lock);
[GRP1:0]
migrator = NONE
active = NONE
nextevt = T0:0, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = NONE migrator = NONE
active = NONE active = NONE
nextevt = T1 nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
idle idle idle idle
2) The change now propagates up to the top. Then tmigr_update_events()
updates the group event of GRP0:0 and executes the following steps
(child = GRP0:0 and group = GRP0:0):
lock(GRP0:0->lock);
lock(GRP1:0->lock);
evt = tmigr_next_groupevt(GRP0:0); -> this removes the ignored events
in GRP0:0
... update GRP1:0 group event and timerqueue ...
unlock(GRP1:0->lock);
unlock(GRP0:0->lock);
So the dance in 1) with locking the GRP0:0->lock and removing the T0i from
the timerqueue is redundand as this is done nevertheless in 2) when
tmigr_next_groupevt(GRP0:0) is executed.
Revert commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on
deactivation") and add a condition into return path to skip the return
only, when hierarchy contains a single group.
Fixes: 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on deactivation")
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
---
kernel/time/timer_migration.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -751,6 +751,31 @@ bool tmigr_update_events(struct tmigr_gr
first_childevt = evt = data->evt;
+ /*
+ * Walking the hierarchy is required in any case when a
+ * remote expiry was done before. This ensures to not lose
+ * already queued events in non active groups (see section
+ * "Required event and timerqueue update after a remote
+ * expiry" in the documentation at the top).
+ *
+ * The two call sites which are executed without a remote expiry
+ * before, are not prevented from propagating changes through
+ * the hierarchy by the return:
+ * - When entering this path by tmigr_new_timer(), @evt->ignore
+ * is never set.
+ * - tmigr_inactive_up() takes care of the propagation by
+ * itself and ignores the return value. But an immediate
+ * return is required because nothing has to be done in this
+ * level as the event could be ignored.
+ *
+ * But, if the hierarchy has only a single level so @group is
+ * the top level group, make sure first event information of the
+ * group is updated properly and also handled properly, so skip
+ * this fast return path.
+ */
+ if (evt->ignore && !remote && group->parent)
+ return true;
+
raw_spin_lock(&group->lock);
childstate.state = 0;
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: [PATCH] timers/migration: Return early on deactivation
2024-04-04 16:50 ` [PATCH] timers/migration: Return early on deactivation Anna-Maria Behnsen
@ 2024-04-04 22:19 ` Frederic Weisbecker
2024-04-05 8:53 ` [PATCH v2] " Anna-Maria Behnsen
0 siblings, 1 reply; 47+ messages in thread
From: Frederic Weisbecker @ 2024-04-04 22:19 UTC (permalink / raw)
To: Anna-Maria Behnsen
Cc: Boqun Feng, Florian Fainelli, Thomas Gleixner,
Russell King (Oracle), Joel Fernandes, Linus Torvalds,
linux-kernel, kernel-team, paulmck, mingo, rcu, neeraj.upadhyay,
urezki, qiang.zhang1211, bigeasy, chenzhongjin, yangjihong1,
rostedt, Justin Chen
Le Thu, Apr 04, 2024 at 06:50:26PM +0200, Anna-Maria Behnsen a écrit :
> Commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on
> deactivation") removed the logic to return early in tmigr_update_events()
> on deactivation. With this the problem with a not properly updated first
> global event in a hierarchy containing only a single group was fixed.
>
> But when having a look at this code path with a hierarchy with more than a
> single level, now unnecessary work is done (example is partially copied
> from the message of the commit mentioned above):
>
> [GRP1:0]
> migrator = GRP0:0
> active = GRP0:0
> nextevt = T0:0i, T0:1
> / \
> [GRP0:0] [GRP0:1]
> migrator = 0 migrator = NONE
> active = 0 active = NONE
> nextevt = T0i, T1 nextevt = T2
> / \ / \
> 0 (T0i) 1 (T1) 2 (T2) 3
> active idle idle idle
>
> 0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
> upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
> CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2
>
> [GRP1:0]
> migrator = GRP0:0
> active = GRP0:0
> nextevt = T0:0i, T0:1
> / \
> [GRP0:0] [GRP0:1]
> migrator = NONE migrator = NONE
> active = NONE active = NONE
> nextevt = T1 nextevt = T2
> / \ / \
> 0 (T0i) 1 (T1) 2 (T2) 3
> idle idle idle idle
>
> 1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
> pushed as its next expiry and its own event kept as "ignore". Without this
> early return the following steps happen in tmigr_update_events() when
> child = null and group = GRP0:0 :
>
> lock(GRP0:0->lock);
> timerqueue_del(GRP0:0, T0i);
> unlock(GRP0:0->lock);
>
>
> [GRP1:0]
> migrator = NONE
> active = NONE
> nextevt = T0:0, T0:1
> / \
> [GRP0:0] [GRP0:1]
> migrator = NONE migrator = NONE
> active = NONE active = NONE
> nextevt = T1 nextevt = T2
> / \ / \
> 0 (T0i) 1 (T1) 2 (T2) 3
> idle idle idle idle
>
> 2) The change now propagates up to the top. Then tmigr_update_events()
> updates the group event of GRP0:0 and executes the following steps
> (child = GRP0:0 and group = GRP0:0):
>
> lock(GRP0:0->lock);
> lock(GRP1:0->lock);
> evt = tmigr_next_groupevt(GRP0:0); -> this removes the ignored events
> in GRP0:0
> ... update GRP1:0 group event and timerqueue ...
> unlock(GRP1:0->lock);
> unlock(GRP0:0->lock);
>
> So the dance in 1) with locking the GRP0:0->lock and removing the T0i from
> the timerqueue is redundand as this is done nevertheless in 2) when
> tmigr_next_groupevt(GRP0:0) is executed.
>
> Revert commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on
> deactivation") and add a condition into return path to skip the return
> only, when hierarchy contains a single group.
>
> Fixes: 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on deactivation")
> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Just some comment nits:
> ---
> kernel/time/timer_migration.c | 25 +++++++++++++++++++++++++
> 1 file changed, 25 insertions(+)
>
> --- a/kernel/time/timer_migration.c
> +++ b/kernel/time/timer_migration.c
> @@ -751,6 +751,31 @@ bool tmigr_update_events(struct tmigr_gr
>
> first_childevt = evt = data->evt;
>
> + /*
> + * Walking the hierarchy is required in any case when a
> + * remote expiry was done before. This ensures to not lose
> + * already queued events in non active groups (see section
> + * "Required event and timerqueue update after a remote
> + * expiry" in the documentation at the top).
> + *
> + * The two call sites which are executed without a remote expiry
> + * before, are not prevented from propagating changes through
> + * the hierarchy by the return:
> + * - When entering this path by tmigr_new_timer(), @evt->ignore
> + * is never set.
> + * - tmigr_inactive_up() takes care of the propagation by
> + * itself and ignores the return value. But an immediate
> + * return is required because nothing has to be done in this
> + * level as the event could be ignored.
It's not exactly required, it's an optimization. How about:
"""
But an immediate return is possible if there is a parent, sparing group
locking at this level, because the upper walking call to the parent will
take care about removing this event from within the group and update
next_expiry accordingly.
"""
> + *
> + * But, if the hierarchy has only a single level so @group is
> + * the top level group, make sure first event information of the
> + * group is updated properly and also handled properly, so skip
> + * this fast return path.
"""
However if there is no parent, ie: the hierarchy has only a single level so
@group is the top level group, make sure the first event information of the
group is updated properly and also handled properly, so skip
this fast return path.
"""
Thanks.
> + */
> + if (evt->ignore && !remote && group->parent)
> + return true;
> +
> raw_spin_lock(&group->lock);
>
> childstate.state = 0;
^ permalink raw reply [flat|nested] 47+ messages in thread* [PATCH v2] timers/migration: Return early on deactivation
2024-04-04 22:19 ` Frederic Weisbecker
@ 2024-04-05 8:53 ` Anna-Maria Behnsen
2024-04-05 9:11 ` [tip: timers/urgent] " tip-bot2 for Anna-Maria Behnsen
0 siblings, 1 reply; 47+ messages in thread
From: Anna-Maria Behnsen @ 2024-04-05 8:53 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Boqun Feng, Florian Fainelli, Thomas Gleixner,
Russell King (Oracle), Joel Fernandes, Linus Torvalds,
linux-kernel, kernel-team, paulmck, mingo, rcu, neeraj.upadhyay,
urezki, qiang.zhang1211, bigeasy, chenzhongjin, yangjihong1,
rostedt, Justin Chen
Commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on
deactivation") removed the logic to return early in tmigr_update_events()
on deactivation. With this the problem with a not properly updated first
global event in a hierarchy containing only a single group was fixed.
But when having a look at this code path with a hierarchy with more than a
single level, now unnecessary work is done (example is partially copied
from the message of the commit mentioned above):
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:0i, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = 0 migrator = NONE
active = 0 active = NONE
nextevt = T0i, T1 nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
active idle idle idle
0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:0i, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = NONE migrator = NONE
active = NONE active = NONE
nextevt = T1 nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
idle idle idle idle
1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". Without this
early return the following steps happen in tmigr_update_events() when
child = null and group = GRP0:0 :
lock(GRP0:0->lock);
timerqueue_del(GRP0:0, T0i);
unlock(GRP0:0->lock);
[GRP1:0]
migrator = NONE
active = NONE
nextevt = T0:0, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = NONE migrator = NONE
active = NONE active = NONE
nextevt = T1 nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
idle idle idle idle
2) The change now propagates up to the top. Then tmigr_update_events()
updates the group event of GRP0:0 and executes the following steps
(child = GRP0:0 and group = GRP0:0):
lock(GRP0:0->lock);
lock(GRP1:0->lock);
evt = tmigr_next_groupevt(GRP0:0); -> this removes the ignored events
in GRP0:0
... update GRP1:0 group event and timerqueue ...
unlock(GRP1:0->lock);
unlock(GRP0:0->lock);
So the dance in 1) with locking the GRP0:0->lock and removing the T0i from
the timerqueue is redundand as this is done nevertheless in 2) when
tmigr_next_groupevt(GRP0:0) is executed.
Revert commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on
deactivation") and add a condition into return path to skip the return
only, when hierarchy contains a single group. Adapt comments accordingly.
Fixes: 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on deactivation")
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/time/timer_migration.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -751,6 +751,33 @@ bool tmigr_update_events(struct tmigr_gr
first_childevt = evt = data->evt;
+ /*
+ * Walking the hierarchy is required in any case when a
+ * remote expiry was done before. This ensures to not lose
+ * already queued events in non active groups (see section
+ * "Required event and timerqueue update after a remote
+ * expiry" in the documentation at the top).
+ *
+ * The two call sites which are executed without a remote expiry
+ * before, are not prevented from propagating changes through
+ * the hierarchy by the return:
+ * - When entering this path by tmigr_new_timer(), @evt->ignore
+ * is never set.
+ * - tmigr_inactive_up() takes care of the propagation by
+ * itself and ignores the return value. But an immediate
+ * return is possible if there is a parent, sparing group
+ * locking at this level, because the upper walking call to
+ * the parent will take care about removing this event from
+ * within the group and update next_expiry accordingly.
+ *
+ * However if there is no parent, ie: the hierarchy has only a
+ * single level so @group is the top level group, make sure the
+ * first event information of the group is updated properly and
+ * also handled properly, so skip this fast return path.
+ */
+ if (evt->ignore && !remote && group->parent)
+ return true;
+
raw_spin_lock(&group->lock);
childstate.state = 0;
^ permalink raw reply [flat|nested] 47+ messages in thread* [tip: timers/urgent] timers/migration: Return early on deactivation
2024-04-05 8:53 ` [PATCH v2] " Anna-Maria Behnsen
@ 2024-04-05 9:11 ` tip-bot2 for Anna-Maria Behnsen
0 siblings, 0 replies; 47+ messages in thread
From: tip-bot2 for Anna-Maria Behnsen @ 2024-04-05 9:11 UTC (permalink / raw)
To: linux-tip-commits
Cc: Anna-Maria Behnsen, Thomas Gleixner, Frederic Weisbecker, x86,
linux-kernel
The following commit has been merged into the timers/urgent branch of tip:
Commit-ID: 7a96a84bfbee96871bb16c70ee3e93d564e190f4
Gitweb: https://git.kernel.org/tip/7a96a84bfbee96871bb16c70ee3e93d564e190f4
Author: Anna-Maria Behnsen <anna-maria@linutronix.de>
AuthorDate: Fri, 05 Apr 2024 10:53:21 +02:00
Committer: Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 05 Apr 2024 11:05:16 +02:00
timers/migration: Return early on deactivation
Commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on
deactivation") removed the logic to return early in tmigr_update_events()
on deactivation. With this the problem with a not properly updated first
global event in a hierarchy containing only a single group was fixed.
But when having a look at this code path with a hierarchy with more than a
single level, now unnecessary work is done (example is partially copied
from the message of the commit mentioned above):
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:0i, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = 0 migrator = NONE
active = 0 active = NONE
nextevt = T0i, T1 nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
active idle idle idle
0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:0i, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = NONE migrator = NONE
active = NONE active = NONE
nextevt = T1 nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
idle idle idle idle
1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". Without this
early return the following steps happen in tmigr_update_events() when
child = null and group = GRP0:0 :
lock(GRP0:0->lock);
timerqueue_del(GRP0:0, T0i);
unlock(GRP0:0->lock);
[GRP1:0]
migrator = NONE
active = NONE
nextevt = T0:0, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = NONE migrator = NONE
active = NONE active = NONE
nextevt = T1 nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
idle idle idle idle
2) The change now propagates up to the top. Then tmigr_update_events()
updates the group event of GRP0:0 and executes the following steps
(child = GRP0:0 and group = GRP0:0):
lock(GRP0:0->lock);
lock(GRP1:0->lock);
evt = tmigr_next_groupevt(GRP0:0); -> this removes the ignored events
in GRP0:0
... update GRP1:0 group event and timerqueue ...
unlock(GRP1:0->lock);
unlock(GRP0:0->lock);
So the dance in 1) with locking the GRP0:0->lock and removing the T0i from
the timerqueue is redundand as this is done nevertheless in 2) when
tmigr_next_groupevt(GRP0:0) is executed.
Revert commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on
deactivation") and add a condition into return path to skip the return
only, when hierarchy contains a single group. Adapt comments accordingly.
Fixes: 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on deactivation")
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/87cyr49on2.fsf@somnus
---
kernel/time/timer_migration.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index e3075e4..ccba875 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -751,6 +751,33 @@ bool tmigr_update_events(struct tmigr_group *group, struct tmigr_group *child,
first_childevt = evt = data->evt;
+ /*
+ * Walking the hierarchy is required in any case when a
+ * remote expiry was done before. This ensures to not lose
+ * already queued events in non active groups (see section
+ * "Required event and timerqueue update after a remote
+ * expiry" in the documentation at the top).
+ *
+ * The two call sites which are executed without a remote expiry
+ * before, are not prevented from propagating changes through
+ * the hierarchy by the return:
+ * - When entering this path by tmigr_new_timer(), @evt->ignore
+ * is never set.
+ * - tmigr_inactive_up() takes care of the propagation by
+ * itself and ignores the return value. But an immediate
+ * return is possible if there is a parent, sparing group
+ * locking at this level, because the upper walking call to
+ * the parent will take care about removing this event from
+ * within the group and update next_expiry accordingly.
+ *
+ * However if there is no parent, ie: the hierarchy has only a
+ * single level so @group is the top level group, make sure the
+ * first event information of the group is updated properly and
+ * also handled properly, so skip this fast return path.
+ */
+ if (evt->ignore && !remote && group->parent)
+ return true;
+
raw_spin_lock(&group->lock);
childstate.state = 0;
^ permalink raw reply related [flat|nested] 47+ messages in thread