All of lore.kernel.org
 help / color / mirror / Atom feed
From: Florian Fainelli <f.fainelli@gmail.com>
To: Boqun Feng <boqun.feng@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>,
	"Russell King (Oracle)" <linux@armlinux.org.uk>,
	Joel Fernandes <joel@joelfernandes.org>,
	Anna-Maria Behnsen <anna-maria@linutronix.de>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-kernel@vger.kernel.org, kernel-team@meta.com,
	paulmck@kernel.org, mingo@kernel.org, tglx@linutronix.de,
	rcu@vger.kernel.org, neeraj.upadhyay@amd.com, urezki@gmail.com,
	qiang.zhang1211@gmail.com, bigeasy@linutronix.de,
	chenzhongjin@huawei.com, yangjihong1@huawei.com,
	rostedt@goodmis.org, Justin Chen <justin.chen@broadcom.com>
Subject: Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]
Date: Thu, 14 Mar 2024 12:09:41 -0700	[thread overview]
Message-ID: <de038bee-cecd-4e76-b0f4-5822b68e439d@gmail.com> (raw)
In-Reply-To: <ZfNHNvzpqf8DOZd8@boqun-archlinux>

On 3/14/24 11:51, Boqun Feng wrote:
> On Thu, Mar 14, 2024 at 11:35:23AM -0700, Florian Fainelli wrote:
>> On 3/14/24 03:41, Frederic Weisbecker wrote:
>>> On Wed, Mar 13, 2024 at 08:44:07PM -0700, Florian Fainelli wrote:
>>>>
>>>>
>>>> On 3/13/2024 3:52 PM, Frederic Weisbecker wrote:
>>>>> On Wed, Mar 13, 2024 at 03:04:26PM -0700, Florian Fainelli wrote:
>>>>>> On 3/13/24 14:59, Russell King (Oracle) wrote:
>>>>>>> On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
>>>>>>>> I will try to provide multiple answers for the sake of everyone having the
>>>>>>>> same context. Responding to Linus' specifically and his suggestion to use
>>>>>>>> "initcall_debug", this is what it gave me:
>>>>>>>>
>>>>>>>> [    6.970669] ata1: SATA link down (SStatus 0 SControl 300)
>>>>>>>> [  166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
>>>>>>>> [  166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
>>>>>>>> [  166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
>>>>>>>> [  166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
>>>>>>>> [  166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
>>>>>>>>
>>>>>>>> Also got another occurrence happening resuming from suspend to DRAM with:
>>>>>>>>
>>>>>>>> [   22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
>>>>>>>> platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
>>>>>>>> [  181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
>>>>>>>> platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
>>>>>>>>
>>>>>>>> and also with the PCIe root complex driver:
>>>>>>>>
>>>>>>>> [   18.266279] brcm-pcie f0460000.pcie: PM: calling
>>>>>>>> brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
>>>>>>>> [  177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
>>>>>>>> [  177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
>>>>>>>> [  177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
>>>>>>>> returned 0 after 159190939 usecs
>>>>>>>> [  177.457257] pcieport 0000:00:00.0: PM: calling
>>>>>>>> pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
>>>>>>>>
>>>>>>>> Surprisingly those drivers are consistently reproducing the failures I am
>>>>>>>> seeing so at least this gave me a clue as to where the problem is.
>>>>>>>>
>>>>>>>> There were no changes to drivers/net/ethernet/broadcom/genet/, the two
>>>>>>>> changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
>>>>>>>> the read_poll_timeout() conversion is correct, we properly break out of the
>>>>>>>> loop. The initial delay looked like a good culprit for a little while, but
>>>>>>>> it is not used on the affected platforms because instead we provide a
>>>>>>>> callback and we have an interrupt to signal the completion of a MDIO
>>>>>>>> operation, therefore unimac_mdio_poll() is not used at all. Finally
>>>>>>>> drivers/memory/brcmstb_dpfe.c also received a single change which is not
>>>>>>>> functional here (.remove function change do return void).
>>>>>>>>
>>>>>>>> I went back to a manual bisection and this time I believe that I have a more
>>>>>>>> plausible candidate with:
>>>>>>>>
>>>>>>>> 7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
>>>>>>>> hierarchical pull model")
>>>>>>>
>>>>>>> I haven't understood the code there yet, and how it would interact with
>>>>>>> arch code, but one thing that immediately jumps out to me is this:
>>>>>>>
>>>>>>> "    As long as a CPU is busy it expires both local and global timers. When a
>>>>>>>         CPU goes idle it arms for the first expiring local timer."
>>>>>>>
>>>>>>> So are local timers "armed" when they are enqueued while the cpu is
>>>>>>> "busy" during initialisation, and will they expire, and will that
>>>>>>> expiry be delivered in a timely manner?
>>>>>>>
>>>>>>> If not, this commit is basically broken, and would be the cause of the
>>>>>>> issue you are seeing. For the mdio case, we're talking about 2ms
>>>>>>> polling. For the dpfe case, it looks like we're talking about 1ms
>>>>>>> sleeps. I'm guessing that these end up being local timers.
>>>>>>>
>>>>>>> Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
>>>>>>> for link up every 5ms - if the link was down and we msleep(5) I wonder
>>>>>>> if that's triggering the same issue.
>>>>>>>
>>>>>>> Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
>>>>>>> say. I would imagine that the same hardware timer driver is being used
>>>>>>> (may be worth checking DT.) The same should be true for the interrupt
>>>>>>> driver as well. There's been no changes in that code.
>>>>>>
>>>>>> I just had it happen with ARM64 I was plagued by:
>>>>>>
>>>>>> https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/
>>>>>>
>>>>>> and my earlier bisections somehow did not have ARM64 fail, so I thought it
>>>>>> was immune but it fails with about the same failure rate as ARM 32-bit.
>>>>>
>>>>> Can you please boot with:
>>>>>
>>>>>        trace_event=timer_migration,timer_start,timer_expire_entry,timer_cancel
>>>>>
>>>>> And add the following and give us the resulting output in dmesg?
>>>>
>>>> Here are two logs from two different systems that exposed the problem on
>>>> boot:
>>>>
>>>> https://gist.github.com/ffainelli/f0834c52ef6320c9216d879ca29a4b81
>>>> https://gist.github.com/ffainelli/dc838883edb925a77d8eb34c0fe95be0
>>>
>>> Thanks! Unfortunately like Thomas pointed out, I'm missing the timer_migration
>>> events. My fault, can you retry with this syntax?
>>>
>>> 	trace_event=timer_migration:*,timer_start,timer_expire_entry,timer_cancel
>>>
>>> Though it's fairly possible that timer migration is not enabled at this point
>>> as it's a late initcall. But we better not miss its traces otherwise.
>>
>> Here is another log with timer_migration:
>>
>> https://gist.github.com/ffainelli/237a5f9928850d6d8900d1f36da45aee
> 
> FWIW, the trace point is still not enabled:
> 
> [    0.000000] Failed to enable trace event: timer_migration:*
> 
> you need this commit in master:
> 
> 	36e40df35d2c "timer_migration: Add tracepoints"
> 
> , which is one commit later than 7ee988770326 AFAICT

Argh, thanks Boqun, here is a new capture:

https://gist.github.com/ffainelli/cb562c1a60ef8e0e69e7d42143c48e8f

this one is does include the tmigr events. Thanks!
-- 
Florian


  reply	other threads:[~2024-03-14 19:09 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-08 17:15 [GIT PULL] RCU changes for v6.9 Boqun Feng
2024-03-11 19:43 ` pr-tracker-bot
2024-03-12 20:32 ` Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9] Florian Fainelli
2024-03-12 21:01   ` Frederic Weisbecker
2024-03-12 21:15     ` Paul E. McKenney
2024-03-12 21:35       ` Florian Fainelli
2024-03-12 22:05     ` Florian Fainelli
2024-03-12 21:07   ` Boqun Feng
2024-03-12 21:34     ` Florian Fainelli
2024-03-12 21:44       ` Linus Torvalds
2024-03-12 23:48         ` Boqun Feng
2024-03-13 16:01           ` Joel Fernandes
2024-03-13 21:30             ` Florian Fainelli
2024-03-13 21:59               ` Russell King (Oracle)
2024-03-13 22:04                 ` Florian Fainelli
2024-03-13 22:49                   ` Russell King (Oracle)
2024-03-13 23:29                     ` Florian Fainelli
2024-03-14  1:15                       ` Linus Torvalds
2024-03-14  1:22                         ` Florian Fainelli
2024-03-13 22:52                   ` Frederic Weisbecker
2024-03-14  3:44                     ` Florian Fainelli
2024-03-14  5:12                       ` Boqun Feng
2024-03-14  6:33                         ` Boqun Feng
2024-03-14  9:32                           ` Thomas Gleixner
2024-03-14  9:11                       ` Thomas Gleixner
2024-03-14 10:41                       ` Frederic Weisbecker
2024-03-14 18:35                         ` Florian Fainelli
2024-03-14 18:51                           ` Boqun Feng
2024-03-14 19:09                             ` Florian Fainelli [this message]
2024-03-14 20:45                               ` Thomas Gleixner
2024-03-14 21:21                                 ` Thomas Gleixner
2024-03-14 21:53                                   ` Florian Fainelli
2024-03-14 22:51                                     ` Thomas Gleixner
2024-03-14 21:58                                   ` Thomas Gleixner
2024-03-14 22:05                                   ` Boqun Feng
2024-03-14 22:10                                     ` Boqun Feng
2024-03-15  1:14                                     ` [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]] Frederic Weisbecker
2024-03-15  1:20                                       ` Frederic Weisbecker
2024-03-15 13:44                                       ` Florian Fainelli
2024-03-16 19:06                                       ` [tip: timers/urgent] timer/migration: Remove buggy early return on deactivation tip-bot2 for Frederic Weisbecker
2024-03-26 16:41                                       ` [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]] Anna-Maria Behnsen
2024-03-26 17:18                                         ` Frederic Weisbecker
2024-04-04 16:50                                           ` [PATCH] timers/migration: Return early on deactivation Anna-Maria Behnsen
2024-04-04 22:19                                             ` Frederic Weisbecker
2024-04-05  8:53                                               ` [PATCH v2] " Anna-Maria Behnsen
2024-04-05  9:11                                                 ` [tip: timers/urgent] " tip-bot2 for Anna-Maria Behnsen
2024-03-14  9:03                 ` Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9] Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=de038bee-cecd-4e76-b0f4-5822b68e439d@gmail.com \
    --to=f.fainelli@gmail.com \
    --cc=anna-maria@linutronix.de \
    --cc=bigeasy@linutronix.de \
    --cc=boqun.feng@gmail.com \
    --cc=chenzhongjin@huawei.com \
    --cc=frederic@kernel.org \
    --cc=joel@joelfernandes.org \
    --cc=justin.chen@broadcom.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=mingo@kernel.org \
    --cc=neeraj.upadhyay@amd.com \
    --cc=paulmck@kernel.org \
    --cc=qiang.zhang1211@gmail.com \
    --cc=rcu@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=urezki@gmail.com \
    --cc=yangjihong1@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.