arm64: csdlock at early boot due to slow serial (?)

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* arm64: csdlock at early boot due to slow serial (?)
@ 2025-07-02 17:10 Breno Leitao
  2025-07-02 17:20 ` Leo Yan
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Breno Leitao @ 2025-07-02 17:10 UTC (permalink / raw)
  To: cov, rmk+kernel, mark.rutland, catalin.marinas, linux-serial
  Cc: rmikey, linux-arm-kernel, usamaarif642, leo.yan, linux-kernel,
	paulmck

Hello,

I'm observing two unusual behaviors during the boot process on my SBSA
ARM machine, with upstream kernel (6.16-rc4):

1) A 9-second pause during early boot:

	[    0.000000] ACPI: SPCR: console: pl011,mmio32,0xc280000,115200
	[    0.420120] Serial: AMBA PL011 UART driver
	[    0.875263] printk: console [ttyAMA0] enabled
	[    9.848263] ACPI: PCI Root Bridge [PCI2] (domain 0002 [bus 00-ff])

2) Occasional CSD lock during early boot:

Intermittently, I encounter a CSD lock. Diagnosing this was challenging, but
after enabling PSEUDO NMI, I was able to capture the stack trace:

	printk: console [ttyAMA0] enabled
	smp: csd: Detected non-responsive CSD lock (#1) on CPU#0, waiting 5001000000 ns for CPU#02 do_nothing (kernel/smp.c:1058)
	smp:     csd: CSD lock (#1) unresponsive.
	Sending NMI from CPU 0 to CPUs 2:
	....
	pl011_console_write_atomic (./arch/arm64/include/asm/vdso/processor.h:12 drivers/tty/serial/amba-pl011.c:2540) (P)
	nbcon_emit_next_record (kernel/printk/nbcon.c:1030)
	__nbcon_atomic_flush_pending_con (kernel/printk/nbcon.c:1498)
	__nbcon_atomic_flush_pending (kernel/printk/nbcon.c:1541 kernel/printk/nbcon.c:1593)
	nbcon_atomic_flush_pending (kernel/printk/nbcon.c:1610)
	vprintk_emit (kernel/printk/printk.c:2429)

On reviewing the amba-pl011.c code, I noticed that each message being flushed
causes the following loop to iterate roughly 20,000 times:

          while ((pl011_read(uap, REG_FR) ^ uap->vendor->inv_fr) & uap->vendor->fr_busy)
                  cpu_relax();

Tracing this, I found that flushing early boot messages is taking a significant
amount of time. For example, trace_printk() output from that function shows:

       swapper/0-1       [000] dN...     9.695941: pl011_console_write_atomic: "[    0.928995] printk: console [ttyAMA0] enabled"
											|
       											-> This is trace_printk of wctxt->outbuf

At timestamp 9.69 seconds, the serial console is still flushing messages from
0.92 seconds, indicating that the initial 9-second gap is spent looping in
cpu_relax()—about 20,000 times per message, which is clearly suboptimal.

Further debugging revealed the following sequence with the pl011 registers:

	1) uart_console_write()
	2) REG_FR has BUSY | RXFE | TXFF for a while (~1k cpu_relax())
	3) RXFE and TXFF are cleaned, and BUSY stay on for another 17k-19k cpu_relax()

Michael has reported a hardware issue where the BUSY bit could get
stuck (see commit d8a4995bcea1: "tty: pl011: Work around QDF2400 E44 stuck BUSY
bit"), which is very similar. TXFE goes down, but BUSY is(?) still stuck for long.

If I am having the same hardware issue, I suppose I need to change that logic
to exist the cpu_relax() loop by checking when Transmit FIFO Empty (TXFE) is 0
instead of BUSY.

Anyway, any one familar with this weird behaviour?

Thanks
--breno

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: arm64: csdlock at early boot due to slow serial (?)
  2025-07-02 17:10 arm64: csdlock at early boot due to slow serial (?) Breno Leitao
@ 2025-07-02 17:20 ` Leo Yan
  2025-07-03 10:02   ` Breno Leitao
  2025-07-03 10:28 ` Mark Rutland
  2025-08-06 11:44 ` Nirmoy Das
  2 siblings, 1 reply; 17+ messages in thread
From: Leo Yan @ 2025-07-02 17:20 UTC (permalink / raw)
  To: Breno Leitao
  Cc: cov, rmk+kernel, mark.rutland, catalin.marinas, linux-serial,
	rmikey, linux-arm-kernel, usamaarif642, linux-kernel, paulmck

Hi Breno,

On Wed, Jul 02, 2025 at 10:10:21AM -0700, Breno Leitao wrote:

[...]

> Further debugging revealed the following sequence with the pl011 registers:
> 
> 	1) uart_console_write()
> 	2) REG_FR has BUSY | RXFE | TXFF for a while (~1k cpu_relax())
> 	3) RXFE and TXFF are cleaned, and BUSY stay on for another 17k-19k cpu_relax()
> 
> Michael has reported a hardware issue where the BUSY bit could get
> stuck (see commit d8a4995bcea1: "tty: pl011: Work around QDF2400 E44 stuck BUSY
> bit"), which is very similar. TXFE goes down, but BUSY is(?) still stuck for long.
> 
> If I am having the same hardware issue, I suppose I need to change that logic
> to exist the cpu_relax() loop by checking when Transmit FIFO Empty (TXFE) is 0
> instead of BUSY.
> 
> Anyway, any one familar with this weird behaviour?

To be clear, I am not familiar with pl011 driver.

For the first step, could you confirm the UART port is only used by
Linux kernel?

In some cases, if normal world and secure world share the same UART
port, it can cause the UART state machine malfunction and long wait.

Thanks,
Leo


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: arm64: csdlock at early boot due to slow serial (?)
  2025-07-02 17:20 ` Leo Yan
@ 2025-07-03 10:02   ` Breno Leitao
  2025-07-03 11:45     ` Leo Yan
  0 siblings, 1 reply; 17+ messages in thread
From: Breno Leitao @ 2025-07-03 10:02 UTC (permalink / raw)
  To: Leo Yan
  Cc: cov, rmk+kernel, mark.rutland, catalin.marinas, linux-serial,
	rmikey, linux-arm-kernel, usamaarif642, linux-kernel, paulmck

On Wed, Jul 02, 2025 at 06:20:36PM +0100, Leo Yan wrote:
> Hi Breno,
> 
> On Wed, Jul 02, 2025 at 10:10:21AM -0700, Breno Leitao wrote:
> 
> [...]
> 
> > Further debugging revealed the following sequence with the pl011 registers:
> > 
> > 	1) uart_console_write()
> > 	2) REG_FR has BUSY | RXFE | TXFF for a while (~1k cpu_relax())
> > 	3) RXFE and TXFF are cleaned, and BUSY stay on for another 17k-19k cpu_relax()
> > 
> > Michael has reported a hardware issue where the BUSY bit could get
> > stuck (see commit d8a4995bcea1: "tty: pl011: Work around QDF2400 E44 stuck BUSY
> > bit"), which is very similar. TXFE goes down, but BUSY is(?) still stuck for long.
> > 
> > If I am having the same hardware issue, I suppose I need to change that logic
> > to exist the cpu_relax() loop by checking when Transmit FIFO Empty (TXFE) is 0
> > instead of BUSY.
> > 
> > Anyway, any one familar with this weird behaviour?
> 
> To be clear, I am not familiar with pl011 driver.
> 
> For the first step, could you confirm the UART port is only used by
> Linux kernel?
> 
> In some cases, if normal world and secure world share the same UART
> port, it can cause the UART state machine malfunction and long wait.

I don't know how to check it for sure, but, looking at the serial
console output, I don't see anything else using the UART. The only
output I see on the console at that time is coming from linux kernel.

Would you recommend any additional check?

Thanks for the reply,
--breno


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: arm64: csdlock at early boot due to slow serial (?)
  2025-07-02 17:10 arm64: csdlock at early boot due to slow serial (?) Breno Leitao
  2025-07-02 17:20 ` Leo Yan
@ 2025-07-03 10:28 ` Mark Rutland
  2025-07-03 14:13   ` Breno Leitao
  2025-08-06 11:44 ` Nirmoy Das
  2 siblings, 1 reply; 17+ messages in thread
From: Mark Rutland @ 2025-07-03 10:28 UTC (permalink / raw)
  To: Breno Leitao
  Cc: cov, rmk+kernel, catalin.marinas, linux-serial, rmikey,
	linux-arm-kernel, usamaarif642, leo.yan, linux-kernel, paulmck

Hi Breno,

On Wed, Jul 02, 2025 at 10:10:21AM -0700, Breno Leitao wrote:
> Hello,
> 
> I'm observing two unusual behaviors during the boot process on my SBSA
> ARM machine, with upstream kernel (6.16-rc4):

Can you say which SoC in particular that is? Knowing that would help to
identify whether there's some known erratum, clocking issue, etc.

Likewise that might imply more folk to add to Cc.

[...]

> At timestamp 9.69 seconds, the serial console is still flushing messages from
> 0.92 seconds, indicating that the initial 9-second gap is spent looping in
> cpu_relax()—about 20,000 times per message, which is clearly suboptimal.
> 
> Further debugging revealed the following sequence with the pl011 registers:
> 
> 	1) uart_console_write()
> 	2) REG_FR has BUSY | RXFE | TXFF for a while (~1k cpu_relax())
> 	3) RXFE and TXFF are cleaned, and BUSY stay on for another 17k-19k cpu_relax()
> 
> Michael has reported a hardware issue where the BUSY bit could get
> stuck (see commit d8a4995bcea1: "tty: pl011: Work around QDF2400 E44 stuck BUSY
> bit"), which is very similar. TXFE goes down, but BUSY is(?) still stuck for long.

Looking at the commit message, that was an issue with the a "custom
(non-PrimeCell) implementation of the SBSA UART" present on QDF400. I
assume that was soemthing that Qualcomm Datacenter Technologies designed
themselves.

It's possible that your SoC has a similar issue with whatever IP block
is being used as the UART, but the issue in that commit certainly
doesn't apply to most PL011 / SBSA-UART implementations.

> If I am having the same hardware issue, I suppose I need to change that logic
> to exist the cpu_relax() loop by checking when Transmit FIFO Empty (TXFE) is 0
> instead of BUSY.

If you have the same issue, then applying the same workaround makes
sense. As above, it's not clear that this is necessarily the same issue.
Any more detail that you can share regarding the SoC would be very
helpful.

Mark.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: arm64: csdlock at early boot due to slow serial (?)
  2025-07-03 10:02   ` Breno Leitao
@ 2025-07-03 11:45     ` Leo Yan
  2025-07-03 15:01       ` Breno Leitao
  0 siblings, 1 reply; 17+ messages in thread
From: Leo Yan @ 2025-07-03 11:45 UTC (permalink / raw)
  To: Breno Leitao
  Cc: cov, rmk+kernel, mark.rutland, catalin.marinas, linux-serial,
	rmikey, linux-arm-kernel, usamaarif642, linux-kernel, paulmck

On Thu, Jul 03, 2025 at 03:02:39AM -0700, Breno Leitao wrote:

[...]

> > In some cases, if normal world and secure world share the same UART
> > port, it can cause the UART state machine malfunction and long wait.
> 
> I don't know how to check it for sure, but, looking at the serial
> console output, I don't see anything else using the UART. The only
> output I see on the console at that time is coming from linux kernel.
> 
> Would you recommend any additional check?

I have no experience in the driver, I should avoid any noise. But two
things in my head for quick trying.

- First, you could try earlycon mode, e.g., add option in kernel
  command line:
  
    earlycon=pl011,mmio32,0xc280000
    earlycon=pl011,0xc280000

  This would be possible to give more printing logs. If earlycon works
  well, then the issue might be caused by later init code (clock, or
  UART driver itself).

- Try another UART port if this is possible.

Just curious, you mentioned you are working on the mainline kernel. If
you can confirm a workable kernel version, then this info can help to
rule out hardware issue and we can review the delta changes.

Thanks,
Leo


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: arm64: csdlock at early boot due to slow serial (?)
  2025-07-03 10:28 ` Mark Rutland
@ 2025-07-03 14:13   ` Breno Leitao
  2025-07-03 16:31     ` Mark Rutland
  0 siblings, 1 reply; 17+ messages in thread
From: Breno Leitao @ 2025-07-03 14:13 UTC (permalink / raw)
  To: Mark Rutland
  Cc: cov, rmk+kernel, catalin.marinas, linux-serial, rmikey,
	linux-arm-kernel, usamaarif642, leo.yan, linux-kernel, paulmck

On Thu, Jul 03, 2025 at 11:28:50AM +0100, Mark Rutland wrote:
> On Wed, Jul 02, 2025 at 10:10:21AM -0700, Breno Leitao wrote:
> > I'm observing two unusual behaviors during the boot process on my SBSA
> > ARM machine, with upstream kernel (6.16-rc4):
> 
> Can you say which SoC in particular that is? Knowing that would help to
> identify whether there's some known erratum, clocking issue, etc.

This is custom made rack mounted machine based on Grace CPU. Here are
some info about the hardware:

	# lscpu:
		Vendor ID:                   ARM
		  Model name:                Neoverse-V2
		    Model:                   0
		    Thread(s) per core:      1
		    Core(s) per socket:      72
		    Socket(s):               1
		    Stepping:                r0p0

	# /proc/cpuinfo
		processor	: 71
		BogoMIPS	: 2000.00
		Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti
		CPU implementer	: 0x41
		CPU architecture: 8
		CPU variant	: 0x0
		CPU part	: 0xd4f
		CPU revision	: 0

	# lshw
	    description: Rack Mount Chassis
	    product: <Internal name>
	    vendor: Quanta
	    version: <Internal name>
	    width: 64 bits
	    capabilities: smbios-3.6.0 dmi-3.6.0 smp sve_default_vector_length tagged_addr_disabled
	    configuration: boot=normal chassis=rackmount family=Default string sku=Default string uuid=...

How do I find the SoC exactly?

> Likewise that might imply more folk to add to Cc.
> 
> [...]
> 
> > At timestamp 9.69 seconds, the serial console is still flushing messages from
> > 0.92 seconds, indicating that the initial 9-second gap is spent looping in
> > cpu_relax()-about 20,000 times per message, which is clearly suboptimal.
> > 
> > Further debugging revealed the following sequence with the pl011 registers:
> > 
> > 	1) uart_console_write()
> > 	2) REG_FR has BUSY | RXFE | TXFF for a while (~1k cpu_relax())
> > 	3) RXFE and TXFF are cleaned, and BUSY stay on for another 17k-19k cpu_relax()
> > 
> > Michael has reported a hardware issue where the BUSY bit could get
> > stuck (see commit d8a4995bcea1: "tty: pl011: Work around QDF2400 E44 stuck BUSY
> > bit"), which is very similar. TXFE goes down, but BUSY is(?) still stuck for long.
> 
> Looking at the commit message, that was an issue with the a "custom
> (non-PrimeCell) implementation of the SBSA UART" present on QDF400. I
> assume that was soemthing that Qualcomm Datacenter Technologies designed
> themselves.
> 
> It's possible that your SoC has a similar issue with whatever IP block
> is being used as the UART, but the issue in that commit certainly
> doesn't apply to most PL011 / SBSA-UART implementations.

That makes total sense. Decoding SPCR I see the following:

	# iasl -d spcr.dat
	Intel ACPI Component Architecture
	ASL+ Optimizing Compiler/Disassembler version 20210604
	Copyright (c) 2000 - 2021 Intel Corporation

	File appears to be binary: found 56 non-ASCII characters, disassembling
	Binary file appears to be a valid ACPI table, disassembling
	Input file spcr.dat, Length 0x50 (80) bytes
	ACPI: SPCR 0x0000000000000000 000050 (v02 NVIDIA A M I    00000001 ARMH 00010000)
	Acpi Data Table [SPCR] decoded
	Formatted output:  spcr.dsl - 2624 bytes

Thanks,
--breno


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: arm64: csdlock at early boot due to slow serial (?)
  2025-07-03 11:45     ` Leo Yan
@ 2025-07-03 15:01       ` Breno Leitao
  2025-07-03 16:17         ` Leo Yan
  0 siblings, 1 reply; 17+ messages in thread
From: Breno Leitao @ 2025-07-03 15:01 UTC (permalink / raw)
  To: Leo Yan
  Cc: cov, rmk+kernel, mark.rutland, catalin.marinas, linux-serial,
	rmikey, linux-arm-kernel, usamaarif642, linux-kernel, paulmck

Hello Leo,

On Thu, Jul 03, 2025 at 12:45:57PM +0100, Leo Yan wrote:
> On Thu, Jul 03, 2025 at 03:02:39AM -0700, Breno Leitao wrote:
> > > In some cases, if normal world and secure world share the same UART
> > > port, it can cause the UART state machine malfunction and long wait.
> > 
> > I don't know how to check it for sure, but, looking at the serial
> > console output, I don't see anything else using the UART. The only
> > output I see on the console at that time is coming from linux kernel.
> > 
> > Would you recommend any additional check?
> 
> I have no experience in the driver, I should avoid any noise. But two
> things in my head for quick trying.
> 
> - First, you could try earlycon mode, e.g., add option in kernel
>   command line:
>   
>     earlycon=pl011,mmio32,0xc280000
>     earlycon=pl011,0xc280000
> 
>   This would be possible to give more printing logs. If earlycon works
>   well, then the issue might be caused by later init code (clock, or
>   UART driver itself).

That is an interesting experiment, thanks for the heads-up.

I got `earlycon=pl011,mmio32,0xc280000` working, the machine boots fine,
and I see different behaviours.

Booting up the system the following lines (as examples):

	[    0.000000] earlycon: pl11 at MMIO32 0x000000000c280000 (options '115200')
	[    0.000000] printk: legacy bootconsole [pl11] enabled
	[    0.493205] smp: Brought up 1 node, 72 CPUs
	[    4.232864] SMP: Total of 72 processors activated.
	[    4.242788] CPU: All CPU(s) started at EL2
	[    6.318137] ARMH0011:00: ttyAMA0 at MMIO 0xc280000 (irq = 66, base_baud = 0) is a SBSA
	[    6.978245] netpoll: netconsole: local port 6666
	[    7.234429] Run /init as init process

Also, I have a trace_printk() in pl011_console_write_atomic, and it only
start later in the code. More precisely at:

       swapper/0-1       [000] dN.1.     6.334581: pl011_console_write_atomic: Printing 49 [    6.334573] printk: console [ttyAMA0] enabled

I understand that the early boot messages got sent by some other
function? (Or maybe the tracing framework was not initialized?)

Anyway, pl011_console_write_atomic() continues to call ~20K times cpu_relax()
waiting for the BUSY bit to disappear, as before.

For comparison, these are the same line as above, when earlycon is not
set:

	[    0.154009] smp: Brought up 1 node, 72 CPUs
	[    0.154012] SMP: Total of 72 processors activated.
	[    0.154013] CPU: All CPU(s) started at EL2
	[    0.466949] ARMH0011:00: ttyAMA0 at MMIO 0xc280000 (irq = 66, base_baud = 0) is a SBSA
	[    0.466967] printk: console [ttyAMA0] enabled
	[    0.466949] ARMH0011:00: ttyAMA0 at MMIO 0xc280000 (irq = 66, base_baud = 0) is a SBSA
	[    5.513307] smp: csd: Detected non-responsive CSD lock (#1) on CPU#0, waiting 5001000000 ns for CPU#05 do_nothing+0x0/0x10(0x0).
	[   10.719305] netpoll: netconsole: local port 6666
	[   10.961798] Run /init as init process


From what I see, using earlycon boots the kernel faster. Here are some
ideas I got:

	1) Earlycon enables the message to be printed earlier, thus, it smooth
	out the serial console

	2) without earlycon, the messages are printed at once, with interrupt
	disabled
	 2.a) This causes CSD locks
	 2.b) CSD locks caused an additional 3 seconds delay

	3) pl011_console_write_atomic is slow on both cases

Thanks for the good suggestion,
--breno


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: arm64: csdlock at early boot due to slow serial (?)
  2025-07-03 15:01       ` Breno Leitao
@ 2025-07-03 16:17         ` Leo Yan
  0 siblings, 0 replies; 17+ messages in thread
From: Leo Yan @ 2025-07-03 16:17 UTC (permalink / raw)
  To: Breno Leitao
  Cc: cov, rmk+kernel, mark.rutland, catalin.marinas, linux-serial,
	rmikey, linux-arm-kernel, usamaarif642, linux-kernel, paulmck

Hi Breno,

On Thu, Jul 03, 2025 at 04:01:54PM +0100, Breno Leitao wrote:

[...]

> That is an interesting experiment, thanks for the heads-up.
> 
> I got `earlycon=pl011,mmio32,0xc280000` working, the machine boots fine,
> and I see different behaviours.
> 
> Booting up the system the following lines (as examples):
> 
> 	[    0.000000] earlycon: pl11 at MMIO32 0x000000000c280000 (options '115200')
> 	[    0.000000] printk: legacy bootconsole [pl11] enabled
> 	[    0.493205] smp: Brought up 1 node, 72 CPUs
> 	[    4.232864] SMP: Total of 72 processors activated.
> 	[    4.242788] CPU: All CPU(s) started at EL2
> 	[    6.318137] ARMH0011:00: ttyAMA0 at MMIO 0xc280000 (irq = 66, base_baud = 0) is a SBSA
> 	[    6.978245] netpoll: netconsole: local port 6666
> 	[    7.234429] Run /init as init process
> 
> Also, I have a trace_printk() in pl011_console_write_atomic, and it only
> start later in the code. More precisely at:
> 
>        swapper/0-1       [000] dN.1.     6.334581: pl011_console_write_atomic: Printing 49 [    6.334573] printk: console [ttyAMA0] enabled
> 
> I understand that the early boot messages got sent by some other
> function? (Or maybe the tracing framework was not initialized?)
> 
> Anyway, pl011_console_write_atomic() continues to call ~20K times cpu_relax()
> waiting for the BUSY bit to disappear, as before.

If earlycon works correctly, it means the UART clock has been prepared
before the kernel boots.

I noticed that pl011_console_write_atomic() enables and disables the
uap->clk clock dynamically. The clock is a SoC implementation. If the
UART clock is already properly enabled for earlycon, we can assume it
doesn't need to be managed again in the UART driver.

So, perhaps it's worth an experiment by commenting out the clock
operations in the UART driver?

Thanks,
Leo


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: arm64: csdlock at early boot due to slow serial (?)
  2025-07-03 14:13   ` Breno Leitao
@ 2025-07-03 16:31     ` Mark Rutland
  2025-07-08 14:00       ` Breno Leitao
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Rutland @ 2025-07-03 16:31 UTC (permalink / raw)
  To: Breno Leitao
  Cc: cov, rmk+kernel, catalin.marinas, linux-serial, rmikey,
	linux-arm-kernel, usamaarif642, leo.yan, linux-kernel, paulmck,
	Ankit Agrawal, Besar Wicaksono

On Thu, Jul 03, 2025 at 03:13:26PM +0100, Breno Leitao wrote:
> On Thu, Jul 03, 2025 at 11:28:50AM +0100, Mark Rutland wrote:
> > On Wed, Jul 02, 2025 at 10:10:21AM -0700, Breno Leitao wrote:
> > > I'm observing two unusual behaviors during the boot process on my SBSA
> > > ARM machine, with upstream kernel (6.16-rc4):
> > 
> > Can you say which SoC in particular that is? Knowing that would help to
> > identify whether there's some known erratum, clocking issue, etc.
> 
> This is custom made rack mounted machine based on Grace CPU. Here are
> some info about the hardware:
> 
> 	# lscpu:
> 		Vendor ID:                   ARM
> 		  Model name:                Neoverse-V2
> 		    Model:                   0
> 		    Thread(s) per core:      1
> 		    Core(s) per socket:      72
> 		    Socket(s):               1
> 		    Stepping:                r0p0
> 
> 	# /proc/cpuinfo
> 		processor	: 71
> 		BogoMIPS	: 2000.00
> 		Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti
> 		CPU implementer	: 0x41
> 		CPU architecture: 8
> 		CPU variant	: 0x0
> 		CPU part	: 0xd4f
> 		CPU revision	: 0
> 
> 	# lshw
> 	    description: Rack Mount Chassis
> 	    product: <Internal name>
> 	    vendor: Quanta
> 	    version: <Internal name>
> 	    width: 64 bits
> 	    capabilities: smbios-3.6.0 dmi-3.6.0 smp sve_default_vector_length tagged_addr_disabled
> 	    configuration: boot=normal chassis=rackmount family=Default string sku=Default string uuid=...
> 
> How do I find the SoC exactly?

From what you've told me above, the SoC is Nvidia Grace; what they call
the CPU is the whole SoC.

> > Likewise that might imply more folk to add to Cc.

I've added Ankit and Besar, since they've both worked on some system
level bits on Grace, and might have an idea.

Ankit, Besar, are you aware of any UART issues on Grace (as described in
Breno's messages below), or do you know of anyone who might have an
idea?

Thanks,
Mark.

> > [...]
> > 
> > > At timestamp 9.69 seconds, the serial console is still flushing messages from
> > > 0.92 seconds, indicating that the initial 9-second gap is spent looping in
> > > cpu_relax()-about 20,000 times per message, which is clearly suboptimal.
> > > 
> > > Further debugging revealed the following sequence with the pl011 registers:
> > > 
> > > 	1) uart_console_write()
> > > 	2) REG_FR has BUSY | RXFE | TXFF for a while (~1k cpu_relax())
> > > 	3) RXFE and TXFF are cleaned, and BUSY stay on for another 17k-19k cpu_relax()
> > > 
> > > Michael has reported a hardware issue where the BUSY bit could get
> > > stuck (see commit d8a4995bcea1: "tty: pl011: Work around QDF2400 E44 stuck BUSY
> > > bit"), which is very similar. TXFE goes down, but BUSY is(?) still stuck for long.
> > 
> > Looking at the commit message, that was an issue with the a "custom
> > (non-PrimeCell) implementation of the SBSA UART" present on QDF400. I
> > assume that was soemthing that Qualcomm Datacenter Technologies designed
> > themselves.
> > 
> > It's possible that your SoC has a similar issue with whatever IP block
> > is being used as the UART, but the issue in that commit certainly
> > doesn't apply to most PL011 / SBSA-UART implementations.
> 
> That makes total sense. Decoding SPCR I see the following:
> 
> 	# iasl -d spcr.dat
> 	Intel ACPI Component Architecture
> 	ASL+ Optimizing Compiler/Disassembler version 20210604
> 	Copyright (c) 2000 - 2021 Intel Corporation
> 
> 	File appears to be binary: found 56 non-ASCII characters, disassembling
> 	Binary file appears to be a valid ACPI table, disassembling
> 	Input file spcr.dat, Length 0x50 (80) bytes
> 	ACPI: SPCR 0x0000000000000000 000050 (v02 NVIDIA A M I    00000001 ARMH 00010000)
> 	Acpi Data Table [SPCR] decoded
> 	Formatted output:  spcr.dsl - 2624 bytes
> 
> Thanks,
> --breno


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: arm64: csdlock at early boot due to slow serial (?)
  2025-07-03 16:31     ` Mark Rutland
@ 2025-07-08 14:00       ` Breno Leitao
  2025-07-09 14:23         ` Breno Leitao
  0 siblings, 1 reply; 17+ messages in thread
From: Breno Leitao @ 2025-07-08 14:00 UTC (permalink / raw)
  To: Mark Rutland, ankita, bwicaksono
  Cc: rmk+kernel, catalin.marinas, linux-serial, rmikey,
	linux-arm-kernel, usamaarif642, leo.yan, linux-kernel, paulmck,
	Ankit Agrawal, Besar Wicaksono

On Thu, Jul 03, 2025 at 05:31:09PM +0100, Mark Rutland wrote:

> > How do I find the SoC exactly?
> 
> >From what you've told me above, the SoC is Nvidia Grace; what they call
> the CPU is the whole SoC.
> 
> > > Likewise that might imply more folk to add to Cc.
> 
> I've added Ankit and Besar, since they've both worked on some system
> level bits on Grace, and might have an idea.
> 
> Ankit, Besar, are you aware of any UART issues on Grace (as described in
> Breno's messages below), or do you know of anyone who might have an
> idea?

Here is more information I got about this problem. TL;DR: While the
machine is booting, it is throttled by the UART speed, while having IRQ
disabled.

Here is my line of investigation:

1) when the kernel start to flush the messages, there are ~700 messages
   in the buffer that will be flushed using
   pl011_console_write_atomic()

	[    0.653673] printk: console [ttyAMA0] enabled
	[    9.239225] ACPI: PCI Root Bridge [PCI2] (domain 0002 [bus 00-ff])

 a) pl011_console_write_atomic() is called  ~700 times. Each message
 have around 50 chars.

 b) For each char, it needs to wait for it to be sent before proceeding.

3) pl011_console_write_atomic() have two big while()/cpu_relax() blocks.
   One per character and one per message.

    a) the first one is in pl011_console_putchar(), where it checks if the
    FIFO is full before writing a single character. This is done one per
    character.

          while (pl011_read(uap, REG_FR) & UART01x_FR_TXFF)
                  cpu_relax();
	 
    b) After the line is written, then it hits the second one, in
    pl011_console_write_atomic(), where it waits until the controller is
    not busy anymore:

              while ((pl011_read(uap, REG_FR) ^ uap->vendor->inv_fr) & uap->vendor->fr_busy)
                  cpu_relax();
	

4) The controller has FIFO enabled, but, it doesn't help much, because
   the CPU is waiting for the 700 * 50 chars to be transmitted.

5) At this time we just have one single CPU online (?)

6) This is done with IRQ disabled, which causes a CSD lock to trigger,
   given that the CPU has IRQ disabled


   a) Following Paul's suggestion, I added
   a `lockdep_assert_irqs_enabled()` at the entrance of
   pl011_console_write_atomic(), and that is what I see.

	hardirqs last  enabled at (256267): [<ffff800080364980>] vprintk_store+0x6f8/0x820
	hardirqs last disabled at (256268): [<ffff80008036e1c4>] __nbcon_atomic_flush_pending+0x2c4/0x3e8
	softirqs last  enabled at (255640): [<ffff8000801bb020>] handle_softirqs+0x910/0xe78
	softirqs last disabled at (255635): [<ffff800080010938>] __do_softirq+0x18/0x20

  b) This is the full stack in how we are calling cpu_relax() with irq
  disabled million times:

	WARNING: CPU: 0 PID: 1 at drivers/tty/serial/amba-pl011.c:2545 pl011_console_write_atomic (drivers/tty/serial/amba-pl011.c:2545)
	Modules linked in:
	pstate: 234003c9 (nzCv DAIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
	pc : pl011_console_write_atomic (drivers/tty/serial/amba-pl011.c:2545)
	lr : pl011_console_write_atomic (drivers/tty/serial/amba-pl011.c:2545)
	sp : ffff80008980dca0
	pmr: 000000f0
	x29: ffff80008980dca0 x28: ffff0000aebcc080 x27: dfff800000000000
	x26: 1fffe00015d7985a x25: 0000000000000000 x24: 0000000000000000
	x23: ffff0000a1800000 x22: ffff800087e6d7f8 x21: 1fffe00015d7985b
	x20: ffff80008980dee0 x19: 0000000000000003 x18: 00000000ffffffff
	x17: 30313478305b2030 x16: 3030303230303030 x15: 3078302055504320
	x14: 6c61636973796870 x13: 0a2a2a2065676173 x12: ffff700011301b8b
	x11: 1ffff00011301b8a x10: ffff700011301b8a x9 : ffff800081eafdb4
	x8 : 0000000000000004 x7 : 0000000000000003 x6 : 0000000000000000
	x5 : 1ffff00011301be2 x4 : 0000000000000018 x3 : 0000000000000000
	x2 : 1ffff00010fcdb57 x1 : 0000000000000000 x0 : 0000000000000001
	Call trace:
	pl011_console_write_atomic (drivers/tty/serial/amba-pl011.c:2545) (P)
	nbcon_emit_next_record (kernel/printk/nbcon.c:1026)
	__nbcon_atomic_flush_pending_con (kernel/printk/nbcon.c:1498)
	__nbcon_atomic_flush_pending (kernel/printk/nbcon.c:1541 kernel/printk/nbcon.c:1593)
	nbcon_atomic_flush_pending (kernel/printk/nbcon.c:1610)
	vprintk_emit (kernel/printk/printk.c:2429)
	vprintk_default (kernel/printk/printk.c:2466)
	vprintk (kernel/printk/printk_safe.c:83)
	_printk (kernel/printk/printk.c:2470)
	__warn (kernel/panic.c:768)
	report_bug (lib/bug.c:197 lib/bug.c:215)
	bug_handler (arch/arm64/kernel/traps.c:1002)
	call_break_hook (arch/arm64/kernel/debug-monitors.c:319)
	brk_handler (arch/arm64/kernel/debug-monitors.c:325)
	do_debug_exception (arch/arm64/mm/fault.c:1002)
	el1_dbg (arch/arm64/kernel/entry-common.c:514)
	el1h_64_sync_handler (arch/arm64/kernel/entry-common.c:567)
	el1h_64_sync (arch/arm64/kernel/entry.S:595)
	pl011_console_write_atomic (drivers/tty/serial/amba-pl011.c:2524) (P)
	nbcon_emit_next_record (kernel/printk/nbcon.c:1026)
	__nbcon_atomic_flush_pending_con (kernel/printk/nbcon.c:1498)
	__nbcon_atomic_flush_pending (kernel/printk/nbcon.c:1541 kernel/printk/nbcon.c:1593)
	nbcon_atomic_flush_pending (kernel/printk/nbcon.c:1610)
	vprintk_emit (kernel/printk/printk.c:2429)
	vprintk_default (kernel/printk/printk.c:2466)
	vprintk (kernel/printk/printk_safe.c:83)
	_printk (kernel/printk/printk.c:2470)
	register_console (kernel/printk/printk.c:4126 (discriminator 9))
	serial_core_register_port (drivers/tty/serial/serial_core.c:2637 drivers/tty/serial/serial_core.c:3157 drivers/tty/serial/serial_core.c:3388)
	serial_ctrl_register_port (drivers/tty/serial/serial_ctrl.c:42)
	uart_add_one_port (drivers/tty/serial/serial_port.c:144)
	pl011_register_port (drivers/tty/serial/amba-pl011.c:2867)
	sbsa_uart_probe (drivers/tty/serial/amba-pl011.c:3041)
	platform_probe (drivers/base/platform.c:1405)
	really_probe (drivers/base/dd.c:579 drivers/base/dd.c:657)
	__driver_probe_device (drivers/base/dd.c:799)
	driver_probe_device (drivers/base/dd.c:829)
	__device_attach_driver (drivers/base/dd.c:958)
	bus_for_each_drv (drivers/base/bus.c:462)
	__device_attach (drivers/base/dd.c:1031)
	device_initial_probe (drivers/base/dd.c:1079)
	bus_probe_device (drivers/base/bus.c:537)
	device_add (drivers/base/core.c:3699)
	platform_device_add (drivers/base/platform.c:716)
	platform_device_register_full (drivers/base/platform.c:845)
	acpi_create_platform_device (./include/linux/err.h:70 drivers/acpi/acpi_platform.c:178)
	acpi_bus_attach (./include/linux/acpi.h:738 drivers/acpi/scan.c:2206 drivers/acpi/scan.c:2198 drivers/acpi/scan.c:2316)
	acpi_dev_for_one_check (drivers/acpi/bus.c:1146)
	device_for_each_child (drivers/base/core.c:4021)
	acpi_dev_for_each_child (drivers/acpi/bus.c:1151)
	acpi_bus_attach (drivers/acpi/scan.c:2323)
	acpi_dev_for_one_check (drivers/acpi/bus.c:1146)
	device_for_each_child (drivers/base/core.c:4021)
	acpi_dev_for_each_child (drivers/acpi/bus.c:1151)
	acpi_bus_attach (drivers/acpi/scan.c:2323)
	acpi_bus_scan (drivers/acpi/scan.c:2533 drivers/acpi/scan.c:2606)
	acpi_scan_init (drivers/acpi/scan.c:2739)
	acpi_init (drivers/acpi/bus.c:1470)
	do_one_initcall (init/main.c:1274)
	kernel_init_freeable (init/main.c:1335 init/main.c:1352 init/main.c:1371 init/main.c:1584)
	kernel_init (init/main.c:1476)
	ret_from_fork (arch/arm64/kernel/entry.S:863)
	irq event stamp: 256268
	hardirqs last enabled at (256267): vprintk_store (kernel/printk/printk.c:2356 (discriminator 1))
	hardirqs last disabled at (256268): __nbcon_atomic_flush_pending (kernel/printk/nbcon.c:1539 kernel/printk/nbcon.c:1593)
	softirqs last enabled at (255640): handle_softirqs (./arch/arm64/include/asm/current.h:19 ./arch/arm64/include/asm/preempt.h:13 kernel/softirq.c:426 kernel/softirq.c:607)
	softirqs last disabled at (255635): __do_softirq (kernel/softirq.c:614)
	---[ end trace 0000000000000000 ]---



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: arm64: csdlock at early boot due to slow serial (?)
  2025-07-08 14:00       ` Breno Leitao
@ 2025-07-09 14:23         ` Breno Leitao
  2025-07-10 13:35           ` Leo Yan
  0 siblings, 1 reply; 17+ messages in thread
From: Breno Leitao @ 2025-07-09 14:23 UTC (permalink / raw)
  To: Mark Rutland, ankita, bwicaksono
  Cc: rmk+kernel, catalin.marinas, linux-serial, rmikey,
	linux-arm-kernel, usamaarif642, leo.yan, linux-kernel, paulmck

On Tue, Jul 08, 2025 at 07:00:45AM -0700, Breno Leitao wrote:
> On Thu, Jul 03, 2025 at 05:31:09PM +0100, Mark Rutland wrote:
> 
> Here is more information I got about this problem. TL;DR: While the
> machine is booting, it is throttled by the UART speed, while having IRQ
> disabled.

quick update: I've identified a solution that significantly improves the
situation. I've found that the serial issue was heavily affecting boot
time, which is unleashed now.

After applying the following fix, the boot speed has improved
dramatically. It's the fastest I've seen, and the CSD lockups are gone.

If no concerns raise in the next days, I will send it officially to the
serial maintainers.

Author: Breno Leitao <leitao@debian.org>
Date:   Wed Jul 9 05:57:06 2025 -0700

    serial: amba-pl011: Fix boot performance by switching to console_initcall()

    Replace arch_initcall() with console_initcall() for PL011 driver initialization
    to resolve severe boot performance issues.

    The current arch_initcall() registration causes the console to initialize
    before the printk subsystem is ready, forcing the driver into atomic mode
    during early boot. This results in:

    - 5-8 second boot delay while ~700 boot messages are processed
    - System freeze with IRQs disabled during message output
    - Each character transmitted synchronously with cpu_relax() polling

    This is what is driving the driver to atomic mode in the early boot:

      static inline void printk_get_console_flush_type(struct console_flush_type *ft)
      {
            ....
            if (printk_kthreads_running)
                    ft->nbcon_offload = true;

    The atomic path processes each character individually through
    pl011_console_putchar(), waiting for UART transmission completion
    before proceeding. With only one CPU online during early boot,
    this creates a bottleneck where the system spends excessive time
    in interrupt-disabled state.

    Here is how the code looks like:

      1) disable interrupt
      2) for each of these 700 messages, call pl011_console_write_atomic()
      3) for each character in the message, calls pl011_console_putchar(),
         which waits for the character to be transmitted
      4) once all the line is transmitted, wait for the UART to be idle
      5) re-enable interrupt

    Here is the code representation of the above:

            pl011_console_write_atomic() {
                    ...
                    // For each char in the message
                    pl011_console_putchar() {
                            while (pl011_read(uap, REG_FR) & UART01x_FR_TXFF)
                                    cpu_relax();
                    }
                    while ((pl011_read(uap, REG_FR) ^ uap->vendor->inv_fr) & uap->vendor->fr_busy)
                            cpu_relax();

    Using console_initcall() ensures proper initialization order,
    allowing the printk subsystem to use threaded output instead
    of atomic mode, eliminating the performance bottleneck.

    Performance improvement: 16x faster kernel boot time at my GRACE SoC
    machine.

      - Before: 10.08s to reach init process
      - After: 0.62s to reach init process

    Here are more timing details, collected from Linus' upstream, where the
    only different is this patch:

    Linus upstream:
      [    0.616203] printk: legacy console [netcon_ext0] enabled
      [    0.627469] Run /init as init process
      [    0.837477] loop: module loaded
      [    8.354803] Adding 134199360k swap on /swapvol/swapfile.

    With this patch:
      [    0.305109] ARMH0011:00: ttyAMA0 at MMIO 0xc280000 (irq = 66, base_baud = 0) is a SBSA
      [   10.081742] Run /init as init process
      [   13.288717] loop: module loaded
      [   22.919934] Adding 134199168k swap on /swapvol/swapfile.

    Link: https://lore.kernel.org/all/aGVn%2FSnOvwWewkOW@gmail.com/ [1]

    Signed-off-by: Breno Leitao <leitao@debian.org>

diff --git a/drivers/tty/serial/amba-pl011.c b/drivers/tty/serial/amba-pl011.c
index 22939841b1de..0cf251365825 100644
--- a/drivers/tty/serial/amba-pl011.c
+++ b/drivers/tty/serial/amba-pl011.c
@@ -3116,7 +3116,7 @@ static void __exit pl011_exit(void)
  * While this can be a module, if builtin it's most likely the console
  * So let's leave module_exit but move module_init to an earlier place
  */
-arch_initcall(pl011_init);
+console_initcall(pl011_init);
 module_exit(pl011_exit);

 MODULE_AUTHOR("ARM Ltd/Deep Blue Solutions Ltd");







^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: arm64: csdlock at early boot due to slow serial (?)
  2025-07-09 14:23         ` Breno Leitao
@ 2025-07-10 13:35           ` Leo Yan
  2025-07-10 17:30             ` Breno Leitao
  0 siblings, 1 reply; 17+ messages in thread
From: Leo Yan @ 2025-07-10 13:35 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Mark Rutland, ankita, bwicaksono, rmk+kernel, catalin.marinas,
	linux-serial, rmikey, linux-arm-kernel, usamaarif642,
	linux-kernel, paulmck

Hi Breno,

On Wed, Jul 09, 2025 at 07:23:44AM -0700, Breno Leitao wrote:
> On Tue, Jul 08, 2025 at 07:00:45AM -0700, Breno Leitao wrote:
> > On Thu, Jul 03, 2025 at 05:31:09PM +0100, Mark Rutland wrote:
> > 
> > Here is more information I got about this problem. TL;DR: While the
> > machine is booting, it is throttled by the UART speed, while having IRQ
> > disabled.
> 
> quick update: I've identified a solution that significantly improves the
> situation. I've found that the serial issue was heavily affecting boot
> time, which is unleashed now.
> 
> After applying the following fix, the boot speed has improved
> dramatically. It's the fastest I've seen, and the CSD lockups are gone.

Thanks for trying to fix the issue.

> If no concerns raise in the next days, I will send it officially to the
> serial maintainers.

I am not an expert on the PL011 driver, however, I do have concerns
after review the change. Please see my comments below.

> Author: Breno Leitao <leitao@debian.org>
> Date:   Wed Jul 9 05:57:06 2025 -0700
> 
>     serial: amba-pl011: Fix boot performance by switching to console_initcall()
> 
>     Replace arch_initcall() with console_initcall() for PL011 driver initialization
>     to resolve severe boot performance issues.

pl011_init() registers as an AMBA device, so the PL011 driver depends
on the AMBA bus initialization. The AMBA bus is initialized with:

    postcore_initcall(amba_init);

Therefore, the PL011 driver is initialized with arch_initcall(), which
occurs later than the postcore init.

My understanding is that console_initcall() is invoked much earlier
than other initcalls triggered by do_initcalls(). With your change, I
saw the PL011 driver fails to register on Juno-r2 board, due to AMBA bus
driver is not ready for a console init.

  Driver 'uart-pl011' was unable to register with bus_type 'amba'
  because the bus was not initialized.

>     The current arch_initcall() registration causes the console to initialize
>     before the printk subsystem is ready, forcing the driver into atomic mode
>     during early boot. This results in:
> 
>     - 5-8 second boot delay while ~700 boot messages are processed
>     - System freeze with IRQs disabled during message output
>     - Each character transmitted synchronously with cpu_relax() polling
> 
>     This is what is driving the driver to atomic mode in the early boot:
> 
>       static inline void printk_get_console_flush_type(struct console_flush_type *ft)
>       {
>             ....
>             if (printk_kthreads_running)
>                     ft->nbcon_offload = true;
> 
>     The atomic path processes each character individually through
>     pl011_console_putchar(), waiting for UART transmission completion
>     before proceeding. With only one CPU online during early boot,
>     this creates a bottleneck where the system spends excessive time
>     in interrupt-disabled state.

The atomic path is introduced recently by the commit:

  2eb2608618ce ("serial: amba-pl011: Implement nbcon console")

My conclusion is that changing the initcall will not disable the atomic
path, changing to console_initcall() will cause AMBA device init
failure, and as a result, the clock operations will not be invoked.
Thus, I am curious if you have ruled out the issue is caused by the UART
clock (as I mentioned in another reply).

BTW, since the atomic path is enabled in the commit 2eb2608618ce, what
is the result after reverting the commit?

Thanks,
Leo

>     Here is how the code looks like:
> 
>       1) disable interrupt
>       2) for each of these 700 messages, call pl011_console_write_atomic()
>       3) for each character in the message, calls pl011_console_putchar(),
>          which waits for the character to be transmitted
>       4) once all the line is transmitted, wait for the UART to be idle
>       5) re-enable interrupt
> 
>     Here is the code representation of the above:
> 
>             pl011_console_write_atomic() {
>                     ...
>                     // For each char in the message
>                     pl011_console_putchar() {
>                             while (pl011_read(uap, REG_FR) & UART01x_FR_TXFF)
>                                     cpu_relax();
>                     }
>                     while ((pl011_read(uap, REG_FR) ^ uap->vendor->inv_fr) & uap->vendor->fr_busy)
>                             cpu_relax();
> 
>     Using console_initcall() ensures proper initialization order,
>     allowing the printk subsystem to use threaded output instead
>     of atomic mode, eliminating the performance bottleneck.
> 
>     Performance improvement: 16x faster kernel boot time at my GRACE SoC
>     machine.
> 
>       - Before: 10.08s to reach init process
>       - After: 0.62s to reach init process
> 
>     Here are more timing details, collected from Linus' upstream, where the
>     only different is this patch:
> 
>     Linus upstream:
>       [    0.616203] printk: legacy console [netcon_ext0] enabled
>       [    0.627469] Run /init as init process
>       [    0.837477] loop: module loaded
>       [    8.354803] Adding 134199360k swap on /swapvol/swapfile.
> 
>     With this patch:
>       [    0.305109] ARMH0011:00: ttyAMA0 at MMIO 0xc280000 (irq = 66, base_baud = 0) is a SBSA
>       [   10.081742] Run /init as init process
>       [   13.288717] loop: module loaded
>       [   22.919934] Adding 134199168k swap on /swapvol/swapfile.
> 
>     Link: https://lore.kernel.org/all/aGVn%2FSnOvwWewkOW@gmail.com/ [1]
> 
>     Signed-off-by: Breno Leitao <leitao@debian.org>
> 
> diff --git a/drivers/tty/serial/amba-pl011.c b/drivers/tty/serial/amba-pl011.c
> index 22939841b1de..0cf251365825 100644
> --- a/drivers/tty/serial/amba-pl011.c
> +++ b/drivers/tty/serial/amba-pl011.c
> @@ -3116,7 +3116,7 @@ static void __exit pl011_exit(void)
>   * While this can be a module, if builtin it's most likely the console
>   * So let's leave module_exit but move module_init to an earlier place
>   */
> -arch_initcall(pl011_init);
> +console_initcall(pl011_init);
>  module_exit(pl011_exit);
> 
>  MODULE_AUTHOR("ARM Ltd/Deep Blue Solutions Ltd");
> 
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: arm64: csdlock at early boot due to slow serial (?)
  2025-07-10 13:35           ` Leo Yan
@ 2025-07-10 17:30             ` Breno Leitao
  2025-07-11  9:50               ` Leo Yan
  0 siblings, 1 reply; 17+ messages in thread
From: Breno Leitao @ 2025-07-10 17:30 UTC (permalink / raw)
  To: Leo Yan
  Cc: Mark Rutland, ankita, bwicaksono, rmk+kernel, catalin.marinas,
	linux-serial, rmikey, linux-arm-kernel, usamaarif642,
	linux-kernel, paulmck

Hello Leo,

On Thu, Jul 10, 2025 at 02:35:57PM +0100, Leo Yan wrote:
> >     serial: amba-pl011: Fix boot performance by switching to console_initcall()
> > 
> >     Replace arch_initcall() with console_initcall() for PL011 driver initialization
> >     to resolve severe boot performance issues.
> 
> pl011_init() registers as an AMBA device, so the PL011 driver depends
> on the AMBA bus initialization. The AMBA bus is initialized with:
> 
>     postcore_initcall(amba_init);
> 
> Therefore, the PL011 driver is initialized with arch_initcall(), which
> occurs later than the postcore init.
> 
> My understanding is that console_initcall() is invoked much earlier
> than other initcalls triggered by do_initcalls(). With your change, I
> saw the PL011 driver fails to register on Juno-r2 board, due to AMBA bus
> driver is not ready for a console init.

You are absolutely correct. I confirm that the machine is fast because
pl011 is not working, which is sad and embarassing.

> The atomic path is introduced recently by the commit:
> 
>   2eb2608618ce ("serial: amba-pl011: Implement nbcon console")
> 
> My conclusion is that changing the initcall will not disable the atomic
> path, changing to console_initcall() will cause AMBA device init
> failure, and as a result, the clock operations will not be invoked.
> Thus, I am curious if you have ruled out the issue is caused by the UART
> clock (as I mentioned in another reply).
> 
> BTW, since the atomic path is enabled in the commit 2eb2608618ce, what
> is the result after reverting the commit?

I've reverted commit 2eb2608618ce ("serial: amba-pl011: Implement nbcon
console"), and I don't see the CSD locks anymoer. The serial speed is
the same and continue to be slow, but, the CSD lock is not there. Here
is the time spent on the serial flush when reverting the commit above

	[    0.309561] printk: legacy console [ttyAMA0] enabled
	[    8.657938] ACPI: PCI Root Bridge [PCI2] (domain 0002 [bus 00-ff])

Thanks
--breno


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: arm64: csdlock at early boot due to slow serial (?)
  2025-07-10 17:30             ` Breno Leitao
@ 2025-07-11  9:50               ` Leo Yan
       [not found]                 ` <cno3lsprrz36gqu27omvwrw27d2ubqtshac3cahfgkhcm2rla7@bpnulcmsyx2u>
  0 siblings, 1 reply; 17+ messages in thread
From: Leo Yan @ 2025-07-11  9:50 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Mark Rutland, ankita, bwicaksono, rmk+kernel, catalin.marinas,
	linux-serial, rmikey, linux-arm-kernel, usamaarif642,
	linux-kernel, paulmck

Hi Breno,

On Thu, Jul 10, 2025 at 10:30:45AM -0700, Breno Leitao wrote:

[...]

> > The atomic path is introduced recently by the commit:
> > 
> >   2eb2608618ce ("serial: amba-pl011: Implement nbcon console")
> > 
> > My conclusion is that changing the initcall will not disable the atomic
> > path, changing to console_initcall() will cause AMBA device init
> > failure, and as a result, the clock operations will not be invoked.
> > Thus, I am curious if you have ruled out the issue is caused by the UART
> > clock (as I mentioned in another reply).
> > 
> > BTW, since the atomic path is enabled in the commit 2eb2608618ce, what
> > is the result after reverting the commit?
> 
> I've reverted commit 2eb2608618ce ("serial: amba-pl011: Implement nbcon
> console"), and I don't see the CSD locks anymoer. The serial speed is
> the same and continue to be slow, but, the CSD lock is not there. Here
> is the time spent on the serial flush when reverting the commit above
> 
> 	[    0.309561] printk: legacy console [ttyAMA0] enabled
> 	[    8.657938] ACPI: PCI Root Bridge [PCI2] (domain 0002 [bus 00-ff])

From this result, we can know both the atomic path and the thread path
take a long time polling.

Since both paths configure the UART clock, I'm curious about the
behaviour if the UART clock is untouched. The relevant code is shown
below.

I may seem a bit stubborn in suspecting a clock issue :) But if you
have confirmed that a standard pl011 UART IP is being used, then the
only external factor I am aware of is the clock.

Thanks,
Leo

---8<---

diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
index 71482d639a6d..b04773ba2602 100644
--- a/drivers/amba/bus.c
+++ b/drivers/amba/bus.c
@@ -64,6 +64,8 @@ static int amba_get_enable_pclk(struct amba_device *pcdev)
 {
 	int ret;
 
+	return 0;
+
 	pcdev->pclk = clk_get(&pcdev->dev, "apb_pclk");
 	if (IS_ERR(pcdev->pclk))
 		return PTR_ERR(pcdev->pclk);
@@ -77,8 +79,8 @@ static int amba_get_enable_pclk(struct amba_device *pcdev)
 
 static void amba_put_disable_pclk(struct amba_device *pcdev)
 {
-	clk_disable_unprepare(pcdev->pclk);
-	clk_put(pcdev->pclk);
+	//clk_disable_unprepare(pcdev->pclk);
+	//clk_put(pcdev->pclk);
 }
 
 
diff --git a/drivers/tty/serial/amba-pl011.c b/drivers/tty/serial/amba-pl011.c
index 22939841b1de..9ba24afb6e4f 100644
--- a/drivers/tty/serial/amba-pl011.c
+++ b/drivers/tty/serial/amba-pl011.c
@@ -1777,7 +1777,7 @@ static int pl011_hwinit(struct uart_port *port)
 	/*
 	 * Try to enable the clock producer.
 	 */
-	retval = clk_prepare_enable(uap->clk);
+	retval = clk_prepare(uap->clk);
 	if (retval)
 		return retval;
 
@@ -1934,7 +1934,7 @@ static int pl011_startup(struct uart_port *port)
 	return 0;
 
  clk_dis:
-	clk_disable_unprepare(uap->clk);
+	//clk_disable_unprepare(uap->clk);
 	return retval;
 }
 
@@ -2025,7 +2025,7 @@ static void pl011_shutdown(struct uart_port *port)
 	/*
 	 * Shut down the clock producer
 	 */
-	clk_disable_unprepare(uap->clk);
+	//clk_disable_unprepare(uap->clk);
 	/* Optionally let pins go into sleep states */
 	pinctrl_pm_select_sleep_state(port->dev);
 
@@ -2524,7 +2524,7 @@ pl011_console_write_atomic(struct console *co, struct nbcon_write_context *wctxt
 	if (!nbcon_enter_unsafe(wctxt))
 		return;
 
-	clk_enable(uap->clk);
+	//clk_enable(uap->clk);
 
 	if (!uap->vendor->always_enabled) {
 		old_cr = pl011_read(uap, REG_CR);
@@ -2542,7 +2542,7 @@ pl011_console_write_atomic(struct console *co, struct nbcon_write_context *wctxt
 	if (!uap->vendor->always_enabled)
 		pl011_write(old_cr, uap, REG_CR);
 
-	clk_disable(uap->clk);
+	//clk_disable(uap->clk);
 
 	nbcon_exit_unsafe(wctxt);
 }
@@ -2556,7 +2556,7 @@ pl011_console_write_thread(struct console *co, struct nbcon_write_context *wctxt
 	if (!nbcon_enter_unsafe(wctxt))
 		return;
 
-	clk_enable(uap->clk);
+	//clk_enable(uap->clk);
 
 	if (!uap->vendor->always_enabled) {
 		old_cr = pl011_read(uap, REG_CR);
@@ -2586,7 +2586,7 @@ pl011_console_write_thread(struct console *co, struct nbcon_write_context *wctxt
 	if (!uap->vendor->always_enabled)
 		pl011_write(old_cr, uap, REG_CR);
 
-	clk_disable(uap->clk);
+	//clk_disable(uap->clk);
 
 	nbcon_exit_unsafe(wctxt);
 }


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: arm64: csdlock at early boot due to slow serial (?)
       [not found]                 ` <cno3lsprrz36gqu27omvwrw27d2ubqtshac3cahfgkhcm2rla7@bpnulcmsyx2u>
@ 2025-07-14 10:52                   ` Leo Yan
  0 siblings, 0 replies; 17+ messages in thread
From: Leo Yan @ 2025-07-14 10:52 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Mark Rutland, ankita, bwicaksono, rmk+kernel, catalin.marinas,
	linux-serial, rmikey, linux-arm-kernel, usamaarif642,
	linux-kernel, paulmck

On Fri, Jul 11, 2025 at 03:45:03AM -0700, Breno Leitao wrote:
> On Fri, Jul 11, 2025 at 10:50:23AM +0100, Leo Yan wrote:
> > > I've reverted commit 2eb2608618ce ("serial: amba-pl011: Implement nbcon
> > > console"), and I don't see the CSD locks anymoer. The serial speed is
> > > the same and continue to be slow, but, the CSD lock is not there. Here
> > > is the time spent on the serial flush when reverting the commit above
> > > 
> > > 	[    0.309561] printk: legacy console [ttyAMA0] enabled
> > > 	[    8.657938] ACPI: PCI Root Bridge [PCI2] (domain 0002 [bus 00-ff])
> > 
> > From this result, we can know both the atomic path and the thread path
> > take a long time polling.
>
> I am wondering if this slowdown is by design, or, if there is a bug
> somewhere.

Polling for over 5 seconds (5,001,000,000 ns) seems too long; it
should not be by design.

> > Since both paths configure the UART clock, I'm curious about the
> > behaviour if the UART clock is untouched. The relevant code is shown
> > below.
> 
> Is this the clock frequency that is connected to pl011 coming from AMBA?

The programming clock is for programming registers, my understanding is
the clock is not provided by bus.

> > I may seem a bit stubborn in suspecting a clock issue :) But if you
> > have confirmed that a standard pl011 UART IP is being used.
> 
> How do I double check this is a pl011 UART IP or if this is being
> emulated by firmware/ACPI.

The log shows:

  ARMH0011:00: ttyAMA0 at MMIO 0xc280000 (irq = 66, base_baud = 0) is a SBSA

This should be emulated by firmware by reading Peripheral Identification
Registers (see [1]).

The most reliable way is to confirm with SoC vendor if the IP is
standard or includes any modification.

[1] https://developer.arm.com/documentation/ddi0183/g/programmers-model/register-descriptions/peripheral-identification-registers--uartperiphid0-3

> > diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
> > index 71482d639a6d..b04773ba2602 100644
> > --- a/drivers/amba/bus.c
> > +++ b/drivers/amba/bus.c
> >
> I've tested it, but, it seems it didnt' help much. here is the full log

[...]

> [    0.307809] ARMH0011:00: ttyAMA0 at MMIO 0xc280000 (irq = 66, base_baud = 0) is a SBSA
> [    0.307816] printk: console [ttyAMA0] enabled

Based on this log and followed OOPS, the issue happens just after
enabling console. It is a bit tricky that it prints logs during
console registration (and configuration).

Seems to me, it is good to check pl011_console_setup() or
sbsa_uart_set_termios() has any impaction on the long waiting. For
example, the flow below will configure pin control and which might
impact the data transaction:

  pl011_console_setup()
   `> pinctrl_pm_select_default_state()

Based on the information shared earlier, the UART FIFO appears to be
stalled, which is causing the long wait times.

Apologies if my suggestion was unreliable or misled anything, and
welcome others to freely correct me.

Thanks,
Leo

> [    5.414515] smp: csd: Detected non-responsive CSD lock (#1) on CPU#0, waiting 5001000000 ns for CPU#01 do_nothing+0x0/0x10(0x0).
> [    5.414532] smp: 	csd: CSD lock (#1) unresponsive.
> [    5.414535] Sending NMI from CPU 0 to CPUs 1:
> [    5.414541] NMI backtrace for cpu 1
> [    5.414545] CPU: 1 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc5-00056-gd7753fd13f9e #86 NONE 
> [    5.414548] Hardware name: Quanta JAVA ISLAND PVT 29F0EMAZ049/Java Island, BIOS F0EJ3A13 06/04/2025
> [    5.414550] pstate: 61400009 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> [    5.414552] pc : pl011_console_write_atomic+0x94/0x170
> [    5.414559] lr : pl011_console_write_atomic+0x84/0x170
> [    5.414562] sp : ffff80008492f1a0
> [    5.414564] pmr: 000000c0
> [    5.414565] x29: ffff80008492f1a0 x28: 0000000000000000 x27: 00000000000000e0
> [    5.414572] x26: 0000000000000190 x25: 0000000000000000 x24: 0000000000000000
> [    5.414578] x23: 0000000000000001 x22: 0000000000000000 x21: ffff800080c38ea8
> [    5.414584] x20: ffff80008492f288 x19: ffff0000b2060080 x18: 00000000ffffffff
> [    5.414590] x17: 6362323230303030 x16: 3030303078303a30 x15: 206e6f6967657220
> [    5.414595] x14: 3030303064322072 x13: 3030303063623232 x12: 3030303030303030
> [    5.414601] x11: 78303a30206e6f69 x10: 6765722030303030 x9 : ffff800080c390e4
> [    5.414607] x8 : 6964657220646e75 x7 : 205d313533383630 x6 : 302e30202020205b
> [    5.414612] x5 : ffff0000ad10200f x4 : 0000000000000000 x3 : 0000000000000008
> [    5.414618] x2 : ffff8000816336c0 x1 : 0000000000000018 x0 : 0000000000000018
> [    5.414624] Call trace:
> [    5.414626]  pl011_console_write_atomic+0x94/0x170 (P)
> [    5.414630]  nbcon_emit_next_record+0x234/0x388
> [    5.414634]  __nbcon_atomic_flush_pending_con+0x88/0x108
> [    5.414637]  __nbcon_atomic_flush_pending+0x108/0x198
> [    5.414640]  nbcon_atomic_flush_pending+0x24/0x38
> [    5.414643]  vprintk_emit+0x200/0x348
> [    5.414645]  vprintk_default+0x3c/0x50
> [    5.414647]  vprintk+0x2c/0x40
> [    5.414650]  _printk+0x50/0x60
> [    5.414652]  register_console+0x424/0x560
> [    5.414654]  serial_core_register_port+0x878/0x898
> [    5.414659]  serial_ctrl_register_port+0x14/0x28
> [    5.414661]  uart_add_one_port+0x14/0x28
> [    5.414664]  pl011_register_port+0x74/0x130
> [    5.414667]  sbsa_uart_probe+0x164/0x1b8
> [    5.414670]  platform_probe+0x8c/0x100
> [    5.414674]  really_probe+0xc4/0x398
> [    5.414676]  __driver_probe_device+0x80/0x1a8
> [    5.414679]  driver_probe_device+0x44/0x120
> [    5.414681]  __device_attach_driver+0xb8/0x158
> [    5.414683]  bus_for_each_drv+0x74/0xc0
> [    5.414685]  __device_attach+0xac/0x1e0
> [    5.414687]  device_initial_probe+0x18/0x28
> [    5.414690]  bus_probe_device+0xa8/0xb8
> [    5.414692]  device_add+0x648/0x830
> [    5.414697]  platform_device_add+0x114/0x280
> [    5.414700]  platform_device_register_full+0x148/0x1b8
> [    5.414702]  acpi_create_platform_device+0x264/0x388
> [    5.414706]  acpi_bus_attach+0x2c8/0x2e8
> [    5.414709]  acpi_dev_for_one_check+0x38/0x48
> [    5.414711]  device_for_each_child+0x54/0xa0
> [    5.414714]  acpi_dev_for_each_child+0x30/0x40
> [    5.414716]  acpi_bus_attach+0x6c/0x2e8
> [    5.414718]  acpi_dev_for_one_check+0x38/0x48
> [    5.414720]  device_for_each_child+0x54/0xa0
> [    5.414723]  acpi_dev_for_each_child+0x30/0x40
> [    5.414725]  acpi_bus_attach+0x6c/0x2e8
> [    5.414727]  acpi_bus_scan+0x5c/0x1b0
> [    5.414730]  acpi_scan_init+0xa0/0x288
> [    5.414737]  acpi_init+0x1e8/0x4b8
> [    5.414740]  do_one_initcall+0x3c/0x2a0
> [    5.414743]  kernel_init_freeable+0x22c/0x460
> [    5.414746]  kernel_init+0x28/0x1f0
> [    5.414750]  ret_from_fork+0x10/0x20


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: arm64: csdlock at early boot due to slow serial (?)
  2025-07-02 17:10 arm64: csdlock at early boot due to slow serial (?) Breno Leitao
  2025-07-02 17:20 ` Leo Yan
  2025-07-03 10:28 ` Mark Rutland
@ 2025-08-06 11:44 ` Nirmoy Das
  2025-08-06 15:53   ` Breno Leitao
  2 siblings, 1 reply; 17+ messages in thread
From: Nirmoy Das @ 2025-08-06 11:44 UTC (permalink / raw)
  To: leitao
  Cc: catalin.marinas, cov, leo.yan, linux-arm-kernel, linux-kernel,
	linux-serial, mark.rutland, paulmck, rmikey, rmk+kernel,
	usamaarif642

Hi Breno,

I tried the latest kernel which seems to contain a fix. Bisecting it led me to https://patch.msgid.link/20250604142045.253301-1-pmladek@suse.com and that indeed fixes it for me.
Could you please give it a try.

Regards,
Nirmoy


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: arm64: csdlock at early boot due to slow serial (?)
  2025-08-06 11:44 ` Nirmoy Das
@ 2025-08-06 15:53   ` Breno Leitao
  0 siblings, 0 replies; 17+ messages in thread
From: Breno Leitao @ 2025-08-06 15:53 UTC (permalink / raw)
  To: Nirmoy Das
  Cc: catalin.marinas, cov, leo.yan, linux-arm-kernel, linux-kernel,
	linux-serial, mark.rutland, paulmck, rmikey, rmk+kernel,
	usamaarif642

Hello Nirmoy,

On Wed, Aug 06, 2025 at 04:44:41AM -0700, Nirmoy Das wrote:
> Hi Breno,
> 
> I tried the latest kernel which seems to contain a fix. Bisecting it led me to https://patch.msgid.link/20250604142045.253301-1-pmladek@suse.com and that indeed fixes it for me.
> Could you please give it a try.

That is awesome. I've played a little bit with it, and I see it solving
my problem as well.

The machine is booting very fast now, taking 0.8 seconds to hit init.

	[    0.811320] Run /init as init process

Thanks for the heads-up!
--breno


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-08-06 16:16 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-02 17:10 arm64: csdlock at early boot due to slow serial (?) Breno Leitao
2025-07-02 17:20 ` Leo Yan
2025-07-03 10:02   ` Breno Leitao
2025-07-03 11:45     ` Leo Yan
2025-07-03 15:01       ` Breno Leitao
2025-07-03 16:17         ` Leo Yan
2025-07-03 10:28 ` Mark Rutland
2025-07-03 14:13   ` Breno Leitao
2025-07-03 16:31     ` Mark Rutland
2025-07-08 14:00       ` Breno Leitao
2025-07-09 14:23         ` Breno Leitao
2025-07-10 13:35           ` Leo Yan
2025-07-10 17:30             ` Breno Leitao
2025-07-11  9:50               ` Leo Yan
     [not found]                 ` <cno3lsprrz36gqu27omvwrw27d2ubqtshac3cahfgkhcm2rla7@bpnulcmsyx2u>
2025-07-14 10:52                   ` Leo Yan
2025-08-06 11:44 ` Nirmoy Das
2025-08-06 15:53   ` Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).