printk NMI splat on boot

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* printk NMI splat on boot
@ 2025-05-20 16:38 Jens Axboe
  2025-05-20 20:08 ` John Ogness
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2025-05-20 16:38 UTC (permalink / raw)
  To: LKML; +Cc: Petr Mladek, Steven Rostedt, john.ogness,
	senozhatsky@chromium.org

Hi,

This has been going on for a while, and finally getting around to
reporting it. For every boot on my Dell R7625, I get one of these:

bnxt_en 0000:01:00.2 eno12419np2: renamed from eth2
usb 1-1.2: new low-speed USB device number 3 using xhci_hcd
usb 1-1.2: New USB device found, idVendor=047b, idProduct=0011, bcdDevice= 1.00
usb 3-1: New USB device found, idVendor=1604, idProduct=10c0, bcdDevice= 0.00
usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 96-.... } 3 jiffies s: 53 root: 0x40/.
rcu: blocking rcu_node structures (internal RCU debug): l=1:96-111:0x1/.
Sending NMI from CPU 0 to CPUs 96:
NMI backtrace for cpu 96
CPU: 96 UID: 0 PID: 3241 Comm: kworker/96:1 Not tainted 6.15.0-rc7+ #254 NONE 
Hardware name: Dell Inc. PowerEdge R7625/06444F, BIOS 1.11.2 12/18/2024
Workqueue: usb_hub_wq hub_event [usbcore]
RIP: 0010:vprintk_emit+0x360/0x430
Code: 87 01 0f 84 91 00 00 00 c6 05 36 ec 8a 01 01 c6 05 3c ec 8a 01 00 0f b6 05 28 ec 8a 01 84 c0 74 0d f3 90 0f b6 05 1b ec 8a 01 <84> c0 75 f3 e8 47 0d 00 00 80 e7 02 74 01 fb c7 05 bb ec 8c 01 00
RSP: 0018:ffffbe47220c7b78 EFLAGS: 00000002
RAX: 0000000000000001 RBX: 0000000000000246 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffffff93019c48
RBP: ffffbe47220c7c48 R08: 00000000fffdffff R09: ffffa1bf658fffa8
R10: ffffa1bf64e00000 R11: 0000000000000002 R12: ffffbe47220c7be0
R13: ffffffff9260f8ee R14: ffffbe47220c7c58 R15: 0000000000000043
FS:  0000000000000000(0000) GS:ffffa14dd4a34000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561066512398 CR3: 00000090a0a11002 CR4: 0000000000370ef0
Call Trace:
 <TASK>
 dev_vprintk_emit+0x130/0x140
 dev_printk_emit+0x3e/0x40
 ? __dev_printk+0x2d/0x70
 _dev_info+0x5c/0x5e
 usb_new_device.cold+0x87/0x3a0 [usbcore]
 hub_event+0x1113/0x1900 [usbcore]
 ? __schedule+0x4e5/0xaf0
 process_one_work+0x140/0x2b0
 worker_thread+0x2ea/0x430
 ? process_one_work+0x2b0/0x2b0
 kthread+0xd4/0x1d0
 ? kthreads_online_cpu+0xf0/0xf0
 ret_from_fork+0x2d/0x50
 ? kthreads_online_cpu+0xf0/0xf0
 ret_from_fork_asm+0x11/0x20
 </TASK>
usb 3-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
usb 1-1.2: Product: Standard USB Keyboard 
hub 3-1:1.0: USB hub found
usb 1-1.2: Manufacturer: Silitek

which appears to be this spin section:

        /* We spin waiting for the owner to release us */
        spin_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_);
        /* Owner will clear console_waiter on hand off */
        while (READ_ONCE(console_waiter))
                cpu_relax();
        spin_release(&console_owner_dep_map, _THIS_IP_);

Box is a 2-socket, with 2:

AMD EPYC 9754 128-Core Processor

CPUs installed.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: printk NMI splat on boot
  2025-05-20 16:38 printk NMI splat on boot Jens Axboe
@ 2025-05-20 20:08 ` John Ogness
  2025-05-20 20:18   ` Jens Axboe
  0 siblings, 1 reply; 7+ messages in thread
From: John Ogness @ 2025-05-20 20:08 UTC (permalink / raw)
  To: Jens Axboe, LKML
  Cc: Petr Mladek, Steven Rostedt, senozhatsky@chromium.org,
	Paul E. McKenney

Added Cc: Paul E. McKenney <paulmck@kernel.org>

Hi Jens,

On 2025-05-20, Jens Axboe <axboe@kernel.dk> wrote:
> This has been going on for a while, and finally getting around to
> reporting it. For every boot on my Dell R7625, I get one of these:

Is it possible that you could compile with:

CONFIG_PRINTK_TIME=y
CONFIG_PRINTK_CALLER=y

so that we can see some timestamps and from which CPU the printk calls
are originating from?

> bnxt_en 0000:01:00.2 eno12419np2: renamed from eth2
> usb 1-1.2: new low-speed USB device number 3 using xhci_hcd
> usb 1-1.2: New USB device found, idVendor=047b, idProduct=0011, bcdDevice= 1.00
> usb 3-1: New USB device found, idVendor=1604, idProduct=10c0, bcdDevice= 0.00
> usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
> rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 96-.... } 3 jiffies s: 53 root: 0x40/.

The expedited grace period has only been going on for 3 jiffies.

What values are you using for CONFIG_RCU_EXP_CPU_STALL_TIMEOUT and
CONFIG_RCU_CPU_STALL_TIMEOUT?

What CONFIG_*HZ* options are you using?

> rcu: blocking rcu_node structures (internal RCU debug): l=1:96-111:0x1/.
> Sending NMI from CPU 0 to CPUs 96:
> NMI backtrace for cpu 96
> CPU: 96 UID: 0 PID: 3241 Comm: kworker/96:1 Not tainted 6.15.0-rc7+ #254 NONE 
> Hardware name: Dell Inc. PowerEdge R7625/06444F, BIOS 1.11.2 12/18/2024
> Workqueue: usb_hub_wq hub_event [usbcore]
> RIP: 0010:vprintk_emit+0x360/0x430
> Code: 87 01 0f 84 91 00 00 00 c6 05 36 ec 8a 01 01 c6 05 3c ec 8a 01 00 0f b6 05 28 ec 8a 01 84 c0 74 0d f3 90 0f b6 05 1b ec 8a 01 <84> c0 75 f3 e8 47 0d 00 00 80 e7 02 74 01 fb c7 05 bb ec 8c 01 00
> RSP: 0018:ffffbe47220c7b78 EFLAGS: 00000002
> RAX: 0000000000000001 RBX: 0000000000000246 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffffff93019c48
> RBP: ffffbe47220c7c48 R08: 00000000fffdffff R09: ffffa1bf658fffa8
> R10: ffffa1bf64e00000 R11: 0000000000000002 R12: ffffbe47220c7be0
> R13: ffffffff9260f8ee R14: ffffbe47220c7c58 R15: 0000000000000043
> FS:  0000000000000000(0000) GS:ffffa14dd4a34000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000561066512398 CR3: 00000090a0a11002 CR4: 0000000000370ef0
> Call Trace:
>  <TASK>
>  dev_vprintk_emit+0x130/0x140
>  dev_printk_emit+0x3e/0x40
>  ? __dev_printk+0x2d/0x70
>  _dev_info+0x5c/0x5e
>  usb_new_device.cold+0x87/0x3a0 [usbcore]
>  hub_event+0x1113/0x1900 [usbcore]
>  ? __schedule+0x4e5/0xaf0
>  process_one_work+0x140/0x2b0
>  worker_thread+0x2ea/0x430
>  ? process_one_work+0x2b0/0x2b0
>  kthread+0xd4/0x1d0
>  ? kthreads_online_cpu+0xf0/0xf0
>  ret_from_fork+0x2d/0x50
>  ? kthreads_online_cpu+0xf0/0xf0
>  ret_from_fork_asm+0x11/0x20
>  </TASK>
> usb 3-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
> usb 1-1.2: Product: Standard USB Keyboard 
> hub 3-1:1.0: USB hub found
> usb 1-1.2: Manufacturer: Silitek
>
> which appears to be this spin section:
>
>         /* We spin waiting for the owner to release us */
>         spin_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_);
>         /* Owner will clear console_waiter on hand off */
>         while (READ_ONCE(console_waiter))
>                 cpu_relax();
>         spin_release(&console_owner_dep_map, _THIS_IP_);

This code is waiting for a CPU to finish printing a line so that it can
take over printing the next line.

If you are using CONFIG_HZ_1000 and a baud rate of 115200bps, you will
only get about 14 characters printed per jiffy. The last line
successfully printed has 68 characters and the expedited grace period
timed out after only 3 jiffies (which would be around 42 characters). So
the numbers and the timeout make sense. But why is the grace period so
short?

John Ogness

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: printk NMI splat on boot
  2025-05-20 20:08 ` John Ogness
@ 2025-05-20 20:18   ` Jens Axboe
  2025-05-20 20:41     ` Jens Axboe
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2025-05-20 20:18 UTC (permalink / raw)
  To: John Ogness, LKML
  Cc: Petr Mladek, Steven Rostedt, senozhatsky@chromium.org,
	Paul E. McKenney

On 5/20/25 2:08 PM, John Ogness wrote:
> Added Cc: Paul E. McKenney <paulmck@kernel.org>
> 
> Hi Jens,
> 
> On 2025-05-20, Jens Axboe <axboe@kernel.dk> wrote:
>> This has been going on for a while, and finally getting around to
>> reporting it. For every boot on my Dell R7625, I get one of these:
> 
> Is it possible that you could compile with:
> 
> CONFIG_PRINTK_TIME=y
> CONFIG_PRINTK_CALLER=y
> 
> so that we can see some timestamps and from which CPU the printk calls
> are originating from?

Sure, let me try that right now. Here it is:

[   21.450298] [   T3193] ahci 0000:65:00.0: 4/4 ports implemented (port mask 0xf)
[   21.458971] [     T17] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 32-.... 96-.... } 3 jiffies s: 29 root: 0x44/.
[   21.458980] [     T17] rcu: blocking rcu_node structures (internal RCU debug): l=1:32-47:0x1/. l=1:96-111:0x1/.
[   21.458985] [     T17] Sending NMI from CPU 0 to CPUs 32:
[   21.458990] [     C32] NMI backtrace for cpu 32
[   21.458993] [     C32] CPU: 32 UID: 0 PID: 3193 Comm: kworker/32:1 Not tainted 6.15.0-rc7+ #255 NONE
[   21.458995] [     C32] Hardware name: Dell Inc. PowerEdge R7625/06444F, BIOS 1.11.2 12/18/2024
[   21.458997] [     C32] Workqueue: events work_for_cpu_fn
[   21.459002] [     C32] RIP: 0010:vprintk_emit+0x360/0x430
[   21.459004] [     C32] Code: 87 01 0f 84 91 00 00 00 c6 05 76 ec 8a 01 01 c6 05 7c ec 8a 01 00 0f b6 05 68 ec 8a 01 84 c0 74 0d f3 90 0f b6 05 5b ec 8a 01 <84> c0 75 f3 e8 47 0d 00 00 80 e7 02 74 01 fb c7 05 fb ec 8c 01 00
[   21.459006] [     C32] RSP: 0018:ffffaf75617b7b18 EFLAGS: 00000002
[   21.459007] [     C32] RAX: 0000000000000001 RBX: 0000000000000246 RCX: 0000000000000000
[   21.459008] [     C32] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffffff84019c48
[   21.459009] [     C32] RBP: ffffaf75617b7be8 R08: 00000000fffdffff R09: ffffa1a6e58fffa8
[   21.459009] [     C32] R10: ffffa1a6e4e00000 R11: 0000000000000002 R12: ffffaf75617b7b80
[   21.459010] [     C32] R13: ffffffff8360f9d1 R14: ffffaf75617b7bf8 R15: 0000000000000038
[   21.459011] [     C32] FS:  0000000000000000(0000) GS:ffffa10563a34000(0000) knlGS:0000000000000000
[   21.459012] [     C32] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   21.459012] [     C32] CR2: 0000000000000000 CR3: 000000270b02a001 CR4: 0000000000370ef0
[   21.459013] [     C32] Call Trace:
[   21.459015] [     C32]  <TASK>
[   21.459017] [     C32]  dev_vprintk_emit+0x130/0x140
[   21.459023] [     C32]  dev_printk_emit+0x3e/0x40
[   21.459025] [     C32]  ? __dev_printk+0x2d/0x70
[   21.459028] [     C32]  _dev_info+0x5c/0x5e
[   21.459031] [     C32]  ahci_print_info+0xab/0x30c [libahci]
[   21.459036] [     C32]  ? pci_conf1_read+0xae/0xf0
[   21.459039] [     C32]  ahci_init_one+0xa3a/0xbf5 [ahci]
[   21.459045] [     C32]  ? sched_balance_newidle+0x2aa/0x3c0
[   21.459048] [     C32]  local_pci_probe+0x3c/0x80
[   21.459050] [     C32]  work_for_cpu_fn+0x10/0x20
[   21.459052] [     C32]  process_one_work+0x140/0x2b0
[   21.459054] [     C32]  worker_thread+0x2ea/0x430
[   21.459056] [     C32]  ? process_one_work+0x2b0/0x2b0
[   21.459057] [     C32]  kthread+0xd4/0x1d0
[   21.459059] [     C32]  ? kthreads_online_cpu+0xf0/0xf0
[   21.459061] [     C32]  ret_from_fork+0x2d/0x50
[   21.459062] [     C32]  ? kthreads_online_cpu+0xf0/0xf0
[   21.459064] [     C32]  ret_from_fork_asm+0x11/0x20
[   21.459066] [     C32]  </TASK>
[   21.459989] [     T17] Sending NMI from CPU 0 to CPUs 96:
[   21.460026] [     C96] NMI backtrace for cpu 96
[   21.460028] [     C96] CPU: 96 UID: 0 PID: 503 Comm: kworker/96:0 Not tainted 6.15.0-rc7+ #255 NONE
[   21.460030] [     C96] Hardware name: Dell Inc. PowerEdge R7625/06444F, BIOS 1.11.2 12/18/2024
[   21.460032] [     C96] Workqueue: events work_for_cpu_fn
[   21.460037] [     C96] RIP: 0010:fbcon_putcs+0xf0/0x150
[   21.460041] [     C96] Code: e5 ff ff 4c 63 44 24 0c 48 89 ee 48 89 df 89 c2 44 8b 4c 24 08 4c 89 c0 48 c1 e0 04 4c 29 c0 48 8d 04 c5 40 71 06 84 8b 48 10 <0f> bf 40 0e 41 51 45 89 f9 52 4c 89 e2 4c 8b 54 24 10 44 01 f0 41
[   21.460042] [     C96] RSP: 0018:ffffaf755af47968 EFLAGS: 00000046
[   21.460044] [     C96] RAX: ffffffff84067140 RBX: ffffa0e780215c00 RCX: 0000000000000030
[   21.460045] [     C96] RDX: 0000000000000007 RSI: ffffa0ff018a3800 RDI: ffffa0e780215c00
[   21.460046] [     C96] RBP: ffffa0ff018a3800 R08: 0000000000000000 R09: 0000000000000000
[   21.460047] [     C96] R10: ffffffff82c88930 R11: 000000000000002f R12: ffffa0ff0cca9f5c
[   21.460048] [     C96] R13: 000000000000001f R14: 000000000000001e R15: 000000000000002e
[   21.460049] [     C96] FS:  0000000000000000(0000) GS:ffffa13563a34000(0000) knlGS:0000000000000000
[   21.460050] [     C96] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   21.460051] [     C96] CR2: 00007fe388c23e10 CR3: 0000007885bd2002 CR4: 0000000000370ef0
[   21.460052] [     C96] Call Trace:
[   21.460054] [     C96]  <TASK>
[   21.460055] [     C96]  ? bit_update_start+0x40/0x40
[   21.460059] [     C96]  fbcon_redraw+0x103/0x240
[   21.460061] [     C96]  fbcon_scroll+0x15a/0x1a0
[   21.460063] [     C96]  con_scroll+0xf4/0x200
[   21.460065] [     C96]  ? fbcon_putcs+0x117/0x150
[   21.460066] [     C96]  lf+0x83/0x90
[   21.460067] [     C96]  vt_console_print+0x2b9/0x420
[   21.460069] [     C96]  console_flush_all+0x2da/0x5b0
[   21.460071] [     C96]  console_unlock+0xd9/0x240
[   21.460072] [     C96]  vprintk_emit+0x37e/0x430
[   21.460073] [     C96]  dev_vprintk_emit+0x130/0x140
[   21.460076] [     C96]  ? kernfs_activate+0x80/0xa0
[   21.460078] [     C96]  dev_printk_emit+0x3e/0x40
[   21.460080] [     C96]  ? __dev_printk+0x2d/0x70
[   21.460084] [     C96]  _dev_info+0x5c/0x5e
[   21.460089] [     C96]  usb_add_hcd.cold+0xd2/0x57d [usbcore]
[   21.460103] [     C96]  usb_hcd_pci_probe+0x16f/0x3e0 [usbcore]
[   21.460114] [     C96]  xhci_pci_common_probe+0x5c/0x220 [xhci_pci]
[   21.460118] [     C96]  local_pci_probe+0x3c/0x80
[   21.460121] [     C96]  work_for_cpu_fn+0x10/0x20
[   21.460122] [     C96]  process_one_work+0x140/0x2b0
[   21.460124] [     C96]  worker_thread+0x2ea/0x430
[   21.460126] [     C96]  ? process_one_work+0x2b0/0x2b0
[   21.460128] [     C96]  kthread+0xd4/0x1d0
[   21.460130] [     C96]  ? kthreads_online_cpu+0xf0/0xf0
[   21.460131] [     C96]  ret_from_fork+0x2d/0x50
[   21.460133] [     C96]  ? kthreads_online_cpu+0xf0/0xf0
[   21.460134] [     C96]  ret_from_fork_asm+0x11/0x20
[   21.460136] [     C96]  </TASK>

>> bnxt_en 0000:01:00.2 eno12419np2: renamed from eth2
>> usb 1-1.2: new low-speed USB device number 3 using xhci_hcd
>> usb 1-1.2: New USB device found, idVendor=047b, idProduct=0011, bcdDevice= 1.00
>> usb 3-1: New USB device found, idVendor=1604, idProduct=10c0, bcdDevice= 0.00
>> usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
>> rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 96-.... } 3 jiffies s: 53 root: 0x40/.
> 
> The expedited grace period has only been going on for 3 jiffies.
> 
> What values are you using for CONFIG_RCU_EXP_CPU_STALL_TIMEOUT and
> CONFIG_RCU_CPU_STALL_TIMEOUT?

CONFIG_RCU_CPU_STALL_TIMEOUT=21
CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=2

> What CONFIG_*HZ* options are you using?

CONFIG_HZ=100

>> rcu: blocking rcu_node structures (internal RCU debug): l=1:96-111:0x1/.
>> Sending NMI from CPU 0 to CPUs 96:
>> NMI backtrace for cpu 96
>> CPU: 96 UID: 0 PID: 3241 Comm: kworker/96:1 Not tainted 6.15.0-rc7+ #254 NONE 
>> Hardware name: Dell Inc. PowerEdge R7625/06444F, BIOS 1.11.2 12/18/2024
>> Workqueue: usb_hub_wq hub_event [usbcore]
>> RIP: 0010:vprintk_emit+0x360/0x430
>> Code: 87 01 0f 84 91 00 00 00 c6 05 36 ec 8a 01 01 c6 05 3c ec 8a 01 00 0f b6 05 28 ec 8a 01 84 c0 74 0d f3 90 0f b6 05 1b ec 8a 01 <84> c0 75 f3 e8 47 0d 00 00 80 e7 02 74 01 fb c7 05 bb ec 8c 01 00
>> RSP: 0018:ffffbe47220c7b78 EFLAGS: 00000002
>> RAX: 0000000000000001 RBX: 0000000000000246 RCX: 0000000000000000
>> RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffffff93019c48
>> RBP: ffffbe47220c7c48 R08: 00000000fffdffff R09: ffffa1bf658fffa8
>> R10: ffffa1bf64e00000 R11: 0000000000000002 R12: ffffbe47220c7be0
>> R13: ffffffff9260f8ee R14: ffffbe47220c7c58 R15: 0000000000000043
>> FS:  0000000000000000(0000) GS:ffffa14dd4a34000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 0000561066512398 CR3: 00000090a0a11002 CR4: 0000000000370ef0
>> Call Trace:
>>  <TASK>
>>  dev_vprintk_emit+0x130/0x140
>>  dev_printk_emit+0x3e/0x40
>>  ? __dev_printk+0x2d/0x70
>>  _dev_info+0x5c/0x5e
>>  usb_new_device.cold+0x87/0x3a0 [usbcore]
>>  hub_event+0x1113/0x1900 [usbcore]
>>  ? __schedule+0x4e5/0xaf0
>>  process_one_work+0x140/0x2b0
>>  worker_thread+0x2ea/0x430
>>  ? process_one_work+0x2b0/0x2b0
>>  kthread+0xd4/0x1d0
>>  ? kthreads_online_cpu+0xf0/0xf0
>>  ret_from_fork+0x2d/0x50
>>  ? kthreads_online_cpu+0xf0/0xf0
>>  ret_from_fork_asm+0x11/0x20
>>  </TASK>
>> usb 3-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
>> usb 1-1.2: Product: Standard USB Keyboard 
>> hub 3-1:1.0: USB hub found
>> usb 1-1.2: Manufacturer: Silitek
>>
>> which appears to be this spin section:
>>
>>         /* We spin waiting for the owner to release us */
>>         spin_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_);
>>         /* Owner will clear console_waiter on hand off */
>>         while (READ_ONCE(console_waiter))
>>                 cpu_relax();
>>         spin_release(&console_owner_dep_map, _THIS_IP_);
> 
> This code is waiting for a CPU to finish printing a line so that it can
> take over printing the next line.
> 
> If you are using CONFIG_HZ_1000 and a baud rate of 115200bps, you will
> only get about 14 characters printed per jiffy. The last line
> successfully printed has 68 characters and the expedited grace period
> timed out after only 3 jiffies (which would be around 42 characters). So
> the numbers and the timeout make sense. But why is the grace period so
> short?

I think I just used whatever defaults were there. At least I didn't set
anything low on purpose. I do have a serial console turned on, which I'm
sure would explain why it's so slow (even if not used).

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: printk NMI splat on boot
  2025-05-20 20:18   ` Jens Axboe
@ 2025-05-20 20:41     ` Jens Axboe
  2025-05-21  6:06       ` Paul E. McKenney
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2025-05-20 20:41 UTC (permalink / raw)
  To: John Ogness, LKML
  Cc: Petr Mladek, Steven Rostedt, senozhatsky@chromium.org,
	Paul E. McKenney

On 5/20/25 2:18 PM, Jens Axboe wrote:
>> What values are you using for CONFIG_RCU_EXP_CPU_STALL_TIMEOUT and
>> CONFIG_RCU_CPU_STALL_TIMEOUT?
> 
> CONFIG_RCU_CPU_STALL_TIMEOUT=21
> CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=2

This was =20 btw, guess it could cut a bit too much...

Changed them to:

CONFIG_RCU_CPU_STALL_TIMEOUT=100
CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0

and complaining is gone.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: printk NMI splat on boot
  2025-05-20 20:41     ` Jens Axboe
@ 2025-05-21  6:06       ` Paul E. McKenney
  2025-05-21 13:05         ` Jens Axboe
  0 siblings, 1 reply; 7+ messages in thread
From: Paul E. McKenney @ 2025-05-21  6:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: John Ogness, LKML, Petr Mladek, Steven Rostedt,
	senozhatsky@chromium.org

On Tue, May 20, 2025 at 02:41:40PM -0600, Jens Axboe wrote:
> On 5/20/25 2:18 PM, Jens Axboe wrote:
> >> What values are you using for CONFIG_RCU_EXP_CPU_STALL_TIMEOUT and
> >> CONFIG_RCU_CPU_STALL_TIMEOUT?
> > 
> > CONFIG_RCU_CPU_STALL_TIMEOUT=21
> > CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=2
> 
> This was =20 btw, guess it could cut a bit too much...

Just confirming that setting CONFIG_RCU_EXP_CPU_STALL_TIMEOUT to two
milliseconds is more than a bit on the aggressive side.  ;-)

Setting it to 20 milliseconds is OK for smartphone-class devices, but
to the best of my knowledge, setting it less than 21 seconds (as in
21,000 milliseconds) has not been tested on any other platform.

> Changed them to:
> 
> CONFIG_RCU_CPU_STALL_TIMEOUT=100
> CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
> 
> and complaining is gone.

This makes it take the default, which in this case would be the specified
CONFIG_RCU_CPU_STALL_TIMEOUT value of 100 seconds.  Which is an unusually
long timeout -- mainline these days is 21 seconds and some distros still
use the old value of 60 seconds.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: printk NMI splat on boot
  2025-05-21  6:06       ` Paul E. McKenney
@ 2025-05-21 13:05         ` Jens Axboe
  2025-05-21 15:06           ` Paul E. McKenney
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2025-05-21 13:05 UTC (permalink / raw)
  To: paulmck
  Cc: John Ogness, LKML, Petr Mladek, Steven Rostedt,
	senozhatsky@chromium.org

On 5/21/25 12:06 AM, Paul E. McKenney wrote:
> On Tue, May 20, 2025 at 02:41:40PM -0600, Jens Axboe wrote:
>> On 5/20/25 2:18 PM, Jens Axboe wrote:
>>>> What values are you using for CONFIG_RCU_EXP_CPU_STALL_TIMEOUT and
>>>> CONFIG_RCU_CPU_STALL_TIMEOUT?
>>>
>>> CONFIG_RCU_CPU_STALL_TIMEOUT=21
>>> CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=2
>>
>> This was =20 btw, guess it could cut a bit too much...
> 
> Just confirming that setting CONFIG_RCU_EXP_CPU_STALL_TIMEOUT to two
> milliseconds is more than a bit on the aggressive side.  ;-)

Sorry guess I wasn't clear - I had pasted in =2, but the setting in my
config was =20.

> Setting it to 20 milliseconds is OK for smartphone-class devices, but
> to the best of my knowledge, setting it less than 21 seconds (as in
> 21,000 milliseconds) has not been tested on any other platform.
> 
>> Changed them to:
>>
>> CONFIG_RCU_CPU_STALL_TIMEOUT=100
>> CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
>>
>> and complaining is gone.
> 
> This makes it take the default, which in this case would be the specified
> CONFIG_RCU_CPU_STALL_TIMEOUT value of 100 seconds.  Which is an unusually
> long timeout -- mainline these days is 21 seconds and some distros still
> use the old value of 60 seconds.

IMHO the settings for these are very odd. Which I guess is fine for
debugging kind of infrastructure, but fairly nonsensical in any case.
But not really that important - looks like RCU_EXP_CPU_STALL_TIMEOUT has
a default of '0' so not sure how on earth I ended up with 20 in that
one. Most likely from not reading the help entry and hence setting it
similarly to RCU_CPU_STALL_TIMEOUT.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: printk NMI splat on boot
  2025-05-21 13:05         ` Jens Axboe
@ 2025-05-21 15:06           ` Paul E. McKenney
  0 siblings, 0 replies; 7+ messages in thread
From: Paul E. McKenney @ 2025-05-21 15:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: John Ogness, LKML, Petr Mladek, Steven Rostedt,
	senozhatsky@chromium.org

On Wed, May 21, 2025 at 07:05:09AM -0600, Jens Axboe wrote:
> On 5/21/25 12:06 AM, Paul E. McKenney wrote:
> > On Tue, May 20, 2025 at 02:41:40PM -0600, Jens Axboe wrote:
> >> On 5/20/25 2:18 PM, Jens Axboe wrote:
> >>>> What values are you using for CONFIG_RCU_EXP_CPU_STALL_TIMEOUT and
> >>>> CONFIG_RCU_CPU_STALL_TIMEOUT?
> >>>
> >>> CONFIG_RCU_CPU_STALL_TIMEOUT=21
> >>> CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=2
> >>
> >> This was =20 btw, guess it could cut a bit too much...
> > 
> > Just confirming that setting CONFIG_RCU_EXP_CPU_STALL_TIMEOUT to two
> > milliseconds is more than a bit on the aggressive side.  ;-)
> 
> Sorry guess I wasn't clear - I had pasted in =2, but the setting in my
> config was =20.

Ah, got it!  Less aggressive, but not recommended for other than small
devices, such as Android smartphones.

> > Setting it to 20 milliseconds is OK for smartphone-class devices, but
> > to the best of my knowledge, setting it less than 21 seconds (as in
> > 21,000 milliseconds) has not been tested on any other platform.
> > 
> >> Changed them to:
> >>
> >> CONFIG_RCU_CPU_STALL_TIMEOUT=100
> >> CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
> >>
> >> and complaining is gone.
> > 
> > This makes it take the default, which in this case would be the specified
> > CONFIG_RCU_CPU_STALL_TIMEOUT value of 100 seconds.  Which is an unusually
> > long timeout -- mainline these days is 21 seconds and some distros still
> > use the old value of 60 seconds.
> 
> IMHO the settings for these are very odd. Which I guess is fine for
> debugging kind of infrastructure, but fairly nonsensical in any case.
> But not really that important - looks like RCU_EXP_CPU_STALL_TIMEOUT has
> a default of '0' so not sure how on earth I ended up with 20 in that
> one. Most likely from not reading the help entry and hence setting it
> similarly to RCU_CPU_STALL_TIMEOUT.

Agreed, it would be better if both had the same units.  But back 20 years
ago, milliseconds would have seemed insane for RCU_CPU_STALL_TIMEOUT,
in fact, many would have argued that the current setting of "only"
21 seconds would be insane.  And then a few years ago, people really
needed milliseconds for RCU_EXP_CPU_STALL_TIMEOUT, so here we are...

Me, I should have seen it coming.  After all, the equivalent to
RCU_CPU_STALL_TIMEOUT in DYNIX/ptx was 1.5 seconds.  But, again, here
we are...

							Thanx, Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-05-21 15:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-20 16:38 printk NMI splat on boot Jens Axboe
2025-05-20 20:08 ` John Ogness
2025-05-20 20:18   ` Jens Axboe
2025-05-20 20:41     ` Jens Axboe
2025-05-21  6:06       ` Paul E. McKenney
2025-05-21 13:05         ` Jens Axboe
2025-05-21 15:06           ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).