[linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck

linux-serial.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
@ 2025-01-22  2:28 kernel test robot
  2025-01-22  8:41 ` John Ogness
  2025-01-24 16:10 ` Petr Mladek
  0 siblings, 2 replies; 21+ messages in thread
From: kernel test robot @ 2025-01-22  2:28 UTC (permalink / raw)
  To: John Ogness
  Cc: oe-lkp, lkp, Greg Kroah-Hartman, Petr Mladek, linux-serial,
	oliver.sang



Hello,

kernel test robot noticed "BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]" on:

commit: b63e6f60eab45b16a1bf734fef9035a4c4187cd5 ("serial: 8250: Switch to nbcon console")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

[test failed on linux-next/master 0907e7fb35756464aa34c35d6abb02998418164b]

in testcase: kunit
version: 
with following parameters:

	group: group-01



config: x86_64-rhel-9.4-kunit
compiler: gcc-12
test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (Haswell) with 16G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202501221029.fb0d574d-lkp@intel.com


[  231.759560][    C3] watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [modprobe:3860]
[  231.759572][    C3] Modules linked in: test_rslib(+) reed_solomon ipmi_devintf ipmi_msghandler intel_rapl_msr intel_rapl_common btrfs snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp blake2b_generic coretemp xor raid6_pq libcrc32c kvm_intel snd_hda_codec_realtek snd_hda_codec_generic platform_profile i915 kvm snd_hda_scodec_component snd_hda_intel sd_mod snd_intel_dspcfg dell_wmi crc32_generic snd_intel_sdw_acpi sg crct10dif_pclmul cec crc32_pclmul dell_smbios snd_hda_codec intel_gtt crc32c_intel dell_wmi_descriptor ghash_clmulni_intel sparse_keymap snd_hda_core ttm snd_hwdep ahci rapl rfkill drm_display_helper snd_pcm mei_wdt libahci intel_cstate dcdbas snd_timer mei_me libata intel_uncore drm_kms_helper snd pcspkr drm_buddy mei soundcore video wmi binfmt_misc drm fuse loop dm_mod ip_tables poly1305_generic chacha_generic [last unloaded: test_fpu]
[  231.759681][    C3] CPU: 3 UID: 0 PID: 3860 Comm: modprobe Tainted: G S  B            N 6.13.0-rc3-00034-gb63e6f60eab4 #1
[  231.759690][    C3] Tainted: [S]=CPU_OUT_OF_SPEC, [B]=BAD_PAGE, [N]=TEST
[  231.759694][    C3] Hardware name: Dell Inc. OptiPlex 9020/0DNKMN, BIOS A05 12/05/2013
[ 231.759699][ C3] RIP: 0010:encode_rs16 (lib/reed_solomon/encode_rs.c:33) reed_solomon 
[ 231.759708][ C3] Code: 87 68 83 00 00 89 da d3 fa 41 0f b6 4d 00 41 38 cc 7c 08 84 c9 0f 85 64 02 00 00 8b 75 04 21 f3 01 d3 39 de 7e c0 48 8b 3c 24 <48> 63 db 48 8d 1c 5f 48 89 d9 48 c1 e9 03 42 0f b6 34 39 48 89 d9
All code
========
   0:	87 68 83             	xchg   %ebp,-0x7d(%rax)
   3:	00 00                	add    %al,(%rax)
   5:	89 da                	mov    %ebx,%edx
   7:	d3 fa                	sar    %cl,%edx
   9:	41 0f b6 4d 00       	movzbl 0x0(%r13),%ecx
   e:	41 38 cc             	cmp    %cl,%r12b
  11:	7c 08                	jl     0x1b
  13:	84 c9                	test   %cl,%cl
  15:	0f 85 64 02 00 00    	jne    0x27f
  1b:	8b 75 04             	mov    0x4(%rbp),%esi
  1e:	21 f3                	and    %esi,%ebx
  20:	01 d3                	add    %edx,%ebx
  22:	39 de                	cmp    %ebx,%esi
  24:	7e c0                	jle    0xffffffffffffffe6
  26:	48 8b 3c 24          	mov    (%rsp),%rdi
  2a:*	48 63 db             	movslq %ebx,%rbx		<-- trapping instruction
  2d:	48 8d 1c 5f          	lea    (%rdi,%rbx,2),%rbx
  31:	48 89 d9             	mov    %rbx,%rcx
  34:	48 c1 e9 03          	shr    $0x3,%rcx
  38:	42 0f b6 34 39       	movzbl (%rcx,%r15,1),%esi
  3d:	48 89 d9             	mov    %rbx,%rcx

Code starting with the faulting instruction
===========================================
   0:	48 63 db             	movslq %ebx,%rbx
   3:	48 8d 1c 5f          	lea    (%rdi,%rbx,2),%rbx
   7:	48 89 d9             	mov    %rbx,%rcx
   a:	48 c1 e9 03          	shr    $0x3,%rcx
   e:	42 0f b6 34 39       	movzbl (%rcx,%r15,1),%esi
  13:	48 89 d9             	mov    %rbx,%rcx
[  231.759717][    C3] RSP: 0018:ffffc90000abf3b0 EFLAGS: 00000297
[  231.759723][    C3] RAX: ffff888102c9ff0a RBX: 00000000000000dd RCX: 0000000000000000
[  231.759728][    C3] RDX: 0000000000000000 RSI: 00000000000000ff RDI: ffff88816b6b7c00
[  231.759733][    C3] RBP: ffff88812901fb00 R08: 00000000000000c8 R09: ffff88816c8b518e
[  231.759738][    C3] R10: 1ffff11025203f60 R11: ffff88816c8b5184 R12: 0000000000000007
[  231.759743][    C3] R13: ffffed1025203f60 R14: ffffed1025203f60 R15: dffffc0000000000
[  231.759748][    C3] FS:  00007f64c760f040(0000) GS:ffff8883a7d80000(0000) knlGS:0000000000000000
[  231.759754][    C3] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  231.759759][    C3] CR2: 00007f024b693000 CR3: 00000001d4462004 CR4: 00000000001726f0
[  231.759764][    C3] DR0: ffffffff8789050c DR1: ffffffff8789050d DR2: ffffffff8789050e
[  231.759769][    C3] DR3: ffffffff8789050f DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  231.759774][    C3] Call Trace:
[  231.759778][    C3]  <IRQ>
[ 231.759782][ C3] ? watchdog_timer_fn (kernel/watchdog.c:770) 
[ 231.759790][ C3] ? __pfx_watchdog_timer_fn (kernel/watchdog.c:685) 
[ 231.759796][ C3] ? __hrtimer_run_queues (kernel/time/hrtimer.c:1739 kernel/time/hrtimer.c:1803) 
[ 231.759803][ C3] ? __pfx___hrtimer_run_queues (kernel/time/hrtimer.c:1773) 
[ 231.759808][ C3] ? ktime_get_update_offsets_now (kernel/time/timekeeping.c:312 (discriminator 3) kernel/time/timekeeping.c:335 (discriminator 3) kernel/time/timekeeping.c:2457 (discriminator 3)) 
[ 231.759814][ C3] ? sched_clock (arch/x86/include/asm/preempt.h:94 arch/x86/kernel/tsc.c:286) 
[ 231.759821][ C3] ? hrtimer_interrupt (kernel/time/hrtimer.c:1868) 
[ 231.759828][ C3] ? __sysvec_apic_timer_interrupt (arch/x86/include/asm/jump_label.h:36 arch/x86/include/asm/trace/irq_vectors.h:41 arch/x86/kernel/apic/apic.c:1056) 
[ 231.759835][ C3] ? sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1049 arch/x86/kernel/apic/apic.c:1049) 
[  231.759842][    C3]  </IRQ>
[  231.759845][    C3]  <TASK>
[ 231.759848][ C3] ? asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:702) 
[ 231.759857][ C3] ? encode_rs16 (lib/reed_solomon/encode_rs.c:33) reed_solomon 
[ 231.759864][ C3] get_rcw_we (lib/reed_solomon/test_rslib.c:173) test_rslib 
[ 231.759873][ C3] test_bc (lib/reed_solomon/test_rslib.c:379) test_rslib 
[ 231.759880][ C3] ? run_exercise (lib/reed_solomon/test_rslib.c:127 lib/reed_solomon/test_rslib.c:457) test_rslib 
[ 231.759886][ C3] run_exercise (lib/reed_solomon/test_rslib.c:423 lib/reed_solomon/test_rslib.c:477) test_rslib 
[ 231.759895][ C3] ? __pfx_run_exercise (lib/reed_solomon/test_rslib.c:443) test_rslib 
[ 231.759902][ C3] test_rslib_init (lib/reed_solomon/test_rslib.c:155) test_rslib 
[ 231.759909][ C3] ? __pfx_test_rslib_init (lib/reed_solomon/test_rslib.c:153) test_rslib 
[ 231.759915][ C3] do_one_initcall (init/main.c:1266) 
[ 231.759922][ C3] ? kasan_save_track (arch/x86/include/asm/current.h:49 mm/kasan/common.c:60 mm/kasan/common.c:69) 
[ 231.759928][ C3] ? __pfx_do_one_initcall (init/main.c:1257) 
[ 231.759934][ C3] ? __kasan_slab_alloc (mm/kasan/common.c:318 mm/kasan/common.c:345) 
[ 231.759939][ C3] ? __kmalloc_cache_noprof (mm/slub.c:4119 mm/slub.c:4168 mm/slub.c:4324) 
[ 231.759945][ C3] ? kasan_unpoison (mm/kasan/shadow.c:156 mm/kasan/shadow.c:182) 
[ 231.759951][ C3] do_init_module (kernel/module/main.c:2910) 
[ 231.759958][ C3] load_module (kernel/module/main.c:3376) 
[ 231.759964][ C3] ? __pfx_load_module (kernel/module/main.c:3223) 
[ 231.759969][ C3] ? security_kernel_post_read_file (security/security.c:3363) 
[ 231.759975][ C3] ? kernel_read_file (arch/x86/include/asm/atomic.h:53 include/linux/atomic/atomic-arch-fallback.h:992 include/linux/atomic/atomic-instrumented.h:436 include/linux/fs.h:3060 fs/kernel_read_file.c:122) 
[ 231.759982][ C3] ? __pfx_kernel_read_file (fs/kernel_read_file.c:38) 
[ 231.759988][ C3] ? init_module_from_file (kernel/module/main.c:3565) 
[ 231.759993][ C3] init_module_from_file (kernel/module/main.c:3565) 
[ 231.759998][ C3] ? __pfx_init_module_from_file (kernel/module/main.c:3541) 
[ 231.760004][ C3] ? __pfx__raw_spin_lock (kernel/locking/spinlock.c:153) 
[ 231.760011][ C3] idempotent_init_module (kernel/module/main.c:3577) 
[ 231.760016][ C3] ? __pfx_idempotent_init_module (kernel/module/main.c:3569) 
[ 231.760022][ C3] ? fdget (include/linux/atomic/atomic-arch-fallback.h:479 include/linux/atomic/atomic-instrumented.h:50 fs/file.c:1145 fs/file.c:1159) 
[ 231.760028][ C3] ? security_capable (security/security.c:1142) 
[ 231.760035][ C3] __x64_sys_finit_module (include/linux/file.h:62 include/linux/file.h:83 kernel/module/main.c:3600 kernel/module/main.c:3587 kernel/module/main.c:3587) 
[ 231.760040][ C3] do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) 
[ 231.760046][ C3] ? do_mmap (mm/mmap.c:496) 
[ 231.760053][ C3] ? __pfx_userfaultfd_unmap_complete (fs/userfaultfd.c:838) 
[ 231.760060][ C3] ? __pfx_do_mmap (mm/mmap.c:288) 
[ 231.760065][ C3] ? down_write_killable (arch/x86/include/asm/atomic64_64.h:20 include/linux/atomic/atomic-arch-fallback.h:2629 include/linux/atomic/atomic-long.h:79 include/linux/atomic/atomic-instrumented.h:3224 kernel/locking/rwsem.c:143 kernel/locking/rwsem.c:268 kernel/locking/rwsem.c:1303 kernel/locking/rwsem.c:1318 kernel/locking/rwsem.c:1590) 
[ 231.760071][ C3] ? __pfx_down_write_killable (kernel/locking/rwsem.c:1586) 
[ 231.760077][ C3] ? vm_mmap_pgoff (mm/util.c:584) 
[ 231.760084][ C3] ? __pfx_vm_mmap_pgoff (mm/util.c:570) 
[ 231.760089][ C3] ? fget (fs/file.c:1063) 
[ 231.760095][ C3] ? fput (arch/x86/include/asm/atomic64_64.h:79 include/linux/atomic/atomic-arch-fallback.h:2913 include/linux/atomic/atomic-arch-fallback.h:3364 include/linux/atomic/atomic-long.h:698 include/linux/atomic/atomic-instrumented.h:3767 include/linux/file_ref.h:157 fs/file_table.c:501) 
[ 231.760100][ C3] ? ksys_mmap_pgoff (mm/mmap.c:547) 
[ 231.760105][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) 
[ 231.760111][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) 
[ 231.760116][ C3] ? do_syscall_64 (arch/x86/entry/common.c:102) 
[ 231.760121][ C3] ? do_user_addr_fault (include/linux/rcupdate.h:882 include/linux/mm.h:741 arch/x86/mm/fault.c:1340) 
[ 231.760127][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) 
[ 231.760133][ C3] ? do_syscall_64 (arch/x86/entry/common.c:102) 
[ 231.760137][ C3] ? exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 arch/x86/mm/fault.c:1489 arch/x86/mm/fault.c:1539) 
[ 231.760142][ C3] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) 
[  231.760148][    C3] RIP: 0033:0x7f64c7711719
[ 231.760154][ C3] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
All code
========
   0:	08 89 e8 5b 5d c3    	or     %cl,-0x3ca2a418(%rcx)
   6:	66 2e 0f 1f 84 00 00 	cs nopw 0x0(%rax,%rax,1)
   d:	00 00 00 
  10:	90                   	nop
  11:	48 89 f8             	mov    %rdi,%rax
  14:	48 89 f7             	mov    %rsi,%rdi
  17:	48 89 d6             	mov    %rdx,%rsi
  1a:	48 89 ca             	mov    %rcx,%rdx
  1d:	4d 89 c2             	mov    %r8,%r10
  20:	4d 89 c8             	mov    %r9,%r8
  23:	4c 8b 4c 24 08       	mov    0x8(%rsp),%r9
  28:	0f 05                	syscall
  2a:*	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax		<-- trapping instruction
  30:	73 01                	jae    0x33
  32:	c3                   	ret
  33:	48 8b 0d b7 06 0d 00 	mov    0xd06b7(%rip),%rcx        # 0xd06f1
  3a:	f7 d8                	neg    %eax
  3c:	64 89 01             	mov    %eax,%fs:(%rcx)
  3f:	48                   	rex.W

Code starting with the faulting instruction
===========================================
   0:	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax
   6:	73 01                	jae    0x9
   8:	c3                   	ret
   9:	48 8b 0d b7 06 0d 00 	mov    0xd06b7(%rip),%rcx        # 0xd06c7
  10:	f7 d8                	neg    %eax
  12:	64 89 01             	mov    %eax,%fs:(%rcx)
  15:	48                   	rex.W


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250122/202501221029.fb0d574d-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-01-22  2:28 [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] kernel test robot
@ 2025-01-22  8:41 ` John Ogness
  2025-01-22  9:37   ` Greg Kroah-Hartman
  2025-01-24 16:10 ` Petr Mladek
  1 sibling, 1 reply; 21+ messages in thread
From: John Ogness @ 2025-01-22  8:41 UTC (permalink / raw)
  To: kernel test robot
  Cc: oe-lkp, lkp, Greg Kroah-Hartman, Petr Mladek, linux-serial,
	oliver.sang

Hi Greg,

On 2025-01-22, kernel test robot <oliver.sang@intel.com> wrote:
> kernel test robot noticed "BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]" on:
>
> commit: b63e6f60eab45b16a1bf734fef9035a4c4187cd5 ("serial: 8250: Switch to nbcon console")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

It looks like we need to revert this commit (and the one after it) until
I can find out what is going on here.

From linux-next they are:

b63e6f60eab4 serial: 8250: Switch to nbcon console

422c9727b07f serial: 8250: Revert "drop lockdep annotation from serial8250_clear_IER()"

I hope to get this solved soon, but I am currently taking part in a Rust
training, which does not leave me many extra brain cycles at the moment.

John Ogness

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-01-22  8:41 ` John Ogness
@ 2025-01-22  9:37   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 21+ messages in thread
From: Greg Kroah-Hartman @ 2025-01-22  9:37 UTC (permalink / raw)
  To: John Ogness; +Cc: kernel test robot, oe-lkp, lkp, Petr Mladek, linux-serial

On Wed, Jan 22, 2025 at 09:47:00AM +0106, John Ogness wrote:
> Hi Greg,
> 
> On 2025-01-22, kernel test robot <oliver.sang@intel.com> wrote:
> > kernel test robot noticed "BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]" on:
> >
> > commit: b63e6f60eab45b16a1bf734fef9035a4c4187cd5 ("serial: 8250: Switch to nbcon console")
> > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> 
> It looks like we need to revert this commit (and the one after it) until
> I can find out what is going on here.
> 
> >From linux-next they are:
> 
> b63e6f60eab4 serial: 8250: Switch to nbcon console
> 
> 422c9727b07f serial: 8250: Revert "drop lockdep annotation from serial8250_clear_IER()"

Ok, now reverted, thanks for finding the offending commits.

> I hope to get this solved soon, but I am currently taking part in a Rust
> training, which does not leave me many extra brain cycles at the moment.

Totally understood :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-01-22  2:28 [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] kernel test robot
  2025-01-22  8:41 ` John Ogness
@ 2025-01-24 16:10 ` Petr Mladek
  2025-01-24 16:39   ` Petr Mladek
  2025-03-15  3:38   ` Ryo Takakura
  1 sibling, 2 replies; 21+ messages in thread
From: Petr Mladek @ 2025-01-24 16:10 UTC (permalink / raw)
  To: kernel test robot
  Cc: John Ogness, oe-lkp, lkp, Greg Kroah-Hartman, linux-serial,
	Jason A. Donenfeld

Hi,

I am Jason into Cc. I wonder if the softlockup might be caused
by a lack of entropy.

On Wed 2025-01-22 10:28:52, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed "BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]" on:
> 
> commit: b63e6f60eab45b16a1bf734fef9035a4c4187cd5 ("serial: 8250: Switch to nbcon console")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> 
> [test failed on linux-next/master 0907e7fb35756464aa34c35d6abb02998418164b]
> 
> in testcase: kunit
> version: 
> with following parameters:
> 
> 	group: group-01
> 
> 
> 
> config: x86_64-rhel-9.4-kunit
> compiler: gcc-12
> test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (Haswell) with 16G memory
> 
> (please refer to attached dmesg/kmsg for entire log/backtrace)
> 
> 
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202501221029.fb0d574d-lkp@intel.com
> 
> 
> [  231.759560][    C3] watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [modprobe:3860]
> [  231.759572][    C3] Modules linked in: test_rslib(+) reed_solomon ipmi_devintf ipmi_msghandler intel_rapl_msr intel_rapl_common btrfs snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp blake2b_generic coretemp xor raid6_pq libcrc32c kvm_intel snd_hda_codec_realtek snd_hda_codec_generic platform_profile i915 kvm snd_hda_scodec_component snd_hda_intel sd_mod snd_intel_dspcfg dell_wmi crc32_generic snd_intel_sdw_acpi sg crct10dif_pclmul cec crc32_pclmul dell_smbios snd_hda_codec intel_gtt crc32c_intel dell_wmi_descriptor ghash_clmulni_intel sparse_keymap snd_hda_core ttm snd_hwdep ahci rapl rfkill drm_display_helper snd_pcm mei_wdt libahci intel_cstate dcdbas snd_timer mei_me libata intel_uncore drm_kms_helper snd pcspkr drm_buddy mei soundcore video wmi binfmt_misc drm fuse loop dm_mod ip_tables poly1305_generic chacha_generic [last unloaded: test_fpu]
> [  231.759681][    C3] CPU: 3 UID: 0 PID: 3860 Comm: modprobe Tainted: G S  B            N 6.13.0-rc3-00034-gb63e6f60eab4 #1
> [  231.759690][    C3] Tainted: [S]=CPU_OUT_OF_SPEC, [B]=BAD_PAGE, [N]=TEST
> [  231.759694][    C3] Hardware name: Dell Inc. OptiPlex 9020/0DNKMN, BIOS A05 12/05/2013
> [ 231.759699][ C3] RIP: 0010:encode_rs16 (lib/reed_solomon/encode_rs.c:33) reed_solomon 
> [ 231.759708][ C3] Code: 87 68 83 00 00 89 da d3 fa 41 0f b6 4d 00 41 38 cc 7c 08 84 c9 0f 85 64 02 00 00 8b 75 04 21 f3 01 d3 39 de 7e c0 48 8b 3c 24 <48> 63 db 48 8d 1c 5f 48 89 d9 48 c1 e9 03 42 0f b6 34 39 48 89 d9
> All code
> ========
>    0:	87 68 83             	xchg   %ebp,-0x7d(%rax)
>    3:	00 00                	add    %al,(%rax)
>    5:	89 da                	mov    %ebx,%edx
>    7:	d3 fa                	sar    %cl,%edx
>    9:	41 0f b6 4d 00       	movzbl 0x0(%r13),%ecx
>    e:	41 38 cc             	cmp    %cl,%r12b
>   11:	7c 08                	jl     0x1b
>   13:	84 c9                	test   %cl,%cl
>   15:	0f 85 64 02 00 00    	jne    0x27f
>   1b:	8b 75 04             	mov    0x4(%rbp),%esi
>   1e:	21 f3                	and    %esi,%ebx
>   20:	01 d3                	add    %edx,%ebx
>   22:	39 de                	cmp    %ebx,%esi
>   24:	7e c0                	jle    0xffffffffffffffe6
>   26:	48 8b 3c 24          	mov    (%rsp),%rdi
>   2a:*	48 63 db             	movslq %ebx,%rbx		<-- trapping instruction
>   2d:	48 8d 1c 5f          	lea    (%rdi,%rbx,2),%rbx
>   31:	48 89 d9             	mov    %rbx,%rcx
>   34:	48 c1 e9 03          	shr    $0x3,%rcx
>   38:	42 0f b6 34 39       	movzbl (%rcx,%r15,1),%esi
>   3d:	48 89 d9             	mov    %rbx,%rcx
> 
> Code starting with the faulting instruction
> ===========================================
>    0:	48 63 db             	movslq %ebx,%rbx
>    3:	48 8d 1c 5f          	lea    (%rdi,%rbx,2),%rbx
>    7:	48 89 d9             	mov    %rbx,%rcx
>    a:	48 c1 e9 03          	shr    $0x3,%rcx
>    e:	42 0f b6 34 39       	movzbl (%rcx,%r15,1),%esi
>   13:	48 89 d9             	mov    %rbx,%rcx
> [  231.759717][    C3] RSP: 0018:ffffc90000abf3b0 EFLAGS: 00000297
> [  231.759723][    C3] RAX: ffff888102c9ff0a RBX: 00000000000000dd RCX: 0000000000000000
> [  231.759728][    C3] RDX: 0000000000000000 RSI: 00000000000000ff RDI: ffff88816b6b7c00
> [  231.759733][    C3] RBP: ffff88812901fb00 R08: 00000000000000c8 R09: ffff88816c8b518e
> [  231.759738][    C3] R10: 1ffff11025203f60 R11: ffff88816c8b5184 R12: 0000000000000007
> [  231.759743][    C3] R13: ffffed1025203f60 R14: ffffed1025203f60 R15: dffffc0000000000
> [  231.759748][    C3] FS:  00007f64c760f040(0000) GS:ffff8883a7d80000(0000) knlGS:0000000000000000
> [  231.759754][    C3] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  231.759759][    C3] CR2: 00007f024b693000 CR3: 00000001d4462004 CR4: 00000000001726f0
> [  231.759764][    C3] DR0: ffffffff8789050c DR1: ffffffff8789050d DR2: ffffffff8789050e
> [  231.759769][    C3] DR3: ffffffff8789050f DR6: 00000000fffe0ff0 DR7: 0000000000000600
> [  231.759774][    C3] Call Trace:
> [  231.759778][    C3]  <IRQ>
> [ 231.759782][ C3] ? watchdog_timer_fn (kernel/watchdog.c:770) 
> [ 231.759790][ C3] ? __pfx_watchdog_timer_fn (kernel/watchdog.c:685) 
> [ 231.759796][ C3] ? __hrtimer_run_queues (kernel/time/hrtimer.c:1739 kernel/time/hrtimer.c:1803) 
> [ 231.759803][ C3] ? __pfx___hrtimer_run_queues (kernel/time/hrtimer.c:1773) 
> [ 231.759808][ C3] ? ktime_get_update_offsets_now (kernel/time/timekeeping.c:312 (discriminator 3) kernel/time/timekeeping.c:335 (discriminator 3) kernel/time/timekeeping.c:2457 (discriminator 3)) 
> [ 231.759814][ C3] ? sched_clock (arch/x86/include/asm/preempt.h:94 arch/x86/kernel/tsc.c:286) 
> [ 231.759821][ C3] ? hrtimer_interrupt (kernel/time/hrtimer.c:1868) 
> [ 231.759828][ C3] ? __sysvec_apic_timer_interrupt (arch/x86/include/asm/jump_label.h:36 arch/x86/include/asm/trace/irq_vectors.h:41 arch/x86/kernel/apic/apic.c:1056) 
> [ 231.759835][ C3] ? sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1049 arch/x86/kernel/apic/apic.c:1049) 
> [  231.759842][    C3]  </IRQ>
> [  231.759845][    C3]  <TASK>
> [ 231.759848][ C3] ? asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:702) 
> [ 231.759857][ C3] ? encode_rs16 (lib/reed_solomon/encode_rs.c:33) reed_solomon 
> [ 231.759864][ C3] get_rcw_we (lib/reed_solomon/test_rslib.c:173) test_rslib 

Honestly, I do not see much how this could be related to the serial
console. This is a module for testing the Generic Reed Solomon
encoder / decoder library.

It seems to do a lot of computation and needs a lot of random numbers.
I wonder if there is not enough entropy and the test is too slow.

From the config file:

CONFIG_PREEMPT_VOLUNTARY=y

I wonder if a cond_resched() in some loop would help. Or using a
pseudorandom generator. I remember the problems related to much
slower random generator, for example, see the commit f900fde28883602b6
("crypto: testmgr - fix RNG performance in fuzz tests").

That said, I did not dig deep into the code. And I did not try to
reproduce the softlockup. I am pretty busy at the moment with some
other stuff. I just wanted to give it a look and share my opinion.

Best Regards,
Petr


> [ 231.759873][ C3] test_bc (lib/reed_solomon/test_rslib.c:379) test_rslib 
> [ 231.759880][ C3] ? run_exercise (lib/reed_solomon/test_rslib.c:127 lib/reed_solomon/test_rslib.c:457) test_rslib 
> [ 231.759886][ C3] run_exercise (lib/reed_solomon/test_rslib.c:423 lib/reed_solomon/test_rslib.c:477) test_rslib 
> [ 231.759895][ C3] ? __pfx_run_exercise (lib/reed_solomon/test_rslib.c:443) test_rslib 
> [ 231.759902][ C3] test_rslib_init (lib/reed_solomon/test_rslib.c:155) test_rslib 
> [ 231.759909][ C3] ? __pfx_test_rslib_init (lib/reed_solomon/test_rslib.c:153) test_rslib 
> [ 231.759915][ C3] do_one_initcall (init/main.c:1266) 
> [ 231.759922][ C3] ? kasan_save_track (arch/x86/include/asm/current.h:49 mm/kasan/common.c:60 mm/kasan/common.c:69) 
> [ 231.759928][ C3] ? __pfx_do_one_initcall (init/main.c:1257) 
> [ 231.759934][ C3] ? __kasan_slab_alloc (mm/kasan/common.c:318 mm/kasan/common.c:345) 
> [ 231.759939][ C3] ? __kmalloc_cache_noprof (mm/slub.c:4119 mm/slub.c:4168 mm/slub.c:4324) 
> [ 231.759945][ C3] ? kasan_unpoison (mm/kasan/shadow.c:156 mm/kasan/shadow.c:182) 
> [ 231.759951][ C3] do_init_module (kernel/module/main.c:2910) 
> [ 231.759958][ C3] load_module (kernel/module/main.c:3376) 
> [ 231.759964][ C3] ? __pfx_load_module (kernel/module/main.c:3223) 
> [ 231.759969][ C3] ? security_kernel_post_read_file (security/security.c:3363) 
> [ 231.759975][ C3] ? kernel_read_file (arch/x86/include/asm/atomic.h:53 include/linux/atomic/atomic-arch-fallback.h:992 include/linux/atomic/atomic-instrumented.h:436 include/linux/fs.h:3060 fs/kernel_read_file.c:122) 
> [ 231.759982][ C3] ? __pfx_kernel_read_file (fs/kernel_read_file.c:38) 
> [ 231.759988][ C3] ? init_module_from_file (kernel/module/main.c:3565) 
> [ 231.759993][ C3] init_module_from_file (kernel/module/main.c:3565) 
> [ 231.759998][ C3] ? __pfx_init_module_from_file (kernel/module/main.c:3541) 
> [ 231.760004][ C3] ? __pfx__raw_spin_lock (kernel/locking/spinlock.c:153) 
> [ 231.760011][ C3] idempotent_init_module (kernel/module/main.c:3577) 
> [ 231.760016][ C3] ? __pfx_idempotent_init_module (kernel/module/main.c:3569) 
> [ 231.760022][ C3] ? fdget (include/linux/atomic/atomic-arch-fallback.h:479 include/linux/atomic/atomic-instrumented.h:50 fs/file.c:1145 fs/file.c:1159) 
> [ 231.760028][ C3] ? security_capable (security/security.c:1142) 
> [ 231.760035][ C3] __x64_sys_finit_module (include/linux/file.h:62 include/linux/file.h:83 kernel/module/main.c:3600 kernel/module/main.c:3587 kernel/module/main.c:3587) 
> [ 231.760040][ C3] do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) 
> [ 231.760046][ C3] ? do_mmap (mm/mmap.c:496) 
> [ 231.760053][ C3] ? __pfx_userfaultfd_unmap_complete (fs/userfaultfd.c:838) 
> [ 231.760060][ C3] ? __pfx_do_mmap (mm/mmap.c:288) 
> [ 231.760065][ C3] ? down_write_killable (arch/x86/include/asm/atomic64_64.h:20 include/linux/atomic/atomic-arch-fallback.h:2629 include/linux/atomic/atomic-long.h:79 include/linux/atomic/atomic-instrumented.h:3224 kernel/locking/rwsem.c:143 kernel/locking/rwsem.c:268 kernel/locking/rwsem.c:1303 kernel/locking/rwsem.c:1318 kernel/locking/rwsem.c:1590) 
> [ 231.760071][ C3] ? __pfx_down_write_killable (kernel/locking/rwsem.c:1586) 
> [ 231.760077][ C3] ? vm_mmap_pgoff (mm/util.c:584) 
> [ 231.760084][ C3] ? __pfx_vm_mmap_pgoff (mm/util.c:570) 
> [ 231.760089][ C3] ? fget (fs/file.c:1063) 
> [ 231.760095][ C3] ? fput (arch/x86/include/asm/atomic64_64.h:79 include/linux/atomic/atomic-arch-fallback.h:2913 include/linux/atomic/atomic-arch-fallback.h:3364 include/linux/atomic/atomic-long.h:698 include/linux/atomic/atomic-instrumented.h:3767 include/linux/file_ref.h:157 fs/file_table.c:501) 
> [ 231.760100][ C3] ? ksys_mmap_pgoff (mm/mmap.c:547) 
> [ 231.760105][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) 
> [ 231.760111][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) 
> [ 231.760116][ C3] ? do_syscall_64 (arch/x86/entry/common.c:102) 
> [ 231.760121][ C3] ? do_user_addr_fault (include/linux/rcupdate.h:882 include/linux/mm.h:741 arch/x86/mm/fault.c:1340) 
> [ 231.760127][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) 
> [ 231.760133][ C3] ? do_syscall_64 (arch/x86/entry/common.c:102) 
> [ 231.760137][ C3] ? exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 arch/x86/mm/fault.c:1489 arch/x86/mm/fault.c:1539) 
> [ 231.760142][ C3] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) 
> [  231.760148][    C3] RIP: 0033:0x7f64c7711719
> [ 231.760154][ C3] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
> All code
> ========
>    0:	08 89 e8 5b 5d c3    	or     %cl,-0x3ca2a418(%rcx)
>    6:	66 2e 0f 1f 84 00 00 	cs nopw 0x0(%rax,%rax,1)
>    d:	00 00 00 
>   10:	90                   	nop
>   11:	48 89 f8             	mov    %rdi,%rax
>   14:	48 89 f7             	mov    %rsi,%rdi
>   17:	48 89 d6             	mov    %rdx,%rsi
>   1a:	48 89 ca             	mov    %rcx,%rdx
>   1d:	4d 89 c2             	mov    %r8,%r10
>   20:	4d 89 c8             	mov    %r9,%r8
>   23:	4c 8b 4c 24 08       	mov    0x8(%rsp),%r9
>   28:	0f 05                	syscall
>   2a:*	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax		<-- trapping instruction
>   30:	73 01                	jae    0x33
>   32:	c3                   	ret
>   33:	48 8b 0d b7 06 0d 00 	mov    0xd06b7(%rip),%rcx        # 0xd06f1
>   3a:	f7 d8                	neg    %eax
>   3c:	64 89 01             	mov    %eax,%fs:(%rcx)
>   3f:	48                   	rex.W
> 
> Code starting with the faulting instruction
> ===========================================
>    0:	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax
>    6:	73 01                	jae    0x9
>    8:	c3                   	ret
>    9:	48 8b 0d b7 06 0d 00 	mov    0xd06b7(%rip),%rcx        # 0xd06c7
>   10:	f7 d8                	neg    %eax
>   12:	64 89 01             	mov    %eax,%fs:(%rcx)
>   15:	48                   	rex.W
> 
> 
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20250122/202501221029.fb0d574d-lkp@intel.com
> 
> 
> 
> -- 
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-01-24 16:10 ` Petr Mladek
@ 2025-01-24 16:39   ` Petr Mladek
  2025-03-15  3:38   ` Ryo Takakura
  1 sibling, 0 replies; 21+ messages in thread
From: Petr Mladek @ 2025-01-24 16:39 UTC (permalink / raw)
  To: kernel test robot
  Cc: John Ogness, oe-lkp, lkp, Greg Kroah-Hartman, linux-serial,
	Jason A. Donenfeld

On Fri 2025-01-24 17:10:13, Petr Mladek wrote:
> Hi,
> 
> I am Jason into Cc. I wonder if the softlockup might be caused
> by a lack of entropy.
> 
> On Wed 2025-01-22 10:28:52, kernel test robot wrote:
> > 
> > 
> > Hello,
> > 
> > kernel test robot noticed "BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]" on:
> > 
> > commit: b63e6f60eab45b16a1bf734fef9035a4c4187cd5 ("serial: 8250: Switch to nbcon console")
> > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > 
> > [test failed on linux-next/master 0907e7fb35756464aa34c35d6abb02998418164b]
> > 
> > in testcase: kunit
> > version: 
> > with following parameters:
> > 
> > 	group: group-01
> > 
> > 
> > 
> > config: x86_64-rhel-9.4-kunit
> > compiler: gcc-12
> > test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (Haswell) with 16G memory
> > 
> > (please refer to attached dmesg/kmsg for entire log/backtrace)
> > 
> > 
> > 
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <oliver.sang@intel.com>
> > | Closes: https://lore.kernel.org/oe-lkp/202501221029.fb0d574d-lkp@intel.com
> > 
> > 
> > [  231.759560][    C3] watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [modprobe:3860]
> > [  231.759572][    C3] Modules linked in: test_rslib(+) reed_solomon ipmi_devintf ipmi_msghandler intel_rapl_msr intel_rapl_common btrfs snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp blake2b_generic coretemp xor raid6_pq libcrc32c kvm_intel snd_hda_codec_realtek snd_hda_codec_generic platform_profile i915 kvm snd_hda_scodec_component snd_hda_intel sd_mod snd_intel_dspcfg dell_wmi crc32_generic snd_intel_sdw_acpi sg crct10dif_pclmul cec crc32_pclmul dell_smbios snd_hda_codec intel_gtt crc32c_intel dell_wmi_descriptor ghash_clmulni_intel sparse_keymap snd_hda_core ttm snd_hwdep ahci rapl rfkill drm_display_helper snd_pcm mei_wdt libahci intel_cstate dcdbas snd_timer mei_me libata intel_uncore drm_kms_helper snd pcspkr drm_buddy mei soundcore video wmi binfmt_misc drm fuse loop dm_mod ip_tables poly1305_generic chacha_generic [last unloaded: test_fpu]
> > [  231.759681][    C3] CPU: 3 UID: 0 PID: 3860 Comm: modprobe Tainted: G S  B            N 6.13.0-rc3-00034-gb63e6f60eab4 #1
> > [  231.759690][    C3] Tainted: [S]=CPU_OUT_OF_SPEC, [B]=BAD_PAGE, [N]=TEST
> > [  231.759694][    C3] Hardware name: Dell Inc. OptiPlex 9020/0DNKMN, BIOS A05 12/05/2013
> > [ 231.759699][ C3] RIP: 0010:encode_rs16 (lib/reed_solomon/encode_rs.c:33) reed_solomon 
> > [ 231.759708][ C3] Code: 87 68 83 00 00 89 da d3 fa 41 0f b6 4d 00 41 38 cc 7c 08 84 c9 0f 85 64 02 00 00 8b 75 04 21 f3 01 d3 39 de 7e c0 48 8b 3c 24 <48> 63 db 48 8d 1c 5f 48 89 d9 48 c1 e9 03 42 0f b6 34 39 48 89 d9
> > All code
> > ========
> >    0:	87 68 83             	xchg   %ebp,-0x7d(%rax)
> >    3:	00 00                	add    %al,(%rax)
> >    5:	89 da                	mov    %ebx,%edx
> >    7:	d3 fa                	sar    %cl,%edx
> >    9:	41 0f b6 4d 00       	movzbl 0x0(%r13),%ecx
> >    e:	41 38 cc             	cmp    %cl,%r12b
> >   11:	7c 08                	jl     0x1b
> >   13:	84 c9                	test   %cl,%cl
> >   15:	0f 85 64 02 00 00    	jne    0x27f
> >   1b:	8b 75 04             	mov    0x4(%rbp),%esi
> >   1e:	21 f3                	and    %esi,%ebx
> >   20:	01 d3                	add    %edx,%ebx
> >   22:	39 de                	cmp    %ebx,%esi
> >   24:	7e c0                	jle    0xffffffffffffffe6
> >   26:	48 8b 3c 24          	mov    (%rsp),%rdi
> >   2a:*	48 63 db             	movslq %ebx,%rbx		<-- trapping instruction
> >   2d:	48 8d 1c 5f          	lea    (%rdi,%rbx,2),%rbx
> >   31:	48 89 d9             	mov    %rbx,%rcx
> >   34:	48 c1 e9 03          	shr    $0x3,%rcx
> >   38:	42 0f b6 34 39       	movzbl (%rcx,%r15,1),%esi
> >   3d:	48 89 d9             	mov    %rbx,%rcx
> > 
> > Code starting with the faulting instruction
> > ===========================================
> >    0:	48 63 db             	movslq %ebx,%rbx
> >    3:	48 8d 1c 5f          	lea    (%rdi,%rbx,2),%rbx
> >    7:	48 89 d9             	mov    %rbx,%rcx
> >    a:	48 c1 e9 03          	shr    $0x3,%rcx
> >    e:	42 0f b6 34 39       	movzbl (%rcx,%r15,1),%esi
> >   13:	48 89 d9             	mov    %rbx,%rcx
> > [  231.759717][    C3] RSP: 0018:ffffc90000abf3b0 EFLAGS: 00000297
> > [  231.759723][    C3] RAX: ffff888102c9ff0a RBX: 00000000000000dd RCX: 0000000000000000
> > [  231.759728][    C3] RDX: 0000000000000000 RSI: 00000000000000ff RDI: ffff88816b6b7c00
> > [  231.759733][    C3] RBP: ffff88812901fb00 R08: 00000000000000c8 R09: ffff88816c8b518e
> > [  231.759738][    C3] R10: 1ffff11025203f60 R11: ffff88816c8b5184 R12: 0000000000000007
> > [  231.759743][    C3] R13: ffffed1025203f60 R14: ffffed1025203f60 R15: dffffc0000000000
> > [  231.759748][    C3] FS:  00007f64c760f040(0000) GS:ffff8883a7d80000(0000) knlGS:0000000000000000
> > [  231.759754][    C3] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  231.759759][    C3] CR2: 00007f024b693000 CR3: 00000001d4462004 CR4: 00000000001726f0
> > [  231.759764][    C3] DR0: ffffffff8789050c DR1: ffffffff8789050d DR2: ffffffff8789050e
> > [  231.759769][    C3] DR3: ffffffff8789050f DR6: 00000000fffe0ff0 DR7: 0000000000000600
> > [  231.759774][    C3] Call Trace:
> > [  231.759778][    C3]  <IRQ>
> > [ 231.759782][ C3] ? watchdog_timer_fn (kernel/watchdog.c:770) 
> > [ 231.759790][ C3] ? __pfx_watchdog_timer_fn (kernel/watchdog.c:685) 
> > [ 231.759796][ C3] ? __hrtimer_run_queues (kernel/time/hrtimer.c:1739 kernel/time/hrtimer.c:1803) 
> > [ 231.759803][ C3] ? __pfx___hrtimer_run_queues (kernel/time/hrtimer.c:1773) 
> > [ 231.759808][ C3] ? ktime_get_update_offsets_now (kernel/time/timekeeping.c:312 (discriminator 3) kernel/time/timekeeping.c:335 (discriminator 3) kernel/time/timekeeping.c:2457 (discriminator 3)) 
> > [ 231.759814][ C3] ? sched_clock (arch/x86/include/asm/preempt.h:94 arch/x86/kernel/tsc.c:286) 
> > [ 231.759821][ C3] ? hrtimer_interrupt (kernel/time/hrtimer.c:1868) 
> > [ 231.759828][ C3] ? __sysvec_apic_timer_interrupt (arch/x86/include/asm/jump_label.h:36 arch/x86/include/asm/trace/irq_vectors.h:41 arch/x86/kernel/apic/apic.c:1056) 
> > [ 231.759835][ C3] ? sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1049 arch/x86/kernel/apic/apic.c:1049) 
> > [  231.759842][    C3]  </IRQ>
> > [  231.759845][    C3]  <TASK>
> > [ 231.759848][ C3] ? asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:702) 
> > [ 231.759857][ C3] ? encode_rs16 (lib/reed_solomon/encode_rs.c:33) reed_solomon 
> > [ 231.759864][ C3] get_rcw_we (lib/reed_solomon/test_rslib.c:173) test_rslib 
> 
> Honestly, I do not see much how this could be related to the serial
> console. This is a module for testing the Generic Reed Solomon
> encoder / decoder library.
> 
> It seems to do a lot of computation and needs a lot of random numbers.
> I wonder if there is not enough entropy and the test is too slow.
> 
> From the config file:
> 
> CONFIG_PREEMPT_VOLUNTARY=y
> 
> I wonder if a cond_resched() in some loop would help. Or using a
> pseudorandom generator. I remember the problems related to much
> slower random generator, for example, see the commit f900fde28883602b6
> ("crypto: testmgr - fix RNG performance in fuzz tests").

I was curious whether this was the only soft lockup error reported in
this code. And I found one more occurence, see
https://lore.kernel.org/all/6A256ED30D339626+ZXwTQL1lMXhQ_inW@HX09040029.powercore.com.cn/

Unfortunately, the discussion about the other occurence does not
continue. So that it is not clear if they found the root of
the problem at that time.

Anyway, it is good to know that this test module is "prone" to
soft lockups.

Best Regards,
Petr
> 
> > [ 231.759873][ C3] test_bc (lib/reed_solomon/test_rslib.c:379) test_rslib 
> > [ 231.759880][ C3] ? run_exercise (lib/reed_solomon/test_rslib.c:127 lib/reed_solomon/test_rslib.c:457) test_rslib 
> > [ 231.759886][ C3] run_exercise (lib/reed_solomon/test_rslib.c:423 lib/reed_solomon/test_rslib.c:477) test_rslib 
> > [ 231.759895][ C3] ? __pfx_run_exercise (lib/reed_solomon/test_rslib.c:443) test_rslib 
> > [ 231.759902][ C3] test_rslib_init (lib/reed_solomon/test_rslib.c:155) test_rslib 
> > [ 231.759909][ C3] ? __pfx_test_rslib_init (lib/reed_solomon/test_rslib.c:153) test_rslib 
> > [ 231.759915][ C3] do_one_initcall (init/main.c:1266) 
> > [ 231.759922][ C3] ? kasan_save_track (arch/x86/include/asm/current.h:49 mm/kasan/common.c:60 mm/kasan/common.c:69) 
> > [ 231.759928][ C3] ? __pfx_do_one_initcall (init/main.c:1257) 
> > [ 231.759934][ C3] ? __kasan_slab_alloc (mm/kasan/common.c:318 mm/kasan/common.c:345) 
> > [ 231.759939][ C3] ? __kmalloc_cache_noprof (mm/slub.c:4119 mm/slub.c:4168 mm/slub.c:4324) 
> > [ 231.759945][ C3] ? kasan_unpoison (mm/kasan/shadow.c:156 mm/kasan/shadow.c:182) 
> > [ 231.759951][ C3] do_init_module (kernel/module/main.c:2910) 
> > [ 231.759958][ C3] load_module (kernel/module/main.c:3376) 
> > [ 231.759964][ C3] ? __pfx_load_module (kernel/module/main.c:3223) 
> > [ 231.759969][ C3] ? security_kernel_post_read_file (security/security.c:3363) 
> > [ 231.759975][ C3] ? kernel_read_file (arch/x86/include/asm/atomic.h:53 include/linux/atomic/atomic-arch-fallback.h:992 include/linux/atomic/atomic-instrumented.h:436 include/linux/fs.h:3060 fs/kernel_read_file.c:122) 
> > [ 231.759982][ C3] ? __pfx_kernel_read_file (fs/kernel_read_file.c:38) 
> > [ 231.759988][ C3] ? init_module_from_file (kernel/module/main.c:3565) 
> > [ 231.759993][ C3] init_module_from_file (kernel/module/main.c:3565) 
> > [ 231.759998][ C3] ? __pfx_init_module_from_file (kernel/module/main.c:3541) 
> > [ 231.760004][ C3] ? __pfx__raw_spin_lock (kernel/locking/spinlock.c:153) 
> > [ 231.760011][ C3] idempotent_init_module (kernel/module/main.c:3577) 
> > [ 231.760016][ C3] ? __pfx_idempotent_init_module (kernel/module/main.c:3569) 
> > [ 231.760022][ C3] ? fdget (include/linux/atomic/atomic-arch-fallback.h:479 include/linux/atomic/atomic-instrumented.h:50 fs/file.c:1145 fs/file.c:1159) 
> > [ 231.760028][ C3] ? security_capable (security/security.c:1142) 
> > [ 231.760035][ C3] __x64_sys_finit_module (include/linux/file.h:62 include/linux/file.h:83 kernel/module/main.c:3600 kernel/module/main.c:3587 kernel/module/main.c:3587) 
> > [ 231.760040][ C3] do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) 
> > [ 231.760046][ C3] ? do_mmap (mm/mmap.c:496) 
> > [ 231.760053][ C3] ? __pfx_userfaultfd_unmap_complete (fs/userfaultfd.c:838) 
> > [ 231.760060][ C3] ? __pfx_do_mmap (mm/mmap.c:288) 
> > [ 231.760065][ C3] ? down_write_killable (arch/x86/include/asm/atomic64_64.h:20 include/linux/atomic/atomic-arch-fallback.h:2629 include/linux/atomic/atomic-long.h:79 include/linux/atomic/atomic-instrumented.h:3224 kernel/locking/rwsem.c:143 kernel/locking/rwsem.c:268 kernel/locking/rwsem.c:1303 kernel/locking/rwsem.c:1318 kernel/locking/rwsem.c:1590) 
> > [ 231.760071][ C3] ? __pfx_down_write_killable (kernel/locking/rwsem.c:1586) 
> > [ 231.760077][ C3] ? vm_mmap_pgoff (mm/util.c:584) 
> > [ 231.760084][ C3] ? __pfx_vm_mmap_pgoff (mm/util.c:570) 
> > [ 231.760089][ C3] ? fget (fs/file.c:1063) 
> > [ 231.760095][ C3] ? fput (arch/x86/include/asm/atomic64_64.h:79 include/linux/atomic/atomic-arch-fallback.h:2913 include/linux/atomic/atomic-arch-fallback.h:3364 include/linux/atomic/atomic-long.h:698 include/linux/atomic/atomic-instrumented.h:3767 include/linux/file_ref.h:157 fs/file_table.c:501) 
> > [ 231.760100][ C3] ? ksys_mmap_pgoff (mm/mmap.c:547) 
> > [ 231.760105][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) 
> > [ 231.760111][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) 
> > [ 231.760116][ C3] ? do_syscall_64 (arch/x86/entry/common.c:102) 
> > [ 231.760121][ C3] ? do_user_addr_fault (include/linux/rcupdate.h:882 include/linux/mm.h:741 arch/x86/mm/fault.c:1340) 
> > [ 231.760127][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) 
> > [ 231.760133][ C3] ? do_syscall_64 (arch/x86/entry/common.c:102) 
> > [ 231.760137][ C3] ? exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 arch/x86/mm/fault.c:1489 arch/x86/mm/fault.c:1539) 
> > [ 231.760142][ C3] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) 
> > [  231.760148][    C3] RIP: 0033:0x7f64c7711719
> > [ 231.760154][ C3] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
> > All code
> > ========
> >    0:	08 89 e8 5b 5d c3    	or     %cl,-0x3ca2a418(%rcx)
> >    6:	66 2e 0f 1f 84 00 00 	cs nopw 0x0(%rax,%rax,1)
> >    d:	00 00 00 
> >   10:	90                   	nop
> >   11:	48 89 f8             	mov    %rdi,%rax
> >   14:	48 89 f7             	mov    %rsi,%rdi
> >   17:	48 89 d6             	mov    %rdx,%rsi
> >   1a:	48 89 ca             	mov    %rcx,%rdx
> >   1d:	4d 89 c2             	mov    %r8,%r10
> >   20:	4d 89 c8             	mov    %r9,%r8
> >   23:	4c 8b 4c 24 08       	mov    0x8(%rsp),%r9
> >   28:	0f 05                	syscall
> >   2a:*	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax		<-- trapping instruction
> >   30:	73 01                	jae    0x33
> >   32:	c3                   	ret
> >   33:	48 8b 0d b7 06 0d 00 	mov    0xd06b7(%rip),%rcx        # 0xd06f1
> >   3a:	f7 d8                	neg    %eax
> >   3c:	64 89 01             	mov    %eax,%fs:(%rcx)
> >   3f:	48                   	rex.W
> > 
> > Code starting with the faulting instruction
> > ===========================================
> >    0:	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax
> >    6:	73 01                	jae    0x9
> >    8:	c3                   	ret
> >    9:	48 8b 0d b7 06 0d 00 	mov    0xd06b7(%rip),%rcx        # 0xd06c7
> >   10:	f7 d8                	neg    %eax
> >   12:	64 89 01             	mov    %eax,%fs:(%rcx)
> >   15:	48                   	rex.W
> > 
> > 
> > The kernel config and materials to reproduce are available at:
> > https://download.01.org/0day-ci/archive/20250122/202501221029.fb0d574d-lkp@intel.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-01-24 16:10 ` Petr Mladek
  2025-01-24 16:39   ` Petr Mladek
@ 2025-03-15  3:38   ` Ryo Takakura
  2025-03-17  8:45     ` John Ogness
  1 sibling, 1 reply; 21+ messages in thread
From: Ryo Takakura @ 2025-03-15  3:38 UTC (permalink / raw)
  To: pmladek, john.ogness
  Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang

Hi!

Is the thread still active? I looked into this.

On Fri, 24 Jan 2025 17:10:11 +0100, Petr Mladek wrote:
>On Wed 2025-01-22 10:28:52, kernel test robot wrote:
>> 
>> 
>> Hello,
>> 
>> kernel test robot noticed "BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]" on:
>> 
>> commit: b63e6f60eab45b16a1bf734fef9035a4c4187cd5 ("serial: 8250: Switch to nbcon console")
>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>> 
>> [test failed on linux-next/master 0907e7fb35756464aa34c35d6abb02998418164b]
>> 
>> in testcase: kunit
>> version: 
>> with following parameters:
>> 
>> 	group: group-01
>> 
>> 
>> 
>> config: x86_64-rhel-9.4-kunit
>> compiler: gcc-12
>> test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (Haswell) with 16G memory
>> 
>> (please refer to attached dmesg/kmsg for entire log/backtrace)
>> 
>> 
>> 
>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>> the same patch/commit), kindly add following tags
>> | Reported-by: kernel test robot <oliver.sang@intel.com>
>> | Closes: https://lore.kernel.org/oe-lkp/202501221029.fb0d574d-lkp@intel.com
>> 
>> 
>> [  231.759560][    C3] watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [modprobe:3860]
>> [  231.759572][    C3] Modules linked in: test_rslib(+) reed_solomon ipmi_devintf ipmi_msghandler intel_rapl_msr intel_rapl_common >btrfs snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp blake2b_generic coretemp xor raid6_pq libcrc32c kvm_intel >snd_hda_codec_realtek snd_hda_codec_generic platform_profile i915 kvm snd_hda_scodec_component snd_hda_intel sd_mod snd_intel_dspcfg >dell_wmi crc32_generic snd_intel_sdw_acpi sg crct10dif_pclmul cec crc32_pclmul dell_smbios snd_hda_codec intel_gtt crc32c_intel >dell_wmi_descriptor ghash_clmulni_intel sparse_keymap snd_hda_core ttm snd_hwdep ahci rapl rfkill drm_display_helper snd_pcm mei_wdt >libahci intel_cstate dcdbas snd_timer mei_me libata intel_uncore drm_kms_helper snd pcspkr drm_buddy mei soundcore video wmi binfmt_misc >drm fuse loop dm_mod ip_tables poly1305_generic chacha_generic [last unloaded: test_fpu]
>> [  231.759681][    C3] CPU: 3 UID: 0 PID: 3860 Comm: modprobe Tainted: G S  B            N 6.13.0-rc3-00034-gb63e6f60eab4 #1
>> [  231.759690][    C3] Tainted: [S]=CPU_OUT_OF_SPEC, [B]=BAD_PAGE, [N]=TEST
>> [  231.759694][    C3] Hardware name: Dell Inc. OptiPlex 9020/0DNKMN, BIOS A05 12/05/2013
>> [ 231.759699][ C3] RIP: 0010:encode_rs16 (lib/reed_solomon/encode_rs.c:33) reed_solomon 
>> [ 231.759708][ C3] Code: 87 68 83 00 00 89 da d3 fa 41 0f b6 4d 00 41 38 cc 7c 08 84 c9 0f 85 64 02 00 00 8b 75 04 21 f3 01 d3 39 de 7e >c0 48 8b 3c 24 <48> 63 db 48 8d 1c 5f 48 89 d9 48 c1 e9 03 42 0f b6 34 39 48 89 d9
>> All code
>> ========
>>    0:	87 68 83             	xchg   %ebp,-0x7d(%rax)
>>    3:	00 00                	add    %al,(%rax)
>>    5:	89 da                	mov    %ebx,%edx
>>    7:	d3 fa                	sar    %cl,%edx
>>    9:	41 0f b6 4d 00       	movzbl 0x0(%r13),%ecx
>>    e:	41 38 cc             	cmp    %cl,%r12b
>>   11:	7c 08                	jl     0x1b
>>   13:	84 c9                	test   %cl,%cl
>>   15:	0f 85 64 02 00 00    	jne    0x27f
>>   1b:	8b 75 04             	mov    0x4(%rbp),%esi
>>   1e:	21 f3                	and    %esi,%ebx
>>   20:	01 d3                	add    %edx,%ebx
>>   22:	39 de                	cmp    %ebx,%esi
>>   24:	7e c0                	jle    0xffffffffffffffe6
>>   26:	48 8b 3c 24          	mov    (%rsp),%rdi
>>   2a:*	48 63 db             	movslq %ebx,%rbx		<-- trapping instruction
>>   2d:	48 8d 1c 5f          	lea    (%rdi,%rbx,2),%rbx
>>   31:	48 89 d9             	mov    %rbx,%rcx
>>   34:	48 c1 e9 03          	shr    $0x3,%rcx
>>   38:	42 0f b6 34 39       	movzbl (%rcx,%r15,1),%esi
>>   3d:	48 89 d9             	mov    %rbx,%rcx
>> 
>> Code starting with the faulting instruction
>> ===========================================
>>    0:	48 63 db             	movslq %ebx,%rbx
>>    3:	48 8d 1c 5f          	lea    (%rdi,%rbx,2),%rbx
>>    7:	48 89 d9             	mov    %rbx,%rcx
>>    a:	48 c1 e9 03          	shr    $0x3,%rcx
>>    e:	42 0f b6 34 39       	movzbl (%rcx,%r15,1),%esi
>>   13:	48 89 d9             	mov    %rbx,%rcx
>> [  231.759717][    C3] RSP: 0018:ffffc90000abf3b0 EFLAGS: 00000297
>> [  231.759723][    C3] RAX: ffff888102c9ff0a RBX: 00000000000000dd RCX: 0000000000000000
>> [  231.759728][    C3] RDX: 0000000000000000 RSI: 00000000000000ff RDI: ffff88816b6b7c00
>> [  231.759733][    C3] RBP: ffff88812901fb00 R08: 00000000000000c8 R09: ffff88816c8b518e
>> [  231.759738][    C3] R10: 1ffff11025203f60 R11: ffff88816c8b5184 R12: 0000000000000007
>> [  231.759743][    C3] R13: ffffed1025203f60 R14: ffffed1025203f60 R15: dffffc0000000000
>> [  231.759748][    C3] FS:  00007f64c760f040(0000) GS:ffff8883a7d80000(0000) knlGS:0000000000000000
>> [  231.759754][    C3] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  231.759759][    C3] CR2: 00007f024b693000 CR3: 00000001d4462004 CR4: 00000000001726f0
>> [  231.759764][    C3] DR0: ffffffff8789050c DR1: ffffffff8789050d DR2: ffffffff8789050e
>> [  231.759769][    C3] DR3: ffffffff8789050f DR6: 00000000fffe0ff0 DR7: 0000000000000600
>> [  231.759774][    C3] Call Trace:
>> [  231.759778][    C3]  <IRQ>
>> [ 231.759782][ C3] ? watchdog_timer_fn (kernel/watchdog.c:770) 
>> [ 231.759790][ C3] ? __pfx_watchdog_timer_fn (kernel/watchdog.c:685) 
>> [ 231.759796][ C3] ? __hrtimer_run_queues (kernel/time/hrtimer.c:1739 kernel/time/hrtimer.c:1803) 
>> [ 231.759803][ C3] ? __pfx___hrtimer_run_queues (kernel/time/hrtimer.c:1773) 
>> [ 231.759808][ C3] ? ktime_get_update_offsets_now (kernel/time/timekeeping.c:312 (discriminator 3) kernel/time/timekeeping.c:335 >(discriminator 3) kernel/time/timekeeping.c:2457 (discriminator 3)) 
>> [ 231.759814][ C3] ? sched_clock (arch/x86/include/asm/preempt.h:94 arch/x86/kernel/tsc.c:286) 
>> [ 231.759821][ C3] ? hrtimer_interrupt (kernel/time/hrtimer.c:1868) 
>> [ 231.759828][ C3] ? __sysvec_apic_timer_interrupt (arch/x86/include/asm/jump_label.h:36 arch/x86/include/asm/trace/irq_vectors.h:41 >arch/x86/kernel/apic/apic.c:1056) 
>> [ 231.759835][ C3] ? sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1049 arch/x86/kernel/apic/apic.c:1049) 
>> [  231.759842][    C3]  </IRQ>
>> [  231.759845][    C3]  <TASK>
>> [ 231.759848][ C3] ? asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:702) 
>> [ 231.759857][ C3] ? encode_rs16 (lib/reed_solomon/encode_rs.c:33) reed_solomon 
>> [ 231.759864][ C3] get_rcw_we (lib/reed_solomon/test_rslib.c:173) test_rslib 
>
>Honestly, I do not see much how this could be related to the serial
>console. This is a module for testing the Generic Reed Solomon
>encoder / decoder library.
>
>It seems to do a lot of computation and needs a lot of random numbers.
>I wonder if there is not enough entropy and the test is too slow.
>
>From the config file:
>
>CONFIG_PREEMPT_VOLUNTARY=y

I tested with and without the bleow two commits on v6.14-rc6 
where CONFIG_PREEMPT_VOLUNTARY=y and CONFIG_REED_SOLOMON_TEST=y.
I used Raspberry pi 4 (4G Mem).

    b63e6f60eab4 ("serial: 8250: Switch to nbcon console")
    422c9727b07f ("serial: 8250: Revert "drop lockdep annotation from serial8250_clear_IER()"")

I got the same softlockup during the test regardless of the presence
of the commits.

[   60.222013] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/0:1]
[   60.222023] Modules linked in:
[   60.222032] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Tainted: G             L     6.14.0-rc6-v14-rc6-voluntary+ #4
[   60.222047] Tainted: [L]=SOFTLOCKUP
[   60.222051] Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT)
[   60.222055] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   60.222066] pc : get_random_u32+0xac/0x118
[   60.222081] lr : __get_random_u32_below+0x20/0x78
[   60.222094] sp : ffffffc08002bb80
[   60.222098] x29: ffffffc08002bb80 x28: 0000000000000003 x27: 0000000000000001
[   60.222114] x26: ffffff804112e6a4 x25: ffffffd33ed21820 x24: ffffff804112e69c
[   60.222128] x23: 0000000000000000 x22: ffffff804112e64e x21: 0000000000000000
[   60.222142] x20: 000000000000000d x19: ffffff80fb7aebb8 x18: 0000000000000002
[   60.222156] x17: 0000000000000004 x16: ffffff804112e584 x15: ffffff8041126796
[   60.222169] x14: ffffff80411267c0 x13: 0000000000000006 x12: ffffff804112e5c0
[   60.222183] x11: ffffff804112e64c x10: 0000000000000007 x9 : ffffffd33dccdd10
[   60.222196] x8 : ffffff804112e6a8 x7 : 0000000000000000 x6 : 0005000400060005
[   60.222210] x5 : ffffff804112e65a x4 : 0000000000000000 x3 : 0000000000000010
[   60.222223] x2 : 0000000000000014 x1 : 000000002c7d0b7a x0 : 0000000000000013
[   60.222237] Call trace:
[   60.222241]  get_random_u32+0xac/0x118 (P)
[   60.222256]  __get_random_u32_below+0x20/0x78
[   60.222268]  get_rcw_we+0x180/0x208
[   60.222278]  test_rslib_init+0x2c8/0xba0
[   60.222292]  do_one_initcall+0x4c/0x210
[   60.222303]  kernel_init_freeable+0x1fc/0x3a0
[   60.222317]  kernel_init+0x28/0x1f8
[   60.222327]  ret_from_fork+0x10/0x20

>
>I wonder if a cond_resched() in some loop would help. Or using a

I wasn't sure which loop would be the appropriate one but adding 
cond_resched() as below worked as suggested.

----- BEGIN -----
diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c
index 75cb1adac..322d7b0a8 100644
--- a/lib/reed_solomon/test_rslib.c
+++ b/lib/reed_solomon/test_rslib.c
@@ -306,6 +306,8 @@ static void test_uc(struct rs_control *rs, int len, int errs,

                if (memcmp(r, c, len * sizeof(*r)))
                        stat->dwrong++;
+
+               cond_resched();
        }
        stat->nwords += trials;
 }
@@ -400,6 +402,8 @@ static void test_bc(struct rs_control *rs, int len, int errs,
                } else {
                        stat->rfail++;
                }
+
+               cond_resched();
        }
        stat->nwords += trials;
 }
----- END -----

>pseudorandom generator. I remember the problems related to much
>slower random generator, for example, see the commit f900fde28883602b6
>("crypto: testmgr - fix RNG performance in fuzz tests").

I haven't tested this but I'll look into it!

>That said, I did not dig deep into the code. And I did not try to
>reproduce the softlockup. I am pretty busy at the moment with some
>other stuff. I just wanted to give it a look and share my opinion.

I think the softlockup is rather a problem of test itself, 
not the two commits.

I hope this helps!

Sincerely,
Ryo Takakura

>Best Regards,
>Petr
>
>
>> [ 231.759873][ C3] test_bc (lib/reed_solomon/test_rslib.c:379) test_rslib 
>> [ 231.759880][ C3] ? run_exercise (lib/reed_solomon/test_rslib.c:127 lib/reed_solomon/test_rslib.c:457) test_rslib 
>> [ 231.759886][ C3] run_exercise (lib/reed_solomon/test_rslib.c:423 lib/reed_solomon/test_rslib.c:477) test_rslib 
>> [ 231.759895][ C3] ? __pfx_run_exercise (lib/reed_solomon/test_rslib.c:443) test_rslib 
>> [ 231.759902][ C3] test_rslib_init (lib/reed_solomon/test_rslib.c:155) test_rslib 
>> [ 231.759909][ C3] ? __pfx_test_rslib_init (lib/reed_solomon/test_rslib.c:153) test_rslib 
>> [ 231.759915][ C3] do_one_initcall (init/main.c:1266) 
>> [ 231.759922][ C3] ? kasan_save_track (arch/x86/include/asm/current.h:49 mm/kasan/common.c:60 mm/kasan/common.c:69) 
>> [ 231.759928][ C3] ? __pfx_do_one_initcall (init/main.c:1257) 
>> [ 231.759934][ C3] ? __kasan_slab_alloc (mm/kasan/common.c:318 mm/kasan/common.c:345) 
>> [ 231.759939][ C3] ? __kmalloc_cache_noprof (mm/slub.c:4119 mm/slub.c:4168 mm/slub.c:4324) 
>> [ 231.759945][ C3] ? kasan_unpoison (mm/kasan/shadow.c:156 mm/kasan/shadow.c:182) 
>> [ 231.759951][ C3] do_init_module (kernel/module/main.c:2910) 
>> [ 231.759958][ C3] load_module (kernel/module/main.c:3376) 
>> [ 231.759964][ C3] ? __pfx_load_module (kernel/module/main.c:3223) 
>> [ 231.759969][ C3] ? security_kernel_post_read_file (security/security.c:3363) 
>> [ 231.759975][ C3] ? kernel_read_file (arch/x86/include/asm/atomic.h:53 include/linux/atomic/atomic-arch-fallback.h:992 include/linux/>atomic/atomic-instrumented.h:436 include/linux/fs.h:3060 fs/kernel_read_file.c:122) 
>> [ 231.759982][ C3] ? __pfx_kernel_read_file (fs/kernel_read_file.c:38) 
>> [ 231.759988][ C3] ? init_module_from_file (kernel/module/main.c:3565) 
>> [ 231.759993][ C3] init_module_from_file (kernel/module/main.c:3565) 
>> [ 231.759998][ C3] ? __pfx_init_module_from_file (kernel/module/main.c:3541) 
>> [ 231.760004][ C3] ? __pfx__raw_spin_lock (kernel/locking/spinlock.c:153) 
>> [ 231.760011][ C3] idempotent_init_module (kernel/module/main.c:3577) 
>> [ 231.760016][ C3] ? __pfx_idempotent_init_module (kernel/module/main.c:3569) 
>> [ 231.760022][ C3] ? fdget (include/linux/atomic/atomic-arch-fallback.h:479 include/linux/atomic/atomic-instrumented.h:50 fs/file.>c:1145 fs/file.c:1159) 
>> [ 231.760028][ C3] ? security_capable (security/security.c:1142) 
>> [ 231.760035][ C3] __x64_sys_finit_module (include/linux/file.h:62 include/linux/file.h:83 kernel/module/main.c:3600 kernel/module/main.>c:3587 kernel/module/main.c:3587) 
>> [ 231.760040][ C3] do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) 
>> [ 231.760046][ C3] ? do_mmap (mm/mmap.c:496) 
>> [ 231.760053][ C3] ? __pfx_userfaultfd_unmap_complete (fs/userfaultfd.c:838) 
>> [ 231.760060][ C3] ? __pfx_do_mmap (mm/mmap.c:288) 
>> [ 231.760065][ C3] ? down_write_killable (arch/x86/include/asm/atomic64_64.h:20 include/linux/atomic/atomic-arch-fallback.h:2629 >include/linux/atomic/atomic-long.h:79 include/linux/atomic/atomic-instrumented.h:3224 kernel/locking/rwsem.c:143 kernel/locking/rwsem.>c:268 kernel/locking/rwsem.c:1303 kernel/locking/rwsem.c:1318 kernel/locking/rwsem.c:1590) 
>> [ 231.760071][ C3] ? __pfx_down_write_killable (kernel/locking/rwsem.c:1586) 
>> [ 231.760077][ C3] ? vm_mmap_pgoff (mm/util.c:584) 
>> [ 231.760084][ C3] ? __pfx_vm_mmap_pgoff (mm/util.c:570) 
>> [ 231.760089][ C3] ? fget (fs/file.c:1063) 
>> [ 231.760095][ C3] ? fput (arch/x86/include/asm/atomic64_64.h:79 include/linux/atomic/atomic-arch-fallback.h:2913 include/linux/atomic/>atomic-arch-fallback.h:3364 include/linux/atomic/atomic-long.h:698 include/linux/atomic/atomic-instrumented.h:3767 include/linux/file_ref.>h:157 fs/file_table.c:501) 
>> [ 231.760100][ C3] ? ksys_mmap_pgoff (mm/mmap.c:547) 
>> [ 231.760105][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/>entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) 
>> [ 231.760111][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/>entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) 
>> [ 231.760116][ C3] ? do_syscall_64 (arch/x86/entry/common.c:102) 
>> [ 231.760121][ C3] ? do_user_addr_fault (include/linux/rcupdate.h:882 include/linux/mm.h:741 arch/x86/mm/fault.c:1340) 
>> [ 231.760127][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/>entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) 
>> [ 231.760133][ C3] ? do_syscall_64 (arch/x86/entry/common.c:102) 
>> [ 231.760137][ C3] ? exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 arch/x86/mm/fault.c:1489 >arch/x86/mm/fault.c:1539) 
>> [ 231.760142][ C3] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) 
>> [  231.760148][    C3] RIP: 0033:0x7f64c7711719
>> [ 231.760154][ C3] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b >4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
>> All code
>> ========
>>    0:	08 89 e8 5b 5d c3    	or     %cl,-0x3ca2a418(%rcx)
>>    6:	66 2e 0f 1f 84 00 00 	cs nopw 0x0(%rax,%rax,1)
>>    d:	00 00 00 
>>   10:	90                   	nop
>>   11:	48 89 f8             	mov    %rdi,%rax
>>   14:	48 89 f7             	mov    %rsi,%rdi
>>   17:	48 89 d6             	mov    %rdx,%rsi
>>   1a:	48 89 ca             	mov    %rcx,%rdx
>>   1d:	4d 89 c2             	mov    %r8,%r10
>>   20:	4d 89 c8             	mov    %r9,%r8
>>   23:	4c 8b 4c 24 08       	mov    0x8(%rsp),%r9
>>   28:	0f 05                	syscall
>>   2a:*	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax		<-- trapping instruction
>>   30:	73 01                	jae    0x33
>>   32:	c3                   	ret
>>   33:	48 8b 0d b7 06 0d 00 	mov    0xd06b7(%rip),%rcx        # 0xd06f1
>>   3a:	f7 d8                	neg    %eax
>>   3c:	64 89 01             	mov    %eax,%fs:(%rcx)
>>   3f:	48                   	rex.W
>> 
>> Code starting with the faulting instruction
>> ===========================================
>>    0:	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax
>>    6:	73 01                	jae    0x9
>>    8:	c3                   	ret
>>    9:	48 8b 0d b7 06 0d 00 	mov    0xd06b7(%rip),%rcx        # 0xd06c7
>>   10:	f7 d8                	neg    %eax
>>   12:	64 89 01             	mov    %eax,%fs:(%rcx)
>>   15:	48                   	rex.W
>> 
>> 
>> The kernel config and materials to reproduce are available at:
>> https://download.01.org/0day-ci/archive/20250122/202501221029.fb0d574d-lkp@intel.com
>> 
>> 
>> 
>> -- 
>> 0-DAY CI Kernel Test Service
>> https://github.com/intel/lkp-tests/wiki

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-03-15  3:38   ` Ryo Takakura
@ 2025-03-17  8:45     ` John Ogness
  2025-03-17 14:42       ` Ryo Takakura
  0 siblings, 1 reply; 21+ messages in thread
From: John Ogness @ 2025-03-17  8:45 UTC (permalink / raw)
  To: Ryo Takakura, pmladek
  Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang

On 2025-03-15, Ryo Takakura <ryotkkr98@gmail.com> wrote:
> Is the thread still active?

Yes. It is one of the open issues that needs to be resolved before I
repost the 8250 atomic console series.

> I got the same softlockup during the test regardless of the presence
> of the commits.
>
> [   60.222013] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/0:1]
> [   60.222023] Modules linked in:
> [   60.222032] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Tainted: G             L     6.14.0-rc6-v14-rc6-voluntary+ #4
> [   60.222047] Tainted: [L]=SOFTLOCKUP
> [   60.222051] Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT)
> [   60.222055] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [   60.222066] pc : get_random_u32+0xac/0x118
> [   60.222081] lr : __get_random_u32_below+0x20/0x78
> [   60.222094] sp : ffffffc08002bb80
> [   60.222098] x29: ffffffc08002bb80 x28: 0000000000000003 x27: 0000000000000001
> [   60.222114] x26: ffffff804112e6a4 x25: ffffffd33ed21820 x24: ffffff804112e69c
> [   60.222128] x23: 0000000000000000 x22: ffffff804112e64e x21: 0000000000000000
> [   60.222142] x20: 000000000000000d x19: ffffff80fb7aebb8 x18: 0000000000000002
> [   60.222156] x17: 0000000000000004 x16: ffffff804112e584 x15: ffffff8041126796
> [   60.222169] x14: ffffff80411267c0 x13: 0000000000000006 x12: ffffff804112e5c0
> [   60.222183] x11: ffffff804112e64c x10: 0000000000000007 x9 : ffffffd33dccdd10
> [   60.222196] x8 : ffffff804112e6a8 x7 : 0000000000000000 x6 : 0005000400060005
> [   60.222210] x5 : ffffff804112e65a x4 : 0000000000000000 x3 : 0000000000000010
> [   60.222223] x2 : 0000000000000014 x1 : 000000002c7d0b7a x0 : 0000000000000013
> [   60.222237] Call trace:
> [   60.222241]  get_random_u32+0xac/0x118 (P)
> [   60.222256]  __get_random_u32_below+0x20/0x78
> [   60.222268]  get_rcw_we+0x180/0x208
> [   60.222278]  test_rslib_init+0x2c8/0xba0
> [   60.222292]  do_one_initcall+0x4c/0x210
> [   60.222303]  kernel_init_freeable+0x1fc/0x3a0
> [   60.222317]  kernel_init+0x28/0x1f8
> [   60.222327]  ret_from_fork+0x10/0x20
>
>> I wonder if a cond_resched() in some loop would help. Or using a
>
> I wasn't sure which loop would be the appropriate one but adding
> cond_resched() as below worked as suggested.
>
> ----- BEGIN -----
> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c
> index 75cb1adac..322d7b0a8 100644
> --- a/lib/reed_solomon/test_rslib.c
> +++ b/lib/reed_solomon/test_rslib.c
> @@ -306,6 +306,8 @@ static void test_uc(struct rs_control *rs, int len, int errs,
>
>                 if (memcmp(r, c, len * sizeof(*r)))
>                         stat->dwrong++;
> +
> +               cond_resched();
>         }
>         stat->nwords += trials;
>  }
> @@ -400,6 +402,8 @@ static void test_bc(struct rs_control *rs, int len, int errs,
>                 } else {
>                         stat->rfail++;
>                 }
> +
> +               cond_resched();
>         }
>         stat->nwords += trials;
>  }
> ----- END -----
>
>> pseudorandom generator. I remember the problems related to much
>> slower random generator, for example, see the commit
>> f900fde28883602b6 ("crypto: testmgr - fix RNG performance in fuzz
>> tests").
>
> I haven't tested this but I'll look into it!
>
>> That said, I did not dig deep into the code. And I did not try to
>> reproduce the softlockup. I am pretty busy at the moment with some
>> other stuff. I just wanted to give it a look and share my opinion.
>
> I think the softlockup is rather a problem of test itself, 
> not the two commits.

Thanks Ryo for looking into this! I think we need to have a technical
explanation/understanding of the problem so that it is clear how my
series triggers or exaggerates the issue.

John

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-03-17  8:45     ` John Ogness
@ 2025-03-17 14:42       ` Ryo Takakura
  2025-04-21  3:41         ` Ryo Takakura
  0 siblings, 1 reply; 21+ messages in thread
From: Ryo Takakura @ 2025-03-17 14:42 UTC (permalink / raw)
  To: john.ogness
  Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang, pmladek,
	ryotkkr98

Hi John!

On Mon, 17 Mar 2025 09:51:46 +0106, John Ogness wrote:
>On 2025-03-15, Ryo Takakura <ryotkkr98@gmail.com> wrote:
>> I got the same softlockup during the test regardless of the presence
>> of the commits.
>>
>> [   60.222013] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/0:1]
>> [   60.222023] Modules linked in:
>> [   60.222032] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Tainted: G             L     6.14.0-rc6-v14-rc6-voluntary+ #4
>> [   60.222047] Tainted: [L]=SOFTLOCKUP
>> [   60.222051] Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT)
>> [   60.222055] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [   60.222066] pc : get_random_u32+0xac/0x118
>> [   60.222081] lr : __get_random_u32_below+0x20/0x78
>> [   60.222094] sp : ffffffc08002bb80
>> [   60.222098] x29: ffffffc08002bb80 x28: 0000000000000003 x27: 0000000000000001
>> [   60.222114] x26: ffffff804112e6a4 x25: ffffffd33ed21820 x24: ffffff804112e69c
>> [   60.222128] x23: 0000000000000000 x22: ffffff804112e64e x21: 0000000000000000
>> [   60.222142] x20: 000000000000000d x19: ffffff80fb7aebb8 x18: 0000000000000002
>> [   60.222156] x17: 0000000000000004 x16: ffffff804112e584 x15: ffffff8041126796
>> [   60.222169] x14: ffffff80411267c0 x13: 0000000000000006 x12: ffffff804112e5c0
>> [   60.222183] x11: ffffff804112e64c x10: 0000000000000007 x9 : ffffffd33dccdd10
>> [   60.222196] x8 : ffffff804112e6a8 x7 : 0000000000000000 x6 : 0005000400060005
>> [   60.222210] x5 : ffffff804112e65a x4 : 0000000000000000 x3 : 0000000000000010
>> [   60.222223] x2 : 0000000000000014 x1 : 000000002c7d0b7a x0 : 0000000000000013
>> [   60.222237] Call trace:
>> [   60.222241]  get_random_u32+0xac/0x118 (P)
>> [   60.222256]  __get_random_u32_below+0x20/0x78
>> [   60.222268]  get_rcw_we+0x180/0x208
>> [   60.222278]  test_rslib_init+0x2c8/0xba0
>> [   60.222292]  do_one_initcall+0x4c/0x210
>> [   60.222303]  kernel_init_freeable+0x1fc/0x3a0
>> [   60.222317]  kernel_init+0x28/0x1f8
>> [   60.222327]  ret_from_fork+0x10/0x20
>>
>>> I wonder if a cond_resched() in some loop would help. Or using a
>>
>> I wasn't sure which loop would be the appropriate one but adding
>> cond_resched() as below worked as suggested.
>>
>> ----- BEGIN -----
>> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c
>> index 75cb1adac..322d7b0a8 100644
>> --- a/lib/reed_solomon/test_rslib.c
>> +++ b/lib/reed_solomon/test_rslib.c
>> @@ -306,6 +306,8 @@ static void test_uc(struct rs_control *rs, int len, int errs,
>>
>>                 if (memcmp(r, c, len * sizeof(*r)))
>>                         stat->dwrong++;
>> +
>> +               cond_resched();
>>         }
>>         stat->nwords += trials;
>>  }
>> @@ -400,6 +402,8 @@ static void test_bc(struct rs_control *rs, int len, int errs,
>>                 } else {
>>                         stat->rfail++;
>>                 }
>> +
>> +               cond_resched();
>>         }
>>         stat->nwords += trials;
>>  }
>> ----- END -----
>>
>>> pseudorandom generator. I remember the problems related to much
>>> slower random generator, for example, see the commit
>>> f900fde28883602b6 ("crypto: testmgr - fix RNG performance in fuzz
>>> tests").
>>
>> I haven't tested this but I'll look into it!
>>
>>> That said, I did not dig deep into the code. And I did not try to
>>> reproduce the softlockup. I am pretty busy at the moment with some
>>> other stuff. I just wanted to give it a look and share my opinion.
>>
>> I think the softlockup is rather a problem of test itself, 
>> not the two commits.
>
>Thanks Ryo for looking into this! I think we need to have a technical
>explanation/understanding of the problem so that it is clear how my
>series triggers or exaggerates the issue.

I see. I can't yet say anything but still not sure if your series has
anything to do with the softlockup... I found that there was similar 
report in the past [0].

Anyways, I will look into the rslib test itself more carefully to see 
if I can give the needed technical explanation for the problem!

Sincerely,
Ryo Takakura

[0] https://lore.kernel.org/linux-kernel//a309f09e07374e62a77bb84c70e6715efd288448.1583122776.git.planteen@gmail.com/

>John

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-03-17 14:42       ` Ryo Takakura
@ 2025-04-21  3:41         ` Ryo Takakura
  2025-04-22 12:15           ` Petr Mladek
  0 siblings, 1 reply; 21+ messages in thread
From: Ryo Takakura @ 2025-04-21  3:41 UTC (permalink / raw)
  To: john.ogness, pmladek
  Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang

Hi!

I would like to follow up the last email that I sent.

First, I'm sorry that I later realized that I should have tested
the rslib test as an inserted module, as how the robot does, by
choosing CONFIG_REED_SOLOMON_TEST=m.
Not as a boottime test by enabling CONFIG_REED_SOLOMON_TEST=y.

Running the rslib test as an inserted module without the John's series
was less prone to softlockup. Without the John's series, softlockup shows
up once in a test or not at all. With the John's series, softlockup can
be observed constanly over the test.

On Mon, 17 Mar 2025 23:42:21 +0900, Ryo Takakura wrote:
>On Mon, 17 Mar 2025 09:51:46 +0106, John Ogness wrote:
>>On 2025-03-15, Ryo Takakura <ryotkkr98@gmail.com> wrote:
>>> I got the same softlockup during the test regardless of the presence
>>> of the commits.
>>>
>>> [   60.222013] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/0:1]
>>> [   60.222023] Modules linked in:
>>> [   60.222032] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Tainted: G             L     6.14.0-rc6-v14-rc6-voluntary+ #4
>>> [   60.222047] Tainted: [L]=SOFTLOCKUP
>>> [   60.222051] Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT)
>>> [   60.222055] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [   60.222066] pc : get_random_u32+0xac/0x118
>>> [   60.222081] lr : __get_random_u32_below+0x20/0x78
>>> [   60.222094] sp : ffffffc08002bb80
>>> [   60.222098] x29: ffffffc08002bb80 x28: 0000000000000003 x27: 0000000000000001
>>> [   60.222114] x26: ffffff804112e6a4 x25: ffffffd33ed21820 x24: ffffff804112e69c
>>> [   60.222128] x23: 0000000000000000 x22: ffffff804112e64e x21: 0000000000000000
>>> [   60.222142] x20: 000000000000000d x19: ffffff80fb7aebb8 x18: 0000000000000002
>>> [   60.222156] x17: 0000000000000004 x16: ffffff804112e584 x15: ffffff8041126796
>>> [   60.222169] x14: ffffff80411267c0 x13: 0000000000000006 x12: ffffff804112e5c0
>>> [   60.222183] x11: ffffff804112e64c x10: 0000000000000007 x9 : ffffffd33dccdd10
>>> [   60.222196] x8 : ffffff804112e6a8 x7 : 0000000000000000 x6 : 0005000400060005
>>> [   60.222210] x5 : ffffff804112e65a x4 : 0000000000000000 x3 : 0000000000000010
>>> [   60.222223] x2 : 0000000000000014 x1 : 000000002c7d0b7a x0 : 0000000000000013
>>> [   60.222237] Call trace:
>>> [   60.222241]  get_random_u32+0xac/0x118 (P)
>>> [   60.222256]  __get_random_u32_below+0x20/0x78
>>> [   60.222268]  get_rcw_we+0x180/0x208
>>> [   60.222278]  test_rslib_init+0x2c8/0xba0
>>> [   60.222292]  do_one_initcall+0x4c/0x210
>>> [   60.222303]  kernel_init_freeable+0x1fc/0x3a0
>>> [   60.222317]  kernel_init+0x28/0x1f8
>>> [   60.222327]  ret_from_fork+0x10/0x20
>>>
>>>> I wonder if a cond_resched() in some loop would help. Or using a
>>>
>>> I wasn't sure which loop would be the appropriate one but adding
>>> cond_resched() as below worked as suggested.
>>>
>>> ----- BEGIN -----
>>> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c
>>> index 75cb1adac..322d7b0a8 100644
>>> --- a/lib/reed_solomon/test_rslib.c
>>> +++ b/lib/reed_solomon/test_rslib.c
>>> @@ -306,6 +306,8 @@ static void test_uc(struct rs_control *rs, int len, int errs,
>>>
>>>                 if (memcmp(r, c, len * sizeof(*r)))
>>>                         stat->dwrong++;
>>> +
>>> +               cond_resched();
>>>         }
>>>         stat->nwords += trials;
>>>  }
>>> @@ -400,6 +402,8 @@ static void test_bc(struct rs_control *rs, int len, int errs,
>>>                 } else {
>>>                         stat->rfail++;
>>>                 }
>>> +
>>> +               cond_resched();
>>>         }
>>>         stat->nwords += trials;
>>>  }
>>> ----- END -----
>>>
>>>> pseudorandom generator. I remember the problems related to much
>>>> slower random generator, for example, see the commit
>>>> f900fde28883602b6 ("crypto: testmgr - fix RNG performance in fuzz
>>>> tests").
>>>
>>> I haven't tested this but I'll look into it!

For the same reason pointed out by the commit f900fde28883 ("crypto: 
testmgr - fix RNG performance in fuzz tests"), the commit d4150779e60f
("random32: use real rng for non-deterministic randomness") does seem
to have slowed down the rslib test.

On Raspberry pi 4(4G Mem, 4 CPUs), running the rslib test on v5.19 with
the commit d4150779e60f ("random32: use real rng for non-deterministic
randomness") takes about 450sec compared to v5.18(without the commit)
which only takes about 350sec.

However, random number generation does not seem to be the only reason
for the rslib test being slow.
Below is the function stats for v5.18 vs v5.19 for the test.

v5.18
+  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] load_module
+  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_init_module
+  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_one_initcall
+   59.90%    59.90%  insmod   [kernel.kallsyms]      [k] decode_rs16
+   30.28%     0.00%  insmod   [unknown]              [k] 0xffffffd60413a748
+   21.39%    21.39%  insmod   [kernel.kallsyms]      [k] encode_rs16
+   20.94%     0.00%  insmod   [unknown]              [k] 0xffffffd60441b0d4
+   14.17%     0.01%  insmod   [unknown]              [k] 0xffffffd60413a324
+   12.92%     0.00%  insmod   [unknown]              [k] 0xffffffd60413a720
+   11.74%     0.00%  insmod   [unknown]              [k] 0xffffffd60413a368
+   11.63%     0.01%  insmod   [unknown]              [k] 0xffffffd60413a4b4
+    6.60%     0.01%  insmod   [unknown]              [k] 0xffffffd60413a46c
+    3.26%     3.26%  insmod   [unknown]              [k] 0xffffffd60413a3c4
+    3.11%     3.11%  insmod   [kernel.kallsyms]      [k] prandom_u32

v5.19
+  100.00%     0.00%  insmod   [kernel.kallsyms]  [k] load_module
+  100.00%     0.00%  insmod   [kernel.kallsyms]  [k] do_init_module
+  100.00%     0.00%  insmod   [kernel.kallsyms]  [k] do_one_initcall
+   45.13%    45.13%  insmod   [kernel.kallsyms]  [k] decode_rs16
+   27.84%    27.83%  insmod   [kernel.kallsyms]  [k] get_random_u32
+   22.83%     0.01%  insmod   [unknown]          [k] 0xffffffd62b379324
+   22.76%     0.00%  insmod   [unknown]          [k] 0xffffffd62b379748
+   22.35%     0.00%  insmod   [unknown]          [k] 0xffffffd62b379720
+   16.17%    16.17%  insmod   [kernel.kallsyms]  [k] encode_rs16

As the v5.19 stats shows, the commit d4150779e60f ("random32: use real
rng for non-deterministic randomness") did slowed down the rslib test
by using real random number generator(prandom_u32->get_random_u32).
But it also shows that the decode/encoding takes quite a lot of time.

For the above reason, if we want make the softlockup go away, maybe
just converting from real to pseudo random generator won't be enough?
I tested with below changes applied to v6.14 which uses pseudo random
numbers instead of real ones similar to what the commit f900fde28883
("crypto: testmgr - fix RNG performance in fuzz tests") did.

The result is that while it did help to fast up the test from 450 sec
to 350 sec, it still triggers the softlockup.

--- BEGIN ---
diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c
index 75cb1adac..5785e3c0c 100644
--- a/lib/reed_solomon/test_rslib.c
+++ b/lib/reed_solomon/test_rslib.c
@@ -9,6 +9,7 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
+#include <linux/prandom.h>
 #include <linux/random.h>
 #include <linux/slab.h>

@@ -95,6 +96,16 @@ static struct pad pad_coef[] = {
        { 1, 0 },
 };

+static inline void init_rnd_state(struct rnd_state *rng)
+{
+       prandom_seed_state(rng, get_random_u64());
+}
+
+static inline u32 prandom_u32_below(struct rnd_state *rng, u32 ceil)
+{
+       return prandom_u32_state(rng) % ceil;
+}
+
 static void free_ws(struct wspace *ws)
 {
        if (!ws)
@@ -151,6 +162,7 @@ static struct wspace *alloc_ws(struct rs_codec *rs)
 static int get_rcw_we(struct rs_control *rs, struct wspace *ws,
                        int len, int errs, int eras)
 {
+       struct rnd_state rng;
        int nroots = rs->codec->nroots;
        int *derrlocs = ws->derrlocs;
        int *errlocs = ws->errlocs;
@@ -162,9 +174,11 @@ static int get_rcw_we(struct rs_control *rs, struct wspace *ws,
        int errloc;
        int i;

+       init_rnd_state(&rng);
+
        /* Load c with random data and encode */
        for (i = 0; i < dlen; i++)
-               c[i] = get_random_u32() & nn;
+               c[i] = prandom_u32_state(&rng) & nn;

        memset(c + dlen, 0, nroots * sizeof(*c));
        encode_rs16(rs, c, dlen, c + dlen, 0);
@@ -178,12 +192,12 @@ static int get_rcw_we(struct rs_control *rs, struct wspace *ws,
        for (i = 0; i < errs; i++) {
                do {
                        /* Error value must be nonzero */
-                       errval = get_random_u32() & nn;
+                       errval = prandom_u32_state(&rng) & nn;
                } while (errval == 0);

                do {
                        /* Must not choose the same location twice */
-                       errloc = get_random_u32_below(len);
+                       errloc = prandom_u32_below(&rng, len);
                } while (errlocs[errloc] != 0);

                errlocs[errloc] = 1;
@@ -194,19 +208,19 @@ static int get_rcw_we(struct rs_control *rs, struct wspace *ws,
        for (i = 0; i < eras; i++) {
                do {
                        /* Must not choose the same location twice */
-                       errloc = get_random_u32_below(len);
+                       errloc = prandom_u32_below(&rng, len);
                } while (errlocs[errloc] != 0);

                derrlocs[i] = errloc;

-               if (ewsc && get_random_u32_below(2)) {
+               if (ewsc && prandom_u32_below(&rng, 2)) {
                        /* Erasure with the symbol intact */
                        errlocs[errloc] = 2;
                } else {
                        /* Erasure with corrupted symbol */
                        do {
                                /* Error value must be nonzero */
-                               errval = get_random_u32() & nn;
+                               errval = prandom_u32_state(&rng) & nn;
                        } while (errval == 0);

                        errlocs[errloc] = 1;
--- END ---

Below is the function stats for with vs without the above change.

v6.14 without the above change
+  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] load_module
+  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_init_module
+  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_one_initcall
+   46.94%    46.94%  insmod   [kernel.kallsyms]      [k] decode_rs16
+   24.92%    24.92%  insmod   [kernel.kallsyms]      [k] get_random_u32
+   23.59%     0.01%  insmod   [unknown]              [k] 0xffffffde423d88cc
+   22.66%     0.00%  insmod   [unknown]              [k] 0xffffffde423d82c0
+   22.39%     0.00%  insmod   [unknown]              [k] 0xffffffde423d88a4
+   16.67%    16.67%  insmod   [kernel.kallsyms]      [k] encode_rs16

v6.14 with the above change
+  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] load_module
+  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_init_module
+  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_one_initcall
+   61.51%    61.50%  insmod   [kernel.kallsyms]      [k] decode_rs16
+   31.15%     0.00%  insmod   [unknown]              [k] 0xffffffe7013858cc
+   21.75%    21.75%  insmod   [kernel.kallsyms]      [k] encode_rs16
+   21.36%     0.01%  insmod   [unknown]              [k] 0xffffffe70137f120
+   15.20%     0.00%  insmod   [unknown]              [k] 0xffffffe7013852c0
+   12.79%     0.00%  insmod   [unknown]              [k] 0xffffffe7013858a4
+   12.00%     0.00%  insmod   [unknown]              [k] 0xffffffe7013854c0
+   12.00%     0.00%  insmod   [unknown]              [k] 0xffffffe701385304
+    6.74%     0.01%  insmod   [unknown]              [k] 0xffffffe701385600
+    3.43%     3.43%  insmod   [unknown]              [k] 0xffffffe701385598
+    2.50%     2.50%  insmod   [unknown]              [k] 0xffffffe701385564
+    1.96%     1.96%  insmod   [kernel.kallsyms]      [k] prandom_u32_state
+    1.85%     1.85%  insmod   [kernel.kallsyms]      [k] get_random_u64

So maybe we still need cond_resched() in addition to converting the
test from using psuedo random numbers to real ones in order to make
the softlockup go away?

>>>> That said, I did not dig deep into the code. And I did not try to
>>>> reproduce the softlockup. I am pretty busy at the moment with some
>>>> other stuff. I just wanted to give it a look and share my opinion.
>>>
>>> I think the softlockup is rather a problem of test itself,
>>> not the two commits.
>>
>>Thanks Ryo for looking into this! I think we need to have a technical
>>explanation/understanding of the problem so that it is clear how my
>>series triggers or exaggerates the issue.

As mentioned earlier, I'm sorry that I should have run the test as
inserted module... It seems the series does make the test more prone
to softlockups.

I am still looking into the issue but I noticed that if the printing
during the test were done in a caller context, not by printk thread,
the softlock goes away.

I tested with a change something like below so that the printing during
the test were not delegated to printk thread ("pr/ttyS0"). With the below
change, the softlock no longer shows up.

--- BEGIN ---
diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
index fd12efcc4..07c879a64 100644
--- a/kernel/printk/nbcon.c
+++ b/kernel/printk/nbcon.c
@@ -1639,6 +1639,7 @@ void nbcon_cpu_emergency_enter(void)
        cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting();
        (*cpu_emergency_nesting)++;
 }
+EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_enter);

 /**
  * nbcon_cpu_emergency_exit - Exit an emergency section
@@ -1656,6 +1657,7 @@ void nbcon_cpu_emergency_exit(void)

        preempt_enable();
 }
+EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_exit);

 /**
  * nbcon_alloc - Allocate and init the nbcon console specific data
diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c
index 75cb1adac..8208f8317 100644
--- a/lib/reed_solomon/test_rslib.c
+++ b/lib/reed_solomon/test_rslib.c
@@ -11,6 +11,7 @@
 #include <linux/moduleparam.h>
 #include <linux/random.h>
 #include <linux/slab.h>
+#include <linux/console.h>

 enum verbosity {
        V_SILENT,
@@ -323,8 +324,11 @@ static int ex_rs_helper(struct rs_control *rs, struct wspace *ws,
        int nroots = rs->codec->nroots;
        int errs, eras, retval;

-       if (v >= V_PROGRESS)
+       if (v >= V_PROGRESS) {
+               nbcon_cpu_emergency_enter();
                pr_info("  %s\n", desc[method]);
+               nbcon_cpu_emergency_exit();
+       }

        for (errs = 0; errs <= nroots / 2; errs++)
                for (eras = 0; eras <= nroots - 2 * errs; eras++)
@@ -353,8 +357,11 @@ static int exercise_rs(struct rs_control *rs, struct wspace *ws,
        int retval = 0;
        int i;

-       if (v >= V_PROGRESS)
+       if (v >= V_PROGRESS) {
+               nbcon_cpu_emergency_enter();
                pr_info("Testing up to error correction capacity...\n");
+               nbcon_cpu_emergency_exit();
+       }

        for (i = 0; i <= IN_PLACE; i++)
                retval |= ex_rs_helper(rs, ws, len, trials, i);
@@ -411,8 +418,11 @@ static int exercise_rs_bc(struct rs_control *rs, struct wspace *ws,
        int nroots = rs->codec->nroots;
        int errs, eras, cutoff;

-       if (v >= V_PROGRESS)
+       if (v >= V_PROGRESS) {
+               nbcon_cpu_emergency_enter();
                pr_info("Testing beyond error correction capacity...\n");
+               nbcon_cpu_emergency_exit();
+       }

        for (errs = 1; errs <= nroots; errs++) {
                eras = nroots - 2 * errs + 1;
@@ -468,8 +478,10 @@ static int run_exercise(struct etab *e)

                prev_pad = pad;
                if (v >= V_PROGRESS) {
+                       nbcon_cpu_emergency_enter();
                        pr_info("Testing (%d,%d)_%d code...\n",
                                        len, kk - pad, nn + 1);
+                       nbcon_cpu_emergency_exit();
                }

                retval |= exercise_rs(rsc, ws, len, e->ntrials);
--- END ---

I tried to figure out how softlockup disappears over the weekend but
couldn't... Seems the John's series somehow prevents the softlockup
thread from touching its timestamp.

I would like to continue looking into the issue and appreaciate any
comment! Hope this provides some insight for now.

Sincerely,
Ryo Takakura

>I see. I can't yet say anything but still not sure if your series has
>anything to do with the softlockup... I found that there was similar
>report in the past [0].
>
>Anyways, I will look into the rslib test itself more carefully to see
>if I can give the needed technical explanation for the problem!
>[0] https://lore.kernel.org/linux-kernel//a309f09e07374e62a77bb84c70e6715efd288448.1583122776.git.planteen@gmail.com/

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-04-21  3:41         ` Ryo Takakura
@ 2025-04-22 12:15           ` Petr Mladek
  2025-04-22 14:03             ` John Ogness
  2025-04-24  8:11             ` Ryo Takakura
  0 siblings, 2 replies; 21+ messages in thread
From: Petr Mladek @ 2025-04-22 12:15 UTC (permalink / raw)
  To: Ryo Takakura
  Cc: john.ogness, Jason, gregkh, linux-serial, lkp, oe-lkp,
	oliver.sang

On Mon 2025-04-21 12:41:50, Ryo Takakura wrote:
> Hi!
> 
> I would like to follow up the last email that I sent.
> 
> First, I'm sorry that I later realized that I should have tested
> the rslib test as an inserted module, as how the robot does, by
> choosing CONFIG_REED_SOLOMON_TEST=m.
> Not as a boottime test by enabling CONFIG_REED_SOLOMON_TEST=y.
> 
> Running the rslib test as an inserted module without the John's series
> was less prone to softlockup. Without the John's series, softlockup shows
> up once in a test or not at all. With the John's series, softlockup can
> be observed constanly over the test.

> On Mon, 17 Mar 2025 23:42:21 +0900, Ryo Takakura wrote:
> >On Mon, 17 Mar 2025 09:51:46 +0106, John Ogness wrote:
> >>On 2025-03-15, Ryo Takakura <ryotkkr98@gmail.com> wrote:
> >>> I got the same softlockup during the test regardless of the presence
> >>> of the commits.
> >>>
> >>> [   60.222013] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/0:1]
> >>> [   60.222023] Modules linked in:
> >>> [   60.222032] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Tainted: G             L     6.14.0-rc6-v14-rc6-voluntary+ #4
> >>> [   60.222047] Tainted: [L]=SOFTLOCKUP
> >>> [   60.222051] Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT)
> >>> [   60.222055] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >>> [   60.222066] pc : get_random_u32+0xac/0x118
> >>> [   60.222081] lr : __get_random_u32_below+0x20/0x78
> >>> [   60.222094] sp : ffffffc08002bb80
> >>> [   60.222098] x29: ffffffc08002bb80 x28: 0000000000000003 x27: 0000000000000001
> >>> [   60.222114] x26: ffffff804112e6a4 x25: ffffffd33ed21820 x24: ffffff804112e69c
> >>> [   60.222128] x23: 0000000000000000 x22: ffffff804112e64e x21: 0000000000000000
> >>> [   60.222142] x20: 000000000000000d x19: ffffff80fb7aebb8 x18: 0000000000000002
> >>> [   60.222156] x17: 0000000000000004 x16: ffffff804112e584 x15: ffffff8041126796
> >>> [   60.222169] x14: ffffff80411267c0 x13: 0000000000000006 x12: ffffff804112e5c0
> >>> [   60.222183] x11: ffffff804112e64c x10: 0000000000000007 x9 : ffffffd33dccdd10
> >>> [   60.222196] x8 : ffffff804112e6a8 x7 : 0000000000000000 x6 : 0005000400060005
> >>> [   60.222210] x5 : ffffff804112e65a x4 : 0000000000000000 x3 : 0000000000000010
> >>> [   60.222223] x2 : 0000000000000014 x1 : 000000002c7d0b7a x0 : 0000000000000013
> >>> [   60.222237] Call trace:
> >>> [   60.222241]  get_random_u32+0xac/0x118 (P)
> >>> [   60.222256]  __get_random_u32_below+0x20/0x78
> >>> [   60.222268]  get_rcw_we+0x180/0x208
> >>> [   60.222278]  test_rslib_init+0x2c8/0xba0
> >>> [   60.222292]  do_one_initcall+0x4c/0x210
> >>> [   60.222303]  kernel_init_freeable+0x1fc/0x3a0
> >>> [   60.222317]  kernel_init+0x28/0x1f8
> >>> [   60.222327]  ret_from_fork+0x10/0x20
> >>>
> >>>> I wonder if a cond_resched() in some loop would help. Or using a
> >>>
> >>> I wasn't sure which loop would be the appropriate one but adding
> >>> cond_resched() as below worked as suggested.
> >>>
> >>> ----- BEGIN -----
> >>> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c
> >>> index 75cb1adac..322d7b0a8 100644
> >>> --- a/lib/reed_solomon/test_rslib.c
> >>> +++ b/lib/reed_solomon/test_rslib.c
> >>> @@ -306,6 +306,8 @@ static void test_uc(struct rs_control *rs, int len, int errs,
> >>>
> >>>                 if (memcmp(r, c, len * sizeof(*r)))
> >>>                         stat->dwrong++;
> >>> +
> >>> +               cond_resched();
> >>>         }
> >>>         stat->nwords += trials;
> >>>  }
> >>> @@ -400,6 +402,8 @@ static void test_bc(struct rs_control *rs, int len, int errs,
> >>>                 } else {
> >>>                         stat->rfail++;
> >>>                 }
> >>> +
> >>> +               cond_resched();
> >>>         }
> >>>         stat->nwords += trials;
> >>>  }
> >>> ----- END -----
> >>>
> >>>> pseudorandom generator. I remember the problems related to much
> >>>> slower random generator, for example, see the commit
> >>>> f900fde28883602b6 ("crypto: testmgr - fix RNG performance in fuzz
> >>>> tests").
> >>>
> >>> I haven't tested this but I'll look into it!
> 
> For the same reason pointed out by the commit f900fde28883 ("crypto: 
> testmgr - fix RNG performance in fuzz tests"), the commit d4150779e60f
> ("random32: use real rng for non-deterministic randomness") does seem
> to have slowed down the rslib test.
> 
> On Raspberry pi 4(4G Mem, 4 CPUs), running the rslib test on v5.19 with
> the commit d4150779e60f ("random32: use real rng for non-deterministic
> randomness") takes about 450sec compared to v5.18(without the commit)
> which only takes about 350sec.
> 
> However, random number generation does not seem to be the only reason
> for the rslib test being slow.
> Below is the function stats for v5.18 vs v5.19 for the test.
> 
> v5.18
> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] load_module
> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_init_module
> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_one_initcall
> +   59.90%    59.90%  insmod   [kernel.kallsyms]      [k] decode_rs16
> +   30.28%     0.00%  insmod   [unknown]              [k] 0xffffffd60413a748
> +   21.39%    21.39%  insmod   [kernel.kallsyms]      [k] encode_rs16
> +   20.94%     0.00%  insmod   [unknown]              [k] 0xffffffd60441b0d4
> +   14.17%     0.01%  insmod   [unknown]              [k] 0xffffffd60413a324
> +   12.92%     0.00%  insmod   [unknown]              [k] 0xffffffd60413a720
> +   11.74%     0.00%  insmod   [unknown]              [k] 0xffffffd60413a368
> +   11.63%     0.01%  insmod   [unknown]              [k] 0xffffffd60413a4b4
> +    6.60%     0.01%  insmod   [unknown]              [k] 0xffffffd60413a46c
> +    3.26%     3.26%  insmod   [unknown]              [k] 0xffffffd60413a3c4
> +    3.11%     3.11%  insmod   [kernel.kallsyms]      [k] prandom_u32
> 
> v5.19
> +  100.00%     0.00%  insmod   [kernel.kallsyms]  [k] load_module
> +  100.00%     0.00%  insmod   [kernel.kallsyms]  [k] do_init_module
> +  100.00%     0.00%  insmod   [kernel.kallsyms]  [k] do_one_initcall
> +   45.13%    45.13%  insmod   [kernel.kallsyms]  [k] decode_rs16
> +   27.84%    27.83%  insmod   [kernel.kallsyms]  [k] get_random_u32
> +   22.83%     0.01%  insmod   [unknown]          [k] 0xffffffd62b379324
> +   22.76%     0.00%  insmod   [unknown]          [k] 0xffffffd62b379748
> +   22.35%     0.00%  insmod   [unknown]          [k] 0xffffffd62b379720
> +   16.17%    16.17%  insmod   [kernel.kallsyms]  [k] encode_rs16
> 
> As the v5.19 stats shows, the commit d4150779e60f ("random32: use real
> rng for non-deterministic randomness") did slowed down the rslib test
> by using real random number generator(prandom_u32->get_random_u32).
> But it also shows that the decode/encoding takes quite a lot of time.
> 
> For the above reason, if we want make the softlockup go away, maybe
> just converting from real to pseudo random generator won't be enough?
> I tested with below changes applied to v6.14 which uses pseudo random
> numbers instead of real ones similar to what the commit f900fde28883
> ("crypto: testmgr - fix RNG performance in fuzz tests") did.
> 
> The result is that while it did help to fast up the test from 450 sec
> to 350 sec, it still triggers the softlockup.

It is not that big speedup. 350 sec still provides a lot of
opportunities for a softlockup.

> --- BEGIN ---
> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c
> index 75cb1adac..5785e3c0c 100644
> --- a/lib/reed_solomon/test_rslib.c
> +++ b/lib/reed_solomon/test_rslib.c
[...]
> @@ -162,9 +174,11 @@ static int get_rcw_we(struct rs_control *rs, struct wspace *ws,
>         int errloc;
>         int i;
> 
> +       init_rnd_state(&rng);
> +
>         /* Load c with random data and encode */
>         for (i = 0; i < dlen; i++)
> -               c[i] = get_random_u32() & nn;
> +               c[i] = prandom_u32_state(&rng) & nn;
> 
>         memset(c + dlen, 0, nroots * sizeof(*c));
>         encode_rs16(rs, c, dlen, c + dlen, 0);
[...]
> --- END ---
> 
> Below is the function stats for with vs without the above change.
> 
> v6.14 without the above change
> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] load_module
> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_init_module
> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_one_initcall
> +   46.94%    46.94%  insmod   [kernel.kallsyms]      [k] decode_rs16
> +   24.92%    24.92%  insmod   [kernel.kallsyms]      [k] get_random_u32
> +   23.59%     0.01%  insmod   [unknown]              [k] 0xffffffde423d88cc
> +   22.66%     0.00%  insmod   [unknown]              [k] 0xffffffde423d82c0
> +   22.39%     0.00%  insmod   [unknown]              [k] 0xffffffde423d88a4
> +   16.67%    16.67%  insmod   [kernel.kallsyms]      [k] encode_rs16
> 
> v6.14 with the above change
> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] load_module
> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_init_module
> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_one_initcall
> +   61.51%    61.50%  insmod   [kernel.kallsyms]      [k] decode_rs16
> +   31.15%     0.00%  insmod   [unknown]              [k] 0xffffffe7013858cc
> +   21.75%    21.75%  insmod   [kernel.kallsyms]      [k] encode_rs16
> +   21.36%     0.01%  insmod   [unknown]              [k] 0xffffffe70137f120
> +   15.20%     0.00%  insmod   [unknown]              [k] 0xffffffe7013852c0
> +   12.79%     0.00%  insmod   [unknown]              [k] 0xffffffe7013858a4
> +   12.00%     0.00%  insmod   [unknown]              [k] 0xffffffe7013854c0
> +   12.00%     0.00%  insmod   [unknown]              [k] 0xffffffe701385304
> +    6.74%     0.01%  insmod   [unknown]              [k] 0xffffffe701385600
> +    3.43%     3.43%  insmod   [unknown]              [k] 0xffffffe701385598
> +    2.50%     2.50%  insmod   [unknown]              [k] 0xffffffe701385564
> +    1.96%     1.96%  insmod   [kernel.kallsyms]      [k] prandom_u32_state
> +    1.85%     1.85%  insmod   [kernel.kallsyms]      [k] get_random_u64
> 
> So maybe we still need cond_resched() in addition to converting the
> test from using psuedo random numbers to real ones in order to make
> the softlockup go away?

IMHO, cond_resched() should be added, definitely. The code clearly
prevents scheduling for tens of seconds.

> >>>> That said, I did not dig deep into the code. And I did not try to
> >>>> reproduce the softlockup. I am pretty busy at the moment with some
> >>>> other stuff. I just wanted to give it a look and share my opinion.
> >>>
> >>> I think the softlockup is rather a problem of test itself,
> >>> not the two commits.
> >>
> >>Thanks Ryo for looking into this! I think we need to have a technical
> >>explanation/understanding of the problem so that it is clear how my
> >>series triggers or exaggerates the issue.
> 
> As mentioned earlier, I'm sorry that I should have run the test as
> inserted module... It seems the series does make the test more prone
> to softlockups.

IMHO, the main difference is that the patch "serial: 8250: Switch to
nbcon console" removes touch_nmi_watchdog() from
serial8250_console_write().

The touch_nmi_watchdog() resets the softlockup watchdog. It might
hide that the CPU did not schedule for a long time.

The touch_nmi_watchdog() was there because the console_lock() owner,
used by the legacy loop, was responsible for flushing all pending
messages. And there might be many pending messages when new ones
were added by other CPUs in parallel. And the legacy loop
could not call cond_resched() when called from printk() because
printk() might be called in atomic context.

The series removed the touch_nmi_watchdog() because the kthread
called cond_resched. And the amount of messages flushed in
the emergency or panic context was limited.

My view:

The touch_nmi_watchdog() did hide the softlockup scenario in the rslib
test. The printk series allowed to see the problem reliably.

By other words, the rslib test should get fixed by adding
cond_resched(). And the conversion of the 8250 serial console
driver to nbcon is correct.

That said, see below.

> I am still looking into the issue but I noticed that if the printing
> during the test were done in a caller context, not by printk thread,
> the softlock goes away.
> 
> I tested with a change something like below so that the printing during
> the test were not delegated to printk thread ("pr/ttyS0"). With the below
> change, the softlock no longer shows up.
> 
> --- BEGIN ---
> diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
> index fd12efcc4..07c879a64 100644
> --- a/kernel/printk/nbcon.c
> +++ b/kernel/printk/nbcon.c
> @@ -1639,6 +1639,7 @@ void nbcon_cpu_emergency_enter(void)
>         cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting();
>         (*cpu_emergency_nesting)++;
>  }
> +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_enter);
> 
>  /**
>   * nbcon_cpu_emergency_exit - Exit an emergency section
> @@ -1656,6 +1657,7 @@ void nbcon_cpu_emergency_exit(void)
> 
>         preempt_enable();
>  }
> +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_exit);
> 
>  /**
>   * nbcon_alloc - Allocate and init the nbcon console specific data
> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c
> index 75cb1adac..8208f8317 100644
> --- a/lib/reed_solomon/test_rslib.c
> +++ b/lib/reed_solomon/test_rslib.c
> @@ -11,6 +11,7 @@
>  #include <linux/moduleparam.h>
>  #include <linux/random.h>
>  #include <linux/slab.h>
> +#include <linux/console.h>
> 
>  enum verbosity {
>         V_SILENT,
> @@ -323,8 +324,11 @@ static int ex_rs_helper(struct rs_control *rs, struct wspace *ws,
>         int nroots = rs->codec->nroots;
>         int errs, eras, retval;
> 
> -       if (v >= V_PROGRESS)
> +       if (v >= V_PROGRESS) {
> +               nbcon_cpu_emergency_enter();
>                 pr_info("  %s\n", desc[method]);
> +               nbcon_cpu_emergency_exit();
> +       }

I do not understand how this prevented the softlockup.

The emergency context causes that the messages is flushed directly
from printk(), see nbcon_atomic_flush_pending() in vprintk_emit().
But see neither cond_resched() nor touch_nmi_watchdog() in this
code path.

By other words, I do _not_ see how the emergency context either
triggered scheduling or did reset the watchdog.

But I might be looking at a different code base. I looked at
Linus' master and linux-next.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-04-22 12:15           ` Petr Mladek
@ 2025-04-22 14:03             ` John Ogness
  2025-04-24  8:11             ` Ryo Takakura
  1 sibling, 0 replies; 21+ messages in thread
From: John Ogness @ 2025-04-22 14:03 UTC (permalink / raw)
  To: Petr Mladek, Ryo Takakura
  Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang

On 2025-04-22, Petr Mladek <pmladek@suse.com> wrote:
> IMHO, the main difference is that the patch "serial: 8250: Switch to
> nbcon console" removes touch_nmi_watchdog() from
> serial8250_console_write().
>
> The touch_nmi_watchdog() resets the softlockup watchdog. It might
> hide that the CPU did not schedule for a long time.
>
> The touch_nmi_watchdog() was there because the console_lock() owner,
> used by the legacy loop, was responsible for flushing all pending
> messages. And there might be many pending messages when new ones
> were added by other CPUs in parallel. And the legacy loop
> could not call cond_resched() when called from printk() because
> printk() might be called in atomic context.
>
> The series removed the touch_nmi_watchdog() because the kthread
> called cond_resched. And the amount of messages flushed in
> the emergency or panic context was limited.
>
> My view:
>
> The touch_nmi_watchdog() did hide the softlockup scenario in the rslib
> test. The printk series allowed to see the problem reliably.
>
> By other words, the rslib test should get fixed by adding
> cond_resched(). And the conversion of the 8250 serial console
> driver to nbcon is correct.
>
> That said, see below.
>
>> I am still looking into the issue but I noticed that if the printing
>> during the test were done in a caller context, not by printk thread,
>> the softlock goes away.
>> 
>> I tested with a change something like below so that the printing during
>> the test were not delegated to printk thread ("pr/ttyS0"). With the below
>> change, the softlock no longer shows up.
>> 
>> --- BEGIN ---
>> diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
>> index fd12efcc4..07c879a64 100644
>> --- a/kernel/printk/nbcon.c
>> +++ b/kernel/printk/nbcon.c
>> @@ -1639,6 +1639,7 @@ void nbcon_cpu_emergency_enter(void)
>>         cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting();
>>         (*cpu_emergency_nesting)++;
>>  }
>> +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_enter);
>> 
>>  /**
>>   * nbcon_cpu_emergency_exit - Exit an emergency section
>> @@ -1656,6 +1657,7 @@ void nbcon_cpu_emergency_exit(void)
>> 
>>         preempt_enable();
>>  }
>> +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_exit);
>> 
>>  /**
>>   * nbcon_alloc - Allocate and init the nbcon console specific data
>> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c
>> index 75cb1adac..8208f8317 100644
>> --- a/lib/reed_solomon/test_rslib.c
>> +++ b/lib/reed_solomon/test_rslib.c
>> @@ -11,6 +11,7 @@
>>  #include <linux/moduleparam.h>
>>  #include <linux/random.h>
>>  #include <linux/slab.h>
>> +#include <linux/console.h>
>> 
>>  enum verbosity {
>>         V_SILENT,
>> @@ -323,8 +324,11 @@ static int ex_rs_helper(struct rs_control *rs, struct wspace *ws,
>>         int nroots = rs->codec->nroots;
>>         int errs, eras, retval;
>> 
>> -       if (v >= V_PROGRESS)
>> +       if (v >= V_PROGRESS) {
>> +               nbcon_cpu_emergency_enter();
>>                 pr_info("  %s\n", desc[method]);
>> +               nbcon_cpu_emergency_exit();
>> +       }
>
> I do not understand how this prevented the softlockup.
>
> The emergency context causes that the messages is flushed directly
> from printk(), see nbcon_atomic_flush_pending() in vprintk_emit().
> But see neither cond_resched() nor touch_nmi_watchdog() in this
> code path.

Note there is still a touch_nmi_watchdog() in wait_for_lsr(). If the
console printing thread is on a different CPU, then its use of
touch_nmi_watchdog() will not help the CPU running the rslib test. By
forcing emergency context, now the CPU running the rslib test will be
calling touch_nmi_watchdog().

John Ogness

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-04-22 12:15           ` Petr Mladek
  2025-04-22 14:03             ` John Ogness
@ 2025-04-24  8:11             ` Ryo Takakura
  2025-04-24  9:00               ` John Ogness
                                 ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: Ryo Takakura @ 2025-04-24  8:11 UTC (permalink / raw)
  To: pmladek, john.ogness
  Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang, ryotkkr98

Hi Petr and John!

On Tue, 22 Apr 2025 14:15:01 +0200, Petr Mladek wrote:
>On Mon 2025-04-21 12:41:50, Ryo Takakura wrote:
>> Hi!
>> 
>> I would like to follow up the last email that I sent.
>> 
>> First, I'm sorry that I later realized that I should have tested
>> the rslib test as an inserted module, as how the robot does, by
>> choosing CONFIG_REED_SOLOMON_TEST=m.
>> Not as a boottime test by enabling CONFIG_REED_SOLOMON_TEST=y.
>> 
>> Running the rslib test as an inserted module without the John's series
>> was less prone to softlockup. Without the John's series, softlockup shows
>> up once in a test or not at all. With the John's series, softlockup can
>> be observed constanly over the test.
>
>> On Mon, 17 Mar 2025 23:42:21 +0900, Ryo Takakura wrote:
>> >On Mon, 17 Mar 2025 09:51:46 +0106, John Ogness wrote:
>> >>On 2025-03-15, Ryo Takakura <ryotkkr98@gmail.com> wrote:
>> >>> I got the same softlockup during the test regardless of the presence
>> >>> of the commits.
>> >>>
>> >>> [   60.222013] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/0:1]
>> >>> [   60.222023] Modules linked in:
>> >>> [   60.222032] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Tainted: G             L     6.14.0-rc6-v14-rc6-voluntary+ #4
>> >>> [   60.222047] Tainted: [L]=SOFTLOCKUP
>> >>> [   60.222051] Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT)
>> >>> [   60.222055] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> >>> [   60.222066] pc : get_random_u32+0xac/0x118
>> >>> [   60.222081] lr : __get_random_u32_below+0x20/0x78
>> >>> [   60.222094] sp : ffffffc08002bb80
>> >>> [   60.222098] x29: ffffffc08002bb80 x28: 0000000000000003 x27: 0000000000000001
>> >>> [   60.222114] x26: ffffff804112e6a4 x25: ffffffd33ed21820 x24: ffffff804112e69c
>> >>> [   60.222128] x23: 0000000000000000 x22: ffffff804112e64e x21: 0000000000000000
>> >>> [   60.222142] x20: 000000000000000d x19: ffffff80fb7aebb8 x18: 0000000000000002
>> >>> [   60.222156] x17: 0000000000000004 x16: ffffff804112e584 x15: ffffff8041126796
>> >>> [   60.222169] x14: ffffff80411267c0 x13: 0000000000000006 x12: ffffff804112e5c0
>> >>> [   60.222183] x11: ffffff804112e64c x10: 0000000000000007 x9 : ffffffd33dccdd10
>> >>> [   60.222196] x8 : ffffff804112e6a8 x7 : 0000000000000000 x6 : 0005000400060005
>> >>> [   60.222210] x5 : ffffff804112e65a x4 : 0000000000000000 x3 : 0000000000000010
>> >>> [   60.222223] x2 : 0000000000000014 x1 : 000000002c7d0b7a x0 : 0000000000000013
>> >>> [   60.222237] Call trace:
>> >>> [   60.222241]  get_random_u32+0xac/0x118 (P)
>> >>> [   60.222256]  __get_random_u32_below+0x20/0x78
>> >>> [   60.222268]  get_rcw_we+0x180/0x208
>> >>> [   60.222278]  test_rslib_init+0x2c8/0xba0
>> >>> [   60.222292]  do_one_initcall+0x4c/0x210
>> >>> [   60.222303]  kernel_init_freeable+0x1fc/0x3a0
>> >>> [   60.222317]  kernel_init+0x28/0x1f8
>> >>> [   60.222327]  ret_from_fork+0x10/0x20
>> >>>
>> >>>> I wonder if a cond_resched() in some loop would help. Or using a
>> >>>
>> >>> I wasn't sure which loop would be the appropriate one but adding
>> >>> cond_resched() as below worked as suggested.
>> >>>
>> >>> ----- BEGIN -----
>> >>> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c
>> >>> index 75cb1adac..322d7b0a8 100644
>> >>> --- a/lib/reed_solomon/test_rslib.c
>> >>> +++ b/lib/reed_solomon/test_rslib.c
>> >>> @@ -306,6 +306,8 @@ static void test_uc(struct rs_control *rs, int len, int errs,
>> >>>
>> >>>                 if (memcmp(r, c, len * sizeof(*r)))
>> >>>                         stat->dwrong++;
>> >>> +
>> >>> +               cond_resched();
>> >>>         }
>> >>>         stat->nwords += trials;
>> >>>  }
>> >>> @@ -400,6 +402,8 @@ static void test_bc(struct rs_control *rs, int len, int errs,
>> >>>                 } else {
>> >>>                         stat->rfail++;
>> >>>                 }
>> >>> +
>> >>> +               cond_resched();
>> >>>         }
>> >>>         stat->nwords += trials;
>> >>>  }
>> >>> ----- END -----
>> >>>
>> >>>> pseudorandom generator. I remember the problems related to much
>> >>>> slower random generator, for example, see the commit
>> >>>> f900fde28883602b6 ("crypto: testmgr - fix RNG performance in fuzz
>> >>>> tests").
>> >>>
>> >>> I haven't tested this but I'll look into it!
>> 
>> For the same reason pointed out by the commit f900fde28883 ("crypto: 
>> testmgr - fix RNG performance in fuzz tests"), the commit d4150779e60f
>> ("random32: use real rng for non-deterministic randomness") does seem
>> to have slowed down the rslib test.
>> 
>> On Raspberry pi 4(4G Mem, 4 CPUs), running the rslib test on v5.19 with
>> the commit d4150779e60f ("random32: use real rng for non-deterministic
>> randomness") takes about 450sec compared to v5.18(without the commit)
>> which only takes about 350sec.
>> 
>> However, random number generation does not seem to be the only reason
>> for the rslib test being slow.
>> Below is the function stats for v5.18 vs v5.19 for the test.
>> 
>> v5.18
>> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] load_module
>> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_init_module
>> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_one_initcall
>> +   59.90%    59.90%  insmod   [kernel.kallsyms]      [k] decode_rs16
>> +   30.28%     0.00%  insmod   [unknown]              [k] 0xffffffd60413a748
>> +   21.39%    21.39%  insmod   [kernel.kallsyms]      [k] encode_rs16
>> +   20.94%     0.00%  insmod   [unknown]              [k] 0xffffffd60441b0d4
>> +   14.17%     0.01%  insmod   [unknown]              [k] 0xffffffd60413a324
>> +   12.92%     0.00%  insmod   [unknown]              [k] 0xffffffd60413a720
>> +   11.74%     0.00%  insmod   [unknown]              [k] 0xffffffd60413a368
>> +   11.63%     0.01%  insmod   [unknown]              [k] 0xffffffd60413a4b4
>> +    6.60%     0.01%  insmod   [unknown]              [k] 0xffffffd60413a46c
>> +    3.26%     3.26%  insmod   [unknown]              [k] 0xffffffd60413a3c4
>> +    3.11%     3.11%  insmod   [kernel.kallsyms]      [k] prandom_u32
>> 
>> v5.19
>> +  100.00%     0.00%  insmod   [kernel.kallsyms]  [k] load_module
>> +  100.00%     0.00%  insmod   [kernel.kallsyms]  [k] do_init_module
>> +  100.00%     0.00%  insmod   [kernel.kallsyms]  [k] do_one_initcall
>> +   45.13%    45.13%  insmod   [kernel.kallsyms]  [k] decode_rs16
>> +   27.84%    27.83%  insmod   [kernel.kallsyms]  [k] get_random_u32
>> +   22.83%     0.01%  insmod   [unknown]          [k] 0xffffffd62b379324
>> +   22.76%     0.00%  insmod   [unknown]          [k] 0xffffffd62b379748
>> +   22.35%     0.00%  insmod   [unknown]          [k] 0xffffffd62b379720
>> +   16.17%    16.17%  insmod   [kernel.kallsyms]  [k] encode_rs16
>> 
>> As the v5.19 stats shows, the commit d4150779e60f ("random32: use real
>> rng for non-deterministic randomness") did slowed down the rslib test
>> by using real random number generator(prandom_u32->get_random_u32).
>> But it also shows that the decode/encoding takes quite a lot of time.
>> 
>> For the above reason, if we want make the softlockup go away, maybe
>> just converting from real to pseudo random generator won't be enough?
>> I tested with below changes applied to v6.14 which uses pseudo random
>> numbers instead of real ones similar to what the commit f900fde28883
>> ("crypto: testmgr - fix RNG performance in fuzz tests") did.
>> 
>> The result is that while it did help to fast up the test from 450 sec
>> to 350 sec, it still triggers the softlockup.
>
>It is not that big speedup. 350 sec still provides a lot of
>opportunities for a softlockup.
>
>> --- BEGIN ---
>> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c
>> index 75cb1adac..5785e3c0c 100644
>> --- a/lib/reed_solomon/test_rslib.c
>> +++ b/lib/reed_solomon/test_rslib.c
>[...]
>> @@ -162,9 +174,11 @@ static int get_rcw_we(struct rs_control *rs, struct wspace *ws,
>>         int errloc;
>>         int i;
>> 
>> +       init_rnd_state(&rng);
>> +
>>         /* Load c with random data and encode */
>>         for (i = 0; i < dlen; i++)
>> -               c[i] = get_random_u32() & nn;
>> +               c[i] = prandom_u32_state(&rng) & nn;
>> 
>>         memset(c + dlen, 0, nroots * sizeof(*c));
>>         encode_rs16(rs, c, dlen, c + dlen, 0);
>[...]
>> --- END ---
>> 
>> Below is the function stats for with vs without the above change.
>> 
>> v6.14 without the above change
>> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] load_module
>> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_init_module
>> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_one_initcall
>> +   46.94%    46.94%  insmod   [kernel.kallsyms]      [k] decode_rs16
>> +   24.92%    24.92%  insmod   [kernel.kallsyms]      [k] get_random_u32
>> +   23.59%     0.01%  insmod   [unknown]              [k] 0xffffffde423d88cc
>> +   22.66%     0.00%  insmod   [unknown]              [k] 0xffffffde423d82c0
>> +   22.39%     0.00%  insmod   [unknown]              [k] 0xffffffde423d88a4
>> +   16.67%    16.67%  insmod   [kernel.kallsyms]      [k] encode_rs16
>> 
>> v6.14 with the above change
>> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] load_module
>> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_init_module
>> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_one_initcall
>> +   61.51%    61.50%  insmod   [kernel.kallsyms]      [k] decode_rs16
>> +   31.15%     0.00%  insmod   [unknown]              [k] 0xffffffe7013858cc
>> +   21.75%    21.75%  insmod   [kernel.kallsyms]      [k] encode_rs16
>> +   21.36%     0.01%  insmod   [unknown]              [k] 0xffffffe70137f120
>> +   15.20%     0.00%  insmod   [unknown]              [k] 0xffffffe7013852c0
>> +   12.79%     0.00%  insmod   [unknown]              [k] 0xffffffe7013858a4
>> +   12.00%     0.00%  insmod   [unknown]              [k] 0xffffffe7013854c0
>> +   12.00%     0.00%  insmod   [unknown]              [k] 0xffffffe701385304
>> +    6.74%     0.01%  insmod   [unknown]              [k] 0xffffffe701385600
>> +    3.43%     3.43%  insmod   [unknown]              [k] 0xffffffe701385598
>> +    2.50%     2.50%  insmod   [unknown]              [k] 0xffffffe701385564
>> +    1.96%     1.96%  insmod   [kernel.kallsyms]      [k] prandom_u32_state
>> +    1.85%     1.85%  insmod   [kernel.kallsyms]      [k] get_random_u64
>> 
>> So maybe we still need cond_resched() in addition to converting the
>> test from using psuedo random numbers to real ones in order to make
>> the softlockup go away?
>
>IMHO, cond_resched() should be added, definitely. The code clearly
>prevents scheduling for tens of seconds.

Thank you for checking, and I agree.
If needed, I can prepare the fix for the test.

>> >>>> That said, I did not dig deep into the code. And I did not try to
>> >>>> reproduce the softlockup. I am pretty busy at the moment with some
>> >>>> other stuff. I just wanted to give it a look and share my opinion.
>> >>>
>> >>> I think the softlockup is rather a problem of test itself,
>> >>> not the two commits.
>> >>
>> >>Thanks Ryo for looking into this! I think we need to have a technical
>> >>explanation/understanding of the problem so that it is clear how my
>> >>series triggers or exaggerates the issue.
>> 
>> As mentioned earlier, I'm sorry that I should have run the test as
>> inserted module... It seems the series does make the test more prone
>> to softlockups.
>
>IMHO, the main difference is that the patch "serial: 8250: Switch to
>nbcon console" removes touch_nmi_watchdog() from
>serial8250_console_write().
>
>The touch_nmi_watchdog() resets the softlockup watchdog. It might
>hide that the CPU did not schedule for a long time.
>
>The touch_nmi_watchdog() was there because the console_lock() owner,
>used by the legacy loop, was responsible for flushing all pending
>messages. And there might be many pending messages when new ones
>were added by other CPUs in parallel. And the legacy loop
>could not call cond_resched() when called from printk() because
>printk() might be called in atomic context.

I see. Without the John's series, the cond_resched() in the mention
code path should be called during the rslib test as it's not in atomic
context in addition to the touch_nmi_watchdog().

>The series removed the touch_nmi_watchdog() because the kthread
>called cond_resched. And the amount of messages flushed in
>the emergency or panic context was limited.

With the John's series appied, I guess the kthread were running on a
cpu other than the one running the rslib test as John said[0], and neither
of touch_nmi_watchdog() nor cond_resched() were called that could prevent
the softlockup.

>My view:
>
>The touch_nmi_watchdog() did hide the softlockup scenario in the rslib
>test. The printk series allowed to see the problem reliably.
>
>By other words, the rslib test should get fixed by adding
>cond_resched(). And the conversion of the 8250 serial console
>driver to nbcon is correct.

I see, and agree! I'll test what's discussed here so that we can say
the above statement is true and the John's series can be brought back.

>That said, see below.
>
>> I am still looking into the issue but I noticed that if the printing
>> during the test were done in a caller context, not by printk thread,
>> the softlock goes away.
>> 
>> I tested with a change something like below so that the printing during
>> the test were not delegated to printk thread ("pr/ttyS0"). With the below
>> change, the softlock no longer shows up.
>> 
>> --- BEGIN ---
>> diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
>> index fd12efcc4..07c879a64 100644
>> --- a/kernel/printk/nbcon.c
>> +++ b/kernel/printk/nbcon.c
>> @@ -1639,6 +1639,7 @@ void nbcon_cpu_emergency_enter(void)
>>         cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting();
>>         (*cpu_emergency_nesting)++;
>>  }
>> +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_enter);
>> 
>>  /**
>>   * nbcon_cpu_emergency_exit - Exit an emergency section
>> @@ -1656,6 +1657,7 @@ void nbcon_cpu_emergency_exit(void)
>> 
>>         preempt_enable();
>>  }
>> +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_exit);
>> 
>>  /**
>>   * nbcon_alloc - Allocate and init the nbcon console specific data
>> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c
>> index 75cb1adac..8208f8317 100644
>> --- a/lib/reed_solomon/test_rslib.c
>> +++ b/lib/reed_solomon/test_rslib.c
>> @@ -11,6 +11,7 @@
>>  #include <linux/moduleparam.h>
>>  #include <linux/random.h>
>>  #include <linux/slab.h>
>> +#include <linux/console.h>
>> 
>>  enum verbosity {
>>         V_SILENT,
>> @@ -323,8 +324,11 @@ static int ex_rs_helper(struct rs_control *rs, struct wspace *ws,
>>         int nroots = rs->codec->nroots;
>>         int errs, eras, retval;
>> 
>> -       if (v >= V_PROGRESS)
>> +       if (v >= V_PROGRESS) {
>> +               nbcon_cpu_emergency_enter();
>>                 pr_info("  %s\n", desc[method]);
>> +               nbcon_cpu_emergency_exit();
>> +       }
>
>I do not understand how this prevented the softlockup.
>
>The emergency context causes that the messages is flushed directly
>from printk(), see nbcon_atomic_flush_pending() in vprintk_emit().
>But see neither cond_resched() nor touch_nmi_watchdog() in this
>code path.
>
>By other words, I do _not_ see how the emergency context either
>triggered scheduling or did reset the watchdog.
>
>But I might be looking at a different code base. I looked at
>Linus' master and linux-next.

I used this kernel[1] which is for raspberry pi. Let me recheck
with some other machine with Linus' master and linux-next to see
if the behavior is raspberry pi specific.

Sincerely,
Ryo Takakura

>Best Regards,
>Petr

[0] https://lore.kernel.org/all/84ikmwqonm.fsf@jogness.linutronix.de/
[1] https://github.com/raspberrypi/linux/tree/rpi-6.14.y

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-04-24  8:11             ` Ryo Takakura
@ 2025-04-24  9:00               ` John Ogness
  2025-04-24 14:13                 ` Ryo Takakura
  2025-04-24  9:02               ` Petr Mladek
  2025-04-30  9:15               ` Ryo Takakura
  2 siblings, 1 reply; 21+ messages in thread
From: John Ogness @ 2025-04-24  9:00 UTC (permalink / raw)
  To: Ryo Takakura, pmladek
  Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang, ryotkkr98

On 2025-04-24, Ryo Takakura <ryotkkr98@gmail.com> wrote:
>> By other words, the rslib test should get fixed by adding
>> cond_resched(). And the conversion of the 8250 serial console
>> driver to nbcon is correct.
>
> I see, and agree! I'll test what's discussed here so that we can say
> the above statement is true and the John's series can be brought back.

I'm looking forward to seeing your test results. Note there is also 2
more issues blocking the series:

https://lore.kernel.org/linux-serial/80b020fc-c18a-4da4-b222-16da1cab2f4c@nvidia.com/

and an offlist report from NXP relating to not being able to suspend on
ARMv8 due to printk irq work IPIs.

John

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-04-24  8:11             ` Ryo Takakura
  2025-04-24  9:00               ` John Ogness
@ 2025-04-24  9:02               ` Petr Mladek
  2025-04-24 14:17                 ` Ryo Takakura
  2025-04-30  9:15               ` Ryo Takakura
  2 siblings, 1 reply; 21+ messages in thread
From: Petr Mladek @ 2025-04-24  9:02 UTC (permalink / raw)
  To: Ryo Takakura
  Cc: john.ogness, Jason, gregkh, linux-serial, lkp, oe-lkp,
	oliver.sang

On Thu 2025-04-24 17:11:01, Ryo Takakura wrote:
> Hi Petr and John!
> 
> On Tue, 22 Apr 2025 14:15:01 +0200, Petr Mladek wrote:
> >On Mon 2025-04-21 12:41:50, Ryo Takakura wrote:
> >> Hi!
> >> 
> >> I would like to follow up the last email that I sent.
> >> 
> >> First, I'm sorry that I later realized that I should have tested
> >> the rslib test as an inserted module, as how the robot does, by
> >> choosing CONFIG_REED_SOLOMON_TEST=m.
> >> Not as a boottime test by enabling CONFIG_REED_SOLOMON_TEST=y.
> >> 
> >> Running the rslib test as an inserted module without the John's series
> >> was less prone to softlockup. Without the John's series, softlockup shows
> >> up once in a test or not at all. With the John's series, softlockup can
> >> be observed constanly over the test.
> >
> >> >>Thanks Ryo for looking into this! I think we need to have a technical
> >> >>explanation/understanding of the problem so that it is clear how my
> >> >>series triggers or exaggerates the issue.
> >> 
> >> As mentioned earlier, I'm sorry that I should have run the test as
> >> inserted module... It seems the series does make the test more prone
> >> to softlockups.
> >
> >IMHO, the main difference is that the patch "serial: 8250: Switch to
> >nbcon console" removes touch_nmi_watchdog() from
> >serial8250_console_write().
> >
> >The touch_nmi_watchdog() resets the softlockup watchdog. It might
> >hide that the CPU did not schedule for a long time.
> >
> >The touch_nmi_watchdog() was there because the console_lock() owner,
> >used by the legacy loop, was responsible for flushing all pending
> >messages. And there might be many pending messages when new ones
> >were added by other CPUs in parallel. And the legacy loop
> >could not call cond_resched() when called from printk() because
> >printk() might be called in atomic context.
> 
> I see. Without the John's series, the cond_resched() in the mention
> code path should be called during the rslib test as it's not in atomic
> context in addition to the touch_nmi_watchdog().

Just to be sure. The right fix is to add cond_resched() to rslib test.
The code should allow scheduling and do not block the CPU for too
long.

touch_nmi_watchdog() just hides the problem. It was used in printk()
because there was no better solution.

> I used this kernel[1] which is for raspberry pi. Let me recheck
> with some other machine with Linus' master and linux-next to see
> if the behavior is raspberry pi specific.

John explained why the emergency context helped. I think that we have
a pretty good understating of what is going on there.

I believe that the problem will be the same in all code streams.
It might be enough to check one of them (Linus' tree or linux-next)
just to be sure that the fix applies and it has not been already
fixed.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-04-24  9:00               ` John Ogness
@ 2025-04-24 14:13                 ` Ryo Takakura
  0 siblings, 0 replies; 21+ messages in thread
From: Ryo Takakura @ 2025-04-24 14:13 UTC (permalink / raw)
  To: john.ogness
  Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang, pmladek,
	ryotkkr98

On Thu, 24 Apr 2025 11:06:23 +0206, John Ogness wrote:
>On 2025-04-24, Ryo Takakura <ryotkkr98@gmail.com> wrote:
>>> By other words, the rslib test should get fixed by adding
>>> cond_resched(). And the conversion of the 8250 serial console
>>> driver to nbcon is correct.
>>
>> I see, and agree! I'll test what's discussed here so that we can say
>> the above statement is true and the John's series can be brought back.
>
>I'm looking forward to seeing your test results. Note there is also 2

OK!

>more issues blocking the series:
>
>https://lore.kernel.org/linux-serial/80b020fc-c18a-4da4-b222-16da1cab2f4c@nvidia.com/
>
>and an offlist report from NXP relating to not being able to suspend on
>ARMv8 due to printk irq work IPIs.

Oh, I wasn't aware of those two...
I may look into them as well after this!

Sincerely,
Ryo Takakura

>John

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-04-24  9:02               ` Petr Mladek
@ 2025-04-24 14:17                 ` Ryo Takakura
  0 siblings, 0 replies; 21+ messages in thread
From: Ryo Takakura @ 2025-04-24 14:17 UTC (permalink / raw)
  To: pmladek
  Cc: Jason, gregkh, john.ogness, linux-serial, lkp, oe-lkp,
	oliver.sang, ryotkkr98

On Thu, 24 Apr 2025 11:02:54 +0200, Petr Mladek wrote:
>On Thu 2025-04-24 17:11:01, Ryo Takakura wrote:
>> Hi Petr and John!
>> 
>> On Tue, 22 Apr 2025 14:15:01 +0200, Petr Mladek wrote:
>> >On Mon 2025-04-21 12:41:50, Ryo Takakura wrote:
>> >> Hi!
>> >> 
>> >> I would like to follow up the last email that I sent.
>> >> 
>> >> First, I'm sorry that I later realized that I should have tested
>> >> the rslib test as an inserted module, as how the robot does, by
>> >> choosing CONFIG_REED_SOLOMON_TEST=m.
>> >> Not as a boottime test by enabling CONFIG_REED_SOLOMON_TEST=y.
>> >> 
>> >> Running the rslib test as an inserted module without the John's series
>> >> was less prone to softlockup. Without the John's series, softlockup shows
>> >> up once in a test or not at all. With the John's series, softlockup can
>> >> be observed constanly over the test.
>> >
>> >> >>Thanks Ryo for looking into this! I think we need to have a technical
>> >> >>explanation/understanding of the problem so that it is clear how my
>> >> >>series triggers or exaggerates the issue.
>> >> 
>> >> As mentioned earlier, I'm sorry that I should have run the test as
>> >> inserted module... It seems the series does make the test more prone
>> >> to softlockups.
>> >
>> >IMHO, the main difference is that the patch "serial: 8250: Switch to
>> >nbcon console" removes touch_nmi_watchdog() from
>> >serial8250_console_write().
>> >
>> >The touch_nmi_watchdog() resets the softlockup watchdog. It might
>> >hide that the CPU did not schedule for a long time.
>> >
>> >The touch_nmi_watchdog() was there because the console_lock() owner,
>> >used by the legacy loop, was responsible for flushing all pending
>> >messages. And there might be many pending messages when new ones
>> >were added by other CPUs in parallel. And the legacy loop
>> >could not call cond_resched() when called from printk() because
>> >printk() might be called in atomic context.
>> 
>> I see. Without the John's series, the cond_resched() in the mention
>> code path should be called during the rslib test as it's not in atomic
>> context in addition to the touch_nmi_watchdog().
>
>Just to be sure. The right fix is to add cond_resched() to rslib test.
>The code should allow scheduling and do not block the CPU for too
>long.
>
>touch_nmi_watchdog() just hides the problem. It was used in printk()
>because there was no better solution.

Understood!

>> I used this kernel[1] which is for raspberry pi. Let me recheck
>> with some other machine with Linus' master and linux-next to see
>> if the behavior is raspberry pi specific.
>
>John explained why the emergency context helped. I think that we have
>a pretty good understating of what is going on there.
>
>I believe that the problem will be the same in all code streams.
>It might be enough to check one of them (Linus' tree or linux-next)
>just to be sure that the fix applies and it has not been already
>fixed.

Thanks for clarifying.
I'll make sure that the problem can also be observed on either of
Linus' tree or linux-next and they still needs the fix.

Sincerely,
Ryo Takakura

>Best Regards,
>Petr

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-04-24  8:11             ` Ryo Takakura
  2025-04-24  9:00               ` John Ogness
  2025-04-24  9:02               ` Petr Mladek
@ 2025-04-30  9:15               ` Ryo Takakura
  2025-04-30 15:41                 ` John Ogness
  2 siblings, 1 reply; 21+ messages in thread
From: Ryo Takakura @ 2025-04-30  9:15 UTC (permalink / raw)
  To: john.ogness, pmladek
  Cc: ryotkkr98, Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang

Hi, I tested what we have discussed!

On Thu, 24 Apr 2025 17:11:01 +0900, Ryo Takakura wrote:
>On Tue, 22 Apr 2025 14:15:01 +0200, Petr Mladek wrote:
>>On Mon 2025-04-21 12:41:50, Ryo Takakura wrote:
>>> First, I'm sorry that I later realized that I should have tested
>>> the rslib test as an inserted module, as how the robot does, by
>>> choosing CONFIG_REED_SOLOMON_TEST=m.
>>> Not as a boottime test by enabling CONFIG_REED_SOLOMON_TEST=y.
>>> 
>>> Running the rslib test as an inserted module without the John's series
>>> was less prone to softlockup. Without the John's series, softlockup shows
>>> up once in a test or not at all. With the John's series, softlockup can
>>> be observed constanly over the test.
>>
>>> On Mon, 17 Mar 2025 23:42:21 +0900, Ryo Takakura wrote:
>>> >On Mon, 17 Mar 2025 09:51:46 +0106, John Ogness wrote:
>>> >>On 2025-03-15, Ryo Takakura <ryotkkr98@gmail.com> wrote:
>>> >>> I got the same softlockup during the test regardless of the presence
>>> >>> of the commits.
>>> >>>
>>> >>> [   60.222013] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/0:1]
>>> >>> [   60.222023] Modules linked in:
>>> >>> [   60.222032] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Tainted: G             L     6.14.>0-rc6-v14-rc6-voluntary+ #4
>>> >>> [   60.222047] Tainted: [L]=SOFTLOCKUP
>>> >>> [   60.222051] Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT)
>>> >>> [   60.222055] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> >>> [   60.222066] pc : get_random_u32+0xac/0x118
>>> >>> [   60.222081] lr : __get_random_u32_below+0x20/0x78
>>> >>> [   60.222094] sp : ffffffc08002bb80
>>> >>> [   60.222098] x29: ffffffc08002bb80 x28: 0000000000000003 x27: 0000000000000001
>>> >>> [   60.222114] x26: ffffff804112e6a4 x25: ffffffd33ed21820 x24: ffffff804112e69c
>>> >>> [   60.222128] x23: 0000000000000000 x22: ffffff804112e64e x21: 0000000000000000
>>> >>> [   60.222142] x20: 000000000000000d x19: ffffff80fb7aebb8 x18: 0000000000000002
>>> >>> [   60.222156] x17: 0000000000000004 x16: ffffff804112e584 x15: ffffff8041126796
>>> >>> [   60.222169] x14: ffffff80411267c0 x13: 0000000000000006 x12: ffffff804112e5c0
>>> >>> [   60.222183] x11: ffffff804112e64c x10: 0000000000000007 x9 : ffffffd33dccdd10
>>> >>> [   60.222196] x8 : ffffff804112e6a8 x7 : 0000000000000000 x6 : 0005000400060005
>>> >>> [   60.222210] x5 : ffffff804112e65a x4 : 0000000000000000 x3 : 0000000000000010
>>> >>> [   60.222223] x2 : 0000000000000014 x1 : 000000002c7d0b7a x0 : 0000000000000013
>>> >>> [   60.222237] Call trace:
>>> >>> [   60.222241]  get_random_u32+0xac/0x118 (P)
>>> >>> [   60.222256]  __get_random_u32_below+0x20/0x78
>>> >>> [   60.222268]  get_rcw_we+0x180/0x208
>>> >>> [   60.222278]  test_rslib_init+0x2c8/0xba0
>>> >>> [   60.222292]  do_one_initcall+0x4c/0x210
>>> >>> [   60.222303]  kernel_init_freeable+0x1fc/0x3a0
>>> >>> [   60.222317]  kernel_init+0x28/0x1f8
>>> >>> [   60.222327]  ret_from_fork+0x10/0x20
>>> >>>
>>> >>>> I wonder if a cond_resched() in some loop would help. Or using a
>>> >>>
>>> >>> I wasn't sure which loop would be the appropriate one but adding
>>> >>> cond_resched() as below worked as suggested.
>>> >>>
>>> >>> ----- BEGIN -----
>>> >>> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c
>>> >>> index 75cb1adac..322d7b0a8 100644
>>> >>> --- a/lib/reed_solomon/test_rslib.c
>>> >>> +++ b/lib/reed_solomon/test_rslib.c
>>> >>> @@ -306,6 +306,8 @@ static void test_uc(struct rs_control *rs, int len, int errs,
>>> >>>
>>> >>>                 if (memcmp(r, c, len * sizeof(*r)))
>>> >>>                         stat->dwrong++;
>>> >>> +
>>> >>> +               cond_resched();
>>> >>>         }
>>> >>>         stat->nwords += trials;
>>> >>>  }
>>> >>> @@ -400,6 +402,8 @@ static void test_bc(struct rs_control *rs, int len, int errs,
>>> >>>                 } else {
>>> >>>                         stat->rfail++;
>>> >>>                 }
>>> >>> +
>>> >>> +               cond_resched();
>>> >>>         }
>>> >>>         stat->nwords += trials;
>>> >>>  }
>>> >>> ----- END -----
>>> >>>
>>> >>>> pseudorandom generator. I remember the problems related to much
>>> >>>> slower random generator, for example, see the commit
>>> >>>> f900fde28883602b6 ("crypto: testmgr - fix RNG performance in fuzz
>>> >>>> tests").
>>> >>>
>>> >>> I haven't tested this but I'll look into it!
>>> 
>>> For the same reason pointed out by the commit f900fde28883 ("crypto: 
>>> testmgr - fix RNG performance in fuzz tests"), the commit d4150779e60f
>>> ("random32: use real rng for non-deterministic randomness") does seem
>>> to have slowed down the rslib test.
>>> 
>>> On Raspberry pi 4(4G Mem, 4 CPUs), running the rslib test on v5.19 with
>>> the commit d4150779e60f ("random32: use real rng for non-deterministic
>>> randomness") takes about 450sec compared to v5.18(without the commit)
>>> which only takes about 350sec.
>>> 
>>> However, random number generation does not seem to be the only reason
>>> for the rslib test being slow.
>>> Below is the function stats for v5.18 vs v5.19 for the test.
>>> 
>>> v5.18
>>> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] load_module
>>> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_init_module
>>> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_one_initcall
>>> +   59.90%    59.90%  insmod   [kernel.kallsyms]      [k] decode_rs16
>>> +   30.28%     0.00%  insmod   [unknown]              [k] 0xffffffd60413a748
>>> +   21.39%    21.39%  insmod   [kernel.kallsyms]      [k] encode_rs16
>>> +   20.94%     0.00%  insmod   [unknown]              [k] 0xffffffd60441b0d4
>>> +   14.17%     0.01%  insmod   [unknown]              [k] 0xffffffd60413a324
>>> +   12.92%     0.00%  insmod   [unknown]              [k] 0xffffffd60413a720
>>> +   11.74%     0.00%  insmod   [unknown]              [k] 0xffffffd60413a368
>>> +   11.63%     0.01%  insmod   [unknown]              [k] 0xffffffd60413a4b4
>>> +    6.60%     0.01%  insmod   [unknown]              [k] 0xffffffd60413a46c
>>> +    3.26%     3.26%  insmod   [unknown]              [k] 0xffffffd60413a3c4
>>> +    3.11%     3.11%  insmod   [kernel.kallsyms]      [k] prandom_u32
>>> 
>>> v5.19
>>> +  100.00%     0.00%  insmod   [kernel.kallsyms]  [k] load_module
>>> +  100.00%     0.00%  insmod   [kernel.kallsyms]  [k] do_init_module
>>> +  100.00%     0.00%  insmod   [kernel.kallsyms]  [k] do_one_initcall
>>> +   45.13%    45.13%  insmod   [kernel.kallsyms]  [k] decode_rs16
>>> +   27.84%    27.83%  insmod   [kernel.kallsyms]  [k] get_random_u32
>>> +   22.83%     0.01%  insmod   [unknown]          [k] 0xffffffd62b379324
>>> +   22.76%     0.00%  insmod   [unknown]          [k] 0xffffffd62b379748
>>> +   22.35%     0.00%  insmod   [unknown]          [k] 0xffffffd62b379720
>>> +   16.17%    16.17%  insmod   [kernel.kallsyms]  [k] encode_rs16
>>> 
>>> As the v5.19 stats shows, the commit d4150779e60f ("random32: use real
>>> rng for non-deterministic randomness") did slowed down the rslib test
>>> by using real random number generator(prandom_u32->get_random_u32).
>>> But it also shows that the decode/encoding takes quite a lot of time.
>>> 
>>> For the above reason, if we want make the softlockup go away, maybe
>>> just converting from real to pseudo random generator won't be enough?
>>> I tested with below changes applied to v6.14 which uses pseudo random
>>> numbers instead of real ones similar to what the commit f900fde28883
>>> ("crypto: testmgr - fix RNG performance in fuzz tests") did.
>>> 
>>> The result is that while it did help to fast up the test from 450 sec
>>> to 350 sec, it still triggers the softlockup.
>>
>>It is not that big speedup. 350 sec still provides a lot of
>>opportunities for a softlockup.
>>
>>> --- BEGIN ---
>>> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c
>>> index 75cb1adac..5785e3c0c 100644
>>> --- a/lib/reed_solomon/test_rslib.c
>>> +++ b/lib/reed_solomon/test_rslib.c
>>[...]
>>> @@ -162,9 +174,11 @@ static int get_rcw_we(struct rs_control *rs, struct wspace *ws,
>>>         int errloc;
>>>         int i;
>>> 
>>> +       init_rnd_state(&rng);
>>> +
>>>         /* Load c with random data and encode */
>>>         for (i = 0; i < dlen; i++)
>>> -               c[i] = get_random_u32() & nn;
>>> +               c[i] = prandom_u32_state(&rng) & nn;
>>> 
>>>         memset(c + dlen, 0, nroots * sizeof(*c));
>>>         encode_rs16(rs, c, dlen, c + dlen, 0);
>>[...]
>>> --- END ---
>>> 
>>> Below is the function stats for with vs without the above change.
>>> 
>>> v6.14 without the above change
>>> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] load_module
>>> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_init_module
>>> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_one_initcall
>>> +   46.94%    46.94%  insmod   [kernel.kallsyms]      [k] decode_rs16
>>> +   24.92%    24.92%  insmod   [kernel.kallsyms]      [k] get_random_u32
>>> +   23.59%     0.01%  insmod   [unknown]              [k] 0xffffffde423d88cc
>>> +   22.66%     0.00%  insmod   [unknown]              [k] 0xffffffde423d82c0
>>> +   22.39%     0.00%  insmod   [unknown]              [k] 0xffffffde423d88a4
>>> +   16.67%    16.67%  insmod   [kernel.kallsyms]      [k] encode_rs16
>>> 
>>> v6.14 with the above change
>>> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] load_module
>>> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_init_module
>>> +  100.00%     0.00%  insmod   [kernel.kallsyms]      [k] do_one_initcall
>>> +   61.51%    61.50%  insmod   [kernel.kallsyms]      [k] decode_rs16
>>> +   31.15%     0.00%  insmod   [unknown]              [k] 0xffffffe7013858cc
>>> +   21.75%    21.75%  insmod   [kernel.kallsyms]      [k] encode_rs16
>>> +   21.36%     0.01%  insmod   [unknown]              [k] 0xffffffe70137f120
>>> +   15.20%     0.00%  insmod   [unknown]              [k] 0xffffffe7013852c0
>>> +   12.79%     0.00%  insmod   [unknown]              [k] 0xffffffe7013858a4
>>> +   12.00%     0.00%  insmod   [unknown]              [k] 0xffffffe7013854c0
>>> +   12.00%     0.00%  insmod   [unknown]              [k] 0xffffffe701385304
>>> +    6.74%     0.01%  insmod   [unknown]              [k] 0xffffffe701385600
>>> +    3.43%     3.43%  insmod   [unknown]              [k] 0xffffffe701385598
>>> +    2.50%     2.50%  insmod   [unknown]              [k] 0xffffffe701385564
>>> +    1.96%     1.96%  insmod   [kernel.kallsyms]      [k] prandom_u32_state
>>> +    1.85%     1.85%  insmod   [kernel.kallsyms]      [k] get_random_u64
>>> 
>>> So maybe we still need cond_resched() in addition to converting the
>>> test from using psuedo random numbers to real ones in order to make
>>> the softlockup go away?
>>
>>IMHO, cond_resched() should be added, definitely. The code clearly
>>prevents scheduling for tens of seconds.
>
>Thank you for checking, and I agree.
>If needed, I can prepare the fix for the test.
>
>>> >>>> That said, I did not dig deep into the code. And I did not try to
>>> >>>> reproduce the softlockup. I am pretty busy at the moment with some
>>> >>>> other stuff. I just wanted to give it a look and share my opinion.
>>> >>>
>>> >>> I think the softlockup is rather a problem of test itself,
>>> >>> not the two commits.
>>> >>
>>> >>Thanks Ryo for looking into this! I think we need to have a technical
>>> >>explanation/understanding of the problem so that it is clear how my
>>> >>series triggers or exaggerates the issue.
>>> 
>>> As mentioned earlier, I'm sorry that I should have run the test as
>>> inserted module... It seems the series does make the test more prone
>>> to softlockups.
>>
>>IMHO, the main difference is that the patch "serial: 8250: Switch to
>>nbcon console" removes touch_nmi_watchdog() from
>>serial8250_console_write().
>>
>>The touch_nmi_watchdog() resets the softlockup watchdog. It might
>>hide that the CPU did not schedule for a long time.
>>
>>The touch_nmi_watchdog() was there because the console_lock() owner,
>>used by the legacy loop, was responsible for flushing all pending
>>messages. And there might be many pending messages when new ones
>>were added by other CPUs in parallel. And the legacy loop
>>could not call cond_resched() when called from printk() because
>>printk() might be called in atomic context.
>
>I see. Without the John's series, the cond_resched() in the mention
>code path should be called during the rslib test as it's not in atomic
>context in addition to the touch_nmi_watchdog()s.

To test the above, I run the rslib test using legacy console(without the
John's series) with the touch_nmi_watchdog()s removed as following.

The result is that it triggered the softlockup as expected. So I guess we
can say that the legacy console was indeed hiding the softlockup scenario
as suggested by Petr.

--- BEGIN ---
diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
index c8704eb9e..50b64a6a2 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -2113,7 +2113,6 @@ static bool wait_for_lsr(struct uart_8250_port *up, int bits)
                if (--tmout == 0)
                        break;
                udelay(1);
-               touch_nmi_watchdog();
        }

        return (tmout != 0);
@@ -3425,8 +3424,6 @@ void serial8250_console_write(struct uart_8250_port *up, const char *s,
        unsigned int ier, use_fifo;
        int locked = 1;

-       touch_nmi_watchdog();
-
        if (oops_in_progress)
                locked = uart_port_trylock_irqsave(port, &flags);
        else
--- END ---

>>The series removed the touch_nmi_watchdog() because the kthread
>>called cond_resched. And the amount of messages flushed in
>>the emergency or panic context was limited.
>
>With the John's series appied, I guess the kthread were running on a
>cpu other than the one running the rslib test as John said[0], and neither
>of touch_nmi_watchdog() nor cond_resched() were called that could prevent
>the softlockup.

To test the above, I removed the touch_nmi_watchdog() mentioned by John[0]
on top of the John's series with the printing forced to be done on the same
cpu running the rslib test by marking emergency section(below change applied).

The result is that it triggered the softlockup as expected. Similar to
the first test, the touch_nmi_watchdog() was preventing the softlockup
when its marked with emergency section.
And I believe this implies that the kthread was running on some other cpu
as stated above as it should also be calling the touch_nmi_watchdog()
when it does printing.

--- BEGIN ---
diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
index 21bc9e79c..2f97a603d 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -2133,7 +2133,6 @@ static bool wait_for_lsr(struct uart_8250_port *up, int bits)
                if (--tmout == 0)
                        break;
                udelay(1);
-               touch_nmi_watchdog();
        }

        return (tmout != 0);
diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
index fd12efcc4..07c879a64 100644
--- a/kernel/printk/nbcon.c
+++ b/kernel/printk/nbcon.c
@@ -1639,6 +1639,7 @@ void nbcon_cpu_emergency_enter(void)
        cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting();
        (*cpu_emergency_nesting)++;
 }
+EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_enter);

 /**
  * nbcon_cpu_emergency_exit - Exit an emergency section
@@ -1656,6 +1657,7 @@ void nbcon_cpu_emergency_exit(void)

        preempt_enable();
 }
+EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_exit);

 /**
  * nbcon_alloc - Allocate and init the nbcon console specific data
diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c
index 75cb1adac..8208f8317 100644
--- a/lib/reed_solomon/test_rslib.c
+++ b/lib/reed_solomon/test_rslib.c
@@ -11,6 +11,7 @@
 #include <linux/moduleparam.h>
 #include <linux/random.h>
 #include <linux/slab.h>
+#include <linux/console.h>

 enum verbosity {
        V_SILENT,
@@ -323,8 +324,11 @@ static int ex_rs_helper(struct rs_control *rs, struct wspace *ws,
        int nroots = rs->codec->nroots;
        int errs, eras, retval;

-       if (v >= V_PROGRESS)
+       if (v >= V_PROGRESS) {
+               nbcon_cpu_emergency_enter();
                pr_info("  %s\n", desc[method]);
+               nbcon_cpu_emergency_exit();
+       }

        for (errs = 0; errs <= nroots / 2; errs++)
                for (eras = 0; eras <= nroots - 2 * errs; eras++)
@@ -353,8 +357,11 @@ static int exercise_rs(struct rs_control *rs, struct wspace *ws,
        int retval = 0;
        int i;

-       if (v >= V_PROGRESS)
+       if (v >= V_PROGRESS) {
+               nbcon_cpu_emergency_enter();
                pr_info("Testing up to error correction capacity...\n");
+               nbcon_cpu_emergency_exit();
+       }

        for (i = 0; i <= IN_PLACE; i++)
                retval |= ex_rs_helper(rs, ws, len, trials, i);
@@ -411,8 +418,11 @@ static int exercise_rs_bc(struct rs_control *rs, struct wspace *ws,
        int nroots = rs->codec->nroots;
        int errs, eras, cutoff;

-       if (v >= V_PROGRESS)
+       if (v >= V_PROGRESS) {
+               nbcon_cpu_emergency_enter();
                pr_info("Testing beyond error correction capacity...\n");
+               nbcon_cpu_emergency_exit();
+       }

        for (errs = 1; errs <= nroots; errs++) {
                eras = nroots - 2 * errs + 1;
@@ -468,8 +478,10 @@ static int run_exercise(struct etab *e)

                prev_pad = pad;
                if (v >= V_PROGRESS) {
+                       nbcon_cpu_emergency_enter();
                        pr_info("Testing (%d,%d)_%d code...\n",
                                        len, kk - pad, nn + 1);
+                       nbcon_cpu_emergency_exit();
                }

                retval |= exercise_rs(rsc, ws, len, e->ntrials);
--- END ---

>>My view:
>>
>>The touch_nmi_watchdog() did hide the softlockup scenario in the rslib
>>test. The printk series allowed to see the problem reliably.
>>
>>By other words, the rslib test should get fixed by adding
>>cond_resched(). And the conversion of the 8250 serial console
>>driver to nbcon is correct.
>
>I see, and agree! I'll test what's discussed here so that we can say
>the above statement is true and the John's series can be brought back.

I would appreciate any comment on the above tests!
If sounds good, I will prepare a fix adding cond_resched() to the
rslib test later on.

>>That said, see below.
>>
>>> I am still looking into the issue but I noticed that if the printing
>>> during the test were done in a caller context, not by printk thread,
>>> the softlock goes away.
>>> 
>>> I tested with a change something like below so that the printing during
>>> the test were not delegated to printk thread ("pr/ttyS0"). With the below
>>> change, the softlock no longer shows up.
>>> 
>>> --- BEGIN ---
>>> diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
>>> index fd12efcc4..07c879a64 100644
>>> --- a/kernel/printk/nbcon.c
>>> +++ b/kernel/printk/nbcon.c
>>> @@ -1639,6 +1639,7 @@ void nbcon_cpu_emergency_enter(void)
>>>         cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting();
>>>         (*cpu_emergency_nesting)++;
>>>  }
>>> +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_enter);
>>> 
>>>  /**
>>>   * nbcon_cpu_emergency_exit - Exit an emergency section
>>> @@ -1656,6 +1657,7 @@ void nbcon_cpu_emergency_exit(void)
>>> 
>>>         preempt_enable();
>>>  }
>>> +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_exit);
>>> 
>>>  /**
>>>   * nbcon_alloc - Allocate and init the nbcon console specific data
>>> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c
>>> index 75cb1adac..8208f8317 100644
>>> --- a/lib/reed_solomon/test_rslib.c
>>> +++ b/lib/reed_solomon/test_rslib.c
>>> @@ -11,6 +11,7 @@
>>>  #include <linux/moduleparam.h>
>>>  #include <linux/random.h>
>>>  #include <linux/slab.h>
>>> +#include <linux/console.h>
>>> 
>>>  enum verbosity {
>>>         V_SILENT,
>>> @@ -323,8 +324,11 @@ static int ex_rs_helper(struct rs_control *rs, struct wspace *ws,
>>>         int nroots = rs->codec->nroots;
>>>         int errs, eras, retval;
>>> 
>>> -       if (v >= V_PROGRESS)
>>> +       if (v >= V_PROGRESS) {
>>> +               nbcon_cpu_emergency_enter();
>>>                 pr_info("  %s\n", desc[method]);
>>> +               nbcon_cpu_emergency_exit();
>>> +       }
>>
>>I do not understand how this prevented the softlockup.
>>
>>The emergency context causes that the messages is flushed directly
>>from printk(), see nbcon_atomic_flush_pending() in vprintk_emit().
>>But see neither cond_resched() nor touch_nmi_watchdog() in this
>>code path.
>>
>>By other words, I do _not_ see how the emergency context either
>>triggered scheduling or did reset the watchdog.
>>
>>But I might be looking at a different code base. I looked at
>>Linus' master and linux-next.
>
>I used this kernel[1] which is for raspberry pi. Let me recheck
>with some other machine with Linus' master and linux-next to see
>if the behavior is raspberry pi specific.

I tested with Linus' master on x86 qemu. I was able to see the rslib
tests prone to softlockup and the John's series exaggerating it, same
as raspberry pi.

But I couldn't see the softlockup go away by marking the emergency
section around the printk call within the rslib test. Looks like it
never calls the touch_nmi_watchdog() in wait_for_lsr(). Maybe because
trasmission gets immediatly completed on qemu board?
I'm afraid that I currently don't have a convenient machine that I can
test this futher at the moment...

Sincerely,
Ryo Takakura

[0] https://lore.kernel.org/all/84ikmwqonm.fsf@jogness.linutronix.de/

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-04-30  9:15               ` Ryo Takakura
@ 2025-04-30 15:41                 ` John Ogness
  2025-05-01  4:10                   ` Ryo Takakura
  0 siblings, 1 reply; 21+ messages in thread
From: John Ogness @ 2025-04-30 15:41 UTC (permalink / raw)
  To: Ryo Takakura, pmladek
  Cc: ryotkkr98, Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang

On 2025-04-30, Ryo Takakura <ryotkkr98@gmail.com> wrote:
>>> The touch_nmi_watchdog() resets the softlockup watchdog. It might
>>> hide that the CPU did not schedule for a long time.
>
> To test the above, I run the rslib test using legacy console(without
> the John's series) with the touch_nmi_watchdog()s removed as
> following.
>
> The result is that it triggered the softlockup as expected. So I guess
> we can say that the legacy console was indeed hiding the softlockup
> scenario as suggested by Petr.

Excellent.

>> With the John's series appied, I guess the kthread were running on a
>> cpu other than the one running the rslib test as John said[0], and
>> neither of touch_nmi_watchdog() nor cond_resched() were called that
>> could prevent the softlockup.
>
> To test the above, I removed the touch_nmi_watchdog() mentioned by John[0]
> on top of the John's series with the printing forced to be done on the same
> cpu running the rslib test by marking emergency section(below change applied).
>
> The result is that it triggered the softlockup as expected. Similar to
> the first test, the touch_nmi_watchdog() was preventing the softlockup
> when its marked with emergency section.

Excellent.

> And I believe this implies that the kthread was running on some other cpu
> as stated above as it should also be calling the touch_nmi_watchdog()
> when it does printing.

Agreed.

> If sounds good, I will prepare a fix adding cond_resched() to the
> rslib test later on.

Yes, please add a cond_resched() to the rslib test.

> I tested with Linus' master on x86 qemu. I was able to see the rslib
> tests prone to softlockup and the John's series exaggerating it, same
> as raspberry pi.
>
> But I couldn't see the softlockup go away by marking the emergency
> section around the printk call within the rslib test. Looks like it
> never calls the touch_nmi_watchdog() in wait_for_lsr(). Maybe because
> trasmission gets immediatly completed on qemu board?

qemu does not emulate a baudrate, so indeed transmission completes
immediately. It would need to be tested on real hardware. But I am
certain we found the cause and explanation. Thank you for digging into
this!

John

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-04-30 15:41                 ` John Ogness
@ 2025-05-01  4:10                   ` Ryo Takakura
  2025-06-16 15:15                     ` Florian Bezdeka
  0 siblings, 1 reply; 21+ messages in thread
From: Ryo Takakura @ 2025-05-01  4:10 UTC (permalink / raw)
  To: john.ogness, pmladek
  Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang, ryotkkr98

On Wed, 30 Apr 2025 17:47:01 +0206, John Ogness wrote:
>On 2025-04-30, Ryo Takakura <ryotkkr98@gmail.com> wrote:
>>>> The touch_nmi_watchdog() resets the softlockup watchdog. It might
>>>> hide that the CPU did not schedule for a long time.
>>
>> To test the above, I run the rslib test using legacy console(without
>> the John's series) with the touch_nmi_watchdog()s removed as
>> following.
>>
>> The result is that it triggered the softlockup as expected. So I guess
>> we can say that the legacy console was indeed hiding the softlockup
>> scenario as suggested by Petr.
>
>Excellent.
>
>>> With the John's series appied, I guess the kthread were running on a
>>> cpu other than the one running the rslib test as John said[0], and
>>> neither of touch_nmi_watchdog() nor cond_resched() were called that
>>> could prevent the softlockup.
>>
>> To test the above, I removed the touch_nmi_watchdog() mentioned by John[0]
>> on top of the John's series with the printing forced to be done on the same
>> cpu running the rslib test by marking emergency section(below change applied).
>>
>> The result is that it triggered the softlockup as expected. Similar to
>> the first test, the touch_nmi_watchdog() was preventing the softlockup
>> when its marked with emergency section.
>
>Excellent.
>
>> And I believe this implies that the kthread was running on some other cpu
>> as stated above as it should also be calling the touch_nmi_watchdog()
>> when it does printing.
>
>Agreed.
>
>> If sounds good, I will prepare a fix adding cond_resched() to the
>> rslib test later on.
>
>Yes, please add a cond_resched() to the rslib test.

Thank you for checking the results, Got it!

>> I tested with Linus' master on x86 qemu. I was able to see the rslib
>> tests prone to softlockup and the John's series exaggerating it, same
>> as raspberry pi.
>>
>> But I couldn't see the softlockup go away by marking the emergency
>> section around the printk call within the rslib test. Looks like it
>> never calls the touch_nmi_watchdog() in wait_for_lsr(). Maybe because
>> trasmission gets immediatly completed on qemu board?
>
>qemu does not emulate a baudrate, so indeed transmission completes
>immediately. It would need to be tested on real hardware. But I am

Oh I see, that makes sense.

>certain we found the cause and explanation. Thank you for digging into
>this!

Great, my pleasure!

Sincerely,
Ryo Takakura

>John

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-05-01  4:10                   ` Ryo Takakura
@ 2025-06-16 15:15                     ` Florian Bezdeka
  2025-06-18  4:42                       ` John Ogness
  0 siblings, 1 reply; 21+ messages in thread
From: Florian Bezdeka @ 2025-06-16 15:15 UTC (permalink / raw)
  To: Ryo Takakura, john.ogness, pmladek
  Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang, Jan Kiszka

Hi all,

On Thu, 2025-05-01 at 13:10 +0900, Ryo Takakura wrote:
> On Wed, 30 Apr 2025 17:47:01 +0206, John Ogness wrote:
> > On 2025-04-30, Ryo Takakura <ryotkkr98@gmail.com> wrote:
> > > > > The touch_nmi_watchdog() resets the softlockup watchdog. It might
> > > > > hide that the CPU did not schedule for a long time.
> > > 
> > > To test the above, I run the rslib test using legacy console(without
> > > the John's series) with the touch_nmi_watchdog()s removed as
> > > following.
> > > 
> > > The result is that it triggered the softlockup as expected. So I guess
> > > we can say that the legacy console was indeed hiding the softlockup
> > > scenario as suggested by Petr.
> > 
> > Excellent.
> > 
> > > > With the John's series appied, I guess the kthread were running on a
> > > > cpu other than the one running the rslib test as John said[0], and
> > > > neither of touch_nmi_watchdog() nor cond_resched() were called that
> > > > could prevent the softlockup.
> > > 
> > > To test the above, I removed the touch_nmi_watchdog() mentioned by John[0]
> > > on top of the John's series with the printing forced to be done on the same
> > > cpu running the rslib test by marking emergency section(below change applied).
> > > 
> > > The result is that it triggered the softlockup as expected. Similar to
> > > the first test, the touch_nmi_watchdog() was preventing the softlockup
> > > when its marked with emergency section.
> > 
> > Excellent.
> > 
> > > And I believe this implies that the kthread was running on some other cpu
> > > as stated above as it should also be calling the touch_nmi_watchdog()
> > > when it does printing.
> > 
> > Agreed.
> > 
> > > If sounds good, I will prepare a fix adding cond_resched() to the
> > > rslib test later on.
> > 
> > Yes, please add a cond_resched() to the rslib test.
> 
> Thank you for checking the results, Got it!
> 
> > > I tested with Linus' master on x86 qemu. I was able to see the rslib
> > > tests prone to softlockup and the John's series exaggerating it, same
> > > as raspberry pi.
> > > 
> > > But I couldn't see the softlockup go away by marking the emergency
> > > section around the printk call within the rslib test. Looks like it
> > > never calls the touch_nmi_watchdog() in wait_for_lsr(). Maybe because
> > > trasmission gets immediatly completed on qemu board?
> > 
> > qemu does not emulate a baudrate, so indeed transmission completes
> > immediately. It would need to be tested on real hardware. But I am
> 
> Oh I see, that makes sense.
> 
> > certain we found the cause and explanation. Thank you for digging into
> > this!
> 
> Great, my pleasure!

With that it has been confirmed that the original test report [1] was
wrong. It reported the trigger, not the root cause.

b63e6f60eab4 ("serial: 8250: Switch to nbcon console")

was / is correct but got reverted by Greg in mainline with 

f79b163c4231 ("Revert "serial: 8250: Switch to nbcon console"")

This leads to the situation that stable-rt branches still ship this
patch while mainline and stable do not. John, is that intended or was
mainline / stable never informed about that?

Best regards,
Florian


[1] https://lore.kernel.org/all/202501221029.fb0d574d-lkp@intel.com/#t

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [linux-next:master] [serial]  b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
  2025-06-16 15:15                     ` Florian Bezdeka
@ 2025-06-18  4:42                       ` John Ogness
  0 siblings, 0 replies; 21+ messages in thread
From: John Ogness @ 2025-06-18  4:42 UTC (permalink / raw)
  To: Florian Bezdeka, Ryo Takakura, pmladek
  Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang, Jan Kiszka

Hi Florian,

On 2025-06-16, Florian Bezdeka <florian.bezdeka@siemens.com> wrote:
> With that it has been confirmed that the original test report [1] was
> wrong. It reported the trigger, not the root cause.
>
> b63e6f60eab4 ("serial: 8250: Switch to nbcon console")
>
> was / is correct but got reverted by Greg in mainline with 
>
> f79b163c4231 ("Revert "serial: 8250: Switch to nbcon console"")
>
> This leads to the situation that stable-rt branches still ship this
> patch while mainline and stable do not. John, is that intended or was
> mainline / stable never informed about that?

stable-rt carries the 8250-nbcon patches because we intend for them to
go mainline. Greg reverted them from mainline at my request because of 2
power management regression reports (related to suspending on arm32). I
do not believe 8250-nbcon is the real problem, but I need to determine
the root issue and offer a solution before sending the 8250-nbcon
patches to mainline again.

Regardless, the fix in this thread is still relevant for mainline now
because there are now other nbcon serial drivers mainline.

John

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-06-18  4:42 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-22  2:28 [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] kernel test robot
2025-01-22  8:41 ` John Ogness
2025-01-22  9:37   ` Greg Kroah-Hartman
2025-01-24 16:10 ` Petr Mladek
2025-01-24 16:39   ` Petr Mladek
2025-03-15  3:38   ` Ryo Takakura
2025-03-17  8:45     ` John Ogness
2025-03-17 14:42       ` Ryo Takakura
2025-04-21  3:41         ` Ryo Takakura
2025-04-22 12:15           ` Petr Mladek
2025-04-22 14:03             ` John Ogness
2025-04-24  8:11             ` Ryo Takakura
2025-04-24  9:00               ` John Ogness
2025-04-24 14:13                 ` Ryo Takakura
2025-04-24  9:02               ` Petr Mladek
2025-04-24 14:17                 ` Ryo Takakura
2025-04-30  9:15               ` Ryo Takakura
2025-04-30 15:41                 ` John Ogness
2025-05-01  4:10                   ` Ryo Takakura
2025-06-16 15:15                     ` Florian Bezdeka
2025-06-18  4:42                       ` John Ogness

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).