* [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]
@ 2025-01-22 2:28 kernel test robot
2025-01-22 8:41 ` John Ogness
2025-01-24 16:10 ` Petr Mladek
0 siblings, 2 replies; 21+ messages in thread
From: kernel test robot @ 2025-01-22 2:28 UTC (permalink / raw)
To: John Ogness
Cc: oe-lkp, lkp, Greg Kroah-Hartman, Petr Mladek, linux-serial,
oliver.sang
Hello,
kernel test robot noticed "BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]" on:
commit: b63e6f60eab45b16a1bf734fef9035a4c4187cd5 ("serial: 8250: Switch to nbcon console")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
[test failed on linux-next/master 0907e7fb35756464aa34c35d6abb02998418164b]
in testcase: kunit
version:
with following parameters:
group: group-01
config: x86_64-rhel-9.4-kunit
compiler: gcc-12
test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (Haswell) with 16G memory
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202501221029.fb0d574d-lkp@intel.com
[ 231.759560][ C3] watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [modprobe:3860]
[ 231.759572][ C3] Modules linked in: test_rslib(+) reed_solomon ipmi_devintf ipmi_msghandler intel_rapl_msr intel_rapl_common btrfs snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp blake2b_generic coretemp xor raid6_pq libcrc32c kvm_intel snd_hda_codec_realtek snd_hda_codec_generic platform_profile i915 kvm snd_hda_scodec_component snd_hda_intel sd_mod snd_intel_dspcfg dell_wmi crc32_generic snd_intel_sdw_acpi sg crct10dif_pclmul cec crc32_pclmul dell_smbios snd_hda_codec intel_gtt crc32c_intel dell_wmi_descriptor ghash_clmulni_intel sparse_keymap snd_hda_core ttm snd_hwdep ahci rapl rfkill drm_display_helper snd_pcm mei_wdt libahci intel_cstate dcdbas snd_timer mei_me libata intel_uncore drm_kms_helper snd pcspkr drm_buddy mei soundcore video wmi binfmt_misc drm fuse loop dm_mod ip_tables poly1305_generic chacha_generic [last unloaded: test_fpu]
[ 231.759681][ C3] CPU: 3 UID: 0 PID: 3860 Comm: modprobe Tainted: G S B N 6.13.0-rc3-00034-gb63e6f60eab4 #1
[ 231.759690][ C3] Tainted: [S]=CPU_OUT_OF_SPEC, [B]=BAD_PAGE, [N]=TEST
[ 231.759694][ C3] Hardware name: Dell Inc. OptiPlex 9020/0DNKMN, BIOS A05 12/05/2013
[ 231.759699][ C3] RIP: 0010:encode_rs16 (lib/reed_solomon/encode_rs.c:33) reed_solomon
[ 231.759708][ C3] Code: 87 68 83 00 00 89 da d3 fa 41 0f b6 4d 00 41 38 cc 7c 08 84 c9 0f 85 64 02 00 00 8b 75 04 21 f3 01 d3 39 de 7e c0 48 8b 3c 24 <48> 63 db 48 8d 1c 5f 48 89 d9 48 c1 e9 03 42 0f b6 34 39 48 89 d9
All code
========
0: 87 68 83 xchg %ebp,-0x7d(%rax)
3: 00 00 add %al,(%rax)
5: 89 da mov %ebx,%edx
7: d3 fa sar %cl,%edx
9: 41 0f b6 4d 00 movzbl 0x0(%r13),%ecx
e: 41 38 cc cmp %cl,%r12b
11: 7c 08 jl 0x1b
13: 84 c9 test %cl,%cl
15: 0f 85 64 02 00 00 jne 0x27f
1b: 8b 75 04 mov 0x4(%rbp),%esi
1e: 21 f3 and %esi,%ebx
20: 01 d3 add %edx,%ebx
22: 39 de cmp %ebx,%esi
24: 7e c0 jle 0xffffffffffffffe6
26: 48 8b 3c 24 mov (%rsp),%rdi
2a:* 48 63 db movslq %ebx,%rbx <-- trapping instruction
2d: 48 8d 1c 5f lea (%rdi,%rbx,2),%rbx
31: 48 89 d9 mov %rbx,%rcx
34: 48 c1 e9 03 shr $0x3,%rcx
38: 42 0f b6 34 39 movzbl (%rcx,%r15,1),%esi
3d: 48 89 d9 mov %rbx,%rcx
Code starting with the faulting instruction
===========================================
0: 48 63 db movslq %ebx,%rbx
3: 48 8d 1c 5f lea (%rdi,%rbx,2),%rbx
7: 48 89 d9 mov %rbx,%rcx
a: 48 c1 e9 03 shr $0x3,%rcx
e: 42 0f b6 34 39 movzbl (%rcx,%r15,1),%esi
13: 48 89 d9 mov %rbx,%rcx
[ 231.759717][ C3] RSP: 0018:ffffc90000abf3b0 EFLAGS: 00000297
[ 231.759723][ C3] RAX: ffff888102c9ff0a RBX: 00000000000000dd RCX: 0000000000000000
[ 231.759728][ C3] RDX: 0000000000000000 RSI: 00000000000000ff RDI: ffff88816b6b7c00
[ 231.759733][ C3] RBP: ffff88812901fb00 R08: 00000000000000c8 R09: ffff88816c8b518e
[ 231.759738][ C3] R10: 1ffff11025203f60 R11: ffff88816c8b5184 R12: 0000000000000007
[ 231.759743][ C3] R13: ffffed1025203f60 R14: ffffed1025203f60 R15: dffffc0000000000
[ 231.759748][ C3] FS: 00007f64c760f040(0000) GS:ffff8883a7d80000(0000) knlGS:0000000000000000
[ 231.759754][ C3] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 231.759759][ C3] CR2: 00007f024b693000 CR3: 00000001d4462004 CR4: 00000000001726f0
[ 231.759764][ C3] DR0: ffffffff8789050c DR1: ffffffff8789050d DR2: ffffffff8789050e
[ 231.759769][ C3] DR3: ffffffff8789050f DR6: 00000000fffe0ff0 DR7: 0000000000000600
[ 231.759774][ C3] Call Trace:
[ 231.759778][ C3] <IRQ>
[ 231.759782][ C3] ? watchdog_timer_fn (kernel/watchdog.c:770)
[ 231.759790][ C3] ? __pfx_watchdog_timer_fn (kernel/watchdog.c:685)
[ 231.759796][ C3] ? __hrtimer_run_queues (kernel/time/hrtimer.c:1739 kernel/time/hrtimer.c:1803)
[ 231.759803][ C3] ? __pfx___hrtimer_run_queues (kernel/time/hrtimer.c:1773)
[ 231.759808][ C3] ? ktime_get_update_offsets_now (kernel/time/timekeeping.c:312 (discriminator 3) kernel/time/timekeeping.c:335 (discriminator 3) kernel/time/timekeeping.c:2457 (discriminator 3))
[ 231.759814][ C3] ? sched_clock (arch/x86/include/asm/preempt.h:94 arch/x86/kernel/tsc.c:286)
[ 231.759821][ C3] ? hrtimer_interrupt (kernel/time/hrtimer.c:1868)
[ 231.759828][ C3] ? __sysvec_apic_timer_interrupt (arch/x86/include/asm/jump_label.h:36 arch/x86/include/asm/trace/irq_vectors.h:41 arch/x86/kernel/apic/apic.c:1056)
[ 231.759835][ C3] ? sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1049 arch/x86/kernel/apic/apic.c:1049)
[ 231.759842][ C3] </IRQ>
[ 231.759845][ C3] <TASK>
[ 231.759848][ C3] ? asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:702)
[ 231.759857][ C3] ? encode_rs16 (lib/reed_solomon/encode_rs.c:33) reed_solomon
[ 231.759864][ C3] get_rcw_we (lib/reed_solomon/test_rslib.c:173) test_rslib
[ 231.759873][ C3] test_bc (lib/reed_solomon/test_rslib.c:379) test_rslib
[ 231.759880][ C3] ? run_exercise (lib/reed_solomon/test_rslib.c:127 lib/reed_solomon/test_rslib.c:457) test_rslib
[ 231.759886][ C3] run_exercise (lib/reed_solomon/test_rslib.c:423 lib/reed_solomon/test_rslib.c:477) test_rslib
[ 231.759895][ C3] ? __pfx_run_exercise (lib/reed_solomon/test_rslib.c:443) test_rslib
[ 231.759902][ C3] test_rslib_init (lib/reed_solomon/test_rslib.c:155) test_rslib
[ 231.759909][ C3] ? __pfx_test_rslib_init (lib/reed_solomon/test_rslib.c:153) test_rslib
[ 231.759915][ C3] do_one_initcall (init/main.c:1266)
[ 231.759922][ C3] ? kasan_save_track (arch/x86/include/asm/current.h:49 mm/kasan/common.c:60 mm/kasan/common.c:69)
[ 231.759928][ C3] ? __pfx_do_one_initcall (init/main.c:1257)
[ 231.759934][ C3] ? __kasan_slab_alloc (mm/kasan/common.c:318 mm/kasan/common.c:345)
[ 231.759939][ C3] ? __kmalloc_cache_noprof (mm/slub.c:4119 mm/slub.c:4168 mm/slub.c:4324)
[ 231.759945][ C3] ? kasan_unpoison (mm/kasan/shadow.c:156 mm/kasan/shadow.c:182)
[ 231.759951][ C3] do_init_module (kernel/module/main.c:2910)
[ 231.759958][ C3] load_module (kernel/module/main.c:3376)
[ 231.759964][ C3] ? __pfx_load_module (kernel/module/main.c:3223)
[ 231.759969][ C3] ? security_kernel_post_read_file (security/security.c:3363)
[ 231.759975][ C3] ? kernel_read_file (arch/x86/include/asm/atomic.h:53 include/linux/atomic/atomic-arch-fallback.h:992 include/linux/atomic/atomic-instrumented.h:436 include/linux/fs.h:3060 fs/kernel_read_file.c:122)
[ 231.759982][ C3] ? __pfx_kernel_read_file (fs/kernel_read_file.c:38)
[ 231.759988][ C3] ? init_module_from_file (kernel/module/main.c:3565)
[ 231.759993][ C3] init_module_from_file (kernel/module/main.c:3565)
[ 231.759998][ C3] ? __pfx_init_module_from_file (kernel/module/main.c:3541)
[ 231.760004][ C3] ? __pfx__raw_spin_lock (kernel/locking/spinlock.c:153)
[ 231.760011][ C3] idempotent_init_module (kernel/module/main.c:3577)
[ 231.760016][ C3] ? __pfx_idempotent_init_module (kernel/module/main.c:3569)
[ 231.760022][ C3] ? fdget (include/linux/atomic/atomic-arch-fallback.h:479 include/linux/atomic/atomic-instrumented.h:50 fs/file.c:1145 fs/file.c:1159)
[ 231.760028][ C3] ? security_capable (security/security.c:1142)
[ 231.760035][ C3] __x64_sys_finit_module (include/linux/file.h:62 include/linux/file.h:83 kernel/module/main.c:3600 kernel/module/main.c:3587 kernel/module/main.c:3587)
[ 231.760040][ C3] do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
[ 231.760046][ C3] ? do_mmap (mm/mmap.c:496)
[ 231.760053][ C3] ? __pfx_userfaultfd_unmap_complete (fs/userfaultfd.c:838)
[ 231.760060][ C3] ? __pfx_do_mmap (mm/mmap.c:288)
[ 231.760065][ C3] ? down_write_killable (arch/x86/include/asm/atomic64_64.h:20 include/linux/atomic/atomic-arch-fallback.h:2629 include/linux/atomic/atomic-long.h:79 include/linux/atomic/atomic-instrumented.h:3224 kernel/locking/rwsem.c:143 kernel/locking/rwsem.c:268 kernel/locking/rwsem.c:1303 kernel/locking/rwsem.c:1318 kernel/locking/rwsem.c:1590)
[ 231.760071][ C3] ? __pfx_down_write_killable (kernel/locking/rwsem.c:1586)
[ 231.760077][ C3] ? vm_mmap_pgoff (mm/util.c:584)
[ 231.760084][ C3] ? __pfx_vm_mmap_pgoff (mm/util.c:570)
[ 231.760089][ C3] ? fget (fs/file.c:1063)
[ 231.760095][ C3] ? fput (arch/x86/include/asm/atomic64_64.h:79 include/linux/atomic/atomic-arch-fallback.h:2913 include/linux/atomic/atomic-arch-fallback.h:3364 include/linux/atomic/atomic-long.h:698 include/linux/atomic/atomic-instrumented.h:3767 include/linux/file_ref.h:157 fs/file_table.c:501)
[ 231.760100][ C3] ? ksys_mmap_pgoff (mm/mmap.c:547)
[ 231.760105][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218)
[ 231.760111][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218)
[ 231.760116][ C3] ? do_syscall_64 (arch/x86/entry/common.c:102)
[ 231.760121][ C3] ? do_user_addr_fault (include/linux/rcupdate.h:882 include/linux/mm.h:741 arch/x86/mm/fault.c:1340)
[ 231.760127][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218)
[ 231.760133][ C3] ? do_syscall_64 (arch/x86/entry/common.c:102)
[ 231.760137][ C3] ? exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 arch/x86/mm/fault.c:1489 arch/x86/mm/fault.c:1539)
[ 231.760142][ C3] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
[ 231.760148][ C3] RIP: 0033:0x7f64c7711719
[ 231.760154][ C3] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
All code
========
0: 08 89 e8 5b 5d c3 or %cl,-0x3ca2a418(%rcx)
6: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
d: 00 00 00
10: 90 nop
11: 48 89 f8 mov %rdi,%rax
14: 48 89 f7 mov %rsi,%rdi
17: 48 89 d6 mov %rdx,%rsi
1a: 48 89 ca mov %rcx,%rdx
1d: 4d 89 c2 mov %r8,%r10
20: 4d 89 c8 mov %r9,%r8
23: 4c 8b 4c 24 08 mov 0x8(%rsp),%r9
28: 0f 05 syscall
2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction
30: 73 01 jae 0x33
32: c3 ret
33: 48 8b 0d b7 06 0d 00 mov 0xd06b7(%rip),%rcx # 0xd06f1
3a: f7 d8 neg %eax
3c: 64 89 01 mov %eax,%fs:(%rcx)
3f: 48 rex.W
Code starting with the faulting instruction
===========================================
0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
6: 73 01 jae 0x9
8: c3 ret
9: 48 8b 0d b7 06 0d 00 mov 0xd06b7(%rip),%rcx # 0xd06c7
10: f7 d8 neg %eax
12: 64 89 01 mov %eax,%fs:(%rcx)
15: 48 rex.W
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250122/202501221029.fb0d574d-lkp@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-01-22 2:28 [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] kernel test robot @ 2025-01-22 8:41 ` John Ogness 2025-01-22 9:37 ` Greg Kroah-Hartman 2025-01-24 16:10 ` Petr Mladek 1 sibling, 1 reply; 21+ messages in thread From: John Ogness @ 2025-01-22 8:41 UTC (permalink / raw) To: kernel test robot Cc: oe-lkp, lkp, Greg Kroah-Hartman, Petr Mladek, linux-serial, oliver.sang Hi Greg, On 2025-01-22, kernel test robot <oliver.sang@intel.com> wrote: > kernel test robot noticed "BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]" on: > > commit: b63e6f60eab45b16a1bf734fef9035a4c4187cd5 ("serial: 8250: Switch to nbcon console") > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master It looks like we need to revert this commit (and the one after it) until I can find out what is going on here. From linux-next they are: b63e6f60eab4 serial: 8250: Switch to nbcon console 422c9727b07f serial: 8250: Revert "drop lockdep annotation from serial8250_clear_IER()" I hope to get this solved soon, but I am currently taking part in a Rust training, which does not leave me many extra brain cycles at the moment. John Ogness ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-01-22 8:41 ` John Ogness @ 2025-01-22 9:37 ` Greg Kroah-Hartman 0 siblings, 0 replies; 21+ messages in thread From: Greg Kroah-Hartman @ 2025-01-22 9:37 UTC (permalink / raw) To: John Ogness; +Cc: kernel test robot, oe-lkp, lkp, Petr Mladek, linux-serial On Wed, Jan 22, 2025 at 09:47:00AM +0106, John Ogness wrote: > Hi Greg, > > On 2025-01-22, kernel test robot <oliver.sang@intel.com> wrote: > > kernel test robot noticed "BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]" on: > > > > commit: b63e6f60eab45b16a1bf734fef9035a4c4187cd5 ("serial: 8250: Switch to nbcon console") > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master > > It looks like we need to revert this commit (and the one after it) until > I can find out what is going on here. > > >From linux-next they are: > > b63e6f60eab4 serial: 8250: Switch to nbcon console > > 422c9727b07f serial: 8250: Revert "drop lockdep annotation from serial8250_clear_IER()" Ok, now reverted, thanks for finding the offending commits. > I hope to get this solved soon, but I am currently taking part in a Rust > training, which does not leave me many extra brain cycles at the moment. Totally understood :) thanks, greg k-h ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-01-22 2:28 [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] kernel test robot 2025-01-22 8:41 ` John Ogness @ 2025-01-24 16:10 ` Petr Mladek 2025-01-24 16:39 ` Petr Mladek 2025-03-15 3:38 ` Ryo Takakura 1 sibling, 2 replies; 21+ messages in thread From: Petr Mladek @ 2025-01-24 16:10 UTC (permalink / raw) To: kernel test robot Cc: John Ogness, oe-lkp, lkp, Greg Kroah-Hartman, linux-serial, Jason A. Donenfeld Hi, I am Jason into Cc. I wonder if the softlockup might be caused by a lack of entropy. On Wed 2025-01-22 10:28:52, kernel test robot wrote: > > > Hello, > > kernel test robot noticed "BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]" on: > > commit: b63e6f60eab45b16a1bf734fef9035a4c4187cd5 ("serial: 8250: Switch to nbcon console") > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master > > [test failed on linux-next/master 0907e7fb35756464aa34c35d6abb02998418164b] > > in testcase: kunit > version: > with following parameters: > > group: group-01 > > > > config: x86_64-rhel-9.4-kunit > compiler: gcc-12 > test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (Haswell) with 16G memory > > (please refer to attached dmesg/kmsg for entire log/backtrace) > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot <oliver.sang@intel.com> > | Closes: https://lore.kernel.org/oe-lkp/202501221029.fb0d574d-lkp@intel.com > > > [ 231.759560][ C3] watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [modprobe:3860] > [ 231.759572][ C3] Modules linked in: test_rslib(+) reed_solomon ipmi_devintf ipmi_msghandler intel_rapl_msr intel_rapl_common btrfs snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp blake2b_generic coretemp xor raid6_pq libcrc32c kvm_intel snd_hda_codec_realtek snd_hda_codec_generic platform_profile i915 kvm snd_hda_scodec_component snd_hda_intel sd_mod snd_intel_dspcfg dell_wmi crc32_generic snd_intel_sdw_acpi sg crct10dif_pclmul cec crc32_pclmul dell_smbios snd_hda_codec intel_gtt crc32c_intel dell_wmi_descriptor ghash_clmulni_intel sparse_keymap snd_hda_core ttm snd_hwdep ahci rapl rfkill drm_display_helper snd_pcm mei_wdt libahci intel_cstate dcdbas snd_timer mei_me libata intel_uncore drm_kms_helper snd pcspkr drm_buddy mei soundcore video wmi binfmt_misc drm fuse loop dm_mod ip_tables poly1305_generic chacha_generic [last unloaded: test_fpu] > [ 231.759681][ C3] CPU: 3 UID: 0 PID: 3860 Comm: modprobe Tainted: G S B N 6.13.0-rc3-00034-gb63e6f60eab4 #1 > [ 231.759690][ C3] Tainted: [S]=CPU_OUT_OF_SPEC, [B]=BAD_PAGE, [N]=TEST > [ 231.759694][ C3] Hardware name: Dell Inc. OptiPlex 9020/0DNKMN, BIOS A05 12/05/2013 > [ 231.759699][ C3] RIP: 0010:encode_rs16 (lib/reed_solomon/encode_rs.c:33) reed_solomon > [ 231.759708][ C3] Code: 87 68 83 00 00 89 da d3 fa 41 0f b6 4d 00 41 38 cc 7c 08 84 c9 0f 85 64 02 00 00 8b 75 04 21 f3 01 d3 39 de 7e c0 48 8b 3c 24 <48> 63 db 48 8d 1c 5f 48 89 d9 48 c1 e9 03 42 0f b6 34 39 48 89 d9 > All code > ======== > 0: 87 68 83 xchg %ebp,-0x7d(%rax) > 3: 00 00 add %al,(%rax) > 5: 89 da mov %ebx,%edx > 7: d3 fa sar %cl,%edx > 9: 41 0f b6 4d 00 movzbl 0x0(%r13),%ecx > e: 41 38 cc cmp %cl,%r12b > 11: 7c 08 jl 0x1b > 13: 84 c9 test %cl,%cl > 15: 0f 85 64 02 00 00 jne 0x27f > 1b: 8b 75 04 mov 0x4(%rbp),%esi > 1e: 21 f3 and %esi,%ebx > 20: 01 d3 add %edx,%ebx > 22: 39 de cmp %ebx,%esi > 24: 7e c0 jle 0xffffffffffffffe6 > 26: 48 8b 3c 24 mov (%rsp),%rdi > 2a:* 48 63 db movslq %ebx,%rbx <-- trapping instruction > 2d: 48 8d 1c 5f lea (%rdi,%rbx,2),%rbx > 31: 48 89 d9 mov %rbx,%rcx > 34: 48 c1 e9 03 shr $0x3,%rcx > 38: 42 0f b6 34 39 movzbl (%rcx,%r15,1),%esi > 3d: 48 89 d9 mov %rbx,%rcx > > Code starting with the faulting instruction > =========================================== > 0: 48 63 db movslq %ebx,%rbx > 3: 48 8d 1c 5f lea (%rdi,%rbx,2),%rbx > 7: 48 89 d9 mov %rbx,%rcx > a: 48 c1 e9 03 shr $0x3,%rcx > e: 42 0f b6 34 39 movzbl (%rcx,%r15,1),%esi > 13: 48 89 d9 mov %rbx,%rcx > [ 231.759717][ C3] RSP: 0018:ffffc90000abf3b0 EFLAGS: 00000297 > [ 231.759723][ C3] RAX: ffff888102c9ff0a RBX: 00000000000000dd RCX: 0000000000000000 > [ 231.759728][ C3] RDX: 0000000000000000 RSI: 00000000000000ff RDI: ffff88816b6b7c00 > [ 231.759733][ C3] RBP: ffff88812901fb00 R08: 00000000000000c8 R09: ffff88816c8b518e > [ 231.759738][ C3] R10: 1ffff11025203f60 R11: ffff88816c8b5184 R12: 0000000000000007 > [ 231.759743][ C3] R13: ffffed1025203f60 R14: ffffed1025203f60 R15: dffffc0000000000 > [ 231.759748][ C3] FS: 00007f64c760f040(0000) GS:ffff8883a7d80000(0000) knlGS:0000000000000000 > [ 231.759754][ C3] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 231.759759][ C3] CR2: 00007f024b693000 CR3: 00000001d4462004 CR4: 00000000001726f0 > [ 231.759764][ C3] DR0: ffffffff8789050c DR1: ffffffff8789050d DR2: ffffffff8789050e > [ 231.759769][ C3] DR3: ffffffff8789050f DR6: 00000000fffe0ff0 DR7: 0000000000000600 > [ 231.759774][ C3] Call Trace: > [ 231.759778][ C3] <IRQ> > [ 231.759782][ C3] ? watchdog_timer_fn (kernel/watchdog.c:770) > [ 231.759790][ C3] ? __pfx_watchdog_timer_fn (kernel/watchdog.c:685) > [ 231.759796][ C3] ? __hrtimer_run_queues (kernel/time/hrtimer.c:1739 kernel/time/hrtimer.c:1803) > [ 231.759803][ C3] ? __pfx___hrtimer_run_queues (kernel/time/hrtimer.c:1773) > [ 231.759808][ C3] ? ktime_get_update_offsets_now (kernel/time/timekeeping.c:312 (discriminator 3) kernel/time/timekeeping.c:335 (discriminator 3) kernel/time/timekeeping.c:2457 (discriminator 3)) > [ 231.759814][ C3] ? sched_clock (arch/x86/include/asm/preempt.h:94 arch/x86/kernel/tsc.c:286) > [ 231.759821][ C3] ? hrtimer_interrupt (kernel/time/hrtimer.c:1868) > [ 231.759828][ C3] ? __sysvec_apic_timer_interrupt (arch/x86/include/asm/jump_label.h:36 arch/x86/include/asm/trace/irq_vectors.h:41 arch/x86/kernel/apic/apic.c:1056) > [ 231.759835][ C3] ? sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1049 arch/x86/kernel/apic/apic.c:1049) > [ 231.759842][ C3] </IRQ> > [ 231.759845][ C3] <TASK> > [ 231.759848][ C3] ? asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:702) > [ 231.759857][ C3] ? encode_rs16 (lib/reed_solomon/encode_rs.c:33) reed_solomon > [ 231.759864][ C3] get_rcw_we (lib/reed_solomon/test_rslib.c:173) test_rslib Honestly, I do not see much how this could be related to the serial console. This is a module for testing the Generic Reed Solomon encoder / decoder library. It seems to do a lot of computation and needs a lot of random numbers. I wonder if there is not enough entropy and the test is too slow. From the config file: CONFIG_PREEMPT_VOLUNTARY=y I wonder if a cond_resched() in some loop would help. Or using a pseudorandom generator. I remember the problems related to much slower random generator, for example, see the commit f900fde28883602b6 ("crypto: testmgr - fix RNG performance in fuzz tests"). That said, I did not dig deep into the code. And I did not try to reproduce the softlockup. I am pretty busy at the moment with some other stuff. I just wanted to give it a look and share my opinion. Best Regards, Petr > [ 231.759873][ C3] test_bc (lib/reed_solomon/test_rslib.c:379) test_rslib > [ 231.759880][ C3] ? run_exercise (lib/reed_solomon/test_rslib.c:127 lib/reed_solomon/test_rslib.c:457) test_rslib > [ 231.759886][ C3] run_exercise (lib/reed_solomon/test_rslib.c:423 lib/reed_solomon/test_rslib.c:477) test_rslib > [ 231.759895][ C3] ? __pfx_run_exercise (lib/reed_solomon/test_rslib.c:443) test_rslib > [ 231.759902][ C3] test_rslib_init (lib/reed_solomon/test_rslib.c:155) test_rslib > [ 231.759909][ C3] ? __pfx_test_rslib_init (lib/reed_solomon/test_rslib.c:153) test_rslib > [ 231.759915][ C3] do_one_initcall (init/main.c:1266) > [ 231.759922][ C3] ? kasan_save_track (arch/x86/include/asm/current.h:49 mm/kasan/common.c:60 mm/kasan/common.c:69) > [ 231.759928][ C3] ? __pfx_do_one_initcall (init/main.c:1257) > [ 231.759934][ C3] ? __kasan_slab_alloc (mm/kasan/common.c:318 mm/kasan/common.c:345) > [ 231.759939][ C3] ? __kmalloc_cache_noprof (mm/slub.c:4119 mm/slub.c:4168 mm/slub.c:4324) > [ 231.759945][ C3] ? kasan_unpoison (mm/kasan/shadow.c:156 mm/kasan/shadow.c:182) > [ 231.759951][ C3] do_init_module (kernel/module/main.c:2910) > [ 231.759958][ C3] load_module (kernel/module/main.c:3376) > [ 231.759964][ C3] ? __pfx_load_module (kernel/module/main.c:3223) > [ 231.759969][ C3] ? security_kernel_post_read_file (security/security.c:3363) > [ 231.759975][ C3] ? kernel_read_file (arch/x86/include/asm/atomic.h:53 include/linux/atomic/atomic-arch-fallback.h:992 include/linux/atomic/atomic-instrumented.h:436 include/linux/fs.h:3060 fs/kernel_read_file.c:122) > [ 231.759982][ C3] ? __pfx_kernel_read_file (fs/kernel_read_file.c:38) > [ 231.759988][ C3] ? init_module_from_file (kernel/module/main.c:3565) > [ 231.759993][ C3] init_module_from_file (kernel/module/main.c:3565) > [ 231.759998][ C3] ? __pfx_init_module_from_file (kernel/module/main.c:3541) > [ 231.760004][ C3] ? __pfx__raw_spin_lock (kernel/locking/spinlock.c:153) > [ 231.760011][ C3] idempotent_init_module (kernel/module/main.c:3577) > [ 231.760016][ C3] ? __pfx_idempotent_init_module (kernel/module/main.c:3569) > [ 231.760022][ C3] ? fdget (include/linux/atomic/atomic-arch-fallback.h:479 include/linux/atomic/atomic-instrumented.h:50 fs/file.c:1145 fs/file.c:1159) > [ 231.760028][ C3] ? security_capable (security/security.c:1142) > [ 231.760035][ C3] __x64_sys_finit_module (include/linux/file.h:62 include/linux/file.h:83 kernel/module/main.c:3600 kernel/module/main.c:3587 kernel/module/main.c:3587) > [ 231.760040][ C3] do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) > [ 231.760046][ C3] ? do_mmap (mm/mmap.c:496) > [ 231.760053][ C3] ? __pfx_userfaultfd_unmap_complete (fs/userfaultfd.c:838) > [ 231.760060][ C3] ? __pfx_do_mmap (mm/mmap.c:288) > [ 231.760065][ C3] ? down_write_killable (arch/x86/include/asm/atomic64_64.h:20 include/linux/atomic/atomic-arch-fallback.h:2629 include/linux/atomic/atomic-long.h:79 include/linux/atomic/atomic-instrumented.h:3224 kernel/locking/rwsem.c:143 kernel/locking/rwsem.c:268 kernel/locking/rwsem.c:1303 kernel/locking/rwsem.c:1318 kernel/locking/rwsem.c:1590) > [ 231.760071][ C3] ? __pfx_down_write_killable (kernel/locking/rwsem.c:1586) > [ 231.760077][ C3] ? vm_mmap_pgoff (mm/util.c:584) > [ 231.760084][ C3] ? __pfx_vm_mmap_pgoff (mm/util.c:570) > [ 231.760089][ C3] ? fget (fs/file.c:1063) > [ 231.760095][ C3] ? fput (arch/x86/include/asm/atomic64_64.h:79 include/linux/atomic/atomic-arch-fallback.h:2913 include/linux/atomic/atomic-arch-fallback.h:3364 include/linux/atomic/atomic-long.h:698 include/linux/atomic/atomic-instrumented.h:3767 include/linux/file_ref.h:157 fs/file_table.c:501) > [ 231.760100][ C3] ? ksys_mmap_pgoff (mm/mmap.c:547) > [ 231.760105][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) > [ 231.760111][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) > [ 231.760116][ C3] ? do_syscall_64 (arch/x86/entry/common.c:102) > [ 231.760121][ C3] ? do_user_addr_fault (include/linux/rcupdate.h:882 include/linux/mm.h:741 arch/x86/mm/fault.c:1340) > [ 231.760127][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) > [ 231.760133][ C3] ? do_syscall_64 (arch/x86/entry/common.c:102) > [ 231.760137][ C3] ? exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 arch/x86/mm/fault.c:1489 arch/x86/mm/fault.c:1539) > [ 231.760142][ C3] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) > [ 231.760148][ C3] RIP: 0033:0x7f64c7711719 > [ 231.760154][ C3] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48 > All code > ======== > 0: 08 89 e8 5b 5d c3 or %cl,-0x3ca2a418(%rcx) > 6: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1) > d: 00 00 00 > 10: 90 nop > 11: 48 89 f8 mov %rdi,%rax > 14: 48 89 f7 mov %rsi,%rdi > 17: 48 89 d6 mov %rdx,%rsi > 1a: 48 89 ca mov %rcx,%rdx > 1d: 4d 89 c2 mov %r8,%r10 > 20: 4d 89 c8 mov %r9,%r8 > 23: 4c 8b 4c 24 08 mov 0x8(%rsp),%r9 > 28: 0f 05 syscall > 2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction > 30: 73 01 jae 0x33 > 32: c3 ret > 33: 48 8b 0d b7 06 0d 00 mov 0xd06b7(%rip),%rcx # 0xd06f1 > 3a: f7 d8 neg %eax > 3c: 64 89 01 mov %eax,%fs:(%rcx) > 3f: 48 rex.W > > Code starting with the faulting instruction > =========================================== > 0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax > 6: 73 01 jae 0x9 > 8: c3 ret > 9: 48 8b 0d b7 06 0d 00 mov 0xd06b7(%rip),%rcx # 0xd06c7 > 10: f7 d8 neg %eax > 12: 64 89 01 mov %eax,%fs:(%rcx) > 15: 48 rex.W > > > The kernel config and materials to reproduce are available at: > https://download.01.org/0day-ci/archive/20250122/202501221029.fb0d574d-lkp@intel.com > > > > -- > 0-DAY CI Kernel Test Service > https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-01-24 16:10 ` Petr Mladek @ 2025-01-24 16:39 ` Petr Mladek 2025-03-15 3:38 ` Ryo Takakura 1 sibling, 0 replies; 21+ messages in thread From: Petr Mladek @ 2025-01-24 16:39 UTC (permalink / raw) To: kernel test robot Cc: John Ogness, oe-lkp, lkp, Greg Kroah-Hartman, linux-serial, Jason A. Donenfeld On Fri 2025-01-24 17:10:13, Petr Mladek wrote: > Hi, > > I am Jason into Cc. I wonder if the softlockup might be caused > by a lack of entropy. > > On Wed 2025-01-22 10:28:52, kernel test robot wrote: > > > > > > Hello, > > > > kernel test robot noticed "BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]" on: > > > > commit: b63e6f60eab45b16a1bf734fef9035a4c4187cd5 ("serial: 8250: Switch to nbcon console") > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master > > > > [test failed on linux-next/master 0907e7fb35756464aa34c35d6abb02998418164b] > > > > in testcase: kunit > > version: > > with following parameters: > > > > group: group-01 > > > > > > > > config: x86_64-rhel-9.4-kunit > > compiler: gcc-12 > > test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (Haswell) with 16G memory > > > > (please refer to attached dmesg/kmsg for entire log/backtrace) > > > > > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > > the same patch/commit), kindly add following tags > > | Reported-by: kernel test robot <oliver.sang@intel.com> > > | Closes: https://lore.kernel.org/oe-lkp/202501221029.fb0d574d-lkp@intel.com > > > > > > [ 231.759560][ C3] watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [modprobe:3860] > > [ 231.759572][ C3] Modules linked in: test_rslib(+) reed_solomon ipmi_devintf ipmi_msghandler intel_rapl_msr intel_rapl_common btrfs snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp blake2b_generic coretemp xor raid6_pq libcrc32c kvm_intel snd_hda_codec_realtek snd_hda_codec_generic platform_profile i915 kvm snd_hda_scodec_component snd_hda_intel sd_mod snd_intel_dspcfg dell_wmi crc32_generic snd_intel_sdw_acpi sg crct10dif_pclmul cec crc32_pclmul dell_smbios snd_hda_codec intel_gtt crc32c_intel dell_wmi_descriptor ghash_clmulni_intel sparse_keymap snd_hda_core ttm snd_hwdep ahci rapl rfkill drm_display_helper snd_pcm mei_wdt libahci intel_cstate dcdbas snd_timer mei_me libata intel_uncore drm_kms_helper snd pcspkr drm_buddy mei soundcore video wmi binfmt_misc drm fuse loop dm_mod ip_tables poly1305_generic chacha_generic [last unloaded: test_fpu] > > [ 231.759681][ C3] CPU: 3 UID: 0 PID: 3860 Comm: modprobe Tainted: G S B N 6.13.0-rc3-00034-gb63e6f60eab4 #1 > > [ 231.759690][ C3] Tainted: [S]=CPU_OUT_OF_SPEC, [B]=BAD_PAGE, [N]=TEST > > [ 231.759694][ C3] Hardware name: Dell Inc. OptiPlex 9020/0DNKMN, BIOS A05 12/05/2013 > > [ 231.759699][ C3] RIP: 0010:encode_rs16 (lib/reed_solomon/encode_rs.c:33) reed_solomon > > [ 231.759708][ C3] Code: 87 68 83 00 00 89 da d3 fa 41 0f b6 4d 00 41 38 cc 7c 08 84 c9 0f 85 64 02 00 00 8b 75 04 21 f3 01 d3 39 de 7e c0 48 8b 3c 24 <48> 63 db 48 8d 1c 5f 48 89 d9 48 c1 e9 03 42 0f b6 34 39 48 89 d9 > > All code > > ======== > > 0: 87 68 83 xchg %ebp,-0x7d(%rax) > > 3: 00 00 add %al,(%rax) > > 5: 89 da mov %ebx,%edx > > 7: d3 fa sar %cl,%edx > > 9: 41 0f b6 4d 00 movzbl 0x0(%r13),%ecx > > e: 41 38 cc cmp %cl,%r12b > > 11: 7c 08 jl 0x1b > > 13: 84 c9 test %cl,%cl > > 15: 0f 85 64 02 00 00 jne 0x27f > > 1b: 8b 75 04 mov 0x4(%rbp),%esi > > 1e: 21 f3 and %esi,%ebx > > 20: 01 d3 add %edx,%ebx > > 22: 39 de cmp %ebx,%esi > > 24: 7e c0 jle 0xffffffffffffffe6 > > 26: 48 8b 3c 24 mov (%rsp),%rdi > > 2a:* 48 63 db movslq %ebx,%rbx <-- trapping instruction > > 2d: 48 8d 1c 5f lea (%rdi,%rbx,2),%rbx > > 31: 48 89 d9 mov %rbx,%rcx > > 34: 48 c1 e9 03 shr $0x3,%rcx > > 38: 42 0f b6 34 39 movzbl (%rcx,%r15,1),%esi > > 3d: 48 89 d9 mov %rbx,%rcx > > > > Code starting with the faulting instruction > > =========================================== > > 0: 48 63 db movslq %ebx,%rbx > > 3: 48 8d 1c 5f lea (%rdi,%rbx,2),%rbx > > 7: 48 89 d9 mov %rbx,%rcx > > a: 48 c1 e9 03 shr $0x3,%rcx > > e: 42 0f b6 34 39 movzbl (%rcx,%r15,1),%esi > > 13: 48 89 d9 mov %rbx,%rcx > > [ 231.759717][ C3] RSP: 0018:ffffc90000abf3b0 EFLAGS: 00000297 > > [ 231.759723][ C3] RAX: ffff888102c9ff0a RBX: 00000000000000dd RCX: 0000000000000000 > > [ 231.759728][ C3] RDX: 0000000000000000 RSI: 00000000000000ff RDI: ffff88816b6b7c00 > > [ 231.759733][ C3] RBP: ffff88812901fb00 R08: 00000000000000c8 R09: ffff88816c8b518e > > [ 231.759738][ C3] R10: 1ffff11025203f60 R11: ffff88816c8b5184 R12: 0000000000000007 > > [ 231.759743][ C3] R13: ffffed1025203f60 R14: ffffed1025203f60 R15: dffffc0000000000 > > [ 231.759748][ C3] FS: 00007f64c760f040(0000) GS:ffff8883a7d80000(0000) knlGS:0000000000000000 > > [ 231.759754][ C3] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 231.759759][ C3] CR2: 00007f024b693000 CR3: 00000001d4462004 CR4: 00000000001726f0 > > [ 231.759764][ C3] DR0: ffffffff8789050c DR1: ffffffff8789050d DR2: ffffffff8789050e > > [ 231.759769][ C3] DR3: ffffffff8789050f DR6: 00000000fffe0ff0 DR7: 0000000000000600 > > [ 231.759774][ C3] Call Trace: > > [ 231.759778][ C3] <IRQ> > > [ 231.759782][ C3] ? watchdog_timer_fn (kernel/watchdog.c:770) > > [ 231.759790][ C3] ? __pfx_watchdog_timer_fn (kernel/watchdog.c:685) > > [ 231.759796][ C3] ? __hrtimer_run_queues (kernel/time/hrtimer.c:1739 kernel/time/hrtimer.c:1803) > > [ 231.759803][ C3] ? __pfx___hrtimer_run_queues (kernel/time/hrtimer.c:1773) > > [ 231.759808][ C3] ? ktime_get_update_offsets_now (kernel/time/timekeeping.c:312 (discriminator 3) kernel/time/timekeeping.c:335 (discriminator 3) kernel/time/timekeeping.c:2457 (discriminator 3)) > > [ 231.759814][ C3] ? sched_clock (arch/x86/include/asm/preempt.h:94 arch/x86/kernel/tsc.c:286) > > [ 231.759821][ C3] ? hrtimer_interrupt (kernel/time/hrtimer.c:1868) > > [ 231.759828][ C3] ? __sysvec_apic_timer_interrupt (arch/x86/include/asm/jump_label.h:36 arch/x86/include/asm/trace/irq_vectors.h:41 arch/x86/kernel/apic/apic.c:1056) > > [ 231.759835][ C3] ? sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1049 arch/x86/kernel/apic/apic.c:1049) > > [ 231.759842][ C3] </IRQ> > > [ 231.759845][ C3] <TASK> > > [ 231.759848][ C3] ? asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:702) > > [ 231.759857][ C3] ? encode_rs16 (lib/reed_solomon/encode_rs.c:33) reed_solomon > > [ 231.759864][ C3] get_rcw_we (lib/reed_solomon/test_rslib.c:173) test_rslib > > Honestly, I do not see much how this could be related to the serial > console. This is a module for testing the Generic Reed Solomon > encoder / decoder library. > > It seems to do a lot of computation and needs a lot of random numbers. > I wonder if there is not enough entropy and the test is too slow. > > From the config file: > > CONFIG_PREEMPT_VOLUNTARY=y > > I wonder if a cond_resched() in some loop would help. Or using a > pseudorandom generator. I remember the problems related to much > slower random generator, for example, see the commit f900fde28883602b6 > ("crypto: testmgr - fix RNG performance in fuzz tests"). I was curious whether this was the only soft lockup error reported in this code. And I found one more occurence, see https://lore.kernel.org/all/6A256ED30D339626+ZXwTQL1lMXhQ_inW@HX09040029.powercore.com.cn/ Unfortunately, the discussion about the other occurence does not continue. So that it is not clear if they found the root of the problem at that time. Anyway, it is good to know that this test module is "prone" to soft lockups. Best Regards, Petr > > > [ 231.759873][ C3] test_bc (lib/reed_solomon/test_rslib.c:379) test_rslib > > [ 231.759880][ C3] ? run_exercise (lib/reed_solomon/test_rslib.c:127 lib/reed_solomon/test_rslib.c:457) test_rslib > > [ 231.759886][ C3] run_exercise (lib/reed_solomon/test_rslib.c:423 lib/reed_solomon/test_rslib.c:477) test_rslib > > [ 231.759895][ C3] ? __pfx_run_exercise (lib/reed_solomon/test_rslib.c:443) test_rslib > > [ 231.759902][ C3] test_rslib_init (lib/reed_solomon/test_rslib.c:155) test_rslib > > [ 231.759909][ C3] ? __pfx_test_rslib_init (lib/reed_solomon/test_rslib.c:153) test_rslib > > [ 231.759915][ C3] do_one_initcall (init/main.c:1266) > > [ 231.759922][ C3] ? kasan_save_track (arch/x86/include/asm/current.h:49 mm/kasan/common.c:60 mm/kasan/common.c:69) > > [ 231.759928][ C3] ? __pfx_do_one_initcall (init/main.c:1257) > > [ 231.759934][ C3] ? __kasan_slab_alloc (mm/kasan/common.c:318 mm/kasan/common.c:345) > > [ 231.759939][ C3] ? __kmalloc_cache_noprof (mm/slub.c:4119 mm/slub.c:4168 mm/slub.c:4324) > > [ 231.759945][ C3] ? kasan_unpoison (mm/kasan/shadow.c:156 mm/kasan/shadow.c:182) > > [ 231.759951][ C3] do_init_module (kernel/module/main.c:2910) > > [ 231.759958][ C3] load_module (kernel/module/main.c:3376) > > [ 231.759964][ C3] ? __pfx_load_module (kernel/module/main.c:3223) > > [ 231.759969][ C3] ? security_kernel_post_read_file (security/security.c:3363) > > [ 231.759975][ C3] ? kernel_read_file (arch/x86/include/asm/atomic.h:53 include/linux/atomic/atomic-arch-fallback.h:992 include/linux/atomic/atomic-instrumented.h:436 include/linux/fs.h:3060 fs/kernel_read_file.c:122) > > [ 231.759982][ C3] ? __pfx_kernel_read_file (fs/kernel_read_file.c:38) > > [ 231.759988][ C3] ? init_module_from_file (kernel/module/main.c:3565) > > [ 231.759993][ C3] init_module_from_file (kernel/module/main.c:3565) > > [ 231.759998][ C3] ? __pfx_init_module_from_file (kernel/module/main.c:3541) > > [ 231.760004][ C3] ? __pfx__raw_spin_lock (kernel/locking/spinlock.c:153) > > [ 231.760011][ C3] idempotent_init_module (kernel/module/main.c:3577) > > [ 231.760016][ C3] ? __pfx_idempotent_init_module (kernel/module/main.c:3569) > > [ 231.760022][ C3] ? fdget (include/linux/atomic/atomic-arch-fallback.h:479 include/linux/atomic/atomic-instrumented.h:50 fs/file.c:1145 fs/file.c:1159) > > [ 231.760028][ C3] ? security_capable (security/security.c:1142) > > [ 231.760035][ C3] __x64_sys_finit_module (include/linux/file.h:62 include/linux/file.h:83 kernel/module/main.c:3600 kernel/module/main.c:3587 kernel/module/main.c:3587) > > [ 231.760040][ C3] do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) > > [ 231.760046][ C3] ? do_mmap (mm/mmap.c:496) > > [ 231.760053][ C3] ? __pfx_userfaultfd_unmap_complete (fs/userfaultfd.c:838) > > [ 231.760060][ C3] ? __pfx_do_mmap (mm/mmap.c:288) > > [ 231.760065][ C3] ? down_write_killable (arch/x86/include/asm/atomic64_64.h:20 include/linux/atomic/atomic-arch-fallback.h:2629 include/linux/atomic/atomic-long.h:79 include/linux/atomic/atomic-instrumented.h:3224 kernel/locking/rwsem.c:143 kernel/locking/rwsem.c:268 kernel/locking/rwsem.c:1303 kernel/locking/rwsem.c:1318 kernel/locking/rwsem.c:1590) > > [ 231.760071][ C3] ? __pfx_down_write_killable (kernel/locking/rwsem.c:1586) > > [ 231.760077][ C3] ? vm_mmap_pgoff (mm/util.c:584) > > [ 231.760084][ C3] ? __pfx_vm_mmap_pgoff (mm/util.c:570) > > [ 231.760089][ C3] ? fget (fs/file.c:1063) > > [ 231.760095][ C3] ? fput (arch/x86/include/asm/atomic64_64.h:79 include/linux/atomic/atomic-arch-fallback.h:2913 include/linux/atomic/atomic-arch-fallback.h:3364 include/linux/atomic/atomic-long.h:698 include/linux/atomic/atomic-instrumented.h:3767 include/linux/file_ref.h:157 fs/file_table.c:501) > > [ 231.760100][ C3] ? ksys_mmap_pgoff (mm/mmap.c:547) > > [ 231.760105][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) > > [ 231.760111][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) > > [ 231.760116][ C3] ? do_syscall_64 (arch/x86/entry/common.c:102) > > [ 231.760121][ C3] ? do_user_addr_fault (include/linux/rcupdate.h:882 include/linux/mm.h:741 arch/x86/mm/fault.c:1340) > > [ 231.760127][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) > > [ 231.760133][ C3] ? do_syscall_64 (arch/x86/entry/common.c:102) > > [ 231.760137][ C3] ? exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 arch/x86/mm/fault.c:1489 arch/x86/mm/fault.c:1539) > > [ 231.760142][ C3] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) > > [ 231.760148][ C3] RIP: 0033:0x7f64c7711719 > > [ 231.760154][ C3] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48 > > All code > > ======== > > 0: 08 89 e8 5b 5d c3 or %cl,-0x3ca2a418(%rcx) > > 6: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1) > > d: 00 00 00 > > 10: 90 nop > > 11: 48 89 f8 mov %rdi,%rax > > 14: 48 89 f7 mov %rsi,%rdi > > 17: 48 89 d6 mov %rdx,%rsi > > 1a: 48 89 ca mov %rcx,%rdx > > 1d: 4d 89 c2 mov %r8,%r10 > > 20: 4d 89 c8 mov %r9,%r8 > > 23: 4c 8b 4c 24 08 mov 0x8(%rsp),%r9 > > 28: 0f 05 syscall > > 2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction > > 30: 73 01 jae 0x33 > > 32: c3 ret > > 33: 48 8b 0d b7 06 0d 00 mov 0xd06b7(%rip),%rcx # 0xd06f1 > > 3a: f7 d8 neg %eax > > 3c: 64 89 01 mov %eax,%fs:(%rcx) > > 3f: 48 rex.W > > > > Code starting with the faulting instruction > > =========================================== > > 0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax > > 6: 73 01 jae 0x9 > > 8: c3 ret > > 9: 48 8b 0d b7 06 0d 00 mov 0xd06b7(%rip),%rcx # 0xd06c7 > > 10: f7 d8 neg %eax > > 12: 64 89 01 mov %eax,%fs:(%rcx) > > 15: 48 rex.W > > > > > > The kernel config and materials to reproduce are available at: > > https://download.01.org/0day-ci/archive/20250122/202501221029.fb0d574d-lkp@intel.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-01-24 16:10 ` Petr Mladek 2025-01-24 16:39 ` Petr Mladek @ 2025-03-15 3:38 ` Ryo Takakura 2025-03-17 8:45 ` John Ogness 1 sibling, 1 reply; 21+ messages in thread From: Ryo Takakura @ 2025-03-15 3:38 UTC (permalink / raw) To: pmladek, john.ogness Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang Hi! Is the thread still active? I looked into this. On Fri, 24 Jan 2025 17:10:11 +0100, Petr Mladek wrote: >On Wed 2025-01-22 10:28:52, kernel test robot wrote: >> >> >> Hello, >> >> kernel test robot noticed "BUG:soft_lockup-CPU##stuck_for#s![modprobe:#]" on: >> >> commit: b63e6f60eab45b16a1bf734fef9035a4c4187cd5 ("serial: 8250: Switch to nbcon console") >> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master >> >> [test failed on linux-next/master 0907e7fb35756464aa34c35d6abb02998418164b] >> >> in testcase: kunit >> version: >> with following parameters: >> >> group: group-01 >> >> >> >> config: x86_64-rhel-9.4-kunit >> compiler: gcc-12 >> test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (Haswell) with 16G memory >> >> (please refer to attached dmesg/kmsg for entire log/backtrace) >> >> >> >> If you fix the issue in a separate patch/commit (i.e. not just a new version of >> the same patch/commit), kindly add following tags >> | Reported-by: kernel test robot <oliver.sang@intel.com> >> | Closes: https://lore.kernel.org/oe-lkp/202501221029.fb0d574d-lkp@intel.com >> >> >> [ 231.759560][ C3] watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [modprobe:3860] >> [ 231.759572][ C3] Modules linked in: test_rslib(+) reed_solomon ipmi_devintf ipmi_msghandler intel_rapl_msr intel_rapl_common >btrfs snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp blake2b_generic coretemp xor raid6_pq libcrc32c kvm_intel >snd_hda_codec_realtek snd_hda_codec_generic platform_profile i915 kvm snd_hda_scodec_component snd_hda_intel sd_mod snd_intel_dspcfg >dell_wmi crc32_generic snd_intel_sdw_acpi sg crct10dif_pclmul cec crc32_pclmul dell_smbios snd_hda_codec intel_gtt crc32c_intel >dell_wmi_descriptor ghash_clmulni_intel sparse_keymap snd_hda_core ttm snd_hwdep ahci rapl rfkill drm_display_helper snd_pcm mei_wdt >libahci intel_cstate dcdbas snd_timer mei_me libata intel_uncore drm_kms_helper snd pcspkr drm_buddy mei soundcore video wmi binfmt_misc >drm fuse loop dm_mod ip_tables poly1305_generic chacha_generic [last unloaded: test_fpu] >> [ 231.759681][ C3] CPU: 3 UID: 0 PID: 3860 Comm: modprobe Tainted: G S B N 6.13.0-rc3-00034-gb63e6f60eab4 #1 >> [ 231.759690][ C3] Tainted: [S]=CPU_OUT_OF_SPEC, [B]=BAD_PAGE, [N]=TEST >> [ 231.759694][ C3] Hardware name: Dell Inc. OptiPlex 9020/0DNKMN, BIOS A05 12/05/2013 >> [ 231.759699][ C3] RIP: 0010:encode_rs16 (lib/reed_solomon/encode_rs.c:33) reed_solomon >> [ 231.759708][ C3] Code: 87 68 83 00 00 89 da d3 fa 41 0f b6 4d 00 41 38 cc 7c 08 84 c9 0f 85 64 02 00 00 8b 75 04 21 f3 01 d3 39 de 7e >c0 48 8b 3c 24 <48> 63 db 48 8d 1c 5f 48 89 d9 48 c1 e9 03 42 0f b6 34 39 48 89 d9 >> All code >> ======== >> 0: 87 68 83 xchg %ebp,-0x7d(%rax) >> 3: 00 00 add %al,(%rax) >> 5: 89 da mov %ebx,%edx >> 7: d3 fa sar %cl,%edx >> 9: 41 0f b6 4d 00 movzbl 0x0(%r13),%ecx >> e: 41 38 cc cmp %cl,%r12b >> 11: 7c 08 jl 0x1b >> 13: 84 c9 test %cl,%cl >> 15: 0f 85 64 02 00 00 jne 0x27f >> 1b: 8b 75 04 mov 0x4(%rbp),%esi >> 1e: 21 f3 and %esi,%ebx >> 20: 01 d3 add %edx,%ebx >> 22: 39 de cmp %ebx,%esi >> 24: 7e c0 jle 0xffffffffffffffe6 >> 26: 48 8b 3c 24 mov (%rsp),%rdi >> 2a:* 48 63 db movslq %ebx,%rbx <-- trapping instruction >> 2d: 48 8d 1c 5f lea (%rdi,%rbx,2),%rbx >> 31: 48 89 d9 mov %rbx,%rcx >> 34: 48 c1 e9 03 shr $0x3,%rcx >> 38: 42 0f b6 34 39 movzbl (%rcx,%r15,1),%esi >> 3d: 48 89 d9 mov %rbx,%rcx >> >> Code starting with the faulting instruction >> =========================================== >> 0: 48 63 db movslq %ebx,%rbx >> 3: 48 8d 1c 5f lea (%rdi,%rbx,2),%rbx >> 7: 48 89 d9 mov %rbx,%rcx >> a: 48 c1 e9 03 shr $0x3,%rcx >> e: 42 0f b6 34 39 movzbl (%rcx,%r15,1),%esi >> 13: 48 89 d9 mov %rbx,%rcx >> [ 231.759717][ C3] RSP: 0018:ffffc90000abf3b0 EFLAGS: 00000297 >> [ 231.759723][ C3] RAX: ffff888102c9ff0a RBX: 00000000000000dd RCX: 0000000000000000 >> [ 231.759728][ C3] RDX: 0000000000000000 RSI: 00000000000000ff RDI: ffff88816b6b7c00 >> [ 231.759733][ C3] RBP: ffff88812901fb00 R08: 00000000000000c8 R09: ffff88816c8b518e >> [ 231.759738][ C3] R10: 1ffff11025203f60 R11: ffff88816c8b5184 R12: 0000000000000007 >> [ 231.759743][ C3] R13: ffffed1025203f60 R14: ffffed1025203f60 R15: dffffc0000000000 >> [ 231.759748][ C3] FS: 00007f64c760f040(0000) GS:ffff8883a7d80000(0000) knlGS:0000000000000000 >> [ 231.759754][ C3] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 231.759759][ C3] CR2: 00007f024b693000 CR3: 00000001d4462004 CR4: 00000000001726f0 >> [ 231.759764][ C3] DR0: ffffffff8789050c DR1: ffffffff8789050d DR2: ffffffff8789050e >> [ 231.759769][ C3] DR3: ffffffff8789050f DR6: 00000000fffe0ff0 DR7: 0000000000000600 >> [ 231.759774][ C3] Call Trace: >> [ 231.759778][ C3] <IRQ> >> [ 231.759782][ C3] ? watchdog_timer_fn (kernel/watchdog.c:770) >> [ 231.759790][ C3] ? __pfx_watchdog_timer_fn (kernel/watchdog.c:685) >> [ 231.759796][ C3] ? __hrtimer_run_queues (kernel/time/hrtimer.c:1739 kernel/time/hrtimer.c:1803) >> [ 231.759803][ C3] ? __pfx___hrtimer_run_queues (kernel/time/hrtimer.c:1773) >> [ 231.759808][ C3] ? ktime_get_update_offsets_now (kernel/time/timekeeping.c:312 (discriminator 3) kernel/time/timekeeping.c:335 >(discriminator 3) kernel/time/timekeeping.c:2457 (discriminator 3)) >> [ 231.759814][ C3] ? sched_clock (arch/x86/include/asm/preempt.h:94 arch/x86/kernel/tsc.c:286) >> [ 231.759821][ C3] ? hrtimer_interrupt (kernel/time/hrtimer.c:1868) >> [ 231.759828][ C3] ? __sysvec_apic_timer_interrupt (arch/x86/include/asm/jump_label.h:36 arch/x86/include/asm/trace/irq_vectors.h:41 >arch/x86/kernel/apic/apic.c:1056) >> [ 231.759835][ C3] ? sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1049 arch/x86/kernel/apic/apic.c:1049) >> [ 231.759842][ C3] </IRQ> >> [ 231.759845][ C3] <TASK> >> [ 231.759848][ C3] ? asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:702) >> [ 231.759857][ C3] ? encode_rs16 (lib/reed_solomon/encode_rs.c:33) reed_solomon >> [ 231.759864][ C3] get_rcw_we (lib/reed_solomon/test_rslib.c:173) test_rslib > >Honestly, I do not see much how this could be related to the serial >console. This is a module for testing the Generic Reed Solomon >encoder / decoder library. > >It seems to do a lot of computation and needs a lot of random numbers. >I wonder if there is not enough entropy and the test is too slow. > >From the config file: > >CONFIG_PREEMPT_VOLUNTARY=y I tested with and without the bleow two commits on v6.14-rc6 where CONFIG_PREEMPT_VOLUNTARY=y and CONFIG_REED_SOLOMON_TEST=y. I used Raspberry pi 4 (4G Mem). b63e6f60eab4 ("serial: 8250: Switch to nbcon console") 422c9727b07f ("serial: 8250: Revert "drop lockdep annotation from serial8250_clear_IER()"") I got the same softlockup during the test regardless of the presence of the commits. [ 60.222013] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/0:1] [ 60.222023] Modules linked in: [ 60.222032] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Tainted: G L 6.14.0-rc6-v14-rc6-voluntary+ #4 [ 60.222047] Tainted: [L]=SOFTLOCKUP [ 60.222051] Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT) [ 60.222055] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 60.222066] pc : get_random_u32+0xac/0x118 [ 60.222081] lr : __get_random_u32_below+0x20/0x78 [ 60.222094] sp : ffffffc08002bb80 [ 60.222098] x29: ffffffc08002bb80 x28: 0000000000000003 x27: 0000000000000001 [ 60.222114] x26: ffffff804112e6a4 x25: ffffffd33ed21820 x24: ffffff804112e69c [ 60.222128] x23: 0000000000000000 x22: ffffff804112e64e x21: 0000000000000000 [ 60.222142] x20: 000000000000000d x19: ffffff80fb7aebb8 x18: 0000000000000002 [ 60.222156] x17: 0000000000000004 x16: ffffff804112e584 x15: ffffff8041126796 [ 60.222169] x14: ffffff80411267c0 x13: 0000000000000006 x12: ffffff804112e5c0 [ 60.222183] x11: ffffff804112e64c x10: 0000000000000007 x9 : ffffffd33dccdd10 [ 60.222196] x8 : ffffff804112e6a8 x7 : 0000000000000000 x6 : 0005000400060005 [ 60.222210] x5 : ffffff804112e65a x4 : 0000000000000000 x3 : 0000000000000010 [ 60.222223] x2 : 0000000000000014 x1 : 000000002c7d0b7a x0 : 0000000000000013 [ 60.222237] Call trace: [ 60.222241] get_random_u32+0xac/0x118 (P) [ 60.222256] __get_random_u32_below+0x20/0x78 [ 60.222268] get_rcw_we+0x180/0x208 [ 60.222278] test_rslib_init+0x2c8/0xba0 [ 60.222292] do_one_initcall+0x4c/0x210 [ 60.222303] kernel_init_freeable+0x1fc/0x3a0 [ 60.222317] kernel_init+0x28/0x1f8 [ 60.222327] ret_from_fork+0x10/0x20 > >I wonder if a cond_resched() in some loop would help. Or using a I wasn't sure which loop would be the appropriate one but adding cond_resched() as below worked as suggested. ----- BEGIN ----- diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c index 75cb1adac..322d7b0a8 100644 --- a/lib/reed_solomon/test_rslib.c +++ b/lib/reed_solomon/test_rslib.c @@ -306,6 +306,8 @@ static void test_uc(struct rs_control *rs, int len, int errs, if (memcmp(r, c, len * sizeof(*r))) stat->dwrong++; + + cond_resched(); } stat->nwords += trials; } @@ -400,6 +402,8 @@ static void test_bc(struct rs_control *rs, int len, int errs, } else { stat->rfail++; } + + cond_resched(); } stat->nwords += trials; } ----- END ----- >pseudorandom generator. I remember the problems related to much >slower random generator, for example, see the commit f900fde28883602b6 >("crypto: testmgr - fix RNG performance in fuzz tests"). I haven't tested this but I'll look into it! >That said, I did not dig deep into the code. And I did not try to >reproduce the softlockup. I am pretty busy at the moment with some >other stuff. I just wanted to give it a look and share my opinion. I think the softlockup is rather a problem of test itself, not the two commits. I hope this helps! Sincerely, Ryo Takakura >Best Regards, >Petr > > >> [ 231.759873][ C3] test_bc (lib/reed_solomon/test_rslib.c:379) test_rslib >> [ 231.759880][ C3] ? run_exercise (lib/reed_solomon/test_rslib.c:127 lib/reed_solomon/test_rslib.c:457) test_rslib >> [ 231.759886][ C3] run_exercise (lib/reed_solomon/test_rslib.c:423 lib/reed_solomon/test_rslib.c:477) test_rslib >> [ 231.759895][ C3] ? __pfx_run_exercise (lib/reed_solomon/test_rslib.c:443) test_rslib >> [ 231.759902][ C3] test_rslib_init (lib/reed_solomon/test_rslib.c:155) test_rslib >> [ 231.759909][ C3] ? __pfx_test_rslib_init (lib/reed_solomon/test_rslib.c:153) test_rslib >> [ 231.759915][ C3] do_one_initcall (init/main.c:1266) >> [ 231.759922][ C3] ? kasan_save_track (arch/x86/include/asm/current.h:49 mm/kasan/common.c:60 mm/kasan/common.c:69) >> [ 231.759928][ C3] ? __pfx_do_one_initcall (init/main.c:1257) >> [ 231.759934][ C3] ? __kasan_slab_alloc (mm/kasan/common.c:318 mm/kasan/common.c:345) >> [ 231.759939][ C3] ? __kmalloc_cache_noprof (mm/slub.c:4119 mm/slub.c:4168 mm/slub.c:4324) >> [ 231.759945][ C3] ? kasan_unpoison (mm/kasan/shadow.c:156 mm/kasan/shadow.c:182) >> [ 231.759951][ C3] do_init_module (kernel/module/main.c:2910) >> [ 231.759958][ C3] load_module (kernel/module/main.c:3376) >> [ 231.759964][ C3] ? __pfx_load_module (kernel/module/main.c:3223) >> [ 231.759969][ C3] ? security_kernel_post_read_file (security/security.c:3363) >> [ 231.759975][ C3] ? kernel_read_file (arch/x86/include/asm/atomic.h:53 include/linux/atomic/atomic-arch-fallback.h:992 include/linux/>atomic/atomic-instrumented.h:436 include/linux/fs.h:3060 fs/kernel_read_file.c:122) >> [ 231.759982][ C3] ? __pfx_kernel_read_file (fs/kernel_read_file.c:38) >> [ 231.759988][ C3] ? init_module_from_file (kernel/module/main.c:3565) >> [ 231.759993][ C3] init_module_from_file (kernel/module/main.c:3565) >> [ 231.759998][ C3] ? __pfx_init_module_from_file (kernel/module/main.c:3541) >> [ 231.760004][ C3] ? __pfx__raw_spin_lock (kernel/locking/spinlock.c:153) >> [ 231.760011][ C3] idempotent_init_module (kernel/module/main.c:3577) >> [ 231.760016][ C3] ? __pfx_idempotent_init_module (kernel/module/main.c:3569) >> [ 231.760022][ C3] ? fdget (include/linux/atomic/atomic-arch-fallback.h:479 include/linux/atomic/atomic-instrumented.h:50 fs/file.>c:1145 fs/file.c:1159) >> [ 231.760028][ C3] ? security_capable (security/security.c:1142) >> [ 231.760035][ C3] __x64_sys_finit_module (include/linux/file.h:62 include/linux/file.h:83 kernel/module/main.c:3600 kernel/module/main.>c:3587 kernel/module/main.c:3587) >> [ 231.760040][ C3] do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) >> [ 231.760046][ C3] ? do_mmap (mm/mmap.c:496) >> [ 231.760053][ C3] ? __pfx_userfaultfd_unmap_complete (fs/userfaultfd.c:838) >> [ 231.760060][ C3] ? __pfx_do_mmap (mm/mmap.c:288) >> [ 231.760065][ C3] ? down_write_killable (arch/x86/include/asm/atomic64_64.h:20 include/linux/atomic/atomic-arch-fallback.h:2629 >include/linux/atomic/atomic-long.h:79 include/linux/atomic/atomic-instrumented.h:3224 kernel/locking/rwsem.c:143 kernel/locking/rwsem.>c:268 kernel/locking/rwsem.c:1303 kernel/locking/rwsem.c:1318 kernel/locking/rwsem.c:1590) >> [ 231.760071][ C3] ? __pfx_down_write_killable (kernel/locking/rwsem.c:1586) >> [ 231.760077][ C3] ? vm_mmap_pgoff (mm/util.c:584) >> [ 231.760084][ C3] ? __pfx_vm_mmap_pgoff (mm/util.c:570) >> [ 231.760089][ C3] ? fget (fs/file.c:1063) >> [ 231.760095][ C3] ? fput (arch/x86/include/asm/atomic64_64.h:79 include/linux/atomic/atomic-arch-fallback.h:2913 include/linux/atomic/>atomic-arch-fallback.h:3364 include/linux/atomic/atomic-long.h:698 include/linux/atomic/atomic-instrumented.h:3767 include/linux/file_ref.>h:157 fs/file_table.c:501) >> [ 231.760100][ C3] ? ksys_mmap_pgoff (mm/mmap.c:547) >> [ 231.760105][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/>entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) >> [ 231.760111][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/>entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) >> [ 231.760116][ C3] ? do_syscall_64 (arch/x86/entry/common.c:102) >> [ 231.760121][ C3] ? do_user_addr_fault (include/linux/rcupdate.h:882 include/linux/mm.h:741 arch/x86/mm/fault.c:1340) >> [ 231.760127][ C3] ? syscall_exit_to_user_mode (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 include/linux/>entry-common.h:232 kernel/entry/common.c:206 kernel/entry/common.c:218) >> [ 231.760133][ C3] ? do_syscall_64 (arch/x86/entry/common.c:102) >> [ 231.760137][ C3] ? exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:92 arch/x86/mm/fault.c:1489 >arch/x86/mm/fault.c:1539) >> [ 231.760142][ C3] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) >> [ 231.760148][ C3] RIP: 0033:0x7f64c7711719 >> [ 231.760154][ C3] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b >4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48 >> All code >> ======== >> 0: 08 89 e8 5b 5d c3 or %cl,-0x3ca2a418(%rcx) >> 6: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1) >> d: 00 00 00 >> 10: 90 nop >> 11: 48 89 f8 mov %rdi,%rax >> 14: 48 89 f7 mov %rsi,%rdi >> 17: 48 89 d6 mov %rdx,%rsi >> 1a: 48 89 ca mov %rcx,%rdx >> 1d: 4d 89 c2 mov %r8,%r10 >> 20: 4d 89 c8 mov %r9,%r8 >> 23: 4c 8b 4c 24 08 mov 0x8(%rsp),%r9 >> 28: 0f 05 syscall >> 2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction >> 30: 73 01 jae 0x33 >> 32: c3 ret >> 33: 48 8b 0d b7 06 0d 00 mov 0xd06b7(%rip),%rcx # 0xd06f1 >> 3a: f7 d8 neg %eax >> 3c: 64 89 01 mov %eax,%fs:(%rcx) >> 3f: 48 rex.W >> >> Code starting with the faulting instruction >> =========================================== >> 0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax >> 6: 73 01 jae 0x9 >> 8: c3 ret >> 9: 48 8b 0d b7 06 0d 00 mov 0xd06b7(%rip),%rcx # 0xd06c7 >> 10: f7 d8 neg %eax >> 12: 64 89 01 mov %eax,%fs:(%rcx) >> 15: 48 rex.W >> >> >> The kernel config and materials to reproduce are available at: >> https://download.01.org/0day-ci/archive/20250122/202501221029.fb0d574d-lkp@intel.com >> >> >> >> -- >> 0-DAY CI Kernel Test Service >> https://github.com/intel/lkp-tests/wiki ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-03-15 3:38 ` Ryo Takakura @ 2025-03-17 8:45 ` John Ogness 2025-03-17 14:42 ` Ryo Takakura 0 siblings, 1 reply; 21+ messages in thread From: John Ogness @ 2025-03-17 8:45 UTC (permalink / raw) To: Ryo Takakura, pmladek Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang On 2025-03-15, Ryo Takakura <ryotkkr98@gmail.com> wrote: > Is the thread still active? Yes. It is one of the open issues that needs to be resolved before I repost the 8250 atomic console series. > I got the same softlockup during the test regardless of the presence > of the commits. > > [ 60.222013] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/0:1] > [ 60.222023] Modules linked in: > [ 60.222032] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Tainted: G L 6.14.0-rc6-v14-rc6-voluntary+ #4 > [ 60.222047] Tainted: [L]=SOFTLOCKUP > [ 60.222051] Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT) > [ 60.222055] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 60.222066] pc : get_random_u32+0xac/0x118 > [ 60.222081] lr : __get_random_u32_below+0x20/0x78 > [ 60.222094] sp : ffffffc08002bb80 > [ 60.222098] x29: ffffffc08002bb80 x28: 0000000000000003 x27: 0000000000000001 > [ 60.222114] x26: ffffff804112e6a4 x25: ffffffd33ed21820 x24: ffffff804112e69c > [ 60.222128] x23: 0000000000000000 x22: ffffff804112e64e x21: 0000000000000000 > [ 60.222142] x20: 000000000000000d x19: ffffff80fb7aebb8 x18: 0000000000000002 > [ 60.222156] x17: 0000000000000004 x16: ffffff804112e584 x15: ffffff8041126796 > [ 60.222169] x14: ffffff80411267c0 x13: 0000000000000006 x12: ffffff804112e5c0 > [ 60.222183] x11: ffffff804112e64c x10: 0000000000000007 x9 : ffffffd33dccdd10 > [ 60.222196] x8 : ffffff804112e6a8 x7 : 0000000000000000 x6 : 0005000400060005 > [ 60.222210] x5 : ffffff804112e65a x4 : 0000000000000000 x3 : 0000000000000010 > [ 60.222223] x2 : 0000000000000014 x1 : 000000002c7d0b7a x0 : 0000000000000013 > [ 60.222237] Call trace: > [ 60.222241] get_random_u32+0xac/0x118 (P) > [ 60.222256] __get_random_u32_below+0x20/0x78 > [ 60.222268] get_rcw_we+0x180/0x208 > [ 60.222278] test_rslib_init+0x2c8/0xba0 > [ 60.222292] do_one_initcall+0x4c/0x210 > [ 60.222303] kernel_init_freeable+0x1fc/0x3a0 > [ 60.222317] kernel_init+0x28/0x1f8 > [ 60.222327] ret_from_fork+0x10/0x20 > >> I wonder if a cond_resched() in some loop would help. Or using a > > I wasn't sure which loop would be the appropriate one but adding > cond_resched() as below worked as suggested. > > ----- BEGIN ----- > diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c > index 75cb1adac..322d7b0a8 100644 > --- a/lib/reed_solomon/test_rslib.c > +++ b/lib/reed_solomon/test_rslib.c > @@ -306,6 +306,8 @@ static void test_uc(struct rs_control *rs, int len, int errs, > > if (memcmp(r, c, len * sizeof(*r))) > stat->dwrong++; > + > + cond_resched(); > } > stat->nwords += trials; > } > @@ -400,6 +402,8 @@ static void test_bc(struct rs_control *rs, int len, int errs, > } else { > stat->rfail++; > } > + > + cond_resched(); > } > stat->nwords += trials; > } > ----- END ----- > >> pseudorandom generator. I remember the problems related to much >> slower random generator, for example, see the commit >> f900fde28883602b6 ("crypto: testmgr - fix RNG performance in fuzz >> tests"). > > I haven't tested this but I'll look into it! > >> That said, I did not dig deep into the code. And I did not try to >> reproduce the softlockup. I am pretty busy at the moment with some >> other stuff. I just wanted to give it a look and share my opinion. > > I think the softlockup is rather a problem of test itself, > not the two commits. Thanks Ryo for looking into this! I think we need to have a technical explanation/understanding of the problem so that it is clear how my series triggers or exaggerates the issue. John ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-03-17 8:45 ` John Ogness @ 2025-03-17 14:42 ` Ryo Takakura 2025-04-21 3:41 ` Ryo Takakura 0 siblings, 1 reply; 21+ messages in thread From: Ryo Takakura @ 2025-03-17 14:42 UTC (permalink / raw) To: john.ogness Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang, pmladek, ryotkkr98 Hi John! On Mon, 17 Mar 2025 09:51:46 +0106, John Ogness wrote: >On 2025-03-15, Ryo Takakura <ryotkkr98@gmail.com> wrote: >> I got the same softlockup during the test regardless of the presence >> of the commits. >> >> [ 60.222013] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/0:1] >> [ 60.222023] Modules linked in: >> [ 60.222032] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Tainted: G L 6.14.0-rc6-v14-rc6-voluntary+ #4 >> [ 60.222047] Tainted: [L]=SOFTLOCKUP >> [ 60.222051] Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT) >> [ 60.222055] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >> [ 60.222066] pc : get_random_u32+0xac/0x118 >> [ 60.222081] lr : __get_random_u32_below+0x20/0x78 >> [ 60.222094] sp : ffffffc08002bb80 >> [ 60.222098] x29: ffffffc08002bb80 x28: 0000000000000003 x27: 0000000000000001 >> [ 60.222114] x26: ffffff804112e6a4 x25: ffffffd33ed21820 x24: ffffff804112e69c >> [ 60.222128] x23: 0000000000000000 x22: ffffff804112e64e x21: 0000000000000000 >> [ 60.222142] x20: 000000000000000d x19: ffffff80fb7aebb8 x18: 0000000000000002 >> [ 60.222156] x17: 0000000000000004 x16: ffffff804112e584 x15: ffffff8041126796 >> [ 60.222169] x14: ffffff80411267c0 x13: 0000000000000006 x12: ffffff804112e5c0 >> [ 60.222183] x11: ffffff804112e64c x10: 0000000000000007 x9 : ffffffd33dccdd10 >> [ 60.222196] x8 : ffffff804112e6a8 x7 : 0000000000000000 x6 : 0005000400060005 >> [ 60.222210] x5 : ffffff804112e65a x4 : 0000000000000000 x3 : 0000000000000010 >> [ 60.222223] x2 : 0000000000000014 x1 : 000000002c7d0b7a x0 : 0000000000000013 >> [ 60.222237] Call trace: >> [ 60.222241] get_random_u32+0xac/0x118 (P) >> [ 60.222256] __get_random_u32_below+0x20/0x78 >> [ 60.222268] get_rcw_we+0x180/0x208 >> [ 60.222278] test_rslib_init+0x2c8/0xba0 >> [ 60.222292] do_one_initcall+0x4c/0x210 >> [ 60.222303] kernel_init_freeable+0x1fc/0x3a0 >> [ 60.222317] kernel_init+0x28/0x1f8 >> [ 60.222327] ret_from_fork+0x10/0x20 >> >>> I wonder if a cond_resched() in some loop would help. Or using a >> >> I wasn't sure which loop would be the appropriate one but adding >> cond_resched() as below worked as suggested. >> >> ----- BEGIN ----- >> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c >> index 75cb1adac..322d7b0a8 100644 >> --- a/lib/reed_solomon/test_rslib.c >> +++ b/lib/reed_solomon/test_rslib.c >> @@ -306,6 +306,8 @@ static void test_uc(struct rs_control *rs, int len, int errs, >> >> if (memcmp(r, c, len * sizeof(*r))) >> stat->dwrong++; >> + >> + cond_resched(); >> } >> stat->nwords += trials; >> } >> @@ -400,6 +402,8 @@ static void test_bc(struct rs_control *rs, int len, int errs, >> } else { >> stat->rfail++; >> } >> + >> + cond_resched(); >> } >> stat->nwords += trials; >> } >> ----- END ----- >> >>> pseudorandom generator. I remember the problems related to much >>> slower random generator, for example, see the commit >>> f900fde28883602b6 ("crypto: testmgr - fix RNG performance in fuzz >>> tests"). >> >> I haven't tested this but I'll look into it! >> >>> That said, I did not dig deep into the code. And I did not try to >>> reproduce the softlockup. I am pretty busy at the moment with some >>> other stuff. I just wanted to give it a look and share my opinion. >> >> I think the softlockup is rather a problem of test itself, >> not the two commits. > >Thanks Ryo for looking into this! I think we need to have a technical >explanation/understanding of the problem so that it is clear how my >series triggers or exaggerates the issue. I see. I can't yet say anything but still not sure if your series has anything to do with the softlockup... I found that there was similar report in the past [0]. Anyways, I will look into the rslib test itself more carefully to see if I can give the needed technical explanation for the problem! Sincerely, Ryo Takakura [0] https://lore.kernel.org/linux-kernel//a309f09e07374e62a77bb84c70e6715efd288448.1583122776.git.planteen@gmail.com/ >John ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-03-17 14:42 ` Ryo Takakura @ 2025-04-21 3:41 ` Ryo Takakura 2025-04-22 12:15 ` Petr Mladek 0 siblings, 1 reply; 21+ messages in thread From: Ryo Takakura @ 2025-04-21 3:41 UTC (permalink / raw) To: john.ogness, pmladek Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang Hi! I would like to follow up the last email that I sent. First, I'm sorry that I later realized that I should have tested the rslib test as an inserted module, as how the robot does, by choosing CONFIG_REED_SOLOMON_TEST=m. Not as a boottime test by enabling CONFIG_REED_SOLOMON_TEST=y. Running the rslib test as an inserted module without the John's series was less prone to softlockup. Without the John's series, softlockup shows up once in a test or not at all. With the John's series, softlockup can be observed constanly over the test. On Mon, 17 Mar 2025 23:42:21 +0900, Ryo Takakura wrote: >On Mon, 17 Mar 2025 09:51:46 +0106, John Ogness wrote: >>On 2025-03-15, Ryo Takakura <ryotkkr98@gmail.com> wrote: >>> I got the same softlockup during the test regardless of the presence >>> of the commits. >>> >>> [ 60.222013] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/0:1] >>> [ 60.222023] Modules linked in: >>> [ 60.222032] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Tainted: G L 6.14.0-rc6-v14-rc6-voluntary+ #4 >>> [ 60.222047] Tainted: [L]=SOFTLOCKUP >>> [ 60.222051] Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT) >>> [ 60.222055] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>> [ 60.222066] pc : get_random_u32+0xac/0x118 >>> [ 60.222081] lr : __get_random_u32_below+0x20/0x78 >>> [ 60.222094] sp : ffffffc08002bb80 >>> [ 60.222098] x29: ffffffc08002bb80 x28: 0000000000000003 x27: 0000000000000001 >>> [ 60.222114] x26: ffffff804112e6a4 x25: ffffffd33ed21820 x24: ffffff804112e69c >>> [ 60.222128] x23: 0000000000000000 x22: ffffff804112e64e x21: 0000000000000000 >>> [ 60.222142] x20: 000000000000000d x19: ffffff80fb7aebb8 x18: 0000000000000002 >>> [ 60.222156] x17: 0000000000000004 x16: ffffff804112e584 x15: ffffff8041126796 >>> [ 60.222169] x14: ffffff80411267c0 x13: 0000000000000006 x12: ffffff804112e5c0 >>> [ 60.222183] x11: ffffff804112e64c x10: 0000000000000007 x9 : ffffffd33dccdd10 >>> [ 60.222196] x8 : ffffff804112e6a8 x7 : 0000000000000000 x6 : 0005000400060005 >>> [ 60.222210] x5 : ffffff804112e65a x4 : 0000000000000000 x3 : 0000000000000010 >>> [ 60.222223] x2 : 0000000000000014 x1 : 000000002c7d0b7a x0 : 0000000000000013 >>> [ 60.222237] Call trace: >>> [ 60.222241] get_random_u32+0xac/0x118 (P) >>> [ 60.222256] __get_random_u32_below+0x20/0x78 >>> [ 60.222268] get_rcw_we+0x180/0x208 >>> [ 60.222278] test_rslib_init+0x2c8/0xba0 >>> [ 60.222292] do_one_initcall+0x4c/0x210 >>> [ 60.222303] kernel_init_freeable+0x1fc/0x3a0 >>> [ 60.222317] kernel_init+0x28/0x1f8 >>> [ 60.222327] ret_from_fork+0x10/0x20 >>> >>>> I wonder if a cond_resched() in some loop would help. Or using a >>> >>> I wasn't sure which loop would be the appropriate one but adding >>> cond_resched() as below worked as suggested. >>> >>> ----- BEGIN ----- >>> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c >>> index 75cb1adac..322d7b0a8 100644 >>> --- a/lib/reed_solomon/test_rslib.c >>> +++ b/lib/reed_solomon/test_rslib.c >>> @@ -306,6 +306,8 @@ static void test_uc(struct rs_control *rs, int len, int errs, >>> >>> if (memcmp(r, c, len * sizeof(*r))) >>> stat->dwrong++; >>> + >>> + cond_resched(); >>> } >>> stat->nwords += trials; >>> } >>> @@ -400,6 +402,8 @@ static void test_bc(struct rs_control *rs, int len, int errs, >>> } else { >>> stat->rfail++; >>> } >>> + >>> + cond_resched(); >>> } >>> stat->nwords += trials; >>> } >>> ----- END ----- >>> >>>> pseudorandom generator. I remember the problems related to much >>>> slower random generator, for example, see the commit >>>> f900fde28883602b6 ("crypto: testmgr - fix RNG performance in fuzz >>>> tests"). >>> >>> I haven't tested this but I'll look into it! For the same reason pointed out by the commit f900fde28883 ("crypto: testmgr - fix RNG performance in fuzz tests"), the commit d4150779e60f ("random32: use real rng for non-deterministic randomness") does seem to have slowed down the rslib test. On Raspberry pi 4(4G Mem, 4 CPUs), running the rslib test on v5.19 with the commit d4150779e60f ("random32: use real rng for non-deterministic randomness") takes about 450sec compared to v5.18(without the commit) which only takes about 350sec. However, random number generation does not seem to be the only reason for the rslib test being slow. Below is the function stats for v5.18 vs v5.19 for the test. v5.18 + 100.00% 0.00% insmod [kernel.kallsyms] [k] load_module + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_init_module + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_one_initcall + 59.90% 59.90% insmod [kernel.kallsyms] [k] decode_rs16 + 30.28% 0.00% insmod [unknown] [k] 0xffffffd60413a748 + 21.39% 21.39% insmod [kernel.kallsyms] [k] encode_rs16 + 20.94% 0.00% insmod [unknown] [k] 0xffffffd60441b0d4 + 14.17% 0.01% insmod [unknown] [k] 0xffffffd60413a324 + 12.92% 0.00% insmod [unknown] [k] 0xffffffd60413a720 + 11.74% 0.00% insmod [unknown] [k] 0xffffffd60413a368 + 11.63% 0.01% insmod [unknown] [k] 0xffffffd60413a4b4 + 6.60% 0.01% insmod [unknown] [k] 0xffffffd60413a46c + 3.26% 3.26% insmod [unknown] [k] 0xffffffd60413a3c4 + 3.11% 3.11% insmod [kernel.kallsyms] [k] prandom_u32 v5.19 + 100.00% 0.00% insmod [kernel.kallsyms] [k] load_module + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_init_module + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_one_initcall + 45.13% 45.13% insmod [kernel.kallsyms] [k] decode_rs16 + 27.84% 27.83% insmod [kernel.kallsyms] [k] get_random_u32 + 22.83% 0.01% insmod [unknown] [k] 0xffffffd62b379324 + 22.76% 0.00% insmod [unknown] [k] 0xffffffd62b379748 + 22.35% 0.00% insmod [unknown] [k] 0xffffffd62b379720 + 16.17% 16.17% insmod [kernel.kallsyms] [k] encode_rs16 As the v5.19 stats shows, the commit d4150779e60f ("random32: use real rng for non-deterministic randomness") did slowed down the rslib test by using real random number generator(prandom_u32->get_random_u32). But it also shows that the decode/encoding takes quite a lot of time. For the above reason, if we want make the softlockup go away, maybe just converting from real to pseudo random generator won't be enough? I tested with below changes applied to v6.14 which uses pseudo random numbers instead of real ones similar to what the commit f900fde28883 ("crypto: testmgr - fix RNG performance in fuzz tests") did. The result is that while it did help to fast up the test from 450 sec to 350 sec, it still triggers the softlockup. --- BEGIN --- diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c index 75cb1adac..5785e3c0c 100644 --- a/lib/reed_solomon/test_rslib.c +++ b/lib/reed_solomon/test_rslib.c @@ -9,6 +9,7 @@ #include <linux/kernel.h> #include <linux/module.h> #include <linux/moduleparam.h> +#include <linux/prandom.h> #include <linux/random.h> #include <linux/slab.h> @@ -95,6 +96,16 @@ static struct pad pad_coef[] = { { 1, 0 }, }; +static inline void init_rnd_state(struct rnd_state *rng) +{ + prandom_seed_state(rng, get_random_u64()); +} + +static inline u32 prandom_u32_below(struct rnd_state *rng, u32 ceil) +{ + return prandom_u32_state(rng) % ceil; +} + static void free_ws(struct wspace *ws) { if (!ws) @@ -151,6 +162,7 @@ static struct wspace *alloc_ws(struct rs_codec *rs) static int get_rcw_we(struct rs_control *rs, struct wspace *ws, int len, int errs, int eras) { + struct rnd_state rng; int nroots = rs->codec->nroots; int *derrlocs = ws->derrlocs; int *errlocs = ws->errlocs; @@ -162,9 +174,11 @@ static int get_rcw_we(struct rs_control *rs, struct wspace *ws, int errloc; int i; + init_rnd_state(&rng); + /* Load c with random data and encode */ for (i = 0; i < dlen; i++) - c[i] = get_random_u32() & nn; + c[i] = prandom_u32_state(&rng) & nn; memset(c + dlen, 0, nroots * sizeof(*c)); encode_rs16(rs, c, dlen, c + dlen, 0); @@ -178,12 +192,12 @@ static int get_rcw_we(struct rs_control *rs, struct wspace *ws, for (i = 0; i < errs; i++) { do { /* Error value must be nonzero */ - errval = get_random_u32() & nn; + errval = prandom_u32_state(&rng) & nn; } while (errval == 0); do { /* Must not choose the same location twice */ - errloc = get_random_u32_below(len); + errloc = prandom_u32_below(&rng, len); } while (errlocs[errloc] != 0); errlocs[errloc] = 1; @@ -194,19 +208,19 @@ static int get_rcw_we(struct rs_control *rs, struct wspace *ws, for (i = 0; i < eras; i++) { do { /* Must not choose the same location twice */ - errloc = get_random_u32_below(len); + errloc = prandom_u32_below(&rng, len); } while (errlocs[errloc] != 0); derrlocs[i] = errloc; - if (ewsc && get_random_u32_below(2)) { + if (ewsc && prandom_u32_below(&rng, 2)) { /* Erasure with the symbol intact */ errlocs[errloc] = 2; } else { /* Erasure with corrupted symbol */ do { /* Error value must be nonzero */ - errval = get_random_u32() & nn; + errval = prandom_u32_state(&rng) & nn; } while (errval == 0); errlocs[errloc] = 1; --- END --- Below is the function stats for with vs without the above change. v6.14 without the above change + 100.00% 0.00% insmod [kernel.kallsyms] [k] load_module + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_init_module + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_one_initcall + 46.94% 46.94% insmod [kernel.kallsyms] [k] decode_rs16 + 24.92% 24.92% insmod [kernel.kallsyms] [k] get_random_u32 + 23.59% 0.01% insmod [unknown] [k] 0xffffffde423d88cc + 22.66% 0.00% insmod [unknown] [k] 0xffffffde423d82c0 + 22.39% 0.00% insmod [unknown] [k] 0xffffffde423d88a4 + 16.67% 16.67% insmod [kernel.kallsyms] [k] encode_rs16 v6.14 with the above change + 100.00% 0.00% insmod [kernel.kallsyms] [k] load_module + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_init_module + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_one_initcall + 61.51% 61.50% insmod [kernel.kallsyms] [k] decode_rs16 + 31.15% 0.00% insmod [unknown] [k] 0xffffffe7013858cc + 21.75% 21.75% insmod [kernel.kallsyms] [k] encode_rs16 + 21.36% 0.01% insmod [unknown] [k] 0xffffffe70137f120 + 15.20% 0.00% insmod [unknown] [k] 0xffffffe7013852c0 + 12.79% 0.00% insmod [unknown] [k] 0xffffffe7013858a4 + 12.00% 0.00% insmod [unknown] [k] 0xffffffe7013854c0 + 12.00% 0.00% insmod [unknown] [k] 0xffffffe701385304 + 6.74% 0.01% insmod [unknown] [k] 0xffffffe701385600 + 3.43% 3.43% insmod [unknown] [k] 0xffffffe701385598 + 2.50% 2.50% insmod [unknown] [k] 0xffffffe701385564 + 1.96% 1.96% insmod [kernel.kallsyms] [k] prandom_u32_state + 1.85% 1.85% insmod [kernel.kallsyms] [k] get_random_u64 So maybe we still need cond_resched() in addition to converting the test from using psuedo random numbers to real ones in order to make the softlockup go away? >>>> That said, I did not dig deep into the code. And I did not try to >>>> reproduce the softlockup. I am pretty busy at the moment with some >>>> other stuff. I just wanted to give it a look and share my opinion. >>> >>> I think the softlockup is rather a problem of test itself, >>> not the two commits. >> >>Thanks Ryo for looking into this! I think we need to have a technical >>explanation/understanding of the problem so that it is clear how my >>series triggers or exaggerates the issue. As mentioned earlier, I'm sorry that I should have run the test as inserted module... It seems the series does make the test more prone to softlockups. I am still looking into the issue but I noticed that if the printing during the test were done in a caller context, not by printk thread, the softlock goes away. I tested with a change something like below so that the printing during the test were not delegated to printk thread ("pr/ttyS0"). With the below change, the softlock no longer shows up. --- BEGIN --- diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c index fd12efcc4..07c879a64 100644 --- a/kernel/printk/nbcon.c +++ b/kernel/printk/nbcon.c @@ -1639,6 +1639,7 @@ void nbcon_cpu_emergency_enter(void) cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting(); (*cpu_emergency_nesting)++; } +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_enter); /** * nbcon_cpu_emergency_exit - Exit an emergency section @@ -1656,6 +1657,7 @@ void nbcon_cpu_emergency_exit(void) preempt_enable(); } +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_exit); /** * nbcon_alloc - Allocate and init the nbcon console specific data diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c index 75cb1adac..8208f8317 100644 --- a/lib/reed_solomon/test_rslib.c +++ b/lib/reed_solomon/test_rslib.c @@ -11,6 +11,7 @@ #include <linux/moduleparam.h> #include <linux/random.h> #include <linux/slab.h> +#include <linux/console.h> enum verbosity { V_SILENT, @@ -323,8 +324,11 @@ static int ex_rs_helper(struct rs_control *rs, struct wspace *ws, int nroots = rs->codec->nroots; int errs, eras, retval; - if (v >= V_PROGRESS) + if (v >= V_PROGRESS) { + nbcon_cpu_emergency_enter(); pr_info(" %s\n", desc[method]); + nbcon_cpu_emergency_exit(); + } for (errs = 0; errs <= nroots / 2; errs++) for (eras = 0; eras <= nroots - 2 * errs; eras++) @@ -353,8 +357,11 @@ static int exercise_rs(struct rs_control *rs, struct wspace *ws, int retval = 0; int i; - if (v >= V_PROGRESS) + if (v >= V_PROGRESS) { + nbcon_cpu_emergency_enter(); pr_info("Testing up to error correction capacity...\n"); + nbcon_cpu_emergency_exit(); + } for (i = 0; i <= IN_PLACE; i++) retval |= ex_rs_helper(rs, ws, len, trials, i); @@ -411,8 +418,11 @@ static int exercise_rs_bc(struct rs_control *rs, struct wspace *ws, int nroots = rs->codec->nroots; int errs, eras, cutoff; - if (v >= V_PROGRESS) + if (v >= V_PROGRESS) { + nbcon_cpu_emergency_enter(); pr_info("Testing beyond error correction capacity...\n"); + nbcon_cpu_emergency_exit(); + } for (errs = 1; errs <= nroots; errs++) { eras = nroots - 2 * errs + 1; @@ -468,8 +478,10 @@ static int run_exercise(struct etab *e) prev_pad = pad; if (v >= V_PROGRESS) { + nbcon_cpu_emergency_enter(); pr_info("Testing (%d,%d)_%d code...\n", len, kk - pad, nn + 1); + nbcon_cpu_emergency_exit(); } retval |= exercise_rs(rsc, ws, len, e->ntrials); --- END --- I tried to figure out how softlockup disappears over the weekend but couldn't... Seems the John's series somehow prevents the softlockup thread from touching its timestamp. I would like to continue looking into the issue and appreaciate any comment! Hope this provides some insight for now. Sincerely, Ryo Takakura >I see. I can't yet say anything but still not sure if your series has >anything to do with the softlockup... I found that there was similar >report in the past [0]. > >Anyways, I will look into the rslib test itself more carefully to see >if I can give the needed technical explanation for the problem! >[0] https://lore.kernel.org/linux-kernel//a309f09e07374e62a77bb84c70e6715efd288448.1583122776.git.planteen@gmail.com/ ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-04-21 3:41 ` Ryo Takakura @ 2025-04-22 12:15 ` Petr Mladek 2025-04-22 14:03 ` John Ogness 2025-04-24 8:11 ` Ryo Takakura 0 siblings, 2 replies; 21+ messages in thread From: Petr Mladek @ 2025-04-22 12:15 UTC (permalink / raw) To: Ryo Takakura Cc: john.ogness, Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang On Mon 2025-04-21 12:41:50, Ryo Takakura wrote: > Hi! > > I would like to follow up the last email that I sent. > > First, I'm sorry that I later realized that I should have tested > the rslib test as an inserted module, as how the robot does, by > choosing CONFIG_REED_SOLOMON_TEST=m. > Not as a boottime test by enabling CONFIG_REED_SOLOMON_TEST=y. > > Running the rslib test as an inserted module without the John's series > was less prone to softlockup. Without the John's series, softlockup shows > up once in a test or not at all. With the John's series, softlockup can > be observed constanly over the test. > On Mon, 17 Mar 2025 23:42:21 +0900, Ryo Takakura wrote: > >On Mon, 17 Mar 2025 09:51:46 +0106, John Ogness wrote: > >>On 2025-03-15, Ryo Takakura <ryotkkr98@gmail.com> wrote: > >>> I got the same softlockup during the test regardless of the presence > >>> of the commits. > >>> > >>> [ 60.222013] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/0:1] > >>> [ 60.222023] Modules linked in: > >>> [ 60.222032] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Tainted: G L 6.14.0-rc6-v14-rc6-voluntary+ #4 > >>> [ 60.222047] Tainted: [L]=SOFTLOCKUP > >>> [ 60.222051] Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT) > >>> [ 60.222055] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > >>> [ 60.222066] pc : get_random_u32+0xac/0x118 > >>> [ 60.222081] lr : __get_random_u32_below+0x20/0x78 > >>> [ 60.222094] sp : ffffffc08002bb80 > >>> [ 60.222098] x29: ffffffc08002bb80 x28: 0000000000000003 x27: 0000000000000001 > >>> [ 60.222114] x26: ffffff804112e6a4 x25: ffffffd33ed21820 x24: ffffff804112e69c > >>> [ 60.222128] x23: 0000000000000000 x22: ffffff804112e64e x21: 0000000000000000 > >>> [ 60.222142] x20: 000000000000000d x19: ffffff80fb7aebb8 x18: 0000000000000002 > >>> [ 60.222156] x17: 0000000000000004 x16: ffffff804112e584 x15: ffffff8041126796 > >>> [ 60.222169] x14: ffffff80411267c0 x13: 0000000000000006 x12: ffffff804112e5c0 > >>> [ 60.222183] x11: ffffff804112e64c x10: 0000000000000007 x9 : ffffffd33dccdd10 > >>> [ 60.222196] x8 : ffffff804112e6a8 x7 : 0000000000000000 x6 : 0005000400060005 > >>> [ 60.222210] x5 : ffffff804112e65a x4 : 0000000000000000 x3 : 0000000000000010 > >>> [ 60.222223] x2 : 0000000000000014 x1 : 000000002c7d0b7a x0 : 0000000000000013 > >>> [ 60.222237] Call trace: > >>> [ 60.222241] get_random_u32+0xac/0x118 (P) > >>> [ 60.222256] __get_random_u32_below+0x20/0x78 > >>> [ 60.222268] get_rcw_we+0x180/0x208 > >>> [ 60.222278] test_rslib_init+0x2c8/0xba0 > >>> [ 60.222292] do_one_initcall+0x4c/0x210 > >>> [ 60.222303] kernel_init_freeable+0x1fc/0x3a0 > >>> [ 60.222317] kernel_init+0x28/0x1f8 > >>> [ 60.222327] ret_from_fork+0x10/0x20 > >>> > >>>> I wonder if a cond_resched() in some loop would help. Or using a > >>> > >>> I wasn't sure which loop would be the appropriate one but adding > >>> cond_resched() as below worked as suggested. > >>> > >>> ----- BEGIN ----- > >>> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c > >>> index 75cb1adac..322d7b0a8 100644 > >>> --- a/lib/reed_solomon/test_rslib.c > >>> +++ b/lib/reed_solomon/test_rslib.c > >>> @@ -306,6 +306,8 @@ static void test_uc(struct rs_control *rs, int len, int errs, > >>> > >>> if (memcmp(r, c, len * sizeof(*r))) > >>> stat->dwrong++; > >>> + > >>> + cond_resched(); > >>> } > >>> stat->nwords += trials; > >>> } > >>> @@ -400,6 +402,8 @@ static void test_bc(struct rs_control *rs, int len, int errs, > >>> } else { > >>> stat->rfail++; > >>> } > >>> + > >>> + cond_resched(); > >>> } > >>> stat->nwords += trials; > >>> } > >>> ----- END ----- > >>> > >>>> pseudorandom generator. I remember the problems related to much > >>>> slower random generator, for example, see the commit > >>>> f900fde28883602b6 ("crypto: testmgr - fix RNG performance in fuzz > >>>> tests"). > >>> > >>> I haven't tested this but I'll look into it! > > For the same reason pointed out by the commit f900fde28883 ("crypto: > testmgr - fix RNG performance in fuzz tests"), the commit d4150779e60f > ("random32: use real rng for non-deterministic randomness") does seem > to have slowed down the rslib test. > > On Raspberry pi 4(4G Mem, 4 CPUs), running the rslib test on v5.19 with > the commit d4150779e60f ("random32: use real rng for non-deterministic > randomness") takes about 450sec compared to v5.18(without the commit) > which only takes about 350sec. > > However, random number generation does not seem to be the only reason > for the rslib test being slow. > Below is the function stats for v5.18 vs v5.19 for the test. > > v5.18 > + 100.00% 0.00% insmod [kernel.kallsyms] [k] load_module > + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_init_module > + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_one_initcall > + 59.90% 59.90% insmod [kernel.kallsyms] [k] decode_rs16 > + 30.28% 0.00% insmod [unknown] [k] 0xffffffd60413a748 > + 21.39% 21.39% insmod [kernel.kallsyms] [k] encode_rs16 > + 20.94% 0.00% insmod [unknown] [k] 0xffffffd60441b0d4 > + 14.17% 0.01% insmod [unknown] [k] 0xffffffd60413a324 > + 12.92% 0.00% insmod [unknown] [k] 0xffffffd60413a720 > + 11.74% 0.00% insmod [unknown] [k] 0xffffffd60413a368 > + 11.63% 0.01% insmod [unknown] [k] 0xffffffd60413a4b4 > + 6.60% 0.01% insmod [unknown] [k] 0xffffffd60413a46c > + 3.26% 3.26% insmod [unknown] [k] 0xffffffd60413a3c4 > + 3.11% 3.11% insmod [kernel.kallsyms] [k] prandom_u32 > > v5.19 > + 100.00% 0.00% insmod [kernel.kallsyms] [k] load_module > + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_init_module > + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_one_initcall > + 45.13% 45.13% insmod [kernel.kallsyms] [k] decode_rs16 > + 27.84% 27.83% insmod [kernel.kallsyms] [k] get_random_u32 > + 22.83% 0.01% insmod [unknown] [k] 0xffffffd62b379324 > + 22.76% 0.00% insmod [unknown] [k] 0xffffffd62b379748 > + 22.35% 0.00% insmod [unknown] [k] 0xffffffd62b379720 > + 16.17% 16.17% insmod [kernel.kallsyms] [k] encode_rs16 > > As the v5.19 stats shows, the commit d4150779e60f ("random32: use real > rng for non-deterministic randomness") did slowed down the rslib test > by using real random number generator(prandom_u32->get_random_u32). > But it also shows that the decode/encoding takes quite a lot of time. > > For the above reason, if we want make the softlockup go away, maybe > just converting from real to pseudo random generator won't be enough? > I tested with below changes applied to v6.14 which uses pseudo random > numbers instead of real ones similar to what the commit f900fde28883 > ("crypto: testmgr - fix RNG performance in fuzz tests") did. > > The result is that while it did help to fast up the test from 450 sec > to 350 sec, it still triggers the softlockup. It is not that big speedup. 350 sec still provides a lot of opportunities for a softlockup. > --- BEGIN --- > diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c > index 75cb1adac..5785e3c0c 100644 > --- a/lib/reed_solomon/test_rslib.c > +++ b/lib/reed_solomon/test_rslib.c [...] > @@ -162,9 +174,11 @@ static int get_rcw_we(struct rs_control *rs, struct wspace *ws, > int errloc; > int i; > > + init_rnd_state(&rng); > + > /* Load c with random data and encode */ > for (i = 0; i < dlen; i++) > - c[i] = get_random_u32() & nn; > + c[i] = prandom_u32_state(&rng) & nn; > > memset(c + dlen, 0, nroots * sizeof(*c)); > encode_rs16(rs, c, dlen, c + dlen, 0); [...] > --- END --- > > Below is the function stats for with vs without the above change. > > v6.14 without the above change > + 100.00% 0.00% insmod [kernel.kallsyms] [k] load_module > + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_init_module > + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_one_initcall > + 46.94% 46.94% insmod [kernel.kallsyms] [k] decode_rs16 > + 24.92% 24.92% insmod [kernel.kallsyms] [k] get_random_u32 > + 23.59% 0.01% insmod [unknown] [k] 0xffffffde423d88cc > + 22.66% 0.00% insmod [unknown] [k] 0xffffffde423d82c0 > + 22.39% 0.00% insmod [unknown] [k] 0xffffffde423d88a4 > + 16.67% 16.67% insmod [kernel.kallsyms] [k] encode_rs16 > > v6.14 with the above change > + 100.00% 0.00% insmod [kernel.kallsyms] [k] load_module > + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_init_module > + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_one_initcall > + 61.51% 61.50% insmod [kernel.kallsyms] [k] decode_rs16 > + 31.15% 0.00% insmod [unknown] [k] 0xffffffe7013858cc > + 21.75% 21.75% insmod [kernel.kallsyms] [k] encode_rs16 > + 21.36% 0.01% insmod [unknown] [k] 0xffffffe70137f120 > + 15.20% 0.00% insmod [unknown] [k] 0xffffffe7013852c0 > + 12.79% 0.00% insmod [unknown] [k] 0xffffffe7013858a4 > + 12.00% 0.00% insmod [unknown] [k] 0xffffffe7013854c0 > + 12.00% 0.00% insmod [unknown] [k] 0xffffffe701385304 > + 6.74% 0.01% insmod [unknown] [k] 0xffffffe701385600 > + 3.43% 3.43% insmod [unknown] [k] 0xffffffe701385598 > + 2.50% 2.50% insmod [unknown] [k] 0xffffffe701385564 > + 1.96% 1.96% insmod [kernel.kallsyms] [k] prandom_u32_state > + 1.85% 1.85% insmod [kernel.kallsyms] [k] get_random_u64 > > So maybe we still need cond_resched() in addition to converting the > test from using psuedo random numbers to real ones in order to make > the softlockup go away? IMHO, cond_resched() should be added, definitely. The code clearly prevents scheduling for tens of seconds. > >>>> That said, I did not dig deep into the code. And I did not try to > >>>> reproduce the softlockup. I am pretty busy at the moment with some > >>>> other stuff. I just wanted to give it a look and share my opinion. > >>> > >>> I think the softlockup is rather a problem of test itself, > >>> not the two commits. > >> > >>Thanks Ryo for looking into this! I think we need to have a technical > >>explanation/understanding of the problem so that it is clear how my > >>series triggers or exaggerates the issue. > > As mentioned earlier, I'm sorry that I should have run the test as > inserted module... It seems the series does make the test more prone > to softlockups. IMHO, the main difference is that the patch "serial: 8250: Switch to nbcon console" removes touch_nmi_watchdog() from serial8250_console_write(). The touch_nmi_watchdog() resets the softlockup watchdog. It might hide that the CPU did not schedule for a long time. The touch_nmi_watchdog() was there because the console_lock() owner, used by the legacy loop, was responsible for flushing all pending messages. And there might be many pending messages when new ones were added by other CPUs in parallel. And the legacy loop could not call cond_resched() when called from printk() because printk() might be called in atomic context. The series removed the touch_nmi_watchdog() because the kthread called cond_resched. And the amount of messages flushed in the emergency or panic context was limited. My view: The touch_nmi_watchdog() did hide the softlockup scenario in the rslib test. The printk series allowed to see the problem reliably. By other words, the rslib test should get fixed by adding cond_resched(). And the conversion of the 8250 serial console driver to nbcon is correct. That said, see below. > I am still looking into the issue but I noticed that if the printing > during the test were done in a caller context, not by printk thread, > the softlock goes away. > > I tested with a change something like below so that the printing during > the test were not delegated to printk thread ("pr/ttyS0"). With the below > change, the softlock no longer shows up. > > --- BEGIN --- > diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c > index fd12efcc4..07c879a64 100644 > --- a/kernel/printk/nbcon.c > +++ b/kernel/printk/nbcon.c > @@ -1639,6 +1639,7 @@ void nbcon_cpu_emergency_enter(void) > cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting(); > (*cpu_emergency_nesting)++; > } > +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_enter); > > /** > * nbcon_cpu_emergency_exit - Exit an emergency section > @@ -1656,6 +1657,7 @@ void nbcon_cpu_emergency_exit(void) > > preempt_enable(); > } > +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_exit); > > /** > * nbcon_alloc - Allocate and init the nbcon console specific data > diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c > index 75cb1adac..8208f8317 100644 > --- a/lib/reed_solomon/test_rslib.c > +++ b/lib/reed_solomon/test_rslib.c > @@ -11,6 +11,7 @@ > #include <linux/moduleparam.h> > #include <linux/random.h> > #include <linux/slab.h> > +#include <linux/console.h> > > enum verbosity { > V_SILENT, > @@ -323,8 +324,11 @@ static int ex_rs_helper(struct rs_control *rs, struct wspace *ws, > int nroots = rs->codec->nroots; > int errs, eras, retval; > > - if (v >= V_PROGRESS) > + if (v >= V_PROGRESS) { > + nbcon_cpu_emergency_enter(); > pr_info(" %s\n", desc[method]); > + nbcon_cpu_emergency_exit(); > + } I do not understand how this prevented the softlockup. The emergency context causes that the messages is flushed directly from printk(), see nbcon_atomic_flush_pending() in vprintk_emit(). But see neither cond_resched() nor touch_nmi_watchdog() in this code path. By other words, I do _not_ see how the emergency context either triggered scheduling or did reset the watchdog. But I might be looking at a different code base. I looked at Linus' master and linux-next. Best Regards, Petr ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-04-22 12:15 ` Petr Mladek @ 2025-04-22 14:03 ` John Ogness 2025-04-24 8:11 ` Ryo Takakura 1 sibling, 0 replies; 21+ messages in thread From: John Ogness @ 2025-04-22 14:03 UTC (permalink / raw) To: Petr Mladek, Ryo Takakura Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang On 2025-04-22, Petr Mladek <pmladek@suse.com> wrote: > IMHO, the main difference is that the patch "serial: 8250: Switch to > nbcon console" removes touch_nmi_watchdog() from > serial8250_console_write(). > > The touch_nmi_watchdog() resets the softlockup watchdog. It might > hide that the CPU did not schedule for a long time. > > The touch_nmi_watchdog() was there because the console_lock() owner, > used by the legacy loop, was responsible for flushing all pending > messages. And there might be many pending messages when new ones > were added by other CPUs in parallel. And the legacy loop > could not call cond_resched() when called from printk() because > printk() might be called in atomic context. > > The series removed the touch_nmi_watchdog() because the kthread > called cond_resched. And the amount of messages flushed in > the emergency or panic context was limited. > > My view: > > The touch_nmi_watchdog() did hide the softlockup scenario in the rslib > test. The printk series allowed to see the problem reliably. > > By other words, the rslib test should get fixed by adding > cond_resched(). And the conversion of the 8250 serial console > driver to nbcon is correct. > > That said, see below. > >> I am still looking into the issue but I noticed that if the printing >> during the test were done in a caller context, not by printk thread, >> the softlock goes away. >> >> I tested with a change something like below so that the printing during >> the test were not delegated to printk thread ("pr/ttyS0"). With the below >> change, the softlock no longer shows up. >> >> --- BEGIN --- >> diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c >> index fd12efcc4..07c879a64 100644 >> --- a/kernel/printk/nbcon.c >> +++ b/kernel/printk/nbcon.c >> @@ -1639,6 +1639,7 @@ void nbcon_cpu_emergency_enter(void) >> cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting(); >> (*cpu_emergency_nesting)++; >> } >> +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_enter); >> >> /** >> * nbcon_cpu_emergency_exit - Exit an emergency section >> @@ -1656,6 +1657,7 @@ void nbcon_cpu_emergency_exit(void) >> >> preempt_enable(); >> } >> +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_exit); >> >> /** >> * nbcon_alloc - Allocate and init the nbcon console specific data >> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c >> index 75cb1adac..8208f8317 100644 >> --- a/lib/reed_solomon/test_rslib.c >> +++ b/lib/reed_solomon/test_rslib.c >> @@ -11,6 +11,7 @@ >> #include <linux/moduleparam.h> >> #include <linux/random.h> >> #include <linux/slab.h> >> +#include <linux/console.h> >> >> enum verbosity { >> V_SILENT, >> @@ -323,8 +324,11 @@ static int ex_rs_helper(struct rs_control *rs, struct wspace *ws, >> int nroots = rs->codec->nroots; >> int errs, eras, retval; >> >> - if (v >= V_PROGRESS) >> + if (v >= V_PROGRESS) { >> + nbcon_cpu_emergency_enter(); >> pr_info(" %s\n", desc[method]); >> + nbcon_cpu_emergency_exit(); >> + } > > I do not understand how this prevented the softlockup. > > The emergency context causes that the messages is flushed directly > from printk(), see nbcon_atomic_flush_pending() in vprintk_emit(). > But see neither cond_resched() nor touch_nmi_watchdog() in this > code path. Note there is still a touch_nmi_watchdog() in wait_for_lsr(). If the console printing thread is on a different CPU, then its use of touch_nmi_watchdog() will not help the CPU running the rslib test. By forcing emergency context, now the CPU running the rslib test will be calling touch_nmi_watchdog(). John Ogness ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-04-22 12:15 ` Petr Mladek 2025-04-22 14:03 ` John Ogness @ 2025-04-24 8:11 ` Ryo Takakura 2025-04-24 9:00 ` John Ogness ` (2 more replies) 1 sibling, 3 replies; 21+ messages in thread From: Ryo Takakura @ 2025-04-24 8:11 UTC (permalink / raw) To: pmladek, john.ogness Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang, ryotkkr98 Hi Petr and John! On Tue, 22 Apr 2025 14:15:01 +0200, Petr Mladek wrote: >On Mon 2025-04-21 12:41:50, Ryo Takakura wrote: >> Hi! >> >> I would like to follow up the last email that I sent. >> >> First, I'm sorry that I later realized that I should have tested >> the rslib test as an inserted module, as how the robot does, by >> choosing CONFIG_REED_SOLOMON_TEST=m. >> Not as a boottime test by enabling CONFIG_REED_SOLOMON_TEST=y. >> >> Running the rslib test as an inserted module without the John's series >> was less prone to softlockup. Without the John's series, softlockup shows >> up once in a test or not at all. With the John's series, softlockup can >> be observed constanly over the test. > >> On Mon, 17 Mar 2025 23:42:21 +0900, Ryo Takakura wrote: >> >On Mon, 17 Mar 2025 09:51:46 +0106, John Ogness wrote: >> >>On 2025-03-15, Ryo Takakura <ryotkkr98@gmail.com> wrote: >> >>> I got the same softlockup during the test regardless of the presence >> >>> of the commits. >> >>> >> >>> [ 60.222013] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/0:1] >> >>> [ 60.222023] Modules linked in: >> >>> [ 60.222032] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Tainted: G L 6.14.0-rc6-v14-rc6-voluntary+ #4 >> >>> [ 60.222047] Tainted: [L]=SOFTLOCKUP >> >>> [ 60.222051] Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT) >> >>> [ 60.222055] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >> >>> [ 60.222066] pc : get_random_u32+0xac/0x118 >> >>> [ 60.222081] lr : __get_random_u32_below+0x20/0x78 >> >>> [ 60.222094] sp : ffffffc08002bb80 >> >>> [ 60.222098] x29: ffffffc08002bb80 x28: 0000000000000003 x27: 0000000000000001 >> >>> [ 60.222114] x26: ffffff804112e6a4 x25: ffffffd33ed21820 x24: ffffff804112e69c >> >>> [ 60.222128] x23: 0000000000000000 x22: ffffff804112e64e x21: 0000000000000000 >> >>> [ 60.222142] x20: 000000000000000d x19: ffffff80fb7aebb8 x18: 0000000000000002 >> >>> [ 60.222156] x17: 0000000000000004 x16: ffffff804112e584 x15: ffffff8041126796 >> >>> [ 60.222169] x14: ffffff80411267c0 x13: 0000000000000006 x12: ffffff804112e5c0 >> >>> [ 60.222183] x11: ffffff804112e64c x10: 0000000000000007 x9 : ffffffd33dccdd10 >> >>> [ 60.222196] x8 : ffffff804112e6a8 x7 : 0000000000000000 x6 : 0005000400060005 >> >>> [ 60.222210] x5 : ffffff804112e65a x4 : 0000000000000000 x3 : 0000000000000010 >> >>> [ 60.222223] x2 : 0000000000000014 x1 : 000000002c7d0b7a x0 : 0000000000000013 >> >>> [ 60.222237] Call trace: >> >>> [ 60.222241] get_random_u32+0xac/0x118 (P) >> >>> [ 60.222256] __get_random_u32_below+0x20/0x78 >> >>> [ 60.222268] get_rcw_we+0x180/0x208 >> >>> [ 60.222278] test_rslib_init+0x2c8/0xba0 >> >>> [ 60.222292] do_one_initcall+0x4c/0x210 >> >>> [ 60.222303] kernel_init_freeable+0x1fc/0x3a0 >> >>> [ 60.222317] kernel_init+0x28/0x1f8 >> >>> [ 60.222327] ret_from_fork+0x10/0x20 >> >>> >> >>>> I wonder if a cond_resched() in some loop would help. Or using a >> >>> >> >>> I wasn't sure which loop would be the appropriate one but adding >> >>> cond_resched() as below worked as suggested. >> >>> >> >>> ----- BEGIN ----- >> >>> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c >> >>> index 75cb1adac..322d7b0a8 100644 >> >>> --- a/lib/reed_solomon/test_rslib.c >> >>> +++ b/lib/reed_solomon/test_rslib.c >> >>> @@ -306,6 +306,8 @@ static void test_uc(struct rs_control *rs, int len, int errs, >> >>> >> >>> if (memcmp(r, c, len * sizeof(*r))) >> >>> stat->dwrong++; >> >>> + >> >>> + cond_resched(); >> >>> } >> >>> stat->nwords += trials; >> >>> } >> >>> @@ -400,6 +402,8 @@ static void test_bc(struct rs_control *rs, int len, int errs, >> >>> } else { >> >>> stat->rfail++; >> >>> } >> >>> + >> >>> + cond_resched(); >> >>> } >> >>> stat->nwords += trials; >> >>> } >> >>> ----- END ----- >> >>> >> >>>> pseudorandom generator. I remember the problems related to much >> >>>> slower random generator, for example, see the commit >> >>>> f900fde28883602b6 ("crypto: testmgr - fix RNG performance in fuzz >> >>>> tests"). >> >>> >> >>> I haven't tested this but I'll look into it! >> >> For the same reason pointed out by the commit f900fde28883 ("crypto: >> testmgr - fix RNG performance in fuzz tests"), the commit d4150779e60f >> ("random32: use real rng for non-deterministic randomness") does seem >> to have slowed down the rslib test. >> >> On Raspberry pi 4(4G Mem, 4 CPUs), running the rslib test on v5.19 with >> the commit d4150779e60f ("random32: use real rng for non-deterministic >> randomness") takes about 450sec compared to v5.18(without the commit) >> which only takes about 350sec. >> >> However, random number generation does not seem to be the only reason >> for the rslib test being slow. >> Below is the function stats for v5.18 vs v5.19 for the test. >> >> v5.18 >> + 100.00% 0.00% insmod [kernel.kallsyms] [k] load_module >> + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_init_module >> + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_one_initcall >> + 59.90% 59.90% insmod [kernel.kallsyms] [k] decode_rs16 >> + 30.28% 0.00% insmod [unknown] [k] 0xffffffd60413a748 >> + 21.39% 21.39% insmod [kernel.kallsyms] [k] encode_rs16 >> + 20.94% 0.00% insmod [unknown] [k] 0xffffffd60441b0d4 >> + 14.17% 0.01% insmod [unknown] [k] 0xffffffd60413a324 >> + 12.92% 0.00% insmod [unknown] [k] 0xffffffd60413a720 >> + 11.74% 0.00% insmod [unknown] [k] 0xffffffd60413a368 >> + 11.63% 0.01% insmod [unknown] [k] 0xffffffd60413a4b4 >> + 6.60% 0.01% insmod [unknown] [k] 0xffffffd60413a46c >> + 3.26% 3.26% insmod [unknown] [k] 0xffffffd60413a3c4 >> + 3.11% 3.11% insmod [kernel.kallsyms] [k] prandom_u32 >> >> v5.19 >> + 100.00% 0.00% insmod [kernel.kallsyms] [k] load_module >> + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_init_module >> + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_one_initcall >> + 45.13% 45.13% insmod [kernel.kallsyms] [k] decode_rs16 >> + 27.84% 27.83% insmod [kernel.kallsyms] [k] get_random_u32 >> + 22.83% 0.01% insmod [unknown] [k] 0xffffffd62b379324 >> + 22.76% 0.00% insmod [unknown] [k] 0xffffffd62b379748 >> + 22.35% 0.00% insmod [unknown] [k] 0xffffffd62b379720 >> + 16.17% 16.17% insmod [kernel.kallsyms] [k] encode_rs16 >> >> As the v5.19 stats shows, the commit d4150779e60f ("random32: use real >> rng for non-deterministic randomness") did slowed down the rslib test >> by using real random number generator(prandom_u32->get_random_u32). >> But it also shows that the decode/encoding takes quite a lot of time. >> >> For the above reason, if we want make the softlockup go away, maybe >> just converting from real to pseudo random generator won't be enough? >> I tested with below changes applied to v6.14 which uses pseudo random >> numbers instead of real ones similar to what the commit f900fde28883 >> ("crypto: testmgr - fix RNG performance in fuzz tests") did. >> >> The result is that while it did help to fast up the test from 450 sec >> to 350 sec, it still triggers the softlockup. > >It is not that big speedup. 350 sec still provides a lot of >opportunities for a softlockup. > >> --- BEGIN --- >> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c >> index 75cb1adac..5785e3c0c 100644 >> --- a/lib/reed_solomon/test_rslib.c >> +++ b/lib/reed_solomon/test_rslib.c >[...] >> @@ -162,9 +174,11 @@ static int get_rcw_we(struct rs_control *rs, struct wspace *ws, >> int errloc; >> int i; >> >> + init_rnd_state(&rng); >> + >> /* Load c with random data and encode */ >> for (i = 0; i < dlen; i++) >> - c[i] = get_random_u32() & nn; >> + c[i] = prandom_u32_state(&rng) & nn; >> >> memset(c + dlen, 0, nroots * sizeof(*c)); >> encode_rs16(rs, c, dlen, c + dlen, 0); >[...] >> --- END --- >> >> Below is the function stats for with vs without the above change. >> >> v6.14 without the above change >> + 100.00% 0.00% insmod [kernel.kallsyms] [k] load_module >> + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_init_module >> + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_one_initcall >> + 46.94% 46.94% insmod [kernel.kallsyms] [k] decode_rs16 >> + 24.92% 24.92% insmod [kernel.kallsyms] [k] get_random_u32 >> + 23.59% 0.01% insmod [unknown] [k] 0xffffffde423d88cc >> + 22.66% 0.00% insmod [unknown] [k] 0xffffffde423d82c0 >> + 22.39% 0.00% insmod [unknown] [k] 0xffffffde423d88a4 >> + 16.67% 16.67% insmod [kernel.kallsyms] [k] encode_rs16 >> >> v6.14 with the above change >> + 100.00% 0.00% insmod [kernel.kallsyms] [k] load_module >> + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_init_module >> + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_one_initcall >> + 61.51% 61.50% insmod [kernel.kallsyms] [k] decode_rs16 >> + 31.15% 0.00% insmod [unknown] [k] 0xffffffe7013858cc >> + 21.75% 21.75% insmod [kernel.kallsyms] [k] encode_rs16 >> + 21.36% 0.01% insmod [unknown] [k] 0xffffffe70137f120 >> + 15.20% 0.00% insmod [unknown] [k] 0xffffffe7013852c0 >> + 12.79% 0.00% insmod [unknown] [k] 0xffffffe7013858a4 >> + 12.00% 0.00% insmod [unknown] [k] 0xffffffe7013854c0 >> + 12.00% 0.00% insmod [unknown] [k] 0xffffffe701385304 >> + 6.74% 0.01% insmod [unknown] [k] 0xffffffe701385600 >> + 3.43% 3.43% insmod [unknown] [k] 0xffffffe701385598 >> + 2.50% 2.50% insmod [unknown] [k] 0xffffffe701385564 >> + 1.96% 1.96% insmod [kernel.kallsyms] [k] prandom_u32_state >> + 1.85% 1.85% insmod [kernel.kallsyms] [k] get_random_u64 >> >> So maybe we still need cond_resched() in addition to converting the >> test from using psuedo random numbers to real ones in order to make >> the softlockup go away? > >IMHO, cond_resched() should be added, definitely. The code clearly >prevents scheduling for tens of seconds. Thank you for checking, and I agree. If needed, I can prepare the fix for the test. >> >>>> That said, I did not dig deep into the code. And I did not try to >> >>>> reproduce the softlockup. I am pretty busy at the moment with some >> >>>> other stuff. I just wanted to give it a look and share my opinion. >> >>> >> >>> I think the softlockup is rather a problem of test itself, >> >>> not the two commits. >> >> >> >>Thanks Ryo for looking into this! I think we need to have a technical >> >>explanation/understanding of the problem so that it is clear how my >> >>series triggers or exaggerates the issue. >> >> As mentioned earlier, I'm sorry that I should have run the test as >> inserted module... It seems the series does make the test more prone >> to softlockups. > >IMHO, the main difference is that the patch "serial: 8250: Switch to >nbcon console" removes touch_nmi_watchdog() from >serial8250_console_write(). > >The touch_nmi_watchdog() resets the softlockup watchdog. It might >hide that the CPU did not schedule for a long time. > >The touch_nmi_watchdog() was there because the console_lock() owner, >used by the legacy loop, was responsible for flushing all pending >messages. And there might be many pending messages when new ones >were added by other CPUs in parallel. And the legacy loop >could not call cond_resched() when called from printk() because >printk() might be called in atomic context. I see. Without the John's series, the cond_resched() in the mention code path should be called during the rslib test as it's not in atomic context in addition to the touch_nmi_watchdog(). >The series removed the touch_nmi_watchdog() because the kthread >called cond_resched. And the amount of messages flushed in >the emergency or panic context was limited. With the John's series appied, I guess the kthread were running on a cpu other than the one running the rslib test as John said[0], and neither of touch_nmi_watchdog() nor cond_resched() were called that could prevent the softlockup. >My view: > >The touch_nmi_watchdog() did hide the softlockup scenario in the rslib >test. The printk series allowed to see the problem reliably. > >By other words, the rslib test should get fixed by adding >cond_resched(). And the conversion of the 8250 serial console >driver to nbcon is correct. I see, and agree! I'll test what's discussed here so that we can say the above statement is true and the John's series can be brought back. >That said, see below. > >> I am still looking into the issue but I noticed that if the printing >> during the test were done in a caller context, not by printk thread, >> the softlock goes away. >> >> I tested with a change something like below so that the printing during >> the test were not delegated to printk thread ("pr/ttyS0"). With the below >> change, the softlock no longer shows up. >> >> --- BEGIN --- >> diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c >> index fd12efcc4..07c879a64 100644 >> --- a/kernel/printk/nbcon.c >> +++ b/kernel/printk/nbcon.c >> @@ -1639,6 +1639,7 @@ void nbcon_cpu_emergency_enter(void) >> cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting(); >> (*cpu_emergency_nesting)++; >> } >> +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_enter); >> >> /** >> * nbcon_cpu_emergency_exit - Exit an emergency section >> @@ -1656,6 +1657,7 @@ void nbcon_cpu_emergency_exit(void) >> >> preempt_enable(); >> } >> +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_exit); >> >> /** >> * nbcon_alloc - Allocate and init the nbcon console specific data >> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c >> index 75cb1adac..8208f8317 100644 >> --- a/lib/reed_solomon/test_rslib.c >> +++ b/lib/reed_solomon/test_rslib.c >> @@ -11,6 +11,7 @@ >> #include <linux/moduleparam.h> >> #include <linux/random.h> >> #include <linux/slab.h> >> +#include <linux/console.h> >> >> enum verbosity { >> V_SILENT, >> @@ -323,8 +324,11 @@ static int ex_rs_helper(struct rs_control *rs, struct wspace *ws, >> int nroots = rs->codec->nroots; >> int errs, eras, retval; >> >> - if (v >= V_PROGRESS) >> + if (v >= V_PROGRESS) { >> + nbcon_cpu_emergency_enter(); >> pr_info(" %s\n", desc[method]); >> + nbcon_cpu_emergency_exit(); >> + } > >I do not understand how this prevented the softlockup. > >The emergency context causes that the messages is flushed directly >from printk(), see nbcon_atomic_flush_pending() in vprintk_emit(). >But see neither cond_resched() nor touch_nmi_watchdog() in this >code path. > >By other words, I do _not_ see how the emergency context either >triggered scheduling or did reset the watchdog. > >But I might be looking at a different code base. I looked at >Linus' master and linux-next. I used this kernel[1] which is for raspberry pi. Let me recheck with some other machine with Linus' master and linux-next to see if the behavior is raspberry pi specific. Sincerely, Ryo Takakura >Best Regards, >Petr [0] https://lore.kernel.org/all/84ikmwqonm.fsf@jogness.linutronix.de/ [1] https://github.com/raspberrypi/linux/tree/rpi-6.14.y ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-04-24 8:11 ` Ryo Takakura @ 2025-04-24 9:00 ` John Ogness 2025-04-24 14:13 ` Ryo Takakura 2025-04-24 9:02 ` Petr Mladek 2025-04-30 9:15 ` Ryo Takakura 2 siblings, 1 reply; 21+ messages in thread From: John Ogness @ 2025-04-24 9:00 UTC (permalink / raw) To: Ryo Takakura, pmladek Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang, ryotkkr98 On 2025-04-24, Ryo Takakura <ryotkkr98@gmail.com> wrote: >> By other words, the rslib test should get fixed by adding >> cond_resched(). And the conversion of the 8250 serial console >> driver to nbcon is correct. > > I see, and agree! I'll test what's discussed here so that we can say > the above statement is true and the John's series can be brought back. I'm looking forward to seeing your test results. Note there is also 2 more issues blocking the series: https://lore.kernel.org/linux-serial/80b020fc-c18a-4da4-b222-16da1cab2f4c@nvidia.com/ and an offlist report from NXP relating to not being able to suspend on ARMv8 due to printk irq work IPIs. John ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-04-24 9:00 ` John Ogness @ 2025-04-24 14:13 ` Ryo Takakura 0 siblings, 0 replies; 21+ messages in thread From: Ryo Takakura @ 2025-04-24 14:13 UTC (permalink / raw) To: john.ogness Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang, pmladek, ryotkkr98 On Thu, 24 Apr 2025 11:06:23 +0206, John Ogness wrote: >On 2025-04-24, Ryo Takakura <ryotkkr98@gmail.com> wrote: >>> By other words, the rslib test should get fixed by adding >>> cond_resched(). And the conversion of the 8250 serial console >>> driver to nbcon is correct. >> >> I see, and agree! I'll test what's discussed here so that we can say >> the above statement is true and the John's series can be brought back. > >I'm looking forward to seeing your test results. Note there is also 2 OK! >more issues blocking the series: > >https://lore.kernel.org/linux-serial/80b020fc-c18a-4da4-b222-16da1cab2f4c@nvidia.com/ > >and an offlist report from NXP relating to not being able to suspend on >ARMv8 due to printk irq work IPIs. Oh, I wasn't aware of those two... I may look into them as well after this! Sincerely, Ryo Takakura >John ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-04-24 8:11 ` Ryo Takakura 2025-04-24 9:00 ` John Ogness @ 2025-04-24 9:02 ` Petr Mladek 2025-04-24 14:17 ` Ryo Takakura 2025-04-30 9:15 ` Ryo Takakura 2 siblings, 1 reply; 21+ messages in thread From: Petr Mladek @ 2025-04-24 9:02 UTC (permalink / raw) To: Ryo Takakura Cc: john.ogness, Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang On Thu 2025-04-24 17:11:01, Ryo Takakura wrote: > Hi Petr and John! > > On Tue, 22 Apr 2025 14:15:01 +0200, Petr Mladek wrote: > >On Mon 2025-04-21 12:41:50, Ryo Takakura wrote: > >> Hi! > >> > >> I would like to follow up the last email that I sent. > >> > >> First, I'm sorry that I later realized that I should have tested > >> the rslib test as an inserted module, as how the robot does, by > >> choosing CONFIG_REED_SOLOMON_TEST=m. > >> Not as a boottime test by enabling CONFIG_REED_SOLOMON_TEST=y. > >> > >> Running the rslib test as an inserted module without the John's series > >> was less prone to softlockup. Without the John's series, softlockup shows > >> up once in a test or not at all. With the John's series, softlockup can > >> be observed constanly over the test. > > > >> >>Thanks Ryo for looking into this! I think we need to have a technical > >> >>explanation/understanding of the problem so that it is clear how my > >> >>series triggers or exaggerates the issue. > >> > >> As mentioned earlier, I'm sorry that I should have run the test as > >> inserted module... It seems the series does make the test more prone > >> to softlockups. > > > >IMHO, the main difference is that the patch "serial: 8250: Switch to > >nbcon console" removes touch_nmi_watchdog() from > >serial8250_console_write(). > > > >The touch_nmi_watchdog() resets the softlockup watchdog. It might > >hide that the CPU did not schedule for a long time. > > > >The touch_nmi_watchdog() was there because the console_lock() owner, > >used by the legacy loop, was responsible for flushing all pending > >messages. And there might be many pending messages when new ones > >were added by other CPUs in parallel. And the legacy loop > >could not call cond_resched() when called from printk() because > >printk() might be called in atomic context. > > I see. Without the John's series, the cond_resched() in the mention > code path should be called during the rslib test as it's not in atomic > context in addition to the touch_nmi_watchdog(). Just to be sure. The right fix is to add cond_resched() to rslib test. The code should allow scheduling and do not block the CPU for too long. touch_nmi_watchdog() just hides the problem. It was used in printk() because there was no better solution. > I used this kernel[1] which is for raspberry pi. Let me recheck > with some other machine with Linus' master and linux-next to see > if the behavior is raspberry pi specific. John explained why the emergency context helped. I think that we have a pretty good understating of what is going on there. I believe that the problem will be the same in all code streams. It might be enough to check one of them (Linus' tree or linux-next) just to be sure that the fix applies and it has not been already fixed. Best Regards, Petr ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-04-24 9:02 ` Petr Mladek @ 2025-04-24 14:17 ` Ryo Takakura 0 siblings, 0 replies; 21+ messages in thread From: Ryo Takakura @ 2025-04-24 14:17 UTC (permalink / raw) To: pmladek Cc: Jason, gregkh, john.ogness, linux-serial, lkp, oe-lkp, oliver.sang, ryotkkr98 On Thu, 24 Apr 2025 11:02:54 +0200, Petr Mladek wrote: >On Thu 2025-04-24 17:11:01, Ryo Takakura wrote: >> Hi Petr and John! >> >> On Tue, 22 Apr 2025 14:15:01 +0200, Petr Mladek wrote: >> >On Mon 2025-04-21 12:41:50, Ryo Takakura wrote: >> >> Hi! >> >> >> >> I would like to follow up the last email that I sent. >> >> >> >> First, I'm sorry that I later realized that I should have tested >> >> the rslib test as an inserted module, as how the robot does, by >> >> choosing CONFIG_REED_SOLOMON_TEST=m. >> >> Not as a boottime test by enabling CONFIG_REED_SOLOMON_TEST=y. >> >> >> >> Running the rslib test as an inserted module without the John's series >> >> was less prone to softlockup. Without the John's series, softlockup shows >> >> up once in a test or not at all. With the John's series, softlockup can >> >> be observed constanly over the test. >> > >> >> >>Thanks Ryo for looking into this! I think we need to have a technical >> >> >>explanation/understanding of the problem so that it is clear how my >> >> >>series triggers or exaggerates the issue. >> >> >> >> As mentioned earlier, I'm sorry that I should have run the test as >> >> inserted module... It seems the series does make the test more prone >> >> to softlockups. >> > >> >IMHO, the main difference is that the patch "serial: 8250: Switch to >> >nbcon console" removes touch_nmi_watchdog() from >> >serial8250_console_write(). >> > >> >The touch_nmi_watchdog() resets the softlockup watchdog. It might >> >hide that the CPU did not schedule for a long time. >> > >> >The touch_nmi_watchdog() was there because the console_lock() owner, >> >used by the legacy loop, was responsible for flushing all pending >> >messages. And there might be many pending messages when new ones >> >were added by other CPUs in parallel. And the legacy loop >> >could not call cond_resched() when called from printk() because >> >printk() might be called in atomic context. >> >> I see. Without the John's series, the cond_resched() in the mention >> code path should be called during the rslib test as it's not in atomic >> context in addition to the touch_nmi_watchdog(). > >Just to be sure. The right fix is to add cond_resched() to rslib test. >The code should allow scheduling and do not block the CPU for too >long. > >touch_nmi_watchdog() just hides the problem. It was used in printk() >because there was no better solution. Understood! >> I used this kernel[1] which is for raspberry pi. Let me recheck >> with some other machine with Linus' master and linux-next to see >> if the behavior is raspberry pi specific. > >John explained why the emergency context helped. I think that we have >a pretty good understating of what is going on there. > >I believe that the problem will be the same in all code streams. >It might be enough to check one of them (Linus' tree or linux-next) >just to be sure that the fix applies and it has not been already >fixed. Thanks for clarifying. I'll make sure that the problem can also be observed on either of Linus' tree or linux-next and they still needs the fix. Sincerely, Ryo Takakura >Best Regards, >Petr ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-04-24 8:11 ` Ryo Takakura 2025-04-24 9:00 ` John Ogness 2025-04-24 9:02 ` Petr Mladek @ 2025-04-30 9:15 ` Ryo Takakura 2025-04-30 15:41 ` John Ogness 2 siblings, 1 reply; 21+ messages in thread From: Ryo Takakura @ 2025-04-30 9:15 UTC (permalink / raw) To: john.ogness, pmladek Cc: ryotkkr98, Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang Hi, I tested what we have discussed! On Thu, 24 Apr 2025 17:11:01 +0900, Ryo Takakura wrote: >On Tue, 22 Apr 2025 14:15:01 +0200, Petr Mladek wrote: >>On Mon 2025-04-21 12:41:50, Ryo Takakura wrote: >>> First, I'm sorry that I later realized that I should have tested >>> the rslib test as an inserted module, as how the robot does, by >>> choosing CONFIG_REED_SOLOMON_TEST=m. >>> Not as a boottime test by enabling CONFIG_REED_SOLOMON_TEST=y. >>> >>> Running the rslib test as an inserted module without the John's series >>> was less prone to softlockup. Without the John's series, softlockup shows >>> up once in a test or not at all. With the John's series, softlockup can >>> be observed constanly over the test. >> >>> On Mon, 17 Mar 2025 23:42:21 +0900, Ryo Takakura wrote: >>> >On Mon, 17 Mar 2025 09:51:46 +0106, John Ogness wrote: >>> >>On 2025-03-15, Ryo Takakura <ryotkkr98@gmail.com> wrote: >>> >>> I got the same softlockup during the test regardless of the presence >>> >>> of the commits. >>> >>> >>> >>> [ 60.222013] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/0:1] >>> >>> [ 60.222023] Modules linked in: >>> >>> [ 60.222032] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Tainted: G L 6.14.>0-rc6-v14-rc6-voluntary+ #4 >>> >>> [ 60.222047] Tainted: [L]=SOFTLOCKUP >>> >>> [ 60.222051] Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT) >>> >>> [ 60.222055] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>> >>> [ 60.222066] pc : get_random_u32+0xac/0x118 >>> >>> [ 60.222081] lr : __get_random_u32_below+0x20/0x78 >>> >>> [ 60.222094] sp : ffffffc08002bb80 >>> >>> [ 60.222098] x29: ffffffc08002bb80 x28: 0000000000000003 x27: 0000000000000001 >>> >>> [ 60.222114] x26: ffffff804112e6a4 x25: ffffffd33ed21820 x24: ffffff804112e69c >>> >>> [ 60.222128] x23: 0000000000000000 x22: ffffff804112e64e x21: 0000000000000000 >>> >>> [ 60.222142] x20: 000000000000000d x19: ffffff80fb7aebb8 x18: 0000000000000002 >>> >>> [ 60.222156] x17: 0000000000000004 x16: ffffff804112e584 x15: ffffff8041126796 >>> >>> [ 60.222169] x14: ffffff80411267c0 x13: 0000000000000006 x12: ffffff804112e5c0 >>> >>> [ 60.222183] x11: ffffff804112e64c x10: 0000000000000007 x9 : ffffffd33dccdd10 >>> >>> [ 60.222196] x8 : ffffff804112e6a8 x7 : 0000000000000000 x6 : 0005000400060005 >>> >>> [ 60.222210] x5 : ffffff804112e65a x4 : 0000000000000000 x3 : 0000000000000010 >>> >>> [ 60.222223] x2 : 0000000000000014 x1 : 000000002c7d0b7a x0 : 0000000000000013 >>> >>> [ 60.222237] Call trace: >>> >>> [ 60.222241] get_random_u32+0xac/0x118 (P) >>> >>> [ 60.222256] __get_random_u32_below+0x20/0x78 >>> >>> [ 60.222268] get_rcw_we+0x180/0x208 >>> >>> [ 60.222278] test_rslib_init+0x2c8/0xba0 >>> >>> [ 60.222292] do_one_initcall+0x4c/0x210 >>> >>> [ 60.222303] kernel_init_freeable+0x1fc/0x3a0 >>> >>> [ 60.222317] kernel_init+0x28/0x1f8 >>> >>> [ 60.222327] ret_from_fork+0x10/0x20 >>> >>> >>> >>>> I wonder if a cond_resched() in some loop would help. Or using a >>> >>> >>> >>> I wasn't sure which loop would be the appropriate one but adding >>> >>> cond_resched() as below worked as suggested. >>> >>> >>> >>> ----- BEGIN ----- >>> >>> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c >>> >>> index 75cb1adac..322d7b0a8 100644 >>> >>> --- a/lib/reed_solomon/test_rslib.c >>> >>> +++ b/lib/reed_solomon/test_rslib.c >>> >>> @@ -306,6 +306,8 @@ static void test_uc(struct rs_control *rs, int len, int errs, >>> >>> >>> >>> if (memcmp(r, c, len * sizeof(*r))) >>> >>> stat->dwrong++; >>> >>> + >>> >>> + cond_resched(); >>> >>> } >>> >>> stat->nwords += trials; >>> >>> } >>> >>> @@ -400,6 +402,8 @@ static void test_bc(struct rs_control *rs, int len, int errs, >>> >>> } else { >>> >>> stat->rfail++; >>> >>> } >>> >>> + >>> >>> + cond_resched(); >>> >>> } >>> >>> stat->nwords += trials; >>> >>> } >>> >>> ----- END ----- >>> >>> >>> >>>> pseudorandom generator. I remember the problems related to much >>> >>>> slower random generator, for example, see the commit >>> >>>> f900fde28883602b6 ("crypto: testmgr - fix RNG performance in fuzz >>> >>>> tests"). >>> >>> >>> >>> I haven't tested this but I'll look into it! >>> >>> For the same reason pointed out by the commit f900fde28883 ("crypto: >>> testmgr - fix RNG performance in fuzz tests"), the commit d4150779e60f >>> ("random32: use real rng for non-deterministic randomness") does seem >>> to have slowed down the rslib test. >>> >>> On Raspberry pi 4(4G Mem, 4 CPUs), running the rslib test on v5.19 with >>> the commit d4150779e60f ("random32: use real rng for non-deterministic >>> randomness") takes about 450sec compared to v5.18(without the commit) >>> which only takes about 350sec. >>> >>> However, random number generation does not seem to be the only reason >>> for the rslib test being slow. >>> Below is the function stats for v5.18 vs v5.19 for the test. >>> >>> v5.18 >>> + 100.00% 0.00% insmod [kernel.kallsyms] [k] load_module >>> + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_init_module >>> + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_one_initcall >>> + 59.90% 59.90% insmod [kernel.kallsyms] [k] decode_rs16 >>> + 30.28% 0.00% insmod [unknown] [k] 0xffffffd60413a748 >>> + 21.39% 21.39% insmod [kernel.kallsyms] [k] encode_rs16 >>> + 20.94% 0.00% insmod [unknown] [k] 0xffffffd60441b0d4 >>> + 14.17% 0.01% insmod [unknown] [k] 0xffffffd60413a324 >>> + 12.92% 0.00% insmod [unknown] [k] 0xffffffd60413a720 >>> + 11.74% 0.00% insmod [unknown] [k] 0xffffffd60413a368 >>> + 11.63% 0.01% insmod [unknown] [k] 0xffffffd60413a4b4 >>> + 6.60% 0.01% insmod [unknown] [k] 0xffffffd60413a46c >>> + 3.26% 3.26% insmod [unknown] [k] 0xffffffd60413a3c4 >>> + 3.11% 3.11% insmod [kernel.kallsyms] [k] prandom_u32 >>> >>> v5.19 >>> + 100.00% 0.00% insmod [kernel.kallsyms] [k] load_module >>> + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_init_module >>> + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_one_initcall >>> + 45.13% 45.13% insmod [kernel.kallsyms] [k] decode_rs16 >>> + 27.84% 27.83% insmod [kernel.kallsyms] [k] get_random_u32 >>> + 22.83% 0.01% insmod [unknown] [k] 0xffffffd62b379324 >>> + 22.76% 0.00% insmod [unknown] [k] 0xffffffd62b379748 >>> + 22.35% 0.00% insmod [unknown] [k] 0xffffffd62b379720 >>> + 16.17% 16.17% insmod [kernel.kallsyms] [k] encode_rs16 >>> >>> As the v5.19 stats shows, the commit d4150779e60f ("random32: use real >>> rng for non-deterministic randomness") did slowed down the rslib test >>> by using real random number generator(prandom_u32->get_random_u32). >>> But it also shows that the decode/encoding takes quite a lot of time. >>> >>> For the above reason, if we want make the softlockup go away, maybe >>> just converting from real to pseudo random generator won't be enough? >>> I tested with below changes applied to v6.14 which uses pseudo random >>> numbers instead of real ones similar to what the commit f900fde28883 >>> ("crypto: testmgr - fix RNG performance in fuzz tests") did. >>> >>> The result is that while it did help to fast up the test from 450 sec >>> to 350 sec, it still triggers the softlockup. >> >>It is not that big speedup. 350 sec still provides a lot of >>opportunities for a softlockup. >> >>> --- BEGIN --- >>> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c >>> index 75cb1adac..5785e3c0c 100644 >>> --- a/lib/reed_solomon/test_rslib.c >>> +++ b/lib/reed_solomon/test_rslib.c >>[...] >>> @@ -162,9 +174,11 @@ static int get_rcw_we(struct rs_control *rs, struct wspace *ws, >>> int errloc; >>> int i; >>> >>> + init_rnd_state(&rng); >>> + >>> /* Load c with random data and encode */ >>> for (i = 0; i < dlen; i++) >>> - c[i] = get_random_u32() & nn; >>> + c[i] = prandom_u32_state(&rng) & nn; >>> >>> memset(c + dlen, 0, nroots * sizeof(*c)); >>> encode_rs16(rs, c, dlen, c + dlen, 0); >>[...] >>> --- END --- >>> >>> Below is the function stats for with vs without the above change. >>> >>> v6.14 without the above change >>> + 100.00% 0.00% insmod [kernel.kallsyms] [k] load_module >>> + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_init_module >>> + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_one_initcall >>> + 46.94% 46.94% insmod [kernel.kallsyms] [k] decode_rs16 >>> + 24.92% 24.92% insmod [kernel.kallsyms] [k] get_random_u32 >>> + 23.59% 0.01% insmod [unknown] [k] 0xffffffde423d88cc >>> + 22.66% 0.00% insmod [unknown] [k] 0xffffffde423d82c0 >>> + 22.39% 0.00% insmod [unknown] [k] 0xffffffde423d88a4 >>> + 16.67% 16.67% insmod [kernel.kallsyms] [k] encode_rs16 >>> >>> v6.14 with the above change >>> + 100.00% 0.00% insmod [kernel.kallsyms] [k] load_module >>> + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_init_module >>> + 100.00% 0.00% insmod [kernel.kallsyms] [k] do_one_initcall >>> + 61.51% 61.50% insmod [kernel.kallsyms] [k] decode_rs16 >>> + 31.15% 0.00% insmod [unknown] [k] 0xffffffe7013858cc >>> + 21.75% 21.75% insmod [kernel.kallsyms] [k] encode_rs16 >>> + 21.36% 0.01% insmod [unknown] [k] 0xffffffe70137f120 >>> + 15.20% 0.00% insmod [unknown] [k] 0xffffffe7013852c0 >>> + 12.79% 0.00% insmod [unknown] [k] 0xffffffe7013858a4 >>> + 12.00% 0.00% insmod [unknown] [k] 0xffffffe7013854c0 >>> + 12.00% 0.00% insmod [unknown] [k] 0xffffffe701385304 >>> + 6.74% 0.01% insmod [unknown] [k] 0xffffffe701385600 >>> + 3.43% 3.43% insmod [unknown] [k] 0xffffffe701385598 >>> + 2.50% 2.50% insmod [unknown] [k] 0xffffffe701385564 >>> + 1.96% 1.96% insmod [kernel.kallsyms] [k] prandom_u32_state >>> + 1.85% 1.85% insmod [kernel.kallsyms] [k] get_random_u64 >>> >>> So maybe we still need cond_resched() in addition to converting the >>> test from using psuedo random numbers to real ones in order to make >>> the softlockup go away? >> >>IMHO, cond_resched() should be added, definitely. The code clearly >>prevents scheduling for tens of seconds. > >Thank you for checking, and I agree. >If needed, I can prepare the fix for the test. > >>> >>>> That said, I did not dig deep into the code. And I did not try to >>> >>>> reproduce the softlockup. I am pretty busy at the moment with some >>> >>>> other stuff. I just wanted to give it a look and share my opinion. >>> >>> >>> >>> I think the softlockup is rather a problem of test itself, >>> >>> not the two commits. >>> >> >>> >>Thanks Ryo for looking into this! I think we need to have a technical >>> >>explanation/understanding of the problem so that it is clear how my >>> >>series triggers or exaggerates the issue. >>> >>> As mentioned earlier, I'm sorry that I should have run the test as >>> inserted module... It seems the series does make the test more prone >>> to softlockups. >> >>IMHO, the main difference is that the patch "serial: 8250: Switch to >>nbcon console" removes touch_nmi_watchdog() from >>serial8250_console_write(). >> >>The touch_nmi_watchdog() resets the softlockup watchdog. It might >>hide that the CPU did not schedule for a long time. >> >>The touch_nmi_watchdog() was there because the console_lock() owner, >>used by the legacy loop, was responsible for flushing all pending >>messages. And there might be many pending messages when new ones >>were added by other CPUs in parallel. And the legacy loop >>could not call cond_resched() when called from printk() because >>printk() might be called in atomic context. > >I see. Without the John's series, the cond_resched() in the mention >code path should be called during the rslib test as it's not in atomic >context in addition to the touch_nmi_watchdog()s. To test the above, I run the rslib test using legacy console(without the John's series) with the touch_nmi_watchdog()s removed as following. The result is that it triggered the softlockup as expected. So I guess we can say that the legacy console was indeed hiding the softlockup scenario as suggested by Petr. --- BEGIN --- diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c index c8704eb9e..50b64a6a2 100644 --- a/drivers/tty/serial/8250/8250_port.c +++ b/drivers/tty/serial/8250/8250_port.c @@ -2113,7 +2113,6 @@ static bool wait_for_lsr(struct uart_8250_port *up, int bits) if (--tmout == 0) break; udelay(1); - touch_nmi_watchdog(); } return (tmout != 0); @@ -3425,8 +3424,6 @@ void serial8250_console_write(struct uart_8250_port *up, const char *s, unsigned int ier, use_fifo; int locked = 1; - touch_nmi_watchdog(); - if (oops_in_progress) locked = uart_port_trylock_irqsave(port, &flags); else --- END --- >>The series removed the touch_nmi_watchdog() because the kthread >>called cond_resched. And the amount of messages flushed in >>the emergency or panic context was limited. > >With the John's series appied, I guess the kthread were running on a >cpu other than the one running the rslib test as John said[0], and neither >of touch_nmi_watchdog() nor cond_resched() were called that could prevent >the softlockup. To test the above, I removed the touch_nmi_watchdog() mentioned by John[0] on top of the John's series with the printing forced to be done on the same cpu running the rslib test by marking emergency section(below change applied). The result is that it triggered the softlockup as expected. Similar to the first test, the touch_nmi_watchdog() was preventing the softlockup when its marked with emergency section. And I believe this implies that the kthread was running on some other cpu as stated above as it should also be calling the touch_nmi_watchdog() when it does printing. --- BEGIN --- diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c index 21bc9e79c..2f97a603d 100644 --- a/drivers/tty/serial/8250/8250_port.c +++ b/drivers/tty/serial/8250/8250_port.c @@ -2133,7 +2133,6 @@ static bool wait_for_lsr(struct uart_8250_port *up, int bits) if (--tmout == 0) break; udelay(1); - touch_nmi_watchdog(); } return (tmout != 0); diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c index fd12efcc4..07c879a64 100644 --- a/kernel/printk/nbcon.c +++ b/kernel/printk/nbcon.c @@ -1639,6 +1639,7 @@ void nbcon_cpu_emergency_enter(void) cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting(); (*cpu_emergency_nesting)++; } +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_enter); /** * nbcon_cpu_emergency_exit - Exit an emergency section @@ -1656,6 +1657,7 @@ void nbcon_cpu_emergency_exit(void) preempt_enable(); } +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_exit); /** * nbcon_alloc - Allocate and init the nbcon console specific data diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c index 75cb1adac..8208f8317 100644 --- a/lib/reed_solomon/test_rslib.c +++ b/lib/reed_solomon/test_rslib.c @@ -11,6 +11,7 @@ #include <linux/moduleparam.h> #include <linux/random.h> #include <linux/slab.h> +#include <linux/console.h> enum verbosity { V_SILENT, @@ -323,8 +324,11 @@ static int ex_rs_helper(struct rs_control *rs, struct wspace *ws, int nroots = rs->codec->nroots; int errs, eras, retval; - if (v >= V_PROGRESS) + if (v >= V_PROGRESS) { + nbcon_cpu_emergency_enter(); pr_info(" %s\n", desc[method]); + nbcon_cpu_emergency_exit(); + } for (errs = 0; errs <= nroots / 2; errs++) for (eras = 0; eras <= nroots - 2 * errs; eras++) @@ -353,8 +357,11 @@ static int exercise_rs(struct rs_control *rs, struct wspace *ws, int retval = 0; int i; - if (v >= V_PROGRESS) + if (v >= V_PROGRESS) { + nbcon_cpu_emergency_enter(); pr_info("Testing up to error correction capacity...\n"); + nbcon_cpu_emergency_exit(); + } for (i = 0; i <= IN_PLACE; i++) retval |= ex_rs_helper(rs, ws, len, trials, i); @@ -411,8 +418,11 @@ static int exercise_rs_bc(struct rs_control *rs, struct wspace *ws, int nroots = rs->codec->nroots; int errs, eras, cutoff; - if (v >= V_PROGRESS) + if (v >= V_PROGRESS) { + nbcon_cpu_emergency_enter(); pr_info("Testing beyond error correction capacity...\n"); + nbcon_cpu_emergency_exit(); + } for (errs = 1; errs <= nroots; errs++) { eras = nroots - 2 * errs + 1; @@ -468,8 +478,10 @@ static int run_exercise(struct etab *e) prev_pad = pad; if (v >= V_PROGRESS) { + nbcon_cpu_emergency_enter(); pr_info("Testing (%d,%d)_%d code...\n", len, kk - pad, nn + 1); + nbcon_cpu_emergency_exit(); } retval |= exercise_rs(rsc, ws, len, e->ntrials); --- END --- >>My view: >> >>The touch_nmi_watchdog() did hide the softlockup scenario in the rslib >>test. The printk series allowed to see the problem reliably. >> >>By other words, the rslib test should get fixed by adding >>cond_resched(). And the conversion of the 8250 serial console >>driver to nbcon is correct. > >I see, and agree! I'll test what's discussed here so that we can say >the above statement is true and the John's series can be brought back. I would appreciate any comment on the above tests! If sounds good, I will prepare a fix adding cond_resched() to the rslib test later on. >>That said, see below. >> >>> I am still looking into the issue but I noticed that if the printing >>> during the test were done in a caller context, not by printk thread, >>> the softlock goes away. >>> >>> I tested with a change something like below so that the printing during >>> the test were not delegated to printk thread ("pr/ttyS0"). With the below >>> change, the softlock no longer shows up. >>> >>> --- BEGIN --- >>> diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c >>> index fd12efcc4..07c879a64 100644 >>> --- a/kernel/printk/nbcon.c >>> +++ b/kernel/printk/nbcon.c >>> @@ -1639,6 +1639,7 @@ void nbcon_cpu_emergency_enter(void) >>> cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting(); >>> (*cpu_emergency_nesting)++; >>> } >>> +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_enter); >>> >>> /** >>> * nbcon_cpu_emergency_exit - Exit an emergency section >>> @@ -1656,6 +1657,7 @@ void nbcon_cpu_emergency_exit(void) >>> >>> preempt_enable(); >>> } >>> +EXPORT_SYMBOL_GPL(nbcon_cpu_emergency_exit); >>> >>> /** >>> * nbcon_alloc - Allocate and init the nbcon console specific data >>> diff --git a/lib/reed_solomon/test_rslib.c b/lib/reed_solomon/test_rslib.c >>> index 75cb1adac..8208f8317 100644 >>> --- a/lib/reed_solomon/test_rslib.c >>> +++ b/lib/reed_solomon/test_rslib.c >>> @@ -11,6 +11,7 @@ >>> #include <linux/moduleparam.h> >>> #include <linux/random.h> >>> #include <linux/slab.h> >>> +#include <linux/console.h> >>> >>> enum verbosity { >>> V_SILENT, >>> @@ -323,8 +324,11 @@ static int ex_rs_helper(struct rs_control *rs, struct wspace *ws, >>> int nroots = rs->codec->nroots; >>> int errs, eras, retval; >>> >>> - if (v >= V_PROGRESS) >>> + if (v >= V_PROGRESS) { >>> + nbcon_cpu_emergency_enter(); >>> pr_info(" %s\n", desc[method]); >>> + nbcon_cpu_emergency_exit(); >>> + } >> >>I do not understand how this prevented the softlockup. >> >>The emergency context causes that the messages is flushed directly >>from printk(), see nbcon_atomic_flush_pending() in vprintk_emit(). >>But see neither cond_resched() nor touch_nmi_watchdog() in this >>code path. >> >>By other words, I do _not_ see how the emergency context either >>triggered scheduling or did reset the watchdog. >> >>But I might be looking at a different code base. I looked at >>Linus' master and linux-next. > >I used this kernel[1] which is for raspberry pi. Let me recheck >with some other machine with Linus' master and linux-next to see >if the behavior is raspberry pi specific. I tested with Linus' master on x86 qemu. I was able to see the rslib tests prone to softlockup and the John's series exaggerating it, same as raspberry pi. But I couldn't see the softlockup go away by marking the emergency section around the printk call within the rslib test. Looks like it never calls the touch_nmi_watchdog() in wait_for_lsr(). Maybe because trasmission gets immediatly completed on qemu board? I'm afraid that I currently don't have a convenient machine that I can test this futher at the moment... Sincerely, Ryo Takakura [0] https://lore.kernel.org/all/84ikmwqonm.fsf@jogness.linutronix.de/ ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-04-30 9:15 ` Ryo Takakura @ 2025-04-30 15:41 ` John Ogness 2025-05-01 4:10 ` Ryo Takakura 0 siblings, 1 reply; 21+ messages in thread From: John Ogness @ 2025-04-30 15:41 UTC (permalink / raw) To: Ryo Takakura, pmladek Cc: ryotkkr98, Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang On 2025-04-30, Ryo Takakura <ryotkkr98@gmail.com> wrote: >>> The touch_nmi_watchdog() resets the softlockup watchdog. It might >>> hide that the CPU did not schedule for a long time. > > To test the above, I run the rslib test using legacy console(without > the John's series) with the touch_nmi_watchdog()s removed as > following. > > The result is that it triggered the softlockup as expected. So I guess > we can say that the legacy console was indeed hiding the softlockup > scenario as suggested by Petr. Excellent. >> With the John's series appied, I guess the kthread were running on a >> cpu other than the one running the rslib test as John said[0], and >> neither of touch_nmi_watchdog() nor cond_resched() were called that >> could prevent the softlockup. > > To test the above, I removed the touch_nmi_watchdog() mentioned by John[0] > on top of the John's series with the printing forced to be done on the same > cpu running the rslib test by marking emergency section(below change applied). > > The result is that it triggered the softlockup as expected. Similar to > the first test, the touch_nmi_watchdog() was preventing the softlockup > when its marked with emergency section. Excellent. > And I believe this implies that the kthread was running on some other cpu > as stated above as it should also be calling the touch_nmi_watchdog() > when it does printing. Agreed. > If sounds good, I will prepare a fix adding cond_resched() to the > rslib test later on. Yes, please add a cond_resched() to the rslib test. > I tested with Linus' master on x86 qemu. I was able to see the rslib > tests prone to softlockup and the John's series exaggerating it, same > as raspberry pi. > > But I couldn't see the softlockup go away by marking the emergency > section around the printk call within the rslib test. Looks like it > never calls the touch_nmi_watchdog() in wait_for_lsr(). Maybe because > trasmission gets immediatly completed on qemu board? qemu does not emulate a baudrate, so indeed transmission completes immediately. It would need to be tested on real hardware. But I am certain we found the cause and explanation. Thank you for digging into this! John ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-04-30 15:41 ` John Ogness @ 2025-05-01 4:10 ` Ryo Takakura 2025-06-16 15:15 ` Florian Bezdeka 0 siblings, 1 reply; 21+ messages in thread From: Ryo Takakura @ 2025-05-01 4:10 UTC (permalink / raw) To: john.ogness, pmladek Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang, ryotkkr98 On Wed, 30 Apr 2025 17:47:01 +0206, John Ogness wrote: >On 2025-04-30, Ryo Takakura <ryotkkr98@gmail.com> wrote: >>>> The touch_nmi_watchdog() resets the softlockup watchdog. It might >>>> hide that the CPU did not schedule for a long time. >> >> To test the above, I run the rslib test using legacy console(without >> the John's series) with the touch_nmi_watchdog()s removed as >> following. >> >> The result is that it triggered the softlockup as expected. So I guess >> we can say that the legacy console was indeed hiding the softlockup >> scenario as suggested by Petr. > >Excellent. > >>> With the John's series appied, I guess the kthread were running on a >>> cpu other than the one running the rslib test as John said[0], and >>> neither of touch_nmi_watchdog() nor cond_resched() were called that >>> could prevent the softlockup. >> >> To test the above, I removed the touch_nmi_watchdog() mentioned by John[0] >> on top of the John's series with the printing forced to be done on the same >> cpu running the rslib test by marking emergency section(below change applied). >> >> The result is that it triggered the softlockup as expected. Similar to >> the first test, the touch_nmi_watchdog() was preventing the softlockup >> when its marked with emergency section. > >Excellent. > >> And I believe this implies that the kthread was running on some other cpu >> as stated above as it should also be calling the touch_nmi_watchdog() >> when it does printing. > >Agreed. > >> If sounds good, I will prepare a fix adding cond_resched() to the >> rslib test later on. > >Yes, please add a cond_resched() to the rslib test. Thank you for checking the results, Got it! >> I tested with Linus' master on x86 qemu. I was able to see the rslib >> tests prone to softlockup and the John's series exaggerating it, same >> as raspberry pi. >> >> But I couldn't see the softlockup go away by marking the emergency >> section around the printk call within the rslib test. Looks like it >> never calls the touch_nmi_watchdog() in wait_for_lsr(). Maybe because >> trasmission gets immediatly completed on qemu board? > >qemu does not emulate a baudrate, so indeed transmission completes >immediately. It would need to be tested on real hardware. But I am Oh I see, that makes sense. >certain we found the cause and explanation. Thank you for digging into >this! Great, my pleasure! Sincerely, Ryo Takakura >John ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-05-01 4:10 ` Ryo Takakura @ 2025-06-16 15:15 ` Florian Bezdeka 2025-06-18 4:42 ` John Ogness 0 siblings, 1 reply; 21+ messages in thread From: Florian Bezdeka @ 2025-06-16 15:15 UTC (permalink / raw) To: Ryo Takakura, john.ogness, pmladek Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang, Jan Kiszka Hi all, On Thu, 2025-05-01 at 13:10 +0900, Ryo Takakura wrote: > On Wed, 30 Apr 2025 17:47:01 +0206, John Ogness wrote: > > On 2025-04-30, Ryo Takakura <ryotkkr98@gmail.com> wrote: > > > > > The touch_nmi_watchdog() resets the softlockup watchdog. It might > > > > > hide that the CPU did not schedule for a long time. > > > > > > To test the above, I run the rslib test using legacy console(without > > > the John's series) with the touch_nmi_watchdog()s removed as > > > following. > > > > > > The result is that it triggered the softlockup as expected. So I guess > > > we can say that the legacy console was indeed hiding the softlockup > > > scenario as suggested by Petr. > > > > Excellent. > > > > > > With the John's series appied, I guess the kthread were running on a > > > > cpu other than the one running the rslib test as John said[0], and > > > > neither of touch_nmi_watchdog() nor cond_resched() were called that > > > > could prevent the softlockup. > > > > > > To test the above, I removed the touch_nmi_watchdog() mentioned by John[0] > > > on top of the John's series with the printing forced to be done on the same > > > cpu running the rslib test by marking emergency section(below change applied). > > > > > > The result is that it triggered the softlockup as expected. Similar to > > > the first test, the touch_nmi_watchdog() was preventing the softlockup > > > when its marked with emergency section. > > > > Excellent. > > > > > And I believe this implies that the kthread was running on some other cpu > > > as stated above as it should also be calling the touch_nmi_watchdog() > > > when it does printing. > > > > Agreed. > > > > > If sounds good, I will prepare a fix adding cond_resched() to the > > > rslib test later on. > > > > Yes, please add a cond_resched() to the rslib test. > > Thank you for checking the results, Got it! > > > > I tested with Linus' master on x86 qemu. I was able to see the rslib > > > tests prone to softlockup and the John's series exaggerating it, same > > > as raspberry pi. > > > > > > But I couldn't see the softlockup go away by marking the emergency > > > section around the printk call within the rslib test. Looks like it > > > never calls the touch_nmi_watchdog() in wait_for_lsr(). Maybe because > > > trasmission gets immediatly completed on qemu board? > > > > qemu does not emulate a baudrate, so indeed transmission completes > > immediately. It would need to be tested on real hardware. But I am > > Oh I see, that makes sense. > > > certain we found the cause and explanation. Thank you for digging into > > this! > > Great, my pleasure! With that it has been confirmed that the original test report [1] was wrong. It reported the trigger, not the root cause. b63e6f60eab4 ("serial: 8250: Switch to nbcon console") was / is correct but got reverted by Greg in mainline with f79b163c4231 ("Revert "serial: 8250: Switch to nbcon console"") This leads to the situation that stable-rt branches still ship this patch while mainline and stable do not. John, is that intended or was mainline / stable never informed about that? Best regards, Florian [1] https://lore.kernel.org/all/202501221029.fb0d574d-lkp@intel.com/#t ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] 2025-06-16 15:15 ` Florian Bezdeka @ 2025-06-18 4:42 ` John Ogness 0 siblings, 0 replies; 21+ messages in thread From: John Ogness @ 2025-06-18 4:42 UTC (permalink / raw) To: Florian Bezdeka, Ryo Takakura, pmladek Cc: Jason, gregkh, linux-serial, lkp, oe-lkp, oliver.sang, Jan Kiszka Hi Florian, On 2025-06-16, Florian Bezdeka <florian.bezdeka@siemens.com> wrote: > With that it has been confirmed that the original test report [1] was > wrong. It reported the trigger, not the root cause. > > b63e6f60eab4 ("serial: 8250: Switch to nbcon console") > > was / is correct but got reverted by Greg in mainline with > > f79b163c4231 ("Revert "serial: 8250: Switch to nbcon console"") > > This leads to the situation that stable-rt branches still ship this > patch while mainline and stable do not. John, is that intended or was > mainline / stable never informed about that? stable-rt carries the 8250-nbcon patches because we intend for them to go mainline. Greg reverted them from mainline at my request because of 2 power management regression reports (related to suspending on arm32). I do not believe 8250-nbcon is the real problem, but I need to determine the root issue and offer a solution before sending the 8250-nbcon patches to mainline again. Regardless, the fix in this thread is still relevant for mainline now because there are now other nbcon serial drivers mainline. John ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2025-06-18 4:42 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-22 2:28 [linux-next:master] [serial] b63e6f60ea: BUG:soft_lockup-CPU##stuck_for#s![modprobe:#] kernel test robot 2025-01-22 8:41 ` John Ogness 2025-01-22 9:37 ` Greg Kroah-Hartman 2025-01-24 16:10 ` Petr Mladek 2025-01-24 16:39 ` Petr Mladek 2025-03-15 3:38 ` Ryo Takakura 2025-03-17 8:45 ` John Ogness 2025-03-17 14:42 ` Ryo Takakura 2025-04-21 3:41 ` Ryo Takakura 2025-04-22 12:15 ` Petr Mladek 2025-04-22 14:03 ` John Ogness 2025-04-24 8:11 ` Ryo Takakura 2025-04-24 9:00 ` John Ogness 2025-04-24 14:13 ` Ryo Takakura 2025-04-24 9:02 ` Petr Mladek 2025-04-24 14:17 ` Ryo Takakura 2025-04-30 9:15 ` Ryo Takakura 2025-04-30 15:41 ` John Ogness 2025-05-01 4:10 ` Ryo Takakura 2025-06-16 15:15 ` Florian Bezdeka 2025-06-18 4:42 ` John Ogness
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).