[Bug 220535] New: ext4 __jbd2_log_wait_for_space soft lockup and CPU stuck for 134s

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [Bug 220535] New: ext4 __jbd2_log_wait_for_space soft lockup and CPU stuck for 134s
@ 2025-09-04  7:58 bugzilla-daemon
  2025-09-04  8:18 ` [Bug 220535] " bugzilla-daemon
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: bugzilla-daemon @ 2025-09-04  7:58 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=220535

            Bug ID: 220535
           Summary: ext4 __jbd2_log_wait_for_space soft lockup and CPU
                    stuck for 134s
           Product: File System
           Version: 2.5
          Hardware: Intel
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P3
         Component: ext4
          Assignee: fs_ext4@kernel-bugs.osdl.org
          Reporter: waxihus@gmail.com
        Regression: No

On a three-node storage cluster, running mdtest concurrently causes a soft
lockup that leads to node crash. 
System load averages spike to ~300–400 and heavy file I/O causes severe memory
churn. 
The issue reproduces on kernels 5.15.0-189 and 6.4, but does not occur on the
SUSE SP4 (June 2023) kernel.

mdtest command:
mpirun -np 128 --hostfile /root/mpirun-hosts  mdtest -d /cluster_test_dir/ -n
10000 -z 1 -b 10 -u -R -w 4096 -e 4096

[ 1503.243551] perf: interrupt took too long (2501 > 2500), lowering
kernel.perf_event_max_sample_rate to 79750
[ 2064.418909] perf: interrupt took too long (3187 > 3126), lowering
kernel.perf_event_max_sample_rate to 62750
[ 2120.339245] BUG: workqueue lockup - pool cpus=118 node=1 flags=0x0 nice=0
stuck for 32s!
[ 2120.339282] Showing busy workqueues and worker pools:
[ 2120.339300] workqueue events: flags=0x0
[ 2120.339327]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 2120.339331]     pending: drm_fb_helper_damage_work [drm_kms_helper]
[ 2120.339484] workqueue mm_percpu_wq: flags=0x8
[ 2120.339488]   pwq 236: cpus=118 node=1 flags=0x0 nice=0 active=3/256
refcnt=6
[ 2120.339490]     pending: vmstat_update, lru_add_drain_per_cpu BAR(853),
drain_local_pages_wq BAR(3933)
[ 2120.339516]   pwq 120: cpus=60 node=1 flags=0x0 nice=0 active=1/256 refcnt=2
[ 2120.339519]     pending: vmstat_update
[ 2120.339624] workqueue writeback: flags=0x4a
[ 2120.339625]   pwq 258: cpus=32-63,96-127 node=1 flags=0x4 nice=0
active=1/256 refcnt=2
[ 2120.339629]     in-flight: 26886:wb_workfn
[ 2120.339670] workqueue kblockd: flags=0x18
[ 2120.339673]   pwq 237: cpus=118 node=1 flags=0x0 nice=-20 active=1/256
refcnt=2
[ 2120.339675]     pending: blk_mq_timeout_work
[ 2120.342089] workqueue yrfs_xq:30648e4: flags=0xa
[ 2120.342091]   pwq 257: cpus=0-31,64-95 node=0 flags=0x4 nice=0 active=1/128
refcnt=2
[ 2120.342095]     in-flight: 28216:yrfs_ops_update_all_osds_capacity_work
[yrfs]
[ 2120.342179] pool 257: cpus=0-31,64-95 node=0 flags=0x4 nice=0 hung=0s
workers=8 idle: 48356 20446 785 831 821 786 825
[ 2120.342187] pool 258: cpus=32-63,96-127 node=1 flags=0x4 nice=0 hung=0s
workers=8 idle: 47239 807 826 55340 34878 804 20445
[ 2120.342196] Showing backtraces of running workers in stalled CPU-bound
worker pools:
[ 2243.225831] BUG: workqueue lockup - pool cpus=45 node=1 flags=0x0 nice=0
stuck for 38s!
[ 2243.225883] Showing busy workqueues and worker pools:
[ 2243.225896] workqueue events: flags=0x0
[ 2243.225918]   pwq 90: cpus=45 node=1 flags=0x0 nice=0 active=1/256 refcnt=2
[ 2243.225922]     pending: mlx5e_rx_dim_work [mlx5_core]
[ 2243.226028]   pwq 54: cpus=27 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 2243.226031]     pending: mlx5e_rx_dim_work [mlx5_core]
[ 2243.226100]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 2243.226103]     pending: drm_fb_helper_damage_work [drm_kms_helper]
[ 2243.226271] workqueue mm_percpu_wq: flags=0x8
[ 2243.226291]   pwq 90: cpus=45 node=1 flags=0x0 nice=0 active=2/256 refcnt=4
[ 2243.226293]     pending: drain_local_pages_wq BAR(853), vmstat_update
[ 2243.226304]   pwq 56: cpus=28 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 2243.226306]     pending: vmstat_update
[ 2243.226405] workqueue writeback: flags=0x4a
[ 2243.226407]   pwq 258: cpus=32-63,96-127 node=1 flags=0x4 nice=0
active=1/256 refcnt=2
[ 2243.226411]     in-flight: 26886:wb_workfn
[ 2243.226463] workqueue kblockd: flags=0x18
[ 2243.226482]   pwq 91: cpus=45 node=1 flags=0x0 nice=-20 active=1/256
refcnt=2
[ 2243.226485]     pending: blk_mq_timeout_work
[ 2243.229035] pool 258: cpus=32-63,96-127 node=1 flags=0x4 nice=0 hung=0s
workers=7 idle: 34878 62315 826 55340 807 47239
[ 2243.229046] Showing backtraces of running workers in stalled CPU-bound
worker pools:
[ 2269.951231] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 2269.951257] rcu:     45-....: (14999 ticks this GP)
idle=346/1/0x4000000000000000 softirq=270824/270824 fqs=5063 
[ 2269.951284]  (t=15000 jiffies g=378937 q=2931783)
[ 2269.951288] NMI backtrace for cpu 45
[ 2269.951290] CPU: 45 PID: 26886 Comm: kworker/u258:1 Kdump: loaded Tainted: G
S         OE  X  N 5.14.21-20250107.el7.x86_64 #1 SLE15-SP5 (unreleased)
d1123ed60f76c89394d27d1f68f473498cc063b4
[ 2269.951295] Hardware name: ASUSTeK COMPUTER INC. RS720-E11-RS24U/Z13PP-D32
Series, BIOS 2801 12/11/2024
[ 2269.951297] Workqueue: writeback wb_workfn (flush-259:17)
[ 2269.951305] Call Trace:
[ 2269.951309]  <IRQ>
[ 2269.951314]  dump_stack_lvl+0x58/0x7b
[ 2269.951320]  nmi_cpu_backtrace+0xf2/0x110
[ 2269.951326]  ? lapic_can_unplug_cpu+0xa0/0xa0
[ 2269.951331]  nmi_trigger_cpumask_backtrace+0xf2/0x140
[ 2269.951334]  rcu_dump_cpu_stacks+0xc8/0xfc
[ 2269.951338]  rcu_sched_clock_irq+0x9b1/0xe50
[ 2269.951345]  ? task_tick_fair+0x158/0x410
[ 2269.951349]  ? sched_clock_cpu+0x9/0xb0
[ 2269.951353]  ? trigger_load_balance+0x62/0x370
[ 2269.951356]  ? tick_sched_handle.isra.20+0x60/0x60
[ 2269.951359]  update_process_times+0x8c/0xb0
[ 2269.951363]  tick_sched_handle.isra.20+0x1d/0x60
[ 2269.951365]  tick_sched_timer+0x67/0x80
[ 2269.951367]  __hrtimer_run_queues+0x10b/0x2a0
[ 2269.951371]  hrtimer_interrupt+0xe5/0x250
[ 2269.951373]  __sysvec_apic_timer_interrupt+0x5a/0x130
[ 2269.951378]  sysvec_apic_timer_interrupt+0x4b/0x90
[ 2269.951382]  </IRQ>
[ 2269.951383]  <TASK>
[ 2269.951384]  asm_sysvec_apic_timer_interrupt+0x4d/0x60
[ 2269.951389] RIP: 0010:native_queued_spin_lock_slowpath+0x19d/0x1e0
[ 2269.951393] Code: c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6 48 05 40
47 03 00 48 03 04 f5 20 fc 03 a5 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08
85 c0 74 f7 48 8b 32 48 85 f6 74 94 0f 0d 0e eb 8f 8b 07
[ 2269.951395] RSP: 0000:ff730c07d357b8b0 EFLAGS: 00000246
[ 2269.951397] RAX: 0000000000000000 RBX: ff4364ec52698f08 RCX:
0000000000b80000
[ 2269.951399] RDX: ff4365197e374740 RSI: 000000000000005a RDI:
ff4364fbaa882450
[ 2269.951400] RBP: ff4364fd31d162d0 R08: 0000000000b80000 R09:
ffa50c073f167ac0
[ 2269.951401] R10: ff730c07d357b830 R11: 0000000000000000 R12:
ff4364fb339c8900
[ 2269.951402] R13: ff4364fbaa882000 R14: ff4364fbaa882450 R15:
0000000000a587a5
[ 2269.951405]  _raw_spin_lock+0x25/0x30
[ 2269.951409]  jbd2_log_do_checkpoint+0x149/0x300 [jbd2
fe085bf250f00c1909bc8f60167717c81ef52839]
[ 2269.951421]  __jbd2_log_wait_for_space+0xf1/0x1e0 [jbd2
fe085bf250f00c1909bc8f60167717c81ef52839]
[ 2269.951427]  add_transaction_credits+0x188/0x290 [jbd2
fe085bf250f00c1909bc8f60167717c81ef52839]
[ 2269.951432]  start_this_handle+0x107/0x530 [jbd2
fe085bf250f00c1909bc8f60167717c81ef52839]
[ 2269.951438]  ? kmem_cache_alloc+0x39c/0x4e0
[ 2269.951441]  jbd2__journal_start+0xf4/0x1f0 [jbd2
fe085bf250f00c1909bc8f60167717c81ef52839]
[ 2269.951447]  __ext4_journal_start_sb+0x105/0x120 [ext4
9b921105c859c08f218cdec280983ebbdfc1b3c6]
[ 2269.951480]  ext4_writepages+0x496/0xd30 [ext4
9b921105c859c08f218cdec280983ebbdfc1b3c6]
[ 2269.951501]  ? update_sd_lb_stats.constprop.149+0xfb/0x8e0
[ 2269.951505]  do_writepages+0xd2/0x1b0
[ 2269.951509]  ? fprop_reflect_period_percpu.isra.7+0x70/0xb0
[ 2269.951512]  __writeback_single_inode+0x41/0x350
[ 2269.951517]  writeback_sb_inodes+0x1d7/0x460
[ 2269.951520]  __writeback_inodes_wb+0x5f/0xd0
[ 2269.951523]  wb_writeback+0x235/0x2d0
[ 2269.951526]  wb_workfn+0x205/0x4a0
[ 2269.951528]  ? finish_task_switch+0x8a/0x2d0
[ 2269.951532]  process_one_work+0x264/0x440
[ 2269.951536]  worker_thread+0x2d/0x3c0
[ 2269.951538]  ? process_one_work+0x440/0x440
[ 2269.951540]  kthread+0x154/0x180
[ 2269.951543]  ? set_kthread_struct+0x50/0x50
[ 2269.951544]  ret_from_fork+0x1f/0x30
[ 2269.951549]  </TASK>
[ 2304.669118] BUG: workqueue lockup - pool cpus=45 node=1 flags=0x0 nice=0
stuck for 99s!
[ 2304.669152] BUG: workqueue lockup - pool cpus=45 node=1 flags=0x0 nice=-20
stuck for 87s!
[ 2304.669188] Showing busy workqueues and worker pools:
[ 2304.669201] workqueue events: flags=0x0
[ 2304.669222]   pwq 90: cpus=45 node=1 flags=0x0 nice=0 active=2/256 refcnt=3
[ 2304.669227]     pending: mlx5e_rx_dim_work [mlx5_core],
drm_fb_helper_damage_work [drm_kms_helper]
[ 2304.669497] workqueue mm_percpu_wq: flags=0x8
[ 2304.669517]   pwq 90: cpus=45 node=1 flags=0x0 nice=0 active=2/256 refcnt=4
[ 2304.669520]     pending: drain_local_pages_wq BAR(853), vmstat_update
[ 2304.669627] workqueue writeback: flags=0x4a
[ 2304.669628]   pwq 258: cpus=32-63,96-127 node=1 flags=0x4 nice=0
active=1/256 refcnt=2
[ 2304.669633]     in-flight: 26886:wb_workfn
[ 2304.669685] workqueue kblockd: flags=0x18
[ 2304.669706]   pwq 91: cpus=45 node=1 flags=0x0 nice=-20 active=1/256
refcnt=2
[ 2304.669709]     pending: blk_mq_timeout_work
[ 2304.672268] pool 258: cpus=32-63,96-127 node=1 flags=0x4 nice=0 hung=0s
workers=7 idle: 62315 55340 47239 34878 807 826
[ 2304.672278] Showing backtraces of running workers in stalled CPU-bound
worker pools:
[ 2335.390761] BUG: workqueue lockup - pool cpus=45 node=1 flags=0x0 nice=0
stuck for 130s!
[ 2335.390793] BUG: workqueue lockup - pool cpus=45 node=1 flags=0x0 nice=-20
stuck for 118s!
[ 2335.390830] Showing busy workqueues and worker pools:
[ 2335.390832] workqueue events: flags=0x0
[ 2335.390835]   pwq 242: cpus=121 node=1 flags=0x0 nice=0 active=1/256
refcnt=2
[ 2335.390839]     pending: kfree_rcu_monitor
[ 2335.390863]   pwq 92: cpus=46 node=1 flags=0x0 nice=0 active=1/256 refcnt=2
[ 2335.390866]     pending: kfree_rcu_monitor
[ 2335.390869]   pwq 90: cpus=45 node=1 flags=0x0 nice=0 active=4/256 refcnt=5
[ 2335.390872]     pending: mlx5e_rx_dim_work [mlx5_core],
drm_fb_helper_damage_work [drm_kms_helper], mlx5e_tx_dim_work [mlx5_core],
mlx5e_tx_dim_work [mlx5_core]
[ 2335.391266] workqueue mm_percpu_wq: flags=0x8
[ 2335.391287]   pwq 90: cpus=45 node=1 flags=0x0 nice=0 active=2/256 refcnt=4
[ 2335.391289]     pending: drain_local_pages_wq BAR(853), vmstat_update
[ 2335.391301]   pwq 32: cpus=16 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 2335.391304]     pending: vmstat_update
[ 2335.391399] workqueue writeback: flags=0x4a
[ 2335.391401]   pwq 258: cpus=32-63,96-127 node=1 flags=0x4 nice=0
active=1/256 refcnt=2
[ 2335.391405]     in-flight: 26886:wb_workfn
[ 2335.391457] workqueue kblockd: flags=0x18
[ 2335.391476]   pwq 91: cpus=45 node=1 flags=0x0 nice=-20 active=1/256
refcnt=2
[ 2335.391478]     pending: blk_mq_timeout_work
[ 2335.393980] pool 258: cpus=32-63,96-127 node=1 flags=0x4 nice=0 hung=0s
workers=7 idle: 47239 55340 62315 34878 807 826
[ 2335.393990] Showing backtraces of running workers in stalled CPU-bound
worker pools:
[ 2346.279770] watchdog: BUG: soft lockup - CPU#45 stuck for 134s!
[kworker/u258:1:26886]
[ 2346.279798] Modules linked in: yrfs(OEN) cpufreq_conservative(EN) ip_vs(EN)
uio_pci_generic(EN) uio(EN) vfio_pci(EN) vfio_pci_core(EN) nf_conntrack(EN)
vfio_virqfd(EN) vfio_iommu_type1(EN) vfio(EN) cuse(EN) fuse(EN) msr(EN)
nf_defrag_ipv6(EN) nbd(EN) nf_defrag_ipv4(EN) af_packet(EN) rdma_ucm(OEX)
intel_rapl_msr(EN) rdma_cm(OEX) iw_cm(OEX) configfs(EN) intel_rapl_common(EN)
intel_uncore_frequency(EN) intel_uncore_frequency_common(EN) ib_ipoib(OEX)
i10nm_edac(EN) nfit(EN) ib_cm(OEX) libnvdimm(EN) x86_pkg_temp_thermal(EN)
coretemp(EN) lockd(EN) cdc_ether(EN) sd_mod(EN) ib_umad(OEX) kvm_intel(EN)
usbnet(EN) grace(EN) xfs(EN) sg(EN) libcrc32c(EN) sunrpc(EN) mii(EN) kvm(EN)
iTCO_wdt(EN) intel_pmc_bxt(EN) iTCO_vendor_support(EN) irqbypass(EN)
mfd_core(EN) crc32_pclmul(EN) ghash_clmulni_intel(EN) mlx5_ib(OEX)
nls_iso8859_1(EN) pmt_crashlog(EN) nls_cp437(EN) pmt_telemetry(EN)
ib_uverbs(OEX) wmi_bmof(EN) vfat(EN) aesni_intel(EN) i2c_i801(EN) idxd(EN)
intel_sdsi(EN) pmt_class(EN)
[ 2346.279851]  isst_if_mmio(EN) mei_me(EN) isst_if_mbox_pci(EN)
crypto_simd(EN) fat(EN) cryptd(EN) efi_pstore(EN) pcspkr(EN) ib_core(OEX)
intel_vsec(EN) idxd_bus(EN) isst_if_common(EN) mei(EN) i2c_smbus(EN)
i2c_ismt(EN) vmd(EN) wmi(EN) joydev(EN) acpi_ipmi(EN) ipmi_si(EN)
ipmi_devintf(EN) ipmi_msghandler(EN) cxl_acpi(EN) cxl_core(EN)
pinctrl_emmitsburg(EN) acpi_power_meter(EN) hid_generic(EN) button(EN)
usbhid(EN) knem(OEX) efivarfs(EN) ip_tables(EN) x_tables(EN) uas(EN)
usb_storage(EN) ext4(EN) crc16(EN) mbcache(EN) jbd2(EN) ast(EN)
drm_vram_helper(EN) drm_ttm_helper(EN) mlx5_core(OEX) ttm(EN) nvme(EN)
pci_hyperv_intf(EN) drm_kms_helper(EN) psample(EN) ahci(EN) syscopyarea(EN)
mlxdevm(OEX) nvme_core(EN) xhci_pci(EN) sysfillrect(EN) libahci(EN)
sysimgblt(EN) mlx_compat(OEX) xhci_pci_renesas(EN) nvme_common(EN)
fb_sys_fops(EN) igb(EN) mlxfw(OEX) t10_pi(EN) libata(EN) xhci_hcd(EN) dca(EN)
crc64_rocksoft(EN) drm(EN) crc32c_intel(EN) usbcore(EN) scsi_mod(EN) tls(EN)
[ 2346.279904]  i2c_algo_bit(EN) crc64(EN) xpmem(OEX)
[ 2346.279907] Supported: No, Unreleased kernel
[ 2346.279909] CPU: 45 PID: 26886 Comm: kworker/u258:1 Kdump: loaded Tainted: G
S         OE  X  N 5.14.21-20250107.el7.x86_64 #1 SLE15-SP5 (unreleased)
d1123ed60f76c89394d27d1f68f473498cc063b4
[ 2346.279914] Hardware name: ASUSTeK COMPUTER INC. RS720-E11-RS24U/Z13PP-D32
Series, BIOS 2801 12/11/2024
[ 2346.279916] Workqueue: writeback wb_workfn (flush-259:17)
[ 2346.279923] RIP: 0010:native_queued_spin_lock_slowpath+0x19d/0x1e0
[ 2346.279929] Code: c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6 48 05 40
47 03 00 48 03 04 f5 20 fc 03 a5 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08
85 c0 74 f7 48 8b 32 48 85 f6 74 94 0f 0d 0e eb 8f 8b 07
[ 2346.279931] RSP: 0000:ff730c07d357b8b0 EFLAGS: 00000246
[ 2346.279934] RAX: 0000000000000000 RBX: ff4364ebccbed750 RCX:
0000000000b80000
[ 2346.279936] RDX: ff4365197e374740 RSI: 000000000000003f RDI:
ff4364fbaa882450
[ 2346.279937] RBP: ff4365013b9f3e00 R08: 0000000000b80000 R09:
0000000000000000
[ 2346.279938] R10: ff730c07d357b830 R11: 0000000000000000 R12:
ff4364fb339c8900
[ 2346.279939] R13: ff4364fbaa882000 R14: ff4364fbaa882450 R15:
0000000000a587a5
[ 2346.279940] FS:  0000000000000000(0000) GS:ff4365197e340000(0000)
knlGS:0000000000000000
[ 2346.279941] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2346.279942] CR2: 00007f70c5710000 CR3: 00000018bc610002 CR4:
0000000000771ee0
[ 2346.279944] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 2346.279945] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
0000000000000400
[ 2346.279946] PKRU: 55555554
[ 2346.279946] Call Trace:
[ 2346.279950]  <TASK>
[ 2346.279952]  _raw_spin_lock+0x25/0x30
[ 2346.279956]  jbd2_log_do_checkpoint+0x149/0x300 [jbd2
fe085bf250f00c1909bc8f60167717c81ef52839]
[ 2346.279969]  __jbd2_log_wait_for_space+0xf1/0x1e0 [jbd2
fe085bf250f00c1909bc8f60167717c81ef52839]
[ 2346.279975]  add_transaction_credits+0x188/0x290 [jbd2
fe085bf250f00c1909bc8f60167717c81ef52839]
[ 2346.279981]  start_this_handle+0x107/0x530 [jbd2
fe085bf250f00c1909bc8f60167717c81ef52839]
[ 2346.279986]  ? kmem_cache_alloc+0x39c/0x4e0
[ 2346.279990]  jbd2__journal_start+0xf4/0x1f0 [jbd2
fe085bf250f00c1909bc8f60167717c81ef52839]
[ 2346.279996]  __ext4_journal_start_sb+0x105/0x120 [ext4
9b921105c859c08f218cdec280983ebbdfc1b3c6]
[ 2346.280027]  ext4_writepages+0x496/0xd30 [ext4
9b921105c859c08f218cdec280983ebbdfc1b3c6]
[ 2346.280049]  ? update_sd_lb_stats.constprop.149+0xfb/0x8e0
[ 2346.280054]  do_writepages+0xd2/0x1b0
[ 2346.280058]  ? fprop_reflect_period_percpu.isra.7+0x70/0xb0
[ 2346.280063]  __writeback_single_inode+0x41/0x350
[ 2346.280067]  writeback_sb_inodes+0x1d7/0x460
[ 2346.280071]  __writeback_inodes_wb+0x5f/0xd0
[ 2346.280074]  wb_writeback+0x235/0x2d0
[ 2346.280077]  wb_workfn+0x205/0x4a0
[ 2346.280079]  ? finish_task_switch+0x8a/0x2d0
[ 2346.280083]  process_one_work+0x264/0x440
[ 2346.280087]  worker_thread+0x2d/0x3c0
[ 2346.280089]  ? process_one_work+0x440/0x440
[ 2346.280091]  kthread+0x154/0x180
[ 2346.280094]  ? set_kthread_struct+0x50/0x50
[ 2346.280096]  ret_from_fork+0x1f/0x30
[ 2346.280100]  </TASK>
[ 2346.280102] Kernel panic - not syncing: softlockup: hung tasks
[ 2346.280117] CPU: 45 PID: 26886 Comm: kworker/u258:1 Kdump: loaded Tainted: G
S         OEL X  N 5.14.21-20250107.el7.x86_64 #1 SLE15-SP5 (unreleased)
d1123ed60f76c89394d27d1f68f473498cc063b4
[ 2346.280147] Hardware name: ASUSTeK COMPUTER INC. RS720-E11-RS24U/Z13PP-D32
Series, BIOS 2801 12/11/2024
[ 2346.280163] Workqueue: writeback wb_workfn (flush-259:17)
[ 2346.280175] Call Trace:
[ 2346.280184]  <IRQ>
[ 2346.280191]  dump_stack_lvl+0x58/0x7b
[ 2346.280201]  panic+0x118/0x2f0
[ 2346.280213]  watchdog_timer_fn+0x1f1/0x210
[ 2346.280225]  ? softlockup_fn+0x30/0x30
[ 2346.280234]  __hrtimer_run_queues+0x10b/0x2a0
[ 2346.280246]  hrtimer_interrupt+0xe5/0x250
[ 2346.280257]  __sysvec_apic_timer_interrupt+0x5a/0x130
[ 2346.280271]  sysvec_apic_timer_interrupt+0x4b/0x90
[ 2346.280283]  </IRQ>
[ 2346.280289]  <TASK>
[ 2346.280296]  asm_sysvec_apic_timer_interrupt+0x4d/0x60
[ 2346.280309] RIP: 0010:native_queued_spin_lock_slowpath+0x19d/0x1e0
[ 2346.280322] Code: c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6 48 05 40
47 03 00 48 03 04 f5 20 fc 03 a5 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08
85 c0 74 f7 48 8b 32 48 85 f6 74 94 0f 0d 0e eb 8f 8b 07
[ 2346.280350] RSP: 0000:ff730c07d357b8b0 EFLAGS: 00000246
[ 2346.280361] RAX: 0000000000000000 RBX: ff4364ebccbed750 RCX:
0000000000b80000
[ 2346.280374] RDX: ff4365197e374740 RSI: 000000000000003f RDI:
ff4364fbaa882450
[ 2346.280386] RBP: ff4365013b9f3e00 R08: 0000000000b80000 R09:
0000000000000000
[ 2346.280399] R10: ff730c07d357b830 R11: 0000000000000000 R12:
ff4364fb339c8900
[ 2346.280411] R13: ff4364fbaa882000 R14: ff4364fbaa882450 R15:
0000000000a587a5
[ 2346.280425]  _raw_spin_lock+0x25/0x30
[ 2346.280434]  jbd2_log_do_checkpoint+0x149/0x300 [jbd2
fe085bf250f00c1909bc8f60167717c81ef52839]
[ 2346.280455]  __jbd2_log_wait_for_space+0xf1/0x1e0 [jbd2
fe085bf250f00c1909bc8f60167717c81ef52839]
[ 2346.280474]  add_transaction_credits+0x188/0x290 [jbd2
fe085bf250f00c1909bc8f60167717c81ef52839]
[ 2346.280495]  start_this_handle+0x107/0x530 [jbd2
fe085bf250f00c1909bc8f60167717c81ef52839]
[ 2346.280513]  ? kmem_cache_alloc+0x39c/0x4e0
[ 2346.280524]  jbd2__journal_start+0xf4/0x1f0 [jbd2
fe085bf250f00c1909bc8f60167717c81ef52839]
[ 2346.280543]  __ext4_journal_start_sb+0x105/0x120 [ext4
9b921105c859c08f218cdec280983ebbdfc1b3c6]
[ 2346.280986]  ext4_writepages+0x496/0xd30 [ext4
9b921105c859c08f218cdec280983ebbdfc1b3c6]
[ 2346.281361]  ? update_sd_lb_stats.constprop.149+0xfb/0x8e0
[ 2346.281706]  do_writepages+0xd2/0x1b0
[ 2346.282035]  ? fprop_reflect_period_percpu.isra.7+0x70/0xb0
[ 2346.282370]  __writeback_single_inode+0x41/0x350
[ 2346.282687]  writeback_sb_inodes+0x1d7/0x460
[ 2346.283007]  __writeback_inodes_wb+0x5f/0xd0
[ 2346.283322]  wb_writeback+0x235/0x2d0
[ 2346.283612]  wb_workfn+0x205/0x4a0
[ 2346.283908]  ? finish_task_switch+0x8a/0x2d0
[ 2346.284195]  process_one_work+0x264/0x440
[ 2346.284487]  worker_thread+0x2d/0x3c0
[ 2346.284762]  ? process_one_work+0x440/0x440
[ 2346.285045]  kthread+0x154/0x180
[ 2346.285317]  ? set_kthread_struct+0x50/0x50
[ 2346.285576]  ret_from_fork+0x1f/0x30
[ 2346.285830]  </TASK>

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 220535] ext4 __jbd2_log_wait_for_space soft lockup and CPU stuck for 134s
  2025-09-04  7:58 [Bug 220535] New: ext4 __jbd2_log_wait_for_space soft lockup and CPU stuck for 134s bugzilla-daemon
@ 2025-09-04  8:18 ` bugzilla-daemon
  2025-09-05  8:13 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2025-09-04  8:18 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=220535

--- Comment #1 from waxihus@gmail.com ---
ext4 format command:
/opt/tools/e2fsprogs/mkfs.ext4 -i 2048 -I 1024 -J size=4096
-Odir_index,casefold,large_dir,filetype -E encoding_flags=strict %(device)s -F

mount (fstab) parameters:
UUID="11adf923-6482-46c1-b418-7df27f1755a1" /data/mds1 ext4
defaults,noatime,nodiratime,user_xattr,nofail,x-systemd.device-timeout=5 0 0

CPU: 2U Intel 6530
memory: DDR5 256GiB

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 220535] ext4 __jbd2_log_wait_for_space soft lockup and CPU stuck for 134s
  2025-09-04  7:58 [Bug 220535] New: ext4 __jbd2_log_wait_for_space soft lockup and CPU stuck for 134s bugzilla-daemon
  2025-09-04  8:18 ` [Bug 220535] " bugzilla-daemon
@ 2025-09-05  8:13 ` bugzilla-daemon
  2025-09-05 12:52 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2025-09-05  8:13 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=220535

--- Comment #2 from Artem S. Tashkinov (aros@gmx.com) ---
Unless this is reproducible under a vanilla support kernel, no one will do
anything about that.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 220535] ext4 __jbd2_log_wait_for_space soft lockup and CPU stuck for 134s
  2025-09-04  7:58 [Bug 220535] New: ext4 __jbd2_log_wait_for_space soft lockup and CPU stuck for 134s bugzilla-daemon
  2025-09-04  8:18 ` [Bug 220535] " bugzilla-daemon
  2025-09-05  8:13 ` bugzilla-daemon
@ 2025-09-05 12:52 ` bugzilla-daemon
  2025-09-05 13:34 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2025-09-05 12:52 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=220535

--- Comment #3 from waxihus@gmail.com ---
We can reproduce this issue on linux-5.15.y and linux-6.4.0, and it may also be
present in the latest mainline kernel. 
What specific information should be collected after the issue has been
reproduced?

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 220535] ext4 __jbd2_log_wait_for_space soft lockup and CPU stuck for 134s
  2025-09-04  7:58 [Bug 220535] New: ext4 __jbd2_log_wait_for_space soft lockup and CPU stuck for 134s bugzilla-daemon
                   ` (2 preceding siblings ...)
  2025-09-05 12:52 ` bugzilla-daemon
@ 2025-09-05 13:34 ` bugzilla-daemon
  2025-09-08 13:11 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2025-09-05 13:34 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=220535

Christian Kujau (kernel@nerdbynature.de) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kernel@nerdbynature.de

--- Comment #4 from Christian Kujau (kernel@nerdbynature.de) ---
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html may be
of help here. Please try to reproduce with the latest version and post this
backtrace here.

An untainted kernel would be helpful too I guess, see

 https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html

From the backtrace above:

 > Tainted: G S         OE  X  N 5.14.21-20250107.el7.x86_64

 - S if the kernel is running on a processor or system that is 
     out of specification
 - O if an externally-built (“out-of-tree”) module has been loaded.
 - E if an unsigned module has been loaded in a kernel 
     supporting module signature.
 - X Auxiliary taint, defined for and used by Linux distributors.
 - N if an in-kernel test, such as a KUnit test, has been run.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 220535] ext4 __jbd2_log_wait_for_space soft lockup and CPU stuck for 134s
  2025-09-04  7:58 [Bug 220535] New: ext4 __jbd2_log_wait_for_space soft lockup and CPU stuck for 134s bugzilla-daemon
                   ` (3 preceding siblings ...)
  2025-09-05 13:34 ` bugzilla-daemon
@ 2025-09-08 13:11 ` bugzilla-daemon
  2025-09-19  2:16 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2025-09-08 13:11 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=220535

--- Comment #5 from waxihus@gmail.com ---
Thank you for your suggestion. 
Since the storage cluster requires InfiniBand RDMA network cards, the OFED
driver needs to be installed; otherwise, the cluster cannot generate sufficient
load to reproduce the issue. 
This problem needs to be reproduced on high-performance servers. We have
already requested such servers and will attempt to reproduce the issue once the
machines are available (expected next week).

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 220535] ext4 __jbd2_log_wait_for_space soft lockup and CPU stuck for 134s
  2025-09-04  7:58 [Bug 220535] New: ext4 __jbd2_log_wait_for_space soft lockup and CPU stuck for 134s bugzilla-daemon
                   ` (4 preceding siblings ...)
  2025-09-08 13:11 ` bugzilla-daemon
@ 2025-09-19  2:16 ` bugzilla-daemon
  2025-09-19  2:34 ` bugzilla-daemon
  2025-10-30 12:48 ` bugzilla-daemon
  7 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2025-09-19  2:16 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=220535

--- Comment #6 from waxihus@gmail.com ---
Created attachment 308696
  --> https://bugzilla.kernel.org/attachment.cgi?id=308696&action=edit
vmcore dmesg reproduced with the latest version and untainted kernel

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 220535] ext4 __jbd2_log_wait_for_space soft lockup and CPU stuck for 134s
  2025-09-04  7:58 [Bug 220535] New: ext4 __jbd2_log_wait_for_space soft lockup and CPU stuck for 134s bugzilla-daemon
                   ` (5 preceding siblings ...)
  2025-09-19  2:16 ` bugzilla-daemon
@ 2025-09-19  2:34 ` bugzilla-daemon
  2025-10-30 12:48 ` bugzilla-daemon
  7 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2025-09-19  2:34 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=220535

--- Comment #7 from waxihus@gmail.com ---
have reproduced with the latest version and untainted kernel, see attachment
for more dmesg log:
source code clone from 46a51f4f5edade43ba66b3c151f0e25ec8b69cb6
[  533.816688] INFO: task kworker/u778:1:1854 blocked for more than 481
seconds.
[  533.816713]       Not tainted 6.17.0-rc6-master-default #2
[  533.816723] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[  533.816734] task:kworker/u778:1  state:D stack:0     pid:1854  tgid:1854 
ppid:2      task_flags:0x4248060 flags:0x00004000
[  533.816751] Workqueue: writeback wb_workfn (flush-259:1)
[  533.816766] Call Trace:
[  533.816773]  <TASK>
[  533.816782]  __schedule+0x462/0x1400
[  533.816793]  ? sysvec_apic_timer_interrupt+0xf/0x90
[  533.816804]  ? srso_alias_return_thunk+0x5/0xfbef5
[  533.816817]  schedule+0x27/0xd0
[  533.816825]  schedule_preempt_disabled+0x15/0x30
[  533.816834]  __mutex_lock.constprop.0+0x357/0x940
[  533.816846]  mutex_lock_io+0x41/0x50
[  533.816857]  __jbd2_log_wait_for_space+0xda/0x1f0 [jbd2
371d593b5f5403746c7713ab4dc9d5e5c1953199]
[  533.816877]  add_transaction_credits+0x2f2/0x300 [jbd2
371d593b5f5403746c7713ab4dc9d5e5c1953199]
[  533.816895]  start_this_handle+0xfe/0x520 [jbd2
371d593b5f5403746c7713ab4dc9d5e5c1953199]
[  533.816910]  ? srso_alias_return_thunk+0x5/0xfbef5
[  533.816921]  jbd2__journal_start+0xfe/0x200 [jbd2
371d593b5f5403746c7713ab4dc9d5e5c1953199]
[  533.816936]  ext4_do_writepages+0x46a/0xee0 [ext4
893473fac91f34d580e31648f305d1177dd81b63]
[  533.816968]  ? __dequeue_entity+0x3c0/0x480
[  533.816977]  ? update_load_avg+0x80/0x760
[  533.816985]  ? srso_alias_return_thunk+0x5/0xfbef5
[  533.816996]  ? ext4_writepages+0xbe/0x190 [ext4
893473fac91f34d580e31648f305d1177dd81b63]
[  533.817019]  ext4_writepages+0xbe/0x190 [ext4
893473fac91f34d580e31648f305d1177dd81b63]
[  533.817044]  do_writepages+0xc7/0x160
[  533.817055]  __writeback_single_inode+0x41/0x340
[  533.817066]  writeback_sb_inodes+0x215/0x4c0
[  533.817084]  __writeback_inodes_wb+0x4c/0xe0
[  533.817094]  wb_writeback+0x192/0x300
[  533.817105]  ? get_nr_inodes+0x3b/0x60
[  533.817116]  wb_workfn+0x38a/0x460
[  533.817126]  process_one_work+0x1a1/0x3e0
[  533.817137]  worker_thread+0x292/0x420
[  533.817147]  ? __pfx_worker_thread+0x10/0x10
[  533.817156]  kthread+0xfc/0x240
[  533.817165]  ? __pfx_kthread+0x10/0x10
[  533.817174]  ? __pfx_kthread+0x10/0x10
[  533.817182]  ret_from_fork+0x1c1/0x1f0
[  533.817192]  ? __pfx_kthread+0x10/0x10
[  533.817200]  ret_from_fork_asm+0x1a/0x30


Also have a soft lockup, but the probability is very low.
[  329.157094] watchdog: BUG: soft lockup - CPU#21 stuck for 67s!
[kworker/u513:2:795]
[  329.157169] Workqueue: writeback wb_workfn (flush-259:8)
[  329.157176] RIP: 0010:queued_read_lock_slowpath+0x52/0x130
[  329.157194] Call Trace:
[  329.157196]  <TASK>
[  329.157200]  start_this_handle+0x99/0x520 [jbd2
0a56678a235e076a07e3222376de4dc1cbec6f17]
[  329.157216]  ? finish_task_switch.isra.0+0x97/0x2c0
[  329.157220]  jbd2__journal_start+0xfe/0x200 [jbd2
0a56678a235e076a07e3222376de4dc1cbec6f17]
[  329.157226]  ext4_do_writepages+0x46a/0xee0 [ext4
bcac05fee1dc1aaf21870e1e652c064619591c71]
[  329.157273]  ? find_get_block_common+0x1a8/0x3f0
[  329.157277]  ? ext4_writepages+0xbe/0x190 [ext4
bcac05fee1dc1aaf21870e1e652c064619591c71]
[  329.157303]  ext4_writepages+0xbe/0x190 [ext4
bcac05fee1dc1aaf21870e1e652c064619591c71]
[  329.157328]  do_writepages+0xc7/0x160
[  329.157331]  __writeback_single_inode+0x41/0x340
[  329.157334]  writeback_sb_inodes+0x215/0x4c0
[  329.157339]  __writeback_inodes_wb+0x4c/0xe0
[  329.157341]  wb_writeback+0x192/0x300
[  329.157344]  ? get_nr_inodes+0x3b/0x60
[  329.157347]  wb_workfn+0x291/0x460
[  329.157350]  process_one_work+0x1a1/0x3e0
[  329.157353]  worker_thread+0x292/0x420
[  329.157356]  ? __pfx_worker_thread+0x10/0x10
[  329.157358]  kthread+0xfc/0x240
[  329.157360]  ? __pfx_kthread+0x10/0x10
[  329.157361]  ? __pfx_kthread+0x10/0x10
[  329.157362]  ret_from_fork+0x1c1/0x1f0
[  329.157365]  ? __pfx_kthread+0x10/0x10
[  329.157366]  ret_from_fork_asm+0x1a/0x30

Reproduction Steps:
Format 10 NVMe drives with XFS and run 3 concurrent 100GB file reads on each
drive.
Format 1 NVMe drive with EXT4 and run 256 concurrent operations for creating
files and folders, as well as adding and deleting xattrs (the issue can also be
reproduced with 192 concurrent operations, though the probability is lower).

cpuinfo：
# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                192
On-line CPU(s) list:   0-191
Thread(s) per core:    2
Core(s) per socket:    96
Socket(s):             1
NUMA node(s):          2
Vendor ID:             AuthenticAMD
CPU family:            25
Model:                 17
Model name:            AMD EPYC 9A14 96-Core Processor
Stepping:              1
CPU MHz:               3699.375
CPU max MHz:           3703.3760
CPU min MHz:           1500.0000
BogoMIPS:              5200.37
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              32768K
NUMA node0 CPU(s):     0-47,96-143
NUMA node1 CPU(s):     48-95,144-191
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb
rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid
extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1
sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic
cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext
perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpuid_fault cpb cat_l3 cdp_l3
hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase
bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap
avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec
xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk
avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv
svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter
pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip
pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg
avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d
debug_swap

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 220535] ext4 __jbd2_log_wait_for_space soft lockup and CPU stuck for 134s
  2025-09-04  7:58 [Bug 220535] New: ext4 __jbd2_log_wait_for_space soft lockup and CPU stuck for 134s bugzilla-daemon
                   ` (6 preceding siblings ...)
  2025-09-19  2:34 ` bugzilla-daemon
@ 2025-10-30 12:48 ` bugzilla-daemon
  7 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2025-10-30 12:48 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=220535

--- Comment #8 from waxihus@gmail.com ---
I wanted to kindly follow up on this issue. Is there anyone looking into this
or any progress made? Please let me know if there is any additional information
I can provide to help with the investigation.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-10-30 12:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-04  7:58 [Bug 220535] New: ext4 __jbd2_log_wait_for_space soft lockup and CPU stuck for 134s bugzilla-daemon
2025-09-04  8:18 ` [Bug 220535] " bugzilla-daemon
2025-09-05  8:13 ` bugzilla-daemon
2025-09-05 12:52 ` bugzilla-daemon
2025-09-05 13:34 ` bugzilla-daemon
2025-09-08 13:11 ` bugzilla-daemon
2025-09-19  2:16 ` bugzilla-daemon
2025-09-19  2:34 ` bugzilla-daemon
2025-10-30 12:48 ` bugzilla-daemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).