* [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8 @ 2022-03-18 13:46 Brian Foster 2022-03-18 16:11 ` Brian Foster 2022-03-18 21:48 ` Dave Chinner 0 siblings, 2 replies; 10+ messages in thread From: Brian Foster @ 2022-03-18 13:46 UTC (permalink / raw) To: linux-xfs Hi, I'm not sure if this is known and/or fixed already, but it didn't look familiar so here is a report. I hit a splat when testing Willy's prospective folio bookmark change and it turns out it replicates on Linus' current master (551acdc3c3d2). This initially reproduced on xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via xfs/006, but when I attempted to reproduce the latter a second time I hit what looks like the same problem as xfs/264. Both tests seem to involve some form of error injection, so possibly the same underlying problem. The GPF splat from xfs/264 is below. Brian --- 8< --- general protection fault, probably for non-canonical address 0x102e31d0105f07d: 0000 [#1] PREEMPT SMP NOPTI CPU: 24 PID: 1647 Comm: kworker/24:1H Tainted: G S 5.17.0-rc8+ #14 Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021 Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs] RIP: 0010:native_queued_spin_lock_slowpath+0x1a4/0x1e0 Code: f3 90 48 8b 0a 48 85 c9 74 f6 eb c5 c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 40 0d 03 00 48 03 04 cd e0 ba 60 b9 <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 0a RSP: 0018:ff407350cf917b48 EFLAGS: 00010206 RAX: 0102e31d0105f07d RBX: ff1a52eeb16dd1c0 RCX: 0000000000002c5a RDX: ff1a53123fb30d40 RSI: ffffffffb95826f6 RDI: ffffffffb9554147 RBP: ff1a53123fb30d40 R08: ff1a52d3c8684028 R09: 0000000000000121 R10: 00000000000000bf R11: 0000000000000b65 R12: 0000000000640000 R13: 0000000000000008 R14: ff1a52d3d0899000 R15: ff1a52d5bdf07800 FS: 0000000000000000(0000) GS:ff1a53123fb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000556b2ed15f44 CR3: 000000019a4c0005 CR4: 0000000000771ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> _raw_spin_lock+0x2c/0x30 xfs_trans_ail_delete+0x27/0xd0 [xfs] xfs_buf_item_done+0x22/0x30 [xfs] xfs_buf_ioend+0x71/0x5e0 [xfs] xfs_trans_committed_bulk+0x167/0x2c0 [xfs] ? enqueue_entity+0x121/0x4d0 ? enqueue_task_fair+0x417/0x530 ? resched_curr+0x23/0xc0 ? check_preempt_curr+0x3f/0x70 ? _raw_spin_unlock_irqrestore+0x1f/0x31 ? __wake_up_common_lock+0x87/0xc0 xlog_cil_committed+0x29c/0x2d0 [xfs] ? _raw_spin_unlock_irqrestore+0x1f/0x31 ? __wake_up_common_lock+0x87/0xc0 xlog_cil_process_committed+0x69/0x80 [xfs] xlog_state_shutdown_callbacks+0xce/0xf0 [xfs] xlog_force_shutdown+0xd0/0x110 [xfs] xfs_do_force_shutdown+0x5f/0x150 [xfs] xlog_ioend_work+0x71/0x80 [xfs] process_one_work+0x1c5/0x390 ? process_one_work+0x390/0x390 worker_thread+0x30/0x350 ? process_one_work+0x390/0x390 kthread+0xe6/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Modules linked in: rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi intel_rapl_msr scsi_transport_iscsi intel_rapl_common ib_umad i10nm_edac x86_pkg_temp_thermal intel_powerclamp ch dm_log dm_mod ---[ end trace 0000000000000000 ]--- RIP: 0010:native_queued_spin_lock_slowpath+0x1a4/0x1e0 Code: f3 90 48 8b 0a 48 85 c9 74 f6 eb c5 c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 40 0d 03 00 48 03 04 cd e0 ba 60 b9 <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 0a RSP: 0018:ff407350cf917b48 EFLAGS: 00010206 RAX: 0102e31d0105f07d RBX: ff1a52eeb16dd1c0 RCX: 0000000000002c5a RDX: ff1a53123fb30d40 RSI: ffffffffb95826f6 RDI: ffffffffb9554147 RBP: ff1a53123fb30d40 R08: ff1a52d3c8684028 R09: 0000000000000121 R10: 00000000000000bf R11: 0000000000000b65 R12: 0000000000640000 R13: 0000000000000008 R14: ff1a52d3d0899000 R15: ff1a52d5bdf07800 FS: 0000000000000000(0000) GS:ff1a53123fb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000556b2ed15f44 CR3: 000000019a4c0005 CR4: 0000000000771ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x37200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception ]--- ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8 2022-03-18 13:46 [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8 Brian Foster @ 2022-03-18 16:11 ` Brian Foster 2022-03-18 21:42 ` Dave Chinner 2022-03-18 21:48 ` Dave Chinner 1 sibling, 1 reply; 10+ messages in thread From: Brian Foster @ 2022-03-18 16:11 UTC (permalink / raw) To: linux-xfs On Fri, Mar 18, 2022 at 09:46:53AM -0400, Brian Foster wrote: > Hi, > > I'm not sure if this is known and/or fixed already, but it didn't look > familiar so here is a report. I hit a splat when testing Willy's > prospective folio bookmark change and it turns out it replicates on > Linus' current master (551acdc3c3d2). This initially reproduced on > xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via > xfs/006, but when I attempted to reproduce the latter a second time I > hit what looks like the same problem as xfs/264. Both tests seem to > involve some form of error injection, so possibly the same underlying > problem. The GPF splat from xfs/264 is below. > Darrick pointed out this [1] series on IRC (particularly the final patch) so I gave that a try. I _think_ that addresses the GPF issue given it was nearly 100% reproducible before and I didn't see it in a few iterations, but once I started a test loop for a longer test I ran into the aforementioned soft lockup again. A snippet of that one is below [2]. When this occurs, the task appears to be stuck (i.e. the warning repeats) indefinitely. Brian [1] https://lore.kernel.org/linux-xfs/20220317053907.164160-1-david@fromorbit.com/ [2] Soft lockup warning from xfs/264 with patches from [1] applied: watchdog: BUG: soft lockup - CPU#52 stuck for 134s! [kworker/52:1H:1881] Modules linked in: rfkill rpcrdma sunrpc intel_rapl_msr intel_rapl_common rdma_ucm ib_srpt ib_isert iscsi_target_mod i10nm_edac target_core_mod x86_pkg_temp_thermal intel_powerclamp ib_iser coretemp libiscsi scsi_transport_iscsi kvm_intel rdma_cm ib_umad ipmi_ssif ib_ipoib iw_cm ib_cm kvm iTCO_wdt iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul acpi_ipmi mlx5_ib ghash_clmulni_intel bnxt_re ipmi_si rapl intel_cstate ib_uverbs ipmi_devintf mei_me isst_if_mmio isst_if_mbox_pci i2c_i801 nd_pmem ib_core intel_uncore wmi_bmof pcspkr isst_if_common mei i2c_smbus intel_pch_thermal ipmi_msghandler nd_btt dax_pmem acpi_power_meter xfs libcrc32c sd_mod sg mlx5_core lpfc mgag200 i2c_algo_bit drm_shmem_helper nvmet_fc drm_kms_helper nvmet nvme_fc mlxfw nvme_fabrics syscopyarea sysfillrect pci_hyperv_intf sysimgblt fb_sys_fops nvme_core ahci tls t10_pi libahci crc32c_intel psample scsi_transport_fc bnxt_en drm megaraid_sas tg3 libata wmi nfit libnvdimm dm_mirror dm_region_hash dm_log dm_mod CPU: 52 PID: 1881 Comm: kworker/52:1H Tainted: G S L 5.17.0-rc8+ #17 Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021 Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs] RIP: 0010:native_queued_spin_lock_slowpath+0x1b0/0x1e0 Code: c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 40 0d 03 00 48 03 04 cd e0 ba 00 8c 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08 85 c0 74 f7 48 8b 0a 48 85 c9 0f 84 6b ff ff ff 0f 0d 09 RSP: 0018:ff4ed0b360e4bb48 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ff3413f05c684540 RCX: 0000000000001719 RDX: ff34142ebfeb0d40 RSI: ffffffff8bf826f6 RDI: ffffffff8bf54147 RBP: ff34142ebfeb0d40 R08: ff34142ebfeb0a68 R09: 00000000000001bc R10: 00000000000001d1 R11: 0000000000000abd R12: 0000000000d40000 R13: 0000000000000008 R14: ff3413f04cd84000 R15: ff3413f059404400 FS: 0000000000000000(0000) GS:ff34142ebfe80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f9200514f70 CR3: 0000000216c16005 CR4: 0000000000771ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> _raw_spin_lock+0x2c/0x30 xfs_trans_ail_delete+0x2a/0xd0 [xfs] xfs_buf_item_done+0x22/0x30 [xfs] xfs_buf_ioend+0x71/0x5e0 [xfs] xfs_trans_committed_bulk+0x167/0x2c0 [xfs] ? enqueue_entity+0x121/0x4d0 ? enqueue_task_fair+0x417/0x530 ? resched_curr+0x23/0xc0 ? check_preempt_curr+0x3f/0x70 ? _raw_spin_unlock_irqrestore+0x1f/0x31 ? __wake_up_common_lock+0x87/0xc0 xlog_cil_committed+0x29c/0x2d0 [xfs] ? _raw_spin_unlock_irqrestore+0x1f/0x31 ? __wake_up_common_lock+0x87/0xc0 xlog_cil_process_committed+0x69/0x80 [xfs] xlog_state_shutdown_callbacks+0xce/0xf0 [xfs] xlog_force_shutdown+0xd0/0x110 [xfs] xfs_do_force_shutdown+0x5f/0x150 [xfs] xlog_ioend_work+0x71/0x80 [xfs] process_one_work+0x1c5/0x390 ? process_one_work+0x390/0x390 worker_thread+0x30/0x350 ? process_one_work+0x390/0x390 kthread+0xe6/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8 2022-03-18 16:11 ` Brian Foster @ 2022-03-18 21:42 ` Dave Chinner 2022-03-21 18:35 ` Brian Foster 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2022-03-18 21:42 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Fri, Mar 18, 2022 at 12:11:07PM -0400, Brian Foster wrote: > On Fri, Mar 18, 2022 at 09:46:53AM -0400, Brian Foster wrote: > > Hi, > > > > I'm not sure if this is known and/or fixed already, but it didn't look > > familiar so here is a report. I hit a splat when testing Willy's > > prospective folio bookmark change and it turns out it replicates on > > Linus' current master (551acdc3c3d2). This initially reproduced on > > xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via > > xfs/006, but when I attempted to reproduce the latter a second time I > > hit what looks like the same problem as xfs/264. Both tests seem to > > involve some form of error injection, so possibly the same underlying > > problem. The GPF splat from xfs/264 is below. > > > > Darrick pointed out this [1] series on IRC (particularly the final > patch) so I gave that a try. I _think_ that addresses the GPF issue > given it was nearly 100% reproducible before and I didn't see it in a > few iterations, but once I started a test loop for a longer test I ran > into the aforementioned soft lockup again. A snippet of that one is > below [2]. When this occurs, the task appears to be stuck (i.e. the > warning repeats) indefinitely. > > Brian > > [1] https://lore.kernel.org/linux-xfs/20220317053907.164160-1-david@fromorbit.com/ > [2] Soft lockup warning from xfs/264 with patches from [1] applied: > > watchdog: BUG: soft lockup - CPU#52 stuck for 134s! [kworker/52:1H:1881] > Modules linked in: rfkill rpcrdma sunrpc intel_rapl_msr intel_rapl_common rdma_ucm ib_srpt ib_isert iscsi_target_mod i10nm_edac target_core_mod x86_pkg_temp_thermal intel_powerclamp ib_iser coretemp libiscsi scsi_transport_iscsi kvm_intel rdma_cm ib_umad ipmi_ssif ib_ipoib iw_cm ib_cm kvm iTCO_wdt iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul acpi_ipmi mlx5_ib ghash_clmulni_intel bnxt_re ipmi_si rapl intel_cstate ib_uverbs ipmi_devintf mei_me isst_if_mmio isst_if_mbox_pci i2c_i801 nd_pmem ib_core intel_uncore wmi_bmof pcspkr isst_if_common mei i2c_smbus intel_pch_thermal ipmi_msghandler nd_btt dax_pmem acpi_power_meter xfs libcrc32c sd_mod sg mlx5_core lpfc mgag200 i2c_algo_bit drm_shmem_helper nvmet_fc drm_kms_helper nvmet nvme_fc mlxfw nvme_fabrics syscopyarea sysfillrect pci_hyperv_intf sysimgblt fb_sys_fops nvme_core ahci tls t10_pi libahci crc32c_intel psample scsi_transport_fc bnxt_en drm megaraid_sas tg3 libata wmi nfit libnvdimm dm_mirror dm_region_hash > dm_log dm_mod > CPU: 52 PID: 1881 Comm: kworker/52:1H Tainted: G S L 5.17.0-rc8+ #17 > Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021 > Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs] > RIP: 0010:native_queued_spin_lock_slowpath+0x1b0/0x1e0 > Code: c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 40 0d 03 00 48 03 04 cd e0 ba 00 8c 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08 85 c0 74 f7 48 8b 0a 48 85 c9 0f 84 6b ff ff ff 0f 0d 09 > RSP: 0018:ff4ed0b360e4bb48 EFLAGS: 00000246 > RAX: 0000000000000000 RBX: ff3413f05c684540 RCX: 0000000000001719 > RDX: ff34142ebfeb0d40 RSI: ffffffff8bf826f6 RDI: ffffffff8bf54147 > RBP: ff34142ebfeb0d40 R08: ff34142ebfeb0a68 R09: 00000000000001bc > R10: 00000000000001d1 R11: 0000000000000abd R12: 0000000000d40000 > R13: 0000000000000008 R14: ff3413f04cd84000 R15: ff3413f059404400 > FS: 0000000000000000(0000) GS:ff34142ebfe80000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f9200514f70 CR3: 0000000216c16005 CR4: 0000000000771ee0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > PKRU: 55555554 > Call Trace: > <TASK> > _raw_spin_lock+0x2c/0x30 > xfs_trans_ail_delete+0x2a/0xd0 [xfs] So what is running around in a tight circle holding the AIL lock? Or what assert failed before this while holding the AIL lock? > xfs_buf_item_done+0x22/0x30 [xfs] > xfs_buf_ioend+0x71/0x5e0 [xfs] > xfs_trans_committed_bulk+0x167/0x2c0 [xfs] > ? enqueue_entity+0x121/0x4d0 > ? enqueue_task_fair+0x417/0x530 > ? resched_curr+0x23/0xc0 > ? check_preempt_curr+0x3f/0x70 > ? _raw_spin_unlock_irqrestore+0x1f/0x31 > ? __wake_up_common_lock+0x87/0xc0 > xlog_cil_committed+0x29c/0x2d0 [xfs] > ? _raw_spin_unlock_irqrestore+0x1f/0x31 > ? __wake_up_common_lock+0x87/0xc0 > xlog_cil_process_committed+0x69/0x80 [xfs] > xlog_state_shutdown_callbacks+0xce/0xf0 [xfs] > xlog_force_shutdown+0xd0/0x110 [xfs] The stack trace here looks mangled - it's missing functions between xfs_trans_committed_bulk() and xfs_buf_ioend() xfs_trans_committed_bulk(abort = true) xfs_trans_committed_bulk lip->li_iop->iop_unpin(remove = true) xfs_buf_item_unpin() xfs_buf_ioend_fail() xfs_buf_ioend() xfs_buf_item_done() Unless, of course, the xfs_buf_ioend symbol is wrongly detected because it's the last function call in xfs_buf_item_unpin(). That would give a stack of xfs_trans_committed_bulk(abort = true) xfs_trans_committed_bulk lip->li_iop->iop_unpin(remove = true) xfs_buf_item_unpin() xfs_buf_item_done() Which is the stale inode buffer release path. Which has a problem as I mention here: https://lore.kernel.org/linux-xfs/20220317053907.164160-8-david@fromorbit.com/ @@ -720,6 +721,17 @@ xfs_iflush_ail_updates( if (INODE_ITEM(lip)->ili_flush_lsn != lip->li_lsn) continue; + /* + * dgc: Not sure how this happens, but it happens very + * occassionaly via generic/388. xfs_iflush_abort() also + * silently handles this same "under writeback but not in AIL at + * shutdown" condition via xfs_trans_ail_delete(). + */ + if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) { + ASSERT(xlog_is_shutdown(lip->li_log)); + continue; + } + THe symptoms that this worked around were double AIL unlocks, AIL list corruptions, and bp->b_li_list corruptions leading to xfs_buf_inode_iodone() getting stuck in an endless loop whilst holding the AIL lock, leading to soft lookups exactly like this one. I now know what is causing this problem - it is xfs_iflush_abort() being called from xfs_inode_reclaim() that removes the inode from buffer list without holding the buffer lock. Hence a traversal in xfs_iflush_cluster or xfs_buf_inode_iodone can store the next inode in the list in n, n then gets aborted by reclaim and removed from the lsit, and then the list traversal moves onto n, and it's now an empty list because it was removed. Hence the list traversal gets stuck forever on n because n->next = n..... If this sort of thing happens in xfs_iflush_ail_updates(), we can either do a double AIL removal which fires an assert with the AIL lock held, or we get stuck spinning on n with the AIL lock held. Either way, they both lead to softlockups on the AIL lock like this one. 'echo l > sysrq-trigger' is your friend in these situations - you'll see if there's a process spinning with the lock held on some other CPU... This situation is a regression introduced in the async inode reclaim patch series: https://lore.kernel.org/linux-xfs/20200622081605.1818434-1-david@fromorbit.com/ And is a locking screwup with xfs_iflush_abort() being called without holding the inode cluster buffer lock. It was a thinko w.r.t. list removal and traversal using the inode item lock. The bug has been there since June 2020, and it's only now that we have peeled back the shutdown onion a couple of layers further that it is manifesting. I have a prototype patch (below) to fix this - the locking is not pretty, but the AIL corruptions and soft lockups have gone away in my testing only to be replaced with a whole new set of g/388 failures *I have never seen before*. Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs: locking on b_io_list is broken for inodes From: Dave Chinner <dchinner@redhat.com> Most buffer io list operations are run with the bp->b_lock held, but xfs_iflush_abort() can be called without the buffer lock being held resulting in inodes being removed from the buffer list while other list operations are occurring. This causes problems with corrupted bp->b_io_list inode lists during filesystem shutdown, leading to traversals that never end, double removals from the AIL, etc. Fix this by passing the buffer to xfs_iflush_abort() if we have it locked. If the inode is attached to the buffer, we're going to have to remove it from the buffer list and we'd have to get the buffer off the inode log item to do that anyway. If we don't have a buffer passed in (e.g. from xfs_reclaim_inode()) then we can determine if the inode has a log item and if it is attached to a buffer before we do anything else. If it does have an attached buffer, we can lock it safely (because the inode has a reference to it) and then perform the inode abort. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_icache.c | 2 +- fs/xfs/xfs_inode.c | 4 +- fs/xfs/xfs_inode_item.c | 123 ++++++++++++++++++++++++++++++++---------------- fs/xfs/xfs_inode_item.h | 2 +- 4 files changed, 87 insertions(+), 44 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 4148cdf7ce4a..ec907be2d5b1 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -883,7 +883,7 @@ xfs_reclaim_inode( */ if (xlog_is_shutdown(ip->i_mount->m_log)) { xfs_iunpin_wait(ip); - xfs_iflush_abort(ip); + xfs_iflush_abort(ip, NULL); goto reclaim; } if (xfs_ipincount(ip)) diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index aab55a06ece7..de8815211a7a 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -3612,7 +3612,7 @@ xfs_iflush_cluster( /* * We must use the safe variant here as on shutdown xfs_iflush_abort() - * can remove itself from the list. + * will remove itself from the list. */ list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) { iip = (struct xfs_inode_log_item *)lip; @@ -3662,7 +3662,7 @@ xfs_iflush_cluster( */ if (xlog_is_shutdown(mp->m_log)) { xfs_iunpin_wait(ip); - xfs_iflush_abort(ip); + xfs_iflush_abort(ip, bp); xfs_iunlock(ip, XFS_ILOCK_SHARED); error = -EIO; continue; diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 11158fa81a09..89fa1fd9ed5b 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -721,17 +721,6 @@ xfs_iflush_ail_updates( if (INODE_ITEM(lip)->ili_flush_lsn != lip->li_lsn) continue; - /* - * dgc: Not sure how this happens, but it happens very - * occassionaly via generic/388. xfs_iflush_abort() also - * silently handles this same "under writeback but not in AIL at - * shutdown" condition via xfs_trans_ail_delete(). - */ - if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) { - ASSERT(xlog_is_shutdown(lip->li_log)); - continue; - } - lsn = xfs_ail_delete_one(ailp, lip); if (!tail_lsn && lsn) tail_lsn = lsn; @@ -799,7 +788,7 @@ xfs_buf_inode_iodone( struct xfs_inode_log_item *iip = INODE_ITEM(lip); if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) { - xfs_iflush_abort(iip->ili_inode); + xfs_iflush_abort(iip->ili_inode, bp); continue; } if (!iip->ili_last_fields) @@ -834,44 +823,98 @@ xfs_buf_inode_io_fail( } /* - * This is the inode flushing abort routine. It is called when - * the filesystem is shutting down to clean up the inode state. It is - * responsible for removing the inode item from the AIL if it has not been - * re-logged and clearing the inode's flush state. + * Abort flushing the inode. + * + * There are two cases where this is called. The first is when the inode cluster + * buffer has been removed and the inodes attached to it have been marked + * XFS_ISTALE. Inode cluster buffer IO completion will be called on the buffer to mark the stale + * inodes clean and remove them from the AIL without doing IO on them. The inode + * should always have a log item attached if it is ISTALE, and we should always + * be passed the locked buffer the inodes are attached to. + * + * The second case is log shutdown. When the log has been shut down, we need + * to abort any flush that is in progress, mark the inode clean and remove it + * from the AIL. We may get passed clean inodes without log items, as well as + * clean inodes with log items that aren't attached to cluster buffers. And + * depending on whether we are called from, we may or may not have a locked + * buffer passed to us. + * + * If we don't have a locked buffer, we try to get it from the inode log item. + * If there is a buffer attached to the ili, then we have a reference to the + * buffer and we can safely lock it, then remove the inode from the buffer. */ void xfs_iflush_abort( - struct xfs_inode *ip) + struct xfs_inode *ip, + struct xfs_buf *locked_bp) { struct xfs_inode_log_item *iip = ip->i_itemp; - struct xfs_buf *bp = NULL; + struct xfs_buf *ibp; - if (iip) { - /* - * Clear the failed bit before removing the item from the AIL so - * xfs_trans_ail_delete() doesn't try to clear and release the - * buffer attached to the log item before we are done with it. - */ - clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags); - xfs_trans_ail_delete(&iip->ili_item, 0); + if (!iip) { + /* clean inode, nothing to do */ + xfs_iflags_clear(ip, XFS_IFLUSHING); + return; + } - /* - * Clear the inode logging fields so no more flushes are - * attempted. - */ - spin_lock(&iip->ili_lock); - iip->ili_last_fields = 0; - iip->ili_fields = 0; - iip->ili_fsync_fields = 0; - iip->ili_flush_lsn = 0; - bp = iip->ili_item.li_buf; - iip->ili_item.li_buf = NULL; - list_del_init(&iip->ili_item.li_bio_list); + /* + * Capture the associated buffer and lock it if the caller didn't + * pass us the locked buffer to begin with. + */ + spin_lock(&iip->ili_lock); + ibp = iip->ili_item.li_buf; + if (!locked_bp && ibp) { + xfs_buf_hold(ibp); spin_unlock(&iip->ili_lock); + xfs_buf_lock(ibp); + spin_lock(&iip->ili_lock); + if (!iip->ili_item.li_buf) { + /* + * Raced with another removal, hold the only reference + * to ibp now. + */ + ASSERT(list_empty(&iip->ili_item.li_bio_list)); + } else { + /* + * Got two references to ibp, drop one now. The other + * ges dropped when we are done. + */ + ASSERT(iip->ili_item.li_buf == ibp); + xfs_buf_rele(ibp); + } + } else { + ASSERT(!ibp || ibp == locked_bp); } + + /* + * Clear the inode logging fields so no more flushes are attempted. + * If we are on a buffer list, it is now safe to remove it because + * the buffer is guaranteed to be locked. + */ + iip->ili_last_fields = 0; + iip->ili_fields = 0; + iip->ili_fsync_fields = 0; + iip->ili_flush_lsn = 0; + iip->ili_item.li_buf = NULL; + list_del_init(&iip->ili_item.li_bio_list); + spin_unlock(&iip->ili_lock); + + /* + * Clear the failed bit before removing the item from the AIL so + * xfs_trans_ail_delete() doesn't try to clear and release the buffer + * attached to the log item before we are done with it. + */ + clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags); + xfs_trans_ail_delete(&iip->ili_item, 0); + xfs_iflags_clear(ip, XFS_IFLUSHING); - if (bp) - xfs_buf_rele(bp); + + /* we can now release the buffer reference the inode log item held. */ + if (ibp) { + if (!locked_bp) + xfs_buf_unlock(ibp); + xfs_buf_rele(ibp); + } } /* diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h index 1a302000d604..01e5845c7f3d 100644 --- a/fs/xfs/xfs_inode_item.h +++ b/fs/xfs/xfs_inode_item.h @@ -43,7 +43,7 @@ static inline int xfs_inode_clean(struct xfs_inode *ip) extern void xfs_inode_item_init(struct xfs_inode *, struct xfs_mount *); extern void xfs_inode_item_destroy(struct xfs_inode *); -extern void xfs_iflush_abort(struct xfs_inode *); +extern void xfs_iflush_abort(struct xfs_inode *, struct xfs_buf *); extern int xfs_inode_item_format_convert(xfs_log_iovec_t *, struct xfs_inode_log_format *); ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8 2022-03-18 21:42 ` Dave Chinner @ 2022-03-21 18:35 ` Brian Foster 2022-03-21 22:14 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Brian Foster @ 2022-03-21 18:35 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Sat, Mar 19, 2022 at 08:42:53AM +1100, Dave Chinner wrote: > On Fri, Mar 18, 2022 at 12:11:07PM -0400, Brian Foster wrote: > > On Fri, Mar 18, 2022 at 09:46:53AM -0400, Brian Foster wrote: > > > Hi, > > > > > > I'm not sure if this is known and/or fixed already, but it didn't look > > > familiar so here is a report. I hit a splat when testing Willy's > > > prospective folio bookmark change and it turns out it replicates on > > > Linus' current master (551acdc3c3d2). This initially reproduced on > > > xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via > > > xfs/006, but when I attempted to reproduce the latter a second time I > > > hit what looks like the same problem as xfs/264. Both tests seem to > > > involve some form of error injection, so possibly the same underlying > > > problem. The GPF splat from xfs/264 is below. > > > > > > > Darrick pointed out this [1] series on IRC (particularly the final > > patch) so I gave that a try. I _think_ that addresses the GPF issue > > given it was nearly 100% reproducible before and I didn't see it in a > > few iterations, but once I started a test loop for a longer test I ran > > into the aforementioned soft lockup again. A snippet of that one is > > below [2]. When this occurs, the task appears to be stuck (i.e. the > > warning repeats) indefinitely. > > > > Brian > > > > [1] https://lore.kernel.org/linux-xfs/20220317053907.164160-1-david@fromorbit.com/ > > [2] Soft lockup warning from xfs/264 with patches from [1] applied: > > > > watchdog: BUG: soft lockup - CPU#52 stuck for 134s! [kworker/52:1H:1881] > > Modules linked in: rfkill rpcrdma sunrpc intel_rapl_msr intel_rapl_common rdma_ucm ib_srpt ib_isert iscsi_target_mod i10nm_edac target_core_mod x86_pkg_temp_thermal intel_powerclamp ib_iser coretemp libiscsi scsi_transport_iscsi kvm_intel rdma_cm ib_umad ipmi_ssif ib_ipoib iw_cm ib_cm kvm iTCO_wdt iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul acpi_ipmi mlx5_ib ghash_clmulni_intel bnxt_re ipmi_si rapl intel_cstate ib_uverbs ipmi_devintf mei_me isst_if_mmio isst_if_mbox_pci i2c_i801 nd_pmem ib_core intel_uncore wmi_bmof pcspkr isst_if_common mei i2c_smbus intel_pch_thermal ipmi_msghandler nd_btt dax_pmem acpi_power_meter xfs libcrc32c sd_mod sg mlx5_core lpfc mgag200 i2c_algo_bit drm_shmem_helper nvmet_fc drm_kms_helper nvmet nvme_fc mlxfw nvme_fabrics syscopyarea sysfillrect pci_hyperv_intf sysimgblt fb_sys_fops nvme_core ahci tls t10_pi libahci crc32c_intel psample scsi_transport_fc bnxt_en drm megaraid_sas tg3 libata wmi nfit libnvdimm dm_mirror dm_region_hash > > dm_log dm_mod > > CPU: 52 PID: 1881 Comm: kworker/52:1H Tainted: G S L 5.17.0-rc8+ #17 > > Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021 > > Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs] > > RIP: 0010:native_queued_spin_lock_slowpath+0x1b0/0x1e0 > > Code: c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 40 0d 03 00 48 03 04 cd e0 ba 00 8c 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08 85 c0 74 f7 48 8b 0a 48 85 c9 0f 84 6b ff ff ff 0f 0d 09 > > RSP: 0018:ff4ed0b360e4bb48 EFLAGS: 00000246 > > RAX: 0000000000000000 RBX: ff3413f05c684540 RCX: 0000000000001719 > > RDX: ff34142ebfeb0d40 RSI: ffffffff8bf826f6 RDI: ffffffff8bf54147 > > RBP: ff34142ebfeb0d40 R08: ff34142ebfeb0a68 R09: 00000000000001bc > > R10: 00000000000001d1 R11: 0000000000000abd R12: 0000000000d40000 > > R13: 0000000000000008 R14: ff3413f04cd84000 R15: ff3413f059404400 > > FS: 0000000000000000(0000) GS:ff34142ebfe80000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 00007f9200514f70 CR3: 0000000216c16005 CR4: 0000000000771ee0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > PKRU: 55555554 > > Call Trace: > > <TASK> > > _raw_spin_lock+0x2c/0x30 > > xfs_trans_ail_delete+0x2a/0xd0 [xfs] > > So what is running around in a tight circle holding the AIL lock? > > Or what assert failed before this while holding the AIL lock? > I don't have much information beyond the test and resulting bug. There are no assert failures before the bug occurs. An active CPU task dump shows the stack from the soft lockup warning, the task running the dump itself, and all other (94/96) CPUs appear idle. I tried the appended patch on top of latest for-next (which now includes the other log shutdown fix) and the problem still occurs. Brian > > xfs_buf_item_done+0x22/0x30 [xfs] > > xfs_buf_ioend+0x71/0x5e0 [xfs] > > xfs_trans_committed_bulk+0x167/0x2c0 [xfs] > > ? enqueue_entity+0x121/0x4d0 > > ? enqueue_task_fair+0x417/0x530 > > ? resched_curr+0x23/0xc0 > > ? check_preempt_curr+0x3f/0x70 > > ? _raw_spin_unlock_irqrestore+0x1f/0x31 > > ? __wake_up_common_lock+0x87/0xc0 > > xlog_cil_committed+0x29c/0x2d0 [xfs] > > ? _raw_spin_unlock_irqrestore+0x1f/0x31 > > ? __wake_up_common_lock+0x87/0xc0 > > xlog_cil_process_committed+0x69/0x80 [xfs] > > xlog_state_shutdown_callbacks+0xce/0xf0 [xfs] > > xlog_force_shutdown+0xd0/0x110 [xfs] > > The stack trace here looks mangled - it's missing functions between > xfs_trans_committed_bulk() and xfs_buf_ioend() > > xfs_trans_committed_bulk(abort = true) > xfs_trans_committed_bulk > lip->li_iop->iop_unpin(remove = true) > xfs_buf_item_unpin() > xfs_buf_ioend_fail() > xfs_buf_ioend() > xfs_buf_item_done() > > Unless, of course, the xfs_buf_ioend symbol is wrongly detected > because it's the last function call in xfs_buf_item_unpin(). That > would give a stack of > > xfs_trans_committed_bulk(abort = true) > xfs_trans_committed_bulk > lip->li_iop->iop_unpin(remove = true) > xfs_buf_item_unpin() > xfs_buf_item_done() > > Which is the stale inode buffer release path. Which has a problem > as I mention here: > > https://lore.kernel.org/linux-xfs/20220317053907.164160-8-david@fromorbit.com/ > > @@ -720,6 +721,17 @@ xfs_iflush_ail_updates( > if (INODE_ITEM(lip)->ili_flush_lsn != lip->li_lsn) > continue; > > + /* > + * dgc: Not sure how this happens, but it happens very > + * occassionaly via generic/388. xfs_iflush_abort() also > + * silently handles this same "under writeback but not in AIL at > + * shutdown" condition via xfs_trans_ail_delete(). > + */ > + if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) { > + ASSERT(xlog_is_shutdown(lip->li_log)); > + continue; > + } > + > > THe symptoms that this worked around were double AIL unlocks, AIL > list corruptions, and bp->b_li_list corruptions leading to > xfs_buf_inode_iodone() getting stuck in an endless loop whilst > holding the AIL lock, leading to soft lookups exactly like this one. > > I now know what is causing this problem - it is xfs_iflush_abort() > being called from xfs_inode_reclaim() that removes the inode from > buffer list without holding the buffer lock. > > Hence a traversal in xfs_iflush_cluster or xfs_buf_inode_iodone > can store the next inode in the list in n, n then gets aborted by > reclaim and removed from the lsit, and then the list traversal moves > onto n, and it's now an empty list because it was removed. Hence > the list traversal gets stuck forever on n because n->next = n..... > > If this sort of thing happens in xfs_iflush_ail_updates(), we can > either do a double AIL removal which fires an assert with the AIL > lock held, or we get stuck spinning on n with > the AIL lock held. Either way, they both lead to softlockups on the > AIL lock like this one. > > 'echo l > sysrq-trigger' is your friend in these situations - you'll > see if there's a process spinning with the lock held on some other > CPU... > > This situation is a regression introduced in the async inode reclaim > patch series: > > https://lore.kernel.org/linux-xfs/20200622081605.1818434-1-david@fromorbit.com/ > > And is a locking screwup with xfs_iflush_abort() being called > without holding the inode cluster buffer lock. It was a thinko > w.r.t. list removal and traversal using the inode item lock. The bug > has been there since June 2020, and it's only now that we have > peeled back the shutdown onion a couple of layers further that it is > manifesting. > > I have a prototype patch (below) to fix this - the locking is not > pretty, but the AIL corruptions and soft lockups have gone away in > my testing only to be replaced with a whole new set of g/388 > failures *I have never seen before*. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > > xfs: locking on b_io_list is broken for inodes > > From: Dave Chinner <dchinner@redhat.com> > > Most buffer io list operations are run with the bp->b_lock held, but > xfs_iflush_abort() can be called without the buffer lock being held > resulting in inodes being removed from the buffer list while other > list operations are occurring. This causes problems with corrupted > bp->b_io_list inode lists during filesystem shutdown, leading to > traversals that never end, double removals from the AIL, etc. > > Fix this by passing the buffer to xfs_iflush_abort() if we have > it locked. If the inode is attached to the buffer, we're going to > have to remove it from the buffer list and we'd have to get the > buffer off the inode log item to do that anyway. > > If we don't have a buffer passed in (e.g. from xfs_reclaim_inode()) > then we can determine if the inode has a log item and if it is > attached to a buffer before we do anything else. If it does have an > attached buffer, we can lock it safely (because the inode has a > reference to it) and then perform the inode abort. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_icache.c | 2 +- > fs/xfs/xfs_inode.c | 4 +- > fs/xfs/xfs_inode_item.c | 123 ++++++++++++++++++++++++++++++++---------------- > fs/xfs/xfs_inode_item.h | 2 +- > 4 files changed, 87 insertions(+), 44 deletions(-) > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 4148cdf7ce4a..ec907be2d5b1 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -883,7 +883,7 @@ xfs_reclaim_inode( > */ > if (xlog_is_shutdown(ip->i_mount->m_log)) { > xfs_iunpin_wait(ip); > - xfs_iflush_abort(ip); > + xfs_iflush_abort(ip, NULL); > goto reclaim; > } > if (xfs_ipincount(ip)) > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > index aab55a06ece7..de8815211a7a 100644 > --- a/fs/xfs/xfs_inode.c > +++ b/fs/xfs/xfs_inode.c > @@ -3612,7 +3612,7 @@ xfs_iflush_cluster( > > /* > * We must use the safe variant here as on shutdown xfs_iflush_abort() > - * can remove itself from the list. > + * will remove itself from the list. > */ > list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) { > iip = (struct xfs_inode_log_item *)lip; > @@ -3662,7 +3662,7 @@ xfs_iflush_cluster( > */ > if (xlog_is_shutdown(mp->m_log)) { > xfs_iunpin_wait(ip); > - xfs_iflush_abort(ip); > + xfs_iflush_abort(ip, bp); > xfs_iunlock(ip, XFS_ILOCK_SHARED); > error = -EIO; > continue; > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c > index 11158fa81a09..89fa1fd9ed5b 100644 > --- a/fs/xfs/xfs_inode_item.c > +++ b/fs/xfs/xfs_inode_item.c > @@ -721,17 +721,6 @@ xfs_iflush_ail_updates( > if (INODE_ITEM(lip)->ili_flush_lsn != lip->li_lsn) > continue; > > - /* > - * dgc: Not sure how this happens, but it happens very > - * occassionaly via generic/388. xfs_iflush_abort() also > - * silently handles this same "under writeback but not in AIL at > - * shutdown" condition via xfs_trans_ail_delete(). > - */ > - if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) { > - ASSERT(xlog_is_shutdown(lip->li_log)); > - continue; > - } > - > lsn = xfs_ail_delete_one(ailp, lip); > if (!tail_lsn && lsn) > tail_lsn = lsn; > @@ -799,7 +788,7 @@ xfs_buf_inode_iodone( > struct xfs_inode_log_item *iip = INODE_ITEM(lip); > > if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) { > - xfs_iflush_abort(iip->ili_inode); > + xfs_iflush_abort(iip->ili_inode, bp); > continue; > } > if (!iip->ili_last_fields) > @@ -834,44 +823,98 @@ xfs_buf_inode_io_fail( > } > > /* > - * This is the inode flushing abort routine. It is called when > - * the filesystem is shutting down to clean up the inode state. It is > - * responsible for removing the inode item from the AIL if it has not been > - * re-logged and clearing the inode's flush state. > + * Abort flushing the inode. > + * > + * There are two cases where this is called. The first is when the inode cluster > + * buffer has been removed and the inodes attached to it have been marked > + * XFS_ISTALE. Inode cluster buffer IO completion will be called on the buffer to mark the stale > + * inodes clean and remove them from the AIL without doing IO on them. The inode > + * should always have a log item attached if it is ISTALE, and we should always > + * be passed the locked buffer the inodes are attached to. > + * > + * The second case is log shutdown. When the log has been shut down, we need > + * to abort any flush that is in progress, mark the inode clean and remove it > + * from the AIL. We may get passed clean inodes without log items, as well as > + * clean inodes with log items that aren't attached to cluster buffers. And > + * depending on whether we are called from, we may or may not have a locked > + * buffer passed to us. > + * > + * If we don't have a locked buffer, we try to get it from the inode log item. > + * If there is a buffer attached to the ili, then we have a reference to the > + * buffer and we can safely lock it, then remove the inode from the buffer. > */ > void > xfs_iflush_abort( > - struct xfs_inode *ip) > + struct xfs_inode *ip, > + struct xfs_buf *locked_bp) > { > struct xfs_inode_log_item *iip = ip->i_itemp; > - struct xfs_buf *bp = NULL; > + struct xfs_buf *ibp; > > - if (iip) { > - /* > - * Clear the failed bit before removing the item from the AIL so > - * xfs_trans_ail_delete() doesn't try to clear and release the > - * buffer attached to the log item before we are done with it. > - */ > - clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags); > - xfs_trans_ail_delete(&iip->ili_item, 0); > + if (!iip) { > + /* clean inode, nothing to do */ > + xfs_iflags_clear(ip, XFS_IFLUSHING); > + return; > + } > > - /* > - * Clear the inode logging fields so no more flushes are > - * attempted. > - */ > - spin_lock(&iip->ili_lock); > - iip->ili_last_fields = 0; > - iip->ili_fields = 0; > - iip->ili_fsync_fields = 0; > - iip->ili_flush_lsn = 0; > - bp = iip->ili_item.li_buf; > - iip->ili_item.li_buf = NULL; > - list_del_init(&iip->ili_item.li_bio_list); > + /* > + * Capture the associated buffer and lock it if the caller didn't > + * pass us the locked buffer to begin with. > + */ > + spin_lock(&iip->ili_lock); > + ibp = iip->ili_item.li_buf; > + if (!locked_bp && ibp) { > + xfs_buf_hold(ibp); > spin_unlock(&iip->ili_lock); > + xfs_buf_lock(ibp); > + spin_lock(&iip->ili_lock); > + if (!iip->ili_item.li_buf) { > + /* > + * Raced with another removal, hold the only reference > + * to ibp now. > + */ > + ASSERT(list_empty(&iip->ili_item.li_bio_list)); > + } else { > + /* > + * Got two references to ibp, drop one now. The other > + * ges dropped when we are done. > + */ > + ASSERT(iip->ili_item.li_buf == ibp); > + xfs_buf_rele(ibp); > + } > + } else { > + ASSERT(!ibp || ibp == locked_bp); > } > + > + /* > + * Clear the inode logging fields so no more flushes are attempted. > + * If we are on a buffer list, it is now safe to remove it because > + * the buffer is guaranteed to be locked. > + */ > + iip->ili_last_fields = 0; > + iip->ili_fields = 0; > + iip->ili_fsync_fields = 0; > + iip->ili_flush_lsn = 0; > + iip->ili_item.li_buf = NULL; > + list_del_init(&iip->ili_item.li_bio_list); > + spin_unlock(&iip->ili_lock); > + > + /* > + * Clear the failed bit before removing the item from the AIL so > + * xfs_trans_ail_delete() doesn't try to clear and release the buffer > + * attached to the log item before we are done with it. > + */ > + clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags); > + xfs_trans_ail_delete(&iip->ili_item, 0); > + > xfs_iflags_clear(ip, XFS_IFLUSHING); > - if (bp) > - xfs_buf_rele(bp); > + > + /* we can now release the buffer reference the inode log item held. */ > + if (ibp) { > + if (!locked_bp) > + xfs_buf_unlock(ibp); > + xfs_buf_rele(ibp); > + } > } > > /* > diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h > index 1a302000d604..01e5845c7f3d 100644 > --- a/fs/xfs/xfs_inode_item.h > +++ b/fs/xfs/xfs_inode_item.h > @@ -43,7 +43,7 @@ static inline int xfs_inode_clean(struct xfs_inode *ip) > > extern void xfs_inode_item_init(struct xfs_inode *, struct xfs_mount *); > extern void xfs_inode_item_destroy(struct xfs_inode *); > -extern void xfs_iflush_abort(struct xfs_inode *); > +extern void xfs_iflush_abort(struct xfs_inode *, struct xfs_buf *); > extern int xfs_inode_item_format_convert(xfs_log_iovec_t *, > struct xfs_inode_log_format *); > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8 2022-03-21 18:35 ` Brian Foster @ 2022-03-21 22:14 ` Dave Chinner 2022-03-22 14:33 ` Brian Foster 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2022-03-21 22:14 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Mon, Mar 21, 2022 at 02:35:21PM -0400, Brian Foster wrote: > On Sat, Mar 19, 2022 at 08:42:53AM +1100, Dave Chinner wrote: > > On Fri, Mar 18, 2022 at 12:11:07PM -0400, Brian Foster wrote: > > > On Fri, Mar 18, 2022 at 09:46:53AM -0400, Brian Foster wrote: > > > > Hi, > > > > > > > > I'm not sure if this is known and/or fixed already, but it didn't look > > > > familiar so here is a report. I hit a splat when testing Willy's > > > > prospective folio bookmark change and it turns out it replicates on > > > > Linus' current master (551acdc3c3d2). This initially reproduced on > > > > xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via > > > > xfs/006, but when I attempted to reproduce the latter a second time I > > > > hit what looks like the same problem as xfs/264. Both tests seem to > > > > involve some form of error injection, so possibly the same underlying > > > > problem. The GPF splat from xfs/264 is below. > > > > > > > > > > Darrick pointed out this [1] series on IRC (particularly the final > > > patch) so I gave that a try. I _think_ that addresses the GPF issue > > > given it was nearly 100% reproducible before and I didn't see it in a > > > few iterations, but once I started a test loop for a longer test I ran > > > into the aforementioned soft lockup again. A snippet of that one is > > > below [2]. When this occurs, the task appears to be stuck (i.e. the > > > warning repeats) indefinitely. > > > > > > Brian > > > > > > [1] https://lore.kernel.org/linux-xfs/20220317053907.164160-1-david@fromorbit.com/ > > > [2] Soft lockup warning from xfs/264 with patches from [1] applied: > > > > > > watchdog: BUG: soft lockup - CPU#52 stuck for 134s! [kworker/52:1H:1881] > > > Modules linked in: rfkill rpcrdma sunrpc intel_rapl_msr intel_rapl_common rdma_ucm ib_srpt ib_isert iscsi_target_mod i10nm_edac target_core_mod x86_pkg_temp_thermal intel_powerclamp ib_iser coretemp libiscsi scsi_transport_iscsi kvm_intel rdma_cm ib_umad ipmi_ssif ib_ipoib iw_cm ib_cm kvm iTCO_wdt iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul acpi_ipmi mlx5_ib ghash_clmulni_intel bnxt_re ipmi_si rapl intel_cstate ib_uverbs ipmi_devintf mei_me isst_if_mmio isst_if_mbox_pci i2c_i801 nd_pmem ib_core intel_uncore wmi_bmof pcspkr isst_if_common mei i2c_smbus intel_pch_thermal ipmi_msghandler nd_btt dax_pmem acpi_power_meter xfs libcrc32c sd_mod sg mlx5_core lpfc mgag200 i2c_algo_bit drm_shmem_helper nvmet_fc drm_kms_helper nvmet nvme_fc mlxfw nvme_fabrics syscopyarea sysfillrect pci_hyperv_intf sysimgblt fb_sys_fops nvme_core ahci tls t10_pi libahci crc32c_intel psample scsi_transport_fc bnxt_en drm megaraid_sas tg3 libata wmi nfit libnvdimm dm_mirror dm_region_hash > > > dm_log dm_mod > > > CPU: 52 PID: 1881 Comm: kworker/52:1H Tainted: G S L 5.17.0-rc8+ #17 > > > Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021 > > > Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs] > > > RIP: 0010:native_queued_spin_lock_slowpath+0x1b0/0x1e0 > > > Code: c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 40 0d 03 00 48 03 04 cd e0 ba 00 8c 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08 85 c0 74 f7 48 8b 0a 48 85 c9 0f 84 6b ff ff ff 0f 0d 09 > > > RSP: 0018:ff4ed0b360e4bb48 EFLAGS: 00000246 > > > RAX: 0000000000000000 RBX: ff3413f05c684540 RCX: 0000000000001719 > > > RDX: ff34142ebfeb0d40 RSI: ffffffff8bf826f6 RDI: ffffffff8bf54147 > > > RBP: ff34142ebfeb0d40 R08: ff34142ebfeb0a68 R09: 00000000000001bc > > > R10: 00000000000001d1 R11: 0000000000000abd R12: 0000000000d40000 > > > R13: 0000000000000008 R14: ff3413f04cd84000 R15: ff3413f059404400 > > > FS: 0000000000000000(0000) GS:ff34142ebfe80000(0000) knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 00007f9200514f70 CR3: 0000000216c16005 CR4: 0000000000771ee0 > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > PKRU: 55555554 > > > Call Trace: > > > <TASK> > > > _raw_spin_lock+0x2c/0x30 > > > xfs_trans_ail_delete+0x2a/0xd0 [xfs] > > > > So what is running around in a tight circle holding the AIL lock? > > > > Or what assert failed before this while holding the AIL lock? > > > > I don't have much information beyond the test and resulting bug. There > are no assert failures before the bug occurs. An active CPU task dump > shows the stack from the soft lockup warning, the task running the dump > itself, and all other (94/96) CPUs appear idle. I tried the appended > patch on top of latest for-next (which now includes the other log > shutdown fix) and the problem still occurs. Yeah, I got another assert fail in xfs_ail_check() last night from: xfs_ail_check+0xa8/0x180 xfs_ail_delete_one+0x3b/0xf0 xfs_buf_inode_iodone+0x329/0x3f0 xfs_buf_ioend+0x1f8/0x530 xfs_buf_ioend_work+0x15/0x20 Finding an item that didn't have IN_AIL set on it. I think I've found another mount vs log shutdown case that can result in dirty aborted inodes that aren't in the AIL being flushed and bad things happen when we then try to remove them from the AIL and they aren't there... Whether that is this problem or not, I don't know, but the assert failures do end up with other threads spinning on the AIL lock because of the assert failures under the AIL lock... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8 2022-03-21 22:14 ` Dave Chinner @ 2022-03-22 14:33 ` Brian Foster 2022-03-22 21:41 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Brian Foster @ 2022-03-22 14:33 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Tue, Mar 22, 2022 at 09:14:33AM +1100, Dave Chinner wrote: > On Mon, Mar 21, 2022 at 02:35:21PM -0400, Brian Foster wrote: > > On Sat, Mar 19, 2022 at 08:42:53AM +1100, Dave Chinner wrote: > > > On Fri, Mar 18, 2022 at 12:11:07PM -0400, Brian Foster wrote: > > > > On Fri, Mar 18, 2022 at 09:46:53AM -0400, Brian Foster wrote: > > > > > Hi, > > > > > > > > > > I'm not sure if this is known and/or fixed already, but it didn't look > > > > > familiar so here is a report. I hit a splat when testing Willy's > > > > > prospective folio bookmark change and it turns out it replicates on > > > > > Linus' current master (551acdc3c3d2). This initially reproduced on > > > > > xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via > > > > > xfs/006, but when I attempted to reproduce the latter a second time I > > > > > hit what looks like the same problem as xfs/264. Both tests seem to > > > > > involve some form of error injection, so possibly the same underlying > > > > > problem. The GPF splat from xfs/264 is below. > > > > > > > > > > > > > Darrick pointed out this [1] series on IRC (particularly the final > > > > patch) so I gave that a try. I _think_ that addresses the GPF issue > > > > given it was nearly 100% reproducible before and I didn't see it in a > > > > few iterations, but once I started a test loop for a longer test I ran > > > > into the aforementioned soft lockup again. A snippet of that one is > > > > below [2]. When this occurs, the task appears to be stuck (i.e. the > > > > warning repeats) indefinitely. > > > > > > > > Brian > > > > > > > > [1] https://lore.kernel.org/linux-xfs/20220317053907.164160-1-david@fromorbit.com/ > > > > [2] Soft lockup warning from xfs/264 with patches from [1] applied: > > > > > > > > watchdog: BUG: soft lockup - CPU#52 stuck for 134s! [kworker/52:1H:1881] > > > > Modules linked in: rfkill rpcrdma sunrpc intel_rapl_msr intel_rapl_common rdma_ucm ib_srpt ib_isert iscsi_target_mod i10nm_edac target_core_mod x86_pkg_temp_thermal intel_powerclamp ib_iser coretemp libiscsi scsi_transport_iscsi kvm_intel rdma_cm ib_umad ipmi_ssif ib_ipoib iw_cm ib_cm kvm iTCO_wdt iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul acpi_ipmi mlx5_ib ghash_clmulni_intel bnxt_re ipmi_si rapl intel_cstate ib_uverbs ipmi_devintf mei_me isst_if_mmio isst_if_mbox_pci i2c_i801 nd_pmem ib_core intel_uncore wmi_bmof pcspkr isst_if_common mei i2c_smbus intel_pch_thermal ipmi_msghandler nd_btt dax_pmem acpi_power_meter xfs libcrc32c sd_mod sg mlx5_core lpfc mgag200 i2c_algo_bit drm_shmem_helper nvmet_fc drm_kms_helper nvmet nvme_fc mlxfw nvme_fabrics syscopyarea sysfillrect pci_hyperv_intf sysimgblt fb_sys_fops nvme_core ahci tls t10_pi libahci crc32c_intel psample scsi_transport_fc bnxt_en drm megaraid_sas tg3 libata wmi nfit libnvdimm dm_mirror dm_region_hash > > > > dm_log dm_mod > > > > CPU: 52 PID: 1881 Comm: kworker/52:1H Tainted: G S L 5.17.0-rc8+ #17 > > > > Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021 > > > > Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs] > > > > RIP: 0010:native_queued_spin_lock_slowpath+0x1b0/0x1e0 > > > > Code: c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 40 0d 03 00 48 03 04 cd e0 ba 00 8c 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08 85 c0 74 f7 48 8b 0a 48 85 c9 0f 84 6b ff ff ff 0f 0d 09 > > > > RSP: 0018:ff4ed0b360e4bb48 EFLAGS: 00000246 > > > > RAX: 0000000000000000 RBX: ff3413f05c684540 RCX: 0000000000001719 > > > > RDX: ff34142ebfeb0d40 RSI: ffffffff8bf826f6 RDI: ffffffff8bf54147 > > > > RBP: ff34142ebfeb0d40 R08: ff34142ebfeb0a68 R09: 00000000000001bc > > > > R10: 00000000000001d1 R11: 0000000000000abd R12: 0000000000d40000 > > > > R13: 0000000000000008 R14: ff3413f04cd84000 R15: ff3413f059404400 > > > > FS: 0000000000000000(0000) GS:ff34142ebfe80000(0000) knlGS:0000000000000000 > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > CR2: 00007f9200514f70 CR3: 0000000216c16005 CR4: 0000000000771ee0 > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > > PKRU: 55555554 > > > > Call Trace: > > > > <TASK> > > > > _raw_spin_lock+0x2c/0x30 > > > > xfs_trans_ail_delete+0x2a/0xd0 [xfs] > > > > > > So what is running around in a tight circle holding the AIL lock? > > > > > > Or what assert failed before this while holding the AIL lock? > > > > > > > I don't have much information beyond the test and resulting bug. There > > are no assert failures before the bug occurs. An active CPU task dump > > shows the stack from the soft lockup warning, the task running the dump > > itself, and all other (94/96) CPUs appear idle. I tried the appended > > patch on top of latest for-next (which now includes the other log > > shutdown fix) and the problem still occurs. > > Yeah, I got another assert fail in xfs_ail_check() last night from: > > xfs_ail_check+0xa8/0x180 > xfs_ail_delete_one+0x3b/0xf0 > xfs_buf_inode_iodone+0x329/0x3f0 > xfs_buf_ioend+0x1f8/0x530 > xfs_buf_ioend_work+0x15/0x20 > > Finding an item that didn't have IN_AIL set on it. I think I've > found another mount vs log shutdown case that can result in dirty > aborted inodes that aren't in the AIL being flushed and bad things > happen when we then try to remove them from the AIL and they aren't > there... > > Whether that is this problem or not, I don't know, but the assert > failures do end up with other threads spinning on the AIL lock > because of the assert failures under the AIL lock... > Some updates.. I tried to reproduce with lock debugging and whatnot enabled but the problem was no longer reproducible, probably due to disruption of timing. When I went back to a reproducing kernel, I ended up seeing a page fault variant crash via xfs/006 instead of the soft lockup. This occurred shortly after the unmount attempt started, so I retried again with KASAN enabled but ran into the same heisenbug behavior. From there, I replaced the generic debug mechanisms with a custom mount flag to set in xfs_trans_ail_destroy() immediately after freeing the xfs_ail object and assert check on entry of xfs_trans_ail_delete(). The next xfs/006 failure produced the splat below [2]. I suspect that is the smoking gun [1] and perhaps we've somehow broken serialization between in-core object teardown and outstanding log I/O completion after the filesystem happens to shutdown. Brian [1] I've not reproduced the soft lockup variant with this hack in place, but if the problem is a UAF then I'd expect the resulting behavior to be somewhat erratic/unpredictable. [2] xfs/006 custom assert and BUG splat: XFS: Assertion failed: !mp->m_freed_ail, file: fs/xfs/xfs_trans_ail.c, line: 879 ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1289 at fs/xfs/xfs_message.c:97 asswarn+0x1a/0x1d [xfs] Modules linked in: rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm intel_rapl_msr iTCO_wdt iTCO_vendor_support inth dm_log dm_mod CPU: 2 PID: 1289 Comm: kworker/2:1H Tainted: G S 5.17.0-rc6+ #29 Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021 Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs] RIP: 0010:asswarn+0x1a/0x1d [xfs] Code: e8 e8 13 8c c4 48 83 c4 60 5b 41 5c 41 5d 5d c3 0f 1f 44 00 00 41 89 c8 48 89 d1 48 89 f2 48 c7 c6 c0 a0 d8 c0 e8 19 fd ff ff <0f> 0b c3 0f 1f 44 00 00 41 89 c8 48 89 d1 48 89 f2 48 c7 c6 c0 a0 RSP: 0018:ff5757be8f37fb70 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ff4847238f4b2940 RCX: 0000000000000000 RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffc0d7e076 RBP: ff4847341d6a0f80 R08: 0000000000000000 R09: 0000000000000000 R10: 000000000000000a R11: f000000000000000 R12: 0000000200002600 R13: ff484724b1c35000 R14: 0000000000000008 R15: ff4847238f4b2940 FS: 0000000000000000(0000) GS:ff484761ffc40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f55c21f0f70 CR3: 0000000142c9c002 CR4: 0000000000771ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> xfs_trans_ail_delete+0x102/0x130 [xfs] xfs_buf_item_done+0x22/0x30 [xfs] xfs_buf_ioend+0x73/0x4d0 [xfs] xfs_trans_committed_bulk+0x17e/0x2f0 [xfs] ? enqueue_task_fair+0x91/0x680 ? remove_entity_load_avg+0x2e/0x70 ? __wake_up_common_lock+0x87/0xc0 xlog_cil_committed+0x2a9/0x300 [xfs] ? __wake_up_common_lock+0x87/0xc0 xlog_cil_process_committed+0x69/0x80 [xfs] xlog_state_shutdown_callbacks+0xce/0xf0 [xfs] xlog_force_shutdown+0xdf/0x150 [xfs] xfs_do_force_shutdown+0x5f/0x150 [xfs] xlog_ioend_work+0x71/0x80 [xfs] process_one_work+0x1c5/0x390 ? process_one_work+0x390/0x390 worker_thread+0x30/0x350 ? process_one_work+0x390/0x390 kthread+0xd7/0x100 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> ---[ end trace 0000000000000000 ]--- BUG: unable to handle page fault for address: 0000000000030840 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 281cd3067 P4D 0 Oops: 0002 [#1] PREEMPT SMP NOPTI CPU: 2 PID: 1289 Comm: kworker/2:1H Tainted: G S W 5.17.0-rc6+ #29 Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021 Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs] RIP: 0010:native_queued_spin_lock_slowpath+0x173/0x1b0 Code: f3 90 48 8b 32 48 85 f6 74 f6 eb d5 c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6 48 05 00 08 03 00 48 03 04 f5 e0 5a ec 85 <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 32 RSP: 0018:ff5757be8f37fb68 EFLAGS: 00010202 RAX: 0000000000030840 RBX: ff4847238f4b2940 RCX: 00000000000c0000 RDX: ff484761ffc70800 RSI: 0000000000000759 RDI: ff4847341d6a0fc0 RBP: ff4847341d6a0f80 R08: 0000000000000000 R09: 0000000000000000 R10: 000000000000000a R11: f000000000000000 R12: ff4847341d6a0fc0 R13: ff484724b1c35000 R14: 0000000000000008 R15: ff4847238f4b2940 FS: 0000000000000000(0000) GS:ff484761ffc40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000030840 CR3: 0000000142c9c002 CR4: 0000000000771ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> _raw_spin_lock+0x21/0x30 xfs_trans_ail_delete+0x38/0x130 [xfs] xfs_buf_item_done+0x22/0x30 [xfs] xfs_buf_ioend+0x73/0x4d0 [xfs] xfs_trans_committed_bulk+0x17e/0x2f0 [xfs] ? enqueue_task_fair+0x91/0x680 ? remove_entity_load_avg+0x2e/0x70 ? __wake_up_common_lock+0x87/0xc0 xlog_cil_committed+0x2a9/0x300 [xfs] ? __wake_up_common_lock+0x87/0xc0 xlog_cil_process_committed+0x69/0x80 [xfs] xlog_state_shutdown_callbacks+0xce/0xf0 [xfs] xlog_force_shutdown+0xdf/0x150 [xfs] xfs_do_force_shutdown+0x5f/0x150 [xfs] xlog_ioend_work+0x71/0x80 [xfs] process_one_work+0x1c5/0x390 ? process_one_work+0x390/0x390 worker_thread+0x30/0x350 ? process_one_work+0x390/0x390 kthread+0xd7/0x100 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Modules linked in: rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm intel_rapl_msr iTCO_wdt iTCO_vendor_support inth dm_log dm_mod CR2: 0000000000030840 ---[ end trace 0000000000000000 ]--- RIP: 0010:native_queued_spin_lock_slowpath+0x173/0x1b0 Code: f3 90 48 8b 32 48 85 f6 74 f6 eb d5 c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6 48 05 00 08 03 00 48 03 04 f5 e0 5a ec 85 <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 32 RSP: 0018:ff5757be8f37fb68 EFLAGS: 00010202 RAX: 0000000000030840 RBX: ff4847238f4b2940 RCX: 00000000000c0000 RDX: ff484761ffc70800 RSI: 0000000000000759 RDI: ff4847341d6a0fc0 RBP: ff4847341d6a0f80 R08: 0000000000000000 R09: 0000000000000000 R10: 000000000000000a R11: f000000000000000 R12: ff4847341d6a0fc0 R13: ff484724b1c35000 R14: 0000000000000008 R15: ff4847238f4b2940 FS: 0000000000000000(0000) GS:ff484761ffc40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000030840 CR3: 0000000142c9c002 CR4: 0000000000771ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x3c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception ]--- ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8 2022-03-22 14:33 ` Brian Foster @ 2022-03-22 21:41 ` Dave Chinner 0 siblings, 0 replies; 10+ messages in thread From: Dave Chinner @ 2022-03-22 21:41 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Tue, Mar 22, 2022 at 10:33:00AM -0400, Brian Foster wrote: > On Tue, Mar 22, 2022 at 09:14:33AM +1100, Dave Chinner wrote: > > On Mon, Mar 21, 2022 at 02:35:21PM -0400, Brian Foster wrote: > > > On Sat, Mar 19, 2022 at 08:42:53AM +1100, Dave Chinner wrote: > > > > On Fri, Mar 18, 2022 at 12:11:07PM -0400, Brian Foster wrote: > > > > > On Fri, Mar 18, 2022 at 09:46:53AM -0400, Brian Foster wrote: > > > > > > Hi, > > > > > > > > > > > > I'm not sure if this is known and/or fixed already, but it didn't look > > > > > > familiar so here is a report. I hit a splat when testing Willy's > > > > > > prospective folio bookmark change and it turns out it replicates on > > > > > > Linus' current master (551acdc3c3d2). This initially reproduced on > > > > > > xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via > > > > > > xfs/006, but when I attempted to reproduce the latter a second time I > > > > > > hit what looks like the same problem as xfs/264. Both tests seem to > > > > > > involve some form of error injection, so possibly the same underlying > > > > > > problem. The GPF splat from xfs/264 is below. > > > > > > > > > > > > > > > > Darrick pointed out this [1] series on IRC (particularly the final > > > > > patch) so I gave that a try. I _think_ that addresses the GPF issue > > > > > given it was nearly 100% reproducible before and I didn't see it in a > > > > > few iterations, but once I started a test loop for a longer test I ran > > > > > into the aforementioned soft lockup again. A snippet of that one is > > > > > below [2]. When this occurs, the task appears to be stuck (i.e. the > > > > > warning repeats) indefinitely. > > > > > > > > > > Brian > > > > > > > > > > [1] https://lore.kernel.org/linux-xfs/20220317053907.164160-1-david@fromorbit.com/ > > > > > [2] Soft lockup warning from xfs/264 with patches from [1] applied: > > > > > > > > > > watchdog: BUG: soft lockup - CPU#52 stuck for 134s! [kworker/52:1H:1881] > > > > > Modules linked in: rfkill rpcrdma sunrpc intel_rapl_msr intel_rapl_common rdma_ucm ib_srpt ib_isert iscsi_target_mod i10nm_edac target_core_mod x86_pkg_temp_thermal intel_powerclamp ib_iser coretemp libiscsi scsi_transport_iscsi kvm_intel rdma_cm ib_umad ipmi_ssif ib_ipoib iw_cm ib_cm kvm iTCO_wdt iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul acpi_ipmi mlx5_ib ghash_clmulni_intel bnxt_re ipmi_si rapl intel_cstate ib_uverbs ipmi_devintf mei_me isst_if_mmio isst_if_mbox_pci i2c_i801 nd_pmem ib_core intel_uncore wmi_bmof pcspkr isst_if_common mei i2c_smbus intel_pch_thermal ipmi_msghandler nd_btt dax_pmem acpi_power_meter xfs libcrc32c sd_mod sg mlx5_core lpfc mgag200 i2c_algo_bit drm_shmem_helper nvmet_fc drm_kms_helper nvmet nvme_fc mlxfw nvme_fabrics syscopyarea sysfillrect pci_hyperv_intf sysimgblt fb_sys_fops nvme_core ahci tls t10_pi libahci crc32c_intel psample scsi_transport_fc bnxt_en drm megaraid_sas tg3 libata wmi nfit libnvdimm dm_mirror dm_region_hash > > > > > dm_log dm_mod > > > > > CPU: 52 PID: 1881 Comm: kworker/52:1H Tainted: G S L 5.17.0-rc8+ #17 > > > > > Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021 > > > > > Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs] > > > > > RIP: 0010:native_queued_spin_lock_slowpath+0x1b0/0x1e0 > > > > > Code: c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 40 0d 03 00 48 03 04 cd e0 ba 00 8c 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08 85 c0 74 f7 48 8b 0a 48 85 c9 0f 84 6b ff ff ff 0f 0d 09 > > > > > RSP: 0018:ff4ed0b360e4bb48 EFLAGS: 00000246 > > > > > RAX: 0000000000000000 RBX: ff3413f05c684540 RCX: 0000000000001719 > > > > > RDX: ff34142ebfeb0d40 RSI: ffffffff8bf826f6 RDI: ffffffff8bf54147 > > > > > RBP: ff34142ebfeb0d40 R08: ff34142ebfeb0a68 R09: 00000000000001bc > > > > > R10: 00000000000001d1 R11: 0000000000000abd R12: 0000000000d40000 > > > > > R13: 0000000000000008 R14: ff3413f04cd84000 R15: ff3413f059404400 > > > > > FS: 0000000000000000(0000) GS:ff34142ebfe80000(0000) knlGS:0000000000000000 > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > > CR2: 00007f9200514f70 CR3: 0000000216c16005 CR4: 0000000000771ee0 > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > > > PKRU: 55555554 > > > > > Call Trace: > > > > > <TASK> > > > > > _raw_spin_lock+0x2c/0x30 > > > > > xfs_trans_ail_delete+0x2a/0xd0 [xfs] > > > > > > > > So what is running around in a tight circle holding the AIL lock? > > > > > > > > Or what assert failed before this while holding the AIL lock? > > > > > > > > > > I don't have much information beyond the test and resulting bug. There > > > are no assert failures before the bug occurs. An active CPU task dump > > > shows the stack from the soft lockup warning, the task running the dump > > > itself, and all other (94/96) CPUs appear idle. I tried the appended > > > patch on top of latest for-next (which now includes the other log > > > shutdown fix) and the problem still occurs. > > > > Yeah, I got another assert fail in xfs_ail_check() last night from: > > > > xfs_ail_check+0xa8/0x180 > > xfs_ail_delete_one+0x3b/0xf0 > > xfs_buf_inode_iodone+0x329/0x3f0 > > xfs_buf_ioend+0x1f8/0x530 > > xfs_buf_ioend_work+0x15/0x20 > > > > Finding an item that didn't have IN_AIL set on it. I think I've > > found another mount vs log shutdown case that can result in dirty > > aborted inodes that aren't in the AIL being flushed and bad things > > happen when we then try to remove them from the AIL and they aren't > > there... > > > > Whether that is this problem or not, I don't know, but the assert > > failures do end up with other threads spinning on the AIL lock > > because of the assert failures under the AIL lock... > > > > Some updates.. I tried to reproduce with lock debugging and whatnot > enabled but the problem was no longer reproducible, probably due to > disruption of timing. When I went back to a reproducing kernel, I ended > up seeing a page fault variant crash via xfs/006 instead of the soft > lockup. This occurred shortly after the unmount attempt started, so I > retried again with KASAN enabled but ran into the same heisenbug > behavior. From there, I replaced the generic debug mechanisms with a > custom mount flag to set in xfs_trans_ail_destroy() immediately after > freeing the xfs_ail object and assert check on entry of > xfs_trans_ail_delete(). The next xfs/006 failure produced the splat > below [2]. I suspect that is the smoking gun [1] and perhaps we've > somehow broken serialization between in-core object teardown and > outstanding log I/O completion after the filesystem happens to shutdown. OK, so the AIL has been torn down, then we are getting log IO completing and being failed, trying to remove a buffer item from the AIL? xfs_log_unmount() does: xfs_log_unmount( struct xfs_mount *mp) { xfs_log_clean(mp); xfs_buftarg_drain(mp->m_ddev_targp); xfs_trans_ail_destroy(mp); And so the AIL cannot be destroyed if there are any active buffers still in the system (e.g. being logged) - it will block until xfs_buftarg_drain() returns with an empty buffer cache. Which means this buffer is likely the superblock buffer - the only buffer in the filesystem that isn't cached in the buffer cache but is logged. So I suspect we have: xfs_log_unmount() xfs_log_clean() xfs_log_quiesce() xfs_log_cover() xfs_sync_sb() xfs_ail_push_all_sync() failing to push the superblock buffer into the AIL, so xfs_log_cover() returns while the log IO is still in progress. We then write an unmount record, which can silently fail to force the iclog and so we get through xfs_log_clean() while the superblock buffer is still attached to an iclog... The buffer cache is empty, and then We then tear down the AIL, only to have iclog callbacks run, fail the superblock buffer item which is in the and try to remove the superblock buffer from the AIL as it's releasing the last reference to the buf log item... This is convoluted, but I think it can happen if we get a iclog IO error on the log force from xfs_sync_sb(). I think the key thing is how the log is shut down and wakeups processed in xlog_force_shutdown.... > [1] I've not reproduced the soft lockup variant with this hack in place, > but if the problem is a UAF then I'd expect the resulting behavior to be > somewhat erratic/unpredictable. > > [2] xfs/006 custom assert and BUG splat: > > XFS: Assertion failed: !mp->m_freed_ail, file: fs/xfs/xfs_trans_ail.c, line: 879 > ------------[ cut here ]------------ > WARNING: CPU: 2 PID: 1289 at fs/xfs/xfs_message.c:97 asswarn+0x1a/0x1d [xfs] > Modules linked in: rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm intel_rapl_msr iTCO_wdt iTCO_vendor_support inth > dm_log dm_mod > CPU: 2 PID: 1289 Comm: kworker/2:1H Tainted: G S 5.17.0-rc6+ #29 > Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021 > Workqueue: xfs-log/dm-5 xlog_ioend_work [xfs] > RIP: 0010:asswarn+0x1a/0x1d [xfs] > Code: e8 e8 13 8c c4 48 83 c4 60 5b 41 5c 41 5d 5d c3 0f 1f 44 00 00 41 89 c8 48 89 d1 48 89 f2 48 c7 c6 c0 a0 d8 c0 e8 19 fd ff ff <0f> 0b c3 0f 1f 44 00 00 41 89 c8 48 89 d1 48 89 f2 48 c7 c6 c0 a0 > RSP: 0018:ff5757be8f37fb70 EFLAGS: 00010246 > RAX: 0000000000000000 RBX: ff4847238f4b2940 RCX: 0000000000000000 > RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffc0d7e076 > RBP: ff4847341d6a0f80 R08: 0000000000000000 R09: 0000000000000000 > R10: 000000000000000a R11: f000000000000000 R12: 0000000200002600 > R13: ff484724b1c35000 R14: 0000000000000008 R15: ff4847238f4b2940 > FS: 0000000000000000(0000) GS:ff484761ffc40000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f55c21f0f70 CR3: 0000000142c9c002 CR4: 0000000000771ee0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > PKRU: 55555554 > Call Trace: > <TASK> > xfs_trans_ail_delete+0x102/0x130 [xfs] > xfs_buf_item_done+0x22/0x30 [xfs] > xfs_buf_ioend+0x73/0x4d0 [xfs] > xfs_trans_committed_bulk+0x17e/0x2f0 [xfs] > ? enqueue_task_fair+0x91/0x680 > ? remove_entity_load_avg+0x2e/0x70 > ? __wake_up_common_lock+0x87/0xc0 > xlog_cil_committed+0x2a9/0x300 [xfs] > ? __wake_up_common_lock+0x87/0xc0 > xlog_cil_process_committed+0x69/0x80 [xfs] > xlog_state_shutdown_callbacks+0xce/0xf0 [xfs] > xlog_force_shutdown+0xdf/0x150 [xfs] > xfs_do_force_shutdown+0x5f/0x150 [xfs] > xlog_ioend_work+0x71/0x80 [xfs] OK, so this is processing an EIO error to a log write, and it's triggering a force shutdown. THis cause the log to be shut down, and then it is running attached iclog callbacks from the shutdown context. That means the fs and log has already been marked as xfs_is_shutdown/xlog_is_shutdown and so high level code will abort (e.g. xfs_trans_commit(), xfs_log_force(), etc) with an error because of shutdown. That's the first thing we need for xfs_sync_sb() to exit without waiting for the superblock buffer to be comitted to disk above. The second is xlog_state_shutdown_callbacks() doing this: { struct xlog_in_core *iclog; LIST_HEAD(cb_list); spin_lock(&log->l_icloglock); iclog = log->l_iclog; do { if (atomic_read(&iclog->ic_refcnt)) { /* Reference holder will re-run iclog callbacks. */ continue; } list_splice_init(&iclog->ic_callbacks, &cb_list); >>>>>> wake_up_all(&iclog->ic_write_wait); >>>>>> wake_up_all(&iclog->ic_force_wait); } while ((iclog = iclog->ic_next) != log->l_iclog); wake_up_all(&log->l_flush_wait); spin_unlock(&log->l_icloglock); >>>>>> xlog_cil_process_committed(&cb_list); } It wakes forces waiters before shutdown processes all the pending callbacks. That means the xfs_sync_sb() waiting on a sync transaction in xfs_log_force() on iclog->ic_force_wait will get woken before the callbacks attached to that iclog are run. Normally this is just fine because the force waiter has nothing to do with AIL operations. But in the case of this unmount path, the log force waiter goes on to tear down the AIL because the log is now shut down and nothing ever blocks it again from the wait point in xfs_log_cover(). Hence it's a race to see who gets to the AIL first - the unmount code or xlog_cil_process_committed() killing the superblock buffer. So that's the bug, and it has nothing to do with any of the other shutdown issues I'm trying to sort out right now. I'll add it to the list. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8 2022-03-18 13:46 [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8 Brian Foster 2022-03-18 16:11 ` Brian Foster @ 2022-03-18 21:48 ` Dave Chinner 2022-03-18 21:51 ` Darrick J. Wong 1 sibling, 1 reply; 10+ messages in thread From: Dave Chinner @ 2022-03-18 21:48 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Fri, Mar 18, 2022 at 09:46:53AM -0400, Brian Foster wrote: > Hi, > > I'm not sure if this is known and/or fixed already, but it didn't look > familiar so here is a report. I hit a splat when testing Willy's > prospective folio bookmark change and it turns out it replicates on > Linus' current master (551acdc3c3d2). This initially reproduced on > xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via > xfs/006, but when I attempted to reproduce the latter a second time I > hit what looks like the same problem as xfs/264. Both tests seem to > involve some form of error injection, so possibly the same underlying > problem. The GPF splat from xfs/264 is below. On a side note, I'm wondering if we should add xfs/006 and xfs/264 to the recoveryloop group - they do a shutdown under load and a followup mount to ensure the filesystem gets recovered before the test ends and the fs is checked, so while thy don't explicitly test recovery, they do exercise it.... Thoughts? -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8 2022-03-18 21:48 ` Dave Chinner @ 2022-03-18 21:51 ` Darrick J. Wong 2022-03-18 22:39 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Darrick J. Wong @ 2022-03-18 21:51 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, linux-xfs On Sat, Mar 19, 2022 at 08:48:31AM +1100, Dave Chinner wrote: > On Fri, Mar 18, 2022 at 09:46:53AM -0400, Brian Foster wrote: > > Hi, > > > > I'm not sure if this is known and/or fixed already, but it didn't look > > familiar so here is a report. I hit a splat when testing Willy's > > prospective folio bookmark change and it turns out it replicates on > > Linus' current master (551acdc3c3d2). This initially reproduced on > > xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via > > xfs/006, but when I attempted to reproduce the latter a second time I > > hit what looks like the same problem as xfs/264. Both tests seem to > > involve some form of error injection, so possibly the same underlying > > problem. The GPF splat from xfs/264 is below. > > On a side note, I'm wondering if we should add xfs/006 and xfs/264 > to the recoveryloop group - they do a shutdown under load and a > followup mount to ensure the filesystem gets recovered before > the test ends and the fs is checked, so while thy don't explicitly > test recovery, they do exercise it.... > > Thoughts? Someone else asked about this the other day, and I proposed a 'recovery' group for tests that don't run in a loop. --D > > -Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8 2022-03-18 21:51 ` Darrick J. Wong @ 2022-03-18 22:39 ` Dave Chinner 0 siblings, 0 replies; 10+ messages in thread From: Dave Chinner @ 2022-03-18 22:39 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs On Fri, Mar 18, 2022 at 02:51:33PM -0700, Darrick J. Wong wrote: > On Sat, Mar 19, 2022 at 08:48:31AM +1100, Dave Chinner wrote: > > On Fri, Mar 18, 2022 at 09:46:53AM -0400, Brian Foster wrote: > > > Hi, > > > > > > I'm not sure if this is known and/or fixed already, but it didn't look > > > familiar so here is a report. I hit a splat when testing Willy's > > > prospective folio bookmark change and it turns out it replicates on > > > Linus' current master (551acdc3c3d2). This initially reproduced on > > > xfs/264 (mkfs defaults) and I saw a soft lockup warning variant via > > > xfs/006, but when I attempted to reproduce the latter a second time I > > > hit what looks like the same problem as xfs/264. Both tests seem to > > > involve some form of error injection, so possibly the same underlying > > > problem. The GPF splat from xfs/264 is below. > > > > On a side note, I'm wondering if we should add xfs/006 and xfs/264 > > to the recoveryloop group - they do a shutdown under load and a > > followup mount to ensure the filesystem gets recovered before > > the test ends and the fs is checked, so while thy don't explicitly > > test recovery, they do exercise it.... > > > > Thoughts? > > Someone else asked about this the other day, and I proposed a 'recovery' > group for tests that don't run in a loop. That distinction is largely meaningless to me. I tend to think of "recoveryloop" as the recovery tests I want to run in a long running loop via iteration. e.g. isomething like 'check -I 250 -g recoveryloop'. I don't really care if the tests loop internally doing multiple recoveries - I'm wanting to run the recovery tests that reproduce problems frequeently in a tight loop repeatedly. Hence I think we should just lump the shutdown+recovery tests all in one group so that when we want to exercise shutdown/recovery we just have one single group to run repeatedly in a loop. Whether that group is named 'recovery' or 'recoveryloop' is largely irrelevant to me. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2022-03-22 21:42 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-03-18 13:46 [BUG] log I/O completion GPF via xfs/006 and xfs/264 on 5.17.0-rc8 Brian Foster 2022-03-18 16:11 ` Brian Foster 2022-03-18 21:42 ` Dave Chinner 2022-03-21 18:35 ` Brian Foster 2022-03-21 22:14 ` Dave Chinner 2022-03-22 14:33 ` Brian Foster 2022-03-22 21:41 ` Dave Chinner 2022-03-18 21:48 ` Dave Chinner 2022-03-18 21:51 ` Darrick J. Wong 2022-03-18 22:39 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox