From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
stable@vger.kernel.org, Josef Bacik <josef@toxicpanda.com>,
Filipe Manana <fdmanana@suse.com>,
David Sterba <dsterba@suse.com>, Sasha Levin <sashal@kernel.org>
Subject: [PATCH 5.10 07/52] btrfs: fix hang during unmount when stopping a space reclaim worker
Date: Mon, 3 Oct 2022 09:11:14 +0200 [thread overview]
Message-ID: <20221003070718.934923187@linuxfoundation.org> (raw)
In-Reply-To: <20221003070718.687440096@linuxfoundation.org>
From: Filipe Manana <fdmanana@suse.com>
[ Upstream commit a362bb864b8db4861977d00bd2c3222503ccc34b ]
Often when running generic/562 from fstests we can hang during unmount,
resulting in a trace like this:
Sep 07 11:52:00 debian9 unknown: run fstests generic/562 at 2022-09-07 11:52:00
Sep 07 11:55:32 debian9 kernel: INFO: task umount:49438 blocked for more than 120 seconds.
Sep 07 11:55:32 debian9 kernel: Not tainted 6.0.0-rc2-btrfs-next-122 #1
Sep 07 11:55:32 debian9 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 07 11:55:32 debian9 kernel: task:umount state:D stack: 0 pid:49438 ppid: 25683 flags:0x00004000
Sep 07 11:55:32 debian9 kernel: Call Trace:
Sep 07 11:55:32 debian9 kernel: <TASK>
Sep 07 11:55:32 debian9 kernel: __schedule+0x3c8/0xec0
Sep 07 11:55:32 debian9 kernel: ? rcu_read_lock_sched_held+0x12/0x70
Sep 07 11:55:32 debian9 kernel: schedule+0x5d/0xf0
Sep 07 11:55:32 debian9 kernel: schedule_timeout+0xf1/0x130
Sep 07 11:55:32 debian9 kernel: ? lock_release+0x224/0x4a0
Sep 07 11:55:32 debian9 kernel: ? lock_acquired+0x1a0/0x420
Sep 07 11:55:32 debian9 kernel: ? trace_hardirqs_on+0x2c/0xd0
Sep 07 11:55:32 debian9 kernel: __wait_for_common+0xac/0x200
Sep 07 11:55:32 debian9 kernel: ? usleep_range_state+0xb0/0xb0
Sep 07 11:55:32 debian9 kernel: __flush_work+0x26d/0x530
Sep 07 11:55:32 debian9 kernel: ? flush_workqueue_prep_pwqs+0x140/0x140
Sep 07 11:55:32 debian9 kernel: ? trace_clock_local+0xc/0x30
Sep 07 11:55:32 debian9 kernel: __cancel_work_timer+0x11f/0x1b0
Sep 07 11:55:32 debian9 kernel: ? close_ctree+0x12b/0x5b3 [btrfs]
Sep 07 11:55:32 debian9 kernel: ? __trace_bputs+0x10b/0x170
Sep 07 11:55:32 debian9 kernel: close_ctree+0x152/0x5b3 [btrfs]
Sep 07 11:55:32 debian9 kernel: ? evict_inodes+0x166/0x1c0
Sep 07 11:55:32 debian9 kernel: generic_shutdown_super+0x71/0x120
Sep 07 11:55:32 debian9 kernel: kill_anon_super+0x14/0x30
Sep 07 11:55:32 debian9 kernel: btrfs_kill_super+0x12/0x20 [btrfs]
Sep 07 11:55:32 debian9 kernel: deactivate_locked_super+0x2e/0xa0
Sep 07 11:55:32 debian9 kernel: cleanup_mnt+0x100/0x160
Sep 07 11:55:32 debian9 kernel: task_work_run+0x59/0xa0
Sep 07 11:55:32 debian9 kernel: exit_to_user_mode_prepare+0x1a6/0x1b0
Sep 07 11:55:32 debian9 kernel: syscall_exit_to_user_mode+0x16/0x40
Sep 07 11:55:32 debian9 kernel: do_syscall_64+0x48/0x90
Sep 07 11:55:32 debian9 kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd
Sep 07 11:55:32 debian9 kernel: RIP: 0033:0x7fcde59a57a7
Sep 07 11:55:32 debian9 kernel: RSP: 002b:00007ffe914217c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
Sep 07 11:55:32 debian9 kernel: RAX: 0000000000000000 RBX: 00007fcde5ae8264 RCX: 00007fcde59a57a7
Sep 07 11:55:32 debian9 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055b57556cdd0
Sep 07 11:55:32 debian9 kernel: RBP: 000055b57556cba0 R08: 0000000000000000 R09: 00007ffe91420570
Sep 07 11:55:32 debian9 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
Sep 07 11:55:32 debian9 kernel: R13: 000055b57556cdd0 R14: 000055b57556ccb8 R15: 0000000000000000
Sep 07 11:55:32 debian9 kernel: </TASK>
What happens is the following:
1) The cleaner kthread tries to start a transaction to delete an unused
block group, but the metadata reservation can not be satisfied right
away, so a reservation ticket is created and it starts the async
metadata reclaim task (fs_info->async_reclaim_work);
2) Writeback for all the filler inodes with an i_size of 2K starts
(generic/562 creates a lot of 2K files with the goal of filling
metadata space). We try to create an inline extent for them, but we
fail when trying to insert the inline extent with -ENOSPC (at
cow_file_range_inline()) - since this is not critical, we fallback
to non-inline mode (back to cow_file_range()), reserve extents, create
extent maps and create the ordered extents;
3) An unmount starts, enters close_ctree();
4) The async reclaim task is flushing stuff, entering the flush states one
by one, until it reaches RUN_DELAYED_IPUTS. There it runs all current
delayed iputs.
After running the delayed iputs and before calling
btrfs_wait_on_delayed_iputs(), one or more ordered extents complete,
and btrfs_add_delayed_iput() is called for each one through
btrfs_finish_ordered_io() -> btrfs_put_ordered_extent(). This results
in bumping fs_info->nr_delayed_iputs from 0 to some positive value.
So the async reclaim task blocks at btrfs_wait_on_delayed_iputs() waiting
for fs_info->nr_delayed_iputs to become 0;
5) The current transaction is committed by the transaction kthread, we then
start unpinning extents and end up calling btrfs_try_granting_tickets()
through unpin_extent_range(), since we released some space.
This results in satisfying the ticket created by the cleaner kthread at
step 1, waking up the cleaner kthread;
6) At close_ctree() we ask the cleaner kthread to park;
7) The cleaner kthread starts the transaction, deletes the unused block
group, and then calls kthread_should_park(), which returns true, so it
parks. And at this point we have the delayed iputs added by the
completion of the ordered extents still pending;
8) Then later at close_ctree(), when we call:
cancel_work_sync(&fs_info->async_reclaim_work);
We hang forever, since the cleaner was parked and no one else can run
delayed iputs after that, while the reclaim task is waiting for the
remaining delayed iputs to be completed.
Fix this by waiting for all ordered extents to complete and running the
delayed iputs before attempting to stop the async reclaim tasks. Note that
we can not wait for ordered extents with btrfs_wait_ordered_roots() (or
other similar functions) because that waits for the BTRFS_ORDERED_COMPLETE
flag to be set on an ordered extent, but the delayed iput is added after
that, when doing the final btrfs_put_ordered_extent(). So instead wait for
the work queues used for executing ordered extent completion to be empty,
which works because we do the final put on an ordered extent at
btrfs_finish_ordered_io() (while we are in the unmount context).
Fixes: d6fd0ae25c6495 ("Btrfs: fix missing delayed iputs on unmount")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
fs/btrfs/disk-io.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2c7e50980a70..f2abd8bfd4a0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4105,6 +4105,31 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
/* clear out the rbtree of defraggable inodes */
btrfs_cleanup_defrag_inodes(fs_info);
+ /*
+ * After we parked the cleaner kthread, ordered extents may have
+ * completed and created new delayed iputs. If one of the async reclaim
+ * tasks is running and in the RUN_DELAYED_IPUTS flush state, then we
+ * can hang forever trying to stop it, because if a delayed iput is
+ * added after it ran btrfs_run_delayed_iputs() and before it called
+ * btrfs_wait_on_delayed_iputs(), it will hang forever since there is
+ * no one else to run iputs.
+ *
+ * So wait for all ongoing ordered extents to complete and then run
+ * delayed iputs. This works because once we reach this point no one
+ * can either create new ordered extents nor create delayed iputs
+ * through some other means.
+ *
+ * Also note that btrfs_wait_ordered_roots() is not safe here, because
+ * it waits for BTRFS_ORDERED_COMPLETE to be set on an ordered extent,
+ * but the delayed iput for the respective inode is made only when doing
+ * the final btrfs_put_ordered_extent() (which must happen at
+ * btrfs_finish_ordered_io() when we are unmounting).
+ */
+ btrfs_flush_workqueue(fs_info->endio_write_workers);
+ /* Ordered extents for free space inodes. */
+ btrfs_flush_workqueue(fs_info->endio_freespace_worker);
+ btrfs_run_delayed_iputs(fs_info);
+
cancel_work_sync(&fs_info->async_reclaim_work);
cancel_work_sync(&fs_info->async_data_reclaim_work);
--
2.35.1
next prev parent reply other threads:[~2022-10-03 7:36 UTC|newest]
Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-03 7:11 [PATCH 5.10 00/52] 5.10.147-rc1 review Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 01/52] thunderbolt: Add support for Intel Maple Ridge Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 02/52] thunderbolt: Add support for Intel Maple Ridge single port controller Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 03/52] ALSA: hda/tegra: Use clk_bulk helpers Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 04/52] ALSA: hda/tegra: Reset hardware Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 05/52] ALSA: hda/hdmi: let new platforms assign the pcm slot dynamically Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 06/52] ALSA: hda: Fix Nvidia dp infoframe Greg Kroah-Hartman
2022-10-03 7:11 ` Greg Kroah-Hartman [this message]
2022-10-03 7:11 ` [PATCH 5.10 08/52] uas: add no-uas quirk for Hiksemi usb_disk Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 09/52] usb-storage: Add Hiksemi USB3-FW to IGNORE_UAS Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 10/52] uas: ignore UAS for Thinkplus chips Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 11/52] usb: typec: ucsi: Remove incorrect warning Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 12/52] thunderbolt: Explicitly reset plug events delay back to USB4 spec value Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 13/52] net: usb: qmi_wwan: Add new usb-id for Dell branded EM7455 Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 14/52] Input: snvs_pwrkey - fix SNVS_HPVIDR1 register address Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 15/52] clk: ingenic-tcu: Properly enable registers before accessing timers Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 16/52] ARM: dts: integrator: Tag PCI host with device_type Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 17/52] ntfs: fix BUG_ON in ntfs_lookup_inode_by_name() Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 18/52] net: mt7531: only do PLL once after the reset Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 19/52] powerpc/64s/radix: dont need to broadcast IPI for radix pmd collapse flush Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 20/52] libata: add ATA_HORKAGE_NOLPM for Pioneer BDR-207M and BDR-205 Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 21/52] mmc: moxart: fix 4-bit bus width and remove 8-bit bus width Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 22/52] mmc: hsq: Fix data stomping during mmc recovery Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 23/52] mm/page_alloc: fix race condition between build_all_zonelists and page allocation Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 24/52] mm: prevent page_frag_alloc() from corrupting the memory Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 25/52] mm/migrate_device.c: flush TLB while holding PTL Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 26/52] mm: fix madivse_pageout mishandling on non-LRU page Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 27/52] media: dvb_vb2: fix possible out of bound access Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 28/52] media: rkvdec: Disable H.264 error detection Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 29/52] swiotlb: max mapping size takes min align mask into account Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 30/52] scsi: hisi_sas: Revert "scsi: hisi_sas: Limit max hw sectors for v3 HW" Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 31/52] ARM: dts: am33xx: Fix MMCHS0 dma properties Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 32/52] reset: imx7: Fix the iMX8MP PCIe PHY PERST support Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 33/52] soc: sunxi: sram: Actually claim SRAM regions Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 34/52] soc: sunxi: sram: Prevent the driver from being unbound Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 35/52] soc: sunxi_sram: Make use of the helper function devm_platform_ioremap_resource() Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 36/52] soc: sunxi: sram: Fix probe function ordering issues Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 37/52] soc: sunxi: sram: Fix debugfs info for A64 SRAM C Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 38/52] ASoC: tas2770: Reinit regcache on reset Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 39/52] Revert "drm: bridge: analogix/dp: add panel prepare/unprepare in suspend/resume time" Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 40/52] Input: melfas_mip4 - fix return value check in mip4_probe() Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 41/52] usbnet: Fix memory leak in usbnet_disconnect() Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 42/52] net: sched: act_ct: fix possible refcount leak in tcf_ct_init() Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 43/52] cxgb4: fix missing unlock on ETHOFLD desc collect fail path Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 44/52] nvme: add new line after variable declatation Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 45/52] nvme: Fix IOC_PR_CLEAR and IOC_PR_RELEASE ioctls for nvme devices Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 46/52] net: stmmac: power up/down serdes in stmmac_open/release Greg Kroah-Hartman
2022-10-04 10:16 ` Pavel Machek
2022-10-03 7:11 ` [PATCH 5.10 47/52] selftests: Fix the if conditions of in test_extra_filter() Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 48/52] clk: imx: imx6sx: remove the SET_RATE_PARENT flag for QSPI clocks Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 49/52] clk: iproc: Do not rely on node name for correct PLL setup Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 50/52] KVM: x86: Hide IA32_PLATFORM_DCA_CAP[31:0] from the guest Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 51/52] x86/alternative: Fix race in try_get_desc() Greg Kroah-Hartman
2022-10-03 7:11 ` [PATCH 5.10 52/52] ALSA: hda/hdmi: fix warning about PCM count when used with SOF Greg Kroah-Hartman
2022-10-03 13:50 ` [PATCH 5.10 00/52] 5.10.147-rc1 review Pavel Machek
2022-10-03 16:43 ` Allen Pais
2022-10-03 17:52 ` Guenter Roeck
2022-10-03 18:14 ` Florian Fainelli
2022-10-03 20:41 ` Slade Watkins
2022-10-04 8:41 ` Naresh Kamboju
2022-10-04 11:39 ` Sudip Mukherjee (Codethink)
2022-10-07 14:44 ` zhouzhixiu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20221003070718.934923187@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=dsterba@suse.com \
--cc=fdmanana@suse.com \
--cc=josef@toxicpanda.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sashal@kernel.org \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox