[PATCH] Btrfs: fix missing delayed iputs on unmount

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Btrfs: fix missing delayed iputs on unmount
@ 2018-10-31  0:14 Omar Sandoval
  2018-10-31  7:14 ` Lu Fengqi
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Omar Sandoval @ 2018-10-31  0:14 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

There's a race between close_ctree() and cleaner_kthread().
close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
sees it set, but this is racy; the cleaner might have already checked
the bit and could be cleaning stuff. In particular, if it deletes unused
block groups, it will create delayed iputs for the free space cache
inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
longer running delayed iputs after a commit. Therefore, if the cleaner
creates more delayed iputs after delayed iputs are run in
btrfs_commit_super(), we will leak inodes on unmount and get a busy
inode crash from the VFS.

Fix it by parking the cleaner before we actually close anything. Then,
any remaining delayed iputs will always be handled in
btrfs_commit_super(). This also ensures that the commit in close_ctree()
is really the last commit, so we can get rid of the commit in
cleaner_kthread().

Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
We found this with a stress test that our containers team runs. I'm
wondering if this same race could have caused any other issues other
than this new iput thing, but I couldn't identify any.

 fs/btrfs/disk-io.c | 40 +++++++---------------------------------
 1 file changed, 7 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b0ab41da91d1..7c17284ae3c2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg)
 	struct btrfs_root *root = arg;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	int again;
-	struct btrfs_trans_handle *trans;
 
-	do {
+	while (1) {
 		again = 0;
 
 		/* Make the cleaner go to sleep early. */
@@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg)
 		 */
 		btrfs_delete_unused_bgs(fs_info);
 sleep:
+		if (kthread_should_park())
+			kthread_parkme();
+		if (kthread_should_stop())
+			return 0;
 		if (!again) {
 			set_current_state(TASK_INTERRUPTIBLE);
-			if (!kthread_should_stop())
-				schedule();
+			schedule();
 			__set_current_state(TASK_RUNNING);
 		}
-	} while (!kthread_should_stop());
-
-	/*
-	 * Transaction kthread is stopped before us and wakes us up.
-	 * However we might have started a new transaction and COWed some
-	 * tree blocks when deleting unused block groups for example. So
-	 * make sure we commit the transaction we started to have a clean
-	 * shutdown when evicting the btree inode - if it has dirty pages
-	 * when we do the final iput() on it, eviction will trigger a
-	 * writeback for it which will fail with null pointer dereferences
-	 * since work queues and other resources were already released and
-	 * destroyed by the time the iput/eviction/writeback is made.
-	 */
-	trans = btrfs_attach_transaction(root);
-	if (IS_ERR(trans)) {
-		if (PTR_ERR(trans) != -ENOENT)
-			btrfs_err(fs_info,
-				  "cleaner transaction attach returned %ld",
-				  PTR_ERR(trans));
-	} else {
-		int ret;
-
-		ret = btrfs_commit_transaction(trans);
-		if (ret)
-			btrfs_err(fs_info,
-				  "cleaner open transaction commit returned %d",
-				  ret);
 	}
-
-	return 0;
 }
 
 static int transaction_kthread(void *arg)
@@ -3931,6 +3904,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 	int ret;
 
 	set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
+	kthread_park(fs_info->cleaner_kthread);
 
 	/* wait for the qgroup rescan worker to stop */
 	btrfs_qgroup_wait_for_completion(fs_info, false);
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] Btrfs: fix missing delayed iputs on unmount
  2018-10-31  0:14 [PATCH] Btrfs: fix missing delayed iputs on unmount Omar Sandoval
@ 2018-10-31  7:14 ` Lu Fengqi
  2018-10-31  8:26   ` Nikolay Borisov
  2018-10-31  7:41 ` Lu Fengqi
  2018-10-31  8:43 ` Nikolay Borisov
  2 siblings, 1 reply; 10+ messages in thread
From: Lu Fengqi @ 2018-10-31  7:14 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team

[-- Attachment #1: Type: text/plain, Size: 4156 bytes --]

On Tue, Oct 30, 2018 at 05:14:42PM -0700, Omar Sandoval wrote:
>From: Omar Sandoval <osandov@fb.com>
>
>There's a race between close_ctree() and cleaner_kthread().
>close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
>sees it set, but this is racy; the cleaner might have already checked
>the bit and could be cleaning stuff. In particular, if it deletes unused
>block groups, it will create delayed iputs for the free space cache
>inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
>longer running delayed iputs after a commit. Therefore, if the cleaner
>creates more delayed iputs after delayed iputs are run in
>btrfs_commit_super(), we will leak inodes on unmount and get a busy
>inode crash from the VFS.
>
>Fix it by parking the cleaner before we actually close anything. Then,
>any remaining delayed iputs will always be handled in
>btrfs_commit_super(). This also ensures that the commit in close_ctree()
>is really the last commit, so we can get rid of the commit in
>cleaner_kthread().
>
>Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
>Signed-off-by: Omar Sandoval <osandov@fb.com>
>---
>We found this with a stress test that our containers team runs. I'm
>wondering if this same race could have caused any other issues other
>than this new iput thing, but I couldn't identify any.

I noticed an inode leak issue in generic/475, but whether dropping commit
30928e9baac2 ("btrfs: don't run delayed_iputs in commit") or applying
this patch, the issue still exists.

I have attached the dmesg.

>
> fs/btrfs/disk-io.c | 40 +++++++---------------------------------
> 1 file changed, 7 insertions(+), 33 deletions(-)
>
>diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>index b0ab41da91d1..7c17284ae3c2 100644
>--- a/fs/btrfs/disk-io.c
>+++ b/fs/btrfs/disk-io.c
>@@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg)
> 	struct btrfs_root *root = arg;
> 	struct btrfs_fs_info *fs_info = root->fs_info;
> 	int again;
>-	struct btrfs_trans_handle *trans;
> 
>-	do {
>+	while (1) {
> 		again = 0;
> 
> 		/* Make the cleaner go to sleep early. */
>@@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg)
> 		 */
> 		btrfs_delete_unused_bgs(fs_info);
> sleep:
>+		if (kthread_should_park())
>+			kthread_parkme();
>+		if (kthread_should_stop())
>+			return 0;
> 		if (!again) {
> 			set_current_state(TASK_INTERRUPTIBLE);
>-			if (!kthread_should_stop())
>-				schedule();
>+			schedule();
> 			__set_current_state(TASK_RUNNING);
> 		}
>-	} while (!kthread_should_stop());
>-
>-	/*
>-	 * Transaction kthread is stopped before us and wakes us up.
>-	 * However we might have started a new transaction and COWed some
>-	 * tree blocks when deleting unused block groups for example. So
>-	 * make sure we commit the transaction we started to have a clean
>-	 * shutdown when evicting the btree inode - if it has dirty pages
>-	 * when we do the final iput() on it, eviction will trigger a
>-	 * writeback for it which will fail with null pointer dereferences
>-	 * since work queues and other resources were already released and
>-	 * destroyed by the time the iput/eviction/writeback is made.
>-	 */
>-	trans = btrfs_attach_transaction(root);
>-	if (IS_ERR(trans)) {
>-		if (PTR_ERR(trans) != -ENOENT)
>-			btrfs_err(fs_info,
>-				  "cleaner transaction attach returned %ld",
>-				  PTR_ERR(trans));
>-	} else {
>-		int ret;
>-
>-		ret = btrfs_commit_transaction(trans);
>-		if (ret)
>-			btrfs_err(fs_info,
>-				  "cleaner open transaction commit returned %d",
>-				  ret);
> 	}
>-
>-	return 0;
> }
> 
> static int transaction_kthread(void *arg)
>@@ -3931,6 +3904,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
> 	int ret;
> 
> 	set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
>+	kthread_park(fs_info->cleaner_kthread);

Since we are not going to call kthread_unpark, I am not sure why
kthread_park is used instead of kthread_stop here. It looks like there
is no significant difference between stopping instantly and delayed stop.

-- 
Thanks,
Lu

> 
> 	/* wait for the qgroup rescan worker to stop */
> 	btrfs_qgroup_wait_for_completion(fs_info, false);
>-- 
>2.19.1
>
>
>
>



[-- Attachment #2: generic_475.dmesg --]
[-- Type: text/plain, Size: 14435 bytes --]

[  366.955193] run fstests generic/475 at 2018-10-31 15:06:43
[  367.495791] BTRFS: device fsid 812f883c-40b2-4456-9769-b94ddf1cb07e devid 1 transid 5 /dev/nvme0n1p2
[  367.624469] BTRFS info (device dm-3): disk space caching is enabled
[  367.627305] BTRFS info (device dm-3): has skinny extents
[  367.628771] BTRFS info (device dm-3): flagging fs with big metadata feature
[  367.637585] BTRFS info (device dm-3): enabling ssd optimizations
[  367.640079] BTRFS info (device dm-3): checking UUID tree
[  367.666290] btrfs_dev_stat_print_on_error: 86 callbacks suppressed
[  367.666293] BTRFS error (device dm-3): bdev /dev/mapper/error-test errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[  367.670665] BTRFS error (device dm-3): bdev /dev/mapper/error-test errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
[  367.673332] BTRFS error (device dm-3): bdev /dev/mapper/error-test errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
[  367.674024] BTRFS error (device dm-3): bdev /dev/mapper/error-test errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
[  367.676072] BTRFS error (device dm-3): bdev /dev/mapper/error-test errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
[  367.677316] Buffer I/O error on dev dm-3, logical block 4718576, async page read
[  367.682713] BTRFS error (device dm-3): bdev /dev/mapper/error-test errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
[  367.685150] BTRFS error (device dm-3): bdev /dev/mapper/error-test errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
[  367.690053] BTRFS error (device dm-3): bdev /dev/mapper/error-test errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
[  367.690298] BTRFS error (device dm-3): bdev /dev/mapper/error-test errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
[  367.694166] BTRFS error (device dm-3): bdev /dev/mapper/error-test errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
[  367.700484] BTRFS: error (device dm-3) in btrfs_commit_transaction:2239: errno=-5 IO failure (Error while writing out transaction)
[  367.707185] BTRFS info (device dm-3): forced readonly
[  367.708627] BTRFS warning (device dm-3): Skipping commit of aborted transaction.
[  367.710526] BTRFS: error (device dm-3) in cleanup_transaction:1849: errno=-5 IO failure
[  367.712330] BTRFS info (device dm-3): delayed_refs has NO entry
[  367.975623] BTRFS error (device dm-3): cleaner transaction attach returned -30
[  367.977998] BTRFS info (device dm-3): at unmount delalloc count 8192
[  367.980424] WARNING: CPU: 2 PID: 14918 at fs/btrfs/disk-io.c:3846 btrfs_free_fs_root+0x94/0xa0 [btrfs]
[  367.983228] Modules linked in: btrfs xor zstd_decompress zstd_compress xxhash raid6_pq efivarfs xfs nvme nvme_core
[  367.986171] CPU: 2 PID: 14918 Comm: umount Kdump: loaded Not tainted 4.19.0-rc8+ #18
[  367.988154] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[  367.990175] RIP: 0010:btrfs_free_fs_root+0x94/0xa0 [btrfs]
[  367.997036] RSP: 0018:ffffc9000744bd60 EFLAGS: 00010282
[  367.998595] RAX: ffff880354eb6270 RBX: ffff8803661c4548 RCX: 0000000000000000
[  368.000665] RDX: 0000000000000002 RSI: 0000000000000001 RDI: 0000000000000000
[  368.002527] RBP: ffff88033cd70000 R08: 0000000000000000 R09: 0000000000000001
[  368.004426] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc9000744bd88
[  368.006248] R13: ffff88033cd700c0 R14: ffffc9000744bd80 R15: 0000000000000000
[  368.008082] FS:  00007f206293e780(0000) GS:ffff880373800000(0000) knlGS:0000000000000000
[  368.010068] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  368.011654] CR2: 000055d225976b70 CR3: 000000033f992004 CR4: 00000000003606e0
[  368.013440] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  368.015232] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  368.016979] Call Trace:
[  368.017947]  btrfs_free_fs_roots+0xcf/0x160 [btrfs]
[  368.019317]  ? wait_for_completion+0x126/0x150
[  368.020617]  close_ctree+0x149/0x330 [btrfs]
[  368.021822]  generic_shutdown_super+0x6c/0x100
[  368.023043]  kill_anon_super+0x14/0x30
[  368.024203]  btrfs_kill_super+0x13/0x100 [btrfs]
[  368.025471]  deactivate_locked_super+0x2f/0x70
[  368.027647]  cleanup_mnt+0x3b/0x70
[  368.028712]  task_work_run+0x9e/0xd0
[  368.029875]  do_syscall_64+0x470/0x600
[  368.031009]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  368.032350]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  368.033620] RIP: 0033:0x7f2062ec1b5b
[  368.039127] RSP: 002b:00007ffd75ebd748 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[  368.040906] RAX: 0000000000000000 RBX: 000055b8b9bdb2b0 RCX: 00007f2062ec1b5b
[  368.042564] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055b8b9bdb490
[  368.044250] RBP: 0000000000000000 R08: 000055b8b9bdb4b0 R09: 00007ffd75ebbfa0
[  368.045862] R10: 0000000000000000 R11: 0000000000000246 R12: 000055b8b9bdb490
[  368.047503] R13: 00007f20631e11a4 R14: 0000000000000000 R15: 00007ffd75ebd9b8
[  368.049089] irq event stamp: 0
[  368.049979] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
[  368.051674] hardirqs last disabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[  368.053587] softirqs last  enabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[  368.055550] softirqs last disabled at (0): [<0000000000000000>]           (null)
[  368.058098] ---[ end trace 08fc3163d80dad81 ]---
[  368.061134] WARNING: CPU: 2 PID: 14918 at fs/btrfs/extent-tree.c:123 btrfs_put_block_group+0x51/0x60 [btrfs]
[  368.063637] Modules linked in: btrfs xor zstd_decompress zstd_compress xxhash raid6_pq efivarfs xfs nvme nvme_core
[  368.066230] CPU: 2 PID: 14918 Comm: umount Kdump: loaded Tainted: G        W         4.19.0-rc8+ #18
[  368.068615] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[  368.070406] RIP: 0010:btrfs_put_block_group+0x51/0x60 [btrfs]
[  368.076431] RSP: 0018:ffffc9000744bda0 EFLAGS: 00010206
[  368.077765] RAX: 0000000000000001 RBX: ffff88033cd70000 RCX: ffff8803414f76c0
[  368.079477] RDX: ffff8803414f76c0 RSI: 0000000000000001 RDI: ffff8803414f74e8
[  368.081141] RBP: ffff8803414f74e8 R08: 0000000000000000 R09: 0000000000000000
[  368.082830] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88033cd70118
[  368.084565] R13: ffff88033cd70168 R14: ffff8803414f7680 R15: dead000000000100
[  368.087151] FS:  00007f206293e780(0000) GS:ffff880373800000(0000) knlGS:0000000000000000
[  368.089087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  368.090616] CR2: 000055d225976b70 CR3: 000000033f992004 CR4: 00000000003606e0
[  368.092390] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  368.094103] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  368.095812] Call Trace:
[  368.096700]  btrfs_free_block_groups+0x1da/0x3b0 [btrfs]
[  368.098065]  close_ctree+0x171/0x330 [btrfs]
[  368.099254]  generic_shutdown_super+0x6c/0x100
[  368.100486]  kill_anon_super+0x14/0x30
[  368.101554]  btrfs_kill_super+0x13/0x100 [btrfs]
[  368.102758]  deactivate_locked_super+0x2f/0x70
[  368.103986]  cleanup_mnt+0x3b/0x70
[  368.105002]  task_work_run+0x9e/0xd0
[  368.106038]  do_syscall_64+0x470/0x600
[  368.107123]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  368.108397]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  368.109677] RIP: 0033:0x7f2062ec1b5b
[  368.115270] RSP: 002b:00007ffd75ebd748 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[  368.117866] RAX: 0000000000000000 RBX: 000055b8b9bdb2b0 RCX: 00007f2062ec1b5b
[  368.119595] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055b8b9bdb490
[  368.121368] RBP: 0000000000000000 R08: 000055b8b9bdb4b0 R09: 00007ffd75ebbfa0
[  368.122983] R10: 0000000000000000 R11: 0000000000000246 R12: 000055b8b9bdb490
[  368.124680] R13: 00007f20631e11a4 R14: 0000000000000000 R15: 00007ffd75ebd9b8
[  368.126330] irq event stamp: 0
[  368.127242] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
[  368.128874] hardirqs last disabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[  368.130836] softirqs last  enabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[  368.132742] softirqs last disabled at (0): [<0000000000000000>]           (null)
[  368.134443] ---[ end trace 08fc3163d80dad82 ]---
[  368.140638] WARNING: CPU: 2 PID: 14918 at fs/btrfs/extent-tree.c:9789 btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
[  368.143175] Modules linked in: btrfs xor zstd_decompress zstd_compress xxhash raid6_pq efivarfs xfs nvme nvme_core
[  368.145923] CPU: 2 PID: 14918 Comm: umount Kdump: loaded Tainted: G        W         4.19.0-rc8+ #18
[  368.148908] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[  368.150766] RIP: 0010:btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
[  368.156851] RSP: 0018:ffffc9000744bdb0 EFLAGS: 00010206
[  368.158212] RAX: ffff8803414f4620 RBX: ffff88033cd70000 RCX: 0000000000000000
[  368.159883] RDX: 0000000000000002 RSI: ffff88033cd71750 RDI: ffff8803414f7a38
[  368.161614] RBP: ffff8803414f0558 R08: 0000000000000000 R09: ffff8803414f7b70
[  368.163340] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88033cd70118
[  368.165029] R13: ffff8803414f4620 R14: 0000000000000000 R15: dead000000000100
[  368.166730] FS:  00007f206293e780(0000) GS:ffff880373800000(0000) knlGS:0000000000000000
[  368.168613] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  368.170055] CR2: 000055d225976b70 CR3: 000000033f992004 CR4: 00000000003606e0
[  368.171794] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  368.173501] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  368.175273] Call Trace:
[  368.176969]  close_ctree+0x171/0x330 [btrfs]
[  368.178203]  generic_shutdown_super+0x6c/0x100
[  368.179443]  kill_anon_super+0x14/0x30
[  368.180560]  btrfs_kill_super+0x13/0x100 [btrfs]
[  368.181794]  deactivate_locked_super+0x2f/0x70
[  368.182975]  cleanup_mnt+0x3b/0x70
[  368.184024]  task_work_run+0x9e/0xd0
[  368.185051]  do_syscall_64+0x470/0x600
[  368.186098]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  368.187392]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  368.188675] RIP: 0033:0x7f2062ec1b5b
[  368.194299] RSP: 002b:00007ffd75ebd748 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[  368.195990] RAX: 0000000000000000 RBX: 000055b8b9bdb2b0 RCX: 00007f2062ec1b5b
[  368.197690] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055b8b9bdb490
[  368.199348] RBP: 0000000000000000 R08: 000055b8b9bdb4b0 R09: 00007ffd75ebbfa0
[  368.200998] R10: 0000000000000000 R11: 0000000000000246 R12: 000055b8b9bdb490
[  368.202651] R13: 00007f20631e11a4 R14: 0000000000000000 R15: 00007ffd75ebd9b8
[  368.204354] irq event stamp: 0
[  368.205236] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
[  368.207825] hardirqs last disabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[  368.209804] softirqs last  enabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[  368.211807] softirqs last disabled at (0): [<0000000000000000>]           (null)
[  368.213501] ---[ end trace 08fc3163d80dad83 ]---
[  368.214836] BTRFS info (device dm-3): space_info 1 has 7708672 free, is not full
[  368.216595] BTRFS info (device dm-3): space_info total=8388608, used=585728, pinned=0, reserved=86016, may_use=8192, readonly=0
[  368.219751] WARNING: CPU: 2 PID: 14918 at fs/btrfs/extent-tree.c:9789 btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
[  368.222497] Modules linked in: btrfs xor zstd_decompress zstd_compress xxhash raid6_pq efivarfs xfs nvme nvme_core
[  368.225120] CPU: 2 PID: 14918 Comm: umount Kdump: loaded Tainted: G        W         4.19.0-rc8+ #18
[  368.227581] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[  368.229402] RIP: 0010:btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
[  368.235564] RSP: 0018:ffffc9000744bdb0 EFLAGS: 00010206
[  368.237770] RAX: ffff8803414f7b40 RBX: ffff88033cd70000 RCX: 000000000017000c
[  368.239647] RDX: 0000000000000002 RSI: 000000000017000c RDI: ffffffff8125fb74
[  368.241447] RBP: ffff8803414f47d0 R08: 0000000000000000 R09: 0000000000000000
[  368.243206] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8803414f4848
[  368.245071] R13: ffff8803414f7b40 R14: 0000000000000000 R15: dead000000000100
[  368.246825] FS:  00007f206293e780(0000) GS:ffff880373800000(0000) knlGS:0000000000000000
[  368.248758] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  368.250245] CR2: 000055d225976b70 CR3: 000000033f992004 CR4: 00000000003606e0
[  368.251982] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  368.253714] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  368.255491] Call Trace:
[  368.256375]  close_ctree+0x171/0x330 [btrfs]
[  368.257584]  generic_shutdown_super+0x6c/0x100
[  368.258786]  kill_anon_super+0x14/0x30
[  368.259900]  btrfs_kill_super+0x13/0x100 [btrfs]
[  368.261180]  deactivate_locked_super+0x2f/0x70
[  368.262433]  cleanup_mnt+0x3b/0x70
[  368.263452]  task_work_run+0x9e/0xd0
[  368.264531]  do_syscall_64+0x470/0x600
[  368.265621]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  368.267805]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  368.269288] RIP: 0033:0x7f2062ec1b5b
[  368.275103] RSP: 002b:00007ffd75ebd748 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[  368.276887] RAX: 0000000000000000 RBX: 000055b8b9bdb2b0 RCX: 00007f2062ec1b5b
[  368.278598] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055b8b9bdb490
[  368.280268] RBP: 0000000000000000 R08: 000055b8b9bdb4b0 R09: 00007ffd75ebbfa0
[  368.281917] R10: 0000000000000000 R11: 0000000000000246 R12: 000055b8b9bdb490
[  368.283580] R13: 00007f20631e11a4 R14: 0000000000000000 R15: 00007ffd75ebd9b8
[  368.285271] irq event stamp: 0
[  368.286151] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
[  368.287851] hardirqs last disabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[  368.289798] softirqs last  enabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[  368.291756] softirqs last disabled at (0): [<0000000000000000>]           (null)
[  368.293419] ---[ end trace 08fc3163d80dad84 ]---
[  368.294684] BTRFS info (device dm-3): space_info 4 has 275857408 free, is not full
[  368.297278] BTRFS info (device dm-3): space_info total=276824064, used=114688, pinned=0, reserved=0, may_use=786432, readonly=65536
[  368.300677] VFS: Busy inodes after unmount of dm-3. Self-destruct in 5 seconds.  Have a nice day...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Btrfs: fix missing delayed iputs on unmount
  2018-10-31  7:14 ` Lu Fengqi
@ 2018-10-31  8:26   ` Nikolay Borisov
  2018-10-31  8:38     ` Nikolay Borisov
  0 siblings, 1 reply; 10+ messages in thread
From: Nikolay Borisov @ 2018-10-31  8:26 UTC (permalink / raw)
  To: Lu Fengqi, Omar Sandoval; +Cc: linux-btrfs, kernel-team



On 31.10.18 г. 9:14 ч., Lu Fengqi wrote:
> On Tue, Oct 30, 2018 at 05:14:42PM -0700, Omar Sandoval wrote:
>> From: Omar Sandoval <osandov@fb.com>
>>
>> There's a race between close_ctree() and cleaner_kthread().
>> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
>> sees it set, but this is racy; the cleaner might have already checked
>> the bit and could be cleaning stuff. In particular, if it deletes unused
>> block groups, it will create delayed iputs for the free space cache
>> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
>> longer running delayed iputs after a commit. Therefore, if the cleaner
>> creates more delayed iputs after delayed iputs are run in
>> btrfs_commit_super(), we will leak inodes on unmount and get a busy
>> inode crash from the VFS.
>>
>> Fix it by parking the cleaner before we actually close anything. Then,
>> any remaining delayed iputs will always be handled in
>> btrfs_commit_super(). This also ensures that the commit in close_ctree()
>> is really the last commit, so we can get rid of the commit in
>> cleaner_kthread().
>>
>> Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>> ---
>> We found this with a stress test that our containers team runs. I'm
>> wondering if this same race could have caused any other issues other
>> than this new iput thing, but I couldn't identify any.
> 
> I noticed an inode leak issue in generic/475, but whether dropping commit
> 30928e9baac2 ("btrfs: don't run delayed_iputs in commit") or applying
> this patch, the issue still exists.
> 
> I have attached the dmesg.

Are you able to trigger this reliably i.e 100% or just, say, 80-90% of
the time? If it's sporadic (but frequent) then it is likely an issue
with error handling. Also looking at the log I see:

[  367.977998] BTRFS info (device dm-3): at unmount delalloc count 8192

Meaning we are leaking 8k of delalloc space which should have been
cleaned up. And the warn is caused because we have an inode in the inode
rb tree. I will suggest that you patch btrfs_free_fs_root to print the
ino's if root->inode_tree is not empty. That way you can see if the inos
correspond to a data file or a freespace ino. But given btrfs free_fs
also complains of delaloc bytes I'd be willing to say it's data inodes.

<snip>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Btrfs: fix missing delayed iputs on unmount
  2018-10-31  8:26   ` Nikolay Borisov
@ 2018-10-31  8:38     ` Nikolay Borisov
  0 siblings, 0 replies; 10+ messages in thread
From: Nikolay Borisov @ 2018-10-31  8:38 UTC (permalink / raw)
  To: Lu Fengqi, Omar Sandoval; +Cc: linux-btrfs, kernel-team



On 31.10.18 г. 10:26 ч., Nikolay Borisov wrote:
> 
> 
> On 31.10.18 г. 9:14 ч., Lu Fengqi wrote:
>> On Tue, Oct 30, 2018 at 05:14:42PM -0700, Omar Sandoval wrote:
>>> From: Omar Sandoval <osandov@fb.com>
>>>
>>> There's a race between close_ctree() and cleaner_kthread().
>>> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
>>> sees it set, but this is racy; the cleaner might have already checked
>>> the bit and could be cleaning stuff. In particular, if it deletes unused
>>> block groups, it will create delayed iputs for the free space cache
>>> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
>>> longer running delayed iputs after a commit. Therefore, if the cleaner
>>> creates more delayed iputs after delayed iputs are run in
>>> btrfs_commit_super(), we will leak inodes on unmount and get a busy
>>> inode crash from the VFS.
>>>
>>> Fix it by parking the cleaner before we actually close anything. Then,
>>> any remaining delayed iputs will always be handled in
>>> btrfs_commit_super(). This also ensures that the commit in close_ctree()
>>> is really the last commit, so we can get rid of the commit in
>>> cleaner_kthread().
>>>
>>> Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
>>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>>> ---
>>> We found this with a stress test that our containers team runs. I'm
>>> wondering if this same race could have caused any other issues other
>>> than this new iput thing, but I couldn't identify any.
>>
>> I noticed an inode leak issue in generic/475, but whether dropping commit
>> 30928e9baac2 ("btrfs: don't run delayed_iputs in commit") or applying
>> this patch, the issue still exists.
>>
>> I have attached the dmesg.
> 
> Are you able to trigger this reliably i.e 100% or just, say, 80-90% of
> the time? If it's sporadic (but frequent) then it is likely an issue
> with error handling. Also looking at the log I see:
> 
> [  367.977998] BTRFS info (device dm-3): at unmount delalloc count 8192
> 
> Meaning we are leaking 8k of delalloc space which should have been
> cleaned up. And the warn is caused because we have an inode in the inode
> rb tree. I will suggest that you patch btrfs_free_fs_root to print the
> ino's if root->inode_tree is not empty. That way you can see if the inos
> correspond to a data file or a freespace ino. But given btrfs free_fs
> also complains of delaloc bytes I'd be willing to say it's data inodes.

As a matter of fact this should have been fixed by :

fe816d0f1d4c ("btrfs: Fix delalloc inodes invalidation during
transaction abort")


Perhaps this commit is not a full fix ?

> 
> <snip>
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Btrfs: fix missing delayed iputs on unmount
  2018-10-31  0:14 [PATCH] Btrfs: fix missing delayed iputs on unmount Omar Sandoval
  2018-10-31  7:14 ` Lu Fengqi
@ 2018-10-31  7:41 ` Lu Fengqi
  2018-10-31 16:37   ` Omar Sandoval
  2018-10-31  8:43 ` Nikolay Borisov
  2 siblings, 1 reply; 10+ messages in thread
From: Lu Fengqi @ 2018-10-31  7:41 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team

On Tue, Oct 30, 2018 at 05:14:42PM -0700, Omar Sandoval wrote:
>From: Omar Sandoval <osandov@fb.com>
>
>There's a race between close_ctree() and cleaner_kthread().
>close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
>sees it set, but this is racy; the cleaner might have already checked
>the bit and could be cleaning stuff. In particular, if it deletes unused
>block groups, it will create delayed iputs for the free space cache
>inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
>longer running delayed iputs after a commit. Therefore, if the cleaner
>creates more delayed iputs after delayed iputs are run in
>btrfs_commit_super(), we will leak inodes on unmount and get a busy

Since the assert added via commit e187831e1875 ("btrfs: assert on non-empty
delayed iputs") wasn't triggered, it doesn't seem to be the cause of inode
leak.

-- 
Thanks,
Lu

>inode crash from the VFS.
>
>Fix it by parking the cleaner before we actually close anything. Then,
>any remaining delayed iputs will always be handled in
>btrfs_commit_super(). This also ensures that the commit in close_ctree()
>is really the last commit, so we can get rid of the commit in
>cleaner_kthread().
>
>Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
>Signed-off-by: Omar Sandoval <osandov@fb.com>
>---
>We found this with a stress test that our containers team runs. I'm
>wondering if this same race could have caused any other issues other
>than this new iput thing, but I couldn't identify any.
>
> fs/btrfs/disk-io.c | 40 +++++++---------------------------------
> 1 file changed, 7 insertions(+), 33 deletions(-)
>
>diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>index b0ab41da91d1..7c17284ae3c2 100644
>--- a/fs/btrfs/disk-io.c
>+++ b/fs/btrfs/disk-io.c
>@@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg)
> 	struct btrfs_root *root = arg;
> 	struct btrfs_fs_info *fs_info = root->fs_info;
> 	int again;
>-	struct btrfs_trans_handle *trans;
> 
>-	do {
>+	while (1) {
> 		again = 0;
> 
> 		/* Make the cleaner go to sleep early. */
>@@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg)
> 		 */
> 		btrfs_delete_unused_bgs(fs_info);
> sleep:
>+		if (kthread_should_park())
>+			kthread_parkme();
>+		if (kthread_should_stop())
>+			return 0;
> 		if (!again) {
> 			set_current_state(TASK_INTERRUPTIBLE);
>-			if (!kthread_should_stop())
>-				schedule();
>+			schedule();
> 			__set_current_state(TASK_RUNNING);
> 		}
>-	} while (!kthread_should_stop());
>-
>-	/*
>-	 * Transaction kthread is stopped before us and wakes us up.
>-	 * However we might have started a new transaction and COWed some
>-	 * tree blocks when deleting unused block groups for example. So
>-	 * make sure we commit the transaction we started to have a clean
>-	 * shutdown when evicting the btree inode - if it has dirty pages
>-	 * when we do the final iput() on it, eviction will trigger a
>-	 * writeback for it which will fail with null pointer dereferences
>-	 * since work queues and other resources were already released and
>-	 * destroyed by the time the iput/eviction/writeback is made.
>-	 */
>-	trans = btrfs_attach_transaction(root);
>-	if (IS_ERR(trans)) {
>-		if (PTR_ERR(trans) != -ENOENT)
>-			btrfs_err(fs_info,
>-				  "cleaner transaction attach returned %ld",
>-				  PTR_ERR(trans));
>-	} else {
>-		int ret;
>-
>-		ret = btrfs_commit_transaction(trans);
>-		if (ret)
>-			btrfs_err(fs_info,
>-				  "cleaner open transaction commit returned %d",
>-				  ret);
> 	}
>-
>-	return 0;
> }
> 
> static int transaction_kthread(void *arg)
>@@ -3931,6 +3904,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
> 	int ret;
> 
> 	set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
>+	kthread_park(fs_info->cleaner_kthread);
> 
> 	/* wait for the qgroup rescan worker to stop */
> 	btrfs_qgroup_wait_for_completion(fs_info, false);
>-- 
>2.19.1
>
>
>
>



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Btrfs: fix missing delayed iputs on unmount
  2018-10-31  7:41 ` Lu Fengqi
@ 2018-10-31 16:37   ` Omar Sandoval
  0 siblings, 0 replies; 10+ messages in thread
From: Omar Sandoval @ 2018-10-31 16:37 UTC (permalink / raw)
  To: Lu Fengqi; +Cc: linux-btrfs, kernel-team

On Wed, Oct 31, 2018 at 03:41:47PM +0800, Lu Fengqi wrote:
> On Tue, Oct 30, 2018 at 05:14:42PM -0700, Omar Sandoval wrote:
> >From: Omar Sandoval <osandov@fb.com>
> >
> >There's a race between close_ctree() and cleaner_kthread().
> >close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> >sees it set, but this is racy; the cleaner might have already checked
> >the bit and could be cleaning stuff. In particular, if it deletes unused
> >block groups, it will create delayed iputs for the free space cache
> >inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> >longer running delayed iputs after a commit. Therefore, if the cleaner
> >creates more delayed iputs after delayed iputs are run in
> >btrfs_commit_super(), we will leak inodes on unmount and get a busy
> 
> Since the assert added via commit e187831e1875 ("btrfs: assert on non-empty
> delayed iputs") wasn't triggered, it doesn't seem to be the cause of inode
> leak.

This was in our production build without CONFIG_BTRFS_ASSERT.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Btrfs: fix missing delayed iputs on unmount
  2018-10-31  0:14 [PATCH] Btrfs: fix missing delayed iputs on unmount Omar Sandoval
  2018-10-31  7:14 ` Lu Fengqi
  2018-10-31  7:41 ` Lu Fengqi
@ 2018-10-31  8:43 ` Nikolay Borisov
  2018-10-31 16:35   ` Omar Sandoval
  2 siblings, 1 reply; 10+ messages in thread
From: Nikolay Borisov @ 2018-10-31  8:43 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team



On 31.10.18 г. 2:14 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> There's a race between close_ctree() and cleaner_kthread().
> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> sees it set, but this is racy; the cleaner might have already checked
> the bit and could be cleaning stuff. In particular, if it deletes unused
> block groups, it will create delayed iputs for the free space cache
> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> longer running delayed iputs after a commit. Therefore, if the cleaner
> creates more delayed iputs after delayed iputs are run in
> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> inode crash from the VFS.
> 
> Fix it by parking the cleaner before we actually close anything. Then,
> any remaining delayed iputs will always be handled in
> btrfs_commit_super(). This also ensures that the commit in close_ctree()
> is really the last commit, so we can get rid of the commit in
> cleaner_kthread().
> 
> Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
> We found this with a stress test that our containers team runs. I'm
> wondering if this same race could have caused any other issues other
> than this new iput thing, but I couldn't identify any.
> 
>  fs/btrfs/disk-io.c | 40 +++++++---------------------------------
>  1 file changed, 7 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index b0ab41da91d1..7c17284ae3c2 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg)
>  	struct btrfs_root *root = arg;
>  	struct btrfs_fs_info *fs_info = root->fs_info;
>  	int again;
> -	struct btrfs_trans_handle *trans;
>  
> -	do {
> +	while (1) {
>  		again = 0;
>  
>  		/* Make the cleaner go to sleep early. */
> @@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg)
>  		 */
>  		btrfs_delete_unused_bgs(fs_info);
>  sleep:
> +		if (kthread_should_park())
> +			kthread_parkme();
> +		if (kthread_should_stop())
> +			return 0;
>  		if (!again) {
>  			set_current_state(TASK_INTERRUPTIBLE);
> -			if (!kthread_should_stop())
> -				schedule();
> +			schedule();
>  			__set_current_state(TASK_RUNNING);
>  		}
> -	} while (!kthread_should_stop());
> -
> -	/*
> -	 * Transaction kthread is stopped before us and wakes us up.
> -	 * However we might have started a new transaction and COWed some
> -	 * tree blocks when deleting unused block groups for example. So
> -	 * make sure we commit the transaction we started to have a clean
> -	 * shutdown when evicting the btree inode - if it has dirty pages
> -	 * when we do the final iput() on it, eviction will trigger a
> -	 * writeback for it which will fail with null pointer dereferences
> -	 * since work queues and other resources were already released and
> -	 * destroyed by the time the iput/eviction/writeback is made.
> -	 */
> -	trans = btrfs_attach_transaction(root);
> -	if (IS_ERR(trans)) {
> -		if (PTR_ERR(trans) != -ENOENT)
> -			btrfs_err(fs_info,
> -				  "cleaner transaction attach returned %ld",
> -				  PTR_ERR(trans));
> -	} else {
> -		int ret;
> -
> -		ret = btrfs_commit_transaction(trans);
> -		if (ret)
> -			btrfs_err(fs_info,
> -				  "cleaner open transaction commit returned %d",
> -				  ret);
>  	}
> -
> -	return 0;
>  }
>  
>  static int transaction_kthread(void *arg)
> @@ -3931,6 +3904,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
>  	int ret;
>  
>  	set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
> +	kthread_park(fs_info->cleaner_kthread);

Can't you directly call kthread_stop here? When you park the thread it
will sleep and then when you call kthread_stop that function will unpark
the thread and the cleaner kthread will see KTHREAD_SHOULD_STOP bit and
just return 0. So the from the moment the thread is parked until it's
stopped it doesn't have a chance to do useful work.

>  
>  	/* wait for the qgroup rescan worker to stop */
>  	btrfs_qgroup_wait_for_completion(fs_info, false);
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Btrfs: fix missing delayed iputs on unmount
  2018-10-31  8:43 ` Nikolay Borisov
@ 2018-10-31 16:35   ` Omar Sandoval
  2018-10-31 16:40     ` Nikolay Borisov
  0 siblings, 1 reply; 10+ messages in thread
From: Omar Sandoval @ 2018-10-31 16:35 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs, kernel-team

On Wed, Oct 31, 2018 at 10:43:02AM +0200, Nikolay Borisov wrote:
> 
> 
> On 31.10.18 г. 2:14 ч., Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > There's a race between close_ctree() and cleaner_kthread().
> > close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> > sees it set, but this is racy; the cleaner might have already checked
> > the bit and could be cleaning stuff. In particular, if it deletes unused
> > block groups, it will create delayed iputs for the free space cache
> > inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > longer running delayed iputs after a commit. Therefore, if the cleaner
> > creates more delayed iputs after delayed iputs are run in
> > btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > inode crash from the VFS.
> > 
> > Fix it by parking the cleaner before we actually close anything. Then,
> > any remaining delayed iputs will always be handled in
> > btrfs_commit_super(). This also ensures that the commit in close_ctree()
> > is really the last commit, so we can get rid of the commit in
> > cleaner_kthread().
> > 
> > Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> > We found this with a stress test that our containers team runs. I'm
> > wondering if this same race could have caused any other issues other
> > than this new iput thing, but I couldn't identify any.
> > 
> >  fs/btrfs/disk-io.c | 40 +++++++---------------------------------
> >  1 file changed, 7 insertions(+), 33 deletions(-)
> > 
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index b0ab41da91d1..7c17284ae3c2 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg)
> >  	struct btrfs_root *root = arg;
> >  	struct btrfs_fs_info *fs_info = root->fs_info;
> >  	int again;
> > -	struct btrfs_trans_handle *trans;
> >  
> > -	do {
> > +	while (1) {
> >  		again = 0;
> >  
> >  		/* Make the cleaner go to sleep early. */
> > @@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg)
> >  		 */
> >  		btrfs_delete_unused_bgs(fs_info);
> >  sleep:
> > +		if (kthread_should_park())
> > +			kthread_parkme();
> > +		if (kthread_should_stop())
> > +			return 0;
> >  		if (!again) {
> >  			set_current_state(TASK_INTERRUPTIBLE);
> > -			if (!kthread_should_stop())
> > -				schedule();
> > +			schedule();
> >  			__set_current_state(TASK_RUNNING);
> >  		}
> > -	} while (!kthread_should_stop());
> > -
> > -	/*
> > -	 * Transaction kthread is stopped before us and wakes us up.
> > -	 * However we might have started a new transaction and COWed some
> > -	 * tree blocks when deleting unused block groups for example. So
> > -	 * make sure we commit the transaction we started to have a clean
> > -	 * shutdown when evicting the btree inode - if it has dirty pages
> > -	 * when we do the final iput() on it, eviction will trigger a
> > -	 * writeback for it which will fail with null pointer dereferences
> > -	 * since work queues and other resources were already released and
> > -	 * destroyed by the time the iput/eviction/writeback is made.
> > -	 */
> > -	trans = btrfs_attach_transaction(root);
> > -	if (IS_ERR(trans)) {
> > -		if (PTR_ERR(trans) != -ENOENT)
> > -			btrfs_err(fs_info,
> > -				  "cleaner transaction attach returned %ld",
> > -				  PTR_ERR(trans));
> > -	} else {
> > -		int ret;
> > -
> > -		ret = btrfs_commit_transaction(trans);
> > -		if (ret)
> > -			btrfs_err(fs_info,
> > -				  "cleaner open transaction commit returned %d",
> > -				  ret);
> >  	}
> > -
> > -	return 0;
> >  }
> >  
> >  static int transaction_kthread(void *arg)
> > @@ -3931,6 +3904,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
> >  	int ret;
> >  
> >  	set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
> > +	kthread_park(fs_info->cleaner_kthread);
> 
> Can't you directly call kthread_stop here? When you park the thread it
> will sleep and then when you call kthread_stop that function will unpark
> the thread and the cleaner kthread will see KTHREAD_SHOULD_STOP bit and
> just return 0. So the from the moment the thread is parked until it's
> stopped it doesn't have a chance to do useful work.

kthread_stop() frees the task_struct, but we might still try to wake up
the cleaner kthread from somewhere (e.g., from the transaction kthread).
So we really need to keep the cleaner alive but not doing work.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Btrfs: fix missing delayed iputs on unmount
  2018-10-31 16:35   ` Omar Sandoval
@ 2018-10-31 16:40     ` Nikolay Borisov
  2018-10-31 16:48       ` Omar Sandoval
  0 siblings, 1 reply; 10+ messages in thread
From: Nikolay Borisov @ 2018-10-31 16:40 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team



On 31.10.18 г. 18:35 ч., Omar Sandoval wrote:
> On Wed, Oct 31, 2018 at 10:43:02AM +0200, Nikolay Borisov wrote:
>>
>>
>> On 31.10.18 г. 2:14 ч., Omar Sandoval wrote:
>>> From: Omar Sandoval <osandov@fb.com>
>>>
>>> There's a race between close_ctree() and cleaner_kthread().
>>> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
>>> sees it set, but this is racy; the cleaner might have already checked
>>> the bit and could be cleaning stuff. In particular, if it deletes unused
>>> block groups, it will create delayed iputs for the free space cache
>>> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
>>> longer running delayed iputs after a commit. Therefore, if the cleaner
>>> creates more delayed iputs after delayed iputs are run in
>>> btrfs_commit_super(), we will leak inodes on unmount and get a busy
>>> inode crash from the VFS.
>>>
>>> Fix it by parking the cleaner before we actually close anything. Then,
>>> any remaining delayed iputs will always be handled in
>>> btrfs_commit_super(). This also ensures that the commit in close_ctree()
>>> is really the last commit, so we can get rid of the commit in
>>> cleaner_kthread().
>>>
>>> Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
>>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>>> ---
>>> We found this with a stress test that our containers team runs. I'm
>>> wondering if this same race could have caused any other issues other
>>> than this new iput thing, but I couldn't identify any.
>>>
>>>  fs/btrfs/disk-io.c | 40 +++++++---------------------------------
>>>  1 file changed, 7 insertions(+), 33 deletions(-)
>>>
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index b0ab41da91d1..7c17284ae3c2 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg)
>>>  	struct btrfs_root *root = arg;
>>>  	struct btrfs_fs_info *fs_info = root->fs_info;
>>>  	int again;
>>> -	struct btrfs_trans_handle *trans;
>>>  
>>> -	do {
>>> +	while (1) {
>>>  		again = 0;
>>>  
>>>  		/* Make the cleaner go to sleep early. */
>>> @@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg)
>>>  		 */
>>>  		btrfs_delete_unused_bgs(fs_info);
>>>  sleep:
>>> +		if (kthread_should_park())
>>> +			kthread_parkme();
>>> +		if (kthread_should_stop())
>>> +			return 0;
>>>  		if (!again) {
>>>  			set_current_state(TASK_INTERRUPTIBLE);
>>> -			if (!kthread_should_stop())
>>> -				schedule();
>>> +			schedule();
>>>  			__set_current_state(TASK_RUNNING);
>>>  		}
>>> -	} while (!kthread_should_stop());
>>> -
>>> -	/*
>>> -	 * Transaction kthread is stopped before us and wakes us up.
>>> -	 * However we might have started a new transaction and COWed some
>>> -	 * tree blocks when deleting unused block groups for example. So
>>> -	 * make sure we commit the transaction we started to have a clean
>>> -	 * shutdown when evicting the btree inode - if it has dirty pages
>>> -	 * when we do the final iput() on it, eviction will trigger a
>>> -	 * writeback for it which will fail with null pointer dereferences
>>> -	 * since work queues and other resources were already released and
>>> -	 * destroyed by the time the iput/eviction/writeback is made.
>>> -	 */
>>> -	trans = btrfs_attach_transaction(root);
>>> -	if (IS_ERR(trans)) {
>>> -		if (PTR_ERR(trans) != -ENOENT)
>>> -			btrfs_err(fs_info,
>>> -				  "cleaner transaction attach returned %ld",
>>> -				  PTR_ERR(trans));
>>> -	} else {
>>> -		int ret;
>>> -
>>> -		ret = btrfs_commit_transaction(trans);
>>> -		if (ret)
>>> -			btrfs_err(fs_info,
>>> -				  "cleaner open transaction commit returned %d",
>>> -				  ret);
>>>  	}
>>> -
>>> -	return 0;
>>>  }
>>>  
>>>  static int transaction_kthread(void *arg)
>>> @@ -3931,6 +3904,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
>>>  	int ret;
>>>  
>>>  	set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
>>> +	kthread_park(fs_info->cleaner_kthread);
>>
>> Can't you directly call kthread_stop here? When you park the thread it
>> will sleep and then when you call kthread_stop that function will unpark
>> the thread and the cleaner kthread will see KTHREAD_SHOULD_STOP bit and
>> just return 0. So the from the moment the thread is parked until it's
>> stopped it doesn't have a chance to do useful work.
> 
> kthread_stop() frees the task_struct, but we might still try to wake up
> the cleaner kthread from somewhere (e.g., from the transaction kthread).
> So we really need to keep the cleaner alive but not doing work.

This dependency then needs to be documented via a comment or at the very
least mentioned in the changelog. Is it possible to refactor the code
(in a different patch) to actually ensure that transaction is stopped
and then kthread as well to remove this dependency ?

> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Btrfs: fix missing delayed iputs on unmount
  2018-10-31 16:40     ` Nikolay Borisov
@ 2018-10-31 16:48       ` Omar Sandoval
  0 siblings, 0 replies; 10+ messages in thread
From: Omar Sandoval @ 2018-10-31 16:48 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs, kernel-team

On Wed, Oct 31, 2018 at 06:40:29PM +0200, Nikolay Borisov wrote:
> 
> 
> On 31.10.18 г. 18:35 ч., Omar Sandoval wrote:
> > On Wed, Oct 31, 2018 at 10:43:02AM +0200, Nikolay Borisov wrote:
> >>
> >>
> >> On 31.10.18 г. 2:14 ч., Omar Sandoval wrote:
> >>> From: Omar Sandoval <osandov@fb.com>
> >>>
> >>> There's a race between close_ctree() and cleaner_kthread().
> >>> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> >>> sees it set, but this is racy; the cleaner might have already checked
> >>> the bit and could be cleaning stuff. In particular, if it deletes unused
> >>> block groups, it will create delayed iputs for the free space cache
> >>> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> >>> longer running delayed iputs after a commit. Therefore, if the cleaner
> >>> creates more delayed iputs after delayed iputs are run in
> >>> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> >>> inode crash from the VFS.
> >>>
> >>> Fix it by parking the cleaner before we actually close anything. Then,
> >>> any remaining delayed iputs will always be handled in
> >>> btrfs_commit_super(). This also ensures that the commit in close_ctree()
> >>> is really the last commit, so we can get rid of the commit in
> >>> cleaner_kthread().
> >>>
> >>> Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
> >>> Signed-off-by: Omar Sandoval <osandov@fb.com>
> >>> ---
> >>> We found this with a stress test that our containers team runs. I'm
> >>> wondering if this same race could have caused any other issues other
> >>> than this new iput thing, but I couldn't identify any.
> >>>
> >>>  fs/btrfs/disk-io.c | 40 +++++++---------------------------------
> >>>  1 file changed, 7 insertions(+), 33 deletions(-)
> >>>
> >>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> >>> index b0ab41da91d1..7c17284ae3c2 100644
> >>> --- a/fs/btrfs/disk-io.c
> >>> +++ b/fs/btrfs/disk-io.c
> >>> @@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg)
> >>>  	struct btrfs_root *root = arg;
> >>>  	struct btrfs_fs_info *fs_info = root->fs_info;
> >>>  	int again;
> >>> -	struct btrfs_trans_handle *trans;
> >>>  
> >>> -	do {
> >>> +	while (1) {
> >>>  		again = 0;
> >>>  
> >>>  		/* Make the cleaner go to sleep early. */
> >>> @@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg)
> >>>  		 */
> >>>  		btrfs_delete_unused_bgs(fs_info);
> >>>  sleep:
> >>> +		if (kthread_should_park())
> >>> +			kthread_parkme();
> >>> +		if (kthread_should_stop())
> >>> +			return 0;
> >>>  		if (!again) {
> >>>  			set_current_state(TASK_INTERRUPTIBLE);
> >>> -			if (!kthread_should_stop())
> >>> -				schedule();
> >>> +			schedule();
> >>>  			__set_current_state(TASK_RUNNING);
> >>>  		}
> >>> -	} while (!kthread_should_stop());
> >>> -
> >>> -	/*
> >>> -	 * Transaction kthread is stopped before us and wakes us up.
> >>> -	 * However we might have started a new transaction and COWed some
> >>> -	 * tree blocks when deleting unused block groups for example. So
> >>> -	 * make sure we commit the transaction we started to have a clean
> >>> -	 * shutdown when evicting the btree inode - if it has dirty pages
> >>> -	 * when we do the final iput() on it, eviction will trigger a
> >>> -	 * writeback for it which will fail with null pointer dereferences
> >>> -	 * since work queues and other resources were already released and
> >>> -	 * destroyed by the time the iput/eviction/writeback is made.
> >>> -	 */
> >>> -	trans = btrfs_attach_transaction(root);
> >>> -	if (IS_ERR(trans)) {
> >>> -		if (PTR_ERR(trans) != -ENOENT)
> >>> -			btrfs_err(fs_info,
> >>> -				  "cleaner transaction attach returned %ld",
> >>> -				  PTR_ERR(trans));
> >>> -	} else {
> >>> -		int ret;
> >>> -
> >>> -		ret = btrfs_commit_transaction(trans);
> >>> -		if (ret)
> >>> -			btrfs_err(fs_info,
> >>> -				  "cleaner open transaction commit returned %d",
> >>> -				  ret);
> >>>  	}
> >>> -
> >>> -	return 0;
> >>>  }
> >>>  
> >>>  static int transaction_kthread(void *arg)
> >>> @@ -3931,6 +3904,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
> >>>  	int ret;
> >>>  
> >>>  	set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
> >>> +	kthread_park(fs_info->cleaner_kthread);
> >>
> >> Can't you directly call kthread_stop here? When you park the thread it
> >> will sleep and then when you call kthread_stop that function will unpark
> >> the thread and the cleaner kthread will see KTHREAD_SHOULD_STOP bit and
> >> just return 0. So the from the moment the thread is parked until it's
> >> stopped it doesn't have a chance to do useful work.
> > 
> > kthread_stop() frees the task_struct, but we might still try to wake up
> > the cleaner kthread from somewhere (e.g., from the transaction kthread).
> > So we really need to keep the cleaner alive but not doing work.
> 
> This dependency then needs to be documented via a comment or at the very
> least mentioned in the changelog. Is it possible to refactor the code
> (in a different patch) to actually ensure that transaction is stopped
> and then kthread as well to remove this dependency ?

Then we get the same issue with things trying to wake up the transaction
kthread after it's been freed (__btrfs_end_transaction(), in
particular). Maybe we could make it work, but it seems very fragile.

I'll send a v2 with a comment.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-10-31 16:48 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-10-31  0:14 [PATCH] Btrfs: fix missing delayed iputs on unmount Omar Sandoval
2018-10-31  7:14 ` Lu Fengqi
2018-10-31  8:26   ` Nikolay Borisov
2018-10-31  8:38     ` Nikolay Borisov
2018-10-31  7:41 ` Lu Fengqi
2018-10-31 16:37   ` Omar Sandoval
2018-10-31  8:43 ` Nikolay Borisov
2018-10-31 16:35   ` Omar Sandoval
2018-10-31 16:40     ` Nikolay Borisov
2018-10-31 16:48       ` Omar Sandoval

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).