From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cn.fujitsu.com ([59.151.112.132]:29884 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1751486AbcF3Dc0 (ORCPT ); Wed, 29 Jun 2016 23:32:26 -0400 Subject: Re: [PATCH 1/2] btrfs: fix fsfreeze hang caused by delayed iputs deal To: References: <20160629051510.17222-1-wangxg.fnst@cn.fujitsu.com> CC: "linux-btrfs@vger.kernel.org" From: Wang Xiaoguang Message-ID: <5774925E.6060106@cn.fujitsu.com> Date: Thu, 30 Jun 2016 11:30:38 +0800 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: hello, On 06/29/2016 07:16 PM, Filipe Manana wrote: > On Wed, Jun 29, 2016 at 6:15 AM, Wang Xiaoguang > wrote: >> When running fstests generic/068, sometimes we got below WARNING: >> xfs_io D ffff8800331dbb20 0 6697 6693 0x00000080 >> ffff8800331dbb20 ffff88007acfc140 ffff880034d895c0 ffff8800331dc000 >> ffff880032d243e8 fffffffeffffffff ffff880032d24400 0000000000000001 >> ffff8800331dbb38 ffffffff816a9045 ffff880034d895c0 ffff8800331dbba8 >> Call Trace: >> [] schedule+0x35/0x80 >> [] rwsem_down_read_failed+0xf2/0x140 >> [] ? __filemap_fdatawrite_range+0xd1/0x100 >> [] call_rwsem_down_read_failed+0x18/0x30 >> [] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs] >> [] percpu_down_read+0x35/0x50 >> [] __sb_start_write+0x2c/0x40 >> [] start_transaction+0x2a5/0x4d0 [btrfs] >> [] btrfs_join_transaction+0x17/0x20 [btrfs] >> [] btrfs_evict_inode+0x3c4/0x5d0 [btrfs] >> [] evict+0xba/0x1a0 >> [] iput+0x196/0x200 >> [] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs] >> [] btrfs_commit_transaction+0x928/0xa80 [btrfs] >> [] btrfs_freeze+0x30/0x40 [btrfs] >> [] freeze_super+0xf0/0x190 >> [] do_vfs_ioctl+0x4a5/0x5c0 >> [] ? do_audit_syscall_entry+0x66/0x70 >> [] ? syscall_trace_enter_phase1+0x11f/0x140 >> [] SyS_ioctl+0x79/0x90 >> [] do_syscall_64+0x62/0x110 >> [] entry_SYSCALL64_slow_path+0x25/0x25 >> >> From this warning, freeze_super() already holds SB_FREEZE_FS, but >> btrfs_freeze() will call btrfs_commit_transaction() again, if >> btrfs_commit_transaction() finds that it has delayed iputs to handle, >> it'll start_transaction(), which will try to get SB_FREEZE_FS lock >> again, then deadlock occurs. >> >> The root cause is that in btrfs, sync_filesystem(sb) does not make >> sure all metadata is updated. See below race window in freeze_super(): >> sync_filesystem(sb); >> | >> | race window >> | In this period, cleaner_kthread() may be scheduled to >> | run, and it call btrfs_delete_unused_bgs() which will >> | add some delayed iputs. >> | >> sb->s_writers.frozen = SB_FREEZE_FS; >> sb_wait_write(sb, SB_FREEZE_FS); >> if (sb->s_op->freeze_fs) { >> /* freeze_fs will call btrfs_commit_transaction() */ >> ret = sb->s_op->freeze_fs(sb); >> > This pseudo diagram also doesn't well the problem. > You should add two timelines for 2 different CPUs/tasks where we see > one locking SB_FREEZE_FS and calling btrfs_freeze() > while the task in the other CPU calls btrfs_commit_transaction(), runs > delayed iputs and then starts/joins a transaction in the > eviction handler. OK, got it. > > > >> So if btrfs is doing freeze job, we should block >> btrfs_delete_unused_bgs(), to avoid add delayed iputs. > Nop, this isn't a real solution. > > You can have many other parts adding inodes to the delayed iputs list, > not just btrfs_delete_unused_bgs() (doing it indirectly for the free > space cache inodes). > For example, when an ordered extent completes we add its inode to the > delayed iputs list at btrfs_put_ordered_extent(), called through > btrfs_finish_ordered_io(). Yes, I know. Before sending this patch, I had searched related codes which call btrfs_add_delayed_iput(). I had thought sync_filesystem() in freeze_super() would ensure that all other btrfs_add_delayed_iput()s are called before transaction starts except the one in cleaner_kthread(). I also have one question, for compressed buffered write, even sync_filesystem() returns, we can not ensure the data is written to disk, right? It seems that btrfs_writepages() can return directly for compressed data, but with page still locked and no WRITEBACK flag set, so later filemap_fdatawait() won't work. btrfs_fdatawrite_range() is used to resolve this issue, but it seems that sync_filesystem() won't make btrfs_fdatawrite_range() called. Also would you be OK with this solution: diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 948aa18..89a996c6 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -2278,7 +2278,8 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, kmem_cache_free(btrfs_trans_handle_cachep, trans); if (current != root->fs_info->transaction_kthread && - current != root->fs_info->cleaner_kthread) + current != root->fs_info->cleaner_kthread && + root->fs_info->sb->s_writers.frozen != SB_FREEZE_FS) btrfs_run_delayed_iputs(root); return ret; Regards, Xiaoguang Wang > > thanks > > > >> Signed-off-by: Wang Xiaoguang >> --- >> fs/btrfs/disk-io.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >> index 863bf7a..fdbe0df 100644 >> --- a/fs/btrfs/disk-io.c >> +++ b/fs/btrfs/disk-io.c >> @@ -1846,8 +1846,11 @@ static int cleaner_kthread(void *arg) >> * after acquiring fs_info->delete_unused_bgs_mutex. So we >> * can't hold, nor need to, fs_info->cleaner_mutex when deleting >> * unused block groups. >> + * > Unneeded and unrelated change, please remove it. > >> */ >> + __sb_start_write(root->fs_info->sb, SB_FREEZE_WRITE, true); >> btrfs_delete_unused_bgs(root->fs_info); >> + __sb_end_write(root->fs_info->sb, SB_FREEZE_WRITE); > A comment explaining why we do this would be valuable, since we never > do these calls except when starting/committing transactions. > >> sleep: >> if (!again) { >> set_current_state(TASK_INTERRUPTIBLE); >> -- >> 2.9.0 >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >