From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 286DBC61DA4 for ; Mon, 6 Mar 2023 23:07:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230033AbjCFXH1 (ORCPT ); Mon, 6 Mar 2023 18:07:27 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50060 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229702AbjCFXH0 (ORCPT ); Mon, 6 Mar 2023 18:07:26 -0500 Received: from out28-68.mail.aliyun.com (out28-68.mail.aliyun.com [115.124.28.68]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 63F096A41A for ; Mon, 6 Mar 2023 15:07:23 -0800 (PST) X-Alimail-AntiSpam: AC=CONTINUE;BC=0.04436261|-1;CH=green;DM=|CONTINUE|false|;DS=CONTINUE|ham_system_inform|0.0168781-3.81792e-05-0.983084;FP=0|0|0|0|0|-1|-1|-1;HT=ay29a033018047204;MF=wangyugui@e16-tech.com;NM=1;PH=DS;RN=3;RT=3;SR=0;TI=SMTPD_---.RfaU-3R_1678144040; Received: from 192.168.2.112(mailfrom:wangyugui@e16-tech.com fp:SMTPD_---.RfaU-3R_1678144040) by smtp.aliyun-inc.com; Tue, 07 Mar 2023 07:07:21 +0800 Date: Tue, 07 Mar 2023 07:07:21 +0800 From: Wang Yugui To: Boris Burkov Subject: Re: [PATCH v2 2/2] btrfs: fix dio continue after short write due to buffer page fault Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com In-Reply-To: References: Message-Id: <20230307070720.63BA.409509F4@e16-tech.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.81.04 [en] Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Hi, fstests generic/276 trigger a deadloop with the patch in misc-next. sysrq(w) output: [33812.455122] sysrq: Show Blocked State [33812.458831] task:kworker/u20:31 state:D stack:0 pid:1457865 ppid:2 flags:0x00004000 [33812.467371] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] [33812.473916] Call Trace: [33812.476365] [33812.478469] __schedule+0x2cb/0x880 [33812.481965] schedule+0x50/0xc0 [33812.485123] btrfs_start_ordered_extent+0xf7/0x140 [btrfs] [33812.490668] ? add_wait_queue+0xa0/0xa0 [33812.494508] btrfs_run_ordered_extent_work+0x1a/0x30 [btrfs] [33812.500219] btrfs_work_helper+0x27f/0x360 [btrfs] [33812.505073] process_one_work+0x1b0/0x390 [33812.509101] worker_thread+0x3c/0x370 [33812.512779] ? process_one_work+0x390/0x390 [33812.516979] kthread+0xe3/0x110 [33812.520138] ? kthread_complete_and_exit+0x20/0x20 [33812.524945] ret_from_fork+0x1f/0x30 [33812.528542] [33812.530751] task:umount state:D stack:0 pid:1465218 ppid:1464868 flags:0x00004002 [33812.539372] Call Trace: [33812.541823] [33812.543938] __schedule+0x2cb/0x880 [33812.547441] ? wait_for_completion+0x83/0x160 [33812.551807] schedule+0x50/0xc0 [33812.554964] schedule_timeout+0x269/0x310 [33812.559005] ? wait_for_completion+0x83/0x160 [33812.563370] wait_for_completion+0xb5/0x160 [33812.567561] btrfs_wait_ordered_extents+0x346/0x450 [btrfs] [33812.573197] btrfs_wait_ordered_roots+0x163/0x250 [btrfs] [33812.578645] btrfs_sync_fs+0x3e/0x1c0 [btrfs] [33812.583052] sync_filesystem+0x74/0x90 [33812.586830] generic_shutdown_super+0x22/0x120 [33812.591281] kill_anon_super+0x14/0x30 [33812.595049] btrfs_kill_super+0x12/0x20 [btrfs] [33812.599629] deactivate_locked_super+0x2c/0x70 [33812.604079] cleanup_mnt+0xb8/0x140 [33812.607585] task_work_run+0x6a/0xb0 [33812.611168] exit_to_user_mode_prepare+0x1b9/0x1c0 [33812.615977] syscall_exit_to_user_mode+0x12/0x30 [33812.620617] do_syscall_64+0x67/0x80 [33812.624198] ? syscall_exit_to_user_mode+0x12/0x30 [33812.629007] ? do_syscall_64+0x67/0x80 [33812.632771] ? syscall_exit_to_user_mode+0x12/0x30 [33812.637579] ? do_syscall_64+0x67/0x80 [33812.641346] ? syscall_exit_to_user_mode+0x12/0x30 [33812.646144] ? do_syscall_64+0x67/0x80 [33812.649907] ? exc_page_fault+0x64/0x140 [33812.653847] entry_SYSCALL_64_after_hwframe+0x63/0xcd [33812.658932] RIP: 0033:0x7f2eb454e56b [33812.662510] RSP: 002b:00007ffdf683ac78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [33812.670088] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f2eb454e56b [33812.677242] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000563731e5b5a0 [33812.684394] RBP: 0000563731e5b370 R08: 0000000000000000 R09: 00007ffdf6839a00 [33812.691536] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [33812.698678] R13: 0000563731e5b5a0 R14: 0000563731e5b480 R15: 0000563731e5b370 [33812.705819] Best Regards Wang Yugui (wangyugui@e16-tech.com) 2023/03/07 > If an application is doing direct io to a btrfs file and experiences a > page fault reading from the write buffer, iomap will issue a partial > bio, and allow the fs to keep going. However, there was a subtle bug in > this codepath in the btrfs dio iomap implementation that led to the > partial write ending up as a gap in the file's extents and to be read > back as zeros. > > The sequence of events in a partial write, lightly summarized and > trimmed down for brevity is as follows: > > ====WRITING TASK==== > btrfs_direct_write > __iomap_dio_write > iomap_iter > btrfs_dio_iomap_begin # create full ordered extent > iomap_dio_bio_iter > bio_iov_iter_get_pages # page fault; partial read > submit_bio # partial bio > iomap_iter > btrfs_dio_iomap_end > btrfs_mark_ordered_io_finished # sets BTRFS_ORDERED_IOERR; > # submit to finish_ordered_fn wq > fault_in_iov_iter_readable # btrfs_direct_write detects partial write > __iomap_dio_write > iomap_iter > btrfs_dio_iomap_begin # create second partial ordered extent > iomap_dio_bio_iter > bio_iov_iter_get_pages # read all of remainder > submit_bio # partial bio with all of remainder > iomap_iter > btrfs_dio_iomap_end # nothing exciting to do with ordered io > > ====DIO ENDIO==== > ==FIRST PARTIAL BIO== > btrfs_dio_end_io > btrfs_mark_ordered_io_finished # bytes_left > 0 > # don't submit to finish_ordered_fn wq > ==SECOND PARTIAL BIO== > btrfs_dio_end_io > btrfs_mark_ordered_io_finished # bytes_left == 0 > # submit to finish_ordered_fn wq > > ====BTRFS FINISH ORDERED WQ==== > ==FIRST PARTIAL BIO== > btrfs_finish_ordered_io # called by dio_iomap_end_io, sees > # BTRFS_ORDERED_IOERR, just drops the > # ordered_extent > ==SECOND PARTIAL BIO== > btrfs_finish_ordered_io # called by btrfs_dio_end_io, writes out file > # extents, csums, etc... > > The essence of the problem is that while btrfs_direct_write and iomap > properly interact to submit all the correct bios, there is insufficient > logic in the btrfs dio functions (btrfs_dio_iomap_begin, > btrfs_dio_submit_io, btrfs_dio_end_io, and btrfs_dio_iomap_end) to > ensure that every bio is at least a part of a completed ordered_extent. > And it is completing an ordered_extent that results in crucial > functionality like writing out a file extent for the range. > > More specifically, btrfs_dio_end_io treats the ordered extent as > unfinished but btrfs_dio_iomap_end sets BTRFS_ORDERED_IOERR on it. > Thus, the finish io work doesn't result in file extents, csums, etc... > In the aftermath, such a file behaves as though it has a hole in it, > instead of the purportedly written data. > > We considered a few options for fixing the bug (apologies for any > incorrect summary of a proposal which I didn't implement and fully > understand): > 1. treat the partial bio as if we had truncated the file, which would > result in properly finishing it. > 2. split the ordered extent when submitting a partial bio. > 3. cache the ordered extent across calls to __iomap_dio_rw in > iter->private, so that we could reuse it and correctly apply several > bios to it. > > I had trouble with 1, and it felt the most like a hack, so I tried 2 > and 3. Since 3 has the benefit of also not creating an extra file > extent, and avoids an ordered extent lookup during bio submission, it > felt like the best option. > > A quick summary of the changes necessary to implement this cached > ordered_extent behavior: > - btrfs_direct_write keeps track of an ordered_extent for the duration > of a call, possible across several __iomap_dio_rws. > - zero the btrfs_dio_data before using it, since its fields constitute > state now. > - btrfs_dio_write uses dio_data to pass this ordered extent into and out > of __iomap_dio_rw. > - when the write is done, put the ordered_extent. > - if the short write happens to be length 0, then we _don't_ get an > extra bio, so we do need to cancel the ordered_extent like we used > to (and ditch the cached ordered extent) > - in btrfs_dio_iomap_begin, if the cached ordered extent is present, > skip all the work of creating it, just look up the extent mapping and > jump to setting up the iomap. (This part could likely be more > elegant..) > > Thanks to Josef, Christoph, and Filipe with their help figuring out the > bug and the fix. > > Fixes: 51bd9563b678 ("btrfs: fix deadlock due to page faults during direct IO reads and writes") > Link: https://bugzilla.redhat.com/show_bug.cgi?id=2169947 > Link: https://lore.kernel.org/linux-btrfs/aa1fb69e-b613-47aa-a99e-a0a2c9ed273f@app.fastmail.com/ > Link: https://pastebin.com/3SDaH8C6 > Signed-off-by: Boris Burkov > --- > fs/btrfs/btrfs_inode.h | 1 + > fs/btrfs/file.c | 11 ++++++- > fs/btrfs/inode.c | 75 +++++++++++++++++++++++++++++++----------- > 3 files changed, 67 insertions(+), 20 deletions(-) > > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h > index 49a92aa65de1..87020aa58121 100644 > --- a/fs/btrfs/btrfs_inode.h > +++ b/fs/btrfs/btrfs_inode.h > @@ -516,6 +516,7 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from, > ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter, > size_t done_before); > struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter, > + struct btrfs_ordered_extent **ordered_extent, > size_t done_before); > > extern const struct dentry_operations btrfs_dentry_operations; > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c > index 5cc5a1faaef5..ec5c5355906b 100644 > --- a/fs/btrfs/file.c > +++ b/fs/btrfs/file.c > @@ -1465,6 +1465,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from) > ssize_t err; > unsigned int ilock_flags = 0; > struct iomap_dio *dio; > + struct btrfs_ordered_extent *ordered_extent = NULL; > > if (iocb->ki_flags & IOCB_NOWAIT) > ilock_flags |= BTRFS_ILOCK_TRY; > @@ -1526,7 +1527,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from) > * got -EFAULT, faulting in the pages before the retry. > */ > from->nofault = true; > - dio = btrfs_dio_write(iocb, from, written); > + dio = btrfs_dio_write(iocb, from, &ordered_extent, written); > from->nofault = false; > > /* > @@ -1569,6 +1570,14 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from) > goto relock; > } > } > + /* > + * We can't loop back to btrfs_dio_write, so we can drop the cached > + * ordered extent. Typically btrfs_dio_iomap_end will run and put the > + * ordered_extent, but this is needed to clean up in case of an error > + * path breaking out of iomap_iter before the final iomap_end call. > + */ > + if (ordered_extent) > + btrfs_put_ordered_extent(ordered_extent); > > /* > * If 'err' is -ENOTBLK or we have not written all data, then it means > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 44e9acc77a74..f1a59c5f3140 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -81,6 +81,7 @@ struct btrfs_dio_data { > struct extent_changeset *data_reserved; > bool data_space_reserved; > bool nocow_done; > + struct btrfs_ordered_extent *ordered; > }; > > struct btrfs_dio_private { > @@ -6976,6 +6977,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode, > } > > static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode, > + struct btrfs_dio_data *dio_data, > const u64 start, > const u64 len, > const u64 orig_start, > @@ -6986,7 +6988,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode, > const int type) > { > struct extent_map *em = NULL; > - int ret; > + struct btrfs_ordered_extent *ordered; > > if (type != BTRFS_ORDERED_NOCOW) { > em = create_io_em(inode, start, len, orig_start, block_start, > @@ -6996,18 +6998,21 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode, > if (IS_ERR(em)) > goto out; > } > - ret = btrfs_add_ordered_extent(inode, start, len, len, block_start, > - block_len, 0, > - (1 << type) | > - (1 << BTRFS_ORDERED_DIRECT), > - BTRFS_COMPRESS_NONE); > - if (ret) { > + ordered = btrfs_alloc_ordered_extent(inode, start, len, len, > + block_start, block_len, 0, > + (1 << type) | > + (1 << BTRFS_ORDERED_DIRECT), > + BTRFS_COMPRESS_NONE); > + if (IS_ERR(ordered)) { > if (em) { > free_extent_map(em); > btrfs_drop_extent_map_range(inode, start, > start + len - 1, false); > } > - em = ERR_PTR(ret); > + em = ERR_PTR(PTR_ERR(ordered)); > + } else { > + ASSERT(!dio_data->ordered); > + dio_data->ordered = ordered; > } > out: > > @@ -7015,6 +7020,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode, > } > > static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode, > + struct btrfs_dio_data *dio_data, > u64 start, u64 len) > { > struct btrfs_root *root = inode->root; > @@ -7030,7 +7036,8 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode, > if (ret) > return ERR_PTR(ret); > > - em = btrfs_create_dio_extent(inode, start, ins.offset, start, > + em = btrfs_create_dio_extent(inode, dio_data, > + start, ins.offset, start, > ins.objectid, ins.offset, ins.offset, > ins.offset, BTRFS_ORDERED_REGULAR); > btrfs_dec_block_group_reservations(fs_info, ins.objectid); > @@ -7375,7 +7382,8 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map, > } > space_reserved = true; > > - em2 = btrfs_create_dio_extent(BTRFS_I(inode), start, len, > + em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, > + start, len, > orig_start, block_start, > len, orig_block_len, > ram_bytes, type); > @@ -7417,7 +7425,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map, > goto out; > space_reserved = true; > > - em = btrfs_new_extent_direct(BTRFS_I(inode), start, len); > + em = btrfs_new_extent_direct(BTRFS_I(inode), dio_data, start, len); > if (IS_ERR(em)) { > ret = PTR_ERR(em); > goto out; > @@ -7521,6 +7529,17 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start, > } > } > > + if (dio_data->ordered) { > + ASSERT(write); > + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, > + dio_data->ordered->file_offset, > + dio_data->ordered->bytes_left); > + if (IS_ERR(em)) { > + ret = PTR_ERR(em); > + goto err; > + } > + goto map_iomap; > + } > memset(dio_data, 0, sizeof(*dio_data)); > > /* > @@ -7662,6 +7681,7 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start, > else > free_extent_state(cached_state); > > +map_iomap: > /* > * Translate extent map information to iomap. > * We trim the extents (and move the addr) even though iomap code does > @@ -7715,13 +7735,25 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length, > if (submitted < length) { > pos += submitted; > length -= submitted; > - if (write) > - btrfs_mark_ordered_io_finished(BTRFS_I(inode), NULL, > - pos, length, false); > - else > + if (write) { > + if (submitted == 0) { > + btrfs_mark_ordered_io_finished(BTRFS_I(inode), > + NULL, pos, > + length, false); > + btrfs_put_ordered_extent(dio_data->ordered); > + dio_data->ordered = NULL; > + } > + } else { > unlock_extent(&BTRFS_I(inode)->io_tree, pos, > pos + length - 1, NULL); > + } > ret = -ENOTBLK; > + } else { > + /* On the last bio, release our cached ordered_extent */ > + if (write) { > + btrfs_put_ordered_extent(dio_data->ordered); > + dio_data->ordered = NULL; > + } > } > > if (write) > @@ -7784,19 +7816,24 @@ static const struct iomap_dio_ops btrfs_dio_ops = { > > ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter, size_t done_before) > { > - struct btrfs_dio_data data; > + struct btrfs_dio_data data = { }; > > return iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops, > IOMAP_DIO_PARTIAL, &data, done_before); > } > > struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter, > + struct btrfs_ordered_extent **ordered_extent, > size_t done_before) > { > - struct btrfs_dio_data data; > + struct btrfs_dio_data dio_data = { .ordered = *ordered_extent }; > + struct iomap_dio *dio; > > - return __iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops, > - IOMAP_DIO_PARTIAL, &data, done_before); > + dio = __iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops, > + IOMAP_DIO_PARTIAL, &dio_data, done_before); > + if (!IS_ERR_OR_NULL(dio)) > + *ordered_extent = dio_data.ordered; > + return dio; > } > > static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, > -- > 2.38.1