From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C7C67C433EF for ; Mon, 21 Feb 2022 10:49:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345111AbiBUKuM (ORCPT ); Mon, 21 Feb 2022 05:50:12 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:50082 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1355239AbiBUKuB (ORCPT ); Mon, 21 Feb 2022 05:50:01 -0500 Received: from out20-97.mail.aliyun.com (out20-97.mail.aliyun.com [115.124.20.97]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 309C4617A for ; Mon, 21 Feb 2022 02:11:43 -0800 (PST) X-Alimail-AntiSpam: AC=CONTINUE;BC=0.04437186|-1;CH=green;DM=|CONTINUE|false|;DS=CONTINUE|ham_alarm|0.14753-0.000839838-0.851631;FP=0|0|0|0|0|-1|-1|-1;HT=ay29a033018047198;MF=wangyugui@e16-tech.com;NM=1;PH=DS;RN=2;RT=2;SR=0;TI=SMTPD_---.MtOA12F_1645438300; Received: from 192.168.2.112(mailfrom:wangyugui@e16-tech.com fp:SMTPD_---.MtOA12F_1645438300) by smtp.aliyun-inc.com(33.38.168.99); Mon, 21 Feb 2022 18:11:40 +0800 Date: Mon, 21 Feb 2022 18:11:42 +0800 From: Wang Yugui To: fdmanana@kernel.org Subject: Re: [PATCH 1/7] btrfs: fix lost prealloc extents beyond eof after full fsync Cc: linux-btrfs@vger.kernel.org In-Reply-To: <0b6139a324f2cd5b470d58c45617fc29ab893f12.1645098951.git.fdmanana@suse.com> References: <0b6139a324f2cd5b470d58c45617fc29ab893f12.1645098951.git.fdmanana@suse.com> Message-Id: <20220221181141.DD77.409509F4@e16-tech.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.75.04 [en] Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Hi, xfstest btrfs/261 PASSED 40 times without this patch. so there is some problem in xfstest btrfs/261? Best Regards Wang Yugui (wangyugui@e16-tech.com) 2022/02/21 > From: Filipe Manana > > When doing a full fsync, if we have prealloc extents beyond (or at) eof, > and the leaves that contain them were not modified in the current > transaction, we end up not logging them. This results in losing those > extents when we replay the log after a power failure, since the inode is > truncated to the current value of the logged i_size. > > Just like for the fast fsync path, we need to always log all prealloc > extents starting at or beyond i_size. The fast fsync case was fixed in > commit 471d557afed155 ("Btrfs: fix loss of prealloc extents past i_size > after fsync log replay") but it missed the full fsync path. The problem > exists since the very early days, when the log tree was added by > commit e02119d5a7b439 ("Btrfs: Add a write ahead tree log to optimize > synchronous operations"). > > Example reproducer: > > $ mkfs.btrfs -f /dev/sdc > $ mount /dev/sdc /mnt > > # Create our test file with many file extent items, so that they span > # several leaves of metadata, even if the node/page size is 64K. Use > # direct IO and not fsync/O_SYNC because it's both faster and it avoids > # clearing the full sync flag from the inode - we want the fsync below > # to trigger the slow full sync code path. > $ xfs_io -f -d -c "pwrite -b 4K 0 16M" /mnt/foo > > # Now add two preallocated extents to our file without extending the > # file's size. One right at i_size, and another further beyond, leaving > # a gap between the two prealloc extents. > $ xfs_io -c "falloc -k 16M 1M" /mnt/foo > $ xfs_io -c "falloc -k 20M 1M" /mnt/foo > > # Make sure everything is durably persisted and the transaction is > # committed. This makes all created extents to have a generation lower > # than the generation of the transaction used by the next write and > # fsync. > sync > > # Now overwrite only the first extent, which will result in modifying > # only the first leaf of metadata for our inode. Then fsync it. This > # fsync will use the slow code path (inode full sync bit is set) because > # it's the first fsync since the inode was created/loaded. > $ xfs_io -c "pwrite 0 4K" -c "fsync" /mnt/foo > > # Extent list before power failure. > $ xfs_io -c "fiemap -v" /mnt/foo > /mnt/foo: > EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS > 0: [0..7]: 2178048..2178055 8 0x0 > 1: [8..16383]: 26632..43007 16376 0x0 > 2: [16384..32767]: 2156544..2172927 16384 0x0 > 3: [32768..34815]: 2172928..2174975 2048 0x800 > 4: [34816..40959]: hole 6144 > 5: [40960..43007]: 2174976..2177023 2048 0x801 > > > > # Mount fs again, trigger log replay. > $ mount /dev/sdc /mnt > > # Extent list after power failure and log replay. > $ xfs_io -c "fiemap -v" /mnt/foo > /mnt/foo: > EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS > 0: [0..7]: 2178048..2178055 8 0x0 > 1: [8..16383]: 26632..43007 16376 0x0 > 2: [16384..32767]: 2156544..2172927 16384 0x1 > > # The prealloc extents at file offsets 16M and 20M are missing. > > So fix this by calling btrfs_log_prealloc_extents() when we are doing a > full fsync, so that we always log all prealloc extents beyond eof. > > A test case for fstests will follow soon. > > CC: stable@vger.kernel.org # 4.19+ > Signed-off-by: Filipe Manana > --- > fs/btrfs/tree-log.c | 43 +++++++++++++++++++++++++++++++------------ > 1 file changed, 31 insertions(+), 12 deletions(-) > > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c > index a483337e8f41..71a5a961fef7 100644 > --- a/fs/btrfs/tree-log.c > +++ b/fs/btrfs/tree-log.c > @@ -4732,7 +4732,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans, > > /* > * Log all prealloc extents beyond the inode's i_size to make sure we do not > - * lose them after doing a fast fsync and replaying the log. We scan the > + * lose them after doing a full/fast fsync and replaying the log. We scan the > * subvolume's root instead of iterating the inode's extent map tree because > * otherwise we can log incorrect extent items based on extent map conversion. > * That can happen due to the fact that extent maps are merged when they > @@ -5510,6 +5510,7 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans, > struct btrfs_log_ctx *ctx, > bool *need_log_inode_item) > { > + const u64 i_size = i_size_read(&inode->vfs_inode); > struct btrfs_root *root = inode->root; > int ins_start_slot = 0; > int ins_nr = 0; > @@ -5530,13 +5531,21 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans, > if (min_key->type > max_key->type) > break; > > - if (min_key->type == BTRFS_INODE_ITEM_KEY) > + if (min_key->type == BTRFS_INODE_ITEM_KEY) { > *need_log_inode_item = false; > - > - if ((min_key->type == BTRFS_INODE_REF_KEY || > - min_key->type == BTRFS_INODE_EXTREF_KEY) && > - inode->generation == trans->transid && > - !recursive_logging) { > + } else if (min_key->type == BTRFS_EXTENT_DATA_KEY && > + min_key->offset >= i_size) { > + /* > + * Extents at and beyond eof are logged with > + * btrfs_log_prealloc_extents(). > + * Only regular files have BTRFS_EXTENT_DATA_KEY keys, > + * and no keys greater than that, so bail out. > + */ > + break; > + } else if ((min_key->type == BTRFS_INODE_REF_KEY || > + min_key->type == BTRFS_INODE_EXTREF_KEY) && > + inode->generation == trans->transid && > + !recursive_logging) { > u64 other_ino = 0; > u64 other_parent = 0; > > @@ -5567,10 +5576,8 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans, > btrfs_release_path(path); > goto next_key; > } > - } > - > - /* Skip xattrs, we log them later with btrfs_log_all_xattrs() */ > - if (min_key->type == BTRFS_XATTR_ITEM_KEY) { > + } else if (min_key->type == BTRFS_XATTR_ITEM_KEY) { > + /* Skip xattrs, logged later with btrfs_log_all_xattrs() */ > if (ins_nr == 0) > goto next_slot; > ret = copy_items(trans, inode, dst_path, path, > @@ -5623,9 +5630,21 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans, > break; > } > } > - if (ins_nr) > + if (ins_nr) { > ret = copy_items(trans, inode, dst_path, path, ins_start_slot, > ins_nr, inode_only, logged_isize); > + if (ret) > + return ret; > + } > + > + if (inode_only == LOG_INODE_ALL && S_ISREG(inode->vfs_inode.i_mode)) { > + /* > + * Release the path because otherwise we might attempt to double > + * lock the same leaf with btrfs_log_prealloc_extents() below. > + */ > + btrfs_release_path(path); > + ret = btrfs_log_prealloc_extents(trans, inode, dst_path); > + } > > return ret; > } > -- > 2.33.0