From: Filipe Manana <fdmanana@kernel.org>
To: Liu Bo <bo.li.liu@oracle.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH] Btrfs: fix log replay failure after unlink and link combination
Date: Fri, 2 Mar 2018 21:02:03 +0000 [thread overview]
Message-ID: <CAL3q7H45j8QkGzj_03rKk4mEndXXj-Rh-bpXwp9OM-7nQ3LFhA@mail.gmail.com> (raw)
In-Reply-To: <20180302182933.GE30920@dhcp-10-211-47-181.usdhcp.oraclecorp.com>
On Fri, Mar 2, 2018 at 6:29 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Wed, Feb 28, 2018 at 03:56:10PM +0000, fdmanana@kernel.org wrote:
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> If we have a file with 2 (or more) hard links in the same directory,
>> remove one of the hard links, create a new file (or link an existing file)
>> in the same directory with the name of the removed hard link, and then
>> finally fsync the new file, we end up with a log that fails to replay,
>> causing a mount failure.
>>
>> Example:
>>
>> $ mkfs.btrfs -f /dev/sdb
>> $ mount /dev/sdb /mnt
>>
>> $ mkdir /mnt/testdir
>> $ touch /mnt/testdir/foo
>> $ ln /mnt/testdir/foo /mnt/testdir/bar
>>
>> $ sync
>>
>> $ unlink /mnt/testdir/bar
>> $ touch /mnt/testdir/bar
>> $ xfs_io -c "fsync" /mnt/testdir/bar
>>
>> <power failure>
>>
>> $ mount /dev/sdb /mnt
>> mount: mount(2) failed: /mnt: No such file or directory
>>
>> When replaying the log, for that example, we also see the following in
>> dmesg/syslog:
>>
>> [71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257
>> [71813.674204] ------------[ cut here ]------------
>> [71813.675694] BTRFS: Transaction aborted (error -2)
>> [71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs]
>> [71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs]
>> [71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G W 4.15.0-rc9-btrfs-next-56+ #1
>> [71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
>> [71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs]
>> [71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286
>> [71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001
>> [71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff
>> [71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001
>> [71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8
>> [71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101
>> [71813.679669] FS: 00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
>> [71813.679669] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0
>> [71813.679669] Call Trace:
>> [71813.679669] btrfs_unlink_inode+0x17/0x41 [btrfs]
>> [71813.679669] drop_one_dir_item+0xfa/0x131 [btrfs]
>> [71813.679669] add_inode_ref+0x71e/0x851 [btrfs]
>> [71813.679669] ? __lock_is_held+0x39/0x71
>> [71813.679669] ? replay_one_buffer+0x53/0x53a [btrfs]
>> [71813.679669] replay_one_buffer+0x4a4/0x53a [btrfs]
>> [71813.679669] ? rcu_read_unlock+0x3a/0x57
>> [71813.679669] ? __lock_is_held+0x39/0x71
>> [71813.679669] walk_up_log_tree+0x101/0x1d2 [btrfs]
>> [71813.679669] walk_log_tree+0xad/0x188 [btrfs]
>> [71813.679669] btrfs_recover_log_trees+0x1fa/0x31e [btrfs]
>> [71813.679669] ? replay_one_extent+0x544/0x544 [btrfs]
>> [71813.679669] open_ctree+0x1cf6/0x2209 [btrfs]
>> [71813.679669] btrfs_mount_root+0x368/0x482 [btrfs]
>> [71813.679669] ? trace_hardirqs_on_caller+0x14c/0x1a6
>> [71813.679669] ? __lockdep_init_map+0x176/0x1c2
>> [71813.679669] ? mount_fs+0x64/0x10b
>> [71813.679669] mount_fs+0x64/0x10b
>> [71813.679669] vfs_kern_mount+0x68/0xce
>> [71813.679669] btrfs_mount+0x13e/0x772 [btrfs]
>> [71813.679669] ? trace_hardirqs_on_caller+0x14c/0x1a6
>> [71813.679669] ? __lockdep_init_map+0x176/0x1c2
>> [71813.679669] ? mount_fs+0x64/0x10b
>> [71813.679669] mount_fs+0x64/0x10b
>> [71813.679669] vfs_kern_mount+0x68/0xce
>> [71813.679669] do_mount+0x6e5/0x973
>> [71813.679669] ? memdup_user+0x3e/0x5c
>> [71813.679669] SyS_mount+0x72/0x98
>> [71813.679669] entry_SYSCALL_64_fastpath+0x1e/0x8b
>> [71813.679669] RIP: 0033:0x7f7cedf150ba
>> [71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206
>> [71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c
>> [71813.679669] ---[ end trace 83bd473fc5b4663b ]---
>> [71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry
>> [71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree)
>> [71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30
>> [71814.128078] BTRFS error (device dm-0): open_ctree failed
>>
>> This happens because the log has inode reference items for both inode 258
>> (the first file we created) and inode 259 (the second file created), and
>> when processing the reference item for inode 258, we replace the
>> corresponding item in the subvolume tree (which has two names, "foo" and
>> "bar") witht he one in the log (which only has one name, "foo") without
>> removing the corresponding dir index keys from the parent directory.
>> Later, when processing the inode reference item for inode 259, which has
>> a name of "bar" associated to it, we notice that dir index entries exist
>> for that name and for a different inode, so we attempt to unlink that
>> name, which fails because the inode reference item for inode 258 no longer
>> has the name "bar" associated to it, making a call to btrfs_unlink_inode()
>> fail with a -ENOENT error.
>>
>> Fix this by unlinking all the names in an inode reference item from a
>> subvolume tree that are not present in the inode reference item found in
>> the log tree, before overwriting it with the item from the log tree.
>
> Since the order during replaying is INODE_ITEM then DIR_INDEX then
> other types, in this particular case, can we fix the problem by also
> logging the parent so that we have the correct DIR_INDEX?
>
> With DIR_INDEX, the problem would be fixed simpler, wouldn't it?
It wouldn't work when the parent directory has a higher inode number
then the child.
Plus for large directories, that would take a performance penalty on
fsync'ing any file inside them (existing or new).
thanks
>
> Thanks,
>
> -liubo
>>
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>> fs/btrfs/ctree.h | 5 ++-
>> fs/btrfs/inode-item.c | 44 ++++++++++++--------
>> fs/btrfs/tree-log.c | 112 ++++++++++++++++++++++++++++++++++++++++++++++++--
>> 3 files changed, 139 insertions(+), 22 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 1a462ab85c49..5d33478bc704 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -3095,7 +3095,10 @@ btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans,
>> u64 inode_objectid, u64 ref_objectid, int ins_len,
>> int cow);
>>
>> -int btrfs_find_name_in_ext_backref(struct btrfs_path *path,
>> +int btrfs_find_name_in_backref(struct extent_buffer *leaf, int slot,
>> + const char *name,
>> + int name_len, struct btrfs_inode_ref **ref_ret);
>> +int btrfs_find_name_in_ext_backref(struct extent_buffer *leaf, int slot,
>> u64 ref_objectid, const char *name,
>> int name_len,
>> struct btrfs_inode_extref **extref_ret);
>> diff --git a/fs/btrfs/inode-item.c b/fs/btrfs/inode-item.c
>> index 39c968f80157..65e1a76bf755 100644
>> --- a/fs/btrfs/inode-item.c
>> +++ b/fs/btrfs/inode-item.c
>> @@ -22,10 +22,10 @@
>> #include "transaction.h"
>> #include "print-tree.h"
>>
>> -static int find_name_in_backref(struct btrfs_path *path, const char *name,
>> - int name_len, struct btrfs_inode_ref **ref_ret)
>> +int btrfs_find_name_in_backref(struct extent_buffer *leaf, int slot,
>> + const char *name,
>> + int name_len, struct btrfs_inode_ref **ref_ret)
>> {
>> - struct extent_buffer *leaf;
>> struct btrfs_inode_ref *ref;
>> unsigned long ptr;
>> unsigned long name_ptr;
>> @@ -33,9 +33,8 @@ static int find_name_in_backref(struct btrfs_path *path, const char *name,
>> u32 cur_offset = 0;
>> int len;
>>
>> - leaf = path->nodes[0];
>> - item_size = btrfs_item_size_nr(leaf, path->slots[0]);
>> - ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
>> + item_size = btrfs_item_size_nr(leaf, slot);
>> + ptr = btrfs_item_ptr_offset(leaf, slot);
>> while (cur_offset < item_size) {
>> ref = (struct btrfs_inode_ref *)(ptr + cur_offset);
>> len = btrfs_inode_ref_name_len(leaf, ref);
>> @@ -44,18 +43,19 @@ static int find_name_in_backref(struct btrfs_path *path, const char *name,
>> if (len != name_len)
>> continue;
>> if (memcmp_extent_buffer(leaf, name, name_ptr, name_len) == 0) {
>> - *ref_ret = ref;
>> + if (ref_ret)
>> + *ref_ret = ref;
>> return 1;
>> }
>> }
>> return 0;
>> }
>>
>> -int btrfs_find_name_in_ext_backref(struct btrfs_path *path, u64 ref_objectid,
>> +int btrfs_find_name_in_ext_backref(struct extent_buffer *leaf, int slot,
>> + u64 ref_objectid,
>> const char *name, int name_len,
>> struct btrfs_inode_extref **extref_ret)
>> {
>> - struct extent_buffer *leaf;
>> struct btrfs_inode_extref *extref;
>> unsigned long ptr;
>> unsigned long name_ptr;
>> @@ -63,9 +63,8 @@ int btrfs_find_name_in_ext_backref(struct btrfs_path *path, u64 ref_objectid,
>> u32 cur_offset = 0;
>> int ref_name_len;
>>
>> - leaf = path->nodes[0];
>> - item_size = btrfs_item_size_nr(leaf, path->slots[0]);
>> - ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
>> + item_size = btrfs_item_size_nr(leaf, slot);
>> + ptr = btrfs_item_ptr_offset(leaf, slot);
>>
>> /*
>> * Search all extended backrefs in this item. We're only
>> @@ -113,7 +112,9 @@ btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans,
>> return ERR_PTR(ret);
>> if (ret > 0)
>> return NULL;
>> - if (!btrfs_find_name_in_ext_backref(path, ref_objectid, name, name_len, &extref))
>> + if (!btrfs_find_name_in_ext_backref(path->nodes[0], path->slots[0],
>> + ref_objectid, name, name_len,
>> + &extref))
>> return NULL;
>> return extref;
>> }
>> @@ -155,7 +156,8 @@ static int btrfs_del_inode_extref(struct btrfs_trans_handle *trans,
>> * This should always succeed so error here will make the FS
>> * readonly.
>> */
>> - if (!btrfs_find_name_in_ext_backref(path, ref_objectid,
>> + if (!btrfs_find_name_in_ext_backref(path->nodes[0], path->slots[0],
>> + ref_objectid,
>> name, name_len, &extref)) {
>> btrfs_handle_fs_error(root->fs_info, -ENOENT, NULL);
>> ret = -EROFS;
>> @@ -225,7 +227,8 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
>> } else if (ret < 0) {
>> goto out;
>> }
>> - if (!find_name_in_backref(path, name, name_len, &ref)) {
>> + if (!btrfs_find_name_in_backref(path->nodes[0], path->slots[0],
>> + name, name_len, &ref)) {
>> ret = -ENOENT;
>> search_ext_refs = 1;
>> goto out;
>> @@ -293,7 +296,9 @@ static int btrfs_insert_inode_extref(struct btrfs_trans_handle *trans,
>> ret = btrfs_insert_empty_item(trans, root, path, &key,
>> ins_len);
>> if (ret == -EEXIST) {
>> - if (btrfs_find_name_in_ext_backref(path, ref_objectid,
>> + if (btrfs_find_name_in_ext_backref(path->nodes[0],
>> + path->slots[0],
>> + ref_objectid,
>> name, name_len, NULL))
>> goto out;
>>
>> @@ -351,7 +356,8 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
>> if (ret == -EEXIST) {
>> u32 old_size;
>>
>> - if (find_name_in_backref(path, name, name_len, &ref))
>> + if (btrfs_find_name_in_backref(path->nodes[0], path->slots[0],
>> + name, name_len, &ref))
>> goto out;
>>
>> old_size = btrfs_item_size_nr(path->nodes[0], path->slots[0]);
>> @@ -365,7 +371,9 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
>> ret = 0;
>> } else if (ret < 0) {
>> if (ret == -EOVERFLOW) {
>> - if (find_name_in_backref(path, name, name_len, &ref))
>> + if (btrfs_find_name_in_backref(path->nodes[0],
>> + path->slots[0],
>> + name, name_len, &ref))
>> ret = -EEXIST;
>> else
>> ret = -EMLINK;
>> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
>> index 411a022489e4..fd573816f461 100644
>> --- a/fs/btrfs/tree-log.c
>> +++ b/fs/btrfs/tree-log.c
>> @@ -966,7 +966,9 @@ static noinline int backref_in_log(struct btrfs_root *log,
>> ptr = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]);
>>
>> if (key->type == BTRFS_INODE_EXTREF_KEY) {
>> - if (btrfs_find_name_in_ext_backref(path, ref_objectid,
>> + if (btrfs_find_name_in_ext_backref(path->nodes[0],
>> + path->slots[0],
>> + ref_objectid,
>> name, namelen, NULL))
>> match = 1;
>>
>> @@ -1190,7 +1192,8 @@ static int extref_get_fields(struct extent_buffer *eb, unsigned long ref_ptr,
>> read_extent_buffer(eb, *name, (unsigned long)&extref->name,
>> *namelen);
>>
>> - *index = btrfs_inode_extref_index(eb, extref);
>> + if (index)
>> + *index = btrfs_inode_extref_index(eb, extref);
>> if (parent_objectid)
>> *parent_objectid = btrfs_inode_extref_parent(eb, extref);
>>
>> @@ -1211,12 +1214,102 @@ static int ref_get_fields(struct extent_buffer *eb, unsigned long ref_ptr,
>>
>> read_extent_buffer(eb, *name, (unsigned long)(ref + 1), *namelen);
>>
>> - *index = btrfs_inode_ref_index(eb, ref);
>> + if (index)
>> + *index = btrfs_inode_ref_index(eb, ref);
>>
>> return 0;
>> }
>>
>> /*
>> + * Take an inode reference item from the log tree and iterate all names from the
>> + * inode reference item in the subvolume tree with the same key (if it exists).
>> + * For any name that is not in the inode reference item from the log tree, do a
>> + * proper unlink of that name (that is, remove its entry from the inode
>> + * reference item and both dir index keys).
>> + */
>> +static int unlink_old_inode_refs(struct btrfs_trans_handle *trans,
>> + struct btrfs_root *root,
>> + struct btrfs_path *path,
>> + struct btrfs_inode *inode,
>> + struct extent_buffer *log_eb,
>> + int log_slot,
>> + struct btrfs_key *key)
>> +{
>> + int ret;
>> + unsigned long ref_ptr;
>> + unsigned long ref_end;
>> + struct extent_buffer *eb;
>> +
>> +again:
>> + btrfs_release_path(path);
>> + ret = btrfs_search_slot(NULL, root, key, path, 0, 0);
>> + if (ret > 0) {
>> + ret = 0;
>> + goto out;
>> + }
>> + if (ret < 0)
>> + goto out;
>> +
>> + eb = path->nodes[0];
>> + ref_ptr = btrfs_item_ptr_offset(eb, path->slots[0]);
>> + ref_end = ref_ptr + btrfs_item_size_nr(eb, path->slots[0]);
>> + while (ref_ptr < ref_end) {
>> + char *name = NULL;
>> + int namelen;
>> + u64 parent_id;
>> +
>> + if (key->type == BTRFS_INODE_EXTREF_KEY) {
>> + ret = extref_get_fields(eb, ref_ptr, &namelen, &name,
>> + NULL, &parent_id);
>> + } else {
>> + parent_id = key->offset;
>> + ret = ref_get_fields(eb, ref_ptr, &namelen, &name,
>> + NULL);
>> + }
>> + if (ret)
>> + goto out;
>> +
>> + if (key->type == BTRFS_INODE_EXTREF_KEY)
>> + ret = btrfs_find_name_in_ext_backref(log_eb, log_slot,
>> + parent_id, name,
>> + namelen, NULL);
>> + else
>> + ret = btrfs_find_name_in_backref(log_eb, log_slot, name,
>> + namelen, NULL);
>> +
>> + if (!ret) {
>> + struct inode *dir;
>> +
>> + btrfs_release_path(path);
>> + dir = read_one_inode(root, parent_id);
>> + if (!dir) {
>> + ret = -ENOENT;
>> + kfree(name);
>> + goto out;
>> + }
>> + ret = btrfs_unlink_inode(trans, root, BTRFS_I(dir),
>> + inode, name, namelen);
>> + kfree(name);
>> + iput(dir);
>> + if (ret)
>> + goto out;
>> + goto again;
>> + }
>> +
>> + kfree(name);
>> + ref_ptr += namelen;
>> + if (key->type == BTRFS_INODE_EXTREF_KEY)
>> + ref_ptr += sizeof(struct btrfs_inode_extref);
>> + else
>> + ref_ptr += sizeof(struct btrfs_inode_ref);
>> + }
>> + ret = 0;
>> + out:
>> + btrfs_release_path(path);
>> + return ret;
>> +}
>> +
>> +/*
>> * replay one inode back reference item found in the log tree.
>> * eb, slot and key refer to the buffer and key found in the log tree.
>> * root is the destination we are replaying into, and path is for temp
>> @@ -1344,6 +1437,19 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans,
>> }
>> }
>>
>> + /*
>> + * Before we overwrite the inode reference item in the subvolume tree
>> + * with the item from the log tree, we must unlink all names from the
>> + * parent directory that are in the subvolume's tree inode reference
>> + * item, otherwise we end up with an inconsistent subvolume tree where
>> + * dir index entries exist for a name but there is no inode reference
>> + * item with the same name.
>> + */
>> + ret = unlink_old_inode_refs(trans, root, path, BTRFS_I(inode), eb, slot,
>> + key);
>> + if (ret)
>> + goto out;
>> +
>> /* finally write the back reference in the inode */
>> ret = overwrite_item(trans, root, path, eb, slot, key);
>> out:
>> --
>> 2.11.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
prev parent reply other threads:[~2018-03-02 21:02 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-28 15:56 [PATCH] Btrfs: fix log replay failure after unlink and link combination fdmanana
2018-03-02 18:29 ` Liu Bo
2018-03-02 19:00 ` Liu Bo
2018-03-02 21:08 ` Filipe Manana
2018-03-02 21:02 ` Filipe Manana [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAL3q7H45j8QkGzj_03rKk4mEndXXj-Rh-bpXwp9OM-7nQ3LFhA@mail.gmail.com \
--to=fdmanana@kernel.org \
--cc=bo.li.liu@oracle.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).