Re: [PATCH] btrfs: Check the first key and level for cached extent buffer

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Nikolay Borisov <nborisov@suse.com>, Qu Wenruo <wqu@suse.com>,
	linux-btrfs@vger.kernel.org
Cc: Yoon Jungyeon <jungyeon@gatech.edu>
Subject: Re: [PATCH] btrfs: Check the first key and level for cached extent buffer
Date: Tue, 12 Mar 2019 16:32:53 +0800	[thread overview]
Message-ID: <cd40d9ec-1a9c-fc82-c50c-127b0fa6f44b@gmx.com> (raw)
In-Reply-To: <1dbd21a1-1e11-d877-553a-2961b3074c16@suse.com>



On 2019/3/12 下午4:11, Nikolay Borisov wrote:
> 
> 
> On 12.03.19 г. 9:57 ч., Nikolay Borisov wrote:
>>
>>
>> On 12.03.19 г. 9:45 ч., Qu Wenruo wrote:
>>> [BUG]
>>> When reading a file from a fuzzed image, kernel can panic like:
>>>   BTRFS warning (device loop0): csum failed root 5 ino 270 off 0 csum 0x98f94189 expected csum 0x00000000 mirror 1
>>>   assertion failed: !memcmp_extent_buffer(b, &disk_key, offsetof(struct btrfs_leaf, items[0].key), sizeof(disk_key)), file: fs/btrfs/ctree.c, line: 2544
>>>   ------------[ cut here ]------------
>>>   kernel BUG at fs/btrfs/ctree.h:3500!
>>>   invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>>   RIP: 0010:btrfs_search_slot.cold.24+0x61/0x63 [btrfs]
>>>   Call Trace:
>>>    btrfs_lookup_csum+0x52/0x150 [btrfs]
>>>    __btrfs_lookup_bio_sums+0x209/0x640 [btrfs]
>>>    btrfs_submit_bio_hook+0x103/0x170 [btrfs]
>>>    submit_one_bio+0x59/0x80 [btrfs]
>>>    extent_read_full_page+0x58/0x80 [btrfs]
>>>    generic_file_read_iter+0x2f6/0x9d0
>>>    __vfs_read+0x14d/0x1a0
>>>    vfs_read+0x8d/0x140
>>>    ksys_read+0x52/0xc0
>>>    do_syscall_64+0x60/0x210
>>>    entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>
>>> [CAUSE]
>>> The fuzzed image has a corrupted leaf whose first key doesn't match with its parent:
>>>   checksum tree key (CSUM_TREE ROOT_ITEM 0)
>>>   node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
>>>   fs uuid 3381d111-94a3-4ac7-8f39-611bbbdab7e6
>>>   chunk uuid 9af1c3c7-2af5-488b-8553-530bd515f14c
>>>   	...
>>>           key (EXTENT_CSUM EXTENT_CSUM 79691776) block 29761536 gen 19
>>>
>>>   leaf 29761536 items 1 free space 1726 generation 19 owner CSUM_TREE
>>>   leaf 29761536 flags 0x1(WRITTEN) backref revision 1
>>>   fs uuid 3381d111-94a3-4ac7-8f39-611bbbdab7e6
>>>   chunk uuid 9af1c3c7-2af5-488b-8553-530bd515f14c
>>>           item 0 key (EXTENT_CSUM EXTENT_CSUM 8798638964736) itemoff 1751 itemsize 2244
>>>                   range start 8798638964736 end 8798641262592 length 2297856
>>>
>>> For the first time tree read, it will not pass verify_level_key() check.
>>> But the extent buffer will still be cached.
>>>
>>> Also there is a pitfall in read_block_for_search(), where a cached
>>> extent buffer will not be checked for its level and first key.
>>>
>>> There are context where we read tree block without verifying its
>>> first key, such as scrub.
>>>
>>> So in that case, a corrupted leaf can sneak in and screw up the kernel.
>>>
>>> [FIX]
>>> Export verify_level_key() as btrfs_verify_level_key() and call it in
>>> read_block_for_search() to fill the hole.
>>>
>>> Please note, this will cause a lot of extra error message if we have a
>>> bad tree block in any hot tree, but it's still much better to trigger
>>> the final safe net in key_search_validate().
>>>
> 
> <snip>
> 
>>>  				ret = -EIO;
>>> -			else if (verify_level_key(fs_info, eb, level,
>>> -						  first_key, parent_transid))
>>> +			else if (btrfs_verify_level_key(fs_info, eb, level,
>>> +						first_key, parent_transid))
>>>  				ret = -EUCLEAN;
>>
>> Actually why is the buffer still held when we return EUCLEAN since in
>> read_tree_block if btree_read_extent_buffer_pages returns an error
>> free_extent_buffer should be called and it should delete the eb from eb
>> cache, no ? IMO the correct behavior should be to remove the corrupted
>> buffer ASAP and not rely on later validation.
> 
> Actually in this case the call to free_extent_buffer in read_tree_block
> won't really clean the buffer since at this point the buffer has refs =
> 2 (one from alloc_extent_buffer and one from being added to the tree),
> however the code in free_extent_buffer won't execute the atomic_cmpxchg
> to do the decrement nor will it execute the fix up right after the
> spinlock if (refs==2 && EXTENT_BUFFER_STALE) which leaves only a single
> call to atomic_dec_and_test in release_extent_buffer which will return
> false. That's wrong.
> 
> 
> The way to fix it is to either:
> a) add a call to atomic_dec(eb->refs) so that the single call to
> atomic_dec_and_test frees the eb
> 
> b) call free_extent_buffer_stale which does atomic_dec itself, I'm more
> inclined to use this option.

Despite the scrub case I described, there is even a more possible case
to sneak a bad eb into cache tree.

One tree block shared by two snapshots, and one of the parent has bad key.

Anyway, either method you mentioned can't solve either shared tree block
nor the scrub case.

So we still need the check, and keep the key_seach_validate() as final
safe net.

Thanks,
Qu

> 
> 
>>
>>>  			else
>>>  				break;
>>> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
>>> index 987a64bc0c66..67a9fe2d29c7 100644
>>> --- a/fs/btrfs/disk-io.h
>>> +++ b/fs/btrfs/disk-io.h
>>> @@ -39,6 +39,9 @@ static inline u64 btrfs_sb_offset(int mirror)
>>>  struct btrfs_device;
>>>  struct btrfs_fs_devices;
>>>  
>>> +int btrfs_verify_level_key(struct btrfs_fs_info *fs_info,
>>> +			   struct extent_buffer *eb, int level,
>>> +			   struct btrfs_key *first_key, u64 parent_transid);
>>>  struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr,
>>>  				      u64 parent_transid, int level,
>>>  				      struct btrfs_key *first_key);
>>>
>>

next prev parent reply	other threads:[~2019-03-12  8:33 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-12  7:45 [PATCH] btrfs: Check the first key and level for cached extent buffer Qu Wenruo
2019-03-12  7:57 ` Nikolay Borisov
2019-03-12  8:05   ` Qu Wenruo
2019-03-12  8:11   ` Nikolay Borisov
2019-03-12  8:32     ` Qu Wenruo [this message]
2019-03-12  8:34       ` Nikolay Borisov
2019-03-12  8:39         ` Qu Wenruo
2019-03-12  8:28 ` Nikolay Borisov
2019-03-12  8:34   ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cd40d9ec-1a9c-fc82-c50c-127b0fa6f44b@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=jungyeon@gatech.edu \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=nborisov@suse.com \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).