linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: Steve Dainard <sdainard@spd1.com>,
	Austin S Hemmelgarn <ahferroin7@gmail.com>
Cc: <linux-btrfs@vger.kernel.org>
Subject: Re: Can't mount btrfs volume on rbd
Date: Wed, 22 Jul 2015 10:01:40 +0800	[thread overview]
Message-ID: <55AEF984.9080706@cn.fujitsu.com> (raw)
In-Reply-To: <CAEMJtDuaOYPomha_VbTah521BZTYzmrg2D5tk6N4rd1Y8UPaog@mail.gmail.com>



Steve Dainard wrote on 2015/07/21 14:07 -0700:
> On Tue, Jul 21, 2015 at 4:15 AM, Austin S Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2015-07-21 04:38, Qu Wenruo wrote:
>>>
>>> Hi Steve,
>>>
>>> I checked your binary dump.
>>>
>>> Previously I was too focused on the assert error, but ignored some even
>>> larger bug...
>>>
>>> As for the btrfs-debug-tree output, subvol 257 and 5 are completely
>>> corrupted.
>>> Subvol 257 seems to contains a new tree root, and 5 seems to contains a
>>> new device tree.
>>>
>>> ------
>>> fs tree key (FS_TREE ROOT_ITEM 0)
>>> leaf 29409280 items 8 free space 15707 generation 9 owner 4
>>> fs uuid 1bb22a03-bc25-466f-b078-c66c6f6a6d28
>>> chunk uuid 11cca6df-e850-45d7-a928-cdff82c5f295
>>>           item 0 key (0 DEV_STATS 1) itemoff 16243 itemsize 40
>>>                   device stats
>>>           item 1 key (1 DEV_EXTENT 0) itemoff 16195 itemsize 48
>>>                   dev extent chunk_tree 3
>>>                   chunk objectid 256 chunk offset 0 length 4194304
>>>           item 2 key (1 DEV_EXTENT 4194304) itemoff 16147 itemsize 48
>>>                   dev extent chunk_tree 3
>>>                   chunk objectid 256 chunk offset 4194304 length 8388608
>>>           item 3 key (1 DEV_EXTENT 12582912) itemoff 16099 itemsize 48
>>>                   dev extent chunk_tree 3
>>> ......
>>> # DEV_EXTENT should never occur in fs tree. It should only occurs in
>>> # dev tree
>>>
>>> file tree key (257 ROOT_ITEM 0)
>>> leaf 29376512 items 13 free space 12844 generation 9 owner 1
>>> fs uuid 1bb22a03-bc25-466f-b078-c66c6f6a6d28
>>> chunk uuid 11cca6df-e850-45d7-a928-cdff82c5f295
>>>           item 0 key (EXTENT_TREE ROOT_ITEM 0) itemoff 15844 itemsize 439
>>>                   root data bytenr 29392896 level 0 dirid 0 refs 1 gen 9
>>>                   uuid 00000000-0000-0000-0000-000000000000
>>>           item 1 key (DEV_TREE ROOT_ITEM 0) itemoff 15405 itemsize 439
>>>                   root data bytenr 29409280 level 0 dirid 0 refs 1 gen 9
>>>                   uuid 00000000-0000-0000-0000-000000000000
>>>           item 2 key (FS_TREE INODE_REF 6) itemoff 15388 itemsize 17
>>>                   inode ref index 0 namelen 7 name: default
>>>           item 3 key (FS_TREE ROOT_ITEM 0) itemoff 14949 itemsize 439
>>>                   root data bytenr 29360128 level 0 dirid 256 refs 1 gen 4
>>>                   uuid 00000000-0000-0000-0000-000000000000
>>>           item 4 key (ROOT_TREE_DIR INODE_ITEM 0) itemoff 14789 itemsize
>>> 160
>>>                   inode generation 3 transid 0 size 0 nbytes 16384
>>>                   block group 0 mode 40755 links 1 uid 0 gid 0
>>>                   rdev 0 flags 0x0
>>> # These things are only in tree root.
>>> ------
>>>
>>> So the problem is, the kernel you use has some bug (btrfs or rbd
>>> related), causing the btrfs write wrong tree blocks into existing tree
>>> blocks.
>>>
>>> For such case, btrfsck won't be able to fix the critical error.
>>> And I didn't even have an idea to fix the assert to change it into a
>>> normal error. As it's corrupting the whole structure of btrfs...
>>>
>>> I can't even recall such critical btrfs bug...
>>>
>>>
>>> Not familiar with rbd, but will it allow a block device to be mounted on
>>> different systems?
>>>
>>> Like exporting a device A to system B and system C, and both system B
>>> and system C mounting device A at the same time as btrfs?
>>>
>> Yes, it's a distributed SAN type system built on top of Ceph.  It does allow
>> having multiple systems mount the device.
>
> This is accurate, but its a configured setting with the default being
> not shareable. The host which has mapped the block device should have
> a lock on it, so if another host attempts to map the same block device
> it should error out.
>
> The first time I had this occur was when it appeared pacemaker (HA
> daemon) couldn't fence one of two nodes, and somehow bypassed the ceph
> locking mechanism, mapping/mounting the block device on both nodes at
> the same time which would account for corruption.
>
> The last time this occurred (which is where the image you've analysed
> came from) pacemaker was not involved, and only one node was
> mapping/mounting the block device.
>
>>
>> Ideally, we really should put in some kind of protection against multiple
>> mounts (this would be a significant selling point of BTRFS in my opinion, as
>> the only other Linux native FS that has this is ext4), and make it _very_
>> obvious that mounting a BTRFS filesystem on multiple nodes concurrently
>> _WILL_ result in pretty much irreparable corruption.
>>
>
>
> I don't know if this has any bearing on the failure case, but the
> filesystem that I sent an image of was only ever created, subvol
> created, and mounted/unmounted several times. There was never any data
> written to that mount point.
>
Subvol creation and rw mount is enough to trigger 2~3 transaction with 
DATA written into btrfs.
As the first rw mount will create free space cache, which is counted as 
data.

But without multiple mount instants, I really can't consider another 
method to destroy btrfs so badly but with all csum OK...

Thanks,
Qu

  reply	other threads:[~2015-07-22  2:01 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-11 15:26 Can't mount btrfs volume on rbd Steve Dainard
2015-06-12  7:23 ` Qu Wenruo
2015-06-12 16:09   ` Steve Dainard
2015-06-15  8:06     ` Qu Wenruo
2015-06-15 16:19       ` Steve Dainard
2015-06-16  1:27         ` Qu Wenruo
2015-07-13 20:22           ` Steve Dainard
2015-07-14  1:22             ` Qu Wenruo
2015-07-21  8:38               ` Qu Wenruo
2015-07-21 11:15                 ` Austin S Hemmelgarn
2015-07-21 21:07                   ` Steve Dainard
2015-07-22  2:01                     ` Qu Wenruo [this message]
2015-07-22 11:16                       ` Austin S Hemmelgarn
2015-07-22 14:13                         ` Gregory Farnum
2015-07-23 11:11                           ` Austin S Hemmelgarn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55AEF984.9080706@cn.fujitsu.com \
    --to=quwenruo@cn.fujitsu.com \
    --cc=ahferroin7@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=sdainard@spd1.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).