From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cn.fujitsu.com ([59.151.112.132]:38096 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1754195AbbGVCBr (ORCPT ); Tue, 21 Jul 2015 22:01:47 -0400 Subject: Re: Can't mount btrfs volume on rbd To: Steve Dainard , Austin S Hemmelgarn References: <557A890D.8080306@cn.fujitsu.com> <557E877E.2060704@cn.fujitsu.com> <557F7B82.2060203@cn.fujitsu.com> <55A46473.8070106@cn.fujitsu.com> <55AE04EF.6040807@cn.fujitsu.com> <55AE29DA.4050201@gmail.com> CC: From: Qu Wenruo Message-ID: <55AEF984.9080706@cn.fujitsu.com> Date: Wed, 22 Jul 2015 10:01:40 +0800 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: Steve Dainard wrote on 2015/07/21 14:07 -0700: > On Tue, Jul 21, 2015 at 4:15 AM, Austin S Hemmelgarn > wrote: >> On 2015-07-21 04:38, Qu Wenruo wrote: >>> >>> Hi Steve, >>> >>> I checked your binary dump. >>> >>> Previously I was too focused on the assert error, but ignored some even >>> larger bug... >>> >>> As for the btrfs-debug-tree output, subvol 257 and 5 are completely >>> corrupted. >>> Subvol 257 seems to contains a new tree root, and 5 seems to contains a >>> new device tree. >>> >>> ------ >>> fs tree key (FS_TREE ROOT_ITEM 0) >>> leaf 29409280 items 8 free space 15707 generation 9 owner 4 >>> fs uuid 1bb22a03-bc25-466f-b078-c66c6f6a6d28 >>> chunk uuid 11cca6df-e850-45d7-a928-cdff82c5f295 >>> item 0 key (0 DEV_STATS 1) itemoff 16243 itemsize 40 >>> device stats >>> item 1 key (1 DEV_EXTENT 0) itemoff 16195 itemsize 48 >>> dev extent chunk_tree 3 >>> chunk objectid 256 chunk offset 0 length 4194304 >>> item 2 key (1 DEV_EXTENT 4194304) itemoff 16147 itemsize 48 >>> dev extent chunk_tree 3 >>> chunk objectid 256 chunk offset 4194304 length 8388608 >>> item 3 key (1 DEV_EXTENT 12582912) itemoff 16099 itemsize 48 >>> dev extent chunk_tree 3 >>> ...... >>> # DEV_EXTENT should never occur in fs tree. It should only occurs in >>> # dev tree >>> >>> file tree key (257 ROOT_ITEM 0) >>> leaf 29376512 items 13 free space 12844 generation 9 owner 1 >>> fs uuid 1bb22a03-bc25-466f-b078-c66c6f6a6d28 >>> chunk uuid 11cca6df-e850-45d7-a928-cdff82c5f295 >>> item 0 key (EXTENT_TREE ROOT_ITEM 0) itemoff 15844 itemsize 439 >>> root data bytenr 29392896 level 0 dirid 0 refs 1 gen 9 >>> uuid 00000000-0000-0000-0000-000000000000 >>> item 1 key (DEV_TREE ROOT_ITEM 0) itemoff 15405 itemsize 439 >>> root data bytenr 29409280 level 0 dirid 0 refs 1 gen 9 >>> uuid 00000000-0000-0000-0000-000000000000 >>> item 2 key (FS_TREE INODE_REF 6) itemoff 15388 itemsize 17 >>> inode ref index 0 namelen 7 name: default >>> item 3 key (FS_TREE ROOT_ITEM 0) itemoff 14949 itemsize 439 >>> root data bytenr 29360128 level 0 dirid 256 refs 1 gen 4 >>> uuid 00000000-0000-0000-0000-000000000000 >>> item 4 key (ROOT_TREE_DIR INODE_ITEM 0) itemoff 14789 itemsize >>> 160 >>> inode generation 3 transid 0 size 0 nbytes 16384 >>> block group 0 mode 40755 links 1 uid 0 gid 0 >>> rdev 0 flags 0x0 >>> # These things are only in tree root. >>> ------ >>> >>> So the problem is, the kernel you use has some bug (btrfs or rbd >>> related), causing the btrfs write wrong tree blocks into existing tree >>> blocks. >>> >>> For such case, btrfsck won't be able to fix the critical error. >>> And I didn't even have an idea to fix the assert to change it into a >>> normal error. As it's corrupting the whole structure of btrfs... >>> >>> I can't even recall such critical btrfs bug... >>> >>> >>> Not familiar with rbd, but will it allow a block device to be mounted on >>> different systems? >>> >>> Like exporting a device A to system B and system C, and both system B >>> and system C mounting device A at the same time as btrfs? >>> >> Yes, it's a distributed SAN type system built on top of Ceph. It does allow >> having multiple systems mount the device. > > This is accurate, but its a configured setting with the default being > not shareable. The host which has mapped the block device should have > a lock on it, so if another host attempts to map the same block device > it should error out. > > The first time I had this occur was when it appeared pacemaker (HA > daemon) couldn't fence one of two nodes, and somehow bypassed the ceph > locking mechanism, mapping/mounting the block device on both nodes at > the same time which would account for corruption. > > The last time this occurred (which is where the image you've analysed > came from) pacemaker was not involved, and only one node was > mapping/mounting the block device. > >> >> Ideally, we really should put in some kind of protection against multiple >> mounts (this would be a significant selling point of BTRFS in my opinion, as >> the only other Linux native FS that has this is ext4), and make it _very_ >> obvious that mounting a BTRFS filesystem on multiple nodes concurrently >> _WILL_ result in pretty much irreparable corruption. >> > > > I don't know if this has any bearing on the failure case, but the > filesystem that I sent an image of was only ever created, subvol > created, and mounted/unmounted several times. There was never any data > written to that mount point. > Subvol creation and rw mount is enough to trigger 2~3 transaction with DATA written into btrfs. As the first rw mount will create free space cache, which is counted as data. But without multiple mount instants, I really can't consider another method to destroy btrfs so badly but with all csum OK... Thanks, Qu