From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.commandict.com.au ([203.190.234.55]:47906 "EHLO mascot.commandict.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752302AbbAUJPs convert rfc822-to-8bit (ORCPT ); Wed, 21 Jan 2015 04:15:48 -0500 Received: from mascot.commandict.com.au (localhost [IPv6:::1]) by mascot.commandict.com.au (Postfix) with ESMTP id 3DE1C8021D for ; Wed, 21 Jan 2015 20:15:44 +1100 (AEDT) Subject: RE: Recovery options for FS forced readonly due to 3.17 snapshot bug From: =?utf-8?Q?Brett_King?= To: =?utf-8?Q?linux-btrfs=40vger=2Ekernel=2Eorg?= Date: Wed, 21 Jan 2015 20:15:44 +1100 Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Message-Id: Sender: linux-btrfs-owner@vger.kernel.org List-ID: -----Original message----- From: Brett King Sent: Wed 01-21-2015 09:26 am Subject: RE: Recovery options for FS forced readonly due to 3.17 snapshot bug CC: linux-btrfs@vger.kernel.org; To: fdmanana@gmail.com; > From: Filipe David Manana > Sent: Tue 01-20-2015 11:40 pm > Subject: Re: Recovery options for FS forced readonly due to 3.17 snapshot bug > To: brett.king@commandict.com.au; > CC: linux-btrfs@vger.kernel.org; > > On Tue, Jan 20, 2015 at 12:15 PM, wrote: > > > Hi, > > > My FS has been forced readonly by the early 3.17 snapshot bug. After much > > reading, I'm looking for validation of some recovery scenarios: > > > > > > 1) btrfsck --repair under a later kernel. > > > > > > 2) replacing the devices one by one under a later kernel, effectively > > removing the corruption. > > > > > > 3) Just copying data across from the readonly source to a new FS created > > under a later kernel. > > > > > > Option 1 with check only (not repair) segfaults and coredumps on 3.18 (see > > below) .. are there some other options needed here or perhaps patches pending > > to rectify this ? > > > > > > Option 2 would be better than nothing, as I get to keep my snapshots and > > don't need any more surplus drives than 2x units of the largest size (data is > > RAID1). > > > > > > Option 3 is a fallback and basically forfeits the source FS including > > snapshots, is guaranteed to work (I can read the data fine), however requires > > 1:1 capacity in new disks to achieve. > > > > > > So being unable to complete a btrfsck check and generally scared off by a > > btrfsck -repair, will option 2 work ? I.E. can I still replace devices on a > > readonly FS ? > > > > > > And are there any other options ? > > > > > > Note the FS is very nearly full of data - I did also get continuous disk > > activity due to this high utilisation triggering the endless reclaim unused > > extent routine, for a few days prior to it being forced readonly. > > > > > > [root@array ~]# uname -r > > > 3.18.2-200.fc21.x86_64 > > > > > > (upgraded since getting the issue) > > > > > > [root@array ~]# btrfsck /dev/sdm > > > > > > parent transid verify failed on 40198674350080 wanted 1623837 found 1622986 > > > Ignoring transid failure > > > Segmentation fault (core dumped) > > > > > > [root@array ~]# btrfs fi sh /export/archive/ > > > Label: 'archive' uuid: 22c7663a-93ca-40a6-9491-26abaa62b924 > > > Total devices 8 FS bytes used 18.07TiB > > > devid 1 size 5.46TiB used 5.46TiB path /dev/sdm > > > devid 2 size 3.64TiB used 3.64TiB path /dev/sdc > > > devid 3 size 3.64TiB used 3.64TiB path /dev/sdd > > > devid 4 size 5.46TiB used 5.46TiB path /dev/sdl > > > devid 5 size 4.55TiB used 4.55TiB path /dev/sdb > > > devid 8 size 4.55TiB used 4.55TiB path /dev/sdj > > > devid 9 size 4.55TiB used 4.55TiB path /dev/sdk > > > devid 10 size 4.55TiB used 4.55TiB path /dev/sde > > > > > > Btrfs v3.18.1 > > > > > > [root@array ~]# df -h /export/archive/ > > > Filesystem Size Used Avail Use% Mounted on > > > /dev/sdm 19T 19T 123G 100% /export/archive > > > > > > [root@array ~]# btrfs fi df /export/archive/ > > > Data, RAID1: total=18.17TiB, used=18.05TiB > > > System, RAID1: total=32.00MiB, used=3.00MiB > > > Metadata, RAID1: total=23.00GiB, used=22.49GiB > > > GlobalReserve, single: total=512.00MiB, used=0.00B > > > > > > [root@array ~]# dmesg |grep -i btrfs |less > > > [ 2.044134] Btrfs loaded > > > [ 2.046327] BTRFS: device label array devid 1 transid 289965 /dev/sda4 > > > [ 2.472321] BTRFS info (device sda4): disk space caching is enabled > > > [ 2.513784] BTRFS: detected SSD devices, enabling SSD mode > > > [ 2.985877] BTRFS info (device sda4): disk space caching is enabled > > > [ 3.520737] BTRFS: device label archive devid 5 transid 1623944 /dev/sdf > > > [ 3.544394] BTRFS: device label archive devid 3 transid 1623944 /dev/sdh > > > [ 3.544553] BTRFS: device label archive devid 10 transid 1623944 /dev/sdi > > > [ 3.544957] BTRFS: device label archive devid 2 transid 1623944 /dev/sdg > > > [ 3.614636] BTRFS: device label archive devid 9 transid 1623944 /dev/sdc > > > [ 3.663006] BTRFS: device label archive devid 8 transid 1623944 /dev/sdb > > > [ 3.663092] BTRFS: device label archive devid 4 transid 1623944 /dev/sdd > > > [ 3.685397] BTRFS: device label archive devid 1 transid 1623944 /dev/sde > > > [ 3.701732] BTRFS info (device sde): disk space caching is enabled > > > [ 3.903413] BTRFS: bdev /dev/sde errs: wr 2805, rd 1548, flush 0, > corrupt > > 0, gen 0 > > > [ 3.903421] BTRFS: bdev /dev/sdd errs: wr 2805, rd 0, flush 0, corrupt > 0, > > gen 0 > > > [ 3.903426] BTRFS: bdev /dev/sdb errs: wr 2805, rd 33, flush 0, corrupt > 0, > > gen 0 > > > [ 3.903430] BTRFS: bdev /dev/sdc errs: wr 2806, rd 0, flush 0, corrupt > 0, > > gen 0 > > > > You have device write and read errors here - the snapshots issue with > > 3.17 wouldn't bump these device error stats. > > Even if you were affected by the snapshots issue, you clearly have > > here another problem, very likely hardware related due to those stats. > > Yes that is my other problem, these disks are in external USB enclosures which > are flakey from time to time; I'm redesigning this imminently. > > > > > If you are affected by the snapshots issue, btrfs check/btrfsck as of > > btrfs-progs v3.17+ should be able to fix it and if you are affected by > > it, it reports at a very early stage (without --repair) one or more > > messages with the following format: > > > > "root %llu has a root item with a more recent gen (%llu) compared to > > the found root node (%llu)\n" > > > > If the snapshots issue from 3.17.0/1 was the only issue you had, those > > messages would be reported by fsck before it segfaults (at least for > > all cases I've seen). > > Hmm I've not seen any of those errors - as you say it seems like something else > is affecting it also. > > > > > Also when one is affected by the snapshots issue, it should only cause > > problems when reading from the affected readonly snapshots, and not > > when running the delayed references code in the trace below. > > This makes sense; I've only read from the (RO) snapshots when being paranoid > about my data & manually comparing some sample files. > > > > > My best recommendation is to check how healthy your drives are, > > perform a backup of whatever is possible (btrfs restore for example) > > I hadn't heard of restore - looks good from the wiki, after a quick trial it > seems to be the way to keep my snapshots if I had to clone the data over to a > new FS - i.e. option 2.5; fantastic. > > > and then only after run btrfsck with --repair. > > Have kicked off a btrfs check --repair now, as it is the option which avoids > the time & cost of procuring new disks; I have backups of my important data and > am an optimist :) > > Will report how it went - thanks for the info ! Well the optimist invites disappointment .. btrfs check --repair still segfaults however my write attempts now end with out of space errors, which are more encouraging however as they appear after the forced readonly message, I suspect they are stemming from that instead of capacity. Jan 21 20:13:38 array kernel: [ 956.187594] BTRFS (device sdd): parent transid verify failed on 57023356125184 wanted 1623924 found 1622992 Jan 21 20:13:38 array kernel: [ 956.187632] WARNING: CPU: 2 PID: 471 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x54/0x120 [btrfs]() Jan 21 20:13:38 array kernel: [ 956.187636] Modules linked in: binfmt_misc pl2303 vfat fat iTCO_wdt iTCO_vendor_support intel_rapl x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi crct10dif_pclmul crc32_pclmul crc32c_intel snd_hda_intel ghash_clmulni_intel snd_hda_controller i2c_i801 mei_me mei ir_mce_kbd_decoder ir_lirc_codec ir_xmp_decoder lirc_dev ir_sharp_decoder ir_sanyo_decoder snd_hda_codec ir_jvc_decoder ir_sony_decoder lpc_ich ir_rc6_decoder mfd_core ir_rc5_decoder snd_hwdep snd_seq ir_nec_decoder snd_seq_device snd_pcm snd_timer i2c_hid rc_rc6_mce tpm_tis tpm i2c_designware_platform nuvoton_cir snd rc_core i2c_designware_core dw_dmac dw_dmac_core soundcore nfsd auth_rpcgss nfs_acl lockd grace sunrpc uas usb_storage btrfs xor i915 raid6_pq i2c_algo_bit drm_kms_helper e1000e drm ptp pps_core sdhci_acpi sdhci video mmc_core Jan 21 20:13:38 array kernel: [ 956.187711] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] Jan 21 20:13:38 array kernel: [ 956.187760] [] __btrfs_abort_transaction+0x54/0x120 [btrfs] Jan 21 20:13:38 array kernel: [ 956.187774] [] __btrfs_free_extent+0x179/0xca0 [btrfs] Jan 21 20:13:38 array kernel: [ 956.187803] [] ? btrfs_merge_delayed_refs+0x3b9/0x490 [btrfs] Jan 21 20:13:38 array kernel: [ 956.187819] [] __btrfs_run_delayed_refs+0x8fe/0x1210 [btrfs] Jan 21 20:13:38 array kernel: [ 956.187866] [] btrfs_run_delayed_refs.part.64+0x6e/0x280 [btrfs] Jan 21 20:13:38 array kernel: [ 956.187904] [] btrfs_write_dirty_block_groups+0x461/0x740 [btrfs] Jan 21 20:13:38 array kernel: [ 956.187943] [] commit_cowonly_roots+0x17a/0x234 [btrfs] Jan 21 20:13:38 array kernel: [ 956.187974] [] btrfs_commit_transaction+0x491/0xa20 [btrfs] Jan 21 20:13:38 array kernel: [ 956.187996] [] flush_space+0x9a/0x4e0 [btrfs] Jan 21 20:13:38 array kernel: [ 956.188016] [] ? btrfs_get_alloc_profile+0x30/0x40 [btrfs] Jan 21 20:13:38 array kernel: [ 956.188043] [] ? can_overcommit+0x54/0xf0 [btrfs] Jan 21 20:13:38 array kernel: [ 956.188063] [] btrfs_async_reclaim_metadata_space+0xe5/0x200 [btrfs] Jan 21 20:13:38 array kernel: [ 956.188133] BTRFS: error (device sdd) in __btrfs_free_extent:5978: errno=-5 IO failure Jan 21 20:13:38 array kernel: [ 956.188224] BTRFS info (device sdd): forced readonly Jan 21 20:13:38 array kernel: [ 956.188232] BTRFS: error (device sdd) in btrfs_run_delayed_refs:2792: errno=-5 IO failure Jan 21 20:13:38 array kernel: BTRFS (device sdd): parent transid verify failed on 57023356125184 wanted 1623924 found 1622992 Jan 21 20:13:38 array kernel: [ 956.190791] BTRFS warning (device sdd): btrfs_uuid_scan_kthread failed -30 Jan 21 20:13:38 array kernel: WARNING: CPU: 2 PID: 471 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x54/0x120 [btrfs]() Jan 21 20:13:38 array kernel: Modules linked in: binfmt_misc pl2303 vfat fat iTCO_wdt iTCO_vendor_support intel_rapl x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi crct10dif_pclmul crc32_pclmul crc32c_intel snd_hda_intel ghash_clmulni_intel snd_hda_controller i2c_i801 mei_me mei ir_mce_kbd_decoder ir_lirc_codec ir_xmp_decoder lirc_dev ir_sharp_decoder ir_sanyo_decoder snd_hda_codec ir_jvc_decoder ir_sony_decoder lpc_ich ir_rc6_decoder mfd_core ir_rc5_decoder snd_hwdep snd_seq ir_nec_decoder snd_seq_device snd_pcm snd_timer i2c_hid rc_rc6_mce tpm_tis tpm i2c_designware_platform nuvoton_cir snd rc_core i2c_designware_core dw_dmac dw_dmac_core soundcore nfsd auth_rpcgss nfs_acl lockd grace sunrpc uas usb_storage btrfs xor i915 raid6_pq i2c_algo_bit drm_kms_helper Jan 21 20:13:38 array kernel: Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] Jan 21 20:13:38 array kernel: [] __btrfs_abort_transaction+0x54/0x120 [btrfs] Jan 21 20:13:38 array kernel: [] __btrfs_free_extent+0x179/0xca0 [btrfs] Jan 21 20:13:38 array kernel: [] ? btrfs_merge_delayed_refs+0x3b9/0x490 [btrfs] Jan 21 20:13:38 array kernel: [] __btrfs_run_delayed_refs+0x8fe/0x1210 [btrfs] Jan 21 20:13:38 array kernel: [] btrfs_run_delayed_refs.part.64+0x6e/0x280 [btrfs] Jan 21 20:13:38 array kernel: [] btrfs_write_dirty_block_groups+0x461/0x740 [btrfs] Jan 21 20:13:38 array kernel: [] commit_cowonly_roots+0x17a/0x234 [btrfs] Jan 21 20:13:38 array kernel: [] btrfs_commit_transaction+0x491/0xa20 [btrfs] Jan 21 20:13:38 array kernel: [] flush_space+0x9a/0x4e0 [btrfs] Jan 21 20:13:38 array kernel: [] ? btrfs_get_alloc_profile+0x30/0x40 [btrfs] Jan 21 20:13:38 array kernel: [] ? can_overcommit+0x54/0xf0 [btrfs] Jan 21 20:13:38 array kernel: [] btrfs_async_reclaim_metadata_space+0xe5/0x200 [btrfs] Jan 21 20:13:38 array kernel: BTRFS: error (device sdd) in __btrfs_free_extent:5978: errno=-5 IO failure Jan 21 20:13:38 array kernel: BTRFS info (device sdd): forced readonly Jan 21 20:13:38 array kernel: BTRFS: error (device sdd) in btrfs_run_delayed_refs:2792: errno=-5 IO failure Jan 21 20:13:38 array kernel: BTRFS warning (device sdd): btrfs_uuid_scan_kthread failed -30 Jan 21 20:13:38 array kernel: [ 956.319756] BTRFS: error (device sdd) in write_one_cache_group:3153: errno=-28 No space left Jan 21 20:13:38 array kernel: [ 956.319805] BTRFS warning (device sdd): Skipping commit of aborted transaction. Jan 21 20:13:38 array kernel: [ 956.319807] BTRFS: error (device sdd) in cleanup_transaction:1607: errno=-28 No space left Jan 21 20:13:38 array kernel: BTRFS: error (device sdd) in write_one_cache_group:3153: errno=-28 No space left Jan 21 20:13:38 array kernel: BTRFS warning (device sdd): Skipping commit of aborted transaction. Jan 21 20:13:38 array kernel: BTRFS: error (device sdd) in cleanup_transaction:1607: errno=-28 No space left So it appears something else is awry, potentially exacerbated by my flakey hardware and which can now only effectively be recovered from with a restore / copy off to a healthier place. Thanks all, good learnings - I'll be back as I couldn't sleep at night without a checksum-on-read FS ! > > > > > > [ 1432.231219] BTRFS: space cache generation (1623006) does not match inode > > (1623943) > > > [ 1432.231283] BTRFS warning (device sde): failed to load free space cache > > for block group 57021632675840, rebuild it now > > > [ 1434.603652] BTRFS (device sde): parent transid verify failed on > > 57022998740992 wanted 1623896 found 1622991 > > > [ 1434.608726] BTRFS (device sde): parent transid verify failed on > > 57022998740992 wanted 1623896 found 1622991 > > > [ 1434.919115] BTRFS (device sde): parent transid verify failed on > > 57022584373248 wanted 1623923 found 1622990 > > > [ 1434.937616] BTRFS (device sde): parent transid verify failed on > > 57022584373248 wanted 1623923 found 1622990 > > > [ 1434.937658] WARNING: CPU: 0 PID: 1571 at fs/btrfs/super.c:260 > > __btrfs_abort_transaction+0x54/0x120 [btrfs]() > > > [ 1434.937660] BTRFS: Transaction aborted (error -5) > > > [ 1434.937662] Modules linked in: binfmt_misc pl2303 iTCO_wdt > > iTCO_vendor_support intel_rapl x86_pkg_temp_thermal coretemp kvm_intel vfat > fat > > kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel > > snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel > > snd_hda_controller mei_me snd_hda_codec mei ir_xmp_decoder ir_lirc_codec > > snd_seq lirc_dev ir_mce_kbd_decoder ir_sharp_decoder ir_sanyo_decoder > i2c_i801 > > ir_sony_decoder lpc_ich ir_jvc_decoder snd_hwdep mfd_core ir_rc6_decoder > > ir_rc5_decoder ir_nec_decoder snd_seq_device snd_pcm i2c_hid tpm_tis tpm > > rc_rc6_mce i2c_designware_platform nuvoton_cir dw_dmac i2c_designware_core > > dw_dmac_core rc_core snd_timer snd soundcore nfsd auth_rpcgss nfs_acl lockd > > grace sunrpc btrfs xor i915 raid6_pq e1000e i2c_algo_bit drm_kms_helper drm > > > [ 1434.937732] Workqueue: events_unbound btrfs_async_reclaim_metadata_space > > [btrfs] > > > [ 1434.937779] [] __btrfs_abort_transaction+0x54/0x120 > > [btrfs] > > > [ 1434.937795] [] > > btrfs_run_delayed_refs.part.64+0x12a/0x280 [btrfs] > > > [ 1434.937811] [] btrfs_run_delayed_refs+0x17/0x20 > [btrfs] > > > [ 1434.937829] [] btrfs_commit_transaction+0x407/0xa20 > > [btrfs] > > > [ 1434.937843] [] flush_space+0x9a/0x4e0 [btrfs] > > > [ 1434.937857] [] ? btrfs_get_alloc_profile+0x30/0x40 > > [btrfs] > > > [ 1434.937870] [] ? can_overcommit+0x54/0xf0 [btrfs] > > > [ 1434.937883] [] > > btrfs_async_reclaim_metadata_space+0xe5/0x200 [btrfs] > > > [ 1434.937920] BTRFS: error (device sde) in btrfs_run_delayed_refs:2792: > > errno=-5 IO failure > > > [ 1434.937972] BTRFS info (device sde): forced readonly > > > [ 1434.937975] BTRFS warning (device sde): Skipping commit of aborted > > transaction. > > > [ 1434.937978] BTRFS: error (device sde) in cleanup_transaction:1607: > > errno=-5 IO failure > > > [ 1435.424679] BTRFS (device sde): parent transid verify failed on > > 29792056147968 wanted 1623936 found 1623004 > > > [ 1435.522805] BTRFS (device sde): parent transid verify failed on > > 29792056999936 wanted 1623936 found 1623004 > > > [ 1435.565828] BTRFS (device sde): parent transid verify failed on > > 29792057737216 wanted 1623936 found 1623004 > > > [ 1435.651292] BTRFS (device sde): parent transid verify failed on > > 29790951227392 wanted 1623935 found 1623003 > > > [ 1435.718304] BTRFS (device sde): parent transid verify failed on > > 29790914019328 wanted 1623935 found 1623003 > > > [ 1435.754679] BTRFS (device sde): parent transid verify failed on > > 29790914592768 wanted 1623935 found 1623003 > > > [ 1439.616104] BTRFS (device sde): parent transid verify failed on > > 29793154777088 wanted 1623936 found 1623005 > > > [ 1439.622553] BTRFS (device sde): parent transid verify failed on > > 29796082139136 wanted 1623939 found 1622980 > > > [ 1439.635702] BTRFS (device sde): parent transid verify failed on > > 29795165683712 wanted 1623938 found 1623007 > > > [ 1439.639677] BTRFS (device sde): parent transid verify failed on > > 29796037656576 wanted 1623939 found 1622980 > > > [ 1439.644303] BTRFS (device sde): parent transid verify failed on > > 29792017874944 wanted 1623936 found 1623004 > > > [ 1439.653535] BTRFS (device sde): parent transid verify failed on > > 29795102965760 wanted 1623939 found 1623007 > > > [ 1439.663393] BTRFS (device sde): parent transid verify failed on > > 29793354055680 wanted 1623937 found 1623005 > > > [ 1439.666222] BTRFS (device sde): parent transid verify failed on > > 29795590062080 wanted 1623939 found 1623007 > > > [ 1439.677759] BTRFS (device sde): parent transid verify failed on > > 29798681542656 wanted 1623943 found 1622983 > > > [ 1439.693965] BTRFS (device sde): parent transid verify failed on > > 29795888676864 wanted 1623939 found 1623007 > > > [ 1444.624351] BTRFS (device sde): parent transid verify failed on > > 29799304413184 wanted 1623943 found 1622984 > > > [ 1444.640856] BTRFS (device sde): parent transid verify failed on > > 29794051571712 wanted 1623937 found 1623005 > > > [ 1444.654402] BTRFS (device sde): parent transid verify failed on > > 29796307681280 wanted 1623940 found 1622980 > > > [ 1444.667444] BTRFS (device sde): parent transid verify failed on > > 29796331765760 wanted 1623940 found 1622980 > > > [ 1444.677949] BTRFS (device sde): parent transid verify failed on > > 29795524362240 wanted 1623939 found 1623007 > > > [ 1444.687418] BTRFS (device sde): parent transid verify failed on > > 29799375421440 wanted 1623943 found 1622984 > > > [ 1444.694611] BTRFS (device sde): parent transid verify failed on > > 29797253136384 wanted 1623941 found 1622981 > > > [ 1444.708855] BTRFS (device sde): parent transid verify failed on > > 29796242833408 wanted 1623940 found 1622980 > > > [ 1444.717589] BTRFS (device sde): parent transid verify failed on > > 29795054223360 wanted 1623939 found 1623007 > > > [ 1444.719838] BTRFS (device sde): parent transid verify failed on > > 29797253316608 wanted 1623941 found 1622981 > > > > > > The FS mounts RW fine on boot, until a write is actually attempted, at > which > > point it spits these errors - ultimately the write doesn't get committed & > the > > FS gets forced readonly. > > > > > > Thanks in advance ! > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > Filipe David Manana, > > > > "Reasonable men adapt themselves to the world. > > Unreasonable men adapt the world to themselves. > > That's why all progress depends on unreasonable men." > >