Hitting error after failed balance

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* Hitting error after failed balance
@ 2014-01-14 19:47 Mitchel Humpherys
  2014-01-14 23:21 ` Mitchel Humpherys
  0 siblings, 1 reply; 3+ messages in thread
From: Mitchel Humpherys @ 2014-01-14 19:47 UTC (permalink / raw)
  To: linux-btrfs

I have a btrfs volume with two disks in it. Inspired by the recent LWN
series on btrfs, I set it up last week and things seemed to be going
quite well. However, I tried to balance the disks the other day since
they were quite out-of-balance, but the balance job failed to complete,
error'ing out with -ENOSPC (even though there was still space left on
both disks). Things seemed okay for a day or so after that, but now I've
been hitting the following WARN()'ing, pretty much constantly:

    [346628.037252] ------------[ cut here ]------------
    [346628.037258] WARNING: CPU: 2 PID: 404 at fs/btrfs/super.c:255 __btrfs_abort_transaction+0x11d/0x130 [btrfs]()
    [346628.037260] btrfs: Transaction aborted (error -5)
    [346628.037261] Modules linked in: arc4 ecb md4 md5 hmac nls_utf8 cifs fuse nfsv3 rpcsec_gss_krb5 nfsv4 nfsd auth_rpcgss oid_registry nfs_acl snd_hda_codec_hdmi qmi_wwan cdc_wdm usbnet mii cdc_acm snd_hda_codec_realtek ftdi_sio usbserial snd_hda_intel snd_hda_codec x86_pkg_temp_thermal intel_powerclamp coretemp nouveau kvm_intel kvm mxm_wmi video ttm drm_kms_helper drm snd_hwdep snd_pcm snd_page_alloc snd_timer snd psmouse crct10dif_pclmul crct10dif_common crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper iTCO_wdt hp_wmi i2c_algo_bit soundcore sparse_keymap e1000e serio_raw cryptd iTCO_vendor_support rfkill gpio_ich ptp evdev pps_core wmi e1000 shpchp i2c_i801 i2c_core microcode processor mei_me mei lpc_ich button pcspkr vboxdrv(O) nfs lockd sunrpc fscache ext4
    [346628.037296]  crc16 mbcache jbd2 usb_storage btrfs libcrc32c hid_generic usbhid hid xor raid6_pq sr_mod cdrom sd_mod ata_generic crc32c_intel ahci libahci pata_acpi ehci_pci ehci_hcd libata usbcore usb_common scsi_mod
    [346628.037307] CPU: 2 PID: 404 Comm: btrfs-transacti Tainted: G        W  O 3.12.6-1-ARCH #1
    [346628.037309] Hardware name: Hewlett-Packard HP Z210 Workstation/1587h, BIOS J51 v01.35 01/10/2012
    [346628.037309]  0000000000000009 ffff8800caed3ca8 ffffffff814ee4fb ffff8800caed3cf0
    [346628.037312]  ffff8800caed3ce0 ffffffff81062bcd 00000000fffffffb ffff880416b13000
    [346628.037314]  ffff8803d82ce500 ffffffffa02dc280 0000000000000a9b ffff8800caed3d40
    [346628.037317] Call Trace:
    [346628.037320]  [<ffffffff814ee4fb>] dump_stack+0x54/0x8d
    [346628.037323]  [<ffffffff81062bcd>] warn_slowpath_common+0x7d/0xa0
    [346628.037326]  [<ffffffff81062c3c>] warn_slowpath_fmt+0x4c/0x50
    [346628.037334]  [<ffffffffa023fbdd>] __btrfs_abort_transaction+0x11d/0x130 [btrfs]
    [346628.037342]  [<ffffffffa025a263>] btrfs_run_delayed_refs+0x443/0x550 [btrfs]
    [346628.037351]  [<ffffffffa026a1ae>] btrfs_commit_transaction+0x4e/0x9d0 [btrfs]
    [346628.037359]  [<ffffffffa02619ad>] transaction_kthread+0x19d/0x220 [btrfs]
    [346628.037367]  [<ffffffffa0261810>] ? free_fs_root+0xc0/0xc0 [btrfs]
    [346628.037370]  [<ffffffff81084fe0>] kthread+0xc0/0xd0
    [346628.037373]  [<ffffffff81084f20>] ? kthread_create_on_node+0x120/0x120
    [346628.037375]  [<ffffffff814fcffc>] ret_from_fork+0x7c/0xb0
    [346628.037378]  [<ffffffff81084f20>] ? kthread_create_on_node+0x120/0x120
    [346628.037379] ---[ end trace fe83d0a80efc9fc0 ]---


Larger kernel log (including the part where the balance job failed)
here: https://gist.github.com/mgalgs/8423964

Rebooting didn't clear things up. I also tried mounting with
skip_balance, but still get the same error. I'm not sure if the balance
is actually related, but something is very unhappy here.

Any ideas what's going on here? Is this salvageable?


Other possibly relevant information:

    $ sudo btrfs filesystem show /local
    Label: none  uuid: 03a83a42-0bc7-42a2-bed6-df19c825897c
            Total devices 2 FS bytes used 380.83GiB
            devid    1 size 410.18GiB used 164.03GiB path /dev/sda6
            devid    2 size 465.76GiB used 220.00GiB path /dev/sdc
    
    Btrfs v3.12

    $ uname -a
    Linux mitchelh-linux 3.12.6-1-ARCH #1 SMP PREEMPT Fri Dec 20 19:39:00 CET 2013 x86_64 GNU/Linux

/dev/sda6 started out as an ext4 partition that I converted with
btrfs-convert. /dev/sdc was fresh. I've been using /dev/sda for over 1.5
years without issues, /dev/sdc is new so it could be a wildcard.


Let me know if you need any more information.

Please Cc me on replies since I'm not subscribed to this list.

Cool filesystem, by the way :). I can't say I've been excited about a
filesystem since playing around with ZFS on FreeBSD, but btrfs is pretty
awesome. The user interface is great.

Thanks!



-- 
Mitch

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Hitting error after failed balance
  2014-01-14 19:47 Hitting error after failed balance Mitchel Humpherys
@ 2014-01-14 23:21 ` Mitchel Humpherys
  2014-01-15 17:19   ` Duncan
  0 siblings, 1 reply; 3+ messages in thread
From: Mitchel Humpherys @ 2014-01-14 23:21 UTC (permalink / raw)
  To: linux-btrfs

On Tue, Jan 14 2014 at 11:47:18 AM, Mitchel Humpherys <mitch.special@gmail.com> wrote:
>
> Other possibly relevant information:
>
>     $ sudo btrfs filesystem show /local
>     Label: none  uuid: 03a83a42-0bc7-42a2-bed6-df19c825897c
>             Total devices 2 FS bytes used 380.83GiB
>             devid    1 size 410.18GiB used 164.03GiB path /dev/sda6
>             devid    2 size 465.76GiB used 220.00GiB path /dev/sdc
>     
>     Btrfs v3.12
>
>     $ uname -a
>     Linux mitchelh-linux 3.12.6-1-ARCH #1 SMP PREEMPT Fri Dec 20 19:39:00 CET 2013 x86_64 GNU/Linux
>

Well I tried btrfsck --repair and now it appears to be unmountable...

    $ sudo btrfsck --repair /dev/sda6
    enabling repair mode
    Checking filesystem on /dev/sda6
    UUID: 03a83a42-0bc7-42a2-bed6-df19c825897c
    checking extents
    Backref 122643533824 parent 5 root 5 not found in extent tree
    Backref 122643533824 parent 123939635200 not referenced back 0x55ef990
    Incorrect global backref count on 122643533824 found 2 wanted 1
    backpointer mismatch on [122643533824 4096]
    ...
    Incorrect global backref count on 846095245312 found 1 wanted 0
    backpointer mismatch on [846095245312 4096]
    Backref 846103539712 parent 2 root 2 not found in extent tree
    Backref 846103539712 root 2 not referenced back 0x15d4460
    Incorrect global backref count on 846103539712 found 1 wanted 0
    backpointer mismatch on [846103539712 4096]
    repaired damaged extent references
    checking free space cache
    cache and super generation don't match, space cache will be invalidated
    checking fs roots
    root 256 inode 257 errors 800, odd csum item
    found 43901995420 bytes used err is 1
    total csum bytes: 363657120
    total tree bytes: 9200095232
    total fs tree bytes: 8303808512
    total extent tree bytes: 473149440
    btree space waste bytes: 2280314010
    file data blocks allocated: 864793686016
     referenced 757795065856
    Btrfs v3.12


    $ sudo mount -a
    mount: wrong fs type, bad option, bad superblock on /dev/sda6,
           missing codepage or helper program, or other error

           In some cases useful info is found in syslog - try
           dmesg | tail or so.

fstab entry:

    /dev/sda6           	/local    	btrfs      	skip_balance	0 0


and here's the kernel log from the mount -a:

    [13462.391937] btrfs: device fsid 03a83a42-0bc7-42a2-bed6-df19c825897c devid 2 transid 95625 /dev/sdc
    [13462.532136] btrfs: device fsid 03a83a42-0bc7-42a2-bed6-df19c825897c devid 1 transid 95625 /dev/sda6
    [13479.393750] btrfs: device fsid 03a83a42-0bc7-42a2-bed6-df19c825897c devid 1 transid 95625 /dev/sda6
    [13479.394386] btrfs: disk space caching is enabled
    [13479.995184] parent transid verify failed on 625292959744 wanted 95626 found 95625
    [13479.995191] btrfs: failed to read log tree
    [13480.051334] btrfs: open_ctree failed

> wanted 95626 found 95625

So close! :)

Am I completely hosed?

-- 
Mitch

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Hitting error after failed balance
  2014-01-14 23:21 ` Mitchel Humpherys
@ 2014-01-15 17:19   ` Duncan
  0 siblings, 0 replies; 3+ messages in thread
From: Duncan @ 2014-01-15 17:19 UTC (permalink / raw)
  To: linux-btrfs

Mitchel Humpherys posted on Tue, 14 Jan 2014 15:21:19 -0800 as excerpted:

> On Tue, Jan 14 2014 at 11:47:18 AM, Mitchel Humpherys
> <mitch.special@gmail.com> wrote:

>>     
>>     Btrfs v3.12
>>
>>     $ uname -a
>>     Linux mitchelh-linux 3.12.6-1-ARCH #1 SMP PREEMPT Fri
>>     Dec 20 19:39:00 CET 2013 x86_64 GNU/Linux
>>
> Well I tried btrfsck --repair and now it appears to be unmountable...

[If you wish to continue being direct-mailed as well, please keep that 
request in every reply, as I read/reply-to this list as a newsgroup via 
gmane.org, and in my news client replying via email is an extra step, so 
I don't do it by default but do try to honor requests when I see 'em.]

You're aware that btrfsck --repair is considered a last-ditch option, 
right?

In some cases it makes the problems worse, not better.  There are other 
things to be tried first, and when it comes down to btrfsck --repair, if 
the filesystem is mountable as yours was, the recommendation is to update 
your backups if you need to (since btrfs is considered testing and still 
under development, you should have routine/tested backups any time you're 
testing it anyway, thus it's update an existing backup, not make your 
FIRST one =:^), test that you can recover from the backup in case the 
btrfsck --repair makes things worse, and /then/ try the repair.

>     $ sudo mount -a
>     mount: wrong fs type, bad option, bad superblock on
>     /dev/sda6, missing codepage or helper program, or other error

> and here's the kernel log from the mount -a:
> 
>     [13479.995191] btrfs: failed to read log tree
>     [13480.051334]
>     btrfs: open_ctree failed

There's the btrfs-zero-log tool.  When it's saying the log can't be read, 
that's the thing to try.  You will lose the last few seconds of 
transactions from the log since the last commit, but btrfs is designed to 
maintain a consistent commit-state, so in theory, anyway, when the log is 
corrupt and can't be used anyway, zeroing it should at least get you back 
to a consistent filesystem state.

Tho I'm not sure what further damage btrfsck --repair might have done or 
what further damage you may have had from the failed balance and 
previous.  But hopefully without the log, the filesystem is at least 
mountable.

> So close! :)
> 
> Am I completely hosed?

If zero-log doesn't work, there's still hope.  You can try btrfs-find-
root and btrfs restore, using the instructions here:

https://btrfs.wiki.kernel.org/index.php/Restore

In general, if you haven't already, I'd suggest spending some time 
reading the documentation on the wiki.  You can also read some of the 
back-list posts right here, say from gmane.org, since not all the common 
wisdom on the list may be on the wiki just yet.  I use the nntp/news 
interface, but there's also a web interface (actually multiple web 
interfaces, take your pick =:^), if you're not familiar with nntp or 
simply don't want to bother setting up a proper nntp client.

If that doesn't work either, well, it may be time to mkfs.btrfs and fall 
back to the backups.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-01-15 17:20 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-14 19:47 Hitting error after failed balance Mitchel Humpherys
2014-01-14 23:21 ` Mitchel Humpherys
2014-01-15 17:19   ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox