floating point exception (core dumped) - btrfs rescue chunk-recover

All of lore.kernel.org
 help / color / mirror / Atom feed

From: P R Shah <getprshah@gmail.com>
To: linux-btrfs@vger.kernel.org
Subject: floating point exception (core dumped) - btrfs rescue chunk-recover
Date: Wed, 6 Jan 2016 10:07:47 +0530	[thread overview]
Message-ID: <568C9A1B.6090907@gmail.com> (raw)

Hello,

TL;DR ==

btrfs 3x500GB RAID 5 - One device failed. Added a new device (btrfs device add) and tried to remove the failed device (btrfs device delete).

I tried to mount the array in degraded mode, but that didn't work either. After multiple attempts (including adding back the failed HDD), I finally ran the btrfs rescue chunk-recover command on the primary member /dev/sdb.

This ran for about 4 hours, and then failed with "floating point exception (core dumped)".
==

I am testing out btrfs to gain familiarity with it. I am quite amazed at it's capabilities and performance. However, I am either not able to understand or implement RAID5 fault tolerance.

I understand from the wiki that RAID56 is experimental. The data I am working with is backed up elsewhere and for all intents and purposes, discard-able.

I have set up a btrfs RAID5 with 3x500GB Seagate HDDs, with a mount point of /storage. Booting is off a fourth HDD (ext4, lubuntu 64bit) that is not involved in the RAID.

Everything was working amazingly well, until one HDD failed and was quietly offlined. For a couple of days, the RAID was running off 2 HDDs and I didn't notice.

When I DID realize, I shut down the system, bought a new HDD (2TB), which took a couple of days to arrive.

When I powered up the system again, the failed 500GB was back. Everything loaded fine, and looked good. To be on the safe side, I ran a badblocks test (ro) on the failing HDD.

Halfway through the test, the HDD disappeared again. After a cold reboot, it was loaded fine again.

At this point, I decided to replace the failed HDD. I shutdown, plugged in the new HDD in place of the boot HDD, booted up with Lubuntu live, mounted (/storage) and added the device to the RAID.

After adding the device successfully, I gave a device delete command for the failed HDD. Partway through the process, the failing HDD (/dev/sdc) disappeared again, and after waiting a couple of hours, I hard-reset the system, and removed the failing HDD, assuming that the RAID will re-build on the existing devices.

Now, the RAID (/storage) refused to mount. I got a c_tree error (please see enclosed logs below).

I tried to mount the array in degraded mode, but that didn't work either. After multiple attempts (including adding back the failed HDD), I finally ran the btrfs rescue chunk-recover command on the primary member /dev/sdb.

This ran for about 4 hours, and then failed with "floating point exception (core dumped)".

Can I recover the array or should I start again? The data is not important, but I would like to know the recovery process, or any misconceptions in my thinking that RAID5 with 3 devices is enough for SOHO-level fault tolerance?

Any advice, pointers, etc, much appreciated. Tech level: medium-high (RHCE).

Relevant system information:
=== uname -a
Linux lubuntu 4.2.0-16-generic #19-Ubuntu SMP Thu Oct 8 15:35:06 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

== btrfs --version
btrfs-progs v4.0

== btrfs fi show
warning, device 2 is missing
Label: 'storage'  uuid: 5a3d6590-df08-4520-b61b-802d350849c7
	Total devices 4 FS bytes used 176.91GiB
	devid    1 size 465.76GiB used 90.03GiB path /dev/sdb
	devid    3 size 465.76GiB used 90.01GiB path /dev/sdc
	devid    4 size 1.82TiB used 10.00GiB path /dev/sda
	*** Some devices missing

== dmesg info
...
Jan  5 01:45:22 lubuntu kernel: [   10.338295] Btrfs loaded
Jan  5 01:45:22 lubuntu kernel: [   10.338899] BTRFS: device label storage devid 4 transid 969 /dev/sda
Jan  5 01:45:22 lubuntu kernel: [   10.340448] BTRFS info (device sda): disk space caching is enabled
Jan  5 01:45:22 lubuntu kernel: [   10.340454] BTRFS: has skinny extents
Jan  5 01:45:22 lubuntu kernel: [   10.343395] BTRFS: failed to read the system array on sda
Jan  5 01:45:22 lubuntu kernel: [   10.352137] BTRFS: open_ctree failed
Jan  5 01:45:22 lubuntu kernel: [   10.382199] BTRFS: device label storage devid 1 transid 969 /dev/sdb
Jan  5 01:45:22 lubuntu kernel: [   10.383740] BTRFS info (device sdb): disk space caching is enabled
Jan  5 01:45:22 lubuntu kernel: [   10.383744] BTRFS: has skinny extents
Jan  5 01:45:22 lubuntu kernel: [   10.384469] BTRFS: failed to read the system array on sdb
Jan  5 01:45:22 lubuntu kernel: [   10.392116] BTRFS: open_ctree failed
Jan  5 01:45:22 lubuntu kernel: [   10.423075] BTRFS: device label storage devid 3 transid

... // after btrfs rescue chunk for about 4 hours
Jan  5 06:01:45 lubuntu kernel: [15404.828156] traps: btrfs[3016] trap divide error ip:4211a0 sp:7ffd7dbb03a8 error:0 in btrfs[400000+73000]
...

== some output from btrfs rescu chunk -vv
...
	    Stripes list:
	    [ 0] Stripe: devid = 3, offset = 21484273664
	    [ 1] Stripe: devid = 2, offset = 21484273664
	    [ 2] Stripe: devid = 1, offset = 21504196608
	Chunk: start = 45134905344, len = 2147483648, type = 81, num_stripes = 3
	    Stripes list:
	    [ 0] Stripe: devid = 3, offset = 22558015488
	    [ 1] Stripe: devid = 2, offset = 22558015488
	    [ 2] Stripe: devid = 1, offset = 22577938432
	Chunk: start = 47282388992, len = 2147483648, type = 81, num_stripes = 3
	    Stripes list:
	    [ 0] Stripe: devid = 3, offset = 23631757312
	    [ 1] Stripe: devid = 2, offset = 23631757312
	    [ 2] Stripe: devid = 1, offset = 23651680256
...
	Device extent: devid = 4, start = 5369757696, len = 1073741824, chunk offset = 201901211648
	Device extent: devid = 4, start = 6443499520, len = 1073741824, chunk offset = 204048695296
	Device extent: devid = 4, start = 7517241344, len = 1073741824, chunk offset = 206196178944
	Device extent: devid = 4, start = 8590983168, len = 1073741824, chunk offset = 208343662592
// floating point error

Regards,
PRShah

next             reply	other threads:[~2016-01-06  4:37 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-06  4:37 P R Shah [this message]
2016-01-06 20:51 ` floating point exception (core dumped) - btrfs rescue chunk-recover Henk Slager

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=568C9A1B.6090907@gmail.com \
    --to=getprshah@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.