linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Btrfs/RAID5 became unmountable after SATA cable fault
@ 2015-10-19  8:39 Janos Toth F.
  2015-10-20 14:59 ` Duncan
  2015-10-21 16:09 ` Janos Toth F.
  0 siblings, 2 replies; 17+ messages in thread
From: Janos Toth F. @ 2015-10-19  8:39 UTC (permalink / raw)
  To: linux-btrfs

I was in the middle of replacing the drives of my NAS one-by-one (I
wished to move to bigger and faster storage at the end), so I used one
more SATA drive + SATA cable than usual. Unfortunately, the extra
cable turned out to be faulty and it looks like it caused some heavy
damage to the file system.

There was no "devive replace" running at the moment or the disaster.
The first round already got finished hours ago and I planned to start
the next one before going to sleep. So, it was a full RAID-5 setup in
normal state. But one of the active, mounted devices was the first
replacment HDD and it was hanging on the spare SATA cable.

I tried to save some file to my mounted samba share and I realized the
file system because read-only. I rebooted the machine and saw that my
/data can't be mounted.
According to SmartmonTools, one of the drives was suffering from SATA
communication errors.

I tried some tirivial recovery methods and I tried to search the
mailing list archives but I didn't really find a solution. I wonder if
somebody can help with this.

Should I run "btrfs rescue chunk-recover /dev/sda"?

Here are some raw details:

# uname -a
Linux F17a_NAS 4.2.3-gentoo #2 SMP Sun Oct 18 17:56:45 CEST 2015
x86_64 AMD E-350 Processor AuthenticAMD GNU/Linux

# btrfs --version
btrfs-progs v4.2.2

# btrfs check /dev/sda
checksum verify failed on 21102592 found 295F0086 wanted 00000000
checksum verify failed on 21102592 found 295F0086 wanted 00000000
checksum verify failed on 21102592 found 99D0FC26 wanted B08FFCA0
checksum verify failed on 21102592 found 99D0FC26 wanted B08FFCA0
bytenr mismatch, want=21102592, have=65536
Couldn't read chunk root
Couldn't open file system

# mount /dev/sda /data -o ro,recovery
mount: wrong fs type, bad option, bad superblock on /dev/sda, ...

# cat /proc/kmsg
<6>[ 1902.033164] BTRFS info (device sdb): enabling auto recovery
<6>[ 1902.033184] BTRFS info (device sdb): disk space caching is enabled
<6>[ 1902.033191] BTRFS: has skinny extents
<3>[ 1902.034931] BTRFS (device sdb): bad tree block start 0 21102592
<3>[ 1902.051259] BTRFS (device sdb): parent transid verify failed on
21147648 wanted 101748 found 101124
<3>[ 1902.111807] BTRFS (device sdb): parent transid verify failed on
44613632 wanted 101770 found 101233
<3>[ 1902.126529] BTRFS (device sdb): parent transid verify failed on
40595456 wanted 101767 found 101232
<6>[ 1902.164667] BTRFS: bdev /dev/sda errs: wr 858, rd 8057, flush
280, corrupt 0, gen 0
<3>[ 1902.165929] BTRFS (device sdb): parent transid verify failed on
44617728 wanted 101770 found 101233
<3>[ 1902.166975] BTRFS (device sdb): parent transid verify failed on
44621824 wanted 101770 found 101233
<3>[ 1902.271296] BTRFS (device sdb): parent transid verify failed on
38621184 wanted 101765 found 101223
<3>[ 1902.380526] BTRFS (device sdb): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[ 1902.381510] BTRFS (device sdb): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[ 1902.381549] BTRFS: Failed to read block groups: -5
<3>[ 1902.394835] BTRFS: open_ctree failed
<6>[ 1911.202254] BTRFS info (device sdb): enabling auto recovery
<6>[ 1911.202270] BTRFS info (device sdb): disk space caching is enabled
<6>[ 1911.202275] BTRFS: has skinny extents
<3>[ 1911.203611] BTRFS (device sdb): bad tree block start 0 21102592
<3>[ 1911.204803] BTRFS (device sdb): parent transid verify failed on
21147648 wanted 101748 found 101124
<3>[ 1911.246384] BTRFS (device sdb): parent transid verify failed on
44613632 wanted 101770 found 101233
<3>[ 1911.248729] BTRFS (device sdb): parent transid verify failed on
40595456 wanted 101767 found 101232
<6>[ 1911.251658] BTRFS: bdev /dev/sda errs: wr 858, rd 8057, flush
280, corrupt 0, gen 0
<3>[ 1911.252485] BTRFS (device sdb): parent transid verify failed on
44617728 wanted 101770 found 101233
<3>[ 1911.253542] BTRFS (device sdb): parent transid verify failed on
44621824 wanted 101770 found 101233
<3>[ 1911.278414] BTRFS (device sdb): parent transid verify failed on
38621184 wanted 101765 found 101223
<3>[ 1911.283950] BTRFS (device sdb): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[ 1911.284835] BTRFS (device sdb): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[ 1911.284873] BTRFS: Failed to read block groups: -5
<3>[ 1911.298783] BTRFS: open_ctree failed


# btrfs-show-super /dev/sda
superblock: bytenr=65536, device=/dev/sda
---------------------------------------------------------
csum                    0xe8789014 [match]
bytenr                  65536
flags                   0x1
                        ( WRITTEN )
magic                   _BHRfS_M [match]
fsid                    2bba7cff-b4bf-4554-bee4-66f69c761ec4
label
generation              101480
root                    37892096
sys_array_size          258
chunk_root_generation   101124
root_level              2
chunk_root              21147648
chunk_root_level        1
log_root                0
log_root_transid        0
log_root_level          0
total_bytes             6001196802048
bytes_used              3593129504768
sectorsize              4096
nodesize                4096
leafsize                4096
stripesize              4096
root_dir                6
num_devices             3
compat_flags            0x0
compat_ro_flags         0x0
incompat_flags          0x381
                        ( MIXED_BACKREF |
                          RAID56 |
                          SKINNY_METADATA |
                          NO_HOLES )
csum_type               0
csum_size               4
cache_generation        101480
uuid_tree_generation    101480
dev_item.uuid           330c9c98-4140-497a-814f-ac76a5b07172
dev_item.fsid           2bba7cff-b4bf-4554-bee4-66f69c761ec4 [match]
dev_item.type           0
dev_item.total_bytes    2000398934016
dev_item.bytes_used     1809263362048
dev_item.io_align       4096
dev_item.io_width       4096
dev_item.sector_size    4096
dev_item.devid          2
dev_item.dev_group      0
dev_item.seek_speed     0
dev_item.bandwidth      0
dev_item.generation     0


# btrfs-show-super /dev/sdb
superblock: bytenr=65536, device=/dev/sdb
---------------------------------------------------------
csum                    0x177aae67 [match]
bytenr                  65536
flags                   0x1
                        ( WRITTEN )
magic                   _BHRfS_M [match]
fsid                    2bba7cff-b4bf-4554-bee4-66f69c761ec4
label
generation              101770
root                    44650496
sys_array_size          258
chunk_root_generation   101748
root_level              2
chunk_root              21102592
chunk_root_level        1
log_root                0
log_root_transid        0
log_root_level          0
total_bytes             6001196802048
bytes_used              3533993762816
sectorsize              4096
nodesize                4096
leafsize                4096
stripesize              4096
root_dir                6
num_devices             3
compat_flags            0x0
compat_ro_flags         0x0
incompat_flags          0x381
                        ( MIXED_BACKREF |
                          RAID56 |
                          SKINNY_METADATA |
                          NO_HOLES )
csum_type               0
csum_size               4
cache_generation        101770
uuid_tree_generation    101770
dev_item.uuid           f14b343e-b701-47f2-a652-e52a47be42b2
dev_item.fsid           2bba7cff-b4bf-4554-bee4-66f69c761ec4 [match]
dev_item.type           0
dev_item.total_bytes    2000398934016
dev_item.bytes_used     1815705812992
dev_item.io_align       4096
dev_item.io_width       4096
dev_item.sector_size    4096
dev_item.devid          3
dev_item.dev_group      0
dev_item.seek_speed     0
dev_item.bandwidth      0
dev_item.generation     0


# btrfs-show-super /dev/sdc
superblock: bytenr=65536, device=/dev/sdc
---------------------------------------------------------
csum                    0xa06026f3 [match]
bytenr                  65536
flags                   0x1
                        ( WRITTEN )
magic                   _BHRfS_M [match]
fsid                    2bba7cff-b4bf-4554-bee4-66f69c761ec4
label
generation              101770
root                    44650496
sys_array_size          258
chunk_root_generation   101748
root_level              2
chunk_root              21102592
chunk_root_level        1
log_root                0
log_root_transid        0
log_root_level          0
total_bytes             6001196802048
bytes_used              3533993762816
sectorsize              4096
nodesize                4096
leafsize                4096
stripesize              4096
root_dir                6
num_devices             3
compat_flags            0x0
compat_ro_flags         0x0
incompat_flags          0x381
                        ( MIXED_BACKREF |
                          RAID56 |
                          SKINNY_METADATA |
                          NO_HOLES )
csum_type               0
csum_size               4
cache_generation        101770
uuid_tree_generation    101770
dev_item.uuid           4dadced6-392f-4d57-920c-ee8fbebbd608
dev_item.fsid           2bba7cff-b4bf-4554-bee4-66f69c761ec4 [match]
dev_item.type           0
dev_item.total_bytes    2000398934016
dev_item.bytes_used     1815726784512
dev_item.io_align       4096
dev_item.io_width       4096
dev_item.sector_size    4096
dev_item.devid          1
dev_item.dev_group      0
dev_item.seek_speed     0
dev_item.bandwidth      0
dev_item.generation     0


# smartctl -a /dev/sda
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.2.3-gentoo] (local build)
...
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age
Always       -       16

This was a new drive and this counter didn't move before I thouched
the cables again in order to prepare for the next "device replace"
round.
I checked the SMART data several times before, during and after the
first round of "device replace" to make sure the new drive didn't came
as faulty from the factory/reseller... I sure these two (unmountable
filesystem and this SATA cable error counter) are directly related.

I threw away these SATA cables because another one of this "batch" (a
four-pack I picked up somewhere, sometime...) proved to be faulty as
well (although that one didn't cause any practical harm, other than
making a Windows PC hanging and the CRC error counter of the SSD
rising).


I am not really happy that Btrfs in RAID5 mode wasn't a little more
fault tolerant towards "disk" faults. Although it might still be
saved, right? Right? :)


Thank you for your answers in advance!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs/RAID5 became unmountable after SATA cable fault
  2015-10-19  8:39 Janos Toth F.
@ 2015-10-20 14:59 ` Duncan
  2015-10-21 16:09 ` Janos Toth F.
  1 sibling, 0 replies; 17+ messages in thread
From: Duncan @ 2015-10-20 14:59 UTC (permalink / raw)
  To: linux-btrfs

Janos Toth F. posted on Mon, 19 Oct 2015 10:39:06 +0200 as excerpted:

> I was in the middle of replacing the drives of my NAS one-by-one (I
> wished to move to bigger and faster storage at the end), so I used one
> more SATA drive + SATA cable than usual. Unfortunately, the extra cable
> turned out to be faulty and it looks like it caused some heavy damage to
> the file system.
> 
> There was no "devive replace" running at the moment or the disaster. The
> first round already got finished hours ago and I planned to start the
> next one before going to sleep. So, it was a full RAID-5 setup in normal
> state. But one of the active, mounted devices was the first replacment
> HDD and it was hanging on the spare SATA cable.
> 
> I tried to save some file to my mounted samba share and I realized the
> file system because read-only. I rebooted the machine and saw that my
> /data can't be mounted.
> According to SmartmonTools, one of the drives was suffering from SATA
> communication errors.
> 
> I tried some tirivial recovery methods and I tried to search the mailing
> list archives but I didn't really find a solution. I wonder if somebody
> can help with this.
> 
> Should I run "btrfs rescue chunk-recover /dev/sda"?
> 
> Here are some raw details:
> 
> # uname -a
> Linux F17a_NAS 4.2.3-gentoo #2 SMP Sun Oct 18 17:56:45 CEST 2015
> x86_64 AMD E-350 Processor AuthenticAMD GNU/Linux
> 
> # btrfs --version
> btrfs-progs v4.2.2

OK, nice and current kernel and userspace, vital for btrfs raid56 mode 
especially, as it's so new...

> # mount /dev/sda /data -o ro,recovery
> mount: wrong fs type, bad option, bad superblock on /dev/sda, ...

Did you try mount -o degraded ?  The recovery option is for a different 
problem and you probably simply need degraded at this point.  But there's 
no indication that you tried degraded.

If one of the devices is/was on a faulting cable, then it's likely not 
being detected properly, and btrfs is simply failing to mount it business-
as-usual because it's degraded, and btrfs wants to ensure that you know 
that before it lets you mount, to maximize the chances of fixing the 
problem before something else goes wrong as well, where the two problems 
combined really /could/ screw up the filesystem beyond simple repair.

Do note that due to a bug with the current kernel, you may get just one 
writable mount to fix the problem.  Attempting a second degraded writable 
mount will often fail due to the bug, and you can only do degraded,ro 
after that, which will let you copy off the data to elsewhere but won't 
let you repair the filesystem, as a writable mount is required for that.  
So if you don't have current backups and you want to maximize your chance 
of saving the data, mounting degraded,ro *first* and (assuming it mounts) 
taking that opportunity to backup, before attempting to mount degraded,rw 
in ordered to replace the screwed up device, is recommended.

Once you have a current backup of anything on it that you consider 
important[1], then do a btrfs filesystem show and see if it says some 
devices missing (as it probably will if you had to use the degraded 
option), after which you can mount degraded,rw and replace/remove the 
missing device.

Meanwhile, should you end up having mounted degraded,rw, and not gotten 
the missing devices properly deleted/replaced before a second degraded 
mount, which probably won't let you mount rw due to that bug I mentioned, 
there's patches available, I believe already applied to the latest kernel 
4.3-rc, that should let you mount degraded,rw more than once, if the data 
and metadata is actually all still available.

The bug is that btrfs currently looks only at the number of devices and 
the chunk-types when deciding whether a filesystem can be mounted 
degraded,rw, while if new data is written in degraded,rw mode, it may 
have to fallback new writes to single chunk write mode.  That's fine for 
the first writable mount, but on the second, it sees those single chunks 
and devices missing, and thinks that single mode with devices missing 
means data is missing too, even tho as long as no further devices went 
missing, it could have only been written to the present devices, so no 
data should actually be missing at all.  But current btrfs doesn't know 
that as it only looks at the number of devices vs what's supposed to be 
there, along with the chunk type, and if there's any single type chunks 
with devices missing, it gives up and won't allow writable mount.

What the patches make btrfs do instead is look at the individual chunks.  
If all the chunks are actually available (as they should be in this 
second degraded writable-mount scenario), then it can still mount 
degraded,rw despite missing devices, thus allowing you to actually remove/
replace the missing device, since it needs writable mount in ordered to 
be able to do that remove/replace.

---
[1] If you're a regular reader of the list, you'll know that I regularly 
point to the sysadmin's rule of backups: If it's not backed up, by 
definition you don't care about the data as much as the time/hassle/
resources saved by not backing up, despite any protests to the contrary 
after loss, because your actions spoke louder than your words and your 
actions said you valued the time/resources saved in not making the backup 
more than the data.  By that definition, you either have the backup, or 
you don't really care about the data anyway and have saved what you value 
most, the time/resources that would have otherwise gone into the backup.  
That said, valuing the time/resources saved more than the calculated 
potential risk of losing the data is one thing, actually having it at 
risk right now is something entirely different since it changes the risk 
factor in the equation, so despite the data not being worth the backup 
cost when the risk was theoretical, it may still be worth the cost once 
that risk factor jumps dramatically, making it worthwhile to take the 
opportunity to do the backup if you get the chance once you know the data 
is at increased risk of being lost entirely, even if it wasn't 
worthwhile, while that risk was much lower.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs/RAID5 became unmountable after SATA cable fault
  2015-10-19  8:39 Janos Toth F.
  2015-10-20 14:59 ` Duncan
@ 2015-10-21 16:09 ` Janos Toth F.
  2015-10-21 16:44   ` ronnie sahlberg
                     ` (3 more replies)
  1 sibling, 4 replies; 17+ messages in thread
From: Janos Toth F. @ 2015-10-21 16:09 UTC (permalink / raw)
  To: linux-btrfs

I went through all the recovery options I could find (starting from
read-only to "extraordinarily dangerous"). Nothing seemed to work.

A Windows based proprietary recovery software (ReclaiMe) could scratch
the surface but only that (it showed me the whole original folder
structure after a few minutes of scanning and the "preview" of some
some plaintext files was promising but most of the bigger files seemed
to be broken).

I used this as a bulk storage for backups and all the things I didn't
care to keep in more than one copies but that includes my
"scratchpad", so I cared enough to use RAID5 mode and to try restoring
some things.

Any last ideas before I "ata secure erase" and sell/repurpose the disks?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs/RAID5 became unmountable after SATA cable fault
  2015-10-21 16:09 ` Janos Toth F.
@ 2015-10-21 16:44   ` ronnie sahlberg
  2015-10-21 17:42   ` ronnie sahlberg
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: ronnie sahlberg @ 2015-10-21 16:44 UTC (permalink / raw)
  To: Janos Toth F.; +Cc: Btrfs BTRFS

If it is for mostly archival storage, I would suggest you take a look
at snapraid.


On Wed, Oct 21, 2015 at 9:09 AM, Janos Toth F. <toth.f.janos@gmail.com> wrote:
> I went through all the recovery options I could find (starting from
> read-only to "extraordinarily dangerous"). Nothing seemed to work.
>
> A Windows based proprietary recovery software (ReclaiMe) could scratch
> the surface but only that (it showed me the whole original folder
> structure after a few minutes of scanning and the "preview" of some
> some plaintext files was promising but most of the bigger files seemed
> to be broken).
>
> I used this as a bulk storage for backups and all the things I didn't
> care to keep in more than one copies but that includes my
> "scratchpad", so I cared enough to use RAID5 mode and to try restoring
> some things.
>
> Any last ideas before I "ata secure erase" and sell/repurpose the disks?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs/RAID5 became unmountable after SATA cable fault
  2015-10-21 16:09 ` Janos Toth F.
  2015-10-21 16:44   ` ronnie sahlberg
@ 2015-10-21 17:42   ` ronnie sahlberg
  2015-10-21 18:40     ` Janos Toth F.
  2015-10-21 17:46   ` Janos Toth F.
  2015-10-21 20:26   ` Chris Murphy
  3 siblings, 1 reply; 17+ messages in thread
From: ronnie sahlberg @ 2015-10-21 17:42 UTC (permalink / raw)
  To: Janos Toth F.; +Cc: Btrfs BTRFS

Maybe hold off erasing the drives a little in case someone wants to
collect some extra data for diagnosing how/why the filesystem got into
this unrecoverable state.

A single device having issues should not cause the whole filesystem to
become unrecoverable.

On Wed, Oct 21, 2015 at 9:09 AM, Janos Toth F. <toth.f.janos@gmail.com> wrote:
> I went through all the recovery options I could find (starting from
> read-only to "extraordinarily dangerous"). Nothing seemed to work.
>
> A Windows based proprietary recovery software (ReclaiMe) could scratch
> the surface but only that (it showed me the whole original folder
> structure after a few minutes of scanning and the "preview" of some
> some plaintext files was promising but most of the bigger files seemed
> to be broken).
>
> I used this as a bulk storage for backups and all the things I didn't
> care to keep in more than one copies but that includes my
> "scratchpad", so I cared enough to use RAID5 mode and to try restoring
> some things.
>
> Any last ideas before I "ata secure erase" and sell/repurpose the disks?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs/RAID5 became unmountable after SATA cable fault
  2015-10-21 16:09 ` Janos Toth F.
  2015-10-21 16:44   ` ronnie sahlberg
  2015-10-21 17:42   ` ronnie sahlberg
@ 2015-10-21 17:46   ` Janos Toth F.
  2015-10-21 20:26   ` Chris Murphy
  3 siblings, 0 replies; 17+ messages in thread
From: Janos Toth F. @ 2015-10-21 17:46 UTC (permalink / raw)
  To: linux-btrfs

I tried several things, including the degraded mount option. One example:

# mount /dev/sdb /data -o ro,degraded,nodatasum,notreelog
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

# cat /proc/kmsg
<6>[  262.616929] BTRFS info (device sdd): allowing degraded mounts
<6>[  262.616943] BTRFS info (device sdd): setting nodatasum
<6>[  262.616949] BTRFS info (device sdd): disk space caching is enabled
<6>[  262.616953] BTRFS: has skinny extents
<6>[  262.652671] BTRFS: bdev (null) errs: wr 858, rd 8057, flush 280,
corrupt 0, gen 0
<3>[  262.697162] BTRFS (device sdd): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[  262.697633] BTRFS (device sdd): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[  262.697660] BTRFS: Failed to read block groups: -5
<3>[  262.709885] BTRFS: open_ctree failed
<6>[  267.197365] BTRFS info (device sdd): allowing degraded mounts
<6>[  267.197385] BTRFS info (device sdd): setting nodatasum
<6>[  267.197397] BTRFS info (device sdd): disabling tree log
<6>[  267.197406] BTRFS info (device sdd): disk space caching is enabled
<6>[  267.197412] BTRFS: has skinny extents
<6>[  267.232809] BTRFS: bdev (null) errs: wr 858, rd 8057, flush 280,
corrupt 0, gen 0
<3>[  267.246167] BTRFS (device sdd): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[  267.246706] BTRFS (device sdd): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[  267.246727] BTRFS: Failed to read block groups: -5
<3>[  267.261392] BTRFS: open_ctree failed

On Wed, Oct 21, 2015 at 6:09 PM, Janos Toth F. <toth.f.janos@gmail.com> wrote:
> I went through all the recovery options I could find (starting from
> read-only to "extraordinarily dangerous"). Nothing seemed to work.
>
> A Windows based proprietary recovery software (ReclaiMe) could scratch
> the surface but only that (it showed me the whole original folder
> structure after a few minutes of scanning and the "preview" of some
> some plaintext files was promising but most of the bigger files seemed
> to be broken).
>
> I used this as a bulk storage for backups and all the things I didn't
> care to keep in more than one copies but that includes my
> "scratchpad", so I cared enough to use RAID5 mode and to try restoring
> some things.
>
> Any last ideas before I "ata secure erase" and sell/repurpose the disks?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs/RAID5 became unmountable after SATA cable fault
  2015-10-21 17:42   ` ronnie sahlberg
@ 2015-10-21 18:40     ` Janos Toth F.
  0 siblings, 0 replies; 17+ messages in thread
From: Janos Toth F. @ 2015-10-21 18:40 UTC (permalink / raw)
  Cc: Btrfs BTRFS

I am afraid the filesystem right now is really damaged regardless of
it's state upon the unexpected cable failure because I tried some
dangerous options after read-only restore/recovery methods all failed
(including zero-log, followed by init-csum-tree and even
chunk-recovery -> all of them just spit out several kind of errors
which suggested they probably didn't even write anything to the disks
before they decided that they already failed but they only caused more
harm than good if they did write something).

Actually, I almost got rid of this data myself intentionally when my
new set of drives arrived. I was considering if I should simply start
from scratch (may be reviewing and might be saving my "scratchpad"
portion of the data but nothing really irreplaceable and/or valuable)
but I thought it's a good idea to test the "device replace" function
in real life.

Even though the replace operation seemed to be successful I am
beginning to wonder if it wasn't really.


On Wed, Oct 21, 2015 at 7:42 PM, ronnie sahlberg
<ronniesahlberg@gmail.com> wrote:
> Maybe hold off erasing the drives a little in case someone wants to
> collect some extra data for diagnosing how/why the filesystem got into
> this unrecoverable state.
>
> A single device having issues should not cause the whole filesystem to
> become unrecoverable.
>
> On Wed, Oct 21, 2015 at 9:09 AM, Janos Toth F. <toth.f.janos@gmail.com> wrote:
>> I went through all the recovery options I could find (starting from
>> read-only to "extraordinarily dangerous"). Nothing seemed to work.
>>
>> A Windows based proprietary recovery software (ReclaiMe) could scratch
>> the surface but only that (it showed me the whole original folder
>> structure after a few minutes of scanning and the "preview" of some
>> some plaintext files was promising but most of the bigger files seemed
>> to be broken).
>>
>> I used this as a bulk storage for backups and all the things I didn't
>> care to keep in more than one copies but that includes my
>> "scratchpad", so I cared enough to use RAID5 mode and to try restoring
>> some things.
>>
>> Any last ideas before I "ata secure erase" and sell/repurpose the disks?
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs/RAID5 became unmountable after SATA cable fault
  2015-10-21 16:09 ` Janos Toth F.
                     ` (2 preceding siblings ...)
  2015-10-21 17:46   ` Janos Toth F.
@ 2015-10-21 20:26   ` Chris Murphy
  3 siblings, 0 replies; 17+ messages in thread
From: Chris Murphy @ 2015-10-21 20:26 UTC (permalink / raw)
  To: Janos Toth F.; +Cc: Btrfs BTRFS

https://btrfs.wiki.kernel.org/index.php/Restore

This should still be possible with even a degraded/unmounted raid5. It
is a bit tedious to figure out how to use it but if you've got some
things you want off the volume, it's not so difficult to prevent
trying it.


Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs/RAID5 became unmountable after SATA cable fault
@ 2015-10-22  1:18 János Tóth F.
  0 siblings, 0 replies; 17+ messages in thread
From: János Tóth F. @ 2015-10-22  1:18 UTC (permalink / raw)
  To: Btrfs BTRFS

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=utf-8, Size: 1119 bytes --]

I tried that after every possible combinations of RO mount failed. I used it in the past for an USB attached drive where an USB-SATA adapter had some issues (I plugged it into a standard USB2 port even though it expected USB3 power current, so a high-current or several standard USB2 ports should have been used to prevent it from emergency shutdown during file copy). It worked flawlessly then but in this case it threw the same errors as "btrsf check". It couldn't even find and list alternative roots. Every tools seemed to give up really fast and easy, sometimes with segfaults and tracing messages in the kernel log.On Oct 21, 2015 22:26, Chris Murphy <lists@colorremedies.com> wrote:
>
> https://btrfs.wiki.kernel.org/index.php/Restore 
>
> This should still be possible with even a degraded/unmounted raid5. It 
> is a bit tedious to figure out how to use it but if you've got some 
> things you want off the volume, it's not so difficult to prevent 
> trying it. 
>
>
> Chris Murphy 
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨è­Ú&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~†­†Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ߣøm

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs/RAID5 became unmountable after SATA cable fault
       [not found] <g7loe3red3ksp64hmb0vsbs2.1445476794489@email.android.com>
@ 2015-11-04 18:01 ` Janos Toth F.
  2015-11-04 18:45   ` Austin S Hemmelgarn
  2015-11-06  9:03   ` Janos Toth F.
  0 siblings, 2 replies; 17+ messages in thread
From: Janos Toth F. @ 2015-11-04 18:01 UTC (permalink / raw)
  To: Btrfs BTRFS

Well. Now I am really confused about Btrfs RAID-5!

So, I replaced all SATA cables (which are explicitly marked for beeing
aimed at SATA3 speeds) and all the 3x2Tb WD Red 2.0 drives with 3x4Tb
Seagate Contellation ES 3 drives and started from sratch. I
secure-erased every drives, created an empty filesystem and ran a
"long" SMART self-test on all drivers before I started using the
storage space (the tests finished without errors, all drivers looked
fine, 0 zero bad sectors, 0 read or SATA CEC errors... all looked
perfectly fine at the time...).

It didn't take long before I realized that one of the new drives
started failing.
I started a scrub and it reported both corrected and uncorrectable errors.
I looked at the SMART data. 2 drives look perfectly fine and 1 drive
seems to be really sick. The latter one has some "reallocated" and
several hundreds of "pending" sectors among other error indications in
the log. I guess it's not the drive surface but the HDD controller (or
may be a head) which is really dying.

I figured the uncorrectable errors are write errors which is not
surprising given the perceived "health" of the drive according to it's
SMART attributes and error logs. That's understandable.


Although, I tried to copy data from the filesystem and it failed at
various ways.
There was a file which couldn't be copied at all. Good question why. I
guess it's because the filesystem needs to be repaired to get the
checksums and parities sorted out first. That's also understandable
(though unexpected, I thought RAID-5 Btrfs is sort-of "self-healing"
in these situations, it should theoretically still be able to
reconstruct and present the correct data, based on checksums and
parities seamlessly and only place error in the kernel log...).

But the worst part is that there are some ISO files which were
seemingly copied without errors but their external checksums (the one
which I can calculate with md5sum and compare to the one supplied by
the publisher of the ISO file) don't match!
Well... this, I cannot understand.
How could these files become corrupt from a single disk failure? And
more importantly: how could these files be copied without errors? Why
didn't Btrfs gave a read error when the checksums didn't add up?


Isn't Btrfs supposed to constantly check the integrity of the file
data during any normal read operations and give an error instead of
spitting out corrupt data as if it was perfectly legit?
I thought that's how it is supposed to work.
What's the point of full data checksuming if only an explicitly
requested scrub operation might look for errors? I thought's it's the
logical thing to do if checksum verification happens during every
single read operation and passing that check is mandatory in order to
get any data out of the filesystem (might be excluding the Direct-I/O
mode but I never use that on Btrfs - if that's even actually
supported, I don't know).


Now I am really considering to move from Linux to Windows and from
Btrfs RAID-5 to Storage Spaces RAID-1 + ReFS (the only limitation is
that ReFS is only "self-healing" on RAID-1, not RAID-5, so I need a
new motherboard with more native SATA connectors and an extra HDD).
That one seemed to actually do what it promises (abort any read
operations upon checksum errors [which always happens seamlessly on
every read] but look at the redundant data first and seamlessly
"self-heal" if possible). The only thing which made Btrfs to look as a
better alternative was the RAID-5 support. But I recently experienced
two cases of 1 drive failing of 3 and it always tuned out as a smaller
or bigger disaster (completely lost data or inconsistent data).


Does anybody have ideas what might went wrong in this second scenario?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs/RAID5 became unmountable after SATA cable fault
  2015-11-04 18:01 ` Janos Toth F.
@ 2015-11-04 18:45   ` Austin S Hemmelgarn
  2015-11-05  4:06     ` Duncan
  2015-11-06  9:03   ` Janos Toth F.
  1 sibling, 1 reply; 17+ messages in thread
From: Austin S Hemmelgarn @ 2015-11-04 18:45 UTC (permalink / raw)
  To: Janos Toth F., Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 4556 bytes --]

On 2015-11-04 13:01, Janos Toth F. wrote:
> But the worst part is that there are some ISO files which were
> seemingly copied without errors but their external checksums (the one
> which I can calculate with md5sum and compare to the one supplied by
> the publisher of the ISO file) don't match!
> Well... this, I cannot understand.
> How could these files become corrupt from a single disk failure? And
> more importantly: how could these files be copied without errors? Why
> didn't Btrfs gave a read error when the checksums didn't add up?
If you can prove that there was a checksum mismatch and BTRFS returned 
invalid data instead of a read error or going to the other disk, then 
that is a very serious bug that needs to be fixed.  You need to keep in 
mind also however that it's completely possible that the data was bad 
before you wrote it to the filesystem, and if that's the case, there's 
nothing any filesystem can do to fix it for you.
>
> Isn't Btrfs supposed to constantly check the integrity of the file
> data during any normal read operations and give an error instead of
> spitting out corrupt data as if it was perfectly legit?
> I thought that's how it is supposed to work.
Assuming that all of your hardware is working exactly like it's supposed 
to, yes it should work that way.  If however, you have something that 
corrupts the data in RAM before or while BTRFS is computing the checksum 
prior to writing the data, the it's fully possible for bad data to get 
written to disk and still have a perfectly correct checksum.  Bad RAM 
may also explain your issues mentioned above with not being able to copy 
stuff off of the filesystem.

Also, if you're using NOCOW files (or just the mount option), those very 
specifically do not store checksums for the blocks, because there is no 
way to do it without significant risk of data corruption.
> What's the point of full data checksuming if only an explicitly
> requested scrub operation might look for errors? I thought's it's the
> logical thing to do if checksum verification happens during every
> single read operation and passing that check is mandatory in order to
> get any data out of the filesystem (might be excluding the Direct-I/O
> mode but I never use that on Btrfs - if that's even actually
> supported, I don't know).
>
>
> Now I am really considering to move from Linux to Windows and from
> Btrfs RAID-5 to Storage Spaces RAID-1 + ReFS (the only limitation is
> that ReFS is only "self-healing" on RAID-1, not RAID-5, so I need a
> new motherboard with more native SATA connectors and an extra HDD).
> That one seemed to actually do what it promises (abort any read
> operations upon checksum errors [which always happens seamlessly on
> every read] but look at the redundant data first and seamlessly
> "self-heal" if possible). The only thing which made Btrfs to look as a
> better alternative was the RAID-5 support. But I recently experienced
> two cases of 1 drive failing of 3 and it always tuned out as a smaller
> or bigger disaster (completely lost data or inconsistent data).
Have you considered looking into ZFS?  I hate to suggest it as an 
alternative to BTRFS, but it's a much more mature and well tested 
technology than ReFS, and has many of the same features as BTRFS (and 
even has the option for triple parity instead of the double you get with 
RAID6).  If you do consider ZFS, make a point to look at FreeBSD in 
addition to the Linux version, the BSD one was a much better written 
port of the original Solaris drivers, and has better performance in many 
cases (and as much as I hate to admit it, BSD is way more reliable than 
Linux in most use cases).

You should also seriously consider whether the convenience of having a 
filesystem that fixes internal errors itself with no user intervention 
is worth the risk of it corrupting your data.  Returning correct data 
whenever possible is one thing, being 'self-healing' is completely 
different.  When you start talking about things that automatically fix 
internal errors without user intervention is when most seasoned system 
administrators start to get really nervous.  Self correcting systems 
have just as much chance to make things worse as they do to make things 
better, and most of them depend on the underlying hardware working 
correctly to actually provide any guarantee of reliability.  I cannot 
count the number of stories I've heard of 'self-healing' hardware RAID 
controllers destroying data.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs/RAID5 became unmountable after SATA cable fault
  2015-11-04 18:45   ` Austin S Hemmelgarn
@ 2015-11-05  4:06     ` Duncan
  2015-11-05 12:30       ` Austin S Hemmelgarn
  2015-11-06  3:19       ` Zoiled
  0 siblings, 2 replies; 17+ messages in thread
From: Duncan @ 2015-11-05  4:06 UTC (permalink / raw)
  To: linux-btrfs

Austin S Hemmelgarn posted on Wed, 04 Nov 2015 13:45:37 -0500 as
excerpted:

> On 2015-11-04 13:01, Janos Toth F. wrote:
>> But the worst part is that there are some ISO files which were
>> seemingly copied without errors but their external checksums (the one
>> which I can calculate with md5sum and compare to the one supplied by
>> the publisher of the ISO file) don't match!
>> Well... this, I cannot understand.
>> How could these files become corrupt from a single disk failure? And
>> more importantly: how could these files be copied without errors? Why
>> didn't Btrfs gave a read error when the checksums didn't add up?
> If you can prove that there was a checksum mismatch and BTRFS returned
> invalid data instead of a read error or going to the other disk, then
> that is a very serious bug that needs to be fixed.  You need to keep in
> mind also however that it's completely possible that the data was bad
> before you wrote it to the filesystem, and if that's the case, there's
> nothing any filesystem can do to fix it for you.

As Austin suggests, if btrfs is returning data, and you haven't turned 
off checksumming with nodatasum or nocow, then it's almost certainly 
returning the data it was given to write out in the first place.  Whether 
that data it was given to write out was correct, however, is an 
/entirely/ different matter.

If ISOs are failing their external checksums, then something is going 
on.  Had you verified the external checksums when you first got the 
files?  That is, are you sure the files were correct as downloaded and/or 
ripped?

Where were the ISOs stored between original procurement/validation and 
writing to btrfs?  Is it possible you still have some/all of them on that 
media?  Do they still external-checksum-verify there?

Basically, assuming btrfs checksums are validating, there's three other 
likely possibilities for where the corruption could have come from before 
writing to btrfs.  Either the files were bad as downloaded or otherwise 
procured -- which is why I asked whether you verified them upon receipt 
-- or you have memory that's going bad, or your temporary storage is 
going bad, before the files ever got written to btrfs.

The memory going bad is a particularly worrying possibility, 
considering...

>> Now I am really considering to move from Linux to Windows and from
>> Btrfs RAID-5 to Storage Spaces RAID-1 + ReFS (the only limitation is
>> that ReFS is only "self-healing" on RAID-1, not RAID-5, so I need a new
>> motherboard with more native SATA connectors and an extra HDD). That
>> one seemed to actually do what it promises (abort any read operations
>> upon checksum errors [which always happens seamlessly on every read]
>> but look at the redundant data first and seamlessly "self-heal" if
>> possible). The only thing which made Btrfs to look as a better
>> alternative was the RAID-5 support. But I recently experienced two
>> cases of 1 drive failing of 3 and it always tuned out as a smaller or
>> bigger disaster (completely lost data or inconsistent data).

> Have you considered looking into ZFS?  I hate to suggest it as an
> alternative to BTRFS, but it's a much more mature and well tested
> technology than ReFS, and has many of the same features as BTRFS (and
> even has the option for triple parity instead of the double you get with
> RAID6).  If you do consider ZFS, make a point to look at FreeBSD in
> addition to the Linux version, the BSD one was a much better written
> port of the original Solaris drivers, and has better performance in many
> cases (and as much as I hate to admit it, BSD is way more reliable than
> Linux in most use cases).
> 
> You should also seriously consider whether the convenience of having a
> filesystem that fixes internal errors itself with no user intervention
> is worth the risk of it corrupting your data.  Returning correct data
> whenever possible is one thing, being 'self-healing' is completely
> different.  When you start talking about things that automatically fix
> internal errors without user intervention is when most seasoned system
> administrators start to get really nervous.  Self correcting systems
> have just as much chance to make things worse as they do to make things
> better, and most of them depend on the underlying hardware working
> correctly to actually provide any guarantee of reliability.

I too would point you at ZFS, but there's one VERY BIG caveat, and one 
related smaller one!

The people who have a lot of ZFS experience say it's generally quite 
reliable, but gobs of **RELIABLE** memory are *absolutely* *critical*!  
The self-healing works well, *PROVIDED* memory isn't producing errors.  
Absolutely reliable memory is in fact *so* critical, that running ZFS on 
non-ECC memory is severely discouraged as a very real risk to your data.

Which is why the above hints that your memory may be bad are so 
worrying.  Don't even *THINK* about ZFS, particularly its self-healing 
features, if you're not absolutely sure your memory is 100% reliable, 
because apparently, based on the comment's I've seen, if it's not, you 
WILL have data loss, likely far worse than btrfs under similar 
circumstances, because when btrfs detects a checksum error it tries 
another copy if it has one (raid1/10 mode), and simply fails the read if 
it doesn't, while apparently, zfs with self-healing activated will give 
you what it thinks is the corrected data, writing it back to repair the 
problem as well, but if memory is bad, it'll be self-damaging instead of 
self-healing, and from what I've read, that's actually a reasonably 
common experience with non-ecc RAM, the reason they so severely 
discourage attempts to run zfs on non-ecc.  But people still keep doing 
it, and still keep getting burned as a result.

(The smaller, in context, caveat, is that zfs works best with /lots/ of 
RAM, particularly when run on Linux, since it is designed to work with a 
different cache system than Linux uses, and won't work without it, so in 
effect with ZFS on Linux everything must be cached twice, upping the 
memory requirements dramatically.) 


(Tho I should mention, while not on zfs, I've actually had my own 
problems with ECC RAM too.  In my case, the RAM was certified to run at 
speeds faster than it was actually reliable at, such that actually stored 
data, what the ECC protects, was fine, the data was actually getting 
damaged in transit to/from the RAM.  On a lightly loaded system, such as 
one running many memory tests or under normal desktop usage conditions, 
the RAM was generally fine, no problems.  But on a heavily loaded system, 
such as when doing parallel builds (I run gentoo, which builds from 
sources in ordered to get the higher level of option flexibility that 
comes only when you can toggle build-time options), I'd often have memory 
faults and my builds would fail.

The most common failure, BTW, was on tarball decompression, bunzip2 or 
the like, since the tarballs contained checksums that were verified on 
data decompression, and often they'd fail to verify.

Once I updated the BIOS to one that would let me set the memory speed 
instead of using the speed the modules themselves reported, and I 
declocked the memory just one notch (this was DDR1, IIRC I declocked from 
the PC3200 it was rated, to PC3000 speeds), not only was the memory then 
100% reliable, but I could and did actually reduce the number of wait-
states for various operations, and it was STILL 100% reliable.  It simply 
couldn't handle the raw speeds it was certified to run, is all, tho it 
did handle it well enough, enough of the time, to make the problem far 
more difficult to diagnose and confirm than it would have been had the 
problem appeared at low load as well.

As it happens, I was running reiserfs at the time, and it handled both 
that hardware issue, and a number of others I've had, far better than I'd 
have expected of /any/ filesystem, when the memory feeding it is simply 
not reliable.  Reiserfs metadata, in particular, seems incredibly 
resilient in the face of hardware issues, and I lost far less data than I 
might have expected, tho without checksums and with bad memory, I imagine 
I had occasional undetected bitflip corruption in files here or there, 
but generally nothing I detected.  I still use reiserfs on my spinning 
rust today, but it's not well suited to SSD, which is where I run btrfs.

But the point for this discussion is that just because it's ECC RAM 
doesn't mean you can't have memory related errors, just that if you do, 
they're likely to be different errors, "transit errors", that will tend 
to be undetected by many memory checkers, at least the ones that don't 
tend to run full out memory bandwidth if they're simply checking that 
what was stored in a cell can be read back, unchanged.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs/RAID5 became unmountable after SATA cable fault
  2015-11-05  4:06     ` Duncan
@ 2015-11-05 12:30       ` Austin S Hemmelgarn
  2015-11-06  3:19       ` Zoiled
  1 sibling, 0 replies; 17+ messages in thread
From: Austin S Hemmelgarn @ 2015-11-05 12:30 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3810 bytes --]

On 2015-11-04 23:06, Duncan wrote:
> (Tho I should mention, while not on zfs, I've actually had my own
> problems with ECC RAM too.  In my case, the RAM was certified to run at
> speeds faster than it was actually reliable at, such that actually stored
> data, what the ECC protects, was fine, the data was actually getting
> damaged in transit to/from the RAM.  On a lightly loaded system, such as
> one running many memory tests or under normal desktop usage conditions,
> the RAM was generally fine, no problems.  But on a heavily loaded system,
> such as when doing parallel builds (I run gentoo, which builds from
> sources in ordered to get the higher level of option flexibility that
> comes only when you can toggle build-time options), I'd often have memory
> faults and my builds would fail.
>
> The most common failure, BTW, was on tarball decompression, bunzip2 or
> the like, since the tarballs contained checksums that were verified on
> data decompression, and often they'd fail to verify.
>
> Once I updated the BIOS to one that would let me set the memory speed
> instead of using the speed the modules themselves reported, and I
> declocked the memory just one notch (this was DDR1, IIRC I declocked from
> the PC3200 it was rated, to PC3000 speeds), not only was the memory then
> 100% reliable, but I could and did actually reduce the number of wait-
> states for various operations, and it was STILL 100% reliable.  It simply
> couldn't handle the raw speeds it was certified to run, is all, tho it
> did handle it well enough, enough of the time, to make the problem far
> more difficult to diagnose and confirm than it would have been had the
> problem appeared at low load as well.
>
> As it happens, I was running reiserfs at the time, and it handled both
> that hardware issue, and a number of others I've had, far better than I'd
> have expected of /any/ filesystem, when the memory feeding it is simply
> not reliable.  Reiserfs metadata, in particular, seems incredibly
> resilient in the face of hardware issues, and I lost far less data than I
> might have expected, tho without checksums and with bad memory, I imagine
> I had occasional undetected bitflip corruption in files here or there,
> but generally nothing I detected.  I still use reiserfs on my spinning
> rust today, but it's not well suited to SSD, which is where I run btrfs.
>
> But the point for this discussion is that just because it's ECC RAM
> doesn't mean you can't have memory related errors, just that if you do,
> they're likely to be different errors, "transit errors", that will tend
> to be undetected by many memory checkers, at least the ones that don't
> tend to run full out memory bandwidth if they're simply checking that
> what was stored in a cell can be read back, unchanged.)
I've actually seen similar issues with both ECC and non-ECC memory 
myself.  Any time I'm getting RAM for a system that I can afford to 
over-spec, I get the next higher speed and under-clock it (which in turn 
means I can lower the timing parameters and usually get a faster system 
than if I was running it at the rated speed).  FWIW, I also make a point 
of doing multiple memtest86+ runs (at a minimum, one running single 
core, and one with forced SMP) when I get new RAM, and even have a 
run-level configured on my Gentoo based home server system where it 
boots Xen and fires up twice as many VM's running memtest86+ as I have 
CPU cores, which is usually enough to fully saturate memory bandwidth 
and check for the type of issues you mentioned having above (although 
the BOINC client I run usually does a good job of triggering those kind 
of issues fast, distributed computing apps tend to be memory bound and 
use a lot of memory bandwidth).


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs/RAID5 became unmountable after SATA cable fault
  2015-11-05  4:06     ` Duncan
  2015-11-05 12:30       ` Austin S Hemmelgarn
@ 2015-11-06  3:19       ` Zoiled
  1 sibling, 0 replies; 17+ messages in thread
From: Zoiled @ 2015-11-06  3:19 UTC (permalink / raw)
  To: Duncan, linux-btrfs

Duncan wrote:
> Austin S Hemmelgarn posted on Wed, 04 Nov 2015 13:45:37 -0500 as
> excerpted:
>
>> On 2015-11-04 13:01, Janos Toth F. wrote:
>>> But the worst part is that there are some ISO files which were
>>> seemingly copied without errors but their external checksums (the one
>>> which I can calculate with md5sum and compare to the one supplied by
>>> the publisher of the ISO file) don't match!
>>> Well... this, I cannot understand.
>>> How could these files become corrupt from a single disk failure? And
>>> more importantly: how could these files be copied without errors? Why
>>> didn't Btrfs gave a read error when the checksums didn't add up?
>> If you can prove that there was a checksum mismatch and BTRFS returned
>> invalid data instead of a read error or going to the other disk, then
>> that is a very serious bug that needs to be fixed.  You need to keep in
>> mind also however that it's completely possible that the data was bad
>> before you wrote it to the filesystem, and if that's the case, there's
>> nothing any filesystem can do to fix it for you.
> As Austin suggests, if btrfs is returning data, and you haven't turned
> off checksumming with nodatasum or nocow, then it's almost certainly
> returning the data it was given to write out in the first place.  Whether
> that data it was given to write out was correct, however, is an
> /entirely/ different matter.
>
> If ISOs are failing their external checksums, then something is going
> on.  Had you verified the external checksums when you first got the
> files?  That is, are you sure the files were correct as downloaded and/or
> ripped?
>
> Where were the ISOs stored between original procurement/validation and
> writing to btrfs?  Is it possible you still have some/all of them on that
> media?  Do they still external-checksum-verify there?
>
> Basically, assuming btrfs checksums are validating, there's three other
> likely possibilities for where the corruption could have come from before
> writing to btrfs.  Either the files were bad as downloaded or otherwise
> procured -- which is why I asked whether you verified them upon receipt
> -- or you have memory that's going bad, or your temporary storage is
> going bad, before the files ever got written to btrfs.
>
> The memory going bad is a particularly worrying possibility,
> considering...
>
>>> Now I am really considering to move from Linux to Windows and from
>>> Btrfs RAID-5 to Storage Spaces RAID-1 + ReFS (the only limitation is
>>> that ReFS is only "self-healing" on RAID-1, not RAID-5, so I need a new
>>> motherboard with more native SATA connectors and an extra HDD). That
>>> one seemed to actually do what it promises (abort any read operations
>>> upon checksum errors [which always happens seamlessly on every read]
>>> but look at the redundant data first and seamlessly "self-heal" if
>>> possible). The only thing which made Btrfs to look as a better
>>> alternative was the RAID-5 support. But I recently experienced two
>>> cases of 1 drive failing of 3 and it always tuned out as a smaller or
>>> bigger disaster (completely lost data or inconsistent data).
>> Have you considered looking into ZFS?  I hate to suggest it as an
>> alternative to BTRFS, but it's a much more mature and well tested
>> technology than ReFS, and has many of the same features as BTRFS (and
>> even has the option for triple parity instead of the double you get with
>> RAID6).  If you do consider ZFS, make a point to look at FreeBSD in
>> addition to the Linux version, the BSD one was a much better written
>> port of the original Solaris drivers, and has better performance in many
>> cases (and as much as I hate to admit it, BSD is way more reliable than
>> Linux in most use cases).
>>
>> You should also seriously consider whether the convenience of having a
>> filesystem that fixes internal errors itself with no user intervention
>> is worth the risk of it corrupting your data.  Returning correct data
>> whenever possible is one thing, being 'self-healing' is completely
>> different.  When you start talking about things that automatically fix
>> internal errors without user intervention is when most seasoned system
>> administrators start to get really nervous.  Self correcting systems
>> have just as much chance to make things worse as they do to make things
>> better, and most of them depend on the underlying hardware working
>> correctly to actually provide any guarantee of reliability.
> I too would point you at ZFS, but there's one VERY BIG caveat, and one
> related smaller one!
>
> The people who have a lot of ZFS experience say it's generally quite
> reliable, but gobs of **RELIABLE** memory are *absolutely* *critical*!
> The self-healing works well, *PROVIDED* memory isn't producing errors.
> Absolutely reliable memory is in fact *so* critical, that running ZFS on
> non-ECC memory is severely discouraged as a very real risk to your data.
>
> Which is why the above hints that your memory may be bad are so
> worrying.  Don't even *THINK* about ZFS, particularly its self-healing
> features, if you're not absolutely sure your memory is 100% reliable,
> because apparently, based on the comment's I've seen, if it's not, you
> WILL have data loss, likely far worse than btrfs under similar
> circumstances, because when btrfs detects a checksum error it tries
> another copy if it has one (raid1/10 mode), and simply fails the read if
> it doesn't, while apparently, zfs with self-healing activated will give
> you what it thinks is the corrected data, writing it back to repair the
> problem as well, but if memory is bad, it'll be self-damaging instead of
> self-healing, and from what I've read, that's actually a reasonably
> common experience with non-ecc RAM, the reason they so severely
> discourage attempts to run zfs on non-ecc.  But people still keep doing
> it, and still keep getting burned as a result.
>
> (The smaller, in context, caveat, is that zfs works best with /lots/ of
> RAM, particularly when run on Linux, since it is designed to work with a
> different cache system than Linux uses, and won't work without it, so in
> effect with ZFS on Linux everything must be cached twice, upping the
> memory requirements dramatically.)
>
>
> (Tho I should mention, while not on zfs, I've actually had my own
> problems with ECC RAM too.  In my case, the RAM was certified to run at
> speeds faster than it was actually reliable at, such that actually stored
> data, what the ECC protects, was fine, the data was actually getting
> damaged in transit to/from the RAM.  On a lightly loaded system, such as
> one running many memory tests or under normal desktop usage conditions,
> the RAM was generally fine, no problems.  But on a heavily loaded system,
> such as when doing parallel builds (I run gentoo, which builds from
> sources in ordered to get the higher level of option flexibility that
> comes only when you can toggle build-time options), I'd often have memory
> faults and my builds would fail.
>
> The most common failure, BTW, was on tarball decompression, bunzip2 or
> the like, since the tarballs contained checksums that were verified on
> data decompression, and often they'd fail to verify.
>
> Once I updated the BIOS to one that would let me set the memory speed
> instead of using the speed the modules themselves reported, and I
> declocked the memory just one notch (this was DDR1, IIRC I declocked from
> the PC3200 it was rated, to PC3000 speeds), not only was the memory then
> 100% reliable, but I could and did actually reduce the number of wait-
> states for various operations, and it was STILL 100% reliable.  It simply
> couldn't handle the raw speeds it was certified to run, is all, tho it
> did handle it well enough, enough of the time, to make the problem far
> more difficult to diagnose and confirm than it would have been had the
> problem appeared at low load as well.
>
> As it happens, I was running reiserfs at the time, and it handled both
> that hardware issue, and a number of others I've had, far better than I'd
> have expected of /any/ filesystem, when the memory feeding it is simply
> not reliable.  Reiserfs metadata, in particular, seems incredibly
> resilient in the face of hardware issues, and I lost far less data than I
> might have expected, tho without checksums and with bad memory, I imagine
> I had occasional undetected bitflip corruption in files here or there,
> but generally nothing I detected.  I still use reiserfs on my spinning
> rust today, but it's not well suited to SSD, which is where I run btrfs.
>
> But the point for this discussion is that just because it's ECC RAM
> doesn't mean you can't have memory related errors, just that if you do,
> they're likely to be different errors, "transit errors", that will tend
> to be undetected by many memory checkers, at least the ones that don't
> tend to run full out memory bandwidth if they're simply checking that
> what was stored in a cell can be read back, unchanged.)
>
I just want to point out that please don't forget about your harddrive 
controlers memory. You mainboard might have ECC ram but your controller 
might not.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs/RAID5 became unmountable after SATA cable fault
  2015-11-04 18:01 ` Janos Toth F.
  2015-11-04 18:45   ` Austin S Hemmelgarn
@ 2015-11-06  9:03   ` Janos Toth F.
  2015-11-06 10:23     ` Patrik Lundquist
  1 sibling, 1 reply; 17+ messages in thread
From: Janos Toth F. @ 2015-11-06  9:03 UTC (permalink / raw)
  To: Btrfs BTRFS

I created a fresh RAID-5 mode Btrfs on the same 3 disks (including the
faulty one which is still producing numerous random read errors) and
Btrfs now seems to work exactly as I would anticipate.

I copied some data and verified the checksum. The data is readable and
correct regardless of the constant warning messages in the kernel log
about the read errors on the single faulty HDD (the bad behavior is
confirmed by the SMART logs and I tested it in a different PC as
well...).

I also ran several scrubs and now it always finishes with X corrected
and 0 uncorrected errors. (The errors are supposedly corrected but the
faulty HDD keeps randomly corrupting the data...)
The last time I saw uncorrected errors during the scrub and not every
data was readable. Rather strange...

I ran 24 hours of Gimps/Prime95 Blend stresstest without errors on the
problematic machine.
Although I updated the firmware of the drives. (I found an IMPORTANT
update when I went there to download SeaTools, although there was no
change log to tell me why this was important). This might changed the
error handling behavior of the drive...?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs/RAID5 became unmountable after SATA cable fault
  2015-11-06  9:03   ` Janos Toth F.
@ 2015-11-06 10:23     ` Patrik Lundquist
  0 siblings, 0 replies; 17+ messages in thread
From: Patrik Lundquist @ 2015-11-06 10:23 UTC (permalink / raw)
  To: Janos Toth F.; +Cc: Btrfs BTRFS

On 6 November 2015 at 10:03, Janos Toth F. <toth.f.janos@gmail.com> wrote:
>
> Although I updated the firmware of the drives. (I found an IMPORTANT
> update when I went there to download SeaTools, although there was no
> change log to tell me why this was important). This might changed the
> error handling behavior of the drive...?

I've had Seagate drives not reporting errors until I updated the
firmware. They tended to timeout instead. Got a shitload of SMART
errors after I updated, but they still didn't handle errors very well
(became unresponsive).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Btrfs/RAID5 became unmountable after SATA cable fault
@ 2016-07-23 13:20 Janos Toth F.
  0 siblings, 0 replies; 17+ messages in thread
From: Janos Toth F. @ 2016-07-23 13:20 UTC (permalink / raw)
  To: Btrfs BTRFS

It seems like I accidentally managed to break my Btrfs/RAID5
filesystem, yet again, in a similar fashion.
This time around, I ran into some random libata driver issue (?)
instead of a faulty hardware part but the end result is quiet similar.

I issued the command (replacing X with valid letters for every
hard-drives in the system):
# echo 1 > /sys/block/sdX/device/queue_depth
and I ended up with read-only filesystems.
I checked dmesg and saw write errors on every disks (not just those in RAID-5).

I tried to reboot immediately without success. My root filesystem with
a single-disk Btrfs (which is an SSD, so it has "single" profile for
both data and metadata) was unmountable, thus the kernel was stuck in
a panic-reboot cycle.
I managed to fix this one by booting from an USB stick and trying
various recovery methods (like mounting it with "-o
clear_cache,nospace_cache,recovery" and running "btrfs rescue
chunk-recovery") until everything seemed to be fine (it can now be
mounted read-write without error messages in the kernel-log, can be
fully scrubbed without errors reported, it passes in "btrfs check",
files can be actually written and read, etc).

Once my system was up and running (well, sort of), I realized my /data
is also un-mountable. I tried the same recovery methods on this RAID-5
filesystem but nothing seemed to help (there is an exception with the
recovery attempts: the system drive was a small and fast SSD so
"chunk-recovery" was a viable option to try but this one consists of
huge slow HDDs - so, I tried to run it as a last-resort over-night but
I found an unresponsive machine on the morning with the process stuck
relatively early in the process).

I can always mount it read-only and access files on it, seemingly
without errors (I compared some of the contents with backups and it
looks good) but as soon as I mount it read-write, all hell breaks
loose and it falls into read-only state in no time (with some files
seemingly disappearing from the filesystem) and the kernel log is
starting to get spammed with various kind of error messages (including
missing csums, etc).


After mounting it like this:
# mount /dev/sdb /data -o rw,noatime,nospace_cache
and doing:
# btrfs scrub start /data
the result is:

scrub status for 7d4769d6-2473-4c94-b476-4facce24b425
        scrub started at Sat Jul 23 13:50:55 2016 and was aborted after 00:05:30
        total bytes scrubbed: 18.99GiB with 16 errors
        error details: read=16
        corrected errors: 0, uncorrectable errors: 16, unverified errors: 0

The relevant dmesg output is:

 [ 1047.709830] BTRFS info (device sdc): disabling disk space caching
[ 1047.709846] BTRFS: has skinny extents
[ 1047.895818] BTRFS info (device sdc): bdev /dev/sdc errs: wr 4, rd
0, flush 0, corrupt 0, gen 0
[ 1047.895835] BTRFS info (device sdc): bdev /dev/sdb errs: wr 4, rd
0, flush 0, corrupt 0, gen 0
[ 1065.764352] BTRFS: checking UUID tree
[ 1386.423973] BTRFS error (device sdc): parent transid verify failed
on 24431936729088 wanted 585936 found 586145
[ 1386.430922] BTRFS error (device sdc): parent transid verify failed
on 24431936729088 wanted 585936 found 586145
[ 1411.738955] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1411.948040] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1412.040964] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1412.040980] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1412.041134] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1412.042628] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1412.042748] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1499.222245] BTRFS error (device sdc): parent transid verify failed
on 24432312270848 wanted 585779 found 586143
[ 1499.230264] BTRFS error (device sdc): parent transid verify failed
on 24432312270848 wanted 585779 found 586143
[ 1525.865143] BTRFS error (device sdc): parent transid verify failed
on 24432367730688 wanted 585779 found 586144
[ 1525.880537] BTRFS error (device sdc): parent transid verify failed
on 24432367730688 wanted 585779 found 586144
[ 1552.434209] BTRFS error (device sdc): parent transid verify failed
on 24432415821824 wanted 585781 found 586144
[ 1552.437325] BTRFS error (device sdc): parent transid verify failed
on 24432415821824 wanted 585781 found 586144


btrfs check /dev/sdc results in:

Checking filesystem on /dev/sdc
UUID: 7d4769d6-2473-4c94-b476-4facce24b425
checking extents
parent transid verify failed on 24431859855360 wanted 585941 found 586144
parent transid verify failed on 24431859855360 wanted 585941 found 586144
checksum verify failed on 24431859855360 found 3F0C0853 wanted 165308D5
parent transid verify failed on 24431859855360 wanted 585941 found 586144
Ignoring transid failure
parent transid verify failed on 24432402878464 wanted 585947 found 586144
parent transid verify failed on 24432402878464 wanted 585947 found 586144
checksum verify failed on 24432402878464 found 2018608B wanted 0947600D
parent transid verify failed on 24432402878464 wanted 585947 found 586144
Ignoring transid failure
leaf parent key incorrect 24432402878464
parent transid verify failed on 24431936729088 wanted 585936 found 586145
parent transid verify failed on 24431936729088 wanted 585936 found 586145
checksum verify failed on 24431936729088 found E464923E wanted CD3B92B8
parent transid verify failed on 24431936729088 wanted 585936 found 586145
Ignoring transid failure
leaf parent key incorrect 24431936729088
parent transid verify failed on 24432268873728 wanted 585946 found 586143
parent transid verify failed on 24432268873728 wanted 585946 found 586143
checksum verify failed on 24432268873728 found 7748C8E4 wanted 5E17C862
parent transid verify failed on 24432268873728 wanted 585946 found 586143
Ignoring transid failure
leaf parent key incorrect 24432268873728
parent transid verify failed on 24432268873728 wanted 585946 found 586143
Ignoring transid failure
leaf parent key incorrect 24432268873728
parent transid verify failed on 24432268873728 wanted 585946 found 586143
Ignoring transid failure
leaf parent key incorrect 24432268873728
parent transid verify failed on 24432268873728 wanted 585946 found 586143
Ignoring transid failure
leaf parent key incorrect 24432268873728
parent transid verify failed on 24432112070656 wanted 585944 found 586142
parent transid verify failed on 24432112070656 wanted 585944 found 586142
checksum verify failed on 24432112070656 found 0482AA77 wanted 2DDDAAF1
parent transid verify failed on 24432112070656 wanted 585944 found 586142
Ignoring transid failure
parent transid verify failed on 24432112070656 wanted 585944 found 586142
Ignoring transid failure
parent transid verify failed on 24432112070656 wanted 585944 found 586142
Ignoring transid failure
parent transid verify failed on 24431790055424 wanted 585936 found 586144
parent transid verify failed on 24431790055424 wanted 585936 found 586144
checksum verify failed on 24431790055424 found 3B2164E6 wanted 127E6460
parent transid verify failed on 24431790055424 wanted 585936 found 586144
Ignoring transid failure
leaf parent key incorrect 24431790055424
parent transid verify failed on 24432038637568 wanted 585941 found 586145
parent transid verify failed on 24432038637568 wanted 585941 found 586145
checksum verify failed on 24432038637568 found 7A070E86 wanted 53580E00
parent transid verify failed on 24432038637568 wanted 585941 found 586145
Ignoring transid failure
leaf parent key incorrect 24432038637568
parent transid verify failed on 24432038637568 wanted 585941 found 586145
Ignoring transid failure
leaf parent key incorrect 24432038637568
parent transid verify failed on 24431790055424 wanted 585936 found 586144
Ignoring transid failure
leaf parent key incorrect 24431790055424
bad block 24431790055424
Errors found in extent allocation tree or chunk allocation
parent transid verify failed on 24432322764800 wanted 585779 found 586145
parent transid verify failed on 24432322764800 wanted 585779 found 586145
checksum verify failed on 24432322764800 found 2B2DE1E6 wanted 0272E160
parent transid verify failed on 24432322764800 wanted 585779 found 586145
Ignoring transid failure
Segmentation fault


So, it seems like there is no way of recovering from this.
Thus, so far, my experience with Btrfs RAID-5 is that it's everything
but resilient. Something sneezes in the system and it's gone. The only
fix is recreating the filesystem from scratch and restoring the
backups (if any) or may be recovering some of the content (with
read-only mount or the "btrfs recovery" tool). But it seems to be much
more prone to become unrecoverable than Btrfs filesystems with
"single" data and/or metadata profiles.


This one accident could possibly be related to the new space_cache=v2,
since I had that enabled when the corruption occurred and now I am
unable to mount it with that option (mounting with "-o
clear_cache,space_cache=v2" fails completely). So, may be that
experimental feature played some role in this:

[  906.664963] BTRFS info (device sdc): disabling disk space caching
[  906.664974] BTRFS: has skinny extents
[  907.032573] BTRFS info (device sdc): bdev /dev/sdc errs: wr 4, rd
0, flush 0, corrupt 0, gen 0
[  907.032589] BTRFS info (device sdc): bdev /dev/sdb errs: wr 4, rd
0, flush 0, corrupt 0, gen 0
[  951.948672] BTRFS info (device sdc): enabling free space tree
[  951.948682] BTRFS info (device sdc): force clearing of disk cache
[  951.948694] BTRFS info (device sdc): using free space tree
[  951.948696] BTRFS: has skinny extents
[  952.125700] BTRFS info (device sdc): bdev /dev/sdc errs: wr 4, rd
0, flush 0, corrupt 0, gen 0
[  952.125717] BTRFS info (device sdc): bdev /dev/sdb errs: wr 4, rd
0, flush 0, corrupt 0, gen 0
[  970.019994] BTRFS: creating free space tree
[  970.308042] BTRFS error (device sdc): parent transid verify failed
on 24431936729088 wanted 585936 found 586145
[  970.316104] BTRFS error (device sdc): parent transid verify failed
on 24431936729088 wanted 585936 found 586145
[  988.288037] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[  988.311250] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[  988.311265] ------------[ cut here ]------------
[  988.311276] WARNING: CPU: 0 PID: 1930 at
fs/btrfs/free-space-tree.c:1196
btrfs_create_free_space_tree+0x160/0x498
[  988.311280] BTRFS: Transaction aborted (error -5)
[  988.311285] CPU: 0 PID: 1930 Comm: mount Not tainted 4.6.4-gentoo #6
[  988.311288] Hardware name: To Be Filled By O.E.M. To Be Filled By
O.E.M./FM2A75 Pro4, BIOS P2.40 07/11/2013
[  988.311291]  0000000000000286 000000008bf8f073 ffffffff812bdd7d
ffff8800d31af9b8
[  988.311297]  0000000000000000 ffffffff8106919f ffff8800da8652a0
ffff8800d31afa10
[  988.311302]  ffff8800d478e000 ffff880000000000 ffff8800da8652a0
ffff8800da865150
[  988.311307] Call Trace:
[  988.311314]  [<ffffffff812bdd7d>] ? dump_stack+0x46/0x59
[  988.311320]  [<ffffffff8106919f>] ? __warn+0xaf/0xd0
[  988.311324]  [<ffffffff8106921a>] ? warn_slowpath_fmt+0x5a/0x78
[  988.311330]  [<ffffffff8126d898>] ? btrfs_create_free_space_tree+0x160/0x498
[  988.311334]  [<ffffffff811f4fe2>] ? open_ctree+0x1d82/0x26b0
[  988.311340]  [<ffffffff811cb497>] ? btrfs_mount+0xca7/0xde0
[  988.311346]  [<ffffffff810fa289>] ? pcpu_alloc_area+0x219/0x3e0
[  988.311350]  [<ffffffff810fadcc>] ? pcpu_alloc+0x38c/0x690
[  988.311356]  [<ffffffff8112e4da>] ? mount_fs+0xa/0x88
[  988.311362]  [<ffffffff81147e86>] ? vfs_kern_mount+0x56/0x100
[  988.311367]  [<ffffffff811cab38>] ? btrfs_mount+0x348/0xde0
[  988.311371]  [<ffffffff811337ca>] ? terminate_walk+0x8a/0xf0
[  988.311375]  [<ffffffff810fa289>] ? pcpu_alloc_area+0x219/0x3e0
[  988.311379]  [<ffffffff810fa065>] ? pcpu_next_unpop+0x35/0x40
[  988.311383]  [<ffffffff810fadcc>] ? pcpu_alloc+0x38c/0x690
[  988.311388]  [<ffffffff8112e4da>] ? mount_fs+0xa/0x88
[  988.311393]  [<ffffffff81147e86>] ? vfs_kern_mount+0x56/0x100
[  988.311397]  [<ffffffff811491ed>] ? do_mount+0x1fd/0xce0
[  988.311400]  [<ffffffff8113f8fb>] ? dput+0xd3/0x248
[  988.311405]  [<ffffffff81120d38>] ? __kmalloc_track_caller+0x20/0xe8
[  988.311408]  [<ffffffff810f7318>] ? memdup_user+0x38/0x60
[  988.311412]  [<ffffffff81149fe0>] ? SyS_mount+0x80/0xc8
[  988.311417]  [<ffffffff816f379b>] ? entry_SYSCALL_64_fastpath+0x13/0x8f
[  988.311420] ---[ end trace a3cc21d9a0eba35e ]---
[  988.311425] BTRFS: error (device sdc) in
btrfs_create_free_space_tree:1196: errno=-5 IO failure
[  988.311463] BTRFS: failed to create free space tree -5
[  988.311475] BTRFS error (device sdc): commit super ret -30
[  988.311561] BTRFS error (device sdc): cleaner transaction attach returned -30
[  988.350206] BTRFS: open_ctree failed


Any ideas before I wipe the filesystem?

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-07-23 13:20 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-22  1:18 Btrfs/RAID5 became unmountable after SATA cable fault János Tóth F.
  -- strict thread matches above, loose matches on Subject: below --
2016-07-23 13:20 Janos Toth F.
     [not found] <g7loe3red3ksp64hmb0vsbs2.1445476794489@email.android.com>
2015-11-04 18:01 ` Janos Toth F.
2015-11-04 18:45   ` Austin S Hemmelgarn
2015-11-05  4:06     ` Duncan
2015-11-05 12:30       ` Austin S Hemmelgarn
2015-11-06  3:19       ` Zoiled
2015-11-06  9:03   ` Janos Toth F.
2015-11-06 10:23     ` Patrik Lundquist
2015-10-19  8:39 Janos Toth F.
2015-10-20 14:59 ` Duncan
2015-10-21 16:09 ` Janos Toth F.
2015-10-21 16:44   ` ronnie sahlberg
2015-10-21 17:42   ` ronnie sahlberg
2015-10-21 18:40     ` Janos Toth F.
2015-10-21 17:46   ` Janos Toth F.
2015-10-21 20:26   ` Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).