Re: Corrupted data, failed drive(s)

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Gaardiolor <gaardiolor@gmail.com>
To: Chris Murphy <lists@colorremedies.com>
Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Corrupted data, failed drive(s)
Date: Fri, 4 Jun 2021 11:27:44 +0200	[thread overview]
Message-ID: <5272b826-ec8e-f3a3-6fc1-bb863b698c83@gmail.com> (raw)
In-Reply-To: <CAJCQCtRkZPqQ_Rfx1Kk6rXZ_GyxDcLymdFjJkS12zZZ0mep3vQ@mail.gmail.com>

Hi Chris,

Thanks for your reply. Just noticed I forgot to mention I'm running 
kernel 5.12.8-300.fc34.x86_64 with btrfs-progs-5.12.1-1.fc34.x86_64 .


>> I have a couple of questions:
>>
>> 1) Unpacking some .tar.gz files from /storage resulted in files with
>> weird names, data was unusable. But, it's raid1. Why is my data corrupt,
>> I've read that BTRFS checks the checksum on read ?
> 
> It suggests an additional problem, but we kinda need full dmesg to
> figure it out I think. If it were just one device having either
> partial or full failure, you'd get a bunch of messages indicating
> those failure or csum mismatches as  well as fixup attempts which then
> either succeed or fail. But no EIO. That there's EIO  suggests both
> copies are somehow bad, so it could be two independent problems. That
> there's four drives with a small number of reported corruptions could
> mean some common problem affecting all of them: cabling or power
> supply.
> 

First try on unpacking the .tar.gz worked. Second try on the same 
.tar.gz now results in:

gzip: stdin: Input/output error
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

dmesg:
[Fri Jun  4 09:53:03 2021] BTRFS warning (device sdc): csum failed root 
5 ino 5114941 off 5247860736 csum 0x545eef4e expected csum 0x2cd08f83 
mirror 1
[Fri Jun  4 09:53:03 2021] BTRFS error (device sdc): bdev /dev/sdb errs: 
wr 0, rd 0, flush 0, corrupt 323, gen 0
[Fri Jun  4 09:53:03 2021] BTRFS warning (device sdc): csum failed root 
5 ino 5114941 off 5247864832 csum 0x79b50174 expected csum 0xa744e4f5 
mirror 1
[Fri Jun  4 09:53:03 2021] BTRFS error (device sdc): bdev /dev/sdb errs: 
wr 0, rd 0, flush 0, corrupt 324, gen 0
[Fri Jun  4 09:53:03 2021] BTRFS warning (device sdc): csum failed root 
5 ino 5114941 off 5247864832 csum 0x79b50174 expected csum 0xa744e4f5 
mirror 2
[Fri Jun  4 09:53:03 2021] BTRFS error (device sdc): bdev /dev/sda errs: 
wr 0, rd 0, flush 0, corrupt 409, gen 0
[Fri Jun  4 09:53:03 2021] repair_io_failure: 326 callbacks suppressed
[Fri Jun  4 09:53:03 2021] BTRFS info (device sdc): read error 
corrected: ino 5114941 off 5247860736 (dev /dev/sdb sector 6674359360)
[Fri Jun  4 09:53:03 2021] BTRFS warning (device sdc): csum failed root 
5 ino 5114941 off 5247864832 csum 0x79b50174 expected csum 0xa744e4f5 
mirror 1
[Fri Jun  4 09:53:03 2021] BTRFS error (device sdc): bdev /dev/sdb errs: 
wr 0, rd 0, flush 0, corrupt 325, gen 0
[Fri Jun  4 09:53:03 2021] BTRFS warning (device sdc): csum failed root 
5 ino 5114941 off 5247864832 csum 0x81568248 expected csum 0xa744e4f5 
mirror 2
[Fri Jun  4 09:53:03 2021] BTRFS error (device sdc): bdev /dev/sda errs: 
wr 0, rd 0, flush 0, corrupt 410, gen 0
[Fri Jun  4 09:53:03 2021] BTRFS warning (device sdc): csum failed root 
5 ino 5114941 off 5247864832 csum 0x81568248 expected csum 0xa744e4f5 
mirror 1
[Fri Jun  4 09:53:03 2021] BTRFS error (device sdc): bdev /dev/sdb errs: 
wr 0, rd 0, flush 0, corrupt 326, gen 0
[Fri Jun  4 09:53:03 2021] BTRFS warning (device sdc): csum failed root 
5 ino 5114941 off 5247864832 csum 0x81568248 expected csum 0xa744e4f5 
mirror 2
[Fri Jun  4 09:53:03 2021] BTRFS error (device sdc): bdev /dev/sda errs: 
wr 0, rd 0, flush 0, corrupt 411, gen 0
[Fri Jun  4 09:53:03 2021] BTRFS warning (device sdc): csum failed root 
5 ino 5114941 off 5247864832 csum 0x79b50174 expected csum 0xa744e4f5 
mirror 1
[Fri Jun  4 09:53:03 2021] BTRFS error (device sdc): bdev /dev/sdb errs: 
wr 0, rd 0, flush 0, corrupt 327, gen 0
[Fri Jun  4 09:53:03 2021] BTRFS warning (device sdc): csum failed root 
5 ino 5114941 off 5247864832 csum 0x81568248 expected csum 0xa744e4f5 
mirror 2
[Fri Jun  4 09:53:03 2021] BTRFS error (device sdc): bdev /dev/sda errs: 
wr 0, rd 0, flush 0, corrupt 412, gen 0

No weird filenames though, and no sdd errors this time. I also see these 
errors in /var/log/messages (on a different filesystem), but I don't see 
any "csum failed" errors in the messages log of yesterday, when the 
strange filenames appeared..

> 
>> 2) Are all my 4 drives faulty because of the corruption_errs ? If so, 4
>> faulty drives is somewhat unusual. Any other possibilities ?
>> 3) Given that
>> - I can't 'btrfs device remove' the device
>> - I do not have a free SATA port
>> - I'd prefer a method that doesn't unnecessarily take a very long time
> 
> You really don't want to remove a drive unless it's on fire. Partial
> failures, you're better off leaving it in, until ready to be replaced.
> And even then it is officially left in until replace completes. Btrfs
> is AOK with partially failing drives, it can unambiguously determine
> when any block is untrustworthy. But the partial failure case also
> means possibly quite a lot of *good* blocks that you might need in
> order to recover from this situation, so you don't want to throw the
> baby out with the bath water, so to speak.
>
I think we're mixing 'btrfs device remove' with physically remove. I did 
not plan on physically remove, but I might be forced because the 
graceful 'btrfs device remove' results in an I/O error. Or is there a 
better way ? Can 'btrfs device remove' ignore errors and continue with 
the good blocks?

My guess was that btrfs remove would take a very long time, I'd have a 
new drive before it would be finished. I had enough free space available 
to remove this device without adding a new drive. Did at the time not 
realize the other 3 drives had issues as well though.

>>
>> What's the best way to migrate to a different device ? I'm guessing,
>> after doing some reading:
>> - shutdown
>> - physically remove faulty disk
>> - boot
>> - verify /dev/sdd is missing, and that I've removed the correct disk
>> - shutdown
>> - connect new disk, it will also be /dev/sdd, because I have no other
>> free SATA port
>> - boot
>> - check that the new disk is /dev/sdd
>> - mount -o degraded /dev/sda /storage
>> - btrfs replace start 4 /dev/sdd /storage
>> - btrfs balance /storage
> 
> You can but again this throws away quite a lot of good blocks, both
> data and metadata. Only if all three of the other drives are perfect
> is this a good idea and there's some evidence it isn't.
> 
> I'd say your top priority is freshen the backups of most important
> things you cannot stand to lose from this file system in case it gets
> much worse. Then try to figure out what's wrong and fix that. The
> direct causes need to be discovered and fixed, and the above sequence
> doesn't identify multiple problems; it just assumes it's this one
> drive. And the available evidence suggests more than one thing is
> going on. If this is cable or dirty/noisy power supply related, the
> recovery process itself can be negatively affected and make things
> worse (more corruptions).
> 
> I think a better approach, as finicky as they can be, is a USB SATA
> enclosure connected to an externally powered hub. Really you want a
> fifth SATA port, even if it's eSATA. But barring that, I think it's
> less risky to keep all four drives together, to do the replacement.
> 

Yes, a fifth SATA port is a good idea. I did plan for this too, I 
actually have 6 SATA ports. What I didn't realize though is that 2 are 
disabled because I I installed 2 NVME drives. Should have read the fm :)

Apart from the general problem I might have (PSU for example), I might 
be able to hook up the new drive temporary via USB3 ? But, what would be 
the approach then ? I'd still need to 'btrfs device remove' sdd right, 
to evict data gracefully and replace it with the new one. But, btrfs 
remove results in an I/O error.

Turns out my drives aren't very cool though. 2 have >45k hours, 2 have 
 >12k which should be kinda ok, but are SMR. Just might be that they are 
all failing.. any idea how plausible that scenario could be ?

sda
   Model Family:     Seagate Barracuda 7200.14 (AF)
   Device Model:     ST2000DM001-1ER164
   TBR: 1350.26 TB
   TBW: 30.4776 TB
   Power_On_Hours 47582

sdb
   Model Family:     Seagate BarraCuda 3.5 (SMR)
   Device Model:     ST4000DM004-2CV104
   TBR: 6.71538 TB
   TBW: 32.5086 TB
   Power_On_Hours 12079

sdc
   Model Family:     Seagate BarraCuda 3.5 (SMR)
   Device Model:     ST4000DM004-2CV104
   TBR: 9.48872 TB
   TBW: 34.1534 TB
   Power_On_Hours 12079

sdd
   Model Family:     Seagate Barracuda 7200.14 (AF)
   Device Model:     ST2000DM001-1ER164
   TBR: 863.043 TB
   TBW: 28.6935 TB
   Power_On_Hours 47583

Thanks

next prev parent reply	other threads:[~2021-06-04  9:28 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-03 16:50 Corrupted data, failed drive(s) Gaardiolor
2021-06-03 22:37 ` Chris Murphy
2021-06-04  9:27   ` Gaardiolor [this message]
2021-06-04 23:22     ` Chris Murphy
2021-06-05  9:23       ` Graham Cobb
2021-06-06 16:14       ` Gaardiolor

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5272b826-ec8e-f3a3-6fc1-bb863b698c83@gmail.com \
    --to=gaardiolor@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).