Unocorrectable errors with RAID1

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Christoph Groth <christoph@grothesque.org>
To: linux-btrfs@vger.kernel.org
Subject: Unocorrectable errors with RAID1
Date: Mon, 16 Jan 2017 12:10:30 +0100	[thread overview]
Message-ID: <87o9z7dzvd.fsf@grothesque.org> (raw)

[-- Attachment #1: Type: text/plain, Size: 7876 bytes --]

Hi,

I’ve been using a btrfs RAID1 of two hard disks since early 2012 
on my home server.  The machine has been working well overall, but 
recently some problems with the file system surfaced.  Since I do 
have backups, I do not worry about the data, but I post here to 
better understand what happened.  Also I cannot exclude that my 
case is useful in some way to btrfs development.

First some information about the system:

root@mim:~# uname -a
Linux mim 4.6.0-1-amd64 #1 SMP Debian 4.6.3-1 (2016-07-04) x86_64 
GNU/Linux
root@mim:~# btrfs --version
btrfs-progs v4.7.3
root@mim:~# btrfs fi show
Label: none  uuid: 2da00153-f9ea-4d6c-a6cc-10c913d22686
	Total devices 2 FS bytes used 345.97GiB
	devid    1 size 465.29GiB used 420.06GiB path /dev/sda2
	devid    2 size 465.29GiB used 420.04GiB path /dev/sdb2

root@mim:~# btrfs fi df /
Data, RAID1: total=417.00GiB, used=344.62GiB
Data, single: total=8.00MiB, used=0.00B
System, RAID1: total=40.00MiB, used=68.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=3.00GiB, used=1.35GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=464.00MiB, used=0.00B
root@mim:~# dmesg | grep -i btrfs
[    4.165859] Btrfs loaded
[    4.481712] BTRFS: device fsid 
2da00153-f9ea-4d6c-a6cc-10c913d22686 devid 1 transid 2075354 
/dev/sda2
[    4.482025] BTRFS: device fsid 
2da00153-f9ea-4d6c-a6cc-10c913d22686 devid 2 transid 2075354 
/dev/sdb2
[    4.521090] BTRFS info (device sdb2): disk space caching is 
enabled
[    4.628506] BTRFS info (device sdb2): bdev /dev/sdb2 errs: wr 
0, rd 0, flush 0, corrupt 3, gen 0
[    4.628521] BTRFS info (device sdb2): bdev /dev/sda2 errs: wr 
0, rd 0, flush 0, corrupt 3, gen 0
[   18.315694] BTRFS info (device sdb2): disk space caching is 
enabled

The disks themselves have been turning for almost 5 years by now, 
but their SMART health is still fully satisfactory.

I noticed that something was wrong because printing stopped to 
work.  So I did a scrub that detected 0 "correctable errors" and 6 
"uncorrectable" errors.  The relevant bits from kern.log are:

Jan 11 11:05:56 mim kernel: [159873.938579] BTRFS warning (device 
sdb2): checksum error at logical 180829634560 on dev /dev/sdb2, 
sector 353143968, root 5, inode 10014144, offset 221184, length 
4096, links 1 (path: usr/lib/x86_64-linux-gnu/libcups.so.2)
Jan 11 11:05:57 mim kernel: [159874.857132] BTRFS warning (device 
sdb2): checksum error at logical 180829634560 on dev /dev/sda2, 
sector 353182880, root 5, inode 10014144, offset 221184, length 
4096, links 1 (path: usr/lib/x86_64-linux-gnu/libcups.so.2)
Jan 11 11:28:42 mim kernel: [161240.083721] BTRFS warning (device 
sdb2): checksum error at logical 260254629888 on dev /dev/sda2, 
sector 508309824, root 5, inode 9990924, offset 6676480, length 
4096, links 1 (path: 
var/lib/apt/lists/ftp.fr.debian.org_debian_dists_unstable_main_binary-amd64_Packages)
Jan 11 11:28:42 mim kernel: [161240.235837] BTRFS warning (device 
sdb2): checksum error at logical 260254638080 on dev /dev/sda2, 
sector 508309840, root 5, inode 9990924, offset 6684672, length 
4096, links 1 (path: 
var/lib/apt/lists/ftp.fr.debian.org_debian_dists_unstable_main_binary-amd64_Packages)
Jan 11 11:37:21 mim kernel: [161759.725120] BTRFS warning (device 
sdb2): checksum error at logical 260254629888 on dev /dev/sdb2, 
sector 508270912, root 5, inode 9990924, offset 6676480, length 
4096, links 1 (path: 
var/lib/apt/lists/ftp.fr.debian.org_debian_dists_unstable_main_binary-amd64_Packages)
Jan 11 11:37:21 mim kernel: [161759.750251] BTRFS warning (device 
sdb2): checksum error at logical 260254638080 on dev /dev/sdb2, 
sector 508270928, root 5, inode 9990924, offset 6684672, length 
4096, links 1 (path: 
var/lib/apt/lists/ftp.fr.debian.org_debian_dists_unstable_main_binary-amd64_Packages)

As you can see each disk has the same three errors, and there are 
no other errors.  Random bad blocks cannot explain this situation. 
I asked on #btrfs and someone suggested that these errors are 
likely due to RAM problems.  This may indeed be the case, since 
the machine has no ECC.  I managed to fix these errors by 
replacing the broken files with good copies.  Scrubbing shows no 
errors now:

root@mim:~# btrfs scrub status /
scrub status for 2da00153-f9ea-4d6c-a6cc-10c913d22686
	scrub started at Sat Jan 14 12:52:03 2017 and finished 
	after 01:49:10
	total bytes scrubbed: 699.17GiB with 0 errors

However, there are further problems.  When trying to archive the 
full filesystem I noticed that some files/directories cannot be 
read.  (The problem is localized to some ".git" directory that I 
don’t need.)  Any attempt to read the broken files (or to delete 
them) does not work:

$ du -sh .git
du: cannot access 
'.git/objects/28/ea2aae3fe57ab4328adaa8b79f3c1cf005dd8d': No such 
file or directory
du: cannot access 
'.git/objects/28/fd95a5e9d08b6684819ce6e3d39d99e2ecccd5': Stale 
file handle
du: cannot access 
'.git/objects/28/52e887ed436ed2c549b20d4f389589b7b58e09': Stale 
file handle
du: cannot access '.git/objects/info': Stale file handle
du: cannot access '.git/objects/pack': Stale file handle

During the above command the following lines were added to 
kern.log:

Jan 16 09:41:34 mim kernel: [132206.957566] BTRFS critical (device 
sda2): corrupt leaf, slot offset bad: block=192561152,root=1, 
slot=15
Jan 16 09:41:34 mim kernel: [132206.957924] BTRFS critical (device 
sda2): corrupt leaf, slot offset bad: block=192561152,root=1, 
slot=15
Jan 16 09:41:34 mim kernel: [132206.958505] BTRFS critical (device 
sda2): corrupt leaf, slot offset bad: block=192561152,root=1, 
slot=15
Jan 16 09:41:34 mim kernel: [132206.958971] BTRFS critical (device 
sda2): corrupt leaf, slot offset bad: block=192561152,root=1, 
slot=15
Jan 16 09:41:34 mim kernel: [132206.959534] BTRFS critical (device 
sda2): corrupt leaf, slot offset bad: block=192561152,root=1, 
slot=15
Jan 16 09:41:34 mim kernel: [132206.959874] BTRFS critical (device 
sda2): corrupt leaf, slot offset bad: block=192561152,root=1, 
slot=15
Jan 16 09:41:34 mim kernel: [132206.960523] BTRFS critical (device 
sda2): corrupt leaf, slot offset bad: block=192561152,root=1, 
slot=15
Jan 16 09:41:34 mim kernel: [132206.960943] BTRFS critical (device 
sda2): corrupt leaf, slot offset bad: block=192561152,root=1, 
slot=15

So I tried to repair the file system by running "btrfs check 
--repair", but this doesn’t work:

(initramfs) btrfs --version
btrfs-progs v4.7.3
(initramfs) btrfs check --repair /dev/sda2
UUID: ...
checking extents
incorrect offsets 2527 2543
items overlap, can't fix
cmds-check.c:4297: fix_item_offset: Assertion `ret` failed.
btrfs[0x41a8b4]
btrfs[0x41a8db]
btrfs[0x42428b]
btrfs[0x424f83]
btrfs[0x4259cd]
btrfs(cmd_check+0x1111)[0x427d6d]
btrfs(main+0x12f)[0x40a341]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7fd98859d2b1]
btrfs(_start+0x2a)[0x40a37a]

I now have the following questions:

* So scrubbing is not enough to check the health of a btrfs file 
  system?  It’s also necessary to read all the files?

* Any ideas what coud have caused the "stale file handle" errors? 
  Is there any way to fix them?  Of course RAM errors can in 
  principle have _any_ consequences, but I would have hoped that 
  even without ECC RAM it’s practically inpossible to end up with 
  an unrepairable file system.  Perhaps I simply had very bad 
  luck.

* I believe that btrfs RAID1 is considered reasonably safe for 
  production use by now.  I want to replace that home server with 
  a new machine (still without ECC).  Is it a good idea to use 
  btrfs for the main file system?  I would certainly hope so! :-)

Thanks for your time,
Christoph

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

next             reply	other threads:[~2017-01-16 11:42 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-16 11:10 Christoph Groth [this message]
2017-01-16 13:24 ` Unocorrectable errors with RAID1 Austin S. Hemmelgarn
2017-01-16 15:42   ` Christoph Groth
2017-01-16 16:29     ` Austin S. Hemmelgarn
2017-01-17  4:50       ` Janos Toth F.
2017-01-17 12:25         ` Austin S. Hemmelgarn
2017-01-17  9:18       ` Christoph Groth
2017-01-17 12:32         ` Austin S. Hemmelgarn
2017-01-16 22:45 ` Goldwyn Rodrigues
2017-01-17  8:44   ` Christoph Groth
2017-01-17 11:32     ` Goldwyn Rodrigues
2017-01-17 20:25       ` Christoph Groth
2017-01-17 21:52         ` Chris Murphy
2017-01-17 23:10           ` Christoph Groth
2017-01-18  7:13             ` gdb log of crashed "btrfs-image -s" Christoph Groth
2017-01-18 11:49               ` Goldwyn Rodrigues
2017-01-18 20:11                 ` Christoph Groth
2017-01-23 12:09                   ` Goldwyn Rodrigues
2017-01-17 22:57         ` Unocorrectable errors with RAID1 Goldwyn Rodrigues
2017-01-17 23:22           ` Christoph Groth

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87o9z7dzvd.fsf@grothesque.org \
    --to=christoph@grothesque.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.