linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* corruption, bad block, input/output errors - do i run --repair?
@ 2014-11-07 14:33 Matt McKinnon
  2014-11-08  4:04 ` Duncan
  2014-11-10  3:11 ` Qu Wenruo
  0 siblings, 2 replies; 3+ messages in thread
From: Matt McKinnon @ 2014-11-07 14:33 UTC (permalink / raw)
  To: linux-btrfs

Hi All,

I'm running into some corruption and I wanted to seek out advice on 
whether or not to run btrfs check --repair, or if I should fall back to 
my backup file server, or both.

The system is mountable, and usable.

# uname -a
Linux cbmm-fs 3.17.2-custom #1 SMP Thu Oct 30 14:09:57 EDT 2014 x86_64 
x86_64 x86_64 GNU/Linux

# btrfs --version
Btrfs v3.14.2
# btrfs fi show
Label: none  uuid: 30c15060-8fb4-4926-87d4-f7d08c3033c5
	Total devices 1 FS bytes used 58.92TiB
	devid    1 size 76.40TiB used 59.05TiB path /dev/sda1

# btrfs fi df /home
Data, single: total=58.75TiB, used=58.75TiB
System, DUP: total=32.00MiB, used=2.66MiB
System, single: total=4.00MiB, used=3.68MiB
Metadata, DUP: total=119.00GiB, used=116.63GiB
Metadata, single: total=64.01GiB, used=57.68GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


I did run into some RO snapshot corruption which caused me to run btrfs 
check:

parent transid verify failed on 20809493159936 wanted 
4486137218058286914 found
390978
parent transid verify failed on 20809493159936 wanted 
4486137218058286914 found
390978
Ignoring transid failure
Checking filesystem on /dev/sda1
UUID: 30c15060-8fb4-4926-87d4-f7d08c3033c5
checking extents
bad block 69290357067776
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots

...

"dir isize wrong" 1 error
"errors 500, file extent discount, nbytes wrong" 14 errors
"errors 2001, no inode item, link count wrong" 257302 errors

...

found 185063071745 bytes used err is 1
total csum bytes: 8428
total tree bytes: 1889284096
total fs tree bytes: 962678784
total extent tree bytes: 159297536
btree space waste bytes: 340014684
file data blocks allocated: 57344
  referenced 57344
Btrfs v3.14.2

Output of a scrub:

ERROR: scrubbing /home failed for device id 1 (Input/output error)
scrub canceled for 30c15060-8fb4-4926-87d4-f7d08c3033c5
         scrub started at Mon Nov  3 06:43:58 2014 and was aborted after 
7613 seconds
         data_extents_scrubbed: 248507555
         tree_extents_scrubbed: 10870729
         data_bytes_scrubbed: 15375990317056
         tree_bytes_scrubbed: 44526505984
         read_errors: 0
         csum_errors: 0
         verify_errors: 0
         no_csum: 15712
         csum_discards: 988018
         super_errors: 0
         malloc_errors: 0
         uncorrectable_errors: 0
         unverified_errors: 0
         corrected_errors: 0
         last_physical: 15425663205376

Output of a balance:

ERROR: error during balancing '/home' - Input/output error
There may be more info in syslog - try dmesg | tail

[501087.506642] ------------[ cut here ]------------
[501087.543971] WARNING: CPU: 5 PID: 31885 at fs/btrfs/relocation.c:925 
build_backref_tree+0x11f0/0x1230 [btrfs]()
[501087.543991] Modules linked in: ipmi_devintf(E) autofs4(E) sb_edac(E) 
edac_core(E) joydev(E) mei_me(E) mei(E) lpc_ich(E) ioatdma(E) ipmi_si(E) 
wmi(E) mac_hid(E) bnep(E) rfcomm(E) bluetooth(E) lp(E) parport(E) 
nfsd(E) nfs_acl(E) auth_rpcgss(E) nfs(E) fscache(E) lockd(E) sunrpc(E) 
ses(E) enclosure(E) hid_generic(E) ahci(E) libahci(E) usbhid(E) hid(E) 
igb(E) dca(E) i2c_algo_bit(E) ptp(E) pps_core(E) megaraid_sas(E) 
btrfs(E) raid6_pq(E) xor(E) libcrc32c(E)
[501087.543995] CPU: 5 PID: 31885 Comm: btrfs Tainted: G      D     E 
3.17.2-custom #1
[501087.543997] Hardware name: Supermicro 
X9DRH-7TF/7F/iTF/iF/X9DRH-7TF/7F/iTF/iF, BIOS 3.0a 12/27/2013
[501087.543999]  000000000000039d ffff88000eadb808 ffffffff8176733c 
0000000000000282
[501087.544001]  0000000000000000 ffff88000eadb848 ffffffff8107163c 
0000000000001000
[501087.544003]  ffff8801d0d9acf0 ffff880497c70380 0000000000000001 
0000000000000001
[501087.544004] Call Trace:
[501087.544014]  [<ffffffff8176733c>] dump_stack+0x46/0x58
[501087.544022]  [<ffffffff8107163c>] warn_slowpath_common+0x8c/0xc0
[501087.544024]  [<ffffffff8107168a>] warn_slowpath_null+0x1a/0x20
[501087.544039]  [<ffffffffa00b4020>] build_backref_tree+0x11f0/0x1230 
[btrfs]
[501087.544052]  [<ffffffffa00b4331>] relocate_tree_blocks+0x2d1/0x690 
[btrfs]
[501087.544060]  [<ffffffff811c1609>] ? kmem_cache_alloc_trace+0x39/0x1f0
[501087.544072]  [<ffffffffa00b54a2>] relocate_block_group+0x202/0x5f0 
[btrfs]
[501087.544083]  [<ffffffffa00b5a40>] 
btrfs_relocate_block_group+0x1b0/0x2d0 [btrfs]
[501087.544098]  [<ffffffffa0088cf5>] 
btrfs_relocate_chunk.isra.62+0x75/0x760 [btrfs]
[501087.544111]  [<ffffffffa0084d86>] ? release_extent_buffer+0x36/0xe0 
[btrfs]
[501087.544124]  [<ffffffffa0085281>] ? free_extent_buffer+0x61/0xc0 [btrfs]
[501087.544136]  [<ffffffffa008d7db>] btrfs_balance+0x8ab/0xf50 [btrfs]
[501087.544150]  [<ffffffffa00985ac>] btrfs_ioctl_balance+0x1cc/0x530 
[btrfs]
[501087.544156]  [<ffffffff811786eb>] ? 
lru_cache_add_active_or_unevictable+0x2b/0xa0
[501087.544168]  [<ffffffffa009aa82>] btrfs_ioctl+0x562/0x1f00 [btrfs]
[501087.544173]  [<ffffffff811e9c0b>] ? putname+0x2b/0x40
[501087.544176]  [<ffffffff811ef193>] ? user_path_at_empty+0x63/0xa0
[501087.544183]  [<ffffffff8105f59c>] ? __do_page_fault+0x28c/0x550
[501087.544187]  [<ffffffff8112528c>] ? acct_account_cputime+0x1c/0x20
[501087.544189]  [<ffffffff811f1106>] do_vfs_ioctl+0x86/0x4f0
[501087.544192]  [<ffffffff810244a5>] ? syscall_trace_enter+0x165/0x280
[501087.544193]  [<ffffffff811f1601>] SyS_ioctl+0x91/0xb0
[501087.544198]  [<ffffffff8176fc7f>] tracesys+0xe1/0xe6
[501087.544199] ---[ end trace e2a77238816656f5 ]---
[501087.579519] parent transid verify failed on 20809493159936 wanted 
4486137218058286914 found 390978


I have been sending incremental snapshot dumps over to an identical file 
server as backups.  Everything checks out OK there.  Do I try to run 
check with --repair first, and fall back to my backup if that fails?

-Matt

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: corruption, bad block, input/output errors - do i run --repair?
  2014-11-07 14:33 corruption, bad block, input/output errors - do i run --repair? Matt McKinnon
@ 2014-11-08  4:04 ` Duncan
  2014-11-10  3:11 ` Qu Wenruo
  1 sibling, 0 replies; 3+ messages in thread
From: Duncan @ 2014-11-08  4:04 UTC (permalink / raw)
  To: linux-btrfs

Matt McKinnon posted on Fri, 07 Nov 2014 09:33:44 -0500 as excerpted:

> I'm running into some corruption and I wanted to seek out advice on
> whether or not to run btrfs check --repair, or if I should fall back to
> my backup file server, or both.
> 
> The system is mountable, and usable.
> 
> # uname -a
> Linux cbmm-fs 3.17.2-custom #1 SMP Thu Oct 30 14:09:57 EDT 2014
> x86_64 x86_64 x86_64 GNU/Linux
> 
> # btrfs --version Btrfs v3.14.2

> I did run into some RO snapshot corruption [...]

> I have been sending incremental snapshot dumps over to an identical file
> server as backups.  Everything checks out OK there.  Do I try to run
> check with --repair first, and fall back to my backup if that fails?


It looks like you may already know about the early 3.17 series RO-
snapshot corruption bug, which you appear to have had, either from the 
list or from elsewhere, but apparently haven't been following the list 
closely enough to have noted the fix.

Kernel 3.17.2, which you have, fixed the bug causing the problem, which 
only affected earlier 3.17 series kernels and only filesystems with read-
only snapshots.

But that didn't entirely fix the problem for people (apparently including 
you) who had already experienced corruption on their filesystems due to 
it, since that didn't fix existing damage, only prevent new damage.

The fix for existing damage is *ONLY* in btrfs-progs 3.17 or newer.  With 
it, running btrfs check --repair should fix existing damage.

*HOWEVER*, attempting to repair the damage with btrfs check --repair with 
btrfs-progs versions PRIOR TO 3.17 WILL MAKE IT WORSE, basically 
unrecoverable using existing tools.

So for this specific damage, running btrfs check --repair from btrfs-progs 
3.17 or newer should fix it.  Do NOT attempt to repair it with earlier 
btrfs-progs versions.


More generally, as recently discussed here in the "Compatibility matrix 
kernels/tools" thread from last week, while any recent kernel version 
should in general work with any recent userspace, and while keeping 
reasonably current on kernels is strongly recommended as older ones have 
now-fixed bugs that may trigger damage in some cases, keeping userspace 
current isn't generally as vital, AS LONG AS you're primarily running 
"online" tools (in general those that work with mounted filesystems), 
which normally do their work via kernel calls anyway.  In that case, the 
most you will be missing is some of the newer features.

HOWEVER, once you get into the offline userspace tools like btrfs check 
and btrfs restore, where the functionality is either fixing damaged 
filesystems or retrieving data off of them while unmounted, a current 
btrfs userspace becomes MUCH more important, since then it's the 
userspace code working on the filesystem.

Which is what we see here.  A kernel bug started creating damage in 
certain corner cases but was relatively rapidly fixed.  However, that fix 
only kept it from creating further damage, it didn't do anything to 
correct existing filesystem damage of that type.  That's where the 
userspace fix comes in, fixing existing damage.  However, only the newest 
btrfs-progs (userspace) has the fixes to correct the existing damage 
properly.  Older versions, including the 3.14.2 you're running, could see 
some damage -- they detect that something isn't right-- but didn't 
understand the problem and if they were used to try to fix it, would 
instead make the problem worse.


So... applying that to your specific case:

Kernel 3.17.2 has the kernel fix and won't cause more damage.

Your 3.14.2 userspace is too old to fix the existing damage, however.

Since you have been wise enough to have backups, you are thus left with 
two choices:

1) Upgrade the userspace and fix the existing damage with the upgraded 
userspace btrfs check --repair.

2) Do a mkfs, thus eliminating the existing damage along with the data on 
the existing filesystem, and restore from backup to the new filesystem, 
recreated free of the damage.  Optionally upgrade the btrfs-progs 
userspace.

In either case, continue to run kernel 3.17.2 or newer so as not to have 
either this bug or the one that affected the 3.15 kernel and early 3.16, 
reappear.

Either way should work.  Here, if the existing filesystem was older than 
say kernel 3.14, I'd probably do the mkfs but do the optional userspace 
upgrade too, taking advantage of newer filesystem options such as skinny-
metadata and 16-KiB metadata nodes while I was at it.  If the filesystem 
was new and already took advantage of those features, I'd probably just 
do the userspace upgrade and btrfs check --repair.  But fortunately for 
you, unlike many unfortunate posters here you have a backup available, 
thus giving you the /choice/, and that choice is up to you. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: corruption, bad block, input/output errors - do i run --repair?
  2014-11-07 14:33 corruption, bad block, input/output errors - do i run --repair? Matt McKinnon
  2014-11-08  4:04 ` Duncan
@ 2014-11-10  3:11 ` Qu Wenruo
  1 sibling, 0 replies; 3+ messages in thread
From: Qu Wenruo @ 2014-11-10  3:11 UTC (permalink / raw)
  To: Matt McKinnon, linux-btrfs


-------- Original Message --------
Subject: corruption, bad block, input/output errors - do i run --repair?
From: Matt McKinnon <matt@techsquare.com>
To: <linux-btrfs@vger.kernel.org>
Date: 2014年11月07日 22:33
> Hi All,
>
> I'm running into some corruption and I wanted to seek out advice on 
> whether or not to run btrfs check --repair, or if I should fall back 
> to my backup file server, or both.
>
> The system is mountable, and usable.
>
> # uname -a
> Linux cbmm-fs 3.17.2-custom #1 SMP Thu Oct 30 14:09:57 EDT 2014 x86_64 
> x86_64 x86_64 GNU/Linux
>
> # btrfs --version
> Btrfs v3.14.2
> # btrfs fi show
> Label: none  uuid: 30c15060-8fb4-4926-87d4-f7d08c3033c5
>     Total devices 1 FS bytes used 58.92TiB
>     devid    1 size 76.40TiB used 59.05TiB path /dev/sda1
>
> # btrfs fi df /home
> Data, single: total=58.75TiB, used=58.75TiB
> System, DUP: total=32.00MiB, used=2.66MiB
> System, single: total=4.00MiB, used=3.68MiB
> Metadata, DUP: total=119.00GiB, used=116.63GiB
> Metadata, single: total=64.01GiB, used=57.68GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
>
> I did run into some RO snapshot corruption which caused me to run 
> btrfs check:
This is a known bug in 3.17 with RO snapshot.
It's fixable but not with your old 3.14 btrfs-progs.

Please update to 3.17 btrfs-progs, and re-run btrfsck(without --repair), 
and it will prompt you to
use --repair if this is the exact bug.
Then run with --repair should fix it.

Thanks,
Qu
>
> parent transid verify failed on 20809493159936 wanted 
> 4486137218058286914 found
> 390978
> parent transid verify failed on 20809493159936 wanted 
> 4486137218058286914 found
> 390978
> Ignoring transid failure
> Checking filesystem on /dev/sda1
> UUID: 30c15060-8fb4-4926-87d4-f7d08c3033c5
> checking extents
> bad block 69290357067776
> Errors found in extent allocation tree or chunk allocation
> checking free space cache
> checking fs roots
>
> ...
>
> "dir isize wrong" 1 error
> "errors 500, file extent discount, nbytes wrong" 14 errors
> "errors 2001, no inode item, link count wrong" 257302 errors
>
> ...
>
> found 185063071745 bytes used err is 1
> total csum bytes: 8428
> total tree bytes: 1889284096
> total fs tree bytes: 962678784
> total extent tree bytes: 159297536
> btree space waste bytes: 340014684
> file data blocks allocated: 57344
>  referenced 57344
> Btrfs v3.14.2
>
> Output of a scrub:
>
> ERROR: scrubbing /home failed for device id 1 (Input/output error)
> scrub canceled for 30c15060-8fb4-4926-87d4-f7d08c3033c5
>         scrub started at Mon Nov  3 06:43:58 2014 and was aborted 
> after 7613 seconds
>         data_extents_scrubbed: 248507555
>         tree_extents_scrubbed: 10870729
>         data_bytes_scrubbed: 15375990317056
>         tree_bytes_scrubbed: 44526505984
>         read_errors: 0
>         csum_errors: 0
>         verify_errors: 0
>         no_csum: 15712
>         csum_discards: 988018
>         super_errors: 0
>         malloc_errors: 0
>         uncorrectable_errors: 0
>         unverified_errors: 0
>         corrected_errors: 0
>         last_physical: 15425663205376
>
> Output of a balance:
>
> ERROR: error during balancing '/home' - Input/output error
> There may be more info in syslog - try dmesg | tail
>
> [501087.506642] ------------[ cut here ]------------
> [501087.543971] WARNING: CPU: 5 PID: 31885 at 
> fs/btrfs/relocation.c:925 build_backref_tree+0x11f0/0x1230 [btrfs]()
> [501087.543991] Modules linked in: ipmi_devintf(E) autofs4(E) 
> sb_edac(E) edac_core(E) joydev(E) mei_me(E) mei(E) lpc_ich(E) 
> ioatdma(E) ipmi_si(E) wmi(E) mac_hid(E) bnep(E) rfcomm(E) bluetooth(E) 
> lp(E) parport(E) nfsd(E) nfs_acl(E) auth_rpcgss(E) nfs(E) fscache(E) 
> lockd(E) sunrpc(E) ses(E) enclosure(E) hid_generic(E) ahci(E) 
> libahci(E) usbhid(E) hid(E) igb(E) dca(E) i2c_algo_bit(E) ptp(E) 
> pps_core(E) megaraid_sas(E) btrfs(E) raid6_pq(E) xor(E) libcrc32c(E)
> [501087.543995] CPU: 5 PID: 31885 Comm: btrfs Tainted: G D     E 
> 3.17.2-custom #1
> [501087.543997] Hardware name: Supermicro 
> X9DRH-7TF/7F/iTF/iF/X9DRH-7TF/7F/iTF/iF, BIOS 3.0a 12/27/2013
> [501087.543999]  000000000000039d ffff88000eadb808 ffffffff8176733c 
> 0000000000000282
> [501087.544001]  0000000000000000 ffff88000eadb848 ffffffff8107163c 
> 0000000000001000
> [501087.544003]  ffff8801d0d9acf0 ffff880497c70380 0000000000000001 
> 0000000000000001
> [501087.544004] Call Trace:
> [501087.544014]  [<ffffffff8176733c>] dump_stack+0x46/0x58
> [501087.544022]  [<ffffffff8107163c>] warn_slowpath_common+0x8c/0xc0
> [501087.544024]  [<ffffffff8107168a>] warn_slowpath_null+0x1a/0x20
> [501087.544039]  [<ffffffffa00b4020>] build_backref_tree+0x11f0/0x1230 
> [btrfs]
> [501087.544052]  [<ffffffffa00b4331>] relocate_tree_blocks+0x2d1/0x690 
> [btrfs]
> [501087.544060]  [<ffffffff811c1609>] ? kmem_cache_alloc_trace+0x39/0x1f0
> [501087.544072]  [<ffffffffa00b54a2>] relocate_block_group+0x202/0x5f0 
> [btrfs]
> [501087.544083]  [<ffffffffa00b5a40>] 
> btrfs_relocate_block_group+0x1b0/0x2d0 [btrfs]
> [501087.544098]  [<ffffffffa0088cf5>] 
> btrfs_relocate_chunk.isra.62+0x75/0x760 [btrfs]
> [501087.544111]  [<ffffffffa0084d86>] ? 
> release_extent_buffer+0x36/0xe0 [btrfs]
> [501087.544124]  [<ffffffffa0085281>] ? free_extent_buffer+0x61/0xc0 
> [btrfs]
> [501087.544136]  [<ffffffffa008d7db>] btrfs_balance+0x8ab/0xf50 [btrfs]
> [501087.544150]  [<ffffffffa00985ac>] btrfs_ioctl_balance+0x1cc/0x530 
> [btrfs]
> [501087.544156]  [<ffffffff811786eb>] ? 
> lru_cache_add_active_or_unevictable+0x2b/0xa0
> [501087.544168]  [<ffffffffa009aa82>] btrfs_ioctl+0x562/0x1f00 [btrfs]
> [501087.544173]  [<ffffffff811e9c0b>] ? putname+0x2b/0x40
> [501087.544176]  [<ffffffff811ef193>] ? user_path_at_empty+0x63/0xa0
> [501087.544183]  [<ffffffff8105f59c>] ? __do_page_fault+0x28c/0x550
> [501087.544187]  [<ffffffff8112528c>] ? acct_account_cputime+0x1c/0x20
> [501087.544189]  [<ffffffff811f1106>] do_vfs_ioctl+0x86/0x4f0
> [501087.544192]  [<ffffffff810244a5>] ? syscall_trace_enter+0x165/0x280
> [501087.544193]  [<ffffffff811f1601>] SyS_ioctl+0x91/0xb0
> [501087.544198]  [<ffffffff8176fc7f>] tracesys+0xe1/0xe6
> [501087.544199] ---[ end trace e2a77238816656f5 ]---
> [501087.579519] parent transid verify failed on 20809493159936 wanted 
> 4486137218058286914 found 390978
>
>
> I have been sending incremental snapshot dumps over to an identical 
> file server as backups.  Everything checks out OK there.  Do I try to 
> run check with --repair first, and fall back to my backup if that fails?
>
> -Matt
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-11-10  3:12 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-07 14:33 corruption, bad block, input/output errors - do i run --repair? Matt McKinnon
2014-11-08  4:04 ` Duncan
2014-11-10  3:11 ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).