balance induced csum errors

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* balance induced csum errors
@ 2013-09-23 21:57 Chris Murphy
  2013-09-23 22:35 ` Chris Murphy
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2013-09-23 21:57 UTC (permalink / raw)
  To: Btrfs BTRFS

SAMSUNG SSD 830 Series
CPU0: Intel® Core(TM) i7-2820QM CPU @ 2.30GHz (fam: 06, model: 2a, stepping: 07)
8GB RAM (quite heavily tested, not recently, with several days of memtest)
kernel 3.11.1-200.fc19.x86_64 running on baremetal
btrfs-progs-0.20.rc1.20130308git704a08c-1.fc19.x86_64

Today I did a scrub on a btrfs volume, with no message or errors in console or dmesg or journal. Immediately after the scrub I did a balance on the volume which resulted in:
ERROR: error during balancing '/' - Input/output error

In dmesg for the time of that error, this is reported:
[  567.921661] btrfs: relocating block group 6463422464 flags 1
[  568.282371] btrfs: found 200 extents
[  568.800974] btrfs: found 200 extents
[  568.868567] btrfs: relocating block group 5389680640 flags 1
[  571.929662] btrfs: found 4410 extents
[  572.896410] btrfs: found 4410 extents
[  572.962479] btrfs: relocating block group 4315938816 flags 1
[  574.681576] BTRFS info (device sda6): csum failed ino 259 off 428470272 csum 2566472073 private 2181120065
[  574.692047] BTRFS info (device sda6): csum failed ino 259 off 428470272 csum 2566472073 private 2181120065

Upon reboot with kernel 3.11.1-200.fc19.x86_64 and also kernel-3.10.4-300.fc19.x86_64 the following is reported in dmesg:


[    6.053511] btrfs no csum found for inode 37693 start 25538560
[    6.054463] BTRFS info (device sda6): csum failed ino 37693 off 25538560 csum 3474434693 private 0
[    6.055299] btrfs no csum found for inode 37693 start 26218496
[    6.056086] BTRFS info (device sda6): csum failed ino 37693 off 26218496 csum 2772176352 private 0
[    6.085993] btrfs no csum found for inode 37693 start 22286336
[    6.086093] btrfs no csum found for inode 37693 start 22368256
[    6.087636] BTRFS info (device sda6): csum failed ino 37693 off 22286336 csum 396494483 private 0
[    6.087741] BTRFS info (device sda6): csum failed ino 37693 off 22368256 csum 2249156591 private 0

[root@f19l chris]# btrfs fi show
failed to open /dev/sr0: No medium found
Label: 'fedora'  uuid: d505bdee-ba7c-4a64-9481-d5cd76ab8b3e
	Total devices 1 FS bytes used 3.64GB
	devid    1 size 12.99GB used 6.51GB path /dev/sda6

The file system is on an SSD, so single profile for both data and metadata:
[root@f19l chris]# btrfs fi df /
Data: total=6.01GB, used=3.39GB
System: total=4.00MB, used=4.00KB
Metadata: total=512.00MB, used=258.93MB


If this is not the result of a known bug, let me know if there's more information I should provide, I do have a ~22MB btrfs-image -c9 -t4 of the file system.

This fs is disposable, but I might try btrfsck --repair --init-csum-tree with a slightly newer btrfs-progs.


Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance induced csum errors
  2013-09-23 21:57 balance induced csum errors Chris Murphy
@ 2013-09-23 22:35 ` Chris Murphy
  2013-09-24 21:36   ` Chris Murphy
  2013-09-24 23:12   ` balance induced csum errors Chris Murphy
  0 siblings, 2 replies; 13+ messages in thread
From: Chris Murphy @ 2013-09-23 22:35 UTC (permalink / raw)
  To: Btrfs BTRFS

Result of btrfsck (without --repair) on the fs.

Checking filesystem on /dev/sda6
UUID: d505bdee-ba7c-4a64-9481-d5cd76ab8b3e
checking extents
checking fs roots
root 257 inode 37693 errors 1800
found 3938304000 bytes used err is 1
total csum bytes: 3557972
total tree bytes: 271794176
total fs tree bytes: 253009920
btree space waste bytes: 79371605
file data blocks allocated: 4546076672
 referenced 3631865856
Btrfs v0.20-rc1

Console result of subsequence scrub on the mounted fs:

scrub status for d505bdee-ba7c-4a64-9481-d5cd76ab8b3e
	scrub started at Mon Sep 23 16:23:33 2013 and finished after 8 seconds
	total bytes scrubbed: 3.67GB with 10 errors
	error details: csum=10
	corrected errors: 0, uncorrectable errors: 10, unverified errors: 0

dmesg result of a subsequent scrub on the mounted file system:

[   30.682058] btrfs: bdev /dev/sda6 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
[   30.682095] btrfs: unable to fixup (regular) error at logical 461914112 on dev /dev/sda6
[   30.682141] btrfs: bdev /dev/sda6 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
[   30.682174] btrfs: unable to fixup (regular) error at logical 460079104 on dev /dev/sda6
[   30.689792] btrfs: bdev /dev/sda6 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
[   30.689823] btrfs: bdev /dev/sda6 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
[   30.689824] btrfs: unable to fixup (regular) error at logical 456085504 on dev /dev/sda6
[   30.689896] btrfs: unable to fixup (regular) error at logical 457531392 on dev /dev/sda6
[   30.743222] btrfs: bdev /dev/sda6 errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
[   30.743260] btrfs: unable to fixup (regular) error at logical 460230656 on dev /dev/sda6
[   30.970989] btrfs: checksum error at logical 462082048 on dev /dev/sda6, sector 902504, root 257, inode 37693, offset 22282240, length 4096, links 1 (path: var/log/journal/180d14c18233452d9918c3aec1c6c68b/system.journal)
[   30.970993] btrfs: checksum error at logical 464195584 on dev /dev/sda6, sector 906632, root 257, inode 37693, offset 22638592, length 4096, links 1 (path: var/log/journal/180d14c18233452d9918c3aec1c6c68b/system.journal)
[   30.970997] btrfs: bdev /dev/sda6 errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
[   30.970998] btrfs: unable to fixup (regular) error at logical 464195584 on dev /dev/sda6
[   30.971270] btrfs: bdev /dev/sda6 errs: wr 0, rd 0, flush 0, corrupt 7, gen 0
[   30.971300] btrfs: unable to fixup (regular) error at logical 462082048 on dev /dev/sda6
[   31.047120] btrfs: checksum error at logical 462123008 on dev /dev/sda6, sector 902584, root 257, inode 37693, offset 22360064, length 4096, links 1 (path: var/log/journal/180d14c18233452d9918c3aec1c6c68b/system.journal)
[   31.047206] btrfs: bdev /dev/sda6 errs: wr 0, rd 0, flush 0, corrupt 8, gen 0
[   31.047235] btrfs: unable to fixup (regular) error at logical 462123008 on dev /dev/sda6
[   36.290269] btrfs: bdev /dev/sda6 errs: wr 0, rd 0, flush 0, corrupt 9, gen 0
[   36.290305] btrfs: unable to fixup (regular) error at logical 4744409088 on dev /dev/sda6
[   37.882830] btrfs: bdev /dev/sda6 errs: wr 0, rd 0, flush 0, corrupt 10, gen 0
[   37.882867] btrfs: unable to fixup (regular) error at logical 6730518528 on dev /dev/sda6



Also, there have been no crashes, panics, or power cuts to this system. Thus far it seems like the balance itself is what has caused the csum corruption. Prior to balance, scrub finds no problems. After balance there is some corruption. But isn't it ambiguous whether the data or the metadata have been corrupted since there is only a single copy of each? In which case is it wise to init-csum-tree?


Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance induced csum errors
  2013-09-23 22:35 ` Chris Murphy
@ 2013-09-24 21:36   ` Chris Murphy
  2013-09-24 22:30     ` Chris Murphy
  2013-09-25  4:34     ` balance induced csum errors, systemd-journal Chris Murphy
  2013-09-24 23:12   ` balance induced csum errors Chris Murphy
  1 sibling, 2 replies; 13+ messages in thread
From: Chris Murphy @ 2013-09-24 21:36 UTC (permalink / raw)
  To: Btrfs BTRFS

I'm able to reproduce this on a different drive, HDD (WDC WD5000BEVT-22ZAT0), also with data and metadata set to single. There are no problems reported when scrubbing before balance, and then there is corruption reported after balance.

File system is created with:
kernel-3.9.5-301.fc19.x86_64
btrfs-progs-0.20.rc1.20130308git704a08c-1.fc19.x86_64


Then updated to:
kernel-3.11.1-200.fc19.x86_64
btrfs-progs-0.20.rc1.20130308git704a08c-1.fc19.x86_64

Then scrubbed with no errors.
Then balanced with no errors (unlike the previous report with SSD which stopped)
Then scrubbed with errors, see below the balance followed by 2nd scrub.



[  226.333352] btrfs: relocating block group 2168455168 flags 1
[  233.032499] btrfs: found 2816 extents
[  234.522162] btrfs: found 2816 extents
[  234.818501] btrfs: relocating block group 1094713344 flags 1
[  261.631679] btrfs: found 13255 extents
[  266.133269] btrfs: found 13254 extents
[  266.464119] btrfs: relocating block group 20971520 flags 4
[  268.665678] btrfs: found 2018 extents
[  268.976324] btrfs: relocating block group 12582912 flags 1
[  269.397991] btrfs: found 246 extents
[  269.931383] btrfs: found 246 extents
[  270.209504] btrfs: relocating block group 4194304 flags 4
[  270.642570] btrfs: found 378 extents
[  318.029771] btrfs: checksum error at logical 2209439744 on dev /dev/sda4, sector 6412464, root 256, inode 25764, offset 6746112, length 4096, links 1 (path: var/log/journal/10db2764a11a4829bf82a94c6559d121/system.journal)
[  318.029793] btrfs: bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
[  318.029827] btrfs: unable to fixup (regular) error at logical 2209439744 on dev /dev/sda4
[  318.045206] btrfs: checksum error at logical 2207895552 on dev /dev/sda4, sector 6409448, root 256, inode 25764, offset 6668288, length 4096, links 1 (path: var/log/journal/10db2764a11a4829bf82a94c6559d121/system.journal)
[  318.045211] btrfs: bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
[  318.045224] btrfs: unable to fixup (regular) error at logical 2207895552 on dev /dev/sda4
[  318.172649] btrfs: checksum error at logical 2211979264 on dev /dev/sda4, sector 6417424, root 256, inode 25764, offset 7389184, length 4096, links 1 (path: var/log/journal/10db2764a11a4829bf82a94c6559d121/system.journal)
[  318.172657] btrfs: bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
[  318.172671] btrfs: unable to fixup (regular) error at logical 2211979264 on dev /dev/sda4
[  318.175607] btrfs: checksum error at logical 2212261888 on dev /dev/sda4, sector 6417976, root 256, inode 25764, offset 7065600, length 4096, links 1 (path: var/log/journal/10db2764a11a4829bf82a94c6559d121/system.journal)
[  318.175611] btrfs: bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
[  318.175619] btrfs: unable to fixup (regular) error at logical 2212261888 on dev /dev/sda4
[  318.175979] btrfs: checksum error at logical 2212278272 on dev /dev/sda4, sector 6418008, root 256, inode 25764, offset 7081984, length 4096, links 1 (path: var/log/journal/10db2764a11a4829bf82a94c6559d121/system.journal)
[  318.175984] btrfs: bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
[  318.175992] btrfs: unable to fixup (regular) error at logical 2212278272 on dev /dev/sda4
[  318.200069] btrfs: checksum error at logical 2213355520 on dev /dev/sda4, sector 6420112, root 256, inode 25764, offset 7581696, length 4096, links 1 (path: var/log/journal/10db2764a11a4829bf82a94c6559d121/system.journal)
[  318.200074] btrfs: bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
[  318.200083] btrfs: unable to fixup (regular) error at logical 2213355520 on dev /dev/sda4
[  318.207868] btrfs: checksum error at logical 2214825984 on dev /dev/sda4, sector 6422984, root 256, inode 25764, offset 7806976, length 4096, links 1 (path: var/log/journal/10db2764a11a4829bf82a94c6559d121/system.journal)
[  318.207872] btrfs: bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 7, gen 0
[  318.207881] btrfs: unable to fixup (regular) error at logical 2214825984 on dev /dev/sda4
[  323.564405] btrfs: checksum error at logical 2650460160 on dev /dev/sda4, sector 7273832, root 256, inode 25764, offset 4247552, length 4096, links 1 (path: var/log/journal/10db2764a11a4829bf82a94c6559d121/system.journal)
[  323.564422] btrfs: bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 8, gen 0
[  323.564456] btrfs: unable to fixup (regular) error at logical 2650460160 on dev /dev/sda4
[  325.307954] btrfs: checksum error at logical 2792796160 on dev /dev/sda4, sector 7551832, root 256, inode 25764, offset 5857280, length 4096, links 1 (path: var/log/journal/10db2764a11a4829bf82a94c6559d121/system.journal)
[  325.307975] btrfs: bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 9, gen 0
[  325.308088] btrfs: unable to fixup (regular) error at logical 2792796160 on dev /dev/sda4
[  325.317607] btrfs: checksum error at logical 2791784448 on dev /dev/sda4, sector 7549856, root 256, inode 25764, offset 5378048, length 4096, links 1 (path: var/log/journal/10db2764a11a4829bf82a94c6559d121/system.journal)
[  325.317621] btrfs: bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 10, gen 0
[  325.317648] btrfs: unable to fixup (regular) error at logical 2791784448 on dev /dev/sda4
[  325.431833] btrfs: checksum error at logical 2791858176 on dev /dev/sda4, sector 7550000, root 256, inode 25764, offset 5455872, length 4096, links 1 (path: var/log/journal/10db2764a11a4829bf82a94c6559d121/system.journal)
[  325.431849] btrfs: bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 11, gen 0
[  325.431877] btrfs: unable to fixup (regular) error at logical 2791858176 on dev /dev/sda4
[  325.432530] btrfs: checksum error at logical 2791907328 on dev /dev/sda4, sector 7550096, root 256, inode 25764, offset 5505024, length 4096, links 1 (path: var/log/journal/10db2764a11a4829bf82a94c6559d121/system.journal)
[  325.432543] btrfs: bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 12, gen 0
[  325.432567] btrfs: unable to fixup (regular) error at logical 2791907328 on dev /dev/sda4
[  327.321507] btrfs: checksum error at logical 2792157184 on dev /dev/sda4, sector 7550584, root 256, inode 25764, offset 5758976, length 4096, links 1 (path: var/log/journal/10db2764a11a4829bf82a94c6559d121/system.journal)
[  327.321525] btrfs: bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 13, gen 0
[  327.321560] btrfs: unable to fixup (regular) error at logical 2792157184 on dev /dev/sda4
[  329.996557] btrfs: checksum error at logical 3165069312 on dev /dev/sda4, sector 8278928, root 256, inode 25764, offset 6369280, length 4096, links 1 (path: var/log/journal/10db2764a11a4829bf82a94c6559d121/system.journal)
[  329.996575] btrfs: bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 14, gen 0
[  329.996604] btrfs: unable to fixup (regular) error at logical 3165069312 on dev /dev/sda4
[  329.997239] btrfs: checksum error at logical 3165126656 on dev /dev/sda4, sector 8279040, root 256, inode 25764, offset 6430720, length 4096, links 1 (path: var/log/journal/10db2764a11a4829bf82a94c6559d121/system.journal)
[  329.997253] btrfs: bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 15, gen 0
[  329.997279] btrfs: unable to fixup (regular) error at logical 3165126656 on dev /dev/sda4



Also on reboot now it is reported:

[root@f19s ~]# dmesg | grep -i btrfs
[    1.835049] Btrfs loaded
[    1.966412] btrfs: disk space caching is enabled
[    1.980436] btrfs: bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 15, gen 0
[    3.233524] SELinux: initialized (dev sda4, type btrfs), uses xattr
[    4.316491] btrfs: disk space caching is enabled
[    9.715052] btrfs no csum found for inode 25764 start 6402048
[    9.715503] btrfs no csum found for inode 25764 start 6754304
[    9.827785] BTRFS info (device sda4): csum failed ino 25764 off 6402048 csum 3000251694 private 0
[   10.204708] BTRFS info (device sda4): csum failed ino 25764 off 6754304 csum 1612034066 private 0
[   11.187301] btrfs no csum found for inode 25764 start 7393280
[   11.187578] btrfs no csum found for inode 25764 start 7585792
[   11.187866] btrfs no csum found for inode 25764 start 7819264
[   11.403389] BTRFS info (device sda4): csum failed ino 25764 off 7393280 csum 3889482771 private 0
[   11.427616] BTRFS info (device sda4): csum failed ino 25764 off 7819264 csum 4086456643 private 0
[   11.486405] BTRFS info (device sda4): csum failed ino 25764 off 7585792 csum 3911271769 private 0



Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance induced csum errors
  2013-09-24 21:36   ` Chris Murphy
@ 2013-09-24 22:30     ` Chris Murphy
  2013-09-25  4:34     ` balance induced csum errors, systemd-journal Chris Murphy
  1 sibling, 0 replies; 13+ messages in thread
From: Chris Murphy @ 2013-09-24 22:30 UTC (permalink / raw)
  To: Btrfs BTRFS

OK so now I'm able to reproduce this with Fedora 20 alpha RC4 on a HDD, which uses:

kernel-3.11.1-300.fc20.x86_64
btrfs-progs-0.20.rc1.20130917git194aa4a-1.fc20.x86_64

Since it's HDD, metadata profile DUP is used. But I still get munged checksums with balance, and the corruption isn't fixable by a subsequent scrub. So even though the data is probably OK and this is just a checksum problem, it's apparently not fixable (?).

[root@oldlaptop ~]# btrfs balance start /
Done, had to relocate 5 out of 5 chunks

[root@oldlaptop ~]# dmesg (snippet)
[  390.770699] btrfs: relocating block group 1103101952 flags 1
[  406.639113] btrfs: found 10341 extents
[  414.172873] btrfs: found 10331 extents
[  414.530059] btrfs: relocating block group 29360128 flags 36
[  418.761208] btrfs: found 9281 extents
[  419.136338] btrfs: relocating block group 20971520 flags 34
[  419.536539] btrfs: found 1 extents
[  419.880757] btrfs: relocating block group 12582912 flags 1
[  420.380511] btrfs: found 282 extents
[  421.080667] btrfs: found 282 extents
[  421.426891] btrfs: relocating block group 4194304 flags 4



[root@oldlaptop ~]# btrfs scrub start /
scrub started on /, fsid 1463a31b-472a-47cd-a8c8-86bf09f978fa (pid=894)

[root@oldlaptop ~]# dmesg (snippet)
[  460.533990] btrfs: checksum error at logical 2607853568 on dev /dev/sda5, sector 7207000, root 256, inode 24622, offset 4247552, length 4096, links 1 (path: var/log/journal/d212cf4a840f4e78a33781c56189a7da/system.journal)
[  460.534045] btrfs: bdev /dev/sda5 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
[  460.534082] btrfs: unable to fixup (regular) error at logical 2607853568 on dev /dev/sda5
[  460.534581] btrfs: checksum error at logical 2607869952 on dev /dev/sda5, sector 7207032, root 256, inode 24622, offset 4263936, length 4096, links 1 (path: var/log/journal/d212cf4a840f4e78a33781c56189a7da/system.journal)
[  460.534594] btrfs: bdev /dev/sda5 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
[  460.534614] btrfs: unable to fixup (regular) error at logical 2607869952 on dev /dev/sda5
[  460.535128] btrfs: checksum error at logical 2607886336 on dev /dev/sda5, sector 7207064, root 256, inode 24622, offset 4280320, length 4096, links 1 (path: var/log/journal/d212cf4a840f4e78a33781c56189a7da/system.journal)
[  460.535140] btrfs: bdev /dev/sda5 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
[  460.535161] btrfs: unable to fixup (regular) error at logical 2607886336 on dev /dev/sda5
[  460.535607] btrfs: checksum error at logical 2607902720 on dev /dev/sda5, sector 7207096, root 256, inode 24622, offset 4296704, length 4096, links 1 (path: var/log/journal/d212cf4a840f4e78a33781c56189a7da/system.journal)
[  460.535619] btrfs: bdev /dev/sda5 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
[  460.535639] btrfs: unable to fixup (regular) error at logical 2607902720 on dev /dev/sda5
[  460.536421] btrfs: checksum error at logical 2608025600 on dev /dev/sda5, sector 7207336, root 256, inode 24622, offset 4313088, length 4096, links 1 (path: var/log/journal/d212cf4a840f4e78a33781c56189a7da/system.journal)
[  460.536437] btrfs: bdev /dev/sda5 errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
[  460.536457] btrfs: unable to fixup (regular) error at logical 2608025600 on dev /dev/sda5
[  460.779192] btrfs: checksum error at logical 2626674688 on dev /dev/sda5, sector 7243760, root 256, inode 24622, offset 4595712, length 4096, links 1 (path: var/log/journal/d212cf4a840f4e78a33781c56189a7da/system.journal)
[  460.779210] btrfs: bdev /dev/sda5 errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
[  460.779245] btrfs: unable to fixup (regular) error at logical 2626674688 on dev /dev/sda5
[  460.779822] btrfs: checksum error at logical 2626715648 on dev /dev/sda5, sector 7243840, root 256, inode 24622, offset 4231168, length 4096, links 1 (path: var/log/journal/d212cf4a840f4e78a33781c56189a7da/system.journal)
[  460.779834] btrfs: bdev /dev/sda5 errs: wr 0, rd 0, flush 0, corrupt 7, gen 0
[  460.779854] btrfs: unable to fixup (regular) error at logical 2626715648 on dev /dev/sda5


And now on reboot:
[root@f20s ~]# dmesg | grep -i btrfs
[    1.725224] Btrfs loaded
[    1.980491] btrfs: disk space caching is enabled
[    2.001684] btrfs: bdev /dev/sda5 errs: wr 0, rd 0, flush 0, corrupt 7, gen 0
[    3.011628] SELinux: initialized (dev sda5, type btrfs), uses xattr
[    5.092593] btrfs: disk space caching is enabled
[    8.703883] btrfs no csum found for inode 24622 start 4235264
[    8.844562] btrfs no csum found for inode 24622 start 4251648
[    8.844589] btrfs no csum found for inode 24622 start 4272128
[    8.844611] btrfs no csum found for inode 24622 start 4288512
[    8.844632] btrfs no csum found for inode 24622 start 4304896
[    8.844658] btrfs no csum found for inode 24622 start 4321280
[    8.856069] BTRFS info (device sda5): csum failed ino 24622 off 4251648 csum 1113579642 private 0
[    8.856084] BTRFS info (device sda5): csum failed ino 24622 off 4272128 csum 2433646103 private 0
[    8.856092] BTRFS info (device sda5): csum failed ino 24622 off 4288512 csum 2276263411 private 0
[    8.857248] BTRFS info (device sda5): csum failed ino 24622 off 4304896 csum 1156822344 private 0
[    8.857424] BTRFS info (device sda5): csum failed ino 24622 off 4321280 csum 3967991073 private 0
[    8.867242] BTRFS info (device sda5): csum failed ino 24622 off 4235264 csum 172180530 private 0



Other info:
[root@oldlaptop ~]# btrfs fi show
bfailed to open /dev/sr0: No medium found
Label: 'fedora'  uuid: 1463a31b-472a-47cd-a8c8-86bf09f978fa
	Total devices 1 FS bytes used 700.04MB
	devid    1 size 432.62GB used 3.04GB path /dev/sda5

Btrfs v0.20-rc1
[root@oldlaptop ~]# btrfs fi df /
Data: total=1.01GB, used=662.47MB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=1.00GB, used=37.57MB
Metadata: total=8.00MB, used=0.00



Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance induced csum errors
  2013-09-23 22:35 ` Chris Murphy
  2013-09-24 21:36   ` Chris Murphy
@ 2013-09-24 23:12   ` Chris Murphy
  1 sibling, 0 replies; 13+ messages in thread
From: Chris Murphy @ 2013-09-24 23:12 UTC (permalink / raw)
  To: Btrfs BTRFS

Since I can now reproduce this bug on two different computers, one with SSD, the other HDD, and scrub does not fix the csum errors with a scrub, I've filed a bug. It's reproducible with:

kernel-3.11.1-300.fc20.x86_64
btrfs-progs-0.20.rc1.20130917git194aa4a-1.fc20.x86_64

Bug is at:
https://bugzilla.redhat.com/show_bug.cgi?id=1011714



Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance induced csum errors, systemd-journal
  2013-09-24 21:36   ` Chris Murphy
  2013-09-24 22:30     ` Chris Murphy
@ 2013-09-25  4:34     ` Chris Murphy
  2013-09-25  5:44       ` Chris Murphy
                         ` (2 more replies)
  1 sibling, 3 replies; 13+ messages in thread
From: Chris Murphy @ 2013-09-25  4:34 UTC (permalink / raw)
  To: Btrfs BTRFS

OK so I think I'm narrowing this down to just the systemd journal, and it's not checksums that are corrupted, it's the journal itself.

[   19.354354] systemd-journald[210]: /var/log/journal/8e4cbfea404512ae70096c6202c9a3bf/system.journal: Journal file corrupted, rotating.

If I set systemd journald.conf Storage=volatile so that it stores journals only in memory, the problem is not reproducible.

However, even after deleting all corrupt journal files, and a subsequent scrub reporting no errors, on each reboot (and mount of the filesystem) I get:

[    3.646448] btrfs: bdev /dev/sda6 errs: wr 0, rd 0, flush 0, corrupt 17, gen 0

So somehow the corrupt counter isn't being reset?

And how would I go about setting /var/log/journal contents to inherit nodatacow? Possible?

Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance induced csum errors, systemd-journal
  2013-09-25  4:34     ` balance induced csum errors, systemd-journal Chris Murphy
@ 2013-09-25  5:44       ` Chris Murphy
  2013-09-25 12:30         ` Josef Bacik
  2013-09-25  6:38       ` Duncan
  2013-09-27 15:07       ` Johannes Hirte
  2 siblings, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2013-09-25  5:44 UTC (permalink / raw)
  To: Btrfs BTRFS


On Sep 24, 2013, at 10:34 PM, Chris Murphy <lists@colorremedies.com> wrote:

> And how would I go about setting /var/log/journal contents to inherit nodatacow? Possible?

chattr +C /var/log/journal

Resolved the problem. Whether this is an appropriate long term fix that systemd should apply to this directory, I don't know.

Chris Murphy


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance induced csum errors, systemd-journal
  2013-09-25  4:34     ` balance induced csum errors, systemd-journal Chris Murphy
  2013-09-25  5:44       ` Chris Murphy
@ 2013-09-25  6:38       ` Duncan
  2013-09-27 15:07       ` Johannes Hirte
  2 siblings, 0 replies; 13+ messages in thread
From: Duncan @ 2013-09-25  6:38 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Tue, 24 Sep 2013 22:34:20 -0600 as excerpted:

> However, even after deleting all corrupt journal files, and a subsequent
> scrub reporting no errors, on each reboot (and mount of the filesystem)
> I get:
> 
> [    3.646448] btrfs: bdev /dev/sda6 errs: wr 0, rd 0, flush 0, corrupt
> 17, gen 0
> 
> So somehow the corrupt counter isn't being reset?

AFAIK, it's deliberate that errors aren't reset automatically so there's 
some historical record and it's possible to see if they start to 
accumulate.

But there is of course a manual reset available, should a sysadmin wish 
to use it...  <quick lookup, quoting the commandline help output> ...

btrfs device stats [-z] <path>|<device>

Show current device IO stats.  -z to reset stats afterwards.

What the (brief) help output doesn't say but the (longer) manpage does... 
for multi-device filesystems <path> will list (and zero with -z) stats 
for all devices (listing one device's stats after another) composing the 
filesystem, <device> will list/zero them for just that single component 
device.

The -r does reset things here.

(FWIW I have a device that's occasionally slow enough to stabilize on 
power-up, that at least with 3.11, btrfs would occasionally drop it on 
resume after a suspend, forcing a hard reboot soon after, with resulting 
corruption.  Fortunately I'm running raid1 mode both data/metadata, and a 
scrub has always fixed things as verified by a further scrub and balance, 
but the stats errors of course stuck around until I did a -r/reset.  So I 
have personal knowledge of this one.  But with last nite's 3.12-rc2 git 
kernel pull and build I changed the kernel commandline option I was using 
from rootdelay=N to rootwait, and between that and the btrfs fixes in 
3.12, I'm hoping I won't see that problem again.  I guess I'll find out 
over the coming couple weeks or so, at which I'll declare the issue gone 
if I've not seen it again.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance induced csum errors, systemd-journal
  2013-09-25  5:44       ` Chris Murphy
@ 2013-09-25 12:30         ` Josef Bacik
  2013-09-25 14:56           ` Chris Murphy
  0 siblings, 1 reply; 13+ messages in thread
From: Josef Bacik @ 2013-09-25 12:30 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On Tue, Sep 24, 2013 at 11:44:15PM -0600, Chris Murphy wrote:
> 
> On Sep 24, 2013, at 10:34 PM, Chris Murphy <lists@colorremedies.com> wrote:
> 
> > And how would I go about setting /var/log/journal contents to inherit nodatacow? Possible?
> 
> chattr +C /var/log/journal
> 
> Resolved the problem. Whether this is an appropriate long term fix that systemd should apply to this directory, I don't know.
> 

That just disables cow which in turn disables csumming so it is a good solution
for you right now and gives me time ti figure out wtf is going on here.  Looking
at the systemd code it isn't doing O_DIRECT, which is how you usually end up
with this sort of situation.  So it is likely a bug on our side, I will try and
track it down today.  Thanks for narrowing this down,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance induced csum errors, systemd-journal
  2013-09-25 12:30         ` Josef Bacik
@ 2013-09-25 14:56           ` Chris Murphy
  2013-09-25 15:08             ` Josef Bacik
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2013-09-25 14:56 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Btrfs BTRFS

On Sep 25, 2013, at 6:30 AM, Josef Bacik <jbacik@fusionio.com> wrote:

> 
> That just disables cow which in turn disables csumming so it is a good solution
> for you right now and gives me time ti figure out wtf is going on here.

I think it's preventing the corruption of the journal logs, because I'm also no longer getting messages from systemd saying a log is corrupt. So I don't think the problem is solved just by not having csums. I'm thinking the csums were always correct, it was the data that was corrupting… or both data and csums were wrong.

It seems that the way systemd-journal is writing to disk is handled differently only during balance operations. The corruption has never happened with days of normal usage (no balance). But happens within tens of seconds upon balance. Naturally something or other is always being written to the systemd-journal logs during a balance (someone logs in=journal entry, kernel reports extent found=journal entry, kernel reports moved chunk=journal entry).

Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance induced csum errors, systemd-journal
  2013-09-25 14:56           ` Chris Murphy
@ 2013-09-25 15:08             ` Josef Bacik
  0 siblings, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2013-09-25 15:08 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Josef Bacik, Btrfs BTRFS

On Wed, Sep 25, 2013 at 08:56:52AM -0600, Chris Murphy wrote:
> 
> On Sep 25, 2013, at 6:30 AM, Josef Bacik <jbacik@fusionio.com> wrote:
> 
> > 
> > That just disables cow which in turn disables csumming so it is a good solution
> > for you right now and gives me time ti figure out wtf is going on here.
> 
> I think it's preventing the corruption of the journal logs, because I'm also no longer getting messages from systemd saying a log is corrupt. So I don't think the problem is solved just by not having csums. I'm thinking the csums were always correct, it was the data that was corrupting… or both data and csums were wrong.
> 
> It seems that the way systemd-journal is writing to disk is handled differently only during balance operations. The corruption has never happened with days of normal usage (no balance). But happens within tens of seconds upon balance. Naturally something or other is always being written to the systemd-journal logs during a balance (someone logs in=journal entry, kernel reports extent found=journal entry, kernel reports moved chunk=journal entry).
> 

I've reproduce it locally so I'll hopefully figure out what is going on soon.
Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance induced csum errors, systemd-journal
  2013-09-25  4:34     ` balance induced csum errors, systemd-journal Chris Murphy
  2013-09-25  5:44       ` Chris Murphy
  2013-09-25  6:38       ` Duncan
@ 2013-09-27 15:07       ` Johannes Hirte
  2013-09-27 16:22         ` Chris Murphy
  2 siblings, 1 reply; 13+ messages in thread
From: Johannes Hirte @ 2013-09-27 15:07 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS, Josef Bacik

On Tue, 24 Sep 2013 22:34:20 -0600
Chris Murphy <lists@colorremedies.com> wrote:

> OK so I think I'm narrowing this down to just the systemd journal,
> and it's not checksums that are corrupted, it's the journal itself.

I doubt it's systemd-dependent, cause I've seen similar behaviour on a
Gentoo system without systemd. Before balance the filesystem was ok,
after I get

root 257 inode 2875 errors 1800
root 257 inode 2881 errors 1800
root 257 inode 2969 errors 1800
root 257 inode 3063 errors 1800
root 257 inode 3120 errors 1800
root 257 inode 12407 errors 1800
root 257 inode 19496 errors 1800
root 257 inode 19500 errors 1800
root 257 inode 19564 errors 1800
root 257 inode 19643 errors 1800
root 257 inode 19693 errors 1800
root 257 inode 19949 errors 1800
root 257 inode 20178 errors 1800
root 257 inode 20320 errors 1800
root 257 inode 20406 errors 1800
root 257 inode 20512 errors 1800
root 257 inode 20586 errors 1800
root 257 inode 20654 errors 1800
root 257 inode 20727 errors 1800
root 257 inode 20728 errors 1800
root 257 inode 20821 errors 1800
root 257 inode 20843 errors 1800
root 257 inode 21062 errors 1800
root 257 inode 21078 errors 1800
root 257 inode 21222 errors 1800
root 257 inode 21356 errors 1800
root 257 inode 21437 errors 1800
root 257 inode 55082 errors 1800
root 257 inode 65343 errors 1800
root 257 inode 72413 errors 1800

on a fsck and scrub tells me that there are unfixable csum errors.
Kernel is 3.12.0-rc2-00083-g4b97280.

I've observed this two times, and every time only the first
subvolume (root 257) was affected.

regards,
  Johannes

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance induced csum errors, systemd-journal
  2013-09-27 15:07       ` Johannes Hirte
@ 2013-09-27 16:22         ` Chris Murphy
  0 siblings, 0 replies; 13+ messages in thread
From: Chris Murphy @ 2013-09-27 16:22 UTC (permalink / raw)
  To: Johannes Hirte; +Cc: Btrfs BTRFS, Josef Bacik

On Sep 27, 2013, at 9:07 AM, Johannes Hirte <johannes.hirte@datenkhaos.de> wrote:

> On Tue, 24 Sep 2013 22:34:20 -0600
> Chris Murphy <lists@colorremedies.com> wrote:
> 
>> OK so I think I'm narrowing this down to just the systemd journal,
>> and it's not checksums that are corrupted, it's the journal itself.
> 
> I doubt it's systemd-dependent,

I did not intend to indicate only systemd journal can trigger this, but rather on my system those appear to be the only affected files. Anything that has the same write behavior as systemd-journald probably has the same problem.

> 
> on a fsck and scrub tells me that there are unfixable csum errors.

The scrub should cause messages to appear in dmesg that include a pathname to the affected files, which might hint at what has the same write behavior. Even though a fix has been sent to stable for the systemd journal triggered issue, you should still find out what's being corrupted in your situation in case the write behaviors are different yet are still triggering corruption.

Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-09-27 16:22 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-23 21:57 balance induced csum errors Chris Murphy
2013-09-23 22:35 ` Chris Murphy
2013-09-24 21:36   ` Chris Murphy
2013-09-24 22:30     ` Chris Murphy
2013-09-25  4:34     ` balance induced csum errors, systemd-journal Chris Murphy
2013-09-25  5:44       ` Chris Murphy
2013-09-25 12:30         ` Josef Bacik
2013-09-25 14:56           ` Chris Murphy
2013-09-25 15:08             ` Josef Bacik
2013-09-25  6:38       ` Duncan
2013-09-27 15:07       ` Johannes Hirte
2013-09-27 16:22         ` Chris Murphy
2013-09-24 23:12   ` balance induced csum errors Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).