* Got 10 csum errors according to dmesg but 0 errors according to dev stats
@ 2015-05-10 14:37 Philip Seeger
2015-05-10 14:58 ` Philip Seeger
0 siblings, 1 reply; 16+ messages in thread
From: Philip Seeger @ 2015-05-10 14:37 UTC (permalink / raw)
To: linux-btrfs
I have installed a new virtual machine (VirtualBox) with Arch on btrfs
(just a root fs and swap partition, no other partitions).
I suddenly noticed 10 checksum errors in the kernel log:
$ dmesg | grep csum
[ 736.283506] BTRFS warning (device sda1): csum failed ino 1704363 off
761856 csum 1145980813 expected csum 2566472073
[ 736.283605] BTRFS warning (device sda1): csum failed ino 1704363 off
1146880 csum 1961240434 expected csum 2566472073
[ 745.583064] BTRFS warning (device sda1): csum failed ino 1704346 off
393216 csum 4035064017 expected csum 2566472073
[ 752.324899] BTRFS warning (device sda1): csum failed ino 1705927 off
2125824 csum 3638986839 expected csum 2566472073
[ 752.333115] BTRFS warning (device sda1): csum failed ino 1705927 off
2588672 csum 176788087 expected csum 2566472073
[ 752.333303] BTRFS warning (device sda1): csum failed ino 1705927 off
3276800 csum 1891435134 expected csum 2566472073
[ 752.333397] BTRFS warning (device sda1): csum failed ino 1705927 off
3964928 csum 3304112727 expected csum 2566472073
[ 2761.889460] BTRFS warning (device sda1): csum failed ino 1705927 off
2125824 csum 3638986839 expected csum 2566472073
[ 9054.226022] BTRFS warning (device sda1): csum failed ino 1704363 off
761856 csum 1145980813 expected csum 2566472073
[ 9054.226106] BTRFS warning (device sda1): csum failed ino 1704363 off
1146880 csum 1961240434 expected csum 2566472073
This is a new vm, it hasn't crashed (which might have caused filesystem
corruption). The virtual disk is on a RAID storage on the host, which is
healthy. All corrupted files are Firefox data files:
$ dmesg | grep csum | grep -Eo 'csum failed ino [0-9]* ' | awk '{print
$4}' | xargs -I{} find -inum {}
./.mozilla/firefox/nfh217zw.default/cookies.sqlite
./.mozilla/firefox/nfh217zw.default/cookies.sqlite
./.mozilla/firefox/nfh217zw.default/webappsstore.sqlite
./.mozilla/firefox/nfh217zw.default/places.sqlite
./.mozilla/firefox/nfh217zw.default/places.sqlite
./.mozilla/firefox/nfh217zw.default/places.sqlite
./.mozilla/firefox/nfh217zw.default/places.sqlite
./.mozilla/firefox/nfh217zw.default/places.sqlite
./.mozilla/firefox/nfh217zw.default/cookies.sqlite
./.mozilla/firefox/nfh217zw.default/cookies.sqlite
How could this possibly happen?
And more importantly: Why doesn't the btrfs stat(u)s output tell me that
errors have occurred?
$ sudo btrfs dev stats /
[/dev/sda1].write_io_errs 0
[/dev/sda1].read_io_errs 0
[/dev/sda1].flush_io_errs 0
[/dev/sda1].corruption_errs 0
[/dev/sda1].generation_errs 0
If the filesystem health was monitored using btrfs dev stats (cronjob)
(like checking a zpool using zpool status), the admin would not have
been notified:
$ sudo btrfs dev stats / | grep -v 0 -c
0
Is my understanding of the stats command wrong, does "corruption_errs"
not mean corruption errors?
--
Philip
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: Got 10 csum errors according to dmesg but 0 errors according to dev stats 2015-05-10 14:37 Got 10 csum errors according to dmesg but 0 errors according to dev stats Philip Seeger @ 2015-05-10 14:58 ` Philip Seeger [not found] ` <CABR0jERqzkdTJxX_1S5WEZHDzX8=O8P7r+Bk0mesPLsR2n=w8A@mail.gmail.com> 2015-05-17 1:53 ` Philip Seeger 0 siblings, 2 replies; 16+ messages in thread From: Philip Seeger @ 2015-05-10 14:58 UTC (permalink / raw) To: linux-btrfs Forgot to mention kernel version: Linux 4.0.1-1-ARCH $ sudo btrfs fi show Label: none uuid: 3e8973d3-83ce-4d93-8d50-2989c0be256a Total devices 1 FS bytes used 19.87GiB devid 1 size 45.00GiB used 21.03GiB path /dev/sda1 btrfs-progs v3.19.1 On 05/10/2015 04:37 PM, Philip Seeger wrote: > I have installed a new virtual machine (VirtualBox) with Arch on btrfs > (just a root fs and swap partition, no other partitions). > I suddenly noticed 10 checksum errors in the kernel log: > $ dmesg | grep csum > [ 736.283506] BTRFS warning (device sda1): csum failed ino 1704363 off > 761856 csum 1145980813 expected csum 2566472073 > [ 736.283605] BTRFS warning (device sda1): csum failed ino 1704363 off > 1146880 csum 1961240434 expected csum 2566472073 > [ 745.583064] BTRFS warning (device sda1): csum failed ino 1704346 off > 393216 csum 4035064017 expected csum 2566472073 > [ 752.324899] BTRFS warning (device sda1): csum failed ino 1705927 off > 2125824 csum 3638986839 expected csum 2566472073 > [ 752.333115] BTRFS warning (device sda1): csum failed ino 1705927 off > 2588672 csum 176788087 expected csum 2566472073 > [ 752.333303] BTRFS warning (device sda1): csum failed ino 1705927 off > 3276800 csum 1891435134 expected csum 2566472073 > [ 752.333397] BTRFS warning (device sda1): csum failed ino 1705927 off > 3964928 csum 3304112727 expected csum 2566472073 > [ 2761.889460] BTRFS warning (device sda1): csum failed ino 1705927 off > 2125824 csum 3638986839 expected csum 2566472073 > [ 9054.226022] BTRFS warning (device sda1): csum failed ino 1704363 off > 761856 csum 1145980813 expected csum 2566472073 > [ 9054.226106] BTRFS warning (device sda1): csum failed ino 1704363 off > 1146880 csum 1961240434 expected csum 2566472073 > > This is a new vm, it hasn't crashed (which might have caused filesystem > corruption). The virtual disk is on a RAID storage on the host, which is > healthy. All corrupted files are Firefox data files: > $ dmesg | grep csum | grep -Eo 'csum failed ino [0-9]* ' | awk '{print > $4}' | xargs -I{} find -inum {} > ./.mozilla/firefox/nfh217zw.default/cookies.sqlite > ./.mozilla/firefox/nfh217zw.default/cookies.sqlite > ./.mozilla/firefox/nfh217zw.default/webappsstore.sqlite > ./.mozilla/firefox/nfh217zw.default/places.sqlite > ./.mozilla/firefox/nfh217zw.default/places.sqlite > ./.mozilla/firefox/nfh217zw.default/places.sqlite > ./.mozilla/firefox/nfh217zw.default/places.sqlite > ./.mozilla/firefox/nfh217zw.default/places.sqlite > ./.mozilla/firefox/nfh217zw.default/cookies.sqlite > ./.mozilla/firefox/nfh217zw.default/cookies.sqlite > > How could this possibly happen? > > And more importantly: Why doesn't the btrfs stat(u)s output tell me that > errors have occurred? > $ sudo btrfs dev stats / > [/dev/sda1].write_io_errs 0 > [/dev/sda1].read_io_errs 0 > [/dev/sda1].flush_io_errs 0 > [/dev/sda1].corruption_errs 0 > [/dev/sda1].generation_errs 0 > > If the filesystem health was monitored using btrfs dev stats (cronjob) > (like checking a zpool using zpool status), the admin would not have > been notified: > $ sudo btrfs dev stats / | grep -v 0 -c > 0 > > Is my understanding of the stats command wrong, does "corruption_errs" > not mean corruption errors? > > > -- Philip ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <CABR0jERqzkdTJxX_1S5WEZHDzX8=O8P7r+Bk0mesPLsR2n=w8A@mail.gmail.com>]
* Re: Got 10 csum errors according to dmesg but 0 errors according to dev stats [not found] ` <CABR0jERqzkdTJxX_1S5WEZHDzX8=O8P7r+Bk0mesPLsR2n=w8A@mail.gmail.com> @ 2015-05-10 17:32 ` Philip Seeger 2015-05-11 1:41 ` Russell Coker 0 siblings, 1 reply; 16+ messages in thread From: Philip Seeger @ 2015-05-10 17:32 UTC (permalink / raw) To: linux-btrfs (Again, last message was rejected.) Hi Richard, thank you for this tip, I didn't notice that btrfs-progs didn't match the kernel version. I've updated btrfs-progs (from the repository, not manually installed), btrfs --version now shows v4.0. However, it seems strange to me that a bunch of files is corrupted simply because btrfs-progs is older than the kernel. To trigger more csum errors, I ran a script that basically finds all files and runs cat $file >/dev/null. I also scrubbed the filesystem. It's getting worse. The number of corrupted files has grown to 79 - all in /home. Some of these files have not been modified in 3 years. I have copied them into this Arch vm from another vm, which runs Fedora (kernel 3.19). The Fedora vm also uses btrfs, so it has the right checksums for all of those files. There are no csum errors in dmesg on that Fedora system. I've also started a scrub there, which has not generated any error yet. To be clear, we're talking about 50k something files (about 11 GB) that I've copied onto this vm; I have used a handful of them and created <10. So after copying a lot of files onto this Arch vm, many of them have been corrupted for unknown reasons (mostly old files, not changed on this Arch system). Scrub: # time btrfs scrub start -B / ; echo scrub $? done scrub done for 3e8973d3-83ce-4d93-8d50-2989c0be256a scrub started at Sun May 10 17:47:34 2015 and finished after 427 seconds total bytes scrubbed: 19.87GiB with 21941 errors error details: csum=21941 corrected errors: 0, uncorrectable errors: 21941, unverified errors: 0 ERROR: There are uncorrectable errors. During the scrub, I also saw several of these: [19935.898678] __readpage_endio_check: 14 callbacks suppressed I have started another scrub (now with v4.0), I still get errors but the affected file names are mentioned in dmesg, which is nice. Is there a btrfs status command that will list permanently damaged files as well (like zpool status -v), since dmesg will be empty after a reboot or crash? I believe, thanks to Richard, I can now answer my second question: The old version 3.19 failed to increase the error counter(s) in dev stats, but this is apparently fixed in 4.0 (so a monitoring job would now be able to notify an admin): $ sudo btrfs dev stats / | grep -v 0 [/dev/sda1].corruption_errs 43882 Thanks Philip On 05/10/2015 05:33 PM, Richard Michael wrote: > Hi Philip, > > Have you tried latest btrfs-progs? > > The progs release version has sync'd up with the kernel version, so > your kernel v4.0.1 with progs v3.19.1 could be taken as a "mismatch". > > I haven't read the progs v3.19.1 v4.0 commit diff, and the wiki > doesn't mention csum fixes/work related to corruption, but, in your > situation, I'd probably try out v4.0 progs to be sure. > > https://btrfs.wiki.kernel.org/index.php/Main_Page#News > > Sorry I don't have more than this to offer. > > > Regards, > Richard > > > On Sun, May 10, 2015 at 10:58 AM, Philip Seeger <p0h0i0l0i0p@gmail.com > <mailto:p0h0i0l0i0p@gmail.com>> wrote: > > Forgot to mention kernel version: Linux 4.0.1-1-ARCH > > $ sudo btrfs fi show > Label: none uuid: 3e8973d3-83ce-4d93-8d50-2989c0be256a > Total devices 1 FS bytes used 19.87GiB > devid 1 size 45.00GiB used 21.03GiB path /dev/sda1 > > btrfs-progs v3.19.1 > > > > > On 05/10/2015 04:37 PM, Philip Seeger wrote: > > I have installed a new virtual machine (VirtualBox) with Arch > on btrfs > (just a root fs and swap partition, no other partitions). > I suddenly noticed 10 checksum errors in the kernel log: > $ dmesg | grep csum > [ 736.283506] BTRFS warning (device sda1): csum failed ino > 1704363 off > 761856 csum 1145980813 expected csum 2566472073 > [ 736.283605] BTRFS warning (device sda1): csum failed ino > 1704363 off > 1146880 csum 1961240434 expected csum 2566472073 > [ 745.583064] BTRFS warning (device sda1): csum failed ino > 1704346 off > 393216 csum 4035064017 expected csum 2566472073 > [ 752.324899] BTRFS warning (device sda1): csum failed ino > 1705927 off > 2125824 csum 3638986839 expected csum 2566472073 > [ 752.333115] BTRFS warning (device sda1): csum failed ino > 1705927 off > 2588672 csum 176788087 expected csum 2566472073 > [ 752.333303] BTRFS warning (device sda1): csum failed ino > 1705927 off > 3276800 csum 1891435134 expected csum 2566472073 > [ 752.333397] BTRFS warning (device sda1): csum failed ino > 1705927 off > 3964928 csum 3304112727 expected csum 2566472073 > [ 2761.889460] BTRFS warning (device sda1): csum failed ino > 1705927 off > 2125824 csum 3638986839 expected csum 2566472073 > [ 9054.226022] BTRFS warning (device sda1): csum failed ino > 1704363 off > 761856 csum 1145980813 expected csum 2566472073 > [ 9054.226106] BTRFS warning (device sda1): csum failed ino > 1704363 off > 1146880 csum 1961240434 expected csum 2566472073 > > This is a new vm, it hasn't crashed (which might have caused > filesystem > corruption). The virtual disk is on a RAID storage on the > host, which is > healthy. All corrupted files are Firefox data files: > $ dmesg | grep csum | grep -Eo 'csum failed ino [0-9]* ' | awk > '{print > $4}' | xargs -I{} find -inum {} > ./.mozilla/firefox/nfh217zw.default/cookies.sqlite > ./.mozilla/firefox/nfh217zw.default/cookies.sqlite > ./.mozilla/firefox/nfh217zw.default/webappsstore.sqlite > ./.mozilla/firefox/nfh217zw.default/places.sqlite > ./.mozilla/firefox/nfh217zw.default/places.sqlite > ./.mozilla/firefox/nfh217zw.default/places.sqlite > ./.mozilla/firefox/nfh217zw.default/places.sqlite > ./.mozilla/firefox/nfh217zw.default/places.sqlite > ./.mozilla/firefox/nfh217zw.default/cookies.sqlite > ./.mozilla/firefox/nfh217zw.default/cookies.sqlite > > How could this possibly happen? > > And more importantly: Why doesn't the btrfs stat(u)s output > tell me that > errors have occurred? > $ sudo btrfs dev stats / > [/dev/sda1].write_io_errs 0 > [/dev/sda1].read_io_errs 0 > [/dev/sda1].flush_io_errs 0 > [/dev/sda1].corruption_errs 0 > [/dev/sda1].generation_errs 0 > > If the filesystem health was monitored using btrfs dev stats > (cronjob) > (like checking a zpool using zpool status), the admin would > not have > been notified: > $ sudo btrfs dev stats / | grep -v 0 -c > 0 > > Is my understanding of the stats command wrong, does > "corruption_errs" > not mean corruption errors? > > > > > -- > Philip > -- > To unsubscribe from this list: send the line "unsubscribe > linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > <mailto:majordomo@vger.kernel.org> > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Got 10 csum errors according to dmesg but 0 errors according to dev stats 2015-05-10 17:32 ` Philip Seeger @ 2015-05-11 1:41 ` Russell Coker 2015-05-12 0:14 ` Philip Seeger 0 siblings, 1 reply; 16+ messages in thread From: Russell Coker @ 2015-05-11 1:41 UTC (permalink / raw) To: Philip Seeger; +Cc: linux-btrfs On Mon, 11 May 2015 03:32:45 AM Philip Seeger wrote: > However, it seems strange to me that a bunch of files is corrupted > simply because btrfs-progs is older than the kernel. That won't happen. > To trigger more csum errors, I ran a script that basically finds all > files and runs cat $file >/dev/null. I also scrubbed the filesystem. > It's getting worse. The number of corrupted files has grown to 79 - all > in /home. Some of these files have not been modified in 3 years. I have Sounds like you are having errors in your RAM, CPU, motherboard, or hard drive cabling. Turn the machine off ASAP and plug the disks into a different system, if you keep it running you will make it worse. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Got 10 csum errors according to dmesg but 0 errors according to dev stats 2015-05-11 1:41 ` Russell Coker @ 2015-05-12 0:14 ` Philip Seeger 2015-05-12 1:04 ` Paul Jones 0 siblings, 1 reply; 16+ messages in thread From: Philip Seeger @ 2015-05-12 0:14 UTC (permalink / raw) To: linux-btrfs > Sounds like you are having errors in your RAM, CPU, motherboard, or hard drive > cabling. Turn the machine off ASAP and plug the disks into a different system, > if you keep it running you will make it worse. > I know it sounds like it, but the host is fine. The host filesystem (on which the vm virtual hdd resides) is healthy. Other vms are running on the same host, no problems there. Just to be sure, I will run memtest, but I'm pretty sure that's not it. The system is under high load a lot, but I don't think btrfs would fail because of a slow system. So I have deleted all those corrupted files in this Arch vm, ran a scrub, 0 errors, all fixed. I restored them, fixed some other things and now - I get checksum errors again. Interestingly, it looks like the corruption is not happening randomly, because the same sqlite files are affected under ~/.mozilla/ and exactly one library file (ghostscript). Meanwhile, other vms (not Arch but Fedora and Debian) are running without a problem (one of them using btrfs as well). Is it possible that systemd isn't unmounting the filesystem properly, so it gets corrupted on shutdown? (Juest a wild guess.) Although I'm not sure if all this happened between reboots. Philip ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: Got 10 csum errors according to dmesg but 0 errors according to dev stats 2015-05-12 0:14 ` Philip Seeger @ 2015-05-12 1:04 ` Paul Jones 2015-05-12 1:37 ` Chris Murphy 2015-05-15 18:33 ` Philip Seeger 0 siblings, 2 replies; 16+ messages in thread From: Paul Jones @ 2015-05-12 1:04 UTC (permalink / raw) To: Philip Seeger, linux-btrfs@vger.kernel.org [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2060 bytes --] > -----Original Message----- > From: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs- > owner@vger.kernel.org] On Behalf Of Philip Seeger > Sent: Tuesday, 12 May 2015 10:15 AM > To: linux-btrfs@vger.kernel.org > Subject: Re: Got 10 csum errors according to dmesg but 0 errors according to > dev stats > > > Sounds like you are having errors in your RAM, CPU, motherboard, or > > hard drive cabling. Turn the machine off ASAP and plug the disks into > > a different system, if you keep it running you will make it worse. > > > I know it sounds like it, but the host is fine. The host filesystem (on which the > vm virtual hdd resides) is healthy. Other vms are running on the same host, > no problems there. Just to be sure, I will run memtest, but I'm pretty sure > that's not it. The system is under high load a lot, but I don't think btrfs would > fail because of a slow system. > > So I have deleted all those corrupted files in this Arch vm, ran a scrub, 0 > errors, all fixed. I restored them, fixed some other things and now - I get > checksum errors again. Interestingly, it looks like the corruption is not > happening randomly, because the same sqlite files are affected under > ~/.mozilla/ and exactly one library file (ghostscript). > Meanwhile, other vms (not Arch but Fedora and Debian) are running without > a problem (one of them using btrfs as well). > > Is it possible that systemd isn't unmounting the filesystem properly, so it gets > corrupted on shutdown? (Juest a wild guess.) Although I'm not sure if all this > happened between reboots. Are you using KVM with some form of disk caching? I had a windows vm that was constantly creating errors on the host filesystem (btrfs) somewhere within the disk image. I changed the caching option (I can't remember from/to what) and it fixed the error. It didn't seem to be causing any errors on the windows guest, but it's windows so you never know :) Paul. ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±ý»k~ÏâØ^nr¡ö¦zË\x1aëh¨èÚ&£ûàz¿äz¹Þú+Ê+zf£¢·h§~Ûiÿÿïêÿêçz_è®\x0fæj:+v¨þ)ߣøm ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Got 10 csum errors according to dmesg but 0 errors according to dev stats 2015-05-12 1:04 ` Paul Jones @ 2015-05-12 1:37 ` Chris Murphy 2015-05-15 18:40 ` Philip Seeger 2015-05-15 18:33 ` Philip Seeger 1 sibling, 1 reply; 16+ messages in thread From: Chris Murphy @ 2015-05-12 1:37 UTC (permalink / raw) To: linux-btrfs@vger.kernel.org There are two file systems involved, guest and host. What are their file systems? I know one of them is Btrfs but I can't tell if they're both Btrfs. There is a regression somewhere, I don't know where yet, when libvirt cache=none or cache=directsync, and the disk image (qcow2 in my case) is on Btrfs. The guest file system doesn't matter, it'll eventually spew some corruption related errors. https://bugzilla.redhat.com/show_bug.cgi?id=1204569 Chris Murphy ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Got 10 csum errors according to dmesg but 0 errors according to dev stats 2015-05-12 1:37 ` Chris Murphy @ 2015-05-15 18:40 ` Philip Seeger 0 siblings, 0 replies; 16+ messages in thread From: Philip Seeger @ 2015-05-15 18:40 UTC (permalink / raw) To: linux-btrfs The host filesystem (where the virtual hdd is stored) is ZFS (no errors). It's a bit slower recently because I am moving a lot of data around while other vms are running. But that shouldn't be a problem. This Arch vm appears to be the only one having problems. Again, I have even scrubbed one other vm, which also uses btrfs (but 3.19, not 4.0) and found no errors. This is the vm from which I have copied the 11 GB directory, which now keeps getting corrupted in the Arch vm. I'd just like to repeat that I have copied one directory from an older vm, which uses btrfs, to a new Arch vm, which also used btrfs, and while that older vm has worked fine (still does), the same files in that copied directory are getting corrupted in the new Arch vm... The bug report mentions Gnome Boxes, but I'm using VirtualBox. Don't know if both are affected in the same way. If it makes a difference: The Arch vm has an IDE controller (host cache enabled) for the virtual optical drive and one SATA controller (host cache disabled) for the virtual hard drive. I will try to get this vm in a healthy state again by deleting all affected files and copying them back again. Also, should I try to use a specific mount option, maybe a lower commit interval? Philip On 05/12/2015 03:37 AM, Chris Murphy wrote: > There are two file systems involved, guest and host. What are their > file systems? I know one of them is Btrfs but I can't tell if they're > both Btrfs. > > There is a regression somewhere, I don't know where yet, when libvirt > cache=none or cache=directsync, and the disk image (qcow2 in my case) > is on Btrfs. The guest file system doesn't matter, it'll eventually > spew some corruption related errors. > https://bugzilla.redhat.com/show_bug.cgi?id=1204569 > > > Chris Murphy > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Got 10 csum errors according to dmesg but 0 errors according to dev stats 2015-05-12 1:04 ` Paul Jones 2015-05-12 1:37 ` Chris Murphy @ 2015-05-15 18:33 ` Philip Seeger 1 sibling, 0 replies; 16+ messages in thread From: Philip Seeger @ 2015-05-15 18:33 UTC (permalink / raw) To: linux-btrfs The SATA controller used for the virtual hard drive of this vm does not have host caching enabled (checkbox not checked). So, no, VirtualBox should not be using any form of disk caching. Also, there is no Windows involved. Philip On 05/12/2015 03:04 AM, Paul Jones wrote: >> -----Original Message----- >> From: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs- >> owner@vger.kernel.org] On Behalf Of Philip Seeger >> Sent: Tuesday, 12 May 2015 10:15 AM >> To: linux-btrfs@vger.kernel.org >> Subject: Re: Got 10 csum errors according to dmesg but 0 errors according to >> dev stats >> >>> Sounds like you are having errors in your RAM, CPU, motherboard, or >>> hard drive cabling. Turn the machine off ASAP and plug the disks into >>> a different system, if you keep it running you will make it worse. >>> >> I know it sounds like it, but the host is fine. The host filesystem (on which the >> vm virtual hdd resides) is healthy. Other vms are running on the same host, >> no problems there. Just to be sure, I will run memtest, but I'm pretty sure >> that's not it. The system is under high load a lot, but I don't think btrfs would >> fail because of a slow system. >> >> So I have deleted all those corrupted files in this Arch vm, ran a scrub, 0 >> errors, all fixed. I restored them, fixed some other things and now - I get >> checksum errors again. Interestingly, it looks like the corruption is not >> happening randomly, because the same sqlite files are affected under >> ~/.mozilla/ and exactly one library file (ghostscript). >> Meanwhile, other vms (not Arch but Fedora and Debian) are running without >> a problem (one of them using btrfs as well). >> >> Is it possible that systemd isn't unmounting the filesystem properly, so it gets >> corrupted on shutdown? (Juest a wild guess.) Although I'm not sure if all this >> happened between reboots. > Are you using KVM with some form of disk caching? I had a windows vm that was constantly creating errors on the host filesystem (btrfs) somewhere within the disk image. I changed the caching option (I can't remember from/to what) and it fixed the error. It didn't seem to be causing any errors on the windows guest, but it's windows so you never know :) > > Paul. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Got 10 csum errors according to dmesg but 0 errors according to dev stats 2015-05-10 14:58 ` Philip Seeger [not found] ` <CABR0jERqzkdTJxX_1S5WEZHDzX8=O8P7r+Bk0mesPLsR2n=w8A@mail.gmail.com> @ 2015-05-17 1:53 ` Philip Seeger 2015-05-17 8:19 ` Duncan 1 sibling, 1 reply; 16+ messages in thread From: Philip Seeger @ 2015-05-17 1:53 UTC (permalink / raw) To: linux-btrfs I deleted and restored the corrupted files again. One of those files (not a new one) got corrupted again. I think I forgot to mention that this btrfs filesystem was converted from ext4 (not initially created as btrfs). Could this cause this corruption? Also, does this df output look weird to anyone, shouldn't metadata be duplicated? # btrfs fi df / Data, single: total=21.00GiB, used=20.82GiB System, single: total=32.00MiB, used=4.00KiB Metadata, single: total=1.25GiB, used=901.21MiB GlobalReserve, single: total=304.00MiB, used=0.00B Philip On 05/10/2015 04:58 PM, Philip Seeger wrote: > Forgot to mention kernel version: Linux 4.0.1-1-ARCH > > $ sudo btrfs fi show > Label: none uuid: 3e8973d3-83ce-4d93-8d50-2989c0be256a > Total devices 1 FS bytes used 19.87GiB > devid 1 size 45.00GiB used 21.03GiB path /dev/sda1 > > btrfs-progs v3.19.1 > > > > On 05/10/2015 04:37 PM, Philip Seeger wrote: >> I have installed a new virtual machine (VirtualBox) with Arch on btrfs >> (just a root fs and swap partition, no other partitions). >> I suddenly noticed 10 checksum errors in the kernel log: >> $ dmesg | grep csum >> [ 736.283506] BTRFS warning (device sda1): csum failed ino 1704363 off >> 761856 csum 1145980813 expected csum 2566472073 >> [ 736.283605] BTRFS warning (device sda1): csum failed ino 1704363 off >> 1146880 csum 1961240434 expected csum 2566472073 >> [ 745.583064] BTRFS warning (device sda1): csum failed ino 1704346 off >> 393216 csum 4035064017 expected csum 2566472073 >> [ 752.324899] BTRFS warning (device sda1): csum failed ino 1705927 off >> 2125824 csum 3638986839 expected csum 2566472073 >> [ 752.333115] BTRFS warning (device sda1): csum failed ino 1705927 off >> 2588672 csum 176788087 expected csum 2566472073 >> [ 752.333303] BTRFS warning (device sda1): csum failed ino 1705927 off >> 3276800 csum 1891435134 expected csum 2566472073 >> [ 752.333397] BTRFS warning (device sda1): csum failed ino 1705927 off >> 3964928 csum 3304112727 expected csum 2566472073 >> [ 2761.889460] BTRFS warning (device sda1): csum failed ino 1705927 off >> 2125824 csum 3638986839 expected csum 2566472073 >> [ 9054.226022] BTRFS warning (device sda1): csum failed ino 1704363 off >> 761856 csum 1145980813 expected csum 2566472073 >> [ 9054.226106] BTRFS warning (device sda1): csum failed ino 1704363 off >> 1146880 csum 1961240434 expected csum 2566472073 >> >> This is a new vm, it hasn't crashed (which might have caused filesystem >> corruption). The virtual disk is on a RAID storage on the host, which is >> healthy. All corrupted files are Firefox data files: >> $ dmesg | grep csum | grep -Eo 'csum failed ino [0-9]* ' | awk '{print >> $4}' | xargs -I{} find -inum {} >> ./.mozilla/firefox/nfh217zw.default/cookies.sqlite >> ./.mozilla/firefox/nfh217zw.default/cookies.sqlite >> ./.mozilla/firefox/nfh217zw.default/webappsstore.sqlite >> ./.mozilla/firefox/nfh217zw.default/places.sqlite >> ./.mozilla/firefox/nfh217zw.default/places.sqlite >> ./.mozilla/firefox/nfh217zw.default/places.sqlite >> ./.mozilla/firefox/nfh217zw.default/places.sqlite >> ./.mozilla/firefox/nfh217zw.default/places.sqlite >> ./.mozilla/firefox/nfh217zw.default/cookies.sqlite >> ./.mozilla/firefox/nfh217zw.default/cookies.sqlite >> >> How could this possibly happen? >> >> And more importantly: Why doesn't the btrfs stat(u)s output tell me that >> errors have occurred? >> $ sudo btrfs dev stats / >> [/dev/sda1].write_io_errs 0 >> [/dev/sda1].read_io_errs 0 >> [/dev/sda1].flush_io_errs 0 >> [/dev/sda1].corruption_errs 0 >> [/dev/sda1].generation_errs 0 >> >> If the filesystem health was monitored using btrfs dev stats (cronjob) >> (like checking a zpool using zpool status), the admin would not have >> been notified: >> $ sudo btrfs dev stats / | grep -v 0 -c >> 0 >> >> Is my understanding of the stats command wrong, does "corruption_errs" >> not mean corruption errors? >> >> >> > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Got 10 csum errors according to dmesg but 0 errors according to dev stats 2015-05-17 1:53 ` Philip Seeger @ 2015-05-17 8:19 ` Duncan 2015-05-17 8:36 ` Omar Sandoval 2015-05-23 12:49 ` Philip Seeger 0 siblings, 2 replies; 16+ messages in thread From: Duncan @ 2015-05-17 8:19 UTC (permalink / raw) To: linux-btrfs Philip Seeger posted on Sun, 17 May 2015 03:53:20 +0200 as excerpted: > > On 05/10/2015 04:58 PM, Philip Seeger wrote: >> >> Forgot to mention kernel version: Linux 4.0.1-1-ARCH >> >> $ sudo btrfs fi show Label: none uuid: >> 3e8973d3-83ce-4d93-8d50-2989c0be256a >> Total devices 1 FS bytes used 19.87GiB >> devid 1 size 45.00GiB used 21.03GiB path /dev/sda1 >> >> btrfs-progs v3.19.1 >> > I think I forgot to mention that this btrfs filesystem was converted > from ext4 (not initially created as btrfs). > Could this cause this corruption? > > Also, does this df output look weird to anyone, shouldn't metadata be > duplicated? > # btrfs fi df / > Data, single: total=21.00GiB, used=20.82GiB > System, single: total=32.00MiB, used=4.00KiB > Metadata, single: total=1.25GiB, used=901.21MiB > GlobalReserve, single: total=304.00MiB, used=0.00B [Reordered to standard quote/reply order, so replies have proper context. Top posting... not so fun to reply to! =:^( ] I can't answer the corruption bit, but answering the df metadata question... Normally, btrfs on a single device defaults to dup metadata type, single data type. The one /normal/ exception to that is when mkfs.btrfs detects an ssd, where it defaults to single data due to ssd firmware often canceling out the intended redundancy of dup anyway.[1] However, conversion from ext* is a bit of a different ball game, and while it /should/ default to dup metadata as well, on 4.0 and into 4.1-rcs as a proper fix hasn't been posted, there's a balance-conversion bug that's keeping type conversion from occurring, both in the normal btrfs balance convert case and in the ext* conversion case. Thus, ext* conversions remain metadata-single mode and cannot be converted to metadata-dup until this bug is fixed. I said that a /proper/ fix hasn't yet been posted. There has been a bisect trace to the commit that killed balance-convert, and that can be reverted, as I guess some distros are doing in their current releases. However, that commit happened to fix an ext* to btrfs conversion fault, that would cause ext* conversions to fail entirely. So reverting that commit does fix normal btrfs balance conversions, but it breaks the ability to convert from ext* at all. I don't know when /that/ was broken, but apparently it was further back. So right now, the only way to get a desired btrfs chunk redundancy type is to use mkfs.btrfs to create it that way in the first place. Which means no ext* conversion unless you're happy with single-data/single- metadata, since that's what it ends up with, and balance-convert is ATM currently broken and can't convert to other redundancy types. Well, unless you want to do the ext* to btrfs convert with the current tools as they are (with the commit in question so the ext*-conversion actually works), then rebuild with that commit reverted, so balance- convert works... Chris Mason has stated he has what he believes to be the correct fix in his head, but he hasn't posted it yet. Either it turned out to have other problems, or he simply hasn't had time to write it out and properly test that it /doesn't/ have other problems. Either way, as I said above, until that patch appears, the only /current/ way (other than jumping thru rebuild and revert hoops) to get other than single data/metadata both on data that's currently on ext4, is to either back it up or use it as a backup, and create a /new/ btrfs of the intended chunk redundancy layout using mkfs.btrfs, mount it and copy the data into it from that backup. --- [1] Ssd firmware canceling out dup redundancy: This can happen in two ways. First, some common ssd firmware (sandforce, IIRC, perhaps others) does its own dedup, such that two identical copies only get written once anyway, thus directly canceling out the benefits of filesystem dup. Second, even for firmware that actually writes two copies, because they are written one right after the other, they may well be written into the same erase block, and since the fail-pattern of ssds normally fails entire erase-blocks at the same time or very close to it, dup won't provide the intended redundancy protection anyway. Thus, on ssds one really needs two physically separate devices in raid1 mode to provide the redundancy single-device dup is intended to provide. Some ssds /may/ provide dup protection as intended, but it's sufficiently unreliable on available ssds that simply defaulting to single and not pretending otherwise was seen to be the wiser path, particularly since users can still specify dup mode at mkfs.btrfs time if they like, or (normally, when balance-convert is working) convert to it later if necessary. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Got 10 csum errors according to dmesg but 0 errors according to dev stats 2015-05-17 8:19 ` Duncan @ 2015-05-17 8:36 ` Omar Sandoval 2015-05-17 8:57 ` Duncan 2015-05-23 12:49 ` Philip Seeger 1 sibling, 1 reply; 16+ messages in thread From: Omar Sandoval @ 2015-05-17 8:36 UTC (permalink / raw) To: Duncan; +Cc: linux-btrfs On Sun, May 17, 2015 at 08:19:48AM +0000, Duncan wrote: > Philip Seeger posted on Sun, 17 May 2015 03:53:20 +0200 as excerpted: > > > > On 05/10/2015 04:58 PM, Philip Seeger wrote: > >> > >> Forgot to mention kernel version: Linux 4.0.1-1-ARCH > >> > >> $ sudo btrfs fi show Label: none uuid: > >> 3e8973d3-83ce-4d93-8d50-2989c0be256a > >> Total devices 1 FS bytes used 19.87GiB > >> devid 1 size 45.00GiB used 21.03GiB path /dev/sda1 > >> > >> btrfs-progs v3.19.1 > >> > > I think I forgot to mention that this btrfs filesystem was converted > > from ext4 (not initially created as btrfs). > > Could this cause this corruption? > > > > Also, does this df output look weird to anyone, shouldn't metadata be > > duplicated? > > # btrfs fi df / > > Data, single: total=21.00GiB, used=20.82GiB > > System, single: total=32.00MiB, used=4.00KiB > > Metadata, single: total=1.25GiB, used=901.21MiB > > GlobalReserve, single: total=304.00MiB, used=0.00B > > [Reordered to standard quote/reply order, so replies have proper > context. Top posting... not so fun to reply to! =:^( ] > > I can't answer the corruption bit, but answering the df metadata > question... > > Normally, btrfs on a single device defaults to dup metadata type, single > data type. The one /normal/ exception to that is when mkfs.btrfs detects > an ssd, where it defaults to single data due to ssd firmware often > canceling out the intended redundancy of dup anyway.[1] > > However, conversion from ext* is a bit of a different ball game, and > while it /should/ default to dup metadata as well, on 4.0 and into 4.1-rcs > as a proper fix hasn't been posted, there's a balance-conversion bug > that's keeping type conversion from occurring, both in the normal btrfs > balance convert case and in the ext* conversion case. Thus, ext* > conversions remain metadata-single mode and cannot be converted to > metadata-dup until this bug is fixed. > > I said that a /proper/ fix hasn't yet been posted. There has been a > bisect trace to the commit that killed balance-convert, and that can be > reverted, as I guess some distros are doing in their current releases. > However, that commit happened to fix an ext* to btrfs conversion fault, > that would cause ext* conversions to fail entirely. So reverting that > commit does fix normal btrfs balance conversions, but it breaks the > ability to convert from ext* at all. I don't know when /that/ was > broken, but apparently it was further back. > > So right now, the only way to get a desired btrfs chunk redundancy type > is to use mkfs.btrfs to create it that way in the first place. Which > means no ext* conversion unless you're happy with single-data/single- > metadata, since that's what it ends up with, and balance-convert is ATM > currently broken and can't convert to other redundancy types. > > Well, unless you want to do the ext* to btrfs convert with the current > tools as they are (with the commit in question so the ext*-conversion > actually works), then rebuild with that commit reverted, so balance- > convert works... Duncan is referring to the commit reverted here: https://patchwork.kernel.org/patch/6238111/ Just to clarify, reverting 2f0810880f082fa8ba66ab2c33b02e4ff9770a5e does not break ext4 conversion. If you revert it, you can btrfs-convert, do a btrfs balance to finalize the conversion, then do another btrfs balance -dconvert=... -mconvert=... to convert the profile. I should have been clearer in that other thread: conversion from ext4 to Btrfs works, its just that the commit that caused the regression did not actually accomplish what it set out to do: allow converting the data/metadata profile of a freshly btrfs-converted ext4 filesystem. -- Omar ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Got 10 csum errors according to dmesg but 0 errors according to dev stats 2015-05-17 8:36 ` Omar Sandoval @ 2015-05-17 8:57 ` Duncan 0 siblings, 0 replies; 16+ messages in thread From: Duncan @ 2015-05-17 8:57 UTC (permalink / raw) To: linux-btrfs Omar Sandoval posted on Sun, 17 May 2015 01:36:54 -0700 as excerpted: > I should have been clearer in that other thread: conversion from ext4 to > Btrfs works, its just that the commit that caused the regression did not > actually accomplish what it set out to do: allow converting the > data/metadata profile of a freshly btrfs-converted ext4 filesystem. Thank you. Clearer now. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Got 10 csum errors according to dmesg but 0 errors according to dev stats 2015-05-17 8:19 ` Duncan 2015-05-17 8:36 ` Omar Sandoval @ 2015-05-23 12:49 ` Philip Seeger 2015-05-23 16:52 ` Duncan 1 sibling, 1 reply; 16+ messages in thread From: Philip Seeger @ 2015-05-23 12:49 UTC (permalink / raw) To: linux-btrfs On Sun, 2015-05-17 at 08:19 +0000, Duncan wrote: > > I can't answer the corruption bit, but answering the df metadata > question... > > Normally, btrfs on a single device defaults to dup metadata type, > single > data type. The one /normal/ exception to that is when mkfs.btrfs > detects > an ssd, where it defaults to single data due to ssd firmware often > canceling out the intended redundancy of dup anyway.[1] > > However, conversion from ext* is a bit of a different ball game, and > while it /should/ default to dup metadata as well, on 4.0 and into > 4.1-rcs > as a proper fix hasn't been posted, there's a balance-conversion bug > that's keeping type conversion from occurring, both in the normal > btrfs > balance convert case and in the ext* conversion case. Thus, ext* > conversions remain metadata-single mode and cannot be converted to > metadata-dup until this bug is fixed. > Thanks for the detailed explanation. I might take a look at the commit you're referring to when I have some more time. For now, I simply used an older live system (3.16) to balance the filesystem (btrfs balance start), which worked. Before that, I deleted the corrupted files. Arch after balance: # btrfs fi df / Data, single: total=25.72GiB, used=19.47GiB System, DUP: total=32.00MiB, used=12.00KiB Metadata, DUP: total=1.25GiB, used=742.27MiB GlobalReserve, single: total=248.00MiB, used=0.00B I have copied my files back (that were corrupted in this vm) and this system is working fine now, no more corruption (so far). It seems the balance has fixed it. Is this a known side effect, that files could get corrupted if no balance is run (not counting the balance with 4.0 which doesn't work due to that commit) after an ext4 conversion? -- Philip ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Got 10 csum errors according to dmesg but 0 errors according to dev stats 2015-05-23 12:49 ` Philip Seeger @ 2015-05-23 16:52 ` Duncan 2015-05-27 20:25 ` Philip Seeger 0 siblings, 1 reply; 16+ messages in thread From: Duncan @ 2015-05-23 16:52 UTC (permalink / raw) To: linux-btrfs Philip Seeger posted on Sat, 23 May 2015 14:49:50 +0200 as excerpted: > Is this a known side effect, that files could get corrupted if no > balance is run (not counting the balance with 4.0 which doesn't work due > to that commit) after an ext4 conversion? I'm not sure. What I am sure of is that I'd not trust a btrfs converted from ext* until the saved subvol is deleted, and a defrag and balance run. Even then, I'd personally be more comfortable with a fresh mkfs.btrfs, and copy over from backup, tho I know reality is that btrfs /needs/ a working conversion program or it'll never take off as the default successor to the ext* crown, as it wants to be. I simply don't trust that conversion, as I've seen too many people have problems with their btrfs after doing the conversion from ext*. Balance-conversions between raid modes of btrfs are a little different, and somewhat more trustworthy... to me, anyway. To be fair, it might well be a personal bias of mine against ext* in the first place, as I never really was comfortable with it for various reasons. Among others, I think enough kernel devs see ext* as simple enough to meddle with that it gets more changes than it really should have, ext3's period with data=writeback as the default being a primary example. Reiserfs, my personal favorite, and xfs, and now btrfs, all seem to be different enough that the hacking is left to the folks that really know the filesystem, with others leaving it to the experts as they're afraid to touch it, at least more so than ext*. Anyway, it's quite possible I have enough of a bias there that it taints anything converted from it more than it should as well. Either way, I personally just don't trust ext* conversions, and would rather see people do mkfs.btrfs and copy over from backup, as I think the filesystem is cleanly native btrfs that way, and has less problems as a result. But you really need a second opinion before trusting /that/, because as I said, it might simply be my personal bias talking. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Got 10 csum errors according to dmesg but 0 errors according to dev stats 2015-05-23 16:52 ` Duncan @ 2015-05-27 20:25 ` Philip Seeger 0 siblings, 0 replies; 16+ messages in thread From: Philip Seeger @ 2015-05-27 20:25 UTC (permalink / raw) To: linux-btrfs On Sat, 2015-05-23 at 16:52 +0000, Duncan wrote: > Philip Seeger posted on Sat, 23 May 2015 14:49:50 +0200 as excerpted: > > > Is this a known side effect, that files could get corrupted if no > > balance is run (not counting the balance with 4.0 which doesn't > > work due > > to that commit) after an ext4 conversion? > > I'm not sure. > > What I am sure of is that I'd not trust a btrfs converted from ext* > until > the saved subvol is deleted, and a defrag and balance run. Even I agree. I did delete the saved subvolume right away, but given that I effectively did not run a balance (due to this bug, the balance had no effect) and then had new files corrupted (repeatedly) as a consequence makes this pretty clear. Maybe there should be a warning in the wiki ("run a complete balance before you start using the converted fs, otherwise your files might get corrupted.")? Though I'd be more interested in some details as to how this might have happened. It seems wrong that corruption occurs after a successful conversion (before a proper balance). Is there anyone else who has had this issue? Maybe someone who's converted to btrfs, using a btrfs version with the balance bug? Philip ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2015-05-27 20:25 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-10 14:37 Got 10 csum errors according to dmesg but 0 errors according to dev stats Philip Seeger
2015-05-10 14:58 ` Philip Seeger
[not found] ` <CABR0jERqzkdTJxX_1S5WEZHDzX8=O8P7r+Bk0mesPLsR2n=w8A@mail.gmail.com>
2015-05-10 17:32 ` Philip Seeger
2015-05-11 1:41 ` Russell Coker
2015-05-12 0:14 ` Philip Seeger
2015-05-12 1:04 ` Paul Jones
2015-05-12 1:37 ` Chris Murphy
2015-05-15 18:40 ` Philip Seeger
2015-05-15 18:33 ` Philip Seeger
2015-05-17 1:53 ` Philip Seeger
2015-05-17 8:19 ` Duncan
2015-05-17 8:36 ` Omar Sandoval
2015-05-17 8:57 ` Duncan
2015-05-23 12:49 ` Philip Seeger
2015-05-23 16:52 ` Duncan
2015-05-27 20:25 ` Philip Seeger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).