Re: Got 10 csum errors according to dmesg but 0 errors according to dev stats

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Philip Seeger <p0h0i0l0i0p@gmail.com>
To: linux-btrfs@vger.kernel.org
Subject: Re: Got 10 csum errors according to dmesg but 0 errors according to dev stats
Date: Sun, 10 May 2015 19:32:45 +0200	[thread overview]
Message-ID: <554F963D.2040209@googlemail.com> (raw)
In-Reply-To: <CABR0jERqzkdTJxX_1S5WEZHDzX8=O8P7r+Bk0mesPLsR2n=w8A@mail.gmail.com>

(Again, last message was rejected.)

Hi Richard,

thank you for this tip, I didn't notice that btrfs-progs didn't match 
the kernel version.
I've updated btrfs-progs (from the repository, not manually installed), 
btrfs --version now shows v4.0.

However, it seems strange to me that a bunch of files is corrupted 
simply because btrfs-progs is older than the kernel.
To trigger more csum errors, I ran a script that basically finds all 
files and runs cat $file >/dev/null. I also scrubbed the filesystem.
It's getting worse. The number of corrupted files has grown to 79 - all 
in /home. Some of these files have not been modified in 3 years. I have 
copied them into this Arch vm from another vm, which runs Fedora (kernel 
3.19). The Fedora vm also uses btrfs, so it has the right checksums for 
all of those files. There are no csum errors in dmesg on that Fedora 
system. I've also started a scrub there, which has not generated any 
error yet. To be clear, we're talking about 50k something files (about 
11 GB) that I've copied onto this vm; I have used a handful of them and 
created <10.

So after copying a lot of files onto this Arch vm, many of them have 
been corrupted for unknown reasons (mostly old files, not changed on 
this Arch system).

Scrub:
# time btrfs scrub start -B / ; echo scrub $? done

scrub done for 3e8973d3-83ce-4d93-8d50-2989c0be256a
     scrub started at Sun May 10 17:47:34 2015 and finished after 427 
seconds
     total bytes scrubbed: 19.87GiB with 21941 errors
     error details: csum=21941
     corrected errors: 0, uncorrectable errors: 21941, unverified errors: 0
ERROR: There are uncorrectable errors.

During the scrub, I also saw several of these:
[19935.898678] __readpage_endio_check: 14 callbacks suppressed

I have started another scrub (now with v4.0), I still get errors but the 
affected file names are mentioned in dmesg, which is nice. Is there a 
btrfs status command that will list permanently damaged files as well 
(like zpool status -v), since dmesg will be empty after a reboot or crash?

I believe, thanks to Richard, I can now answer my second question: The 
old version 3.19 failed to increase the error counter(s) in dev stats, 
but this is apparently fixed in 4.0 (so a monitoring job would now be 
able to notify an admin):
$ sudo btrfs dev stats / | grep -v 0
[/dev/sda1].corruption_errs 43882



Thanks
Philip

On 05/10/2015 05:33 PM, Richard Michael wrote:
> Hi Philip,
>
> Have you tried latest btrfs-progs?
>
> The progs release version has sync'd up with the kernel version, so 
> your kernel v4.0.1 with progs v3.19.1 could be taken as a "mismatch".
>
> I haven't read the progs v3.19.1 v4.0 commit diff, and the wiki 
> doesn't mention csum fixes/work related to corruption, but, in your 
> situation, I'd probably try out v4.0 progs to be sure.
>
> https://btrfs.wiki.kernel.org/index.php/Main_Page#News
>
> Sorry I don't have more than this to offer.
>
>
> Regards,
> Richard
>
>
> On Sun, May 10, 2015 at 10:58 AM, Philip Seeger <p0h0i0l0i0p@gmail.com 
> <mailto:p0h0i0l0i0p@gmail.com>> wrote:
>
>     Forgot to mention kernel version: Linux 4.0.1-1-ARCH
>
>     $ sudo btrfs fi show
>     Label: none  uuid: 3e8973d3-83ce-4d93-8d50-2989c0be256a
>         Total devices 1 FS bytes used 19.87GiB
>         devid    1 size 45.00GiB used 21.03GiB path /dev/sda1
>
>     btrfs-progs v3.19.1
>
>
>
>
>     On 05/10/2015 04:37 PM, Philip Seeger wrote:
>
>         I have installed a new virtual machine (VirtualBox) with Arch
>         on btrfs
>         (just a root fs and swap partition, no other partitions).
>         I suddenly noticed 10 checksum errors in the kernel log:
>         $ dmesg | grep csum
>         [  736.283506] BTRFS warning (device sda1): csum failed ino
>         1704363 off
>         761856 csum 1145980813 expected csum 2566472073
>         [  736.283605] BTRFS warning (device sda1): csum failed ino
>         1704363 off
>         1146880 csum 1961240434 expected csum 2566472073
>         [  745.583064] BTRFS warning (device sda1): csum failed ino
>         1704346 off
>         393216 csum 4035064017 expected csum 2566472073
>         [  752.324899] BTRFS warning (device sda1): csum failed ino
>         1705927 off
>         2125824 csum 3638986839 expected csum 2566472073
>         [  752.333115] BTRFS warning (device sda1): csum failed ino
>         1705927 off
>         2588672 csum 176788087 expected csum 2566472073
>         [  752.333303] BTRFS warning (device sda1): csum failed ino
>         1705927 off
>         3276800 csum 1891435134 expected csum 2566472073
>         [  752.333397] BTRFS warning (device sda1): csum failed ino
>         1705927 off
>         3964928 csum 3304112727 expected csum 2566472073
>         [ 2761.889460] BTRFS warning (device sda1): csum failed ino
>         1705927 off
>         2125824 csum 3638986839 expected csum 2566472073
>         [ 9054.226022] BTRFS warning (device sda1): csum failed ino
>         1704363 off
>         761856 csum 1145980813 expected csum 2566472073
>         [ 9054.226106] BTRFS warning (device sda1): csum failed ino
>         1704363 off
>         1146880 csum 1961240434 expected csum 2566472073
>
>         This is a new vm, it hasn't crashed (which might have caused
>         filesystem
>         corruption). The virtual disk is on a RAID storage on the
>         host, which is
>         healthy. All corrupted files are Firefox data files:
>         $ dmesg | grep csum | grep -Eo 'csum failed ino [0-9]* ' | awk
>         '{print
>         $4}' | xargs -I{} find -inum {}
>         ./.mozilla/firefox/nfh217zw.default/cookies.sqlite
>         ./.mozilla/firefox/nfh217zw.default/cookies.sqlite
>         ./.mozilla/firefox/nfh217zw.default/webappsstore.sqlite
>         ./.mozilla/firefox/nfh217zw.default/places.sqlite
>         ./.mozilla/firefox/nfh217zw.default/places.sqlite
>         ./.mozilla/firefox/nfh217zw.default/places.sqlite
>         ./.mozilla/firefox/nfh217zw.default/places.sqlite
>         ./.mozilla/firefox/nfh217zw.default/places.sqlite
>         ./.mozilla/firefox/nfh217zw.default/cookies.sqlite
>         ./.mozilla/firefox/nfh217zw.default/cookies.sqlite
>
>         How could this possibly happen?
>
>         And more importantly: Why doesn't the btrfs stat(u)s output
>         tell me that
>         errors have occurred?
>         $ sudo btrfs dev stats /
>         [/dev/sda1].write_io_errs   0
>         [/dev/sda1].read_io_errs    0
>         [/dev/sda1].flush_io_errs   0
>         [/dev/sda1].corruption_errs 0
>         [/dev/sda1].generation_errs 0
>
>         If the filesystem health was monitored using btrfs dev stats
>         (cronjob)
>         (like checking a zpool using zpool status), the admin would
>         not have
>         been notified:
>         $ sudo btrfs dev stats / | grep -v 0 -c
>         0
>
>         Is my understanding of the stats command wrong, does
>         "corruption_errs"
>         not mean corruption errors?
>
>
>
>
>     -- 
>     Philip
>     --
>     To unsubscribe from this list: send the line "unsubscribe
>     linux-btrfs" in
>     the body of a message to majordomo@vger.kernel.org
>     <mailto:majordomo@vger.kernel.org>
>     More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>

next prev parent reply	other threads:[~2015-05-10 17:32 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-10 14:37 Got 10 csum errors according to dmesg but 0 errors according to dev stats Philip Seeger
2015-05-10 14:58 ` Philip Seeger
     [not found]   ` <CABR0jERqzkdTJxX_1S5WEZHDzX8=O8P7r+Bk0mesPLsR2n=w8A@mail.gmail.com>
2015-05-10 17:32     ` Philip Seeger [this message]
2015-05-11  1:41       ` Russell Coker
2015-05-12  0:14         ` Philip Seeger
2015-05-12  1:04           ` Paul Jones
2015-05-12  1:37             ` Chris Murphy
2015-05-15 18:40               ` Philip Seeger
2015-05-15 18:33             ` Philip Seeger
2015-05-17  1:53   ` Philip Seeger
2015-05-17  8:19     ` Duncan
2015-05-17  8:36       ` Omar Sandoval
2015-05-17  8:57         ` Duncan
2015-05-23 12:49       ` Philip Seeger
2015-05-23 16:52         ` Duncan
2015-05-27 20:25           ` Philip Seeger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=554F963D.2040209@googlemail.com \
    --to=p0h0i0l0i0p@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).