2 errors when scrubbing - but I don't know what they mean

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* 2 errors when scrubbing - but I don't know what they mean
@ 2013-11-28 20:36 Sebastian Ochmann
  2013-11-29  1:10 ` Duncan
  2013-11-29  5:51 ` Wang Shilong
  0 siblings, 2 replies; 9+ messages in thread
From: Sebastian Ochmann @ 2013-11-28 20:36 UTC (permalink / raw)
  To: linux-btrfs

Hello everyone,

when I scrubbed one of my btrfs volumes today, the result of the scrub was:

total bytes scrubbed: 1.27TB with 2 errors
error details: super=2
corrected errors: 0, uncorrectable errors: 0, unverified errors: 0

and dmesg said:

btrfs: bdev /dev/mapper/tray errs: wr 0, rd 0, flush 0, corrupt 0, gen 1
btrfs: bdev /dev/mapper/tray errs: wr 0, rd 0, flush 0, corrupt 0, gen 2

Can someone please enlighten me what these errors mean (especially the 
"super" and "gen" values)? As an additional info: The drive is sometimes 
used in a machine with kernel 3.11.6 and sometimes with 3.12.0, could 
this swapping explain the problem somehow?

Best regards
Sebastian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2 errors when scrubbing - but I don't know what they mean
  2013-11-28 20:36 2 errors when scrubbing - but I don't know what they mean Sebastian Ochmann
@ 2013-11-29  1:10 ` Duncan
  2013-11-30 11:31   ` Sebastian Ochmann
  2013-11-29  5:51 ` Wang Shilong
  1 sibling, 1 reply; 9+ messages in thread
From: Duncan @ 2013-11-29  1:10 UTC (permalink / raw)
  To: linux-btrfs

Sebastian Ochmann posted on Thu, 28 Nov 2013 21:36:32 +0100 as excerpted:

> when I scrubbed one of my btrfs volumes today, the result of the scrub
> was:
> 
> total bytes scrubbed: 1.27TB with 2 errors error details: super=2
> corrected errors: 0, uncorrectable errors: 0, unverified errors: 0
> 
> and dmesg said:
> 
> btrfs: bdev /dev/mapper/tray errs: wr 0, rd 0, flush 0, corrupt 0, gen 1
> btrfs: bdev /dev/mapper/tray errs: wr 0, rd 0, flush 0, corrupt 0, gen 2
> 
> Can someone please enlighten me what these errors mean (especially the
> "super" and "gen" values)? As an additional info: The drive is sometimes
> used in a machine with kernel 3.11.6 and sometimes with 3.12.0, could
> this swapping explain the problem somehow?

[Just an admin using/testing btrfs here; not a dev.]

Super=superblock.  I really can't say what errors registered as superblock 
errors might mean as I've never seen them here and haven't chanced across 
an explanation on-list or on the wiki, but were I seeing that here, my 
approach would be to try the scrub again and hope the errors were fixed 
(tho I should mention that I'm on SSD with multiple independent rather 
small btrfs partitions, so scrubs take a couple minutes for my larger 
partitions, not the hours you're likely to see with multi-TB spinning 
rust, so rerunning a scrub is trivial, /here/!).  If that didn't catch 
them, then I'd try btrfsck (without --repair) and see if it had any 
further information to offer.  (Repair is a a further step that I'd only 
take if necessary -- making sure I had a good backup first!)  There's 
also btrfs-show-super, which should be safe as it's read-only, simply 
displaying a lot of information, much of which probably won't make much 
sense except to a btrfs dev/expert (it's beyond me).

As for the dmesg output you quoted, if you compare your syslog times for 
the same messages, I suspect you'll find they were printed at filesystem 
mount time, NOT during the scrub, and are thus not directly related.

What the dmesg output IS directly related to is the output of btrfs 
device stat.  The first thing to note about it is that errors reported 
are cumulative, only being reset if its -z option is used.  Thus, stats 
let you track whether the number of errors are rising, but unless you 
reset stats (using btrfs dev stat -z) after your last scrub, they'll 
still reflect historical errors that have already been corrected -- 
errors reported at mount time and by device stat reflect historical 
status and do NOT necessarily reflect *CURRENT* errors.

As with the superblock errors, I've not actually seen generation errors 
here, so I don't know whether they're the superblock errors scrub is 
reporting or are different.  Similarly, I don't know what fixes them.

What I /have/ seen here are read_ and write_io_errs (as reported by stat, 
simply wr/rd as reported by the kernel at mount time), due to bad 
shutdowns (well, suspend-to-ram that didn't resume properly).  I know 
scrub can and does recover those, provided it has a second copy to 
recover from, as it does here since (with the exception of /boot) all my 
btrfs filesystems are btrfs raid1 mode, both data and metadata, across 
two SSDs.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2 errors when scrubbing - but I don't know what they mean
  2013-11-28 20:36 2 errors when scrubbing - but I don't know what they mean Sebastian Ochmann
  2013-11-29  1:10 ` Duncan
@ 2013-11-29  5:51 ` Wang Shilong
  1 sibling, 0 replies; 9+ messages in thread
From: Wang Shilong @ 2013-11-29  5:51 UTC (permalink / raw)
  To: Sebastian Ochmann; +Cc: linux-btrfs

Hi,

On 11/29/2013 04:36 AM, Sebastian Ochmann wrote:
> Hello everyone,
>
> when I scrubbed one of my btrfs volumes today, the result of the scrub 
> was:
>
> total bytes scrubbed: 1.27TB with 2 errors
> error details: super=2
> corrected errors: 0, uncorrectable errors: 0, unverified errors: 0
Here super error means superblock checksum mismatch,scrub just report
superblock errors but dosen't try to fix it....

Maybe this is just a read error, anyway, superblocks will be rewritten 
after commiting
a transaction..

Thanks,
Wang
>
> and dmesg said:
>
> btrfs: bdev /dev/mapper/tray errs: wr 0, rd 0, flush 0, corrupt 0, gen 1
> btrfs: bdev /dev/mapper/tray errs: wr 0, rd 0, flush 0, corrupt 0, gen 2
>
> Can someone please enlighten me what these errors mean (especially the 
> "super" and "gen" values)? As an additional info: The drive is 
> sometimes used in a machine with kernel 3.11.6 and sometimes with 
> 3.12.0, could this swapping explain the problem somehow?
>
> Best regards
> Sebastian
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2 errors when scrubbing - but I don't know what they mean
  2013-11-29  1:10 ` Duncan
@ 2013-11-30 11:31   ` Sebastian Ochmann
       [not found]     ` <CAP9B-Q=Y+uY2kErYb1ZKMsvFrbYidmGpPnUbHm8iApj7v6wK+w@mail.gmail.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Sebastian Ochmann @ 2013-11-30 11:31 UTC (permalink / raw)
  To: 1i5t5.duncan; +Cc: linux-btrfs

Hello,

thank you for your input. I didn't know that btrfs keeps the error 
counters over mounts/reboots, but that's nice.

I'm still trying to figure out how such a generation error may occur in 
the first place. One thing I noticed looking at the btrfs code is that 
the generation error counter will only get incremented in the actual 
scrubbing code (either in "scrub_checksum_super" or in 
"scrub_handle_errored_block", both in scrub.c - please correct me if I'm 
wrong, I'm not a btrfs dev). Also, the dmesg errors I saw were not there 
at boot time, but about 10 minutes after boot which was about the time 
when I started the scrub so I'm pretty sure that it was the scrub that 
detected the errors.

The question remains what can cause superblock/gen errors. Sure it could 
be "some" read error, but I'd really like to make sure that it's not a 
systematic error. I wasn't able to reproduce it yet though.

Best
Sebastian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Fwd: 2 errors when scrubbing - but I don't know what they mean
       [not found]     ` <CAP9B-Q=Y+uY2kErYb1ZKMsvFrbYidmGpPnUbHm8iApj7v6wK+w@mail.gmail.com>
@ 2013-12-01  1:16       ` Shilong Wang
  2013-12-01 20:45       ` Sebastian Ochmann
  1 sibling, 0 replies; 9+ messages in thread
From: Shilong Wang @ 2013-12-01  1:16 UTC (permalink / raw)
  To: linux-btrfs

cc: linux-btrfs

---------- Forwarded message ----------
From: Shilong Wang <wangshilong1991@gmail.com>
Date: 2013/12/1
Subject: Re: 2 errors when scrubbing - but I don't know what they mean
To: Sebastian Ochmann <ochmann@informatik.uni-bonn.de>

Hello Sebastian,

2013/11/30 Sebastian Ochmann <ochmann@informatik.uni-bonn.de>:
> Hello,
>
> thank you for your input. I didn't know that btrfs keeps the error counters
> over mounts/reboots, but that's nice.
>
> I'm still trying to figure out how such a generation error may occur in the
> first place. One thing I noticed looking at the btrfs code is that the
> generation error counter will only get incremented in the actual scrubbing
> code (either in "scrub_checksum_super" or in "scrub_handle_errored_block",
> both in scrub.c - please correct me if I'm wrong, I'm not a btrfs dev).

Right, Scrub will read superblock with bio rather than using pagecaches.
This mean we will reread superblock from disks, if a checksum mismatch happens,
This can be the following reasons:

1.some read errors happen while scrubing, while superblocks are actually good
2.during last transaction, when we are trying to write superblocks to
disk, some silent corruption
   happens.
3.some unexpected operation write data to superblocks directly, for
example..'dd if=/dev/zero'
of=/dev/ seek=65536   count=4k' something like this.

Actually, during boot time, superblock should be fine, because will do
checksum check
when trying to using superblock. if checksum mismatch, we will refuse
to mount, After mounting,
these superblocks should be cached in memory until you umouting filesystem.

So ideal thing is your disk is fine, and during next transaction,
superblocks will be rewritten.
and during next umounting, you can mounting filesystem successfully!

However, if you find such superblocks checksum mismatch very often
during scrub, it maybe
there are something wrong with disk!

> Also, the dmesg errors I saw were not there at boot time, but about 10
> minutes after boot which was about the time when I started the scrub so I'm
> pretty sure that it was the scrub that detected the errors.
>
> The question remains what can cause superblock/gen errors. Sure it could be
> "some" read error, but I'd really like to make sure that it's not a
> systematic error. I wasn't able to reproduce it yet though.

You can reproduce this by doing 'dd if=/dev/zero of=/dev/sd*
seek=65536 count=4k' before
btrfs scrubing.

Thanks,
Wang
>
> Best
> Sebastian
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2 errors when scrubbing - but I don't know what they mean
       [not found]     ` <CAP9B-Q=Y+uY2kErYb1ZKMsvFrbYidmGpPnUbHm8iApj7v6wK+w@mail.gmail.com>
  2013-12-01  1:16       ` Fwd: " Shilong Wang
@ 2013-12-01 20:45       ` Sebastian Ochmann
  2013-12-02  1:30         ` Wang Shilong
  2013-12-02  9:21         ` Wang Shilong
  1 sibling, 2 replies; 9+ messages in thread
From: Sebastian Ochmann @ 2013-12-01 20:45 UTC (permalink / raw)
  To: Shilong Wang; +Cc: linux-btrfs

Hello,

 > However, if you find such superblocks checksum mismatch very often
 > during scrub, it maybe
 > there are something wrong with disk!

I'm sorry, but I don't think there's a problem with my disks because I 
was able to trigger the errors that increment the "gen" error counter 
during scrub on a completely different machine and drive today. I 
basically performed some I/O operations on a drive and scrubbed at the 
same time over and over again until I actually saw "super" errors during 
scrub. But the error is reeally hard to trigger. It seems to me like a 
race condition somewhere.

So I went a step further and tried to create a repro for this. It seems 
like I can trigger the errors now once every few minutes with the method 
described below, but sometimes it really takes a long time until the 
error pops up, so be patient when trying this...

For the repro:

I'm using a btrfs image in RAM for this for two reasons: I can scrub 
quickly over and over again and I can rule our hard drive errors. My 
machine has 32 GB of RAM, so that comes in handy here - if you try this 
on a physical drive, make sure to adjust some parameters, if necessary.

Create a tmpfs and a testing image, format as btrfs:

$ mkdir btrfstest
$ cd btrfstest/
$ mkdir tmp
$ mount -t tmpfs -o size=20G none tmp
$ dd if=/dev/zero of=tmp/vol bs=1G count=19
$ mkfs.btrfs tmp/vol
$ mkdir mnt
$ mount -o commit=1 tmp/vol mnt

Note the "commit=1" mount option. It's not strictly necessary, but I 
have the feeling it helps with triggering the problem...

So now we have a 19 GB btrfs filesystem in RAM, mounted in "mnt". What I 
did for performing some artificial I/O operations is to rm and cp a 
linux source tree over and over again. Suppose you have an unpacked 
linux source tree available in the "/somewhere/linux" directory (and 
you're using bash). We'll spawn some loops that keep the filesystem busy:

$ while true; do rm -fr mnt/a; sleep 1.0; cp -R /somewhere/linux mnt/a; 
sleep 1.0; done
$ while true; do rm -fr mnt/b; sleep 1.1; cp -R /somewhere/linux mnt/b; 
sleep 1.1; done
$ while true; do rm -fr mnt/c; sleep 1.2; cp -R /somewhere/linux mnt/c; 
sleep 1.2; done

Now that the filesystem is busy, we'll also scrub it repeatedly (without 
backgrounding, -B):

$ while true; do btrfs scrub start -B mnt; sleep 0.5; done

On my machine and in RAM, each scrub takes 0-1 second and the "total 
bytes scrubbed" should fluctuate (seems to be especially true with 
commit=1, but not sure). Get a beverage of your choice and wait.

(about 10 minutes later)

When I was writing this repro it took about 10 minutes until scrub said:

   total bytes scrubbed: 1.20GB with 2 errors
   error details: super=2
   corrected errors: 0, uncorrectable errors: 0, unverified errors: 0

and in dmesg:

   [15282.155170] btrfs: bdev /dev/loop0 errs: wr 0, rd 0, flush 0, 
corrupt 0, gen 1
   [15282.155176] btrfs: bdev /dev/loop0 errs: wr 0, rd 0, flush 0, 
corrupt 0, gen 2

After that, scrub is happy again and will continue normally until the 
same errors happen again after a few hundred scrubs or so.

So all in all, the error can be triggered using normal I/O operations 
and scrubbing at the right moments, it seems. Even with a btrfs image in 
RAM, so no hard drive error is possible.

Hope anyone can reproduce this and maybe debug it.

Best regards
Sebastian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2 errors when scrubbing - but I don't know what they mean
  2013-12-01 20:45       ` Sebastian Ochmann
@ 2013-12-02  1:30         ` Wang Shilong
  2013-12-02  1:53           ` Wang Shilong
  2013-12-02  9:21         ` Wang Shilong
  1 sibling, 1 reply; 9+ messages in thread
From: Wang Shilong @ 2013-12-02  1:30 UTC (permalink / raw)
  To: Sebastian Ochmann; +Cc: Shilong Wang, linux-btrfs

On 12/02/2013 04:45 AM, Sebastian Ochmann wrote:
> Hello,
>
> > However, if you find such superblocks checksum mismatch very often
> > during scrub, it maybe
> > there are something wrong with disk!
>
> I'm sorry, but I don't think there's a problem with my disks because I 
> was able to trigger the errors that increment the "gen" error counter 
> during scrub on a completely different machine and drive today. I 
> basically performed some I/O operations on a drive and scrubbed at the 
> same time over and over again until I actually saw "super" errors 
> during scrub. But the error is reeally hard to trigger. It seems to me 
> like a race condition somewhere.
>
> So I went a step further and tried to create a repro for this. It 
> seems like I can trigger the errors now once every few minutes with 
> the method described below, but sometimes it really takes a long time 
> until the error pops up, so be patient when trying this...
>
> For the repro:
>
> I'm using a btrfs image in RAM for this for two reasons: I can scrub 
> quickly over and over again and I can rule our hard drive errors. My 
> machine has 32 GB of RAM, so that comes in handy here - if you try 
> this on a physical drive, make sure to adjust some parameters, if 
> necessary.
>
> Create a tmpfs and a testing image, format as btrfs:
>
> $ mkdir btrfstest
> $ cd btrfstest/
> $ mkdir tmp
> $ mount -t tmpfs -o size=20G none tmp
> $ dd if=/dev/zero of=tmp/vol bs=1G count=19
> $ mkfs.btrfs tmp/vol
> $ mkdir mnt
> $ mount -o commit=1 tmp/vol mnt
>
> Note the "commit=1" mount option. It's not strictly necessary, but I 
> have the feeling it helps with triggering the problem...
>
> So now we have a 19 GB btrfs filesystem in RAM, mounted in "mnt". What 
> I did for performing some artificial I/O operations is to rm and cp a 
> linux source tree over and over again. Suppose you have an unpacked 
> linux source tree available in the "/somewhere/linux" directory (and 
> you're using bash). We'll spawn some loops that keep the filesystem busy:
>
> $ while true; do rm -fr mnt/a; sleep 1.0; cp -R /somewhere/linux 
> mnt/a; sleep 1.0; done
> $ while true; do rm -fr mnt/b; sleep 1.1; cp -R /somewhere/linux 
> mnt/b; sleep 1.1; done
> $ while true; do rm -fr mnt/c; sleep 1.2; cp -R /somewhere/linux 
> mnt/c; sleep 1.2; done
>
> Now that the filesystem is busy, we'll also scrub it repeatedly 
> (without backgrounding, -B):
>
> $ while true; do btrfs scrub start -B mnt; sleep 0.5; done
>
> On my machine and in RAM, each scrub takes 0-1 second and the "total 
> bytes scrubbed" should fluctuate (seems to be especially true with 
> commit=1, but not sure). Get a beverage of your choice and wait.
>
> (about 10 minutes later)
>
> When I was writing this repro it took about 10 minutes until scrub said:
>
>   total bytes scrubbed: 1.20GB with 2 errors
>   error details: super=2
>   corrected errors: 0, uncorrectable errors: 0, unverified errors: 0
>
> and in dmesg:
>
>   [15282.155170] btrfs: bdev /dev/loop0 errs: wr 0, rd 0, flush 0, 
> corrupt 0, gen 1
>   [15282.155176] btrfs: bdev /dev/loop0 errs: wr 0, rd 0, flush 0, 
> corrupt 0, gen 2
>
> After that, scrub is happy again and will continue normally until the 
> same errors happen again after a few hundred scrubs or so.
>
> So all in all, the error can be triggered using normal I/O operations 
> and scrubbing at the right moments, it seems. Even with a btrfs image 
> in RAM, so no hard drive error is possible.
>
> Hope anyone can reproduce this and maybe debug it.
Let me have a look at this.

Thanks,
Wang
>
> Best regards
> Sebastian
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2 errors when scrubbing - but I don't know what they mean
  2013-12-02  1:30         ` Wang Shilong
@ 2013-12-02  1:53           ` Wang Shilong
  0 siblings, 0 replies; 9+ messages in thread
From: Wang Shilong @ 2013-12-02  1:53 UTC (permalink / raw)
  To: Sebastian Ochmann; +Cc: Shilong Wang, linux-btrfs

On 12/02/2013 09:30 AM, Wang Shilong wrote:
> On 12/02/2013 04:45 AM, Sebastian Ochmann wrote:
>> Hello,
>>
>> > However, if you find such superblocks checksum mismatch very often
>> > during scrub, it maybe
>> > there are something wrong with disk!
>>
>> I'm sorry, but I don't think there's a problem with my disks because 
>> I was able to trigger the errors that increment the "gen" error 
>> counter during scrub on a completely different machine and drive 
>> today. I basically performed some I/O operations on a drive and 
>> scrubbed at the same time over and over again until I actually saw 
>> "super" errors during scrub. But the error is reeally hard to 
>> trigger. It seems to me like a race condition somewhere.
>>
>> So I went a step further and tried to create a repro for this. It 
>> seems like I can trigger the errors now once every few minutes with 
>> the method described below, but sometimes it really takes a long time 
>> until the error pops up, so be patient when trying this...
>>
>> For the repro:
>>
>> I'm using a btrfs image in RAM for this for two reasons: I can scrub 
>> quickly over and over again and I can rule our hard drive errors. My 
>> machine has 32 GB of RAM, so that comes in handy here - if you try 
>> this on a physical drive, make sure to adjust some parameters, if 
>> necessary.
>>
>> Create a tmpfs and a testing image, format as btrfs:
>>
>> $ mkdir btrfstest
>> $ cd btrfstest/
>> $ mkdir tmp
>> $ mount -t tmpfs -o size=20G none tmp
>> $ dd if=/dev/zero of=tmp/vol bs=1G count=19
>> $ mkfs.btrfs tmp/vol
>> $ mkdir mnt
>> $ mount -o commit=1 tmp/vol mnt
>>
>> Note the "commit=1" mount option. It's not strictly necessary, but I 
>> have the feeling it helps with triggering the problem...
>>
>> So now we have a 19 GB btrfs filesystem in RAM, mounted in "mnt". 
>> What I did for performing some artificial I/O operations is to rm and 
>> cp a linux source tree over and over again. Suppose you have an 
>> unpacked linux source tree available in the "/somewhere/linux" 
>> directory (and you're using bash). We'll spawn some loops that keep 
>> the filesystem busy:
>>
>> $ while true; do rm -fr mnt/a; sleep 1.0; cp -R /somewhere/linux 
>> mnt/a; sleep 1.0; done
>> $ while true; do rm -fr mnt/b; sleep 1.1; cp -R /somewhere/linux 
>> mnt/b; sleep 1.1; done
>> $ while true; do rm -fr mnt/c; sleep 1.2; cp -R /somewhere/linux 
>> mnt/c; sleep 1.2; done
>>
>> Now that the filesystem is busy, we'll also scrub it repeatedly 
>> (without backgrounding, -B):
>>
>> $ while true; do btrfs scrub start -B mnt; sleep 0.5; done
>>
>> On my machine and in RAM, each scrub takes 0-1 second and the "total 
>> bytes scrubbed" should fluctuate (seems to be especially true with 
>> commit=1, but not sure). Get a beverage of your choice and wait.
>>
>> (about 10 minutes later)
>>
>> When I was writing this repro it took about 10 minutes until scrub said:
>>
>>   total bytes scrubbed: 1.20GB with 2 errors
>>   error details: super=2
>>   corrected errors: 0, uncorrectable errors: 0, unverified errors: 0
>>
>> and in dmesg:
>>
>>   [15282.155170] btrfs: bdev /dev/loop0 errs: wr 0, rd 0, flush 0, 
>> corrupt 0, gen 1
>>   [15282.155176] btrfs: bdev /dev/loop0 errs: wr 0, rd 0, flush 0, 
>> corrupt 0, gen 2
>>
>> After that, scrub is happy again and will continue normally until the 
>> same errors happen again after a few hundred scrubs or so.
>>
>> So all in all, the error can be triggered using normal I/O operations 
>> and scrubbing at the right moments, it seems. Even with a btrfs image 
>> in RAM, so no hard drive error is possible.
>>
>> Hope anyone can reproduce this and maybe debug it.
It seems this is a generation mismatch not a checksum mismatch.

The story is `tree log sync` now only flush first superblock, this will 
casue superblock
generation mismatch while we are scrubbing other two superblocks.

I will give a patch to fix this issue, thanks for reporting!


Thanks,
Wang
> Let me have a look at this.
>
> Thanks,
> Wang
>>
>> Best regards
>> Sebastian
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe 
>> linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2 errors when scrubbing - but I don't know what they mean
  2013-12-01 20:45       ` Sebastian Ochmann
  2013-12-02  1:30         ` Wang Shilong
@ 2013-12-02  9:21         ` Wang Shilong
  1 sibling, 0 replies; 9+ messages in thread
From: Wang Shilong @ 2013-12-02  9:21 UTC (permalink / raw)
  To: Sebastian Ochmann; +Cc: Shilong Wang, linux-btrfs

Hi Sebastian,

On 12/02/2013 04:45 AM, Sebastian Ochmann wrote:
> Hello,
>
> > However, if you find such superblocks checksum mismatch very often
> > during scrub, it maybe
> > there are something wrong with disk!
>
> I'm sorry, but I don't think there's a problem with my disks because I 
> was able to trigger the errors that increment the "gen" error counter 
> during scrub on a completely different machine and drive today. I 
> basically performed some I/O operations on a drive and scrubbed at the 
> same time over and over again until I actually saw "super" errors 
> during scrub. But the error is reeally hard to trigger. It seems to me 
> like a race condition somewhere.
I am sorry, i try to reproduce the problem as steps what you have said, 
it didn't come up yet(i have run it for more than 6 hours).:-(
I took a careful look at code.

Superblock generation mismatch can only happen in 
scrub_checksum_super(). The generation mismatch happens when:
superblocks' gen ! = last_trans_commited.

While we can only modify value 'last_trans_commited' in one 
place(commiting transaction), However, in commiting transaction before
changing last_trans_commited, we will call btrfs_scrub_pause() which 
make it impossible that srubbing and writting supers
happen at the same time. Otherwise, i must miss some important thing 
here:-)

Would you please have a try with btrfs-next and see if the problem still 
exist in that branch:
https://git.kernel.org/cgit/linux/kernel/git/josef/btrfs-next.git/

Thanks,
Wang
>
> So I went a step further and tried to create a repro for this. It 
> seems like I can trigger the errors now once every few minutes with 
> the method described below, but sometimes it really takes a long time 
> until the error pops up, so be patient when trying this...
>
> For the repro:
>
> I'm using a btrfs image in RAM for this for two reasons: I can scrub 
> quickly over and over again and I can rule our hard drive errors. My 
> machine has 32 GB of RAM, so that comes in handy here - if you try 
> this on a physical drive, make sure to adjust some parameters, if 
> necessary.
>
> Create a tmpfs and a testing image, format as btrfs:
>
> $ mkdir btrfstest
> $ cd btrfstest/
> $ mkdir tmp
> $ mount -t tmpfs -o size=20G none tmp
> $ dd if=/dev/zero of=tmp/vol bs=1G count=19
> $ mkfs.btrfs tmp/vol
> $ mkdir mnt
> $ mount -o commit=1 tmp/vol mnt
>
> Note the "commit=1" mount option. It's not strictly necessary, but I 
> have the feeling it helps with triggering the problem...
>
> So now we have a 19 GB btrfs filesystem in RAM, mounted in "mnt". What 
> I did for performing some artificial I/O operations is to rm and cp a 
> linux source tree over and over again. Suppose you have an unpacked 
> linux source tree available in the "/somewhere/linux" directory (and 
> you're using bash). We'll spawn some loops that keep the filesystem busy:
>
> $ while true; do rm -fr mnt/a; sleep 1.0; cp -R /somewhere/linux 
> mnt/a; sleep 1.0; done
> $ while true; do rm -fr mnt/b; sleep 1.1; cp -R /somewhere/linux 
> mnt/b; sleep 1.1; done
> $ while true; do rm -fr mnt/c; sleep 1.2; cp -R /somewhere/linux 
> mnt/c; sleep 1.2; done
>
> Now that the filesystem is busy, we'll also scrub it repeatedly 
> (without backgrounding, -B):
>
> $ while true; do btrfs scrub start -B mnt; sleep 0.5; done
>
> On my machine and in RAM, each scrub takes 0-1 second and the "total 
> bytes scrubbed" should fluctuate (seems to be especially true with 
> commit=1, but not sure). Get a beverage of your choice and wait.
>
> (about 10 minutes later)
>
> When I was writing this repro it took about 10 minutes until scrub said:
>
>   total bytes scrubbed: 1.20GB with 2 errors
>   error details: super=2
>   corrected errors: 0, uncorrectable errors: 0, unverified errors: 0
>
> and in dmesg:
>
>   [15282.155170] btrfs: bdev /dev/loop0 errs: wr 0, rd 0, flush 0, 
> corrupt 0, gen 1
>   [15282.155176] btrfs: bdev /dev/loop0 errs: wr 0, rd 0, flush 0, 
> corrupt 0, gen 2
>
> After that, scrub is happy again and will continue normally until the 
> same errors happen again after a few hundred scrubs or so.
>
> So all in all, the error can be triggered using normal I/O operations 
> and scrubbing at the right moments, it seems. Even with a btrfs image 
> in RAM, so no hard drive error is possible.
>
> Hope anyone can reproduce this and maybe debug it.
>
> Best regards
> Sebastian
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-12-02  9:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-28 20:36 2 errors when scrubbing - but I don't know what they mean Sebastian Ochmann
2013-11-29  1:10 ` Duncan
2013-11-30 11:31   ` Sebastian Ochmann
     [not found]     ` <CAP9B-Q=Y+uY2kErYb1ZKMsvFrbYidmGpPnUbHm8iApj7v6wK+w@mail.gmail.com>
2013-12-01  1:16       ` Fwd: " Shilong Wang
2013-12-01 20:45       ` Sebastian Ochmann
2013-12-02  1:30         ` Wang Shilong
2013-12-02  1:53           ` Wang Shilong
2013-12-02  9:21         ` Wang Shilong
2013-11-29  5:51 ` Wang Shilong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).