Kernel 2.6.36 btrfs csum bugreport

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Kernel 2.6.36 btrfs csum bugreport
@ 2010-10-31 12:55 Andreas Bauer
  2010-10-31 22:15 ` cwillu
  0 siblings, 1 reply; 7+ messages in thread
From: Andreas Bauer @ 2010-10-31 12:55 UTC (permalink / raw)
  To: linux-btrfs

Hi everybody,

Today while playing around with btrfs I uncovered what must be a bug in the btrfs checksum code. My kernel log received a couple of these messages with various ino and off numbers:

btrfs csum failed ino 5098 off 524288 csum 2981133980 private 959545494
[..]

This happens on reading from the btrfs filesystem.

The funny thing is that the files are read correct, as verified by md5sum. I have cross-checked this on another machine (with same kernel and btrfs utils): same result. A full filesystem md5sum check showed no errors. The md5sums obviously were computed before the data was copied to the btrfs.

So I conclude that these messages are faulty because data is read correctly. In addition, when you have more than one btrfs you cannot see from the message which fs it is refering to.

Here is my setup, maybe it has something to do with the (nowadays) unusual kernel target:

- unmodified upstream 2.6.36 kernel
- Debian Squeeze
- Standard Debian gcc 4.3.5 with target i486
- CPU AMD Geode LX800 on ALIX board
- btrfs on USB-ATA connected IDE drive Seagate Barracuda 7200.8 ST3400832A
- btrs utils v0.19
- about 300GB of data of all sorts in 50000+ files on the fs
- data gets rsynced to another btrfs volume of 1TB when on read the csum errors occur

Hope that some of this informations rings a bell on someones mind. If so, please let me know ;)

bye, Andreas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel 2.6.36 btrfs csum bugreport
  2010-10-31 12:55 Andreas Bauer
@ 2010-10-31 22:15 ` cwillu
  0 siblings, 0 replies; 7+ messages in thread
From: cwillu @ 2010-10-31 22:15 UTC (permalink / raw)
  Cc: linux-btrfs

> Today while playing around with btrfs I uncovered what must be a bug in the btrfs checksum code. My kernel log received a couple of these messages with various ino and off numbers:
>
> btrfs csum failed ino 5098 off 524288 csum 2981133980 private 959545494
> [..]
>
> This happens on reading from the btrfs filesystem.
>
> The funny thing is that the files are read correct, as verified by md5sum. I have cross-checked this on another machine (with same kernel and btrfs utils): same result. A full filesystem md5sum check showed no errors. The md5sums obviously were computed before the data was copied to the btrfs.
>
> So I conclude that these messages are faulty because data is read correctly. In addition, when you have more than one btrfs you cannot see from the message which fs it is refering to.

Is this a raid1 or a dup array?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Kernel 2.6.36 btrfs csum bugreport
@ 2010-11-01  0:35 Andreas Bauer
  2010-11-01 10:55 ` Daniel J Blueman
  0 siblings, 1 reply; 7+ messages in thread
From: Andreas Bauer @ 2010-11-01  0:35 UTC (permalink / raw)
  To: cwillu; +Cc: linux-btrfs

So I conclude that these messages are faulty because data is read correctly. 
 In addition, when you have more than one btrfs you cannot see from the message 
 which fs it is refering to.

 Is this a raid1 or a dup array?

No, plain vanilla partition on physical hard disk. Btrfs was made with the command "mkfs.btrfs /dev/sdc1" no extra arguments.

bye, A.B.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Kernel 2.6.36 btrfs csum bugreport
@ 2010-11-01 10:39 Andreas Bauer
  0 siblings, 0 replies; 7+ messages in thread
From: Andreas Bauer @ 2010-11-01 10:39 UTC (permalink / raw)
  To: linux-btrfs

To follow up on this matter, I have created another two btrfs volumes (also plain - no options - also on two external USB-SATA disks), and am at the moment copying heaps of data between these two. No errors as of yet. All copies are verified by md5sum after the deed.

The volume in question can still "reliably" reproduce the csum errors on read, though. Aprox. 30 csum errors occur when the whole fs is read. The data is still fine. I can put it aside for further debugging until at most Wednesday morning.

If someone wants me to run diagnostics on it, please let me know. I am glad to be of help (until Wednesday morning).

Andreas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel 2.6.36 btrfs csum bugreport
  2010-11-01  0:35 Andreas Bauer
@ 2010-11-01 10:55 ` Daniel J Blueman
  2010-11-01 11:02   ` cwillu
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel J Blueman @ 2010-11-01 10:55 UTC (permalink / raw)
  To: Andreas Bauer; +Cc: cwillu, linux-btrfs

On 1 November 2010 00:35, Andreas Bauer <ab@voltage.de> wrote:
> So I conclude that these messages are faulty because data is read cor=
rectly.
> =A0In addition, when you have more than one btrfs you cannot see from=
 the message
> =A0which fs it is refering to.
>
> =A0Is this a raid1 or a dup array?
>
> No, plain vanilla partition on physical hard disk. Btrfs was made wit=
h the command "mkfs.btrfs /dev/sdc1" no extra arguments.

By default, metadata is duplicated, thus it could be that BTRFS is
using the correct copy of the metadata after finding checksum errors
in the first copy.

Daniel
--=20
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel 2.6.36 btrfs csum bugreport
  2010-11-01 10:55 ` Daniel J Blueman
@ 2010-11-01 11:02   ` cwillu
  0 siblings, 0 replies; 7+ messages in thread
From: cwillu @ 2010-11-01 11:02 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Andreas Bauer, linux-btrfs

On Mon, Nov 1, 2010 at 4:55 AM, Daniel J Blueman
<daniel.blueman@gmail.com> wrote:
> On 1 November 2010 00:35, Andreas Bauer <ab@voltage.de> wrote:
>> So I conclude that these messages are faulty because data is read co=
rrectly.
>> =A0In addition, when you have more than one btrfs you cannot see fro=
m the message
>> =A0which fs it is refering to.
>>
>> =A0Is this a raid1 or a dup array?
>>
>> No, plain vanilla partition on physical hard disk. Btrfs was made wi=
th the command "mkfs.btrfs /dev/sdc1" no extra arguments.
>
> By default, metadata is duplicated, thus it could be that BTRFS is
> using the correct copy of the metadata after finding checksum errors
> in the first copy.

Ahhhhhhh, and that makes this make sense:

Andreas, have you checked which file(s) are giving the errors?  if
not, you can use "find /whatever/mountpoint -xdev -inum 5098 -print"
to get the filename.  And I would bet that it's small enough that it's
being inlined into the metadata block group, and therefore covered
under the default "dup" profile of that block group, which is why
you're getting the actual file data back.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Kernel 2.6.36 btrfs csum bugreport
@ 2010-11-07 15:37 Andreas Bauer
  0 siblings, 0 replies; 7+ messages in thread
From: Andreas Bauer @ 2010-11-07 15:37 UTC (permalink / raw)
  To: linux-btrfs

On Mon, Nov 01, 2010 at 12:02:10PM CET, cwillu wrote:

 Ahhhhhhh, and that makes this make sense:

 Andreas, have you checked which file(s) are giving the errors?  if
 not, you can use "find /whatever/mountpoint -xdev -inum 5098 -print"
 to get the filename.  And I would bet that it's small enough that it's
 being inlined into the metadata block group, and therefore covered
 under the default "dup" profile of that block group, which is why
 you're getting the actual file data back.

Sorry to disappoint, the files hit are from big (8 GB) to small. I took 
the  opportunity to compare the syslog from both machines I tested on,
and the csum ino and off counters are completely different in each case.

The filesystem which showed this behaviour has now been destoyed, and
in further testing I wasn't able to reproduce the bug.

To summarize:

- a btrfs about 400GB in size showed several csum errors
on reading while the data read was correct. The same thing happened
when the filesystem was mounted on another machine (same kernel).

- the errors could be consistently reproduced by reading enough data. 

- about 60 - 120 csum happened on reading about 250 GB of data.

- the csum error happened to different inodes each time (and each run)

As I don't have enough time at the moment to familiarize myself with
the btrfs code, I have to let go of this issue at this point. Thank 
you for your work.

-- A.B.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-11-07 15:37 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-01 10:39 Kernel 2.6.36 btrfs csum bugreport Andreas Bauer
  -- strict thread matches above, loose matches on Subject: below --
2010-11-07 15:37 Andreas Bauer
2010-11-01  0:35 Andreas Bauer
2010-11-01 10:55 ` Daniel J Blueman
2010-11-01 11:02   ` cwillu
2010-10-31 12:55 Andreas Bauer
2010-10-31 22:15 ` cwillu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).