* btrfs csum failed
@ 2011-05-03 21:56 Martin Schitter
2011-05-04 0:28 ` Josef Bacik
2011-05-04 12:39 ` Chris Mason
0 siblings, 2 replies; 17+ messages in thread
From: Martin Schitter @ 2011-05-03 21:56 UTC (permalink / raw)
To: linux-btrfs
since my last debian kernel-update to 2.6.38-2-amd64 i got troubles with
csum failures. it's a volume full of huge kvm-images on md-RAID1 and
LVM, so i used the mount options: 'noatime,nodatasum' to maximize the
performance.
it happened two weeks ago for the fist time. and now again a kvm-image
isn't readable again. i have to use an older snapshot to substitute the
virtual machine.
this are the entries in dmesg/kernel-log on any access:
...
[2412668.409442] btrfs csum failed ino 258 off 2331529216 csum
3632892464 private 2115348581
...
it's a production machine, so i can not make to much experiments on it.
do you see an obvious way to solve this problem?
thanks!
martin
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: btrfs csum failed
2011-05-03 21:56 btrfs csum failed Martin Schitter
@ 2011-05-04 0:28 ` Josef Bacik
2011-05-04 0:44 ` Martin Schitter
2011-05-04 12:39 ` Chris Mason
1 sibling, 1 reply; 17+ messages in thread
From: Josef Bacik @ 2011-05-04 0:28 UTC (permalink / raw)
To: Martin Schitter; +Cc: linux-btrfs
On Tue, May 03, 2011 at 11:56:32PM +0200, Martin Schitter wrote:
> since my last debian kernel-update to 2.6.38-2-amd64 i got troubles with
> csum failures. it's a volume full of huge kvm-images on md-RAID1 and
> LVM, so i used the mount options: 'noatime,nodatasum' to maximize the
> performance.
>
> it happened two weeks ago for the fist time. and now again a kvm-image
> isn't readable again. i have to use an older snapshot to substitute the
> virtual machine.
>
> this are the entries in dmesg/kernel-log on any access:
> ...
> [2412668.409442] btrfs csum failed ino 258 off 2331529216 csum
> 3632892464 private 2115348581
> ...
>
> it's a production machine, so i can not make to much experiments on it.
> do you see an obvious way to solve this problem?
>
Wait why are you running with btrfs in production? What OS is in this vm image?
Thanks,
Josef
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: btrfs csum failed
2011-05-04 0:28 ` Josef Bacik
@ 2011-05-04 0:44 ` Martin Schitter
2011-05-04 2:18 ` Fajar A. Nugraha
2011-05-04 14:39 ` Josef Bacik
0 siblings, 2 replies; 17+ messages in thread
From: Martin Schitter @ 2011-05-04 0:44 UTC (permalink / raw)
To: Josef Bacik; +Cc: linux-btrfs
Am 2011-05-04 02:28, schrieb Josef Bacik:
> Wait why are you running with btrfs in production?
do you know a better alternative for continuous snapshots? :)
it works surprisingly well since more than a year.
well the performance could be better for vm-image-hosting but it works.
we used cache='writeback' for a long time but now all virtual instances
have set cache='none'
> What OS is in this vm image?
2.6.30-bpo.1-amd64 with virtio-driver
could you give me some advice how to debug/report this specific problem
more precise?
thanks
martin
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: btrfs csum failed
2011-05-04 0:44 ` Martin Schitter
@ 2011-05-04 2:18 ` Fajar A. Nugraha
2011-05-04 11:39 ` Martin Schitter
2011-05-04 14:39 ` Josef Bacik
1 sibling, 1 reply; 17+ messages in thread
From: Fajar A. Nugraha @ 2011-05-04 2:18 UTC (permalink / raw)
To: Martin Schitter; +Cc: linux-btrfs
On Wed, May 4, 2011 at 7:44 AM, Martin Schitter <ms@mur.at> wrote:
> Am 2011-05-04 02:28, schrieb Josef Bacik:
>>
>> Wait why are you running with btrfs in production?
>
> do you know a better alternative for continuous snapshots? :)
zfs :D
>
> it works surprisingly well since more than a year.
> well the performance could be better for vm-image-hosting but it works.
>
> we used cache='writeback' for a long time but now all virtual instances have
> set cache='none'
>
>> What OS is in this vm image?
>
> 2.6.30-bpo.1-amd64 with virtio-driver
>
> could you give me some advice how to debug/report this specific problem more
> precise?
If it's not reproducible then I'd suspect it'd be hard to do.
Usually checksum errors is early sign of hardware failure (most common
are disk or power supply).
--
Fajar
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: btrfs csum failed
2011-05-04 2:18 ` Fajar A. Nugraha
@ 2011-05-04 11:39 ` Martin Schitter
2011-05-04 11:47 ` Hugo Mills
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Martin Schitter @ 2011-05-04 11:39 UTC (permalink / raw)
To: Fajar A. Nugraha; +Cc: linux-btrfs
Am 2011-05-04 04:18, schrieb Fajar A. Nugraha:
>> could you give me some advice how to debug/report this specific
>> problem more
>>> precise?
> If it's not reproducible then I'd suspect it'd be hard to do.
the last working snapshot is from 2011-05-02-17:13. i can reproduce this
file system corruption on one specific file in any hourly snapshot later.
whenever i make a simple:
cat snapshot-2011-05-02-18:13/sata-images/image_xy.raw > /dev/null
i get an "Input/output error" and the quoted debug messages in dmesg and
kernel-log
could this be seen as an useful starting point for further investigations?
> Usually checksum errors is early sign of hardware failure (most
> common are disk or power supply).
that looks very unplausible to me. there is an RAID1 layer beneath btrfs
in our setup and i don't see any errors there.
and the 'nodatasum' option should also ignore csum issues.-- isn't it?
martin
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: btrfs csum failed
2011-05-04 11:39 ` Martin Schitter
@ 2011-05-04 11:47 ` Hugo Mills
2011-05-04 11:51 ` cwillu
2011-05-04 12:31 ` Kaspar Schleiser
2 siblings, 0 replies; 17+ messages in thread
From: Hugo Mills @ 2011-05-04 11:47 UTC (permalink / raw)
To: Martin Schitter; +Cc: Fajar A. Nugraha, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 830 bytes --]
On Wed, May 04, 2011 at 01:39:46PM +0200, Martin Schitter wrote:
> and the 'nodatasum' option should also ignore csum issues.-- isn't it?
No, "nodatasum" will prevent newly-written data from being
checksummed. However, if a checksum already exists (because the data
was written to a filesystem mounted without the "nodatasum" option),
btrfs will still verify the checksum, regardless of the current
setting of nodatasum.
There is currently no way of preventing btrfs from verifying
checksums if they exist; I don't believe that there's any way of
removing an existing checksum, either.
Hugo.
--
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Charting the inexorable advance of Western syphilisation... ---
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: btrfs csum failed
2011-05-04 11:39 ` Martin Schitter
2011-05-04 11:47 ` Hugo Mills
@ 2011-05-04 11:51 ` cwillu
2011-05-04 12:27 ` Martin Schitter
2011-05-04 14:09 ` Jan Schmidt
2011-05-04 12:31 ` Kaspar Schleiser
2 siblings, 2 replies; 17+ messages in thread
From: cwillu @ 2011-05-04 11:51 UTC (permalink / raw)
To: Martin Schitter; +Cc: Fajar A. Nugraha, linux-btrfs
On Wed, May 4, 2011 at 5:39 AM, Martin Schitter <ms@mur.at> wrote:
> Am 2011-05-04 04:18, schrieb Fajar A. Nugraha:
>>>
>>> could you give me some advice how to debug/report this specific
>>> problem more
>>>>
>>>> precise?
>>
>> If it's not reproducible then I'd suspect it'd be hard to do.
>
> the last working snapshot is from 2011-05-02-17:13. i can reproduce this
> file system corruption on one specific file in any hourly snapshot later.
That's not surprising, any later snapshots will be sharing the same
corrupted block.
> that looks very unplausible to me. there is an RAID1 layer beneath btrfs in
> our setup and i don't see any errors there.
That doesn't rule out the possibility of corruption when it was
written in the first place, or some similar problem that the raid1
faithfully reproduced on both mirrors. That's not to say that it's
impossible that the problem is in btrfs, just that it's not the only
plausible possibility.
> and the 'nodatasum' option should also ignore csum issues.-- isn't it?
No, it only affects writing new checksums; any existing checksums are
still checked.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: btrfs csum failed
2011-05-04 11:51 ` cwillu
@ 2011-05-04 12:27 ` Martin Schitter
2011-05-04 13:23 ` Edward Ned Harvey
2011-05-04 14:09 ` Jan Schmidt
1 sibling, 1 reply; 17+ messages in thread
From: Martin Schitter @ 2011-05-04 12:27 UTC (permalink / raw)
To: cwillu; +Cc: Fajar A. Nugraha, linux-btrfs
Am 2011-05-04 13:51, schrieb cwillu:
>> that looks very unplausible to me. there is an RAID1 layer beneath btrfs in
>> our setup and i don't see any errors there.
>
> That doesn't rule out the possibility of corruption when it was
> written in the first place, or some similar problem that the raid1
> faithfully reproduced on both mirrors. That's not to say that it's
> impossible that the problem is in btrfs, just that it's not the only
> plausible possibility.
well -- i am doing a backup of all images every night. this process
should work like a simple "scrub" because all data (and its checksumes)
will be read. that's the way i stumbled over this problem!
>> and the 'nodatasum' option should also ignore csum issues.-- isn't it?
>
> No, it only affects writing new checksums; any existing checksums are
> still checked.
would it make some sense to remount the volume with checksumming enabled
and run additional tests to find similar suspect blocks to prevent this
kind of suddenly broken files?
martin
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: btrfs csum failed
2011-05-04 11:39 ` Martin Schitter
2011-05-04 11:47 ` Hugo Mills
2011-05-04 11:51 ` cwillu
@ 2011-05-04 12:31 ` Kaspar Schleiser
2011-05-04 13:25 ` Martin Schitter
2 siblings, 1 reply; 17+ messages in thread
From: Kaspar Schleiser @ 2011-05-04 12:31 UTC (permalink / raw)
To: Martin Schitter; +Cc: linux-btrfs
Hey Martin,
On 05/04/11 13:39, Martin Schitter wrote:
>> Usually checksum errors is early sign of hardware failure (most
>> common are disk or power supply).
>
> that looks very unplausible to me. there is an RAID1 layer beneath btrfs
> in our setup and i don't see any errors there.
Is the btrfs RAID1 itself inside a virtual machine? I've had data
corruption with virtio block devices > 1TB on early squeeze kernels.
Kaspar
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: btrfs csum failed
2011-05-03 21:56 btrfs csum failed Martin Schitter
2011-05-04 0:28 ` Josef Bacik
@ 2011-05-04 12:39 ` Chris Mason
2011-05-04 14:06 ` Martin Schitter
1 sibling, 1 reply; 17+ messages in thread
From: Chris Mason @ 2011-05-04 12:39 UTC (permalink / raw)
To: Martin Schitter; +Cc: linux-btrfs
Excerpts from Martin Schitter's message of 2011-05-03 17:56:32 -0400:
> since my last debian kernel-update to 2.6.38-2-amd64 i got troubles with
> csum failures. it's a volume full of huge kvm-images on md-RAID1 and
> LVM, so i used the mount options: 'noatime,nodatasum' to maximize the
> performance.
>
> it happened two weeks ago for the fist time. and now again a kvm-image
> isn't readable again. i have to use an older snapshot to substitute the
> virtual machine.
>
> this are the entries in dmesg/kernel-log on any access:
> ...
> [2412668.409442] btrfs csum failed ino 258 off 2331529216 csum
> 3632892464 private 2115348581
> ...
>
> it's a production machine, so i can not make to much experiments on it.
> do you see an obvious way to solve this problem?
What OS is inside these virtual machines? The btrfs unstable tree has
some fixes for windows based OSes.
Is your kvm config using O_DIRECT?
I've also got patches here that force us to honor nodatasum even when
the file has csums, that can help if the contents of the file are
actually good.
-chris
^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: btrfs csum failed
2011-05-04 12:27 ` Martin Schitter
@ 2011-05-04 13:23 ` Edward Ned Harvey
2011-05-04 14:42 ` Martin Schitter
0 siblings, 1 reply; 17+ messages in thread
From: Edward Ned Harvey @ 2011-05-04 13:23 UTC (permalink / raw)
To: 'Martin Schitter', cwillu; +Cc: Fajar A. Nugraha, linux-btrfs
> From: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs-
> owner@vger.kernel.org] On Behalf Of Martin Schitter
>
> well -- i am doing a backup of all images every night. this process
> should work like a simple "scrub" because all data (and its checksumes)
> will be read.
Sorry, not correct. When you read all the data using something in user-land, the OS only needs to read one side of the data. It can accelerate by staggering the read requests across multiple disks. So some sectors remain unread on some disks.
When you scrub, it reads all the data from all the redundant copies (mirrored or raid) on all the individual disks in the raid set.
For this reason, you always want to use JBOD, and don't use hardware raid. Because if there's an undetected hardware error, the hardware raid will make it impossible for the OS to examine individual disks to identify the failing one.
At least I know all the above is true for reading & scrubbing in another filesystem, I don't actually know any of this for fact in btrfs, but it seems so basic I would be flabbergasted if I learned that wasn't the btrfs behavior.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: btrfs csum failed
2011-05-04 12:31 ` Kaspar Schleiser
@ 2011-05-04 13:25 ` Martin Schitter
0 siblings, 0 replies; 17+ messages in thread
From: Martin Schitter @ 2011-05-04 13:25 UTC (permalink / raw)
To: Kaspar Schleiser; +Cc: linux-btrfs
Am 2011-05-04 14:31, schrieb Kaspar Schleiser:
> Is the btrfs RAID1 itself inside a virtual machine? I've had data
> corruption with virtio block devices > 1TB on early squeeze kernels.
no -- it's on the (native) host side. and we use a very actual kernel
from debian 'testing' (2.6.38-2).
martin
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: btrfs csum failed
2011-05-04 12:39 ` Chris Mason
@ 2011-05-04 14:06 ` Martin Schitter
0 siblings, 0 replies; 17+ messages in thread
From: Martin Schitter @ 2011-05-04 14:06 UTC (permalink / raw)
To: Chris Mason; +Cc: linux-btrfs
Am 2011-05-04 14:39, schrieb Chris Mason:
> What OS is inside these virtual machines? The btrfs unstable tree has
> some fixes for windows based OSes.
we have only linux guests of different flavor, no windows guests.
both corruptions during this last weeks belong to different virtual
block device images of the same guest instance.
> Is your kvm config using O_DIRECT?
yes -- the kvm/qemu option cache="none" implies O_DIRECT.
> I've also got patches here that force us to honor nodatasum even when
> the file has csums, that can help if the contents of the file are
> actually good.
that sounds interessting! in our case it may be easier do use same
recent backup data, but it could be very helpful in similar situations.
i would really like to help isolating the reasons of this failure and a
find a practical strategy to prevent additional breakdowns.
thanks
martin
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: btrfs csum failed
2011-05-04 11:51 ` cwillu
2011-05-04 12:27 ` Martin Schitter
@ 2011-05-04 14:09 ` Jan Schmidt
1 sibling, 0 replies; 17+ messages in thread
From: Jan Schmidt @ 2011-05-04 14:09 UTC (permalink / raw)
To: cwillu; +Cc: linux-btrfs
On 04.05.2011 13:51, cwillu wrote:
> On Wed, May 4, 2011 at 5:39 AM, Martin Schitter <ms@mur.at> wrote:
>> and the 'nodatasum' option should also ignore csum issues.-- isn't it?
>
> No, it only affects writing new checksums; any existing checksums are
> still checked.
>From the report I assume this must be the case for meta data, but it
does not stand true for data. I was just looking at
btrfs_readpage_end_io_hook for some other reason and realized it skips
checksum checking when the file system is mounted nodatasum.
-Jan
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: btrfs csum failed
2011-05-04 0:44 ` Martin Schitter
2011-05-04 2:18 ` Fajar A. Nugraha
@ 2011-05-04 14:39 ` Josef Bacik
1 sibling, 0 replies; 17+ messages in thread
From: Josef Bacik @ 2011-05-04 14:39 UTC (permalink / raw)
To: Martin Schitter; +Cc: linux-btrfs
On 05/03/2011 08:44 PM, Martin Schitter wrote:
> Am 2011-05-04 02:28, schrieb Josef Bacik:
>> Wait why are you running with btrfs in production?
>
> do you know a better alternative for continuous snapshots? :)
>
> it works surprisingly well since more than a year.
> well the performance could be better for vm-image-hosting but it works.
>
> we used cache='writeback' for a long time but now all virtual instances
> have set cache='none'
>
>> What OS is in this vm image?
>
> 2.6.30-bpo.1-amd64 with virtio-driver
>
> could you give me some advice how to debug/report this specific problem
> more precise?
>
So there is a problem with DIO, since userspace can modify pages in
flight we will end up with the wrong checksums since the data can change
in flight. I was trying to come up with a way to fix this but there's
really nothing to be done at the moment other than turn off checksumming
per file. Windows was particularly bad about this, but I hadn't seen it
with Linux guests (even though it should still be happening). So I'll
come up with something to turn off checksumming per file to get around
this for now, I'll try and get to that soonish. Thanks,
Josef
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: btrfs csum failed
2011-05-04 13:23 ` Edward Ned Harvey
@ 2011-05-04 14:42 ` Martin Schitter
2011-05-04 18:10 ` Chris Mason
0 siblings, 1 reply; 17+ messages in thread
From: Martin Schitter @ 2011-05-04 14:42 UTC (permalink / raw)
To: Edward Ned Harvey; +Cc: cwillu, Fajar A. Nugraha, linux-btrfs
Am 2011-05-04 15:23, schrieb Edward Ned Harvey:
>> From: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs-
>> owner@vger.kernel.org] On Behalf Of Martin Schitter
>>
>> well -- i am doing a backup of all images every night. this
>> process should work like a simple "scrub" because all data (and its
>> checksumes) will be read.
>
> Sorry, not correct. When you read all the data using something in
> user-land, the OS only needs to read one side of the data. It can
> accelerate by staggering the read requests across multiple disks. So
> some sectors remain unread on some disks.
>
> When you scrub, it reads all the data from all the redundant copies
> (mirrored or raid) on all the individual disks in the raid set.
ok -- i see -- you're right!
i know, there a some befits in the way btrfs and zfs implement RAID /
multiply disk usage and checksumming, but i a also want to stay on the
save side, when it comes to real practical problems. so i decided to use
'classical' linux software RAID-1 as the base layer. that's a very old
fashioned solution, but it usually simply works... and you can change a
broken disk without any respect of the used filesystem(s). in general i
try to use btrfs only on account of its snapshot features in a very
simple way.
it looks very strange to me, that i don't see any SMART warnings on the
harddisks or errors on other filsystems on the same raid-array. there
was also no reboot, power-failure or similar when the corruption
suddenly appeared. so i think, a btrfs bug would be the most evident
explanation.
martin
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: btrfs csum failed
2011-05-04 14:42 ` Martin Schitter
@ 2011-05-04 18:10 ` Chris Mason
0 siblings, 0 replies; 17+ messages in thread
From: Chris Mason @ 2011-05-04 18:10 UTC (permalink / raw)
To: Martin Schitter; +Cc: Edward Ned Harvey, cwillu, Fajar A. Nugraha, linux-btrfs
Excerpts from Martin Schitter's message of 2011-05-04 10:42:51 -0400:
> Am 2011-05-04 15:23, schrieb Edward Ned Harvey:
> >> From: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs-
> >> owner@vger.kernel.org] On Behalf Of Martin Schitter
> >>
> >> well -- i am doing a backup of all images every night. this
> >> process should work like a simple "scrub" because all data (and its
> >> checksumes) will be read.
> >
> > Sorry, not correct. When you read all the data using something in
> > user-land, the OS only needs to read one side of the data. It can
> > accelerate by staggering the read requests across multiple disks. So
> > some sectors remain unread on some disks.
> >
> > When you scrub, it reads all the data from all the redundant copies
> > (mirrored or raid) on all the individual disks in the raid set.
>
> ok -- i see -- you're right!
>
> i know, there a some befits in the way btrfs and zfs implement RAID /
> multiply disk usage and checksumming, but i a also want to stay on the
> save side, when it comes to real practical problems. so i decided to use
> 'classical' linux software RAID-1 as the base layer. that's a very old
> fashioned solution, but it usually simply works... and you can change a
> broken disk without any respect of the used filesystem(s). in general i
> try to use btrfs only on account of its snapshot features in a very
> simple way.
>
> it looks very strange to me, that i don't see any SMART warnings on the
> harddisks or errors on other filsystems on the same raid-array. there
> was also no reboot, power-failure or similar when the corruption
> suddenly appeared. so i think, a btrfs bug would be the most evident
> explanation.
That's the bad news, it can be very hard to tell. The disk could be
returning garbage or btrfs would be messing up the csums.
The btrfs unstable tree does have one fix that is related to O_DIRECT
and kvm, but we've only ever seen it happen with a windows guest. This
doesn't mean it is impossible for a linux guest to trigger it though.
-chris
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2011-05-04 18:10 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-05-03 21:56 btrfs csum failed Martin Schitter
2011-05-04 0:28 ` Josef Bacik
2011-05-04 0:44 ` Martin Schitter
2011-05-04 2:18 ` Fajar A. Nugraha
2011-05-04 11:39 ` Martin Schitter
2011-05-04 11:47 ` Hugo Mills
2011-05-04 11:51 ` cwillu
2011-05-04 12:27 ` Martin Schitter
2011-05-04 13:23 ` Edward Ned Harvey
2011-05-04 14:42 ` Martin Schitter
2011-05-04 18:10 ` Chris Mason
2011-05-04 14:09 ` Jan Schmidt
2011-05-04 12:31 ` Kaspar Schleiser
2011-05-04 13:25 ` Martin Schitter
2011-05-04 14:39 ` Josef Bacik
2011-05-04 12:39 ` Chris Mason
2011-05-04 14:06 ` Martin Schitter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).