Itermittent data corruption and dmesg spam

All of lore.kernel.org
 help / color / mirror / Atom feed

* Itermittent data corruption and dmesg spam
@ 2013-10-23  3:58 Henry de Valence
  2013-10-23 13:09 ` Duncan
  2013-10-23 15:39 ` Chris Murphy
  0 siblings, 2 replies; 3+ messages in thread
From: Henry de Valence @ 2013-10-23  3:58 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1797 bytes --]

Hi all,

Two questions:

First, I have a ton of lines in dmesg like

[  123.664465] incomplete page read in btrfs with offset 2048 and length 2048
[  123.835761] incomplete page read in btrfs with offset 512 and length 3584

What does this mean? I tried searching on Google but all I got was the commit 
that added the code that prints these messages. Should I be worried?

Second, I’m having some intermittent data corruption issues, and I’m not 
really sure how to pin down the cause. Sometimes, I’ll get errors trying to 
read a file due to a failed checksum, but when I run btrfs scrub, it reports 
that everything is OK. For instance, this time I booted, I get a line in dmesg 
saying

btrfs: bdev /dev/bcache0 errs: wr 0, rd 0, flush 0, corrupt 16, gen 0

but when I run btrfs scrub I get:

scrub status for 56118d27-c9a8-483c-afaa-e429d59884e9
     scrub started at Tue Oct 22 22:46:17 2013 and finished after 2802 seconds
     total bytes scrubbed: 426.03GB with 0 errors

My setup is a btrfs partition on a bcache device, which has a new-ish hard 
drive as the backing store and a partition on an older SSD as the cache. The 
bcache documentation suggests that sequential reads bypass the cache device. 
Is it possible that I have some bad blocks on my SSD, which cause the errors 
and data corruption, but the data corruption doesn’t show up with btrfs scrub 
because the disk accesses in the scrub are bypassing the cache?

Does anyone know how I could test this theory, or otherwise try to determine 
the source of the problems?

For what it’s worth, I ran smartctl on both my hard drive and my SSD, and it 
didn’t detect anything.

My btrfs version is Btrfs v0.20-rc1-358-g194aa4a on Linux 3.11.3 (Arch).

Thanks,
Henry de Valence

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Itermittent data corruption and dmesg spam
  2013-10-23  3:58 Itermittent data corruption and dmesg spam Henry de Valence
@ 2013-10-23 13:09 ` Duncan
  2013-10-23 15:39 ` Chris Murphy
  1 sibling, 0 replies; 3+ messages in thread
From: Duncan @ 2013-10-23 13:09 UTC (permalink / raw)
  To: linux-btrfs

Henry de Valence posted on Tue, 22 Oct 2013 23:58:33 -0400 as excerpted:

> Second, I’m having some intermittent data corruption issues, and I’m not
> really sure how to pin down the cause. Sometimes, I’ll get errors trying
> to read a file due to a failed checksum, but when I run btrfs scrub, it
> reports that everything is OK. For instance, this time I booted, I get a
> line in dmesg saying
> 
> btrfs: bdev /dev/bcache0 errs: wr 0, rd 0, flush 0, corrupt 16, gen 0
> 
> but when I run btrfs scrub I get:
> 
> scrub status for 56118d27-c9a8-483c-afaa-e429d59884e9
>      scrub started at Tue Oct 22 22:46:17 2013 and finished after 2802
>      seconds total bytes scrubbed: 426.03GB with 0 errors

I know nothing (other than its general purpose) about bcache so I'll stay 
away from that angle, but...

[This takes a bit of a long way around, but comes back to your issue, so 
be patient...]

This reminds me of some years ago when I had some hard to pin down memory 
corruption issues.  Memtest would say everything was OK, and most of the 
time the system was fine, but every once in awhile, things would go 
haywire.  (In my case, one of the most common symptoms was a bunzip2 
failure due to checksum mismatch... but it wasn't the file, it was the 
memory as a retry would bunzip just fine.)  I had occasional mcheck 
errors too, when the hardware would catch the issue.

My problem ultimately turned out to be borderline speed-certified 
memory.  A BIOS update eventually gave me the ability to de-clock the 
memory from its rating just slightly (IIRC from 333 MHz to 330 or some 
such, this was in the DDR1 era), after which I was actually able to 
tighten some of the other ratings (various wait-state settings) a bit and 
get back some of the speed lost by the slightly lower clock.  The memory 
cells themselves were fine thus memcheck coming up clean, and so was the 
bus... most of the time, but at the rated clock speed every once in 
awhile...

Then later I upgraded memory and didn't have the problem at all with the 
new memory, so it was indeed the memory modules that weren't quite 
reliable at the rated speed, NOT the mobo or on-board bus.

Back to your current situation, someone else just recently had a problem 
that, like my memory experience but with storage not memory, traced to a 
SATA system that wasn't quite stable at the rated SATA-3 speeds.  When he 
forced it back to SATA-2, it worked just fine.  (Unfortunately for SATA, 
it's halving the speed, not the loss of a percent or two with a slightly 
lower clock that I was able to do on my memory, and even then make it up 
to some extent with slightly tighter wait-state timings.  But IIRC he was 
on spinning rust anyway, which means the physical platter speed was in 
practice the normal bottleneck anyway so at least he didn't lose much 
except a bit of cache-access-speed.)

So I'd suggest using hdparm or the like to (temporarily) force a lower 
SATA/SAS/whatever speed and see if that helps at all.  If it does, you 
can investigate that further and decide what to do then.  If it doesn't, 
you can return to your normal speeds and no harm done.

Of course the bcache device complicates things a bit, but I guess you can 
try setting speed for both devices one at a time, and possibly try 
disabling the cache and running direct too (assuming that's possible with 
bcache).  But with the exception of those comments, as I said I'll leave 
the bcache stuff for you to figure out as I know little or nothing about 
it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Itermittent data corruption and dmesg spam
  2013-10-23  3:58 Itermittent data corruption and dmesg spam Henry de Valence
  2013-10-23 13:09 ` Duncan
@ 2013-10-23 15:39 ` Chris Murphy
  1 sibling, 0 replies; 3+ messages in thread
From: Chris Murphy @ 2013-10-23 15:39 UTC (permalink / raw)
  To: Henry de Valence; +Cc: linux-btrfs

On Oct 22, 2013, at 9:58 PM, Henry de Valence <hdevalence@hdevalence.ca> wrote:
> 
> 
> btrfs: bdev /dev/bcache0 errs: wr 0, rd 0, flush 0, corrupt 16, gen 0
> 
> but when I run btrfs scrub I get:
> 
> scrub status for 56118d27-c9a8-483c-afaa-e429d59884e9
>     scrub started at Tue Oct 22 22:46:17 2013 and finished after 2802 seconds
>     total bytes scrubbed: 426.03GB with 0 errors

Well it sounds to me like there was some kind of corruption at some point, and possibly btrfs fixed this (maybe it was metadata corruption and it was simple for it to fix because it keeps two copies of metadata). Note that this mount info line, the first instance you report, is a persistent counter. Within the last 2-3 weeks there's a post on here, I think by Hugo or Duncan, on how to reset that counter. The 2nd instance, the scrub status, is not persistent, it shows the values for the most recently run scrub.

So it seems clear to me that there were 16 corruptions encountered during normal operation, were fixed by the time you ran the scrub. So overall I'd say you probably are OK but you'd need to go back through syslog or journalctl and see if you can find the first instances of the original corruption. I know there were some issues with bcache and btrfs and dirty data, although my vague recollection is that this discussion wasn't on this list but rather lkml. You might do an lkml search for "overstreet btrfs" and see what hits you get.

> 
> My setup is a btrfs partition on a bcache device, which has a new-ish hard 
> drive as the backing store and a partition on an older SSD as the cache. The 
> bcache documentation suggests that sequential reads bypass the cache device. 
> Is it possible that I have some bad blocks on my SSD, which cause the errors 
> and data corruption, but the data corruption doesn’t show up with btrfs scrub 
> because the disk accesses in the scrub are bypassing the cache?

That seems doubtful or this would be persistently occurring in the scrubs, yet your scrub is clean. You just have a persistent counter that says since the counter was last reset, there have been 16 corruptions found. It seems clear since then they've been fixed. So the question is why they occurred in the first place because there's probably a better chance of this being a bug rather than bad hardware just because this combination is not significantly tested yet.

Chris Murphy

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2013-10-23 15:39 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-23  3:58 Itermittent data corruption and dmesg spam Henry de Valence
2013-10-23 13:09 ` Duncan
2013-10-23 15:39 ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.