* Btrfs transaction checksum corruption & losing root of the tree & bizarre UUID change.
@ 2014-07-10 23:32 Tomasz Kusmierz
2014-07-11 1:57 ` Austin S Hemmelgarn
2014-07-11 10:38 ` Duncan
0 siblings, 2 replies; 4+ messages in thread
From: Tomasz Kusmierz @ 2014-07-10 23:32 UTC (permalink / raw)
To: linux-btrfs
Hi all !
So it been some time with btrfs, and so far I was very pleased, but
since I've upgraded to ubuntu from 13.10 to 14.04 problems started to
occur (YES I know this might be unrelated).
So in the past I've had problems with btrfs which turned out to be a
problem caused by static from printer generating some corruption in
ram causing checksum failures on the file system - so I'm not going to
assume that there is something wrong with btrfs from the start.
Anyway:
On my server I'm running 6 x 2TB disk in raid 10 for general storage
and 2 x ~0.5 TB raid 1 for system. Might be unrelated, but after
upgrading to 14.04 I've started using Own Cloud which uses Apache &
MySql for backing store - all data stored on storage array, mysql was
on system array.
All started with csum errors showing up in mysql data files and in
some transactions !!!. Generally system imidiatelly was switching to
all btrfs read only mode due to being forced by kernel (don't have
dmesg / syslog now). Removed offending files, problem seemed to go
away and started from scratch. After 5 days problem reapered and now
was located around same mysql files and in files managed by apache as
"cloud". At this point since these files are rather dear to me I've
decided to pull all stops and try to rescue as much as I can.
As a excercise in btrfs managment I've run btrfsck --repair - did not
help. Repeated with --init-csum-tree - turned out that this left me
with blank system array. Nice ! could use some warning here.
I've moved all drives and move those to my main rig which got a nice
16GB of ecc ram, so errors of ram, cpu, controller should be kept
theoretically eliminated. I've used system array drives and spare
drive to extract all "dear to me" files to newly created array (1tb +
500GB + 640GB). Runned a scrub on it and everything seemed OK. At this
point I've deleted "dear to me" files from storage array and ran a
scrub. Scrub now showed even more csum errors in transactions and one
large file that was not touched FOR VERY LONG TIME (size ~1GB).
Deleted file. Ran scrub - no errors. Copied "dear to me files" back to
storage array. Ran scrub - no issues. Deleted files from my backup
array and decided to call a day. Next day I've decided to run a scrub
once more "just to be sure" this time it discovered a myriad of errors
in files and transactions. Since I've had no time to continue decided
to postpone on next day - next day I've started my rig and noticed
that both backup array and storage array does not mount anymore. I was
attempting to rescue situation without any luck. Power cycled PC and
on next startup both arrays failed to mount, when I tried to mount
backup array mount told me that this specific uuid DOES NOT EXIST
!?!?!
my fstab uuid:
fcf23e83-f165-4af0-8d1c-cd6f8d2788f4
new uuid:
771a4ed0-5859-4e10-b916-07aec4b1a60b
tried to mount by /dev/sdb1 and it did mount. Tried by new uuid and it
did mount as well. Scrub passes with flying colours on backup array
while storage array still fails to mount with:
root@ubuntu-pc:~# mount /dev/sdd1 /arrays/@storage/
mount: wrong fs type, bad option, bad superblock on /dev/sdd1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
for any device in the array.
Honestly this is a question to more senior guys - what should I do now ?
Chris Mason - have you got any updates to your "old friend stress.sh"
? If not I can try using previous version that you provided to stress
test my system - but I this is a second system that exposes this
erratic behaviour.
Anyone - what can I do to rescue my "bellowed files" (no sarcasm with
zfs / ext4 / tapes / DVDs)
ps. needles to say: SMART - no sata CRC errors, no relocated sectors,
no errors what so ever (as much as I can see).
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Btrfs transaction checksum corruption & losing root of the tree & bizarre UUID change.
2014-07-10 23:32 Btrfs transaction checksum corruption & losing root of the tree & bizarre UUID change Tomasz Kusmierz
@ 2014-07-11 1:57 ` Austin S Hemmelgarn
2014-07-11 10:38 ` Duncan
1 sibling, 0 replies; 4+ messages in thread
From: Austin S Hemmelgarn @ 2014-07-11 1:57 UTC (permalink / raw)
To: Tomasz Kusmierz, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 5479 bytes --]
On 07/10/2014 07:32 PM, Tomasz Kusmierz wrote:
> Hi all !
>
> So it been some time with btrfs, and so far I was very pleased, but
> since I've upgraded to ubuntu from 13.10 to 14.04 problems started to
> occur (YES I know this might be unrelated).
>
> So in the past I've had problems with btrfs which turned out to be a
> problem caused by static from printer generating some corruption in
> ram causing checksum failures on the file system - so I'm not going to
> assume that there is something wrong with btrfs from the start.
>
> Anyway:
> On my server I'm running 6 x 2TB disk in raid 10 for general storage
> and 2 x ~0.5 TB raid 1 for system. Might be unrelated, but after
> upgrading to 14.04 I've started using Own Cloud which uses Apache &
> MySql for backing store - all data stored on storage array, mysql was
> on system array.
>
> All started with csum errors showing up in mysql data files and in
> some transactions !!!. Generally system imidiatelly was switching to
> all btrfs read only mode due to being forced by kernel (don't have
> dmesg / syslog now). Removed offending files, problem seemed to go
> away and started from scratch. After 5 days problem reapered and now
> was located around same mysql files and in files managed by apache as
> "cloud". At this point since these files are rather dear to me I've
> decided to pull all stops and try to rescue as much as I can.
>
> As a excercise in btrfs managment I've run btrfsck --repair - did not
> help. Repeated with --init-csum-tree - turned out that this left me
> with blank system array. Nice ! could use some warning here.
>
I know that this will eventually be pointed out by somebody, so I'm
going to save them the trouble and mention that it does say on both the
wiki and in the manpages that btrfsck should be a last-resort (ie, after
you have made sure you have backups of anything on the FS).
> I've moved all drives and move those to my main rig which got a nice
> 16GB of ecc ram, so errors of ram, cpu, controller should be kept
> theoretically eliminated. I've used system array drives and spare
> drive to extract all "dear to me" files to newly created array (1tb +
> 500GB + 640GB). Runned a scrub on it and everything seemed OK. At this
> point I've deleted "dear to me" files from storage array and ran a
> scrub. Scrub now showed even more csum errors in transactions and one
> large file that was not touched FOR VERY LONG TIME (size ~1GB).
> Deleted file. Ran scrub - no errors. Copied "dear to me files" back to
> storage array. Ran scrub - no issues. Deleted files from my backup
> array and decided to call a day. Next day I've decided to run a scrub
> once more "just to be sure" this time it discovered a myriad of errors
> in files and transactions. Since I've had no time to continue decided
> to postpone on next day - next day I've started my rig and noticed
> that both backup array and storage array does not mount anymore. I was
> attempting to rescue situation without any luck. Power cycled PC and
> on next startup both arrays failed to mount, when I tried to mount
> backup array mount told me that this specific uuid DOES NOT EXIST
> !?!?!
>
> my fstab uuid:
> fcf23e83-f165-4af0-8d1c-cd6f8d2788f4
> new uuid:
> 771a4ed0-5859-4e10-b916-07aec4b1a60b
>
>
> tried to mount by /dev/sdb1 and it did mount. Tried by new uuid and it
> did mount as well. Scrub passes with flying colours on backup array
> while storage array still fails to mount with:
>
> root@ubuntu-pc:~# mount /dev/sdd1 /arrays/@storage/
> mount: wrong fs type, bad option, bad superblock on /dev/sdd1,
> missing codepage or helper program, or other error
> In some cases useful info is found in syslog - try
> dmesg | tail or so
>
> for any device in the array.
>
> Honestly this is a question to more senior guys - what should I do now ?
>
> Chris Mason - have you got any updates to your "old friend stress.sh"
> ? If not I can try using previous version that you provided to stress
> test my system - but I this is a second system that exposes this
> erratic behaviour.
>
> Anyone - what can I do to rescue my "bellowed files" (no sarcasm with
> zfs / ext4 / tapes / DVDs)
>
> ps. needles to say: SMART - no sata CRC errors, no relocated sectors,
> no errors what so ever (as much as I can see).
First thing that I would do is some very heavy testing with tools like
iozone and fio. I would use the verify mode from iozone to further
check data integrity. My guess based on what you have said is that it
is probably issues with either the storage controller (I've had issues
with almost every brand of SATA controller other than Intel, AMD, Via,
and Nvidia, and it almost always manifested as data corruption under
heavy load), or something in the disk's firmware. I would still suggest
double-checking your RAM with Memtest, and check the cables on the
drives. The one other thing that I can think of is potential voltage
sags from the PSU (either because the PSU is overloaded at times, or
because of really noisy/poorly-conditioned line power). Of course, I
may be totally off with these ideas, but the only 2 times that I have
ever had issues like these myself were caused by a bad storage
controller doing writes from the wrong location in RAM, and a
line--voltage sag that happened right as BTRFS was in the middle writing
to the root-tree.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Btrfs transaction checksum corruption & losing root of the tree & bizarre UUID change.
2014-07-10 23:32 Btrfs transaction checksum corruption & losing root of the tree & bizarre UUID change Tomasz Kusmierz
2014-07-11 1:57 ` Austin S Hemmelgarn
@ 2014-07-11 10:38 ` Duncan
2014-07-11 12:33 ` Russell Coker
1 sibling, 1 reply; 4+ messages in thread
From: Duncan @ 2014-07-11 10:38 UTC (permalink / raw)
To: linux-btrfs
Tomasz Kusmierz posted on Fri, 11 Jul 2014 00:32:33 +0100 as excerpted:
> So it been some time with btrfs, and so far I was very pleased, but
> since I've upgraded to ubuntu from 13.10 to 14.04 problems started to
> occur (YES I know this might be unrelated).
Many points below; might as well start with this one.
You list the ubuntu version but don't list the kernel or btrfs-tools
versions. This is an upstream list, so ubuntu version means little to
us. We need kernel and userspace (btrfs-tools) versions.
As the wiki stresses, btrfs is still under heavy development and it's
particularly vital to run current kernels as they fix known bugs in older
kernels. 3.16 is on rc4 now so if you're not on the latest 3.15.x stable
series kernel at minimum, you're missing patches for known bugs. And by
rc2 or rc3, many btrfs users have already switched to the development
kernel series, assuming they're not affected by any of the still active
regressions in the development kernel. (FWIW, this is where I am.)
Further, there's a btrfs-next branch that many run as well, with patches
not yet in mainline but slated for it, tho that's a bit /too/ bleeding
edge for my tastes.
Keeping /absolutely/ current with the latest btrfs-tools release isn't
/quite/ as vital, as the most risky operations are handled by the kernel,
but keeping somewhere near current is definitely recommended. Current
btrfs-tools git-master is 3.14.2, with 3.12.0 the last release before
3.14, as well as the earliest recommended version. If you're still on
0.19-something or 0.20-rc1 or so, please upgrade to at least 3.12
userspace.
> So in the past I've had problems with btrfs which turned out to be a
> problem caused by static from printer generating some corruption in ram
> causing checksum failures on the file system - so I'm not going to
> assume that there is something wrong with btrfs from the start.
Just as a note, RAM shouldn't be that touchy. There's buffer capacitors
and the like that should keep the system core (including RAM) stable even
in the face of a noisy electronic environment. While that might have
been the immediately visible problem, I'd consider it a warning sign that
you have something else unhealthy going on.
The last time I started having issues like that, it was on an old
motherboard, and the capacitors were going bad. By the time I quit using
it, I could still run it if I kept the room cold enough (60F/15C or so),
but any warmer and the data buses would start corrupting data on the way
to and from the drives. Turned out several of the capacitors were
bulging and a couple had popped.
As Austin H mentioned, it can also be power supply issues. In one place
I lived the wall power simply wasn't stable enough and computers kept
dying. That was actually a couple decades ago now, but I've seen other
people report similar problems more recently. Another power issue I've
seen was a UPS that simply wasn't providing the necessary power --
replaced and the problem disappeared.
Another thing I've seen happen that fits right in with the upgrade
coincidence, is that a newer release (or certain distros as opposed to
others, or one time someone reported it was Linux crashing where MSWindows
ran fine) might be better optimized for current systems, which can stress
them more, triggering problems where the less optimized OS ran fine.
However, that tends to trigger CPU issues due to overheating, not so much
RAM issues.
There's another possibility too, but more below on that. Bottom line,
however, your printer shouldn't be able to destabilize the computer like
that, and if it is, that's evidence of other problems, which you really
do need to get to the bottom of.
> Anyway:
> On my server I'm running 6 x 2TB disk in raid 10 for general storage and
> 2 x ~0.5 TB raid 1 for system.
Wait a minute... Btrfs raid10 and raid1 modes, or hardware RAID, or
software/firmware RAID such as the kernel's mdraid or dmraid?
There's a critical difference here, in that btrfs raid modes are checksum
protected, while the kernel's software raid, at least, is not. More
below.
> Might be unrelated, but after upgrading
> to 14.04 I've started using Own Cloud which uses Apache & MySql for
> backing store - all data stored on storage array, mysql was on system
> array.
>
> All started with csum errors showing up in mysql data files and in some
> transactions !!!. Generally system imidiatelly was switching to all
> btrfs read only mode due to being forced by kernel (don't have dmesg /
> syslog now). Removed offending files, problem seemed to go away and
> started from scratch. After 5 days problem reapered and now was located
> around same mysql files and in files managed by apache as "cloud". At
> this point since these files are rather dear to me I've decided to pull
> all stops and try to rescue as much as I can.
Just to clarify, btrfs csum errors, right? Or btrfs saying fine and you
mean mysql errors? I'll assume btrfs csum errors...
How large are these files? I'm not familiar with owncloud internals, but
does it store its files as files (which would presumably be on btrfs)
that are simply indexed by the mysql instance, or as database objects
actually stored in mysql (so in one or more huge database files that are
themselves presumably on btrfs)?
The reason I'm asking is potential file fragmentation and the additional
filesystem overhead involved, with a corresponding increased risk of
corruption due to the increased stress on the filesystem due to the
additional overhead. A rather long and detailed discussion of the
problem and some potential solutions follows. Feel free to skim or skip
it (down to the next quote fragment) for now and come back to it later if
you think it may apply.
Due to the nature of COW (copy-on-write) based filesystems (including
btrfs) in general, they always find a particular write-pattern
challenging, and btrfs is no exception. The write pattern in question is
modify-in-place (which I often refer to as internal write, since it's
writes to the middle of a file, not just the end), as opposed to write
out serially and either truncate/replace or append-only modify.
Databases and VM images are particularly prone to exactly this sort of
write pattern, since for them the common operation is modifying some
chunk of data somewhere in the middle, then writing it back out, without
modifying either the data before or after that particular chunk.
Normal filesystems modify-in-place with that access pattern, and while
there's some risk of corruption in particular if the system crashes
during the modify-write, because modify-in-place-filesystems are the
common case, and this is the most common write mode for these apps, they
have evolved to detect and deal with this problem.
Copy-on-write filesystems deal with file modifications differently. They
write the new data to a different location, and then update the
filesystem metadata to map the new location into the existing file at the
appropriate spot.
For btrfs, this is typically done in 4 KiB (4096 byte) chunks at a time.
If you have say a 128 MiB file, that's 128 MiB / 4 KiB = 128 * 1024 / 4 =
32,768 blocks, each 4 KiB long. Now make that 128 MiB 32,768 block file
a database file with essentially random writes, copying each block
elsewhere as soon as it's written to, and the problem quickly becomes
apparent -- you quickly end up with a HEAVILY fragmented file of tens of
thousands of extents. If the file is a GiB or larger, the file may well
be hundreds of thousands of extents!
Of course particularly on spinning rust hard drives, fragmentation that
heavy means SERIOUS access time issues as the heads seek back and forth
and then wait for the appropriate disk segment to spin under the read/
write heads. SSDs don't have the seek latency, but they have their own
issues in terms of IOPs limits and erase-block sizes. And of course
there's the extra filesystem metadata overhead in tracking all those
hundreds of thousands of extents, too, that affects both types of storage.
It's all this extra filesystem metadata overhead that eventually seems to
cause problems. When you consider the potential race conditions inherent
in updating not only block location mapping but also checksums for
hundreds of thousands of extents, with real-time updates coming in that
need written to disk in exactly the correct order along with the extent
and checksum updates, and possibly compression thrown in if you have that
turned on as well, plus the possibility of something crashing or in your
case that extra bit of electronic noise at just the wrong moment, it's a
wonder there aren't more issues than there are.
Fortunately, btrfs has a couple ways of ameliorating the problem. First,
there's the autodefrag mount option. With this option enabled, btrfs
will auto-detect file fragmenting writes and queue those files for later
automatic defrag by a background defrag thread.
This works well for reasonably small database files upto a few hundred
MiB in size. Firefox's sqlite based history and etc database files are a
good example, and autodefrag works well with them.
But as file sizes approach a GiB, particularly for fairly active
databases or VMs, autodefrag doesn't work so well, because larger files
take longer for the defrag thread to rewrite, and at some point, the
changes are coming in faster than the file can be rewritten! =:^(
The next alternative is similar to autodefrag, except at a different
level. Script the defrags and have a timer-based trigger.
Traditionally, it'd be a cronjob, but in today's new systemd based world,
the trigger could just as easily be a systemd timer. Either way, the
idea here is to run the defrag once a day or so, probably when the system
or at least the database or VM image in question isn't so active. That
way, more fragmentation will build up during the hours between defrags,
but it'll be limited to a day's worth, with the defrag being scheduled to
take care of the problem during a daily low activity period, overnight or
whatever it may be. A number of people have reported that this works
better for them than autodefrag, particularly as files approach and
exceed a GiB in size.
Finally, btrfs has the NOCOW file attribute, set using chattr +C.
Basically, it tells btrfs to handle this file like a normal update-in-
place filesystem would, instead of doing the COW thing btrfs does for
most files. But there are a few significant limitations.
First of all, NOCOW must be set on a file before it has any content.
Setting it on a file that already has data doesn't guarantee NOCOW. The
easiest way to do this is to set NOCOW on the directory that will hold
the files that you want NOCOWed, after which any newly created file (or
subdir) in that directory will inherit the NOCOW attribute at creation,
thus before it gets any data. Existing database and VM-image files can
then be copied (NOT moved, unless it's between filesystems, as a move on
the same filesystem won't trigger the creation) into the NOCOW directory,
and should get the attribute properly that way.
Second, NOCOW turns off both btrfs checksumming and (if otherwise
enabled) compression for that file. This is because updating the file in-
place introduces all sorts of race conditions and etc that make
checksumming and compression impractical. The reason btrfs can normally
do them is due to its COW nature, and as a result, turning that off turns
off checksumming and compression as well.
Now loss of checksumming in particular sounds pretty bad, but for this
type of file, it's not actually quite as bad as it sounds, because as I
mentioned above, apps that routinely do update-in-place have generally
evolved their own error detection and correction procedures, so having
btrfs do it too, particularly when they're not done in cooperation with
each other, doesn't always work out so well anyway.
Third, NOCOW interacts badly with the btrfs snapshot feature, which
depends on COW. A btrfs snapshot locks the existing version of the file
in place, relying on COW to create a new copy of any changed block
somewhere else, while keeping the old unmodified copy where it is. So
any modification to a file block after a snapshot by definition MUST COW
that block, even if the file is otherwise set NOCOW. And that's
precisely what happens -- the first write to a file block after a
snapshot COWS that block, tho subsequent writes to the same file block
(until the next snapshot) will overwrite in-place once again, due to the
NOCOW.
The implications of "the snapshot COW exception" are particularly bad
when automated snapshotting, perhaps hourly or even every minute, is
happening. Snapper and a number of other automated snapshotting scripts
do this. But what the snapshot COW exception in combination with
frequent snapshotting means, is that NOCOW basically loses its effect,
because the first write to a file-block after a snapshot must be COW in
any case.
Tho there's a workaround for that, too. Since btrfs snapshots stop at
btrfs subvolume boundaries, make that NOCOW directory a dedicated
subvolume, so snapshots of the parent don't include it, and then use
conventional backups instead of snapshotting on that subvolume.
So basically what I said about NOCOW above is exactly that: btrfs handles
the file as if it were a normal filesystem, not the COW based filesystem
btrfs normally is. That means normal filesystem rules apply, which means
the COW-dependent features that btrfs normally has simply don't work on
NOCOW files. Which is a bit of a negative, yes, but consider this, the
features you're losing you wouldn't have on a normal filesystem anyway,
so compared to a normal filesystem you're not losing them. And btrfs
features that don't depend on COW, like the btrfs multi-device-filesystem
option, and the btrfs subvolume option, can still be used. =:^)
Still, one of the autodefrag options may be better, since they do NOT
kill btrfs' COW based features as NOCOW does.
> As a excercise in btrfs managment I've run btrfsck --repair - did not
> help. Repeated with --init-csum-tree - turned out that this left me with
> blank system array. Nice ! could use some warning here.
As Austin mentioned, btrfsck --repair is normally recommended only as a
last resort, to be run either on the recommendation of a developer when
they've decided it should fix the problem without making it worse, or if
you've tried everything else and the next step would be blowing away the
filesystem with a new mkfs anyway, so you've nothing to lose.
Normally before that, you'd try mounting with the recovery and then the
recovery,ro options, and if that didn't work, you'd try btrfs restore,
possibly in combination with btrfs-find-root, as detailed on the wiki.
(I actually had to do this recently, mounting recovery or recovery,ro
didn't work, but btrfs restore did. FWIW the wiki page on restore
helped, but I think it could use some further improvement, which I might
give it if I ever get around to registering on the wiki so I can edit it.)
As for the blank array after --init-csum-tree, some versions back there
was a bug of just that sort. Without knowing for sure which versions you
have and going back to look, I can't say for sure that's the bug you ran
into, but it was a known problem back then, it has been fixed in current
versions, and if you're running old versions, that's /exactly/ the sort
of risk that you are taking!
> I've moved all drives and move those to my main rig which got a nice
> 16GB of ecc ram, so errors of ram, cpu, controller should be kept
> theoretically eliminated.
It's worth noting that ECC RAM doesn't necessarily help when it's an in-
transit bus error. Some years ago I had one of the original 3-digit
Opteron machines, which of course required registered and thus ECC RAM.
The first RAM I purchased for that board was apparently borderline on its
timing certifications, and while it worked fine when the system wasn't
too stressed, including with memtest, which passed with flying colors,
under medium memory activity it would very occasionally give me, for
instance, a bad bzip2 csum, and with intensive memory activity, the
problem would be worse (more bz2 decompress errors, gcc would error out
too sometimes and I'd have to restart my build, very occasionally the
system would crash).
Some of the time I'd get machine-checks from it, usually not, and as I
said, memcheck passed with flying colors.
Eventually a BIOS update gave me the ability to declock memory timings
from what the memory was supposedly certified at, and at the next lower
speed memory clock, it was solid as a rock, even when I then tightened up
some of the individual wait-state timings from what it was rated for.
And later I upgraded memory, with absolutely no problems with the new
memory, either. It was simply that the memory (original DDR) was PC3200
certified when it could only reliably do PC3000 speeds.
The point being, just because it's ECC RAM doesn't mean errors will
always be reported, particularly when it's bus errors due to effectively
overclocking -- I wasn't overclocking from the certification, it was
simply cheap RAM and the certification was just a bit "optimistic".
That's the other hardware issue possibility I mentioned above, that I
punted discussing until here as a reply to your ECC comments.
(FWIW, I was running reiserfs at the time, and thru all the bad memory
problems, reiserfs was solid as a rock. That was /after/ reiserfs got
the data=ordered by default behavior, which Chris Mason, yes, the same
one leading btrfs now, helped with as it happens (IDR if he was patch
author or not but IIRC he was kernel reiserfs maintainer at the time),
curing the risky reiserfs data=writeback behavior that reiserfs
originally had that's the reason it got its reputation for poor
reliability. I'm still solidly impressed with reiserfs reliability and
continue running it on my spinning rust to this day, tho with the journal
it's not particularly appropriate for ssds, which is where I run btrfs.)
> I've used system array drives and spare drive
> to extract all "dear to me" files to newly created array (1tb + 500GB +
> 640GB). Runned a scrub on it and everything seemed OK. At this point
> I've deleted "dear to me" files from storage array and ran a scrub.
> Scrub now showed even more csum errors in transactions and one large
> file that was not touched FOR VERY LONG TIME (size ~1GB).
Btrfs scrub, right? Not mdraid or whatever scrub.
Here's where btrfs raid mode comes in. If you are running btrfs raid1 or
raid10 mode, there's two copies of the data/metadata on the filesystem,
and if one is bad, scrub csum-verifies the other and assuming it's good,
rewrites the bad copy.
If you're running btrfs on a single device with mdraid or whatever
underneath it, then btrfs will default to dup mode for metadata -- two
copies of it -- but only single mode for data, and if that copy doesn't
csum-validate, there's no second copy for btrfs scrub to try to recover
from. =:^(
Which is why the sort of raid you are running is important. Btrfs
raid1/10 gives you that extra csummed copy in case the other fails csum-
verification. Single-device btrfs on top of some other raid won't give
you that for data, only for metadata, and because you don't have direct
control of which lower-level raid component the copy it supplies gets
pulled from, and (with the possible exception of hardware raid if it's
the expensive stuff) it's not checksummed, for all you know what the
lower level raid feeds btrfs is the bad copy, and while there may or may
not be a good copy elsewhere, there's no way for you to ensure btrfs gets
a chance at it.
My biggest complaint with that, is that regardless of the number of
devices, both btrfs raid1 and raid10 modes only have two copies of the
data. More devices still means only two copies, just more capacity. N-
way-mirroring is on the roadmap, but it's not here yet. Which is
frustrating as my ideal balance between risk and expense would be 3-way-
mirroring, giving me a second fallback in case the first TWO copies csum-
verify-fail. But it's roadmapped, and at this point btrfs is still
immature enough that the weak point isn't yet the limit of two copies,
but instead the risk of btrfs itself being buggy, so I have to be patient.
> Deleted file. Ran scrub - no errors. Copied "dear to me files" back to
> storage array. Ran scrub - no issues. Deleted files from my backup array
> and decided to call a day. Next day I've decided to run a scrub once
> more "just to be sure" this time it discovered a myriad of errors in
> files and transactions. Since I've had no time to continue decided to
> postpone on next day - next day I've started my rig and noticed that
> both backup array and storage array does not mount anymore.
That really *REALLY* sounds to me like hardware issues. If you ran scrub
and it said no errors, and you properly umounted at shutdown, then the
filesystem /should/ be clean. If scrub is giving you that many csum-
verify errors that quickly, on the same data that passed without error
the day before, then I really don't see any other alternative /but/
hardware errors. It was safely on disk and verifying the one day. It's
not the next. Either the disk is rotting out from under you in real time
-- rather unlikely especially if SMART is saying it's fine, or you have
something else dynamically wrong with your hardware. It could be power.
It could be faulty memory. It could be bad caps on the mobo. It could
be a bad SATA cable or controller. Whatever it is, at this point all the
experience I have is telling me that's a hardware error, and you're not
going to be safe until you get it fixed.
That said, something ELSE I can definitely say from experience, altho
btrfs is several kernels better at this point than it was for my
experience, is that btrfs isn't a particularly great filesystem to be
running with faulty hardware. I already mentioned some of the stuff I've
put reiserfs thru over the years, and it's really REALLY better at
running on faulty hardware than btrfs is.
Bottom line, if your hardware is faulty and I believe it is, if you
simply don't have the economic resources to fix it properly at this
point, I STRONGLY recommend a mature and proven stable filesystem like
(for me at least) reiserfs, not something as immature and still under
intensive development as btrfs is at this point. I can say from
experience that there's a *WORLD* of difference between trying to run
btrfs on known unstable hardware, and running it on rock-stable, good
hardware.
Now I'm not saying you have to go with reiserfs. Ext3 or something else
may be a better choice for you. But I'd DEFINITELY not recommend btrfs
for unstable hardware, and if the hardware is indeed unstable, I don't
believe I'd recommend ext4 either, because ext4 simply doesn't have the
solid maturity either, for the unstable hardware situation.
> I was
> attempting to rescue situation without any luck. Power cycled PC and on
> next startup both arrays failed to mount, when I tried to mount backup
> array mount told me that this specific uuid DOES NOT EXIST !?!?!
>
> my fstab uuid:
> fcf23e83-f165-4af0-8d1c-cd6f8d2788f4 new uuid:
> 771a4ed0-5859-4e10-b916-07aec4b1a60b
>
>
> tried to mount by /dev/sdb1 and it did mount. Tried by new uuid and it
> did mount as well.
I haven't a clue on that. The only thing remotely connected that I can
think of is that I know btrfs has a UUID tree as I've seen it dealt with
in various patches, that I have some vague idea is somehow related to the
various btrfs subvolumes and snapshots. But I'm not a dev and other than
thinking it might possibly have something to do with an older snapshot of
the filesystem if you had made one, I haven't a clue at all. And even
that is just the best guess I could come up with, that I'd say is more
likely wrong than right.
I take it you've considered and eliminated the possibility of your fstab
somehow getting replaced with a stale backup from a year ago or whatever,
that had the other long stale and forgotten UUID in it?
(FWIW, I prefer LABEL= to UUID= in my fstabs here, precisely because UUIDs
are so humanly unreadable that I can see myself making just this sort of
error. My labeling scheme is a bit strange and is in effect its own ID
system for my own use, but I can make a bit more sense of it than UUIDs,
and thus I'd be rather more likely to catch a label error I made than a
corresponding UUID error. Of course as they say, YMMV. By all means if
you prefer UUIDs, stick with them! I just can't see myself using them,
when labels are /so/ much easier to deal with!)
> ps. needles to say: SMART - no sata CRC errors, no relocated sectors, no
> errors what so ever (as much as I can see).
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Btrfs transaction checksum corruption & losing root of the tree & bizarre UUID change.
2014-07-11 10:38 ` Duncan
@ 2014-07-11 12:33 ` Russell Coker
0 siblings, 0 replies; 4+ messages in thread
From: Russell Coker @ 2014-07-11 12:33 UTC (permalink / raw)
To: Duncan; +Cc: linux-btrfs
On Fri, 11 Jul 2014 10:38:22 Duncan wrote:
> > I've moved all drives and move those to my main rig which got a nice
> > 16GB of ecc ram, so errors of ram, cpu, controller should be kept
> > theoretically eliminated.
>
> It's worth noting that ECC RAM doesn't necessarily help when it's an in-
> transit bus error. Some years ago I had one of the original 3-digit
> Opteron machines, which of course required registered and thus ECC RAM.
> The first RAM I purchased for that board was apparently borderline on its
> timing certifications, and while it worked fine when the system wasn't
> too stressed, including with memtest, which passed with flying colors,
> under medium memory activity it would very occasionally give me, for
> instance, a bad bzip2 csum, and with intensive memory activity, the
> problem would be worse (more bz2 decompress errors, gcc would error out
> too sometimes and I'd have to restart my build, very occasionally the
> system would crash).
If bad RAM causes corrupt memory but no ECC error reports then it probably
wouldn't be a bus error. A bus error SHOULD give ECC reports.
One problem is that RAM errors aren't random. From memory the Hamming codes
used fix 100% of single bit errors, detect 100% of 2 bit errors, and let some
3 bit errors through. If you have a memory module with 3 chips on it (the
later generation of DIMM for any given size) then an error in 1 chip can
change 4 bits.
The other main problem is that if you have a read or write going to the wrong
address then you lose as AFAIK there's no ECC on address lines.
But I still recommend ECC RAM, it just decreases the scope for problems.
About half the serious problems I've had with BTRFS have been caused by a
faulty DIMM...
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-07-11 12:33 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-10 23:32 Btrfs transaction checksum corruption & losing root of the tree & bizarre UUID change Tomasz Kusmierz
2014-07-11 1:57 ` Austin S Hemmelgarn
2014-07-11 10:38 ` Duncan
2014-07-11 12:33 ` Russell Coker
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).