From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
Date: Mon, 9 May 2016 19:18:59 +0000 (UTC) [thread overview]
Message-ID: <pan$5176$b51afa6d$75bca86d$7e2b8169@cox.net> (raw)
In-Reply-To: 4aa3dda7-70d6-5dcf-2fa7-4f2b509e4a1e@gmail.com
Austin S. Hemmelgarn posted on Mon, 09 May 2016 14:21:57 -0400 as
excerpted:
> This practice evolved out of the fact that the only bad RAM I've ever
> dealt with either completely failed to POST (which can have all kinds of
> interesting symptoms if it's just one module, some MB's refuse to boot,
> some report the error, others just disable the module and act like
> nothing happened), or passed all the memory testing tools I threw at it
> (memtest86, memtest86+, memtester, concurrent memtest86 invocations from
> Xen domains, inventive acrobatics with tmpfs and FIO, etc), but failed
> under heavy concurrent random access, which can be reliably produced by
> running a bunch of big software builds at the same time with the CPU
> insanely over-committed.
My (likely much more limited) experience matches yours.
Tho FWIW, in my case I did find that one of the more common memory
failure indicators was bz2-ed tarball decompression, where the tarball
would fail its decompression checksum safety checks. However, that most
reliably happened in the context of a heavily loaded system doing other
package builds in parallel to the package tarball extraction that failed.
In my case, I even had ECC RAM, but it was apparently just slightly out
of spec for its labeled and internally configured memory speeds (PC3200
DDR1 at the time), at least on my hardware. Once I got a BIOS update
that let me, I slightly downclocked the memory (to PC3000, IIRC), and it
was absolutely solid, no more errors, even with tightened up wait-state
timings. Later I upgraded RAM, and the new RAM worked just fine at the
same PC3200 speeds that were a problem for the older RAM.
The problem was apparently that while the RAM cells that memcheck checks
were fine, it was testing in an otherwise calm environment (not much
choice since you can only boot to the test directly and can't do anything
else at the same time), without all the other stuff going on in the
hectic environment of a multi-package parallel build, that apparently
happened to occasionally trigger the edge-case that would corrupt things.
And FWIW, I still have major respect for how well reiserfs behaved under
those conditions. No filesystem can be expected to be 100% reliable when
it's getting corrupted data due to bad memory, but reiserfs held up
remarkably well, far better than btrfs did under similar conditions (but
then with the PCI and SATA bus) a few year later, forcing me back to
reiserfs for a time, which again, continued to work like a champ, even
under hardware conditions that were absolutely unworkable with btrfs. I
had a heat-related (AC went out, in Phoenix, in the summer, 40+ C
outside, 50+C inside, who knows what the disks were!?) head crash on a
disk too, where the partitions that were mounted and likely had the head
flying over them were damaged beyond (easy) recovery, but other
partitions on the same disk were absolutely fine, and I actually
continued to run off them for a few months after cooling everything back
down. That sort of experience is the reason I still use reiserfs on
spinning rust, including my second and third level backups, even while
I'm running btrfs on the ssds for the working system and primary backup.
It's also the reason I continue to use a partitioned system with multiple
independent filesystems (btrfs raid1 on a pair of ssds for most of the
working btrfs and primary backups, individual ssd btrfs in dup mode for
/boot, and its backup on the other ssd), instead of putting my data eggs
all in the same filesystem basket with subvolumes, where if the
filesystem goes out all the subvolumes go with it!
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2016-05-09 19:22 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-05-04 23:21 btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair Niccolò Belli
2016-05-05 1:07 ` Chris Murphy
2016-05-05 10:36 ` Niccolò Belli
2016-05-05 17:48 ` Omar Sandoval
2016-05-06 11:38 ` Niccolò Belli
2016-05-07 15:45 ` Niccolò Belli
2016-05-07 15:58 ` Clemens Eisserer
2016-05-07 16:11 ` Niccolò Belli
2016-05-08 18:27 ` Patrik Lundquist
2016-05-09 11:52 ` Austin S. Hemmelgarn
2016-05-09 14:53 ` Niccolò Belli
2016-05-09 16:29 ` Zygo Blaxell
2016-05-09 18:21 ` Austin S. Hemmelgarn
2016-05-09 19:18 ` Duncan [this message]
2016-05-12 14:35 ` Niccolò Belli
2016-05-12 15:43 ` Austin S. Hemmelgarn
2016-05-13 11:07 ` Niccolò Belli
2016-05-13 11:35 ` Austin S. Hemmelgarn
2016-05-13 12:10 ` Niccolò Belli
2016-05-13 21:54 ` Chris Murphy
2016-05-12 16:48 ` Zygo Blaxell
2016-05-09 19:23 ` Lionel Bouton
2016-05-09 21:30 ` Chris Murphy
2016-05-07 23:35 ` Chris Murphy
2016-05-05 4:12 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$5176$b51afa6d$75bca86d$7e2b8169@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.