From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: BTRFS free space handling still needs more work: Hangs again
Date: Sat, 27 Dec 2014 04:26:10 +0000 (UTC) [thread overview]
Message-ID: <pan$248af$ca4818d4$b1ce7df3$527f91c1@cox.net> (raw)
In-Reply-To: 6794875.ElpGA7E3Vr@merkaba
Martin Steigerwald posted on Fri, 26 Dec 2014 16:59:09 +0100 as excerpted:
> Dec 26 16:17:57 merkaba kernel: [ 8102.029438] mce:
> [Hardware Error]: Machine check events logged
> Dec 26 16:20:27 merkaba kernel: [ 8252.054015] mce:
> [Hardware Error]: Machine check events logged
Have you checked these MCEs? What are they?
MCEs are hardware errors. These are *NOT* kernel errors, tho of course
they may /trigger/ kernel errors. The reported event codes can be looked
up and translated into English.
Since shortly after the first one until a bit before the second one here,
you had hardware thermal throttling, the CPUs, on-chip cache, and
possibly the memory, was working pretty hard.
FWIW, I had an AMD machine that would MCE with memory related errors some
time (about a decade) ago. I had ECC RAM, but it was cheap and
apparently not quite up to the speeds it was actually rated for. MemTest
check out the memory fine, but under high stress especially, it would
sometimes have bus/transit related corruption, which would sometimes (not
always) trigger those MCEs.
Eventually a BIOS update gave me the ability to turn down the memory
timings, and turning them down just one notch made everything rock-stable
-- I was even able to decrease some of the wait-states to get a bit of
the memory speed back. It just so happened that it was borderline stable
at the rated clock, and turning the memory clock down just one notch was
all it took. Later, I upgraded the RAM (the bad RAM was two half-gig
sticks, back when they were $100+ a piece, I upgraded to four 2-gig
sticks), and the new RAM didn't have the problem at all -- the bad RAM
sticks simply weren't /quite/ stable at the rated speed, that was it.
I run gentoo so of course do a lot of building from sources, and
interestingly enough, the thing that turned out to detect the corruption
the most often was bzip2 compression checksums -- I'd get errors on
sources decompress previous to the build, rather more often than actual
build failures altho those would happen occasionally as well, while
redoing it would work fine -- checksums passed, and I never had a build
that actually finished fail to run due to a bad build.
Now here's the thing. Of course a decade ago was well before I was
running btrfs (FWIW I was running reiserfs at the time, and it seemed
pretty resilient given the bad RAM I had), so it was the bzip2 checksums
it failed on.
But guess what btrfs uses for file integrity, checksums. If your MCEs
are either like my memory-related MCEs were, or are similar CPU-cache or
CPU related but still something that would affect checksumming, btrfs may
well be fighting bad checksums due to the same issues, and that would of
course throw all sorts of wrenches into things. Another thing I've seen
reported as triggering MCEs is bad power (in that case it was an either
underpowered or going bad UPS, once it was out of the picture, the MCEs
and problems stopped).
Now I think you're having other btrfs issues as well, some of which are
likely legit bugs. However, your MCEs certainly aren't helping things,
and I'd definitely recommend checking up on them to see what's actually
happening to your hardware. It may well be that without whatever
hardware issues are triggering those MCEs, you may end up with less btrfs
problems as well.
Or maybe not, but it's something to look into, because right now,
regardless of whether they're making things worse physically, they're at
minimum obscuring a troubleshooting picture that would be clearer without
them.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2014-12-27 4:26 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-12-26 13:37 BTRFS free space handling still needs more work: Hangs again Martin Steigerwald
2014-12-26 14:20 ` Martin Steigerwald
2014-12-26 14:41 ` Martin Steigerwald
2014-12-27 3:33 ` Duncan
2014-12-26 15:59 ` Martin Steigerwald
2014-12-27 4:26 ` Duncan [this message]
2014-12-26 22:48 ` Robert White
2014-12-27 5:54 ` Duncan
2014-12-27 9:01 ` Martin Steigerwald
2014-12-27 9:30 ` Hugo Mills
2014-12-27 10:54 ` Martin Steigerwald
2014-12-27 11:52 ` Robert White
2014-12-27 13:16 ` Martin Steigerwald
2014-12-27 13:49 ` Robert White
2014-12-27 14:06 ` Martin Steigerwald
2014-12-27 14:00 ` Robert White
2014-12-27 14:14 ` Martin Steigerwald
2014-12-27 14:21 ` Martin Steigerwald
2014-12-27 15:14 ` Robert White
2014-12-27 16:01 ` Martin Steigerwald
2014-12-28 0:25 ` Robert White
2014-12-28 1:01 ` Bardur Arantsson
2014-12-28 4:03 ` Robert White
2014-12-28 12:03 ` Martin Steigerwald
2014-12-28 17:04 ` Patrik Lundquist
2014-12-29 10:14 ` Martin Steigerwald
2014-12-28 12:07 ` Martin Steigerwald
2014-12-28 14:52 ` Robert White
2014-12-28 15:42 ` Martin Steigerwald
2014-12-28 15:47 ` Martin Steigerwald
2014-12-29 0:27 ` Robert White
2014-12-29 9:14 ` Martin Steigerwald
2014-12-27 16:10 ` Martin Steigerwald
2014-12-27 14:19 ` Robert White
2014-12-27 11:11 ` Martin Steigerwald
2014-12-27 12:08 ` Robert White
2014-12-27 13:55 ` Martin Steigerwald
2014-12-27 14:54 ` Robert White
2014-12-27 16:26 ` Hugo Mills
2014-12-27 17:11 ` Martin Steigerwald
2014-12-27 17:59 ` Martin Steigerwald
2014-12-28 0:06 ` Robert White
2014-12-28 11:05 ` Martin Steigerwald
2014-12-28 13:00 ` BTRFS free space handling still needs more work: Hangs again (further tests) Martin Steigerwald
2014-12-28 13:40 ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare) Martin Steigerwald
2014-12-28 13:56 ` BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea) Martin Steigerwald
2014-12-28 15:00 ` Martin Steigerwald
2014-12-29 9:25 ` Martin Steigerwald
2014-12-27 18:28 ` BTRFS free space handling still needs more work: Hangs again Zygo Blaxell
2014-12-27 18:40 ` Hugo Mills
2014-12-27 19:23 ` BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time) Martin Steigerwald
2014-12-29 2:07 ` Zygo Blaxell
2014-12-29 9:32 ` Martin Steigerwald
2015-01-06 20:03 ` Zygo Blaxell
2015-01-07 19:08 ` Martin Steigerwald
2015-01-07 21:41 ` Zygo Blaxell
2015-01-08 5:45 ` Duncan
2015-01-08 10:18 ` Martin Steigerwald
2015-01-09 8:25 ` Duncan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$248af$ca4818d4$b1ce7df3$527f91c1@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).