From: Leho Kraav <leho@kraav.com>
To: linux-btrfs@vger.kernel.org
Subject: btrfs 3.2.2 -> 3.3.1 upgrade finally ate babies, some advice?
Date: Mon, 09 Apr 2012 16:24:33 +0300 [thread overview]
Message-ID: <4F82E311.1040905@kraav.com> (raw)
Hi all
$ uname -a
Gentoo Linux s9 3.3.1-pf #2 SMP PREEMPT Mon Apr 9 00:35:28 EEST 2012
i686 Intel(R) Core(TM) i5-2467M CPU @ 1.60GHz GenuineIntel GNU/Linux
I was running stuff for the past year or so on 4 partitions:
/dev/sda1 -> dm-crypt -> btrfs raid 0 ROOT 10.0GB
/dev/sda2 -> dm-crypt -> btrfs raid 0 ROOT 10.0GB
/dev/sda3 -> dm-crypt -> btrfs raid 0 HOME 10.0GB
/dev/sda4 -> dm-crypt -> btrfs raid 0 HOME 10.0GB
Both filesystems mounted with "noatime,nodiratime,ssd,discard,compress=lzo"
I set that multi-partition monster up back in the 2.6.36ish days, when
dm-crypt either was not capable of utilizing multicores on a single
partition or I possibly didn't know that it already could. At one point
it definitely couldn't.
So over time HOME started filling up and at the point of last night's
baby eating "df -hT" showed 1.7G free. Yes I know free space is
complicated in btrfs. Space had not been an issue so I didn't think to
use any better tools regularly to check, such as "btrfs fi show" I guess.
I upgraded my 3.2.2-pf to 3.3.1-pf* and proceeding to launching my
regular apps Firefox, TB, office, etc. Except they all hung. Checking my
/var/log/message window revealed what was happening:
* pf-sources => http://pf.natalenko.name/
...
Apr 8 02:45:52 s9 sudo: leho : TTY=pts/0 ; PWD=/home/leho ;
USER=root ; COMMAND=/bin/tail -
f /home/leho/.tail/awesome-leho /home/leho/.tail/messages
/home/leho/.tail/openvpn.log
Apr 8 02:45:52 s9 sudo: pam_unix(sudo:session): session opened for user
root by (uid=0)
Apr 8 02:46:11 s9 kernel: [ 189.691778] attempt to access beyond end
of device
Apr 8 02:46:11 s9 kernel: [ 189.691787] dm-3: rw=129, want=23361976,
limit=20967424
Apr 8 02:46:11 s9 kernel: [ 189.691792] attempt to access beyond end
of device
Apr 8 02:46:11 s9 kernel: [ 189.691795] dm-3: rw=129, want=27556216,
limit=20967424
Apr 8 02:46:11 s9 kernel: [ 189.691799] attempt to access beyond end
of device
...
Apr 8 02:46:11 s9 kernel: [ 189.691869] attempt to access beyond end
of device
Apr 8 02:46:11 s9 kernel: [ 189.691874] dm-3: rw=129, want=69498616,
limit=20967424
...
Apr 8 02:46:11 s9 kernel: [ 189.692233] attempt to access beyond end
of device
Apr 8 02:46:11 s9 kernel: [ 189.692237] dm-3: rw=129, want=228879736,
limit=20967424
(thousands of lines of this, as we can see "want" gets bigger all the time)
And it was all downhill from there. Result is a majorly corrupted
filesystem that seems to be beyond repair. Hard rebooting back started
giving csum errors in various spots and any modifications to the
filesystem, even deleting files, would start another flood of "attempt
to access beyond end of device", totally messing up syslog-ng. With
blazing speedsc of an SSD that probably isn't a surprise.
So searching around, I found out about the ENOSPC thing which is
possibly still an issue in 3.3. Is there any useful info I could provide
for this? I now have some bigger partitions and probably won't run out
of space again for a while.
I also discovered the btrfs "restore" binary, although possibly it was
too late, since I had already hard rebooted a few times and done some
more damage to HOME. This thing returned a whole bunch of "ret is -3"
messages, and 0 byte files. Occasionally files were good as well. But
majority of the files, seems to corrupt. When running out of space
happens, is this a reasonable result to expect?
"btrfs scrub" reported uncorrectable errors count in the millions. At
least thousands of csum mismatch errors visible in dmesg.
"btrfs balance" would bomb the machine with the same "access beyond end
of device".
I made images of the two btrfs partitions on sda3 and sda4 for future
diagnosis. I do think they are pretty corrupt though. Or could there be
some magic poke or offset that would make more stuff magically
"restore"-able :>
So in conclusion:
* is filesystem-wide corruption like this helped by running on top of
dm-crypt or btrfs multi device? dm-crypt is definitely staying for me,
but I did consolidate partitions now to just 2.
* what exactly should happen when an out of space scenario like the
above happens?
* I guess I should keep an eye on "btrfs fi show" on the regular?
next reply other threads:[~2012-04-09 13:24 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-04-09 13:24 Leho Kraav [this message]
-- strict thread matches above, loose matches on Subject: below --
2012-04-09 14:35 btrfs 3.2.2 -> 3.3.1 upgrade finally ate babies, some advice? Daniel J Blueman
2012-04-09 14:44 ` Leho Kraav
2012-04-09 14:54 ` Daniel J Blueman
2012-04-09 19:07 ` Martin Steigerwald
2012-04-09 20:58 ` Leho Kraav
2012-04-09 21:32 ` Leho Kraav
2012-04-09 23:19 ` David Sterba
2012-04-10 9:07 ` Ilya Dryomov
2012-04-10 15:31 ` Leho Kraav
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4F82E311.1040905@kraav.com \
--to=leho@kraav.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.