From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: "kernel BUG at /home/apw/COD/linux/fs/btrfs/extent_io.c:2116!" when deleting device or balancing filesystem.
Date: Mon, 28 Apr 2014 03:26:45 +0000 (UTC) [thread overview]
Message-ID: <pan$2abc$efe882ea$fdd493d$5779ebd8@cox.net> (raw)
In-Reply-To: C25AA8A9-30B0-4D00-80B5-B681D93D67A3@pieroen.nl
Jaap Pieroen posted on Sun, 27 Apr 2014 18:30:19 +0200 as excerpted:
> Hello,
>
> When I try to delete a device from my btrfs filesystem I always get the
> following kernel bug error:
> kernel BUG at /home/apw/COD/linux/fs/btrfs/extent_io.c:2116!
> invalid opcode: 0000 [#3] SMP
> See attached log file for more details.
That's a reasonably common, generic error, simply indicating the kernel
got an invalid/zero opcode instead of what it was supposed to get, but
not really saying why, tho the log does give some more info.
In the log, it relocates various block groups, but then fails on one, due
to invalid checksum (csum). See below for the implications of that.
> I’m trying to delete the device /dev/sdb from my filesystem.
>
> Steps I tried so far are:
> 1. mount with the clear_cache option
> 2. balance the filesystem (results in the same kernel error)
> 3. scrub the filesystem
> 4. btrfsck —repair
Never use btrfsck (or btrfs check) with the --repair option, unless
you're about ready to give up on the filesystem and do a mkfs, in which
case you aren't risking anything anyway, or unless a dev suggests you run
it.
The reason being, btrfs check --repair knows how to fix some types of
errors, but among the ones it doesn't know how to fix, it can sometimes
make the problem worse. At some point it should know most problems and
at least not make them worse, but until then, it's not a good risk to
take unless you really know what you're doing or it's no risk as the
next step is blowing away the filesystem anyway.
(btrfs check, without --repair, is fine to run, since it's read-only and
thus won't make anything worse. But by the same token, it won't fix
anything either, it's simply informational.)
> During scrubbing and btrfsck some error where found and fixed. But I
> think these where error caused by system lockups during copying data to
> the new btrfs filesystem. These lockups where caused by an extraordinary
> amount of hard links, since I was using rsnapshot to create hourly
> snapshots on my old filesystem that I am migrating towards btrfs.
> Removing these hard links solved the lockup problems.
>
> Something I also noted was that after the btrfsck run, the command
> ‘btrfs fi show’ reported
> “devid 4 size 0.0GiB used 98.00GiB path /dev/sdb” (mind the 0.0GB).
>
> I’m ready to run any diagnostics necessary, but the filesystem is 4.7T
> so it won’t be able to provide an image.
>
> System details:
> $ uname -a Linux nasbak 3.14.1-031401-generic
Good, latest stable kernel. =:^)
> $ btrfs --version
> Btrfs v3.12
You're behind on btrfs-tools. =:^( The latest version is v3.14.1.
> $ sudo btrfs fi show
> Label: btrfs_storage uuid: 7ca5f38e-308f-43ab-b3ea-31b3bcd11a0d
> Total devices 6 FS bytes used 4.57TiB
> devid 1 size 1.82TiB used 1.32TiB path /dev/sde
> devid 2 size 1.82TiB used 1.32TiB path /dev/sdf
> devid 3 size 1.82TiB used 1.32TiB path /dev/sdg
> devid 4 size 931.51GiB used 88.00GiB path /dev/sdb
> devid 6 size 2.73TiB used 947.03GiB path /dev/sdh
> devid 7 size 2.73TiB used 947.03GiB path /dev/sdi
>
> Btrfs v3.12
For further reference, whenever you post btrfs fi show, please post btrfs
fi df as well, as the two provide complementary information, and the
picture without both of them is incomplete.
If you'd supplied the btrfs fi df output, we could see what raid level
you're running for data/metadata/system, as well as which type of chunks
were still left on /dev/sdb.
For raid1 and raid10 modes (and dup mode on a single device), there's two
copies of each chunk, thus a second copy to try if the checksum fails.
Single and raid0 modes only keep a single copy, so there's not much to do
there but find the corresponding file and delete it, to correct the
problem. In normal operation, if such a checksum error is found and
there is a second copy that passes checksum, the invalid copy is
rewritten to match. What scrub does is go thru the entire filesystem
looking for such errors and rewriting the invalid copy if possible, so
you don't have to wait until you happen on the problem by accident.
You mentioned that you did try scrub and that it fixed some errors, which
would be csum errors. But did it leave any unfixed because there wasn't
a second, valid copy of the invalid data with which to rewrite it? If it
found and fixed all the errors, then you shouldn't be seeing further csum
errors like those in the log file, unless more are being created, which
would indicate an ongoing problem (perhaps a device going bad).
Of course the kernel bug is presumably locking up your system, not
allowing a clean shutdown, in which case you may well have more csum
errors due to that. So after rebooting, be sure to run a scrub before
you try to balance or device delete, and hopefully eliminate the problem.
But... since you didn't post the df output, we don't know what the
remaining content on the device is, data/metadata/system, nor do we know
what mode it is, and it could well be that scrub can't remove it due to
invalid csums if there's no second, valid copy, as will definitely be the
case if it's single or raid0 mode (with data chunks being single by
default, tho metadata and system chunks default to raid1 on a multi-
device filesystem and dup on a single-device filesystem).
If there's no valid second copy to rewrite the bad one with, you may
simply have to figure out what file and/or snapshot(s) it belongs to and
delete them, fixing the bad csums that way.
Of course that's assuming it's the bad csums causing the problem, not
something else.
Meanwhile, while I don't claim to be a dev nor to /really/ read code, I
did see some recent patches go by with comments that described bugs that
looked to me like they might match the problem you're reporting here,
specifically, failure to properly device delete under some conditions.
So I'd suggest updating to a current btrfs-progs v3.14.1 and see if that
helps. If not, try a current v3.15-rcX testing kernel, or if you don't
want to try that, wait a couple stable kernel releases and see if there's
any btrfs patches applied.
With a bit of luck, between tracking down and eliminating the bad csums,
and the newer code that I think fixes at least some of the failure to
device delete issues, the problem will be addressed. =:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2014-04-28 3:26 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-27 16:30 "kernel BUG at /home/apw/COD/linux/fs/btrfs/extent_io.c:2116!" when deleting device or balancing filesystem Jaap Pieroen
2014-04-28 3:26 ` Duncan [this message]
2014-04-28 8:07 ` Hugo Mills
2014-04-28 20:30 ` Jaap Pieroen
2014-04-29 6:28 ` Duncan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$2abc$efe882ea$fdd493d$5779ebd8@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).