From: David Seikel <onefang@gmail.com>
To: linux-btrfs@vger.kernel.org
Subject: Re: Oddness with phantom device replacing real device.
Date: Fri, 14 Aug 2015 09:25:34 +1000 [thread overview]
Message-ID: <20150814092534.29e08bde.onefang@gmail.com> (raw)
In-Reply-To: <20150813095510.GP12976@carfax.org.uk>
[-- Attachment #1: Type: text/plain, Size: 5469 bytes --]
On Thu, 13 Aug 2015 09:55:10 +0000 Hugo Mills <hugo@carfax.org.uk>
wrote:
> On Thu, Aug 13, 2015 at 01:33:22PM +1000, David Seikel wrote:
> > I don't actually think that this is a BTRFS problem, but it's
> > showing symptoms within BTRFS, and I have no other clues, so maybe
> > the BTRFS experts can help me figure out what is actually going
> > wrong.
> >
> > I'm a sysadmin working for a company that does scientific modelling.
> > They have many TBs of data. We use two servers running Ubuntu
> > 14.04 LTS to backup all of this data. One of them includes 16
> > spinning rust disks hooked to a RAID controller running in JBOD
> > mode (in other words, as far as Linux is concerned, they are just
> > 16 ordinary disks). They are /dev/sdc to /dev/sdr, all being used
> > as a single BTRFS file system.
> >
> > I have been having no end of trouble with this system recently.
> > Keep in mind that due to the huge amount of data we deal with, doing
> > anything takes a long time. So "recently" means "in the last
> > several months".
> >
> > My latest attempt to beat some sense into this server was to
> > upgrade it to the latest officially backported kernel from Ubuntu,
> > and compile my own copy of btrfs-progs from source code (latest
> > release from github). Then I recreated the 16 disk BTRFS file
> > system, and started the backup software running again, from
> > scratch. The next day, /dev/sdc has vanished, to be replaced be a
> > phantom /dev/sds. There's no such disk as /dev/sds. /dev/sds is
> > now included in the BTRFS file system replacing /dev/sdc. In /dev
> > sdc does indeed vanish, and sds does indeed appear. This was
> > happening before. /dev/sds then starts to fill up with errors,
> > since no such disk actually exists.
>
> Sounds like the kind of behaviour when the disk has vanished from
> the system for long enough to drop out and be recreated by the
> driver. The renaming may (possibly) be down to a poor error-handling
> path in btrfs -- we see this happening on USB sometimes, where the
> original device node is still hung on to by the FS on a hardware
> error, and so when the device comes back it's given a different name.
So that part may need some fixing in BTRFS.
> > I don't know what is actually causing the problem. The disks are
> > in a hot swap backplane, and if I actually pulled sdc out, then it
> > would still be listed as part of the BTRFS file system, wouldn't it?
>
> With btrfs fi show, no, you'd get ** some devices missing ** in the
> output.
Which is different from what I'm getting.
> > If I
> > then where to plug some new disk into the same spot, it would not be
> > recognised as part of the file system?
>
> Correct... Unless the device had a superblock with the same UUID in
> it (like, say, the new device is just the old one reappearing
> again). In that case, udev would trigger a btrfs dev scan, and the
> "new" device would rejoin the FS -- probably a little out of date, but
> that would be caught by checksums and be fixed if you have redundancy
> in the storage.
But btrfs is thinking it's a different device, hence all the errors as
it gets confused.
> > So assuming that the RAID
> > controller is getting confused and thinking that sdc has been
> > pulled, then replaced by sds, it should not be showing up as part
> > of the BTRFS file system? Or maybe there's a signature on sdc that
> > BTRFS notices makes it part of the file system, even though BTRFS
> > is now confused about it's location?
>
> See above.
>
> > After a reboot, sdc returns and sds is gone again.
>
> Expected.
>
> > The RAID controller has recently been replaced, but there where
> > similar problems with the old one as well. A better model of RAID
> > controller was chosen this time.
> >
> > I've also not been able to complete a scrub on this system recently.
> > The really odd thing is that I get messages that the scrub has
> > aborted, yet the scrub continues, then much later (days later) the
> > scrub causes a kernel panic. The "aborted" happens some random
> > time into the scrub, but usually in the early part of the scrub.
> > Mind you, if BTRFS is completely confused due to a problem
> > elsewhere, then maybe this can be excused.
>
> I think that means that it's aborting on one device but continuing
> on all the others.
Ah, would be useful for scrub to say so, and point out which device/s
got aborted.
> > The other backup server is almost identical, though it has less
> > disks in the array. It doesn't have any issues with the BTRFS file
> > system.
> >
> > Can any one help shed some light on this please? Hopefully some
> > "quick" things to try, given my definition of "recently" above means
> > that most things take days or weeks, or even months for me to try.
> >
> > I have attached the usual debugging info requested. This is after
> > the bogus sds replaces sdc.
> >
>
> The first thing would be to check your system logs for signs of
> hardware problems (ATA errors). This sounds a lot like you've got a
> dodgy disk that needs to be replaced.
Just gotta figure out which one, I thought I already replaced the dodgy
one. Might be more than one. sigh
I'm guessing /dev/sdc.
--
A big old stinking pile of genius that no one wants
coz there are too many silver coated monkeys in the world.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
prev parent reply other threads:[~2015-08-13 23:25 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-08-13 3:33 Oddness with phantom device replacing real device David Seikel
2015-08-13 9:55 ` Hugo Mills
2015-08-13 23:25 ` David Seikel [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150814092534.29e08bde.onefang@gmail.com \
--to=onefang@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).