From: David Seikel <onefang@gmail.com>
To: linux-btrfs@vger.kernel.org
Subject: Re: Oddness with phantom device replacing real device.
Date: Fri, 14 Aug 2015 09:25:34 +1000 [thread overview]
Message-ID: <20150814092534.29e08bde.onefang@gmail.com> (raw)
In-Reply-To: <20150813095510.GP12976@carfax.org.uk>
[-- Attachment #1: Type: text/plain, Size: 5469 bytes --]
On Thu, 13 Aug 2015 09:55:10 +0000 Hugo Mills <hugo@carfax.org.uk>
wrote:
> On Thu, Aug 13, 2015 at 01:33:22PM +1000, David Seikel wrote:
> > I don't actually think that this is a BTRFS problem, but it's
> > showing symptoms within BTRFS, and I have no other clues, so maybe
> > the BTRFS experts can help me figure out what is actually going
> > wrong.
> >
> > I'm a sysadmin working for a company that does scientific modelling.
> > They have many TBs of data. We use two servers running Ubuntu
> > 14.04 LTS to backup all of this data. One of them includes 16
> > spinning rust disks hooked to a RAID controller running in JBOD
> > mode (in other words, as far as Linux is concerned, they are just
> > 16 ordinary disks). They are /dev/sdc to /dev/sdr, all being used
> > as a single BTRFS file system.
> >
> > I have been having no end of trouble with this system recently.
> > Keep in mind that due to the huge amount of data we deal with, doing
> > anything takes a long time. So "recently" means "in the last
> > several months".
> >
> > My latest attempt to beat some sense into this server was to
> > upgrade it to the latest officially backported kernel from Ubuntu,
> > and compile my own copy of btrfs-progs from source code (latest
> > release from github). Then I recreated the 16 disk BTRFS file
> > system, and started the backup software running again, from
> > scratch. The next day, /dev/sdc has vanished, to be replaced be a
> > phantom /dev/sds. There's no such disk as /dev/sds. /dev/sds is
> > now included in the BTRFS file system replacing /dev/sdc. In /dev
> > sdc does indeed vanish, and sds does indeed appear. This was
> > happening before. /dev/sds then starts to fill up with errors,
> > since no such disk actually exists.
>
> Sounds like the kind of behaviour when the disk has vanished from
> the system for long enough to drop out and be recreated by the
> driver. The renaming may (possibly) be down to a poor error-handling
> path in btrfs -- we see this happening on USB sometimes, where the
> original device node is still hung on to by the FS on a hardware
> error, and so when the device comes back it's given a different name.
So that part may need some fixing in BTRFS.
> > I don't know what is actually causing the problem. The disks are
> > in a hot swap backplane, and if I actually pulled sdc out, then it
> > would still be listed as part of the BTRFS file system, wouldn't it?
>
> With btrfs fi show, no, you'd get ** some devices missing ** in the
> output.
Which is different from what I'm getting.
> > If I
> > then where to plug some new disk into the same spot, it would not be
> > recognised as part of the file system?
>
> Correct... Unless the device had a superblock with the same UUID in
> it (like, say, the new device is just the old one reappearing
> again). In that case, udev would trigger a btrfs dev scan, and the
> "new" device would rejoin the FS -- probably a little out of date, but
> that would be caught by checksums and be fixed if you have redundancy
> in the storage.
But btrfs is thinking it's a different device, hence all the errors as
it gets confused.
> > So assuming that the RAID
> > controller is getting confused and thinking that sdc has been
> > pulled, then replaced by sds, it should not be showing up as part
> > of the BTRFS file system? Or maybe there's a signature on sdc that
> > BTRFS notices makes it part of the file system, even though BTRFS
> > is now confused about it's location?
>
> See above.
>
> > After a reboot, sdc returns and sds is gone again.
>
> Expected.
>
> > The RAID controller has recently been replaced, but there where
> > similar problems with the old one as well. A better model of RAID
> > controller was chosen this time.
> >
> > I've also not been able to complete a scrub on this system recently.
> > The really odd thing is that I get messages that the scrub has
> > aborted, yet the scrub continues, then much later (days later) the
> > scrub causes a kernel panic. The "aborted" happens some random
> > time into the scrub, but usually in the early part of the scrub.
> > Mind you, if BTRFS is completely confused due to a problem
> > elsewhere, then maybe this can be excused.
>
> I think that means that it's aborting on one device but continuing
> on all the others.
Ah, would be useful for scrub to say so, and point out which device/s
got aborted.
> > The other backup server is almost identical, though it has less
> > disks in the array. It doesn't have any issues with the BTRFS file
> > system.
> >
> > Can any one help shed some light on this please? Hopefully some
> > "quick" things to try, given my definition of "recently" above means
> > that most things take days or weeks, or even months for me to try.
> >
> > I have attached the usual debugging info requested. This is after
> > the bogus sds replaces sdc.
> >
>
> The first thing would be to check your system logs for signs of
> hardware problems (ATA errors). This sounds a lot like you've got a
> dodgy disk that needs to be replaced.
Just gotta figure out which one, I thought I already replaced the dodgy
one. Might be more than one. sigh
I'm guessing /dev/sdc.
--
A big old stinking pile of genius that no one wants
coz there are too many silver coated monkeys in the world.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
prev parent reply other threads:[~2015-08-13 23:25 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-08-13 3:33 Oddness with phantom device replacing real device David Seikel
2015-08-13 9:55 ` Hugo Mills
2015-08-13 23:25 ` David Seikel [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150814092534.29e08bde.onefang@gmail.com \
--to=onefang@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.