Oddness with phantom device replacing real device.

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Oddness with phantom device replacing real device.
@ 2015-08-13  3:33 David Seikel
  2015-08-13  9:55 ` Hugo Mills
  0 siblings, 1 reply; 3+ messages in thread
From: David Seikel @ 2015-08-13  3:33 UTC (permalink / raw)
  To: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3427 bytes --]

I don't actually think that this is a BTRFS problem, but it's showing
symptoms within BTRFS, and I have no other clues, so maybe the BTRFS
experts can help me figure out what is actually going wrong.

I'm a sysadmin working for a company that does scientific modelling.
They have many TBs of data.  We use two servers running Ubuntu 14.04 LTS
to backup all of this data.  One of them includes 16 spinning rust
disks hooked to a RAID controller running in JBOD mode (in other words,
as far as Linux is concerned, they are just 16 ordinary disks).  They
are /dev/sdc to /dev/sdr, all being used as a single BTRFS file system.

I have been having no end of trouble with this system recently.  Keep
in mind that due to the huge amount of data we deal with, doing
anything takes a long time.  So "recently" means "in the last several
months".

My latest attempt to beat some sense into this server was to upgrade it
to the latest officially backported kernel from Ubuntu, and compile my
own copy of btrfs-progs from source code (latest release from github).
Then I recreated the 16 disk BTRFS file system, and started the backup
software running again, from scratch.  The next day, /dev/sdc has
vanished, to be replaced be a phantom /dev/sds.  There's no such disk
as /dev/sds.  /dev/sds is now included in the BTRFS file system
replacing /dev/sdc.  In /dev sdc does indeed vanish, and sds does
indeed appear.  This was happening before.  /dev/sds then starts to
fill up with errors, since no such disk actually exists.

I don't know what is actually causing the problem.  The disks are in a
hot swap backplane, and if I actually pulled sdc out, then it would
still be listed as part of the BTRFS file system, wouldn't it?  If I
then where to plug some new disk into the same spot, it would not be
recognised as part of the file system?  So assuming that the RAID
controller is getting confused and thinking that sdc has been pulled,
then replaced by sds, it should not be showing up as part of the BTRFS
file system?  Or maybe there's a signature on sdc that BTRFS notices
makes it part of the file system, even though BTRFS is now confused
about it's location?

After a reboot, sdc returns and sds is gone again.

The RAID controller has recently been replaced, but there where similar
problems with the old one as well.  A better model of RAID controller
was chosen this time.

I've also not been able to complete a scrub on this system recently.
The really odd thing is that I get messages that the scrub has aborted,
yet the scrub continues, then much later (days later) the scrub causes
a kernel panic.  The "aborted" happens some random time into the scrub,
but usually in the early part of the scrub.  Mind you, if BTRFS is
completely confused due to a problem elsewhere, then maybe this can be
excused.

The other backup server is almost identical, though it has less disks
in the array.  It doesn't have any issues with the BTRFS file system.

Can any one help shed some light on this please?  Hopefully some
"quick" things to try, given my definition of "recently" above means
that most things take days or weeks, or even months for me to try.

I have attached the usual debugging info requested.  This is after the
bogus sds replaces sdc.

-- 
A big old stinking pile of genius that no one wants
coz there are too many silver coated monkeys in the world.

[-- Attachment #1.2: onefang_btrfs_details.txt --]
[-- Type: text/plain, Size: 7275 bytes --]

> uname -a
Linux walker 3.19.0-25-generic #26~14.04.1-Ubuntu SMP Fri Jul 24 21:16:20 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux


> /opt/btrfs-progs/bin/btrfs --version
btrfs-progs v4.1.2


>  /opt/btrfs-progs/bin/btrfs fi show
Label: none  uuid: 901d40d5-5881-468d-b07e-bfda80b20525
	Total devices 2 FS bytes used 24.79GiB
	devid    1 size 29.13GiB used 29.13GiB path /dev/sda1
	devid    2 size 29.13GiB used 29.13GiB path /dev/sdb1

Label: none  uuid: a017a1f7-8a09-4427-8f4b-25fe39dd3a61
	Total devices 16 FS bytes used 952.21GiB
	devid    1 size 2.73TiB used 20.00MiB path /dev/sds
	devid    2 size 2.73TiB used 0.00B path /dev/sdd
	devid    3 size 2.73TiB used 0.00B path /dev/sde
	devid    4 size 2.73TiB used 0.00B path /dev/sdf
	devid    5 size 2.73TiB used 0.00B path /dev/sdg
	devid    6 size 2.73TiB used 0.00B path /dev/sdh
	devid    7 size 2.73TiB used 0.00B path /dev/sdi
	devid    8 size 2.73TiB used 0.00B path /dev/sdj
	devid    9 size 3.64TiB used 636.00GiB path /dev/sdk
	devid   10 size 2.73TiB used 0.00B path /dev/sdl
	devid   11 size 2.73TiB used 1.00GiB path /dev/sdm
	devid   12 size 2.73TiB used 1.00GiB path /dev/sdn
	devid   13 size 2.73TiB used 1.00GiB path /dev/sdo
	devid   14 size 2.73TiB used 1.00GiB path /dev/sdp
	devid   15 size 3.64TiB used 635.01GiB path /dev/sdq
	devid   16 size 3.64TiB used 635.01GiB path /dev/sdr


> /opt/btrfs-progs/bin/btrfs fi df /var/lib/backuppc
btrfs-progs v4.1.2
Data, RAID1: total=951.00GiB, used=950.04GiB
Data, single: total=8.00MiB, used=0.00B
System, RAID1: total=8.00MiB, used=160.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=4.00GiB, used=2.63GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B


> dmesg
[60174.590577] BTRFS: bdev /dev/sds errs: wr 1920, rd 0, flush 640, corrupt 0, gen 0
[60174.613263] BTRFS: lost page write due to I/O error on /dev/sds
[60174.613265] BTRFS: bdev /dev/sds errs: wr 1921, rd 0, flush 640, corrupt 0, gen 0
[60174.635281] BTRFS: lost page write due to I/O error on /dev/sds
[60174.635282] BTRFS: bdev /dev/sds errs: wr 1922, rd 0, flush 640, corrupt 0, gen 0
[60174.659822] BTRFS: lost page write due to I/O error on /dev/sds
[60174.659824] BTRFS: bdev /dev/sds errs: wr 1923, rd 0, flush 640, corrupt 0, gen 0
[60206.196739] BTRFS: bdev /dev/sds errs: wr 1923, rd 0, flush 641, corrupt 0, gen 0
[60206.219317] BTRFS: lost page write due to I/O error on /dev/sds
[60206.219321] BTRFS: bdev /dev/sds errs: wr 1924, rd 0, flush 641, corrupt 0, gen 0
[60206.241866] BTRFS: lost page write due to I/O error on /dev/sds
[60206.241870] BTRFS: bdev /dev/sds errs: wr 1925, rd 0, flush 641, corrupt 0, gen 0
[60206.265205] BTRFS: lost page write due to I/O error on /dev/sds
[60206.265208] BTRFS: bdev /dev/sds errs: wr 1926, rd 0, flush 641, corrupt 0, gen 0
[60237.102648] BTRFS: bdev /dev/sds errs: wr 1926, rd 0, flush 642, corrupt 0, gen 0
[60237.125815] BTRFS: lost page write due to I/O error on /dev/sds
[60237.125819] BTRFS: bdev /dev/sds errs: wr 1927, rd 0, flush 642, corrupt 0, gen 0
[60237.148393] BTRFS: lost page write due to I/O error on /dev/sds
[60237.148398] BTRFS: bdev /dev/sds errs: wr 1928, rd 0, flush 642, corrupt 0, gen 0
[60237.170912] BTRFS: lost page write due to I/O error on /dev/sds
[60237.170917] BTRFS: bdev /dev/sds errs: wr 1929, rd 0, flush 642, corrupt 0, gen 0
[60268.100120] BTRFS: bdev /dev/sds errs: wr 1929, rd 0, flush 643, corrupt 0, gen 0
[60268.123432] BTRFS: lost page write due to I/O error on /dev/sds
[60268.123435] BTRFS: bdev /dev/sds errs: wr 1930, rd 0, flush 643, corrupt 0, gen 0
[60268.145911] BTRFS: lost page write due to I/O error on /dev/sds
[60268.145915] BTRFS: bdev /dev/sds errs: wr 1931, rd 0, flush 643, corrupt 0, gen 0
[60268.168411] BTRFS: lost page write due to I/O error on /dev/sds
[60268.168415] BTRFS: bdev /dev/sds errs: wr 1932, rd 0, flush 643, corrupt 0, gen 0
[60299.811546] BTRFS: bdev /dev/sds errs: wr 1932, rd 0, flush 644, corrupt 0, gen 0
[60299.833913] BTRFS: lost page write due to I/O error on /dev/sds
[60299.833918] BTRFS: bdev /dev/sds errs: wr 1933, rd 0, flush 644, corrupt 0, gen 0
[60299.856655] BTRFS: lost page write due to I/O error on /dev/sds
[60299.856659] BTRFS: bdev /dev/sds errs: wr 1934, rd 0, flush 644, corrupt 0, gen 0
[60299.878772] BTRFS: lost page write due to I/O error on /dev/sds
[60299.878776] BTRFS: bdev /dev/sds errs: wr 1935, rd 0, flush 644, corrupt 0, gen 0
[60330.940527] BTRFS: bdev /dev/sds errs: wr 1935, rd 0, flush 645, corrupt 0, gen 0
[60330.964429] BTRFS: lost page write due to I/O error on /dev/sds
[60330.964434] BTRFS: bdev /dev/sds errs: wr 1936, rd 0, flush 645, corrupt 0, gen 0
[60330.987095] BTRFS: lost page write due to I/O error on /dev/sds
[60330.987099] BTRFS: bdev /dev/sds errs: wr 1937, rd 0, flush 645, corrupt 0, gen 0
[60331.009462] BTRFS: lost page write due to I/O error on /dev/sds
[60331.009466] BTRFS: bdev /dev/sds errs: wr 1938, rd 0, flush 645, corrupt 0, gen 0
[60361.875654] BTRFS: bdev /dev/sds errs: wr 1938, rd 0, flush 646, corrupt 0, gen 0
[60361.898426] BTRFS: lost page write due to I/O error on /dev/sds
[60361.898431] BTRFS: bdev /dev/sds errs: wr 1939, rd 0, flush 646, corrupt 0, gen 0
[60361.922188] BTRFS: lost page write due to I/O error on /dev/sds
[60361.922192] BTRFS: bdev /dev/sds errs: wr 1940, rd 0, flush 646, corrupt 0, gen 0
[60361.944643] BTRFS: lost page write due to I/O error on /dev/sds
[60361.944647] BTRFS: bdev /dev/sds errs: wr 1941, rd 0, flush 646, corrupt 0, gen 0
[60393.246924] BTRFS: bdev /dev/sds errs: wr 1941, rd 0, flush 647, corrupt 0, gen 0
[60393.269525] BTRFS: lost page write due to I/O error on /dev/sds
[60393.269529] BTRFS: bdev /dev/sds errs: wr 1942, rd 0, flush 647, corrupt 0, gen 0
[60393.292396] BTRFS: lost page write due to I/O error on /dev/sds
[60393.292401] BTRFS: bdev /dev/sds errs: wr 1943, rd 0, flush 647, corrupt 0, gen 0
[60393.314915] BTRFS: lost page write due to I/O error on /dev/sds
[60393.314919] BTRFS: bdev /dev/sds errs: wr 1944, rd 0, flush 647, corrupt 0, gen 0
[60424.085923] BTRFS: bdev /dev/sds errs: wr 1944, rd 0, flush 648, corrupt 0, gen 0
[60424.109367] BTRFS: lost page write due to I/O error on /dev/sds
[60424.109371] BTRFS: bdev /dev/sds errs: wr 1945, rd 0, flush 648, corrupt 0, gen 0
[60424.131817] BTRFS: lost page write due to I/O error on /dev/sds
[60424.131819] BTRFS: bdev /dev/sds errs: wr 1946, rd 0, flush 648, corrupt 0, gen 0
[60424.154242] BTRFS: lost page write due to I/O error on /dev/sds
[60424.154246] BTRFS: bdev /dev/sds errs: wr 1947, rd 0, flush 648, corrupt 0, gen 0
[60454.996199] BTRFS: bdev /dev/sds errs: wr 1947, rd 0, flush 649, corrupt 0, gen 0
[60455.019418] BTRFS: lost page write due to I/O error on /dev/sds
[60455.019423] BTRFS: bdev /dev/sds errs: wr 1948, rd 0, flush 649, corrupt 0, gen 0
[60455.042932] BTRFS: lost page write due to I/O error on /dev/sds
[60455.042937] BTRFS: bdev /dev/sds errs: wr 1949, rd 0, flush 649, corrupt 0, gen 0
[60455.067661] BTRFS: lost page write due to I/O error on /dev/sds
[60455.067665] BTRFS: bdev /dev/sds errs: wr 1950, rd 0, flush 649, corrupt 0, gen 0

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Oddness with phantom device replacing real device.
  2015-08-13  3:33 Oddness with phantom device replacing real device David Seikel
@ 2015-08-13  9:55 ` Hugo Mills
  2015-08-13 23:25   ` David Seikel
  0 siblings, 1 reply; 3+ messages in thread
From: Hugo Mills @ 2015-08-13  9:55 UTC (permalink / raw)
  To: David Seikel; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4790 bytes --]

On Thu, Aug 13, 2015 at 01:33:22PM +1000, David Seikel wrote:
> I don't actually think that this is a BTRFS problem, but it's showing
> symptoms within BTRFS, and I have no other clues, so maybe the BTRFS
> experts can help me figure out what is actually going wrong.
> 
> I'm a sysadmin working for a company that does scientific modelling.
> They have many TBs of data.  We use two servers running Ubuntu 14.04 LTS
> to backup all of this data.  One of them includes 16 spinning rust
> disks hooked to a RAID controller running in JBOD mode (in other words,
> as far as Linux is concerned, they are just 16 ordinary disks).  They
> are /dev/sdc to /dev/sdr, all being used as a single BTRFS file system.
> 
> I have been having no end of trouble with this system recently.  Keep
> in mind that due to the huge amount of data we deal with, doing
> anything takes a long time.  So "recently" means "in the last several
> months".
> 
> My latest attempt to beat some sense into this server was to upgrade it
> to the latest officially backported kernel from Ubuntu, and compile my
> own copy of btrfs-progs from source code (latest release from github).
> Then I recreated the 16 disk BTRFS file system, and started the backup
> software running again, from scratch.  The next day, /dev/sdc has
> vanished, to be replaced be a phantom /dev/sds.  There's no such disk
> as /dev/sds.  /dev/sds is now included in the BTRFS file system
> replacing /dev/sdc.  In /dev sdc does indeed vanish, and sds does
> indeed appear.  This was happening before.  /dev/sds then starts to
> fill up with errors, since no such disk actually exists.

   Sounds like the kind of behaviour when the disk has vanished from
the system for long enough to drop out and be recreated by the
driver. The renaming may (possibly) be down to a poor error-handling
path in btrfs -- we see this happening on USB sometimes, where the
original device node is still hung on to by the FS on a hardware
error, and so when the device comes back it's given a different name.

> I don't know what is actually causing the problem.  The disks are in a
> hot swap backplane, and if I actually pulled sdc out, then it would
> still be listed as part of the BTRFS file system, wouldn't it?

   With btrfs fi show, no, you'd get ** some devices missing ** in the
output.

>  If I
> then where to plug some new disk into the same spot, it would not be
> recognised as part of the file system?

   Correct... Unless the device had a superblock with the same UUID in
it (like, say, the new device is just the old one reappearing
again). In that case, udev would trigger a btrfs dev scan, and the
"new" device would rejoin the FS -- probably a little out of date, but
that would be caught by checksums and be fixed if you have redundancy
in the storage.

>  So assuming that the RAID
> controller is getting confused and thinking that sdc has been pulled,
> then replaced by sds, it should not be showing up as part of the BTRFS
> file system?  Or maybe there's a signature on sdc that BTRFS notices
> makes it part of the file system, even though BTRFS is now confused
> about it's location?

   See above.

> After a reboot, sdc returns and sds is gone again.

   Expected.

> The RAID controller has recently been replaced, but there where similar
> problems with the old one as well.  A better model of RAID controller
> was chosen this time.
> 
> I've also not been able to complete a scrub on this system recently.
> The really odd thing is that I get messages that the scrub has aborted,
> yet the scrub continues, then much later (days later) the scrub causes
> a kernel panic.  The "aborted" happens some random time into the scrub,
> but usually in the early part of the scrub.  Mind you, if BTRFS is
> completely confused due to a problem elsewhere, then maybe this can be
> excused.

   I think that means that it's aborting on one device but continuing
on all the others.

> The other backup server is almost identical, though it has less disks
> in the array.  It doesn't have any issues with the BTRFS file system.
> 
> Can any one help shed some light on this please?  Hopefully some
> "quick" things to try, given my definition of "recently" above means
> that most things take days or weeks, or even months for me to try.
> 
> I have attached the usual debugging info requested.  This is after the
> bogus sds replaces sdc.
> 

   The first thing would be to check your system logs for signs of
hardware problems (ATA errors). This sounds a lot like you've got a
dodgy disk that needs to be replaced.

   Hugo.

-- 
Hugo Mills             | A gentleman doesn't do damage unless he's paid for
hugo@... carfax.org.uk | it.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                                            Juri Papay

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Oddness with phantom device replacing real device.
  2015-08-13  9:55 ` Hugo Mills
@ 2015-08-13 23:25   ` David Seikel
  0 siblings, 0 replies; 3+ messages in thread
From: David Seikel @ 2015-08-13 23:25 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5469 bytes --]

On Thu, 13 Aug 2015 09:55:10 +0000 Hugo Mills <hugo@carfax.org.uk>
wrote:

> On Thu, Aug 13, 2015 at 01:33:22PM +1000, David Seikel wrote:
> > I don't actually think that this is a BTRFS problem, but it's
> > showing symptoms within BTRFS, and I have no other clues, so maybe
> > the BTRFS experts can help me figure out what is actually going
> > wrong.
> > 
> > I'm a sysadmin working for a company that does scientific modelling.
> > They have many TBs of data.  We use two servers running Ubuntu
> > 14.04 LTS to backup all of this data.  One of them includes 16
> > spinning rust disks hooked to a RAID controller running in JBOD
> > mode (in other words, as far as Linux is concerned, they are just
> > 16 ordinary disks).  They are /dev/sdc to /dev/sdr, all being used
> > as a single BTRFS file system.
> > 
> > I have been having no end of trouble with this system recently.
> > Keep in mind that due to the huge amount of data we deal with, doing
> > anything takes a long time.  So "recently" means "in the last
> > several months".
> > 
> > My latest attempt to beat some sense into this server was to
> > upgrade it to the latest officially backported kernel from Ubuntu,
> > and compile my own copy of btrfs-progs from source code (latest
> > release from github). Then I recreated the 16 disk BTRFS file
> > system, and started the backup software running again, from
> > scratch.  The next day, /dev/sdc has vanished, to be replaced be a
> > phantom /dev/sds.  There's no such disk as /dev/sds.  /dev/sds is
> > now included in the BTRFS file system replacing /dev/sdc.  In /dev
> > sdc does indeed vanish, and sds does indeed appear.  This was
> > happening before.  /dev/sds then starts to fill up with errors,
> > since no such disk actually exists.
> 
>    Sounds like the kind of behaviour when the disk has vanished from
> the system for long enough to drop out and be recreated by the
> driver. The renaming may (possibly) be down to a poor error-handling
> path in btrfs -- we see this happening on USB sometimes, where the
> original device node is still hung on to by the FS on a hardware
> error, and so when the device comes back it's given a different name.

So that part may need some fixing in BTRFS.

> > I don't know what is actually causing the problem.  The disks are
> > in a hot swap backplane, and if I actually pulled sdc out, then it
> > would still be listed as part of the BTRFS file system, wouldn't it?
> 
>    With btrfs fi show, no, you'd get ** some devices missing ** in the
> output.

Which is different from what I'm getting.

> >  If I
> > then where to plug some new disk into the same spot, it would not be
> > recognised as part of the file system?
> 
>    Correct... Unless the device had a superblock with the same UUID in
> it (like, say, the new device is just the old one reappearing
> again). In that case, udev would trigger a btrfs dev scan, and the
> "new" device would rejoin the FS -- probably a little out of date, but
> that would be caught by checksums and be fixed if you have redundancy
> in the storage.

But btrfs is thinking it's a different device, hence all the errors as
it gets confused.

> >  So assuming that the RAID
> > controller is getting confused and thinking that sdc has been
> > pulled, then replaced by sds, it should not be showing up as part
> > of the BTRFS file system?  Or maybe there's a signature on sdc that
> > BTRFS notices makes it part of the file system, even though BTRFS
> > is now confused about it's location?
> 
>    See above.
> 
> > After a reboot, sdc returns and sds is gone again.
> 
>    Expected.
> 
> > The RAID controller has recently been replaced, but there where
> > similar problems with the old one as well.  A better model of RAID
> > controller was chosen this time.
> > 
> > I've also not been able to complete a scrub on this system recently.
> > The really odd thing is that I get messages that the scrub has
> > aborted, yet the scrub continues, then much later (days later) the
> > scrub causes a kernel panic.  The "aborted" happens some random
> > time into the scrub, but usually in the early part of the scrub.
> > Mind you, if BTRFS is completely confused due to a problem
> > elsewhere, then maybe this can be excused.
> 
>    I think that means that it's aborting on one device but continuing
> on all the others.

Ah, would be useful for scrub to say so, and point out which device/s
got aborted.

> > The other backup server is almost identical, though it has less
> > disks in the array.  It doesn't have any issues with the BTRFS file
> > system.
> > 
> > Can any one help shed some light on this please?  Hopefully some
> > "quick" things to try, given my definition of "recently" above means
> > that most things take days or weeks, or even months for me to try.
> > 
> > I have attached the usual debugging info requested.  This is after
> > the bogus sds replaces sdc.
> > 
> 
>    The first thing would be to check your system logs for signs of
> hardware problems (ATA errors). This sounds a lot like you've got a
> dodgy disk that needs to be replaced.

Just gotta figure out which one, I thought I already replaced the dodgy
one.  Might be more than one.  sigh

I'm guessing /dev/sdc.

-- 
A big old stinking pile of genius that no one wants
coz there are too many silver coated monkeys in the world.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-08-13 23:25 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-13  3:33 Oddness with phantom device replacing real device David Seikel
2015-08-13  9:55 ` Hugo Mills
2015-08-13 23:25   ` David Seikel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).