Looking for some advice on best way to identify drives / recover from issues

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Looking for some advice on best way to identify drives / recover from issues
@ 2014-01-05 15:04 Dylan Distasio
  2014-01-05 15:44 ` Mark Knecht
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Dylan Distasio @ 2014-01-05 15:04 UTC (permalink / raw)
  To: linux-raid

Hi all-

I''ve been fortunate enough to not have to email this august group for
advice regarding my mdadm arrays in quite awhile, but am looking for
some suggestions.

I woke up this morning to something beeping in my headless Norco
server case at home (never a promising start to the morning).  I was
unable to ping the box which increased my dismay.  I proceeded to
perform a hard reboot, and still nothing on the ping.  At this point,
I plugged a monitor in to see what was happening on reboot.

Let me take a moment to provide details of my basic set up.  There are
three separate HD controllers being used in this box: the motherboard
headers, a supermicro PCI-X card (in a PCI slot), and a Highpoint
RocketRaid SAS controller used as JBOD.

I have a number of separate mdadm arrays tied to this physical box
that have been built over the years including a RAID6 one, a RAID10,
and 2 mirrors.

Unfortunately, I did not take the time to physically label the drives
in the box (there are close to 20) as I built these, and had been
meaning to, but life got in the way.  Since I have had no issues with
these arrays in a very long time, I don't even remember if I split
them across controllers or what.

So back to the reboot, I can see the motherboard drives showing up as
the POST runs through its paces.  I can then see what appears to be
the Supermicro drives showing up, but when the Highpoint controller
gets to it own internal boot screen, it hangs at detecting drives, and
I am unable to get into the controller card BIOS by hitting ctrl-H
(keyboard works though, as I can ctrl-alt-delete, so it is not locking
the PC).

So at this point, I don't know my point of failure.  I am guessing the
Highpoint flaked out though, especially since I now believe that was
the component beeping based on the PC restarting ok otherwise.

I am looking for advice on minimizing my risk of making things worse
as I attempt to identify what drives belong which with array.   The
RAID6 is my most immediate concern in getting back up and running.

My immediate thought was to disconnect all drives and then reconnect
them one by one from a motherboard header, and use:

mdadm --examine /dev/sdX1

Will that give me enough info to figure out which drive belongs to
which array?  Does anyone have any other suggestions?  I am not sure
of the current state of ANY of the arrays that were on this box, but I
don't want to make things worse by booting this system up with some
drives missing because I've unplugged them, and having the a bad
situation get worse.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Looking for some advice on best way to identify drives / recover from issues
  2014-01-05 15:04 Looking for some advice on best way to identify drives / recover from issues Dylan Distasio
@ 2014-01-05 15:44 ` Mark Knecht
  2014-01-05 17:01   ` Dylan Distasio
  2014-01-05 16:33 ` Roger Heflin
  2014-01-05 18:34 ` Phil Turmel
  2 siblings, 1 reply; 11+ messages in thread
From: Mark Knecht @ 2014-01-05 15:44 UTC (permalink / raw)
  To: Dylan Distasio; +Cc: Linux-RAID

On Sun, Jan 5, 2014 at 7:04 AM, Dylan Distasio <interzone@gmail.com> wrote:
> Hi all-
>
> I''ve been fortunate enough to not have to email this august group for
> advice regarding my mdadm arrays in quite awhile, but am looking for
> some suggestions.
>
> I woke up this morning to something beeping in my headless Norco
> server case at home (never a promising start to the morning).  I was
> unable to ping the box which increased my dismay.  I proceeded to
> perform a hard reboot, and still nothing on the ping.  At this point,
> I plugged a monitor in to see what was happening on reboot.
>
> Let me take a moment to provide details of my basic set up.  There are
> three separate HD controllers being used in this box: the motherboard
> headers, a supermicro PCI-X card (in a PCI slot), and a Highpoint
> RocketRaid SAS controller used as JBOD.
>
> I have a number of separate mdadm arrays tied to this physical box
> that have been built over the years including a RAID6 one, a RAID10,
> and 2 mirrors.
>
> Unfortunately, I did not take the time to physically label the drives
> in the box (there are close to 20) as I built these, and had been
> meaning to, but life got in the way.  Since I have had no issues with
> these arrays in a very long time, I don't even remember if I split
> them across controllers or what.
>
> So back to the reboot, I can see the motherboard drives showing up as
> the POST runs through its paces.  I can then see what appears to be
> the Supermicro drives showing up, but when the Highpoint controller
> gets to it own internal boot screen, it hangs at detecting drives, and
> I am unable to get into the controller card BIOS by hitting ctrl-H
> (keyboard works though, as I can ctrl-alt-delete, so it is not locking
> the PC).
>
> So at this point, I don't know my point of failure.  I am guessing the
> Highpoint flaked out though, especially since I now believe that was
> the component beeping based on the PC restarting ok otherwise.
>
> I am looking for advice on minimizing my risk of making things worse
> as I attempt to identify what drives belong which with array.   The
> RAID6 is my most immediate concern in getting back up and running.
>
> My immediate thought was to disconnect all drives and then reconnect
> them one by one from a motherboard header, and use:
>
> mdadm --examine /dev/sdX1
>
> Will that give me enough info to figure out which drive belongs to
> which array?  Does anyone have any other suggestions?  I am not sure
> of the current state of ANY of the arrays that were on this box, but I
> don't want to make things worse by booting this system up with some
> drives missing because I've unplugged them, and having the a bad
> situation get worse.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

I'm reacting to nothing more than 20 drives, no documentation and beeping:

1) Are the beeps POST codes?

http://www.computerhope.com/beep.htm

2) Before making any physical changes I'd start by  drawing an
accurate picture. Exactly what cables go to exactly what drives. Put a
label on each drive & each cable (masking tape/black pen, etc.) so
that if you do disassemble things you have a chance of getting it back
together later in the same configuration.

3) If I thought it was the High Point I'd likely just remove it from
its slot and try booting again. (Assuming it's not needed to boot.)

4) It's not clear to me from this email exactly what's required (if
anything) in terms of RAID to make machine boot but if I could boot
from nothing but what's attached to the MB then that what I'd be
trying to do first. With 20 drives you could have a power supply
failiing  and the system isn't getting enough power to run 20 drives,
etc. Minimize as much as possible.

5) If you get to where you can boot then you should run smartctl on
each drive looking for any info. However I would understand if 20
drives over a bunch of years means not all drives support S.M.A.R.T.

6) Once you get it booting I'd run a check of any RAID that's included
at that point to ensure it hadn't been damaged and then look to add
things back in.

Good luck!

- Mark

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Looking for some advice on best way to identify drives / recover from issues
  2014-01-05 15:44 ` Mark Knecht
@ 2014-01-05 17:01   ` Dylan Distasio
  2014-01-05 18:05     ` Mark Knecht
  0 siblings, 1 reply; 11+ messages in thread
From: Dylan Distasio @ 2014-01-05 17:01 UTC (permalink / raw)
  To: linux-raid

>
> I'm reacting to nothing more than 20 drives, no documentation and beeping:
>
> 1) Are the beeps POST codes?
>
> http://www.computerhope.com/beep.htm

No, I woke up to it beeping, but it POSTs fine.  I think the beeping
was probably coming from the Highpoint card indicating array failure.
It no longer beeps, but it also won't pick up any drives.
>
> 2) Before making any physical changes I'd start by  drawing an
> accurate picture. Exactly what cables go to exactly what drives. Put a
> label on each drive & each cable (masking tape/black pen, etc.) so
> that if you do disassemble things you have a chance of getting it back
> together later in the same configuration.
>

Good idea.
> 3) If I thought it was the High Point I'd likely just remove it from
> its slot and try booting again. (Assuming it's not needed to boot.)
>
> 4) It's not clear to me from this email exactly what's required (if
> anything) in terms of RAID to make machine boot but if I could boot
> from nothing but what's attached to the MB then that what I'd be
> trying to do first. With 20 drives you could have a power supply
> failiing  and the system isn't getting enough power to run 20 drives,
> etc. Minimize as much as possible.
>
> 5) If you get to where you can boot then you should run smartctl on
> each drive looking for any info. However I would understand if 20
> drives over a bunch of years means not all drives support S.M.A.R.T.
>
> 6) Once you get it booting I'd run a check of any RAID that's included
> at that point to ensure it hadn't been damaged and then look to add
> things back in.
>
> Good luck!
>
> - Mark

The machine will boot fine from what I can see, but I am concerned
with disconnecting some drives and letting it boot because it might
result in a degraded array assembling simply because I did not
reconnect the drives it was looking for when I remove the highpoint
from the picture.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Looking for some advice on best way to identify drives / recover from issues
  2014-01-05 17:01   ` Dylan Distasio
@ 2014-01-05 18:05     ` Mark Knecht
  0 siblings, 0 replies; 11+ messages in thread
From: Mark Knecht @ 2014-01-05 18:05 UTC (permalink / raw)
  To: Dylan Distasio; +Cc: Linux-RAID

On Sun, Jan 5, 2014 at 9:01 AM, Dylan Distasio <interzone@gmail.com> wrote:
>>
>> I'm reacting to nothing more than 20 drives, no documentation and beeping:
>>
>> 1) Are the beeps POST codes?
>>
>> http://www.computerhope.com/beep.htm
>
> No, I woke up to it beeping, but it POSTs fine.  I think the beeping
> was probably coming from the Highpoint card indicating array failure.
> It no longer beeps, but it also won't pick up any drives.

>
> The machine will boot fine from what I can see, but I am concerned
> with disconnecting some drives and letting it boot because it might
> result in a degraded array assembling simply because I did not
> reconnect the drives it was looking for when I remove the highpoint
> from the picture.

A couple of things. Again, I'm not an md expert so be careful, but
this is just informational:

1) You said you have a couple of RAIDs in this box but you didn't
describe them very completely. To the extent you can what are they?
Give all the info you can: RAID type, number of drives, which drives,
do the RAIDs use dev names (/dev/sda) or do you use UUIDs? What
controllers are connected to what RAIDs.

2) IMO if the machine is booting without picking up the drives
attached to the Highpoint then it would seem logical that you could
completely remove the Highpoint - actually take it out of the PCI slot
- and still boot the machine. If you can do that and get to a Linux
prompt then document and test what you have. After that's done and
someone more md knowledgeable checks in on what you are reporting then
you could look at the Highpoint drives and finally the Highpoint
controller.

Personally I wouldn't do anything in terms of assembling the RAID by
hand until I had LOTS of mdadm data (Examine & Detail analysis for all
drives and RAIDs) If you have another machine where you could test
drives then I suppose you could remove them one-at-a-time and run
smartstl on each one to determine if there's any obvious problem.

Good luck,
Mark

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Looking for some advice on best way to identify drives / recover from issues
  2014-01-05 15:04 Looking for some advice on best way to identify drives / recover from issues Dylan Distasio
  2014-01-05 15:44 ` Mark Knecht
@ 2014-01-05 16:33 ` Roger Heflin
  2014-01-05 17:06   ` Dylan Distasio
  2014-01-05 18:34 ` Phil Turmel
  2 siblings, 1 reply; 11+ messages in thread
From: Roger Heflin @ 2014-01-05 16:33 UTC (permalink / raw)
  To: Dylan Distasio; +Cc: Linux RAID

The crude but simple way is this:

Get the machine up with all disks that will work.

dd if=/dev/mdX of=/dev/null on each array, noting which disks light
up, repeat on all arrays, same process can be done with each disk (dd
if=/dev/sdX of=/dev/null ) to see exactly what disk maps to where.
This trick is rather nice since it pretty much works with
everything...even if you have a hw raid controlled and a failed disk,
that will be the one disk that never lights, so you can find the
failed on there also, just make sure that when done you have the
expected number of disks to not light up.

The biggest issue is that if the md's come up missing the 4 drives it
may complicate things with MD, though at worse that should require
some usage of the raw mdadm command to force things on after doing
this.

On Sun, Jan 5, 2014 at 9:04 AM, Dylan Distasio <interzone@gmail.com> wrote:
> Hi all-
>
> I''ve been fortunate enough to not have to email this august group for
> advice regarding my mdadm arrays in quite awhile, but am looking for
> some suggestions.
>
> I woke up this morning to something beeping in my headless Norco
> server case at home (never a promising start to the morning).  I was
> unable to ping the box which increased my dismay.  I proceeded to
> perform a hard reboot, and still nothing on the ping.  At this point,
> I plugged a monitor in to see what was happening on reboot.
>
> Let me take a moment to provide details of my basic set up.  There are
> three separate HD controllers being used in this box: the motherboard
> headers, a supermicro PCI-X card (in a PCI slot), and a Highpoint
> RocketRaid SAS controller used as JBOD.
>
> I have a number of separate mdadm arrays tied to this physical box
> that have been built over the years including a RAID6 one, a RAID10,
> and 2 mirrors.
>
> Unfortunately, I did not take the time to physically label the drives
> in the box (there are close to 20) as I built these, and had been
> meaning to, but life got in the way.  Since I have had no issues with
> these arrays in a very long time, I don't even remember if I split
> them across controllers or what.
>
> So back to the reboot, I can see the motherboard drives showing up as
> the POST runs through its paces.  I can then see what appears to be
> the Supermicro drives showing up, but when the Highpoint controller
> gets to it own internal boot screen, it hangs at detecting drives, and
> I am unable to get into the controller card BIOS by hitting ctrl-H
> (keyboard works though, as I can ctrl-alt-delete, so it is not locking
> the PC).
>
> So at this point, I don't know my point of failure.  I am guessing the
> Highpoint flaked out though, especially since I now believe that was
> the component beeping based on the PC restarting ok otherwise.
>
> I am looking for advice on minimizing my risk of making things worse
> as I attempt to identify what drives belong which with array.   The
> RAID6 is my most immediate concern in getting back up and running.
>
> My immediate thought was to disconnect all drives and then reconnect
> them one by one from a motherboard header, and use:
>
> mdadm --examine /dev/sdX1
>
> Will that give me enough info to figure out which drive belongs to
> which array?  Does anyone have any other suggestions?  I am not sure
> of the current state of ANY of the arrays that were on this box, but I
> don't want to make things worse by booting this system up with some
> drives missing because I've unplugged them, and having the a bad
> situation get worse.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Looking for some advice on best way to identify drives / recover from issues
  2014-01-05 16:33 ` Roger Heflin
@ 2014-01-05 17:06   ` Dylan Distasio
  2014-01-05 19:37     ` Krzysztof Adamski
  2014-01-22 17:16     ` Dylan Distasio
  0 siblings, 2 replies; 11+ messages in thread
From: Dylan Distasio @ 2014-01-05 17:06 UTC (permalink / raw)
  To: linux-raid

Thanks for the trick.  The issue of complicating things with MD is
what I am concerned about.  I am afraid to boot the PC up with drives
missing (if for example I remove the highpoint controller) because it
may end up assembling an array with drives missing and degrading it
when it didn't need to be.

I'm really wishing I had labeled my drives now, since I don't know
which ones are part of which array physically, and don't want any
arrays to assemble until I do.  I was wondering if booting into a live
CD would be the way to go.  I need some way of checking which drive is
in which array without the risk of any arrays assembling.

On Sun, Jan 5, 2014 at 11:33 AM, Roger Heflin <rogerheflin@gmail.com> wrote:
> The crude but simple way is this:
>
> Get the machine up with all disks that will work.
>
> dd if=/dev/mdX of=/dev/null on each array, noting which disks light
> up, repeat on all arrays, same process can be done with each disk (dd
> if=/dev/sdX of=/dev/null ) to see exactly what disk maps to where.
> This trick is rather nice since it pretty much works with
> everything...even if you have a hw raid controlled and a failed disk,
> that will be the one disk that never lights, so you can find the
> failed on there also, just make sure that when done you have the
> expected number of disks to not light up.
>
> The biggest issue is that if the md's come up missing the 4 drives it
> may complicate things with MD, though at worse that should require
> some usage of the raw mdadm command to force things on after doing
> this.
>
> On Sun, Jan 5, 2014 at 9:04 AM, Dylan Distasio <interzone@gmail.com> wrote:
>> Hi all-
>>
>> I''ve been fortunate enough to not have to email this august group for
>> advice regarding my mdadm arrays in quite awhile, but am looking for
>> some suggestions.
>>
>> I woke up this morning to something beeping in my headless Norco
>> server case at home (never a promising start to the morning).  I was
>> unable to ping the box which increased my dismay.  I proceeded to
>> perform a hard reboot, and still nothing on the ping.  At this point,
>> I plugged a monitor in to see what was happening on reboot.
>>
>> Let me take a moment to provide details of my basic set up.  There are
>> three separate HD controllers being used in this box: the motherboard
>> headers, a supermicro PCI-X card (in a PCI slot), and a Highpoint
>> RocketRaid SAS controller used as JBOD.
>>
>> I have a number of separate mdadm arrays tied to this physical box
>> that have been built over the years including a RAID6 one, a RAID10,
>> and 2 mirrors.
>>
>> Unfortunately, I did not take the time to physically label the drives
>> in the box (there are close to 20) as I built these, and had been
>> meaning to, but life got in the way.  Since I have had no issues with
>> these arrays in a very long time, I don't even remember if I split
>> them across controllers or what.
>>
>> So back to the reboot, I can see the motherboard drives showing up as
>> the POST runs through its paces.  I can then see what appears to be
>> the Supermicro drives showing up, but when the Highpoint controller
>> gets to it own internal boot screen, it hangs at detecting drives, and
>> I am unable to get into the controller card BIOS by hitting ctrl-H
>> (keyboard works though, as I can ctrl-alt-delete, so it is not locking
>> the PC).
>>
>> So at this point, I don't know my point of failure.  I am guessing the
>> Highpoint flaked out though, especially since I now believe that was
>> the component beeping based on the PC restarting ok otherwise.
>>
>> I am looking for advice on minimizing my risk of making things worse
>> as I attempt to identify what drives belong which with array.   The
>> RAID6 is my most immediate concern in getting back up and running.
>>
>> My immediate thought was to disconnect all drives and then reconnect
>> them one by one from a motherboard header, and use:
>>
>> mdadm --examine /dev/sdX1
>>
>> Will that give me enough info to figure out which drive belongs to
>> which array?  Does anyone have any other suggestions?  I am not sure
>> of the current state of ANY of the arrays that were on this box, but I
>> don't want to make things worse by booting this system up with some
>> drives missing because I've unplugged them, and having the a bad
>> situation get worse.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Looking for some advice on best way to identify drives / recover from issues
  2014-01-05 17:06   ` Dylan Distasio
@ 2014-01-05 19:37     ` Krzysztof Adamski
  2014-01-22 17:16     ` Dylan Distasio
  1 sibling, 0 replies; 11+ messages in thread
From: Krzysztof Adamski @ 2014-01-05 19:37 UTC (permalink / raw)
  To: Dylan Distasio; +Cc: linux-raid

Boot with knoppix CD and examine the drives without MD starting. This is
after you remove the highpoint card.

This way you can figure out what drives belong to which array.
If you can use lsdrv while running from knoppix.


On Sun, 2014-01-05 at 12:06 -0500, Dylan Distasio wrote:
> Thanks for the trick.  The issue of complicating things with MD is
> what I am concerned about.  I am afraid to boot the PC up with drives
> missing (if for example I remove the highpoint controller) because it
> may end up assembling an array with drives missing and degrading it
> when it didn't need to be.
> 
> I'm really wishing I had labeled my drives now, since I don't know
> which ones are part of which array physically, and don't want any
> arrays to assemble until I do.  I was wondering if booting into a live
> CD would be the way to go.  I need some way of checking which drive is
> in which array without the risk of any arrays assembling.
> 
> On Sun, Jan 5, 2014 at 11:33 AM, Roger Heflin <rogerheflin@gmail.com> wrote:
> > The crude but simple way is this:
> >
> > Get the machine up with all disks that will work.
> >
> > dd if=/dev/mdX of=/dev/null on each array, noting which disks light
> > up, repeat on all arrays, same process can be done with each disk (dd
> > if=/dev/sdX of=/dev/null ) to see exactly what disk maps to where.
> > This trick is rather nice since it pretty much works with
> > everything...even if you have a hw raid controlled and a failed disk,
> > that will be the one disk that never lights, so you can find the
> > failed on there also, just make sure that when done you have the
> > expected number of disks to not light up.
> >
> > The biggest issue is that if the md's come up missing the 4 drives it
> > may complicate things with MD, though at worse that should require
> > some usage of the raw mdadm command to force things on after doing
> > this.
> >
> > On Sun, Jan 5, 2014 at 9:04 AM, Dylan Distasio <interzone@gmail.com> wrote:
> >> Hi all-
> >>
> >> I''ve been fortunate enough to not have to email this august group for
> >> advice regarding my mdadm arrays in quite awhile, but am looking for
> >> some suggestions.
> >>
> >> I woke up this morning to something beeping in my headless Norco
> >> server case at home (never a promising start to the morning).  I was
> >> unable to ping the box which increased my dismay.  I proceeded to
> >> perform a hard reboot, and still nothing on the ping.  At this point,
> >> I plugged a monitor in to see what was happening on reboot.
> >>
> >> Let me take a moment to provide details of my basic set up.  There are
> >> three separate HD controllers being used in this box: the motherboard
> >> headers, a supermicro PCI-X card (in a PCI slot), and a Highpoint
> >> RocketRaid SAS controller used as JBOD.
> >>
> >> I have a number of separate mdadm arrays tied to this physical box
> >> that have been built over the years including a RAID6 one, a RAID10,
> >> and 2 mirrors.
> >>
> >> Unfortunately, I did not take the time to physically label the drives
> >> in the box (there are close to 20) as I built these, and had been
> >> meaning to, but life got in the way.  Since I have had no issues with
> >> these arrays in a very long time, I don't even remember if I split
> >> them across controllers or what.
> >>
> >> So back to the reboot, I can see the motherboard drives showing up as
> >> the POST runs through its paces.  I can then see what appears to be
> >> the Supermicro drives showing up, but when the Highpoint controller
> >> gets to it own internal boot screen, it hangs at detecting drives, and
> >> I am unable to get into the controller card BIOS by hitting ctrl-H
> >> (keyboard works though, as I can ctrl-alt-delete, so it is not locking
> >> the PC).
> >>
> >> So at this point, I don't know my point of failure.  I am guessing the
> >> Highpoint flaked out though, especially since I now believe that was
> >> the component beeping based on the PC restarting ok otherwise.
> >>
> >> I am looking for advice on minimizing my risk of making things worse
> >> as I attempt to identify what drives belong which with array.   The
> >> RAID6 is my most immediate concern in getting back up and running.
> >>
> >> My immediate thought was to disconnect all drives and then reconnect
> >> them one by one from a motherboard header, and use:
> >>
> >> mdadm --examine /dev/sdX1
> >>
> >> Will that give me enough info to figure out which drive belongs to
> >> which array?  Does anyone have any other suggestions?  I am not sure
> >> of the current state of ANY of the arrays that were on this box, but I
> >> don't want to make things worse by booting this system up with some
> >> drives missing because I've unplugged them, and having the a bad
> >> situation get worse.
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Looking for some advice on best way to identify drives / recover from issues
  2014-01-05 17:06   ` Dylan Distasio
  2014-01-05 19:37     ` Krzysztof Adamski
@ 2014-01-22 17:16     ` Dylan Distasio
  1 sibling, 0 replies; 11+ messages in thread
From: Dylan Distasio @ 2014-01-22 17:16 UTC (permalink / raw)
  To: linux-raid

I just wanted to provide an update on my situation for those
interested.  It might help someone in the future with my combination
of hardware.  I originally suspected my Highpoint controller was the
point of failure, and decided to get a LSI 9211 controller card to
swap it out with since they are pretty affordable on eBay and seem to
be decent low end controllers.  I figured that would be the easiest
troubleshooting first step based on my situation, especially since the
original controller was seizing up.

Anyways, I had time yesterday to swap it out after waiting for it to
arrive from Hong Kong.  When I rebooted, I got a grub error 15.  I'll
be honest, grub isn't my forte, but I imagined that it might have been
related to device order assignment, and that the new hardware had
confused it somehow.  I did some googling, and decided to roll the
dice with a boot repair live CD.  I went through the steps to install
a newer version of grub and could see that the tool had successfully
found the OS drive, so I let it finish, and rebooted.  At this point,
I now got a cryptic "No upper memory" error.  I was beginning to pull
out some of what little hair is left at that point.  I did some
additional googling and stumbled across some threads on Gigabyte
motherboards not playing nice with LSI 9211 cards...Doh!  I had heard
that updating to a Beta Bios had sometimes helped, and proceeded to
format a flash drive as a bootable DOS disk with the flash utility.
Of course, this is an older motherboard and I could not get it to boot
from the flash drive, and I don't have a floppy handy.

On to the next choice...I had a newer gigabyte mobo lying around with
a processor and ram already installed.  I swapped that in, and tried
the LSI card.  Immediately, I was greeted with the same upper memory
error.  Ugh!  I decided to flash that mobo with a beta bios as a long
shot.  I was able to do so with a flash drive.  I rebooted, and was
greeted with the grub menu!  It worked, the only problem now was that
I noticed the new LSI card was dropping 4 of the 8 drives.  At this
point, the lightbulb went off that it was probably NOT the highpoint
controller that was the point of failure.  I swapped that back in and
disconnected the mini-SAS cable to the problematic drives.  Sure
enough, it was working fine.  I realized at this point that there were
only two choices left as failure points.  The 4 drives (which I was
hoping was not the case, as losing that many at once would not have
been good), or the SATA backplane on my Norco 4020 case.  The
backplane is divided into 5 separate banks of SATA connectors, each
with their own power connection, that control 4 drives each.  I
proceeded to pull the 4 drives in their trays and hook them up
directly to the sata end of the the controller card.  I rebooted, and
success!  My arrays were all running successfully.

I am now working on trying to repair the backplane assuming I can swap
out the damaged sections.  I will need to pull apart and rewire this
entire case which won't be a fun project, but most importantly I got
to the root cause, and there doesn't appear to be any harm to the
arrays.  I am going to run a check on them shortly though.

On Sun, Jan 5, 2014 at 12:06 PM, Dylan Distasio <interzone@gmail.com> wrote:
> Thanks for the trick.  The issue of complicating things with MD is
> what I am concerned about.  I am afraid to boot the PC up with drives
> missing (if for example I remove the highpoint controller) because it
> may end up assembling an array with drives missing and degrading it
> when it didn't need to be.
>
> I'm really wishing I had labeled my drives now, since I don't know
> which ones are part of which array physically, and don't want any
> arrays to assemble until I do.  I was wondering if booting into a live
> CD would be the way to go.  I need some way of checking which drive is
> in which array without the risk of any arrays assembling.
>
> On Sun, Jan 5, 2014 at 11:33 AM, Roger Heflin <rogerheflin@gmail.com> wrote:
>> The crude but simple way is this:
>>
>> Get the machine up with all disks that will work.
>>
>> dd if=/dev/mdX of=/dev/null on each array, noting which disks light
>> up, repeat on all arrays, same process can be done with each disk (dd
>> if=/dev/sdX of=/dev/null ) to see exactly what disk maps to where.
>> This trick is rather nice since it pretty much works with
>> everything...even if you have a hw raid controlled and a failed disk,
>> that will be the one disk that never lights, so you can find the
>> failed on there also, just make sure that when done you have the
>> expected number of disks to not light up.
>>
>> The biggest issue is that if the md's come up missing the 4 drives it
>> may complicate things with MD, though at worse that should require
>> some usage of the raw mdadm command to force things on after doing
>> this.
>>
>> On Sun, Jan 5, 2014 at 9:04 AM, Dylan Distasio <interzone@gmail.com> wrote:
>>> Hi all-
>>>
>>> I''ve been fortunate enough to not have to email this august group for
>>> advice regarding my mdadm arrays in quite awhile, but am looking for
>>> some suggestions.
>>>
>>> I woke up this morning to something beeping in my headless Norco
>>> server case at home (never a promising start to the morning).  I was
>>> unable to ping the box which increased my dismay.  I proceeded to
>>> perform a hard reboot, and still nothing on the ping.  At this point,
>>> I plugged a monitor in to see what was happening on reboot.
>>>
>>> Let me take a moment to provide details of my basic set up.  There are
>>> three separate HD controllers being used in this box: the motherboard
>>> headers, a supermicro PCI-X card (in a PCI slot), and a Highpoint
>>> RocketRaid SAS controller used as JBOD.
>>>
>>> I have a number of separate mdadm arrays tied to this physical box
>>> that have been built over the years including a RAID6 one, a RAID10,
>>> and 2 mirrors.
>>>
>>> Unfortunately, I did not take the time to physically label the drives
>>> in the box (there are close to 20) as I built these, and had been
>>> meaning to, but life got in the way.  Since I have had no issues with
>>> these arrays in a very long time, I don't even remember if I split
>>> them across controllers or what.
>>>
>>> So back to the reboot, I can see the motherboard drives showing up as
>>> the POST runs through its paces.  I can then see what appears to be
>>> the Supermicro drives showing up, but when the Highpoint controller
>>> gets to it own internal boot screen, it hangs at detecting drives, and
>>> I am unable to get into the controller card BIOS by hitting ctrl-H
>>> (keyboard works though, as I can ctrl-alt-delete, so it is not locking
>>> the PC).
>>>
>>> So at this point, I don't know my point of failure.  I am guessing the
>>> Highpoint flaked out though, especially since I now believe that was
>>> the component beeping based on the PC restarting ok otherwise.
>>>
>>> I am looking for advice on minimizing my risk of making things worse
>>> as I attempt to identify what drives belong which with array.   The
>>> RAID6 is my most immediate concern in getting back up and running.
>>>
>>> My immediate thought was to disconnect all drives and then reconnect
>>> them one by one from a motherboard header, and use:
>>>
>>> mdadm --examine /dev/sdX1
>>>
>>> Will that give me enough info to figure out which drive belongs to
>>> which array?  Does anyone have any other suggestions?  I am not sure
>>> of the current state of ANY of the arrays that were on this box, but I
>>> don't want to make things worse by booting this system up with some
>>> drives missing because I've unplugged them, and having the a bad
>>> situation get worse.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Looking for some advice on best way to identify drives / recover from issues
  2014-01-05 15:04 Looking for some advice on best way to identify drives / recover from issues Dylan Distasio
  2014-01-05 15:44 ` Mark Knecht
  2014-01-05 16:33 ` Roger Heflin
@ 2014-01-05 18:34 ` Phil Turmel
  2014-01-06 15:57   ` John Stoffel
  2 siblings, 1 reply; 11+ messages in thread
From: Phil Turmel @ 2014-01-05 18:34 UTC (permalink / raw)
  To: Dylan Distasio, linux-raid

Good afternoon Dylan,

On 01/05/2014 10:04 AM, Dylan Distasio wrote:
> Hi all-

[trim /]

> Unfortunately, I did not take the time to physically label the drives
> in the box (there are close to 20) as I built these, and had been
> meaning to, but life got in the way.  Since I have had no issues with
> these arrays in a very long time, I don't even remember if I split
> them across controllers or what.

[trim /]

> Will that give me enough info to figure out which drive belongs to
> which array?  Does anyone have any other suggestions?  I am not sure
> of the current state of ANY of the arrays that were on this box, but I
> don't want to make things worse by booting this system up with some
> drives missing because I've unplugged them, and having the a bad
> situation get worse.

I created a script for precisely this type of documentation task, keyed
to drive serial numbers and UUIDs wherever identifiable.

https://github.com/pturmel/lsdrv

(If any of your drive serial numbers are entirely numeric, you'll want
the patch shown in the open issues)

HTH,

Phil


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Looking for some advice on best way to identify drives / recover from issues
  2014-01-05 18:34 ` Phil Turmel
@ 2014-01-06 15:57   ` John Stoffel
  2014-01-06 16:54     ` Phil Turmel
  0 siblings, 1 reply; 11+ messages in thread
From: John Stoffel @ 2014-01-06 15:57 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Dylan Distasio, linux-raid

>>>>> "Phil" == Phil Turmel <philip@turmel.org> writes:

Phil> I created a script for precisely this type of documentation task, keyed
Phil> to drive serial numbers and UUIDs wherever identifiable.

Phil> https://github.com/pturmel/lsdrv

Phil,

Thanks for the script, it looks good, but I wanted to poke you about
the continuation and corner vars, which are defined with funky graphic
chars.  Would it be hard to put in simpler plain ASCII graphics there
by default, and offer a switch for UTF-8 (???) output?

Also, I wonder if having it going the other way, which is from mount
point down to the device(s) would make sense as well?  Now I just need
to find the time to hack your code and see what I can do.

Thanks!
John

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Looking for some advice on best way to identify drives / recover from issues
  2014-01-06 15:57   ` John Stoffel
@ 2014-01-06 16:54     ` Phil Turmel
  0 siblings, 0 replies; 11+ messages in thread
From: Phil Turmel @ 2014-01-06 16:54 UTC (permalink / raw)
  To: John Stoffel; +Cc: Dylan Distasio, linux-raid

On 01/06/2014 10:57 AM, John Stoffel wrote:
>>>>>> "Phil" == Phil Turmel <philip@turmel.org> writes:
> 
> Phil> I created a script for precisely this type of documentation task, keyed
> Phil> to drive serial numbers and UUIDs wherever identifiable.
> 
> Phil> https://github.com/pturmel/lsdrv
> 
> Phil,
> 
> Thanks for the script, it looks good, but I wanted to poke you about
> the continuation and corner vars, which are defined with funky graphic
> chars.  Would it be hard to put in simpler plain ASCII graphics there
> by default, and offer a switch for UTF-8 (???) output?

I'll take patches that make the script locale-aware using python2's
normal methods.  I hadn't bothered to figure out how to do so since
distros have been installing with utf-8 by default for years now.  I do
not want to give up the utf-8 line drawing characters for the majority.

> Also, I wonder if having it going the other way, which is from mount
> point down to the device(s) would make sense as well?  Now I just need
> to find the time to hack your code and see what I can do.

Shouldn't be hard.

I chose not to go in that direction as it would leave out unmounted
devices.  My intent is to document the relationships amongst everything
present (with SNs and UUIDs) to the fullest extent possible.

Thanks for the feedback.

Phil

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-01-22 17:16 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-05 15:04 Looking for some advice on best way to identify drives / recover from issues Dylan Distasio
2014-01-05 15:44 ` Mark Knecht
2014-01-05 17:01   ` Dylan Distasio
2014-01-05 18:05     ` Mark Knecht
2014-01-05 16:33 ` Roger Heflin
2014-01-05 17:06   ` Dylan Distasio
2014-01-05 19:37     ` Krzysztof Adamski
2014-01-22 17:16     ` Dylan Distasio
2014-01-05 18:34 ` Phil Turmel
2014-01-06 15:57   ` John Stoffel
2014-01-06 16:54     ` Phil Turmel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).