* Device naming and raid1
@ 2008-08-26 15:32 Tony Coffman
2008-08-27 11:11 ` David Lethe
0 siblings, 1 reply; 8+ messages in thread
From: Tony Coffman @ 2008-08-26 15:32 UTC (permalink / raw)
To: linux-raid
I've have a Centos5 box running a software raid-1 set on a pair of SATA
drives.
The SATA controller or driver has a flaw.
Every 150 days or so, one of the two drives will experience errors and fail.
Subsequent tests always show the drive and cable to be ok. We bought a
couple of replacement drives before we figured that out :-(
On the last event this weekend, I went searching for a way to get the
raid back online with no host downtime. I found the technique that
deletes the drive and then brings it back online with a bus scan using
the /sys filesystem delete and rescan entities.
I didn't realize that you could also perform a rescan on a single LUN.
I'll have to use that next time.
My question - since I've done a delete/rescan bus operation, my device
name and major,minor numbers have changed.
Original
[0:0:0:0] disk ATA ST3250410AS 3.AA /dev/sdc
Current
[0:0:0:0] disk ATA ST3250410AS 3.AA /dev/sdc
If I re-add the device to the raid set using the new device name, will
it cause any problems on the next boot?
The drive appears to be fine. I can read all blocks with no errors.
Partition table looks ok, etc..
In the future if I rescan just the single LUN, I'm pretty sure I won't
run into again this but I'd like to avoid an outage on this event if
possible.
Thanks and regards,
--Tony
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: Device naming and raid1
2008-08-26 15:32 Device naming and raid1 Tony Coffman
@ 2008-08-27 11:11 ` David Lethe
2008-08-27 13:21 ` Tony Coffman
0 siblings, 1 reply; 8+ messages in thread
From: David Lethe @ 2008-08-27 11:11 UTC (permalink / raw)
To: Tony Coffman, linux-raid
> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Tony Coffman
> Sent: Tuesday, August 26, 2008 10:33 AM
> To: linux-raid@vger.kernel.org
> Subject: Device naming and raid1
>
> I've have a Centos5 box running a software raid-1 set on a pair of
SATA
> drives.
>
> The SATA controller or driver has a flaw.
> Every 150 days or so, one of the two drives will experience errors and
> fail.
>
> Subsequent tests always show the drive and cable to be ok. We bought
a
> couple of replacement drives before we figured that out :-(
>
> On the last event this weekend, I went searching for a way to get the
> raid back online with no host downtime. I found the technique that
> deletes the drive and then brings it back online with a bus scan using
> the /sys filesystem delete and rescan entities.
>
> I didn't realize that you could also perform a rescan on a single LUN.
> I'll have to use that next time.
>
> My question - since I've done a delete/rescan bus operation, my device
> name and major,minor numbers have changed.
>
> Original
> [0:0:0:0] disk ATA ST3250410AS 3.AA /dev/sdc
>
> Current
> [0:0:0:0] disk ATA ST3250410AS 3.AA /dev/sdc
>
> If I re-add the device to the raid set using the new device name, will
> it cause any problems on the next boot?
>
> The drive appears to be fine. I can read all blocks with no errors.
> Partition table looks ok, etc..
>
> In the future if I rescan just the single LUN, I'm pretty sure I won't
> run into again this but I'd like to avoid an outage on this event if
> possible.
>
> Thanks and regards,
> --Tony
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.htm
Don't be too quick to say the drive(s) are good, or for that matter,
making any assumptions about what is bad or good. (Well, OK, let's
assume the monitor is good). If the drives are reporting errors and
the drives fail, why not trap the error messages and do some diagnostics
while drives are still in that failed state? Error messages tell you
what the errors are. Make yourself a bootable CDROM or USB and next
time the drives lockup and/or start spitting out errors, then capture
everything. Then boot to the external device (do NOT cycle power), and
run one of many possible diagnostics to confirm or eliminate the disks.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Device naming and raid1
2008-08-27 11:11 ` David Lethe
@ 2008-08-27 13:21 ` Tony Coffman
2008-08-27 13:31 ` Sujit Karataparambil
2008-08-27 13:42 ` Steve Fairbairn
0 siblings, 2 replies; 8+ messages in thread
From: Tony Coffman @ 2008-08-27 13:21 UTC (permalink / raw)
To: linux-raid
David Lethe wrote:
>> -----Original Message-----
>> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
>> owner@vger.kernel.org] On Behalf Of Tony Coffman
>> Sent: Tuesday, August 26, 2008 10:33 AM
>> To: linux-raid@vger.kernel.org
>> Subject: Device naming and raid1
>>
>> I've have a Centos5 box running a software raid-1 set on a pair of
>>
> SATA
>
>> drives.
>>
>> The SATA controller or driver has a flaw.
>> Every 150 days or so, one of the two drives will experience errors and
>> fail.
>>
>> Subsequent tests always show the drive and cable to be ok. We bought
>>
> a
>
>> couple of replacement drives before we figured that out :-(
>>
>> On the last event this weekend, I went searching for a way to get the
>> raid back online with no host downtime. I found the technique that
>> deletes the drive and then brings it back online with a bus scan using
>> the /sys filesystem delete and rescan entities.
>>
>> I didn't realize that you could also perform a rescan on a single LUN.
>> I'll have to use that next time.
>>
>> My question - since I've done a delete/rescan bus operation, my device
>> name and major,minor numbers have changed.
>>
>> Original
>> [0:0:0:0] disk ATA ST3250410AS 3.AA /dev/sdc
>>
>> Current
>> [0:0:0:0] disk ATA ST3250410AS 3.AA /dev/sdc
>>
>> If I re-add the device to the raid set using the new device name, will
>> it cause any problems on the next boot?
>>
>> The drive appears to be fine. I can read all blocks with no errors.
>> Partition table looks ok, etc..
>>
>> In the future if I rescan just the single LUN, I'm pretty sure I won't
>> run into again this but I'd like to avoid an outage on this event if
>> possible.
>>
>> Thanks and regards,
>> --Tony
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid"
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.htm
>>
> Don't be too quick to say the drive(s) are good, or for that matter,
> making any assumptions about what is bad or good. (Well, OK, let's
> assume the monitor is good). If the drives are reporting errors and
> the drives fail, why not trap the error messages and do some diagnostics
> while drives are still in that failed state? Error messages tell you
> what the errors are. Make yourself a bootable CDROM or USB and next
> time the drives lockup and/or start spitting out errors, then capture
> everything. Then boot to the external device (do NOT cycle power), and
> run one of many possible diagnostics to confirm or eliminate the disks.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Thanks much for the reply. For the purposes of this discussion you can
assume that I've already re-established confidence in the drive, the
cable, and the controller and that the data on the drives is worthless
and I just want to get maximum uptime without causing a raid assemble
problem on the next reboot.
Any idea on my original question? If I re-add the drive using the
/dev/sdc name will I have problems on the next boot when the drive is
named /dev/sda?
Based on my experience with Linux and other software raid
implementations, I'm strongly inclined to think that the device naming
doesn't matter - the system will scan the drives at boot looking for
raid sets and re-assemble them no matter what major and minor numbers or
device names are. I'm not opposed to finding out the hard way but I'd
really like to get a definitive answer now because by the time this
system is next rebooted I'll probably have long forgotten about this.
Regards,
--Tony
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Device naming and raid1
2008-08-27 13:21 ` Tony Coffman
@ 2008-08-27 13:31 ` Sujit Karataparambil
2008-08-27 14:08 ` Sujit Karataparambil
2008-08-27 13:42 ` Steve Fairbairn
1 sibling, 1 reply; 8+ messages in thread
From: Sujit Karataparambil @ 2008-08-27 13:31 UTC (permalink / raw)
To: linux-raid
> Thanks much for the reply. For the purposes of this discussion you can
> assume that I've already re-established confidence in the drive, the
> cable, and the controller and that the data on the drives is worthless
> and I just want to get maximum uptime without causing a raid assemble
> problem on the next reboot.
Good.
>
> Any idea on my original question? If I re-add the drive using the
> /dev/sdc name will I have problems on the next boot when the drive is
> named /dev/sda?
Since this seems to be block device it really does not matter.
> Based on my experience with Linux and other software raid
> implementations, I'm strongly inclined to think that the device naming
> doesn't matter - the system will scan the drives at boot looking for
Kindly read some decent kernel documentation before you jump up and
say this. Kindly surf the net and read some decent article's before you
do any precious upgrades for now.
Sujit
--
--linux(2.4/2.6),bsd(4.5.x+),solaris(2.5+)
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: Device naming and raid1
2008-08-27 13:21 ` Tony Coffman
2008-08-27 13:31 ` Sujit Karataparambil
@ 2008-08-27 13:42 ` Steve Fairbairn
1 sibling, 0 replies; 8+ messages in thread
From: Steve Fairbairn @ 2008-08-27 13:42 UTC (permalink / raw)
To: 'Tony Coffman', linux-raid
>
> Any idea on my original question? If I re-add the drive
> using the /dev/sdc name will I have problems on the next boot
> when the drive is named /dev/sda?
>
> Based on my experience with Linux and other software raid
> implementations, I'm strongly inclined to think that the
> device naming doesn't matter - the system will scan the
> drives at boot looking for raid sets and re-assemble them no
> matter what major and minor numbers or device names are. I'm
> not opposed to finding out the hard way but I'd really like
> to get a definitive answer now because by the time this
> system is next rebooted I'll probably have long forgotten about this.
>
Not an expert in any sense of the word, but if the array is being
identified by its UUID and not by a specific list of devices in
mdadm.conf, then the array should assemble fine after a reboot with the
device names having changed. I recently had a similar situation where I
added a new PCI-E x1 2 port SATA II card, which promptly pushed all my
other sdx devices down two places. The raid mounted as expected and all
was/is wonderful.
Hope this helps,
Steve.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Device naming and raid1
2008-08-27 13:31 ` Sujit Karataparambil
@ 2008-08-27 14:08 ` Sujit Karataparambil
2008-08-27 14:46 ` David Greaves
2008-08-27 14:48 ` Tony Coffman
0 siblings, 2 replies; 8+ messages in thread
From: Sujit Karataparambil @ 2008-08-27 14:08 UTC (permalink / raw)
To: linux-raid
http://www.gagme.com/greg/linux/raid-lvm.php
you can try this with the spare drives you have.
basically what you have to do is to check whether the drive
now linked to another device name, is the reason for this
problem.
once it shows unplugged or failed you can, use your new
replacement drive and reboot.
Kindly read the comments to this article, Which is very
usefull.
On 8/27/08, Sujit Karataparambil <sjt.kar@gmail.com> wrote:
> > Thanks much for the reply. For the purposes of this discussion you can
> > assume that I've already re-established confidence in the drive, the
> > cable, and the controller and that the data on the drives is worthless
> > and I just want to get maximum uptime without causing a raid assemble
> > problem on the next reboot.
>
> Good.
>
> >
> > Any idea on my original question? If I re-add the drive using the
> > /dev/sdc name will I have problems on the next boot when the drive is
> > named /dev/sda?
>
> Since this seems to be block device it really does not matter.
>
> > Based on my experience with Linux and other software raid
> > implementations, I'm strongly inclined to think that the device naming
> > doesn't matter - the system will scan the drives at boot looking for
>
> Kindly read some decent kernel documentation before you jump up and
> say this. Kindly surf the net and read some decent article's before you
> do any precious upgrades for now.
>
> Sujit
>
> --
> --linux(2.4/2.6),bsd(4.5.x+),solaris(2.5+)
>
--
--linux(2.4/2.6),bsd(4.5.x+),solaris(2.5+)
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Device naming and raid1
2008-08-27 14:08 ` Sujit Karataparambil
@ 2008-08-27 14:46 ` David Greaves
2008-08-27 14:48 ` Tony Coffman
1 sibling, 0 replies; 8+ messages in thread
From: David Greaves @ 2008-08-27 14:46 UTC (permalink / raw)
To: Sujit Karataparambil; +Cc: linux-raid
Sujit Karataparambil wrote:
> http://www.gagme.com/greg/linux/raid-lvm.php
You may want to consider relating this to our community linux-raid wiki here:
http://linux-raid.osdl.org/
David
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Device naming and raid1
2008-08-27 14:08 ` Sujit Karataparambil
2008-08-27 14:46 ` David Greaves
@ 2008-08-27 14:48 ` Tony Coffman
1 sibling, 0 replies; 8+ messages in thread
From: Tony Coffman @ 2008-08-27 14:48 UTC (permalink / raw)
To: linux-raid
Sujit Karataparambil wrote:
> http://www.gagme.com/greg/linux/raid-lvm.php
>
> you can try this with the spare drives you have.
>
> basically what you have to do is to check whether the drive
> now linked to another device name, is the reason for this
> problem.
>
> once it shows unplugged or failed you can, use your new
> replacement drive and reboot.
>
> Kindly read the comments to this article, Which is very
> usefull.
>
> On 8/27/08, Sujit Karataparambil <sjt.kar@gmail.com> wrote:
>
>>> Thanks much for the reply. For the purposes of this discussion you can
>>> assume that I've already re-established confidence in the drive, the
>>> cable, and the controller and that the data on the drives is worthless
>>> and I just want to get maximum uptime without causing a raid assemble
>>> problem on the next reboot.
>>>
>> Good.
>>
>>
>>> Any idea on my original question? If I re-add the drive using the
>>> /dev/sdc name will I have problems on the next boot when the drive is
>>> named /dev/sda?
>>>
>> Since this seems to be block device it really does not matter.
>>
>>
>>> Based on my experience with Linux and other software raid
>>> implementations, I'm strongly inclined to think that the device naming
>>> doesn't matter - the system will scan the drives at boot looking for
>>>
>> Kindly read some decent kernel documentation before you jump up and
>> say this. Kindly surf the net and read some decent article's before you
>> do any precious upgrades for now.
>>
>> Sujit
>>
>> --
>> --linux(2.4/2.6),bsd(4.5.x+),solaris(2.5+)
>>
>>
>
>
Sujit,
Thanks for the replies and the link. I appreciate them.
I spent several hours this week reading the kernel documentation
(md.txt), the mdadm man pages, the linux-raid wiki, and reading articles
on the net before posting to the list.
I highly recommend this wiki page from IBM for anybody who has an issue
similar to mine (temporary failure that knocks a drive offline who wants
to bring it back online without a reboot). This really helped me
understand the processes for running a highly available raid-1 set on
the linux kernel. The hardware is different but the same principles work.
http://www-941.ibm.com/collaboration/wiki/pages/viewpage.action?pageId=3625
For anybody who is interested, I grabbed a test system this morning to
simulate my situation (temporary drive failure). You can, in fact,
bring the failed drive back online with a different device name,
remirror it, and reboot with no issues on Centos5. The key, as Steve
Fairbairn pointed out, is that the mdadm.conf file is setup to use the
RAID UUID. I moved the drives into a different scan order with a the
same test and that works also. My situation is slightly complicated by
the fact that I'm booting off these same drives so I had to mirror the
MBR on the second drive but since I had already done this previously on
the wonky system this was easily achieved.
see here
http://www.dirigo.net/tuxTips/avoidingProblems/GrubMdMbr.php
I'm running a couple of more tests today to see what happens if I rescan
the device to bring it back online with the same device name - I'll post
the result in case anybody is interested. I expect that I'll have to
initiate the rebuild but don't expect any other problems.
Regards,
--Tony
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-08-27 14:48 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-26 15:32 Device naming and raid1 Tony Coffman
2008-08-27 11:11 ` David Lethe
2008-08-27 13:21 ` Tony Coffman
2008-08-27 13:31 ` Sujit Karataparambil
2008-08-27 14:08 ` Sujit Karataparambil
2008-08-27 14:46 ` David Greaves
2008-08-27 14:48 ` Tony Coffman
2008-08-27 13:42 ` Steve Fairbairn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).