* md and sd out of sync
@ 2010-07-04 22:25 Richard Scobie
2010-07-04 23:17 ` richard
2010-07-05 0:05 ` Neil Brown
0 siblings, 2 replies; 3+ messages in thread
From: Richard Scobie @ 2010-07-04 22:25 UTC (permalink / raw)
To: 'Linux RAID'
I have 16 x 2TB drives that each partitioned into 3 equal sized partitions.
Three md RAID6 arrays have then been built, each utilising one partition
on each drive.
Over the weekend, one member of one array was failed out:
end_request: I/O error, dev sdz, sector 1302228737
md: super_written gets error=-5, uptodate=0
raid5: Disk failure on sdz1, disabling device.
raid5: Operation continuing on 15 devices.
Checking with smartctl is not an option as the controller (LSI SAS)
reacts badly. On the basis of it possibly being a transitory error or a
sector that could be remapped on resync I re-added it to the array.
This failed part way through and cased enough disruption to the
controller that the whole drive was taken offline:
sd 8:0:24:0: [sdz] <6>sd 8:0:24:0: [sdz] Result: hostbyte=DID_NO_CONNECT
driverb
yte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdz, sector 569772337
Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdz, sector 569769777
sd 8:0:24:0: [sdz] <6>mptsas: ioc0: removing sata device, channel 0, id
32, phy
11
Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdz, sector 569770097
port-8:1:8: mptsas: ioc0: delete port (8)
I would have thought at this point for mdadm to be reporting that the
remaining 2 complete arrays had each lost their /dev/sdz components, but
this is not the case - it shows healthy arrays.
Is this expected behaviour?
To complicate things further, without any intervention, the disconnected
drive was then recognised again as a new device and reconnected as
/dev/sdai:
mptsas: ioc0: attaching sata device, channel 0, id 32, phy 11
scsi 8:0:34:0: Direct-Access ATA WDC WD2003FYYS-0 0D02 PQ: 0
ANSI: 5
sd 8:0:34:0: [sdai] 3907029168 512-byte hardware sectors (2000399 MB)
sd 8:0:34:0: [sdai] Write Protect is off
sd 8:0:34:0: [sdai] Mode Sense: 73 00 00 08
sd 8:0:34:0: [sdai] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
sd 8:0:34:0: [sdai] 3907029168 512-byte hardware sectors (2000399 MB)
sd 8:0:34:0: [sdai] Write Protect is off
sd 8:0:34:0: [sdai] Mode Sense: 73 00 00 08
sd 8:0:34:0: [sdai] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
sdai: sdai1 sdai2 sdai3
sd 8:0:34:0: [sdai] Attached SCSI disk
sd 8:0:34:0: Attached scsi generic sg26 type 0
Because sdz no longer exists, I cannot fail and remove /dev/sdz2 and
/dev/sdz3 from the other 2 md arrays.
I will proceed by just replacing the drive and rebooting, at which point
I should just be able to re-add it to all arrays, but I just wanted to
draw attention to how ignorant md seems to be to all the changes that
have occurred. Maybe things have changed in later versions:
Kernel 2.6.27.19-78.2.30.fc9.x86_64
mdadm 2.6.4
Regards,
Richard
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: md and sd out of sync
2010-07-04 22:25 md and sd out of sync Richard Scobie
@ 2010-07-04 23:17 ` richard
2010-07-05 0:05 ` Neil Brown
1 sibling, 0 replies; 3+ messages in thread
From: richard @ 2010-07-04 23:17 UTC (permalink / raw)
To: 'Linux RAID'
Richard Scobie wrote:
> I have 16 x 2TB drives that each partitioned into 3 equal sized partitions.
>
> Three md RAID6 arrays have then been built, each utilising one partition
> on each drive.
>
> Over the weekend, one member of one array was failed out:
As a followup to this, when I hot removed the failed drive out, mdadm
notified me of a failure of /dev/sdz2 (despite /dev/sdz2 no longer
existing).
Then, some 20 minutes later, upon hot inserting the new drive, I got an
mdadm notification of /dev/sdz3 failing.
So it got there in the end.
Regards,
Richard
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: md and sd out of sync
2010-07-04 22:25 md and sd out of sync Richard Scobie
2010-07-04 23:17 ` richard
@ 2010-07-05 0:05 ` Neil Brown
1 sibling, 0 replies; 3+ messages in thread
From: Neil Brown @ 2010-07-05 0:05 UTC (permalink / raw)
To: Richard Scobie; +Cc: 'Linux RAID'
On Mon, 05 Jul 2010 10:25:37 +1200
Richard Scobie <richard@sauce.co.nz> wrote:
> I have 16 x 2TB drives that each partitioned into 3 equal sized partitions.
>
> Three md RAID6 arrays have then been built, each utilising one partition
> on each drive.
>
> Over the weekend, one member of one array was failed out:
>
> end_request: I/O error, dev sdz, sector 1302228737
> md: super_written gets error=-5, uptodate=0
> raid5: Disk failure on sdz1, disabling device.
> raid5: Operation continuing on 15 devices.
>
> Checking with smartctl is not an option as the controller (LSI SAS)
> reacts badly. On the basis of it possibly being a transitory error or a
> sector that could be remapped on resync I re-added it to the array.
>
> This failed part way through and cased enough disruption to the
> controller that the whole drive was taken offline:
>
> sd 8:0:24:0: [sdz] <6>sd 8:0:24:0: [sdz] Result: hostbyte=DID_NO_CONNECT
> driverb
> yte=DRIVER_OK,SUGGEST_OK
> end_request: I/O error, dev sdz, sector 569772337
> Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
> end_request: I/O error, dev sdz, sector 569769777
> sd 8:0:24:0: [sdz] <6>mptsas: ioc0: removing sata device, channel 0, id
> 32, phy
> 11
> Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
> end_request: I/O error, dev sdz, sector 569770097
> port-8:1:8: mptsas: ioc0: delete port (8)
>
>
> I would have thought at this point for mdadm to be reporting that the
> remaining 2 complete arrays had each lost their /dev/sdz components, but
> this is not the case - it shows healthy arrays.
>
> Is this expected behaviour?
Yes. md only notices that a device has failed when it tried to perform IO
and gets an error.
The next release of mdadm will have "mdadm -incremental --fail" which can be
called by udev when udev notices a device disappearing. mdadm will find any
array that included the given device and fail/remove it.
>
> To complicate things further, without any intervention, the disconnected
> drive was then recognised again as a new device and reconnected as
> /dev/sdai:
>
> mptsas: ioc0: attaching sata device, channel 0, id 32, phy 11
> scsi 8:0:34:0: Direct-Access ATA WDC WD2003FYYS-0 0D02 PQ: 0
> ANSI: 5
> sd 8:0:34:0: [sdai] 3907029168 512-byte hardware sectors (2000399 MB)
> sd 8:0:34:0: [sdai] Write Protect is off
> sd 8:0:34:0: [sdai] Mode Sense: 73 00 00 08
> sd 8:0:34:0: [sdai] Write cache: enabled, read cache: enabled, doesn't
> support DPO or FUA
> sd 8:0:34:0: [sdai] 3907029168 512-byte hardware sectors (2000399 MB)
> sd 8:0:34:0: [sdai] Write Protect is off
> sd 8:0:34:0: [sdai] Mode Sense: 73 00 00 08
> sd 8:0:34:0: [sdai] Write cache: enabled, read cache: enabled, doesn't
> support DPO or FUA
> sdai: sdai1 sdai2 sdai3
> sd 8:0:34:0: [sdai] Attached SCSI disk
> sd 8:0:34:0: Attached scsi generic sg26 type 0
>
>
> Because sdz no longer exists, I cannot fail and remove /dev/sdz2 and
> /dev/sdz3 from the other 2 md arrays.
You can.
mdadm /dev/mdXX --fail detached
mdadm /dev/mdXX --remove detached
NeilBrown
>
> I will proceed by just replacing the drive and rebooting, at which point
> I should just be able to re-add it to all arrays, but I just wanted to
> draw attention to how ignorant md seems to be to all the changes that
> have occurred. Maybe things have changed in later versions:
>
> Kernel 2.6.27.19-78.2.30.fc9.x86_64
> mdadm 2.6.4
>
> Regards,
>
> Richard
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2010-07-05 0:05 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-04 22:25 md and sd out of sync Richard Scobie
2010-07-04 23:17 ` richard
2010-07-05 0:05 ` Neil Brown
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).