linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* FailSpare event?
@ 2007-01-11 22:11 Mike
  2007-01-11 22:23 ` Neil Brown
  0 siblings, 1 reply; 17+ messages in thread
From: Mike @ 2007-01-11 22:11 UTC (permalink / raw)
  To: linux-raid

Can someone tell me what this means please? I just received this in
an email from one of my servers:


From: mdadm monitoring [root@$DOMAIN.com]
To: root@$DOMAIN.com
Subject: FailSpare event on /dev/md2:$HOST.$DOMAIN.com

This is an automatically generated mail message from mdadm
running on $HOST.$DOMAIN.com

A FailSpare event had been detected on md device /dev/md2.

It could be related to component device /dev/sde2.

Faithfully yours, etc.

On this machine I execute:

$ cat /proc/mdstat
Personalities : [raid5] [raid4] [raid1] 
md0 : active raid1 sdf1[2](S) sde1[3](S) sdd1[4](S) sdc1[5](S) sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md1 : active raid1 sdf3[2](S) sde3[3](S) sdd3[4](S) sdc3[5](S) sdb3[1] sda3[0]
3068288 blocks [2/2] [UU]

md2 : active raid5 sdf2[4] sde2[5](F) sdd2[3] sdc2[2] sdb2[1] sda2[0]
560732160 blocks level 5, 256k chunk, algorithm 2 [5/5] [UUUUU]

unused devices: <none>


Does the email message mean drive sde2[5] has failed? I know the sde2 refers
to the second partition of /dev/sde. Here is the partition table

# fdisk -l /dev/sde
[root@elo ~]# fdisk -l /dev/sde

Disk /dev/sde: 146.8 GB, 146815733760 bytes
255 heads, 63 sectors/track, 17849 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
   /dev/sde1   *           1          13      104391   fd  Linux raid autodetect
   /dev/sde2              14       17465   140183190   fd  Linux raid autodetect
   /dev/sde3           17466       17847     3068415   fd  Linux raid autodetect

I have partition 2 of drive sde as one of the raid devices for md. Does the (S)
on sde3[2](S) mean the device is a spare for md1 and the same for md0?

Mike


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: FailSpare event?
  2007-01-11 22:11 FailSpare event? Mike
@ 2007-01-11 22:23 ` Neil Brown
  2007-01-11 22:36   ` Mike
                     ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Neil Brown @ 2007-01-11 22:23 UTC (permalink / raw)
  To: Mike; +Cc: linux-raid

On Thursday January 11, mikee@mikee.ath.cx wrote:
> Can someone tell me what this means please? I just received this in
> an email from one of my servers:
> 
....

> 
> A FailSpare event had been detected on md device /dev/md2.
> 
> It could be related to component device /dev/sde2.

It means that mdadm has just noticed that /dev/sde2 is a spare and is faulty.

You would normally expect this if the array is rebuilding a spare and
a write to the spare fails however...

> 
> md2 : active raid5 sdf2[4] sde2[5](F) sdd2[3] sdc2[2] sdb2[1] sda2[0]
> 560732160 blocks level 5, 256k chunk, algorithm 2 [5/5] [UUUUU]

That isn't the case here - your array doesn't need rebuilding.
Possible a superblock-update failed.  Possibly mdadm only just started
monitoring the array and the spare has been faulty for some time.

> 
> Does the email message mean drive sde2[5] has failed? I know the sde2 refers
> to the second partition of /dev/sde. Here is the partition table

It means that md thinks sde2 cannot be trusted.  To find out why you
would need to look at kernel logs for IO errors.

> 
> I have partition 2 of drive sde as one of the raid devices for md. Does the (S)
> on sde3[2](S) mean the device is a spare for md1 and the same for md0?
> 

Yes, (S) means the device is spare.  You don't have (S) next to sde2
on md2 because (F) (failed) overrides (S).
You can tell by the position [5], that it isn't part of the array
(being a 5 disk array, the active positions are 0,1,2,3,4).

NeilBrown

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: FailSpare event?
  2007-01-11 22:23 ` Neil Brown
@ 2007-01-11 22:36   ` Mike
  2007-01-11 22:59     ` Neil Brown
  2007-01-12 14:34   ` Ernst Herzberg
  2007-01-13 22:29   ` Mike
  2 siblings, 1 reply; 17+ messages in thread
From: Mike @ 2007-01-11 22:36 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On Fri, 12 Jan 2007, Neil Brown might have said:

> On Thursday January 11, mikee@mikee.ath.cx wrote:
> > Can someone tell me what this means please? I just received this in
> > an email from one of my servers:
> > 
> ....
> 
> > 
> > A FailSpare event had been detected on md device /dev/md2.
> > 
> > It could be related to component device /dev/sde2.
> 
> It means that mdadm has just noticed that /dev/sde2 is a spare and is faulty.
> 
> You would normally expect this if the array is rebuilding a spare and
> a write to the spare fails however...
> 
> > 
> > md2 : active raid5 sdf2[4] sde2[5](F) sdd2[3] sdc2[2] sdb2[1] sda2[0]
> > 560732160 blocks level 5, 256k chunk, algorithm 2 [5/5] [UUUUU]
> 
> That isn't the case here - your array doesn't need rebuilding.
> Possible a superblock-update failed.  Possibly mdadm only just started
> monitoring the array and the spare has been faulty for some time.
> 
> > 
> > Does the email message mean drive sde2[5] has failed? I know the sde2 refers
> > to the second partition of /dev/sde. Here is the partition table
> 
> It means that md thinks sde2 cannot be trusted.  To find out why you
> would need to look at kernel logs for IO errors.
> 
> > 
> > I have partition 2 of drive sde as one of the raid devices for md. Does the (S)
> > on sde3[2](S) mean the device is a spare for md1 and the same for md0?
> > 
> 
> Yes, (S) means the device is spare.  You don't have (S) next to sde2
> on md2 because (F) (failed) overrides (S).
> You can tell by the position [5], that it isn't part of the array
> (being a 5 disk array, the active positions are 0,1,2,3,4).
> 
> NeilBrown
> 

Thanks for the quick response.

So I'm ok for the moment? Yes, I need to find the error and fix everything
back to the (S) state.

The messages in $HOST:/var/log/messages for the time of the email are:

Jan 11 16:04:25 elo kernel: sd 2:0:4:0: SCSI error: return code = 0x8000002
Jan 11 16:04:25 elo kernel: sde: Current: sense key: Hardware Error
Jan 11 16:04:25 elo kernel:     Additional sense: Internal target failure
Jan 11 16:04:25 elo kernel: Info fld=0x10b93c4d
Jan 11 16:04:25 elo kernel: end_request: I/O error, dev sde, sector 280575053
Jan 11 16:04:25 elo kernel: raid5: Disk failure on sde2, disabling device. Operation continuing on 5 devices

This is a dell box running Fedora Core with recent patches. It is a production
box so I do not patch each night.

On AIX boxes I can blink the drives to identify a bad/failing device. Is there
a way to blink the drives in linux?

Mike

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: FailSpare event?
  2007-01-11 22:36   ` Mike
@ 2007-01-11 22:59     ` Neil Brown
  2007-01-11 23:06       ` Mike
  0 siblings, 1 reply; 17+ messages in thread
From: Neil Brown @ 2007-01-11 22:59 UTC (permalink / raw)
  To: Mike; +Cc: linux-raid

On Thursday January 11, mikee@mikee.ath.cx wrote:
> 
> So I'm ok for the moment? Yes, I need to find the error and fix everything
> back to the (S) state.

Yes, OK for the moment.

> 
> The messages in $HOST:/var/log/messages for the time of the email are:
> 
> Jan 11 16:04:25 elo kernel: sd 2:0:4:0: SCSI error: return code = 0x8000002
> Jan 11 16:04:25 elo kernel: sde: Current: sense key: Hardware Error
> Jan 11 16:04:25 elo kernel:     Additional sense: Internal target failure
> Jan 11 16:04:25 elo kernel: Info fld=0x10b93c4d
> Jan 11 16:04:25 elo kernel: end_request: I/O error, dev sde, sector 280575053
> Jan 11 16:04:25 elo kernel: raid5: Disk failure on sde2, disabling device. Operation continuing on 5 devices

Given the sector number it looks likely that it was a superblock
update.
No idea how bad an 'internal target failure' is.  Maybe powercycling
the drive would 'fix' it, maybe not.

> 
> On AIX boxes I can blink the drives to identify a bad/failing device. Is there
> a way to blink the drives in linux?

Unfortunately not.

NeilBrown

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: FailSpare event?
  2007-01-11 22:59     ` Neil Brown
@ 2007-01-11 23:06       ` Mike
  2007-01-12  0:05         ` Mike Hardy
                           ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Mike @ 2007-01-11 23:06 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On Fri, 12 Jan 2007, Neil Brown might have said:

> On Thursday January 11, mikee@mikee.ath.cx wrote:
> > 
> > So I'm ok for the moment? Yes, I need to find the error and fix everything
> > back to the (S) state.
> 
> Yes, OK for the moment.
> 
> > 
> > The messages in $HOST:/var/log/messages for the time of the email are:
> > 
> > Jan 11 16:04:25 elo kernel: sd 2:0:4:0: SCSI error: return code = 0x8000002
> > Jan 11 16:04:25 elo kernel: sde: Current: sense key: Hardware Error
> > Jan 11 16:04:25 elo kernel:     Additional sense: Internal target failure
> > Jan 11 16:04:25 elo kernel: Info fld=0x10b93c4d
> > Jan 11 16:04:25 elo kernel: end_request: I/O error, dev sde, sector 280575053
> > Jan 11 16:04:25 elo kernel: raid5: Disk failure on sde2, disabling device. Operation continuing on 5 devices
> 
> Given the sector number it looks likely that it was a superblock
> update.
> No idea how bad an 'internal target failure' is.  Maybe powercycling
> the drive would 'fix' it, maybe not.
> 
> > 
> > On AIX boxes I can blink the drives to identify a bad/failing device. Is there
> > a way to blink the drives in linux?
> 
> Unfortunately not.
> 
> NeilBrown
> 

I found the smartctl command. I have a 'long' test running in the background.
I checked this drive and the other drives. This drive has been used the least
(confirms it is a spare?) and is the only one with 'Total uncorrected errors' > 0.

How to determine the error, correct the error, or clear the error?

Mike

[root@$HOST ~]# smartctl -a /dev/sde
smartctl version 5.36 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: SEAGATE  ST3146707LC      Version: D703
Serial number: 3KS30WY8
Device type: disk
Transport protocol: Parallel SCSI (SPI-4)
Local Time is: Thu Jan 11 17:00:26 2007 CST
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK

Current Drive Temperature:     48 C
Drive Trip Temperature:        68 C
Elements in grown defect list: 0
Vendor (Seagate) cache information
  Blocks sent to initiator = 66108
  Blocks received from initiator = 147374656
  Blocks read from cache and sent to initiator = 42215
  Number of read and write commands whose size <= segment size = 12635583
  Number of read and write commands whose size > segment size = 0
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 3943.42
  number of minutes until next internal SMART test = 94

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:        354        0         0       354        354          0.546           0
write:         0        0         0         0          0        185.871           1

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed, segment failed   -    3943                 - [-   -    -]

Long (extended) Self Test duration: 2726 seconds [45.4 minutes]


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: FailSpare event?
  2007-01-11 23:06       ` Mike
@ 2007-01-12  0:05         ` Mike Hardy
  2007-01-12  0:40         ` Corey Hickey
  2007-01-12  0:48         ` Martin Schröder
  2 siblings, 0 replies; 17+ messages in thread
From: Mike Hardy @ 2007-01-12  0:05 UTC (permalink / raw)
  To: Mike; +Cc: linux-raid


google "BadBlockHowto"

Any "just google it" response sounds glib, but this is actually how to
do it :-)

If you're new to md and mdadm, don't forget to actually remove the drive
from the array before you start working on it with 'dd'

-Mike

Mike wrote:
> On Fri, 12 Jan 2007, Neil Brown might have said:
> 
>> On Thursday January 11, mikee@mikee.ath.cx wrote:
>>> So I'm ok for the moment? Yes, I need to find the error and fix everything
>>> back to the (S) state.
>> Yes, OK for the moment.
>>
>>> The messages in $HOST:/var/log/messages for the time of the email are:
>>>
>>> Jan 11 16:04:25 elo kernel: sd 2:0:4:0: SCSI error: return code = 0x8000002
>>> Jan 11 16:04:25 elo kernel: sde: Current: sense key: Hardware Error
>>> Jan 11 16:04:25 elo kernel:     Additional sense: Internal target failure
>>> Jan 11 16:04:25 elo kernel: Info fld=0x10b93c4d
>>> Jan 11 16:04:25 elo kernel: end_request: I/O error, dev sde, sector 280575053
>>> Jan 11 16:04:25 elo kernel: raid5: Disk failure on sde2, disabling device. Operation continuing on 5 devices
>> Given the sector number it looks likely that it was a superblock
>> update.
>> No idea how bad an 'internal target failure' is.  Maybe powercycling
>> the drive would 'fix' it, maybe not.
>>
>>> On AIX boxes I can blink the drives to identify a bad/failing device. Is there
>>> a way to blink the drives in linux?
>> Unfortunately not.
>>
>> NeilBrown
>>
> 
> I found the smartctl command. I have a 'long' test running in the background.
> I checked this drive and the other drives. This drive has been used the least
> (confirms it is a spare?) and is the only one with 'Total uncorrected errors' > 0.
> 
> How to determine the error, correct the error, or clear the error?
> 
> Mike
> 
> [root@$HOST ~]# smartctl -a /dev/sde
> smartctl version 5.36 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
> 
> Device: SEAGATE  ST3146707LC      Version: D703
> Serial number: 3KS30WY8
> Device type: disk
> Transport protocol: Parallel SCSI (SPI-4)
> Local Time is: Thu Jan 11 17:00:26 2007 CST
> Device supports SMART and is Enabled
> Temperature Warning Enabled
> SMART Health Status: OK
> 
> Current Drive Temperature:     48 C
> Drive Trip Temperature:        68 C
> Elements in grown defect list: 0
> Vendor (Seagate) cache information
>   Blocks sent to initiator = 66108
>   Blocks received from initiator = 147374656
>   Blocks read from cache and sent to initiator = 42215
>   Number of read and write commands whose size <= segment size = 12635583
>   Number of read and write commands whose size > segment size = 0
> Vendor (Seagate/Hitachi) factory information
>   number of hours powered up = 3943.42
>   number of minutes until next internal SMART test = 94
> 
> Error counter log:
>            Errors Corrected by           Total   Correction     Gigabytes    Total
>                ECC          rereads/    errors   algorithm      processed    uncorrected
>            fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
> read:        354        0         0       354        354          0.546           0
> write:         0        0         0         0          0        185.871           1
> 
> Non-medium error count:        0
> 
> SMART Self-test log
> Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
>      Description                              number   (hours)
> # 1  Background long   Completed, segment failed   -    3943                 - [-   -    -]
> 
> Long (extended) Self Test duration: 2726 seconds [45.4 minutes]
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: FailSpare event?
  2007-01-11 23:06       ` Mike
  2007-01-12  0:05         ` Mike Hardy
@ 2007-01-12  0:40         ` Corey Hickey
  2007-01-12  0:48         ` Martin Schröder
  2 siblings, 0 replies; 17+ messages in thread
From: Corey Hickey @ 2007-01-12  0:40 UTC (permalink / raw)
  To: linux-raid

Mike wrote:
> I found the smartctl command. I have a 'long' test running in the background.
> I checked this drive and the other drives. This drive has been used the least
> (confirms it is a spare?) and is the only one with 'Total uncorrected errors' > 0.
> 
> How to determine the error, correct the error, or clear the error?
> 
> Mike
> 

[cut]

> SMART Self-test log
> Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
>      Description                              number   (hours)
> # 1  Background long   Completed, segment failed   -    3943                 - [-   -    -]
> 
> Long (extended) Self Test duration: 2726 seconds [45.4 minutes]

Am I mistaken, or does the above information not say that the long 
self-test actually failed? If a SMART test fails, that should be 
sufficient cause to RMA the drive if it's still under warranty.

It might not actually be that surprising to have a largely unused drive 
fail. I've had a couple drives fail due to what I presume is bearing 
wear: the drive gradually gets noisy (over many months) and eventually 
starts having intermittent errors that get more and more frequent. If 
your drive was spinning while it was a spare, then it would be just as 
likely to wear out a bad bearing as any of your other drives. Of course, 
it could be some other problem; that's just an example.

-Corey

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: FailSpare event?
  2007-01-11 23:06       ` Mike
  2007-01-12  0:05         ` Mike Hardy
  2007-01-12  0:40         ` Corey Hickey
@ 2007-01-12  0:48         ` Martin Schröder
  2 siblings, 0 replies; 17+ messages in thread
From: Martin Schröder @ 2007-01-12  0:48 UTC (permalink / raw)
  To: Linux List

2007/1/12, Mike <mikee@mikee.ath.cx>:
> # 1  Background long   Completed, segment failed   -    3943

This should still be in warranty. Try to get a replacement.

Best
   Martin

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: FailSpare event?
  2007-01-11 22:23 ` Neil Brown
  2007-01-11 22:36   ` Mike
@ 2007-01-12 14:34   ` Ernst Herzberg
  2007-01-13 18:10     ` Nix
  2007-01-13 22:29   ` Mike
  2 siblings, 1 reply; 17+ messages in thread
From: Ernst Herzberg @ 2007-01-12 14:34 UTC (permalink / raw)
  To: Neil Brown; +Cc: Mike, linux-raid

On Thursday 11 January 2007 23:23, Neil Brown wrote:
> On Thursday January 11, mikee@mikee.ath.cx wrote:
> > Can someone tell me what this means please? I just received this in
> > an email from one of my servers:
>
> ....
>

Same problem here, on different machines. But only with mdadm 2.6, with 
mdadm 2.5.5 no problems.

First machine sends direct after starting mdadm in monitor mode:
(kernel 2.6.20-rc3)
-----------------------------
event=DeviceDisappeared
mddev=/dev/md1
device=Wrong-Level

Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] 
md1 : active raid0 sdb2[1] sda2[0]
      3904704 blocks 16k chunks
      
md2 : active raid0 sdb3[1] sda3[0]
      153930112 blocks 16k chunks
      
md3 : active raid5 sdf1[3] sde1[2] sdd1[1] sdc1[0]
      732587712 blocks level 5, 16k chunk, algorithm 2 [4/4] [UUUU]
      
md0 : active raid1 sdb1[1] sda1[0]
      192640 blocks [2/2] [UU]
      
unused devices: <none>
-----------------------
and a second time for md2.
Then every about 60 sec 4 times

event=SpareActive
mddev=/dev/md3

******************************

Second machine sends about every 60sec 8 messages with:
(kernel 2.6.19.2)
--------------------------
event=SpareActive
mddev=/dev/md0
device=

Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md1 : active raid1 sdb1[1] sda1[0]
      979840 blocks [2/2] [UU]
      
md3 : active raid5 sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1] sdc1[0]
      4899200 blocks level 5, 8k chunk, algorithm 2 [6/6] [UUUUUU]
      
md2 : active raid5 sdh2[7] sdg2[6] sdf2[5] sde2[4] sdd2[3] sdc2[2] sdb2[1] 
sda2[0]
      6858880 blocks level 5, 4k chunk, algorithm 2 [8/8] [UUUUUUUU]
      
md0 : active raid5 sdh3[7] sdg3[6] sdf3[5] sde3[4] sdd3[3] sdc3[2] sdb3[1] 
sda3[0]
      235086656 blocks level 5, 16k chunk, algorithm 2 [8/8] [UUUUUUUU]
      
unused devices: <none>

--------------------------

Both machines had nerver seen any spare device, and there are no failing 
devices, everything works as expected.

<earny>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: FailSpare event?
  2007-01-12 14:34   ` Ernst Herzberg
@ 2007-01-13 18:10     ` Nix
  2007-01-13 23:34       ` Nix
                         ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Nix @ 2007-01-13 18:10 UTC (permalink / raw)
  To: earny; +Cc: Neil Brown, Mike, linux-raid

On 12 Jan 2007, Ernst Herzberg told this:
> Then every about 60 sec 4 times
>
> event=SpareActive
> mddev=/dev/md3

I see exactly this on both my RAID-5 arrays, neither of which have any
spare device --- nor have any active devices transitioned to spare
(which is what that event is actually supposed to mean).

mdadm-2.6 bug, I fear. I haven't tracked it down yet but will look
shortly: I can't afford to not run mdadm --monitor... odd, that
code hasn't changed during 2.6 development.

-- 
`He accused the FSF of being "something of a hypocrit", which
 shows that he neither understands hypocrisy nor can spell.'
   --- jimmybgood

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: FailSpare event?
  2007-01-11 22:23 ` Neil Brown
  2007-01-11 22:36   ` Mike
  2007-01-12 14:34   ` Ernst Herzberg
@ 2007-01-13 22:29   ` Mike
  2 siblings, 0 replies; 17+ messages in thread
From: Mike @ 2007-01-13 22:29 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On Fri, 12 Jan 2007, Neil Brown might have said:

> On Thursday January 11, mikee@mikee.ath.cx wrote:
> > Can someone tell me what this means please? I just received this in
> > an email from one of my servers:
> > 
> ....
> 
> > 
> > A FailSpare event had been detected on md device /dev/md2.
> > 
> > It could be related to component device /dev/sde2.
> 
> It means that mdadm has just noticed that /dev/sde2 is a spare and is faulty.
> 
> You would normally expect this if the array is rebuilding a spare and
> a write to the spare fails however...
> 
> > 
> > md2 : active raid5 sdf2[4] sde2[5](F) sdd2[3] sdc2[2] sdb2[1] sda2[0]
> > 560732160 blocks level 5, 256k chunk, algorithm 2 [5/5] [UUUUU]
> 
> That isn't the case here - your array doesn't need rebuilding.
> Possible a superblock-update failed.  Possibly mdadm only just started
> monitoring the array and the spare has been faulty for some time.
> 
> > 
> > Does the email message mean drive sde2[5] has failed? I know the sde2 refers
> > to the second partition of /dev/sde. Here is the partition table
> 
> It means that md thinks sde2 cannot be trusted.  To find out why you
> would need to look at kernel logs for IO errors.
> 
> > 
> > I have partition 2 of drive sde as one of the raid devices for md. Does the (S)
> > on sde3[2](S) mean the device is a spare for md1 and the same for md0?
> > 
> 
> Yes, (S) means the device is spare.  You don't have (S) next to sde2
> on md2 because (F) (failed) overrides (S).
> You can tell by the position [5], that it isn't part of the array
> (being a 5 disk array, the active positions are 0,1,2,3,4).
> 
> NeilBrown
> 

I have cleared the error by:

# mdadm --manage /dev/md2 -f /dev/sde2
( make sure it has failed )
# mdadm --manage /dev/md2 -r /dev/sde2
( remove from the array )
# mdadm --manage /dev/md2 -a /dev/sde2
( add the device back to the array )
# mdadm --detail /dev/md2
( verify there are no faults and the array knows about the spare )

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: FailSpare event?
  2007-01-13 18:10     ` Nix
@ 2007-01-13 23:34       ` Nix
  2007-01-13 23:38       ` Nix
  2007-01-14 15:01       ` Nix
  2 siblings, 0 replies; 17+ messages in thread
From: Nix @ 2007-01-13 23:34 UTC (permalink / raw)
  To: earny; +Cc: Neil Brown, Mike, linux-raid

On 13 Jan 2007, nix@esperi.org.uk spake thusly:

> On 12 Jan 2007, Ernst Herzberg told this:
>> Then every about 60 sec 4 times
>>
>> event=SpareActive
>> mddev=/dev/md3
>
> I see exactly this on both my RAID-5 arrays, neither of which have any
> spare device --- nor have any active devices transitioned to spare
> (which is what that event is actually supposed to mean).

Hm, the manual says that it means that a spare has transitioned to
active (which seems more likely). Perhaps the comment at line 82 of
Monitor.c is wrong, or I just don't understand what a `reverse
transition' is supposed to be.

-- 
`He accused the FSF of being "something of a hypocrit", which
 shows that he neither understands hypocrisy nor can spell.'
   --- jimmybgood

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: FailSpare event?
  2007-01-13 18:10     ` Nix
  2007-01-13 23:34       ` Nix
@ 2007-01-13 23:38       ` Nix
       [not found]         ` <45ABA3E4.3050800@tmr.com>
  2007-01-14 15:01       ` Nix
  2 siblings, 1 reply; 17+ messages in thread
From: Nix @ 2007-01-13 23:38 UTC (permalink / raw)
  To: earny; +Cc: Neil Brown, Mike, linux-raid

On 13 Jan 2007, nix@esperi.org.uk uttered the following:

> On 12 Jan 2007, Ernst Herzberg told this:
>> Then every about 60 sec 4 times
>>
>> event=SpareActive
>> mddev=/dev/md3
>
> I see exactly this on both my RAID-5 arrays, neither of which have any
> spare device --- nor have any active devices transitioned to spare
> (which is what that event is actually supposed to mean).

One oddity has already come to light. My /proc/mdstat says

md2 : active raid5 sdb7[0] hda5[3] sda7[1]
      19631104 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md1 : active raid5 sda6[0] hdc5[3] sdb6[1]
      76807296 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU]

hda5 and hdc5 look odd. Indeed, --examine says

    Number   Major   Minor   RaidDevice State
       0       8        6        0      active sync   /dev/sda6
       1       8       22        1      active sync   /dev/sdb6
       3      22        5        2      active sync   /dev/hdc5

    Number   Major   Minor   RaidDevice State
       0       8       23        0      active sync   /dev/sdb7
       1       8        7        1      active sync   /dev/sda7
       3       3        5        2      active sync   /dev/hda5

0, 1, and *3*. Where has number 2 gone? (And how does `Number' differ
from `RaidDevice'? Why have both?)

-- 
`He accused the FSF of being "something of a hypocrit", which
 shows that he neither understands hypocrisy nor can spell.'
   --- jimmybgood

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: FailSpare event?
  2007-01-13 18:10     ` Nix
  2007-01-13 23:34       ` Nix
  2007-01-13 23:38       ` Nix
@ 2007-01-14 15:01       ` Nix
  2007-01-14 21:20         ` Neil Brown
  2 siblings, 1 reply; 17+ messages in thread
From: Nix @ 2007-01-14 15:01 UTC (permalink / raw)
  To: earny; +Cc: Neil Brown, Mike, linux-raid

On 13 Jan 2007, nix@esperi.org.uk uttered the following:
> mdadm-2.6 bug, I fear. I haven't tracked it down yet but will look
> shortly: I can't afford to not run mdadm --monitor... odd, that
> code hasn't changed during 2.6 development.

Whoo! Compile Monitor.c without optimization and the problem goes away.

Hunting: maybe it's a compiler bug (anyone not using GCC 4.1.1 seeing
this?), maybe mdadm is tripping undefined behaviour somewhere...

-- 
`He accused the FSF of being "something of a hypocrit", which
 shows that he neither understands hypocrisy nor can spell.'
   --- jimmybgood

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: FailSpare event?
  2007-01-14 15:01       ` Nix
@ 2007-01-14 21:20         ` Neil Brown
  2007-01-15 20:08           ` Nix
  0 siblings, 1 reply; 17+ messages in thread
From: Neil Brown @ 2007-01-14 21:20 UTC (permalink / raw)
  To: Nix; +Cc: earny, Mike, linux-raid

On Sunday January 14, nix@esperi.org.uk wrote:
> On 13 Jan 2007, nix@esperi.org.uk uttered the following:
> > mdadm-2.6 bug, I fear. I haven't tracked it down yet but will look
> > shortly: I can't afford to not run mdadm --monitor... odd, that
> > code hasn't changed during 2.6 development.
> 
> Whoo! Compile Monitor.c without optimization and the problem goes away.
> 
> Hunting: maybe it's a compiler bug (anyone not using GCC 4.1.1 seeing
> this?), maybe mdadm is tripping undefined behaviour somewhere...

Probably....

A quick look suggests that the following patch might make a
difference, but there is more to it than that.  I think there are
subtle differences due to the use of version-1 superblocks.  That
might be just another one-line change, but I want to make sure first.

Thanks,
NeilBrown



### Diffstat output
 ./Monitor.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/Monitor.c ./Monitor.c
--- .prev/Monitor.c	2006-12-21 17:15:55.000000000 +1100
+++ ./Monitor.c	2007-01-15 08:17:30.000000000 +1100
@@ -383,7 +383,7 @@ int Monitor(mddev_dev_t devlist,
 						)
 						alert("SpareActive", dev, dv, mailaddr, mailfrom, alert_cmd, dosyslog);
 				}
-				st->devstate[i] = disc.state;
+				st->devstate[i] = newstate;
 				st->devid[i] = makedev(disc.major, disc.minor);
 			}
 			st->active = array.active_disks;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: FailSpare event?
       [not found]         ` <45ABA3E4.3050800@tmr.com>
@ 2007-01-15 19:59           ` Nix
  0 siblings, 0 replies; 17+ messages in thread
From: Nix @ 2007-01-15 19:59 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: earny, Neil Brown, Mike, linux-raid

On 15 Jan 2007, Bill Davidsen told this:
> Nix wrote:
>>     Number   Major   Minor   RaidDevice State
>>        0       8        6        0      active sync   /dev/sda6
>>        1       8       22        1      active sync   /dev/sdb6
>>        3      22        5        2      active sync   /dev/hdc5
>>
>>     Number   Major   Minor   RaidDevice State
>>        0       8       23        0      active sync   /dev/sdb7
>>        1       8        7        1      active sync   /dev/sda7
>>        3       3        5        2      active sync   /dev/hda5
>>
>> 0, 1, and *3*. Where has number 2 gone? (And how does `Number' differ
>> from `RaidDevice'? Why have both?)
>>
>>
> Did you ever move the data to these drives from another? I think this
> is what you see when you migrate by adding a drive as a spare, then
> mark an existing drive as failed, so the data is rebuilt on the new
> drive. Was there ever a device 2?

Nope. These arrays were created in one lump and never had a spare.

Plenty of pvmoves have happened on them, but that's *inside* the
arrays, of course...

-- 
`He accused the FSF of being "something of a hypocrit", which
 shows that he neither understands hypocrisy nor can spell.'
   --- jimmybgood

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: FailSpare event?
  2007-01-14 21:20         ` Neil Brown
@ 2007-01-15 20:08           ` Nix
  0 siblings, 0 replies; 17+ messages in thread
From: Nix @ 2007-01-15 20:08 UTC (permalink / raw)
  To: Neil Brown; +Cc: earny, Mike, linux-raid

On 14 Jan 2007, Neil Brown told this:
> A quick look suggests that the following patch might make a
> difference, but there is more to it than that.  I think there are
> subtle differences due to the use of version-1 superblocks.  That
> might be just another one-line change, but I want to make sure first.

Well, that certainly made that warning go away. I don't have any
actually-failed disks, so I can't tell if it would *ever* warn anymore ;)

... actually, it just picked up some monthly array check activity:

Jan 15 20:03:17 loki daemon warning: mdadm: Rebuild20 event detected on md device /dev/md2

So it looks like it works perfectly well now.

(Looking at the code, yeah, without that change it'll never remember
state changes at all!)

One bit of residue from the state before this patch remains on line 352,
where you initialize disc.state and then never use it for anything...

-- 
`He accused the FSF of being "something of a hypocrit", which
 shows that he neither understands hypocrisy nor can spell.'
   --- jimmybgood

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2007-01-15 20:08 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-01-11 22:11 FailSpare event? Mike
2007-01-11 22:23 ` Neil Brown
2007-01-11 22:36   ` Mike
2007-01-11 22:59     ` Neil Brown
2007-01-11 23:06       ` Mike
2007-01-12  0:05         ` Mike Hardy
2007-01-12  0:40         ` Corey Hickey
2007-01-12  0:48         ` Martin Schröder
2007-01-12 14:34   ` Ernst Herzberg
2007-01-13 18:10     ` Nix
2007-01-13 23:34       ` Nix
2007-01-13 23:38       ` Nix
     [not found]         ` <45ABA3E4.3050800@tmr.com>
2007-01-15 19:59           ` Nix
2007-01-14 15:01       ` Nix
2007-01-14 21:20         ` Neil Brown
2007-01-15 20:08           ` Nix
2007-01-13 22:29   ` Mike

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).