RE: raid5 won't resync

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RE: raid5 won't resync
@ 2004-08-31 15:32 Mike Fowler
  0 siblings, 0 replies; 10+ messages in thread
From: Mike Fowler @ 2004-08-31 15:32 UTC (permalink / raw)
  To: linux-raid

Hi,

I had a similar problem on my box a couple of weeks ago. I have a 
dual-AMD 1.3GHz system, with two SIIG UltraATA 133 PCI IDE cards, eight 
hard drives, forming 2 RAID-5 arrays. Each array consists of four drives 
of one card. With just one raid active the machine would run fine, but 
with two it would lock on startup. In the end it turned out it was in 
interrupt race on the APIC chip, by disabling APIC (kernel parameter 
noapic) the machine was able to boot and resync the arrays. Might this 
help in your situation?

-- 
-Mike Fowler
"I could be a genius if I just put my mind to it, and I,
I could do anything, if only I could get 'round to it"



Jon Lewis wrote:

> On Tue, 31 Aug 2004, Guy wrote:
>
>  
>
>> I have read where someone else had a similar problem.
>> The slowdown was caused by a bad hard disk.
>>
>> Do a dd read test of each disk in the array.
>>
>> Example:
>> time dd if=/dev/sdj of=/dev/null bs=64k
>>   
>
>
> All of these finished at about the same time with no read errors 
> reported.
>
>  
>
>> Someone else has said:
>> Performance can be bad if the disk controller is sharing an interrupt 
>> with
>> another device.
>> It is ok for 2 of the same model cards to share 1 interrupt.
>>   
>
>
> Since it's an SMP system, IO APIC gives us lots of IRQs and there is no
> sharing.
>
>           CPU0       CPU1
>  0:     739040    1188881    IO-APIC-edge  timer
>  1:        173        178    IO-APIC-edge  keyboard
>  2:          0          0          XT-PIC  cascade
> 14:     355893     353513    IO-APIC-edge  ide0
> 15:    1963919    1944260    IO-APIC-edge  ide1
> 20:       7171       7690   IO-APIC-level  eth0
> 21:          2          3   IO-APIC-level  eth1
> 23:    1540742    1537849   IO-APIC-level  qlogicfc
> 27:    1540624    1539874   IO-APIC-level  qlogicfc
>
> Since the recovery had stopped making progress, I decided to fail the
> drive it had brought in as the spare with mdadm /dev/md2 -f /dev/sdf1.
> That worked as expected.  mdadm /dev/md2 -r /dev/sdf1 seems to have hung.
> It's in state D and I can't terminate it.  Trying to add a new spare,
> mdadm can't get a lock on /dev/md2 because the previous one is stuck.
>
> I suspect at this point, we're going to have to just reboot again.
>
> ----------------------------------------------------------------------
> Jon Lewis                   |  I route
> Senior Network Engineer     |  therefore you are
> Atlantic Net                |
> _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>  
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* raid5 won't resync
@ 2004-08-31  3:08 Jon Lewis
  2004-08-31  4:08 ` Guy
  0 siblings, 1 reply; 10+ messages in thread
From: Jon Lewis @ 2004-08-31  3:08 UTC (permalink / raw)
  To: linux-raid; +Cc: aaron

We had a large mail server lose a drive today (not the first time), but
we've been having alot of trouble with the resync this time.

mdadm told us /dev/sde1 had failed.  Coworker did a raidhotadd with a hot
spare (/dev/sdg1).  Machine was under heavy load so we weren't surprised
that the rebuild was going kind of slowly.  About 4 hours later, the
system locked up with lots of "qlogifc0: no handles slots, this should not
happen" error messages.

At this point, we moved the drives (fiber channel attached SCA scsi drive
array) to a spare system with its own qlogic card.  Kernel sees the RAID5
and says that /dev/sde1 is bad.  It starts trying to resync it, but
it's using a different spare drive.  After about 10% of the resync, the
K/s resync speed slows to a few hundred K/sec, and keeps getting slower.
At this point the FS on the RAID5 isn't even mounted, so there shouldn't
be any system activity competing with the RAID rebuild.
/proc/sys/dev/raid/speed_limit_max is set to 100000.

Personalities : [raid5]
read_ahead 1024 sectors
md2 : active raid5 sdf1[10] sdm1[9] sdl1[8] sdk1[7] sdj1[6] sdn1[5]
sdg1[3]
sdd1[2] sdc1[1] sdb1[0]
     315266688 blocks level 5, 64k chunk, algorithm 2 [10/9] [UUUU_UUUUU]
     [==>..................]  recovery = 11.6% (4065836/35029632)
finish=1400.0min speed=368K/sec

kernel version in the original system where the drive failed and the
lockup happened during resync was 2.4.20-28.rh8.0.atsmp from
http://atrpms.net.  ATrpms are simply rebuilding the redhat kernel with
the XFS patches applied.

That system will also crash with the following ATrpms kernels:
2.4.20-35
2.4.20-19
2.4.18-14

Kernel version on spare system doing the slow resync is 2.4.22 from
kernel.org with XFS patches from http://oss.sgi.com/projects/xfs/.  The
big raid5 is an XFS fs.

Each system has 2 qlogic cards (all of which are the same).  The one where
it's resyncing now are:

QLogic ISP2100 SCSI on PCI bus 01 device 10 irq 27 base 0xe800
QLogic ISP2100 SCSI on PCI bus 01 device 18 irq 23 base 0xe400

The drives are all:
  Vendor: IBM      Model: DRHL36L  CLAR36  Rev: 3347
  Type:   Direct-Access                    ANSI SCSI revision: 02

Both systems are dual PIII 1.4's with 4GB RAM.

Anyone have any idea what bug(s) we're running into or have suggestions
for getting this RAID5 back in sync and in service?

----------------------------------------------------------------------
 Jon Lewis                   |  I route
 Senior Network Engineer     |  therefore you are
 Atlantic Net                |
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: raid5 won't resync
  2004-08-31  3:08 Jon Lewis
@ 2004-08-31  4:08 ` Guy
  2004-08-31  8:08   ` Jon Lewis
  0 siblings, 1 reply; 10+ messages in thread
From: Guy @ 2004-08-31  4:08 UTC (permalink / raw)
  To: 'Jon Lewis', linux-raid; +Cc: aaron

I have read where someone else had a similar problem.
The slowdown was caused by a bad hard disk.

Do a dd read test of each disk in the array.

Example:
time dd if=/dev/sdj of=/dev/null bs=64k

Open different windows and test all of the disks at the same time, 1 per
window.  If you test them all from the same window using "&" the output will
get mixed.

The time command is to compare the performance of each disk.
The time command is optional.

Someone else has said:
Performance can be bad if the disk controller is sharing an interrupt with
another device.
It is ok for 2 of the same model cards to share 1 interrupt.

Use this to determine which interrupts are being used:
cat /proc/interrupts

Moving the card may change the interrupt.
You may also change the interrupts from the BIOS.

I don't think an interrupt problem would cause a slow down over time.
I bet you have a problem with a disk drive.

I hope this helps!

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Jon Lewis
Sent: Monday, August 30, 2004 11:09 PM
To: linux-raid@vger.kernel.org
Cc: aaron@america.com
Subject: raid5 won't resync

We had a large mail server lose a drive today (not the first time), but
we've been having alot of trouble with the resync this time.

mdadm told us /dev/sde1 had failed.  Coworker did a raidhotadd with a hot
spare (/dev/sdg1).  Machine was under heavy load so we weren't surprised
that the rebuild was going kind of slowly.  About 4 hours later, the
system locked up with lots of "qlogifc0: no handles slots, this should not
happen" error messages.

At this point, we moved the drives (fiber channel attached SCA scsi drive
array) to a spare system with its own qlogic card.  Kernel sees the RAID5
and says that /dev/sde1 is bad.  It starts trying to resync it, but
it's using a different spare drive.  After about 10% of the resync, the
K/s resync speed slows to a few hundred K/sec, and keeps getting slower.
At this point the FS on the RAID5 isn't even mounted, so there shouldn't
be any system activity competing with the RAID rebuild.
/proc/sys/dev/raid/speed_limit_max is set to 100000.

Personalities : [raid5]
read_ahead 1024 sectors
md2 : active raid5 sdf1[10] sdm1[9] sdl1[8] sdk1[7] sdj1[6] sdn1[5]
sdg1[3]
sdd1[2] sdc1[1] sdb1[0]
     315266688 blocks level 5, 64k chunk, algorithm 2 [10/9] [UUUU_UUUUU]
     [==>..................]  recovery = 11.6% (4065836/35029632)
finish=1400.0min speed=368K/sec

kernel version in the original system where the drive failed and the
lockup happened during resync was 2.4.20-28.rh8.0.atsmp from
http://atrpms.net.  ATrpms are simply rebuilding the redhat kernel with
the XFS patches applied.

That system will also crash with the following ATrpms kernels:
2.4.20-35
2.4.20-19
2.4.18-14

Kernel version on spare system doing the slow resync is 2.4.22 from
kernel.org with XFS patches from http://oss.sgi.com/projects/xfs/.  The
big raid5 is an XFS fs.

Each system has 2 qlogic cards (all of which are the same).  The one where
it's resyncing now are:

QLogic ISP2100 SCSI on PCI bus 01 device 10 irq 27 base 0xe800
QLogic ISP2100 SCSI on PCI bus 01 device 18 irq 23 base 0xe400

The drives are all:
  Vendor: IBM      Model: DRHL36L  CLAR36  Rev: 3347
  Type:   Direct-Access                    ANSI SCSI revision: 02

Both systems are dual PIII 1.4's with 4GB RAM.

Anyone have any idea what bug(s) we're running into or have suggestions
for getting this RAID5 back in sync and in service?

----------------------------------------------------------------------
 Jon Lewis                   |  I route
 Senior Network Engineer     |  therefore you are
 Atlantic Net                |
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: raid5 won't resync
  2004-08-31  4:08 ` Guy
@ 2004-08-31  8:08   ` Jon Lewis
  2004-08-31 14:50     ` Guy
  0 siblings, 1 reply; 10+ messages in thread
From: Jon Lewis @ 2004-08-31  8:08 UTC (permalink / raw)
  To: Guy; +Cc: linux-raid, aaron

On Tue, 31 Aug 2004, Guy wrote:

> I have read where someone else had a similar problem.
> The slowdown was caused by a bad hard disk.
>
> Do a dd read test of each disk in the array.
>
> Example:
> time dd if=/dev/sdj of=/dev/null bs=64k

All of these finished at about the same time with no read errors reported.

> Someone else has said:
> Performance can be bad if the disk controller is sharing an interrupt with
> another device.
> It is ok for 2 of the same model cards to share 1 interrupt.

Since it's an SMP system, IO APIC gives us lots of IRQs and there is no
sharing.

           CPU0       CPU1
  0:     739040    1188881    IO-APIC-edge  timer
  1:        173        178    IO-APIC-edge  keyboard
  2:          0          0          XT-PIC  cascade
 14:     355893     353513    IO-APIC-edge  ide0
 15:    1963919    1944260    IO-APIC-edge  ide1
 20:       7171       7690   IO-APIC-level  eth0
 21:          2          3   IO-APIC-level  eth1
 23:    1540742    1537849   IO-APIC-level  qlogicfc
 27:    1540624    1539874   IO-APIC-level  qlogicfc

Since the recovery had stopped making progress, I decided to fail the
drive it had brought in as the spare with mdadm /dev/md2 -f /dev/sdf1.
That worked as expected.  mdadm /dev/md2 -r /dev/sdf1 seems to have hung.
It's in state D and I can't terminate it.  Trying to add a new spare,
mdadm can't get a lock on /dev/md2 because the previous one is stuck.

I suspect at this point, we're going to have to just reboot again.

----------------------------------------------------------------------
 Jon Lewis                   |  I route
 Senior Network Engineer     |  therefore you are
 Atlantic Net                |
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: raid5 won't resync
  2004-08-31  8:08   ` Jon Lewis
@ 2004-08-31 14:50     ` Guy
  2004-08-31 20:09       ` Jon Lewis
  0 siblings, 1 reply; 10+ messages in thread
From: Guy @ 2004-08-31 14:50 UTC (permalink / raw)
  To: 'Jon Lewis'; +Cc: linux-raid, aaron

I this point you need professional help!  :)

I don't know what to tell you.

Good luck,
Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Jon Lewis
Sent: Tuesday, August 31, 2004 4:08 AM
To: Guy
Cc: linux-raid@vger.kernel.org; aaron@america.com
Subject: RE: raid5 won't resync

On Tue, 31 Aug 2004, Guy wrote:

> I have read where someone else had a similar problem.
> The slowdown was caused by a bad hard disk.
>
> Do a dd read test of each disk in the array.
>
> Example:
> time dd if=/dev/sdj of=/dev/null bs=64k

All of these finished at about the same time with no read errors reported.

> Someone else has said:
> Performance can be bad if the disk controller is sharing an interrupt with
> another device.
> It is ok for 2 of the same model cards to share 1 interrupt.

Since it's an SMP system, IO APIC gives us lots of IRQs and there is no
sharing.

           CPU0       CPU1
  0:     739040    1188881    IO-APIC-edge  timer
  1:        173        178    IO-APIC-edge  keyboard
  2:          0          0          XT-PIC  cascade
 14:     355893     353513    IO-APIC-edge  ide0
 15:    1963919    1944260    IO-APIC-edge  ide1
 20:       7171       7690   IO-APIC-level  eth0
 21:          2          3   IO-APIC-level  eth1
 23:    1540742    1537849   IO-APIC-level  qlogicfc
 27:    1540624    1539874   IO-APIC-level  qlogicfc

Since the recovery had stopped making progress, I decided to fail the
drive it had brought in as the spare with mdadm /dev/md2 -f /dev/sdf1.
That worked as expected.  mdadm /dev/md2 -r /dev/sdf1 seems to have hung.
It's in state D and I can't terminate it.  Trying to add a new spare,
mdadm can't get a lock on /dev/md2 because the previous one is stuck.

I suspect at this point, we're going to have to just reboot again.

----------------------------------------------------------------------
 Jon Lewis                   |  I route
 Senior Network Engineer     |  therefore you are
 Atlantic Net                |
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: raid5 won't resync
  2004-08-31 14:50     ` Guy
@ 2004-08-31 20:09       ` Jon Lewis
  2004-08-31 20:40         ` Guy
  0 siblings, 1 reply; 10+ messages in thread
From: Jon Lewis @ 2004-08-31 20:09 UTC (permalink / raw)
  To: linux-raid; +Cc: aaron

Now we've got a new problem with the raid array from last night.  We've
switched qlogic drivers to one that some people have posted is more stable
than the one we were using.  This unfortunately changed all the scsi
device names.  i.e.

abcdefg hijklmn has become
hijklmn abcdefg

I put the following in /etc/mdadm.conf:

DEVICE /dev/sd[abcdefghijklmn][1]
ARRAY /dev/md2 level=raid5 num-devices=10 UUID=532d4b61:48f5278b:4fd2e730:6dd4a608

That DEVICE line should cover all the members (under their new device
names) for the raid5 array.

then I ran:

mdadm --assemble /dev/md2 --uuid 532d4b61:48f5278b:4fd2e730:6dd4a608
or
mdadm --assemble /dev/md2 --scan

Both terminate with the same result:

mdadm: /dev/md2 assembled from 4 drives and 1 spare - not enough to start
the array.

but if I look at /proc/mdstat, it did find all 10 (actually 11) devices.

# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
md2 : inactive sdc1[6] sdm1[10] sdf1[9] sde1[8] sdd1[7] sdg1[5] sdl1[4]
sdn1[3] sdk1[2] sdj1[1] sdi1[0]
      0 blocks
md1 : active raid1 hda1[0] hdc1[1]
      30716160 blocks [2/2] [UU]
      [>....................]  resync =  3.5% (1098392/30716160) finish=298.2min speed=1654K/sec
md0 : active raid1 sdh2[0] sda2[1]
      104320 blocks [2/2] [UU]

unused devices: <none>

I suspect it's found both the failed drive (originally sde1, now named
sdl1) and the spare that it had started, but never finished, rebuilding
on (sdg1, now sdn1).  Why is mdadm saying there are only 4 devices + 1
spare?  Is there a best way to proceed at this point to try to get this
array repaired?

----------------------------------------------------------------------
 Jon Lewis                   |  I route
 Senior Network Engineer     |  therefore you are
 Atlantic Net                |
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: raid5 won't resync
  2004-08-31 20:09       ` Jon Lewis
@ 2004-08-31 20:40         ` Guy
  2004-08-31 21:27           ` Jon Lewis
  0 siblings, 1 reply; 10+ messages in thread
From: Guy @ 2004-08-31 20:40 UTC (permalink / raw)
  To: 'Jon Lewis', linux-raid; +Cc: aaron

I think what you did should work, but...
I have had similar problems.
Try again, but this time don't include any spare disks, or any other disks.
Only include the disks you know have the data.
Or, just list the disks on the command line.

Keep your fingers crossed!

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Jon Lewis
Sent: Tuesday, August 31, 2004 4:10 PM
To: linux-raid@vger.kernel.org
Cc: aaron@america.com
Subject: RE: raid5 won't resync

Now we've got a new problem with the raid array from last night.  We've
switched qlogic drivers to one that some people have posted is more stable
than the one we were using.  This unfortunately changed all the scsi
device names.  i.e.

abcdefg hijklmn has become
hijklmn abcdefg

I put the following in /etc/mdadm.conf:

DEVICE /dev/sd[abcdefghijklmn][1]
ARRAY /dev/md2 level=raid5 num-devices=10
UUID=532d4b61:48f5278b:4fd2e730:6dd4a608

That DEVICE line should cover all the members (under their new device
names) for the raid5 array.

then I ran:

mdadm --assemble /dev/md2 --uuid 532d4b61:48f5278b:4fd2e730:6dd4a608
or
mdadm --assemble /dev/md2 --scan

Both terminate with the same result:

mdadm: /dev/md2 assembled from 4 drives and 1 spare - not enough to start
the array.

but if I look at /proc/mdstat, it did find all 10 (actually 11) devices.

# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
md2 : inactive sdc1[6] sdm1[10] sdf1[9] sde1[8] sdd1[7] sdg1[5] sdl1[4]
sdn1[3] sdk1[2] sdj1[1] sdi1[0]
      0 blocks
md1 : active raid1 hda1[0] hdc1[1]
      30716160 blocks [2/2] [UU]
      [>....................]  resync =  3.5% (1098392/30716160)
finish=298.2min speed=1654K/sec
md0 : active raid1 sdh2[0] sda2[1]
      104320 blocks [2/2] [UU]

unused devices: <none>

I suspect it's found both the failed drive (originally sde1, now named
sdl1) and the spare that it had started, but never finished, rebuilding
on (sdg1, now sdn1).  Why is mdadm saying there are only 4 devices + 1
spare?  Is there a best way to proceed at this point to try to get this
array repaired?

----------------------------------------------------------------------
 Jon Lewis                   |  I route
 Senior Network Engineer     |  therefore you are
 Atlantic Net                |
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: raid5 won't resync
  2004-08-31 20:40         ` Guy
@ 2004-08-31 21:27           ` Jon Lewis
  2004-08-31 22:37             ` Guy
  0 siblings, 1 reply; 10+ messages in thread
From: Jon Lewis @ 2004-08-31 21:27 UTC (permalink / raw)
  To: Guy; +Cc: linux-raid, aaron

On Tue, 31 Aug 2004, Guy wrote:

> I think what you did should work, but...
> I have had similar problems.
> Try again, but this time don't include any spare disks, or any other disks.
> Only include the disks you know have the data.
> Or, just list the disks on the command line.

# mdadm --assemble /dev/md2 /dev/sdc1 /dev/sdm1 /dev/sdf1 /dev/sde1
/dev/sdd1 /dev/sdg1 /dev/sdk1 /dev/sdj1 /dev/sdi1
mdadm: /dev/md2 assembled from 4 drives and 1 spare - not enough to start
the array.

I've left sdl1 and sdn1 out of the above as they're the failed drive and
the partially rebuilt spare.

I see a pattern that could explain why mdadm thinks there are only 4
drives.  From mdadm -E on each drive:

sdc1:    Update Time : Tue Aug 31 03:47:27 2004
sdd1:    Update Time : Tue Aug 31 03:47:27 2004
sde1:    Update Time : Tue Aug 31 03:47:27 2004
sdf1:    Update Time : Tue Aug 31 03:47:27 2004
sdg1:    Update Time : Mon Aug 30 22:42:36 2004
sdi1:    Update Time : Mon Aug 30 22:42:36 2004
sdj1:    Update Time : Mon Aug 30 22:42:36 2004
sdk1:    Update Time : Mon Aug 30 22:42:36 2004
sdl1:    Update Time : Tue Jul 13 02:08:37 2004
sdm1:    Update Time : Mon Aug 30 22:42:36 2004
sdn1:    Update Time : Mon Aug 30 22:42:36 2004

Is mdadm --assemble seeing that 4 drives have a more recent Update Time
than the rest and ignoring the rest?

----------------------------------------------------------------------
 Jon Lewis                   |  I route
 Senior Network Engineer     |  therefore you are
 Atlantic Net                |
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: raid5 won't resync
  2004-08-31 21:27           ` Jon Lewis
@ 2004-08-31 22:37             ` Guy
  2004-09-01  0:25               ` Jon Lewis
  0 siblings, 1 reply; 10+ messages in thread
From: Guy @ 2004-08-31 22:37 UTC (permalink / raw)
  To: 'Jon Lewis'; +Cc: linux-raid, aaron

You have 2 failed drives?
RAID5 only supports 1 failed drive.

Have you tested the drives to determine if they are good?
Example:
dd if=/dev/sdf of=/dev/null bs=64k

If you can find enough good drives, use the force option on assemble.
But don't include any disks that don't have 100% of the data.
A spare that did a partial re-build is not good to use at this point.

So, if your array had 10 disks, you need to find 9 of them that are still
working.

Guy

-----Original Message-----
From: Jon Lewis [mailto:jlewis@lewis.org] 
Sent: Tuesday, August 31, 2004 5:27 PM
To: Guy
Cc: linux-raid@vger.kernel.org; aaron@america.com
Subject: RE: raid5 won't resync

On Tue, 31 Aug 2004, Guy wrote:

> I think what you did should work, but...
> I have had similar problems.
> Try again, but this time don't include any spare disks, or any other
disks.
> Only include the disks you know have the data.
> Or, just list the disks on the command line.

# mdadm --assemble /dev/md2 /dev/sdc1 /dev/sdm1 /dev/sdf1 /dev/sde1
/dev/sdd1 /dev/sdg1 /dev/sdk1 /dev/sdj1 /dev/sdi1
mdadm: /dev/md2 assembled from 4 drives and 1 spare - not enough to start
the array.

I've left sdl1 and sdn1 out of the above as they're the failed drive and
the partially rebuilt spare.

I see a pattern that could explain why mdadm thinks there are only 4
drives.  From mdadm -E on each drive:

sdc1:    Update Time : Tue Aug 31 03:47:27 2004
sdd1:    Update Time : Tue Aug 31 03:47:27 2004
sde1:    Update Time : Tue Aug 31 03:47:27 2004
sdf1:    Update Time : Tue Aug 31 03:47:27 2004
sdg1:    Update Time : Mon Aug 30 22:42:36 2004
sdi1:    Update Time : Mon Aug 30 22:42:36 2004
sdj1:    Update Time : Mon Aug 30 22:42:36 2004
sdk1:    Update Time : Mon Aug 30 22:42:36 2004
sdl1:    Update Time : Tue Jul 13 02:08:37 2004
sdm1:    Update Time : Mon Aug 30 22:42:36 2004
sdn1:    Update Time : Mon Aug 30 22:42:36 2004

Is mdadm --assemble seeing that 4 drives have a more recent Update Time
than the rest and ignoring the rest?

----------------------------------------------------------------------
 Jon Lewis                   |  I route
 Senior Network Engineer     |  therefore you are
 Atlantic Net                |
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: raid5 won't resync
  2004-08-31 22:37             ` Guy
@ 2004-09-01  0:25               ` Jon Lewis
  0 siblings, 0 replies; 10+ messages in thread
From: Jon Lewis @ 2004-09-01  0:25 UTC (permalink / raw)
  To: Guy; +Cc: linux-raid, aaron

I don't believe we have 2 failed drives, and AFAICT from doing the dd read
tests last night, none are actually bad.  md decided for whatever reason
(qlogic driver bug I'm guessing) that 1 drive had failed.  We put in a
spare drive to let it rebuild, but that rebuild never completed.  Unless
something else happened that I'm not aware of (quite possible since I'm
125 miles away), we should still have a 10 drive raid5 with one failed
drive...so we ought to be able to get the 9 drives + parity/missing bits
calculation up and running.

On Tue, 31 Aug 2004, Guy wrote:

> You have 2 failed drives?
> RAID5 only supports 1 failed drive.
>
> Have you tested the drives to determine if they are good?
> Example:
> dd if=/dev/sdf of=/dev/null bs=64k
>
> If you can find enough good drives, use the force option on assemble.
> But don't include any disks that don't have 100% of the data.
> A spare that did a partial re-build is not good to use at this point.
>
> So, if your array had 10 disks, you need to find 9 of them that are still
> working.
>
> Guy
>
> -----Original Message-----
> From: Jon Lewis [mailto:jlewis@lewis.org]
> Sent: Tuesday, August 31, 2004 5:27 PM
> To: Guy
> Cc: linux-raid@vger.kernel.org; aaron@america.com
> Subject: RE: raid5 won't resync
>
> On Tue, 31 Aug 2004, Guy wrote:
>
> > I think what you did should work, but...
> > I have had similar problems.
> > Try again, but this time don't include any spare disks, or any other
> disks.
> > Only include the disks you know have the data.
> > Or, just list the disks on the command line.
>
> # mdadm --assemble /dev/md2 /dev/sdc1 /dev/sdm1 /dev/sdf1 /dev/sde1
> /dev/sdd1 /dev/sdg1 /dev/sdk1 /dev/sdj1 /dev/sdi1
> mdadm: /dev/md2 assembled from 4 drives and 1 spare - not enough to start
> the array.
>
> I've left sdl1 and sdn1 out of the above as they're the failed drive and
> the partially rebuilt spare.
>
> I see a pattern that could explain why mdadm thinks there are only 4
> drives.  From mdadm -E on each drive:
>
> sdc1:    Update Time : Tue Aug 31 03:47:27 2004
> sdd1:    Update Time : Tue Aug 31 03:47:27 2004
> sde1:    Update Time : Tue Aug 31 03:47:27 2004
> sdf1:    Update Time : Tue Aug 31 03:47:27 2004
> sdg1:    Update Time : Mon Aug 30 22:42:36 2004
> sdi1:    Update Time : Mon Aug 30 22:42:36 2004
> sdj1:    Update Time : Mon Aug 30 22:42:36 2004
> sdk1:    Update Time : Mon Aug 30 22:42:36 2004
> sdl1:    Update Time : Tue Jul 13 02:08:37 2004
> sdm1:    Update Time : Mon Aug 30 22:42:36 2004
> sdn1:    Update Time : Mon Aug 30 22:42:36 2004
>
> Is mdadm --assemble seeing that 4 drives have a more recent Update Time
> than the rest and ignoring the rest?
>
> ----------------------------------------------------------------------
>  Jon Lewis                   |  I route
>  Senior Network Engineer     |  therefore you are
>  Atlantic Net                |
> _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
>

----------------------------------------------------------------------
 Jon Lewis                   |  I route
 Senior Network Engineer     |  therefore you are
 Atlantic Net                |
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2004-09-01  0:25 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-31 15:32 raid5 won't resync Mike Fowler
  -- strict thread matches above, loose matches on Subject: below --
2004-08-31  3:08 Jon Lewis
2004-08-31  4:08 ` Guy
2004-08-31  8:08   ` Jon Lewis
2004-08-31 14:50     ` Guy
2004-08-31 20:09       ` Jon Lewis
2004-08-31 20:40         ` Guy
2004-08-31 21:27           ` Jon Lewis
2004-08-31 22:37             ` Guy
2004-09-01  0:25               ` Jon Lewis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).