linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: raid5 won't resync
@ 2004-08-31 15:32 Mike Fowler
  0 siblings, 0 replies; 10+ messages in thread
From: Mike Fowler @ 2004-08-31 15:32 UTC (permalink / raw)
  To: linux-raid

Hi,

I had a similar problem on my box a couple of weeks ago. I have a 
dual-AMD 1.3GHz system, with two SIIG UltraATA 133 PCI IDE cards, eight 
hard drives, forming 2 RAID-5 arrays. Each array consists of four drives 
of one card. With just one raid active the machine would run fine, but 
with two it would lock on startup. In the end it turned out it was in 
interrupt race on the APIC chip, by disabling APIC (kernel parameter 
noapic) the machine was able to boot and resync the arrays. Might this 
help in your situation?

-- 
-Mike Fowler
"I could be a genius if I just put my mind to it, and I,
I could do anything, if only I could get 'round to it"



Jon Lewis wrote:

> On Tue, 31 Aug 2004, Guy wrote:
>
>  
>
>> I have read where someone else had a similar problem.
>> The slowdown was caused by a bad hard disk.
>>
>> Do a dd read test of each disk in the array.
>>
>> Example:
>> time dd if=/dev/sdj of=/dev/null bs=64k
>>   
>
>
> All of these finished at about the same time with no read errors 
> reported.
>
>  
>
>> Someone else has said:
>> Performance can be bad if the disk controller is sharing an interrupt 
>> with
>> another device.
>> It is ok for 2 of the same model cards to share 1 interrupt.
>>   
>
>
> Since it's an SMP system, IO APIC gives us lots of IRQs and there is no
> sharing.
>
>           CPU0       CPU1
>  0:     739040    1188881    IO-APIC-edge  timer
>  1:        173        178    IO-APIC-edge  keyboard
>  2:          0          0          XT-PIC  cascade
> 14:     355893     353513    IO-APIC-edge  ide0
> 15:    1963919    1944260    IO-APIC-edge  ide1
> 20:       7171       7690   IO-APIC-level  eth0
> 21:          2          3   IO-APIC-level  eth1
> 23:    1540742    1537849   IO-APIC-level  qlogicfc
> 27:    1540624    1539874   IO-APIC-level  qlogicfc
>
> Since the recovery had stopped making progress, I decided to fail the
> drive it had brought in as the spare with mdadm /dev/md2 -f /dev/sdf1.
> That worked as expected.  mdadm /dev/md2 -r /dev/sdf1 seems to have hung.
> It's in state D and I can't terminate it.  Trying to add a new spare,
> mdadm can't get a lock on /dev/md2 because the previous one is stuck.
>
> I suspect at this point, we're going to have to just reboot again.
>
> ----------------------------------------------------------------------
> Jon Lewis                   |  I route
> Senior Network Engineer     |  therefore you are
> Atlantic Net                |
> _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>  
>


^ permalink raw reply	[flat|nested] 10+ messages in thread
* raid5 won't resync
@ 2004-08-31  3:08 Jon Lewis
  2004-08-31  4:08 ` Guy
  0 siblings, 1 reply; 10+ messages in thread
From: Jon Lewis @ 2004-08-31  3:08 UTC (permalink / raw)
  To: linux-raid; +Cc: aaron

We had a large mail server lose a drive today (not the first time), but
we've been having alot of trouble with the resync this time.

mdadm told us /dev/sde1 had failed.  Coworker did a raidhotadd with a hot
spare (/dev/sdg1).  Machine was under heavy load so we weren't surprised
that the rebuild was going kind of slowly.  About 4 hours later, the
system locked up with lots of "qlogifc0: no handles slots, this should not
happen" error messages.

At this point, we moved the drives (fiber channel attached SCA scsi drive
array) to a spare system with its own qlogic card.  Kernel sees the RAID5
and says that /dev/sde1 is bad.  It starts trying to resync it, but
it's using a different spare drive.  After about 10% of the resync, the
K/s resync speed slows to a few hundred K/sec, and keeps getting slower.
At this point the FS on the RAID5 isn't even mounted, so there shouldn't
be any system activity competing with the RAID rebuild.
/proc/sys/dev/raid/speed_limit_max is set to 100000.

Personalities : [raid5]
read_ahead 1024 sectors
md2 : active raid5 sdf1[10] sdm1[9] sdl1[8] sdk1[7] sdj1[6] sdn1[5]
sdg1[3]
sdd1[2] sdc1[1] sdb1[0]
     315266688 blocks level 5, 64k chunk, algorithm 2 [10/9] [UUUU_UUUUU]
     [==>..................]  recovery = 11.6% (4065836/35029632)
finish=1400.0min speed=368K/sec

kernel version in the original system where the drive failed and the
lockup happened during resync was 2.4.20-28.rh8.0.atsmp from
http://atrpms.net.  ATrpms are simply rebuilding the redhat kernel with
the XFS patches applied.

That system will also crash with the following ATrpms kernels:
2.4.20-35
2.4.20-19
2.4.18-14

Kernel version on spare system doing the slow resync is 2.4.22 from
kernel.org with XFS patches from http://oss.sgi.com/projects/xfs/.  The
big raid5 is an XFS fs.

Each system has 2 qlogic cards (all of which are the same).  The one where
it's resyncing now are:

QLogic ISP2100 SCSI on PCI bus 01 device 10 irq 27 base 0xe800
QLogic ISP2100 SCSI on PCI bus 01 device 18 irq 23 base 0xe400

The drives are all:
  Vendor: IBM      Model: DRHL36L  CLAR36  Rev: 3347
  Type:   Direct-Access                    ANSI SCSI revision: 02

Both systems are dual PIII 1.4's with 4GB RAM.

Anyone have any idea what bug(s) we're running into or have suggestions
for getting this RAID5 back in sync and in service?

----------------------------------------------------------------------
 Jon Lewis                   |  I route
 Senior Network Engineer     |  therefore you are
 Atlantic Net                |
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2004-09-01  0:25 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-31 15:32 raid5 won't resync Mike Fowler
  -- strict thread matches above, loose matches on Subject: below --
2004-08-31  3:08 Jon Lewis
2004-08-31  4:08 ` Guy
2004-08-31  8:08   ` Jon Lewis
2004-08-31 14:50     ` Guy
2004-08-31 20:09       ` Jon Lewis
2004-08-31 20:40         ` Guy
2004-08-31 21:27           ` Jon Lewis
2004-08-31 22:37             ` Guy
2004-09-01  0:25               ` Jon Lewis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).