FSCK and it crashes...

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* FSCK and it crashes...
@ 2002-12-10  9:42 Gordon Henderson
  2002-12-10 10:22 ` Wim Vinckier
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Gordon Henderson @ 2002-12-10  9:42 UTC (permalink / raw)
  To: linux-raid

I've been using Linux RAID for a few years now with good success, but have
recently had a problem common to some systems I look after with kernel
version 2.4.x.

When I try to FSCK a RAID partition (and I've seen this happen on RAID 1
and 5) the machine locks up needing a reset to get it going again. On past
occasions I reverted to a 2.2 kernel with the patches and it went just
fine, however this time I need to access hardware (New Promise IDE
controllers) and LVM that only seem to be supported by very recent 2.4
kernels. (ie 2.4.19 for the hardware)

I've had a quick search of the archives and didn't really find anything -
does anyone have any clues - maybe I'm missing something obvious?

The box is running Debian3 and is a dual (AMD Athlon(tm) MP 1600+)
processor box with 4 IDE drives on 2 promise dual 133 controllers (only
the cd-rom on the on-board controllers) The kernels are stock ones off
ftp.kernel.org. (Debian 3 comes with 2.4.18 which doesn't have the Promise
drivers - I had to do the inital build by connecting one drive to the
on-board controller, then migrate it over)

The 4 drives are partitiond identically with 4 primary partitions, 256M,
1024M, 2048M and the rest of the disk (~120M) the 4 big partitions being
combined together into a raid 5 which I then turn into one big physical
volume using LVM, then create a 150GB logical volume out of that (so I can
take LVM snapshots using the remaining ~200GB avalable). I'm wondering if
this is now a bit too ambitious. I'll do some test later without LVM, but
I have had this problem on 2 other boxes that don't use LVM.

The other partitions are also raid5 except for the root partition which is
raid1 so it can boot.

It's nice and fast, and seems stable when running, and can withstand the
loss of any 1 disk, but when there's the nagging fear that you might never
be able to fsck it, it's a bit worrying... (Although moving to XFS is
something planned anyway, but I feel we're right on the edge here with new
hardware and software and don't want to push ourselves over!)

So any insight or clues would be appreciated,

Thanks,

Gordon

Ps. Output of /proc/mdstat if it helps:

md0 : active raid1 hdg1[1] hde1[0]
      248896 blocks [2/2] [UU]

md4 : active raid1 hdk1[1] hdi1[0]
      248896 blocks [2/2] [UU]

md1 : active raid5 hdk2[3] hdi2[2] hdg2[1] hde2[0]
      1493760 blocks level 5, 32k chunk, algorithm 0 [4/4] [UUUU]

md2 : active raid5 hdk3[3] hdi3[2] hdg3[1] hde3[0]
      6000000 blocks level 5, 32k chunk, algorithm 0 [4/4] [UUUU]

md3 : active raid5 hdk4[3] hdi4[2] hdg4[1] hde4[0]
      353630592 blocks level 5, 32k chunk, algorithm 0 [4/4] [UUUU]

unused devices: <none>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: FSCK and it crashes...
  2002-12-10  9:42 FSCK and it crashes Gordon Henderson
@ 2002-12-10 10:22 ` Wim Vinckier
  2002-12-10 18:58 ` Steven Dake
  2002-12-13 15:38 ` raid5: switching cache buffer size Gordon Henderson
  2 siblings, 0 replies; 8+ messages in thread
From: Wim Vinckier @ 2002-12-10 10:22 UTC (permalink / raw)
  To: Gordon Henderson; +Cc: linux-raid

On Tue, 10 Dec 2002, Gordon Henderson wrote:

>
> I've been using Linux RAID for a few years now with good success, but have
> recently had a problem common to some systems I look after with kernel
> version 2.4.x.
>
> When I try to FSCK a RAID partition (and I've seen this happen on RAID 1
> and 5) the machine locks up needing a reset to get it going again. On past
> occasions I reverted to a 2.2 kernel with the patches and it went just
> fine, however this time I need to access hardware (New Promise IDE
> controllers) and LVM that only seem to be supported by very recent 2.4
> kernels. (ie 2.4.19 for the hardware)
>
> I've had a quick search of the archives and didn't really find anything -
> does anyone have any clues - maybe I'm missing something obvious?
>
> The box is running Debian3 and is a dual (AMD Athlon(tm) MP 1600+)
> processor box with 4 IDE drives on 2 promise dual 133 controllers (only
> the cd-rom on the on-board controllers) The kernels are stock ones off
> ftp.kernel.org. (Debian 3 comes with 2.4.18 which doesn't have the Promise
> drivers - I had to do the inital build by connecting one drive to the
> on-board controller, then migrate it over)
>
> The 4 drives are partitiond identically with 4 primary partitions, 256M,
> 1024M, 2048M and the rest of the disk (~120M) the 4 big partitions being
> combined together into a raid 5 which I then turn into one big physical
> volume using LVM, then create a 150GB logical volume out of that (so I can
> take LVM snapshots using the remaining ~200GB avalable). I'm wondering if
> this is now a bit too ambitious. I'll do some test later without LVM, but
> I have had this problem on 2 other boxes that don't use LVM.
>
> The other partitions are also raid5 except for the root partition which is
> raid1 so it can boot.
>
> It's nice and fast, and seems stable when running, and can withstand the
> loss of any 1 disk, but when there's the nagging fear that you might never
> be able to fsck it, it's a bit worrying... (Although moving to XFS is
> something planned anyway, but I feel we're right on the edge here with new
> hardware and software and don't want to push ourselves over!)
>
> So any insight or clues would be appreciated,
>
> Thanks,
>
> Gordon
>
>
> Ps. Output of /proc/mdstat if it helps:
>
> md0 : active raid1 hdg1[1] hde1[0]
>       248896 blocks [2/2] [UU]
>
> md4 : active raid1 hdk1[1] hdi1[0]
>       248896 blocks [2/2] [UU]
>
> md1 : active raid5 hdk2[3] hdi2[2] hdg2[1] hde2[0]
>       1493760 blocks level 5, 32k chunk, algorithm 0 [4/4] [UUUU]
>
> md2 : active raid5 hdk3[3] hdi3[2] hdg3[1] hde3[0]
>       6000000 blocks level 5, 32k chunk, algorithm 0 [4/4] [UUUU]
>
> md3 : active raid5 hdk4[3] hdi4[2] hdg4[1] hde4[0]
>       353630592 blocks level 5, 32k chunk, algorithm 0 [4/4] [UUUU]
>
> unused devices: <none>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

I'm had a problem with was very simular.  One disk in my raid1 system
failed and needed to be replaced.  After i installed a new disk but when
it was syncing for a while, my system crashed.  I didn't need a fdisk
because i was using a reiserfs.  When I started my system again, raid
needed a sync so I got the crash again.  I tried it with a 2.4.18, 2.4.19
and a 2.4.20 kernel.  2.4.18 was able to sync my disks and after a little
search I found a different value in /proc/sys/dev/raid/speed_limit_max.
When I changed it in my 2.4.20 I mostly could sync.  I wasn't able to
trace the exact problem.  I thought it was a thermal problem since I had
some thermal problems with my system but it seems there are more people
out there with almost the same problem.

Greetings,

Wim.
------------------------------------------------------------------------
Wim VINCKIER
Wasstraat 38                                          Wim-Raid@tisnix.be
B-9000  Gent                                               ICQ 100545109
------------------------------------------------------------------------
'Windows 98 or better required' said the box... so I installed linux


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: FSCK and it crashes...
  2002-12-10  9:42 FSCK and it crashes Gordon Henderson
  2002-12-10 10:22 ` Wim Vinckier
@ 2002-12-10 18:58 ` Steven Dake
  2002-12-10 19:04   ` Gordon Henderson
  2002-12-13 15:38 ` raid5: switching cache buffer size Gordon Henderson
  2 siblings, 1 reply; 8+ messages in thread
From: Steven Dake @ 2002-12-10 18:58 UTC (permalink / raw)
  To: Gordon Henderson; +Cc: linux-raid

Gorden,

I believe there is a bug in RAID in 2.4.18 which causes resyncs never to 
complete (its fixed in 2.4.19).  You might check that your drive isn't 
resyncing (cat /proc/mdstat if you see a percentage, its resyncing). 
 This may or may not be your problem, although I'd try a newer kernel.

Thanks
-steve

Gordon Henderson wrote:

>I've been using Linux RAID for a few years now with good success, but have
>recently had a problem common to some systems I look after with kernel
>version 2.4.x.
>
>When I try to FSCK a RAID partition (and I've seen this happen on RAID 1
>and 5) the machine locks up needing a reset to get it going again. On past
>occasions I reverted to a 2.2 kernel with the patches and it went just
>fine, however this time I need to access hardware (New Promise IDE
>controllers) and LVM that only seem to be supported by very recent 2.4
>kernels. (ie 2.4.19 for the hardware)
>
>I've had a quick search of the archives and didn't really find anything -
>does anyone have any clues - maybe I'm missing something obvious?
>
>The box is running Debian3 and is a dual (AMD Athlon(tm) MP 1600+)
>processor box with 4 IDE drives on 2 promise dual 133 controllers (only
>the cd-rom on the on-board controllers) The kernels are stock ones off
>ftp.kernel.org. (Debian 3 comes with 2.4.18 which doesn't have the Promise
>drivers - I had to do the inital build by connecting one drive to the
>on-board controller, then migrate it over)
>
>The 4 drives are partitiond identically with 4 primary partitions, 256M,
>1024M, 2048M and the rest of the disk (~120M) the 4 big partitions being
>combined together into a raid 5 which I then turn into one big physical
>volume using LVM, then create a 150GB logical volume out of that (so I can
>take LVM snapshots using the remaining ~200GB avalable). I'm wondering if
>this is now a bit too ambitious. I'll do some test later without LVM, but
>I have had this problem on 2 other boxes that don't use LVM.
>
>The other partitions are also raid5 except for the root partition which is
>raid1 so it can boot.
>
>It's nice and fast, and seems stable when running, and can withstand the
>loss of any 1 disk, but when there's the nagging fear that you might never
>be able to fsck it, it's a bit worrying... (Although moving to XFS is
>something planned anyway, but I feel we're right on the edge here with new
>hardware and software and don't want to push ourselves over!)
>
>So any insight or clues would be appreciated,
>
>Thanks,
>
>Gordon
>
>
>Ps. Output of /proc/mdstat if it helps:
>
>md0 : active raid1 hdg1[1] hde1[0]
>      248896 blocks [2/2] [UU]
>
>md4 : active raid1 hdk1[1] hdi1[0]
>      248896 blocks [2/2] [UU]
>
>md1 : active raid5 hdk2[3] hdi2[2] hdg2[1] hde2[0]
>      1493760 blocks level 5, 32k chunk, algorithm 0 [4/4] [UUUU]
>
>md2 : active raid5 hdk3[3] hdi3[2] hdg3[1] hde3[0]
>      6000000 blocks level 5, 32k chunk, algorithm 0 [4/4] [UUUU]
>
>md3 : active raid5 hdk4[3] hdi4[2] hdg4[1] hde4[0]
>      353630592 blocks level 5, 32k chunk, algorithm 0 [4/4] [UUUU]
>
>unused devices: <none>
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
>  
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: FSCK and it crashes...
  2002-12-10 18:58 ` Steven Dake
@ 2002-12-10 19:04   ` Gordon Henderson
  0 siblings, 0 replies; 8+ messages in thread
From: Gordon Henderson @ 2002-12-10 19:04 UTC (permalink / raw)
  To: Steven Dake; +Cc: linux-raid

On Tue, 10 Dec 2002, Steven Dake wrote:

> Gorden,
>
> I believe there is a bug in RAID in 2.4.18 which causes resyncs never to
> complete (its fixed in 2.4.19).  You might check that your drive isn't
> resyncing (cat /proc/mdstat if you see a percentage, its resyncing).
>  This may or may not be your problem, although I'd try a newer kernel.

Thanks, but we started with 2.4.20, and it crashed even after it had
finished the resync... Going to do some more tests without LVM to see if
thats getting in the way.

Gordon


^ permalink raw reply	[flat|nested] 8+ messages in thread

* raid5: switching cache buffer size
  2002-12-10  9:42 FSCK and it crashes Gordon Henderson
  2002-12-10 10:22 ` Wim Vinckier
  2002-12-10 18:58 ` Steven Dake
@ 2002-12-13 15:38 ` Gordon Henderson
  2002-12-13 18:59   ` Luca Berra
  2 siblings, 1 reply; 8+ messages in thread
From: Gordon Henderson @ 2002-12-13 15:38 UTC (permalink / raw)
  To: linux-raid

Anyone know what this means?

My big server is currently fscking a lvm partition which sits on-top of a
raid5 device and the console is scrolling these messages as fast as it can.
It says it's switching from 0 to 512, from 512 to 0 and 0 to 1024 ( or
maybe 512 to 1024, its hard to tell).

What's happening and how do I stop it?

Cheers,

Gordon

Ps. earlier FSCK problems seemed to be due to dodgy AMD motherboard
hardware and (U)DMA. It's a documented AMD bug and they have no
workaround, but plugging a ps/2 mouse in seems to fix it...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: raid5: switching cache buffer size
  2002-12-13 15:38 ` raid5: switching cache buffer size Gordon Henderson
@ 2002-12-13 18:59   ` Luca Berra
  2002-12-13 19:17     ` Kanoalani Withington
  2002-12-13 22:29     ` Gordon Henderson
  0 siblings, 2 replies; 8+ messages in thread
From: Luca Berra @ 2002-12-13 18:59 UTC (permalink / raw)
  To: Gordon Henderson; +Cc: linux-raid

Gordon Henderson wrote:
> Anyone know what this means?
> 
> My big server is currently fscking a lvm partition which sits on-top of a
> raid5 device and the console is scrolling these messages as fast as it can.
> It says it's switching from 0 to 512, from 512 to 0 and 0 to 1024 ( or
> maybe 512 to 1024, its hard to tell).
> 
> What's happening and how do I stop it?

hi,
you probably have filesystems of different block in the volume group,
raid cache cannot cope with requests of varying block size so it has to 
flush the cache each time, this only hurts performance and is not 
problem for data safety.

fix:
1) make sure everything on your raid5 is the same block size, if you 
have swap you are locked to using 4k blocks, if you are using snapshots 
1k blocks, if you use both jump.
2) comment the printk in the raid5 source

L.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: raid5: switching cache buffer size
  2002-12-13 18:59   ` Luca Berra
@ 2002-12-13 19:17     ` Kanoalani Withington
  2002-12-13 22:29     ` Gordon Henderson
  1 sibling, 0 replies; 8+ messages in thread
From: Kanoalani Withington @ 2002-12-13 19:17 UTC (permalink / raw)
  To: Luca Berra; +Cc: Gordon Henderson, linux-raid

You might also be using a journaled filesystem with the journal on the 
same volume as the data. XFS does this by default for example, writes to 
the journal are not the same block size as writes to the data volume so 
the RAID layer reports tons of messages like the ones you describe. The 
proper solution (according the XFS maintainers) is to create a small 
disk mirror for the journal separate from the data volume.

-Kanoa

Luca Berra wrote:

> Gordon Henderson wrote:
>
>> Anyone know what this means?
>>
>> My big server is currently fscking a lvm partition which sits on-top 
>> of a
>> raid5 device and the console is scrolling these messages as fast as 
>> it can.
>> It says it's switching from 0 to 512, from 512 to 0 and 0 to 1024 ( or
>> maybe 512 to 1024, its hard to tell).
>>
>> What's happening and how do I stop it?
>
>
> hi,
> you probably have filesystems of different block in the volume group,
> raid cache cannot cope with requests of varying block size so it has 
> to flush the cache each time, this only hurts performance and is not 
> problem for data safety.
>
> fix:
> 1) make sure everything on your raid5 is the same block size, if you 
> have swap you are locked to using 4k blocks, if you are using 
> snapshots 1k blocks, if you use both jump.
> 2) comment the printk in the raid5 source
>
> L.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: raid5: switching cache buffer size
  2002-12-13 18:59   ` Luca Berra
  2002-12-13 19:17     ` Kanoalani Withington
@ 2002-12-13 22:29     ` Gordon Henderson
  1 sibling, 0 replies; 8+ messages in thread
From: Gordon Henderson @ 2002-12-13 22:29 UTC (permalink / raw)
  To: linux-raid

Thanks for the replies - I've also been searching the archives of both the
raid and lvm lists and found some more on it. It seems OK until I take a
snapshot, so need to look more into the lvm stuff if I'm going to continue
using it, but I'm starting to think that lvm on raid5 might be more hassle
than it's worth, and I managed to get it to crash too when using
dump/restore and (s)tar to a SCSI tape system while running an lvm
snapshot. I'm now thinking that as I have enough disk space, I'd be better
off with 3 raid5 partitions and simply copy the data from the life
partition onto the 'yesterday' partition every night rather than use the
lvm snapshot facility...

(I was using ext2 btw, was thinking about XFS, but maybe that'd just be
too much for the system!!!)

Cheers,

Gordon

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2002-12-13 22:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-12-10  9:42 FSCK and it crashes Gordon Henderson
2002-12-10 10:22 ` Wim Vinckier
2002-12-10 18:58 ` Steven Dake
2002-12-10 19:04   ` Gordon Henderson
2002-12-13 15:38 ` raid5: switching cache buffer size Gordon Henderson
2002-12-13 18:59   ` Luca Berra
2002-12-13 19:17     ` Kanoalani Withington
2002-12-13 22:29     ` Gordon Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).