public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* oops with dual xeon 2.8ghz  4gb ram +smp, software raid, lvm, and xfs
@ 2004-11-22 19:06 Phil Dier
  2004-11-23  0:17 ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: Phil Dier @ 2004-11-22 19:06 UTC (permalink / raw)
  To: Kernel Mailing List

Hi,

I'm setting up a storage array with Linux, software RAID, LVM, and XFS,
but I keep getting oopses during heavy I/O. I've been able to reproduce
this with 2.6.6, 2.6.8.1, 2.6.9, and 2.6.10-rc2-bk4. I have dual xeon
2.8s with 4gb of ram. I'm using adaptec and a fusion mpt scsi devices
(more details in the following link). Connected are 2 ultra160 scsi
jbods w/ 2 disks apiece. I'm using raid 10 (or should it be 01?) mirrored 
stripes.

Due to its size, I've posted my debug info at this location (I've included
output from all of the above kernels):

<http://www.icglink.com/cluster-debug-info.html> (~235kb)

Please let me know if I've left anything out that would help in locating
the source of the problem.  I'm very willing to try out any patches/config
changes.

please cc me on any replies, as I am not subscribed to the list...

Thanks,

--

Phil Dier (ICGLink.com -- 615 370-1530 x733)

/* vim:set noai nocindent ts=8 sw=8: */

^ permalink raw reply	[flat|nested] 28+ messages in thread
* Re: oops with dual xeon 2.8ghz 4gb ram +smp, software raid, lvm, and xfs
@ 2004-11-23 20:48 Joerg Sommrey
  0 siblings, 0 replies; 28+ messages in thread
From: Joerg Sommrey @ 2004-11-23 20:48 UTC (permalink / raw)
  To: Linux kernel mailing list; +Cc: Phil Dier

>Hi,
>
>I'm setting up a storage array with Linux, software RAID, LVM, and XFS,
>but I keep getting oopses during heavy I/O. I've been able to reproduce
>this with 2.6.6, 2.6.8.1, 2.6.9, and 2.6.10-rc2-bk4. I have dual xeon
>2.8s with 4gb of ram. I'm using adaptec and a fusion mpt scsi devices
>(more details in the following link). Connected are 2 ultra160 scsi
>jbods w/ 2 disks apiece. I'm using raid 10 (or should it be 01?) mirrored 
>stripes.

This looks very interesting.  My setup is somehow similar:
linux 2.6.9-ac8, SMP (2 x Athlon), Adaptec 2940UW + Promise SATA150 TX4,
4K stacks, software RAID, LVM and XFS.  The symptoms are different,
however.  When creating snapshots I get sometimes errors like this,
(which I'm unable to reproduce):

Nov 21 03:00:48 bear kernel:  [__alloc_pages+457/912] __alloc_pages+0x1c9/0x390
Nov 21 03:00:48 bear kernel:  [check_poison_obj+47/480] check_poison_obj+0x2f/0x1e0
Nov 21 03:00:48 bear kernel:  [__get_free_pages+37/64] __get_free_pages+0x25/0x40
Nov 21 03:00:48 bear kernel:  [kmem_getpages+33/208] kmem_getpages+0x21/0xd0
Nov 21 03:00:48 bear kernel:  [dbg_redzone1+21/48] dbg_redzone1+0x15/0x30
Nov 21 03:00:48 bear kernel:  [cache_grow+176/352] cache_grow+0xb0/0x160
Nov 21 03:00:48 bear kernel:  [cache_alloc_refill+428/640] cache_alloc_refill+0x1ac/0x280
Nov 21 03:00:48 bear kernel:  [__kmalloc+188/240] __kmalloc+0xbc/0xf0
Nov 21 03:00:48 bear kernel:  [mempool_resize+156/416] mempool_resize+0x9c/0x1a0
Nov 21 03:00:48 bear kernel:  [resize_pool+100/224] resize_pool+0x64/0xe0
Nov 21 03:00:48 bear kernel:  [dm_create_persistent+40/320] dm_create_persistent+0x28/0x140
Nov 21 03:00:48 bear kernel:  [snapshot_ctr+804/912] snapshot_ctr+0x324/0x390
Nov 21 03:00:48 bear kernel:  [dm_table_add_target+262/432] dm_table_add_target+0x106/0x1b0
Nov 21 03:00:48 bear kernel:  [populate_table+130/224] populate_table+0x82/0xe0
Nov 21 03:00:48 bear kernel:  [table_load+104/320] table_load+0x68/0x140
Nov 21 03:00:48 bear kernel:  [ctl_ioctl+241/336] ctl_ioctl+0xf1/0x150
Nov 21 03:00:48 bear kernel:  [table_load+0/320] table_load+0x0/0x140
Nov 21 03:00:48 bear kernel:  [sys_ioctl+253/640] sys_ioctl+0xfd/0x280
Nov 21 03:00:48 bear kernel:  [syscall_call+7/11] syscall_call+0x7/0xb
Nov 21 03:00:48 bear kernel: device-mapper: : Couldn't create exception store
Nov 21 03:00:48 bear kernel:
Nov 21 03:00:48 bear kernel: device-mapper: error adding target to table

When I tried the sample script (modified according to the size of the
FS) nothing happened at first.  But creating snapshots in parallel
locked up the system after a while. (Hard lockup, no diagnostic data
available.)

I switched to 8K stacks then. Running the sample and creating snapshots
didn't lock up, but resulted in a similar error as shown above:

Nov 23 20:46:33 bear kernel: lvcreate: page allocation failure. order:0, mode:0xd0
Nov 23 20:46:33 bear kernel:  [__alloc_pages+457/912] __alloc_pages+0x1c9/0x390
Nov 23 20:46:33 bear kernel:  [__get_free_pages+37/64] __get_free_pages+0x25/0x40
Nov 23 20:46:33 bear kernel:  [kmem_getpages+33/208] kmem_getpages+0x21/0xd0
Nov 23 20:46:33 bear kernel:  [cache_grow+176/352] cache_grow+0xb0/0x160
Nov 23 20:46:33 bear kernel:  [check_slabp+24/240] check_slabp+0x18/0xf0
Nov 23 20:46:33 bear kernel:  [cache_alloc_refill+428/640] cache_alloc_refill+0x1ac/0x280
Nov 23 20:46:33 bear kernel:  [dbg_redzone1+21/48] dbg_redzone1+0x15/0x30
Nov 23 20:46:33 bear kernel:  [cache_alloc_debugcheck_after+65/368] cache_alloc_debugcheck_after+0x41/0x170
Nov 23 20:46:33 bear kernel:  [kmem_cache_alloc+149/192] kmem_cache_alloc+0x95/0xc0
Nov 23 20:46:33 bear kernel:  [alloc_io+34/48] alloc_io+0x22/0x30
Nov 23 20:46:33 bear kernel:  [alloc_io+34/48] alloc_io+0x22/0x30
Nov 23 20:46:33 bear kernel:  [mempool_resize+284/416] mempool_resize+0x11c/0x1a0
Nov 23 20:46:34 bear kernel:  [resize_pool+100/224] resize_pool+0x64/0xe0
Nov 23 20:46:34 bear kernel:  [kcopyd_client_create+134/208] kcopyd_client_create+0x86/0xd0
Nov 23 20:46:34 bear kernel:  [snapshot_ctr+700/912] snapshot_ctr+0x2bc/0x390
Nov 23 20:46:34 bear kernel:  [dm_table_add_target+262/432] dm_table_add_target+0x106/0x1b0
Nov 23 20:46:34 bear kernel:  [populate_table+130/224] populate_table+0x82/0xe0
Nov 23 20:46:34 bear kernel:  [table_load+104/320] table_load+0x68/0x140
Nov 23 20:46:34 bear kernel:  [ctl_ioctl+241/336] ctl_ioctl+0xf1/0x150
Nov 23 20:46:34 bear kernel:  [table_load+0/320] table_load+0x0/0x140
Nov 23 20:46:35 bear kernel:  [sys_ioctl+253/640] sys_ioctl+0xfd/0x280
Nov 23 20:46:36 bear kernel:  [syscall_call+7/11] syscall_call+0x7/0xb

Maybe these issues are not related.  If they are I'd be glad to support
with some additional testing.

-jo


^ permalink raw reply	[flat|nested] 28+ messages in thread
* Re: oops with dual xeon 2.8ghz  4gb ram +smp, software raid, lvm, and xfs
@ 2004-11-24  9:28 Anders Saaby
  0 siblings, 0 replies; 28+ messages in thread
From: Anders Saaby @ 2004-11-24  9:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Phil Dier, Jakob Oestergaard, Christoph Hellwig

Hi Phil,

I have some hands-on experience with this kind of setop. I am working with a 
fairly similar setup as yours:

Two (UP) Xeon servers each with ~1TB SCSI RAID. Running XFS, exporting via NFS 
on 2.6.8.1. Serving ~18.000 Homedirs. - 24/7 heavy load.

I have seen quite a lot of Oops's on these servers, but now have a stable 
setup.

Here's the highlights:

< 2.6.8.1 (+/- various patches): XFS b0rks after a short period of heavy load. 
(tried several different setups and patches from SGI) 
2.6.8.1 (+/- various patches): SMP+XFS+NFS Oops's after ~1 Hour under heavy 
load.
2.6.8.1 (without patches): UP+XFS+NFS has now been running stable for 56 days, 
06h 24m. :)

I haven't tried 2.6.9 on these servers yet because of the stale filehandles 
issue and have no urge to break a stable setup. I am not 100% sure if the 
issue with weird changes on files which Jakob talked about, is introduced in 
2.6.9, but Im not seeing it on my setup.

- So buttom line - 2.6.8.1 running on a single CPU machine does the trick for 
me. (And who needs a lot of CPU powah on an NFS server? :))

Regarding ext3... This filesystem also seems to be b0rked on at least the 
newer 2.6.x kernels. We have some mailservers which until two days ago Oopsed 
on ext3. These now run XFS and the errors seems to be gone. - Don't get me 
wrong here, I have never seen ext3 Oops's on a low-load server - Only under 
heavy load (and SMP).

Snip of one of the ext3 Oops's (you will see several people here on LKML 
having the same/similar problem):
<SNIP>
Unable to handle kernel NULL pointer dereference at virtual address 0000000c
printing eip:
c018b2f5
*pde = 00000000
Oops: 0002 [#1]
SMP
Modules linked in: nfs e1000 iptable_nat rtc
CPU:    2
EIP:    0060:[<c018b2f5>]    Not tainted VLI
EFLAGS: 00010286   (2.6.9)
EIP is at journal_commit_transaction+0x545/0x11b0
eax: d971826c   ebx: 00000000   ecx: e489eefc   edx: 00000014
esi: d971826c   edi: f7406000   ebp: ea0a6f80   esp: f7407d8c
ds: 007b   es: 007b   ss: 0068
Process kjournald (pid: 177, threadinfo=f7406000 task=f7df63b0)
Stack: 03afe6b2 c2157478 f7407e40 f7406000 c2157414 00000000 00000000 00000000
       00000000 00000000 e489ebfc cd61056c 000010e8 01c2bf60 c040e020 00000000
       f7406000 0000001e f7407e1c c0412f80 00000008 f7407e5c c01134e3 f7407e1c
Call Trace:
 [<c01134e3>] find_busiest_group+0xf3/0x300
 [<c0113799>] find_busiest_queue+0xa9/0xd0
 [<c0115620>] autoremove_wake_function+0x0/0x40
 [<c0115620>] autoremove_wake_function+0x0/0x40
 [<c018e0e1>] kjournald+0xc1/0x230
 [<c0115620>] autoremove_wake_function+0x0/0x40
 [<c0112ba3>] finish_task_switch+0x33/0x70
 [<c0115620>] autoremove_wake_function+0x0/0x40
 [<c0103ff6>] ret_from_fork+0x6/0x14
 [<c018e000>] commit_timeout+0x0/0x10
 [<c018e020>] kjournald+0x0/0x230
 [<c010253d>] kernel_thread_helper+0x5/0x18
Code: 00 89 f0 e8 5e e1 17 00 83 c4 14 8b 45 18 85 c0 0f 84 49 01 00 00 bf 00 
e0 ff ff 21 e7 89 f6 8d bc 27 00 00 00 00 8b 70 20 8b 1e <f0> ff 43 0c 8b 03 
83 e0 04 74 4e 8b 94 24
 e8 01 00 00 8d 82 c0
</SNIP>

Phil Dier wrote:

> 
> Thanks for the tips, Jakob.
> 
> I *will* be exporting via NFS, so this is definetly good to know. I've
> been looking at using jfs and reiser as well, but some preliminary
> benchmarks suggested that xfs was the best performer for the kind of
> workload that I'm anticipating. I guess xfs is out of the question now,
> as I definetly don't want to deal with weird interactions like that.
> 
> Can anyone speak on the stability of (reiser|jfs|other) with nfs? My
> biggest requirements are online resizing and stability (ext3 online
> resize is still beta IIRC, but I wouldn't be opposed to using it if
> someone could tell me otherwise); speed would be nice, but I'm willing
> to sacrifice speed for the sake of reliability.
> 
> I'm personally using lvm + reiser + nfs without consequence on my
> fileserver at home, but it's not seeing nearly the loads that this box
> is going to see.
> 

-- 
Med venlig hilsen - Best regards - Meilleures salutations

Anders Saaby
Systems Engineer
------------------------------------------------
Cohaesio A/S - Maglebjergvej 5D - DK-2800 Lyngby
Phone: +45 45 880 888 - Fax: +45 45 880 777
Mail: as@cohaesio.com - http://www.cohaesio.com
------------------------------------------------

^ permalink raw reply	[flat|nested] 28+ messages in thread
[parent not found: <33rTj-1VZ-13@gated-at.bofh.it>]

end of thread, other threads:[~2004-12-09  3:51 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-11-22 19:06 oops with dual xeon 2.8ghz 4gb ram +smp, software raid, lvm, and xfs Phil Dier
2004-11-23  0:17 ` Andrew Morton
2004-11-23 15:37   ` Phil Dier
2004-11-23 17:02     ` Jakob Oestergaard
2004-11-23 18:29       ` Phil Dier
2004-11-23 22:39       ` Christoph Hellwig
2004-11-23 22:56         ` Jakob Oestergaard
2004-11-23 23:12           ` Christoph Hellwig
2004-11-30 17:37         ` Phil Dier
2004-11-24 15:45   ` Phil Dier
2004-11-24 16:56     ` Christoph Hellwig
2004-11-24 23:12     ` Andrew Morton
2004-11-25  0:48       ` Phil Dier
2004-11-28 11:29       ` David Greaves
2004-11-28 18:27         ` Andrew Morton
2004-12-08  9:03           ` David Greaves
2004-12-08  9:15             ` Andrew Morton
2004-12-09  3:50               ` Nigel Cunningham
2004-11-24 23:12   ` Neil Brown
2004-11-24 23:50     ` Andrew Morton
2004-11-25  0:14       ` Neil Brown
2004-11-25  1:05         ` Andrew Morton
2004-11-25  6:57         ` Jens Axboe
2004-11-25  7:08           ` Andrew Morton
2004-11-25  7:11             ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2004-11-23 20:48 Joerg Sommrey
2004-11-24  9:28 Anders Saaby
     [not found] <33rTj-1VZ-13@gated-at.bofh.it>
     [not found] ` <33wJq-633-25@gated-at.bofh.it>
     [not found]   ` <34fwL-P1-21@gated-at.bofh.it>
     [not found]     ` <34fGp-V2-9@gated-at.bofh.it>
     [not found]       ` <34fGp-V2-7@gated-at.bofh.it>
     [not found]         ` <34lVr-5WH-1@gated-at.bofh.it>
     [not found]           ` <34m5a-61Z-9@gated-at.bofh.it>
2004-11-25 11:07             ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox