* 2.4.20 instability on bigmem systems?
@ 2003-03-14 0:27 Gregory K. Ruiz-Ade
2003-03-14 0:42 ` William Lee Irwin III
2003-03-14 18:31 ` Martin J. Bligh
0 siblings, 2 replies; 13+ messages in thread
From: Gregory K. Ruiz-Ade @ 2003-03-14 0:27 UTC (permalink / raw)
To: linux-kernel
This started out as my asking about random crashes on 2.4.19, but in the
course of me trying to figure out what was going on, I got to experience
first-hand the increadibly poor performance of 2.4.x (notably 2.4.20) on a
system with >4GB of memory (8GB).
So, I'm looking for a solution, preferrably a set of patches, to help with
the problems described below.
On Monday, I installed a 2.4.20 kernel on our Dell PowerEdge 6600. This
machine is configured as follows:
4x 1.6Ghz Xeon CPUs
8GB RAM
Built-in:
ATI Rage 128 graphics
dual Broadcomm Gigabit Ethernet
serial/parallel/usb
Adaptec SCSI
Dell PERC3/DC (AMI/LSI MegaRAID) dual-channel
The kernel was built from the linux-2.4.20.tar.bz2 from kernel.org, and
patched with only the lvm-1.0.7 and linux-2.4.20 VFS locking patches from
Sistina's LVM-1.0.7 package.
The primary problem: Whenever any process (or set of processes) initiates
intensive disk I/O, the system grinds to a halt, kswapd and kupdated
consume upwards of 40% to 60% CPU each, and system load averages can jump
upwards of 21.00. The problem can be replicated with a simple find command
("find / -print" seems to do it nicely).
I have had two rather painful nights dealing with this (Monday and Tuesday
nights). Luckily, I have a serial null-modem cable rigged up between the
troubled server and another server, and was able to capture all the info
from the Magic Sysrq commands that I could.
Full details are at http://castandcrew.com/~gregory/lkmlstuff/burpr/2.4.20
I've included the kernel config, the kernel and initrd images, the system
map file, output from "ps auxfww" and a couple screen scrapings from top,
and captures from magic sysrq commands from both crashes.
I had problems like this with 2.4.19, and was directed to apply a patch to
inode.c, which appears to be part of a patch set for 2.4.19pre9aa2. I've
archived it at:
http://castandcrew.com/~gregory/lkmlstuff/burpr/2.4.19/patches/10_inode-highmem-2
For 2.4.19, this solves _most_ of the stability issues, but I still have to
work with the LVM people and possibly whomever is responsible for the VM in
2.4.19/2.4.20 to track down some kernel oopses (possibly a seperate
problem.)
I will happily provide whatever other information is needed, though my
opportunities to test things on the machine in question is limited by the
fact that it's a production server.
Thanks in advance,
Gregory
--
Gregory K. Ruiz-Ade <gregory@castandcrew.com>
Sr. Systems Administrator
Cast & Crew Entertainment Services, Inc.
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: 2.4.20 instability on bigmem systems?
2003-03-14 0:27 2.4.20 instability on bigmem systems? Gregory K. Ruiz-Ade
@ 2003-03-14 0:42 ` William Lee Irwin III
2003-03-14 1:45 ` Gregory K. Ruiz-Ade
2003-03-14 18:31 ` Martin J. Bligh
1 sibling, 1 reply; 13+ messages in thread
From: William Lee Irwin III @ 2003-03-14 0:42 UTC (permalink / raw)
To: Gregory K. Ruiz-Ade; +Cc: linux-kernel
On Thu, Mar 13, 2003 at 04:27:22PM -0800, Gregory K. Ruiz-Ade wrote:
> The primary problem: Whenever any process (or set of processes) initiates
> intensive disk I/O, the system grinds to a halt, kswapd and kupdated
> consume upwards of 40% to 60% CPU each, and system load averages can jump
> upwards of 21.00. The problem can be replicated with a simple find command
> ("find / -print" seems to do it nicely).
> I have had two rather painful nights dealing with this (Monday and Tuesday
> nights). Luckily, I have a serial null-modem cable rigged up between the
> troubled server and another server, and was able to capture all the info
> from the Magic Sysrq commands that I could.
> Full details are at http://castandcrew.com/~gregory/lkmlstuff/burpr/2.4.20
Hmm, slabinfo would be very helpful, as well as meminfo.
On Thu, Mar 13, 2003 at 04:27:22PM -0800, Gregory K. Ruiz-Ade wrote:
> I've included the kernel config, the kernel and initrd images, the system
> map file, output from "ps auxfww" and a couple screen scrapings from top,
> and captures from magic sysrq commands from both crashes.
> I had problems like this with 2.4.19, and was directed to apply a patch to
> inode.c, which appears to be part of a patch set for 2.4.19pre9aa2. I've
> archived it at:
> http://castandcrew.com/~gregory/lkmlstuff/burpr/2.4.19/patches/10_inode-highmem-2
> For 2.4.19, this solves _most_ of the stability issues, but I still have to
> work with the LVM people and possibly whomever is responsible for the VM in
> 2.4.19/2.4.20 to track down some kernel oopses (possibly a seperate
> problem.)
> I will happily provide whatever other information is needed, though my
> opportunities to test things on the machine in question is limited by the
> fact that it's a production server.
You might need bh stuff (memclass-related or something like it) if it's
general disk io. Can't be too sure until slabinfo + meminfo materialize.
-- wli
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: 2.4.20 instability on bigmem systems?
2003-03-14 0:42 ` William Lee Irwin III
@ 2003-03-14 1:45 ` Gregory K. Ruiz-Ade
2003-03-14 1:53 ` William Lee Irwin III
0 siblings, 1 reply; 13+ messages in thread
From: Gregory K. Ruiz-Ade @ 2003-03-14 1:45 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: linux-kernel
On Thursday 13 March 2003 16:42, William Lee Irwin III wrote:
> > Full details are at
> > http://castandcrew.com/~gregory/lkmlstuff/burpr/2.4.20
>
> Hmm, slabinfo would be very helpful, as well as meminfo.
I'll have to schedule a reboot into that kernel, but I'll try to get it
tonight if at all possible.
> You might need bh stuff (memclass-related or something like it) if it's
> general disk io. Can't be too sure until slabinfo + meminfo materialize.
I'm not familiar with "bh"... where can I read up on what it is?
Thanks again,
Gregory
--
Gregory K. Ruiz-Ade <gregory@castandcrew.com>
Sr. Systems Administrator
Cast & Crew Entertainment Services, Inc.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.4.20 instability on bigmem systems?
2003-03-14 1:45 ` Gregory K. Ruiz-Ade
@ 2003-03-14 1:53 ` William Lee Irwin III
2003-03-14 3:55 ` Gregory K. Ruiz-Ade
0 siblings, 1 reply; 13+ messages in thread
From: William Lee Irwin III @ 2003-03-14 1:53 UTC (permalink / raw)
To: Gregory K. Ruiz-Ade; +Cc: linux-kernel
On Thursday 13 March 2003 16:42, William Lee Irwin III wrote:
>> Hmm, slabinfo would be very helpful, as well as meminfo.
On Thu, Mar 13, 2003 at 05:45:28PM -0800, Gregory K. Ruiz-Ade wrote:
> I'll have to schedule a reboot into that kernel, but I'll try to get it
> tonight if at all possible.
That's fine; I'm not in a hurry. =)
On Thursday 13 March 2003 16:42, William Lee Irwin III wrote:
>> You might need bh stuff (memclass-related or something like it) if it's
>> general disk io. Can't be too sure until slabinfo + meminfo materialize.
On Thu, Mar 13, 2003 at 05:45:28PM -0800, Gregory K. Ruiz-Ade wrote:
> I'm not familiar with "bh"... where can I read up on what it is?
bh == buffer_head. fs/buffer_head.c has the code.
-- wli
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.4.20 instability on bigmem systems?
2003-03-14 1:53 ` William Lee Irwin III
@ 2003-03-14 3:55 ` Gregory K. Ruiz-Ade
2003-03-14 4:13 ` William Lee Irwin III
0 siblings, 1 reply; 13+ messages in thread
From: Gregory K. Ruiz-Ade @ 2003-03-14 3:55 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: linux-kernel
On Thursday 13 March 2003 17:53, William Lee Irwin III wrote:
> On Thursday 13 March 2003 16:42, William Lee Irwin III wrote:
> >> Hmm, slabinfo would be very helpful, as well as meminfo.
>
> On Thu, Mar 13, 2003 at 05:45:28PM -0800, Gregory K. Ruiz-Ade wrote:
> > I'll have to schedule a reboot into that kernel, but I'll try to get it
> > tonight if at all possible.
>
> That's fine; I'm not in a hurry. =)
I got an opportunity, so the contents of cpuinfo, slabinfo, and meminfo are
at: http://castandcrew.com/~gregory/lkmlstuff/burpr/2.4.20
Hopefully they're useful.
Additionally, I managed to generate an Oops (processed with ksymoops, too)
while trying to create an LVM snapshot. The oops and it's
ksymoops-translated file are up there too. The oops traces back into the
VM somewhere.
--
Gregory K. Ruiz-Ade <gregory@castandcrew.com>
Sr. Systems Administrator
Cast & Crew Entertainment Services, Inc.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.4.20 instability on bigmem systems?
2003-03-14 3:55 ` Gregory K. Ruiz-Ade
@ 2003-03-14 4:13 ` William Lee Irwin III
2003-03-14 17:31 ` Gregory K. Ruiz-Ade
0 siblings, 1 reply; 13+ messages in thread
From: William Lee Irwin III @ 2003-03-14 4:13 UTC (permalink / raw)
To: Gregory K. Ruiz-Ade; +Cc: linux-kernel
On Thu, Mar 13, 2003 at 07:55:27PM -0800, Gregory K. Ruiz-Ade wrote:
> I got an opportunity, so the contents of cpuinfo, slabinfo, and meminfo are
> at: http://castandcrew.com/~gregory/lkmlstuff/burpr/2.4.20
> Hopefully they're useful.
> Additionally, I managed to generate an Oops (processed with ksymoops, too)
> while trying to create an LVM snapshot. The oops and it's
> ksymoops-translated file are up there too. The oops traces back into the
> VM somewhere.
Hmm, neither slabinfo nor meminfo show the machine being under any stress.
Were they generated while the problem was happening?
The useful information would be to collect meminfo and slabinfo while
kswapd and updated are spinning. Also, cpuinfo doesn't ever change,
(at least while being run on the same box) so you can leave that out.
BTW, oopses tracing back into the VM doesn't help. It's usually someone
doing something wrong the VM checks for. In this case I'll bet someone
(i.e. LVM) called vmalloc() with interrupts off.
-- wli
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.4.20 instability on bigmem systems?
2003-03-14 4:13 ` William Lee Irwin III
@ 2003-03-14 17:31 ` Gregory K. Ruiz-Ade
2003-03-14 20:08 ` William Lee Irwin III
0 siblings, 1 reply; 13+ messages in thread
From: Gregory K. Ruiz-Ade @ 2003-03-14 17:31 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: linux-kernel
On Thursday 13 March 2003 20:13, William Lee Irwin III wrote:
> Hmm, neither slabinfo nor meminfo show the machine being under any
> stress. Were they generated while the problem was happening?
>
> The useful information would be to collect meminfo and slabinfo while
> kswapd and updated are spinning. Also, cpuinfo doesn't ever change,
> (at least while being run on the same box) so you can leave that out.
Ahh. I was a bit out of it yesterday, and didn't think to actually stress
the machine. :\
I'll be able to give it a good beating this weekend sometime.
> BTW, oopses tracing back into the VM doesn't help. It's usually someone
> doing something wrong the VM checks for. In this case I'll bet someone
> (i.e. LVM) called vmalloc() with interrupts off.
Hmm... Okay, mind if I quote you when I post that oops the the lvm list? :)
Thanks again,
Gregory
--
Gregory K. Ruiz-Ade <gregory@castandcrew.com>
Sr. Systems Administrator
Cast & Crew Entertainment Services, Inc.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.4.20 instability on bigmem systems?
2003-03-14 17:31 ` Gregory K. Ruiz-Ade
@ 2003-03-14 20:08 ` William Lee Irwin III
2003-03-17 2:15 ` Gregory K. Ruiz-Ade
0 siblings, 1 reply; 13+ messages in thread
From: William Lee Irwin III @ 2003-03-14 20:08 UTC (permalink / raw)
To: Gregory K. Ruiz-Ade; +Cc: linux-kernel
On Thursday 13 March 2003 20:13, William Lee Irwin III wrote:
>> Hmm, neither slabinfo nor meminfo show the machine being under any
>> stress. Were they generated while the problem was happening?
>> The useful information would be to collect meminfo and slabinfo while
>> kswapd and updated are spinning. Also, cpuinfo doesn't ever change,
>> (at least while being run on the same box) so you can leave that out.
On Fri, Mar 14, 2003 at 09:31:15AM -0800, Gregory K. Ruiz-Ade wrote:
> Ahh. I was a bit out of it yesterday, and didn't think to actually stress
> the machine. :\
> I'll be able to give it a good beating this weekend sometime.
cc: me when you post those results.
On Thursday 13 March 2003 20:13, William Lee Irwin III wrote:
>> BTW, oopses tracing back into the VM doesn't help. It's usually someone
>> doing something wrong the VM checks for. In this case I'll bet someone
>> (i.e. LVM) called vmalloc() with interrupts off.
On Fri, Mar 14, 2003 at 09:31:15AM -0800, Gregory K. Ruiz-Ade wrote:
> Hmm... Okay, mind if I quote you when I post that oops the the lvm list? :)
Understand that was said in the context of finding the VM bug. I'm not
interested in LVM bugs, legitimate though they may be, mostly b/c it's
not my project and I can't save the world.
-- wli
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.4.20 instability on bigmem systems?
2003-03-14 20:08 ` William Lee Irwin III
@ 2003-03-17 2:15 ` Gregory K. Ruiz-Ade
2003-03-17 2:26 ` William Lee Irwin III
0 siblings, 1 reply; 13+ messages in thread
From: Gregory K. Ruiz-Ade @ 2003-03-17 2:15 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: linux-kernel
On Friday 14 March 2003 12:08, William Lee Irwin III wrote:
> On Fri, Mar 14, 2003 at 09:31:15AM -0800, Gregory K. Ruiz-Ade wrote:
> > Ahh. I was a bit out of it yesterday, and didn't think to actually
> > stress the machine. :\
> > I'll be able to give it a good beating this weekend sometime.
>
> cc: me when you post those results.
Okay, I tried to load the system a bit and stress out the disk I/O, running
a couple finds across the whole system (find | xargs stat, find | xargs cat
> /dev/null, a couple other things) after sucking up free memory by catting
our database disk files to /dev/null. I also had a 'make -j5 clean
oldconfig dep bzImage modules' running to try to drive the load up a bit,
too.
I've got snapshots of meminfo, slabinfo, and output from 'ps auxfww' at:
http://castandcrew.com/~gregory/lkmlstuff/burpr/2.4.20/loadtest/
It only really starts getting interesting after 20030316.1725, when I
started the kernel build. I have a very simple shell script that basically
does nothing other than "make clean oldconfig dep && make -j5 bzImage &&
make -j5 modules". I ran that a couple times in the sources for Red Hat's
2.4.9-e.12 kernel sources.
Surprisingly I wasn't able to grind down the system like I expected. Not
sure why it's behaving so wonderfully today.
It crashed again on Friday night, running 2.4.19. The only information I
was able to get was a kernel BUG message on the serial console (I
ksymoops'ed it after rebooting). From what I could tell after the fact,
nothing was really running. Several scripts got fired off by cron, which
check various things (mainly to make sure certain services are still
running), and around then is when the system locked up. The info I have
for that crash is at:
http://castandcrew.com/~gregory/lkmlstuff/burpr/2.4.19
As it is, I'm going to try running on Red Hat's 2.4.9-e.12 sources this
week. I'm compiling the kernel right now, and will be rebooting into it
shortly.
This has been quite the week of headaches for me.
--
Gregory K. Ruiz-Ade <gregory@castandcrew.com>
Sr. Systems Administrator
Cast & Crew Entertainment Services, Inc.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.4.20 instability on bigmem systems?
2003-03-17 2:15 ` Gregory K. Ruiz-Ade
@ 2003-03-17 2:26 ` William Lee Irwin III
2003-03-17 4:59 ` Gregory K. Ruiz-Ade
0 siblings, 1 reply; 13+ messages in thread
From: William Lee Irwin III @ 2003-03-17 2:26 UTC (permalink / raw)
To: Gregory K. Ruiz-Ade; +Cc: linux-kernel
On Sun, Mar 16, 2003 at 06:15:11PM -0800, Gregory K. Ruiz-Ade wrote:
> Okay, I tried to load the system a bit and stress out the disk I/O, running
> a couple finds across the whole system (find | xargs stat, find | xargs cat
> > /dev/null, a couple other things) after sucking up free memory by catting
> our database disk files to /dev/null. I also had a 'make -j5 clean
> oldconfig dep bzImage modules' running to try to drive the load up a bit,
> too.
> I've got snapshots of meminfo, slabinfo, and output from 'ps auxfww' at:
> http://castandcrew.com/~gregory/lkmlstuff/burpr/2.4.20/loadtest/
> It only really starts getting interesting after 20030316.1725, when I
> started the kernel build. I have a very simple shell script that basically
> does nothing other than "make clean oldconfig dep && make -j5 bzImage &&
> make -j5 modules". I ran that a couple times in the sources for Red Hat's
> 2.4.9-e.12 kernel sources.
> Surprisingly I wasn't able to grind down the system like I expected. Not
> sure why it's behaving so wonderfully today.
If it didn't behave badly then it won't help to look at the stats.
-- wli
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.4.20 instability on bigmem systems?
2003-03-17 2:26 ` William Lee Irwin III
@ 2003-03-17 4:59 ` Gregory K. Ruiz-Ade
2003-03-17 5:38 ` Gregory K. Ruiz-Ade
0 siblings, 1 reply; 13+ messages in thread
From: Gregory K. Ruiz-Ade @ 2003-03-17 4:59 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: linux-kernel
On Sunday 16 March 2003 18:26, William Lee Irwin III wrote:
> If it didn't behave badly then it won't help to look at the stats.
I'm seriously at my wits end. (Not because of you! I hate vague
problems...) I have no idea why it behaved itself this time, just like i
have no idea why it misbehaves.
Right now, I'm back to running 2.4.19 with the inode.c patch from one of the
2.4.19-preXX-aaY kernels (see
http://castandcrew.com/~gregory/lkmlstuff/burpr/2.4.19/patches) as the most
stable thing we've gotten so far.
I'm going to write some scripts to run out of cron ever 5 minutes (or maybe
even every minute) to collect meminfo, slabinfo, ps output, and whatever
else i can think of. What else would be useful to help you track down
these problems?
Hopefully, the next time the system goes to hell, I'll have _something_ to
give you.
As a side question, is bigmem >2GB? I.e., if I pass "mem=2048m" to the
kernel from lilo, will the bigmem stuff for the VM be disabled, or should I
instead build a new kernel with high memory support turned off? Also, with
highmem support turned off, the max memory is 2GB, right? I may well just
ignore the high 6GB out of desperation to get a stable system until 2.6 is
released.
Thanks again for taking the time.
Working on a migrain,
Gregory
--
Gregory K. Ruiz-Ade <gregory@castandcrew.com>
Sr. Systems Administrator
Cast & Crew Entertainment Services, Inc.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.4.20 instability on bigmem systems?
2003-03-17 4:59 ` Gregory K. Ruiz-Ade
@ 2003-03-17 5:38 ` Gregory K. Ruiz-Ade
0 siblings, 0 replies; 13+ messages in thread
From: Gregory K. Ruiz-Ade @ 2003-03-17 5:38 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: linux-kernel
On Sunday 16 March 2003 20:59, Gregory K. Ruiz-Ade wrote:
> Right now, I'm back to running 2.4.19 with the inode.c patch from one of
> the 2.4.19-preXX-aaY kernels (see
> http://castandcrew.com/~gregory/lkmlstuff/burpr/2.4.19/patches) as the
> most stable thing we've gotten so far.
Could you possibly take a look at the kernel BUG that I got Friday night on
the 2.4.19 kernel, to see if it (oh god please) hopefully points in a
useful direction?
http://castandcrew.com/~gregory/lkmlstuff/burpr/2.4.19
The files are:
burpr-kernel-bug.20030314.2100 (the raw bug)
burpr-kernel-bug.20030314.2100.ksymoops (processed through ksymoops)
FYI, the amaivschk program listed in the BUG is a shell script, and also at
that URL if you're curious as to what it does.
Thanks again...
--
Gregory K. Ruiz-Ade <gregory@castandcrew.com>
Sr. Systems Administrator
Cast & Crew Entertainment Services, Inc.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.4.20 instability on bigmem systems?
2003-03-14 0:27 2.4.20 instability on bigmem systems? Gregory K. Ruiz-Ade
2003-03-14 0:42 ` William Lee Irwin III
@ 2003-03-14 18:31 ` Martin J. Bligh
1 sibling, 0 replies; 13+ messages in thread
From: Martin J. Bligh @ 2003-03-14 18:31 UTC (permalink / raw)
To: Gregory K. Ruiz-Ade, linux-kernel
> The primary problem: Whenever any process (or set of processes) initiates
> intensive disk I/O, the system grinds to a halt, kswapd and kupdated
> consume upwards of 40% to 60% CPU each, and system load averages can jump
> upwards of 21.00. The problem can be replicated with a simple find command
> ("find / -print" seems to do it nicely).
Well known set of problems.
2.4 vm sucks on big machines. Run 2.5, 2.4-aa, or UL.
Yes, you can spend a few weeks beating your head against a brick wall
gathering various bugfixes if you like ... but you'll probably just end
up with a sore head ...
M.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2003-03-17 5:27 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-03-14 0:27 2.4.20 instability on bigmem systems? Gregory K. Ruiz-Ade
2003-03-14 0:42 ` William Lee Irwin III
2003-03-14 1:45 ` Gregory K. Ruiz-Ade
2003-03-14 1:53 ` William Lee Irwin III
2003-03-14 3:55 ` Gregory K. Ruiz-Ade
2003-03-14 4:13 ` William Lee Irwin III
2003-03-14 17:31 ` Gregory K. Ruiz-Ade
2003-03-14 20:08 ` William Lee Irwin III
2003-03-17 2:15 ` Gregory K. Ruiz-Ade
2003-03-17 2:26 ` William Lee Irwin III
2003-03-17 4:59 ` Gregory K. Ruiz-Ade
2003-03-17 5:38 ` Gregory K. Ruiz-Ade
2003-03-14 18:31 ` Martin J. Bligh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox