public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* system lockup issues w/ 2.4.19
@ 2003-03-05 21:10 Gregory K. Ruiz-Ade
  0 siblings, 0 replies; 3+ messages in thread
From: Gregory K. Ruiz-Ade @ 2003-03-05 21:10 UTC (permalink / raw)
  To: linux-kernel

I was wondering if anyone here might be able to point me in the right 
direction for a solution to this problem.

About once a month or so (not very regular), one or the other of our Dell 
PowerEdge servers goes catatonic.  Examining the system in this state, it 
seems to exhibit the symptoms of a full process table, in that no new 
processes can be started at all.  This results in a system that has a 
completely unresponsive console, services which answer to new connections 
but never do anything, and general frustration as the only sign of life at 
all is that it'll respond to pings.

Investigation after a power cycle of the catatonic machine reveals that 
processes that had been running at the time of the event (whatever the 
event is) either kept running (in the case of services, like syslog) or ran 
to normal completion (i.e., reports or other jobs that had been started 
earlier).  As I said, everything I've gleaned from these systems when 
they've gone catatonic suggests a full process table, but I have no proof 
of it.

One of these servers (where it's more critical that this not happen) is a 
Dell PowerEdge 6600, 4x 1.6GHz Xeon, 8GB ram, dual Broadcom GigE NICs, 
PERC3/DC (Megaraid) Raid controller.  This system is running SuSE Linux 
Enterprise Server 7 (essentially SuSE 7.2 Pro) with kernel 2.4.19 (from 
kernel.org) patched with LVM 1.0.5 (sistina) and "10_inode-highmem-2", a 
patch recommended to me way back when I was trying to take care of some 
LVM/VM issues with bigmem support (which are still unresolved).  This is a 
production machine, and as such my ability to load new kernels and do 
testing is limited, but I can get ahold of it on the weekends.

The other server is our development machine, a Dell PowerEdge 4600, 2x 
2.4GHz Xeon, 2GB ram, e100 and Broadcom GigE NICs, aacraid.  This system is 
also running SuSE Linux Enterprise Server 7, kernel 2.4.19 + LVM 1.0.5, but 
without the highmem patch.

At this point, I'm looking for any options in resolving this issue.  I'm 
goign to be contacting SuSE to open a support ticket, to see if they can 
help.  I'm preparing a 2.4.20 kernel + LVM 1.0.7 which I can hopefully 
install tonight on both machines.  I'm trying to get my hands on one of Red 
Hat's 2.4.20 "bigmem" kernels to see what they include and if maybe one of 
RH's kernels might do the job...  Any other suggestions would be most 
appreciated.

TIA,
Gregory

-- 
Gregory K. Ruiz-Ade <gregory@castandcrew.com>
Sr. Systems Administrator
Cast & Crew Entertainment Services, Inc.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: system lockup issues w/ 2.4.19
       [not found] <mailman.1046898841.30893.linux-kernel2news@redhat.com>
@ 2003-03-05 23:42 ` Pete Zaitcev
  0 siblings, 0 replies; 3+ messages in thread
From: Pete Zaitcev @ 2003-03-05 23:42 UTC (permalink / raw)
  To: Gregory K. Ruiz-Ade; +Cc: linux-kernel

> About once a month or so (not very regular), one or the other of our Dell 
> PowerEdge servers goes catatonic.  Examining the system in this state, it 
> seems to exhibit the symptoms of a full process table, in that no new 
> processes can be started at all.

It is essential that you explained how you did the examining,
with relevant shell traces/snapshots, etc. If they are too
long, upload them somewhere. And use of stock kernels goes
without saying, or you have to go to your vendor (SuSE).

-- Pete

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: system lockup issues w/ 2.4.19
       [not found] <13694.1047106361@ocs3.intra.ocs.com.au>
@ 2003-03-10 18:12 ` Gregory K. Ruiz-Ade
  0 siblings, 0 replies; 3+ messages in thread
From: Gregory K. Ruiz-Ade @ 2003-03-10 18:12 UTC (permalink / raw)
  To: Keith Owens; +Cc: linux-kernel

On Friday 07 March 2003 22:52, Keith Owens wrote:
> Those symptoms do not necessarily mean a full process table.  You get
> exactly those symptoms if some code has grabbed a spin lock related to
> process creation and not released it.

Hmm... Well, it happened again on Friday night, and pouring through the 
syslogs, sendmail started refusing mail due to a load average of 18 and 
then 19...  This system, even under it's heaviest use, never breaks a 
system load average of 4-5.  Would a "stuck" spinlock result in an 
artificial inflation of system load averages (as a symptom)?

> You need kernel debugging features to find out which lock is the
> problem.  Booting with nmi_watchdog and a serial console (see
> linux/Documentation) will often tell you what has hung.

I'm building a 2.4.20 kernel using sources from kernel.org, and turning on 
the following options:

-->8--[Cut Here (.config)]-->8--
#
# Kernel hacking
#
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_HIGHMEM=y
# CONFIG_DEBUG_SLAB is not set
# CONFIG_DEBUG_IOVIRT is not set
CONFIG_MAGIC_SYSRQ=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_FRAME_POINTER=y
-->8--[Cut Here (.config)]-->8--

Should I enable the other two, as well?

Also, do I've wired up the serial console on this machine to another machine 
(that's much more stable) so that I can access it remotely... should I try 
to set something up that simply monitors the serial console constantly and 
logs it to a file, or will I be able to get the info I need via a program 
like minicom by poking the kernel after the fact?

> The kdb patch (ftp://oss.sgi.com/projects/kdb/download/v3.0) will let
> you print the state of each process and find out where they are
> spinning.  Note: kdb patches are against standard kernels, ask your
> distributor about how to patch the distributor's kernel with kdb.

I'll add this in to the kernel as well.  Hopefully I'll be able to get some 
more useful information out of the system the next time this happens.

Thanks for all the pointers!

Gregory

-- 
Gregory K. Ruiz-Ade <gregory@castandcrew.com>
Sr. Systems Administrator
Cast & Crew Entertainment Services, Inc.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2003-03-10 18:02 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-03-05 21:10 system lockup issues w/ 2.4.19 Gregory K. Ruiz-Ade
     [not found] <mailman.1046898841.30893.linux-kernel2news@redhat.com>
2003-03-05 23:42 ` Pete Zaitcev
     [not found] <13694.1047106361@ocs3.intra.ocs.com.au>
2003-03-10 18:12 ` Gregory K. Ruiz-Ade

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox