* Re: system lockup issues w/ 2.4.19
[not found] <13694.1047106361@ocs3.intra.ocs.com.au>
@ 2003-03-10 18:12 ` Gregory K. Ruiz-Ade
0 siblings, 0 replies; 3+ messages in thread
From: Gregory K. Ruiz-Ade @ 2003-03-10 18:12 UTC (permalink / raw)
To: Keith Owens; +Cc: linux-kernel
On Friday 07 March 2003 22:52, Keith Owens wrote:
> Those symptoms do not necessarily mean a full process table. You get
> exactly those symptoms if some code has grabbed a spin lock related to
> process creation and not released it.
Hmm... Well, it happened again on Friday night, and pouring through the
syslogs, sendmail started refusing mail due to a load average of 18 and
then 19... This system, even under it's heaviest use, never breaks a
system load average of 4-5. Would a "stuck" spinlock result in an
artificial inflation of system load averages (as a symptom)?
> You need kernel debugging features to find out which lock is the
> problem. Booting with nmi_watchdog and a serial console (see
> linux/Documentation) will often tell you what has hung.
I'm building a 2.4.20 kernel using sources from kernel.org, and turning on
the following options:
-->8--[Cut Here (.config)]-->8--
#
# Kernel hacking
#
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_HIGHMEM=y
# CONFIG_DEBUG_SLAB is not set
# CONFIG_DEBUG_IOVIRT is not set
CONFIG_MAGIC_SYSRQ=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_FRAME_POINTER=y
-->8--[Cut Here (.config)]-->8--
Should I enable the other two, as well?
Also, do I've wired up the serial console on this machine to another machine
(that's much more stable) so that I can access it remotely... should I try
to set something up that simply monitors the serial console constantly and
logs it to a file, or will I be able to get the info I need via a program
like minicom by poking the kernel after the fact?
> The kdb patch (ftp://oss.sgi.com/projects/kdb/download/v3.0) will let
> you print the state of each process and find out where they are
> spinning. Note: kdb patches are against standard kernels, ask your
> distributor about how to patch the distributor's kernel with kdb.
I'll add this in to the kernel as well. Hopefully I'll be able to get some
more useful information out of the system the next time this happens.
Thanks for all the pointers!
Gregory
--
Gregory K. Ruiz-Ade <gregory@castandcrew.com>
Sr. Systems Administrator
Cast & Crew Entertainment Services, Inc.
^ permalink raw reply [flat|nested] 3+ messages in thread
[parent not found: <mailman.1046898841.30893.linux-kernel2news@redhat.com>]
* Re: system lockup issues w/ 2.4.19
[not found] <mailman.1046898841.30893.linux-kernel2news@redhat.com>
@ 2003-03-05 23:42 ` Pete Zaitcev
0 siblings, 0 replies; 3+ messages in thread
From: Pete Zaitcev @ 2003-03-05 23:42 UTC (permalink / raw)
To: Gregory K. Ruiz-Ade; +Cc: linux-kernel
> About once a month or so (not very regular), one or the other of our Dell
> PowerEdge servers goes catatonic. Examining the system in this state, it
> seems to exhibit the symptoms of a full process table, in that no new
> processes can be started at all.
It is essential that you explained how you did the examining,
with relevant shell traces/snapshots, etc. If they are too
long, upload them somewhere. And use of stock kernels goes
without saying, or you have to go to your vendor (SuSE).
-- Pete
^ permalink raw reply [flat|nested] 3+ messages in thread
* system lockup issues w/ 2.4.19
@ 2003-03-05 21:10 Gregory K. Ruiz-Ade
0 siblings, 0 replies; 3+ messages in thread
From: Gregory K. Ruiz-Ade @ 2003-03-05 21:10 UTC (permalink / raw)
To: linux-kernel
I was wondering if anyone here might be able to point me in the right
direction for a solution to this problem.
About once a month or so (not very regular), one or the other of our Dell
PowerEdge servers goes catatonic. Examining the system in this state, it
seems to exhibit the symptoms of a full process table, in that no new
processes can be started at all. This results in a system that has a
completely unresponsive console, services which answer to new connections
but never do anything, and general frustration as the only sign of life at
all is that it'll respond to pings.
Investigation after a power cycle of the catatonic machine reveals that
processes that had been running at the time of the event (whatever the
event is) either kept running (in the case of services, like syslog) or ran
to normal completion (i.e., reports or other jobs that had been started
earlier). As I said, everything I've gleaned from these systems when
they've gone catatonic suggests a full process table, but I have no proof
of it.
One of these servers (where it's more critical that this not happen) is a
Dell PowerEdge 6600, 4x 1.6GHz Xeon, 8GB ram, dual Broadcom GigE NICs,
PERC3/DC (Megaraid) Raid controller. This system is running SuSE Linux
Enterprise Server 7 (essentially SuSE 7.2 Pro) with kernel 2.4.19 (from
kernel.org) patched with LVM 1.0.5 (sistina) and "10_inode-highmem-2", a
patch recommended to me way back when I was trying to take care of some
LVM/VM issues with bigmem support (which are still unresolved). This is a
production machine, and as such my ability to load new kernels and do
testing is limited, but I can get ahold of it on the weekends.
The other server is our development machine, a Dell PowerEdge 4600, 2x
2.4GHz Xeon, 2GB ram, e100 and Broadcom GigE NICs, aacraid. This system is
also running SuSE Linux Enterprise Server 7, kernel 2.4.19 + LVM 1.0.5, but
without the highmem patch.
At this point, I'm looking for any options in resolving this issue. I'm
goign to be contacting SuSE to open a support ticket, to see if they can
help. I'm preparing a 2.4.20 kernel + LVM 1.0.7 which I can hopefully
install tonight on both machines. I'm trying to get my hands on one of Red
Hat's 2.4.20 "bigmem" kernels to see what they include and if maybe one of
RH's kernels might do the job... Any other suggestions would be most
appreciated.
TIA,
Gregory
--
Gregory K. Ruiz-Ade <gregory@castandcrew.com>
Sr. Systems Administrator
Cast & Crew Entertainment Services, Inc.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2003-03-10 18:02 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <13694.1047106361@ocs3.intra.ocs.com.au>
2003-03-10 18:12 ` system lockup issues w/ 2.4.19 Gregory K. Ruiz-Ade
[not found] <mailman.1046898841.30893.linux-kernel2news@redhat.com>
2003-03-05 23:42 ` Pete Zaitcev
2003-03-05 21:10 Gregory K. Ruiz-Ade
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox