Consisten kernel hang during heavy TCP connection handling load

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Consisten kernel hang during heavy TCP connection handling load
@ 2004-09-22 19:17 Andrew A.
  2004-09-24  0:05 ` Consistent " Andrew A.
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew A. @ 2004-09-22 19:17 UTC (permalink / raw)
  To: linux-kernel

I have written a pair of applications the server side of which consistently causes my Linux Fedora Core 2 system to become
completely unresponsive; all consoles hang, and it no longer services network connections.

The applications engage in the rapid opening and closing of TCP connections.  The server side is multithreaded (# threads approx 5).
It services the connections by dumping data into them from a file.  The client side reads no data.  The server then receives EAGAIN
from send(...,MSG_NOWAIT) calls, and issues 5ms sleep before resending on any particular TCP connection.  It loops up to 20 times
waiting for the the connection to become unblocked.  The applications are running within GDB, and threads *are* created/destroyed
during the process.

I will change the application to use select() rather than sleeping on a blocked pipe.  However, I don't think it's a "good thing"
that the machine hangs so completely.

I looked for tools to help catch the kernel before it goes la-la (assuming it's the kernel going la-la), but got
frustrated/ran-out-of-time.  E.g., lkcd seems defunct.

If pointed in the right direction, I would be happy to perform further forensics after re-creating the hang.  I am also in the
process of upgrading the kernel to see if that resolves the problem.

Andrew Athan

uname -a:

Linux bbox.memeplex.com 2.6.6-1.435 #1 Mon Jun 14 09:09:07 EDT 2004 i686 i686 i386 GNU/Linux

lsmod:

Module                  Size  Used by
snd_mixer_oss          13824  2
snd_via82xx            20644  3
snd_ac97_codec         54788  1 snd_via82xx
snd_pcm                69256  1 snd_via82xx
snd_timer              17284  1 snd_pcm
snd_page_alloc          8072  2 snd_via82xx,snd_pcm
gameport                3328  1 snd_via82xx
snd_mpu401_uart         4864  1 snd_via82xx
snd_rawmidi            17444  1 snd_mpu401_uart
snd_seq_device          6152  1 snd_rawmidi
snd                    39396  10
snd_mixer_oss,snd_via82xx,snd_ac97_codec,snd_pcm,snd_timer,snd_mpu401_uart,snd_rawmidi,snd_seq_device
soundcore               6112  3 snd
ipt_mark                1408  2
ipt_MARK                1664  14
cls_u32                 5508  2
cls_fw                  3200  2
sch_sfq                 4352  9
sch_htb                18048  1
iptable_mangle          2176  1
ip_tables              13568  3 ipt_mark,ipt_MARK,iptable_mangle
nfsd                  159488  9
exportfs                4224  1 nfsd
lockd                  47816  2 nfsd
parport_pc             19392  1
lp                      8236  0
parport                29640  2 parport_pc,lp
autofs4                12932  0
sunrpc                109924  19 nfsd,lockd
via_rhine              15752  0
mii                     3584  1 via_rhine
floppy                 47440  0
sg                     27680  0
scsi_mod               91984  1 sg
microcode               4768  0
dm_mod                 32800  0
ehci_hcd               22916  0
uhci_hcd               24472  0
button                  4632  0
battery                 6924  0
asus_acpi               8984  0
ac                      3340  0
r128                   85796  2
ipv6                  184672  18
ext3                  103656  2
jbd                    40728  1 ext3

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Consistent kernel hang during heavy TCP connection handling load
  2004-09-22 19:17 Consisten kernel hang during heavy TCP connection handling load Andrew A.
@ 2004-09-24  0:05 ` Andrew A.
  2004-09-26 17:42   ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew A. @ 2004-09-24  0:05 UTC (permalink / raw)
  To: linux-kernel

I would not normally quote an an entire message, but it contains data relevant to this problem.

The hang below occurs even outside of GDB, and also occurs after upgrading the kernel:

Linux bbox.memeplex.com 2.6.8-1.521 #1 Mon Aug 16 09:01:18 EDT 2004 i686 i686 i386 GNU/Linux

Can anyone please give me a clue/pointer to tools/techniques that might help identify where in the kernel the hang occurs?  The
system is so completely unresponsive when this occurs that I cannot provide any forensic data.

Does anyone's experience show that these types of hangs might occur purely as the result of use (or mis-use) of the pthreads
library?  I'm looking for hints about what parts of my code to review.

There could easily be erroneous calls to pthread_detach(), pthread_join(), close(), and other system calls involved.

Thanks,
Andrew Athan

-----Original Message-----
From: linux-kernel-owner@vger.kernel.org
[mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Andrew A.
Sent: Wednesday, September 22, 2004 3:17 PM
To: linux-kernel@vger.kernel.org
Subject: Consisten kernel hang during heavy TCP connection handling load

I have written a pair of applications the server side of which consistently causes my Linux Fedora Core 2 system to become
completely unresponsive; all consoles hang, and it no longer services network connections.

The applications engage in the rapid opening and closing of TCP connections.  The server side is multithreaded (# threads approx 5).
It services the connections by dumping data into them from a file.  The client side reads no data.  The server then receives EAGAIN
from send(...,MSG_NOWAIT) calls, and issues 5ms sleep before resending on any particular TCP connection.  It loops up to 20 times
waiting for the the connection to become unblocked.  The applications are running within GDB, and threads *are* created/destroyed
during the process.

I will change the application to use select() rather than sleeping on a blocked pipe.  However, I don't think it's a "good thing"
that the machine hangs so completely.

I looked for tools to help catch the kernel before it goes la-la (assuming it's the kernel going la-la), but got
frustrated/ran-out-of-time.  E.g., lkcd seems defunct.

If pointed in the right direction, I would be happy to perform further forensics after re-creating the hang.  I am also in the
process of upgrading the kernel to see if that resolves the problem.

Andrew Athan

uname -a:

Linux bbox.memeplex.com 2.6.6-1.435 #1 Mon Jun 14 09:09:07 EDT 2004 i686 i686 i386 GNU/Linux

lsmod:

Module                  Size  Used by
snd_mixer_oss          13824  2
snd_via82xx            20644  3
snd_ac97_codec         54788  1 snd_via82xx
snd_pcm                69256  1 snd_via82xx
snd_timer              17284  1 snd_pcm
snd_page_alloc          8072  2 snd_via82xx,snd_pcm
gameport                3328  1 snd_via82xx
snd_mpu401_uart         4864  1 snd_via82xx
snd_rawmidi            17444  1 snd_mpu401_uart
snd_seq_device          6152  1 snd_rawmidi
snd                    39396  10
snd_mixer_oss,snd_via82xx,snd_ac97_codec,snd_pcm,snd_timer,snd_mpu401_uart,snd_rawmidi,snd_seq_device
soundcore               6112  3 snd
ipt_mark                1408  2
ipt_MARK                1664  14
cls_u32                 5508  2
cls_fw                  3200  2
sch_sfq                 4352  9
sch_htb                18048  1
iptable_mangle          2176  1
ip_tables              13568  3 ipt_mark,ipt_MARK,iptable_mangle
nfsd                  159488  9
exportfs                4224  1 nfsd
lockd                  47816  2 nfsd
parport_pc             19392  1
lp                      8236  0
parport                29640  2 parport_pc,lp
autofs4                12932  0
sunrpc                109924  19 nfsd,lockd
via_rhine              15752  0
mii                     3584  1 via_rhine
floppy                 47440  0
sg                     27680  0
scsi_mod               91984  1 sg
microcode               4768  0
dm_mod                 32800  0
ehci_hcd               22916  0
uhci_hcd               24472  0
button                  4632  0
battery                 6924  0
asus_acpi               8984  0
ac                      3340  0
r128                   85796  2
ipv6                  184672  18
ext3                  103656  2
jbd                    40728  1 ext3

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Consistent kernel hang during heavy TCP connection handling load
  2004-09-24  0:05 ` Consistent " Andrew A.
@ 2004-09-26 17:42   ` Jan Kara
  2004-09-26 21:33     ` Andrew A.
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2004-09-26 17:42 UTC (permalink / raw)
  To: Andrew A.; +Cc: linux-kernel

> 
> I would not normally quote an an entire message, but it contains data
> relevant to this problem.
> 
> The hang below occurs even outside of GDB, and also occurs after
> upgrading the kernel:
> 
> Linux bbox.memeplex.com 2.6.8-1.521 #1 Mon Aug 16 09:01:18 EDT 2004
> i686 i686 i386 GNU/Linux
> 
> 
> 
> Can anyone please give me a clue/pointer to tools/techniques that
> might help identify where in the kernel the hang occurs?  The system
> is so completely unresponsive when this occurs that I cannot provide
> any forensic data.
  How unresponsive exactly it is? Can you switch consoles and write? I
suppose ps(1) hangs... Is the disk working?

You can compile kernel with the magic Sysrq key (it is the option in the
kernel debugging section), run it and then press alt-sysrq-t and the
state of all processes will be printed. That might help...

> Does anyone's experience show that these types of hangs might occur
> purely as the result of use (or mis-use) of the pthreads library?  I'm
> looking for hints about what parts of my code to review.
> 
> There could easily be erroneous calls to pthread_detach(),
> pthread_join(), close(), and other system calls involved.
> 
> Thanks,
> Andrew Athan

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Consistent kernel hang during heavy TCP connection handling load
  2004-09-26 17:42   ` Jan Kara
@ 2004-09-26 21:33     ` Andrew A.
  2004-09-27  8:24       ` Jan Kara
  2004-09-30 16:34       ` Bill Davidsen
  0 siblings, 2 replies; 9+ messages in thread
From: Andrew A. @ 2004-09-26 21:33 UTC (permalink / raw)
  To: Jan Kara, linux-kernel

Jan,

Thanks for responding.  When I got no responses, I searched for ways to get more data out of the kernel--I must say that it has been
quite a journey to identify what is working, where to get it, and how to install it when it comes to kernel
debugging/crash-data-gathering tools.  LKCD for example, is not available at the location you'll eventually arrive at if you search
for it in google ... it's not obvious what it's state is (current/defunct/superceded), there's KDB, KGDB, netdump, netconsol,
netlog, diskdump (conusingly known as lkdump) etc. etc.  And then, even if you do figure out what tools are current, you then have
to match the tool to the particular kernel version you are running -- which can be a task and a half unto itself.

Is diskdump available for 2.4?  Can anyone comment on the choice of tools below?

Anyway, I have also done all of the following:

(1) Enabled netdump/netconsole on 2.6.8.1-521 Fedora Core kernel, after first fixing the startup scripts.  Fixes can be found at
www.memeplex.com/Linux.html  Note that after I also fixed crash.c to be a 2.6 compliant kernel module, and loading it to test
netdump, I always end up with a vmcore-incomplete image approx 45k in size, on the netdump-server.  Can anyone tell me if this is
absurdly small, and if so, what might be the solution?  The client box always reboots so I suspect too-small timeouts are the issue.

(2) Downloaded the latest 2.4 kernel, installed KDB patches and modified configs on the system to accept the 2.4 kernel --
specifically, /etc/modules.conf and xorg.conf changes (added Mouse1/SendCoreEvents on /dev/psaux).  I don't think I found any
netdump patches for the 2.4 line of kernels.  Can someone point me in the right direction?

(3) Enabled sysrq on both kernels, including echo "1" > /proc/sys/kernel/sysrq

I'll wait for the next hang now, trying it on both kernels.  By the way, the system is hung VERY badly--doesn't respond to anything,
no switching consoles, no keyboard events, no disk activity.  Dunno about network, since I haven't put a sniffer on it yet.

A.

-----Original Message-----
From: Jan Kara [mailto:jack@suse.cz]
Sent: Sunday, September 26, 2004 1:42 PM
To: Andrew A.
Cc: linux-kernel@vger.kernel.org
Subject: Re: Consistent kernel hang during heavy TCP connection handling
load

>
> I would not normally quote an an entire message, but it contains data
> relevant to this problem.
>
> The hang below occurs even outside of GDB, and also occurs after
> upgrading the kernel:
>
> Linux bbox.memeplex.com 2.6.8-1.521 #1 Mon Aug 16 09:01:18 EDT 2004
> i686 i686 i386 GNU/Linux
>
>
>
> Can anyone please give me a clue/pointer to tools/techniques that
> might help identify where in the kernel the hang occurs?  The system
> is so completely unresponsive when this occurs that I cannot provide
> any forensic data.
  How unresponsive exactly it is? Can you switch consoles and write? I
suppose ps(1) hangs... Is the disk working?

You can compile kernel with the magic Sysrq key (it is the option in the
kernel debugging section), run it and then press alt-sysrq-t and the
state of all processes will be printed. That might help...

> Does anyone's experience show that these types of hangs might occur
> purely as the result of use (or mis-use) of the pthreads library?  I'm
> looking for hints about what parts of my code to review.
>
> There could easily be erroneous calls to pthread_detach(),
> pthread_join(), close(), and other system calls involved.
>
> Thanks,
> Andrew Athan

								Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Consistent kernel hang during heavy TCP connection handling load
  2004-09-26 21:33     ` Andrew A.
@ 2004-09-27  8:24       ` Jan Kara
  2004-09-27 13:15         ` Andrew A.
  2004-09-30 16:34       ` Bill Davidsen
  1 sibling, 1 reply; 9+ messages in thread
From: Jan Kara @ 2004-09-27  8:24 UTC (permalink / raw)
  To: Andrew A.; +Cc: linux-kernel

  Hello,

> Thanks for responding.  When I got no responses, I searched for ways
> to get more data out of the kernel--I must say that it has been quite
> a journey to identify what is working, where to get it, and how to
> install it when it comes to kernel debugging/crash-data-gathering
> tools.  LKCD for example, is not available at the location you'll
> eventually arrive at if you search for it in google ... it's not
> obvious what it's state is (current/defunct/superceded), there's KDB,
> KGDB, netdump, netconsol, netlog, diskdump (conusingly known as
> lkdump) etc. etc.  And then, even if you do figure out what tools are
> current, you then have to match the tool to the particular kernel
> version you are running -- which can be a task and a half unto itself.
> 
> Is diskdump available for 2.4?  Can anyone comment on the choice of
> tools below?
> 
> Anyway, I have also done all of the following:
> 
> (1) Enabled netdump/netconsole on 2.6.8.1-521 Fedora Core kernel,
> after first fixing the startup scripts.  Fixes can be found at
> www.memeplex.com/Linux.html  Note that after I also fixed crash.c to
> be a 2.6 compliant kernel module, and loading it to test netdump, I
> always end up with a vmcore-incomplete image approx 45k in size, on
> the netdump-server.  Can anyone tell me if this is absurdly small, and
> if so, what might be the solution?  The client box always reboots so I
> suspect too-small timeouts are the issue.
> 
> (2) Downloaded the latest 2.4 kernel, installed KDB patches and
> modified configs on the system to accept the 2.4 kernel --
> specifically, /etc/modules.conf and xorg.conf changes (added
> Mouse1/SendCoreEvents on /dev/psaux).  I don't think I found any
> netdump patches for the 2.4 line of kernels.  Can someone point me in
> the right direction?
  I don't have personaly much experience with debugging by above tools
so I won't be of much help. As you describe the problem below I
personaly think that you won't get much from them if the system is as
unresponsive as you write.

> (3) Enabled sysrq on both kernels, including echo "1" > /proc/sys/kernel/sysrq
> 
> I'll wait for the next hang now, trying it on both kernels.  By the
> way, the system is hung VERY badly--doesn't respond to anything, no
> switching consoles, no keyboard events, no disk activity.  Dunno about
> network, since I haven't put a sniffer on it yet.
  Hmm.. that looks bad. Do you debug things under console and not
in X? If that is the case either there is some hardware problem (you
likely generate quite high load on the machine) or some driver is stuck
with interrupts disabled. In case debugging tools don't help you can try
to compile kernel with minimal config (just disable everything not
needed to run the test). Also reproducing on a different machine would
be useful to rule out hardware...

								Honza

> 
> -----Original Message-----
> From: Jan Kara [mailto:jack@suse.cz]
> Sent: Sunday, September 26, 2004 1:42 PM
> To: Andrew A.
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: Consistent kernel hang during heavy TCP connection handling
> load
> 
> 
> >
> > I would not normally quote an an entire message, but it contains data
> > relevant to this problem.
> >
> > The hang below occurs even outside of GDB, and also occurs after
> > upgrading the kernel:
> >
> > Linux bbox.memeplex.com 2.6.8-1.521 #1 Mon Aug 16 09:01:18 EDT 2004
> > i686 i686 i386 GNU/Linux
> >
> >
> >
> > Can anyone please give me a clue/pointer to tools/techniques that
> > might help identify where in the kernel the hang occurs?  The system
> > is so completely unresponsive when this occurs that I cannot provide
> > any forensic data.
>   How unresponsive exactly it is? Can you switch consoles and write? I
> suppose ps(1) hangs... Is the disk working?
> 
> You can compile kernel with the magic Sysrq key (it is the option in the
> kernel debugging section), run it and then press alt-sysrq-t and the
> state of all processes will be printed. That might help...
> 
> > Does anyone's experience show that these types of hangs might occur
> > purely as the result of use (or mis-use) of the pthreads library?  I'm
> > looking for hints about what parts of my code to review.
> >
> > There could easily be erroneous calls to pthread_detach(),
> > pthread_join(), close(), and other system calls involved.
> >
> > Thanks,
> > Andrew Athan
> 
> 								Honza
> --
> Jan Kara <jack@suse.cz>
> SuSE CR Labs
> 
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Consistent kernel hang during heavy TCP connection handling load
  2004-09-27  8:24       ` Jan Kara
@ 2004-09-27 13:15         ` Andrew A.
  2004-09-27 13:33           ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew A. @ 2004-09-27 13:15 UTC (permalink / raw)
  To: Jan Kara, Andrew A.; +Cc: linux-kernel


Jan/all:

Yes, I have reproduced the problem on another machine running a similar kernel but with different network card, CPU, etc.

A.

-----Original Message-----
From: Jan Kara
Subject: Re: Consistent kernel hang during heavy TCP connection handling
load


  Hello,

> Thanks for responding.  When I got no responses, I searched for ways
  I don't have personaly much experience with debugging by above tools
so I won't be of much help. As you describe the problem below I
personaly think that you won't get much from them if the system is as
unresponsive as you write.

> (3) Enabled sysrq on both kernels, including echo "1" > /proc/sys/kernel/sysrq
> 
> I'll wait for the next hang now, trying it on both kernels.  By the
> way, the system is hung VERY badly--doesn't respond to anything, no
> switching consoles, no keyboard events, no disk activity.  Dunno about
> network, since I haven't put a sniffer on it yet.
  Hmm.. that looks bad. Do you debug things under console and not
in X? If that is the case either there is some hardware problem (you
likely generate quite high load on the machine) or some driver is stuck
with interrupts disabled. In case debugging tools don't help you can try
to compile kernel with minimal config (just disable everything not
needed to run the test). Also reproducing on a different machine would
be useful to rule out hardware...

								Honza





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Consistent kernel hang during heavy TCP connection handling load
  2004-09-27 13:15         ` Andrew A.
@ 2004-09-27 13:33           ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2004-09-27 13:33 UTC (permalink / raw)
  To: Andrew A.; +Cc: linux-kernel

  Hello,

> Yes, I have reproduced the problem on another machine running a
> similar kernel but with different network card, CPU, etc.
  OK, so it probably won't be hardware. Any debugging output? If I got
it right you are using RH kernel - can you try with the vanilla one from
ftp.kernel.org to rule out some RH specific patches? Can you send your
kernel configuration?

								Honza

> -----Original Message-----
> From: Jan Kara
> Subject: Re: Consistent kernel hang during heavy TCP connection handling
> load
> 
> 
>   Hello,
> 
> > Thanks for responding.  When I got no responses, I searched for ways
>   I don't have personaly much experience with debugging by above tools
> so I won't be of much help. As you describe the problem below I
> personaly think that you won't get much from them if the system is as
> unresponsive as you write.
> 
> > (3) Enabled sysrq on both kernels, including echo "1" > /proc/sys/kernel/sysrq
> > 
> > I'll wait for the next hang now, trying it on both kernels.  By the
> > way, the system is hung VERY badly--doesn't respond to anything, no
> > switching consoles, no keyboard events, no disk activity.  Dunno about
> > network, since I haven't put a sniffer on it yet.
>   Hmm.. that looks bad. Do you debug things under console and not
> in X? If that is the case either there is some hardware problem (you
> likely generate quite high load on the machine) or some driver is stuck
> with interrupts disabled. In case debugging tools don't help you can try
> to compile kernel with minimal config (just disable everything not
> needed to run the test). Also reproducing on a different machine would
> be useful to rule out hardware...
> 
> 								Honza
> 
> 
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Consistent kernel hang during heavy TCP connection handling load
  2004-09-26 21:33     ` Andrew A.
  2004-09-27  8:24       ` Jan Kara
@ 2004-09-30 16:34       ` Bill Davidsen
  2004-09-30 18:18         ` Daniel Stekloff
  1 sibling, 1 reply; 9+ messages in thread
From: Bill Davidsen @ 2004-09-30 16:34 UTC (permalink / raw)
  To: linux-kernel

Andrew A. wrote:
> Jan,
> 
> Thanks for responding.  When I got no responses, I searched for ways to get more data out of the kernel--I must say that it has been
> quite a journey to identify what is working, where to get it, and how to install it when it comes to kernel
> debugging/crash-data-gathering tools.  LKCD for example, is not available at the location you'll eventually arrive at if you search
> for it in google ... it's not obvious what it's state is (current/defunct/superceded), there's KDB, KGDB, netdump, netconsol,
> netlog, diskdump (conusingly known as lkdump) etc. etc.  And then, even if you do figure out what tools are current, you then have
> to match the tool to the particular kernel version you are running -- which can be a task and a half unto itself.
> 
> Is diskdump available for 2.4?  Can anyone comment on the choice of tools below?
> 
> Anyway, I have also done all of the following:
> 
> (1) Enabled netdump/netconsole on 2.6.8.1-521 Fedora Core kernel, after first fixing the startup scripts.  Fixes can be found at
> www.memeplex.com/Linux.html  Note that after I also fixed crash.c to be a 2.6 compliant kernel module, and loading it to test
> netdump, I always end up with a vmcore-incomplete image approx 45k in size, on the netdump-server.  Can anyone tell me if this is
> absurdly small, and if so, what might be the solution?  The client box always reboots so I suspect too-small timeouts are the issue.

My experience is 100% with RH kernels, but the dump should be about 
memory size, in my case 2.5G or 4G and it is. But I did see hangs which 
resulted in the size you mention, a few k and hang.

There was a patch floating around to write a core image to a disk 
partition like Solaris, AIX, and other commercial systems, but Linus was 
opposed for some reason I remember as "I don't need this and it culd be 
dangerous" or similar. If that can be retrofitted to a current kernel it 
would be more useful than netdump, I suspect.

In any case, the short answer is that what you see is way too short, it 
sounds like the header info on config, registers, or somesuch that 
netdump sends first before the core.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Consistent kernel hang during heavy TCP connection handling load
  2004-09-30 16:34       ` Bill Davidsen
@ 2004-09-30 18:18         ` Daniel Stekloff
  0 siblings, 0 replies; 9+ messages in thread
From: Daniel Stekloff @ 2004-09-30 18:18 UTC (permalink / raw)
  To: Bill Davidsen, aathan-linux-kernel-1542; +Cc: linux-kernel

On Thu, 2004-09-30 at 09:34, Bill Davidsen wrote:
> Andrew A. wrote:
> > Jan,
> > 
> > Thanks for responding.  When I got no responses, I searched for ways to get more data out of the kernel--I must say that it has been
> > quite a journey to identify what is working, where to get it, and how to install it when it comes to kernel
> > debugging/crash-data-gathering tools.  LKCD for example, is not available at the location you'll eventually arrive at if you search
> > for it in google ... it's not obvious what it's state is (current/defunct/superceded), there's KDB, KGDB, netdump, netconsol,
> > netlog, diskdump (conusingly known as lkdump) etc. etc.  And then, even if you do figure out what tools are current, you then have
> > to match the tool to the particular kernel version you are running -- which can be a task and a half unto itself.
> > 
> > Is diskdump available for 2.4?  Can anyone comment on the choice of tools below?
> > 
> > Anyway, I have also done all of the following:
> > 
> > (1) Enabled netdump/netconsole on 2.6.8.1-521 Fedora Core kernel, after first fixing the startup scripts.  Fixes can be found at
> > www.memeplex.com/Linux.html  Note that after I also fixed crash.c to be a 2.6 compliant kernel module, and loading it to test
> > netdump, I always end up with a vmcore-incomplete image approx 45k in size, on the netdump-server.  Can anyone tell me if this is
> > absurdly small, and if so, what might be the solution?  The client box always reboots so I suspect too-small timeouts are the issue.
> 
> My experience is 100% with RH kernels, but the dump should be about 
> memory size, in my case 2.5G or 4G and it is. But I did see hangs which 
> resulted in the size you mention, a few k and hang.


Yep, the vmcore should be around the size of the memory on the dumping
system. The size is too small and vmcore-incomplete wasn't made into
vmcore, so it's incomplete.

Do you have access to making a new kernel? Here's a fix that I think
will help. This will speed things up and help with the timeout issues.

drivers/net/netdump.c :

1) Initialize jiffy_cyles to 1000 * (1000000/HZ) -

-static unsigned long long t0, jiffy_cycles;
+static unsigned long long t0, jiffy_cycles = 1000 * (1000000/HZ);

2) Change prev_jiffies from an "int" to an "unsigned long" in
print_status() function -

-       static int prev_jiffies = 0;
+       static unsigned long prev_jiffies = 0;


Let me know if this helps,

Thanks,

Dan


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2004-09-30 18:18 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-22 19:17 Consisten kernel hang during heavy TCP connection handling load Andrew A.
2004-09-24  0:05 ` Consistent " Andrew A.
2004-09-26 17:42   ` Jan Kara
2004-09-26 21:33     ` Andrew A.
2004-09-27  8:24       ` Jan Kara
2004-09-27 13:15         ` Andrew A.
2004-09-27 13:33           ` Jan Kara
2004-09-30 16:34       ` Bill Davidsen
2004-09-30 18:18         ` Daniel Stekloff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox