Temporary random kernel hang

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Temporary random kernel hang
@ 2006-12-08 10:38 seven
  2006-12-10  7:07 ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: seven @ 2006-12-08 10:38 UTC (permalink / raw)
  To: linux-kernel


Hello,

I have some trouble with a multithreaded java network server running on
SLES10. At random times I see the kernel take 80% of the CPU leaving iddle
to 0% for 30 seconds. After this period the system returns to normal
operation state.

Below is a vmstat -a 3 recording that shows the problem:

 1  0      0 773068 529184 693048    0    0     0     0  272  201  0  0 100 
0  0
 0  0      0 773068 529184 693064    0    0     0    25  317  334  1  0 99 
1  0
 0  0      0 772944 529216 693248    0    0     0    24  477 1017  3  0 96 
0  0
 0  0      0 772820 529256 693316    0    0     0     0  525 1376  4  1 95 
0  0
 0  0      0 772448 529344 693636    0    0     0   107 1098 3306 11  2 86 
0  0
 0  0      0 772324 529404 693456    0    0     0     0  723 2247  7  2 91 
0  0
 0  0      0 772076 529496 693656    0    0     0   132  770 2488  7  2 91 
1  0
 0  0      0 772200 529528 693608    0    0     0    91  528 1168  4  1 94 
1  0
 0  0      0 772200 529532 693728    0    0     0     0  334  387  1  0 99 
0  0
 0  0      0 772076 529568 693680    0    0     0    24  564 1250  4  1 95 
0  0
 0  0      0 771828 529636 693784    0    0     0     0  787 2144  7  2 91 
0  0
 0  0      0 771580 529744 694232    0    0     0   111  995 3081 11  2 86 
1  0
107  0      0 771316 529792 694904    0    0     0   153  829 1650 12 37 51 
0  0
113  0      0 771316 529792 694912    0    0     0     0  323  169 15 85  0 
0  0
116  0      0 771216 529792 694728    0    0     0    25  292  190 14 86  0 
0  0
122  0      0 771340 529792 694728    0    0     0    21  311  191 15 85  0 
0  0
138  0      0 771464 529792 694728    0    0     0     0  365  196 14 86  0 
0  0
146  0      0 771464 529792 694728    0    0     0     0  331  189 16 84  0 
0  0
150  0      0 771472 529792 694728    0    0     0     0  336  183 15 85  0 
0  0
146  0      0 771472 529792 694728    0    0     0     4  310  201 14 86  0 
0  0
145  0      0 771472 529792 694728    0    0     0     0  285  163 15 85  0 
0  0
procs -----------memory---------- ---swap-- -----io---- -system--
-----cpu------
 r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id
wa st
146  0      0 771472 529792 694728    0    0     0     0  277  159 14 86  0 
0  0
145  0      0 771472 529792 694728    0    0     0    32  275  133 15 85  0 
0  0
 0  0      0 771208 529892 694176    0    0     0     0 1012 3408 12  4 84 
0  0
 0  0      0 770712 529972 694488    0    0     0   149  774 2869  8  2 90 
0  0
 0  0      0 770712 529972 694488    0    0     0     0  271  195  0  0 100 
0  0
 0  0      0 770728 529972 694488    0    0     0    35  269  167  0  0 100 
1  0
 0  0      0 770728 529972 694488    0    0     0     7  269  189  0  0 100 
0  0

The application is memory stable ( no leaks ) and a deadlock is out of the
question since in a deadlock case the system would freeze forever and not
temporarily. There are around 200 - 250 tcp/ip clients connected to the
application and 550 threads ( streaming blocking sockets are used so every
client is managed by one reading thread and one writing thread)

The same application works fine on SLES9.3

Hanging Evironment:
-----------------------------------------------------------------------------
mustang:~ # uname -a
Linux mustang 2.6.16.21-0.25-smp #1 SMP Tue Sep 19 07:26:15 UTC 2006 x86_64
x86_64 x86_64 GNU/Linux
mustang:~ # java -version
java version "1.6.0-rc"
Java(TM) SE Runtime Environment (build 1.6.0-rc-b104)
Java HotSpot(TM) Server VM (build 1.6.0-rc-b104, mixed mode)
mustang:~ # cat /etc/SuSE-release
SUSE Linux Enterprise Server 10 (x86_64)
VERSION = 10
-----------------------------------------------------------------------------

Working environment:
-----------------------------------------------------------------------------
apollo:~ # uname -a
Linux apollo 2.6.5-7.252-smp #1 SMP Tue Feb 14 11:11:04 UTC 2006 x86_64
x86_64 x86_64 GNU/Linux
apollo:~ # java -version
java version "1.6.0-rc"
Java(TM) SE Runtime Environment (build 1.6.0-rc-b95)
Java HotSpot(TM) 64-Bit Server VM (build 1.6.0-rc-b95, mixed mode)
apollo:~ # cat /etc/SuSE-release
SUSE LINUX Enterprise Server 9 (x86_64)
VERSION = 9
PATCHLEVEL = 3
-----------------------------------------------------------------------------

Can you give me some pointers about where to start debugging this issue?

Regards,
Horia
-- 
View this message in context: http://www.nabble.com/Temporary-random-kernel-hang-tf2779860.html#a7755634
Sent from the linux-kernel mailing list archive at Nabble.com.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Temporary random kernel hang
  2006-12-08 10:38 Temporary random kernel hang seven
@ 2006-12-10  7:07 ` Andrew Morton
  2006-12-11 18:46   ` seven
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2006-12-10  7:07 UTC (permalink / raw)
  To: seven; +Cc: linux-kernel

> On Fri, 8 Dec 2006 02:38:45 -0800 (PST) seven <horia.muntean@gmail.com> wrote:
> 
> Hello,
> 
> I have some trouble with a multithreaded java network server running on
> SLES10. At random times I see the kernel take 80% of the CPU leaving iddle
> to 0% for 30 seconds. After this period the system returns to normal
> operation state.
> 
> Below is a vmstat -a 3 recording that shows the problem:
> 
>  1  0      0 773068 529184 693048    0    0     0     0  272  201  0  0 100 
> 0  0
>  0  0      0 773068 529184 693064    0    0     0    25  317  334  1  0 99 
> 1  0
>  0  0      0 772944 529216 693248    0    0     0    24  477 1017  3  0 96 
> 0  0
>  0  0      0 772820 529256 693316    0    0     0     0  525 1376  4  1 95 
> 0  0
>  0  0      0 772448 529344 693636    0    0     0   107 1098 3306 11  2 86 
> 0  0
>  0  0      0 772324 529404 693456    0    0     0     0  723 2247  7  2 91 
> 0  0
>  0  0      0 772076 529496 693656    0    0     0   132  770 2488  7  2 91 
> 1  0
>  0  0      0 772200 529528 693608    0    0     0    91  528 1168  4  1 94 
> 1  0
>  0  0      0 772200 529532 693728    0    0     0     0  334  387  1  0 99 
> 0  0
>  0  0      0 772076 529568 693680    0    0     0    24  564 1250  4  1 95 
> 0  0
>  0  0      0 771828 529636 693784    0    0     0     0  787 2144  7  2 91 
> 0  0
>  0  0      0 771580 529744 694232    0    0     0   111  995 3081 11  2 86 
> 1  0
> 107  0      0 771316 529792 694904    0    0     0   153  829 1650 12 37 51 
> 0  0
> 113  0      0 771316 529792 694912    0    0     0     0  323  169 15 85  0 
> 0  0
> 116  0      0 771216 529792 694728    0    0     0    25  292  190 14 86  0 
> 0  0
> 122  0      0 771340 529792 694728    0    0     0    21  311  191 15 85  0 
> 0  0
> 138  0      0 771464 529792 694728    0    0     0     0  365  196 14 86  0 
> 0  0
> 146  0      0 771464 529792 694728    0    0     0     0  331  189 16 84  0 
> 0  0
> 150  0      0 771472 529792 694728    0    0     0     0  336  183 15 85  0 
> 0  0
> 146  0      0 771472 529792 694728    0    0     0     4  310  201 14 86  0 
> 0  0
> 145  0      0 771472 529792 694728    0    0     0     0  285  163 15 85  0 
> 0  0
> procs -----------memory---------- ---swap-- -----io---- -system--
> -----cpu------
>  r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id
> wa st
> 146  0      0 771472 529792 694728    0    0     0     0  277  159 14 86  0 
> 0  0
> 145  0      0 771472 529792 694728    0    0     0    32  275  133 15 85  0 
> 0  0
>  0  0      0 771208 529892 694176    0    0     0     0 1012 3408 12  4 84 
> 0  0
>  0  0      0 770712 529972 694488    0    0     0   149  774 2869  8  2 90 
> 0  0
>  0  0      0 770712 529972 694488    0    0     0     0  271  195  0  0 100 
> 0  0
>  0  0      0 770728 529972 694488    0    0     0    35  269  167  0  0 100 
> 1  0
>  0  0      0 770728 529972 694488    0    0     0     7  269  189  0  0 100 
> 0  0
> 
> The application is memory stable ( no leaks ) and a deadlock is out of the
> question since in a deadlock case the system would freeze forever and not
> temporarily. There are around 200 - 250 tcp/ip clients connected to the
> application and 550 threads ( streaming blocking sockets are used so every
> client is managed by one reading thread and one writing thread)
> 
> The same application works fine on SLES9.3
> 
> Hanging Evironment:
> -----------------------------------------------------------------------------
> mustang:~ # uname -a
> Linux mustang 2.6.16.21-0.25-smp #1 SMP Tue Sep 19 07:26:15 UTC 2006 x86_64
> x86_64 x86_64 GNU/Linux
> mustang:~ # java -version
> java version "1.6.0-rc"
> Java(TM) SE Runtime Environment (build 1.6.0-rc-b104)
> Java HotSpot(TM) Server VM (build 1.6.0-rc-b104, mixed mode)
> mustang:~ # cat /etc/SuSE-release
> SUSE Linux Enterprise Server 10 (x86_64)
> VERSION = 10
> -----------------------------------------------------------------------------
> 
> Working environment:
> -----------------------------------------------------------------------------
> apollo:~ # uname -a
> Linux apollo 2.6.5-7.252-smp #1 SMP Tue Feb 14 11:11:04 UTC 2006 x86_64
> x86_64 x86_64 GNU/Linux
> apollo:~ # java -version
> java version "1.6.0-rc"
> Java(TM) SE Runtime Environment (build 1.6.0-rc-b95)
> Java HotSpot(TM) 64-Bit Server VM (build 1.6.0-rc-b95, mixed mode)
> apollo:~ # cat /etc/SuSE-release
> SUSE LINUX Enterprise Server 9 (x86_64)
> VERSION = 9
> PATCHLEVEL = 3
> -----------------------------------------------------------------------------
> 
> Can you give me some pointers about where to start debugging this issue?
> 

A kernel profile will tell us were the kernel is burning CPU.  Something
like this (run as root):

#!/bin/sh
while true
do
	opcontrol --stop
	opcontrol --shutdown
	rm -rf /var/lib/oprofile
	opcontrol --vmlinux=/boot/vmlinux-$(uname -r)
	opcontrol --start-daemon
	opcontrol --start
	date
	sleep 5
	opcontrol --stop
	opcontrol --shutdown
	opreport -l /boot/vmlinux-$(uname -r) | head -50
done | tee /tmp/oprofile-output

Then locate the output record which correlates with one of these episodes.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Temporary random kernel hang
  2006-12-10  7:07 ` Andrew Morton
@ 2006-12-11 18:46   ` seven
  2006-12-14 22:10     ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: seven @ 2006-12-11 18:46 UTC (permalink / raw)
  To: linux-kernel




> A kernel profile will tell us were the kernel is burning CPU.  Something
> like this (run as root):
> 
> #!/bin/sh
> while true
> do
> 	opcontrol --stop
> 	opcontrol --shutdown
> 	rm -rf /var/lib/oprofile
> 	opcontrol --vmlinux=/boot/vmlinux-$(uname -r)
> 	opcontrol --start-daemon
> 	opcontrol --start
> 	date
> 	sleep 5
> 	opcontrol --stop
> 	opcontrol --shutdown
> 	opreport -l /boot/vmlinux-$(uname -r) | head -50
> done | tee /tmp/oprofile-output
> 
> Then locate the output record which correlates with one of these episodes.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 


Hello Andrew,

First, thank you for the pointers. I did what you suggested ( installed
oprofile and run the script ). In the same time, I added a timestamp column
to vmstat recording.

I can see dramatic changes in the output of opreport during the freeze
period, but I do not know what it means. 

Here are my recordings:

vmstat -a 3 | awk {print $0"-"system("date")} > vmstat.txt &

 0  0      0 211888 980736 755460    0    0     0   600 1599 2679 30  8 62 
0  0 Mon Dec 11 18:30:07 GMT 2006
 4  0      0 209356 982216 756600    0    0     0   604 2201 6401 44  8 48 
0  0 Mon Dec 11 18:30:10 GMT 2006
88  0      0 208616 982648 756912    0    0     0     0 2340 3690 45 10 44 
0  0 Mon Dec 11 18:30:13 GMT 2006
 5  0      0 206892 984024 757228    0    0     0   752 2780 5246 53  9 38 
0  0 Mon Dec 11 18:31:00 GMT 2006
179  0      0 206428 984244 758152    0    0     0   287  834  787 17 81  2 
0  0 Mon Dec 11 18:31:00 GMT 2006
245  0      0 206344 984244 758256    0    0     0     0  527  244 14 86  0 
0  0 Mon Dec 11 18:31:00 GMT 2006
297  0      0 206476 984248 758048    0    0     0    32  535  238 15 85  0 
0  0 Mon Dec 11 18:31:00 GMT 2006
procs -----------memory---------- ---swap-- -----io---- -system--
-----cpu------ Mon Dec 11 18:31:00 GMT 2006
 r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id
wa st Mon Dec 11 18:31:00 GMT 2006
303  0      0 206476 984256 758060    0    0     0   161  375  210 15 85  0 
0  0 Mon Dec 11 18:31:00 GMT 2006
309  0      0 206468 984260 758064    0    0     0     0  352  209 15 85  0 
0  0 Mon Dec 11 18:31:00 GMT 2006
316  0      0 206468 984264 758068    0    0     0    21  345  220 14 86  0 
0  0 Mon Dec 11 18:31:00 GMT 2006
315  0      0 206592 984268 758072    0    0     0    15  315  229 15 85  0 
0  0 Mon Dec 11 18:31:00 GMT 2006
323  0      0 206592 984272 758072    0    0     0    21  330  230 14 86  0 
0  0 Mon Dec 11 18:31:00 GMT 2006
332  0      0 206608 984272 758072    0    0     0     0  439  235 13 87  0 
0  0 Mon Dec 11 18:31:00 GMT 2006
333  0      0 206608 984272 758072    0    0     0    12  374  199 14 86  0 
0  0 Mon Dec 11 18:31:00 GMT 2006
337  0      0 206608 984292 758096    0    0     0    28  333  233 15 85  0 
0  0 Mon Dec 11 18:31:00 GMT 2006
341  0      0 206484 984292 758096    0    0     0     0  328  201 14 86  0 
0  0 Mon Dec 11 18:31:00 GMT 2006
 6  0      0 205344 984368 758164    0    0     0    21  597  887 21 78  1 
0  0 Mon Dec 11 18:31:03 GMT 2006
 0  0      0 204420 984768 757704    0    0     0     0 3911 4646 76 14 10 
0  0 Mon Dec 11 18:31:06 GMT 2006
 5  0      0 293960 917620 736532    0    0     0   707 3191 5204 62  9 28 
0  0 Mon Dec 11 18:31:09 GMT 2006
28  0      0 294556 917764 735684    0    0     0   831 3128 5132 62 14 23 
0  0 Mon Dec 11 18:31:12 GMT 2006
 9  0      0 291836 919304 736960    0    0     0     0 2935 5228 57  9 34 
0  0 Mon Dec 11 18:31:15 GMT 2006

-------------------------------------------------------------------------------------
The script output corresponding to the freeze period is below. Notice that
the third sample has completely different look compared to sample 1, 2, 4, 5
witch correspond to normal operation situation.


Mon Dec 11 18:29:56 GMT 2006
Stopping profiling.
Stopping profiling.
Killing daemon.
CPU: P4 / Xeon, speed 3000.18 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        symbol name
700       3.9093  ia32_sysenter_target
608       3.3955  find_vma
603       3.3676  copy_user_generic
570       3.1833  sysenter_do_call
486       2.7142  schedule
485       2.7086  tcp_sendmsg
445       2.4852  __find_get_block
429       2.3958  do_futex
365       2.0384  page_fault
355       1.9826  main_timer_handler
293       1.6363  do_gettimeofday
284       1.5861  try_to_wake_up
277       1.5470  __switch_to
258       1.4409  fget
255       1.4241  clear_page
220       1.2286  ia32_syscall
213       1.1895  ll_rw_block
197       1.1002  find_lock_page
194       1.0834  gs_change
187       1.0443  __up_read
177       0.9885  lock_sock
177       0.9885  tcp_ack
171       0.9550  __down_read
164       0.9159  IRQ0xa9_interrupt
156       0.8712  fget_light
152       0.8489  fput
146       0.8154  kfree
144       0.8042  math_state_restore
144       0.8042  tcp_v4_rcv
128       0.7148  __brelse
128       0.7148  recalc_task_prio
123       0.6869  __handle_mm_fault
123       0.6869  __kmalloc
122       0.6813  copy_page
122       0.6813  kmem_cache_alloc
120       0.6702  __tcp_push_pending_frames
117       0.6534  unlock_buffer
116       0.6478  rw_verify_area
115       0.6422  thread_return
114       0.6367  unmap_vmas
112       0.6255  kmem_cache_free
112       0.6255  radix_tree_lookup
110       0.6143  device_not_available
105       0.5864  ip_output
104       0.5808  pfn_to_page
104       0.5808  release_sock
90        0.5026  release_pages
Using default event: GLOBAL_POWER_EVENTS:100000:1:1:1
Daemon started.
Profiler running.
Mon Dec 11 18:30:02 GMT 2006
Stopping profiling.
Stopping profiling.
Killing daemon.
CPU: P4 / Xeon, speed 3000.18 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        symbol name
741       4.3771  ia32_sysenter_target
682       4.0286  sysenter_do_call
622       3.6742  find_vma
523       3.0894  copy_user_generic
515       3.0421  schedule
495       2.9240  do_futex
419       2.4750  __find_get_block
373       2.2033  do_gettimeofday
372       2.1974  tcp_sendmsg
325       1.9198  main_timer_handler
312       1.8430  try_to_wake_up
298       1.7603  __switch_to
260       1.5358  fget
239       1.4118  gs_change
227       1.3409  ia32_syscall
222       1.3114  __down_read
219       1.2936  page_fault
193       1.1401  find_lock_page
189       1.1164  __up_read
180       1.0633  fget_light
178       1.0515  clear_page
178       1.0515  tcp_ack
175       1.0337  ll_rw_block
167       0.9865  math_state_restore
156       0.9215  thread_return
154       0.9097  lock_sock
151       0.8920  fput
138       0.8152  recalc_task_prio
135       0.7974  tcp_v4_rcv
129       0.7620  kfree
128       0.7561  IRQ0xa9_interrupt
124       0.7325  release_sock
123       0.7266  device_not_available
120       0.7088  __tcp_push_pending_frames
114       0.6734  __kmalloc
111       0.6557  unlock_buffer
110       0.6498  inet_sendmsg
108       0.6380  __brelse
105       0.6202  do_gettimeoffset_tsc
100       0.5907  rw_verify_area
95        0.5612  tcp_init_tso_segs
94        0.5553  kmem_cache_free
94        0.5553  radix_tree_lookup
92        0.5434  ip_output
89        0.5257  file_update_time
86        0.5080  dnotify_parent
86        0.5080  kmem_cache_alloc
Using default event: GLOBAL_POWER_EVENTS:100000:1:1:1
Daemon started.
Profiler running.
Mon Dec 11 18:30:09 GMT 2006
Stopping profiling.
Stopping profiling.
Killing daemon.
CPU: P4 / Xeon, speed 3000.18 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        symbol name
196048    9.3307  page_fault
184945    8.8023  force_sig_info
153475    7.3045  do_signal
120071    5.7146  get_signal_to_deliver
116196    5.5302  ia32_setup_sigcontext
89368     4.2534  ia32_syscall
82987     3.9497  __down_read_trylock
82396     3.9215  save_i387_ia32
79436     3.7807  copy_user_generic
77587     3.6927  __sigqueue_alloc
74439     3.5428  sys32_rt_sigreturn
69484     3.3070  recalc_sigpending_tsk
65733     3.1285  do_page_fault
49694     2.3651  retint_restore_args
45756     2.1777  ia32_setup_rt_frame
45717     2.1759  try_to_wake_up
45658     2.1730  restore_i387_ia32
31626     1.5052  __up_read
30286     1.4414  __dequeue_signal
29849     1.4206  memcpy
27139     1.2917  sys32_sigaltstack
25838     1.2297  ia32_restore_sigcontext
25126     1.1958  kmem_cache_free
24757     1.1783  do_sigaltstack
22884     1.0891  __sigqueue_free
22818     1.0860  error_sti
22007     1.0474  signal_wake_up
19816     0.9431  kmem_cache_alloc
17810     0.8476  retint_signal
17208     0.8190  is_prefetch
16315     0.7765  _atomic_dec_and_lock
14320     0.6815  specific_send_sig_info
13881     0.6607  send_signal
13704     0.6522  find_vma
12392     0.5898  free_uid
9283      0.4418  get_sigframe
9184      0.4371  copy_siginfo_to_user32
8525      0.4057  pfn_to_page
7980      0.3798  unhandled_signal
7465      0.3553  error_entry
7309      0.3479  convert_rip_to_linear
6926      0.3296  notifier_call_chain
6367      0.3030  kprobe_exceptions_notify
5948      0.2831  copy_to_user
5408      0.2574  error_exit
4407      0.2097  int_ret_from_sys_call
4128      0.1965  dequeue_signal
Using default event: GLOBAL_POWER_EVENTS:100000:1:1:1
Daemon started.
Profiler running.
Mon Dec 11 18:31:02 GMT 2006
Stopping profiling.
Stopping profiling.
Killing daemon.
CPU: P4 / Xeon, speed 3000.18 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        symbol name
1113      4.4269  ia32_sysenter_target
911       3.6234  find_vma
887       3.5280  copy_user_generic
850       3.3808  sysenter_do_call
715       2.8438  do_futex
712       2.8319  tcp_sendmsg
679       2.7007  __find_get_block
657       2.6132  schedule
508       2.0205  do_gettimeofday
399       1.5870  try_to_wake_up
375       1.4915  fget
366       1.4557  main_timer_handler
341       1.3563  __switch_to
314       1.2489  IRQ0xa9_interrupt
307       1.2211  ia32_syscall
297       1.1813  clear_page
292       1.1614  ll_rw_block
285       1.1336  unlock_buffer
278       1.1057  gs_change
277       1.1017  fget_light
276       1.0978  find_lock_page
274       1.0898  page_fault
268       1.0659  tcp_ack
247       0.9824  lock_sock
242       0.9625  __down_read
233       0.9267  kfree
223       0.8870  __tcp_push_pending_frames
213       0.8472  tcp_v4_rcv
210       0.8353  math_state_restore
203       0.8074  __up_read
202       0.8034  fput
196       0.7796  recalc_task_prio
194       0.7716  release_sock
187       0.7438  kmem_cache_free
182       0.7239  __brelse
180       0.7159  thread_return
173       0.6881  test_clear_page_dirty
170       0.6762  inet_sendmsg
161       0.6404  ip_output
159       0.6324  device_not_available
156       0.6205  pfn_to_page
151       0.6006  __kmalloc
149       0.5926  tcp_transmit_skb
145       0.5767  file_update_time
143       0.5688  dnotify_parent
142       0.5648  kmem_cache_alloc
141       0.5608  do_gettimeoffset_tsc
Using default event: GLOBAL_POWER_EVENTS:100000:1:1:1
Daemon started.
Profiler running.
Mon Dec 11 18:31:08 GMT 2006
Stopping profiling.
Stopping profiling.
Killing daemon.
CPU: P4 / Xeon, speed 3000.18 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        symbol name
1054      4.3503  ia32_sysenter_target
880       3.6322  copy_user_generic
859       3.5455  sysenter_do_call
842       3.4753  find_vma
694       2.8645  tcp_sendmsg
675       2.7860  __find_get_block
656       2.7076  do_futex
645       2.6622  schedule
487       2.0101  do_gettimeofday
409       1.6881  try_to_wake_up
384       1.5849  fget
382       1.5767  main_timer_handler
360       1.4859  __switch_to
323       1.3332  tcp_ack
304       1.2547  clear_page
304       1.2547  ll_rw_block
302       1.2465  IRQ0xa9_interrupt
293       1.2093  fget_light
289       1.1928  page_fault
279       1.1516  find_lock_page
273       1.1268  ia32_syscall
271       1.1185  gs_change
256       1.0566  lock_sock
233       0.9617  kfree
229       0.9452  __up_read
227       0.9369  __down_read
223       0.9204  tcp_v4_rcv
207       0.8544  fput
194       0.8007  unlock_buffer
190       0.7842  thread_return
189       0.7801  math_state_restore
186       0.7677  release_sock
178       0.7347  __tcp_push_pending_frames
175       0.7223  __kmalloc
168       0.6934  inet_sendmsg
164       0.6769  dnotify_parent
164       0.6769  recalc_task_prio
158       0.6521  kmem_cache_free
154       0.6356  tcp_transmit_skb
153       0.6315  device_not_available
144       0.5944  tcp_init_tso_segs
140       0.5778  __brelse
139       0.5737  pfn_to_page
132       0.5448  kmem_cache_alloc
131       0.5407  file_update_time
131       0.5407  ip_output
128       0.5283  tcp_current_mss

---------------------------------------------------------------------------------------------------

I did notice something that can be related to those freeze periods: I
started to monitor the temperatures of the CPUs and noticed that the freeze
(I could reproduce only one freeze today) happened in the same interval as a
temperature spike ( around 50 degrees C ). My system is a 1U, 2-way Intel
Xeon 3.0 GHz blade ( IBM x336 ) and I think it has some kind of temperature
control that kicks in when the CPU temp reaches certain levels. AFAIK the
temperature control system will change the CPU speed in order to cool it
down.

Anyway, I will try to heat up the other system tomorrow (the stable one) to
see if I can bring it to the same temp. level and see how it responds. 

In the same time I will try to heat up again the unstable system to see if
there is really relation between the CPU temp. and system freezes (only one
match could be just a coincidence).

Is is possible that this kernel behavior could be related to temp. control
events?

Regards and thank you again,
Horia

-- 
View this message in context: http://www.nabble.com/Temporary-random-kernel-hang-tf2779860.html#a7800583
Sent from the linux-kernel mailing list archive at Nabble.com.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Temporary random kernel hang
  2006-12-11 18:46   ` seven
@ 2006-12-14 22:10     ` Andrew Morton
  2006-12-14 23:07       ` seven
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2006-12-14 22:10 UTC (permalink / raw)
  To: seven; +Cc: linux-kernel

On Mon, 11 Dec 2006 10:46:25 -0800 (PST)
seven <horia.muntean@gmail.com> wrote:

> 
> 
> 
> > A kernel profile will tell us were the kernel is burning CPU.  Something
> > like this (run as root):
> > 
> > #!/bin/sh
> > while true
> > do
> > 	opcontrol --stop
> > 	opcontrol --shutdown
> > 	rm -rf /var/lib/oprofile
> > 	opcontrol --vmlinux=/boot/vmlinux-$(uname -r)
> > 	opcontrol --start-daemon
> > 	opcontrol --start
> > 	date
> > 	sleep 5
> > 	opcontrol --stop
> > 	opcontrol --shutdown
> > 	opreport -l /boot/vmlinux-$(uname -r) | head -50
> > done | tee /tmp/oprofile-output
> > 
> > Then locate the output record which correlates with one of these episodes.
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> > 
> 
> 
> Hello Andrew,

No, I wasn't ignoring you for four days.  Please, always do reply-to-all.

> First, thank you for the pointers. I did what you suggested ( installed
> oprofile and run the script ). In the same time, I added a timestamp column
> to vmstat recording.
> 
> I can see dramatic changes in the output of opreport during the freeze
> period, but I do not know what it means. 
> 
> ..
>
> CPU: P4 / Xeon, speed 3000.18 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not
> stopped) with a unit mask of 0x01 (mandatory) count 100000
> samples  %        symbol name
> 741       4.3771  ia32_sysenter_target
> 682       4.0286  sysenter_do_call
> 622       3.6742  find_vma
> 523       3.0894  copy_user_generic
> 515       3.0421  schedule
> 495       2.9240  do_futex
> 419       2.4750  __find_get_block
> 373       2.2033  do_gettimeofday
> 372       2.1974  tcp_sendmsg
> 325       1.9198  main_timer_handler
> 312       1.8430  try_to_wake_up
> 298       1.7603  __switch_to
> 260       1.5358  fget
> 239       1.4118  gs_change
> 227       1.3409  ia32_syscall
> 222       1.3114  __down_read
> 219       1.2936  page_fault
> 193       1.1401  find_lock_page
> 189       1.1164  __up_read
> 180       1.0633  fget_light
> 178       1.0515  clear_page
> 178       1.0515  tcp_ack
> 175       1.0337  ll_rw_block
> 167       0.9865  math_state_restore
> 156       0.9215  thread_return
> 154       0.9097  lock_sock
> 151       0.8920  fput
> 138       0.8152  recalc_task_prio
> 135       0.7974  tcp_v4_rcv
> 129       0.7620  kfree
> 128       0.7561  IRQ0xa9_interrupt
> 124       0.7325  release_sock
> 123       0.7266  device_not_available
> 120       0.7088  __tcp_push_pending_frames
> 114       0.6734  __kmalloc
> 111       0.6557  unlock_buffer
> 110       0.6498  inet_sendmsg
> 108       0.6380  __brelse
> 105       0.6202  do_gettimeoffset_tsc
> 100       0.5907  rw_verify_area
> 95        0.5612  tcp_init_tso_segs
> 94        0.5553  kmem_cache_free
> 94        0.5553  radix_tree_lookup
> 92        0.5434  ip_output
> 89        0.5257  file_update_time
> 86        0.5080  dnotify_parent
> 86        0.5080  kmem_cache_alloc

versus

> Using default event: GLOBAL_POWER_EVENTS:100000:1:1:1
> Daemon started.
> Profiler running.
> Mon Dec 11 18:30:09 GMT 2006
> Stopping profiling.
> Stopping profiling.
> Killing daemon.
> CPU: P4 / Xeon, speed 3000.18 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not
> stopped) with a unit mask of 0x01 (mandatory) count 100000
> samples  %        symbol name
> 196048    9.3307  page_fault
> 184945    8.8023  force_sig_info
> 153475    7.3045  do_signal
> 120071    5.7146  get_signal_to_deliver
> 116196    5.5302  ia32_setup_sigcontext
> 89368     4.2534  ia32_syscall
> 82987     3.9497  __down_read_trylock
> 82396     3.9215  save_i387_ia32
> 79436     3.7807  copy_user_generic
> 77587     3.6927  __sigqueue_alloc
> 74439     3.5428  sys32_rt_sigreturn
> 69484     3.3070  recalc_sigpending_tsk
> 65733     3.1285  do_page_fault
> 49694     2.3651  retint_restore_args
> 45756     2.1777  ia32_setup_rt_frame
> 45717     2.1759  try_to_wake_up
> 45658     2.1730  restore_i387_ia32
> 31626     1.5052  __up_read
> 30286     1.4414  __dequeue_signal
> 29849     1.4206  memcpy
> 27139     1.2917  sys32_sigaltstack
> 25838     1.2297  ia32_restore_sigcontext
> 25126     1.1958  kmem_cache_free
> 24757     1.1783  do_sigaltstack
> 22884     1.0891  __sigqueue_free
> 22818     1.0860  error_sti
> 22007     1.0474  signal_wake_up
> 19816     0.9431  kmem_cache_alloc
> 17810     0.8476  retint_signal
> 17208     0.8190  is_prefetch
> 16315     0.7765  _atomic_dec_and_lock

It's certainly different.  It is hard to work out what might be going
on in there, but there's no obvious sign of misbehaviour.  It could be that
the application has simply gone berzerk and is doing a larger number or
expensive system calls.

The next step would be to try to catch one of these eposides with strace.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Temporary random kernel hang
  2006-12-14 22:10     ` Andrew Morton
@ 2006-12-14 23:07       ` seven
  2006-12-15 10:54         ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: seven @ 2006-12-14 23:07 UTC (permalink / raw)
  To: linux-kernel



Andrew Morton wrote:
> 
> No, I wasn't ignoring you for four days.  Please, always do reply-to-all.
> 

I am using nabble's interface to post messages ( www.nabble.com ) and I am
not subscribed to the mailing list, so it's possible that this is the cause
of the direct mail.


Andrew Morton wrote:
> 
> It's certainly different.  It is hard to work out what might be going
> on in there, but there's no obvious sign of misbehaviour.  It could be
> that
> the application has simply gone berzerk and is doing a larger number or
> expensive system calls.
> 
> The next step would be to try to catch one of these eposides with strace.
> 

I will try. Thank you.

Regards,
Horia 
-- 
View this message in context: http://www.nabble.com/Temporary-random-kernel-hang-tf2779860.html#a7883193
Sent from the linux-kernel mailing list archive at Nabble.com.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Temporary random kernel hang
  2006-12-14 23:07       ` seven
@ 2006-12-15 10:54         ` Peter Zijlstra
  2006-12-15 12:06           ` seven
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2006-12-15 10:54 UTC (permalink / raw)
  To: seven; +Cc: linux-kernel, Andrew Morton

On Thu, 2006-12-14 at 15:07 -0800, seven wrote:
> 
> Andrew Morton wrote:
> > 
> > No, I wasn't ignoring you for four days.  Please, always do reply-to-all.
> > 
> 
> I am using nabble's interface to post messages ( www.nabble.com ) and I am
> not subscribed to the mailing list, so it's possible that this is the cause
> of the direct mail.

However, all these replies are send to your gmail account, so you
_could_ just 'reply to all' there, right?

Please do, finding lone responses on lkml is hard.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Temporary random kernel hang
  2006-12-15 10:54         ` Peter Zijlstra
@ 2006-12-15 12:06           ` seven
  0 siblings, 0 replies; 7+ messages in thread
From: seven @ 2006-12-15 12:06 UTC (permalink / raw)
  To: linux-kernel



Peter Zijlstra wrote:
> 
> On Thu, 2006-12-14 at 15:07 -0800, seven wrote:
>> 
>> Andrew Morton wrote:
>> > 
>> > No, I wasn't ignoring you for four days.  Please, always do
>> reply-to-all.
>> > 
>> 
>> I am using nabble's interface to post messages ( www.nabble.com ) and I
>> am
>> not subscribed to the mailing list, so it's possible that this is the
>> cause
>> of the direct mail.
> 
> However, all these replies are send to your gmail account, so you
> _could_ just 'reply to all' there, right?
> 
> Please do, finding lone responses on lkml is hard.
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 

Well, nabble sends an alert mail to my gmail account whenever there is a
reply to the thread I monitor accompanied by a second message that is
'coming' from the author of the reply and a CC to
linux-kernel@vger.kernel.org. So I use gmail only for reading the replays.

In the Reply form used by nabble one can find the following 


> This message will be sent to the linux-kernel@vger.kernel.org mailing
> list.
> 
so I assumed that nabble will do proper posting to the mailing list. It
appears that it does not.

I will use gmail with 'Reply to all' from now on in order to post messages. 
Sorry for any inconveniences.

Regards,
Horia


-- 
View this message in context: http://www.nabble.com/Temporary-random-kernel-hang-tf2779860.html#a7890237
Sent from the linux-kernel mailing list archive at Nabble.com.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-12-15 12:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-12-08 10:38 Temporary random kernel hang seven
2006-12-10  7:07 ` Andrew Morton
2006-12-11 18:46   ` seven
2006-12-14 22:10     ` Andrew Morton
2006-12-14 23:07       ` seven
2006-12-15 10:54         ` Peter Zijlstra
2006-12-15 12:06           ` seven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox