linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Strange CPU occupation...
@ 2007-10-31 12:38 BERTRAND Joël
  2007-11-01  9:25 ` Strange CPU occupation... and system hangs BERTRAND Joël
  2007-11-07  8:55 ` Strange CPU occupation Goswin von Brederlow
  0 siblings, 2 replies; 3+ messages in thread
From: BERTRAND Joël @ 2007-10-31 12:38 UTC (permalink / raw)
  To: sparclinux, linux-raid, iscsitarget-devel

	Hello,

	I'm looking for a bug in iSCSI target code, but I have found this 
morning a new bug that is certainly related to mine...

Please consider these raid volumes:
Root gershwin:[/etc] > cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md7 : active raid1 sdi1[2](F) md_d0p1[0]
       1464725632 blocks [2/1] [U_]

md_d0 : active raid5 sdc1[0] sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1]
       1464725760 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]

md6 : active raid1 sda1[0] sdb1[1]
       7815552 blocks [2/2] [UU]

md5 : active raid1 sda8[0] sdb8[1]
       14538752 blocks [2/2] [UU]

md4 : active raid1 sda7[0] sdb7[1]
       4883648 blocks [2/2] [UU]

md3 : active raid1 sda6[0] sdb6[1]
       9767424 blocks [2/2] [UU]

md2 : active raid1 sda5[0] sdb5[1]
       29294400 blocks [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
       489856 blocks [2/2] [UU]

md0 : active raid1 sdb4[1] sda4[0]
       4883648 blocks [2/2] [UU]

unused devices: <none>
Root gershwin:[/etc] >

md7 only has one disk because I cannot synchronize it over iSCSI. But 
without any message, load average of this server (24 threads T1000) 
increases until more than 9. top returns :
top - 13:36:08 up 4 days,  1:00,  3 users,  load average: 9.23, 8.46, 6.26
Tasks: 252 total,   5 running, 246 sleeping,   0 stopped,   1 zombie
Cpu(s):  0.0%us,  4.2%sy,  0.0%ni, 87.4%id,  8.4%wa,  0.0%hi,  0.0%si, 
0.0%st
Mem:   4139024k total,  4115920k used,    23104k free,   743976k buffers
Swap:  7815536k total,      304k used,  7815232k free,  2188048k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 

  5426 root      15  -5     0    0    0 R  100  0.0  46:32.54 
md_d0_raid5
17215 root      20   0  3120 1552 1112 R    1  0.0   0:01.38 top 

     1 root      20   0  2576  960  816 S    0  0.0   0:09.74 init 

     2 root      15  -5     0    0    0 S    0  0.0   0:00.00 kthreadd 

     3 root      RT  -5     0    0    0 S    0  0.0   0:00.18 
migration/0
     4 root      15  -5     0    0    0 S    0  0.0   0:00.18 
ksoftirqd/0

and some process are in D state :
Root gershwin:[/etc] > ps auwx | grep D
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       270  0.0  0.0      0     0 ?        D    Oct27   1:17 [pdflush]
root      3676  0.9  0.0      0     0 ?        D    Oct27  56:03 [nfsd]
root      5435  0.0  0.0      0     0 ?        D<   Oct27   3:16 [md7_raid1]
root      5438  0.0  0.0      0     0 ?        D<   Oct27   1:01 [kjournald]
root      5440  0.0  0.0      0     0 ?        D<   Oct27   0:33 [loop0]
root      5441  0.0  0.0      0     0 ?        D<   Oct27   0:05 [kjournald]
root     16442  0.0  0.0  20032  1208 pts/2    D+   13:23   0:00 iftop 
-i eth2

	Why md7_raid is in D state ? Same question about iftop ?

	Regards,

	JKB

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Strange CPU occupation... and system hangs
  2007-10-31 12:38 Strange CPU occupation BERTRAND Joël
@ 2007-11-01  9:25 ` BERTRAND Joël
  2007-11-07  8:55 ` Strange CPU occupation Goswin von Brederlow
  1 sibling, 0 replies; 3+ messages in thread
From: BERTRAND Joël @ 2007-11-01  9:25 UTC (permalink / raw)
  To: sparclinux, linux-raid, iscsitarget-devel

BERTRAND Joël wrote:

<snip>

> and some process are in D state :
> Root gershwin:[/etc] > ps auwx | grep D
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root       270  0.0  0.0      0     0 ?        D    Oct27   1:17 [pdflush]
> root      3676  0.9  0.0      0     0 ?        D    Oct27  56:03 [nfsd]
> root      5435  0.0  0.0      0     0 ?        D<   Oct27   3:16 [md7_raid1]
> root      5438  0.0  0.0      0     0 ?        D<   Oct27   1:01 [kjournald]
> root      5440  0.0  0.0      0     0 ?        D<   Oct27   0:33 [loop0]
> root      5441  0.0  0.0      0     0 ?        D<   Oct27   0:05 [kjournald]
> root     16442  0.0  0.0  20032  1208 pts/2    D+   13:23   0:00 iftop 
> -i eth2
> 
> 	Why md7_raid is in D state ? Same question about iftop ?

	Some bad news... After ten or eleven hours, kernel crashes on this 
server. The last top screen is :

top - 04:59:46 up 4 days, 16:24,  3 users,  load average: 19.72, 19.22, 
19.05
Tasks: 285 total,   5 running, 279 sleeping,   0 stopped,   1 zombie
Cpu(s):  0.0%us,  4.2%sy,  0.0%ni, 68.5%id, 27.3%wa,  0.0%hi,  0.0%si, 
0.0%st
Mem:   4139024k total,  4130800k used,     8224k free,    38984k buffers
Swap:  7815536k total,      304k used,  7815232k free,    79056k cached
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 

5426 root      15  -5     0    0    0 R  100  0.0 970:17.21 md_d0_raid5
26923 root      20   0  3120 1568 1112 R    2  0.0  13:32.24 top 

...

	I have rebooted. I don't have any message in log files. I don't have 
any screen but I haven't seen anything on serial console. In ker.log, I 
have :
Oct 31 15:36:15 gershwin kernel: swapper: page allocation failure. 
order:2, mode:0x4020
Oct 31 15:36:15 gershwin kernel: Call Trace:
Oct 31 15:36:15 gershwin kernel:  [00000000004b6568] 
__slab_alloc+0x1b0/0x720
Oct 31 15:36:15 gershwin kernel:  [00000000004b87a8] 
__kmalloc_track_caller+0xb0/0xe0
Oct 31 15:36:15 gershwin kernel:  [0000000000601d68] __alloc_skb+0x50/0x120
Oct 31 15:36:15 gershwin kernel:  [0000000000642ee0] 
tcp_collapse+0x1e8/0x440
Oct 31 15:36:15 gershwin kernel:  [0000000000643298] 
tcp_prune_queue+0x160/0x3a0
Oct 31 15:36:15 gershwin kernel:  [0000000000643d08] 
tcp_data_queue+0x830/0xde0
Oct 31 15:36:15 gershwin kernel:  [0000000000645d74] 
tcp_rcv_established+0x35c/0x840
Oct 31 15:36:15 gershwin kernel:  [000000000064cf7c] 
tcp_v4_do_rcv+0xe4/0x4a0
Oct 31 15:36:15 gershwin kernel:  [000000000064fdd8] tcp_v4_rcv+0xb00/0xb20
Oct 31 15:36:15 gershwin kernel:  [000000000062e2ac] 
ip_local_deliver+0x194/0x3a0
Oct 31 15:36:15 gershwin kernel:  [000000000062dd98] ip_rcv+0x360/0x6e0
Oct 31 15:36:15 gershwin kernel:  [0000000000607f64] 
netif_receive_skb+0x1ec/0x480
Oct 31 15:36:15 gershwin kernel:  [00000000005a5fe0] tg3_poll+0x6c8/0xc40
Oct 31 15:36:15 gershwin kernel:  [000000000060a940] 
net_rx_action+0x88/0x160
Oct 31 15:36:15 gershwin kernel:  [0000000000468078] __do_softirq+0x80/0x100
Oct 31 15:36:15 gershwin kernel:  [000000000046815c] do_softirq+0x64/0x80
Oct 31 15:36:15 gershwin kernel: Mem-info:
Oct 31 15:36:15 gershwin kernel: Normal per-cpu:
Oct 31 15:36:15 gershwin kernel: CPU    0: Hot: hi:   90, btch:  15 usd: 
  15   Cold: hi:   30, btch:   7 usd:   5
Oct 31 15:36:15 gershwin kernel: CPU    1: Hot: hi:   90, btch:  15 usd: 
  31   Cold: hi:   30, btch:   7 usd:   4
Oct 31 15:36:15 gershwin kernel: CPU    2: Hot: hi:   90, btch:  15 usd: 
   4   Cold: hi:   30, btch:   7 usd:   3
Oct 31 15:36:15 gershwin kernel: CPU    3: Hot: hi:   90, btch:  15 usd: 
  82   Cold: hi:   30, btch:   7 usd:   2
Oct 31 15:36:15 gershwin kernel: CPU    4: Hot: hi:   90, btch:  15 usd: 
  84   Cold: hi:   30, btch:   7 usd:   0
Oct 31 15:36:15 gershwin kernel: CPU    5: Hot: hi:   90, btch:  15 usd: 
  65   Cold: hi:   30, btch:   7 usd:   4
Oct 31 15:36:15 gershwin kernel: CPU    6: Hot: hi:   90, btch:  15 usd: 
  85   Cold: hi:   30, btch:   7 usd:   6
Oct 31 15:36:15 gershwin kernel: CPU    7: Hot: hi:   90, btch:  15 usd: 
  69   Cold: hi:   30, btch:   7 usd:   4
Oct 31 15:36:15 gershwin kernel: CPU    8: Hot: hi:   90, btch:  15 usd: 
  11   Cold: hi:   30, btch:   7 usd:   5
Oct 31 15:36:15 gershwin kernel: CPU    9: Hot: hi:   90, btch:  15 usd: 
  75   Cold: hi:   30, btch:   7 usd:   1
Oct 31 15:36:15 gershwin kernel: CPU   10: Hot: hi:   90, btch:  15 usd: 
  84   Cold: hi:   30, btch:   7 usd:   2
Oct 31 15:36:15 gershwin kernel: CPU   11: Hot: hi:   90, btch:  15 usd: 
  13   Cold: hi:   30, btch:   7 usd:   1
Oct 31 15:36:15 gershwin kernel: CPU   12: Hot: hi:   90, btch:  15 usd: 
  17   Cold: hi:   30, btch:   7 usd:  23
Oct 31 15:36:15 gershwin kernel: CPU   13: Hot: hi:   90, btch:  15 usd: 
   7   Cold: hi:   30, btch:   7 usd:  25
Oct 31 15:36:15 gershwin kernel: CPU   14: Hot: hi:   90, btch:  15 usd: 
  64   Cold: hi:   30, btch:   7 usd:  27
Oct 31 15:36:15 gershwin kernel: CPU   15: Hot: hi:   90, btch:  15 usd: 
  12   Cold: hi:   30, btch:   7 usd:   6
Oct 31 15:36:15 gershwin kernel: CPU   16: Hot: hi:   90, btch:  15 usd: 
   2   Cold: hi:   30, btch:   7 usd:   1
Oct 31 15:36:15 gershwin kernel: CPU   17: Hot: hi:   90, btch:  15 usd: 
  80   Cold: hi:   30, btch:   7 usd:   1
Oct 31 15:36:15 gershwin kernel: CPU   18: Hot: hi:   90, btch:  15 usd: 
   4   Cold: hi:   30, btch:   7 usd:  17
Oct 31 15:36:15 gershwin kernel: CPU   19: Hot: hi:   90, btch:  15 usd: 
  58   Cold: hi:   30, btch:   7 usd:   1
Oct 31 15:36:16 gershwin kernel: CPU   20: Hot: hi:   90, btch:  15 usd: 
  13   Cold: hi:   30, btch:   7 usd:   4
Oct 31 15:36:16 gershwin kernel: CPU   21: Hot: hi:   90, btch:  15 usd: 
  87   Cold: hi:   30, btch:   7 usd:   2
Oct 31 15:36:16 gershwin kernel: CPU   22: Hot: hi:   90, btch:  15 usd: 
  77   Cold: hi:   30, btch:   7 usd:   6
Oct 31 15:36:16 gershwin kernel: CPU   23: Hot: hi:   90, btch:  15 usd: 
  10   Cold: hi:   30, btch:   7 usd:   2
Oct 31 15:36:16 gershwin kernel: Active:72131 inactive:275943 dirty:6433 
writeback:1041 unstable:0
Oct 31 15:36:16 gershwin kernel:  free:2187 slab:131941 mapped:1732 
pagetables:258 bounce:0
Oct 31 15:36:16 gershwin kernel: Normal free:17496kB min:8144kB 
low:10176kB high:12216kB active:577048kB inactive:2207544kB 
present:4149440kB pages_scanned:0 all_unreclaimable? no
Oct 31 15:36:16 gershwin kernel: lowmem_reserve[]: 0 0
Oct 31 15:36:16 gershwin kernel: Normal: 1786*8kB 173*16kB 0*32kB 1*64kB 
0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB = 17632kB
Oct 31 15:36:16 gershwin kernel: Swap cache: add 39, delete 39, find 
0/0, race 0+0
Oct 31 15:36:16 gershwin kernel: Free swap  = 7815232kB
Oct 31 15:36:16 gershwin kernel: Free swap  = 7815232kB
Oct 31 15:36:16 gershwin kernel: Total swap = 7815536kB
Oct 31 15:36:16 gershwin kernel: Free swap:       7815232kB
Oct 31 15:36:17 gershwin kernel: 524260 pages of RAM
Oct 31 15:36:17 gershwin kernel: 6882 reserved pages
Oct 31 15:36:17 gershwin kernel: 278965 pages shared
Oct 31 15:36:17 gershwin kernel: 0 pages swap cached
Oct 31 15:36:17 gershwin kernel: 6435 pages dirty
Oct 31 15:36:17 gershwin kernel: 1041 pages writeback
Oct 31 15:36:17 gershwin kernel: 1732 pages mapped
Oct 31 15:36:17 gershwin kernel: 131941 pages slab
Oct 31 15:36:17 gershwin kernel: 258 pages pagetables

	Any idea ?

	Regards,

	JKB

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Strange CPU occupation...
  2007-10-31 12:38 Strange CPU occupation BERTRAND Joël
  2007-11-01  9:25 ` Strange CPU occupation... and system hangs BERTRAND Joël
@ 2007-11-07  8:55 ` Goswin von Brederlow
  1 sibling, 0 replies; 3+ messages in thread
From: Goswin von Brederlow @ 2007-11-07  8:55 UTC (permalink / raw)
  To: BERTRAND Joel; +Cc: sparclinux, linux-raid, iscsitarget-devel

BERTRAND Joël <joel.bertrand@systella.fr> writes:

>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>
>  5426 root      15  -5     0    0    0 R  100  0.0  46:32.54
> md_d0_raid5

First: You can tune the stripe cache.

Secondly: You might want the raid speedup patches that implement
prioritized queueing. That reduces the number of incomplete stripes
written to disk and thus speeds up things a lot.

Thirdly: If you have multiple cores then you probably want patches to
multithread the raid5 code. With just one thread it is ultimately cpu
bound and just won't get faster. XORing and maintainance overhead
seems to be awfully cpu intensive.

Fourth: Learn to live with it. Software Raid5/6 is expensive.

> and some process are in D state :
> Root gershwin:[/etc] > ps auwx | grep D
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root       270  0.0  0.0      0     0 ?        D    Oct27   1:17 [pdflush]
> root      3676  0.9  0.0      0     0 ?        D    Oct27  56:03 [nfsd]
> root      5435  0.0  0.0      0     0 ?        D<   Oct27   3:16 [md7_raid1]
> root      5438  0.0  0.0      0     0 ?        D<   Oct27   1:01 [kjournald]
> root      5440  0.0  0.0      0     0 ?        D<   Oct27   0:33 [loop0]
> root      5441  0.0  0.0      0     0 ?        D<   Oct27   0:05 [kjournald]
> root     16442  0.0  0.0  20032  1208 pts/2    D+   13:23   0:00 iftop
> -i eth2
>
> 	Why md7_raid is in D state ? Same question about iftop ?

I would say it is just waiting for the underlying devices to finish
some I/O request. And iftop will be in some kernel call waiting for
something or other.

MfG
        Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2007-11-07  8:55 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-31 12:38 Strange CPU occupation BERTRAND Joël
2007-11-01  9:25 ` Strange CPU occupation... and system hangs BERTRAND Joël
2007-11-07  8:55 ` Strange CPU occupation Goswin von Brederlow

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).