public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* processes in D State too long too often
@ 2009-02-07  6:30 Gary L. Grobe
  2009-02-07  6:45 ` Andrew Morton
  0 siblings, 1 reply; 8+ messages in thread
From: Gary L. Grobe @ 2009-02-07  6:30 UTC (permalink / raw)
  To: linux-kernel

I'm currently running 2.6.27-r7 with ether and myrinet interconnects on slave quad and dual-quad dell 1950's w/ 16-32Gb's ram and a master node that is a Dell 6850 w/ 32Gb.

I've got processes running on diskless nodes mounted to a master node via NFS. I see that several of these processes are in a D state, yet they recover in a very short time back to running (a few seconds) and then the CPU goes from 0% to 100% usage (which is correct, these CPU's should be running @ 100% as they're running some number crunching simulations, but when in D state CPU usage goes to 0%).

So why they are in a D state and waiting on I/O? Then I look on the master node and see that several nfsd's are also in a D state, and shortly recover back to running (I see this shown as '-' if using ps, or R in 'top').

Running 'ps -eal', I see in the WCHAN column for the processes in a D state the following (which I believe is what the processes are waiting on). It can be a mix of these. I usually see 'sync_p', 'nfs_wa', 'lmGrou', and 'txLock'. My file system type is JFS.

Here's a snipped 'ps -eal' listing on the master node.

1 D     0 26709     2  0  75  -5 -     0 lmGrou ?        00:00:07 nfsd
1 S     0 26710     2  0  75  -5 -     0 -      ?        00:00:07 nfsd
1 S     0 26711     2  0  75  -5 -     0 -      ?        00:00:04 nfsd
1 S     0 26712     2  0  75  -5 -     0 -      ?        00:00:08 nfsd
1 D     0 26713     2  0  75  -5 -     0 lmGrou ?        00:00:10 nfsd
1 S     0 26714     2  0  75  -5 -     0 -      ?        00:00:09 nfsd
1 D     0 26715     2  0  75  -5 -     0 txLock ?        00:00:08 nfsd
1 D     0 26716     2  0  75  -5 -     0 -      ?        00:00:09 nfsd
1 D     0 26717     2  0  75  -5 -     0 txLock ?        00:00:09 nfsd
1 S     0 26718     2  0  75  -5 -     0 -      ?        00:00:07 nfsd
1 D     0 26719     2  0  75  -5 -     0 -      ?        00:00:08 nfsd
1 D     0 26720     2  0  75  -5 -     0 sync_p ?        00:00:09 nfsd
1 S     0 26721     2  0  75  -5 -     0 -      ?        00:00:09 nfsd
1 S     0 26722     2  0  75  -5 -     0 -      ?        00:00:09 nfsd

And here's the same command on a diskless node which shows that my processes are in a D state w/ what seems to be nfs_wait (and from which they recover quite quickly, a few seconds later) ...

# ps -eal
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
...
1 S  1001  6145     1  0  80   0 -  6560 924758 ?        00:00:01 orted
0 D  1001  6146  6145 71  80   0 - 941316 nfs_wa ?       19:27:10 lve_1
0 D  1001  6147  6145 58  80   0 - 894594 nfs_wa ?       15:57:56 lve_1
0 R  1001  6148  6145 57  80   0 - 901343 -     ?        15:33:07 lve_1
0 R  1001  6149  6145 78  80   0 - 896065 -     ?        21:31:32 lve_1
...

'rpcinfo -p master_node' shows that I have portmapper, mountd, nlockmgr, and nfs running w/ all the correct normal info.

It would seem as if NFS was dropping out intermittently, but I've really gone all throughout the NFS config and see nothing wrong, my DNS servers are working fine, it's all running on a local LAN (no firewall issues), and I see the same results on many different diskless nodes so I don't believe it's a hardware issues. All my previous installations have run fine w/ this same NFS config.

Others have suggested this may be a 2.6.27-r7 kernel bug. I must note that I did not have this same problem running a 2.6.17 kernel w/ XFS. The hold up seems to be in the kernel and I'm looking for any advice is this might be the case.

Because these processes are going into a D state so often, a simulation that might normally run for 6 hours now takes 2 days to complete. I've tested the myrinet and ether interconnects and I see no issues from node to node or switch. I can reproduce the problem every time between any one node and the master.

So I'm looking for thoughts as to what might be going on and how to further investigate if this is in fact a kernel issue.







^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: processes in D State too long too often
  2009-02-07  6:30 processes in D State too long too often Gary L. Grobe
@ 2009-02-07  6:45 ` Andrew Morton
  0 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2009-02-07  6:45 UTC (permalink / raw)
  To: Gary L. Grobe; +Cc: linux-kernel, linux-nfs

(cc linux-nfs)

On Sat, 07 Feb 2009 06:30:22 +0000 "Gary L. Grobe" <gary@grobe.net> wrote:

> I'm currently running 2.6.27-r7 with ether and myrinet interconnects on slave quad and dual-quad dell 1950's w/ 16-32Gb's ram and a master node that is a Dell 6850 w/ 32Gb.
> 
> I've got processes running on diskless nodes mounted to a master node via NFS. I see that several of these processes are in a D state, yet they recover in a very short time back to running (a few seconds) and then the CPU goes from 0% to 100% usage (which is correct, these CPU's should be running @ 100% as they're running some number crunching simulations, but when in D state CPU usage goes to 0%).
> 
> So why they are in a D state and waiting on I/O? Then I look on the master node and see that several nfsd's are also in a D state, and shortly recover back to running (I see this shown as '-' if using ps, or R in 'top').
> 
> Running 'ps -eal', I see in the WCHAN column for the processes in a D state the following (which I believe is what the processes are waiting on). It can be a mix of these. I usually see 'sync_p', 'nfs_wa', 'lmGrou', and 'txLock'. My file system type is JFS.
> 
> Here's a snipped 'ps -eal' listing on the master node.
> 
> 1 D     0 26709     2  0  75  -5 -     0 lmGrou ?        00:00:07 nfsd
> 1 S     0 26710     2  0  75  -5 -     0 -      ?        00:00:07 nfsd
> 1 S     0 26711     2  0  75  -5 -     0 -      ?        00:00:04 nfsd
> 1 S     0 26712     2  0  75  -5 -     0 -      ?        00:00:08 nfsd
> 1 D     0 26713     2  0  75  -5 -     0 lmGrou ?        00:00:10 nfsd
> 1 S     0 26714     2  0  75  -5 -     0 -      ?        00:00:09 nfsd
> 1 D     0 26715     2  0  75  -5 -     0 txLock ?        00:00:08 nfsd
> 1 D     0 26716     2  0  75  -5 -     0 -      ?        00:00:09 nfsd
> 1 D     0 26717     2  0  75  -5 -     0 txLock ?        00:00:09 nfsd
> 1 S     0 26718     2  0  75  -5 -     0 -      ?        00:00:07 nfsd
> 1 D     0 26719     2  0  75  -5 -     0 -      ?        00:00:08 nfsd
> 1 D     0 26720     2  0  75  -5 -     0 sync_p ?        00:00:09 nfsd
> 1 S     0 26721     2  0  75  -5 -     0 -      ?        00:00:09 nfsd
> 1 S     0 26722     2  0  75  -5 -     0 -      ?        00:00:09 nfsd
> 
> And here's the same command on a diskless node which shows that my processes are in a D state w/ what seems to be nfs_wait (and from which they recover quite quickly, a few seconds later) ...
> 
> # ps -eal
> F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
> ...
> 1 S  1001  6145     1  0  80   0 -  6560 924758 ?        00:00:01 orted
> 0 D  1001  6146  6145 71  80   0 - 941316 nfs_wa ?       19:27:10 lve_1
> 0 D  1001  6147  6145 58  80   0 - 894594 nfs_wa ?       15:57:56 lve_1
> 0 R  1001  6148  6145 57  80   0 - 901343 -     ?        15:33:07 lve_1
> 0 R  1001  6149  6145 78  80   0 - 896065 -     ?        21:31:32 lve_1
> ...
> 
> 'rpcinfo -p master_node' shows that I have portmapper, mountd, nlockmgr, and nfs running w/ all the correct normal info.
> 
> It would seem as if NFS was dropping out intermittently, but I've really gone all throughout the NFS config and see nothing wrong, my DNS servers are working fine, it's all running on a local LAN (no firewall issues), and I see the same results on many different diskless nodes so I don't believe it's a hardware issues. All my previous installations have run fine w/ this same NFS config.
> 
> Others have suggested this may be a 2.6.27-r7 kernel bug. I must note that I did not have this same problem running a 2.6.17 kernel w/ XFS. The hold up seems to be in the kernel and I'm looking for any advice is this might be the case.
> 
> Because these processes are going into a D state so often, a simulation that might normally run for 6 hours now takes 2 days to complete. I've tested the myrinet and ether interconnects and I see no issues from node to node or switch. I can reproduce the problem every time between any one node and the master.
> 
> So I'm looking for thoughts as to what might be going on and how to further investigate if this is in fact a kernel issue.
> 

I guess it would help if you can run

	echo w > /proc/sysrq-trigger

and manage to hit enter when this is happening.

Then run

	dmesg -c -s 1000000 > foo

then check `foo' to see that you caught some nice traces of the stuck tasks.

Then send us those traces.  Please try to avoid wordwrapping them in
the email.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: processes in D state too long too often
@ 2009-02-07 20:49 Gary L. Grobe
  2009-02-09 21:27 ` J. Bruce Fields
  0 siblings, 1 reply; 8+ messages in thread
From: Gary L. Grobe @ 2009-02-07 20:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-nfs

> Then send us those traces. Please try to avoid wordwrapping them in
the email.

Ok, so I run my simulations, and then run the following script. 

--- This was run on the master node
echo w > /proc/sysrq-trigger
dmesg -c -s 1000000 > foo

I've captured these stuck tasks on both the master and the slave node, on separate runs.

Here is the master capture.
---
SysRq : Show Blocked State
  task                        PC stack   pid father
SysRq : Show Blocked State
  task                        PC stack   pid father
nfsd          D 00000001079318f8     0  5832      2
 ffff880837a23c70 0000000000000046 0000000000000000 0000000000000002
 ffff88083aef5950 ffff88083cda5790 ffff88083aef5b80 0000000437a23c80
 00000000ffffffff 00000001079318fb 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff8056e8e3>] schedule_timeout+0x8a/0xad
 [<ffffffff80237e8b>] process_timeout+0x0/0x5
 [<ffffffff8056e8de>] schedule_timeout+0x85/0xad
 [<ffffffff80238178>] msleep+0x14/0x1e
 [<ffffffff8030235d>] nfsd_vfs_write+0x221/0x2dd
 [<ffffffff8028222b>] __dentry_open+0x14c/0x23b
 [<ffffffff80302a93>] nfsd_write+0xc5/0xe2
 [<ffffffff803003bd>] nfsd_proc_write+0xc5/0xde
 [<ffffffff80307ac1>] decode_fh+0x1c/0x45
 [<ffffffff802fe8f3>] nfsd_dispatch+0xde/0x1c2
 [<ffffffff805456f5>] svc_process+0x408/0x6e7
 [<ffffffff8056f920>] __down_read+0x12/0x93
 [<ffffffff802fef26>] nfsd+0x1b7/0x285
 [<ffffffff802fed6f>] nfsd+0x0/0x285
 [<ffffffff80240d34>] kthread+0x47/0x73
 [<ffffffff8022bd1d>] schedule_tail+0x27/0x5f
 [<ffffffff8020c109>] child_rip+0xa/0x11
 [<ffffffff80240ced>] kthread+0x0/0x73
 [<ffffffff8020c0ff>] child_rip+0x0/0x11

nfsd          D 00000001079318e9     0  5840      2
 ffff88083725fc70 0000000000000046 0000000000000000 0000000000000002
 ffff880838cca8e0 ffffffff8070a340 ffff880838ccab10 000000003725fc80
 00000000ffffffff 00000001079318fb 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff8056e8e3>] schedule_timeout+0x8a/0xad
 [<ffffffff80237e8b>] process_timeout+0x0/0x5
 [<ffffffff8056e8de>] schedule_timeout+0x85/0xad
 [<ffffffff80238178>] msleep+0x14/0x1e
 [<ffffffff8030235d>] nfsd_vfs_write+0x221/0x2dd
 [<ffffffff8028222b>] __dentry_open+0x14c/0x23b
 [<ffffffff80302a93>] nfsd_write+0xc5/0xe2
 [<ffffffff803003bd>] nfsd_proc_write+0xc5/0xde
 [<ffffffff80307ac1>] decode_fh+0x1c/0x45
 [<ffffffff802fe8f3>] nfsd_dispatch+0xde/0x1c2
 [<ffffffff805456f5>] svc_process+0x408/0x6e7
 [<ffffffff8056f920>] __down_read+0x12/0x93
 [<ffffffff802fef26>] nfsd+0x1b7/0x285
 [<ffffffff802fed6f>] nfsd+0x0/0x285
 [<ffffffff80240d34>] kthread+0x47/0x73
 [<ffffffff8022bd1d>] schedule_tail+0x27/0x5f
 [<ffffffff8020c109>] child_rip+0xa/0x11
 [<ffffffff80240ced>] kthread+0x0/0x73
 [<ffffffff8020c0ff>] child_rip+0x0/0x11

nfsd          D 00000001079318e9     0  5847      2
 ffff8808369a3c70 0000000000000046 0000000000000000 0000000000000002
 ffff880836f88bb0 ffffffff8070a340 ffff880836f88de0 00000000369a3c80
 00000000ffffffff 00000001079318f9 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff8056e8e3>] schedule_timeout+0x8a/0xad
 [<ffffffff80237e8b>] process_timeout+0x0/0x5
 [<ffffffff8056e8de>] schedule_timeout+0x85/0xad
 [<ffffffff80238178>] msleep+0x14/0x1e
 [<ffffffff8030235d>] nfsd_vfs_write+0x221/0x2dd
 [<ffffffff8028222b>] __dentry_open+0x14c/0x23b
 [<ffffffff80302a93>] nfsd_write+0xc5/0xe2
 [<ffffffff803003bd>] nfsd_proc_write+0xc5/0xde
 [<ffffffff80307ac1>] decode_fh+0x1c/0x45
 [<ffffffff802fe8f3>] nfsd_dispatch+0xde/0x1c2
 [<ffffffff805456f5>] svc_process+0x408/0x6e7
 [<ffffffff8056f920>] __down_read+0x12/0x93
 [<ffffffff802fef26>] nfsd+0x1b7/0x285
 [<ffffffff802fed6f>] nfsd+0x0/0x285
 [<ffffffff80240d34>] kthread+0x47/0x73
 [<ffffffff8022bd1d>] schedule_tail+0x27/0x5f
 [<ffffffff8020c109>] child_rip+0xa/0x11
 [<ffffffff80240ced>] kthread+0x0/0x73
 [<ffffffff8020c0ff>] child_rip+0x0/0x11

nfsd          D 00000001079318e9     0  5867      2
 ffff88083561fc70 0000000000000046 0000000000000000 0000000000000002
 ffff880835422cb0 ffffffff8070a340 ffff880835422ee0 000000003561fc80
 00000000ffffffff 00000001079318f8 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff8056e8e3>] schedule_timeout+0x8a/0xad
 [<ffffffff80237e8b>] process_timeout+0x0/0x5
 [<ffffffff8056e8de>] schedule_timeout+0x85/0xad
 [<ffffffff80238178>] msleep+0x14/0x1e
 [<ffffffff8030235d>] nfsd_vfs_write+0x221/0x2dd
 [<ffffffff8028222b>] __dentry_open+0x14c/0x23b
 [<ffffffff80302a93>] nfsd_write+0xc5/0xe2
 [<ffffffff803003bd>] nfsd_proc_write+0xc5/0xde
 [<ffffffff80307ac1>] decode_fh+0x1c/0x45
 [<ffffffff802fe8f3>] nfsd_dispatch+0xde/0x1c2
 [<ffffffff805456f5>] svc_process+0x408/0x6e7
 [<ffffffff8056f920>] __down_read+0x12/0x93
 [<ffffffff802fef26>] nfsd+0x1b7/0x285
 [<ffffffff802fed6f>] nfsd+0x0/0x285
 [<ffffffff80240d34>] kthread+0x47/0x73
 [<ffffffff8022bd1d>] schedule_tail+0x27/0x5f
 [<ffffffff8020c109>] child_rip+0xa/0x11
 [<ffffffff80240ced>] kthread+0x0/0x73
 [<ffffffff8020c0ff>] child_rip+0x0/0x11

nfsd          D 00000001079318e9     0  5868      2
 ffff880835739c70 0000000000000046 0000000000000000 0000000000000002
 ffff880835422720 ffffffff8070a340 ffff880835422950 0000000035739c80
 00000000ffffffff 00000001079318f9 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff8056e8e3>] schedule_timeout+0x8a/0xad
 [<ffffffff80237e8b>] process_timeout+0x0/0x5
 [<ffffffff8056e8de>] schedule_timeout+0x85/0xad
 [<ffffffff80238178>] msleep+0x14/0x1e
 [<ffffffff8030235d>] nfsd_vfs_write+0x221/0x2dd
 [<ffffffff8028222b>] __dentry_open+0x14c/0x23b
 [<ffffffff80302a93>] nfsd_write+0xc5/0xe2
 [<ffffffff803003bd>] nfsd_proc_write+0xc5/0xde
 [<ffffffff80307ac1>] decode_fh+0x1c/0x45
 [<ffffffff802fe8f3>] nfsd_dispatch+0xde/0x1c2
 [<ffffffff805456f5>] svc_process+0x408/0x6e7
 [<ffffffff8056f920>] __down_read+0x12/0x93
 [<ffffffff802fef26>] nfsd+0x1b7/0x285
 [<ffffffff802fed6f>] nfsd+0x0/0x285
 [<ffffffff80240d34>] kthread+0x47/0x73
 [<ffffffff8022bd1d>] schedule_tail+0x27/0x5f
 [<ffffffff8020c109>] child_rip+0xa/0x11
 [<ffffffff80240ced>] kthread+0x0/0x73
 [<ffffffff8020c0ff>] child_rip+0x0/0x11

nfsd          D 00000001079318e9     0  5887      2
 ffff8808342bbc70 0000000000000046 0000000000000000 0000000000000002
 ffff8808340a2db0 ffffffff8070a340 ffff8808340a2fe0 00000000342bbc80
 00000000ffffffff 00000001079318f9 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff8056e8e3>] schedule_timeout+0x8a/0xad
 [<ffffffff80237e8b>] process_timeout+0x0/0x5
 [<ffffffff8056e8de>] schedule_timeout+0x85/0xad
 [<ffffffff80238178>] msleep+0x14/0x1e
 [<ffffffff8030235d>] nfsd_vfs_write+0x221/0x2dd
 [<ffffffff8028222b>] __dentry_open+0x14c/0x23b
 [<ffffffff80302a93>] nfsd_write+0xc5/0xe2
 [<ffffffff803003bd>] nfsd_proc_write+0xc5/0xde
 [<ffffffff80307ac1>] decode_fh+0x1c/0x45
 [<ffffffff802fe8f3>] nfsd_dispatch+0xde/0x1c2
 [<ffffffff805456f5>] svc_process+0x408/0x6e7
 [<ffffffff8056f920>] __down_read+0x12/0x93
 [<ffffffff802fef26>] nfsd+0x1b7/0x285
 [<ffffffff802fed6f>] nfsd+0x0/0x285
 [<ffffffff80240d34>] kthread+0x47/0x73
 [<ffffffff8022bd1d>] schedule_tail+0x27/0x5f
 [<ffffffff8020c109>] child_rip+0xa/0x11
 [<ffffffff80240ced>] kthread+0x0/0x73
 [<ffffffff8020c0ff>] child_rip+0x0/0x11

nfsd          D 00000001079318e9     0  5888      2
 ffff8808343b9c70 0000000000000046 0000000000000000 0000000000000002
 ffff8808340a2820 ffffffff8070a340 ffff8808340a2a50 00000000343b9c80
 00000000ffffffff 00000001079318fb 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff8056e8e3>] schedule_timeout+0x8a/0xad
 [<ffffffff80237e8b>] process_timeout+0x0/0x5
 [<ffffffff8056e8de>] schedule_timeout+0x85/0xad
 [<ffffffff80238178>] msleep+0x14/0x1e
 [<ffffffff8030235d>] nfsd_vfs_write+0x221/0x2dd
 [<ffffffff8028222b>] __dentry_open+0x14c/0x23b
 [<ffffffff80302a93>] nfsd_write+0xc5/0xe2
 [<ffffffff803003bd>] nfsd_proc_write+0xc5/0xde
 [<ffffffff80307ac1>] decode_fh+0x1c/0x45
 [<ffffffff802fe8f3>] nfsd_dispatch+0xde/0x1c2
 [<ffffffff805456f5>] svc_process+0x408/0x6e7
 [<ffffffff8056f920>] __down_read+0x12/0x93
 [<ffffffff802fef26>] nfsd+0x1b7/0x285
 [<ffffffff802fed6f>] nfsd+0x0/0x285
 [<ffffffff80240d34>] kthread+0x47/0x73
 [<ffffffff8022bd1d>] schedule_tail+0x27/0x5f
 [<ffffffff8020c109>] child_rip+0xa/0x11
 [<ffffffff80240ced>] kthread+0x0/0x73
 [<ffffffff8020c0ff>] child_rip+0x0/0x11

nfsd          D 00000001079318e9     0  5893      2
 ffff8808338e7c70 0000000000000046 0000000000000000 0000000000000002
 ffff880833dd0860 ffffffff8070a340 ffff880833dd0a90 00000000338e7c80
 00000000ffffffff 00000001079318f9 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff8056e8e3>] schedule_timeout+0x8a/0xad
 [<ffffffff80237e8b>] process_timeout+0x0/0x5
 [<ffffffff8056e8de>] schedule_timeout+0x85/0xad
 [<ffffffff80238178>] msleep+0x14/0x1e
 [<ffffffff8030235d>] nfsd_vfs_write+0x221/0x2dd
 [<ffffffff8028222b>] __dentry_open+0x14c/0x23b
 [<ffffffff80302a93>] nfsd_write+0xc5/0xe2
 [<ffffffff803003bd>] nfsd_proc_write+0xc5/0xde
 [<ffffffff80307ac1>] decode_fh+0x1c/0x45
 [<ffffffff802fe8f3>] nfsd_dispatch+0xde/0x1c2
 [<ffffffff805456f5>] svc_process+0x408/0x6e7
 [<ffffffff8056f920>] __down_read+0x12/0x93
 [<ffffffff802fef26>] nfsd+0x1b7/0x285
 [<ffffffff802fed6f>] nfsd+0x0/0x285
 [<ffffffff80240d34>] kthread+0x47/0x73
 [<ffffffff8022bd1d>] schedule_tail+0x27/0x5f
 [<ffffffff8020c109>] child_rip+0xa/0x11
 [<ffffffff80240ced>] kthread+0x0/0x73
 [<ffffffff8020c0ff>] child_rip+0x0/0x11

nfsd          D 00000001079318e9     0  5905      2
 ffff880832d41c70 0000000000000046 0000000000000000 0000000000000002
 ffff880832d3f9d0 ffffffff8070a340 ffff880832d3fc00 0000000032d41c80
 00000000ffffffff 00000001079318fb 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff8056e8e3>] schedule_timeout+0x8a/0xad
 [<ffffffff80237e8b>] process_timeout+0x0/0x5
 [<ffffffff8056e8de>] schedule_timeout+0x85/0xad
 [<ffffffff80238178>] msleep+0x14/0x1e
 [<ffffffff8030235d>] nfsd_vfs_write+0x221/0x2dd
 [<ffffffff8028222b>] __dentry_open+0x14c/0x23b
 [<ffffffff80302a93>] nfsd_write+0xc5/0xe2
 [<ffffffff803003bd>] nfsd_proc_write+0xc5/0xde
 [<ffffffff80307ac1>] decode_fh+0x1c/0x45
 [<ffffffff802fe8f3>] nfsd_dispatch+0xde/0x1c2
 [<ffffffff805456f5>] svc_process+0x408/0x6e7
 [<ffffffff8056f920>] __down_read+0x12/0x93
 [<ffffffff802fef26>] nfsd+0x1b7/0x285
 [<ffffffff802fed6f>] nfsd+0x0/0x285
 [<ffffffff80240d34>] kthread+0x47/0x73
 [<ffffffff8022bd1d>] schedule_tail+0x27/0x5f
 [<ffffffff8020c109>] child_rip+0xa/0x11
 [<ffffffff80240ced>] kthread+0x0/0x73
 [<ffffffff8020c0ff>] child_rip+0x0/0x11

And here's what was caught on the slave node during a separate run.
---
SysRq : Show Blocked State
  task                        PC stack   pid father
hello_c       D 00000001027c68c2     0  9840   9839
 ffff88043b101bf8 0000000000000086 0000000000000000 0000000000000246
 ffff88043e58f2c0 ffff88043e4d6bf0 ffff88043e58f4f0 0000000200000246
 00000000ffffffff ffffffff805317a9 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff805317a9>] xprt_prepare_transmit+0x81/0x8c
 [<ffffffff802f58bf>] nfs_wait_bit_killable+0x0/0x30
 [<ffffffff802f58e9>] nfs_wait_bit_killable+0x2a/0x30
 [<ffffffff80560e03>] __wait_on_bit+0x40/0x6e
 [<ffffffff802f58bf>] nfs_wait_bit_killable+0x0/0x30
 [<ffffffff80560e9d>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff80240e87>] wake_bit_function+0x0/0x23
 [<ffffffff802f94b0>] nfs_sync_mapping_wait+0xed/0x21a
 [<ffffffff802f9672>] nfs_wb_page+0x95/0xc7
 [<ffffffff802f9da0>] nfs_flush_incompatible+0x40/0x51
 [<ffffffff802eea77>] nfs_vm_page_mkwrite+0xbd/0xf2
 [<ffffffff80267f9e>] do_wp_page+0xe3/0x511
 [<ffffffff80269a34>] handle_mm_fault+0x665/0x6c1
 [<ffffffff8026d46b>] vma_merge+0x147/0x1f4
 [<ffffffff80563f91>] do_page_fault+0x43a/0x7f5
 [<ffffffff80562049>] error_exit+0x0/0x51

hello_c       D 00000001027c68e6     0  9841   9839
 ffff88043a0a3bf8 0000000000000086 0000000000000000 ffff88043c491800
 ffff88043c9d8660 ffff88043e4a0620 ffff88043c9d8890 000000013c491800
 00000000ffffffff ffffffff805307a5 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff805307a5>] xprt_end_transmit+0x2c/0x39
 [<ffffffff802f58bf>] nfs_wait_bit_killable+0x0/0x30
 [<ffffffff802f58e9>] nfs_wait_bit_killable+0x2a/0x30
 [<ffffffff80560e03>] __wait_on_bit+0x40/0x6e
 [<ffffffff802f58bf>] nfs_wait_bit_killable+0x0/0x30
 [<ffffffff80560e9d>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff80240e87>] wake_bit_function+0x0/0x23
 [<ffffffff802f94b0>] nfs_sync_mapping_wait+0xed/0x21a
 [<ffffffff802f9672>] nfs_wb_page+0x95/0xc7
 [<ffffffff802f9da0>] nfs_flush_incompatible+0x40/0x51
 [<ffffffff802eea77>] nfs_vm_page_mkwrite+0xbd/0xf2
 [<ffffffff80267f9e>] do_wp_page+0xe3/0x511
 [<ffffffff80269a34>] handle_mm_fault+0x665/0x6c1
 [<ffffffff80281f21>] put_unused_fd+0x31/0x3c
 [<ffffffff80563f91>] do_page_fault+0x43a/0x7f5
 [<ffffffff80562049>] error_exit+0x0/0x51

hello_c       D ffffffff802f58bf     0  9842   9839
 ffff88043b1bdbf8 0000000000000082 ffff88043a0ee688 ffff88043c491800
 ffff88043bbc8d30 ffff88043c9a99d0 ffff88043bbc8f60 000000033c491800
 ffff88043a0ee688 ffffffff805307a5 ffff88043a0ee688 0000000000000246
Call Trace:
 [<ffffffff805307a5>] xprt_end_transmit+0x2c/0x39
 [<ffffffff802f58bf>] nfs_wait_bit_killable+0x0/0x30
 [<ffffffff802f58e9>] nfs_wait_bit_killable+0x2a/0x30
 [<ffffffff80560e03>] __wait_on_bit+0x40/0x6e
 [<ffffffff802f58bf>] nfs_wait_bit_killable+0x0/0x30
 [<ffffffff80560e9d>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff80240e87>] wake_bit_function+0x0/0x23
 [<ffffffff802f94b0>] nfs_sync_mapping_wait+0xed/0x21a
 [<ffffffff802f9672>] nfs_wb_page+0x95/0xc7
 [<ffffffff802f9da0>] nfs_flush_incompatible+0x40/0x51
 [<ffffffff802eea77>] nfs_vm_page_mkwrite+0xbd/0xf2
 [<ffffffff80267f9e>] do_wp_page+0xe3/0x511
 [<ffffffff80269a34>] handle_mm_fault+0x665/0x6c1
 [<ffffffff80563f91>] do_page_fault+0x43a/0x7f5
 [<ffffffff80562049>] error_exit+0x0/0x51

hello_c       D 00000001027c68db     0  9843   9839
 ffff88043ac59bf8 0000000000000082 0000000000000000 0000000000000246
 ffff88043bbc8210 ffffffff806f9340 ffff88043bbc8440 0000000000000246
 00000000ffffffff ffffffff805317a9 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff805317a9>] xprt_prepare_transmit+0x81/0x8c
 [<ffffffff802f58bf>] nfs_wait_bit_killable+0x0/0x30
 [<ffffffff802f58e9>] nfs_wait_bit_killable+0x2a/0x30
 [<ffffffff80560e03>] __wait_on_bit+0x40/0x6e
 [<ffffffff802f58bf>] nfs_wait_bit_killable+0x0/0x30
 [<ffffffff80560e9d>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff80240e87>] wake_bit_function+0x0/0x23
 [<ffffffff802f94b0>] nfs_sync_mapping_wait+0xed/0x21a
 [<ffffffff802f9672>] nfs_wb_page+0x95/0xc7
 [<ffffffff802f9da0>] nfs_flush_incompatible+0x40/0x51
 [<ffffffff802eea77>] nfs_vm_page_mkwrite+0xbd/0xf2
 [<ffffffff80267f9e>] do_wp_page+0xe3/0x511
 [<ffffffff80269a34>] handle_mm_fault+0x665/0x6c1
 [<ffffffff80563f91>] do_page_fault+0x43a/0x7f5
 [<ffffffff80562049>] error_exit+0x0/0x51





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: processes in D state too long too often
  2009-02-07 20:49 processes in D state " Gary L. Grobe
@ 2009-02-09 21:27 ` J. Bruce Fields
  0 siblings, 0 replies; 8+ messages in thread
From: J. Bruce Fields @ 2009-02-09 21:27 UTC (permalink / raw)
  To: Gary L. Grobe; +Cc: linux-kernel, linux-nfs, Andrew Morton

(By the way, please try not to start new threads or drop people from the
cc.)

On Sat, Feb 07, 2009 at 08:49:27PM +0000, Gary L. Grobe wrote:
> > Then send us those traces. Please try to avoid wordwrapping them in
> the email.
> 
> Ok, so I run my simulations, and then run the following script. 
> 
> --- This was run on the master node
> echo w > /proc/sysrq-trigger
> dmesg -c -s 1000000 > foo
> 
> I've captured these stuck tasks on both the master and the slave node, on separate runs.

So the nfsd stacks are all identical:

> nfsd          D 00000001079318f8     0  5832      2
>  ffff880837a23c70 0000000000000046 0000000000000000 0000000000000002
>  ffff88083aef5950 ffff88083cda5790 ffff88083aef5b80 0000000437a23c80
>  00000000ffffffff 00000001079318fb 0000000000000000 0000000000000000
> Call Trace:
>  [<ffffffff8056e8e3>] schedule_timeout+0x8a/0xad
>  [<ffffffff80237e8b>] process_timeout+0x0/0x5
>  [<ffffffff8056e8de>] schedule_timeout+0x85/0xad
>  [<ffffffff80238178>] msleep+0x14/0x1e
>  [<ffffffff8030235d>] nfsd_vfs_write+0x221/0x2dd
>  [<ffffffff8028222b>] __dentry_open+0x14c/0x23b
>  [<ffffffff80302a93>] nfsd_write+0xc5/0xe2
>  [<ffffffff803003bd>] nfsd_proc_write+0xc5/0xde
>  [<ffffffff80307ac1>] decode_fh+0x1c/0x45
>  [<ffffffff802fe8f3>] nfsd_dispatch+0xde/0x1c2
>  [<ffffffff805456f5>] svc_process+0x408/0x6e7
>  [<ffffffff8056f920>] __down_read+0x12/0x93
>  [<ffffffff802fef26>] nfsd+0x1b7/0x285
>  [<ffffffff802fed6f>] nfsd+0x0/0x285
>  [<ffffffff80240d34>] kthread+0x47/0x73
>  [<ffffffff8022bd1d>] schedule_tail+0x27/0x5f
>  [<ffffffff8020c109>] child_rip+0xa/0x11
>  [<ffffffff80240ced>] kthread+0x0/0x73
>  [<ffffffff8020c0ff>] child_rip+0x0/0x11

Presumably that's the msleep(10) in nfsd_vfs_write().

That wouldn't explain the same nfsd thread waiting for several seconds,
though.  Or was it just that that several seconds during which different
nfsd threads were stuck in D, not necessarily the same ones?

What does your /etc/exports file look like on the server, and what are
the mount options on the client?

You could try turning off that msleep() with no_wdelay, but it may not
help.

The more likely explanation is that you just switched to a more recent
distro where "sync" (as opposed to "async") is the option.  Depending on
workload, "async" may improve performance a great deal, at the expense
of possible data corruption on server reboot!

If you're doing a lot of writing and using NFSv2, then switching to
NFSv3 may give you performance close to the "async" performance without
the corruption worries.

--b.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re:  processes in D state too long too often
@ 2009-02-10  1:04 Gary L. Grobe
  2009-02-10 16:02 ` J. Bruce Fields
  0 siblings, 1 reply; 8+ messages in thread
From: Gary L. Grobe @ 2009-02-10  1:04 UTC (permalink / raw)
  To: J. Bruce Fields, Gary L. Grobe; +Cc: linux-kernel, linux-nfs, Andrew Morton

>Presumably that's the msleep(10) in nfsd_vfs_write().
>
>That wouldn't explain the same nfsd thread waiting for several seconds,
>though.  Or was it just that that several seconds during which different
>nfsd threads were stuck in D, not necessarily the same ones?
>
>What does your /etc/exports file look like on the server, and what are
>the mount options on the client?
>
>You could try turning off that msleep() with no_wdelay, but it may not
>help.
>
>The more likely explanation is that you just switched to a more recent
>distro where "sync" (as opposed to "async") is the option.  Depending on
>workload, "async" may improve performance a great deal, at the expense
>of possible data corruption on server reboot!
>
>If you're doing a lot of writing and using NFSv2, then switching to
>NFSv3 may give you performance close to the "async" performance without
>the corruption worries.

Apologies for the unintentional separate thread.

I really think I'm seeing the same nfsd threads going into D for a very short time. Here's what's in /etc/exports ...

/diskless/10.0.1.1 10.0.1.1(sync,rw,no_root_squash,no_all_squash,no_subtree_check)
/diskless/10.0.1.2 10.0.1.2(sync,rw,no_root_squash,no_all_squash,no_subtree_check)
/diskless/10.0.1.3 10.0.1.3(sync,rw,no_root_squash,no_all_squash,no_subtree_check)
# ... same lines above for another 80+ nodes

# Common to all slave nodes.
/usr    10.0.0.0/16(sync,ro,subtree_check,no_root_squash,no_all_squash)
/opt    10.0.0.0/16(sync,rw,no_subtree_check,no_root_squash,no_all_squash)
/home   10.0.0.0/16(sync,rw,no_subtree_check,no_root_squash,no_all_squash)
#/var/log 10.0.0.0/16(sync,rw,subtree_check,no_root_squash,no_all_squash)

Mount options on each client are as follows ...

10.0.0.10:/diskless/10.0.1.1    /   nfs     sync,hard,intr,rw,rsize=8192,wsize=8192 0   0
10.0.0.10:/opt          /opt    nfs sync,hard,intr,rw,rsize=8192,wsize=8192 0   0
10.0.0.10:/usr          /usr    nfs sync,hard,intr,ro,rsize=8192,wsize=8192 0   0
10.0.0.10:/home         /home   nfs sync,hard,intr,rw,rsize=8192,wsize=8192 0   0
none                    /proc   proc    defaults    0 0
#10.0.0.10:/var/log     /var/log nfs    sync,hard,intr,rw   0 0

I'm not following turning off the msleep() option. Where are you referring to this from?

I've got NFSv3 enabled and have used this in a previous installation (using the same distro, gentoo) on this same hardware with no issues, and 'sync', and the performance was much better. Something worth noting, I've rolled back my kernel several times now and each time I go back (w/ same vers on master and slave node), the D state time in simulation processes keeps getting better (cut down). I went from 2.6.27-r7 to 2.6.24-r8 and now I'm running 2.6.20-r10, and each one better than the previous (and later) kernel. I was running 2.6.18-r2 in the past, which I'm having difficulties getting at the moment.






^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re:   processes in D state too long too often
@ 2009-02-10  2:18 Gary L. Grobe
  2009-02-10 16:02 ` J. Bruce Fields
  0 siblings, 1 reply; 8+ messages in thread
From: Gary L. Grobe @ 2009-02-10  2:18 UTC (permalink / raw)
  To: Gary L. Grobe, J. Bruce Fields; +Cc: linux-kernel, linux-nfs, Andrew Morton


>>The more likely explanation is that you just switched to a more recent
>>distro where "sync" (as opposed to "async") is the option.  Depending on
>>workload, "async" may improve performance a great deal, at the expense
>>of possible data corruption on server reboot!
>>
>>If you're doing a lot of writing and using NFSv2, then switching to
>>NFSv3 may give you performance close to the "async" performance without
>>the corruption worries.

Just a small update about our rollback I need to correct. Turns out our problem has been solved by going with the 2.6.20-r10 of the gentoo-sources patched kernel. Although gentoo marks this as unstable for amd64, it's working fine. I've made no other changes than going back a few versions on the kernel and adjusting the .config w/ the same settings. 

Tomorrow I'll likely give the next marked stable patched gentoo-sources kernel another try which was 2.6.24-r10 and recheck my configs and try to gather anything else I can gather from it.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: processes in D state too long too often
  2009-02-10  1:04 Gary L. Grobe
@ 2009-02-10 16:02 ` J. Bruce Fields
  0 siblings, 0 replies; 8+ messages in thread
From: J. Bruce Fields @ 2009-02-10 16:02 UTC (permalink / raw)
  To: Gary L. Grobe; +Cc: linux-kernel, linux-nfs, Andrew Morton

On Tue, Feb 10, 2009 at 01:04:10AM +0000, Gary L. Grobe wrote:
> >Presumably that's the msleep(10) in nfsd_vfs_write().
> >
> >That wouldn't explain the same nfsd thread waiting for several seconds,
> >though.  Or was it just that that several seconds during which different
> >nfsd threads were stuck in D, not necessarily the same ones?
> >
> >What does your /etc/exports file look like on the server, and what are
> >the mount options on the client?
> >
> >You could try turning off that msleep() with no_wdelay, but it may not
> >help.
> >
> >The more likely explanation is that you just switched to a more recent
> >distro where "sync" (as opposed to "async") is the option.  Depending on
> >workload, "async" may improve performance a great deal, at the expense
> >of possible data corruption on server reboot!
> >
> >If you're doing a lot of writing and using NFSv2, then switching to
> >NFSv3 may give you performance close to the "async" performance without
> >the corruption worries.
> 
> Apologies for the unintentional separate thread.
> 
> I really think I'm seeing the same nfsd threads going into D for a very short time. Here's what's in /etc/exports ...
> 
> /diskless/10.0.1.1 10.0.1.1(sync,rw,no_root_squash,no_all_squash,no_subtree_check)
> /diskless/10.0.1.2 10.0.1.2(sync,rw,no_root_squash,no_all_squash,no_subtree_check)
> /diskless/10.0.1.3 10.0.1.3(sync,rw,no_root_squash,no_all_squash,no_subtree_check)
> # ... same lines above for another 80+ nodes
> 
> # Common to all slave nodes.
> /usr    10.0.0.0/16(sync,ro,subtree_check,no_root_squash,no_all_squash)
> /opt    10.0.0.0/16(sync,rw,no_subtree_check,no_root_squash,no_all_squash)
> /home   10.0.0.0/16(sync,rw,no_subtree_check,no_root_squash,no_all_squash)
> #/var/log 10.0.0.0/16(sync,rw,subtree_check,no_root_squash,no_all_squash)
> 
> Mount options on each client are as follows ...
> 
> 10.0.0.10:/diskless/10.0.1.1    /   nfs     sync,hard,intr,rw,rsize=8192,wsize=8192 0   0
> 10.0.0.10:/opt          /opt    nfs sync,hard,intr,rw,rsize=8192,wsize=8192 0   0
> 10.0.0.10:/usr          /usr    nfs sync,hard,intr,ro,rsize=8192,wsize=8192 0   0
> 10.0.0.10:/home         /home   nfs sync,hard,intr,rw,rsize=8192,wsize=8192 0   0
> none                    /proc   proc    defaults    0 0
> #10.0.0.10:/var/log     /var/log nfs    sync,hard,intr,rw   0 0
> 
> I'm not following turning off the msleep() option. Where are you referring to this from?

If you add no_wdelay to the export options, then we won't see the
mdelay() calls in the sysrq-w output anymore.  I doubt that'll solve the
problem, but it may be worth a try just to see what changes.

> I've got NFSv3 enabled and have used this in a previous installation
> (using the same distro, gentoo) on this same hardware with no issues,
> and 'sync', and the performance was much better.

OK, then I'm out of theories for now....

--b.

> Something worth
> noting, I've rolled back my kernel several times now and each time I
> go back (w/ same vers on master and slave node), the D state time in
> simulation processes keeps getting better (cut down). I went from
> 2.6.27-r7 to 2.6.24-r8 and now I'm running 2.6.20-r10, and each one
> better than the previous (and later) kernel. I was running 2.6.18-r2
> in the past, which I'm having difficulties getting at the moment.
> 
> 
> 
> 
> 
> -- To unsubscribe from this list: send the line "unsubscribe
> linux-nfs" in the body of a message to majordomo@vger.kernel.org More
> majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: processes in D state too long too often
  2009-02-10  2:18 Gary L. Grobe
@ 2009-02-10 16:02 ` J. Bruce Fields
  0 siblings, 0 replies; 8+ messages in thread
From: J. Bruce Fields @ 2009-02-10 16:02 UTC (permalink / raw)
  To: Gary L. Grobe; +Cc: linux-kernel, linux-nfs, Andrew Morton

On Tue, Feb 10, 2009 at 02:18:33AM +0000, Gary L. Grobe wrote:
> 
> >>The more likely explanation is that you just switched to a more recent
> >>distro where "sync" (as opposed to "async") is the option.  Depending on
> >>workload, "async" may improve performance a great deal, at the expense
> >>of possible data corruption on server reboot!
> >>
> >>If you're doing a lot of writing and using NFSv2, then switching to
> >>NFSv3 may give you performance close to the "async" performance without
> >>the corruption worries.
> 
> Just a small update about our rollback I need to correct. Turns out our problem has been solved by going with the 2.6.20-r10 of the gentoo-sources patched kernel. Although gentoo marks this as unstable for amd64, it's working fine. I've made no other changes than going back a few versions on the kernel and adjusting the .config w/ the same settings. 
> 
> Tomorrow I'll likely give the next marked stable patched gentoo-sources kernel another try which was 2.6.24-r10 and recheck my configs and try to gather anything else I can gather from it.

Sounds good, thanks.

--b.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-02-10 16:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-07  6:30 processes in D State too long too often Gary L. Grobe
2009-02-07  6:45 ` Andrew Morton
  -- strict thread matches above, loose matches on Subject: below --
2009-02-07 20:49 processes in D state " Gary L. Grobe
2009-02-09 21:27 ` J. Bruce Fields
2009-02-10  1:04 Gary L. Grobe
2009-02-10 16:02 ` J. Bruce Fields
2009-02-10  2:18 Gary L. Grobe
2009-02-10 16:02 ` J. Bruce Fields

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox