linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* processes hanging in state D when reading from nfs
@ 2011-08-27 19:22 Rüdiger Meier
  2011-09-20 13:52 ` J. Bruce Fields
  0 siblings, 1 reply; 4+ messages in thread
From: Rüdiger Meier @ 2011-08-27 19:22 UTC (permalink / raw)
  To: linux-nfs

Hi,


I've got an annoying problem with my nfs4 clients.
Lately I see many processes hanging in state D when reading from nfs
mount. Sometimes they can be killed sometimes not.

This occurs mostly whith shell scripts started by cron.

For example on one machine there is a file where suddenly all reads on
it are hanging, ls -ls still works:

rwxr-xr-x 1 tk users 128 2010-09-08 15:54 /home/tk/usr/local/scripts/plain_ALLMAJOR.sh

As you see it's an old script, not modified since long time. It was
running a few times per day since months.

Now this is the processlist:

tk        8829  0.0  0.0  11372   800 ?        Ds   Aug25   0:00 /bin/sh -c ~/usr/local/scripts/plain_ALLMAJOR.sh
tk        8830  0.0  0.0  11372   824 ?        Ds   Aug25   0:00 /bin/sh -c ~/usr/local/scripts/plain_ALLMAJOR.sh
tk       18864  0.0  0.0  11372   844 ?        Ds   Aug26   0:00 /bin/sh -c ~/usr/local/scripts/plain_ALLMAJOR.sh
tk       18865  0.0  0.0  11372   860 ?        Ds   Aug26   0:00 /bin/sh -c ~/usr/local/scripts/plain_ALLMAJOR.sh
rudi     23745  0.0  0.0  10300   748 pts/20   D    20:39   0:00 file /home/tk/usr/local/scripts/plain_ALLMAJOR.sh
rudi     24361  0.0  0.0  10300   748 pts/20   D    20:40   0:00 file /home/tk/usr/local/scripts/plain_ALLMAJOR.sh
root     30417  0.0  0.0  10056   472 ?        D    Aug24   0:00 less /home/tk/usr/local/scripts/plain_ALLMAJOR.sh
rudi     30569  0.0  0.0  10064  1128 pts/1    D+   20:41   0:00 less /home/tk/usr/local/scripts/plain_ALLMAJOR.sh

The /bin/sh processes are hanging forever in state "Ds" but can be killed.
The less and file commands can't be killed.
On other clients I can read that file without probs.

The logs on server and clients don't tell me anything.
What can I do to find out what's the problem?


BTW each hanging process increases the load by 1 but the affected machines
are still quite usable even with a load of 800 on a single core CPU!


here my specs:
2.6.37.6-0.7-desktop
openSUSE 11.4 (x86_64)


cu,
Rudi

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: processes hanging in state D when reading from nfs
  2011-08-27 19:22 processes hanging in state D when reading from nfs Rüdiger Meier
@ 2011-09-20 13:52 ` J. Bruce Fields
  2011-09-21 23:40   ` Rüdiger Meier
  0 siblings, 1 reply; 4+ messages in thread
From: J. Bruce Fields @ 2011-09-20 13:52 UTC (permalink / raw)
  To: Rüdiger Meier; +Cc: linux-nfs

On Sat, Aug 27, 2011 at 09:22:53PM +0200, Rüdiger Meier wrote:
> I've got an annoying problem with my nfs4 clients.
> Lately I see many processes hanging in state D when reading from nfs
> mount. Sometimes they can be killed sometimes not.

Is this still happening?

> This occurs mostly whith shell scripts started by cron.
> 
> For example on one machine there is a file where suddenly all reads on
> it are hanging, ls -ls still works:
> 
> rwxr-xr-x 1 tk users 128 2010-09-08 15:54 /home/tk/usr/local/scripts/plain_ALLMAJOR.sh
> 
> As you see it's an old script, not modified since long time. It was
> running a few times per day since months.
> 
> Now this is the processlist:
> 
> tk        8829  0.0  0.0  11372   800 ?        Ds   Aug25   0:00 /bin/sh -c ~/usr/local/scripts/plain_ALLMAJOR.sh
> tk        8830  0.0  0.0  11372   824 ?        Ds   Aug25   0:00 /bin/sh -c ~/usr/local/scripts/plain_ALLMAJOR.sh
> tk       18864  0.0  0.0  11372   844 ?        Ds   Aug26   0:00 /bin/sh -c ~/usr/local/scripts/plain_ALLMAJOR.sh
> tk       18865  0.0  0.0  11372   860 ?        Ds   Aug26   0:00 /bin/sh -c ~/usr/local/scripts/plain_ALLMAJOR.sh
> rudi     23745  0.0  0.0  10300   748 pts/20   D    20:39   0:00 file /home/tk/usr/local/scripts/plain_ALLMAJOR.sh
> rudi     24361  0.0  0.0  10300   748 pts/20   D    20:40   0:00 file /home/tk/usr/local/scripts/plain_ALLMAJOR.sh
> root     30417  0.0  0.0  10056   472 ?        D    Aug24   0:00 less /home/tk/usr/local/scripts/plain_ALLMAJOR.sh
> rudi     30569  0.0  0.0  10064  1128 pts/1    D+   20:41   0:00 less /home/tk/usr/local/scripts/plain_ALLMAJOR.sh
> 
> The /bin/sh processes are hanging forever in state "Ds" but can be killed.
> The less and file commands can't be killed.
> On other clients I can read that file without probs.
> 
> The logs on server and clients don't tell me anything.
> What can I do to find out what's the problem?

Running wireshark and watching the network traffic may sometimes give an
idea whether the client or server is to blame.

> BTW each hanging process increases the load by 1 but the affected machines
> are still quite usable even with a load of 800 on a single core CPU!
> 
> 
> here my specs:
> 2.6.37.6-0.7-desktop
> openSUSE 11.4 (x86_64)

On both client and server?

--b.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: processes hanging in state D when reading from nfs
  2011-09-20 13:52 ` J. Bruce Fields
@ 2011-09-21 23:40   ` Rüdiger Meier
  2011-09-22 15:39     ` Michael Gutteridge
  0 siblings, 1 reply; 4+ messages in thread
From: Rüdiger Meier @ 2011-09-21 23:40 UTC (permalink / raw)
  To: linux-nfs

On Tuesday 20 September 2011, J. Bruce Fields wrote:
> On Sat, Aug 27, 2011 at 09:22:53PM +0200, Rüdiger Meier wrote:
> > I've got an annoying problem with my nfs4 clients.
> > Lately I see many processes hanging in state D when reading from
> > nfs mount. Sometimes they can be killed sometimes not.
>
> Is this still happening?

Yes, allthough we've managed to avoid the "dangerous" things.
Sometimes we have also probs like the other current thread
"Writing / Locking problem with NFSv4".

Moreover I've got issues with wrong file permissions of newly created 
files. Mostly when using make: The first make failes and you see some 
objects files with strange permissions. If you watch these files on 
other clients then they are ok. Then the second make finishes 
successful.
I'm still suspecting the damn readdir cache changes in 2.6.37.

> Running wireshark and watching the network traffic may sometimes give
> an idea whether the client or server is to blame.

I should do that but somehow I'm a bit tyred of debugging my NFS issues 
after doing it the whole last year just to get one thing fixed and run 
into another issue. Maybe I first try a current kernel, hoping that 
things are getting better without doing anything.

cu,
Rudi

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: processes hanging in state D when reading from nfs
  2011-09-21 23:40   ` Rüdiger Meier
@ 2011-09-22 15:39     ` Michael Gutteridge
  0 siblings, 0 replies; 4+ messages in thread
From: Michael Gutteridge @ 2011-09-22 15:39 UTC (permalink / raw)
  To: linux-nfs

Rüdiger Meier <sweet_f_a@...> writes:

> 
> On Tuesday 20 September 2011, J. Bruce Fields wrote:
> > On Sat, Aug 27, 2011 at 09:22:53PM +0200, Rüdiger Meier wrote:
> > > I've got an annoying problem with my nfs4 clients.
> > > Lately I see many processes hanging in state D when reading from
> > > nfs mount. Sometimes they can be killed sometimes not.
> >
> > Is this still happening?
> 
> Yes, allthough we've managed to avoid the "dangerous" things.
> Sometimes we have also probs like the other current thread
> "Writing / Locking problem with NFSv4".
> 

For what it's worth:  we have been seeing very similar behavior on our OpenSuSE
11.3 (x86_64, 2.6.34.10-0.2) systems, though one other difference is that we are
using NFSv3 for these mounts.

I was able to get some traces via sysrq, though no ethernet dumps (these
problems would happen occasionally, impossible to determine when/where).  These
are heavily loaded systems, doing lots of compute and IO.

  1  [3754730.533669] R             D ffffffff810dc3e0     0 22621      1
0x00000004     
  2  [3754730.533671]  ffff88165f993cb8 0000000000000086 ffff881037174600
ffffffffa0332bbd 
  3  [3754730.533673]  0000000000013e80 0000000000013e80 ffff88165f993fd8
0000000000013e80 
  4  [3754730.533675]  ffff88165f993fd8 ffff881e5cd521c0 0000000000013e80
0000000000013e80 
  5  [3754730.533676] Call Trace:
  6  [3754730.533678]  [<ffffffff8145004e>] io_schedule+0x6e/0xb0
  7  [3754730.533681]  [<ffffffff810dc418>] sync_page+0x38/0x50
  8  [3754730.533683]  [<ffffffff814505da>] __wait_on_bit_lock+0x4a/0xb0
  9  [3754730.533685]  [<ffffffff810dc3be>] __lock_page+0x5e/0x70
 10  [3754730.533687]  [<ffffffff810dd2f8>] filemap_fault+0x2f8/0x410
 11  [3754730.533690]  [<ffffffff810f7c12>] __do_fault+0x52/0x4f0
 12  [3754730.533692]  [<ffffffff810fbf82>] handle_mm_fault+0x1b2/0xbd0
 13  [3754730.533694]  [<ffffffff81455799>] do_page_fault+0x169/0x3a0
 14  [3754730.533697]  [<ffffffff8145271f>] page_fault+0x1f/0x30
 15  [3754730.533699]  [<00007f79e2486ce0>] 0x7f79e2486ce0

This is pretty representative of the processses in D.  Does this help, or are
there too many differences from the original?

Thanks

Michael




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-09-22 15:45 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-27 19:22 processes hanging in state D when reading from nfs Rüdiger Meier
2011-09-20 13:52 ` J. Bruce Fields
2011-09-21 23:40   ` Rüdiger Meier
2011-09-22 15:39     ` Michael Gutteridge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).