Tracing down 250ms open/chdir calls

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Tracing down 250ms open/chdir calls
@ 2009-02-16  7:55 Carsten Aulbert
  2009-02-16 12:56 ` Trond Myklebust
  2009-02-16 13:24 ` Joe Landman
  0 siblings, 2 replies; 5+ messages in thread
From: Carsten Aulbert @ 2009-02-16  7:55 UTC (permalink / raw)
  To: linux-nfs, linux-kernel, beowulf

Hi all,

sorry in advance for this vague subject and also the vague email, I'm
trying my best to summarize the problem:

On our large cluster we sometimes encounter the problem that our main
scheduling processes are often in state D and in the end not capable
anymore of pushing work to the cluster.

The head nodes are 8 core boxes with Xeon CPUs and equipped with 16 GB
of memory, when certain types of jobs are running we see system loads of
about 20-30 which might go up to 80-100 from time to time. Looking at
the individual cores they are mostly busy with system tasks (e.g. htop
shows 'red' bars).

stat -tt -c showed that several system calls of the scheduler take a
long time to complete, most notably open and chdir which took between
180 and 230ms to complete (during our testing). Since most of these open
and chdir are via NFSv3 I'm including that list as well. The NFS servers
are Sun Fire X4500 boxes running Solaris 10u5 right now.

A standard output line looks like:
93.37   38.997264     230753      169    78      open

i.e. 93.37% of the system-related time was spend in 169 successful open
calls which took 230753us/call, thus 39 wall clock seconds were spend in
a minute just doing open.

We tried several things to understand the problem, but apart from moving
more files (mostly log files of currently running jobs) off NFS we did
not move far ahead so far. On
https://n0.aei.uni-hannover.de/twiki/bin/view/ATLAS/H2Problems
we have summarized some things.

With the help of 'stress' and a tiny program just doing open/putc/close
into a single file, I've tried to get a feeling how good or bad things
are when compared to other head nodes with different tasks/loads:

https://n0.aei.uni-hannover.de/twiki/bin/view/ATLAS/OpenCloseIotest

(this test may or may not help in the long run, I'm just poking into the
dark).

Now my questions:

* Do you have any suggestions how to continue debugging this problem?
* Does anyone know how to improve the situation? Next on my agenda would
be to try different IO algorithms, any hints which ones should be good
for such boxes?
* I guess I missed vital information. please let me know if you need
more information of the system

Please Cc me from linux-kernel, I'm only on the other two addressed lists.

Cheers and a lot of TIA

Carsten

-- 
Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
Callinstrasse 38, 30167 Hannover, Germany
Phone/Fax: +49 511 762-17185 / -17193
http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/31

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Tracing down 250ms open/chdir calls
  2009-02-16  7:55 Tracing down 250ms open/chdir calls Carsten Aulbert
@ 2009-02-16 12:56 ` Trond Myklebust
  2009-02-16 13:00   ` Carsten Aulbert
  2009-02-16 13:24 ` Joe Landman
  1 sibling, 1 reply; 5+ messages in thread
From: Trond Myklebust @ 2009-02-16 12:56 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: linux-nfs, linux-kernel, beowulf

On Mon, 2009-02-16 at 08:55 +0100, Carsten Aulbert wrote:
> Hi all,
> 
> sorry in advance for this vague subject and also the vague email, I'm
> trying my best to summarize the problem:
> 
> On our large cluster we sometimes encounter the problem that our main
> scheduling processes are often in state D and in the end not capable
> anymore of pushing work to the cluster.
> 
> The head nodes are 8 core boxes with Xeon CPUs and equipped with 16 GB
> of memory, when certain types of jobs are running we see system loads of
> about 20-30 which might go up to 80-100 from time to time. Looking at
> the individual cores they are mostly busy with system tasks (e.g. htop
> shows 'red' bars).
> 
> stat -tt -c showed that several system calls of the scheduler take a
> long time to complete, most notably open and chdir which took between
> 180 and 230ms to complete (during our testing). Since most of these open
> and chdir are via NFSv3 I'm including that list as well. The NFS servers
> are Sun Fire X4500 boxes running Solaris 10u5 right now.
> 
> A standard output line looks like:
> 93.37   38.997264     230753      169    78      open
> 
> i.e. 93.37% of the system-related time was spend in 169 successful open
> calls which took 230753us/call, thus 39 wall clock seconds were spend in
> a minute just doing open.
> 
> We tried several things to understand the problem, but apart from moving
> more files (mostly log files of currently running jobs) off NFS we did
> not move far ahead so far. On
> https://n0.aei.uni-hannover.de/twiki/bin/view/ATLAS/H2Problems
> we have summarized some things.
> 
> With the help of 'stress' and a tiny program just doing open/putc/close
> into a single file, I've tried to get a feeling how good or bad things
> are when compared to other head nodes with different tasks/loads:
> 
> https://n0.aei.uni-hannover.de/twiki/bin/view/ATLAS/OpenCloseIotest
> 
> (this test may or may not help in the long run, I'm just poking into the
> dark).
> 
> Now my questions:
> 
> * Do you have any suggestions how to continue debugging this problem?
> * Does anyone know how to improve the situation? Next on my agenda would
> be to try different IO algorithms, any hints which ones should be good
> for such boxes?
> * I guess I missed vital information. please let me know if you need
> more information of the system
> 
> Please Cc me from linux-kernel, I'm only on the other two addressed lists.
> 
> Cheers and a lot of TIA
> 
> Carsten

2.6.27.7 has a known NFS client performance bug due to a change in the
authentication code. The fix was merged in 2.6.27.9: see the commit
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git&a=commitdiff&h=a0f04d0096bd7edb543576c55f7a0993628f924a

Cheers
  Trond


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Tracing down 250ms open/chdir calls
  2009-02-16 12:56 ` Trond Myklebust
@ 2009-02-16 13:00   ` Carsten Aulbert
  0 siblings, 0 replies; 5+ messages in thread
From: Carsten Aulbert @ 2009-02-16 13:00 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, linux-kernel, beowulf

Hi Trond,

Trond Myklebust schrieb:
> 2.6.27.7 has a known NFS client performance bug due to a change in the
> authentication code. The fix was merged in 2.6.27.9: see the commit
> http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git&a=commitdiff&h=a0f04d0096bd7edb543576c55f7a0993628f924a

Maybe that's also a reason why we don't see it that often anymore -
although the user load has shifted a bit also. We are running 2.7.27.14
(and .15 since Friday).

Thanks for letting me know, I'll probably re-run the tests again with a
newer kernel.

Cheers

Carsten

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Tracing down 250ms open/chdir calls
  2009-02-16  7:55 Tracing down 250ms open/chdir calls Carsten Aulbert
  2009-02-16 12:56 ` Trond Myklebust
@ 2009-02-16 13:24 ` Joe Landman
  2009-02-16 13:44   ` Carsten Aulbert
  1 sibling, 1 reply; 5+ messages in thread
From: Joe Landman @ 2009-02-16 13:24 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: beowulf, linux-nfs, linux-kernel

Carsten Aulbert wrote:
> Hi all,
> 
> sorry in advance for this vague subject and also the vague email, I'm
> trying my best to summarize the problem:
> 
> On our large cluster we sometimes encounter the problem that our main
> scheduling processes are often in state D and in the end not capable
> anymore of pushing work to the cluster.

Hi Carsten

   Are you using a "standard" cluster scheduler (SGE, PBS, ...) or a 
locally written one?

> The head nodes are 8 core boxes with Xeon CPUs and equipped with 16 GB
> of memory, when certain types of jobs are running we see system loads of
> about 20-30 which might go up to 80-100 from time to time. Looking at
> the individual cores they are mostly busy with system tasks (e.g. htop
> shows 'red' bars).

Hmmm...  These are your head nodes?  Not your NFS server nodes?  Sounds 
like there are a large number of blocked IO processes ... try a

	vmstat 1

and look at the "b" column (usually second from left ... or nearly 
there).  Lots of blocked IO processes can have the affect of introducing 
significant latency into all system calls.

> stat -tt -c showed that several system calls of the scheduler take a
> long time to complete, most notably open and chdir which took between
> 180 and 230ms to complete (during our testing). Since most of these open
> and chdir are via NFSv3 I'm including that list as well. The NFS servers
> are Sun Fire X4500 boxes running Solaris 10u5 right now.

Hmmm.... What happens if you make these local to each box?  What are the 
mount options for the mount points?  We have spoken to other users with 
performance problems on such servers.  The NFS server/OS combination you 
indicate above isn't known to be very fast.  This isn't your problem 
(see later), but it looks like your own data suggests you are giving up 
nearly an order of magnitude performance using this NFS server/OS 
combination, likely at a premium price as well.

> A standard output line looks like:
> 93.37   38.997264     230753      169    78      open
> 
> i.e. 93.37% of the system-related time was spend in 169 successful open
> calls which took 230753us/call, thus 39 wall clock seconds were spend in
> a minute just doing open.
> 
> We tried several things to understand the problem, but apart from moving
> more files (mostly log files of currently running jobs) off NFS we did
> not move far ahead so far. On
> https://n0.aei.uni-hannover.de/twiki/bin/view/ATLAS/H2Problems
> we have summarized some things.

I didn't catch if these are tcp/udp mounts.  What are the mount options?

> With the help of 'stress' and a tiny program just doing open/putc/close
> into a single file, I've tried to get a feeling how good or bad things
> are when compared to other head nodes with different tasks/loads:
> 
> https://n0.aei.uni-hannover.de/twiki/bin/view/ATLAS/OpenCloseIotest

... and these results do in fact confirm the comment I made about the 
NFS server/OS combo not known to be fast.

 From my laptop running a Ubuntu 8.10 variant over gigabit to a known 
very slow NFS server ...

./measure.exe /misc/crunch/home/landman/test2
Wrote 24905 times into /misc/crunch/home/landman/test2, speed is 
421.14/s, time elapsed: 59.14s

from a different server in a different location to a JackRabbit-S server 
over NFS on a single gigabit with default mounts (this server running a 
long IOzone)

./measure.exe /mnt/d/1
Wrote 124312 times into /mnt/d/1, speed is 2072.99/s, time elapsed: 59.97s

and from the same starting point but to a JackRabbit-S server over NFS 
on a single gigabit with default mounts (this server idle)

./measure.exe /mnt/d/x
Wrote 134426 times into /mnt/d/x, speed is 2270.79/s, time elapsed: 59.20s

This system would be roughly similar to your d02 machine.

ly
Assuming you aren't using mount options of noac,sync,...  Could you 
enlighten us as to what mount options are for the head nodes?

Also, the way the code is written, you are doing quite a few calls to 
gettimeofday ... you could probably avoid this with a little re-writing 
of the main loop.

If you are using noac or sync on your NFS mounts, then this could 
explain some of the differences you are seeing (certainly the 100/s vs 
800/s ... but not likely the 4/s)

However, if you notice that h2 in your table is an apparent outlier, 
there may be something more systematic going on there.  Since you 
indicate there is a high load going on while you are testing, this is 
likely what you need to explore.

Grab the atop program.  It is pretty good about letting you explore what 
is causing load.  That is, despite your x4500's/Solaris combo showing 
itself not to be a fast NFS server the problem appears to be more likely 
more localized on the h2 machine than on the NFS machines.

http://freshmeat.net/projects/atop/

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman@scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Tracing down 250ms open/chdir calls
  2009-02-16 13:24 ` Joe Landman
@ 2009-02-16 13:44   ` Carsten Aulbert
  0 siblings, 0 replies; 5+ messages in thread
From: Carsten Aulbert @ 2009-02-16 13:44 UTC (permalink / raw)
  To: landman; +Cc: beowulf, linux-nfs, linux-kernel

Hi Joe, 

(keeping all lists cross-posted, please shout briefly at me if I fall
off the line of not being rude):

Joe Landman schrieb:
> 
>   Are you using a "standard" cluster scheduler (SGE, PBS, ...) or a
> locally written one?
> 

We use Condor (http://www.cs.wisc.edu/condor/).
> 
> Hmmm...  These are your head nodes?  Not your NFS server nodes?  Sounds
> like there are a large number of blocked IO processes ... try a
> 

Yes, these are the head nodes and not the NFS servers.

>     vmstat 1
> 
> and look at the "b" column (usually second from left ... or nearly
> there).  Lots of blocked IO processes can have the affect of introducing
> significant latency into all system calls.
> 

Right now, (system load 40, but still quite responsive box):

vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  0     60 1748932  49204 7181132    0    0   393   730   15   12  8  4 76 12
 1  0     60 1738804  49204 7187768    0    0     0   254 3340 11635 15  8 77  0
 1  0     60 1693220  49216 7193328    0    0     0   306 3118 9296 16  7 77  0
 1  0     60 1751184  49224 7192924    0    0     0   343 3450 10866 12  9 79  0
 0  0     60 1754924  49224 7192936    0    0     0   127 2744 6750  5  7 87  2
 0  0     60 1756932  49240 7187552    0    0     0   532 3289 9673  3  6 91  0
 0  0     60 1752664  49240 7193776    0    0     0    77 2835 10075  2  7 92  0
 2  0     60 1754956  49244 7193820    0    0     0   553 3976 14870  4 12 84  0
 1  0     60 1742032  49252 7206288    0    0     0   193 3588 9133  5  6 89  0
 1  0     60 1736920  49252 7206316    0    0     0   284 3821 9402  7  7 86  0
 4  0     60 1742292  49260 7193964    0    0     0   514 4545 12428 17 10 74  0

> Hmmm.... What happens if you make these local to each box?  What are the
> mount options for the mount points?  We have spoken to other users with
> performance problems on such servers.  The NFS server/OS combination you
> indicate above isn't known to be very fast.  This isn't your problem
> (see later), but it looks like your own data suggests you are giving up
> nearly an order of magnitude performance using this NFS server/OS
> combination, likely at a premium price as well.
> 

The mount options are pretty standard NFSv3 via tcp, ...:

s02:/atlashome/USER on /home/USER type nfs (rw,vers=3,rsize=32768,wsize=32768,namlen=255,soft,nointr,nolock,noacl,proto=tcp,
timeo=600,retrans=2,sec=sys,mountaddr=10.20.20.2,mountvers=3,mountproto=tcp,addr=10.20.20.2)

> Assuming you aren't using mount options of noac,sync,...  Could you
> enlighten us as to what mount options are for the head nodes?
The linux data server are exporting NFS with async, the X4500 should be about the same,
i.e. we set nocacheflushing for the zpool

> 
> Also, the way the code is written, you are doing quite a few calls to
> gettimeofday ... you could probably avoid this with a little re-writing
> of the main loop.
> 

Well, that proofs that I should never be let near any serious programming - 
at least not in time critical parts of the codes ;)

> If you are using noac or sync on your NFS mounts, then this could
> explain some of the differences you are seeing (certainly the 100/s vs
> 800/s ... but not likely the 4/s)
> 
> However, if you notice that h2 in your table is an apparent outlier,
> there may be something more systematic going on there.  Since you
> indicate there is a high load going on while you are testing, this is
> likely what you need to explore.
> 

That was my idea as well.

> Grab the atop program.  It is pretty good about letting you explore what
> is causing load.  That is, despite your x4500's/Solaris combo showing
> itself not to be a fast NFS server the problem appears to be more likely
> more localized on the h2 machine than on the NFS machines.
> 
> http://freshmeat.net/projects/atop/

I'll try that and will dive again into iotop as well.

Cheers

Carsten

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-02-16 13:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-16  7:55 Tracing down 250ms open/chdir calls Carsten Aulbert
2009-02-16 12:56 ` Trond Myklebust
2009-02-16 13:00   ` Carsten Aulbert
2009-02-16 13:24 ` Joe Landman
2009-02-16 13:44   ` Carsten Aulbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox