linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* ls stalls
       [not found] <4FB43655.5040907@fhcrc.org>
@ 2012-05-16 23:34 ` Stuart Kendrick
  2012-05-16 23:41   ` Chuck Lever
  0 siblings, 1 reply; 7+ messages in thread
From: Stuart Kendrick @ 2012-05-16 23:34 UTC (permalink / raw)
  To: linux-nfs

Hi folks,

On our large memory (64GB) HPC nodes, we intermittently see what we call
'interactive stalls':  pauses in receiving 'ls' output.  Also, bash
shell completion stalls, emacs stalls.  We've hacked /bin/ls to time how
long it takes to complete and then to log diagnostic information when
that time exceeds 3 seconds.  In some cases, the result isn't surprising
-- a directory containing thousands or tens of thousands of files,
hosted on slow storage, might well take seconds to display.  But most of
the time, these stalls occur on directories containing tens or
occasionally hundreds of files; 'ls' on such a directory normally takes
a millisecond or less to complete.  Stalls vary in length:  most of them
under 10s, with a significant portion under 100s, and the occasional
stall in the 100-300s range.

I've been correlating strace output ('strace -f -tt ls {directory}')
with packet traces.  And I see the following pattern:

(A) The stall occurs between a 'stat' on the directory and the 'open' on
the directory ... and sometimes, though not always, between the 'open'
and the following 'fcntl'.  Here's an example of a 10s stall:

17:20:01.365375
stat("/shared/silo_r/xxx/colongwas_archive/plco-sshfs/pancreatic-panscan-dbgap/panscan-work/610-gtc",
{st_mode=S_IFDIR|0770, st_size=327680, ...}) = 0

17:20:11.774368
open("/shared/silo_r/xxx/colongwas_archive/plco-sshfs/pancreatic-panscan-dbgap/panscan-work/610-gtc",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3


And of a ~200s stall:


10:30:01.768459
stat("/shared/silo_r/xxx/colongwas_archive/plco-sshfs/pancreatic-panscan-dbgap/panscan-work/610-gtc",
{st_mode=S_IFDIR|0770, st_size=327680, ...}) = 0

10:33:06.072659
open("/shared/silo_r/xxx/colongwas_archive/plco-sshfs/pancreatic-panscan-dbgap/panscan-work/610-gtc",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3

10:33:28.884426 fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)

10:33:28.884600 getdents64(3, /* 683 entries */, 32768) = 32736



(B) On the wire, during that stall, the HPC node says nothing to the NFS
server (sometimes literally, sometimes it is reading or writing in
support of some other task/user, but not emitting GETATTR or READDIR or
READDIRPLUS calls).  [No dropped frames, no TCP pathology.]


(C) Network IO is noticeable:  the node is reading and/or writing,
rapidly, with at least one of the handful of NFS servers which provide
storage to the HPC environment.

The clients are all running OpenSuse 11.3 Teal (kernel
2.6.34.10-0.2-default).  The NFS servers are a mix -- Solaris 10,
several NetApps, Windows 2008 -- backed by several different storage
systems.

Diagrams and related information visible at
https://vishnu.fhcrc.org/Rhino-RCA/

Insights?  Suggestions?

--sk

Stuart Kendrick
Fred Hutchinson Cancer Research Center
Seattle, WA USA



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ls stalls
  2012-05-16 23:34 ` ls stalls Stuart Kendrick
@ 2012-05-16 23:41   ` Chuck Lever
  2012-05-17 13:12     ` Stuart Kendrick
  0 siblings, 1 reply; 7+ messages in thread
From: Chuck Lever @ 2012-05-16 23:41 UTC (permalink / raw)
  To: Stuart Kendrick; +Cc: linux-nfs


On May 16, 2012, at 7:34 PM, Stuart Kendrick wrote:

> Hi folks,
> 
> On our large memory (64GB) HPC nodes, we intermittently see what we call
> 'interactive stalls':  pauses in receiving 'ls' output.  Also, bash
> shell completion stalls, emacs stalls.  We've hacked /bin/ls to time how
> long it takes to complete and then to log diagnostic information when
> that time exceeds 3 seconds.  In some cases, the result isn't surprising
> -- a directory containing thousands or tens of thousands of files,
> hosted on slow storage, might well take seconds to display.  But most of
> the time, these stalls occur on directories containing tens or
> occasionally hundreds of files; 'ls' on such a directory normally takes
> a millisecond or less to complete.  Stalls vary in length:  most of them
> under 10s, with a significant portion under 100s, and the occasional
> stall in the 100-300s range.
> 
> I've been correlating strace output ('strace -f -tt ls {directory}')
> with packet traces.  And I see the following pattern:
> 
> (A) The stall occurs between a 'stat' on the directory and the 'open' on
> the directory ... and sometimes, though not always, between the 'open'
> and the following 'fcntl'.  Here's an example of a 10s stall:
> 
> 17:20:01.365375
> stat("/shared/silo_r/xxx/colongwas_archive/plco-sshfs/pancreatic-panscan-dbgap/panscan-work/610-gtc",
> {st_mode=S_IFDIR|0770, st_size=327680, ...}) = 0
> 
> 17:20:11.774368
> open("/shared/silo_r/xxx/colongwas_archive/plco-sshfs/pancreatic-panscan-dbgap/panscan-work/610-gtc",
> O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
> 
> 
> And of a ~200s stall:
> 
> 
> 10:30:01.768459
> stat("/shared/silo_r/xxx/colongwas_archive/plco-sshfs/pancreatic-panscan-dbgap/panscan-work/610-gtc",
> {st_mode=S_IFDIR|0770, st_size=327680, ...}) = 0
> 
> 10:33:06.072659
> open("/shared/silo_r/xxx/colongwas_archive/plco-sshfs/pancreatic-panscan-dbgap/panscan-work/610-gtc",
> O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
> 
> 10:33:28.884426 fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)
> 
> 10:33:28.884600 getdents64(3, /* 683 entries */, 32768) = 32736
> 
> 
> 
> (B) On the wire, during that stall, the HPC node says nothing to the NFS
> server (sometimes literally, sometimes it is reading or writing in
> support of some other task/user, but not emitting GETATTR or READDIR or
> READDIRPLUS calls).  [No dropped frames, no TCP pathology.]
> 
> 
> (C) Network IO is noticeable:  the node is reading and/or writing,
> rapidly, with at least one of the handful of NFS servers which provide
> storage to the HPC environment.
> 
> The clients are all running OpenSuse 11.3 Teal (kernel
> 2.6.34.10-0.2-default).  The NFS servers are a mix -- Solaris 10,
> several NetApps, Windows 2008 -- backed by several different storage
> systems.
> 
> Diagrams and related information visible at
> https://vishnu.fhcrc.org/Rhino-RCA/
> 
> Insights?  Suggestions?

My starting guess is that there is some task on that client that has dirtied the pages of one of the files in the directory you are trying to list.  A GETATTR is required to flush outstanding writes to a file so the server can provide size and mtime attributes that reflect the last most recent write to that file.

Possible work-arounds: You can reduce the dirty_ratio setting on the client to force it to starting flushing outstanding writes sooner, you can change the writing applications to use synchronous writes (or flush manually), or you can alter your "ls" command so that it doesn't require a GETATTR for each file.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ls stalls
  2012-05-16 23:41   ` Chuck Lever
@ 2012-05-17 13:12     ` Stuart Kendrick
  2012-05-17 13:46       ` Chuck Lever
  2012-05-29 13:07       ` Stuart Kendrick
  0 siblings, 2 replies; 7+ messages in thread
From: Stuart Kendrick @ 2012-05-17 13:12 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs

> My starting guess is that there is some task on that client that has dirtied the pages of one of the files in the directory you are trying to list.  A GETATTR is required to flush outstanding writes to a file so the server can provide size and mtime attributes that reflect the last most recent write to that file.
> 
> Possible work-arounds: You can reduce the dirty_ratio setting on the client to force it to starting flushing outstanding writes sooner, you can change the writing applications to use synchronous writes (or flush manually), or you can alter your "ls" command so that it doesn't require a GETATTR for each file.
> 
How might I replicate this?

I'm trying a continuous write to a file ... but 'ls -l' remains snappy.

#!/bin/bash
while [ true ]
	cp foo.1 foo.2
	echo Do it again
done

--sk

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ls stalls
  2012-05-17 13:12     ` Stuart Kendrick
@ 2012-05-17 13:46       ` Chuck Lever
  2012-05-17 23:51         ` Stuart Kendrick
  2012-05-29 13:07       ` Stuart Kendrick
  1 sibling, 1 reply; 7+ messages in thread
From: Chuck Lever @ 2012-05-17 13:46 UTC (permalink / raw)
  To: Stuart Kendrick; +Cc: linux-nfs


On May 17, 2012, at 9:12 AM, Stuart Kendrick wrote:

>> My starting guess is that there is some task on that client that has dirtied the pages of one of the files in the directory you are trying to list.  A GETATTR is required to flush outstanding writes to a file so the server can provide size and mtime attributes that reflect the last most recent write to that file.
>> 
>> Possible work-arounds: You can reduce the dirty_ratio setting on the client to force it to starting flushing outstanding writes sooner, you can change the writing applications to use synchronous writes (or flush manually), or you can alter your "ls" command so that it doesn't require a GETATTR for each file.
>> 
> How might I replicate this?
> 
> I'm trying a continuous write to a file ... but 'ls -l' remains snappy.

The file must reside in the directory you are listing, and it should be very large.  Typically dd'ing zeroes into a large file is a good way to starve non-writers.

> #!/bin/bash
> while [ true ]
> 	cp foo.1 foo.2
> 	echo Do it again
> done

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ls stalls
  2012-05-17 13:46       ` Chuck Lever
@ 2012-05-17 23:51         ` Stuart Kendrick
  0 siblings, 0 replies; 7+ messages in thread
From: Stuart Kendrick @ 2012-05-17 23:51 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs


> The file must reside in the directory you are listing, and it should be very large.  Typically dd'ing zeroes into a large file is a good way to starve non-writers.

Thanx Chuck.  I can replicate the problem, I've learned a lot about
dirty_bytes/dirty_ratio and I'm on my way to resolving this.

--sk

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ls stalls
  2012-05-17 13:12     ` Stuart Kendrick
  2012-05-17 13:46       ` Chuck Lever
@ 2012-05-29 13:07       ` Stuart Kendrick
  2012-05-29 16:20         ` Chuck Lever
  1 sibling, 1 reply; 7+ messages in thread
From: Stuart Kendrick @ 2012-05-29 13:07 UTC (permalink / raw)
  To: linux-nfs

>> My starting guess is that there is some task on that client that has dirtied the pages of one of the files in the directory you are trying to list.
A GETATTR is required to flush outstanding writes to a file so the
server can provide size and mtime attributes that reflect the last most
recent write to that file.
>>
>> Possible work-arounds: You can reduce the dirty_ratio setting on the client to force it to starting flushing outstanding writes sooner, you can change the writing applications to use synchronous writes (or flush manually), or you can alter your "ls" command so that it doesn't require a GETATTR for each file.
>>

OK, I think I understand that if I've dirtied the pages on a file and
then issue a stat against it, that the NFS Client will stall my stat
until it flushes cache.

And I'm focused on the work-around of:  "You can reduce the dirty_ratio
setting ..."

(a) Does the stat /trigger/ a cache flush?  Or does the stat have to
wait for the usual mechanisms to initiate a flush?

(b) How granular is this process?  Will the NFS Client issue my stat
(GETATTR) once all the dirty pages relevant to my particular file are
flushed (but other pages are still being written)?  Or does the NFS
Client wait until pdflush has entirely emptied cache before proceeding
with the stat?

(c) When cache hits dirty_ratio (or dirty_bytes), I believe the kernel
blocks /all/ writers until it has /emptied/ cache (and flushing a big
cache to slow storage can take a while) ... am I correct?

(d) If my load is dominated by large writes over NFS, does cache buy me
anything?  Seems to me that the dominant benefit of cache is elevatoring
... the block structure of storage accessible via NFS is opaque ...
thus, there's nothing write cache can do to increase the performance of
the write ... it may as well hand off the blocks in any old order to the
NFS server and let the storage's cache worry about sequencing the blocks.

==> Seems to me that shrinking dirty_xxx to something really low ...
like 100K ... would:
(1) maximize 'stat' performance during heavy writes
(2) leave NFS write performance unaffected

but I can tell that I'm missing something here (because when I try it,
interactive performance, i.e. 'ls -l', tanks).  What am I not understanding?

--sk

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ls stalls
  2012-05-29 13:07       ` Stuart Kendrick
@ 2012-05-29 16:20         ` Chuck Lever
  0 siblings, 0 replies; 7+ messages in thread
From: Chuck Lever @ 2012-05-29 16:20 UTC (permalink / raw)
  To: Stuart Kendrick; +Cc: linux-nfs


On May 29, 2012, at 9:07 AM, Stuart Kendrick wrote:

>>> My starting guess is that there is some task on that client that has dirtied the pages of one of the files in the directory you are trying to list.
> A GETATTR is required to flush outstanding writes to a file so the
> server can provide size and mtime attributes that reflect the last most
> recent write to that file.
>>> 
>>> Possible work-arounds: You can reduce the dirty_ratio setting on the client to force it to starting flushing outstanding writes sooner, you can change the writing applications to use synchronous writes (or flush manually), or you can alter your "ls" command so that it doesn't require a GETATTR for each file.
>>> 
> 
> OK, I think I understand that if I've dirtied the pages on a file and
> then issue a stat against it, that the NFS Client will stall my stat
> until it flushes cache.
> 
> And I'm focused on the work-around of:  "You can reduce the dirty_ratio
> setting ..."
> 
> (a) Does the stat /trigger/ a cache flush?  Or does the stat have to
> wait for the usual mechanisms to initiate a flush?

stat(2) invokes the VFS ->getattr method, which is nfs_getattr() for NFS.  nfs_getattr() initiates the flush before sending the GETATTR operation.

> (b) How granular is this process?  Will the NFS Client issue my stat
> (GETATTR) once all the dirty pages relevant to my particular file are
> flushed (but other pages are still being written)?  Or does the NFS
> Client wait until pdflush has entirely emptied cache before proceeding
> with the stat?

The nfs_getattr() function flushes dirty pages for just that file.

nfs_getattr() should cause other writers to wait for the GETATTR to complete; otherwise the GETATTR can be starved.  I'm not sure I see a mechanism like that in 3.4.

> (c) When cache hits dirty_ratio (or dirty_bytes), I believe the kernel
> blocks /all/ writers until it has /emptied/ cache (and flushing a big
> cache to slow storage can take a while) ... am I correct?

Not sure.  You may want to try dirty_background_ratio instead, as that kicks off background flushing, but writers may continue.

> (d) If my load is dominated by large writes over NFS, does cache buy me
> anything?  Seems to me that the dominant benefit of cache is elevatoring
> ... the block structure of storage accessible via NFS is opaque ...
> thus, there's nothing write cache can do to increase the performance of
> the write ... it may as well hand off the blocks in any old order to the
> NFS server and let the storage's cache worry about sequencing the blocks.

Caching is mostly of benefit for readers.

However, delaying small writes has the benefit of allowing a client to coalesce small application write requests into fewer large on-the-wire requests.  It can also be of benefit if an application is overwriting the same part of a file many times, or if it writes then truncates.

Honestly, I've always regarded the aggressive write caching behavior of the Linux NFS client as a performance bug.  The problem is that write flushing is driven almost entirely by the VM in modern Linux kernels, and there isn't a way to adjust specific flushing behavior for a specific file system implementation (hence the use of system-wide tunables as a workaround).

> ==> Seems to me that shrinking dirty_xxx to something really low ...
> like 100K ... would:
> (1) maximize 'stat' performance during heavy writes
> (2) leave NFS write performance unaffected
> 
> but I can tell that I'm missing something here (because when I try it,
> interactive performance, i.e. 'ls -l', tanks).  What am I not understanding?

That's a very broad question.  Try diagnosing the specific problem in your environment, and do try out dirty_background_ratio.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-05-29 16:20 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <4FB43655.5040907@fhcrc.org>
2012-05-16 23:34 ` ls stalls Stuart Kendrick
2012-05-16 23:41   ` Chuck Lever
2012-05-17 13:12     ` Stuart Kendrick
2012-05-17 13:46       ` Chuck Lever
2012-05-17 23:51         ` Stuart Kendrick
2012-05-29 13:07       ` Stuart Kendrick
2012-05-29 16:20         ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).