* Performance Diagnosis
@ 2008-07-15 15:34 Andrew Bell
[not found] ` <e80abd30807150834m47a1b86cle39885150f1d5bfd-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 16+ messages in thread
From: Andrew Bell @ 2008-07-15 15:34 UTC (permalink / raw)
To: linux-nfs
Hi,
I have a RHEL 5 system that exhibits less than wonderful performance
when copying large files from/to an NFS filesystem. When the copy is
taking place, other access to the filesystem is painfully slow. I
would like to have the filesystem react well to small requests while a
large request is taking place.
A couple of questions:
Is this a reasonable expectation?
Is this perhaps an I/O scheduling issue that isn't specific to NFS,
but shows up there because of the latency of my NFS setup?
Is this most likely a client issue, a server issue or a combination?
Do you have recomendations on the best way to determine what is
happening? Are there existing tools to monitor active IO/NFS
requests/responses and any relevant queues?
Thanks for any info/ideas before I get in too deep :)
--
Andrew Bell
andrew.bell.ia@gmail.com
^ permalink raw reply [flat|nested] 16+ messages in thread[parent not found: <e80abd30807150834m47a1b86cle39885150f1d5bfd-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Performance Diagnosis [not found] ` <e80abd30807150834m47a1b86cle39885150f1d5bfd-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-07-15 15:49 ` Chuck Lever 2008-07-15 15:58 ` Peter Staubach 1 sibling, 0 replies; 16+ messages in thread From: Chuck Lever @ 2008-07-15 15:49 UTC (permalink / raw) To: Andrew Bell; +Cc: linux-nfs On Tue, Jul 15, 2008 at 11:34 AM, Andrew Bell <andrew.bell.ia@gmail.com> wrote: > Hi, > > I have a RHEL 5 system that exhibits less than wonderful performance > when copying large files from/to an NFS filesystem. When the copy is > taking place, other access to the filesystem is painfully slow. I > would like to have the filesystem react well to small requests while a > large request is taking place. > > A couple of questions: > > Is this a reasonable expectation? Yes, but Linux NFS can't fulfill it. :-) There is currently only one RPC transport socket between client and server for each mount point. Large file copies (or similar operations) will queue a lot of I/O, so your small requests will take a while to get through the queued up writes or reads ahead of them. > Is this perhaps an I/O scheduling issue that isn't specific to NFS, > but shows up there because of the latency of my NFS setup? > > Is this most likely a client issue, a server issue or a combination? Well, if your server or network is slow, this kind of thing is more likely to happen. > Do you have recomendations on the best way to determine what is > happening? Are there existing tools to monitor active IO/NFS > requests/responses and any relevant queues? Yes, I wrote some Python tools that are still undocumented (ie you will likely have to read the Python source to figure out what they do). They were recently included in nfs-utils, but you can download them from: http://oss.oracle.com/~cel/linux-2.6/2.6.25/nfs-iostat and http://oss.oracle.com/~cel/linux-2.6/2.6.25/mountstats > > > Thanks for any info/ideas before I get in too deep :) > > -- > Andrew Bell > andrew.bell.ia@gmail.com -- Chuck Lever ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Performance Diagnosis [not found] ` <e80abd30807150834m47a1b86cle39885150f1d5bfd-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2008-07-15 15:49 ` Chuck Lever @ 2008-07-15 15:58 ` Peter Staubach 2008-07-15 16:23 ` Chuck Lever 1 sibling, 1 reply; 16+ messages in thread From: Peter Staubach @ 2008-07-15 15:58 UTC (permalink / raw) To: Andrew Bell; +Cc: linux-nfs Andrew Bell wrote: > Hi, > > I have a RHEL 5 system that exhibits less than wonderful performance > when copying large files from/to an NFS filesystem. When the copy is > taking place, other access to the filesystem is painfully slow. I > would like to have the filesystem react well to small requests while a > large request is taking place. > > A couple of questions: > > Is this a reasonable expectation? > > Well, yes, I think that it would be a reasonable expectation. I know that I would certainly like for it to be true. :-) That said, this is a common situation, but not one that we've had/made the time to resolve yet. > Is this perhaps an I/O scheduling issue that isn't specific to NFS, > but shows up there because of the latency of my NFS setup? > > Could be. Nothing is impossible. That said... > Is this most likely a client issue, a server issue or a combination? > > It could be either one, both, or the network even. It could easily just be the architecture of the NFS client solution, in that it is sharing a single TCP connection for both data operations and also metadata operations. The metadata operations can get behind the larger data operations in the TCP stream, thus increasing their latencies. > Do you have recomendations on the best way to determine what is > happening? Are there existing tools to monitor active IO/NFS > requests/responses and any relevant queues? > > Perhaps ensure that the local file system on the server is performing well and that there are no obvious hot spots or that the activity is causing the file system to thrash. Some file systems, such as ext3, tend to bottleneck in the journaling code, so that might be an area of the local file system to consider. > Thanks for any info/ideas before I get in too deep :) We could use some idea of the activities that are occurring when you encounter the slowness that you are concerned about. If it is the notion described above, sometimes called head of line blocking, then we could think about ways to duplex operations over multiple TCP connections, perhaps with one connection for small, low latency operations, and another connection for larger, higher latency operations. Thanx... ps ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Performance Diagnosis 2008-07-15 15:58 ` Peter Staubach @ 2008-07-15 16:23 ` Chuck Lever [not found] ` <76bd70e30807150923r31027edxb0394a220bbe879b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Chuck Lever @ 2008-07-15 16:23 UTC (permalink / raw) To: Peter Staubach; +Cc: Andrew Bell, linux-nfs On Tue, Jul 15, 2008 at 11:58 AM, Peter Staubach <staubach@redhat.com> wrote: > If it is the notion described above, sometimes called head > of line blocking, then we could think about ways to duplex > operations over multiple TCP connections, perhaps with one > connection for small, low latency operations, and another > connection for larger, higher latency operations. I've dreamed about that for years. I don't think it would be too difficult, but one thing that has held it back is the shortage of ephemeral ports on the client may reduce the number of concurrent mount points we can support. One way to avoid the port issue is to construct an SCTP transport for NFS. SCTP allows multiple streams on the same connection, effectively eliminating head of line blocking. -- Chuck Lever ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <76bd70e30807150923r31027edxb0394a220bbe879b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Performance Diagnosis [not found] ` <76bd70e30807150923r31027edxb0394a220bbe879b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-07-15 16:34 ` Andrew Bell [not found] ` <e80abd30807150934tc14e793ydd7aae44b4c3111b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2008-07-15 17:44 ` Peter Staubach 1 sibling, 1 reply; 16+ messages in thread From: Andrew Bell @ 2008-07-15 16:34 UTC (permalink / raw) To: chucklever; +Cc: Peter Staubach, linux-nfs On Tue, Jul 15, 2008 at 11:23 AM, Chuck Lever <chuck.lever@oracle.com> wrote: > On Tue, Jul 15, 2008 at 11:58 AM, Peter Staubach <staubach@redhat.com> wrote: >> If it is the notion described above, sometimes called head >> of line blocking, then we could think about ways to duplex >> operations over multiple TCP connections, perhaps with one >> connection for small, low latency operations, and another >> connection for larger, higher latency operations. > > I've dreamed about that for years. I don't think it would be too > difficult, but one thing that has held it back is the shortage of > ephemeral ports on the client may reduce the number of concurrent > mount points we can support. Could one come up with a way to insert "small" ops somewhere in middle of the existing queue, or are the TCP send buffers typically too deep for this to do much good? Seems like more than one connection would allow "good" servers to handle requests simultaneously anyway. Is there really that big a shortage of ephemeral ports? I guess one could do active connection management. > One way to avoid the port issue is to construct an SCTP transport for > NFS. SCTP allows multiple streams on the same connection, effectively > eliminating head of line blocking. Waiting for SCTP sounds like a long-term solution, as server vendors probably have little incentive. Thanks for the ideas. I'll have to see what kind of time I can get to investigate this stuff. -- Andrew Bell andrew.bell.ia@gmail.com ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <e80abd30807150934tc14e793ydd7aae44b4c3111b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Performance Diagnosis [not found] ` <e80abd30807150934tc14e793ydd7aae44b4c3111b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-07-15 17:20 ` Chuck Lever 0 siblings, 0 replies; 16+ messages in thread From: Chuck Lever @ 2008-07-15 17:20 UTC (permalink / raw) To: Andrew Bell; +Cc: Peter Staubach, linux-nfs On Tue, Jul 15, 2008 at 12:34 PM, Andrew Bell <andrew.bell.ia@gmail.com> wrote: > On Tue, Jul 15, 2008 at 11:23 AM, Chuck Lever <chuck.lever@oracle.com> wrote: >> On Tue, Jul 15, 2008 at 11:58 AM, Peter Staubach <staubach@redhat.com> wrote: >>> If it is the notion described above, sometimes called head >>> of line blocking, then we could think about ways to duplex >>> operations over multiple TCP connections, perhaps with one >>> connection for small, low latency operations, and another >>> connection for larger, higher latency operations. >> >> I've dreamed about that for years. I don't think it would be too >> difficult, but one thing that has held it back is the shortage of >> ephemeral ports on the client may reduce the number of concurrent >> mount points we can support. > > Could one come up with a way to insert "small" ops somewhere in middle > of the existing queue, or are the TCP send buffers typically too deep > for this to do much good? Seems like more than one connection would > allow "good" servers to handle requests simultaneously anyway. There are several queues inside the NFS client stack. The underlying RPC client manages a slot table. Each slot contains one pending RPC request; ie an RPC has been sent and this slot held is waiting for the reply. The table contains 16 slots by default. You can adjust the size (up to 128 slots) via a sysctl, and that may help your situation by allowing more reads or writes to go to the server at once. The RPC client allows a single RPC to be sent on the socket at a time. (Waiting for the reply is asynchronous, so the next request can be sent on the socket as soon as this one is done being sent). Especially for large requests, this may mean waiting for the socket buffer to be emptied before more data can be sent. The socket is held for each each request until it is entirely sent so that data for different requests are not intermingled. If the network is not congested, this is generally not a problem, but if the server is backed up, it can take a while before the buffer is ready for more data from a single large request. Before an RPC gets into a slot, though, it waits on a backlog queue. This queue can grow quite long in situations where there are a lot of reads or writes and the server or network is slow. The Python scripts I mentioned before have information about the backlog queue size, slot table utilization, and per-operation average latency. So you can clearly determine what the client is waiting for. > Is there really that big a shortage of ephemeral ports? Yes. The NFS client uses only privileged ports (although you can optionally tell it to use non-privileged ports as well). For long-lived sockets (such as transport sockets for NFS), it is careful to choose privileged ports that are not a "well known" service (eg port 22 is the standard ssh service port). So the default port range is roughly between 670 and 1023. >> One way to avoid the port issue is to construct an SCTP transport for >> NFS. SCTP allows multiple streams on the same connection, effectively >> eliminating head of line blocking. > > Waiting for SCTP sounds like a long-term solution, as server vendors > probably have little incentive. Yep. > Thanks for the ideas. I'll have to see what kind of time I can get to > investigate this stuff. We neglected to mention that you can also increase the number of NFSD threads on your server. I think eight is the default, and often that isn't enough. -- Chuck Lever ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Performance Diagnosis [not found] ` <76bd70e30807150923r31027edxb0394a220bbe879b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2008-07-15 16:34 ` Andrew Bell @ 2008-07-15 17:44 ` Peter Staubach 2008-07-15 18:17 ` Chuck Lever 1 sibling, 1 reply; 16+ messages in thread From: Peter Staubach @ 2008-07-15 17:44 UTC (permalink / raw) To: chucklever; +Cc: Andrew Bell, linux-nfs Chuck Lever wrote: > On Tue, Jul 15, 2008 at 11:58 AM, Peter Staubach <staubach@redhat.com> wrote: > >> If it is the notion described above, sometimes called head >> of line blocking, then we could think about ways to duplex >> operations over multiple TCP connections, perhaps with one >> connection for small, low latency operations, and another >> connection for larger, higher latency operations. >> > > I've dreamed about that for years. I don't think it would be too > difficult, but one thing that has held it back is the shortage of > ephemeral ports on the client may reduce the number of concurrent > mount points we can support. > > One way to avoid the port issue is to construct an SCTP transport for > NFS. SCTP allows multiple streams on the same connection, effectively > eliminating head of line blocking. I like the idea of combining this work with implementing a proper connection manager so that we don't need a connection per mount. We really only need one connection per client and server, no matter how many individual mounts there might be from that single server. (Or two connections, if we want to do something like this...) We could also manage the connection space and thus, never run into the shortage of ports ever again. When the port space is full or we've run into some other artificial limit, then we simply close down some other connection to make space. ps ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Performance Diagnosis 2008-07-15 17:44 ` Peter Staubach @ 2008-07-15 18:17 ` Chuck Lever [not found] ` <76bd70e30807151117g520f22cj1dfe26b971987d38-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Chuck Lever @ 2008-07-15 18:17 UTC (permalink / raw) To: Peter Staubach; +Cc: Andrew Bell, linux-nfs On Tue, Jul 15, 2008 at 1:44 PM, Peter Staubach <staubach@redhat.com> wrote: > Chuck Lever wrote: >> >> On Tue, Jul 15, 2008 at 11:58 AM, Peter Staubach <staubach@redhat.com> >> wrote: >> >>> >>> If it is the notion described above, sometimes called head >>> of line blocking, then we could think about ways to duplex >>> operations over multiple TCP connections, perhaps with one >>> connection for small, low latency operations, and another >>> connection for larger, higher latency operations. >>> >> >> I've dreamed about that for years. I don't think it would be too >> difficult, but one thing that has held it back is the shortage of >> ephemeral ports on the client may reduce the number of concurrent >> mount points we can support. >> >> One way to avoid the port issue is to construct an SCTP transport for >> NFS. SCTP allows multiple streams on the same connection, effectively >> eliminating head of line blocking. > > I like the idea of combining this work with implementing a proper > connection manager so that we don't need a connection per mount. > We really only need one connection per client and server, no matter > how many individual mounts there might be from that single server. > (Or two connections, if we want to do something like this...) > > We could also manage the connection space and thus, never run into > the shortage of ports ever again. When the port space is full or > we've run into some other artificial limit, then we simply close > down some other connection to make space. I think we should do this for text-based mounts; however this would mean the connection management would happen in the kernel, which (only slightly) complicates things. I was thinking about this a little last week when Trond mentioned implementing a connected UDP socket transport... It would be nice if all the kernel RPC services that needed to send a single RPC request (like mount, rpcbind, and so on) could share a small managed pool of sockets (a pool of TCP sockets, or a pool of connected UDP sockets). Connected sockets have the ostensible advantage that they can quickly detect the absence of a remote listener. But such a pool would be a good idea because multiple mount requests to the same server could all flow over the same set of connections. But we might be able to get away with something nearly as efficient if the RPC client would always invoke a connect(AF_UNSPEC) before destroying the socket. Wouldn't that free the ephemeral port immediately? What are the risks of trying something like this? -- "Alright guard, begin the unnecessarily slow-moving dipping mechanism." --Dr. Evil ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <76bd70e30807151117g520f22cj1dfe26b971987d38-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Performance Diagnosis [not found] ` <76bd70e30807151117g520f22cj1dfe26b971987d38-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-07-15 18:51 ` Trond Myklebust 2008-07-15 19:21 ` Peter Staubach 0 siblings, 1 reply; 16+ messages in thread From: Trond Myklebust @ 2008-07-15 18:51 UTC (permalink / raw) To: chucklever; +Cc: Peter Staubach, Andrew Bell, linux-nfs On Tue, 2008-07-15 at 14:17 -0400, Chuck Lever wrote: > On Tue, Jul 15, 2008 at 1:44 PM, Peter Staubach <staubach@redhat.com> wrote: > > Chuck Lever wrote: > >> > >> On Tue, Jul 15, 2008 at 11:58 AM, Peter Staubach <staubach@redhat.com> > >> wrote: > >> > >>> > >>> If it is the notion described above, sometimes called head > >>> of line blocking, then we could think about ways to duplex > >>> operations over multiple TCP connections, perhaps with one > >>> connection for small, low latency operations, and another > >>> connection for larger, higher latency operations. > >>> > >> > >> I've dreamed about that for years. I don't think it would be too > >> difficult, but one thing that has held it back is the shortage of > >> ephemeral ports on the client may reduce the number of concurrent > >> mount points we can support. > >> > >> One way to avoid the port issue is to construct an SCTP transport for > >> NFS. SCTP allows multiple streams on the same connection, effectively > >> eliminating head of line blocking. > > > > I like the idea of combining this work with implementing a proper > > connection manager so that we don't need a connection per mount. > > We really only need one connection per client and server, no matter > > how many individual mounts there might be from that single server. > > (Or two connections, if we want to do something like this...) > > > > We could also manage the connection space and thus, never run into > > the shortage of ports ever again. When the port space is full or > > we've run into some other artificial limit, then we simply close > > down some other connection to make space. > > I think we should do this for text-based mounts; however this would > mean the connection management would happen in the kernel, which (only > slightly) complicates things. > > I was thinking about this a little last week when Trond mentioned > implementing a connected UDP socket transport... > > It would be nice if all the kernel RPC services that needed to send a > single RPC request (like mount, rpcbind, and so on) could share a > small managed pool of sockets (a pool of TCP sockets, or a pool of > connected UDP sockets). Connected sockets have the ostensible > advantage that they can quickly detect the absence of a remote > listener. But such a pool would be a good idea because multiple mount > requests to the same server could all flow over the same set of > connections. > > But we might be able to get away with something nearly as efficient if > the RPC client would always invoke a connect(AF_UNSPEC) before > destroying the socket. Wouldn't that free the ephemeral port > immediately? What are the risks of trying something like this? Why is all the talk here only about RPC level solutions? Newer kernels already have a good deal of extra throttling of writes at the NFS superblock level, and there is even a sysctl to control the amount of outstanding writes before the VM congestion control sets in. Please see /proc/sys/fs/nfs/nfs_congestion_kb Cheers Trond ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Performance Diagnosis 2008-07-15 18:51 ` Trond Myklebust @ 2008-07-15 19:21 ` Peter Staubach 2008-07-15 19:35 ` Trond Myklebust 0 siblings, 1 reply; 16+ messages in thread From: Peter Staubach @ 2008-07-15 19:21 UTC (permalink / raw) To: Trond Myklebust; +Cc: chucklever, Andrew Bell, linux-nfs Trond Myklebust wrote: > On Tue, 2008-07-15 at 14:17 -0400, Chuck Lever wrote: > >> On Tue, Jul 15, 2008 at 1:44 PM, Peter Staubach <staubach@redhat.com> wrote: >> >>> Chuck Lever wrote: >>> >>>> On Tue, Jul 15, 2008 at 11:58 AM, Peter Staubach <staubach@redhat.com> >>>> wrote: >>>> >>>> >>>>> If it is the notion described above, sometimes called head >>>>> of line blocking, then we could think about ways to duplex >>>>> operations over multiple TCP connections, perhaps with one >>>>> connection for small, low latency operations, and another >>>>> connection for larger, higher latency operations. >>>>> >>>>> >>>> I've dreamed about that for years. I don't think it would be too >>>> difficult, but one thing that has held it back is the shortage of >>>> ephemeral ports on the client may reduce the number of concurrent >>>> mount points we can support. >>>> >>>> One way to avoid the port issue is to construct an SCTP transport for >>>> NFS. SCTP allows multiple streams on the same connection, effectively >>>> eliminating head of line blocking. >>>> >>> I like the idea of combining this work with implementing a proper >>> connection manager so that we don't need a connection per mount. >>> We really only need one connection per client and server, no matter >>> how many individual mounts there might be from that single server. >>> (Or two connections, if we want to do something like this...) >>> >>> We could also manage the connection space and thus, never run into >>> the shortage of ports ever again. When the port space is full or >>> we've run into some other artificial limit, then we simply close >>> down some other connection to make space. >>> >> I think we should do this for text-based mounts; however this would >> mean the connection management would happen in the kernel, which (only >> slightly) complicates things. >> >> I was thinking about this a little last week when Trond mentioned >> implementing a connected UDP socket transport... >> >> It would be nice if all the kernel RPC services that needed to send a >> single RPC request (like mount, rpcbind, and so on) could share a >> small managed pool of sockets (a pool of TCP sockets, or a pool of >> connected UDP sockets). Connected sockets have the ostensible >> advantage that they can quickly detect the absence of a remote >> listener. But such a pool would be a good idea because multiple mount >> requests to the same server could all flow over the same set of >> connections. >> >> But we might be able to get away with something nearly as efficient if >> the RPC client would always invoke a connect(AF_UNSPEC) before >> destroying the socket. Wouldn't that free the ephemeral port >> immediately? What are the risks of trying something like this? >> > > > Why is all the talk here only about RPC level solutions? > > Newer kernels already have a good deal of extra throttling of writes at > the NFS superblock level, and there is even a sysctl to control the > amount of outstanding writes before the VM congestion control sets in. > Please see /proc/sys/fs/nfs/nfs_congestion_kb The throttling of writes definitely seems like a NFS level issue, so that's a good thing. (RHEL-5 might be a tad far enough behind to not be able to take advantage of all of these modern things... :-)) The connection manager would seem to be a RPC level thing, although I haven't thought through the ramifications of the NFSv4.1 stuff and how it might impact a connection manager sufficiently. ps ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Performance Diagnosis 2008-07-15 19:21 ` Peter Staubach @ 2008-07-15 19:35 ` Trond Myklebust 2008-07-15 19:55 ` Peter Staubach 2008-07-16 7:35 ` Benny Halevy 0 siblings, 2 replies; 16+ messages in thread From: Trond Myklebust @ 2008-07-15 19:35 UTC (permalink / raw) To: Peter Staubach; +Cc: chucklever, Andrew Bell, linux-nfs On Tue, 2008-07-15 at 15:21 -0400, Peter Staubach wrote: > The connection manager would seem to be a RPC level thing, although > I haven't thought through the ramifications of the NFSv4.1 stuff > and how it might impact a connection manager sufficiently. We already have the scheme that shuts down connections on inactive RPC clients after a suitable timeout period, so the only gains I can see would have to involve shutting down connections on active clients. At that point, the danger isn't with NFSv4.1, it is rather with NFSv2/3/4.0... Specifically, their lack of good replay cache semantics mean that you have to be very careful about schemes that involve shutting down connections on active RPC clients. Cheers Trond ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Performance Diagnosis 2008-07-15 19:35 ` Trond Myklebust @ 2008-07-15 19:55 ` Peter Staubach 2008-07-15 20:27 ` Trond Myklebust 2008-07-15 21:15 ` Talpey, Thomas 2008-07-16 7:35 ` Benny Halevy 1 sibling, 2 replies; 16+ messages in thread From: Peter Staubach @ 2008-07-15 19:55 UTC (permalink / raw) To: Trond Myklebust; +Cc: chucklever, Andrew Bell, linux-nfs Trond Myklebust wrote: > On Tue, 2008-07-15 at 15:21 -0400, Peter Staubach wrote: > >> The connection manager would seem to be a RPC level thing, although >> I haven't thought through the ramifications of the NFSv4.1 stuff >> and how it might impact a connection manager sufficiently. >> > > We already have the scheme that shuts down connections on inactive RPC > clients after a suitable timeout period, so the only gains I can see > would have to involve shutting down connections on active clients. > > At that point, the danger isn't with NFSv4.1, it is rather with > NFSv2/3/4.0... Specifically, their lack of good replay cache semantics > mean that you have to be very careful about schemes that involve > shutting down connections on active RPC clients. It seems to me that as long as we don't shut down a connection which is actively being used for an outstanding request, then we shouldn't have any larger problems with the duplicate caches on servers than we do now. We can do this easily enough by reference counting the connection state and then only closing connections which are not being referenced. I definitely agree, shutting down a connection which is being used is just inviting trouble. A gain would be that we could reduce the numbers of connections on active clients if we could disassociate a connection with a particular mounted file system. As long as we can achieve maximum network bandwidth through a single connection, then we don't need more than one connection per server. We could handle the case where the client was talking to more servers than it had connection space for by forcibly, but safely closing connections to servers and then using the space for a new connection to a server. We could do this in the connection manager by checking to see if there was an available connection which was not marked as in the process of being closed. If so, then it just enters the fray as needing a connection and am working like all of the others. The algorithm could look something like: top: Look for a connection to the right server which is not marked as being closed. If one was found, then increment its reference count and return it. Attempt to create a new connect, If this works, then increment its reference count and return it. Find a connection to be closed, either one not being currently used or via some heuristic like round-robin. If this connection is not actively being used, then close it and go to top. Mark the connection as being closed, wait until it is closed, and then go to top. I know that this is rough and there are several races that I glossed over, but hopefully, this will outline the general bones of a solution. When the system is having recycle connections, it may slow down, but at least it will work and not have things just fail. Thanx... ps ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Performance Diagnosis 2008-07-15 19:55 ` Peter Staubach @ 2008-07-15 20:27 ` Trond Myklebust 2008-07-15 20:48 ` Peter Staubach 2008-07-15 21:15 ` Talpey, Thomas 1 sibling, 1 reply; 16+ messages in thread From: Trond Myklebust @ 2008-07-15 20:27 UTC (permalink / raw) To: Peter Staubach; +Cc: chucklever, Andrew Bell, linux-nfs On Tue, 2008-07-15 at 15:55 -0400, Peter Staubach wrote: > It seems to me that as long as we don't shut down a connection > which is actively being used for an outstanding request, then > we shouldn't have any larger problems with the duplicate caches > on servers than we do now. > > We can do this easily enough by reference counting the connection > state and then only closing connections which are not being > referenced. Agreed. > A gain would be that we could reduce the numbers of connections on > active clients if we could disassociate a connection with a > particular mounted file system. As long as we can achieve maximum > network bandwidth through a single connection, then we don't need > more than one connection per server. Isn't that pretty much the norm today anyway? The only call to rpc_create() that I can find is made when creating the nfs_client structure. All other NFS-related rpc connections are created as clones of the above shared structure, and thus share the same rpc_xprt. I'm not sure that we want to share connections in the cases where we can't share the same nfs_client, since that usually means that RPC level parameters such as timeout values, NFS protocol versions differ. > We could handle the case where the client was talking to more > servers than it had connection space for by forcibly, but safely > closing connections to servers and then using the space for a > new connection to a server. We could do this in the connection > manager by checking to see if there was an available connection > which was not marked as in the process of being closed. If so, > then it just enters the fray as needing a connection and am > working like all of the others. > > The algorithm could look something like: > > top: > Look for a connection to the right server which is not marked > as being closed. > If one was found, then increment its reference count and > return it. > Attempt to create a new connect, > If this works, then increment its reference count and > return it. > Find a connection to be closed, either one not being currently > used or via some heuristic like round-robin. > If this connection is not actively being used, then close it > and go to top. > Mark the connection as being closed, wait until it is closed, > and then go to top. Actually, what you really want to do is look at whether or not any of the rpc slots are in use or not. If they aren't, then you are free to close the connection, if not, go to the next. Unfortunately, you still can't get rid of the 2 minute TIME_WAIT state in the case of a TCP connection, so I'm not sure how useful this will turn out to be... Cheers Trond ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Performance Diagnosis 2008-07-15 20:27 ` Trond Myklebust @ 2008-07-15 20:48 ` Peter Staubach 0 siblings, 0 replies; 16+ messages in thread From: Peter Staubach @ 2008-07-15 20:48 UTC (permalink / raw) To: Trond Myklebust; +Cc: chucklever, Andrew Bell, linux-nfs Trond Myklebust wrote: > On Tue, 2008-07-15 at 15:55 -0400, Peter Staubach wrote: > >> It seems to me that as long as we don't shut down a connection >> which is actively being used for an outstanding request, then >> we shouldn't have any larger problems with the duplicate caches >> on servers than we do now. >> >> We can do this easily enough by reference counting the connection >> state and then only closing connections which are not being >> referenced. >> > > Agreed. > > >> A gain would be that we could reduce the numbers of connections on >> active clients if we could disassociate a connection with a >> particular mounted file system. As long as we can achieve maximum >> network bandwidth through a single connection, then we don't need >> more than one connection per server. >> > > Isn't that pretty much the norm today anyway? The only call to > rpc_create() that I can find is made when creating the nfs_client > structure. All other NFS-related rpc connections are created as clones > of the above shared structure, and thus share the same rpc_xprt. > > Well, it is the norm for the shared superblock situation, yes. > I'm not sure that we want to share connections in the cases where we > can't share the same nfs_client, since that usually means that RPC level > parameters such as timeout values, NFS protocol versions differ. > > I think the TCP connection can be managed independent of these things. >> We could handle the case where the client was talking to more >> servers than it had connection space for by forcibly, but safely >> closing connections to servers and then using the space for a >> new connection to a server. We could do this in the connection >> manager by checking to see if there was an available connection >> which was not marked as in the process of being closed. If so, >> then it just enters the fray as needing a connection and am >> working like all of the others. >> >> The algorithm could look something like: >> >> top: >> Look for a connection to the right server which is not marked >> as being closed. >> If one was found, then increment its reference count and >> return it. >> Attempt to create a new connect, >> If this works, then increment its reference count and >> return it. >> Find a connection to be closed, either one not being currently >> used or via some heuristic like round-robin. >> If this connection is not actively being used, then close it >> and go to top. >> Mark the connection as being closed, wait until it is closed, >> and then go to top. >> > > Actually, what you really want to do is look at whether or not any of > the rpc slots are in use or not. If they aren't, then you are free to > close the connection, if not, go to the next. > > I think that we would still need to be able to handle the situation where we needed a connection and all connections appeared to be in use. I think that the ability to force a connection to become unused would be required. > Unfortunately, you still can't get rid of the 2 minute TIME_WAIT state > in the case of a TCP connection, so I'm not sure how useful this will > turn out to be... Well, there is that... :-) It sure seems like there has to be some way of dealing with that thing. Thanx... ps ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Performance Diagnosis 2008-07-15 19:55 ` Peter Staubach 2008-07-15 20:27 ` Trond Myklebust @ 2008-07-15 21:15 ` Talpey, Thomas 1 sibling, 0 replies; 16+ messages in thread From: Talpey, Thomas @ 2008-07-15 21:15 UTC (permalink / raw) To: Peter Staubach; +Cc: Trond Myklebust, chucklever, Andrew Bell, linux-nfs At 03:55 PM 7/15/2008, Peter Staubach wrote: >Trond Myklebust wrote: >> On Tue, 2008-07-15 at 15:21 -0400, Peter Staubach wrote: >> >>> The connection manager would seem to be a RPC level thing, although >>> I haven't thought through the ramifications of the NFSv4.1 stuff >>> and how it might impact a connection manager sufficiently. >>> >> >> We already have the scheme that shuts down connections on inactive RPC >> clients after a suitable timeout period, so the only gains I can see >> would have to involve shutting down connections on active clients. >> >> At that point, the danger isn't with NFSv4.1, it is rather with >> NFSv2/3/4.0... Specifically, their lack of good replay cache semantics >> mean that you have to be very careful about schemes that involve >> shutting down connections on active RPC clients. > >It seems to me that as long as we don't shut down a connection >which is actively being used for an outstanding request, then >we shouldn't have any larger problems with the duplicate caches >on servers than we do now. > >We can do this easily enough by reference counting the connection >state and then only closing connections which are not being >referenced. > >I definitely agree, shutting down a connection which is being used >is just inviting trouble. > >A gain would be that we could reduce the numbers of connections on >active clients if we could disassociate a connection with a >particular mounted file system. As long as we can achieve maximum >network bandwidth through a single connection, then we don't need >more than one connection per server. Not quite! Getting full network bandwidth is one requirement, but having the slots to use it is another! The prolblem with sharing a mount currently is that the slot table is preallocated at mount time, each time the mount is shared, the slots become less and less adequate to the task. If we include growing the slot table with sharing the connection, and having some sort of non-starvation so readaheads and deep random read workloads don't hog the slots and block out getattrs, then I agree. The v4.1 session brings this to the top-level btw, by explicitly negotiating these limits end to end. > >We could handle the case where the client was talking to more >servers than it had connection space for by forcibly, but safely >closing connections to servers and then using the space for a >new connection to a server. We could do this in the connection >manager by checking to see if there was an available connection >which was not marked as in the process of being closed. If so, >then it just enters the fray as needing a connection and am >working like all of the others. > >The algorithm could look something like: > >top: > Look for a connection to the right server which is not marked > as being closed. > If one was found, then increment its reference count and ...increase its slot count and... > return it. > Attempt to create a new connect, > If this works, then increment its reference count and > return it. > Find a connection to be closed, either one not being currently > used or via some heuristic like round-robin. > If this connection is not actively being used, then close it > and go to top. > Mark the connection as being closed, wait until it is closed, > and then go to top. > >I know that this is rough and there are several races that I >glossed over, but hopefully, this will outline the general bones >of a solution. There is one other *very* important thing to note. The RPC XID is managed on a per-mount basis in Linux, two different mount points can have duplicate XIDs. There is no ambiguity at the server, because the two mounts, with two connections, have different IP 5-tuples. But if mounts are shared, then we need to be sure that XIDs are also shared, to avoid reply cache collisions. Tom. I'd like to add that sharing a connection when only one is > >When the system is having recycle connections, it may slow down, >but at least it will work and not have things just fail. > > Thanx... > > ps >-- >To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Performance Diagnosis 2008-07-15 19:35 ` Trond Myklebust 2008-07-15 19:55 ` Peter Staubach @ 2008-07-16 7:35 ` Benny Halevy 1 sibling, 0 replies; 16+ messages in thread From: Benny Halevy @ 2008-07-16 7:35 UTC (permalink / raw) To: Trond Myklebust; +Cc: Peter Staubach, chucklever, Andrew Bell, linux-nfs On Jul. 15, 2008, 22:35 +0300, Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > On Tue, 2008-07-15 at 15:21 -0400, Peter Staubach wrote: >> The connection manager would seem to be a RPC level thing, although >> I haven't thought through the ramifications of the NFSv4.1 stuff >> and how it might impact a connection manager sufficiently. > > We already have the scheme that shuts down connections on inactive RPC > clients after a suitable timeout period, so the only gains I can see > would have to involve shutting down connections on active clients. > > At that point, the danger isn't with NFSv4.1, it is rather with > NFSv2/3/4.0... Specifically, their lack of good replay cache semantics > mean that you have to be very careful about schemes that involve > shutting down connections on active RPC clients. One more thing to consider about nfsv4.1 is the back channel which uses one of the forward going connections. You may need to keep it alive while you hold state on the client (data/dir delegations, layouts, etc.) Benny > > Cheers > Trond > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2008-07-16 7:35 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-15 15:34 Performance Diagnosis Andrew Bell
[not found] ` <e80abd30807150834m47a1b86cle39885150f1d5bfd-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-07-15 15:49 ` Chuck Lever
2008-07-15 15:58 ` Peter Staubach
2008-07-15 16:23 ` Chuck Lever
[not found] ` <76bd70e30807150923r31027edxb0394a220bbe879b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-07-15 16:34 ` Andrew Bell
[not found] ` <e80abd30807150934tc14e793ydd7aae44b4c3111b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-07-15 17:20 ` Chuck Lever
2008-07-15 17:44 ` Peter Staubach
2008-07-15 18:17 ` Chuck Lever
[not found] ` <76bd70e30807151117g520f22cj1dfe26b971987d38-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-07-15 18:51 ` Trond Myklebust
2008-07-15 19:21 ` Peter Staubach
2008-07-15 19:35 ` Trond Myklebust
2008-07-15 19:55 ` Peter Staubach
2008-07-15 20:27 ` Trond Myklebust
2008-07-15 20:48 ` Peter Staubach
2008-07-15 21:15 ` Talpey, Thomas
2008-07-16 7:35 ` Benny Halevy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox