* Re: sendmsg blocking with sendtimeout vs. non-blocking [not found] ` <4a4634330810230926m9285a3etf6340e3de53c576@mail.gmail.com> @ 2008-10-23 16:42 ` Steve French 2008-10-23 17:40 ` Steve French 2008-10-23 18:00 ` Jeff Layton 0 siblings, 2 replies; 24+ messages in thread From: Steve French @ 2008-10-23 16:42 UTC (permalink / raw) To: Shirish Pargaonkar Cc: Jeff Layton, Suresh Jayaraman, nhorman, linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz On Thu, Oct 23, 2008 at 11:26 AM, Shirish Pargaonkar <shirishpargaonkar@gmail.com> wrote: > On Thu, Oct 23, 2008 at 9:54 AM, Steve French <smfrench@gmail.com> wrote: >>> The big question is what happens under heavy memory pressure. In that >>> case we may not be able to copy the data from the buffer or iov into >>> the skbuf. With non-blocking I/O kernel_sendmsg will return error >>> if nothing was copied (probably -EAGAIN or -ENOSPC). If some of the >>> data could be copied, then it'll return a positive value that was less >>> than the size of the send. I think we are ok, as long as in all paths we either: a) resend the remainder of the SMB (by retrying what is left in the buffer or b) kill the tcp session, reconnect (and either retry with a new handle or return a EHOSTDOWN error to the vfs depending on the path) >> >>> Regardless of whether we use blocking I/O or not, kernel_sendmsg could >>> still copy less data than was requested. The patch needs to account >>> for this, probably by retrying the send with the remaining data if it >>> occurs. >> Yes - if that is true we need to retry at least once on this. >> >>> I don't think Shirish's patch is likely fixing this problem, but is >>> probably just making it harder for it to occur. >>> >>> I think, overall that blocking I/O makes sense. It should make error >>> handling easier and I don't see any real disadvantage. We might >>> consider using longer timeouts, removing the MSG_NOSIGNAL flag and >>> changing the code to deal with the possibility of catching a signal >>> during the kernel_sendmsg call. >>> >>> Also, we're setting the sk_sndbuf and sk_rcvbuf in ipv4_connect. >>> According to Neil, that may be hurting us here. This has the effect of >>> placing a hard limit on the send buffer for the socket, and may be >>> making it more likely that we get an error back from kernel_sendmsg. >> Yes - Sridhar and Dave Stevens mentioned that too. We haven't needed >> that (setting sndbuf) for at least a few kernel versions (is the auto socket >> buffer size tuning working as far back as RHEL5/2.6.18 and >> SLES10 2.6.16 too?) >> >> Shirish, >> I thought that that was part of your patch? > > No it is not but this is an easy change, just do not set them. As I look into this sndbuf and rcvbuf size setting ... what concerns me is why nfs sets these sizes for snd and rcvbuf sizes still if they don't need to be set? We (cifs) have larger write sizes (56K) than nfs's default. See svc_set_sockbufsize in net/sunrpc/svcsock.c -- Thanks, Steve ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-23 16:42 ` sendmsg blocking with sendtimeout vs. non-blocking Steve French @ 2008-10-23 17:40 ` Steve French 2008-10-23 17:46 ` Evgeniy Polyakov 2008-10-23 18:00 ` Jeff Layton 1 sibling, 1 reply; 24+ messages in thread From: Steve French @ 2008-10-23 17:40 UTC (permalink / raw) To: Shirish Pargaonkar Cc: Jeff Layton, Suresh Jayaraman, nhorman, linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz On Thu, Oct 23, 2008 at 11:42 AM, Steve French <smfrench@gmail.com> wrote: > On Thu, Oct 23, 2008 at 11:26 AM, Shirish Pargaonkar > As I look into this sndbuf and rcvbuf size setting ... what concerns > me is why nfs sets these sizes for snd and rcvbuf sizes still if they > don't need to be set? We (cifs) have larger write sizes (56K) than > nfs's default. See svc_set_sockbufsize in net/sunrpc/svcsock.c What is the valid range for the sndbuf and rcvbuf size so I can sanity check this if the user overrides it on mount? -- Thanks, Steve ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-23 17:40 ` Steve French @ 2008-10-23 17:46 ` Evgeniy Polyakov 0 siblings, 0 replies; 24+ messages in thread From: Evgeniy Polyakov @ 2008-10-23 17:46 UTC (permalink / raw) To: Steve French Cc: Shirish Pargaonkar, Jeff Layton, Suresh Jayaraman, nhorman, linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz Hi. On Thu, Oct 23, 2008 at 12:40:05PM -0500, Steve French (smfrench@gmail.com) wrote: > On Thu, Oct 23, 2008 at 11:42 AM, Steve French <smfrench@gmail.com> wrote: > > On Thu, Oct 23, 2008 at 11:26 AM, Shirish Pargaonkar > > As I look into this sndbuf and rcvbuf size setting ... what concerns > > me is why nfs sets these sizes for snd and rcvbuf sizes still if they > > don't need to be set? We (cifs) have larger write sizes (56K) than > > nfs's default. See svc_set_sockbufsize in net/sunrpc/svcsock.c > > What is the valid range for the sndbuf and rcvbuf size so I can sanity > check this if the user overrides it on mount? >From zero to infinity. Actual size will be automatically ajusted by the kernel, but will not exceed specified one, so there is no need to tune this parameter. Very likely you do not want to change socket queue sizes, since it may hurt performance, when previously autotuning could rise the maximum socket buffer size. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-23 16:42 ` sendmsg blocking with sendtimeout vs. non-blocking Steve French 2008-10-23 17:40 ` Steve French @ 2008-10-23 18:00 ` Jeff Layton 2008-10-23 19:19 ` Steve French 1 sibling, 1 reply; 24+ messages in thread From: Jeff Layton @ 2008-10-23 18:00 UTC (permalink / raw) To: Steve French Cc: Shirish Pargaonkar, Suresh Jayaraman, nhorman, linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz On Thu, 23 Oct 2008 11:42:49 -0500 "Steve French" <smfrench@gmail.com> wrote: > On Thu, Oct 23, 2008 at 11:26 AM, Shirish Pargaonkar > <shirishpargaonkar@gmail.com> wrote: > > On Thu, Oct 23, 2008 at 9:54 AM, Steve French <smfrench@gmail.com> wrote: > >>> The big question is what happens under heavy memory pressure. In that > >>> case we may not be able to copy the data from the buffer or iov into > >>> the skbuf. With non-blocking I/O kernel_sendmsg will return error > >>> if nothing was copied (probably -EAGAIN or -ENOSPC). If some of the > >>> data could be copied, then it'll return a positive value that was less > >>> than the size of the send. > I think we are ok, as long as in all paths we either: > a) resend the remainder of the SMB (by retrying what is left in the buffer > or > b) kill the tcp session, reconnect (and either retry with a new handle > or return a EHOSTDOWN error to the vfs depending on the path) > > >> > >>> Regardless of whether we use blocking I/O or not, kernel_sendmsg could > >>> still copy less data than was requested. The patch needs to account > >>> for this, probably by retrying the send with the remaining data if it > >>> occurs. > >> Yes - if that is true we need to retry at least once on this. > >> > >>> I don't think Shirish's patch is likely fixing this problem, but is > >>> probably just making it harder for it to occur. > >>> > >>> I think, overall that blocking I/O makes sense. It should make error > >>> handling easier and I don't see any real disadvantage. We might > >>> consider using longer timeouts, removing the MSG_NOSIGNAL flag and > >>> changing the code to deal with the possibility of catching a signal > >>> during the kernel_sendmsg call. > >>> > >>> Also, we're setting the sk_sndbuf and sk_rcvbuf in ipv4_connect. > >>> According to Neil, that may be hurting us here. This has the effect of > >>> placing a hard limit on the send buffer for the socket, and may be > >>> making it more likely that we get an error back from kernel_sendmsg. > >> Yes - Sridhar and Dave Stevens mentioned that too. We haven't needed > >> that (setting sndbuf) for at least a few kernel versions (is the auto socket > >> buffer size tuning working as far back as RHEL5/2.6.18 and > >> SLES10 2.6.16 too?) > >> > >> Shirish, > >> I thought that that was part of your patch? > > > > No it is not but this is an easy change, just do not set them. > > As I look into this sndbuf and rcvbuf size setting ... what concerns > me is why nfs sets these sizes for snd and rcvbuf sizes still if they > don't need to be set? We (cifs) have larger write sizes (56K) than > nfs's default. See svc_set_sockbufsize in net/sunrpc/svcsock.c > AFAICT, those size settings have been there for a long time (2002?). My guess is that this may predate the socket autotuning stuff that Neil H. mentioned to me. His statement was basically "unless you know for sure that you don't want to use more than X amount of memory, then there isn't much reason to set the send and receive buffers". -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-23 18:00 ` Jeff Layton @ 2008-10-23 19:19 ` Steve French 2008-10-23 19:35 ` Jeff Layton 0 siblings, 1 reply; 24+ messages in thread From: Steve French @ 2008-10-23 19:19 UTC (permalink / raw) To: Jeff Layton, Evgeniy Polyakov Cc: Shirish Pargaonkar, Suresh Jayaraman, nhorman, linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz, Jim Rees On Thu, Oct 23, 2008 at 1:00 PM, Jeff Layton <jlayton@redhat.com> wrote: > On Thu, 23 Oct 2008 11:42:49 -0500 > "Steve French" <smfrench@gmail.com> wrote: > >> On Thu, Oct 23, 2008 at 11:26 AM, Shirish Pargaonkar >> <shirishpargaonkar@gmail.com> wrote: >> > On Thu, Oct 23, 2008 at 9:54 AM, Steve French <smfrench@gmail.com> wrote: >> >>> The big question is what happens under heavy memory pressure. In that >> >>> case we may not be able to copy the data from the buffer or iov into >> >>> the skbuf. With non-blocking I/O kernel_sendmsg will return error >> >>> if nothing was copied (probably -EAGAIN or -ENOSPC). If some of the >> >>> data could be copied, then it'll return a positive value that was less >> >>> than the size of the send. >> I think we are ok, as long as in all paths we either: >> a) resend the remainder of the SMB (by retrying what is left in the buffer >> or >> b) kill the tcp session, reconnect (and either retry with a new handle >> or return a EHOSTDOWN error to the vfs depending on the path) >> >> >> >> >>> Regardless of whether we use blocking I/O or not, kernel_sendmsg could >> >>> still copy less data than was requested. The patch needs to account >> >>> for this, probably by retrying the send with the remaining data if it >> >>> occurs. >> >> Yes - if that is true we need to retry at least once on this. >> >> >> >>> I don't think Shirish's patch is likely fixing this problem, but is >> >>> probably just making it harder for it to occur. >> >>> >> >>> I think, overall that blocking I/O makes sense. It should make error >> >>> handling easier and I don't see any real disadvantage. We might >> >>> consider using longer timeouts, removing the MSG_NOSIGNAL flag and >> >>> changing the code to deal with the possibility of catching a signal >> >>> during the kernel_sendmsg call. >> >>> >> >>> Also, we're setting the sk_sndbuf and sk_rcvbuf in ipv4_connect. >> >>> According to Neil, that may be hurting us here. This has the effect of >> >>> placing a hard limit on the send buffer for the socket, and may be >> >>> making it more likely that we get an error back from kernel_sendmsg. >> >> Yes - Sridhar and Dave Stevens mentioned that too. We haven't needed >> >> that (setting sndbuf) for at least a few kernel versions (is the auto socket >> >> buffer size tuning working as far back as RHEL5/2.6.18 and >> >> SLES10 2.6.16 too?) >> >> >> >> Shirish, >> >> I thought that that was part of your patch? >> > >> > No it is not but this is an easy change, just do not set them. >> >> As I look into this sndbuf and rcvbuf size setting ... what concerns >> me is why nfs sets these sizes for snd and rcvbuf sizes still if they >> don't need to be set? We (cifs) have larger write sizes (56K) than >> nfs's default. See svc_set_sockbufsize in net/sunrpc/svcsock.c >> > > AFAICT, those size settings have been there for a long time (2002?). My > guess is that this may predate the socket autotuning stuff that Neil H. > mentioned to me. > > His statement was basically "unless you know for sure that you don't > want to use more than X amount of memory, then there isn't much reason > to set the send and receive buffers". I think that there is a problem still - cifs needs tcp autotuning but the buffers probably need to be at least as big as the largest SMB response (about 17K on receives, but configurable to be larger). See comment below about NFS: On Thu, Oct 23, 2008 at 2:05 PM, Jim Rees <rees@umich.edu> wrote: > There are two issues to be aware of. One is that the socket buffers have to > be big enough for the tcp congestion window. In the old days, the > application would have to know ahead of time how big this is, and call > setsockopt(), which sets these numbers. > > Now however, the tcp stack "autotunes" the buffer sizes to the correct > amount. If you call setsockopt() to set a buffer size, or set sk_*buf, or > set the SOCK_*BUF_LOCK bits in sk_userlocks, you defeat this autotuning. > This is almost always a bad idea. > > The other issue is that at least for NFS, the receive buffer must be big > enough to hold the biggest possible rpc. If not, a partial rpc will get > stuck in the buffer and no progress will be made. > > Issue one is easy to deal with, just don't muck with the socket internal > data structure. The second one is harder. What's really needed is a new > api into the tcp layer that will reserve a minimum amount of memory in the > socket buffer so that receives won't stall. For now, our patch sets the > sk_*buf values without setting the lock flags. -- Thanks, Steve ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-23 19:19 ` Steve French @ 2008-10-23 19:35 ` Jeff Layton 2008-10-23 19:40 ` Steve French 2008-10-23 20:02 ` Jim Rees 0 siblings, 2 replies; 24+ messages in thread From: Jeff Layton @ 2008-10-23 19:35 UTC (permalink / raw) To: Steve French Cc: Evgeniy Polyakov, Frank S Filz, nhorman, Suresh Jayaraman, linux-fsdevel, linux-cifs-client@lists.samba.org, Jim Rees On Thu, 23 Oct 2008 14:19:04 -0500 "Steve French" <smfrench@gmail.com> wrote: > > His statement was basically "unless you know for sure that you don't > > want to use more than X amount of memory, then there isn't much reason > > to set the send and receive buffers". > > I think that there is a problem still - cifs needs tcp autotuning but > the buffers probably need to be at least as big as the largest SMB > response (about 17K on receives, but configurable to be larger). See > comment below about NFS: > On Thu, Oct 23, 2008 at 2:05 PM, Jim Rees <rees@umich.edu> wrote: > There are two issues to be aware of. One is that the socket buffers have to > be big enough for the tcp congestion window. In the old days, the > application would have to know ahead of time how big this is, and call > setsockopt(), which sets these numbers. > > Now however, the tcp stack "autotunes" the buffer sizes to the correct > amount. If you call setsockopt() to set a buffer size, or set sk_*buf, or > set the SOCK_*BUF_LOCK bits in sk_userlocks, you defeat this autotuning. > This is almost always a bad idea. > > The other issue is that at least for NFS, the receive buffer must be big > enough to hold the biggest possible rpc. If not, a partial rpc will get > stuck in the buffer and no progress will be made. > > Issue one is easy to deal with, just don't muck with the socket internal > data structure. The second one is harder. What's really needed is a new > api into the tcp layer that will reserve a minimum amount of memory in the > socket buffer so that receives won't stall. For now, our patch sets the > sk_*buf values without setting the lock flags. Agreed. Thanks Jim, for sending along that info... It sounds like what we should do for CIFS is the same thing. Set the buffer sizes but make sure the SOCK_*BUF_LOCK bits are cleared so that they can grow as needed. Does the kernel ever shrink these buffers? -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-23 19:35 ` Jeff Layton @ 2008-10-23 19:40 ` Steve French 2008-10-23 20:59 ` Evgeniy Polyakov 2008-10-23 20:02 ` Jim Rees 1 sibling, 1 reply; 24+ messages in thread From: Steve French @ 2008-10-23 19:40 UTC (permalink / raw) To: Jeff Layton Cc: Evgeniy Polyakov, Shirish Pargaonkar, Suresh Jayaraman, nhorman, linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz, Jim Rees On Thu, Oct 23, 2008 at 2:35 PM, Jeff Layton <jlayton@redhat.com> wrote: > On Thu, 23 Oct 2008 14:19:04 -0500 > "Steve French" <smfrench@gmail.com> wrote: >> > His statement was basically "unless you know for sure that you don't >> > want to use more than X amount of memory, then there isn't much reason >> > to set the send and receive buffers". >> >> I think that there is a problem still - cifs needs tcp autotuning but >> the buffers probably need to be at least as big as the largest SMB >> response (about 17K on receives, but configurable to be larger). See >> comment below about NFS: >> > > On Thu, Oct 23, 2008 at 2:05 PM, Jim Rees <rees@umich.edu> wrote: >> There are two issues to be aware of. One is that the socket buffers have to >> be big enough for the tcp congestion window. In the old days, the >> application would have to know ahead of time how big this is, and call >> setsockopt(), which sets these numbers. >> >> Now however, the tcp stack "autotunes" the buffer sizes to the correct >> amount. If you call setsockopt() to set a buffer size, or set sk_*buf, or >> set the SOCK_*BUF_LOCK bits in sk_userlocks, you defeat this autotuning. >> This is almost always a bad idea. >> >> The other issue is that at least for NFS, the receive buffer must be big >> enough to hold the biggest possible rpc. If not, a partial rpc will get >> stuck in the buffer and no progress will be made. >> >> Issue one is easy to deal with, just don't muck with the socket internal >> data structure. The second one is harder. What's really needed is a new >> api into the tcp layer that will reserve a minimum amount of memory in the >> socket buffer so that receives won't stall. For now, our patch sets the >> sk_*buf values without setting the lock flags. > > > Agreed. Thanks Jim, for sending along that info... > > It sounds like what we should do for CIFS is the same thing. Set the > buffer sizes but make sure the SOCK_*BUF_LOCK bits are cleared so that > they can grow as needed. CIFS, unlike NFS, never set the SOCK_*BUF_LOCK flags. I think we already do the same thing as NFS, they still are turning off autotuning on the client right? If we could set a "sk_min_sndbuf_size" someday (to e.g. twice the size of the cifs write frames ie about 112K) - would that be enough? -- Thanks, Steve ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-23 19:40 ` Steve French @ 2008-10-23 20:59 ` Evgeniy Polyakov 2008-10-24 1:18 ` Steve French 0 siblings, 1 reply; 24+ messages in thread From: Evgeniy Polyakov @ 2008-10-23 20:59 UTC (permalink / raw) To: Steve French Cc: Frank S Filz, nhorman, Jeff Layton, Suresh Jayaraman, linux-fsdevel, linux-cifs-client@lists.samba.org, Jim Rees On Thu, Oct 23, 2008 at 02:40:01PM -0500, Steve French (smfrench@gmail.com) wrote: > I think we already do the same thing as NFS, they still are turning > off autotuning on the client right? > > If we could set a "sk_min_sndbuf_size" someday (to e.g. twice the size > of the cifs write frames ie about 112K) - would that be enough? You do not need to set sockt buffers, but instead read data in chunks, which will automatically make a TCP progress. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-23 20:59 ` Evgeniy Polyakov @ 2008-10-24 1:18 ` Steve French 2008-10-24 1:30 ` Jim Rees 2008-10-24 1:52 ` Trond Myklebust 0 siblings, 2 replies; 24+ messages in thread From: Steve French @ 2008-10-24 1:18 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Jeff Layton, Shirish Pargaonkar, Suresh Jayaraman, nhorman, linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz, Jim Rees On Thu, Oct 23, 2008 at 3:59 PM, Evgeniy Polyakov <zbr@ioremap.net> wrote: > On Thu, Oct 23, 2008 at 02:40:01PM -0500, Steve French (smfrench@gmail.com) wrote: >> I think we already do the same thing as NFS, they still are turning >> off autotuning on the client right? >> >> If we could set a "sk_min_sndbuf_size" someday (to e.g. twice the size >> of the cifs write frames ie about 112K) - would that be enough? > > > You do not need to set sockt buffers, but instead read data in chunks, > which will automatically make a TCP progress. If that is the case (ie that cifs and nfs never need to set these over tcp - I am still having trouble reconciling that with the NFS guys' comments that they must set rcvbuf (and Jim's comment below) > The other issue is that at least for NFS, the receive buffer must be big > enough to hold the biggest possible rpc. If not, a partial rpc will get > stuck in the buffer and no progress will be made. > -- Thanks, Steve ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-24 1:18 ` Steve French @ 2008-10-24 1:30 ` Jim Rees 2008-10-24 1:48 ` Jeff Layton 2008-10-24 12:53 ` Neil Horman 2008-10-24 1:52 ` Trond Myklebust 1 sibling, 2 replies; 24+ messages in thread From: Jim Rees @ 2008-10-24 1:30 UTC (permalink / raw) To: Steve French Cc: Evgeniy Polyakov, Jeff Layton, Shirish Pargaonkar, Suresh Jayaraman, nhorman, linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz Steve French wrote: If that is the case (ie that cifs and nfs never need to set these over tcp - I am still having trouble reconciling that with the NFS guys' comments that they must set rcvbuf (and Jim's comment below) If you have an application that wants to read a big chunk of data from a socket, and won't remove any of that data from the socket until the entire chunk has arrived, then the application must set the receive socket buffer size big enough to hold the entire chunk. Otherwise the application will stall. As far as I can tell, the corresponding situation does not hold for send buffers, because the tcp layer will grow the send buffer to be big enough to hold whatever the application wants to send. But I don't know if that's always true, or if there is some limit, so to be safe our NFS patch continues to set both the receive and send buffer sizes. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-24 1:30 ` Jim Rees @ 2008-10-24 1:48 ` Jeff Layton 2008-10-24 12:53 ` Neil Horman 1 sibling, 0 replies; 24+ messages in thread From: Jeff Layton @ 2008-10-24 1:48 UTC (permalink / raw) To: Jim Rees Cc: Steve French, Evgeniy Polyakov, Shirish Pargaonkar, Suresh Jayaraman, nhorman, linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz On Thu, 23 Oct 2008 21:30:32 -0400 Jim Rees <rees@umich.edu> wrote: > Steve French wrote: > > If that is the case (ie that cifs and nfs never need to set these over > tcp - I am still having trouble reconciling that with the NFS guys' > comments that they must set rcvbuf (and Jim's comment below) > > If you have an application that wants to read a big chunk of data from a > socket, and won't remove any of that data from the socket until the entire > chunk has arrived, then the application must set the receive socket buffer > size big enough to hold the entire chunk. Otherwise the application will > stall. > > As far as I can tell, the corresponding situation does not hold for send > buffers, because the tcp layer will grow the send buffer to be big enough to > hold whatever the application wants to send. But I don't know if that's > always true, or if there is some limit, so to be safe our NFS patch > continues to set both the receive and send buffer sizes. Right, that's not actually the case for CIFS, so I don't think we need to have CIFS tune the receive buffer. cifs_demultiplex_thread just slurps data off of the socket as it comes without concern about whether we have a whole request or not. We may end up having to call kernel_recvmsg a few extra times, but I don't think we need to worry about it getting stuck. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-24 1:30 ` Jim Rees 2008-10-24 1:48 ` Jeff Layton @ 2008-10-24 12:53 ` Neil Horman 2008-10-24 13:59 ` Jim Rees 2008-10-24 14:23 ` Steve French 1 sibling, 2 replies; 24+ messages in thread From: Neil Horman @ 2008-10-24 12:53 UTC (permalink / raw) To: Jim Rees Cc: Steve French, Evgeniy Polyakov, Jeff Layton, Shirish Pargaonkar, Suresh Jayaraman, nhorman, linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz On Thu, Oct 23, 2008 at 09:30:32PM -0400, Jim Rees wrote: > Steve French wrote: > > If that is the case (ie that cifs and nfs never need to set these over > tcp - I am still having trouble reconciling that with the NFS guys' > comments that they must set rcvbuf (and Jim's comment below) > > If you have an application that wants to read a big chunk of data from a > socket, and won't remove any of that data from the socket until the entire > chunk has arrived, then the application must set the receive socket buffer > size big enough to hold the entire chunk. Otherwise the application will > stall. > Thats not true. Assuming there is sufficient memory available, and the tcp segments receive window isn't closed, then you will continue to receive data, as not setting SO_RCVBUF will allow the receive buffers to autotune in size. I know this is getting off the subject a bit, but if cifs is waiting for an entire message to arrive on a tcp segment before it reads it, thats a cifs design flaw. While you can send messages accross stream oriented sockets, you can't enforce message boundaries for data still in the receive buffer. Since the application (cifs in this case) can't (and shouldn't) have visibility into the tcp sockets congestion window or recieve window, you'll never really know if you'll be waiting forever for data that is waiting for you to read it out of the socket. The correct solution is to read data out of the socket into a temporary buffer that you allocate in the cifs code, and loop repetatively calling recv[msg|from] until such time as the entire message has been read. > As far as I can tell, the corresponding situation does not hold for send > buffers, because the tcp layer will grow the send buffer to be big enough to > hold whatever the application wants to send. But I don't know if that's > always true, or if there is some limit, so to be safe our NFS patch > continues to set both the receive and send buffer sizes. receive buffers do the same thing as send buffers in regards to autotuning. You may have had to set SO_RCVBUF previously, but that should no longer be the case Neil -- /*************************************************** *Neil Horman *Senior Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-24 12:53 ` Neil Horman @ 2008-10-24 13:59 ` Jim Rees 2008-10-24 15:20 ` Evgeniy Polyakov 2008-10-24 15:29 ` Neil Horman 2008-10-24 14:23 ` Steve French 1 sibling, 2 replies; 24+ messages in thread From: Jim Rees @ 2008-10-24 13:59 UTC (permalink / raw) To: Neil Horman Cc: Steve French, Evgeniy Polyakov, Jeff Layton, Shirish Pargaonkar, Suresh Jayaraman, linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz Neil Horman wrote: Thats not true. Assuming there is sufficient memory available, and the tcp segments receive window isn't closed, then you will continue to receive data, as not setting SO_RCVBUF will allow the receive buffers to autotune in size. My explanation may be flawed but the behavior is real. If you turn on autotuning on the nfsd socket, and don't do anything with the buffer sizes, nfsd will stall. This is in a 2.6.24 kernel. I would of course welcome independent verification. I thought that the receive buffer size would only grow as large as the receiver's advertised window, but I'll admit I don't fully understand the autotuning code. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-24 13:59 ` Jim Rees @ 2008-10-24 15:20 ` Evgeniy Polyakov 2008-10-24 15:36 ` Jim Rees 2008-10-24 15:29 ` Neil Horman 1 sibling, 1 reply; 24+ messages in thread From: Evgeniy Polyakov @ 2008-10-24 15:20 UTC (permalink / raw) To: Jim Rees Cc: Neil Horman, Steve French, Jeff Layton, Shirish Pargaonkar, Suresh Jayaraman, linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz Hi. On Fri, Oct 24, 2008 at 09:59:36AM -0400, Jim Rees (rees@umich.edu) wrote: > My explanation may be flawed but the behavior is real. If you turn on > autotuning on the nfsd socket, and don't do anything with the buffer sizes, > nfsd will stall. This is in a 2.6.24 kernel. I would of course welcome > independent verification. That's effectively a problem related to wait for full request to be in the socket queue, which apparently can not grow infinitely and for huge enough request may 'overflow' currently calculated buffer size. I do not know why nfsd decided to only read request from the socket, when it was fully received, probably because it was decided that playing with additional locks (or socket lock itself) to safely demultiplex request is not appropriate and instead system could read it in one kernel_recvmsg() or friend call with own internal locking. Whatever its decision was based on, it does not apply to cifs which reads data in own demultiplexin thread via loops of kernel_recvmsg()s. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-24 15:20 ` Evgeniy Polyakov @ 2008-10-24 15:36 ` Jim Rees 0 siblings, 0 replies; 24+ messages in thread From: Jim Rees @ 2008-10-24 15:36 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Neil Horman, Steve French, Jeff Layton, Shirish Pargaonkar, Suresh Jayaraman, linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz Evgeniy Polyakov wrote: I do not know why nfsd decided to only read request from the socket, when it was fully received, I assume it's because nfsd can't copy the data for performance reasons, but doesn't want to release any of the buffer until the entire rpc has arrived, so it can guarantee atomicity. But I'm not a nfsd expert. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-24 13:59 ` Jim Rees 2008-10-24 15:20 ` Evgeniy Polyakov @ 2008-10-24 15:29 ` Neil Horman 1 sibling, 0 replies; 24+ messages in thread From: Neil Horman @ 2008-10-24 15:29 UTC (permalink / raw) To: Jim Rees Cc: Neil Horman, Steve French, Evgeniy Polyakov, Jeff Layton, Shirish Pargaonkar, Suresh Jayaraman, linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz On Fri, Oct 24, 2008 at 09:59:36AM -0400, Jim Rees wrote: > Neil Horman wrote: > > Thats not true. Assuming there is sufficient memory available, and the tcp > segments receive window isn't closed, then you will continue to receive data, as > not setting SO_RCVBUF will allow the receive buffers to autotune in size. > > My explanation may be flawed but the behavior is real. If you turn on > autotuning on the nfsd socket, and don't do anything with the buffer sizes, > nfsd will stall. This is in a 2.6.24 kernel. I would of course welcome > independent verification. > I'll veryify when I get a chance, and I believe that what you say about nfsd is true, but I don't think its directly a result of autotuning (see below) > I thought that the receive buffer size would only grow as large as the > receiver's advertised window, but I'll admit I don't fully understand the > autotuning code. This is roughly correct, since thats as large as the buffer ever really needs to be. I say roughly because it will actually be somewhat larger than the window (allowing for skb overhead). I would need to verify this as well, but I would expect whats happening in the nfsd code is that by setting SO_RCVBUF, you are forcing the receive window to be sufficiently large to hold an entire rpc transaction. While this works, It does force the nfsd code to have some understanding of the underlying transport. It might be better (subject to much debate of course), to allocate a temporary buffer in the nfsd receve code that can hold a maximally sized rpc transaction, and simply call kernel_recvmsg with a timeout associated for whatever size you were expecting. You may get less than that back, but you can then store it in the temp buffer, update the appropriate pointers and sizes and call recvmsg again. that keeps the receive window open and data streaming in. Using the tcp receive buffer to force recives on application message boundaries is fraught with problems like this (in any application). Its better to use some extra ram to handle your application message reassembly. Neil -- /*************************************************** *Neil Horman *Senior Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-24 12:53 ` Neil Horman 2008-10-24 13:59 ` Jim Rees @ 2008-10-24 14:23 ` Steve French 2008-10-24 15:02 ` Shirish Pargaonkar 1 sibling, 1 reply; 24+ messages in thread From: Steve French @ 2008-10-24 14:23 UTC (permalink / raw) To: Neil Horman Cc: Frank S Filz, Jeff Layton, Suresh Jayaraman, linux-cifs-client@lists.samba.org, linux-fsdevel, Evgeniy Polyakov, Jim Rees [-- Attachment #1.1: Type: text/plain, Size: 818 bytes --] On Fri, Oct 24, 2008 at 7:53 AM, Neil Horman <nhorman@redhat.com> wrote: > The correct solution is to read data out of the socket into a temporary > buffer that you allocate in the cifs code, and loop repetatively calling > recv[msg|from] until such time as the entire message has been read. > > > That is what the cifs code does in cifs_demultiplex_thread. Shirish, If we can verify that performance does not appear to be impacted (e.g. some dbench runs against remote server and also against localhost/samba) when removing the few lines of code in (cifs's) ipv4_connect in which cifs sets sndbuf/rcvbuf (we never set them in the case fortunately) - then I will remove those lines immediately (and I am ok with not having a mount option to turn off the autotuning ie revert to old behavior) -- Thanks, Steve [-- Attachment #1.2: Type: text/html, Size: 1186 bytes --] [-- Attachment #2: Type: text/plain, Size: 172 bytes --] _______________________________________________ linux-cifs-client mailing list linux-cifs-client@lists.samba.org https://lists.samba.org/mailman/listinfo/linux-cifs-client ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-24 14:23 ` Steve French @ 2008-10-24 15:02 ` Shirish Pargaonkar 2008-10-24 16:18 ` Steve French 0 siblings, 1 reply; 24+ messages in thread From: Shirish Pargaonkar @ 2008-10-24 15:02 UTC (permalink / raw) To: Steve French Cc: Frank S Filz, Neil Horman, Jeff Layton, Suresh Jayaraman, linux-fsdevel, Evgeniy Polyakov, linux-cifs-client@lists.samba.org, Jim Rees Steve, Here is a run against a samba share usr/src/dbench/dbench-3.04 # ./dbench -s -S -t 5s -D /mnt/smb_n -c client.txt 1 dbench version 3.04 - Copyright Andrew Tridgell 1999-2004 Running for 5 seconds with load 'client.txt' and minimum warmup 1 secs 1 clients started 1 12 0.00 MB/sec warmup 1 sec 1 12 0.00 MB/sec warmup 2 sec 1 12 0.00 MB/sec warmup 3 sec 1 12 0.00 MB/sec warmup 4 sec 1 12 0.00 MB/sec warmup 5 sec 1 12 0.00 MB/sec warmup 6 sec 1 12 0.00 MB/sec warmup 7 sec 1 12 0.00 MB/sec warmup 8 sec 1 12 0.00 MB/sec warmup 9 sec 1 12 0.00 MB/sec warmup 10 sec 1 12 0.00 MB/sec warmup 11 sec 1 12 0.00 MB/sec warmup 12 sec 1 12 0.00 MB/sec warmup 13 sec 1 12 0.00 MB/sec warmup 14 sec 1 12 0.00 MB/sec warmup 15 sec 1 12 0.00 MB/sec warmup 16 sec 1 12 0.00 MB/sec warmup 17 sec 1 12 0.00 MB/sec warmup 18 sec 1 12 0.00 MB/sec warmup 19 sec 1 12 0.00 MB/sec warmup 20 sec 1 12 0.00 MB/sec warmup 21 sec 1 12 0.00 MB/sec warmup 22 sec 1 12 0.00 MB/sec warmup 23 sec 1 12 0.00 MB/sec warmup 24 sec 1 102 2.82 MB/sec execute 1 sec 1 163 2.42 MB/sec execute 2 sec 1 223 2.24 MB/sec execute 3 sec 1 232 1.75 MB/sec execute 4 sec 1 232 1.40 MB/sec cleanup 5 sec 1 232 1.16 MB/sec cleanup 6 sec 1 232 1.00 MB/sec cleanup 7 sec 1 232 0.87 MB/sec cleanup 8 sec 1 232 0.78 MB/sec cleanup 9 sec 1 232 0.70 MB/sec cleanup 10 sec 1 232 0.63 MB/sec cleanup 11 sec 1 232 0.58 MB/sec cleanup 12 sec 1 232 0.54 MB/sec cleanup 13 sec 1 232 0.50 MB/sec cleanup 14 sec 1 232 0.47 MB/sec cleanup 15 sec 1 232 0.44 MB/sec cleanup 16 sec 1 232 0.41 MB/sec cleanup 17 sec 1 232 0.39 MB/sec cleanup 18 sec 1 232 0.37 MB/sec cleanup 19 sec /bin/rm: cannot remove directory `/mnt/smb_n/clients/client0/~dmtmp/COREL': Directory not empty 1 232 0.35 MB/sec cleanup 20 sec Throughput 1.39629 MB/sec (sync open) (sync dirs) 1 procs And here is against a Windows XP share :/usr/src/dbench/dbench-3.04 # ./dbench -s -S -t 5s -D /mnt/smb_r -c client.txt 1 dbench version 3.04 - Copyright Andrew Tridgell 1999-2004 Running for 5 seconds with load 'client.txt' and minimum warmup 1 secs 1 clients started 1 132 3.01 MB/sec execute 1 sec 1 207 2.68 MB/sec execute 2 sec 1 284 2.61 MB/sec execute 3 sec 1 357 2.51 MB/sec execute 4 sec 1 418 2.40 MB/sec cleanup 5 sec 1 418 2.01 MB/sec cleanup 6 sec 1 418 1.72 MB/sec cleanup 7 sec 1 418 1.51 MB/sec cleanup 8 sec 1 418 1.34 MB/sec cleanup 9 sec 1 418 1.20 MB/sec cleanup 10 sec 1 418 1.10 MB/sec cleanup 11 sec 1 418 1.00 MB/sec cleanup 12 sec 1 418 0.93 MB/sec cleanup 13 sec 1 418 0.86 MB/sec cleanup 14 sec 1 418 0.80 MB/sec cleanup 15 sec 1 418 0.75 MB/sec cleanup 16 sec 1 418 0.71 MB/sec cleanup 17 sec 1 418 0.67 MB/sec cleanup 18 sec 1 418 0.63 MB/sec cleanup 19 sec 1 418 0.60 MB/sec cleanup 20 sec 1 418 0.57 MB/sec cleanup 21 sec 1 418 0.55 MB/sec cleanup 22 sec 1 418 0.52 MB/sec cleanup 23 sec 1 418 0.50 MB/sec cleanup 24 sec /bin/rm: cannot remove directory `/mnt/smb_r/clients/client0/~dmtmp/PARADOX': Directory not empty 1 418 0.49 MB/sec cleanup 24 sec Throughput 2.40957 MB/sec (sync open) (sync dirs) 1 procs And the testrun still running with your patch. Regards, Shirish On Fri, Oct 24, 2008 at 9:23 AM, Steve French <smfrench@gmail.com> wrote: > > > On Fri, Oct 24, 2008 at 7:53 AM, Neil Horman <nhorman@redhat.com> wrote: >> >> The correct solution is to read data out of the socket into a temporary >> buffer that you allocate in the cifs code, and loop repetatively calling >> recv[msg|from] until such time as the entire message has been read. >> >> > That is what the cifs code does in cifs_demultiplex_thread. > > Shirish, > If we can verify that performance does not appear to be impacted (e.g. some > dbench runs against remote server and also against localhost/samba) when > removing the few lines of code in (cifs's) ipv4_connect in which cifs sets > sndbuf/rcvbuf (we never set them in the case fortunately) - then I will > remove those lines immediately (and I am ok with not having a mount option > to turn off the autotuning ie revert to old behavior) > > > > -- > Thanks, > > Steve > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-24 15:02 ` Shirish Pargaonkar @ 2008-10-24 16:18 ` Steve French 0 siblings, 0 replies; 24+ messages in thread From: Steve French @ 2008-10-24 16:18 UTC (permalink / raw) To: Shirish Pargaonkar Cc: Frank S Filz, Neil Horman, Jeff Layton, Suresh Jayaraman, linux-fsdevel, Evgeniy Polyakov, linux-cifs-client@lists.samba.org, Jim Rees [-- Attachment #1.1: Type: text/plain, Size: 4633 bytes --] A few comments - you need a warmup longer than 1 second, and a test run longer than 5 seconds. A 100 second test run is probably ok - we need the "before and after" comparison (ie compare the performance with the patch and umount and compare the performance with noautotune turned on (old behavior). But in any case something looks wrong - a performance of 1.40MB/sec to Samba is worse than I would expect. Also please try it with more simulated clients (e.g. 50) With dbench 3.04 I am getting more than double what you got with 1 client, although my system could be a bit faster. Also can you make sure that you are running current code. I had hoped that the /bin/rm: cannot remove directory `/mnt/smb_r/clients/client0/~ > > dmtmp/PARADOX': Directory not empty error would go away with the fixes Jeff and I did to delete on close (it may be that this version of dbench keeps files open when it tries to delete them and then does not close them before the directory is removed thus delete on close doesn't help). You should probably be running dbench 4 instead On Fri, Oct 24, 2008 at 10:02 AM, Shirish Pargaonkar < shirishpargaonkar@gmail.com> wrote: > Steve, > > Here is a run against a samba share > > usr/src/dbench/dbench-3.04 # ./dbench -s -S -t 5s -D /mnt/smb_n -c > client.txt 1 > dbench version 3.04 - Copyright Andrew Tridgell 1999-2004 > > Running for 5 seconds with load 'client.txt' and minimum warmup 1 secs > 1 clients started > 1 12 0.00 MB/sec warmup 1 sec > 1 12 0.00 MB/sec warmup 24 sec > 1 102 2.82 MB/sec execute 1 sec > 1 232 0.37 MB/sec cleanup 19 sec > /bin/rm: cannot remove directory > `/mnt/smb_n/clients/client0/~dmtmp/COREL': Directory not empty > 1 232 0.35 MB/sec cleanup 20 sec > > Throughput 1.39629 MB/sec (sync open) (sync dirs) 1 procs > > > And here is against a Windows XP share > > :/usr/src/dbench/dbench-3.04 # ./dbench -s -S -t 5s -D /mnt/smb_r -c > client.txt 1 > dbench version 3.04 - Copyright Andrew Tridgell 1999-2004 > > Running for 5 seconds with load 'client.txt' and minimum warmup 1 secs > 1 clients started > 1 132 3.01 MB/sec execute 1 sec > 1 207 2.68 MB/sec execute 2 sec > 1 284 2.61 MB/sec execute 3 sec > 1 357 2.51 MB/sec execute 4 sec > 1 418 2.40 MB/sec cleanup 5 sec > 1 418 2.01 MB/sec cleanup 6 sec > 1 418 1.72 MB/sec cleanup 7 sec > 1 418 1.51 MB/sec cleanup 8 sec > 1 418 1.34 MB/sec cleanup 9 sec > 1 418 1.20 MB/sec cleanup 10 sec > 1 418 1.10 MB/sec cleanup 11 sec > 1 418 1.00 MB/sec cleanup 12 sec > 1 418 0.93 MB/sec cleanup 13 sec > 1 418 0.86 MB/sec cleanup 14 sec > 1 418 0.80 MB/sec cleanup 15 sec > 1 418 0.75 MB/sec cleanup 16 sec > 1 418 0.71 MB/sec cleanup 17 sec > 1 418 0.67 MB/sec cleanup 18 sec > 1 418 0.63 MB/sec cleanup 19 sec > 1 418 0.60 MB/sec cleanup 20 sec > 1 418 0.57 MB/sec cleanup 21 sec > 1 418 0.55 MB/sec cleanup 22 sec > 1 418 0.52 MB/sec cleanup 23 sec > 1 418 0.50 MB/sec cleanup 24 sec > /bin/rm: cannot remove directory > `/mnt/smb_r/clients/client0/~dmtmp/PARADOX': Directory not empty > 1 418 0.49 MB/sec cleanup 24 sec > > Throughput 2.40957 MB/sec (sync open) (sync dirs) 1 procs > > > And the testrun still running with your patch. > > Regards, > > Shirish > > On Fri, Oct 24, 2008 at 9:23 AM, Steve French <smfrench@gmail.com> wrote: > > > > > > On Fri, Oct 24, 2008 at 7:53 AM, Neil Horman <nhorman@redhat.com> wrote: > >> > >> The correct solution is to read data out of the socket into a temporary > >> buffer that you allocate in the cifs code, and loop repetatively calling > >> recv[msg|from] until such time as the entire message has been read. > >> > >> > > That is what the cifs code does in cifs_demultiplex_thread. > > > > Shirish, > > If we can verify that performance does not appear to be impacted (e.g. > some > > dbench runs against remote server and also against localhost/samba) when > > removing the few lines of code in (cifs's) ipv4_connect in which cifs > sets > > sndbuf/rcvbuf (we never set them in the case fortunately) - then I will > > remove those lines immediately (and I am ok with not having a mount > option > > to turn off the autotuning ie revert to old behavior) > > > > > > > > -- > > Thanks, > > > > Steve > > > -- Thanks, Steve [-- Attachment #1.2: Type: text/html, Size: 6881 bytes --] [-- Attachment #2: Type: text/plain, Size: 172 bytes --] _______________________________________________ linux-cifs-client mailing list linux-cifs-client@lists.samba.org https://lists.samba.org/mailman/listinfo/linux-cifs-client ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-24 1:18 ` Steve French 2008-10-24 1:30 ` Jim Rees @ 2008-10-24 1:52 ` Trond Myklebust 2008-10-24 2:06 ` Steve French 1 sibling, 1 reply; 24+ messages in thread From: Trond Myklebust @ 2008-10-24 1:52 UTC (permalink / raw) To: Steve French Cc: Evgeniy Polyakov, Jeff Layton, Shirish Pargaonkar, Suresh Jayaraman, nhorman, linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz, Jim Rees On Thu, 2008-10-23 at 20:18 -0500, Steve French wrote: > On Thu, Oct 23, 2008 at 3:59 PM, Evgeniy Polyakov <zbr@ioremap.net> wrote: > > On Thu, Oct 23, 2008 at 02:40:01PM -0500, Steve French (smfrench@gmail.com) wrote: > >> I think we already do the same thing as NFS, they still are turning > >> off autotuning on the client right? > >> > >> If we could set a "sk_min_sndbuf_size" someday (to e.g. twice the size > >> of the cifs write frames ie about 112K) - would that be enough? > > > > > > You do not need to set sockt buffers, but instead read data in chunks, > > which will automatically make a TCP progress. > > If that is the case (ie that cifs and nfs never need to set these over > tcp - I am still having trouble reconciling that with the NFS guys' > comments that they must set rcvbuf (and Jim's comment below) > > The other issue is that at least for NFS, the receive buffer must be big > > enough to hold the biggest possible rpc. If not, a partial rpc will get > > stuck in the buffer and no progress will be made. Jim & co are talking about the _server_ side, which has very different requirements when compared to a client. One of the NFS kernel server's main tasks is to manage its own resources, and for that reason one of the design constraints is that it only starts reading the next request from a socket when it knows that it has enough resources to send a reply without blocking. Look rather at the NFS client: that uses non-blocking modes together with an aio engine (a.k.a. rpciod) to shove stuff as quickly as possible out of the socket's skbufs and into the page cache/inode metadata caches. No locking of TCP socket buffer sizes needed or used... Trond ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-24 1:52 ` Trond Myklebust @ 2008-10-24 2:06 ` Steve French 2008-10-24 4:43 ` Trond Myklebust 0 siblings, 1 reply; 24+ messages in thread From: Steve French @ 2008-10-24 2:06 UTC (permalink / raw) To: Trond Myklebust Cc: Evgeniy Polyakov, Jeff Layton, Shirish Pargaonkar, Suresh Jayaraman, nhorman, linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz, Jim Rees On Thu, Oct 23, 2008 at 8:52 PM, Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > On Thu, 2008-10-23 at 20:18 -0500, Steve French wrote: > Jim & co are talking about the _server_ side, which has very different > requirements when compared to a client. One of the NFS kernel server's > main tasks is to manage its own resources, and for that reason one of > the design constraints is that it only starts reading the next request > from a socket when it knows that it has enough resources to send a reply > without blocking. > > Look rather at the NFS client: that uses non-blocking modes together > with an aio engine (a.k.a. rpciod) to shove stuff as quickly as possible > out of the socket's skbufs and into the page cache/inode metadata > caches. No locking of TCP socket buffer sizes needed or used... That brings us back to what started this discussion though (blocking vs. non-blocking sends) ... on the send path in the cifs client, Shirish's testing shows improvements when we change from non-blocking sends with retry to blocking sends: - smb_msg.msg_flags = MSG_DONTWAIT + MSG_NOSIGNAL; + smb_msg.msg_flags = MSG_NOSIGNAL; -- Thanks, Steve ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-24 2:06 ` Steve French @ 2008-10-24 4:43 ` Trond Myklebust 0 siblings, 0 replies; 24+ messages in thread From: Trond Myklebust @ 2008-10-24 4:43 UTC (permalink / raw) To: Steve French Cc: Evgeniy Polyakov, Jeff Layton, Shirish Pargaonkar, Suresh Jayaraman, nhorman@redhat.com, linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz, Jim Rees On Oct 23, 2008, at 22:06, "Steve French" <smfrench@gmail.com> wrote: > On Thu, Oct 23, 2008 at 8:52 PM, Trond Myklebust > <trond.myklebust@fys.uio.no> wrote: >> On Thu, 2008-10-23 at 20:18 -0500, Steve French wrote: >> Jim & co are talking about the _server_ side, which has very >> different >> requirements when compared to a client. One of the NFS kernel >> server's >> main tasks is to manage its own resources, and for that reason one of >> the design constraints is that it only starts reading the next >> request >> from a socket when it knows that it has enough resources to send a >> reply >> without blocking. >> >> Look rather at the NFS client: that uses non-blocking modes together >> with an aio engine (a.k.a. rpciod) to shove stuff as quickly as >> possible >> out of the socket's skbufs and into the page cache/inode metadata >> caches. No locking of TCP socket buffer sizes needed or used... > > That brings us back to what started this discussion though (blocking > vs. > non-blocking sends) ... on the send path in the cifs client, > Shirish's testing > shows improvements when we change from non-blocking sends > with retry to blocking sends: > > - smb_msg.msg_flags = MSG_DONTWAIT + MSG_NOSIGNAL; > + smb_msg.msg_flags = MSG_NOSIGNAL; > So, what triggers the retry in your case? In NFS, we use a sk_writespace callback to initiate another attempt. That's pretty much the mechanism used by the non-blocking sendmsg() too. Trond ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-23 19:35 ` Jeff Layton 2008-10-23 19:40 ` Steve French @ 2008-10-23 20:02 ` Jim Rees 2008-10-23 20:18 ` Neil Horman 1 sibling, 1 reply; 24+ messages in thread From: Jim Rees @ 2008-10-23 20:02 UTC (permalink / raw) To: Jeff Layton Cc: Evgeniy Polyakov, Steve French, nhorman, Frank S Filz, Suresh Jayaraman, linux-fsdevel, linux-cifs-client@lists.samba.org Jeff Layton wrote: It sounds like what we should do for CIFS is the same thing. Set the buffer sizes but make sure the SOCK_*BUF_LOCK bits are cleared so that they can grow as needed. You have to be careful about when you set them. The old nfs code does it in the wrong place, just before the first read. That's too late. You have to do it just after accept(). Does the kernel ever shrink these buffers? Not now, but it could in the future. Yet another reason this should be an actual socket (or tcp) call with well defined semantics. By the way, I'm pretty sure you never have to set sk_sndbuf, as the tcp code will grow the send buffer as needed. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: sendmsg blocking with sendtimeout vs. non-blocking 2008-10-23 20:02 ` Jim Rees @ 2008-10-23 20:18 ` Neil Horman 0 siblings, 0 replies; 24+ messages in thread From: Neil Horman @ 2008-10-23 20:18 UTC (permalink / raw) To: Jim Rees Cc: Jeff Layton, Steve French, Evgeniy Polyakov, Shirish Pargaonkar, Suresh Jayaraman, nhorman, linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz On Thu, Oct 23, 2008 at 04:02:22PM -0400, Jim Rees wrote: > Jeff Layton wrote: > > It sounds like what we should do for CIFS is the same thing. Set the > buffer sizes but make sure the SOCK_*BUF_LOCK bits are cleared so that > they can grow as needed. > > You have to be careful about when you set them. The old nfs code does it in > the wrong place, just before the first read. That's too late. You have to > do it just after accept(). > > Does the kernel ever shrink these buffers? > Yes. Its not so much a buffer, but an allowable allocation threshold. And it should shrink see sk_*_free and its child functions. Neil -- /*************************************************** *Neil Horman *Senior Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2008-10-24 16:18 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <524f69650810221414n52e236d9rfc6f231d1db75405@mail.gmail.com>
[not found] ` <20081023091325.0ad059c6@barsoom.rdu.redhat.com>
[not found] ` <524f69650810230754m20b815b5n256e82ef1d5dd4f0@mail.gmail.com>
[not found] ` <4a4634330810230926m9285a3etf6340e3de53c576@mail.gmail.com>
2008-10-23 16:42 ` sendmsg blocking with sendtimeout vs. non-blocking Steve French
2008-10-23 17:40 ` Steve French
2008-10-23 17:46 ` Evgeniy Polyakov
2008-10-23 18:00 ` Jeff Layton
2008-10-23 19:19 ` Steve French
2008-10-23 19:35 ` Jeff Layton
2008-10-23 19:40 ` Steve French
2008-10-23 20:59 ` Evgeniy Polyakov
2008-10-24 1:18 ` Steve French
2008-10-24 1:30 ` Jim Rees
2008-10-24 1:48 ` Jeff Layton
2008-10-24 12:53 ` Neil Horman
2008-10-24 13:59 ` Jim Rees
2008-10-24 15:20 ` Evgeniy Polyakov
2008-10-24 15:36 ` Jim Rees
2008-10-24 15:29 ` Neil Horman
2008-10-24 14:23 ` Steve French
2008-10-24 15:02 ` Shirish Pargaonkar
2008-10-24 16:18 ` Steve French
2008-10-24 1:52 ` Trond Myklebust
2008-10-24 2:06 ` Steve French
2008-10-24 4:43 ` Trond Myklebust
2008-10-23 20:02 ` Jim Rees
2008-10-23 20:18 ` Neil Horman
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.