linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: sendmsg blocking with sendtimeout vs. non-blocking
       [not found]     ` <4a4634330810230926m9285a3etf6340e3de53c576@mail.gmail.com>
@ 2008-10-23 16:42       ` Steve French
  2008-10-23 17:40         ` Steve French
  2008-10-23 18:00         ` Jeff Layton
  0 siblings, 2 replies; 24+ messages in thread
From: Steve French @ 2008-10-23 16:42 UTC (permalink / raw)
  To: Shirish Pargaonkar
  Cc: Jeff Layton, Suresh Jayaraman, nhorman,
	linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz

On Thu, Oct 23, 2008 at 11:26 AM, Shirish Pargaonkar
<shirishpargaonkar@gmail.com> wrote:
> On Thu, Oct 23, 2008 at 9:54 AM, Steve French <smfrench@gmail.com> wrote:
>>> The big question is what happens under heavy memory pressure. In that
>>> case we may not be able to copy the data from the buffer or iov into
>>> the skbuf. With non-blocking I/O kernel_sendmsg will return error
>>> if nothing was copied (probably -EAGAIN or -ENOSPC). If some of the
>>> data could be copied, then it'll return a positive value that was less
>>> than the size of the send.
I think we are ok, as long as in all paths we either:
a) resend the remainder of the SMB (by retrying what is left in the buffer
or
b) kill the tcp session, reconnect (and either retry with a new handle
or return a EHOSTDOWN error to the vfs depending on the path)

>>
>>> Regardless of whether we use blocking I/O or not, kernel_sendmsg could
>>> still copy less data than was requested. The patch needs to account
>>> for this, probably by retrying the send with the remaining data if it
>>> occurs.
>> Yes - if that is true we need to retry at least once on this.
>>
>>> I don't think Shirish's patch is likely fixing this problem, but is
>>> probably just making it harder for it to occur.
>>>
>>> I think, overall that blocking I/O makes sense. It should make error
>>> handling easier and I don't see any real disadvantage. We might
>>> consider using longer timeouts, removing the MSG_NOSIGNAL flag and
>>> changing the code to deal with the possibility of catching a signal
>>> during the kernel_sendmsg call.
>>>
>>> Also, we're setting the sk_sndbuf and sk_rcvbuf in ipv4_connect.
>>> According to Neil, that may be hurting us here. This has the effect of
>>> placing a hard limit on the send buffer for the socket, and may be
>>> making it more likely that we get an error back from kernel_sendmsg.
>> Yes - Sridhar and Dave Stevens mentioned that too.   We haven't needed
>> that (setting sndbuf) for at least a few kernel versions (is the auto socket
>> buffer size tuning working as far back as RHEL5/2.6.18 and
>> SLES10 2.6.16 too?)
>>
>> Shirish,
>> I thought that that was part of your patch?
>
> No it is not but this is an easy change, just do not set them.

As I look into this sndbuf and rcvbuf size setting ... what concerns
me is why nfs sets these sizes for snd and rcvbuf sizes still if they
don't need to be set?  We (cifs) have larger write sizes (56K) than
nfs's default.   See svc_set_sockbufsize in net/sunrpc/svcsock.c


-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-23 16:42       ` sendmsg blocking with sendtimeout vs. non-blocking Steve French
@ 2008-10-23 17:40         ` Steve French
  2008-10-23 17:46           ` Evgeniy Polyakov
  2008-10-23 18:00         ` Jeff Layton
  1 sibling, 1 reply; 24+ messages in thread
From: Steve French @ 2008-10-23 17:40 UTC (permalink / raw)
  To: Shirish Pargaonkar
  Cc: Jeff Layton, Suresh Jayaraman, nhorman,
	linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz

On Thu, Oct 23, 2008 at 11:42 AM, Steve French <smfrench@gmail.com> wrote:
> On Thu, Oct 23, 2008 at 11:26 AM, Shirish Pargaonkar
> As I look into this sndbuf and rcvbuf size setting ... what concerns
> me is why nfs sets these sizes for snd and rcvbuf sizes still if they
> don't need to be set?  We (cifs) have larger write sizes (56K) than
> nfs's default.   See svc_set_sockbufsize in net/sunrpc/svcsock.c

What is the valid range for the sndbuf and rcvbuf size so I can sanity
check this if the user overrides it on mount?



-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-23 17:40         ` Steve French
@ 2008-10-23 17:46           ` Evgeniy Polyakov
  0 siblings, 0 replies; 24+ messages in thread
From: Evgeniy Polyakov @ 2008-10-23 17:46 UTC (permalink / raw)
  To: Steve French
  Cc: Shirish Pargaonkar, Jeff Layton, Suresh Jayaraman, nhorman,
	linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz

Hi.

On Thu, Oct 23, 2008 at 12:40:05PM -0500, Steve French (smfrench@gmail.com) wrote:
> On Thu, Oct 23, 2008 at 11:42 AM, Steve French <smfrench@gmail.com> wrote:
> > On Thu, Oct 23, 2008 at 11:26 AM, Shirish Pargaonkar
> > As I look into this sndbuf and rcvbuf size setting ... what concerns
> > me is why nfs sets these sizes for snd and rcvbuf sizes still if they
> > don't need to be set?  We (cifs) have larger write sizes (56K) than
> > nfs's default.   See svc_set_sockbufsize in net/sunrpc/svcsock.c
> 
> What is the valid range for the sndbuf and rcvbuf size so I can sanity
> check this if the user overrides it on mount?

>From zero to infinity. Actual size will be automatically ajusted by the
kernel, but will not exceed specified one, so there is no need to tune
this parameter. Very likely you do not want to change socket queue sizes,
since it may hurt performance, when previously autotuning could rise the
maximum socket buffer size.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-23 16:42       ` sendmsg blocking with sendtimeout vs. non-blocking Steve French
  2008-10-23 17:40         ` Steve French
@ 2008-10-23 18:00         ` Jeff Layton
  2008-10-23 19:19           ` Steve French
  1 sibling, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2008-10-23 18:00 UTC (permalink / raw)
  To: Steve French
  Cc: Shirish Pargaonkar, Suresh Jayaraman, nhorman,
	linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz

On Thu, 23 Oct 2008 11:42:49 -0500
"Steve French" <smfrench@gmail.com> wrote:

> On Thu, Oct 23, 2008 at 11:26 AM, Shirish Pargaonkar
> <shirishpargaonkar@gmail.com> wrote:
> > On Thu, Oct 23, 2008 at 9:54 AM, Steve French <smfrench@gmail.com> wrote:
> >>> The big question is what happens under heavy memory pressure. In that
> >>> case we may not be able to copy the data from the buffer or iov into
> >>> the skbuf. With non-blocking I/O kernel_sendmsg will return error
> >>> if nothing was copied (probably -EAGAIN or -ENOSPC). If some of the
> >>> data could be copied, then it'll return a positive value that was less
> >>> than the size of the send.
> I think we are ok, as long as in all paths we either:
> a) resend the remainder of the SMB (by retrying what is left in the buffer
> or
> b) kill the tcp session, reconnect (and either retry with a new handle
> or return a EHOSTDOWN error to the vfs depending on the path)
> 
> >>
> >>> Regardless of whether we use blocking I/O or not, kernel_sendmsg could
> >>> still copy less data than was requested. The patch needs to account
> >>> for this, probably by retrying the send with the remaining data if it
> >>> occurs.
> >> Yes - if that is true we need to retry at least once on this.
> >>
> >>> I don't think Shirish's patch is likely fixing this problem, but is
> >>> probably just making it harder for it to occur.
> >>>
> >>> I think, overall that blocking I/O makes sense. It should make error
> >>> handling easier and I don't see any real disadvantage. We might
> >>> consider using longer timeouts, removing the MSG_NOSIGNAL flag and
> >>> changing the code to deal with the possibility of catching a signal
> >>> during the kernel_sendmsg call.
> >>>
> >>> Also, we're setting the sk_sndbuf and sk_rcvbuf in ipv4_connect.
> >>> According to Neil, that may be hurting us here. This has the effect of
> >>> placing a hard limit on the send buffer for the socket, and may be
> >>> making it more likely that we get an error back from kernel_sendmsg.
> >> Yes - Sridhar and Dave Stevens mentioned that too.   We haven't needed
> >> that (setting sndbuf) for at least a few kernel versions (is the auto socket
> >> buffer size tuning working as far back as RHEL5/2.6.18 and
> >> SLES10 2.6.16 too?)
> >>
> >> Shirish,
> >> I thought that that was part of your patch?
> >
> > No it is not but this is an easy change, just do not set them.
> 
> As I look into this sndbuf and rcvbuf size setting ... what concerns
> me is why nfs sets these sizes for snd and rcvbuf sizes still if they
> don't need to be set?  We (cifs) have larger write sizes (56K) than
> nfs's default.   See svc_set_sockbufsize in net/sunrpc/svcsock.c
> 

AFAICT, those size settings have been there for a long time (2002?). My
guess is that this may predate the socket autotuning stuff that Neil H.
mentioned to me.

His statement was basically "unless you know for sure that you don't
want to use more than X amount of memory, then there isn't much reason
to set the send and receive buffers".

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-23 18:00         ` Jeff Layton
@ 2008-10-23 19:19           ` Steve French
  2008-10-23 19:35             ` Jeff Layton
  0 siblings, 1 reply; 24+ messages in thread
From: Steve French @ 2008-10-23 19:19 UTC (permalink / raw)
  To: Jeff Layton, Evgeniy Polyakov
  Cc: Shirish Pargaonkar, Suresh Jayaraman, nhorman,
	linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz,
	Jim Rees

On Thu, Oct 23, 2008 at 1:00 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Thu, 23 Oct 2008 11:42:49 -0500
> "Steve French" <smfrench@gmail.com> wrote:
>
>> On Thu, Oct 23, 2008 at 11:26 AM, Shirish Pargaonkar
>> <shirishpargaonkar@gmail.com> wrote:
>> > On Thu, Oct 23, 2008 at 9:54 AM, Steve French <smfrench@gmail.com> wrote:
>> >>> The big question is what happens under heavy memory pressure. In that
>> >>> case we may not be able to copy the data from the buffer or iov into
>> >>> the skbuf. With non-blocking I/O kernel_sendmsg will return error
>> >>> if nothing was copied (probably -EAGAIN or -ENOSPC). If some of the
>> >>> data could be copied, then it'll return a positive value that was less
>> >>> than the size of the send.
>> I think we are ok, as long as in all paths we either:
>> a) resend the remainder of the SMB (by retrying what is left in the buffer
>> or
>> b) kill the tcp session, reconnect (and either retry with a new handle
>> or return a EHOSTDOWN error to the vfs depending on the path)
>>
>> >>
>> >>> Regardless of whether we use blocking I/O or not, kernel_sendmsg could
>> >>> still copy less data than was requested. The patch needs to account
>> >>> for this, probably by retrying the send with the remaining data if it
>> >>> occurs.
>> >> Yes - if that is true we need to retry at least once on this.
>> >>
>> >>> I don't think Shirish's patch is likely fixing this problem, but is
>> >>> probably just making it harder for it to occur.
>> >>>
>> >>> I think, overall that blocking I/O makes sense. It should make error
>> >>> handling easier and I don't see any real disadvantage. We might
>> >>> consider using longer timeouts, removing the MSG_NOSIGNAL flag and
>> >>> changing the code to deal with the possibility of catching a signal
>> >>> during the kernel_sendmsg call.
>> >>>
>> >>> Also, we're setting the sk_sndbuf and sk_rcvbuf in ipv4_connect.
>> >>> According to Neil, that may be hurting us here. This has the effect of
>> >>> placing a hard limit on the send buffer for the socket, and may be
>> >>> making it more likely that we get an error back from kernel_sendmsg.
>> >> Yes - Sridhar and Dave Stevens mentioned that too.   We haven't needed
>> >> that (setting sndbuf) for at least a few kernel versions (is the auto socket
>> >> buffer size tuning working as far back as RHEL5/2.6.18 and
>> >> SLES10 2.6.16 too?)
>> >>
>> >> Shirish,
>> >> I thought that that was part of your patch?
>> >
>> > No it is not but this is an easy change, just do not set them.
>>
>> As I look into this sndbuf and rcvbuf size setting ... what concerns
>> me is why nfs sets these sizes for snd and rcvbuf sizes still if they
>> don't need to be set?  We (cifs) have larger write sizes (56K) than
>> nfs's default.   See svc_set_sockbufsize in net/sunrpc/svcsock.c
>>
>
> AFAICT, those size settings have been there for a long time (2002?). My
> guess is that this may predate the socket autotuning stuff that Neil H.
> mentioned to me.
>
> His statement was basically "unless you know for sure that you don't
> want to use more than X amount of memory, then there isn't much reason
> to set the send and receive buffers".

I think that there is a problem still - cifs needs tcp autotuning but
the buffers probably need to be at least as big as the largest SMB
response (about 17K on receives, but configurable to be larger).   See
comment below about NFS:

On Thu, Oct 23, 2008 at 2:05 PM, Jim Rees <rees@umich.edu> wrote:
> There are two issues to be aware of.  One is that the socket buffers have to
> be big enough for the tcp congestion window.  In the old days, the
> application would have to know ahead of time how big this is, and call
> setsockopt(), which sets these numbers.
>
> Now however, the tcp stack "autotunes" the buffer sizes to the correct
> amount.  If you call setsockopt() to set a buffer size, or set sk_*buf, or
> set the SOCK_*BUF_LOCK bits in sk_userlocks, you defeat this autotuning.
> This is almost always a bad idea.
>
> The other issue is that at least for NFS, the receive buffer must be big
> enough to hold the biggest possible rpc.  If not, a partial rpc will get
> stuck in the buffer and no progress will be made.
>
> Issue one is easy to deal with, just don't muck with the socket internal
> data structure.  The second one is harder.  What's really needed is a new
> api into the tcp layer that will reserve a minimum amount of memory in the
> socket buffer so that receives won't stall.  For now, our patch sets the
> sk_*buf values without setting the lock flags.



-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-23 19:19           ` Steve French
@ 2008-10-23 19:35             ` Jeff Layton
  2008-10-23 19:40               ` Steve French
  2008-10-23 20:02               ` Jim Rees
  0 siblings, 2 replies; 24+ messages in thread
From: Jeff Layton @ 2008-10-23 19:35 UTC (permalink / raw)
  To: Steve French
  Cc: Evgeniy Polyakov, Frank S Filz, nhorman, Suresh Jayaraman,
	linux-fsdevel, linux-cifs-client@lists.samba.org, Jim Rees

On Thu, 23 Oct 2008 14:19:04 -0500
"Steve French" <smfrench@gmail.com> wrote:
> > His statement was basically "unless you know for sure that you don't
> > want to use more than X amount of memory, then there isn't much reason
> > to set the send and receive buffers".
> 
> I think that there is a problem still - cifs needs tcp autotuning but
> the buffers probably need to be at least as big as the largest SMB
> response (about 17K on receives, but configurable to be larger).   See
> comment below about NFS:
> 

On Thu, Oct 23, 2008 at 2:05 PM, Jim Rees <rees@umich.edu> wrote:
> There are two issues to be aware of.  One is that the socket buffers have to
> be big enough for the tcp congestion window.  In the old days, the
> application would have to know ahead of time how big this is, and call
> setsockopt(), which sets these numbers.
>
> Now however, the tcp stack "autotunes" the buffer sizes to the correct
> amount.  If you call setsockopt() to set a buffer size, or set sk_*buf, or
> set the SOCK_*BUF_LOCK bits in sk_userlocks, you defeat this autotuning.
> This is almost always a bad idea.
>
> The other issue is that at least for NFS, the receive buffer must be big
> enough to hold the biggest possible rpc.  If not, a partial rpc will get
> stuck in the buffer and no progress will be made.
>
> Issue one is easy to deal with, just don't muck with the socket internal
> data structure.  The second one is harder.  What's really needed is a new
> api into the tcp layer that will reserve a minimum amount of memory in the
> socket buffer so that receives won't stall.  For now, our patch sets the
> sk_*buf values without setting the lock flags.  


Agreed. Thanks Jim, for sending along that info...

It sounds like what we should do for CIFS is the same thing. Set the
buffer sizes but make sure the SOCK_*BUF_LOCK bits are cleared so that
they can grow as needed.

Does the kernel ever shrink these buffers?
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-23 19:35             ` Jeff Layton
@ 2008-10-23 19:40               ` Steve French
  2008-10-23 20:59                 ` Evgeniy Polyakov
  2008-10-23 20:02               ` Jim Rees
  1 sibling, 1 reply; 24+ messages in thread
From: Steve French @ 2008-10-23 19:40 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Evgeniy Polyakov, Shirish Pargaonkar, Suresh Jayaraman, nhorman,
	linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz,
	Jim Rees

On Thu, Oct 23, 2008 at 2:35 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Thu, 23 Oct 2008 14:19:04 -0500
> "Steve French" <smfrench@gmail.com> wrote:
>> > His statement was basically "unless you know for sure that you don't
>> > want to use more than X amount of memory, then there isn't much reason
>> > to set the send and receive buffers".
>>
>> I think that there is a problem still - cifs needs tcp autotuning but
>> the buffers probably need to be at least as big as the largest SMB
>> response (about 17K on receives, but configurable to be larger).   See
>> comment below about NFS:
>>
>
> On Thu, Oct 23, 2008 at 2:05 PM, Jim Rees <rees@umich.edu> wrote:
>> There are two issues to be aware of.  One is that the socket buffers have to
>> be big enough for the tcp congestion window.  In the old days, the
>> application would have to know ahead of time how big this is, and call
>> setsockopt(), which sets these numbers.
>>
>> Now however, the tcp stack "autotunes" the buffer sizes to the correct
>> amount.  If you call setsockopt() to set a buffer size, or set sk_*buf, or
>> set the SOCK_*BUF_LOCK bits in sk_userlocks, you defeat this autotuning.
>> This is almost always a bad idea.
>>
>> The other issue is that at least for NFS, the receive buffer must be big
>> enough to hold the biggest possible rpc.  If not, a partial rpc will get
>> stuck in the buffer and no progress will be made.
>>
>> Issue one is easy to deal with, just don't muck with the socket internal
>> data structure.  The second one is harder.  What's really needed is a new
>> api into the tcp layer that will reserve a minimum amount of memory in the
>> socket buffer so that receives won't stall.  For now, our patch sets the
>> sk_*buf values without setting the lock flags.
>
>
> Agreed. Thanks Jim, for sending along that info...
>
> It sounds like what we should do for CIFS is the same thing. Set the
> buffer sizes but make sure the SOCK_*BUF_LOCK bits are cleared so that
> they can grow as needed.

CIFS, unlike NFS, never set the SOCK_*BUF_LOCK flags.

I think we already do the same thing as NFS, they still are turning
off autotuning on the client right?

If we could set a "sk_min_sndbuf_size" someday (to e.g. twice the size
of the cifs write frames ie about 112K)  - would that be enough?


-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-23 19:35             ` Jeff Layton
  2008-10-23 19:40               ` Steve French
@ 2008-10-23 20:02               ` Jim Rees
  2008-10-23 20:18                 ` Neil Horman
  1 sibling, 1 reply; 24+ messages in thread
From: Jim Rees @ 2008-10-23 20:02 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Evgeniy Polyakov, Steve French, nhorman, Frank S Filz,
	Suresh Jayaraman, linux-fsdevel,
	linux-cifs-client@lists.samba.org

Jeff Layton wrote:

  It sounds like what we should do for CIFS is the same thing. Set the
  buffer sizes but make sure the SOCK_*BUF_LOCK bits are cleared so that
  they can grow as needed.

You have to be careful about when you set them.  The old nfs code does it in
the wrong place, just before the first read.  That's too late.  You have to
do it just after accept().

  Does the kernel ever shrink these buffers?

Not now, but it could in the future.  Yet another reason this should be an
actual socket (or tcp) call with well defined semantics.

By the way, I'm pretty sure you never have to set sk_sndbuf, as the tcp code
will grow the send buffer as needed.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-23 20:02               ` Jim Rees
@ 2008-10-23 20:18                 ` Neil Horman
  0 siblings, 0 replies; 24+ messages in thread
From: Neil Horman @ 2008-10-23 20:18 UTC (permalink / raw)
  To: Jim Rees
  Cc: Jeff Layton, Steve French, Evgeniy Polyakov, Shirish Pargaonkar,
	Suresh Jayaraman, nhorman, linux-cifs-client@lists.samba.org,
	linux-fsdevel, Frank S Filz

On Thu, Oct 23, 2008 at 04:02:22PM -0400, Jim Rees wrote:
> Jeff Layton wrote:
> 
>   It sounds like what we should do for CIFS is the same thing. Set the
>   buffer sizes but make sure the SOCK_*BUF_LOCK bits are cleared so that
>   they can grow as needed.
> 
> You have to be careful about when you set them.  The old nfs code does it in
> the wrong place, just before the first read.  That's too late.  You have to
> do it just after accept().
> 
>   Does the kernel ever shrink these buffers?
> 
Yes.  Its not so much a buffer, but an allowable allocation threshold.  And it
should shrink see sk_*_free and its child functions.

Neil


-- 
/***************************************************
 *Neil Horman
 *Senior Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-23 19:40               ` Steve French
@ 2008-10-23 20:59                 ` Evgeniy Polyakov
  2008-10-24  1:18                   ` Steve French
  0 siblings, 1 reply; 24+ messages in thread
From: Evgeniy Polyakov @ 2008-10-23 20:59 UTC (permalink / raw)
  To: Steve French
  Cc: Frank S Filz, nhorman, Jeff Layton, Suresh Jayaraman,
	linux-fsdevel, linux-cifs-client@lists.samba.org, Jim Rees

On Thu, Oct 23, 2008 at 02:40:01PM -0500, Steve French (smfrench@gmail.com) wrote:
> I think we already do the same thing as NFS, they still are turning
> off autotuning on the client right?
> 
> If we could set a "sk_min_sndbuf_size" someday (to e.g. twice the size
> of the cifs write frames ie about 112K)  - would that be enough?


You do not need to set sockt buffers, but instead read data in chunks,
which will automatically make a TCP progress.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-23 20:59                 ` Evgeniy Polyakov
@ 2008-10-24  1:18                   ` Steve French
  2008-10-24  1:30                     ` Jim Rees
  2008-10-24  1:52                     ` Trond Myklebust
  0 siblings, 2 replies; 24+ messages in thread
From: Steve French @ 2008-10-24  1:18 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Jeff Layton, Shirish Pargaonkar, Suresh Jayaraman, nhorman,
	linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz,
	Jim Rees

On Thu, Oct 23, 2008 at 3:59 PM, Evgeniy Polyakov <zbr@ioremap.net> wrote:
> On Thu, Oct 23, 2008 at 02:40:01PM -0500, Steve French (smfrench@gmail.com) wrote:
>> I think we already do the same thing as NFS, they still are turning
>> off autotuning on the client right?
>>
>> If we could set a "sk_min_sndbuf_size" someday (to e.g. twice the size
>> of the cifs write frames ie about 112K)  - would that be enough?
>
>
> You do not need to set sockt buffers, but instead read data in chunks,
> which will automatically make a TCP progress.

If that is the case (ie that cifs and nfs never need to set these over
tcp - I am still having trouble reconciling that with the NFS guys'
comments that they must set rcvbuf (and Jim's comment below)
> The other issue is that at least for NFS, the receive buffer must be big
> enough to hold the biggest possible rpc.  If not, a partial rpc will get
> stuck in the buffer and no progress will be made.
>


-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-24  1:18                   ` Steve French
@ 2008-10-24  1:30                     ` Jim Rees
  2008-10-24  1:48                       ` Jeff Layton
  2008-10-24 12:53                       ` Neil Horman
  2008-10-24  1:52                     ` Trond Myklebust
  1 sibling, 2 replies; 24+ messages in thread
From: Jim Rees @ 2008-10-24  1:30 UTC (permalink / raw)
  To: Steve French
  Cc: Evgeniy Polyakov, Jeff Layton, Shirish Pargaonkar,
	Suresh Jayaraman, nhorman, linux-cifs-client@lists.samba.org,
	linux-fsdevel, Frank S Filz

Steve French wrote:

  If that is the case (ie that cifs and nfs never need to set these over
  tcp - I am still having trouble reconciling that with the NFS guys'
  comments that they must set rcvbuf (and Jim's comment below)

If you have an application that wants to read a big chunk of data from a
socket, and won't remove any of that data from the socket until the entire
chunk has arrived, then the application must set the receive socket buffer
size big enough to hold the entire chunk.  Otherwise the application will
stall.

As far as I can tell, the corresponding situation does not hold for send
buffers, because the tcp layer will grow the send buffer to be big enough to
hold whatever the application wants to send.  But I don't know if that's
always true, or if there is some limit, so to be safe our NFS patch
continues to set both the receive and send buffer sizes.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-24  1:30                     ` Jim Rees
@ 2008-10-24  1:48                       ` Jeff Layton
  2008-10-24 12:53                       ` Neil Horman
  1 sibling, 0 replies; 24+ messages in thread
From: Jeff Layton @ 2008-10-24  1:48 UTC (permalink / raw)
  To: Jim Rees
  Cc: Steve French, Evgeniy Polyakov, Shirish Pargaonkar,
	Suresh Jayaraman, nhorman, linux-cifs-client@lists.samba.org,
	linux-fsdevel, Frank S Filz

On Thu, 23 Oct 2008 21:30:32 -0400
Jim Rees <rees@umich.edu> wrote:

> Steve French wrote:
> 
>   If that is the case (ie that cifs and nfs never need to set these over
>   tcp - I am still having trouble reconciling that with the NFS guys'
>   comments that they must set rcvbuf (and Jim's comment below)
> 
> If you have an application that wants to read a big chunk of data from a
> socket, and won't remove any of that data from the socket until the entire
> chunk has arrived, then the application must set the receive socket buffer
> size big enough to hold the entire chunk.  Otherwise the application will
> stall.
> 
> As far as I can tell, the corresponding situation does not hold for send
> buffers, because the tcp layer will grow the send buffer to be big enough to
> hold whatever the application wants to send.  But I don't know if that's
> always true, or if there is some limit, so to be safe our NFS patch
> continues to set both the receive and send buffer sizes.

Right, that's not actually the case for CIFS, so I don't think we need
to have CIFS tune the receive buffer. cifs_demultiplex_thread just
slurps data off of the socket as it comes without concern about whether
we have a whole request or not. We may end up having to call
kernel_recvmsg a few extra times, but I don't think we need to worry
about it getting stuck.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-24  1:18                   ` Steve French
  2008-10-24  1:30                     ` Jim Rees
@ 2008-10-24  1:52                     ` Trond Myklebust
  2008-10-24  2:06                       ` Steve French
  1 sibling, 1 reply; 24+ messages in thread
From: Trond Myklebust @ 2008-10-24  1:52 UTC (permalink / raw)
  To: Steve French
  Cc: Evgeniy Polyakov, Jeff Layton, Shirish Pargaonkar,
	Suresh Jayaraman, nhorman, linux-cifs-client@lists.samba.org,
	linux-fsdevel, Frank S Filz, Jim Rees

On Thu, 2008-10-23 at 20:18 -0500, Steve French wrote:
> On Thu, Oct 23, 2008 at 3:59 PM, Evgeniy Polyakov <zbr@ioremap.net> wrote:
> > On Thu, Oct 23, 2008 at 02:40:01PM -0500, Steve French (smfrench@gmail.com) wrote:
> >> I think we already do the same thing as NFS, they still are turning
> >> off autotuning on the client right?
> >>
> >> If we could set a "sk_min_sndbuf_size" someday (to e.g. twice the size
> >> of the cifs write frames ie about 112K)  - would that be enough?
> >
> >
> > You do not need to set sockt buffers, but instead read data in chunks,
> > which will automatically make a TCP progress.
> 
> If that is the case (ie that cifs and nfs never need to set these over
> tcp - I am still having trouble reconciling that with the NFS guys'
> comments that they must set rcvbuf (and Jim's comment below)
> > The other issue is that at least for NFS, the receive buffer must be big
> > enough to hold the biggest possible rpc.  If not, a partial rpc will get
> > stuck in the buffer and no progress will be made.

Jim & co are talking about the _server_ side, which has very different
requirements when compared to a client. One of the NFS kernel server's
main tasks is to manage its own resources, and for that reason one of
the design constraints is that it only starts reading the next request
from a socket when it knows that it has enough resources to send a reply
without blocking.

Look rather at the NFS client: that uses non-blocking modes together
with an aio engine (a.k.a. rpciod) to shove stuff as quickly as possible
out of the socket's skbufs and into the page cache/inode metadata
caches. No locking of TCP socket buffer sizes needed or used...

Trond


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-24  1:52                     ` Trond Myklebust
@ 2008-10-24  2:06                       ` Steve French
  2008-10-24  4:43                         ` Trond Myklebust
  0 siblings, 1 reply; 24+ messages in thread
From: Steve French @ 2008-10-24  2:06 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Evgeniy Polyakov, Jeff Layton, Shirish Pargaonkar,
	Suresh Jayaraman, nhorman, linux-cifs-client@lists.samba.org,
	linux-fsdevel, Frank S Filz, Jim Rees

On Thu, Oct 23, 2008 at 8:52 PM, Trond Myklebust
<trond.myklebust@fys.uio.no> wrote:
> On Thu, 2008-10-23 at 20:18 -0500, Steve French wrote:
> Jim & co are talking about the _server_ side, which has very different
> requirements when compared to a client. One of the NFS kernel server's
> main tasks is to manage its own resources, and for that reason one of
> the design constraints is that it only starts reading the next request
> from a socket when it knows that it has enough resources to send a reply
> without blocking.
>
> Look rather at the NFS client: that uses non-blocking modes together
> with an aio engine (a.k.a. rpciod) to shove stuff as quickly as possible
> out of the socket's skbufs and into the page cache/inode metadata
> caches. No locking of TCP socket buffer sizes needed or used...

That brings us back to what started this discussion though (blocking vs.
non-blocking sends) ... on the send path in the cifs client, Shirish's testing
shows improvements when we change from non-blocking sends
with retry to blocking sends:

-	smb_msg.msg_flags = MSG_DONTWAIT + MSG_NOSIGNAL;
+	smb_msg.msg_flags = MSG_NOSIGNAL;



-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-24  2:06                       ` Steve French
@ 2008-10-24  4:43                         ` Trond Myklebust
  0 siblings, 0 replies; 24+ messages in thread
From: Trond Myklebust @ 2008-10-24  4:43 UTC (permalink / raw)
  To: Steve French
  Cc: Evgeniy Polyakov, Jeff Layton, Shirish Pargaonkar,
	Suresh Jayaraman, nhorman@redhat.com,
	linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz,
	Jim Rees


On Oct 23, 2008, at 22:06, "Steve French" <smfrench@gmail.com> wrote:

> On Thu, Oct 23, 2008 at 8:52 PM, Trond Myklebust
> <trond.myklebust@fys.uio.no> wrote:
>> On Thu, 2008-10-23 at 20:18 -0500, Steve French wrote:
>> Jim & co are talking about the _server_ side, which has very  
>> different
>> requirements when compared to a client. One of the NFS kernel  
>> server's
>> main tasks is to manage its own resources, and for that reason one of
>> the design constraints is that it only starts reading the next  
>> request
>> from a socket when it knows that it has enough resources to send a  
>> reply
>> without blocking.
>>
>> Look rather at the NFS client: that uses non-blocking modes together
>> with an aio engine (a.k.a. rpciod) to shove stuff as quickly as  
>> possible
>> out of the socket's skbufs and into the page cache/inode metadata
>> caches. No locking of TCP socket buffer sizes needed or used...
>
> That brings us back to what started this discussion though (blocking  
> vs.
> non-blocking sends) ... on the send path in the cifs client,  
> Shirish's testing
> shows improvements when we change from non-blocking sends
> with retry to blocking sends:
>
> -    smb_msg.msg_flags = MSG_DONTWAIT + MSG_NOSIGNAL;
> +    smb_msg.msg_flags = MSG_NOSIGNAL;
>

So, what triggers the retry in your case? In NFS, we use a  
sk_writespace callback to initiate another attempt. That's pretty much  
the mechanism used by the non-blocking sendmsg() too.

Trond 
     

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-24  1:30                     ` Jim Rees
  2008-10-24  1:48                       ` Jeff Layton
@ 2008-10-24 12:53                       ` Neil Horman
  2008-10-24 13:59                         ` Jim Rees
  2008-10-24 14:23                         ` Steve French
  1 sibling, 2 replies; 24+ messages in thread
From: Neil Horman @ 2008-10-24 12:53 UTC (permalink / raw)
  To: Jim Rees
  Cc: Steve French, Evgeniy Polyakov, Jeff Layton, Shirish Pargaonkar,
	Suresh Jayaraman, nhorman, linux-cifs-client@lists.samba.org,
	linux-fsdevel, Frank S Filz

On Thu, Oct 23, 2008 at 09:30:32PM -0400, Jim Rees wrote:
> Steve French wrote:
> 
>   If that is the case (ie that cifs and nfs never need to set these over
>   tcp - I am still having trouble reconciling that with the NFS guys'
>   comments that they must set rcvbuf (and Jim's comment below)
> 
> If you have an application that wants to read a big chunk of data from a
> socket, and won't remove any of that data from the socket until the entire
> chunk has arrived, then the application must set the receive socket buffer
> size big enough to hold the entire chunk.  Otherwise the application will
> stall.
> 
Thats not true.  Assuming there is sufficient memory available, and the tcp
segments receive window isn't closed, then you will continue to receive data, as
not setting SO_RCVBUF will allow the receive buffers to autotune in size.

I know this is getting off the subject a bit, but if cifs is waiting for an
entire message to arrive on a tcp segment before it reads it, thats a cifs
design flaw.  While you can send messages accross stream oriented sockets, you
can't enforce message boundaries for data still in the receive buffer.  Since
the application (cifs in this case) can't (and shouldn't) have visibility into
the tcp sockets congestion window or recieve window, you'll never really know if
you'll be waiting forever for data that is waiting for you to read it out of the
socket.  The correct solution is to read data out of the socket into a temporary
buffer that you allocate in the cifs code, and loop repetatively calling
recv[msg|from] until such time as the entire message has been read.


> As far as I can tell, the corresponding situation does not hold for send
> buffers, because the tcp layer will grow the send buffer to be big enough to
> hold whatever the application wants to send.  But I don't know if that's
> always true, or if there is some limit, so to be safe our NFS patch
> continues to set both the receive and send buffer sizes.

receive buffers do the same thing as send buffers in regards to autotuning.  You
may have had to set SO_RCVBUF previously, but that should no longer be the case

Neil

-- 
/***************************************************
 *Neil Horman
 *Senior Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-24 12:53                       ` Neil Horman
@ 2008-10-24 13:59                         ` Jim Rees
  2008-10-24 15:20                           ` Evgeniy Polyakov
  2008-10-24 15:29                           ` Neil Horman
  2008-10-24 14:23                         ` Steve French
  1 sibling, 2 replies; 24+ messages in thread
From: Jim Rees @ 2008-10-24 13:59 UTC (permalink / raw)
  To: Neil Horman
  Cc: Steve French, Evgeniy Polyakov, Jeff Layton, Shirish Pargaonkar,
	Suresh Jayaraman, linux-cifs-client@lists.samba.org,
	linux-fsdevel, Frank S Filz

Neil Horman wrote:

  Thats not true.  Assuming there is sufficient memory available, and the tcp
  segments receive window isn't closed, then you will continue to receive data, as
  not setting SO_RCVBUF will allow the receive buffers to autotune in size.

My explanation may be flawed but the behavior is real.  If you turn on
autotuning on the nfsd socket, and don't do anything with the buffer sizes,
nfsd will stall.  This is in a 2.6.24 kernel.  I would of course welcome
independent verification.

I thought that the receive buffer size would only grow as large as the
receiver's advertised window, but I'll admit I don't fully understand the
autotuning code.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-24 12:53                       ` Neil Horman
  2008-10-24 13:59                         ` Jim Rees
@ 2008-10-24 14:23                         ` Steve French
  2008-10-24 15:02                           ` Shirish Pargaonkar
  1 sibling, 1 reply; 24+ messages in thread
From: Steve French @ 2008-10-24 14:23 UTC (permalink / raw)
  To: Neil Horman
  Cc: Frank S Filz, Jeff Layton, Suresh Jayaraman,
	linux-cifs-client@lists.samba.org, linux-fsdevel,
	Evgeniy Polyakov, Jim Rees


[-- Attachment #1.1: Type: text/plain, Size: 818 bytes --]

On Fri, Oct 24, 2008 at 7:53 AM, Neil Horman <nhorman@redhat.com> wrote:

>  The correct solution is to read data out of the socket into a temporary
> buffer that you allocate in the cifs code, and loop repetatively calling
> recv[msg|from] until such time as the entire message has been read.
>
>
> That is what the cifs code does in cifs_demultiplex_thread.

Shirish,
If we can verify that performance does not appear to be impacted (e.g. some
dbench runs against remote server and also against localhost/samba)  when
removing the few lines of code in (cifs's) ipv4_connect in which cifs sets
sndbuf/rcvbuf (we never set them in the case fortunately) - then I will
remove those lines immediately (and I am ok with not having a mount option
to turn off the autotuning ie revert to old behavior)



-- 
Thanks,

Steve

[-- Attachment #1.2: Type: text/html, Size: 1186 bytes --]

[-- Attachment #2: Type: text/plain, Size: 172 bytes --]

_______________________________________________
linux-cifs-client mailing list
linux-cifs-client@lists.samba.org
https://lists.samba.org/mailman/listinfo/linux-cifs-client

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-24 14:23                         ` Steve French
@ 2008-10-24 15:02                           ` Shirish Pargaonkar
  2008-10-24 16:18                             ` Steve French
  0 siblings, 1 reply; 24+ messages in thread
From: Shirish Pargaonkar @ 2008-10-24 15:02 UTC (permalink / raw)
  To: Steve French
  Cc: Frank S Filz, Neil Horman, Jeff Layton, Suresh Jayaraman,
	linux-fsdevel, Evgeniy Polyakov,
	linux-cifs-client@lists.samba.org, Jim Rees

Steve,

Here is a run against a samba share

usr/src/dbench/dbench-3.04 # ./dbench -s -S -t 5s -D /mnt/smb_n -c client.txt 1
dbench version 3.04 - Copyright Andrew Tridgell 1999-2004

Running for 5 seconds with load 'client.txt' and minimum warmup 1 secs
1 clients started
   1        12     0.00 MB/sec  warmup   1 sec
   1        12     0.00 MB/sec  warmup   2 sec
   1        12     0.00 MB/sec  warmup   3 sec
   1        12     0.00 MB/sec  warmup   4 sec
   1        12     0.00 MB/sec  warmup   5 sec
   1        12     0.00 MB/sec  warmup   6 sec
   1        12     0.00 MB/sec  warmup   7 sec
   1        12     0.00 MB/sec  warmup   8 sec
   1        12     0.00 MB/sec  warmup   9 sec
   1        12     0.00 MB/sec  warmup  10 sec
   1        12     0.00 MB/sec  warmup  11 sec
   1        12     0.00 MB/sec  warmup  12 sec
   1        12     0.00 MB/sec  warmup  13 sec
   1        12     0.00 MB/sec  warmup  14 sec
   1        12     0.00 MB/sec  warmup  15 sec
   1        12     0.00 MB/sec  warmup  16 sec
   1        12     0.00 MB/sec  warmup  17 sec
   1        12     0.00 MB/sec  warmup  18 sec
   1        12     0.00 MB/sec  warmup  19 sec
   1        12     0.00 MB/sec  warmup  20 sec
   1        12     0.00 MB/sec  warmup  21 sec
   1        12     0.00 MB/sec  warmup  22 sec
   1        12     0.00 MB/sec  warmup  23 sec
   1        12     0.00 MB/sec  warmup  24 sec
   1       102     2.82 MB/sec  execute   1 sec
   1       163     2.42 MB/sec  execute   2 sec
   1       223     2.24 MB/sec  execute   3 sec
   1       232     1.75 MB/sec  execute   4 sec
   1       232     1.40 MB/sec  cleanup   5 sec
   1       232     1.16 MB/sec  cleanup   6 sec
   1       232     1.00 MB/sec  cleanup   7 sec
   1       232     0.87 MB/sec  cleanup   8 sec
   1       232     0.78 MB/sec  cleanup   9 sec
   1       232     0.70 MB/sec  cleanup  10 sec
   1       232     0.63 MB/sec  cleanup  11 sec
   1       232     0.58 MB/sec  cleanup  12 sec
   1       232     0.54 MB/sec  cleanup  13 sec
   1       232     0.50 MB/sec  cleanup  14 sec
   1       232     0.47 MB/sec  cleanup  15 sec
   1       232     0.44 MB/sec  cleanup  16 sec
   1       232     0.41 MB/sec  cleanup  17 sec
   1       232     0.39 MB/sec  cleanup  18 sec
   1       232     0.37 MB/sec  cleanup  19 sec
/bin/rm: cannot remove directory
`/mnt/smb_n/clients/client0/~dmtmp/COREL': Directory not empty
   1       232     0.35 MB/sec  cleanup  20 sec

Throughput 1.39629 MB/sec (sync open) (sync dirs) 1 procs


And here is against a Windows XP share

:/usr/src/dbench/dbench-3.04 # ./dbench -s -S -t 5s -D /mnt/smb_r -c
client.txt 1
dbench version 3.04 - Copyright Andrew Tridgell 1999-2004

Running for 5 seconds with load 'client.txt' and minimum warmup 1 secs
1 clients started
   1       132     3.01 MB/sec  execute   1 sec
   1       207     2.68 MB/sec  execute   2 sec
   1       284     2.61 MB/sec  execute   3 sec
   1       357     2.51 MB/sec  execute   4 sec
   1       418     2.40 MB/sec  cleanup   5 sec
   1       418     2.01 MB/sec  cleanup   6 sec
   1       418     1.72 MB/sec  cleanup   7 sec
   1       418     1.51 MB/sec  cleanup   8 sec
   1       418     1.34 MB/sec  cleanup   9 sec
   1       418     1.20 MB/sec  cleanup  10 sec
   1       418     1.10 MB/sec  cleanup  11 sec
   1       418     1.00 MB/sec  cleanup  12 sec
   1       418     0.93 MB/sec  cleanup  13 sec
   1       418     0.86 MB/sec  cleanup  14 sec
   1       418     0.80 MB/sec  cleanup  15 sec
   1       418     0.75 MB/sec  cleanup  16 sec
   1       418     0.71 MB/sec  cleanup  17 sec
   1       418     0.67 MB/sec  cleanup  18 sec
   1       418     0.63 MB/sec  cleanup  19 sec
   1       418     0.60 MB/sec  cleanup  20 sec
   1       418     0.57 MB/sec  cleanup  21 sec
   1       418     0.55 MB/sec  cleanup  22 sec
   1       418     0.52 MB/sec  cleanup  23 sec
   1       418     0.50 MB/sec  cleanup  24 sec
/bin/rm: cannot remove directory
`/mnt/smb_r/clients/client0/~dmtmp/PARADOX': Directory not empty
   1       418     0.49 MB/sec  cleanup  24 sec

Throughput 2.40957 MB/sec (sync open) (sync dirs) 1 procs


And the testrun still running with your patch.

Regards,

Shirish

On Fri, Oct 24, 2008 at 9:23 AM, Steve French <smfrench@gmail.com> wrote:
>
>
> On Fri, Oct 24, 2008 at 7:53 AM, Neil Horman <nhorman@redhat.com> wrote:
>>
>>  The correct solution is to read data out of the socket into a temporary
>> buffer that you allocate in the cifs code, and loop repetatively calling
>> recv[msg|from] until such time as the entire message has been read.
>>
>>
> That is what the cifs code does in cifs_demultiplex_thread.
>
> Shirish,
> If we can verify that performance does not appear to be impacted (e.g. some
> dbench runs against remote server and also against localhost/samba)  when
> removing the few lines of code in (cifs's) ipv4_connect in which cifs sets
> sndbuf/rcvbuf (we never set them in the case fortunately) - then I will
> remove those lines immediately (and I am ok with not having a mount option
> to turn off the autotuning ie revert to old behavior)
>
>
>
> --
> Thanks,
>
> Steve
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-24 13:59                         ` Jim Rees
@ 2008-10-24 15:20                           ` Evgeniy Polyakov
  2008-10-24 15:36                             ` Jim Rees
  2008-10-24 15:29                           ` Neil Horman
  1 sibling, 1 reply; 24+ messages in thread
From: Evgeniy Polyakov @ 2008-10-24 15:20 UTC (permalink / raw)
  To: Jim Rees
  Cc: Neil Horman, Steve French, Jeff Layton, Shirish Pargaonkar,
	Suresh Jayaraman, linux-cifs-client@lists.samba.org,
	linux-fsdevel, Frank S Filz

Hi.

On Fri, Oct 24, 2008 at 09:59:36AM -0400, Jim Rees (rees@umich.edu) wrote:
> My explanation may be flawed but the behavior is real.  If you turn on
> autotuning on the nfsd socket, and don't do anything with the buffer sizes,
> nfsd will stall.  This is in a 2.6.24 kernel.  I would of course welcome
> independent verification.

That's effectively a problem related to wait for full request to be
in the socket queue, which apparently can not grow infinitely and for
huge enough request may 'overflow' currently calculated buffer size.

I do not know why nfsd decided to only read request from the socket,
when it was fully received, probably because it was decided that playing
with additional locks (or socket lock itself) to safely demultiplex
request is not appropriate and instead system could read it in one
kernel_recvmsg() or friend call with own internal locking. Whatever its
decision was based on, it does not apply to cifs which reads data in own
demultiplexin thread via loops of kernel_recvmsg()s.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-24 13:59                         ` Jim Rees
  2008-10-24 15:20                           ` Evgeniy Polyakov
@ 2008-10-24 15:29                           ` Neil Horman
  1 sibling, 0 replies; 24+ messages in thread
From: Neil Horman @ 2008-10-24 15:29 UTC (permalink / raw)
  To: Jim Rees
  Cc: Neil Horman, Steve French, Evgeniy Polyakov, Jeff Layton,
	Shirish Pargaonkar, Suresh Jayaraman,
	linux-cifs-client@lists.samba.org, linux-fsdevel, Frank S Filz

On Fri, Oct 24, 2008 at 09:59:36AM -0400, Jim Rees wrote:
> Neil Horman wrote:
> 
>   Thats not true.  Assuming there is sufficient memory available, and the tcp
>   segments receive window isn't closed, then you will continue to receive data, as
>   not setting SO_RCVBUF will allow the receive buffers to autotune in size.
> 
> My explanation may be flawed but the behavior is real.  If you turn on
> autotuning on the nfsd socket, and don't do anything with the buffer sizes,
> nfsd will stall.  This is in a 2.6.24 kernel.  I would of course welcome
> independent verification.
> 
I'll veryify when I get  a chance, and I believe that what you say about nfsd is
true, but I don't think its directly a result of autotuning (see below)

> I thought that the receive buffer size would only grow as large as the
> receiver's advertised window, but I'll admit I don't fully understand the
> autotuning code.

This is roughly correct, since thats as large as the buffer ever really needs to
be.  I say roughly because it will actually be somewhat larger than the window
(allowing for skb overhead).

I would need to verify this as well, but I would expect whats happening in the
nfsd code is that by setting SO_RCVBUF, you are forcing the receive window to be
sufficiently large to hold an entire rpc transaction.  While this works, It does
force the nfsd code to have some understanding of the underlying transport.  It
might be better (subject to much debate of course), to allocate a temporary
buffer in the nfsd receve code that can hold a maximally sized rpc transaction,
and simply call kernel_recvmsg with a timeout associated for whatever size you
were expecting.  You may get less than that back, but you can then store it in
the temp buffer, update the appropriate pointers and sizes and call recvmsg
again.  that keeps the receive window open and data streaming in.  Using the tcp
receive buffer to force recives on application message boundaries is fraught
with problems like this (in any application).  Its better to use some extra ram
to handle your application message reassembly.

Neil

-- 
/***************************************************
 *Neil Horman
 *Senior Software Engineer
 *Red Hat, Inc.
 *nhorman@redhat.com
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-24 15:20                           ` Evgeniy Polyakov
@ 2008-10-24 15:36                             ` Jim Rees
  0 siblings, 0 replies; 24+ messages in thread
From: Jim Rees @ 2008-10-24 15:36 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Neil Horman, Steve French, Jeff Layton, Shirish Pargaonkar,
	Suresh Jayaraman, linux-cifs-client@lists.samba.org,
	linux-fsdevel, Frank S Filz

Evgeniy Polyakov wrote:

  I do not know why nfsd decided to only read request from the socket,
  when it was fully received,

I assume it's because nfsd can't copy the data for performance reasons, but
doesn't want to release any of the buffer until the entire rpc has arrived,
so it can guarantee atomicity.  But I'm not a nfsd expert.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sendmsg blocking with sendtimeout vs. non-blocking
  2008-10-24 15:02                           ` Shirish Pargaonkar
@ 2008-10-24 16:18                             ` Steve French
  0 siblings, 0 replies; 24+ messages in thread
From: Steve French @ 2008-10-24 16:18 UTC (permalink / raw)
  To: Shirish Pargaonkar
  Cc: Frank S Filz, Neil Horman, Jeff Layton, Suresh Jayaraman,
	linux-fsdevel, Evgeniy Polyakov,
	linux-cifs-client@lists.samba.org, Jim Rees


[-- Attachment #1.1: Type: text/plain, Size: 4633 bytes --]

A few comments - you need a warmup longer than 1 second, and a test run
longer than 5 seconds. A 100 second test run is probably ok - we need the
"before and after" comparison (ie compare the performance with the patch and
umount and compare the performance with noautotune turned on (old
behavior).   But in any case something looks wrong - a performance of
1.40MB/sec
to Samba is worse than I would expect.  Also please try it with more
simulated clients (e.g. 50)

With dbench 3.04 I am getting more than double what you got with 1 client,
although my system could be a bit faster.

Also can you make sure that you are running current code.  I had hoped that
the

/bin/rm: cannot remove directory
`/mnt/smb_r/clients/client0/~
>
> dmtmp/PARADOX': Directory not empty


error would go away with the fixes Jeff and I did to delete on close (it may
be that this version of dbench keeps files open when it tries to delete them
and then does not close them before the directory is removed thus delete on
close doesn't help).

You should probably be running dbench 4 instead


On Fri, Oct 24, 2008 at 10:02 AM, Shirish Pargaonkar <
shirishpargaonkar@gmail.com> wrote:

> Steve,
>
> Here is a run against a samba share
>
> usr/src/dbench/dbench-3.04 # ./dbench -s -S -t 5s -D /mnt/smb_n -c
> client.txt 1
> dbench version 3.04 - Copyright Andrew Tridgell 1999-2004
>
> Running for 5 seconds with load 'client.txt' and minimum warmup 1 secs
> 1 clients started
>   1        12     0.00 MB/sec  warmup   1 sec
>   1        12     0.00 MB/sec  warmup  24 sec
>   1       102     2.82 MB/sec  execute   1 sec
>   1       232     0.37 MB/sec  cleanup  19 sec
> /bin/rm: cannot remove directory
> `/mnt/smb_n/clients/client0/~dmtmp/COREL': Directory not empty
>   1       232     0.35 MB/sec  cleanup  20 sec
>
> Throughput 1.39629 MB/sec (sync open) (sync dirs) 1 procs
>
>
> And here is against a Windows XP share
>
> :/usr/src/dbench/dbench-3.04 # ./dbench -s -S -t 5s -D /mnt/smb_r -c
> client.txt 1
> dbench version 3.04 - Copyright Andrew Tridgell 1999-2004
>
> Running for 5 seconds with load 'client.txt' and minimum warmup 1 secs
> 1 clients started
>   1       132     3.01 MB/sec  execute   1 sec
>   1       207     2.68 MB/sec  execute   2 sec
>   1       284     2.61 MB/sec  execute   3 sec
>   1       357     2.51 MB/sec  execute   4 sec
>   1       418     2.40 MB/sec  cleanup   5 sec
>   1       418     2.01 MB/sec  cleanup   6 sec
>   1       418     1.72 MB/sec  cleanup   7 sec
>   1       418     1.51 MB/sec  cleanup   8 sec
>   1       418     1.34 MB/sec  cleanup   9 sec
>   1       418     1.20 MB/sec  cleanup  10 sec
>   1       418     1.10 MB/sec  cleanup  11 sec
>   1       418     1.00 MB/sec  cleanup  12 sec
>   1       418     0.93 MB/sec  cleanup  13 sec
>   1       418     0.86 MB/sec  cleanup  14 sec
>   1       418     0.80 MB/sec  cleanup  15 sec
>   1       418     0.75 MB/sec  cleanup  16 sec
>   1       418     0.71 MB/sec  cleanup  17 sec
>   1       418     0.67 MB/sec  cleanup  18 sec
>   1       418     0.63 MB/sec  cleanup  19 sec
>   1       418     0.60 MB/sec  cleanup  20 sec
>   1       418     0.57 MB/sec  cleanup  21 sec
>   1       418     0.55 MB/sec  cleanup  22 sec
>   1       418     0.52 MB/sec  cleanup  23 sec
>   1       418     0.50 MB/sec  cleanup  24 sec
> /bin/rm: cannot remove directory
> `/mnt/smb_r/clients/client0/~dmtmp/PARADOX': Directory not empty
>   1       418     0.49 MB/sec  cleanup  24 sec
>
> Throughput 2.40957 MB/sec (sync open) (sync dirs) 1 procs
>
>
> And the testrun still running with your patch.
>
> Regards,
>
> Shirish
>
> On Fri, Oct 24, 2008 at 9:23 AM, Steve French <smfrench@gmail.com> wrote:
> >
> >
> > On Fri, Oct 24, 2008 at 7:53 AM, Neil Horman <nhorman@redhat.com> wrote:
> >>
> >>  The correct solution is to read data out of the socket into a temporary
> >> buffer that you allocate in the cifs code, and loop repetatively calling
> >> recv[msg|from] until such time as the entire message has been read.
> >>
> >>
> > That is what the cifs code does in cifs_demultiplex_thread.
> >
> > Shirish,
> > If we can verify that performance does not appear to be impacted (e.g.
> some
> > dbench runs against remote server and also against localhost/samba)  when
> > removing the few lines of code in (cifs's) ipv4_connect in which cifs
> sets
> > sndbuf/rcvbuf (we never set them in the case fortunately) - then I will
> > remove those lines immediately (and I am ok with not having a mount
> option
> > to turn off the autotuning ie revert to old behavior)
> >
> >
> >
> > --
> > Thanks,
> >
> > Steve
> >
>



-- 
Thanks,

Steve

[-- Attachment #1.2: Type: text/html, Size: 6881 bytes --]

[-- Attachment #2: Type: text/plain, Size: 172 bytes --]

_______________________________________________
linux-cifs-client mailing list
linux-cifs-client@lists.samba.org
https://lists.samba.org/mailman/listinfo/linux-cifs-client

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-10-24 16:18 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <524f69650810221414n52e236d9rfc6f231d1db75405@mail.gmail.com>
     [not found] ` <20081023091325.0ad059c6@barsoom.rdu.redhat.com>
     [not found]   ` <524f69650810230754m20b815b5n256e82ef1d5dd4f0@mail.gmail.com>
     [not found]     ` <4a4634330810230926m9285a3etf6340e3de53c576@mail.gmail.com>
2008-10-23 16:42       ` sendmsg blocking with sendtimeout vs. non-blocking Steve French
2008-10-23 17:40         ` Steve French
2008-10-23 17:46           ` Evgeniy Polyakov
2008-10-23 18:00         ` Jeff Layton
2008-10-23 19:19           ` Steve French
2008-10-23 19:35             ` Jeff Layton
2008-10-23 19:40               ` Steve French
2008-10-23 20:59                 ` Evgeniy Polyakov
2008-10-24  1:18                   ` Steve French
2008-10-24  1:30                     ` Jim Rees
2008-10-24  1:48                       ` Jeff Layton
2008-10-24 12:53                       ` Neil Horman
2008-10-24 13:59                         ` Jim Rees
2008-10-24 15:20                           ` Evgeniy Polyakov
2008-10-24 15:36                             ` Jim Rees
2008-10-24 15:29                           ` Neil Horman
2008-10-24 14:23                         ` Steve French
2008-10-24 15:02                           ` Shirish Pargaonkar
2008-10-24 16:18                             ` Steve French
2008-10-24  1:52                     ` Trond Myklebust
2008-10-24  2:06                       ` Steve French
2008-10-24  4:43                         ` Trond Myklebust
2008-10-23 20:02               ` Jim Rees
2008-10-23 20:18                 ` Neil Horman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).