SCTP data chunk bundling when SCTP

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* SCTP data chunk bundling when SCTP_NODELAY is set
@ 2014-06-17 14:17 David Laight
  2014-06-17 15:45 ` Vlad Yasevich
  0 siblings, 1 reply; 10+ messages in thread
From: David Laight @ 2014-06-17 14:17 UTC (permalink / raw)
  To: netdev@vger.kernel.org

If SCTP_NODELAY is set it is difficult to get SCTP to bundle
data chunks into ethernet packets.
This leads to very high packet rates which bundling could easily
reduce by a factor or 8 or 10.

Nagle can't really be enabled because it generates unwanted delays
when traffic is light (Nagle only really works for unidirectional bulk
data and command-response when the messages are smaller than the mtu).

Even if the sending application knows it has more data to send,
there isn't much it can do to get the chunks bundled.

AFAICT 'corking' the socket even stops full sized packets being
sent - so the application will deadlock if the socket write
buffer size is reached before the socket is 'uncorked'.
This also means that the application can't send back to back
full sized packets unless it uncorks the socket at exactly
the right places.

MSG_MORE isn't supported by SCTP, but I'm not sure it would help.
You really need a MSG_NO_MORE flag and to leave Nagle enabled.

About the only thing I can think of is to normally have Nagle
enabled, and then perform the following sequence to force the
buffered data chunks be sent:
1) disable Nagle
2) cork the socket
3) uncork the socket
4) enable Nagle
Four socket calls is a little excessive!

Any other ideas ?

	David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: SCTP data chunk bundling when SCTP_NODELAY is set
  2014-06-17 14:17 SCTP data chunk bundling when SCTP_NODELAY is set David Laight
@ 2014-06-17 15:45 ` Vlad Yasevich
  2014-06-17 16:07   ` David Laight
  0 siblings, 1 reply; 10+ messages in thread
From: Vlad Yasevich @ 2014-06-17 15:45 UTC (permalink / raw)
  To: David Laight, netdev@vger.kernel.org

On 06/17/2014 10:17 AM, David Laight wrote:
> If SCTP_NODELAY is set it is difficult to get SCTP to bundle
> data chunks into ethernet packets.
> This leads to very high packet rates which bundling could easily
> reduce by a factor or 8 or 10.
> 
> Nagle can't really be enabled because it generates unwanted delays
> when traffic is light (Nagle only really works for unidirectional bulk
> data and command-response when the messages are smaller than the mtu).
> 
> Even if the sending application knows it has more data to send,
> there isn't much it can do to get the chunks bundled.
> 
> AFAICT 'corking' the socket even stops full sized packets being
> sent - so the application will deadlock if the socket write
> buffer size is reached before the socket is 'uncorked'.
> This also means that the application can't send back to back
> full sized packets unless it uncorks the socket at exactly
> the right places.
> 
> MSG_MORE isn't supported by SCTP, but I'm not sure it would help.
> You really need a MSG_NO_MORE flag and to leave Nagle enabled.
> 
> About the only thing I can think of is to normally have Nagle
> enabled, and then perform the following sequence to force the
> buffered data chunks be sent:
> 1) disable Nagle
> 2) cork the socket
> 3) uncork the socket
> 4) enable Nagle
> Four socket calls is a little excessive!

First, how are you corking an SCTP socket?  There is no SCTP_CORK
and looking at the code, I don't see how an SCTP queue can be
cored by user...

I suppose we could implement SCTP_CORK to do the right thing.

I thought is possibly utilizing something like sendmmsg() and passing
an extra flag to let it be know that this is a multi-message send
that should be queued up by sctp..

-vlad

> 
> Any other ideas ?
> 
> 	David
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: SCTP data chunk bundling when SCTP_NODELAY is set
  2014-06-17 15:45 ` Vlad Yasevich
@ 2014-06-17 16:07   ` David Laight
  2014-06-17 16:49     ` Vlad Yasevich
  0 siblings, 1 reply; 10+ messages in thread
From: David Laight @ 2014-06-17 16:07 UTC (permalink / raw)
  To: 'Vlad Yasevich', netdev@vger.kernel.org

From: Vlad Yasevich
> On 06/17/2014 10:17 AM, David Laight wrote:
> > If SCTP_NODELAY is set it is difficult to get SCTP to bundle
> > data chunks into ethernet packets.
> > This leads to very high packet rates which bundling could easily
> > reduce by a factor or 8 or 10.
> >
> > Nagle can't really be enabled because it generates unwanted delays
> > when traffic is light (Nagle only really works for unidirectional bulk
> > data and command-response when the messages are smaller than the mtu).
> >
> > Even if the sending application knows it has more data to send,
> > there isn't much it can do to get the chunks bundled.
> >
> > AFAICT 'corking' the socket even stops full sized packets being
> > sent - so the application will deadlock if the socket write
> > buffer size is reached before the socket is 'uncorked'.
> > This also means that the application can't send back to back
> > full sized packets unless it uncorks the socket at exactly
> > the right places.
> >
> > MSG_MORE isn't supported by SCTP, but I'm not sure it would help.
> > You really need a MSG_NO_MORE flag and to leave Nagle enabled.
> >
> > About the only thing I can think of is to normally have Nagle
> > enabled, and then perform the following sequence to force the
> > buffered data chunks be sent:
> > 1) disable Nagle
> > 2) cork the socket
> > 3) uncork the socket
> > 4) enable Nagle
> > Four socket calls is a little excessive!
> 
> First, how are you corking an SCTP socket?  There is no SCTP_CORK
> and looking at the code, I don't see how an SCTP queue can be
> cored by user...

I only looked as far as seeing that the code in sm_sideffect.c
allows someone else to have corked the socket, and the effect
that the 'cork' had.

> I suppose we could implement SCTP_CORK to do the right thing.
> 
> I thought is possibly utilizing something like sendmmsg() and passing
> an extra flag to let it be know that this is a multi-message send
> that should be queued up by sctp..

It would be as easy to expose the extra flag to the 'application'
allowing it to use sendmsg() or sendmmsg().
While sendmmsg() saves a system call, it is fairly horrid to use.
(and I'm sending from a kernel driver so don't care about the 
system call cost!)

Possibly MSG_MORE with Nagle disabled could invoke the Nagle send
delay - but you'd need to know whether any chunks in the queue
had MSG_MORE clear.
It would also be nice to have a 'send on local timeout', rather than
Nagle's 'wait for everything to be acked'.

	David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: SCTP data chunk bundling when SCTP_NODELAY is set
  2014-06-17 16:07   ` David Laight
@ 2014-06-17 16:49     ` Vlad Yasevich
  2014-06-18 13:47       ` David Laight
  0 siblings, 1 reply; 10+ messages in thread
From: Vlad Yasevich @ 2014-06-17 16:49 UTC (permalink / raw)
  To: David Laight, netdev@vger.kernel.org

On 06/17/2014 12:07 PM, David Laight wrote:
> From: Vlad Yasevich
>> On 06/17/2014 10:17 AM, David Laight wrote:
>>> If SCTP_NODELAY is set it is difficult to get SCTP to bundle
>>> data chunks into ethernet packets.
>>> This leads to very high packet rates which bundling could easily
>>> reduce by a factor or 8 or 10.
>>>
>>> Nagle can't really be enabled because it generates unwanted delays
>>> when traffic is light (Nagle only really works for unidirectional bulk
>>> data and command-response when the messages are smaller than the mtu).
>>>
>>> Even if the sending application knows it has more data to send,
>>> there isn't much it can do to get the chunks bundled.
>>>
>>> AFAICT 'corking' the socket even stops full sized packets being
>>> sent - so the application will deadlock if the socket write
>>> buffer size is reached before the socket is 'uncorked'.
>>> This also means that the application can't send back to back
>>> full sized packets unless it uncorks the socket at exactly
>>> the right places.
>>>
>>> MSG_MORE isn't supported by SCTP, but I'm not sure it would help.
>>> You really need a MSG_NO_MORE flag and to leave Nagle enabled.
>>>
>>> About the only thing I can think of is to normally have Nagle
>>> enabled, and then perform the following sequence to force the
>>> buffered data chunks be sent:
>>> 1) disable Nagle
>>> 2) cork the socket
>>> 3) uncork the socket
>>> 4) enable Nagle
>>> Four socket calls is a little excessive!
>>
>> First, how are you corking an SCTP socket?  There is no SCTP_CORK
>> and looking at the code, I don't see how an SCTP queue can be
>> cored by user...
> 
> I only looked as far as seeing that the code in sm_sideffect.c
> allows someone else to have corked the socket, and the effect
> that the 'cork' had.

Right.  Cork is pretty dumb right now.  It should work more like TCP
where if you've queued up enough data to bypass nagle checks, it should
flush some number for full MTU packets.

> 
>> I suppose we could implement SCTP_CORK to do the right thing.
>>
>> I thought is possibly utilizing something like sendmmsg() and passing
>> an extra flag to let it be know that this is a multi-message send
>> that should be queued up by sctp..
> 
> It would be as easy to expose the extra flag to the 'application'
> allowing it to use sendmsg() or sendmmsg().
> While sendmmsg() saves a system call, it is fairly horrid to use.
> (and I'm sending from a kernel driver so don't care about the 
> system call cost!)
> 
> Possibly MSG_MORE with Nagle disabled could invoke the Nagle send
> delay - but you'd need to know whether any chunks in the queue
> had MSG_MORE clear.

That's why doing this with cork would be simpler.  The ULP can just
queue up a bunch of small data and if we pass nagle checks, it will be
flushed.  If not, uncork will flush it.

I could work up a patch for you if you want.

Thanks
-vlad

> It would also be nice to have a 'send on local timeout', rather than
> Nagle's 'wait for everything to be acked'.
> 
> 	David
> 
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: SCTP data chunk bundling when SCTP_NODELAY is set
  2014-06-17 16:49     ` Vlad Yasevich
@ 2014-06-18 13:47       ` David Laight
  2014-06-18 16:38         ` David Laight
  0 siblings, 1 reply; 10+ messages in thread
From: David Laight @ 2014-06-18 13:47 UTC (permalink / raw)
  To: 'Vlad Yasevich', netdev@vger.kernel.org

From: Vlad Yasevich 
...
> >> I suppose we could implement SCTP_CORK to do the right thing.
> >>
> >> I thought is possibly utilizing something like sendmmsg() and passing
> >> an extra flag to let it be know that this is a multi-message send
> >> that should be queued up by sctp..
> >
> > It would be as easy to expose the extra flag to the 'application'
> > allowing it to use sendmsg() or sendmmsg().
> > While sendmmsg() saves a system call, it is fairly horrid to use.
> > (and I'm sending from a kernel driver so don't care about the
> > system call cost!)
> >
> > Possibly MSG_MORE with Nagle disabled could invoke the Nagle send
> > delay - but you'd need to know whether any chunks in the queue
> > had MSG_MORE clear.
> 
> That's why doing this with cork would be simpler.  The ULP can just
> queue up a bunch of small data and if we pass nagle checks, it will be
> flushed.  If not, uncork will flush it.

I think you need only care about the 'MSG_MORE' flag of the last data chunk.
Any earlier data (with MSG_MORE clear) will usually have been sent (unless
prevented by Nagle or flow control), so you probably wouldn't be able to
send it regardless of the state of MSG_MORE on a chunk being queued.
There is also the expectation that another send without MSG_MORE will
happen almost immediately.

So MSG_MORE could have the same effect as corking the socket.
Although you'd need separate bits - but uncork could clear both.

What I would like to implement (from M3UA) is to hold data for a maximum
of (say) 5ms awaiting M3UA data chunks. To do this properly requires
knowledge of the actual ethernet packet boundaries.

The problem is there are (at least) three cases:
1) This data should be sent as soon as possible.
2) Send this data some time soonish.
3) I've got another data block I'm going to give you after this one.

> I could work up a patch for you if you want.

I was thinking I might try to write one.
But I've been asked to look at something else for a bit.

	David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: SCTP data chunk bundling when SCTP_NODELAY is set
  2014-06-18 13:47       ` David Laight
@ 2014-06-18 16:38         ` David Laight
  2014-06-19 13:27           ` Vlad Yasevich
  0 siblings, 1 reply; 10+ messages in thread
From: David Laight @ 2014-06-18 16:38 UTC (permalink / raw)
  To: David Laight, 'Vlad Yasevich', netdev@vger.kernel.org

From: David Laight
> From: Vlad Yasevich
> ...
> > >> I suppose we could implement SCTP_CORK to do the right thing.
> > >>
> > >> I thought is possibly utilizing something like sendmmsg() and passing
> > >> an extra flag to let it be know that this is a multi-message send
> > >> that should be queued up by sctp..
> > >
> > > It would be as easy to expose the extra flag to the 'application'
> > > allowing it to use sendmsg() or sendmmsg().
> > > While sendmmsg() saves a system call, it is fairly horrid to use.
> > > (and I'm sending from a kernel driver so don't care about the
> > > system call cost!)
> > >
> > > Possibly MSG_MORE with Nagle disabled could invoke the Nagle send
> > > delay - but you'd need to know whether any chunks in the queue
> > > had MSG_MORE clear.
> >
> > That's why doing this with cork would be simpler.  The ULP can just
> > queue up a bunch of small data and if we pass nagle checks, it will be
> > flushed.  If not, uncork will flush it.
> 
> I think you need only care about the 'MSG_MORE' flag of the last data chunk.
> Any earlier data (with MSG_MORE clear) will usually have been sent (unless
> prevented by Nagle or flow control), so you probably wouldn't be able to
> send it regardless of the state of MSG_MORE on a chunk being queued.
> There is also the expectation that another send without MSG_MORE will
> happen almost immediately.
> 
> So MSG_MORE could have the same effect as corking the socket.
> Although you'd need separate bits - but uncork could clear both.
> 
> What I would like to implement (from M3UA) is to hold data for a maximum
> of (say) 5ms awaiting M3UA data chunks. To do this properly requires
> knowledge of the actual ethernet packet boundaries.
> 
> The problem is there are (at least) three cases:
> 1) This data should be sent as soon as possible.
> 2) Send this data some time soonish.
> 3) I've got another data block I'm going to give you after this one.
> 
> > I could work up a patch for you if you want.
> 
> I was thinking I might try to write one.

Actually this might work for what I'm trying to do.
(untested).

diff --git a/net/sctp/output.c b/net/sctp/output.c
index 0f4d15f..51030bc 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -691,7 +691,7 @@ static sctp_xmit_t sctp_packet_can_append_data(struct sctp_packet *packet,
 	 * if any previously transmitted data on the connection remains
 	 * unacknowledged.
 	 */
-	if (!sctp_sk(asoc->base.sk)->nodelay && sctp_packet_empty(packet) &&
+	if (sctp_sk(asoc->base.sk)->nodelay != 1 && sctp_packet_empty(packet) &&
 	    inflight && sctp_state(asoc, ESTABLISHED)) {
 		unsigned int max = transport->pathmtu - packet->overhead;
 		unsigned int len = chunk->skb->len + q->out_qlen;
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index fee06b9..084b957 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1928,7 +1928,10 @@ static int sctp_sendmsg(struct kiocb *iocb, struct sock *sk,
 	}
 
 	/* Break the message into multiple chunks of maximum size. */
+	if (msg->msg_flags & MSG_MORE)
+		sp->nodelay |= 2;
 	datamsg = sctp_datamsg_from_user(asoc, sinfo, msg, msg_len);
+	sp->nodelay &= 1;
 	if (IS_ERR(datamsg)) {
 		err = PTR_ERR(datamsg);
 		goto out_free;

Ideally MSG_MORE should delay sending even if 'inflight' is false.
But that would require 'flush on timeout'.
I'd prefer that, and with a configurable timeout.
But I can implement the timeout in the 'application'.

Given the way Nagle is implemented in sctp, I could keep flipping
it on and off - but that probably has undocumented behaviour
(ie it might suddenly change).

	David

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: SCTP data chunk bundling when SCTP_NODELAY is set
  2014-06-18 16:38         ` David Laight
@ 2014-06-19 13:27           ` Vlad Yasevich
  2014-06-19 13:45             ` David Laight
  0 siblings, 1 reply; 10+ messages in thread
From: Vlad Yasevich @ 2014-06-19 13:27 UTC (permalink / raw)
  To: David Laight, netdev@vger.kernel.org

On 06/18/2014 12:38 PM, David Laight wrote:
> From: David Laight
>> From: Vlad Yasevich
>> ...
>>>>> I suppose we could implement SCTP_CORK to do the right thing.
>>>>>
>>>>> I thought is possibly utilizing something like sendmmsg() and passing
>>>>> an extra flag to let it be know that this is a multi-message send
>>>>> that should be queued up by sctp..
>>>>
>>>> It would be as easy to expose the extra flag to the 'application'
>>>> allowing it to use sendmsg() or sendmmsg().
>>>> While sendmmsg() saves a system call, it is fairly horrid to use.
>>>> (and I'm sending from a kernel driver so don't care about the
>>>> system call cost!)
>>>>
>>>> Possibly MSG_MORE with Nagle disabled could invoke the Nagle send
>>>> delay - but you'd need to know whether any chunks in the queue
>>>> had MSG_MORE clear.
>>>
>>> That's why doing this with cork would be simpler.  The ULP can just
>>> queue up a bunch of small data and if we pass nagle checks, it will be
>>> flushed.  If not, uncork will flush it.
>>
>> I think you need only care about the 'MSG_MORE' flag of the last data chunk.
>> Any earlier data (with MSG_MORE clear) will usually have been sent (unless
>> prevented by Nagle or flow control), so you probably wouldn't be able to
>> send it regardless of the state of MSG_MORE on a chunk being queued.
>> There is also the expectation that another send without MSG_MORE will
>> happen almost immediately.
>>
>> So MSG_MORE could have the same effect as corking the socket.
>> Although you'd need separate bits - but uncork could clear both.
>>
>> What I would like to implement (from M3UA) is to hold data for a maximum
>> of (say) 5ms awaiting M3UA data chunks. To do this properly requires
>> knowledge of the actual ethernet packet boundaries.
>>
>> The problem is there are (at least) three cases:
>> 1) This data should be sent as soon as possible.
>> 2) Send this data some time soonish.
>> 3) I've got another data block I'm going to give you after this one.
>>
>>> I could work up a patch for you if you want.
>>
>> I was thinking I might try to write one.
> 
> Actually this might work for what I'm trying to do.
> (untested).
> 
> diff --git a/net/sctp/output.c b/net/sctp/output.c
> index 0f4d15f..51030bc 100644
> --- a/net/sctp/output.c
> +++ b/net/sctp/output.c
> @@ -691,7 +691,7 @@ static sctp_xmit_t sctp_packet_can_append_data(struct sctp_packet *packet,
>  	 * if any previously transmitted data on the connection remains
>  	 * unacknowledged.
>  	 */
> -	if (!sctp_sk(asoc->base.sk)->nodelay && sctp_packet_empty(packet) &&
> +	if (sctp_sk(asoc->base.sk)->nodelay != 1 && sctp_packet_empty(packet) &&
>  	    inflight && sctp_state(asoc, ESTABLISHED)) {
>  		unsigned int max = transport->pathmtu - packet->overhead;
>  		unsigned int len = chunk->skb->len + q->out_qlen;
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index fee06b9..084b957 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -1928,7 +1928,10 @@ static int sctp_sendmsg(struct kiocb *iocb, struct sock *sk,
>  	}
>  
>  	/* Break the message into multiple chunks of maximum size. */
> +	if (msg->msg_flags & MSG_MORE)
> +		sp->nodelay |= 2;
>  	datamsg = sctp_datamsg_from_user(asoc, sinfo, msg, msg_len);
> +	sp->nodelay &= 1;

I think you reset it too early.  You have to reset after the call to
sctp_primitive_SEND().  This way, you queue up the data and go through
the state machine with nodelay != 1, thus triggering the updated code
on output.

>  	if (IS_ERR(datamsg)) {
>  		err = PTR_ERR(datamsg);
>  		goto out_free;
> 
> Ideally MSG_MORE should delay sending even if 'inflight' is false.
> But that would require 'flush on timeout'.

You can use a lack of MSG_MORE to be an indication of a flush.  Thus
MSG_MORE would always queue up data until MSG_MORE is 0, at which point
flush should happen.

> I'd prefer that, and with a configurable timeout.
> But I can implement the timeout in the 'application'.
> 
> Given the way Nagle is implemented in sctp, I could keep flipping
> it on and off - but that probably has undocumented behaviour
> (ie it might suddenly change).

With the above MSG_MORE, I think you can just turn off nagle once and
use MSG_MORE and when you drain your application queue, clear MSG_MORE
on the last write.

-vlad


> 
> 	David
> 
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: SCTP data chunk bundling when SCTP_NODELAY is set
  2014-06-19 13:27           ` Vlad Yasevich
@ 2014-06-19 13:45             ` David Laight
  2014-06-19 14:05               ` Eric Dumazet
  2014-06-19 15:36               ` Vlad Yasevich
  0 siblings, 2 replies; 10+ messages in thread
From: David Laight @ 2014-06-19 13:45 UTC (permalink / raw)
  To: 'Vlad Yasevich', netdev@vger.kernel.org

From: Vlad Yasevich
> On 06/18/2014 12:38 PM, David Laight wrote:
> > From: David Laight
> >> From: Vlad Yasevich
> >> ...
> >>>>> I suppose we could implement SCTP_CORK to do the right thing.
> >>>>>
> >>>>> I thought is possibly utilizing something like sendmmsg() and passing
> >>>>> an extra flag to let it be know that this is a multi-message send
> >>>>> that should be queued up by sctp..
> >>>>
> >>>> It would be as easy to expose the extra flag to the 'application'
> >>>> allowing it to use sendmsg() or sendmmsg().
> >>>> While sendmmsg() saves a system call, it is fairly horrid to use.
> >>>> (and I'm sending from a kernel driver so don't care about the
> >>>> system call cost!)
> >>>>
> >>>> Possibly MSG_MORE with Nagle disabled could invoke the Nagle send
> >>>> delay - but you'd need to know whether any chunks in the queue
> >>>> had MSG_MORE clear.
> >>>
> >>> That's why doing this with cork would be simpler.  The ULP can just
> >>> queue up a bunch of small data and if we pass nagle checks, it will be
> >>> flushed.  If not, uncork will flush it.
> >>
> >> I think you need only care about the 'MSG_MORE' flag of the last data chunk.
> >> Any earlier data (with MSG_MORE clear) will usually have been sent (unless
> >> prevented by Nagle or flow control), so you probably wouldn't be able to
> >> send it regardless of the state of MSG_MORE on a chunk being queued.
> >> There is also the expectation that another send without MSG_MORE will
> >> happen almost immediately.
> >>
> >> So MSG_MORE could have the same effect as corking the socket.
> >> Although you'd need separate bits - but uncork could clear both.
> >>
> >> What I would like to implement (from M3UA) is to hold data for a maximum
> >> of (say) 5ms awaiting M3UA data chunks. To do this properly requires
> >> knowledge of the actual ethernet packet boundaries.
> >>
> >> The problem is there are (at least) three cases:
> >> 1) This data should be sent as soon as possible.
> >> 2) Send this data some time soonish.
> >> 3) I've got another data block I'm going to give you after this one.
> >>
> >>> I could work up a patch for you if you want.
> >>
> >> I was thinking I might try to write one.
> >
> > Actually this might work for what I'm trying to do.
> > (untested).
> >
> > diff --git a/net/sctp/output.c b/net/sctp/output.c
> > index 0f4d15f..51030bc 100644
> > --- a/net/sctp/output.c
> > +++ b/net/sctp/output.c
> > @@ -691,7 +691,7 @@ static sctp_xmit_t sctp_packet_can_append_data(struct sctp_packet *packet,
> >  	 * if any previously transmitted data on the connection remains
> >  	 * unacknowledged.
> >  	 */
> > -	if (!sctp_sk(asoc->base.sk)->nodelay && sctp_packet_empty(packet) &&
> > +	if (sctp_sk(asoc->base.sk)->nodelay != 1 && sctp_packet_empty(packet) &&
> >  	    inflight && sctp_state(asoc, ESTABLISHED)) {
> >  		unsigned int max = transport->pathmtu - packet->overhead;
> >  		unsigned int len = chunk->skb->len + q->out_qlen;
> > diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> > index fee06b9..084b957 100644
> > --- a/net/sctp/socket.c
> > +++ b/net/sctp/socket.c
> > @@ -1928,7 +1928,10 @@ static int sctp_sendmsg(struct kiocb *iocb, struct sock *sk,
> >  	}
> >
> >  	/* Break the message into multiple chunks of maximum size. */
> > +	if (msg->msg_flags & MSG_MORE)
> > +		sp->nodelay |= 2;
> >  	datamsg = sctp_datamsg_from_user(asoc, sinfo, msg, msg_len);
> > +	sp->nodelay &= 1;
> 
> I think you reset it too early.  You have to reset after the call to
> sctp_primitive_SEND().  This way, you queue up the data and go through
> the state machine with nodelay != 1, thus triggering the updated code
> on output.

I changed it to clear the flag if MSG_MORE is clear before I tested it.

I'll post a patch after net-next opens.

> > Ideally MSG_MORE should delay sending even if 'inflight' is false.
> > But that would require 'flush on timeout'.
> 
> You can use a lack of MSG_MORE to be an indication of a flush.  Thus
> MSG_MORE would always queue up data until MSG_MORE is 0, at which point
> flush should happen.

With Nagle disabled. If MSG_MORE was clear on the previous send then there
will normally be nothing queued (would have to be flow control limited).

> 
> > I'd prefer that, and with a configurable timeout.
> > But I can implement the timeout in the 'application'.
> >
> > Given the way Nagle is implemented in sctp, I could keep flipping
> > it on and off - but that probably has undocumented behaviour
> > (ie it might suddenly change).
> 
> With the above MSG_MORE, I think you can just turn off nagle once and
> use MSG_MORE and when you drain your application queue, clear MSG_MORE
> on the last write.

That is what I've done.
When my test application sends 100 messages through M3UA I now see a
small number of ethernet packets - rather than 100.
The whole thing does rather rely on the lengths of various code
paths in order to get multiple messages queued at the point that
sendmsg() is finally called.

I'm not worried about sending 2 packets at the start of a burst of data.
It seems safer than not sending data because the application send
a single block with MSG_MORE set.

I do need to do some testing with simulated network delays.
Someone posted how to set that up earlier today.

I noticed that the TCP code is documented to eventually send data after 200ms.
It would be better if that interval were settable per-socket.
I'd set it to 1ms (or next tick).

	David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: SCTP data chunk bundling when SCTP_NODELAY is set
  2014-06-19 13:45             ` David Laight
@ 2014-06-19 14:05               ` Eric Dumazet
  2014-06-19 15:36               ` Vlad Yasevich
  1 sibling, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2014-06-19 14:05 UTC (permalink / raw)
  To: David Laight; +Cc: 'Vlad Yasevich', netdev@vger.kernel.org

On Thu, 2014-06-19 at 13:45 +0000, David Laight wrote:

> I noticed that the TCP code is documented to eventually send data after 200ms.
> It would be better if that interval were settable per-socket.
> I'd set it to 1ms (or next tick).

This 200ms is really RTO. We do not use an extra timer for this yet.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: SCTP data chunk bundling when SCTP_NODELAY is set
  2014-06-19 13:45             ` David Laight
  2014-06-19 14:05               ` Eric Dumazet
@ 2014-06-19 15:36               ` Vlad Yasevich
  1 sibling, 0 replies; 10+ messages in thread
From: Vlad Yasevich @ 2014-06-19 15:36 UTC (permalink / raw)
  To: David Laight, netdev@vger.kernel.org

On 06/19/2014 09:45 AM, David Laight wrote:
> From: Vlad Yasevich
>> On 06/18/2014 12:38 PM, David Laight wrote:
>>> From: David Laight
>>>> From: Vlad Yasevich
>>>> ...
>>>>>>> I suppose we could implement SCTP_CORK to do the right thing.
>>>>>>>
>>>>>>> I thought is possibly utilizing something like sendmmsg() and passing
>>>>>>> an extra flag to let it be know that this is a multi-message send
>>>>>>> that should be queued up by sctp..
>>>>>>
>>>>>> It would be as easy to expose the extra flag to the 'application'
>>>>>> allowing it to use sendmsg() or sendmmsg().
>>>>>> While sendmmsg() saves a system call, it is fairly horrid to use.
>>>>>> (and I'm sending from a kernel driver so don't care about the
>>>>>> system call cost!)
>>>>>>
>>>>>> Possibly MSG_MORE with Nagle disabled could invoke the Nagle send
>>>>>> delay - but you'd need to know whether any chunks in the queue
>>>>>> had MSG_MORE clear.
>>>>>
>>>>> That's why doing this with cork would be simpler.  The ULP can just
>>>>> queue up a bunch of small data and if we pass nagle checks, it will be
>>>>> flushed.  If not, uncork will flush it.
>>>>
>>>> I think you need only care about the 'MSG_MORE' flag of the last data chunk.
>>>> Any earlier data (with MSG_MORE clear) will usually have been sent (unless
>>>> prevented by Nagle or flow control), so you probably wouldn't be able to
>>>> send it regardless of the state of MSG_MORE on a chunk being queued.
>>>> There is also the expectation that another send without MSG_MORE will
>>>> happen almost immediately.
>>>>
>>>> So MSG_MORE could have the same effect as corking the socket.
>>>> Although you'd need separate bits - but uncork could clear both.
>>>>
>>>> What I would like to implement (from M3UA) is to hold data for a maximum
>>>> of (say) 5ms awaiting M3UA data chunks. To do this properly requires
>>>> knowledge of the actual ethernet packet boundaries.
>>>>
>>>> The problem is there are (at least) three cases:
>>>> 1) This data should be sent as soon as possible.
>>>> 2) Send this data some time soonish.
>>>> 3) I've got another data block I'm going to give you after this one.
>>>>
>>>>> I could work up a patch for you if you want.
>>>>
>>>> I was thinking I might try to write one.
>>>
>>> Actually this might work for what I'm trying to do.
>>> (untested).
>>>
>>> diff --git a/net/sctp/output.c b/net/sctp/output.c
>>> index 0f4d15f..51030bc 100644
>>> --- a/net/sctp/output.c
>>> +++ b/net/sctp/output.c
>>> @@ -691,7 +691,7 @@ static sctp_xmit_t sctp_packet_can_append_data(struct sctp_packet *packet,
>>>  	 * if any previously transmitted data on the connection remains
>>>  	 * unacknowledged.
>>>  	 */
>>> -	if (!sctp_sk(asoc->base.sk)->nodelay && sctp_packet_empty(packet) &&
>>> +	if (sctp_sk(asoc->base.sk)->nodelay != 1 && sctp_packet_empty(packet) &&
>>>  	    inflight && sctp_state(asoc, ESTABLISHED)) {
>>>  		unsigned int max = transport->pathmtu - packet->overhead;
>>>  		unsigned int len = chunk->skb->len + q->out_qlen;
>>> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
>>> index fee06b9..084b957 100644
>>> --- a/net/sctp/socket.c
>>> +++ b/net/sctp/socket.c
>>> @@ -1928,7 +1928,10 @@ static int sctp_sendmsg(struct kiocb *iocb, struct sock *sk,
>>>  	}
>>>
>>>  	/* Break the message into multiple chunks of maximum size. */
>>> +	if (msg->msg_flags & MSG_MORE)
>>> +		sp->nodelay |= 2;
>>>  	datamsg = sctp_datamsg_from_user(asoc, sinfo, msg, msg_len);
>>> +	sp->nodelay &= 1;
>>
>> I think you reset it too early.  You have to reset after the call to
>> sctp_primitive_SEND().  This way, you queue up the data and go through
>> the state machine with nodelay != 1, thus triggering the updated code
>> on output.
> 
> I changed it to clear the flag if MSG_MORE is clear before I tested it.
> 
> I'll post a patch after net-next opens.
> 
>>> Ideally MSG_MORE should delay sending even if 'inflight' is false.
>>> But that would require 'flush on timeout'.
>>
>> You can use a lack of MSG_MORE to be an indication of a flush.  Thus
>> MSG_MORE would always queue up data until MSG_MORE is 0, at which point
>> flush should happen.
> 
> With Nagle disabled. If MSG_MORE was clear on the previous send then there
> will normally be nothing queued (would have to be flow control limited).
> 
>>
>>> I'd prefer that, and with a configurable timeout.
>>> But I can implement the timeout in the 'application'.
>>>
>>> Given the way Nagle is implemented in sctp, I could keep flipping
>>> it on and off - but that probably has undocumented behaviour
>>> (ie it might suddenly change).
>>
>> With the above MSG_MORE, I think you can just turn off nagle once and
>> use MSG_MORE and when you drain your application queue, clear MSG_MORE
>> on the last write.
> 
> That is what I've done.
> When my test application sends 100 messages through M3UA I now see a
> small number of ethernet packets - rather than 100.
> The whole thing does rather rely on the lengths of various code
> paths in order to get multiple messages queued at the point that
> sendmsg() is finally called.
> 
> I'm not worried about sending 2 packets at the start of a burst of data.
> It seems safer than not sending data because the application send
> a single block with MSG_MORE set.
> 
> I do need to do some testing with simulated network delays.
> Someone posted how to set that up earlier today.

http://www.linuxfoundation.org/collaborate/workgroups/networking/netem

-vlad

> 
> I noticed that the TCP code is documented to eventually send data after 200ms.
> It would be better if that interval were settable per-socket.
> I'd set it to 1ms (or next tick).
> 
> 	David
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-06-19 15:36 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-17 14:17 SCTP data chunk bundling when SCTP_NODELAY is set David Laight
2014-06-17 15:45 ` Vlad Yasevich
2014-06-17 16:07   ` David Laight
2014-06-17 16:49     ` Vlad Yasevich
2014-06-18 13:47       ` David Laight
2014-06-18 16:38         ` David Laight
2014-06-19 13:27           ` Vlad Yasevich
2014-06-19 13:45             ` David Laight
2014-06-19 14:05               ` Eric Dumazet
2014-06-19 15:36               ` Vlad Yasevich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).