All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vlad Yasevich <vladislav.yasevich@hp.com>
To: linux-sctp@vger.kernel.org
Subject: Re: [PATCH] Fix piggybacked ACKs
Date: Tue, 04 Aug 2009 14:50:18 +0000	[thread overview]
Message-ID: <4A784AAA.4000009@hp.com> (raw)
In-Reply-To: <20090729160557.GC29475@nortel.com>

Doug Graham wrote:
> Wei Yongjun wrote:
>> Doug Graham wrote:
>>  
>>> Oops.  Sent the last one in HTML,  so the mailing list rejected it.
>>> Damned GUI email
>>> clients!
>>>
>>> Wei Yongjun wrote:
>>>    
>>>> Doug Graham wrote:
>>>>  
>>>>      
>>>>> On Fri, Jul 31, 2009 at 12:21:15PM +0800, Wei Yongjun wrote:
>>>>>             
>>>>>> Doug Graham wrote:
>>>>>>                   
>>>>>>>  13 2.002632    10.0.0.15   10.0.0.11   DATA (1452 bytes data)  14
>>>>>>> 2.203092    10.0.0.11   10.0.0.15   SACK  15 2.203153  
>>>>>>> 10.0.0.15   10.0.0.11   DATA (2 bytes data)
>>>>>>>  16 2.203427    10.0.0.11   10.0.0.15   SACK  17 2.203808  
>>>>>>> 10.0.0.11   10.0.0.15   DATA (1452 bytes data)
>>>>>>>  18 2.403524    10.0.0.15   10.0.0.11   SACK  19 2.403686  
>>>>>>> 10.0.0.11   10.0.0.15   DATA (2 bytes data)
>>>>>>>  20 2.603285    10.0.0.15   10.0.0.11   SACK
>>>>>>> What bothers me about this is that Nagle seems to be introducing a
>>>>>>> delay
>>>>>>> here.  The first DATA packets in both directions are MTU-sized
>>>>>>> packets,
>>>>>>> yet both the Linux client and the BSD server wait 200ms until they
>>>>>>> get
>>>>>>> the SACK to the first fragment before sending the second fragment.
>>>>>>> The server can't send its reply until it gets both fragments, and
>>>>>>> the
>>>>>>> client can't reassemble the reply until it gets both fragments, so
>>>>>>> from
>>>>>>> the application's point of view, the reply doesn't arrive until
>>>>>>> 400ms
>>>>>>> after the request is sent.  This could probably be fixed by
>>>>>>> disabling
>>>>>>> Nagle with SCTP_NODELAY, but that shouldn't be required.  Nagle is
>>>>>>> only
>>>>>>> supposed to prevent multiple outstanding *small* packets.
>>>>>>>                             
>>>>>> I think you hit the point which Nagle's algorithm should be not used.
>>>>>>
>>>>>> Can you try the following patch?
>>>>>>
>>>>>> [PATCH] sctp: do not used Nagle algorithm while fragmented data is
>>>>>> transmitted
>>>>>>
>>>>>> If fragmented data is sent, the Nagle's algorithm should not be
>>>>>> used. In special case, if only one large packet is sent, the delay
>>>>>> send of fragmented data will cause the receiver wait for more
>>>>>> fragmented data to reassembe them and not send SACK, but the sender
>>>>>> still wait for SACK before send the last fragment.
>>>>>>                     
>>>>> [patch deleted]
>>>>>
>>>>> This patch seems to work quite well, but I think disabling Nagle
>>>>> completely for large messages is not quite the right thing to do.
>>>>> There's a draft-minshall-nagle-01.txt floating around that describes a
>>>>> modified Nagle algorithm for TCP.  It appears to have been implemented
>>>>> in Linux TCP even though the draft has expired.  The modified
>>>>> algorithm
>>>>> is how I thought Nagle had always worked to begin with.  From the
>>>>> draft:
>>>>>
>>>>>         "If a TCP has less than a full-sized packet to transmit,
>>>>>         and if any previously transmitted less than full-sized
>>>>>         packet has not yet been acknowledged, do not transmit
>>>>>         a packet."
>>>>>
>>>>> so in the case of sending a fragmented SCTP message, all but the last
>>>>> fragment will be full-sized and will be sent without delay.  The last
>>>>> fragment will usually not be full-sized, but it too will be sent
>>>>> without
>>>>> delay because there are no outstanding non-full-sized packets.
>>>>>
>>>>> The difference between this and your method is that yours would
>>>>> allow many small fragments of big messages to be outstanding, whereas
>>>>> this one would only allow the first big message to be sent in its
>>>>> entirety, followed by the full-sized fragments of the next big
>>>>> message.  When it came time to send the second small fragment,
>>>>> Nagle would force it to wait for an ACK for the first small fragment.
>>>>> I'm not convinced that the difference is all that important,
>>>>> but who knows.
>>>>>
>>>>> Here's my attempt at implementing the modified Nagle algorithm
>>>>> described
>>>>> in draft-minshall-nagle-01.txt.  It should be applied instead of your
>>>>> patch, not on top of it.  If (q->outstanding_bytes % asoc->frag_point)
>>>>> is zero, no delay is introduced.  The assumption is that this means
>>>>> that
>>>>> all outstanding packets (if any) are full-sized.
>>>>>
>>>>> Signed-off-by: Doug Graham <dgraham@nortel.com>
>>>>>
>>>>> ---
>>>>> --- linux-2.6.29/net/sctp/output.c    2009/08/02 00:47:44    1.3
>>>>> +++ linux-2.6.29/net/sctp/output.c    2009/08/02 00:51:18
>>>>> @@ -717,7 +717,8 @@ static sctp_xmit_t sctp_packet_append_da
>>>>>       * unacknowledged.
>>>>>       */
>>>>>      if (!sp->nodelay && sctp_packet_empty(packet) &&
>>>>> -        q->outstanding_bytes && sctp_state(asoc, ESTABLISHED)) {
>>>>> +        (q->outstanding_bytes % asoc->frag_point) != 0 &&
>>>>> +        sctp_state(asoc, ESTABLISHED)) {
>>>>>          unsigned len = datasize + q->out_qlen;
>>>>>  
>>>>>          /* Check whether this chunk and all the rest of pending
>>>>>               
>>>> Seem good! But it may be broken the small packet transmit which can be
>>>> used Nagle algorithm.
>>>> Such as this:
>>>>
>>>> Endpoint A                Endpint B
>>>>           <-------------  DATA (size\x1452/2) delay send
>>>>           <-------------  DATA (size\x1452/2) send immediately
>>>>           <-------------  DATA (size\x1452/2) send immediately ** broken
>>>>           <-------------  DATA (size\x1452/2) delay send
>>>>           <-------------  DATA (size\x1452/2) send immediately
>>>>           <-------------  DATA (size\x1452/2) send immediately ** broken
>>>>
>>>>
>>>> Can you try this one?
>>>>
>>>>
>>>>         
>>> I would, except I don't understand what you're getting at.  Does this
>>> mean to send a total of
>>> 6 1454 byte messages from B to A?  If so, why would the first one be
>>> delayed?
>>>     
>>
>> Oh, no, six 726 bytes(1452/2) messages, may be the 1st and 2nd are
>> bundled in one packet,
>> the 3rd is a single packet, the 4th, 5th are bundled, the 6th is single.
>> I have no test it.
>>
>>   
> Ah, so that / meant *division*!  I thought that was your notation
> meaning that the
> packets were fragmented into a 1452 byte chunk and a 2 byte chunk!
> Makes more sense now :-)
> 
> I admit that I didn't study too closely exactly what
> q->outstanding_bytes represents.  I assumed
> it meant the number of bytes that had been sent on the wire, but not yet
> acknowledged.
> Any bytes that were delayed because of Nagle would not be counted in
> outstanding_bytes
> (I assume).  So the first send of 726 would get sent immediately and
> counted in
> outstanding_bytes.  The second one would get delayed by Nagle and not
> counted
> in outstanding_bytes.  All the later ones would also get delayed by
> Nagle because
> outstanding_bytes is still 726.
> 
> I do think that using outstanding_bytes the way I did is probably an
> ugly kludge, and
> there's hopefully a better way.  But the right way will probably involve
> adding
> some more state to each association (the snd.sml variable mentioned in
> the minshall
> draft at the very least).  I'm not sure that using asoc->frag_point the
> way I did is correct
> either, because I think the frag_point can change during the lifetime of
> an association.

Using division in such a hot path is a non-starter to begin with, so we
definitely need to find a better way.

Using frag_point is not the right way to do it either since it's effected by
MTU and user API.

I think we can add something to sctp_outq structure to properly track this.

-vlad


> 
> --Doug
> 


  parent reply	other threads:[~2009-08-04 14:50 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-07-29 16:05 [PATCH] Fix piggybacked ACKs Doug Graham
2009-07-30  6:48 ` Wei Yongjun
2009-07-30  9:51 ` Wei Yongjun
2009-07-30 16:49 ` Doug Graham
2009-07-30 17:05 ` Vlad Yasevich
2009-07-30 21:24 ` Vlad Yasevich
2009-07-30 23:40 ` Doug Graham
2009-07-31  0:53 ` Wei Yongjun
2009-07-31  1:17 ` Doug Graham
2009-07-31  1:43 ` Doug Graham
2009-07-31  4:21 ` Wei Yongjun
2009-07-31  7:30 ` Michael Tüxen
2009-07-31  7:34 ` Michael Tüxen
2009-07-31 12:59 ` Doug Graham
2009-07-31 13:11 ` Doug Graham
2009-07-31 13:39 ` Doug Graham
2009-07-31 14:18 ` Vlad Yasevich
2009-08-02  2:03 ` Doug Graham
2009-08-03  2:00 ` Wei Yongjun
2009-08-03  2:15 ` Wei Yongjun
2009-08-03  3:32 ` Wei Yongjun
2009-08-04  3:00 ` Doug Graham
2009-08-04  3:03 ` Wei Yongjun
2009-08-04  3:28 ` Doug Graham
2009-08-04  3:44 ` Doug Graham
2009-08-04  3:57 ` Doug Graham
2009-08-04 14:50 ` Vlad Yasevich [this message]
2009-08-04 17:05 ` Doug Graham
2009-08-04 17:14 ` Vlad Yasevich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4A784AAA.4000009@hp.com \
    --to=vladislav.yasevich@hp.com \
    --cc=linux-sctp@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.