Re: [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP

DCCP protocol discussions
 help / color / mirror / Atom feed

From: Eddie Kohler <kohler@cs.ucla.edu>
To: dccp@vger.kernel.org
Subject: Re: [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP
Date: Thu, 08 Feb 2007 00:59:52 +0000	[thread overview]
Message-ID: <45CA7608.7060702@cs.ucla.edu> (raw)
In-Reply-To: <200701101021.38920@strip-the-willow>

I have some minor thoughts relating to this.

- In what units are t_nom kept?  I would hope microseconds at least, not 
milliseconds.  You say "dccps_xmit_timer is reset to expire in t_now + rc 
milliseconds"; I assume you mean that the value t_now + rc is cast to 
milliseconds.  Clearly high rates will require that t_nom be kept in 
microseconds at least.

I wonder because you say:

>  -> assume that t_ipi is less than 1 millisecond, then in effect all packets are
>     sent immediately; hence we have a _continuous_ burst of packets

but this does not follow.  If t_ipi<1ms, but is non-zero, then at most 1ms 
worth of packets can be sent in a burst -- not continuous (assuming it takes 
way less than 1ms to send a packet).

- ccid3_hc_tx_send_packet should return a value that is measured in 
MICROSECONDS not milliseconds.  It also sounds like there is a rounding error 
in step 3a); it should probably return (delay + 500)/1000 at least.

- What if the delay till the next packet is >0 but <HZ?  It sounds like that 
case is causing the problem.  One answer: Do not schedule a TIMER; instead, 
leave a kernel thread or bottom half scheduled, so that the next time the 
kernel runs, it will poll DCCP, even if that is before 1HZ from now.

Make sense?
Eddie


> Below I throw in my 2 cents of why I think there is a critical speed X_crit. Maybe you can
> help me dispel it or point out other possibilities which we can - step-by-step - eliminate,
> until the cause becomes fully clear.
> 
> Firstly, all packet scheduling is based on schedule_timeout().
> 
> The return code rc of ccid_hc_tx_send_packet (wrapper around ccid3_hc_tx_send_packet) is used
> to decide whether to 
> 
>  (a) send the packet immediately or
>  (b) sleep with HZ granularity before retrying
> 
> I am assuming that there is no loss on the link and no backlog of packets which couldn't be
> scheduled so far (i.e. if t_nom < t_now then t_now - t_nom < t_ipi). I assume further that
> there is a constant stream of packets, fed into the TX queue by continuously calling
> dccp_sendmsg. This is also the background of the experiments/graphs. 
> 
> Here is the analysis, starting with ccid3_hc_tx_send_packet:
> 
>  1) dccp_sendmsg calls dccp_write_xmit(sk, 0)
> 
>  2) dccp_write_xmit calls ccid_hc_tx_send_packet, a wrapper around ccid3_hc_tx_send_packet
> 
>  3) ccid3_hc_tx_send_packet gets the current time in usecs and computes  delay = t_nom - t_now
> 
>      (a) if delay >= delta = min(t_ipi/2, t_gran/2) then it returns delay/1000
>      (b) otherwise it returns 0
> 
>  4) back in dccp_write_xmit, 
>      * if rc=0 then the packet is sent immediately;  otherwise (since block=0), 
>      * dccps_xmit_timer is reset to expire in t_now + rc  milliseconds (sk_reset_timer)
>          -- in this case dccp_write_xmit exits now and
> 	 -- when the write timer expires, dccp_write_xmit_timer is called, which again
> 	    calls dccp_write_xmit(sk, 0)
> 	 -- this means going back to (3), now delay < delta, the function returns 0
> 	    and the packet is sent immediately
> 
> To find where the problematic case is, assume that the sender is in slow start and
> doubles X each RTT. As X increases, t_ipi decreases so that there is a point where
> t_ipi < 1000 usec. 
>  
>  -> all differences delay = t_nom - t_now which are less than 1000 result in 
>     delay / 1000 = 0 due to integer division
>  -> hence all packets which are late up to 1 millisecond are sent immediately
>  -> assume that t_ipi is less than 1 millisecond, then in effect all packets are
>     sent immediately; hence we have a _continuous_ burst of packets
>  -> schedule_timeout() really only has a granularity of HZ:
>      * if HZ\x1000,   msecs_to_jiffies(m) returns m
>      * if HZ < 1000, msecs_to_jiffies(m) returns (m * HZ + 999)/1000
>           => hence m=1 millisecond will give a result of 1 jiffie
> 	  => but the granularity of jiffies is in HZ < 1000 so that the 
> 	      timer will expire with a granularity of HZ
>           => that means if X is higher than X_crit, t_ipi will always be such that
> 	      the timer expires at a time which is too late, so that packets are all
> 	      sent in immediate bursts or in scheduled bursts, but there is no longer
> 	      any real scheduling
> 
> The other points which I am not entirely sure about yet are
>  * compression of packet spacing due to using TX output queues
>  * interactions with the traffic control subsystem
>  * times when the main socket is locked
> 
> - Gerrit
> 
> |  > I have a snapshot which illustrates this state: 
> |  > 
> |  >  http://www.erg.abdn.ac.uk/users/gerrit/dccp/dccp_probe/examples/no_tx_locking/transmit_rate.png
> |  >   
> |  > The oscillating behaviour is well visible. In contrast, I am sure that you would agree that the
> |  > desirable state is the following:
> |  > 
> |  >  http://www.erg.abdn.ac.uk/users/gerrit/dccp/dccp_probe/examples/with_tx_locking/transmit_rate.png
> |  > 
> |  > These snapshots were originally taken to compare the performance with and without serializing access to
> |  > TX history. I didn't submit the patch since, at times, I would get the same chaotic behaviour with TX locking.
> |  > 
> |  > Other people on this list have reported that iperf performance is unpredictable with CCID 3. 
> |  > 
> |  > The point is that, without putting in some kind of control, we have a system which gets into a state of
> |  > chaos as soon as the maximum controllable speed X_crit is reached. When it is past that point, there is
> |  > no longer a notion of predictable performance or correct average rate: what happens is then outside the
> |  > control of the CCID 3 module, performance is then a matter of coincidence.
> |  > 
> |  > I don't think that a kernel maintainer will gladly support a module which is liable to reaching such a
> |  > chaotic state.
> |  >  
> |  >   
> |  > |  > I have done a back-of-the-envelope calculation below for different sizes of s; 9kbyte
> |  > |  > I think is the maximum size of an Ethernet jumbo frame.
> |  > |  > 
> |  > |  >    -----------+---------+---------+---------+---------+-------+---------+-------+
> |  > |  >             s | 32      | 100     | 250     | 500     | 1000  | 1500    | 9000  |
> |  > |  >    -----------+---------+---------+---------+---------+-------+---------+-------+
> |  > |  >     X_critical| 32kbps  | 100kbps | 250kbps | 500kbps | 1mbps | 1.5mbps | 9mbps |
> |  > |  >    -----------+---------+---------+---------+---------+-------+---------+-------+ 
> |  > |  > 
> |  > |  > That means we can only expect predictable performance up to 9mbps ?????
> |  > |  
> |  > |  Same comment.  I imagine performance will be predictable at speeds FAR 
> |  > |  ABOVE 9mbps, DESPITE the sub-RTT bursts.  Predictable performance means 
> |  > |  about the same average rate from one RTT to the next.
> |  > I think that, without finer timer resolution, we need to put in some kind of throttle to avoid
> |  > entering the region where speed can no longer be controlled.
> |  > 
> |  >   
> |  > |  > I am dumbstruck - it means that the whole endeavour to try and use Gigabit cards (or
> |  > |  > even 100 Mbit ethernet cards) is futile and we should be using the old 10 Mbit cards???
> |  > |  
> |  > |  Remember that TCP is ENTIRELY based on bursts!!!!!  No rate control at 
> |  > |  all.  And it still gets predictable performance at high rates.
> |  > |  
> |  > Yes, but ..... it uses an entirely different mechanism and is not rate-based.
> |  
> |

next prev parent reply	other threads:[~2007-02-08  0:59 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-01-10 10:21 [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP Gerrit Renker
2007-01-10 19:40 ` Ian McDonald
2007-01-12 10:39 ` Gerrit Renker
2007-01-12 12:54 ` Gerrit Renker
2007-01-12 16:33 ` Eddie Kohler
2007-01-12 16:41 ` Eddie Kohler
2007-01-12 16:58 ` Gerrit Renker
2007-01-12 20:02 ` Ian McDonald
2007-01-15  7:56 ` Gerrit Renker
2007-01-15  8:34 ` Gerrit Renker
2007-02-08  0:59 ` Eddie Kohler [this message]
2007-02-08  1:13 ` Ian McDonald
2007-02-08  1:23 ` Eddie Kohler
2007-02-08  1:47 ` Ian McDonald
2007-02-08  5:50 ` Eddie Kohler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=45CA7608.7060702@cs.ucla.edu \
    --to=kohler@cs.ucla.edu \
    --cc=dccp@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox