Re: [PATCH 1/1] DCCP: Fix up t_nom

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re:  [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP
@ 2007-01-10 10:21 Gerrit Renker
  2007-01-10 19:40 ` Ian McDonald
                   ` (13 more replies)
  0 siblings, 14 replies; 15+ messages in thread
From: Gerrit Renker @ 2007-01-10 10:21 UTC (permalink / raw)
  To: dccp

		Packet scheduling tardiness problem
		-----------------------------------

I would like to continue discussion on your patch; I hope you are not dismayed
about the response: this concerned only the code changes as such. I think that
there is a valid point which you were trying to resolve.

I have therefore consulted with colleagues and tried to find out why successive
packets might be too late. This resulted in the following.

(A) The granularity of the rate-based packet scheduling is too coarse

    We are resolving t_ipi with microsecond granularity but the timing for the
    delay between packets uses millisecond granularity (schedule_timeout). This
    means we can only generate inter-packet delays in the range of 1 millisecond
    up to 15.625 milliseconds (the largest possible inter-packet interval which
    corresponds to 1 byte per each 64 seconds). 

    As a result, our range of speeds that can be influenced is:

	    1/64 byte per second  ....  1000 bytes per second

    In all other cases, the delay will be 0 (due to integer division) and hence
    their packet spacing solely depends on how fast the hardware can cope.

    In essence, this is like a car whose accelerator works really well in the
    range 1 meter/hour up to 2 miles per hour, and for everything else it tries
    to use top speed of 120 mph. 

    Therefore I wonder if there is some kind of `micro/nanosleep' which we can use?
    Did some grepping and inevitably landed in kernel/hrtimers.c - any advice on
    how to best deploy these?
    
    On healthy links the inter-packet times are often in the range of multiples of
    10 microseconds (60 microseconds is frequent). 


(B) Fixing send time for packets which are too late
    
    You were mentioning bursts of packets which appear to be too late. I consulted
    with a colleague of how to fix this: the solution seems much more complicated
    than the current infrastructure supports.
    Suppose the TX queue length is n and the packet at the head of the queue is too
    late. Then one would need to recompute the sending times for each packet in the
    TX queue by taking into account the tardiness (it would not be sufficient to
    simply drain the queue). It seems that we would need to implement a type of
    credit-based system (e.g. like a Token Bucket Filter); in effect a kind of Qdisc
    on layer 4.

When using only a single t_nom for the next-packet-to-send as we are doing at the moment,
resetting t_nom to t_now when packets are late seems the most sensible thing to do. 

So what I think we should do is fix the packet scheduling algorithm to use finer-grained
delay. Since in practice no one really is interested in speeds of 1kbyte/sec and below,
it in effect means that we are not controlling packet spacing at all.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP
  2007-01-10 10:21 [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP Gerrit Renker
@ 2007-01-10 19:40 ` Ian McDonald
  2007-01-12 10:39 ` Gerrit Renker
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Ian McDonald @ 2007-01-10 19:40 UTC (permalink / raw)
  To: dccp

Gerrit,

Thank you for this summary. I think that this discussion shows the
fundamental difference in our patches and therefore the fundamental
issue we need to resolve. It would be helpful if Eddie or others
throws in their 2 cents worth.

On 10/01/07, Gerrit Renker <gerrit@erg.abdn.ac.uk> wrote:
>                 Packet scheduling tardiness problem
>                 -----------------------------------
>
> I would like to continue discussion on your patch; I hope you are not dismayed
> about the response: this concerned only the code changes as such. I think that
> there is a valid point which you were trying to resolve.
>
I'm not dismayed at all by the response. These issues sometimes take a
lot of discussion before they go into the kernel. Look at hrtimers,
kevents, netchannels, reiser4 etc and you will see that we are a in
low number of revisions! It is far better to discuss so that we all
understand what is happening and can get the best possible solution.

> I have therefore consulted with colleagues and tried to find out why successive
> packets might be too late. This resulted in the following.
>
> (A) The granularity of the rate-based packet scheduling is too coarse
>
>     We are resolving t_ipi with microsecond granularity but the timing for the
>     delay between packets uses millisecond granularity (schedule_timeout). This
>     means we can only generate inter-packet delays in the range of 1 millisecond
>     up to 15.625 milliseconds (the largest possible inter-packet interval which
>     corresponds to 1 byte per each 64 seconds).
>
>     As a result, our range of speeds that can be influenced is:
>
>             1/64 byte per second  ....  1000 bytes per second
>
>     In all other cases, the delay will be 0 (due to integer division) and hence
>     their packet spacing solely depends on how fast the hardware can cope.
>
>     In essence, this is like a car whose accelerator works really well in the
>     range 1 meter/hour up to 2 miles per hour, and for everything else it tries
>     to use top speed of 120 mph.
>

This is correct provided you are working on the assumption of part B.
I think that this is the whole crux of the problem and why we haven't
got an agreed patch between us yet.

Rather than espouse my views directly I'll try and comment on the RFC
which is the canonical definition of what we should be doing. I've put
in the whole 4.6 section of RFC 3448 and commented inline.

4.6.  Scheduling of Packet Transmissions

   As TFRC is rate-based, and as operating systems typically cannot
   schedule events precisely, it is necessary to be opportunistic about
   sending data packets so that the correct average rate is maintained
   despite the course-grain or irregular scheduling of the operating
   system.

Note here that the requirement is not to maintain precise timing. The
requirement is to maintain the correct average rate. This was the
light that went on in my head when I was reviewing both of our
previous iterations. I believe we were coming from the wrong starting
point - we were trying to improve scheduling, not maintain the correct
rate.

   Thus a typical sending loop will calculate the correct
   inter-packet interval, t_ipi, as follows:

        t_ipi = s/X_inst;

Note here that we must calculate t_ipi by this definition. It doesn't
say whether it MUST be in sending loop (capitals deliberate as these
have special meaning) so my approach of doing in sending loop is fine
as is yours provided we pick up both changes in s and X_inst. From
memory you detect changes in X_inst and recalculate t_ipi but not for
s (although it could be other way around).

   When a sender first starts sending at time t_0, it calculates t_ipi,
   and calculates a nominal send time t_1 = t_0 + t_ipi for packet 1.
   When the application becomes idle, it checks the current time, t_now,
   and then requests re-scheduling after (t_ipi - (t_now - t_0))
   seconds.  When the application is re-scheduled, it checks the current
   time, t_now, again.  If (t_now > t_1 - delta) then packet 1 is sent.

Note that initially we set t_ipi to 1 second. This could be set to a
better value based on connection setup as per Eddie and your
discussion earlier but I haven't implemented this yet. In this way my
code is a hack that I remove the 1 second and add the initial RTT once
we obtain it. I see this ugliness can be removed when we make the code
base conform to the RFC intent (it is not in RFC yet but Eddie said he
would propose for revision)

   Now a new t_ipi may be calculated, and used to calculate a nominal
   send time t_2 for packet 2: t2 = t_1 + t_ipi.  The process then
   repeats, with each successive packet's send time being calculated
   from the nominal send time of the previous packet.

   In some cases, when the nominal send time, t_i, of the next packet is
   calculated, it may already be the case that t_now > t_i - delta.  In
   such a case the packet should be sent immediately.  Thus if the
   operating system has coarse timer granularity and the transmit rate
   is high, then TFRC may send short bursts of several packets separated
   by intervals of the OS timer granularity.

Note a couple of phrases here "the packet should be sent immediately",
"TFRC may send short bursts of several packets separated by intervals
of the OS timer granularity". This is why I don't think we should ever
reset t_nom to the current time. Doing this stops us achieving the
average rate required. With reseting t_nom what we are doing is
setting X as the maximum instantaneous rate rather than the average
rate we are trying to achieve.

   The parameter delta is to allow a degree of flexibility in the send
   time of a packet.  If the operating system has a scheduling timer
   granularity of t_gran seconds, then delta would typically be set to:

        delta = min(t_ipi/2, t_gran/2);

   t_gran is 10ms on many Unix systems.  If t_gran is not known, a value
   of 10ms can be safely assumed.

We do this properly.

>     Therefore I wonder if there is some kind of `micro/nanosleep' which we can use?
>     Did some grepping and inevitably landed in kernel/hrtimers.c - any advice on
>     how to best deploy these?
>
>     On healthy links the inter-packet times are often in the range of multiples of
>     10 microseconds (60 microseconds is frequent).
>
I don't believe we need to do this from a practical point of view as
the RFC says we aim for average packet rate rather than precise
timing. It is interesting from a research point of view whether we
achieve better results from smoother packet flow given precise timing,
and I have seen published literature on this effect. It is not
necessary for RFC compliance though.

However if you want to implement for research that is fine provided
that we satisfy a couple of criteria:
- negligible performance impact. I would have thought that hrtimer
would add to overhead a reasonable amount. Is the point of hrtimers to
achieve a large number of timers per second with low overhead or does
it allow you to precisely schedule an event at a time you precisely
require? By my calculations using CCID3 on a 1 GBits/sec link using 1
hrtimer per packet would mean around 90,000 timer fires per second vs
a maximum of 1000 timeslices if we are using standard setup and HZ of
1000. Interrupting other operations 90,000 times per second is surely
not good??
- cross platform support. I'm not sure if platforms like ARM support
hrtimers. I want to see CCID3 used across many platforms and small
embedded devices based on ARM would be a key market (your cellphone
for video calls for example).

Based on this I think hrtimer support is not needed but if it is done
then it should be an option not a requirement.

>
> (B) Fixing send time for packets which are too late
>
>     You were mentioning bursts of packets which appear to be too late. I consulted
>     with a colleague of how to fix this: the solution seems much more complicated
>     than the current infrastructure supports.
>     Suppose the TX queue length is n and the packet at the head of the queue is too
>     late. Then one would need to recompute the sending times for each packet in the
>     TX queue by taking into account the tardiness (it would not be sufficient to
>     simply drain the queue). It seems that we would need to implement a type of
>     credit-based system (e.g. like a Token Bucket Filter); in effect a kind of Qdisc
>     on layer 4.
>
> When using only a single t_nom for the next-packet-to-send as we are doing at the moment,
> resetting t_nom to t_now when packets are late seems the most sensible thing to do.
>
> So what I think we should do is fix the packet scheduling algorithm to use finer-grained
> delay. Since in practice no one really is interested in speeds of 1kbyte/sec and below,
> it in effect means that we are not controlling packet spacing at all.
>
See above.

-- 
Web: http://wand.net.nz/~iam4
Blog: http://iansblog.jandi.co.nz
WAND Network Research Group

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP
  2007-01-10 10:21 [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP Gerrit Renker
  2007-01-10 19:40 ` Ian McDonald
@ 2007-01-12 10:39 ` Gerrit Renker
  2007-01-12 12:54 ` Gerrit Renker
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Gerrit Renker @ 2007-01-12 10:39 UTC (permalink / raw)
  To: dccp

|  Summary: I agree with Ian that hrtimer support is not required, and that 
|  bursts are OK.  They are explicitly allowed by the RFC.
I don't have much disagreement with your points. However, the `bursts' issue
can really not be dismissed as unproblematic. Please see other reply.

|  
|  >   When a sender first starts sending at time t_0, it calculates t_ipi,
|  >   and calculates a nominal send time t_1 = t_0 + t_ipi for packet 1.
|  >   When the application becomes idle, it checks the current time, t_now,
|  >   and then requests re-scheduling after (t_ipi - (t_now - t_0))
|  >   seconds.  When the application is re-scheduled, it checks the current
|  >   time, t_now, again.  If (t_now > t_1 - delta) then packet 1 is sent.
|  > 
|  > Note that initially we set t_ipi to 1 second. This could be set to a
|  > better value based on connection setup as per Eddie and your
|  > discussion earlier but I haven't implemented this yet. In this way my
|  > code is a hack that I remove the 1 second and add the initial RTT once
|  > we obtain it. I see this ugliness can be removed when we make the code
|  > base conform to the RFC intent (it is not in RFC yet but Eddie said he
|  > would propose for revision)
|  
|  For what it's worth, it's as close to in the RFC as it can get without a 
|  revision.  The authors of the RFC agree that we meant the initial 
|  Request-Response RTT to be usable as an initial RTT estimate; the 
|  working group agreed; errata has been sent.
So we are RFC-compliant for the moment. One point which remains unresolved is,
the above is about RTTs, but what about the initial sending rate of 1 packet
per second, which implies an initial t_ipi of 1 second?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP
  2007-01-10 10:21 [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP Gerrit Renker
  2007-01-10 19:40 ` Ian McDonald
  2007-01-12 10:39 ` Gerrit Renker
@ 2007-01-12 12:54 ` Gerrit Renker
  2007-01-12 16:33 ` Eddie Kohler
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Gerrit Renker @ 2007-01-12 12:54 UTC (permalink / raw)
  To: dccp

Quoting Eddie Kohler:
|  > The problem I see is that
|  >  * scheduling granularity is at best 1ms
|  >  * hence t_gran/2 is at best 500usec
|  >  * and so when t_ipi < 1ms, packets will always be sent in bursts
|  > 
|  > So we have a `critical point' after which the average sending rate is dominated by the
|  > hardware and no longer under the rate-based control of t_ipi.
|  
|  I think there might be a confusion about what "average sending rate" 
|  actually means.  You appear to assume that precise timing control is 
|  needed to obtain a smooth average.  I.e., if there are bursts, then 
|  there is no smooth average.  But this isn't what the RFC means by 
|  average.  The average rate is suppsoed to be smooth *from one RTT to the 
|  next*.  Sub-RTT burstiness is *explicitly allowed*, although 
|  implementations should try to avoid it when it's easy to do so.
Sorry, I don't think you got my point. The problem is NOT in occasional bursts, but rather
that there is a critical speed X_crit after which the system essentially gets completely out
of control.

If bursts would occur occasionally then it would be like a car which is speeding up for a moment, 
but then slows down again (and thus keeps the average speed) - agree that this would not be a problem.
Here however we have that, once X_crit is reached, the system will _oscillate_ between the top
available speed (and this could be hundreds of Mbits per second) and whatever it gets in terms
of feedback from the receiver. After slowing down, it would again climb in slow-start up to 
X_crit, then jump to top speed. Thus it is  like a car which will abruptly switch to top speed
of e.g. 160 mph when trying to accelerate past 2 mph.

I have a snapshot which illustrates this state: 

 http://www.erg.abdn.ac.uk/users/gerrit/dccp/dccp_probe/examples/no_tx_locking/transmit_rate.png

The oscillating behaviour is well visible. In contrast, I am sure that you would agree that the
desirable state is the following:

 http://www.erg.abdn.ac.uk/users/gerrit/dccp/dccp_probe/examples/with_tx_locking/transmit_rate.png

These snapshots were originally taken to compare the performance with and without serializing access to
TX history. I didn't submit the patch since, at times, I would get the same chaotic behaviour with TX locking.

Other people on this list have reported that iperf performance is unpredictable with CCID 3. 

The point is that, without putting in some kind of control, we have a system which gets into a state of
chaos as soon as the maximum controllable speed X_crit is reached. When it is past that point, there is
no longer a notion of predictable performance or correct average rate: what happens is then outside the
control of the CCID 3 module, performance is then a matter of coincidence.

I don't think that a kernel maintainer will gladly support a module which is liable to reaching such a
chaotic state.

|  > I have done a back-of-the-envelope calculation below for different sizes of s; 9kbyte
|  > I think is the maximum size of an Ethernet jumbo frame.
|  > 
|  >    -----------+---------+---------+---------+---------+-------+---------+-------+
|  >             s | 32      | 100     | 250     | 500     | 1000  | 1500    | 9000  |
|  >    -----------+---------+---------+---------+---------+-------+---------+-------+
|  >     X_critical| 32kbps  | 100kbps | 250kbps | 500kbps | 1mbps | 1.5mbps | 9mbps |
|  >    -----------+---------+---------+---------+---------+-------+---------+-------+ 
|  > 
|  > That means we can only expect predictable performance up to 9mbps ?????
|  
|  Same comment.  I imagine performance will be predictable at speeds FAR 
|  ABOVE 9mbps, DESPITE the sub-RTT bursts.  Predictable performance means 
|  about the same average rate from one RTT to the next.
I think that, without finer timer resolution, we need to put in some kind of throttle to avoid
entering the region where speed can no longer be controlled.

|  > I am dumbstruck - it means that the whole endeavour to try and use Gigabit cards (or
|  > even 100 Mbit ethernet cards) is futile and we should be using the old 10 Mbit cards???
|  
|  Remember that TCP is ENTIRELY based on bursts!!!!!  No rate control at 
|  all.  And it still gets predictable performance at high rates.
|  
Yes, but ..... it uses an entirely different mechanism and is not rate-based.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP
  2007-01-10 10:21 [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP Gerrit Renker
                   ` (2 preceding siblings ...)
  2007-01-12 12:54 ` Gerrit Renker
@ 2007-01-12 16:33 ` Eddie Kohler
  2007-01-12 16:41 ` Eddie Kohler
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Eddie Kohler @ 2007-01-12 16:33 UTC (permalink / raw)
  To: dccp

> |  For what it's worth, it's as close to in the RFC as it can get without a 
> |  revision.  The authors of the RFC agree that we meant the initial 
> |  Request-Response RTT to be usable as an initial RTT estimate; the 
> |  working group agreed; errata has been sent.
> So we are RFC-compliant for the moment. One point which remains unresolved is,
> the above is about RTTs, but what about the initial sending rate of 1 packet
> per second, which implies an initial t_ipi of 1 second?

"Therefore, in contrast to [RFC3448], the initial CCID 3 sending rate is 
allowed to be at least two packets per RTT, and at most four packets per 
RTT, depending on the packet size. ..." [RFC4342, Sec 5]

Since CCID3 always has an estimate of the round trip time when it sends 
packets, there is no such thing as an initial sending rate of 1 packet 
per second.  The initial sending rate, after the Response arrives (thus 
providing an initial estimate of the round trip time), is 2-4 packets 
per RTT, depending on s.

Eddie

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP
  2007-01-10 10:21 [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP Gerrit Renker
                   ` (3 preceding siblings ...)
  2007-01-12 16:33 ` Eddie Kohler
@ 2007-01-12 16:41 ` Eddie Kohler
  2007-01-12 16:58 ` Gerrit Renker
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Eddie Kohler @ 2007-01-12 16:41 UTC (permalink / raw)
  To: dccp

The first figure certainly demonstrates a problem.  However, that 
problem is not inherent in CCID3, it is not inherent in rate-based 
solutions, and high-rate timers probably wouldn't solve it.  CCID3 has 
been tested -- in simulation mind you -- at high rates.  The problem is 
a bug in the Linux implementation.  Ian seems to think he can solve the 
problem with bursts and I am inclined to agree.

Your comments about X_crit are based on your observations, not analysis, 
yes?  If you can provide some reason why CCID3 inherently has an X_crit, 
I'd like to hear it.  "Oscillat[ing] between the top available speed ... 
and whatever it gets in terms of feedback" is not TFRC.  Sounds like a bug.

I agree that kernel maintainers don't want bugs in the kernel.

Anyway, if you can go deeper into the code and determine why you're 
observing this behavior (I assume in the absence of loss, which is even 
weirder), then that might be useful.

Eddie


Gerrit Renker wrote:
> Quoting Eddie Kohler:
> |  > The problem I see is that
> |  >  * scheduling granularity is at best 1ms
> |  >  * hence t_gran/2 is at best 500usec
> |  >  * and so when t_ipi < 1ms, packets will always be sent in bursts
> |  > 
> |  > So we have a `critical point' after which the average sending rate is dominated by the
> |  > hardware and no longer under the rate-based control of t_ipi.
> |  
> |  I think there might be a confusion about what "average sending rate" 
> |  actually means.  You appear to assume that precise timing control is 
> |  needed to obtain a smooth average.  I.e., if there are bursts, then 
> |  there is no smooth average.  But this isn't what the RFC means by 
> |  average.  The average rate is suppsoed to be smooth *from one RTT to the 
> |  next*.  Sub-RTT burstiness is *explicitly allowed*, although 
> |  implementations should try to avoid it when it's easy to do so.
> Sorry, I don't think you got my point. The problem is NOT in occasional bursts, but rather
> that there is a critical speed X_crit after which the system essentially gets completely out
> of control.
> 
> If bursts would occur occasionally then it would be like a car which is speeding up for a moment, 
> but then slows down again (and thus keeps the average speed) - agree that this would not be a problem.
> Here however we have that, once X_crit is reached, the system will _oscillate_ between the top
> available speed (and this could be hundreds of Mbits per second) and whatever it gets in terms
> of feedback from the receiver. After slowing down, it would again climb in slow-start up to 
> X_crit, then jump to top speed. Thus it is  like a car which will abruptly switch to top speed
> of e.g. 160 mph when trying to accelerate past 2 mph.
> 
> I have a snapshot which illustrates this state: 
> 
>  http://www.erg.abdn.ac.uk/users/gerrit/dccp/dccp_probe/examples/no_tx_locking/transmit_rate.png
>   
> The oscillating behaviour is well visible. In contrast, I am sure that you would agree that the
> desirable state is the following:
> 
>  http://www.erg.abdn.ac.uk/users/gerrit/dccp/dccp_probe/examples/with_tx_locking/transmit_rate.png
> 
> These snapshots were originally taken to compare the performance with and without serializing access to
> TX history. I didn't submit the patch since, at times, I would get the same chaotic behaviour with TX locking.
> 
> Other people on this list have reported that iperf performance is unpredictable with CCID 3. 
> 
> The point is that, without putting in some kind of control, we have a system which gets into a state of
> chaos as soon as the maximum controllable speed X_crit is reached. When it is past that point, there is
> no longer a notion of predictable performance or correct average rate: what happens is then outside the
> control of the CCID 3 module, performance is then a matter of coincidence.
> 
> I don't think that a kernel maintainer will gladly support a module which is liable to reaching such a
> chaotic state.
>  
>   
> |  > I have done a back-of-the-envelope calculation below for different sizes of s; 9kbyte
> |  > I think is the maximum size of an Ethernet jumbo frame.
> |  > 
> |  >    -----------+---------+---------+---------+---------+-------+---------+-------+
> |  >             s | 32      | 100     | 250     | 500     | 1000  | 1500    | 9000  |
> |  >    -----------+---------+---------+---------+---------+-------+---------+-------+
> |  >     X_critical| 32kbps  | 100kbps | 250kbps | 500kbps | 1mbps | 1.5mbps | 9mbps |
> |  >    -----------+---------+---------+---------+---------+-------+---------+-------+ 
> |  > 
> |  > That means we can only expect predictable performance up to 9mbps ?????
> |  
> |  Same comment.  I imagine performance will be predictable at speeds FAR 
> |  ABOVE 9mbps, DESPITE the sub-RTT bursts.  Predictable performance means 
> |  about the same average rate from one RTT to the next.
> I think that, without finer timer resolution, we need to put in some kind of throttle to avoid
> entering the region where speed can no longer be controlled.
> 
>   
> |  > I am dumbstruck - it means that the whole endeavour to try and use Gigabit cards (or
> |  > even 100 Mbit ethernet cards) is futile and we should be using the old 10 Mbit cards???
> |  
> |  Remember that TCP is ENTIRELY based on bursts!!!!!  No rate control at 
> |  all.  And it still gets predictable performance at high rates.
> |  
> Yes, but ..... it uses an entirely different mechanism and is not rate-based.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP
  2007-01-10 10:21 [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP Gerrit Renker
                   ` (4 preceding siblings ...)
  2007-01-12 16:41 ` Eddie Kohler
@ 2007-01-12 16:58 ` Gerrit Renker
  2007-01-12 20:02 ` Ian McDonald
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Gerrit Renker @ 2007-01-12 16:58 UTC (permalink / raw)
  To: dccp

|  > |  For what it's worth, it's as close to in the RFC as it can get without a 
|  > |  revision.  The authors of the RFC agree that we meant the initial 
|  > |  Request-Response RTT to be usable as an initial RTT estimate; the 
|  > |  working group agreed; errata has been sent.
|  > So we are RFC-compliant for the moment. One point which remains unresolved is,
|  > the above is about RTTs, but what about the initial sending rate of 1 packet
|  > per second, which implies an initial t_ipi of 1 second?
|  
|  "Therefore, in contrast to [RFC3448], the initial CCID 3 sending rate is 
|  allowed to be at least two packets per RTT, and at most four packets per 
|  RTT, depending on the packet size. ..." [RFC4342, Sec 5]
|  
|  Since CCID3 always has an estimate of the round trip time when it sends 
|  packets, there is no such thing as an initial sending rate of 1 packet 
|  per second.  The initial sending rate, after the Response arrives (thus 
|  providing an initial estimate of the round trip time), is 2-4 packets 
|  per RTT, depending on s.
Thanks for the clarification. Using the initial RTT estimate for the first packets
has not been ignored, but will require some API changes, since CCIDs operate
all in connected state and hence have no access to handshake-RTT information.
But before changing the API, there are many bugs which need to be fixed.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP
  2007-01-10 10:21 [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP Gerrit Renker
                   ` (5 preceding siblings ...)
  2007-01-12 16:58 ` Gerrit Renker
@ 2007-01-12 20:02 ` Ian McDonald
  2007-01-15  7:56 ` Gerrit Renker
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Ian McDonald @ 2007-01-12 20:02 UTC (permalink / raw)
  To: dccp

On 13/01/07, Gerrit Renker <gerrit@erg.abdn.ac.uk> wrote:
> I have a snapshot which illustrates this state:
>
>  http://www.erg.abdn.ac.uk/users/gerrit/dccp/dccp_probe/examples/no_tx_locking/transmit_rate.png
>
> The oscillating behaviour is well visible. In contrast, I am sure that you would agree that the
> desirable state is the following:
>
>  http://www.erg.abdn.ac.uk/users/gerrit/dccp/dccp_probe/examples/with_tx_locking/transmit_rate.png
>

Questions on the graphs:
- are these direct connect nodes?
- why is Xcalc 0?
- why on the stable one is Xrecv different from X. Is X the rate that
it is allowed to send at but doesn't so only achieves Xrecv?
-- 
Web: http://wand.net.nz/~iam4
Blog: http://iansblog.jandi.co.nz
WAND Network Research Group

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP
  2007-01-10 10:21 [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP Gerrit Renker
                   ` (6 preceding siblings ...)
  2007-01-12 20:02 ` Ian McDonald
@ 2007-01-15  7:56 ` Gerrit Renker
  2007-01-15  8:34 ` Gerrit Renker
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Gerrit Renker @ 2007-01-15  7:56 UTC (permalink / raw)
  To: dccp

|  > I have a snapshot which illustrates this state:
|  >
|  >  http://www.erg.abdn.ac.uk/users/gerrit/dccp/dccp_probe/examples/no_tx_locking/transmit_rate.png
|  >
|  > The oscillating behaviour is well visible. In contrast, I am sure that you would agree that the
|  > desirable state is the following:
|  >
|  >  http://www.erg.abdn.ac.uk/users/gerrit/dccp/dccp_probe/examples/with_tx_locking/transmit_rate.png
|  >
|  
|  Questions on the graphs:
|  - are these direct connect nodes?
The nodes are on the same Ethernet LAN, connected via 100Mbps / 1GBps interfaces. There is no
other traffic on that LAN.
|  - why is Xcalc 0?
Since no losses occurred (p is always 0 on these tests).
|  - why on the stable one is Xrecv different from X. Is X the rate that
|  it is allowed to send at but doesn't so only achieves Xrecv?
When there is no loss, X = max(min(2*X, 2*X_recv), s/R) - the graph agrees with this, since
X is about 2*X_recv.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP
  2007-01-10 10:21 [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP Gerrit Renker
                   ` (7 preceding siblings ...)
  2007-01-15  7:56 ` Gerrit Renker
@ 2007-01-15  8:34 ` Gerrit Renker
  2007-02-08  0:59 ` Eddie Kohler
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Gerrit Renker @ 2007-01-15  8:34 UTC (permalink / raw)
  To: dccp

Quoting Eddie Kohler:
|  The first figure certainly demonstrates a problem.  However, that 
|  problem is not inherent in CCID3, it is not inherent in rate-based 
|  solutions, and high-rate timers probably wouldn't solve it.  CCID3 has 
|  been tested -- in simulation mind you -- at high rates.  The problem is 
|  a bug in the Linux implementation.  Ian seems to think he can solve the 
|  problem with bursts and I am inclined to agree.
|  
|  Your comments about X_crit are based on your observations, not analysis, 
|  yes?  If you can provide some reason why CCID3 inherently has an X_crit, 
|  I'd like to hear it.  "Oscillat[ing] between the top available speed ... 
|  and whatever it gets in terms of feedback" is not TFRC.  Sounds like a bug.
|  
|  I agree that kernel maintainers don't want bugs in the kernel.
|  
|  Anyway, if you can go deeper into the code and determine why you're 
|  observing this behavior (I assume in the absence of loss, which is even 
|  weirder), then that might be useful.
|  

First off - I think we all agree that the RFCs are all sound and 
thus virtually everything here deals with implementation problems (if there
are additional observations or discussions, we can copy to dccp@ietf).

I think to find why the CCID 3 performance is so chaotic, unpredictable and 
abysmally poor we should try to combine the various strenghts of people on this list.

Apart from writing standards documents, you have designed the core of the click
modular router system, so many issues arising here you can probably evaluate from
a practical perspective as well as from a standards-based perspective.

Ian has been the maintainer of the CCID 3 module for so long and knows all the background
from the original Lulea code, through the WAND research code and the various stages it
went through. So it more or less entirely depends on the communication on this list how
good this code can be made. 

Below I throw in my 2 cents of why I think there is a critical speed X_crit. Maybe you can
help me dispel it or point out other possibilities which we can - step-by-step - eliminate,
until the cause becomes fully clear.

Firstly, all packet scheduling is based on schedule_timeout().

The return code rc of ccid_hc_tx_send_packet (wrapper around ccid3_hc_tx_send_packet) is used
to decide whether to 

 (a) send the packet immediately or
 (b) sleep with HZ granularity before retrying

I am assuming that there is no loss on the link and no backlog of packets which couldn't be
scheduled so far (i.e. if t_nom < t_now then t_now - t_nom < t_ipi). I assume further that
there is a constant stream of packets, fed into the TX queue by continuously calling
dccp_sendmsg. This is also the background of the experiments/graphs. 

Here is the analysis, starting with ccid3_hc_tx_send_packet:

 1) dccp_sendmsg calls dccp_write_xmit(sk, 0)

 2) dccp_write_xmit calls ccid_hc_tx_send_packet, a wrapper around ccid3_hc_tx_send_packet

 3) ccid3_hc_tx_send_packet gets the current time in usecs and computes  delay = t_nom - t_now

     (a) if delay >= delta = min(t_ipi/2, t_gran/2) then it returns delay/1000
     (b) otherwise it returns 0

 4) back in dccp_write_xmit, 
     * if rc=0 then the packet is sent immediately;  otherwise (since block=0), 
     * dccps_xmit_timer is reset to expire in t_now + rc  milliseconds (sk_reset_timer)
         -- in this case dccp_write_xmit exits now and
	 -- when the write timer expires, dccp_write_xmit_timer is called, which again
	    calls dccp_write_xmit(sk, 0)
	 -- this means going back to (3), now delay < delta, the function returns 0
	    and the packet is sent immediately

To find where the problematic case is, assume that the sender is in slow start and
doubles X each RTT. As X increases, t_ipi decreases so that there is a point where
t_ipi < 1000 usec. 

 -> all differences delay = t_nom - t_now which are less than 1000 result in 
    delay / 1000 = 0 due to integer division
 -> hence all packets which are late up to 1 millisecond are sent immediately
 -> assume that t_ipi is less than 1 millisecond, then in effect all packets are
    sent immediately; hence we have a _continuous_ burst of packets
 -> schedule_timeout() really only has a granularity of HZ:
     * if HZ\x1000,   msecs_to_jiffies(m) returns m
     * if HZ < 1000, msecs_to_jiffies(m) returns (m * HZ + 999)/1000
          => hence m=1 millisecond will give a result of 1 jiffie
	  => but the granularity of jiffies is in HZ < 1000 so that the 
	      timer will expire with a granularity of HZ
          => that means if X is higher than X_crit, t_ipi will always be such that
	      the timer expires at a time which is too late, so that packets are all
	      sent in immediate bursts or in scheduled bursts, but there is no longer
	      any real scheduling

The other points which I am not entirely sure about yet are
 * compression of packet spacing due to using TX output queues
 * interactions with the traffic control subsystem
 * times when the main socket is locked

- Gerrit

|  > I have a snapshot which illustrates this state: 
|  > 
|  >  http://www.erg.abdn.ac.uk/users/gerrit/dccp/dccp_probe/examples/no_tx_locking/transmit_rate.png
|  >   
|  > The oscillating behaviour is well visible. In contrast, I am sure that you would agree that the
|  > desirable state is the following:
|  > 
|  >  http://www.erg.abdn.ac.uk/users/gerrit/dccp/dccp_probe/examples/with_tx_locking/transmit_rate.png
|  > 
|  > These snapshots were originally taken to compare the performance with and without serializing access to
|  > TX history. I didn't submit the patch since, at times, I would get the same chaotic behaviour with TX locking.
|  > 
|  > Other people on this list have reported that iperf performance is unpredictable with CCID 3. 
|  > 
|  > The point is that, without putting in some kind of control, we have a system which gets into a state of
|  > chaos as soon as the maximum controllable speed X_crit is reached. When it is past that point, there is
|  > no longer a notion of predictable performance or correct average rate: what happens is then outside the
|  > control of the CCID 3 module, performance is then a matter of coincidence.
|  > 
|  > I don't think that a kernel maintainer will gladly support a module which is liable to reaching such a
|  > chaotic state.
|  >  
|  >   
|  > |  > I have done a back-of-the-envelope calculation below for different sizes of s; 9kbyte
|  > |  > I think is the maximum size of an Ethernet jumbo frame.
|  > |  > 
|  > |  >    -----------+---------+---------+---------+---------+-------+---------+-------+
|  > |  >             s | 32      | 100     | 250     | 500     | 1000  | 1500    | 9000  |
|  > |  >    -----------+---------+---------+---------+---------+-------+---------+-------+
|  > |  >     X_critical| 32kbps  | 100kbps | 250kbps | 500kbps | 1mbps | 1.5mbps | 9mbps |
|  > |  >    -----------+---------+---------+---------+---------+-------+---------+-------+ 
|  > |  > 
|  > |  > That means we can only expect predictable performance up to 9mbps ?????
|  > |  
|  > |  Same comment.  I imagine performance will be predictable at speeds FAR 
|  > |  ABOVE 9mbps, DESPITE the sub-RTT bursts.  Predictable performance means 
|  > |  about the same average rate from one RTT to the next.
|  > I think that, without finer timer resolution, we need to put in some kind of throttle to avoid
|  > entering the region where speed can no longer be controlled.
|  > 
|  >   
|  > |  > I am dumbstruck - it means that the whole endeavour to try and use Gigabit cards (or
|  > |  > even 100 Mbit ethernet cards) is futile and we should be using the old 10 Mbit cards???
|  > |  
|  > |  Remember that TCP is ENTIRELY based on bursts!!!!!  No rate control at 
|  > |  all.  And it still gets predictable performance at high rates.
|  > |  
|  > Yes, but ..... it uses an entirely different mechanism and is not rate-based.
|  
|  

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP
  2007-01-10 10:21 [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP Gerrit Renker
                   ` (8 preceding siblings ...)
  2007-01-15  8:34 ` Gerrit Renker
@ 2007-02-08  0:59 ` Eddie Kohler
  2007-02-08  1:13 ` Ian McDonald
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Eddie Kohler @ 2007-02-08  0:59 UTC (permalink / raw)
  To: dccp

I have some minor thoughts relating to this.

- In what units are t_nom kept?  I would hope microseconds at least, not 
milliseconds.  You say "dccps_xmit_timer is reset to expire in t_now + rc 
milliseconds"; I assume you mean that the value t_now + rc is cast to 
milliseconds.  Clearly high rates will require that t_nom be kept in 
microseconds at least.

I wonder because you say:

>  -> assume that t_ipi is less than 1 millisecond, then in effect all packets are
>     sent immediately; hence we have a _continuous_ burst of packets

but this does not follow.  If t_ipi<1ms, but is non-zero, then at most 1ms 
worth of packets can be sent in a burst -- not continuous (assuming it takes 
way less than 1ms to send a packet).

- ccid3_hc_tx_send_packet should return a value that is measured in 
MICROSECONDS not milliseconds.  It also sounds like there is a rounding error 
in step 3a); it should probably return (delay + 500)/1000 at least.

- What if the delay till the next packet is >0 but <HZ?  It sounds like that 
case is causing the problem.  One answer: Do not schedule a TIMER; instead, 
leave a kernel thread or bottom half scheduled, so that the next time the 
kernel runs, it will poll DCCP, even if that is before 1HZ from now.

Make sense?
Eddie


> Below I throw in my 2 cents of why I think there is a critical speed X_crit. Maybe you can
> help me dispel it or point out other possibilities which we can - step-by-step - eliminate,
> until the cause becomes fully clear.
> 
> Firstly, all packet scheduling is based on schedule_timeout().
> 
> The return code rc of ccid_hc_tx_send_packet (wrapper around ccid3_hc_tx_send_packet) is used
> to decide whether to 
> 
>  (a) send the packet immediately or
>  (b) sleep with HZ granularity before retrying
> 
> I am assuming that there is no loss on the link and no backlog of packets which couldn't be
> scheduled so far (i.e. if t_nom < t_now then t_now - t_nom < t_ipi). I assume further that
> there is a constant stream of packets, fed into the TX queue by continuously calling
> dccp_sendmsg. This is also the background of the experiments/graphs. 
> 
> Here is the analysis, starting with ccid3_hc_tx_send_packet:
> 
>  1) dccp_sendmsg calls dccp_write_xmit(sk, 0)
> 
>  2) dccp_write_xmit calls ccid_hc_tx_send_packet, a wrapper around ccid3_hc_tx_send_packet
> 
>  3) ccid3_hc_tx_send_packet gets the current time in usecs and computes  delay = t_nom - t_now
> 
>      (a) if delay >= delta = min(t_ipi/2, t_gran/2) then it returns delay/1000
>      (b) otherwise it returns 0
> 
>  4) back in dccp_write_xmit, 
>      * if rc=0 then the packet is sent immediately;  otherwise (since block=0), 
>      * dccps_xmit_timer is reset to expire in t_now + rc  milliseconds (sk_reset_timer)
>          -- in this case dccp_write_xmit exits now and
> 	 -- when the write timer expires, dccp_write_xmit_timer is called, which again
> 	    calls dccp_write_xmit(sk, 0)
> 	 -- this means going back to (3), now delay < delta, the function returns 0
> 	    and the packet is sent immediately
> 
> To find where the problematic case is, assume that the sender is in slow start and
> doubles X each RTT. As X increases, t_ipi decreases so that there is a point where
> t_ipi < 1000 usec. 
>  
>  -> all differences delay = t_nom - t_now which are less than 1000 result in 
>     delay / 1000 = 0 due to integer division
>  -> hence all packets which are late up to 1 millisecond are sent immediately
>  -> assume that t_ipi is less than 1 millisecond, then in effect all packets are
>     sent immediately; hence we have a _continuous_ burst of packets
>  -> schedule_timeout() really only has a granularity of HZ:
>      * if HZ\x1000,   msecs_to_jiffies(m) returns m
>      * if HZ < 1000, msecs_to_jiffies(m) returns (m * HZ + 999)/1000
>           => hence m=1 millisecond will give a result of 1 jiffie
> 	  => but the granularity of jiffies is in HZ < 1000 so that the 
> 	      timer will expire with a granularity of HZ
>           => that means if X is higher than X_crit, t_ipi will always be such that
> 	      the timer expires at a time which is too late, so that packets are all
> 	      sent in immediate bursts or in scheduled bursts, but there is no longer
> 	      any real scheduling
> 
> The other points which I am not entirely sure about yet are
>  * compression of packet spacing due to using TX output queues
>  * interactions with the traffic control subsystem
>  * times when the main socket is locked
> 
> - Gerrit
> 
> |  > I have a snapshot which illustrates this state: 
> |  > 
> |  >  http://www.erg.abdn.ac.uk/users/gerrit/dccp/dccp_probe/examples/no_tx_locking/transmit_rate.png
> |  >   
> |  > The oscillating behaviour is well visible. In contrast, I am sure that you would agree that the
> |  > desirable state is the following:
> |  > 
> |  >  http://www.erg.abdn.ac.uk/users/gerrit/dccp/dccp_probe/examples/with_tx_locking/transmit_rate.png
> |  > 
> |  > These snapshots were originally taken to compare the performance with and without serializing access to
> |  > TX history. I didn't submit the patch since, at times, I would get the same chaotic behaviour with TX locking.
> |  > 
> |  > Other people on this list have reported that iperf performance is unpredictable with CCID 3. 
> |  > 
> |  > The point is that, without putting in some kind of control, we have a system which gets into a state of
> |  > chaos as soon as the maximum controllable speed X_crit is reached. When it is past that point, there is
> |  > no longer a notion of predictable performance or correct average rate: what happens is then outside the
> |  > control of the CCID 3 module, performance is then a matter of coincidence.
> |  > 
> |  > I don't think that a kernel maintainer will gladly support a module which is liable to reaching such a
> |  > chaotic state.
> |  >  
> |  >   
> |  > |  > I have done a back-of-the-envelope calculation below for different sizes of s; 9kbyte
> |  > |  > I think is the maximum size of an Ethernet jumbo frame.
> |  > |  > 
> |  > |  >    -----------+---------+---------+---------+---------+-------+---------+-------+
> |  > |  >             s | 32      | 100     | 250     | 500     | 1000  | 1500    | 9000  |
> |  > |  >    -----------+---------+---------+---------+---------+-------+---------+-------+
> |  > |  >     X_critical| 32kbps  | 100kbps | 250kbps | 500kbps | 1mbps | 1.5mbps | 9mbps |
> |  > |  >    -----------+---------+---------+---------+---------+-------+---------+-------+ 
> |  > |  > 
> |  > |  > That means we can only expect predictable performance up to 9mbps ?????
> |  > |  
> |  > |  Same comment.  I imagine performance will be predictable at speeds FAR 
> |  > |  ABOVE 9mbps, DESPITE the sub-RTT bursts.  Predictable performance means 
> |  > |  about the same average rate from one RTT to the next.
> |  > I think that, without finer timer resolution, we need to put in some kind of throttle to avoid
> |  > entering the region where speed can no longer be controlled.
> |  > 
> |  >   
> |  > |  > I am dumbstruck - it means that the whole endeavour to try and use Gigabit cards (or
> |  > |  > even 100 Mbit ethernet cards) is futile and we should be using the old 10 Mbit cards???
> |  > |  
> |  > |  Remember that TCP is ENTIRELY based on bursts!!!!!  No rate control at 
> |  > |  all.  And it still gets predictable performance at high rates.
> |  > |  
> |  > Yes, but ..... it uses an entirely different mechanism and is not rate-based.
> |  
> |  

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP
  2007-01-10 10:21 [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP Gerrit Renker
                   ` (9 preceding siblings ...)
  2007-02-08  0:59 ` Eddie Kohler
@ 2007-02-08  1:13 ` Ian McDonald
  2007-02-08  1:23 ` Eddie Kohler
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Ian McDonald @ 2007-02-08  1:13 UTC (permalink / raw)
  To: dccp

On 2/8/07, Eddie Kohler <kohler@cs.ucla.edu> wrote:
> I have some minor thoughts relating to this.
>
> - In what units are t_nom kept?  I would hope microseconds at least, not
> milliseconds.  You say "dccps_xmit_timer is reset to expire in t_now + rc
> milliseconds"; I assume you mean that the value t_now + rc is cast to
> milliseconds.  Clearly high rates will require that t_nom be kept in
> microseconds at least.

It's definitely microseconds. Nearly all critical stuff such as t_nom,
t_ipi, rtt are calculated in usecs. Early on some code was attempted
to be run as msecs and it was a disaster.

But t_nom wouldn't matter if it was msecs in this case as you send
packets up to t_nom and then set a timer. If there was a backlog
they'd get sent next time and as you say takes way less than 1ms to
send a packet.

>
> I wonder because you say:
>
> >  -> assume that t_ipi is less than 1 millisecond, then in effect all packets are
> >     sent immediately; hence we have a _continuous_ burst of packets
>
> but this does not follow.  If t_ipi<1ms, but is non-zero, then at most 1ms
> worth of packets can be sent in a burst -- not continuous (assuming it takes
> way less than 1ms to send a packet).
>
> - ccid3_hc_tx_send_packet should return a value that is measured in
> MICROSECONDS not milliseconds.  It also sounds like there is a rounding error
> in step 3a); it should probably return (delay + 500)/1000 at least.
>
This is used to set a timer to know when to wake up again which is
valid to be in msecs. We don't want to be firing interrupts in usec
intervals as too intensive and that's why you have t_gran there.

> - What if the delay till the next packet is >0 but <HZ?  It sounds like that
> case is causing the problem.  One answer: Do not schedule a TIMER; instead,
> leave a kernel thread or bottom half scheduled, so that the next time the
> kernel runs, it will poll DCCP, even if that is before 1HZ from now.
>
Might have a look at that but it's not affecting me right now - more
Gerrit's area.

> Make sense?
> Eddie
>
>
-- 
Web: http://wand.net.nz/~iam4
Blog: http://iansblog.jandi.co.nz
WAND Network Research Group

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP
  2007-01-10 10:21 [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP Gerrit Renker
                   ` (10 preceding siblings ...)
  2007-02-08  1:13 ` Ian McDonald
@ 2007-02-08  1:23 ` Eddie Kohler
  2007-02-08  1:47 ` Ian McDonald
  2007-02-08  5:50 ` Eddie Kohler
  13 siblings, 0 replies; 15+ messages in thread
From: Eddie Kohler @ 2007-02-08  1:23 UTC (permalink / raw)
  To: dccp

>> - ccid3_hc_tx_send_packet should return a value that is measured in
>> MICROSECONDS not milliseconds.  It also sounds like there is a 
>> rounding error
>> in step 3a); it should probably return (delay + 500)/1000 at least.
>>
> This is used to set a timer to know when to wake up again which is
> valid to be in msecs. We don't want to be firing interrupts in usec
> intervals as too intensive and that's why you have t_gran there.

You want it to return microseconds precisely so you can distinguish between 
cases that require a timer and cases that require something lighterweight, 
like a simple rescheduling.  If you pre-truncate the information to 
millisecond granularity there's no way to distinguish between "send NOW" and 
"send pretty soon but not now".  Just divide by 1000 when you actually 
schedule the timer.

E


> 
>> - What if the delay till the next packet is >0 but <HZ?  It sounds 
>> like that
>> case is causing the problem.  One answer: Do not schedule a TIMER; 
>> instead,
>> leave a kernel thread or bottom half scheduled, so that the next time the
>> kernel runs, it will poll DCCP, even if that is before 1HZ from now.
>>
> Might have a look at that but it's not affecting me right now - more
> Gerrit's area.
> 
>> Make sense?
>> Eddie
>>
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP
  2007-01-10 10:21 [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP Gerrit Renker
                   ` (11 preceding siblings ...)
  2007-02-08  1:23 ` Eddie Kohler
@ 2007-02-08  1:47 ` Ian McDonald
  2007-02-08  5:50 ` Eddie Kohler
  13 siblings, 0 replies; 15+ messages in thread
From: Ian McDonald @ 2007-02-08  1:47 UTC (permalink / raw)
  To: dccp

On 2/8/07, Eddie Kohler <kohler@cs.ucla.edu> wrote:
> >> - ccid3_hc_tx_send_packet should return a value that is measured in
> >> MICROSECONDS not milliseconds.  It also sounds like there is a
> >> rounding error
> >> in step 3a); it should probably return (delay + 500)/1000 at least.
> >>
> > This is used to set a timer to know when to wake up again which is
> > valid to be in msecs. We don't want to be firing interrupts in usec
> > intervals as too intensive and that's why you have t_gran there.
>
> You want it to return microseconds precisely so you can distinguish between
> cases that require a timer and cases that require something lighterweight,
> like a simple rescheduling.  If you pre-truncate the information to
> millisecond granularity there's no way to distinguish between "send NOW" and
> "send pretty soon but not now".  Just divide by 1000 when you actually
> schedule the timer.
>
I'm not sure about this. If you reschedule you've got no guarantee
when you get your next timeslice. I also think that timers probably
aren't that heavy in Linux as it probably weights the scheduler. If
you ask for something 5 msecs in future you are guaranteed it won't go
off before then but sometime after..

Probably one to test some time but I can't see this affecting things
too much and bigger bugs to fry at present probably. But certainly
worth checking at some point.

Ian
-- 
Web: http://wand.net.nz/~iam4
Blog: http://iansblog.jandi.co.nz
WAND Network Research Group

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP
  2007-01-10 10:21 [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP Gerrit Renker
                   ` (12 preceding siblings ...)
  2007-02-08  1:47 ` Ian McDonald
@ 2007-02-08  5:50 ` Eddie Kohler
  13 siblings, 0 replies; 15+ messages in thread
From: Eddie Kohler @ 2007-02-08  5:50 UTC (permalink / raw)
  To: dccp

>> You want it to return microseconds precisely so you can distinguish 
>> between
>> cases that require a timer and cases that require something 
>> lighterweight,
>> like a simple rescheduling.  If you pre-truncate the information to
>> millisecond granularity there's no way to distinguish between "send 
>> NOW" and
>> "send pretty soon but not now".  Just divide by 1000 when you actually
>> schedule the timer.
>>
> I'm not sure about this. If you reschedule you've got no guarantee
> when you get your next timeslice.

You've got no GUARANTEE of when you get your next timeslice, but in most 
cases it will be soon.  As opposed to scheduling a timer 1ms in the 
future, where you KNOW you won't run for 1ms, as you say.

Gerrit's problem is burstiness at the level of milliseconds.  The only 
way to solve this is to run at sub-millisecond granularity.  Use hi-res 
timers, or easier, use existing mechanisms for running code frequently: 
i.e., leaving the code runnable.

Eddie


> I also think that timers probably
> aren't that heavy in Linux as it probably weights the scheduler. If
> you ask for something 5 msecs in future you are guaranteed it won't go
> off before then but sometime after..
> 
> Probably one to test some time but I can't see this affecting things
> too much and bigger bugs to fry at present probably. But certainly
> worth checking at some point.
> 
> Ian

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2007-02-08  5:50 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-01-10 10:21 [PATCH 1/1] DCCP: Fix up t_nom - FOLLOW-UP Gerrit Renker
2007-01-10 19:40 ` Ian McDonald
2007-01-12 10:39 ` Gerrit Renker
2007-01-12 12:54 ` Gerrit Renker
2007-01-12 16:33 ` Eddie Kohler
2007-01-12 16:41 ` Eddie Kohler
2007-01-12 16:58 ` Gerrit Renker
2007-01-12 20:02 ` Ian McDonald
2007-01-15  7:56 ` Gerrit Renker
2007-01-15  8:34 ` Gerrit Renker
2007-02-08  0:59 ` Eddie Kohler
2007-02-08  1:13 ` Ian McDonald
2007-02-08  1:23 ` Eddie Kohler
2007-02-08  1:47 ` Ian McDonald
2007-02-08  5:50 ` Eddie Kohler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.