[PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
@ 2006-06-14  9:40 Jesper Dangaard Brouer
  2006-06-14 12:06 ` jamal
                   ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Jesper Dangaard Brouer @ 2006-06-14  9:40 UTC (permalink / raw)
  To: Stephen Hemminger, Jamal Hadi Salim, netdev, lartc
  Cc: russell-tcatm, hawk, hawk, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1311 bytes --]

The Linux traffic's control engine inaccurately calculates
transmission times for packets sent over ADSL links.  For
some packet sizes the error rises to over 50%.  This occurs
because ADSL uses ATM as its link layer transport, and ATM
transmits packets in fixed sized 53 byte cells.

The following patches to iproute2 and the kernel add an
option to calculate traffic transmission times over all
ATM links, including ADSL, with perfect accuracy.

A longer presentation of the patch, its rational, what it
does and how to use it can be found here:
   http://www.stuart.id.au/russell/files/tc/tc-atm/

A earlier version of the patch, and a _detailed_ empirical
investigation of its effects can be found here:
   http://www.adsl-optimizer.dk/

The patches are both backwards and forwards compatible.
This means unpatched kernels will work with a patched
version of iproute2, and an unpatched iproute2 will work
on patches kernels.

This is a combined effort of Jesper Brouer and Russell Stuart,
to get these patches into the upstream kernel.

Let the discussion start about what we need to change to get this
upstream?

We see this as a feature enhancement, as thus hope that it can be
queued in davem's net-2.6.18.git tree.

---
Regards,
 Jesper Brouer & Russell Stuart.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-14  9:40 [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL Jesper Dangaard Brouer
@ 2006-06-14 12:06 ` jamal
  2006-06-14 12:55   ` Jesper Dangaard Brouer
                     ` (3 more replies)
  2006-06-14 14:27 ` Phillip Susi
  2006-06-20  5:35 ` Chris Wedgwood
  2 siblings, 4 replies; 56+ messages in thread
From: jamal @ 2006-06-14 12:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: hawk, russell-tcatm, lartc, netdev, Stephen Hemminger


I have taken linux-kernel off the list.

Russell's site is inaccessible to me (I actually think this is related
to some DNS issues i may be having) and your masters is too long to
spend 2 minutes and glean it; so heres a question or two for you:

- Have you tried to do a long-lived session such as a large FTP and 
seen how far off the deviation was? That would provide some interesting
data point.
- To be a devil's advocate (and not claim there is no issue), where do
you draw the line with "overhead"? 
Example the smallest ethernet packet is 64 bytes of which 14 bytes are
ethernet headers ("overhead" for IP) - and this is not counting CRC etc.
If you were to set an MTU of say 64 bytes and tried to do a http or ftp,
how accurate do you think the calculation would be? I would think not
very different.
Does it matter if it is accurate on the majority of the cases?
- For further reflection: Have you considered the case where the rate
table has already been considered on some link speed in user space and
then somewhere post-config the physical link speed changes? This would
happen in the case where ethernet AN is involved and the partner makes
some changes (use ethtool). 

I would say the last bullet is a more interesting problem than a corner
case of some link layer technology that has high overhead.
Your work would be more interesting if it was generic for many link
layers instead of just ATM.


cheers,
jamal

On Wed, 2006-14-06 at 11:40 +0200, Jesper Dangaard Brouer wrote:
> The Linux traffic's control engine inaccurately calculates
> transmission times for packets sent over ADSL links.  For
> some packet sizes the error rises to over 50%.  This occurs
> because ADSL uses ATM as its link layer transport, and ATM
> transmits packets in fixed sized 53 byte cells.
> 
> The following patches to iproute2 and the kernel add an
> option to calculate traffic transmission times over all
> ATM links, including ADSL, with perfect accuracy.
> 
> A longer presentation of the patch, its rational, what it
> does and how to use it can be found here:
>    http://www.stuart.id.au/russell/files/tc/tc-atm/
> 
> A earlier version of the patch, and a _detailed_ empirical
> investigation of its effects can be found here:
>    http://www.adsl-optimizer.dk/
> 
> The patches are both backwards and forwards compatible.
> This means unpatched kernels will work with a patched
> version of iproute2, and an unpatched iproute2 will work
> on patches kernels.
> 
> 
> This is a combined effort of Jesper Brouer and Russell Stuart,
> to get these patches into the upstream kernel.
> 
> Let the discussion start about what we need to change to get this
> upstream?
> 
> We see this as a feature enhancement, as thus hope that it can be
> queued in davem's net-2.6.18.git tree.
> 
> ---
> Regards,
>  Jesper Brouer & Russell Stuart.
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-14 12:06 ` jamal
@ 2006-06-14 12:55   ` Jesper Dangaard Brouer
  2006-06-15 12:57     ` jamal
  2006-06-15 13:16     ` jamal
  2006-06-14 15:32   ` Andy Furniss
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 56+ messages in thread
From: Jesper Dangaard Brouer @ 2006-06-14 12:55 UTC (permalink / raw)
  To: hadi; +Cc: hawk, russell-tcatm, lartc, netdev, Stephen Hemminger

[-- Attachment #1: Type: text/plain, Size: 4519 bytes --]


On Wed, 2006-06-14 at 08:06 -0400, jamal wrote:

> Russell's site is inaccessible to me (I actually think this is related
> to some DNS issues i may be having) 

Strange, I have access to Russell's site.  Maybe its his redirect
feature that confuses your browser, try:
 http://ace-host.stuart.id.au/russell/files/tc/tc-atm/

> and your masters is too long to
> spend 2 minutes and glean it; so heres a question or two for you:

Yes, I is quite long and very detailed.  But it worth reading (... says
the author him self ;-))


> - Have you tried to do a long-lived session such as a large FTP and 
> seen how far off the deviation was? That would provide some interesting
> data point.

The deviation can be calculated.  The impact is of cause small for large
packets.  But the argument that bulk TCP transfers is not as badly
affected, is wrong because all the TCP ACK packets gets maximum penalty.

On an ADSL link with more than 8 bytes overhead, a 40 bytes TCP ACK will
use more that one ATM frame, causing 2 ATM frames to be send that
consumes 106 bytes, eg. 62% overhead.  On a small upstream ADSL line
that hurts! (See thesis page 53, table 5.3 "Overhead summary").


> - To be a devil's advocate (and not claim there is no issue), where do
> you draw the line with "overhead"? 
> Example the smallest ethernet packet is 64 bytes of which 14 bytes are
> ethernet headers ("overhead" for IP) - and this is not counting CRC etc.
> If you were to set an MTU of say 64 bytes and tried to do a http or ftp,
> how accurate do you think the calculation would be? I would think not
> very different.

I do think we handle this situation, but I'm not quite sure that I fully
understand the question (sorry).


> Does it matter if it is accurate on the majority of the cases?
> - For further reflection: Have you considered the case where the rate
> table has already been considered on some link speed in user space and
> then somewhere post-config the physical link speed changes? This would
> happen in the case where ethernet AN is involved and the partner makes
> some changes (use ethtool). 
>
> I would say the last bullet is a more interesting problem than a corner
> case of some link layer technology that has high overhead.

We only claim to do magic on ATM/ADSL links... nothing else ;-)


> Your work would be more interesting if it was generic for many link
> layers instead of just ATM.

Well, we did consider to do so, but we though that it would be harder to
get it into the kernel.

Actually thats the reason for the defines:
 #define	ATM_CELL_SIZE		53
 #define	ATM_CELL_PAYLOAD	48

Changing these should should make it possible to adapt to any other SAR
(Segment And Reasembly) link layer.  


> On Wed, 2006-14-06 at 11:40 +0200, Jesper Dangaard Brouer wrote:
> > The Linux traffic's control engine inaccurately calculates
> > transmission times for packets sent over ADSL links.  For
> > some packet sizes the error rises to over 50%.  This occurs
> > because ADSL uses ATM as its link layer transport, and ATM
> > transmits packets in fixed sized 53 byte cells.
> > 
> > The following patches to iproute2 and the kernel add an
> > option to calculate traffic transmission times over all
> > ATM links, including ADSL, with perfect accuracy.
> > 
> > A longer presentation of the patch, its rational, what it
> > does and how to use it can be found here:
> >    http://www.stuart.id.au/russell/files/tc/tc-atm/
> > 
> > A earlier version of the patch, and a _detailed_ empirical
> > investigation of its effects can be found here:
> >    http://www.adsl-optimizer.dk/
> > 
> > The patches are both backwards and forwards compatible.
> > This means unpatched kernels will work with a patched
> > version of iproute2, and an unpatched iproute2 will work
> > on patches kernels.
> > 
> > 
> > This is a combined effort of Jesper Brouer and Russell Stuart,
> > to get these patches into the upstream kernel.
> > 
> > Let the discussion start about what we need to change to get this
> > upstream?
> > 
> > We see this as a feature enhancement, as thus hope that it can be
> > queued in davem's net-2.6.18.git tree.
> > 
> > ---
> > Regards,
> >  Jesper Brouer & Russell Stuart.
> > 
> 

Thanks for your comments :-)

-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network developer
  Cand. Scient Datalog / MSc.
  Author of http://adsl-optimizer.dk


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-14 12:55   ` Jesper Dangaard Brouer
@ 2006-06-15 12:57     ` jamal
  2006-06-15 13:16     ` jamal
  1 sibling, 0 replies; 56+ messages in thread
From: jamal @ 2006-06-15 12:57 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Stephen Hemminger, netdev, lartc, russell-tcatm, hawk

On Wed, 2006-14-06 at 14:55 +0200, Jesper Dangaard Brouer wrote:
> On Wed, 2006-06-14 at 08:06 -0400, jamal wrote:

> > - Have you tried to do a long-lived session such as a large FTP and 
> > seen how far off the deviation was? That would provide some interesting
> > data point.
> 
> The deviation can be calculated.  The impact is of cause small for large
> packets.  
>
> But the argument that bulk TCP transfers is not as badly
> affected, is wrong because all the TCP ACK packets gets maximum penalty.
> 

ACKs have always played a prominent role. The last numbers i have seen
for North America (but i think probably valid globaly) show in the range
of 40% ACKs in internet traffic
http://netweb.usc.edu/~rsinha/pkt-sizes/
I suspect a lot of these stats are on their way to change with voip, p2p
etc.
But i dont think it is ACKs perse that you or Russell are contending
cause these issues. It's the presence of ATM . And all evidence seems to
point to the fact that ISPs bill you for something other than your
point of view, no?

> On an ADSL link with more than 8 bytes overhead, a 40 bytes TCP ACK will
> use more that one ATM frame, causing 2 ATM frames to be send that
> consumes 106 bytes, eg. 62% overhead.  On a small upstream ADSL line
> that hurts! (See thesis page 53, table 5.3 "Overhead summary").
> 

But how are you connected to the DSLAM? In north America it is typically
ethernet. If i use the current tables i dont see much of a problem with
say cable modems. Are you trying to compensate for the accounting
differences between what your service provider measures (accounting for
their ATM cells) and what you do accounting for your ethernet frames? 
I guess i am lost as to where the ATM is in the topology and more
importantly whether we (Linux) mis-account or whether your approach is
trying to compensate for the ISPs mis-accounting.

> 
> > - To be a devil's advocate (and not claim there is no issue), where do
> > you draw the line with "overhead"? 
> > Example the smallest ethernet packet is 64 bytes of which 14 bytes are
> > ethernet headers ("overhead" for IP) - and this is not counting CRC etc.
> > If you were to set an MTU of say 64 bytes and tried to do a http or ftp,
> > how accurate do you think the calculation would be? I would think not
> > very different.
> 
> I do think we handle this situation, but I'm not quite sure that I fully
> understand the question (sorry).
> 

Assume the following:
- You had ethernet end to end. Is there still a problem?
- Take it a notch up and assume you had ethernet with MTU of 64B. This
way you will have all your packets being small and having high overhead.
Do you still have a problem?

> 
> > Does it matter if it is accurate on the majority of the cases?
> > - For further reflection: Have you considered the case where the rate
> > table has already been considered on some link speed in user space and
> > then somewhere post-config the physical link speed changes? This would
> > happen in the case where ethernet AN is involved and the partner makes
> > some changes (use ethtool). 
> >
> > I would say the last bullet is a more interesting problem than a corner
> > case of some link layer technology that has high overhead.
> 
> We only claim to do magic on ATM/ADSL links... nothing else ;-)
> 

This is well and good given the focus of your thesis. Up/down here we
need something more generic. Your masters-thesis is a good start but
consider doing the phd next and complete this work;->

> 
> > Your work would be more interesting if it was generic for many link
> > layers instead of just ATM.
> 
> Well, we did consider to do so, but we though that it would be harder to
> get it into the kernel.
> 
> Actually thats the reason for the defines:
>  #define	ATM_CELL_SIZE		53
>  #define	ATM_CELL_PAYLOAD	48
> 
> Changing these should should make it possible to adapt to any other SAR
> (Segment And Reasembly) link layer.  
> 

You are still speaking ATM (and the above may still be valid), but: 
Could you for example look at the netdevice->type and from that figure
out the link layer overhead and compensate for it.
Obviously a lot more useful if such activity is doable in user space
without any knowledge of the kernel? and therefore zero change to the
kernel and everything then becomes forward and backward compatible.

cheers,
jamal

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-14 12:55   ` Jesper Dangaard Brouer
  2006-06-15 12:57     ` jamal
@ 2006-06-15 13:16     ` jamal
  2006-06-20  1:04       ` Patrick McHardy
  1 sibling, 1 reply; 56+ messages in thread
From: jamal @ 2006-06-15 13:16 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Stephen Hemminger, netdev, lartc, russell-tcatm, hawk

On Wed, 2006-14-06 at 14:55 +0200, Jesper Dangaard Brouer wrote:
> On Wed, 2006-06-14 at 08:06 -0400, jamal wrote:

> > - Have you tried to do a long-lived session such as a large FTP and 
> > seen how far off the deviation was? That would provide some interesting
> > data point.
> 
> The deviation can be calculated.  The impact is of cause small for large
> packets.  
>
> But the argument that bulk TCP transfers is not as badly
> affected, is wrong because all the TCP ACK packets gets maximum penalty.
> 

ACKs have always played a prominent role. The last numbers i have seen
for North America (but i think probably valid globaly) show in the range
of 40% ACKs in internet traffic
http://netweb.usc.edu/~rsinha/pkt-sizes/
I suspect a lot of these stats are on their way to change with voip, p2p
etc.
But i dont think it is ACKs perse that you or Russell are contending
cause these issues. It's the presence of ATM . And all evidence seems to
point to the fact that ISPs bill you for something other than your
point of view, no?

> On an ADSL link with more than 8 bytes overhead, a 40 bytes TCP ACK will
> use more that one ATM frame, causing 2 ATM frames to be send that
> consumes 106 bytes, eg. 62% overhead.  On a small upstream ADSL line
> that hurts! (See thesis page 53, table 5.3 "Overhead summary").
> 

But how are you connected to the DSLAM? In north America it is typically
ethernet. If i use the current tables i dont see much of a problem with
say cable modems. Are you trying to compensate for the accounting
differences between what your service provider measures (accounting for
their ATM cells) and what you do accounting for your ethernet frames? 
I guess i am lost as to where the ATM is in the topology and more
importantly whether we (Linux) mis-account or whether your approach is
trying to compensate for the ISPs mis-accounting.

> 
> > - To be a devil's advocate (and not claim there is no issue), where do
> > you draw the line with "overhead"? 
> > Example the smallest ethernet packet is 64 bytes of which 14 bytes are
> > ethernet headers ("overhead" for IP) - and this is not counting CRC etc.
> > If you were to set an MTU of say 64 bytes and tried to do a http or ftp,
> > how accurate do you think the calculation would be? I would think not
> > very different.
> 
> I do think we handle this situation, but I'm not quite sure that I fully
> understand the question (sorry).
> 

Assume the following:
- You had ethernet end to end. Is there still a problem?
- Take it a notch up and assume you had ethernet with MTU of 64B. This
way you will have all your packets being small and having high overhead.
Do you still have a problem?

> 
> > Does it matter if it is accurate on the majority of the cases?
> > - For further reflection: Have you considered the case where the rate
> > table has already been considered on some link speed in user space and
> > then somewhere post-config the physical link speed changes? This would
> > happen in the case where ethernet AN is involved and the partner makes
> > some changes (use ethtool). 
> >
> > I would say the last bullet is a more interesting problem than a corner
> > case of some link layer technology that has high overhead.
> 
> We only claim to do magic on ATM/ADSL links... nothing else ;-)
> 

This is well and good given the focus of your thesis. Up/down here we
need something more generic. Your masters-thesis is a good start but
consider doing the phd next and complete this work;->

> 
> > Your work would be more interesting if it was generic for many link
> > layers instead of just ATM.
> 
> Well, we did consider to do so, but we though that it would be harder to
> get it into the kernel.
> 
> Actually thats the reason for the defines:
>  #define	ATM_CELL_SIZE		53
>  #define	ATM_CELL_PAYLOAD	48
> 
> Changing these should should make it possible to adapt to any other SAR
> (Segment And Reasembly) link layer.  
> 

You are still speaking ATM (and the above may still be valid), but: 
Could you for example look at the netdevice->type and from that figure
out the link layer overhead and compensate for it.
Obviously a lot more useful if such activity is doable in user space
without any knowledge of the kernel? and therefore zero change to the
kernel and everything then becomes forward and backward compatible.

cheers,
jamal

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-15 13:16     ` jamal
@ 2006-06-20  1:04       ` Patrick McHardy
  2006-06-20 14:59         ` jamal
  0 siblings, 1 reply; 56+ messages in thread
From: Patrick McHardy @ 2006-06-20  1:04 UTC (permalink / raw)
  To: hadi
  Cc: Jesper Dangaard Brouer, Stephen Hemminger, netdev, lartc,
	russell-tcatm, hawk

jamal wrote:
> You are still speaking ATM (and the above may still be valid), but: 
> Could you for example look at the netdevice->type and from that figure
> out the link layer overhead and compensate for it.
> Obviously a lot more useful if such activity is doable in user space
> without any knowledge of the kernel? and therefore zero change to the
> kernel and everything then becomes forward and backward compatible.

It would be nice to have support for HFSC as well, which unfortunately
needs to be done in the kernel since it doesn't use rate tables.
What about qdiscs like SFQ (which uses the packet size in quantum
calculations)? I guess it would make sense to use the wire-length
there as well.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-20  1:04       ` Patrick McHardy
@ 2006-06-20 14:59         ` jamal
  2006-06-20 15:16           ` Patrick McHardy
  0 siblings, 1 reply; 56+ messages in thread
From: jamal @ 2006-06-20 14:59 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: hawk, russell-tcatm, lartc, netdev, Stephen Hemminger,
	Jesper Dangaard Brouer

On Tue, 2006-20-06 at 03:04 +0200, Patrick McHardy wrote:
> jamal wrote:
> > You are still speaking ATM (and the above may still be valid), but: 
> > Could you for example look at the netdevice->type and from that figure
> > out the link layer overhead and compensate for it.
> > Obviously a lot more useful if such activity is doable in user space
> > without any knowledge of the kernel? and therefore zero change to the
> > kernel and everything then becomes forward and backward compatible.
> 
> It would be nice to have support for HFSC as well, which unfortunately
> needs to be done in the kernel since it doesn't use rate tables.
> What about qdiscs like SFQ (which uses the packet size in quantum
> calculations)? I guess it would make sense to use the wire-length
> there as well.

Didnt even think of that ;-> 
Is it getting too complicated? 

BTW, I forgot to mention one thing on the bandwidth issue is we could do
is send netlink events on link speed changes too; some listener
somewhere would then do the adjustment.

cheers,
jamal


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-20 14:59         ` jamal
@ 2006-06-20 15:16           ` Patrick McHardy
  2006-06-21 12:21             ` Krzysztof Matusik
  0 siblings, 1 reply; 56+ messages in thread
From: Patrick McHardy @ 2006-06-20 15:16 UTC (permalink / raw)
  To: hadi
  Cc: hawk, russell-tcatm, lartc, netdev, Stephen Hemminger,
	Jesper Dangaard Brouer

jamal wrote:
> On Tue, 2006-20-06 at 03:04 +0200, Patrick McHardy wrote:
> 
>>It would be nice to have support for HFSC as well, which unfortunately
>>needs to be done in the kernel since it doesn't use rate tables.
>>What about qdiscs like SFQ (which uses the packet size in quantum
>>calculations)? I guess it would make sense to use the wire-length
>>there as well.
> 
> 
> Didnt even think of that ;-> 
> Is it getting too complicated? 

The code wouldn't be very complicated, it just adds some overhead. If
you do something like I described in my previous mail the overhead for
people not using it would be an additional pointer test before reading
skb->len. I guess we could also make it a compile time option.
I personally think this is something that really improves our quality
of implementation, after all, its "wire" resources qdiscs are meant
to manage.

> BTW, I forgot to mention one thing on the bandwidth issue is we could do
> is send netlink events on link speed changes too; some listener
> somewhere would then do the adjustment.

See the mail I just wrote :)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-20 15:16           ` Patrick McHardy
@ 2006-06-21 12:21             ` Krzysztof Matusik
  2006-06-21 12:54               ` Patrick McHardy
  0 siblings, 1 reply; 56+ messages in thread
From: Krzysztof Matusik @ 2006-06-21 12:21 UTC (permalink / raw)
  To: netdev

Dnia wtorek, 20 czerwca 2006 17:16, Patrick McHardy napisał:
> jamal wrote:
> > On Tue, 2006-20-06 at 03:04 +0200, Patrick McHardy wrote:
> >>It would be nice to have support for HFSC as well, which unfortunately
> >>needs to be done in the kernel since it doesn't use rate tables.
> >>What about qdiscs like SFQ (which uses the packet size in quantum
> >>calculations)? I guess it would make sense to use the wire-length
> >>there as well.
> >
> > Didnt even think of that ;->
> > Is it getting too complicated?
>
> The code wouldn't be very complicated, it just adds some overhead. If
> you do something like I described in my previous mail the overhead for
> people not using it would be an additional pointer test before reading
> skb->len. I guess we could also make it a compile time option.
> I personally think this is something that really improves our quality
> of implementation, after all, its "wire" resources qdiscs are meant
> to manage.

I'd love to see this one implemented. I'm using HFSC more than a year and it 
never provides proper QoS on ATM/ADSL links; low delays can never be achieved 
even with significant throttling below the h/w link bandwidth.
This would help a lot regarding the amount of adsl users but on the other 
hand- there's not many HFSC implementations in real-life I guess (users seem 
to be afraid of it's 'complexity'). 
This idea doesn't seem look dirty- is there a chance to implement it in the 
kernel and iproute?

regards

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-21 12:21             ` Krzysztof Matusik
@ 2006-06-21 12:54               ` Patrick McHardy
  2006-06-21 14:33                 ` Krzysztof Matusik
  0 siblings, 1 reply; 56+ messages in thread
From: Patrick McHardy @ 2006-06-21 12:54 UTC (permalink / raw)
  To: Krzysztof Matusik; +Cc: netdev

Krzysztof Matusik wrote:
> Dnia wtorek, 20 czerwca 2006 17:16, Patrick McHardy napisał:
> 
>>The code wouldn't be very complicated, it just adds some overhead. If
>>you do something like I described in my previous mail the overhead for
>>people not using it would be an additional pointer test before reading
>>skb->len. I guess we could also make it a compile time option.
>>I personally think this is something that really improves our quality
>>of implementation, after all, its "wire" resources qdiscs are meant
>>to manage.
> 
> 
> I'd love to see this one implemented. I'm using HFSC more than a year and it 
> never provides proper QoS on ATM/ADSL links; low delays can never be achieved 
> even with significant throttling below the h/w link bandwidth.

Mhh .. I trust you picked a proper clocksource and timer frequency?

> This would help a lot regarding the amount of adsl users but on the other 
> hand- there's not many HFSC implementations in real-life I guess (users seem 
> to be afraid of it's 'complexity'). 
> This idea doesn't seem look dirty- is there a chance to implement it in the 
> kernel and iproute?

I hacked up a patch yesterday. I want to do a bit more testing before
posting it, but its hard to really test the effectiveness because even
without this patch my DSL line already delivers higher throughput and
much better delay than it should (its a throttled SDSL line sold as
ADSL). I'll try to do some testing on an ethernet link now.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-21 12:54               ` Patrick McHardy
@ 2006-06-21 14:33                 ` Krzysztof Matusik
  0 siblings, 0 replies; 56+ messages in thread
From: Krzysztof Matusik @ 2006-06-21 14:33 UTC (permalink / raw)
  To: netdev

Dnia środa, 21 czerwca 2006 14:54, Patrick McHardy napisał:
> > I'd love to see this one implemented. I'm using HFSC more than a year and
> > it never provides proper QoS on ATM/ADSL links; low delays can never be
> > achieved even with significant throttling below the h/w link bandwidth.
>
> Mhh .. I trust you picked a proper clocksource and timer frequency?

250Hz or 1000Hz and jiffies (for a SMP machine)
I don't mean I've got big delays though and those are lowest with HFSC, when 
link is satiated.

IMHO since ATM is sending small packets constantly thorough its link (ATM 
specific) encapsulated data defragmentation cannot simply be achieved (or it 
simply isn't done by modems). As a result one can't simply throttle qdisc 
basing on computations like 
UpperLimit = ( bandwidth*(1 - header_bits/frame_size )
having bandwidth specified by ATM link speed rate, as it can be reported by 
modem, and expect it won't ever be congested.
This seems to be main problem. Link asymmetry, window size etc don't affect 
line congestion and so delays that much I suppose.
The same with incorrect timer :-).

> I hacked up a patch yesterday. I want to do a bit more testing before
> posting it, but its hard to really test the effectiveness because even
> without this patch my DSL line already delivers higher throughput and
> much better delay than it should (its a throttled SDSL line sold as
> ADSL). I'll try to do some testing on an ethernet link now.

I'll follow this thread then.

regards

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-14 12:06 ` jamal
  2006-06-14 12:55   ` Jesper Dangaard Brouer
@ 2006-06-14 15:32   ` Andy Furniss
  2006-06-20  0:54   ` Patrick McHardy
       [not found]   ` <1150287983.3246.27.camel@ras.pc.brisbane.lube>
  3 siblings, 0 replies; 56+ messages in thread
From: Andy Furniss @ 2006-06-14 15:32 UTC (permalink / raw)
  To: hadi
  Cc: Jesper Dangaard Brouer, hawk, russell-tcatm, lartc, netdev,
	Stephen Hemminger

jamal wrote:
> I have taken linux-kernel off the list.
> 
> Russell's site is inaccessible to me (I actually think this is related
> to some DNS issues i may be having) and your masters is too long to
> spend 2 minutes and glean it; so heres a question or two for you:
> 
> - Have you tried to do a long-lived session such as a large FTP and 
> seen how far off the deviation was? That would provide some interesting
> data point.
> - To be a devil's advocate (and not claim there is no issue), where do
> you draw the line with "overhead"? 

Me and many others have run a smilar hack for years, there is also a 
userspace project still alive which does the same.

The difference is that without it I would need to sacrifice almost half 
my 288kbit atm/dsl showtime bandwidth to be sure of control.

With the modification I can run at 286kbit / 288 and know I will never 
have jitter worse than the bitrate latency of a mtu packet. The 286 
figure was choses to allow a full buffer to drain/ allow for timer 
innaccuracy etc. On a p200 with tsc, 2.6.12 it's never gone over for me 
- though talking of timers I notice on my desktop 2.6.16 I gain 2 
minutes a day now.

Andy.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-14 12:06 ` jamal
  2006-06-14 12:55   ` Jesper Dangaard Brouer
  2006-06-14 15:32   ` Andy Furniss
@ 2006-06-20  0:54   ` Patrick McHardy
  2006-06-20 14:56     ` jamal
       [not found]   ` <1150287983.3246.27.camel@ras.pc.brisbane.lube>
  3 siblings, 1 reply; 56+ messages in thread
From: Patrick McHardy @ 2006-06-20  0:54 UTC (permalink / raw)
  To: hadi
  Cc: Jesper Dangaard Brouer, hawk, russell-tcatm, lartc, netdev,
	Stephen Hemminger

jamal wrote:
> - For further reflection: Have you considered the case where the rate
> table has already been considered on some link speed in user space and
> then somewhere post-config the physical link speed changes? This would
> happen in the case where ethernet AN is involved and the partner makes
> some changes (use ethtool). 
> 
> I would say the last bullet is a more interesting problem than a corner
> case of some link layer technology that has high overhead.
> Your work would be more interesting if it was generic for many link
> layers instead of just ATM.

I've thought about this a couple of times, scaling the virtual clock
rate should be enough for "simple" qdiscs like TBF or HTB, which have
a linear relation between time and bandwidth. I haven't really thought
about the effects on HFSC yet, on a small scale the relation is
non-linear. But this is a different problem from trying to accomodate
for link-layer overhead.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-20  0:54   ` Patrick McHardy
@ 2006-06-20 14:56     ` jamal
  2006-06-20 15:09       ` Patrick McHardy
  0 siblings, 1 reply; 56+ messages in thread
From: jamal @ 2006-06-20 14:56 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Stephen Hemminger, netdev, lartc, russell-tcatm, hawk,
	Jesper Dangaard Brouer

On Tue, 2006-20-06 at 02:54 +0200, Patrick McHardy wrote:
> jamal wrote:
> > - For further reflection: Have you considered the case where the rate
> > table has already been considered on some link speed in user space and
> > then somewhere post-config the physical link speed changes? This would
> > happen in the case where ethernet AN is involved and the partner makes
> > some changes (use ethtool). 
> > 
[..]
> I've thought about this a couple of times, scaling the virtual clock
> rate should be enough for "simple" qdiscs like TBF or HTB, which have
> a linear relation between time and bandwidth. I haven't really thought
> about the effects on HFSC yet, on a small scale the relation is
> non-linear. 

Does HFSC not depend on bandwith? How is rate control achieved?

> But this is a different problem from trying to accomodate
> for link-layer overhead.
> 

Yes it is different issue.

cheers,
jamal


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-20 14:56     ` jamal
@ 2006-06-20 15:09       ` Patrick McHardy
  2006-06-22 18:41         ` jamal
  0 siblings, 1 reply; 56+ messages in thread
From: Patrick McHardy @ 2006-06-20 15:09 UTC (permalink / raw)
  To: hadi
  Cc: Stephen Hemminger, netdev, lartc, russell-tcatm, hawk,
	Jesper Dangaard Brouer

jamal wrote:
> On Tue, 2006-20-06 at 02:54 +0200, Patrick McHardy wrote:
> 
>>jamal wrote:
>>
>>>- For further reflection: Have you considered the case where the rate
>>>table has already been considered on some link speed in user space and
>>>then somewhere post-config the physical link speed changes? This would
>>>happen in the case where ethernet AN is involved and the partner makes
>>>some changes (use ethtool). 
>>>
> 
> [..]
> 
>>I've thought about this a couple of times, scaling the virtual clock
>>rate should be enough for "simple" qdiscs like TBF or HTB, which have
>>a linear relation between time and bandwidth. I haven't really thought
>>about the effects on HFSC yet, on a small scale the relation is
>>non-linear. 
> 
> 
> Does HFSC not depend on bandwith? How is rate control achieved?

"Depend on bandwidth" is not the right term. All of TBF, HTB and HFSC
provide bandwidth per time, but with TBF and HTB the relation between
the amount of bandwidth is linear to the amount of time, with HFSC
it is only on a linear on larger scale since it uses service curves,
which are represented as two linear pieces. So you have bandwidth b1
for time t1, bandwidth b2 after that until eternity. By scaling the
clock rate you alter after how much time b2 kicks in, which affects
the guaranteed delays. The end result should be that both bandwidth
and delay scale up or down proportionally, but I'm not sure that this
is what HFSC would do in all cases (on small scale). But it should
be easy to answer with a bit more time for visualizing it.

The thing I'm not sure about is whether this wouldn't be handled better
by userspace, if the link layer speed changes you might not want
proportional scaling but prefer to still give a fixed amount of that
bandwidth to some class, for example VoIP traffic. Do we have netlink
notifications for link speed changes?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-20 15:09       ` Patrick McHardy
@ 2006-06-22 18:41         ` jamal
  2006-06-23 14:32           ` Patrick McHardy
  0 siblings, 1 reply; 56+ messages in thread
From: jamal @ 2006-06-22 18:41 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Jesper Dangaard Brouer, hawk, russell-tcatm, netdev,
	Stephen Hemminger

On Tue, 2006-20-06 at 17:09 +0200, Patrick McHardy wrote:
> jamal wrote:

> "Depend on bandwidth" is not the right term. All of TBF, HTB and HFSC
> provide bandwidth per time, but with TBF and HTB the relation between
> the amount of bandwidth is linear to the amount of time, with HFSC
> it is only on a linear on larger scale since it uses service curves,
> which are represented as two linear pieces. So you have bandwidth b1
> for time t1, bandwidth b2 after that until eternity. By scaling the
> clock rate you alter after how much time b2 kicks in, which affects
> the guaranteed delays. The end result should be that both bandwidth
> and delay scale up or down proportionally, but I'm not sure that this
> is what HFSC would do in all cases (on small scale). But it should
> be easy to answer with a bit more time for visualizing it.
> 

Ok, this makes things a little trickier though, no? 

> The thing I'm not sure about is whether this wouldn't be handled better
> by userspace, 

If you do it in user space you will need a daemon of some form; this is
my preference but it seems a lot of people hate daemons - the standard
claim is it is counter-usability. Such people are forgiving if you built
the daemon into the kernel as a thread. Perhaps the netcarrier that
Stefan Rompf has added could be extended to handle this)
Note, if you wanna do it right as well you will factor in other things
like some wireless technologies which changes their throughput
capability over a period of time ( A lot of these guys try to have their
own hardware level schedulers to compensate for this).

> if the link layer speed changes you might not want
> proportional scaling but prefer to still give a fixed amount of that
> bandwidth to some class, for example VoIP traffic. Do we have netlink
> notifications for link speed changes?

Not there at the moment - but we do emit event for other link layer
stuff like carrier on/off - so adding this should be trivial and a
reasonable thing to have; with a caveat: it will be link-layer specific;
so whoever ends up adding will have to be careful to make sure it is not
hard-coded to be specific to ethernet-like netdevices. It could probably
be reported together with link state as a TLV like ETHER_SPEED_CHANGED
which carries probably old speed and new speed
and maybe even reason why it changed.

cheers,
jamal

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-22 18:41         ` jamal
@ 2006-06-23 14:32           ` Patrick McHardy
  2006-06-24 14:39             ` jamal
  0 siblings, 1 reply; 56+ messages in thread
From: Patrick McHardy @ 2006-06-23 14:32 UTC (permalink / raw)
  To: hadi; +Cc: Jesper Dangaard Brouer, hawk, russell-tcatm, netdev,
	Stephen Hemminger

jamal wrote:
> On Tue, 2006-20-06 at 17:09 +0200, Patrick McHardy wrote:
> 
>>The thing I'm not sure about is whether this wouldn't be handled better
>>by userspace, 
> 
> 
> If you do it in user space you will need a daemon of some form; this is
> my preference but it seems a lot of people hate daemons - the standard
> claim is it is counter-usability. Such people are forgiving if you built
> the daemon into the kernel as a thread. Perhaps the netcarrier that
> Stefan Rompf has added could be extended to handle this)

I absolutely disagree, in my opinion we currently have too few
specialized daemons and a lot too much wild shell scripting with
seding, awking, grepping. I'm actually working on a daemon that
provides a small scripting language to express networking configuration
states and react to result set changes and state transitions.
Its far from complete, but already allows you to express things like
"on transition from { none, link flags !up } to { link flags up }
execute /etc/network/link/$LINK_NAME up" (same for routes, addresses,
basically all object types) or "for each link with flags lowerup,up
execute add-to-bridge.sh". The value of each expression is dumped
into the environment on execution, so you can comfortably use
$LINK_NAME or $LINK_MTU instead of having to cut it out the
"ip link list" output. Should be trivial to support link speed changes
once we have notifications for that.

>>if the link layer speed changes you might not want
>>proportional scaling but prefer to still give a fixed amount of that
>>bandwidth to some class, for example VoIP traffic. Do we have netlink
>>notifications for link speed changes?
> 
> 
> Not there at the moment - but we do emit event for other link layer
> stuff like carrier on/off - so adding this should be trivial and a
> reasonable thing to have; with a caveat: it will be link-layer specific;
> so whoever ends up adding will have to be careful to make sure it is not
> hard-coded to be specific to ethernet-like netdevices. It could probably
> be reported together with link state as a TLV like ETHER_SPEED_CHANGED
> which carries probably old speed and new speed
> and maybe even reason why it changed.

I don't think it should carry both old and new speed. Netlink
notifications usually provide a snapshot of the new state, but
no indication what changed, with one notable exception, the
ifi_change field, which IMO is a hack for lazy userspace. Since
notifications can get lost, userspace needs to resync occasionally.
The naiive approach (works for every other object) to determine if
the object state changed from my last known state is to compare
all attributes .. but the ifi_change field will be different
between notifications and dumps even if the object itself didn't
change. "Lazy userspace" because looking at ifi_change is obviously
only useful if it doesn't keep its last known state and tries to
derive the change from update notifications alone .. which means it
fails when notifications are lost.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-23 14:32           ` Patrick McHardy
@ 2006-06-24 14:39             ` jamal
  2006-06-26 11:21               ` Patrick McHardy
  0 siblings, 1 reply; 56+ messages in thread
From: jamal @ 2006-06-24 14:39 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Jesper Dangaard Brouer, hawk, russell-tcatm, netdev,
	Stephen Hemminger

On Fri, 2006-23-06 at 16:32 +0200, Patrick McHardy wrote:
> jamal wrote:

> > If you do it in user space you will need a daemon of some form; this is
> > my preference but it seems a lot of people hate daemons - the standard
> > claim is it is counter-usability. Such people are forgiving if you built
> > the daemon into the kernel as a thread. Perhaps the netcarrier that
> > Stefan Rompf has added could be extended to handle this)
> 
> I absolutely disagree, in my opinion we currently have too few
> specialized daemons and a lot too much wild shell scripting with
> seding, awking, grepping. 

Like i said i too prefer a daemon; but i have experienced a lot of
people who dont (especially in small embedded devices).

> I'm actually working on a daemon that
> provides a small scripting language to express networking configuration
> states and react to result set changes and state transitions.
> Its far from complete, but already allows you to express things like
> "on transition from { none, link flags !up } to { link flags up }
> execute /etc/network/link/$LINK_NAME up" (same for routes, addresses,
> basically all object types) or "for each link with flags lowerup,up
> execute add-to-bridge.sh". The value of each expression is dumped
> into the environment on execution, so you can comfortably use
> $LINK_NAME or $LINK_MTU instead of having to cut it out the
> "ip link list" output. Should be trivial to support link speed changes
> once we have notifications for that.
> 

cool - a neteventd


> I don't think it should carry both old and new speed. Netlink
> notifications usually provide a snapshot of the new state, but
> no indication what changed, with one notable exception, the
> ifi_change field, which IMO is a hack for lazy userspace. 

I am quiet fond of the ifi_change ;->

> Since
> notifications can get lost, userspace needs to resync occasionally.
> The naiive approach (works for every other object) to determine if
> the object state changed from my last known state is to compare
> all attributes ..

scalability issues abound when you have a gazillion things to look at.
There used or may still be a way to tell from looking at netlink socket
that an error occurred since last time - such as "a message was lost".
You could use that to tell a message was lost and do scanning only
then.  

>  but the ifi_change field will be different
> between notifications and dumps even if the object itself didn't
> change. "Lazy userspace" because looking at ifi_change is obviously
> only useful if it doesn't keep its last known state and tries to
> derive the change from update notifications alone .. which means it
> fails when notifications are lost.
> 

But thats not the real intent for it. 

cheers,
jamal

PS:- I will look at your other postings and respond later i have to run
for now.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-24 14:39             ` jamal
@ 2006-06-26 11:21               ` Patrick McHardy
  2006-06-27 13:01                 ` jamal
  0 siblings, 1 reply; 56+ messages in thread
From: Patrick McHardy @ 2006-06-26 11:21 UTC (permalink / raw)
  To: hadi; +Cc: Jesper Dangaard Brouer, hawk, russell-tcatm, netdev,
	Stephen Hemminger

jamal wrote:
> On Fri, 2006-23-06 at 16:32 +0200, Patrick McHardy wrote:
> 
>>I don't think it should carry both old and new speed. Netlink
>>notifications usually provide a snapshot of the new state, but
>>no indication what changed, with one notable exception, the
>>ifi_change field, which IMO is a hack for lazy userspace. 
> 
> 
> I am quiet fond of the ifi_change ;->
> 
>>Since
>>notifications can get lost, userspace needs to resync occasionally.
>>The naiive approach (works for every other object) to determine if
>>the object state changed from my last known state is to compare
>>all attributes ..
> 
> 
> scalability issues abound when you have a gazillion things to look at.
> There used or may still be a way to tell from looking at netlink socket
> that an error occurred since last time - such as "a message was lost".
> You could use that to tell a message was lost and do scanning only
> then.  

It returns -ENOBUFS on socket overrun. Without it netlink notifications
wouldn't be very useable as you couldn't figure out when you missed
some.

>> but the ifi_change field will be different
>>between notifications and dumps even if the object itself didn't
>>change. "Lazy userspace" because looking at ifi_change is obviously
>>only useful if it doesn't keep its last known state and tries to
>>derive the change from update notifications alone .. which means it
>>fails when notifications are lost.
>>
> 
> 
> But thats not the real intent for it. 

Then what is the intent, it doesn't carry any other information?
It includes information that are not available any other way from
the kernel, yet the information is not transmitted reliably. How
could a program that relies on this possibly work reliable?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-26 11:21               ` Patrick McHardy
@ 2006-06-27 13:01                 ` jamal
  2006-07-02  4:23                   ` Patrick McHardy
  0 siblings, 1 reply; 56+ messages in thread
From: jamal @ 2006-06-27 13:01 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Stephen Hemminger, netdev, russell-tcatm, hawk,
	Jesper Dangaard Brouer

On Mon, 2006-26-06 at 13:21 +0200, Patrick McHardy wrote:
> jamal wrote:

> > scalability issues abound when you have a gazillion things to look at.
> > There used or may still be a way to tell from looking at netlink socket
> > that an error occurred since last time - such as "a message was lost".
> > You could use that to tell a message was lost and do scanning only
> > then.  
> 
> It returns -ENOBUFS on socket overrun. Without it netlink notifications
> wouldn't be very useable as you couldn't figure out when you missed
> some.
> 

I think you are probably right; it is -ENOBUFS. I recall something along
the lines of a socket->error flag somewhere. I will go and snoop on
code.

> >> but the ifi_change field will be different
> >>between notifications and dumps even if the object itself didn't
> >>change. "Lazy userspace" because looking at ifi_change is obviously
> >>only useful if it doesn't keep its last known state and tries to
> >>derive the change from update notifications alone .. which means it
> >>fails when notifications are lost.
> >>
> > 
> > 
> > But thats not the real intent for it. 
> 
> Then what is the intent, it doesn't carry any other information?

Generally it is a filter to what the ifi_flags reflect.
>From an event architecture scalability perspective, it is useful to be
able to look at what changed without having to query the source. That is
what ifi_change provides.  So it is not "Lazy userspace" rather it is an
effective optimization.
You keep less state and you have less message exchanges between user and
kernel.

> It includes information that are not available any other way from
> the kernel, yet the information is not transmitted reliably. How
> could a program that relies on this possibly work reliable?

Ok, so first lets separate them as two different issues:

- Any message to user space may be lost whether it has ifi_change or
not. You need some way to figure out a message was lost and declare your
state may be invalid. The -ENOBUFs is one way to discover stale state.
- by looking at ifi_change i can tell what changed from about 10 
other things reflected in the flags. If you get an ENOBUFS in this case
(or any other), it means your state is not consistent and to have
reliable info you need to query the kernel.

Hope that makes sense.

cheers,
jamal



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-27 13:01                 ` jamal
@ 2006-07-02  4:23                   ` Patrick McHardy
  2006-07-02 13:59                     ` jamal
  0 siblings, 1 reply; 56+ messages in thread
From: Patrick McHardy @ 2006-07-02  4:23 UTC (permalink / raw)
  To: hadi; +Cc: Stephen Hemminger, netdev, russell-tcatm, hawk,
	Jesper Dangaard Brouer

Sorry, missed your reply so far.

>>Then what is the intent, it doesn't carry any other information?
> 
> 
> Generally it is a filter to what the ifi_flags reflect.
>>From an event architecture scalability perspective, it is useful to be
> able to look at what changed without having to query the source. That is
> what ifi_change provides.  So it is not "Lazy userspace" rather it is an
> effective optimization.
> You keep less state and you have less message exchanges between user and
> kernel.

It might have been, but the situation is a bit more complicated today,
and special case solutions (aka hacks) don't work very well for the
huge diversity of interest in update messages. Besides that, it is
flawed.

>>It includes information that are not available any other way from
>>the kernel, yet the information is not transmitted reliably. How
>>could a program that relies on this possibly work reliable?
> 
> 
> Ok, so first lets separate them as two different issues:
> 
> - Any message to user space may be lost whether it has ifi_change or
> not. You need some way to figure out a message was lost and declare your
> state may be invalid. The -ENOBUFs is one way to discover stale state.

Thats not the point. ifi_change doesn't inform you about updates, it
informs you about specific changes. And you can't determine those in
the case of lost notifications without keeping the old state yourself.
It may save you some time _comparing_ (not searching) in case it wasn't
lost, but it only contains a very limited set of information and nothing
saves you from searching the entry a message refers to based on the
primary keys (which unfortunately aren't even defined for every message
type, which means they're not even clearly defined inside the kernel.
routing rules are one example). Its a hack for some specific usage case,
and since you're defending this so hard, I'm naturally wondering if
you introduced it?

> - by looking at ifi_change i can tell what changed from about 10 
> other things reflected in the flags. If you get an ENOBUFS in this case
> (or any other), it means your state is not consistent and to have
> reliable info you need to query the kernel.

The flags only reflect a single piece of information, one that userspace
needs to be able to reconstruct anyway. Once again the point: netlink
notifications are _update_ messages, not state change messages. This is
a fundamental difference, misleading userspace (see tc monitor) to
believing it is something different for very limited usage cases does
nothing good. I hereby state my believe that 99% of netlink users
are broken.

> Hope that makes sense.

Me too :)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-07-02  4:23                   ` Patrick McHardy
@ 2006-07-02 13:59                     ` jamal
  0 siblings, 0 replies; 56+ messages in thread
From: jamal @ 2006-07-02 13:59 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Jesper Dangaard Brouer, hawk, russell-tcatm, netdev,
	Stephen Hemminger

On Sun, 2006-02-07 at 06:23 +0200, Patrick McHardy wrote:
[..]
> > - Any message to user space may be lost whether it has ifi_change or
> > not. You need some way to figure out a message was lost and declare your
> > state may be invalid. The -ENOBUFs is one way to discover stale state.
> 
> Thats not the point. ifi_change doesn't inform you about updates, it
> informs you about specific changes. 

yes, 

> And you can't determine those in
> the case of lost notifications without keeping the old state yourself.

nod,

> It may save you some time _comparing_ (not searching) in case it wasn't
> lost, 

This is one of the points of where it has value.

Take a hypothetical example of when you have a gazillion events/sec
happening, this is an exageration in case of ifi_change, just assume it
so for illustration purposes. Assume no messages are lost. Assume you
are only interested in the netdevice becoming  promisc or un-promisc.
In such a case, you can quickly check if among the vector of events
that this just happened or not.

> but it only contains a very limited set of information and nothing
> saves you from searching the entry a message refers to based on the
> primary keys (which unfortunately aren't even defined for every message
> type, which means they're not even clearly defined inside the kernel.
> routing rules are one example). 

In some cases it does provide you with details - example in the case of
booleans where it will tell you something changed from 0->1 or 1->0.
Actually the promisc case i gave is a bad example because it falls under
such a category although i hope it still made the point i was trying to
make. 

But yes in general, it is just supposed to provide you with info of
"dude, you got a message from X and Y" - which is more useful than the
select() "dude, you got a message". You have then to go and read that
message if you are interested or if you are busy you can go later.

> Its a hack for some specific usage case,
> and since you're defending this so hard, I'm naturally wondering if
> you introduced it?
> 

No, i didnt - it has been there for years; it may/must have been Alexey
who introduced it.

The reason i am defending it is i find it useful and it is a classical
approach in event management; i have seen it used in a lot of middleware
as an optimization. When you have a vector of events that can be
reported, then as optimization you dont waste resources by sending
details about the event in case nobody is interested in hearing about
the details or is ready to process them.

BTW,theres a consistency bug/hack at the moment in iproute2 on its
usage.

> > - by looking at ifi_change i can tell what changed from about 10 
> > other things reflected in the flags. If you get an ENOBUFS in this case
> > (or any other), it means your state is not consistent and to have
> > reliable info you need to query the kernel.
> 
> The flags only reflect a single piece of information, one that userspace
> needs to be able to reconstruct anyway. Once again the point: netlink
> notifications are _update_ messages, not state change messages. This is
> a fundamental difference, misleading userspace (see tc monitor) to
> believing it is something different for very limited usage cases does
> nothing good. I hereby state my believe that 99% of netlink users
> are broken.
> 

yes, it would mislead users who dont use it appropriately such as ip mon
but that doesnt make it less valuable. I thought ip mon would bitch if
it received enobufs? 

cheers,
jamal

^ permalink raw reply	[flat|nested] 56+ messages in thread

[parent not found: <1150287983.3246.27.camel@ras.pc.brisbane.lube>]

[parent not found: <1150292693.5197.1.camel@jzny2>]

[parent not found: <1150843471.17455.2.camel@ras.pc.brisbane.lube>]

[parent not found: <15653CE98281AD4FBD7F70BCEE3666E53CD54A@comxexch01.comx.local>]

[parent not found: <1151000966.5392.34.camel@jzny2>]

* RE: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
       [not found]           ` <1151000966.5392.34.camel@jzny2>
@ 2006-06-23 12:37             ` Russell Stuart
  2006-06-23 15:21               ` Patrick McHardy
  2006-06-24 14:13               ` jamal
  0 siblings, 2 replies; 56+ messages in thread
From: Russell Stuart @ 2006-06-23 12:37 UTC (permalink / raw)
  To: hadi
  Cc: Patrick McHardy, Alan Cox, Stephen Hemminger, netdev,
	Jesper Dangaard Brouer

On Thu, 2006-06-22 at 14:29 -0400, jamal wrote: 
> Russell,
> 
> I did look at what you sent me and somewhere in those discussions i
> argue that the changes compensate to make the rate be a goodput
> instead of advertised throughput.

I did see that, but didn't realise you were responding to 
me.  A lot of discussion has gone on since and evidently 
quite a bit of which was addressed to me.  I will try to 
answer the some of the points.   Sorry for the digest 
like reply :(

On Wed, 2006-06-14 at 11:57 +0100, Alan Cox wrote:
> I'm 
> not sure if that matters but for modern processors I'm also sceptical
> that the clever computation is actually any faster than just doing the
> maths, especially if something cache intensive is also running.

Assuming you are referring to the rate tables - I hadn't
thought about it, but I guess I would agree.   However, this 
patch wasn't trying to radically re-engineer the traffic 
control engines rate calculation code.  Quite the reverse I -
was was trying to change it as little as possible.  The kernel 
part of the patch actually only introduced one small change - 
the optional addition of a constant the packet length.

On Thu, 2006-06-15 at 08:57 -0400, jamal wrote: 
> But i dont think it is ACKs perse that you or Russell are contending
> cause these issues. It's the presence of ATM . And all evidence seems to
> point to the fact that ISPs bill you for something other than your
> point of view, no?

I don't know about anywhere else, but certainly here in
Australia some ISP's creative in how they advertise their
link speeds.  Again that is not the issue we were trying 
to address with the patch.

On Thu, 2006-06-15 at 08:57 -0400, jamal wrote: 
> You are still speaking ATM (and the above may still be valid), but: 
> Could you for example look at the netdevice->type and from that figure
> out the link layer overhead and compensate for it.

As others have pointed out, this doesn't work for the ADSL 
user.  An ADSL modem is connected to the box using either 
ethernet, wireless or USB.

On Thu, 2006-06-15 at 09:03 -0400, jamal wrote: 
> It is probably doable by just looking at netdevice->type and figuring
> the link layer technology. Totally in user space and building the
> compensated for tables there before telling the kernel (advantage is no
> kernel changes and therefore it would work with older kernels as well).

Others have had this same thought, and have spent time trying
to come up with a user space only solution.  They failed because 
it isn't possible.  To understand why see this thread:

  http://mailman.ds9a.nl/pipermail/lartc/2006q1/018314.html

Also, the user space patch does improve the performance of 
older kernels (ie unpatched kernels).  Rather than getting 
the rate wrong 99.9% of the time, older kernels only get it 
wrong 14% of the time, on average.

On Tue, 2006-06-20 at 03:04 +0200, Patrick McHardy wrote: 
> What about qdiscs like SFQ (which uses the packet size in quantum
> calculations)? I guess it would make sense to use the wire-length
> there as well.

Being pedantic, SQF automatically assigns traffic to classes 
and gives each class an equal share of the available bandwidth.  
As I am sure you are aware SQF's trick is that it randomly 
changes its classification algorithm - every second in the Linux 
implementation.  If there are errors in rate calculation this 
randomisation will ensure they are distributed equally between 
the classes as time goes on.  So no, accurate packets sizes are 
not that important to SQF.

But they are important to many other qdiscs, and I am sure 
that was your point.  SQF just happened to be a bad example.

On Tue, 2006-06-20 at 10:06 -0400, jamal wrote:
> What this means is that Linux computes based on ethernet
> headers. Somewhere downstream ATM (refer to above) comes in and that
> causes mismatch in what Linux expects to be the bandwidth and what
> your service provider who doesnt account for the ATM overhead when
> they sell you "1.5Mbps".
> Reminds me of hard disk vendors who define 1K to be 1000 to show
> how large their drives are.
> Yes, Linux cant tell if your service provider is lying to you.

No, it can't.  But you can measure the bandwidth you are 
getting from your ISP and plug that into the tc command 
line.  The web page I sent to you describes how to do this
for ADSL lines.

On Tue, 2006-06-20 at 10:06 -0400, jamal wrote:
> > On Mon, 2006-19-06 at 21:31 +0200, Jesper Dangaard Brouer wrote:
> > The issue here is, that ATM does not have fixed overhead (due to alignment 
> > and padding).  This means that a fixed reduction of the bandwidth is not 
> > the solution.  We could reduce the bandwidth to the worst-case overhead, 
> > which is 62%, I do not think that is a good solution...
> > 
> 
> I dont see it as wrong to be honest with you. Your mileage may vary.

Jamal am I reading this correctly?  Did you just say that you 
don't see having to reduce your available bandwidth by 62% to 
take account of deficiencies in Linux traffic engine as wrong?  
Why on earth would you say that?

On Tue, 2006-06-20 at 10:06 -0400, jamal wrote:
> Dont have time to read your doc and dont get me wrong, there is a
> "quark" practical problem: As practical as the hard disk manufacturer
> who claims that they have 11G drive when it is 10G.

This reads like we don't see the same problem in the same way.
Your disk example is a 10% error that effects less savvy users.
The ATM problem we are trying to address effects a big chunk of 
all Linux's traffic control users.  (Big chunk as counted by
boxes, not bytes.)

Something like 60% of all broadband connections use ADSL.  Most 
of the remainder live in the US and use cable. Or at least so 
says this web page:
  http://tinyurl.com/pydnj
Extrapolating from that, I think it is safe to say fair chunk
of all people using the Linux Traffic Control engine use ADSL,
and thus may benefit from this patch.

Now it is true that right now these people may not see a great
benefit from the patch.  Those that will are divided into two
categories:

1.  Those that saturate their upstream bandwidth.  This isn't
    hard to do on ADSL, due to its first letter.  It effects
    people who use run web sites, email lists - which is bugger
    all, and those who play games or run P2P - which is most
    home users.

2.  Those that use Voip.  Again there aren't many people who do
    this right now, but that will change.  Its not hard to 
    envisage a future where real time streaming like this will
    come to dominate Internet traffic.  Voip effects the other
    major group of users out there - business.

Ergo I believe that in the long term the patch will benefit a
lot of people.  The next argument is how much it will benefit
them.

It turns out that the patch is only useful if you have some
small packets that MUST have priority on the ADSL link. 
Jesper's traffic was TCP ACK's (he was addressing problem 1) 
and mine was VOIP traffic.  This would seem a trivial problem 
to solve with Linux's traffic control engine.  I don't know 
what path Jesper took - but I tried using it in the obvious 
fashion and it didn't work.  A couple of large emails would 
take out an office's phone system.  It took me days of head 
scratching to figure out why.

The cause was ADSL using ATM as a carrier.  In my case I was 
using approx 110 byte packets.  Do the sums.  It takes 3 ATM 
cells to carry an 110 byte packets.  That is 159 bytes.  A
50% error.  That meant the ISP was doing the traffic control, 
and he wasn't prioritising VOIP traffic.  Sure, you can 
optimise the values  you pass to tc for 110 byte packets.  But 
then it fails miserably for a other packet sizes; such as 
a different VOIP codec, or TCP acks. The only solution is to 
understate your available bandwidth by at least a 1/3rd.  I 
hope you don't consider that acceptable.

The reason this patch wasn't thought of until now is that
large packets don't see much benefit.  For similar packet
sizes the maximum error is determined by the ATM cell size 
(you can be +/- one ATM cell) and that is 53 bytes.  This 
means on packets around MTU size the error is 53/1500 = 3.5%.  
Hardly worth worrying about.  For traditional Internet usage, 
ie the one ADSL was designed for, the upstream channel, ie 
the one carrying the TCP ACKS, was rarely saturated.  The 
speed was limited by the downstream channel - the one 
carrying MTU sized packets.

So in summary - no, Jamal, I see no correspondence between 
your 10/11Gb hard drives example and this patch.

On Tue, 2006-06-20 at 10:06 -0400, jamal wrote:
> It needs to be
> resolved - but not in an intrusive way in my opinion.

To be honest, I didn't think the patch was that intrusive.
It adds an optional constant to the skb->len.  Hardly earth
shattering.

On Tue, 2006-06-20 at 16:45 +0200, Patrick McHardy wrote: 
> Handling all qdiscs would mean adding a pointer to a mapping table
> to struct net_device and using something like "skb_wire_len(skb, dev)"
> instead of skb->len in the queueing layer. That of course doesn't
> mean that we can't still provide pre-adjusted ratetables for qdiscs
> that use them.

Yes, that would work well, and is probably how it should of
been done when the kernel stuff was originally written.  As 
it happens Jesper's original solution was closer to this.  The 
reason we choose not to go that way it is would change the 
kernel-userspace API.   The current patch solves the problem 
and works well as possible on all kernel versions - both 
patched and unpatched.

Now that I think about to change things the way you suggest
here does seem simple enough.  But it probably belongs in a 
different patch.  We wrote this patch to fix a specific problem 
with ATM links, and it should succeed or fail on the merits 
of doing that.  Cleaning up the kernel code to do what you 
suggest is a different issue.  Let whether it to should be 
done, or not, be based on its own merits.

On Tue, 2006-06-20 at 11:38 -0400, jamal wrote: 
> The issue is really is whether Linux should be interested in the
> throughput it is told about or the goodput (also known as effective
> throughput) the service provider offers. Two different issues by
> definition. 
<snip>
On Thu, 2006-06-22 at 14:29 -0400, jamal wrote:
> I did look at what you sent me and somewhere in those discussions i
> argue that the changes compensate to make the rate be a goodput
> instead of advertised throughput. Throughput is typically what 
> schedulers work with and is typically to what is on the wire.
> Goodput tends to be end-to-end; so somewhere down the road ATM
> "reduces" the goodput but not the throughput.
> I am actaully just fine with telling the scheduler you have less
> throughput than what your ISP is telling you. I am also
> not against a generic change as long as it is non-intrusive because i
> believe this is a practical issue and Patrick Mchardy says he can
> deliver such a patch.

I have read your throughput versus goodput thing a couple of
times, and I'm sorry - I don't understand.  What is it you
would like us to achieve?

As for the patch being invasive, it changes 37 lines of 
kernel code.  No other suggestion I have seen here will be 
that small.

If making the patch generic, ie allowing it to handle cell 
sizes other than ATM, then let me know I will make the
change on the weekend.  It is just a user space change.

One final point: if you are happy with an invasive patch that
changes the world, I have a suggestion.  Modularise the rate
calculation function.  We have qdisc modules, filter modules
and whatnot - so add another type.  Rate calculation.  The
current system can become the default rate calculation module
if none is specified.  Patrick can have his system, and Alan
can have his.  And we can add an ATM one.  If you wish, I can
(with Jespers help, I hope) re-do the patch in that style,
producing the default one and an ATM one.  My personal
preference though would be to put this patch in, and then
let this new idea stand or fall on its own merits.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-23 12:37             ` Russell Stuart
@ 2006-06-23 15:21               ` Patrick McHardy
  2006-06-26  0:45                 ` Russell Stuart
  2006-06-24 14:13               ` jamal
  1 sibling, 1 reply; 56+ messages in thread
From: Patrick McHardy @ 2006-06-23 15:21 UTC (permalink / raw)
  To: Russell Stuart
  Cc: hadi, Alan Cox, Stephen Hemminger, netdev, Jesper Dangaard Brouer

Russell Stuart wrote:
> On Thu, 2006-06-22 at 14:29 -0400, jamal wrote: 
> 

> On Tue, 2006-06-20 at 03:04 +0200, Patrick McHardy wrote: 
> 
>>What about qdiscs like SFQ (which uses the packet size in quantum
>>calculations)? I guess it would make sense to use the wire-length
>>there as well.
> 
> 
> Being pedantic, SQF automatically assigns traffic to classes 
> and gives each class an equal share of the available bandwidth.  
> As I am sure you are aware SQF's trick is that it randomly 
> changes its classification algorithm - every second in the Linux 
> implementation.  If there are errors in rate calculation this 
> randomisation will ensure they are distributed equally between 
> the classes as time goes on.  So no, accurate packets sizes are 
> not that important to SQF.
> 
> But they are important to many other qdiscs, and I am sure 
> that was your point.  SQF just happened to be a bad example.

Not really. The randomization doesn't happen by default, but it doesn't
influence this anyway. SFQ allows flows to send up to "quantum" bytes
at a time before moving on to the next one. A flow that sends 75 * 20
byte will in the eyes of SFQ use 1500bytes, on the (ethernet) wire it
needs 4800bytes. A flow that sents 1500byte packets will only need
1504 bytes on the wire, but will be treated equally. So it does make
a different for SFQ.

> On Tue, 2006-06-20 at 16:45 +0200, Patrick McHardy wrote: 
> 
>>Handling all qdiscs would mean adding a pointer to a mapping table
>>to struct net_device and using something like "skb_wire_len(skb, dev)"
>>instead of skb->len in the queueing layer. That of course doesn't
>>mean that we can't still provide pre-adjusted ratetables for qdiscs
>>that use them.
> 
> 
> Yes, that would work well, and is probably how it should of
> been done when the kernel stuff was originally written.  As 
> it happens Jesper's original solution was closer to this.  The 
> reason we choose not to go that way it is would change the 
> kernel-userspace API.   The current patch solves the problem 
> and works well as possible on all kernel versions - both 
> patched and unpatched.

Not a problem as long as the new stuff doesn't break anything existing.
My patch introduces a TCA_STAB (for size table), similar to the _RTAB
attributes. Old iproute with new kernel and new iproute with old kernel
both work fine.

> Now that I think about to change things the way you suggest
> here does seem simple enough.  But it probably belongs in a 
> different patch.  We wrote this patch to fix a specific problem 
> with ATM links, and it should succeed or fail on the merits 
> of doing that.  Cleaning up the kernel code to do what you 
> suggest is a different issue.  Let whether it to should be 
> done, or not, be based on its own merits.

Its not about cleanup, its about providing the same capabilities
to all qdiscs instead of just a few selected ones and generalizing
it so it is also usable for non-ATM overhead calculations.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-23 15:21               ` Patrick McHardy
@ 2006-06-26  0:45                 ` Russell Stuart
  2006-06-26 11:10                   ` Patrick McHardy
  0 siblings, 1 reply; 56+ messages in thread
From: Russell Stuart @ 2006-06-26  0:45 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: hadi, Alan Cox, Stephen Hemminger, netdev, Jesper Dangaard Brouer

On Fri, 2006-06-23 at 17:21 +0200, Patrick McHardy wrote: 
> Not really. The randomization doesn't happen by default, but it doesn't
> influence this anyway. SFQ allows flows to send up to "quantum" bytes
> at a time before moving on to the next one. A flow that sends 75 * 20
> byte will in the eyes of SFQ use 1500bytes, on the (ethernet) wire it
> needs 4800bytes. A flow that sents 1500byte packets will only need
> 1504 bytes on the wire, but will be treated equally. So it does make
> a different for SFQ.

I hadn't even thought to check.  My bad.  The S in SFQ stands 
for stochastic, so something that does without randomisation 
the algorithm implemented couldn't really be called SFQ - 
particularly as it weakens the algorithm considerably.  I 
hope that most users do specify a perturb.

Your 20 byte example is hardly realistic.  skb->len includes 
the 14 byte ethernet header, so there is a total of 6 data 
bytes in a 20 byte packet.  The IP header alone is 20 bytes.  
TCP as implemented on Linux adds another 32 bytes (20 + the 
rtt option).  In other words I agree with Jamal's comments 
elsewhere - optimising for MPU sized packets doesn't seem 
like a win.

> Not a problem as long as the new stuff doesn't break anything existing.
> My patch introduces a TCA_STAB (for size table), similar to the _RTAB
> attributes. Old iproute with new kernel and new iproute with old kernel
> both work fine.

OK, good.

> Its not about cleanup, its about providing the same capabilities
> to all qdiscs instead of just a few selected ones and generalizing
> it so it is also usable for non-ATM overhead calculations.

Perhaps I chose my words poorly.

My intent was to contrast the size and goals of the two
proposed patches.  The ATM patch is a 37 line patch.  It 
includes some minor cleanups.  From the pseudo code you 
have posted what you are proposing is a more ambitious and 
much larger patch that moves a chunk of user space code 
into the kernel.  I am a complete newbie when it comes to 
getting code into the kernel, but that strikes me as 
contentious.  I would rather not have the ATM patch 
depend on it.

By the by, here are a couple of observations:

1.  The entries in the current rtab are already very closely 
    related to packet lengths.  They are actually the packet
    length multiplied by a constant that converts the units
    from "bytes" to "jiffies".  The constant is the same for
    all entries in the table.

2.  As such, the current rtab could already be used by SFQ
    and any other qdisc that needs to know the packet length.
    That SFQ doesn't do this is probably because it doesn't
    effect its performance overly.

3.  Be that as it may, the current RTAB isn't in the most
    convenient form for SFQ, and I am guessing it is in a 
    very inconvenient form for HFSC.  Adding a new version 
    that is identical except that it contains the raw packet 
    length would be a simple change.  In that format it
    could be used by all qdiscs.  The users of the existing
    rtab would have to do the multiplication that converts
    the packet length to jiffies in the kernel.  This means
    the conceptually at least, should the gootput change
    you need to change this one constant, not the entire
    table.

4.  Much as you seem to dislike having the rate / packet length
    calculations in user space, having them there makes it easy 
    to add new technologies such as ATM.  You just have to 
    change a user space tool - not the kernel.

5.  We still did have to modify the kernel for ATM.  That was
    because of its rather unusual characteristics.  However,
    it you look at the size of modifications made to the kernel
    verses the size made to the user space tool, (37 lines
    versus 303 lines,) the bulk of the work was does in user
    space.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-26  0:45                 ` Russell Stuart
@ 2006-06-26 11:10                   ` Patrick McHardy
  2006-06-27  6:19                     ` Russell Stuart
  0 siblings, 1 reply; 56+ messages in thread
From: Patrick McHardy @ 2006-06-26 11:10 UTC (permalink / raw)
  To: Russell Stuart
  Cc: hadi, Alan Cox, Stephen Hemminger, netdev, Jesper Dangaard Brouer

Russell Stuart wrote:
> On Fri, 2006-06-23 at 17:21 +0200, Patrick McHardy wrote: 
> 
>>Not really. The randomization doesn't happen by default, but it doesn't
>>influence this anyway. SFQ allows flows to send up to "quantum" bytes
>>at a time before moving on to the next one. A flow that sends 75 * 20
>>byte will in the eyes of SFQ use 1500bytes, on the (ethernet) wire it
>>needs 4800bytes. A flow that sents 1500byte packets will only need
>>1504 bytes on the wire, but will be treated equally. So it does make
>>a different for SFQ.
> 
> 
> I hadn't even thought to check.  My bad.  The S in SFQ stands 
> for stochastic, so something that does without randomisation 
> the algorithm implemented couldn't really be called SFQ - 
> particularly as it weakens the algorithm considerably.  I 
> hope that most users do specify a perturb.

Its not as great as you think. Changing hash-functions on the
fly causes reordering for non-idle flows when bucket-lengths
aren't distributed even. I never use it.

> Your 20 byte example is hardly realistic.  skb->len includes 
> the 14 byte ethernet header, so there is a total of 6 data 
> bytes in a 20 byte packet.  The IP header alone is 20 bytes.  
> TCP as implemented on Linux adds another 32 bytes (20 + the 
> rtt option).  In other words I agree with Jamal's comments 
> elsewhere - optimising for MPU sized packets doesn't seem 
> like a win.

The point is that SFQ does care about packet sizes, and this is
true for both MPU-sized and other packets.

>>Its not about cleanup, its about providing the same capabilities
>>to all qdiscs instead of just a few selected ones and generalizing
>>it so it is also usable for non-ATM overhead calculations.
> 
> 
> Perhaps I chose my words poorly.
> 
> My intent was to contrast the size and goals of the two
> proposed patches.  The ATM patch is a 37 line patch.  It 
> includes some minor cleanups.  From the pseudo code you 
> have posted what you are proposing is a more ambitious and 
> much larger patch that moves a chunk of user space code 
> into the kernel.  I am a complete newbie when it comes to 
> getting code into the kernel, but that strikes me as 
> contentious.  I would rather not have the ATM patch 
> depend on it.
> 
> By the by, here are a couple of observations:
> 
> 1.  The entries in the current rtab are already very closely 
>     related to packet lengths.  They are actually the packet
>     length multiplied by a constant that converts the units
>     from "bytes" to "jiffies".  The constant is the same for
>     all entries in the table.
> 
> 2.  As such, the current rtab could already be used by SFQ
>     and any other qdisc that needs to know the packet length.
>     That SFQ doesn't do this is probably because it doesn't
>     effect its performance overly.

The rtab includes the transmission time, which is related to,
but still is something different than the length. You can't
calculate the transmission time without a rate, which is
not needed otherwise for SFQ for example. The way rtabs
are used also needs more space, my current size tables
only use as much space as needed, which is 16 entries for
ethernet instead of 256.

> 3.  Be that as it may, the current RTAB isn't in the most
>     convenient form for SFQ, and I am guessing it is in a 
>     very inconvenient form for HFSC.  Adding a new version 
>     that is identical except that it contains the raw packet 
>     length would be a simple change.  In that format it
>     could be used by all qdiscs.  The users of the existing
>     rtab would have to do the multiplication that converts
>     the packet length to jiffies in the kernel.  This means
>     the conceptually at least, should the gootput change
>     you need to change this one constant, not the entire
>     table.

Well, I don't care much whether we use rtabs or something new,
but rtabs are meant for something different and as such are not
optimally suited for this.

> 4.  Much as you seem to dislike having the rate / packet length
>     calculations in user space, having them there makes it easy 
>     to add new technologies such as ATM.  You just have to 
>     change a user space tool - not the kernel.

I don't dislike it for beeing in userspace, I dislike it for
a) beeing used for this since it only covers TBF-based qdiscs
b) beeing used for an ATM "special case" which is not special
   at all.

I should also note that my tables don't come out of the random
generator but are provided by userspace as well. Unless the
mechanism is unable to express the needs of a particular link
layer type all you have to do is to provide a different set
of values without touching any code at all.

> 5.  We still did have to modify the kernel for ATM.  That was
>     because of its rather unusual characteristics.  However,
>     it you look at the size of modifications made to the kernel
>     verses the size made to the user space tool, (37 lines
>     versus 303 lines,) the bulk of the work was does in user
>     space.

I'm sorry, but arguing that a limited special case solution is
better because it needs slightly less code is just not reasonable.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-26 11:10                   ` Patrick McHardy
@ 2006-06-27  6:19                     ` Russell Stuart
  2006-06-27 17:18                       ` Patrick McHardy
  2006-07-04 13:29                       ` Patrick McHardy
  0 siblings, 2 replies; 56+ messages in thread
From: Russell Stuart @ 2006-06-27  6:19 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Russell Stuart, hadi, Alan Cox, Stephen Hemminger, netdev,
	Jesper Dangaard Brouer

On 26/06/2006 9:10 PM, Patrick McHardy wrote:
>>5.  We still did have to modify the kernel for ATM.  That was
>>    because of its rather unusual characteristics.  However,
>>    it you look at the size of modifications made to the kernel
>>    verses the size made to the user space tool, (37 lines
>>    versus 303 lines,) the bulk of the work was does in user
>>    space.
> 
> I'm sorry, but arguing that a limited special case solution is
> better because it needs slightly less code is just not reasonable.

Without seeing your actual proposal it is difficult to
judge whether this is a reasonable trade-off or not.
Hopefully we will see your code soon.  Do you have any
idea when?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-27  6:19                     ` Russell Stuart
@ 2006-06-27 17:18                       ` Patrick McHardy
  2006-07-04 13:29                       ` Patrick McHardy
  1 sibling, 0 replies; 56+ messages in thread
From: Patrick McHardy @ 2006-06-27 17:18 UTC (permalink / raw)
  To: Russell Stuart
  Cc: Russell Stuart, hadi, Alan Cox, Stephen Hemminger, netdev,
	Jesper Dangaard Brouer

Russell Stuart wrote:
> Without seeing your actual proposal it is difficult to
> judge whether this is a reasonable trade-off or not.
> Hopefully we will see your code soon.  Do you have any
> idea when?

Probably not today, I'll try to get it into shape until tomomorrow.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-27  6:19                     ` Russell Stuart
  2006-06-27 17:18                       ` Patrick McHardy
@ 2006-07-04 13:29                       ` Patrick McHardy
  2006-07-04 19:29                         ` jamal
  2006-07-06  0:39                         ` Russell Stuart
  1 sibling, 2 replies; 56+ messages in thread
From: Patrick McHardy @ 2006-07-04 13:29 UTC (permalink / raw)
  To: Russell Stuart
  Cc: Russell Stuart, hadi, Alan Cox, Stephen Hemminger, netdev,
	Jesper Dangaard Brouer

[-- Attachment #1: Type: text/plain, Size: 1192 bytes --]

Russell Stuart wrote:
> On 26/06/2006 9:10 PM, Patrick McHardy wrote:
> 
>>> 5.  We still did have to modify the kernel for ATM.  That was
>>>    because of its rather unusual characteristics.  However,
>>>    it you look at the size of modifications made to the kernel
>>>    verses the size made to the user space tool, (37 lines
>>>    versus 303 lines,) the bulk of the work was does in user
>>>    space.
>>
>>
>> I'm sorry, but arguing that a limited special case solution is
>> better because it needs slightly less code is just not reasonable.
> 
> 
> Without seeing your actual proposal it is difficult to
> judge whether this is a reasonable trade-off or not.
> Hopefully we will see your code soon.  Do you have any
> idea when?

Unfortunately I still didn't got to cleaning them up, so I'm sending
them in their preliminary state. Its not much that is missing, but
the netem usage of skb->cb needs to be integrated better, I failed
to move it to the qdisc_skb_cb so far because of circular includes.
But nothing unfixable. I'm mostly interested if the current size-tables
can express what you need for ATM, I wasn't able to understand the
big comment in tc_core.c in your patch.


[-- Attachment #2: 01.diff --]
[-- Type: text/plain, Size: 13169 bytes --]

[NET_SCHED]: Add accessor function for packet length for qdiscs

Signed-off-by: Patrick McHardy <kaber@trash.net>

---
commit 2a6508576111d82246ee018edbcc4b0f0d18acad
tree 8be27ab6040ea90ed11728763e5b8fcf9e221b67
parent 31304c909e6945b005af62cd55a582e9c010a0b4
author Patrick McHardy <kaber@trash.net> Tue, 04 Jul 2006 15:03:01 +0200
committer Patrick McHardy <kaber@trash.net> Tue, 04 Jul 2006 15:03:01 +0200

 include/net/sch_generic.h |    9 +++++++--
 net/sched/sch_atm.c       |    4 ++--
 net/sched/sch_cbq.c       |   12 ++++++------
 net/sched/sch_dsmark.c    |    2 +-
 net/sched/sch_fifo.c      |    2 +-
 net/sched/sch_gred.c      |   12 ++++++------
 net/sched/sch_hfsc.c      |    8 ++++----
 net/sched/sch_htb.c       |    8 ++++----
 net/sched/sch_netem.c     |    6 +++---
 net/sched/sch_prio.c      |    2 +-
 net/sched/sch_red.c       |    2 +-
 net/sched/sch_sfq.c       |   14 +++++++-------
 net/sched/sch_tbf.c       |    6 +++---
 net/sched/sch_teql.c      |    4 ++--
 14 files changed, 48 insertions(+), 43 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index b0e9108..75d7a55 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -184,12 +184,17 @@ tcf_destroy(struct tcf_proto *tp)
 	kfree(tp);
 }
 
+static inline unsigned int qdisc_tx_len(struct sk_buff *skb)
+{
+	return skb->len;
+}
+
 static inline int __qdisc_enqueue_tail(struct sk_buff *skb, struct Qdisc *sch,
 				       struct sk_buff_head *list)
 {
 	__skb_queue_tail(list, skb);
-	sch->qstats.backlog += skb->len;
-	sch->bstats.bytes += skb->len;
+	sch->qstats.backlog += qdisc_tx_len(skb);
+	sch->bstats.bytes += qdisc_tx_len(skb);
 	sch->bstats.packets++;
 
 	return NET_XMIT_SUCCESS;
diff --git a/net/sched/sch_atm.c b/net/sched/sch_atm.c
index dbf44da..4df305e 100644
--- a/net/sched/sch_atm.c
+++ b/net/sched/sch_atm.c
@@ -453,9 +453,9 @@ #endif
 		if (flow) flow->qstats.drops++;
 		return ret;
 	}
-	sch->bstats.bytes += skb->len;
+	sch->bstats.bytes += qdisc_tx_len(skb);
 	sch->bstats.packets++;
-	flow->bstats.bytes += skb->len;
+	flow->bstats.bytes += qdisc_tx_len(skb);
 	flow->bstats.packets++;
 	/*
 	 * Okay, this may seem weird. We pretend we've dropped the packet if
diff --git a/net/sched/sch_cbq.c b/net/sched/sch_cbq.c
index 80b7f6a..5d705e2 100644
--- a/net/sched/sch_cbq.c
+++ b/net/sched/sch_cbq.c
@@ -404,7 +404,7 @@ static int
 cbq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 {
 	struct cbq_sched_data *q = qdisc_priv(sch);
-	int len = skb->len;
+	int len = qdisc_tx_len(skb);
 	int ret;
 	struct cbq_class *cl = cbq_classify(skb, sch, &ret);
 
@@ -688,7 +688,7 @@ #ifdef CONFIG_NET_CLS_POLICE
 
 static int cbq_reshape_fail(struct sk_buff *skb, struct Qdisc *child)
 {
-	int len = skb->len;
+	int len = qdisc_tx_len(skb);
 	struct Qdisc *sch = child->__parent;
 	struct cbq_sched_data *q = qdisc_priv(sch);
 	struct cbq_class *cl = q->rx_class;
@@ -915,7 +915,7 @@ cbq_dequeue_prio(struct Qdisc *sch, int 
 			if (skb == NULL)
 				goto skip_class;
 
-			cl->deficit -= skb->len;
+			cl->deficit -= qdisc_tx_len(skb);
 			q->tx_class = cl;
 			q->tx_borrowed = borrow;
 			if (borrow != cl) {
@@ -923,11 +923,11 @@ #ifndef CBQ_XSTATS_BORROWS_BYTES
 				borrow->xstats.borrows++;
 				cl->xstats.borrows++;
 #else
-				borrow->xstats.borrows += skb->len;
-				cl->xstats.borrows += skb->len;
+				borrow->xstats.borrows += qdisc_tx_len(skb);
+				cl->xstats.borrows += qdisc_tx_len(skb);
 #endif
 			}
-			q->tx_len = skb->len;
+			q->tx_len = qdisc_tx_len(skb);
 
 			if (cl->deficit <= 0) {
 				q->active[prio] = cl;
diff --git a/net/sched/sch_dsmark.c b/net/sched/sch_dsmark.c
index 11c8a21..53346c6 100644
--- a/net/sched/sch_dsmark.c
+++ b/net/sched/sch_dsmark.c
@@ -265,7 +265,7 @@ #endif
 		return err;
 	}
 
-	sch->bstats.bytes += skb->len;
+	sch->bstats.bytes += qdisc_tx_len(skb);
 	sch->bstats.packets++;
 	sch->q.qlen++;
 
diff --git a/net/sched/sch_fifo.c b/net/sched/sch_fifo.c
index c2689f4..ec99321 100644
--- a/net/sched/sch_fifo.c
+++ b/net/sched/sch_fifo.c
@@ -28,7 +28,7 @@ static int bfifo_enqueue(struct sk_buff 
 {
 	struct fifo_sched_data *q = qdisc_priv(sch);
 
-	if (likely(sch->qstats.backlog + skb->len <= q->limit))
+	if (likely(sch->qstats.backlog + qdisc_tx_len(skb) <= q->limit))
 		return qdisc_enqueue_tail(skb, sch);
 
 	return qdisc_reshape_fail(skb, sch);
diff --git a/net/sched/sch_gred.c b/net/sched/sch_gred.c
index 0cafdd5..f0bf5d7 100644
--- a/net/sched/sch_gred.c
+++ b/net/sched/sch_gred.c
@@ -189,7 +189,7 @@ static int gred_enqueue(struct sk_buff *
 	}
 
 	q->packetsin++;
-	q->bytesin += skb->len;
+	q->bytesin += qdisc_tx_len(skb);
 
 	if (gred_wred_mode(t))
 		gred_load_wred_set(t, q);
@@ -227,8 +227,8 @@ static int gred_enqueue(struct sk_buff *
 			break;
 	}
 
-	if (q->backlog + skb->len <= q->limit) {
-		q->backlog += skb->len;
+	if (q->backlog + qdisc_tx_len(skb) <= q->limit) {
+		q->backlog += qdisc_tx_len(skb);
 		return qdisc_enqueue_tail(skb, sch);
 	}
 
@@ -255,7 +255,7 @@ static int gred_requeue(struct sk_buff *
 	} else {
 		if (red_is_idling(&q->parms))
 			red_end_of_idle_period(&q->parms);
-		q->backlog += skb->len;
+		q->backlog += qdisc_tx_len(skb);
 	}
 
 	return qdisc_requeue(skb, sch);
@@ -278,7 +278,7 @@ static struct sk_buff *gred_dequeue(stru
 				       "VQ 0x%x after dequeue, screwing up "
 				       "backlog.\n", tc_index_to_dp(skb));
 		} else {
-			q->backlog -= skb->len;
+			q->backlog -= qdisc_tx_len(skb);
 
 			if (!q->backlog && !gred_wred_mode(t))
 				red_start_of_idle_period(&q->parms);
@@ -300,7 +300,7 @@ static unsigned int gred_drop(struct Qdi
 
 	skb = qdisc_dequeue_tail(sch);
 	if (skb) {
-		unsigned int len = skb->len;
+		unsigned int len = qdisc_tx_len(skb);
 		struct gred_sched_data *q;
 		u16 dp = tc_index_to_dp(skb);
 
diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index 6b1b4a9..3fc8351 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -942,7 +942,7 @@ qdisc_peek_len(struct Qdisc *sch)
 			printk("qdisc_peek_len: non work-conserving qdisc ?\n");
 		return 0;
 	}
-	len = skb->len;
+	len = qdisc_tx_len(skb);
 	if (unlikely(sch->ops->requeue(skb, sch) != NET_XMIT_SUCCESS)) {
 		if (net_ratelimit())
 			printk("qdisc_peek_len: failed to requeue\n");
@@ -1648,7 +1648,7 @@ hfsc_enqueue(struct sk_buff *skb, struct
 		return err;
 	}
 
-	len = skb->len;
+	len = qdisc_tx_len(skb);
 	err = cl->qdisc->enqueue(skb, cl->qdisc);
 	if (unlikely(err != NET_XMIT_SUCCESS)) {
 		cl->qstats.drops++;
@@ -1712,9 +1712,9 @@ hfsc_dequeue(struct Qdisc *sch)
 		return NULL;
 	}
 
-	update_vf(cl, skb->len, cur_time);
+	update_vf(cl, qdisc_tx_len(skb), cur_time);
 	if (realtime)
-		cl->cl_cumul += skb->len;
+		cl->cl_cumul += qdisc_tx_len(skb);
 
 	if (cl->qdisc->q.qlen != 0) {
 		if (cl->cl_flags & HFSC_RSC) {
diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 34afe41..b26fa9a 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -733,12 +733,12 @@ #endif
 	cl->qstats.drops++;
 	return NET_XMIT_DROP;
     } else {
-	cl->bstats.packets++; cl->bstats.bytes += skb->len;
+	cl->bstats.packets++; cl->bstats.bytes += qdisc_tx_len(skb);
 	htb_activate (q,cl);
     }
 
     sch->q.qlen++;
-    sch->bstats.packets++; sch->bstats.bytes += skb->len;
+    sch->bstats.packets++; sch->bstats.bytes += qdisc_tx_len(skb);
     HTB_DBG(1,1,"htb_enq_ok cl=%X skb=%p\n",(cl && cl != HTB_DIRECT)?cl->classid:0,skb);
     return NET_XMIT_SUCCESS;
 }
@@ -1067,7 +1067,7 @@ next:
 	} while (cl != start);
 
 	if (likely(skb != NULL)) {
-		if ((cl->un.leaf.deficit[level] -= skb->len) < 0) {
+		if ((cl->un.leaf.deficit[level] -= qdisc_tx_len(skb)) < 0) {
 			HTB_DBG(4,2,"htb_next_cl oldptr=%p quant_add=%d\n",
 				level?cl->parent->un.inner.ptr[prio]:q->ptr[0][prio],cl->un.leaf.quantum);
 			cl->un.leaf.deficit[level] += cl->un.leaf.quantum;
@@ -1077,7 +1077,7 @@ next:
 		   gives us slightly better performance */
 		if (!cl->un.leaf.q->q.qlen)
 			htb_deactivate (q,cl);
-		htb_charge_class (q,cl,level,skb->len);
+		htb_charge_class (q,cl,level,qdisc_tx_len(skb));
 	}
 	return skb;
 }
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index c5bd806..aa97ecb 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -225,7 +225,7 @@ static int netem_enqueue(struct sk_buff 
 
 	if (likely(ret == NET_XMIT_SUCCESS)) {
 		sch->q.qlen++;
-		sch->bstats.bytes += skb->len;
+		sch->bstats.bytes += qdisc_tx_len(skb);
 		sch->bstats.packets++;
 	} else
 		sch->qstats.drops++;
@@ -507,8 +507,8 @@ static int tfifo_enqueue(struct sk_buff 
 
 		__skb_queue_after(list, skb, nskb);
 
-		sch->qstats.backlog += nskb->len;
-		sch->bstats.bytes += nskb->len;
+		sch->qstats.backlog += qdisc_tx_len(nskb);
+		sch->bstats.bytes += qdisc_tx_len(nskb);
 		sch->bstats.packets++;
 
 		return NET_XMIT_SUCCESS;
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index a5fa03c..2175732 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -99,7 +99,7 @@ #ifdef CONFIG_NET_CLS_ACT
 #endif
 
 	if ((ret = qdisc->enqueue(skb, qdisc)) == NET_XMIT_SUCCESS) {
-		sch->bstats.bytes += skb->len;
+		sch->bstats.bytes += qdisc_tx_len(skb);
 		sch->bstats.packets++;
 		sch->q.qlen++;
 		return NET_XMIT_SUCCESS;
diff --git a/net/sched/sch_red.c b/net/sched/sch_red.c
index d65cadd..24ec0b2 100644
--- a/net/sched/sch_red.c
+++ b/net/sched/sch_red.c
@@ -95,7 +95,7 @@ static int red_enqueue(struct sk_buff *s
 
 	ret = child->enqueue(skb, child);
 	if (likely(ret == NET_XMIT_SUCCESS)) {
-		sch->bstats.bytes += skb->len;
+		sch->bstats.bytes += qdisc_tx_len(skb);
 		sch->bstats.packets++;
 		sch->q.qlen++;
 	} else {
diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index d0d6e59..2a57d0d 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -225,7 +225,7 @@ static unsigned int sfq_drop(struct Qdis
 	if (d > 1) {
 		sfq_index x = q->dep[d+SFQ_DEPTH].next;
 		skb = q->qs[x].prev;
-		len = skb->len;
+		len = qdisc_tx_len(skb);
 		__skb_unlink(skb, &q->qs[x]);
 		kfree_skb(skb);
 		sfq_dec(q, x);
@@ -241,7 +241,7 @@ static unsigned int sfq_drop(struct Qdis
 		q->next[q->tail] = q->next[d];
 		q->allot[q->next[d]] += q->quantum;
 		skb = q->qs[d].prev;
-		len = skb->len;
+		len = qdisc_tx_len(skb);
 		__skb_unlink(skb, &q->qs[d]);
 		kfree_skb(skb);
 		sfq_dec(q, d);
@@ -267,7 +267,7 @@ sfq_enqueue(struct sk_buff *skb, struct 
 		q->ht[hash] = x = q->dep[SFQ_DEPTH].next;
 		q->hash[x] = hash;
 	}
-	sch->qstats.backlog += skb->len;
+	sch->qstats.backlog += qdisc_tx_len(skb);
 	__skb_queue_tail(&q->qs[x], skb);
 	sfq_inc(q, x);
 	if (q->qs[x].qlen == 1) {		/* The flow is new */
@@ -282,7 +282,7 @@ sfq_enqueue(struct sk_buff *skb, struct 
 		}
 	}
 	if (++sch->q.qlen < q->limit-1) {
-		sch->bstats.bytes += skb->len;
+		sch->bstats.bytes += qdisc_tx_len(skb);
 		sch->bstats.packets++;
 		return 0;
 	}
@@ -303,7 +303,7 @@ sfq_requeue(struct sk_buff *skb, struct 
 		q->ht[hash] = x = q->dep[SFQ_DEPTH].next;
 		q->hash[x] = hash;
 	}
-	sch->qstats.backlog += skb->len;
+	sch->qstats.backlog += qdisc_tx_len(skb);
 	__skb_queue_head(&q->qs[x], skb);
 	sfq_inc(q, x);
 	if (q->qs[x].qlen == 1) {		/* The flow is new */
@@ -347,7 +347,7 @@ sfq_dequeue(struct Qdisc* sch)
 	skb = __skb_dequeue(&q->qs[a]);
 	sfq_dec(q, a);
 	sch->q.qlen--;
-	sch->qstats.backlog -= skb->len;
+	sch->qstats.backlog -= qdisc_tx_len(skb);
 
 	/* Is the slot empty? */
 	if (q->qs[a].qlen == 0) {
@@ -359,7 +359,7 @@ sfq_dequeue(struct Qdisc* sch)
 		}
 		q->next[q->tail] = a;
 		q->allot[a] += q->quantum;
-	} else if ((q->allot[a] -= skb->len) <= 0) {
+	} else if ((q->allot[a] -= qdisc_tx_len(skb)) <= 0) {
 		q->tail = a;
 		a = q->next[a];
 		q->allot[a] += q->quantum;
diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c
index d9a5d29..c87b0e6 100644
--- a/net/sched/sch_tbf.c
+++ b/net/sched/sch_tbf.c
@@ -139,7 +139,7 @@ static int tbf_enqueue(struct sk_buff *s
 	struct tbf_sched_data *q = qdisc_priv(sch);
 	int ret;
 
-	if (skb->len > q->max_size) {
+	if (qdisc_tx_len(skb) > q->max_size) {
 		sch->qstats.drops++;
 #ifdef CONFIG_NET_CLS_POLICE
 		if (sch->reshape_fail == NULL || sch->reshape_fail(skb, sch))
@@ -155,7 +155,7 @@ #endif
 	}
 
 	sch->q.qlen++;
-	sch->bstats.bytes += skb->len;
+	sch->bstats.bytes += qdisc_tx_len(skb);
 	sch->bstats.packets++;
 	return 0;
 }
@@ -204,7 +204,7 @@ static struct sk_buff *tbf_dequeue(struc
 		psched_time_t now;
 		long toks, delay;
 		long ptoks = 0;
-		unsigned int len = skb->len;
+		unsigned int len = qdisc_tx_len(skb);
 
 		PSCHED_GET_TIME(now);
 
diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c
index 4c16ad5..538f63f 100644
--- a/net/sched/sch_teql.c
+++ b/net/sched/sch_teql.c
@@ -97,7 +97,7 @@ teql_enqueue(struct sk_buff *skb, struct
 
 	__skb_queue_tail(&q->q, skb);
 	if (q->q.qlen <= dev->tx_queue_len) {
-		sch->bstats.bytes += skb->len;
+		sch->bstats.bytes += qdisc_tx_len(skb);
 		sch->bstats.packets++;
 		return 0;
 	}
@@ -278,7 +278,7 @@ static int teql_master_xmit(struct sk_bu
 	struct Qdisc *start, *q;
 	int busy;
 	int nores;
-	int len = skb->len;
+	int len = qdisc_tx_len(skb);
 	struct sk_buff *skb_res = NULL;
 
 	start = master->slaves;

[-- Attachment #3: 02.diff --]
[-- Type: text/plain, Size: 2212 bytes --]

[NET_SCHED]: Move top-level device queueing code to seperate function

Signed-off-by: Patrick McHardy <kaber@trash.net>

---
commit a39585afe71dafab96208515a8fa99c92b108fee
tree fbb7672a3061a38edc9f75d3fb8f34652796b109
parent 2a6508576111d82246ee018edbcc4b0f0d18acad
author Patrick McHardy <kaber@trash.net> Tue, 04 Jul 2006 15:03:28 +0200
committer Patrick McHardy <kaber@trash.net> Tue, 04 Jul 2006 15:03:28 +0200

 include/net/pkt_sched.h |    1 +
 net/core/dev.c          |   10 +---------
 net/sched/sch_generic.c |   12 ++++++++++++
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 1925c65..44cf69e 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -224,6 +224,7 @@ extern struct qdisc_rate_table *qdisc_ge
 		struct rtattr *tab);
 extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
 
+extern int qdisc_enqueue_root(struct net_device *dev, struct sk_buff *skb);
 extern void __qdisc_run(struct net_device *dev);
 
 static inline void qdisc_run(struct net_device *dev)
diff --git a/net/core/dev.c b/net/core/dev.c
index 066a60a..8599120 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1449,15 +1449,7 @@ #ifdef CONFIG_NET_CLS_ACT
 	skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS);
 #endif
 	if (q->enqueue) {
-		/* Grab device queue */
-		spin_lock(&dev->queue_lock);
-
-		rc = q->enqueue(skb, q);
-
-		qdisc_run(dev);
-
-		spin_unlock(&dev->queue_lock);
-		rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc;
+		rc = qdisc_enqueue_root(dev, skb);
 		goto out;
 	}
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index d735f51..2bab466 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -77,6 +77,18 @@ void qdisc_unlock_tree(struct net_device
    if one is grabbed, another must be free.
  */
 
+int qdisc_enqueue_root(struct net_device *dev, struct sk_buff *skb)
+{
+	int ret;
+
+	spin_lock(&dev->queue_lock);
+	ret = dev->qdisc->enqueue(skb, dev->qdisc);
+	qdisc_run(dev);
+	spin_unlock(&dev->queue_lock);
+
+	return ret == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : ret;
+}
+
 
 /* Kick device.
    Note, that this procedure can be called by a watchdog timer, so that

[-- Attachment #4: 03.diff --]
[-- Type: text/plain, Size: 8494 bytes --]

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index d10f353..2ce55d5 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -83,6 +83,21 @@ struct tc_ratespec
 	__u32		rate;
 };
 
+struct tc_sizespec
+{
+	unsigned int	cell_log;
+	unsigned int	addend;
+};
+
+enum {
+	TCA_STAB_UNSPEC,
+	TCA_STAB_BASE,
+	TCA_STAB_DATA,
+	__TCA_STAB_MAX
+};
+
+#define TCA_STAB_MAX (__TCA_STAB_MAX - 1)
+
 /* FIFO section */
 
 struct tc_fifo_qopt
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index facd9ee..167cc22 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -821,6 +821,7 @@ enum
 	TCA_RATE,
 	TCA_FCNT,
 	TCA_STATS2,
+	TCA_STAB,
 	__TCA_MAX
 };
 
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 44cf69e..8fd9a42 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -223,6 +223,7 @@ extern struct Qdisc *qdisc_lookup_class(
 extern struct qdisc_rate_table *qdisc_get_rtab(struct tc_ratespec *r,
 		struct rtattr *tab);
 extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
+extern void qdisc_put_stab(struct qdisc_size_table *tab);
 
 extern int qdisc_enqueue_root(struct net_device *dev, struct sk_buff *skb);
 extern void __qdisc_run(struct net_device *dev);
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 75d7a55..76c50a1 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -23,6 +23,15 @@ struct qdisc_rate_table
 	int		refcnt;
 };
 
+struct qdisc_size_table
+{
+	struct list_head	list;
+	struct tc_sizespec	size;
+	int			refcnt;
+	unsigned int		tsize;
+	u32			data[];
+};
+
 struct Qdisc
 {
 	int 			(*enqueue)(struct sk_buff *skb, struct Qdisc *dev);
@@ -33,6 +42,7 @@ #define TCQ_F_THROTTLED	2
 #define TCQ_F_INGRESS	4
 	int			padded;
 	struct Qdisc_ops	*ops;
+	struct qdisc_size_table	*stab;
 	u32			handle;
 	u32			parent;
 	atomic_t		refcnt;
@@ -184,9 +194,19 @@ tcf_destroy(struct tcf_proto *tp)
 	kfree(tp);
 }
 
+struct qdisc_skb_cb {
+	unsigned int	len;
+	char		data[];
+};
+
+static inline struct qdisc_skb_cb *qdisc_skb_cb(struct sk_buff *skb)
+{
+	return (struct qdisc_skb_cb *)skb->cb;
+}
+
 static inline unsigned int qdisc_tx_len(struct sk_buff *skb)
 {
-	return skb->len;
+	return qdisc_skb_cb(skb)->len;
 }
 
 static inline int __qdisc_enqueue_tail(struct sk_buff *skb, struct Qdisc *sch,
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index c7844ba..479fc85 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -286,6 +286,78 @@ void qdisc_put_rtab(struct qdisc_rate_ta
 	}
 }
 
+static LIST_HEAD(qdisc_stab_list);
+
+static struct qdisc_size_table *qdisc_get_stab(struct rtattr *tab, int *err)
+{
+	struct qdisc_size_table *stab;
+	struct rtattr *tb[TCA_STAB_MAX];
+	unsigned int tsize;
+
+	*err = -EINVAL;
+	if (rtattr_parse_nested(tb, TCA_STAB_MAX, tab))
+		return NULL;
+	if (tb[TCA_STAB_BASE-1] == NULL ||
+	    RTA_PAYLOAD(tb[TCA_STAB_BASE-1]) < sizeof(struct tc_sizespec))
+	    	return NULL;
+
+	tsize = 0;
+	if (tb[TCA_STAB_DATA-1] != NULL)
+		tsize = RTA_PAYLOAD(tb[TCA_STAB_DATA-1]) / sizeof(u32);
+
+	list_for_each_entry(stab, &qdisc_stab_list, list) {
+		if (stab->tsize != tsize)
+			continue;
+		if (memcmp(&stab->size, RTA_DATA(tb[TCA_STAB_BASE-1]),
+		           sizeof(stab->size)))
+			continue;
+		if (tsize > 0  &&
+		    memcmp(stab->data, RTA_DATA(tb[TCA_STAB_DATA-1]),
+		    	   sizeof(u32) * tsize));
+			continue;
+		stab->refcnt++;
+		return stab;
+	}
+
+	*err = -ENOMEM;
+	stab = kmalloc(sizeof(*stab) + sizeof(u32) * tsize, GFP_KERNEL);
+	if (stab == NULL)
+		return stab;
+	memcpy(&stab->size, RTA_DATA(tb[TCA_STAB_BASE-1]), sizeof(stab->size));
+	stab->tsize = tsize;
+	if (tsize > 0)
+		memcpy(stab->data, RTA_DATA(tb[TCA_STAB_DATA-1]),
+		       sizeof(u32) * tsize);
+	list_add_tail(&stab->list, &qdisc_stab_list);
+	*err = 0;
+	return stab;
+}
+
+void qdisc_put_stab(struct qdisc_size_table *stab)
+{
+	if (!stab || --stab->refcnt)
+		return;
+	list_del(&stab->list);
+	kfree(stab);
+}
+
+static int
+qdisc_dump_stab(struct sk_buff *skb, struct qdisc_size_table *stab)
+{
+	unsigned char *b = skb->tail;
+	struct rtattr *rta = (struct rtattr *)b;
+
+	RTA_PUT(skb, TCA_STAB, 0, NULL);
+	RTA_PUT(skb, TCA_STAB_BASE, sizeof(stab->size), &stab->size);
+	RTA_PUT(skb, TCA_STAB_DATA, sizeof(stab->data[0]) * stab->tsize,
+		stab->data);
+	rta->rta_len = skb->tail - b;
+	return skb->len;
+
+rtattr_failure:
+	skb_trim(skb, b - skb->data);
+	return -1;
+}
 
 /* Allocate an unique handle from space managed by kernel */
 
@@ -453,6 +525,11 @@ #endif
 	sch->handle = handle;
 
 	if (!ops->init || (err = ops->init(sch, tca[TCA_OPTIONS-1])) == 0) {
+		if (tca[TCA_STAB-1]) {
+			sch->stab = qdisc_get_stab(tca[TCA_STAB-1], &err);
+			if (sch->stab == NULL)
+				goto err_out3;
+		}
 #ifdef CONFIG_NET_ESTIMATOR
 		if (tca[TCA_RATE-1]) {
 			err = gen_new_estimator(&sch->bstats, &sch->rate_est,
@@ -477,6 +554,7 @@ #endif
 		return sch;
 	}
 err_out3:
+	qdisc_put_stab(sch->stab);
 	dev_put(dev);
 	kfree((char *) sch - sch->padded);
 err_out2:
@@ -488,15 +566,26 @@ err_out:
 
 static int qdisc_change(struct Qdisc *sch, struct rtattr **tca)
 {
-	if (tca[TCA_OPTIONS-1]) {
-		int err;
+	int err;
 
+	if (tca[TCA_OPTIONS-1]) {
 		if (sch->ops->change == NULL)
 			return -EINVAL;
 		err = sch->ops->change(sch, tca[TCA_OPTIONS-1]);
 		if (err)
 			return err;
 	}
+	if (tca[TCA_STAB-1]) {
+		struct qdisc_size_table *stab;
+
+		stab = qdisc_get_stab(tca[TCA_STAB-1], &err);
+		if (stab == NULL)
+			return err;
+		spin_lock_bh(&sch->dev->queue_lock);
+		qdisc_put_stab(sch->stab);
+		sch->stab = stab;
+		spin_unlock_bh(&sch->dev->queue_lock);
+	}
 #ifdef CONFIG_NET_ESTIMATOR
 	if (tca[TCA_RATE-1])
 		gen_replace_estimator(&sch->bstats, &sch->rate_est,
@@ -769,6 +858,9 @@ static int tc_fill_qdisc(struct sk_buff 
 		goto rtattr_failure;
 	q->qstats.qlen = q->q.qlen;
 
+	if (q->stab != NULL && qdisc_dump_stab(skb, q->stab) < 0)
+		goto rtattr_failure;
+
 	if (gnet_stats_start_copy_compat(skb, TCA_STATS2, TCA_STATS,
 			TCA_XSTATS, q->stats_lock, &d) < 0)
 		goto rtattr_failure;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 2bab466..9022650 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -67,6 +67,21 @@ void qdisc_unlock_tree(struct net_device
 	write_unlock_bh(&qdisc_tree_lock);
 }
 
+static void qdisc_init_len(struct sk_buff *skb, struct Qdisc *q)
+{
+	unsigned int idx, len = skb->len;
+	struct qdisc_size_table *stab = q->stab;
+
+	if (stab == NULL)
+		goto out;
+	idx = len >> stab->size.cell_log;
+	if (idx < stab->tsize)
+		len = stab->data[idx];
+	len += stab->size.addend;
+out:
+	((struct qdisc_skb_cb *)skb->cb)->len = len;
+}
+
 /* 
    dev->queue_lock serializes queue accesses for this device
    AND dev->qdisc pointer itself.
@@ -82,6 +97,7 @@ int qdisc_enqueue_root(struct net_device
 	int ret;
 
 	spin_lock(&dev->queue_lock);
+	qdisc_init_len(skb, dev->qdisc);
 	ret = dev->qdisc->enqueue(skb, dev->qdisc);
 	qdisc_run(dev);
 	spin_unlock(&dev->queue_lock);
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index aa97ecb..15dde88 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -148,7 +148,7 @@ static long tabledist(unsigned long mu, 
 static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 {
 	struct netem_sched_data *q = qdisc_priv(sch);
-	struct netem_skb_cb *cb = (struct netem_skb_cb *)skb->cb;
+	struct netem_skb_cb *cb = (struct netem_skb_cb *)qdisc_skb_cb(skb)->data;
 	struct sk_buff *skb2;
 	int ret;
 	int count = 1;
@@ -268,7 +268,7 @@ static struct sk_buff *netem_dequeue(str
 	skb = q->qdisc->dequeue(q->qdisc);
 	if (skb) {
 		const struct netem_skb_cb *cb
-			= (const struct netem_skb_cb *)skb->cb;
+			= (const struct netem_skb_cb *)qdisc_skb_cb(skb)->data;
 		psched_time_t now;
 
 		/* if more time remaining? */
@@ -493,13 +493,13 @@ static int tfifo_enqueue(struct sk_buff 
 	struct fifo_sched_data *q = qdisc_priv(sch);
 	struct sk_buff_head *list = &sch->q;
 	const struct netem_skb_cb *ncb
-		= (const struct netem_skb_cb *)nskb->cb;
+		= (const struct netem_skb_cb *)qdisc_skb_cb(nskb)->data;
 	struct sk_buff *skb;
 
 	if (likely(skb_queue_len(list) < q->limit)) {
 		skb_queue_reverse_walk(list, skb) {
 			const struct netem_skb_cb *cb
-				= (const struct netem_skb_cb *)skb->cb;
+				= (const struct netem_skb_cb *)qdisc_skb_cb(skb)->data;
 
 			if (!PSCHED_TLESS(ncb->time_to_send, cb->time_to_send))
 				break;

[-- Attachment #5: iproute.diff --]
[-- Type: text/plain, Size: 4483 bytes --]

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index d10f353..2ce55d5 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -83,6 +83,21 @@ struct tc_ratespec
 	__u32		rate;
 };
 
+struct tc_sizespec
+{
+	unsigned int	cell_log;
+	unsigned int	addend;
+};
+
+enum {
+	TCA_STAB_UNSPEC,
+	TCA_STAB_BASE,
+	TCA_STAB_DATA,
+	__TCA_STAB_MAX
+};
+
+#define TCA_STAB_MAX (__TCA_STAB_MAX - 1)
+
 /* FIFO section */
 
 struct tc_fifo_qopt
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 5e33a20..addf5fb 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -821,6 +821,7 @@ enum
 	TCA_RATE,
 	TCA_FCNT,
 	TCA_STATS2,
+	TCA_STAB,
 	__TCA_MAX
 };
 
diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c
index e9174ab..c38fa87 100644
--- a/tc/tc_qdisc.c
+++ b/tc/tc_qdisc.c
@@ -41,10 +41,79 @@ static int usage(void)
 	return -1;
 }
 
+static int parse_stab(int *argcp, char ***argvp, struct tc_sizespec *stab,
+		      __u32 **datap)
+{
+	int argc = *argcp;
+	char **argv = *argvp;
+
+	NEXT_ARG();
+	while (argc > 0) {
+		if (matches("overhead", *argv) == 0) {
+			NEXT_ARG();
+			if (stab->addend)
+				duparg("overhead", *argv);
+			if (get_size(&stab->addend, *argv))
+				return -1;
+			NEXT_ARG();
+		} else if (matches("cell_log", *argv) == 0) {
+			NEXT_ARG();
+			if (stab->cell_log)
+				duparg("cell_log", *argv);
+			if (get_u32(&stab->cell_log, *argv, 0))
+				return -1;
+			NEXT_ARG();
+		} else if (get_size(*datap, *argv) == 0) {
+			argv++, argc--;
+			++*datap;
+		} else
+			break;
+	}
+	if (!stab->addend && !stab->cell_log)
+		return -1;
+	*argcp = argc;
+	*argvp = argv;
+	return 0;
+}
+
+static void print_stab(FILE *f, char *prefix, struct rtattr *tab)
+{
+	struct rtattr *tb[TCA_STAB_MAX+1];
+	struct tc_sizespec *size;
+	unsigned int i;
+	__u32 *data;
+	SPRINT_BUF(buf);
+
+	parse_rtattr_nested(tb, TCA_STAB_MAX, tab);
+	if (tb[TCA_STAB_BASE] == NULL ||
+	    RTA_PAYLOAD(tb[TCA_STAB_BASE]) < sizeof(struct tc_sizespec))
+		return;
+	fprintf(f, "%s", prefix);
+	size = RTA_DATA(tb[TCA_STAB_BASE]);
+	if (size->addend) {
+		print_size(buf, SPRINT_BSIZE-1, size->addend);
+		fprintf(f, "overhead %s ", buf);
+	}
+       	if (size->cell_log)
+		fprintf(f, "cell_log %u ", size->cell_log);
+	if (tb[TCA_STAB_DATA] == NULL)
+		return;
+	data = RTA_DATA(tb[TCA_STAB_DATA]);
+	for (i = 0; i < RTA_PAYLOAD(tb[TCA_STAB_DATA]) / sizeof(__u32); i++) {
+		print_size(buf, SPRINT_BSIZE-1, data[i]);
+		fprintf(f, "%s ", buf);
+	}
+}
+
 int tc_qdisc_modify(int cmd, unsigned flags, int argc, char **argv)
 {
 	struct qdisc_util *q = NULL;
 	struct tc_estimator est;
+	struct {
+		struct tc_sizespec size;
+		__u32 data[256];
+	} stab;
+	__u32 *stabdata = &stab.data[0];
 	char  d[16];
 	char  k[16];
 	struct {
@@ -55,6 +124,7 @@ int tc_qdisc_modify(int cmd, unsigned fl
 
 	memset(&req, 0, sizeof(req));
 	memset(&est, 0, sizeof(est));
+	memset(&stab, 0, sizeof(stab));
 	memset(&d, 0, sizeof(d));
 	memset(&k, 0, sizeof(k));
 
@@ -108,6 +178,10 @@ #endif
 		} else if (matches(*argv, "estimator") == 0) {
 			if (parse_estimator(&argc, &argv, &est))
 				return -1;
+		} else if (matches(*argv, "stab") == 0) {
+			if (parse_stab(&argc, &argv, &stab.size, &stabdata))
+				return -1;
+			continue;
 		} else if (matches(*argv, "help") == 0) {
 			usage();
 		} else {
@@ -124,6 +198,16 @@ #endif
 		addattr_l(&req.n, sizeof(req), TCA_KIND, k, strlen(k)+1);
 	if (est.ewma_log)
 		addattr_l(&req.n, sizeof(req), TCA_RATE, &est, sizeof(est));
+	if (stab.size.addend || stab.size.cell_log) {
+		struct rtattr *tail = NLMSG_TAIL(&req.n);
+
+		addattr_l(&req.n, sizeof(req), TCA_STAB, NULL, 0);
+		addattr_l(&req.n, sizeof(req), TCA_STAB_BASE, &stab.size,
+			  sizeof(stab.size));
+		addattr_l(&req.n, sizeof(req), TCA_STAB_DATA, stab.data,
+		          (void *)stabdata - (void *)stab.data);
+		tail->rta_len = (void *)NLMSG_TAIL(&req.n) - (void *)tail;
+	}
 
 	if (q) {
 		if (!q->parse_qopt) {
@@ -215,7 +299,7 @@ static int print_qdisc(const struct sock
 		q = get_qdisc_kind("prio");
 	else
 		q = get_qdisc_kind(RTA_DATA(tb[TCA_KIND]));
-	
+
 	if (tb[TCA_OPTIONS]) {
 		if (q)
 			q->print_qopt(q, fp, tb[TCA_OPTIONS]);
@@ -223,6 +307,12 @@ static int print_qdisc(const struct sock
 			fprintf(fp, "[cannot parse qdisc parameters]");
 	}
 	fprintf(fp, "\n");
+
+	if (tb[TCA_STAB]) {
+		print_stab(fp, " ", tb[TCA_STAB]);
+		fprintf(fp, "\n");
+	}
+
 	if (show_stats) {
 		struct rtattr *xstats = NULL;
 

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-07-04 13:29                       ` Patrick McHardy
@ 2006-07-04 19:29                         ` jamal
  2006-07-04 23:53                           ` Patrick McHardy
  2006-07-06  0:39                         ` Russell Stuart
  1 sibling, 1 reply; 56+ messages in thread
From: jamal @ 2006-07-04 19:29 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Jesper Dangaard Brouer, netdev, Stephen Hemminger, Alan Cox,
	Russell Stuart, Russell Stuart

On Tue, 2006-04-07 at 15:29 +0200, Patrick McHardy wrote:
> Russell Stuart wrote:
[..]
> > Without seeing your actual proposal it is difficult to
> > judge whether this is a reasonable trade-off or not.
> > Hopefully we will see your code soon.  Do you have any
> > idea when?
> 
> Unfortunately I still didn't got to cleaning them up, so I'm sending
> them in their preliminary state. Its not much that is missing, but
> the netem usage of skb->cb needs to be integrated better, I failed
> to move it to the qdisc_skb_cb so far because of circular includes.
> But nothing unfixable. I'm mostly interested if the current size-tables
> can express what you need for ATM, I wasn't able to understand the
> big comment in tc_core.c in your patch.
> 

Looks good from within the range of "change within reason" of addressed
problem. The cb on the qdisc seems only usable for netem, correct?
Also while not unreasonable, i wasnt sure how qdisc_enqueue_root()
fit in the grand scheme of things for this change (it seemed out of
place).

cheers,
jamal


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-07-04 19:29                         ` jamal
@ 2006-07-04 23:53                           ` Patrick McHardy
  0 siblings, 0 replies; 56+ messages in thread
From: Patrick McHardy @ 2006-07-04 23:53 UTC (permalink / raw)
  To: hadi
  Cc: Jesper Dangaard Brouer, netdev, Stephen Hemminger, Alan Cox,
	Russell Stuart, Russell Stuart

jamal wrote:
> On Tue, 2006-04-07 at 15:29 +0200, Patrick McHardy wrote:
> 
>>Russell Stuart wrote:
> 
> [..]
> 
>>>Without seeing your actual proposal it is difficult to
>>>judge whether this is a reasonable trade-off or not.
>>>Hopefully we will see your code soon.  Do you have any
>>>idea when?
>>
>>Unfortunately I still didn't got to cleaning them up, so I'm sending
>>them in their preliminary state. Its not much that is missing, but
>>the netem usage of skb->cb needs to be integrated better, I failed
>>to move it to the qdisc_skb_cb so far because of circular includes.
>>But nothing unfixable. I'm mostly interested if the current size-tables
>>can express what you need for ATM, I wasn't able to understand the
>>big comment in tc_core.c in your patch.
>>
> 
> 
> Looks good from within the range of "change within reason" of addressed
> problem. The cb on the qdisc seems only usable for netem, correct?

Yes, it has the same limitations as current netem cb usage. Really
makeing it useable for all qdiscs would require reserving a few bytes
for every level, so far that isn't necessary and I would prefer to
just add a time_to_send field for netem. The problem with this is
that it currently requires sch_generic.h and pkt_sched.h to include
one another, so I did the qdisc_skb_cb() thing to at least get it to
compile for now.

> Also while not unreasonable, i wasnt sure how qdisc_enqueue_root()
> fit in the grand scheme of things for this change (it seemed out of
> place).

Its there as a spot to do the initial time calculations and store them
in the cb. I didn't want to put this in net/core/dev.c.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-07-04 13:29                       ` Patrick McHardy
  2006-07-04 19:29                         ` jamal
@ 2006-07-06  0:39                         ` Russell Stuart
  2006-07-07  8:00                           ` Patrick McHardy
  1 sibling, 1 reply; 56+ messages in thread
From: Russell Stuart @ 2006-07-06  0:39 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: hadi, Alan Cox, Stephen Hemminger, netdev, Jesper Dangaard Brouer

[-- Attachment #1: Type: text/plain, Size: 1921 bytes --]

On Tue, 2006-07-04 at 15:29 +0200, Patrick McHardy wrote:
> Unfortunately I still didn't got to cleaning them up, so I'm sending
> them in their preliminary state. Its not much that is missing, but
> the netem usage of skb->cb needs to be integrated better, I failed
> to move it to the qdisc_skb_cb so far because of circular includes.

Cleanups aside, architecturally the bulk of your patch 
looks like a no-brainier to me.  The calculation of
packet length should be in one place.  Caching it in
skb->cb was a nice touch.

> But nothing unfixable. I'm mostly interested if the current size-tables
> can express what you need for ATM, I wasn't able to understand the
> big comment in tc_core.c in your patch.

Unfortunately you do things in the wrong order for ATM.
See: http://mailman.ds9a.nl/pipermail/lartc/2006q1/018314.html
for an overview of the problem, and then the attached email for
a detailed description of how the current patch addresses it.
It is a trivial fix.

As I said earlier, RTAB and STAB contain the same numbers,
just scaled differently.  The ATM patch stuffed around with
RTAB.  With your patch in place it will have to do the same 
exactly the same thing with STAB - because RTAB and STAB
carry the same data.  So to me the two patches seem
orthogonal.

One observation is the size optimisation you applied to STAB, 
making it variable length, could also be applied to RTAB.  
In fact it should be.  Then they would be identical, apart 
from the scaling.  Even the lookup operation (performed in
qdisc_init_len in your patch) would be identical.

However, now you lot have made me go away and think, I have
another idea on how to attack this.  Perhaps it will be
more palatable to you.  It would replace RTAB and STAB with
a 28 byte structure for most protocol stacks - well all I can
think of off the top of my head, anyway.  RTAB would have to
remain for backwards compatibility, of course.

[-- Attachment #2: Attached message - Re: Getting ATM patches into the kernel --]
[-- Type: message/rfc822, Size: 10535 bytes --]

From: Russell Stuart <russell@stuart.id.au>
To: Jesper Dangaard Brouer <jdb@comx.dk>
Subject: Re: Getting ATM patches into the kernel
Date: Fri, 19 May 2006 22:59:34 +1000
Message-ID: <1148043575.4206.259.camel@ras.pc.brisbane.lube>

Explanation of rate calculation code.

tc_core.c lines 53..61, tc_align_to_cells():

  Does the same thing as you function of the same name.
  Code is intended to be semantically equivalent.

tc_core.c lines 145..160, in tc_calc_ratespec:

  There are two material differences to your code.  Both
  are in this line:

    int sz = ((i+1)<<cell_log) - 1 + overhead;

  The first is that where I add the overhead, you don't.
  This is because I am calculating the rate table so it
  works as well as close to optimal as possible on an
  unpatched kernel.  Given the background in your draft
  email, I assume I don't have to prove you that it does
  in fact calculate the best possible table for an 
  unpatched kernel.

  The second change is I use:
    ((i+1)<<cell_log) - 1
  Whereas you had:
    ((i+1)<<cell_log)

  The number calculated by this expression:
    ((i+1) << cell_log)
  is the shortest packet length rtab[(i+1)] applies to.  
  Ergo, that number minus 1 must be the longest packet 
  length the previous rtab entry, rtab[i], applies to.
  So the intent of the change is that in unpatched kernels,
  rtab[i] will contain the time to send the longest 
  packet that uses it.  I suspect your code ends up doing 
  the same thing on kernels that add overhead to the 
  packet length.

There should be no other semantic differences between and
how you and I calculate the rate table.  If so this should
be sufficient to show that if your code calculated an ideal
rate table for kernels that added the overhead to the 
packet length before looking up the rate table, my code 
should calculate the best possible table for unpatched 
kernels.  It won't be ideal because some table entries will
apply to packets using different numbers of ATM cells.  
This is unavoidable.  But where that happens my code will 
always be conservative, ie use the longest packet length.

I think backward the compatibility increases the odds of
the patch being accepted.  The next step is to patch the
kernel it can get 100% accurate results by using this
table.

Consider this "diagram":

Overhead 0
----------
ATM Payload:          39 40 41 42 43 44 45 46 47 48 49 50
skb->len:             39 40 41 42 43 44 45 46 47 48 49 50
rtab[skb->len/8]:      1  1  1  1  1  1  1  1  1 x2  2  2

Each column is for a particular packet length.  The rows
show the packet length as seen by a layer of the protocol.  
The top row, "ATM Payload", contains packet length seen 
by the ATM layer - ie the ATM payload.  It includes all 
overheads bar cell overhead and padding.  It is always 
equal to: skb->len + overhead.

The second row, "skb->len" contains the packet seen by 
the kernel.  It is just the figure in the first line
less the "overhead" passed to tc.

The third line shows how "tc" will calculate the rate 
table entry for the packet length.  It contains the
number of ATM cells "tc" thinks the packet will occupy.
I am assuming a cell_log of 3 here, meaning each rate
table entry is used for 8 different packet lengths.  
In an unpatched kernel a single rate table entry may 
apply to packets which use different numbers of ATM 
cells.  Since the rate table entry is a single number 
it cant be right for all packet lengths when that 
happens.  If faced with this ambiguity tc uses the number
of cells occupied by the longest packet the rate table 
entry applies to. This means the rate table will be wrong 
for the shorter packets, and this is highlighted in the
diagram an "x".

Thus we see from the table that for a 39 byte packet tc
thinks it will take 1 cell to send.  But for a 48 byte
packet the table says it will take "x2" packets.  The "x"
means it will really take 1 packet, but because this rate
table entry is also used by 49..55 byte packets tc used
the bigger value for that rate table entry.

We can fix the errant "x2" entry by sliding the rate
table along so it sits under the smallest packet length
it could apply to (in this case 49).  To achieve that
the rate table must be slide one byte to right.  This 
can be achieved by adding an "alignment" figure of -1 
to the packet length before looking up the rate table.
Here is the same diagram as above, but with two 
additional lines added showing the effect of adding the
alignment:

Overhead 0
----------
ATM Payload:          39 40 41 42 43 44 45 46 47 48 49 50
skb->len:             39 40 41 42 43 44 45 46 47 48 49 50
rtab[skb->len/8]:      1  1  1  1  1  1  1  1  1 x2  2  2
skb->len-1:           38 39 40 41 42 43 44 45 46 47 48 49
rtab[(skb->len-1)/8]:  1  1  1  1  1  1  1  1  1  1  2  2

The rows "skb->len-1" and "rtab[(skb->len-1)/8]" are
copies of the rows above that have been moved slightly.
They have to be copies, as the rtab computed by tc hasn't
changed, so for example 47 must still yield 1 cell, and
48 must still yield 2 cells.

Below is the same diagram repeated for overheads 1 .. 9.
The are all pretty much the same as the one above.  The
only changes is the increasing overhead, which in turn 
changes the number of columns in error.  The alignment 
changes to compensate.  You will see it follows a 
straightforward pattern as the overhead increases.  

Overhead 1
----------
ATM Payload:          39 40 41 42 43 44 45 46 47 48 49 50
skb->len:             38 39 40 41 42 43 44 45 46 47 48 49
rtab[skb->len/8]:      1  1  1  1  1  1  1  1  1  1  2  2
skb->len-0:           38 39 40 41 42 43 44 45 46 47 48 49
rtab[(skb->len-0)/8]:  1  1  1  1  1  1  1  1  1  1  2  2

Overhead 2
----------
ATM Payload:          39 40 41 42 43 44 45 46 47 48 49 50
skb->len:             37 38 39 40 41 42 43 44 45 46 47 48
rtab[skb->len/8]:      1  1  1 x2 x2 x2 x2 x2 x2 x2  2  2
skb->len-7:           30 31 32 33 34 35 36 37 38 39 40 41
rtab[(skb->len-7)/8]:  1  1  1  1  1  1  1  1  1  1  2  2

Overhead 3
----------
ATM Payload:          39 40 41 42 43 44 45 46 47 48 49 50
skb->len:             36 37 38 39 40 41 42 43 44 45 46 47
rtab[skb->len/8]:      1  1  1  1 x2 x2 x2 x2 x2 x2  2  2
skb->len-6:           30 31 32 33 34 35 36 37 38 39 40 41
rtab[(skb->len-6)/8]:  1  1  1  1  1  1  1  1  1  1  2  2

Overhead 4
----------
ATM Payload:          39 40 41 42 43 44 45 46 47 48 49 50
skb->len:             35 36 37 38 39 40 41 42 43 44 45 46
rtab[skb->len/8]:      1  1  1  1  1 x2 x2 x2 x2 x2  2  2
skb->len-5:           30 31 32 33 34 35 36 37 38 39 40 41
rtab[(skb->len-5)/8]:  1  1  1  1  1  1  1  1  1  1  2  2

Overhead 5
----------
ATM Payload:          39 40 41 42 43 44 45 46 47 48 49 50
skb->len:             34 35 36 37 38 39 40 41 42 43 44 45
rtab[skb->len/8]:      1  1  1  1  1  1 x2 x2 x2 x2  2  2
skb->len-4:           30 31 32 33 34 35 36 37 38 39 40 41
rtab[(skb->len-4)/8]:  1  1  1  1  1  1  1  1  1  1  2  2

Overhead 6
----------
ATM Payload:          39 40 41 42 43 44 45 46 47 48 49 50
skb->len:             33 34 35 36 37 38 39 40 41 42 43 44
rtab[skb->len/8]:      1  1  1  1  1  1  1 x2 x2 x2  2  2
skb->len-3:           30 31 32 33 34 35 36 37 38 39 40 41
rtab[(skb->len-3)/8]:  1  1  1  1  1  1  1  1  1  1  2  2

Overhead 7
----------
ATM Payload:          39 40 41 42 43 44 45 46 47 48 49 50
skb->len:             32 33 34 35 36 37 38 39 40 41 42 43
rtab[skb->len/8]:      1  1  1  1  1  1  1  1 x2 x2  2  2
skb->len-2:           30 31 32 33 34 35 36 37 38 39 40 41
rtab[(skb->len-2)/8]:  1  1  1  1  1  1  1  1  1  1  2  2

Overhead 8
----------
ATM Payload:          39 40 41 42 43 44 45 46 47 48 49 50
skb->len:             31 32 33 34 35 36 37 38 39 40 41 42
rtab[skb->len/8]:      1  1  1  1  1  1  1  1  1 x2  2  2
skb->len-1:           30 31 32 33 34 35 36 37 38 39 40 41
rtab[(skb->len-1)/8]:  1  1  1  1  1  1  1  1  1  1  2  2

Overhead 9
----------
ATM Payload:          39 40 41 42 43 44 45 46 47 48 49 50
skb->len:             30 31 32 33 34 35 36 37 38 39 40 41
rtab[skb->len/8]:      1  1  1  1  1  1  1  1  1  1  2  2
skb->len-0:           30 31 32 33 34 35 36 37 38 39 40 41
rtab[(skb->len-0)/8]:  1  1  1  1  1  1  1  1  1  1  2  2

Notice that at overhead 8 has an alignment of -1, and this
is the same alignment as overhead 0 had.  This is because
the errors have moved through one rate table entry and are
now into the next one, so the "fix" is the same.

We can draw up a table showing the required alignment for
each overhead, like this:

  overhead | alignment
  ---------+----------
      0    |    -1
      1    |     0
      2    |    -7
      3    |    -6
      4    |    -5
      5    |    -4
      6    |    -3
      7    |    -2

We only need 8 elements, as once we get to 8 (or more
accurately 1<<cell_log) it repeats.  If the kernel
adds the alignment figure from the table to the packet
size before looking up the rate table, then it will 
always get accurate results from the optimal table for 
the unpatched kernel.

The alignment figure is calculated by "tc" and passed to
the kernel, much as you passed the "overhead" to the 
kernel in your code.  In my case it will always be small.

And finally, we come to ......

tc_core.c line 122, in tc_calc_cell_align.  The expression 
I use to calculate the alignment is:

    return (overhead + cell_size - 2) % cell_size - cell_size + 1

You could try and mathematically prove it does yield the
table above, but since there are only 8 values that
matter (the % in the expression ensures it repeats after
that), it much easier to just plug in the overheads 0..7
and verify it does come up with the correct alignment.

QED, for a cell_log of 3 anyway.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-07-06  0:39                         ` Russell Stuart
@ 2006-07-07  8:00                           ` Patrick McHardy
  2006-07-10  8:44                             ` Russell Stuart
  0 siblings, 1 reply; 56+ messages in thread
From: Patrick McHardy @ 2006-07-07  8:00 UTC (permalink / raw)
  To: Russell Stuart
  Cc: hadi, Alan Cox, Stephen Hemminger, netdev, Jesper Dangaard Brouer

Russell Stuart wrote:
> Unfortunately you do things in the wrong order for ATM.
> See: http://mailman.ds9a.nl/pipermail/lartc/2006q1/018314.html
> for an overview of the problem, and then the attached email for
> a detailed description of how the current patch addresses it.
> It is a trivial fix.

Actually that was the part I didn't understand, you keep talking
(also in that comment in tc_core.c) about an "unknown overhead".
What is that and why would it be unknown? The mail you attached
is quite long, is there an simple example that shows what you
mean?

> As I said earlier, RTAB and STAB contain the same numbers,
> just scaled differently.  The ATM patch stuffed around with
> RTAB.  With your patch in place it will have to do the same 
> exactly the same thing with STAB - because RTAB and STAB
> carry the same data.  So to me the two patches seem
> orthogonal.

Yes. As I said in the beginning of this thread, we could still
do the RTAB thing for qdiscs that support it. But since the
visible size should be the same for all qdiscs (including child
ones) we need to do the initial calculation anyway, so I don't
see any purpose in still adjusting the rate tables and using
skb->len instead of using the adjusted size.

> One observation is the size optimisation you applied to STAB, 
> making it variable length, could also be applied to RTAB.  
> In fact it should be.  Then they would be identical, apart 
> from the scaling.  Even the lookup operation (performed in
> qdisc_init_len in your patch) would be identical.

We can do that. I'm not attached to the size tables :)

> However, now you lot have made me go away and think, I have
> another idea on how to attack this.  Perhaps it will be
> more palatable to you.  It would replace RTAB and STAB with
> a 28 byte structure for most protocol stacks - well all I can
> think of off the top of my head, anyway.  RTAB would have to
> remain for backwards compatibility, of course.

Can you describe in more detail?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-07-07  8:00                           ` Patrick McHardy
@ 2006-07-10  8:44                             ` Russell Stuart
  0 siblings, 0 replies; 56+ messages in thread
From: Russell Stuart @ 2006-07-10  8:44 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netdev, hadi, Alan Cox, lartc

On Fri, 2006-07-07 at 10:00 +0200, Patrick McHardy wrote:
> Russell Stuart wrote:
> > Unfortunately you do things in the wrong order for ATM.
> > See: http://mailman.ds9a.nl/pipermail/lartc/2006q1/018314.html
> > for an overview of the problem, and then the attached email for
> > a detailed description of how the current patch addresses it.
> > It is a trivial fix.
> 
> Actually that was the part I didn't understand, you keep talking
> (also in that comment in tc_core.c) about an "unknown overhead".
> What is that and why would it be unknown? The mail you attached
> is quite long, is there an simple example that shows what you
> mean?

The "unknown overhead" is just the overhead passed to tc
using the "tc ... overhead xxx" option.  It is probably
what you intended to put into your addend attribute.

It is "unknown" because the kernel currently doesn't use
it.  It is passed in the tc_ratespec, but is ignored by
the kernel as are most fields in there.

The easy way to fix the "ATM" problem described in the big
comment is simply to add the "overhead" to the packet 
length before doing the RTAB lookup.  (Identical comments 
apply to STAB).  If you don't accept this or understand
why, then go read the "long emails" which attempt to
explain it in detail.  Jesper's initial version of the
patch did just that, BTW.

However if you do that then you have to adjust RTAB for
all cases (not just ATM) to reflect that the kernel is 
now adding the overhead.  Thus the RTAB tc sends to the 
kernel now changes for different kernel versions, making 
modern versions of tc incompatible with older kernels, 
and visa versa.  I didn't consider that acceptable.

My solution to this to give the kernel the old format
RTAB (ie the one that assumed the kernel didn't add the
overhead) and a small adjustment.  This small adjustment 
is called cell_align in the ATM patch.  You do the same 
thing with cell_align as the previous solution did with 
the overhead - ie add it in just before looking up RTAB.  
This is in effect all the kernel part of the ATM patch
does - make the kernel accept the cell_align option,
and add it to skb->len before looking up RTAB.

The difference between cell_align and overhead is that
cell_align is always 0 when there is no packetisation,
and even when non zero it is small (less than 1<<cell_log, 
ie less than 8 for typical MTU's).  So for anything bar 
ATM it is zero which means old kernels are completely
unaffected, and even for ATM not sending it produces a 
small error which means older kernels still benefit from
the "ATM" user space patch.   This makes the proposed 
"ATM" version of tc both forward and  backward compatible.

One other point arises here.  The fields in "tc_ratespec"
that "tc" fills and the kernel ignores are there so "tc 
show" will work.  The essence of the problem is "tc"
compiles the stuff you give it into a single "RTAB".  
That "RTAB" can't be reverse compiled into the original 
numbers the user provided.  So if "tc show" is to work,
"tc" has to save that information somewhere.  I don't
think the "tc_ratespec" was the best choice for two
reasons.

Firstly, having the fields show up in tc_ratespec 
makes it seem like the kernel can use them.  It can't,
as the "overhead" example above shows.  Secondly, from
tc's point of view it is inflexible.  Over time new
features have been be added to "tc", and each time a
new way of encoding it in the existing "tc_ratespec" 
has to be invented.  Thus we now have hacks like the
storing the "overhead" in the upper bits of the MPU
figure.

A better solution would be to provide a TLV (ie a 
TCA_XXX constant) for TC's private use.  From the 
kernels point of view it would be an opaque structure
which just saves and echos back when asked.  This
would solve both problems.

> > However, now you lot have made me go away and think, I have
> > another idea on how to attack this.  Perhaps it will be
> > more palatable to you.  It would replace RTAB and STAB with
> > a 28 byte structure for most protocol stacks - well all I can
> > think of off the top of my head, anyway.  RTAB would have to
> > remain for backwards compatibility, of course.
> 
> Can you describe in more detail?

OK, but first I want to make the point that the only
reason I suggest this is to get some sort of ATM
patch into the kernel, as the current patch on the
table is having a rough time.

Alan Cox made the point earlier (if I understood him
correctly) that this tabling lookup probably isn't
a big win on modern CPU's - we may be better off
moving it all into the kernel.  Thinking about this,
I tried to come up with a way of describing the
mapping between skb->len and the on the wire packet
length for every protocol I know.  This is what I
came up with.

Assume we have a packet length L, which is to be
transported by some protocol.  For now we consider
one protocol only, ie: TCP, PPP, ATM, Ethernet or
whatever.   I will generalise it to multiple protocols
later.  I think a generalised transformation can be 
made using using 5 numbers which are applied in this
order:

  Overhead - A fixed overhead that is added to L.

  Mpu      - Minimum packet size.  If the result of
             (Overhead+L) is smaller that this, then 
             the new result becomes this size.

  Round    - The result is then rounded up to this
             many bytes.  For protocols that always
             transmit single bytes this figure would be
             1.  If there were some protocol that
             transmitted data as 4 byte chunks then this
             would be 4.  For ATM it is 48.

  CellPay  - If the packet is broken down into smaller
             packets when sent, then this is the amount
             of data that will fit into each chunk.

  CallOver - This is the additional overhead each cell
             carries.

The idea is the kernel would do this calculation on the
fly for each packet.  If you represent this set of number 
numbers as a comma separated list in the order they were 
presented above, then here are some examples:

  IP:       20
  Ethernet: 18,64
  PPP:      2
  ATM:      0,0,48,48,5

It may be that 5 numbers are a overkill.  It is for all
protocols I am aware of - for those you could get away
with 4.  But I am no expert.

The next step is to generalise for many protocols.  As
the protocols are stacked the length output by one 
protocol becoming the input length for the downstream 
one.  So we just need to apply the same transformation 
serially. I will use '+' to indicate the stacking.  For 
a typical ATM stack, PPPoE over LLC, we have:

  ppp:2+pppoe:6+ethernet:14,64+llc:8+all5:4+atm:0,0,48,48,5

If this were implemented naively, then the kernel would
have to apply the above calculation 6 times, like this:

  Protocol   InputLength    OutputLength
  ---------  ------------   ----------------
  ppp        skb->len       skb->len+2
  pppoe:     skb->len+2     skb->len+2+6
  ethernet:  skb->len+2+6   skb->len+2+6+14
  ... and so on.

But it can be optimised.  In this particular case we can
combine those six operations into 1:

  adsl_pppoe_llc:34,64,48,48,5

The five numbers have the same meaning as before.  It
it not difficult to come up with a generalised rule that
allows you to do this for most cases.  For the remainder
(if they exist - I can't think of any) the kernel would
have to apply the transformation iteratively.

Before going on, it is worth while comparing this to the
current RTAB solution (and by implication STAB):

  1.  Oddly, the number of steps and hence speed for 
      common protocols is probably the same.  Compare:
        RTAB - You have to add an OverHead in the general
               case.
             - You have to scale by cell_log.
             - You have to ensure the overhead+skb->len
               doesn't overflow / underflow the RTAB.
             - You have to do the lookup.
        New  - You have to add overhead.
             - You have to check the MPU.
             - You have to check if you have to apply
               Round,CellPay,CellOver - but you won't
               have to for any protocol except ATM.

  2.  Because of the cell_log, RTAB gives an 100% accurate
      answer 1 time in every (1<<cell_log) packet lengths.
      The new version is always 100% accurate.

  3.  The new version isn't as flexible as RTAB, as least
      from the kernels point of view.  Conceivably there are 
      protocols that could be handled by RTAB that are not 
      handled by the new one.  Since RTAB is computed in
      user space, this implies these new protocols might
      be handled by a user space change only.  The new
      version would always require a kernel change.  Note 
      however that even RTAB required a kernel change for
      ATM, however.

So far we have what your STAB would provide us with - a
way to calculate the packet length.  It takes 5 int's for 
every protocol stack I can think of.  It probably runs
faster for most protocols but is less robust to the
introduction of new protocols.

But some qdisc's need to know how long it takes to send a 
packet.  This is what RTAB provides us with, in fact.  
So if we were to do away with RTAB completely, then we need
a way for the kernel to covert packet lengths into the time
it takes to send a packet.  This is what I discuss next.
The comments apply to both STAB and the new algorithm above,
as they both compute packet lengths.

If J is the time it takes to send one byte over a link, 
then we can compute RTAB from STAB like this:

  for (int i = 0; i < array_length(STAB); i += 1)
    RTAB[i] = STAB[i] * J;

This is exactly the operation "tc" performs now in user
space.  It is possibly in user space because J is usually
less than 1, and thus is most conveniently represented as
a floating point number.  Floating point operations in
the kernel are verboten.

I can think of two ways to move this operation into the
kernel.  The straight forward way is to represent J as a 
scale and a division.  Ie the kernel does:

  RTAB[i] = (STAB[i] << scale) / divisor.

The second way depends on that fact that most CPU's can
multiply two 32 uint's to produce a 64 bit result in a
single operation.   I don't know whether this operation is
available to kernel code.  But if it is, J can be 
represented as a 32 bit fixed point number, with the
implied decimal point after the most significant 8 bits.
Then this operation would suffice:

  extern long long mul64(unsigned long, a, unsigned long b);

  RTAB[i] = (unsigned long)(mul64(STAB[i], J) >> 24);

This method doesn't use division, and is probably faster
on lower end CPU's.  It would handle 100G Ethernet on a 
machine with Hz == 1000, and 1200 bits/sec on a machine
with Hz == 10000.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-23 12:37             ` Russell Stuart
  2006-06-23 15:21               ` Patrick McHardy
@ 2006-06-24 14:13               ` jamal
  2006-06-26  4:23                 ` Russell Stuart
  2006-07-18  2:06                 ` Russell Stuart
  1 sibling, 2 replies; 56+ messages in thread
From: jamal @ 2006-06-24 14:13 UTC (permalink / raw)
  To: Russell Stuart
  Cc: Jesper Dangaard Brouer, netdev, Stephen Hemminger, Alan Cox,
	Patrick McHardy

On Fri, 2006-23-06 at 22:37 +1000, Russell Stuart wrote: 
>  Sorry for the digest like reply :(

;-> I know this discussion has gone in a few directions already
and i am sure as you kept responding to the emails it became obvious.
Let me just jump to one point so we dont repeat a lot of the same things
over and over..
You can actually stop reading here if you have gathered the view at
this point that i am not objecting to the simple approach Patrick is
going with...

> On Tue, 2006-06-20 at 10:06 -0400, jamal wrote:
>  But you can measure the bandwidth you are 
> getting from your ISP and plug that into the tc command 
> line.  The web page I sent to you describes how to do this
> for ADSL lines.
> 

Indeed and i referred to it in the exchanges. 
And yes, I was arguing that the tc scheme you describe would not be so
bad either if the cost of making a generic change is expensive.

To reiterate the scheduler "owns the resources" of the link
i.e link level bandwidth - aka throughput; however, teaching the
scheduler as in your description in that page about effective
throughput aka "goodput" is also within reason if you knew _it was
consistent_ . The diagram i drew was:

|Linux| --ethernet-- |Modem| --DSL-- |DSLAM| --ATM-- |BRAS| 

As can be seen from above, Linux has direct ownership of the ethernet
link resources (within limits of fixed bandwidth no AN etc). Just like a
disk quota scheduler owns the disk resources or CPU scheduler owns the
CPU cycles etc. Likewise, the DSLAM downstream owns scheduling of the
ATM link resources.

if you have knowledge that ATM runs between DSLAM and BRAS and that
the effective throughput is in-fact lower than the advertised one at
that point in the end2end path, then i admit it is fine to do as you
described in your web-page. 
Given the ubiquity of ADSL it is also ok to have simple tweak knobs for
the scheduler in Linux to allow for it to be able to compensate for the
ATM fragmentation 3 hops down. 

There are a lot of link layer issues that you may end up knowing of
(other than the ATM fragmentation overhead) in regards to something
downstream and you keep adding knobs is just adding more bloat. 
Example: If that 3rd hop was wireless that happened to be doing CDMA RLP
with a lot of retransmits, or wireless that varied its throughput from
1-3Mbps at any point in time or it was a satellite link that had a lot
of latency etc etc. You could always have some way to tweak these via
the kernel. In-fact people have written schedulers specifically for
these sorts of link layer problems (I think even some of the IEEE 802.11
or wimax folks have standardized specific schedulers). You basically
have to draw a line somewhere. My line was "can it be done via user
space? yes - do it there".

Patrick seems to have a simple way to compensate generically for link
layer fragmentation, so i will not argue the practically; hopefully that
settles it? ;->

cheers,
jamal

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-24 14:13               ` jamal
@ 2006-06-26  4:23                 ` Russell Stuart
  2006-07-18  2:06                 ` Russell Stuart
  1 sibling, 0 replies; 56+ messages in thread
From: Russell Stuart @ 2006-06-26  4:23 UTC (permalink / raw)
  To: hadi
  Cc: Russell Stuart, Jesper Dangaard Brouer, netdev, Stephen Hemminger,
	Alan Cox, Patrick McHardy

On 25/06/2006 12:13 AM, jamal wrote:
> You can actually stop reading here if you have gathered the view at
> this point that i am not objecting to the simple approach Patrick is
> going with...

Perhaps this is my problem.  I am not sure I understand
what Patrick is proposing.  I can wait for his patch, I
guess.

> Indeed and i referred to it in the exchanges. 
> And yes, I was arguing that the tc scheme you describe would not be so
> bad either if the cost of making a generic change is expensive.

OK.  I take it from this you think there is merit in
the idea of adding code so the kernel can calculate
the ATM link speeds correctly.  The discussion is
really about the best way to go about it?

If so, excellent.  I am not really too fussy about how
it is achieved, I just want my VOIP connections to
work well on stock kernels.

> There are a lot of link layer issues that you may end up knowing of
> (other than the ATM fragmentation overhead) in regards to something
> downstream and you keep adding knobs is just adding more bloat. 
> Example: If that 3rd hop was wireless that happened to be doing CDMA RLP
> with a lot of retransmits, or wireless that varied its throughput from
> 1-3Mbps at any point in time or it was a satellite link that had a lot
> of latency etc etc. You could always have some way to tweak these via
> the kernel. In-fact people have written schedulers specifically for
> these sorts of link layer problems (I think even some of the IEEE 802.11
> or wimax folks have standardized specific schedulers). You basically
> have to draw a line somewhere. My line was "can it be done via user
> space? yes - do it there".

If you mean by adding lots of knobs, you mean we need a knob
for 802.11, a knob for ATM, a knob for ethernet and so on,
then we do need lots of knobs.  And you need to know which
of those layers is the bottle neck, so you know what knob to
fit.  But you only ever need one knob on a given link.

I can only think of two ways out of needing lots of knobs.
One is to have a qdisc that doesn't need to know the link
speed in order to shape traffic to it gets to the scheduling
and not someone upstream.  Sounds like black magic to me,
but perhaps HFSC does this - I have not read the papers
yet, but I plan to do so soon.

The second way is to automatically calculate the link speed,
using a daemon perhaps :).  Again it sounds like black
magic.  Note that there is already code in the kernel that
does this, but it lives in the layers above - in TCP and
DCCP.  I am referring to Westwood, and friends.  These
algorithms live in the layers above because the need feed
back from the network - which can only come from the other
end of connection unless ECN is working.

I have not been able to figure out how Patrick intends to
solve these problems from his posts so far, so I am waiting
for his code.  Hopefully it will include a lot of comments.

> Patrick seems to have a simple way to compensate generically for link
> layer fragmentation, so i will not argue the practically; hopefully that
> settles it? ;->

Yes, it does.  It will be interesting to see what Patrick
comes up with.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-24 14:13               ` jamal
  2006-06-26  4:23                 ` Russell Stuart
@ 2006-07-18  2:06                 ` Russell Stuart
  2006-07-18 13:35                   ` jamal
  2006-07-18 21:46                   ` Andy Furniss
  1 sibling, 2 replies; 56+ messages in thread
From: Russell Stuart @ 2006-07-18  2:06 UTC (permalink / raw)
  To: hadi; +Cc: Jesper Dangaard Brouer, netdev, Stephen Hemminger,
	Patrick McHardy

On Sat, 2006-06-24 at 10:13 -0400, jamal wrote:
> And yes, I was arguing that the tc scheme you describe would not be so
> bad either if the cost of making a generic change is expensive.
<snip>
> Patrick seems to have a simple way to compensate generically for link
> layer fragmentation, so i will not argue the practically; hopefully that
> settles it? ;->

Things seem to have died down.  Patrick's patch seemed 
unrelated to ATM to me.  I did put up another suggestion, 
but I don't think anybody was too impressed with the 
idea.  So that leave the current ATM patch as the only 
one we have on the table that addresses the ATM issue.

Since you don't think it is "too bad", can we proceed 
with it?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-07-18  2:06                 ` Russell Stuart
@ 2006-07-18 13:35                   ` jamal
  2006-07-18 21:46                   ` Andy Furniss
  1 sibling, 0 replies; 56+ messages in thread
From: jamal @ 2006-07-18 13:35 UTC (permalink / raw)
  To: Russell Stuart
  Cc: Jesper Dangaard Brouer, netdev, Stephen Hemminger,
	Patrick McHardy

Russell,
I apologize i havent followed the latest discussion to the detail.
My understanding of Patricks work was it solved the ATM problem as well.
I think he is busy somewhere - lets give him an opportunity to respond
and i will try to catchup with the thread as well.

cheers,
jamal

On Tue, 2006-18-07 at 12:06 +1000, Russell Stuart wrote:
> On Sat, 2006-06-24 at 10:13 -0400, jamal wrote:
> > And yes, I was arguing that the tc scheme you describe would not be so
> > bad either if the cost of making a generic change is expensive.
> <snip>
> > Patrick seems to have a simple way to compensate generically for link
> > layer fragmentation, so i will not argue the practically; hopefully that
> > settles it? ;->
> 
> Things seem to have died down.  Patrick's patch seemed 
> unrelated to ATM to me.  I did put up another suggestion, 
> but I don't think anybody was too impressed with the 
> idea.  So that leave the current ATM patch as the only 
> one we have on the table that addresses the ATM issue.
> 
> Since you don't think it is "too bad", can we proceed 
> with it?
> 
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-07-18  2:06                 ` Russell Stuart
  2006-07-18 13:35                   ` jamal
@ 2006-07-18 21:46                   ` Andy Furniss
  2006-07-19  1:02                     ` Russell Stuart
  1 sibling, 1 reply; 56+ messages in thread
From: Andy Furniss @ 2006-07-18 21:46 UTC (permalink / raw)
  To: Russell Stuart
  Cc: hadi, Jesper Dangaard Brouer, netdev, Stephen Hemminger,
	Patrick McHardy

Russell Stuart wrote:
> On Sat, 2006-06-24 at 10:13 -0400, jamal wrote:
> 
>>And yes, I was arguing that the tc scheme you describe would not be so
>>bad either if the cost of making a generic change is expensive.
> 
> <snip>
> 
>>Patrick seems to have a simple way to compensate generically for link
>>layer fragmentation, so i will not argue the practically; hopefully that
>>settles it? ;->
> 
> 
> Things seem to have died down.  Patrick's patch seemed 
> unrelated to ATM to me.  I did put up another suggestion, 
> but I don't think anybody was too impressed with the 
> idea.  So that leave the current ATM patch as the only 
> one we have on the table that addresses the ATM issue.

FWIW I think it may be possible to do it Patricks' way, as if I read it 
properly he will end up with the ATM cell train length which gets 
shifted by cell_log and looked up as before. The ATM length will be in 
steps of 53 so with cell_log 3 or 4 I think there will be no collisions 
- so special rate tables for ATM can still be made perfect.

As you say, I think mpu should be added aswell - so eth/other can benefit.


Andy.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-07-18 21:46                   ` Andy Furniss
@ 2006-07-19  1:02                     ` Russell Stuart
  2006-07-19 14:42                       ` Andy Furniss
  2006-07-19 14:50                       ` [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL Patrick McHardy
  0 siblings, 2 replies; 56+ messages in thread
From: Russell Stuart @ 2006-07-19  1:02 UTC (permalink / raw)
  To: lists
  Cc: hadi, Jesper Dangaard Brouer, netdev, Stephen Hemminger,
	Patrick McHardy

On Tue, 2006-07-18 at 22:46 +0100, Andy Furniss wrote: 
> FWIW I think it may be possible to do it Patricks' way, as if I read it 
> properly he will end up with the ATM cell train length which gets 
> shifted by cell_log and looked up as before. The ATM length will be in 
> steps of 53 so with cell_log 3 or 4 I think there will be no collisions 
> - so special rate tables for ATM can still be made perfect.

Patrick is proposing that the packet lengths be sent to 
the kernel in a similar way to how transmission times (ie
RTAB) is sent now.  I agree that is how things should be 
done - but it doesn't have much to do with the ATM patch, 
other than he has allowed for ATM in the way he does the 
calculation in the kernel [1].

In particular:

- As it stands, it doesn't help the qdiscs that use 
  RTAB.  So unless he proposes to remove RTAB entirely 
  the ATM patch as it will still have to go in.

- A bit of effort was put into making this current
  ATM patch both backwards and forwards compatible.
  Patricks patch would work with newer kernels,
  obviously.  Older kernels, and in particular the
  kernel that Debian is Etch is likely to distribute
  would miss out.

If Patrick did intend remove RTAB entirely then he
needs to add a fair bit more into his patch.  Since 
RTAB is just STAB scaled, its certainly possible.
The kernel will have to do a shift and a division
for each packet, which I assume is permissible.

> As you say, I think mpu should be added aswell - so eth/other can benefit.

Not really.  The MPU is reflected in the STAB table,
just as it is for RTAB.

One other point - the optimisation Patrick proposes
for STAB (over RTAB) was to make the number of entries
variable.  This seems like a good idea.  However there 
is no such thing as a free lunch, and if you did 
indeed reduce the number of entries to 16 for Ethernet 
(as I think Patrick suggested), then each entry would
cover 1500/16 = 93 different packet lengths.  Ie,
entry 0 would cover packet lengths 0..93, entry 1
94..186, and so on.  A single entry can't be right
for all those packet lengths, so again we are back
to a average 30% error for typical VOIP length
packets.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-07-19  1:02                     ` Russell Stuart
@ 2006-07-19 14:42                       ` Andy Furniss
  2006-07-19 14:54                         ` Patrick McHardy
  2006-07-19 20:26                         ` [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL (RTAB BUG) Jesper Dangaard Brouer
  2006-07-19 14:50                       ` [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL Patrick McHardy
  1 sibling, 2 replies; 56+ messages in thread
From: Andy Furniss @ 2006-07-19 14:42 UTC (permalink / raw)
  To: Russell Stuart
  Cc: hadi, Jesper Dangaard Brouer, netdev, Stephen Hemminger,
	Patrick McHardy

Russell Stuart wrote:
> On Tue, 2006-07-18 at 22:46 +0100, Andy Furniss wrote: 
> 
>>FWIW I think it may be possible to do it Patricks' way, as if I read it 
>>properly he will end up with the ATM cell train length which gets 
>>shifted by cell_log and looked up as before. The ATM length will be in 
>>steps of 53 so with cell_log 3 or 4 I think there will be no collisions 
>>- so special rate tables for ATM can still be made perfect.
> 
> 
> Patrick is proposing that the packet lengths be sent to 
> the kernel in a similar way to how transmission times (ie
> RTAB) is sent now.  I agree that is how things should be 
> done - but it doesn't have much to do with the ATM patch, 
> other than he has allowed for ATM in the way he does the 
> calculation in the kernel [1].
> 
> In particular:
> 
> - As it stands, it doesn't help the qdiscs that use 
>   RTAB.  So unless he proposes to remove RTAB entirely 
>   the ATM patch as it will still have to go in.

Hmm - I was just looking at the kernel changes to htb. The only 
difference is the len - I am blindly assuming that it does/will return 
the link lengths properly for atm.

So for atm, qdisc_tx_len(skb) will always return lengths that are 
multiples of 53.

If nothing else were done we would suffer innacuarcy from the cell_log 
just like eth.

But no other kernel hack would be needed to do it perfectly - rather 
like we (who patch for atm already) just fill the tc generated rate 
table with what we like, that would be an option.

> 
> - A bit of effort was put into making this current
>   ATM patch both backwards and forwards compatible.
>   Patricks patch would work with newer kernels,
>   obviously.  Older kernels, and in particular the
>   kernel that Debian is Etch is likely to distribute
>   would miss out.
> 
> If Patrick did intend remove RTAB entirely then he
> needs to add a fair bit more into his patch.  Since 
> RTAB is just STAB scaled, its certainly possible.
> The kernel will have to do a shift and a division
> for each packet, which I assume is permissible.

I guess that is for others to decide :-) I think Patrick has a point 
about sfq/htb drr, Like you I guess, I thought that alot of extra per 
packet calculations would have got an instant NO.

> 
>>As you say, I think mpu should be added aswell - so eth/other can benefit.
> 
> 
> Not really.  The MPU is reflected in the STAB table,
> just as it is for RTAB.

OK, I was thinking of what Jamal said about helping others, so 
everything TC should be capable of accepting mpu and overhead with these 
patches - or is more work needed?

It will be good to be able to say

tc ... police rate 500kbit mpu 60 overhead 24 ... for eth.
(Assuming eth mpu/overhead are really 46/38 - p in mpu is payload AIUI 
so 60 and 24 come from allowing for skb->len being IP+14)

or for ATM + pppoa something like

tc ... police rate 500kbit overhead 10 atm ...

In the case of eth someone already added mpu/overhead for HTB and it 
doesn't need any extra per packet calcs. I guess this way it would.

> 
> One other point - the optimisation Patrick proposes
> for STAB (over RTAB) was to make the number of entries
> variable.  This seems like a good idea.  However there 
> is no such thing as a free lunch, and if you did 
> indeed reduce the number of entries to 16 for Ethernet 
> (as I think Patrick suggested), then each entry would
> cover 1500/16 = 93 different packet lengths.  Ie,
> entry 0 would cover packet lengths 0..93, entry 1
> 94..186, and so on.  A single entry can't be right
> for all those packet lengths, so again we are back
> to a average 30% error for typical VOIP length
> packets.

I agree less accuracy will not be nice. But as an option it could be the 
only way you can do 1/10Gig + jumbo frames.

Andy.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-07-19 14:42                       ` Andy Furniss
@ 2006-07-19 14:54                         ` Patrick McHardy
  2006-07-19 20:26                         ` [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL (RTAB BUG) Jesper Dangaard Brouer
  1 sibling, 0 replies; 56+ messages in thread
From: Patrick McHardy @ 2006-07-19 14:54 UTC (permalink / raw)
  To: lists
  Cc: Russell Stuart, hadi, Jesper Dangaard Brouer, netdev,
	Stephen Hemminger

Andy Furniss wrote:
> Russell Stuart wrote:
> 
>> - As it stands, it doesn't help the qdiscs that use   RTAB.  So unless
>> he proposes to remove RTAB entirely   the ATM patch as it will still
>> have to go in.
> 
> 
> Hmm - I was just looking at the kernel changes to htb. The only
> difference is the len - I am blindly assuming that it does/will return
> the link lengths properly for atm.
> 
> So for atm, qdisc_tx_len(skb) will always return lengths that are
> multiples of 53.
> 
> If nothing else were done we would suffer innacuarcy from the cell_log
> just like eth.
> 
> But no other kernel hack would be needed to do it perfectly - rather
> like we (who patch for atm already) just fill the tc generated rate
> table with what we like, that would be an option.

That is how it should work. If the calculation doesn't fit, lets fix it.

>> The kernel will have to do a shift and a division
>> for each packet, which I assume is permissible.
> 
> 
> I guess that is for others to decide :-) I think Patrick has a point
> about sfq/htb drr, Like you I guess, I thought that alot of extra per
> packet calculations would have got an instant NO.

Its only done once per packet (currently, it might be interesting to
override the length for specific classes and their childs, for example
if you do queueing on eth0 and have an DSL router one hop apart).
The division is gone in my patch btw.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL (RTAB BUG)
  2006-07-19 14:42                       ` Andy Furniss
  2006-07-19 14:54                         ` Patrick McHardy
@ 2006-07-19 20:26                         ` Jesper Dangaard Brouer
  2006-07-19 21:00                           ` Alexey Kuznetsov
  1 sibling, 1 reply; 56+ messages in thread
From: Jesper Dangaard Brouer @ 2006-07-19 20:26 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Russell Stuart, hadi, netdev, Stephen Hemminger, Patrick McHardy,
	Alexey Kuznetsov

Russell Stuart wrote:
> 
> - As it stands, it doesn't help the qdiscs that use 
> RTAB.  So unless he proposes to remove RTAB entirely the ATM patch as 
> it will still have to go in.

Here is a very important point here:

  The RTAB (rate-table) in the kernel is NOT aligned, this is the ONLY
  reason why we need to patch the kernel.

This why the kernel RTAB patch is still needed even with Patrick's 
excelent STAB patch.  (Maybe the RTAB system should be removed entirely?)

I have discussed this RTAB issue with Patrick (off list) and he denoted 
this RTAB issue as a regular BUG.

This is the problem with the RTAB:
----------------------------------
The "hash" function for packet size lookups is a simple binary shift
operation.

  rtab[pkt_len>>cell_log] = pkt_xmit_time;

With a cell_log of 3 (the default), this implies that:
     pkt_len  0 to  7 goes into entry 0,
and pkt_len  8 to 15 goes into entry 1,
and pkt_len 16 to 23 goes into entry 2,
and pkt_len 24 to 31 goes into entry 3,
and so on...

Current mapping:
entry[0](maps: 0- 7)=xmit_size:0
entry[1](maps: 8-15)=xmit_size:8
entry[2](maps:16-23)=xmit_size:16
entry[3](maps:24-31)=xmit_size:24
entry[4](maps:32-39)=xmit_size:32
entry[5](maps:40-47)=xmit_size:40
entry[6](maps:48-55)=xmit_size:48

When the table is constructed (in tc_core.c) the pkt_xmit_time is 
calculated from the lower boundary.  Meaning that transmitting a 7 byte 
packet "costs" 0 bytes to transmit.  The zero transmit cost is properly 
not a real-world problem as the IP header bounds the minimum packet size.

for (i=0; i<256; i++) {
   unsigned sz = (i<<cell_log);
   rtab[i] = tc_core_usec2tick(1000000*((double)sz/bps));
}

Basically transmitting a packet size and accounting for a smaller
packet size in the QoS system pose a problem.  We can loose queue
control by transmitting faster than the QoS was configured for.
(maybe a (DoS/queue) attack to loose QoS control with small special
sized packets?)

One possible FIX to the RTAB "BUG":
-----------------------------------
One solution would be to change the table to:
  entry[0](maps:0-7)=xmit_size:7
  entry[1](maps:8-15)=xmit_size:15
  entry[2](maps:16-23)=xmit_size:23
  entry[3](maps:24-31)=xmit_size:31
  entry[4](maps:32-39)=xmit_size:39
  entry[5](maps:40-47)=xmit_size:47
  entry[6](maps:48-55)=xmit_size:55

But most packets align to a even boundary, thus in the general case we
would over estimate. Instead it would be preferable to change the
table (mapping) to:

  rtab[(pkt_len-1)>>cell_log]

and adjusting the xmit_size accordingly.
Giving the table:

  entry[0](maps:1-8)=xmit_size:8
  entry[1](maps:9-16)=xmit_size:16
  entry[2](maps:15-24)=xmit_size:24
  entry[3](maps:24-32)=xmit_size:32
  entry[4](maps:33-40)=xmit_size:40
  entry[5](maps:41-48)=xmit_size:48
  entry[6](maps:47-56)=xmit_size:56

The xmit_size is done like this:

for (i=0; i<256; i++) {
   unsigned sz = ((i+1)<<cell_log);
   rtab[i] = tc_core_usec2tick(1000000*((double)sz/bps));
}

All table lookups should be changed to:
    rtab[(pkt_len-1)>>cell_log]

Another fix for the RTAB:
-------------------------
Remove the RTAB array lookups and calculate the pkt_xmit_time instead.

Guess that Alexey wrote these RTAB lookup in a time where array lookups 
was faster... now we have that memory lookups are the bottleneck.

What about removing the RTAB system entirely?

Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL (RTAB BUG)
  2006-07-19 20:26                         ` [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL (RTAB BUG) Jesper Dangaard Brouer
@ 2006-07-19 21:00                           ` Alexey Kuznetsov
  2006-07-20  5:47                             ` Russell Stuart
  0 siblings, 1 reply; 56+ messages in thread
From: Alexey Kuznetsov @ 2006-07-19 21:00 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Andy Furniss, Russell Stuart, hadi, netdev, Stephen Hemminger,
	Patrick McHardy

Hello!

> Guess that Alexey wrote these RTAB lookup in a time where array lookups 
> was faster... now we have that memory lookups are the bottleneck.

No, they were slower from the very beginning. If I remember correctly,
there is comment about this somewhere.

I just did not find any simple way to do 32 bit fixed point arithmetics
scaling from bps to Gbps and was lazy to investigate this further,
tables are much simpler and more flexible.

> What about removing the RTAB system entirely?

Well, if fixed point arithmetics is not a problem.

Plus, remember, the function is not R*size, it is at least
R*size+addend, to account for link overhead. Plus account for padding
of small packets. Plus, when policing it should deaccount already added
link headers, QoS counts only network payload.

Alexey

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL (RTAB BUG)
  2006-07-19 21:00                           ` Alexey Kuznetsov
@ 2006-07-20  5:47                             ` Russell Stuart
  2006-07-20 23:49                               ` Alexey Kuznetsov
  0 siblings, 1 reply; 56+ messages in thread
From: Russell Stuart @ 2006-07-20  5:47 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Jesper Dangaard Brouer, Andy Furniss, hadi, netdev,
	Stephen Hemminger, Patrick McHardy

On Thu, 2006-07-20 at 01:00 +0400, Alexey Kuznetsov wrote:
> Hello!

So you really do exist?  I thought it was just
rumour.

> Well, if fixed point arithmetics is not a problem.

It shouldn't be.  Any decimal number can be expressed
as a fraction, eg:

  0.00123 = 123/100000

Which can be calculated as a multiply and a divide. With
MTU's up to 2048, it should be possible to do this with
99.9999% accuracy (ie 2048/2^23).

With a bit more work in userspace (ie in tc), it can be
be reduced to a multiply and a shift.

> Plus, remember, the function is not R*size, it is at least
> R*size+addend, to account for link overhead. Plus account for padding
> of small packets. Plus, when policing it should deaccount already added
> link headers, QoS counts only network payload.

Yes, it is flexible - and has served us well up until
now.  It doesn't work well for ATM, but with a small
bit of extra calculation in the kernel it could.
However, it turns out that ATM is a special case.  If 
ATM's cell payload was 58 bytes instead of 48 bytes 
(say), then it would not be possible to produce a RTAB 
that had small errors (eg < 10%) for smallish packet 
sizes (< 290 bytes).  I seem to have trouble 
explaining why in a concise way that people understand, 
so I won't try here.

So when Alan Cox said our ATM patch didn't solve the 
packetisation problem in general, he was right as our
patch just built upon RTAB.  Patrick's STAB proposal 
in general either for that matter, as it is just another 
implementation of RTAB with the same limitations.  The 
only way I can think of to solve it in general is to 
move many more calculations into the kernel - as I 
proposed in a long winded answer to Patrick earlier 
in this thread.

But doing so would get rid of the table implementation 
and the flexibility it has given us to date.  For that 
reason I feel uncomfortable with it.

The engineering decision becomes this - are there any
other protocols like ATM out there that could justify 
such a change?  (In my more cynical moments I think of 
it differently - has/is the world going to make a 
second engineering fuck up on the scale of ATM again?  
How on earth did anyone decide that pushing data 
packets over ATM, as happens in ADSL, was a good 
idea?)  I know of no other such protocols.  But then
I don't have an encyclopedic knowledge of comms
protocols, so that doesn't mean much.  I suspect you
know a good deal more about them than I do.  What say
you?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL (RTAB BUG)
  2006-07-20  5:47                             ` Russell Stuart
@ 2006-07-20 23:49                               ` Alexey Kuznetsov
  0 siblings, 0 replies; 56+ messages in thread
From: Alexey Kuznetsov @ 2006-07-20 23:49 UTC (permalink / raw)
  To: Russell Stuart
  Cc: Jesper Dangaard Brouer, Andy Furniss, hadi, netdev,
	Stephen Hemminger, Patrick McHardy

Hello!

> It shouldn't be.  Any decimal number can be expressed
> as a fraction, eg:

I remember this. :-) I stalled selecting corrects divisors
to fight over/underflows. Not becuase it was difficult,
just because did not see a reason to do this.

> But doing so would get rid of the table implementation 
> and the flexibility it has given us to date.  For that 
> reason I feel uncomfortable with it.
> 
> The engineering decision becomes this - are there any
> other protocols like ATM out there that could justify 
> such a change? 

Is it faster? You say, yes. Is it required? You say, yes.
Is there some protocols, which needs more flexibility? No.

> know a good deal more about them than I do.  What say
> you?

Frankly, I seriously believed that rtabs is a good way to handle ATM. :-)
I seriously believed that you have to do something like:
((packet_size+cell_payload-1)/cell_payload)*cell_size

So, if in reality even this protocol does not justify keeping ratbs, kill
them.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-07-19  1:02                     ` Russell Stuart
  2006-07-19 14:42                       ` Andy Furniss
@ 2006-07-19 14:50                       ` Patrick McHardy
  2006-07-20  4:56                         ` Russell Stuart
  1 sibling, 1 reply; 56+ messages in thread
From: Patrick McHardy @ 2006-07-19 14:50 UTC (permalink / raw)
  To: Russell Stuart
  Cc: lists, hadi, Jesper Dangaard Brouer, netdev, Stephen Hemminger

Russell Stuart wrote:
> On Tue, 2006-07-18 at 22:46 +0100, Andy Furniss wrote: 
> 
>>FWIW I think it may be possible to do it Patricks' way, as if I read it 
>>properly he will end up with the ATM cell train length which gets 
>>shifted by cell_log and looked up as before. The ATM length will be in 
>>steps of 53 so with cell_log 3 or 4 I think there will be no collisions 
>>- so special rate tables for ATM can still be made perfect.

Please excuse my silence, I was travelling and am still catching up
with my mails.

> Patrick is proposing that the packet lengths be sent to 
> the kernel in a similar way to how transmission times (ie
> RTAB) is sent now.  I agree that is how things should be 
> done - but it doesn't have much to do with the ATM patch, 
> other than he has allowed for ATM in the way he does the 
> calculation in the kernel [1].
> 
> In particular:
> 
> - As it stands, it doesn't help the qdiscs that use 
>   RTAB.  So unless he proposes to remove RTAB entirely 
>   the ATM patch as it will still have to go in.

Why? The length calculated by my STABs (or something similar)
is used by _all_ qdiscs. Not only for transmission time calculation,
but also for statistics and estimators. If the length calculation
doesn't fit for ATM, that can be fixed.

> - A bit of effort was put into making this current
>   ATM patch both backwards and forwards compatible.
>   Patricks patch would work with newer kernels,
>   obviously.  Older kernels, and in particular the
>   kernel that Debian is Etch is likely to distribute
>   would miss out.

True, but it provides more consistency, and making current
kernels behave better is more important than old kernels.

> If Patrick did intend remove RTAB entirely then he
> needs to add a fair bit more into his patch.  Since 
> RTAB is just STAB scaled, its certainly possible.
> The kernel will have to do a shift and a division
> for each packet, which I assume is permissible.

You seem to have misunderstood my patch. It doesn't need to
touch RTABs, it just calculates the packet length as seen
on the wire (whereever it is) and uses that thoughout the
entire qdisc layer.

>>As you say, I think mpu should be added aswell - so eth/other can benefit.
> 
> 
> Not really.  The MPU is reflected in the STAB table,
> just as it is for RTAB.
> 
> One other point - the optimisation Patrick proposes
> for STAB (over RTAB) was to make the number of entries
> variable.  This seems like a good idea.  However there 
> is no such thing as a free lunch, and if you did 
> indeed reduce the number of entries to 16 for Ethernet 
> (as I think Patrick suggested), then each entry would
> cover 1500/16 = 93 different packet lengths.  Ie,
> entry 0 would cover packet lengths 0..93, entry 1
> 94..186, and so on.  A single entry can't be right
> for all those packet lengths, so again we are back
> to a average 30% error for typical VOIP length
> packets.

My patch doesn't uses fixed sized cells, so it can deal
with anything, worst case is you use one cell per packet
size. Optimizing size and lookup speed for ethernet makes
a lot more sense than optimizing for ADSL.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-07-19 14:50                       ` [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL Patrick McHardy
@ 2006-07-20  4:56                         ` Russell Stuart
  2006-07-30 23:06                           ` Russell Stuart
  0 siblings, 1 reply; 56+ messages in thread
From: Russell Stuart @ 2006-07-20  4:56 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: lists, hadi, Jesper Dangaard Brouer, netdev, Stephen Hemminger

On Wed, 2006-07-19 at 16:50 +0200, Patrick McHardy wrote:
> Please excuse my silence, I was travelling and am still catching up
> with my mails.

Sorry.  Had I realised you were busy I would of
waited.

> > - As it stands, it doesn't help the qdiscs that use 
> >   RTAB.  So unless he proposes to remove RTAB entirely 
> >   the ATM patch as it will still have to go in.
> 
> Why? The length calculated by my STABs (or something similar)
> is used by _all_ qdiscs. Not only for transmission time calculation,
> but also for statistics and estimators.

Oh.  I didn't see where it is used for the time 
calculation in your patch.  Did I miss something,
or is that the unfinished bit?

This is possibly my stumbling block.  If you don't remove
RTAB the ATM patch as stands will be needed.  Your patch
didn't remove RTAB, and you didn't say it was intended to,
so I presume it wasn't going to.

>  If the length calculation
> doesn't fit for ATM, that can be fixed.

Yes of course.  Just to be clear: as far as I am concerned
this never was an issue.

> > - A bit of effort was put into making this current
> >   ATM patch both backwards and forwards compatible.
> >   Patricks patch would work with newer kernels,
> >   obviously.  Older kernels, and in particular the
> >   kernel that Debian is Etch is likely to distribute
> >   would miss out.
> 
> True, but it provides more consistency, and making current
> kernels behave better is more important than old kernels.

I guess provided the new "tc" works with older kernels this
is OK - although a disappoint to me.  Works here being defined
as "works as well as a previous the version of tc does".  For 
me not working would be OK as well provided "tc" issued a 
warning message to the effect that it "needs kernel version 
XXX or above"", but doing that would probably require it to 
look at the kernel version.  Looking at the kernel version 
in tc seems to be frowned upon.

> You seem to have misunderstood my patch. It doesn't need to
> touch RTABs, it just calculates the packet length as seen
> on the wire (whereever it is) and uses that thoughout the
> entire qdisc layer.

No, you have it in reverse - as I said above.  My problem is 
that your patch does not touch RTAB.  Several qdiscs really 
don't care about the length of a packet (other than for 
keeping track of stats) - they just care about how long 
it takes to send.  Off the top of my these are HTB, CBQ 
and TBF.  They use RTAB to make this calculation.  So unless
you replace RTAB with STAB the current ATM patch will still 
be needed.

> > One other point - the optimisation Patrick proposes
> > for STAB (over RTAB) was to make the number of entries
> > variable.  This seems like a good idea.  However there 
> > is no such thing as a free lunch, and if you did 
> > indeed reduce the number of entries to 16 for Ethernet 
> > (as I think Patrick suggested), then each entry would
> > cover 1500/16 = 93 different packet lengths.  Ie,
> > entry 0 would cover packet lengths 0..93, entry 1
> > 94..186, and so on.  A single entry can't be right
> > for all those packet lengths, so again we are back
> > to a average 30% error for typical VOIP length
> > packets.
> 
> My patch doesn't uses fixed sized cells, so it can deal
> with anything, worst case is you use one cell per packet
> size. Optimizing size and lookup speed for ethernet makes
> a lot more sense than optimizing for ADSL.

I was just responding to a point you made earlier, when
you said STAB could only use 16 entries as opposed to the
256 used by RTAB.  I suspect nobody would actually do that 
because of the inaccuracy it creates, so the comparison is
perhaps unfair.  I agree the flexibility of making STAB 
variable length is a good idea, and comes at 0 cost in 
the kernel.

Andy Furniss wrote:
> > Russell Stuart wrote:
> >> The kernel will have to do a shift and a division
> >> for each packet, which I assume is permissible.
> > 
> > 
> > I guess that is for others to decide :-) I think Patrick has a point
> > about sfq/htb drr, Like you I guess, I thought that alot of extra per
> > packet calculations would have got an instant NO.
> 
> Its only done once per packet (currently, it might be interesting to
> override the length for specific classes and their childs, for example
> if you do queueing on eth0 and have an DSL router one hop apart).
> The division is gone in my patch btw.

Unlike the packet length the time calculation can't be
cached in the skb.  Most classes in HTB/CBQ use different
packet transmission rates.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-07-20  4:56                         ` Russell Stuart
@ 2006-07-30 23:06                           ` Russell Stuart
  2006-08-08 22:01                             ` Russell Stuart
  0 siblings, 1 reply; 56+ messages in thread
From: Russell Stuart @ 2006-07-30 23:06 UTC (permalink / raw)
  To: hadi
  Cc: lists, hadi, Jesper Dangaard Brouer, netdev, Stephen Hemminger,
	Patrick McHardy

On Thu, 2006-07-20 at 14:56 +1000, Russell Stuart wrote:
> On Wed, 2006-07-19 at 16:50 +0200, Patrick McHardy wrote:
> > Please excuse my silence, I was travelling and am still catching up
> > with my mails.
> 
> Sorry.  Had I realised you were busy I would of
> waited.
> 
> > > - As it stands, it doesn't help the qdiscs that use 
> > >   RTAB.  So unless he proposes to remove RTAB entirely 
> > >   the ATM patch as it will still have to go in.
> > 
> > Why? The length calculated by my STABs (or something similar)
> > is used by _all_ qdiscs. Not only for transmission time calculation,
> > but also for statistics and estimators.
> 
> Oh.  I didn't see where it is used for the time 
> calculation in your patch.  Did I miss something,
> or is that the unfinished bit?
> 
> This is possibly my stumbling block.  If you don't remove
> RTAB the ATM patch as stands will be needed.  Your patch
> didn't remove RTAB, and you didn't say it was intended to,
> so I presume it wasn't going to.

It has gone quiet again.  In my mind the one unresolved issue
is whether Patrick intended to remove RTAB with his patch.
If not, the ATM patch as it stands will have to go in.

Patrick - it would be nice to hear from you.




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-07-30 23:06                           ` Russell Stuart
@ 2006-08-08 22:01                             ` Russell Stuart
  2006-08-09 11:33                               ` jamal
  0 siblings, 1 reply; 56+ messages in thread
From: Russell Stuart @ 2006-08-08 22:01 UTC (permalink / raw)
  To: hadi
  Cc: lists, Jesper Dangaard Brouer, netdev, Stephen Hemminger,
	Patrick McHardy

On Mon, 2006-07-31 at 09:06 +1000, Russell Stuart wrote:
> It has gone quiet again.  In my mind the one unresolved issue
> is whether Patrick intended to remove RTAB with his patch.
> If not, the ATM patch as it stands will have to go in.
> 
> Patrick - it would be nice to hear from you.

It appears Patrick isn't going to reply.  The situation as
I see it is:

- Patrick's last post said his STAB patch wasn't going to
  touch RTAB, so regardless of whether Patrick's STAB
  patch proceeds or not the ATM patch will be needed to
  make the current kernel work with ATM.

- If I read Jesper's last post correctly, he said Patrick
  acknowledged to him in a private email that RTAB had to 
  change to accommodate ATM, and in fact called it a bug
  in RTAB.  Fixing this "bug" is what the kernel part of
  the current ATM patch does - and it does it in a 
  backward compatible way.

- Jamal, you said that this ATM patch should go in if 
  Patrick didn't come up with a better solution.

Jamal - unless there are other outstanding issues I have 
missed or someone has had second thoughts this means you
should be OK with the patch going in.  Can we get it into
Dave M's tree now?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-08-08 22:01                             ` Russell Stuart
@ 2006-08-09 11:33                               ` jamal
  2006-09-04 10:37                                 ` Russell Stuart
  0 siblings, 1 reply; 56+ messages in thread
From: jamal @ 2006-08-09 11:33 UTC (permalink / raw)
  To: Russell Stuart
  Cc: lists, Jesper Dangaard Brouer, netdev, Stephen Hemminger,
	Patrick McHardy

On Wed, 2006-09-08 at 08:01 +1000, Russell Stuart wrote:
> On Mon, 2006-07-31 at 09:06 +1000, Russell Stuart wrote:
> > It has gone quiet again.  In my mind the one unresolved issue
> > is whether Patrick intended to remove RTAB with his patch.
> > If not, the ATM patch as it stands will have to go in.
> > 
> > Patrick - it would be nice to hear from you.
> 
> It appears Patrick isn't going to reply.  The situation as
> I see it is:
> 
> - Patrick's last post said his STAB patch wasn't going to
>   touch RTAB, so regardless of whether Patrick's STAB
>   patch proceeds or not the ATM patch will be needed to
>   make the current kernel work with ATM.
> 
> - If I read Jesper's last post correctly, he said Patrick
>   acknowledged to him in a private email that RTAB had to 
>   change to accommodate ATM, and in fact called it a bug
>   in RTAB.  Fixing this "bug" is what the kernel part of
>   the current ATM patch does - and it does it in a 
>   backward compatible way.
> 
> - Jamal, you said that this ATM patch should go in if 
>   Patrick didn't come up with a better solution.
> 
> Jamal - unless there are other outstanding issues I have 
> missed or someone has had second thoughts this means you
> should be OK with the patch going in.  Can we get it into
> Dave M's tree now?

Hi Russell,

My apologies; at some point i lost track of the discussion because
things seemed under control. 
Lets hear from Patrick - I have a lot of faith in him; i just happen to
know he is occupied elsewhere as we speak (which may explain his
silence). So please be a little patient with him. In the meantime please
dont be discouraged in continuing to contribute - sometimes this is how
things go.

You are right that if there is no generic solution from Patrick we
should not hold your patch. My understanding of Patrick's patch is it
covers for your case; i may have misread that given you and him are
still discussing the differences. 

Perhaps one approach is to push your patch in first and then when
Patrick is ready he can operate on that. Patrick, thoughts?

cheers,
jamal


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-08-09 11:33                               ` jamal
@ 2006-09-04 10:37                                 ` Russell Stuart
  0 siblings, 0 replies; 56+ messages in thread
From: Russell Stuart @ 2006-09-04 10:37 UTC (permalink / raw)
  To: hadi
  Cc: lists, Jesper Dangaard Brouer, netdev, Stephen Hemminger,
	Patrick McHardy

On Wed, 2006-08-09 at 07:33 -0400, jamal wrote:
> > Jamal - unless there are other outstanding issues I have 
> > missed or someone has had second thoughts this means you
> > should be OK with the patch going in.  Can we get it into
> > Dave M's tree now?
> 
> Hi Russell,
> 
> My apologies; at some point i lost track of the discussion because
> things seemed under control. 
> Lets hear from Patrick - I have a lot of faith in him; i just happen to
> know he is occupied elsewhere as we speak (which may explain his
> silence). So please be a little patient with him. In the meantime please
> dont be discouraged in continuing to contribute - sometimes this is how
> things go.
> 
> You are right that if there is no generic solution from Patrick we
> should not hold your patch. My understanding of Patrick's patch is it
> covers for your case; i may have misread that given you and him are
> still discussing the differences. 
> 
> Perhaps one approach is to push your patch in first and then when
> Patrick is ready he can operate on that. Patrick, thoughts?

Jamal, I think its time to assume Patrick has lost interest.
Is it possible to get this in now?


-- 
VGER BF report: U 0.542723

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-14  9:40 [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL Jesper Dangaard Brouer
  2006-06-14 12:06 ` jamal
@ 2006-06-14 14:27 ` Phillip Susi
  2006-06-14 15:08   ` Jesper Dangaard Brouer
  2006-06-20  5:35 ` Chris Wedgwood
  2 siblings, 1 reply; 56+ messages in thread
From: Phillip Susi @ 2006-06-14 14:27 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Stephen Hemminger, Jamal Hadi Salim, netdev, lartc, russell-tcatm,
	hawk, linux-kernel

Jesper Dangaard Brouer wrote:
> The Linux traffic's control engine inaccurately calculates
> transmission times for packets sent over ADSL links.  For
> some packet sizes the error rises to over 50%.  This occurs
> because ADSL uses ATM as its link layer transport, and ATM
> transmits packets in fixed sized 53 byte cells.
> 

I could have sworn that DSL uses its own framing protocol that is 
similar to the frame/superframe structure of HDSL ( T1 ) lines and over 
that you can run ATM or ethernet.  Or is it typically ethernet -> ATM -> 
HDSL?

In any case, why does the kernel care about the exact time that the IP 
packet has been received and reassembled on the headend?



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-14 14:27 ` Phillip Susi
@ 2006-06-14 15:08   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 56+ messages in thread
From: Jesper Dangaard Brouer @ 2006-06-14 15:08 UTC (permalink / raw)
  To: Phillip Susi; +Cc: netdev, russell-tcatm, hawk

[-- Attachment #1: Type: text/plain, Size: 1184 bytes --]

On Wed, 2006-06-14 at 10:27 -0400, Phillip Susi wrote:
> Jesper Dangaard Brouer wrote:
> > The Linux traffic's control engine inaccurately calculates
> > transmission times for packets sent over ADSL links.  For
> > some packet sizes the error rises to over 50%.  This occurs
> > because ADSL uses ATM as its link layer transport, and ATM
> > transmits packets in fixed sized 53 byte cells.
> > 
> 
> I could have sworn that DSL uses its own framing protocol that is 
> similar to the frame/superframe structure of HDSL ( T1 ) lines and over 
> that you can run ATM or ethernet.  Or is it typically ethernet -> ATM -> 
> HDSL?

Nope, not according to the ADSL standards G.992.1 and G.992.2.

> In any case, why does the kernel care about the exact time that the IP 
> packet has been received and reassembled on the headend?

I think you have misunderstood what the rate table does...
(There is an explaination in the thesis page 57 section 6.1.2)
http://www.adsl-optimizer.dk/thesis/

-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network developer
  Cand. Scient Datalog / MSc.
  Author of http://adsl-optimizer.dk


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-14  9:40 [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL Jesper Dangaard Brouer
  2006-06-14 12:06 ` jamal
  2006-06-14 14:27 ` Phillip Susi
@ 2006-06-20  5:35 ` Chris Wedgwood
  2006-06-20  7:33   ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 56+ messages in thread
From: Chris Wedgwood @ 2006-06-20  5:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev

On Wed, Jun 14, 2006 at 11:40:04AM +0200, Jesper Dangaard Brouer wrote:

> The Linux traffic's control engine inaccurately calculates
> transmission times for packets sent over ADSL links.  For some
> packet sizes the error rises to over 50%.  This occurs because ADSL
> uses ATM as its link layer transport, and ATM transmits packets in
> fixed sized 53 byte cells.

What if AAL5 is used?  The cell-alignment math is going to be wrong
there surely?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
  2006-06-20  5:35 ` Chris Wedgwood
@ 2006-06-20  7:33   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 56+ messages in thread
From: Jesper Dangaard Brouer @ 2006-06-20  7:33 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: netdev, russell-tcatm

[-- Attachment #1: Type: text/plain, Size: 882 bytes --]


On Mon, 2006-06-19 at 22:35 -0700, Chris Wedgwood wrote:
> On Wed, Jun 14, 2006 at 11:40:04AM +0200, Jesper Dangaard Brouer wrote:
> 
> > The Linux traffic's control engine inaccurately calculates
> > transmission times for packets sent over ADSL links.  For some
> > packet sizes the error rises to over 50%.  This occurs because ADSL
> > uses ATM as its link layer transport, and ATM transmits packets in
> > fixed sized 53 byte cells.
> 
> What if AAL5 is used?  The cell-alignment math is going to be wrong
> there surely?

Actually it _is_ AAL5 which is accounted for.

See Chapter 5 "ADSL link layer overhead" (page 48-54).
http://www.adsl-optimizer.dk/thesis/main_final_hyper.pdf

-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network developer
  Cand. Scient Datalog / MSc.
  Author of http://adsl-optimizer.dk


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2006-09-04 10:39 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-14  9:40 [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL Jesper Dangaard Brouer
2006-06-14 12:06 ` jamal
2006-06-14 12:55   ` Jesper Dangaard Brouer
2006-06-15 12:57     ` jamal
2006-06-15 13:16     ` jamal
2006-06-20  1:04       ` Patrick McHardy
2006-06-20 14:59         ` jamal
2006-06-20 15:16           ` Patrick McHardy
2006-06-21 12:21             ` Krzysztof Matusik
2006-06-21 12:54               ` Patrick McHardy
2006-06-21 14:33                 ` Krzysztof Matusik
2006-06-14 15:32   ` Andy Furniss
2006-06-20  0:54   ` Patrick McHardy
2006-06-20 14:56     ` jamal
2006-06-20 15:09       ` Patrick McHardy
2006-06-22 18:41         ` jamal
2006-06-23 14:32           ` Patrick McHardy
2006-06-24 14:39             ` jamal
2006-06-26 11:21               ` Patrick McHardy
2006-06-27 13:01                 ` jamal
2006-07-02  4:23                   ` Patrick McHardy
2006-07-02 13:59                     ` jamal
     [not found]   ` <1150287983.3246.27.camel@ras.pc.brisbane.lube>
     [not found]     ` <1150292693.5197.1.camel@jzny2>
     [not found]       ` <1150843471.17455.2.camel@ras.pc.brisbane.lube>
     [not found]         ` <15653CE98281AD4FBD7F70BCEE3666E53CD54A@comxexch01.comx.local>
     [not found]           ` <1151000966.5392.34.camel@jzny2>
2006-06-23 12:37             ` Russell Stuart
2006-06-23 15:21               ` Patrick McHardy
2006-06-26  0:45                 ` Russell Stuart
2006-06-26 11:10                   ` Patrick McHardy
2006-06-27  6:19                     ` Russell Stuart
2006-06-27 17:18                       ` Patrick McHardy
2006-07-04 13:29                       ` Patrick McHardy
2006-07-04 19:29                         ` jamal
2006-07-04 23:53                           ` Patrick McHardy
2006-07-06  0:39                         ` Russell Stuart
2006-07-07  8:00                           ` Patrick McHardy
2006-07-10  8:44                             ` Russell Stuart
2006-06-24 14:13               ` jamal
2006-06-26  4:23                 ` Russell Stuart
2006-07-18  2:06                 ` Russell Stuart
2006-07-18 13:35                   ` jamal
2006-07-18 21:46                   ` Andy Furniss
2006-07-19  1:02                     ` Russell Stuart
2006-07-19 14:42                       ` Andy Furniss
2006-07-19 14:54                         ` Patrick McHardy
2006-07-19 20:26                         ` [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL (RTAB BUG) Jesper Dangaard Brouer
2006-07-19 21:00                           ` Alexey Kuznetsov
2006-07-20  5:47                             ` Russell Stuart
2006-07-20 23:49                               ` Alexey Kuznetsov
2006-07-19 14:50                       ` [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL Patrick McHardy
2006-07-20  4:56                         ` Russell Stuart
2006-07-30 23:06                           ` Russell Stuart
2006-08-08 22:01                             ` Russell Stuart
2006-08-09 11:33                               ` jamal
2006-09-04 10:37                                 ` Russell Stuart
2006-06-14 14:27 ` Phillip Susi
2006-06-14 15:08   ` Jesper Dangaard Brouer
2006-06-20  5:35 ` Chris Wedgwood
2006-06-20  7:33   ` Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).