* [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
@ 2006-06-14 9:40 Jesper Dangaard Brouer
2006-06-14 12:06 ` jamal
` (2 more replies)
0 siblings, 3 replies; 56+ messages in thread
From: Jesper Dangaard Brouer @ 2006-06-14 9:40 UTC (permalink / raw)
To: Stephen Hemminger, Jamal Hadi Salim, netdev, lartc
Cc: russell-tcatm, hawk, hawk, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1311 bytes --]
The Linux traffic's control engine inaccurately calculates
transmission times for packets sent over ADSL links. For
some packet sizes the error rises to over 50%. This occurs
because ADSL uses ATM as its link layer transport, and ATM
transmits packets in fixed sized 53 byte cells.
The following patches to iproute2 and the kernel add an
option to calculate traffic transmission times over all
ATM links, including ADSL, with perfect accuracy.
A longer presentation of the patch, its rational, what it
does and how to use it can be found here:
http://www.stuart.id.au/russell/files/tc/tc-atm/
A earlier version of the patch, and a _detailed_ empirical
investigation of its effects can be found here:
http://www.adsl-optimizer.dk/
The patches are both backwards and forwards compatible.
This means unpatched kernels will work with a patched
version of iproute2, and an unpatched iproute2 will work
on patches kernels.
This is a combined effort of Jesper Brouer and Russell Stuart,
to get these patches into the upstream kernel.
Let the discussion start about what we need to change to get this
upstream?
We see this as a feature enhancement, as thus hope that it can be
queued in davem's net-2.6.18.git tree.
---
Regards,
Jesper Brouer & Russell Stuart.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-14 9:40 [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL Jesper Dangaard Brouer @ 2006-06-14 12:06 ` jamal 2006-06-14 12:55 ` Jesper Dangaard Brouer ` (3 more replies) 2006-06-14 14:27 ` Phillip Susi 2006-06-20 5:35 ` Chris Wedgwood 2 siblings, 4 replies; 56+ messages in thread From: jamal @ 2006-06-14 12:06 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: hawk, russell-tcatm, lartc, netdev, Stephen Hemminger I have taken linux-kernel off the list. Russell's site is inaccessible to me (I actually think this is related to some DNS issues i may be having) and your masters is too long to spend 2 minutes and glean it; so heres a question or two for you: - Have you tried to do a long-lived session such as a large FTP and seen how far off the deviation was? That would provide some interesting data point. - To be a devil's advocate (and not claim there is no issue), where do you draw the line with "overhead"? Example the smallest ethernet packet is 64 bytes of which 14 bytes are ethernet headers ("overhead" for IP) - and this is not counting CRC etc. If you were to set an MTU of say 64 bytes and tried to do a http or ftp, how accurate do you think the calculation would be? I would think not very different. Does it matter if it is accurate on the majority of the cases? - For further reflection: Have you considered the case where the rate table has already been considered on some link speed in user space and then somewhere post-config the physical link speed changes? This would happen in the case where ethernet AN is involved and the partner makes some changes (use ethtool). I would say the last bullet is a more interesting problem than a corner case of some link layer technology that has high overhead. Your work would be more interesting if it was generic for many link layers instead of just ATM. cheers, jamal On Wed, 2006-14-06 at 11:40 +0200, Jesper Dangaard Brouer wrote: > The Linux traffic's control engine inaccurately calculates > transmission times for packets sent over ADSL links. For > some packet sizes the error rises to over 50%. This occurs > because ADSL uses ATM as its link layer transport, and ATM > transmits packets in fixed sized 53 byte cells. > > The following patches to iproute2 and the kernel add an > option to calculate traffic transmission times over all > ATM links, including ADSL, with perfect accuracy. > > A longer presentation of the patch, its rational, what it > does and how to use it can be found here: > http://www.stuart.id.au/russell/files/tc/tc-atm/ > > A earlier version of the patch, and a _detailed_ empirical > investigation of its effects can be found here: > http://www.adsl-optimizer.dk/ > > The patches are both backwards and forwards compatible. > This means unpatched kernels will work with a patched > version of iproute2, and an unpatched iproute2 will work > on patches kernels. > > > This is a combined effort of Jesper Brouer and Russell Stuart, > to get these patches into the upstream kernel. > > Let the discussion start about what we need to change to get this > upstream? > > We see this as a feature enhancement, as thus hope that it can be > queued in davem's net-2.6.18.git tree. > > --- > Regards, > Jesper Brouer & Russell Stuart. > ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-14 12:06 ` jamal @ 2006-06-14 12:55 ` Jesper Dangaard Brouer 2006-06-15 12:57 ` jamal 2006-06-15 13:16 ` jamal 2006-06-14 15:32 ` Andy Furniss ` (2 subsequent siblings) 3 siblings, 2 replies; 56+ messages in thread From: Jesper Dangaard Brouer @ 2006-06-14 12:55 UTC (permalink / raw) To: hadi; +Cc: hawk, russell-tcatm, lartc, netdev, Stephen Hemminger [-- Attachment #1: Type: text/plain, Size: 4519 bytes --] On Wed, 2006-06-14 at 08:06 -0400, jamal wrote: > Russell's site is inaccessible to me (I actually think this is related > to some DNS issues i may be having) Strange, I have access to Russell's site. Maybe its his redirect feature that confuses your browser, try: http://ace-host.stuart.id.au/russell/files/tc/tc-atm/ > and your masters is too long to > spend 2 minutes and glean it; so heres a question or two for you: Yes, I is quite long and very detailed. But it worth reading (... says the author him self ;-)) > - Have you tried to do a long-lived session such as a large FTP and > seen how far off the deviation was? That would provide some interesting > data point. The deviation can be calculated. The impact is of cause small for large packets. But the argument that bulk TCP transfers is not as badly affected, is wrong because all the TCP ACK packets gets maximum penalty. On an ADSL link with more than 8 bytes overhead, a 40 bytes TCP ACK will use more that one ATM frame, causing 2 ATM frames to be send that consumes 106 bytes, eg. 62% overhead. On a small upstream ADSL line that hurts! (See thesis page 53, table 5.3 "Overhead summary"). > - To be a devil's advocate (and not claim there is no issue), where do > you draw the line with "overhead"? > Example the smallest ethernet packet is 64 bytes of which 14 bytes are > ethernet headers ("overhead" for IP) - and this is not counting CRC etc. > If you were to set an MTU of say 64 bytes and tried to do a http or ftp, > how accurate do you think the calculation would be? I would think not > very different. I do think we handle this situation, but I'm not quite sure that I fully understand the question (sorry). > Does it matter if it is accurate on the majority of the cases? > - For further reflection: Have you considered the case where the rate > table has already been considered on some link speed in user space and > then somewhere post-config the physical link speed changes? This would > happen in the case where ethernet AN is involved and the partner makes > some changes (use ethtool). > > I would say the last bullet is a more interesting problem than a corner > case of some link layer technology that has high overhead. We only claim to do magic on ATM/ADSL links... nothing else ;-) > Your work would be more interesting if it was generic for many link > layers instead of just ATM. Well, we did consider to do so, but we though that it would be harder to get it into the kernel. Actually thats the reason for the defines: #define ATM_CELL_SIZE 53 #define ATM_CELL_PAYLOAD 48 Changing these should should make it possible to adapt to any other SAR (Segment And Reasembly) link layer. > On Wed, 2006-14-06 at 11:40 +0200, Jesper Dangaard Brouer wrote: > > The Linux traffic's control engine inaccurately calculates > > transmission times for packets sent over ADSL links. For > > some packet sizes the error rises to over 50%. This occurs > > because ADSL uses ATM as its link layer transport, and ATM > > transmits packets in fixed sized 53 byte cells. > > > > The following patches to iproute2 and the kernel add an > > option to calculate traffic transmission times over all > > ATM links, including ADSL, with perfect accuracy. > > > > A longer presentation of the patch, its rational, what it > > does and how to use it can be found here: > > http://www.stuart.id.au/russell/files/tc/tc-atm/ > > > > A earlier version of the patch, and a _detailed_ empirical > > investigation of its effects can be found here: > > http://www.adsl-optimizer.dk/ > > > > The patches are both backwards and forwards compatible. > > This means unpatched kernels will work with a patched > > version of iproute2, and an unpatched iproute2 will work > > on patches kernels. > > > > > > This is a combined effort of Jesper Brouer and Russell Stuart, > > to get these patches into the upstream kernel. > > > > Let the discussion start about what we need to change to get this > > upstream? > > > > We see this as a feature enhancement, as thus hope that it can be > > queued in davem's net-2.6.18.git tree. > > > > --- > > Regards, > > Jesper Brouer & Russell Stuart. > > > Thanks for your comments :-) -- Med venlig hilsen / Best regards Jesper Brouer ComX Networks A/S Linux Network developer Cand. Scient Datalog / MSc. Author of http://adsl-optimizer.dk [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-14 12:55 ` Jesper Dangaard Brouer @ 2006-06-15 12:57 ` jamal 2006-06-15 13:16 ` jamal 1 sibling, 0 replies; 56+ messages in thread From: jamal @ 2006-06-15 12:57 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Stephen Hemminger, netdev, lartc, russell-tcatm, hawk On Wed, 2006-14-06 at 14:55 +0200, Jesper Dangaard Brouer wrote: > On Wed, 2006-06-14 at 08:06 -0400, jamal wrote: > > - Have you tried to do a long-lived session such as a large FTP and > > seen how far off the deviation was? That would provide some interesting > > data point. > > The deviation can be calculated. The impact is of cause small for large > packets. > > But the argument that bulk TCP transfers is not as badly > affected, is wrong because all the TCP ACK packets gets maximum penalty. > ACKs have always played a prominent role. The last numbers i have seen for North America (but i think probably valid globaly) show in the range of 40% ACKs in internet traffic http://netweb.usc.edu/~rsinha/pkt-sizes/ I suspect a lot of these stats are on their way to change with voip, p2p etc. But i dont think it is ACKs perse that you or Russell are contending cause these issues. It's the presence of ATM . And all evidence seems to point to the fact that ISPs bill you for something other than your point of view, no? > On an ADSL link with more than 8 bytes overhead, a 40 bytes TCP ACK will > use more that one ATM frame, causing 2 ATM frames to be send that > consumes 106 bytes, eg. 62% overhead. On a small upstream ADSL line > that hurts! (See thesis page 53, table 5.3 "Overhead summary"). > But how are you connected to the DSLAM? In north America it is typically ethernet. If i use the current tables i dont see much of a problem with say cable modems. Are you trying to compensate for the accounting differences between what your service provider measures (accounting for their ATM cells) and what you do accounting for your ethernet frames? I guess i am lost as to where the ATM is in the topology and more importantly whether we (Linux) mis-account or whether your approach is trying to compensate for the ISPs mis-accounting. > > > - To be a devil's advocate (and not claim there is no issue), where do > > you draw the line with "overhead"? > > Example the smallest ethernet packet is 64 bytes of which 14 bytes are > > ethernet headers ("overhead" for IP) - and this is not counting CRC etc. > > If you were to set an MTU of say 64 bytes and tried to do a http or ftp, > > how accurate do you think the calculation would be? I would think not > > very different. > > I do think we handle this situation, but I'm not quite sure that I fully > understand the question (sorry). > Assume the following: - You had ethernet end to end. Is there still a problem? - Take it a notch up and assume you had ethernet with MTU of 64B. This way you will have all your packets being small and having high overhead. Do you still have a problem? > > > Does it matter if it is accurate on the majority of the cases? > > - For further reflection: Have you considered the case where the rate > > table has already been considered on some link speed in user space and > > then somewhere post-config the physical link speed changes? This would > > happen in the case where ethernet AN is involved and the partner makes > > some changes (use ethtool). > > > > I would say the last bullet is a more interesting problem than a corner > > case of some link layer technology that has high overhead. > > We only claim to do magic on ATM/ADSL links... nothing else ;-) > This is well and good given the focus of your thesis. Up/down here we need something more generic. Your masters-thesis is a good start but consider doing the phd next and complete this work;-> > > > Your work would be more interesting if it was generic for many link > > layers instead of just ATM. > > Well, we did consider to do so, but we though that it would be harder to > get it into the kernel. > > Actually thats the reason for the defines: > #define ATM_CELL_SIZE 53 > #define ATM_CELL_PAYLOAD 48 > > Changing these should should make it possible to adapt to any other SAR > (Segment And Reasembly) link layer. > You are still speaking ATM (and the above may still be valid), but: Could you for example look at the netdevice->type and from that figure out the link layer overhead and compensate for it. Obviously a lot more useful if such activity is doable in user space without any knowledge of the kernel? and therefore zero change to the kernel and everything then becomes forward and backward compatible. cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-14 12:55 ` Jesper Dangaard Brouer 2006-06-15 12:57 ` jamal @ 2006-06-15 13:16 ` jamal 2006-06-20 1:04 ` Patrick McHardy 1 sibling, 1 reply; 56+ messages in thread From: jamal @ 2006-06-15 13:16 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Stephen Hemminger, netdev, lartc, russell-tcatm, hawk On Wed, 2006-14-06 at 14:55 +0200, Jesper Dangaard Brouer wrote: > On Wed, 2006-06-14 at 08:06 -0400, jamal wrote: > > - Have you tried to do a long-lived session such as a large FTP and > > seen how far off the deviation was? That would provide some interesting > > data point. > > The deviation can be calculated. The impact is of cause small for large > packets. > > But the argument that bulk TCP transfers is not as badly > affected, is wrong because all the TCP ACK packets gets maximum penalty. > ACKs have always played a prominent role. The last numbers i have seen for North America (but i think probably valid globaly) show in the range of 40% ACKs in internet traffic http://netweb.usc.edu/~rsinha/pkt-sizes/ I suspect a lot of these stats are on their way to change with voip, p2p etc. But i dont think it is ACKs perse that you or Russell are contending cause these issues. It's the presence of ATM . And all evidence seems to point to the fact that ISPs bill you for something other than your point of view, no? > On an ADSL link with more than 8 bytes overhead, a 40 bytes TCP ACK will > use more that one ATM frame, causing 2 ATM frames to be send that > consumes 106 bytes, eg. 62% overhead. On a small upstream ADSL line > that hurts! (See thesis page 53, table 5.3 "Overhead summary"). > But how are you connected to the DSLAM? In north America it is typically ethernet. If i use the current tables i dont see much of a problem with say cable modems. Are you trying to compensate for the accounting differences between what your service provider measures (accounting for their ATM cells) and what you do accounting for your ethernet frames? I guess i am lost as to where the ATM is in the topology and more importantly whether we (Linux) mis-account or whether your approach is trying to compensate for the ISPs mis-accounting. > > > - To be a devil's advocate (and not claim there is no issue), where do > > you draw the line with "overhead"? > > Example the smallest ethernet packet is 64 bytes of which 14 bytes are > > ethernet headers ("overhead" for IP) - and this is not counting CRC etc. > > If you were to set an MTU of say 64 bytes and tried to do a http or ftp, > > how accurate do you think the calculation would be? I would think not > > very different. > > I do think we handle this situation, but I'm not quite sure that I fully > understand the question (sorry). > Assume the following: - You had ethernet end to end. Is there still a problem? - Take it a notch up and assume you had ethernet with MTU of 64B. This way you will have all your packets being small and having high overhead. Do you still have a problem? > > > Does it matter if it is accurate on the majority of the cases? > > - For further reflection: Have you considered the case where the rate > > table has already been considered on some link speed in user space and > > then somewhere post-config the physical link speed changes? This would > > happen in the case where ethernet AN is involved and the partner makes > > some changes (use ethtool). > > > > I would say the last bullet is a more interesting problem than a corner > > case of some link layer technology that has high overhead. > > We only claim to do magic on ATM/ADSL links... nothing else ;-) > This is well and good given the focus of your thesis. Up/down here we need something more generic. Your masters-thesis is a good start but consider doing the phd next and complete this work;-> > > > Your work would be more interesting if it was generic for many link > > layers instead of just ATM. > > Well, we did consider to do so, but we though that it would be harder to > get it into the kernel. > > Actually thats the reason for the defines: > #define ATM_CELL_SIZE 53 > #define ATM_CELL_PAYLOAD 48 > > Changing these should should make it possible to adapt to any other SAR > (Segment And Reasembly) link layer. > You are still speaking ATM (and the above may still be valid), but: Could you for example look at the netdevice->type and from that figure out the link layer overhead and compensate for it. Obviously a lot more useful if such activity is doable in user space without any knowledge of the kernel? and therefore zero change to the kernel and everything then becomes forward and backward compatible. cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-15 13:16 ` jamal @ 2006-06-20 1:04 ` Patrick McHardy 2006-06-20 14:59 ` jamal 0 siblings, 1 reply; 56+ messages in thread From: Patrick McHardy @ 2006-06-20 1:04 UTC (permalink / raw) To: hadi Cc: Jesper Dangaard Brouer, Stephen Hemminger, netdev, lartc, russell-tcatm, hawk jamal wrote: > You are still speaking ATM (and the above may still be valid), but: > Could you for example look at the netdevice->type and from that figure > out the link layer overhead and compensate for it. > Obviously a lot more useful if such activity is doable in user space > without any knowledge of the kernel? and therefore zero change to the > kernel and everything then becomes forward and backward compatible. It would be nice to have support for HFSC as well, which unfortunately needs to be done in the kernel since it doesn't use rate tables. What about qdiscs like SFQ (which uses the packet size in quantum calculations)? I guess it would make sense to use the wire-length there as well. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-20 1:04 ` Patrick McHardy @ 2006-06-20 14:59 ` jamal 2006-06-20 15:16 ` Patrick McHardy 0 siblings, 1 reply; 56+ messages in thread From: jamal @ 2006-06-20 14:59 UTC (permalink / raw) To: Patrick McHardy Cc: hawk, russell-tcatm, lartc, netdev, Stephen Hemminger, Jesper Dangaard Brouer On Tue, 2006-20-06 at 03:04 +0200, Patrick McHardy wrote: > jamal wrote: > > You are still speaking ATM (and the above may still be valid), but: > > Could you for example look at the netdevice->type and from that figure > > out the link layer overhead and compensate for it. > > Obviously a lot more useful if such activity is doable in user space > > without any knowledge of the kernel? and therefore zero change to the > > kernel and everything then becomes forward and backward compatible. > > It would be nice to have support for HFSC as well, which unfortunately > needs to be done in the kernel since it doesn't use rate tables. > What about qdiscs like SFQ (which uses the packet size in quantum > calculations)? I guess it would make sense to use the wire-length > there as well. Didnt even think of that ;-> Is it getting too complicated? BTW, I forgot to mention one thing on the bandwidth issue is we could do is send netlink events on link speed changes too; some listener somewhere would then do the adjustment. cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-20 14:59 ` jamal @ 2006-06-20 15:16 ` Patrick McHardy 2006-06-21 12:21 ` Krzysztof Matusik 0 siblings, 1 reply; 56+ messages in thread From: Patrick McHardy @ 2006-06-20 15:16 UTC (permalink / raw) To: hadi Cc: hawk, russell-tcatm, lartc, netdev, Stephen Hemminger, Jesper Dangaard Brouer jamal wrote: > On Tue, 2006-20-06 at 03:04 +0200, Patrick McHardy wrote: > >>It would be nice to have support for HFSC as well, which unfortunately >>needs to be done in the kernel since it doesn't use rate tables. >>What about qdiscs like SFQ (which uses the packet size in quantum >>calculations)? I guess it would make sense to use the wire-length >>there as well. > > > Didnt even think of that ;-> > Is it getting too complicated? The code wouldn't be very complicated, it just adds some overhead. If you do something like I described in my previous mail the overhead for people not using it would be an additional pointer test before reading skb->len. I guess we could also make it a compile time option. I personally think this is something that really improves our quality of implementation, after all, its "wire" resources qdiscs are meant to manage. > BTW, I forgot to mention one thing on the bandwidth issue is we could do > is send netlink events on link speed changes too; some listener > somewhere would then do the adjustment. See the mail I just wrote :) ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-20 15:16 ` Patrick McHardy @ 2006-06-21 12:21 ` Krzysztof Matusik 2006-06-21 12:54 ` Patrick McHardy 0 siblings, 1 reply; 56+ messages in thread From: Krzysztof Matusik @ 2006-06-21 12:21 UTC (permalink / raw) To: netdev Dnia wtorek, 20 czerwca 2006 17:16, Patrick McHardy napisał: > jamal wrote: > > On Tue, 2006-20-06 at 03:04 +0200, Patrick McHardy wrote: > >>It would be nice to have support for HFSC as well, which unfortunately > >>needs to be done in the kernel since it doesn't use rate tables. > >>What about qdiscs like SFQ (which uses the packet size in quantum > >>calculations)? I guess it would make sense to use the wire-length > >>there as well. > > > > Didnt even think of that ;-> > > Is it getting too complicated? > > The code wouldn't be very complicated, it just adds some overhead. If > you do something like I described in my previous mail the overhead for > people not using it would be an additional pointer test before reading > skb->len. I guess we could also make it a compile time option. > I personally think this is something that really improves our quality > of implementation, after all, its "wire" resources qdiscs are meant > to manage. I'd love to see this one implemented. I'm using HFSC more than a year and it never provides proper QoS on ATM/ADSL links; low delays can never be achieved even with significant throttling below the h/w link bandwidth. This would help a lot regarding the amount of adsl users but on the other hand- there's not many HFSC implementations in real-life I guess (users seem to be afraid of it's 'complexity'). This idea doesn't seem look dirty- is there a chance to implement it in the kernel and iproute? regards ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-21 12:21 ` Krzysztof Matusik @ 2006-06-21 12:54 ` Patrick McHardy 2006-06-21 14:33 ` Krzysztof Matusik 0 siblings, 1 reply; 56+ messages in thread From: Patrick McHardy @ 2006-06-21 12:54 UTC (permalink / raw) To: Krzysztof Matusik; +Cc: netdev Krzysztof Matusik wrote: > Dnia wtorek, 20 czerwca 2006 17:16, Patrick McHardy napisał: > >>The code wouldn't be very complicated, it just adds some overhead. If >>you do something like I described in my previous mail the overhead for >>people not using it would be an additional pointer test before reading >>skb->len. I guess we could also make it a compile time option. >>I personally think this is something that really improves our quality >>of implementation, after all, its "wire" resources qdiscs are meant >>to manage. > > > I'd love to see this one implemented. I'm using HFSC more than a year and it > never provides proper QoS on ATM/ADSL links; low delays can never be achieved > even with significant throttling below the h/w link bandwidth. Mhh .. I trust you picked a proper clocksource and timer frequency? > This would help a lot regarding the amount of adsl users but on the other > hand- there's not many HFSC implementations in real-life I guess (users seem > to be afraid of it's 'complexity'). > This idea doesn't seem look dirty- is there a chance to implement it in the > kernel and iproute? I hacked up a patch yesterday. I want to do a bit more testing before posting it, but its hard to really test the effectiveness because even without this patch my DSL line already delivers higher throughput and much better delay than it should (its a throttled SDSL line sold as ADSL). I'll try to do some testing on an ethernet link now. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-21 12:54 ` Patrick McHardy @ 2006-06-21 14:33 ` Krzysztof Matusik 0 siblings, 0 replies; 56+ messages in thread From: Krzysztof Matusik @ 2006-06-21 14:33 UTC (permalink / raw) To: netdev Dnia środa, 21 czerwca 2006 14:54, Patrick McHardy napisał: > > I'd love to see this one implemented. I'm using HFSC more than a year and > > it never provides proper QoS on ATM/ADSL links; low delays can never be > > achieved even with significant throttling below the h/w link bandwidth. > > Mhh .. I trust you picked a proper clocksource and timer frequency? 250Hz or 1000Hz and jiffies (for a SMP machine) I don't mean I've got big delays though and those are lowest with HFSC, when link is satiated. IMHO since ATM is sending small packets constantly thorough its link (ATM specific) encapsulated data defragmentation cannot simply be achieved (or it simply isn't done by modems). As a result one can't simply throttle qdisc basing on computations like UpperLimit = ( bandwidth*(1 - header_bits/frame_size ) having bandwidth specified by ATM link speed rate, as it can be reported by modem, and expect it won't ever be congested. This seems to be main problem. Link asymmetry, window size etc don't affect line congestion and so delays that much I suppose. The same with incorrect timer :-). > I hacked up a patch yesterday. I want to do a bit more testing before > posting it, but its hard to really test the effectiveness because even > without this patch my DSL line already delivers higher throughput and > much better delay than it should (its a throttled SDSL line sold as > ADSL). I'll try to do some testing on an ethernet link now. I'll follow this thread then. regards ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-14 12:06 ` jamal 2006-06-14 12:55 ` Jesper Dangaard Brouer @ 2006-06-14 15:32 ` Andy Furniss 2006-06-20 0:54 ` Patrick McHardy [not found] ` <1150287983.3246.27.camel@ras.pc.brisbane.lube> 3 siblings, 0 replies; 56+ messages in thread From: Andy Furniss @ 2006-06-14 15:32 UTC (permalink / raw) To: hadi Cc: Jesper Dangaard Brouer, hawk, russell-tcatm, lartc, netdev, Stephen Hemminger jamal wrote: > I have taken linux-kernel off the list. > > Russell's site is inaccessible to me (I actually think this is related > to some DNS issues i may be having) and your masters is too long to > spend 2 minutes and glean it; so heres a question or two for you: > > - Have you tried to do a long-lived session such as a large FTP and > seen how far off the deviation was? That would provide some interesting > data point. > - To be a devil's advocate (and not claim there is no issue), where do > you draw the line with "overhead"? Me and many others have run a smilar hack for years, there is also a userspace project still alive which does the same. The difference is that without it I would need to sacrifice almost half my 288kbit atm/dsl showtime bandwidth to be sure of control. With the modification I can run at 286kbit / 288 and know I will never have jitter worse than the bitrate latency of a mtu packet. The 286 figure was choses to allow a full buffer to drain/ allow for timer innaccuracy etc. On a p200 with tsc, 2.6.12 it's never gone over for me - though talking of timers I notice on my desktop 2.6.16 I gain 2 minutes a day now. Andy. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-14 12:06 ` jamal 2006-06-14 12:55 ` Jesper Dangaard Brouer 2006-06-14 15:32 ` Andy Furniss @ 2006-06-20 0:54 ` Patrick McHardy 2006-06-20 14:56 ` jamal [not found] ` <1150287983.3246.27.camel@ras.pc.brisbane.lube> 3 siblings, 1 reply; 56+ messages in thread From: Patrick McHardy @ 2006-06-20 0:54 UTC (permalink / raw) To: hadi Cc: Jesper Dangaard Brouer, hawk, russell-tcatm, lartc, netdev, Stephen Hemminger jamal wrote: > - For further reflection: Have you considered the case where the rate > table has already been considered on some link speed in user space and > then somewhere post-config the physical link speed changes? This would > happen in the case where ethernet AN is involved and the partner makes > some changes (use ethtool). > > I would say the last bullet is a more interesting problem than a corner > case of some link layer technology that has high overhead. > Your work would be more interesting if it was generic for many link > layers instead of just ATM. I've thought about this a couple of times, scaling the virtual clock rate should be enough for "simple" qdiscs like TBF or HTB, which have a linear relation between time and bandwidth. I haven't really thought about the effects on HFSC yet, on a small scale the relation is non-linear. But this is a different problem from trying to accomodate for link-layer overhead. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-20 0:54 ` Patrick McHardy @ 2006-06-20 14:56 ` jamal 2006-06-20 15:09 ` Patrick McHardy 0 siblings, 1 reply; 56+ messages in thread From: jamal @ 2006-06-20 14:56 UTC (permalink / raw) To: Patrick McHardy Cc: Stephen Hemminger, netdev, lartc, russell-tcatm, hawk, Jesper Dangaard Brouer On Tue, 2006-20-06 at 02:54 +0200, Patrick McHardy wrote: > jamal wrote: > > - For further reflection: Have you considered the case where the rate > > table has already been considered on some link speed in user space and > > then somewhere post-config the physical link speed changes? This would > > happen in the case where ethernet AN is involved and the partner makes > > some changes (use ethtool). > > [..] > I've thought about this a couple of times, scaling the virtual clock > rate should be enough for "simple" qdiscs like TBF or HTB, which have > a linear relation between time and bandwidth. I haven't really thought > about the effects on HFSC yet, on a small scale the relation is > non-linear. Does HFSC not depend on bandwith? How is rate control achieved? > But this is a different problem from trying to accomodate > for link-layer overhead. > Yes it is different issue. cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-20 14:56 ` jamal @ 2006-06-20 15:09 ` Patrick McHardy 2006-06-22 18:41 ` jamal 0 siblings, 1 reply; 56+ messages in thread From: Patrick McHardy @ 2006-06-20 15:09 UTC (permalink / raw) To: hadi Cc: Stephen Hemminger, netdev, lartc, russell-tcatm, hawk, Jesper Dangaard Brouer jamal wrote: > On Tue, 2006-20-06 at 02:54 +0200, Patrick McHardy wrote: > >>jamal wrote: >> >>>- For further reflection: Have you considered the case where the rate >>>table has already been considered on some link speed in user space and >>>then somewhere post-config the physical link speed changes? This would >>>happen in the case where ethernet AN is involved and the partner makes >>>some changes (use ethtool). >>> > > [..] > >>I've thought about this a couple of times, scaling the virtual clock >>rate should be enough for "simple" qdiscs like TBF or HTB, which have >>a linear relation between time and bandwidth. I haven't really thought >>about the effects on HFSC yet, on a small scale the relation is >>non-linear. > > > Does HFSC not depend on bandwith? How is rate control achieved? "Depend on bandwidth" is not the right term. All of TBF, HTB and HFSC provide bandwidth per time, but with TBF and HTB the relation between the amount of bandwidth is linear to the amount of time, with HFSC it is only on a linear on larger scale since it uses service curves, which are represented as two linear pieces. So you have bandwidth b1 for time t1, bandwidth b2 after that until eternity. By scaling the clock rate you alter after how much time b2 kicks in, which affects the guaranteed delays. The end result should be that both bandwidth and delay scale up or down proportionally, but I'm not sure that this is what HFSC would do in all cases (on small scale). But it should be easy to answer with a bit more time for visualizing it. The thing I'm not sure about is whether this wouldn't be handled better by userspace, if the link layer speed changes you might not want proportional scaling but prefer to still give a fixed amount of that bandwidth to some class, for example VoIP traffic. Do we have netlink notifications for link speed changes? ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-20 15:09 ` Patrick McHardy @ 2006-06-22 18:41 ` jamal 2006-06-23 14:32 ` Patrick McHardy 0 siblings, 1 reply; 56+ messages in thread From: jamal @ 2006-06-22 18:41 UTC (permalink / raw) To: Patrick McHardy Cc: Jesper Dangaard Brouer, hawk, russell-tcatm, netdev, Stephen Hemminger On Tue, 2006-20-06 at 17:09 +0200, Patrick McHardy wrote: > jamal wrote: > "Depend on bandwidth" is not the right term. All of TBF, HTB and HFSC > provide bandwidth per time, but with TBF and HTB the relation between > the amount of bandwidth is linear to the amount of time, with HFSC > it is only on a linear on larger scale since it uses service curves, > which are represented as two linear pieces. So you have bandwidth b1 > for time t1, bandwidth b2 after that until eternity. By scaling the > clock rate you alter after how much time b2 kicks in, which affects > the guaranteed delays. The end result should be that both bandwidth > and delay scale up or down proportionally, but I'm not sure that this > is what HFSC would do in all cases (on small scale). But it should > be easy to answer with a bit more time for visualizing it. > Ok, this makes things a little trickier though, no? > The thing I'm not sure about is whether this wouldn't be handled better > by userspace, If you do it in user space you will need a daemon of some form; this is my preference but it seems a lot of people hate daemons - the standard claim is it is counter-usability. Such people are forgiving if you built the daemon into the kernel as a thread. Perhaps the netcarrier that Stefan Rompf has added could be extended to handle this) Note, if you wanna do it right as well you will factor in other things like some wireless technologies which changes their throughput capability over a period of time ( A lot of these guys try to have their own hardware level schedulers to compensate for this). > if the link layer speed changes you might not want > proportional scaling but prefer to still give a fixed amount of that > bandwidth to some class, for example VoIP traffic. Do we have netlink > notifications for link speed changes? Not there at the moment - but we do emit event for other link layer stuff like carrier on/off - so adding this should be trivial and a reasonable thing to have; with a caveat: it will be link-layer specific; so whoever ends up adding will have to be careful to make sure it is not hard-coded to be specific to ethernet-like netdevices. It could probably be reported together with link state as a TLV like ETHER_SPEED_CHANGED which carries probably old speed and new speed and maybe even reason why it changed. cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-22 18:41 ` jamal @ 2006-06-23 14:32 ` Patrick McHardy 2006-06-24 14:39 ` jamal 0 siblings, 1 reply; 56+ messages in thread From: Patrick McHardy @ 2006-06-23 14:32 UTC (permalink / raw) To: hadi; +Cc: Jesper Dangaard Brouer, hawk, russell-tcatm, netdev, Stephen Hemminger jamal wrote: > On Tue, 2006-20-06 at 17:09 +0200, Patrick McHardy wrote: > >>The thing I'm not sure about is whether this wouldn't be handled better >>by userspace, > > > If you do it in user space you will need a daemon of some form; this is > my preference but it seems a lot of people hate daemons - the standard > claim is it is counter-usability. Such people are forgiving if you built > the daemon into the kernel as a thread. Perhaps the netcarrier that > Stefan Rompf has added could be extended to handle this) I absolutely disagree, in my opinion we currently have too few specialized daemons and a lot too much wild shell scripting with seding, awking, grepping. I'm actually working on a daemon that provides a small scripting language to express networking configuration states and react to result set changes and state transitions. Its far from complete, but already allows you to express things like "on transition from { none, link flags !up } to { link flags up } execute /etc/network/link/$LINK_NAME up" (same for routes, addresses, basically all object types) or "for each link with flags lowerup,up execute add-to-bridge.sh". The value of each expression is dumped into the environment on execution, so you can comfortably use $LINK_NAME or $LINK_MTU instead of having to cut it out the "ip link list" output. Should be trivial to support link speed changes once we have notifications for that. >>if the link layer speed changes you might not want >>proportional scaling but prefer to still give a fixed amount of that >>bandwidth to some class, for example VoIP traffic. Do we have netlink >>notifications for link speed changes? > > > Not there at the moment - but we do emit event for other link layer > stuff like carrier on/off - so adding this should be trivial and a > reasonable thing to have; with a caveat: it will be link-layer specific; > so whoever ends up adding will have to be careful to make sure it is not > hard-coded to be specific to ethernet-like netdevices. It could probably > be reported together with link state as a TLV like ETHER_SPEED_CHANGED > which carries probably old speed and new speed > and maybe even reason why it changed. I don't think it should carry both old and new speed. Netlink notifications usually provide a snapshot of the new state, but no indication what changed, with one notable exception, the ifi_change field, which IMO is a hack for lazy userspace. Since notifications can get lost, userspace needs to resync occasionally. The naiive approach (works for every other object) to determine if the object state changed from my last known state is to compare all attributes .. but the ifi_change field will be different between notifications and dumps even if the object itself didn't change. "Lazy userspace" because looking at ifi_change is obviously only useful if it doesn't keep its last known state and tries to derive the change from update notifications alone .. which means it fails when notifications are lost. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-23 14:32 ` Patrick McHardy @ 2006-06-24 14:39 ` jamal 2006-06-26 11:21 ` Patrick McHardy 0 siblings, 1 reply; 56+ messages in thread From: jamal @ 2006-06-24 14:39 UTC (permalink / raw) To: Patrick McHardy Cc: Jesper Dangaard Brouer, hawk, russell-tcatm, netdev, Stephen Hemminger On Fri, 2006-23-06 at 16:32 +0200, Patrick McHardy wrote: > jamal wrote: > > If you do it in user space you will need a daemon of some form; this is > > my preference but it seems a lot of people hate daemons - the standard > > claim is it is counter-usability. Such people are forgiving if you built > > the daemon into the kernel as a thread. Perhaps the netcarrier that > > Stefan Rompf has added could be extended to handle this) > > I absolutely disagree, in my opinion we currently have too few > specialized daemons and a lot too much wild shell scripting with > seding, awking, grepping. Like i said i too prefer a daemon; but i have experienced a lot of people who dont (especially in small embedded devices). > I'm actually working on a daemon that > provides a small scripting language to express networking configuration > states and react to result set changes and state transitions. > Its far from complete, but already allows you to express things like > "on transition from { none, link flags !up } to { link flags up } > execute /etc/network/link/$LINK_NAME up" (same for routes, addresses, > basically all object types) or "for each link with flags lowerup,up > execute add-to-bridge.sh". The value of each expression is dumped > into the environment on execution, so you can comfortably use > $LINK_NAME or $LINK_MTU instead of having to cut it out the > "ip link list" output. Should be trivial to support link speed changes > once we have notifications for that. > cool - a neteventd > I don't think it should carry both old and new speed. Netlink > notifications usually provide a snapshot of the new state, but > no indication what changed, with one notable exception, the > ifi_change field, which IMO is a hack for lazy userspace. I am quiet fond of the ifi_change ;-> > Since > notifications can get lost, userspace needs to resync occasionally. > The naiive approach (works for every other object) to determine if > the object state changed from my last known state is to compare > all attributes .. scalability issues abound when you have a gazillion things to look at. There used or may still be a way to tell from looking at netlink socket that an error occurred since last time - such as "a message was lost". You could use that to tell a message was lost and do scanning only then. > but the ifi_change field will be different > between notifications and dumps even if the object itself didn't > change. "Lazy userspace" because looking at ifi_change is obviously > only useful if it doesn't keep its last known state and tries to > derive the change from update notifications alone .. which means it > fails when notifications are lost. > But thats not the real intent for it. cheers, jamal PS:- I will look at your other postings and respond later i have to run for now. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-24 14:39 ` jamal @ 2006-06-26 11:21 ` Patrick McHardy 2006-06-27 13:01 ` jamal 0 siblings, 1 reply; 56+ messages in thread From: Patrick McHardy @ 2006-06-26 11:21 UTC (permalink / raw) To: hadi; +Cc: Jesper Dangaard Brouer, hawk, russell-tcatm, netdev, Stephen Hemminger jamal wrote: > On Fri, 2006-23-06 at 16:32 +0200, Patrick McHardy wrote: > >>I don't think it should carry both old and new speed. Netlink >>notifications usually provide a snapshot of the new state, but >>no indication what changed, with one notable exception, the >>ifi_change field, which IMO is a hack for lazy userspace. > > > I am quiet fond of the ifi_change ;-> > >>Since >>notifications can get lost, userspace needs to resync occasionally. >>The naiive approach (works for every other object) to determine if >>the object state changed from my last known state is to compare >>all attributes .. > > > scalability issues abound when you have a gazillion things to look at. > There used or may still be a way to tell from looking at netlink socket > that an error occurred since last time - such as "a message was lost". > You could use that to tell a message was lost and do scanning only > then. It returns -ENOBUFS on socket overrun. Without it netlink notifications wouldn't be very useable as you couldn't figure out when you missed some. >> but the ifi_change field will be different >>between notifications and dumps even if the object itself didn't >>change. "Lazy userspace" because looking at ifi_change is obviously >>only useful if it doesn't keep its last known state and tries to >>derive the change from update notifications alone .. which means it >>fails when notifications are lost. >> > > > But thats not the real intent for it. Then what is the intent, it doesn't carry any other information? It includes information that are not available any other way from the kernel, yet the information is not transmitted reliably. How could a program that relies on this possibly work reliable? ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-26 11:21 ` Patrick McHardy @ 2006-06-27 13:01 ` jamal 2006-07-02 4:23 ` Patrick McHardy 0 siblings, 1 reply; 56+ messages in thread From: jamal @ 2006-06-27 13:01 UTC (permalink / raw) To: Patrick McHardy Cc: Stephen Hemminger, netdev, russell-tcatm, hawk, Jesper Dangaard Brouer On Mon, 2006-26-06 at 13:21 +0200, Patrick McHardy wrote: > jamal wrote: > > scalability issues abound when you have a gazillion things to look at. > > There used or may still be a way to tell from looking at netlink socket > > that an error occurred since last time - such as "a message was lost". > > You could use that to tell a message was lost and do scanning only > > then. > > It returns -ENOBUFS on socket overrun. Without it netlink notifications > wouldn't be very useable as you couldn't figure out when you missed > some. > I think you are probably right; it is -ENOBUFS. I recall something along the lines of a socket->error flag somewhere. I will go and snoop on code. > >> but the ifi_change field will be different > >>between notifications and dumps even if the object itself didn't > >>change. "Lazy userspace" because looking at ifi_change is obviously > >>only useful if it doesn't keep its last known state and tries to > >>derive the change from update notifications alone .. which means it > >>fails when notifications are lost. > >> > > > > > > But thats not the real intent for it. > > Then what is the intent, it doesn't carry any other information? Generally it is a filter to what the ifi_flags reflect. >From an event architecture scalability perspective, it is useful to be able to look at what changed without having to query the source. That is what ifi_change provides. So it is not "Lazy userspace" rather it is an effective optimization. You keep less state and you have less message exchanges between user and kernel. > It includes information that are not available any other way from > the kernel, yet the information is not transmitted reliably. How > could a program that relies on this possibly work reliable? Ok, so first lets separate them as two different issues: - Any message to user space may be lost whether it has ifi_change or not. You need some way to figure out a message was lost and declare your state may be invalid. The -ENOBUFs is one way to discover stale state. - by looking at ifi_change i can tell what changed from about 10 other things reflected in the flags. If you get an ENOBUFS in this case (or any other), it means your state is not consistent and to have reliable info you need to query the kernel. Hope that makes sense. cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-27 13:01 ` jamal @ 2006-07-02 4:23 ` Patrick McHardy 2006-07-02 13:59 ` jamal 0 siblings, 1 reply; 56+ messages in thread From: Patrick McHardy @ 2006-07-02 4:23 UTC (permalink / raw) To: hadi; +Cc: Stephen Hemminger, netdev, russell-tcatm, hawk, Jesper Dangaard Brouer Sorry, missed your reply so far. >>Then what is the intent, it doesn't carry any other information? > > > Generally it is a filter to what the ifi_flags reflect. >>From an event architecture scalability perspective, it is useful to be > able to look at what changed without having to query the source. That is > what ifi_change provides. So it is not "Lazy userspace" rather it is an > effective optimization. > You keep less state and you have less message exchanges between user and > kernel. It might have been, but the situation is a bit more complicated today, and special case solutions (aka hacks) don't work very well for the huge diversity of interest in update messages. Besides that, it is flawed. >>It includes information that are not available any other way from >>the kernel, yet the information is not transmitted reliably. How >>could a program that relies on this possibly work reliable? > > > Ok, so first lets separate them as two different issues: > > - Any message to user space may be lost whether it has ifi_change or > not. You need some way to figure out a message was lost and declare your > state may be invalid. The -ENOBUFs is one way to discover stale state. Thats not the point. ifi_change doesn't inform you about updates, it informs you about specific changes. And you can't determine those in the case of lost notifications without keeping the old state yourself. It may save you some time _comparing_ (not searching) in case it wasn't lost, but it only contains a very limited set of information and nothing saves you from searching the entry a message refers to based on the primary keys (which unfortunately aren't even defined for every message type, which means they're not even clearly defined inside the kernel. routing rules are one example). Its a hack for some specific usage case, and since you're defending this so hard, I'm naturally wondering if you introduced it? > - by looking at ifi_change i can tell what changed from about 10 > other things reflected in the flags. If you get an ENOBUFS in this case > (or any other), it means your state is not consistent and to have > reliable info you need to query the kernel. The flags only reflect a single piece of information, one that userspace needs to be able to reconstruct anyway. Once again the point: netlink notifications are _update_ messages, not state change messages. This is a fundamental difference, misleading userspace (see tc monitor) to believing it is something different for very limited usage cases does nothing good. I hereby state my believe that 99% of netlink users are broken. > Hope that makes sense. Me too :) ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-07-02 4:23 ` Patrick McHardy @ 2006-07-02 13:59 ` jamal 0 siblings, 0 replies; 56+ messages in thread From: jamal @ 2006-07-02 13:59 UTC (permalink / raw) To: Patrick McHardy Cc: Jesper Dangaard Brouer, hawk, russell-tcatm, netdev, Stephen Hemminger On Sun, 2006-02-07 at 06:23 +0200, Patrick McHardy wrote: [..] > > - Any message to user space may be lost whether it has ifi_change or > > not. You need some way to figure out a message was lost and declare your > > state may be invalid. The -ENOBUFs is one way to discover stale state. > > Thats not the point. ifi_change doesn't inform you about updates, it > informs you about specific changes. yes, > And you can't determine those in > the case of lost notifications without keeping the old state yourself. nod, > It may save you some time _comparing_ (not searching) in case it wasn't > lost, This is one of the points of where it has value. Take a hypothetical example of when you have a gazillion events/sec happening, this is an exageration in case of ifi_change, just assume it so for illustration purposes. Assume no messages are lost. Assume you are only interested in the netdevice becoming promisc or un-promisc. In such a case, you can quickly check if among the vector of events that this just happened or not. > but it only contains a very limited set of information and nothing > saves you from searching the entry a message refers to based on the > primary keys (which unfortunately aren't even defined for every message > type, which means they're not even clearly defined inside the kernel. > routing rules are one example). In some cases it does provide you with details - example in the case of booleans where it will tell you something changed from 0->1 or 1->0. Actually the promisc case i gave is a bad example because it falls under such a category although i hope it still made the point i was trying to make. But yes in general, it is just supposed to provide you with info of "dude, you got a message from X and Y" - which is more useful than the select() "dude, you got a message". You have then to go and read that message if you are interested or if you are busy you can go later. > Its a hack for some specific usage case, > and since you're defending this so hard, I'm naturally wondering if > you introduced it? > No, i didnt - it has been there for years; it may/must have been Alexey who introduced it. The reason i am defending it is i find it useful and it is a classical approach in event management; i have seen it used in a lot of middleware as an optimization. When you have a vector of events that can be reported, then as optimization you dont waste resources by sending details about the event in case nobody is interested in hearing about the details or is ready to process them. BTW,theres a consistency bug/hack at the moment in iproute2 on its usage. > > - by looking at ifi_change i can tell what changed from about 10 > > other things reflected in the flags. If you get an ENOBUFS in this case > > (or any other), it means your state is not consistent and to have > > reliable info you need to query the kernel. > > The flags only reflect a single piece of information, one that userspace > needs to be able to reconstruct anyway. Once again the point: netlink > notifications are _update_ messages, not state change messages. This is > a fundamental difference, misleading userspace (see tc monitor) to > believing it is something different for very limited usage cases does > nothing good. I hereby state my believe that 99% of netlink users > are broken. > yes, it would mislead users who dont use it appropriately such as ip mon but that doesnt make it less valuable. I thought ip mon would bitch if it received enobufs? cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
[parent not found: <1150287983.3246.27.camel@ras.pc.brisbane.lube>]
[parent not found: <1150292693.5197.1.camel@jzny2>]
[parent not found: <1150843471.17455.2.camel@ras.pc.brisbane.lube>]
[parent not found: <15653CE98281AD4FBD7F70BCEE3666E53CD54A@comxexch01.comx.local>]
[parent not found: <1151000966.5392.34.camel@jzny2>]
* RE: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL [not found] ` <1151000966.5392.34.camel@jzny2> @ 2006-06-23 12:37 ` Russell Stuart 2006-06-23 15:21 ` Patrick McHardy 2006-06-24 14:13 ` jamal 0 siblings, 2 replies; 56+ messages in thread From: Russell Stuart @ 2006-06-23 12:37 UTC (permalink / raw) To: hadi Cc: Patrick McHardy, Alan Cox, Stephen Hemminger, netdev, Jesper Dangaard Brouer On Thu, 2006-06-22 at 14:29 -0400, jamal wrote: > Russell, > > I did look at what you sent me and somewhere in those discussions i > argue that the changes compensate to make the rate be a goodput > instead of advertised throughput. I did see that, but didn't realise you were responding to me. A lot of discussion has gone on since and evidently quite a bit of which was addressed to me. I will try to answer the some of the points. Sorry for the digest like reply :( On Wed, 2006-06-14 at 11:57 +0100, Alan Cox wrote: > I'm > not sure if that matters but for modern processors I'm also sceptical > that the clever computation is actually any faster than just doing the > maths, especially if something cache intensive is also running. Assuming you are referring to the rate tables - I hadn't thought about it, but I guess I would agree. However, this patch wasn't trying to radically re-engineer the traffic control engines rate calculation code. Quite the reverse I - was was trying to change it as little as possible. The kernel part of the patch actually only introduced one small change - the optional addition of a constant the packet length. On Thu, 2006-06-15 at 08:57 -0400, jamal wrote: > But i dont think it is ACKs perse that you or Russell are contending > cause these issues. It's the presence of ATM . And all evidence seems to > point to the fact that ISPs bill you for something other than your > point of view, no? I don't know about anywhere else, but certainly here in Australia some ISP's creative in how they advertise their link speeds. Again that is not the issue we were trying to address with the patch. On Thu, 2006-06-15 at 08:57 -0400, jamal wrote: > You are still speaking ATM (and the above may still be valid), but: > Could you for example look at the netdevice->type and from that figure > out the link layer overhead and compensate for it. As others have pointed out, this doesn't work for the ADSL user. An ADSL modem is connected to the box using either ethernet, wireless or USB. On Thu, 2006-06-15 at 09:03 -0400, jamal wrote: > It is probably doable by just looking at netdevice->type and figuring > the link layer technology. Totally in user space and building the > compensated for tables there before telling the kernel (advantage is no > kernel changes and therefore it would work with older kernels as well). Others have had this same thought, and have spent time trying to come up with a user space only solution. They failed because it isn't possible. To understand why see this thread: http://mailman.ds9a.nl/pipermail/lartc/2006q1/018314.html Also, the user space patch does improve the performance of older kernels (ie unpatched kernels). Rather than getting the rate wrong 99.9% of the time, older kernels only get it wrong 14% of the time, on average. On Tue, 2006-06-20 at 03:04 +0200, Patrick McHardy wrote: > What about qdiscs like SFQ (which uses the packet size in quantum > calculations)? I guess it would make sense to use the wire-length > there as well. Being pedantic, SQF automatically assigns traffic to classes and gives each class an equal share of the available bandwidth. As I am sure you are aware SQF's trick is that it randomly changes its classification algorithm - every second in the Linux implementation. If there are errors in rate calculation this randomisation will ensure they are distributed equally between the classes as time goes on. So no, accurate packets sizes are not that important to SQF. But they are important to many other qdiscs, and I am sure that was your point. SQF just happened to be a bad example. On Tue, 2006-06-20 at 10:06 -0400, jamal wrote: > What this means is that Linux computes based on ethernet > headers. Somewhere downstream ATM (refer to above) comes in and that > causes mismatch in what Linux expects to be the bandwidth and what > your service provider who doesnt account for the ATM overhead when > they sell you "1.5Mbps". > Reminds me of hard disk vendors who define 1K to be 1000 to show > how large their drives are. > Yes, Linux cant tell if your service provider is lying to you. No, it can't. But you can measure the bandwidth you are getting from your ISP and plug that into the tc command line. The web page I sent to you describes how to do this for ADSL lines. On Tue, 2006-06-20 at 10:06 -0400, jamal wrote: > > On Mon, 2006-19-06 at 21:31 +0200, Jesper Dangaard Brouer wrote: > > The issue here is, that ATM does not have fixed overhead (due to alignment > > and padding). This means that a fixed reduction of the bandwidth is not > > the solution. We could reduce the bandwidth to the worst-case overhead, > > which is 62%, I do not think that is a good solution... > > > > I dont see it as wrong to be honest with you. Your mileage may vary. Jamal am I reading this correctly? Did you just say that you don't see having to reduce your available bandwidth by 62% to take account of deficiencies in Linux traffic engine as wrong? Why on earth would you say that? On Tue, 2006-06-20 at 10:06 -0400, jamal wrote: > Dont have time to read your doc and dont get me wrong, there is a > "quark" practical problem: As practical as the hard disk manufacturer > who claims that they have 11G drive when it is 10G. This reads like we don't see the same problem in the same way. Your disk example is a 10% error that effects less savvy users. The ATM problem we are trying to address effects a big chunk of all Linux's traffic control users. (Big chunk as counted by boxes, not bytes.) Something like 60% of all broadband connections use ADSL. Most of the remainder live in the US and use cable. Or at least so says this web page: http://tinyurl.com/pydnj Extrapolating from that, I think it is safe to say fair chunk of all people using the Linux Traffic Control engine use ADSL, and thus may benefit from this patch. Now it is true that right now these people may not see a great benefit from the patch. Those that will are divided into two categories: 1. Those that saturate their upstream bandwidth. This isn't hard to do on ADSL, due to its first letter. It effects people who use run web sites, email lists - which is bugger all, and those who play games or run P2P - which is most home users. 2. Those that use Voip. Again there aren't many people who do this right now, but that will change. Its not hard to envisage a future where real time streaming like this will come to dominate Internet traffic. Voip effects the other major group of users out there - business. Ergo I believe that in the long term the patch will benefit a lot of people. The next argument is how much it will benefit them. It turns out that the patch is only useful if you have some small packets that MUST have priority on the ADSL link. Jesper's traffic was TCP ACK's (he was addressing problem 1) and mine was VOIP traffic. This would seem a trivial problem to solve with Linux's traffic control engine. I don't know what path Jesper took - but I tried using it in the obvious fashion and it didn't work. A couple of large emails would take out an office's phone system. It took me days of head scratching to figure out why. The cause was ADSL using ATM as a carrier. In my case I was using approx 110 byte packets. Do the sums. It takes 3 ATM cells to carry an 110 byte packets. That is 159 bytes. A 50% error. That meant the ISP was doing the traffic control, and he wasn't prioritising VOIP traffic. Sure, you can optimise the values you pass to tc for 110 byte packets. But then it fails miserably for a other packet sizes; such as a different VOIP codec, or TCP acks. The only solution is to understate your available bandwidth by at least a 1/3rd. I hope you don't consider that acceptable. The reason this patch wasn't thought of until now is that large packets don't see much benefit. For similar packet sizes the maximum error is determined by the ATM cell size (you can be +/- one ATM cell) and that is 53 bytes. This means on packets around MTU size the error is 53/1500 = 3.5%. Hardly worth worrying about. For traditional Internet usage, ie the one ADSL was designed for, the upstream channel, ie the one carrying the TCP ACKS, was rarely saturated. The speed was limited by the downstream channel - the one carrying MTU sized packets. So in summary - no, Jamal, I see no correspondence between your 10/11Gb hard drives example and this patch. On Tue, 2006-06-20 at 10:06 -0400, jamal wrote: > It needs to be > resolved - but not in an intrusive way in my opinion. To be honest, I didn't think the patch was that intrusive. It adds an optional constant to the skb->len. Hardly earth shattering. On Tue, 2006-06-20 at 16:45 +0200, Patrick McHardy wrote: > Handling all qdiscs would mean adding a pointer to a mapping table > to struct net_device and using something like "skb_wire_len(skb, dev)" > instead of skb->len in the queueing layer. That of course doesn't > mean that we can't still provide pre-adjusted ratetables for qdiscs > that use them. Yes, that would work well, and is probably how it should of been done when the kernel stuff was originally written. As it happens Jesper's original solution was closer to this. The reason we choose not to go that way it is would change the kernel-userspace API. The current patch solves the problem and works well as possible on all kernel versions - both patched and unpatched. Now that I think about to change things the way you suggest here does seem simple enough. But it probably belongs in a different patch. We wrote this patch to fix a specific problem with ATM links, and it should succeed or fail on the merits of doing that. Cleaning up the kernel code to do what you suggest is a different issue. Let whether it to should be done, or not, be based on its own merits. On Tue, 2006-06-20 at 11:38 -0400, jamal wrote: > The issue is really is whether Linux should be interested in the > throughput it is told about or the goodput (also known as effective > throughput) the service provider offers. Two different issues by > definition. <snip> On Thu, 2006-06-22 at 14:29 -0400, jamal wrote: > I did look at what you sent me and somewhere in those discussions i > argue that the changes compensate to make the rate be a goodput > instead of advertised throughput. Throughput is typically what > schedulers work with and is typically to what is on the wire. > Goodput tends to be end-to-end; so somewhere down the road ATM > "reduces" the goodput but not the throughput. > I am actaully just fine with telling the scheduler you have less > throughput than what your ISP is telling you. I am also > not against a generic change as long as it is non-intrusive because i > believe this is a practical issue and Patrick Mchardy says he can > deliver such a patch. I have read your throughput versus goodput thing a couple of times, and I'm sorry - I don't understand. What is it you would like us to achieve? As for the patch being invasive, it changes 37 lines of kernel code. No other suggestion I have seen here will be that small. If making the patch generic, ie allowing it to handle cell sizes other than ATM, then let me know I will make the change on the weekend. It is just a user space change. One final point: if you are happy with an invasive patch that changes the world, I have a suggestion. Modularise the rate calculation function. We have qdisc modules, filter modules and whatnot - so add another type. Rate calculation. The current system can become the default rate calculation module if none is specified. Patrick can have his system, and Alan can have his. And we can add an ATM one. If you wish, I can (with Jespers help, I hope) re-do the patch in that style, producing the default one and an ATM one. My personal preference though would be to put this patch in, and then let this new idea stand or fall on its own merits. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-23 12:37 ` Russell Stuart @ 2006-06-23 15:21 ` Patrick McHardy 2006-06-26 0:45 ` Russell Stuart 2006-06-24 14:13 ` jamal 1 sibling, 1 reply; 56+ messages in thread From: Patrick McHardy @ 2006-06-23 15:21 UTC (permalink / raw) To: Russell Stuart Cc: hadi, Alan Cox, Stephen Hemminger, netdev, Jesper Dangaard Brouer Russell Stuart wrote: > On Thu, 2006-06-22 at 14:29 -0400, jamal wrote: > > On Tue, 2006-06-20 at 03:04 +0200, Patrick McHardy wrote: > >>What about qdiscs like SFQ (which uses the packet size in quantum >>calculations)? I guess it would make sense to use the wire-length >>there as well. > > > Being pedantic, SQF automatically assigns traffic to classes > and gives each class an equal share of the available bandwidth. > As I am sure you are aware SQF's trick is that it randomly > changes its classification algorithm - every second in the Linux > implementation. If there are errors in rate calculation this > randomisation will ensure they are distributed equally between > the classes as time goes on. So no, accurate packets sizes are > not that important to SQF. > > But they are important to many other qdiscs, and I am sure > that was your point. SQF just happened to be a bad example. Not really. The randomization doesn't happen by default, but it doesn't influence this anyway. SFQ allows flows to send up to "quantum" bytes at a time before moving on to the next one. A flow that sends 75 * 20 byte will in the eyes of SFQ use 1500bytes, on the (ethernet) wire it needs 4800bytes. A flow that sents 1500byte packets will only need 1504 bytes on the wire, but will be treated equally. So it does make a different for SFQ. > On Tue, 2006-06-20 at 16:45 +0200, Patrick McHardy wrote: > >>Handling all qdiscs would mean adding a pointer to a mapping table >>to struct net_device and using something like "skb_wire_len(skb, dev)" >>instead of skb->len in the queueing layer. That of course doesn't >>mean that we can't still provide pre-adjusted ratetables for qdiscs >>that use them. > > > Yes, that would work well, and is probably how it should of > been done when the kernel stuff was originally written. As > it happens Jesper's original solution was closer to this. The > reason we choose not to go that way it is would change the > kernel-userspace API. The current patch solves the problem > and works well as possible on all kernel versions - both > patched and unpatched. Not a problem as long as the new stuff doesn't break anything existing. My patch introduces a TCA_STAB (for size table), similar to the _RTAB attributes. Old iproute with new kernel and new iproute with old kernel both work fine. > Now that I think about to change things the way you suggest > here does seem simple enough. But it probably belongs in a > different patch. We wrote this patch to fix a specific problem > with ATM links, and it should succeed or fail on the merits > of doing that. Cleaning up the kernel code to do what you > suggest is a different issue. Let whether it to should be > done, or not, be based on its own merits. Its not about cleanup, its about providing the same capabilities to all qdiscs instead of just a few selected ones and generalizing it so it is also usable for non-ATM overhead calculations. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-23 15:21 ` Patrick McHardy @ 2006-06-26 0:45 ` Russell Stuart 2006-06-26 11:10 ` Patrick McHardy 0 siblings, 1 reply; 56+ messages in thread From: Russell Stuart @ 2006-06-26 0:45 UTC (permalink / raw) To: Patrick McHardy Cc: hadi, Alan Cox, Stephen Hemminger, netdev, Jesper Dangaard Brouer On Fri, 2006-06-23 at 17:21 +0200, Patrick McHardy wrote: > Not really. The randomization doesn't happen by default, but it doesn't > influence this anyway. SFQ allows flows to send up to "quantum" bytes > at a time before moving on to the next one. A flow that sends 75 * 20 > byte will in the eyes of SFQ use 1500bytes, on the (ethernet) wire it > needs 4800bytes. A flow that sents 1500byte packets will only need > 1504 bytes on the wire, but will be treated equally. So it does make > a different for SFQ. I hadn't even thought to check. My bad. The S in SFQ stands for stochastic, so something that does without randomisation the algorithm implemented couldn't really be called SFQ - particularly as it weakens the algorithm considerably. I hope that most users do specify a perturb. Your 20 byte example is hardly realistic. skb->len includes the 14 byte ethernet header, so there is a total of 6 data bytes in a 20 byte packet. The IP header alone is 20 bytes. TCP as implemented on Linux adds another 32 bytes (20 + the rtt option). In other words I agree with Jamal's comments elsewhere - optimising for MPU sized packets doesn't seem like a win. > Not a problem as long as the new stuff doesn't break anything existing. > My patch introduces a TCA_STAB (for size table), similar to the _RTAB > attributes. Old iproute with new kernel and new iproute with old kernel > both work fine. OK, good. > Its not about cleanup, its about providing the same capabilities > to all qdiscs instead of just a few selected ones and generalizing > it so it is also usable for non-ATM overhead calculations. Perhaps I chose my words poorly. My intent was to contrast the size and goals of the two proposed patches. The ATM patch is a 37 line patch. It includes some minor cleanups. From the pseudo code you have posted what you are proposing is a more ambitious and much larger patch that moves a chunk of user space code into the kernel. I am a complete newbie when it comes to getting code into the kernel, but that strikes me as contentious. I would rather not have the ATM patch depend on it. By the by, here are a couple of observations: 1. The entries in the current rtab are already very closely related to packet lengths. They are actually the packet length multiplied by a constant that converts the units from "bytes" to "jiffies". The constant is the same for all entries in the table. 2. As such, the current rtab could already be used by SFQ and any other qdisc that needs to know the packet length. That SFQ doesn't do this is probably because it doesn't effect its performance overly. 3. Be that as it may, the current RTAB isn't in the most convenient form for SFQ, and I am guessing it is in a very inconvenient form for HFSC. Adding a new version that is identical except that it contains the raw packet length would be a simple change. In that format it could be used by all qdiscs. The users of the existing rtab would have to do the multiplication that converts the packet length to jiffies in the kernel. This means the conceptually at least, should the gootput change you need to change this one constant, not the entire table. 4. Much as you seem to dislike having the rate / packet length calculations in user space, having them there makes it easy to add new technologies such as ATM. You just have to change a user space tool - not the kernel. 5. We still did have to modify the kernel for ATM. That was because of its rather unusual characteristics. However, it you look at the size of modifications made to the kernel verses the size made to the user space tool, (37 lines versus 303 lines,) the bulk of the work was does in user space. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-26 0:45 ` Russell Stuart @ 2006-06-26 11:10 ` Patrick McHardy 2006-06-27 6:19 ` Russell Stuart 0 siblings, 1 reply; 56+ messages in thread From: Patrick McHardy @ 2006-06-26 11:10 UTC (permalink / raw) To: Russell Stuart Cc: hadi, Alan Cox, Stephen Hemminger, netdev, Jesper Dangaard Brouer Russell Stuart wrote: > On Fri, 2006-06-23 at 17:21 +0200, Patrick McHardy wrote: > >>Not really. The randomization doesn't happen by default, but it doesn't >>influence this anyway. SFQ allows flows to send up to "quantum" bytes >>at a time before moving on to the next one. A flow that sends 75 * 20 >>byte will in the eyes of SFQ use 1500bytes, on the (ethernet) wire it >>needs 4800bytes. A flow that sents 1500byte packets will only need >>1504 bytes on the wire, but will be treated equally. So it does make >>a different for SFQ. > > > I hadn't even thought to check. My bad. The S in SFQ stands > for stochastic, so something that does without randomisation > the algorithm implemented couldn't really be called SFQ - > particularly as it weakens the algorithm considerably. I > hope that most users do specify a perturb. Its not as great as you think. Changing hash-functions on the fly causes reordering for non-idle flows when bucket-lengths aren't distributed even. I never use it. > Your 20 byte example is hardly realistic. skb->len includes > the 14 byte ethernet header, so there is a total of 6 data > bytes in a 20 byte packet. The IP header alone is 20 bytes. > TCP as implemented on Linux adds another 32 bytes (20 + the > rtt option). In other words I agree with Jamal's comments > elsewhere - optimising for MPU sized packets doesn't seem > like a win. The point is that SFQ does care about packet sizes, and this is true for both MPU-sized and other packets. >>Its not about cleanup, its about providing the same capabilities >>to all qdiscs instead of just a few selected ones and generalizing >>it so it is also usable for non-ATM overhead calculations. > > > Perhaps I chose my words poorly. > > My intent was to contrast the size and goals of the two > proposed patches. The ATM patch is a 37 line patch. It > includes some minor cleanups. From the pseudo code you > have posted what you are proposing is a more ambitious and > much larger patch that moves a chunk of user space code > into the kernel. I am a complete newbie when it comes to > getting code into the kernel, but that strikes me as > contentious. I would rather not have the ATM patch > depend on it. > > By the by, here are a couple of observations: > > 1. The entries in the current rtab are already very closely > related to packet lengths. They are actually the packet > length multiplied by a constant that converts the units > from "bytes" to "jiffies". The constant is the same for > all entries in the table. > > 2. As such, the current rtab could already be used by SFQ > and any other qdisc that needs to know the packet length. > That SFQ doesn't do this is probably because it doesn't > effect its performance overly. The rtab includes the transmission time, which is related to, but still is something different than the length. You can't calculate the transmission time without a rate, which is not needed otherwise for SFQ for example. The way rtabs are used also needs more space, my current size tables only use as much space as needed, which is 16 entries for ethernet instead of 256. > 3. Be that as it may, the current RTAB isn't in the most > convenient form for SFQ, and I am guessing it is in a > very inconvenient form for HFSC. Adding a new version > that is identical except that it contains the raw packet > length would be a simple change. In that format it > could be used by all qdiscs. The users of the existing > rtab would have to do the multiplication that converts > the packet length to jiffies in the kernel. This means > the conceptually at least, should the gootput change > you need to change this one constant, not the entire > table. Well, I don't care much whether we use rtabs or something new, but rtabs are meant for something different and as such are not optimally suited for this. > 4. Much as you seem to dislike having the rate / packet length > calculations in user space, having them there makes it easy > to add new technologies such as ATM. You just have to > change a user space tool - not the kernel. I don't dislike it for beeing in userspace, I dislike it for a) beeing used for this since it only covers TBF-based qdiscs b) beeing used for an ATM "special case" which is not special at all. I should also note that my tables don't come out of the random generator but are provided by userspace as well. Unless the mechanism is unable to express the needs of a particular link layer type all you have to do is to provide a different set of values without touching any code at all. > 5. We still did have to modify the kernel for ATM. That was > because of its rather unusual characteristics. However, > it you look at the size of modifications made to the kernel > verses the size made to the user space tool, (37 lines > versus 303 lines,) the bulk of the work was does in user > space. I'm sorry, but arguing that a limited special case solution is better because it needs slightly less code is just not reasonable. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-26 11:10 ` Patrick McHardy @ 2006-06-27 6:19 ` Russell Stuart 2006-06-27 17:18 ` Patrick McHardy 2006-07-04 13:29 ` Patrick McHardy 0 siblings, 2 replies; 56+ messages in thread From: Russell Stuart @ 2006-06-27 6:19 UTC (permalink / raw) To: Patrick McHardy Cc: Russell Stuart, hadi, Alan Cox, Stephen Hemminger, netdev, Jesper Dangaard Brouer On 26/06/2006 9:10 PM, Patrick McHardy wrote: >>5. We still did have to modify the kernel for ATM. That was >> because of its rather unusual characteristics. However, >> it you look at the size of modifications made to the kernel >> verses the size made to the user space tool, (37 lines >> versus 303 lines,) the bulk of the work was does in user >> space. > > I'm sorry, but arguing that a limited special case solution is > better because it needs slightly less code is just not reasonable. Without seeing your actual proposal it is difficult to judge whether this is a reasonable trade-off or not. Hopefully we will see your code soon. Do you have any idea when? ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-27 6:19 ` Russell Stuart @ 2006-06-27 17:18 ` Patrick McHardy 2006-07-04 13:29 ` Patrick McHardy 1 sibling, 0 replies; 56+ messages in thread From: Patrick McHardy @ 2006-06-27 17:18 UTC (permalink / raw) To: Russell Stuart Cc: Russell Stuart, hadi, Alan Cox, Stephen Hemminger, netdev, Jesper Dangaard Brouer Russell Stuart wrote: > Without seeing your actual proposal it is difficult to > judge whether this is a reasonable trade-off or not. > Hopefully we will see your code soon. Do you have any > idea when? Probably not today, I'll try to get it into shape until tomomorrow. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-27 6:19 ` Russell Stuart 2006-06-27 17:18 ` Patrick McHardy @ 2006-07-04 13:29 ` Patrick McHardy 2006-07-04 19:29 ` jamal 2006-07-06 0:39 ` Russell Stuart 1 sibling, 2 replies; 56+ messages in thread From: Patrick McHardy @ 2006-07-04 13:29 UTC (permalink / raw) To: Russell Stuart Cc: Russell Stuart, hadi, Alan Cox, Stephen Hemminger, netdev, Jesper Dangaard Brouer [-- Attachment #1: Type: text/plain, Size: 1192 bytes --] Russell Stuart wrote: > On 26/06/2006 9:10 PM, Patrick McHardy wrote: > >>> 5. We still did have to modify the kernel for ATM. That was >>> because of its rather unusual characteristics. However, >>> it you look at the size of modifications made to the kernel >>> verses the size made to the user space tool, (37 lines >>> versus 303 lines,) the bulk of the work was does in user >>> space. >> >> >> I'm sorry, but arguing that a limited special case solution is >> better because it needs slightly less code is just not reasonable. > > > Without seeing your actual proposal it is difficult to > judge whether this is a reasonable trade-off or not. > Hopefully we will see your code soon. Do you have any > idea when? Unfortunately I still didn't got to cleaning them up, so I'm sending them in their preliminary state. Its not much that is missing, but the netem usage of skb->cb needs to be integrated better, I failed to move it to the qdisc_skb_cb so far because of circular includes. But nothing unfixable. I'm mostly interested if the current size-tables can express what you need for ATM, I wasn't able to understand the big comment in tc_core.c in your patch. [-- Attachment #2: 01.diff --] [-- Type: text/plain, Size: 13169 bytes --] [NET_SCHED]: Add accessor function for packet length for qdiscs Signed-off-by: Patrick McHardy <kaber@trash.net> --- commit 2a6508576111d82246ee018edbcc4b0f0d18acad tree 8be27ab6040ea90ed11728763e5b8fcf9e221b67 parent 31304c909e6945b005af62cd55a582e9c010a0b4 author Patrick McHardy <kaber@trash.net> Tue, 04 Jul 2006 15:03:01 +0200 committer Patrick McHardy <kaber@trash.net> Tue, 04 Jul 2006 15:03:01 +0200 include/net/sch_generic.h | 9 +++++++-- net/sched/sch_atm.c | 4 ++-- net/sched/sch_cbq.c | 12 ++++++------ net/sched/sch_dsmark.c | 2 +- net/sched/sch_fifo.c | 2 +- net/sched/sch_gred.c | 12 ++++++------ net/sched/sch_hfsc.c | 8 ++++---- net/sched/sch_htb.c | 8 ++++---- net/sched/sch_netem.c | 6 +++--- net/sched/sch_prio.c | 2 +- net/sched/sch_red.c | 2 +- net/sched/sch_sfq.c | 14 +++++++------- net/sched/sch_tbf.c | 6 +++--- net/sched/sch_teql.c | 4 ++-- 14 files changed, 48 insertions(+), 43 deletions(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index b0e9108..75d7a55 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -184,12 +184,17 @@ tcf_destroy(struct tcf_proto *tp) kfree(tp); } +static inline unsigned int qdisc_tx_len(struct sk_buff *skb) +{ + return skb->len; +} + static inline int __qdisc_enqueue_tail(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff_head *list) { __skb_queue_tail(list, skb); - sch->qstats.backlog += skb->len; - sch->bstats.bytes += skb->len; + sch->qstats.backlog += qdisc_tx_len(skb); + sch->bstats.bytes += qdisc_tx_len(skb); sch->bstats.packets++; return NET_XMIT_SUCCESS; diff --git a/net/sched/sch_atm.c b/net/sched/sch_atm.c index dbf44da..4df305e 100644 --- a/net/sched/sch_atm.c +++ b/net/sched/sch_atm.c @@ -453,9 +453,9 @@ #endif if (flow) flow->qstats.drops++; return ret; } - sch->bstats.bytes += skb->len; + sch->bstats.bytes += qdisc_tx_len(skb); sch->bstats.packets++; - flow->bstats.bytes += skb->len; + flow->bstats.bytes += qdisc_tx_len(skb); flow->bstats.packets++; /* * Okay, this may seem weird. We pretend we've dropped the packet if diff --git a/net/sched/sch_cbq.c b/net/sched/sch_cbq.c index 80b7f6a..5d705e2 100644 --- a/net/sched/sch_cbq.c +++ b/net/sched/sch_cbq.c @@ -404,7 +404,7 @@ static int cbq_enqueue(struct sk_buff *skb, struct Qdisc *sch) { struct cbq_sched_data *q = qdisc_priv(sch); - int len = skb->len; + int len = qdisc_tx_len(skb); int ret; struct cbq_class *cl = cbq_classify(skb, sch, &ret); @@ -688,7 +688,7 @@ #ifdef CONFIG_NET_CLS_POLICE static int cbq_reshape_fail(struct sk_buff *skb, struct Qdisc *child) { - int len = skb->len; + int len = qdisc_tx_len(skb); struct Qdisc *sch = child->__parent; struct cbq_sched_data *q = qdisc_priv(sch); struct cbq_class *cl = q->rx_class; @@ -915,7 +915,7 @@ cbq_dequeue_prio(struct Qdisc *sch, int if (skb == NULL) goto skip_class; - cl->deficit -= skb->len; + cl->deficit -= qdisc_tx_len(skb); q->tx_class = cl; q->tx_borrowed = borrow; if (borrow != cl) { @@ -923,11 +923,11 @@ #ifndef CBQ_XSTATS_BORROWS_BYTES borrow->xstats.borrows++; cl->xstats.borrows++; #else - borrow->xstats.borrows += skb->len; - cl->xstats.borrows += skb->len; + borrow->xstats.borrows += qdisc_tx_len(skb); + cl->xstats.borrows += qdisc_tx_len(skb); #endif } - q->tx_len = skb->len; + q->tx_len = qdisc_tx_len(skb); if (cl->deficit <= 0) { q->active[prio] = cl; diff --git a/net/sched/sch_dsmark.c b/net/sched/sch_dsmark.c index 11c8a21..53346c6 100644 --- a/net/sched/sch_dsmark.c +++ b/net/sched/sch_dsmark.c @@ -265,7 +265,7 @@ #endif return err; } - sch->bstats.bytes += skb->len; + sch->bstats.bytes += qdisc_tx_len(skb); sch->bstats.packets++; sch->q.qlen++; diff --git a/net/sched/sch_fifo.c b/net/sched/sch_fifo.c index c2689f4..ec99321 100644 --- a/net/sched/sch_fifo.c +++ b/net/sched/sch_fifo.c @@ -28,7 +28,7 @@ static int bfifo_enqueue(struct sk_buff { struct fifo_sched_data *q = qdisc_priv(sch); - if (likely(sch->qstats.backlog + skb->len <= q->limit)) + if (likely(sch->qstats.backlog + qdisc_tx_len(skb) <= q->limit)) return qdisc_enqueue_tail(skb, sch); return qdisc_reshape_fail(skb, sch); diff --git a/net/sched/sch_gred.c b/net/sched/sch_gred.c index 0cafdd5..f0bf5d7 100644 --- a/net/sched/sch_gred.c +++ b/net/sched/sch_gred.c @@ -189,7 +189,7 @@ static int gred_enqueue(struct sk_buff * } q->packetsin++; - q->bytesin += skb->len; + q->bytesin += qdisc_tx_len(skb); if (gred_wred_mode(t)) gred_load_wred_set(t, q); @@ -227,8 +227,8 @@ static int gred_enqueue(struct sk_buff * break; } - if (q->backlog + skb->len <= q->limit) { - q->backlog += skb->len; + if (q->backlog + qdisc_tx_len(skb) <= q->limit) { + q->backlog += qdisc_tx_len(skb); return qdisc_enqueue_tail(skb, sch); } @@ -255,7 +255,7 @@ static int gred_requeue(struct sk_buff * } else { if (red_is_idling(&q->parms)) red_end_of_idle_period(&q->parms); - q->backlog += skb->len; + q->backlog += qdisc_tx_len(skb); } return qdisc_requeue(skb, sch); @@ -278,7 +278,7 @@ static struct sk_buff *gred_dequeue(stru "VQ 0x%x after dequeue, screwing up " "backlog.\n", tc_index_to_dp(skb)); } else { - q->backlog -= skb->len; + q->backlog -= qdisc_tx_len(skb); if (!q->backlog && !gred_wred_mode(t)) red_start_of_idle_period(&q->parms); @@ -300,7 +300,7 @@ static unsigned int gred_drop(struct Qdi skb = qdisc_dequeue_tail(sch); if (skb) { - unsigned int len = skb->len; + unsigned int len = qdisc_tx_len(skb); struct gred_sched_data *q; u16 dp = tc_index_to_dp(skb); diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c index 6b1b4a9..3fc8351 100644 --- a/net/sched/sch_hfsc.c +++ b/net/sched/sch_hfsc.c @@ -942,7 +942,7 @@ qdisc_peek_len(struct Qdisc *sch) printk("qdisc_peek_len: non work-conserving qdisc ?\n"); return 0; } - len = skb->len; + len = qdisc_tx_len(skb); if (unlikely(sch->ops->requeue(skb, sch) != NET_XMIT_SUCCESS)) { if (net_ratelimit()) printk("qdisc_peek_len: failed to requeue\n"); @@ -1648,7 +1648,7 @@ hfsc_enqueue(struct sk_buff *skb, struct return err; } - len = skb->len; + len = qdisc_tx_len(skb); err = cl->qdisc->enqueue(skb, cl->qdisc); if (unlikely(err != NET_XMIT_SUCCESS)) { cl->qstats.drops++; @@ -1712,9 +1712,9 @@ hfsc_dequeue(struct Qdisc *sch) return NULL; } - update_vf(cl, skb->len, cur_time); + update_vf(cl, qdisc_tx_len(skb), cur_time); if (realtime) - cl->cl_cumul += skb->len; + cl->cl_cumul += qdisc_tx_len(skb); if (cl->qdisc->q.qlen != 0) { if (cl->cl_flags & HFSC_RSC) { diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c index 34afe41..b26fa9a 100644 --- a/net/sched/sch_htb.c +++ b/net/sched/sch_htb.c @@ -733,12 +733,12 @@ #endif cl->qstats.drops++; return NET_XMIT_DROP; } else { - cl->bstats.packets++; cl->bstats.bytes += skb->len; + cl->bstats.packets++; cl->bstats.bytes += qdisc_tx_len(skb); htb_activate (q,cl); } sch->q.qlen++; - sch->bstats.packets++; sch->bstats.bytes += skb->len; + sch->bstats.packets++; sch->bstats.bytes += qdisc_tx_len(skb); HTB_DBG(1,1,"htb_enq_ok cl=%X skb=%p\n",(cl && cl != HTB_DIRECT)?cl->classid:0,skb); return NET_XMIT_SUCCESS; } @@ -1067,7 +1067,7 @@ next: } while (cl != start); if (likely(skb != NULL)) { - if ((cl->un.leaf.deficit[level] -= skb->len) < 0) { + if ((cl->un.leaf.deficit[level] -= qdisc_tx_len(skb)) < 0) { HTB_DBG(4,2,"htb_next_cl oldptr=%p quant_add=%d\n", level?cl->parent->un.inner.ptr[prio]:q->ptr[0][prio],cl->un.leaf.quantum); cl->un.leaf.deficit[level] += cl->un.leaf.quantum; @@ -1077,7 +1077,7 @@ next: gives us slightly better performance */ if (!cl->un.leaf.q->q.qlen) htb_deactivate (q,cl); - htb_charge_class (q,cl,level,skb->len); + htb_charge_class (q,cl,level,qdisc_tx_len(skb)); } return skb; } diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c index c5bd806..aa97ecb 100644 --- a/net/sched/sch_netem.c +++ b/net/sched/sch_netem.c @@ -225,7 +225,7 @@ static int netem_enqueue(struct sk_buff if (likely(ret == NET_XMIT_SUCCESS)) { sch->q.qlen++; - sch->bstats.bytes += skb->len; + sch->bstats.bytes += qdisc_tx_len(skb); sch->bstats.packets++; } else sch->qstats.drops++; @@ -507,8 +507,8 @@ static int tfifo_enqueue(struct sk_buff __skb_queue_after(list, skb, nskb); - sch->qstats.backlog += nskb->len; - sch->bstats.bytes += nskb->len; + sch->qstats.backlog += qdisc_tx_len(nskb); + sch->bstats.bytes += qdisc_tx_len(nskb); sch->bstats.packets++; return NET_XMIT_SUCCESS; diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index a5fa03c..2175732 100644 --- a/net/sched/sch_prio.c +++ b/net/sched/sch_prio.c @@ -99,7 +99,7 @@ #ifdef CONFIG_NET_CLS_ACT #endif if ((ret = qdisc->enqueue(skb, qdisc)) == NET_XMIT_SUCCESS) { - sch->bstats.bytes += skb->len; + sch->bstats.bytes += qdisc_tx_len(skb); sch->bstats.packets++; sch->q.qlen++; return NET_XMIT_SUCCESS; diff --git a/net/sched/sch_red.c b/net/sched/sch_red.c index d65cadd..24ec0b2 100644 --- a/net/sched/sch_red.c +++ b/net/sched/sch_red.c @@ -95,7 +95,7 @@ static int red_enqueue(struct sk_buff *s ret = child->enqueue(skb, child); if (likely(ret == NET_XMIT_SUCCESS)) { - sch->bstats.bytes += skb->len; + sch->bstats.bytes += qdisc_tx_len(skb); sch->bstats.packets++; sch->q.qlen++; } else { diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c index d0d6e59..2a57d0d 100644 --- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -225,7 +225,7 @@ static unsigned int sfq_drop(struct Qdis if (d > 1) { sfq_index x = q->dep[d+SFQ_DEPTH].next; skb = q->qs[x].prev; - len = skb->len; + len = qdisc_tx_len(skb); __skb_unlink(skb, &q->qs[x]); kfree_skb(skb); sfq_dec(q, x); @@ -241,7 +241,7 @@ static unsigned int sfq_drop(struct Qdis q->next[q->tail] = q->next[d]; q->allot[q->next[d]] += q->quantum; skb = q->qs[d].prev; - len = skb->len; + len = qdisc_tx_len(skb); __skb_unlink(skb, &q->qs[d]); kfree_skb(skb); sfq_dec(q, d); @@ -267,7 +267,7 @@ sfq_enqueue(struct sk_buff *skb, struct q->ht[hash] = x = q->dep[SFQ_DEPTH].next; q->hash[x] = hash; } - sch->qstats.backlog += skb->len; + sch->qstats.backlog += qdisc_tx_len(skb); __skb_queue_tail(&q->qs[x], skb); sfq_inc(q, x); if (q->qs[x].qlen == 1) { /* The flow is new */ @@ -282,7 +282,7 @@ sfq_enqueue(struct sk_buff *skb, struct } } if (++sch->q.qlen < q->limit-1) { - sch->bstats.bytes += skb->len; + sch->bstats.bytes += qdisc_tx_len(skb); sch->bstats.packets++; return 0; } @@ -303,7 +303,7 @@ sfq_requeue(struct sk_buff *skb, struct q->ht[hash] = x = q->dep[SFQ_DEPTH].next; q->hash[x] = hash; } - sch->qstats.backlog += skb->len; + sch->qstats.backlog += qdisc_tx_len(skb); __skb_queue_head(&q->qs[x], skb); sfq_inc(q, x); if (q->qs[x].qlen == 1) { /* The flow is new */ @@ -347,7 +347,7 @@ sfq_dequeue(struct Qdisc* sch) skb = __skb_dequeue(&q->qs[a]); sfq_dec(q, a); sch->q.qlen--; - sch->qstats.backlog -= skb->len; + sch->qstats.backlog -= qdisc_tx_len(skb); /* Is the slot empty? */ if (q->qs[a].qlen == 0) { @@ -359,7 +359,7 @@ sfq_dequeue(struct Qdisc* sch) } q->next[q->tail] = a; q->allot[a] += q->quantum; - } else if ((q->allot[a] -= skb->len) <= 0) { + } else if ((q->allot[a] -= qdisc_tx_len(skb)) <= 0) { q->tail = a; a = q->next[a]; q->allot[a] += q->quantum; diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c index d9a5d29..c87b0e6 100644 --- a/net/sched/sch_tbf.c +++ b/net/sched/sch_tbf.c @@ -139,7 +139,7 @@ static int tbf_enqueue(struct sk_buff *s struct tbf_sched_data *q = qdisc_priv(sch); int ret; - if (skb->len > q->max_size) { + if (qdisc_tx_len(skb) > q->max_size) { sch->qstats.drops++; #ifdef CONFIG_NET_CLS_POLICE if (sch->reshape_fail == NULL || sch->reshape_fail(skb, sch)) @@ -155,7 +155,7 @@ #endif } sch->q.qlen++; - sch->bstats.bytes += skb->len; + sch->bstats.bytes += qdisc_tx_len(skb); sch->bstats.packets++; return 0; } @@ -204,7 +204,7 @@ static struct sk_buff *tbf_dequeue(struc psched_time_t now; long toks, delay; long ptoks = 0; - unsigned int len = skb->len; + unsigned int len = qdisc_tx_len(skb); PSCHED_GET_TIME(now); diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c index 4c16ad5..538f63f 100644 --- a/net/sched/sch_teql.c +++ b/net/sched/sch_teql.c @@ -97,7 +97,7 @@ teql_enqueue(struct sk_buff *skb, struct __skb_queue_tail(&q->q, skb); if (q->q.qlen <= dev->tx_queue_len) { - sch->bstats.bytes += skb->len; + sch->bstats.bytes += qdisc_tx_len(skb); sch->bstats.packets++; return 0; } @@ -278,7 +278,7 @@ static int teql_master_xmit(struct sk_bu struct Qdisc *start, *q; int busy; int nores; - int len = skb->len; + int len = qdisc_tx_len(skb); struct sk_buff *skb_res = NULL; start = master->slaves; [-- Attachment #3: 02.diff --] [-- Type: text/plain, Size: 2212 bytes --] [NET_SCHED]: Move top-level device queueing code to seperate function Signed-off-by: Patrick McHardy <kaber@trash.net> --- commit a39585afe71dafab96208515a8fa99c92b108fee tree fbb7672a3061a38edc9f75d3fb8f34652796b109 parent 2a6508576111d82246ee018edbcc4b0f0d18acad author Patrick McHardy <kaber@trash.net> Tue, 04 Jul 2006 15:03:28 +0200 committer Patrick McHardy <kaber@trash.net> Tue, 04 Jul 2006 15:03:28 +0200 include/net/pkt_sched.h | 1 + net/core/dev.c | 10 +--------- net/sched/sch_generic.c | 12 ++++++++++++ 3 files changed, 14 insertions(+), 9 deletions(-) diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h index 1925c65..44cf69e 100644 --- a/include/net/pkt_sched.h +++ b/include/net/pkt_sched.h @@ -224,6 +224,7 @@ extern struct qdisc_rate_table *qdisc_ge struct rtattr *tab); extern void qdisc_put_rtab(struct qdisc_rate_table *tab); +extern int qdisc_enqueue_root(struct net_device *dev, struct sk_buff *skb); extern void __qdisc_run(struct net_device *dev); static inline void qdisc_run(struct net_device *dev) diff --git a/net/core/dev.c b/net/core/dev.c index 066a60a..8599120 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1449,15 +1449,7 @@ #ifdef CONFIG_NET_CLS_ACT skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS); #endif if (q->enqueue) { - /* Grab device queue */ - spin_lock(&dev->queue_lock); - - rc = q->enqueue(skb, q); - - qdisc_run(dev); - - spin_unlock(&dev->queue_lock); - rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc; + rc = qdisc_enqueue_root(dev, skb); goto out; } diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index d735f51..2bab466 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -77,6 +77,18 @@ void qdisc_unlock_tree(struct net_device if one is grabbed, another must be free. */ +int qdisc_enqueue_root(struct net_device *dev, struct sk_buff *skb) +{ + int ret; + + spin_lock(&dev->queue_lock); + ret = dev->qdisc->enqueue(skb, dev->qdisc); + qdisc_run(dev); + spin_unlock(&dev->queue_lock); + + return ret == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : ret; +} + /* Kick device. Note, that this procedure can be called by a watchdog timer, so that [-- Attachment #4: 03.diff --] [-- Type: text/plain, Size: 8494 bytes --] diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h index d10f353..2ce55d5 100644 --- a/include/linux/pkt_sched.h +++ b/include/linux/pkt_sched.h @@ -83,6 +83,21 @@ struct tc_ratespec __u32 rate; }; +struct tc_sizespec +{ + unsigned int cell_log; + unsigned int addend; +}; + +enum { + TCA_STAB_UNSPEC, + TCA_STAB_BASE, + TCA_STAB_DATA, + __TCA_STAB_MAX +}; + +#define TCA_STAB_MAX (__TCA_STAB_MAX - 1) + /* FIFO section */ struct tc_fifo_qopt diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index facd9ee..167cc22 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -821,6 +821,7 @@ enum TCA_RATE, TCA_FCNT, TCA_STATS2, + TCA_STAB, __TCA_MAX }; diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h index 44cf69e..8fd9a42 100644 --- a/include/net/pkt_sched.h +++ b/include/net/pkt_sched.h @@ -223,6 +223,7 @@ extern struct Qdisc *qdisc_lookup_class( extern struct qdisc_rate_table *qdisc_get_rtab(struct tc_ratespec *r, struct rtattr *tab); extern void qdisc_put_rtab(struct qdisc_rate_table *tab); +extern void qdisc_put_stab(struct qdisc_size_table *tab); extern int qdisc_enqueue_root(struct net_device *dev, struct sk_buff *skb); extern void __qdisc_run(struct net_device *dev); diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 75d7a55..76c50a1 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -23,6 +23,15 @@ struct qdisc_rate_table int refcnt; }; +struct qdisc_size_table +{ + struct list_head list; + struct tc_sizespec size; + int refcnt; + unsigned int tsize; + u32 data[]; +}; + struct Qdisc { int (*enqueue)(struct sk_buff *skb, struct Qdisc *dev); @@ -33,6 +42,7 @@ #define TCQ_F_THROTTLED 2 #define TCQ_F_INGRESS 4 int padded; struct Qdisc_ops *ops; + struct qdisc_size_table *stab; u32 handle; u32 parent; atomic_t refcnt; @@ -184,9 +194,19 @@ tcf_destroy(struct tcf_proto *tp) kfree(tp); } +struct qdisc_skb_cb { + unsigned int len; + char data[]; +}; + +static inline struct qdisc_skb_cb *qdisc_skb_cb(struct sk_buff *skb) +{ + return (struct qdisc_skb_cb *)skb->cb; +} + static inline unsigned int qdisc_tx_len(struct sk_buff *skb) { - return skb->len; + return qdisc_skb_cb(skb)->len; } static inline int __qdisc_enqueue_tail(struct sk_buff *skb, struct Qdisc *sch, diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index c7844ba..479fc85 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -286,6 +286,78 @@ void qdisc_put_rtab(struct qdisc_rate_ta } } +static LIST_HEAD(qdisc_stab_list); + +static struct qdisc_size_table *qdisc_get_stab(struct rtattr *tab, int *err) +{ + struct qdisc_size_table *stab; + struct rtattr *tb[TCA_STAB_MAX]; + unsigned int tsize; + + *err = -EINVAL; + if (rtattr_parse_nested(tb, TCA_STAB_MAX, tab)) + return NULL; + if (tb[TCA_STAB_BASE-1] == NULL || + RTA_PAYLOAD(tb[TCA_STAB_BASE-1]) < sizeof(struct tc_sizespec)) + return NULL; + + tsize = 0; + if (tb[TCA_STAB_DATA-1] != NULL) + tsize = RTA_PAYLOAD(tb[TCA_STAB_DATA-1]) / sizeof(u32); + + list_for_each_entry(stab, &qdisc_stab_list, list) { + if (stab->tsize != tsize) + continue; + if (memcmp(&stab->size, RTA_DATA(tb[TCA_STAB_BASE-1]), + sizeof(stab->size))) + continue; + if (tsize > 0 && + memcmp(stab->data, RTA_DATA(tb[TCA_STAB_DATA-1]), + sizeof(u32) * tsize)); + continue; + stab->refcnt++; + return stab; + } + + *err = -ENOMEM; + stab = kmalloc(sizeof(*stab) + sizeof(u32) * tsize, GFP_KERNEL); + if (stab == NULL) + return stab; + memcpy(&stab->size, RTA_DATA(tb[TCA_STAB_BASE-1]), sizeof(stab->size)); + stab->tsize = tsize; + if (tsize > 0) + memcpy(stab->data, RTA_DATA(tb[TCA_STAB_DATA-1]), + sizeof(u32) * tsize); + list_add_tail(&stab->list, &qdisc_stab_list); + *err = 0; + return stab; +} + +void qdisc_put_stab(struct qdisc_size_table *stab) +{ + if (!stab || --stab->refcnt) + return; + list_del(&stab->list); + kfree(stab); +} + +static int +qdisc_dump_stab(struct sk_buff *skb, struct qdisc_size_table *stab) +{ + unsigned char *b = skb->tail; + struct rtattr *rta = (struct rtattr *)b; + + RTA_PUT(skb, TCA_STAB, 0, NULL); + RTA_PUT(skb, TCA_STAB_BASE, sizeof(stab->size), &stab->size); + RTA_PUT(skb, TCA_STAB_DATA, sizeof(stab->data[0]) * stab->tsize, + stab->data); + rta->rta_len = skb->tail - b; + return skb->len; + +rtattr_failure: + skb_trim(skb, b - skb->data); + return -1; +} /* Allocate an unique handle from space managed by kernel */ @@ -453,6 +525,11 @@ #endif sch->handle = handle; if (!ops->init || (err = ops->init(sch, tca[TCA_OPTIONS-1])) == 0) { + if (tca[TCA_STAB-1]) { + sch->stab = qdisc_get_stab(tca[TCA_STAB-1], &err); + if (sch->stab == NULL) + goto err_out3; + } #ifdef CONFIG_NET_ESTIMATOR if (tca[TCA_RATE-1]) { err = gen_new_estimator(&sch->bstats, &sch->rate_est, @@ -477,6 +554,7 @@ #endif return sch; } err_out3: + qdisc_put_stab(sch->stab); dev_put(dev); kfree((char *) sch - sch->padded); err_out2: @@ -488,15 +566,26 @@ err_out: static int qdisc_change(struct Qdisc *sch, struct rtattr **tca) { - if (tca[TCA_OPTIONS-1]) { - int err; + int err; + if (tca[TCA_OPTIONS-1]) { if (sch->ops->change == NULL) return -EINVAL; err = sch->ops->change(sch, tca[TCA_OPTIONS-1]); if (err) return err; } + if (tca[TCA_STAB-1]) { + struct qdisc_size_table *stab; + + stab = qdisc_get_stab(tca[TCA_STAB-1], &err); + if (stab == NULL) + return err; + spin_lock_bh(&sch->dev->queue_lock); + qdisc_put_stab(sch->stab); + sch->stab = stab; + spin_unlock_bh(&sch->dev->queue_lock); + } #ifdef CONFIG_NET_ESTIMATOR if (tca[TCA_RATE-1]) gen_replace_estimator(&sch->bstats, &sch->rate_est, @@ -769,6 +858,9 @@ static int tc_fill_qdisc(struct sk_buff goto rtattr_failure; q->qstats.qlen = q->q.qlen; + if (q->stab != NULL && qdisc_dump_stab(skb, q->stab) < 0) + goto rtattr_failure; + if (gnet_stats_start_copy_compat(skb, TCA_STATS2, TCA_STATS, TCA_XSTATS, q->stats_lock, &d) < 0) goto rtattr_failure; diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 2bab466..9022650 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -67,6 +67,21 @@ void qdisc_unlock_tree(struct net_device write_unlock_bh(&qdisc_tree_lock); } +static void qdisc_init_len(struct sk_buff *skb, struct Qdisc *q) +{ + unsigned int idx, len = skb->len; + struct qdisc_size_table *stab = q->stab; + + if (stab == NULL) + goto out; + idx = len >> stab->size.cell_log; + if (idx < stab->tsize) + len = stab->data[idx]; + len += stab->size.addend; +out: + ((struct qdisc_skb_cb *)skb->cb)->len = len; +} + /* dev->queue_lock serializes queue accesses for this device AND dev->qdisc pointer itself. @@ -82,6 +97,7 @@ int qdisc_enqueue_root(struct net_device int ret; spin_lock(&dev->queue_lock); + qdisc_init_len(skb, dev->qdisc); ret = dev->qdisc->enqueue(skb, dev->qdisc); qdisc_run(dev); spin_unlock(&dev->queue_lock); diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c index aa97ecb..15dde88 100644 --- a/net/sched/sch_netem.c +++ b/net/sched/sch_netem.c @@ -148,7 +148,7 @@ static long tabledist(unsigned long mu, static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch) { struct netem_sched_data *q = qdisc_priv(sch); - struct netem_skb_cb *cb = (struct netem_skb_cb *)skb->cb; + struct netem_skb_cb *cb = (struct netem_skb_cb *)qdisc_skb_cb(skb)->data; struct sk_buff *skb2; int ret; int count = 1; @@ -268,7 +268,7 @@ static struct sk_buff *netem_dequeue(str skb = q->qdisc->dequeue(q->qdisc); if (skb) { const struct netem_skb_cb *cb - = (const struct netem_skb_cb *)skb->cb; + = (const struct netem_skb_cb *)qdisc_skb_cb(skb)->data; psched_time_t now; /* if more time remaining? */ @@ -493,13 +493,13 @@ static int tfifo_enqueue(struct sk_buff struct fifo_sched_data *q = qdisc_priv(sch); struct sk_buff_head *list = &sch->q; const struct netem_skb_cb *ncb - = (const struct netem_skb_cb *)nskb->cb; + = (const struct netem_skb_cb *)qdisc_skb_cb(nskb)->data; struct sk_buff *skb; if (likely(skb_queue_len(list) < q->limit)) { skb_queue_reverse_walk(list, skb) { const struct netem_skb_cb *cb - = (const struct netem_skb_cb *)skb->cb; + = (const struct netem_skb_cb *)qdisc_skb_cb(skb)->data; if (!PSCHED_TLESS(ncb->time_to_send, cb->time_to_send)) break; [-- Attachment #5: iproute.diff --] [-- Type: text/plain, Size: 4483 bytes --] diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h index d10f353..2ce55d5 100644 --- a/include/linux/pkt_sched.h +++ b/include/linux/pkt_sched.h @@ -83,6 +83,21 @@ struct tc_ratespec __u32 rate; }; +struct tc_sizespec +{ + unsigned int cell_log; + unsigned int addend; +}; + +enum { + TCA_STAB_UNSPEC, + TCA_STAB_BASE, + TCA_STAB_DATA, + __TCA_STAB_MAX +}; + +#define TCA_STAB_MAX (__TCA_STAB_MAX - 1) + /* FIFO section */ struct tc_fifo_qopt diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index 5e33a20..addf5fb 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -821,6 +821,7 @@ enum TCA_RATE, TCA_FCNT, TCA_STATS2, + TCA_STAB, __TCA_MAX }; diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c index e9174ab..c38fa87 100644 --- a/tc/tc_qdisc.c +++ b/tc/tc_qdisc.c @@ -41,10 +41,79 @@ static int usage(void) return -1; } +static int parse_stab(int *argcp, char ***argvp, struct tc_sizespec *stab, + __u32 **datap) +{ + int argc = *argcp; + char **argv = *argvp; + + NEXT_ARG(); + while (argc > 0) { + if (matches("overhead", *argv) == 0) { + NEXT_ARG(); + if (stab->addend) + duparg("overhead", *argv); + if (get_size(&stab->addend, *argv)) + return -1; + NEXT_ARG(); + } else if (matches("cell_log", *argv) == 0) { + NEXT_ARG(); + if (stab->cell_log) + duparg("cell_log", *argv); + if (get_u32(&stab->cell_log, *argv, 0)) + return -1; + NEXT_ARG(); + } else if (get_size(*datap, *argv) == 0) { + argv++, argc--; + ++*datap; + } else + break; + } + if (!stab->addend && !stab->cell_log) + return -1; + *argcp = argc; + *argvp = argv; + return 0; +} + +static void print_stab(FILE *f, char *prefix, struct rtattr *tab) +{ + struct rtattr *tb[TCA_STAB_MAX+1]; + struct tc_sizespec *size; + unsigned int i; + __u32 *data; + SPRINT_BUF(buf); + + parse_rtattr_nested(tb, TCA_STAB_MAX, tab); + if (tb[TCA_STAB_BASE] == NULL || + RTA_PAYLOAD(tb[TCA_STAB_BASE]) < sizeof(struct tc_sizespec)) + return; + fprintf(f, "%s", prefix); + size = RTA_DATA(tb[TCA_STAB_BASE]); + if (size->addend) { + print_size(buf, SPRINT_BSIZE-1, size->addend); + fprintf(f, "overhead %s ", buf); + } + if (size->cell_log) + fprintf(f, "cell_log %u ", size->cell_log); + if (tb[TCA_STAB_DATA] == NULL) + return; + data = RTA_DATA(tb[TCA_STAB_DATA]); + for (i = 0; i < RTA_PAYLOAD(tb[TCA_STAB_DATA]) / sizeof(__u32); i++) { + print_size(buf, SPRINT_BSIZE-1, data[i]); + fprintf(f, "%s ", buf); + } +} + int tc_qdisc_modify(int cmd, unsigned flags, int argc, char **argv) { struct qdisc_util *q = NULL; struct tc_estimator est; + struct { + struct tc_sizespec size; + __u32 data[256]; + } stab; + __u32 *stabdata = &stab.data[0]; char d[16]; char k[16]; struct { @@ -55,6 +124,7 @@ int tc_qdisc_modify(int cmd, unsigned fl memset(&req, 0, sizeof(req)); memset(&est, 0, sizeof(est)); + memset(&stab, 0, sizeof(stab)); memset(&d, 0, sizeof(d)); memset(&k, 0, sizeof(k)); @@ -108,6 +178,10 @@ #endif } else if (matches(*argv, "estimator") == 0) { if (parse_estimator(&argc, &argv, &est)) return -1; + } else if (matches(*argv, "stab") == 0) { + if (parse_stab(&argc, &argv, &stab.size, &stabdata)) + return -1; + continue; } else if (matches(*argv, "help") == 0) { usage(); } else { @@ -124,6 +198,16 @@ #endif addattr_l(&req.n, sizeof(req), TCA_KIND, k, strlen(k)+1); if (est.ewma_log) addattr_l(&req.n, sizeof(req), TCA_RATE, &est, sizeof(est)); + if (stab.size.addend || stab.size.cell_log) { + struct rtattr *tail = NLMSG_TAIL(&req.n); + + addattr_l(&req.n, sizeof(req), TCA_STAB, NULL, 0); + addattr_l(&req.n, sizeof(req), TCA_STAB_BASE, &stab.size, + sizeof(stab.size)); + addattr_l(&req.n, sizeof(req), TCA_STAB_DATA, stab.data, + (void *)stabdata - (void *)stab.data); + tail->rta_len = (void *)NLMSG_TAIL(&req.n) - (void *)tail; + } if (q) { if (!q->parse_qopt) { @@ -215,7 +299,7 @@ static int print_qdisc(const struct sock q = get_qdisc_kind("prio"); else q = get_qdisc_kind(RTA_DATA(tb[TCA_KIND])); - + if (tb[TCA_OPTIONS]) { if (q) q->print_qopt(q, fp, tb[TCA_OPTIONS]); @@ -223,6 +307,12 @@ static int print_qdisc(const struct sock fprintf(fp, "[cannot parse qdisc parameters]"); } fprintf(fp, "\n"); + + if (tb[TCA_STAB]) { + print_stab(fp, " ", tb[TCA_STAB]); + fprintf(fp, "\n"); + } + if (show_stats) { struct rtattr *xstats = NULL; ^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-07-04 13:29 ` Patrick McHardy @ 2006-07-04 19:29 ` jamal 2006-07-04 23:53 ` Patrick McHardy 2006-07-06 0:39 ` Russell Stuart 1 sibling, 1 reply; 56+ messages in thread From: jamal @ 2006-07-04 19:29 UTC (permalink / raw) To: Patrick McHardy Cc: Jesper Dangaard Brouer, netdev, Stephen Hemminger, Alan Cox, Russell Stuart, Russell Stuart On Tue, 2006-04-07 at 15:29 +0200, Patrick McHardy wrote: > Russell Stuart wrote: [..] > > Without seeing your actual proposal it is difficult to > > judge whether this is a reasonable trade-off or not. > > Hopefully we will see your code soon. Do you have any > > idea when? > > Unfortunately I still didn't got to cleaning them up, so I'm sending > them in their preliminary state. Its not much that is missing, but > the netem usage of skb->cb needs to be integrated better, I failed > to move it to the qdisc_skb_cb so far because of circular includes. > But nothing unfixable. I'm mostly interested if the current size-tables > can express what you need for ATM, I wasn't able to understand the > big comment in tc_core.c in your patch. > Looks good from within the range of "change within reason" of addressed problem. The cb on the qdisc seems only usable for netem, correct? Also while not unreasonable, i wasnt sure how qdisc_enqueue_root() fit in the grand scheme of things for this change (it seemed out of place). cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-07-04 19:29 ` jamal @ 2006-07-04 23:53 ` Patrick McHardy 0 siblings, 0 replies; 56+ messages in thread From: Patrick McHardy @ 2006-07-04 23:53 UTC (permalink / raw) To: hadi Cc: Jesper Dangaard Brouer, netdev, Stephen Hemminger, Alan Cox, Russell Stuart, Russell Stuart jamal wrote: > On Tue, 2006-04-07 at 15:29 +0200, Patrick McHardy wrote: > >>Russell Stuart wrote: > > [..] > >>>Without seeing your actual proposal it is difficult to >>>judge whether this is a reasonable trade-off or not. >>>Hopefully we will see your code soon. Do you have any >>>idea when? >> >>Unfortunately I still didn't got to cleaning them up, so I'm sending >>them in their preliminary state. Its not much that is missing, but >>the netem usage of skb->cb needs to be integrated better, I failed >>to move it to the qdisc_skb_cb so far because of circular includes. >>But nothing unfixable. I'm mostly interested if the current size-tables >>can express what you need for ATM, I wasn't able to understand the >>big comment in tc_core.c in your patch. >> > > > Looks good from within the range of "change within reason" of addressed > problem. The cb on the qdisc seems only usable for netem, correct? Yes, it has the same limitations as current netem cb usage. Really makeing it useable for all qdiscs would require reserving a few bytes for every level, so far that isn't necessary and I would prefer to just add a time_to_send field for netem. The problem with this is that it currently requires sch_generic.h and pkt_sched.h to include one another, so I did the qdisc_skb_cb() thing to at least get it to compile for now. > Also while not unreasonable, i wasnt sure how qdisc_enqueue_root() > fit in the grand scheme of things for this change (it seemed out of > place). Its there as a spot to do the initial time calculations and store them in the cb. I didn't want to put this in net/core/dev.c. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-07-04 13:29 ` Patrick McHardy 2006-07-04 19:29 ` jamal @ 2006-07-06 0:39 ` Russell Stuart 2006-07-07 8:00 ` Patrick McHardy 1 sibling, 1 reply; 56+ messages in thread From: Russell Stuart @ 2006-07-06 0:39 UTC (permalink / raw) To: Patrick McHardy Cc: hadi, Alan Cox, Stephen Hemminger, netdev, Jesper Dangaard Brouer [-- Attachment #1: Type: text/plain, Size: 1921 bytes --] On Tue, 2006-07-04 at 15:29 +0200, Patrick McHardy wrote: > Unfortunately I still didn't got to cleaning them up, so I'm sending > them in their preliminary state. Its not much that is missing, but > the netem usage of skb->cb needs to be integrated better, I failed > to move it to the qdisc_skb_cb so far because of circular includes. Cleanups aside, architecturally the bulk of your patch looks like a no-brainier to me. The calculation of packet length should be in one place. Caching it in skb->cb was a nice touch. > But nothing unfixable. I'm mostly interested if the current size-tables > can express what you need for ATM, I wasn't able to understand the > big comment in tc_core.c in your patch. Unfortunately you do things in the wrong order for ATM. See: http://mailman.ds9a.nl/pipermail/lartc/2006q1/018314.html for an overview of the problem, and then the attached email for a detailed description of how the current patch addresses it. It is a trivial fix. As I said earlier, RTAB and STAB contain the same numbers, just scaled differently. The ATM patch stuffed around with RTAB. With your patch in place it will have to do the same exactly the same thing with STAB - because RTAB and STAB carry the same data. So to me the two patches seem orthogonal. One observation is the size optimisation you applied to STAB, making it variable length, could also be applied to RTAB. In fact it should be. Then they would be identical, apart from the scaling. Even the lookup operation (performed in qdisc_init_len in your patch) would be identical. However, now you lot have made me go away and think, I have another idea on how to attack this. Perhaps it will be more palatable to you. It would replace RTAB and STAB with a 28 byte structure for most protocol stacks - well all I can think of off the top of my head, anyway. RTAB would have to remain for backwards compatibility, of course. [-- Attachment #2: Attached message - Re: Getting ATM patches into the kernel --] [-- Type: message/rfc822, Size: 10535 bytes --] From: Russell Stuart <russell@stuart.id.au> To: Jesper Dangaard Brouer <jdb@comx.dk> Subject: Re: Getting ATM patches into the kernel Date: Fri, 19 May 2006 22:59:34 +1000 Message-ID: <1148043575.4206.259.camel@ras.pc.brisbane.lube> Explanation of rate calculation code. tc_core.c lines 53..61, tc_align_to_cells(): Does the same thing as you function of the same name. Code is intended to be semantically equivalent. tc_core.c lines 145..160, in tc_calc_ratespec: There are two material differences to your code. Both are in this line: int sz = ((i+1)<<cell_log) - 1 + overhead; The first is that where I add the overhead, you don't. This is because I am calculating the rate table so it works as well as close to optimal as possible on an unpatched kernel. Given the background in your draft email, I assume I don't have to prove you that it does in fact calculate the best possible table for an unpatched kernel. The second change is I use: ((i+1)<<cell_log) - 1 Whereas you had: ((i+1)<<cell_log) The number calculated by this expression: ((i+1) << cell_log) is the shortest packet length rtab[(i+1)] applies to. Ergo, that number minus 1 must be the longest packet length the previous rtab entry, rtab[i], applies to. So the intent of the change is that in unpatched kernels, rtab[i] will contain the time to send the longest packet that uses it. I suspect your code ends up doing the same thing on kernels that add overhead to the packet length. There should be no other semantic differences between and how you and I calculate the rate table. If so this should be sufficient to show that if your code calculated an ideal rate table for kernels that added the overhead to the packet length before looking up the rate table, my code should calculate the best possible table for unpatched kernels. It won't be ideal because some table entries will apply to packets using different numbers of ATM cells. This is unavoidable. But where that happens my code will always be conservative, ie use the longest packet length. I think backward the compatibility increases the odds of the patch being accepted. The next step is to patch the kernel it can get 100% accurate results by using this table. Consider this "diagram": Overhead 0 ---------- ATM Payload: 39 40 41 42 43 44 45 46 47 48 49 50 skb->len: 39 40 41 42 43 44 45 46 47 48 49 50 rtab[skb->len/8]: 1 1 1 1 1 1 1 1 1 x2 2 2 Each column is for a particular packet length. The rows show the packet length as seen by a layer of the protocol. The top row, "ATM Payload", contains packet length seen by the ATM layer - ie the ATM payload. It includes all overheads bar cell overhead and padding. It is always equal to: skb->len + overhead. The second row, "skb->len" contains the packet seen by the kernel. It is just the figure in the first line less the "overhead" passed to tc. The third line shows how "tc" will calculate the rate table entry for the packet length. It contains the number of ATM cells "tc" thinks the packet will occupy. I am assuming a cell_log of 3 here, meaning each rate table entry is used for 8 different packet lengths. In an unpatched kernel a single rate table entry may apply to packets which use different numbers of ATM cells. Since the rate table entry is a single number it cant be right for all packet lengths when that happens. If faced with this ambiguity tc uses the number of cells occupied by the longest packet the rate table entry applies to. This means the rate table will be wrong for the shorter packets, and this is highlighted in the diagram an "x". Thus we see from the table that for a 39 byte packet tc thinks it will take 1 cell to send. But for a 48 byte packet the table says it will take "x2" packets. The "x" means it will really take 1 packet, but because this rate table entry is also used by 49..55 byte packets tc used the bigger value for that rate table entry. We can fix the errant "x2" entry by sliding the rate table along so it sits under the smallest packet length it could apply to (in this case 49). To achieve that the rate table must be slide one byte to right. This can be achieved by adding an "alignment" figure of -1 to the packet length before looking up the rate table. Here is the same diagram as above, but with two additional lines added showing the effect of adding the alignment: Overhead 0 ---------- ATM Payload: 39 40 41 42 43 44 45 46 47 48 49 50 skb->len: 39 40 41 42 43 44 45 46 47 48 49 50 rtab[skb->len/8]: 1 1 1 1 1 1 1 1 1 x2 2 2 skb->len-1: 38 39 40 41 42 43 44 45 46 47 48 49 rtab[(skb->len-1)/8]: 1 1 1 1 1 1 1 1 1 1 2 2 The rows "skb->len-1" and "rtab[(skb->len-1)/8]" are copies of the rows above that have been moved slightly. They have to be copies, as the rtab computed by tc hasn't changed, so for example 47 must still yield 1 cell, and 48 must still yield 2 cells. Below is the same diagram repeated for overheads 1 .. 9. The are all pretty much the same as the one above. The only changes is the increasing overhead, which in turn changes the number of columns in error. The alignment changes to compensate. You will see it follows a straightforward pattern as the overhead increases. Overhead 1 ---------- ATM Payload: 39 40 41 42 43 44 45 46 47 48 49 50 skb->len: 38 39 40 41 42 43 44 45 46 47 48 49 rtab[skb->len/8]: 1 1 1 1 1 1 1 1 1 1 2 2 skb->len-0: 38 39 40 41 42 43 44 45 46 47 48 49 rtab[(skb->len-0)/8]: 1 1 1 1 1 1 1 1 1 1 2 2 Overhead 2 ---------- ATM Payload: 39 40 41 42 43 44 45 46 47 48 49 50 skb->len: 37 38 39 40 41 42 43 44 45 46 47 48 rtab[skb->len/8]: 1 1 1 x2 x2 x2 x2 x2 x2 x2 2 2 skb->len-7: 30 31 32 33 34 35 36 37 38 39 40 41 rtab[(skb->len-7)/8]: 1 1 1 1 1 1 1 1 1 1 2 2 Overhead 3 ---------- ATM Payload: 39 40 41 42 43 44 45 46 47 48 49 50 skb->len: 36 37 38 39 40 41 42 43 44 45 46 47 rtab[skb->len/8]: 1 1 1 1 x2 x2 x2 x2 x2 x2 2 2 skb->len-6: 30 31 32 33 34 35 36 37 38 39 40 41 rtab[(skb->len-6)/8]: 1 1 1 1 1 1 1 1 1 1 2 2 Overhead 4 ---------- ATM Payload: 39 40 41 42 43 44 45 46 47 48 49 50 skb->len: 35 36 37 38 39 40 41 42 43 44 45 46 rtab[skb->len/8]: 1 1 1 1 1 x2 x2 x2 x2 x2 2 2 skb->len-5: 30 31 32 33 34 35 36 37 38 39 40 41 rtab[(skb->len-5)/8]: 1 1 1 1 1 1 1 1 1 1 2 2 Overhead 5 ---------- ATM Payload: 39 40 41 42 43 44 45 46 47 48 49 50 skb->len: 34 35 36 37 38 39 40 41 42 43 44 45 rtab[skb->len/8]: 1 1 1 1 1 1 x2 x2 x2 x2 2 2 skb->len-4: 30 31 32 33 34 35 36 37 38 39 40 41 rtab[(skb->len-4)/8]: 1 1 1 1 1 1 1 1 1 1 2 2 Overhead 6 ---------- ATM Payload: 39 40 41 42 43 44 45 46 47 48 49 50 skb->len: 33 34 35 36 37 38 39 40 41 42 43 44 rtab[skb->len/8]: 1 1 1 1 1 1 1 x2 x2 x2 2 2 skb->len-3: 30 31 32 33 34 35 36 37 38 39 40 41 rtab[(skb->len-3)/8]: 1 1 1 1 1 1 1 1 1 1 2 2 Overhead 7 ---------- ATM Payload: 39 40 41 42 43 44 45 46 47 48 49 50 skb->len: 32 33 34 35 36 37 38 39 40 41 42 43 rtab[skb->len/8]: 1 1 1 1 1 1 1 1 x2 x2 2 2 skb->len-2: 30 31 32 33 34 35 36 37 38 39 40 41 rtab[(skb->len-2)/8]: 1 1 1 1 1 1 1 1 1 1 2 2 Overhead 8 ---------- ATM Payload: 39 40 41 42 43 44 45 46 47 48 49 50 skb->len: 31 32 33 34 35 36 37 38 39 40 41 42 rtab[skb->len/8]: 1 1 1 1 1 1 1 1 1 x2 2 2 skb->len-1: 30 31 32 33 34 35 36 37 38 39 40 41 rtab[(skb->len-1)/8]: 1 1 1 1 1 1 1 1 1 1 2 2 Overhead 9 ---------- ATM Payload: 39 40 41 42 43 44 45 46 47 48 49 50 skb->len: 30 31 32 33 34 35 36 37 38 39 40 41 rtab[skb->len/8]: 1 1 1 1 1 1 1 1 1 1 2 2 skb->len-0: 30 31 32 33 34 35 36 37 38 39 40 41 rtab[(skb->len-0)/8]: 1 1 1 1 1 1 1 1 1 1 2 2 Notice that at overhead 8 has an alignment of -1, and this is the same alignment as overhead 0 had. This is because the errors have moved through one rate table entry and are now into the next one, so the "fix" is the same. We can draw up a table showing the required alignment for each overhead, like this: overhead | alignment ---------+---------- 0 | -1 1 | 0 2 | -7 3 | -6 4 | -5 5 | -4 6 | -3 7 | -2 We only need 8 elements, as once we get to 8 (or more accurately 1<<cell_log) it repeats. If the kernel adds the alignment figure from the table to the packet size before looking up the rate table, then it will always get accurate results from the optimal table for the unpatched kernel. The alignment figure is calculated by "tc" and passed to the kernel, much as you passed the "overhead" to the kernel in your code. In my case it will always be small. And finally, we come to ...... tc_core.c line 122, in tc_calc_cell_align. The expression I use to calculate the alignment is: return (overhead + cell_size - 2) % cell_size - cell_size + 1 You could try and mathematically prove it does yield the table above, but since there are only 8 values that matter (the % in the expression ensures it repeats after that), it much easier to just plug in the overheads 0..7 and verify it does come up with the correct alignment. QED, for a cell_log of 3 anyway. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-07-06 0:39 ` Russell Stuart @ 2006-07-07 8:00 ` Patrick McHardy 2006-07-10 8:44 ` Russell Stuart 0 siblings, 1 reply; 56+ messages in thread From: Patrick McHardy @ 2006-07-07 8:00 UTC (permalink / raw) To: Russell Stuart Cc: hadi, Alan Cox, Stephen Hemminger, netdev, Jesper Dangaard Brouer Russell Stuart wrote: > Unfortunately you do things in the wrong order for ATM. > See: http://mailman.ds9a.nl/pipermail/lartc/2006q1/018314.html > for an overview of the problem, and then the attached email for > a detailed description of how the current patch addresses it. > It is a trivial fix. Actually that was the part I didn't understand, you keep talking (also in that comment in tc_core.c) about an "unknown overhead". What is that and why would it be unknown? The mail you attached is quite long, is there an simple example that shows what you mean? > As I said earlier, RTAB and STAB contain the same numbers, > just scaled differently. The ATM patch stuffed around with > RTAB. With your patch in place it will have to do the same > exactly the same thing with STAB - because RTAB and STAB > carry the same data. So to me the two patches seem > orthogonal. Yes. As I said in the beginning of this thread, we could still do the RTAB thing for qdiscs that support it. But since the visible size should be the same for all qdiscs (including child ones) we need to do the initial calculation anyway, so I don't see any purpose in still adjusting the rate tables and using skb->len instead of using the adjusted size. > One observation is the size optimisation you applied to STAB, > making it variable length, could also be applied to RTAB. > In fact it should be. Then they would be identical, apart > from the scaling. Even the lookup operation (performed in > qdisc_init_len in your patch) would be identical. We can do that. I'm not attached to the size tables :) > However, now you lot have made me go away and think, I have > another idea on how to attack this. Perhaps it will be > more palatable to you. It would replace RTAB and STAB with > a 28 byte structure for most protocol stacks - well all I can > think of off the top of my head, anyway. RTAB would have to > remain for backwards compatibility, of course. Can you describe in more detail? ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-07-07 8:00 ` Patrick McHardy @ 2006-07-10 8:44 ` Russell Stuart 0 siblings, 0 replies; 56+ messages in thread From: Russell Stuart @ 2006-07-10 8:44 UTC (permalink / raw) To: Patrick McHardy; +Cc: netdev, hadi, Alan Cox, lartc On Fri, 2006-07-07 at 10:00 +0200, Patrick McHardy wrote: > Russell Stuart wrote: > > Unfortunately you do things in the wrong order for ATM. > > See: http://mailman.ds9a.nl/pipermail/lartc/2006q1/018314.html > > for an overview of the problem, and then the attached email for > > a detailed description of how the current patch addresses it. > > It is a trivial fix. > > Actually that was the part I didn't understand, you keep talking > (also in that comment in tc_core.c) about an "unknown overhead". > What is that and why would it be unknown? The mail you attached > is quite long, is there an simple example that shows what you > mean? The "unknown overhead" is just the overhead passed to tc using the "tc ... overhead xxx" option. It is probably what you intended to put into your addend attribute. It is "unknown" because the kernel currently doesn't use it. It is passed in the tc_ratespec, but is ignored by the kernel as are most fields in there. The easy way to fix the "ATM" problem described in the big comment is simply to add the "overhead" to the packet length before doing the RTAB lookup. (Identical comments apply to STAB). If you don't accept this or understand why, then go read the "long emails" which attempt to explain it in detail. Jesper's initial version of the patch did just that, BTW. However if you do that then you have to adjust RTAB for all cases (not just ATM) to reflect that the kernel is now adding the overhead. Thus the RTAB tc sends to the kernel now changes for different kernel versions, making modern versions of tc incompatible with older kernels, and visa versa. I didn't consider that acceptable. My solution to this to give the kernel the old format RTAB (ie the one that assumed the kernel didn't add the overhead) and a small adjustment. This small adjustment is called cell_align in the ATM patch. You do the same thing with cell_align as the previous solution did with the overhead - ie add it in just before looking up RTAB. This is in effect all the kernel part of the ATM patch does - make the kernel accept the cell_align option, and add it to skb->len before looking up RTAB. The difference between cell_align and overhead is that cell_align is always 0 when there is no packetisation, and even when non zero it is small (less than 1<<cell_log, ie less than 8 for typical MTU's). So for anything bar ATM it is zero which means old kernels are completely unaffected, and even for ATM not sending it produces a small error which means older kernels still benefit from the "ATM" user space patch. This makes the proposed "ATM" version of tc both forward and backward compatible. One other point arises here. The fields in "tc_ratespec" that "tc" fills and the kernel ignores are there so "tc show" will work. The essence of the problem is "tc" compiles the stuff you give it into a single "RTAB". That "RTAB" can't be reverse compiled into the original numbers the user provided. So if "tc show" is to work, "tc" has to save that information somewhere. I don't think the "tc_ratespec" was the best choice for two reasons. Firstly, having the fields show up in tc_ratespec makes it seem like the kernel can use them. It can't, as the "overhead" example above shows. Secondly, from tc's point of view it is inflexible. Over time new features have been be added to "tc", and each time a new way of encoding it in the existing "tc_ratespec" has to be invented. Thus we now have hacks like the storing the "overhead" in the upper bits of the MPU figure. A better solution would be to provide a TLV (ie a TCA_XXX constant) for TC's private use. From the kernels point of view it would be an opaque structure which just saves and echos back when asked. This would solve both problems. > > However, now you lot have made me go away and think, I have > > another idea on how to attack this. Perhaps it will be > > more palatable to you. It would replace RTAB and STAB with > > a 28 byte structure for most protocol stacks - well all I can > > think of off the top of my head, anyway. RTAB would have to > > remain for backwards compatibility, of course. > > Can you describe in more detail? OK, but first I want to make the point that the only reason I suggest this is to get some sort of ATM patch into the kernel, as the current patch on the table is having a rough time. Alan Cox made the point earlier (if I understood him correctly) that this tabling lookup probably isn't a big win on modern CPU's - we may be better off moving it all into the kernel. Thinking about this, I tried to come up with a way of describing the mapping between skb->len and the on the wire packet length for every protocol I know. This is what I came up with. Assume we have a packet length L, which is to be transported by some protocol. For now we consider one protocol only, ie: TCP, PPP, ATM, Ethernet or whatever. I will generalise it to multiple protocols later. I think a generalised transformation can be made using using 5 numbers which are applied in this order: Overhead - A fixed overhead that is added to L. Mpu - Minimum packet size. If the result of (Overhead+L) is smaller that this, then the new result becomes this size. Round - The result is then rounded up to this many bytes. For protocols that always transmit single bytes this figure would be 1. If there were some protocol that transmitted data as 4 byte chunks then this would be 4. For ATM it is 48. CellPay - If the packet is broken down into smaller packets when sent, then this is the amount of data that will fit into each chunk. CallOver - This is the additional overhead each cell carries. The idea is the kernel would do this calculation on the fly for each packet. If you represent this set of number numbers as a comma separated list in the order they were presented above, then here are some examples: IP: 20 Ethernet: 18,64 PPP: 2 ATM: 0,0,48,48,5 It may be that 5 numbers are a overkill. It is for all protocols I am aware of - for those you could get away with 4. But I am no expert. The next step is to generalise for many protocols. As the protocols are stacked the length output by one protocol becoming the input length for the downstream one. So we just need to apply the same transformation serially. I will use '+' to indicate the stacking. For a typical ATM stack, PPPoE over LLC, we have: ppp:2+pppoe:6+ethernet:14,64+llc:8+all5:4+atm:0,0,48,48,5 If this were implemented naively, then the kernel would have to apply the above calculation 6 times, like this: Protocol InputLength OutputLength --------- ------------ ---------------- ppp skb->len skb->len+2 pppoe: skb->len+2 skb->len+2+6 ethernet: skb->len+2+6 skb->len+2+6+14 ... and so on. But it can be optimised. In this particular case we can combine those six operations into 1: adsl_pppoe_llc:34,64,48,48,5 The five numbers have the same meaning as before. It it not difficult to come up with a generalised rule that allows you to do this for most cases. For the remainder (if they exist - I can't think of any) the kernel would have to apply the transformation iteratively. Before going on, it is worth while comparing this to the current RTAB solution (and by implication STAB): 1. Oddly, the number of steps and hence speed for common protocols is probably the same. Compare: RTAB - You have to add an OverHead in the general case. - You have to scale by cell_log. - You have to ensure the overhead+skb->len doesn't overflow / underflow the RTAB. - You have to do the lookup. New - You have to add overhead. - You have to check the MPU. - You have to check if you have to apply Round,CellPay,CellOver - but you won't have to for any protocol except ATM. 2. Because of the cell_log, RTAB gives an 100% accurate answer 1 time in every (1<<cell_log) packet lengths. The new version is always 100% accurate. 3. The new version isn't as flexible as RTAB, as least from the kernels point of view. Conceivably there are protocols that could be handled by RTAB that are not handled by the new one. Since RTAB is computed in user space, this implies these new protocols might be handled by a user space change only. The new version would always require a kernel change. Note however that even RTAB required a kernel change for ATM, however. So far we have what your STAB would provide us with - a way to calculate the packet length. It takes 5 int's for every protocol stack I can think of. It probably runs faster for most protocols but is less robust to the introduction of new protocols. But some qdisc's need to know how long it takes to send a packet. This is what RTAB provides us with, in fact. So if we were to do away with RTAB completely, then we need a way for the kernel to covert packet lengths into the time it takes to send a packet. This is what I discuss next. The comments apply to both STAB and the new algorithm above, as they both compute packet lengths. If J is the time it takes to send one byte over a link, then we can compute RTAB from STAB like this: for (int i = 0; i < array_length(STAB); i += 1) RTAB[i] = STAB[i] * J; This is exactly the operation "tc" performs now in user space. It is possibly in user space because J is usually less than 1, and thus is most conveniently represented as a floating point number. Floating point operations in the kernel are verboten. I can think of two ways to move this operation into the kernel. The straight forward way is to represent J as a scale and a division. Ie the kernel does: RTAB[i] = (STAB[i] << scale) / divisor. The second way depends on that fact that most CPU's can multiply two 32 uint's to produce a 64 bit result in a single operation. I don't know whether this operation is available to kernel code. But if it is, J can be represented as a 32 bit fixed point number, with the implied decimal point after the most significant 8 bits. Then this operation would suffice: extern long long mul64(unsigned long, a, unsigned long b); RTAB[i] = (unsigned long)(mul64(STAB[i], J) >> 24); This method doesn't use division, and is probably faster on lower end CPU's. It would handle 100G Ethernet on a machine with Hz == 1000, and 1200 bits/sec on a machine with Hz == 10000. ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-23 12:37 ` Russell Stuart 2006-06-23 15:21 ` Patrick McHardy @ 2006-06-24 14:13 ` jamal 2006-06-26 4:23 ` Russell Stuart 2006-07-18 2:06 ` Russell Stuart 1 sibling, 2 replies; 56+ messages in thread From: jamal @ 2006-06-24 14:13 UTC (permalink / raw) To: Russell Stuart Cc: Jesper Dangaard Brouer, netdev, Stephen Hemminger, Alan Cox, Patrick McHardy On Fri, 2006-23-06 at 22:37 +1000, Russell Stuart wrote: > Sorry for the digest like reply :( ;-> I know this discussion has gone in a few directions already and i am sure as you kept responding to the emails it became obvious. Let me just jump to one point so we dont repeat a lot of the same things over and over.. You can actually stop reading here if you have gathered the view at this point that i am not objecting to the simple approach Patrick is going with... > On Tue, 2006-06-20 at 10:06 -0400, jamal wrote: > But you can measure the bandwidth you are > getting from your ISP and plug that into the tc command > line. The web page I sent to you describes how to do this > for ADSL lines. > Indeed and i referred to it in the exchanges. And yes, I was arguing that the tc scheme you describe would not be so bad either if the cost of making a generic change is expensive. To reiterate the scheduler "owns the resources" of the link i.e link level bandwidth - aka throughput; however, teaching the scheduler as in your description in that page about effective throughput aka "goodput" is also within reason if you knew _it was consistent_ . The diagram i drew was: |Linux| --ethernet-- |Modem| --DSL-- |DSLAM| --ATM-- |BRAS| As can be seen from above, Linux has direct ownership of the ethernet link resources (within limits of fixed bandwidth no AN etc). Just like a disk quota scheduler owns the disk resources or CPU scheduler owns the CPU cycles etc. Likewise, the DSLAM downstream owns scheduling of the ATM link resources. if you have knowledge that ATM runs between DSLAM and BRAS and that the effective throughput is in-fact lower than the advertised one at that point in the end2end path, then i admit it is fine to do as you described in your web-page. Given the ubiquity of ADSL it is also ok to have simple tweak knobs for the scheduler in Linux to allow for it to be able to compensate for the ATM fragmentation 3 hops down. There are a lot of link layer issues that you may end up knowing of (other than the ATM fragmentation overhead) in regards to something downstream and you keep adding knobs is just adding more bloat. Example: If that 3rd hop was wireless that happened to be doing CDMA RLP with a lot of retransmits, or wireless that varied its throughput from 1-3Mbps at any point in time or it was a satellite link that had a lot of latency etc etc. You could always have some way to tweak these via the kernel. In-fact people have written schedulers specifically for these sorts of link layer problems (I think even some of the IEEE 802.11 or wimax folks have standardized specific schedulers). You basically have to draw a line somewhere. My line was "can it be done via user space? yes - do it there". Patrick seems to have a simple way to compensate generically for link layer fragmentation, so i will not argue the practically; hopefully that settles it? ;-> cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-24 14:13 ` jamal @ 2006-06-26 4:23 ` Russell Stuart 2006-07-18 2:06 ` Russell Stuart 1 sibling, 0 replies; 56+ messages in thread From: Russell Stuart @ 2006-06-26 4:23 UTC (permalink / raw) To: hadi Cc: Russell Stuart, Jesper Dangaard Brouer, netdev, Stephen Hemminger, Alan Cox, Patrick McHardy On 25/06/2006 12:13 AM, jamal wrote: > You can actually stop reading here if you have gathered the view at > this point that i am not objecting to the simple approach Patrick is > going with... Perhaps this is my problem. I am not sure I understand what Patrick is proposing. I can wait for his patch, I guess. > Indeed and i referred to it in the exchanges. > And yes, I was arguing that the tc scheme you describe would not be so > bad either if the cost of making a generic change is expensive. OK. I take it from this you think there is merit in the idea of adding code so the kernel can calculate the ATM link speeds correctly. The discussion is really about the best way to go about it? If so, excellent. I am not really too fussy about how it is achieved, I just want my VOIP connections to work well on stock kernels. > There are a lot of link layer issues that you may end up knowing of > (other than the ATM fragmentation overhead) in regards to something > downstream and you keep adding knobs is just adding more bloat. > Example: If that 3rd hop was wireless that happened to be doing CDMA RLP > with a lot of retransmits, or wireless that varied its throughput from > 1-3Mbps at any point in time or it was a satellite link that had a lot > of latency etc etc. You could always have some way to tweak these via > the kernel. In-fact people have written schedulers specifically for > these sorts of link layer problems (I think even some of the IEEE 802.11 > or wimax folks have standardized specific schedulers). You basically > have to draw a line somewhere. My line was "can it be done via user > space? yes - do it there". If you mean by adding lots of knobs, you mean we need a knob for 802.11, a knob for ATM, a knob for ethernet and so on, then we do need lots of knobs. And you need to know which of those layers is the bottle neck, so you know what knob to fit. But you only ever need one knob on a given link. I can only think of two ways out of needing lots of knobs. One is to have a qdisc that doesn't need to know the link speed in order to shape traffic to it gets to the scheduling and not someone upstream. Sounds like black magic to me, but perhaps HFSC does this - I have not read the papers yet, but I plan to do so soon. The second way is to automatically calculate the link speed, using a daemon perhaps :). Again it sounds like black magic. Note that there is already code in the kernel that does this, but it lives in the layers above - in TCP and DCCP. I am referring to Westwood, and friends. These algorithms live in the layers above because the need feed back from the network - which can only come from the other end of connection unless ECN is working. I have not been able to figure out how Patrick intends to solve these problems from his posts so far, so I am waiting for his code. Hopefully it will include a lot of comments. > Patrick seems to have a simple way to compensate generically for link > layer fragmentation, so i will not argue the practically; hopefully that > settles it? ;-> Yes, it does. It will be interesting to see what Patrick comes up with. ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-24 14:13 ` jamal 2006-06-26 4:23 ` Russell Stuart @ 2006-07-18 2:06 ` Russell Stuart 2006-07-18 13:35 ` jamal 2006-07-18 21:46 ` Andy Furniss 1 sibling, 2 replies; 56+ messages in thread From: Russell Stuart @ 2006-07-18 2:06 UTC (permalink / raw) To: hadi; +Cc: Jesper Dangaard Brouer, netdev, Stephen Hemminger, Patrick McHardy On Sat, 2006-06-24 at 10:13 -0400, jamal wrote: > And yes, I was arguing that the tc scheme you describe would not be so > bad either if the cost of making a generic change is expensive. <snip> > Patrick seems to have a simple way to compensate generically for link > layer fragmentation, so i will not argue the practically; hopefully that > settles it? ;-> Things seem to have died down. Patrick's patch seemed unrelated to ATM to me. I did put up another suggestion, but I don't think anybody was too impressed with the idea. So that leave the current ATM patch as the only one we have on the table that addresses the ATM issue. Since you don't think it is "too bad", can we proceed with it? ^ permalink raw reply [flat|nested] 56+ messages in thread
* RE: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-07-18 2:06 ` Russell Stuart @ 2006-07-18 13:35 ` jamal 2006-07-18 21:46 ` Andy Furniss 1 sibling, 0 replies; 56+ messages in thread From: jamal @ 2006-07-18 13:35 UTC (permalink / raw) To: Russell Stuart Cc: Jesper Dangaard Brouer, netdev, Stephen Hemminger, Patrick McHardy Russell, I apologize i havent followed the latest discussion to the detail. My understanding of Patricks work was it solved the ATM problem as well. I think he is busy somewhere - lets give him an opportunity to respond and i will try to catchup with the thread as well. cheers, jamal On Tue, 2006-18-07 at 12:06 +1000, Russell Stuart wrote: > On Sat, 2006-06-24 at 10:13 -0400, jamal wrote: > > And yes, I was arguing that the tc scheme you describe would not be so > > bad either if the cost of making a generic change is expensive. > <snip> > > Patrick seems to have a simple way to compensate generically for link > > layer fragmentation, so i will not argue the practically; hopefully that > > settles it? ;-> > > Things seem to have died down. Patrick's patch seemed > unrelated to ATM to me. I did put up another suggestion, > but I don't think anybody was too impressed with the > idea. So that leave the current ATM patch as the only > one we have on the table that addresses the ATM issue. > > Since you don't think it is "too bad", can we proceed > with it? > > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-07-18 2:06 ` Russell Stuart 2006-07-18 13:35 ` jamal @ 2006-07-18 21:46 ` Andy Furniss 2006-07-19 1:02 ` Russell Stuart 1 sibling, 1 reply; 56+ messages in thread From: Andy Furniss @ 2006-07-18 21:46 UTC (permalink / raw) To: Russell Stuart Cc: hadi, Jesper Dangaard Brouer, netdev, Stephen Hemminger, Patrick McHardy Russell Stuart wrote: > On Sat, 2006-06-24 at 10:13 -0400, jamal wrote: > >>And yes, I was arguing that the tc scheme you describe would not be so >>bad either if the cost of making a generic change is expensive. > > <snip> > >>Patrick seems to have a simple way to compensate generically for link >>layer fragmentation, so i will not argue the practically; hopefully that >>settles it? ;-> > > > Things seem to have died down. Patrick's patch seemed > unrelated to ATM to me. I did put up another suggestion, > but I don't think anybody was too impressed with the > idea. So that leave the current ATM patch as the only > one we have on the table that addresses the ATM issue. FWIW I think it may be possible to do it Patricks' way, as if I read it properly he will end up with the ATM cell train length which gets shifted by cell_log and looked up as before. The ATM length will be in steps of 53 so with cell_log 3 or 4 I think there will be no collisions - so special rate tables for ATM can still be made perfect. As you say, I think mpu should be added aswell - so eth/other can benefit. Andy. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-07-18 21:46 ` Andy Furniss @ 2006-07-19 1:02 ` Russell Stuart 2006-07-19 14:42 ` Andy Furniss 2006-07-19 14:50 ` [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL Patrick McHardy 0 siblings, 2 replies; 56+ messages in thread From: Russell Stuart @ 2006-07-19 1:02 UTC (permalink / raw) To: lists Cc: hadi, Jesper Dangaard Brouer, netdev, Stephen Hemminger, Patrick McHardy On Tue, 2006-07-18 at 22:46 +0100, Andy Furniss wrote: > FWIW I think it may be possible to do it Patricks' way, as if I read it > properly he will end up with the ATM cell train length which gets > shifted by cell_log and looked up as before. The ATM length will be in > steps of 53 so with cell_log 3 or 4 I think there will be no collisions > - so special rate tables for ATM can still be made perfect. Patrick is proposing that the packet lengths be sent to the kernel in a similar way to how transmission times (ie RTAB) is sent now. I agree that is how things should be done - but it doesn't have much to do with the ATM patch, other than he has allowed for ATM in the way he does the calculation in the kernel [1]. In particular: - As it stands, it doesn't help the qdiscs that use RTAB. So unless he proposes to remove RTAB entirely the ATM patch as it will still have to go in. - A bit of effort was put into making this current ATM patch both backwards and forwards compatible. Patricks patch would work with newer kernels, obviously. Older kernels, and in particular the kernel that Debian is Etch is likely to distribute would miss out. If Patrick did intend remove RTAB entirely then he needs to add a fair bit more into his patch. Since RTAB is just STAB scaled, its certainly possible. The kernel will have to do a shift and a division for each packet, which I assume is permissible. > As you say, I think mpu should be added aswell - so eth/other can benefit. Not really. The MPU is reflected in the STAB table, just as it is for RTAB. One other point - the optimisation Patrick proposes for STAB (over RTAB) was to make the number of entries variable. This seems like a good idea. However there is no such thing as a free lunch, and if you did indeed reduce the number of entries to 16 for Ethernet (as I think Patrick suggested), then each entry would cover 1500/16 = 93 different packet lengths. Ie, entry 0 would cover packet lengths 0..93, entry 1 94..186, and so on. A single entry can't be right for all those packet lengths, so again we are back to a average 30% error for typical VOIP length packets. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-07-19 1:02 ` Russell Stuart @ 2006-07-19 14:42 ` Andy Furniss 2006-07-19 14:54 ` Patrick McHardy 2006-07-19 20:26 ` [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL (RTAB BUG) Jesper Dangaard Brouer 2006-07-19 14:50 ` [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL Patrick McHardy 1 sibling, 2 replies; 56+ messages in thread From: Andy Furniss @ 2006-07-19 14:42 UTC (permalink / raw) To: Russell Stuart Cc: hadi, Jesper Dangaard Brouer, netdev, Stephen Hemminger, Patrick McHardy Russell Stuart wrote: > On Tue, 2006-07-18 at 22:46 +0100, Andy Furniss wrote: > >>FWIW I think it may be possible to do it Patricks' way, as if I read it >>properly he will end up with the ATM cell train length which gets >>shifted by cell_log and looked up as before. The ATM length will be in >>steps of 53 so with cell_log 3 or 4 I think there will be no collisions >>- so special rate tables for ATM can still be made perfect. > > > Patrick is proposing that the packet lengths be sent to > the kernel in a similar way to how transmission times (ie > RTAB) is sent now. I agree that is how things should be > done - but it doesn't have much to do with the ATM patch, > other than he has allowed for ATM in the way he does the > calculation in the kernel [1]. > > In particular: > > - As it stands, it doesn't help the qdiscs that use > RTAB. So unless he proposes to remove RTAB entirely > the ATM patch as it will still have to go in. Hmm - I was just looking at the kernel changes to htb. The only difference is the len - I am blindly assuming that it does/will return the link lengths properly for atm. So for atm, qdisc_tx_len(skb) will always return lengths that are multiples of 53. If nothing else were done we would suffer innacuarcy from the cell_log just like eth. But no other kernel hack would be needed to do it perfectly - rather like we (who patch for atm already) just fill the tc generated rate table with what we like, that would be an option. > > - A bit of effort was put into making this current > ATM patch both backwards and forwards compatible. > Patricks patch would work with newer kernels, > obviously. Older kernels, and in particular the > kernel that Debian is Etch is likely to distribute > would miss out. > > If Patrick did intend remove RTAB entirely then he > needs to add a fair bit more into his patch. Since > RTAB is just STAB scaled, its certainly possible. > The kernel will have to do a shift and a division > for each packet, which I assume is permissible. I guess that is for others to decide :-) I think Patrick has a point about sfq/htb drr, Like you I guess, I thought that alot of extra per packet calculations would have got an instant NO. > >>As you say, I think mpu should be added aswell - so eth/other can benefit. > > > Not really. The MPU is reflected in the STAB table, > just as it is for RTAB. OK, I was thinking of what Jamal said about helping others, so everything TC should be capable of accepting mpu and overhead with these patches - or is more work needed? It will be good to be able to say tc ... police rate 500kbit mpu 60 overhead 24 ... for eth. (Assuming eth mpu/overhead are really 46/38 - p in mpu is payload AIUI so 60 and 24 come from allowing for skb->len being IP+14) or for ATM + pppoa something like tc ... police rate 500kbit overhead 10 atm ... In the case of eth someone already added mpu/overhead for HTB and it doesn't need any extra per packet calcs. I guess this way it would. > > One other point - the optimisation Patrick proposes > for STAB (over RTAB) was to make the number of entries > variable. This seems like a good idea. However there > is no such thing as a free lunch, and if you did > indeed reduce the number of entries to 16 for Ethernet > (as I think Patrick suggested), then each entry would > cover 1500/16 = 93 different packet lengths. Ie, > entry 0 would cover packet lengths 0..93, entry 1 > 94..186, and so on. A single entry can't be right > for all those packet lengths, so again we are back > to a average 30% error for typical VOIP length > packets. I agree less accuracy will not be nice. But as an option it could be the only way you can do 1/10Gig + jumbo frames. Andy. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-07-19 14:42 ` Andy Furniss @ 2006-07-19 14:54 ` Patrick McHardy 2006-07-19 20:26 ` [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL (RTAB BUG) Jesper Dangaard Brouer 1 sibling, 0 replies; 56+ messages in thread From: Patrick McHardy @ 2006-07-19 14:54 UTC (permalink / raw) To: lists Cc: Russell Stuart, hadi, Jesper Dangaard Brouer, netdev, Stephen Hemminger Andy Furniss wrote: > Russell Stuart wrote: > >> - As it stands, it doesn't help the qdiscs that use RTAB. So unless >> he proposes to remove RTAB entirely the ATM patch as it will still >> have to go in. > > > Hmm - I was just looking at the kernel changes to htb. The only > difference is the len - I am blindly assuming that it does/will return > the link lengths properly for atm. > > So for atm, qdisc_tx_len(skb) will always return lengths that are > multiples of 53. > > If nothing else were done we would suffer innacuarcy from the cell_log > just like eth. > > But no other kernel hack would be needed to do it perfectly - rather > like we (who patch for atm already) just fill the tc generated rate > table with what we like, that would be an option. That is how it should work. If the calculation doesn't fit, lets fix it. >> The kernel will have to do a shift and a division >> for each packet, which I assume is permissible. > > > I guess that is for others to decide :-) I think Patrick has a point > about sfq/htb drr, Like you I guess, I thought that alot of extra per > packet calculations would have got an instant NO. Its only done once per packet (currently, it might be interesting to override the length for specific classes and their childs, for example if you do queueing on eth0 and have an DSL router one hop apart). The division is gone in my patch btw. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL (RTAB BUG) 2006-07-19 14:42 ` Andy Furniss 2006-07-19 14:54 ` Patrick McHardy @ 2006-07-19 20:26 ` Jesper Dangaard Brouer 2006-07-19 21:00 ` Alexey Kuznetsov 1 sibling, 1 reply; 56+ messages in thread From: Jesper Dangaard Brouer @ 2006-07-19 20:26 UTC (permalink / raw) To: Andy Furniss Cc: Russell Stuart, hadi, netdev, Stephen Hemminger, Patrick McHardy, Alexey Kuznetsov Russell Stuart wrote: > > - As it stands, it doesn't help the qdiscs that use > RTAB. So unless he proposes to remove RTAB entirely the ATM patch as > it will still have to go in. Here is a very important point here: The RTAB (rate-table) in the kernel is NOT aligned, this is the ONLY reason why we need to patch the kernel. This why the kernel RTAB patch is still needed even with Patrick's excelent STAB patch. (Maybe the RTAB system should be removed entirely?) I have discussed this RTAB issue with Patrick (off list) and he denoted this RTAB issue as a regular BUG. This is the problem with the RTAB: ---------------------------------- The "hash" function for packet size lookups is a simple binary shift operation. rtab[pkt_len>>cell_log] = pkt_xmit_time; With a cell_log of 3 (the default), this implies that: pkt_len 0 to 7 goes into entry 0, and pkt_len 8 to 15 goes into entry 1, and pkt_len 16 to 23 goes into entry 2, and pkt_len 24 to 31 goes into entry 3, and so on... Current mapping: entry[0](maps: 0- 7)=xmit_size:0 entry[1](maps: 8-15)=xmit_size:8 entry[2](maps:16-23)=xmit_size:16 entry[3](maps:24-31)=xmit_size:24 entry[4](maps:32-39)=xmit_size:32 entry[5](maps:40-47)=xmit_size:40 entry[6](maps:48-55)=xmit_size:48 When the table is constructed (in tc_core.c) the pkt_xmit_time is calculated from the lower boundary. Meaning that transmitting a 7 byte packet "costs" 0 bytes to transmit. The zero transmit cost is properly not a real-world problem as the IP header bounds the minimum packet size. for (i=0; i<256; i++) { unsigned sz = (i<<cell_log); rtab[i] = tc_core_usec2tick(1000000*((double)sz/bps)); } Basically transmitting a packet size and accounting for a smaller packet size in the QoS system pose a problem. We can loose queue control by transmitting faster than the QoS was configured for. (maybe a (DoS/queue) attack to loose QoS control with small special sized packets?) One possible FIX to the RTAB "BUG": ----------------------------------- One solution would be to change the table to: entry[0](maps:0-7)=xmit_size:7 entry[1](maps:8-15)=xmit_size:15 entry[2](maps:16-23)=xmit_size:23 entry[3](maps:24-31)=xmit_size:31 entry[4](maps:32-39)=xmit_size:39 entry[5](maps:40-47)=xmit_size:47 entry[6](maps:48-55)=xmit_size:55 But most packets align to a even boundary, thus in the general case we would over estimate. Instead it would be preferable to change the table (mapping) to: rtab[(pkt_len-1)>>cell_log] and adjusting the xmit_size accordingly. Giving the table: entry[0](maps:1-8)=xmit_size:8 entry[1](maps:9-16)=xmit_size:16 entry[2](maps:15-24)=xmit_size:24 entry[3](maps:24-32)=xmit_size:32 entry[4](maps:33-40)=xmit_size:40 entry[5](maps:41-48)=xmit_size:48 entry[6](maps:47-56)=xmit_size:56 The xmit_size is done like this: for (i=0; i<256; i++) { unsigned sz = ((i+1)<<cell_log); rtab[i] = tc_core_usec2tick(1000000*((double)sz/bps)); } All table lookups should be changed to: rtab[(pkt_len-1)>>cell_log] Another fix for the RTAB: ------------------------- Remove the RTAB array lookups and calculate the pkt_xmit_time instead. Guess that Alexey wrote these RTAB lookup in a time where array lookups was faster... now we have that memory lookups are the bottleneck. What about removing the RTAB system entirely? Cheers, Jesper Brouer -- ------------------------------------------------------------------- MSc. Master of Computer Science Dept. of Computer Science, University of Copenhagen Author of http://www.adsl-optimizer.dk ------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL (RTAB BUG) 2006-07-19 20:26 ` [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL (RTAB BUG) Jesper Dangaard Brouer @ 2006-07-19 21:00 ` Alexey Kuznetsov 2006-07-20 5:47 ` Russell Stuart 0 siblings, 1 reply; 56+ messages in thread From: Alexey Kuznetsov @ 2006-07-19 21:00 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Andy Furniss, Russell Stuart, hadi, netdev, Stephen Hemminger, Patrick McHardy Hello! > Guess that Alexey wrote these RTAB lookup in a time where array lookups > was faster... now we have that memory lookups are the bottleneck. No, they were slower from the very beginning. If I remember correctly, there is comment about this somewhere. I just did not find any simple way to do 32 bit fixed point arithmetics scaling from bps to Gbps and was lazy to investigate this further, tables are much simpler and more flexible. > What about removing the RTAB system entirely? Well, if fixed point arithmetics is not a problem. Plus, remember, the function is not R*size, it is at least R*size+addend, to account for link overhead. Plus account for padding of small packets. Plus, when policing it should deaccount already added link headers, QoS counts only network payload. Alexey ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL (RTAB BUG) 2006-07-19 21:00 ` Alexey Kuznetsov @ 2006-07-20 5:47 ` Russell Stuart 2006-07-20 23:49 ` Alexey Kuznetsov 0 siblings, 1 reply; 56+ messages in thread From: Russell Stuart @ 2006-07-20 5:47 UTC (permalink / raw) To: Alexey Kuznetsov Cc: Jesper Dangaard Brouer, Andy Furniss, hadi, netdev, Stephen Hemminger, Patrick McHardy On Thu, 2006-07-20 at 01:00 +0400, Alexey Kuznetsov wrote: > Hello! So you really do exist? I thought it was just rumour. > Well, if fixed point arithmetics is not a problem. It shouldn't be. Any decimal number can be expressed as a fraction, eg: 0.00123 = 123/100000 Which can be calculated as a multiply and a divide. With MTU's up to 2048, it should be possible to do this with 99.9999% accuracy (ie 2048/2^23). With a bit more work in userspace (ie in tc), it can be be reduced to a multiply and a shift. > Plus, remember, the function is not R*size, it is at least > R*size+addend, to account for link overhead. Plus account for padding > of small packets. Plus, when policing it should deaccount already added > link headers, QoS counts only network payload. Yes, it is flexible - and has served us well up until now. It doesn't work well for ATM, but with a small bit of extra calculation in the kernel it could. However, it turns out that ATM is a special case. If ATM's cell payload was 58 bytes instead of 48 bytes (say), then it would not be possible to produce a RTAB that had small errors (eg < 10%) for smallish packet sizes (< 290 bytes). I seem to have trouble explaining why in a concise way that people understand, so I won't try here. So when Alan Cox said our ATM patch didn't solve the packetisation problem in general, he was right as our patch just built upon RTAB. Patrick's STAB proposal in general either for that matter, as it is just another implementation of RTAB with the same limitations. The only way I can think of to solve it in general is to move many more calculations into the kernel - as I proposed in a long winded answer to Patrick earlier in this thread. But doing so would get rid of the table implementation and the flexibility it has given us to date. For that reason I feel uncomfortable with it. The engineering decision becomes this - are there any other protocols like ATM out there that could justify such a change? (In my more cynical moments I think of it differently - has/is the world going to make a second engineering fuck up on the scale of ATM again? How on earth did anyone decide that pushing data packets over ATM, as happens in ADSL, was a good idea?) I know of no other such protocols. But then I don't have an encyclopedic knowledge of comms protocols, so that doesn't mean much. I suspect you know a good deal more about them than I do. What say you? ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL (RTAB BUG) 2006-07-20 5:47 ` Russell Stuart @ 2006-07-20 23:49 ` Alexey Kuznetsov 0 siblings, 0 replies; 56+ messages in thread From: Alexey Kuznetsov @ 2006-07-20 23:49 UTC (permalink / raw) To: Russell Stuart Cc: Jesper Dangaard Brouer, Andy Furniss, hadi, netdev, Stephen Hemminger, Patrick McHardy Hello! > It shouldn't be. Any decimal number can be expressed > as a fraction, eg: I remember this. :-) I stalled selecting corrects divisors to fight over/underflows. Not becuase it was difficult, just because did not see a reason to do this. > But doing so would get rid of the table implementation > and the flexibility it has given us to date. For that > reason I feel uncomfortable with it. > > The engineering decision becomes this - are there any > other protocols like ATM out there that could justify > such a change? Is it faster? You say, yes. Is it required? You say, yes. Is there some protocols, which needs more flexibility? No. > know a good deal more about them than I do. What say > you? Frankly, I seriously believed that rtabs is a good way to handle ATM. :-) I seriously believed that you have to do something like: ((packet_size+cell_payload-1)/cell_payload)*cell_size So, if in reality even this protocol does not justify keeping ratbs, kill them. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-07-19 1:02 ` Russell Stuart 2006-07-19 14:42 ` Andy Furniss @ 2006-07-19 14:50 ` Patrick McHardy 2006-07-20 4:56 ` Russell Stuart 1 sibling, 1 reply; 56+ messages in thread From: Patrick McHardy @ 2006-07-19 14:50 UTC (permalink / raw) To: Russell Stuart Cc: lists, hadi, Jesper Dangaard Brouer, netdev, Stephen Hemminger Russell Stuart wrote: > On Tue, 2006-07-18 at 22:46 +0100, Andy Furniss wrote: > >>FWIW I think it may be possible to do it Patricks' way, as if I read it >>properly he will end up with the ATM cell train length which gets >>shifted by cell_log and looked up as before. The ATM length will be in >>steps of 53 so with cell_log 3 or 4 I think there will be no collisions >>- so special rate tables for ATM can still be made perfect. Please excuse my silence, I was travelling and am still catching up with my mails. > Patrick is proposing that the packet lengths be sent to > the kernel in a similar way to how transmission times (ie > RTAB) is sent now. I agree that is how things should be > done - but it doesn't have much to do with the ATM patch, > other than he has allowed for ATM in the way he does the > calculation in the kernel [1]. > > In particular: > > - As it stands, it doesn't help the qdiscs that use > RTAB. So unless he proposes to remove RTAB entirely > the ATM patch as it will still have to go in. Why? The length calculated by my STABs (or something similar) is used by _all_ qdiscs. Not only for transmission time calculation, but also for statistics and estimators. If the length calculation doesn't fit for ATM, that can be fixed. > - A bit of effort was put into making this current > ATM patch both backwards and forwards compatible. > Patricks patch would work with newer kernels, > obviously. Older kernels, and in particular the > kernel that Debian is Etch is likely to distribute > would miss out. True, but it provides more consistency, and making current kernels behave better is more important than old kernels. > If Patrick did intend remove RTAB entirely then he > needs to add a fair bit more into his patch. Since > RTAB is just STAB scaled, its certainly possible. > The kernel will have to do a shift and a division > for each packet, which I assume is permissible. You seem to have misunderstood my patch. It doesn't need to touch RTABs, it just calculates the packet length as seen on the wire (whereever it is) and uses that thoughout the entire qdisc layer. >>As you say, I think mpu should be added aswell - so eth/other can benefit. > > > Not really. The MPU is reflected in the STAB table, > just as it is for RTAB. > > One other point - the optimisation Patrick proposes > for STAB (over RTAB) was to make the number of entries > variable. This seems like a good idea. However there > is no such thing as a free lunch, and if you did > indeed reduce the number of entries to 16 for Ethernet > (as I think Patrick suggested), then each entry would > cover 1500/16 = 93 different packet lengths. Ie, > entry 0 would cover packet lengths 0..93, entry 1 > 94..186, and so on. A single entry can't be right > for all those packet lengths, so again we are back > to a average 30% error for typical VOIP length > packets. My patch doesn't uses fixed sized cells, so it can deal with anything, worst case is you use one cell per packet size. Optimizing size and lookup speed for ethernet makes a lot more sense than optimizing for ADSL. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-07-19 14:50 ` [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL Patrick McHardy @ 2006-07-20 4:56 ` Russell Stuart 2006-07-30 23:06 ` Russell Stuart 0 siblings, 1 reply; 56+ messages in thread From: Russell Stuart @ 2006-07-20 4:56 UTC (permalink / raw) To: Patrick McHardy Cc: lists, hadi, Jesper Dangaard Brouer, netdev, Stephen Hemminger On Wed, 2006-07-19 at 16:50 +0200, Patrick McHardy wrote: > Please excuse my silence, I was travelling and am still catching up > with my mails. Sorry. Had I realised you were busy I would of waited. > > - As it stands, it doesn't help the qdiscs that use > > RTAB. So unless he proposes to remove RTAB entirely > > the ATM patch as it will still have to go in. > > Why? The length calculated by my STABs (or something similar) > is used by _all_ qdiscs. Not only for transmission time calculation, > but also for statistics and estimators. Oh. I didn't see where it is used for the time calculation in your patch. Did I miss something, or is that the unfinished bit? This is possibly my stumbling block. If you don't remove RTAB the ATM patch as stands will be needed. Your patch didn't remove RTAB, and you didn't say it was intended to, so I presume it wasn't going to. > If the length calculation > doesn't fit for ATM, that can be fixed. Yes of course. Just to be clear: as far as I am concerned this never was an issue. > > - A bit of effort was put into making this current > > ATM patch both backwards and forwards compatible. > > Patricks patch would work with newer kernels, > > obviously. Older kernels, and in particular the > > kernel that Debian is Etch is likely to distribute > > would miss out. > > True, but it provides more consistency, and making current > kernels behave better is more important than old kernels. I guess provided the new "tc" works with older kernels this is OK - although a disappoint to me. Works here being defined as "works as well as a previous the version of tc does". For me not working would be OK as well provided "tc" issued a warning message to the effect that it "needs kernel version XXX or above"", but doing that would probably require it to look at the kernel version. Looking at the kernel version in tc seems to be frowned upon. > You seem to have misunderstood my patch. It doesn't need to > touch RTABs, it just calculates the packet length as seen > on the wire (whereever it is) and uses that thoughout the > entire qdisc layer. No, you have it in reverse - as I said above. My problem is that your patch does not touch RTAB. Several qdiscs really don't care about the length of a packet (other than for keeping track of stats) - they just care about how long it takes to send. Off the top of my these are HTB, CBQ and TBF. They use RTAB to make this calculation. So unless you replace RTAB with STAB the current ATM patch will still be needed. > > One other point - the optimisation Patrick proposes > > for STAB (over RTAB) was to make the number of entries > > variable. This seems like a good idea. However there > > is no such thing as a free lunch, and if you did > > indeed reduce the number of entries to 16 for Ethernet > > (as I think Patrick suggested), then each entry would > > cover 1500/16 = 93 different packet lengths. Ie, > > entry 0 would cover packet lengths 0..93, entry 1 > > 94..186, and so on. A single entry can't be right > > for all those packet lengths, so again we are back > > to a average 30% error for typical VOIP length > > packets. > > My patch doesn't uses fixed sized cells, so it can deal > with anything, worst case is you use one cell per packet > size. Optimizing size and lookup speed for ethernet makes > a lot more sense than optimizing for ADSL. I was just responding to a point you made earlier, when you said STAB could only use 16 entries as opposed to the 256 used by RTAB. I suspect nobody would actually do that because of the inaccuracy it creates, so the comparison is perhaps unfair. I agree the flexibility of making STAB variable length is a good idea, and comes at 0 cost in the kernel. Andy Furniss wrote: > > Russell Stuart wrote: > >> The kernel will have to do a shift and a division > >> for each packet, which I assume is permissible. > > > > > > I guess that is for others to decide :-) I think Patrick has a point > > about sfq/htb drr, Like you I guess, I thought that alot of extra per > > packet calculations would have got an instant NO. > > Its only done once per packet (currently, it might be interesting to > override the length for specific classes and their childs, for example > if you do queueing on eth0 and have an DSL router one hop apart). > The division is gone in my patch btw. Unlike the packet length the time calculation can't be cached in the skb. Most classes in HTB/CBQ use different packet transmission rates. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-07-20 4:56 ` Russell Stuart @ 2006-07-30 23:06 ` Russell Stuart 2006-08-08 22:01 ` Russell Stuart 0 siblings, 1 reply; 56+ messages in thread From: Russell Stuart @ 2006-07-30 23:06 UTC (permalink / raw) To: hadi Cc: lists, hadi, Jesper Dangaard Brouer, netdev, Stephen Hemminger, Patrick McHardy On Thu, 2006-07-20 at 14:56 +1000, Russell Stuart wrote: > On Wed, 2006-07-19 at 16:50 +0200, Patrick McHardy wrote: > > Please excuse my silence, I was travelling and am still catching up > > with my mails. > > Sorry. Had I realised you were busy I would of > waited. > > > > - As it stands, it doesn't help the qdiscs that use > > > RTAB. So unless he proposes to remove RTAB entirely > > > the ATM patch as it will still have to go in. > > > > Why? The length calculated by my STABs (or something similar) > > is used by _all_ qdiscs. Not only for transmission time calculation, > > but also for statistics and estimators. > > Oh. I didn't see where it is used for the time > calculation in your patch. Did I miss something, > or is that the unfinished bit? > > This is possibly my stumbling block. If you don't remove > RTAB the ATM patch as stands will be needed. Your patch > didn't remove RTAB, and you didn't say it was intended to, > so I presume it wasn't going to. It has gone quiet again. In my mind the one unresolved issue is whether Patrick intended to remove RTAB with his patch. If not, the ATM patch as it stands will have to go in. Patrick - it would be nice to hear from you. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-07-30 23:06 ` Russell Stuart @ 2006-08-08 22:01 ` Russell Stuart 2006-08-09 11:33 ` jamal 0 siblings, 1 reply; 56+ messages in thread From: Russell Stuart @ 2006-08-08 22:01 UTC (permalink / raw) To: hadi Cc: lists, Jesper Dangaard Brouer, netdev, Stephen Hemminger, Patrick McHardy On Mon, 2006-07-31 at 09:06 +1000, Russell Stuart wrote: > It has gone quiet again. In my mind the one unresolved issue > is whether Patrick intended to remove RTAB with his patch. > If not, the ATM patch as it stands will have to go in. > > Patrick - it would be nice to hear from you. It appears Patrick isn't going to reply. The situation as I see it is: - Patrick's last post said his STAB patch wasn't going to touch RTAB, so regardless of whether Patrick's STAB patch proceeds or not the ATM patch will be needed to make the current kernel work with ATM. - If I read Jesper's last post correctly, he said Patrick acknowledged to him in a private email that RTAB had to change to accommodate ATM, and in fact called it a bug in RTAB. Fixing this "bug" is what the kernel part of the current ATM patch does - and it does it in a backward compatible way. - Jamal, you said that this ATM patch should go in if Patrick didn't come up with a better solution. Jamal - unless there are other outstanding issues I have missed or someone has had second thoughts this means you should be OK with the patch going in. Can we get it into Dave M's tree now? ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-08-08 22:01 ` Russell Stuart @ 2006-08-09 11:33 ` jamal 2006-09-04 10:37 ` Russell Stuart 0 siblings, 1 reply; 56+ messages in thread From: jamal @ 2006-08-09 11:33 UTC (permalink / raw) To: Russell Stuart Cc: lists, Jesper Dangaard Brouer, netdev, Stephen Hemminger, Patrick McHardy On Wed, 2006-09-08 at 08:01 +1000, Russell Stuart wrote: > On Mon, 2006-07-31 at 09:06 +1000, Russell Stuart wrote: > > It has gone quiet again. In my mind the one unresolved issue > > is whether Patrick intended to remove RTAB with his patch. > > If not, the ATM patch as it stands will have to go in. > > > > Patrick - it would be nice to hear from you. > > It appears Patrick isn't going to reply. The situation as > I see it is: > > - Patrick's last post said his STAB patch wasn't going to > touch RTAB, so regardless of whether Patrick's STAB > patch proceeds or not the ATM patch will be needed to > make the current kernel work with ATM. > > - If I read Jesper's last post correctly, he said Patrick > acknowledged to him in a private email that RTAB had to > change to accommodate ATM, and in fact called it a bug > in RTAB. Fixing this "bug" is what the kernel part of > the current ATM patch does - and it does it in a > backward compatible way. > > - Jamal, you said that this ATM patch should go in if > Patrick didn't come up with a better solution. > > Jamal - unless there are other outstanding issues I have > missed or someone has had second thoughts this means you > should be OK with the patch going in. Can we get it into > Dave M's tree now? Hi Russell, My apologies; at some point i lost track of the discussion because things seemed under control. Lets hear from Patrick - I have a lot of faith in him; i just happen to know he is occupied elsewhere as we speak (which may explain his silence). So please be a little patient with him. In the meantime please dont be discouraged in continuing to contribute - sometimes this is how things go. You are right that if there is no generic solution from Patrick we should not hold your patch. My understanding of Patrick's patch is it covers for your case; i may have misread that given you and him are still discussing the differences. Perhaps one approach is to push your patch in first and then when Patrick is ready he can operate on that. Patrick, thoughts? cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-08-09 11:33 ` jamal @ 2006-09-04 10:37 ` Russell Stuart 0 siblings, 0 replies; 56+ messages in thread From: Russell Stuart @ 2006-09-04 10:37 UTC (permalink / raw) To: hadi Cc: lists, Jesper Dangaard Brouer, netdev, Stephen Hemminger, Patrick McHardy On Wed, 2006-08-09 at 07:33 -0400, jamal wrote: > > Jamal - unless there are other outstanding issues I have > > missed or someone has had second thoughts this means you > > should be OK with the patch going in. Can we get it into > > Dave M's tree now? > > Hi Russell, > > My apologies; at some point i lost track of the discussion because > things seemed under control. > Lets hear from Patrick - I have a lot of faith in him; i just happen to > know he is occupied elsewhere as we speak (which may explain his > silence). So please be a little patient with him. In the meantime please > dont be discouraged in continuing to contribute - sometimes this is how > things go. > > You are right that if there is no generic solution from Patrick we > should not hold your patch. My understanding of Patrick's patch is it > covers for your case; i may have misread that given you and him are > still discussing the differences. > > Perhaps one approach is to push your patch in first and then when > Patrick is ready he can operate on that. Patrick, thoughts? Jamal, I think its time to assume Patrick has lost interest. Is it possible to get this in now? -- VGER BF report: U 0.542723 ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-14 9:40 [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL Jesper Dangaard Brouer 2006-06-14 12:06 ` jamal @ 2006-06-14 14:27 ` Phillip Susi 2006-06-14 15:08 ` Jesper Dangaard Brouer 2006-06-20 5:35 ` Chris Wedgwood 2 siblings, 1 reply; 56+ messages in thread From: Phillip Susi @ 2006-06-14 14:27 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Stephen Hemminger, Jamal Hadi Salim, netdev, lartc, russell-tcatm, hawk, linux-kernel Jesper Dangaard Brouer wrote: > The Linux traffic's control engine inaccurately calculates > transmission times for packets sent over ADSL links. For > some packet sizes the error rises to over 50%. This occurs > because ADSL uses ATM as its link layer transport, and ATM > transmits packets in fixed sized 53 byte cells. > I could have sworn that DSL uses its own framing protocol that is similar to the frame/superframe structure of HDSL ( T1 ) lines and over that you can run ATM or ethernet. Or is it typically ethernet -> ATM -> HDSL? In any case, why does the kernel care about the exact time that the IP packet has been received and reassembled on the headend? ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-14 14:27 ` Phillip Susi @ 2006-06-14 15:08 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 56+ messages in thread From: Jesper Dangaard Brouer @ 2006-06-14 15:08 UTC (permalink / raw) To: Phillip Susi; +Cc: netdev, russell-tcatm, hawk [-- Attachment #1: Type: text/plain, Size: 1184 bytes --] On Wed, 2006-06-14 at 10:27 -0400, Phillip Susi wrote: > Jesper Dangaard Brouer wrote: > > The Linux traffic's control engine inaccurately calculates > > transmission times for packets sent over ADSL links. For > > some packet sizes the error rises to over 50%. This occurs > > because ADSL uses ATM as its link layer transport, and ATM > > transmits packets in fixed sized 53 byte cells. > > > > I could have sworn that DSL uses its own framing protocol that is > similar to the frame/superframe structure of HDSL ( T1 ) lines and over > that you can run ATM or ethernet. Or is it typically ethernet -> ATM -> > HDSL? Nope, not according to the ADSL standards G.992.1 and G.992.2. > In any case, why does the kernel care about the exact time that the IP > packet has been received and reassembled on the headend? I think you have misunderstood what the rate table does... (There is an explaination in the thesis page 57 section 6.1.2) http://www.adsl-optimizer.dk/thesis/ -- Med venlig hilsen / Best regards Jesper Brouer ComX Networks A/S Linux Network developer Cand. Scient Datalog / MSc. Author of http://adsl-optimizer.dk [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-14 9:40 [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL Jesper Dangaard Brouer 2006-06-14 12:06 ` jamal 2006-06-14 14:27 ` Phillip Susi @ 2006-06-20 5:35 ` Chris Wedgwood 2006-06-20 7:33 ` Jesper Dangaard Brouer 2 siblings, 1 reply; 56+ messages in thread From: Chris Wedgwood @ 2006-06-20 5:35 UTC (permalink / raw) To: Jesper Dangaard Brouer; +Cc: netdev On Wed, Jun 14, 2006 at 11:40:04AM +0200, Jesper Dangaard Brouer wrote: > The Linux traffic's control engine inaccurately calculates > transmission times for packets sent over ADSL links. For some > packet sizes the error rises to over 50%. This occurs because ADSL > uses ATM as its link layer transport, and ATM transmits packets in > fixed sized 53 byte cells. What if AAL5 is used? The cell-alignment math is going to be wrong there surely? ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL 2006-06-20 5:35 ` Chris Wedgwood @ 2006-06-20 7:33 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 56+ messages in thread From: Jesper Dangaard Brouer @ 2006-06-20 7:33 UTC (permalink / raw) To: Chris Wedgwood; +Cc: netdev, russell-tcatm [-- Attachment #1: Type: text/plain, Size: 882 bytes --] On Mon, 2006-06-19 at 22:35 -0700, Chris Wedgwood wrote: > On Wed, Jun 14, 2006 at 11:40:04AM +0200, Jesper Dangaard Brouer wrote: > > > The Linux traffic's control engine inaccurately calculates > > transmission times for packets sent over ADSL links. For some > > packet sizes the error rises to over 50%. This occurs because ADSL > > uses ATM as its link layer transport, and ATM transmits packets in > > fixed sized 53 byte cells. > > What if AAL5 is used? The cell-alignment math is going to be wrong > there surely? Actually it _is_ AAL5 which is accounted for. See Chapter 5 "ADSL link layer overhead" (page 48-54). http://www.adsl-optimizer.dk/thesis/main_final_hyper.pdf -- Med venlig hilsen / Best regards Jesper Brouer ComX Networks A/S Linux Network developer Cand. Scient Datalog / MSc. Author of http://adsl-optimizer.dk [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
end of thread, other threads:[~2006-09-04 10:39 UTC | newest]
Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-14 9:40 [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL Jesper Dangaard Brouer
2006-06-14 12:06 ` jamal
2006-06-14 12:55 ` Jesper Dangaard Brouer
2006-06-15 12:57 ` jamal
2006-06-15 13:16 ` jamal
2006-06-20 1:04 ` Patrick McHardy
2006-06-20 14:59 ` jamal
2006-06-20 15:16 ` Patrick McHardy
2006-06-21 12:21 ` Krzysztof Matusik
2006-06-21 12:54 ` Patrick McHardy
2006-06-21 14:33 ` Krzysztof Matusik
2006-06-14 15:32 ` Andy Furniss
2006-06-20 0:54 ` Patrick McHardy
2006-06-20 14:56 ` jamal
2006-06-20 15:09 ` Patrick McHardy
2006-06-22 18:41 ` jamal
2006-06-23 14:32 ` Patrick McHardy
2006-06-24 14:39 ` jamal
2006-06-26 11:21 ` Patrick McHardy
2006-06-27 13:01 ` jamal
2006-07-02 4:23 ` Patrick McHardy
2006-07-02 13:59 ` jamal
[not found] ` <1150287983.3246.27.camel@ras.pc.brisbane.lube>
[not found] ` <1150292693.5197.1.camel@jzny2>
[not found] ` <1150843471.17455.2.camel@ras.pc.brisbane.lube>
[not found] ` <15653CE98281AD4FBD7F70BCEE3666E53CD54A@comxexch01.comx.local>
[not found] ` <1151000966.5392.34.camel@jzny2>
2006-06-23 12:37 ` Russell Stuart
2006-06-23 15:21 ` Patrick McHardy
2006-06-26 0:45 ` Russell Stuart
2006-06-26 11:10 ` Patrick McHardy
2006-06-27 6:19 ` Russell Stuart
2006-06-27 17:18 ` Patrick McHardy
2006-07-04 13:29 ` Patrick McHardy
2006-07-04 19:29 ` jamal
2006-07-04 23:53 ` Patrick McHardy
2006-07-06 0:39 ` Russell Stuart
2006-07-07 8:00 ` Patrick McHardy
2006-07-10 8:44 ` Russell Stuart
2006-06-24 14:13 ` jamal
2006-06-26 4:23 ` Russell Stuart
2006-07-18 2:06 ` Russell Stuart
2006-07-18 13:35 ` jamal
2006-07-18 21:46 ` Andy Furniss
2006-07-19 1:02 ` Russell Stuart
2006-07-19 14:42 ` Andy Furniss
2006-07-19 14:54 ` Patrick McHardy
2006-07-19 20:26 ` [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL (RTAB BUG) Jesper Dangaard Brouer
2006-07-19 21:00 ` Alexey Kuznetsov
2006-07-20 5:47 ` Russell Stuart
2006-07-20 23:49 ` Alexey Kuznetsov
2006-07-19 14:50 ` [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL Patrick McHardy
2006-07-20 4:56 ` Russell Stuart
2006-07-30 23:06 ` Russell Stuart
2006-08-08 22:01 ` Russell Stuart
2006-08-09 11:33 ` jamal
2006-09-04 10:37 ` Russell Stuart
2006-06-14 14:27 ` Phillip Susi
2006-06-14 15:08 ` Jesper Dangaard Brouer
2006-06-20 5:35 ` Chris Wedgwood
2006-06-20 7:33 ` Jesper Dangaard Brouer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).