* creating netdev queues on the fly?
@ 2011-11-10 13:58 Johannes Berg
2011-11-10 14:35 ` Eric Dumazet
[not found] ` <1320933501.3967.68.camel-8upI4CBIZJIJvtFkdXX2HixXY32XiHfO@public.gmane.org>
0 siblings, 2 replies; 12+ messages in thread
From: Johannes Berg @ 2011-11-10 13:58 UTC (permalink / raw)
To: netdev; +Cc: linux-wireless
Hi,
I've been thinking about how we manage TX queues in wifi and right now
we just split things up by access category for QoS purposes.
However we have the issue that we might be pushing data to stations with
completely different speeds. Onn wired, where our outgoing speed is
essentially constant and some router/switch has to drop packets for the
slow link:
machine A === 1000mbps link ==== [switch] === 1000mbps === machine B
|
+--- 100mbps link --- machine C
But on wireless we really transmit to slow stations only with a slow
speed, so our outgoing speed differs. I think the scenario is quite
different, also because the speed can vary obviously.
So to get to my question: What if we could create netdev queues on the
fly?
The reason to do that is that we really don't want to reserve some 8000
queues just because somebody could possibly try to create 2000
connections (2007 is the theoretical max due to protocol restrictions)
to the AP interface. We also don't really want to create a netdev for
each peer (though you could implement it that way today).
I looked at this and it doesn't seem terrible. Creating & destroying the
queues might be tricky though. I think ndo_select_queue might return the
queue pointer instead of an index, and then that queue could be used.
The normal queues would still be in an array, with maybe a linked list
of extra queues that were dynamically created. Obviously the driver
would have to be able to manage that.
Ultimately, all the frames will of course end up on the same four
hardware queues again. But this would some better management, and piled
up traffic to one station that suddenly dies wouldn't impact performance
for all others as badly as it does today since we wouldn't let all those
frames pile up on the hardware queues, they'd only get there with some
mechanism that might take airtime into account.
I think this might also make implementing reservation (tspec) easier.
Not sure if anyone wants/needs that though.
Am I completely crazy?
johannes
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: creating netdev queues on the fly? 2011-11-10 13:58 creating netdev queues on the fly? Johannes Berg @ 2011-11-10 14:35 ` Eric Dumazet 2011-11-10 16:26 ` Johannes Berg [not found] ` <1320933501.3967.68.camel-8upI4CBIZJIJvtFkdXX2HixXY32XiHfO@public.gmane.org> 1 sibling, 1 reply; 12+ messages in thread From: Eric Dumazet @ 2011-11-10 14:35 UTC (permalink / raw) To: Johannes Berg; +Cc: netdev, linux-wireless Le jeudi 10 novembre 2011 à 14:58 +0100, Johannes Berg a écrit : > Hi, > > I've been thinking about how we manage TX queues in wifi and right now > we just split things up by access category for QoS purposes. > > However we have the issue that we might be pushing data to stations with > completely different speeds. Onn wired, where our outgoing speed is > essentially constant and some router/switch has to drop packets for the > slow link: > > machine A === 1000mbps link ==== [switch] === 1000mbps === machine B > | > +--- 100mbps link --- machine C > > But on wireless we really transmit to slow stations only with a slow > speed, so our outgoing speed differs. I think the scenario is quite > different, also because the speed can vary obviously. > > So to get to my question: What if we could create netdev queues on the > fly? > > The reason to do that is that we really don't want to reserve some 8000 > queues just because somebody could possibly try to create 2000 > connections (2007 is the theoretical max due to protocol restrictions) > to the AP interface. We also don't really want to create a netdev for > each peer (though you could implement it that way today). > > I looked at this and it doesn't seem terrible. Creating & destroying the > queues might be tricky though. I think ndo_select_queue might return the > queue pointer instead of an index, and then that queue could be used. > The normal queues would still be in an array, with maybe a linked list > of extra queues that were dynamically created. Obviously the driver > would have to be able to manage that. > > Ultimately, all the frames will of course end up on the same four > hardware queues again. But this would some better management, and piled > up traffic to one station that suddenly dies wouldn't impact performance > for all others as badly as it does today since we wouldn't let all those > frames pile up on the hardware queues, they'd only get there with some > mechanism that might take airtime into account. > > I think this might also make implementing reservation (tspec) easier. > Not sure if anyone wants/needs that though. > > > Am I completely crazy? > In term of qdisc management I believe its a bit complex if we start to dynamically add netdev queues :) My first idea would be to extend Qdisc management so that a device can callback qdisc when a frame is finaly delivered / consumed / discarded. We currently only have qdisc->enqueue() and qdisc->dequeue(), we could add qdisc->deliver_callback(skb) You keep devices as they are, with a netdevqueue per hardware queue. Then, using a Qdisc like existing ones, but with a limit of outstanding(given to device but not yet consumed) packets per class. external tc classifier would deliver a hash/index depending on remote station. As a bonus you can get all the existing rate estimators / QOS / shapers ... ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: creating netdev queues on the fly? 2011-11-10 14:35 ` Eric Dumazet @ 2011-11-10 16:26 ` Johannes Berg [not found] ` <1320942369.3967.127.camel-8upI4CBIZJIJvtFkdXX2HixXY32XiHfO@public.gmane.org> 0 siblings, 1 reply; 12+ messages in thread From: Johannes Berg @ 2011-11-10 16:26 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev, linux-wireless On Thu, 2011-11-10 at 15:35 +0100, Eric Dumazet wrote: > > So to get to my question: What if we could create netdev queues on the > > fly? > In term of qdisc management I believe its a bit complex if we start to > dynamically add netdev queues :) Yeah I was thinking that would be the problematic part. > My first idea would be to extend Qdisc management so that a device can > callback qdisc when a frame is finaly delivered / consumed / discarded. > > We currently only have qdisc->enqueue() and qdisc->dequeue(), we could > add qdisc->deliver_callback(skb) > > You keep devices as they are, with a netdevqueue per hardware queue. > > Then, using a Qdisc like existing ones, but with a limit of > outstanding(given to device but not yet consumed) packets per class. > > external tc classifier would deliver a hash/index depending on remote > station. So basically you'd have a class for each station? Or a class for each station/AC? That's a lot of classes rather than queues, are they pre-allocated, or does it not matter? Overall that sounds like a good approach. > As a bonus you can get all the existing rate estimators / QOS / > shapers ... Yeah ... I guess I need to start learning tc :) johannes -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <1320942369.3967.127.camel-8upI4CBIZJIJvtFkdXX2HixXY32XiHfO@public.gmane.org>]
* Re: creating netdev queues on the fly? [not found] ` <1320942369.3967.127.camel-8upI4CBIZJIJvtFkdXX2HixXY32XiHfO@public.gmane.org> @ 2011-11-10 17:00 ` Eric Dumazet 0 siblings, 0 replies; 12+ messages in thread From: Eric Dumazet @ 2011-11-10 17:00 UTC (permalink / raw) To: Johannes Berg; +Cc: netdev, linux-wireless Le jeudi 10 novembre 2011 à 17:26 +0100, Johannes Berg a écrit : > On Thu, 2011-11-10 at 15:35 +0100, Eric Dumazet wrote: > > > > So to get to my question: What if we could create netdev queues on the > > > fly? > > > In term of qdisc management I believe its a bit complex if we start to > > dynamically add netdev queues :) > > Yeah I was thinking that would be the problematic part. > > > My first idea would be to extend Qdisc management so that a device can > > callback qdisc when a frame is finaly delivered / consumed / discarded. > > > > We currently only have qdisc->enqueue() and qdisc->dequeue(), we could > > add qdisc->deliver_callback(skb) > > > > You keep devices as they are, with a netdevqueue per hardware queue. > > > > Then, using a Qdisc like existing ones, but with a limit of > > outstanding(given to device but not yet consumed) packets per class. > > > > external tc classifier would deliver a hash/index depending on remote > > station. > > So basically you'd have a class for each station? Or a class for each > station/AC? That's a lot of classes rather than queues, are they > pre-allocated, or does it not matter? Overall that sounds like a good > approach. They could be statically allocated in a stochastic fashion (Sorry, SFQ is one of my favorite qdisc, very light in memory/cpu and yet powerful) net/sched/sch_sfq.c Of course, HFSC / DRR are considered more flexible, but eat _much_ more memory. http://people.netfilter.org/kaber/shaping You declare a hash divisor of 1024 to map one station to one class (Might have hash collision of course...) Each active slot contains a local queue to one remote station. Each queue can be pfifo or whatever smart qdisc (sfq, or a more flexible qdisc) Then you use a filter to select the appropriate class : tc filter add dev ${dev} protocol all pref 1 parent $handle handle 1 \ flow hash keys remote_station divisor 1024 -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <1320933501.3967.68.camel-8upI4CBIZJIJvtFkdXX2HixXY32XiHfO@public.gmane.org>]
* Re: creating netdev queues on the fly? [not found] ` <1320933501.3967.68.camel-8upI4CBIZJIJvtFkdXX2HixXY32XiHfO@public.gmane.org> @ 2011-11-10 14:40 ` Helmut Schaa [not found] ` <CAGXE3d-_RFgW_zwfX2vTBe1psXmgoBFO5pd5cAgtYo=Jwpddhw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2011-11-10 14:47 ` Dave Taht 1 sibling, 1 reply; 12+ messages in thread From: Helmut Schaa @ 2011-11-10 14:40 UTC (permalink / raw) To: Johannes Berg; +Cc: netdev, linux-wireless Hi, On Thu, Nov 10, 2011 at 2:58 PM, Johannes Berg <johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org> wrote: > But on wireless we really transmit to slow stations only with a slow > speed, so our outgoing speed differs. I think the scenario is quite > different, also because the speed can vary obviously. > > So to get to my question: What if we could create netdev queues on the > fly? > > The reason to do that is that we really don't want to reserve some 8000 > queues just because somebody could possibly try to create 2000 > connections (2007 is the theoretical max due to protocol restrictions) > to the AP interface. We also don't really want to create a netdev for > each peer (though you could implement it that way today). > > I looked at this and it doesn't seem terrible. Creating & destroying the > queues might be tricky though. I think ndo_select_queue might return the > queue pointer instead of an index, and then that queue could be used. > The normal queues would still be in an array, with maybe a linked list > of extra queues that were dynamically created. Obviously the driver > would have to be able to manage that. > > Ultimately, all the frames will of course end up on the same four > hardware queues again. But this would some better management, and piled > up traffic to one station that suddenly dies wouldn't impact performance > for all others as badly as it does today since we wouldn't let all those > frames pile up on the hardware queues, they'd only get there with some > mechanism that might take airtime into account. > > I think this might also make implementing reservation (tspec) easier. > Not sure if anyone wants/needs that though. Wouldn't it be possible to implement something like this as a qdisc on top of mq that makes use of the current tx rate per station to distribute the airtime equitably? Of course this would require the qdisc to know the tx rate a priori but for mac80211 drivers we could just use last_tx_rate as an estimate ... Helmut -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <CAGXE3d-_RFgW_zwfX2vTBe1psXmgoBFO5pd5cAgtYo=Jwpddhw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: creating netdev queues on the fly? [not found] ` <CAGXE3d-_RFgW_zwfX2vTBe1psXmgoBFO5pd5cAgtYo=Jwpddhw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-11-10 14:55 ` Denys Fedoryshchenko 2011-11-10 15:25 ` Dave Taht 0 siblings, 1 reply; 12+ messages in thread From: Denys Fedoryshchenko @ 2011-11-10 14:55 UTC (permalink / raw) To: Helmut Schaa; +Cc: Johannes Berg, netdev, linux-wireless On Thu, 10 Nov 2011 15:40:01 +0100, Helmut Schaa wrote: >> >> I think this might also make implementing reservation (tspec) >> easier. >> Not sure if anyone wants/needs that though. > > Wouldn't it be possible to implement something like this as a qdisc > on top of > mq that makes use of the current tx rate per station to distribute > the airtime > equitably? > > Of course this would require the qdisc to know the tx rate a priori > but for > mac80211 drivers we could just use last_tx_rate as an estimate ... > > Helmut > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Maybe someone will make something like "tfifo" in future :) And when clients are connected, each have his own queue. then, for example qdisc add dev wlan0 parent 1:10 handle 10 tfifo limit 100ms If packet are older than 100ms will be dropped, or new packets are not added, if there is packet older than 100ms are not sent yet. I am not sure that bandwidth will be distributed fairly, it is different question, probably each queue should have some "limited chunk of time" to send data. And again, 802.11a/b/g at least are half-duplex and CSMA, and without polling/TDMA or CTS/RTS tricks it will be complicated to give guaranteed chunks of time. P.S. That's just a dream :) --- Network engineer Denys Fedoryshchenko P.O.Box 41553 Jeddah 21531 Tel: 920023422 Fax: +966 26501784 E-Mail: denys-ArQk2d8GGkZT5gTzvV8LJA@public.gmane.org -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: creating netdev queues on the fly? 2011-11-10 14:55 ` Denys Fedoryshchenko @ 2011-11-10 15:25 ` Dave Taht [not found] ` <CAA93jw7ECQegWj6rpd48sbmDQUjorCYzXANXJSj0baHtxzC7EA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 12+ messages in thread From: Dave Taht @ 2011-11-10 15:25 UTC (permalink / raw) To: Denys Fedoryshchenko; +Cc: Helmut Schaa, Johannes Berg, netdev, linux-wireless On Thu, Nov 10, 2011 at 3:55 PM, Denys Fedoryshchenko <denys@wbc.net.sa> wrote: > On Thu, 10 Nov 2011 15:40:01 +0100, Helmut Schaa wrote: >>> >>> I think this might also make implementing reservation (tspec) easier. >>> Not sure if anyone wants/needs that though. >> >> Wouldn't it be possible to implement something like this as a qdisc on top >> of >> mq that makes use of the current tx rate per station to distribute >> the airtime >> equitably? >> >> Of course this would require the qdisc to know the tx rate a priori but >> for >> mac80211 drivers we could just use last_tx_rate as an estimate ... Need last_aggregation_depth bytes and packets and last feasible versions of those for an estimate with n. completion rate, perhaps... >> >> Helmut >> -- >> To unsubscribe from this list: send the line "unsubscribe netdev" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > Maybe someone will make something like "tfifo" in future :) Did some of it last weekend. Code doesn't work yet - I mixed in too many of the ideas mentioned in my earlier, longer, more difficult to read missive. Time based queue limits are an important part of the answer, however. Our current fixed packet queue limits are merely a proxy for time. At 100Mbit ethernet, they were 100ms - which is conus distances... and I'd hope that someone more knowledgable than I at these layers take all this on... Two notes: 1) Getting 'time' from the kernel is expensive. And: System time wanders. mac80211 Wifi devices however do export a get_tsf function which could be used as a relative-to-the-queue clock - and we actually don't need accuracy down to the level of get_tsf (25ns) 2) You need to get a timestamp on entry to the first queue and check against the allowable latency on exit from the last. So to construct a tc chain you'd want a tfifo (timestamp on entry), fifot (check timestamp against limit on dequeue), and for the simplest of applications : tfifot (timestamp on entry, check on exit) > And when clients are connected, each have his own queue. Needed, yes. Particularly with mixed g/n, but also to optimize aggregation AND fairness. > > then, for example qdisc add dev wlan0 parent 1:10 handle 10 tfifo limit > 100ms > If packet are older than 100ms will be dropped, or new packets are not > added, if > there is packet older than 100ms are not sent yet. Aside from me calling it 'tfifot last weekend, yes. Though I prefer using 'timelimit' as the key name, to distinguish between previous overloaded uses of the word limit. > > I am not sure that bandwidth will be distributed fairly, it is different > question, > probably each queue should have some "limited chunk of time" to send data. Round robin with random starting points for each cycle through the stations in what I was calling a 'grouper' in the previous mail. > And again, 802.11a/b/g at least are half-duplex and CSMA, and without > polling/TDMA or CTS/RTS tricks > it will be complicated to give guaranteed chunks of time. I *think* but am not sure that what I have described interacts with the EDCA reasonably well. > > P.S. That's just a dream :) > > --- > Network engineer > Denys Fedoryshchenko > > P.O.Box 41553 Jeddah 21531 > Tel: 920023422 > Fax: +966 26501784 > E-Mail: denys@wbc.net.sa > -- > To unsubscribe from this list: send the line "unsubscribe linux-wireless" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Dave Täht SKYPE: davetaht US Tel: 1-239-829-5608 FR Tel: 0638645374 http://www.bufferbloat.net ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <CAA93jw7ECQegWj6rpd48sbmDQUjorCYzXANXJSj0baHtxzC7EA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: creating netdev queues on the fly? [not found] ` <CAA93jw7ECQegWj6rpd48sbmDQUjorCYzXANXJSj0baHtxzC7EA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-11-11 11:02 ` Eric Dumazet 2011-11-11 11:42 ` Dave Taht 2011-11-12 9:45 ` Eric Dumazet 0 siblings, 2 replies; 12+ messages in thread From: Eric Dumazet @ 2011-11-11 11:02 UTC (permalink / raw) To: Dave Taht Cc: Denys Fedoryshchenko, Helmut Schaa, Johannes Berg, netdev, linux-wireless Le jeudi 10 novembre 2011 à 16:25 +0100, Dave Taht a écrit : > Two notes: > > 1) Getting 'time' from the kernel is expensive. And: System time wanders. > > mac80211 Wifi devices however do export a get_tsf function which could > be used as a relative-to-the-queue clock - and we actually don't need > accuracy down to the level of get_tsf (25ns) > getting 'time' from kernel is not that expensive, depending on resolution you need. We are not going to use timestamps from devices ! If ms resolution is enough (say you want to drop packets if they stay more than 100ms in qdisc), jiffies is a single memory read. If needing us or ns resolution, psched_get_time() uses ktime_get(), and is used in CBQ, HTB, HFSC, TBF, so if it was expensive we would have big problem right now :) 2) You need to get a timestamp on entry to the first queue and check > against the allowable latency on exit from the last. So to construct a > tc chain you'd want a tfifo (timestamp on entry), fifot (check > timestamp against limit on dequeue), and for the simplest of > applications : tfifot (timestamp on entry, check on exit) That would be not very practical. I would see a new Qdisc/Class property, like the rate estimator, that we can attach to any Qdisc/Class with a new tc option. Even without any limit enforcing (might be Random Early Detection by the way), it could be used to get a Queue Delay estimation, using EWMA avqdelay = avqdelay*(1-W) + qdelay*W; W = 2^(-ewma_log); tc [ qdisc | class] add [...] [est 1sec 8sec] [delayest ewma_log ] .. tc -s -d qdisc ... qdisc htb 1: root refcnt 2 r2q 10 default 1 direct_packets_stat 0 ver 3.17 Sent 3596219 bytes 2567 pkt (dropped 238, overlimits 3797 requeues 0) rate 2557Kbit 215pps backlog 0b 0p requeues 0 delay 91ms tc [ qdisc | class] add [...] [est 1sec 8sec] [delaylimit max ] .. tc -s -d qdisc ... qdisc htb 1: root refcnt 2 r2q 10 default 1 direct_packets_stat 0 ver 3.17 Sent 3596219 bytes 2567 pkt (dropped 238, overlimits 3797 requeues 0) rate 2557Kbit 215pps backlog 0b 0p requeues 0 delay 91ms delaylimit 100ms (dropped 12) -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: creating netdev queues on the fly? 2011-11-11 11:02 ` Eric Dumazet @ 2011-11-11 11:42 ` Dave Taht [not found] ` <CAA93jw7n1jYiWrnHOF0Zmzd0cVtadNhPSCpP5YqEdq_Q9opw5A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2011-11-12 9:45 ` Eric Dumazet 1 sibling, 1 reply; 12+ messages in thread From: Dave Taht @ 2011-11-11 11:42 UTC (permalink / raw) To: Eric Dumazet Cc: Denys Fedoryshchenko, Helmut Schaa, Johannes Berg, netdev, linux-wireless, Andrew McGregor On Fri, Nov 11, 2011 at 12:02 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > Le jeudi 10 novembre 2011 à 16:25 +0100, Dave Taht a écrit : > > > >> Two notes: >> >> 1) Getting 'time' from the kernel is expensive. And: System time wanders. >> >> mac80211 Wifi devices however do export a get_tsf function which could >> be used as a relative-to-the-queue clock - and we actually don't need >> accuracy down to the level of get_tsf (25ns) >> > > > getting 'time' from kernel is not that expensive, depending on > resolution you need. We are not going to use timestamps from devices ! Wireless keeps track of time via a TSF function. http://en.wikipedia.org/wiki/Timing_Synchronization_Function_%28TSF%29 TSF wanders forward based on an election process from broadcast beacons, the presentation below contains animations and the like that describe how it works http://www.cse.ohio-state.edu/~lai/788-Au05/2-scalibility.ppt Further it copes with life in terms of TU. One tu is 1024us. Now, whether any of that is truly relevant to how to do packet scheduling well on wireless, is a good question. TU is fairly fundamental to the minstrel algorithm. It bothers me to have anything resembling a beat frequency. > > If ms resolution is enough (say you want to drop packets if they stay > more than 100ms in qdisc), jiffies is a single memory read. The intended properties of the hardware queues are 10 ms VO, 100 ms VI, and some_sane_number_relative_to_real_bandwidth for BE and BK. The software queues should make similar assumptions, although for voice traffic, being able to drop and 'pull forward' the next packet in a stream should it exceed time in queue would be good, and you could go with 30 ms in queue with that level of jitter. > > If needing us or ns resolution, psched_get_time() uses ktime_get(), and > is used in CBQ, HTB, HFSC, TBF, so if it was expensive we would have big > problem right now :) > > 2) You need to get a timestamp on entry to the first queue and check >> against the allowable latency on exit from the last. So to construct a >> tc chain you'd want a tfifo (timestamp on entry), fifot (check >> timestamp against limit on dequeue), and for the simplest of >> applications : tfifot (timestamp on entry, check on exit) > > That would be not very practical. > > I would see a new Qdisc/Class property, like the rate estimator, that we > can attach to any Qdisc/Class with a new tc option. > > Even without any limit enforcing (might be Random Early Detection by the > way), it could be used to get a Queue Delay estimation, using EWMA > > avqdelay = avqdelay*(1-W) + qdelay*W; > W = 2^(-ewma_log); > > tc [ qdisc | class] add [...] [est 1sec 8sec] [delayest ewma_log ] .. More elegant to be sure. But for N you kind of need to do ewma between aggregated transmission groups. > > tc -s -d qdisc ... > qdisc htb 1: root refcnt 2 r2q 10 default 1 direct_packets_stat 0 ver 3.17 > Sent 3596219 bytes 2567 pkt (dropped 238, overlimits 3797 requeues 0) > rate 2557Kbit 215pps backlog 0b 0p requeues 0 > delay 91ms I still don't 'get' how we can split out a stream based on a stationid, toss it in a queue to be further scheduled (my choice would be QFQ, btw), and then sanely de-schedule a burst of packets for that destination appropriate for that station's aggregation level and transmit rate using existing tc methods. I liked the callback idea discussed earlier for implementing a 'grouper' of this sort. That said I'm strongly encouraged by the dialog thus far on this thread. > > > > tc [ qdisc | class] add [...] [est 1sec 8sec] [delaylimit max ] .. > > tc -s -d qdisc ... > qdisc htb 1: root refcnt 2 r2q 10 default 1 direct_packets_stat 0 ver 3.17 > Sent 3596219 bytes 2567 pkt (dropped 238, overlimits 3797 requeues 0) > rate 2557Kbit 215pps backlog 0b 0p requeues 0 > delay 91ms delaylimit 100ms (dropped 12) and I assume that you are either making this syntax up or coding faster than emailing in some tree somewhere... > > > > -- Dave Täht SKYPE: davetaht US Tel: 1-239-829-5608 FR Tel: 0638645374 http://www.bufferbloat.net -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <CAA93jw7n1jYiWrnHOF0Zmzd0cVtadNhPSCpP5YqEdq_Q9opw5A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: creating netdev queues on the fly? [not found] ` <CAA93jw7n1jYiWrnHOF0Zmzd0cVtadNhPSCpP5YqEdq_Q9opw5A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-11-11 11:54 ` Eric Dumazet 0 siblings, 0 replies; 12+ messages in thread From: Eric Dumazet @ 2011-11-11 11:54 UTC (permalink / raw) To: Dave Taht Cc: Denys Fedoryshchenko, Helmut Schaa, Johannes Berg, netdev, linux-wireless, Andrew McGregor Le vendredi 11 novembre 2011 à 12:42 +0100, Dave Taht a écrit : > More elegant to be sure. But for N you kind of need to do ewma between > aggregated transmission groups. > I speak at Class/Qdisc level. If each transmission group has its own class [ I believe it should ], it should fit. > I still don't 'get' how we can split out a stream based on a > stationid, toss it in a queue to be further scheduled (my choice would > be QFQ, btw), and then sanely de-schedule a burst of packets for that > destination appropriate for that station's aggregation level and > transmit rate using existing tc methods. > Right now its not possible since we dont have a feedback once a packet is dequeued from qdisc. But it should be doable. > I liked the callback idea discussed earlier for implementing a > 'grouper' of this sort. > > That said I'm strongly encouraged by the dialog thus far on this thread. > ... > and I assume that you are either making this syntax up or coding > faster than emailing in some tree somewhere... > That because I prefer discussing on this before starting coding once general idea is accepted. Note this delay idea is not new, it already was mentioned on netdev some months ago. -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: creating netdev queues on the fly? 2011-11-11 11:02 ` Eric Dumazet 2011-11-11 11:42 ` Dave Taht @ 2011-11-12 9:45 ` Eric Dumazet 1 sibling, 0 replies; 12+ messages in thread From: Eric Dumazet @ 2011-11-12 9:45 UTC (permalink / raw) To: Dave Taht Cc: Denys Fedoryshchenko, Helmut Schaa, Johannes Berg, netdev, linux-wireless Le vendredi 11 novembre 2011 à 12:02 +0100, Eric Dumazet a écrit : > I would see a new Qdisc/Class property, like the rate estimator, that we > can attach to any Qdisc/Class with a new tc option. > > Even without any limit enforcing (might be Random Early Detection by the > way), it could be used to get a Queue Delay estimation, using EWMA > > avqdelay = avqdelay*(1-W) + qdelay*W; > W = 2^(-ewma_log); > > tc [ qdisc | class] add [...] [est 1sec 8sec] [delayest ewma_log ] .. > > tc -s -d qdisc ... > qdisc htb 1: root refcnt 2 r2q 10 default 1 direct_packets_stat 0 ver 3.17 > Sent 3596219 bytes 2567 pkt (dropped 238, overlimits 3797 requeues 0) > rate 2557Kbit 215pps backlog 0b 0p requeues 0 > delay 91ms > > I coded the thing (delayest at qdisc level) and got interesting values, for example with following HTB setup : DEV=eth3 MTU=1500 rate=10mbit EST="est 1sec 8sec delayest 6" tc qdisc del dev $DEV root tc qdisc add dev $DEV root handle 1: ${EST} \ htb default 1 tc class add dev $DEV parent 1: classid 1:1 htb \ rate ${rate} mtu 40000 quantum 80000 tc qdisc add dev $DEV parent 1:1 handle 10: ${EST} pfifo limit 3 With light trafic on my x86_64 machine I have : # tcnew -s -d qdisc show dev eth3 qdisc htb 1: root refcnt 17 r2q 10 default 1 direct_packets_stat 0 ver 3.17 Sent 78216 bytes 569 pkt (dropped 0, overlimits 0 requeues 0) rate 9040bit 13pps backlog 0b 0p requeues 0 delay 1126 ns log 6 qdisc pfifo 10: parent 1:1 limit 3p Sent 78216 bytes 569 pkt (dropped 0, overlimits 0 requeues 0) rate 9040bit 13pps backlog 0b 0p requeues 0 delay 731 ns log 6 Wow, 1126ns of overhead per packet... This is the prototype kernel patch I used on top of net-next : diff --git a/include/linux/gen_stats.h b/include/linux/gen_stats.h index 552c8a0..5ad57a6 100644 --- a/include/linux/gen_stats.h +++ b/include/linux/gen_stats.h @@ -63,5 +63,16 @@ struct gnet_estimator { unsigned char ewma_log; }; +/** + * struct gnet_qdelay - queue delay configuration / reports + * @avdelay: average queue delay in ns + * @limit: packets delayed more than this value are dropped + * @avdelaylog: the log of measurement window weight + */ +struct gnet_qdelay { + __u64 avdelay; + __u64 limit; + __u32 avdelaylog; +}; #endif /* __LINUX_GEN_STATS_H */ diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index 8e872ea..61c66ec 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -484,6 +484,7 @@ enum { TCA_FCNT, TCA_STATS2, TCA_STAB, + TCA_QDELAY, __TCA_MAX }; diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index f6bb08b..e293228 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -42,6 +42,13 @@ struct qdisc_size_table { u16 data[]; }; +/* + * qdisc/class avdelay is computed using EWMA, with a fixed factor of 16 + * Only the weight is a parameter (avdelaylog) + * With u64 values, this leaves 48 bits, a max of 281474 seconds. + */ +#define TCQ_AVDELAY_FACTOR 16 + struct Qdisc { int (*enqueue)(struct sk_buff *skb, struct Qdisc *dev); struct sk_buff * (*dequeue)(struct Qdisc *dev); @@ -50,8 +57,11 @@ struct Qdisc { #define TCQ_F_INGRESS 2 #define TCQ_F_CAN_BYPASS 4 #define TCQ_F_MQROOT 8 +#define TCQ_F_QDELAY 0x10 #define TCQ_F_WARN_NONWC (1 << 16) - int padded; + u8 padded; + u8 avdelaylog; /* avdelay EWMA weight */ + u8 _pad[2]; const struct Qdisc_ops *ops; struct qdisc_size_table __rcu *stab; struct list_head list; @@ -80,6 +90,10 @@ struct Qdisc { struct gnet_stats_basic_packed bstats; unsigned int __state; struct gnet_stats_queue qstats; + + /* average queue delay in ns << TCQ_AVDELAY_FACTOR */ + u64 avdelay; + struct rcu_head rcu_head; spinlock_t busylock; u32 limit; @@ -219,6 +233,9 @@ struct tcf_proto { }; struct qdisc_skb_cb { +#ifdef CONFIG_NET_SCHED_QDELAY + ktime_t enqueue_time; +#endif unsigned int pkt_len; long data[]; }; @@ -467,6 +484,14 @@ static inline void qdisc_bstats_update(struct Qdisc *sch, const struct sk_buff *skb) { bstats_update(&sch->bstats, skb); +#ifdef CONFIG_NET_SCHED_QDELAY + if (sch->flags & TCQ_F_QDELAY) { + u64 delay = ktime_to_ns(ktime_sub(ktime_get(), + qdisc_skb_cb(skb)->enqueue_time)); + delay <<= TCQ_AVDELAY_FACTOR; + sch->avdelay += (delay >> sch->avdelaylog) - (sch->avdelay >> sch->avdelaylog); + } +#endif } static inline int __qdisc_enqueue_tail(struct sk_buff *skb, struct Qdisc *sch, diff --git a/net/core/dev.c b/net/core/dev.c index 6ba50a1..587534d 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2402,6 +2402,13 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, int rc; qdisc_skb_cb(skb)->pkt_len = skb->len; + +#ifdef CONFIG_NET_SCHED_QDELAY + qdisc_skb_cb(skb)->enqueue_time.tv64 = 0; + if (q->flags & TCQ_F_QDELAY) + qdisc_skb_cb(skb)->enqueue_time = ktime_get(); +#endif + qdisc_calculate_pkt_len(skb, q); /* * Heuristic to force contended enqueues to serialize on a diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 2590e91..028f882 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -470,6 +470,17 @@ config NET_CLS_ACT A recent version of the iproute2 package is required to use extended matches. +config NET_SCHED_QDELAY + bool "QDISC/CLASS queue delay Estimators and Limits" + ---help--- + Say Y here if you want to be able to track queue delays, and + be able to drop packets if they stay in a queue a too long time. + It adds some overhead per packet, since it needs to get precise + time at enqueue and dequeue time. + + A recent version of the iproute2 package is required to use + extended matches. + config NET_ACT_POLICE tristate "Traffic Policing" depends on NET_CLS_ACT diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index dca6c1a..212fba9 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -842,6 +842,17 @@ qdisc_create(struct net_device *dev, struct netdev_queue *dev_queue, } rcu_assign_pointer(sch->stab, stab); } + if (tca[TCA_QDELAY]) { + struct gnet_qdelay *parm = nla_data(tca[TCA_QDELAY]); + err = -EINVAL; + if (nla_len(tca[TCA_QDELAY]) < sizeof(*parm)) + goto err_out4; + sch->avdelaylog = parm->avdelaylog; + sch->flags |= TCQ_F_QDELAY; + } else { /* temporary testing */ + sch->avdelaylog = 6; + sch->flags |= TCQ_F_QDELAY; + } if (tca[TCA_RATE]) { spinlock_t *root_lock; @@ -1206,6 +1217,16 @@ static int tc_fill_qdisc(struct sk_buff *skb, struct Qdisc *q, u32 clid, if (stab && qdisc_dump_stab(skb, stab) < 0) goto nla_put_failure; +#ifdef CONFIG_NET_SCHED_QDELAY + if (q->flags & TCQ_F_QDELAY) { + struct gnet_qdelay qdelay; + + memset(&qdelay, 0, sizeof(qdelay)); + qdelay.avdelay = q->avdelay >> TCQ_AVDELAY_FACTOR; + qdelay.avdelaylog = q->avdelaylog; + NLA_PUT(skb, TCA_QDELAY, sizeof(qdelay), &qdelay); + } +#endif if (gnet_stats_start_copy_compat(skb, TCA_STATS2, TCA_STATS, TCA_XSTATS, qdisc_root_sleeping_lock(q), &d) < 0) goto nla_put_failure; ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: creating netdev queues on the fly? [not found] ` <1320933501.3967.68.camel-8upI4CBIZJIJvtFkdXX2HixXY32XiHfO@public.gmane.org> 2011-11-10 14:40 ` Helmut Schaa @ 2011-11-10 14:47 ` Dave Taht 1 sibling, 0 replies; 12+ messages in thread From: Dave Taht @ 2011-11-10 14:47 UTC (permalink / raw) To: Johannes Berg; +Cc: netdev, linux-wireless On Thu, Nov 10, 2011 at 2:58 PM, Johannes Berg <johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org> wrote: > Hi, > > Am I completely crazy? Somewhat. :) Much of your thinking aligns with mine, however my goal was to try and reduce latencies on wireless-n, where we send variable size truckloads of packets to each destination. Solving that one is hard, and requires two levels of active queue management in the packet scheduler layer, and a bit more communication up from the driver itself. We could have a unique 'station identifier' which fits handily into 32 bits as the max allowed range is 0-2008 and map from MAC to that on tx entry. Having that as a flow classifier lets us have per station destination queues easily split up via a std tc filter... and then we have the ability, finally, to manage queue depth on a per station basis. So you end up with four queues, each tied to a hardware queue that then splits things up on a per station basis, fair queues within the queues to each station, and recombines them at the end on a basis bursty enough to aggregate as they exit the radio. I don't mind at all up to 8000 queues, honestly, wasting 99% on mostly unused queue structures via pouring megabytes into useless bufferbloated FIFO only packet buffers seems an acceptible compromise, but I'm easy... As for managing queue depth on a per station basis, some of what has been discussed on "byte queue limits" applies, but given wireless's peculiarites, tsf timestamping on entry to the first qdisc, doing fair queuing inside the per-sta queue (QFQ?), and checking the timestamp on exit from the queue against a sane limit for the queue type would do wonders for overall latencies and network responsiveness. Done right, instead of seeing a single tcp stream capable of inducing multi-second latencies for the next stream, latencies would stay flat up unto the max aggregation depth of different streams on a given sta, subject only the how many other competing stations there are, the net effect of packet loss would be vastly lessened, and world peace, achieved. I dream of 2ms pings and dns lookups, even gaming, under load, on wireless. I do. First steps are getting a station identifier and some useful statistics regarding that stations max (that quantum) packet bundle size, and completion rate, mostly from minstrel... on each packet... a tc classifier that can use it to toss into the tcindex mechanism (if that is what is used), another sane classifier for something like QFQ per station, and a packet 'grouper' that can output correctly sized bursts of packets on a sane basis from the queues in a randomly sane order (not round robin per se', to even out the load it has to start dequeuing groups) How to do all that within tc? Well... I like the idea of throwing out the 32 bitness of tc's calssifiers (mac hashing and ipv6 hashing is not very effective), but I doubt that will fly.. So to fit into the the existing structures the idea of adding the concept of a tc qdisc 'grouper' along with all the other tc filter 'splitters' - that could be multiqueue and multiple hardware queue aware - seems like an answer. Another crazy piece of the idea (courtesy nbd - I'd rather go crazy adding fields to the skb) is to wedge that id and some minstrel statistics and completion rates and the timestamp into each skb's mostly unused 48 byte 'reserved for special uses field... which it has to do under rcu lock anyway. I started coding up time based queue limits the other weekend, actually... some of this has been discussed on the bloat list. > > > johannes > > -- > To unsubscribe from this list: send the line "unsubscribe linux-wireless" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Dave Täht SKYPE: davetaht US Tel: 1-239-829-5608 FR Tel: 0638645374 http://www.bufferbloat.net -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2011-11-12 9:45 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-10 13:58 creating netdev queues on the fly? Johannes Berg
2011-11-10 14:35 ` Eric Dumazet
2011-11-10 16:26 ` Johannes Berg
[not found] ` <1320942369.3967.127.camel-8upI4CBIZJIJvtFkdXX2HixXY32XiHfO@public.gmane.org>
2011-11-10 17:00 ` Eric Dumazet
[not found] ` <1320933501.3967.68.camel-8upI4CBIZJIJvtFkdXX2HixXY32XiHfO@public.gmane.org>
2011-11-10 14:40 ` Helmut Schaa
[not found] ` <CAGXE3d-_RFgW_zwfX2vTBe1psXmgoBFO5pd5cAgtYo=Jwpddhw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-11-10 14:55 ` Denys Fedoryshchenko
2011-11-10 15:25 ` Dave Taht
[not found] ` <CAA93jw7ECQegWj6rpd48sbmDQUjorCYzXANXJSj0baHtxzC7EA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-11-11 11:02 ` Eric Dumazet
2011-11-11 11:42 ` Dave Taht
[not found] ` <CAA93jw7n1jYiWrnHOF0Zmzd0cVtadNhPSCpP5YqEdq_Q9opw5A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-11-11 11:54 ` Eric Dumazet
2011-11-12 9:45 ` Eric Dumazet
2011-11-10 14:47 ` Dave Taht
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox