creating netdev queues on the fly?

Netdev List
 help / color / mirror / Atom feed

* creating netdev queues on the fly?
@ 2011-11-10 13:58 Johannes Berg
  2011-11-10 14:35 ` Eric Dumazet
       [not found] ` <1320933501.3967.68.camel-8upI4CBIZJIJvtFkdXX2HixXY32XiHfO@public.gmane.org>
  0 siblings, 2 replies; 12+ messages in thread
From: Johannes Berg @ 2011-11-10 13:58 UTC (permalink / raw)
  To: netdev; +Cc: linux-wireless

Hi,

I've been thinking about how we manage TX queues in wifi and right now
we just split things up by access category for QoS purposes.

However we have the issue that we might be pushing data to stations with
completely different speeds. Onn wired, where our outgoing speed is
essentially constant and some router/switch has to drop packets for the
slow link:

machine A === 1000mbps link ==== [switch] === 1000mbps === machine B
                                     |
                                     +--- 100mbps link --- machine C

But on wireless we really transmit to slow stations only with a slow
speed, so our outgoing speed differs. I think the scenario is quite
different, also because the speed can vary obviously.

So to get to my question: What if we could create netdev queues on the
fly?

The reason to do that is that we really don't want to reserve some 8000
queues just because somebody could possibly try to create 2000
connections (2007 is the theoretical max due to protocol restrictions)
to the AP interface. We also don't really want to create a netdev for
each peer (though you could implement it that way today).

I looked at this and it doesn't seem terrible. Creating & destroying the
queues might be tricky though. I think ndo_select_queue might return the
queue pointer instead of an index, and then that queue could be used.
The normal queues would still be in an array, with maybe a linked list
of extra queues that were dynamically created. Obviously the driver
would have to be able to manage that.

Ultimately, all the frames will of course end up on the same four
hardware queues again. But this would some better management, and piled
up traffic to one station that suddenly dies wouldn't impact performance
for all others as badly as it does today since we wouldn't let all those
frames pile up on the hardware queues, they'd only get there with some
mechanism that might take airtime into account.

I think this might also make implementing reservation (tspec) easier.
Not sure if anyone wants/needs that though.

Am I completely crazy?

johannes

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: creating netdev queues on the fly?
  2011-11-10 13:58 creating netdev queues on the fly? Johannes Berg
@ 2011-11-10 14:35 ` Eric Dumazet
  2011-11-10 16:26   ` Johannes Berg
       [not found] ` <1320933501.3967.68.camel-8upI4CBIZJIJvtFkdXX2HixXY32XiHfO@public.gmane.org>
  1 sibling, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2011-11-10 14:35 UTC (permalink / raw)
  To: Johannes Berg; +Cc: netdev, linux-wireless

Le jeudi 10 novembre 2011 à 14:58 +0100, Johannes Berg a écrit :
> Hi,
> 
> I've been thinking about how we manage TX queues in wifi and right now
> we just split things up by access category for QoS purposes.
> 
> However we have the issue that we might be pushing data to stations with
> completely different speeds. Onn wired, where our outgoing speed is
> essentially constant and some router/switch has to drop packets for the
> slow link:
> 
> machine A === 1000mbps link ==== [switch] === 1000mbps === machine B
>                                      |
>                                      +--- 100mbps link --- machine C
> 
> But on wireless we really transmit to slow stations only with a slow
> speed, so our outgoing speed differs. I think the scenario is quite
> different, also because the speed can vary obviously.
> 
> So to get to my question: What if we could create netdev queues on the
> fly?
> 
> The reason to do that is that we really don't want to reserve some 8000
> queues just because somebody could possibly try to create 2000
> connections (2007 is the theoretical max due to protocol restrictions)
> to the AP interface. We also don't really want to create a netdev for
> each peer (though you could implement it that way today).
> 
> I looked at this and it doesn't seem terrible. Creating & destroying the
> queues might be tricky though. I think ndo_select_queue might return the
> queue pointer instead of an index, and then that queue could be used.
> The normal queues would still be in an array, with maybe a linked list
> of extra queues that were dynamically created. Obviously the driver
> would have to be able to manage that.
> 
> Ultimately, all the frames will of course end up on the same four
> hardware queues again. But this would some better management, and piled
> up traffic to one station that suddenly dies wouldn't impact performance
> for all others as badly as it does today since we wouldn't let all those
> frames pile up on the hardware queues, they'd only get there with some
> mechanism that might take airtime into account.
> 
> I think this might also make implementing reservation (tspec) easier.
> Not sure if anyone wants/needs that though.
> 
> 
> Am I completely crazy?
> 

In term of qdisc management I believe its a bit complex if we start to
dynamically add netdev queues :)

My first idea would be to extend Qdisc management so that a device can
callback qdisc when a frame is finaly delivered / consumed / discarded.

We currently only have qdisc->enqueue() and qdisc->dequeue(), we could
add qdisc->deliver_callback(skb)

You keep devices as they are, with a netdevqueue per hardware queue.

Then, using a Qdisc like existing ones, but with a limit of
outstanding(given to device but not yet consumed) packets per class.

external tc classifier would deliver a hash/index depending on remote
station.

As a bonus you can get all the existing rate estimators / QOS /
shapers ...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: creating netdev queues on the fly?
       [not found] ` <1320933501.3967.68.camel-8upI4CBIZJIJvtFkdXX2HixXY32XiHfO@public.gmane.org>
@ 2011-11-10 14:40   ` Helmut Schaa
       [not found]     ` <CAGXE3d-_RFgW_zwfX2vTBe1psXmgoBFO5pd5cAgtYo=Jwpddhw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2011-11-10 14:47   ` Dave Taht
  1 sibling, 1 reply; 12+ messages in thread
From: Helmut Schaa @ 2011-11-10 14:40 UTC (permalink / raw)
  To: Johannes Berg; +Cc: netdev, linux-wireless

Hi,

On Thu, Nov 10, 2011 at 2:58 PM, Johannes Berg
<johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org> wrote:
> But on wireless we really transmit to slow stations only with a slow
> speed, so our outgoing speed differs. I think the scenario is quite
> different, also because the speed can vary obviously.
>
> So to get to my question: What if we could create netdev queues on the
> fly?
>
> The reason to do that is that we really don't want to reserve some 8000
> queues just because somebody could possibly try to create 2000
> connections (2007 is the theoretical max due to protocol restrictions)
> to the AP interface. We also don't really want to create a netdev for
> each peer (though you could implement it that way today).
>
> I looked at this and it doesn't seem terrible. Creating & destroying the
> queues might be tricky though. I think ndo_select_queue might return the
> queue pointer instead of an index, and then that queue could be used.
> The normal queues would still be in an array, with maybe a linked list
> of extra queues that were dynamically created. Obviously the driver
> would have to be able to manage that.
>
> Ultimately, all the frames will of course end up on the same four
> hardware queues again. But this would some better management, and piled
> up traffic to one station that suddenly dies wouldn't impact performance
> for all others as badly as it does today since we wouldn't let all those
> frames pile up on the hardware queues, they'd only get there with some
> mechanism that might take airtime into account.
>
> I think this might also make implementing reservation (tspec) easier.
> Not sure if anyone wants/needs that though.

Wouldn't it be possible to implement something like this as a qdisc on top of
mq that makes use of the current tx rate per station to distribute the airtime
equitably?

Of course this would require the qdisc to know the tx rate a priori but for
mac80211 drivers we could just use last_tx_rate as an estimate ...

Helmut
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: creating netdev queues on the fly?
       [not found] ` <1320933501.3967.68.camel-8upI4CBIZJIJvtFkdXX2HixXY32XiHfO@public.gmane.org>
  2011-11-10 14:40   ` Helmut Schaa
@ 2011-11-10 14:47   ` Dave Taht
  1 sibling, 0 replies; 12+ messages in thread
From: Dave Taht @ 2011-11-10 14:47 UTC (permalink / raw)
  To: Johannes Berg; +Cc: netdev, linux-wireless

On Thu, Nov 10, 2011 at 2:58 PM, Johannes Berg
<johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org> wrote:
> Hi,

>
> Am I completely crazy?

Somewhat. :)

Much of your thinking aligns with mine, however my goal was to try and
reduce latencies on wireless-n, where we send variable size truckloads
of packets to each destination.

Solving that one is hard, and requires two levels of active queue
management in the packet scheduler layer, and a bit more communication
up from the driver itself.

We could have a unique 'station identifier' which fits handily into 32
bits as the max allowed range is 0-2008 and map from MAC to that on tx
entry. Having that as a flow classifier lets us have per station
destination queues easily split up via a std tc filter... and then  we
have the ability, finally, to manage queue depth on a per station
basis.

So you end up with four queues, each tied to a hardware queue that
then splits things up on a per station basis, fair queues within the
queues to each station, and recombines them at the end on a basis
bursty enough to aggregate as they exit the radio.

I don't mind at all up to 8000 queues, honestly, wasting 99% on mostly
unused queue structures via pouring megabytes into useless
bufferbloated FIFO only packet buffers seems an acceptible compromise,
but I'm easy...

As for managing queue depth on a per station basis, some of what has
been discussed on "byte queue limits" applies, but given wireless's
peculiarites, tsf timestamping on entry to the first qdisc, doing fair
queuing inside the per-sta queue (QFQ?), and checking the timestamp on
exit from the queue against a sane limit for the queue type would do
wonders for overall latencies and network responsiveness.

Done right, instead of seeing a single tcp stream capable of inducing
multi-second latencies for the next stream, latencies would stay flat
up unto the max aggregation depth of different streams on a given sta,
subject only the how many other competing stations there are, the net
effect of packet loss would be vastly lessened, and world peace,
achieved. I dream of 2ms pings and dns lookups, even gaming, under
load, on wireless. I do.

First steps are getting a station identifier and some useful
statistics regarding that stations max (that quantum) packet bundle
size, and completion rate, mostly from minstrel... on each packet...

a tc classifier that can use it to toss into the tcindex mechanism (if
that is what is used), another sane classifier for something like QFQ
per station, and a packet 'grouper' that can output correctly sized
bursts of packets on a sane basis from the queues in a randomly sane
order (not round robin per se', to even out the load it has to start
dequeuing groups)

How to do all that within tc? Well... I like the idea of throwing out
the 32 bitness of tc's calssifiers (mac hashing and ipv6 hashing is
not very effective), but I doubt that will fly..

So to fit into the the existing structures the idea of adding the
concept of a  tc qdisc 'grouper'  along with all the other tc filter
'splitters' - that could be multiqueue and multiple hardware queue
aware - seems like an answer.

Another crazy piece of the idea (courtesy nbd - I'd rather go crazy
adding fields to the skb) is to wedge that id and some minstrel
statistics and completion rates and the timestamp into each skb's
mostly unused 48 byte 'reserved for special uses field... which it has
to do under rcu lock anyway.

I started coding up time based queue limits the other weekend,
actually... some of this has been discussed on the bloat list.

>
>
> johannes
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: creating netdev queues on the fly?
       [not found]     ` <CAGXE3d-_RFgW_zwfX2vTBe1psXmgoBFO5pd5cAgtYo=Jwpddhw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-11-10 14:55       ` Denys Fedoryshchenko
  2011-11-10 15:25         ` Dave Taht
  0 siblings, 1 reply; 12+ messages in thread
From: Denys Fedoryshchenko @ 2011-11-10 14:55 UTC (permalink / raw)
  To: Helmut Schaa; +Cc: Johannes Berg, netdev, linux-wireless

 On Thu, 10 Nov 2011 15:40:01 +0100, Helmut Schaa wrote:
>>
>> I think this might also make implementing reservation (tspec) 
>> easier.
>> Not sure if anyone wants/needs that though.
>
> Wouldn't it be possible to implement something like this as a qdisc 
> on top of
> mq that makes use of the current tx rate per station to distribute
> the airtime
> equitably?
>
> Of course this would require the qdisc to know the tx rate a priori 
> but for
> mac80211 drivers we could just use last_tx_rate as an estimate ...
>
> Helmut
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

 Maybe someone will make something like "tfifo" in future :)
 And when clients are connected, each have his own queue.

 then, for example qdisc add dev wlan0 parent 1:10 handle 10 tfifo limit 
 100ms
 If packet are older than 100ms will be dropped, or new packets are not 
 added, if
 there is packet older than 100ms are not sent yet.

 I am not sure that bandwidth will be distributed fairly, it is 
 different question,
 probably each queue should have some "limited chunk of time" to send 
 data.
 And again, 802.11a/b/g at least are half-duplex and CSMA, and without 
 polling/TDMA or CTS/RTS tricks
 it will be complicated to give guaranteed chunks of time.

 P.S. That's just a dream :)

 ---
 Network engineer
 Denys Fedoryshchenko

 P.O.Box 41553 Jeddah 21531
 Tel:   920023422
 Fax:  +966 26501784
 E-Mail: denys-ArQk2d8GGkZT5gTzvV8LJA@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: creating netdev queues on the fly?
  2011-11-10 14:55       ` Denys Fedoryshchenko
@ 2011-11-10 15:25         ` Dave Taht
       [not found]           ` <CAA93jw7ECQegWj6rpd48sbmDQUjorCYzXANXJSj0baHtxzC7EA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Taht @ 2011-11-10 15:25 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: Helmut Schaa, Johannes Berg, netdev, linux-wireless

On Thu, Nov 10, 2011 at 3:55 PM, Denys Fedoryshchenko <denys@wbc.net.sa> wrote:
> On Thu, 10 Nov 2011 15:40:01 +0100, Helmut Schaa wrote:
>>>
>>> I think this might also make implementing reservation (tspec) easier.
>>> Not sure if anyone wants/needs that though.
>>
>> Wouldn't it be possible to implement something like this as a qdisc on top
>> of
>> mq that makes use of the current tx rate per station to distribute
>> the airtime
>> equitably?
>>
>> Of course this would require the qdisc to know the tx rate a priori but
>> for
>> mac80211 drivers we could just use last_tx_rate as an estimate ...

Need last_aggregation_depth bytes and packets and last feasible
versions of those for an estimate with n. completion rate, perhaps...

>>
>> Helmut
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> Maybe someone will make something like "tfifo" in future :)

Did some of it last weekend. Code doesn't work yet - I mixed in too
many of the ideas mentioned in
my earlier, longer, more difficult to read missive. Time based queue
limits are an important part of the answer, however.

Our current fixed packet queue limits are merely a proxy for time. At
100Mbit ethernet, they were 100ms - which is conus distances...

 and I'd hope that someone more knowledgable than I at these layers
take all this on...

Two notes:

1) Getting 'time' from the kernel is expensive. And: System time wanders.

mac80211 Wifi devices however do export a get_tsf function which could
be used as a relative-to-the-queue clock - and we actually don't need
accuracy down to the level of get_tsf (25ns)

2) You need to get a timestamp on entry to the first queue and check
against the allowable latency on exit from the last. So to construct a
tc chain you'd want a tfifo (timestamp on entry), fifot (check
timestamp against limit on dequeue), and for the simplest of
applications : tfifot (timestamp on entry, check on exit)

> And when clients are connected, each have his own queue.

Needed, yes. Particularly with mixed g/n, but also to optimize
aggregation AND fairness.

>
> then, for example qdisc add dev wlan0 parent 1:10 handle 10 tfifo limit
> 100ms
> If packet are older than 100ms will be dropped, or new packets are not
> added, if
> there is packet older than 100ms are not sent yet.

Aside from me calling it 'tfifot last weekend, yes. Though I prefer
using 'timelimit' as the key name, to distinguish between previous
overloaded uses of the word limit.

>
> I am not sure that bandwidth will be distributed fairly, it is different
> question,
> probably each queue should have some "limited chunk of time" to send data.

Round robin with random starting points for each cycle through the
stations in what I was calling a 'grouper' in the previous mail.

> And again, 802.11a/b/g at least are half-duplex and CSMA, and without
> polling/TDMA or CTS/RTS tricks
> it will be complicated to give guaranteed chunks of time.

I *think* but am not sure that what I have described interacts with
the EDCA reasonably well.

>
> P.S. That's just a dream :)
>
> ---
> Network engineer
> Denys Fedoryshchenko
>
> P.O.Box 41553 Jeddah 21531
> Tel:   920023422
> Fax:  +966 26501784
> E-Mail: denys@wbc.net.sa
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: creating netdev queues on the fly?
  2011-11-10 14:35 ` Eric Dumazet
@ 2011-11-10 16:26   ` Johannes Berg
       [not found]     ` <1320942369.3967.127.camel-8upI4CBIZJIJvtFkdXX2HixXY32XiHfO@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Johannes Berg @ 2011-11-10 16:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-wireless

On Thu, 2011-11-10 at 15:35 +0100, Eric Dumazet wrote:

> > So to get to my question: What if we could create netdev queues on the
> > fly?

> In term of qdisc management I believe its a bit complex if we start to
> dynamically add netdev queues :)

Yeah I was thinking that would be the problematic part.

> My first idea would be to extend Qdisc management so that a device can
> callback qdisc when a frame is finaly delivered / consumed / discarded.
> 
> We currently only have qdisc->enqueue() and qdisc->dequeue(), we could
> add qdisc->deliver_callback(skb)
> 
> You keep devices as they are, with a netdevqueue per hardware queue.
> 
> Then, using a Qdisc like existing ones, but with a limit of
> outstanding(given to device but not yet consumed) packets per class.
> 
> external tc classifier would deliver a hash/index depending on remote
> station.

So basically you'd have a class for each station? Or a class for each
station/AC? That's a lot of classes rather than queues, are they
pre-allocated, or does it not matter? Overall that sounds like a good
approach.

> As a bonus you can get all the existing rate estimators / QOS /
> shapers ...

Yeah ... I guess I need to start learning tc :)

johannes

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: creating netdev queues on the fly?
       [not found]     ` <1320942369.3967.127.camel-8upI4CBIZJIJvtFkdXX2HixXY32XiHfO@public.gmane.org>
@ 2011-11-10 17:00       ` Eric Dumazet
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2011-11-10 17:00 UTC (permalink / raw)
  To: Johannes Berg; +Cc: netdev, linux-wireless

Le jeudi 10 novembre 2011 à 17:26 +0100, Johannes Berg a écrit :
> On Thu, 2011-11-10 at 15:35 +0100, Eric Dumazet wrote:
> 
> > > So to get to my question: What if we could create netdev queues on the
> > > fly?
> 
> > In term of qdisc management I believe its a bit complex if we start to
> > dynamically add netdev queues :)
> 
> Yeah I was thinking that would be the problematic part.
> 
> > My first idea would be to extend Qdisc management so that a device can
> > callback qdisc when a frame is finaly delivered / consumed / discarded.
> > 
> > We currently only have qdisc->enqueue() and qdisc->dequeue(), we could
> > add qdisc->deliver_callback(skb)
> > 
> > You keep devices as they are, with a netdevqueue per hardware queue.
> > 
> > Then, using a Qdisc like existing ones, but with a limit of
> > outstanding(given to device but not yet consumed) packets per class.
> > 
> > external tc classifier would deliver a hash/index depending on remote
> > station.
> 
> So basically you'd have a class for each station? Or a class for each
> station/AC? That's a lot of classes rather than queues, are they
> pre-allocated, or does it not matter? Overall that sounds like a good
> approach.

They could be statically allocated in a stochastic fashion (Sorry, SFQ
is one of my favorite qdisc, very light in memory/cpu and yet powerful)

net/sched/sch_sfq.c

Of course, HFSC / DRR are considered more flexible, but eat _much_ more
memory.

http://people.netfilter.org/kaber/shaping

You declare a hash divisor of 1024 to map one station to one class
(Might have hash collision of course...)

Each active slot contains a local queue to one remote station.
Each queue can be pfifo or whatever smart qdisc (sfq, or a more flexible
qdisc)

Then you use a filter to select the appropriate class :

tc filter add dev ${dev} protocol all pref 1 parent $handle handle 1 \
	flow hash keys remote_station divisor 1024



--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: creating netdev queues on the fly?
       [not found]           ` <CAA93jw7ECQegWj6rpd48sbmDQUjorCYzXANXJSj0baHtxzC7EA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-11-11 11:02             ` Eric Dumazet
  2011-11-11 11:42               ` Dave Taht
  2011-11-12  9:45               ` Eric Dumazet
  0 siblings, 2 replies; 12+ messages in thread
From: Eric Dumazet @ 2011-11-11 11:02 UTC (permalink / raw)
  To: Dave Taht
  Cc: Denys Fedoryshchenko, Helmut Schaa, Johannes Berg, netdev,
	linux-wireless

Le jeudi 10 novembre 2011 à 16:25 +0100, Dave Taht a écrit :



> Two notes:
> 
> 1) Getting 'time' from the kernel is expensive. And: System time wanders.
> 
> mac80211 Wifi devices however do export a get_tsf function which could
> be used as a relative-to-the-queue clock - and we actually don't need
> accuracy down to the level of get_tsf (25ns)
> 


getting 'time' from kernel is not that expensive, depending on
resolution you need. We are not going to use timestamps from devices !

If ms resolution is enough (say you want to drop packets if they stay
more than 100ms in qdisc), jiffies is a single memory read.

If needing us or ns resolution, psched_get_time() uses ktime_get(), and
is used in CBQ, HTB, HFSC, TBF, so if it was expensive we would have big
problem right now :)

2) You need to get a timestamp on entry to the first queue and check
> against the allowable latency on exit from the last. So to construct a
> tc chain you'd want a tfifo (timestamp on entry), fifot (check
> timestamp against limit on dequeue), and for the simplest of
> applications : tfifot (timestamp on entry, check on exit)

That would be not very practical.

I would see a new Qdisc/Class property, like the rate estimator, that we
can attach to any Qdisc/Class with a new tc option.

Even without any limit enforcing (might be Random Early Detection by the
way), it could be used to get a Queue Delay estimation, using EWMA

avqdelay = avqdelay*(1-W) + qdelay*W;
W = 2^(-ewma_log);

tc [ qdisc | class] add [...] [est 1sec 8sec] [delayest ewma_log ] ..

tc -s -d qdisc ...
qdisc htb 1: root refcnt 2 r2q 10 default 1 direct_packets_stat 0 ver 3.17
 Sent 3596219 bytes 2567 pkt (dropped 238, overlimits 3797 requeues 0) 
 rate 2557Kbit 215pps backlog 0b 0p requeues 0 
 delay 91ms



tc [ qdisc | class] add [...] [est 1sec 8sec] [delaylimit max ] ..

tc -s -d qdisc ...
qdisc htb 1: root refcnt 2 r2q 10 default 1 direct_packets_stat 0 ver 3.17
 Sent 3596219 bytes 2567 pkt (dropped 238, overlimits 3797 requeues 0) 
 rate 2557Kbit 215pps backlog 0b 0p requeues 0 
 delay 91ms delaylimit 100ms (dropped 12)



--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: creating netdev queues on the fly?
  2011-11-11 11:02             ` Eric Dumazet
@ 2011-11-11 11:42               ` Dave Taht
       [not found]                 ` <CAA93jw7n1jYiWrnHOF0Zmzd0cVtadNhPSCpP5YqEdq_Q9opw5A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2011-11-12  9:45               ` Eric Dumazet
  1 sibling, 1 reply; 12+ messages in thread
From: Dave Taht @ 2011-11-11 11:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Denys Fedoryshchenko, Helmut Schaa, Johannes Berg, netdev,
	linux-wireless, Andrew McGregor

On Fri, Nov 11, 2011 at 12:02 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Le jeudi 10 novembre 2011 à 16:25 +0100, Dave Taht a écrit :
>
>
>
>> Two notes:
>>
>> 1) Getting 'time' from the kernel is expensive. And: System time wanders.
>>
>> mac80211 Wifi devices however do export a get_tsf function which could
>> be used as a relative-to-the-queue clock - and we actually don't need
>> accuracy down to the level of get_tsf (25ns)
>>
>
>
> getting 'time' from kernel is not that expensive, depending on
> resolution you need. We are not going to use timestamps from devices !

Wireless keeps track of time via a TSF function.

http://en.wikipedia.org/wiki/Timing_Synchronization_Function_%28TSF%29

TSF wanders forward based on an election process from broadcast beacons,
the presentation below contains animations and the like that describe
how it works

http://www.cse.ohio-state.edu/~lai/788-Au05/2-scalibility.ppt

Further it copes with life in terms of TU. One tu is 1024us.

Now, whether any of that is truly relevant to how to do packet
scheduling well on wireless, is a good question. TU is fairly
fundamental to the minstrel algorithm. It bothers me to have anything
resembling a beat frequency.

>
> If ms resolution is enough (say you want to drop packets if they stay
> more than 100ms in qdisc), jiffies is a single memory read.

The intended properties of the hardware queues are 10 ms VO, 100 ms
VI, and some_sane_number_relative_to_real_bandwidth for BE and BK.

The software queues should make similar assumptions, although for
voice traffic, being able to drop and 'pull forward' the next packet
in a stream should it exceed time in queue would be good, and you
could go with 30 ms in queue with that level of jitter.

>
> If needing us or ns resolution, psched_get_time() uses ktime_get(), and
> is used in CBQ, HTB, HFSC, TBF, so if it was expensive we would have big
> problem right now :)
>
> 2) You need to get a timestamp on entry to the first queue and check
>> against the allowable latency on exit from the last. So to construct a
>> tc chain you'd want a tfifo (timestamp on entry), fifot (check
>> timestamp against limit on dequeue), and for the simplest of
>> applications : tfifot (timestamp on entry, check on exit)
>
> That would be not very practical.
>
> I would see a new Qdisc/Class property, like the rate estimator, that we
> can attach to any Qdisc/Class with a new tc option.
>
> Even without any limit enforcing (might be Random Early Detection by the
> way), it could be used to get a Queue Delay estimation, using EWMA
>
> avqdelay = avqdelay*(1-W) + qdelay*W;
> W = 2^(-ewma_log);
>
> tc [ qdisc | class] add [...] [est 1sec 8sec] [delayest ewma_log ] ..

More elegant to be sure. But for N you kind of need to do ewma between
aggregated transmission groups.

>
> tc -s -d qdisc ...
> qdisc htb 1: root refcnt 2 r2q 10 default 1 direct_packets_stat 0 ver 3.17
>  Sent 3596219 bytes 2567 pkt (dropped 238, overlimits 3797 requeues 0)
>  rate 2557Kbit 215pps backlog 0b 0p requeues 0
>  delay 91ms

I still don't 'get' how we can split out a stream based on a
stationid, toss it in a queue to be further scheduled (my choice would
be QFQ, btw), and then sanely de-schedule a burst of packets for that
destination appropriate for that station's aggregation level and
transmit rate using existing tc methods.

I liked the callback idea discussed earlier for implementing a
'grouper' of this sort.

That said I'm strongly encouraged by the dialog thus far on this thread.

>
>
>
> tc [ qdisc | class] add [...] [est 1sec 8sec] [delaylimit max ] ..
>
> tc -s -d qdisc ...
> qdisc htb 1: root refcnt 2 r2q 10 default 1 direct_packets_stat 0 ver 3.17
>  Sent 3596219 bytes 2567 pkt (dropped 238, overlimits 3797 requeues 0)
>  rate 2557Kbit 215pps backlog 0b 0p requeues 0
>  delay 91ms delaylimit 100ms (dropped 12)

and I assume that you are either making this syntax up or coding
faster than emailing in some tree somewhere...

>
>
>
>



-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: creating netdev queues on the fly?
       [not found]                 ` <CAA93jw7n1jYiWrnHOF0Zmzd0cVtadNhPSCpP5YqEdq_Q9opw5A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-11-11 11:54                   ` Eric Dumazet
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2011-11-11 11:54 UTC (permalink / raw)
  To: Dave Taht
  Cc: Denys Fedoryshchenko, Helmut Schaa, Johannes Berg, netdev,
	linux-wireless, Andrew McGregor

Le vendredi 11 novembre 2011 à 12:42 +0100, Dave Taht a écrit :

> More elegant to be sure. But for N you kind of need to do ewma between
> aggregated transmission groups.
> 

I speak at Class/Qdisc level. If each transmission group has its own
class [ I believe it should ], it should fit.


> I still don't 'get' how we can split out a stream based on a
> stationid, toss it in a queue to be further scheduled (my choice would
> be QFQ, btw), and then sanely de-schedule a burst of packets for that
> destination appropriate for that station's aggregation level and
> transmit rate using existing tc methods.
> 

Right now its not possible since we dont have a feedback once a packet
is dequeued from qdisc. But it should be doable.

> I liked the callback idea discussed earlier for implementing a
> 'grouper' of this sort.
> 
> That said I'm strongly encouraged by the dialog thus far on this thread.
> 

...

> and I assume that you are either making this syntax up or coding
> faster than emailing in some tree somewhere...
> 

That because I prefer discussing on this before starting coding once
general idea is accepted.

Note this delay idea is not new, it already was mentioned on netdev some
months ago.



--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: creating netdev queues on the fly?
  2011-11-11 11:02             ` Eric Dumazet
  2011-11-11 11:42               ` Dave Taht
@ 2011-11-12  9:45               ` Eric Dumazet
  1 sibling, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2011-11-12  9:45 UTC (permalink / raw)
  To: Dave Taht
  Cc: Denys Fedoryshchenko, Helmut Schaa, Johannes Berg, netdev,
	linux-wireless

Le vendredi 11 novembre 2011 à 12:02 +0100, Eric Dumazet a écrit :

> I would see a new Qdisc/Class property, like the rate estimator, that we
> can attach to any Qdisc/Class with a new tc option.
> 
> Even without any limit enforcing (might be Random Early Detection by the
> way), it could be used to get a Queue Delay estimation, using EWMA
> 
> avqdelay = avqdelay*(1-W) + qdelay*W;
> W = 2^(-ewma_log);
> 
> tc [ qdisc | class] add [...] [est 1sec 8sec] [delayest ewma_log ] ..
> 
> tc -s -d qdisc ...
> qdisc htb 1: root refcnt 2 r2q 10 default 1 direct_packets_stat 0 ver 3.17
>  Sent 3596219 bytes 2567 pkt (dropped 238, overlimits 3797 requeues 0) 
>  rate 2557Kbit 215pps backlog 0b 0p requeues 0 
>  delay 91ms
> 
> 


I coded the thing (delayest at qdisc level) and got interesting values,
for example with following HTB setup :

DEV=eth3
MTU=1500
rate=10mbit
EST="est 1sec 8sec delayest 6"

tc qdisc del dev $DEV root
tc qdisc add dev $DEV root handle 1: ${EST} \
	htb default 1 
tc class add dev $DEV parent 1: classid 1:1 htb \
	rate ${rate} mtu 40000 quantum 80000
tc qdisc add dev $DEV parent 1:1 handle 10: ${EST} pfifo limit 3


With light trafic on my x86_64 machine I have :

# tcnew -s -d qdisc show dev eth3
qdisc htb 1: root refcnt 17 r2q 10 default 1 direct_packets_stat 0 ver 3.17
 Sent 78216 bytes 569 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 9040bit 13pps backlog 0b 0p requeues 0 
 delay 1126 ns log 6
qdisc pfifo 10: parent 1:1 limit 3p
 Sent 78216 bytes 569 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 9040bit 13pps backlog 0b 0p requeues 0 
 delay 731 ns log 6

Wow, 1126ns of overhead per packet...

This is the prototype kernel patch I used on top of net-next :

diff --git a/include/linux/gen_stats.h b/include/linux/gen_stats.h
index 552c8a0..5ad57a6 100644
--- a/include/linux/gen_stats.h
+++ b/include/linux/gen_stats.h
@@ -63,5 +63,16 @@ struct gnet_estimator {
 	unsigned char	ewma_log;
 };
 
+/**
+ * struct gnet_qdelay - queue delay configuration / reports
+ * @avdelay: average queue delay in ns
+ * @limit: packets delayed more than this value are dropped
+ * @avdelaylog: the log of measurement window weight
+ */
+struct gnet_qdelay {
+	__u64 avdelay;
+	__u64 limit;
+	__u32 avdelaylog;
+};
 
 #endif /* __LINUX_GEN_STATS_H */
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 8e872ea..61c66ec 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -484,6 +484,7 @@ enum {
 	TCA_FCNT,
 	TCA_STATS2,
 	TCA_STAB,
+	TCA_QDELAY,
 	__TCA_MAX
 };
 
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f6bb08b..e293228 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -42,6 +42,13 @@ struct qdisc_size_table {
 	u16			data[];
 };
 
+/*
+ * qdisc/class avdelay is computed using EWMA, with a fixed factor of 16
+ * Only the weight is a parameter (avdelaylog)
+ * With u64 values, this leaves 48 bits, a max of 281474 seconds.
+ */
+#define TCQ_AVDELAY_FACTOR 16
+
 struct Qdisc {
 	int 			(*enqueue)(struct sk_buff *skb, struct Qdisc *dev);
 	struct sk_buff *	(*dequeue)(struct Qdisc *dev);
@@ -50,8 +57,11 @@ struct Qdisc {
 #define TCQ_F_INGRESS		2
 #define TCQ_F_CAN_BYPASS	4
 #define TCQ_F_MQROOT		8
+#define TCQ_F_QDELAY		0x10
 #define TCQ_F_WARN_NONWC	(1 << 16)
-	int			padded;
+	u8			padded;
+	u8			avdelaylog; /* avdelay EWMA weight */
+	u8			_pad[2];
 	const struct Qdisc_ops	*ops;
 	struct qdisc_size_table	__rcu *stab;
 	struct list_head	list;
@@ -80,6 +90,10 @@ struct Qdisc {
 	struct gnet_stats_basic_packed bstats;
 	unsigned int		__state;
 	struct gnet_stats_queue	qstats;
+
+	/* average queue delay in ns << TCQ_AVDELAY_FACTOR */
+	u64			avdelay;
+
 	struct rcu_head		rcu_head;
 	spinlock_t		busylock;
 	u32			limit;
@@ -219,6 +233,9 @@ struct tcf_proto {
 };
 
 struct qdisc_skb_cb {
+#ifdef CONFIG_NET_SCHED_QDELAY
+	ktime_t			enqueue_time;
+#endif
 	unsigned int		pkt_len;
 	long			data[];
 };
@@ -467,6 +484,14 @@ static inline void qdisc_bstats_update(struct Qdisc *sch,
 				       const struct sk_buff *skb)
 {
 	bstats_update(&sch->bstats, skb);
+#ifdef CONFIG_NET_SCHED_QDELAY
+	if (sch->flags & TCQ_F_QDELAY) {
+		u64 delay = ktime_to_ns(ktime_sub(ktime_get(),
+					qdisc_skb_cb(skb)->enqueue_time));
+		delay <<= TCQ_AVDELAY_FACTOR;
+		sch->avdelay += (delay >> sch->avdelaylog) - (sch->avdelay >> sch->avdelaylog);
+	}
+#endif
 }
 
 static inline int __qdisc_enqueue_tail(struct sk_buff *skb, struct Qdisc *sch,
diff --git a/net/core/dev.c b/net/core/dev.c
index 6ba50a1..587534d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2402,6 +2402,13 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	int rc;
 
 	qdisc_skb_cb(skb)->pkt_len = skb->len;
+
+#ifdef CONFIG_NET_SCHED_QDELAY
+	qdisc_skb_cb(skb)->enqueue_time.tv64 = 0;
+	if (q->flags & TCQ_F_QDELAY)
+		qdisc_skb_cb(skb)->enqueue_time = ktime_get();
+#endif
+
 	qdisc_calculate_pkt_len(skb, q);
 	/*
 	 * Heuristic to force contended enqueues to serialize on a
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 2590e91..028f882 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -470,6 +470,17 @@ config NET_CLS_ACT
 	  A recent version of the iproute2 package is required to use
 	  extended matches.
 
+config NET_SCHED_QDELAY
+	bool "QDISC/CLASS queue delay Estimators and Limits"
+	---help---
+	  Say Y here if you want to be able to track queue delays, and
+	  be able to drop packets if they stay in a queue a too long time.
+	  It adds some overhead per packet, since it needs to get precise
+	  time at enqueue and dequeue time.
+
+	  A recent version of the iproute2 package is required to use
+	  extended matches.
+
 config NET_ACT_POLICE
 	tristate "Traffic Policing"
         depends on NET_CLS_ACT 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index dca6c1a..212fba9 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -842,6 +842,17 @@ qdisc_create(struct net_device *dev, struct netdev_queue *dev_queue,
 			}
 			rcu_assign_pointer(sch->stab, stab);
 		}
+		if (tca[TCA_QDELAY]) {
+			struct gnet_qdelay *parm = nla_data(tca[TCA_QDELAY]);
+			err = -EINVAL;
+			if (nla_len(tca[TCA_QDELAY]) < sizeof(*parm))
+				goto err_out4;
+			sch->avdelaylog = parm->avdelaylog;
+			sch->flags |= TCQ_F_QDELAY;
+		} else { /* temporary testing */
+			sch->avdelaylog = 6;
+			sch->flags |= TCQ_F_QDELAY;
+		}
 		if (tca[TCA_RATE]) {
 			spinlock_t *root_lock;
 
@@ -1206,6 +1217,16 @@ static int tc_fill_qdisc(struct sk_buff *skb, struct Qdisc *q, u32 clid,
 	if (stab && qdisc_dump_stab(skb, stab) < 0)
 		goto nla_put_failure;
 
+#ifdef CONFIG_NET_SCHED_QDELAY
+	if (q->flags & TCQ_F_QDELAY) {
+		struct gnet_qdelay qdelay;
+
+		memset(&qdelay, 0, sizeof(qdelay));	
+		qdelay.avdelay = q->avdelay >> TCQ_AVDELAY_FACTOR;
+		qdelay.avdelaylog = q->avdelaylog;
+		NLA_PUT(skb, TCA_QDELAY, sizeof(qdelay), &qdelay);
+	}
+#endif
 	if (gnet_stats_start_copy_compat(skb, TCA_STATS2, TCA_STATS, TCA_XSTATS,
 					 qdisc_root_sleeping_lock(q), &d) < 0)
 		goto nla_put_failure;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-11-12  9:45 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-10 13:58 creating netdev queues on the fly? Johannes Berg
2011-11-10 14:35 ` Eric Dumazet
2011-11-10 16:26   ` Johannes Berg
     [not found]     ` <1320942369.3967.127.camel-8upI4CBIZJIJvtFkdXX2HixXY32XiHfO@public.gmane.org>
2011-11-10 17:00       ` Eric Dumazet
     [not found] ` <1320933501.3967.68.camel-8upI4CBIZJIJvtFkdXX2HixXY32XiHfO@public.gmane.org>
2011-11-10 14:40   ` Helmut Schaa
     [not found]     ` <CAGXE3d-_RFgW_zwfX2vTBe1psXmgoBFO5pd5cAgtYo=Jwpddhw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-11-10 14:55       ` Denys Fedoryshchenko
2011-11-10 15:25         ` Dave Taht
     [not found]           ` <CAA93jw7ECQegWj6rpd48sbmDQUjorCYzXANXJSj0baHtxzC7EA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-11-11 11:02             ` Eric Dumazet
2011-11-11 11:42               ` Dave Taht
     [not found]                 ` <CAA93jw7n1jYiWrnHOF0Zmzd0cVtadNhPSCpP5YqEdq_Q9opw5A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-11-11 11:54                   ` Eric Dumazet
2011-11-12  9:45               ` Eric Dumazet
2011-11-10 14:47   ` Dave Taht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox