netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* tc filter mask for ACK packets off?
@ 2012-01-01  2:30 John A. Sullivan III
  0 siblings, 0 replies; 18+ messages in thread
From: John A. Sullivan III @ 2012-01-01  2:30 UTC (permalink / raw)
  To: netdev

Hello, all.  I've been noticing that virtually all the documentation
says we should prioritize ACK only packets and that they can be
identified with match u8 0x10 0xff.  However, isn't the actual flag
field only 6 bits longs and the first two belong to a previous 6 bit
reserved field?
If that is true, if ever those bits are set, our filters will
unnecessarily break.  Shouldn't it be match u8 0x10 0x3f? Then again,
I'm very new at this.  Thanks - John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* tc filter mask for ACK packets off?
@ 2012-01-01  2:30 John A. Sullivan III
  2012-01-03  7:31 ` Michal Kubeček
  0 siblings, 1 reply; 18+ messages in thread
From: John A. Sullivan III @ 2012-01-01  2:30 UTC (permalink / raw)
  To: netdev

Hello, all.  I've been noticing that virtually all the documentation
says we should prioritize ACK only packets and that they can be
identified with match u8 0x10 0xff.  However, isn't the actual flag
field only 6 bits longs and the first two belong to a previous 6 bit
reserved field?
If that is true, if ever those bits are set, our filters will
unnecessarily break.  Shouldn't it be match u8 0x10 0x3f? Then again,
I'm very new at this.  Thanks - John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: tc filter mask for ACK packets off?
  2012-01-01  2:30 tc filter mask for ACK packets off? John A. Sullivan III
@ 2012-01-03  7:31 ` Michal Kubeček
  2012-01-03  9:36   ` Dave Taht
  2012-01-04  0:01   ` Michal Soltys
  0 siblings, 2 replies; 18+ messages in thread
From: Michal Kubeček @ 2012-01-03  7:31 UTC (permalink / raw)
  To: netdev; +Cc: John A. Sullivan III

On Saturday 31 of December 2011 21:30EN, John A. Sullivan III wrote:
> Hello, all.  I've been noticing that virtually all the documentation
> says we should prioritize ACK only packets and that they can be
> identified with match u8 0x10 0xff.  However, isn't the actual flag
> field only 6 bits longs and the first two belong to a previous 6 bit
> reserved field?

It's even worse, those two bits are in fact used for ECN (RFC 3168).

> If that is true, if ever those bits are set, our filters will
> unnecessarily break.  Shouldn't it be match u8 0x10 0x3f?

I think so.

However, by a "ACK only" packet (worth prioritizing), I would rather 
understand a packet with ACK flag without any payload, not a packet with 
ACK as the only flag. For many TCP connections, all packets except 
initial SYN and SYN-ACK and two FIN packets have ACK as the only flag. 
So my guess is you should rather prioritize all TCP packets with no 
application layer data.

                                                         Michal Kubecek

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: tc filter mask for ACK packets off?
  2012-01-03  7:31 ` Michal Kubeček
@ 2012-01-03  9:36   ` Dave Taht
  2012-01-03 10:40     ` [RFC] SFQ planned changes Eric Dumazet
  2012-01-03 12:18     ` tc filter mask for ACK packets off? John A. Sullivan III
  2012-01-04  0:01   ` Michal Soltys
  1 sibling, 2 replies; 18+ messages in thread
From: Dave Taht @ 2012-01-03  9:36 UTC (permalink / raw)
  To: Michal Kubeček; +Cc: netdev, John A. Sullivan III

On Tue, Jan 3, 2012 at 8:31 AM, Michal Kubeček <mkubecek@suse.cz> wrote:
> On Saturday 31 of December 2011 21:30EN, John A. Sullivan III wrote:
>> Hello, all.  I've been noticing that virtually all the documentation
>> says we should prioritize ACK only packets and that they can be
>> identified with match u8 0x10 0xff.  However, isn't the actual flag

Most of that invalid documentation was derived from the original
'wondershaper' effort, and became 'canon', elsewhere.

I'm hoping that we get a chance to correct the documentation
on the new wiki and remove the old, incorrect info from the web...

wshaper's (2001) assumptions were gradually invalidated over the
years. It was a suitable shaper for a 200k-800k download link
at the time, when web sites were 70k in size, and people still
used things like ssh heavily, men were men, and javascript
scarce....

Good follow on work was the esfq and adsl-shaper efforts,
but these publications have also been obsoleted by events.

ESFQ's core features got incorporated in sfq, and adsl-shaper
sort of made it in, genericaly.

The stories of those two shaping efforts are useful bits of history
worth reading about, to gain context about the problems they
were trying to solve.

>> field only 6 bits longs and the first two belong to a previous 6 bit
>> reserved field?
>i
> It's even worse, those two bits are in fact used for ECN (RFC 3168).

I had submitted a patch to openwrt to fix this issue with wondershaper
a while back. I don't know if it got taken up or not...

and either way, wshaper's approach doesn't work well on
modern bandwidths. The core idea (prioritizing small acks
somewhat) retains some value, but the implementation is
unworkable.

>> If that is true, if ever those bits are set, our filters will
>> unnecessarily break.  Shouldn't it be match u8 0x10 0x3f?
>
> I think so.

Yes, the old-style, 'canonical', filters break ECN, and have
been breaking it everywhere for a decade.

Also: most, TCP's timestamp so the ack size is larger.

And alas... none of the shapers mentioned above do ipv6 properly.

There are innumerable other limitations... notably prioritizing
dns, syn, synack also can help. They are unable to detect or
prioritize voip packets (sip or skype), either.

> However, by a "ACK only" packet (worth prioritizing), I would rather
> understand a packet with ACK flag without any payload, not a packet with
> ACK as the only flag. For many TCP connections, all packets except
> initial SYN and SYN-ACK and two FIN packets have ACK as the only flag.
> So my guess is you should rather prioritize all TCP packets with no
> application layer data.

No. :)

I'd go into more detail, but after what I hope are the final two
fixes to sfq and qfq land in the net-next kernel (after some more
testing), I like to think I have a more valid approach than this
in the works, but that too will require some more development
and testing.

http://www.teklibre.com/~d/bloat/pfifo_fast_vs_sfq_qfq_linear.png

If you are interested in seeing that work in progress

git clone git://github.com/dtaht/deBloat.git

see the src/staqfq.lua script for a start at a general purpose new
age shaper...

and src/qmodels/*4mbit* for some prototypes of a 'soft bandwidth'
one.

(regrettably net-next + some patches is required at present)

I note that (as of yesterday) sfq is performing as well as qfq did
under most workloads, and is considerably simpler than qfq, but
what I have in mind for shaping in a asymmetric scenario
*may* involve 'weighting' - rather than strictly prioritizing -
small acks... and it may not - I'd like to be able to benchmark
the various AQM approaches against a variety of workloads
before declaring victory. Could use some help with all that....

>
>                                                         Michal Kubecek
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC] SFQ planned changes
  2012-01-03  9:36   ` Dave Taht
@ 2012-01-03 10:40     ` Eric Dumazet
  2012-01-03 12:07       ` Dave Taht
  2012-01-03 12:18     ` tc filter mask for ACK packets off? John A. Sullivan III
  1 sibling, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2012-01-03 10:40 UTC (permalink / raw)
  To: Dave Taht; +Cc: Michal Kubeček, netdev, John A. Sullivan III

Le mardi 03 janvier 2012 à 10:36 +0100, Dave Taht a écrit :

> I note that (as of yesterday) sfq is performing as well as qfq did
> under most workloads, and is considerably simpler than qfq, but
> what I have in mind for shaping in a asymmetric scenario
> *may* involve 'weighting' - rather than strictly prioritizing -
> small acks... and it may not - I'd like to be able to benchmark
> the various AQM approaches against a variety of workloads
> before declaring victory.


A QFQ setup with more than 1024 classes/qdisc is way too slow at init
time, and consume ~384 bytes per class : ~12582912 bytes for 32768
classes.

We also are limited to 65536 qdisc per device, so QFQ setup using hash
is limited to a 32768 divisor.


Now SFQ as implemented in Linux is very limited, with at most 127 flows
and limit of 127 packets. [ So if 127 flows are active, we have one
packet per flow ]

I plan to add to SFQ following features :

- Ability to specify a per flow limit 
     Its what is called the 'depth',
     currently hardcoded to min(127, limit)

- Ability to have up to 65535 flows (instead of 127)

- Ability to have a head drop (to drop old packets from a flow)

example of use : No more than 20 packets per flow, max 8000 flows, max
20000 packets in SFQ qdisc, hash table of 65536 slots.

tc qdisc add ... sfq \
	flows 8000 \
	depth 20 \
	headdrop \
	limit 20000 divisor 65536

Ram usage : 32 bytes per flow, instead of 384 for QFQ, so much better
cache hit ratio. 2 bytes per hash table slots, instead of 8 for QFQ.

(perturb timer for a huge SFQ setup would be not recommended)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] SFQ planned changes
  2012-01-03 10:40     ` [RFC] SFQ planned changes Eric Dumazet
@ 2012-01-03 12:07       ` Dave Taht
  2012-01-03 12:50         ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Taht @ 2012-01-03 12:07 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Michal Kubeček, netdev, John A. Sullivan III

It will take me a while to fully comment on this... there are
all sorts of subtlties to deal with (one biggie - ledbat vs multi-queue
behavior)... but I am encouraged by the events of the past
months and my testing today....

On Tue, Jan 3, 2012 at 11:40 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 03 janvier 2012 à 10:36 +0100, Dave Taht a écrit :
>
>> I note that (as of yesterday) sfq is performing as well as qfq did
>> under most workloads, and is considerably simpler than qfq, but
>> what I have in mind for shaping in a asymmetric scenario
>> *may* involve 'weighting' - rather than strictly prioritizing -
>> small acks... and it may not - I'd like to be able to benchmark

I need to be clear that the above is a subtle problem that I'd
have to talk to in a separate mail - AND both SFQ and QFQ
do such a better job than wshaper did in the first place that
weighting small acks only wins in a limited number of
scenarios.

We have a larger problem in dealing with TSO/GSO
size superpackets that's hard to solve.

I'd prefer to think, design tests, and benchmark, and think again
for a while...

>> the various AQM approaches against a variety of workloads
>> before declaring victory.
>
>
> A QFQ setup with more than 1024 classes/qdisc is way too slow at init
> time, and consume ~384 bytes per class : ~12582912 bytes for 32768
> classes.

QFQ could be improved with some of the same techniques you
describe below.

> We also are limited to 65536 qdisc per device, so QFQ setup using hash
> is limited to a 32768 divisor.
>
>
> Now SFQ as implemented in Linux is very limited, with at most 127 flows
> and limit of 127 packets. [ So if 127 flows are active, we have one
> packet per flow ]

I agree SFQ can be improved upwards in scale, greatly.

My own personal goal is to evolve towards something that
can replace pfifo_fast as the default in linux.

I don't know if that goal is shared by all as yet. :)

> I plan to add to SFQ following features :

From a 'doing science' perspective, I'd like it if it remained possible
to continue using and benchmarking SFQ as it was, and create
this set of ideas as a new qdisc ('efq'?)

As these changes seem to require changes to userspace tc, anyway,
and (selfishly) my patching burden is great enough...

Perhaps some additional benefit could be had by losing
full backward API compatability with sfq, as well?

> - Ability to specify a per flow limit
>     Its what is called the 'depth',
>     currently hardcoded to min(127, limit)

Introducing per-flow buffering (as QFQ does) *re-introduces*
the overall AQM problem of managing the size of the
individual flows.

this CDF graph shows how badly wireless is currently behaving
(courtesy Albert Rafetseder of the university of vienna)

http://www.teklibre.com/~d/bloat/qfq_vs_pfifo_fast_wireless_iwl_card_vs_cerowrt.pdf

(I have to convince gnuplot to give me these!!)

If I were to add a larger sub-qdisc depth on QFQ than what's in there
(presently 24)
the same graph would also show the median latency increase proportionately.

The Time in Queue idea for managing that queue depth is quite
strong, there may be others.

(in fact, I'm carrying your preliminary TiQ patch in my
 bql trees, not that I've done anything with it yet)

> - Ability to have up to 65535 flows (instead of 127)
>
> - Ability to have a head drop (to drop old packets from a flow)

The head drop idea is strong, when combined with time in queue.

However: it would be useful to be able to pull forward the next packet
in that sub-queue and deliver it, so as to provide proper signalling
upstream. Packets nowadays arrive in bursts, which means that
once one time stamp has expired, many will. What I just suggested
would (worst case) deliver every other packet in a backlog and
obviously needs refinement.....

>
> example of use : No more than 20 packets per flow, max 8000 flows, max
> 20000 packets in SFQ qdisc, hash table of 65536 slots.
>
> tc qdisc add ... sfq \
>        flows 8000 \
>        depth 20 \
>        headdrop \
>        limit 20000 divisor 65536
>
> Ram usage : 32 bytes per flow, instead of 384 for QFQ, so much better
> cache hit ratio. 2 bytes per hash table slots, instead of 8 for QFQ.

I do like it!

I retain liking for QFQ because other qdiscs (red, for example)
can be attached to it, but having a simple, yet good default with SFQ
scaled up to modern requirements would also be awesome!

> (perturb timer for a huge SFQ setup would be not recommended)

no kidding!

>
>
>



-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: tc filter mask for ACK packets off?
  2012-01-03  9:36   ` Dave Taht
  2012-01-03 10:40     ` [RFC] SFQ planned changes Eric Dumazet
@ 2012-01-03 12:18     ` John A. Sullivan III
  2012-01-03 12:32       ` Eric Dumazet
  1 sibling, 1 reply; 18+ messages in thread
From: John A. Sullivan III @ 2012-01-03 12:18 UTC (permalink / raw)
  To: Dave Taht; +Cc: Michal Kubeček, netdev

On Tue, 2012-01-03 at 10:36 +0100, Dave Taht wrote:
<snip>
> I'd go into more detail, but after what I hope are the final two
> fixes to sfq and qfq land in the net-next kernel (after some more
> testing), I like to think I have a more valid approach than this
> in the works, but that too will require some more development
> and testing.
> 
> http://www.teklibre.com/~d/bloat/pfifo_fast_vs_sfq_qfq_linear.png
> 
<snip>
Hmmm . . . certainly shattered my concerns about replacing pfifo_fast
with SFQ! Thanks - John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: tc filter mask for ACK packets off?
  2012-01-03 12:18     ` tc filter mask for ACK packets off? John A. Sullivan III
@ 2012-01-03 12:32       ` Eric Dumazet
  2012-01-03 12:45         ` John A. Sullivan III
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2012-01-03 12:32 UTC (permalink / raw)
  To: John A. Sullivan III; +Cc: Dave Taht, Michal Kubeček, netdev

Le mardi 03 janvier 2012 à 07:18 -0500, John A. Sullivan III a écrit :
> On Tue, 2012-01-03 at 10:36 +0100, Dave Taht wrote:
> <snip>
> > I'd go into more detail, but after what I hope are the final two
> > fixes to sfq and qfq land in the net-next kernel (after some more
> > testing), I like to think I have a more valid approach than this
> > in the works, but that too will require some more development
> > and testing.
> > 
> > http://www.teklibre.com/~d/bloat/pfifo_fast_vs_sfq_qfq_linear.png
> > 
> <snip>
> Hmmm . . . certainly shattered my concerns about replacing pfifo_fast
> with SFQ! Thanks - John

Before you do, take the time to read the warning in sfq source :


	ADVANTAGE:

	- It is very cheap. Both CPU and memory requirements are minimal.

	DRAWBACKS:

	- "Stochastic" -> It is not 100% fair.
	When hash collisions occur, several flows are considered as one.

	- "Round-robin" -> It introduces larger delays than virtual clock
	based schemes, and should not be used for isolating interactive
	traffic	from non-interactive. It means, that this scheduler
	should be used as leaf of CBQ or P3, which put interactive traffic
	to higher priority band.


SFQ (as a direct replacement of dev root qdisc) is fine if most of your trafic
is of same kind/priority.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: tc filter mask for ACK packets off?
  2012-01-03 12:32       ` Eric Dumazet
@ 2012-01-03 12:45         ` John A. Sullivan III
  2012-01-03 13:00           ` Dave Taht
  0 siblings, 1 reply; 18+ messages in thread
From: John A. Sullivan III @ 2012-01-03 12:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Dave Taht, Michal Kubeček, netdev

On Tue, 2012-01-03 at 13:32 +0100, Eric Dumazet wrote:
> Le mardi 03 janvier 2012 à 07:18 -0500, John A. Sullivan III a écrit :
> > On Tue, 2012-01-03 at 10:36 +0100, Dave Taht wrote:
> > <snip>
> > > I'd go into more detail, but after what I hope are the final two
> > > fixes to sfq and qfq land in the net-next kernel (after some more
> > > testing), I like to think I have a more valid approach than this
> > > in the works, but that too will require some more development
> > > and testing.
> > > 
> > > http://www.teklibre.com/~d/bloat/pfifo_fast_vs_sfq_qfq_linear.png
> > > 
> > <snip>
> > Hmmm . . . certainly shattered my concerns about replacing pfifo_fast
> > with SFQ! Thanks - John
> 
> Before you do, take the time to read the warning in sfq source :
> 
> 
> 	ADVANTAGE:
> 
> 	- It is very cheap. Both CPU and memory requirements are minimal.
> 
> 	DRAWBACKS:
> 
> 	- "Stochastic" -> It is not 100% fair.
> 	When hash collisions occur, several flows are considered as one.
> 
> 	- "Round-robin" -> It introduces larger delays than virtual clock
> 	based schemes, and should not be used for isolating interactive
> 	traffic	from non-interactive. It means, that this scheduler
> 	should be used as leaf of CBQ or P3, which put interactive traffic
> 	to higher priority band.
> 
> 
> SFQ (as a direct replacement of dev root qdisc) is fine if most of your trafic
> is of same kind/priority.
> 
> 
> 
Yes, I suppose I should have been more specific, replacing pfifo_fast
when I am using something else to prioritize and shape my traffic like
HFSC.  Hmm . . . although I still wonder about iSCSI SANs . . .   Thanks
- John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] SFQ planned changes
  2012-01-03 12:07       ` Dave Taht
@ 2012-01-03 12:50         ` Eric Dumazet
  2012-01-03 16:08           ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2012-01-03 12:50 UTC (permalink / raw)
  To: Dave Taht; +Cc: Michal Kubeček, netdev, John A. Sullivan III

Le mardi 03 janvier 2012 à 13:07 +0100, Dave Taht a écrit :

> From a 'doing science' perspective, I'd like it if it remained possible
> to continue using and benchmarking SFQ as it was, and create
> this set of ideas as a new qdisc ('efq'?)
> 
> As these changes seem to require changes to userspace tc, anyway,
> and (selfishly) my patching burden is great enough...
> 
> Perhaps some additional benefit could be had by losing
> full backward API compatability with sfq, as well?

No, it's completely compatable with prior version.

An old tc command, or lack of new arguments will setup the SFQ qdisc
exactly as before.

I coded the thing and am doing stress tests before submission.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: tc filter mask for ACK packets off?
  2012-01-03 12:45         ` John A. Sullivan III
@ 2012-01-03 13:00           ` Dave Taht
  2012-01-03 17:57             ` John A. Sullivan III
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Taht @ 2012-01-03 13:00 UTC (permalink / raw)
  To: John A. Sullivan III; +Cc: Eric Dumazet, Michal Kubeček, netdev

On Tue, Jan 3, 2012 at 1:45 PM, John A. Sullivan III
<jsullivan@opensourcedevel.com> wrote:
> On Tue, 2012-01-03 at 13:32 +0100, Eric Dumazet wrote:
>> Le mardi 03 janvier 2012 à 07:18 -0500, John A. Sullivan III a écrit :
>> > On Tue, 2012-01-03 at 10:36 +0100, Dave Taht wrote:
>> > <snip>
>> > > I'd go into more detail, but after what I hope are the final two
>> > > fixes to sfq and qfq land in the net-next kernel (after some more
>> > > testing), I like to think I have a more valid approach than this
>> > > in the works, but that too will require some more development
>> > > and testing.
>> > >
>> > > http://www.teklibre.com/~d/bloat/pfifo_fast_vs_sfq_qfq_linear.png
>> > >
>> > <snip>
>> > Hmmm . . . certainly shattered my concerns about replacing pfifo_fast
>> > with SFQ! Thanks - John

SFQ as presently implemented (and by presently, I mean, as of yesterday,
by tomorrow it could be different at the rate eric is going!) is VERY
suitable for
sub 100Mbit desktops, wireless stations/laptops other devices,
home gateways with sub 100Mbit uplinks, and the like. That's a few
hundred million devices that aren't using it today and defaulting to
pfifo_fast and suffering for it.

QFQ is it's big brother and I have hopes it can scale up to 10GigE,
once suitable techniques are found for managing the sub-queue depth.

The enhancements to SFQ eric proposed in the other thread might get it
to where it outperforms (by a lot) pfifo_fast in it's default configuration
(eg txqueuelen 1000) with few side effects. Scaling further up than that...

... I don't have a good picture of gigE performance at the moment with
any of these advanced qdiscs and have no recomendation.

I do recomend highly that you fiddle with this stuff! I do have to
note that the graph above had GSO/TSO turned off.

>> Before you do, take the time to read the warning in sfq source :
>>
>>
>>       ADVANTAGE:
>>
>>       - It is very cheap. Both CPU and memory requirements are minimal.
>>
>>       DRAWBACKS:
>>
>>       - "Stochastic" -> It is not 100% fair.
>>       When hash collisions occur, several flows are considered as one.

This is in part the benefit of SFQ vs QFQ in that the maximum queue
depth is well managed.

>>       - "Round-robin" -> It introduces larger delays than virtual clock
>>       based schemes, and should not be used for isolating interactive
>>       traffic from non-interactive. It means, that this scheduler
>>       should be used as leaf of CBQ or P3, which put interactive traffic
>>       to higher priority band.

These delays are NOTHING compared to what pfifo_fast can induce.

Very little traffic nowadays is marked as interactive to any statistically
significant extent, so any FQ method effectively makes more traffic
interactive than prioritization can.

>> SFQ (as a direct replacement of dev root qdisc) is fine if most of your trafic
>> is of same kind/priority.

Which is the case for most desktops, laptops, gws, wireless, etc.

> Yes, I suppose I should have been more specific, replacing pfifo_fast
> when I am using something else to prioritize and shape my traffic like
> HFSC.

I enjoyed getting your HFSC experience secondhand. It would be
very interesting getting your feedback on trying this stuff.

More data is needed to beat the bloat.

> Hmm . . . although I still wonder about iSCSI SANs . . .   Thanks

I wonder too. Most of the people running iSCSI seem to have an
aversion to packet loss, yet are running over TCP. I *think*
FQ methods will improve latency dramatically for iSCSI
when iSCSI has multiple initiators....


> - John
>



-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] SFQ planned changes
  2012-01-03 12:50         ` Eric Dumazet
@ 2012-01-03 16:08           ` Eric Dumazet
  2012-01-03 23:57             ` Dave Taht
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2012-01-03 16:08 UTC (permalink / raw)
  To: Dave Taht; +Cc: Michal Kubeček, netdev, John A. Sullivan III

Here is the code I ran on my test server with 200 netperf TCP_STREAM
flows with pretty good results (each flow gets 0.5 % of bandwidth)

$TC qdisc add dev $DEV root handle 1: est 1sec 8sec htb default 1 
$TC class add dev $DEV parent 1: classid 1:1 est 1sec 8sec htb \
	rate 200Mbit mtu 40000 quantum 80000

$TC qdisc add dev $DEV parent 1:1 handle 10: est 1sec 8sec sfq \
	limit 2000 depth 10 headdrop flows 1000 divisor 16384 

# tcnew -s -d qdisc show dev eth3
qdisc htb 1: root refcnt 18 r2q 10 default 1 direct_packets_stat 0 ver 3.17
 Sent 4512949730 bytes 3030391 pkt (dropped 44409, overlimits 6105100 requeues 1) 
 rate 198288Kbit 16629pps backlog 0b 1732p requeues 1 
qdisc sfq 10: parent 1:1 limit 2000p quantum 1514b depth 10 headdrop flows 1000/16384 divisor 16384 
 Sent 4512949730 bytes 3030391 pkt (dropped 44409, overlimits 0 requeues 0) 
 rate 198288Kbit 16629pps backlog 2622248b 1732p requeues 0 

patch on top of current net-next

 include/linux/pkt_sched.h |    7 +
 net/sched/sch_sfq.c       |  144 ++++++++++++++++++++++++------------
 2 files changed, 104 insertions(+), 47 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 8daced3..c2c6cfd 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -162,6 +162,13 @@ struct tc_sfq_qopt {
 	unsigned	flows;		/* Maximal number of flows  */
 };
 
+struct tc_sfq_ext_qopt {
+	struct tc_sfq_qopt qopt;
+	unsigned int depth;		/* max number of packets per flow */
+	unsigned int headdrop;
+};
+
+
 struct tc_sfq_xstats {
 	__s32		allot;
 };
diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index d329a8a..66682fd 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -67,15 +67,16 @@
 
 	IMPLEMENTATION:
 	This implementation limits maximal queue length to 128;
-	max mtu to 2^18-1; max 128 flows, number of hash buckets to 1024.
-	The only goal of this restrictions was that all data
-	fit into one 4K page on 32bit arches.
+	max mtu to 2^18-1;
+	max 65280 flows,
+	number of hash buckets to 65536.
 
 	It is easy to increase these values, but not in flight.  */
 
 #define SFQ_DEPTH		128 /* max number of packets per flow */
-#define SFQ_SLOTS		128 /* max number of flows */
-#define SFQ_EMPTY_SLOT		255
+#define SFQ_DEFAULT_FLOWS	128
+#define SFQ_MAX_FLOWS		(0x10000 - 256) /* max number of flows */
+#define SFQ_EMPTY_SLOT		0xffff
 #define SFQ_DEFAULT_HASH_DIVISOR 1024
 
 /* We use 16 bits to store allot, and want to handle packets up to 64K
@@ -84,13 +85,13 @@
 #define SFQ_ALLOT_SHIFT		3
 #define SFQ_ALLOT_SIZE(X)	DIV_ROUND_UP(X, 1 << SFQ_ALLOT_SHIFT)
 
-/* This type should contain at least SFQ_DEPTH + SFQ_SLOTS values */
-typedef unsigned char sfq_index;
+/* This type should contain at least SFQ_DEPTH + SFQ_MAX_FLOWS values */
+typedef u16 sfq_index;
 
 /*
  * We dont use pointers to save space.
- * Small indexes [0 ... SFQ_SLOTS - 1] are 'pointers' to slots[] array
- * while following values [SFQ_SLOTS ... SFQ_SLOTS + SFQ_DEPTH - 1]
+ * Small indexes [0 ... SFQ_MAX_FLOWS - 1] are 'pointers' to slots[] array
+ * while following values [SFQ_MAX_FLOWS ... SFQ_MAX_FLOWS + SFQ_DEPTH - 1]
  * are 'pointers' to dep[] array
  */
 struct sfq_head {
@@ -112,8 +113,11 @@ struct sfq_sched_data {
 /* Parameters */
 	int		perturb_period;
 	unsigned int	quantum;	/* Allotment per round: MUST BE >= MTU */
-	int		limit;
+	int		limit;		/* limit of total number of packets in this qdisc */
 	unsigned int	divisor;	/* number of slots in hash table */
+	unsigned int	maxflows;	/* number of flows in flows array */
+	int		headdrop;
+	int		depth;		/* limit depth of each flow */
 /* Variables */
 	struct tcf_proto *filter_list;
 	struct timer_list perturb_timer;
@@ -122,7 +126,7 @@ struct sfq_sched_data {
 	unsigned short  scaled_quantum; /* SFQ_ALLOT_SIZE(quantum) */
 	struct sfq_slot *tail;		/* current slot in round */
 	sfq_index	*ht;		/* Hash table (divisor slots) */
-	struct sfq_slot	slots[SFQ_SLOTS];
+	struct sfq_slot	*slots;
 	struct sfq_head	dep[SFQ_DEPTH];	/* Linked list of slots, indexed by depth */
 };
 
@@ -131,9 +135,9 @@ struct sfq_sched_data {
  */
 static inline struct sfq_head *sfq_dep_head(struct sfq_sched_data *q, sfq_index val)
 {
-	if (val < SFQ_SLOTS)
+	if (val < SFQ_MAX_FLOWS)
 		return &q->slots[val].dep;
-	return &q->dep[val - SFQ_SLOTS];
+	return &q->dep[val - SFQ_MAX_FLOWS];
 }
 
 /*
@@ -199,18 +203,19 @@ static unsigned int sfq_classify(struct sk_buff *skb, struct Qdisc *sch,
 }
 
 /*
- * x : slot number [0 .. SFQ_SLOTS - 1]
+ * x : slot number [0 .. SFQ_MAX_FLOWS - 1]
  */
 static inline void sfq_link(struct sfq_sched_data *q, sfq_index x)
 {
 	sfq_index p, n;
-	int qlen = q->slots[x].qlen;
+	struct sfq_slot *slot = &q->slots[x];
+	int qlen = slot->qlen;
 
-	p = qlen + SFQ_SLOTS;
+	p = qlen + SFQ_MAX_FLOWS;
 	n = q->dep[qlen].next;
 
-	q->slots[x].dep.next = n;
-	q->slots[x].dep.prev = p;
+	slot->dep.next = n;
+	slot->dep.prev = p;
 
 	q->dep[qlen].next = x;		/* sfq_dep_head(q, p)->next = x */
 	sfq_dep_head(q, n)->prev = x;
@@ -305,7 +310,7 @@ static unsigned int sfq_drop(struct Qdisc *sch)
 		x = q->dep[d].next;
 		slot = &q->slots[x];
 drop:
-		skb = slot_dequeue_tail(slot);
+		skb = q->headdrop ? slot_dequeue_head(slot) : slot_dequeue_tail(slot);
 		len = qdisc_pkt_len(skb);
 		sfq_dec(q, x);
 		kfree_skb(skb);
@@ -349,16 +354,26 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	slot = &q->slots[x];
 	if (x == SFQ_EMPTY_SLOT) {
 		x = q->dep[0].next; /* get a free slot */
+		if (x >= SFQ_MAX_FLOWS)
+			return qdisc_drop(skb, sch);
 		q->ht[hash] = x;
 		slot = &q->slots[x];
 		slot->hash = hash;
 	}
 
-	/* If selected queue has length q->limit, do simple tail drop,
-	 * i.e. drop _this_ packet.
-	 */
-	if (slot->qlen >= q->limit)
-		return qdisc_drop(skb, sch);
+	if (slot->qlen >= q->depth) {
+		struct sk_buff *head;
+
+		if (!q->headdrop)
+			return qdisc_drop(skb, sch);
+		head = slot_dequeue_head(slot);
+		sch->qstats.backlog -= qdisc_pkt_len(head);
+		kfree_skb(head);
+		sch->qstats.drops++;
+		sch->qstats.backlog += qdisc_pkt_len(skb);
+		slot_queue_add(slot, skb);
+		return NET_XMIT_CN;
+	}
 
 	sch->qstats.backlog += qdisc_pkt_len(skb);
 	slot_queue_add(slot, skb);
@@ -366,11 +381,11 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	if (slot->qlen == 1) {		/* The flow is new */
 		if (q->tail == NULL) {	/* It is the first flow */
 			slot->next = x;
+			q->tail = slot;
 		} else {
 			slot->next = q->tail->next;
 			q->tail->next = x;
 		}
-		q->tail = slot;
 		slot->allot = q->scaled_quantum;
 	}
 	if (++sch->q.qlen <= q->limit)
@@ -445,16 +460,17 @@ sfq_reset(struct Qdisc *sch)
  * We dont use sfq_dequeue()/sfq_enqueue() because we dont want to change
  * counters.
  */
-static void sfq_rehash(struct sfq_sched_data *q)
+static int sfq_rehash(struct sfq_sched_data *q)
 {
 	struct sk_buff *skb;
 	int i;
 	struct sfq_slot *slot;
 	struct sk_buff_head list;
+	int dropped = 0;
 
 	__skb_queue_head_init(&list);
 
-	for (i = 0; i < SFQ_SLOTS; i++) {
+	for (i = 0; i < q->maxflows; i++) {
 		slot = &q->slots[i];
 		if (!slot->qlen)
 			continue;
@@ -474,6 +490,11 @@ static void sfq_rehash(struct sfq_sched_data *q)
 		slot = &q->slots[x];
 		if (x == SFQ_EMPTY_SLOT) {
 			x = q->dep[0].next; /* get a free slot */
+			if (x >= SFQ_MAX_FLOWS) {
+				kfree_skb(skb);
+				dropped++;
+				continue;
+			}
 			q->ht[hash] = x;
 			slot = &q->slots[x];
 			slot->hash = hash;
@@ -491,6 +512,7 @@ static void sfq_rehash(struct sfq_sched_data *q)
 			slot->allot = q->scaled_quantum;
 		}
 	}
+	return dropped;
 }
 
 static void sfq_perturbation(unsigned long arg)
@@ -502,7 +524,7 @@ static void sfq_perturbation(unsigned long arg)
 	spin_lock(root_lock);
 	q->perturbation = net_random();
 	if (!q->filter_list && q->tail)
-		sfq_rehash(q);
+		qdisc_tree_decrease_qlen(sch, sfq_rehash(q));
 	spin_unlock(root_lock);
 
 	if (q->perturb_period)
@@ -513,11 +535,13 @@ static int sfq_change(struct Qdisc *sch, struct nlattr *opt)
 {
 	struct sfq_sched_data *q = qdisc_priv(sch);
 	struct tc_sfq_qopt *ctl = nla_data(opt);
+	struct tc_sfq_ext_qopt *ctl_ext = NULL;
 	unsigned int qlen;
 
 	if (opt->nla_len < nla_attr_size(sizeof(*ctl)))
 		return -EINVAL;
-
+	if (opt->nla_len >= nla_attr_size(sizeof(*ctl_ext)))
+		ctl_ext = nla_data(opt);
 	if (ctl->divisor &&
 	    (!is_power_of_2(ctl->divisor) || ctl->divisor > 65536))
 		return -EINVAL;
@@ -526,10 +550,18 @@ static int sfq_change(struct Qdisc *sch, struct nlattr *opt)
 	q->quantum = ctl->quantum ? : psched_mtu(qdisc_dev(sch));
 	q->scaled_quantum = SFQ_ALLOT_SIZE(q->quantum);
 	q->perturb_period = ctl->perturb_period * HZ;
-	if (ctl->limit)
-		q->limit = min_t(u32, ctl->limit, SFQ_DEPTH - 1);
+	if (ctl->flows)
+		q->maxflows = min_t(u32, ctl->flows, SFQ_MAX_FLOWS);
 	if (ctl->divisor)
 		q->divisor = ctl->divisor;
+	if (ctl_ext) {
+		if (ctl_ext->depth)
+			q->depth = min_t(u32, ctl_ext->depth, SFQ_DEPTH - 1);
+		q->headdrop = ctl_ext->headdrop;
+	}
+	if (ctl->limit)
+		q->limit = min_t(u32, ctl->limit, q->depth * q->maxflows);
+
 	qlen = sch->q.qlen;
 	while (sch->q.qlen > q->limit)
 		sfq_drop(sch);
@@ -544,6 +576,16 @@ static int sfq_change(struct Qdisc *sch, struct nlattr *opt)
 	return 0;
 }
 
+static void sfq_free(void *addr)
+{
+	if (addr) {
+		if (is_vmalloc_addr(addr))
+			vfree(addr);
+		else
+			kfree(addr);
+	}
+}
+
 static int sfq_init(struct Qdisc *sch, struct nlattr *opt)
 {
 	struct sfq_sched_data *q = qdisc_priv(sch);
@@ -555,14 +597,16 @@ static int sfq_init(struct Qdisc *sch, struct nlattr *opt)
 	init_timer_deferrable(&q->perturb_timer);
 
 	for (i = 0; i < SFQ_DEPTH; i++) {
-		q->dep[i].next = i + SFQ_SLOTS;
-		q->dep[i].prev = i + SFQ_SLOTS;
+		q->dep[i].next = i + SFQ_MAX_FLOWS;
+		q->dep[i].prev = i + SFQ_MAX_FLOWS;
 	}
 
 	q->limit = SFQ_DEPTH - 1;
+	q->depth = SFQ_DEPTH - 1;
 	q->cur_depth = 0;
 	q->tail = NULL;
 	q->divisor = SFQ_DEFAULT_HASH_DIVISOR;
+	q->maxflows = SFQ_DEFAULT_FLOWS;
 	if (opt == NULL) {
 		q->quantum = psched_mtu(qdisc_dev(sch));
 		q->scaled_quantum = SFQ_ALLOT_SIZE(q->quantum);
@@ -575,15 +619,22 @@ static int sfq_init(struct Qdisc *sch, struct nlattr *opt)
 	}
 
 	sz = sizeof(q->ht[0]) * q->divisor;
-	q->ht = kmalloc(sz, GFP_KERNEL);
+	q->ht = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN);
 	if (!q->ht && sz > PAGE_SIZE)
 		q->ht = vmalloc(sz);
 	if (!q->ht)
 		return -ENOMEM;
+
+	q->slots = kzalloc(sizeof(q->slots[0]) * q->maxflows, GFP_KERNEL | __GFP_NOWARN);
+	if (!q->slots)
+		q->slots = vzalloc(sizeof(q->slots[0]) * q->maxflows);
+	if (!q->slots) {
+		sfq_free(q->ht);
+		return -ENOMEM;
+	}
 	for (i = 0; i < q->divisor; i++)
 		q->ht[i] = SFQ_EMPTY_SLOT;
-
-	for (i = 0; i < SFQ_SLOTS; i++) {
+	for (i = 0; i < q->maxflows; i++) {
 		slot_queue_init(&q->slots[i]);
 		sfq_link(q, i);
 	}
@@ -601,25 +652,24 @@ static void sfq_destroy(struct Qdisc *sch)
 	tcf_destroy_chain(&q->filter_list);
 	q->perturb_period = 0;
 	del_timer_sync(&q->perturb_timer);
-	if (is_vmalloc_addr(q->ht))
-		vfree(q->ht);
-	else
-		kfree(q->ht);
+	sfq_free(q->ht);
+	sfq_free(q->slots);
 }
 
 static int sfq_dump(struct Qdisc *sch, struct sk_buff *skb)
 {
 	struct sfq_sched_data *q = qdisc_priv(sch);
 	unsigned char *b = skb_tail_pointer(skb);
-	struct tc_sfq_qopt opt;
-
-	opt.quantum = q->quantum;
-	opt.perturb_period = q->perturb_period / HZ;
+	struct tc_sfq_ext_qopt opt;
 
-	opt.limit = q->limit;
-	opt.divisor = q->divisor;
-	opt.flows = q->limit;
+	opt.qopt.quantum = q->quantum;
+	opt.qopt.perturb_period = q->perturb_period / HZ;
 
+	opt.qopt.limit = q->limit;
+	opt.qopt.divisor = q->divisor;
+	opt.qopt.flows = q->maxflows;
+	opt.depth = q->depth;
+	opt.headdrop = q->headdrop;
 	NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
 
 	return skb->len;

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: tc filter mask for ACK packets off?
  2012-01-03 13:00           ` Dave Taht
@ 2012-01-03 17:57             ` John A. Sullivan III
  0 siblings, 0 replies; 18+ messages in thread
From: John A. Sullivan III @ 2012-01-03 17:57 UTC (permalink / raw)
  To: Dave Taht; +Cc: Eric Dumazet, Michal Kubeček, netdev

On Tue, 2012-01-03 at 14:00 +0100, Dave Taht wrote:
<snip>
> SFQ as presently implemented (and by presently, I mean, as of yesterday,
> by tomorrow it could be different at the rate eric is going!) is VERY
> suitable for
> sub 100Mbit desktops, wireless stations/laptops other devices,
> home gateways with sub 100Mbit uplinks, and the like. That's a few
> hundred million devices that aren't using it today and defaulting to
> pfifo_fast and suffering for it.
> 
> QFQ is it's big brother and I have hopes it can scale up to 10GigE,
> once suitable techniques are found for managing the sub-queue depth.
> 
> The enhancements to SFQ eric proposed in the other thread might get it
> to where it outperforms (by a lot) pfifo_fast in it's default configuration
> (eg txqueuelen 1000) with few side effects. Scaling further up than that...
> 
> ... I don't have a good picture of gigE performance at the moment with
> any of these advanced qdiscs and have no recomendation.
Hmm . . . that's interesting in light of our thoughts about using SFQ
for iSCSI.  In that case, the links are GbE or 10GbE. Is there a problem
using SFQ on those size links rather than pfifo_fast?
> 
<snip>
> >>       - "Round-robin" -> It introduces larger delays than virtual clock
> >>       based schemes, and should not be used for isolating interactive
> >>       traffic from non-interactive. It means, that this scheduler
> >>       should be used as leaf of CBQ or P3, which put interactive traffic
> >>       to higher priority band.
> 
> These delays are NOTHING compared to what pfifo_fast can induce.
> 
> Very little traffic nowadays is marked as interactive to any statistically
> significant extent, so any FQ method effectively makes more traffic
> interactive than prioritization can.
That may be changing quickly.  I am doing a lot of work with Destkop
Virtualization.  This is all interactive traffic and, unlike terminal
screens over telnet or ssh in the past, these can be fairly large chunks
of data using full sized packets.  They are also bursty rather than
periodic.  I would think we very much need prioritization here combined
with FQ (hence our interest in HFSC + SFQ).
> 
<snip>
> > Hmm . . . although I still wonder about iSCSI SANs . . .   Thanks
> 
> I wonder too. Most of the people running iSCSI seem to have an
> aversion to packet loss, yet are running over TCP. I *think*
> FQ methods will improve latency dramatically for iSCSI
> when iSCSI has multiple initiators....
<snip>
I haven't had a chance to play with this yet but I'll do a little
thinking out loud.  Since these can be very large data transmissions, I
would think it quite possible that a new connection's SYN packet is
stuck behind a pile of full sized iSCSI packets.  On the other hand, I'm
not sure where the bottleneck is in iSCSI and if these queues ever
backlog.  I just ran a quick, simple test on a non-optimized SAN doing a
cat /dev/zero > filename, hit 3.6Gbps throughput with four e1000 NICs
doing multipath multibus and saw no backlog in the pfifo_fast qdiscs.

If we do ever backlog, I would think SFQ would provide for a more
immediate response to new streams whereas users of the bulk downloads
already in process would not even notice the blip when the new stream is
inserted.

I would be a little concerned about iSCSI packets being delivered out of
order when multipath multibus is used, i.e., the iSCSI commands are
round robined around several NICs and thus several queues.  If those
queues are in varying states of backlog, a later packet in one queue
might be delivered before an earlier packet in another queue.  Then
again, I would think pfifo_fast could produce a greater delay than SFQ -
John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] SFQ planned changes
  2012-01-03 16:08           ` Eric Dumazet
@ 2012-01-03 23:57             ` Dave Taht
  2012-01-04  0:14               ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Taht @ 2012-01-03 23:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Michal Kubeček, netdev, John A. Sullivan III

On Tue, Jan 3, 2012 at 5:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Here is the code I ran on my test server with 200 netperf TCP_STREAM
> flows with pretty good results (each flow gets 0.5 % of bandwidth)

Can I encourage you to always simultaneously run a fping and/or a
netperf -t TCP_RR

latency under load test when doing stuff like this?

The amount of backlogged bytes is rather impressive...


>
> $TC qdisc add dev $DEV root handle 1: est 1sec 8sec htb default 1
> $TC class add dev $DEV parent 1: classid 1:1 est 1sec 8sec htb \
>        rate 200Mbit mtu 40000 quantum 80000
>
> $TC qdisc add dev $DEV parent 1:1 handle 10: est 1sec 8sec sfq \
>        limit 2000 depth 10 headdrop flows 1000 divisor 16384
>
> # tcnew -s -d qdisc show dev eth3
> qdisc htb 1: root refcnt 18 r2q 10 default 1 direct_packets_stat 0 ver 3.17
>  Sent 4512949730 bytes 3030391 pkt (dropped 44409, overlimits 6105100 requeues 1)
>  rate 198288Kbit 16629pps backlog 0b 1732p requeues 1
> qdisc sfq 10: parent 1:1 limit 2000p quantum 1514b depth 10 headdrop flows 1000/16384 divisor 16384
>  Sent 4512949730 bytes 3030391 pkt (dropped 44409, overlimits 0 requeues 0)
>  rate 198288Kbit 16629pps backlog 2622248b 1732p requeues 0
>
> patch on top of current net-next

I'm not going to have time to get to this for a while...


>  include/linux/pkt_sched.h |    7 +
>  net/sched/sch_sfq.c       |  144 ++++++++++++++++++++++++------------
>  2 files changed, 104 insertions(+), 47 deletions(-)
>
> diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
> index 8daced3..c2c6cfd 100644
> --- a/include/linux/pkt_sched.h
> +++ b/include/linux/pkt_sched.h
> @@ -162,6 +162,13 @@ struct tc_sfq_qopt {
>        unsigned        flows;          /* Maximal number of flows  */
>  };
>
> +struct tc_sfq_ext_qopt {
> +       struct tc_sfq_qopt qopt;
> +       unsigned int depth;             /* max number of packets per flow */
> +       unsigned int headdrop;
> +};
> +
> +
>  struct tc_sfq_xstats {
>        __s32           allot;
>  };
> diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
> index d329a8a..66682fd 100644
> --- a/net/sched/sch_sfq.c
> +++ b/net/sched/sch_sfq.c
> @@ -67,15 +67,16 @@
>
>        IMPLEMENTATION:
>        This implementation limits maximal queue length to 128;
> -       max mtu to 2^18-1; max 128 flows, number of hash buckets to 1024.
> -       The only goal of this restrictions was that all data
> -       fit into one 4K page on 32bit arches.
> +       max mtu to 2^18-1;
> +       max 65280 flows,
> +       number of hash buckets to 65536.
>
>        It is easy to increase these values, but not in flight.  */
>
>  #define SFQ_DEPTH              128 /* max number of packets per flow */
> -#define SFQ_SLOTS              128 /* max number of flows */
> -#define SFQ_EMPTY_SLOT         255
> +#define SFQ_DEFAULT_FLOWS      128
> +#define SFQ_MAX_FLOWS          (0x10000 - 256) /* max number of flows */
> +#define SFQ_EMPTY_SLOT         0xffff
>  #define SFQ_DEFAULT_HASH_DIVISOR 1024
>
>  /* We use 16 bits to store allot, and want to handle packets up to 64K
> @@ -84,13 +85,13 @@
>  #define SFQ_ALLOT_SHIFT                3
>  #define SFQ_ALLOT_SIZE(X)      DIV_ROUND_UP(X, 1 << SFQ_ALLOT_SHIFT)
>
> -/* This type should contain at least SFQ_DEPTH + SFQ_SLOTS values */
> -typedef unsigned char sfq_index;
> +/* This type should contain at least SFQ_DEPTH + SFQ_MAX_FLOWS values */
> +typedef u16 sfq_index;
>
>  /*
>  * We dont use pointers to save space.
> - * Small indexes [0 ... SFQ_SLOTS - 1] are 'pointers' to slots[] array
> - * while following values [SFQ_SLOTS ... SFQ_SLOTS + SFQ_DEPTH - 1]
> + * Small indexes [0 ... SFQ_MAX_FLOWS - 1] are 'pointers' to slots[] array
> + * while following values [SFQ_MAX_FLOWS ... SFQ_MAX_FLOWS + SFQ_DEPTH - 1]
>  * are 'pointers' to dep[] array
>  */
>  struct sfq_head {
> @@ -112,8 +113,11 @@ struct sfq_sched_data {
>  /* Parameters */
>        int             perturb_period;
>        unsigned int    quantum;        /* Allotment per round: MUST BE >= MTU */
> -       int             limit;
> +       int             limit;          /* limit of total number of packets in this qdisc */
>        unsigned int    divisor;        /* number of slots in hash table */
> +       unsigned int    maxflows;       /* number of flows in flows array */
> +       int             headdrop;
> +       int             depth;          /* limit depth of each flow */
>  /* Variables */
>        struct tcf_proto *filter_list;
>        struct timer_list perturb_timer;
> @@ -122,7 +126,7 @@ struct sfq_sched_data {
>        unsigned short  scaled_quantum; /* SFQ_ALLOT_SIZE(quantum) */
>        struct sfq_slot *tail;          /* current slot in round */
>        sfq_index       *ht;            /* Hash table (divisor slots) */
> -       struct sfq_slot slots[SFQ_SLOTS];
> +       struct sfq_slot *slots;
>        struct sfq_head dep[SFQ_DEPTH]; /* Linked list of slots, indexed by depth */
>  };
>
> @@ -131,9 +135,9 @@ struct sfq_sched_data {
>  */
>  static inline struct sfq_head *sfq_dep_head(struct sfq_sched_data *q, sfq_index val)
>  {
> -       if (val < SFQ_SLOTS)
> +       if (val < SFQ_MAX_FLOWS)
>                return &q->slots[val].dep;
> -       return &q->dep[val - SFQ_SLOTS];
> +       return &q->dep[val - SFQ_MAX_FLOWS];
>  }
>
>  /*
> @@ -199,18 +203,19 @@ static unsigned int sfq_classify(struct sk_buff *skb, struct Qdisc *sch,
>  }
>
>  /*
> - * x : slot number [0 .. SFQ_SLOTS - 1]
> + * x : slot number [0 .. SFQ_MAX_FLOWS - 1]
>  */
>  static inline void sfq_link(struct sfq_sched_data *q, sfq_index x)
>  {
>        sfq_index p, n;
> -       int qlen = q->slots[x].qlen;
> +       struct sfq_slot *slot = &q->slots[x];
> +       int qlen = slot->qlen;
>
> -       p = qlen + SFQ_SLOTS;
> +       p = qlen + SFQ_MAX_FLOWS;
>        n = q->dep[qlen].next;
>
> -       q->slots[x].dep.next = n;
> -       q->slots[x].dep.prev = p;
> +       slot->dep.next = n;
> +       slot->dep.prev = p;
>
>        q->dep[qlen].next = x;          /* sfq_dep_head(q, p)->next = x */
>        sfq_dep_head(q, n)->prev = x;
> @@ -305,7 +310,7 @@ static unsigned int sfq_drop(struct Qdisc *sch)
>                x = q->dep[d].next;
>                slot = &q->slots[x];
>  drop:
> -               skb = slot_dequeue_tail(slot);
> +               skb = q->headdrop ? slot_dequeue_head(slot) : slot_dequeue_tail(slot);
>                len = qdisc_pkt_len(skb);
>                sfq_dec(q, x);
>                kfree_skb(skb);
> @@ -349,16 +354,26 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
>        slot = &q->slots[x];
>        if (x == SFQ_EMPTY_SLOT) {
>                x = q->dep[0].next; /* get a free slot */
> +               if (x >= SFQ_MAX_FLOWS)
> +                       return qdisc_drop(skb, sch);
>                q->ht[hash] = x;
>                slot = &q->slots[x];
>                slot->hash = hash;
>        }
>
> -       /* If selected queue has length q->limit, do simple tail drop,
> -        * i.e. drop _this_ packet.
> -        */
> -       if (slot->qlen >= q->limit)
> -               return qdisc_drop(skb, sch);
> +       if (slot->qlen >= q->depth) {
> +               struct sk_buff *head;
> +
> +               if (!q->headdrop)
> +                       return qdisc_drop(skb, sch);
> +               head = slot_dequeue_head(slot);
> +               sch->qstats.backlog -= qdisc_pkt_len(head);
> +               kfree_skb(head);
> +               sch->qstats.drops++;
> +               sch->qstats.backlog += qdisc_pkt_len(skb);
> +               slot_queue_add(slot, skb);
> +               return NET_XMIT_CN;
> +       }
>
>        sch->qstats.backlog += qdisc_pkt_len(skb);
>        slot_queue_add(slot, skb);
> @@ -366,11 +381,11 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
>        if (slot->qlen == 1) {          /* The flow is new */
>                if (q->tail == NULL) {  /* It is the first flow */
>                        slot->next = x;
> +                       q->tail = slot;
>                } else {
>                        slot->next = q->tail->next;
>                        q->tail->next = x;
>                }
> -               q->tail = slot;
>                slot->allot = q->scaled_quantum;
>        }
>        if (++sch->q.qlen <= q->limit)
> @@ -445,16 +460,17 @@ sfq_reset(struct Qdisc *sch)
>  * We dont use sfq_dequeue()/sfq_enqueue() because we dont want to change
>  * counters.
>  */
> -static void sfq_rehash(struct sfq_sched_data *q)
> +static int sfq_rehash(struct sfq_sched_data *q)
>  {
>        struct sk_buff *skb;
>        int i;
>        struct sfq_slot *slot;
>        struct sk_buff_head list;
> +       int dropped = 0;
>
>        __skb_queue_head_init(&list);
>
> -       for (i = 0; i < SFQ_SLOTS; i++) {
> +       for (i = 0; i < q->maxflows; i++) {
>                slot = &q->slots[i];
>                if (!slot->qlen)
>                        continue;
> @@ -474,6 +490,11 @@ static void sfq_rehash(struct sfq_sched_data *q)
>                slot = &q->slots[x];
>                if (x == SFQ_EMPTY_SLOT) {
>                        x = q->dep[0].next; /* get a free slot */
> +                       if (x >= SFQ_MAX_FLOWS) {
> +                               kfree_skb(skb);
> +                               dropped++;
> +                               continue;
> +                       }
>                        q->ht[hash] = x;
>                        slot = &q->slots[x];
>                        slot->hash = hash;
> @@ -491,6 +512,7 @@ static void sfq_rehash(struct sfq_sched_data *q)
>                        slot->allot = q->scaled_quantum;
>                }
>        }
> +       return dropped;
>  }
>
>  static void sfq_perturbation(unsigned long arg)
> @@ -502,7 +524,7 @@ static void sfq_perturbation(unsigned long arg)
>        spin_lock(root_lock);
>        q->perturbation = net_random();
>        if (!q->filter_list && q->tail)
> -               sfq_rehash(q);
> +               qdisc_tree_decrease_qlen(sch, sfq_rehash(q));
>        spin_unlock(root_lock);
>
>        if (q->perturb_period)
> @@ -513,11 +535,13 @@ static int sfq_change(struct Qdisc *sch, struct nlattr *opt)
>  {
>        struct sfq_sched_data *q = qdisc_priv(sch);
>        struct tc_sfq_qopt *ctl = nla_data(opt);
> +       struct tc_sfq_ext_qopt *ctl_ext = NULL;
>        unsigned int qlen;
>
>        if (opt->nla_len < nla_attr_size(sizeof(*ctl)))
>                return -EINVAL;
> -
> +       if (opt->nla_len >= nla_attr_size(sizeof(*ctl_ext)))
> +               ctl_ext = nla_data(opt);
>        if (ctl->divisor &&
>            (!is_power_of_2(ctl->divisor) || ctl->divisor > 65536))
>                return -EINVAL;
> @@ -526,10 +550,18 @@ static int sfq_change(struct Qdisc *sch, struct nlattr *opt)
>        q->quantum = ctl->quantum ? : psched_mtu(qdisc_dev(sch));
>        q->scaled_quantum = SFQ_ALLOT_SIZE(q->quantum);
>        q->perturb_period = ctl->perturb_period * HZ;
> -       if (ctl->limit)
> -               q->limit = min_t(u32, ctl->limit, SFQ_DEPTH - 1);
> +       if (ctl->flows)
> +               q->maxflows = min_t(u32, ctl->flows, SFQ_MAX_FLOWS);
>        if (ctl->divisor)
>                q->divisor = ctl->divisor;
> +       if (ctl_ext) {
> +               if (ctl_ext->depth)
> +                       q->depth = min_t(u32, ctl_ext->depth, SFQ_DEPTH - 1);
> +               q->headdrop = ctl_ext->headdrop;
> +       }
> +       if (ctl->limit)
> +               q->limit = min_t(u32, ctl->limit, q->depth * q->maxflows);
> +
>        qlen = sch->q.qlen;
>        while (sch->q.qlen > q->limit)
>                sfq_drop(sch);
> @@ -544,6 +576,16 @@ static int sfq_change(struct Qdisc *sch, struct nlattr *opt)
>        return 0;
>  }
>
> +static void sfq_free(void *addr)
> +{
> +       if (addr) {
> +               if (is_vmalloc_addr(addr))
> +                       vfree(addr);
> +               else
> +                       kfree(addr);
> +       }
> +}
> +
>  static int sfq_init(struct Qdisc *sch, struct nlattr *opt)
>  {
>        struct sfq_sched_data *q = qdisc_priv(sch);
> @@ -555,14 +597,16 @@ static int sfq_init(struct Qdisc *sch, struct nlattr *opt)
>        init_timer_deferrable(&q->perturb_timer);
>
>        for (i = 0; i < SFQ_DEPTH; i++) {
> -               q->dep[i].next = i + SFQ_SLOTS;
> -               q->dep[i].prev = i + SFQ_SLOTS;
> +               q->dep[i].next = i + SFQ_MAX_FLOWS;
> +               q->dep[i].prev = i + SFQ_MAX_FLOWS;
>        }
>
>        q->limit = SFQ_DEPTH - 1;
> +       q->depth = SFQ_DEPTH - 1;
>        q->cur_depth = 0;
>        q->tail = NULL;
>        q->divisor = SFQ_DEFAULT_HASH_DIVISOR;
> +       q->maxflows = SFQ_DEFAULT_FLOWS;
>        if (opt == NULL) {
>                q->quantum = psched_mtu(qdisc_dev(sch));
>                q->scaled_quantum = SFQ_ALLOT_SIZE(q->quantum);
> @@ -575,15 +619,22 @@ static int sfq_init(struct Qdisc *sch, struct nlattr *opt)
>        }
>
>        sz = sizeof(q->ht[0]) * q->divisor;
> -       q->ht = kmalloc(sz, GFP_KERNEL);
> +       q->ht = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN);
>        if (!q->ht && sz > PAGE_SIZE)
>                q->ht = vmalloc(sz);
>        if (!q->ht)
>                return -ENOMEM;
> +
> +       q->slots = kzalloc(sizeof(q->slots[0]) * q->maxflows, GFP_KERNEL | __GFP_NOWARN);
> +       if (!q->slots)
> +               q->slots = vzalloc(sizeof(q->slots[0]) * q->maxflows);
> +       if (!q->slots) {
> +               sfq_free(q->ht);
> +               return -ENOMEM;
> +       }
>        for (i = 0; i < q->divisor; i++)
>                q->ht[i] = SFQ_EMPTY_SLOT;
> -
> -       for (i = 0; i < SFQ_SLOTS; i++) {
> +       for (i = 0; i < q->maxflows; i++) {
>                slot_queue_init(&q->slots[i]);
>                sfq_link(q, i);
>        }
> @@ -601,25 +652,24 @@ static void sfq_destroy(struct Qdisc *sch)
>        tcf_destroy_chain(&q->filter_list);
>        q->perturb_period = 0;
>        del_timer_sync(&q->perturb_timer);
> -       if (is_vmalloc_addr(q->ht))
> -               vfree(q->ht);
> -       else
> -               kfree(q->ht);
> +       sfq_free(q->ht);
> +       sfq_free(q->slots);
>  }
>
>  static int sfq_dump(struct Qdisc *sch, struct sk_buff *skb)
>  {
>        struct sfq_sched_data *q = qdisc_priv(sch);
>        unsigned char *b = skb_tail_pointer(skb);
> -       struct tc_sfq_qopt opt;
> -
> -       opt.quantum = q->quantum;
> -       opt.perturb_period = q->perturb_period / HZ;
> +       struct tc_sfq_ext_qopt opt;
>
> -       opt.limit = q->limit;
> -       opt.divisor = q->divisor;
> -       opt.flows = q->limit;
> +       opt.qopt.quantum = q->quantum;
> +       opt.qopt.perturb_period = q->perturb_period / HZ;
>
> +       opt.qopt.limit = q->limit;
> +       opt.qopt.divisor = q->divisor;
> +       opt.qopt.flows = q->maxflows;
> +       opt.depth = q->depth;
> +       opt.headdrop = q->headdrop;
>        NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
>
>        return skb->len;
>
>



-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: tc filter mask for ACK packets off?
  2012-01-03  7:31 ` Michal Kubeček
  2012-01-03  9:36   ` Dave Taht
@ 2012-01-04  0:01   ` Michal Soltys
  1 sibling, 0 replies; 18+ messages in thread
From: Michal Soltys @ 2012-01-04  0:01 UTC (permalink / raw)
  To: Michal Kubeček; +Cc: netdev, John A. Sullivan III

On 12-01-03 08:31, Michal Kubeček wrote:
> On Saturday 31 of December 2011 21:30EN, John A. Sullivan III wrote:
>>  <cut>
>
> However, by a "ACK only" packet (worth prioritizing), I would rather
> understand a packet with ACK flag without any payload, not a packet with
> ACK as the only flag. For many TCP connections, all packets except
> initial SYN and SYN-ACK and two FIN packets have ACK as the only flag.
> So my guess is you should rather prioritize all TCP packets with no
> application layer data.
>
>                                                           Michal Kubecek

In context of the above - xtables-addons provide length2 match which 
(possibly paired with other iptables matches) gives excellent control 
for such tasks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] SFQ planned changes
  2012-01-03 23:57             ` Dave Taht
@ 2012-01-04  0:14               ` Eric Dumazet
  2012-01-04  7:56                 ` Dave Taht
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2012-01-04  0:14 UTC (permalink / raw)
  To: Dave Taht; +Cc: Michal Kubeček, netdev, John A. Sullivan III

Le mercredi 04 janvier 2012 à 00:57 +0100, Dave Taht a écrit :
> On Tue, Jan 3, 2012 at 5:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Here is the code I ran on my test server with 200 netperf TCP_STREAM
> > flows with pretty good results (each flow gets 0.5 % of bandwidth)
> 
> Can I encourage you to always simultaneously run a fping and/or a
> netperf -t TCP_RR
> 

ping is pretty nice ;)

# ping -c 20 192.168.20.112
PING 192.168.20.112 (192.168.20.112) 56(84) bytes of data.
64 bytes from 192.168.20.112: icmp_req=1 ttl=64 time=0.251 ms
64 bytes from 192.168.20.112: icmp_req=2 ttl=64 time=0.123 ms
64 bytes from 192.168.20.112: icmp_req=3 ttl=64 time=0.124 ms
64 bytes from 192.168.20.112: icmp_req=4 ttl=64 time=0.108 ms
64 bytes from 192.168.20.112: icmp_req=5 ttl=64 time=0.131 ms
64 bytes from 192.168.20.112: icmp_req=6 ttl=64 time=0.126 ms
64 bytes from 192.168.20.112: icmp_req=7 ttl=64 time=0.156 ms
64 bytes from 192.168.20.112: icmp_req=8 ttl=64 time=0.123 ms
64 bytes from 192.168.20.112: icmp_req=9 ttl=64 time=0.111 ms
64 bytes from 192.168.20.112: icmp_req=10 ttl=64 time=0.129 ms
64 bytes from 192.168.20.112: icmp_req=11 ttl=64 time=0.112 ms
64 bytes from 192.168.20.112: icmp_req=12 ttl=64 time=0.138 ms
64 bytes from 192.168.20.112: icmp_req=13 ttl=64 time=0.118 ms
64 bytes from 192.168.20.112: icmp_req=14 ttl=64 time=0.119 ms
64 bytes from 192.168.20.112: icmp_req=15 ttl=64 time=0.121 ms
64 bytes from 192.168.20.112: icmp_req=16 ttl=64 time=0.125 ms
64 bytes from 192.168.20.112: icmp_req=17 ttl=64 time=0.128 ms
64 bytes from 192.168.20.112: icmp_req=18 ttl=64 time=0.108 ms
64 bytes from 192.168.20.112: icmp_req=19 ttl=64 time=0.149 ms
64 bytes from 192.168.20.112: icmp_req=20 ttl=64 time=0.126 ms

--- 192.168.20.112 ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 18999ms
rtt min/avg/max/mdev = 0.108/0.131/0.251/0.031 ms


> latency under load test when doing stuff like this?
> 
> The amount of backlogged bytes is rather impressive...

200 tcp flooding flows... thats pretty normal.

If I add to this load a TCP_RR one I get :

# netperf -H 192.168.20.110 -v 0 -l 10 -t TCP_RR
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.20.110 (192.168.20.110) port 0 AF_INET : demo
7606.18 


If I stop the flood and start the TCP_RR alone :

# netperf -H 192.168.20.110 -v 0 -l 10 -t TCP_RR
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.20.110 (192.168.20.110) port 0 AF_INET : demo
12031.39 

And a ping on idle link :
#  ping -c 20 192.168.20.112
PING 192.168.20.112 (192.168.20.112) 56(84) bytes of data.
64 bytes from 192.168.20.112: icmp_req=1 ttl=64 time=0.119 ms
64 bytes from 192.168.20.112: icmp_req=2 ttl=64 time=0.090 ms
64 bytes from 192.168.20.112: icmp_req=3 ttl=64 time=0.085 ms
64 bytes from 192.168.20.112: icmp_req=4 ttl=64 time=0.087 ms
64 bytes from 192.168.20.112: icmp_req=5 ttl=64 time=0.084 ms
64 bytes from 192.168.20.112: icmp_req=6 ttl=64 time=0.084 ms
64 bytes from 192.168.20.112: icmp_req=7 ttl=64 time=0.088 ms
64 bytes from 192.168.20.112: icmp_req=8 ttl=64 time=0.085 ms
64 bytes from 192.168.20.112: icmp_req=9 ttl=64 time=0.083 ms
64 bytes from 192.168.20.112: icmp_req=10 ttl=64 time=0.082 ms
64 bytes from 192.168.20.112: icmp_req=11 ttl=64 time=0.082 ms
64 bytes from 192.168.20.112: icmp_req=12 ttl=64 time=0.085 ms
64 bytes from 192.168.20.112: icmp_req=13 ttl=64 time=0.086 ms
64 bytes from 192.168.20.112: icmp_req=14 ttl=64 time=0.084 ms
64 bytes from 192.168.20.112: icmp_req=15 ttl=64 time=0.089 ms
64 bytes from 192.168.20.112: icmp_req=16 ttl=64 time=0.081 ms
64 bytes from 192.168.20.112: icmp_req=17 ttl=64 time=0.084 ms
64 bytes from 192.168.20.112: icmp_req=18 ttl=64 time=0.086 ms
64 bytes from 192.168.20.112: icmp_req=19 ttl=64 time=0.084 ms
64 bytes from 192.168.20.112: icmp_req=20 ttl=64 time=0.084 ms

--- 192.168.20.112 ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 19000ms
rtt min/avg/max/mdev = 0.081/0.086/0.119/0.012 ms


I can do a test on full Gigabit speed (removing the HTB) and 1000 flows
and post results

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] SFQ planned changes
  2012-01-04  0:14               ` Eric Dumazet
@ 2012-01-04  7:56                 ` Dave Taht
  2012-01-04  8:17                   ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Taht @ 2012-01-04  7:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Michal Kubeček, netdev, John A. Sullivan III

On Wed, Jan 4, 2012 at 1:14 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mercredi 04 janvier 2012 à 00:57 +0100, Dave Taht a écrit :
>> On Tue, Jan 3, 2012 at 5:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > Here is the code I ran on my test server with 200 netperf TCP_STREAM
>> > flows with pretty good results (each flow gets 0.5 % of bandwidth)
>>
>> Can I encourage you to always simultaneously run a fping and/or a
>> netperf -t TCP_RR
>>

So I sat down and setup something that could do gigE and exercise
everything I had lying around worth playing with to see what crashed...

> a ping on idle link :
> #  ping -c 20 192.168.20.112
> PING 192.168.20.112 (192.168.20.112) 56(84) bytes of data.

> 64 bytes from 192.168.20.112: icmp_req=1 ttl=64 time=0.119 ms
> 64 bytes from 192.168.20.112: icmp_req=2 ttl=64 time=0.090 ms
> 64 bytes from 192.168.20.112: icmp_req=3 ttl=64 time=0.085 ms
> 64 bytes from 192.168.20.112: icmp_req=4 ttl=64 time=0.087 ms

I find puzzling that my baseline ping time is nearly 3x yours.

I guess this is the price I pay for a 680mhz box on the other end.

My baseline ping (1 hop e1000e to router)

64 bytes from 172.30.50.1: icmp_req=18 ttl=64 time=0.239 ms
64 bytes from 172.30.50.1: icmp_req=19 ttl=64 time=0.247 ms
64 bytes from 172.30.50.1: icmp_req=20 ttl=64 time=0.301 ms

(or, in my data format)

|T|172.30.50.1  |172.30.47.1  |172.30.47.27 |
|-+-+-+-+|
|1|0.34|0.63|0.59|
|2|0.28|0.42|0.45|
|3|0.39|0.41|0.48|
|4|0.37|0.42|0.51|
|5|0.33|0.43|0.49|


your load test:

> # ping -c 20 192.168.20.112
> PING 192.168.20.112 (192.168.20.112) 56(84) bytes of data.
> 64 bytes from 192.168.20.112: icmp_req=1 ttl=64 time=0.251 ms
> 64 bytes from 192.168.20.112: icmp_req=2 ttl=64 time=0.123 ms
> 64 bytes from 192.168.20.112: icmp_req=3 ttl=64 time=0.124 ms

This was my complex qfq/sfq test that ran all night (somehow), at gigE.

STAQFQ is on on the source laptop, 100 iperfs in play, 600 seconds
at a time, net transfer rate
about 250Mbit.... - and I rate limited BQL to 9000 limit_max.
GSO/TSO are off throughout

(STAQFQ is 514 QFQ bins, 24 pfifo_fast qdiscs per)

first router has staqfq on the external interface connected to laptop #1
sfq on the internal interface connected to router #2

router #2 had sfq on it's external interface and internal interface

laptop #2 had pfifo_fast on it

|count|e1000e to router|to next router|through next router's switch to
laptop #2|
|100|0.40|0.42|0.57|
|101|0.49|0.48|0.54|
|102|0.59|0.65|0.73|
|103|0.48|0.59|0.83|
|104|0.36|0.56|0.75|
|105|0.51|0.63|0.66|
|106|0.41|0.60|0.40|
|107|0.62|0.44|0.81|
|108|0.33|0.36|0.79|
|109|0.49|0.49|0.49|
|110|0.48|0.42|0.54|

Three notes of interest while I sort through this:

1) I saw spikes of up to 12ms with BQL's limiter enabled at one point
or another.
I'll try to duplicate that.
2) I did manage to crash QFQ multiple times earlier in the night
(on every interface that has sfq on it now)
3) And when the ping ends up in the wrong bin, the results can be interesting.

|125|0.56|98.91|0.55|
|126|0.41|96.54|0.52|
|127|0.35|96.11|0.91|
|128|0.23|106.52|0.57|
|129|0.42|104.01|0.83|
|130|0.44|105.92|0.59|

4) there was packet loss (yea!) and many other anomalies, I ran each test for
600 seconds, need to look at the actual data transferred, and will try
a plot later.

But I can say the 2 day old SFQ stuff stands up to a load test...
And QFQ can do pretty well, too, when not crashing...
I will get to your new patch set over the weekend.

-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] SFQ planned changes
  2012-01-04  7:56                 ` Dave Taht
@ 2012-01-04  8:17                   ` Eric Dumazet
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2012-01-04  8:17 UTC (permalink / raw)
  To: Dave Taht; +Cc: Michal Kubeček, netdev, John A. Sullivan III

Le mercredi 04 janvier 2012 à 08:56 +0100, Dave Taht a écrit :

> I find puzzling that my baseline ping time is nearly 3x yours.
> 
> I guess this is the price I pay for a 680mhz box on the other end.
> 

Hmm... maybe... but this seems strange. A ping handler should be a
matter of 1 to 10 us at most. not 100 us.

Checkout on receiver  machine rx coalescing params : ethtool -c eth0

Here : the sender is a normal link (not trunk mode)

# ethtool -c eth3
Coalesce parameters for eth3:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 24
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 48
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 0


And on my 2nd server, receiver of the ping request, (but also 2 switches
are crossed between these machines). This eth2 is part of a bond0
device, and trunk (vlan) activated on this link.

$ ethtool -c eth2
Coalesce parameters for eth2:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 20
rx-frames: 5
rx-usecs-irq: 0
rx-frames-irq: 5

tx-usecs: 72
tx-frames: 53
tx-usecs-irq: 0
tx-frames-irq: 5

So I have a 20us delay at rx time before NIC sends an interrupt to the
Host to 'deliver' the incoming packet.

If I change it to 1 us :

ethtool -C eth2 rx-usecs 1

then pings are even better, but a given load should generate more
interrupts.

# ping 192.168.20.110
PING 192.168.20.110 (192.168.20.110) 56(84) bytes of data.
64 bytes from 192.168.20.110: icmp_req=1 ttl=64 time=0.067 ms
64 bytes from 192.168.20.110: icmp_req=2 ttl=64 time=0.061 ms
64 bytes from 192.168.20.110: icmp_req=3 ttl=64 time=0.064 ms
64 bytes from 192.168.20.110: icmp_req=4 ttl=64 time=0.064 ms
64 bytes from 192.168.20.110: icmp_req=5 ttl=64 time=0.061 ms
64 bytes from 192.168.20.110: icmp_req=6 ttl=64 time=0.061 ms
64 bytes from 192.168.20.110: icmp_req=7 ttl=64 time=0.062 ms
64 bytes from 192.168.20.110: icmp_req=8 ttl=64 time=0.060 ms
64 bytes from 192.168.20.110: icmp_req=9 ttl=64 time=0.062 ms
64 bytes from 192.168.20.110: icmp_req=10 ttl=64 time=0.062 ms
64 bytes from 192.168.20.110: icmp_req=11 ttl=64 time=0.062 ms
64 bytes from 192.168.20.110: icmp_req=12 ttl=64 time=0.062 ms
64 bytes from 192.168.20.110: icmp_req=13 ttl=64 time=0.058 ms
64 bytes from 192.168.20.110: icmp_req=14 ttl=64 time=0.062 ms
64 bytes from 192.168.20.110: icmp_req=15 ttl=64 time=0.063 ms
64 bytes from 192.168.20.110: icmp_req=16 ttl=64 time=0.063 ms
64 bytes from 192.168.20.110: icmp_req=17 ttl=64 time=0.059 ms
64 bytes from 192.168.20.110: icmp_req=18 ttl=64 time=0.062 ms
^C
--- 192.168.20.110 ping statistics ---
18 packets transmitted, 18 received, 0% packet loss, time 16999ms
rtt min/avg/max/mdev = 0.058/0.061/0.067/0.010 ms

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2012-01-04  8:17 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-01  2:30 tc filter mask for ACK packets off? John A. Sullivan III
2012-01-03  7:31 ` Michal Kubeček
2012-01-03  9:36   ` Dave Taht
2012-01-03 10:40     ` [RFC] SFQ planned changes Eric Dumazet
2012-01-03 12:07       ` Dave Taht
2012-01-03 12:50         ` Eric Dumazet
2012-01-03 16:08           ` Eric Dumazet
2012-01-03 23:57             ` Dave Taht
2012-01-04  0:14               ` Eric Dumazet
2012-01-04  7:56                 ` Dave Taht
2012-01-04  8:17                   ` Eric Dumazet
2012-01-03 12:18     ` tc filter mask for ACK packets off? John A. Sullivan III
2012-01-03 12:32       ` Eric Dumazet
2012-01-03 12:45         ` John A. Sullivan III
2012-01-03 13:00           ` Dave Taht
2012-01-03 17:57             ` John A. Sullivan III
2012-01-04  0:01   ` Michal Soltys
  -- strict thread matches above, loose matches on Subject: below --
2012-01-01  2:30 John A. Sullivan III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).