Network optimality (was Re: [PATCH net-next] qdisc: validate skb without holding lock

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Network optimality (was Re: [PATCH net-next] qdisc: validate skb without holding lock_
@ 2014-10-13 14:22 Dave Taht
       [not found] ` <CAL4WiioOodM60_Ud6qyGt1_uyeqpDeGFF6PtEKtGa=NwHhVFiw@mail.gmail.com>
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Taht @ 2014-10-13 14:22 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, Eric Dumazet, netdev@vger.kernel.org, Tom Herbert,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Jamal Hadi Salim, Alexander Duyck, John Fastabend,
	Toke Høiland-Jørgensen

When I first got cc'd on these threads, and saw netperf-wrapper being
used on it,
I thought: "Oh god, I've created a monster.". My intent with helping create
such a measurement tool was not to routinely drive a network to saturation
but to be able to measure the impact on latency of doing so.

I was trying get reasonable behavior when a "router" went into overload.

Servers, on the other hand, have more options to avoid overload than
routers do. There's been a great deal of really nice work on that
front. I love all that.

and I like BQL because it provides enough backpressure to be able to
do smarter things about scheduling packets higher in the stack. (life
pre-BQL cost some hair)

But tom once told me me "BQL's objective is to keep the hardware busy".
It uses an MIAD controller instead of a more sane AIMD one, in particular,
I'd much rather it ramped down to smaller values after
absorbing a burst.

My objective is always to keep the *network's behavior optimal*,
minimizing bursts, and subsequent tail loss on the other side,
and responding quickly to loss, and doing that by
preserving to the highest extent possible the ack clocking that a
fluid model has. I LOVE BQL for providing more backpressure than has
ever existed before, and I know it's incredibly difficult to have fluid models
in a conventional cpu architecture that has to do other stuff.

But in order to get the best results for network behavior I'm willing to
sacrifice a great deal of cpu, interrupts, whatever it takes! to get the
most packets to all the destinations specified, whatever the workload,
with the *minimum amount of latency between ack and reply* possible.

What I'd hoped for in the new bulking and rcu stuff was to be able to
see a net reduction in TSO/GSO Size, and/or BQL's size, and I also did
keep hoping for some profiles on sch_fq, and for more complex
benchmarking of dozens or hundreds of realistically sized TCP flows
(in both directions) to exercise it all.

Some of the data presented showed that a single BQL'd queue was >400K,
and with hardware multi-queue, 128K, when TSO and GSO were used, but
with hardware multi-queue and no TSO/GSO, BQL was closer to 30K.

This said to me that the maximum "right" size for a TSO/GSO "packet" was
closer to 12k in this environment, and the right size for BQL, 30k,
before it started exerting backpressure to the qdisc.

This would reduce the potential inter-flow network latency by a factor
of 10 on the single hw queue scenario, and 4 in the multi queue one.

It would probably cost some interrupts, and in scenarios lacking
packet loss, throughput, but in other scenarios with lots of flows
each flow will ramp up in speed, faster, as you reduce the RTTs.
Paying attention to this will also push profiling activities into
areas of the stack that might be profitable.

I would very much like to have profiles of happens now both here and
elsewhere in the stack with this new code with TSO/GSO sizes capped
thusly and BQL capped to 30k, and a smarter qdisc like fq used.

2) Most of the time, a server is not driving the wire to saturation. If
   it is, you are doing something wrong. The BQL queues are empty, or
   nearly so, so the instant someone creates a qdisc queue, it
   drains.

But: if there are two or more flows under contention, creating a qdisc queue
    better multiplexing the results is highly desirable, and the stack
   should be smart enough to make that overload only last briefly.

   This is part of why I'm unfond of the deep and persistent BQL queues as we
get today.

3) Pure ack-only workloads are rare. It is a useful test case, but...

4) I thought the ring-cleanup optimization was rather interesting and
   could be made more dynamic.

5) I remain amazed at the vast improvements in throughput, reductions in
interrupts, lockless operation and the RCU stuff that have come out of
this so far, but had to make these points in the hope that the big picture
is retained.

It does no good to blast packets through the network unless there is a
high probability that they will actually be received on the other side.

thanks for listening.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Network optimality (was Re: [PATCH net-next] qdisc: validate skb without holding lock_
       [not found] ` <CAL4WiioOodM60_Ud6qyGt1_uyeqpDeGFF6PtEKtGa=NwHhVFiw@mail.gmail.com>
@ 2014-10-13 16:58   ` Eric Dumazet
  2014-10-13 17:20     ` Dave Taht
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2014-10-13 16:58 UTC (permalink / raw)
  To: Dave Taht
  Cc: Alexander Duyck, Jesper Dangaard Brouer, Hannes Frederic Sowa,
	John Fastabend, Jamal Hadi Salim, Daniel Borkmann,
	Florian Westphal, netdev@vger.kernel.org,
	Toke Høiland-Jørgensen, Tom Herbert, David Miller


> On Oct 13, 2014 7:22 AM, "Dave Taht" <dave.taht@gmail.com> wrote:
>         When I first got cc'd on these threads, and saw
>         netperf-wrapper being
>         used on it,
>         I thought: "Oh god, I've created a monster.". My intent with
>         helping create
>         such a measurement tool was not to routinely drive a network
>         to saturation
>         but to be able to measure the impact on latency of doing so.
>         
>         I was trying get reasonable behavior when a "router" went into
>         overload.
>         
>         Servers, on the other hand, have more options to avoid
>         overload than
>         routers do. There's been a great deal of really nice work on
>         that
>         front. I love all that.
>         
>         and I like BQL because it provides enough backpressure to be
>         able to
>         do smarter things about scheduling packets higher in the
>         stack. (life
>         pre-BQL cost some hair)
>         
>         But tom once told me me "BQL's objective is to keep the
>         hardware busy".
>         It uses an MIAD controller instead of a more sane AIMD one, in
>         particular,
>         I'd much rather it ramped down to smaller values after
>         absorbing a burst.
>         
>         My objective is always to keep the *network's behavior
>         optimal*,
>         minimizing bursts, and subsequent tail loss on the other side,
>         and responding quickly to loss, and doing that by
>         preserving to the highest extent possible the ack clocking
>         that a
>         fluid model has. I LOVE BQL for providing more backpressure
>         than has
>         ever existed before, and I know it's incredibly difficult to
>         have fluid models
>         in a conventional cpu architecture that has to do other stuff.
>         
>         But in order to get the best results for network behavior I'm
>         willing to
>         sacrifice a great deal of cpu, interrupts, whatever it takes!
>         to get the
>         most packets to all the destinations specified, whatever the
>         workload,
>         with the *minimum amount of latency between ack and reply*
>         possible.
>         
>         What I'd hoped for in the new bulking and rcu stuff was to be
>         able to
>         see a net reduction in TSO/GSO Size, and/or BQL's size, and I
>         also did
>         keep hoping for some profiles on sch_fq, and for more complex
>         benchmarking of dozens or hundreds of realistically sized TCP
>         flows
>         (in both directions) to exercise it all.
>         
>         Some of the data presented showed that a single BQL'd queue
>         was >400K,
>         and with hardware multi-queue, 128K, when TSO and GSO were
>         used, but
>         with hardware multi-queue and no TSO/GSO, BQL was closer to
>         30K.
>         
>         This said to me that the maximum "right" size for a TSO/GSO
>         "packet" was
>         closer to 12k in this environment, and the right size for BQL,
>         30k,
>         before it started exerting backpressure to the qdisc.
>         
>         This would reduce the potential inter-flow network latency by
>         a factor
>         of 10 on the single hw queue scenario, and 4 in the multi
>         queue one.
>         
>         It would probably cost some interrupts, and in scenarios
>         lacking
>         packet loss, throughput, but in other scenarios with lots of
>         flows
>         each flow will ramp up in speed, faster, as you reduce the
>         RTTs.
>         Paying attention to this will also push profiling activities
>         into
>         areas of the stack that might be profitable.
>         
>         I would very much like to have profiles of happens now both
>         here and
>         elsewhere in the stack with this new code with TSO/GSO sizes
>         capped
>         thusly and BQL capped to 30k, and a smarter qdisc like fq
>         used.
>         
>         2) Most of the time, a server is not driving the wire to
>         saturation. If
>            it is, you are doing something wrong. The BQL queues are
>         empty, or
>            nearly so, so the instant someone creates a qdisc queue, it
>            drains.
>         
>         But: if there are two or more flows under contention, creating
>         a qdisc queue
>             better multiplexing the results is highly desirable, and
>         the stack
>            should be smart enough to make that overload only last
>         briefly.
>         
>            This is part of why I'm unfond of the deep and persistent
>         BQL queues as we
>         get today.
>         
>         3) Pure ack-only workloads are rare. It is a useful test case,
>         but...
>         
>         4) I thought the ring-cleanup optimization was rather
>         interesting and
>            could be made more dynamic.
>         
>         5) I remain amazed at the vast improvements in throughput,
>         reductions in
>         interrupts, lockless operation and the RCU stuff that have
>         come out of
>         this so far, but had to make these points in the hope that the
>         big picture
>         is retained.
>         
>         It does no good to blast packets through the network unless
>         there is a
>         high probability that they will actually be received on the
>         other side.
>         
>         thanks for listening.

Dave

I am kind of surprised you wrote this nonsense.

Being able to send data at full speed has nothing to do about how
packets are scheduled. You are concerned about packet scheduling, and
not about how fast we can send raw data on a 40Gb NIC.

We made all these changes so that we can spend cpu cycles at the right
place.

There are reasons we have fq_codel, and fq. Do not forget this, please.

Take a deep breath, some coffee, and rethink.

Thanks

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Network optimality (was Re: [PATCH net-next] qdisc: validate skb without holding lock_
  2014-10-13 16:58   ` Eric Dumazet
@ 2014-10-13 17:20     ` Dave Taht
  2014-10-13 17:43       ` Tom Herbert
  2014-10-13 20:27       ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 6+ messages in thread
From: Dave Taht @ 2014-10-13 17:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, Jesper Dangaard Brouer, Hannes Frederic Sowa,
	John Fastabend, Jamal Hadi Salim, Daniel Borkmann,
	Florian Westphal, netdev@vger.kernel.org,
	Toke Høiland-Jørgensen, Tom Herbert, David Miller

On Mon, Oct 13, 2014 at 9:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> On Oct 13, 2014 7:22 AM, "Dave Taht" <dave.taht@gmail.com> wrote:
>>         When I first got cc'd on these threads, and saw
>>         netperf-wrapper being
>>         used on it,
>>         I thought: "Oh god, I've created a monster.". My intent with
>>         helping create
>>         such a measurement tool was not to routinely drive a network
>>         to saturation
>>         but to be able to measure the impact on latency of doing so.
>>
>>         I was trying get reasonable behavior when a "router" went into
>>         overload.
>>
>>         Servers, on the other hand, have more options to avoid
>>         overload than
>>         routers do. There's been a great deal of really nice work on
>>         that
>>         front. I love all that.
>>
>>         and I like BQL because it provides enough backpressure to be
>>         able to
>>         do smarter things about scheduling packets higher in the
>>         stack. (life
>>         pre-BQL cost some hair)
>>
>>         But tom once told me me "BQL's objective is to keep the
>>         hardware busy".
>>         It uses an MIAD controller instead of a more sane AIMD one, in
>>         particular,
>>         I'd much rather it ramped down to smaller values after
>>         absorbing a burst.
>>
>>         My objective is always to keep the *network's behavior
>>         optimal*,
>>         minimizing bursts, and subsequent tail loss on the other side,
>>         and responding quickly to loss, and doing that by
>>         preserving to the highest extent possible the ack clocking
>>         that a
>>         fluid model has. I LOVE BQL for providing more backpressure
>>         than has
>>         ever existed before, and I know it's incredibly difficult to
>>         have fluid models
>>         in a conventional cpu architecture that has to do other stuff.
>>
>>         But in order to get the best results for network behavior I'm
>>         willing to
>>         sacrifice a great deal of cpu, interrupts, whatever it takes!
>>         to get the
>>         most packets to all the destinations specified, whatever the
>>         workload,
>>         with the *minimum amount of latency between ack and reply*
>>         possible.
>>
>>         What I'd hoped for in the new bulking and rcu stuff was to be
>>         able to
>>         see a net reduction in TSO/GSO Size, and/or BQL's size, and I
>>         also did
>>         keep hoping for some profiles on sch_fq, and for more complex
>>         benchmarking of dozens or hundreds of realistically sized TCP
>>         flows
>>         (in both directions) to exercise it all.
>>
>>         Some of the data presented showed that a single BQL'd queue
>>         was >400K,
>>         and with hardware multi-queue, 128K, when TSO and GSO were
>>         used, but
>>         with hardware multi-queue and no TSO/GSO, BQL was closer to
>>         30K.
>>
>>         This said to me that the maximum "right" size for a TSO/GSO
>>         "packet" was
>>         closer to 12k in this environment, and the right size for BQL,
>>         30k,
>>         before it started exerting backpressure to the qdisc.
>>
>>         This would reduce the potential inter-flow network latency by
>>         a factor
>>         of 10 on the single hw queue scenario, and 4 in the multi
>>         queue one.
>>
>>         It would probably cost some interrupts, and in scenarios
>>         lacking
>>         packet loss, throughput, but in other scenarios with lots of
>>         flows
>>         each flow will ramp up in speed, faster, as you reduce the
>>         RTTs.
>>         Paying attention to this will also push profiling activities
>>         into
>>         areas of the stack that might be profitable.
>>
>>         I would very much like to have profiles of happens now both
>>         here and
>>         elsewhere in the stack with this new code with TSO/GSO sizes
>>         capped
>>         thusly and BQL capped to 30k, and a smarter qdisc like fq
>>         used.
>>
>>         2) Most of the time, a server is not driving the wire to
>>         saturation. If
>>            it is, you are doing something wrong. The BQL queues are
>>         empty, or
>>            nearly so, so the instant someone creates a qdisc queue, it
>>            drains.
>>
>>         But: if there are two or more flows under contention, creating
>>         a qdisc queue
>>             better multiplexing the results is highly desirable, and
>>         the stack
>>            should be smart enough to make that overload only last
>>         briefly.
>>
>>            This is part of why I'm unfond of the deep and persistent
>>         BQL queues as we
>>         get today.
>>
>>         3) Pure ack-only workloads are rare. It is a useful test case,
>>         but...
>>
>>         4) I thought the ring-cleanup optimization was rather
>>         interesting and
>>            could be made more dynamic.
>>
>>         5) I remain amazed at the vast improvements in throughput,
>>         reductions in
>>         interrupts, lockless operation and the RCU stuff that have
>>         come out of
>>         this so far, but had to make these points in the hope that the
>>         big picture
>>         is retained.
>>
>>         It does no good to blast packets through the network unless
>>         there is a
>>         high probability that they will actually be received on the
>>         other side.
>>
>>         thanks for listening.
>
> Dave
>
> I am kind of surprised you wrote this nonsense.

Sorry, it was definately pre-coffee! I get twitchy when all the joy
seems to be in spewing packets at high rates rather than optimizing
for low RTTs in packet paired flow behavior.

> Being able to send data at full speed has nothing to do about how
> packets are scheduled. You are concerned about packet scheduling, and
> not about how fast we can send raw data on a 40Gb NIC.

I would like to also get better behavior out of gigE and below, and for
these changes to not impact the downstream behavior of the network
overall.

To give you an example, I would like to see the tcp flows in the
2nd chart here, to converge faster than the 5 seconds they currently
take at GigE speeds.

http://snapon.lab.bufferbloat.net/~cero2/nuc-to-puck/results.html

> We made all these changes so that we can spend cpu cycles at the right
> place.

I grok that.

> There are reasons we have fq_codel, and fq. Do not forget this, please.

Which is why I was hoping to see profiles along the way that showed where
else locks were being taken, what those cpu cycles were like, etc, and what were
those hotspots, when a smarter qdisc was engaged.


-- 
Dave Täht

https://www.bufferbloat.net/projects/make-wifi-fast

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Network optimality (was Re: [PATCH net-next] qdisc: validate skb without holding lock_
  2014-10-13 17:20     ` Dave Taht
@ 2014-10-13 17:43       ` Tom Herbert
  2014-10-13 20:27       ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 6+ messages in thread
From: Tom Herbert @ 2014-10-13 17:43 UTC (permalink / raw)
  To: Dave Taht
  Cc: Eric Dumazet, Alexander Duyck, Jesper Dangaard Brouer,
	Hannes Frederic Sowa, John Fastabend, Jamal Hadi Salim,
	Daniel Borkmann, Florian Westphal, netdev@vger.kernel.org,
	Toke Høiland-Jørgensen, David Miller

When TSO/GSO is enabled with BQL we tend to see large limits than if
they are disabled. This is due to the fact that we treat gso packets
as a single packet although way to queuing in device. BQL limit is
nominally 2*N+1 MSS where N is minimal number of bytes to keep queue
full. In gso MSS is up to 64K, so a limit of 192K is common (without
BQL I see limit of 30K).

For GSO, it seems like we can split larger segments. For instance if
in a bulk dequeue we need 30K but have 64K next in qdisc, maybe we can
split to do GSO on first 30K of segment, and requeue other 34K to the
qdisc.

With TSO we might do something similar, but probably harder to get
granularity since TX completions are only done on TSO packets (might
be interesting if a device could report partial completions somehow).


On Mon, Oct 13, 2014 at 10:20 AM, Dave Taht <dave.taht@gmail.com> wrote:
> On Mon, Oct 13, 2014 at 9:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>> On Oct 13, 2014 7:22 AM, "Dave Taht" <dave.taht@gmail.com> wrote:
>>>         When I first got cc'd on these threads, and saw
>>>         netperf-wrapper being
>>>         used on it,
>>>         I thought: "Oh god, I've created a monster.". My intent with
>>>         helping create
>>>         such a measurement tool was not to routinely drive a network
>>>         to saturation
>>>         but to be able to measure the impact on latency of doing so.
>>>
>>>         I was trying get reasonable behavior when a "router" went into
>>>         overload.
>>>
>>>         Servers, on the other hand, have more options to avoid
>>>         overload than
>>>         routers do. There's been a great deal of really nice work on
>>>         that
>>>         front. I love all that.
>>>
>>>         and I like BQL because it provides enough backpressure to be
>>>         able to
>>>         do smarter things about scheduling packets higher in the
>>>         stack. (life
>>>         pre-BQL cost some hair)
>>>
>>>         But tom once told me me "BQL's objective is to keep the
>>>         hardware busy".
>>>         It uses an MIAD controller instead of a more sane AIMD one, in
>>>         particular,
>>>         I'd much rather it ramped down to smaller values after
>>>         absorbing a burst.
>>>
>>>         My objective is always to keep the *network's behavior
>>>         optimal*,
>>>         minimizing bursts, and subsequent tail loss on the other side,
>>>         and responding quickly to loss, and doing that by
>>>         preserving to the highest extent possible the ack clocking
>>>         that a
>>>         fluid model has. I LOVE BQL for providing more backpressure
>>>         than has
>>>         ever existed before, and I know it's incredibly difficult to
>>>         have fluid models
>>>         in a conventional cpu architecture that has to do other stuff.
>>>
>>>         But in order to get the best results for network behavior I'm
>>>         willing to
>>>         sacrifice a great deal of cpu, interrupts, whatever it takes!
>>>         to get the
>>>         most packets to all the destinations specified, whatever the
>>>         workload,
>>>         with the *minimum amount of latency between ack and reply*
>>>         possible.
>>>
>>>         What I'd hoped for in the new bulking and rcu stuff was to be
>>>         able to
>>>         see a net reduction in TSO/GSO Size, and/or BQL's size, and I
>>>         also did
>>>         keep hoping for some profiles on sch_fq, and for more complex
>>>         benchmarking of dozens or hundreds of realistically sized TCP
>>>         flows
>>>         (in both directions) to exercise it all.
>>>
>>>         Some of the data presented showed that a single BQL'd queue
>>>         was >400K,
>>>         and with hardware multi-queue, 128K, when TSO and GSO were
>>>         used, but
>>>         with hardware multi-queue and no TSO/GSO, BQL was closer to
>>>         30K.
>>>
>>>         This said to me that the maximum "right" size for a TSO/GSO
>>>         "packet" was
>>>         closer to 12k in this environment, and the right size for BQL,
>>>         30k,
>>>         before it started exerting backpressure to the qdisc.
>>>
>>>         This would reduce the potential inter-flow network latency by
>>>         a factor
>>>         of 10 on the single hw queue scenario, and 4 in the multi
>>>         queue one.
>>>
>>>         It would probably cost some interrupts, and in scenarios
>>>         lacking
>>>         packet loss, throughput, but in other scenarios with lots of
>>>         flows
>>>         each flow will ramp up in speed, faster, as you reduce the
>>>         RTTs.
>>>         Paying attention to this will also push profiling activities
>>>         into
>>>         areas of the stack that might be profitable.
>>>
>>>         I would very much like to have profiles of happens now both
>>>         here and
>>>         elsewhere in the stack with this new code with TSO/GSO sizes
>>>         capped
>>>         thusly and BQL capped to 30k, and a smarter qdisc like fq
>>>         used.
>>>
>>>         2) Most of the time, a server is not driving the wire to
>>>         saturation. If
>>>            it is, you are doing something wrong. The BQL queues are
>>>         empty, or
>>>            nearly so, so the instant someone creates a qdisc queue, it
>>>            drains.
>>>
>>>         But: if there are two or more flows under contention, creating
>>>         a qdisc queue
>>>             better multiplexing the results is highly desirable, and
>>>         the stack
>>>            should be smart enough to make that overload only last
>>>         briefly.
>>>
>>>            This is part of why I'm unfond of the deep and persistent
>>>         BQL queues as we
>>>         get today.
>>>
>>>         3) Pure ack-only workloads are rare. It is a useful test case,
>>>         but...
>>>
>>>         4) I thought the ring-cleanup optimization was rather
>>>         interesting and
>>>            could be made more dynamic.
>>>
>>>         5) I remain amazed at the vast improvements in throughput,
>>>         reductions in
>>>         interrupts, lockless operation and the RCU stuff that have
>>>         come out of
>>>         this so far, but had to make these points in the hope that the
>>>         big picture
>>>         is retained.
>>>
>>>         It does no good to blast packets through the network unless
>>>         there is a
>>>         high probability that they will actually be received on the
>>>         other side.
>>>
>>>         thanks for listening.
>>
>> Dave
>>
>> I am kind of surprised you wrote this nonsense.
>
> Sorry, it was definately pre-coffee! I get twitchy when all the joy
> seems to be in spewing packets at high rates rather than optimizing
> for low RTTs in packet paired flow behavior.
>
There are multiple dimensions we are trying to optimize for. Bulk
dequeue should not adversely affect latency, we are merely doing work
in one operation previously done in several.

The lack of granularity in GSO segments might be something that could
be address though. When TSO/GSO is enabled with BQL we tend to see
large limits than if they are disabled. This is due to the fact that
we treat gso packets as a single packet all the way to queuing in
device. BQL limit is nominally 2*N+1 MSS where N is minimal number of
bytes to keep queue full. In gso MSS is up to 64K, so a limit of 192K
is common (without BQL I see limit of 30K).

For GSO, it seems like we can split larger segments. For instance if
in a bulk dequeue we need 30K but have 64K segment next in qdisc,
maybe we can split to do GSO on first 30K of segment, and requeue
other 34K to the qdisc.

With TSO we might do something similar, but probably harder to get
granularity since TX completions are only done on TSO packets (might
be interesting if a device could report partial completions somehow).

>> Being able to send data at full speed has nothing to do about how
>> packets are scheduled. You are concerned about packet scheduling, and
>> not about how fast we can send raw data on a 40Gb NIC.
>
> I would like to also get better behavior out of gigE and below, and for
> these changes to not impact the downstream behavior of the network
> overall.
>
> To give you an example, I would like to see the tcp flows in the
> 2nd chart here, to converge faster than the 5 seconds they currently
> take at GigE speeds.
>
> http://snapon.lab.bufferbloat.net/~cero2/nuc-to-puck/results.html
>
>> We made all these changes so that we can spend cpu cycles at the right
>> place.
>
> I grok that.
>
>> There are reasons we have fq_codel, and fq. Do not forget this, please.
>
> Which is why I was hoping to see profiles along the way that showed where
> else locks were being taken, what those cpu cycles were like, etc, and what were
> those hotspots, when a smarter qdisc was engaged.
>
>
> --
> Dave Täht
>
> https://www.bufferbloat.net/projects/make-wifi-fast

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Network optimality (was Re: [PATCH net-next] qdisc: validate skb without holding lock_
  2014-10-13 17:20     ` Dave Taht
  2014-10-13 17:43       ` Tom Herbert
@ 2014-10-13 20:27       ` Jesper Dangaard Brouer
  2014-10-13 20:47         ` Dave Taht
  1 sibling, 1 reply; 6+ messages in thread
From: Jesper Dangaard Brouer @ 2014-10-13 20:27 UTC (permalink / raw)
  To: Dave Taht
  Cc: Eric Dumazet, Alexander Duyck, Hannes Frederic Sowa,
	John Fastabend, Jamal Hadi Salim, Daniel Borkmann,
	Florian Westphal, netdev@vger.kernel.org,
	Toke Høiland-Jørgensen, Tom Herbert, David Miller,
	brouer


On Mon, 13 Oct 2014 10:20:17 -0700 Dave Taht <dave.taht@gmail.com> wrote:

> On Mon, Oct 13, 2014 at 9:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> >> On Oct 13, 2014 7:22 AM, "Dave Taht" <dave.taht@gmail.com> wrote:
[...]
> 
> I would like to also get better behavior out of gigE and below, and for
> these changes to not impact the downstream behavior of the network
> overall.

I also care about 1Gbit/s and below, that why I did some many tests
(with igb at 10Mbit/s, 100Mbit/s and 1Gbit/s).  

 
> To give you an example, I would like to see the tcp flows in the
> 2nd chart here, to converge faster than the 5 seconds they currently
> take at GigE speeds.
> 
> http://snapon.lab.bufferbloat.net/~cero2/nuc-to-puck/results.html

In the last graph, where you cannot saturate the link, because you
turned off GSO, GRO and TSO.  Here I expect you will see the benefit of
the qdisc bulking. That is, will be able to saturate the link and
achieve the lower latency as BQL will cut off the bursts at +1 MTU.
I would be interested in the results...


> > We made all these changes so that we can spend cpu cycles at the right
> > place.

Exactly. 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Network optimality (was Re: [PATCH net-next] qdisc: validate skb without holding lock_
  2014-10-13 20:27       ` Jesper Dangaard Brouer
@ 2014-10-13 20:47         ` Dave Taht
  0 siblings, 0 replies; 6+ messages in thread
From: Dave Taht @ 2014-10-13 20:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Alexander Duyck, Hannes Frederic Sowa,
	John Fastabend, Jamal Hadi Salim, Daniel Borkmann,
	Florian Westphal, netdev@vger.kernel.org,
	Toke Høiland-Jørgensen, Tom Herbert, David Miller

On Mon, Oct 13, 2014 at 1:27 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Mon, 13 Oct 2014 10:20:17 -0700 Dave Taht <dave.taht@gmail.com> wrote:
>
>> On Mon, Oct 13, 2014 at 9:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >
>> >> On Oct 13, 2014 7:22 AM, "Dave Taht" <dave.taht@gmail.com> wrote:
> [...]
>>
>> I would like to also get better behavior out of gigE and below, and for
>> these changes to not impact the downstream behavior of the network
>> overall.
>
> I also care about 1Gbit/s and below, that why I did some many tests
> (with igb at 10Mbit/s, 100Mbit/s and 1Gbit/s).
>
>
>> To give you an example, I would like to see the tcp flows in the
>> 2nd chart here, to converge faster than the 5 seconds they currently
>> take at GigE speeds.
>>
>> http://snapon.lab.bufferbloat.net/~cero2/nuc-to-puck/results.html
>
> In the last graph, where you cannot saturate the link, because you
> turned off GSO, GRO and TSO.  Here I expect you will see the benefit of
> the qdisc bulking. That is, will be able to saturate the link and
> achieve the lower latency as BQL will cut off the bursts at +1 MTU.
> I would be interested in the results...

I am too!!!! 5 seconds to converge? 50x the baseline latency when under load?
vs not being able to saturate the link at all? Ugh. Two lousy choices.

I think xmit_more will
help a lot in the latter case, my other suggestions regarding reducing
the size of the offloads in the former.

But it looks like xmit_more support needs to be added to fq, and fq_codel (?),
and despite me reading the patches submitted thus far, it would be saner
for someone else to patch e1000e support for the nuc (and the zillions
of other e1000e platforms)

(did I miss that patch go by?)

I'm certainly willing to test the result on that platform (and I have
some other tweaks
in my queue at the qdisc layer that I can throw in, also).

>> > We made all these changes so that we can spend cpu cycles at the right
>> > place.
>
> Exactly.

+1.

So what happens when more cpu cycles are available in the right place?

The dequeue routines in both fq and fq_codel are a bit more complex than
pfifo_fast (and I've longed to kill the maxpacket concept in codel, btw)....

y'all are in such a lovely place with profilers and hardware at the ready
to just make a simple sysctl and analyze what happens at rates I can
only dream of.


>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Sr. Network Kernel Developer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer



-- 
Dave Täht

https://www.bufferbloat.net/projects/make-wifi-fast

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-10-13 20:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-13 14:22 Network optimality (was Re: [PATCH net-next] qdisc: validate skb without holding lock_ Dave Taht
     [not found] ` <CAL4WiioOodM60_Ud6qyGt1_uyeqpDeGFF6PtEKtGa=NwHhVFiw@mail.gmail.com>
2014-10-13 16:58   ` Eric Dumazet
2014-10-13 17:20     ` Dave Taht
2014-10-13 17:43       ` Tom Herbert
2014-10-13 20:27       ` Jesper Dangaard Brouer
2014-10-13 20:47         ` Dave Taht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).