netif_rx packet dumping

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* netif_rx packet dumping
@ 2005-03-03 20:38 Stephen Hemminger
  2005-03-03 20:55 ` David S. Miller
  2005-03-03 21:26 ` Baruch Even
  0 siblings, 2 replies; 51+ messages in thread
From: Stephen Hemminger @ 2005-03-03 20:38 UTC (permalink / raw)
  To: Injong Rhee, John Heffner, David S. Miller, Yee-Ting Li,
	Baruch Even; +Cc: netdev

Both BIC TCP 1.1 and TCP-H include patches to disable the queue
throttling behaviour of netif_rx. The existing throttling algorithm
causes all packets to be dumped (until queue emptys) when the packet
backlog reaches netdev_max_backog. I suppose this is some kind of DoS
prevention mechanism. The problem is that this dumping action creates
mulitple packet loss that forces TCP back to slow start.

But, all this is really moot for the case of any reasonably high speed
device because of NAPI. netif_rx is not even used for any device that uses NAPI. 
The NAPI code path uses net_receive_skb and the receive queue management is done
by the receive scheduling (dev->quota) of the rx_scheduler.

My question is why did BIC TCP and TCP-H turn off the throttling?
Was it because they were/are using older 2.4 devices without NAPI.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 20:38 netif_rx packet dumping Stephen Hemminger
@ 2005-03-03 20:55 ` David S. Miller
  2005-03-03 21:01   ` Stephen Hemminger
  2005-03-03 21:18   ` jamal
  2005-03-03 21:26 ` Baruch Even
  1 sibling, 2 replies; 51+ messages in thread
From: David S. Miller @ 2005-03-03 20:55 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: rhee, jheffner, Yee-Ting.Li, baruch, netdev

On Thu, 3 Mar 2005 12:38:11 -0800
Stephen Hemminger <shemminger@osdl.org> wrote:

> The existing throttling algorithm causes all packets to be dumped
> (until queue emptys) when the packet backlog reaches
> netdev_max_backog. I suppose this is some kind of DoS prevention
> mechanism. The problem is that this dumping action creates mulitple
> packet loss that forces TCP back to slow start.
> 
> But, all this is really moot for the case of any reasonably high speed
> device because of NAPI. netif_rx is not even used for any device that
> uses NAPI.  The NAPI code path uses net_receive_skb and the receive
> queue management is done by the receive scheduling (dev->quota) of the
> rx_scheduler.

Even without NAPI, netif_rx() ends up using the quota etc. machanisms
when the queue gets processed via process_backlog().

ksoftirqd should handle cpu starvation issues at a higher level.

I think it is therefore safe to remove the netif_max_backlog stuff
altogether.  "300" is such a non-sense setting, especially for gigabit
drivers which aren't using NAPI for whatever reason.  It's even low
for a system with 2 100Mbit devices.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 20:55 ` David S. Miller
@ 2005-03-03 21:01   ` Stephen Hemminger
  2005-03-03 21:18   ` jamal
  1 sibling, 0 replies; 51+ messages in thread
From: Stephen Hemminger @ 2005-03-03 21:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: rhee, jheffner, Yee-Ting.Li, baruch, netdev

On Thu, 3 Mar 2005 12:55:56 -0800
"David S. Miller" <davem@davemloft.net> wrote:

> On Thu, 3 Mar 2005 12:38:11 -0800
> Stephen Hemminger <shemminger@osdl.org> wrote:
> 
> > The existing throttling algorithm causes all packets to be dumped
> > (until queue emptys) when the packet backlog reaches
> > netdev_max_backog. I suppose this is some kind of DoS prevention
> > mechanism. The problem is that this dumping action creates mulitple
> > packet loss that forces TCP back to slow start.
> > 
> > But, all this is really moot for the case of any reasonably high speed
> > device because of NAPI. netif_rx is not even used for any device that
> > uses NAPI.  The NAPI code path uses net_receive_skb and the receive
> > queue management is done by the receive scheduling (dev->quota) of the
> > rx_scheduler.
> 
> Even without NAPI, netif_rx() ends up using the quota etc. machanisms
> when the queue gets processed via process_backlog().
> 
> ksoftirqd should handle cpu starvation issues at a higher level.
> 
> I think it is therefore safe to remove the netif_max_backlog stuff
> altogether.  "300" is such a non-sense setting, especially for gigabit
> drivers which aren't using NAPI for whatever reason.  It's even low
> for a system with 2 100Mbit devices.

Okay, already have patchset to clean out the sample_stats and other leftovers
so I'll add it to the tail of that.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 20:55 ` David S. Miller
  2005-03-03 21:01   ` Stephen Hemminger
@ 2005-03-03 21:18   ` jamal
  2005-03-03 21:21     ` Stephen Hemminger
  1 sibling, 1 reply; 51+ messages in thread
From: jamal @ 2005-03-03 21:18 UTC (permalink / raw)
  To: David S. Miller
  Cc: Stephen Hemminger, rhee, jheffner, Yee-Ting.Li, baruch, netdev

On Thu, 2005-03-03 at 15:55, David S. Miller wrote:
> On Thu, 3 Mar 2005 12:38:11 -0800
> Stephen Hemminger <shemminger@osdl.org> wrote:
> 
> > The existing throttling algorithm causes all packets to be dumped
> > (until queue emptys) when the packet backlog reaches
> > netdev_max_backog. I suppose this is some kind of DoS prevention
> > mechanism. The problem is that this dumping action creates mulitple
> > packet loss that forces TCP back to slow start.
> > 
> > But, all this is really moot for the case of any reasonably high speed
> > device because of NAPI. netif_rx is not even used for any device that
> > uses NAPI.  The NAPI code path uses net_receive_skb and the receive
> > queue management is done by the receive scheduling (dev->quota) of the
> > rx_scheduler.
> 
> Even without NAPI, netif_rx() ends up using the quota etc. machanisms
> when the queue gets processed via process_backlog().
> 
> ksoftirqd should handle cpu starvation issues at a higher level.
> 
> I think it is therefore safe to remove the netif_max_backlog stuff
> altogether.  "300" is such a non-sense setting, especially for gigabit
> drivers which aren't using NAPI for whatever reason.  It's even low
> for a system with 2 100Mbit devices.

A couple of issues with this
- the rx softirq uses netif_max_backlog as a contraint on how long to
run before yielding. Could probably fix by having a different variable.
It may be fair to decouple those two in any case.
- if you dont put a restriction on how many netif_rx packets get queued
then it is more than likely you will run into an OOM case for non-NAPI
drivers under interupt overload. Could probably resolve this by
increasing the backlog size to several TCP window sizes (handwaving:
2?). What would be the optimal TCP window size in these big fat pipes
assuming real low RTT? 

I would say whoever is worried about this should use a NAPI driver;
otherwise you dont deserve that pipe!

cheers,
jamal

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 21:18   ` jamal
@ 2005-03-03 21:21     ` Stephen Hemminger
  2005-03-03 21:24       ` jamal
  0 siblings, 1 reply; 51+ messages in thread
From: Stephen Hemminger @ 2005-03-03 21:21 UTC (permalink / raw)
  To: hadi; +Cc: David S. Miller, rhee, jheffner, Yee-Ting.Li, baruch, netdev

On 03 Mar 2005 16:18:08 -0500
jamal <hadi@cyberus.ca> wrote:

> On Thu, 2005-03-03 at 15:55, David S. Miller wrote:
> > On Thu, 3 Mar 2005 12:38:11 -0800
> > Stephen Hemminger <shemminger@osdl.org> wrote:
> > 
> > > The existing throttling algorithm causes all packets to be dumped
> > > (until queue emptys) when the packet backlog reaches
> > > netdev_max_backog. I suppose this is some kind of DoS prevention
> > > mechanism. The problem is that this dumping action creates mulitple
> > > packet loss that forces TCP back to slow start.
> > > 
> > > But, all this is really moot for the case of any reasonably high speed
> > > device because of NAPI. netif_rx is not even used for any device that
> > > uses NAPI.  The NAPI code path uses net_receive_skb and the receive
> > > queue management is done by the receive scheduling (dev->quota) of the
> > > rx_scheduler.
> > 
> > Even without NAPI, netif_rx() ends up using the quota etc. machanisms
> > when the queue gets processed via process_backlog().
> > 
> > ksoftirqd should handle cpu starvation issues at a higher level.
> > 
> > I think it is therefore safe to remove the netif_max_backlog stuff
> > altogether.  "300" is such a non-sense setting, especially for gigabit
> > drivers which aren't using NAPI for whatever reason.  It's even low
> > for a system with 2 100Mbit devices.
> 
> A couple of issues with this
> - the rx softirq uses netif_max_backlog as a contraint on how long to
> run before yielding. Could probably fix by having a different variable.
> It may be fair to decouple those two in any case.
> - if you dont put a restriction on how many netif_rx packets get queued
> then it is more than likely you will run into an OOM case for non-NAPI
> drivers under interupt overload. Could probably resolve this by
> increasing the backlog size to several TCP window sizes (handwaving:
> 2?). What would be the optimal TCP window size in these big fat pipes
> assuming real low RTT? 
> 
> I would say whoever is worried about this should use a NAPI driver;
> otherwise you dont deserve that pipe!

My plan is to keep netif_max_backlog but bump it up to something bigger
by default. Maybe even autosize it based on memory available.  But
get rid of the "dump till empty" behaviour that screws over TCP.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 21:21     ` Stephen Hemminger
@ 2005-03-03 21:24       ` jamal
  2005-03-03 21:32         ` David S. Miller
  0 siblings, 1 reply; 51+ messages in thread
From: jamal @ 2005-03-03 21:24 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David S. Miller, rhee, jheffner, Yee-Ting.Li, baruch, netdev

On Thu, 2005-03-03 at 16:21, Stephen Hemminger wrote:

> My plan is to keep netif_max_backlog but bump it up to something bigger
> by default. Maybe even autosize it based on memory available.  But
> get rid of the "dump till empty" behaviour that screws over TCP.

Ok, this does sound more reasonable. Out of curiosity, are packets being
dropped at the socket queue? Why is "dump till empty" behaviour screwing
over TCP.

cheers,
jamal

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 20:38 netif_rx packet dumping Stephen Hemminger
  2005-03-03 20:55 ` David S. Miller
@ 2005-03-03 21:26 ` Baruch Even
  2005-03-03 21:36   ` David S. Miller
  2005-03-03 22:03   ` jamal
  1 sibling, 2 replies; 51+ messages in thread
From: Baruch Even @ 2005-03-03 21:26 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Injong Rhee, John Heffner, David S. Miller, Yee-Ting Li, netdev

Stephen Hemminger wrote:
> Both BIC TCP 1.1 and TCP-H include patches to disable the queue
> throttling behaviour of netif_rx. The existing throttling algorithm
> causes all packets to be dumped (until queue emptys) when the packet
> backlog reaches netdev_max_backog. I suppose this is some kind of DoS
> prevention mechanism. The problem is that this dumping action creates
> mulitple packet loss that forces TCP back to slow start.
> 
> But, all this is really moot for the case of any reasonably high speed
> device because of NAPI. netif_rx is not even used for any device that uses NAPI. 
> The NAPI code path uses net_receive_skb and the receive queue management is done
> by the receive scheduling (dev->quota) of the rx_scheduler.
> 
> My question is why did BIC TCP and TCP-H turn off the throttling?
> Was it because they were/are using older 2.4 devices without NAPI.

NAPI was not used because it caused skews in the performance, I haven't 
tested it myself, just passing hearsay.

I have patches for the SACK processing to improve performance which 
should reduce the problems with the queues, but they are for 2.6.6 and 
forward porting them to 2.6.11 is quite a bit of work (too much was 
changed in conflicting areas). I hope to get to work on this soon.

The bad effect of the queue throttling was mostly the killed ack clock 
and the fact that recovery was only when timeout happened. Preferably 
only the packets that don't fit should be dropped, but the queue 
emptying should not be waited for.

Baruch

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 21:24       ` jamal
@ 2005-03-03 21:32         ` David S. Miller
  2005-03-03 21:54           ` Stephen Hemminger
  2005-03-03 22:01           ` jamal
  0 siblings, 2 replies; 51+ messages in thread
From: David S. Miller @ 2005-03-03 21:32 UTC (permalink / raw)
  To: hadi; +Cc: shemminger, rhee, jheffner, Yee-Ting.Li, baruch, netdev

On 03 Mar 2005 16:24:25 -0500
jamal <hadi@cyberus.ca> wrote:

> Ok, this does sound more reasonable. Out of curiosity, are packets being
> dropped at the socket queue? Why is "dump till empty" behaviour screwing
> over TCP.

Because it does the same thing tail-drop in routers do.
It makes everything back off a lot and go into slow start.
If we'd just drop 1 packet per flow or something like that
(so it could be fixed with a quick fast retransmit), TCP
would avoid regressing into slow start.

You say "use a NAPI driver", but netif_rx() _IS_ a NAPI driver.
process_backlog() adheres to quotas and every other stabilizing
effect NAPI drivers use, the only missing part is the RX interrupt
disabling.

We should eliminate the max backlog thing completely.  There is
no need for it.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 21:26 ` Baruch Even
@ 2005-03-03 21:36   ` David S. Miller
  2005-03-03 21:44     ` Baruch Even
  2005-03-03 22:03   ` jamal
  1 sibling, 1 reply; 51+ messages in thread
From: David S. Miller @ 2005-03-03 21:36 UTC (permalink / raw)
  To: Baruch Even; +Cc: shemminger, rhee, jheffner, Yee-Ting.Li, netdev

On Thu, 03 Mar 2005 21:26:58 +0000
Baruch Even <baruch@ev-en.org> wrote:

> I have patches for the SACK processing to improve performance which 
> should reduce the problems with the queues, but they are for 2.6.6 and 
> forward porting them to 2.6.11 is quite a bit of work (too much was 
> changed in conflicting areas). I hope to get to work on this soon.

Please split up your patches properly this time.  Last time you
split up the patches, there was common changes in several of the
patch files.  It looked like you just hand edited the patches in
order to split up the changes, or something like that, and it's
very error prone and made review impossible.

And I'm not accepting your changes if you're going to still add all
that linked list stuff to the generic struct sk_buff.  Adding anything
new to sk_buff is going to make it straddle more L2 cache lines on
ia64 and other 64-bit systems and that totally kills performance.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 21:36   ` David S. Miller
@ 2005-03-03 21:44     ` Baruch Even
  2005-03-03 21:54       ` Andi Kleen
  2005-03-03 21:57       ` David S. Miller
  0 siblings, 2 replies; 51+ messages in thread
From: Baruch Even @ 2005-03-03 21:44 UTC (permalink / raw)
  To: David S. Miller; +Cc: shemminger, rhee, jheffner, Yee-Ting.Li, netdev

David S. Miller wrote:
> On Thu, 03 Mar 2005 21:26:58 +0000
> Baruch Even <baruch@ev-en.org> wrote:
> 
> 
>>I have patches for the SACK processing to improve performance which 
>>should reduce the problems with the queues, but they are for 2.6.6 and 
>>forward porting them to 2.6.11 is quite a bit of work (too much was 
>>changed in conflicting areas). I hope to get to work on this soon.
> 
> Please split up your patches properly this time.  Last time you
> split up the patches, there was common changes in several of the
> patch files.  It looked like you just hand edited the patches in
> order to split up the changes, or something like that, and it's
> very error prone and made review impossible.

That was before my time, I've cleaned that up since then.

> And I'm not accepting your changes if you're going to still add all
> that linked list stuff to the generic struct sk_buff.  Adding anything
> new to sk_buff is going to make it straddle more L2 cache lines on
> ia64 and other 64-bit systems and that totally kills performance.

That's a bit more of a problem, that's the exact performance improvement 
we are trying to add!

The current linked list goes over all the packets, the linked list we 
add is for the packets that were not SACKed. The idea being that it is a 
lot faster since there are a lot less packets not SACKed compared to 
packets already SACKed (or never mentioned in SACKs).

If you have a way around this I'd be happy to hear it.

Baruch

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 21:32         ` David S. Miller
@ 2005-03-03 21:54           ` Stephen Hemminger
  2005-03-03 22:02             ` John Heffner
  2005-03-04 19:49             ` Jason Lunz
  2005-03-03 22:01           ` jamal
  1 sibling, 2 replies; 51+ messages in thread
From: Stephen Hemminger @ 2005-03-03 21:54 UTC (permalink / raw)
  To: David S. Miller; +Cc: hadi, rhee, jheffner, Yee-Ting.Li, baruch, netdev

On Thu, 3 Mar 2005 13:32:37 -0800
"David S. Miller" <davem@davemloft.net> wrote:

> On 03 Mar 2005 16:24:25 -0500
> jamal <hadi@cyberus.ca> wrote:
> 
> > Ok, this does sound more reasonable. Out of curiosity, are packets being
> > dropped at the socket queue? Why is "dump till empty" behaviour screwing
> > over TCP.
> 
> Because it does the same thing tail-drop in routers do.
> It makes everything back off a lot and go into slow start.
> If we'd just drop 1 packet per flow or something like that
> (so it could be fixed with a quick fast retransmit), TCP
> would avoid regressing into slow start.

Maybe a simple Random Exponential Drop (RED) would be more friendly.

> You say "use a NAPI driver", but netif_rx() _IS_ a NAPI driver.
> process_backlog() adheres to quotas and every other stabilizing
> effect NAPI drivers use, the only missing part is the RX interrupt
> disabling.
> 
> We should eliminate the max backlog thing completely.  There is
> no need for it.

Still need some bound, because if process_backlog is running slower
than the net; then the queue could grow till memory exhausted from
skbuff's. 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 21:44     ` Baruch Even
@ 2005-03-03 21:54       ` Andi Kleen
  2005-03-03 22:04         ` David S. Miller
  2005-03-03 21:57       ` David S. Miller
  1 sibling, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2005-03-03 21:54 UTC (permalink / raw)
  To: Baruch Even; +Cc: shemminger, rhee, jheffner, Yee-Ting.Li, netdev

Baruch Even <baruch@ev-en.org> writes:
>
> If you have a way around this I'd be happy to hear it.

There may be free space left over in skb->cb that you can use
for this.

-Andi

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 21:44     ` Baruch Even
  2005-03-03 21:54       ` Andi Kleen
@ 2005-03-03 21:57       ` David S. Miller
  2005-03-03 22:14         ` Baruch Even
                           ` (2 more replies)
  1 sibling, 3 replies; 51+ messages in thread
From: David S. Miller @ 2005-03-03 21:57 UTC (permalink / raw)
  To: Baruch Even; +Cc: shemminger, rhee, jheffner, Yee-Ting.Li, netdev

On Thu, 03 Mar 2005 21:44:52 +0000
Baruch Even <baruch@ev-en.org> wrote:

> The current linked list goes over all the packets, the linked list we 
> add is for the packets that were not SACKed. The idea being that it is a 
> lot faster since there are a lot less packets not SACKed compared to 
> packets already SACKed (or never mentioned in SACKs).
> 
> If you have a way around this I'd be happy to hear it.

I'm sure you can find a way to steal sizeof(void *) from
"struct tcp_skb_cb" :-)

It is currently 36 bytes on both 32-bit and 64-bit platforms.
This means if you can squeeze out 4 bytes (so that it fits
in the skb->cb[] 40 byte area), you can fit a pointer in there
for the linked list stuff.

I'll try to brain storm on this as well.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 21:32         ` David S. Miller
  2005-03-03 21:54           ` Stephen Hemminger
@ 2005-03-03 22:01           ` jamal
  1 sibling, 0 replies; 51+ messages in thread
From: jamal @ 2005-03-03 22:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: shemminger, rhee, jheffner, Yee-Ting.Li, baruch, netdev

On Thu, 2005-03-03 at 16:32, David S. Miller wrote:
> On 03 Mar 2005 16:24:25 -0500
> jamal <hadi@cyberus.ca> wrote:
> 
> > Ok, this does sound more reasonable. Out of curiosity, are packets being
> > dropped at the socket queue? Why is "dump till empty" behaviour screwing
> > over TCP.
> 
> Because it does the same thing tail-drop in routers do.
> It makes everything back off a lot and go into slow start.
> If we'd just drop 1 packet per flow or something like that
> (so it could be fixed with a quick fast retransmit), TCP
> would avoid regressing into slow start.
>
> You say "use a NAPI driver", but netif_rx() _IS_ a NAPI driver.
> process_backlog() adheres to quotas and every other stabilizing
> effect NAPI drivers use, the only missing part is the RX interrupt
> disabling.
> 

Thats true; but it's the RX interrupt disabling that worries me.
And the fact that memory is finite.

Let me throw some worst case scenarios:
In the (ahem) "old" days when 100Mbps was hip, 148kpps (translate to
about 1-2 interupts per packet) you pretty much fill that queue pretty
quickly before it is processed. You could say the pentium-2 class used
then was "slow" - but it is probably same compute capacity as most of
the embedded systems out there today. On Gige we are talking about
queueing upto 100K packets per ethx.
I realize i am using unreasonable worst case but it becomes eventually a
choice of when to stop accepting packets in order to maintain sanity.

cheers,
jamal

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 21:54           ` Stephen Hemminger
@ 2005-03-03 22:02             ` John Heffner
  2005-03-03 22:26               ` jamal
  2005-03-04 19:49             ` Jason Lunz
  1 sibling, 1 reply; 51+ messages in thread
From: John Heffner @ 2005-03-03 22:02 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David S. Miller, hadi, rhee, Yee-Ting.Li, baruch, netdev

On Thu, 3 Mar 2005, Stephen Hemminger wrote:

> On Thu, 3 Mar 2005 13:32:37 -0800
> "David S. Miller" <davem@davemloft.net> wrote:
>
> > On 03 Mar 2005 16:24:25 -0500
> > jamal <hadi@cyberus.ca> wrote:
> >
> > > Ok, this does sound more reasonable. Out of curiosity, are packets being
> > > dropped at the socket queue? Why is "dump till empty" behaviour screwing
> > > over TCP.
> >
> > Because it does the same thing tail-drop in routers do.
> > It makes everything back off a lot and go into slow start.
> > If we'd just drop 1 packet per flow or something like that
> > (so it could be fixed with a quick fast retransmit), TCP
> > would avoid regressing into slow start.
>
> Maybe a simple Random Exponential Drop (RED) would be more friendly.

That would probably not be appropriate.  This queue is only for absorbing
micro-scale bursts.  It should not hold any data in steady state like a
router queue can.  The receive window can handle the macro scale flow
control.

  -John

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 21:26 ` Baruch Even
  2005-03-03 21:36   ` David S. Miller
@ 2005-03-03 22:03   ` jamal
  2005-03-03 22:31     ` Baruch Even
  1 sibling, 1 reply; 51+ messages in thread
From: jamal @ 2005-03-03 22:03 UTC (permalink / raw)
  To: Baruch Even
  Cc: Stephen Hemminger, Injong Rhee, John Heffner, David S. Miller,
	Yee-Ting Li, netdev

On Thu, 2005-03-03 at 16:26, Baruch Even wrote:

> NAPI was not used because it caused skews in the performance, I haven't 
> tested it myself, just passing hearsay.

It will help to post on this list when such issues are noticed. 
It could be a simple a driver bug: such a the one posted on by Lennert 2
days ago on e1000 NAPI - such a bug could have had serious repurcasions
on TCP because it sat on packets in DMA occasionally upto 2 seconds.
Seems like that bug has been sitting there for a long time.
What kernel version? What kind of skews?
Is it possible you tell this person to repeat the tests with 2.6.11?

cheers,
jamal

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 21:54       ` Andi Kleen
@ 2005-03-03 22:04         ` David S. Miller
  0 siblings, 0 replies; 51+ messages in thread
From: David S. Miller @ 2005-03-03 22:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: baruch, shemminger, rhee, jheffner, Yee-Ting.Li, netdev

On Thu, 03 Mar 2005 22:54:24 +0100
Andi Kleen <ak@muc.de> wrote:

> Baruch Even <baruch@ev-en.org> writes:
> >
> > If you have a way around this I'd be happy to hear it.
> 
> There may be free space left over in skb->cb that you can use
> for this.

There is 4 bytes, enough for 32-bit but not 64-bit platforms.
Also, see my other reply to his posting.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 21:57       ` David S. Miller
@ 2005-03-03 22:14         ` Baruch Even
  2005-03-08 15:42         ` Baruch Even
  2005-03-31 16:33         ` Baruch Even
  2 siblings, 0 replies; 51+ messages in thread
From: Baruch Even @ 2005-03-03 22:14 UTC (permalink / raw)
  To: David S. Miller; +Cc: shemminger, rhee, jheffner, Yee-Ting.Li, netdev

David S. Miller wrote:
> On Thu, 03 Mar 2005 21:44:52 +0000
> Baruch Even <baruch@ev-en.org> wrote:
> 
> 
>>The current linked list goes over all the packets, the linked list we 
>>add is for the packets that were not SACKed. The idea being that it is a 
>>lot faster since there are a lot less packets not SACKed compared to 
>>packets already SACKed (or never mentioned in SACKs).
>>
>>If you have a way around this I'd be happy to hear it.
> 
> I'm sure you can find a way to steal sizeof(void *) from
> "struct tcp_skb_cb" :-)
> 
> It is currently 36 bytes on both 32-bit and 64-bit platforms.
> This means if you can squeeze out 4 bytes (so that it fits
> in the skb->cb[] 40 byte area), you can fit a pointer in there
> for the linked list stuff.

Stephen has a patch to move some of the extra congestion control data to 
their own struct, that would free some space for me :-)

I'll need to take a look at this again, the original patch actually 
increased the number of bytes for the cb from 40 to get some extra space.

Baruch

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 22:02             ` John Heffner
@ 2005-03-03 22:26               ` jamal
  2005-03-03 23:16                 ` Stephen Hemminger
  2005-03-04 19:52                 ` Edgar E Iglesias
  0 siblings, 2 replies; 51+ messages in thread
From: jamal @ 2005-03-03 22:26 UTC (permalink / raw)
  To: John Heffner
  Cc: Stephen Hemminger, David S. Miller, rhee, Yee-Ting.Li, baruch,
	netdev

On Thu, 2005-03-03 at 17:02, John Heffner wrote:
> On Thu, 3 Mar 2005, Stephen Hemminger wrote:
> > Maybe a simple Random Exponential Drop (RED) would be more friendly.
> 
> That would probably not be appropriate.  This queue is only for absorbing
> micro-scale bursts.  It should not hold any data in steady state like a
> router queue can.  The receive window can handle the macro scale flow
> control.

recall this is a queue that is potentially shared by many many flows
from potentially many many interfaces i.e it deals with many many
micro-scale bursts.
Clearly, the best approach is to have lots and lots of memmory and to
make that queue real huge so it can cope with all of them all the time.
We dont have that luxury - If you restrict the queue size, you will have
to drop packets... Which ones?
Probably simplest solution is to leave it as is right now and just
adjust the contraints based on your system memmory etc.

cheers,
jamal

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 22:03   ` jamal
@ 2005-03-03 22:31     ` Baruch Even
  0 siblings, 0 replies; 51+ messages in thread
From: Baruch Even @ 2005-03-03 22:31 UTC (permalink / raw)
  To: hadi
  Cc: Stephen Hemminger, Injong Rhee, John Heffner, David S. Miller,
	Yee-Ting Li, netdev

jamal wrote:
> On Thu, 2005-03-03 at 16:26, Baruch Even wrote:
> 
> 
>>NAPI was not used because it caused skews in the performance, I haven't 
>>tested it myself, just passing hearsay.
> 
> It will help to post on this list when such issues are noticed.
> It could be a simple a driver bug: such a the one posted on by Lennert 2
> days ago on e1000 NAPI - such a bug could have had serious repurcasions
> on TCP because it sat on packets in DMA occasionally upto 2 seconds.
> Seems like that bug has been sitting there for a long time.
> What kernel version? What kind of skews?
> Is it possible you tell this person to repeat the tests with 2.6.11?

I have asked around but there is no hard data to substantiate the claim 
at this time. And since then all tests were done non-NAPI style.

Kernel versions were probably 2.4.23 or some other 2.4 kernel.

Baruch

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 22:26               ` jamal
@ 2005-03-03 23:16                 ` Stephen Hemminger
  2005-03-03 23:40                   ` jamal
                                     ` (2 more replies)
  2005-03-04 19:52                 ` Edgar E Iglesias
  1 sibling, 3 replies; 51+ messages in thread
From: Stephen Hemminger @ 2005-03-03 23:16 UTC (permalink / raw)
  To: hadi; +Cc: John Heffner, David S. Miller, rhee, Yee-Ting.Li, baruch, netdev

On 03 Mar 2005 17:26:51 -0500
jamal <hadi@cyberus.ca> wrote:

> On Thu, 2005-03-03 at 17:02, John Heffner wrote:
> > On Thu, 3 Mar 2005, Stephen Hemminger wrote:
> > > Maybe a simple Random Exponential Drop (RED) would be more friendly.
> > 
> > That would probably not be appropriate.  This queue is only for absorbing
> > micro-scale bursts.  It should not hold any data in steady state like a
> > router queue can.  The receive window can handle the macro scale flow
> > control.
> 
> recall this is a queue that is potentially shared by many many flows
> from potentially many many interfaces i.e it deals with many many
> micro-scale bursts.
> Clearly, the best approach is to have lots and lots of memmory and to
> make that queue real huge so it can cope with all of them all the time.
> We dont have that luxury - If you restrict the queue size, you will have
> to drop packets... Which ones?
> Probably simplest solution is to leave it as is right now and just
> adjust the contraints based on your system memmory etc.

Another alternative would be some form of adaptive threshold,
something like adaptive drop tail described in this paper.
http://www.ee.cityu.edu.hk/~gchen/pdf/ITC18.pdf

Since netif_rx is running at interrupt time, it has to be simple/quick.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 23:16                 ` Stephen Hemminger
@ 2005-03-03 23:40                   ` jamal
  2005-03-03 23:48                   ` Baruch Even
  2005-03-03 23:48                   ` John Heffner
  2 siblings, 0 replies; 51+ messages in thread
From: jamal @ 2005-03-03 23:40 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: John Heffner, David S. Miller, rhee, Yee-Ting.Li, baruch, netdev

On Thu, 2005-03-03 at 18:16, Stephen Hemminger wrote:
> On 03 Mar 2005 17:26:51 -0500
> jamal <hadi@cyberus.ca> wrote:
> 
> > On Thu, 2005-03-03 at 17:02, John Heffner wrote:
> > > On Thu, 3 Mar 2005, Stephen Hemminger wrote:
> > > > Maybe a simple Random Exponential Drop (RED) would be more friendly.
> > > 
> > > That would probably not be appropriate.  This queue is only for absorbing
> > > micro-scale bursts.  It should not hold any data in steady state like a
> > > router queue can.  The receive window can handle the macro scale flow
> > > control.
> > 
> > recall this is a queue that is potentially shared by many many flows
> > from potentially many many interfaces i.e it deals with many many
> > micro-scale bursts.
> > Clearly, the best approach is to have lots and lots of memmory and to
> > make that queue real huge so it can cope with all of them all the time.
> > We dont have that luxury - If you restrict the queue size, you will have
> > to drop packets... Which ones?
> > Probably simplest solution is to leave it as is right now and just
> > adjust the contraints based on your system memmory etc.
> 
> Another alternative would be some form of adaptive threshold,
> something like adaptive drop tail described in this paper.
> http://www.ee.cityu.edu.hk/~gchen/pdf/ITC18.pdf
> 
> Since netif_rx is running at interrupt time, it has to be simple/quick.

Note we do have some form of "detection" in place which mauntains
history via EWMA through get_sample_stats(). This essentially tells you
if the average queue size is growing and how fast i.e a first order
differential. The idea was for drivers to use this information to back
off and not hit netif_rx() for a period of time. You could use that
scheme to kick in and adjust the backlog - i think that would be a cheap
effort (if you decouple it from rx softirq). This would be much simpler
than the scheme described in the paper.

I think to really make progress though, you need to "cheaply" recognize
things at a flow level. You dont have to be very accurate, something
like a bloom filter with timers to clear state would do fine. But thats
more work than above suggestion.

cheers,
jamal

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 23:16                 ` Stephen Hemminger
  2005-03-03 23:40                   ` jamal
@ 2005-03-03 23:48                   ` Baruch Even
  2005-03-04  3:45                     ` jamal
  2005-03-03 23:48                   ` John Heffner
  2 siblings, 1 reply; 51+ messages in thread
From: Baruch Even @ 2005-03-03 23:48 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: hadi, John Heffner, David S. Miller, rhee, Yee-Ting.Li, netdev

Stephen Hemminger wrote:
> On 03 Mar 2005 17:26:51 -0500
> jamal <hadi@cyberus.ca> wrote:
> 
>>On Thu, 2005-03-03 at 17:02, John Heffner wrote:
>>
>>>On Thu, 3 Mar 2005, Stephen Hemminger wrote:
>>>
>>>>Maybe a simple Random Exponential Drop (RED) would be more friendly.
>>>
>>>That would probably not be appropriate.  This queue is only for absorbing
>>>micro-scale bursts.  It should not hold any data in steady state like a
>>>router queue can.  The receive window can handle the macro scale flow
>>>control.
>>
>>recall this is a queue that is potentially shared by many many flows
>>from potentially many many interfaces i.e it deals with many many
>>micro-scale bursts.
>>Clearly, the best approach is to have lots and lots of memmory and to
>>make that queue real huge so it can cope with all of them all the time.
>>We dont have that luxury - If you restrict the queue size, you will have
>>to drop packets... Which ones?
>>Probably simplest solution is to leave it as is right now and just
>>adjust the contraints based on your system memmory etc.
> 
> Another alternative would be some form of adaptive threshold,
> something like adaptive drop tail described in this paper.
> http://www.ee.cityu.edu.hk/~gchen/pdf/ITC18.pdf

What is the purpose of all of these schemes?

The queue is there to handle short bursts of packets when the network 
stack cannot handle it. The bad behaviour was the throttling of the 
queue, the smart schemes are not going to make it that much better if 
the hardware/software can't keep up.

Even adding more memory to the queue is not going to make a big 
difference, it will just delay the inevitable end and add some more 
queueing latency to the connections.

Baruch

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 23:16                 ` Stephen Hemminger
  2005-03-03 23:40                   ` jamal
  2005-03-03 23:48                   ` Baruch Even
@ 2005-03-03 23:48                   ` John Heffner
  2005-03-04  1:42                     ` Lennert Buytenhek
  2 siblings, 1 reply; 51+ messages in thread
From: John Heffner @ 2005-03-03 23:48 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: hadi, David S. Miller, rhee, Yee-Ting.Li, baruch, netdev

On Thu, 3 Mar 2005, Stephen Hemminger wrote:

> Another alternative would be some form of adaptive threshold,
> something like adaptive drop tail described in this paper.
> http://www.ee.cityu.edu.hk/~gchen/pdf/ITC18.pdf
>
> Since netif_rx is running at interrupt time, it has to be simple/quick.

All these AQM schemes are trying to solve a fundamentally different
problem.  With TCP at least, the only congestion experienced at this point
will be transient, so you do not want to send any congestion signals (drop
packets) if you can avoid it at all.  Making the limit as high as you can
tolerate seems like the best thing to me.

I am a bit worried that removing the throttling *is* a DOS risk though.

  -John

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 23:48                   ` John Heffner
@ 2005-03-04  1:42                     ` Lennert Buytenhek
  2005-03-04  3:10                       ` John Heffner
  0 siblings, 1 reply; 51+ messages in thread
From: Lennert Buytenhek @ 2005-03-04  1:42 UTC (permalink / raw)
  To: John Heffner
  Cc: Stephen Hemminger, hadi, David S. Miller, rhee, Yee-Ting.Li,
	baruch, netdev

On Thu, Mar 03, 2005 at 06:48:50PM -0500, John Heffner wrote:

> > Another alternative would be some form of adaptive threshold,
> > something like adaptive drop tail described in this paper.
> > http://www.ee.cityu.edu.hk/~gchen/pdf/ITC18.pdf
> >
> > Since netif_rx is running at interrupt time, it has to be simple/quick.
> 
> All these AQM schemes are trying to solve a fundamentally different
> problem.  With TCP at least, the only congestion experienced at this point
> will be transient, so you do not want to send any congestion signals (drop
> packets) if you can avoid it at all.  Making the limit as high as you can
> tolerate seems like the best thing to me.

If the traffic does not terminate locally (f.e. when doing routing),
an insanely large queue has more disadvantages than advantages.

If you're routing those exact same TCP packets on the way to their
final destination, you run the risk of not sending out any congestion
signals in the cases where you should, making your forwarding latency
skyrocket (punishing all the other flows) in the process.


--L

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-04  1:42                     ` Lennert Buytenhek
@ 2005-03-04  3:10                       ` John Heffner
  2005-03-04  3:31                         ` Lennert Buytenhek
  0 siblings, 1 reply; 51+ messages in thread
From: John Heffner @ 2005-03-04  3:10 UTC (permalink / raw)
  To: Lennert Buytenhek; +Cc: Stephen Hemminger, baruch, netdev

On Fri, 4 Mar 2005, Lennert Buytenhek wrote:

> On Thu, Mar 03, 2005 at 06:48:50PM -0500, John Heffner wrote:
>
> > All these AQM schemes are trying to solve a fundamentally different
> > problem.  With TCP at least, the only congestion experienced at this point
> > will be transient, so you do not want to send any congestion signals (drop
> > packets) if you can avoid it at all.  Making the limit as high as you can
> > tolerate seems like the best thing to me.
>
> If the traffic does not terminate locally (f.e. when doing routing),
> an insanely large queue has more disadvantages than advantages.
>
> If you're routing those exact same TCP packets on the way to their
> final destination, you run the risk of not sending out any congestion
> signals in the cases where you should, making your forwarding latency
> skyrocket (punishing all the other flows) in the process.

Yes.  In "as high as you can tolerate" latency is implicit. :)  This is
just as true whether forwarding or not.  Offhand I'd say 10 ms is a good
number (bursts should be shorter than this, but it's not too much
latency).

The forwarding case where you actually need congestion control, as opposed
to absorbing bursts, is pretty gross.  If you have a router (more likely
firewall) whose bottleneck is the CPU, then you're operating entirely in
you input queue.  Yuck.  In such a situation, if you don't want user
processes to get starved, you need to do throttling => bad for TCP.

  -John

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-04  3:10                       ` John Heffner
@ 2005-03-04  3:31                         ` Lennert Buytenhek
  0 siblings, 0 replies; 51+ messages in thread
From: Lennert Buytenhek @ 2005-03-04  3:31 UTC (permalink / raw)
  To: John Heffner; +Cc: Stephen Hemminger, baruch, netdev

On Thu, Mar 03, 2005 at 10:10:55PM -0500, John Heffner wrote:

> The forwarding case where you actually need congestion control, as
> opposed to absorbing bursts, is pretty gross.  If you have a router
> (more likely firewall) whose bottleneck is the CPU, then you're
> operating entirely in you input queue.  Yuck.

Yes.  This does happen under DoS (or just if your hardware is plain
underspec'ed), and even though you can't really avoid interrupt
livelock and starving userland processes when you're using a non-NAPI
driver (which is what we're talking about here), you don't want to go
OOM as well.

Removing the backlog is a problem also for the non-forwarding case --
you don't want someone to be able to OOM your server just by flooding
it with enough packets.  ("Wait, I can't drop those, those might be
legitimate ACKs!")

--L

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 23:48                   ` Baruch Even
@ 2005-03-04  3:45                     ` jamal
  2005-03-04  8:47                       ` Baruch Even
  0 siblings, 1 reply; 51+ messages in thread
From: jamal @ 2005-03-04  3:45 UTC (permalink / raw)
  To: Baruch Even
  Cc: Stephen Hemminger, John Heffner, David S. Miller, rhee,
	Yee-Ting.Li, netdev

On Thu, 2005-03-03 at 18:48, Baruch Even wrote:

> 
> The queue is there to handle short bursts of packets when the network 
> stack cannot handle it. The bad behaviour was the throttling of the 
> queue, 

Can you explain a little more? Why does the the throttling cause any
bad behavior thats any different from the queue being full? In both
cases, packets arriving during that transient will be dropped.

> the smart schemes are not going to make it that much better if 
> the hardware/software can't keep up.

consider that this queue could be shared by as many as a few thousand
unrelated TCP flows - not just one. It is also used for packets being
forwarded. If you factor that the system has to react to protect itself
then these schemes may make sense. The best place to do it is really in
hardware, but the closer to the hardware as possible is the next besr
possible spot. 

cheers,
jamal

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-04  3:45                     ` jamal
@ 2005-03-04  8:47                       ` Baruch Even
  2005-03-07 13:55                         ` jamal
  0 siblings, 1 reply; 51+ messages in thread
From: Baruch Even @ 2005-03-04  8:47 UTC (permalink / raw)
  To: hadi
  Cc: Stephen Hemminger, John Heffner, David S. Miller, rhee,
	Yee-Ting.Li, netdev

jamal wrote:
> On Thu, 2005-03-03 at 18:48, Baruch Even wrote:
> 
> 
>>The queue is there to handle short bursts of packets when the network 
>>stack cannot handle it. The bad behaviour was the throttling of the 
>>queue, 
> 
> 
> Can you explain a little more? Why does the the throttling cause any
> bad behavior thats any different from the queue being full? In both
> cases, packets arriving during that transient will be dropped.

If you have 300 packets in the queue and the throttling kicks in you now 
drop ALL packets until the queue is empty, this will normally take some 
time, during all of this time you are dropping all the ACKs that are 
coming in, you lose SACK information and potentially you leave no packet 
in flight so that the next packet will be sent only due to retransmit 
timer waking up, at which point your congestion control algorithm starts
from cwnd=1.

You can look at the report http://hamilton.ie/net/LinuxHighSpeed.pdf for 
some graphs of the effects.

>>the smart schemes are not going to make it that much better if 
>>the hardware/software can't keep up.
> 
> consider that this queue could be shared by as many as a few thousand
> unrelated TCP flows - not just one. It is also used for packets being
> forwarded. If you factor that the system has to react to protect itself
> then these schemes may make sense. The best place to do it is really in
> hardware, but the closer to the hardware as possible is the next besr
> possible spot. 

Actually the problem we had was with TCP end-system performance 
problems, compared to them the router problem is more limited since it 
only needs to do a lookup on a hash, tree or whatever and not a linked 
list of several thousand packets.

I'd prefer avoiding an AFQ scheme in the incoming queue, if you do add 
one, please make it configurable so I can disable it. The drop-tail 
behaviour is good enough for me. Remember that an AFQ needs to drop 
packets long before the queue is full so there will likely be more 
losses involved.

Baruch

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 21:54           ` Stephen Hemminger
  2005-03-03 22:02             ` John Heffner
@ 2005-03-04 19:49             ` Jason Lunz
  1 sibling, 0 replies; 51+ messages in thread
From: Jason Lunz @ 2005-03-04 19:49 UTC (permalink / raw)
  To: netdev

shemminger@osdl.org said:
>> We should eliminate the max backlog thing completely.  There is
>> no need for it.
>
> Still need some bound, because if process_backlog is running slower
> than the net; then the queue could grow till memory exhausted from
> skbuff's. 

yes. I eliminated netdev_max_backlog long ago in an experiment with a
2.4 kernel, and it became quite easy to consume huge amounts of kernel
memory.

Jason

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 22:26               ` jamal
  2005-03-03 23:16                 ` Stephen Hemminger
@ 2005-03-04 19:52                 ` Edgar E Iglesias
  2005-03-04 19:54                   ` Stephen Hemminger
  1 sibling, 1 reply; 51+ messages in thread
From: Edgar E Iglesias @ 2005-03-04 19:52 UTC (permalink / raw)
  To: jamal
  Cc: John Heffner, Stephen Hemminger, David S. Miller, rhee,
	Yee-Ting.Li, baruch, netdev

On Thu, Mar 03, 2005 at 05:26:51PM -0500, jamal wrote:
> On Thu, 2005-03-03 at 17:02, John Heffner wrote:
> > On Thu, 3 Mar 2005, Stephen Hemminger wrote:
> > > Maybe a simple Random Exponential Drop (RED) would be more friendly.
> > 
> > That would probably not be appropriate.  This queue is only for absorbing
> > micro-scale bursts.  It should not hold any data in steady state like a
> > router queue can.  The receive window can handle the macro scale flow
> > control.
> 
> recall this is a queue that is potentially shared by many many flows
> from potentially many many interfaces i.e it deals with many many
> micro-scale bursts.
> Clearly, the best approach is to have lots and lots of memmory and to
> make that queue real huge so it can cope with all of them all the time.
> We dont have that luxury - If you restrict the queue size, you will have
> to drop packets... Which ones?
> Probably simplest solution is to leave it as is right now and just
> adjust the contraints based on your system memmory etc.
> 

Why not have smaller queues but per interface? this would avoid
introducing too much latency and keep memory consumption low
but scale as we add more interfaces. It would also provide some
kind of fair queueing between the interfaces to avoid highspeed 
nics to starve lowspeed ones.

Queue length would still be an issue though, should somehow be 
related to interface rate and acceptable introduced latency.

Regarding RED and other more sophisticated algorithms, I assume
this is up to the ingress qdisc to take care of. What the queues
before the ingress qdiscs should do, is to avoid introducing
too much latency. In my opinion, low latency or high burst
tolerance should be the choice of the admin, like for egress.

I am not very familiar with the linux code, so I may be completly
wrong here...

Regards
-- 
        Programmer
        Edgar E Iglesias <edgar@axis.com> 46.46.272.1946

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-04 19:52                 ` Edgar E Iglesias
@ 2005-03-04 19:54                   ` Stephen Hemminger
  2005-03-04 21:41                     ` Edgar E Iglesias
  0 siblings, 1 reply; 51+ messages in thread
From: Stephen Hemminger @ 2005-03-04 19:54 UTC (permalink / raw)
  To: Edgar E Iglesias
  Cc: jamal, John Heffner, David S. Miller, rhee, Yee-Ting.Li, baruch,
	netdev

On Fri, 4 Mar 2005 20:52:39 +0100
Edgar E Iglesias <edgar.iglesias@axis.com> wrote:

> On Thu, Mar 03, 2005 at 05:26:51PM -0500, jamal wrote:
> > On Thu, 2005-03-03 at 17:02, John Heffner wrote:
> > > On Thu, 3 Mar 2005, Stephen Hemminger wrote:
> > > > Maybe a simple Random Exponential Drop (RED) would be more friendly.
> > > 
> > > That would probably not be appropriate.  This queue is only for absorbing
> > > micro-scale bursts.  It should not hold any data in steady state like a
> > > router queue can.  The receive window can handle the macro scale flow
> > > control.
> > 
> > recall this is a queue that is potentially shared by many many flows
> > from potentially many many interfaces i.e it deals with many many
> > micro-scale bursts.
> > Clearly, the best approach is to have lots and lots of memmory and to
> > make that queue real huge so it can cope with all of them all the time.
> > We dont have that luxury - If you restrict the queue size, you will have
> > to drop packets... Which ones?
> > Probably simplest solution is to leave it as is right now and just
> > adjust the contraints based on your system memmory etc.
> > 
> 
> Why not have smaller queues but per interface? this would avoid
> introducing too much latency and keep memory consumption low
> but scale as we add more interfaces. It would also provide some
> kind of fair queueing between the interfaces to avoid highspeed 
> nics to starve lowspeed ones.

That would require locking and effectively turn every device
into a NAPI device.

> Queue length would still be an issue though, should somehow be 
> related to interface rate and acceptable introduced latency.
> 
> Regarding RED and other more sophisticated algorithms, I assume
> this is up to the ingress qdisc to take care of. What the queues
> before the ingress qdiscs should do, is to avoid introducing
> too much latency. In my opinion, low latency or high burst
> tolerance should be the choice of the admin, like for egress.
> 

All this happens at a much lower level before the ingress qdisc
(which is optional) gets involved.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-04 19:54                   ` Stephen Hemminger
@ 2005-03-04 21:41                     ` Edgar E Iglesias
  0 siblings, 0 replies; 51+ messages in thread
From: Edgar E Iglesias @ 2005-03-04 21:41 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Edgar E Iglesias, jamal, John Heffner, David S. Miller, rhee,
	Yee-Ting.Li, baruch, netdev

On Fri, Mar 04, 2005 at 11:54:22AM -0800, Stephen Hemminger wrote:
> On Fri, 4 Mar 2005 20:52:39 +0100
> Edgar E Iglesias <edgar.iglesias@axis.com> wrote:
> 
> > On Thu, Mar 03, 2005 at 05:26:51PM -0500, jamal wrote:
> > > On Thu, 2005-03-03 at 17:02, John Heffner wrote:
> > > > On Thu, 3 Mar 2005, Stephen Hemminger wrote:
> > > > > Maybe a simple Random Exponential Drop (RED) would be more friendly.
> > > > 
> > > > That would probably not be appropriate.  This queue is only for absorbing
> > > > micro-scale bursts.  It should not hold any data in steady state like a
> > > > router queue can.  The receive window can handle the macro scale flow
> > > > control.
> > > 
> > > recall this is a queue that is potentially shared by many many flows
> > > from potentially many many interfaces i.e it deals with many many
> > > micro-scale bursts.
> > > Clearly, the best approach is to have lots and lots of memmory and to
> > > make that queue real huge so it can cope with all of them all the time.
> > > We dont have that luxury - If you restrict the queue size, you will have
> > > to drop packets... Which ones?
> > > Probably simplest solution is to leave it as is right now and just
> > > adjust the contraints based on your system memmory etc.
> > > 
> > 
> > Why not have smaller queues but per interface? this would avoid
> > introducing too much latency and keep memory consumption low
> > but scale as we add more interfaces. It would also provide some
> > kind of fair queueing between the interfaces to avoid highspeed 
> > nics to starve lowspeed ones.
> 
> That would require locking and effectively turn every device
> into a NAPI device.
> 

Oh ok, then why not a weightend algorithm, like we do when we feed
netif_receive_skb to give fairness among CPUs, but this time for 
netif_rx input to give fairness among intefaces. The total queuelen 
would be the length of the total weights and grow as more interfaces
are added. When each interface's quota is reached it begins to drop.
The individual weights could be chosen based on interface rates.
A simple WRR would do. 
This may (compared to the current queue) cost cpu cycles though...

> > Queue length would still be an issue though, should somehow be 
> > related to interface rate and acceptable introduced latency.
> > 
> > Regarding RED and other more sophisticated algorithms, I assume
> > this is up to the ingress qdisc to take care of. What the queues
> > before the ingress qdiscs should do, is to avoid introducing
> > too much latency. In my opinion, low latency or high burst
> > tolerance should be the choice of the admin, like for egress.
> > 
> 
> All this happens at a much lower level before the ingress qdisc
> (which is optional) gets involved.

Exactly, this is why we should not introduce latency. Latency should
be a choice for upper layers. When ingress qdiscs are disabled its 
acceptable (I guess) to have a default behavior with some kind of
balanced tradeoff, but when qdiscs are enabled a 300 skb list could
become a problem introducing latency. Some applications would like
to signal congestion much earlier.

-- 
        Programmer
        Edgar E Iglesias <edgar@axis.com> 46.46.272.1946

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-04  8:47                       ` Baruch Even
@ 2005-03-07 13:55                         ` jamal
  2005-03-08 15:56                           ` Baruch Even
  0 siblings, 1 reply; 51+ messages in thread
From: jamal @ 2005-03-07 13:55 UTC (permalink / raw)
  To: Baruch Even
  Cc: Stephen Hemminger, John Heffner, David S. Miller, rhee,
	Yee-Ting.Li, netdev

On Fri, 2005-03-04 at 03:47, Baruch Even wrote:
> jamal wrote:

> > Can you explain a little more? Why does the the throttling cause any
> > bad behavior thats any different from the queue being full? In both
> > cases, packets arriving during that transient will be dropped.
> 
> If you have 300 packets in the queue and the throttling kicks in you now 
> drop ALL packets until the queue is empty, this will normally take some 
> time, during all of this time you are dropping all the ACKs that are 
> coming in, you lose SACK information and potentially you leave no packet 
> in flight so that the next packet will be sent only due to retransmit 
> timer waking up, at which point your congestion control algorithm starts
> from cwnd=1.
> 
> You can look at the report http://hamilton.ie/net/LinuxHighSpeed.pdf for 
> some graphs of the effects.
> 

Always cool to see some test running across the pond. 

Were the processors tied to NICs? 
Your experiment is more than likely a single flow, correct?
In other words the whole queue was infact dedicated just for your one
flow - thats why you can call this queue a transient burst queue. 
Do you still have the data that shows how many packets were dropped
during this period. Do you still have the experimental data? I am
particulary interested in seeing the softnet stats as well as tcp
netstats.

I think your main problem was the huge amounts of SACK on the writequeue
and the resultant processing i.e section 1.1 and how you resolved that.
I dont see any issue in dropping ACKs, many of them even for such large
windows as you have - TCPs ACKs are cummulative. It is true if you drop
"large" enough amounts of ACKS, you will end up in timeouts - but large
enough in your case must be in the minimal 1000 packets. And to say you
dropped a 1000 packets while processing 300 means you were taking too
long processing the 300. So it would be interesting to see a repeat of
the test after youve resolved 1.1 but without removing the congestion
code. 
Then what would be really interesting is to see the perfomance you get
from multiple flows with and without congestion. 
I am not against a the benchmarky nature of the single flow and tuning
for that, but we should also look at a wider scope at the effect before
you handwave based on the result of one testcase.
Infact i would agree with giving you a way to turn off the congestion
control - and i am not sure how long we should keep it around with NAPI
getting more popular.. I will prepare a simple patch.
What you really need to do eventually is use NAPI not these antiquated
schemes.

I am also worried that since you used a non-NAPI driver, the effect of
reordering necessitating the UNDO is much much higher.
So if i was you i would repeat 1.2 with the fix from 1.1 as well as
tying the NIC to one CPU. And it would be a good idea to present more
detailed results - not just tcp windows fluctuating (you may not need
them for the paper, but would be useful to see for debugging purposes
other parameters).

> >>the smart schemes are not going to make it that much better if 
> >>the hardware/software can't keep up.
> > 
> > consider that this queue could be shared by as many as a few thousand
> > unrelated TCP flows - not just one. It is also used for packets being
> > forwarded. If you factor that the system has to react to protect itself
> > then these schemes may make sense. The best place to do it is really in
> > hardware, but the closer to the hardware as possible is the next besr
> > possible spot. 
> 
> Actually the problem we had was with TCP end-system performance 
> problems, compared to them the router problem is more limited since it 
> only needs to do a lookup on a hash, tree or whatever and not a linked 
> list of several thousand packets.
> 

I am not sure i followed. If you mean routers dont use linked lists
you are highly mistaken.

> I'd prefer avoiding an AFQ scheme in the incoming queue, if you do add 
> one, please make it configurable so I can disable it. The drop-tail 
> behaviour is good enough for me. Remember that an AFQ needs to drop 
> packets long before the queue is full so there will likely be more 
> losses involved.

What i was suggesting to Stephen would probably make more sense to kick
in when theres congestion. weighted windowing allows to sense things
that are coming; so the idea was to more not allow new flows once the 
we are congested. 
Just Use NAPI driver and you wont have to worry about this.

cheers,
jamal

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 21:57       ` David S. Miller
  2005-03-03 22:14         ` Baruch Even
@ 2005-03-08 15:42         ` Baruch Even
  2005-03-08 17:00           ` Andi Kleen
  2005-03-31 16:33         ` Baruch Even
  2 siblings, 1 reply; 51+ messages in thread
From: Baruch Even @ 2005-03-08 15:42 UTC (permalink / raw)
  To: David S. Miller; +Cc: shemminger, netdev

David S. Miller wrote:
> On Thu, 03 Mar 2005 21:44:52 +0000
> Baruch Even <baruch@ev-en.org> wrote:
> 
>>The current linked list goes over all the packets, the linked list we 
>>add is for the packets that were not SACKed. The idea being that it is a 
>>lot faster since there are a lot less packets not SACKed compared to 
>>packets already SACKed (or never mentioned in SACKs).
>>
>>If you have a way around this I'd be happy to hear it.
> 
> I'm sure you can find a way to steal sizeof(void *) from
> "struct tcp_skb_cb" :-)
> 
> It is currently 36 bytes on both 32-bit and 64-bit platforms.
> This means if you can squeeze out 4 bytes (so that it fits
> in the skb->cb[] 40 byte area), you can fit a pointer in there
> for the linked list stuff.

The current code adds 2 pointers to tcp_skb_cb and 20 bytes tcp_sock.

I can squeeze the tcp_skb_cb to one pointer at the expense of extra work 
to remove a packet from the list (the other pointer is the prev pointer).

I'm trying to do two tasks at the same time, port to 2.6.11 and rerun 
all the tests to produce a chapter in my thesis (on the ported 2.6.11 
patches). Hopefully it will work out, otherwise the porting will be delayed.

Baruch

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-07 13:55                         ` jamal
@ 2005-03-08 15:56                           ` Baruch Even
  2005-03-08 22:02                             ` jamal
  2005-03-22 21:55                             ` cliff white
  0 siblings, 2 replies; 51+ messages in thread
From: Baruch Even @ 2005-03-08 15:56 UTC (permalink / raw)
  To: hadi
  Cc: Stephen Hemminger, John Heffner, David S. Miller, rhee,
	Yee-Ting.Li, netdev

jamal wrote:
> On Fri, 2005-03-04 at 03:47, Baruch Even wrote:
> 
>>jamal wrote:
> 
> 
>>>Can you explain a little more? Why does the the throttling cause any
>>>bad behavior thats any different from the queue being full? In both
>>>cases, packets arriving during that transient will be dropped.
>>
>>If you have 300 packets in the queue and the throttling kicks in you now 
>>drop ALL packets until the queue is empty, this will normally take some 
>>time, during all of this time you are dropping all the ACKs that are 
>>coming in, you lose SACK information and potentially you leave no packet 
>>in flight so that the next packet will be sent only due to retransmit 
>>timer waking up, at which point your congestion control algorithm starts
>>from cwnd=1.
>>
>>You can look at the report http://hamilton.ie/net/LinuxHighSpeed.pdf for 
>>some graphs of the effects.
> 
> Were the processors tied to NICs? 

No. These are single CPU machines (with HT).

> Your experiment is more than likely a single flow, correct?

Yes.

> In other words the whole queue was infact dedicated just for your one
> flow - thats why you can call this queue a transient burst queue. 

Indeed, For a router or a web server handling several thousand flows it 
might be different, but I don't expect it handles a single packet in one 
  ms (or more) as it happens for the current end-system ack handling code.

> Do you still have the data that shows how many packets were dropped
> during this period. Do you still have the experimental data? I am
> particulary interested in seeing the softnet stats as well as tcp
> netstats.

No, These tests were not run by me, I'll probably rerun similar tests as 
well to base my work on, send me in private how do I get the stats from 
the kernel and I'll add it to my test scripts.

> I think your main problem was the huge amounts of SACK on the writequeue
> and the resultant processing i.e section 1.1 and how you resolved that.

That is my main guess as well, the original work was done rather 
quickly, we are now reorganizing thoughts and redoing the tests in a 
more orderly fashion.

> I dont see any issue in dropping ACKs, many of them even for such large
> windows as you have - TCPs ACKs are cummulative. It is true if you drop
> "large" enough amounts of ACKS, you will end up in timeouts - but large
> enough in your case must be in the minimal 1000 packets. And to say you
> dropped a 1000 packets while processing 300 means you were taking too
> long processing the 300.

With the current code SACK processing takes a long time, so it is 
possible that it happened to drop more than a thousand packets while 
handling 300. I think that after the fixing of the SACK code, the rest 
might work without getting to much into the ingress queue. But that 
might still change when we go to even higher speeds.

> Then what would be really interesting is to see the perfomance you get
> from multiple flows with and without congestion.

We'd need to get a very high speed link for multiple high speed flows.

> I am not against a the benchmarky nature of the single flow and tuning
> for that, but we should also look at a wider scope at the effect before
> you handwave based on the result of one testcase.

I can't say I didn't handwave, but then, there is little experimentation 
done to see if the other claims are correct and that AFQ is really 
needed so early in the packet receive stage. There are also voices that 
say AFQ sucks and causes more damage than good, I don't remember details 
currently.

> So if i was you i would repeat 1.2 with the fix from 1.1 as well as
> tying the NIC to one CPU. And it would be a good idea to present more
> detailed results - not just tcp windows fluctuating (you may not need
> them for the paper, but would be useful to see for debugging purposes
> other parameters).

I'd be happy to hear what other benchmarks you would like to see, I 
currently intend to add some ack processing time analysis and oprofile 
information. With possibly showing the size of the ingress queue as a 
measure as well.

Making it as thorough as possible is one of my goals. Input is always 
welcome.

Baruch

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-08 15:42         ` Baruch Even
@ 2005-03-08 17:00           ` Andi Kleen
  2005-03-08 18:01             ` Baruch Even
  2005-03-08 18:09             ` David S. Miller
  0 siblings, 2 replies; 51+ messages in thread
From: Andi Kleen @ 2005-03-08 17:00 UTC (permalink / raw)
  To: Baruch Even; +Cc: shemminger, netdev

Baruch Even <baruch@ev-en.org> writes:
>
> I can squeeze the tcp_skb_cb to one pointer at the expense of extra
> work to remove a packet from the list (the other pointer is the prev
> pointer).

You could also use a xor list in theory. But I'm not sure it's worth it.
Increasing cb by 4 bytes shouldn't be a very big issue.

-Andi

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-08 17:00           ` Andi Kleen
@ 2005-03-08 18:01             ` Baruch Even
  2005-03-08 18:09             ` David S. Miller
  1 sibling, 0 replies; 51+ messages in thread
From: Baruch Even @ 2005-03-08 18:01 UTC (permalink / raw)
  To: Andi Kleen; +Cc: shemminger, netdev

Andi Kleen wrote:
> Baruch Even <baruch@ev-en.org> writes:
> 
>>I can squeeze the tcp_skb_cb to one pointer at the expense of extra
>>work to remove a packet from the list (the other pointer is the prev
>>pointer).
> 
> You could also use a xor list in theory. But I'm not sure it's worth it.
> Increasing cb by 4 bytes shouldn't be a very big issue.

If increasing the cb by 4 bytes is ok, it would make my life easier and 
the code simpler. But I don't think it would also be a big issue to 
remove the prev pointer.

Now, if we can agree on increasing the tcp_sock by 20 bytes I'll be able 
to continue my work.

Baruch

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-08 17:00           ` Andi Kleen
  2005-03-08 18:01             ` Baruch Even
@ 2005-03-08 18:09             ` David S. Miller
  2005-03-08 18:18               ` Andi Kleen
  2005-03-08 18:27               ` Ben Greear
  1 sibling, 2 replies; 51+ messages in thread
From: David S. Miller @ 2005-03-08 18:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: baruch, shemminger, netdev

On Tue, 08 Mar 2005 18:00:49 +0100
Andi Kleen <ak@muc.de> wrote:

> Baruch Even <baruch@ev-en.org> writes:
> >
> > I can squeeze the tcp_skb_cb to one pointer at the expense of extra
> > work to remove a packet from the list (the other pointer is the prev
> > pointer).
> 
> You could also use a xor list in theory. But I'm not sure it's worth it.
> Increasing cb by 4 bytes shouldn't be a very big issue.

Going from "40" to "44" takes 64-bit platforms onto another cache line
for struct sk_buff, as I stated in another email.

And every time I let this happen, I get an email from David Mosberger because
it shows up in performance tests on ia64. :-)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-08 18:09             ` David S. Miller
@ 2005-03-08 18:18               ` Andi Kleen
  2005-03-08 18:37                 ` Thomas Graf
  2005-03-08 18:27               ` Ben Greear
  1 sibling, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2005-03-08 18:18 UTC (permalink / raw)
  To: David S. Miller; +Cc: baruch, shemminger, netdev

> > You could also use a xor list in theory. But I'm not sure it's worth it.
> > Increasing cb by 4 bytes shouldn't be a very big issue.
> 
> Going from "40" to "44" takes 64-bit platforms onto another cache line
> for struct sk_buff, as I stated in another email.
> 
> And every time I let this happen, I get an email from David Mosberger because
> it shows up in performance tests on ia64. :-)

Ok, then use a XOR list or trim some other field.

There are some other savings possible e.g. from a quick look:
- skb->list is afaik totally unnecessary and probably even unused.
- struct timeval could be an optimized structure using 32bit
for the sub second part. 
(would need moving it somewhere else, otherwise alignment doesn't help)
- Are really three device pointers needed? Perhaps things can
be a bit optimized here.
- Hippi could be finally changed to use skb->cb instead of its
private field.
- is skb->security still needed? It should be obsolete with ->sec_path, no?
Would only help together with the timestamp optimization.

Of course these all wouldn't change the number of cache lines significantly,
but it would possible allow other optimizations that need new fields.

-Andi

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-08 18:09             ` David S. Miller
  2005-03-08 18:18               ` Andi Kleen
@ 2005-03-08 18:27               ` Ben Greear
  2005-03-09 23:57                 ` Thomas Graf
  1 sibling, 1 reply; 51+ messages in thread
From: Ben Greear @ 2005-03-08 18:27 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, baruch, shemminger, netdev

David S. Miller wrote:
> On Tue, 08 Mar 2005 18:00:49 +0100
> Andi Kleen <ak@muc.de> wrote:
> 
> 
>>Baruch Even <baruch@ev-en.org> writes:
>>
>>>I can squeeze the tcp_skb_cb to one pointer at the expense of extra
>>>work to remove a packet from the list (the other pointer is the prev
>>>pointer).
>>
>>You could also use a xor list in theory. But I'm not sure it's worth it.
>>Increasing cb by 4 bytes shouldn't be a very big issue.
> 
> 
> Going from "40" to "44" takes 64-bit platforms onto another cache line
> for struct sk_buff, as I stated in another email.

Seems like we might could squish the sk_buff a bit:

Do we really need 32-bits for the mac-len:

	unsigned int		len,
				data_len,
				mac_len,
				csum;

Some of these flags could be collapsed into a single field and we
could do bit-shift operations for the single flags we care about.
This would also make it easier to add new flags as desired w/out
growing the structure.

	unsigned char		local_df,
				cloned,
				pkt_type,
				ip_summed;

The priority could probably be 16 bits as well, do we really need more
than 65k different priorities:

	__u32			priority;

Of course...this might be things for 2.7 since lots of modules will probably
be accessing these fields.  Maybe to get started we could add macros to grab
the flags and such so that when we finally do collapse things into a single
flags field the external code doesn't have to know or care?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-08 18:18               ` Andi Kleen
@ 2005-03-08 18:37                 ` Thomas Graf
  2005-03-08 18:51                   ` Arnaldo Carvalho de Melo
  2005-03-08 22:16                   ` Andi Kleen
  0 siblings, 2 replies; 51+ messages in thread
From: Thomas Graf @ 2005-03-08 18:37 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, baruch, shemminger, netdev

* Andi Kleen <20050308181844.GA37392@muc.de> 2005-03-08 19:18
> There are some other savings possible e.g. from a quick look:
> - skb->list is afaik totally unnecessary and probably even unused.
> - struct timeval could be an optimized structure using 32bit
> for the sub second part. 
> (would need moving it somewhere else, otherwise alignment doesn't help)
> - Are really three device pointers needed? Perhaps things can
> be a bit optimized here.

Likely that real_dev can be moved to cb. I would like to keep indev
though, it really helps at policy routing decisions.

> - Hippi could be finally changed to use skb->cb instead of its
> private field.

Definitely.

> - is skb->security still needed? It should be obsolete with ->sec_path, no?
> Would only help together with the timestamp optimization.

security has been unused for quite some time as far as I can see.

Anyone going for a patch? Otherwise I'll give it a try.

Speaking of it, I see tcp_sock is marginal over 2**10 on 32 bit archs and
Stephen's plans to outsource the cc bits brings us closer to the border.
Would it be worth to try and get it below 2**10? I spotted some places
for optimizations but not enough to really save the needed amount.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-08 18:37                 ` Thomas Graf
@ 2005-03-08 18:51                   ` Arnaldo Carvalho de Melo
  2005-03-08 22:16                   ` Andi Kleen
  1 sibling, 0 replies; 51+ messages in thread
From: Arnaldo Carvalho de Melo @ 2005-03-08 18:51 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Andi Kleen, David S. Miller, baruch, shemminger, netdev

On Tue, 8 Mar 2005 19:37:59 +0100, Thomas Graf <tgraf@suug.ch> wrote:

> Speaking of it, I see tcp_sock is marginal over 2**10 on 32 bit archs and
> Stephen's plans to outsource the cc bits brings us closer to the border.
> Would it be worth to try and get it below 2**10? I spotted some places
> for optimizations but not enough to really save the needed amount.

sk_protinfo, sk_slab, sk_zapped are going away when I finish my 
connection_sock-2.6 series.

sk_protinfo isn't needed if all proto families use the sk_alloc + kmalloc, i.e.
specifying the size of the proto specific socket (like tcp_sock) in
"zero_it"  and
passing NULL in the slab parameter, like I did with bluetooth today and will do
with the ham radio, the last ones using sk_protinfo.

sk_slab will be get from only in sk->sk_prot->slab.

sk_zapped was just a debugging member that got reused, and can be turned
into a SOCK_ZAPPED in sk_flags.

-- 
- Arnaldo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-08 15:56                           ` Baruch Even
@ 2005-03-08 22:02                             ` jamal
  2005-03-22 21:55                             ` cliff white
  1 sibling, 0 replies; 51+ messages in thread
From: jamal @ 2005-03-08 22:02 UTC (permalink / raw)
  To: Baruch Even
  Cc: Stephen Hemminger, John Heffner, David S. Miller, rhee,
	Yee-Ting.Li, netdev

[-- Attachment #1: Type: text/plain, Size: 4927 bytes --]

On Tue, 2005-03-08 at 10:56, Baruch Even wrote:
> jamal wrote:
> > 
> > Were the processors tied to NICs? 
> 
> No. These are single CPU machines (with HT).
> 

You have SMP.

> > In other words the whole queue was infact dedicated just for your one
> > flow - thats why you can call this queue a transient burst queue. 
> 
> Indeed, For a router or a web server handling several thousand flows it 
> might be different, but I don't expect it handles a single packet in one 
> ms (or more) as it happens for the current end-system ack handling code.
> 

I think it would be interesting as well to see more than one flow;
maybe go upto 16 (1, 2, 4, 8, 16) and if theres anything to observe you
should see it.

> > Do you still have the data that shows how many packets were dropped
> > during this period. Do you still have the experimental data? I am
> > particulary interested in seeing the softnet stats as well as tcp
> > netstats.
> 
> No, These tests were not run by me, I'll probably rerun similar tests as 
> well to base my work on, send me in private how do I get the stats from 
> the kernel and I'll add it to my test scripts.
> 

I will be more than happy to help. Let me know when you are ready.

> > I think your main problem was the huge amounts of SACK on the writequeue
> > and the resultant processing i.e section 1.1 and how you resolved that.
> 
> That is my main guess as well, the original work was done rather 
> quickly, we are now reorganizing thoughts and redoing the tests in a 
> more orderly fashion.
> 

I have attached the little patch i forgot to last time where you can
adjust. Adjust the lo_cong parameter in /proc to tune the number of
packets in which the congestion valve gets opened. Make sure you 
are within range of the other parameters like no_cong etc.
Or you can change that check to be using no_cong instead.

> > I dont see any issue in dropping ACKs, many of them even for such large
> > windows as you have - TCPs ACKs are cummulative. It is true if you drop
> > "large" enough amounts of ACKS, you will end up in timeouts - but large
> > enough in your case must be in the minimal 1000 packets. And to say you
> > dropped a 1000 packets while processing 300 means you were taking too
> > long processing the 300.
> 
> With the current code SACK processing takes a long time, so it is 
> possible that it happened to drop more than a thousand packets while 
> handling 300. I think that after the fixing of the SACK code, the rest 
> might work without getting to much into the ingress queue. But that 
> might still change when we go to even higher speeds.
> 

Sure. I am not questioning fixing the SACK code - I dont think it was
envisioned that someone was going to have 1000 outstanding SACKs in 
_one_ flow.  But an interesting test case would be to fix the SACK
processing then retest without getting rid of the congestion valve.
Again, shall i repeat you really should be using NAPI? ;->

> > Then what would be really interesting is to see the perfomance you get
> > from multiple flows with and without congestion.
> 
> We'd need to get a very high speed link for multiple high speed flows.
> 

Get a couple of PCs and hook them back to back.

> > I am not against a the benchmarky nature of the single flow and tuning
> > for that, but we should also look at a wider scope at the effect before
> > you handwave based on the result of one testcase.
> 
> I can't say I didn't handwave, but then, there is little experimentation 
> done to see if the other claims are correct and that AFQ is really 
> needed so early in the packet receive stage. There are also voices that 
> say AFQ sucks and causes more damage than good, I don't remember details 
> currently.
> 

Its not totaly AFQ,
The idea is that code is measuring (based on history) how busy the
system is. What Stephen and I were discussing is that you may wanna
totaly punish new flows coming in instead of the one flow that already
is flowing when the going gets tough (i.e congestion detected). However,
this maybe too big unneeded a hack when we have NAPI which will do just
fine for you.

> > So if i was you i would repeat 1.2 with the fix from 1.1 as well as
> > tying the NIC to one CPU. And it would be a good idea to present more
> > detailed results - not just tcp windows fluctuating (you may not need
> > them for the paper, but would be useful to see for debugging purposes
> > other parameters).
> 
> I'd be happy to hear what other benchmarks you would like to see, I 
> currently intend to add some ack processing time analysis and oprofile 
> information. With possibly showing the size of the ingress queue as a 
> measure as well.
> 
> Making it as thorough as possible is one of my goals. Input is always 
> welcome.
> 

Good - and thanks for not being defensive; your work can only get better
this way. Ping me when you have ported to 2.6.11 and are ready to do the
testing.

cheers,
jamal

[-- Attachment #2: cong_p --]
[-- Type: text/plain, Size: 291 bytes --]

--- 2611-mod/net/core/dev.c	2005/03/07 12:26:21	1.1
+++ 2611-mod/net/core/dev.c	2005/03/07 12:30:19
@@ -1742,6 +1742,9 @@
 		if (work >= quota || jiffies - start_time > 1)
 			break;
 
+		if (queue->throttle && work >= lo_cong)
+			queue->throttle = 0;
+
 	}
 
 	backlog_dev->quota -= work;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-08 18:37                 ` Thomas Graf
  2005-03-08 18:51                   ` Arnaldo Carvalho de Melo
@ 2005-03-08 22:16                   ` Andi Kleen
  1 sibling, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2005-03-08 22:16 UTC (permalink / raw)
  To: Thomas Graf; +Cc: David S. Miller, baruch, shemminger, netdev

On Tue, Mar 08, 2005 at 07:37:59PM +0100, Thomas Graf wrote:
> * Andi Kleen <20050308181844.GA37392@muc.de> 2005-03-08 19:18
> > There are some other savings possible e.g. from a quick look:
> > - skb->list is afaik totally unnecessary and probably even unused.

I was wrong on that. Removing skb->list would be worthy, but needs
a lot of changes.

[BTW there seems to be large cleanup potential in skb list functions;
lots of cruft and even some unused functions around and the locking
is prehistoric too. In case anybody is interested in a useful cleanup
project]

> > - struct timeval could be an optimized structure using 32bit
> > for the sub second part. 
> > (would need moving it somewhere else, otherwise alignment doesn't help)
> > - Are really three device pointers needed? Perhaps things can
> > be a bit optimized here.
> 
> Likely that real_dev can be moved to cb. I would like to keep indev
> though, it really helps at policy routing decisions.

Moving to cb is useless, you just would need to enlarge it then.


-Andi

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-08 18:27               ` Ben Greear
@ 2005-03-09 23:57                 ` Thomas Graf
  2005-03-10  0:03                   ` Stephen Hemminger
  2005-03-10  8:33                   ` Andi Kleen
  0 siblings, 2 replies; 51+ messages in thread
From: Thomas Graf @ 2005-03-09 23:57 UTC (permalink / raw)
  To: Ben Greear; +Cc: David S. Miller, Andi Kleen, baruch, shemminger, netdev

* Ben Greear <422DEE85.5010302@candelatech.com> 2005-03-08 10:27
> Seems like we might could squish the sk_buff a bit:
> 
> Do we really need 32-bits for the mac-len:
> 
> 	unsigned int		len,
> 				data_len,
> 				mac_len,
> 				csum;

Yes, I guess it can be 16-bits.

> Some of these flags could be collapsed into a single field and we
> could do bit-shift operations for the single flags we care about.
> This would also make it easier to add new flags as desired w/out
> growing the structure.
> 
> 	unsigned char		local_df,
> 				cloned,
> 				pkt_type,
> 				ip_summed;

I changed them to be :1, less work and doesn't have to be atomic anyway.

> The priority could probably be 16 bits as well, do we really need more
> than 65k different priorities:
> 
> 	__u32			priority;

We use skb->priority to map to tc handles which is by definition 32-bits,
not really used at the moment but it will be of use again soon.

> Of course...this might be things for 2.7 since lots of modules will probably
> be accessing these fields.  Maybe to get started we could add macros to grab
> the flags and such so that when we finally do collapse things into a single
> flags field the external code doesn't have to know or care?

I attached a small patch below saving 4 bytes and leaving some room for
additional flags. The removal of security has indeed potential to break
external modules.

diff -Nru linux-2.6.11-bk5.orig/include/linux/skbuff.h linux-2.6.11-bk5/include/linux/skbuff.h
--- linux-2.6.11-bk5.orig/include/linux/skbuff.h	2005-03-09 22:00:23.000000000 +0100
+++ linux-2.6.11-bk5/include/linux/skbuff.h	2005-03-10 00:48:46.000000000 +0100
@@ -250,16 +250,16 @@
 
 	unsigned int		len,
 				data_len,
-				mac_len,
 				csum;
-	unsigned char		local_df,
+	unsigned short		mac_len,
+				protocol;
+	unsigned char		pkt_type,
+				local_df:1,
 				cloned:1,
-				nohdr:1,
-				pkt_type,
-				ip_summed;
+				ip_summed:2,
+				nohdr:1;
+	/* 20 bits spare */
 	__u32			priority;
-	unsigned short		protocol,
-				security;
 
 	void			(*destructor)(struct sk_buff *skb);
 #ifdef CONFIG_NETFILTER
diff -Nru linux-2.6.11-bk5.orig/include/linux/tc_ematch/tc_em_meta.h linux-2.6.11-bk5/include/linux/tc_ematch/tc_em_meta.h
--- linux-2.6.11-bk5.orig/include/linux/tc_ematch/tc_em_meta.h	2005-03-09 22:00:23.000000000 +0100
+++ linux-2.6.11-bk5/include/linux/tc_ematch/tc_em_meta.h	2005-03-09 23:34:28.000000000 +0100
@@ -45,7 +45,7 @@
 	TCF_META_ID_REALDEV,
 	TCF_META_ID_PRIORITY,
 	TCF_META_ID_PROTOCOL,
-	TCF_META_ID_SECURITY,
+	TCF_META_ID_SECURITY, /* obsolete */
 	TCF_META_ID_PKTTYPE,
 	TCF_META_ID_PKTLEN,
 	TCF_META_ID_DATALEN,
diff -Nru linux-2.6.11-bk5.orig/net/core/skbuff.c linux-2.6.11-bk5/net/core/skbuff.c
--- linux-2.6.11-bk5.orig/net/core/skbuff.c	2005-03-09 22:00:39.000000000 +0100
+++ linux-2.6.11-bk5/net/core/skbuff.c	2005-03-09 23:33:54.000000000 +0100
@@ -359,7 +359,6 @@
 	C(ip_summed);
 	C(priority);
 	C(protocol);
-	C(security);
 	n->destructor = NULL;
 #ifdef CONFIG_NETFILTER
 	C(nfmark);
@@ -427,7 +426,6 @@
 	new->pkt_type	= old->pkt_type;
 	new->stamp	= old->stamp;
 	new->destructor = NULL;
-	new->security	= old->security;
 #ifdef CONFIG_NETFILTER
 	new->nfmark	= old->nfmark;
 	new->nfcache	= old->nfcache;
diff -Nru linux-2.6.11-bk5.orig/net/ipv4/ip_output.c linux-2.6.11-bk5/net/ipv4/ip_output.c
--- linux-2.6.11-bk5.orig/net/ipv4/ip_output.c	2005-03-09 22:00:40.000000000 +0100
+++ linux-2.6.11-bk5/net/ipv4/ip_output.c	2005-03-09 23:41:40.000000000 +0100
@@ -388,7 +388,6 @@
 	to->pkt_type = from->pkt_type;
 	to->priority = from->priority;
 	to->protocol = from->protocol;
-	to->security = from->security;
 	dst_release(to->dst);
 	to->dst = dst_clone(from->dst);
 	to->dev = from->dev;
diff -Nru linux-2.6.11-bk5.orig/net/ipv6/ip6_output.c linux-2.6.11-bk5/net/ipv6/ip6_output.c
--- linux-2.6.11-bk5.orig/net/ipv6/ip6_output.c	2005-03-09 22:00:42.000000000 +0100
+++ linux-2.6.11-bk5/net/ipv6/ip6_output.c	2005-03-09 23:46:01.000000000 +0100
@@ -462,7 +462,6 @@
 	to->pkt_type = from->pkt_type;
 	to->priority = from->priority;
 	to->protocol = from->protocol;
-	to->security = from->security;
 	dst_release(to->dst);
 	to->dst = dst_clone(from->dst);
 	to->dev = from->dev;
diff -Nru linux-2.6.11-bk5.orig/net/sched/em_meta.c linux-2.6.11-bk5/net/sched/em_meta.c
--- linux-2.6.11-bk5.orig/net/sched/em_meta.c	2005-03-09 22:00:41.000000000 +0100
+++ linux-2.6.11-bk5/net/sched/em_meta.c	2005-03-09 23:34:16.000000000 +0100
@@ -204,11 +204,6 @@
 	dst->value = skb->protocol;
 }
 
-META_COLLECTOR(int_security)
-{
-	dst->value = skb->security;
-}
-
 META_COLLECTOR(int_pkttype)
 {
 	dst->value = skb->pkt_type;
@@ -311,7 +306,6 @@
 		[TCF_META_ID_REALDEV]	= { .get = meta_int_realdev },
 		[TCF_META_ID_PRIORITY]	= { .get = meta_int_priority },
 		[TCF_META_ID_PROTOCOL]	= { .get = meta_int_protocol },
-		[TCF_META_ID_SECURITY]	= { .get = meta_int_security },
 		[TCF_META_ID_PKTTYPE]	= { .get = meta_int_pkttype },
 		[TCF_META_ID_PKTLEN]	= { .get = meta_int_pktlen },
 		[TCF_META_ID_DATALEN]	= { .get = meta_int_datalen },

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-09 23:57                 ` Thomas Graf
@ 2005-03-10  0:03                   ` Stephen Hemminger
  2005-03-10  8:33                   ` Andi Kleen
  1 sibling, 0 replies; 51+ messages in thread
From: Stephen Hemminger @ 2005-03-10  0:03 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Ben Greear, David S. Miller, Andi Kleen, baruch, netdev

On Thu, 10 Mar 2005 00:57:28 +0100
Thomas Graf <tgraf@suug.ch> wrote:

> > Of course...this might be things for 2.7 since lots of modules will probably
> > be accessing these fields.  Maybe to get started we could add macros to grab
> > the flags and such so that when we finally do collapse things into a single
> > flags field the external code doesn't have to know or care?

Their may never be a 2.7, don't wait that long.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-09 23:57                 ` Thomas Graf
  2005-03-10  0:03                   ` Stephen Hemminger
@ 2005-03-10  8:33                   ` Andi Kleen
  2005-03-10 14:08                     ` Thomas Graf
  1 sibling, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2005-03-10  8:33 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Ben Greear, David S. Miller, baruch, shemminger, netdev

> I attached a small patch below saving 4 bytes and leaving some room for
> additional flags. The removal of security has indeed potential to break
> external modules.

And? The Linux kernel never had a stable ABI.

-Andi

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-10  8:33                   ` Andi Kleen
@ 2005-03-10 14:08                     ` Thomas Graf
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Graf @ 2005-03-10 14:08 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ben Greear, David S. Miller, baruch, shemminger, netdev

* Andi Kleen <20050310083309.GA510@muc.de> 2005-03-10 09:33
> > I attached a small patch below saving 4 bytes and leaving some room for
> > additional flags. The removal of security has indeed potential to break
> > external modules.
> 
> And? The Linux kernel never had a stable ABI.

Sure but why break it if not really necessary? Anyways, here's a revised
version saving 4 bytes and leaving 19 bits for additional flags or a new
field. A trivial reordering saves another 4 bytes on 64bit archs when
both netfilter debugging and bridging ist enabled.

Signed-off-by: Thomas Graf <tgraf@suug.ch>

diff -Nru linux-2.6.11-bk5.orig/include/linux/skbuff.h linux-2.6.11-bk5/include/linux/skbuff.h
--- linux-2.6.11-bk5.orig/include/linux/skbuff.h	2005-03-10 13:34:00.000000000 +0100
+++ linux-2.6.11-bk5/include/linux/skbuff.h	2005-03-10 13:49:17.000000000 +0100
@@ -250,16 +250,16 @@
 
 	unsigned int		len,
 				data_len,
-				mac_len,
 				csum;
-	unsigned char		local_df,
+	unsigned short		mac_len,
+				protocol;
+	unsigned char		pkt_type,
+				local_df:1,
 				cloned:1,
-				nohdr:1,
-				pkt_type,
-				ip_summed;
+				ip_summed:2,
+				nohdr:1;
+	__u16			__pad;
 	__u32			priority;
-	unsigned short		protocol,
-				security;
 
 	void			(*destructor)(struct sk_buff *skb);
 #ifdef CONFIG_NETFILTER
@@ -267,12 +267,12 @@
 	__u32			nfcache;
 	__u32			nfctinfo;
 	struct nf_conntrack	*nfct;
-#ifdef CONFIG_NETFILTER_DEBUG
-        unsigned int		nf_debug;
-#endif
 #ifdef CONFIG_BRIDGE_NETFILTER
 	struct nf_bridge_info	*nf_bridge;
 #endif
+#ifdef CONFIG_NETFILTER_DEBUG
+        unsigned int		nf_debug;
+#endif
 #endif /* CONFIG_NETFILTER */
 #if defined(CONFIG_HIPPI)
 	union {
diff -Nru linux-2.6.11-bk5.orig/include/linux/tc_ematch/tc_em_meta.h linux-2.6.11-bk5/include/linux/tc_ematch/tc_em_meta.h
--- linux-2.6.11-bk5.orig/include/linux/tc_ematch/tc_em_meta.h	2005-03-10 13:34:00.000000000 +0100
+++ linux-2.6.11-bk5/include/linux/tc_ematch/tc_em_meta.h	2005-03-09 23:34:28.000000000 +0100
@@ -45,7 +45,7 @@
 	TCF_META_ID_REALDEV,
 	TCF_META_ID_PRIORITY,
 	TCF_META_ID_PROTOCOL,
-	TCF_META_ID_SECURITY,
+	TCF_META_ID_SECURITY, /* obsolete */
 	TCF_META_ID_PKTTYPE,
 	TCF_META_ID_PKTLEN,
 	TCF_META_ID_DATALEN,
diff -Nru linux-2.6.11-bk5.orig/net/core/skbuff.c linux-2.6.11-bk5/net/core/skbuff.c
--- linux-2.6.11-bk5.orig/net/core/skbuff.c	2005-03-10 13:34:00.000000000 +0100
+++ linux-2.6.11-bk5/net/core/skbuff.c	2005-03-09 23:33:54.000000000 +0100
@@ -359,7 +359,6 @@
 	C(ip_summed);
 	C(priority);
 	C(protocol);
-	C(security);
 	n->destructor = NULL;
 #ifdef CONFIG_NETFILTER
 	C(nfmark);
@@ -427,7 +426,6 @@
 	new->pkt_type	= old->pkt_type;
 	new->stamp	= old->stamp;
 	new->destructor = NULL;
-	new->security	= old->security;
 #ifdef CONFIG_NETFILTER
 	new->nfmark	= old->nfmark;
 	new->nfcache	= old->nfcache;
diff -Nru linux-2.6.11-bk5.orig/net/ipv4/ip_output.c linux-2.6.11-bk5/net/ipv4/ip_output.c
--- linux-2.6.11-bk5.orig/net/ipv4/ip_output.c	2005-03-10 13:34:00.000000000 +0100
+++ linux-2.6.11-bk5/net/ipv4/ip_output.c	2005-03-09 23:41:40.000000000 +0100
@@ -388,7 +388,6 @@
 	to->pkt_type = from->pkt_type;
 	to->priority = from->priority;
 	to->protocol = from->protocol;
-	to->security = from->security;
 	dst_release(to->dst);
 	to->dst = dst_clone(from->dst);
 	to->dev = from->dev;
diff -Nru linux-2.6.11-bk5.orig/net/ipv6/ip6_output.c linux-2.6.11-bk5/net/ipv6/ip6_output.c
--- linux-2.6.11-bk5.orig/net/ipv6/ip6_output.c	2005-03-10 13:34:00.000000000 +0100
+++ linux-2.6.11-bk5/net/ipv6/ip6_output.c	2005-03-09 23:46:01.000000000 +0100
@@ -462,7 +462,6 @@
 	to->pkt_type = from->pkt_type;
 	to->priority = from->priority;
 	to->protocol = from->protocol;
-	to->security = from->security;
 	dst_release(to->dst);
 	to->dst = dst_clone(from->dst);
 	to->dev = from->dev;
diff -Nru linux-2.6.11-bk5.orig/net/sched/em_meta.c linux-2.6.11-bk5/net/sched/em_meta.c
--- linux-2.6.11-bk5.orig/net/sched/em_meta.c	2005-03-10 13:34:00.000000000 +0100
+++ linux-2.6.11-bk5/net/sched/em_meta.c	2005-03-09 23:34:16.000000000 +0100
@@ -204,11 +204,6 @@
 	dst->value = skb->protocol;
 }
 
-META_COLLECTOR(int_security)
-{
-	dst->value = skb->security;
-}
-
 META_COLLECTOR(int_pkttype)
 {
 	dst->value = skb->pkt_type;
@@ -311,7 +306,6 @@
 		[TCF_META_ID_REALDEV]	= { .get = meta_int_realdev },
 		[TCF_META_ID_PRIORITY]	= { .get = meta_int_priority },
 		[TCF_META_ID_PROTOCOL]	= { .get = meta_int_protocol },
-		[TCF_META_ID_SECURITY]	= { .get = meta_int_security },
 		[TCF_META_ID_PKTTYPE]	= { .get = meta_int_pkttype },
 		[TCF_META_ID_PKTLEN]	= { .get = meta_int_pktlen },
 		[TCF_META_ID_DATALEN]	= { .get = meta_int_datalen },

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-08 15:56                           ` Baruch Even
  2005-03-08 22:02                             ` jamal
@ 2005-03-22 21:55                             ` cliff white
  1 sibling, 0 replies; 51+ messages in thread
From: cliff white @ 2005-03-22 21:55 UTC (permalink / raw)
  To: Baruch Even, netdev

On Tue, 08 Mar 2005 15:56:04 +0000
Baruch Even <baruch@ev-en.org> wrote:

> jamal wrote:
> > On Fri, 2005-03-04 at 03:47, Baruch Even wrote:
> > 
> >>jamal wrote:
> > 
> > 
[snip]
> 
> > So if i was you i would repeat 1.2 with the fix from 1.1 as well as
> > tying the NIC to one CPU. And it would be a good idea to present more
> > detailed results - not just tcp windows fluctuating (you may not need
> > them for the paper, but would be useful to see for debugging purposes
> > other parameters).
> 
> I'd be happy to hear what other benchmarks you would like to see, I 
> currently intend to add some ack processing time analysis and oprofile 
> information. With possibly showing the size of the ingress queue as a 
> measure as well.
> 
> Making it as thorough as possible is one of my goals. Input is always 
> welcome.

I am collecting network benchmarks/test in hope of doing some testing here at OSDL.
Would love to run your benchmark, if you wished to share.
cliffw

> 
> Baruch
> 


-- 
"Ive always gone through periods where I bolt upright at four in the morning; 
now at least theres a reason." -Michael Feldman

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: netif_rx packet dumping
  2005-03-03 21:57       ` David S. Miller
  2005-03-03 22:14         ` Baruch Even
  2005-03-08 15:42         ` Baruch Even
@ 2005-03-31 16:33         ` Baruch Even
  2 siblings, 0 replies; 51+ messages in thread
From: Baruch Even @ 2005-03-31 16:33 UTC (permalink / raw)
  To: David S. Miller; +Cc: shemminger, jheffner, netdev

David S. Miller wrote:
> On Thu, 03 Mar 2005 21:44:52 +0000
> Baruch Even <baruch@ev-en.org> wrote:
> 
> 
>>The current linked list goes over all the packets, the linked list we 
>>add is for the packets that were not SACKed. The idea being that it is a 
>>lot faster since there are a lot less packets not SACKed compared to 
>>packets already SACKed (or never mentioned in SACKs).
>>
>>If you have a way around this I'd be happy to hear it.
> 
> 
> I'm sure you can find a way to steal sizeof(void *) from
> "struct tcp_skb_cb" :-)
> 
> It is currently 36 bytes on both 32-bit and 64-bit platforms.
> This means if you can squeeze out 4 bytes (so that it fits
> in the skb->cb[] 40 byte area), you can fit a pointer in there
> for the linked list stuff.

I changed the code to use only the next pointer and dropped the prev. 
The cb still fits into 40 bytes for 32bit but for an em64t compile with 
gcc version 3.4.4 20041218 (prerelease) (Debian 3.4.3-6) the cb now 
requires 48 bytes. I haven't looked at the code emitted by I suspect 
it's an alignment that forced the pointer to start at 40 and then the 
size is 8 bytes.

 From a quick test I did there was no performance hit and the numbers 
very very similar to those in the 32bit case. The machine is exactly the 
  same (it's a 3Ghz xeon which I so-far only run as 32 bit, not 
suspecting I had a 64 bit test machine).

Baruch

p.s. Thanks to whomever put the compile time test of this, it probably 
saved me quite a lot of time of hunting weird crashes.

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2005-03-31 16:33 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-03 20:38 netif_rx packet dumping Stephen Hemminger
2005-03-03 20:55 ` David S. Miller
2005-03-03 21:01   ` Stephen Hemminger
2005-03-03 21:18   ` jamal
2005-03-03 21:21     ` Stephen Hemminger
2005-03-03 21:24       ` jamal
2005-03-03 21:32         ` David S. Miller
2005-03-03 21:54           ` Stephen Hemminger
2005-03-03 22:02             ` John Heffner
2005-03-03 22:26               ` jamal
2005-03-03 23:16                 ` Stephen Hemminger
2005-03-03 23:40                   ` jamal
2005-03-03 23:48                   ` Baruch Even
2005-03-04  3:45                     ` jamal
2005-03-04  8:47                       ` Baruch Even
2005-03-07 13:55                         ` jamal
2005-03-08 15:56                           ` Baruch Even
2005-03-08 22:02                             ` jamal
2005-03-22 21:55                             ` cliff white
2005-03-03 23:48                   ` John Heffner
2005-03-04  1:42                     ` Lennert Buytenhek
2005-03-04  3:10                       ` John Heffner
2005-03-04  3:31                         ` Lennert Buytenhek
2005-03-04 19:52                 ` Edgar E Iglesias
2005-03-04 19:54                   ` Stephen Hemminger
2005-03-04 21:41                     ` Edgar E Iglesias
2005-03-04 19:49             ` Jason Lunz
2005-03-03 22:01           ` jamal
2005-03-03 21:26 ` Baruch Even
2005-03-03 21:36   ` David S. Miller
2005-03-03 21:44     ` Baruch Even
2005-03-03 21:54       ` Andi Kleen
2005-03-03 22:04         ` David S. Miller
2005-03-03 21:57       ` David S. Miller
2005-03-03 22:14         ` Baruch Even
2005-03-08 15:42         ` Baruch Even
2005-03-08 17:00           ` Andi Kleen
2005-03-08 18:01             ` Baruch Even
2005-03-08 18:09             ` David S. Miller
2005-03-08 18:18               ` Andi Kleen
2005-03-08 18:37                 ` Thomas Graf
2005-03-08 18:51                   ` Arnaldo Carvalho de Melo
2005-03-08 22:16                   ` Andi Kleen
2005-03-08 18:27               ` Ben Greear
2005-03-09 23:57                 ` Thomas Graf
2005-03-10  0:03                   ` Stephen Hemminger
2005-03-10  8:33                   ` Andi Kleen
2005-03-10 14:08                     ` Thomas Graf
2005-03-31 16:33         ` Baruch Even
2005-03-03 22:03   ` jamal
2005-03-03 22:31     ` Baruch Even

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).