[e1000 2.6 10/11] TxDescriptors -> 1024 default

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [e1000 2.6 10/11] TxDescriptors -> 1024 default
@ 2003-09-09  3:14 Feldman, Scott
  2003-09-11 19:18 ` Jeff Garzik
  0 siblings, 1 reply; 38+ messages in thread
From: Feldman, Scott @ 2003-09-09  3:14 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: netdev, ricardoz


* Change the default number of Tx descriptors from 256 to 1024.
  Data from [ricardoz@us.ibm.com] shows it's easy to overrun
  the Tx desc queue.

-------------

diff -Nuarp linux-2.6.0-test4/drivers/net/e1000/e1000_param.c linux-2.6.0-test4/drivers/net/e1000.new/e1000_param.c
--- linux-2.6.0-test4/drivers/net/e1000/e1000_param.c	2003-08-22 16:57:59.000000000 -0700
+++ linux-2.6.0-test4/drivers/net/e1000.new/e1000_param.c	2003-09-08 09:13:12.000000000 -0700
@@ -63,9 +63,10 @@ MODULE_PARM_DESC(X, S);
 /* Transmit Descriptor Count
  *
  * Valid Range: 80-256 for 82542 and 82543 gigabit ethernet controllers
- * Valid Range: 80-4096 for 82544
+ * Valid Range: 80-4096 for 82544 and newer
  *
- * Default Value: 256
+ * Default Value: 256 for 82542 and 82543 gigabit ethernet controllers
+ * Default Value: 1024 for 82544 and newer
  */
 
 E1000_PARAM(TxDescriptors, "Number of transmit descriptors");
@@ -73,7 +74,7 @@ E1000_PARAM(TxDescriptors, "Number of tr
 /* Receive Descriptor Count
  *
  * Valid Range: 80-256 for 82542 and 82543 gigabit ethernet controllers
- * Valid Range: 80-4096 for 82544
+ * Valid Range: 80-4096 for 82544 and newer
  *
  * Default Value: 256
  */
@@ -200,6 +201,7 @@ E1000_PARAM(InterruptThrottleRate, "Inte
 #define MAX_TXD                      256
 #define MIN_TXD                       80
 #define MAX_82544_TXD               4096
+#define DEFAULT_82544_TXD           1024
 
 #define DEFAULT_RXD                  256
 #define MAX_RXD                      256
@@ -320,12 +322,15 @@ e1000_check_options(struct e1000_adapter
 		struct e1000_option opt = {
 			.type = range_option,
 			.name = "Transmit Descriptors",
-			.err  = "using default of " __MODULE_STRING(DEFAULT_TXD),
-			.def  = DEFAULT_TXD,
 			.arg  = { .r = { .min = MIN_TXD }}
 		};
 		struct e1000_desc_ring *tx_ring = &adapter->tx_ring;
 		e1000_mac_type mac_type = adapter->hw.mac_type;
+		opt.err = mac_type < e1000_82544 ?
+			"using default of " __MODULE_STRING(DEFAULT_TXD) :
+			"using default of " __MODULE_STRING(DEFAULT_82544_TXD);
+		opt.def = mac_type < e1000_82544 ?
+			DEFAULT_TXD : DEFAULT_82544_TXD;
 		opt.arg.r.max = mac_type < e1000_82544 ?
 			MAX_TXD : MAX_82544_TXD;
 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-09  3:14 Feldman, Scott
@ 2003-09-11 19:18 ` Jeff Garzik
  2003-09-11 19:45   ` Ben Greear
  0 siblings, 1 reply; 38+ messages in thread
From: Jeff Garzik @ 2003-09-11 19:18 UTC (permalink / raw)
  To: Feldman, Scott; +Cc: netdev, ricardoz

Feldman, Scott wrote:
> * Change the default number of Tx descriptors from 256 to 1024.
>   Data from [ricardoz@us.ibm.com] shows it's easy to overrun
>   the Tx desc queue.

All e1000 patches applied except this one.

Of _course_ it's easy to overrun the Tx desc queue.  That's why we have 
a TX queue sitting on top of the NIC's hardware queue.  And TCP socket 
buffers on top of that.  And similar things.

Descriptor increases like this are usually the result of some sillyhead 
blasting out UDP packets, and then wondering why he sees packet loss on 
the local computer (the "blast out packets" side).

You're just wasting memory.

	Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-11 19:18 ` Jeff Garzik
@ 2003-09-11 19:45   ` Ben Greear
  2003-09-11 19:59     ` Jeff Garzik
  2003-09-11 20:12     ` David S. Miller
  0 siblings, 2 replies; 38+ messages in thread
From: Ben Greear @ 2003-09-11 19:45 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Feldman, Scott, netdev, ricardoz

Jeff Garzik wrote:
> Feldman, Scott wrote:
> 
>> * Change the default number of Tx descriptors from 256 to 1024.
>>   Data from [ricardoz@us.ibm.com] shows it's easy to overrun
>>   the Tx desc queue.
> 
> 
> 
> All e1000 patches applied except this one.
> 
> Of _course_ it's easy to overrun the Tx desc queue.  That's why we have 
> a TX queue sitting on top of the NIC's hardware queue.  And TCP socket 
> buffers on top of that.  And similar things.
> 
> Descriptor increases like this are usually the result of some sillyhead 
> blasting out UDP packets, and then wondering why he sees packet loss on 
> the local computer (the "blast out packets" side).

Erm, shouldn't the local machine back itself off if the various
queues are full?  Some time back I looked through the code and it
appeared to.  If not, I think it should.

> 
> You're just wasting memory.
> 
>     Jeff
> 
> 
> 


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-11 19:45   ` Ben Greear
@ 2003-09-11 19:59     ` Jeff Garzik
  2003-09-11 20:12     ` David S. Miller
  1 sibling, 0 replies; 38+ messages in thread
From: Jeff Garzik @ 2003-09-11 19:59 UTC (permalink / raw)
  To: Ben Greear; +Cc: Feldman, Scott, netdev, ricardoz

Ben Greear wrote:
> Jeff Garzik wrote:
> 
>> Feldman, Scott wrote:
>>
>>> * Change the default number of Tx descriptors from 256 to 1024.
>>>   Data from [ricardoz@us.ibm.com] shows it's easy to overrun
>>>   the Tx desc queue.
>>
>>
>>
>>
>> All e1000 patches applied except this one.
>>
>> Of _course_ it's easy to overrun the Tx desc queue.  That's why we 
>> have a TX queue sitting on top of the NIC's hardware queue.  And TCP 
>> socket buffers on top of that.  And similar things.
>>
>> Descriptor increases like this are usually the result of some 
>> sillyhead blasting out UDP packets, and then wondering why he sees 
>> packet loss on the local computer (the "blast out packets" side).
> 
> 
> Erm, shouldn't the local machine back itself off if the various
> queues are full?  Some time back I looked through the code and it
> appeared to.  If not, I think it should.


Given the guarantees of the protocol, the net stack has the freedom to 
drop UDP packets, for example at times when (for TCP) one would 
otherwise queue a packet for retransmit.

	Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-11 19:45   ` Ben Greear
  2003-09-11 19:59     ` Jeff Garzik
@ 2003-09-11 20:12     ` David S. Miller
  2003-09-11 20:40       ` Ben Greear
  1 sibling, 1 reply; 38+ messages in thread
From: David S. Miller @ 2003-09-11 20:12 UTC (permalink / raw)
  To: Ben Greear; +Cc: jgarzik, scott.feldman, netdev, ricardoz

On Thu, 11 Sep 2003 12:45:55 -0700
Ben Greear <greearb@candelatech.com> wrote:

> Erm, shouldn't the local machine back itself off if the various
> queues are full?  Some time back I looked through the code and it
> appeared to.  If not, I think it should.

Generic networking device queues drop when the overflow.

Whatever dev->tx_queue_len is set to, the device driver needs
to be prepared to be able to queue successfully.

Most people run into problems when they run stupid UDP applications
that send a stream of tinygrams (<~64 bytes).  The solutions are to
either fix the UDP app or restrict it's socket send buffer size.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-11 20:12     ` David S. Miller
@ 2003-09-11 20:40       ` Ben Greear
  2003-09-11 21:07         ` David S. Miller
  0 siblings, 1 reply; 38+ messages in thread
From: Ben Greear @ 2003-09-11 20:40 UTC (permalink / raw)
  To: David S. Miller; +Cc: jgarzik, scott.feldman, netdev, ricardoz

David S. Miller wrote:
> Generic networking device queues drop when the overflow.
> 
> Whatever dev->tx_queue_len is set to, the device driver needs
> to be prepared to be able to queue successfully.
> 
> Most people run into problems when they run stupid UDP applications
> that send a stream of tinygrams (<~64 bytes).  The solutions are to
> either fix the UDP app or restrict it's socket send buffer size.

Is this close to how it works?

So, assume we configure a 10MB socket send queue on our UDP socket...

Select says its writable up to at least 5MB.

We write 5MB of 64byte packets "righ now".

Did we just drop a large number of packets?

I would expect that the packets, up to 10MB, are buffered in some
list/fifo in the socket code, and that as the underlying device queue
empties itself, the socket will feed it more packets.

The device queue, in turn, is emptied as the driver is able to fill it's
TxDescriptors, and the hardware empties the TxDescriptors.

Obviously, I'm confused somewhere....

Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-11 20:40       ` Ben Greear
@ 2003-09-11 21:07         ` David S. Miller
  2003-09-11 21:29           ` Ben Greear
  0 siblings, 1 reply; 38+ messages in thread
From: David S. Miller @ 2003-09-11 21:07 UTC (permalink / raw)
  To: Ben Greear; +Cc: jgarzik, scott.feldman, netdev, ricardoz

On Thu, 11 Sep 2003 13:40:44 -0700
Ben Greear <greearb@candelatech.com> wrote:

> So, assume we configure a 10MB socket send queue on our UDP socket...
> 
> Select says its writable up to at least 5MB.
> 
> We write 5MB of 64byte packets "righ now".
> 
> Did we just drop a large number of packets?

Yes, we did _iff_ dev->tx_queue_len is less than or equal
to (5MB / (64 + sizeof(udp_id_headers))).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
       [not found] <3F60DE5B.1010700@pobox.com>
@ 2003-09-11 21:27 ` Ricardo C Gonzalez
  0 siblings, 0 replies; 38+ messages in thread
From: Ricardo C Gonzalez @ 2003-09-11 21:27 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: netdev, jgarzik, scott.feldman, davem



      Do not make these a UDP issue. It is much easier today to overrun the
Tx queue as  N TCP connections can be running in parallel on different
CPU's.

You can see these on the data I sent out.

regards,

----------------------------------------------------------------------------------

***  ALWAYS THINK POSITIVE ***

Rick Gonzalez
IBM Linux Performance Group
Building: 905    Office: 7G019
Phone: (512) 838-0623

Jeff Garzik <jgarzik@pobox.com> on 09/11/2003 03:43:07 PM

To:    Ricardo C Gonzalez/Austin/IBM@ibmus
cc:    scott.feldman@intel.com, greearb@candelatech.com
Subject:    Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default



Ricardo C Gonzalez wrote:
> The default Tx queue can easly be overrun by today’s SMP
> systems
> taking into consideration that these are 4way systems at Ghz speeds.


As I said, this is expected.

And please CC the netdev@oss.sgi.com mailing list, where technical
discussions like this are held.

 Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-11 21:29           ` Ben Greear
@ 2003-09-11 21:29             ` David S. Miller
  2003-09-11 21:47               ` Ricardo C Gonzalez
  2003-09-11 22:15               ` Ben Greear
  0 siblings, 2 replies; 38+ messages in thread
From: David S. Miller @ 2003-09-11 21:29 UTC (permalink / raw)
  To: Ben Greear; +Cc: jgarzik, scott.feldman, netdev, ricardoz

On Thu, 11 Sep 2003 14:29:43 -0700
Ben Greear <greearb@candelatech.com> wrote:

> Thanks for that clarification.  Is there no way to tell
> at 'sendto' time that the buffers are over-full, and either
> block or return -EBUSY or something like that?

The TX queue state can change by hundreds of packets by
the time we are finished making the "decision", also how would
you like to "wake" up sockets when the TX queue is liberated.
That extra overhead and logic would be wonderful for performance.

No, this is all nonsense.  Packet scheduling and queueing is
an opaque layer to all the upper layers.  It is the only sensible
design.

IP transmit is black hole that may drop packets at any moment,
any datagram application not prepared for this should be prepared
for troubles or choose to move over to something like TCP.

I listed even a workaround for such stupid UDP apps, simply limit
their socket send queue limits.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-11 21:07         ` David S. Miller
@ 2003-09-11 21:29           ` Ben Greear
  2003-09-11 21:29             ` David S. Miller
  0 siblings, 1 reply; 38+ messages in thread
From: Ben Greear @ 2003-09-11 21:29 UTC (permalink / raw)
  To: David S. Miller; +Cc: jgarzik, scott.feldman, netdev, ricardoz

David S. Miller wrote:
> On Thu, 11 Sep 2003 13:40:44 -0700
> Ben Greear <greearb@candelatech.com> wrote:
> 
> 
>>So, assume we configure a 10MB socket send queue on our UDP socket...
>>
>>Select says its writable up to at least 5MB.
>>
>>We write 5MB of 64byte packets "righ now".
>>
>>Did we just drop a large number of packets?
> 
> 
> Yes, we did _iff_ dev->tx_queue_len is less than or equal
> to (5MB / (64 + sizeof(udp_id_headers))).

Thanks for that clarification.  Is there no way to tell
at 'sendto' time that the buffers are over-full, and either
block or return -EBUSY or something like that?

Perhaps the poll logic should also take the underlying buffer
into account and not show the socket as writable in this case?

Supposing in the above example, I set tx_queue_len to
(5MB / (64 + sizeof(udp_id_headers))), will
the packets now be dropped in the driver instead, or will there
be no more (local) drops?

Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-11 21:29             ` David S. Miller
@ 2003-09-11 21:47               ` Ricardo C Gonzalez
  2003-09-11 22:00                 ` Jeff Garzik
  2003-09-11 22:15               ` Ben Greear
  1 sibling, 1 reply; 38+ messages in thread
From: Ricardo C Gonzalez @ 2003-09-11 21:47 UTC (permalink / raw)
  To: David S. Miller; +Cc: greearb, jgarzik, scott.feldman, netdev

>IP transmit is black hole that may drop packets at any moment,
>any datagram application not prepared for this should be prepared
>for troubles or choose to move over to something like TCP.

As I said before, please do not make this a UDP issue. The data I sent out
was taken using a TCP_STREAM test case. Please review it.

regards,
----------------------------------------------------------------------------------

***  ALWAYS THINK POSITIVE ***

Rick Gonzalez
IBM Linux Performance Group
Building: 905    Office: 7G019
Phone: (512) 838-0623

"David S. Miller" <davem@redhat.com> on 09/11/2003 04:29:06 PM

To:    Ben Greear <greearb@candelatech.com>
cc:    jgarzik@pobox.com, scott.feldman@intel.com, netdev@oss.sgi.com,
       Ricardo C Gonzalez/Austin/IBM@ibmus
Subject:    Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default

On Thu, 11 Sep 2003 14:29:43 -0700
Ben Greear <greearb@candelatech.com> wrote:

> Thanks for that clarification.  Is there no way to tell
> at 'sendto' time that the buffers are over-full, and either
> block or return -EBUSY or something like that?

The TX queue state can change by hundreds of packets by
the time we are finished making the "decision", also how would
you like to "wake" up sockets when the TX queue is liberated.
That extra overhead and logic would be wonderful for performance.

No, this is all nonsense.  Packet scheduling and queueing is
an opaque layer to all the upper layers.  It is the only sensible
design.

IP transmit is black hole that may drop packets at any moment,
any datagram application not prepared for this should be prepared
for troubles or choose to move over to something like TCP.

I listed even a workaround for such stupid UDP apps, simply limit
 their socket send queue limits.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-11 21:47               ` Ricardo C Gonzalez
@ 2003-09-11 22:00                 ` Jeff Garzik
  0 siblings, 0 replies; 38+ messages in thread
From: Jeff Garzik @ 2003-09-11 22:00 UTC (permalink / raw)
  To: Ricardo C Gonzalez; +Cc: David S. Miller, greearb, scott.feldman, netdev

Ricardo C Gonzalez wrote:
> 
> 
>>IP transmit is black hole that may drop packets at any moment,
>>any datagram application not prepared for this should be prepared
>>for troubles or choose to move over to something like TCP.
> 
> 
> 
> As I said before, please do not make this a UDP issue. The data I sent out
> was taken using a TCP_STREAM test case. Please review it.


Your own words say "CPUs can fill TX queue".  We already know this. 
CPUs have been doing wire speed for ages.

	Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-11 21:29             ` David S. Miller
  2003-09-11 21:47               ` Ricardo C Gonzalez
@ 2003-09-11 22:15               ` Ben Greear
  2003-09-11 23:02                 ` David S. Miller
  1 sibling, 1 reply; 38+ messages in thread
From: Ben Greear @ 2003-09-11 22:15 UTC (permalink / raw)
  To: David S. Miller; +Cc: jgarzik, scott.feldman, netdev, ricardoz

David S. Miller wrote:
> On Thu, 11 Sep 2003 14:29:43 -0700
> Ben Greear <greearb@candelatech.com> wrote:
> 
> 
>>Thanks for that clarification.  Is there no way to tell
>>at 'sendto' time that the buffers are over-full, and either
>>block or return -EBUSY or something like that?
> 
> 
> The TX queue state can change by hundreds of packets by
> the time we are finished making the "decision", also how would
> you like to "wake" up sockets when the TX queue is liberated.

So, at some point the decision is already made that we must drop
the packet, or that we can enqueue it.  This is where I would propose
we block the thing trying to enqueue, or at least propagate a failure
code back up the stack(s) so that the packet can be retried by the
calling layer.

Preferably, one would propagate the error all the way to userspace
and let them deal with it, just like we currently deal with socket
queue full issues.

> That extra overhead and logic would be wonderful for performance.

The cost of a retransmit is also expensive, whether it is some hacked
up UDP protocol or for TCP.  Even if one had to implement callbacks
from the device queue to the interested sockets, this should not
be a large performance hit.

> 
> No, this is all nonsense.  Packet scheduling and queueing is
> an opaque layer to all the upper layers.  It is the only sensible
> design.

This is possible, but it does not seem cut and dried to me.  If there
is any documentation or research that support this assertion, please
do let us know.

> 
> IP transmit is black hole that may drop packets at any moment,
> any datagram application not prepared for this should be prepared
> for troubles or choose to move over to something like TCP.
> 
> I listed even a workaround for such stupid UDP apps, simply limit
> their socket send queue limits.

And the original poster shows how a similar problem slows down TCP
as well due to local dropped packets.  Don't you think we'd get better
TCP throughput if we instead had the calling code wait 1us for the buffers
to clear?

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-11 22:15               ` Ben Greear
@ 2003-09-11 23:02                 ` David S. Miller
  2003-09-11 23:22                   ` Ben Greear
  0 siblings, 1 reply; 38+ messages in thread
From: David S. Miller @ 2003-09-11 23:02 UTC (permalink / raw)
  To: Ben Greear; +Cc: jgarzik, scott.feldman, netdev, ricardoz

On Thu, 11 Sep 2003 15:15:19 -0700
Ben Greear <greearb@candelatech.com> wrote:

> And the original poster shows how a similar problem slows down TCP
> as well due to local dropped packets.

So, again, dampen the per-socket send queue sizes.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-11 23:02                 ` David S. Miller
@ 2003-09-11 23:22                   ` Ben Greear
  2003-09-11 23:29                     ` David S. Miller
  2003-09-12  1:34                     ` jamal
  0 siblings, 2 replies; 38+ messages in thread
From: Ben Greear @ 2003-09-11 23:22 UTC (permalink / raw)
  Cc: jgarzik, scott.feldman, netdev, ricardoz

David S. Miller wrote:
> On Thu, 11 Sep 2003 15:15:19 -0700
> Ben Greear <greearb@candelatech.com> wrote:
> 
> 
>>And the original poster shows how a similar problem slows down TCP
>>as well due to local dropped packets.
> 
> 
> So, again, dampen the per-socket send queue sizes.

That's just a band-aid to cover up the flaw with the lack
of queue-pressure feedback to the higher stacks, as would be increasing the
TxDescriptors for that matter.

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-11 23:22                   ` Ben Greear
@ 2003-09-11 23:29                     ` David S. Miller
  2003-09-12  1:34                     ` jamal
  1 sibling, 0 replies; 38+ messages in thread
From: David S. Miller @ 2003-09-11 23:29 UTC (permalink / raw)
  To: Ben Greear; +Cc: jgarzik, scott.feldman, netdev, ricardoz

On Thu, 11 Sep 2003 16:22:35 -0700
Ben Greear <greearb@candelatech.com> wrote:

> David S. Miller wrote:
> > So, again, dampen the per-socket send queue sizes.
> 
> That's just a band-aid to cover up the flaw with the lack
> of queue-pressure feedback to the higher stacks, as would be increasing the
> TxDescriptors for that matter.

The whole point of the various packet scheduler algorithms
are foregone if we're just going to queue up and send the
crap again.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-11 23:22                   ` Ben Greear
  2003-09-11 23:29                     ` David S. Miller
@ 2003-09-12  1:34                     ` jamal
  2003-09-12  2:20                       ` Ricardo C Gonzalez
  2003-09-13  3:49                       ` David S. Miller
  1 sibling, 2 replies; 38+ messages in thread
From: jamal @ 2003-09-12  1:34 UTC (permalink / raw)
  To: Ben Greear; +Cc: jgarzik, scott.feldman, netdev, ricardoz

Scott,

dont increase the tx descriptor ring size - that would truly wasting
memory; 256 is pretty adequate.
* increase instead the txquelen (as suggested by Davem); user space
tools like ip or ifconfig could do it. The standard size has been around
100 for 100Mbps; i suppose it is fair to say that Gige can move data out
at 10x that; so set it to 1000. Maybe you can do this from the driver
based on what negotiated speed is detected?

--------
[root@jzny root]# ip link ls eth0
4: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 100
    link/ether 00:b0:d0:05:ae:81 brd ff:ff:ff:ff:ff:ff

[root@jzny root]# ip  link set[root@jzny root]# ip link ls eth0
4: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:b0:d0:05:ae:81 brd ff:ff:ff:ff:ff:ff eth0 txqueuelen
1000
-------

TCP already reacts on packets dropped at the scheduler level, UDP would
be too hard to enforce since the logic is typically on an app above udp.
So just conrtol it via the socket queue size.

cheers,
jamal

On Thu, 2003-09-11 at 19:22, Ben Greear wrote:
> David S. Miller wrote:
> > On Thu, 11 Sep 2003 15:15:19 -0700
> > Ben Greear <greearb@candelatech.com> wrote:
> > 
> > 
> >>And the original poster shows how a similar problem slows down TCP
> >>as well due to local dropped packets.
> > 
> > 
> > So, again, dampen the per-socket send queue sizes.
> 
> That's just a band-aid to cover up the flaw with the lack
> of queue-pressure feedback to the higher stacks, as would be increasing the
> TxDescriptors for that matter.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-12  1:34                     ` jamal
@ 2003-09-12  2:20                       ` Ricardo C Gonzalez
  2003-09-12  3:05                         ` jamal
  2003-09-13  3:49                       ` David S. Miller
  1 sibling, 1 reply; 38+ messages in thread
From: Ricardo C Gonzalez @ 2003-09-12  2:20 UTC (permalink / raw)
  To: hadi; +Cc: greearb, jgarzik, scott.feldman, netdev



Jamal wrote:

>* increase instead the txquelen (as suggested by Davem); user space
>tools like ip or ifconfig could do it. The standard size has been around
>100 for 100Mbps; i suppose it is fair to say that Gige can move data out
>at 10x that; so set it to 1000. Maybe you can do this from the driver
>based on what negotiated speed is detected?

      This is also another way to do it. As long as we make it harder for
users to drop packets and get up to date with Gigabit speeds. We would also
have to think about the upcomming 10Gige adapters and their queue sizes,
but that is a separate issue. Anyway, the driver can easly set the
txqueuelen to 1000.

      We should care about counting the packets being dropped on the
transmit side. Would it be the responsability of the driver to account for
this drops? Because each driver has a dedicated software queue and in my
opinion, the driver should account for this packets.


regards,


----------------------------------------------------------------------------------

***  ALWAYS THINK POSITIVE ***

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-12  2:20                       ` Ricardo C Gonzalez
@ 2003-09-12  3:05                         ` jamal
  0 siblings, 0 replies; 38+ messages in thread
From: jamal @ 2003-09-12  3:05 UTC (permalink / raw)
  To: Ricardo C Gonzalez; +Cc: greearb, jgarzik, scott.feldman, netdev

On Thu, 2003-09-11 at 22:20, Ricardo C Gonzalez wrote:
> Jamal wrote:

>       We should care about counting the packets being dropped on the
> transmit side. Would it be the responsability of the driver to account for
> this drops? Because each driver has a dedicated software queue and in my
> opinion, the driver should account for this packets.

This is really the schedulers responsibility. Its hard for the driver to
keep track of why a packet was dropped. Example, could be dropped to
make room for a higher priority packet thats being anticipated to show
up soon.

The simple default 3-band scheduler unfortunately doesnt quiet show its
stats ...so simple way to see drops is:

- install the prio qdisc 
------
[root@jzny root]# tc qdisc add dev eth0 root prio
[root@jzny root]# tc -s qdisc
qdisc prio 8001: dev eth0 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1
1
 Sent 42 bytes 1 pkts (dropped 0, overlimits 0) 
-----

or you may wanna install a single pfifo queue with size of 1000 each
(although this is a little too mediavial)

example:
#tc qdisc add dev eth0 root pfifo limit 1000
#tc -s qdisc
qdisc pfifo 8002: dev eth0 limit 1000p
 Sent 0 bytes 0 pkts (dropped 0, overlimits 0) 

etc

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [e1000 2.6 10/11] TxDescriptors -> 1024 default
@ 2003-09-12  5:13 Feldman, Scott
  2003-09-12 12:44 ` jamal
  0 siblings, 1 reply; 38+ messages in thread
From: Feldman, Scott @ 2003-09-12  5:13 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: netdev, ricardoz

> Feldman, Scott wrote:
> > * Change the default number of Tx descriptors from 256 to 1024.
> >   Data from [ricardoz@us.ibm.com] shows it's easy to overrun
> >   the Tx desc queue.
> 
> 
> All e1000 patches applied except this one.
> 
> You're just wasting memory.

256 descriptors does sound like enough, but what about the fragmented
skb case?  MAX_SKB_FRAGS is (64K/4K + 2) = 18, and each fragment gets
mapped to one descriptor.  It even gets worse with e1000 because we may
need to split a fragment into two fragments in the driver to workaround
hardware errata.  :-(

It would be interesting to see the frags:skb ratio for 2.6 with TSO
enabled and disabled with Rick's test.  So our effective number of
descriptors needs to be adjusted by that ratio.  I agree with David that
it's wasteful for the device driver to worry about more than
dev->tx_queue_len, but that's in skb units, not descriptor units.

On the other hand, if we're always running the descriptor ring near
empty, we've got other problems.  It seems to reason that it doesn't
matter how big the ring is if we're in that situation.  If the CPU can
overrun the device, expanding the queues between the CPU and the device
may help with bursts but gets you nothing for a sustained load.

I flunked that queuing theory class anyway, so what do I know?  Every
time I get stuck in a traffic slug on the freeway, I think about that
class.  Hey, that means my car is like an skb, so maybe longer roads
would help?  Not!

-scott

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-12  5:13 [e1000 2.6 10/11] TxDescriptors -> 1024 default Feldman, Scott
@ 2003-09-12 12:44 ` jamal
  2003-09-12 15:29   ` Donald Becker
  2003-09-12 18:12   ` Ben Greear
  0 siblings, 2 replies; 38+ messages in thread
From: jamal @ 2003-09-12 12:44 UTC (permalink / raw)
  To: Feldman, Scott; +Cc: Jeff Garzik, netdev, ricardoz

On Fri, 2003-09-12 at 01:13, Feldman, Scott wrote:
> > Feldman, Scott wrote:
> > > * Change the default number of Tx descriptors from 256 to 1024.
> > >   Data from [ricardoz@us.ibm.com] shows it's easy to overrun
> > >   the Tx desc queue.
> > 
> > 
> > All e1000 patches applied except this one.
> > 
> > You're just wasting memory.
> 
> 256 descriptors does sound like enough, but what about the fragmented
> skb case?  MAX_SKB_FRAGS is (64K/4K + 2) = 18, and each fragment gets
> mapped to one descriptor.  It even gets worse with e1000 because we may
> need to split a fragment into two fragments in the driver to workaround
> hardware errata.  :-(
> 
> It would be interesting to see the frags:skb ratio for 2.6 with TSO
> enabled and disabled with Rick's test.  So our effective number of
> descriptors needs to be adjusted by that ratio.  I agree with David that
> it's wasteful for the device driver to worry about more than
> dev->tx_queue_len, but that's in skb units, not descriptor units.
> 

Ok, i overlooked the frags part.
Donald Becker is the man behind setting the # of descriptors to either
32 or 64 for 10/100. I think i saw some email from him once on how he
reached the conclusion to choose those numbers. Note, this was before
zero copy tx and skb frags. Someone needs to talk to Donald and come up
with values that make more sense for Gige and skb frags. It would be
nice to see how the numbers are derived.  

> On the other hand, if we're always running the descriptor ring near
> empty, we've got other problems.  It seems to reason that it doesn't
> matter how big the ring is if we're in that situation.  If the CPU can
> overrun the device, expanding the queues between the CPU and the device
> may help with bursts but gets you nothing for a sustained load.
> 

Well, theres only one way out that device ;-> and it goes out at a max 
rate of Gige. If you have sustained incoming rates from the CPU(s) of
greater than Gige, then you are fucked anyways and you are better to
drop at the scheduler queue.

> I flunked that queuing theory class anyway, so what do I know?  Every
> time I get stuck in a traffic slug on the freeway, I think about that
> class.  Hey, that means my car is like an skb, so maybe longer roads
> would help?  Not!

Note we do return an indication that the packet was dropped. What you do
with that information is relative. TCP makes use of it in the kernel
which makes sense. UDP congestion control is mostly under the influence
of the UDP app in user space. The impedance between user space and
kernel makes that info useless to the UDP app especially in cases when
the system is overloaded (which is where this matters most). This is of
course theory and someone who really wants to find out should
experiment. I would be pleasantly shocked if it turned out the info to
the UDP app was useful. An interesting thing to try , which violates
UDP, is to have UDP requeue a packet back to the socket queue in the
kernel everytime an indication is received that the scheduler queue
dropped the packet. User space by virtue of UDP sock queue not emptying
should find out soon and slow down.
All this is really speculation:
A UDP app that really care about congestion should factor it from an end
to end perspective and use the big socket queues suggested to buffer
things.

To give anology to your car, if you only find out half way later that
there was a red light a few meters back then that info is useless. If
you dont get hit and reverse you may find that infact the light has
turned to green which is again useless ;-> 

cheers,
jamal 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-12 12:44 ` jamal
@ 2003-09-12 15:29   ` Donald Becker
  2003-09-12 17:44     ` Ricardo C Gonzalez
  2003-09-15 11:37     ` jamal
  2003-09-12 18:12   ` Ben Greear
  1 sibling, 2 replies; 38+ messages in thread
From: Donald Becker @ 2003-09-12 15:29 UTC (permalink / raw)
  To: jamal; +Cc: Feldman, Scott, Jeff Garzik, netdev, ricardoz

On 12 Sep 2003, jamal wrote:
> On Fri, 2003-09-12 at 01:13, Feldman, Scott wrote:
> > > Feldman, Scott wrote:
> > > > * Change the default number of Tx descriptors from 256 to 1024.
> > > >   Data from [ricardoz@us.ibm.com] shows it's easy to overrun
> > > >   the Tx desc queue.
> > > 
> > > You're just wasting memory.
> > 
> > 256 descriptors does sound like enough, but what about the fragmented
> > skb case?  MAX_SKB_FRAGS is (64K/4K + 2) = 18, and each fragment gets
> > mapped to one descriptor.
..
> Ok, i overlooked the frags part.
> Donald Becker is the man behind setting the # of descriptors to either
> 32 or 64 for 10/100. I think i saw some email from him once on how he
> reached the conclusion to choose those numbers.

The number varies between 10 and 16 skbuffs for 100Mbps, with a typical
Tx descriptor ring of 16 elements.

I spent much time instrumenting and measuring the behavior of the
drivers.  I found that with 100Mbps, between four and seven Tx skbuffs
were sufficient to keep the hardware Tx queue from emptying while there
were still packets in the software Tx queue.  Adding more neither
avoided gaps nor obviously improved the overall cache miss count.

With early Gb cards, Yellowfins, the numbers didn't increase nearly as
much as I expected.  I concluded:
  - At the time we were CPU limited generating tinygrams, thus the queue
    connection performance only mattered for full sized packets.
  - Increasing the link speed does not increase the tinygram series length 
    A series of tinygram are worst-case for the queue connection, since
    the hardware can dispatch them almost as quickly as the queue fills.
    But we don't get more tinygrams in a row with a faster link.

I could write extensively about the other parameters and designs that I
(mostly) confirmed and (occasionally) discoved.  They would be easy to
attack as being based on old hardware, but I believe that relative
numbers have changed little.  And I haven't seen any well-considered
analysis that would indicate 1000 driver queue entries is significantly
better than 20 in general purpose use.

> Note, this was before zero copy tx and skb frags.

[[ Grumble about mbufs and designing for a PDP omitted. ]]

Most users will see unfragmented skbuffs.
Even with fragmented skbuffs, you should average fewer than 2 frags per
skbuff.  With 9KB jumbo frames that might increase slightly.  Has anyone
measured more?  Or measured at all?

Remember the purpose of the driver (hardware) Tx queue.
It should be the minimum size consistent with
  Keeping the wire busy(1) when we have queued packets
  Cache locality(2) when queueing Tx packets
  Allowing interrupt mitigation
It is *not* supposed to act as an additional buffer in the system.
Unlike the queue layer, it cannot be changed dynamically, classify
packets, re-order based on priority or do any of the future clever
improvements possible with the general-purpose, device-independent
software queue. 

(1) A short gap with atypical traffic is entirely acceptable

(2) Cache locality isn't important because we need local performance.  It's
    important to minimize cache line displacement for the rest of the
    system, especially the application.

-- 
Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
914 Bay Ridge Road, Suite 220		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-12 15:29   ` Donald Becker
@ 2003-09-12 17:44     ` Ricardo C Gonzalez
  2003-09-15 11:37     ` jamal
  1 sibling, 0 replies; 38+ messages in thread
From: Ricardo C Gonzalez @ 2003-09-12 17:44 UTC (permalink / raw)
  To: Donald Becker; +Cc: hadi, scott.feldman, jgarzik, netdev

Comment from Herman:

GigE had been around for a number of years now and when it first came out,
the CPU's were quite slow compared to the adapter.
Now the processor speeds have continued to move ahead and with N   Giga Hz
plus class processors on a SMP, they can clearly feed a gigE adapter at at
high rate.

The problem is you have to decide how many TCP connections you want to
support.
Each connection can be sending up to the TCP window size  (based on the
receivers socket recv space).  This is typically 64K in a GigE environement
these days.   Each individual TCP session only knows about its window size.
Thus with N of them active, you can overrun the transmit queue(s) (HW and
SW Q's).
It not nice to have TCP do all the work of building up N packets and then
drop them on the floor simply because of a queue limit.
I mean the buffers are ALEADY built, now they must be discarded and let TCP
timeout and retransmit.
Customers can have lots of TCP connections on these SMP servers and they
expect the GigE to perform.
I don't think they care about a the minimial space needed for the transmit
que descriptors.  They have a lot of money in high performance network gear
(switches, etc) and dropping packets at the sender is just plain bad.

If you don't want to increase the hardware queue, then at least increase
the software queue, which requires no space.

Thanks, Herman

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-12 12:44 ` jamal
  2003-09-12 15:29   ` Donald Becker
@ 2003-09-12 18:12   ` Ben Greear
  2003-09-12 18:31     ` Ricardo C Gonzalez
  2003-09-15 11:29     ` jamal
  1 sibling, 2 replies; 38+ messages in thread
From: Ben Greear @ 2003-09-12 18:12 UTC (permalink / raw)
  To: hadi; +Cc: Feldman, Scott, Jeff Garzik, netdev, ricardoz

jamal wrote:
> On Fri, 2003-09-12 at 01:13, Feldman, Scott wrote:

>>On the other hand, if we're always running the descriptor ring near
>>empty, we've got other problems.  It seems to reason that it doesn't
>>matter how big the ring is if we're in that situation.  If the CPU can
>>overrun the device, expanding the queues between the CPU and the device
>>may help with bursts but gets you nothing for a sustained load.
>>
> 
> 
> Well, theres only one way out that device ;-> and it goes out at a max 
> rate of Gige. If you have sustained incoming rates from the CPU(s) of
> greater than Gige, then you are fucked anyways and you are better to
> drop at the scheduler queue.

I have seen greater packets-per-second throughput when I increase the TxDescriptor
ring (and RxDescriptor ring) when using pktgen, which checks for enqueue errors
and re-queues as needed.  So, it could help the case where we are running
at very high sustained speeds (or high packets-per-second rates).

>>I flunked that queuing theory class anyway, so what do I know?  Every
>>time I get stuck in a traffic slug on the freeway, I think about that
>>class.  Hey, that means my car is like an skb, so maybe longer roads
>>would help?  Not!
> 
> 
> 
> Note we do return an indication that the packet was dropped. What you do
> with that information is relative. TCP makes use of it in the kernel
> which makes sense. UDP congestion control is mostly under the influence
> of the UDP app in user space. The impedance between user space and
> kernel makes that info useless to the UDP app especially in cases when
> the system is overloaded (which is where this matters most). This is of
> course theory and someone who really wants to find out should
> experiment. I would be pleasantly shocked if it turned out the info to
> the UDP app was useful. An interesting thing to try , which violates
> UDP, is to have UDP requeue a packet back to the socket queue in the
> kernel everytime an indication is received that the scheduler queue
> dropped the packet. User space by virtue of UDP sock queue not emptying
> should find out soon and slow down.

Um, I doubt the UDP protocol says you MUST drop packets when you reach
congestion...it just says that you _CAN_ drop the packet.  Slowing
down user-space is exactly what you want to do in this case because it
saves user-space CPU, and it saves the user-space program from having
to deal (so often) with dropped packets.

Already, if the socket queue is full, poll/select will block, you'll get
-EBUSY returned, and/or your application will block on a wait queue....
Any of these allow the user space program to immediately back off,
saving the whole system work.

> All this is really speculation:
> A UDP app that really care about congestion should factor it from an end
> to end perspective and use the big socket queues suggested to buffer
> things.

Big socket queues can cause your machine to over-run the scheduler queue,
if I understand Dave right.  And lots of big queues everywhere can cause
your machine to OOM and lock up completely (see another recent thread).

> 
> To give anology to your car, if you only find out half way later that
> there was a red light a few meters back then that info is useless. If
> you dont get hit and reverse you may find that infact the light has
> turned to green which is again useless ;-> 

So much better to have stopped the car earlier and kept him out of the
intersection in the first place :)

Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-12 18:12   ` Ben Greear
@ 2003-09-12 18:31     ` Ricardo C Gonzalez
  2003-09-15 11:29     ` jamal
  1 sibling, 0 replies; 38+ messages in thread
From: Ricardo C Gonzalez @ 2003-09-12 18:31 UTC (permalink / raw)
  To: Ben Greear; +Cc: hadi, Feldman, Scott, Jeff Garzik, netdev




Ben Greear worte:

>I have seen greater packets-per-second throughput when I increase the
TxDescriptor
>ring (and RxDescriptor ring) when using pktgen, which checks for enqueue
errors
>and re-queues as needed.  So, it could help the case where we are running
>at very high sustained speeds (or high packets-per-second rates).

This is exactly the type of test I run(TCP_STREAM tests at very high
sustained speeds). And I saw very good performance improvement when
increasing the TxDescriptors to at least 1024.

I run a network benchmark in this case the data I gathered was on 4 way,
1.45 Ghz systems,  running the benchmark on only one adapter per machine,
point-to-point.


regards,

----------------------------------------------------------------------------------

***  ALWAYS THINK POSITIVE ***

Rick Gonzalez

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-12  1:34                     ` jamal
  2003-09-12  2:20                       ` Ricardo C Gonzalez
@ 2003-09-13  3:49                       ` David S. Miller
  2003-09-13 11:52                         ` Robert Olsson
  2003-09-14 19:08                         ` Ricardo C Gonzalez
  1 sibling, 2 replies; 38+ messages in thread
From: David S. Miller @ 2003-09-13  3:49 UTC (permalink / raw)
  To: hadi; +Cc: greearb, jgarzik, scott.feldman, netdev, ricardoz

On 11 Sep 2003 21:34:23 -0400
jamal <hadi@cyberus.ca> wrote:

> dont increase the tx descriptor ring size - that would truly wasting
> memory; 256 is pretty adequate.
> * increase instead the txquelen (as suggested by Davem); user space
> tools like ip or ifconfig could do it. The standard size has been around
> 100 for 100Mbps; i suppose it is fair to say that Gige can move data out
> at 10x that; so set it to 1000. Maybe you can do this from the driver
> based on what negotiated speed is detected?

I spoke with Alexey once about this, actually tx_queue_len can
be arbitrarily large but it should be reasonable nonetheless.

Our preliminary conclusions were that values of 1000 for 100Mbit and
faster were probably appropriate.  Maybe something larger for 1Gbit,
who knows.

We also determined that the only connection between TX descriptor
ring size and dev->tx_queue_len was that the latter should be large
enough to handle, at a minimum, the amount of pending TX descriptor
ACKs that can be pending considering mitigation et al.

So if TX irq mitigation can defer up to N TX descriptor completions
then dev->tx_queue_len must be at least that large.

Back to the main topic, maybe we should set dev->tx_queue_len to
1000 by default for all ethernet devices.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-13  3:49                       ` David S. Miller
@ 2003-09-13 11:52                         ` Robert Olsson
  2003-09-15 12:12                           ` jamal
  2003-09-14 19:08                         ` Ricardo C Gonzalez
  1 sibling, 1 reply; 38+ messages in thread
From: Robert Olsson @ 2003-09-13 11:52 UTC (permalink / raw)
  To: David S. Miller; +Cc: hadi, greearb, jgarzik, scott.feldman, netdev, ricardoz


David S. Miller writes:
 > On 11 Sep 2003 21:34:23 -0400
 > jamal <hadi@cyberus.ca> wrote:
 > 
 > > dont increase the tx descriptor ring size - that would truly wasting
 > > memory; 256 is pretty adequate.
 > > * increase instead the txquelen (as suggested by Davem); user space
 > > tools like ip or ifconfig could do it. The standard size has been around
 > > 100 for 100Mbps; i suppose it is fair to say that Gige can move data out
 > > at 10x that; so set it to 1000. Maybe you can do this from the driver
 > > based on what negotiated speed is detected?
 > 
 > I spoke with Alexey once about this, actually tx_queue_len can
 > be arbitrarily large but it should be reasonable nonetheless.
 > 
 > Our preliminary conclusions were that values of 1000 for 100Mbit and
 > faster were probably appropriate.  Maybe something larger for 1Gbit,
 > who knows.
 > 
 > We also determined that the only connection between TX descriptor
 > ring size and dev->tx_queue_len was that the latter should be large
 > enough to handle, at a minimum, the amount of pending TX descriptor
 > ACKs that can be pending considering mitigation et al.
 > 
 > So if TX irq mitigation can defer up to N TX descriptor completions
 > then dev->tx_queue_len must be at least that large.
 > 
 > Back to the main topic, maybe we should set dev->tx_queue_len to
 > 1000 by default for all ethernet devices.

 Hello!

 Yes sounds like adequate setting for GIGE. This is what use for production
 and lab but rather than increasing dev->tx_queue_len to 1000 we replace the 
 pfifo_fast with the pfifo qdisc w. setting a qlen of 1000.

 And with we have tx_descriptor_ring_size 256 which is tuned to the NIC's
 "TX service interval" with respect to interrupt mitigation etc. This seems 
 good enough even for small packets.

 For routers this setting is even more crucial as we need to serialize 
 several flows and we know the flows are bursty.


 Cheers.
						--ro

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-13  3:49                       ` David S. Miller
  2003-09-13 11:52                         ` Robert Olsson
@ 2003-09-14 19:08                         ` Ricardo C Gonzalez
  2003-09-15  2:50                           ` David Brownell
  1 sibling, 1 reply; 38+ messages in thread
From: Ricardo C Gonzalez @ 2003-09-14 19:08 UTC (permalink / raw)
  To: David S. Miller; +Cc: hadi, greearb, jgarzik, scott.feldman, netdev






David Miller wrote:

>Back to the main topic, maybe we should set dev->tx_queue_len to
>1000 by default for all ethernet devices.


I definately agree with setting the dev->tx_queue_len to 1000 as a default
for all ethernet adapters. All adapters will benefit from this change.


regards,

----------------------------------------------------------------------------------

***  ALWAYS THINK POSITIVE ***

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-14 19:08                         ` Ricardo C Gonzalez
@ 2003-09-15  2:50                           ` David Brownell
  2003-09-15  8:17                             ` David S. Miller
  0 siblings, 1 reply; 38+ messages in thread
From: David Brownell @ 2003-09-15  2:50 UTC (permalink / raw)
  To: Ricardo C Gonzalez, David S. Miller
  Cc: hadi, greearb, jgarzik, scott.feldman, netdev

Ricardo C Gonzalez wrote:
> 
> David Miller wrote:
> 
> 
>>Back to the main topic, maybe we should set dev->tx_queue_len to
>>1000 by default for all ethernet devices.
> 
> 
> 
> I definately agree with setting the dev->tx_queue_len to 1000 as a default
> for all ethernet adapters. All adapters will benefit from this change.

Except ones where CONFIG_EMBEDDED, maybe?  Not everyone wants
to spend that much memory, even when it's available...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-15  2:50                           ` David Brownell
@ 2003-09-15  8:17                             ` David S. Miller
  0 siblings, 0 replies; 38+ messages in thread
From: David S. Miller @ 2003-09-15  8:17 UTC (permalink / raw)
  To: David Brownell; +Cc: ricardoz, hadi, greearb, jgarzik, scott.feldman, netdev

On Sun, 14 Sep 2003 19:50:56 -0700
David Brownell <david-b@pacbell.net> wrote:

> Except ones where CONFIG_EMBEDDED, maybe?  Not everyone wants
> to spend that much memory, even when it's available...

Dropping the packet between the network stack and the driver
does waste memory for _LONGER_ periods of time.

When we drop, TCP still hangs onto the buffer, and we'll send
it again and again until it makes it and we get an ACK back
or the connection completely times out.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-12 18:12   ` Ben Greear
  2003-09-12 18:31     ` Ricardo C Gonzalez
@ 2003-09-15 11:29     ` jamal
  1 sibling, 0 replies; 38+ messages in thread
From: jamal @ 2003-09-15 11:29 UTC (permalink / raw)
  To: Ben Greear; +Cc: Feldman, Scott, Jeff Garzik, netdev, ricardoz

On Fri, 2003-09-12 at 14:12, Ben Greear wrote:

> > 
> > Well, theres only one way out that device ;-> and it goes out at a max 
> > rate of Gige. If you have sustained incoming rates from the CPU(s) of
> > greater than Gige, then you are fucked anyways and you are better to
> > drop at the scheduler queue.
> 
> I have seen greater packets-per-second throughput when I increase 
> the TxDescriptor
> ring (and RxDescriptor ring) when using pktgen, which checks for enqueue 
> errors
> and re-queues as needed.  So, it could help the case where we are running
> at very high sustained speeds (or high packets-per-second rates).
> 

Have you tried increasing the s/ware queue instead?;-> Thats been
mentioned in about the last 10 posts.
Even though you care about what you refer as "very high sustained
speeds" type of apps, others may not. Infact i think the majority may
not: 
Think of that poor ssh login packet queued in behind 999 ftp packets in
the s/ware queue which is also above another 1000 ftp packets in the TX
DMA path. 

Whatever happened to good engineering such as the post from Donald? 

> >>I flunked that queuing theory class anyway, so what do I know?  Every
> >>time I get stuck in a traffic slug on the freeway, I think about that
> >>class.  Hey, that means my car is like an skb, so maybe longer roads
> >>would help?  Not!
> > 
> > 
> > 
> > Note we do return an indication that the packet was dropped. What you do
> > with that information is relative. TCP makes use of it in the kernel
> > which makes sense. UDP congestion control is mostly under the influence
> > of the UDP app in user space. The impedance between user space and
> > kernel makes that info useless to the UDP app especially in cases when
> > the system is overloaded (which is where this matters most). This is of
> > course theory and someone who really wants to find out should
> > experiment. I would be pleasantly shocked if it turned out the info to
> > the UDP app was useful. An interesting thing to try , which violates
> > UDP, is to have UDP requeue a packet back to the socket queue in the
> > kernel everytime an indication is received that the scheduler queue
> > dropped the packet. User space by virtue of UDP sock queue not emptying
> > should find out soon and slow down.
> 
> Um, I doubt the UDP protocol says you MUST drop packets when you reach
> congestion...it just says that you _CAN_ drop the packet.  

Which UDP spec did you read?;-> Try to do voice or any "realtime" apps
which use UDP precisely because it doesnt do what you described

> Slowing
> down user-space is exactly what you want to do in this case because it
> saves user-space CPU, and it saves the user-space program from having
> to deal (so often) with dropped packets.
> 
>
> Already, if the socket queue is full, poll/select will block, you'll get
> -EBUSY returned, and/or your application will block on a wait queue....
> Any of these allow the user space program to immediately back off,
> saving the whole system work.
> 
[handwaving deleted]

Why dont you try an experiment to show that you can pass the signal to
user space? 


> > 
> > To give anology to your car, if you only find out half way later that
> > there was a red light a few meters back then that info is useless. If
> > you dont get hit and reverse you may find that infact the light has
> > turned to green which is again useless ;-> 
> 
> So much better to have stopped the car earlier and kept him out of the
> intersection in the first place :)
> 

Try that experiment friend and see if you could have stopped the car in
time;->

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-12 15:29   ` Donald Becker
  2003-09-12 17:44     ` Ricardo C Gonzalez
@ 2003-09-15 11:37     ` jamal
  1 sibling, 0 replies; 38+ messages in thread
From: jamal @ 2003-09-15 11:37 UTC (permalink / raw)
  To: Donald Becker; +Cc: Feldman, Scott, Jeff Garzik, netdev, ricardoz

On Fri, 2003-09-12 at 11:29, Donald Becker wrote:

> Remember the purpose of the driver (hardware) Tx queue.
> It should be the minimum size consistent with
>   Keeping the wire busy(1) when we have queued packets
>   Cache locality(2) when queueing Tx packets
>   Allowing interrupt mitigation
> It is *not* supposed to act as an additional buffer in the system.
> Unlike the queue layer, it cannot be changed dynamically, classify
> packets, re-order based on priority or do any of the future clever
> improvements possible with the general-purpose, device-independent
> software queue. 
> 
> (1) A short gap with atypical traffic is entirely acceptable
> 
> (2) Cache locality isn't important because we need local performance.  It's
>     important to minimize cache line displacement for the rest of the
>     system, especially the application.

Dont know how much time you have, but this would be a good paper if you
wrote one. Since you already have gathered data "in the days when CPUs
were slow", you can complement it with newer data. If you dont have
time, i am sure there are people who will be interested in collecting
data for you.

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-13 11:52                         ` Robert Olsson
@ 2003-09-15 12:12                           ` jamal
  2003-09-15 13:45                             ` Robert Olsson
  0 siblings, 1 reply; 38+ messages in thread
From: jamal @ 2003-09-15 12:12 UTC (permalink / raw)
  To: Robert Olsson
  Cc: David S. Miller, greearb, jgarzik, scott.feldman, netdev,
	ricardoz

On Sat, 2003-09-13 at 07:52, Robert Olsson wrote:

>  > 
>  > I spoke with Alexey once about this, actually tx_queue_len can
>  > be arbitrarily large but it should be reasonable nonetheless.
>  > 
>  > Our preliminary conclusions were that values of 1000 for 100Mbit and
>  > faster were probably appropriate.  Maybe something larger for 1Gbit,
>  > who knows.

If you recall we saw that even for the gent who was trying to do 100K
TCP sockets on a 4 way SMP, 1000 was sufficient and no packets were
dropped.

>  > 
>  > We also determined that the only connection between TX descriptor
>  > ring size and dev->tx_queue_len was that the latter should be large
>  > enough to handle, at a minimum, the amount of pending TX descriptor
>  > ACKs that can be pending considering mitigation et al.
>  > 
>  > So if TX irq mitigation can defer up to N TX descriptor completions
>  > then dev->tx_queue_len must be at least that large.
>  > 
>  > Back to the main topic, maybe we should set dev->tx_queue_len to
>  > 1000 by default for all ethernet devices.
> 
>  Hello!
> 
>  Yes sounds like adequate setting for GIGE. This is what use for production
>  and lab but rather than increasing dev->tx_queue_len to 1000 we replace the 
>  pfifo_fast with the pfifo qdisc w. setting a qlen of 1000.
> 

I think this may not be good for the reason of QoS. You want BGP packets
to be given priority over ftp. A single queue kills that.
The current default 3 band queue is good enough, the only challenge
being noone sees stats for it. I have a patch for the kernel at:
http://www.cyberus.ca/~hadi/patches/restore.pfifo.kernel
and for tc at:
http://www.cyberus.ca/~hadi/patches/restore.pfifo.tc

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-15 12:12                           ` jamal
@ 2003-09-15 13:45                             ` Robert Olsson
  2003-09-15 23:15                               ` David S. Miller
  0 siblings, 1 reply; 38+ messages in thread
From: Robert Olsson @ 2003-09-15 13:45 UTC (permalink / raw)
  To: hadi
  Cc: Robert Olsson, David S. Miller, greearb, jgarzik, scott.feldman,
	netdev, ricardoz


jamal writes:

 > I think this may not be good for the reason of QoS. You want BGP packets
 > to be given priority over ftp. A single queue kills that.

 Well so far single queue has been robust enough for BGP-sessions. Talking
 from own experiences...

 > The current default 3 band queue is good enough, the only challenge
 > being noone sees stats for it. I have a patch for the kernel at:
 > http://www.cyberus.ca/~hadi/patches/restore.pfifo.kernel
 > and for tc at:
 > http://www.cyberus.ca/~hadi/patches/restore.pfifo.tc
 
 Yes. 
 I've missed this. Our lazy work-around for the missing stats is to install 
 pfifo qdisc as said. IMO it should be included.

 Cheers.
						--ro
  

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-15 13:45                             ` Robert Olsson
@ 2003-09-15 23:15                               ` David S. Miller
  2003-09-16  9:28                                 ` Robert Olsson
  0 siblings, 1 reply; 38+ messages in thread
From: David S. Miller @ 2003-09-15 23:15 UTC (permalink / raw)
  To: Robert Olsson
  Cc: hadi, Robert.Olsson, greearb, jgarzik, scott.feldman, netdev,
	ricardoz

On Mon, 15 Sep 2003 15:45:42 +0200
Robert Olsson <Robert.Olsson@data.slu.se> wrote:

>  > The current default 3 band queue is good enough, the only challenge
>  > being noone sees stats for it. I have a patch for the kernel at:
>  > http://www.cyberus.ca/~hadi/patches/restore.pfifo.kernel
>  > and for tc at:
>  > http://www.cyberus.ca/~hadi/patches/restore.pfifo.tc
>  
>  Yes. 
>  I've missed this. Our lazy work-around for the missing stats is to install 
>  pfifo qdisc as said. IMO it should be included.

I've included Jamal's pfifo_fast statistic patch, and the
change to increase ethernet's tx_queue_len to 1000 in all
of my trees.

Thanks.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2003-09-15 23:15                               ` David S. Miller
@ 2003-09-16  9:28                                 ` Robert Olsson
  0 siblings, 0 replies; 38+ messages in thread
From: Robert Olsson @ 2003-09-16  9:28 UTC (permalink / raw)
  To: David S. Miller
  Cc: kuznet, Robert Olsson, hadi, greearb, jgarzik, scott.feldman,
	netdev, ricardoz


David S. Miller writes:

 > >  > http://www.cyberus.ca/~hadi/patches/restore.pfifo.kernel
 > >  > and for tc at:
 > >  > http://www.cyberus.ca/~hadi/patches/restore.pfifo.tc
 > 
 > I've included Jamal's pfifo_fast statistic patch, and the
 > change to increase ethernet's tx_queue_len to 1000 in all
 > of my trees.

 Thanks. We ask Alexey to include the tc part too.

 Cheers.
					--ro

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
       [not found] <Pine.LNX.4.58.0405141430340.4622@fcat>
@ 2004-05-18 14:34 ` Ricardo C Gonzalez
  2004-06-02 19:11   ` Marc Herbert
  0 siblings, 1 reply; 38+ messages in thread
From: Ricardo C Gonzalez @ 2004-05-18 14:34 UTC (permalink / raw)
  To: Marc Herbert; +Cc: David S. Miller, netdev

Marc,

      Are you considering the case of small packets?  Many applications use
lots of small packets.  Example is the volano benchmark.
You needs to look at throughput rates for small packets. A 1GB  ethernet
can send something like 1.4 Million 64 byte packets per second.

Let me know what you think.

regards,

---------------------------------------------------------
Rick Gonzalez
LTC pSeries Performance
Building: 908    Office: 1D004
Phone: (512) 838-0623

             Marc Herbert                                                  
             <marc.herbert@fre                                             
             e.fr>                                                      To 
                                       Ricardo C Gonzalez/Austin/IBM@ibmus 
             05/14/2004 09:16                                           cc 
             AM                        "David S. Miller"                   
                                       <davem@redhat.com>,                 
                                       netdev@oss.sgi.com                  
                                                                   Subject 
                                       Re: [e1000 2.6 10/11] TxDescriptors 
                                       -> 1024 default                     

On Sun, 14 Sep 2003, Ricardo C Gonzalez wrote:

> David Miller wrote:
>
> >Back to the main topic, maybe we should set dev->tx_queue_len to
> >1000 by default for all ethernet devices.
>
>
> I definately agree with setting the dev->tx_queue_len to 1000 as a
default
> for all ethernet adapters. All adapters will benefit from this change.
>

Sorry to exhume this discussion but I only recently discovered this
change, the hard way.

I carefully read this old thread and did not grasp every detail, but
there is one thing that I am sure of: 1000 packets @ 1 Gb/s looks
good, but on the other hand, 1000 full size packets @ 10 Mb/s are
about 1.2 seconds long!

Too little buffering means not enough dampering effect, which is very
important for performance in asynchronous systems. Granted. However,
_too much_ buffering means too big and too variable latencies. When
discussing buffers, duration is very often more important than size.
Applications, TCP's dynamic (and kernel dynamics too?) do not care
much about buffer sizes, they more often care about latencies (and
throughput, of course). Buffers sizes is often "just a small matter of
implementation" :-) People designing routers for instance talk about
buffers in _milliseconds_ much more often than in _bytes_ (despite the
fact that their memories cost more than in hosts, considering the
throughputs involved).

100 packets @ 100 Mb/s was 12 ms. 1000 packets @ 1 Gb/s is still
12 ms. 12 ms is great. It's a "good" latency because it is the
order of magnitude of real-world constants like:  comfortable
interactive applications, operating system sheduler granularity or
propagation time in 2000 km of cable.

But 1000 packets @ 100 Mb/s is 120 ms and is neither very good nor
useful anymore. 1000 packets @ 10 Mb/s is 1.2 s, which is ridiculous.
It does mean that, when joe user is uploading some big file through
his cheap Ethernet card, and that there are no other bottleneck/drops
further in the network, every concurrent application will have to wait
1.2 s before accessing the network. Imagine now that some packet is
lost for whatever reason on some _other_ TCP connection going through
this terrible 1.2 s queue. Then you need one SACK/RTX extra round trip
time to recover from it: so it's now 2.4 s to deliver the data just
after the drop.  Assuming of course TCP does not become confused by
this huge latency and probably huge jitter.

And I don't think you want to make fiddling with "tc" mandatory for
joe user.

I am unfortunately not familiar with this part of the linux kernel,
but I really think that, if possible, txqueuelen should be initialized
at "12 ms" and not at "1000 packets". I can imagine there are some
corner cases, like for instance when some GEth NIC is hot-plugged into
a 100 Mb/s, but hey, those are corner cases. I think even a simple
constant-per-model txqueuelen initialization would be already great.

Cheers,

Marc.

PS: one workaround for joe user against this 1.2s latency would be to
keep his SND_BUF and number of sockets small. But this is poor.

--
"Je n'ai fait cette lettre-ci plus longue que parce que je n'ai pas eu
le loisir de la faire plus courte." -- Blaise Pascal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default
  2004-05-18 14:34 ` Ricardo C Gonzalez
@ 2004-06-02 19:11   ` Marc Herbert
  0 siblings, 0 replies; 38+ messages in thread
From: Marc Herbert @ 2004-06-02 19:11 UTC (permalink / raw)
  To: Ricardo C Gonzalez; +Cc: netdev

On Tue, 18 May 2004, Ricardo C Gonzalez wrote:

>       Are you considering the case of small packets?

Good point: I was not, because it adds some complexity :-/

> Many applications use
> lots of small packets.  Example is the volano benchmark.

Well... isn't that example a bit extreme, rather than representative ?

> You needs to look at throughput rates for small packets. A 1GB  ethernet
> can send something like 1.4 Million 64 byte packets per second.
>
> Let me know what you think.

I said in some previous message that, from the point of view of IP
"applications" in the very broad sense (i.e., including TCP) the
txqueuelen should ideally be defined in milliseconds, in order to give
an upper bound to latency. I reiterate that. So, for a given Ethernet
link _speed_, this would translates into an ideal definition of
txqueuelen in _bytes_ and not in _packets_. The current approximation
(in packets) seems to assume that people wishing to send a whole lot
of small packets are seldom, and can set the txqueuelen by themselves.
This seems sensible to me.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2004-06-02 19:11 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-09-12  5:13 [e1000 2.6 10/11] TxDescriptors -> 1024 default Feldman, Scott
2003-09-12 12:44 ` jamal
2003-09-12 15:29   ` Donald Becker
2003-09-12 17:44     ` Ricardo C Gonzalez
2003-09-15 11:37     ` jamal
2003-09-12 18:12   ` Ben Greear
2003-09-12 18:31     ` Ricardo C Gonzalez
2003-09-15 11:29     ` jamal
     [not found] <Pine.LNX.4.58.0405141430340.4622@fcat>
2004-05-18 14:34 ` Ricardo C Gonzalez
2004-06-02 19:11   ` Marc Herbert
     [not found] <3F60DE5B.1010700@pobox.com>
2003-09-11 21:27 ` Ricardo C Gonzalez
  -- strict thread matches above, loose matches on Subject: below --
2003-09-09  3:14 Feldman, Scott
2003-09-11 19:18 ` Jeff Garzik
2003-09-11 19:45   ` Ben Greear
2003-09-11 19:59     ` Jeff Garzik
2003-09-11 20:12     ` David S. Miller
2003-09-11 20:40       ` Ben Greear
2003-09-11 21:07         ` David S. Miller
2003-09-11 21:29           ` Ben Greear
2003-09-11 21:29             ` David S. Miller
2003-09-11 21:47               ` Ricardo C Gonzalez
2003-09-11 22:00                 ` Jeff Garzik
2003-09-11 22:15               ` Ben Greear
2003-09-11 23:02                 ` David S. Miller
2003-09-11 23:22                   ` Ben Greear
2003-09-11 23:29                     ` David S. Miller
2003-09-12  1:34                     ` jamal
2003-09-12  2:20                       ` Ricardo C Gonzalez
2003-09-12  3:05                         ` jamal
2003-09-13  3:49                       ` David S. Miller
2003-09-13 11:52                         ` Robert Olsson
2003-09-15 12:12                           ` jamal
2003-09-15 13:45                             ` Robert Olsson
2003-09-15 23:15                               ` David S. Miller
2003-09-16  9:28                                 ` Robert Olsson
2003-09-14 19:08                         ` Ricardo C Gonzalez
2003-09-15  2:50                           ` David Brownell
2003-09-15  8:17                             ` David S. Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).