* [e1000 2.6 10/11] TxDescriptors -> 1024 default
@ 2003-09-09 3:14 Feldman, Scott
2003-09-11 19:18 ` Jeff Garzik
0 siblings, 1 reply; 38+ messages in thread
From: Feldman, Scott @ 2003-09-09 3:14 UTC (permalink / raw)
To: Jeff Garzik; +Cc: netdev, ricardoz
* Change the default number of Tx descriptors from 256 to 1024.
Data from [ricardoz@us.ibm.com] shows it's easy to overrun
the Tx desc queue.
-------------
diff -Nuarp linux-2.6.0-test4/drivers/net/e1000/e1000_param.c linux-2.6.0-test4/drivers/net/e1000.new/e1000_param.c
--- linux-2.6.0-test4/drivers/net/e1000/e1000_param.c 2003-08-22 16:57:59.000000000 -0700
+++ linux-2.6.0-test4/drivers/net/e1000.new/e1000_param.c 2003-09-08 09:13:12.000000000 -0700
@@ -63,9 +63,10 @@ MODULE_PARM_DESC(X, S);
/* Transmit Descriptor Count
*
* Valid Range: 80-256 for 82542 and 82543 gigabit ethernet controllers
- * Valid Range: 80-4096 for 82544
+ * Valid Range: 80-4096 for 82544 and newer
*
- * Default Value: 256
+ * Default Value: 256 for 82542 and 82543 gigabit ethernet controllers
+ * Default Value: 1024 for 82544 and newer
*/
E1000_PARAM(TxDescriptors, "Number of transmit descriptors");
@@ -73,7 +74,7 @@ E1000_PARAM(TxDescriptors, "Number of tr
/* Receive Descriptor Count
*
* Valid Range: 80-256 for 82542 and 82543 gigabit ethernet controllers
- * Valid Range: 80-4096 for 82544
+ * Valid Range: 80-4096 for 82544 and newer
*
* Default Value: 256
*/
@@ -200,6 +201,7 @@ E1000_PARAM(InterruptThrottleRate, "Inte
#define MAX_TXD 256
#define MIN_TXD 80
#define MAX_82544_TXD 4096
+#define DEFAULT_82544_TXD 1024
#define DEFAULT_RXD 256
#define MAX_RXD 256
@@ -320,12 +322,15 @@ e1000_check_options(struct e1000_adapter
struct e1000_option opt = {
.type = range_option,
.name = "Transmit Descriptors",
- .err = "using default of " __MODULE_STRING(DEFAULT_TXD),
- .def = DEFAULT_TXD,
.arg = { .r = { .min = MIN_TXD }}
};
struct e1000_desc_ring *tx_ring = &adapter->tx_ring;
e1000_mac_type mac_type = adapter->hw.mac_type;
+ opt.err = mac_type < e1000_82544 ?
+ "using default of " __MODULE_STRING(DEFAULT_TXD) :
+ "using default of " __MODULE_STRING(DEFAULT_82544_TXD);
+ opt.def = mac_type < e1000_82544 ?
+ DEFAULT_TXD : DEFAULT_82544_TXD;
opt.arg.r.max = mac_type < e1000_82544 ?
MAX_TXD : MAX_82544_TXD;
^ permalink raw reply [flat|nested] 38+ messages in thread* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-09 3:14 [e1000 2.6 10/11] TxDescriptors -> 1024 default Feldman, Scott @ 2003-09-11 19:18 ` Jeff Garzik 2003-09-11 19:45 ` Ben Greear 0 siblings, 1 reply; 38+ messages in thread From: Jeff Garzik @ 2003-09-11 19:18 UTC (permalink / raw) To: Feldman, Scott; +Cc: netdev, ricardoz Feldman, Scott wrote: > * Change the default number of Tx descriptors from 256 to 1024. > Data from [ricardoz@us.ibm.com] shows it's easy to overrun > the Tx desc queue. All e1000 patches applied except this one. Of _course_ it's easy to overrun the Tx desc queue. That's why we have a TX queue sitting on top of the NIC's hardware queue. And TCP socket buffers on top of that. And similar things. Descriptor increases like this are usually the result of some sillyhead blasting out UDP packets, and then wondering why he sees packet loss on the local computer (the "blast out packets" side). You're just wasting memory. Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-11 19:18 ` Jeff Garzik @ 2003-09-11 19:45 ` Ben Greear 2003-09-11 19:59 ` Jeff Garzik 2003-09-11 20:12 ` David S. Miller 0 siblings, 2 replies; 38+ messages in thread From: Ben Greear @ 2003-09-11 19:45 UTC (permalink / raw) To: Jeff Garzik; +Cc: Feldman, Scott, netdev, ricardoz Jeff Garzik wrote: > Feldman, Scott wrote: > >> * Change the default number of Tx descriptors from 256 to 1024. >> Data from [ricardoz@us.ibm.com] shows it's easy to overrun >> the Tx desc queue. > > > > All e1000 patches applied except this one. > > Of _course_ it's easy to overrun the Tx desc queue. That's why we have > a TX queue sitting on top of the NIC's hardware queue. And TCP socket > buffers on top of that. And similar things. > > Descriptor increases like this are usually the result of some sillyhead > blasting out UDP packets, and then wondering why he sees packet loss on > the local computer (the "blast out packets" side). Erm, shouldn't the local machine back itself off if the various queues are full? Some time back I looked through the code and it appeared to. If not, I think it should. > > You're just wasting memory. > > Jeff > > > -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-11 19:45 ` Ben Greear @ 2003-09-11 19:59 ` Jeff Garzik 2003-09-11 20:12 ` David S. Miller 1 sibling, 0 replies; 38+ messages in thread From: Jeff Garzik @ 2003-09-11 19:59 UTC (permalink / raw) To: Ben Greear; +Cc: Feldman, Scott, netdev, ricardoz Ben Greear wrote: > Jeff Garzik wrote: > >> Feldman, Scott wrote: >> >>> * Change the default number of Tx descriptors from 256 to 1024. >>> Data from [ricardoz@us.ibm.com] shows it's easy to overrun >>> the Tx desc queue. >> >> >> >> >> All e1000 patches applied except this one. >> >> Of _course_ it's easy to overrun the Tx desc queue. That's why we >> have a TX queue sitting on top of the NIC's hardware queue. And TCP >> socket buffers on top of that. And similar things. >> >> Descriptor increases like this are usually the result of some >> sillyhead blasting out UDP packets, and then wondering why he sees >> packet loss on the local computer (the "blast out packets" side). > > > Erm, shouldn't the local machine back itself off if the various > queues are full? Some time back I looked through the code and it > appeared to. If not, I think it should. Given the guarantees of the protocol, the net stack has the freedom to drop UDP packets, for example at times when (for TCP) one would otherwise queue a packet for retransmit. Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-11 19:45 ` Ben Greear 2003-09-11 19:59 ` Jeff Garzik @ 2003-09-11 20:12 ` David S. Miller 2003-09-11 20:40 ` Ben Greear 1 sibling, 1 reply; 38+ messages in thread From: David S. Miller @ 2003-09-11 20:12 UTC (permalink / raw) To: Ben Greear; +Cc: jgarzik, scott.feldman, netdev, ricardoz On Thu, 11 Sep 2003 12:45:55 -0700 Ben Greear <greearb@candelatech.com> wrote: > Erm, shouldn't the local machine back itself off if the various > queues are full? Some time back I looked through the code and it > appeared to. If not, I think it should. Generic networking device queues drop when the overflow. Whatever dev->tx_queue_len is set to, the device driver needs to be prepared to be able to queue successfully. Most people run into problems when they run stupid UDP applications that send a stream of tinygrams (<~64 bytes). The solutions are to either fix the UDP app or restrict it's socket send buffer size. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-11 20:12 ` David S. Miller @ 2003-09-11 20:40 ` Ben Greear 2003-09-11 21:07 ` David S. Miller 0 siblings, 1 reply; 38+ messages in thread From: Ben Greear @ 2003-09-11 20:40 UTC (permalink / raw) To: David S. Miller; +Cc: jgarzik, scott.feldman, netdev, ricardoz David S. Miller wrote: > Generic networking device queues drop when the overflow. > > Whatever dev->tx_queue_len is set to, the device driver needs > to be prepared to be able to queue successfully. > > Most people run into problems when they run stupid UDP applications > that send a stream of tinygrams (<~64 bytes). The solutions are to > either fix the UDP app or restrict it's socket send buffer size. Is this close to how it works? So, assume we configure a 10MB socket send queue on our UDP socket... Select says its writable up to at least 5MB. We write 5MB of 64byte packets "righ now". Did we just drop a large number of packets? I would expect that the packets, up to 10MB, are buffered in some list/fifo in the socket code, and that as the underlying device queue empties itself, the socket will feed it more packets. The device queue, in turn, is emptied as the driver is able to fill it's TxDescriptors, and the hardware empties the TxDescriptors. Obviously, I'm confused somewhere.... Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-11 20:40 ` Ben Greear @ 2003-09-11 21:07 ` David S. Miller 2003-09-11 21:29 ` Ben Greear 0 siblings, 1 reply; 38+ messages in thread From: David S. Miller @ 2003-09-11 21:07 UTC (permalink / raw) To: Ben Greear; +Cc: jgarzik, scott.feldman, netdev, ricardoz On Thu, 11 Sep 2003 13:40:44 -0700 Ben Greear <greearb@candelatech.com> wrote: > So, assume we configure a 10MB socket send queue on our UDP socket... > > Select says its writable up to at least 5MB. > > We write 5MB of 64byte packets "righ now". > > Did we just drop a large number of packets? Yes, we did _iff_ dev->tx_queue_len is less than or equal to (5MB / (64 + sizeof(udp_id_headers))). ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-11 21:07 ` David S. Miller @ 2003-09-11 21:29 ` Ben Greear 2003-09-11 21:29 ` David S. Miller 0 siblings, 1 reply; 38+ messages in thread From: Ben Greear @ 2003-09-11 21:29 UTC (permalink / raw) To: David S. Miller; +Cc: jgarzik, scott.feldman, netdev, ricardoz David S. Miller wrote: > On Thu, 11 Sep 2003 13:40:44 -0700 > Ben Greear <greearb@candelatech.com> wrote: > > >>So, assume we configure a 10MB socket send queue on our UDP socket... >> >>Select says its writable up to at least 5MB. >> >>We write 5MB of 64byte packets "righ now". >> >>Did we just drop a large number of packets? > > > Yes, we did _iff_ dev->tx_queue_len is less than or equal > to (5MB / (64 + sizeof(udp_id_headers))). Thanks for that clarification. Is there no way to tell at 'sendto' time that the buffers are over-full, and either block or return -EBUSY or something like that? Perhaps the poll logic should also take the underlying buffer into account and not show the socket as writable in this case? Supposing in the above example, I set tx_queue_len to (5MB / (64 + sizeof(udp_id_headers))), will the packets now be dropped in the driver instead, or will there be no more (local) drops? Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-11 21:29 ` Ben Greear @ 2003-09-11 21:29 ` David S. Miller 2003-09-11 21:47 ` Ricardo C Gonzalez 2003-09-11 22:15 ` Ben Greear 0 siblings, 2 replies; 38+ messages in thread From: David S. Miller @ 2003-09-11 21:29 UTC (permalink / raw) To: Ben Greear; +Cc: jgarzik, scott.feldman, netdev, ricardoz On Thu, 11 Sep 2003 14:29:43 -0700 Ben Greear <greearb@candelatech.com> wrote: > Thanks for that clarification. Is there no way to tell > at 'sendto' time that the buffers are over-full, and either > block or return -EBUSY or something like that? The TX queue state can change by hundreds of packets by the time we are finished making the "decision", also how would you like to "wake" up sockets when the TX queue is liberated. That extra overhead and logic would be wonderful for performance. No, this is all nonsense. Packet scheduling and queueing is an opaque layer to all the upper layers. It is the only sensible design. IP transmit is black hole that may drop packets at any moment, any datagram application not prepared for this should be prepared for troubles or choose to move over to something like TCP. I listed even a workaround for such stupid UDP apps, simply limit their socket send queue limits. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-11 21:29 ` David S. Miller @ 2003-09-11 21:47 ` Ricardo C Gonzalez 2003-09-11 22:00 ` Jeff Garzik 2003-09-11 22:15 ` Ben Greear 1 sibling, 1 reply; 38+ messages in thread From: Ricardo C Gonzalez @ 2003-09-11 21:47 UTC (permalink / raw) To: David S. Miller; +Cc: greearb, jgarzik, scott.feldman, netdev >IP transmit is black hole that may drop packets at any moment, >any datagram application not prepared for this should be prepared >for troubles or choose to move over to something like TCP. As I said before, please do not make this a UDP issue. The data I sent out was taken using a TCP_STREAM test case. Please review it. regards, ---------------------------------------------------------------------------------- *** ALWAYS THINK POSITIVE *** Rick Gonzalez IBM Linux Performance Group Building: 905 Office: 7G019 Phone: (512) 838-0623 "David S. Miller" <davem@redhat.com> on 09/11/2003 04:29:06 PM To: Ben Greear <greearb@candelatech.com> cc: jgarzik@pobox.com, scott.feldman@intel.com, netdev@oss.sgi.com, Ricardo C Gonzalez/Austin/IBM@ibmus Subject: Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default On Thu, 11 Sep 2003 14:29:43 -0700 Ben Greear <greearb@candelatech.com> wrote: > Thanks for that clarification. Is there no way to tell > at 'sendto' time that the buffers are over-full, and either > block or return -EBUSY or something like that? The TX queue state can change by hundreds of packets by the time we are finished making the "decision", also how would you like to "wake" up sockets when the TX queue is liberated. That extra overhead and logic would be wonderful for performance. No, this is all nonsense. Packet scheduling and queueing is an opaque layer to all the upper layers. It is the only sensible design. IP transmit is black hole that may drop packets at any moment, any datagram application not prepared for this should be prepared for troubles or choose to move over to something like TCP. I listed even a workaround for such stupid UDP apps, simply limit their socket send queue limits. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-11 21:47 ` Ricardo C Gonzalez @ 2003-09-11 22:00 ` Jeff Garzik 0 siblings, 0 replies; 38+ messages in thread From: Jeff Garzik @ 2003-09-11 22:00 UTC (permalink / raw) To: Ricardo C Gonzalez; +Cc: David S. Miller, greearb, scott.feldman, netdev Ricardo C Gonzalez wrote: > > >>IP transmit is black hole that may drop packets at any moment, >>any datagram application not prepared for this should be prepared >>for troubles or choose to move over to something like TCP. > > > > As I said before, please do not make this a UDP issue. The data I sent out > was taken using a TCP_STREAM test case. Please review it. Your own words say "CPUs can fill TX queue". We already know this. CPUs have been doing wire speed for ages. Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-11 21:29 ` David S. Miller 2003-09-11 21:47 ` Ricardo C Gonzalez @ 2003-09-11 22:15 ` Ben Greear 2003-09-11 23:02 ` David S. Miller 1 sibling, 1 reply; 38+ messages in thread From: Ben Greear @ 2003-09-11 22:15 UTC (permalink / raw) To: David S. Miller; +Cc: jgarzik, scott.feldman, netdev, ricardoz David S. Miller wrote: > On Thu, 11 Sep 2003 14:29:43 -0700 > Ben Greear <greearb@candelatech.com> wrote: > > >>Thanks for that clarification. Is there no way to tell >>at 'sendto' time that the buffers are over-full, and either >>block or return -EBUSY or something like that? > > > The TX queue state can change by hundreds of packets by > the time we are finished making the "decision", also how would > you like to "wake" up sockets when the TX queue is liberated. So, at some point the decision is already made that we must drop the packet, or that we can enqueue it. This is where I would propose we block the thing trying to enqueue, or at least propagate a failure code back up the stack(s) so that the packet can be retried by the calling layer. Preferably, one would propagate the error all the way to userspace and let them deal with it, just like we currently deal with socket queue full issues. > That extra overhead and logic would be wonderful for performance. The cost of a retransmit is also expensive, whether it is some hacked up UDP protocol or for TCP. Even if one had to implement callbacks from the device queue to the interested sockets, this should not be a large performance hit. > > No, this is all nonsense. Packet scheduling and queueing is > an opaque layer to all the upper layers. It is the only sensible > design. This is possible, but it does not seem cut and dried to me. If there is any documentation or research that support this assertion, please do let us know. > > IP transmit is black hole that may drop packets at any moment, > any datagram application not prepared for this should be prepared > for troubles or choose to move over to something like TCP. > > I listed even a workaround for such stupid UDP apps, simply limit > their socket send queue limits. And the original poster shows how a similar problem slows down TCP as well due to local dropped packets. Don't you think we'd get better TCP throughput if we instead had the calling code wait 1us for the buffers to clear? -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-11 22:15 ` Ben Greear @ 2003-09-11 23:02 ` David S. Miller 2003-09-11 23:22 ` Ben Greear 0 siblings, 1 reply; 38+ messages in thread From: David S. Miller @ 2003-09-11 23:02 UTC (permalink / raw) To: Ben Greear; +Cc: jgarzik, scott.feldman, netdev, ricardoz On Thu, 11 Sep 2003 15:15:19 -0700 Ben Greear <greearb@candelatech.com> wrote: > And the original poster shows how a similar problem slows down TCP > as well due to local dropped packets. So, again, dampen the per-socket send queue sizes. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-11 23:02 ` David S. Miller @ 2003-09-11 23:22 ` Ben Greear 2003-09-11 23:29 ` David S. Miller 2003-09-12 1:34 ` jamal 0 siblings, 2 replies; 38+ messages in thread From: Ben Greear @ 2003-09-11 23:22 UTC (permalink / raw) Cc: jgarzik, scott.feldman, netdev, ricardoz David S. Miller wrote: > On Thu, 11 Sep 2003 15:15:19 -0700 > Ben Greear <greearb@candelatech.com> wrote: > > >>And the original poster shows how a similar problem slows down TCP >>as well due to local dropped packets. > > > So, again, dampen the per-socket send queue sizes. That's just a band-aid to cover up the flaw with the lack of queue-pressure feedback to the higher stacks, as would be increasing the TxDescriptors for that matter. -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-11 23:22 ` Ben Greear @ 2003-09-11 23:29 ` David S. Miller 2003-09-12 1:34 ` jamal 1 sibling, 0 replies; 38+ messages in thread From: David S. Miller @ 2003-09-11 23:29 UTC (permalink / raw) To: Ben Greear; +Cc: jgarzik, scott.feldman, netdev, ricardoz On Thu, 11 Sep 2003 16:22:35 -0700 Ben Greear <greearb@candelatech.com> wrote: > David S. Miller wrote: > > So, again, dampen the per-socket send queue sizes. > > That's just a band-aid to cover up the flaw with the lack > of queue-pressure feedback to the higher stacks, as would be increasing the > TxDescriptors for that matter. The whole point of the various packet scheduler algorithms are foregone if we're just going to queue up and send the crap again. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-11 23:22 ` Ben Greear 2003-09-11 23:29 ` David S. Miller @ 2003-09-12 1:34 ` jamal 2003-09-12 2:20 ` Ricardo C Gonzalez 2003-09-13 3:49 ` David S. Miller 1 sibling, 2 replies; 38+ messages in thread From: jamal @ 2003-09-12 1:34 UTC (permalink / raw) To: Ben Greear; +Cc: jgarzik, scott.feldman, netdev, ricardoz Scott, dont increase the tx descriptor ring size - that would truly wasting memory; 256 is pretty adequate. * increase instead the txquelen (as suggested by Davem); user space tools like ip or ifconfig could do it. The standard size has been around 100 for 100Mbps; i suppose it is fair to say that Gige can move data out at 10x that; so set it to 1000. Maybe you can do this from the driver based on what negotiated speed is detected? -------- [root@jzny root]# ip link ls eth0 4: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 100 link/ether 00:b0:d0:05:ae:81 brd ff:ff:ff:ff:ff:ff [root@jzny root]# ip link set[root@jzny root]# ip link ls eth0 4: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000 link/ether 00:b0:d0:05:ae:81 brd ff:ff:ff:ff:ff:ff eth0 txqueuelen 1000 ------- TCP already reacts on packets dropped at the scheduler level, UDP would be too hard to enforce since the logic is typically on an app above udp. So just conrtol it via the socket queue size. cheers, jamal On Thu, 2003-09-11 at 19:22, Ben Greear wrote: > David S. Miller wrote: > > On Thu, 11 Sep 2003 15:15:19 -0700 > > Ben Greear <greearb@candelatech.com> wrote: > > > > > >>And the original poster shows how a similar problem slows down TCP > >>as well due to local dropped packets. > > > > > > So, again, dampen the per-socket send queue sizes. > > That's just a band-aid to cover up the flaw with the lack > of queue-pressure feedback to the higher stacks, as would be increasing the > TxDescriptors for that matter. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-12 1:34 ` jamal @ 2003-09-12 2:20 ` Ricardo C Gonzalez 2003-09-12 3:05 ` jamal 2003-09-13 3:49 ` David S. Miller 1 sibling, 1 reply; 38+ messages in thread From: Ricardo C Gonzalez @ 2003-09-12 2:20 UTC (permalink / raw) To: hadi; +Cc: greearb, jgarzik, scott.feldman, netdev Jamal wrote: >* increase instead the txquelen (as suggested by Davem); user space >tools like ip or ifconfig could do it. The standard size has been around >100 for 100Mbps; i suppose it is fair to say that Gige can move data out >at 10x that; so set it to 1000. Maybe you can do this from the driver >based on what negotiated speed is detected? This is also another way to do it. As long as we make it harder for users to drop packets and get up to date with Gigabit speeds. We would also have to think about the upcomming 10Gige adapters and their queue sizes, but that is a separate issue. Anyway, the driver can easly set the txqueuelen to 1000. We should care about counting the packets being dropped on the transmit side. Would it be the responsability of the driver to account for this drops? Because each driver has a dedicated software queue and in my opinion, the driver should account for this packets. regards, ---------------------------------------------------------------------------------- *** ALWAYS THINK POSITIVE *** ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-12 2:20 ` Ricardo C Gonzalez @ 2003-09-12 3:05 ` jamal 0 siblings, 0 replies; 38+ messages in thread From: jamal @ 2003-09-12 3:05 UTC (permalink / raw) To: Ricardo C Gonzalez; +Cc: greearb, jgarzik, scott.feldman, netdev On Thu, 2003-09-11 at 22:20, Ricardo C Gonzalez wrote: > Jamal wrote: > We should care about counting the packets being dropped on the > transmit side. Would it be the responsability of the driver to account for > this drops? Because each driver has a dedicated software queue and in my > opinion, the driver should account for this packets. This is really the schedulers responsibility. Its hard for the driver to keep track of why a packet was dropped. Example, could be dropped to make room for a higher priority packet thats being anticipated to show up soon. The simple default 3-band scheduler unfortunately doesnt quiet show its stats ...so simple way to see drops is: - install the prio qdisc ------ [root@jzny root]# tc qdisc add dev eth0 root prio [root@jzny root]# tc -s qdisc qdisc prio 8001: dev eth0 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 42 bytes 1 pkts (dropped 0, overlimits 0) ----- or you may wanna install a single pfifo queue with size of 1000 each (although this is a little too mediavial) example: #tc qdisc add dev eth0 root pfifo limit 1000 #tc -s qdisc qdisc pfifo 8002: dev eth0 limit 1000p Sent 0 bytes 0 pkts (dropped 0, overlimits 0) etc cheers, jamal ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-12 1:34 ` jamal 2003-09-12 2:20 ` Ricardo C Gonzalez @ 2003-09-13 3:49 ` David S. Miller 2003-09-13 11:52 ` Robert Olsson 2003-09-14 19:08 ` Ricardo C Gonzalez 1 sibling, 2 replies; 38+ messages in thread From: David S. Miller @ 2003-09-13 3:49 UTC (permalink / raw) To: hadi; +Cc: greearb, jgarzik, scott.feldman, netdev, ricardoz On 11 Sep 2003 21:34:23 -0400 jamal <hadi@cyberus.ca> wrote: > dont increase the tx descriptor ring size - that would truly wasting > memory; 256 is pretty adequate. > * increase instead the txquelen (as suggested by Davem); user space > tools like ip or ifconfig could do it. The standard size has been around > 100 for 100Mbps; i suppose it is fair to say that Gige can move data out > at 10x that; so set it to 1000. Maybe you can do this from the driver > based on what negotiated speed is detected? I spoke with Alexey once about this, actually tx_queue_len can be arbitrarily large but it should be reasonable nonetheless. Our preliminary conclusions were that values of 1000 for 100Mbit and faster were probably appropriate. Maybe something larger for 1Gbit, who knows. We also determined that the only connection between TX descriptor ring size and dev->tx_queue_len was that the latter should be large enough to handle, at a minimum, the amount of pending TX descriptor ACKs that can be pending considering mitigation et al. So if TX irq mitigation can defer up to N TX descriptor completions then dev->tx_queue_len must be at least that large. Back to the main topic, maybe we should set dev->tx_queue_len to 1000 by default for all ethernet devices. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-13 3:49 ` David S. Miller @ 2003-09-13 11:52 ` Robert Olsson 2003-09-15 12:12 ` jamal 2003-09-14 19:08 ` Ricardo C Gonzalez 1 sibling, 1 reply; 38+ messages in thread From: Robert Olsson @ 2003-09-13 11:52 UTC (permalink / raw) To: David S. Miller; +Cc: hadi, greearb, jgarzik, scott.feldman, netdev, ricardoz David S. Miller writes: > On 11 Sep 2003 21:34:23 -0400 > jamal <hadi@cyberus.ca> wrote: > > > dont increase the tx descriptor ring size - that would truly wasting > > memory; 256 is pretty adequate. > > * increase instead the txquelen (as suggested by Davem); user space > > tools like ip or ifconfig could do it. The standard size has been around > > 100 for 100Mbps; i suppose it is fair to say that Gige can move data out > > at 10x that; so set it to 1000. Maybe you can do this from the driver > > based on what negotiated speed is detected? > > I spoke with Alexey once about this, actually tx_queue_len can > be arbitrarily large but it should be reasonable nonetheless. > > Our preliminary conclusions were that values of 1000 for 100Mbit and > faster were probably appropriate. Maybe something larger for 1Gbit, > who knows. > > We also determined that the only connection between TX descriptor > ring size and dev->tx_queue_len was that the latter should be large > enough to handle, at a minimum, the amount of pending TX descriptor > ACKs that can be pending considering mitigation et al. > > So if TX irq mitigation can defer up to N TX descriptor completions > then dev->tx_queue_len must be at least that large. > > Back to the main topic, maybe we should set dev->tx_queue_len to > 1000 by default for all ethernet devices. Hello! Yes sounds like adequate setting for GIGE. This is what use for production and lab but rather than increasing dev->tx_queue_len to 1000 we replace the pfifo_fast with the pfifo qdisc w. setting a qlen of 1000. And with we have tx_descriptor_ring_size 256 which is tuned to the NIC's "TX service interval" with respect to interrupt mitigation etc. This seems good enough even for small packets. For routers this setting is even more crucial as we need to serialize several flows and we know the flows are bursty. Cheers. --ro ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-13 11:52 ` Robert Olsson @ 2003-09-15 12:12 ` jamal 2003-09-15 13:45 ` Robert Olsson 0 siblings, 1 reply; 38+ messages in thread From: jamal @ 2003-09-15 12:12 UTC (permalink / raw) To: Robert Olsson Cc: David S. Miller, greearb, jgarzik, scott.feldman, netdev, ricardoz On Sat, 2003-09-13 at 07:52, Robert Olsson wrote: > > > > I spoke with Alexey once about this, actually tx_queue_len can > > be arbitrarily large but it should be reasonable nonetheless. > > > > Our preliminary conclusions were that values of 1000 for 100Mbit and > > faster were probably appropriate. Maybe something larger for 1Gbit, > > who knows. If you recall we saw that even for the gent who was trying to do 100K TCP sockets on a 4 way SMP, 1000 was sufficient and no packets were dropped. > > > > We also determined that the only connection between TX descriptor > > ring size and dev->tx_queue_len was that the latter should be large > > enough to handle, at a minimum, the amount of pending TX descriptor > > ACKs that can be pending considering mitigation et al. > > > > So if TX irq mitigation can defer up to N TX descriptor completions > > then dev->tx_queue_len must be at least that large. > > > > Back to the main topic, maybe we should set dev->tx_queue_len to > > 1000 by default for all ethernet devices. > > Hello! > > Yes sounds like adequate setting for GIGE. This is what use for production > and lab but rather than increasing dev->tx_queue_len to 1000 we replace the > pfifo_fast with the pfifo qdisc w. setting a qlen of 1000. > I think this may not be good for the reason of QoS. You want BGP packets to be given priority over ftp. A single queue kills that. The current default 3 band queue is good enough, the only challenge being noone sees stats for it. I have a patch for the kernel at: http://www.cyberus.ca/~hadi/patches/restore.pfifo.kernel and for tc at: http://www.cyberus.ca/~hadi/patches/restore.pfifo.tc cheers, jamal ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-15 12:12 ` jamal @ 2003-09-15 13:45 ` Robert Olsson 2003-09-15 23:15 ` David S. Miller 0 siblings, 1 reply; 38+ messages in thread From: Robert Olsson @ 2003-09-15 13:45 UTC (permalink / raw) To: hadi Cc: Robert Olsson, David S. Miller, greearb, jgarzik, scott.feldman, netdev, ricardoz jamal writes: > I think this may not be good for the reason of QoS. You want BGP packets > to be given priority over ftp. A single queue kills that. Well so far single queue has been robust enough for BGP-sessions. Talking from own experiences... > The current default 3 band queue is good enough, the only challenge > being noone sees stats for it. I have a patch for the kernel at: > http://www.cyberus.ca/~hadi/patches/restore.pfifo.kernel > and for tc at: > http://www.cyberus.ca/~hadi/patches/restore.pfifo.tc Yes. I've missed this. Our lazy work-around for the missing stats is to install pfifo qdisc as said. IMO it should be included. Cheers. --ro ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-15 13:45 ` Robert Olsson @ 2003-09-15 23:15 ` David S. Miller 2003-09-16 9:28 ` Robert Olsson 0 siblings, 1 reply; 38+ messages in thread From: David S. Miller @ 2003-09-15 23:15 UTC (permalink / raw) To: Robert Olsson Cc: hadi, Robert.Olsson, greearb, jgarzik, scott.feldman, netdev, ricardoz On Mon, 15 Sep 2003 15:45:42 +0200 Robert Olsson <Robert.Olsson@data.slu.se> wrote: > > The current default 3 band queue is good enough, the only challenge > > being noone sees stats for it. I have a patch for the kernel at: > > http://www.cyberus.ca/~hadi/patches/restore.pfifo.kernel > > and for tc at: > > http://www.cyberus.ca/~hadi/patches/restore.pfifo.tc > > Yes. > I've missed this. Our lazy work-around for the missing stats is to install > pfifo qdisc as said. IMO it should be included. I've included Jamal's pfifo_fast statistic patch, and the change to increase ethernet's tx_queue_len to 1000 in all of my trees. Thanks. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-15 23:15 ` David S. Miller @ 2003-09-16 9:28 ` Robert Olsson 0 siblings, 0 replies; 38+ messages in thread From: Robert Olsson @ 2003-09-16 9:28 UTC (permalink / raw) To: David S. Miller Cc: kuznet, Robert Olsson, hadi, greearb, jgarzik, scott.feldman, netdev, ricardoz David S. Miller writes: > > > http://www.cyberus.ca/~hadi/patches/restore.pfifo.kernel > > > and for tc at: > > > http://www.cyberus.ca/~hadi/patches/restore.pfifo.tc > > I've included Jamal's pfifo_fast statistic patch, and the > change to increase ethernet's tx_queue_len to 1000 in all > of my trees. Thanks. We ask Alexey to include the tc part too. Cheers. --ro ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-13 3:49 ` David S. Miller 2003-09-13 11:52 ` Robert Olsson @ 2003-09-14 19:08 ` Ricardo C Gonzalez 2003-09-15 2:50 ` David Brownell 2004-05-15 12:14 ` TxDescriptors -> 1024 default. Please not for every NIC! Marc Herbert 1 sibling, 2 replies; 38+ messages in thread From: Ricardo C Gonzalez @ 2003-09-14 19:08 UTC (permalink / raw) To: David S. Miller; +Cc: hadi, greearb, jgarzik, scott.feldman, netdev David Miller wrote: >Back to the main topic, maybe we should set dev->tx_queue_len to >1000 by default for all ethernet devices. I definately agree with setting the dev->tx_queue_len to 1000 as a default for all ethernet adapters. All adapters will benefit from this change. regards, ---------------------------------------------------------------------------------- *** ALWAYS THINK POSITIVE *** ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-14 19:08 ` Ricardo C Gonzalez @ 2003-09-15 2:50 ` David Brownell 2003-09-15 8:17 ` David S. Miller 2004-05-15 12:14 ` TxDescriptors -> 1024 default. Please not for every NIC! Marc Herbert 1 sibling, 1 reply; 38+ messages in thread From: David Brownell @ 2003-09-15 2:50 UTC (permalink / raw) To: Ricardo C Gonzalez, David S. Miller Cc: hadi, greearb, jgarzik, scott.feldman, netdev Ricardo C Gonzalez wrote: > > David Miller wrote: > > >>Back to the main topic, maybe we should set dev->tx_queue_len to >>1000 by default for all ethernet devices. > > > > I definately agree with setting the dev->tx_queue_len to 1000 as a default > for all ethernet adapters. All adapters will benefit from this change. Except ones where CONFIG_EMBEDDED, maybe? Not everyone wants to spend that much memory, even when it's available... ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [e1000 2.6 10/11] TxDescriptors -> 1024 default 2003-09-15 2:50 ` David Brownell @ 2003-09-15 8:17 ` David S. Miller 0 siblings, 0 replies; 38+ messages in thread From: David S. Miller @ 2003-09-15 8:17 UTC (permalink / raw) To: David Brownell; +Cc: ricardoz, hadi, greearb, jgarzik, scott.feldman, netdev On Sun, 14 Sep 2003 19:50:56 -0700 David Brownell <david-b@pacbell.net> wrote: > Except ones where CONFIG_EMBEDDED, maybe? Not everyone wants > to spend that much memory, even when it's available... Dropping the packet between the network stack and the driver does waste memory for _LONGER_ periods of time. When we drop, TCP still hangs onto the buffer, and we'll send it again and again until it makes it and we get an ACK back or the connection completely times out. ^ permalink raw reply [flat|nested] 38+ messages in thread
* TxDescriptors -> 1024 default. Please not for every NIC! 2003-09-14 19:08 ` Ricardo C Gonzalez 2003-09-15 2:50 ` David Brownell @ 2004-05-15 12:14 ` Marc Herbert 2004-05-19 9:30 ` Marc Herbert 1 sibling, 1 reply; 38+ messages in thread From: Marc Herbert @ 2004-05-15 12:14 UTC (permalink / raw) To: netdev On Sun, 14 Sep 2003, Ricardo C Gonzalez wrote: > David Miller wrote: > > >Back to the main topic, maybe we should set dev->tx_queue_len to > >1000 by default for all ethernet devices. > > > I definately agree with setting the dev->tx_queue_len to 1000 as a default > for all ethernet adapters. All adapters will benefit from this change. > <http://oss.sgi.com/projects/netdev/archive/2003-09/threads.html#00247> Sorry to exhume this discussion but I only recently discovered this change, the hard way. I carefully read this old thread and did not grasp _every_ detail, but there is one thing that I am sure of: 1000 packets @ 1 Gb/s looks good, but on the other hand, 1000 full-size Ethernet packets @ 10 Mb/s are about 1.2 seconds long! Too little buffering means not enough dampering effect, which is very important for performance in asynchronous systems, granted. However, _too much_ buffering means too big and too variable latencies. When discussing buffers, duration is very often more important than size. Applications, TCP's dynamic (and kernel dynamics too?) do not care much about buffer sizes, they more often care about latencies (and throughput, of course). Buffers sizes is often "just a small matter of implementation" :-) For instance people designing routers talk about buffers in _milliseconds_ much more often than in _bytes_ (despite the fact that their memories cost more than in hosts, considering the throughputs involved). 100 packets @ 100 Mb/s was 12 ms. 1000 packets @ 1 Gb/s is still 12 ms. 12 ms is great. It's a "good" latency because it is the order of magnitude of real-world constants like: comfortable interactive applications, operating system sheduler granularity or propagation time in 2000 km of cable. But 1000 packets @ 100 Mb/s is 120 ms and is neither very good nor very useful anymore. 1000 packets @ 10 Mb/s is 1.2 s, which is ridiculous. It does mean that, when joe user is uploading some big file through his cheap Ethernet card, and that there are no other bottleneck/drops further in the network, every concurrent application will have to wait 1.2 s before accessing the network! It this hard to believe for you, just make the test yourself, it's very easy: force one of you NICs to 10Mb/s full duplex, txqueuelen 1000 and send a continuous flow to a nearby machine. Then try to ping anything. Imagine now that some packet is lost for whatever reason on some _other_ TCP connection going through this terrible 1.2 s queue. Then you need one SACK/RTX extra round trip time to recover from it: so it's now _2.4 s_ to deliver the data sent just after the dropped packet... Assuming of course TCP timers do not become confused by this huge latency and probably huge jitter. And I don't think you want to make fiddling with "tc" mandatory for joe user. Or tell him: "oh, please just 'ifconfig txqueuelen 10', or buy a new Ethernet card". I am unfortunately not familiar with this part of the linux kernel, but I really think that, if possible, txqueuelen should be initialized at some "constant 12 ms" and not at the "1000 packets" highly variable latency setting. I can imagine there are some corner cases, like for instance when some GEth NIC is hot-plugged into a 100 Mb/s, or jumbo frames, but hey, those are corner cases : as a first step, even a simple constant-per-model txqueuelen initialization would be already great. Cheers, Marc. PS: one workaround for joe user against this 1.2s latency would be to keep his SND_BUF and number of sockets small. But this is poor. -- "Je n'ai fait cette lettre-ci plus longue que parce que je n'ai pas eu le loisir de la faire plus courte." -- Blaise Pascal ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: TxDescriptors -> 1024 default. Please not for every NIC! 2004-05-15 12:14 ` TxDescriptors -> 1024 default. Please not for every NIC! Marc Herbert @ 2004-05-19 9:30 ` Marc Herbert 2004-05-19 10:27 ` Pekka Pietikainen 2004-05-19 11:54 ` Andi Kleen 0 siblings, 2 replies; 38+ messages in thread From: Marc Herbert @ 2004-05-19 9:30 UTC (permalink / raw) To: netdev On Sat, 15 May 2004, Marc Herbert wrote: > <http://oss.sgi.com/projects/netdev/archive/2003-09/threads.html#00247> > > Sorry to exhume this discussion but I only recently discovered this > change, the hard way. > > I am unfortunately not familiar with this part of the linux kernel, > but I really think that, if possible, txqueuelen should be initialized > at some "constant 12 ms" and not at the "1000 packets" highly variable > latency setting. I can imagine there are some corner cases, like for > instance when some GEth NIC is hot-plugged into a 100 Mb/s, or jumbo > frames, but hey, those are corner cases : as a first step, even a > simple constant-per-model txqueuelen initialization would be already > great. After some further study, I was glad to discover my suggestion above both easy and short to implement. See patch below. Trying to sum-it up: - Ricardo asks (among others) for a new 1000 packets default txqueuelen for Intel's e1000, based on some data (couldn't not find this data, please send me the pointer if you have it, thanks). - Me argues that we all lived happy for ages with this default setting of 100 packets @ 100 Mb/s (and lived approximately happy @ 10 Mb/s), but we'll soon see doom and gloom with this new and brutal change to 1000 packets for all this _legacy_ 10-100 Mb/s hardware. e1000 data only is not enough to justify this radical shift. If you are convinced by _both_ items above, then the patch below content _both_, and we're done. If you are not, then... wait for further discussion, including answers to latest Ricardo's post. PS: several people seem to think TCP "drops" packets when the qdisc is full. My analysis of the code _and_ my experiments makes me think they are wrong: TCP rather "blocks" when the qdisc is full. See explanation here: <http://oss.sgi.com/archives/netdev/2004-05/msg00151.html> (Subject: Re: TcpOutSegs way too optimistic (netstat -s)) ===== drivers/net/net_init.c 1.11 vs edited ===== --- 1.11/drivers/net/net_init.c Tue Sep 16 01:12:25 2003 +++ edited/drivers/net/net_init.c Wed May 19 11:05:34 2004 @@ -420,7 +420,10 @@ dev->hard_header_len = ETH_HLEN; dev->mtu = 1500; /* eth_mtu */ dev->addr_len = ETH_ALEN; - dev->tx_queue_len = 1000; /* Ethernet wants good queues */ + dev->tx_queue_len = 100; /* This is a sensible generic default for + 100 Mb/s: about 12ms with 1500 full size packets. + Drivers should tune this depending on interface + specificities and settings */ memset(dev->broadcast,0xFF, ETH_ALEN); ===== drivers/net/e1000/e1000_main.c 1.56 vs edited ===== --- 1.56/drivers/net/e1000/e1000_main.c Tue Feb 3 01:43:42 2004 +++ edited/drivers/net/e1000/e1000_main.c Wed May 19 03:14:32 2004 @@ -400,6 +400,8 @@ err = -ENOMEM; goto err_alloc_etherdev; } + + netdev->tx_queue_len = 1000; SET_MODULE_OWNER(netdev); ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: TxDescriptors -> 1024 default. Please not for every NIC! 2004-05-19 9:30 ` Marc Herbert @ 2004-05-19 10:27 ` Pekka Pietikainen 2004-05-20 14:11 ` Luis R. Rodriguez 2004-05-19 11:54 ` Andi Kleen 1 sibling, 1 reply; 38+ messages in thread From: Pekka Pietikainen @ 2004-05-19 10:27 UTC (permalink / raw) To: Marc Herbert; +Cc: netdev, prism54-devel On Wed, May 19, 2004 at 11:30:28AM +0200, Marc Herbert wrote: > - Me argues that we all lived happy for ages with this default > setting of 100 packets @ 100 Mb/s (and lived approximately happy @ > 10 Mb/s), but we'll soon see doom and gloom with this new and > brutal change to 1000 packets for all this _legacy_ 10-100 Mb/s > hardware. e1000 data only is not enough to justify this radical > shift. > > If you are convinced by _both_ items above, then the patch below > content _both_, and we're done. > > If you are not, then... wait for further discussion, including answers > to latest Ricardo's post. Not to mention that not all modern hardware is gigabit, current 2.6 seems to be setting txqueuelen of 1000 for 802.11 devices too (at least my prism54), which might be causing major problems for me. Well, I'm still trying to figure out whether it's txqueue or WEP that causes all traffic to stop (with rx invalid crypt packets showing up in iwconfig afterwards, AP is a linksys wrt54g in case it makes a difference) every now and then until a ifdown / ifup. Tried both vanilla 2.6 prism54 and CVS (which seems to have a reset on tx timeout thing added), but if txqueue is 1000 that won't easily get triggered will it? It's been running for a few days just fine with txqueue = 100 and no WEP, if it stays like that i'll start tweaking to find what exactly triggers it. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Re: TxDescriptors -> 1024 default. Please not for every NIC! 2004-05-19 10:27 ` Pekka Pietikainen @ 2004-05-20 14:11 ` Luis R. Rodriguez 2004-05-20 16:38 ` [Prism54-devel] " Jean Tourrilhes 0 siblings, 1 reply; 38+ messages in thread From: Luis R. Rodriguez @ 2004-05-20 14:11 UTC (permalink / raw) To: Pekka Pietikainen; +Cc: Marc Herbert, netdev, prism54-devel, Jean Tourrilhes [-- Attachment #1: Type: text/plain, Size: 1239 bytes --] On Wed, May 19, 2004 at 01:27:00PM +0300, Pekka Pietikainen wrote: > On Wed, May 19, 2004 at 11:30:28AM +0200, Marc Herbert wrote: > > - Me argues that we all lived happy for ages with this default > > setting of 100?packets @?100?Mb/s (and lived approximately happy @ > > 10 Mb/s), but we'll soon see doom and gloom with this new and > > brutal change to 1000?packets for all this _legacy_ 10-100 Mb/s > > hardware. e1000 data only is not enough to justify this radical > > shift. > > > > If you are convinced by _both_ items above, then the patch below > > content _both_, and we're done. > > > > If you are not, then... wait for further discussion, including answers > > to latest Ricardo's post. > > Not to mention that not all modern hardware is gigabit, current > 2.6 seems to be setting txqueuelen of 1000 for 802.11 devices too (at least > my prism54), which might be causing major problems for me. Considering 802.11b's peak is at 11Mbit and standard 802.11g is at 54Mbit (some manufacturers are using two channels and getting 108Mbit now) I'd think we should stick at 100, as the patch proposes. Jean? Luis -- GnuPG Key fingerprint = 113F B290 C6D2 0251 4D84 A34A 6ADD 4937 E20A 525E [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Prism54-devel] Re: TxDescriptors -> 1024 default. Please not for every NIC! 2004-05-20 14:11 ` Luis R. Rodriguez @ 2004-05-20 16:38 ` Jean Tourrilhes 2004-05-20 16:45 ` Tomasz Torcz 0 siblings, 1 reply; 38+ messages in thread From: Jean Tourrilhes @ 2004-05-20 16:38 UTC (permalink / raw) To: Pekka Pietikainen, Marc Herbert, netdev, prism54-devel On Thu, May 20, 2004 at 10:11:11AM -0400, Luis R. Rodriguez wrote: > On Wed, May 19, 2004 at 01:27:00PM +0300, Pekka Pietikainen wrote: > > On Wed, May 19, 2004 at 11:30:28AM +0200, Marc Herbert wrote: > > > - Me argues that we all lived happy for ages with this default > > > setting of 100?packets @?100?Mb/s (and lived approximately happy @ > > > 10 Mb/s), but we'll soon see doom and gloom with this new and > > > brutal change to 1000?packets for all this _legacy_ 10-100 Mb/s > > > hardware. e1000 data only is not enough to justify this radical > > > shift. > > > > > > If you are convinced by _both_ items above, then the patch below > > > content _both_, and we're done. > > > > > > If you are not, then... wait for further discussion, including answers > > > to latest Ricardo's post. > > > > Not to mention that not all modern hardware is gigabit, current > > 2.6 seems to be setting txqueuelen of 1000 for 802.11 devices too (at least > > my prism54), which might be causing major problems for me. > > Considering 802.11b's peak is at 11Mbit and standard 802.11g is at 54Mbit > (some manufacturers are using two channels and getting 108Mbit now) I'd > think we should stick at 100, as the patch proposes. Jean? > > Luis I never like to have huge queues of buffers. It waste memory, and degrade the latency, especially with competing sockets. In a theoritical stable system, you don't need buffers (you run everything synchronously), buffer are only needed to take care of the jitter in real networks. The real throughouput of 802.11g is more around 30Mb/s (at TCP/IP level). However, wireless networks tend to have more jitter (interference and contention). But, wireless cards tend to have a fair number of buffers in the hardware. I personally would stick with 100. The IrDA stack runs perfectly fine with 15 buffers at 4 Mb/s. If 100 is not enough, I think the problem is not the number of buffers, but somewhere else. For example, we might want to think about explicit socket callbacks (like I did in IrDA). But that's only personal opinions ;-) Have fun... Jean ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Prism54-devel] Re: TxDescriptors -> 1024 default. Please not for every NIC! 2004-05-20 16:38 ` [Prism54-devel] " Jean Tourrilhes @ 2004-05-20 16:45 ` Tomasz Torcz 2004-05-20 17:13 ` zero copy TX in benchmarks was " Andi Kleen 0 siblings, 1 reply; 38+ messages in thread From: Tomasz Torcz @ 2004-05-20 16:45 UTC (permalink / raw) To: netdev On Thu, May 20, 2004 at 09:38:11AM -0700, Jean Tourrilhes wrote: > I personally would stick with 100. The IrDA stack runs > perfectly fine with 15 buffers at 4 Mb/s. If 100 is not enough, I > think the problem is not the number of buffers, but somewhere else. I don't know how much trollish or true is that comment: http://bsd.slashdot.org/comments.pl?sid=106258&cid=9049422 but it suggest, that Linux' stack having no BSD like mbuf functionality, is not perfect for fast transmission. Maybe some network guru cna comment ? -- Tomasz Torcz ,,(...) today's high-end is tomorrow's embedded processor.'' zdzichu@irc.-nie.spam-.pl -- Mitchell Blank on LKML ^ permalink raw reply [flat|nested] 38+ messages in thread
* zero copy TX in benchmarks was Re: [Prism54-devel] Re: TxDescriptors -> 1024 default. Please not for every NIC! 2004-05-20 16:45 ` Tomasz Torcz @ 2004-05-20 17:13 ` Andi Kleen 0 siblings, 0 replies; 38+ messages in thread From: Andi Kleen @ 2004-05-20 17:13 UTC (permalink / raw) To: Tomasz Torcz; +Cc: netdev On Thu, May 20, 2004 at 06:45:16PM +0200, Tomasz Torcz wrote: > On Thu, May 20, 2004 at 09:38:11AM -0700, Jean Tourrilhes wrote: > > I personally would stick with 100. The IrDA stack runs > > perfectly fine with 15 buffers at 4 Mb/s. If 100 is not enough, I > > think the problem is not the number of buffers, but somewhere else. Not sure why you post this to this thread? It has nothing to do with the previous message. > > I don't know how much trollish or true is that comment: > http://bsd.slashdot.org/comments.pl?sid=106258&cid=9049422 Linux sk_buffs and BSD mbufs are not very different anymore today. The BSD mbufs have been getting more sk_buff'ish over time, and sk_buffs have grown some properties of mbufs. They both have changed to optionally pass references of memory around instead of copying always, which is what counts here. > but it suggest, that Linux' stack having no BSD like mbuf functionality, > is not perfect for fast transmission. Maybe some network guru > cna comment ? I have not read all the details, but I suppose they used sendmsg() instead of sendfile() for this test. NetBSD can use zero copy TX in this case; Linux can only with sendfile and sendmsg will copy. Obvious linux will be slower then because a copy can cost quite a lot of CPU. Or rather it is not really the CPU cost that is the problem here, but the bandwidth usage - very high speed networking i s essentially memory bandwidth limited and copying over the CPU adds additional bandwidth requirements to the memory subsystem. There was an implementation of zero copy sendmsg() for linux long ago, but it was removed because it was fundamentally incompatible with good SMP scaling, because it would require remote TLB flushes over possible many CPUs (if you search the archives of this list you will find long threads about it). It would not be very hard to readd (Linux has all the low level infrastructure needed for it), but it doesn't make sense. NetBSD may have the luxury to not care about MP scaling, but Linux doesn't. The disadvantage of sendfile is that you can only transmit files directly; if you want to transmit data directly out of an process' address space you have to put them into a file mmap and sendfile from there. This may be a bit inconvenient if the basic unit of data in your program isn't files. There was an plan suggested to fix that (implement zero copy TX for POSIX AIO instead of BSD sockets), which would not have this problem. POSIX AIO has all the infrastructure to do zero copy IO without problematic and slow TLB flushes. Just so far nobody implemented that. In practice it is not a too big issue because many tuned servers (your typical ftpd, httpd or samba server) use sendfile already. -Andi ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: TxDescriptors -> 1024 default. Please not for every NIC! 2004-05-19 9:30 ` Marc Herbert 2004-05-19 10:27 ` Pekka Pietikainen @ 2004-05-19 11:54 ` Andi Kleen 1 sibling, 0 replies; 38+ messages in thread From: Andi Kleen @ 2004-05-19 11:54 UTC (permalink / raw) To: Marc Herbert; +Cc: netdev Marc Herbert <marc.herbert@free.fr> writes: > > PS: several people seem to think TCP "drops" packets when the qdisc is > full. My analysis of the code _and_ my experiments makes me think they > are wrong: TCP rather "blocks" when the qdisc is full. See explanation > here: <http://oss.sgi.com/archives/netdev/2004-05/msg00151.html> > (Subject: Re: TcpOutSegs way too optimistic (netstat -s)) This behaviour was only added relatively recently (in late 2.3.x timeframe) I believe all the default queue lengths tunings were done before that. So it would probably make sense to reevaluate/rebenchmark the default queue lengths for various devices with the newer code. -Andi ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <C925F8B43D79CC49ACD0601FB68FF50CDB13D3@orsmsx408>]
* RE: TxDescriptors -> 1024 default. Please not for every NIC! [not found] <C925F8B43D79CC49ACD0601FB68FF50CDB13D3@orsmsx408> @ 2004-06-02 19:14 ` Marc Herbert 2004-06-02 19:49 ` Cheng Jin 0 siblings, 1 reply; 38+ messages in thread From: Marc Herbert @ 2004-06-02 19:14 UTC (permalink / raw) To: Brandeburg, Jesse; +Cc: netdev On Wed, 26 May 2004, Brandeburg, Jesse wrote: > I'm not sure that you could actually get the problem to occur on 100 > or 10Mb/s hardware however because of TCP window size limitation and > such. What I'm getting at is that even if you have a device that can > queue lots of packets, it probably won't unless you're using an > unreliable protocol like UDP. Theoretically it's a problem, but I'm > not convinced that in real world scenarios it actually creates any > issues. Do you have a test that demonstrates the problem? OK, let's go for it. The following message details a simple experiment demonstrating how a too big (1000-packet) txqueuelen creates a dreaded latency at the IP level inside the sender for under-gigabit Ethernet interfaces, and so finally advocates a default 1000-packet txqueuelen defined _only_ by/for the _GigE_ drivers, leaving the previous (before sep 2003 in 2.4) 100-packet default untouched for slower interfaces. The experiment should be very quick and easy to reproduce in your lab, even in your home. Sorry, this message is way too long because it's... detailed, and tries to anticipate questions, hopefully avoiding the need to come back on this issue. The counter-part is that it's not dense and thus hopefully quick to read for anyone in the field. And please pardon my english mistakes. Detailed Experiment ------------------- You need at least 2 hosts, but ideally 3. The sender "S", the receiver "R1", and some witness host "R2". R1 and R2 can probably be collapsed together if you don't have enough hosts but I am afraid of unknown nasty side effects in this case. Host S is using a simple TCP connection to upload an infinite file to host R. The bottleneck is S's own 100Mb/s (or worst, 10Mb/s) network interface. This is very important: no packet drop must occur elsewhere between S and R, else TCP congestion avoidance algorithm will interpret this as a congestion sign and throttle down, and the txqueue will stay empty. If your TCP connection is under-performing your sending wire for any reason, you will obviously never fill your txqueue. The ACK-clocking property of TCP has for consequence that only the queue of the bottleneck of the path may fill up (except in very dynamic environnements where the bottleneck may be fast-changing, but let's stay simple). Actually almost everything below is still true when the bottleneck is elsewhere in some router further on the path instead of local in the sender. It's still true, just... elsewhere. For instance if you use a linux box as a router, and if it happens to be the bottleneck, I suspect this latency issue will appear more or less the same. But again, let's stay simple for the moment, forget those further routers and get back to this _local_ IP bottleneck and its too big txqueuelen. By the way, forcing your GigE interface to 100 or 10 is ok. I used iperf <http://dast.nlanr.net/Projects/Iperf/> to upload the infinite file, but any equivalent tool should do it. Since your TCP connection will suffer this artificial txqueue latency, you also need to increase SND_BUF and RCV_BUF, else the number of TCP packets sent (and thus the txqueue filling) will be capped (wait below for more about this). - So just run: host_R $ iperf --server --interval 1 --window 1M host_S $ iperf --client R --time 1000 --window 1M Check that you get a full 94.1Mb/s (resp. 9.4Mb/s) wire-rate. If not, investigate why and don't bother going on. - Now just watch the latency between S some other host "R2". For instance using mtr: S$ mtr -l R2 As the txqueue fills up, you will see perceived latency increasing every round-trip time, up to 120ms (worst with 10Mb/s: up to 1.2s!). When the txqueue is full, TCP detects it and enters congestion window reduction, providing some temporary relief. Then the artificial latency quickly ramps up again. - You can also try to start another simultaneous upload to R2: host_R2 $ iperf -s -i 1 host_S $ iperf -c R2 -t 1000 ... and watch how the artificial latency harms the start of the other TCP connection, which need ages to ramp up it's throughput. I also heard from here: "A Map of the Networking Code in Linux Kernel 2.4.20", Technical Report DataTAG-2004-1, section 4.4 http://datatag.web.cern.ch/datatag/publications.html that you can get interesting qdisc stats using such commands: # tc qdisc add dev eth1 root pfifo limit 100 # tc -s -d qdisc show dev eth1 But I did not tried them. Warning: the interface tx_ring size has to be added to the qdisc's txqueuelen to get the total sender's queue length perceived by TCP. Some drivers may also set it big. Solution -------- Now reduce the txqueuelen to the previous value: ifconfig eth1 txqueuelen 100 for 10Mb/s you can even try: ifconfig eth1 txqueuelen 10 Now your latency is now back to a sensible value (12ms max), and everything works fine. It's a simple as that. Throughput is not harmed at all, you still get the full 94.1 Mb/s wire-rate. If this (previous) setting was harmful to throughput for 100Mb/s interfaces, people would have complained since long. The more complex truth ----------------------- If there is a real-world, distance-caused latency between S and R, then having some equivalent amount of buffering in txqueuelen helps average performance, because the interface has then a backlog of packets to send while TCP takes time to ramp up its congestion window again a decrease, the former compensating the latter. (This may be what the e1000 guys observed in the first place, motivating the increase to 1000 ? After all, 1.2ms of buffering was small) The txqueue may smooth the sawtooth evolution of TCP congestion window, minimizing the interface idle time. But increased perceived latency is the price to pay for this nice damper. There is a tradeoff between latency and TCP throughput _on wide area_ routes to tune here, but pushing it as far as storing in txqueuelen _multiple_ times any real-world latency (did I say "1.2s" already?) brings no benefit at all for throughput; it's just terribly harmful for perceived latency. No IP router does so much buffering. Besides linux :-> I don't think IP queues should be sized to cope with moon-earth latency by default. Conclusion ---------- (aka: let's harass the maintainers) Of course I just demonstrated here the worst case. In many other cases, TCP will throttle down for some reason (packet losses, too small socket buffers,...), it will not fill the pipe nor the txqueue, and this dreaded latency will not appear. You could argue that my test case is very seldom in the real world/not representative (and I would _not_ agree), so the txqueue will never be full in practice, since there will always be some other reason making TCP under-performing the sending wire. OK. Even then, why defining it uselessly so high for every NIC? Why take this risk? Just as a small convenience for e1000 users? Mmmmm... So now every interface has this 1000-packet queue (and soon 10,000 because of these upcoming 10Gb/s interfaces). To ensure no one ever falls in this too big txqueue trap, I suggest the following user documentation: "if you have a 100Mb/s interface, be warned that your txqueuelen is too high (it was tuned for real, gigabit men). So please reduce it using ifconfig. Alternatively, if you are not root, please tune your socket buffers finely. Too small, you will under-perform. Too big, you will fill up your txqueue and create artificial latency. Good luck. Of course, you can forget all the above when your interface is not the bottleneck" On the other hand, having a default max txqueuelen defined in _milliseconds_ (just like most other routers do) is quite easy to implement and covers correctly all cases, without complex tuning instructions for the end user. The ideal implementation is that every driver defines txqueuelen by itself, depending on the actual link speed. It is unrealistic in the short-term, but the incremental implementation path is very easy: define a "sensible, generic default" of 100-packet, perfect for the 100Mb/s masses, not too bad for 10Mb/s masses, and let the (only few until now, let's hurry up) gigabit drivers override/optimize this to 1000, or whatever else even more finely tuned, for the cheap price of a few lines of code per driver. Thanks in advance for agreeing OR proving that I am wrong. I mean: thanks in advance for anything besides remaining silent. And thanks for reading all this gossiping, an impressive effort indeed. ^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: TxDescriptors -> 1024 default. Please not for every NIC! 2004-06-02 19:14 ` Marc Herbert @ 2004-06-02 19:49 ` Cheng Jin 2004-06-05 14:37 ` jamal 0 siblings, 1 reply; 38+ messages in thread From: Cheng Jin @ 2004-06-02 19:49 UTC (permalink / raw) To: Marc Herbert; +Cc: netdev@oss.sgi.com Marc, In general, I very much agree with what you have stated about not having a large txqueuelen. Txqueuelen should be something that alleviates the mismatch between CPU speed and NIC transmission speed, temporarily. As long as the txqueuelne is greater than zero, say 10 just to be safe, NIC will be running at full speed (unless there were inefficiencies in scheduling) so there is no incentive in setting it to be an excessively large value like 1000. > > I'm not sure that you could actually get the problem to occur on 100 > > or 10Mb/s hardware however because of TCP window size limitation and With today's CPU, I think you will be able to fill up the txqueuelen on a 10 or 100 Mbps NIC, assuming there is a large file transfer and large window size and stuff. > If there is a real-world, distance-caused latency between S and R, > then having some equivalent amount of buffering in txqueuelen helps > average performance, because the interface has then a backlog of > packets to send while TCP takes time to ramp up its congestion window > again a decrease, the former compensating the latter. (This may be > what the e1000 guys observed in the first place, motivating the > increase to 1000 ? After all, 1.2ms of buffering was small) The > txqueue may smooth the sawtooth evolution of TCP congestion window, > minimizing the interface idle time. But increased perceived latency > is the price to pay for this nice damper. There is a tradeoff between > latency and TCP throughput _on wide area_ routes to tune here, but > pushing it as far as storing in txqueuelen _multiple_ times any > real-world latency (did I say "1.2s" already?) brings no benefit at > all for throughput; it's just terribly harmful for perceived latency. > No IP router does so much buffering. Besides linux :-> I don't think > IP queues should be sized to cope with moon-earth latency by default. Very much agree with this paragraph. As long as the buffer is more than one bandwidth delay product, for a single TCP flow, window halving after each loss will still sustain a large enough window to maintain packets in the buffer to have full utilization. The downside is exactly what Marc said, very very large queueing delay for a long time. Going back to what Marc said in an earlier e-mail about having txqueuelen in the unit of bytes rather than packets to provide a fixed queueing delay in ms rather than packets. Maintaining txqueuelen in ms would be an ideal solution, but probably hard to achieve in practice. Keeping txqueuelen in bytes may be a problem for senders that wants to send many small pacekts. While the byte count may be small, the overhead of sending small packets may introduce large delays. Cheng ^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: TxDescriptors -> 1024 default. Please not for every NIC! 2004-06-02 19:49 ` Cheng Jin @ 2004-06-05 14:37 ` jamal 0 siblings, 0 replies; 38+ messages in thread From: jamal @ 2004-06-05 14:37 UTC (permalink / raw) To: Cheng Jin; +Cc: Marc Herbert, netdev@oss.sgi.com On Wed, 2004-06-02 at 15:49, Cheng Jin wrote: > Marc, > > In general, I very much agree with what you have stated about not having > a large txqueuelen. Txqueuelen should be something that alleviates > the mismatch between CPU speed and NIC transmission speed, Thats the theory. More interesting of course are bus speeds, arbitration schemes, RAM latencies and throughput and other dynamic bottlenecks like system loads. > temporarily. > As long as the txqueuelne is greater than zero, say 10 just to be safe, > NIC will be running at full speed (unless there were inefficiencies in > scheduling) so there is no incentive in setting it to be an excessively > large value like 1000. In theory as well, the only time you even need to queue is when theres congestion.. In reality, totaly different ballgame. In other words its not a simple system that you can throw Littles theorems at. Marc, good email, at least you didnt hand wave and declare the wind was blowing towards the south today. My opinion: I agree that the 1000 qlen is excessive for 10/100 - infact i think the value should dynamically adjust itself even for gige capable NICs (example if a gige NIC negotiates a 10Mbps speed with link partner, then you should adjust the qlen)[1]. That wont be trivial to do - but more importantly motivation lacks because i dont think the situation we have right now is devastating. To clarify: A single TCP flow will fill in any pipe you give it under proper conditions (proper congestion control algorithms, buffer etc)[2]. Most apps using TCP dont care very much about latency; the only exception would be some scientific clustering technologies using TCP for control messaging. And for those type of apps, you should be able to tune the qlen to your liking using tc or ip utilities (I claim they shouldnt be using tcp to begin with, but thats another discussion). If you dont want to take that extra step to tune then you dont care, and IMO you shouldnt complain. Having said all that: i still think theres value in maybe issuing a warning or making the default qlen selection a compile time config. cheers, jamal [1] I think it would make a nice project for someone with time. I can consult for anyone interested. [2] Looking at the recent patches on BIC, it does seem pretty agressive and should have no problem filling a 10Gige pipe with proper processing power. ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2004-06-05 14:37 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-09-09 3:14 [e1000 2.6 10/11] TxDescriptors -> 1024 default Feldman, Scott
2003-09-11 19:18 ` Jeff Garzik
2003-09-11 19:45 ` Ben Greear
2003-09-11 19:59 ` Jeff Garzik
2003-09-11 20:12 ` David S. Miller
2003-09-11 20:40 ` Ben Greear
2003-09-11 21:07 ` David S. Miller
2003-09-11 21:29 ` Ben Greear
2003-09-11 21:29 ` David S. Miller
2003-09-11 21:47 ` Ricardo C Gonzalez
2003-09-11 22:00 ` Jeff Garzik
2003-09-11 22:15 ` Ben Greear
2003-09-11 23:02 ` David S. Miller
2003-09-11 23:22 ` Ben Greear
2003-09-11 23:29 ` David S. Miller
2003-09-12 1:34 ` jamal
2003-09-12 2:20 ` Ricardo C Gonzalez
2003-09-12 3:05 ` jamal
2003-09-13 3:49 ` David S. Miller
2003-09-13 11:52 ` Robert Olsson
2003-09-15 12:12 ` jamal
2003-09-15 13:45 ` Robert Olsson
2003-09-15 23:15 ` David S. Miller
2003-09-16 9:28 ` Robert Olsson
2003-09-14 19:08 ` Ricardo C Gonzalez
2003-09-15 2:50 ` David Brownell
2003-09-15 8:17 ` David S. Miller
2004-05-15 12:14 ` TxDescriptors -> 1024 default. Please not for every NIC! Marc Herbert
2004-05-19 9:30 ` Marc Herbert
2004-05-19 10:27 ` Pekka Pietikainen
2004-05-20 14:11 ` Luis R. Rodriguez
2004-05-20 16:38 ` [Prism54-devel] " Jean Tourrilhes
2004-05-20 16:45 ` Tomasz Torcz
2004-05-20 17:13 ` zero copy TX in benchmarks was " Andi Kleen
2004-05-19 11:54 ` Andi Kleen
[not found] <C925F8B43D79CC49ACD0601FB68FF50CDB13D3@orsmsx408>
2004-06-02 19:14 ` Marc Herbert
2004-06-02 19:49 ` Cheng Jin
2004-06-05 14:37 ` jamal
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).