* Re: [E1000-devel] Transmission limit [not found] <1101467291.24742.70.camel@mellia.lipar.polito.it> @ 2004-11-26 14:05 ` P 2004-11-26 15:31 ` Marco Mellia ` (2 more replies) 0 siblings, 3 replies; 85+ messages in thread From: P @ 2004-11-26 14:05 UTC (permalink / raw) To: mellia; +Cc: e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev I'm forwarding this to netdev, as these are very interesting results (even if I don't beleive them). If you point us at the code/versions we will be better able to answer. Marco Mellia wrote: > We are trying to stress the e1000 hardware/driver under linux and Click > to see what is the maximum number of packets per second that can be > received/transmitted by a single NIC. > > We found something which is counterintuitive: > > - in reception, we can receive ALL the traffic, regardeless of the > packet size (or if you prefer, we can receive ALL the minimum sized > packet at gigabit speed) I questioned whether you actually did receive at that rate to which you responded: > - using Click, we can receive 100% of (small) packets at gigabit > speed with TWO cards (2gigabit/s ~ 2.8Mpps) > - using linux and standard e1000 driver, we can receive up to about > 80% of traffic from a single nic (~1.1Mpps) > - using linux and a modified (simplified) version of the driver, we > can receive 100% on a single nic, but not 100% using two nics (up > to ~1.5Mpps). > > Reception means: receiving the packet up to the rx ring at the > kernel level, and then IMMEDIATELY drop it (no packet processing, > no forwarding, nothing more...) > > Using NAPI or IRQ has littel impact (as we are not processing the > packets, the livelock due to the hardIRQ preemption versus the > softIRQ managers is not entered...) > > But the limit in TRANSMISSION seems to be 700Kpps. Regardless of > - the traffic generator, > - the driver version, > - the O.S. (linux/click), > - the hardware (broadcom card have the same limit). > > - in transmission we CAN ONLY trasmit about 700.000 pkt/s when the > minimum sized packets are considered (64bytes long ethernet minumum > frame size). That is about HALF the maximum number of pkt/s considering > a gigabit link. > > What is weird, is that if we artificially "preload" the NIC tx-fifo with > packets, and then instruct it to start sending them, those are actually > transmitted AT WIRE SPEED!! > > These results have been obtained considering different software > generators (namely, UDPGEN, PACKETGEN, Application level generators) > under LINUX (2.4.x, 2.6.x), and under CLICK (using a modified version of > UDPGEN). > > The hardware setup considers > - a 2.8GHz Xeon hardware > - PCI-X bus (133MHz/64bit) > - 1G of Ram > - Intel PRO 1000 MT single, double, and quad cards, integrated or on a > PCI slot. > > Different driver versions have been used, and while there are (small) > differencies when receiving packets, ALL of them present the same > trasmission limits. > > Moreover, the same happen considering other vendors cards (broadcom > based chipset). > > Is there any limit on the PCI-X (or PCI) that can be the bottleneck? > Or Limit on the number of packets per second that can be stored in the > NIC tx-fifo? > May the lenght of the tx-fifo impact on this? > > Any hints will be really appreciated. > Thanks in advance cheers, Pádraig. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-26 14:05 ` [E1000-devel] Transmission limit P @ 2004-11-26 15:31 ` Marco Mellia 2004-11-26 19:56 ` jamal ` (3 more replies) 2004-11-26 15:40 ` Robert Olsson 2004-11-27 20:00 ` Lennert Buytenhek 2 siblings, 4 replies; 85+ messages in thread From: Marco Mellia @ 2004-11-26 15:31 UTC (permalink / raw) To: P; +Cc: mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev If you don't trust us, please, ignore this email. Sorry. That's the number we have. And are actually very similar from what other colleagues of us got. The point is: while a PCI-X linux or (or click) box can receive (receive just up to the netif_receive_skb() level and then discard the skb) up to more than wire speed using off-the-shelf gigabit ethernet hardware, there is no way to transmit more than about half that speed. This is true considering minimum sized ethernet frames. This holds true with - linux 2.4.x and 2.6.x and click-linux 2.4.x - intel e1000 or broadcom drivers (modified to drop packets after the netif_receive_skb()) - whichever driver version you like (with minor modifications). The only modification to the driver we did consists in carefully prefecting the data in the CPU internal cache. Some details and results can be retreived from http://www.tlc-networks.polito.it/~mellia/euroTLC.pdf Part of this results are presented in this paper A. Bianco, J.M. Finochietto, G. Galante, M. Mellia, F. Neri Open-Source PC-Based Software Routers: a Viable Approach to High-Performance Packet Switching Third Internation Workshop on QoS in Multiservice IP Networks Catania, Feb 2005 http://www.tlc-networks.polito.it/mellia/papers/Euro_qos_ip.pdf Hope this helps. > I'm forwarding this to netdev, as these are very interesting > results (even if I don't beleive them). > > If you point us at the code/versions we will be better able to answer. > > Marco Mellia wrote: > > We are trying to stress the e1000 hardware/driver under linux and Click > > to see what is the maximum number of packets per second that can be > > received/transmitted by a single NIC. > > > > We found something which is counterintuitive: > > > > - in reception, we can receive ALL the traffic, regardeless of the > > packet size (or if you prefer, we can receive ALL the minimum sized > > packet at gigabit speed) > > I questioned whether you actually did receive at that rate to > which you responded: > > > - using Click, we can receive 100% of (small) packets at gigabit > > speed with TWO cards (2gigabit/s ~ 2.8Mpps) > > - using linux and standard e1000 driver, we can receive up to about > > 80% of traffic from a single nic (~1.1Mpps) > > - using linux and a modified (simplified) version of the driver, we > > can receive 100% on a single nic, but not 100% using two nics (up > > to ~1.5Mpps). > > > > Reception means: receiving the packet up to the rx ring at the > > kernel level, and then IMMEDIATELY drop it (no packet processing, > > no forwarding, nothing more...) > > > > Using NAPI or IRQ has littel impact (as we are not processing the > > packets, the livelock due to the hardIRQ preemption versus the > > softIRQ managers is not entered...) > > > > But the limit in TRANSMISSION seems to be 700Kpps. Regardless of > > - the traffic generator, > > - the driver version, > > - the O.S. (linux/click), > > - the hardware (broadcom card have the same limit). > > > > > - in transmission we CAN ONLY trasmit about 700.000 pkt/s when the > > minimum sized packets are considered (64bytes long ethernet minumum > > frame size). That is about HALF the maximum number of pkt/s considering > > a gigabit link. > > > > What is weird, is that if we artificially "preload" the NIC tx-fifo with > > packets, and then instruct it to start sending them, those are actually > > transmitted AT WIRE SPEED!! > > > > These results have been obtained considering different software > > generators (namely, UDPGEN, PACKETGEN, Application level generators) > > under LINUX (2.4.x, 2.6.x), and under CLICK (using a modified version of > > UDPGEN). > > > > The hardware setup considers > > - a 2.8GHz Xeon hardware > > - PCI-X bus (133MHz/64bit) > > - 1G of Ram > > - Intel PRO 1000 MT single, double, and quad cards, integrated or on a > > PCI slot. > > > > Different driver versions have been used, and while there are (small) > > differencies when receiving packets, ALL of them present the same > > trasmission limits. > > > > Moreover, the same happen considering other vendors cards (broadcom > > based chipset). > > > > Is there any limit on the PCI-X (or PCI) that can be the bottleneck? > > Or Limit on the number of packets per second that can be stored in the > > NIC tx-fifo? > > May the lenght of the tx-fifo impact on this? > > > > Any hints will be really appreciated. > > Thanks in advance > > cheers, > Pádraig. -- Ciao, /\/\/\rco +-----------------------------------+ | Marco Mellia - Assistant Professor| | Tel: 39-011-2276-608 | | Tel: 39-011-564-4173 | | Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . . | Politecnico di Torino | \ / . ASCII Ribbon Campaign . | Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail . | Torino - 10129 - Italy | / \ .- NO Word docs in e-mail. | http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . . +-----------------------------------+ The box said "Requires Windows 95 or Better." So I installed Linux. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-26 15:31 ` Marco Mellia @ 2004-11-26 19:56 ` jamal 2004-11-29 14:21 ` Marco Mellia 2004-11-26 20:06 ` jamal ` (2 subsequent siblings) 3 siblings, 1 reply; 85+ messages in thread From: jamal @ 2004-11-26 19:56 UTC (permalink / raw) To: mellia; +Cc: P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Fri, 2004-11-26 at 10:31, Marco Mellia wrote: > If you don't trust us, please, ignore this email. > Sorry. Dont take it the wrong way please - nobody has been able to produce the results you have. So thats why you may be getting that comment. The fact you have been able to do this is a good thing. > That's the number we have. And are actually very similar from what other > colleagues of us got. > > The point is: > while a PCI-X linux or (or click) box can receive (receive just up to > the netif_receive_skb() level and then discard the skb) up to more than > wire speed using off-the-shelf gigabit ethernet hardware, there is no > way to transmit more than about half that speed. This is true > considering minimum sized ethernet frames. > Hrm. I could not get more than 8-900Kpps on receive drop in the driver on a super fast xeon. Can you post the diff for your driver? My tests was with e1000. What kind of hardware is this? Do you have a block diagram on how the NIC is connected on the system? A lot of issues are dependent on how you hardware hookup is. > This holds true with > - linux 2.4.x and 2.6.x and click-linux 2.4.x > - intel e1000 or broadcom drivers (modified to drop packets after the > netif_receive_skb()) > - whichever driver version you like (with minor modifications). > > The only modification to the driver we did consists in carefully > prefecting the data in the CPU internal cache. > prefetching as in the use of prefetch()? What were you prefetching if you end up dropping packet? > Some details and results can be retreived from > > http://www.tlc-networks.polito.it/~mellia/euroTLC.pdf > > Part of this results are presented in this paper > A. Bianco, J.M. Finochietto, G. Galante, M. Mellia, F. Neri > Open-Source PC-Based Software Routers: a Viable Approach to High-Performance Packet Switching > Third Internation Workshop on QoS in Multiservice IP Networks > Catania, Feb 2005 > http://www.tlc-networks.polito.it/mellia/papers/Euro_qos_ip.pdf > > Hope this helps. > Thanks i will read these papers. Take a look at presentation i made at SUCON: www.suug.ch/sucon/04/slides/pkt_cls.pdf I have solved the problem which is identified in the first of slides (just before "why me momma?" slide) - i could describe the solution and even provide pathces which may address (perhaps) some of the transmit issues you are seeing. cheers, jamal ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-26 19:56 ` jamal @ 2004-11-29 14:21 ` Marco Mellia 2004-11-30 13:46 ` jamal 0 siblings, 1 reply; 85+ messages in thread From: Marco Mellia @ 2004-11-29 14:21 UTC (permalink / raw) To: hadi Cc: mellia, P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Fri, 2004-11-26 at 20:56, jamal wrote: > On Fri, 2004-11-26 at 10:31, Marco Mellia wrote: > > If you don't trust us, please, ignore this email. > > Sorry. > > Dont take it the wrong way please - nobody has been able to produce the > results you have. So thats why you may be getting that comment. > The fact you have been able to do this is a good thing. No problem from this side. I also forgot a couple of 8-! I guess... [...] > prefetching as in the use of prefetch()? > What were you prefetching if you end up dropping packet? > Sorry I used the wrong terms there. What we discovered, is that the CPU caching mechanisms as a HUGE impact. And that you have very little control on it. Prefetching may help, but it is difficult to tredict its impacts... Indeed, if you access to the packet struct, the CPU has to fetch data from the main memory, which stored the packet transfered using DMA from the NIC. The penalty in the memory access is huge, and you have little control on it. In our experiments, we modified the kernel to drop packets just after receiving them. skb are just deallocated (using standerd kernel routines, i.e., no recycling is used). Logically, that happen when the netif_rx() is called. Now, we have three cases 1) just mofify the netif_rx() to drop packets. 2) as in one, plus remove the protocol check in the driver (i.e., comment the line skb->protocol = eth_type_trans(skb, netdev); ) to avoid to access the real packet data. 3) as in 2, but dealloc is performed at the driver level, instead of calling the netif_rx() In the first case, we can receive about 1.1Mpps (~80% of packets) In the second case, we can receive 100% of packets, as we removed the penalty of looking at the packet headers to discover its protocol type. In the third case, we can NOT receive 100% of packets! The only difference is that we actually _REMOVED_ a funcion call. This reduces the overhead, and the compiler/cpu/whatever can not optimize the data path to access to the skb which must be freed. Our guess is that by freeing up the skb in the netif_rx() function actually allows the compiler/cpu to prefetch the skb itself, and therefore keep the pipeline working... My guess is that if you change compiler, cpu, memory subsystem, you may get very counterintuitive results... -- Ciao, /\/\/\rco +-----------------------------------+ | Marco Mellia - Assistant Professor| | Tel: 39-011-2276-608 | | Tel: 39-011-564-4173 | | Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . . | Politecnico di Torino | \ / . ASCII Ribbon Campaign . | Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail . | Torino - 10129 - Italy | / \ .- NO Word docs in e-mail. | http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . . +-----------------------------------+ The box said "Requires Windows 95 or Better." So I installed Linux. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-29 14:21 ` Marco Mellia @ 2004-11-30 13:46 ` jamal 2004-12-02 17:24 ` Marco Mellia 0 siblings, 1 reply; 85+ messages in thread From: jamal @ 2004-11-30 13:46 UTC (permalink / raw) To: mellia; +Cc: P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Mon, 2004-11-29 at 09:21, Marco Mellia wrote: > On Fri, 2004-11-26 at 20:56, jamal wrote: > > On Fri, 2004-11-26 at 10:31, Marco Mellia wrote: > > > If you don't trust us, please, ignore this email. > > > Sorry. > > > > Dont take it the wrong way please - nobody has been able to produce the > > results you have. So thats why you may be getting that comment. > > The fact you have been able to do this is a good thing. > > No problem from this side. I also forgot a couple of 8-! I guess... > > [...] > > > prefetching as in the use of prefetch()? > > What were you prefetching if you end up dropping packet? > > > I read your paper on the weekend - theres one thing which i dont think has been written on before on NAPI that you covered unfortunetly with no melodrama ;-> This is the min-max fairness issue. If you actually mix and match different speeds then it becomes a really interesting problem. Example try congesting a 100Mbps with 2x1Gbps. What quotas to use etc. Could this be done cleverly at runtime with dynamic adjustments etc. Next time you want you want to slave students to do some work talk to us - I got plenty of things you could try out and keep them busy forever;-> > Sorry I used the wrong terms there. > What we discovered, is that the CPU caching mechanisms as a HUGE impact. > And that you have very little control on it. Prefetching may help, but > it is difficult to tredict its impacts... Prefetching is hard. The only evidence i have seen of actually what "appears" to be working prefetching is some code from David Morsberger at HP. Other architectures are known to be more friendly - my eperiences with MIPs are far more pleasant. BTW, thats another topic to get those students to investigate ;-> > Indeed, if you access to the packet struct, the CPU has to fetch data > from the main memory, which stored the packet transfered using DMA from > the NIC. The penalty in the memory access is huge, and you have little > control on it. > > In our experiments, we modified the kernel to drop packets just after > receiving them. skb are just deallocated (using standerd kernel > routines, i.e., no recycling is used). Logically, that happen when the > netif_rx() is called. > > Now, we have three cases > 1) just mofify the netif_rx() to drop packets. > 2) as in one, plus remove the protocol check in the driver > (i.e., comment the line > skb->protocol = eth_type_trans(skb, netdev); > ) to avoid to access the real packet data. > 3) as in 2, but dealloc is performed at the driver level, instead of > calling the netif_rx() > > In the first case, we can receive about 1.1Mpps (~80% of packets) Possible. I was able to receive 900Kpps or so in my experiments with gact drop which is slightly above this with a 2.4 Ghz machine with IRQ affinity. > In the second case, we can receive 100% of packets, as we removed the > penalty of looking at the packet headers to discover its protocol type. > This is the one people found hard to believe. I will go and retest this. It is possible. > In the third case, we can NOT receive 100% of packets! > The only difference is that we actually _REMOVED_ a funcion call. This > reduces the overhead, and the compiler/cpu/whatever can not optimize the > data path to access to the skb which must be freed. It doesnt seem like you were runing NAPI if you depended on calling netif_rx In that case, #3 would be freeing in hard IRQ context while #2 is softIRQ. > Our guess is that by freeing up the skb in the netif_rx() function > actually allows the compiler/cpu to prefetch the skb itself, and > therefore keep the pipeline working... > > My guess is that if you change compiler, cpu, memory subsystem, you may > get very counterintuitive results... Refer to my comment above. Repeat tests with NAPI and see if you get same results. cheers, jamal ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-30 13:46 ` jamal @ 2004-12-02 17:24 ` Marco Mellia 0 siblings, 0 replies; 85+ messages in thread From: Marco Mellia @ 2004-12-02 17:24 UTC (permalink / raw) To: hadi Cc: mellia, P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev > > In our experiments, we modified the kernel to drop packets just after > > receiving them. skb are just deallocated (using standerd kernel > > routines, i.e., no recycling is used). Logically, that happen when the > > netif_rx() is called. > > > > Now, we have three cases > > 1) just mofify the netif_rx() to drop packets. > > 2) as in one, plus remove the protocol check in the driver > > (i.e., comment the line > > skb->protocol = eth_type_trans(skb, netdev); > > ) to avoid to access the real packet data. > > 3) as in 2, but dealloc is performed at the driver level, instead of > > calling the netif_rx() > > > > In the first case, we can receive about 1.1Mpps (~80% of packets) > > Possible. I was able to receive 900Kpps or so in my experiments with > gact drop which is slightly above this with a 2.4 Ghz machine with IRQ > affinity. I double checked with the people that actually did the job. They indeed tested both cases, i.e., dropping packets either using IRQ (therefore using netif_rx()) or using NAPI (therefore using netif_receive_skb()). In both cases, disabling the eth_type_trans() check, we receive 100% of packets... > > In the third case, we can NOT receive 100% of packets! > > The only difference is that we actually _REMOVED_ a funcion call. This > > reduces the overhead, and the compiler/cpu/whatever can not optimize the > > data path to access to the skb which must be freed. > > It doesnt seem like you were runing NAPI if you depended on calling > netif_rx > In that case, #3 would be freeing in hard IRQ context while #2 is > softIRQ. Again, it was my mistake. Case #3 was performed using the NAPI stack, i.e., freeing up skb instead of calling the netif_receive_skb(). Doing that, we observed a performance drop, that we hint to some caching isses. Indeed, investigating with a Oprofile, in case #3 it registers about twice the number of cache miss than in case #2. Again, we do not have any plain explanation, but our intuition is that adding a function call with pointer as argument might allow the compiler/cpu to prefecth the skb and speed up the memory release... > > Our guess is that by freeing up the skb in the netif_rx() function > > actually allows the compiler/cpu to prefetch the skb itself, and > > therefore keep the pipeline working... > > > > My guess is that if you change compiler, cpu, memory subsystem, you may > > get very counterintuitive results... > > Refer to my comment above. > Repeat tests with NAPI and see if you get same results. We were using NAPI. Sorry for the misunderstanding. Hope this helps. -- Ciao, /\/\/\rco +--+ | Marco Mellia - Assistant Professor| | Tel: 39-011-2276-608 | | Tel: 39-011-564-4173 | | Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . . | Politecnico di Torino | \ / . ASCII Ribbon Campaign . | Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail . | Torino - 10129 - Italy | / \ .- NO Word docs in e-mail. | http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . . +--+ The box said "Requires Windows 95 or Better." So I installed Linux. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-26 15:31 ` Marco Mellia 2004-11-26 19:56 ` jamal @ 2004-11-26 20:06 ` jamal 2004-11-26 20:56 ` Lennert Buytenhek 2004-11-27 9:25 ` Harald Welte 3 siblings, 0 replies; 85+ messages in thread From: jamal @ 2004-11-26 20:06 UTC (permalink / raw) To: mellia; +Cc: P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Fri, 2004-11-26 at 10:31, Marco Mellia wrote: > If you don't trust us, please, ignore this email. BTW, You have to be telling the truth espcially since you have S. Giordano in your team ;-> We just need to figure out what you are saying. Off to read your paper. cheers, jamal ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-26 15:31 ` Marco Mellia 2004-11-26 19:56 ` jamal 2004-11-26 20:06 ` jamal @ 2004-11-26 20:56 ` Lennert Buytenhek 2004-11-26 21:02 ` Lennert Buytenhek 2004-11-27 9:25 ` Harald Welte 3 siblings, 1 reply; 85+ messages in thread From: Lennert Buytenhek @ 2004-11-26 20:56 UTC (permalink / raw) To: Marco Mellia Cc: P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Fri, Nov 26, 2004 at 04:31:21PM +0100, Marco Mellia wrote: > The point is: > while a PCI-X linux or (or click) box can receive (receive just up to > the netif_receive_skb() level and then discard the skb) up to more than > wire speed using off-the-shelf gigabit ethernet hardware, there is no > way to transmit more than about half that speed. This is true > considering minimum sized ethernet frames. That's more-or-less what I'm seeing. Theoretically, the maximum #pps you can send on gigabit is p=125000000/(s+24) where s is the packet size, and the constant 24 consists of the 8B preamble, 4B FCS and and 12B inter-frame gap. On an e1000 in a 32b 66MHz PCI slot (Intel server mainboard, e1000 'desktop' NIC) I'm seeing that exact curve for packet sizes > ~350 bytes, but for smaller packets than that, the curve goes like p=264000000/(s+335) (which is accurate to +/- 100pps.) The 2.64e8 component is exactly the theoretical max. bandwidth of the PCI slot the card is in, the 335 a random constant that accounts for latency. On a different mobo I get a curve following the same formula but different value for 335. The same card in a 32b 33MHz PCI slot in a cheap Asus desktop board gives something a bit stranger: - p=132000000/(s+260) for s<128 - p=132000000/(s+390) for 128<=s<256 - p=132000000/(s+520) for 256<=s<384 - ... Again, the 132000000 corresponds with the theoretical max. bandwidth of the 32/33 bus. I'm not all that sure yet why things show this behavior. --L ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-26 20:56 ` Lennert Buytenhek @ 2004-11-26 21:02 ` Lennert Buytenhek 0 siblings, 0 replies; 85+ messages in thread From: Lennert Buytenhek @ 2004-11-26 21:02 UTC (permalink / raw) To: Marco Mellia Cc: P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Fri, Nov 26, 2004 at 09:56:59PM +0100, Lennert Buytenhek wrote: > On an e1000 in a 32b 66MHz PCI slot (Intel server mainboard, e1000 'desktop' > NIC) I'm seeing that exact curve for packet sizes > ~350 bytes, but for > smaller packets than that, the curve goes like p=264000000/(s+335) (which > is accurate to +/- 100pps.) The 2.64e8 component is exactly the theoretical > max. bandwidth of the PCI slot the card is in, the 335 a random constant > that accounts for latency. On a different mobo I get a curve following > the same formula but different value for 335. > > The same card in a 32b 33MHz PCI slot in a cheap Asus desktop board gives > something a bit stranger: > - p=132000000/(s+260) for s<128 > - p=132000000/(s+390) for 128<=s<256 > - p=132000000/(s+520) for 256<=s<384 > - ... This could be explained by observing that on the Intel mobo, the NIC sits on a dedicated PCI bus, while on the cheap Asus board, all PCI slots plus all onboard devices share the same PCI bus. Probably after pulling in a single burst of packet (32 clocks here, sounds about right), the NIC has to relinquish the bus to other bus masters and wait for 128 byte times until it gets to pull packet data from RAM again. Would be interesting to find out where the latency is coming from. Find a way to reduce/work around that and the 64b packet case will benefit as well. --L ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-26 15:31 ` Marco Mellia ` (2 preceding siblings ...) 2004-11-26 20:56 ` Lennert Buytenhek @ 2004-11-27 9:25 ` Harald Welte [not found] ` <20041127111101.GC23139@xi.wantstofly.org> ` (2 more replies) 3 siblings, 3 replies; 85+ messages in thread From: Harald Welte @ 2004-11-27 9:25 UTC (permalink / raw) To: Marco Mellia Cc: P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev [-- Attachment #1: Type: text/plain, Size: 1823 bytes --] On Fri, Nov 26, 2004 at 04:31:21PM +0100, Marco Mellia wrote: > If you don't trust us, please, ignore this email. > Sorry. > > That's the number we have. And are actually very similar from what other > colleagues of us got. > > The point is: > while a PCI-X linux or (or click) box can receive (receive just up to > the netif_receive_skb() level and then discard the skb) up to more than > wire speed using off-the-shelf gigabit ethernet hardware, there is no > way to transmit more than about half that speed. This is true > considering minimum sized ethernet frames. Yes, I've seen this, too. I even rewrote the linux e1000 driver in order to re-fill the tx queue from hardirq handler, and it didn't help. 760kpps is the most I could ever get (133MHz 64bit PCI-X on a Sun Fire v20z, Dual Opteron 1.8GHz) I've posted this result to netdev at some earlier point, I also Cc'ed intel but never got a reply (http://oss.sgi.com/archives/netdev/2004-09/msg00540.html) My guess is that Intel always knew this and they want to sell their CSA chips rather than improving the PCI e1000. We are hitting a hard limit here, either PCI-X wise or e1000 wise. You cannot refill the tx queue faster than from hardirq, and still you don't get any better numbers. It was suggested that the problem is PCI DMA arbitration latency, since the hardware needs to arbitrate the bus for every packet. Interestingly, if you use a four-port e1000, the numbers get even worse (580kpps) because the additional pcix bridge on the card introduces further latency. -- - Harald Welte <laforge@gnumonks.org> http://www.gnumonks.org/ ============================================================================ Programming is like sex: One mistake and you have to support it your lifetime [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <20041127111101.GC23139@xi.wantstofly.org>]
* Re: [E1000-devel] Transmission limit [not found] ` <20041127111101.GC23139@xi.wantstofly.org> @ 2004-11-27 11:31 ` Harald Welte 0 siblings, 0 replies; 85+ messages in thread From: Harald Welte @ 2004-11-27 11:31 UTC (permalink / raw) To: Lennert Buytenhek; +Cc: Linux Netdev List [-- Attachment #1: Type: text/plain, Size: 770 bytes --] On Sat, Nov 27, 2004 at 12:11:01PM +0100, Lennert Buytenhek wrote: > On Sat, Nov 27, 2004 at 10:25:03AM +0100, Harald Welte wrote: > > > I even rewrote the linux e1000 driver [...] > > This is very interesting. You have chipset docs then? Once again, please excuse my bad english. I seem to have translated 'umgeschrieben' into 'rewrote' which is absolutely not applicable here. Please do s/rewrote/modified/, i.e. I modified/altered/changed the driver And no, I don't have any docs. > cheers, > Lennert -- - Harald Welte <laforge@gnumonks.org> http://www.gnumonks.org/ ============================================================================ Programming is like sex: One mistake and you have to support it your lifetime [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-27 9:25 ` Harald Welte [not found] ` <20041127111101.GC23139@xi.wantstofly.org> @ 2004-11-27 20:12 ` Cesar Marcondes 2004-11-29 8:53 ` Marco Mellia 2 siblings, 0 replies; 85+ messages in thread From: Cesar Marcondes @ 2004-11-27 20:12 UTC (permalink / raw) To: Harald Welte Cc: Marco Mellia, P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev STOP !!!! On Sat, 27 Nov 2004, Harald Welte wrote: > On Fri, Nov 26, 2004 at 04:31:21PM +0100, Marco Mellia wrote: > > If you don't trust us, please, ignore this email. > > Sorry. > > > > That's the number we have. And are actually very similar from what other > > colleagues of us got. > > > > The point is: > > while a PCI-X linux or (or click) box can receive (receive just up to > > the netif_receive_skb() level and then discard the skb) up to more than > > wire speed using off-the-shelf gigabit ethernet hardware, there is no > > way to transmit more than about half that speed. This is true > > considering minimum sized ethernet frames. > > Yes, I've seen this, too. > > I even rewrote the linux e1000 driver in order to re-fill the tx queue > from hardirq handler, and it didn't help. 760kpps is the most I could > ever get (133MHz 64bit PCI-X on a Sun Fire v20z, Dual Opteron 1.8GHz) > > I've posted this result to netdev at some earlier point, I also Cc'ed > intel but never got a reply > (http://oss.sgi.com/archives/netdev/2004-09/msg00540.html) > > My guess is that Intel always knew this and they want to sell their CSA > chips rather than improving the PCI e1000. > > We are hitting a hard limit here, either PCI-X wise or e1000 wise. You > cannot refill the tx queue faster than from hardirq, and still you don't > get any better numbers. > > It was suggested that the problem is PCI DMA arbitration latency, since > the hardware needs to arbitrate the bus for every packet. > > Interestingly, if you use a four-port e1000, the numbers get even worse > (580kpps) because the additional pcix bridge on the card introduces > further latency. > > -- > - Harald Welte <laforge@gnumonks.org> http://www.gnumonks.org/ > ============================================================================ > Programming is like sex: One mistake and you have to support it your lifetime > ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-27 9:25 ` Harald Welte [not found] ` <20041127111101.GC23139@xi.wantstofly.org> 2004-11-27 20:12 ` Cesar Marcondes @ 2004-11-29 8:53 ` Marco Mellia 2004-11-29 14:50 ` Lennert Buytenhek 2 siblings, 1 reply; 85+ messages in thread From: Marco Mellia @ 2004-11-29 8:53 UTC (permalink / raw) To: Harald Welte Cc: Marco Mellia, P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sat, 2004-11-27 at 10:25, Harald Welte wrote: > On Fri, Nov 26, 2004 at 04:31:21PM +0100, Marco Mellia wrote: > > If you don't trust us, please, ignore this email. > > Sorry. > > > > That's the number we have. And are actually very similar from what other > > colleagues of us got. > > > > The point is: > > while a PCI-X linux or (or click) box can receive (receive just up to > > the netif_receive_skb() level and then discard the skb) up to more than > > wire speed using off-the-shelf gigabit ethernet hardware, there is no > > way to transmit more than about half that speed. This is true > > considering minimum sized ethernet frames. > > Yes, I've seen this, too. > > I even rewrote the linux e1000 driver in order to re-fill the tx queue > from hardirq handler, and it didn't help. 760kpps is the most I could > ever get (133MHz 64bit PCI-X on a Sun Fire v20z, Dual Opteron 1.8GHz) > > I've posted this result to netdev at some earlier point, I also Cc'ed > intel but never got a reply > (http://oss.sgi.com/archives/netdev/2004-09/msg00540.html) > > My guess is that Intel always knew this and they want to sell their CSA > chips rather than improving the PCI e1000. > > We are hitting a hard limit here, either PCI-X wise or e1000 wise. You > cannot refill the tx queue faster than from hardirq, and still you don't > get any better numbers. > > It was suggested that the problem is PCI DMA arbitration latency, since > the hardware needs to arbitrate the bus for every packet. Th's our intuition too. Notice that we get the same results with 3com (broadcom based) gigabit cards. We are thinking of sending packet in "bursts" instead of single transfers. The only problem is to let the NIC know that there are more than a packet in a burst... -- Ciao, /\/\/\rco +-----------------------------------+ | Marco Mellia - Assistant Professor| | Tel: 39-011-2276-608 | | Tel: 39-011-564-4173 | | Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . . | Politecnico di Torino | \ / . ASCII Ribbon Campaign . | Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail . | Torino - 10129 - Italy | / \ .- NO Word docs in e-mail. | http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . . +-----------------------------------+ The box said "Requires Windows 95 or Better." So I installed Linux. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-29 8:53 ` Marco Mellia @ 2004-11-29 14:50 ` Lennert Buytenhek 2004-11-30 8:42 ` Marco Mellia 0 siblings, 1 reply; 85+ messages in thread From: Lennert Buytenhek @ 2004-11-29 14:50 UTC (permalink / raw) To: Marco Mellia Cc: Harald Welte, P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Mon, Nov 29, 2004 at 09:53:33AM +0100, Marco Mellia wrote: > Th's our intuition too. > Notice that we get the same results with 3com (broadcom based) gigabit > cards. > We are thinking of sending packet in "bursts" instead of single > transfers. The only problem is to let the NIC know that there are more > than a packet in a burst... Jamal implemented exactly this for e1000 already, he might be persuaded into posting his patch here. Jamal? :) --L ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-29 14:50 ` Lennert Buytenhek @ 2004-11-30 8:42 ` Marco Mellia 2004-12-01 12:25 ` jamal 0 siblings, 1 reply; 85+ messages in thread From: Marco Mellia @ 2004-11-30 8:42 UTC (permalink / raw) To: Lennert Buytenhek Cc: Marco Mellia, Harald Welte, P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Mon, 2004-11-29 at 15:50, Lennert Buytenhek wrote: > On Mon, Nov 29, 2004 at 09:53:33AM +0100, Marco Mellia wrote: > > > Th's our intuition too. > > Notice that we get the same results with 3com (broadcom based) gigabit > > cards. > > We are thinking of sending packet in "bursts" instead of single > > transfers. The only problem is to let the NIC know that there are more > > than a packet in a burst... > > Jamal implemented exactly this for e1000 already, he might be persuaded > into posting his patch here. Jamal? :) I guess that saying that we are _very_ interested in this might help. :-) We can offer as "beta-testers" as well... -- Ciao, /\/\/\rco +-----------------------------------+ | Marco Mellia - Assistant Professor| | Tel: 39-011-2276-608 | | Tel: 39-011-564-4173 | | Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . . | Politecnico di Torino | \ / . ASCII Ribbon Campaign . | Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail . | Torino - 10129 - Italy | / \ .- NO Word docs in e-mail. | http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . . +-----------------------------------+ The box said "Requires Windows 95 or Better." So I installed Linux. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-30 8:42 ` Marco Mellia @ 2004-12-01 12:25 ` jamal 2004-12-02 13:39 ` Marco Mellia 0 siblings, 1 reply; 85+ messages in thread From: jamal @ 2004-12-01 12:25 UTC (permalink / raw) To: mellia Cc: Lennert Buytenhek, Harald Welte, P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Tue, 2004-11-30 at 03:42, Marco Mellia wrote: > On Mon, 2004-11-29 at 15:50, Lennert Buytenhek wrote: > > On Mon, Nov 29, 2004 at 09:53:33AM +0100, Marco Mellia wrote: > > > > > Th's our intuition too. > > > Notice that we get the same results with 3com (broadcom based) gigabit > > > cards. > > > We are thinking of sending packet in "bursts" instead of single > > > transfers. The only problem is to let the NIC know that there are more > > > than a packet in a burst... > > > > Jamal implemented exactly this for e1000 already, he might be persuaded > > into posting his patch here. Jamal? :) > > I guess that saying that we are _very_ interested in this might help. > :-) > We can offer as "beta-testers" as well... Sorry missed this (I wasnt CCed so it went to a low priority queue which i read on a best effort basis). Let me clean up the patches a little bit this weekend. The patch is at least 4 months old; latest reincarnation was due to issue1 on my SUCON presentation. Would a patch against latest 2.6.x bitkeeper (whatever it is this weekend) be fine? If you are in a rush and dont mind a little ugliness then i will pass them as is. BTW, Scott posted a interesting patch yesterday, you may wanna give that a shot as well. cheers, jamal ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-01 12:25 ` jamal @ 2004-12-02 13:39 ` Marco Mellia 2004-12-03 13:07 ` jamal 0 siblings, 1 reply; 85+ messages in thread From: Marco Mellia @ 2004-12-02 13:39 UTC (permalink / raw) To: hadi Cc: mellia, Lennert Buytenhek, Harald Welte, P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev > > > > We are thinking of sending packet in "bursts" instead of single > > > > transfers. The only problem is to let the NIC know that there are more > > > > than a packet in a burst... > > > > > > Jamal implemented exactly this for e1000 already, he might be persuaded > > > into posting his patch here. Jamal? :) > > > > I guess that saying that we are _very_ interested in this might help. > > :-) > > We can offer as "beta-testers" as well... > > Sorry missed this (I wasnt CCed so it went to a low priority queue which > i read on a best effort basis). > Let me clean up the patches a little bit this weekend. The patch is at > least 4 months old; latest reincarnation was due to issue1 on my SUCON > presentation. Would a patch against latest 2.6.x bitkeeper (whatever it > is this weekend) be fine? If you are in a rush and dont mind a little > ugliness then i will pass them as is. > We'll be glad to spend some time trying this out. Please, we are not very confortable with the linux bitkeeper maintenance method. Can we ask you to provide us a patch to a standard kernel/driver (whatever you prefer...)? Also a complete source sub-tree would be ok ;-) > BTW, Scott posted a interesting patch yesterday, you may wanna give that > a shot as well. We're trying that out right now... (which means, that in a couple of days, we'll try it ;-)) Thanks a lot. -- Ciao, /\/\/\rco +-----------------------------------+ | Marco Mellia - Assistant Professor| | Tel: 39-011-2276-608 | | Tel: 39-011-564-4173 | | Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . . | Politecnico di Torino | \ / . ASCII Ribbon Campaign . | Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail . | Torino - 10129 - Italy | / \ .- NO Word docs in e-mail. | http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . . +-----------------------------------+ The box said "Requires Windows 95 or Better." So I installed Linux. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-02 13:39 ` Marco Mellia @ 2004-12-03 13:07 ` jamal 0 siblings, 0 replies; 85+ messages in thread From: jamal @ 2004-12-03 13:07 UTC (permalink / raw) To: mellia Cc: Lennert Buytenhek, Harald Welte, P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Thu, 2004-12-02 at 08:39, Marco Mellia wrote: > We'll be glad to spend some time trying this out. Please, we are not > very confortable with the linux bitkeeper maintenance method. Can we ask > you to provide us a patch to a standard kernel/driver (whatever you > prefer...)? Also a complete source sub-tree would be ok ;-) Would a -rcX patch be fine for you? 2.6.10-rc2; which means you willl take 2.6.9 patch it with the patch-2.6.10-rc2.gz from kernel.org/v2.6/testing directory then patch one more time with patch i give you. Let me know if you are uncomfortable with that as well. [Sorry, I am disk poor and my stupid ISP still charges $1/MB/month even in this age if i put it up at cyberus]. In the patch i give you i will include rx path improvement code that I got from David Morsberger; I "think" i have seen some improvements with it but i am not 100% sure. If you repeat the test where you drop the packet right after eth_type_trans() with this patch on, I would be very interested if you see any improvements. In any case, expect something from me this weekend or monday (big party this weekend ;->). cheers, jamal ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-26 14:05 ` [E1000-devel] Transmission limit P 2004-11-26 15:31 ` Marco Mellia @ 2004-11-26 15:40 ` Robert Olsson 2004-11-26 15:59 ` Marco Mellia 2004-11-27 20:00 ` Lennert Buytenhek 2 siblings, 1 reply; 85+ messages in thread From: Robert Olsson @ 2004-11-26 15:40 UTC (permalink / raw) To: P; +Cc: mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev P@draigBrady.com writes: > I'm forwarding this to netdev, as these are very interesting > results (even if I don't beleive them). > I questioned whether you actually did receive at that rate to > which you responded: > > > - using Click, we can receive 100% of (small) packets at gigabit > > speed with TWO cards (2gigabit/s ~ 2.8Mpps) > > - using linux and standard e1000 driver, we can receive up to about > > 80% of traffic from a single nic (~1.1Mpps) > > - using linux and a modified (simplified) version of the driver, we > > can receive 100% on a single nic, but not 100% using two nics (up > > to ~1.5Mpps). > > > > Reception means: receiving the packet up to the rx ring at the > > kernel level, and then IMMEDIATELY drop it (no packet processing, > > no forwarding, nothing more...) In more detail please... The RX ring must be refilled? And HW DMA's the to memory-buffer? But I assume data it not touched otherwise. Touching the packet-data givs a major impact. See eth_type_trans in all profiles. So what forwarding numbers is seen? > > But the limit in TRANSMISSION seems to be 700Kpps. Regardless of > > - the traffic generator, > > - the driver version, > > - the O.S. (linux/click), > > - the hardware (broadcom card have the same limit). > > > > > - in transmission we CAN ONLY trasmit about 700.000 pkt/s when the > > minimum sized packets are considered (64bytes long ethernet minumum > > frame size). That is about HALF the maximum number of pkt/s considering > > a gigabit link. > > > > What is weird, is that if we artificially "preload" the NIC tx-fifo with > > packets, and then instruct it to start sending them, those are actually > > transmitted AT WIRE SPEED!! OK. Good to know about e1000. Networking is most DMA's and CPU is used adminstating it this is the challange. > > These results have been obtained considering different software > > generators (namely, UDPGEN, PACKETGEN, Application level generators) > > under LINUX (2.4.x, 2.6.x), and under CLICK (using a modified version of > > UDPGEN). We get a hundred kpps more...Turn off all mitigation so interrupts are undelayed so TX ring can be filled as quick as possible. Even you could try to fill TX as soon as the HW says there are available buffers. This could even be done from TX-interrupt. > > The hardware setup considers > > - a 2.8GHz Xeon hardware > > - PCI-X bus (133MHz/64bit) > > - 1G of Ram > > - Intel PRO 1000 MT single, double, and quad cards, integrated or on a > > PCI slot. > > Is there any limit on the PCI-X (or PCI) that can be the bottleneck? > > Or Limit on the number of packets per second that can be stored in the > > NIC tx-fifo? > > May the lenght of the tx-fifo impact on this? Small packet performance is dependent on low latency. Higher bus speed gives shorter latency but also on higher speed buses there use to be bridges that adds latency. For packet generation we use still 866 MHz PIII:s and 82543GC on serverworks 64-bit board which are faster than most other systems. So for testing routing performance in pps we have to use several flows. This gives the advantage to test SMP/NUMA as well. --ro ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-26 15:40 ` Robert Olsson @ 2004-11-26 15:59 ` Marco Mellia 2004-11-26 16:57 ` P 2004-11-26 17:58 ` Robert Olsson 0 siblings, 2 replies; 85+ messages in thread From: Marco Mellia @ 2004-11-26 15:59 UTC (permalink / raw) To: Robert Olsson Cc: P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev Robert, It a pleasure to hear from you. > > I questioned whether you actually did receive at that rate to > > which you responded: > > > > > - using Click, we can receive 100% of (small) packets at gigabit > > > speed with TWO cards (2gigabit/s ~ 2.8Mpps) > > > - using linux and standard e1000 driver, we can receive up to about > > > 80% of traffic from a single nic (~1.1Mpps) > > > - using linux and a modified (simplified) version of the driver, we > > > can receive 100% on a single nic, but not 100% using two nics (up > > > to ~1.5Mpps). > > > > > > Reception means: receiving the packet up to the rx ring at the > > > kernel level, and then IMMEDIATELY drop it (no packet processing, > > > no forwarding, nothing more...) > > In more detail please... The RX ring must be refilled? And HW DMA's > the to memory-buffer? But I assume data it not touched otherwise. > > Touching the packet-data givs a major impact. See eth_type_trans > in all profiles. That's exactly what we removed from the driver code: touching the packet limit the reception rate at about 1.1Mpps, while avoiding to check the eth_type_trans actually allows to receive 100% of packets. skb are de/allocated using standard kernel memory management. Still, without touching the packet, we can receive 100% of them. > So what forwarding numbers is seen? Forwarding is another issue. It seems to us that the bottleneck is in the transmission of packets. Indeed, considering only reception and transmission _separetely_ - all packets can be received - no more than ~700kpps can be trasmitted When IP-forwarding is considered, no more we hit the transmission limit (using NAPI, and your buffer recycling patch, as mentioned on the paper and on the slides... If no buffer recycling is adopted, performance drop a bit) So it seemd to us that the major bottleneck is due to the transmission limit. Again, you can get numbers and more details from http://www.tlc-networks.polito.it/~mellia/euroTLC.pdf http://www.tlc-networks.polito.it/mellia/papers/Euro_qos_ip.pdf > > > But the limit in TRANSMISSION seems to be 700Kpps. Regardless of > > > - the traffic generator, > > > - the driver version, > > > - the O.S. (linux/click), > > > - the hardware (broadcom card have the same limit). > > > > > > > > - in transmission we CAN ONLY trasmit about 700.000 pkt/s when the > > > minimum sized packets are considered (64bytes long ethernet minumum > > > frame size). That is about HALF the maximum number of pkt/s considering > > > a gigabit link. > > > > > > What is weird, is that if we artificially "preload" the NIC tx-fifo with > > > packets, and then instruct it to start sending them, those are actually > > > transmitted AT WIRE SPEED!! > > OK. Good to know about e1000. Networking is most DMA's and CPU is used > adminstating it this is the challange. That's true. There is still the chance that the limit is due to hardware CRC calculation (which must be added to the ethernet frame by the nic...). But we're quite confortable that that is not the limit, since in the reception path the same operation must be performed... > > > These results have been obtained considering different software > > > generators (namely, UDPGEN, PACKETGEN, Application level generators) > > > under LINUX (2.4.x, 2.6.x), and under CLICK (using a modified version of > > > UDPGEN). > > We get a hundred kpps more...Turn off all mitigation so interrupts are > undelayed so TX ring can be filled as quick as possible. > > Even you could try to fill TX as soon as the HW says there are available > buffers. This could even be done from TX-interrupt. Are you suggesting to modify packetgen to be more aggressive? > > > The hardware setup considers > > > - a 2.8GHz Xeon hardware > > > - PCI-X bus (133MHz/64bit) > > > - 1G of Ram > > > - Intel PRO 1000 MT single, double, and quad cards, integrated or on a > > > PCI slot. > > > Is there any limit on the PCI-X (or PCI) that can be the bottleneck? > > > Or Limit on the number of packets per second that can be stored in the > > > NIC tx-fifo? > > > May the lenght of the tx-fifo impact on this? > > Small packet performance is dependent on low latency. Higher bus speed > gives shorter latency but also on higher speed buses there use to be > bridges that adds latency. That's true. We suspect that the limit is due to bus latency. But still, we are surprised, since the bus allows to receive 100%, but to transmit up to ~50%. Moreover the raw aggerate bandwidth of the buffer is _far_ larger (133MHz*64bit ~ 8gbit/s > For packet generation we use still 866 MHz PIII:s and 82543GC on serverworks > 64-bit board which are faster than most other systems. So for testing routing > performance in pps we have to use several flows. This gives the advantage to > test SMP/NUMA as well. We use an hardware generator (Agilent router tester)... which can saturate a gigabit link with no problem (and cost much more than a PC...). So our forwarding test are not limited... -- Ciao, /\/\/\rco +-----------------------------------+ | Marco Mellia - Assistant Professor| | Tel: 39-011-2276-608 | | Tel: 39-011-564-4173 | | Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . . | Politecnico di Torino | \ / . ASCII Ribbon Campaign . | Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail . | Torino - 10129 - Italy | / \ .- NO Word docs in e-mail. | http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . . +-----------------------------------+ The box said "Requires Windows 95 or Better." So I installed Linux. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-26 15:59 ` Marco Mellia @ 2004-11-26 16:57 ` P 2004-11-26 20:01 ` jamal 2004-11-26 17:58 ` Robert Olsson 1 sibling, 1 reply; 85+ messages in thread From: P @ 2004-11-26 16:57 UTC (permalink / raw) To: mellia Cc: Robert Olsson, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev I forgot a smilely on my previous post about not beleiving you. So here's 2: :-) :-) Comments below: Marco Mellia wrote: > Robert, > It a pleasure to hear from you. > >> Touching the packet-data givs a major impact. See eth_type_trans >> in all profiles. Notice the e1000 sets up the alignment for IP by default. > skb are de/allocated using standard kernel memory management. Still, > without touching the packet, we can receive 100% of them. I was doing some playing in this area this week. I changed the alloc per packet to a "realloc" per packet. I.E. the e1000 driver owns the packets. I noticed a very nice speedup from this. In summary a userspace app was able to receive 2x250Kpps without this patch, and 2x490Kpps with it. The patch is here: http://www.pixelbeat.org/tmp/linux-2.4.20-pb.diff Note 99% of that patch is just upgrading from e1000 V4.4.12-k1 to V5.2.52 (which doesn't affect the performance). Wow I just read you're excellent paper, and noticed you used this approach also :-) >> Small packet performance is dependent on low latency. Higher bus speed >> gives shorter latency but also on higher speed buses there use to be >> bridges that adds latency. > > That's true. We suspect that the limit is due to bus latency. But still, > we are surprised, since the bus allows to receive 100%, but to transmit > up to ~50%. Moreover the raw aggerate bandwidth of the buffer is _far_ > larger (133MHz*64bit ~ 8gbit/s Well there definitely could be an asymmetry wrt bus latency. Saying that though, in my tests with much the same hardware as you, I could only get 800Kpps into the driver. I'll check this again when I have time. Note also that as I understand it the PCI control bus is running at a much lower rate, and that is used to arbitrate the bus for each packet. I.E. the 8Gb/s number above is not the bottleneck. An lspci -vvv for your ethernet devices would be useful Also to view the burst size: setpci -d 8086:1010 e6.b (where 8086:1010 is the ethernet device PCI id). cheers, Pádraig. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-26 16:57 ` P @ 2004-11-26 20:01 ` jamal 2004-11-29 10:19 ` P 2004-11-29 13:09 ` Robert Olsson 0 siblings, 2 replies; 85+ messages in thread From: jamal @ 2004-11-26 20:01 UTC (permalink / raw) To: P Cc: mellia, Robert Olsson, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Fri, 2004-11-26 at 11:57, P@draigBrady.com wrote: > > skb are de/allocated using standard kernel memory management. Still, > > without touching the packet, we can receive 100% of them. > > I was doing some playing in this area this week. > I changed the alloc per packet to a "realloc" per packet. > I.E. the e1000 driver owns the packets. I noticed a > very nice speedup from this. In summary a userspace > app was able to receive 2x250Kpps without this patch, > and 2x490Kpps with it. The patch is here: > http://www.pixelbeat.org/tmp/linux-2.4.20-pb.diff A very angry gorilla on that url ;-> > Note 99% of that patch is just upgrading from > e1000 V4.4.12-k1 to V5.2.52 (which doesn't affect > the performance). > > Wow I just read you're excellent paper, and noticed > you used this approach also :-) > Have to read the paper - When Robert was last visiting here; we did some tests and packet recycling is not very valuable as far as SMP is concerned (given that packets can be alloced on one CPU and freed on another). There a clear win on single CPU machines. > >> Small packet performance is dependent on low latency. Higher bus speed > >> gives shorter latency but also on higher speed buses there use to be > >> bridges that adds latency. > > > > That's true. We suspect that the limit is due to bus latency. But still, > > we are surprised, since the bus allows to receive 100%, but to transmit > > up to ~50%. Moreover the raw aggerate bandwidth of the buffer is _far_ > > larger (133MHz*64bit ~ 8gbit/s > > Well there definitely could be an asymmetry wrt bus latency. > Saying that though, in my tests with much the same hardware > as you, I could only get 800Kpps into the driver. Yep, thats about the number i was seeing as well in both pieces of hardware i used in the tests in my SUCON presentation. > I'll > check this again when I have time. Note also that as I understand > it the PCI control bus is running at a much lower rate, > and that is used to arbitrate the bus for each packet. > I.E. the 8Gb/s number above is not the bottleneck. > > An lspci -vvv for your ethernet devices would be useful > Also to view the burst size: setpci -d 8086:1010 e6.b > (where 8086:1010 is the ethernet device PCI id). > Can you talk a little about this PCI control bus? I have heard you mention it before ... I am trying to visualize where it fits in PCI system. cheers, jamal ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-26 20:01 ` jamal @ 2004-11-29 10:19 ` P 2004-11-29 13:09 ` Robert Olsson 1 sibling, 0 replies; 85+ messages in thread From: P @ 2004-11-29 10:19 UTC (permalink / raw) To: hadi Cc: mellia, Robert Olsson, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev jamal wrote: > On Fri, 2004-11-26 at 11:57, P@draigBrady.com wrote: > > >>>skb are de/allocated using standard kernel memory management. Still, >>>without touching the packet, we can receive 100% of them. >> >>I was doing some playing in this area this week. >>I changed the alloc per packet to a "realloc" per packet. >>I.E. the e1000 driver owns the packets. I noticed a >>very nice speedup from this. In summary a userspace >>app was able to receive 2x250Kpps without this patch, >>and 2x490Kpps with it. The patch is here: >>http://www.pixelbeat.org/tmp/linux-2.4.20-pb.diff > > > A very angry gorilla on that url ;-> feck. Add a .gz http://www.pixelbeat.org/tmp/linux-2.4.20-pb.diff.gz >>Note 99% of that patch is just upgrading from >>e1000 V4.4.12-k1 to V5.2.52 (which doesn't affect >>the performance). >> >>Wow I just read you're excellent paper, and noticed >>you used this approach also :-) >> > > > Have to read the paper - When Robert was last visiting here; we did some > tests and packet recycling is not very valuable as far as SMP is > concerned (given that packets can be alloced on one CPU and freed on > another). There a clear win on single CPU machines. Well for my app, I am just monitoring, so I use IRQ and process affinity. You could split the skb heads across CPUs also I guess. >>>>Small packet performance is dependent on low latency. Higher bus speed >>>>gives shorter latency but also on higher speed buses there use to be >>>>bridges that adds latency. >>> >>>That's true. We suspect that the limit is due to bus latency. But still, >>>we are surprised, since the bus allows to receive 100%, but to transmit >>>up to ~50%. Moreover the raw aggerate bandwidth of the buffer is _far_ >>>larger (133MHz*64bit ~ 8gbit/s >> >>Well there definitely could be an asymmetry wrt bus latency. >>Saying that though, in my tests with much the same hardware >>as you, I could only get 800Kpps into the driver. > > > Yep, thats about the number i was seeing as well in both pieces of > hardware i used in the tests in my SUCON presentation. > > >> I'll >>check this again when I have time. Note also that as I understand >>it the PCI control bus is running at a much lower rate, >>and that is used to arbitrate the bus for each packet. >>I.E. the 8Gb/s number above is not the bottleneck. >> >>An lspci -vvv for your ethernet devices would be useful >>Also to view the burst size: setpci -d 8086:1010 e6.b >>(where 8086:1010 is the ethernet device PCI id). >> > > Can you talk a little about this PCI control bus? I have heard you > mention it before ... I am trying to visualize where it fits in PCI > system. Basically the bus is arbitrated per packet. See secion 3.5 in: http://www.intel.com/design/network/applnots/ap453.pdf This also has lots of nice PCI info: http://www.hep.man.ac.uk/u/rich/PFLDnet2004/Rich_PFLDNet_10GE_v7.ppt -- Pádraig Brady - http://www.pixelbeat.org -- ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-26 20:01 ` jamal 2004-11-29 10:19 ` P @ 2004-11-29 13:09 ` Robert Olsson 2004-11-29 20:16 ` David S. Miller 2004-11-30 13:31 ` jamal 1 sibling, 2 replies; 85+ messages in thread From: Robert Olsson @ 2004-11-29 13:09 UTC (permalink / raw) To: hadi Cc: P, mellia, Robert Olsson, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev jamal writes: > Have to read the paper - When Robert was last visiting here; we did some > tests and packet recycling is not very valuable as far as SMP is > concerned (given that packets can be alloced on one CPU and freed on > another). There a clear win on single CPU machines. Correct yes at you lab about 2 1/2 years ago. I see those experiments in a different light today as we never got any packet budget contribution from SMP with shared mem arch whatsoever. Spent a week w. Alexey in the lab to understand whats going on. Two flows with total affinity (for each CPU) even removed all locks and part of the IP stack. We were still confused... When Opteron/NUMA gave good contribution in those setups. We start thinking it must be latency and memory controllers that makes the difference. As w. each CPU has it's own memory and memory controller in Opteron case. So from that aspect we expecting the impossible from recycling patch maybe it will do better on boxes w. local memory. But I think we should give it up in current form skb recycling. If extend it to deal cache bouncing etc. We end up having something like slab in every driver. slab has improved is not so dominant in profiles now. Also from what I understand new HW and MSI can help in the case where pass objects between CPU. Did I dream or did someone tell me that S2IO could have several TX ring that could via MSI be routed to proper cpu? slab packet-objects have been discussed. It would do some contribution but is the complexity worth it? Also I think it could possible to do more lightweight variant of skb recycling in case we need to recycle PCI-mapping etc. --ro ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-29 13:09 ` Robert Olsson @ 2004-11-29 20:16 ` David S. Miller 2004-12-01 16:47 ` Robert Olsson 2004-11-30 13:31 ` jamal 1 sibling, 1 reply; 85+ messages in thread From: David S. Miller @ 2004-11-29 20:16 UTC (permalink / raw) To: Robert Olsson Cc: hadi, P, mellia, Robert.Olsson, e1000-devel, jorge.finochietto, galante, netdev On Mon, 29 Nov 2004 14:09:08 +0100 Robert Olsson <Robert.Olsson@data.slu.se> wrote: > Did I dream or did someone tell me that S2IO > could have several TX ring that could via MSI be routed to proper cpu? One of Sun's gigabit chips can do this too, except it isn't via MSI, the driver has to read the descriptor to figure out which cpu gets the software interrupt to process the packet. SGI had hardware which allowed you to do this kind of stuff too. Obviously the MSI version works much better. It is important, the cpu selection process. First of all, it must be calculated such that flows always go through the same cpu. Otherwise TCP sockets bounce between the cpus for a streaming transfer. And even this doesn't avoid all such problems, TCP LISTEN state sockets will still thrash between the cpus with such a "pick a cpu based upon" flow scheme. Anyways, just some thoughts. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-29 20:16 ` David S. Miller @ 2004-12-01 16:47 ` Robert Olsson 0 siblings, 0 replies; 85+ messages in thread From: Robert Olsson @ 2004-12-01 16:47 UTC (permalink / raw) To: David S. Miller Cc: Robert Olsson, hadi, P, mellia, e1000-devel, jorge.finochietto, galante, netdev David S. Miller writes: > > Did I dream or did someone tell me that S2IO > > could have several TX ring that could via MSI be routed to proper cpu? > > One of Sun's gigabit chips can do this too, except it isn't > via MSI, the driver has to read the descriptor to figure out > which cpu gets the software interrupt to process the packet. > > SGI had hardware which allowed you to do this kind of stuff too. > > Obviously the MSI version works much better. > > It is important, the cpu selection process. First of all, it must > be calculated such that flows always go through the same cpu. > Otherwise TCP sockets bounce between the cpus for a streaming > transfer. > > And even this doesn't avoid all such problems, TCP LISTEN state > sockets will still thrash between the cpus with such a "pick > a cpu based upon" flow scheme. > > Anyways, just some thoughts. Thanks for the the info. Well we'll be forced to get into those problems when the HW is capable. I'll guess it will be w. the 10 GIGE cards. --ro ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-29 13:09 ` Robert Olsson 2004-11-29 20:16 ` David S. Miller @ 2004-11-30 13:31 ` jamal 2004-11-30 13:46 ` Lennert Buytenhek 1 sibling, 1 reply; 85+ messages in thread From: jamal @ 2004-11-30 13:31 UTC (permalink / raw) To: Robert Olsson Cc: P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Mon, 2004-11-29 at 08:09, Robert Olsson wrote: > jamal writes: > > > Have to read the paper - When Robert was last visiting here; we did some > > tests and packet recycling is not very valuable as far as SMP is > > concerned (given that packets can be alloced on one CPU and freed on > > another). There a clear win on single CPU machines. > > > Correct yes at you lab about 2 1/2 years ago. How time flies when you are having fun ;-> > I see those experiments in a > different light today as we never got any packet budget contribution > from SMP with shared mem arch whatsoever. Spent a week w. Alexey in the lab > to understand whats going on. Two flows with total affinity (for each CPU) > even removed all locks and part of the IP stack. We were still confused... > > When Opteron/NUMA gave good contribution in those setups. We start thinking > it must be latency and memory controllers that makes the difference. As w. > each CPU has it's own memory and memory controller in Opteron case. > > So from that aspect we expecting the impossible from recycling patch > maybe it will do better on boxes w. local memory. > Interesting thought. Not using a lot of my brain cells to compute i would say that it would get worse. But i suppose the real reason this gets nasty on x86 style SMP is because cache misses are more expensive there, maybe? > But I think we should give it up in current form skb recycling. If extend > it to deal cache bouncing etc. We end up having something like slab in > every driver. slab has improved is not so dominant in profiles now. > nod. > Also from what I understand new HW and MSI can help in the case where > pass objects between CPU. Did I dream or did someone tell me that S2IO > could have several TX ring that could via MSI be routed to proper cpu? I am wondering if the per CPU tx/rx irqs are valuable at all. They sound like more hell to maintain. > slab packet-objects have been discussed. It would do some contribution > but is the complexity worth it? May not be worth it. > > Also I think it could possible to do more lightweight variant of skb > recycling in case we need to recycle PCI-mapping etc. > I think its valuable to have it for people with UP; its not worth the complexity for SMP IMO. cheers, jamal ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-30 13:31 ` jamal @ 2004-11-30 13:46 ` Lennert Buytenhek 2004-11-30 14:25 ` jamal 0 siblings, 1 reply; 85+ messages in thread From: Lennert Buytenhek @ 2004-11-30 13:46 UTC (permalink / raw) To: jamal Cc: Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Tue, Nov 30, 2004 at 08:31:41AM -0500, jamal wrote: > > Also from what I understand new HW and MSI can help in the case where > > pass objects between CPU. Did I dream or did someone tell me that S2IO > > could have several TX ring that could via MSI be routed to proper cpu? > > I am wondering if the per CPU tx/rx irqs are valuable at all. They sound > like more hell to maintain. On the TX path you'd have qdiscs to deal with as well, no? --L ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-30 13:46 ` Lennert Buytenhek @ 2004-11-30 14:25 ` jamal 2004-12-01 0:11 ` Lennert Buytenhek 0 siblings, 1 reply; 85+ messages in thread From: jamal @ 2004-11-30 14:25 UTC (permalink / raw) To: Lennert Buytenhek Cc: Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Tue, 2004-11-30 at 08:46, Lennert Buytenhek wrote: > On Tue, Nov 30, 2004 at 08:31:41AM -0500, jamal wrote: > > > > Also from what I understand new HW and MSI can help in the case where > > > pass objects between CPU. Did I dream or did someone tell me that S2IO > > > could have several TX ring that could via MSI be routed to proper cpu? > > > > I am wondering if the per CPU tx/rx irqs are valuable at all. They sound > > like more hell to maintain. > > On the TX path you'd have qdiscs to deal with as well, no? I think management of it would be non-trivial in SMP. Youd have to start playing stupid loadbalancing tricks which would reduce the value of existence of tx irqs to begin with. cheers, jamal ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-30 14:25 ` jamal @ 2004-12-01 0:11 ` Lennert Buytenhek 2004-12-01 1:09 ` Scott Feldman 2004-12-01 12:08 ` jamal 0 siblings, 2 replies; 85+ messages in thread From: Lennert Buytenhek @ 2004-12-01 0:11 UTC (permalink / raw) To: jamal Cc: Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Tue, Nov 30, 2004 at 09:25:54AM -0500, jamal wrote: > > > > Also from what I understand new HW and MSI can help in the case where > > > > pass objects between CPU. Did I dream or did someone tell me that S2IO > > > > could have several TX ring that could via MSI be routed to proper cpu? > > > > > > I am wondering if the per CPU tx/rx irqs are valuable at all. They sound > > > like more hell to maintain. > > > > On the TX path you'd have qdiscs to deal with as well, no? > > I think management of it would be non-trivial in SMP. Youd have to start > playing stupid loadbalancing tricks which would reduce the value of > existence of tx irqs to begin with. You mean the management of qdiscs would be non-trivial? Probably the idea of these kinds of tricks is to skip the qdisc step altogether. --L ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-01 0:11 ` Lennert Buytenhek @ 2004-12-01 1:09 ` Scott Feldman 2004-12-01 15:34 ` Robert Olsson ` (3 more replies) 2004-12-01 12:08 ` jamal 1 sibling, 4 replies; 85+ messages in thread From: Scott Feldman @ 2004-12-01 1:09 UTC (permalink / raw) To: Lennert Buytenhek Cc: jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev Hey, turns out, I know some e1000 tricks that might help get the kpps numbers up. My problem is I only have a P4 desktop system with a 82544 nic running at PCI 32/33Mhz, so I can't play with the big boys. But, attached is a rework of the Tx path to eliminate 1) Tx interrupts, and 2) Tx descriptor write-backs. For me, I see a nice jump in kpps, but I'd like others to try with their setups. We should be able to get to wire speed with 60-byte packets. I'm using pktgen in linux-2.6.9, count = 1000000. System: Intel 865 (HT 2.6Ghz) Nic: 82544 PCI 32-bit/33Mhz Driver: linux-2.6.9 e1000 (5.3.19-k2-NAPI), no Interrupt Delays BEFORE 256 descs pkt_size = 60: 253432pps 129Mb/sec errors: 0 pkt_size = 1500: 56356pps 678Mb/sec errors: 499791 4096 descs pkt_size = 60: 254222pps 130Mb/sec errors: 0 pkt_size = 1500: 52693pps 634Mb/sec errors: 497556 AFTER Modified driver to turn off Tx interrupts and descriptor write-backs. Uses a timer to schedule Tx cleanup. The timer runs at 1ms. This would work poorly where HZ=100. Needed to bump Tx descriptors up to 4096 because 1ms is a lot of time with 60-byte packets at 1GbE. Every time the timer expires, there is only one PIO read to get HW head pointer. This wouldn't work at lower media speeds like 10Mbps or 100Mbps because the ring isn't large enough (or we would need a higher resolution timer). This also get Tx cleanup out of the NAPI path. 4096 descs pkt_size = 60: 541618pps 277Mb/sec errors: 914 pkt_size = 1500: 76198pps 916Mb/sec errors: 12419 This doubles the kpps numbers for 60-byte packets. I'd like to see what happens on higher bus bandwidth systems. Anyone? -scott diff -Naurp linux-2.6.9/drivers/net/e1000/e1000.h linux-2.6.9/drivers/net/e1000.mod/e1000.h --- linux-2.6.9/drivers/net/e1000/e1000.h 2004-10-18 14:53:06.000000000 -0700 +++ linux-2.6.9/drivers/net/e1000.mod/e1000.h 2004-11-30 14:41:07.045391488 -0800 @@ -103,7 +103,7 @@ struct e1000_adapter; #define E1000_MAX_INTR 10 /* TX/RX descriptor defines */ -#define E1000_DEFAULT_TXD 256 +#define E1000_DEFAULT_TXD 4096 #define E1000_MAX_TXD 256 #define E1000_MIN_TXD 80 #define E1000_MAX_82544_TXD 4096 @@ -189,6 +189,7 @@ struct e1000_desc_ring { /* board specific private data structure */ struct e1000_adapter { + struct timer_list tx_cleanup_timer; struct timer_list tx_fifo_stall_timer; struct timer_list watchdog_timer; struct timer_list phy_info_timer; @@ -224,6 +225,7 @@ struct e1000_adapter { uint32_t tx_fifo_size; atomic_t tx_fifo_stall; boolean_t pcix_82544; + boolean_t tx_cleanup_scheduled; /* RX */ struct e1000_desc_ring rx_ring; diff -Naurp linux-2.6.9/drivers/net/e1000/e1000_hw.h linux-2.6.9/drivers/net/e1000.mod/e1000_hw.h --- linux-2.6.9/drivers/net/e1000/e1000_hw.h 2004-10-18 14:55:06.000000000 -0700 +++ linux-2.6.9/drivers/net/e1000.mod/e1000_hw.h 2004-11-30 13:48:07.983682328 -0800 @@ -417,14 +417,12 @@ int32_t e1000_set_d3_lplu_state(struct e /* This defines the bits that are set in the Interrupt Mask * Set/Read Register. Each bit is documented below: * o RXT0 = Receiver Timer Interrupt (ring 0) - * o TXDW = Transmit Descriptor Written Back * o RXDMT0 = Receive Descriptor Minimum Threshold hit (ring 0) * o RXSEQ = Receive Sequence Error * o LSC = Link Status Change */ #define IMS_ENABLE_MASK ( \ E1000_IMS_RXT0 | \ - E1000_IMS_TXDW | \ E1000_IMS_RXDMT0 | \ E1000_IMS_RXSEQ | \ E1000_IMS_LSC) diff -Naurp linux-2.6.9/drivers/net/e1000/e1000_main.c linux-2.6.9/drivers/net/e1000.mod/e1000_main.c --- linux-2.6.9/drivers/net/e1000/e1000_main.c 2004-10-18 14:53:50.000000000 -0700 +++ linux-2.6.9/drivers/net/e1000.mod/e1000_main.c 2004-11-30 16:15:13.777957656 -0800 @@ -131,7 +131,7 @@ static int e1000_set_mac(struct net_devi static void e1000_irq_disable(struct e1000_adapter *adapter); static void e1000_irq_enable(struct e1000_adapter *adapter); static irqreturn_t e1000_intr(int irq, void *data, struct pt_regs *regs); -static boolean_t e1000_clean_tx_irq(struct e1000_adapter *adapter); +static void e1000_clean_tx(unsigned long data); #ifdef CONFIG_E1000_NAPI static int e1000_clean(struct net_device *netdev, int *budget); static boolean_t e1000_clean_rx_irq(struct e1000_adapter *adapter, @@ -286,6 +286,7 @@ e1000_down(struct e1000_adapter *adapter e1000_irq_disable(adapter); free_irq(adapter->pdev->irq, netdev); + del_timer_sync(&adapter->tx_cleanup_timer); del_timer_sync(&adapter->tx_fifo_stall_timer); del_timer_sync(&adapter->watchdog_timer); del_timer_sync(&adapter->phy_info_timer); @@ -533,6 +534,10 @@ e1000_probe(struct pci_dev *pdev, e1000_get_bus_info(&adapter->hw); + init_timer(&adapter->tx_cleanup_timer); + adapter->tx_cleanup_timer.function = &e1000_clean_tx; + adapter->tx_cleanup_timer.data = (unsigned long) adapter; + init_timer(&adapter->tx_fifo_stall_timer); adapter->tx_fifo_stall_timer.function = &e1000_82547_tx_fifo_stall; adapter->tx_fifo_stall_timer.data = (unsigned long) adapter; @@ -893,14 +898,9 @@ e1000_configure_tx(struct e1000_adapter e1000_config_collision_dist(&adapter->hw); /* Setup Transmit Descriptor Settings for eop descriptor */ - adapter->txd_cmd = E1000_TXD_CMD_IDE | E1000_TXD_CMD_EOP | + adapter->txd_cmd = E1000_TXD_CMD_EOP | E1000_TXD_CMD_IFCS; - if(adapter->hw.mac_type < e1000_82543) - adapter->txd_cmd |= E1000_TXD_CMD_RPS; - else - adapter->txd_cmd |= E1000_TXD_CMD_RS; - /* Cache if we're 82544 running in PCI-X because we'll * need this to apply a workaround later in the send path. */ if(adapter->hw.mac_type == e1000_82544 && @@ -1820,6 +1820,11 @@ e1000_xmit_frame(struct sk_buff *skb, st return NETDEV_TX_LOCKED; } + if(!adapter->tx_cleanup_scheduled) { + adapter->tx_cleanup_scheduled = TRUE; + mod_timer(&adapter->tx_cleanup_timer, jiffies + 1); + } + /* need: count + 2 desc gap to keep tail from touching * head, otherwise try next time */ if(E1000_DESC_UNUSED(&adapter->tx_ring) < count + 2) { @@ -1856,6 +1861,7 @@ e1000_xmit_frame(struct sk_buff *skb, st netdev->trans_start = jiffies; spin_unlock_irqrestore(&adapter->tx_lock, flags); + return NETDEV_TX_OK; } @@ -2151,8 +2157,7 @@ e1000_intr(int irq, void *data, struct p } #else for(i = 0; i < E1000_MAX_INTR; i++) - if(unlikely(!e1000_clean_rx_irq(adapter) & - !e1000_clean_tx_irq(adapter))) + if(unlikely(!e1000_clean_rx_irq(adapter))) break; #endif @@ -2170,18 +2175,15 @@ e1000_clean(struct net_device *netdev, i { struct e1000_adapter *adapter = netdev->priv; int work_to_do = min(*budget, netdev->quota); - int tx_cleaned; int work_done = 0; - tx_cleaned = e1000_clean_tx_irq(adapter); e1000_clean_rx_irq(adapter, &work_done, work_to_do); *budget -= work_done; netdev->quota -= work_done; - /* if no Rx and Tx cleanup work was done, exit the polling mode */ - if(!tx_cleaned || (work_done < work_to_do) || - !netif_running(netdev)) { + /* if no Rx cleanup work was done, exit the polling mode */ + if((work_done < work_to_do) || !netif_running(netdev)) { netif_rx_complete(netdev); e1000_irq_enable(adapter); return 0; @@ -2192,66 +2194,74 @@ e1000_clean(struct net_device *netdev, i #endif /** - * e1000_clean_tx_irq - Reclaim resources after transmit completes - * @adapter: board private structure + * e1000_clean_tx - Reclaim resources after transmit completes + * @data: timer callback data (board private structure) **/ -static boolean_t -e1000_clean_tx_irq(struct e1000_adapter *adapter) +static void +e1000_clean_tx(unsigned long data) { + struct e1000_adapter *adapter = (struct e1000_adapter *)data; struct e1000_desc_ring *tx_ring = &adapter->tx_ring; struct net_device *netdev = adapter->netdev; struct pci_dev *pdev = adapter->pdev; - struct e1000_tx_desc *tx_desc, *eop_desc; struct e1000_buffer *buffer_info; - unsigned int i, eop; - boolean_t cleaned = FALSE; + unsigned int i, next; + int size = 0, count = 0; + uint32_t tx_head; - i = tx_ring->next_to_clean; - eop = tx_ring->buffer_info[i].next_to_watch; - eop_desc = E1000_TX_DESC(*tx_ring, eop); + spin_lock(&adapter->tx_lock); - while(eop_desc->upper.data & cpu_to_le32(E1000_TXD_STAT_DD)) { - for(cleaned = FALSE; !cleaned; ) { - tx_desc = E1000_TX_DESC(*tx_ring, i); - buffer_info = &tx_ring->buffer_info[i]; + tx_head = E1000_READ_REG(&adapter->hw, TDH); - if(likely(buffer_info->dma)) { - pci_unmap_page(pdev, - buffer_info->dma, - buffer_info->length, - PCI_DMA_TODEVICE); - buffer_info->dma = 0; - } + i = next = tx_ring->next_to_clean; - if(buffer_info->skb) { - dev_kfree_skb_any(buffer_info->skb); - buffer_info->skb = NULL; - } + while(i != tx_head) { + size++; + if(i == tx_ring->buffer_info[next].next_to_watch) { + count += size; + size = 0; + if(unlikely(++i == tx_ring->count)) + i = 0; + next = i; + } else { + if(unlikely(++i == tx_ring->count)) + i = 0; + } + } - tx_desc->buffer_addr = 0; - tx_desc->lower.data = 0; - tx_desc->upper.data = 0; + i = tx_ring->next_to_clean; + while(count--) { + buffer_info = &tx_ring->buffer_info[i]; - cleaned = (i == eop); - if(unlikely(++i == tx_ring->count)) i = 0; + if(likely(buffer_info->dma)) { + pci_unmap_page(pdev, + buffer_info->dma, + buffer_info->length, + PCI_DMA_TODEVICE); + buffer_info->dma = 0; } - - eop = tx_ring->buffer_info[i].next_to_watch; - eop_desc = E1000_TX_DESC(*tx_ring, eop); + + if(buffer_info->skb) { + dev_kfree_skb_any(buffer_info->skb); + buffer_info->skb = NULL; + } + + if(unlikely(++i == tx_ring->count)) + i = 0; } tx_ring->next_to_clean = i; - spin_lock(&adapter->tx_lock); + if(E1000_DESC_UNUSED(tx_ring) != tx_ring->count) + mod_timer(&adapter->tx_cleanup_timer, jiffies + 1); + else + adapter->tx_cleanup_scheduled = FALSE; - if(unlikely(cleaned && netif_queue_stopped(netdev) && - netif_carrier_ok(netdev))) + if(unlikely(netif_queue_stopped(netdev) && netif_carrier_ok(netdev))) netif_wake_queue(netdev); spin_unlock(&adapter->tx_lock); - - return cleaned; } /** ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-01 1:09 ` Scott Feldman @ 2004-12-01 15:34 ` Robert Olsson 2004-12-01 16:49 ` Scott Feldman 2004-12-01 18:29 ` Lennert Buytenhek ` (2 subsequent siblings) 3 siblings, 1 reply; 85+ messages in thread From: Robert Olsson @ 2004-12-01 15:34 UTC (permalink / raw) To: sfeldma Cc: Lennert Buytenhek, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev Scott Feldman writes: > Hey, turns out, I know some e1000 tricks that might help get the kpps > numbers up. > > My problem is I only have a P4 desktop system with a 82544 nic running > at PCI 32/33Mhz, so I can't play with the big boys. But, attached is a > rework of the Tx path to eliminate 1) Tx interrupts, and 2) Tx > descriptor write-backs. For me, I see a nice jump in kpps, but I'd like > others to try with their setups. We should be able to get to wire speed > with 60-byte packets. > > System: Intel 865 (HT 2.6Ghz) > Nic: 82544 PCI 32-bit/33Mhz > Driver: linux-2.6.9 e1000 (5.3.19-k2-NAPI), no Interrupt Delays > 4096 descs > pkt_size = 60: 541618pps 277Mb/sec errors: 914 Hello! Nice but I no improvements w. 82546GB @ 133 MHz on 1.6 GHz Opteron it seems. SMP kernel linux-2.6.9-rc2 Vanilla. 801077pps 410Mb/sec (410151424bps) errors: 95596 Patch TXD=4096 608690pps 311Mb/sec (311649280bps) errors: 0 Patch TXD=2048 624103pps 319Mb/sec (319540736bps) errors: 0 Patch TXD=1024 551289pps 282Mb/sec (282259968bps) errors: 4506 Error count is a bit confusing... --ro ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-01 15:34 ` Robert Olsson @ 2004-12-01 16:49 ` Scott Feldman 2004-12-01 17:37 ` Robert Olsson ` (2 more replies) 0 siblings, 3 replies; 85+ messages in thread From: Scott Feldman @ 2004-12-01 16:49 UTC (permalink / raw) To: Robert Olsson Cc: Lennert Buytenhek, jamal, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Wed, 2004-12-01 at 07:34, Robert Olsson wrote: > Nice but I no improvements w. 82546GB @ 133 MHz on 1.6 GHz Opteron it seems. > SMP kernel linux-2.6.9-rc2 > > Vanilla. > 801077pps 410Mb/sec (410151424bps) errors: 95596 > > Patch TXD=4096 > 608690pps 311Mb/sec (311649280bps) errors: 0 Thank you Robert for trying it out. Well those results are counter-intuitive! We remove Tx interrupts and Tx descriptor DMA write-backs and get no re-tries, and performance drops? The only bus activities left are the DMA of buffers to device and the register writes to increment tail. I'm stumped. I'll need to get my hands on a faster system. Maybe there is a bus analyzer under the tree. :-) -scott ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-01 16:49 ` Scott Feldman @ 2004-12-01 17:37 ` Robert Olsson 2004-12-02 17:54 ` Robert Olsson 2004-12-02 18:23 ` Robert Olsson 2 siblings, 0 replies; 85+ messages in thread From: Robert Olsson @ 2004-12-01 17:37 UTC (permalink / raw) To: sfeldma Cc: Robert Olsson, Lennert Buytenhek, jamal, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev Scott Feldman writes: > Thank you Robert for trying it out. > > Well those results are counter-intuitive! We remove Tx interrupts and > Tx descriptor DMA write-backs and get no re-tries, and performance > drops? The only bus activities left are the DMA of buffers to device > and the register writes to increment tail. I'm stumped. I'll need to > get my hands on a faster system. Maybe there is a bus analyzer under > the tree. :-) Huh. I've got a deja-vu feeling. What will happen if we remove almost all events (interrupts) and just have the timer waking up once-in-a-while? --ro ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-01 16:49 ` Scott Feldman 2004-12-01 17:37 ` Robert Olsson @ 2004-12-02 17:54 ` Robert Olsson 2004-12-02 18:23 ` Robert Olsson 2 siblings, 0 replies; 85+ messages in thread From: Robert Olsson @ 2004-12-02 17:54 UTC (permalink / raw) To: sfeldma Cc: Robert Olsson, Lennert Buytenhek, jamal, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev Scott Feldman writes: > Thank you Robert for trying it out. Scott! I've rerun some of the tests. I've set maxcpus=1 make sure all things happens on one CPU. Some HW as yesterday. I see a now lot variation in the results from your patch. vanilla 804353pps 411Mb/sec (411828736bps) errors: 98877 patch TXD=4096 Sometimes: 882362pps 451Mb/sec (451769344bps) errors: 0 patch TXD=2048 Sometimes: 943007pps 482Mb/sec (482819584bps) errors: 0 But very often runs around 500 kpps with patch. This smells scheduling to me as smaller rings use to mean higher performance but ring need to big enough to hide latencies. See also my next mail... --ro ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-01 16:49 ` Scott Feldman 2004-12-01 17:37 ` Robert Olsson 2004-12-02 17:54 ` Robert Olsson @ 2004-12-02 18:23 ` Robert Olsson 2004-12-02 23:25 ` Lennert Buytenhek ` (2 more replies) 2 siblings, 3 replies; 85+ messages in thread From: Robert Olsson @ 2004-12-02 18:23 UTC (permalink / raw) To: sfeldma Cc: Robert Olsson, Lennert Buytenhek, jamal, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev Hello! Below is little patch to clean skb at xmit. It's old jungle trick Jamal and I used w. tulip. Note we can now even decrease the size of TX ring. It can increase TX performance from 800 kpps to 1125128pps 576Mb/sec (576065536bps) errors: 0 1124946pps 575Mb/sec (575972352bps) errors: 0 But suffers from scheduling problems as the previous patch. Often we just get 582108pps 298Mb/sec (298039296bps) errors: 0 When the sender CPU free (it's) skb's. we might get some "TX free affinity" which are unrelated to irq affinity of course not 100% perfect. And some of Scotts may still be used. --- drivers/net/e1000/e1000.h.orig 2004-12-01 13:59:36.000000000 +0100 +++ drivers/net/e1000/e1000.h 2004-12-02 20:11:31.000000000 +0100 @@ -103,7 +103,7 @@ #define E1000_MAX_INTR 10 /* TX/RX descriptor defines */ -#define E1000_DEFAULT_TXD 256 +#define E1000_DEFAULT_TXD 128 #define E1000_MAX_TXD 256 #define E1000_MIN_TXD 80 #define E1000_MAX_82544_TXD 4096 --- drivers/net/e1000/e1000_main.c.orig 2004-12-01 13:59:36.000000000 +0100 +++ drivers/net/e1000/e1000_main.c 2004-12-02 20:37:40.000000000 +0100 @@ -1820,6 +1820,10 @@ return NETDEV_TX_LOCKED; } + + if( adapter->tx_ring.next_to_use - adapter->tx_ring.next_to_clean > 80 ) + e1000_clean_tx_ring(adapter); + /* need: count + 2 desc gap to keep tail from touching * head, otherwise try next time */ if(E1000_DESC_UNUSED(&adapter->tx_ring) < count + 2) { --ro ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-02 18:23 ` Robert Olsson @ 2004-12-02 23:25 ` Lennert Buytenhek 2004-12-03 5:23 ` Scott Feldman 2004-12-10 16:24 ` Martin Josefsson 2 siblings, 0 replies; 85+ messages in thread From: Lennert Buytenhek @ 2004-12-02 23:25 UTC (permalink / raw) To: Robert Olsson Cc: sfeldma, jamal, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Thu, Dec 02, 2004 at 07:23:24PM +0100, Robert Olsson wrote: > Below is little patch to clean skb at xmit. It's old jungle trick Jamal > and I used w. tulip. Note we can now even decrease the size of TX ring. > > It can increase TX performance from 800 kpps to > 1125128pps 576Mb/sec (576065536bps) errors: 0 > 1124946pps 575Mb/sec (575972352bps) errors: 0 > > But suffers from scheduling problems as the previous patch. Often we just get > 582108pps 298Mb/sec (298039296bps) errors: 0 Robert, there is something weird with your setup with packets sizes under 160 bytes. Can you check if you also get wildly variable numbers on a baseline kernel perhaps? The numbers you sent me of packet size vs. pps were very jumpy as well, even at 10M pkts per run. --L ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-02 18:23 ` Robert Olsson 2004-12-02 23:25 ` Lennert Buytenhek @ 2004-12-03 5:23 ` Scott Feldman 2004-12-10 16:24 ` Martin Josefsson 2 siblings, 0 replies; 85+ messages in thread From: Scott Feldman @ 2004-12-03 5:23 UTC (permalink / raw) To: Robert Olsson Cc: Lennert Buytenhek, jamal, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Thu, 2004-12-02 at 10:23, Robert Olsson wrote: > It can increase TX performance from 800 kpps to > 1125128pps 576Mb/sec (576065536bps) errors: 0 > 1124946pps 575Mb/sec (575972352bps) errors: 0 These are the best numbers reported so far, right? > And some of Scotts may still be used. Did you try combining the two? > + > + if( adapter->tx_ring.next_to_use - adapter->tx_ring.next_to_clean > 80 ) > + e1000_clean_tx_ring(adapter); > + You want to use E1000_DESC_UNUSED here because of the ring wrap. ;-) -scott ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-02 18:23 ` Robert Olsson 2004-12-02 23:25 ` Lennert Buytenhek 2004-12-03 5:23 ` Scott Feldman @ 2004-12-10 16:24 ` Martin Josefsson 2 siblings, 0 replies; 85+ messages in thread From: Martin Josefsson @ 2004-12-10 16:24 UTC (permalink / raw) To: Robert Olsson Cc: sfeldma, Lennert Buytenhek, jamal, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev [-- Attachment #1: Type: text/plain, Size: 1096 bytes --] On Thu, 2004-12-02 at 19:23, Robert Olsson wrote: > Hello! > > Below is little patch to clean skb at xmit. It's old jungle trick Jamal > and I used w. tulip. Note we can now even decrease the size of TX ring. Just a small unimportant note. > --- drivers/net/e1000/e1000_main.c.orig 2004-12-01 13:59:36.000000000 +0100 > +++ drivers/net/e1000/e1000_main.c 2004-12-02 20:37:40.000000000 +0100 > @@ -1820,6 +1820,10 @@ > return NETDEV_TX_LOCKED; > } > > + > + if( adapter->tx_ring.next_to_use - adapter->tx_ring.next_to_clean > 80 ) > + e1000_clean_tx_ring(adapter); > + > /* need: count + 2 desc gap to keep tail from touching > * head, otherwise try next time */ > if(E1000_DESC_UNUSED(&adapter->tx_ring) < count + 2) { This patch is pretty broken, I doubt you want to call e1000_clean_tx_ring(), I think you want some variant of e1000_clean_tx_irq() :) e1000_clean_tx_irq() takes adapter->tx_lock which e1000_xmit_frame() also does so it will need some modification. And it should use E1000_DESC_UNUSED as Scott pointed out. -- /Martin [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-01 1:09 ` Scott Feldman 2004-12-01 15:34 ` Robert Olsson @ 2004-12-01 18:29 ` Lennert Buytenhek 2004-12-01 21:35 ` Lennert Buytenhek 2004-12-02 17:31 ` [E1000-devel] Transmission limit Marco Mellia 2004-12-03 20:57 ` Lennert Buytenhek 3 siblings, 1 reply; 85+ messages in thread From: Lennert Buytenhek @ 2004-12-01 18:29 UTC (permalink / raw) To: Scott Feldman Cc: jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Tue, Nov 30, 2004 at 05:09:59PM -0800, Scott Feldman wrote: > This doubles the kpps numbers for 60-byte packets. I'd like to see what > happens on higher bus bandwidth systems. Anyone? Dual Xeon 2.4GHz, a 82540EM and a 82541GI both on 32/66 on separate PCI buses. BEFORE performance is approx the same for both, ~620kpps. AFTER performance is ~730kpps, also approx the same for both. (Note: only sending with one NIC at a time.) Once or twice it went into a state where it started spitting out these kinds of messages and never recovered: Dec 1 19:13:18 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out [...] Dec 1 19:13:31 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out [...] Dec 1 19:13:43 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out But overall, looks good. Strange thing that Robert's numbers didn't improve. Doing some more measurements right now. --L ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-01 18:29 ` Lennert Buytenhek @ 2004-12-01 21:35 ` Lennert Buytenhek 2004-12-02 6:13 ` Scott Feldman 0 siblings, 1 reply; 85+ messages in thread From: Lennert Buytenhek @ 2004-12-01 21:35 UTC (permalink / raw) To: Scott Feldman Cc: jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev [-- Attachment #1: Type: text/plain, Size: 1209 bytes --] On Wed, Dec 01, 2004 at 07:29:43PM +0100, Lennert Buytenhek wrote: > > This doubles the kpps numbers for 60-byte packets. I'd like to see what > > happens on higher bus bandwidth systems. Anyone? > > Dual Xeon 2.4GHz, a 82540EM and a 82541GI both on 32/66 on separate > PCI buses. > > BEFORE performance is approx the same for both, ~620kpps. > AFTER performance is ~730kpps, also approx the same for both. Pretty graph attached. From ~220B packets or so it does wire speed, but there's still an odd drop in performance around 256B packets (which is also there without your patch.) From 350B packets or so, performance is identical with or without your patch (wire speed.) So. Do you have any other good plans perhaps? :) > Once or twice it went into a state where it started spitting out these > kinds of messages and never recovered: > > Dec 1 19:13:18 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out > [...] > Dec 1 19:13:31 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out > [...] > Dec 1 19:13:43 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out Didn't see this happen anymore. (ifconfig down and then up recovered it both times I saw it happen.) thanks, Lennert [-- Attachment #2: feldman.png --] [-- Type: image/png, Size: 7959 bytes --] ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-01 21:35 ` Lennert Buytenhek @ 2004-12-02 6:13 ` Scott Feldman 2004-12-03 13:24 ` jamal 2004-12-05 14:50 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Lennert Buytenhek 0 siblings, 2 replies; 85+ messages in thread From: Scott Feldman @ 2004-12-02 6:13 UTC (permalink / raw) To: Lennert Buytenhek Cc: jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Wed, 2004-12-01 at 13:35, Lennert Buytenhek wrote: > Pretty graph attached. From ~220B packets or so it does wire speed, but > there's still an odd drop in performance around 256B packets (which is > also there without your patch.) From 350B packets or so, performance is > identical with or without your patch (wire speed.) Seems this is helping PCI nics but not PCI-X. I was using PCI 32/33. Can't explain the dip around 256B. > So. Do you have any other good plans perhaps? :) Idea#1 Is the write of TDT causing interference with DMA transactions? In addition to my patch, what happens if you bump the Tx tail every n packets, where n is like 16 or 32 or 64? if((i % 16) == 0) E1000_REG_WRITE(&adapter->hw, TDT, i); This might piss the NETDEV timer off if the send count isn't a multiple of n, so you might want to disable netdev->tx_timeout. Idea#2 The Ultimate: queue up 4096 packets and then write TDT once to send all 4096 in one shot. Well, maybe a few less that 4096 so we don't wrap the ring. How about pkt_size = 4000? Take my patch and change the timer call in e1000_xmit_frame from jiffies + 1 to jiffies + HZ This will schedule the cleanup of the skbs 1 second after the first queue, so we shouldn't be doing any cleanup while the 4000 packets are DMA'ed. Oh, and change the tail write to if((i % 4000) == 0) E1000_REG_WRITE(&adapter->hw, TDT, i); Of course you'll need to close/open the driver after each run. Idea#3 http://www.mail-archive.com/freebsd-net@freebsd.org/msg10826.html Set TXDMAC to 0 in e1000_configure_tx. > > Once or twice it went into a state where it started spitting out these > > kinds of messages and never recovered: > > > > Dec 1 19:13:18 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out > > [...] > > Dec 1 19:13:31 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out > > [...] > > Dec 1 19:13:43 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out > > Didn't see this happen anymore. (ifconfig down and then up recovered it > both times I saw it happen.) Well, it's probably not a HW bug that's causing the reset; it's probably some bug with my patch. -scott ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-02 6:13 ` Scott Feldman @ 2004-12-03 13:24 ` jamal 2004-12-05 14:50 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Lennert Buytenhek 1 sibling, 0 replies; 85+ messages in thread From: jamal @ 2004-12-03 13:24 UTC (permalink / raw) To: sfeldma Cc: Lennert Buytenhek, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Thu, 2004-12-02 at 01:13, Scott Feldman wrote: > On Wed, 2004-12-01 at 13:35, Lennert Buytenhek wrote: > > Pretty graph attached. From ~220B packets or so it does wire speed, but > > there's still an odd drop in performance around 256B packets (which is > > also there without your patch.) From 350B packets or so, performance is > > identical with or without your patch (wire speed.) > > Seems this is helping PCI nics but not PCI-X. I was using PCI 32/33. > Can't explain the dip around 256B. > Interesting thought. I also saw improvements with my batching patch for PCI 32/32 but nothing noticeable in PCI-X 64/66. cheers, jamal ^ permalink raw reply [flat|nested] 85+ messages in thread
* 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) 2004-12-02 6:13 ` Scott Feldman 2004-12-03 13:24 ` jamal @ 2004-12-05 14:50 ` Lennert Buytenhek 2004-12-05 15:03 ` Martin Josefsson 1 sibling, 1 reply; 85+ messages in thread From: Lennert Buytenhek @ 2004-12-05 14:50 UTC (permalink / raw) To: Scott Feldman Cc: jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Wed, Dec 01, 2004 at 10:13:33PM -0800, Scott Feldman wrote: > Idea#3 > > http://www.mail-archive.com/freebsd-net@freebsd.org/msg10826.html > > Set TXDMAC to 0 in e1000_configure_tx. Enabling 'DMA packet prefetching' gives me an impressive boost in performance. Combined with your TX clean rework, I now get 1.03Mpps TX performance at 60B packets. Transmitting from both of the 82546 ports at the same time gives me close to 2 Mpps. The freebsd post hints that (some) e1000 hardware might be buggy w.r.t. this prefetching though. I'll play some more with the other ideas you suggested as well. 60 1036488 61 1037413 62 1036429 63 990239 64 993218 65 993233 66 993201 67 993234 68 993219 69 993208 70 992225 71 980560 --L diff -ur e1000.orig/e1000_main.c e1000/e1000_main.c --- e1000.orig/e1000_main.c 2004-12-04 11:43:12.000000000 +0100 +++ e1000/e1000_main.c 2004-12-05 15:40:49.284946897 +0100 @@ -879,6 +894,8 @@ E1000_WRITE_REG(&adapter->hw, TCTL, tctl); + E1000_WRITE_REG(&adapter->hw, TXDMAC, 0); + e1000_config_collision_dist(&adapter->hw); /* Setup Transmit Descriptor Settings for eop descriptor */ ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) 2004-12-05 14:50 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Lennert Buytenhek @ 2004-12-05 15:03 ` Martin Josefsson 2004-12-05 15:15 ` Lennert Buytenhek ` (2 more replies) 0 siblings, 3 replies; 85+ messages in thread From: Martin Josefsson @ 2004-12-05 15:03 UTC (permalink / raw) To: Lennert Buytenhek Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sun, 5 Dec 2004, Lennert Buytenhek wrote: > Enabling 'DMA packet prefetching' gives me an impressive boost in performance. > Combined with your TX clean rework, I now get 1.03Mpps TX performance at 60B > packets. Transmitting from both of the 82546 ports at the same time gives me > close to 2 Mpps. > > The freebsd post hints that (some) e1000 hardware might be buggy w.r.t. this > prefetching though. > > I'll play some more with the other ideas you suggested as well. > > 60 1036488 I was just playing with prefetching when you sent your mail :) I get that number with Scotts patch but without prefetching. If I mode the TDT update to the tc cleaning I get a few extra kpps but not much. BUT if I use the above + prefetching I get this: 60 1483890 64 1418568 68 1356992 72 1300523 76 1248568 80 1142989 84 1140909 88 1114951 92 1076546 96 960732 100 949801 104 972876 108 945314 112 918380 116 891393 120 865923 124 843288 128 696465 Which is pretty nice :) This is on one port of a 82546GB The hardware is a dual Athlon MP 2000+ in an Asus A7M266-D motherboard and the nic is located in a 64/66 slot. I won't post any patch until I've tested some more and cleaned up a few things. BTW, I also get some transmit timouts with Scotts patch sometimes, not often but it does happen. /Martin ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) 2004-12-05 15:03 ` Martin Josefsson @ 2004-12-05 15:15 ` Lennert Buytenhek 2004-12-05 15:19 ` Martin Josefsson 2004-12-05 15:42 ` Martin Josefsson 2004-12-05 21:12 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Scott Feldman 2 siblings, 1 reply; 85+ messages in thread From: Lennert Buytenhek @ 2004-12-05 15:15 UTC (permalink / raw) To: Martin Josefsson Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sun, Dec 05, 2004 at 04:03:36PM +0100, Martin Josefsson wrote: > BUT if I use the above + prefetching I get this: > > 60 1483890 > [snip] > > Which is pretty nice :) Not just that, it's also wire speed GigE. Damn. Now we all have to go and upgrade to 10GbE cards, and I don't think my girlfriend would give me one of those for christmas. > This is on one port of a 82546GB > > The hardware is a dual Athlon MP 2000+ in an Asus A7M266-D motherboard and > the nic is located in a 64/66 slot. Hmmm. Funny you get this number even on 64/66. How many PCI bridges between the CPUs and the NIC? Any idea how many cycles an MMIO read on your hardware is? cheers, Lennert ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) 2004-12-05 15:15 ` Lennert Buytenhek @ 2004-12-05 15:19 ` Martin Josefsson 2004-12-05 15:30 ` Martin Josefsson 0 siblings, 1 reply; 85+ messages in thread From: Martin Josefsson @ 2004-12-05 15:19 UTC (permalink / raw) To: Lennert Buytenhek Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sun, 5 Dec 2004, Lennert Buytenhek wrote: > > 60 1483890 > > [snip] > > > > Which is pretty nice :) > > Not just that, it's also wire speed GigE. Damn. Now we all have to go > and upgrade to 10GbE cards, and I don't think my girlfriend would give me > one of those for christmas. Yes it is, and it's lovely to see. You have to nerdify her so she sees the need for geeky hardware enough to give you what you need :) > > This is on one port of a 82546GB > > > > The hardware is a dual Athlon MP 2000+ in an Asus A7M266-D motherboard and > > the nic is located in a 64/66 slot. > > Hmmm. Funny you get this number even on 64/66. How many PCI bridges > between the CPUs and the NIC? Any idea how many cycles an MMIO read on > your hardware is? I verified that I get the same results on a small whimpy 82540EM that runs at 32/66 as well. Just about to see what I get at 32/33 with that card. /Martin ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) 2004-12-05 15:19 ` Martin Josefsson @ 2004-12-05 15:30 ` Martin Josefsson 2004-12-05 17:00 ` Lennert Buytenhek 0 siblings, 1 reply; 85+ messages in thread From: Martin Josefsson @ 2004-12-05 15:30 UTC (permalink / raw) To: Lennert Buytenhek Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sun, 5 Dec 2004, Martin Josefsson wrote: > > > The hardware is a dual Athlon MP 2000+ in an Asus A7M266-D motherboard and > > > the nic is located in a 64/66 slot. > > > > Hmmm. Funny you get this number even on 64/66. How many PCI bridges > > between the CPUs and the NIC? Any idea how many cycles an MMIO read on > > your hardware is? > > I verified that I get the same results on a small whimpy 82540EM that runs > at 32/66 as well. Just about to see what I get at 32/33 with that card. Just tested the 82540EM at 32/33 and it's a big diffrence. 60 350229 64 247037 68 219643 72 218205 76 216786 80 215386 84 214003 88 212638 92 211291 96 210004 100 208647 104 182461 108 181468 112 180453 116 179482 120 185472 124 188336 128 153743 Sorry, forgot to answer your other questions, I'm a bit excited at the moment :) The 64/66 bus on this motherboard is directly connected to the northbridge. Here's the lspci output with the 82546GB nic attached to the 64/66 bus and 82540EM nic connected to the 32/33 bus that hangs off the southbridge: 00:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] System Controller (rev 11) 00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] AGP Bridge 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ISA (rev 05) 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-768 [Opus] IDE (rev 04) 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI (rev 03) 00:08.0 Ethernet controller: Intel Corp. 82546GB Gigabit Ethernet Controller (rev 03) 00:08.1 Ethernet controller: Intel Corp. 82546GB Gigabit Ethernet Controller (rev 03) 00:10.0 PCI bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] PCI (rev 05) 01:05.0 VGA compatible controller: Silicon Integrated Systems [SiS] 86C326 5598/6326 (rev 0b) 02:05.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 0c) 02:06.0 SCSI storage controller: Adaptec AIC-7892A U160/m (rev 02) 02:08.0 Ethernet controller: Intel Corp. 82540EM Gigabit Ethernet Controller (rev 02) And lspci -t -[00]-+-00.0 +-01.0-[01]----05.0 +-07.0 +-07.1 +-07.3 +-08.0 +-08.1 \-10.0-[02]--+-05.0 +-06.0 \-08.0 I have no idea how expensive an MMIO read is on this machine, do you have an relatively easy way to find out? /Martin ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) 2004-12-05 15:30 ` Martin Josefsson @ 2004-12-05 17:00 ` Lennert Buytenhek 2004-12-05 17:11 ` Martin Josefsson 0 siblings, 1 reply; 85+ messages in thread From: Lennert Buytenhek @ 2004-12-05 17:00 UTC (permalink / raw) To: Martin Josefsson Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sun, Dec 05, 2004 at 04:30:47PM +0100, Martin Josefsson wrote: > > I verified that I get the same results on a small whimpy 82540EM > > that runs at 32/66 as well. Just about to see what I get at 32/33 > > with that card. > > Just tested the 82540EM at 32/33 and it's a big diffrence. > > 60 350229 > 64 247037 > 68 219643 > 72 218205 > 76 216786 > 80 215386 > 84 214003 > 88 212638 > 92 211291 > 96 210004 > 100 208647 > 104 182461 > 108 181468 > 112 180453 > 116 179482 > 120 185472 > 124 188336 > 128 153743 With or without prefetching? My 82540 in 32/33 mode gets on baseline 2.6.9: 60 431967 61 431311 62 431927 63 427827 64 427482 And with Scott's notxints patch: 60 514496 61 514493 62 514754 63 504629 64 504123 > Sorry, forgot to answer your other questions, I'm a bit excited at the > moment :) Makes sense :) > The 64/66 bus on this motherboard is directly connected to the > northbridge. Your lspci output seems to suggest there is another PCI bridge in between (00:10.0) Basically on my box, it's CPU - MCH - P64H2 - e1000, where MCH is the 'Memory Controller Hub' and P64H2 the PCI-X bridge chip. > I have no idea how expensive an MMIO read is on this machine, do you have > an relatively easy way to find out? A dirty way, yes ;-) Open up e1000_osdep.h and do: -#define E1000_READ_REG(a, reg) ( \ - readl((a)->hw_addr + \ - (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg))) +#define E1000_READ_REG(a, reg) ({ \ + unsigned long s, e, d, v; \ +\ + (a)->mmio_reads++; \ + rdtsc(s, d); \ + v = readl((a)->hw_addr + \ + (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg)); \ + rdtsc(e, d); \ + e -= s; \ + printk(KERN_INFO "e1000: MMIO read took %ld clocks\n", e); \ + printk(KERN_INFO "e1000: in process %d(%s)\n", current->pid, current->comm); \ + dump_stack(); \ + v; \ +}) You might want to disable the stack dump of course. --L ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) 2004-12-05 17:00 ` Lennert Buytenhek @ 2004-12-05 17:11 ` Martin Josefsson 2004-12-05 17:38 ` Martin Josefsson 0 siblings, 1 reply; 85+ messages in thread From: Martin Josefsson @ 2004-12-05 17:11 UTC (permalink / raw) To: Lennert Buytenhek Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sun, 5 Dec 2004, Lennert Buytenhek wrote: > > Just tested the 82540EM at 32/33 and it's a big diffrence. > > > > 60 350229 > > 64 247037 > > 68 219643 [snip] > With or without prefetching? My 82540 in 32/33 mode gets on baseline > 2.6.9: With, will test without. I've always suspected that the 32bit bus on this motherboard is a bit slow. > Your lspci output seems to suggest there is another PCI bridge in > between (00:10.0) Yes it sits between the 32bit and the 64bit bus. > Basically on my box, it's CPU - MCH - P64H2 - e1000, where MCH is the > 'Memory Controller Hub' and P64H2 the PCI-X bridge chip. I don't have PCI-X (unless 64/66 counts as PCI-x which I highly doubt) > > I have no idea how expensive an MMIO read is on this machine, do you have > > an relatively easy way to find out? > > A dirty way, yes ;-) Open up e1000_osdep.h and do: > > -#define E1000_READ_REG(a, reg) ( \ > - readl((a)->hw_addr + \ > - (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg))) > +#define E1000_READ_REG(a, reg) ({ \ > + unsigned long s, e, d, v; \ > +\ > + (a)->mmio_reads++; \ > + rdtsc(s, d); \ > + v = readl((a)->hw_addr + \ > + (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg)); \ > + rdtsc(e, d); \ > + e -= s; \ > + printk(KERN_INFO "e1000: MMIO read took %ld clocks\n", e); \ > + printk(KERN_INFO "e1000: in process %d(%s)\n", current->pid, current->comm); \ > + dump_stack(); \ > + v; \ > +}) > > You might want to disable the stack dump of course. Will test this in a while. /Martin ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) 2004-12-05 17:11 ` Martin Josefsson @ 2004-12-05 17:38 ` Martin Josefsson 2004-12-05 18:14 ` Lennert Buytenhek 0 siblings, 1 reply; 85+ messages in thread From: Martin Josefsson @ 2004-12-05 17:38 UTC (permalink / raw) To: Lennert Buytenhek Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sun, 5 Dec 2004, Martin Josefsson wrote: > > -#define E1000_READ_REG(a, reg) ( \ > > - readl((a)->hw_addr + \ > > - (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg))) > > +#define E1000_READ_REG(a, reg) ({ \ > > + unsigned long s, e, d, v; \ > > +\ > > + (a)->mmio_reads++; \ > > + rdtsc(s, d); \ > > + v = readl((a)->hw_addr + \ > > + (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg)); \ > > + rdtsc(e, d); \ > > + e -= s; \ > > + printk(KERN_INFO "e1000: MMIO read took %ld clocks\n", e); \ > > + printk(KERN_INFO "e1000: in process %d(%s)\n", current->pid, current->comm); \ > > + dump_stack(); \ > > + v; \ > > +}) > > > > You might want to disable the stack dump of course. > > Will test this in a while. It gives pretty varied results. This is during a pktgen run. The machine is an Athlon MP 2000+ which operated at 1667 MHz e1000: MMIO read took 481 clocks e1000: MMIO read took 369 clocks e1000: MMIO read took 481 clocks e1000: MMIO read took 11 clocks e1000: MMIO read took 477 clocks e1000: MMIO read took 316 clocks e1000: MMIO read took 481 clocks e1000: MMIO read took 316 clocks e1000: MMIO read took 480 clocks e1000: MMIO read took 332 clocks e1000: MMIO read took 480 clocks e1000: MMIO read took 372 clocks e1000: MMIO read took 480 clocks e1000: MMIO read took 11 clocks e1000: MMIO read took 481 clocks e1000: MMIO read took 388 clocks e1000: MMIO read took 480 clocks e1000: MMIO read took 11 clocks e1000: MMIO read took 485 clocks e1000: MMIO read took 317 clocks e1000: MMIO read took 481 clocks e1000: MMIO read took 337 clocks e1000: MMIO read took 480 clocks e1000: MMIO read took 316 clocks e1000: MMIO read took 480 clocks e1000: MMIO read took 409 clocks e1000: MMIO read took 480 clocks e1000: MMIO read took 334 clocks e1000: MMIO read took 481 clocks e1000: MMIO read took 316 clocks e1000: MMIO read took 480 clocks e1000: MMIO read took 11 clocks e1000: MMIO read took 505 clocks e1000: MMIO read took 359 clocks e1000: MMIO read took 484 clocks e1000: MMIO read took 337 clocks e1000: MMIO read took 464 clocks e1000: MMIO read took 504 clocks /Martin ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) 2004-12-05 17:38 ` Martin Josefsson @ 2004-12-05 18:14 ` Lennert Buytenhek 0 siblings, 0 replies; 85+ messages in thread From: Lennert Buytenhek @ 2004-12-05 18:14 UTC (permalink / raw) To: Martin Josefsson Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sun, Dec 05, 2004 at 06:38:05PM +0100, Martin Josefsson wrote: > e1000: MMIO read took 481 clocks > e1000: MMIO read took 369 clocks > e1000: MMIO read took 481 clocks > e1000: MMIO read took 11 clocks > e1000: MMIO read took 477 clocks > e1000: MMIO read took 316 clocks Interesting. On a 1667MHz CPU, this is around ~0.28us per MMIO read in the worst case. On my hardware (dual Xeon 2.4GHz), the best case I've ever seen was ~0.83us. This alone can make a hell of a difference, esp. for 60B packets. --L ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) 2004-12-05 15:03 ` Martin Josefsson 2004-12-05 15:15 ` Lennert Buytenhek @ 2004-12-05 15:42 ` Martin Josefsson 2004-12-05 16:48 ` Martin Josefsson ` (2 more replies) 2004-12-05 21:12 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Scott Feldman 2 siblings, 3 replies; 85+ messages in thread From: Martin Josefsson @ 2004-12-05 15:42 UTC (permalink / raw) To: Lennert Buytenhek Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sun, 5 Dec 2004, Martin Josefsson wrote: [snip] > BUT if I use the above + prefetching I get this: > > 60 1483890 [snip] > This is on one port of a 82546GB > > The hardware is a dual Athlon MP 2000+ in an Asus A7M266-D motherboard and > the nic is located in a 64/66 slot. > > I won't post any patch until I've tested some more and cleaned up a few > things. > > BTW, I also get some transmit timouts with Scotts patch sometimes, not > often but it does happen. Here's the patch, not much more tested (it still gives some transmit timeouts since it's scotts patch + prefetching and delayed TDT updating). And it's not cleaned up, but hey, that's development :) The delayed TDT updating was a test and currently it delays the first tx'd packet after a timerrun 1ms. Would be interesting to see what other people get with this thing. Lennert? diff -X /home/gandalf/dontdiff.ny -urNp linux-2.6.10-rc3.orig/drivers/net/e1000/e1000.h linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000.h --- linux-2.6.10-rc3.orig/drivers/net/e1000/e1000.h 2004-12-04 18:16:53.000000000 +0100 +++ linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000.h 2004-12-05 15:12:25.000000000 +0100 @@ -101,7 +101,7 @@ struct e1000_adapter; #define E1000_MAX_INTR 10 /* TX/RX descriptor defines */ -#define E1000_DEFAULT_TXD 256 +#define E1000_DEFAULT_TXD 4096 #define E1000_MAX_TXD 256 #define E1000_MIN_TXD 80 #define E1000_MAX_82544_TXD 4096 @@ -187,6 +187,7 @@ struct e1000_desc_ring { /* board specific private data structure */ struct e1000_adapter { + struct timer_list tx_cleanup_timer; struct timer_list tx_fifo_stall_timer; struct timer_list watchdog_timer; struct timer_list phy_info_timer; @@ -222,6 +223,7 @@ struct e1000_adapter { uint32_t tx_fifo_size; atomic_t tx_fifo_stall; boolean_t pcix_82544; + boolean_t tx_cleanup_scheduled; /* RX */ struct e1000_desc_ring rx_ring; diff -X /home/gandalf/dontdiff.ny -urNp linux-2.6.10-rc3.orig/drivers/net/e1000/e1000_hw.h linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000_hw.h --- linux-2.6.10-rc3.orig/drivers/net/e1000/e1000_hw.h 2004-12-04 18:16:53.000000000 +0100 +++ linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000_hw.h 2004-12-05 15:37:50.000000000 +0100 @@ -417,14 +417,12 @@ int32_t e1000_set_d3_lplu_state(struct e /* This defines the bits that are set in the Interrupt Mask * Set/Read Register. Each bit is documented below: * o RXT0 = Receiver Timer Interrupt (ring 0) - * o TXDW = Transmit Descriptor Written Back * o RXDMT0 = Receive Descriptor Minimum Threshold hit (ring 0) * o RXSEQ = Receive Sequence Error * o LSC = Link Status Change */ #define IMS_ENABLE_MASK ( \ E1000_IMS_RXT0 | \ - E1000_IMS_TXDW | \ E1000_IMS_RXDMT0 | \ E1000_IMS_RXSEQ | \ E1000_IMS_LSC) diff -X /home/gandalf/dontdiff.ny -urNp linux-2.6.10-rc3.orig/drivers/net/e1000/e1000_main.c linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000_main.c --- linux-2.6.10-rc3.orig/drivers/net/e1000/e1000_main.c 2004-12-05 14:59:19.000000000 +0100 +++ linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000_main.c 2004-12-05 15:40:11.000000000 +0100 @@ -131,7 +131,7 @@ static int e1000_set_mac(struct net_devi static void e1000_irq_disable(struct e1000_adapter *adapter); static void e1000_irq_enable(struct e1000_adapter *adapter); static irqreturn_t e1000_intr(int irq, void *data, struct pt_regs *regs); -static boolean_t e1000_clean_tx_irq(struct e1000_adapter *adapter); +static void e1000_clean_tx(unsigned long data); #ifdef CONFIG_E1000_NAPI static int e1000_clean(struct net_device *netdev, int *budget); static boolean_t e1000_clean_rx_irq(struct e1000_adapter *adapter, @@ -286,6 +286,7 @@ e1000_down(struct e1000_adapter *adapter e1000_irq_disable(adapter); free_irq(adapter->pdev->irq, netdev); + del_timer_sync(&adapter->tx_cleanup_timer); del_timer_sync(&adapter->tx_fifo_stall_timer); del_timer_sync(&adapter->watchdog_timer); del_timer_sync(&adapter->phy_info_timer); @@ -522,6 +523,10 @@ e1000_probe(struct pci_dev *pdev, e1000_get_bus_info(&adapter->hw); + init_timer(&adapter->tx_cleanup_timer); + adapter->tx_cleanup_timer.function = &e1000_clean_tx; + adapter->tx_cleanup_timer.data = (unsigned long) adapter; + init_timer(&adapter->tx_fifo_stall_timer); adapter->tx_fifo_stall_timer.function = &e1000_82547_tx_fifo_stall; adapter->tx_fifo_stall_timer.data = (unsigned long) adapter; @@ -882,19 +887,16 @@ e1000_configure_tx(struct e1000_adapter e1000_config_collision_dist(&adapter->hw); /* Setup Transmit Descriptor Settings for eop descriptor */ - adapter->txd_cmd = E1000_TXD_CMD_IDE | E1000_TXD_CMD_EOP | + adapter->txd_cmd = E1000_TXD_CMD_EOP | E1000_TXD_CMD_IFCS; - if(adapter->hw.mac_type < e1000_82543) - adapter->txd_cmd |= E1000_TXD_CMD_RPS; - else - adapter->txd_cmd |= E1000_TXD_CMD_RS; - /* Cache if we're 82544 running in PCI-X because we'll * need this to apply a workaround later in the send path. */ if(adapter->hw.mac_type == e1000_82544 && adapter->hw.bus_type == e1000_bus_type_pcix) adapter->pcix_82544 = 1; + + E1000_WRITE_REG(&adapter->hw, TXDMAC, 0); } /** @@ -1707,7 +1709,7 @@ e1000_tx_queue(struct e1000_adapter *ada wmb(); tx_ring->next_to_use = i; - E1000_WRITE_REG(&adapter->hw, TDT, i); + /* E1000_WRITE_REG(&adapter->hw, TDT, i); */ } /** @@ -1809,6 +1811,11 @@ e1000_xmit_frame(struct sk_buff *skb, st return NETDEV_TX_LOCKED; } + if(!adapter->tx_cleanup_scheduled) { + adapter->tx_cleanup_scheduled = TRUE; + mod_timer(&adapter->tx_cleanup_timer, jiffies + 1); + } + /* need: count + 2 desc gap to keep tail from touching * head, otherwise try next time */ if(E1000_DESC_UNUSED(&adapter->tx_ring) < count + 2) { @@ -1845,6 +1852,7 @@ e1000_xmit_frame(struct sk_buff *skb, st netdev->trans_start = jiffies; spin_unlock_irqrestore(&adapter->tx_lock, flags); + return NETDEV_TX_OK; } @@ -2140,8 +2148,7 @@ e1000_intr(int irq, void *data, struct p } #else for(i = 0; i < E1000_MAX_INTR; i++) - if(unlikely(!e1000_clean_rx_irq(adapter) & - !e1000_clean_tx_irq(adapter))) + if(unlikely(!e1000_clean_rx_irq(adapter))) break; #endif @@ -2159,18 +2166,15 @@ e1000_clean(struct net_device *netdev, i { struct e1000_adapter *adapter = netdev->priv; int work_to_do = min(*budget, netdev->quota); - int tx_cleaned; int work_done = 0; - tx_cleaned = e1000_clean_tx_irq(adapter); e1000_clean_rx_irq(adapter, &work_done, work_to_do); *budget -= work_done; netdev->quota -= work_done; - /* if no Rx and Tx cleanup work was done, exit the polling mode */ - if(!tx_cleaned || (work_done < work_to_do) || - !netif_running(netdev)) { + /* if no Rx cleanup work was done, exit the polling mode */ + if((work_done < work_to_do) || !netif_running(netdev)) { netif_rx_complete(netdev); e1000_irq_enable(adapter); return 0; @@ -2181,66 +2185,76 @@ e1000_clean(struct net_device *netdev, i #endif /** - * e1000_clean_tx_irq - Reclaim resources after transmit completes - * @adapter: board private structure + * e1000_clean_tx - Reclaim resources after transmit completes + * @data: timer callback data (board private structure) **/ -static boolean_t -e1000_clean_tx_irq(struct e1000_adapter *adapter) +static void +e1000_clean_tx(unsigned long data) { + struct e1000_adapter *adapter = (struct e1000_adapter *)data; struct e1000_desc_ring *tx_ring = &adapter->tx_ring; struct net_device *netdev = adapter->netdev; struct pci_dev *pdev = adapter->pdev; - struct e1000_tx_desc *tx_desc, *eop_desc; struct e1000_buffer *buffer_info; - unsigned int i, eop; - boolean_t cleaned = FALSE; + unsigned int i, next; + int size = 0, count = 0; + uint32_t tx_head; - i = tx_ring->next_to_clean; - eop = tx_ring->buffer_info[i].next_to_watch; - eop_desc = E1000_TX_DESC(*tx_ring, eop); + spin_lock(&adapter->tx_lock); - while(eop_desc->upper.data & cpu_to_le32(E1000_TXD_STAT_DD)) { - for(cleaned = FALSE; !cleaned; ) { - tx_desc = E1000_TX_DESC(*tx_ring, i); - buffer_info = &tx_ring->buffer_info[i]; + E1000_WRITE_REG(&adapter->hw, TDT, tx_ring->next_to_use); - if(likely(buffer_info->dma)) { - pci_unmap_page(pdev, - buffer_info->dma, - buffer_info->length, - PCI_DMA_TODEVICE); - buffer_info->dma = 0; - } + tx_head = E1000_READ_REG(&adapter->hw, TDH); - if(buffer_info->skb) { - dev_kfree_skb_any(buffer_info->skb); - buffer_info->skb = NULL; - } + i = next = tx_ring->next_to_clean; - tx_desc->buffer_addr = 0; - tx_desc->lower.data = 0; - tx_desc->upper.data = 0; + while(i != tx_head) { + size++; + if(i == tx_ring->buffer_info[next].next_to_watch) { + count += size; + size = 0; + if(unlikely(++i == tx_ring->count)) + i = 0; + next = i; + } else { + if(unlikely(++i == tx_ring->count)) + i = 0; + } + } - cleaned = (i == eop); - if(unlikely(++i == tx_ring->count)) i = 0; + i = tx_ring->next_to_clean; + while(count--) { + buffer_info = &tx_ring->buffer_info[i]; + + if(likely(buffer_info->dma)) { + pci_unmap_page(pdev, + buffer_info->dma, + buffer_info->length, + PCI_DMA_TODEVICE); + buffer_info->dma = 0; } - - eop = tx_ring->buffer_info[i].next_to_watch; - eop_desc = E1000_TX_DESC(*tx_ring, eop); + + if(buffer_info->skb) { + dev_kfree_skb_any(buffer_info->skb); + buffer_info->skb = NULL; + } + + if(unlikely(++i == tx_ring->count)) + i = 0; } tx_ring->next_to_clean = i; - spin_lock(&adapter->tx_lock); + if(E1000_DESC_UNUSED(tx_ring) != tx_ring->count) + mod_timer(&adapter->tx_cleanup_timer, jiffies + 1); + else + adapter->tx_cleanup_scheduled = FALSE; - if(unlikely(cleaned && netif_queue_stopped(netdev) && - netif_carrier_ok(netdev))) + if(unlikely(netif_queue_stopped(netdev) && netif_carrier_ok(netdev))) netif_wake_queue(netdev); spin_unlock(&adapter->tx_lock); - - return cleaned; } /** /Martin ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) 2004-12-05 15:42 ` Martin Josefsson @ 2004-12-05 16:48 ` Martin Josefsson 2004-12-05 17:01 ` Martin Josefsson 2004-12-05 17:58 ` Lennert Buytenhek 2004-12-05 17:44 ` Lennert Buytenhek 2004-12-08 23:36 ` Ray Lehtiniemi 2 siblings, 2 replies; 85+ messages in thread From: Martin Josefsson @ 2004-12-05 16:48 UTC (permalink / raw) To: Lennert Buytenhek Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sun, 5 Dec 2004, Martin Josefsson wrote: > The delayed TDT updating was a test and currently it delays the first tx'd > packet after a timerrun 1ms. I removed the delayed TDT updating and gave it a go again (this is scott + prefetching): 60 1486193 64 1267639 68 1259682 72 1243997 76 1243989 80 1153608 84 1123813 88 1115047 92 1076636 96 1040792 100 1007252 104 975806 108 946263 112 918456 116 892227 120 867477 124 844052 128 821858 It gives a little diffrent results, 60byte is ok but then it falls a lot down to 64byte and the curve seems a bit flatter. This should be the same driver that Lennert got 1.03Mpps with. I get 1.03Mpps without prefetching. I tried using both ports on the 82546GB nic. delay nodelay 1CPU 1.95 Mpps 1.76 Mpps 2CPU 1.60 Mpps 1.44 Mpps All tests performed on an SMP kernel, the above mention of 1CPU vs 2CPU just means how the two nics were bound to the cpus. And there's no tx-interrupts at all due to scotts patch. /Martin ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) 2004-12-05 16:48 ` Martin Josefsson @ 2004-12-05 17:01 ` Martin Josefsson 2004-12-05 17:58 ` Lennert Buytenhek 1 sibling, 0 replies; 85+ messages in thread From: Martin Josefsson @ 2004-12-05 17:01 UTC (permalink / raw) To: Lennert Buytenhek Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sun, 5 Dec 2004, Martin Josefsson wrote: > I removed the delayed TDT updating and gave it a go again (this is scott + > prefetching): > > 60 1486193 > 64 1267639 > 68 1259682 Yet another mail, I hope you are using a NAPI-enabled MUA :) This time I tried vanilla + prefetch and it gave pretty nice performance as well: 60 1308047 64 1076044 68 1079377 72 1058993 76 1055708 80 1025659 84 1024692 88 1024236 92 1024510 96 1012853 100 1007925 104 976500 108 947061 112 919169 116 892804 120 868084 124 844609 128 822381 Large gap between 60 and 64byte, maybe the prefetching only prefetches 32bytes at a time? As a reference: here's a completely vanilla e1000 driver: 60 860931 64 772949 68 754738 72 754200 76 756093 80 756398 84 742111 88 738120 92 740426 96 739720 100 722322 104 729287 108 719312 112 723171 116 705551 120 704843 124 704622 128 665863 /Martin ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) 2004-12-05 16:48 ` Martin Josefsson 2004-12-05 17:01 ` Martin Josefsson @ 2004-12-05 17:58 ` Lennert Buytenhek 1 sibling, 0 replies; 85+ messages in thread From: Lennert Buytenhek @ 2004-12-05 17:58 UTC (permalink / raw) To: Martin Josefsson Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sun, Dec 05, 2004 at 05:48:34PM +0100, Martin Josefsson wrote: > I tried using both ports on the 82546GB nic. > > delay nodelay > 1CPU 1.95 Mpps 1.76 Mpps > 2CPU 1.60 Mpps 1.44 Mpps I get: delay nodelay 1CPU 1837356 1837330 2CPU 2035060 1947424 So in your case using 2 CPUs degrades performance, in my case it increases it. And TDT delaying/coalescing only improves performance when using 2 CPUs, and even then only slightly (and only for <= 62B packets.) --L ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) 2004-12-05 15:42 ` Martin Josefsson 2004-12-05 16:48 ` Martin Josefsson @ 2004-12-05 17:44 ` Lennert Buytenhek 2004-12-05 17:51 ` Lennert Buytenhek 2004-12-08 23:36 ` Ray Lehtiniemi 2 siblings, 1 reply; 85+ messages in thread From: Lennert Buytenhek @ 2004-12-05 17:44 UTC (permalink / raw) To: Martin Josefsson Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sun, Dec 05, 2004 at 04:42:34PM +0100, Martin Josefsson wrote: > The delayed TDT updating was a test and currently it delays the first tx'd > packet after a timerrun 1ms. > > Would be interesting to see what other people get with this thing. > Lennert? I took Scott's notxints patch, added the prefetch bits and moved the TDT updating to e1000_clean_tx as you did. Slightly better than before, but not much: 60 1070157 61 1066610 62 1062088 63 991447 64 991546 65 991537 66 991449 67 990857 68 989882 69 991347 Regular TDT updating: 60 1037469 61 1038425 62 1037393 63 993143 64 992156 65 993137 66 992203 67 992165 68 992185 69 988249 --L ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) 2004-12-05 17:44 ` Lennert Buytenhek @ 2004-12-05 17:51 ` Lennert Buytenhek 2004-12-05 17:54 ` Martin Josefsson 0 siblings, 1 reply; 85+ messages in thread From: Lennert Buytenhek @ 2004-12-05 17:51 UTC (permalink / raw) To: Martin Josefsson Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sun, Dec 05, 2004 at 06:44:01PM +0100, Lennert Buytenhek wrote: > On Sun, Dec 05, 2004 at 04:42:34PM +0100, Martin Josefsson wrote: > > > The delayed TDT updating was a test and currently it delays the first tx'd > > packet after a timerrun 1ms. > > > > Would be interesting to see what other people get with this thing. > > Lennert? > > I took Scott's notxints patch, added the prefetch bits and moved the > TDT updating to e1000_clean_tx as you did. > > Slightly better than before, but not much: I've tested all packet sizes now, and delayed TDT updating once per jiffy (instead of once per packet) indeed gives about 25kpps more on 60,61,62 byte packets, and is hardly worth it for bigger packets. --L ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) 2004-12-05 17:51 ` Lennert Buytenhek @ 2004-12-05 17:54 ` Martin Josefsson 2004-12-06 11:32 ` 1.03Mpps on e1000 (was: " jamal 0 siblings, 1 reply; 85+ messages in thread From: Martin Josefsson @ 2004-12-05 17:54 UTC (permalink / raw) To: Lennert Buytenhek Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sun, 5 Dec 2004, Lennert Buytenhek wrote: > I've tested all packet sizes now, and delayed TDT updating once per jiffy > (instead of once per packet) indeed gives about 25kpps more on 60,61,62 > byte packets, and is hardly worth it for bigger packets. Maybe we can't see any real gains here now, I wonder if it has any effect if you have lots of nics on the same bus. I mean, in theory it saves a whole lot of traffic on the bus. /Martin ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: Transmission limit) 2004-12-05 17:54 ` Martin Josefsson @ 2004-12-06 11:32 ` jamal 2004-12-06 12:11 ` Lennert Buytenhek 0 siblings, 1 reply; 85+ messages in thread From: jamal @ 2004-12-06 11:32 UTC (permalink / raw) To: Martin Josefsson Cc: Lennert Buytenhek, Scott Feldman, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sun, 2004-12-05 at 12:54, Martin Josefsson wrote: > On Sun, 5 Dec 2004, Lennert Buytenhek wrote: > > > I've tested all packet sizes now, and delayed TDT updating once per jiffy > > (instead of once per packet) indeed gives about 25kpps more on 60,61,62 > > byte packets, and is hardly worth it for bigger packets. > > Maybe we can't see any real gains here now, I wonder if it has any effect > if you have lots of nics on the same bus. I mean, in theory it saves a > whole lot of traffic on the bus. > This sounds like really exciting stuff happening here over the weekend. Scott, you had to leave Intel before giving us this tip? ;-> Someone correct me if i am wrong - but does it appear as if all these changes are only useful on PCI but not PCI-X? cheers, jamal ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: Transmission limit) 2004-12-06 11:32 ` 1.03Mpps on e1000 (was: " jamal @ 2004-12-06 12:11 ` Lennert Buytenhek 2004-12-06 12:20 ` jamal 0 siblings, 1 reply; 85+ messages in thread From: Lennert Buytenhek @ 2004-12-06 12:11 UTC (permalink / raw) To: jamal Cc: Martin Josefsson, Scott Feldman, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Mon, Dec 06, 2004 at 06:32:37AM -0500, jamal wrote: > Someone correct me if i am wrong - but does it appear as if all these > changes are only useful on PCI but not PCI-X? They are useful on PCI-X as well as regular PCI. On my 64/100 NIC I get ~620kpps on 2.6.9, ~1Mpps with 2.6.9 plus tx rework plus TXDMAC=0. Martin gets the ~1Mpps number with just the tx rework, and even more with TXDMAC=0 added in as well. --L ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: Transmission limit) 2004-12-06 12:11 ` Lennert Buytenhek @ 2004-12-06 12:20 ` jamal 2004-12-06 12:23 ` Lennert Buytenhek 0 siblings, 1 reply; 85+ messages in thread From: jamal @ 2004-12-06 12:20 UTC (permalink / raw) To: Lennert Buytenhek Cc: Martin Josefsson, Scott Feldman, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Mon, 2004-12-06 at 07:11, Lennert Buytenhek wrote: > On Mon, Dec 06, 2004 at 06:32:37AM -0500, jamal wrote: > > > Someone correct me if i am wrong - but does it appear as if all these > > changes are only useful on PCI but not PCI-X? > > They are useful on PCI-X as well as regular PCI. On my 64/100 NIC I > get ~620kpps on 2.6.9, ~1Mpps with 2.6.9 plus tx rework plus TXDMAC=0. > > Martin gets the ~1Mpps number with just the tx rework, and even more > with TXDMAC=0 added in as well. Right, but so far when i scan the results all i see is PCI not PCI-X. Which of your (or Martins) boards has PCI-X? cheers, jamal ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: Transmission limit) 2004-12-06 12:20 ` jamal @ 2004-12-06 12:23 ` Lennert Buytenhek 2004-12-06 12:30 ` Martin Josefsson 0 siblings, 1 reply; 85+ messages in thread From: Lennert Buytenhek @ 2004-12-06 12:23 UTC (permalink / raw) To: jamal Cc: Martin Josefsson, Scott Feldman, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Mon, Dec 06, 2004 at 07:20:43AM -0500, jamal wrote: > > > Someone correct me if i am wrong - but does it appear as if all these > > > changes are only useful on PCI but not PCI-X? > > > > They are useful on PCI-X as well as regular PCI. On my 64/100 NIC I > > get ~620kpps on 2.6.9, ~1Mpps with 2.6.9 plus tx rework plus TXDMAC=0. > > > > Martin gets the ~1Mpps number with just the tx rework, and even more > > with TXDMAC=0 added in as well. > > Right, but so far when i scan the results all i see is PCI not PCI-X. > Which of your (or Martins) boards has PCI-X? I've tested 32/33 PCI, 32/66 PCI, and 64/100 PCI-X. I _think_ Martin was running at 64/133 PCI-X. --L ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: Transmission limit) 2004-12-06 12:23 ` Lennert Buytenhek @ 2004-12-06 12:30 ` Martin Josefsson 2004-12-06 13:11 ` jamal 0 siblings, 1 reply; 85+ messages in thread From: Martin Josefsson @ 2004-12-06 12:30 UTC (permalink / raw) To: Lennert Buytenhek Cc: jamal, Scott Feldman, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Mon, 6 Dec 2004, Lennert Buytenhek wrote: > > Right, but so far when i scan the results all i see is PCI not PCI-X. > > Which of your (or Martins) boards has PCI-X? > > I've tested 32/33 PCI, 32/66 PCI, and 64/100 PCI-X. I _think_ Martin > was running at 64/133 PCI-X. I don't have any motherboards with PCI-X so no :) I'm running the 82546GB (dualport) at 64/66 and the 82540EM (desktop adapter) at 32/66, both are able to send at wirespeed. /Martin ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: Transmission limit) 2004-12-06 12:30 ` Martin Josefsson @ 2004-12-06 13:11 ` jamal [not found] ` <20041206132907.GA13411@xi.wantstofly.org> 0 siblings, 1 reply; 85+ messages in thread From: jamal @ 2004-12-06 13:11 UTC (permalink / raw) To: Martin Josefsson Cc: Lennert Buytenhek, Scott Feldman, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev Hopefully someone will beat me to testing to see if our forwarding capacity now goes up with this new recipe. cheers, jamal On Mon, 2004-12-06 at 07:30, Martin Josefsson wrote: > On Mon, 6 Dec 2004, Lennert Buytenhek wrote: > > > > Right, but so far when i scan the results all i see is PCI not PCI-X. > > > Which of your (or Martins) boards has PCI-X? > > > > I've tested 32/33 PCI, 32/66 PCI, and 64/100 PCI-X. I _think_ Martin > > was running at 64/133 PCI-X. > > I don't have any motherboards with PCI-X so no :) > I'm running the 82546GB (dualport) at 64/66 and the 82540EM (desktop > adapter) at 32/66, both are able to send at wirespeed. > > /Martin > > ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <20041206132907.GA13411@xi.wantstofly.org>]
[parent not found: <16820.37049.396306.295878@robur.slu.se>]
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) [not found] ` <16820.37049.396306.295878@robur.slu.se> @ 2004-12-06 17:32 ` P 0 siblings, 0 replies; 85+ messages in thread From: P @ 2004-12-06 17:32 UTC (permalink / raw) To: Robert Olsson Cc: Lennert Buytenhek, jamal, Martin Josefsson, Scott Feldman, mellia, Jorge Manuel Finochietto, Giulio Galante, netdev Robert Olsson wrote: > Lennert Buytenhek writes: > > On Mon, Dec 06, 2004 at 08:11:02AM -0500, jamal wrote: > > > > > Hopefully someone will beat me to testing to see if our forwarding > > > capacity now goes up with this new recipe. > > > A breakthrough we now can send small packets at wire speed it will make > development and testing much easier... It surely will!! Just to recap, 2 people have been able to tx @ wire speed. The origonal poster was able to receive at wire speed, but could only TX at about 50% wire speed. It would be really cool if we could combine this to bridge @ wire speed. -- Pádraig Brady - http://www.pixelbeat.org -- ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) 2004-12-05 15:42 ` Martin Josefsson 2004-12-05 16:48 ` Martin Josefsson 2004-12-05 17:44 ` Lennert Buytenhek @ 2004-12-08 23:36 ` Ray Lehtiniemi [not found] ` <41B825A5.2000009@draigBrady.com> 2 siblings, 1 reply; 85+ messages in thread From: Ray Lehtiniemi @ 2004-12-08 23:36 UTC (permalink / raw) To: Martin Josefsson Cc: Lennert Buytenhek, Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev hello martin On Sun, Dec 05, 2004 at 04:42:34PM +0100, Martin Josefsson wrote: > > Here's the patch, not much more tested (it still gives some transmit > timeouts since it's scotts patch + prefetching and delayed TDT updating). > And it's not cleaned up, but hey, that's development :) > > The delayed TDT updating was a test and currently it delays the first tx'd > packet after a timerrun 1ms. > > Would be interesting to see what other people get with this thing. > Lennert? well, i'm brand new to gig ethernet, but i have access to some nice hardware right now, so i decided to give your patch a try. this is the average tx pps of 10 pktgen runs for each packet size: 60 1187589.1 64 601805.4 68 1115029.3 72 593096.4 76 1097761.1 80 587125.4 84 1098045.2 88 588159.1 92 1072124.8 96 582510.3 100 1008056.8 104 577898.0 108 946974.0 112 573719.2 116 892871.0 120 573072.5 124 844608.3 128 563685.7 any idea why the packet rates are cut in half for every other line? pktgen is running with eth0 bound to CPU0 on this box: NexGate NSA 2040G Dual Xeon 3.06 GHz, HT enabled 1 GB PC3200 DDR SDRAM Dual 82544EI - on PCI-X 64 bit 133 MHz bus - behind P64H2 bridge - on hub channel D of E7501 chipset thanks -- ---------------------------------------------------------------------- Ray L <rayl@mail.com> ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <41B825A5.2000009@draigBrady.com>]
[parent not found: <20041209161825.GA32454@mail.com>]
* Re: 1.03Mpps on e1000 [not found] ` <20041209161825.GA32454@mail.com> @ 2004-12-09 17:12 ` P [not found] ` <20041209164820.GB32454@mail.com> 1 sibling, 0 replies; 85+ messages in thread From: P @ 2004-12-09 17:12 UTC (permalink / raw) To: Ray Lehtiniemi; +Cc: netdev Ray Lehtiniemi wrote: > On Thu, Dec 09, 2004 at 10:15:01AM +0000, P@draigBrady.com wrote: > >>That is very interesting! >>I'm guessing it's due to some alignment bug? >> >>Can you repeat for 60-68 ? > > certainly. here are the raw results, and a summary oprofile for > 60-68. > > looking at the disassembly, it seems that the 'rdtsc' opcode > at 0x46f3 is causing the problem? Well that wasn't obvious to me :-) I did some manipulating with sort/join and came up with the following percentage changes Note the % diff col adds to 37% address % @ 60b % @ 64b % diff 000046f5 14.6006 22.3856 7.785000 #instruction after rdtsc 00004737 15.0990 20.2242 5.125200 #instruction after rdtsc 0000474b 11.3857 12.9496 1.563900 00004726 1.5419 2.5867 1.044800 000046f7 0.6258 1.1922 0.566400 00004751 4.9377 5.4016 0.463900 000047a1 0.0118 0.4675 0.455700 00004739 1.2614 1.6962 0.434800 000044f7 1.0592 1.4506 0.391400 00004749 0.5467 0.9253 0.378600 0000475d 0.0879 0.1769 0.089000 0000445f 0.3785 0.4599 0.081400 000047c3 0.1003 0.1652 0.064900 000045cf 0.0804 0.1316 0.051200 000047aa 0.0048 0.0194 0.014600 000047bd 5.5e-04 0.0142 0.013650 000047b3 0.0106 0.0200 0.009400 00004598 0.0061 0.0147 0.008600 000045e9 0.0026 0.0103 0.007700 00004640 0.0692 0.0701 0.000900 00004465 0.0014 0.0020 0.000600 0000481b 4.3e-04 7.3e-04 0.000300 0000470e 6.1e-05 2.4e-04 0.000179 0000458d 1.2e-04 2.7e-04 0.000150 00004a47 1.8e-04 3.0e-04 0.000120 00004735 0.0085 0.0086 0.000100 00004745 1.2e-04 2.2e-04 0.000100 000047dd 0.0032 0.0033 0.000100 00004a49 0.0037 0.0038 0.000100 00004663 1.8e-04 2.7e-04 0.000090 0000489a 8.0e-04 8.9e-04 0.000090 00004514 9.2e-04 0.0010 0.000080 00004a61 6.1e-05 1.4e-04 0.000079 000046d4 6.1e-05 1.1e-04 0.000049 00004789 6.1e-05 1.1e-04 0.000049 00004683 1.2e-04 1.6e-04 0.000040 00004a51 1.8e-04 2.2e-04 0.000040 000047cc 9.2e-04 9.5e-04 0.000030 000045ba 6.8e-04 7.0e-04 0.000020 00004a36 6.1e-05 8.1e-05 0.000020 00004620 1.8e-04 1.9e-04 0.000010 0000474f 0.0042 0.0042 0.000000 0000466d 6.1e-05 5.4e-05 -0.000007 00004817 1.2e-04 1.1e-04 -0.000010 0000470c 4.9e-04 4.6e-04 -0.000030 000045eb 6.1e-05 2.7e-05 -0.000034 00004616 6.1e-05 2.7e-05 -0.000034 00004a1e 6.1e-05 2.7e-05 -0.000034 00004652 1.2e-04 8.1e-05 -0.000039 000047ee 1.2e-04 8.1e-05 -0.000039 00004685 1.2e-04 5.4e-05 -0.000066 00004894 3.1e-04 2.4e-04 -0.000070 00004714 6.1e-04 5.2e-04 -0.000090 00004524 1.2e-04 2.7e-05 -0.000093 0000467b 1.2e-04 2.7e-05 -0.000093 000046bb 1.2e-04 2.7e-05 -0.000093 00004446 0.0010 8.9e-04 -0.000110 0000488b 2.5e-04 1.4e-04 -0.000110 00004522 4.3e-04 2.7e-04 -0.000160 00004508 3.1e-04 1.4e-04 -0.000170 00004634 6.1e-04 4.3e-04 -0.000180 00004587 8.0e-04 6.0e-04 -0.000200 000047ae 0.0032 0.0030 -0.000200 00004440 5.5e-04 3.3e-04 -0.000220 00004459 0.0012 9.8e-04 -0.000220 00004506 9.2e-04 6.5e-04 -0.000270 000049ff 0.0021 0.0018 -0.000300 0000451c 0.0013 9.8e-04 -0.000320 000046c7 3.7e-04 2.7e-05 -0.000343 00004673 4.9e-04 1.1e-04 -0.000380 0000478f 4.9e-04 1.1e-04 -0.000380 00004450 0.0012 8.1e-04 -0.000390 00004541 6.1e-04 2.2e-04 -0.000390 000045a9 7.4e-04 3.5e-04 -0.000390 00004777 5.5e-04 1.6e-04 -0.000390 000047d0 6.8e-04 2.7e-04 -0.000410 00004457 0.0084 0.0079 -0.000500 000047ba 0.0018 0.0013 -0.000500 00004a6b 0.0031 0.0026 -0.000500 00004612 5.5e-04 2.7e-05 -0.000523 00004681 6.8e-04 1.4e-04 -0.000540 0000477b 7.4e-04 1.9e-04 -0.000550 00004503 0.0017 0.0011 -0.000600 000047df 0.0020 0.0014 -0.000600 000045b6 0.0010 3.8e-04 -0.000620 00004781 0.0010 3.8e-04 -0.000620 00004667 0.0012 5.2e-04 -0.000680 00004885 0.0015 8.1e-04 -0.000690 000045a3 0.0017 0.0010 -0.000700 000047da 0.0014 7.0e-04 -0.000700 00004747 8.6e-04 8.1e-05 -0.000779 0000446f 0.0151 0.0143 -0.000800 00004702 0.0019 0.0011 -0.000800 00004718 0.0157 0.0149 -0.000800 000047b6 0.0022 0.0014 -0.000800 00004a25 0.0054 0.0046 -0.000800 00004a65 0.0026 0.0018 -0.000800 0000477e 9.8e-04 1.4e-04 -0.000840 000045c8 0.0015 5.7e-04 -0.000930 00004543 0.0049 0.0039 -0.001000 00004604 0.0013 3.0e-04 -0.001000 00004787 0.0026 0.0016 -0.001000 00004a02 0.0018 7.6e-04 -0.001040 0000450e 0.0063 0.0052 -0.001100 0000465d 0.0022 0.0011 -0.001100 0000459d 0.0014 1.9e-04 -0.001210 0000464a 0.0017 3.8e-04 -0.001320 000047cf 0.0020 6.8e-04 -0.001320 00004a13 0.0016 1.1e-04 -0.001490 0000461e 0.0017 1.6e-04 -0.001540 000044ff 0.0040 0.0024 -0.001600 00004628 0.0020 3.5e-04 -0.001650 000045d5 0.0076 0.0055 -0.002100 00004638 0.0049 0.0027 -0.002200 00004650 0.0045 0.0021 -0.002400 00004632 0.0052 0.0026 -0.002600 00004769 0.0059 0.0033 -0.002600 00004444 0.0957 0.0930 -0.002700 00004610 0.0034 6.5e-04 -0.002750 000046fb 0.0097 0.0069 -0.002800 0000487f 0.0175 0.0146 -0.002900 000044f4 0.0071 0.0039 -0.003200 00004757 0.0068 0.0032 -0.003600 00004583 0.0176 0.0136 -0.004000 0000472d 0.0178 0.0138 -0.004000 00004624 0.0049 6.5e-04 -0.004250 00004700 0.0074 0.0029 -0.004500 00004763 0.0110 0.0059 -0.005100 00004755 0.0091 0.0037 -0.005400 000047b0 0.0201 0.0138 -0.006300 0000459b 0.0102 0.0035 -0.006700 000046fd 0.0146 0.0078 -0.006800 00004797 0.0253 0.0181 -0.007200 0000473f 0.0226 0.0153 -0.007300 0000476d 0.0253 0.0180 -0.007300 0000474d 0.0236 0.0152 -0.008400 000044f0 0.0191 0.0094 -0.009700 00004471 0.0332 0.0222 -0.011000 000046f3 0.0224 0.0112 -0.011200 0000472f 0.0221 0.0105 -0.011600 00004743 0.0146 0.0025 -0.012100 00004753 0.0311 0.0185 -0.012600 000044f9 0.0232 0.0100 -0.013200 000045f2 0.0781 0.0638 -0.014300 000045c0 0.0796 0.0632 -0.016400 000047a4 0.1020 0.0851 -0.016900 00004455 0.0468 0.0282 -0.018600 0000472a 0.0331 0.0140 -0.019100 00004720 0.0420 0.0228 -0.019200 00004741 0.0520 0.0255 -0.026500 0000460a 0.0296 6.8e-04 -0.028920 00004469 0.0696 0.0391 -0.030500 000047b8 0.0485 0.0164 -0.032100 00004771 0.0479 0.0151 -0.032800 000047d6 0.0634 0.0270 -0.036400 000045c2 0.1763 0.0500 -0.126300 0000488e 0.2228 0.0961 -0.126700 0000458f 0.2212 0.0932 -0.128000 00004709 0.8817 0.7529 -0.128800 0000479b 0.2469 0.1158 -0.131100 000047c6 0.2489 0.1103 -0.138600 00004775 0.2514 0.1124 -0.139000 00004657 0.2502 0.1105 -0.139700 0000444c 0.2555 0.1107 -0.144800 000045df 0.1822 0.0357 -0.146500 00004608 0.2596 0.1117 -0.147900 00004618 0.2635 0.1153 -0.148200 00004679 0.2580 0.1094 -0.148600 0000462c 0.2630 0.1134 -0.149600 00004594 0.2494 0.0958 -0.153600 000045f8 0.1934 0.0369 -0.156500 0000471a 0.8706 0.6718 -0.198800 000045e6 0.4986 0.2189 -0.279700 00004644 0.4393 0.1515 -0.287800 0000463c 0.5214 0.2247 -0.296700 000045fe 0.5160 0.2022 -0.313800 00004622 3.5942 1.5668 -2.027400 0000461c 3.6298 1.5695 -2.060300 00004716 19.2425 16.4027 -2.839800 00004600 5.2128 2.2837 -2.929100 000045b0 7.8500 3.3027 -4.547300 > > > it is worth noting that my box has become quite unstable since > i started to use oprofile and pktgen together. sshd stops responding, > and the network seems to go down. not sure what is happening there... > this instability seems to be persisting across reboots, unfortunately... > > > > > > > 60 bytes > -------- > > 60 1195259 > 60 1206652 > 60 1139822 > 60 1206650 > 60 1206654 > 60 1136447 > 60 1206651 > 60 1148050 > 60 1206504 > 60 1206653 > > CPU: P4 / Xeon with 2 hyper-threads, speed 3067.25 MHz (estimated) > Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000 > vma samples % image name app name symbol name > 00004337 1626886 57.5170 pktgen.ko pktgen pktgen_thread_worker > c02f389d 282974 10.0043 vmlinux vmlinux _spin_lock > c021adc0 219795 7.7706 vmlinux vmlinux e1000_clean_tx > c02f3904 164371 5.8112 vmlinux vmlinux _spin_lock_bh > c0219c74 160383 5.6702 vmlinux vmlinux e1000_xmit_frame > c02f3870 124564 4.4038 vmlinux vmlinux _spin_trylock > 000041d1 48511 1.7151 pktgen.ko pktgen next_to_run > c02f399a 46205 1.6335 vmlinux vmlinux _spin_unlock_irqrestore > c010c7d9 20876 0.7381 vmlinux vmlinux mark_offset_tsc > c011fdb2 13116 0.4637 vmlinux vmlinux local_bh_enable > c0107248 8166 0.2887 vmlinux vmlinux timer_interrupt > c0103970 5607 0.1982 vmlinux vmlinux apic_timer_interrupt > c010123a 5368 0.1898 vmlinux vmlinux default_idle > c02f39a5 4256 0.1505 vmlinux vmlinux _spin_unlock_bh > c0103c08 4042 0.1429 vmlinux vmlinux page_fault > 0804ae00 3930 0.1389 oprofiled oprofiled sfile_find > 0804aa10 3573 0.1263 oprofiled oprofiled get_file > > > > 64 bytes > -------- > > 64 606104 > 64 597737 > 64 594927 > 64 595531 > 64 606876 > 64 594751 > 64 595709 > 64 595070 > 64 606876 > 64 595600 > > CPU: P4 / Xeon with 2 hyper-threads, speed 3067.25 MHz (estimated) > Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000 > vma samples % image name app name symbol name > 00004337 3688998 68.9133 pktgen.ko pktgen pktgen_thread_worker > c02f389d 519536 9.7053 vmlinux vmlinux _spin_lock > c021adc0 271791 5.0773 vmlinux vmlinux e1000_clean_tx > c0219c74 214428 4.0057 vmlinux vmlinux e1000_xmit_frame > c02f3904 166334 3.1072 vmlinux vmlinux _spin_lock_bh > c02f3870 127623 2.3841 vmlinux vmlinux _spin_trylock > 000041d1 111650 2.0857 pktgen.ko pktgen next_to_run > c02f399a 47428 0.8860 vmlinux vmlinux _spin_unlock_irqrestore > c010c7d9 39586 0.7395 vmlinux vmlinux mark_offset_tsc > c0107248 14671 0.2741 vmlinux vmlinux timer_interrupt > c011fdb2 12926 0.2415 vmlinux vmlinux local_bh_enable > c0103970 11778 0.2200 vmlinux vmlinux apic_timer_interrupt > c010123a 9282 0.1734 vmlinux vmlinux default_idle > 0804ae00 7449 0.1392 oprofiled oprofiled sfile_find > 0804aa10 6387 0.1193 oprofiled oprofiled get_file > 0804ac30 6234 0.1165 oprofiled oprofiled sfile_log_sample > 0804f4b0 5852 0.1093 oprofiled oprofiled odb_insert > > > > 68 bytes > -------- > > 68 1124822 > 68 1124805 > 68 1090006 > 68 1124822 > 68 1089775 > 68 1124812 > 68 1123305 > 68 1091796 > 68 1124820 > 68 1087043 > > CPU: P4 / Xeon with 2 hyper-threads, speed 3067.25 MHz (estimated) > Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000 > vma samples % image name app name symbol name > 00004337 1753028 58.4510 pktgen.ko pktgen pktgen_thread_worker > c02f389d 301835 10.0641 vmlinux vmlinux _spin_lock > c021adc0 223405 7.4490 vmlinux vmlinux e1000_clean_tx > c02f3904 167118 5.5722 vmlinux vmlinux _spin_lock_bh > c0219c74 166016 5.5355 vmlinux vmlinux e1000_xmit_frame > c02f3870 131516 4.3851 vmlinux vmlinux _spin_trylock > 000041d1 56334 1.8783 pktgen.ko pktgen next_to_run > c02f399a 46860 1.5624 vmlinux vmlinux _spin_unlock_irqrestore > c010c7d9 26188 0.8732 vmlinux vmlinux mark_offset_tsc > c011fdb2 12199 0.4068 vmlinux vmlinux local_bh_enable > c0107248 10399 0.3467 vmlinux vmlinux timer_interrupt > c010123a 8799 0.2934 vmlinux vmlinux default_idle > c0103970 8194 0.2732 vmlinux vmlinux apic_timer_interrupt > c0117346 4822 0.1608 vmlinux vmlinux find_busiest_group > 0804ae00 4214 0.1405 oprofiled oprofiled sfile_find > c02f39a5 3955 0.1319 vmlinux vmlinux _spin_unlock_bh > 0804aa10 3745 0.1249 oprofiled oprofiled get_file > > > > here is the detailed breakdown for the 60 byte pktgen: > > CPU: P4 / Xeon with 2 hyper-threads, speed 3067.25 MHz (estimated) > Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000 > vma samples % image name app name symbol name > 00004337 1626886 57.5170 pktgen.ko pktgen pktgen_thread_worker > 00004440 9 5.5e-04 > 00004444 1557 0.0957 > 00004446 17 0.0010 > 0000444c 4156 0.2555 > 00004450 19 0.0012 > 00004455 762 0.0468 > 00004457 136 0.0084 > 00004459 20 0.0012 > 0000445f 6157 0.3785 > 00004465 23 0.0014 > 00004469 1133 0.0696 > 0000446f 246 0.0151 > 00004471 540 0.0332 > 000044f0 310 0.0191 > 000044f4 115 0.0071 > 000044f7 17232 1.0592 > 000044f9 377 0.0232 > 000044ff 65 0.0040 > 00004503 28 0.0017 > 00004506 15 9.2e-04 > 00004508 5 3.1e-04 > 0000450e 102 0.0063 > 00004514 15 9.2e-04 > 0000451c 21 0.0013 > 00004522 7 4.3e-04 > 00004524 2 1.2e-04 > 00004541 10 6.1e-04 > 00004543 79 0.0049 > 00004583 287 0.0176 > 00004587 13 8.0e-04 > 0000458d 2 1.2e-04 > 0000458f 3598 0.2212 > 00004594 4057 0.2494 > 00004598 100 0.0061 > 0000459b 166 0.0102 > 0000459d 22 0.0014 > 000045a3 28 0.0017 > 000045a9 12 7.4e-04 > 000045b0 127711 7.8500 > 000045b6 17 0.0010 > 000045ba 11 6.8e-04 > 000045c0 1295 0.0796 > 000045c2 2869 0.1763 > 000045c8 24 0.0015 > 000045cf 1308 0.0804 > 000045d5 123 0.0076 > 000045df 2964 0.1822 > 000045e6 8111 0.4986 > 000045e9 42 0.0026 > 000045eb 1 6.1e-05 > 000045f2 1271 0.0781 > 000045f8 3146 0.1934 > 000045fe 8395 0.5160 > 00004600 84807 5.2128 > 00004604 21 0.0013 > 00004608 4223 0.2596 > 0000460a 481 0.0296 > 00004610 55 0.0034 > 00004612 9 5.5e-04 > 00004616 1 6.1e-05 > 00004618 4287 0.2635 > 0000461a 3 1.8e-04 > 0000461c 59052 3.6298 > 0000461e 28 0.0017 > 00004620 3 1.8e-04 > 00004622 58473 3.5942 > 00004624 79 0.0049 > 00004628 33 0.0020 > 0000462c 4279 0.2630 > 00004632 84 0.0052 > 00004634 10 6.1e-04 > 00004638 80 0.0049 > 0000463c 8483 0.5214 > 00004640 1126 0.0692 > 00004644 7147 0.4393 > 0000464a 27 0.0017 > 00004650 73 0.0045 > 00004652 2 1.2e-04 > 00004657 4070 0.2502 > 0000465d 36 0.0022 > 00004663 3 1.8e-04 > 00004665 2 1.2e-04 > 00004667 20 0.0012 > 0000466d 1 6.1e-05 > 00004673 8 4.9e-04 > 00004679 4197 0.2580 > 0000467b 2 1.2e-04 > 00004681 11 6.8e-04 > 00004683 2 1.2e-04 > 00004685 2 1.2e-04 > 000046bb 2 1.2e-04 > 000046c1 2 1.2e-04 > 000046c7 6 3.7e-04 > 000046d4 1 6.1e-05 > 000046f3 365 0.0224 > 000046f5 237535 14.6006 > 000046f7 10181 0.6258 > 000046fb 157 0.0097 > 000046fd 238 0.0146 > 00004700 120 0.0074 > 00004702 31 0.0019 > 00004709 14344 0.8817 > 0000470c 8 4.9e-04 > 0000470e 1 6.1e-05 > 00004714 10 6.1e-04 > 00004716 313053 19.2425 > 00004718 255 0.0157 > 0000471a 14164 0.8706 > 00004720 683 0.0420 > 00004726 25085 1.5419 > 0000472a 538 0.0331 > 0000472d 290 0.0178 > 0000472f 359 0.0221 > 00004735 139 0.0085 > 00004737 245644 15.0990 > 00004739 20521 1.2614 > 0000473f 368 0.0226 > 00004741 846 0.0520 > 00004743 237 0.0146 > 00004745 2 1.2e-04 > 00004747 14 8.6e-04 > 00004749 8894 0.5467 > 0000474b 185233 11.3857 > 0000474d 384 0.0236 > 0000474f 69 0.0042 > 00004751 80331 4.9377 > 00004753 506 0.0311 > 00004755 148 0.0091 > 00004757 111 0.0068 > 0000475d 1430 0.0879 > 00004763 179 0.0110 > 00004769 96 0.0059 > 0000476d 411 0.0253 > 00004771 780 0.0479 > 00004775 4090 0.2514 > 00004777 9 5.5e-04 > 0000477b 12 7.4e-04 > 0000477e 16 9.8e-04 > 00004781 17 0.0010 > 00004787 43 0.0026 > 00004789 1 6.1e-05 > 0000478f 8 4.9e-04 > 00004797 412 0.0253 > 0000479b 4016 0.2469 > 000047a1 192 0.0118 > 000047a4 1660 0.1020 > 000047aa 78 0.0048 > 000047ae 52 0.0032 > 000047b0 327 0.0201 > 000047b3 173 0.0106 > 000047b6 35 0.0022 > 000047b8 789 0.0485 > 000047ba 29 0.0018 > 000047bd 9 5.5e-04 > 000047c3 1632 0.1003 > 000047c6 4049 0.2489 > 000047cc 15 9.2e-04 > 000047cf 33 0.0020 > 000047d0 11 6.8e-04 > 000047d6 1032 0.0634 > 000047da 22 0.0014 > 000047dd 52 0.0032 > 000047df 33 0.0020 > 000047ea 1 6.1e-05 > 000047ee 2 1.2e-04 > 000047f6 1 6.1e-05 > 000047ff 1 6.1e-05 > 00004809 1 6.1e-05 > 0000480e 1 6.1e-05 > 00004817 2 1.2e-04 > 0000481b 7 4.3e-04 > 0000487f 284 0.0175 > 00004885 24 0.0015 > 0000488b 4 2.5e-04 > 0000488e 3625 0.2228 > 00004894 5 3.1e-04 > 0000489a 13 8.0e-04 > 000049ff 34 0.0021 > 00004a02 30 0.0018 > 00004a04 4 2.5e-04 > 00004a0f 3 1.8e-04 > 00004a13 26 0.0016 > 00004a1e 1 6.1e-05 > 00004a25 88 0.0054 > 00004a36 1 6.1e-05 > 00004a47 3 1.8e-04 > 00004a49 60 0.0037 > 00004a51 3 1.8e-04 > 00004a61 1 6.1e-05 > 00004a65 42 0.0026 > 00004a6b 50 0.0031 > > > > here is the detailed breakdown for the 64 byte pktgen: > > CPU: P4 / Xeon with 2 hyper-threads, speed 3067.25 MHz (estimated) > Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000 > vma samples % image name app name symbol name > 00004337 3688998 68.9133 pktgen.ko pktgen pktgen_thread_worker > 00004440 12 3.3e-04 > 00004444 3431 0.0930 > 00004446 33 8.9e-04 > 0000444c 4082 0.1107 > 00004450 30 8.1e-04 > 00004455 1041 0.0282 > 00004457 292 0.0079 > 00004459 36 9.8e-04 > 0000445f 16964 0.4599 > 00004465 73 0.0020 > 00004469 1442 0.0391 > 0000446f 528 0.0143 > 00004471 818 0.0222 > 000044f0 347 0.0094 > 000044f4 145 0.0039 > 000044f7 53514 1.4506 > 000044f9 369 0.0100 > 000044ff 90 0.0024 > 00004503 41 0.0011 > 00004506 24 6.5e-04 > 00004508 5 1.4e-04 > 0000450e 192 0.0052 > 00004514 37 0.0010 > 00004516 5 1.4e-04 > 0000451c 36 9.8e-04 > 00004522 10 2.7e-04 > 00004524 1 2.7e-05 > 00004541 8 2.2e-04 > 00004543 144 0.0039 > 00004583 503 0.0136 > 00004587 22 6.0e-04 > 0000458d 10 2.7e-04 > 0000458f 3437 0.0932 > 00004594 3533 0.0958 > 00004598 541 0.0147 > 0000459b 129 0.0035 > 0000459d 7 1.9e-04 > 000045a3 38 0.0010 > 000045a9 13 3.5e-04 > 000045b0 121838 3.3027 > 000045b6 14 3.8e-04 > 000045ba 26 7.0e-04 > 000045c0 2330 0.0632 > 000045c2 1843 0.0500 > 000045c8 21 5.7e-04 > 000045cf 4855 0.1316 > 000045d5 203 0.0055 > 000045df 1317 0.0357 > 000045e6 8076 0.2189 > 000045e9 381 0.0103 > 000045eb 1 2.7e-05 > 000045f2 2355 0.0638 > 000045f8 1362 0.0369 > 000045fe 7460 0.2022 > 00004600 84246 2.2837 > 00004604 11 3.0e-04 > 00004608 4122 0.1117 > 0000460a 25 6.8e-04 > 00004610 24 6.5e-04 > 00004612 1 2.7e-05 > 00004614 1 2.7e-05 > 00004616 1 2.7e-05 > 00004618 4254 0.1153 > 0000461c 57898 1.5695 > 0000461e 6 1.6e-04 > 00004620 7 1.9e-04 > 00004622 57801 1.5668 > 00004624 24 6.5e-04 > 00004628 13 3.5e-04 > 0000462c 4185 0.1134 > 00004632 97 0.0026 > 00004634 16 4.3e-04 > 00004638 99 0.0027 > 0000463c 8288 0.2247 > 00004640 2585 0.0701 > 00004644 5590 0.1515 > 0000464a 14 3.8e-04 > 00004650 77 0.0021 > 00004652 3 8.1e-05 > 00004657 4077 0.1105 > 0000465d 41 0.0011 > 00004663 10 2.7e-04 > 00004667 19 5.2e-04 > 0000466d 2 5.4e-05 > 00004673 4 1.1e-04 > 00004679 4035 0.1094 > 0000467b 1 2.7e-05 > 00004681 5 1.4e-04 > 00004683 6 1.6e-04 > 00004685 2 5.4e-05 > 000046bb 1 2.7e-05 > 000046c7 1 2.7e-05 > 000046d4 4 1.1e-04 > 000046f3 415 0.0112 > 000046f5 825806 22.3856 > 000046f7 43980 1.1922 > 000046fb 256 0.0069 > 000046fd 286 0.0078 > 00004700 108 0.0029 > 00004702 41 0.0011 > 00004705 5 1.4e-04 > 00004709 27774 0.7529 > 0000470c 17 4.6e-04 > 0000470e 9 2.4e-04 > 00004714 19 5.2e-04 > 00004716 605096 16.4027 > 00004718 548 0.0149 > 0000471a 24782 0.6718 > 00004720 842 0.0228 > 00004726 95423 2.5867 > 0000472a 516 0.0140 > 0000472d 510 0.0138 > 0000472f 389 0.0105 > 00004735 316 0.0086 > 00004737 746069 20.2242 > 00004739 62574 1.6962 > 0000473f 565 0.0153 > 00004741 941 0.0255 > 00004743 91 0.0025 > 00004745 8 2.2e-04 > 00004747 3 8.1e-05 > 00004749 34135 0.9253 > 0000474b 477712 12.9496 > 0000474d 561 0.0152 > 0000474f 155 0.0042 > 00004751 199265 5.4016 > 00004753 684 0.0185 > 00004755 137 0.0037 > 00004757 119 0.0032 > 0000475d 6527 0.1769 > 00004763 217 0.0059 > 00004769 120 0.0033 > 0000476d 665 0.0180 > 00004771 558 0.0151 > 00004775 4148 0.1124 > 00004777 6 1.6e-04 > 0000477b 7 1.9e-04 > 0000477e 5 1.4e-04 > 00004781 14 3.8e-04 > 00004787 60 0.0016 > 00004789 4 1.1e-04 > 0000478f 4 1.1e-04 > 00004797 669 0.0181 > 0000479b 4271 0.1158 > 000047a1 17245 0.4675 > 000047a4 3138 0.0851 > 000047aa 716 0.0194 > 000047ae 112 0.0030 > 000047b0 508 0.0138 > 000047b3 736 0.0200 > 000047b6 53 0.0014 > 000047b8 604 0.0164 > 000047ba 47 0.0013 > 000047bd 525 0.0142 > 000047c3 6094 0.1652 > 000047c6 4068 0.1103 > 000047cc 35 9.5e-04 > 000047cf 25 6.8e-04 > 000047d0 10 2.7e-04 > 000047d6 995 0.0270 > 000047da 26 7.0e-04 > 000047dd 120 0.0033 > 000047df 50 0.0014 > 000047ee 3 8.1e-05 > 000047fa 1 2.7e-05 > 00004817 4 1.1e-04 > 0000481b 27 7.3e-04 > 0000487f 539 0.0146 > 00004885 30 8.1e-04 > 0000488b 5 1.4e-04 > 0000488e 3544 0.0961 > 00004894 9 2.4e-04 > 0000489a 33 8.9e-04 > 000049ff 67 0.0018 > 00004a02 28 7.6e-04 > 00004a11 1 2.7e-05 > 00004a13 4 1.1e-04 > 00004a18 3 8.1e-05 > 00004a1e 1 2.7e-05 > 00004a25 168 0.0046 > 00004a36 3 8.1e-05 > 00004a47 11 3.0e-04 > 00004a49 139 0.0038 > 00004a51 8 2.2e-04 > 00004a59 1 2.7e-05 > 00004a61 5 1.4e-04 > 00004a65 67 0.0018 > 00004a6b 97 0.0026 > > > and finally, here's the disasm of the threadworker function: > > > 00004337 <pktgen_Thread_worker>: > 4337: 55 push %ebp > 4338: 57 push %edi > 4339: 56 push %esi > 433a: 53 push %ebx > 433b: bb 00 e0 ff ff mov $0xffffe000,%ebx > 4340: 21 e3 and %esp,%ebx > 4342: 83 ec 2c sub $0x2c,%esp > 4345: 89 44 24 28 mov %eax,0x28(%esp) > 4349: 8b b0 bc 02 00 00 mov 0x2bc(%eax),%esi > 434f: c7 44 24 20 00 00 00 movl $0x0,0x20(%esp) > 4356: 00 > 4357: c7 04 24 c2 06 00 00 movl $0x6c2,(%esp) > 435e: 89 74 24 04 mov %esi,0x4(%esp) > 4362: e8 fc ff ff ff call 4363 <pktgen_thread_worker+0x2c> > 4367: 8b 03 mov (%ebx),%eax > 4369: 8b 80 90 04 00 00 mov 0x490(%eax),%eax > 436f: 05 04 05 00 00 add $0x504,%eax > 4374: e8 fc ff ff ff call 4375 <pktgen_thread_worker+0x3e> > 4379: 8b 03 mov (%ebx),%eax > 437b: c7 80 94 04 00 00 ff movl $0xfffbbeff,0x494(%eax) > 4382: be fb ff > 4385: c7 80 98 04 00 00 ff movl $0xffffffff,0x498(%eax) > 438c: ff ff ff > 438f: e8 fc ff ff ff call 4390 <pktgen_thread_worker+0x59> > 4394: 8b 03 mov (%ebx),%eax > 4396: 8b 80 90 04 00 00 mov 0x490(%eax),%eax > 439c: 05 04 05 00 00 add $0x504,%eax > 43a1: e8 fc ff ff ff call 43a2 <pktgen_thread_worker+0x6b> > 43a6: 89 f1 mov %esi,%ecx > 43a8: ba 01 00 00 00 mov $0x1,%edx > 43ad: d3 e2 shl %cl,%edx > 43af: 8b 03 mov (%ebx),%eax > 43b1: e8 fc ff ff ff call 43b2 <pktgen_thread_worker+0x7b> > 43b6: 39 73 10 cmp %esi,0x10(%ebx) > 43b9: 74 08 je 43c3 <pktgen_thread_worker+0x8c> > 43bb: 0f 0b ud2a > 43bd: 27 daa > 43be: 0b b0 06 00 00 8b or 0x8b000006(%eax),%esi > 43c4: 44 inc %esp > 43c5: 24 28 and $0x28,%al > 43c7: 8b 54 24 28 mov 0x28(%esp),%edx > 43cb: 05 c0 02 00 00 add $0x2c0,%eax > 43d0: 89 44 24 1c mov %eax,0x1c(%esp) > 43d4: c7 82 c0 02 00 00 01 movl $0x1,0x2c0(%edx) > 43db: 00 00 00 > 43de: 89 d0 mov %edx,%eax > 43e0: 8b 4c 24 1c mov 0x1c(%esp),%ecx > 43e4: 05 c4 02 00 00 add $0x2c4,%eax > 43e9: 89 41 04 mov %eax,0x4(%ecx) > 43ec: 89 41 08 mov %eax,0x8(%ecx) > 43ef: 83 a2 b4 02 00 00 f0 andl $0xfffffff0,0x2b4(%edx) > 43f6: 8b 03 mov (%ebx),%eax > 43f8: 8b 80 a8 00 00 00 mov 0xa8(%eax),%eax > 43fe: 89 82 b8 02 00 00 mov %eax,0x2b8(%edx) > 4404: 8b 03 mov (%ebx),%eax > 4406: 8b 80 a8 00 00 00 mov 0xa8(%eax),%eax > 440c: 89 74 24 04 mov %esi,0x4(%esp) > 4410: c7 04 24 a0 06 00 00 movl $0x6a0,(%esp) > 4417: 89 44 24 08 mov %eax,0x8(%esp) > 441b: e8 fc ff ff ff call 441c <pktgen_thread_worker+0xe5> > 4420: 8b 44 24 28 mov 0x28(%esp),%eax > 4424: 8b 80 b0 02 00 00 mov 0x2b0(%eax),%eax > 442a: 89 44 24 24 mov %eax,0x24(%esp) > 442e: 8b 03 mov (%ebx),%eax > 4430: c7 00 01 00 00 00 movl $0x1,(%eax) > 4436: f0 83 44 24 00 00 lock addl $0x0,0x0(%esp) > 443c: 89 5c 24 18 mov %ebx,0x18(%esp) > 4440: 8b 54 24 18 mov 0x18(%esp),%edx > 4444: 8b 02 mov (%edx),%eax > 4446: c7 00 00 00 00 00 movl $0x0,(%eax) > 444c: 8b 44 24 28 mov 0x28(%esp),%eax > 4450: e8 7c fd ff ff call 41d1 <next_to_run> > 4455: 85 c0 test %eax,%eax > 4457: 89 c6 mov %eax,%esi > 4459: 0f 84 aa 03 00 00 je 4809 <pktgen_thread_worker+0x4d2> > 445f: 8b 88 44 04 00 00 mov 0x444(%eax),%ecx > 4465: 89 4c 24 14 mov %ecx,0x14(%esp) > 4469: 8b b8 90 02 00 00 mov 0x290(%eax),%edi > 446f: 85 ff test %edi,%edi > 4471: 74 7d je 44f0 <pktgen_thread_worker+0x1b9> > 4473: 0f 31 rdtsc > 4475: 89 44 24 0c mov %eax,0xc(%esp) > 4479: 89 54 24 10 mov %edx,0x10(%esp) > 447d: 85 d2 test %edx,%edx > 447f: 8b 1d 1c 00 00 00 mov 0x1c,%ebx > 4485: 89 d1 mov %edx,%ecx > 4487: 89 c5 mov %eax,%ebp > 4489: 74 08 je 4493 <pktgen_thread_worker+0x15c> > 448b: 89 d0 mov %edx,%eax > 448d: 31 d2 xor %edx,%edx > 448f: f7 f3 div %ebx > 4491: 89 c1 mov %eax,%ecx > 4493: 89 e8 mov %ebp,%eax > 4495: f7 f3 div %ebx > 4497: 89 ca mov %ecx,%edx > 4499: 89 d3 mov %edx,%ebx > 449b: 8b 96 b8 02 00 00 mov 0x2b8(%esi),%edx > 44a1: 39 d3 cmp %edx,%ebx > 44a3: 89 c1 mov %eax,%ecx > 44a5: 8b 86 b4 02 00 00 mov 0x2b4(%esi),%eax > 44ab: 77 37 ja 44e4 <pktgen_thread_worker+0x1ad> > 44ad: 72 04 jb 44b3 <pktgen_thread_worker+0x17c> > 44af: 39 c1 cmp %eax,%ecx > 44b1: 73 31 jae 44e4 <pktgen_thread_worker+0x1ad> > 44b3: 8b be b4 02 00 00 mov 0x2b4(%esi),%edi > 44b9: 29 cf sub %ecx,%edi > 44bb: 81 ff 0f 27 00 00 cmp $0x270f,%edi > 44c1: 0f 86 ee 04 00 00 jbe 49b5 <pktgen_thread_worker+0x67e> > 44c7: b9 d3 4d 62 10 mov $0x10624dd3,%ecx > 44cc: 89 f8 mov %edi,%eax > 44ce: f7 e1 mul %ecx > 44d0: 89 d1 mov %edx,%ecx > 44d2: 89 f2 mov %esi,%edx > 44d4: c1 e9 06 shr $0x6,%ecx > 44d7: 89 c8 mov %ecx,%eax > 44d9: e8 10 e4 ff ff call 28ee <pg_udelay> > 44de: 8b be 90 02 00 00 mov 0x290(%esi),%edi > 44e4: 81 ff ff ff ff 7f cmp $0x7fffffff,%edi > 44ea: 0f 84 d5 04 00 00 je 49c5 <pktgen_thread_worker+0x68e> > 44f0: 8b 54 24 14 mov 0x14(%esp),%edx > 44f4: 8b 42 24 mov 0x24(%edx),%eax > 44f7: a8 01 test $0x1,%al > 44f9: 0f 85 f4 01 00 00 jne 46f3 <pktgen_thread_worker+0x3bc> > 44ff: 8b 4c 24 18 mov 0x18(%esp),%ecx > 4503: 8b 41 08 mov 0x8(%ecx),%eax > 4506: a8 08 test $0x8,%al > 4508: 0f 85 e5 01 00 00 jne 46f3 <pktgen_thread_worker+0x3bc> > 450e: 8b 86 c8 02 00 00 mov 0x2c8(%esi),%eax > 4514: 85 c0 test %eax,%eax > 4516: 0f 85 63 03 00 00 jne 487f <pktgen_thread_worker+0x548> > 451c: 8b 96 40 04 00 00 mov 0x440(%esi),%edx > 4522: 85 d2 test %edx,%edx > 4524: 75 5d jne 4583 <pktgen_thread_worker+0x24c> > 4526: 8b 86 c4 02 00 00 mov 0x2c4(%esi),%eax > 452c: 83 c0 01 add $0x1,%eax > 452f: 3b 86 e8 02 00 00 cmp 0x2e8(%esi),%eax > 4535: 89 86 c4 02 00 00 mov %eax,0x2c4(%esi) > 453b: 0f 83 5f 03 00 00 jae 48a0 <pktgen_thread_worker+0x569> > 4541: 85 d2 test %edx,%edx > 4543: 75 3e jne 4583 <pktgen_thread_worker+0x24c> > 4545: f6 86 81 02 00 00 02 testb $0x2,0x281(%esi) > 454c: 0f 84 87 03 00 00 je 48d9 <pktgen_thread_worker+0x5a2> > 4552: 89 f2 mov %esi,%edx > 4554: 8b 44 24 14 mov 0x14(%esp),%eax > 4558: e8 fc f1 ff ff call 3759 <fill_packet_ipv6> > 455d: 85 c0 test %eax,%eax > 455f: 89 86 40 04 00 00 mov %eax,0x440(%esi) > 4565: 0f 84 87 03 00 00 je 48f2 <pktgen_thread_worker+0x5bb> > 456b: 83 86 bc 02 00 00 01 addl $0x1,0x2bc(%esi) > 4572: c7 86 c4 02 00 00 00 movl $0x0,0x2c4(%esi) > 4579: 00 00 00 > 457c: 83 96 c0 02 00 00 00 adcl $0x0,0x2c0(%esi) > 4583: 8b 7c 24 14 mov 0x14(%esp),%edi > 4587: 81 c7 2c 01 00 00 add $0x12c,%edi > 458d: 89 f8 mov %edi,%eax > 458f: e8 fc ff ff ff call 4590 <pktgen_thread_worker+0x259> > 4594: 8b 54 24 14 mov 0x14(%esp),%edx > 4598: 8b 42 24 mov 0x24(%edx),%eax > 459b: a8 01 test $0x1,%al > 459d: 0f 85 6c 03 00 00 jne 490f <pktgen_thread_worker+0x5d8> > 45a3: 8b 86 40 04 00 00 mov 0x440(%esi),%eax > 45a9: f0 ff 80 94 00 00 00 lock incl 0x94(%eax) > 45b0: 8b 86 40 04 00 00 mov 0x440(%esi),%eax > 45b6: 8b 54 24 14 mov 0x14(%esp),%edx > 45ba: ff 92 6c 01 00 00 call *0x16c(%edx) > 45c0: 85 c0 test %eax,%eax > 45c2: 0f 85 37 04 00 00 jne 49ff <pktgen_thread_worker+0x6c8> > 45c8: 83 86 9c 02 00 00 01 addl $0x1,0x29c(%esi) > 45cf: 8b 86 2c 04 00 00 mov 0x42c(%esi),%eax > 45d5: c7 86 c8 02 00 00 01 movl $0x1,0x2c8(%esi) > 45dc: 00 00 00 > 45df: 83 96 a0 02 00 00 00 adcl $0x0,0x2a0(%esi) > 45e6: 83 c0 04 add $0x4,%eax > 45e9: 31 d2 xor %edx,%edx > 45eb: 83 86 e4 02 00 00 01 addl $0x1,0x2e4(%esi) > 45f2: 01 86 a4 02 00 00 add %eax,0x2a4(%esi) > 45f8: 11 96 a8 02 00 00 adc %edx,0x2a8(%esi) > 45fe: 0f 31 rdtsc > 4600: 89 44 24 0c mov %eax,0xc(%esp) > 4604: 89 54 24 10 mov %edx,0x10(%esp) > 4608: 85 d2 test %edx,%edx > 460a: 8b 1d 1c 00 00 00 mov 0x1c,%ebx > 4610: 89 d1 mov %edx,%ecx > 4612: 89 c5 mov %eax,%ebp > 4614: 74 08 je 461e <pktgen_thread_worker+0x2e7> > 4616: 89 d0 mov %edx,%eax > 4618: 31 d2 xor %edx,%edx > 461a: f7 f3 div %ebx > 461c: 89 c1 mov %eax,%ecx > 461e: 89 e8 mov %ebp,%eax > 4620: f7 f3 div %ebx > 4622: 89 ca mov %ecx,%edx > 4624: 89 54 24 10 mov %edx,0x10(%esp) > 4628: 89 44 24 0c mov %eax,0xc(%esp) > 462c: 8b 8e 90 02 00 00 mov 0x290(%esi),%ecx > 4632: 31 db xor %ebx,%ebx > 4634: 01 4c 24 0c add %ecx,0xc(%esp) > 4638: 11 5c 24 10 adc %ebx,0x10(%esp) > 463c: 8b 54 24 0c mov 0xc(%esp),%edx > 4640: 8b 4c 24 10 mov 0x10(%esp),%ecx > 4644: 89 96 b4 02 00 00 mov %edx,0x2b4(%esi) > 464a: 89 8e b8 02 00 00 mov %ecx,0x2b8(%esi) > 4650: 89 f8 mov %edi,%eax > 4652: e8 fc ff ff ff call 4653 <pktgen_thread_worker+0x31c> > 4657: 8b 96 98 02 00 00 mov 0x298(%esi),%edx > 465d: 8b 86 94 02 00 00 mov 0x294(%esi),%eax > 4663: 89 d1 mov %edx,%ecx > 4665: 09 c1 or %eax,%ecx > 4667: 0f 84 f6 00 00 00 je 4763 <pktgen_thread_worker+0x42c> > 466d: 8b 9e a0 02 00 00 mov 0x2a0(%esi),%ebx > 4673: 8b 8e 9c 02 00 00 mov 0x29c(%esi),%ecx > 4679: 39 d3 cmp %edx,%ebx > 467b: 0f 82 e2 00 00 00 jb 4763 <pktgen_thread_worker+0x42c> > 4681: 77 08 ja 468b <pktgen_thread_worker+0x354> > 4683: 39 c1 cmp %eax,%ecx > 4685: 0f 82 d8 00 00 00 jb 4763 <pktgen_thread_worker+0x42c> > 468b: 8b 8e 40 04 00 00 mov 0x440(%esi),%ecx > 4691: 8b 81 94 00 00 00 mov 0x94(%ecx),%eax > 4697: 83 f8 01 cmp $0x1,%eax > 469a: 74 4e je 46ea <pktgen_thread_worker+0x3b3> > 469c: 0f 31 rdtsc > 469e: 89 c7 mov %eax,%edi > 46a0: 8b 81 94 00 00 00 mov 0x94(%ecx),%eax > 46a6: 83 f8 01 cmp $0x1,%eax > 46a9: 89 d5 mov %edx,%ebp > 46ab: 74 2b je 46d8 <pktgen_thread_worker+0x3a1> > 46ad: bb 00 e0 ff ff mov $0xffffe000,%ebx > 46b2: 21 e3 and %esp,%ebx > 46b4: eb 16 jmp 46cc <pktgen_thread_worker+0x395> > 46b6: e8 fc ff ff ff call 46b7 <pktgen_thread_worker+0x380> > 46bb: 8b 86 40 04 00 00 mov 0x440(%esi),%eax > 46c1: 8b 80 94 00 00 00 mov 0x94(%eax),%eax > 46c7: 83 f8 01 cmp $0x1,%eax > 46ca: 74 0c je 46d8 <pktgen_thread_worker+0x3a1> > 46cc: 8b 03 mov (%ebx),%eax > 46ce: 8b 40 04 mov 0x4(%eax),%eax > 46d1: 8b 40 08 mov 0x8(%eax),%eax > 46d4: a8 04 test $0x4,%al > 46d6: 74 de je 46b6 <pktgen_thread_worker+0x37f> > 46d8: 0f 31 rdtsc > 46da: 29 f8 sub %edi,%eax > 46dc: 19 ea sbb %ebp,%edx > 46de: 01 86 dc 02 00 00 add %eax,0x2dc(%esi) > 46e4: 11 96 e0 02 00 00 adc %edx,0x2e0(%esi) > 46ea: 89 f0 mov %esi,%eax > 46ec: e8 2f fa ff ff call 4120 <pktgen_stop_device> > 46f1: eb 70 jmp 4763 <pktgen_thread_worker+0x42c> > 46f3: 0f 31 rdtsc > 46f5: 89 d5 mov %edx,%ebp > 46f7: 8b 54 24 14 mov 0x14(%esp),%edx > 46fb: 89 c7 mov %eax,%edi > 46fd: 8b 42 24 mov 0x24(%edx),%eax > 4700: a8 02 test $0x2,%al > 4702: 2e 74 e5 je,pn 46ea <pktgen_thread_worker+0x3b3> > 4705: 8b 4c 24 18 mov 0x18(%esp),%ecx > 4709: 8b 41 08 mov 0x8(%ecx),%eax > 470c: a8 08 test $0x8,%al > 470e: 0f 85 e1 02 00 00 jne 49f5 <pktgen_thread_worker+0x6be> > 4714: 0f 31 rdtsc > 4716: 29 f8 sub %edi,%eax > 4718: 19 ea sbb %ebp,%edx > 471a: 01 86 dc 02 00 00 add %eax,0x2dc(%esi) > 4720: 11 96 e0 02 00 00 adc %edx,0x2e0(%esi) > 4726: 8b 54 24 14 mov 0x14(%esp),%edx > 472a: 8b 42 24 mov 0x24(%edx),%eax > 472d: a8 01 test $0x1,%al > 472f: 0f 84 d9 fd ff ff je 450e <pktgen_thread_worker+0x1d7> > 4735: 0f 31 rdtsc > 4737: 85 d2 test %edx,%edx > 4739: 8b 1d 1c 00 00 00 mov 0x1c,%ebx > 473f: 89 d1 mov %edx,%ecx > 4741: 89 c7 mov %eax,%edi > 4743: 74 08 je 474d <pktgen_thread_worker+0x416> > 4745: 89 d0 mov %edx,%eax > 4747: 31 d2 xor %edx,%edx > 4749: f7 f3 div %ebx > 474b: 89 c1 mov %eax,%ecx > 474d: 89 f8 mov %edi,%eax > 474f: f7 f3 div %ebx > 4751: 89 ca mov %ecx,%edx > 4753: 89 c1 mov %eax,%ecx > 4755: 89 d3 mov %edx,%ebx > 4757: 89 8e b4 02 00 00 mov %ecx,0x2b4(%esi) > 475d: 89 9e b8 02 00 00 mov %ebx,0x2b8(%esi) > 4763: 8b 96 c8 02 00 00 mov 0x2c8(%esi),%edx > 4769: 8b 4c 24 24 mov 0x24(%esp),%ecx > 476d: 01 54 24 20 add %edx,0x20(%esp) > 4771: 39 4c 24 20 cmp %ecx,0x20(%esp) > 4775: 76 20 jbe 4797 <pktgen_thread_worker+0x460> > 4777: 8b 54 24 18 mov 0x18(%esp),%edx > 477b: 8b 42 10 mov 0x10(%edx),%eax > 477e: c1 e0 07 shl $0x7,%eax > 4781: 8b b8 00 00 00 00 mov 0x0(%eax),%edi > 4787: 85 ff test %edi,%edi > 4789: 0f 85 1c 02 00 00 jne 49ab <pktgen_thread_worker+0x674> > 478f: c7 44 24 20 00 00 00 movl $0x0,0x20(%esp) > 4796: 00 > 4797: 8b 4c 24 28 mov 0x28(%esp),%ecx > 479b: 8b 91 b4 02 00 00 mov 0x2b4(%ecx),%edx > 47a1: f6 c2 01 test $0x1,%dl > 47a4: 0f 85 7c 00 00 00 jne 4826 <pktgen_thread_worker+0x4ef> > 47aa: 8b 4c 24 18 mov 0x18(%esp),%ecx > 47ae: 8b 01 mov (%ecx),%eax > 47b0: 8b 40 04 mov 0x4(%eax),%eax > 47b3: 8b 40 08 mov 0x8(%eax),%eax > 47b6: a8 04 test $0x4,%al > 47b8: 75 6c jne 4826 <pktgen_thread_worker+0x4ef> > 47ba: f6 c2 02 test $0x2,%dl > 47bd: 0f 85 c7 01 00 00 jne 498a <pktgen_thread_worker+0x653> > 47c3: f6 c2 04 test $0x4,%dl > 47c6: 0f 85 9d 01 00 00 jne 4969 <pktgen_thread_worker+0x632> > 47cc: 80 e2 08 and $0x8,%dl > 47cf: 90 nop > 47d0: 0f 85 7a 01 00 00 jne 4950 <pktgen_thread_worker+0x619> > 47d6: 8b 54 24 18 mov 0x18(%esp),%edx > 47da: 8b 42 08 mov 0x8(%edx),%eax > 47dd: a8 08 test $0x8,%al > 47df: 0f 84 5b fc ff ff je 4440 <pktgen_thread_worker+0x109> > 47e5: e8 fc ff ff ff call 47e6 <pktgen_thread_worker+0x4af> > 47ea: 8b 54 24 18 mov 0x18(%esp),%edx > 47ee: 8b 02 mov (%edx),%eax > 47f0: c7 00 00 00 00 00 movl $0x0,(%eax) > 47f6: 8b 44 24 28 mov 0x28(%esp),%eax > 47fa: e8 d2 f9 ff ff call 41d1 <next_to_run> > 47ff: 85 c0 test %eax,%eax > 4801: 89 c6 mov %eax,%esi > 4803: 0f 85 56 fc ff ff jne 445f <pktgen_thread_worker+0x128> > 4809: ba 64 00 00 00 mov $0x64,%edx > 480e: 8b 44 24 1c mov 0x1c(%esp),%eax > 4812: e8 fc ff ff ff call 4813 <pktgen_thread_worker+0x4dc> > 4817: 8b 4c 24 28 mov 0x28(%esp),%ecx > 481b: 8b 91 b4 02 00 00 mov 0x2b4(%ecx),%edx > 4821: f6 c2 01 test $0x1,%dl > 4824: 74 84 je 47aa <pktgen_thread_worker+0x473> > 4826: 8b 5c 24 28 mov 0x28(%esp),%ebx > 482a: c7 04 24 c8 06 00 00 movl $0x6c8,(%esp) > 4831: 83 c3 0c add $0xc,%ebx > 4834: 89 5c 24 04 mov %ebx,0x4(%esp) > 4838: e8 fc ff ff ff call 4839 <pktgen_thread_worker+0x502> > 483d: 8b 44 24 28 mov 0x28(%esp),%eax > 4841: e8 f6 f9 ff ff call 423c <pktgen_stop> > 4846: 89 5c 24 04 mov %ebx,0x4(%esp) > 484a: c7 04 24 e8 06 00 00 movl $0x6e8,(%esp) > 4851: e8 fc ff ff ff call 4852 <pktgen_thread_worker+0x51b> > 4856: 8b 44 24 28 mov 0x28(%esp),%eax > 485a: e8 1b fa ff ff call 427a <pktgen_rem_all_ifs> > 485f: 89 5c 24 04 mov %ebx,0x4(%esp) > 4863: c7 04 24 cc 06 00 00 movl $0x6cc,(%esp) > 486a: e8 fc ff ff ff call 486b <pktgen_thread_worker+0x534> > 486f: 8b 44 24 28 mov 0x28(%esp),%eax > 4873: 83 c4 2c add $0x2c,%esp > 4876: 5b pop %ebx > 4877: 5e pop %esi > 4878: 5f pop %edi > 4879: 5d pop %ebp > 487a: e9 27 fa ff ff jmp 42a6 <pktgen_rem_thread> > 487f: 8b 96 40 04 00 00 mov 0x440(%esi),%edx > 4885: 8b 86 c4 02 00 00 mov 0x2c4(%esi),%eax > 488b: 83 c0 01 add $0x1,%eax > 488e: 3b 86 e8 02 00 00 cmp 0x2e8(%esi),%eax > 4894: 89 86 c4 02 00 00 mov %eax,0x2c4(%esi) > 489a: 0f 82 a1 fc ff ff jb 4541 <pktgen_thread_worker+0x20a> > 48a0: 85 d2 test %edx,%edx > 48a2: 0f 84 9d fc ff ff je 4545 <pktgen_thread_worker+0x20e> > 48a8: 8b 82 94 00 00 00 mov 0x94(%edx),%eax > 48ae: 83 f8 01 cmp $0x1,%eax > 48b1: 74 12 je 48c5 <pktgen_thread_worker+0x58e> > 48b3: f0 ff 8a 94 00 00 00 lock decl 0x94(%edx) > 48ba: 0f 94 c0 sete %al > 48bd: 84 c0 test %al,%al > 48bf: 0f 84 80 fc ff ff je 4545 <pktgen_thread_worker+0x20e> > 48c5: 89 d0 mov %edx,%eax > 48c7: e8 fc ff ff ff call 48c8 <pktgen_thread_worker+0x591> > 48cc: f6 86 81 02 00 00 02 testb $0x2,0x281(%esi) > 48d3: 0f 85 79 fc ff ff jne 4552 <pktgen_thread_worker+0x21b> > 48d9: 89 f2 mov %esi,%edx > 48db: 8b 44 24 14 mov 0x14(%esp),%eax > 48df: e8 22 e7 ff ff call 3006 <fill_packet_ipv4> > 48e4: 85 c0 test %eax,%eax > 48e6: 89 86 40 04 00 00 mov %eax,0x440(%esi) > 48ec: 0f 85 79 fc ff ff jne 456b <pktgen_thread_worker+0x234> > 48f2: c7 04 24 08 07 00 00 movl $0x708,(%esp) > 48f9: e8 fc ff ff ff call 48fa <pktgen_thread_worker+0x5c3> > 48fe: e8 fc ff ff ff call 48ff <pktgen_thread_worker+0x5c8> > 4903: 83 ae c4 02 00 00 01 subl $0x1,0x2c4(%esi) > 490a: e9 54 fe ff ff jmp 4763 <pktgen_thread_worker+0x42c> > 490f: c7 86 c8 02 00 00 00 movl $0x0,0x2c8(%esi) > 4916: 00 00 00 > 4919: 0f 31 rdtsc > 491b: 89 44 24 0c mov %eax,0xc(%esp) > 491f: 89 54 24 10 mov %edx,0x10(%esp) > 4923: 85 d2 test %edx,%edx > 4925: 8b 1d 1c 00 00 00 mov 0x1c,%ebx > 492b: 89 d1 mov %edx,%ecx > 492d: 89 c5 mov %eax,%ebp > 492f: 74 08 je 4939 <pktgen_thread_worker+0x602> > 4931: 89 d0 mov %edx,%eax > 4933: 31 d2 xor %edx,%edx > 4935: f7 f3 div %ebx > 4937: 89 c1 mov %eax,%ecx > 4939: 89 e8 mov %ebp,%eax > 493b: f7 f3 div %ebx > 493d: 89 ca mov %ecx,%edx > 493f: 89 86 b4 02 00 00 mov %eax,0x2b4(%esi) > 4945: 89 96 b8 02 00 00 mov %edx,0x2b8(%esi) > 494b: e9 00 fd ff ff jmp 4650 <pktgen_thread_worker+0x319> > 4950: 8b 44 24 28 mov 0x28(%esp),%eax > 4954: e8 21 f9 ff ff call 427a <pktgen_rem_all_ifs> > 4959: 8b 44 24 28 mov 0x28(%esp),%eax > 495d: 83 a0 b4 02 00 00 f7 andl $0xfffffff7,0x2b4(%eax) > 4964: e9 6d fe ff ff jmp 47d6 <pktgen_thread_worker+0x49f> > 4969: 8b 44 24 28 mov 0x28(%esp),%eax > 496d: e8 d7 f2 ff ff call 3c49 <pktgen_run> > 4972: 8b 4c 24 28 mov 0x28(%esp),%ecx > 4976: 8b 91 b4 02 00 00 mov 0x2b4(%ecx),%edx > 497c: 83 e2 fb and $0xfffffffb,%edx > 497f: 89 91 b4 02 00 00 mov %edx,0x2b4(%ecx) > 4985: e9 42 fe ff ff jmp 47cc <pktgen_thread_worker+0x495> > 498a: 8b 44 24 28 mov 0x28(%esp),%eax > 498e: e8 a9 f8 ff ff call 423c <pktgen_stop> > 4993: 8b 44 24 28 mov 0x28(%esp),%eax > 4997: 8b 90 b4 02 00 00 mov 0x2b4(%eax),%edx > 499d: 83 e2 fd and $0xfffffffd,%edx > 49a0: 89 90 b4 02 00 00 mov %edx,0x2b4(%eax) > 49a6: e9 18 fe ff ff jmp 47c3 <pktgen_thread_worker+0x48c> > 49ab: e8 fc ff ff ff call 49ac <pktgen_thread_worker+0x675> > 49b0: e9 da fd ff ff jmp 478f <pktgen_thread_worker+0x458> > 49b5: 89 f2 mov %esi,%edx > 49b7: 89 f8 mov %edi,%eax > 49b9: e8 cc de ff ff call 288a <nanospin> > 49be: 89 f6 mov %esi,%esi > 49c0: e9 19 fb ff ff jmp 44de <pktgen_thread_worker+0x1a7> > 49c5: 0f 31 rdtsc > 49c7: 85 d2 test %edx,%edx > 49c9: 8b 1d 1c 00 00 00 mov 0x1c,%ebx > 49cf: 89 d1 mov %edx,%ecx > 49d1: 89 c7 mov %eax,%edi > 49d3: 74 08 je 49dd <pktgen_thread_worker+0x6a6> > 49d5: 89 d0 mov %edx,%eax > 49d7: 31 d2 xor %edx,%edx > 49d9: f7 f3 div %ebx > 49db: 89 c1 mov %eax,%ecx > 49dd: 89 f8 mov %edi,%eax > 49df: f7 f3 div %ebx > 49e1: 89 ca mov %ecx,%edx > 49e3: 89 c1 mov %eax,%ecx > 49e5: 89 d3 mov %edx,%ebx > 49e7: 81 c1 ff ff ff 7f add $0x7fffffff,%ecx > 49ed: 83 d3 00 adc $0x0,%ebx > 49f0: e9 62 fd ff ff jmp 4757 <pktgen_thread_worker+0x420> > 49f5: e8 fc ff ff ff call 49f6 <pktgen_thread_worker+0x6bf> > 49fa: e9 15 fd ff ff jmp 4714 <pktgen_thread_worker+0x3dd> > 49ff: 83 f8 ff cmp $0xffffffff,%eax > 4a02: 75 14 jne 4a18 <pktgen_thread_worker+0x6e1> > 4a04: 8b 4c 24 14 mov 0x14(%esp),%ecx > 4a08: f6 81 59 01 00 00 10 testb $0x10,0x159(%ecx) > 4a0f: 74 07 je 4a18 <pktgen_thread_worker+0x6e1> > 4a11: f3 90 pause > 4a13: e9 98 fb ff ff jmp 45b0 <pktgen_thread_worker+0x279> > 4a18: 8b 86 40 04 00 00 mov 0x440(%esi),%eax > 4a1e: f0 ff 88 94 00 00 00 lock decl 0x94(%eax) > 4a25: 8b 2d 08 00 00 00 mov 0x8,%ebp > 4a2b: 85 ed test %ebp,%ebp > 4a2d: 75 4f jne 4a7e <pktgen_thread_worker+0x747> > 4a2f: 83 86 ac 02 00 00 01 addl $0x1,0x2ac(%esi) > 4a36: c7 86 c8 02 00 00 00 movl $0x0,0x2c8(%esi) > 4a3d: 00 00 00 > 4a40: 83 96 b0 02 00 00 00 adcl $0x0,0x2b0(%esi) > 4a47: 0f 31 rdtsc > 4a49: 89 44 24 0c mov %eax,0xc(%esp) > 4a4d: 89 54 24 10 mov %edx,0x10(%esp) > 4a51: 85 d2 test %edx,%edx > 4a53: 8b 1d 1c 00 00 00 mov 0x1c,%ebx > 4a59: 89 d1 mov %edx,%ecx > 4a5b: 89 c5 mov %eax,%ebp > 4a5d: 74 08 je 4a67 <pktgen_thread_worker+0x730> > 4a5f: 89 d0 mov %edx,%eax > 4a61: 31 d2 xor %edx,%edx > 4a63: f7 f3 div %ebx > 4a65: 89 c1 mov %eax,%ecx > 4a67: 89 e8 mov %ebp,%eax > 4a69: f7 f3 div %ebx > 4a6b: 89 ca mov %ecx,%edx > 4a6d: 89 86 b4 02 00 00 mov %eax,0x2b4(%esi) > 4a73: 89 96 b8 02 00 00 mov %edx,0x2b8(%esi) > 4a79: e9 80 fb ff ff jmp 45fe <pktgen_thread_worker+0x2c7> > 4a7e: e8 fc ff ff ff call 4a7f <pktgen_thread_worker+0x748> > 4a83: 85 c0 test %eax,%eax > 4a85: 74 a8 je 4a2f <pktgen_thread_worker+0x6f8> > 4a87: c7 04 24 e9 06 00 00 movl $0x6e9,(%esp) > 4a8e: e8 fc ff ff ff call 4a8f <pktgen_thread_worker+0x758> > 4a93: eb 9a jmp 4a2f <pktgen_thread_worker+0x6f8> > -- Pádraig Brady - http://www.pixelbeat.org -- ^ permalink raw reply [flat|nested] 85+ messages in thread
[parent not found: <20041209164820.GB32454@mail.com>]
* Re: 1.03Mpps on e1000 [not found] ` <20041209164820.GB32454@mail.com> @ 2004-12-09 17:19 ` P 2004-12-09 23:25 ` Ray Lehtiniemi 0 siblings, 1 reply; 85+ messages in thread From: P @ 2004-12-09 17:19 UTC (permalink / raw) To: Ray Lehtiniemi Ray Lehtiniemi wrote: > On Thu, Dec 09, 2004 at 09:18:25AM -0700, Ray Lehtiniemi wrote: > >>it is worth noting that my box has become quite unstable since >>i started to use oprofile and pktgen together. sshd stops responding, >>and the network seems to go down. not sure what is happening there... >>this instability seems to be persisting across reboots, unfortunately... > > > > ok, it seems that this is related to martin's e1000 patch, and i > just hadn't noticed it before. rolling back the 1.2 Mpps patch > seems to cure the problem. > > symptoms are a total freezeup of the e1000 interfaces. netstat > -an shows a tcp connection for my ssh login to the box, with about > 53K in the send-Q. /proc/net/tcp is empty, however.... i can > reproduce this at will by doing > > # objdump -d /lib/modules/2.6.10-rc3-mp-rayl/kernel/net/core/pktgen.ko > > on that machine with the e1000-patched kernel running. > > > if there's any diagnostic output i can generate that might tell > me what's going wrong, let me know and i'll try to generate it. can you send this to again to netdev. thanks. -- Pádraig Brady - http://www.pixelbeat.org -- ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 2004-12-09 17:19 ` P @ 2004-12-09 23:25 ` Ray Lehtiniemi 0 siblings, 0 replies; 85+ messages in thread From: Ray Lehtiniemi @ 2004-12-09 23:25 UTC (permalink / raw) To: netdev hi all my apologies if this gets received twice... i originally sent a copy of this using mutt's 'bounce' function, but i don't think that's what i wanted to do..... this is a bug report re: martin e1000 patch. i'm seeing some lockups under normal traffic loads that seem to go away if i revert the patch. details below.. thanks On Thu, Dec 09, 2004 at 05:19:55PM +0000, P@draigBrady.com wrote: > Ray Lehtiniemi wrote: > >On Thu, Dec 09, 2004 at 09:18:25AM -0700, Ray Lehtiniemi wrote: > > > >>it is worth noting that my box has become quite unstable since > >>i started to use oprofile and pktgen together. sshd stops responding, > >>and the network seems to go down. not sure what is happening there... > >>this instability seems to be persisting across reboots, unfortunately... > > > > > > > >ok, it seems that this is related to martin's e1000 patch, and i > >just hadn't noticed it before. rolling back the 1.2 Mpps patch > >seems to cure the problem. > > > >symptoms are a total freezeup of the e1000 interfaces. netstat > >-an shows a tcp connection for my ssh login to the box, with about > >53K in the send-Q. /proc/net/tcp is empty, however.... i can > >reproduce this at will by doing > > > > # objdump -d /lib/modules/2.6.10-rc3-mp-rayl/kernel/net/core/pktgen.ko > > > >on that machine with the e1000-patched kernel running. > > > > > >if there's any diagnostic output i can generate that might tell > >me what's going wrong, let me know and i'll try to generate it. > > can you send this to again to netdev. > > thanks. > > -- > Pádraig Brady - http://www.pixelbeat.org > -- -- ---------------------------------------------------------------------- Ray L <rayl@mail.com> ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) 2004-12-05 15:03 ` Martin Josefsson 2004-12-05 15:15 ` Lennert Buytenhek 2004-12-05 15:42 ` Martin Josefsson @ 2004-12-05 21:12 ` Scott Feldman 2004-12-05 21:25 ` Lennert Buytenhek 2 siblings, 1 reply; 85+ messages in thread From: Scott Feldman @ 2004-12-05 21:12 UTC (permalink / raw) To: Martin Josefsson Cc: Lennert Buytenhek, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sun, 2004-12-05 at 07:03, Martin Josefsson wrote: > BUT if I use the above + prefetching I get this: > > 60 1483890 Ok, proof that we can get to 1.4Mpps! That's the good news. The bad news is prefetching is potentially buggy as pointed out in the freebsd note. Buggy as in the controller may hang. Sorry, I don't have details on what conditions are necessary to cause a hang. Would Martin or Lennert run these test for a longer duration so we can get some data, maybe adding in Rx. It could be that removing the Tx interrupts and descriptor write-backs, prefetching may be ok. I don't know. Intel? Also, wouldn't it be great if someone wrote a document capturing all of the accumulated knowledge for future generations? -scott ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) 2004-12-05 21:12 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Scott Feldman @ 2004-12-05 21:25 ` Lennert Buytenhek 2004-12-06 1:23 ` 1.03Mpps on e1000 (was: " Scott Feldman 0 siblings, 1 reply; 85+ messages in thread From: Lennert Buytenhek @ 2004-12-05 21:25 UTC (permalink / raw) To: Scott Feldman Cc: Martin Josefsson, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sun, Dec 05, 2004 at 01:12:22PM -0800, Scott Feldman wrote: > Would Martin or Lennert run these test for a longer duration so we can > get some data, maybe adding in Rx. It could be that removing the Tx > interrupts and descriptor write-backs, prefetching may be ok. I don't > know. Intel? What your patch does is (correct me if I'm wrong): - Masking TXDW, effectively preventing it from delivering TXdone ints. - Not setting E1000_TXD_CMD_IDE in the TXD command field, which causes the chip to 'ignore the TIDV' register, which is the 'TX Interrupt Delay Value'. What exactly does this? - Not setting the "Report Packet Sent"/"Report Status" bits in the TXD command field. Is this the equivalent of the TXdone interrupt? Just exactly which bit avoids the descriptor writeback? I'm also a bit worried that only freeing packets 1ms later will mess up socket accounting and such. Any ideas on that? > Also, wouldn't it be great if someone wrote a document capturing all of > the accumulated knowledge for future generations? I'll volunteer for that. --L ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: Transmission limit) 2004-12-05 21:25 ` Lennert Buytenhek @ 2004-12-06 1:23 ` Scott Feldman 0 siblings, 0 replies; 85+ messages in thread From: Scott Feldman @ 2004-12-06 1:23 UTC (permalink / raw) To: Lennert Buytenhek Cc: Martin Josefsson, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Sun, 2004-12-05 at 13:25, Lennert Buytenhek wrote: > What your patch does is (correct me if I'm wrong): > - Masking TXDW, effectively preventing it from delivering TXdone ints. > - Not setting E1000_TXD_CMD_IDE in the TXD command field, which causes > the chip to 'ignore the TIDV' register, which is the 'TX Interrupt > Delay Value'. What exactly does this? A descriptor with IDE, when written back, starts the Tx delay timers countdown. Never setting IDE means the Tx delay timers never expire. > - Not setting the "Report Packet Sent"/"Report Status" bits in the TXD > command field. Is this the equivalent of the TXdone interrupt? > > Just exactly which bit avoids the descriptor writeback? As the name implies, Report Status (RS) instructs the controller to indicate the status of the descriptor by doing a write-back (DMA) to the descriptor memory. The only status we care about is the "done" indicator. By reading TDH (Tx head), we can figure out where hardware is without reading the status of each descriptor. Since we don't need status, we can turn off RS. > I'm also a bit worried that only freeing packets 1ms later will mess up > socket accounting and such. Any ideas on that? Well the timer solution is less than ideal, and any protocols that are sensitive to getting Tx resources returned by the driver as quickly as possible are not going to be happy. I don't know if 1ms is quick enough. You could eliminate the timer by doing the cleanup first thing in xmit_frame, but then you have two problems: 1) you might end up reading TDH for each send, and that's going to be expensive; 2) calls to xmit_frame might stop, leaving uncleaned work until xmit_frame is called again. -scott ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-01 1:09 ` Scott Feldman 2004-12-01 15:34 ` Robert Olsson 2004-12-01 18:29 ` Lennert Buytenhek @ 2004-12-02 17:31 ` Marco Mellia 2004-12-03 20:57 ` Lennert Buytenhek 3 siblings, 0 replies; 85+ messages in thread From: Marco Mellia @ 2004-12-02 17:31 UTC (permalink / raw) To: sfeldma Cc: birke, Lennert Buytenhek, jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Wed, 2004-12-01 at 02:09, Scott Feldman wrote: > Hey, turns out, I know some e1000 tricks that might help get the kpps > numbers up. > > My problem is I only have a P4 desktop system with a 82544 nic running > at PCI 32/33Mhz, so I can't play with the big boys. But, attached is a > rework of the Tx path to eliminate 1) Tx interrupts, and 2) Tx > descriptor write-backs. For me, I see a nice jump in kpps, but I'd like > others to try with their setups. We should be able to get to wire speed > with 60-byte packets. > Here are the numbers in our setup: vanilla kernel [2.4.20 + packetgen + driver e1000 5.4.11] 4096 Descr => 356 Mbps (60 bytes long frames) => 941Mbps (1500 bytes lonf frames) 256 Descr => 354 Mbps (60 bytes long frames) => 941Mbps (1500 bytes lonf frames) Patched driver [2.4.20 + packetgen + driver e1000 5.4.11 patched] 4096 Descr => 357 Mbps (60 bytes long frames) => 941Mbps (1500 bytes lonf frames) I guess that was _not_ the bottleneck sigh... at least with a PCI-X bus. Again, latency issue of the DMA transfer from RAM to NIC? -- Ciao, /\/\/\rco +-----------------------------------+ | Marco Mellia - Assistant Professor| | Tel: 39-011-2276-608 | | Tel: 39-011-564-4173 | | Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . . | Politecnico di Torino | \ / . ASCII Ribbon Campaign . | Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail . | Torino - 10129 - Italy | / \ .- NO Word docs in e-mail. | http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . . +-----------------------------------+ The box said "Requires Windows 95 or Better." So I installed Linux. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-01 1:09 ` Scott Feldman ` (2 preceding siblings ...) 2004-12-02 17:31 ` [E1000-devel] Transmission limit Marco Mellia @ 2004-12-03 20:57 ` Lennert Buytenhek 2004-12-04 10:36 ` Lennert Buytenhek 3 siblings, 1 reply; 85+ messages in thread From: Lennert Buytenhek @ 2004-12-03 20:57 UTC (permalink / raw) To: Scott Feldman Cc: jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev [-- Attachment #1: Type: text/plain, Size: 2459 bytes --] On Tue, Nov 30, 2004 at 05:09:59PM -0800, Scott Feldman wrote: > Hey, turns out, I know some e1000 tricks that might help get the kpps > numbers up. > > My problem is I only have a P4 desktop system with a 82544 nic running > at PCI 32/33Mhz, so I can't play with the big boys. But, attached is a > rework of the Tx path to eliminate 1) Tx interrupts, and 2) Tx > descriptor write-backs. For me, I see a nice jump in kpps, but I'd like > others to try with their setups. We should be able to get to wire speed > with 60-byte packets. Attached is a graph of my numbers with and without your patch for: - An 82540 at PCI 32/33, idle 33MHz card on the same bus forcing it to 33MHz. - An 82541 at PCI 32/66. - An 82546 at PCI-X 64/100, NIC can do 133MHz but mobo only does 100MHz. All 'phi' tests were done on my box phi, a dual 2.4GHz Xeon on an Intel SE7505VB2 board (http://www.intel.com/design/servers/se7505vb2/). I've included Robert's 64/133 numbers ('sourcemage') on his dual 866MHz P3 for comparison. I didn't test all packet sizes up to 1500, just the first few hundred bytes for each. As before, the max # pps at 60B packets is strongly influenced by the per- packet overhead (which seems to be reduced by your patch for my machine quite a bit, also on 64/100, even though Robert sees no improvement on 64/133) while the slope of each curve appears to depend only on the speed of the bus the NIC is in. I.e. the 60B kpps number more-or-less determines the shape of the rest of the graph in each case. Bus speed is most likely also the reason why the 64/100 setup w/o your patch starts off slower than the 64/66 with your patch, but then eventually beats the 64/66 (around 140B packets) just before they both hit the GigE saturation point. There's no drop at 256B for the 64/100 setup like with the 32/* setups. Perhaps the drop at 256B is because of the PCI latency timer being set to 64 by default, and that causes the transfer on 32b to be broken up in 256-byte chunks? I'm not able to saturate gigabit on 32/33 with 1500B packets, while Jamal does. Another thing to look into. Also note that the 64/100 NIC has rather wobbly performance between 60B and ~160B bytes. This 'square wave pattern' is there both with and without your patch, perhaps something particular to the NIC. Its period appears to be 16 bytes, dropping down where packet_size mod 16 = 0, and then jumping up again a bit when packet_size mod 16 = 6. Odd. --L [-- Attachment #2: perf.png --] [-- Type: image/png, Size: 31312 bytes --] ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-03 20:57 ` Lennert Buytenhek @ 2004-12-04 10:36 ` Lennert Buytenhek 0 siblings, 0 replies; 85+ messages in thread From: Lennert Buytenhek @ 2004-12-04 10:36 UTC (permalink / raw) To: Scott Feldman Cc: jamal, Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Fri, Dec 03, 2004 at 09:57:06PM +0100, Lennert Buytenhek wrote: > > My problem is I only have a P4 desktop system with a 82544 nic running > > at PCI 32/33Mhz, so I can't play with the big boys. But, attached is a > > rework of the Tx path to eliminate 1) Tx interrupts, and 2) Tx > > descriptor write-backs. For me, I see a nice jump in kpps, but I'd like > > others to try with their setups. We should be able to get to wire speed > > with 60-byte packets. > > Attached is a graph of my numbers with and without your patch for: > - An 82540 at PCI 32/33, idle 33MHz card on the same bus forcing it to 33MHz. > - An 82541 at PCI 32/66. > - An 82546 at PCI-X 64/100, NIC can do 133MHz but mobo only does 100MHz. When extrapolating these numbers to the 0-byte packet case (which then tells you the per-packet overhead), I get the following approximate numbers: case overhead phi-32-33-82540-2.6.9 1.86 us phi-32-66-82541-2.6.9 1.41 us phi-64-100-82546-2.6.9 1.45 us phi-32-33-82540-2.6.9-feldman 1.48 us phi-32-66-82541-2.6.9-feldman 1.13 us phi-64-100-82546-2.6.9-feldman 1.25 us Note that this figure doesn't differ all that much between the different bus widths/speeds. In any case, if I ever want to get more than ~880kpps on this hardware, there's no other way than to make this overhead go down. For saturating 1Gb/s with 60B packets on 64/100, the overhead can't be more than ~0.59 us per packet or you lose. --L ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-01 0:11 ` Lennert Buytenhek 2004-12-01 1:09 ` Scott Feldman @ 2004-12-01 12:08 ` jamal 2004-12-01 15:24 ` Lennert Buytenhek 1 sibling, 1 reply; 85+ messages in thread From: jamal @ 2004-12-01 12:08 UTC (permalink / raw) To: Lennert Buytenhek Cc: Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Tue, 2004-11-30 at 19:11, Lennert Buytenhek wrote: > On Tue, Nov 30, 2004 at 09:25:54AM -0500, jamal wrote: > > > > > > Also from what I understand new HW and MSI can help in the case where > > > > > pass objects between CPU. Did I dream or did someone tell me that S2IO > > > > > could have several TX ring that could via MSI be routed to proper cpu? > > > > > > > > I am wondering if the per CPU tx/rx irqs are valuable at all. They sound > > > > like more hell to maintain. > > > > > > On the TX path you'd have qdiscs to deal with as well, no? > > > > I think management of it would be non-trivial in SMP. Youd have to start > > playing stupid loadbalancing tricks which would reduce the value of > > existence of tx irqs to begin with. > > You mean the management of qdiscs would be non-trivial? I mean it is useful in only the most ideal cases and if you want to actually do something useful in most cases with it you will have to muck around. Take the case of forwarding (maybe with a little or almost no localhost generated traffic) - then you end allocating in CPUA, processing and queueing on egress. Tx softirq, which is what stashes the packet on tx DMA eventually, is not guaranteed to run on the same CPU. Now add a little latency between ingress and egress .. The ideal case is where you end up processing to completion from ingress to egress (which is known to happen in Linux when theres no congestion). > Probably the idea of these kinds of tricks is to skip the qdisc step > altogether. > Which is preached by the BSD folks - bogus in my opinion. If you want to do something as bland/boring as that you can probably afford a $500 DLINK router which can do it at wire rate with (with cost you being locked in whatever features they have). cheers, jamal ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-12-01 12:08 ` jamal @ 2004-12-01 15:24 ` Lennert Buytenhek 0 siblings, 0 replies; 85+ messages in thread From: Lennert Buytenhek @ 2004-12-01 15:24 UTC (permalink / raw) To: jamal Cc: Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Wed, Dec 01, 2004 at 07:08:20AM -0500, jamal wrote: [ per-CPU TX/RX rings ] > > You mean the management of qdiscs would be non-trivial? > > I mean it is useful in only the most ideal cases and if you want to > actually do something useful in most cases with it you will have to > muck around. > Take the case of forwarding (maybe with a little or almost no localhost > generated traffic) - then you end allocating in CPUA, processing and > queueing on egress. Tx softirq, which is what stashes the packet on tx > DMA eventually, is not guaranteed to run on the same CPU. Now add a > little latency between ingress and egress .. > The ideal case is where you end up processing to completion from ingress > to egress (which is known to happen in Linux when theres no congestion). We disagreed on this topic at SUCON and I'm afraid we'll be disagreeing on it forever :) IMHO, on 10GbE any kind of qdisc is a waste of cycles. I don't think it's very likely that you'll be using that single 10GbE NIC for forwarding packets, doing that with a PC at this point in the history of PCs is just silly. If you do use it for forwarding, how likely is it that you'll be able to process an incoming burst of packets fast enough to require queueing on the egress interface? You have to be able to send a burst of packets bigger than the NIC's TX FIFO at >10GbE in the first place for queueing to be effective/useful at all. (Leaving the question of whether or not there'll be some room in the TX FIFO at TX time unanswered, what you're doing with per-CPU TX rings is basically just simulating the "N individual NICs each bound to its own CPU" case with a single NIC.) > > Probably the idea of these kinds of tricks is to skip the qdisc step > > altogether. > > Which is preached by the BSD folks - bogus in my opinion. If you want to > do something as bland/boring as that you can probably afford a $500 > DLINK router which can do it at wire rate with (with cost you being > locked in whatever features they have). That's an unfair comparison. Just because I don't need CBQ doesn't mean my $500 DLINK router does everything I'd want it to -- advanced firewalling is one thing that comes to mind. Last time I looked I couldn't load my own kernel modules on my DLINK router either. --L ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-26 15:59 ` Marco Mellia 2004-11-26 16:57 ` P @ 2004-11-26 17:58 ` Robert Olsson 1 sibling, 0 replies; 85+ messages in thread From: Robert Olsson @ 2004-11-26 17:58 UTC (permalink / raw) To: mellia Cc: Robert Olsson, P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev Marco Mellia writes: > > Touching the packet-data givs a major impact. See eth_type_trans > > in all profiles. > > That's exactly what we removed from the driver code: touching the packet > limit the reception rate at about 1.1Mpps, while avoiding to check the > eth_type_trans actually allows to receive 100% of packets. > > skb are de/allocated using standard kernel memory management. Still, > without touching the packet, we can receive 100% of them. Right. I recall I tried something similar but as I only have pktgen as sender I could only verify this to pktgen TX speed about 860 kpps for PIII box I mentioned. This w. UP and one NIC. > When IP-forwarding is considered, no more we hit the transmission limit > (using NAPI, and your buffer recycling patch, as mentioned on the paper > and on the slides... If no buffer recycling is adopted, performance drop > a bit) > So it seemd to us that the major bottleneck is due to the transmission > limit. > > Again, you can get numbers and more details from > > http://www.tlc-networks.polito.it/~mellia/euroTLC.pdf > http://www.tlc-networks.polito.it/mellia/papers/Euro_qos_ip.pdf Nice. Seems we getting close to click w. NAPI and recycling. The skb recycling is outdated as it adds to much complexity to the kernel. I got some idea how make a much more lighweight variant... If you feel hacking I can outline the idea so you can try it. > > OK. Good to know about e1000. Networking is most DMA's and CPU is used > > adminstating it this is the challange. > > That's true. There is still the chance that the limit is due to hardware > CRC calculation (which must be added to the ethernet frame by the > nic...). But we're quite confortable that that is not the limit, since > in the reception path the same operation must be performed... OK! > > Even you could try to fill TX as soon as the HW says there are available > > buffers. This could even be done from TX-interrupt. > > Are you suggesting to modify packetgen to be more aggressive? Well it could be useful at least as an experiment. Our lab would be happy... > > Small packet performance is dependent on low latency. Higher bus speed > > gives shorter latency but also on higher speed buses there use to be > > bridges that adds latency. > > That's true. We suspect that the limit is due to bus latency. But still, > we are surprised, since the bus allows to receive 100%, but to transmit > up to ~50%. Moreover the raw aggerate bandwidth of the buffer is _far_ > larger (133MHz*64bit ~ 8gbit/s Have a look at graph in the pktgen paper presented at Linux-Kongress in Erlangen 2004. It seems like even at 8gbit/s thsi is limiting small packet TX performance. ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/pktgen_paper.pdf --ro ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-26 14:05 ` [E1000-devel] Transmission limit P 2004-11-26 15:31 ` Marco Mellia 2004-11-26 15:40 ` Robert Olsson @ 2004-11-27 20:00 ` Lennert Buytenhek 2004-11-29 12:44 ` Marco Mellia 2 siblings, 1 reply; 85+ messages in thread From: Lennert Buytenhek @ 2004-11-27 20:00 UTC (permalink / raw) To: mellia; +Cc: e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Fri, Nov 26, 2004 at 02:05:26PM +0000, P@draigBrady.com wrote: > >What is weird, is that if we artificially "preload" the NIC tx-fifo with > >packets, and then instruct it to start sending them, those are actually > >transmitted AT WIRE SPEED!! I've very interested in exactly what it is you're doing here. What do you mean by 'preload'? --L ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-27 20:00 ` Lennert Buytenhek @ 2004-11-29 12:44 ` Marco Mellia 2004-11-29 15:19 ` Lennert Buytenhek 0 siblings, 1 reply; 85+ messages in thread From: Marco Mellia @ 2004-11-29 12:44 UTC (permalink / raw) To: Lennert Buytenhek Cc: mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev > > >What is weird, is that if we artificially "preload" the NIC tx-fifo with > > >packets, and then instruct it to start sending them, those are actually > > >transmitted AT WIRE SPEED!! > > I've very interested in exactly what it is you're doing here. What > do you mean by 'preload'? Here is a brief description of the trick we used. The modified driver code can be grabbed from http://www.tlc-networks.polito.it/~mellia/e1000_modified.tar.gz So: with "preloaded" we mean that we put the packets to be transmitted previously in the TX fifo of the nic without actually updating the register which counts the number of packets in the fifo queue. To do this a student modified the network driver adding an entry in /proc/net/e1000/eth#. If you read from there you will get the values of the internal registers of the NIC regarding the internal fifo. Writing a number to it, you can set the TDFPC register which contains the number of pkts in the TX queue of the internal FIFO. To get the above result you have to: Compile this version of the driver (don't remember on which version it was based on) Load it. After that you can take a look at the internal registers with: cat /proc/net/e1000/eth# (# replace it with the correct number) Then we start placing something inthe TX fifo. To do this i simply used: ping -c 10 x.x.x.x This has placed and also transmitted 10 ping pkts. But they aren't deleted from the internal FIFO; only the pointers have been updated. Take a look at the registers again with: cat /proc/net/e1000/eth# Now use: echo 10 > /proc/net/e1000/eth# Naturally 10 is the number we used above. This "resets" the registers and writes in the TDFPC that there are 10 pkts in the TX queue. Now when we do: ping -c 1 x.x.x.x You will see that the NIC will transmit 11 pkts (10 we "preloaded" + the new one). If you try to measure the TX speed you will see that it is ~ the wire speed. Note: - note that if you haven't static arp tables there will be also some arp pkts (should be two more pkts) - probably if you write too many pkts it won't work because the FIFO is organized like a circular buffer and you will begin to overwrite the first pkts. - the normal ping pkts aren't minimum size but reduce them with the -s option - the code modoifications have been writen with having "quick and dirty" in mind, certainly it is possible to write them better -- Ciao, /\/\/\rco +-----------------------------------+ | Marco Mellia - Assistant Professor| | Tel: 39-011-2276-608 | | Tel: 39-011-564-4173 | | Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . . | Politecnico di Torino | \ / . ASCII Ribbon Campaign . | Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail . | Torino - 10129 - Italy | / \ .- NO Word docs in e-mail. | http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . . +-----------------------------------+ The box said "Requires Windows 95 or Better." So I installed Linux. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-29 12:44 ` Marco Mellia @ 2004-11-29 15:19 ` Lennert Buytenhek 2004-11-29 17:32 ` Marco Mellia 0 siblings, 1 reply; 85+ messages in thread From: Lennert Buytenhek @ 2004-11-29 15:19 UTC (permalink / raw) To: Marco Mellia Cc: e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Mon, Nov 29, 2004 at 01:44:27PM +0100, Marco Mellia wrote: > This "resets" the registers and writes in the TDFPC that there are 10 > pkts in the TX queue. > Now when we do: > > ping -c 1 x.x.x.x > > You will see that the NIC will transmit 11 pkts (10 we "preloaded" + the > new one). > If you try to measure the TX speed you will see that it is ~ the wire > speed. How are you measuring this? cheers, Lennert ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-29 15:19 ` Lennert Buytenhek @ 2004-11-29 17:32 ` Marco Mellia 2004-11-29 19:08 ` Lennert Buytenhek 0 siblings, 1 reply; 85+ messages in thread From: Marco Mellia @ 2004-11-29 17:32 UTC (permalink / raw) To: Lennert Buytenhek Cc: Marco Mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev Using the Agilent Router tester as receiver... :-( > On Mon, Nov 29, 2004 at 01:44:27PM +0100, Marco Mellia wrote: > > > This "resets" the registers and writes in the TDFPC that there are 10 > > pkts in the TX queue. > > Now when we do: > > > > ping -c 1 x.x.x.x > > > > You will see that the NIC will transmit 11 pkts (10 we "preloaded" + the > > new one). > > If you try to measure the TX speed you will see that it is ~ the wire > > speed. > > How are you measuring this? > > > cheers, > Lennert -- Ciao, /\/\/\rco +-----------------------------------+ | Marco Mellia - Assistant Professor| | Tel: 39-011-2276-608 | | Tel: 39-011-564-4173 | | Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . . | Politecnico di Torino | \ / . ASCII Ribbon Campaign . | Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail . | Torino - 10129 - Italy | / \ .- NO Word docs in e-mail. | http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . . +-----------------------------------+ The box said "Requires Windows 95 or Better." So I installed Linux. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-29 17:32 ` Marco Mellia @ 2004-11-29 19:08 ` Lennert Buytenhek 2004-11-29 19:09 ` Lennert Buytenhek 0 siblings, 1 reply; 85+ messages in thread From: Lennert Buytenhek @ 2004-11-29 19:08 UTC (permalink / raw) To: Marco Mellia Cc: e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Mon, Nov 29, 2004 at 06:32:13PM +0100, Marco Mellia wrote: > Using the Agilent Router tester as receiver... > :-( OK, so you're measuring the inter-packet gap and in that burst of 11 (or whatever many) packets it's 96 bit times between every packet, yes? Interesting. Can you also try 'pre-loading' the TX ring with a bunch of packets and then writing the values 0, 1, 2, 3 ... n-1 to the TXD register with back-to-back MMIO writes (instead of doing a single write of the value 'n'), and check what inter-packet gap you get then? cheers, Lennert ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit 2004-11-29 19:08 ` Lennert Buytenhek @ 2004-11-29 19:09 ` Lennert Buytenhek 0 siblings, 0 replies; 85+ messages in thread From: Lennert Buytenhek @ 2004-11-29 19:09 UTC (permalink / raw) To: Marco Mellia Cc: e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev On Mon, Nov 29, 2004 at 08:08:08PM +0100, Lennert Buytenhek wrote: > packets and then writing the values 0, 1, 2, 3 ... n-1 to the TXD register ^^^ That should be TDT. --L ^ permalink raw reply [flat|nested] 85+ messages in thread
end of thread, other threads:[~2004-12-10 16:24 UTC | newest]
Thread overview: 85+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1101467291.24742.70.camel@mellia.lipar.polito.it>
2004-11-26 14:05 ` [E1000-devel] Transmission limit P
2004-11-26 15:31 ` Marco Mellia
2004-11-26 19:56 ` jamal
2004-11-29 14:21 ` Marco Mellia
2004-11-30 13:46 ` jamal
2004-12-02 17:24 ` Marco Mellia
2004-11-26 20:06 ` jamal
2004-11-26 20:56 ` Lennert Buytenhek
2004-11-26 21:02 ` Lennert Buytenhek
2004-11-27 9:25 ` Harald Welte
[not found] ` <20041127111101.GC23139@xi.wantstofly.org>
2004-11-27 11:31 ` Harald Welte
2004-11-27 20:12 ` Cesar Marcondes
2004-11-29 8:53 ` Marco Mellia
2004-11-29 14:50 ` Lennert Buytenhek
2004-11-30 8:42 ` Marco Mellia
2004-12-01 12:25 ` jamal
2004-12-02 13:39 ` Marco Mellia
2004-12-03 13:07 ` jamal
2004-11-26 15:40 ` Robert Olsson
2004-11-26 15:59 ` Marco Mellia
2004-11-26 16:57 ` P
2004-11-26 20:01 ` jamal
2004-11-29 10:19 ` P
2004-11-29 13:09 ` Robert Olsson
2004-11-29 20:16 ` David S. Miller
2004-12-01 16:47 ` Robert Olsson
2004-11-30 13:31 ` jamal
2004-11-30 13:46 ` Lennert Buytenhek
2004-11-30 14:25 ` jamal
2004-12-01 0:11 ` Lennert Buytenhek
2004-12-01 1:09 ` Scott Feldman
2004-12-01 15:34 ` Robert Olsson
2004-12-01 16:49 ` Scott Feldman
2004-12-01 17:37 ` Robert Olsson
2004-12-02 17:54 ` Robert Olsson
2004-12-02 18:23 ` Robert Olsson
2004-12-02 23:25 ` Lennert Buytenhek
2004-12-03 5:23 ` Scott Feldman
2004-12-10 16:24 ` Martin Josefsson
2004-12-01 18:29 ` Lennert Buytenhek
2004-12-01 21:35 ` Lennert Buytenhek
2004-12-02 6:13 ` Scott Feldman
2004-12-03 13:24 ` jamal
2004-12-05 14:50 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Lennert Buytenhek
2004-12-05 15:03 ` Martin Josefsson
2004-12-05 15:15 ` Lennert Buytenhek
2004-12-05 15:19 ` Martin Josefsson
2004-12-05 15:30 ` Martin Josefsson
2004-12-05 17:00 ` Lennert Buytenhek
2004-12-05 17:11 ` Martin Josefsson
2004-12-05 17:38 ` Martin Josefsson
2004-12-05 18:14 ` Lennert Buytenhek
2004-12-05 15:42 ` Martin Josefsson
2004-12-05 16:48 ` Martin Josefsson
2004-12-05 17:01 ` Martin Josefsson
2004-12-05 17:58 ` Lennert Buytenhek
2004-12-05 17:44 ` Lennert Buytenhek
2004-12-05 17:51 ` Lennert Buytenhek
2004-12-05 17:54 ` Martin Josefsson
2004-12-06 11:32 ` 1.03Mpps on e1000 (was: " jamal
2004-12-06 12:11 ` Lennert Buytenhek
2004-12-06 12:20 ` jamal
2004-12-06 12:23 ` Lennert Buytenhek
2004-12-06 12:30 ` Martin Josefsson
2004-12-06 13:11 ` jamal
[not found] ` <20041206132907.GA13411@xi.wantstofly.org>
[not found] ` <16820.37049.396306.295878@robur.slu.se>
2004-12-06 17:32 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] " P
2004-12-08 23:36 ` Ray Lehtiniemi
[not found] ` <41B825A5.2000009@draigBrady.com>
[not found] ` <20041209161825.GA32454@mail.com>
2004-12-09 17:12 ` 1.03Mpps on e1000 P
[not found] ` <20041209164820.GB32454@mail.com>
2004-12-09 17:19 ` P
2004-12-09 23:25 ` Ray Lehtiniemi
2004-12-05 21:12 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Scott Feldman
2004-12-05 21:25 ` Lennert Buytenhek
2004-12-06 1:23 ` 1.03Mpps on e1000 (was: " Scott Feldman
2004-12-02 17:31 ` [E1000-devel] Transmission limit Marco Mellia
2004-12-03 20:57 ` Lennert Buytenhek
2004-12-04 10:36 ` Lennert Buytenhek
2004-12-01 12:08 ` jamal
2004-12-01 15:24 ` Lennert Buytenhek
2004-11-26 17:58 ` Robert Olsson
2004-11-27 20:00 ` Lennert Buytenhek
2004-11-29 12:44 ` Marco Mellia
2004-11-29 15:19 ` Lennert Buytenhek
2004-11-29 17:32 ` Marco Mellia
2004-11-29 19:08 ` Lennert Buytenhek
2004-11-29 19:09 ` Lennert Buytenhek
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).