TX performance of Intel 82546

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* TX performance of Intel 82546
@ 2004-09-15  8:14 Harald Welte
  2004-09-15  9:18 ` P
  0 siblings, 1 reply; 14+ messages in thread
From: Harald Welte @ 2004-09-15  8:14 UTC (permalink / raw)
  To: Linux NICS; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 13642 bytes --]

Hi!

I'm currently trying to help Robert Olsson improving the performance of
the Linux in-kernel packet generator (pktgen.c).  At the moment, we seem
to be unable to get more than 760kpps from a single port of a 82546,
(or any other PCI-X MAC supported by e1000) - that's a bit more than 51%
wirespeed at 64byte packet sizes.

I tried to find out whether this is a software (i.e. linux network
stack, pktgen) problem, or if it is a hardware or driver problem.

To do this, I hardwired some code into the e1000 driver, that
automatically refills the Tx Queue with the same packet over and over
again (see ugly hack attached to this email).  When running in this
hardwired mode, I do not get any E1000_ICR_TXQE events - so apparently
the TX Queue never gets empty, and the 82546 is transferring packets as
fast as possible from host memory.

However, I still don't get more than 760kpps from a single port.

Do you have any further recommendations comments?

Is this 760kpps really a hardware limitation?  Is it limited by the
82546, the PCI-X latency/bandwidth or memory latency/bandwith?

Did Intel ever achieve higher tx pps rates with the 82546 MAC? If yes,
on which hardware and OS ?

Thanks for your help.


Hardware:

MSI K8D Master-F, Dual Opteron 1.4GHz, 1GB RAM, PC-2700 (DDR-333), all on CPU1

0000:02:03.0 Ethernet controller: Intel Corp. 82546GB Gigabit Ethernet Controller (rev 03)
        Subsystem: Intel Corp. PRO/1000 MT Dual Port Network Connection
        Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 24
        Memory at fc9c0000 (64-bit, non-prefetchable) [size=128K]
        Memory at fc900000 (64-bit, non-prefetchable) [size=256K]
        I/O ports at a880 [size=64]
        Expansion ROM at fc8c0000 [disabled] [size=256K]
        Capabilities: [dc] Power Management version 2
        Capabilities: [e4]      Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-


Software

linux-2.6.8.1
- modified e1000 to keep re-filling 2048 tx descriptors with same skb
- board does not generate any 'tx queue empty' interrupts
	- thus, TX is running at full PC/HT/memory speed

TxDescriptors=2048,2048 FlowControl=0,0 Speed=1000,1000 TxIntDelay=0,0 TxAbsIntDelay=0,0 InterruptThottleRate=0,0

No more than 763kpps possible :(

tried (no improvement):
	- IntDelay, AbsIntDelay, dynamic ThrottleRate
	- enabling NAPI (disables TX interrupt)
		-> tx queue runs empty
	- smp_affinity to other cpu (not sure which has local ram)
		-> only 714kpps
	- increase skb to 128byte
		- 746kpps

---

diff -Nru linux-2.6.8.1/drivers/net/e1000/e1000.h linux-2.6.8.1-test/drivers/net/e1000/e1000.h
--- linux-2.6.8.1/drivers/net/e1000/e1000.h	2004-08-14 10:54:47.000000000 +0000
+++ linux-2.6.8.1-test/drivers/net/e1000/e1000.h	2004-09-13 16:37:12.000000000 +0000
@@ -202,6 +202,7 @@
 	spinlock_t stats_lock;
 	atomic_t irq_sem;
 	struct work_struct tx_timeout_task;
+	struct work_struct tx_pktgen_task;
     	uint8_t fc_autoneg;
 
 	struct timer_list blink_timer;
diff -Nru linux-2.6.8.1/drivers/net/e1000/e1000_main.c linux-2.6.8.1-test/drivers/net/e1000/e1000_main.c
--- linux-2.6.8.1/drivers/net/e1000/e1000_main.c	2004-08-14 10:55:10.000000000 +0000
+++ linux-2.6.8.1-test/drivers/net/e1000/e1000_main.c	2004-09-13 21:19:13.996635320 +0000
@@ -111,6 +111,9 @@
 void e1000_update_stats(struct e1000_adapter *adapter);
 
 /* Local Function Prototypes */
+static struct sk_buff *test_dummy_skb(unsigned int size);
+static void test_refill_tx_queue(struct e1000_adapter *adapter);
+static void test_tx_pktgen_task(struct net_device *netdev);
 
 static int e1000_init_module(void);
 static void e1000_exit_module(void);
@@ -273,6 +276,9 @@
 	mod_timer(&adapter->watchdog_timer, jiffies);
 	e1000_irq_enable(adapter);
 
+	test_dummy_skb(60);
+	test_refill_tx_queue(adapter);
+
 	return 0;
 }
 
@@ -281,6 +287,8 @@
 {
 	struct net_device *netdev = adapter->netdev;
 
+	printk("%s: entering\n", __FUNCTION__);
+
 	e1000_irq_disable(adapter);
 	free_irq(adapter->pdev->irq, netdev);
 	del_timer_sync(&adapter->tx_fifo_stall_timer);
@@ -533,6 +541,9 @@
 	INIT_WORK(&adapter->tx_timeout_task,
 		(void (*)(void *))e1000_tx_timeout_task, netdev);
 
+	INIT_WORK(&adapter->tx_pktgen_task,
+		(void (*)(void *))test_tx_pktgen_task, netdev);
+
 	/* we're going to reset, so assume we have no link for now */
 
 	netif_carrier_off(netdev);
@@ -765,6 +776,7 @@
 {
 	struct e1000_adapter *adapter = netdev->priv;
 
+	printk("%s: entering\n", __FUNCTION__);
 	e1000_down(adapter);
 
 	e1000_free_tx_resources(adapter);
@@ -1070,13 +1082,15 @@
 
 	for(i = 0; i < tx_ring->count; i++) {
 		buffer_info = &tx_ring->buffer_info[i];
-		if(buffer_info->skb) {
-
+		if(buffer_info->dma) {
 			pci_unmap_page(pdev,
 			               buffer_info->dma,
 			               buffer_info->length,
 			               PCI_DMA_TODEVICE);
+			buffer_info->dma = 0;
+		}
 
+		if(buffer_info->skb) {
 			dev_kfree_skb(buffer_info->skb);
 
 			buffer_info->skb = NULL;
@@ -1434,6 +1448,7 @@
 			 * but we've got queued Tx work that's never going
 			 * to get done, so reset controller to flush Tx.
 			 * (Do the reset outside of interrupt context). */
+			printk("%s: scheduling timeout\n", __FUNCTION__);
 			schedule_work(&adapter->tx_timeout_task);
 		}
 	}
@@ -1555,6 +1570,7 @@
 #define E1000_MAX_TXD_PWR	12
 #define E1000_MAX_DATA_PER_TXD	(1<<E1000_MAX_TXD_PWR)
 
+
 static inline int
 e1000_tx_map(struct e1000_adapter *adapter, struct sk_buff *skb,
 	unsigned int first, unsigned int max_per_txd,
@@ -1754,6 +1770,13 @@
 		return 0;
 	}
 
+#if 0
+	/* don't send any packets */
+	dev_kfree_skb_any(skb);
+	netdev->trans_start = jiffies;
+	return 0;
+#endif
+
 #ifdef NETIF_F_TSO
 	mss = skb_shinfo(skb)->tso_size;
 	/* The controller does a simple calculation to 
@@ -1791,6 +1814,8 @@
 	if(E1000_DESC_UNUSED(&adapter->tx_ring) < count + 2 ) {
 		netif_stop_queue(netdev);
 		spin_unlock_irqrestore(&adapter->tx_lock, flags);
+		if (net_ratelimit())
+			printk(KERN_DEBUG "err: no unused descriptors\n");
 		return 1;
 	}
 	spin_unlock_irqrestore(&adapter->tx_lock, flags);
@@ -1834,6 +1859,7 @@
 {
 	struct e1000_adapter *adapter = netdev->priv;
 
+	printk("%s: entering\n", __FUNCTION__);
 	/* Do the reset outside of interrupt context */
 	schedule_work(&adapter->tx_timeout_task);
 }
@@ -1843,6 +1869,7 @@
 {
 	struct e1000_adapter *adapter = netdev->priv;
 
+	printk("%s: entering\n", __FUNCTION__);
 	netif_device_detach(netdev);
 	e1000_down(adapter);
 	e1000_up(adapter);
@@ -2078,6 +2105,8 @@
 {
 	if(atomic_dec_and_test(&adapter->irq_sem)) {
 		E1000_WRITE_REG(&adapter->hw, IMS, IMS_ENABLE_MASK);
+		/* disable RX interrupt generation */
+		//E1000_WRITE_REG(&adapter->hw, IMS, IMS_ENABLE_MASK & ~E1000_IMS_RXDMT0);
 		E1000_WRITE_FLUSH(&adapter->hw);
 	}
 }
@@ -2103,11 +2132,27 @@
 	if(!icr)
 		return IRQ_NONE;  /* Not our interrupt */
 
+#if 0
+	printk("e1000_intr: icr=0x%08x\n", icr);
+	printk("%s: tdh=%d, tdt=%d\n", __FUNCTION__,
+		E1000_READ_REG(&adapter->hw, TDH),
+		E1000_READ_REG(&adapter->hw, TDT));
+#endif
+
 	if(icr & (E1000_ICR_RXSEQ | E1000_ICR_LSC)) {
 		hw->get_link_status = 1;
 		mod_timer(&adapter->watchdog_timer, jiffies);
 	}
 
+	if (icr & E1000_ICR_TXQE) {
+		printk("TX queue empty: shouldn't happen!\n");
+	}
+
+	if (icr & E1000_ICR_TXDW) {
+		e1000_clean_tx_irq(adapter);
+		schedule_work(&adapter->tx_pktgen_task);
+	}
+
 #ifdef CONFIG_E1000_NAPI
 	if(netif_rx_schedule_prep(netdev)) {
 
@@ -2116,7 +2161,7 @@
 		*/
 
 		atomic_inc(&adapter->irq_sem);
-		E1000_WRITE_REG(hw, IMC, ~0);
+//		E1000_WRITE_REG(hw, IMC, ~0);
 		__netif_rx_schedule(netdev);
 	}
 #else
@@ -2142,6 +2187,7 @@
 	int work_to_do = min(*budget, netdev->quota);
 	int work_done = 0;
 	
+	//printk("%s entered\n", __FUNCTION__);
 	e1000_clean_tx_irq(adapter);
 	e1000_clean_rx_irq(adapter, &work_done, work_to_do);
 
@@ -2175,6 +2221,7 @@
 	boolean_t cleaned = FALSE;
 
 
+	//printk("%s entered\n", __FUNCTION__);
 	i = tx_ring->next_to_clean;
 	eop = tx_ring->buffer_info[i].next_to_watch;
 	eop_desc = E1000_TX_DESC(*tx_ring, eop);
@@ -2184,6 +2231,7 @@
 		for(cleaned = FALSE; !cleaned; ) {
 			tx_desc = E1000_TX_DESC(*tx_ring, i);
 			buffer_info = &tx_ring->buffer_info[i];
+			//printk("cleaning tx_desc %d\n", i);
 
 			if(buffer_info->dma) {
 
@@ -2802,6 +2850,7 @@
 	uint32_t ctrl, ctrl_ext, rctl, manc, status;
 	uint32_t wufc = adapter->wol;
 
+	printk("%s: entering\n", __FUNCTION__);
 	netif_device_detach(netdev);
 
 	if(netif_running(netdev))
@@ -2920,4 +2969,169 @@
 }
 #endif
 
+
+#include <linux/ip.h>
+#include <linux/udp.h>
+static struct sk_buff *test_skb;
+
+static struct sk_buff *
+test_dummy_skb(unsigned int pkt_size)
+{
+	int datalen;
+	struct sk_buff *skb;
+	__u8 *eth;
+	struct iphdr *iph;
+	struct udphdr *udph;
+	
+	skb = alloc_skb(pkt_size + 64 + 16, GFP_ATOMIC);
+
+	if (!skb)
+		return NULL;
+
+	/* increase reference count so noone can free it */
+	skb_get(skb);
+
+	skb_reserve(skb, 16);
+	eth = (__u8 *) skb_push(skb, 14);
+	iph = (struct iphdr *)skb_put(skb, sizeof(struct iphdr));
+	udph = (struct udphdr *)skb_put(skb, sizeof(struct udphdr));
+	memset(eth, 1, 6);	// dst
+	memset(eth+6, 2, 6);	// src
+	memset(eth+12, 0x08, 1);
+	memset(eth+13, 0x00, 1);
+
+	datalen = pkt_size - 14 - 20 - 8;
+
+	udph->source = htons(9);
+	udph->dest = htons(9);
+	udph->len = htons(datalen + 8);
+	udph->check = 0;
+
+	iph->ihl = 5;
+	iph->version = 4;
+	iph->ttl = 3;
+	iph->tos = 0;
+	iph->protocol = IPPROTO_UDP;
+	iph->saddr = htonl(0x01010101);
+	iph->daddr = htonl(0x02020202);
+	iph->frag_off = 0;
+	iph->tot_len = htons(20+8+datalen);
+	iph->check = 0;
+	iph->check = ip_fast_csum((void *) iph, iph->ihl);
+	skb->protocol = __constant_htons(ETH_P_IP);
+	skb->mac.raw = ((u8 *)iph) - 14;
+	skb->pkt_type = PACKET_HOST;
+
+	test_skb = skb;
+
+	return skb;
+}
+
+static inline void
+test_tx_queue(struct e1000_adapter *adapter, int count, int tx_flags)
+{
+	struct e1000_desc_ring *tx_ring = &adapter->tx_ring;
+	struct e1000_tx_desc *tx_desc = NULL;
+	struct e1000_buffer *buffer_info;
+	uint32_t txd_upper = 0, txd_lower = E1000_TXD_CMD_IFCS;
+	unsigned int i;
+
+	if(tx_flags & E1000_TX_FLAGS_TSO) {
+		txd_lower |= E1000_TXD_CMD_DEXT | E1000_TXD_DTYP_D |
+		             E1000_TXD_CMD_TSE;
+		txd_upper |= (E1000_TXD_POPTS_IXSM | E1000_TXD_POPTS_TXSM) << 8;
+	}
+
+	if(tx_flags & E1000_TX_FLAGS_CSUM) {
+		txd_lower |= E1000_TXD_CMD_DEXT | E1000_TXD_DTYP_D;
+		txd_upper |= E1000_TXD_POPTS_TXSM << 8;
+	}
+
+	if(tx_flags & E1000_TX_FLAGS_VLAN) {
+		txd_lower |= E1000_TXD_CMD_VLE;
+		txd_upper |= (tx_flags & E1000_TX_FLAGS_VLAN_MASK);
+	}
+
+	i = tx_ring->next_to_use;
+
+	while(count--) {
+		buffer_info = &tx_ring->buffer_info[i];
+		tx_desc = E1000_TX_DESC(*tx_ring, i);
+		tx_desc->buffer_addr = cpu_to_le64(buffer_info->dma);
+		tx_desc->lower.data =
+			cpu_to_le32(txd_lower | buffer_info->length);
+		tx_desc->upper.data = cpu_to_le32(txd_upper);
+		if(++i == tx_ring->count) i = 0;
+	}
+
+	tx_desc->lower.data |= cpu_to_le32(adapter->txd_cmd);
+
+	/* Force memory writes to complete before letting h/w
+	 * know there are new descriptors to fetch.  (Only
+	 * applicable for weak-ordered memory model archs,
+	 * such as IA-64). */
+	wmb();
+
+	tx_ring->next_to_use = i;
+}
+
+static void
+test_refill_tx_queue(struct e1000_adapter *adapter)
+{
+	int i, num;
+	unsigned long flags;
+	int reserve = 2 + 10; 
+	
+	spin_lock_irqsave(&adapter->tx_lock, flags);
+	num = E1000_DESC_UNUSED(&adapter->tx_ring);
+
+	if (num <= reserve) {
+		printk("too little unused descriptors to refill\n");
+		spin_unlock_irqrestore(&adapter->tx_lock, flags);
+		return;
+	}
+
+	//printk("%s: refilling %d descriptors\n", __FUNCTION__, num - reserve);
+
+	i = 0;
+	while (1) {
+		int ret, skb_idx;
+		if (i >= num-reserve)
+			break;
+
+	//	printk("e1000_tx_map(%d)\n", adapter->tx_ring.next_to_use);
+		ret = e1000_tx_map(adapter, test_skb, 
+					adapter->tx_ring.next_to_use,
+				       E1000_MAX_DATA_PER_TXD,
+				       skb_shinfo(test_skb)->nr_frags,
+				       skb_shinfo(test_skb)->tso_size);
+		skb_idx = adapter->tx_ring.buffer_info[adapter->tx_ring.next_to_use].next_to_watch;
+		adapter->tx_ring.buffer_info[skb_idx].skb = NULL;
+		test_tx_queue(adapter, ret, 0);
+		i += ret;
+	}
+
+#if 0
+	printk("%s: tdh=%d, tdt=%d\n", __FUNCTION__,
+		E1000_READ_REG(&adapter->hw, TDH),
+		E1000_READ_REG(&adapter->hw, TDT));
+#endif
+	E1000_WRITE_REG(&adapter->hw, TDT, adapter->tx_ring.next_to_use);
+#if 0
+	printk("%s: tdh=%d, tdt=%d\n", __FUNCTION__,
+		E1000_READ_REG(&adapter->hw, TDH),
+		E1000_READ_REG(&adapter->hw, TDT));
+#endif
+
+	spin_unlock_irqrestore(&adapter->tx_lock, flags);
+}
+
+static void
+test_tx_pktgen_task(struct net_device *netdev)
+{
+	struct e1000_adapter *adapter = netdev->priv;
+	test_refill_tx_queue(adapter);
+}
+
+
 /* e1000_main.c */



-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: TX performance of Intel 82546
  2004-09-15  8:14 TX performance of Intel 82546 Harald Welte
@ 2004-09-15  9:18 ` P
  2004-09-15 12:18   ` jamal
  2004-09-15 12:36   ` Robert Olsson
  0 siblings, 2 replies; 14+ messages in thread
From: P @ 2004-09-15  9:18 UTC (permalink / raw)
  To: Harald Welte; +Cc: Linux NICS, netdev

Harald Welte wrote:
> Hi!
> 
> I'm currently trying to help Robert Olsson improving the performance of
> the Linux in-kernel packet generator (pktgen.c).  At the moment, we seem
> to be unable to get more than 760kpps from a single port of a 82546,
> (or any other PCI-X MAC supported by e1000) - that's a bit more than 51%
> wirespeed at 64byte packet sizes.

In my experience anything around 750Kpps is a PCI limitation,
specifically PCI bus arbitration latency. Note the clock speed of
the control signal used for bus arbitration has not increased
in proportion to the PCI data clock speed.

The application note #453 referenced below is very imformative:
http://www.intel.com/design/network/products/lan/docs/82546_docs.htm

I was able to confirm the above by passing 4x730Kpps
through a PCI-X system with 4 ethernet controllers,
but never more than 760Kpps through one particular controller.

Note also you may be able to tune for transmission
using setpci (google for setpci & MMRBC), or hacking with TSO?

Pádraig.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: TX performance of Intel 82546
  2004-09-15  9:18 ` P
@ 2004-09-15 12:18   ` jamal
  2004-09-15 14:02     ` Harald Welte
  2004-09-15 12:36   ` Robert Olsson
  1 sibling, 1 reply; 14+ messages in thread
From: jamal @ 2004-09-15 12:18 UTC (permalink / raw)
  To: P; +Cc: Harald Welte, Linux NICS, netdev

[-- Attachment #1: Type: text/plain, Size: 960 bytes --]

On Wed, 2004-09-15 at 05:18, P@draigBrady.com wrote:

> I was able to confirm the above by passing 4x730Kpps
> through a PCI-X system with 4 ethernet controllers,
> but never more than 760Kpps through one particular controller.
> 
> Note also you may be able to tune for transmission
> using setpci (google for setpci & MMRBC), or hacking with TSO?

Our friends in FreeBSD claim they can do 1Mpps _forwarding_ with 
e1000 - forget about transmit only ;->

I have been experimenting after SUCON on and off because i found the BSD
folks batch their transmits. They have a very dumb egress with no QoS.
Results are not conclusive yet. I will post at some point. 

Harald try to move the wmb() just before you write the TDT as i do in
kick_DMA in rearranged e1000 patch attached.
I will move things around if you show worse results.

BTW, anyone wanting to experiment on this patch talk to me privately. I
need to clean it up - maybe this weekend.

cheers,
jamal


[-- Attachment #2: e1000p --]
[-- Type: text/plain, Size: 4889 bytes --]

--- 269-rc1-bk10/drivers/net/e1000/e1000_main.c	2004/09/12 17:05:58	1.1
+++ 269-rc1-bk10/drivers/net/e1000/e1000_main.c	2004/09/15 12:13:24
@@ -125,6 +125,7 @@
 static void e1000_watchdog(unsigned long data);
 static void e1000_82547_tx_fifo_stall(unsigned long data);
 static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
+static int e1000_xmit_frames(struct sk_buff_head *list, struct net_device *netdev);
 static struct net_device_stats * e1000_get_stats(struct net_device *netdev);
 static int e1000_change_mtu(struct net_device *netdev, int new_mtu);
 static int e1000_set_mac(struct net_device *netdev, void *p);
@@ -448,6 +449,7 @@
 	netdev->open = &e1000_open;
 	netdev->stop = &e1000_close;
 	netdev->hard_start_xmit = &e1000_xmit_frame;
+	netdev->hard_batch_xmit = &e1000_xmit_frames;
 	netdev->get_stats = &e1000_get_stats;
 	netdev->set_multicast_list = &e1000_set_multi;
 	netdev->set_mac_address = &e1000_set_mac;
@@ -1673,6 +1675,14 @@
 }
 
 static inline void
+e1000_kick_DMA(struct e1000_adapter *adapter, int i)
+{
+	wmb();
+
+	E1000_WRITE_REG(&adapter->hw, TDT, i);
+}
+
+static inline void
 e1000_tx_queue(struct e1000_adapter *adapter, int count, int tx_flags)
 {
 	struct e1000_desc_ring *tx_ring = &adapter->tx_ring;
@@ -1711,14 +1721,16 @@
 
 	tx_desc->lower.data |= cpu_to_le32(adapter->txd_cmd);
 
+#if 0
 	/* Force memory writes to complete before letting h/w
 	 * know there are new descriptors to fetch.  (Only
 	 * applicable for weak-ordered memory model archs,
 	 * such as IA-64). */
 	wmb();
 
-	tx_ring->next_to_use = i;
 	E1000_WRITE_REG(&adapter->hw, TDT, i);
+#endif
+	tx_ring->next_to_use = i;
 }
 
 /**
@@ -1760,15 +1772,15 @@
 }
 
 #define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 )
-static int
-e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
+#define NETDEV_TX_DROPPED 3
+static inline int
+e1000_queue_frame(struct sk_buff *skb, struct net_device *netdev)
 {
 	struct e1000_adapter *adapter = netdev->priv;
 	unsigned int first, max_per_txd = E1000_MAX_DATA_PER_TXD;
 	unsigned int max_txd_pwr = E1000_MAX_TXD_PWR;
 	unsigned int tx_flags = 0;
 	unsigned int len = skb->len;
-	unsigned long flags;
 	unsigned int nr_frags = 0;
 	unsigned int mss = 0;
 	int count = 0;
@@ -1778,7 +1790,7 @@
 
 	if(unlikely(skb->len <= 0)) {
 		dev_kfree_skb_any(skb);
-		return 0;
+		return NETDEV_TX_DROPPED;
 	}
 
 #ifdef NETIF_F_TSO
@@ -1813,27 +1825,19 @@
 	if(adapter->pcix_82544)
 		count += nr_frags;
 
- 	local_irq_save(flags); 
- 	if (!spin_trylock(&adapter->tx_lock)) { 
- 		/* Collision - tell upper layer to requeue */ 
- 		local_irq_restore(flags); 
- 		return -1; 
- 	} 
 
 	/* need: count + 2 desc gap to keep tail from touching
 	 * head, otherwise try next time */
 	if(E1000_DESC_UNUSED(&adapter->tx_ring) < count + 2) {
 		netif_stop_queue(netdev);
-		spin_unlock_irqrestore(&adapter->tx_lock, flags);
-		return 1;
+		return NETDEV_TX_BUSY;
 	}
 
 	if(unlikely(adapter->hw.mac_type == e1000_82547)) {
 		if(unlikely(e1000_82547_fifo_workaround(adapter, skb))) {
 			netif_stop_queue(netdev);
 			mod_timer(&adapter->tx_fifo_stall_timer, jiffies);
-			spin_unlock_irqrestore(&adapter->tx_lock, flags);
-			return 1;
+			return NETDEV_TX_BUSY;
 		}
 	}
 
@@ -1855,8 +1859,69 @@
 
 	netdev->trans_start = jiffies;
 
+	return NETDEV_TX_OK;
+}
+
+static int
+e1000_xmit_frames(struct sk_buff_head *list, struct net_device *netdev)
+{
+	struct e1000_adapter *adapter = netdev->priv;
+	int ret = NETDEV_TX_OK;
+	int didq = 0;
+	int inbatch =  skb_queue_len(list);
+	struct sk_buff	*skb = NULL;
+	unsigned long flags;
+
+ 	local_irq_save(flags); 
+ 	if (!spin_trylock(&adapter->tx_lock)) { 
+ 		/* Collision - tell upper layer to requeue */ 
+ 		local_irq_restore(flags); 
+ 		return NETDEV_TX_LOCKED; 
+ 	} 
+
+	while ((skb = __skb_dequeue(list)) != NULL) {
+		ret = e1000_queue_frame(skb, netdev);
+		if (ret == NETDEV_TX_OK) {
+			didq++;
+		} else {
+			if (ret == NETDEV_TX_BUSY)
+				break;
+		}
+	}
+
+	if (didq)
+		e1000_kick_DMA(adapter, adapter->tx_ring.next_to_use);
+	if (skb_queue_len(list) && (inbatch > skb_queue_len(list)))
+		ret = NETDEV_TX_BUSY;
+	else
+		ret = NETDEV_TX_OK;
 	spin_unlock_irqrestore(&adapter->tx_lock, flags);
-	return 0;
+	return ret;
+}
+
+static int
+e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
+{
+	struct e1000_adapter *adapter = netdev->priv;
+	int ret = NETDEV_TX_OK;
+	unsigned long flags;
+
+ 	local_irq_save(flags); 
+ 	if (!spin_trylock(&adapter->tx_lock)) { 
+ 		/* Collision - tell upper layer to requeue */ 
+ 		local_irq_restore(flags); 
+ 		return NETDEV_TX_LOCKED; 
+ 	} 
+
+	ret = e1000_queue_frame(skb, netdev);
+	if (ret == NETDEV_TX_OK) {
+		e1000_kick_DMA(adapter, adapter->tx_ring.next_to_use);
+	}
+
+	spin_unlock_irqrestore(&adapter->tx_lock, flags);
+	if (ret == NETDEV_TX_DROPPED)
+		ret = NETDEV_TX_OK;
+	return ret;
 }
 
 /**

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: TX performance of Intel 82546
  2004-09-15  9:18 ` P
  2004-09-15 12:18   ` jamal
@ 2004-09-15 12:36   ` Robert Olsson
  2004-09-15 13:49     ` jamal
                       ` (2 more replies)
  1 sibling, 3 replies; 14+ messages in thread
From: Robert Olsson @ 2004-09-15 12:36 UTC (permalink / raw)
  To: P; +Cc: Harald Welte, Linux NICS, netdev


P@draigBrady.com writes:
 > Harald Welte wrote:

 > > I'm currently trying to help Robert Olsson improving the performance of
 > > the Linux in-kernel packet generator (pktgen.c).  At the moment, we seem
 > > to be unable to get more than 760kpps from a single port of a 82546,
 > > (or any other PCI-X MAC supported by e1000) - that's a bit more than 51%
 > > wirespeed at 64byte packet sizes.

 Yes it seems intel adapters work better in BSD as they claim to route
 1 Mpps and we cannot even send more ~750 kpps even with feeding the
 adapter only. :-)

 > In my experience anything around 750Kpps is a PCI limitation,
 > specifically PCI bus arbitration latency. Note the clock speed of
 > the control signal used for bus arbitration has not increased
 > in proportion to the PCI data clock speed.

 Yes data from an Opteron @ 1.6 GHz w. e1000 82546EB 64 byte pkts.

 133 MHz 830 pps
 100 MHz 721 pps
  66 MHz 561 pps

 So higher bus bandwidth could increase the small packet rate.

 So is there a difference in PCI-tuning BSD versus Linux? 
 And even more general can we measure the maximum numbers
 of transactions on a PCI-bus?

 Chip should be able to transfer 64 packets in single burst I don't now
 how set/verify this.

 Cheers.
						--ro

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: TX performance of Intel 82546
  2004-09-15 12:36   ` Robert Olsson
@ 2004-09-15 13:49     ` jamal
  2004-09-15 15:33       ` Robert Olsson
  2004-09-15 13:59     ` P
  2004-09-15 17:55     ` Harald Welte
  2 siblings, 1 reply; 14+ messages in thread
From: jamal @ 2004-09-15 13:49 UTC (permalink / raw)
  To: Robert Olsson; +Cc: P, Harald Welte, Linux NICS, netdev

On Wed, 2004-09-15 at 08:36, Robert Olsson wrote:

>  > In my experience anything around 750Kpps is a PCI limitation,
>  > specifically PCI bus arbitration latency. Note the clock speed of
>  > the control signal used for bus arbitration has not increased
>  > in proportion to the PCI data clock speed.
> 
>  Yes data from an Opteron @ 1.6 GHz w. e1000 82546EB 64 byte pkts.
> 
>  133 MHz 830 pps
>  100 MHz 721 pps
>   66 MHz 561 pps
> 
>  So higher bus bandwidth could increase the small packet rate.

Nice data.
BTW, is this per interface? i thought i have seen numbers in the range
of 1.3Mpps from you.

>  So is there a difference in PCI-tuning BSD versus Linux? 

As far as i could tell they batch transmit (mostly because of the way
mbufs are structured really). 

>  And even more general can we measure the maximum numbers
>  of transactions on a PCI-bus?

You would need speacilized hardware for this i think.

>  Chip should be able to transfer 64 packets in single burst I don't now
>  how set/verify this.

What Pádraig.posted in regards to the MMRBC register is actually
enlightening. I kept thinking about it after i sent my last email.
If indeed the overhead is incured in the setup(all fingers in my test
setups point fingers at this) then increasing the burst size should show
improvements.

cheers,
jamal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: TX performance of Intel 82546
  2004-09-15 12:36   ` Robert Olsson
  2004-09-15 13:49     ` jamal
@ 2004-09-15 13:59     ` P
  2004-09-15 15:41       ` Robert Olsson
  2004-09-15 18:15       ` Harald Welte
  2004-09-15 17:55     ` Harald Welte
  2 siblings, 2 replies; 14+ messages in thread
From: P @ 2004-09-15 13:59 UTC (permalink / raw)
  To: Robert Olsson; +Cc: Harald Welte, netdev

Robert Olsson wrote:
> P@draigBrady.com writes:
>  > Harald Welte wrote:
> 
>  > > I'm currently trying to help Robert Olsson improving the performance of
>  > > the Linux in-kernel packet generator (pktgen.c).  At the moment, we seem
>  > > to be unable to get more than 760kpps from a single port of a 82546,
>  > > (or any other PCI-X MAC supported by e1000) - that's a bit more than 51%
>  > > wirespeed at 64byte packet sizes.
> 
>  Yes it seems intel adapters work better in BSD as they claim to route
>  1 Mpps and we cannot even send more ~750 kpps even with feeding the
>  adapter only. :-)
> 
>  > In my experience anything around 750Kpps is a PCI limitation,
>  > specifically PCI bus arbitration latency. Note the clock speed of
>  > the control signal used for bus arbitration has not increased
>  > in proportion to the PCI data clock speed.
> 
>  Yes data from an Opteron @ 1.6 GHz w. e1000 82546EB 64 byte pkts.
> 
>  133 MHz 830 pps
>  100 MHz 721 pps
>   66 MHz 561 pps

Interesting info thanks!
It would be very interesting to see the performance of PCI express
which should not have the bus arbitration issues.

>  So higher bus bandwidth could increase the small packet rate.
> 
>  So is there a difference in PCI-tuning BSD versus Linux? 
>  And even more general can we measure the maximum numbers
>  of transactions on a PCI-bus?
> 
>  Chip should be able to transfer 64 packets in single burst I don't now
>  how set/verify this.

Well from the intel docs they say "The devices include a PCI interface
that maximizes the use of bursts for efficient bus usage.
The controllers are able to cache up to 64 packet descriptors in
a single burst for efficient PCI bandwidth usage."

So I'm guessing that increasing the PCI-X burst size setting
(MMRBC) will automatically get more packets sent per transfer?
I said previously in this thread to google for setpci and MMRBC,
but what I know about it is...

To return the current setting(s):

setpci -d 8086:1010 e6.b

The MMRBC is the upper two bits of the lower nibble, where:

0 = 512 byte bursts
1 = 1024 byte bursts
2 = 2048 byte bursts
3 = 4096 byte bursts

For me to set 4KiB bursts I do:

setpci -d 8086:1010 e6.b=0e

The following measured a 30% throughput improvement (on 10G)
from setting the burst size to 4KiB:
https://mgmt.datatag.org/sravot/TCP_WAN_perf_sr061504.pdf

Pádraig.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: TX performance of Intel 82546
  2004-09-15 12:18   ` jamal
@ 2004-09-15 14:02     ` Harald Welte
  2004-09-15 14:46       ` jamal
  2004-09-15 14:55       ` Andi Kleen
  0 siblings, 2 replies; 14+ messages in thread
From: Harald Welte @ 2004-09-15 14:02 UTC (permalink / raw)
  To: jamal; +Cc: P, Linux NICS, netdev

[-- Attachment #1: Type: text/plain, Size: 602 bytes --]

On Wed, Sep 15, 2004 at 08:18:27AM -0400, jamal wrote:

> Our friends in FreeBSD claim they can do 1Mpps _forwarding_ with 
> e1000 - forget about transmit only ;->

IMHO that was a 82547, attached to CSA and not PCI-X... 

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: TX performance of Intel 82546
  2004-09-15 14:02     ` Harald Welte
@ 2004-09-15 14:46       ` jamal
  2004-09-15 14:55       ` Andi Kleen
  1 sibling, 0 replies; 14+ messages in thread
From: jamal @ 2004-09-15 14:46 UTC (permalink / raw)
  To: Harald Welte; +Cc: Linux NICS, netdev

On Wed, 2004-09-15 at 10:02, Harald Welte wrote:
> On Wed, Sep 15, 2004 at 08:18:27AM -0400, jamal wrote:
> 
> > Our friends in FreeBSD claim they can do 1Mpps _forwarding_ with 
> > e1000 - forget about transmit only ;->
> 
> IMHO that was a 82547, attached to CSA and not PCI-X... 

I thought PCI-X should have less overhead in this case (because it has
split-transaction capability), no?
What do you mean by above statement of it not being PCI-X?

cheers,
jama

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: TX performance of Intel 82546
  2004-09-15 14:02     ` Harald Welte
  2004-09-15 14:46       ` jamal
@ 2004-09-15 14:55       ` Andi Kleen
  1 sibling, 0 replies; 14+ messages in thread
From: Andi Kleen @ 2004-09-15 14:55 UTC (permalink / raw)
  To: Harald Welte; +Cc: jamal, P, Linux NICS, netdev

On Wed, Sep 15, 2004 at 04:02:22PM +0200, Harald Welte wrote:
> On Wed, Sep 15, 2004 at 08:18:27AM -0400, jamal wrote:
> 
> > Our friends in FreeBSD claim they can do 1Mpps _forwarding_ with 
> > e1000 - forget about transmit only ;->
> 
> IMHO that was a 82547, attached to CSA and not PCI-X... 

Still only one e1000 can be attached to CSA. I didn't think
there are any that can do two. 

And routing only with a single NIC could be difficult... 

-Andi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: TX performance of Intel 82546
  2004-09-15 13:49     ` jamal
@ 2004-09-15 15:33       ` Robert Olsson
  0 siblings, 0 replies; 14+ messages in thread
From: Robert Olsson @ 2004-09-15 15:33 UTC (permalink / raw)
  To: hadi; +Cc: Robert Olsson, P, Harald Welte, Linux NICS, netdev

jamal writes:
 > >  Yes data from an Opteron @ 1.6 GHz w. e1000 82546EB 64 byte pkts.
 > > 
 > >  133 MHz 830 pps
 > >  100 MHz 721 pps
 > >   66 MHz 561 pps

 Well the pps should kpps but everybody seem to understand this.

> BTW, is this per interface? i thought i have seen numbers in the range
 > of 1.3Mpps from you.

 Yes 1.3 Mpps is aggregated forwarding performance from 2 *.1.6 GHz
 Opterons. In a setup where CPU0 handles eth0->eth1 and CPU1 handles
 eth2->eth3. 

 So due to the fact that a single NIC's TX does not keep up with packet 
 budget I had to use several "flows" to saturate the packet budget.

 This is a little breakthrough as we for the first time see some 
 aggregated performance with packet forwarding and got something in
 return for all multiprocessor efforts.

 IMO this is much more important then the last percent of performance
 of pps numbers.

 But the aggregated performance is only seen with Opterons my conclusion
 as we discussed is that memory/controller is local to the CPU and gives 
 lower latency and additional CPU adds memory controllers. Compare this 
 to where many CPU's share same controller/memory.

 > What Pádraig.posted in regards to the MMRBC register is actually
 > enlightening. I kept thinking about it after i sent my last email.
 > If indeed the overhead is incured in the setup(all fingers in my test
 > setups point fingers at this) then increasing the burst size should show
 > improvements.

 It's worth to test...

 Cheers.

						--ro

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: TX performance of Intel 82546
  2004-09-15 13:59     ` P
@ 2004-09-15 15:41       ` Robert Olsson
  2004-09-15 18:15       ` Harald Welte
  1 sibling, 0 replies; 14+ messages in thread
From: Robert Olsson @ 2004-09-15 15:41 UTC (permalink / raw)
  To: P; +Cc: Robert Olsson, Harald Welte, netdev


P@draigBrady.com writes:

 > So I'm guessing that increasing the PCI-X burst size setting
 > (MMRBC) will automatically get more packets sent per transfer?
 > I said previously in this thread to google for setpci and MMRBC,
 > but what I know about it is...
 > 
 > To return the current setting(s):
 > 
 > setpci -d 8086:1010 e6.b
 > 
 > The MMRBC is the upper two bits of the lower nibble, where:
 > 
 > 0 = 512 byte bursts
 > 1 = 1024 byte bursts
 > 2 = 2048 byte bursts
 > 3 = 4096 byte bursts
 > 
 > For me to set 4KiB bursts I do:
 > 
 > setpci -d 8086:1010 e6.b=0e

 Thanks!
 That should definitely be tested. Four runs w. pktgen should be enough.

						--ro

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: TX performance of Intel 82546
  2004-09-15 12:36   ` Robert Olsson
  2004-09-15 13:49     ` jamal
  2004-09-15 13:59     ` P
@ 2004-09-15 17:55     ` Harald Welte
  2 siblings, 0 replies; 14+ messages in thread
From: Harald Welte @ 2004-09-15 17:55 UTC (permalink / raw)
  To: Robert Olsson; +Cc: P, Linux NICS, netdev

[-- Attachment #1: Type: text/plain, Size: 836 bytes --]

On Wed, Sep 15, 2004 at 02:36:25PM +0200, Robert Olsson wrote:

>  Chip should be able to transfer 64 packets in single burst I don't now
>  how set/verify this.

maybe this is what the E1000_TCTL_PBE flag in TXCTL and E1000_TBT
register are for?

I tried switching TCTL_PBE on, and played with different values of TBT
(0,1,255,65535) - however, no improvement in pps rate.

Maybe someone form intel could comment on this?

>  Cheers.
> 						--ro

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: TX performance of Intel 82546
  2004-09-15 13:59     ` P
  2004-09-15 15:41       ` Robert Olsson
@ 2004-09-15 18:15       ` Harald Welte
  2004-09-15 18:15         ` David S. Miller
  1 sibling, 1 reply; 14+ messages in thread
From: Harald Welte @ 2004-09-15 18:15 UTC (permalink / raw)
  To: P; +Cc: Robert Olsson, netdev

[-- Attachment #1: Type: text/plain, Size: 1482 bytes --]

On Wed, Sep 15, 2004 at 02:59:30PM +0100, P@draigBrady.com wrote:
> Interesting info thanks!
> It would be very interesting to see the performance of PCI express
> which should not have the bus arbitration issues.

Unfortunately there is no e1000 for PCI Express available yet... only
Marvell-Yukon and syskonnect single-port boards so far :(

> Well from the intel docs they say "The devices include a PCI interface
> that maximizes the use of bursts for efficient bus usage.
> The controllers are able to cache up to 64 packet descriptors in
> a single burst for efficient PCI bandwidth usage."
> 
> So I'm guessing that increasing the PCI-X burst size setting
> (MMRBC) will automatically get more packets sent per transfer?
> I said previously in this thread to google for setpci and MMRBC,
> but what I know about it is...

Mh, I tried it on my System, following parameters:

dual 82456GB, PCI-X, 64bit, 66MHz, UP x86_64 kernel, modified e1000 with
hard-wired tx descriptor refill.

I did not observe any change in tx pps throughput when setting MMRBC to
512 / 4096 byte bursts.

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: TX performance of Intel 82546
  2004-09-15 18:15       ` Harald Welte
@ 2004-09-15 18:15         ` David S. Miller
  0 siblings, 0 replies; 14+ messages in thread
From: David S. Miller @ 2004-09-15 18:15 UTC (permalink / raw)
  To: Harald Welte; +Cc: P, Robert.Olsson, netdev

On Wed, 15 Sep 2004 20:15:16 +0200
Harald Welte <laforge@netfilter.org> wrote:

> On Wed, Sep 15, 2004 at 02:59:30PM +0100, P@draigBrady.com wrote:
> > Interesting info thanks!
> > It would be very interesting to see the performance of PCI express
> > which should not have the bus arbitration issues.
> 
> Unfortunately there is no e1000 for PCI Express available yet... only
> Marvell-Yukon and syskonnect single-port boards so far :(

There are TG3 chips that support PCI-Express.  These are
the 5705/5750 variants.

I don't know if actual boards are being sold, or if these
are currently on-board only.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2004-09-15 18:15 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-15  8:14 TX performance of Intel 82546 Harald Welte
2004-09-15  9:18 ` P
2004-09-15 12:18   ` jamal
2004-09-15 14:02     ` Harald Welte
2004-09-15 14:46       ` jamal
2004-09-15 14:55       ` Andi Kleen
2004-09-15 12:36   ` Robert Olsson
2004-09-15 13:49     ` jamal
2004-09-15 15:33       ` Robert Olsson
2004-09-15 13:59     ` P
2004-09-15 15:41       ` Robert Olsson
2004-09-15 18:15       ` Harald Welte
2004-09-15 18:15         ` David S. Miller
2004-09-15 17:55     ` Harald Welte

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).