[PATCH 2.6] e100: use NAPI mode all the time

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 2.6] e100: use NAPI mode all the time
@ 2004-06-05  0:35 Scott Feldman
  2004-06-06 22:57 ` Tim Mattox
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Scott Feldman @ 2004-06-05  0:35 UTC (permalink / raw)
  To: jgarzik; +Cc: netdev, scott.feldman


I see no reason to keep the non-NAPI option for e100.  This patch removes
the CONFIG_E100_NAPI option and puts the driver in NAPI mode all the time.
Matches the way tg3 works.

Unless someone has a really good reason to keep the non-NAPI mode, this
should go in for 2.6.7.

-scott

----------------

diff -Naurp linux-2.6.7-rc2-bk5/drivers/net/e100.c linux-2.6.7-rc2-bk5.mod/drivers/net/e100.c
--- linux-2.6.7-rc2-bk5/drivers/net/e100.c	2004-06-04 15:58:07.000000000 -0700
+++ linux-2.6.7-rc2-bk5.mod/drivers/net/e100.c	2004-06-04 16:02:04.000000000 -0700
@@ -87,9 +87,8 @@
  *	cb_to_use is the next CB to use for queuing a command; cb_to_clean
  *	is the next CB to check for completion; cb_to_send is the first
  *	CB to start on in case of a previous failure to resume.  CB clean
- *	up happens in interrupt context in response to a CU interrupt, or
- *	in dev->poll in the case where NAPI is enabled.  cbs_avail keeps
- *	track of number of free CB resources available.
+ *	up happens in interrupt context in response to a CU interrupt.
+ *	cbs_avail keeps track of number of free CB resources available.
  *
  * 	Hardware padding of short packets to minimum packet size is
  * 	enabled.  82557 pads with 7Eh, while the later controllers pad
@@ -112,9 +111,8 @@
  *	replacement RFDs cannot be allocated, or the RU goes non-active,
  *	the RU must be restarted.  Frame arrival generates an interrupt,
  *	and Rx indication and re-allocation happen in the same context,
- *	therefore no locking is required.  If NAPI is enabled, this work
- *	happens in dev->poll.  A software-generated interrupt is gen-
- *	erated from the watchdog to recover from a failed allocation
+ *	therefore no locking is required.  A software-generated interrupt
+ *	is generated from the watchdog to recover from a failed allocation
  *	senario where all Rx resources have been indicated and none re-
  *	placed.
  *
@@ -126,8 +124,6 @@
  * 	supported.  Tx Scatter/Gather is not supported.  Jumbo Frames is
  * 	not supported (hardware limitation).
  *
- * 	NAPI support is enabled with CONFIG_E100_NAPI.
- *
  * 	MagicPacket(tm) WoL support is enabled/disabled via ethtool.
  *
  * 	Thanks to JC (jchapman@katalix.com) for helping with
@@ -158,7 +154,7 @@


 #define DRV_NAME		"e100"
-#define DRV_VERSION		"3.0.18"
+#define DRV_VERSION		"3.0.22-NAPI"
 #define DRV_DESCRIPTION		"Intel(R) PRO/100 Network Driver"
 #define DRV_COPYRIGHT		"Copyright(c) 1999-2004 Intel Corporation"
 #define PFX			DRV_NAME ": "
@@ -1463,11 +1459,7 @@ static inline int e100_rx_indicate(struc
 		nic->net_stats.rx_packets++;
 		nic->net_stats.rx_bytes += actual_size;
 		nic->netdev->last_rx = jiffies;
-#ifdef CONFIG_E100_NAPI
 		netif_receive_skb(skb);
-#else
-		netif_rx(skb);
-#endif
 		if(work_done)
 			(*work_done)++;
 	}
@@ -1562,20 +1554,12 @@ static irqreturn_t e100_intr(int irq, vo
 	if(stat_ack & stat_ack_rnr)
 		nic->ru_running = 0;

-#ifdef CONFIG_E100_NAPI
 	e100_disable_irq(nic);
 	netif_rx_schedule(netdev);
-#else
-	if(stat_ack & stat_ack_rx)
-		e100_rx_clean(nic, NULL, 0);
-	if(stat_ack & stat_ack_tx)
-		e100_tx_clean(nic);
-#endif

 	return IRQ_HANDLED;
 }

-#ifdef CONFIG_E100_NAPI
 static int e100_poll(struct net_device *netdev, int *budget)
 {
 	struct nic *nic = netdev_priv(netdev);
@@ -1598,7 +1582,6 @@ static int e100_poll(struct net_device *

 	return 1;
 }
-#endif

 #ifdef CONFIG_NET_POLL_CONTROLLER
 static void e100_netpoll(struct net_device *netdev)
@@ -2137,10 +2120,8 @@ static int __devinit e100_probe(struct p
 	SET_ETHTOOL_OPS(netdev, &e100_ethtool_ops);
 	netdev->tx_timeout = e100_tx_timeout;
 	netdev->watchdog_timeo = E100_WATCHDOG_PERIOD;
-#ifdef CONFIG_E100_NAPI
 	netdev->poll = e100_poll;
 	netdev->weight = E100_NAPI_WEIGHT;
-#endif
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	netdev->poll_controller = e100_netpoll;
 #endif
diff -Naurp linux-2.6.7-rc2-bk5/drivers/net/Kconfig linux-2.6.7-rc2-bk5.mod/drivers/net/Kconfig
--- linux-2.6.7-rc2-bk5/drivers/net/Kconfig	2004-06-04 15:58:26.000000000 -0700
+++ linux-2.6.7-rc2-bk5.mod/drivers/net/Kconfig	2004-06-04 16:02:34.000000000 -0700
@@ -1498,10 +1498,6 @@ config E100
 	  <file:Documentation/networking/net-modules.txt>.  The module
 	  will be called e100.

-config E100_NAPI
-	bool "Use Rx Polling (NAPI)"
-	depends on E100
-
 config LNE390
 	tristate "Mylex EISA LNE390A/B support (EXPERIMENTAL)"
 	depends on NET_PCI && EISA && EXPERIMENTAL

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2.6] e100: use NAPI mode all the time
  2004-06-05  0:35 [PATCH 2.6] e100: use NAPI mode all the time Scott Feldman
@ 2004-06-06 22:57 ` Tim Mattox
  2004-06-07  0:03   ` Scott Feldman
  2004-06-08  9:53 ` Christopher Chan
  2004-06-11  0:16 ` Jeff Garzik
  2 siblings, 1 reply; 10+ messages in thread
From: Tim Mattox @ 2004-06-06 22:57 UTC (permalink / raw)
  To: Scott Feldman; +Cc: netdev, bonding-devel, jgarzik

Scott,
Have you considered how this interacts with multiple e100's bonded
together with Linux channel bonding?
I've CC'd the bonding developer mailing list to flush out any more
opinions on this.

I have yet to set up a good test system, but my impression has been
that NAPI and channel bonding would lead to lots of packet re-ordering
load for the CPU that could outweigh the interrupt load savings.
Does anyone have experience with this?

Also, depending on the setting of /proc/sys/net/ipv4/tcp_reordering
the TCP stack might do aggressive NACKs because of a false-positive on
dropped packets due to the large reordering that could occur with
NAPI and bonding combined.

In short, unless there has been study on this, I would suggest not yet
removing support for non-NAPI mode on any network driver.

On Jun 4, 2004, at 8:35 PM, Scott Feldman wrote:
> I see no reason to keep the non-NAPI option for e100.  This patch 
> removes
> the CONFIG_E100_NAPI option and puts the driver in NAPI mode all the 
> time.
> Matches the way tg3 works.
>
> Unless someone has a really good reason to keep the non-NAPI mode, this
> should go in for 2.6.7.
>
> -scott
>
> ----------------
>
> diff -Naurp linux-2.6.7-rc2-bk5/drivers/net/e100.c 
> linux-2.6.7-rc2-bk5.mod/drivers/net/e100.c
> --- linux-2.6.7-rc2-bk5/drivers/net/e100.c	2004-06-04 
> 15:58:07.000000000 -0700
> +++ linux-2.6.7-rc2-bk5.mod/drivers/net/e100.c	2004-06-04 
> 16:02:04.000000000 -0700
> @@ -87,9 +87,8 @@
>   *	cb_to_use is the next CB to use for queuing a command; cb_to_clean
>   *	is the next CB to check for completion; cb_to_send is the first
>   *	CB to start on in case of a previous failure to resume.  CB clean
> - *	up happens in interrupt context in response to a CU interrupt, or
> - *	in dev->poll in the case where NAPI is enabled.  cbs_avail keeps
> - *	track of number of free CB resources available.
> + *	up happens in interrupt context in response to a CU interrupt.
> + *	cbs_avail keeps track of number of free CB resources available.
>   *
>   * 	Hardware padding of short packets to minimum packet size is
>   * 	enabled.  82557 pads with 7Eh, while the later controllers pad
> @@ -112,9 +111,8 @@
>   *	replacement RFDs cannot be allocated, or the RU goes non-active,
>   *	the RU must be restarted.  Frame arrival generates an interrupt,
>   *	and Rx indication and re-allocation happen in the same context,
> - *	therefore no locking is required.  If NAPI is enabled, this work
> - *	happens in dev->poll.  A software-generated interrupt is gen-
> - *	erated from the watchdog to recover from a failed allocation
> + *	therefore no locking is required.  A software-generated interrupt
> + *	is generated from the watchdog to recover from a failed allocation
>   *	senario where all Rx resources have been indicated and none re-
>   *	placed.
>   *
> @@ -126,8 +124,6 @@
>   * 	supported.  Tx Scatter/Gather is not supported.  Jumbo Frames is
>   * 	not supported (hardware limitation).
>   *
> - * 	NAPI support is enabled with CONFIG_E100_NAPI.
> - *
>   * 	MagicPacket(tm) WoL support is enabled/disabled via ethtool.
>   *
>   * 	Thanks to JC (jchapman@katalix.com) for helping with
> @@ -158,7 +154,7 @@
>
>
>  #define DRV_NAME		"e100"
> -#define DRV_VERSION		"3.0.18"
> +#define DRV_VERSION		"3.0.22-NAPI"
>  #define DRV_DESCRIPTION		"Intel(R) PRO/100 Network Driver"
>  #define DRV_COPYRIGHT		"Copyright(c) 1999-2004 Intel Corporation"
>  #define PFX			DRV_NAME ": "
> @@ -1463,11 +1459,7 @@ static inline int e100_rx_indicate(struc
>  		nic->net_stats.rx_packets++;
>  		nic->net_stats.rx_bytes += actual_size;
>  		nic->netdev->last_rx = jiffies;
> -#ifdef CONFIG_E100_NAPI
>  		netif_receive_skb(skb);
> -#else
> -		netif_rx(skb);
> -#endif
>  		if(work_done)
>  			(*work_done)++;
>  	}
> @@ -1562,20 +1554,12 @@ static irqreturn_t e100_intr(int irq, vo
>  	if(stat_ack & stat_ack_rnr)
>  		nic->ru_running = 0;
>
> -#ifdef CONFIG_E100_NAPI
>  	e100_disable_irq(nic);
>  	netif_rx_schedule(netdev);
> -#else
> -	if(stat_ack & stat_ack_rx)
> -		e100_rx_clean(nic, NULL, 0);
> -	if(stat_ack & stat_ack_tx)
> -		e100_tx_clean(nic);
> -#endif
>
>  	return IRQ_HANDLED;
>  }
>
> -#ifdef CONFIG_E100_NAPI
>  static int e100_poll(struct net_device *netdev, int *budget)
>  {
>  	struct nic *nic = netdev_priv(netdev);
> @@ -1598,7 +1582,6 @@ static int e100_poll(struct net_device *
>
>  	return 1;
>  }
> -#endif
>
>  #ifdef CONFIG_NET_POLL_CONTROLLER
>  static void e100_netpoll(struct net_device *netdev)
> @@ -2137,10 +2120,8 @@ static int __devinit e100_probe(struct p
>  	SET_ETHTOOL_OPS(netdev, &e100_ethtool_ops);
>  	netdev->tx_timeout = e100_tx_timeout;
>  	netdev->watchdog_timeo = E100_WATCHDOG_PERIOD;
> -#ifdef CONFIG_E100_NAPI
>  	netdev->poll = e100_poll;
>  	netdev->weight = E100_NAPI_WEIGHT;
> -#endif
>  #ifdef CONFIG_NET_POLL_CONTROLLER
>  	netdev->poll_controller = e100_netpoll;
>  #endif
> diff -Naurp linux-2.6.7-rc2-bk5/drivers/net/Kconfig 
> linux-2.6.7-rc2-bk5.mod/drivers/net/Kconfig
> --- linux-2.6.7-rc2-bk5/drivers/net/Kconfig	2004-06-04 
> 15:58:26.000000000 -0700
> +++ linux-2.6.7-rc2-bk5.mod/drivers/net/Kconfig	2004-06-04 
> 16:02:34.000000000 -0700
> @@ -1498,10 +1498,6 @@ config E100
>  	  <file:Documentation/networking/net-modules.txt>.  The module
>  	  will be called e100.
>
> -config E100_NAPI
> -	bool "Use Rx Polling (NAPI)"
> -	depends on E100
> -
>  config LNE390
>  	tristate "Mylex EISA LNE390A/B support (EXPERIMENTAL)"
>  	depends on NET_PCI && EISA && EXPERIMENTAL
>
--
Tim Mattox - tmattox@engr.uky.edu - http://homepage.mac.com/tmattox/
     http://aggregate.org/KAOS/ - http://advogato.org/person/tmattox/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2.6] e100: use NAPI mode all the time
  2004-06-06 22:57 ` Tim Mattox
@ 2004-06-07  0:03   ` Scott Feldman
  2004-06-07  1:51     ` Tim Mattox
  0 siblings, 1 reply; 10+ messages in thread
From: Scott Feldman @ 2004-06-07  0:03 UTC (permalink / raw)
  To: Tim Mattox; +Cc: Scott Feldman, netdev, bonding-devel, jgarzik

> Have you considered how this interacts with multiple e100's bonded
> together with Linux channel bonding?
> I've CC'd the bonding developer mailing list to flush out any more
> opinions on this.

No.  But if there is an issue between NAPI and bonding, that's something
to solve between NAPI and bonding but not the nic driver.

> I have yet to set up a good test system, but my impression has been
> that NAPI and channel bonding would lead to lots of packet re-ordering
> load for the CPU that could outweigh the interrupt load savings.
> Does anyone have experience with this?

re-ordering or dropped?

> Also, depending on the setting of /proc/sys/net/ipv4/tcp_reordering
> the TCP stack might do aggressive NACKs because of a false-positive on
> dropped packets due to the large reordering that could occur with
> NAPI and bonding combined.

I guess I don't see the bonding angle.  How does inserting a SW FIFO
between the nic HW and the softirq thread make things better for
bonding?

> In short, unless there has been study on this, I would suggest not yet
> removing support for non-NAPI mode on any network driver.

fedora core 2's default is e100-NAPI, so we're getting good test
coverage there without bonding.   tg3 has used NAPI only for some time,
and I'm sure it's used with bonding.  

-scott

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2.6] e100: use NAPI mode all the time
  2004-06-07  0:03   ` Scott Feldman
@ 2004-06-07  1:51     ` Tim Mattox
  2004-06-07  2:33       ` Jeff Garzik
  0 siblings, 1 reply; 10+ messages in thread
From: Tim Mattox @ 2004-06-07  1:51 UTC (permalink / raw)
  To: sfeldma; +Cc: netdev, bonding-devel, Scott Feldman, jgarzik

Please excuse the length of this e-mail.

I will attempt to explain the potential problem between NAPI and
bonding with an example below.  And the only reason I say "potential"
is that I have deliberately avoided building clusters with this
configuration and have not seen it "in the wild" personally.
I've read about this problem on the beowulf mailing list, usually
in conjunction with people trying to bond GigE NICs.
I will soon have a cluster that can be easily switched to various
modes on it's network, including simple bonding, and I should be able
to directly test this myself in my lab.

The problem is caused by the order packets are delivered to the TCP
stack on the receiving machine.  In normal round-robin bonding mode,
the packets are sent out one per NIC in the bond.  For simplicity
sake, lets say we have two NICs in a bond, eth0 and eth1.  When
sending packets, eth0 will handle all the even packets, and eth1 all
the odd packets.  Similarly when receiving, eth0 would get all
the even packets, and eth1 all the odd packets from a particular
TCP stream.

With NAPI (or other interrupt mitigation techniques) the
receiving machine will process multiple packets in a row from a
single NIC, before getting packets from another NIC.  In the
above example, eth0 would receive packets 0, 2, 4, 6, etc.
and pass them to the TCP layer.  Followed by eth1's
packets 1, 3, 5, 7, etc.  The specific number of out-of-order
packets received in a row would depend on many factors.

The TCP layer would need to reorder the packets from something
like 0, 2, 4, 6, 1, 3, 5, 7 or something
like 0, 2, 4, 1, 3, 5, 6, 7.  With many possible variations.

Before NAPI (and hardware interrupt mitigation schemes), bonding
would work without causing this re-ordering, since each packet
would arrive and be enqueued to the TCP stack in the order of
arrival, which in a well designed network would match the
transmission order.  Sure, if your network randomly delayed packets
then things would get out of order, but in the HPC community which
uses bonding, the two network paths would normally be made
identical, and possibly with only a single switch between source
and destination NICs.  If there was congestion delays in one path and
not in another, then the HPC network/program had more serious problems.

I don't want to slow the progress of Linux networking development.
I was objecting to the removal of a feature to e100 that already has
working code and that was, AFAIK, necessary for the performance
enhancement of bonding.

If the overhead of re-ordering the packets is not significant, and
if simply increasing the value of /proc/sys/net/ipv4/tcp_reordering
will allow TCP to "chill" and not send negative ACKs when it sees
packets this much out of order, than sure, remove the non-NAPI support.

I will attempt to re-locate the specific examples discussed on the
beowulf mailing list, but I don't have those URLs handy.

On Jun 6, 2004, at 8:03 PM, Scott Feldman wrote:
>> Have you considered how this interacts with multiple e100's bonded
>> together with Linux channel bonding?
>> I've CC'd the bonding developer mailing list to flush out any more
>> opinions on this.
>
> No.  But if there is an issue between NAPI and bonding, that's 
> something
> to solve between NAPI and bonding but not the nic driver.

There may yet need to be more bonding code put in the receive path
to deal with this re-ordering problem.  Or possibly a configuration
option to NAPI that works across various NIC drivers.  But I hope not.
Any bonding developers have ideas on how to mitigate this problem?

>> I have yet to set up a good test system, but my impression has been
>> that NAPI and channel bonding would lead to lots of packet re-ordering
>> load for the CPU that could outweigh the interrupt load savings.
>> Does anyone have experience with this?
>
> re-ordering or dropped?

This re-ordering problem will show up without any actual packet loss.

>> Also, depending on the setting of /proc/sys/net/ipv4/tcp_reordering
>> the TCP stack might do aggressive NACKs because of a false-positive on
>> dropped packets due to the large reordering that could occur with
>> NAPI and bonding combined.
>
> I guess I don't see the bonding angle.  How does inserting a SW FIFO
> between the nic HW and the softirq thread make things better for
> bonding?

I'm not sure I understand your question.  The tcp_reordering parameter
is supposed to control the amount of out-of-order packets the receiving
TCP stack sees before issuing pre-emptive negative ACKs to the sender.
(To avoid waiting for the TCP resend timer to expire.)  This was an
optimization that works well in most situations where packet re-ordering
was a strong indication of a dropped packet.  Such extra NACKs, and the
resulting unnecessary retransmits, would be quite detrimental to
performance in a bonded network setup that was not actually dropping
packets.

>> In short, unless there has been study on this, I would suggest not yet
>> removing support for non-NAPI mode on any network driver.
>
> fedora core 2's default is e100-NAPI, so we're getting good test
> coverage there without bonding.   tg3 has used NAPI only for some time,
> and I'm sure it's used with bonding.
>
> -scott

I have NO problems with NAPI itself, I think it's a wonderful 
development.
I would even advocate for making NAPI the default across the board.
But for bonding, until I see otherwise, I want to be able to not use 
NAPI.
As I indicated, I will have a new cluster that I can directly test this
NAPI vs Bonding issue very soon.
--
Tim Mattox - tmattox@engr.uky.edu - http://homepage.mac.com/tmattox/
     http://aggregate.org/KAOS/ - http://advogato.org/person/tmattox/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2.6] e100: use NAPI mode all the time
  2004-06-07  1:51     ` Tim Mattox
@ 2004-06-07  2:33       ` Jeff Garzik
  2004-06-07  6:39         ` [Bonding-devel] " Jay Vosburgh
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff Garzik @ 2004-06-07  2:33 UTC (permalink / raw)
  To: Tim Mattox; +Cc: sfeldma, netdev, bonding-devel, Scott Feldman

Tim Mattox wrote:
> The problem is caused by the order packets are delivered to the TCP
> stack on the receiving machine.  In normal round-robin bonding mode,
> the packets are sent out one per NIC in the bond.  For simplicity
> sake, lets say we have two NICs in a bond, eth0 and eth1.  When
> sending packets, eth0 will handle all the even packets, and eth1 all
> the odd packets.  Similarly when receiving, eth0 would get all
> the even packets, and eth1 all the odd packets from a particular
> TCP stream.
> 
> With NAPI (or other interrupt mitigation techniques) the
> receiving machine will process multiple packets in a row from a
> single NIC, before getting packets from another NIC.  In the
> above example, eth0 would receive packets 0, 2, 4, 6, etc.
> and pass them to the TCP layer.  Followed by eth1's
> packets 1, 3, 5, 7, etc.  The specific number of out-of-order
> packets received in a row would depend on many factors.
> 
> The TCP layer would need to reorder the packets from something
> like 0, 2, 4, 6, 1, 3, 5, 7 or something
> like 0, 2, 4, 1, 3, 5, 6, 7.  With many possible variations.

Ethernet drivers have _always_ processed multiple packets per interrupt, 
since before the days of NAPI, and before the days of hardware mitigation.

Therefore, this is mainly an argument against using overly simplistic 
load balancing schemes that _create_ this problem :)  It's much smarter 
to load balance based on flows, for example.  I think the ALB mode does 
this?

You appear to be making the incorrect assumption that packets sent in 
this simplistic, round-robin manner could ever _hope_ to arrive in-order 
at the destination.  Any number of things serve gather packets into 
bursts:  net stack TX queue, hardware DMA ring, hardware FIFO, remote 
h/w FIFO, remote hardware DMA ring, remote softirq.

> I don't want to slow the progress of Linux networking development.
> I was objecting to the removal of a feature to e100 that already has
> working code and that was, AFAIK, necessary for the performance
> enhancement of bonding.

No, just don't use a bonding mode that kills performance.  It has 
nothing to do with NAPI.

As I said, ethernet drivers have been processing runs of packets per irq 
/ softirq for ages and ages.  This isn't new with NAPI, to be sure.

> I have NO problems with NAPI itself, I think it's a wonderful development.
> I would even advocate for making NAPI the default across the board.
> But for bonding, until I see otherwise, I want to be able to not use NAPI.
> As I indicated, I will have a new cluster that I can directly test this
> NAPI vs Bonding issue very soon.

As Scott indicated, people use bonding with tg3 (unconditional NAPI) all 
time.

Further, I hope you're not doing something silly like trying to load 
balance on the _same_ ethernet.  If you are, that's a signal that deeper 
problems exist -- you should be able to do wire speed with one NIC.

	Jeff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bonding-devel] Re: [PATCH 2.6] e100: use NAPI mode all the time
  2004-06-07  2:33       ` Jeff Garzik
@ 2004-06-07  6:39         ` Jay Vosburgh
  2004-06-07 11:17           ` jamal
  0 siblings, 1 reply; 10+ messages in thread
From: Jay Vosburgh @ 2004-06-07  6:39 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Tim Mattox, sfeldma, netdev, bonding-devel, Scott Feldman

Jeff Garzik <jgarzik@pobox.com> wrote:

>Tim Mattox wrote:
>> The problem is caused by the order packets are delivered to the TCP
>> stack on the receiving machine.  In normal round-robin bonding mode,
>> the packets are sent out one per NIC in the bond.  For simplicity
>> sake, lets say we have two NICs in a bond, eth0 and eth1.  When
>> sending packets, eth0 will handle all the even packets, and eth1 all
>> the odd packets.  Similarly when receiving, eth0 would get all
>> the even packets, and eth1 all the odd packets from a particular
>> TCP stream.

>Ethernet drivers have _always_ processed multiple packets per
>interrupt, since before the days of NAPI, and before the days of
>hardware mitigation.

	There was a discussion about this behavior (round-robin mode out
of order delivery) on bonding-devel in February 2003.  The archives can
be found here:

http://sourceforge.net/mailarchive/forum.php?forum_id=2094&max_rows=25&style=ultimate&viewmonth=200302

	The messages on Feb 19 relate to the effects of packet
coalescing, and Feb 17 to general out of order delivery problems.
Somewhere in there are the results of some testing I did, and analysis
of how tcp_ordering effects things.  As I recall, I even used e100s for
my testing, so it may be a fair apples to apples comparsion.

	When I tested this (on 4 100Mbps ethernets), even after
adjusting tcp_reordering I could only get TCP single stream throughput
of about 235 Mb/sec out of a theoretical 375 or so (400 minus about 6%
for headers and whatnot).  UDP would run in the mid to upper 300's,
depending upon datagram size.  The tests did not examine UDP delivery
order.

	The round-robin mode will, for all practical purposes, always
deliver some large percentage of packets out of order.  You can fiddle
with the tcp_reordering parameter to mitigate the effects to some
degree, but there's no way it's going away entirely.

	I'm curious as to what types of systems the beowulf / HPC people
(mentioned by Tim in an earlier message) are using that they don't see
out of order problems with round robin, even without NAPI.

>Therefore, this is mainly an argument against using overly simplistic
>load balancing schemes that _create_ this problem :)  It's much
>smarter to load balance based on flows, for example.  I think the ALB
>mode does this?

	The round robin mode is unique in that it is the only mode that
will attempt (however stupidly) to stripe single connections (flows)
across multiple interfaces.  The other (smarter) modes, 802.3ad, alb,
and tlb, will try to keep particular connections generally on a
particular interface (for 802.3ad, it's required by the standard to
behave that way).  This means that a given single TCP/IP connection
won't get more than one interface worth of throughput.  With
round-robin, you can get more than one interface worth, but not very
efficiently.

>> I have NO problems with NAPI itself, I think it's a wonderful development.
>> I would even advocate for making NAPI the default across the board.
>> But for bonding, until I see otherwise, I want to be able to not use NAPI.
>> As I indicated, I will have a new cluster that I can directly test this
>> NAPI vs Bonding issue very soon.

	After taking into account the effects of delivering multiple
packets per interrupt and the scheduling order of network device
interrupts (potentially on different CPUs), I'm not really sure there's
much room for NAPI to make round-robin any worse than it already is.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bonding-devel] Re: [PATCH 2.6] e100: use NAPI mode all the time
  2004-06-07  6:39         ` [Bonding-devel] " Jay Vosburgh
@ 2004-06-07 11:17           ` jamal
  0 siblings, 0 replies; 10+ messages in thread
From: jamal @ 2004-06-07 11:17 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Jeff Garzik, Tim Mattox, sfeldma, netdev, bonding-devel,
	Scott Feldman

Hi, 

dont have time to go through all that thread, but lets understand your
problem and setup. Lets start with the setup:
You have 4 ethx ports on PC1 x-connected to 4 on PC2. You have bonding
on PC1 but not on PC2. You have NAPI on both PC1 and PC2. Is any of them
multiprocessor?

Lets get the setup then we can continue the discussion.

cheers,
jamal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2.6] e100: use NAPI mode all the time
  2004-06-05  0:35 [PATCH 2.6] e100: use NAPI mode all the time Scott Feldman
  2004-06-06 22:57 ` Tim Mattox
@ 2004-06-08  9:53 ` Christopher Chan
  2004-06-15 18:04   ` Christopher Chan
  2004-06-11  0:16 ` Jeff Garzik
  2 siblings, 1 reply; 10+ messages in thread
From: Christopher Chan @ 2004-06-08  9:53 UTC (permalink / raw)
  To: Scott Feldman; +Cc: jgarzik, netdev

Scott Feldman wrote:
> I see no reason to keep the non-NAPI option for e100.  This patch removes
> the CONFIG_E100_NAPI option and puts the driver in NAPI mode all the time.
> Matches the way tg3 works.
> 
> Unless someone has a really good reason to keep the non-NAPI mode, this
> should go in for 2.6.7.

I for one need to test 2.6.6 e100 with NAPI on. Under 2.6.3/4 I had 
problems with NAPI mode turned on. Turning NAPI off and then also doing

net.ipv4.tcp_max_syn_backlog = 2048
net.ipv4.route.gc_thresh = 65536
net.ipv4.route.max_size = 1048576

was the only way to keep the machines I run available via the network.

I would get dst cache overflows and sometimes the kernel will log 
garbled messages and when that happens the box requires a reboot.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2.6] e100: use NAPI mode all the time
  2004-06-05  0:35 [PATCH 2.6] e100: use NAPI mode all the time Scott Feldman
  2004-06-06 22:57 ` Tim Mattox
  2004-06-08  9:53 ` Christopher Chan
@ 2004-06-11  0:16 ` Jeff Garzik
  2 siblings, 0 replies; 10+ messages in thread
From: Jeff Garzik @ 2004-06-11  0:16 UTC (permalink / raw)
  To: Scott Feldman; +Cc: netdev

applied to netdev-2.6 queue (and thus Andrew's -mm tree automatically).

We'll let it stew in there for a while and get testing feedback.

Your 3 recently-sent bugfixes will go straight upstream, of course.

	Jeff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2.6] e100: use NAPI mode all the time
  2004-06-08  9:53 ` Christopher Chan
@ 2004-06-15 18:04   ` Christopher Chan
  0 siblings, 0 replies; 10+ messages in thread
From: Christopher Chan @ 2004-06-15 18:04 UTC (permalink / raw)
  To: Christopher Chan; +Cc: Scott Feldman, jgarzik, netdev

Christopher Chan wrote:
> Scott Feldman wrote:
> 
>> I see no reason to keep the non-NAPI option for e100.  This patch removes
>> the CONFIG_E100_NAPI option and puts the driver in NAPI mode all the 
>> time.
>> Matches the way tg3 works.
>>
>> Unless someone has a really good reason to keep the non-NAPI mode, this
>> should go in for 2.6.7.
> 
> 
> I for one need to test 2.6.6 e100 with NAPI on. Under 2.6.3/4 I had 
> problems with NAPI mode turned on. Turning NAPI off and then also doing
> 
> net.ipv4.tcp_max_syn_backlog = 2048
> net.ipv4.route.gc_thresh = 65536
> net.ipv4.route.max_size = 1048576
> 
> was the only way to keep the machines I run available via the network.
> 
> I would get dst cache overflows and sometimes the kernel will log 
> garbled messages and when that happens the box requires a reboot.
> 

KERNEL: assertion (tp->copied_seq == tp->rcv_nxt || (flags & (MSG_PEEK | 
MSG_TRUNC))) failed at net/ipv4/tcp.c (1632)
KERNEL: assertion (flags & MSG_PEEK) failed at net/ipv4/tcp.c (1568)
KERNEL: assertion (tp->copied_seq == tp->rcv_nxt || (flags & (MSG_PEEK | 
MSG_TRUNC))) failed at net/ipv4/tcp.c (1632)
KERNEL: assertion (flags & MSG_PEEK) failed at net/ipv4/tcp.c (1568)
printk: 4253 messages suppressed.
dst cache overflow
KERNEL: assertion (tp->copied_seq == tp->rcv_nxt || (flags & (MSG_PEEK | 
MSG_TRUNC))) failed at net/ipv4/tcp.c (1632)
KERNEL: assertion (flags & MSG_PEEK) failed at net/ipv4/tcp.c (1568)
KERNEL: assertion (tp->copied_seq == tp->rcv_nxt || (flags & (MSG_PEEK | 
MSG_TRUNC))) failed at net/ipv4/tcp.c (1632)
KERNEL: assertion (flags & MSG_PEEK) failed at net/ipv4/tcp.c (1568)
KERNEL: assertion (tp->copied_seq == tp->rcv_nxt || (flags & (MSG_PEEK | 
MSG_TRUNC))) failed at net/ipv4/tcp.c (1632)
KERNEL: assertion (flags & MSG_PEEK) failed at net/ipv4/tcp.c (1568)
KERNEL: assertion (tp->copied_seq == tp->rcv_nxt || (flags & (MSG_PEEK | 
MSG_TRUNC))) failed at net/ipv4/tcp.c (1632)

I get loads of this now for the only box that I have NAPI enabled on the 
e100 driver.

This is on a 2.6.6 kernel.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2004-06-15 18:04 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-05  0:35 [PATCH 2.6] e100: use NAPI mode all the time Scott Feldman
2004-06-06 22:57 ` Tim Mattox
2004-06-07  0:03   ` Scott Feldman
2004-06-07  1:51     ` Tim Mattox
2004-06-07  2:33       ` Jeff Garzik
2004-06-07  6:39         ` [Bonding-devel] " Jay Vosburgh
2004-06-07 11:17           ` jamal
2004-06-08  9:53 ` Christopher Chan
2004-06-15 18:04   ` Christopher Chan
2004-06-11  0:16 ` Jeff Garzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).