RFC: NAPI packet weighting patch

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RFC: NAPI packet weighting patch
@ 2005-05-26 21:36 Mitch Williams
  2005-05-27  8:21 ` Robert Olsson
                   ` (2 more replies)
  0 siblings, 3 replies; 121+ messages in thread
From: Mitch Williams @ 2005-05-26 21:36 UTC (permalink / raw)
  To: netdev; +Cc: john.ronciak, ganesh.venkatesan, jesse.brandeburg

The following patch (which applies to 2.6.12rc4) adds a new sysctl
parameter called 'netdev_packet_weight'.  This parameter controls how many
backlog work units each RX packet is worth.

With the parameter set to 0 (the default), NAPI polling works exactly as
it does today:  each packet is worth one backlog work unit, and the
maximum number of received packets that will be processed in any given
softirq is controlled by the 'netdev_max_backlog' parameter.

By setting the netdev_packet_weight to a nonzero value, we make each
packet worth more than one backlog work unit.  Since it's a shift value, a
setting of 1 makes each packet worth 2 work units, a setting of 2 makes
each packet worth 4 units, etc.  Under normal circumstances you would
never use a value higher than 3, though 4 might work for Gigabit and 10
Gigabit networks.

By increasing the packet weight, we accomplish two things:  first, we
cause the individual NAPI RX loops in each driver to process fewer
packets.  This means that they will free up RX resources to the hardware
more often, which reduces the possibility of dropped packets.  Second, it
shortens the total time spent in the NAPI softirq, which can free the CPU
to handle other tasks more often, thus reducing overall latency.

Performance tests in our lab have shown that tweaking this parameter,
along with the netdev_max_backlog parameter, can provide significant
performance increase -- greater than 100Mbps improvement -- over default
settings.  I tested with both e1000 and tg3 drivers and saw improvement in
both cases.  I did not see higher CPU utilization, even with the increased
throughput.

The caveat, of course, is that different systems and network
configurations require different settings.  On the other hand, that's
really no different than what we see with the max_backlog parameter today.
On some systems neither parameter makes any difference.

Still, we feel that there is value to having this in the kernel.  Please
test and comment as you have time available.

Thanks!
-Mitch Williams
mitch.a.williams@intel.com

diff -urpN -x dontdiff rc4-clean/Documentation/filesystems/proc.txt linux-2.6.12-rc4/Documentation/filesystems/proc.txt
--- rc4-clean/Documentation/filesystems/proc.txt	2005-05-18 16:35:43.000000000 -0700
+++ linux-2.6.12-rc4/Documentation/filesystems/proc.txt	2005-05-19 11:16:10.000000000 -0700
@@ -1378,7 +1378,13 @@ netdev_max_backlog
 ------------------

 Maximum number  of  packets,  queued  on  the  INPUT  side, when the interface
-receives packets faster than kernel can process them.
+receives packets faster than kernel can process them.  This is also the
+maximum number of packets handled in a single softirq under NAPI.
+
+netdev_packet_weight
+--------------------
+The value, in netdev_max_backlog unit, of each received packet.  This is a
+shift value, and should be set no higher than 3.

 optmem_max
 ----------
diff -urpN -x dontdiff rc4-clean/include/linux/sysctl.h linux-2.6.12-rc4/include/linux/sysctl.h
--- rc4-clean/include/linux/sysctl.h	2005-05-18 16:36:06.000000000 -0700
+++ linux-2.6.12-rc4/include/linux/sysctl.h	2005-05-18 16:44:07.000000000 -0700
@@ -242,6 +242,7 @@ enum
 	NET_CORE_MOD_CONG=16,
 	NET_CORE_DEV_WEIGHT=17,
 	NET_CORE_SOMAXCONN=18,
+	NET_CORE_PACKET_WEIGHT=19,
 };

 /* /proc/sys/net/ethernet */
diff -urpN -x dontdiff rc4-clean/net/core/dev.c linux-2.6.12-rc4/net/core/dev.c
--- rc4-clean/net/core/dev.c	2005-05-18 16:36:07.000000000 -0700
+++ linux-2.6.12-rc4/net/core/dev.c	2005-05-19 11:16:57.000000000 -0700
@@ -1352,6 +1352,7 @@ out:
   =======================================================================*/

 int netdev_max_backlog = 300;
+int netdev_packet_weight = 0; /* each packet is worth 1 backlog unit */
 int weight_p = 64;            /* old backlog weight */
 /* These numbers are selected based on intuition and some
  * experimentatiom, if you have more scientific way of doing this
@@ -1778,6 +1779,7 @@ static void net_rx_action(struct softirq
 	struct softnet_data *queue = &__get_cpu_var(softnet_data);
 	unsigned long start_time = jiffies;
 	int budget = netdev_max_backlog;
+	int budget_temp;

 	local_irq_disable();
@@ -1793,21 +1795,22 @@ static void net_rx_action(struct softirq
 		dev = list_entry(queue->poll_list.next,
 				 struct net_device, poll_list);
 		netpoll_poll_lock(dev);
-
-		if (dev->quota <= 0 || dev->poll(dev, &budget)) {
+		budget_temp = budget;
+		if (dev->quota <= 0 || dev->poll(dev, &budget_temp)) {
 			netpoll_poll_unlock(dev);
 			local_irq_disable();
 			list_del(&dev->poll_list);
 			list_add_tail(&dev->poll_list, &queue->poll_list);
 			if (dev->quota < 0)
-				dev->quota += dev->weight;
+				dev->quota += dev->weight >> netdev_packet_weight;
 			else
-				dev->quota = dev->weight;
+				dev->quota = dev->weight >> netdev_packet_weight;
 		} else {
 			netpoll_poll_unlock(dev);
 			dev_put(dev);
 			local_irq_disable();
 		}
+		budget -= (budget - budget_temp) << netdev_packet_weight;
 	}
 out:
 	local_irq_enable();
diff -urpN -x dontdiff rc4-clean/net/core/sysctl_net_core.c linux-2.6.12-rc4/net/core/sysctl_net_core.c
--- rc4-clean/net/core/sysctl_net_core.c	2005-03-01 23:38:03.000000000 -0800
+++ linux-2.6.12-rc4/net/core/sysctl_net_core.c	2005-05-18 16:44:09.000000000 -0700
@@ -13,6 +13,7 @@
 #ifdef CONFIG_SYSCTL

 extern int netdev_max_backlog;
+extern int netdev_packet_weight;
 extern int weight_p;
 extern int no_cong_thresh;
 extern int no_cong;
@@ -91,6 +92,14 @@ ctl_table core_table[] = {
 		.proc_handler	= &proc_dointvec
 	},
 	{
+		.ctl_name	= NET_CORE_PACKET_WEIGHT,
+		.procname	= "netdev_packet_weight",
+		.data		= &netdev_packet_weight,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec
+	},
+	{
 		.ctl_name	= NET_CORE_MAX_BACKLOG,
 		.procname	= "netdev_max_backlog",
 		.data		= &netdev_max_backlog,

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RFC: NAPI packet weighting patch
  2005-05-26 21:36 Mitch Williams
@ 2005-05-27  8:21 ` Robert Olsson
  2005-05-27 11:18 ` jamal
  2005-05-27 15:50 ` Stephen Hemminger
  2 siblings, 0 replies; 121+ messages in thread
From: Robert Olsson @ 2005-05-27  8:21 UTC (permalink / raw)
  To: Mitch Williams; +Cc: netdev, john.ronciak, ganesh.venkatesan, jesse.brandeburg

 Hello!
 Some comments below.

Mitch Williams writes:

 > With the parameter set to 0 (the default), NAPI polling works exactly as
 > it does today:  each packet is worth one backlog work unit, and the
 > maximum number of received packets that will be processed in any given
 > softirq is controlled by the 'netdev_max_backlog' parameter.

 You should be able to accomplish on per-device basis with dev->weight 

 > By increasing the packet weight, we accomplish two things:  first, we
 > cause the individual NAPI RX loops in each driver to process fewer
 > packets.  This means that they will free up RX resources to the hardware
 > more often, which reduces the possibility of dropped packets.  

 I kind of interesting area and complex as weight setting should consider
 coalicing etc.as we try find an acceptable balance of interrupts, polls.
 and packtets per poll. Again to me this indicates that this should be done 
 on driver level.

 Do you have more details about the cases your were able to improve and how 
 your thingking was here. It's kind of unresearched area.

 > Second, it shortens the total time spent in the NAPI softirq, which can 
 > free the CPU to handle other tasks more often, thus reducing overall latency.

 At high packet load from several dev's we still only break the RX softirq 
 when exhausting the total budget or a jiffie. Generally the RX softirq is very 
 well-behaved due to this.

 Cheers.
					--ro

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-05-26 21:36 Mitch Williams
  2005-05-27  8:21 ` Robert Olsson
@ 2005-05-27 11:18 ` jamal
  2005-05-27 15:50 ` Stephen Hemminger
  2 siblings, 0 replies; 121+ messages in thread
From: jamal @ 2005-05-27 11:18 UTC (permalink / raw)
  To: Mitch Williams; +Cc: netdev, john.ronciak, ganesh.venkatesan, jesse.brandeburg

On Thu, 2005-26-05 at 14:36 -0700, Mitch Williams wrote:
> The following patch (which applies to 2.6.12rc4) adds a new sysctl
> parameter called 'netdev_packet_weight'.  This parameter controls how many
> backlog work units each RX packet is worth.
> 
> With the parameter set to 0 (the default), NAPI polling works exactly as
> it does today:  each packet is worth one backlog work unit, and the
> maximum number of received packets that will be processed in any given
> softirq is controlled by the 'netdev_max_backlog' parameter.
> 

NAPI uses already using a Weighted Round robin scheduling scheme know as
DRR.
I am not sure providing a weight scale on the weight is enhancing
anything. 
Did you try to just reduce the weight instead to make it smaller
instead? i.e take the resultant weight of you using a shift and set that
as the weight.

cheers,
jamal

 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-05-26 21:36 Mitch Williams
  2005-05-27  8:21 ` Robert Olsson
  2005-05-27 11:18 ` jamal
@ 2005-05-27 15:50 ` Stephen Hemminger
  2005-05-27 20:27   ` Mitch Williams
  2 siblings, 1 reply; 121+ messages in thread
From: Stephen Hemminger @ 2005-05-27 15:50 UTC (permalink / raw)
  To: Mitch Williams; +Cc: netdev, john.ronciak, ganesh.venkatesan, jesse.brandeburg

On Thu, 26 May 2005 14:36:22 -0700
Mitch Williams <mitch.a.williams@intel.com> wrote:

> The following patch (which applies to 2.6.12rc4) adds a new sysctl
> parameter called 'netdev_packet_weight'.  This parameter controls how many
> backlog work units each RX packet is worth.
> 
> With the parameter set to 0 (the default), NAPI polling works exactly as
> it does today:  each packet is worth one backlog work unit, and the
> maximum number of received packets that will be processed in any given
> softirq is controlled by the 'netdev_max_backlog' parameter.
> 
> By setting the netdev_packet_weight to a nonzero value, we make each
> packet worth more than one backlog work unit.  Since it's a shift value, a
> setting of 1 makes each packet worth 2 work units, a setting of 2 makes
> each packet worth 4 units, etc.  Under normal circumstances you would
> never use a value higher than 3, though 4 might work for Gigabit and 10
> Gigabit networks.
> 
> By increasing the packet weight, we accomplish two things:  first, we
> cause the individual NAPI RX loops in each driver to process fewer
> packets.  This means that they will free up RX resources to the hardware
> more often, which reduces the possibility of dropped packets.  Second, it
> shortens the total time spent in the NAPI softirq, which can free the CPU
> to handle other tasks more often, thus reducing overall latency.

Rather than weighting each packet differently, why not just reduce the upper
bound on number of packets. There are several patches is in my 2.6.12-rc5-tcp3
that make this easier:
----
http://developer.osdl.org/shemminger/patches/2.6.12-rc5-tcp3/patches/bigger-backlog.patch
Separate out the two uses of netdev_max_backlog. One controls the upper
bound on packets processed per softirq, the new name for this is netdev_max_weight;
the other controls the limit on packets queued via netif_rx

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>


Index: linux-2.6.12-rc4-tcp2/net/core/sysctl_net_core.c
===================================================================
--- linux-2.6.12-rc4-tcp2.orig/net/core/sysctl_net_core.c
+++ linux-2.6.12-rc4-tcp2/net/core/sysctl_net_core.c
@@ -13,6 +13,7 @@
 #ifdef CONFIG_SYSCTL
 
 extern int netdev_max_backlog;
+extern int netdev_max_weight;
 extern int weight_p;
 extern int net_msg_cost;
 extern int net_msg_burst;
@@ -137,6 +138,14 @@ ctl_table core_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec
 	},
+	{
+		.ctl_name	= NET_CORE_MAX_WEIGHT,
+		.procname	= "netdev_max_weight",
+		.data		= &netdev_max_weight,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec
+	},
 	{ .ctl_name = 0 }
 };
 
Index: linux-2.6.12-rc4-tcp2/net/core/dev.c
===================================================================
--- linux-2.6.12-rc4-tcp2.orig/net/core/dev.c
+++ linux-2.6.12-rc4-tcp2/net/core/dev.c
@@ -1334,7 +1334,8 @@ out:
 			Receiver routines
   =======================================================================*/
 
-int netdev_max_backlog = 300;
+int netdev_max_backlog = 10000;
+int netdev_max_weight  = 500;
 int weight_p = 64;            /* old backlog weight */
 
 DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
@@ -1682,8 +1683,7 @@ static void net_rx_action(struct softirq
 {
 	struct softnet_data *queue = &__get_cpu_var(softnet_data);
 	unsigned long start_time = jiffies;
-	int budget = netdev_max_backlog;
-
+	int budget = netdev_max_weight;
 	
 	local_irq_disable();
 
Index: linux-2.6.12-rc4-tcp2/include/linux/sysctl.h
===================================================================
--- linux-2.6.12-rc4-tcp2.orig/include/linux/sysctl.h
+++ linux-2.6.12-rc4-tcp2/include/linux/sysctl.h
@@ -242,6 +242,7 @@ enum
 	NET_CORE_MOD_CONG=16,
 	NET_CORE_DEV_WEIGHT=17,
 	NET_CORE_SOMAXCONN=18,
+	NET_CORE_MAX_WEIGHT=19,
 };
 
 /* /proc/sys/net/ethernet */

----
http://developer.osdl.org/shemminger/patches/2.6.12-rc5-tcp3/patches/fix-weightp.patch
Changing the dev_weight sysctl parameter has no effect because the weight
of the backlog devices is set during initialization and never changed.
Fix this by propogating changes.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>


Index: linux-2.6.12-rc4-tcp2/net/core/dev.c
===================================================================
--- linux-2.6.12-rc4-tcp2.orig/net/core/dev.c
+++ linux-2.6.12-rc4-tcp2/net/core/dev.c
@@ -1636,6 +1636,7 @@ static int process_backlog(struct net_de
 	struct softnet_data *queue = &__get_cpu_var(softnet_data);
 	unsigned long start_time = jiffies;
 
+	backlog_dev->weight = weight_p;
 	for (;;) {
 		struct sk_buff *skb;
 		struct net_device *dev;

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-05-27 15:50 ` Stephen Hemminger
@ 2005-05-27 20:27   ` Mitch Williams
  2005-05-27 21:01     ` Stephen Hemminger
  0 siblings, 1 reply; 121+ messages in thread
From: Mitch Williams @ 2005-05-27 20:27 UTC (permalink / raw)
  To: netdev, Stephen Hemminger, hadi, Robert.Olsson
  Cc: Ronciak, John, Venkatesan, Ganesh, Brandeburg, Jesse

Stephen, Robert, and Jamal all replied to my original message, and all
said approximately the same thing:  "Why don't you just reduce the weight
in the driver?  It does the same thing."

To which I reply, respectfully, I know that.  And no it doesn't, not
exactly.

My primary reason for adding this setting is to allow for runtime tweaking
-- just like max_backlog has right now.  Driver weight is a compile-time
setting, and has to be changed for every driver that you run.

This setting allows you to scale the weight of all your drivers, at
runtime, in one place.  It's complimentary to Stephen's max_weight idea --
his patch affects how long you spend in any individual softirq; my patch
affects how long you spend in any driver's individual NAPI poll routine,
as well as how long the softirq lasts.

Perhaps we can merge the two patches to come up with some Ultimate
Tweakable Network Goodness.  I'd be happy to do that (on Tuesday; I'm
heading home early today) if anybody's interested.

-Mitch

NB:
I've got a white paper that I wrote up for internal consumption.  I plan
to post it to our Sourceforge archive, but I need to do a bunch of
scrubbing first, lest our lawyers go into convulsions.

Meanwhile, I can give away the ending:  for my performance tests, on a
pure Gigabit network, I saw consistently better performance by a) using my
patch, b) reducing max_backlog, and c) using a nonzero value for
packet_weight.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-05-27 20:27   ` Mitch Williams
@ 2005-05-27 21:01     ` Stephen Hemminger
  2005-05-28  0:56       ` jamal
  0 siblings, 1 reply; 121+ messages in thread
From: Stephen Hemminger @ 2005-05-27 21:01 UTC (permalink / raw)
  To: Mitch Williams
  Cc: netdev, hadi, Robert.Olsson, Ronciak, John, Venkatesan, Ganesh,
	Brandeburg, Jesse

On Fri, 27 May 2005 13:27:04 -0700
Mitch Williams <mitch.a.williams@intel.com> wrote:

> 
> Stephen, Robert, and Jamal all replied to my original message, and all
> said approximately the same thing:  "Why don't you just reduce the weight
> in the driver?  It does the same thing."
> 
> To which I reply, respectfully, I know that.  And no it doesn't, not
> exactly.
> 
> My primary reason for adding this setting is to allow for runtime tweaking
> -- just like max_backlog has right now.  Driver weight is a compile-time
> setting, and has to be changed for every driver that you run.
> 
> This setting allows you to scale the weight of all your drivers, at
> runtime, in one place.  It's complimentary to Stephen's max_weight idea --
> his patch affects how long you spend in any individual softirq; my patch
> affects how long you spend in any driver's individual NAPI poll routine,
> as well as how long the softirq lasts.
>

Why not just allow adjusting dev->weight via sysfs?
	

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-05-27 21:01     ` Stephen Hemminger
@ 2005-05-28  0:56       ` jamal
  2005-05-31 17:35         ` Mitch Williams
  0 siblings, 1 reply; 121+ messages in thread
From: jamal @ 2005-05-28  0:56 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Mitch Williams, netdev, Robert.Olsson, Ronciak, John,
	Venkatesan, Ganesh, Brandeburg, Jesse

On Fri, 2005-27-05 at 14:01 -0700, Stephen Hemminger wrote:

> 
> Why not just allow adjusting dev->weight via sysfs?

I think that should be good enough - and i thought your patch already
did this. 
Adding a shift to the weight in a _weighted_ RR algorithm does sound
odd.

cheers,
jamal

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-05-28  0:56       ` jamal
@ 2005-05-31 17:35         ` Mitch Williams
  2005-05-31 17:40           ` Stephen Hemminger
  2005-05-31 22:07           ` Jon Mason
  0 siblings, 2 replies; 121+ messages in thread
From: Mitch Williams @ 2005-05-31 17:35 UTC (permalink / raw)
  To: jamal
  Cc: Stephen Hemminger, Williams, Mitch A, netdev, Robert.Olsson,
	Ronciak, John, Venkatesan, Ganesh, Brandeburg, Jesse



On Fri, 27 May 2005, jamal wrote:

>
> On Fri, 2005-27-05 at 14:01 -0700, Stephen Hemminger wrote:
>
> >
> > Why not just allow adjusting dev->weight via sysfs?
>
> I think that should be good enough - and i thought your patch already
> did this.
> Adding a shift to the weight in a _weighted_ RR algorithm does sound
> odd.
>

Stephen's patch only affects the weight for the backlog device.  Exporting
dev-> weight to sysfs will allow the weight to be set for any network
device.  Which makes perfect sense.

I'll work on getting this done and verifying performance this week.
Expect a patch in a few days.

Thanks, guys.

-Mitch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-05-31 17:35         ` Mitch Williams
@ 2005-05-31 17:40           ` Stephen Hemminger
  2005-05-31 17:43             ` Mitch Williams
  2005-05-31 22:07           ` Jon Mason
  1 sibling, 1 reply; 121+ messages in thread
From: Stephen Hemminger @ 2005-05-31 17:40 UTC (permalink / raw)
  To: Mitch Williams
  Cc: jamal, Williams, Mitch A, netdev, Robert.Olsson, Ronciak, John,
	Venkatesan, Ganesh, Brandeburg, Jesse

Like this (untested) patch:

Index: napi-sysfs/net/core/net-sysfs.c
===================================================================
--- napi-sysfs.orig/net/core/net-sysfs.c
+++ napi-sysfs/net/core/net-sysfs.c
@@ -184,6 +184,22 @@ static ssize_t store_tx_queue_len(struct
 static CLASS_DEVICE_ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len, 
 			 store_tx_queue_len);
 
+NETDEVICE_SHOW(weight, fmt_ulong);
+
+static int change_weight(struct net_device *net, unsigned long new_weight)
+{
+	net->weight = new_weight;
+	return 0;
+}
+
+static ssize_t store_weight(struct class_device *dev, const char *buf, size_t len)
+{
+	return netdev_store(dev, buf, len, change_weight);
+}
+
+static CLASS_DEVICE_ATTR(weight, S_IRUGO | S_IWUSR, show_weight, 
+			 store_weight);
+
 
 static struct class_device_attribute *net_class_attributes[] = {
 	&class_device_attr_ifindex,
@@ -193,6 +209,7 @@ static struct class_device_attribute *ne
 	&class_device_attr_features,
 	&class_device_attr_mtu,
 	&class_device_attr_flags,
+	&class_device_attr_weight,
 	&class_device_attr_type,
 	&class_device_attr_address,
 	&class_device_attr_broadcast,

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-05-31 17:40           ` Stephen Hemminger
@ 2005-05-31 17:43             ` Mitch Williams
  0 siblings, 0 replies; 121+ messages in thread
From: Mitch Williams @ 2005-05-31 17:43 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Williams, Mitch A, jamal, netdev, Robert.Olsson, Ronciak, John,
	Venkatesan, Ganesh, Brandeburg, Jesse



On Tue, 31 May 2005, Stephen Hemminger wrote:

>
> Like this (untested) patch:
>
Gosh, you're making my life too easy.  Thanks!

I'll apply this, give it a spin, and let you know what I see.

-Mitch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-05-31 17:35         ` Mitch Williams
  2005-05-31 17:40           ` Stephen Hemminger
@ 2005-05-31 22:07           ` Jon Mason
  2005-05-31 22:14             ` David S. Miller
  1 sibling, 1 reply; 121+ messages in thread
From: Jon Mason @ 2005-05-31 22:07 UTC (permalink / raw)
  To: Mitch Williams
  Cc: jamal, Stephen Hemminger, netdev, Robert.Olsson, Ronciak, John,
	Venkatesan, Ganesh, Brandeburg, Jesse

On Tuesday 31 May 2005 12:35 pm, Mitch Williams wrote:
> On Fri, 27 May 2005, jamal wrote:
> > On Fri, 2005-27-05 at 14:01 -0700, Stephen Hemminger wrote:
> > > Why not just allow adjusting dev->weight via sysfs?
> >
> > I think that should be good enough - and i thought your patch already
> > did this.
> > Adding a shift to the weight in a _weighted_ RR algorithm does sound
> > odd.
>
> Stephen's patch only affects the weight for the backlog device.  Exporting
> dev-> weight to sysfs will allow the weight to be set for any network
> device.  Which makes perfect sense.
>
> I'll work on getting this done and verifying performance this week.
> Expect a patch in a few days.
>
> Thanks, guys.
>
> -Mitch

It seems to me that the drivers should adjust dev->weight dependent on the 
media speed/duplexity of the current link.  A 10Mbps link will be constantly 
re-enabling interrupts, as the incoming traffic is too slow.  Why not have it 
be 1/4 the weight of the gigabit NAPI weight, and set it when the link speed 
is detected (or forced)?  

Of course some performace analysis would have to be done to determine the 
optimal numbers for each speed/duplexity setting per driver.

Thanks,
Jon

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-05-31 22:07           ` Jon Mason
@ 2005-05-31 22:14             ` David S. Miller
  2005-05-31 23:28               ` Jon Mason
  0 siblings, 1 reply; 121+ messages in thread
From: David S. Miller @ 2005-05-31 22:14 UTC (permalink / raw)
  To: jdmason
  Cc: mitch.a.williams, hadi, shemminger, netdev, Robert.Olsson,
	john.ronciak, ganesh.venkatesan, jesse.brandeburg

From: Jon Mason <jdmason@us.ibm.com>
Date: Tue, 31 May 2005 17:07:54 -0500

> Of course some performace analysis would have to be done to determine the 
> optimal numbers for each speed/duplexity setting per driver.

per cpu speed, per memory bus speed, per I/O bus speed, and add in other
complications such as NUMA

My point is that whatever experimental number you come up with will be
good for that driver on your systems, not necessarily for others.

Even within a system, whatever number you select will be the wrong
thing to use if one starts a continuous I/O stream to the SATA
controller in the next PCI slot, for example.

We keep getting bitten by this, as the Altix perf data continually shows,
and we need to absolutely stop thinking this way.

The way to go is to make selections based upon observed events and
mesaurements.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-05-31 22:14             ` David S. Miller
@ 2005-05-31 23:28               ` Jon Mason
  2005-06-02 12:26                 ` jamal
  0 siblings, 1 reply; 121+ messages in thread
From: Jon Mason @ 2005-05-31 23:28 UTC (permalink / raw)
  To: David S. Miller
  Cc: mitch.a.williams, hadi, shemminger, netdev, Robert.Olsson,
	john.ronciak, ganesh.venkatesan, jesse.brandeburg

On Tuesday 31 May 2005 05:14 pm, David S. Miller wrote:
> From: Jon Mason <jdmason@us.ibm.com>
> Date: Tue, 31 May 2005 17:07:54 -0500
>
> > Of course some performace analysis would have to be done to determine the
> > optimal numbers for each speed/duplexity setting per driver.
>
> per cpu speed, per memory bus speed, per I/O bus speed, and add in other
> complications such as NUMA
>
> My point is that whatever experimental number you come up with will be
> good for that driver on your systems, not necessarily for others.
>
> Even within a system, whatever number you select will be the wrong
> thing to use if one starts a continuous I/O stream to the SATA
> controller in the next PCI slot, for example.
>
> We keep getting bitten by this, as the Altix perf data continually shows,
> and we need to absolutely stop thinking this way.
>
> The way to go is to make selections based upon observed events and
> mesaurements.

I'm not arguing against a /proc entry to tune dev->weight for those sysadmins 
advanced enough to do that.  I am arguing that we can make the driver smarter 
(at little/no cost)  for "out of the box" users.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-05-31 23:28               ` Jon Mason
@ 2005-06-02 12:26                 ` jamal
  2005-06-02 17:30                   ` Stephen Hemminger
  0 siblings, 1 reply; 121+ messages in thread
From: jamal @ 2005-06-02 12:26 UTC (permalink / raw)
  To: Jon Mason
  Cc: David S. Miller, mitch.a.williams, shemminger, netdev,
	Robert.Olsson, john.ronciak, ganesh.venkatesan, jesse.brandeburg

On Tue, 2005-31-05 at 18:28 -0500, Jon Mason wrote:
> On Tuesday 31 May 2005 05:14 pm, David S. Miller wrote:
> > From: Jon Mason <jdmason@us.ibm.com>
> > Date: Tue, 31 May 2005 17:07:54 -0500
> >
> > > Of course some performace analysis would have to be done to determine the
> > > optimal numbers for each speed/duplexity setting per driver.
> >
> > per cpu speed, per memory bus speed, per I/O bus speed, and add in other
> > complications such as NUMA
> >
> > My point is that whatever experimental number you come up with will be
> > good for that driver on your systems, not necessarily for others.
> >
> > Even within a system, whatever number you select will be the wrong
> > thing to use if one starts a continuous I/O stream to the SATA
> > controller in the next PCI slot, for example.
> >
> > We keep getting bitten by this, as the Altix perf data continually shows,
> > and we need to absolutely stop thinking this way.
> >
> > The way to go is to make selections based upon observed events and
> > mesaurements.
> 
> I'm not arguing against a /proc entry to tune dev->weight for those sysadmins 
> advanced enough to do that.  I am arguing that we can make the driver smarter 
> (at little/no cost)  for "out of the box" users.
> 

What is the point of making the driver "smarter"? 
Recall, the algorithm used to schedule the netdevices is based on an
extension of Weighted Round Robin from Varghese et al known as DRR (ask
gooogle for details).
The idea is to provide fairness amongst many drivers. As an example, if
you have a gige driver it shouldnt be taking all the resources at the
expense of starving the fastether driver.
If the admin wants one driver to be more "important" than the other,
s/he will make sure it has a higher weight.

cheers,
jamal

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-02 12:26                 ` jamal
@ 2005-06-02 17:30                   ` Stephen Hemminger
  0 siblings, 0 replies; 121+ messages in thread
From: Stephen Hemminger @ 2005-06-02 17:30 UTC (permalink / raw)
  To: hadi
  Cc: Jon Mason, David S. Miller, mitch.a.williams, netdev,
	Robert.Olsson, john.ronciak, ganesh.venkatesan, jesse.brandeburg

On Thu, 02 Jun 2005 08:26:46 -0400
jamal <hadi@cyberus.ca> wrote:

> On Tue, 2005-31-05 at 18:28 -0500, Jon Mason wrote:
> > On Tuesday 31 May 2005 05:14 pm, David S. Miller wrote:
> > > From: Jon Mason <jdmason@us.ibm.com>
> > > Date: Tue, 31 May 2005 17:07:54 -0500
> > >
> > > > Of course some performace analysis would have to be done to determine the
> > > > optimal numbers for each speed/duplexity setting per driver.
> > >
> > > per cpu speed, per memory bus speed, per I/O bus speed, and add in other
> > > complications such as NUMA
> > >
> > > My point is that whatever experimental number you come up with will be
> > > good for that driver on your systems, not necessarily for others.
> > >
> > > Even within a system, whatever number you select will be the wrong
> > > thing to use if one starts a continuous I/O stream to the SATA
> > > controller in the next PCI slot, for example.
> > >
> > > We keep getting bitten by this, as the Altix perf data continually shows,
> > > and we need to absolutely stop thinking this way.
> > >
> > > The way to go is to make selections based upon observed events and
> > > mesaurements.
> > 
> > I'm not arguing against a /proc entry to tune dev->weight for those sysadmins 
> > advanced enough to do that.  I am arguing that we can make the driver smarter 
> > (at little/no cost)  for "out of the box" users.
> > 
> 
> What is the point of making the driver "smarter"? 
> Recall, the algorithm used to schedule the netdevices is based on an
> extension of Weighted Round Robin from Varghese et al known as DRR (ask
> gooogle for details).
> The idea is to provide fairness amongst many drivers. As an example, if
> you have a gige driver it shouldnt be taking all the resources at the
> expense of starving the fastether driver.
> If the admin wants one driver to be more "important" than the other,
> s/he will make sure it has a higher weight.
>

In fact, since the default weighting should be based on the amount of cpu time expended
per frame rather than link speed. The point is that a more "heavy weight" driver shouldn't
starve out all the others.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: RFC: NAPI packet weighting patch
@ 2005-06-02 21:19 Ronciak, John
  2005-06-02 21:31 ` Stephen Hemminger
  0 siblings, 1 reply; 121+ messages in thread
From: Ronciak, John @ 2005-06-02 21:19 UTC (permalink / raw)
  To: hadi, Jon Mason
  Cc: David S. Miller, Williams, Mitch A, shemminger, netdev,
	Robert.Olsson, Venkatesan, Ganesh, Brandeburg, Jesse

The DRR algorithm assumes a perfect world, where hardware resources are
infinite, packets arrive continuously (or separated by very long
delays), there are no bus latencies, and CPU speed is infinite.

The real world is much messier: hardware starves for resources if it's
not serviced quickly enough, packets arrive at inconvenient intervals
(especially at 10 and 100 Mbps speeds), and buses and CPUs are slow.

Thus, the driver should have the intelligence built into it to make an
"intelligent" choice on what the weight should be for that
driver/hardware.  The calculation in the driver should take into account
all the factors that the driver has access to.  These include link
speed, bus type and speed, processor speed and some amount of actual
device FIFO size and latency smarts.  The driver would use all of the
factors to come up with a weight to prevent it from dropping frames and
not to starve out other devices in the system or hinder performance.  It
seems to us that the driver is the one that know best and should try to
come up with a reasonable value for weight based on its own knowledge of
the hardware.

This has been showing up in our NAPI test data which Mitch is currently
scrubbing for release.  It shows that there is a need for either better
default static weight numbers or for them to be calculated based on some
system dynamic variables.  We would like to see the latter tried but the
only problem is that each driver would have to make its own
calculations, and it may not have access to all of the system-wide data
it would need to make a proper calculation.

Even with a more intelligent driver, we still would like to see some
mechanism for the weight to be changed at runtime, such as with
Stephen's sysfs patch.  This would allow a sysadmin (or user-space app)
to tune the system based on statistical data that isn't available to the
individual driver.

Cheers,
John

> -----Original Message-----
> From: jamal [mailto:hadi@cyberus.ca] 
> Sent: Thursday, June 02, 2005 5:27 AM
> To: Jon Mason
> Cc: David S. Miller; Williams, Mitch A; shemminger@osdl.org; 
> netdev@oss.sgi.com; Robert.Olsson@data.slu.se; Ronciak, John; 
> Venkatesan, Ganesh; Brandeburg, Jesse
> Subject: Re: RFC: NAPI packet weighting patch
> 
> 
> On Tue, 2005-31-05 at 18:28 -0500, Jon Mason wrote:
> > On Tuesday 31 May 2005 05:14 pm, David S. Miller wrote:
> > > From: Jon Mason <jdmason@us.ibm.com>
> > > Date: Tue, 31 May 2005 17:07:54 -0500
> > >
> > > > Of course some performace analysis would have to be 
> done to determine the
> > > > optimal numbers for each speed/duplexity setting per driver.
> > >
> > > per cpu speed, per memory bus speed, per I/O bus speed, 
> and add in other
> > > complications such as NUMA
> > >
> > > My point is that whatever experimental number you come up 
> with will be
> > > good for that driver on your systems, not necessarily for others.
> > >
> > > Even within a system, whatever number you select will be the wrong
> > > thing to use if one starts a continuous I/O stream to the SATA
> > > controller in the next PCI slot, for example.
> > >
> > > We keep getting bitten by this, as the Altix perf data 
> continually shows,
> > > and we need to absolutely stop thinking this way.
> > >
> > > The way to go is to make selections based upon observed events and
> > > mesaurements.
> > 
> > I'm not arguing against a /proc entry to tune dev->weight 
> for those sysadmins 
> > advanced enough to do that.  I am arguing that we can make 
> the driver smarter 
> > (at little/no cost)  for "out of the box" users.
> > 
> 
> What is the point of making the driver "smarter"? 
> Recall, the algorithm used to schedule the netdevices is based on an
> extension of Weighted Round Robin from Varghese et al known 
> as DRR (ask
> gooogle for details).
> The idea is to provide fairness amongst many drivers. As an 
> example, if
> you have a gige driver it shouldnt be taking all the resources at the
> expense of starving the fastether driver.
> If the admin wants one driver to be more "important" than the other,
> s/he will make sure it has a higher weight.
> 
> cheers,
> jamal
> 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-02 21:19 Ronciak, John
@ 2005-06-02 21:31 ` Stephen Hemminger
  2005-06-02 21:40   ` David S. Miller
  2005-06-02 21:51   ` Jon Mason
  0 siblings, 2 replies; 121+ messages in thread
From: Stephen Hemminger @ 2005-06-02 21:31 UTC (permalink / raw)
  To: Ronciak, John
  Cc: hadi, Jon Mason, David S. Miller, Williams, Mitch A, netdev,
	Robert.Olsson, Venkatesan, Ganesh, Brandeburg, Jesse

On Thu, 2 Jun 2005 14:19:55 -0700
"Ronciak, John" <john.ronciak@intel.com> wrote:

> The DRR algorithm assumes a perfect world, where hardware resources are
> infinite, packets arrive continuously (or separated by very long
> delays), there are no bus latencies, and CPU speed is infinite.
> 
> The real world is much messier: hardware starves for resources if it's
> not serviced quickly enough, packets arrive at inconvenient intervals
> (especially at 10 and 100 Mbps speeds), and buses and CPUs are slow.
> 
> Thus, the driver should have the intelligence built into it to make an
> "intelligent" choice on what the weight should be for that
> driver/hardware.  The calculation in the driver should take into account
> all the factors that the driver has access to.  These include link
> speed, bus type and speed, processor speed and some amount of actual
> device FIFO size and latency smarts.  The driver would use all of the
> factors to come up with a weight to prevent it from dropping frames and
> not to starve out other devices in the system or hinder performance.  It
> seems to us that the driver is the one that know best and should try to
> come up with a reasonable value for weight based on its own knowledge of
> the hardware.

This is like saying each CPU vendor should write their own process scheduler
for Linux. Now with NUMA and HT, it is getting almost that bad but we still
try and keep it CPU neutral.

For networking the problem is worse, the "right" choice depends on workload
and relationship between components in the system. I can't see how you could
ever expect a driver specific solution.   

> This has been showing up in our NAPI test data which Mitch is currently
> scrubbing for release.  It shows that there is a need for either better
> default static weight numbers or for them to be calculated based on some
> system dynamic variables.  We would like to see the latter tried but the
> only problem is that each driver would have to make its own
> calculations, and it may not have access to all of the system-wide data
> it would need to make a proper calculation.

And for other workloads, and other systems (think about the Altix with
long access latencies), your numbers will be wrong. Perhaps we need
to quit trying for a perfect solution and just get a "good enough" one
that works.

Let's keep the intelligence out of the driver. Most of the existing
smart drivers end up looking like crap and don't work that well.


> Even with a more intelligent driver, we still would like to see some
> mechanism for the weight to be changed at runtime, such as with
> Stephen's sysfs patch.  This would allow a sysadmin (or user-space app)
> to tune the system based on statistical data that isn't available to the
> individual driver.
> 

It will be yet another knob that all except the benchmark tweakers can
ignore (hopefully).

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-02 21:31 ` Stephen Hemminger
@ 2005-06-02 21:40   ` David S. Miller
  2005-06-02 21:51   ` Jon Mason
  1 sibling, 0 replies; 121+ messages in thread
From: David S. Miller @ 2005-06-02 21:40 UTC (permalink / raw)
  To: shemminger
  Cc: john.ronciak, hadi, jdmason, mitch.a.williams, netdev,
	Robert.Olsson, ganesh.venkatesan, jesse.brandeburg

From: Stephen Hemminger <shemminger@osdl.org>
Date: Thu, 2 Jun 2005 14:31:26 -0700

> For networking the problem is worse, the "right" choice depends on workload
> and relationship between components in the system. I can't see how you could
> ever expect a driver specific solution.   

I totally agree, even the mere concept of driver-centric decisions
in this area is pretty bogus.

> And for other workloads, and other systems (think about the Altix with
> long access latencies), your numbers will be wrong. Perhaps we need
> to quit trying for a perfect solution and just get a "good enough" one
> that works.

I don't understand why nobody is investigating doing this stuff
by generic measurements that the core kernel can perform.

The generic ->poll() runner code can say, wow it took N-usec to
process M packets, perhaps I should adjust the weight.

I haven't seen one concrete suggestion along those lines, yet that is
where the answer to this kind of stuff is.  Those kinds of solutions
are completely CPU, memory, I/O bus, network device, and workload
independant.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-02 21:31 ` Stephen Hemminger
  2005-06-02 21:40   ` David S. Miller
@ 2005-06-02 21:51   ` Jon Mason
  2005-06-02 22:12     ` David S. Miller
  2005-06-02 22:15     ` Robert Olsson
  1 sibling, 2 replies; 121+ messages in thread
From: Jon Mason @ 2005-06-02 21:51 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Ronciak, John, hadi, David S. Miller, Williams, Mitch A, netdev,
	Robert.Olsson, Venkatesan, Ganesh, Brandeburg, Jesse

On Thursday 02 June 2005 04:31 pm, Stephen Hemminger wrote:
<...>
> For networking the problem is worse, the "right" choice depends on workload
> and relationship between components in the system. I can't see how you
> could ever expect a driver specific solution.

I think there is a way for a generic driver NAPI enhancement.  That is to 
modify the weight dependent on link speed.  

Here is the problem as I see it, NAPI enablement for slow media speeds causes 
unneeded strain on the system.  This is because of the "weight" of NAPI.  
Lets look at e1000 as an example.  Currently the NAPI weight is 64, 
regardless of link media speed.  This weight is probably fine for a gigabit 
link, but for 10/100 this is way to large.  Thus causing interrupts to be 
enabled/disabled after every poll/interrupt.  Lots of overhead, and not very 
smart.  Why not have the driver set the weight to 16/32 respectively for the 
weight (or better yet, have someone run numbers to find weight that are 
closer to what the adapter can actually use)?  While these numbers may not be 
optimal for every system, this is much better that the current system, and 
would only require 5 or so extra lines of code per NAPI enabled driver.

For those who want to have an optimal weight for their tuned system, let them 
use the /proc entry that is being proposed.

Thanks,
Jon

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-02 21:51   ` Jon Mason
@ 2005-06-02 22:12     ` David S. Miller
  2005-06-02 22:19       ` Jon Mason
  2005-06-02 22:15     ` Robert Olsson
  1 sibling, 1 reply; 121+ messages in thread
From: David S. Miller @ 2005-06-02 22:12 UTC (permalink / raw)
  To: jdmason
  Cc: shemminger, john.ronciak, hadi, mitch.a.williams, netdev,
	Robert.Olsson, ganesh.venkatesan, jesse.brandeburg

From: Jon Mason <jdmason@us.ibm.com>
Date: Thu, 2 Jun 2005 16:51:48 -0500

> Why not have the driver set the weight to 16/32 respectively for the
> weight (or better yet, have someone run numbers to find weight that
> are closer to what the adapter can actually use)?  While these
> numbers may not be optimal for every system, this is much better
> that the current system, and would only require 5 or so extra lines
> of code per NAPI enabled driver.

Why do this when we can adjust the weight in one spot,
namely the upper level NAPI ->poll() running loop?

It can measure the overhead, how many packets processed, etc.
and make intelligent decisions based upon that.  This is a CPU
speed, memory speed, I/O bus speed, and link speed agnostic
solution.

The driver need not take any part in this, and the scheme will
dynamically adjust to resource usage changes in the system.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-02 21:51   ` Jon Mason
  2005-06-02 22:12     ` David S. Miller
@ 2005-06-02 22:15     ` Robert Olsson
  1 sibling, 0 replies; 121+ messages in thread
From: Robert Olsson @ 2005-06-02 22:15 UTC (permalink / raw)
  To: Jon Mason
  Cc: Stephen Hemminger, Ronciak, John, hadi, David S. Miller,
	Williams, Mitch A, netdev, Robert.Olsson, Venkatesan, Ganesh,
	Brandeburg, Jesse



 Differentiate the meaning of weight a bit.

 Let weight only limit the number of pkts we deliver per ->poll
 
 Have some other mechanism or threshold to control when interrupts are
 to be turned on.

 The first approximation for this could be to poll as long as we see 
 any pkt on the RX ring. As interrupt seems expensive on all platforms.

 Cheers.
						--ro


Jon Mason writes:
 > On Thursday 02 June 2005 04:31 pm, Stephen Hemminger wrote:
 > <...>
 > > For networking the problem is worse, the "right" choice depends on workload
 > > and relationship between components in the system. I can't see how you
 > > could ever expect a driver specific solution.
 > 
 > I think there is a way for a generic driver NAPI enhancement.  That is to 
 > modify the weight dependent on link speed.  
 > 
 > Here is the problem as I see it, NAPI enablement for slow media speeds causes 
 > unneeded strain on the system.  This is because of the "weight" of NAPI.  
 > Lets look at e1000 as an example.  Currently the NAPI weight is 64, 
 > regardless of link media speed.  This weight is probably fine for a gigabit 
 > link, but for 10/100 this is way to large.  Thus causing interrupts to be 
 > enabled/disabled after every poll/interrupt.  Lots of overhead, and not very 
 > smart.  Why not have the driver set the weight to 16/32 respectively for the 
 > weight (or better yet, have someone run numbers to find weight that are 
 > closer to what the adapter can actually use)?  While these numbers may not be 
 > optimal for every system, this is much better that the current system, and 
 > would only require 5 or so extra lines of code per NAPI enabled driver.
 > 
 > For those who want to have an optimal weight for their tuned system, let them 
 > use the /proc entry that is being proposed.
 > 
 > Thanks,
 > Jon

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-02 22:12     ` David S. Miller
@ 2005-06-02 22:19       ` Jon Mason
  0 siblings, 0 replies; 121+ messages in thread
From: Jon Mason @ 2005-06-02 22:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: shemminger, john.ronciak, hadi, mitch.a.williams, netdev,
	Robert.Olsson, ganesh.venkatesan, jesse.brandeburg

On Thursday 02 June 2005 05:12 pm, David S. Miller wrote:
> From: Jon Mason <jdmason@us.ibm.com>
> Date: Thu, 2 Jun 2005 16:51:48 -0500
>
> > Why not have the driver set the weight to 16/32 respectively for the
> > weight (or better yet, have someone run numbers to find weight that
> > are closer to what the adapter can actually use)?  While these
> > numbers may not be optimal for every system, this is much better
> > that the current system, and would only require 5 or so extra lines
> > of code per NAPI enabled driver.
>
> Why do this when we can adjust the weight in one spot,
> namely the upper level NAPI ->poll() running loop?
>
> It can measure the overhead, how many packets processed, etc.
> and make intelligent decisions based upon that.  This is a CPU
> speed, memory speed, I/O bus speed, and link speed agnostic
> solution.
>
> The driver need not take any part in this, and the scheme will
> dynamically adjust to resource usage changes in the system.

Yes, a much better idea to do this generically.  I 100% agree with you.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: RFC: NAPI packet weighting patch
@ 2005-06-03  0:11 Ronciak, John
  2005-06-03  0:18 ` David S. Miller
  0 siblings, 1 reply; 121+ messages in thread
From: Ronciak, John @ 2005-06-03  0:11 UTC (permalink / raw)
  To: Jon Mason, David S. Miller
  Cc: shemminger, hadi, Williams, Mitch A, netdev, Robert.Olsson,
	Venkatesan, Ganesh, Brandeburg, Jesse

I like this idea as well but I do an issue with it.  How would this
stack code find out that the weight is too high and pacekts are being
dropped (not being polled fast enough)?  It would have to check the
controller stats to see the error count increasing for some period.  I'm
not sure this is workable unless we have some sort of feedback which the
driver could send up (or set) saying that this is happening and the
dynamic weight code could take into acount.

Comments?

Cheers,
John


> -----Original Message-----
> From: Jon Mason [mailto:jdmason@us.ibm.com] 
> Sent: Thursday, June 02, 2005 3:20 PM
> To: David S. Miller
> Cc: shemminger@osdl.org; Ronciak, John; hadi@cyberus.ca; 
> Williams, Mitch A; netdev@oss.sgi.com; 
> Robert.Olsson@data.slu.se; Venkatesan, Ganesh; Brandeburg, Jesse
> Subject: Re: RFC: NAPI packet weighting patch
> 
> 
> On Thursday 02 June 2005 05:12 pm, David S. Miller wrote:
> > From: Jon Mason <jdmason@us.ibm.com>
> > Date: Thu, 2 Jun 2005 16:51:48 -0500
> >
> > > Why not have the driver set the weight to 16/32 
> respectively for the
> > > weight (or better yet, have someone run numbers to find 
> weight that
> > > are closer to what the adapter can actually use)?  While these
> > > numbers may not be optimal for every system, this is much better
> > > that the current system, and would only require 5 or so 
> extra lines
> > > of code per NAPI enabled driver.
> >
> > Why do this when we can adjust the weight in one spot,
> > namely the upper level NAPI ->poll() running loop?
> >
> > It can measure the overhead, how many packets processed, etc.
> > and make intelligent decisions based upon that.  This is a CPU
> > speed, memory speed, I/O bus speed, and link speed agnostic
> > solution.
> >
> > The driver need not take any part in this, and the scheme will
> > dynamically adjust to resource usage changes in the system.
> 
> Yes, a much better idea to do this generically.  I 100% agree 
> with you.
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03  0:11 RFC: NAPI packet weighting patch Ronciak, John
@ 2005-06-03  0:18 ` David S. Miller
  2005-06-03  2:32   ` jamal
  0 siblings, 1 reply; 121+ messages in thread
From: David S. Miller @ 2005-06-03  0:18 UTC (permalink / raw)
  To: john.ronciak
  Cc: jdmason, shemminger, hadi, mitch.a.williams, netdev,
	Robert.Olsson, ganesh.venkatesan, jesse.brandeburg

From: "Ronciak, John" <john.ronciak@intel.com>
Date: Thu, 2 Jun 2005 17:11:20 -0700

> I like this idea as well but I do an issue with it.  How would this
> stack code find out that the weight is too high and pacekts are being
> dropped (not being polled fast enough)?  It would have to check the
> controller stats to see the error count increasing for some period.  I'm
> not sure this is workable unless we have some sort of feedback which the
> driver could send up (or set) saying that this is happening and the
> dynamic weight code could take into acount.

What more do you need other than checking the statistics counter?  The
drop statistics (the ones we care about) are incremented in real time
by the ->poll() code, so it's not like we have to trigger some
asynchronous event to get a current version of the number.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03  0:18 ` David S. Miller
@ 2005-06-03  2:32   ` jamal
  2005-06-03 17:43     ` Mitch Williams
  0 siblings, 1 reply; 121+ messages in thread
From: jamal @ 2005-06-03  2:32 UTC (permalink / raw)
  To: David S. Miller
  Cc: john.ronciak, jdmason, shemminger, mitch.a.williams, netdev,
	Robert.Olsson, ganesh.venkatesan, jesse.brandeburg

On Thu, 2005-02-06 at 17:18 -0700, David S. Miller wrote:
> From: "Ronciak, John" <john.ronciak@intel.com>
> Date: Thu, 2 Jun 2005 17:11:20 -0700
> 
> > I like this idea as well but I do an issue with it.  How would this
> > stack code find out that the weight is too high and pacekts are being
> > dropped (not being polled fast enough)?  It would have to check the
> > controller stats to see the error count increasing for some period.  I'm
> > not sure this is workable unless we have some sort of feedback which the
> > driver could send up (or set) saying that this is happening and the
> > dynamic weight code could take into acount.
> 
> What more do you need other than checking the statistics counter?  The
> drop statistics (the ones we care about) are incremented in real time
> by the ->poll() code, so it's not like we have to trigger some
> asynchronous event to get a current version of the number.

I am reading through all the emails and I think either the problem is
not being clearly stated or not understood. I was going to say "or i am
on crack "- but I know i am clean ;-> 

Heres what i think i saw as a flow of events:
Someone posted a theory that if you happen to reduce the weight
(iirc the reduction was via a shift) then the DRR would give less CPU
time cycle to the driver - Whats the big suprise there? thats DRR design
intent.

Stephen has a patch which allows people to reduce the weight.
DRR  provides fairness. If you have 10 NICs coming at different wire
rates, the weights provide a fairness quota without caring about what
those speeds are. So it doesnt make any sense IMO to have the weight
based on what the NIC speed is. Infact i claim it is _nonsense_. You
dont need to factor speed. And the claim that DRR is not real world
is blasphemous.

Having said that:
I have a feeling that issue which is which is being waded around is the
amount that the softirq chews in the CPU (unfortunately a well known
issue) and to some extent the packet flow a specific driver chews
depending on the path it takes.
In other words, for DRR algorithm to enhance the fairness it should
consider not only fairness in the amounts of packets the driver injects
into the system but also the amount of CPU that driver chews. At the
moment we lump all drivers together as far as the CPU cycles are
concerned.
If we could narrow it down to this, then i think there is something that
could lead to meaningful discussion.
This, however, does not eradicate the need for DRR and is absolutely not
driver specific. 

cheers,
jamal

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: RFC: NAPI packet weighting patch
@ 2005-06-03 17:40 Ronciak, John
  2005-06-03 18:08 ` Robert Olsson
  0 siblings, 1 reply; 121+ messages in thread
From: Ronciak, John @ 2005-06-03 17:40 UTC (permalink / raw)
  To: David S. Miller
  Cc: jdmason, shemminger, hadi, Williams, Mitch A, netdev,
	Robert.Olsson, Venkatesan, Ganesh, Brandeburg, Jesse

> What more do you need other than checking the statistics counter?  The
> drop statistics (the ones we care about) are incremented in real time
> by the ->poll() code, so it's not like we have to trigger some
> asynchronous event to get a current version of the number.
> 

I think that there is some more confusion here.  I'm talking about
frames dropped by the Ethernet controller at the hardware level (no
descriptor available).  This for example is happening now with our
driver with the weight set to 64.  This is also what started us looking
into what was going on with the weight.  I don't see how the NAPI code
to dynamically adjust the weight could easily get the hardware stats
number to know if frames are being dropped or not.  Sorry if I caused
the confusion here.

Mitch is working on a response to Jamal's last mail trying to level set
what we are seeing and doing.

Cheers,
John

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03  2:32   ` jamal
@ 2005-06-03 17:43     ` Mitch Williams
  2005-06-03 18:38       ` David S. Miller
  2005-06-03 18:42       ` jamal
  0 siblings, 2 replies; 121+ messages in thread
From: Mitch Williams @ 2005-06-03 17:43 UTC (permalink / raw)
  To: jamal
  Cc: David S. Miller, Ronciak, John, jdmason, shemminger, netdev,
	Robert.Olsson, Venkatesan, Ganesh, Brandeburg, Jesse

On Thu, 2 Jun 2005, jamal wrote:
>
> Heres what i think i saw as a flow of events:
> Someone posted a theory that if you happen to reduce the weight
> (iirc the reduction was via a shift) then the DRR would give less CPU
> time cycle to the driver - Whats the big suprise there? thats DRR design
> intent.

Well, that was me.  Or at least I was the original poster on this thread.
But my theory (if you can call it that) really wasn't about CPU time.  I
spent several weeks in our lab with the somewhat nebulous task of "look at
Linux performance".  And what I found was, to me, counterintuitive:
reducing weight improved performance, sometimes significantly.

>
> Stephen has a patch which allows people to reduce the weight.
> DRR  provides fairness. If you have 10 NICs coming at different wire
> rates, the weights provide a fairness quota without caring about what
> those speeds are. So it doesnt make any sense IMO to have the weight
> based on what the NIC speed is. Infact i claim it is _nonsense_. You
> dont need to factor speed. And the claim that DRR is not real world
> is blasphemous.

OK, well, call me a blasphemer (against whom?).  I'm not really saying
that the DRR algorithm is not real-world, but rather that NAPI as
currently implemented has some significant performance limitations.

In my mind, there are two major problems with NAPI as it stands today.
First, at Gigabit and higher speeds, the default settings don't allow the
driver to process received packets in a timely manner.  This causes
dropped packets due to lack of receive resources.  Lowering the weight can
fix this, at least in a single-adapter environment.

Second, at 10Mbps and 100Mbps, modern processors are just too fast for the
network.  The NAPI polling loop runs so much quicker than the wire speed
that only one or two packets are processed per softirq -- which
effectively puts the adapter back in interrupt mode.  Because of this, you
can easily bog down a very fast box with relatively slow traffic, just due
to the massive number of interrupts generated.

My original post (and patch) were to address the first issue.  By using
the shift value on the quota, I effectively lowered the weight for every
driver in the system.   Stephen sent out a patch that allowed you to
adjust each driver's weight individually.  My testing has shown that, as
expected, you can achieve the same performance gain either way.

In a multiple-adapter environment, you need to adjust the weight of all
drivers together to fix the dropped packets issue.  Lowering the weight on
one adapter won't help it if the other interfaces are still taking up a
lot of time in their receive loops.

My patch gave you one knob to twiddle that would correct this issue.
Stephen's patch gave you one knob for each adapter, but now you need to
twiddle them all to see any benefit.

The second issue currently has no fix.  What is needed is a way for the
driver to request a delayed poll, possibly based on line speed.  If we
could wait, say, 8 packet times before polling, we could significantly
reduce the number of interrupts the system has to deal with, at the cost
of higher latency.  We haven't had time to investigate this at all, but
the need is clearly present -- we've had customer calls about this issue.

>
> Having said that:
> I have a feeling that issue which is which is being waded around is the
> amount that the softirq chews in the CPU (unfortunately a well known
> issue) and to some extent the packet flow a specific driver chews
> depending on the path it takes.

I fiddled with this concept a little bit, but didn't see much performance
gain by doing so.  But it may be something that we can go back and look
at.

Either way, I think the netdev community needs to look critically at NAPI,
and make some changes.   Network performance in 2.6.12-rcWhatever is
pretty poor.  2.4.30 beats it handily, and it really shouldn't be that
way.

> This, however, does not eradicate the need for DRR and is absolutely not
> driver specific.

Agreed.  All of the changes I've experimented with at the NAPI level have
affected performance similarly on multiple drivers.

-Mitch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: RFC: NAPI packet weighting patch
  2005-06-03 17:40 Ronciak, John
@ 2005-06-03 18:08 ` Robert Olsson
  0 siblings, 0 replies; 121+ messages in thread
From: Robert Olsson @ 2005-06-03 18:08 UTC (permalink / raw)
  To: Ronciak, John
  Cc: David S. Miller, jdmason, shemminger, hadi, Williams, Mitch A,
	netdev, Robert.Olsson, Venkatesan, Ganesh, Brandeburg, Jesse


Ronciak, John writes:
 > > What more do you need other than checking the statistics counter?  The
 > > drop statistics (the ones we care about) are incremented in real time
 > > by the ->poll() code, so it's not like we have to trigger some
 > > asynchronous event to get a current version of the number.
 > > 
 > 
 > I think that there is some more confusion here.  I'm talking about
 > frames dropped by the Ethernet controller at the hardware level (no
 > descriptor available).  This for example is happening now with our
 > driver with the weight set to 64.  This is also what started us looking
 > into what was going on with the weight.  I don't see how the NAPI code
 > to dynamically adjust the weight could easily get the hardware stats
 > number to know if frames are being dropped or not.  Sorry if I caused
 > the confusion here.

 It's not obvious that weight is to blame for frames dropped. I would 
 look into RX ring size in relation to HW mitigation.
 And of course if you system is very loaded the RX softirq gives room
 for other jobs and frames get dropped

 Cheers.
 					--ro
 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: RFC: NAPI packet weighting patch
@ 2005-06-03 18:19 Ronciak, John
  2005-06-03 18:33 ` Ben Greear
  2005-06-03 20:17 ` Robert Olsson
  0 siblings, 2 replies; 121+ messages in thread
From: Ronciak, John @ 2005-06-03 18:19 UTC (permalink / raw)
  To: Robert Olsson
  Cc: David S. Miller, jdmason, shemminger, hadi, Williams, Mitch A,
	netdev, Venkatesan, Ganesh, Brandeburg, Jesse

>  It's not obvious that weight is to blame for frames dropped. I would 
>  look into RX ring size in relation to HW mitigation.
>  And of course if you system is very loaded the RX softirq gives room
>  for other jobs and frames get dropped
> 
With the same system (fairly high end with nothing major running on it)
we got rid of the dropped frames by just reducing the weight for 64.  So
the weight did have something to do with the dropped frames.  Maybe
other factors as well, but in static tests like this it sure looks like
the 64 value is wrong is some cases.


Cheers,
John

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 18:19 Ronciak, John
@ 2005-06-03 18:33 ` Ben Greear
  2005-06-03 18:49   ` David S. Miller
  2005-06-03 20:17 ` Robert Olsson
  1 sibling, 1 reply; 121+ messages in thread
From: Ben Greear @ 2005-06-03 18:33 UTC (permalink / raw)
  To: Ronciak, John
  Cc: Robert Olsson, David S. Miller, jdmason, shemminger, hadi,
	Williams, Mitch A, netdev, Venkatesan, Ganesh, Brandeburg, Jesse

Ronciak, John wrote:
>> It's not obvious that weight is to blame for frames dropped. I would 
>> look into RX ring size in relation to HW mitigation.
>> And of course if you system is very loaded the RX softirq gives room
>> for other jobs and frames get dropped
>>
> 
> With the same system (fairly high end with nothing major running on it)
> we got rid of the dropped frames by just reducing the weight for 64.  So
> the weight did have something to do with the dropped frames.  Maybe
> other factors as well, but in static tests like this it sure looks like
> the 64 value is wrong is some cases.

Is this implying that having the NAPI poll do less work per poll
of the driver actually increases performance?  I would have guessed that
the opposite would be true.

Maybe the poll is disabling the IRQs on the NIC for too long, or something
like that?

For e1000, are you using larger than the default 256 receive descriptors?
I have seen that increasing these descriptors helps decrease drops by
a small percentage.

Have you tried increasing the netdev-backlog setting to see if that
fixes the problem (while leaving the weight at the default)?

What packet sizes and speeds are you using for your tests?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 17:43     ` Mitch Williams
@ 2005-06-03 18:38       ` David S. Miller
  2005-06-03 18:42       ` jamal
  1 sibling, 0 replies; 121+ messages in thread
From: David S. Miller @ 2005-06-03 18:38 UTC (permalink / raw)
  To: mitch.a.williams
  Cc: hadi, john.ronciak, jdmason, shemminger, netdev, Robert.Olsson,
	ganesh.venkatesan, jesse.brandeburg

From: Mitch Williams <mitch.a.williams@intel.com>
Date: Fri, 3 Jun 2005 10:43:32 -0700

> In my mind, there are two major problems with NAPI as it stands today.
> First, at Gigabit and higher speeds, the default settings don't allow the
> driver to process received packets in a timely manner.  This causes
> dropped packets due to lack of receive resources.  Lowering the weight can
> fix this, at least in a single-adapter environment.

I really don't see how changing the weight can change things
in the single adapter case.

When we hit the quota, we just loop and process more packets.
It doesn't fundamentally change anything about how the NAPI
code operates.

Please investigate what exactly is happening.  I have a few
theories.  First, is it the case that with a lower weight we
drop out of the loop because 'jiffies' advanced one tick?
Some simply instrumentation in net/core/dev.c:net_rx_action()
would show what's going on.  Actually, we keep this statistic
via netdev_rx_stat, so just cat /proc/net/softnet_stat to
get a look at if "time_squeeze" is being incremented when
dev->weight is 64 in your tests.

Next, I don't think "budget" in that function is going down to zero,
that's set to 300 by default.

If the quota is consumed, the device is just added right back
to the tail of the poll_list, and if it's the only device
active we jump right back into it's ->poll() routine over
and over until there is no more pending work in the device
or we hit the "jiffies - start_time > 1" test.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 17:43     ` Mitch Williams
  2005-06-03 18:38       ` David S. Miller
@ 2005-06-03 18:42       ` jamal
  2005-06-03 19:01         ` David S. Miller
  1 sibling, 1 reply; 121+ messages in thread
From: jamal @ 2005-06-03 18:42 UTC (permalink / raw)
  To: Mitch Williams
  Cc: David S. Miller, Ronciak, John, jdmason, shemminger, netdev,
	Robert.Olsson, Venkatesan, Ganesh, Brandeburg, Jesse

On Fri, 2005-03-06 at 10:43 -0700, Mitch Williams wrote:
> 
> On Thu, 2 Jun 2005, jamal wrote:
> >
> > Heres what i think i saw as a flow of events:
> > Someone posted a theory that if you happen to reduce the weight
> > (iirc the reduction was via a shift) then the DRR would give less CPU
> > time cycle to the driver - Whats the big suprise there? thats DRR design
> > intent.
> 
> Well, that was me.  Or at least I was the original poster on this thread.
> But my theory (if you can call it that) really wasn't about CPU time.  I
> spent several weeks in our lab with the somewhat nebulous task of "look at
> Linux performance".  And what I found was, to me, counterintuitive:
> reducing weight improved performance, sometimes significantly.
> 

When you reduce the weight, the system is spending less time in the
softirq processing packets before softirq yields. If this gives more
opportunity to your app to run, then the performance will go up.
Is this what you are seeing? 

> OK, well, call me a blasphemer (against whom?).
> I'm not really saying
> that the DRR algorithm is not real-world, but rather that NAPI as
> currently implemented has some significant performance limitations.
> 

And we need to be fair and investigate why.

> In my mind, there are two major problems with NAPI as it stands today.
> First, at Gigabit and higher speeds, the default settings don't allow the
> driver to process received packets in a timely manner.

What do you mean by timely?

>   This causes
> dropped packets due to lack of receive resources.  Lowering the weight can
> fix this, at least in a single-adapter environment.
> 

If your know your workload you could tune the weight. Additionaly you
could tune the softirq using nice.

> Second, at 10Mbps and 100Mbps, modern processors are just too fast for the
> network.  The NAPI polling loop runs so much quicker than the wire speed
> that only one or two packets are processed per softirq -- which
> effectively puts the adapter back in interrupt mode.  Because of this, you
> can easily bog down a very fast box with relatively slow traffic, just due
> to the massive number of interrupts generated.
> 

Massive is an overstatement. The issue is really IO. If you process one
packet in each interupt then NAPI does add extra IO costs at "low"
traffic levels. 
Note that this is also a known issue - reference the threads from waay
back from people like Manfred Spraul and recently from the SGI folks.

IO unfortunately hasnt kept up with CPU speeds; hardware vendors such as
your company have been busy making processors faster but forgetting
about IO and RAM latencies. PCI-E seems promising from what i have
heard, interim PCI-E bridging to PCI-X is form what i have heard on its
IO performance worse.

> My original post (and patch) were to address the first issue.  By using
> the shift value on the quota, I effectively lowered the weight for every
> driver in the system.   Stephen sent out a patch that allowed you to
> adjust each driver's weight individually.  My testing has shown that, as
> expected, you can achieve the same performance gain either way.
> 

Ok, glad to hear thats resolved.

> In a multiple-adapter environment, you need to adjust the weight of all
> drivers together to fix the dropped packets issue.  Lowering the weight on
> one adapter won't help it if the other interfaces are still taking up a
> lot of time in their receive loops.
> 
> My patch gave you one knob to twiddle that would correct this issue.
> Stephen's patch gave you one knob for each adapter, but now you need to
> twiddle them all to see any benefit.
> 

makes sense

> The second issue currently has no fix.  What is needed is a way for the
> driver to request a delayed poll, possibly based on line speed.  If we
> could wait, say, 8 packet times before polling, we could significantly
> reduce the number of interrupts the system has to deal with, at the cost
> of higher latency.  We haven't had time to investigate this at all, but
> the need is clearly present -- we've had customer calls about this issue.
> 

I can believe you (note it has to do with IO costs though) having seen
how horrific MMIO numbers are on faster processors. Talk to Jesse, he
has seen a little program from Lennert/Robert/Harald that does MMIO
measurements.
It seems the trend is that as CPUs get faster, IO gets more expensive in
both cpu cycles as well as absolute time.
The solution to this issue is to be found in mitigation at the moment in
conjunction with NAPI.
The SGI folks have made some real progress with recent patches from
Davem and Michael Chan on tg3.
I have been experimenting with some patches but they introduce
unacceptable jitter in latency.
So lets summarize it this way: This is something that needs to be
resolved - but whatever solution needs to be generic.

> Either way, I think the netdev community needs to look critically at NAPI,
> and make some changes.   

I think what you call as the second issue needs a solution. Mitigation
is the only generic solution at the moment.

> Network performance in 2.6.12-rcWhatever is
> pretty poor.  2.4.30 beats it handily, and it really shouldn't be that
> way.
> 

Are you using NAPI as well on 2.4.30?

cheers,
jamal

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 18:33 ` Ben Greear
@ 2005-06-03 18:49   ` David S. Miller
  2005-06-03 18:59     ` Ben Greear
  0 siblings, 1 reply; 121+ messages in thread
From: David S. Miller @ 2005-06-03 18:49 UTC (permalink / raw)
  To: greearb
  Cc: john.ronciak, Robert.Olsson, jdmason, shemminger, hadi,
	mitch.a.williams, netdev, ganesh.venkatesan, jesse.brandeburg

From: Ben Greear <greearb@candelatech.com>
Date: Fri, 03 Jun 2005 11:33:00 -0700

> Is this implying that having the NAPI poll do less work per poll
> of the driver actually increases performance?  I would have guessed that
> the opposite would be true.

Exactly my thoughts as well :)

> Maybe the poll is disabling the IRQs on the NIC for too long, or something
> like that?

In a reply I just sent out to this thread, I postulate that the
jiffies check is hitting earlier with a lower weight value, a quick
look at /proc/net/softnet_stat during their testing will confirm or
deny this theory.

It could also just be a simple bug in the dev->quota accounting
somewhere.

Note that, in all of this, I do not have any objections to providing
a way to configure the dev->weight values.  I will be applying Stephen
Hemminger's patches.

But I think we MUST find out the reason for the observed behavior,
especially in the single-adapter case since the result is so illogical.
We could find an important bug in the NAPI implementation, or learn
something new about how NAPI behaves.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 18:49   ` David S. Miller
@ 2005-06-03 18:59     ` Ben Greear
  2005-06-03 19:02       ` David S. Miller
  0 siblings, 1 reply; 121+ messages in thread
From: Ben Greear @ 2005-06-03 18:59 UTC (permalink / raw)
  To: David S. Miller
  Cc: john.ronciak, Robert.Olsson, jdmason, shemminger, hadi,
	mitch.a.williams, netdev, ganesh.venkatesan, jesse.brandeburg

David S. Miller wrote:
> From: Ben Greear <greearb@candelatech.com>

>>Maybe the poll is disabling the IRQs on the NIC for too long, or something
>>like that?
> 
> 
> In a reply I just sent out to this thread, I postulate that the
> jiffies check is hitting earlier with a lower weight value, a quick
> look at /proc/net/softnet_stat during their testing will confirm or
> deny this theory.

That would basically just decrease the work done in the NAPI poll though,
so I don't see how that could be the problem, since the 'solution' was to
force less work to be done.

> It could also just be a simple bug in the dev->quota accounting
> somewhere.
> 
> Note that, in all of this, I do not have any objections to providing
> a way to configure the dev->weight values.  I will be applying Stephen
> Hemminger's patches.

Good.  The more knobs the merrier, so long as they are at least somewhat
documented and default to good sane values :)

Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 18:42       ` jamal
@ 2005-06-03 19:01         ` David S. Miller
  2005-06-03 19:28           ` Mitch Williams
  2005-06-03 19:40           ` jamal
  0 siblings, 2 replies; 121+ messages in thread
From: David S. Miller @ 2005-06-03 19:01 UTC (permalink / raw)
  To: hadi
  Cc: mitch.a.williams, john.ronciak, jdmason, shemminger, netdev,
	Robert.Olsson, ganesh.venkatesan, jesse.brandeburg

From: jamal <hadi@cyberus.ca>
Date: Fri, 03 Jun 2005 14:42:30 -0400

> When you reduce the weight, the system is spending less time in the
> softirq processing packets before softirq yields. If this gives more
> opportunity to your app to run, then the performance will go up.
> Is this what you are seeing? 

Jamal, this is my current theory as well, we hit the jiffies
check.

It it the only logical explanation I can come up with for the
single adapter case.

There are some ways we can mitigate this.  Here is one idea
off the top of my head.

When the jiffies check is hit, lower the weight of the most recently
polled device towards some minimum (perhaps divide by two).  If we
successfully poll without hitting the jiffies check, make a small
increment of the weight up to some limit.

It is Van Jacobson TCP congestion avoidance applied to NAPI :-)

Just a simple AIMD (Additive Increase, Multiplicative Decrease).
So, hitting the jiffies work limit is congestion, and the cause
of the congestion is the most recently polled device.

In this regime, what the driver currently specifies as "->weight"
is actually the maximum we'll use in the congestion control
algorithm.  And we can choose some constant minimum, something
like "8" ought to work well.

Comments?

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 18:59     ` Ben Greear
@ 2005-06-03 19:02       ` David S. Miller
  0 siblings, 0 replies; 121+ messages in thread
From: David S. Miller @ 2005-06-03 19:02 UTC (permalink / raw)
  To: greearb
  Cc: john.ronciak, Robert.Olsson, jdmason, shemminger, hadi,
	mitch.a.williams, netdev, ganesh.venkatesan, jesse.brandeburg

From: Ben Greear <greearb@candelatech.com>
Date: Fri, 03 Jun 2005 11:59:35 -0700

> David S. Miller wrote:
> > In a reply I just sent out to this thread, I postulate that the
> > jiffies check is hitting earlier with a lower weight value, a quick
> > look at /proc/net/softnet_stat during their testing will confirm or
> > deny this theory.
> 
> That would basically just decrease the work done in the NAPI poll though,
> so I don't see how that could be the problem, since the 'solution' was to
> force less work to be done.

It allows his application to get onto the CPU faster.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 19:01         ` David S. Miller
@ 2005-06-03 19:28           ` Mitch Williams
  2005-06-03 19:59             ` jamal
                               ` (2 more replies)
  2005-06-03 19:40           ` jamal
  1 sibling, 3 replies; 121+ messages in thread
From: Mitch Williams @ 2005-06-03 19:28 UTC (permalink / raw)
  To: David S. Miller
  Cc: hadi, mitch.a.williams, john.ronciak, jdmason, shemminger, netdev,
	Robert.Olsson, ganesh.venkatesan, jesse.brandeburg

On Fri, 3 Jun 2005, David S. Miller wrote:

> From: jamal <hadi@cyberus.ca>
> Date: Fri, 03 Jun 2005 14:42:30 -0400
>
> > When you reduce the weight, the system is spending less time in the
> > softirq processing packets before softirq yields. If this gives more
> > opportunity to your app to run, then the performance will go up.
> > Is this what you are seeing?
>
> Jamal, this is my current theory as well, we hit the jiffies
> check.

Well, I hate to mess up your guys' theories, but the real reason is
simpler:  hardware receive resources, specifically descriptors and
buffers.

In a typical NAPI polling loop, the driver processes receive packets until
it either hits the quota or runs out of packets.  Then, at the end of the
loop, it returns all of those now-free receive resources back to the
hardware.

With a heavy receive load, the hardware will run out of receive
descriptors in the time it takes the driver/NAPI/stack to process 64
packets.  So it drops them on the floor.  And, as we know, dropped packets
are A Bad Thing.

By reducing the driver weight, we cause the driver to give receive
resources back to the hardware more often, which prevents dropped packets.

As Ben Greer noticed, increasing the number of descriptors can help with
this issue.   But it really can't eliminate the problem -- once the ring
is full, it doesn't matter how big it is, it's still full.

In my testing (Dual 2.8GHz Xeon, PCI-X bus, Gigabit network, 10 clients),
I was able to completely eliminate dropped packets in most cases by
reducing the driver weight down to about 20.

Now for some speculation:

Aside from dropped packets, I saw continued performance gain with even
lower weights, with the sweet spot (on a single adapter) being about 8 to
10.  I don't have a definite answer for why this is happening, but my
theory is that it's latency.  Packets are processed more often, meaning
they spend less time sitting in hardware-owned buffers, which means they
get to the stack quicker, which means less latency.

But I'm happy to admit I might be wrong with this theory.  Nevertheless,
the effect exists, and I've seen it on drivers other than just e1000.
(And, no, I'm not allowed to say which other drivers I've used, or give
specific numbers, or our lawyers will string me up by my toes.)

Anybody else got a theory?

-Mitch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 19:01         ` David S. Miller
  2005-06-03 19:28           ` Mitch Williams
@ 2005-06-03 19:40           ` jamal
  2005-06-03 20:23             ` jamal
  1 sibling, 1 reply; 121+ messages in thread
From: jamal @ 2005-06-03 19:40 UTC (permalink / raw)
  To: David S. Miller
  Cc: mitch.a.williams, john.ronciak, jdmason, shemminger, netdev,
	Robert.Olsson, ganesh.venkatesan, jesse.brandeburg

On Fri, 2005-03-06 at 12:01 -0700, David S. Miller wrote:
> From: jamal <hadi@cyberus.ca>
> Date: Fri, 03 Jun 2005 14:42:30 -0400
> 
> > When you reduce the weight, the system is spending less time in the
> > softirq processing packets before softirq yields. If this gives more
> > opportunity to your app to run, then the performance will go up.
> > Is this what you are seeing? 
> 
> Jamal, this is my current theory as well, we hit the jiffies
> check.
> 

I think you are more than likely right. If we can instrument it Mitch
could check it out. Mitch would you like to try something that will
instrument this? I know i have seen this behavior but it was when i was
playing with some system that had a real small HZ.

> It it the only logical explanation I can come up with for the
> single adapter case.
> 
> There are some ways we can mitigate this.  Here is one idea
> off the top of my head.
> 
> When the jiffies check is hit, lower the weight of the most recently
> polled device towards some minimum (perhaps divide by two).  If we
> successfully poll without hitting the jiffies check, make a small
> increment of the weight up to some limit.
> 

You probably wanna start high up first until you hit congestion
and then start lowering. 

> It is Van Jacobson TCP congestion avoidance applied to NAPI :-)
> 
> Just a simple AIMD (Additive Increase, Multiplicative Decrease).
> So, hitting the jiffies work limit is congestion, and the cause
> of the congestion is the most recently polled device.
> 
> In this regime, what the driver currently specifies as "->weight"
> is actually the maximum we'll use in the congestion control
> algorithm.  And we can choose some constant minimum, something
> like "8" ought to work well.
> 
> Comments?
> 

In theory it looks good - but i think you end up defeating the fairness
factor. If you can narrow it down to which driver is causing congestion,
and only penalize that driver i think it would work well.


cheers,
jamal

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 20:29               ` David S. Miller
@ 2005-06-03 19:49                 ` Michael Chan
  2005-06-03 20:59                   ` Lennert Buytenhek
  0 siblings, 1 reply; 121+ messages in thread
From: Michael Chan @ 2005-06-03 19:49 UTC (permalink / raw)
  To: David S. Miller
  Cc: mitch.a.williams, hadi, john.ronciak, jdmason, shemminger, netdev,
	Robert.Olsson, ganesh.venkatesan, jesse.brandeburg

On Fri, 2005-06-03 at 13:29 -0700, David S. Miller wrote:

> E1000 processes the full QUOTA of RX packets,
> _THEN_ replenishes with new RX buffers.  No wonder
> the chip runs out of RX descriptors.
> 
> You should replenish _AS_ you grab RX packets
> off the receive queue, just as tg3 does.

Yes, in tg3, rx buffers are replenished and put back into the ring as
completed packets are taken off the ring. But we don't tell the chip
about these new buffers until we get to the end of the loop, potentially
after a full quota of packets. Doesn't this make the end result the same
as e1000?

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 19:28           ` Mitch Williams
@ 2005-06-03 19:59             ` jamal
  2005-06-03 20:31               ` David S. Miller
  2005-06-03 20:22             ` David S. Miller
  2005-06-03 20:30             ` Ben Greear
  2 siblings, 1 reply; 121+ messages in thread
From: jamal @ 2005-06-03 19:59 UTC (permalink / raw)
  To: Mitch Williams
  Cc: David S. Miller, john.ronciak, jdmason, shemminger, netdev,
	Robert.Olsson, ganesh.venkatesan, jesse.brandeburg

On Fri, 2005-03-06 at 12:28 -0700, Mitch Williams wrote:
> 
> On Fri, 3 Jun 2005, David S. Miller wrote:
> 
> > From: jamal <hadi@cyberus.ca>
> > Date: Fri, 03 Jun 2005 14:42:30 -0400
> >
> > > When you reduce the weight, the system is spending less time in the
> > > softirq processing packets before softirq yields. If this gives more
> > > opportunity to your app to run, then the performance will go up.
> > > Is this what you are seeing?
> >
> > Jamal, this is my current theory as well, we hit the jiffies
> > check.
> 
> Well, I hate to mess up your guys' theories, but the real reason is
> simpler:  hardware receive resources, specifically descriptors and
> buffers.
> 
> In a typical NAPI polling loop, the driver processes receive packets until
> it either hits the quota or runs out of packets.  Then, at the end of the
> loop, it returns all of those now-free receive resources back to the
> hardware.
> 
> With a heavy receive load, the hardware will run out of receive
> descriptors in the time it takes the driver/NAPI/stack to process 64
> packets.  So it drops them on the floor.  And, as we know, dropped packets
> are A Bad Thing.
> 
> By reducing the driver weight, we cause the driver to give receive
> resources back to the hardware more often, which prevents dropped packets.
> 
> As Ben Greer noticed, increasing the number of descriptors can help with
> this issue.   But it really can't eliminate the problem -- once the ring
> is full, it doesn't matter how big it is, it's still full.
> 
> In my testing (Dual 2.8GHz Xeon, PCI-X bus, Gigabit network, 10 clients),
> I was able to completely eliminate dropped packets in most cases by
> reducing the driver weight down to about 20.
> 
> Now for some speculation:
> 

What you said above is unfortunately also speculation ;->
But one that you could validate by putting proper hooks. As an example,
try to restore a descriptor every time you pick one - for an example of
this look at the sb1250 driver.

cheers,
jamal

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: RFC: NAPI packet weighting patch
  2005-06-03 18:19 Ronciak, John
  2005-06-03 18:33 ` Ben Greear
@ 2005-06-03 20:17 ` Robert Olsson
  2005-06-03 20:30   ` David S. Miller
  1 sibling, 1 reply; 121+ messages in thread
From: Robert Olsson @ 2005-06-03 20:17 UTC (permalink / raw)
  To: Ronciak, John
  Cc: Robert Olsson, David S. Miller, jdmason, shemminger, hadi,
	Williams, Mitch A, netdev, Venkatesan, Ganesh, Brandeburg, Jesse

Ronciak, John writes:

 > With the same system (fairly high end with nothing major running on it)
 > we got rid of the dropped frames by just reducing the weight for 64.  So
 > the weight did have something to do with the dropped frames.  Maybe
 > other factors as well, but in static tests like this it sure looks like
 > the 64 value is wrong is some cases.

 It is possible that a lower weight forced your driver to disable interrupts
 and do packet reception w/o interrupts often this is more efficient as
 we get rid intr. latency etc.

 Again I think weight should only used for fairness and not control the
 threshold when to disable interrupts.

 You can test with a new policy in e1000_clean so you schedule for a new 
 poll if work_done (any pkts received) or tx_cleaned is true.

 Cheers.
						--ro

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 19:28           ` Mitch Williams
  2005-06-03 19:59             ` jamal
@ 2005-06-03 20:22             ` David S. Miller
  2005-06-03 20:29               ` David S. Miller
  2005-06-03 20:30             ` Ben Greear
  2 siblings, 1 reply; 121+ messages in thread
From: David S. Miller @ 2005-06-03 20:22 UTC (permalink / raw)
  To: mitch.a.williams
  Cc: hadi, john.ronciak, jdmason, shemminger, netdev, Robert.Olsson,
	ganesh.venkatesan, jesse.brandeburg

From: Mitch Williams <mitch.a.williams@intel.com>
Date: Fri, 3 Jun 2005 12:28:10 -0700

> In a typical NAPI polling loop, the driver processes receive packets until
> it either hits the quota or runs out of packets.  Then, at the end of the
> loop, it returns all of those now-free receive resources back to the
> hardware.
> 
> With a heavy receive load, the hardware will run out of receive
> descriptors in the time it takes the driver/NAPI/stack to process 64
> packets.  So it drops them on the floor.  And, as we know, dropped packets
> are A Bad Thing.

This is why you should replenish RX packets _IN_ your
RX packet receive processing, not via some tasklet
or other seperate work processing context.

No wonder I never see this on tg3.

It is the only way to do this cleanly.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 19:40           ` jamal
@ 2005-06-03 20:23             ` jamal
  2005-06-03 20:28               ` Mitch Williams
  0 siblings, 1 reply; 121+ messages in thread
From: jamal @ 2005-06-03 20:23 UTC (permalink / raw)
  To: David S. Miller
  Cc: mitch.a.williams, john.ronciak, jdmason, shemminger, netdev,
	Robert.Olsson, ganesh.venkatesan, jesse.brandeburg

On Fri, 2005-03-06 at 15:40 -0400, jamal wrote:
> On Fri, 2005-03-06 at 12:01 -0700, David S. Miller wrote:

> 
> I think you are more than likely right. If we can instrument it Mitch
> could check it out. Mitch would you like to try something that will
> instrument this? I know i have seen this behavior but it was when i was
> playing with some system that had a real small HZ.
> 

Sorry, Its already there as Dave said in his email.
Look for time_squeeze. Its the column i labeled XXXX below.

-----
$ cat /proc/net/softnet_stat
0000f938 00000000 XXXXXXX 00000000 00000000 00000000 00000000 00000000
00000000
------

cheers,
jamal

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 20:23             ` jamal
@ 2005-06-03 20:28               ` Mitch Williams
  0 siblings, 0 replies; 121+ messages in thread
From: Mitch Williams @ 2005-06-03 20:28 UTC (permalink / raw)
  To: jamal
  Cc: David S. Miller, Williams, Mitch A, Ronciak, John, jdmason,
	shemminger, netdev, Robert.Olsson, Venkatesan, Ganesh,
	Brandeburg, Jesse



On Fri, 3 Jun 2005, jamal wrote:

>
> Sorry, Its already there as Dave said in his email.
> Look for time_squeeze. Its the column i labeled XXXX below.
>
> -----
> $ cat /proc/net/softnet_stat
> 0000f938 00000000 XXXXXXX 00000000 00000000 00000000 00000000 00000000
> 00000000
> ------


I might not be able to get into the lab today (they keep making me do
work!), but I should be able to pop in Monday and take a look.  Shouldn't
take too long.

Thanks,
Mitch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 20:22             ` David S. Miller
@ 2005-06-03 20:29               ` David S. Miller
  2005-06-03 19:49                 ` Michael Chan
  0 siblings, 1 reply; 121+ messages in thread
From: David S. Miller @ 2005-06-03 20:29 UTC (permalink / raw)
  To: mitch.a.williams
  Cc: hadi, john.ronciak, jdmason, shemminger, netdev, Robert.Olsson,
	ganesh.venkatesan, jesse.brandeburg

From: "David S. Miller" <davem@davemloft.net>
Date: Fri, 03 Jun 2005 13:22:57 -0700 (PDT)

> This is why you should replenish RX packets _IN_ your
> RX packet receive processing, not via some tasklet
> or other seperate work processing context.
> 
> No wonder I never see this on tg3.

Actually, the problem is slightly different.

E1000 processes the full QUOTA of RX packets,
_THEN_ replenishes with new RX buffers.  No wonder
the chip runs out of RX descriptors.

You should replenish _AS_ you grab RX packets
off the receive queue, just as tg3 does.  This
allows you to accomplish two things:

1) Keep up with the chip so that it does not starve,
   regardless of dev->weight setting or system load.

2) Make intelligent decisions when RX buffer allocation
   fails.  When we look at a RX descriptor in tg3 we
   never leave the descriptor empty.

   If replacement RX buffer fails, we simply ignore the
   RX packet we're looking at and give it back to the
   chip.  Every driver should implement this policy.

   Drivers that do not do things this way run into all
   kinds of RX ring chip starvation issues like the ones
   you are seeing here.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 20:17 ` Robert Olsson
@ 2005-06-03 20:30   ` David S. Miller
  0 siblings, 0 replies; 121+ messages in thread
From: David S. Miller @ 2005-06-03 20:30 UTC (permalink / raw)
  To: Robert.Olsson
  Cc: john.ronciak, jdmason, shemminger, hadi, mitch.a.williams, netdev,
	ganesh.venkatesan, jesse.brandeburg

From: Robert Olsson <Robert.Olsson@data.slu.se>
Date: Fri, 3 Jun 2005 22:17:31 +0200

>  It is possible that a lower weight forced your driver to disable interrupts
>  and do packet reception w/o interrupts often this is more efficient as
>  we get rid intr. latency etc.
> 
>  Again I think weight should only used for fairness and not control the
>  threshold when to disable interrupts.
>  
>  You can test with a new policy in e1000_clean so you schedule for a new 
>  poll if work_done (any pkts received) or tx_cleaned is true.

I don't think this is it.   What's happening is that E1000
pulls up to a full dev->quota of packets off the ring,
and _THEN_ goes back and does RX buffer replenishing.

It is very clear why E1000 runs out of RX descriptors with
this kind of policy.

I outlined a way to fix this in the E1000 driver in another
email.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 19:28           ` Mitch Williams
  2005-06-03 19:59             ` jamal
  2005-06-03 20:22             ` David S. Miller
@ 2005-06-03 20:30             ` Ben Greear
  2 siblings, 0 replies; 121+ messages in thread
From: Ben Greear @ 2005-06-03 20:30 UTC (permalink / raw)
  To: Mitch Williams
  Cc: David S. Miller, hadi, john.ronciak, jdmason, shemminger, netdev,
	Robert.Olsson, ganesh.venkatesan, jesse.brandeburg

Mitch Williams wrote:
> 
> On Fri, 3 Jun 2005, David S. Miller wrote:
> 
> 
>>From: jamal <hadi@cyberus.ca>
>>Date: Fri, 03 Jun 2005 14:42:30 -0400
>>
>>
>>>When you reduce the weight, the system is spending less time in the
>>>softirq processing packets before softirq yields. If this gives more
>>>opportunity to your app to run, then the performance will go up.
>>>Is this what you are seeing?
>>
>>Jamal, this is my current theory as well, we hit the jiffies
>>check.
> 
> 
> Well, I hate to mess up your guys' theories, but the real reason is
> simpler:  hardware receive resources, specifically descriptors and
> buffers.
> 
> In a typical NAPI polling loop, the driver processes receive packets until
> it either hits the quota or runs out of packets.  Then, at the end of the
> loop, it returns all of those now-free receive resources back to the
> hardware.
> 
> With a heavy receive load, the hardware will run out of receive
> descriptors in the time it takes the driver/NAPI/stack to process 64
> packets.  So it drops them on the floor.  And, as we know, dropped packets
> are A Bad Thing.

If it can fill up more than 190 RX descriptors in the time it takes NAPI
to pull 64, then there is no possible way to not drop packets!  How could
NAPI ever keep up if what you say is true?

> By reducing the driver weight, we cause the driver to give receive
> resources back to the hardware more often, which prevents dropped packets.
> 
> As Ben Greer noticed, increasing the number of descriptors can help with
> this issue.   But it really can't eliminate the problem -- once the ring
> is full, it doesn't matter how big it is, it's still full.

If you have 1024 rx descriptors, and the NAPI poll pulls off 64 at one
time, I do not see how pulling off 20 could be any more useful.  Either way,
you have more than 900 other RX descriptors to be received.

Even if you only have the default of 256 the NIC should be able to continue
receiving packets with the other 190 or so descriptors while NAPI is doing
it's receive poll.  If the buffers are often nearly used up, then the problem
is that the NAPI poll cannot pull the packets fast enough, and again, I do not
see how making it do more polls could make it able to pull packets from the
NIC more efficiently.

Maybe you could instrument the NAPI receive logic to
see if there is some horrible waste of CPU and/or time when it tries to pull
larger amounts of packets at once?  A linear increase in work cannot explain
what you are describing.

> In my testing (Dual 2.8GHz Xeon, PCI-X bus, Gigabit network, 10 clients),
> I was able to completely eliminate dropped packets in most cases by
> reducing the driver weight down to about 20.

At least tell us what type of traffic you are using?  TCP with MTU sized
packets, traffic-generator with 60 byte packets?  Actual speed that you
are running (aggregate)?  Full-duplex traffic, or mostly uni-directional?
packets-per-second you are receiving & transmitting when the drops occur?

On a dual 2.8Ghz xeon system with PCI-X bus, with a quad-port Intel pro/1000
NIC I can run about 950Mbps of traffic, bi-directional, on two ports at the
same time, and drop few or no packets.  (MTU sized packets here).

This is using a modified version of pktgen, btw.  So, if you are seeing
any amount of dropped pkts on a single NIC, especially if you are mostly
doing uni-directional traffic, then I think the problem might be elsewhere,
because the stock 2.6.11 and similar kernels can easily handle this amount
of network traffic.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 19:59             ` jamal
@ 2005-06-03 20:31               ` David S. Miller
  2005-06-03 21:12                 ` Jon Mason
  0 siblings, 1 reply; 121+ messages in thread
From: David S. Miller @ 2005-06-03 20:31 UTC (permalink / raw)
  To: hadi
  Cc: mitch.a.williams, john.ronciak, jdmason, shemminger, netdev,
	Robert.Olsson, ganesh.venkatesan, jesse.brandeburg

From: jamal <hadi@cyberus.ca>
Date: Fri, 03 Jun 2005 15:59:31 -0400

> But one that you could validate by putting proper hooks. As an example,
> try to restore a descriptor every time you pick one - for an example of
> this look at the sb1250 driver.

Yes, this in my mind is exactly the problem.  TG3 does this
properly, as do several other drivers.

You should never defer RX buffer replenishment, you should
always do it as you grab packets off of the ring.  You will
starve the chip otherwise.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 20:59                   ` Lennert Buytenhek
@ 2005-06-03 20:35                     ` Michael Chan
  2005-06-03 22:29                       ` jamal
                                         ` (2 more replies)
  2005-06-03 21:07                     ` Edgar E Iglesias
  1 sibling, 3 replies; 121+ messages in thread
From: Michael Chan @ 2005-06-03 20:35 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: David S. Miller, mitch.a.williams, hadi, john.ronciak, jdmason,
	shemminger, netdev, Robert.Olsson, ganesh.venkatesan,
	jesse.brandeburg

On Fri, 2005-06-03 at 22:59 +0200, Lennert Buytenhek wrote:
> On Fri, Jun 03, 2005 at 12:49:29PM -0700, Michael Chan wrote:
> 
> > Yes, in tg3, rx buffers are replenished and put back into the ring
> > as completed packets are taken off the ring. But we don't tell the
> > chip about these new buffers until we get to the end of the loop,
> > potentially after a full quota of packets.
> 
> Which makes a lot more sense, since you'd rather do one MMIO write
> at the end of the loop than one per iteration, especially if your
> MMIO read (flush) latency is high.  (Any subsequent MMIO read will
> have to flush out all pending writes, which'll be slow if there's
> a lot of writes still in the queue.)
> 
I agree on the merit of issuing only one IO at the end. What I'm saying
is that doing so will make it similar to e1000 where all the buffers are
replenished at the end. Isn't that so or am I missing something?

By the way, in tg3 there is a buffer replenishment threshold programmed
to the chip and is currently set at rx_pending / 8 (200/8 = 25). This
means that the chip will replenish 25 rx buffers at a time.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 19:49                 ` Michael Chan
@ 2005-06-03 20:59                   ` Lennert Buytenhek
  2005-06-03 20:35                     ` Michael Chan
  2005-06-03 21:07                     ` Edgar E Iglesias
  0 siblings, 2 replies; 121+ messages in thread
From: Lennert Buytenhek @ 2005-06-03 20:59 UTC (permalink / raw)
  To: Michael Chan
  Cc: David S. Miller, mitch.a.williams, hadi, john.ronciak, jdmason,
	shemminger, netdev, Robert.Olsson, ganesh.venkatesan,
	jesse.brandeburg

On Fri, Jun 03, 2005 at 12:49:29PM -0700, Michael Chan wrote:

> > E1000 processes the full QUOTA of RX packets,
> > _THEN_ replenishes with new RX buffers.  No wonder
> > the chip runs out of RX descriptors.
> > 
> > You should replenish _AS_ you grab RX packets
> > off the receive queue, just as tg3 does.
> 
> Yes, in tg3, rx buffers are replenished and put back into the ring
> as completed packets are taken off the ring. But we don't tell the
> chip about these new buffers until we get to the end of the loop,
> potentially after a full quota of packets.

Which makes a lot more sense, since you'd rather do one MMIO write
at the end of the loop than one per iteration, especially if your
MMIO read (flush) latency is high.  (Any subsequent MMIO read will
have to flush out all pending writes, which'll be slow if there's
a lot of writes still in the queue.)


--L

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 20:59                   ` Lennert Buytenhek
  2005-06-03 20:35                     ` Michael Chan
@ 2005-06-03 21:07                     ` Edgar E Iglesias
  2005-06-03 23:30                       ` Lennert Buytenhek
  1 sibling, 1 reply; 121+ messages in thread
From: Edgar E Iglesias @ 2005-06-03 21:07 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Michael Chan, David S. Miller, mitch.a.williams, hadi,
	john.ronciak, jdmason, shemminger, netdev, Robert.Olsson,
	ganesh.venkatesan, jesse.brandeburg

On Fri, Jun 03, 2005 at 10:59:45PM +0200, Lennert Buytenhek wrote:
> On Fri, Jun 03, 2005 at 12:49:29PM -0700, Michael Chan wrote:
> 
> > > E1000 processes the full QUOTA of RX packets,
> > > _THEN_ replenishes with new RX buffers.  No wonder
> > > the chip runs out of RX descriptors.
> > > 
> > > You should replenish _AS_ you grab RX packets
> > > off the receive queue, just as tg3 does.
> > 
> > Yes, in tg3, rx buffers are replenished and put back into the ring
> > as completed packets are taken off the ring. But we don't tell the
> > chip about these new buffers until we get to the end of the loop,
> > potentially after a full quota of packets.
> 
> Which makes a lot more sense, since you'd rather do one MMIO write
> at the end of the loop than one per iteration, especially if your
> MMIO read (flush) latency is high.  (Any subsequent MMIO read will
> have to flush out all pending writes, which'll be slow if there's
> a lot of writes still in the queue.)
> 
> 
> --L

Maybe it would be better to put a fixed weight at this level, return
the descriptors to the HW after every X packets. That way you
can keep the NAPI weight at 64 (or what ever) and still give back 
descriptors to HW more often.

Best regards
-- 
        Programmer
        Edgar E Iglesias <edgar@axis.com> 46.46.272.1946

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 20:31               ` David S. Miller
@ 2005-06-03 21:12                 ` Jon Mason
  0 siblings, 0 replies; 121+ messages in thread
From: Jon Mason @ 2005-06-03 21:12 UTC (permalink / raw)
  To: David S. Miller
  Cc: hadi, mitch.a.williams, john.ronciak, shemminger, netdev,
	Robert.Olsson, ganesh.venkatesan, jesse.brandeburg

On Friday 03 June 2005 03:31 pm, David S. Miller wrote:
> From: jamal <hadi@cyberus.ca>
> Date: Fri, 03 Jun 2005 15:59:31 -0400
>
> > But one that you could validate by putting proper hooks. As an example,
> > try to restore a descriptor every time you pick one - for an example of
> > this look at the sb1250 driver.
>
> Yes, this in my mind is exactly the problem.  TG3 does this
> properly, as do several other drivers.
>
> You should never defer RX buffer replenishment, you should
> always do it as you grab packets off of the ring.  You will
> starve the chip otherwise.

e1000 isn't the only driver to do things this way.  r8169, via-velocity, dl2k, 
and skge (and I'm sure many more).  Might be nice to perform a driver audit 
to see what drivers do this.
  

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 20:35                     ` Michael Chan
@ 2005-06-03 22:29                       ` jamal
  2005-06-04  0:25                         ` Michael Chan
  2005-06-03 23:26                       ` Lennert Buytenhek
  2005-06-05 20:11                       ` David S. Miller
  2 siblings, 1 reply; 121+ messages in thread
From: jamal @ 2005-06-03 22:29 UTC (permalink / raw)
  To: Michael Chan
  Cc: Lennert Buytenhek, David S. Miller, mitch.a.williams,
	john.ronciak, jdmason, shemminger, netdev, Robert.Olsson,
	ganesh.venkatesan, jesse.brandeburg

On Fri, 2005-03-06 at 13:35 -0700, Michael Chan wrote:
> On Fri, 2005-06-03 at 22:59 +0200, Lennert Buytenhek wrote:

> > Which makes a lot more sense, since you'd rather do one MMIO write
> > at the end of the loop than one per iteration, especially if your
> > MMIO read (flush) latency is high.  (Any subsequent MMIO read will
> > have to flush out all pending writes, which'll be slow if there's
> > a lot of writes still in the queue.)
> > 
> I agree on the merit of issuing only one IO at the end. What I'm saying
> is that doing so will make it similar to e1000 where all the buffers are
> replenished at the end. Isn't that so or am I missing something?
> 

I think the main issue would be a lot less CPU used in your case
(because of the single MMIO).

> By the way, in tg3 there is a buffer replenishment threshold programmed
> to the chip and is currently set at rx_pending / 8 (200/8 = 25). This
> means that the chip will replenish 25 rx buffers at a time.
> 

So when you write the MMIO, 25 buffers are replenished or is this auto
magically happening in the background? Sounds like a neat feature either
way.

cheers,
jamal

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 20:35                     ` Michael Chan
  2005-06-03 22:29                       ` jamal
@ 2005-06-03 23:26                       ` Lennert Buytenhek
  2005-06-05 20:11                       ` David S. Miller
  2 siblings, 0 replies; 121+ messages in thread
From: Lennert Buytenhek @ 2005-06-03 23:26 UTC (permalink / raw)
  To: Michael Chan
  Cc: David S. Miller, mitch.a.williams, hadi, john.ronciak, jdmason,
	shemminger, netdev, Robert.Olsson, ganesh.venkatesan,
	jesse.brandeburg

On Fri, Jun 03, 2005 at 01:35:22PM -0700, Michael Chan wrote:

> > > Yes, in tg3, rx buffers are replenished and put back into the ring
> > > as completed packets are taken off the ring. But we don't tell the
> > > chip about these new buffers until we get to the end of the loop,
> > > potentially after a full quota of packets.
> > 
> > Which makes a lot more sense, since you'd rather do one MMIO write
> > at the end of the loop than one per iteration, especially if your
> > MMIO read (flush) latency is high.  (Any subsequent MMIO read will
> > have to flush out all pending writes, which'll be slow if there's
> > a lot of writes still in the queue.)
> 
> I agree on the merit of issuing only one IO at the end. What I'm saying
> is that doing so will make it similar to e1000 where all the buffers are
> replenished at the end. Isn't that so or am I missing something?

I think you're right: for e1000 as well as tg3, the NIC cannot use
the new RX buffers until the CPU breaks out of the poll loop.

I don't understand why reducing the weight apparently makes the e1000
go faster.  Perhaps as Robert said, the RX ring is not big enough and
that's why feeding RX buffers back to the chip more agressively might
help prevent overruns?

I would say that running with a N+64-entry RX ring and a weight of 64
should not show any worse behavior than running with a N+16-entry RX
ring with a weight of 16.  If anything, weight=64 should show _better_
performance than weight=16.  Something else must be going on.


--L

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 21:07                     ` Edgar E Iglesias
@ 2005-06-03 23:30                       ` Lennert Buytenhek
  0 siblings, 0 replies; 121+ messages in thread
From: Lennert Buytenhek @ 2005-06-03 23:30 UTC (permalink / raw)
  To: Edgar E Iglesias
  Cc: Michael Chan, David S. Miller, mitch.a.williams, hadi,
	john.ronciak, jdmason, shemminger, netdev, Robert.Olsson,
	ganesh.venkatesan, jesse.brandeburg

On Fri, Jun 03, 2005 at 11:07:01PM +0200, Edgar E Iglesias wrote:

> > > Yes, in tg3, rx buffers are replenished and put back into the ring
> > > as completed packets are taken off the ring. But we don't tell the
> > > chip about these new buffers until we get to the end of the loop,
> > > potentially after a full quota of packets.
> > 
> > Which makes a lot more sense, since you'd rather do one MMIO write
> > at the end of the loop than one per iteration, especially if your
> > MMIO read (flush) latency is high.  (Any subsequent MMIO read will
> > have to flush out all pending writes, which'll be slow if there's
> > a lot of writes still in the queue.)
> 
> Maybe it would be better to put a fixed weight at this level, return
> the descriptors to the HW after every X packets. That way you
> can keep the NAPI weight at 64 (or what ever) and still give back 
> descriptors to HW more often.

For this scheme to make any difference at all, the RX ring must be
overflowing in the case where we refill the RX ring only once every
64 packets.

If the RX ring _is_ overflowing but the system is otherwise capable of
keeping up with the receive rate (i.e. the packet service times as seen
by the NIC have a high variance), simply make the RX ring bigger.

I don't see what's going on.


--L

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 22:29                       ` jamal
@ 2005-06-04  0:25                         ` Michael Chan
  2005-06-05 21:36                           ` David S. Miller
  0 siblings, 1 reply; 121+ messages in thread
From: Michael Chan @ 2005-06-04  0:25 UTC (permalink / raw)
  To: hadi
  Cc: Lennert Buytenhek, David S. Miller, mitch.a.williams,
	john.ronciak, jdmason, shemminger, netdev, Robert.Olsson,
	ganesh.venkatesan, jesse.brandeburg

On Fri, 2005-06-03 at 18:29 -0400, jamal wrote:
> On Fri, 2005-03-06 at 13:35 -0700, Michael Chan wrote:
> 
> > By the way, in tg3 there is a buffer replenishment threshold programmed
> > to the chip and is currently set at rx_pending / 8 (200/8 = 25). This
> > means that the chip will replenish 25 rx buffers at a time.
> > 
> 
> So when you write the MMIO, 25 buffers are replenished or is this auto
> magically happening in the background? Sounds like a neat feature either
> way.
> 
The MMIO writes a cumulative producer index of new rx descriptors in the
ring. As the chip requires new buffers for rx packets, it will DMA 25 of
these rx descriptors at a time up to the producer index.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-03 20:35                     ` Michael Chan
  2005-06-03 22:29                       ` jamal
  2005-06-03 23:26                       ` Lennert Buytenhek
@ 2005-06-05 20:11                       ` David S. Miller
  2 siblings, 0 replies; 121+ messages in thread
From: David S. Miller @ 2005-06-05 20:11 UTC (permalink / raw)
  To: mchan
  Cc: buytenh, mitch.a.williams, hadi, john.ronciak, jdmason,
	shemminger, netdev, Robert.Olsson, ganesh.venkatesan,
	jesse.brandeburg

From: "Michael Chan" <mchan@broadcom.com>
Date: Fri, 03 Jun 2005 13:35:22 -0700

> I agree on the merit of issuing only one IO at the end. What I'm saying
> is that doing so will make it similar to e1000 where all the buffers are
> replenished at the end. Isn't that so or am I missing something?

You're totally right.  I guess we don't see the e1000 behavior
due to any of the following:

1) we set the RX ring sizes larger by default
2) we set it larger than what the e1000 tests were done with
3) we process the RX ring faster and thus the chip can't catch up
   and exhaust the ring

We use a default of 200 in tg3, and e1000 seems to use a default
of 256.

This actually points more to the fact that what you're actually
doing to process the packet has a huge influence on whether
the chip can catch up and exhaust the RX ring.  How much
software work does the netif_receive_skb() call entail, on
average, for the given workload?

That is why the exact test being run is important in analyzing
reports such as these.  If you're doing a TCP transfer, then
netif_receive_skb() can be _VERY_ expensive per-call.  If, on
the other hand, you're routing tiny 64-byte packets or responding
to simple ICMP echo requests, the per-call cost can be significantly
lower.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-04  0:25                         ` Michael Chan
@ 2005-06-05 21:36                           ` David S. Miller
  2005-06-06  6:43                             ` David S. Miller
  0 siblings, 1 reply; 121+ messages in thread
From: David S. Miller @ 2005-06-05 21:36 UTC (permalink / raw)
  To: mchan
  Cc: hadi, buytenh, mitch.a.williams, john.ronciak, jdmason,
	shemminger, netdev, Robert.Olsson, ganesh.venkatesan,
	jesse.brandeburg


To illustrate my most recent point (that packet processing cost
on RX is variable, and at times highly so) I made some hacks
to the tg3 driver to record how many system clock ticks each
netif_receive_skb() call consumed.

This clock on my sparc64 box updates at a rate of 12MHZ and
is used for system time keeping.

Anyways, here is a log from a stream transfer to this system.  So the
packet trace is heavily TCP receive bound.  Here is a sample from
this.  I take a tick sample before the netif_receive_skb() call, take
one afterwards, and record the difference between the two:

[ 52 73 41 65 38 61 58 63 37 62 36 62 50 74 38 64 ]
[ 37 63 39 62 36 64 36 61 50 75 38 64 39 65 37 62 ]
[ 36 60 36 62 50 76 39 67 38 63 35 62 35 64 35 62 ]
[ 62 74 41 65 37 62 37 63 36 61 39 62 52 75 38 66 ]
[ 37 63 35 61 38 62 36 60 49 75 38 64 37 62 36 66 ]
[ 42 62 36 62 48 76 38 64 35 62 40 63 36 60 36 63 ]
[ 49 76 36 64 35 64 38 64 37 61 36 62 60 74 37 80 ]
[ 43 69 36 65 36 62 37 62 54 77 42 66 37 64 35 60 ]
[ 36 61 38 62 51 75 40 64 35 62 36 61 37 61 39 61 ]
[ 51 76 38 64 35 63 36 63 38 62 37 63 49 76 39 64 ]
[ 35 64 35 64 38 62 36 62 61 85 42 65 38 79 38 62 ]
[ 36 61 35 64 49 77 37 63 38 64 36 60 37 62 36 60 ]
[ 51 76 38 66 38 62 37 63 36 62 37 60 50 77 41 64 ]
[ 36 60 36 60 36 61 37 61 50 78 39 66 37 63 36 62 ]
[ 36 61 39 63 60 74 38 66 37 61 35 63 37 65 36 65 ]
[ 48 76 38 65 36 64 41 64 36 60 35 61 49 76 39 66 ]
[ 36 64 39 60 37 60 36 59 51 73 37 64 40 64 36 62 ]
[ 37 61 35 62 50 78 39 67 38 63 35 61 36 63 36 61 ]
[ 66 75 41 66 37 65 36 61 36 62 38 63 50 75 38 65 ]
[ 37 63 36 62 38 63 36 63 49 76 38 64 38 63 40 64 ]
[ 35 63 36 60 50 74 39 65 37 65 38 62 36 62 36 60 ]
[ 51 75 37 66 39 65 37 62 37 62 38 61 67 72 39 65 ]
[ 37 62 35 61 37 61 54 63 53 75 42 67 35 63 36 61 ]
[ 36 65 39 62 53 75 38 64 36 63 35 62 38 63 36 61 ]
[ 49 77 39 66 38 62 36 62 38 61 35 59 83 91 77 25 ]
[ 22 22 22 24 21 21 21 20 21 35 67 24 50 47 67 39 ]
[ 65 34 65 36 63 65 74 38 64 35 64 37 63 37 62 36 ]
[ 61 51 75 38 67 39 63 35 64 37 62 36 61 50 74 37 ]
[ 66 37 62 35 63 35 61 36 65 52 76 40 65 38 61 37 ]
[ 62 36 61 40 64 63 71 40 62 36 64 36 63 36 61 39 ]
[ 62 49 76 37 65 36 62 36 61 38 65 41 64 50 75 39 ]
[ 67 37 62 37 63 36 62 38 61 69 153 70 140 200 737 67 ]

Notice how the packet trail seems to bounce back and
forth between taking ~30 ticks to taking ~60 ticks?

The ~60 tick packets are the TCP data packets that
make us output an ACK packet.  So this makes it cost
double of what it takes to process a TCP data packet
for which we do not immediately generate an ACK.

It pretty much shows that we need to have something
other than a blank "COUNT" to represent the NAPI weight,
and we should instead try to measure the real "work"
actually consumed, via some time measurement and limit,
to implement this stuff properly.

BTW, here is the patch implementing this stuff.

--- ./drivers/net/tg3.c.~1~	2005-06-03 11:13:14.000000000 -0700
+++ ./drivers/net/tg3.c	2005-06-05 14:16:32.000000000 -0700
@@ -2836,7 +2836,17 @@ static int tg3_rx(struct tg3 *tp, int bu
 				    desc->err_vlan & RXD_VLAN_MASK);
 		} else
 #endif
+		{
+			unsigned long t = get_cycles();
+			unsigned int ent;
+
 			netif_receive_skb(skb);
+			t = get_cycles() - t;
+
+			ent = tp->rx_log_ent;
+			tp->rx_log[ent] = (u32) t;
+			tp->rx_log_ent = ((ent + 1) & RX_LOG_MASK);
+		}
 
 		tp->dev->last_rx = jiffies;
 		received++;
@@ -6609,6 +6619,28 @@ static struct net_device_stats *tg3_get_
 	stats->rx_crc_errors = old_stats->rx_crc_errors +
 		calc_crc_errors(tp);
 
+	/* XXX Yes, I know, do this right. :-)  */
+	{
+		unsigned int ent, pos;
+
+		printk("TG3: RX LOG, current ent[%d]\n", tp->rx_log_ent);
+		ent = tp->rx_log_ent - 512;
+		pos = 0;
+		while (ent != tp->rx_log_ent) {
+			if (!pos) printk("[ ");
+
+			printk("%u ", tp->rx_log[ent]);
+
+			if (++pos >= 16) {
+				printk("]\n");
+				pos = 0;
+			}
+			ent = (ent + 1) & RX_LOG_MASK;
+		}
+		if (pos != 0)
+			printk("]\n");
+	}
+
 	return stats;
 }
 
--- ./drivers/net/tg3.h.~1~	2005-06-03 11:13:14.000000000 -0700
+++ ./drivers/net/tg3.h	2005-06-05 14:16:00.000000000 -0700
@@ -2232,6 +2232,11 @@ struct tg3 {
 #define SST_25VF0X0_PAGE_SIZE		4098
 
 	struct ethtool_coalesce		coal;
+
+#define RX_LOG_SIZE	(1 << 14)
+#define RX_LOG_MASK	(RX_LOG_SIZE - 1)
+	unsigned int			rx_log_ent;
+	u32				rx_log[RX_LOG_SIZE];
 };
 
 #endif /* !(_T3_H) */

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-05 21:36                           ` David S. Miller
@ 2005-06-06  6:43                             ` David S. Miller
  0 siblings, 0 replies; 121+ messages in thread
From: David S. Miller @ 2005-06-06  6:43 UTC (permalink / raw)
  To: mchan
  Cc: hadi, buytenh, mitch.a.williams, john.ronciak, jdmason,
	shemminger, netdev, Robert.Olsson, ganesh.venkatesan,
	jesse.brandeburg

From: "David S. Miller" <davem@davemloft.net>
Date: Sun, 05 Jun 2005 14:36:53 -0700 (PDT)

> BTW, here is the patch implementing this stuff.

A new patch and some more data.

When we go to gigabit, and NAPI kicks in, the first RX
packet costs a lot (cache misses etc.) but the rest are
very efficient to process.  I suspect this only holds
for the single socket case, and on a real system processing
many connections the cost drop might not be so clean.

The log output format is:

(TX_TICKS:RX_TICKS[ RX_TICK1 RX_TICK2 RX_TICK3 ... ])

Here is an example trace from a single socket TCP stream
send over gigabit:

(9:112[ 26 8 7 8 7 ])
(6:110[ 23 8 8 8 7 ])
(7:57[ 26 8 ])
(6:117[ 25 8 9 7 7 ])
(5:37[ 26 ])
(6:113[ 28 8 7 8 7 ])
(0:20[ 9 ])
(8:111[ 27 7 7 8 7 ])
(5:109[ 25 8 8 8 7 ])
(8:113[ 25 7 8 9 7 ])
(6:108[ 25 8 7 7 7 ])
(8:88[ 26 8 8 7 ])
(6:109[ 25 7 7 7 7 ])
(6:111[ 25 9 8 7 7 ])
(0:48[ 9 5 ])

This kind of trace reiterates some things we already know.
For example, mitigation (HW, SW, or a combination of both)
helps because processing multiple packets let's us "reuse"
the cpu cache priming the handling of the first packet
achieves for us.

It would be great to stick something like this into the e1000
driver, and get some output from it with Intel's single NIC
performance degradation test case.

It is also necessary for the Intel folks to say whether the
NIC is running out of RX descriptors in the single NIC
case with dev->weight set to the default of 64.  If so, does
increasing the RX ring size to a larger value via ethtool
help?  If not, then why in the world are things running more
slowly?

I've got a crappy 1.5GHZ sparc64 box in my tg3 tests here, and it can
handle gigabit line rate with much CPU to spare.  So either Intel is
doing something other than TCP stream tests, or something else is out
of whack.

I even tried to do things like having a memory touching program
run in parallel with the TCP stream test, and this did not make
the timing numbers in the logs increase much at all.

--- ./drivers/net/tg3.c.~1~	2005-06-03 11:13:14.000000000 -0700
+++ ./drivers/net/tg3.c	2005-06-05 23:21:11.000000000 -0700
@@ -2836,7 +2836,22 @@ static int tg3_rx(struct tg3 *tp, int bu
 				    desc->err_vlan & RXD_VLAN_MASK);
 		} else
 #endif
+		{
+			unsigned long t = get_cycles();
+			struct tg3_poll_log_ent *lp;
+			unsigned int ent;
+
 			netif_receive_skb(skb);
+			t = get_cycles() - t;
+
+			ent = tp->poll_log_ent;
+			lp = &tp->poll_log[ent];
+			ent = lp->rx_cur_ent;
+			if (ent < POLL_RX_SIZE) {
+				lp->rx_ents[ent] = (u16) t;
+				lp->rx_cur_ent = ent + 1;
+			}
+		}
 
 		tp->dev->last_rx = jiffies;
 		received++;
@@ -2897,9 +2912,15 @@ static int tg3_poll(struct net_device *n
 
 	/* run TX completion thread */
 	if (sblk->idx[0].tx_consumer != tp->tx_cons) {
+		unsigned long t;
+
 		spin_lock(&tp->tx_lock);
+		t = get_cycles();		
 		tg3_tx(tp);
+		t = get_cycles() - t;
 		spin_unlock(&tp->tx_lock);
+
+		tp->poll_log[tp->poll_log_ent].tx_ticks = (u16) t;
 	}
 
 	spin_unlock_irqrestore(&tp->lock, flags);
@@ -2911,16 +2932,28 @@ static int tg3_poll(struct net_device *n
 	if (sblk->idx[0].rx_producer != tp->rx_rcb_ptr) {
 		int orig_budget = *budget;
 		int work_done;
+		unsigned long t;
+		unsigned int ent;
 
 		if (orig_budget > netdev->quota)
 			orig_budget = netdev->quota;
 
+		t = get_cycles();
 		work_done = tg3_rx(tp, orig_budget);
+		t = get_cycles() - t;
+
+		ent = tp->poll_log_ent;
+		tp->poll_log[ent].rx_ticks = (u16) t;
 
 		*budget -= work_done;
 		netdev->quota -= work_done;
 	}
 
+	tp->poll_log_ent = (tp->poll_log_ent + 1) & POLL_LOG_MASK;
+	tp->poll_log[tp->poll_log_ent].tx_ticks = 0;
+	tp->poll_log[tp->poll_log_ent].rx_ticks = 0;
+	tp->poll_log[tp->poll_log_ent].rx_cur_ent = 0;
+
 	if (tp->tg3_flags & TG3_FLAG_TAGGED_STATUS)
 		tp->last_tag = sblk->status_tag;
 	rmb();
@@ -6609,6 +6642,27 @@ static struct net_device_stats *tg3_get_
 	stats->rx_crc_errors = old_stats->rx_crc_errors +
 		calc_crc_errors(tp);
 
+	/* XXX Yes, I know, do this right. :-)  */
+	{
+		unsigned int ent;
+
+		printk("TG3: POLL LOG, current ent[%d]\n", tp->poll_log_ent);
+		ent = tp->poll_log_ent - (POLL_LOG_SIZE - 1);
+		ent &= POLL_LOG_MASK;
+		while (ent != tp->poll_log_ent) {
+			struct tg3_poll_log_ent *lp = &tp->poll_log[ent];
+			int i;
+
+			printk("(%u:%u[ ",
+			       lp->tx_ticks, lp->rx_ticks);
+			for (i = 0; i < lp->rx_cur_ent; i++)
+				printk("%d ", lp->rx_ents[i]);
+			printk("])\n");
+
+			ent = (ent + 1) & POLL_LOG_MASK;
+		}
+	}
+
 	return stats;
 }
 
--- ./drivers/net/tg3.h.~1~	2005-06-03 11:13:14.000000000 -0700
+++ ./drivers/net/tg3.h	2005-06-05 23:21:05.000000000 -0700
@@ -2003,6 +2003,15 @@ struct tg3_ethtool_stats {
 	u64		nic_tx_threshold_hit;
 };
 
+struct tg3_poll_log_ent {
+	u16 tx_ticks;
+	u16 rx_ticks;
+#define POLL_RX_SIZE	8
+#define POLL_RX_MASK	(POLL_RX_SIZE - 1)
+	u16 rx_cur_ent;
+	u16 rx_ents[POLL_RX_SIZE];
+};
+
 struct tg3 {
 	/* begin "general, frequently-used members" cacheline section */
 
@@ -2232,6 +2241,11 @@ struct tg3 {
 #define SST_25VF0X0_PAGE_SIZE		4098
 
 	struct ethtool_coalesce		coal;
+
+#define POLL_LOG_SIZE	(1 << 7)
+#define POLL_LOG_MASK	(POLL_LOG_SIZE - 1)
+	unsigned int			poll_log_ent;
+	struct tg3_poll_log_ent		poll_log[POLL_LOG_SIZE];
 };
 
 #endif /* !(_T3_H) */

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: RFC: NAPI packet weighting patch
@ 2005-06-06 15:35 Ronciak, John
  2005-06-06 19:47 ` David S. Miller
  0 siblings, 1 reply; 121+ messages in thread
From: Ronciak, John @ 2005-06-06 15:35 UTC (permalink / raw)
  To: David S. Miller, mchan
  Cc: hadi, buytenh, Williams, Mitch A, jdmason, shemminger, netdev,
	Robert.Olsson, Venkatesan, Ganesh, Brandeburg, Jesse

We are dropping packets at the HW level (FIFO errors) with 256
descriptors and the default weight of 64.  As we said reducing the
weight eliminates this which is understandable since the driver is being
serviced more fequently.  We also hacked the driver to do a buffer
allocation per packet sent up the stack.  This reduced the number of
dropped pacekts by about 80% but it was still a significant number of
drops (190K to 39K dropped).  So I don't think this is where the problem
is.  This is also comfimed with the tg3 driver doing the buffer update
to the HW every 25 descriptors.

We did not up the descriptor ring size with the default weight but will
try this today and report back.

Cheers,
John


> -----Original Message-----
> From: David S. Miller [mailto:davem@davemloft.net] 
> Sent: Sunday, June 05, 2005 11:43 PM
> To: mchan@broadcom.com
> Cc: hadi@cyberus.ca; buytenh@wantstofly.org; Williams, Mitch 
> A; Ronciak, John; jdmason@us.ibm.com; shemminger@osdl.org; 
> netdev@oss.sgi.com; Robert.Olsson@data.slu.se; Venkatesan, 
> Ganesh; Brandeburg, Jesse
> Subject: Re: RFC: NAPI packet weighting patch
> 
> 
> From: "David S. Miller" <davem@davemloft.net>
> Date: Sun, 05 Jun 2005 14:36:53 -0700 (PDT)
> 
> > BTW, here is the patch implementing this stuff.
> 
> A new patch and some more data.
> 
> When we go to gigabit, and NAPI kicks in, the first RX
> packet costs a lot (cache misses etc.) but the rest are
> very efficient to process.  I suspect this only holds
> for the single socket case, and on a real system processing
> many connections the cost drop might not be so clean.
> 
> The log output format is:
> 
> (TX_TICKS:RX_TICKS[ RX_TICK1 RX_TICK2 RX_TICK3 ... ])
> 
> Here is an example trace from a single socket TCP stream
> send over gigabit:
> 
> (9:112[ 26 8 7 8 7 ])
> (6:110[ 23 8 8 8 7 ])
> (7:57[ 26 8 ])
> (6:117[ 25 8 9 7 7 ])
> (5:37[ 26 ])
> (6:113[ 28 8 7 8 7 ])
> (0:20[ 9 ])
> (8:111[ 27 7 7 8 7 ])
> (5:109[ 25 8 8 8 7 ])
> (8:113[ 25 7 8 9 7 ])
> (6:108[ 25 8 7 7 7 ])
> (8:88[ 26 8 8 7 ])
> (6:109[ 25 7 7 7 7 ])
> (6:111[ 25 9 8 7 7 ])
> (0:48[ 9 5 ])
> 
> This kind of trace reiterates some things we already know.
> For example, mitigation (HW, SW, or a combination of both)
> helps because processing multiple packets let's us "reuse"
> the cpu cache priming the handling of the first packet
> achieves for us.
> 
> It would be great to stick something like this into the e1000
> driver, and get some output from it with Intel's single NIC
> performance degradation test case.
> 
> It is also necessary for the Intel folks to say whether the
> NIC is running out of RX descriptors in the single NIC
> case with dev->weight set to the default of 64.  If so, does
> increasing the RX ring size to a larger value via ethtool
> help?  If not, then why in the world are things running more
> slowly?
> 
> I've got a crappy 1.5GHZ sparc64 box in my tg3 tests here, and it can
> handle gigabit line rate with much CPU to spare.  So either Intel is
> doing something other than TCP stream tests, or something else is out
> of whack.
> 
> I even tried to do things like having a memory touching program
> run in parallel with the TCP stream test, and this did not make
> the timing numbers in the logs increase much at all.
> 
> --- ./drivers/net/tg3.c.~1~	2005-06-03 11:13:14.000000000 -0700
> +++ ./drivers/net/tg3.c	2005-06-05 23:21:11.000000000 -0700
> @@ -2836,7 +2836,22 @@ static int tg3_rx(struct tg3 *tp, int bu
>  				    desc->err_vlan & RXD_VLAN_MASK);
>  		} else
>  #endif
> +		{
> +			unsigned long t = get_cycles();
> +			struct tg3_poll_log_ent *lp;
> +			unsigned int ent;
> +
>  			netif_receive_skb(skb);
> +			t = get_cycles() - t;
> +
> +			ent = tp->poll_log_ent;
> +			lp = &tp->poll_log[ent];
> +			ent = lp->rx_cur_ent;
> +			if (ent < POLL_RX_SIZE) {
> +				lp->rx_ents[ent] = (u16) t;
> +				lp->rx_cur_ent = ent + 1;
> +			}
> +		}
>  
>  		tp->dev->last_rx = jiffies;
>  		received++;
> @@ -2897,9 +2912,15 @@ static int tg3_poll(struct net_device *n
>  
>  	/* run TX completion thread */
>  	if (sblk->idx[0].tx_consumer != tp->tx_cons) {
> +		unsigned long t;
> +
>  		spin_lock(&tp->tx_lock);
> +		t = get_cycles();		
>  		tg3_tx(tp);
> +		t = get_cycles() - t;
>  		spin_unlock(&tp->tx_lock);
> +
> +		tp->poll_log[tp->poll_log_ent].tx_ticks = (u16) t;
>  	}
>  
>  	spin_unlock_irqrestore(&tp->lock, flags);
> @@ -2911,16 +2932,28 @@ static int tg3_poll(struct net_device *n
>  	if (sblk->idx[0].rx_producer != tp->rx_rcb_ptr) {
>  		int orig_budget = *budget;
>  		int work_done;
> +		unsigned long t;
> +		unsigned int ent;
>  
>  		if (orig_budget > netdev->quota)
>  			orig_budget = netdev->quota;
>  
> +		t = get_cycles();
>  		work_done = tg3_rx(tp, orig_budget);
> +		t = get_cycles() - t;
> +
> +		ent = tp->poll_log_ent;
> +		tp->poll_log[ent].rx_ticks = (u16) t;
>  
>  		*budget -= work_done;
>  		netdev->quota -= work_done;
>  	}
>  
> +	tp->poll_log_ent = (tp->poll_log_ent + 1) & POLL_LOG_MASK;
> +	tp->poll_log[tp->poll_log_ent].tx_ticks = 0;
> +	tp->poll_log[tp->poll_log_ent].rx_ticks = 0;
> +	tp->poll_log[tp->poll_log_ent].rx_cur_ent = 0;
> +
>  	if (tp->tg3_flags & TG3_FLAG_TAGGED_STATUS)
>  		tp->last_tag = sblk->status_tag;
>  	rmb();
> @@ -6609,6 +6642,27 @@ static struct net_device_stats *tg3_get_
>  	stats->rx_crc_errors = old_stats->rx_crc_errors +
>  		calc_crc_errors(tp);
>  
> +	/* XXX Yes, I know, do this right. :-)  */
> +	{
> +		unsigned int ent;
> +
> +		printk("TG3: POLL LOG, current ent[%d]\n", 
> tp->poll_log_ent);
> +		ent = tp->poll_log_ent - (POLL_LOG_SIZE - 1);
> +		ent &= POLL_LOG_MASK;
> +		while (ent != tp->poll_log_ent) {
> +			struct tg3_poll_log_ent *lp = 
> &tp->poll_log[ent];
> +			int i;
> +
> +			printk("(%u:%u[ ",
> +			       lp->tx_ticks, lp->rx_ticks);
> +			for (i = 0; i < lp->rx_cur_ent; i++)
> +				printk("%d ", lp->rx_ents[i]);
> +			printk("])\n");
> +
> +			ent = (ent + 1) & POLL_LOG_MASK;
> +		}
> +	}
> +
>  	return stats;
>  }
>  
> --- ./drivers/net/tg3.h.~1~	2005-06-03 11:13:14.000000000 -0700
> +++ ./drivers/net/tg3.h	2005-06-05 23:21:05.000000000 -0700
> @@ -2003,6 +2003,15 @@ struct tg3_ethtool_stats {
>  	u64		nic_tx_threshold_hit;
>  };
>  
> +struct tg3_poll_log_ent {
> +	u16 tx_ticks;
> +	u16 rx_ticks;
> +#define POLL_RX_SIZE	8
> +#define POLL_RX_MASK	(POLL_RX_SIZE - 1)
> +	u16 rx_cur_ent;
> +	u16 rx_ents[POLL_RX_SIZE];
> +};
> +
>  struct tg3 {
>  	/* begin "general, frequently-used members" cacheline section */
>  
> @@ -2232,6 +2241,11 @@ struct tg3 {
>  #define SST_25VF0X0_PAGE_SIZE		4098
>  
>  	struct ethtool_coalesce		coal;
> +
> +#define POLL_LOG_SIZE	(1 << 7)
> +#define POLL_LOG_MASK	(POLL_LOG_SIZE - 1)
> +	unsigned int			poll_log_ent;
> +	struct tg3_poll_log_ent		poll_log[POLL_LOG_SIZE];
>  };
>  
>  #endif /* !(_T3_H) */
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-06 15:35 Ronciak, John
@ 2005-06-06 19:47 ` David S. Miller
  0 siblings, 0 replies; 121+ messages in thread
From: David S. Miller @ 2005-06-06 19:47 UTC (permalink / raw)
  To: john.ronciak
  Cc: mchan, hadi, buytenh, mitch.a.williams, jdmason, shemminger,
	netdev, Robert.Olsson, ganesh.venkatesan, jesse.brandeburg

From: "Ronciak, John" <john.ronciak@intel.com>
Date: Mon, 6 Jun 2005 08:35:26 -0700

> We are dropping packets at the HW level (FIFO errors) with 256
> descriptors and the default weight of 64.  As we said reducing the
> weight eliminates this which is understandable since the driver is being
> serviced more fequently.  We also hacked the driver to do a buffer
> allocation per packet sent up the stack.  This reduced the number of
> dropped pacekts by about 80% but it was still a significant number of
> drops (190K to 39K dropped).  So I don't think this is where the problem
> is.  This is also comfimed with the tg3 driver doing the buffer update
> to the HW every 25 descriptors.

I reach a different conclusion, sorry. :-)

Here is the invariant:

	If you force the e1000 driver to do RX replenishment every N
	packets it should reduce the packet drops the same (in the
	single NIC case) as if you reduced the dev->weight to that
	same value N.

You have two test cases, single NIC and multi-NIC, so you should
be very clear in which case your drop number applies to.  They
are two totally different problems.

> We did not up the descriptor ring size with the default weight but will
> try this today and report back.

Thanks for all of your test data and hard work so far.
It's very valuable.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: RFC: NAPI packet weighting patch
@ 2005-06-06 20:29 Ronciak, John
  2005-06-06 23:55 ` Mitch Williams
  0 siblings, 1 reply; 121+ messages in thread
From: Ronciak, John @ 2005-06-06 20:29 UTC (permalink / raw)
  To: David S. Miller
  Cc: mchan, hadi, buytenh, Williams, Mitch A, jdmason, shemminger,
	netdev, Robert.Olsson, Venkatesan, Ganesh, Brandeburg, Jesse

> 	If you force the e1000 driver to do RX replenishment every N
> 	packets it should reduce the packet drops the same (in the
> 	single NIC case) as if you reduced the dev->weight to that
> 	same value N.

But this isn't what we are seeing.  Even if we just reduce the weight
value to 32 from 64, all of the drops go away.  So there seems to be
other things affecting this.

We are just talking about single NIC testing at this point.  I agree
that single and multi-NIC results different issues and we will need to
test this as well with whatever we come up with out of this.

I also like your idea about the weight value being adjusted based on
real work done using some measurable metric.  This seems like a good
path to explore as well.

Cheers,
John

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: RFC: NAPI packet weighting patch
  2005-06-06 20:29 Ronciak, John
@ 2005-06-06 23:55 ` Mitch Williams
  2005-06-07  0:08   ` Ben Greear
  2005-06-07  4:53   ` Stephen Hemminger
  0 siblings, 2 replies; 121+ messages in thread
From: Mitch Williams @ 2005-06-06 23:55 UTC (permalink / raw)
  To: Ronciak, John
  Cc: David S. Miller, mchan, hadi, buytenh, Williams, Mitch A, jdmason,
	shemminger, netdev, Robert.Olsson, Venkatesan, Ganesh,
	Brandeburg, Jesse

On Mon, 6 Jun 2005, Ronciak, John wrote:

> > 	If you force the e1000 driver to do RX replenishment every N
> > 	packets it should reduce the packet drops the same (in the
> > 	single NIC case) as if you reduced the dev->weight to that
> > 	same value N.
>
> But this isn't what we are seeing.  Even if we just reduce the weight
> value to 32 from 64, all of the drops go away.  So there seems to be
> other things affecting this.

Some quickie results for everybody -- I've been working on other stuff this
morning and haven't had much time in the lab.

Increasing the RX ring to 512 descriptors eliminates dropped packets, but
performance goes down.  When I mentioned this, John and Jesse both nodded
and said, "Yep.  That's what happens when the descriptor ring grows past
one page."

Reducing the weight to 32 got rid of almost all of the dropped packets
(down to < 1 per second); reducing it to 20 eliminated all of them.  In
both cases performance rose as compared to the default weight of 64.

Tests were run on 2.6.12rc5 on a dual Xeon 2.8GHz PCI-X system.  We run
Chariot for performance testing, using TCP/IP large file transfers with 10
Windows 2000 clients.

We're still looking at some methods of returning RX resources to the
hardware more often, but we don't have results on that yet.

> I also like your idea about the weight value being adjusted based on
> real work done using some measurable metric.  This seems like a good
> path to explore as well.

Agreed.  I think NAPI can be a lot smarter than it is today.

-Mitch

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-06 23:55 ` Mitch Williams
@ 2005-06-07  0:08   ` Ben Greear
  2005-06-08  1:50     ` Jesse Brandeburg
  2005-06-07  4:53   ` Stephen Hemminger
  1 sibling, 1 reply; 121+ messages in thread
From: Ben Greear @ 2005-06-07  0:08 UTC (permalink / raw)
  To: Mitch Williams
  Cc: Ronciak, John, David S. Miller, mchan, hadi, buytenh, jdmason,
	shemminger, netdev, Robert.Olsson, Venkatesan, Ganesh,
	Brandeburg, Jesse

Mitch Williams wrote:
> 
> On Mon, 6 Jun 2005, Ronciak, John wrote:
> 
> 
>>>	If you force the e1000 driver to do RX replenishment every N
>>>	packets it should reduce the packet drops the same (in the
>>>	single NIC case) as if you reduced the dev->weight to that
>>>	same value N.
>>
>>But this isn't what we are seeing.  Even if we just reduce the weight
>>value to 32 from 64, all of the drops go away.  So there seems to be
>>other things affecting this.
> 
> 
> Some quickie results for everybody -- I've been working on other stuff this
> morning and haven't had much time in the lab.
> 
> Increasing the RX ring to 512 descriptors eliminates dropped packets, but
> performance goes down.  When I mentioned this, John and Jesse both nodded
> and said, "Yep.  That's what happens when the descriptor ring grows past
> one page."
> 
> Reducing the weight to 32 got rid of almost all of the dropped packets
> (down to < 1 per second); reducing it to 20 eliminated all of them.  In
> both cases performance rose as compared to the default weight of 64.
> 
> Tests were run on 2.6.12rc5 on a dual Xeon 2.8GHz PCI-X system.  We run
> Chariot for performance testing, using TCP/IP large file transfers with 10
> Windows 2000 clients.

So is the Linux server reading/writing these large files to/from the disk?

Can you tell us how much performance went down when you increased the
descriptors to 512?

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-06 23:55 ` Mitch Williams
  2005-06-07  0:08   ` Ben Greear
@ 2005-06-07  4:53   ` Stephen Hemminger
  2005-06-07 12:38     ` jamal
  2005-06-21 20:20     ` David S. Miller
  1 sibling, 2 replies; 121+ messages in thread
From: Stephen Hemminger @ 2005-06-07  4:53 UTC (permalink / raw)
  To: Mitch Williams
  Cc: Ronciak, John, David S. Miller, mchan, hadi, buytenh, jdmason,
	netdev, Robert.Olsson, Venkatesan, Ganesh, Brandeburg, Jesse

I noticed that the tg3 driver copies packets less than a certain
threshold to a new buffer, but e1000 always passes the big buffer up
the stack. Could this be having an impact?

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07 12:38     ` jamal
@ 2005-06-07 12:06       ` Martin Josefsson
  2005-06-07 13:29         ` jamal
  2005-06-21 20:37         ` David S. Miller
  0 siblings, 2 replies; 121+ messages in thread
From: Martin Josefsson @ 2005-06-07 12:06 UTC (permalink / raw)
  To: jamal
  Cc: Stephen Hemminger, Mitch Williams, Ronciak, John, David S. Miller,
	mchan, buytenh, jdmason, netdev, Robert.Olsson,
	Venkatesan, Ganesh, Brandeburg, Jesse

On Tue, 7 Jun 2005, jamal wrote:

> It is possible. Remember also the cost of IO these days is worse than a
> cache miss in cycles as well as absolute time. So the e1000 maybe doing
> more IO than the tg3.
>
> I think there is something fishy about the e1000 in general; From what i
> just heard mentioned reading the emails is there's improvement if the rx
> ring is replenished on a per packet basis instead of a batch at the end.
> This somehow is not an issue with tg3. I think doing replenishing in
> smaller batches like 5 packets at a time would also help.
> That the tg3 doesnt need to have its rx ring sizes adjusted but the
> e1000 gets better the lower the rx ring size is strange.
>
> To the intel folks: shouldnt someone be investigating why this is so?
>
> Fixing the effect with "lets lower the weight" or "wait, lets adjust it
> at runtime" because we know it fixes our problem - sounds like a serious
> bandaid to me. Lets find the cause and fix that instead.
> Why is this issue happening with e1000? Thats what needs to be resolved.
> So far some evidence seems to be suggesting that the tg3 uses less CPU.

One thing that jumps to mind is that e1000 starts at lastrxdescriptor+1
and loops and checks the status of each descriptor and stops when it finds
a descriptor that isn't finished. Another way to do it is to read out the
current position of the ring and loop from lastrxdescriptor+1 up to the
current position. Scott Feldman implemented this for TX and there it
increased performance somewhat (discussed here on netdev some months ago).
I wonder if it could also decrease RX latency, I mean, we have to get the
cache miss sometime anyway.

I havn't checked how tg3 does it.

/Martin

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07 13:29         ` jamal
@ 2005-06-07 12:36           ` Martin Josefsson
  2005-06-07 16:34             ` Robert Olsson
  0 siblings, 1 reply; 121+ messages in thread
From: Martin Josefsson @ 2005-06-07 12:36 UTC (permalink / raw)
  To: jamal
  Cc: Stephen Hemminger, Mitch Williams, Ronciak, John, David S. Miller,
	mchan, buytenh, jdmason, netdev, Robert.Olsson,
	Venkatesan, Ganesh, Brandeburg, Jesse

On Tue, 7 Jun 2005, jamal wrote:

> On Tue, 2005-07-06 at 14:06 +0200, Martin Josefsson wrote:
>
> > One thing that jumps to mind is that e1000 starts at lastrxdescriptor+1
> > and loops and checks the status of each descriptor and stops when it finds
> > a descriptor that isn't finished. Another way to do it is to read out the
> > current position of the ring and loop from lastrxdescriptor+1 up to the
> > current position. Scott Feldman implemented this for TX and there it
> > increased performance somewhat (discussed here on netdev some months ago).
> > I wonder if it could also decrease RX latency, I mean, we have to get the
> > cache miss sometime anyway.
> >
>
> The effect of Scotts patch was to reduce IO by amortizing it on the TX
> side. Are we talking about the same thing ? This was in the case of TX
> descriptor prunning?

Yes, that was for TX pruning.

> So it is possible that the e1000 is doing more than necessary share of
> IO on the receive side as well.

Yes, that's what I mean. Same thing but for RX but the question is how
much we would gain from it, we still need to touch the rx-descriptor
sooner or later. Would be worth a test.
My testsetup isn't in a working condition right now, Robert?

/Martin

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07  4:53   ` Stephen Hemminger
@ 2005-06-07 12:38     ` jamal
  2005-06-07 12:06       ` Martin Josefsson
  2005-06-21 20:20     ` David S. Miller
  1 sibling, 1 reply; 121+ messages in thread
From: jamal @ 2005-06-07 12:38 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Mitch Williams, Ronciak, John, David S. Miller, mchan, buytenh,
	jdmason, netdev, Robert.Olsson, Venkatesan, Ganesh,
	Brandeburg, Jesse

On Mon, 2005-06-06 at 21:53 -0700, Stephen Hemminger wrote:
> I noticed that the tg3 driver copies packets less than a certain
> threshold to a new buffer, but e1000 always passes the big buffer up
> the stack. Could this be having an impact?

It is possible. Remember also the cost of IO these days is worse than a
cache miss in cycles as well as absolute time. So the e1000 maybe doing
more IO than the tg3. 

I think there is something fishy about the e1000 in general; From what i
just heard mentioned reading the emails is there's improvement if the rx
ring is replenished on a per packet basis instead of a batch at the end.
This somehow is not an issue with tg3. I think doing replenishing in
smaller batches like 5 packets at a time would also help.
That the tg3 doesnt need to have its rx ring sizes adjusted but the
e1000 gets better the lower the rx ring size is strange.

To the intel folks: shouldnt someone be investigating why this is so?

Fixing the effect with "lets lower the weight" or "wait, lets adjust it
at runtime" because we know it fixes our problem - sounds like a serious
bandaid to me. Lets find the cause and fix that instead.
Why is this issue happening with e1000? Thats what needs to be resolved.
So far some evidence seems to be suggesting that the tg3 uses less CPU.

cheers,
jamal

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07 12:06       ` Martin Josefsson
@ 2005-06-07 13:29         ` jamal
  2005-06-07 12:36           ` Martin Josefsson
  2005-06-21 20:37         ` David S. Miller
  1 sibling, 1 reply; 121+ messages in thread
From: jamal @ 2005-06-07 13:29 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Stephen Hemminger, Mitch Williams, Ronciak, John, David S. Miller,
	mchan, buytenh, jdmason, netdev, Robert.Olsson,
	Venkatesan, Ganesh, Brandeburg, Jesse

On Tue, 2005-07-06 at 14:06 +0200, Martin Josefsson wrote:

> One thing that jumps to mind is that e1000 starts at lastrxdescriptor+1
> and loops and checks the status of each descriptor and stops when it finds
> a descriptor that isn't finished. Another way to do it is to read out the
> current position of the ring and loop from lastrxdescriptor+1 up to the
> current position. Scott Feldman implemented this for TX and there it
> increased performance somewhat (discussed here on netdev some months ago).
> I wonder if it could also decrease RX latency, I mean, we have to get the
> cache miss sometime anyway.
> 

The effect of Scotts patch was to reduce IO by amortizing it on the TX
side. Are we talking about the same thing ? This was in the case of TX
descriptor prunning? 
So it is possible that the e1000 is doing more than necessary share of
IO on the receive side as well.

cheers,
jamal

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: RFC: NAPI packet weighting patch
@ 2005-06-07 16:23 Ronciak, John
  2005-06-07 20:21 ` David S. Miller
  0 siblings, 1 reply; 121+ messages in thread
From: Ronciak, John @ 2005-06-07 16:23 UTC (permalink / raw)
  To: hadi, Stephen Hemminger
  Cc: Williams, Mitch A, David S. Miller, mchan, buytenh, jdmason,
	netdev, Robert.Olsson, Venkatesan, Ganesh, Brandeburg, Jesse

>> 
> To the intel folks: shouldnt someone be investigating why this is so?

This is why we started all of this.  We have data that is showing this
issue where our over all performance is best in class and yet we can
make it better by changing things like the weight value.

There also seems to be some misconceptions about changing the weight
value.  It actually improves the performance of other drivers as well.
Not as much as it improves the e1000 performance but it does seem to
help others as well.  We (Intel) have to be careful talking about
competitors performance so we just refer to them as competitors in these
threads.  So it is not just e1000 who benefits from the lower weight
values.  One thing it is doing for e1000 right now is that it is
stopping the e1000 from dropping frames which is part of why it's
helping the e1000 more (I think).

I agree that we need to bottom out on this and it's why we are
dedicating the time and resources to this effort.  We also appreciate
all the effort to help resolve this as well.  This should result in a
better performing 2.6 stack and drivers.  The new TSO code is a big step
in that direction as well.

Cheers,
John

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07 12:36           ` Martin Josefsson
@ 2005-06-07 16:34             ` Robert Olsson
  2005-06-07 23:19               ` Rick Jones
  0 siblings, 1 reply; 121+ messages in thread
From: Robert Olsson @ 2005-06-07 16:34 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: jamal, Stephen Hemminger, Mitch Williams, Ronciak, John,
	David S. Miller, mchan, buytenh, jdmason, netdev, Robert.Olsson,
	Venkatesan, Ganesh, Brandeburg, Jesse

Martin Josefsson writes:

 > > So it is possible that the e1000 is doing more than necessary share of
 > > IO on the receive side as well.
 > 
 > Yes, that's what I mean. Same thing but for RX but the question is how
 > much we would gain from it, we still need to touch the rx-descriptor
 > sooner or later. Would be worth a test.
 > My testsetup isn't in a working condition right now, Robert?

 Next week possibly... but really now idea what's to test or whats going on.

 We have dual TCP server with one NIC. How is setup now? I don't know even
 how it should be setup for maximum TCP performance? 

 How is irq affinity setup?  Is irq's jumping between the CPU:s etc?
 Does ksoftirq(s) use CPU?   If so it can be adjusted tuned too.
 How is packets processes by CPU's. /proc/net/softnet_stat. Do we see 
 drops w. one CPU too etc
 It might be intricate question of balance between softirq and userland.

 Cheers.
						--ro

 BTW, Can netperf be used for tests like this? (Rick?)

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07 16:23 Ronciak, John
@ 2005-06-07 20:21 ` David S. Miller
  2005-06-08  2:20   ` Jesse Brandeburg
  0 siblings, 1 reply; 121+ messages in thread
From: David S. Miller @ 2005-06-07 20:21 UTC (permalink / raw)
  To: john.ronciak
  Cc: hadi, shemminger, mitch.a.williams, mchan, buytenh, jdmason,
	netdev, Robert.Olsson, ganesh.venkatesan, jesse.brandeburg

From: "Ronciak, John" <john.ronciak@intel.com>
Date: Tue, 7 Jun 2005 09:23:32 -0700

> There also seems to be some misconceptions about changing the weight
> value.  It actually improves the performance of other drivers as well.
> Not as much as it improves the e1000 performance but it does seem to
> help others as well.

One reason it helps e1000 more, which Robert Olsson mentioned, could
be the HW irq mitigation settings used by the e1000 driver.  Lowering
these would be a good test.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07 16:34             ` Robert Olsson
@ 2005-06-07 23:19               ` Rick Jones
  0 siblings, 0 replies; 121+ messages in thread
From: Rick Jones @ 2005-06-07 23:19 UTC (permalink / raw)
  To: Robert Olsson, netdev
  Cc: Martin Josefsson, jamal, Stephen Hemminger, Mitch Williams,
	Ronciak, John, David S. Miller, mchan, buytenh, jdmason,
	Venkatesan, Ganesh, Brandeburg, Jesse

> 
>  BTW, Can netperf be used for tests like this? (Rick?)

Assuming I'm translating "test like this" to the right sort of stuff :)

If one wants to see the effect of different buffer replenishment strategies, I 
suppose that some netperf tests could indeed be used.  It would be desirable to 
look at service demand moreso than throughput (assuming the throughput is 
link-bound).  TCP_STREAM and/or TCP_MAERTS.  I'm not sure the extent to which it 
would be visible to a TCP_RR test.

Differences in service demand could also be used to measure effects of irq 
migration, pinning IRQs and/or processes to specific CPUs and the like.  The 
linux processor affinity stuff in netperf could use a little help though - it is 
easily confused as to when to use a two argument vs three argument 
sched_setaffinity call.  I suspect one may also see differences in TCP_RR 
transaction rates.

I suspect some high number of confidence interval iterations might be required.

rick jones

i'd trim individual names from the dist list, but am not 100% sure who is on 
netdev...

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07  0:08   ` Ben Greear
@ 2005-06-08  1:50     ` Jesse Brandeburg
  0 siblings, 0 replies; 121+ messages in thread
From: Jesse Brandeburg @ 2005-06-08  1:50 UTC (permalink / raw)
  To: Ben Greear
  Cc: Williams, Mitch A, Ronciak, John, David S. Miller, mchan, hadi,
	buytenh, jdmason, shemminger, netdev, Robert.Olsson,
	Venkatesan, Ganesh, Brandeburg, Jesse

On Mon, 6 Jun 2005, Ben Greear wrote:

> So is the Linux server reading/writing these large files to/from the
> disk?

no, the test runs completely from memory, and the clients are 
reading/writing from/to the server

> Can you tell us how much performance went down when you increased the
> descriptors to 512?

sorry don't know the answer to that one.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07 20:21 ` David S. Miller
@ 2005-06-08  2:20   ` Jesse Brandeburg
  2005-06-08  3:31     ` David S. Miller
  2005-06-08  3:43     ` David S. Miller
  0 siblings, 2 replies; 121+ messages in thread
From: Jesse Brandeburg @ 2005-06-08  2:20 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ronciak, John, hadi, shemminger, Williams, Mitch A, mchan,
	buytenh, jdmason, netdev, Robert.Olsson, Venkatesan, Ganesh,
	Brandeburg, Jesse

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2928 bytes --]

On Tue, 7 Jun 2005, David S. Miller wrote:
> > There also seems to be some misconceptions about changing the weight
> > value.  It actually improves the performance of other drivers as well.
> > Not as much as it improves the e1000 performance but it does seem to
> > help others as well.
> 
> One reason it helps e1000 more, which Robert Olsson mentioned, could
> be the HW irq mitigation settings used by the e1000 driver.  Lowering
> these would be a good test.

Well, first a little more data.  The machine in question is a dual xeon 
running 2.6.12-rc5 or 2.6.12-rc4-supertso

with the 2.6.12-rc5 kernel (the old) tso promptly shuts down after a SACK, 
and after that point the machine is CPU bound at 100%.  This is the point 
that we start to drop packets at the hardware level.

I tried the experiment today where I replenish buffers to hardware every 
16 packets or so.  This appears to mitigate all drops at the hardware 
level (no drops).  We're still at 100% with the rc5 kernel, however.

even with this replenish fix, the addition of dropping the weight to 16 
helped increase our throughput, although only about 1%.

On the other hand, taking our driver as is with no changes and running the 
supertso (not the split out version, yet) kernel, we show no dropped 
packets and 60% cpu use.  This combines with a 6% increase in throughput, 
and the data pattern on the wire is much more constant (i have tcpdumps, 
do you want to see them Dave?)

I'm looking forward to trying the split out patches tomorrow.

here is my (compile tested) patch, for e1000

diff -rup e1000-6.0.60.orig/src/e1000_main.c e1000-6.0.60/src/e1000_main.c
--- e1000-6.0.60.orig/src/e1000_main.c	2005-06-07 19:07:37.000000000 -0700
+++ e1000-6.0.60/src/e1000_main.c	2005-06-07 19:15:05.000000000 -0700
@@ -3074,11 +3074,14 @@ e1000_clean_rx_irq(struct e1000_adapter
  next_desc:
  		rx_desc->status = 0;
  		buffer_info->skb = NULL;
+		if(unlikely((i & ~(E1000_RX_BUFFER_WRITE - 1)) == i))
+			adapter->alloc_rx_buf(adapter);
  		if(unlikely(++i == rx_ring->count)) i = 0;

  		rx_desc = E1000_RX_DESC(*rx_ring, i);
  	}
  	rx_ring->next_to_clean = i;
+	/* not sure this is necessary any more, but its safe */
  	adapter->alloc_rx_buf(adapter);

  	return cleaned;
@@ -3209,12 +3212,15 @@ e1000_clean_rx_irq_ps(struct e1000_adapt
  next_desc:
  		rx_desc->wb.middle.status_error &= ~0xFF;
  		buffer_info->skb = NULL;
+		if(unlikely((i & ~(E1000_RX_BUFFER_WRITE - 1)) == i))
+			adapter->alloc_rx_buf(adapter);
  		if(unlikely(++i == rx_ring->count)) i = 0;

  		rx_desc = E1000_RX_DESC_PS(*rx_ring, i);
  		staterr = le32_to_cpu(rx_desc->wb.middle.status_error);
  	}
  	rx_ring->next_to_clean = i;
+	/* not sure this is necessary any more, but its safe */
  	adapter->alloc_rx_buf(adapter);

  	return cleaned;

PS e1000-6.0.60 is posted on sf.net/projects/e1000 now.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-08  2:20   ` Jesse Brandeburg
@ 2005-06-08  3:31     ` David S. Miller
  2005-06-08  3:43     ` David S. Miller
  1 sibling, 0 replies; 121+ messages in thread
From: David S. Miller @ 2005-06-08  3:31 UTC (permalink / raw)
  To: jesse.brandeburg
  Cc: john.ronciak, hadi, shemminger, mitch.a.williams, mchan, buytenh,
	jdmason, netdev, Robert.Olsson, ganesh.venkatesan

From: Jesse Brandeburg <jesse.brandeburg@intel.com>
Date: Tue, 7 Jun 2005 19:20:37 -0700 (PDT)

> I'm looking forward to trying the split out patches tomorrow.

Don't get too excited, those are purely bug fixes and don't
actually do the actual "Super TSO" part yet.  I'm trying to
test the cleanups leading up to the actual TSO segmenting change
to make sure any such regressions therein get weeded out.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-08  2:20   ` Jesse Brandeburg
  2005-06-08  3:31     ` David S. Miller
@ 2005-06-08  3:43     ` David S. Miller
  2005-06-08 13:36       ` jamal
  1 sibling, 1 reply; 121+ messages in thread
From: David S. Miller @ 2005-06-08  3:43 UTC (permalink / raw)
  To: jesse.brandeburg
  Cc: john.ronciak, hadi, shemminger, mitch.a.williams, mchan, buytenh,
	jdmason, netdev, Robert.Olsson, ganesh.venkatesan

From: Jesse Brandeburg <jesse.brandeburg@intel.com>
Date: Tue, 7 Jun 2005 19:20:37 -0700 (PDT)

> with the 2.6.12-rc5 kernel (the old) tso promptly shuts down after a SACK, 
> and after that point the machine is CPU bound at 100%.  This is the point 
> that we start to drop packets at the hardware level.

You're getting packet loss on the local network where you're
running these tests?  Or is it simple packet reordering?

> I tried the experiment today where I replenish buffers to hardware every 
> 16 packets or so.  This appears to mitigate all drops at the hardware 
> level (no drops).  We're still at 100% with the rc5 kernel, however.
> 
> even with this replenish fix, the addition of dropping the weight to 16 
> helped increase our throughput, although only about 1%.

Any minor timing difference of any kind can have up to a %3 or
%4 difference in TCP performance when the receiver is CPU
limited.

> On the other hand, taking our driver as is with no changes and running the 
> supertso (not the split out version, yet) kernel, we show no dropped 
> packets and 60% cpu use.  This combines with a 6% increase in throughput, 
> and the data pattern on the wire is much more constant (i have tcpdumps, 
> do you want to see them Dave?)

Yes, indeed the tcpdumps tend to look much nicer with supertso.

The 10gbit guys see regressions though.  They are helping me
test things gradually in order to track down what change causes
the problems.  That's why I've started rewriting super TSO from
scratch in a series of very small patches.

I don't see how supertso can help the receiver, which is where
the RX drops should be occuring.  That's a little weird.

I can't believe a 2.5 GHZ machine can't keep up with a simple 1 Gbit
TCP stream.  Do you have some other computation going on in that
system?  As stated yesterday my 1.5 GHZ crappy sparc64 box can receive
a 1 Gbit TCP stream with much cpu to spare, my 750 MHZ sparc64 box can
nearly do so as well.

Something is up, if a single gigabit TCP stream can fully CPU
load your machine.  10 gigabit, yeah, definitely all current
generation machines are cpu limited over that link speed, but
1 gigabit should be no problem.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-08  3:43     ` David S. Miller
@ 2005-06-08 13:36       ` jamal
  2005-06-09 21:37         ` Jesse Brandeburg
  0 siblings, 1 reply; 121+ messages in thread
From: jamal @ 2005-06-08 13:36 UTC (permalink / raw)
  To: David S. Miller
  Cc: jesse.brandeburg, john.ronciak, shemminger, mitch.a.williams,
	mchan, buytenh, jdmason, netdev, Robert.Olsson, ganesh.venkatesan

On Tue, 2005-07-06 at 20:43 -0700, David S. Miller wrote:
> From: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Date: Tue, 7 Jun 2005 19:20:37 -0700 (PDT)
[..]
> > I tried the experiment today where I replenish buffers to hardware every 
> > 16 packets or so.  This appears to mitigate all drops at the hardware 
> > level (no drops).  We're still at 100% with the rc5 kernel, however.
> > 
> > even with this replenish fix, the addition of dropping the weight to 16 
> > helped increase our throughput, although only about 1%.
> 
> Any minor timing difference of any kind can have up to a %3 or
> %4 difference in TCP performance when the receiver is CPU
> limited.
> 

Agreed.

[..]
> I don't see how supertso can help the receiver, which is where
> the RX drops should be occuring.  That's a little weird.
> 
> I can't believe a 2.5 GHZ machine can't keep up with a simple 1 Gbit
> TCP stream.  Do you have some other computation going on in that
> system?  As stated yesterday my 1.5 GHZ crappy sparc64 box can receive
> a 1 Gbit TCP stream with much cpu to spare, my 750 MHZ sparc64 box can
> nearly do so as well.
> 
> Something is up, if a single gigabit TCP stream can fully CPU
> load your machine.  10 gigabit, yeah, definitely all current
> generation machines are cpu limited over that link speed, but
> 1 gigabit should be no problem.
> 

Yes, sir.
BTW, all along i thought the sender and receiver are hooked up directly
(there was some mention of chariot a while back).
Even if they did have some smart ass thing in the middle that reorders,
it is still suprising that such a fast CPU cant handle a mere one Gig of
what seems to be MTU=1500 bytes sized packets.
I suppose a netstat -s would help for visualization in addition to those
dumps. 

Heres what i am deducing from their data, correct me if i am wrong:
->The evidence is that something is expensive in their code path (duh).

-> Whatever that expensive thing code is, it not helped by them
replenishing the descriptors after all the budget is exhausted since the
descriptor departure rate is much slower than packet arrival.
---> This is why they would be seeing that the reduction of weight
improves performance since the replenishing happens sooner with a
smaller weight.
------> Clearly the driver needs some fixing - if they could do what
their competitor's(who shall remain nameless) driver does  or replenish
more often, then that would go some way to help (Jesse's result with
replenish after 16 is proof).

This still hasnt resolved what the problem is but we may be getting
close.

Even if they SACKed for every packet, this still would not make any
sense. So i think a profile of where the cycles are spent would also
help. I am suspecting the driver at this point but i could be wrong.

cheers,
jamal

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-08 13:36       ` jamal
@ 2005-06-09 21:37         ` Jesse Brandeburg
  2005-06-09 22:05           ` Stephen Hemminger
  2005-06-09 22:20           ` jamal
  0 siblings, 2 replies; 121+ messages in thread
From: Jesse Brandeburg @ 2005-06-09 21:37 UTC (permalink / raw)
  To: jamal
  Cc: David S. Miller, Brandeburg, Jesse, Ronciak, John, shemminger,
	Williams, Mitch A, mchan, buytenh, jdmason, netdev, Robert.Olsson,
	Venkatesan, Ganesh

[-- Attachment #1: Type: TEXT/PLAIN, Size: 7801 bytes --]

On Wed, 8 Jun 2005, jamal wrote:
> > Something is up, if a single gigabit TCP stream can fully CPU
> > load your machine.  10 gigabit, yeah, definitely all current
> > generation machines are cpu limited over that link speed, but
> > 1 gigabit should be no problem.
> >
> 
> Yes, sir.
> BTW, all along i thought the sender and receiver are hooked up directly
> (there was some mention of chariot a while back).

Okay let me clear this up once and for all, here is our test setup:

* 10 1u rack machines (dual P3 - 1250MHz), with both windows and linux 
installed (running windows now)
* Extreme 1gig switch
* Dual 2.8 GHz P4 server, RHEL3 base, running 2.6.12-rc5 or supertso patch

* the test entails transferring 1MB files of zeros from memory to memory, 
using TCP, with each client doing primary either send or recv, not both.

> Even if they did have some smart ass thing in the middle that reorders,
> it is still suprising that such a fast CPU cant handle a mere one Gig of
> what seems to be MTU=1500 bytes sized packets.

It can handle a single thread (or even 6) just fine, its after that we get 
in trouble somewhere.

> I suppose a netstat -s would help for visualization in addition to those
> dumps.

Okay I have that data, do you want it for the old tso, supertso, or no tso 
at all?

> Heres what i am deducing from their data, correct me if i am wrong:
> ->The evidence is that something is expensive in their code path (duh).

Actually I've found that adding more threads (10 total) sending to the 
server, while keeping the transmit thread count constant yields an 
increase our throughput all the way to 1750+ Mb/s (with supertso)

> -> Whatever that expensive thing code is, it not helped by them
> replenishing the descriptors after all the budget is exhausted since the
> descriptor departure rate is much slower than packet arrival.

I'm running all my tests with the replenish patch mentioned earlier in 
this thread.

> ---> This is why they would be seeing that the reduction of weight
> improves performance since the replenishing happens sooner with a
> smaller weight.

seems like we're past the weight problem now, should i start a new thread?

> ------> Clearly the driver needs some fixing - if they could do what

I'm not convinced it is the driver that is having issues.  We might be 
having some complex interaction with the stack, but I definitely think we 
have a lot of onion layers to hack through here, all of which are probably 
relevant.

> Even if they SACKed for every packet, this still would not make any
> sense. So i think a profile of where the cycles are spent would also
> help. I am suspecting the driver at this point but i could be wrong.

I have profile data, here is an example of 5tx/5rx threads, where the 
throughput was 1236Mb/s total, 936tx, 300rx, on 2.6.12-rc5 with old TSO 
(the original problem case) we are at 100% cpu and generating 3289 ints/s, 
with no hardware drops reported prolly due to my replenish patch
CPU: P4 / Xeon with 2 hyper-threads, speed 2791.36 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        image name       symbol name
533687    8.1472  vmlinux          pskb_expand_head
428726    6.5449  vmlinux          __copy_user_zeroing_intel
349934    5.3421  vmlinux          _read_lock_irqsave
313667    4.7884  vmlinux          csum_partial
218870    3.3413  vmlinux          _spin_lock
214302    3.2715  vmlinux          __copy_user_intel
193662    2.9564  vmlinux          skb_release_data
177755    2.7136  vmlinux          ipt_do_table
148445    2.2662  vmlinux          _write_lock_irqsave
148080    2.2606  vmlinux          _read_unlock_bh
143308    2.1877  vmlinux          tcp_sendmsg
115745    1.7670  vmlinux          ip_queue_xmit
111487    1.7020  vmlinux          __kfree_skb
108383    1.6546  vmlinux          _spin_lock_irqsave
108071    1.6498  e1000.ko         e1000_xmit_frame
107850    1.6464  vmlinux          tcp_clean_rtx_queue
104552    1.5961  e1000.ko         e1000_clean_tx_irq
101308    1.5466  e1000.ko         e1000_clean_rx_irq
94297     1.4395  vmlinux          __copy_from_user_ll
85170     1.3002  vmlinux          kfree
76730     1.1714  vmlinux          tcp_transmit_skb
70976     1.0835  vmlinux          eth_type_trans
67381     1.0286  vmlinux          tcp_rcv_established
64670     0.9872  vmlinux          sub_preempt_count
64451     0.9839  vmlinux          dev_queue_xmit
64010     0.9772  vmlinux          skb_clone
62314     0.9513  vmlinux          tcp_v4_rcv
61980     0.9462  vmlinux          nf_iterate
60374     0.9217  vmlinux          ip_finish_output
57407     0.8764  vmlinux          _write_unlock_bh
56165     0.8574  vmlinux          mark_offset_tsc
54673     0.8346  endpoint         (no symbols)
52662     0.8039  vmlinux          __kmalloc
50112     0.7650  vmlinux          sock_wfree
50001     0.7633  vmlinux          _spin_trylock
47053     0.7183  vmlinux          _read_lock_bh
45988     0.7021  vmlinux          tcp_write_xmit
44229     0.6752  vmlinux          kmem_cache_alloc
43506     0.6642  vmlinux          smp_processor_id
42401     0.6473  vmlinux          ip_conntrack_find_get
42095     0.6426  vmlinux          alloc_skb
40619     0.6201  vmlinux          tcp_in_window
38098     0.5816  vmlinux          add_preempt_count
37701     0.5755  vmlinux          __copy_to_user_ll
31529     0.4813  vmlinux          ip_conntrack_in
31314     0.4780  vmlinux          kmem_cache_free
30954     0.4725  vmlinux          __ip_conntrack_find
30863     0.4712  vmlinux          local_bh_enable
30774     0.4698  vmlinux          tcp_packet
29426     0.4492  vmlinux          _spin_unlock_irqrestore
28716     0.4384  vmlinux          hash_conntrack
27073     0.4133  vmlinux          ip_route_input
26540     0.4052  e1000.ko         e1000_clean
25817     0.3941  vmlinux          nf_hook_slow
23395     0.3571  vmlinux          schedule
22981     0.3508  vmlinux          tcp_v4_send_check
22139     0.3380  vmlinux          __mod_timer
22126     0.3378  vmlinux          timer_interrupt
21511     0.3284  vmlinux          cache_alloc_refill
21161     0.3230  vmlinux          netif_receive_skb
20418     0.3117  vmlinux          _write_lock_bh
19443     0.2968  vmlinux          skb_copy_datagram_iovec
19100     0.2916  vmlinux          ip_nat_fn
18784     0.2868  vmlinux          ip_local_deliver
18251     0.2786  vmlinux          _read_lock
17513     0.2674  vmlinux          nat_packet
17124     0.2614  e1000.ko         e1000_intr
16357     0.2497  vmlinux          default_idle
15358     0.2345  vmlinux          qdisc_restart
14564     0.2223  vmlinux          _read_unlock
14360     0.2192  vmlinux          tcp_recvmsg
13853     0.2115  oprofiled        odb_insert
13374     0.2042  e1000.ko         e1000_alloc_rx_buffers
13321     0.2034  vmlinux          apic_timer_interrupt
12668     0.1934  vmlinux          pfifo_fast_enqueue
12618     0.1926  vmlinux          tcp_sack
12180     0.1859  vmlinux          ip_nat_local_fn
11434     0.1746  vmlinux          system_call
11426     0.1744  vmlinux          free_block
11377     0.1737  vmlinux          try_to_wake_up
11138     0.1700  vmlinux          irq_entries_start
11017     0.1682  vmlinux          ipt_route_hook
10987     0.1677  vmlinux          dev_queue_xmit_nit
10970     0.1675  vmlinux          tcp_push_one
10508     0.1604  vmlinux          tcp_error
10365     0.1582  vmlinux          pfifo_fast_dequeue
10323     0.1576  vmlinux          ip_rcv
10022     0.1530  vmlinux          ip_output

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-09 21:37         ` Jesse Brandeburg
@ 2005-06-09 22:05           ` Stephen Hemminger
  2005-06-09 22:12             ` Jesse Brandeburg
  2005-06-09 22:22             ` David S. Miller
  2005-06-09 22:20           ` jamal
  1 sibling, 2 replies; 121+ messages in thread
From: Stephen Hemminger @ 2005-06-09 22:05 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: jamal, David S. Miller, Brandeburg, Jesse, Ronciak, John,
	Williams, Mitch A, mchan, buytenh, jdmason, netdev, Robert.Olsson,
	Venkatesan, Ganesh


> I have profile data, here is an example of 5tx/5rx threads, where the 
> throughput was 1236Mb/s total, 936tx, 300rx, on 2.6.12-rc5 with old TSO 
> (the original problem case) we are at 100% cpu and generating 3289 ints/s, 
> with no hardware drops reported prolly due to my replenish patch
> CPU: P4 / Xeon with 2 hyper-threads, speed 2791.36 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
> samples  %        image name       symbol name
> 533687    8.1472  vmlinux          pskb_expand_head
> 428726    6.5449  vmlinux          __copy_user_zeroing_intel
> 349934    5.3421  vmlinux          _read_lock_irqsave

We should kill all reader/writer locks in the fastpath. reader locks are
more expensive than spinlocks unless they are going to be held for a fairly
large window.


> 313667    4.7884  vmlinux          csum_partial
> 218870    3.3413  vmlinux          _spin_lock
> 214302    3.2715  vmlinux          __copy_user_intel
> 193662    2.9564  vmlinux          skb_release_data
> 177755    2.7136  vmlinux          ipt_do_table

You are probably benchmarking iptables/netfilter! How many rules do you
have?

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-09 22:05           ` Stephen Hemminger
@ 2005-06-09 22:12             ` Jesse Brandeburg
  2005-06-09 22:21               ` David S. Miller
  2005-06-09 22:21               ` jamal
  2005-06-09 22:22             ` David S. Miller
  1 sibling, 2 replies; 121+ messages in thread
From: Jesse Brandeburg @ 2005-06-09 22:12 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Brandeburg, Jesse, jamal, David S. Miller, Ronciak, John,
	Williams, Mitch A, mchan, buytenh, jdmason, netdev, Robert.Olsson,
	Venkatesan, Ganesh

[-- Attachment #1: Type: TEXT/PLAIN, Size: 564 bytes --]

On Thu, 9 Jun 2005, Stephen Hemminger wrote:
> > 313667    4.7884  vmlinux          csum_partial
> > 218870    3.3413  vmlinux          _spin_lock
> > 214302    3.2715  vmlinux          __copy_user_intel
> > 193662    2.9564  vmlinux          skb_release_data
> > 177755    2.7136  vmlinux          ipt_do_table
> 
> You are probably benchmarking iptables/netfilter! How many rules do you
> have?

I saw that... somehow iptables got compiled into kernel statically.  no 
rules are active or installed  iptables -L -n shows nothing in any chain.

Jesse

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-09 21:37         ` Jesse Brandeburg
  2005-06-09 22:05           ` Stephen Hemminger
@ 2005-06-09 22:20           ` jamal
  1 sibling, 0 replies; 121+ messages in thread
From: jamal @ 2005-06-09 22:20 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: David S. Miller, Ronciak, John, shemminger, Williams, Mitch A,
	mchan, buytenh, jdmason, netdev, Robert.Olsson,
	Venkatesan, Ganesh

On Thu, 2005-09-06 at 14:37 -0700, Jesse Brandeburg wrote:

> Okay let me clear this up once and for all, here is our test setup:
> 
> * 10 1u rack machines (dual P3 - 1250MHz), with both windows and linux 
> installed (running windows now)
> * Extreme 1gig switch
> * Dual 2.8 GHz P4 server, RHEL3 base, running 2.6.12-rc5 or supertso patch
> 
> * the test entails transferring 1MB files of zeros from memory to memory, 
> using TCP, with each client doing primary either send or recv, not both.

Linux as sender?

> > Even if they did have some smart ass thing in the middle that reorders,
> > it is still suprising that such a fast CPU cant handle a mere one Gig of
> > what seems to be MTU=1500 bytes sized packets.
> 
> It can handle a single thread (or even 6) just fine, its after that we get 
> in trouble somewhere.
> 

Certainly interesting details?

> > I suppose a netstat -s would help for visualization in addition to those
> > dumps.
> 
> Okay I have that data, do you want it for the old tso, supertso, or no tso 
> at all?
> 

hrmph - dont know. Dave could tell you.
I would say whatever you are running thats latest and greatest and
causes you trouble?

> > Heres what i am deducing from their data, correct me if i am wrong:
> > ->The evidence is that something is expensive in their code path (duh).
> 
> Actually I've found that adding more threads (10 total) sending to the 
> server, while keeping the transmit thread count constant yields an 
> increase our throughput all the way to 1750+ Mb/s (with supertso)
> 

Interesting tidbit

> > -> Whatever that expensive thing code is, it not helped by them
> > replenishing the descriptors after all the budget is exhausted since the
> > descriptor departure rate is much slower than packet arrival.
> 
> I'm running all my tests with the replenish patch mentioned earlier in 
> this thread.
> 

Ok. When i said " in the data path" - it could be anything from the
driver all the way to the socket.
If you have some pig along that path - it would mean you get back less
often to replenish the descriptors.

> > ---> This is why they would be seeing that the reduction of weight
> > improves performance since the replenishing happens sooner with a
> > smaller weight.
> 
> seems like we're past the weight problem now, should i start a new thread?
> 

I think so. 

> > ------> Clearly the driver needs some fixing - if they could do what
> 
> I'm not convinced it is the driver that is having issues.  We might be 
> having some complex interaction with the stack, but I definitely think we 
> have a lot of onion layers to hack through here, all of which are probably 
> relevant.
> 

I agree. But the driver could have some improvement as well if you did
what the other driver does ;->


> I have profile data, here is an example of 5tx/5rx threads, where the 
> throughput was 1236Mb/s total, 936tx, 300rx, on 2.6.12-rc5 with old TSO 
> (the original problem case) we are at 100% cpu and generating 3289 ints/s, 
> with no hardware drops reported prolly due to my replenish patch

Hrm, reading Stephen email as well ;->
Can you turn off netfilter off totaly? Most importantly remove
contracking.

cheers,
jamal

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-09 22:12             ` Jesse Brandeburg
@ 2005-06-09 22:21               ` David S. Miller
  2005-06-09 22:21               ` jamal
  1 sibling, 0 replies; 121+ messages in thread
From: David S. Miller @ 2005-06-09 22:21 UTC (permalink / raw)
  To: jesse.brandeburg
  Cc: shemminger, hadi, john.ronciak, mitch.a.williams, mchan, buytenh,
	jdmason, netdev, Robert.Olsson, ganesh.venkatesan

From: Jesse Brandeburg <jesse.brandeburg@intel.com>
Date: Thu, 9 Jun 2005 15:12:09 -0700 (PDT)

> I saw that... somehow iptables got compiled into kernel statically.  no 
> rules are active or installed  iptables -L -n shows nothing in any chain.

Netfilter can kill performance, even if no rules are loaded at all.
Please take that out of your kernel.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-09 22:12             ` Jesse Brandeburg
  2005-06-09 22:21               ` David S. Miller
@ 2005-06-09 22:21               ` jamal
  1 sibling, 0 replies; 121+ messages in thread
From: jamal @ 2005-06-09 22:21 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Stephen Hemminger, David S. Miller, Ronciak, John,
	Williams, Mitch A, mchan, buytenh, jdmason, netdev, Robert.Olsson,
	Venkatesan, Ganesh

On Thu, 2005-09-06 at 15:12 -0700, Jesse Brandeburg wrote:
> On Thu, 9 Jun 2005, Stephen Hemminger wrote:
> > > 313667    4.7884  vmlinux          csum_partial
> > > 218870    3.3413  vmlinux          _spin_lock
> > > 214302    3.2715  vmlinux          __copy_user_intel
> > > 193662    2.9564  vmlinux          skb_release_data
> > > 177755    2.7136  vmlinux          ipt_do_table
> > 
> > You are probably benchmarking iptables/netfilter! How many rules do you
> > have?
> 
> I saw that... somehow iptables got compiled into kernel statically.  no 
> rules are active or installed  iptables -L -n shows nothing in any chain.

Contracking is a lot worse of a problem. Just turn off netfilter all
together.

cheers,
jamal

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-09 22:05           ` Stephen Hemminger
  2005-06-09 22:12             ` Jesse Brandeburg
@ 2005-06-09 22:22             ` David S. Miller
  1 sibling, 0 replies; 121+ messages in thread
From: David S. Miller @ 2005-06-09 22:22 UTC (permalink / raw)
  To: shemminger
  Cc: jesse.brandeburg, hadi, john.ronciak, mitch.a.williams, mchan,
	buytenh, jdmason, netdev, Robert.Olsson, ganesh.venkatesan

From: Stephen Hemminger <shemminger@osdl.org>
Date: Thu, 9 Jun 2005 15:05:46 -0700

> > I have profile data, here is an example of 5tx/5rx threads, where the 
> > throughput was 1236Mb/s total, 936tx, 300rx, on 2.6.12-rc5 with old TSO 
> > (the original problem case) we are at 100% cpu and generating 3289 ints/s, 
> > with no hardware drops reported prolly due to my replenish patch
> > CPU: P4 / Xeon with 2 hyper-threads, speed 2791.36 MHz (estimated)
> > Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
> > samples  %        image name       symbol name
> > 533687    8.1472  vmlinux          pskb_expand_head
> > 428726    6.5449  vmlinux          __copy_user_zeroing_intel
> > 349934    5.3421  vmlinux          _read_lock_irqsave
> 
> We should kill all reader/writer locks in the fastpath. reader locks are
> more expensive than spinlocks unless they are going to be held for a fairly
> large window.

True, but I see no reason why it should have any influence here.
Let's not get distracted by this in our analysis of the problem.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07  4:53   ` Stephen Hemminger
  2005-06-07 12:38     ` jamal
@ 2005-06-21 20:20     ` David S. Miller
  2005-06-21 20:38       ` Rick Jones
  1 sibling, 1 reply; 121+ messages in thread
From: David S. Miller @ 2005-06-21 20:20 UTC (permalink / raw)
  To: shemminger
  Cc: mitch.a.williams, john.ronciak, mchan, hadi, buytenh, jdmason,
	netdev, Robert.Olsson, ganesh.venkatesan, jesse.brandeburg

From: Stephen Hemminger <shemminger@osdl.org>
Date: Mon, 06 Jun 2005 21:53:32 -0700

> I noticed that the tg3 driver copies packets less than a certain
> threshold to a new buffer, but e1000 always passes the big buffer up
> the stack. Could this be having an impact?

I bet it does, this makes ACK processing a lot more expensive.  And it
is so much cheaper to just recycle the big buffer back to the chip
if you copy to a small buffer, and it warms up the caches for the
packet headers as a side effect as well.

Actually, it has a _HUGE_ _HUGE_ impact.  If you pass the big buffer
up, the receiving socket gets charged for the size of the huge buffer,
not for just the size of the packet contained within.  This makes
sockets get overcharged for data reception, and it can cause all kinds
of performance problems.

I highly recommend that this gets fixed.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07 12:06       ` Martin Josefsson
  2005-06-07 13:29         ` jamal
@ 2005-06-21 20:37         ` David S. Miller
  2005-06-22  7:27           ` Eric Dumazet
  2005-06-22  8:42           ` P
  1 sibling, 2 replies; 121+ messages in thread
From: David S. Miller @ 2005-06-21 20:37 UTC (permalink / raw)
  To: gandalf
  Cc: hadi, shemminger, mitch.a.williams, john.ronciak, mchan, buytenh,
	jdmason, netdev, Robert.Olsson, ganesh.venkatesan,
	jesse.brandeburg

From: Martin Josefsson <gandalf@wlug.westbo.se>
Date: Tue, 7 Jun 2005 14:06:18 +0200 (CEST)

> One thing that jumps to mind is that e1000 starts at lastrxdescriptor+1
> and loops and checks the status of each descriptor and stops when it finds
> a descriptor that isn't finished. Another way to do it is to read out the
> current position of the ring and loop from lastrxdescriptor+1 up to the
> current position. Scott Feldman implemented this for TX and there it
> increased performance somewhat (discussed here on netdev some months ago).
> I wonder if it could also decrease RX latency, I mean, we have to get the
> cache miss sometime anyway.
> 
> I havn't checked how tg3 does it.

I don't think this matters all that much.  tg3 does loop on RX
producer index, so doesn't touch descriptors unless the RX producer
index states there is a ready packet there.

One thing I noticed with Super TSO testing is that e1000 has very
expensive TSO transmit processing.  The big problem is the context
descriptor.  This is 4 extra 32-bit words eaten up in the transmit
ring for every TSO packet.  Whereas tg3 stores all the TSO offload
information directly in the normal TX descriptor (which is the
same size, 16 bytes, as the e1000 normal TX descriptor).

It accounts for a non-trivial amount of overhead.  On my SunBlade1500
with Super TSO, e1000 transmitter eats %40 of CPU to fill a gigabit
pipe whereas tg3 takes %30.  All of the extra time, based upon quick
scans of oprofile dumps, shows it in the e1000 driver.

Also, e1000 sends full MTU sized SKBs down into the stack even if the
packet is very small.  This also hurts performance a lot.  As
discussed elsewhere, it should use a "small packet" cut-off just like
other drivers do.  If the RX frame is less than this cut-off value, a
new smaller sized SKB is allocated and the RX data copied into it.
The RX ring SKB is left in-place and given back to the chip.

My only guess is that the e1000 driver implemented things this way
to simplify the RX recycling logic.  Well, it is an area ripe for
improvement in this driver :)

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-21 20:20     ` David S. Miller
@ 2005-06-21 20:38       ` Rick Jones
  2005-06-21 20:55         ` David S. Miller
  2005-06-21 21:47         ` Andi Kleen
  0 siblings, 2 replies; 121+ messages in thread
From: Rick Jones @ 2005-06-21 20:38 UTC (permalink / raw)
  To: David S. Miller
  Cc: shemminger, mitch.a.williams, john.ronciak, mchan, hadi, buytenh,
	jdmason, netdev, Robert.Olsson, ganesh.venkatesan,
	jesse.brandeburg

David S. Miller wrote:
> From: Stephen Hemminger <shemminger@osdl.org>
> Date: Mon, 06 Jun 2005 21:53:32 -0700
> 
> 
>>I noticed that the tg3 driver copies packets less than a certain
>>threshold to a new buffer, but e1000 always passes the big buffer up
>>the stack. Could this be having an impact?
> 
> 
> I bet it does, this makes ACK processing a lot more expensive. 

Why would ACK processing care about the size of the buffer containing the ACK 
segment?

> And it
> is so much cheaper to just recycle the big buffer back to the chip
> if you copy to a small buffer, and it warms up the caches for the
> packet headers as a side effect as well.

I would think that the cache business would be a wash either way.  With 64 byte 
cache lines (128 in some cases) just accessing the link-level header has brought 
the IP header into the cache, and probably the TCP header as well.

Isn't the decision point between the sum of allocating a small buffer and doing 
the copy, versus allocating a new large buffer and (re)mapping it for DMA?  I 
guess that would come down to copy versus mapping overhead.

> Actually, it has a _HUGE_ _HUGE_ impact.  If you pass the big buffer
> up, the receiving socket gets charged for the size of the huge buffer,
> not for just the size of the packet contained within.  This makes
> sockets get overcharged for data reception, and it can cause all kinds
> of performance problems.

Then copy when the socket is about to fill with overhead bytes?

> I highly recommend that this gets fixed.

What is the cut-off point for the copy?

rick jones

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-21 20:38       ` Rick Jones
@ 2005-06-21 20:55         ` David S. Miller
  2005-06-21 21:47         ` Andi Kleen
  1 sibling, 0 replies; 121+ messages in thread
From: David S. Miller @ 2005-06-21 20:55 UTC (permalink / raw)
  To: rick.jones2
  Cc: shemminger, mitch.a.williams, john.ronciak, mchan, hadi, buytenh,
	jdmason, netdev, Robert.Olsson, ganesh.venkatesan,
	jesse.brandeburg

From: Rick Jones <rick.jones2@hp.com>
Date: Tue, 21 Jun 2005 13:38:39 -0700

> > I highly recommend that this gets fixed.
> 
> What is the cut-off point for the copy?

256 has been found to be a well functioning value to use.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-21 20:38       ` Rick Jones
  2005-06-21 20:55         ` David S. Miller
@ 2005-06-21 21:47         ` Andi Kleen
  2005-06-21 22:22           ` Donald Becker
  1 sibling, 1 reply; 121+ messages in thread
From: Andi Kleen @ 2005-06-21 21:47 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev, davem

Rick Jones <rick.jones2@hp.com> writes:

> > Actually, it has a _HUGE_ _HUGE_ impact.  If you pass the big buffer
> > up, the receiving socket gets charged for the size of the huge buffer,
> > not for just the size of the packet contained within.  This makes
> > sockets get overcharged for data reception, and it can cause all kinds
> > of performance problems.
> 
> Then copy when the socket is about to fill with overhead bytes?

The stack has supported that since 2.4.

Mostly because it is the only sane way to handle devices with very big
MTU. But it turns off all kinds of fast paths before it happens, I
guess that is what David was refering too.

However I suspect the cut-off points with rx-copybreak in common driver 
have been  often tuned before that code was introduced and it might
be worth to do some retesting.

-Andi

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-21 21:47         ` Andi Kleen
@ 2005-06-21 22:22           ` Donald Becker
  2005-06-21 22:34             ` Andi Kleen
  0 siblings, 1 reply; 121+ messages in thread
From: Donald Becker @ 2005-06-21 22:22 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Rick Jones, netdev, davem

On 21 Jun 2005, Andi Kleen wrote:

> Rick Jones <rick.jones2@hp.com> writes:
> 
> > > Actually, it has a _HUGE_ _HUGE_ impact.  If you pass the big buffer
> > > up, the receiving socket gets charged for the size of the huge buffer,
> > > not for just the size of the packet contained within.  This makes
> > > sockets get overcharged for data reception, and it can cause all kinds
> > > of performance problems.
> > 
> > Then copy when the socket is about to fill with overhead bytes?

Or better, predict when the frame you are currently stuffing into the 
queue will be there when the queue fills up.  And then use the same 
crystal ball to...
 
> Mostly because it is the only sane way to handle devices with very big
> MTU. But it turns off all kinds of fast paths before it happens, I
> guess that is what David was refering too.
> 
> However I suspect the cut-off points with rx-copybreak in common driver 
> have been  often tuned before that code was introduced and it might
> be worth to do some retesting.

Most of that analysis and tuning was done in the 1996-99 timeframe.  

While much has changed since then, the same basic parameters remain
   - cache line size
   - frame header size (MAC+IP+ProtocolHeader)
   - hot cache lines from copying or type classification
   - cold memory lines from PCI writes

I suspect you'll find that a good rx_copybreak is pretty much the same as 
it was when I did the original evaluation.

If you are looking for an area that has changed: the hidden cost of 
maintaining consistent cache lines on SMP systems is far higher than it 
was back in the days of the Pentium Pro.

Donald Becker				becker@scyld.com
Scyld Software				A Penguin Computing company
914 Bay Ridge Road, Suite 220		www.scyld.com
Annapolis MD 21403			410-990-9993

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-21 22:22           ` Donald Becker
@ 2005-06-21 22:34             ` Andi Kleen
  2005-06-22  0:08               ` Donald Becker
  0 siblings, 1 reply; 121+ messages in thread
From: Andi Kleen @ 2005-06-21 22:34 UTC (permalink / raw)
  To: Donald Becker; +Cc: Andi Kleen, Rick Jones, netdev, davem

> While much has changed since then, the same basic parameters remain
>    - cache line size

In 96 we had 32 byte cache lines. These days 64-128 are common,
with some 256 byte cache line systems around.

>    - frame header size (MAC+IP+ProtocolHeader)

In 96 people tended to not use time stamps.

>    - hot cache lines from copying or type classification

Not sure what you mean with that.

>    - cold memory lines from PCI writes

I suspect in '96 chipsets also didn't do as aggressive prefetching
as they do today.

-Andi

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-21 22:34             ` Andi Kleen
@ 2005-06-22  0:08               ` Donald Becker
  2005-06-22  4:44                 ` Chris Friesen
  2005-06-22 16:23                 ` Leonid Grossman
  0 siblings, 2 replies; 121+ messages in thread
From: Donald Becker @ 2005-06-22  0:08 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Rick Jones, netdev, davem

On Wed, 22 Jun 2005, Andi Kleen wrote:

> > While much has changed since then, the same basic parameters remain
> >    - cache line size
> 
> In 96 we had 32 byte cache lines. These days 64-128 are common,
> with some 256 byte cache line systems around.

Good point.
I believe that the most common line size is 64 bytes for L1 cache.

Most L2 caches that have larger line sizes still fill only 64 byte 
blocks unless prefetching is triggered.  (Feel free to correct me with 
non-obscure CPUs and relevant cases.  For instance, I know that on the 
Itanium the 128 byte line L2 cache is used as L1, but only for FPU 
fetches.  That doesn't count.)

The implication here is that as soon as we look at the first byte of the 
MAC address, we have read in 64 bytes.  That's a whole minimum-size 
EThernet frame.

> >    - frame header size (MAC+IP+ProtocolHeader)
> 
> In 96 people tended to not use time stamps.

Ehh, not a big difference.

> >    - hot cache lines from copying or type classification
> Not sure what you mean with that.

See the comment above. We decide if a packet is multicast vs. unicast, IP 
vs. other at approximately interrupt/"rx_copybreak" time.  Very few NIC 
provide this info in status bits, so we end up looking at the packet 
header.  That read moves the previously known-uncached data (after all, it 
was just came in from a bus write) into the L1 cache for the CPU handling 
the device.  Once it's there, the copy is almost free.

[[ Background: Yes, the allocating the new skbuff is very expensive.  But 
we can either allocate a new, correctly-sized skbuff to copy into, or 
allocate a new full-sized skbuff to replace the one we will send to the Rx 
queue.  ]] 

> >    - cold memory lines from PCI writes
> 
> I suspect in '96 chipsets also didn't do as aggressive prefetching
> as they do today.

Prefetching helps linear read bandwidth, but we shouldn't be triggering 
it.  And I claim that cache line prefetching only restores the relative
balance between L1/L2 caches, otherwise the long L2 cache lines would be 
very expensive with bump-read-bump-read with linear scans through memory.

-- 
Donald Becker				becker@scyld.com
Scyld Software	 			Scyld Beowulf cluster systems
914 Bay Ridge Road, Suite 220		www.scyld.com
Annapolis MD 21403			410-990-9993

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22  0:08               ` Donald Becker
@ 2005-06-22  4:44                 ` Chris Friesen
  2005-06-22 11:31                   ` Andi Kleen
  2005-06-22 16:23                 ` Leonid Grossman
  1 sibling, 1 reply; 121+ messages in thread
From: Chris Friesen @ 2005-06-22  4:44 UTC (permalink / raw)
  To: Donald Becker; +Cc: Andi Kleen, Rick Jones, netdev, davem

Donald Becker wrote:
> On Wed, 22 Jun 2005, Andi Kleen wrote:
> 
> 
>>>While much has changed since then, the same basic parameters remain
>>>   - cache line size
>>
>>In 96 we had 32 byte cache lines. These days 64-128 are common,
>>with some 256 byte cache line systems around.
> 
> 
> Good point.
> I believe that the most common line size is 64 bytes for L1 cache.

If I recall, G4 chips are 32 bytes, and G5s are 128 bytes.  Most current 
x86 chips are 64 bytes though.

Chris

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-21 20:37         ` David S. Miller
@ 2005-06-22  7:27           ` Eric Dumazet
  2005-06-22  8:42           ` P
  1 sibling, 0 replies; 121+ messages in thread
From: Eric Dumazet @ 2005-06-22  7:27 UTC (permalink / raw)
  To: David S. Miller
  Cc: gandalf, hadi, shemminger, mitch.a.williams, john.ronciak, mchan,
	buytenh, jdmason, netdev, Robert.Olsson, ganesh.venkatesan,
	jesse.brandeburg

David S. Miller a écrit :
> 
> Also, e1000 sends full MTU sized SKBs down into the stack even if the
> packet is very small.  This also hurts performance a lot.  As
> discussed elsewhere, it should use a "small packet" cut-off just like
> other drivers do.  If the RX frame is less than this cut-off value, a
> new smaller sized SKB is allocated and the RX data copied into it.
> The RX ring SKB is left in-place and given back to the chip.
> 
> My only guess is that the e1000 driver implemented things this way
> to simplify the RX recycling logic.  Well, it is an area ripe for
> improvement in this driver :)
> 
> 

Here is a copy of a mail from Scott Feldman (19/11/2003) when I asked him to add this copybreak feature into e1000 driver :
It did improve performance on my workload. It also reduce the memory requirement a *lot* (It was using 300.000 active TCP sockets, mostly 
receiving short frames)

Eric Dumazet

---------------------------------------------------
Try this (untested) patch.  It's against 5.2.26 (which you don't have),
so hand patch it.  (Sorry).

Do you have any way to measure performance?  CPU utilization?  The copy
isn't free.

Oh, also, this patch doesn't try to recycle the 4K skb that was copied
from.  Instead, it's freed and re-allocated.  Shouldn't be a big deal
because your totally system memory allocation should remain constant
(except for outstanding copybreak skb's).

Let us know how it goes.

-scott

----------------

diff -Naurp e1000-5.2.26/src/e1000_main.c
e1000-5.2.26-cb/src/e1000_main.c
--- e1000-5.2.26/src/e1000_main.c	2003-11-17 19:23:38.000000000
-0800
+++ e1000-5.2.26-cb/src/e1000_main.c	2003-11-18 18:18:07.000000000
-0800
@@ -2343,6 +2343,20 @@ e1000_clean_rx_irq(struct e1000_adapter
  			}
  		}

+		/* RONCH 11/18/03 - code added for copybreak test */
+#define E1000_CB_LENGTH 128
+		if(length < E1000_CB_LENGTH ) {
+			struct sk_buff *new_skb = dev_alloc_skb(length
+2);
+			if(new_skb) {
+				skb_reserve(new_skb, 2);
+				new_skb->dev = netdev;
+				memcpy(new_skb->data, skb->data,
length);
+				dev_kfree_skb(skb);
+				skb = new_skb;
+			}
+		}
+		/* end copybreak code */
+
  		/* Good Receive */
  		skb_put(skb, length - ETHERNET_FCS_SIZE);

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-21 20:37         ` David S. Miller
  2005-06-22  7:27           ` Eric Dumazet
@ 2005-06-22  8:42           ` P
  2005-06-22 19:37             ` jamal
  1 sibling, 1 reply; 121+ messages in thread
From: P @ 2005-06-22  8:42 UTC (permalink / raw)
  To: David S. Miller
  Cc: gandalf, hadi, shemminger, mitch.a.williams, john.ronciak, mchan,
	buytenh, jdmason, netdev, Robert.Olsson, ganesh.venkatesan,
	jesse.brandeburg

David S. Miller wrote:
> Also, e1000 sends full MTU sized SKBs down into the stack even if the
> packet is very small.  This also hurts performance a lot.  As
> discussed elsewhere, it should use a "small packet" cut-off just like
> other drivers do.  If the RX frame is less than this cut-off value, a
> new smaller sized SKB is allocated and the RX data copied into it.
> The RX ring SKB is left in-place and given back to the chip.

Yes the copy is essentially free here as the data is already cached.

As a data point, I went the whole hog and used buffer recycling
in my essentially packet sniffing application. I.E. there are no
allocs per packet at all, and this make a HUGE difference. On a
2x3.4GHz 2xe1000 system I can receive 620Kpps per port sustained
into my userspace app which does a LOT of processing per packet.
Without the buffer recycling is was around 250Kpps.
Note I don't reuse an skb until the packet is copied into a
PACKET_MMAP buffer.

-- 
Pádraig Brady - http://www.pixelbeat.org
--

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22  4:44                 ` Chris Friesen
@ 2005-06-22 11:31                   ` Andi Kleen
  0 siblings, 0 replies; 121+ messages in thread
From: Andi Kleen @ 2005-06-22 11:31 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Donald Becker, Andi Kleen, Rick Jones, netdev, davem

> If I recall, G4 chips are 32 bytes, and G5s are 128 bytes.  Most current 
> x86 chips are 64 bytes though.

P4s are effectively 128 byte. And that is the most common x86 right now.

-Andi

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: RFC: NAPI packet weighting patch
  2005-06-22  0:08               ` Donald Becker
  2005-06-22  4:44                 ` Chris Friesen
@ 2005-06-22 16:23                 ` Leonid Grossman
  2005-06-22 16:37                   ` jamal
  2005-06-22 17:05                   ` Andi Kleen
  1 sibling, 2 replies; 121+ messages in thread
From: Leonid Grossman @ 2005-06-22 16:23 UTC (permalink / raw)
  To: 'Donald Becker', 'Andi Kleen'
  Cc: 'Rick Jones', netdev, davem


> 
> See the comment above. We decide if a packet is multicast vs. 
> unicast, IP vs. other at approximately 
> interrupt/"rx_copybreak" time.  Very few NIC provide this 
> info in status bits, so we end up looking at the packet 
> header.  That read moves the previously known-uncached data 
> (after all, it was just came in from a bus write) into the L1 
> cache for the CPU handling the device.  Once it's there, the 
> copy is almost free.

What status bits a NIC has to provide, in order for the stack to avoid
touching headers?
In our case, the headers are separated by the hardware so ideally we would
like to avoid any header processing altogether,
and reduce the number of cache misses.

> 
> [[ Background: Yes, the allocating the new skbuff is very 
> expensive.  But we can either allocate a new, correctly-sized 
> skbuff to copy into, or allocate a new full-sized skbuff to 
> replace the one we will send to the Rx queue.  ]] 
>  
> > >    - cold memory lines from PCI writes
> > 
> > I suspect in '96 chipsets also didn't do as aggressive 
> prefetching as 
> > they do today.
> 
> Prefetching helps linear read bandwidth, but we shouldn't be 
> triggering it.  And I claim that cache line prefetching only 
> restores the relative balance between L1/L2 caches, otherwise 
> the long L2 cache lines would be very expensive with 
> bump-read-bump-read with linear scans through memory.
> 
> -- 
> Donald Becker				becker@scyld.com
> Scyld Software	 			Scyld Beowulf 
> cluster systems
> 914 Bay Ridge Road, Suite 220		www.scyld.com
> Annapolis MD 21403			410-990-9993
> 
> 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: RFC: NAPI packet weighting patch
  2005-06-22 16:23                 ` Leonid Grossman
@ 2005-06-22 16:37                   ` jamal
  2005-06-22 18:00                     ` Leonid Grossman
  2005-06-22 17:05                   ` Andi Kleen
  1 sibling, 1 reply; 121+ messages in thread
From: jamal @ 2005-06-22 16:37 UTC (permalink / raw)
  To: Leonid Grossman
  Cc: 'Donald Becker', 'Andi Kleen',
	'Rick Jones', netdev, davem

On Wed, 2005-22-06 at 09:23 -0700, Leonid Grossman wrote:
> > 
> > See the comment above. We decide if a packet is multicast vs. 
> > unicast, IP vs. other at approximately 
> > interrupt/"rx_copybreak" time.  Very few NIC provide this 
> > info in status bits, so we end up looking at the packet 
> > header.  That read moves the previously known-uncached data 
> > (after all, it was just came in from a bus write) into the L1 
> > cache for the CPU handling the device.  Once it's there, the 
> > copy is almost free.
> 
> What status bits a NIC has to provide, in order for the stack to avoid
> touching headers?
> In our case, the headers are separated by the hardware so ideally we would
> like to avoid any header processing altogether,
> and reduce the number of cache misses.
> 

Provide metadata that can be used to totaly replace eth_type_trans()
i.e answer the questions: is it multi/uni/broadcast, is the packet for
us (you would need to be programmed with what for us means), Is it IP,
ARP etc. I am sure any standard NIC these days can do a subset of these
 You want to go one step further then allow the user to download a
number of filters and tell you what tag you should put on the descriptor
when sending the packet to user space on a match or mismatch. 
If say you allowed 1024 such filters (not very different from the
current multicast filters), you could cut down a lot of CPU time.

cheers,
jamal

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 16:23                 ` Leonid Grossman
  2005-06-22 16:37                   ` jamal
@ 2005-06-22 17:05                   ` Andi Kleen
  1 sibling, 0 replies; 121+ messages in thread
From: Andi Kleen @ 2005-06-22 17:05 UTC (permalink / raw)
  To: Leonid Grossman
  Cc: 'Donald Becker', 'Andi Kleen',
	'Rick Jones', netdev, davem

On Wed, Jun 22, 2005 at 09:23:41AM -0700, Leonid Grossman wrote:
> 
> > 
> > See the comment above. We decide if a packet is multicast vs. 
> > unicast, IP vs. other at approximately 
> > interrupt/"rx_copybreak" time.  Very few NIC provide this 
> > info in status bits, so we end up looking at the packet 
> > header.  That read moves the previously known-uncached data 
> > (after all, it was just came in from a bus write) into the L1 
> > cache for the CPU handling the device.  Once it's there, the 
> > copy is almost free.
> 
> What status bits a NIC has to provide, in order for the stack to avoid
> touching headers?

To avoid it completely is pretty hard - you would need to supply
nearly everything in the header.

But when you supply L2 protocol/ and unicast/broadcast/multicast information
and if the packet is destined to the localhost or not then the 
headers can be gotten with a prefetch early and then when 
the header is later processed then it might be with some luck
already in cache.

BTW quite a few modern NICs provide this information actually contrary
to what Donald stated (sometimes with restrictions like only
works without multicast), but it hasn't been widely used yet.

-Andi

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: RFC: NAPI packet weighting patch
  2005-06-22 16:37                   ` jamal
@ 2005-06-22 18:00                     ` Leonid Grossman
  2005-06-22 18:06                       ` Andi Kleen
  0 siblings, 1 reply; 121+ messages in thread
From: Leonid Grossman @ 2005-06-22 18:00 UTC (permalink / raw)
  To: hadi
  Cc: 'Donald Becker', 'Andi Kleen',
	'Rick Jones', netdev, davem

 

> -----Original Message-----
> From: jamal [mailto:hadi@cyberus.ca] 
> Sent: Wednesday, June 22, 2005 9:37 AM
> To: Leonid Grossman
> Cc: 'Donald Becker'; 'Andi Kleen'; 'Rick Jones'; 
> netdev@oss.sgi.com; davem@redhat.com
> Subject: RE: RFC: NAPI packet weighting patch
> 
> On Wed, 2005-22-06 at 09:23 -0700, Leonid Grossman wrote:
> > > 
> > > See the comment above. We decide if a packet is multicast vs. 
> > > unicast, IP vs. other at approximately interrupt/"rx_copybreak" 
> > > time.  Very few NIC provide this info in status bits, so 
> we end up 
> > > looking at the packet header.  That read moves the previously 
> > > known-uncached data (after all, it was just came in from a bus 
> > > write) into the L1 cache for the CPU handling the device. 
>  Once it's 
> > > there, the copy is almost free.
> > 
> > What status bits a NIC has to provide, in order for the 
> stack to avoid 
> > touching headers?
> > In our case, the headers are separated by the hardware so 
> ideally we 
> > would like to avoid any header processing altogether, and 
> reduce the 
> > number of cache misses.
> > 
> 
> Provide metadata that can be used to totaly replace 
> eth_type_trans() i.e answer the questions: is it 
> multi/uni/broadcast, is the packet for us (you would need to 
> be programmed with what for us means), Is it IP, ARP etc. I 
> am sure any standard NIC these days can do a subset of these  
> You want to go one step further then allow the user to 
> download a number of filters and tell you what tag you should 
> put on the descriptor when sending the packet to user space 
> on a match or mismatch. 
> If say you allowed 1024 such filters (not very different from 
> the current multicast filters), you could cut down a lot of CPU time.


Well, this is all supported in the hardware. 
The number of filters is only 256 (not 1024) for direct match, but it is
unlimited for a hash match.
Of course, the upper layer still needs to be able to take advantage of the
filters...

Outside of the filters capability, from the (granted, pretty limited)
testing we see some noticeable improvement from providing status bits but it
is not as big as I would expect,
It looks like the headers are still being touched somewhere... We will look
at this some more.

Thanks, Leonid

> 
> cheers,
> jamal
> 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 18:00                     ` Leonid Grossman
@ 2005-06-22 18:06                       ` Andi Kleen
  2005-06-22 20:22                         ` David S. Miller
  0 siblings, 1 reply; 121+ messages in thread
From: Andi Kleen @ 2005-06-22 18:06 UTC (permalink / raw)
  To: Leonid Grossman
  Cc: hadi, 'Donald Becker', 'Andi Kleen',
	'Rick Jones', netdev, davem

> Outside of the filters capability, from the (granted, pretty limited)
> testing we see some noticeable improvement from providing status bits but it
> is not as big as I would expect,
> It looks like the headers are still being touched somewhere... We will look
> at this some more.

The headers are read of course in the main stack. No way around that.

It basically helps you only when you can space the prefetch for the header
out long enough that the data is in cache when you need it. However
it is tricky because CPUs have only a limited load queue entries and doing
too many prefetches will just overflow that.

This can be done by batching L2 packet processing, but doing so 
is not good for your latency.

-Andi

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22  8:42           ` P
@ 2005-06-22 19:37             ` jamal
  2005-06-23  8:56               ` P
  0 siblings, 1 reply; 121+ messages in thread
From: jamal @ 2005-06-22 19:37 UTC (permalink / raw)
  To: P
  Cc: David S. Miller, gandalf, shemminger, mitch.a.williams,
	john.ronciak, mchan, buytenh, jdmason, netdev, Robert.Olsson,
	ganesh.venkatesan, jesse.brandeburg

On Wed, 2005-22-06 at 09:42 +0100, P@draigBrady.com wrote:

> 
> Yes the copy is essentially free here as the data is already cached.
> 
> As a data point, I went the whole hog and used buffer recycling
> in my essentially packet sniffing application. I.E. there are no
> allocs per packet at all, and this make a HUGE difference. On a
> 2x3.4GHz 2xe1000 system I can receive 620Kpps per port sustained
> into my userspace app which does a LOT of processing per packet.
> Without the buffer recycling is was around 250Kpps.
> Note I don't reuse an skb until the packet is copied into a
> PACKET_MMAP buffer.

Was this machine SMP? NAPI involved? I take it nothing interfering in
the middle with the headers?

cheers,
jamal

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 18:06                       ` Andi Kleen
@ 2005-06-22 20:22                         ` David S. Miller
  2005-06-22 20:35                           ` Rick Jones
                                             ` (3 more replies)
  0 siblings, 4 replies; 121+ messages in thread
From: David S. Miller @ 2005-06-22 20:22 UTC (permalink / raw)
  To: ak; +Cc: leonid.grossman, hadi, becker, rick.jones2, netdev, davem

From: Andi Kleen <ak@suse.de>
Date: Wed, 22 Jun 2005 20:06:55 +0200

> However it is tricky because CPUs have only a limited load queue
> entries and doing too many prefetches will just overflow that.

Several processors can queue about 8 prefetch requests, and
these slots are independant of those consumed by a load.

Yes, if you queue too many prefetches, the queue overflows.

I think the optimal scheme would be:

1) eth_type_trans() info in RX descriptor
2) prefetch(skb->data) done as early as possible in driver
   RX handling

Actually, I believe to most optimal scheme is:

foo_driver_rx()
{
	for_each_rx_descriptor() {
		...
		skb = driver_priv->rx_skbs[index];
		prefetch(skb->data);

		skb = realloc_or_recycle_rx_descriptor(skb, index);
		if (skb == NULL)
			goto next_rxd;

		skb->prot = eth_type_trans(skb, driver_priv->dev);
		netif_receive_skb(skb);
		...
	next_rxd:
		...
	}
}

The idea is that first the prefetch goes into flight, then you do the
recycle or reallocation of the RX descriptor SKB, then you try to
touch the data.

This makes it very likely the prefetch will be in the cpu in time.

Everyone seems to have this absolute fetish about batching the RX
descriptor refilling work.  It's wrong, it should be done when you
pull a receive packet off the ring, for many reasons.  Off the top of
my head:

1) Descriptors are refilled as soon as possible, decreasing
   the chance of the device hitting the end of the RX ring
   and thus unable to receive a packet.

2) As shown above, it gives you compute time which can be used to
   schedule the prefetch.  This nearly makes RX replenishment free.
   Instead of having the CPU spin on a cache miss when we run
   eth_type_trans() during those cycles, we do useful work.

I'm going to play around with these ideas in the tg3 driver.
Obvious patch below.

--- 1/drivers/net/tg3.c.~1~	2005-06-22 12:33:07.000000000 -0700
+++ 2/drivers/net/tg3.c	2005-06-22 13:19:13.000000000 -0700
@@ -2772,6 +2772,13 @@
 			goto next_pkt_nopost;
 		}

+		/* Prefetch now.  The recycle/realloc of the RX
+		 * entry is moderately expensive, so by the time
+		 * that is complete the data should have reached
+		 * the cpu.
+		 */
+		prefetch(skb->data);
+
 		work_mask |= opaque_key;

 		if ((desc->err_vlan & RXD_ERR_MASK) != 0 &&

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 20:22                         ` David S. Miller
@ 2005-06-22 20:35                           ` Rick Jones
  2005-06-22 20:43                             ` David S. Miller
  2005-06-22 21:10                           ` Andi Kleen
                                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 121+ messages in thread
From: Rick Jones @ 2005-06-22 20:35 UTC (permalink / raw)
  To: netdev; +Cc: hadi, becker


> Everyone seems to have this absolute fetish about batching the RX
> descriptor refilling work.  It's wrong, it should be done when you
> pull a receive packet off the ring, for many reasons.  Off the top of
> my head:
> 
> 1) Descriptors are refilled as soon as possible, decreasing
>    the chance of the device hitting the end of the RX ring
>    and thus unable to receive a packet.

IFF one pokes the NIC for each buffer right?

rick jones

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 20:35                           ` Rick Jones
@ 2005-06-22 20:43                             ` David S. Miller
  0 siblings, 0 replies; 121+ messages in thread
From: David S. Miller @ 2005-06-22 20:43 UTC (permalink / raw)
  To: rick.jones2; +Cc: netdev, hadi, becker

From: Rick Jones <rick.jones2@hp.com>
Date: Wed, 22 Jun 2005 13:35:46 -0700

> > Everyone seems to have this absolute fetish about batching the RX
> > descriptor refilling work.  It's wrong, it should be done when you
> > pull a receive packet off the ring, for many reasons.  Off the top of
> > my head:
> > 
> > 1) Descriptors are refilled as soon as possible, decreasing
> >    the chance of the device hitting the end of the RX ring
> >    and thus unable to receive a packet.
> 
> IFF one pokes the NIC for each buffer right?

Or "every 5" or something like that.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 20:22                         ` David S. Miller
  2005-06-22 20:35                           ` Rick Jones
@ 2005-06-22 21:10                           ` Andi Kleen
  2005-06-22 21:16                             ` David S. Miller
  2005-06-22 21:53                             ` Chris Friesen
  2005-06-22 21:38                           ` Eric Dumazet
  2005-06-22 22:42                           ` Leonid Grossman
  3 siblings, 2 replies; 121+ messages in thread
From: Andi Kleen @ 2005-06-22 21:10 UTC (permalink / raw)
  To: David S. Miller
  Cc: ak, leonid.grossman, hadi, becker, rick.jones2, netdev, davem

On Wed, Jun 22, 2005 at 01:22:41PM -0700, David S. Miller wrote:
> From: Andi Kleen <ak@suse.de>
> Date: Wed, 22 Jun 2005 20:06:55 +0200
> 
> > However it is tricky because CPUs have only a limited load queue
> > entries and doing too many prefetches will just overflow that.
> 
> Several processors can queue about 8 prefetch requests, and
> these slots are independant of those consumed by a load.

8 entries? That sounds very small. Is that an old Sparc or something? :)

An Opteron has 44 entries, effectively 32 for L2.   Netburst
or POWER4 derived CPUs have more than that.

> Yes, if you queue too many prefetches, the queue overflows.
> 
> I think the optimal scheme would be:
> 
> 1) eth_type_trans() info in RX descriptor
> 2) prefetch(skb->data) done as early as possible in driver
>    RX handling
> 
> Actually, I believe to most optimal scheme is:

Looks reasonable. Not sure about the "most optimal" though, some benchmarking
of different schemes would be probably a good idea.

-Andi

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 21:10                           ` Andi Kleen
@ 2005-06-22 21:16                             ` David S. Miller
  2005-06-22 21:53                             ` Chris Friesen
  1 sibling, 0 replies; 121+ messages in thread
From: David S. Miller @ 2005-06-22 21:16 UTC (permalink / raw)
  To: ak; +Cc: leonid.grossman, hadi, becker, rick.jones2, netdev, davem

From: Andi Kleen <ak@suse.de>
Date: Wed, 22 Jun 2005 23:10:58 +0200

> 8 entries? That sounds very small. Is that an old Sparc or something? :)

Hey, Sparc does suck, this isn't news for anyone :-)

> Looks reasonable. Not sure about the "most optimal" though, some benchmarking
> of different schemes would be probably a good idea.

Absolutely.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 20:22                         ` David S. Miller
  2005-06-22 20:35                           ` Rick Jones
  2005-06-22 21:10                           ` Andi Kleen
@ 2005-06-22 21:38                           ` Eric Dumazet
  2005-06-22 22:13                             ` Eric Dumazet
  2005-06-22 22:23                             ` David S. Miller
  2005-06-22 22:42                           ` Leonid Grossman
  3 siblings, 2 replies; 121+ messages in thread
From: Eric Dumazet @ 2005-06-22 21:38 UTC (permalink / raw)
  To: David S. Miller
  Cc: ak, leonid.grossman, hadi, becker, rick.jones2, netdev, davem

David S. Miller a écrit :

> 
> 2) As shown above, it gives you compute time which can be used to
>    schedule the prefetch.  This nearly makes RX replenishment free.
>    Instead of having the CPU spin on a cache miss when we run
>    eth_type_trans() during those cycles, we do useful work.
> 
> I'm going to play around with these ideas in the tg3 driver.
> Obvious patch below.


Then maybe we could also play with prefetchw() in the case the incoming frame
is small enough to be copied to a new skb.

drivers/net/tg3.c

	copy_skb = dev_alloc_skb(len + 2);
	if (copy_skb == NULL)
		goto drop_it_no_recycle;
+	prefetchw(copy_skb->data);

	copy_skb->dev = tp->dev;
	skb_reserve(copy_skb, 2);
	skb_put(copy_skb, len);

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 21:10                           ` Andi Kleen
  2005-06-22 21:16                             ` David S. Miller
@ 2005-06-22 21:53                             ` Chris Friesen
  2005-06-22 22:11                               ` Andi Kleen
  1 sibling, 1 reply; 121+ messages in thread
From: Chris Friesen @ 2005-06-22 21:53 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David S. Miller, leonid.grossman, hadi, becker, rick.jones2,
	netdev

Andi Kleen wrote:

> 8 entries? That sounds very small. Is that an old Sparc or something? :)

The G5 has 8 prefetch streams.  Not an ancient cpu.


Chris

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 21:53                             ` Chris Friesen
@ 2005-06-22 22:11                               ` Andi Kleen
  0 siblings, 0 replies; 121+ messages in thread
From: Andi Kleen @ 2005-06-22 22:11 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Andi Kleen, David S. Miller, leonid.grossman, hadi, becker,
	rick.jones2, netdev

On Wed, Jun 22, 2005 at 03:53:30PM -0600, Chris Friesen wrote:
> Andi Kleen wrote:
> 
> >8 entries? That sounds very small. Is that an old Sparc or something? :)
> 
> The G5 has 8 prefetch streams.  Not an ancient cpu.

prefetch stream means a context of the auto prefetcher. 

It different from a load queue entry which is just a load of a cache line
which can be triggered by user instructions or the auto prefetcher. 
Each prefetch stream would consume a lot of them, so just for your 8 streams 
above you probably need a large two digit number or more.

I don't have exact numbers for the PPC970, but afaik its LS unit
has a very long queue. On POWER4 (which is a very similar CPU) we see 
a lot of races that don't happen on other platforms. That seems to be 
because it reorders writes every aggressively. I suppose this is true for
reads as well.

-Andi

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 21:38                           ` Eric Dumazet
@ 2005-06-22 22:13                             ` Eric Dumazet
  2005-06-22 22:30                               ` David S. Miller
  2005-06-22 22:23                             ` David S. Miller
  1 sibling, 1 reply; 121+ messages in thread
From: Eric Dumazet @ 2005-06-22 22:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, ak, leonid.grossman, hadi, becker, rick.jones2,
	netdev, davem

Eric Dumazet a écrit :

> 
> Then maybe we could also play with prefetchw() in the case the incoming 
> frame
> is small enough to be copied to a new skb.
> 
> drivers/net/tg3.c
> 
>     copy_skb = dev_alloc_skb(len + 2);
>     if (copy_skb == NULL)
>         goto drop_it_no_recycle;
> +    prefetchw(copy_skb->data);
> 
>     copy_skb->dev = tp->dev;
>     skb_reserve(copy_skb, 2);
>     skb_put(copy_skb, len);
> 
> 
> 

I also found that the memcpy() done to copy the data to the new skb suffers from misalignment.

This is because of skb_reserve(skbs, 2) that was done on both skb, and memcpy() (at least on x86_64) doing long word copies without checking 
alignment of source or destination.

Maybe we could :

1) make sure both skbs had the same skb_reserve() of 2 (thats not clear because tg3.c mixes the '2' and tp->rx_offset,
and according to a comment :
	rx_offset != 2 iff this is a 5701 card running 
                                          	in PCI-X mode

2) and do :

-	memcpy(copy_skb->data, skb->data, len);
+	memcpy(copy_skb->data-2, skb->data-2, len+2);

(That is copy 2 more bytes, but gain aligned copy to speedup memcpy())

Eric Dumazet

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 21:38                           ` Eric Dumazet
  2005-06-22 22:13                             ` Eric Dumazet
@ 2005-06-22 22:23                             ` David S. Miller
  2005-06-23 12:14                               ` jamal
  1 sibling, 1 reply; 121+ messages in thread
From: David S. Miller @ 2005-06-22 22:23 UTC (permalink / raw)
  To: dada1; +Cc: ak, leonid.grossman, hadi, becker, rick.jones2, netdev, davem

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 22 Jun 2005 23:38:21 +0200

> Then maybe we could also play with prefetchw() in the case the
> incoming frame is small enough to be copied to a new skb.

That's a good idea too.  In fact, this would deal with platforms
that use non-temporal stores in their memcpy() implementation.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 22:13                             ` Eric Dumazet
@ 2005-06-22 22:30                               ` David S. Miller
  0 siblings, 0 replies; 121+ messages in thread
From: David S. Miller @ 2005-06-22 22:30 UTC (permalink / raw)
  To: dada1; +Cc: ak, leonid.grossman, hadi, becker, rick.jones2, netdev, davem

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 23 Jun 2005 00:13:21 +0200

> I also found that the memcpy() done to copy the data to the new skb suffers from misalignment.
> 
> This is because of skb_reserve(skbs, 2) that was done on both skb, and memcpy() (at least on x86_64) doing long word copies without checking 
> alignment of source or destination.
> 
> Maybe we could :
>
> 1) make sure both skbs had the same skb_reserve() of 2 (thats not clear because tg3.c mixes the '2' and tp->rx_offset,
> and according to a comment :
> 	rx_offset != 2 iff this is a 5701 card running 
>                                           	in PCI-X mode
>
> 2) and do :
> 
> -	memcpy(copy_skb->data, skb->data, len);
> +	memcpy(copy_skb->data-2, skb->data-2, len+2);
> 
> (That is copy 2 more bytes, but gain aligned copy to speedup memcpy())

Yep, good idea.  Actually, the driver should be using
NET_IP_ALIGN for rx_offset unless it's the 5701 card running
in PCI-X mode case.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: RFC: NAPI packet weighting patch
  2005-06-22 20:22                         ` David S. Miller
                                             ` (2 preceding siblings ...)
  2005-06-22 21:38                           ` Eric Dumazet
@ 2005-06-22 22:42                           ` Leonid Grossman
  2005-06-22 23:13                             ` Andi Kleen
  3 siblings, 1 reply; 121+ messages in thread
From: Leonid Grossman @ 2005-06-22 22:42 UTC (permalink / raw)
  To: 'David S. Miller', ak; +Cc: hadi, becker, rick.jones2, netdev, davem

 

> -----Original Message-----
> From: David S. Miller [mailto:davem@davemloft.net] 
> Sent: Wednesday, June 22, 2005 1:23 PM
> To: ak@suse.de
> Cc: leonid.grossman@neterion.com; hadi@cyberus.ca; 
> becker@scyld.com; rick.jones2@hp.com; netdev@oss.sgi.com; 
> davem@redhat.com
> Subject: Re: RFC: NAPI packet weighting patch
> 
> From: Andi Kleen <ak@suse.de>
> Date: Wed, 22 Jun 2005 20:06:55 +0200
> 
> > However it is tricky because CPUs have only a limited load queue 
> > entries and doing too many prefetches will just overflow that.
> 
> Several processors can queue about 8 prefetch requests, and 
> these slots are independant of those consumed by a load.
> 
> Yes, if you queue too many prefetches, the queue overflows.
> 
> I think the optimal scheme would be:
> 
> 1) eth_type_trans() info in RX descriptor
> 2) prefetch(skb->data) done as early as possible in driver
>    RX handling
> 
> Actually, I believe to most optimal scheme is:
> 
> foo_driver_rx()
> {
> 	for_each_rx_descriptor() {
> 		...
> 		skb = driver_priv->rx_skbs[index];
> 		prefetch(skb->data);
> 
> 		skb = realloc_or_recycle_rx_descriptor(skb, index);
> 		if (skb == NULL)
> 			goto next_rxd;
> 
> 		skb->prot = eth_type_trans(skb, driver_priv->dev);
> 		netif_receive_skb(skb);
> 		...
> 	next_rxd:
> 		...
> 	}
> }
> 
> The idea is that first the prefetch goes into flight, then 
> you do the recycle or reallocation of the RX descriptor SKB, 
> then you try to touch the data.
> 
> This makes it very likely the prefetch will be in the cpu in time.
> 
> Everyone seems to have this absolute fetish about batching 
> the RX descriptor refilling work.  It's wrong, it should be 
> done when you pull a receive packet off the ring, for many 
> reasons.  Off the top of my head:

This is very hw-dependent, since there are NICs that read descriptors in
batches anyways - but the second argument below is compelling.

> 
> 1) Descriptors are refilled as soon as possible, decreasing
>    the chance of the device hitting the end of the RX ring
>    and thus unable to receive a packet.
> 
> 2) As shown above, it gives you compute time which can be used to
>    schedule the prefetch.  This nearly makes RX replenishment free.
>    Instead of having the CPU spin on a cache miss when we run
>    eth_type_trans() during those cycles, we do useful work.
> 
> I'm going to play around with these ideas in the tg3 driver.
> Obvious patch below.

We will play around with the s2io driver as well, there seem to be several
interesting ideas to try - thanks a lot for the input!
Cheers, Leonid

> 
> --- 1/drivers/net/tg3.c.~1~	2005-06-22 12:33:07.000000000 -0700
> +++ 2/drivers/net/tg3.c	2005-06-22 13:19:13.000000000 -0700
> @@ -2772,6 +2772,13 @@
>  			goto next_pkt_nopost;
>  		}
>  
> +		/* Prefetch now.  The recycle/realloc of the RX
> +		 * entry is moderately expensive, so by the time
> +		 * that is complete the data should have reached
> +		 * the cpu.
> +		 */
> +		prefetch(skb->data);
> +
>  		work_mask |= opaque_key;
>  
>  		if ((desc->err_vlan & RXD_ERR_MASK) != 0 &&
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 22:42                           ` Leonid Grossman
@ 2005-06-22 23:13                             ` Andi Kleen
  2005-06-22 23:19                               ` David S. Miller
  0 siblings, 1 reply; 121+ messages in thread
From: Andi Kleen @ 2005-06-22 23:13 UTC (permalink / raw)
  To: Leonid Grossman
  Cc: 'David S. Miller', ak, hadi, becker, rick.jones2, netdev,
	davem

> This is very hw-dependent, since there are NICs that read descriptors in
> batches anyways - but the second argument below is compelling.

The computing time must be quite long to be really a win.
You need to waste a few hundred cycles at least on a modern fast CPU.

-Andi
> > 
> > 2) As shown above, it gives you compute time which can be used to
> >    schedule the prefetch.  This nearly makes RX replenishment free.
> >    Instead of having the CPU spin on a cache miss when we run
> >    eth_type_trans() during those cycles, we do useful work.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 23:13                             ` Andi Kleen
@ 2005-06-22 23:19                               ` David S. Miller
  2005-06-22 23:23                                 ` Andi Kleen
  0 siblings, 1 reply; 121+ messages in thread
From: David S. Miller @ 2005-06-22 23:19 UTC (permalink / raw)
  To: ak; +Cc: leonid.grossman, davem, hadi, becker, rick.jones2, netdev

From: Andi Kleen <ak@suse.de>
Date: Thu, 23 Jun 2005 01:13:00 +0200

> The computing time must be quite long to be really a win.
> You need to waste a few hundred cycles at least on a modern fast CPU.

SKB allocation more than fits this requirement, and that is exactly
what the RX descriptor replenishment will do.

Even if SKB allocation was only half the necessary number of cycles
for the prefetch to hit the cpu, it'd still be a win.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 23:19                               ` David S. Miller
@ 2005-06-22 23:23                                 ` Andi Kleen
  0 siblings, 0 replies; 121+ messages in thread
From: Andi Kleen @ 2005-06-22 23:23 UTC (permalink / raw)
  To: David S. Miller
  Cc: ak, leonid.grossman, davem, hadi, becker, rick.jones2, netdev

On Wed, Jun 22, 2005 at 07:19:56PM -0400, David S. Miller wrote:
> From: Andi Kleen <ak@suse.de>
> Date: Thu, 23 Jun 2005 01:13:00 +0200
> 
> > The computing time must be quite long to be really a win.
> > You need to waste a few hundred cycles at least on a modern fast CPU.
> 
> SKB allocation more than fits this requirement, and that is exactly
> what the RX descriptor replenishment will do.

It shouldn't in theory. Unless they did something bad to the slab
allocator again when I wasn't looking ;-) 

> 
> Even if SKB allocation was only half the necessary number of cycles
> for the prefetch to hit the cpu, it'd still be a win.

If it's too small then it might be left in the noise.

-Andi

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 19:37             ` jamal
@ 2005-06-23  8:56               ` P
  0 siblings, 0 replies; 121+ messages in thread
From: P @ 2005-06-23  8:56 UTC (permalink / raw)
  To: hadi
  Cc: David S. Miller, gandalf, shemminger, mitch.a.williams,
	john.ronciak, mchan, buytenh, jdmason, netdev, Robert.Olsson,
	ganesh.venkatesan, jesse.brandeburg

jamal wrote:
> On Wed, 2005-22-06 at 09:42 +0100, P@draigBrady.com wrote:
> 
> 
>>Yes the copy is essentially free here as the data is already cached.
>>
>>As a data point, I went the whole hog and used buffer recycling
>>in my essentially packet sniffing application. I.E. there are no
>>allocs per packet at all, and this make a HUGE difference. On a
>>2x3.4GHz 2xe1000 system I can receive 620Kpps per port sustained
>>into my userspace app which does a LOT of processing per packet.
>>Without the buffer recycling is was around 250Kpps.
>>Note I don't reuse an skb until the packet is copied into a
>>PACKET_MMAP buffer.
> 
> 
> Was this machine SMP?

Yes. 2 x 3.4GHz P4s
1 logical CPU per port (irq affinity)
1 thread (NB on same logical CPU as irq (sched_affinity))
to do user space per packet processing.

> NAPI involved?

Yep.

> I take it nothing interfering in
> the middle with the headers?

It uses the standard path to PACKET_MMAP buffer
e1000_clean_rx_irq -> netif_receive_skb -> tpacket_rcv

Pádraig.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 22:23                             ` David S. Miller
@ 2005-06-23 12:14                               ` jamal
  2005-06-23 17:36                                 ` David Mosberger
  0 siblings, 1 reply; 121+ messages in thread
From: jamal @ 2005-06-23 12:14 UTC (permalink / raw)
  To: David S. Miller
  Cc: Lennert Buytenhek, davidm, netdev, dada1, ak, leonid.grossman,
	becker, rick.jones2, davem

On Wed, 2005-22-06 at 15:23 -0700, David S. Miller wrote:
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Wed, 22 Jun 2005 23:38:21 +0200
> 
> > Then maybe we could also play with prefetchw() in the case the
> > incoming frame is small enough to be copied to a new skb.
> 
> That's a good idea too.  In fact, this would deal with platforms
> that use non-temporal stores in their memcpy() implementation.

For the fans of the e1000 (or even the tg3 deprived people), heres a 
patch which originated from David Mosberger that i played around (about
9 months back) - it will need some hand patching for the latest driver.
Similar approach: prefetch skb->data,twiddle twiddle not little star,
touch header.

I found the aggressive mode effective on a xeon but i belive David is
using this on x86_64. So Lennert, I lied to you saying it was never
effective on x86. You just have to do the right juju such as factoring
in the memory load-latency and how much cache you have on your specific
CPU.
CCing davidm (in addition To: davem of course ;->) so he may provide
more insight on his tests.

Interesting of course is if you miss the "twiddle here" (as i saw in my
experiments) and do the obvious (such as defining AGGRESSIVE to 0), you
infact end up paying a penalty in performance.

cheers,
jamal

===== drivers/net/e1000/e1000_main.c 1.134 vs edited =====
--- 1.134/drivers/net/e1000/e1000_main.c        2004-09-12 16:52:48
-07:00
+++ edited/drivers/net/e1000/e1000_main.c       2004-09-30 06:05:11
-07:00
@@ -2278,12 +2278,30 @@
        uint8_t last_byte;
        unsigned int i;
        boolean_t cleaned = FALSE;
+#define AGGRESSIVE 1
 
        i = rx_ring->next_to_clean;
+#if AGGRESSIVE
+       prefetch(rx_ring->buffer_info[i].skb->data);
+#endif
        rx_desc = E1000_RX_DESC(*rx_ring, i);
 
        while(rx_desc->status & E1000_RXD_STAT_DD) {
                buffer_info = &rx_ring->buffer_info[i];
+# if AGGRESSIVE
+               {
+                       struct e1000_rx_desc *next_rx;
+                       unsigned int j = i + 1;
+
+                       if (j == rx_ring->count)
+                               j = 0;
+                       next_rx = E1000_RX_DESC(*rx_ring, j);
+                       if (next_rx->status & E1000_RXD_STAT_DD)
+                               prefetch(rx_ring->buffer_info[j].skb->data);
+               }
+# else
+               prefetch(buffer_info->skb->data);
+# endif
 #ifdef CONFIG_E1000_NAPI
                if(*work_done >= work_to_do)
                        break;

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-23 12:14                               ` jamal
@ 2005-06-23 17:36                                 ` David Mosberger
  0 siblings, 0 replies; 121+ messages in thread
From: David Mosberger @ 2005-06-23 17:36 UTC (permalink / raw)
  To: hadi
  Cc: David S. Miller, Lennert Buytenhek, davidm, netdev, dada1, ak,
	leonid.grossman, becker, rick.jones2, davem

>>>>> On Thu, 23 Jun 2005 08:14:11 -0400, jamal <hadi@cyberus.ca> said:

  Jamal> For the fans of the e1000 (or even the tg3 deprived people),
  Jamal> heres a patch which originated from David Mosberger that i
  Jamal> played around (about 9 months back) - it will need some hand
  Jamal> patching for the latest driver.  Similar approach: prefetch
  Jamal> skb->data,twiddle twiddle not little star, touch header.

  Jamal> I found the aggressive mode effective on a xeon but i belive
  Jamal> David is using this on x86_64. So Lennert, I lied to you
  Jamal> saying it was never effective on x86. You just have to do the
  Jamal> right juju such as factoring in the memory load-latency and
  Jamal> how much cache you have on your specific CPU.  CCing davidm
  Jamal> (in addition To: davem of course ;->) so he may provide more
  Jamal> insight on his tests.

I didn't remember what experiments I did, but I found the original
mail, with all the data.  The experiments were done on ia64 (naturally
;-).

Enjoy,

	--david

---
From: David Mosberger <davidm@linux.hpl.hp.com>
To: hadi@cyberus.ca
Cc: Alexey <kuznet@ms2.inr.ac.ru>, "David S. Miller" <davem@davemloft.net>,
        Robert Olsson <Robert.Olsson@data.slu.se>,
        Lennert Buytenhek <buytenh@wantstofly.org>, davidm@hpl.hp.com,
        eranian@linux.hpl.hp.com, grundler@parisc-linux.org
Subject: Re: prefetch
Date: Thu, 30 Sep 2004 06:51:29 -0700
Reply-To: davidm@hpl.hp.com
X-URL: http://www.hpl.hp.com/personal/David_Mosberger/

>>>>> On 27 Sep 2004 11:08:00 -0400, jamal <hadi@cyberus.ca> said:

  Jamal> one of the top abusers of cpu cycles in the netcode is
  Jamal> eth_type_trans() on x86 type hardware. This is where the
  Jamal> first time the skb->data is touched (hence a cache miss).
  Jamal> Clearly a good place to prefecth is in eth_type_trans itself
  Jamal> maybe right at the top you could prefetch skb->data or after
  Jamal> skb_pull() you could prefetch skb->mac.ethernet.

  Jamal> oprofile shows me the cycles being abused
  Jamal> (GLOBAL_POWER_EVENTS on xeon box) went down when i do either;
  Jamal> i cut down more cycles on doing skb->mac.ethernet that
  Jamal> skb->data - but thats a different topic.

  Jamal> My test is purely forwarding: packets come in through eth0,
  Jamal> get exercised by routing code and come out eth1. So the
  Jamal> important parameters for my test case are primarly throughput
  Jamal> and secondary is latency.  Adding the prefetch above while
  Jamal> showing lower CPU cycles, results in decreeased throughput
  Jamal> numbers and higher latency numbers.  What gives?

  Jamal> I am CCing the HP folks since they have some interesting
  Jamal> tools i heard David talk about at SUCON.

I don't have a good setup to measure packet forwarding performance.
However, prefetching skb->data certainly does reduce CPU utilization
on ia64 as the measurements below show.

I tried three versions:

 - original 2.6.9-rc3 (ORIGINAL)
 - 2.6.9-rc3 with a prefetch in e1000_clean_rx_irq (OBVIOUS)
 - 2.6.9-rc3 which prefetches the _next_ rx buffer (AGGRESSIVE)

All 3 cases use an e1000 board with NAPI enabled.

netperf results for 3 runs of ORIGINAL and AGGRESSIVE:

ORIGINAL:

$ netperf -l30 -c -C -H 192.168.10.15 -- -m1 -D
TCP STREAM TEST to 192.168.10.15 : nodelay
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time   Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.  10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384      1    30.00       1.59   99.93    10.94    5155.593  2257.461
 87380  16384      1    30.00       1.62   99.87    11.19    5045.549  2260.294
 87380  16384      1    30.00       1.62   99.89    11.29    5045.269  2281.327

AGGRESSIVE:

$ netperf -l30 -c -C -H 192.168.10.15 -- -m1 -D
TCP STREAM TEST to 192.168.10.15 : nodelay
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time   Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.  10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384      1    30.00       1.62   99.98    10.51    5062.204  2128.695
 87380  16384      1    30.00       1.62   99.99    10.51    5064.528  2128.940
 87380  16384      1    30.00       1.62   99.98    10.67    5053.365  2156.333

As you can see, not much of a throughput difference (I'd not expect
that, given the test...), but service demand on the receiver is down
significantly.  This is also confirmed with the following three
profiles (collected with q-syscollect):

ORIGINAL:

% time      self     cumul     calls self/call  tot/call name
 53.73     32.05     32.05      471k     68.1u     68.1u default_idle
  4.59      2.74     34.79     12.0M      228n      259n eth_type_trans

OBVIOUS:

% time      self     cumul     calls self/call  tot/call name
 55.72     33.25     33.25      469k     70.8u     70.8u default_idle
  4.49      2.68     35.93     12.0M      222n      278n tcp_v4_rcv
  2.84      1.70     37.63      473k     3.59u     32.6u e1000_clean
  2.81      1.68     39.30     12.2M      137n      525n tcp_rcv_established
  2.71      1.62     40.92     12.1M      134n      711n netif_receive_skb
  2.39      1.43     42.34     12.0M      119n      148n eth_type_trans

AGGRESSIVE:

% time      self     cumul     calls self/call  tot/call name
 57.51     34.34     34.34      395k     86.9u     86.9u default_idle
  4.40      2.62     36.96     12.3M      214n      265n tcp_v4_rcv
  3.12      1.86     38.82      455k     4.09u     31.3u e1000_clean
  3.09      1.84     40.66     12.0M      154n      584n tcp_rcv_established
  2.89      1.72     42.39     12.0M      144n      723n netif_receive_skb
  1.94      1.16     43.55      918k     1.26u     1.26u _spin_unlock_irq
  1.90      1.13     44.68     12.3M     92.4n      115n ip_route_input
  1.87      1.11     45.79     12.6M     88.4n     89.6n kfree
  1.87      1.11     46.91     12.1M     91.8n      572n ip_rcv
  1.68      1.00     47.91     12.1M     82.4n      351n ip_local_deliver
  1.21      0.72     48.63     12.6M     57.7n     58.9n __kmalloc
  1.01      0.60     49.23     12.3M     48.8n     53.7n sba_unmap_single
  1.00      0.59     49.83     12.0M     49.4n     81.0n eth_type_trans

Comparing ORIGINAL and AGGRESSIVE, we see that the latter spends an
additional 2.29 seconds in the idle-loop (default_idle), which
corresponds closely to the 2.19 seconds savings we're seeing in
eth_type_trans(), so the saving the prefetch achieves is real and not
offset by extra costs in other places.

The above also shows that the OBVIOUS prefetch is unable to cover the
entire load-latency.  Thus, I suspect it would really be best to use
the AGGRESSIVE prefetching policy.  If we were to do this, then the
code at label next_desc could be simplified, since we already
precomputed the next value of i/rx_desc as part of the prefetch.

It would be interesting to know how (modern) x86 CPUs behave.  If
somebody wants to try this, I attached a patch below (setting
AGGRESSIVE to 1 gives you the AGGRESSIVE version, seting it to 0 gives
you the OBVIOUS version).

Cheers,

	--david

===== drivers/net/e1000/e1000_main.c 1.134 vs edited =====
--- 1.134/drivers/net/e1000/e1000_main.c	2004-09-12 16:52:48 -07:00
+++ edited/drivers/net/e1000/e1000_main.c	2004-09-30 06:05:11 -07:00
@@ -2278,12 +2278,30 @@
 	uint8_t last_byte;
 	unsigned int i;
 	boolean_t cleaned = FALSE;
+#define AGGRESSIVE 1

 	i = rx_ring->next_to_clean;
+#if AGGRESSIVE
+	prefetch(rx_ring->buffer_info[i].skb->data);
+#endif
 	rx_desc = E1000_RX_DESC(*rx_ring, i);

 	while(rx_desc->status & E1000_RXD_STAT_DD) {
 		buffer_info = &rx_ring->buffer_info[i];
+# if AGGRESSIVE
+		{
+			struct e1000_rx_desc *next_rx;
+			unsigned int j = i + 1;
+
+			if (j == rx_ring->count)
+				j = 0;
+			next_rx = E1000_RX_DESC(*rx_ring, j);
+			if (next_rx->status & E1000_RXD_STAT_DD)
+				prefetch(rx_ring->buffer_info[j].skb->data);
+		}
+# else
+		prefetch(buffer_info->skb->data);
+# endif
 #ifdef CONFIG_E1000_NAPI
 		if(*work_done >= work_to_do)
 			break;

^ permalink raw reply	[flat|nested] 121+ messages in thread

end of thread, other threads:[~2005-06-23 17:36 UTC | newest]

Thread overview: 121+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-03  0:11 RFC: NAPI packet weighting patch Ronciak, John
2005-06-03  0:18 ` David S. Miller
2005-06-03  2:32   ` jamal
2005-06-03 17:43     ` Mitch Williams
2005-06-03 18:38       ` David S. Miller
2005-06-03 18:42       ` jamal
2005-06-03 19:01         ` David S. Miller
2005-06-03 19:28           ` Mitch Williams
2005-06-03 19:59             ` jamal
2005-06-03 20:31               ` David S. Miller
2005-06-03 21:12                 ` Jon Mason
2005-06-03 20:22             ` David S. Miller
2005-06-03 20:29               ` David S. Miller
2005-06-03 19:49                 ` Michael Chan
2005-06-03 20:59                   ` Lennert Buytenhek
2005-06-03 20:35                     ` Michael Chan
2005-06-03 22:29                       ` jamal
2005-06-04  0:25                         ` Michael Chan
2005-06-05 21:36                           ` David S. Miller
2005-06-06  6:43                             ` David S. Miller
2005-06-03 23:26                       ` Lennert Buytenhek
2005-06-05 20:11                       ` David S. Miller
2005-06-03 21:07                     ` Edgar E Iglesias
2005-06-03 23:30                       ` Lennert Buytenhek
2005-06-03 20:30             ` Ben Greear
2005-06-03 19:40           ` jamal
2005-06-03 20:23             ` jamal
2005-06-03 20:28               ` Mitch Williams
  -- strict thread matches above, loose matches on Subject: below --
2005-06-07 16:23 Ronciak, John
2005-06-07 20:21 ` David S. Miller
2005-06-08  2:20   ` Jesse Brandeburg
2005-06-08  3:31     ` David S. Miller
2005-06-08  3:43     ` David S. Miller
2005-06-08 13:36       ` jamal
2005-06-09 21:37         ` Jesse Brandeburg
2005-06-09 22:05           ` Stephen Hemminger
2005-06-09 22:12             ` Jesse Brandeburg
2005-06-09 22:21               ` David S. Miller
2005-06-09 22:21               ` jamal
2005-06-09 22:22             ` David S. Miller
2005-06-09 22:20           ` jamal
2005-06-06 20:29 Ronciak, John
2005-06-06 23:55 ` Mitch Williams
2005-06-07  0:08   ` Ben Greear
2005-06-08  1:50     ` Jesse Brandeburg
2005-06-07  4:53   ` Stephen Hemminger
2005-06-07 12:38     ` jamal
2005-06-07 12:06       ` Martin Josefsson
2005-06-07 13:29         ` jamal
2005-06-07 12:36           ` Martin Josefsson
2005-06-07 16:34             ` Robert Olsson
2005-06-07 23:19               ` Rick Jones
2005-06-21 20:37         ` David S. Miller
2005-06-22  7:27           ` Eric Dumazet
2005-06-22  8:42           ` P
2005-06-22 19:37             ` jamal
2005-06-23  8:56               ` P
2005-06-21 20:20     ` David S. Miller
2005-06-21 20:38       ` Rick Jones
2005-06-21 20:55         ` David S. Miller
2005-06-21 21:47         ` Andi Kleen
2005-06-21 22:22           ` Donald Becker
2005-06-21 22:34             ` Andi Kleen
2005-06-22  0:08               ` Donald Becker
2005-06-22  4:44                 ` Chris Friesen
2005-06-22 11:31                   ` Andi Kleen
2005-06-22 16:23                 ` Leonid Grossman
2005-06-22 16:37                   ` jamal
2005-06-22 18:00                     ` Leonid Grossman
2005-06-22 18:06                       ` Andi Kleen
2005-06-22 20:22                         ` David S. Miller
2005-06-22 20:35                           ` Rick Jones
2005-06-22 20:43                             ` David S. Miller
2005-06-22 21:10                           ` Andi Kleen
2005-06-22 21:16                             ` David S. Miller
2005-06-22 21:53                             ` Chris Friesen
2005-06-22 22:11                               ` Andi Kleen
2005-06-22 21:38                           ` Eric Dumazet
2005-06-22 22:13                             ` Eric Dumazet
2005-06-22 22:30                               ` David S. Miller
2005-06-22 22:23                             ` David S. Miller
2005-06-23 12:14                               ` jamal
2005-06-23 17:36                                 ` David Mosberger
2005-06-22 22:42                           ` Leonid Grossman
2005-06-22 23:13                             ` Andi Kleen
2005-06-22 23:19                               ` David S. Miller
2005-06-22 23:23                                 ` Andi Kleen
2005-06-22 17:05                   ` Andi Kleen
2005-06-06 15:35 Ronciak, John
2005-06-06 19:47 ` David S. Miller
2005-06-03 18:19 Ronciak, John
2005-06-03 18:33 ` Ben Greear
2005-06-03 18:49   ` David S. Miller
2005-06-03 18:59     ` Ben Greear
2005-06-03 19:02       ` David S. Miller
2005-06-03 20:17 ` Robert Olsson
2005-06-03 20:30   ` David S. Miller
2005-06-03 17:40 Ronciak, John
2005-06-03 18:08 ` Robert Olsson
2005-06-02 21:19 Ronciak, John
2005-06-02 21:31 ` Stephen Hemminger
2005-06-02 21:40   ` David S. Miller
2005-06-02 21:51   ` Jon Mason
2005-06-02 22:12     ` David S. Miller
2005-06-02 22:19       ` Jon Mason
2005-06-02 22:15     ` Robert Olsson
2005-05-26 21:36 Mitch Williams
2005-05-27  8:21 ` Robert Olsson
2005-05-27 11:18 ` jamal
2005-05-27 15:50 ` Stephen Hemminger
2005-05-27 20:27   ` Mitch Williams
2005-05-27 21:01     ` Stephen Hemminger
2005-05-28  0:56       ` jamal
2005-05-31 17:35         ` Mitch Williams
2005-05-31 17:40           ` Stephen Hemminger
2005-05-31 17:43             ` Mitch Williams
2005-05-31 22:07           ` Jon Mason
2005-05-31 22:14             ` David S. Miller
2005-05-31 23:28               ` Jon Mason
2005-06-02 12:26                 ` jamal
2005-06-02 17:30                   ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).