Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net: fix softnet_stat
From: Changli Gao @ 2010-04-15  3:08 UTC (permalink / raw)
  To: David S. Miller; +Cc: Tom Herbert, Eric Dumazet, netdev, Changli Gao
In-Reply-To: <1271300513-9995-1-git-send-email-xiaosuo@gmail.com>

On Thu, Apr 15, 2010 at 11:01 AM, Changli Gao <xiaosuo@gmail.com> wrote:
> fix softnet_stat
>
> Per cpu variable softnet_data.total was shared between IRQ and SoftIRQ context
> without any protection. And enqueue_to_backlog should update the netdev_rx_stat
> of the target CPU.
>
> This patch changes the meaning of softnet_data.total to the number of packets
> processed, not includes dropped, in order to avoid it sharing between IRQ and
> SoftIRQ. And softnet_data.dropped is protected in the corresponding
> input_pkt_queue.lock, if RPS is enabled.
>

I forget a comment:

This patch also fixed a bug: packets received by enqueue_to_backlog()
was counted twice: one in enqueue_to_backlog(), and one in
__netif_receive_skb(), although they may not on the same CPUs.

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* [PATCH] net: fix softnet_stat
From: Changli Gao @ 2010-04-15  3:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: Tom Herbert, Eric Dumazet, netdev, Changli Gao

fix softnet_stat

Per cpu variable softnet_data.total was shared between IRQ and SoftIRQ context
without any protection. And enqueue_to_backlog should update the netdev_rx_stat
of the target CPU.

This patch changes the meaning of softnet_data.total to the number of packets
processed, not includes dropped, in order to avoid it sharing between IRQ and
SoftIRQ. And softnet_data.dropped is protected in the corresponding
input_pkt_queue.lock, if RPS is enabled.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
 net/core/dev.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index a10a216..b38a991 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2336,7 +2336,6 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
 	queue = &per_cpu(softnet_data, cpu);

 	local_irq_save(flags);
-	__get_cpu_var(netdev_rx_stat).total++;

 	rps_lock(queue);
 	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
@@ -2366,9 +2365,9 @@ enqueue:
 		goto enqueue;
 	}

+	per_cpu(netdev_rx_stat, cpu).dropped++;
 	rps_unlock(queue);

-	__get_cpu_var(netdev_rx_stat).dropped++;
 	local_irq_restore(flags);

 	kfree_skb(skb);
@@ -3565,7 +3564,7 @@ static int softnet_seq_show(struct seq_file *seq, void *v)
 	struct netif_rx_stats *s = v;

 	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
-		   s->total, s->dropped, s->time_squeeze, 0,
+		   s->total + s->dropped, s->dropped, s->time_squeeze, 0,
 		   0, 0, 0, 0, /* was fastroute */
 		   s->cpu_collision, s->received_rps);
 	return 0;

^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: Changli Gao @ 2010-04-15  2:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Tom Herbert, hadi, netdev, robert,
	David Miller, Andi Kleen
In-Reply-To: <1271299206.16881.1764.camel@edumazet-laptop>

On Thu, Apr 15, 2010 at 10:40 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Agree 100%, and irqbalance is the existing daemon. It should be used and
> changed if necessary.
>
> Changli, my stronges argument about your patches is that our scheduler
> and memory affinity api (numactl driven) is bitmask oriented, giving the
> same weight to individual cpu or individual memory node.
>

It works with the assumption: the workloads handled in non-schedulable
context are less than the others. If most of work is done in
non-schedulable(softirq) context, scheduler can't keep load balance.


-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH net-next-2.6] drivers: net: use skb_headlen()
From: Eric Dumazet @ 2010-04-15  2:44 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100414.161504.149014877.davem@davemloft.net>

Le mercredi 14 avril 2010 à 16:15 -0700, David Miller a écrit :

> Eric, there were several typos in this patch and it is clear
> that even e1000 wasn't build tested.
> 
> You used the mystical "skb_headsize()" instead of "skb_headlen()"
> in several locations.
> 
> I fixed this up, but please build test your changes no matter
> how seemingly trivial next time.
> 

Oh well, I was sure I build tested it on a modallconfig, but now I
realize it was on a custom config, sorry David, thanks !



^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-15  2:40 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Changli Gao, Tom Herbert, hadi, netdev, robert, David Miller,
	Andi Kleen
In-Reply-To: <20100414160220.43a62999@nehalam>

Le mercredi 14 avril 2010 à 16:02 -0700, Stephen Hemminger a écrit :
> On Thu, 15 Apr 2010 06:51:29 +0800
> Changli Gao <xiaosuo@gmail.com> wrote:
> 
> > On Thu, Apr 15, 2010 at 4:57 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > >
> > > RPS can be tuned (Changli wants a finer tuning...), it would be
> > > intereting to tune multiqueue devices too. I dont know if its possible
> > > right now.
> > >
> > > On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
> > > 10Gigabit has 16 queues. It might be good to use less queues according
> > > to your results on some workloads, and eventually use RPS on a second
> > > layering.
> > 
> > My idear is: run a daemon in userland to monitor the softnet
> > statistics, and tun the RPS setting if necessary. It seems that the
> > current softnet statistics data isn't correct.
> > 
> > Long time ago, I did a test, and the conclution was
> > call_function_single IPI was more expensive than resched IPI, so I
> > moved to kernel thread from softirq for packet processing. I'll redo
> > the test later.
> > 
> 
> The big thing is data, data, data... Performance can only be examined
> with real hard data with multiple different kind of hardware.  Also, check for
> regressions in lmbench and TPC benchmarks. Yes this is hard, but papers
> on this would allow for rational rather than speculative choices.
> 
> Adding more tuning knobs is not the answer unless you can show when
> the tuning helps.
> 

Agree 100%, and irqbalance is the existing daemon. It should be used and
changed if necessary.

Changli, my stronges argument about your patches is that our scheduler
and memory affinity api (numactl driven) is bitmask oriented, giving the
same weight to individual cpu or individual memory node.




^ permalink raw reply

* RE: [PATCH] Infiniband: Randomize local port allocation.
From: Tetsuo Handa @ 2010-04-15  2:29 UTC (permalink / raw)
  To: sean.hefty
  Cc: amwang, opurdila, eric.dumazet, netdev, nhorman, davem, ebiederm,
	linux-kernel, rolandd
In-Reply-To: <195EB1AD388F455AA386EB214FCD09F4@amr.corp.intel.com>

Sean Hefty wrote:
> >+	remaining = (high - low) + 1;
> >+	rover = net_random() % remaining + low;
> >+	do {
> >+		rover++;
> >+		if ((rover < low) || (rover > high))
> >+			rover = low;
> 
> Assuming that we're likely to pick a valid port on the first try, it would be
> more efficient to move the above 3 lines to the end of the while loop.
> 
Indeed. I moved these lines to "if (--remaining) { ... }" block.
--------------------
[PATCH] Infiniband: Randomize local port allocation.

Randomize local port allocation in a way sctp_get_port_local() does.
Update rover at the end of loop since we're likely to pick a valid port
on the first try.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 drivers/infiniband/core/cma.c |   70 +++++++++++++++---------------------------
 1 file changed, 25 insertions(+), 45 deletions(-)

--- linux-2.6.34-rc4.orig/drivers/infiniband/core/cma.c
+++ linux-2.6.34-rc4/drivers/infiniband/core/cma.c
@@ -79,7 +79,6 @@ static DEFINE_IDR(sdp_ps);
 static DEFINE_IDR(tcp_ps);
 static DEFINE_IDR(udp_ps);
 static DEFINE_IDR(ipoib_ps);
-static int next_port;
 
 struct cma_device {
 	struct list_head	list;
@@ -1970,47 +1969,33 @@ err1:
 
 static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv)
 {
-	struct rdma_bind_list *bind_list;
-	int port, ret, low, high;
-
-	bind_list = kzalloc(sizeof *bind_list, GFP_KERNEL);
-	if (!bind_list)
-		return -ENOMEM;
-
-retry:
-	/* FIXME: add proper port randomization per like inet_csk_get_port */
-	do {
-		ret = idr_get_new_above(ps, bind_list, next_port, &port);
-	} while ((ret == -EAGAIN) && idr_pre_get(ps, GFP_KERNEL));
-
-	if (ret)
-		goto err1;
+	static unsigned int last_used_port;
+	int low, high, remaining;
+	unsigned int rover;
 
 	inet_get_local_port_range(&low, &high);
-	if (port > high) {
-		if (next_port != low) {
-			idr_remove(ps, port);
-			next_port = low;
-			goto retry;
-		}
-		ret = -EADDRNOTAVAIL;
-		goto err2;
+	remaining = (high - low) + 1;
+	rover = net_random() % remaining + low;
+retry:
+	if (last_used_port != rover &&
+	    !idr_find(ps, (unsigned short) rover)) {
+		int ret = cma_alloc_port(ps, id_priv, rover);
+		/*
+		 * Remember previously used port number in order to avoid
+		 * re-using same port immediately after it is closed.
+		 */
+		if (!ret)
+			last_used_port = rover;
+		if (ret != -EADDRNOTAVAIL)
+			return ret;
 	}
-
-	if (port == high)
-		next_port = low;
-	else
-		next_port = port + 1;
-
-	bind_list->ps = ps;
-	bind_list->port = (unsigned short) port;
-	cma_bind_port(bind_list, id_priv);
-	return 0;
-err2:
-	idr_remove(ps, port);
-err1:
-	kfree(bind_list);
-	return ret;
+	if (--remaining) {
+		rover++;
+		if ((rover < low) || (rover > high))
+			rover = low;
+		goto retry;
+	}
+	return -EADDRNOTAVAIL;
 }
 
 static int cma_use_port(struct idr *ps, struct rdma_id_private *id_priv)
@@ -2995,12 +2980,7 @@ static void cma_remove_one(struct ib_dev
 
 static int __init cma_init(void)
 {
-	int ret, low, high, remaining;
-
-	get_random_bytes(&next_port, sizeof next_port);
-	inet_get_local_port_range(&low, &high);
-	remaining = (high - low) + 1;
-	next_port = ((unsigned int) next_port % remaining) + low;
+	int ret;
 
 	cma_wq = create_singlethread_workqueue("rdma_cm");
 	if (!cma_wq)

^ permalink raw reply

* [PATCH] WAN: flush tx_queue in hdlc_ppp to prevent panic on rmmod hw_driver.
From: Krzysztof Halasa @ 2010-04-15  0:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

tx_queue is used as a temporary queue when not allowed to queue skb
directly to the hw device driver (which may sleep). Most paths flush
it before returning, but ppp_start() currently cannot. Make sure we
don't leave skbs pointing to a non-existent device.

Thanks to Michael Barkowski for reporting this problem.

Signed-off-by: Krzysztof Hałasa <khc@pm.waw.pl>

diff --git a/drivers/net/wan/hdlc_ppp.c b/drivers/net/wan/hdlc_ppp.c
index b9b9d6b..941f053 100644
--- a/drivers/net/wan/hdlc_ppp.c
+++ b/drivers/net/wan/hdlc_ppp.c
@@ -628,9 +628,15 @@ static void ppp_stop(struct net_device *dev)
 	ppp_cp_event(dev, PID_LCP, STOP, 0, 0, 0, NULL);
 }

+static void ppp_close(struct net_device *dev)
+{
+	ppp_tx_flush();
+}
+
 static struct hdlc_proto proto = {
 	.start		= ppp_start,
 	.stop		= ppp_stop,
+	.close		= ppp_close,
 	.type_trans	= ppp_type_trans,
 	.ioctl		= ppp_ioctl,
 	.netif_rx	= ppp_rx,

^ permalink raw reply related

* Re: hdlc_ppp: why no detach()?
From: Krzysztof Halasa @ 2010-04-15  0:02 UTC (permalink / raw)
  To: Michael Barkowski
  Cc: David S. Miller, Julia Lawall, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <m3633todl2.fsf@intrepid.localdomain>

> 1. The SCR packet will be delayed by 2 seconds (both first and second
>    SCR will be sent the same time). Perhaps we delay only a little
>    (instead of full 2 seconds) and only then send the initial packet.

BTW can I sleep in a netdevice_notifier callback, no?
-- 
Krzysztof Halasa

^ permalink raw reply

* RE: [PATCH] Infiniband: Randomize local port allocation.
From: Sean Hefty @ 2010-04-15  0:01 UTC (permalink / raw)
  To: penguin-kernel, rolandd
  Cc: amwang, opurdila, eric.dumazet, netdev, nhorman, davem, ebiederm,
	linux-kernel
In-Reply-To: <201004140201.o3E21Aqn075978@www262.sakura.ne.jp>

>[PATCH] Infiniband: Randomize local port allocation.
>
>Randomize local port allocation in a way sctp_get_port_local() does.
>
>Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Thanks for fixing this long outstanding issue.  :)  The latest patch looks
correct and passed some simple tests that I ran against it.  One comment below,
which I didn't catch before:

>---
> drivers/infiniband/core/cma.c |   69 ++++++++++++++---------------------------
>-
> 1 file changed, 24 insertions(+), 45 deletions(-)
>
>--- linux-2.6.34-rc4.orig/drivers/infiniband/core/cma.c
>+++ linux-2.6.34-rc4/drivers/infiniband/core/cma.c
>@@ -79,7 +79,6 @@ static DEFINE_IDR(sdp_ps);
> static DEFINE_IDR(tcp_ps);
> static DEFINE_IDR(udp_ps);
> static DEFINE_IDR(ipoib_ps);
>-static int next_port;
>
> struct cma_device {
> 	struct list_head	list;
>@@ -1970,47 +1969,32 @@ err1:
>
> static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv)
> {
>-	struct rdma_bind_list *bind_list;
>-	int port, ret, low, high;
>-
>-	bind_list = kzalloc(sizeof *bind_list, GFP_KERNEL);
>-	if (!bind_list)
>-		return -ENOMEM;
>-
>-retry:
>-	/* FIXME: add proper port randomization per like inet_csk_get_port */
>-	do {
>-		ret = idr_get_new_above(ps, bind_list, next_port, &port);
>-	} while ((ret == -EAGAIN) && idr_pre_get(ps, GFP_KERNEL));
>-
>-	if (ret)
>-		goto err1;
>+	static unsigned int last_used_port;
>+	int low, high, remaining;
>+	unsigned int rover;
>
> 	inet_get_local_port_range(&low, &high);
>-	if (port > high) {
>-		if (next_port != low) {
>-			idr_remove(ps, port);
>-			next_port = low;
>-			goto retry;
>+	remaining = (high - low) + 1;
>+	rover = net_random() % remaining + low;
>+	do {
>+		rover++;
>+		if ((rover < low) || (rover > high))
>+			rover = low;

Assuming that we're likely to pick a valid port on the first try, it would be
more efficient to move the above 3 lines to the end of the while loop.

>+		if (last_used_port != rover &&
>+		    !idr_find(ps, (unsigned short) rover)) {
>+			int ret = cma_alloc_port(ps, id_priv, rover);
>+			/*
>+			 * Remember previously used port number in order to
>+			 * avoid re-using same port immediately after it is
>+			 * closed.
>+			 */
>+			if (!ret)
>+				last_used_port = rover;
>+			if (ret != -EADDRNOTAVAIL)
>+				return ret;
> 		}
>-		ret = -EADDRNOTAVAIL;
>-		goto err2;
>-	}
>-
>-	if (port == high)
>-		next_port = low;
>-	else
>-		next_port = port + 1;
>-
>-	bind_list->ps = ps;
>-	bind_list->port = (unsigned short) port;
>-	cma_bind_port(bind_list, id_priv);
>-	return 0;
>-err2:
>-	idr_remove(ps, port);
>-err1:
>-	kfree(bind_list);
>-	return ret;
>+	} while (--remaining > 0);
>+	return -EADDRNOTAVAIL;
> }

^ permalink raw reply

* Re: [PATCH net-next-2.6] drivers: net: use skb_headlen()
From: David Miller @ 2010-04-14 23:15 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <20100414.155949.210580528.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Wed, 14 Apr 2010 15:59:49 -0700 (PDT)

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 13 Apr 2010 22:48:20 +0200
> 
>> replaces (skb->len - skb->data_len) occurrences by skb_headlen(skb)
>> 
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Applied, thanks Eric.

Eric, there were several typos in this patch and it is clear
that even e1000 wasn't build tested.

You used the mystical "skb_headsize()" instead of "skb_headlen()"
in several locations.

I fixed this up, but please build test your changes no matter
how seemingly trivial next time.

Thanks.

^ permalink raw reply

* Re: [PATCH v3] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-14 23:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, netdev
In-Reply-To: <1271258409.16881.1701.camel@edumazet-laptop>

On Wed, Apr 14, 2010 at 11:20 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mercredi 14 avril 2010 à 17:52 +0800, Changli Gao a écrit :
>> batch skb dequeueing from softnet input_pkt_queue
>>
>> batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
>> contention and irq disabling/enabling.
>>
>> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
>
> Adding stop_machine() with no explanation ?

stop_machine() is added to flush the processing_queue. Because the old
flush_backlog() runs in IRQ context, it can't touch the things, which
are only valid in softirq context.

>
> No ack from my previous comments, suggestions, and still same logic ?

In this patch, the volatile variable flush_processing_queue is
removed, and the corresponding lines are removed too.

Oh, I should splice the old message back, as stop_machine is used
instead. So, the processing queue will be removed from softnet_data,
and a single int counter will be used instead to count the packets
which are being processing.

one issue you concern is potential cache miss when summing in enqueue
function. It won't happen all the time, in fact, I think it should
happen seldom, and it is the responsibility of hardware to cache the
data frequently used.

>
> Are we supposed to read patch, test it, make some benches, correct bugs,
> say Amen ?
>

OK, I'll test it.

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [BUG net-next-2.6] fib: Some rcu warning
From: David Miller @ 2010-04-14 23:12 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, paulmck
In-Reply-To: <1271275859.16881.1753.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 14 Apr 2010 22:10:59 +0200

> [PATCH] fib: suppress lockdep-RCU false positive in FIB trie.
> 
> Followup of commit 634a4b20
> 
> Allow tnode_get_child_rcu() to be called either under rcu_read_lock()
> protection or with RTNL held.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Since both your and Paul both posted an identical patch I'll
just add both of your signoffs to this one :-)

Since 634a4b20 is in net-2.6, this should probably go into
net-2.6 as well.

Thanks guys.

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Stephen Hemminger @ 2010-04-14 23:02 UTC (permalink / raw)
  To: Changli Gao
  Cc: Eric Dumazet, Tom Herbert, hadi, netdev, robert, David Miller,
	Andi Kleen
In-Reply-To: <v2s412e6f7f1004141551u5764604h3ab092ee9e70a650@mail.gmail.com>

On Thu, 15 Apr 2010 06:51:29 +0800
Changli Gao <xiaosuo@gmail.com> wrote:

> On Thu, Apr 15, 2010 at 4:57 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > RPS can be tuned (Changli wants a finer tuning...), it would be
> > intereting to tune multiqueue devices too. I dont know if its possible
> > right now.
> >
> > On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
> > 10Gigabit has 16 queues. It might be good to use less queues according
> > to your results on some workloads, and eventually use RPS on a second
> > layering.
> 
> My idear is: run a daemon in userland to monitor the softnet
> statistics, and tun the RPS setting if necessary. It seems that the
> current softnet statistics data isn't correct.
> 
> Long time ago, I did a test, and the conclution was
> call_function_single IPI was more expensive than resched IPI, so I
> moved to kernel thread from softirq for packet processing. I'll redo
> the test later.
> 

The big thing is data, data, data... Performance can only be examined
with real hard data with multiple different kind of hardware.  Also, check for
regressions in lmbench and TPC benchmarks. Yes this is hard, but papers
on this would allow for rational rather than speculative choices.

Adding more tuning knobs is not the answer unless you can show when
the tuning helps.

-- 

^ permalink raw reply

* Re: [net-next-2.6 PATCH] ixgbe: fix bug with vlan strip in promsic mode
From: David Miller @ 2010-04-14 23:04 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, jesse.brandeburg
In-Reply-To: <20100414221015.11752.42490.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed, 14 Apr 2010 15:11:12 -0700

> From: Jesse Brandeburg <jesse.brandeburg@intel.com>
> 
> The ixgbe driver was setting up 82598 hardware correctly, so that
> when promiscuous mode was enabled hardware stripping was turned
> off.  But on 82599 the logic to disable/enable hardware stripping
> is different, and the code was not updated correctly when the
> hardware vlan stripping was enabled as default.
> 
> This change comprises the creation of two new helper functions
> and calling them from the right locations to disable and enable
> hardware stripping of vlan tags at appropriate times.
> 
> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied, for real this time! :-)

Thanks.

^ permalink raw reply

* Re: [PATCH net-next-2.6] drivers: net: use skb_headlen()
From: David Miller @ 2010-04-14 22:59 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1271191700.16881.574.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 13 Apr 2010 22:48:20 +0200

> replaces (skb->len - skb->data_len) occurrences by skb_headlen(skb)
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [GIT PULL] vhost-net fix for 2.6.34-rc4
From: David Miller @ 2010-04-14 22:57 UTC (permalink / raw)
  To: mst; +Cc: netdev, hch
In-Reply-To: <20100414141733.GA5370@redhat.com>

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Wed, 14 Apr 2010 17:17:33 +0300

> Christoph Hellwig (1):
>       vhost: fix sparse warnings

Do the sparse warnings show actual real bugs, or are we just
adding annotations to make the tool happy?

Unless the bugs being fixed are earth shattering, this belongs
in net-next-2.6 not net-2.6

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Changli Gao @ 2010-04-14 22:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, hadi, Stephen Hemminger, netdev, robert,
	David Miller, Andi Kleen
In-Reply-To: <1271278661.16881.1761.camel@edumazet-laptop>

On Thu, Apr 15, 2010 at 4:57 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> RPS can be tuned (Changli wants a finer tuning...), it would be
> intereting to tune multiqueue devices too. I dont know if its possible
> right now.
>
> On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
> 10Gigabit has 16 queues. It might be good to use less queues according
> to your results on some workloads, and eventually use RPS on a second
> layering.

My idear is: run a daemon in userland to monitor the softnet
statistics, and tun the RPS setting if necessary. It seems that the
current softnet statistics data isn't correct.

Long time ago, I did a test, and the conclution was
call_function_single IPI was more expensive than resched IPI, so I
moved to kernel thread from softirq for packet processing. I'll redo
the test later.

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: hdlc_ppp: why no detach()?
From: Krzysztof Halasa @ 2010-04-14 22:48 UTC (permalink / raw)
  To: Michael Barkowski
  Cc: David S. Miller, Julia Lawall, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <4BC32B00.1030600@ruggedcom.com>

Hello Michael,

Michael Barkowski <michaelbarkowski@ruggedcom.com> writes:

> I am looking at your hdlc_ppp code and I don't understand: why is there
> not the equivalent of fr_detach() in there?

I assume you mean .detach = fr_destroy(). It's used only to kill
subdevices, i.e. it has nothing to do with the interface being up/down.

> pc8300_drv:cpc_remove_one() frees netdevs quite confidently but I wonder
> how it can be so sure that there are not skbs in hdlc_ppp's tx_queue
> associated with those devices before freeing them....q

Theoretically all paths adding skbs to the tx_queue should send them out
before returning (possibly also on behalf of other devices). However I
wonder if it's the case. Let's see: Only ppp_tx_cp() adds to the queue
directly:
- ppp_rx() (calls ppp_tx_flush())
- ppp_timer (calls ppp_tx_flush())
- ppp_cp_event():
  - ppp_cp_parse_cr() (calls ppp_tx_flush())
  - ppp_stop() calls ppp_cp_event(), but it won't queue any skb, it only
    marks the connection as closed and does the same to IPCP and IPV6CP.

This means the problematic part is ppp_start() which calls
ppp_cp_event(LCP, START) = IRC | SCR | 3 meaning
Initialize-Restart-Count, Send-Configure-Request and change state to
REQ_SENT. This causes two problems:

1. The SCR packet will be delayed by 2 seconds (both first and second
   SCR will be sent the same time). Perhaps we delay only a little
   (instead of full 2 seconds) and only then send the initial packet.

2. (as you noted) the skb will be added to tx_queue and left there. If
   we happen to "ifconfig up" and "rmmod driver" before receiving any
   packet and before ppp->req_timeout (2 seconds) and before any other
   PPP interface does the same, we will eventually get skb with invalid
   ->dev. This is simple to drain in .close (detach is a wrong place
   since it may be called long after the interface is deactivated, there
   is no need to delay it past .close). The fix for #1 will already fix
   #2, but the redundant safety doesn't cost us anything.

Thanks for noting the problem, I'll post a patch shortly.

Also it seems the timeouts etc. should be configurable. ATM we're only
fixing bugs, good.
-- 
Krzysztof Halasa

^ permalink raw reply

* [net-next-2.6 PATCH] ixgbe: fix bug with vlan strip in promsic mode
From: Jeff Kirsher @ 2010-04-14 22:11 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, Jesse Brandeburg, Jeff Kirsher

From: Jesse Brandeburg <jesse.brandeburg@intel.com>

The ixgbe driver was setting up 82598 hardware correctly, so that
when promiscuous mode was enabled hardware stripping was turned
off.  But on 82599 the logic to disable/enable hardware stripping
is different, and the code was not updated correctly when the
hardware vlan stripping was enabled as default.

This change comprises the creation of two new helper functions
and calling them from the right locations to disable and enable
hardware stripping of vlan tags at appropriate times.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe_main.c |  115 +++++++++++++++++++++++++---------------
 1 files changed, 72 insertions(+), 43 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 1b1419c..a98ff0e 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -2482,12 +2482,74 @@ static void ixgbe_vlan_rx_kill_vid(struct net_device *netdev, u16 vid)
 	hw->mac.ops.set_vfta(&adapter->hw, vid, pool_ndx, false);
 }
 
+/**
+ * ixgbe_vlan_filter_disable - helper to disable hw vlan filtering
+ * @adapter: driver data
+ */
+static void ixgbe_vlan_filter_disable(struct ixgbe_adapter *adapter)
+{
+	struct ixgbe_hw *hw = &adapter->hw;
+	u32 vlnctrl = IXGBE_READ_REG(hw, IXGBE_VLNCTRL);
+	int i, j;
+
+	switch (hw->mac.type) {
+	case ixgbe_mac_82598EB:
+		vlnctrl &= ~(IXGBE_VLNCTRL_VME | IXGBE_VLNCTRL_VFE);
+		vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
+		IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
+		break;
+	case ixgbe_mac_82599EB:
+		vlnctrl &= ~IXGBE_VLNCTRL_VFE;
+		vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
+		IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
+		for (i = 0; i < adapter->num_rx_queues; i++) {
+			j = adapter->rx_ring[i]->reg_idx;
+			vlnctrl = IXGBE_READ_REG(hw, IXGBE_RXDCTL(j));
+			vlnctrl &= ~IXGBE_RXDCTL_VME;
+			IXGBE_WRITE_REG(hw, IXGBE_RXDCTL(j), vlnctrl);
+		}
+		break;
+	default:
+		break;
+	}
+}
+
+/**
+ * ixgbe_vlan_filter_enable - helper to enable hw vlan filtering
+ * @adapter: driver data
+ */
+static void ixgbe_vlan_filter_enable(struct ixgbe_adapter *adapter)
+{
+	struct ixgbe_hw *hw = &adapter->hw;
+	u32 vlnctrl = IXGBE_READ_REG(hw, IXGBE_VLNCTRL);
+	int i, j;
+
+	switch (hw->mac.type) {
+	case ixgbe_mac_82598EB:
+		vlnctrl |= IXGBE_VLNCTRL_VME | IXGBE_VLNCTRL_VFE;
+		vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
+		IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
+		break;
+	case ixgbe_mac_82599EB:
+		vlnctrl |= IXGBE_VLNCTRL_VFE;
+		vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
+		IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
+		for (i = 0; i < adapter->num_rx_queues; i++) {
+			j = adapter->rx_ring[i]->reg_idx;
+			vlnctrl = IXGBE_READ_REG(hw, IXGBE_RXDCTL(j));
+			vlnctrl |= IXGBE_RXDCTL_VME;
+			IXGBE_WRITE_REG(hw, IXGBE_RXDCTL(j), vlnctrl);
+		}
+		break;
+	default:
+		break;
+	}
+}
+
 static void ixgbe_vlan_rx_register(struct net_device *netdev,
                                    struct vlan_group *grp)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
-	u32 ctrl;
-	int i, j;
 
 	if (!test_bit(__IXGBE_DOWN, &adapter->state))
 		ixgbe_irq_disable(adapter);
@@ -2498,25 +2560,7 @@ static void ixgbe_vlan_rx_register(struct net_device *netdev,
 	 * still receive traffic from a DCB-enabled host even if we're
 	 * not in DCB mode.
 	 */
-	ctrl = IXGBE_READ_REG(&adapter->hw, IXGBE_VLNCTRL);
-
-	/* Disable CFI check */
-	ctrl &= ~IXGBE_VLNCTRL_CFIEN;
-
-	/* enable VLAN tag stripping */
-	if (adapter->hw.mac.type == ixgbe_mac_82598EB) {
-		ctrl |= IXGBE_VLNCTRL_VME;
-	} else if (adapter->hw.mac.type == ixgbe_mac_82599EB) {
-		for (i = 0; i < adapter->num_rx_queues; i++) {
-			u32 ctrl;
-			j = adapter->rx_ring[i]->reg_idx;
-			ctrl = IXGBE_READ_REG(&adapter->hw, IXGBE_RXDCTL(j));
-			ctrl |= IXGBE_RXDCTL_VME;
-			IXGBE_WRITE_REG(&adapter->hw, IXGBE_RXDCTL(j), ctrl);
-		}
-	}
-
-	IXGBE_WRITE_REG(&adapter->hw, IXGBE_VLNCTRL, ctrl);
+	ixgbe_vlan_filter_enable(adapter);
 
 	ixgbe_vlan_rx_add_vid(netdev, 0);
 
@@ -2551,17 +2595,17 @@ void ixgbe_set_rx_mode(struct net_device *netdev)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
 	struct ixgbe_hw *hw = &adapter->hw;
-	u32 fctrl, vlnctrl;
+	u32 fctrl;
 
 	/* Check for Promiscuous and All Multicast modes */
 
 	fctrl = IXGBE_READ_REG(hw, IXGBE_FCTRL);
-	vlnctrl = IXGBE_READ_REG(hw, IXGBE_VLNCTRL);
 
 	if (netdev->flags & IFF_PROMISC) {
 		hw->addr_ctrl.user_set_promisc = 1;
 		fctrl |= (IXGBE_FCTRL_UPE | IXGBE_FCTRL_MPE);
-		vlnctrl &= ~IXGBE_VLNCTRL_VFE;
+		/* don't hardware filter vlans in promisc mode */
+		ixgbe_vlan_filter_disable(adapter);
 	} else {
 		if (netdev->flags & IFF_ALLMULTI) {
 			fctrl |= IXGBE_FCTRL_MPE;
@@ -2569,12 +2613,11 @@ void ixgbe_set_rx_mode(struct net_device *netdev)
 		} else {
 			fctrl &= ~(IXGBE_FCTRL_UPE | IXGBE_FCTRL_MPE);
 		}
-		vlnctrl |= IXGBE_VLNCTRL_VFE;
+		ixgbe_vlan_filter_enable(adapter);
 		hw->addr_ctrl.user_set_promisc = 0;
 	}
 
 	IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
-	IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
 
 	/* reprogram secondary unicast list */
 	hw->mac.ops.update_uc_addr_list(hw, netdev);
@@ -2641,7 +2684,7 @@ static void ixgbe_napi_disable_all(struct ixgbe_adapter *adapter)
 static void ixgbe_configure_dcb(struct ixgbe_adapter *adapter)
 {
 	struct ixgbe_hw *hw = &adapter->hw;
-	u32 txdctl, vlnctrl;
+	u32 txdctl;
 	int i, j;
 
 	ixgbe_dcb_check_config(&adapter->dcb_cfg);
@@ -2659,22 +2702,8 @@ static void ixgbe_configure_dcb(struct ixgbe_adapter *adapter)
 		IXGBE_WRITE_REG(hw, IXGBE_TXDCTL(j), txdctl);
 	}
 	/* Enable VLAN tag insert/strip */
-	vlnctrl = IXGBE_READ_REG(hw, IXGBE_VLNCTRL);
-	if (hw->mac.type == ixgbe_mac_82598EB) {
-		vlnctrl |= IXGBE_VLNCTRL_VME | IXGBE_VLNCTRL_VFE;
-		vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
-		IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
-	} else if (hw->mac.type == ixgbe_mac_82599EB) {
-		vlnctrl |= IXGBE_VLNCTRL_VFE;
-		vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
-		IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
-		for (i = 0; i < adapter->num_rx_queues; i++) {
-			j = adapter->rx_ring[i]->reg_idx;
-			vlnctrl = IXGBE_READ_REG(hw, IXGBE_RXDCTL(j));
-			vlnctrl |= IXGBE_RXDCTL_VME;
-			IXGBE_WRITE_REG(hw, IXGBE_RXDCTL(j), vlnctrl);
-		}
-	}
+	ixgbe_vlan_filter_enable(adapter);
+
 	hw->mac.ops.set_vfta(&adapter->hw, 0, 0, true);
 }
 


^ permalink raw reply related

* Re: HTB - What's the minimal value for 'rate' parameter?
From: Jarek Poplawski @ 2010-04-14 21:45 UTC (permalink / raw)
  To: Antonio Almeida; +Cc: netdev, kaber, davem, devik
In-Reply-To: <p2q298f5c051004140322hc54f5315taa5c7d85cb74f06@mail.gmail.com>

Antonio Almeida wrote, On 04/14/2010 12:22 PM:

> What do you mean with "1:2 has grandchildren with overflown rate tables"?
> I couldn't understand your idea. Is there any mistake in the
> configuration I sent?
> How would you set rates for this particular example?

class htb 1:1 root rate 1000Mbit ceil 1000Mbit
class htb 1:2 parent 1:1 rate 4096Kbit ceil 4096Kbit
class htb 1:10 parent 1:2 rate 1024Kbit ceil 4096Kbit
class htb 1:11 parent 1:2 rate 1024Kbit ceil 4096Kbit
class htb 1:101 parent 1:10 prio 3 rate 8bit ceil 4096Kbit
class htb 1:111 parent 1:11 prio 3 rate 8bit ceil 4096Kbit

Classes 1:101 and 1:111 have too low rates, which causes wrong (overflowed!)
values in their rate tables, so their rates could be practically
uncontrollable. They are limited by their ceils instead, so something like:

class htb 1:101 parent 1:10 leaf 101: prio 3 rate 4096Kbit ceil 4096Kbit
class htb 1:111 parent 1:11 leaf 111: prio 3 rate 4096Kbit ceil 4096Kbit

But then their guaranteed rates are higher than their parents, and the
sum is higher than grandparent's rate, which means the config is wrong.
(You have to control these sums - HTB doesn't.)

As I wrote before, the minimal (overflow safe) rate depends on max
packet size, and for 1500 byte it would be something around:
1500b/2min, so if your clients can wait so long, try this:

class htb 1:101 parent 1:10 leaf 101: prio 3 rate 100bit ceil 4096Kbit
class htb 1:111 parent 1:11 leaf 111: prio 3 rate 100bit ceil 4096Kbit

Regards,
Jarek P.

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-14 20:57 UTC (permalink / raw)
  To: Tom Herbert
  Cc: hadi, Stephen Hemminger, netdev, robert, David Miller,
	Changli Gao, Andi Kleen
In-Reply-To: <z2y65634d661004141345o9bdcf759sf266866931823baf@mail.gmail.com>

Le mercredi 14 avril 2010 à 13:45 -0700, Tom Herbert a écrit :
> > Only if more than one flow is involved.
> >
> > And if you have many flows, chance they will spread several queues...
> >
> 
> But use too many queues and the efficiency of NAPI drops and cost of
> device interrupts becomes dominant, so that the overhead from
> additional hard interrupts can surpass the overhead of doing RPS and
> the IPIs.  I believe we are seeing this is in some of our results
> which shows that a combination of multi-queue and RPS can be better
> than just multi-queue (see rps changelog).  Again, I'm not claiming
> that is generally true, but there are a lot of factors to consider.
> --

RPS can be tuned (Changli wants a finer tuning...), it would be
intereting to tune multiqueue devices too. I dont know if its possible
right now.

On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
10Gigabit has 16 queues. It might be good to use less queues according
to your results on some workloads, and eventually use RPS on a second
layering.





^ permalink raw reply

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Arnd Bergmann @ 2010-04-14 20:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <20100414204003.GC11321@redhat.com>

On Wednesday 14 April 2010 22:40:03 Michael S. Tsirkin wrote:
> On Wed, Apr 14, 2010 at 10:39:49PM +0200, Arnd Bergmann wrote:
> > Well, if the guest not only wants to send data but also receive
> > frames coming from other machines, they need to get from the kernel
> > into qemu, and the only way I can see for doing that is to read
> > from this device if there is no vhost support around on the new
> > machine.
> > 
> > Maybe we're talking about different things here.
>
> mpassthrough is currently useless without vhost.
> If the new machine has no vhost, it can't use mpassthrough :)

Ok. Is that a planned feature though? vhost is currently limited
to guests with a virtio-net driver and even if you extend it to other
guest emulations, it will probably always be a subset of the qemu
supported drivers, but it may be useful to support zero-copy on other
drivers as well.

	Arnd

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Tom Herbert @ 2010-04-14 20:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: hadi, Stephen Hemminger, netdev, robert, David Miller,
	Changli Gao, Andi Kleen
In-Reply-To: <1271276855.16881.1756.camel@edumazet-laptop>

> Only if more than one flow is involved.
>
> And if you have many flows, chance they will spread several queues...
>

But use too many queues and the efficiency of NAPI drops and cost of
device interrupts becomes dominant, so that the overhead from
additional hard interrupts can surpass the overhead of doing RPS and
the IPIs.  I believe we are seeing this is in some of our results
which shows that a combination of multi-queue and RPS can be better
than just multi-queue (see rps changelog).  Again, I'm not claiming
that is generally true, but there are a lot of factors to consider.

^ permalink raw reply

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Michael S. Tsirkin @ 2010-04-14 20:40 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <201004142239.50299.arnd@arndb.de>

On Wed, Apr 14, 2010 at 10:39:49PM +0200, Arnd Bergmann wrote:
> On Wednesday 14 April 2010 22:31:42 Michael S. Tsirkin wrote:
> > On Wed, Apr 14, 2010 at 06:35:57PM +0200, Arnd Bergmann wrote:
> > > On Wednesday 14 April 2010, Michael S. Tsirkin wrote:
> > > > > > 
> > > > > > qemu needs the ability to inject raw packets into device
> > > > > > from userspace, bypassing vhost/virtio (for live migration).
> > > > > 
> > > > > Ok, but since there is only a write callback and no read, it won't
> > > > > actually be able to do this with the current code, right?
> > > > 
> > > > I think it'll work as is, with vhost qemu only ever writes,
> > > > never reads from device. We'll also never need GSO etc
> > > > which is a large part of what tap does (and macvtap will
> > > > have to do).
> > > 
> > > Ah, I see. I didn't realize that qemu needs to write to the
> > > device even if vhost is used. But for the case of migration to
> > > another machine without vhost, wouldn't qemu also need to read?
> > 
> > Not that I know. Why?
> 
> Well, if the guest not only wants to send data but also receive
> frames coming from other machines, they need to get from the kernel
> into qemu, and the only way I can see for doing that is to read
> from this device if there is no vhost support around on the new
> machine.
> 
> Maybe we're talking about different things here.
> 
> 	Arnd

mpassthrough is currently useless without vhost.
If the new machine has no vhost, it can't use mpassthrough :)

-- 
MST

^ permalink raw reply

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Arnd Bergmann @ 2010-04-14 20:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <20100414203142.GA11321@redhat.com>

On Wednesday 14 April 2010 22:31:42 Michael S. Tsirkin wrote:
> On Wed, Apr 14, 2010 at 06:35:57PM +0200, Arnd Bergmann wrote:
> > On Wednesday 14 April 2010, Michael S. Tsirkin wrote:
> > > > > 
> > > > > qemu needs the ability to inject raw packets into device
> > > > > from userspace, bypassing vhost/virtio (for live migration).
> > > > 
> > > > Ok, but since there is only a write callback and no read, it won't
> > > > actually be able to do this with the current code, right?
> > > 
> > > I think it'll work as is, with vhost qemu only ever writes,
> > > never reads from device. We'll also never need GSO etc
> > > which is a large part of what tap does (and macvtap will
> > > have to do).
> > 
> > Ah, I see. I didn't realize that qemu needs to write to the
> > device even if vhost is used. But for the case of migration to
> > another machine without vhost, wouldn't qemu also need to read?
> 
> Not that I know. Why?

Well, if the guest not only wants to send data but also receive
frames coming from other machines, they need to get from the kernel
into qemu, and the only way I can see for doing that is to read
from this device if there is no vhost support around on the new
machine.

Maybe we're talking about different things here.

	Arnd

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox