Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Andi Kleen @ 2010-04-16 13:42 UTC (permalink / raw)
  To: jamal; +Cc: Andi Kleen, Tom Herbert, davem, netdev, eric.dumazet
In-Reply-To: <1271424726.4606.42.camel@bigi>

On Fri, Apr 16, 2010 at 09:32:06AM -0400, jamal wrote:
> How are you going to schedule the net softirq on an empty queue if you
> do this?

Sorry don't understand the question? 

You can always do the flow as if rps was not there.

> BTW, in my tests sending an IPI to an SMT sibling or to another core
> didnt make any difference in terms of latency - still 5 microsecs.
> I dont have dual Nehalem where we have to cross QPI - there i suspect
> it will be longer than 5 microsecs.

I meant an IPI to a sibling is not useful. You send it to the IPI
to get cache locality in the target, but if the target has the same
cache locality as you you can as well avoid the cost of the IPI
and process directly.

For thread sibling I'm pretty sure it's useless. Not full sure about
socket sibling. Maybe.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-16 13:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1271399547.16881.3724.camel@edumazet-laptop>

On Fri, 2010-04-16 at 08:32 +0200, Eric Dumazet wrote:
> Le vendredi 16 avril 2010 à 14:02 +0800, Changli Gao a écrit :
> 
> > resched IPI, apparently. But it is async absolutely. and its IRQ
> > handler is lighter.
> > 
> 
> You still dont answer to the question, and your claims are not grounded
> by hard facts, but by your interpretation of code.

My understanding of current scheduler is it does use IPIs to migrate
tasks around - so thats why things may be working for Changli. i.e
it is scheduler magic if you use kthreads. It is hard to say if this
would work better...

cheers,
jamal


^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Andi Kleen @ 2010-04-16 13:37 UTC (permalink / raw)
  To: jamal
  Cc: Andi Kleen, Changli Gao, Eric Dumazet, Rick Jones, David Miller,
	therbert, netdev, robert
In-Reply-To: <1271424455.4606.39.camel@bigi>

On Fri, Apr 16, 2010 at 09:27:35AM -0400, jamal wrote:
> On Fri, 2010-04-16 at 09:15 +0200, Andi Kleen wrote:
> 
> > > resched IPI, apparently. But it is async absolutely. and its IRQ
> > > handler is lighter.
> > 
> > It shouldn't be a lot lighter than the new fancy "queued smp_call_function"
> > that's in the tree for a few releases. So it would surprise me if it made
> > much difference. In the old days when there was only a single lock for
> > s_c_f() perhaps...
> 
> So you are saying that the old implementation of IPI (likely what i
> tried pre-napi and as recent as 2-3 years ago) was bad because of a
> single lock?

Yes.

The old implementation of smp_call_function. Also in the really old
days there was no smp_call_function_single() so you tended to broadcast.

Jens did a lot of work on this for his block device work IPI implementation.

> On IPIs:
> Is anyone familiar with what is going on with Nehalem? Why is it this
> good? I expect things will get a lot nastier with other hardware like
> xeon based or even Nehalem with rps going across QPI.

Nehalem is just fast. I don't know why it's fast in your specific
case. It might be simply because it has lots of bandwidth everywhere.
Atomic operations are also faster than on previous Intel CPUs.


> Here's why i think IPIs are bad, please correct me if i am wrong:
> - they are synchronous. i.e an IPI issuer has to wait for an ACK (which
> is in the form of an IPI).

In the hardware there's no ack, but in the Linux implementation there
is usually (because need to know when to free the stack state used
to pass information)

However there's also now support for queued IPI
with a special API (I believe Tom is using that)

> - data cache has to be synced to main memory
> - the instruction pipeline is flushed

At least on Nehalem data transfer can be often through the cache.

IPIs involve APIC accesses which are not very fast (so overall
it's far more than a pipeline worth of work), but it's still
not a incredible expensive operation.

There's also X2APIC now which should be slightly faster, but it's 
likely not in your Nehalem (this is only in the highend Xeon versions)

> Do you know any specs i could read up which will tell me a little more?

If you're just interested in IPI and cache line transfer performance it's
probably best to just measure it.

Some general information is always in the Intel optimization guide.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: "kernel:nf_ct_icmp: bad HW ICMP checksum" too noisy
From: Benny Amorsen @ 2010-04-16 13:36 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netdev, Netfilter Development Mailinglist
In-Reply-To: <4BC723EE.9090504@trash.net>

Patrick McHardy <kaber@trash.net> writes:

> You should only see that message when nf_conntrack_log_invalid is
> active.

True, I can disable that one as a work-around.


/Benny


^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Changli Gao @ 2010-04-16 13:34 UTC (permalink / raw)
  To: hadi; +Cc: Eric Dumazet, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1271424065.4606.31.camel@bigi>

On Fri, Apr 16, 2010 at 9:21 PM, jamal <hadi@cyberus.ca> wrote:
> On Fri, 2010-04-16 at 07:18 +0200, Eric Dumazet wrote:
>
>>
>> A kernel module might do this, this could be integrated in perf bench so
>> that we can regression tests upcoming kernels.
>
> Perf would be good - but even softnet_stat cleaner than the the nasty
> hack i use (attached) would be a good start; the ping with and without
> rps gives me a ballpark number.
>
> IPI is important to me because having tried it before it and failed
> miserably. I was thinking the improvement may be due to hardware used
> but i am having a hard time to get people to tell me what hardware they
> used! I am old school - I need data;-> The RFS patch commit seems to
> have more info but still vague, example:
> "The benefits of RFS are dependent on cache hierarchy, application
> load, and other factors"
> Also, what does a "simple" or "complex" benchmark mean?;->
> I think it is only fair to get this info, no?
>
> Please dont consider what i say above as being anti-RPS.
> 5 microsec extra latency is not bad if it can be amortized.
> Unfortunately, the best traffic i could generate was < 20Kpps of
> ping which still manages to get 1 IPI/packet on Nehalem. I am going
> to write up some app (lots of cycles available tommorow). I still think
> it is valueable.
>

+	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
 		   s->total, s->dropped, s->time_squeeze, 0,
 		   0, 0, 0, 0, /* was fastroute */
-		   s->cpu_collision, s->received_rps);
+		   s->cpu_collision, s->received_rps, s->ipi_rps);

Do you mean that received_rps is equal to ipi_rps? received_rps is the
number of IPI used by RPS. And ipi_rps is the number of IPIs sent by
function generic_exec_single(). If there isn't other user of
generic_exec_single(), received_rps should be equal to ipi_rps.

@@ -158,7 +159,10 @@ void generic_exec_single(int cpu, struct
call_single_data *data, int wait)
 	 * equipped to do the right thing...
 	 */
 	if (ipi)
+{
 		arch_send_call_function_single_ipi(cpu);
+		__get_cpu_var(netdev_rx_stat).ipi_rps++;
+}


-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: jamal @ 2010-04-16 13:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Tom Herbert, davem, netdev, eric.dumazet
In-Reply-To: <87r5mf8va9.fsf@basil.nowhere.org>

On Fri, 2010-04-16 at 13:57 +0200, Andi Kleen wrote:

> One thing I've been wondering while reading if this should be made
> socket or SMT aware.
> 
> If you're on a hyperthreaded system and sending a IPI
> to your core sibling, which has a completely shared cache hierarchy,
> might not be the best use of cycles.
> 
> The same could potentially true for shared L2 or shared L3 cache
> (e.g. only redirect flows between different sockets)
> 
> Have you ever considered that?
> 

How are you going to schedule the net softirq on an empty queue if you
do this?
BTW, in my tests sending an IPI to an SMT sibling or to another core
didnt make any difference in terms of latency - still 5 microsecs.
I dont have dual Nehalem where we have to cross QPI - there i suspect
it will be longer than 5 microsecs.

cheers,
jamal

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-16 13:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Changli Gao, Eric Dumazet, Rick Jones, David Miller, therbert,
	netdev, robert
In-Reply-To: <20100416071522.GY18855@one.firstfloor.org>

On Fri, 2010-04-16 at 09:15 +0200, Andi Kleen wrote:

> > resched IPI, apparently. But it is async absolutely. and its IRQ
> > handler is lighter.
> 
> It shouldn't be a lot lighter than the new fancy "queued smp_call_function"
> that's in the tree for a few releases. So it would surprise me if it made
> much difference. In the old days when there was only a single lock for
> s_c_f() perhaps...

So you are saying that the old implementation of IPI (likely what i
tried pre-napi and as recent as 2-3 years ago) was bad because of a
single lock?

BTW, I directed some questions to you earlier but didnt get a response,
to quote:
---
On IPIs:
Is anyone familiar with what is going on with Nehalem? Why is it this
good? I expect things will get a lot nastier with other hardware like
xeon based or even Nehalem with rps going across QPI.
Here's why i think IPIs are bad, please correct me if i am wrong:
- they are synchronous. i.e an IPI issuer has to wait for an ACK (which
is in the form of an IPI).
- data cache has to be synced to main memory
- the instruction pipeline is flushed
- what else did i miss? Andi?
---

Do you know any specs i could read up which will tell me a little more?

cheers,
jamal



^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-16 13:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1271395106.16881.3645.camel@edumazet-laptop>

[-- Attachment #1: Type: text/plain, Size: 1231 bytes --]

On Fri, 2010-04-16 at 07:18 +0200, Eric Dumazet wrote:

> 
> A kernel module might do this, this could be integrated in perf bench so
> that we can regression tests upcoming kernels.

Perf would be good - but even softnet_stat cleaner than the the nasty
hack i use (attached) would be a good start; the ping with and without
rps gives me a ballpark number.

IPI is important to me because having tried it before it and failed
miserably. I was thinking the improvement may be due to hardware used
but i am having a hard time to get people to tell me what hardware they
used! I am old school - I need data;-> The RFS patch commit seems to
have more info but still vague, example: 
"The benefits of RFS are dependent on cache hierarchy, application
load, and other factors"
Also, what does a "simple" or "complex" benchmark mean?;->
I think it is only fair to get this info, no?

Please dont consider what i say above as being anti-RPS.
5 microsec extra latency is not bad if it can be amortized.
Unfortunately, the best traffic i could generate was < 20Kpps of
ping which still manages to get 1 IPI/packet on Nehalem. I am going
to write up some app (lots of cycles available tommorow). I still think
it is valueable.

cheers,
jamal

[-- Attachment #2: p1 --]
[-- Type: text/x-patch, Size: 1551 bytes --]

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d1a21b5..f8267fc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -224,6 +224,7 @@ struct netif_rx_stats {
 	unsigned time_squeeze;
 	unsigned cpu_collision;
 	unsigned received_rps;
+	unsigned ipi_rps;
 };

 DECLARE_PER_CPU(struct netif_rx_stats, netdev_rx_stat);
diff --git a/kernel/smp.c b/kernel/smp.c
index 9867b6b..8c5dcb7 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -11,6 +11,7 @@
 #include <linux/init.h>
 #include <linux/smp.h>
 #include <linux/cpu.h>
+#include <linux/netdevice.h>

 static struct {
 	struct list_head	queue;
@@ -158,7 +159,10 @@ void generic_exec_single(int cpu, struct call_single_data *data, int wait)
 	 * equipped to do the right thing...
 	 */
 	if (ipi)
+{
 		arch_send_call_function_single_ipi(cpu);
+		__get_cpu_var(netdev_rx_stat).ipi_rps++;
+}

 	if (wait)
 		csd_lock_wait(data);
diff --git a/net/core/dev.c b/net/core/dev.c
index b98ddc6..0bbbdcf 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3563,10 +3563,12 @@ static int softnet_seq_show(struct seq_file *seq, void *v)
 {
 	struct netif_rx_stats *s = v;

-	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
+	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
 		   s->total, s->dropped, s->time_squeeze, 0,
 		   0, 0, 0, 0, /* was fastroute */
-		   s->cpu_collision, s->received_rps);
+		   s->cpu_collision, s->received_rps, s->ipi_rps);
+	s->ipi_rps = 0;
+	s->received_rps = 0;
 	return 0;
 }

^ permalink raw reply related

* [PATCH] tg3: Fix INTx fallback when MSI fails
From: Andre Detsch @ 2010-04-16 13:15 UTC (permalink / raw)
  To: netdev, Matt Carlson

tg3: Fix INTx fallback when MSI fails

MSI setup changes the value of some key attributes of struct tg3 *tp.
These attributes must be taken into account and restored before
we try to do a new request_irq for INTx fallback.

In powerpc, the original code was leading to an EINVAL return within
request_irq, because the driver was trying to use the disabled MSI
virtual irq number instead of tp->pdev->irq.

Signed-off-by: Andre Detsch <adetsch@br.ibm.com>

---
Tested on powerpc, but should be safe for other architectures as well.

Index: linux-2.6.34-rc4/drivers/net/tg3.c
===================================================================
--- linux-2.6.34-rc4.orig/drivers/net/tg3.c	2010-04-12 21:41:35.000000000 -0400
+++ linux-2.6.34-rc4/drivers/net/tg3.c	2010-04-15 20:37:41.000000000 -0400
@@ -8633,6 +8633,9 @@ static int tg3_test_msi(struct tg3 *tp)
 	pci_disable_msi(tp->pdev);

 	tp->tg3_flags2 &= ~TG3_FLG2_USING_MSI;
+	tp->irq_cnt = 1;
+	tp->napi[0].irq_vec = tp->pdev->irq;
+	tp->dev->real_num_tx_queues = 1;

 	err = tg3_request_irq(tp, 0);
 	if (err)

^ permalink raw reply

* Re: [PATCH] drivers/net/pcmcia/3c574_cs: fixing stats.tx_bytes counter
From: Dominik Brodowski @ 2010-04-16 13:01 UTC (permalink / raw)
  To: Alexander Kurz, David S. Miller; +Cc: netdev
In-Reply-To: <alpine.DEB.1.10.1003312014120.9974@blala.de>

David,

as this is more netdev-related than PCMCIA-related, could you pick it up?
Else, I'm willing to take it upstream, but would prefer your ACK on this.

Thanks & best wishes,

	Dominik

From: Alexander Kurz <akurz@blala.de>
Date: Wed, 31 Mar 2010 20:21:29 +0400
Subject: [PATCH] net: 3c574_cs fix stats.tx_bytes counter

Update the stats counter calculation in 3c574_cs, similar
to the method used in 3c589_cs. This corrects the contents
of the counter on tests using a "Megahertz 574B" card.

[linux@dominikbrodowski.net: clean up commit message]
Signed-off-by: Alexander Kurz <linux@kbdbabel.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>

diff --git a/drivers/net/pcmcia/3c574_cs.c b/drivers/net/pcmcia/3c574_cs.c
index 727bb38..30b7cf7 100644
--- a/drivers/net/pcmcia/3c574_cs.c
+++ b/drivers/net/pcmcia/3c574_cs.c
@@ -772,8 +772,13 @@ static netdev_tx_t el3_start_xmit(struct sk_buff *skb,
 		  inw(ioaddr + EL3_STATUS));

 	spin_lock_irqsave(&lp->window_lock, flags);
+
+	dev->stats.tx_bytes += skb->len;
+
+	/* Put out the doubleword header... */
 	outw(skb->len, ioaddr + TX_FIFO);
 	outw(0, ioaddr + TX_FIFO);
+	/* ... and the packet rounded to a doubleword. */
 	outsl(ioaddr + TX_FIFO, skb->data, (skb->len+3)>>2);

 	dev->trans_start = jiffies;
@@ -1012,8 +1017,6 @@ static void update_stats(struct net_device *dev)
 	/* BadSSD */				   inb(ioaddr + 12);
 	up					 = inb(ioaddr + 13);

-	dev->stats.tx_bytes 			+= tx + ((up & 0xf0) << 12);
-
 	EL3WINDOW(1);
 }

^ permalink raw reply related

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Andi Kleen @ 2010-04-16 11:57 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, eric.dumazet
In-Reply-To: <alpine.DEB.1.00.1004152243470.15102@pokey.mtv.corp.google.com>

Tom Herbert <therbert@google.com> writes:
> +
> +		/*
> +		 * If the desired CPU (where last recvmsg was done) is
> +		 * different from current CPU (one in the rx-queue flow
> +		 * table entry), switch if one of the following holds:
> +		 *   - Current CPU is unset (equal to RPS_NO_CPU).
> +		 *   - Current CPU is offline.
> +		 *   - The current CPU's queue tail has advanced beyond the
> +		 *     last packet that was enqueued using this table entry.
> +		 *     This guarantees that all previous packets for the flow
> +		 *     have been dequeued, thus preserving in order delivery.
> +		 */
> +		if (unlikely(tcpu != next_cpu) &&
> +		    (tcpu == RPS_NO_CPU || !cpu_online(tcpu) ||
> +		     ((int)(per_cpu(softnet_data, tcpu).input_queue_head -

One thing I've been wondering while reading if this should be made
socket or SMT aware.

If you're on a hyperthreaded system and sending a IPI
to your core sibling, which has a completely shared cache hierarchy,
might not be the best use of cycles.

The same could potentially true for shared L2 or shared L3 cache
(e.g. only redirect flows between different sockets)

Have you ever considered that?

This is of course something that could be addressed post-merge, not
a blocker.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: HTB - What's the minimal value for 'rate' parameter?
From: Antonio Almeida @ 2010-04-16 11:56 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: netdev, kaber, davem, devik
In-Reply-To: <4BC63766.5080104@gmail.com>

Now I understand. It makes sense - totally! Thanks for your endurance
trying to open my eyes :)
I've been trying rates bigger that 100bit for a while and it's working fine.
Thanks a lot for your illustration!

Regards
  Antonio Almeida



On Wed, Apr 14, 2010 at 10:45 PM, Jarek Poplawski wrote:
> Antonio Almeida wrote, On 04/14/2010 12:22 PM:
>
>> What do you mean with "1:2 has grandchildren with overflown rate tables"?
>> I couldn't understand your idea. Is there any mistake in the
>> configuration I sent?
>> How would you set rates for this particular example?
>
>
> class htb 1:1 root rate 1000Mbit ceil 1000Mbit
> class htb 1:2 parent 1:1 rate 4096Kbit ceil 4096Kbit
> class htb 1:10 parent 1:2 rate 1024Kbit ceil 4096Kbit
> class htb 1:11 parent 1:2 rate 1024Kbit ceil 4096Kbit
> class htb 1:101 parent 1:10 prio 3 rate 8bit ceil 4096Kbit
> class htb 1:111 parent 1:11 prio 3 rate 8bit ceil 4096Kbit
>
> Classes 1:101 and 1:111 have too low rates, which causes wrong (overflowed!)
> values in their rate tables, so their rates could be practically
> uncontrollable. They are limited by their ceils instead, so something like:
>
> class htb 1:101 parent 1:10 leaf 101: prio 3 rate 4096Kbit ceil 4096Kbit
> class htb 1:111 parent 1:11 leaf 111: prio 3 rate 4096Kbit ceil 4096Kbit
>
> But then their guaranteed rates are higher than their parents, and the
> sum is higher than grandparent's rate, which means the config is wrong.
> (You have to control these sums - HTB doesn't.)
>
> As I wrote before, the minimal (overflow safe) rate depends on max
> packet size, and for 1500 byte it would be something around:
> 1500b/2min, so if your clients can wait so long, try this:
>
> class htb 1:101 parent 1:10 leaf 101: prio 3 rate 100bit ceil 4096Kbit
> class htb 1:111 parent 1:11 leaf 111: prio 3 rate 100bit ceil 4096Kbit
>
> Regards,
> Jarek P.
>

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-16  7:48 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, netdev
In-Reply-To: <20100416.002632.83844236.davem@davemloft.net>

Le vendredi 16 avril 2010 à 00:26 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 16 Apr 2010 09:18:03 +0200
> 
> > Hmm, I wonder if its not an artifact of net-next-2.6 being a bit old
> > (versus linux-2.6). I know scheduler guys did some tweaks.
> 
> I synced net-next-2.6 up with Linus's current tree just a day
> or two ago when I pulled net-2.6 into net-next-2.6.

OK thanks :)

Tom, please add a read_mostly to rps_sock_flow_table

struct rps_sock_flow_table *rps_sock_flow_table __read_mostly;

I'll spend some hours today to track the problem.




^ permalink raw reply

* Re: Network multiqueue question
From: George B. @ 2010-04-16  7:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1271393633.16881.3606.camel@edumazet-laptop>

On Thu, Apr 15, 2010 at 9:53 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 15 avril 2010 à 21:00 -0700, George B. a écrit :

> What kind of traffic do your machines manage exactly ?

Content to mobile devices (cell phones and such). More detail sent privately.

> On server, you use two ports of the same kind (same number of queues) ?

Yes, same kind.  We try to make everything identical.  Fewer problems that way.

George

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: David Miller @ 2010-04-16  7:26 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, netdev
In-Reply-To: <1271402283.16881.3791.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 16 Apr 2010 09:18:03 +0200

> Hmm, I wonder if its not an artifact of net-next-2.6 being a bit old
> (versus linux-2.6). I know scheduler guys did some tweaks.

I synced net-next-2.6 up with Linus's current tree just a day
or two ago when I pulled net-2.6 into net-next-2.6.

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-16  7:18 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, netdev
In-Reply-To: <1271401007.16881.3762.camel@edumazet-laptop>

Le vendredi 16 avril 2010 à 08:56 +0200, Eric Dumazet a écrit :

> I read the patch and found no error.
> 
> I booted a test machine and performed some tests
> 
> I am a bit worried of a tbench regression I am looking at right now.
> 
> if RFS disabled , tbench 16   ->  4408.63 MB/sec 
> 
> 
> # grep . /sys/class/net/lo/queues/rx-0/*
> /sys/class/net/lo/queues/rx-0/rps_cpus:00000000
> /sys/class/net/lo/queues/rx-0/rps_flow_cnt:8192
> # cat /proc/sys/net/core/rps_sock_flow_entries
> 8192
> 
> 
> echo ffff >/sys/class/net/lo/queues/rx-0/rps_cpus
> 
> tbench 16 -> 2336.32 MB/sec
> 
> 
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>    PerfTop:   14561 irqs/sec  kernel:86.3% [1000Hz cycles],  (all, 16 CPUs)
> -----------------------------------------------------------------------------------------------------------------------------------------------------
> 
>              samples  pcnt function                       DSO
>              _______ _____ ______________________________ __________________________________________________________
> 
>              2664.00  5.1% copy_user_generic_string       /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              2323.00  4.4% acpi_os_read_port              /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              1641.00  3.1% _raw_spin_lock_irqsave         /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              1260.00  2.4% schedule                       /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              1159.00  2.2% _raw_spin_lock                 /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              1051.00  2.0% tcp_ack                        /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               991.00  1.9% tcp_sendmsg                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               922.00  1.8% tcp_recvmsg                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               821.00  1.6% child_run                      /usr/bin/tbench                                           
>               766.00  1.5% all_string_sub                 /usr/bin/tbench                                           
>               630.00  1.2% __switch_to                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               608.00  1.2% __GI_strchr                    /lib/tls/libc-2.3.4.so                                    
>               606.00  1.2% ipt_do_table                   /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               600.00  1.1% __GI_strstr                    /lib/tls/libc-2.3.4.so                                    
>               556.00  1.1% __netif_receive_skb            /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               504.00  1.0% tcp_transmit_skb               /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               502.00  1.0% tick_nohz_stop_sched_tick      /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               481.00  0.9% _raw_spin_unlock_irqrestore    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               473.00  0.9% next_token                     /usr/bin/tbench                                           
>               449.00  0.9% ip_rcv                         /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               423.00  0.8% call_function_single_interrupt /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               422.00  0.8% ia32_sysenter_target           /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               420.00  0.8% compat_sys_socketcall          /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               401.00  0.8% mod_timer                      /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               400.00  0.8% process_backlog                /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               399.00  0.8% ip_queue_xmit                  /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               387.00  0.7% select_task_rq_fair            /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               377.00  0.7% _raw_spin_lock_bh              /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               360.00  0.7% tcp_v4_rcv                     /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> 
> But if RFS is on, why activating rps_cpus change tbench ?
> 

Hmm, I wonder if its not an artifact of net-next-2.6 being a bit old
(versus linux-2.6). I know scheduler guys did some tweaks.

Because apparently, some cpus are idle part of their time (30% ???)

Or a new bug on cpu accounting, reporting idle time while cpus are
busy....

# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
16  0      0 5670264  13280  63392    0    0     2     1 1512  227 12 47 41  0
18  0      0 5669396  13280  63392    0    0     0     0 657952 1606102 14 58 28  0
17  0      0 5668776  13288  63392    0    0     0    12 656701 1606369 14 58 28  0
18  0      0 5669644  13288  63392    0    0     0     0 657636 1603960 15 57 28  0
17  0      0 5670900  13288  63392    0    0     0     0 666425 1584847 15 56 29  0
15  0      0 5669164  13288  63392    0    0     0     0 682578 1472616 14 56 30  0
16  0      0 5669412  13288  63392    0    0     0     0 695767 1506302 14 54 32  0
14  0      0 5668916  13296  63396    0    0     4   148 685286 1482897 14 56 30  0
17  0      0 5669784  13296  63396    0    0     0     0 683910 1477994 14 56 30  0
18  0      0 5670032  13296  63396    0    0     0     0 692023 1497195 14 55 31  0
16  0      0 5669040  13296  63396    0    0     0     0 677477 1468157 14 56 30  0
16  0      0 5668916  13312  63396    0    0     0    32 489358 1048553 14 57 30  0
18  0      0 5667924  13320  63396    0    0     0    12 424787 897145 15 55 29  0

RFS off :

# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
24  0      0 5669624  13632  63476    0    0     2     1  261   82 12 48 40  0
26  0      0 5669492  13632  63476    0    0     0     0 4223 1740651 21 71  7  0
23  0      0 5669864  13640  63476    0    0     0    12 4205 1731882 21 71  8  0
23  0      0 5670484  13640  63476    0    0     0     0 4176 1733448 21 71  8  0
24  0      0 5670588  13640  63476    0    0     0     0 4176 1733845 21 72  7  0
21  0      0 5671084  13640  63476    0    0     0     0 4200 1734990 20 73  7  0
23  0      0 5671580  13640  63476    0    0     0     0 4168 1735100 21 71  8  0
23  0      0 5671704  13640  63480    0    0     4   132 4221 1733428 21 72  7  0
22  0      0 5671952  13640  63480    0    0     0     0 4190 1730370 21 72  8  0
20  0      0 5672292  13640  63480    0    0     0     0 4212 1732084 22 70  8  0




^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Andi Kleen @ 2010-04-16  7:15 UTC (permalink / raw)
  To: Changli Gao
  Cc: Eric Dumazet, hadi, Rick Jones, David Miller, therbert, netdev,
	robert, andi
In-Reply-To: <o2v412e6f7f1004152302j1aca5edam9d53d01781ddbe9d@mail.gmail.com>

> > Come on Changli.
> >
> > How do you wake up a thread on a remote cpu ?
> >
> 
> resched IPI, apparently. But it is async absolutely. and its IRQ
> handler is lighter.

It shouldn't be a lot lighter than the new fancy "queued smp_call_function"
that's in the tree for a few releases. So it would surprise me if it made
much difference. In the old days when there was only a single lock for
s_c_f() perhaps...

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-16  6:56 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, netdev
In-Reply-To: <20100415.233334.242114544.davem@davemloft.net>

Le jeudi 15 avril 2010 à 23:33 -0700, David Miller a écrit :
> From: Tom Herbert <therbert@google.com>
> Date: Thu, 15 Apr 2010 22:47:08 -0700 (PDT)
> 
> > Version 5 of RFS:
> > - Moved rps_sock_flow_sysctl into net/core/sysctl_net_core.c as a
> > static function.
> > - Apply limits to rps_sock_flow_entires systcl and rps_flow_count
> > sysfs variable.
> 
> I've read this over a few times and I think it's ready to go into
> net-next-2.6, we can tweak things as-needed from here on out.
> 
> Eric, what do you think?

I read the patch and found no error.

I booted a test machine and performed some tests

I am a bit worried of a tbench regression I am looking at right now.

if RFS disabled , tbench 16   ->  4408.63 MB/sec 


# grep . /sys/class/net/lo/queues/rx-0/*
/sys/class/net/lo/queues/rx-0/rps_cpus:00000000
/sys/class/net/lo/queues/rx-0/rps_flow_cnt:8192
# cat /proc/sys/net/core/rps_sock_flow_entries
8192


echo ffff >/sys/class/net/lo/queues/rx-0/rps_cpus

tbench 16 -> 2336.32 MB/sec


-----------------------------------------------------------------------------------------------------------------------------------------------------
   PerfTop:   14561 irqs/sec  kernel:86.3% [1000Hz cycles],  (all, 16 CPUs)
-----------------------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ __________________________________________________________

             2664.00  5.1% copy_user_generic_string       /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
             2323.00  4.4% acpi_os_read_port              /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
             1641.00  3.1% _raw_spin_lock_irqsave         /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
             1260.00  2.4% schedule                       /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
             1159.00  2.2% _raw_spin_lock                 /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
             1051.00  2.0% tcp_ack                        /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              991.00  1.9% tcp_sendmsg                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              922.00  1.8% tcp_recvmsg                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              821.00  1.6% child_run                      /usr/bin/tbench                                           
              766.00  1.5% all_string_sub                 /usr/bin/tbench                                           
              630.00  1.2% __switch_to                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              608.00  1.2% __GI_strchr                    /lib/tls/libc-2.3.4.so                                    
              606.00  1.2% ipt_do_table                   /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              600.00  1.1% __GI_strstr                    /lib/tls/libc-2.3.4.so                                    
              556.00  1.1% __netif_receive_skb            /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              504.00  1.0% tcp_transmit_skb               /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              502.00  1.0% tick_nohz_stop_sched_tick      /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              481.00  0.9% _raw_spin_unlock_irqrestore    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              473.00  0.9% next_token                     /usr/bin/tbench                                           
              449.00  0.9% ip_rcv                         /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              423.00  0.8% call_function_single_interrupt /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              422.00  0.8% ia32_sysenter_target           /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              420.00  0.8% compat_sys_socketcall          /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              401.00  0.8% mod_timer                      /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              400.00  0.8% process_backlog                /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              399.00  0.8% ip_queue_xmit                  /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              387.00  0.7% select_task_rq_fair            /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              377.00  0.7% _raw_spin_lock_bh              /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              360.00  0.7% tcp_v4_rcv                     /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux

But if RFS is on, why activating rps_cpus change tbench ?




^ permalink raw reply

* Duplicate IP false alerts from arping
From: unni krishnan @ 2010-04-16  6:51 UTC (permalink / raw)
  To: netdev

Hi,

I am trying to find a duplicate IP in the network using arping.

-------------------------
[root@vps1 ~]# ping -c 3 192.168.1.212
PING 192.168.1.212 (192.168.1.212) 56(84) bytes of data.
64 bytes from 192.168.1.212: icmp_seq=1 ttl=64 time=1.33 ms
64 bytes from 192.168.1.212: icmp_seq=2 ttl=64 time=0.280 ms
64 bytes from 192.168.1.212: icmp_seq=3 ttl=64 time=0.306 ms

--- 192.168.1.212 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.280/0.641/1.339/0.494 ms
[root@vps1 ~]# arping -D -I eth0 -c 5 192.168.1.212 ; echo $?
ARPING 192.168.1.212 from 0.0.0.0 eth0
0
-------------------------


As per arping that IP is duplicate. But if I go ahead and ifdown the
IP in the known location I cant ping that IP ( That means that IP is
not duplicated ? ). This is the result after shutting down the IP.

--------------------------
[root@vps1 ~]# ping -c 3 192.168.1.212
PING 192.168.1.212 (192.168.1.212) 56(84) bytes of data.
>From 192.168.1.63 icmp_seq=1 Destination Host Unreachable
>From 192.168.1.63 icmp_seq=2 Destination Host Unreachable
>From 192.168.1.63 icmp_seq=3 Destination Host Unreachable

--- 192.168.1.212 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2001ms
, pipe 3
[root@vps1 ~]# arping -D -I eth0 -c 5 192.168.1.212 ; echo $?
ARPING 192.168.1.212 from 0.0.0.0 eth0
Sent 5 probes (5 broadcast(s))
Received 0 response(s)
0
[root@vps1 ~]#
--------------------------

My question is, in this case IP 192.168.1.212 is not duplicated. But
still arping gives duplicate status. Why it is like that ?

-- 
Regards,
Unni
http://mutexes.org/
http://twitter.com/webofunni

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/2] net: replace ipfragok with skb->local_df
From: Herbert Xu @ 2010-04-16  6:49 UTC (permalink / raw)
  To: Shan Wei
  Cc: David Miller, yinghai.lu, kuznet, pekkas, jmorris,
	yoshfuji@linux-ipv6.org >> YOSHIFUJI Hideaki,
	Patrick McHardy, netdev@vger.kernel.org, dccp, linux-sctp,
	kleptog, jchapman, mostrows, acme
In-Reply-To: <20100416064344.GA12412@gondor.apana.org.au>

On Fri, Apr 16, 2010 at 02:43:44PM +0800, Herbert Xu wrote:
> On Fri, Apr 16, 2010 at 10:26:09AM +0800, Shan Wei wrote:
> > 
> > Now, PPPoX/PPPoL2TP driver still use ip_queue_xmit to send packets with ipfragok == 1.
> > So, now we can't remove the && ... bit. 
> 
> Huh? If they still call ip_queue_xmit with ipfragok then surely
> the build will fail after your patch as it removes the ipfragok
> argument?

Nevermind, I was looking at the wrong tree.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH Resubmission v2] drivers/net/usb: Add new driver ipheth
From: David Miller @ 2010-04-16  6:44 UTC (permalink / raw)
  To: agimenez
  Cc: dgiagio, dborca, James.Bottomley, ralf, gregkh, jonas.sjoquist,
	torgny.johansson, steve.glendinning, dbrownell, omar.oberthur,
	remi.denis-courmont, netdev, linux-kernel, linux-usb
In-Reply-To: <1271360791-30312-1-git-send-email-agimenez@sysvalve.es>

From: L. Alberto Giménez <agimenez@sysvalve.es>
Date: Thu, 15 Apr 2010 21:46:29 +0200

> From: dborca@yahoo.com
> 
> Add new driver to use tethering with an iPhone device. After initial submission,
> apply fixes to fit the new driver into the kernel standards.
> 
> There are still a couple of minor (almost cosmetic-level) issues, but the driver
> is fully functional right now.
> 
> Signed-off-by: L. Alberto Giménez <agimenez@sysvalve.es>

I'm very confused about the authorship of this driver.

Who wrote it?

You added a "From: " line using specifying Daniel Borca (btw,
when you add these "From: " lines you  need to specify it in
the form "From: NAME <EMAIL>" not just "From: EMAIL" so in
this case we want to see "From: Daniel Borca <dborca@yahoo.com>")

The code itself gives copyright to Diego Giagio <diego@giagio.com>
and he is also the one listed in the MODULE_AUTHOR().

And you're the one submitting the code, and also the only person
actually giving a signoff in the commit message.

It's too confusing and ambiguous, and if there are any problems
down the road the last thing we need is for the authorship to
be ambiguous.

I would really appreciate it if the authorship was clearly stated, and
the actual author of the code actually gives a "Signed-off-by: " line
in the commit message for this inclusions of this driver.

Please fix this up and resubmit, thank you.

Thanks.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/2] net: replace ipfragok with skb->local_df
From: Herbert Xu @ 2010-04-16  6:43 UTC (permalink / raw)
  To: Shan Wei
  Cc: David Miller, yinghai.lu, kuznet, pekkas, jmorris,
	yoshfuji@linux-ipv6.org >> YOSHIFUJI Hideaki,
	Patrick McHardy, netdev@vger.kernel.org, dccp, linux-sctp,
	kleptog, jchapman, mostrows, acme
In-Reply-To: <4BC7CAC1.4000803@cn.fujitsu.com>

On Fri, Apr 16, 2010 at 10:26:09AM +0800, Shan Wei wrote:
> 
> Now, PPPoX/PPPoL2TP driver still use ip_queue_xmit to send packets with ipfragok == 1.
> So, now we can't remove the && ... bit. 

Huh? If they still call ip_queue_xmit with ipfragok then surely
the build will fail after your patch as it removes the ipfragok
argument?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [net-next-2.6 PATCH 3/3 v2] ipv6: fix the comment of ip6_xmit()
From: David Miller @ 2010-04-16  6:37 UTC (permalink / raw)
  To: shanwei; +Cc: netdev
In-Reply-To: <4BC7D010.7080008@cn.fujitsu.com>

From: Shan Wei <shanwei@cn.fujitsu.com>
Date: Fri, 16 Apr 2010 10:48:48 +0800

> 
> ip6_xmit() is used by upper transport protocol.
> 
> Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/3 v2] net: replace ipfragok with skb->local_df
From: David Miller @ 2010-04-16  6:37 UTC (permalink / raw)
  To: shanwei
  Cc: herbert, yinghai.lu, kuznet, pekkas, jmorris, yoshfuji, kaber,
	netdev, dccp, linux-sctp, jchapman, mostrows
In-Reply-To: <4BC7CEBC.70200@cn.fujitsu.com>

From: Shan Wei <shanwei@cn.fujitsu.com>
Date: Fri, 16 Apr 2010 10:43:08 +0800

> As Herbert Xu said: we should be able to simply replace ipfragok
> with skb->local_df. commit f88037(sctp: Drop ipfargok in sctp_xmit function)
> has droped ipfragok and set local_df value properly.
> 
> The patch kills the ipfragok parameter of .queue_xmit().
> 
> 
> Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 1/3 v2] ipv6: cancel to setting local_df in ip6_xmit()
From: David Miller @ 2010-04-16  6:37 UTC (permalink / raw)
  To: shanwei
  Cc: herbert, emils.tantilov, kuznet, pekkas, jmorris, yoshfuji, kaber,
	eric.dumazet, netdev
In-Reply-To: <4BC7CDD2.5020004@cn.fujitsu.com>

From: Shan Wei <shanwei@cn.fujitsu.com>
Date: Fri, 16 Apr 2010 10:39:14 +0800

> commit f88037(sctp: Drop ipfargok in sctp_xmit function)
> has droped ipfragok and set local_df value properly.
> 
> So the change of commit 77e2f1(ipv6: Fix ip6_xmit to 
> send fragments if ipfragok is true) is not needed. 
> So the patch remove them.
> 
> Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>

Applied.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox