Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 1/3] ptp: Added a brand new class driver for ptp clocks.
From: Wolfgang Grandegger @ 2010-05-01 14:09 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev
In-Reply-To: <20100501140823.GA2779@riccoc20.at.omicron.at>

Richard Cochran wrote:
> On Fri, Apr 30, 2010 at 07:58:41PM +0200, Wolfgang Grandegger wrote:
>>>  include/linux/Kbuild             |    1 +
>>>  include/linux/ptp_clock.h        |   37 +++++
>> ptp_clock.h should probably be added to "include/linux/Kbuild".
> 
> But it already is, see the two lines above. Or did you mean something
> else?

Oops, sorry for the noise.

Wolfgang.

^ permalink raw reply

* Re: [PATCH 1/3] ptp: Added a brand new class driver for ptp clocks.
From: Richard Cochran @ 2010-05-01 14:08 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: netdev
In-Reply-To: <4BDB1A51.20505@grandegger.com>

On Fri, Apr 30, 2010 at 07:58:41PM +0200, Wolfgang Grandegger wrote:
> >  include/linux/Kbuild             |    1 +
> >  include/linux/ptp_clock.h        |   37 +++++
> 
> ptp_clock.h should probably be added to "include/linux/Kbuild".

But it already is, see the two lines above. Or did you mean something
else?

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: jamal @ 2010-05-01 13:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272720125.2230.178.camel@edumazet-laptop>

On Sat, 2010-05-01 at 15:22 +0200, Eric Dumazet wrote:

> You must understand that the whole 'bench' is mostly governed by
> scheduler artifacts. The regression you mention is probably a side
> effect.

likely.

> By slowing down one part, its possible to zap all calls to scheduler and
> go maybe 300% faster (Because consumer threads can avoid 3/4 of the time
> to schedule)
> 
> Reciprocally, optimizing one part of the network stack might make
> threads hitting an empty queue, and need to call more often the
> scheduler.

It is fair to say that what i am seeing is _not_ fatal because it is rps
that is regressing; non-rps is fine. I would consider non-rps to be the
common use scenario and if that was doing badly then it is a problem.
The good news is it is getting better - likely because of some changes
made on behalf of rps ;->
With rps, one could follow some instructions on how to make it better.
I am hoping that some of the system "magic" is documented as Tom
mentioned he will.

> This is why some higly specialized programs never block/schedule and
> perform busy loops instead.

Agreed. My brain cells should learn to accept this fact ;->

cheers,
jamal


^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-05-01 13:22 UTC (permalink / raw)
  To: hadi
  Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272714966.14499.37.camel@bigi>

Le samedi 01 mai 2010 à 07:56 -0400, jamal a écrit :

> 
> [1]i.e with this program rps was getting worse (it was much better
> before say net-next of apr14) and that non-rps has been getting better
> numbers since. The regression is real - but it is likely in another
> subsystem.
> 

You must understand that the whole 'bench' is mostly governed by
scheduler artifacts. The regression you mention is probably a side
effect.

By slowing down one part, its possible to zap all calls to scheduler and
go maybe 300% faster (Because consumer threads can avoid 3/4 of the time
to schedule)

Reciprocally, optimizing one part of the network stack might make
threads hitting an empty queue, and need to call more often the
scheduler.

This is why some higly specialized programs never block/schedule and
perform busy loops instead.




^ permalink raw reply

* [PATCH linux-2.6.34-rc5] drivers/net/phy: micrel phy driver
From: Choi, David @ 2010-04-29 16:12 UTC (permalink / raw)
  To: netdev

To whom it may have concerned:

From: David J. Choi <david.choi@micrel.com>
Body of the explanation: This is the first version of phy driver from Micrel Inc.
Signed-off-by: David J. Choi <david.choi@micrel.com>

---
--- linux-2.6.34-rc5/drivers/net/phy/micrel.c.orig	2010-04-29 08:20:51.000000000 -0700
+++ linux-2.6.34-rc5/drivers/net/phy/micrel.c	2010-04-29 08:52:37.000000000 -0700
@@ -0,0 +1,104 @@
+/*
+ * drivers/net/phy/micrel.c
+ *
+ * Driver for Micrel PHYs
+ *
+ * Author: David J. Choi
+ *
+ * Copyright (c) 2010 Micrel, Inc.
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ *
+ * Support : ksz9021 , vsc8201, ks8001
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/phy.h>
+
+#define	PHY_ID_KSZ9021			0x00221611
+#define	PHY_ID_VSC8201			0x000FC413
+#define	PHY_ID_KS8001			0x0022161A
+
+
+static int kszphy_config_init(struct phy_device *phydev)
+{
+	return 0;
+}
+
+
+static struct phy_driver ks8001_driver = {
+	.phy_id		= PHY_ID_KS8001,
+	.phy_id_mask	= 0x00fffff0,
+	.features	= PHY_BASIC_FEATURES,
+	.flags		= PHY_POLL,
+	.config_init	= kszphy_config_init,
+	.config_aneg	= genphy_config_aneg,
+	.read_status	= genphy_read_status,
+	.driver		= { .owner = THIS_MODULE,},
+};
+
+static struct phy_driver vsc8201_driver = {
+	.phy_id		= PHY_ID_VSC8201,
+	.name		= "Micrel VSC8201",
+	.phy_id_mask	= 0x00fffff0,
+	.features	= PHY_BASIC_FEATURES,
+	.flags		= PHY_POLL,
+	.config_init	= kszphy_config_init,
+	.config_aneg	= genphy_config_aneg,
+	.read_status	= genphy_read_status,
+	.driver		= { .owner = THIS_MODULE,},
+};
+
+static struct phy_driver ksz9021_driver = {
+	.phy_id		= PHY_ID_KSZ9021,
+	.phy_id_mask	= 0x000fff10,
+	.name		= "Micrel KSZ9021 Gigabit PHY",
+	.features	= PHY_GBIT_FEATURES | SUPPORTED_Pause,
+	.flags		= PHY_POLL,
+	.config_init	= kszphy_config_init,
+	.config_aneg	= genphy_config_aneg,
+	.read_status	= genphy_read_status,
+	.driver		= { .owner = THIS_MODULE, },
+};
+
+static int __init ksphy_init(void)
+{
+	int ret;
+
+	ret = phy_driver_register(&ks8001_driver);
+	if (ret)
+		goto err1;
+	ret = phy_driver_register(&vsc8201_driver);
+	if (ret)
+		goto err2;
+
+	ret = phy_driver_register(&ksz9021_driver);
+	if (ret)
+		goto err3;
+	return 0;
+
+err3:
+	phy_driver_unregister(&vsc8201_driver);
+err2:
+	phy_driver_unregister(&ks8001_driver);
+err1:
+	return ret;
+}
+
+static void __exit ksphy_exit(void)
+{
+	phy_driver_unregister(&ks8001_driver);
+	phy_driver_unregister(&vsc8201_driver);
+	phy_driver_unregister(&ksz9021_driver);
+}
+
+module_init(ksphy_init);
+module_exit(ksphy_exit);
+
+MODULE_DESCRIPTION("Micrel PHY driver");
+MODULE_AUTHOR("David J. Choi");
+MODULE_LICENSE("GPL");
--- linux-2.6.34-rc5/drivers/net/phy/Kconfig.orig	2010-04-29 08:21:12.000000000 -0700
+++ linux-2.6.34-rc5/drivers/net/phy/Kconfig	2010-04-29 08:25:18.000000000 -0700
@@ -88,6 +88,11 @@ config LSI_ET1011C_PHY
 	---help---
 	  Supports the LSI ET1011C PHY.
 
+config MICREL_PHY
+	tristate "Driver for Micrel PHYs"
+	---help---
+	  Supports the KSZ9021, VSC8201, KS8001 PHYs.
+
 config FIXED_PHY
 	bool "Driver for MDIO Bus/PHY emulation with fixed speed/link PHYs"
 	depends on PHYLIB=y
--- linux-2.6.34-rc5/drivers/net/phy/Makefile.orig	2010-04-29 08:20:25.000000000 -0700
+++ linux-2.6.34-rc5/drivers/net/phy/Makefile	2010-04-29 08:31:13.000000000 -0700
@@ -20,4 +20,5 @@ obj-$(CONFIG_MDIO_BITBANG)	+= mdio-bitba
 obj-$(CONFIG_MDIO_GPIO)		+= mdio-gpio.o
 obj-$(CONFIG_NATIONAL_PHY)	+= national.o
 obj-$(CONFIG_STE10XP)		+= ste10Xp.o
+obj-$(CONFIG_MICREL_PHY)	+= micrel.o
 obj-$(CONFIG_MDIO_OCTEON)	+= mdio-octeon.o

---

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/2] add ndo_set_port_profile op support for enic dynamic vnics
From: Arnd Bergmann @ 2010-05-01 12:36 UTC (permalink / raw)
  To: Scott Feldman; +Cc: davem, netdev, chrisw, Jens Osterkamp
In-Reply-To: <C8008CCC.2D21E%scofeldm@cisco.com>

On Friday 30 April 2010, Scott Feldman wrote:
> >    ip iov set  port-profile DEVICE [ base BASE-DEVICE ] name PORT-PROFILE
> >                              [ host_uuid HOST_UUID ]
> >                      [ client_name CLIENT_NAME ]
> >                                       [ client_uuid CLIENT_UUID ]
> >    ip iov set  vsi { associate | pre-associate | pre-associate-rr }
> > BASE-DEVICE
> >                                       vsi MGR:VTID:VER
> >                                       mac LLADDR [ vlan VID ]
> >                                       client_uuid CLIENT_UUID
> > 
> >    ip iov del  port_profile DEVICE      [ base BASE-DEVICE ]
> >    ip iov del  vsi          BASE-DEVICE [ mac LLADDR [ vlan VID ] ]
> >        [ client_uuid CLIENT_UUID ]
> > 
> >    ip iov show port_profile DEVICE      [ base BASE-DEVICE ]
> >    ip iov show vsi          BASE-DEVICE [ mac LLADDR [ vlan VID ] ]
> > [ client_uuid CLIENT_UUID ]
> > 
> > You would obvioulsy only implement the kernel support for the port-profile
> > stuff as callbacks, because no driver yet does VDP in the kernel, but we
> > should
> > have a common netlink header that defines both variants.
> > 
> > Chris, any opinion on this interface as opposed to the combined one?
> > Either one should work, but splitting it seems cleaner to me.
> 
> I haven't seen Chris's response, but it seems vger was down for awhile, so
> maybe it's coming.  Assuming we go for the split design, we're still talking
> about using RTM_SETLINK/RTM_GETLINK/RTM_DELLINK for these netlink msgs?  Or
> are you suggesting by your cmd syntax that we return to
> RTM_SETIOV/RTM_GETIOV like in the first iovnl patch?  RTM_SET/GET/DELLINK is
> probably simplier, cleaner patch.

In either case (split or combined), I would prefer the separate IOV
commands. The reason for this is that when support is not in the kernel,
it allows a cleaner separation between what's (always) handled in the
kernel and what's (potentially) done in user space.

	Arnd

^ permalink raw reply

* Re: [PATCH] tcp: SO_TIMESTAMP implementation for TCP
From: Paul LeoNerd Evans @ 2010-05-01 12:06 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: therbert
In-Reply-To: <20100430.164115.257514715.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 1455 bytes --]

On Fri, Apr 30, 2010 at 04:41:15PM -0700, David Miller wrote:
> If other people have an opinion about this, now would be the time
> to speak up. :-)

I have to say I agree with David.

The "receive timestamp" for a TCP recv() call is completely meaningless.
Each byte in the stream arguably could have a set of receive timestamps,
being the timestamp of the underlying IPv4 packet containing a fragment
of a TCP segment that covered that byte. One recv() call could cover
many packets, many recv() calls could be required to consume one packet.
We just don't know from userland.

The point about IPv4 fragments in UDP is a reasonable one; that because
of IPv4 fragmentation there are still potentially multiple timestamps
that could be relevant to a single UDP recv() call. But no two recv()
calls can possibly relate to the same IPv4 fragments, so I feel this is
more defined. Plus, of all the IPv4 fragments that go into a single UDP
packet, one of them is special - the first one, the one containing the
UDP header. We could easily say "the timestamp of a UDP recv() call
shall be the time at which its header was received, even if other
fragments arrived before or after it". 

We cannot make any such distinction for some window in a TCP stream. All
TCP segments are indistinct in this manner.

-- 
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: jamal @ 2010-05-01 11:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272714179.2230.151.camel@edumazet-laptop>

On Sat, 2010-05-01 at 13:42 +0200, Eric Dumazet wrote:

> But, whole point of epoll is to not change interest each time you get an
> event.
> 
> Without EV_PERSIST, you need two more syscalls per recvfrom()
> 
> epoll_wait()
>  epoll_ctl(REMOVE)
>  epoll_ctl(ADD)
>  recvfrom()
> 
> Even poll() would be faster in your case
> 
> poll(one fd)
> recvfrom()
> 

This is true - but my goal was/is to replicate the regression i was
seeing[1]. 
I will try with PERSIST next opportunity. If it gets better
then it is something that needs documentation in the doc Tom
promised ;->

> I always thought copybreak was borderline...
> It can help to reduce memory footprint (allocating 128 bytes instead of
> 2048/4096 bytes per frame), but with RPS, it would make sense to perform
> copybreak after RPS, not before.
> 
> Reducing memory footprint also means less changes on
> udp_memory_allocated /tcp_memory_allocate (memory reclaim logic)

Indeed, something that didnt cross my mind in the rush to test - it is
one of those things that need to be mentioned in some doc somewhere.
Tom, are you listening? ;->

cheers,
jamal

[1]i.e with this program rps was getting worse (it was much better
before say net-next of apr14) and that non-rps has been getting better
numbers since. The regression is real - but it is likely in another
subsystem.


^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-05-01 11:42 UTC (permalink / raw)
  To: hadi
  Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272713014.14499.21.camel@bigi>

Le samedi 01 mai 2010 à 07:23 -0400, jamal a écrit :
> On Sat, 2010-05-01 at 07:57 +0200, Eric Dumazet wrote:
> 
> > I changed your program a bit to use EV_PERSIST, (to avoid epoll_ctl()
> > overhead for each packet...)
> 
> Thats a different test case then ;-> You can also get rid of the timer
> (I doubt it will show much difference in results) - I have it in there
> because it i am trying to replicate what i saw causing the regression.
> 
> > RPS off : 220.000 pps 
> > 
> > RPS on (ee mask) : 700.000 pps  (with a slightly modified tg3 driver)
> > 96% of delivered packets
> > 
> 
> That's a very very huge gap. What were the numbers before you changed to
> EV_PERSIST?

But, whole point of epoll is to not change interest each time you get an
event.

Without EV_PERSIST, you need two more syscalls per recvfrom()

epoll_wait()
 epoll_ctl(REMOVE)
 epoll_ctl(ADD)
 recvfrom()

Even poll() would be faster in your case

poll(one fd)
recvfrom()



> Note: i did not add any of your other patches for dst refcnt, sockets
> etc. Were you running with those patches in these tests? I will try the
> next opportunity i get to have latest kernel + those patches. 
> 
> > This is on tg3 adapter, and tg3 has copybreak feature : small packets
> > are copied into skb of the right size.
> 
> Ok, so the driver tuning is also important then (and it shows in the
> profile).

I always thought copybreak was borderline...

It can help to reduce memory footprint (allocating 128 bytes instead of
2048/4096 bytes per frame), but with RPS, it would make sense to perform
copybreak after RPS, not before.

Reducing memory footprint also means less changes on
udp_memory_allocated /tcp_memory_allocate (memory reclaim logic)




^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: jamal @ 2010-05-01 11:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272694442.2230.86.camel@edumazet-laptop>

On Sat, 2010-05-01 at 08:14 +0200, Eric Dumazet wrote:

> BTW, using ee mask, cpu4 is not used at _all_, even for the user
> threads. Scheduler does a bad job IMHO.

I have the opposite frustration ;->
I did notice it got used. My goal was to totally avoid using it, for
simple reason it is an SMT thread that shares same core as cpu0.
In retrospect i should probably set irq affinity then to cpu0 and 4.

> Using fe mask, I get all packets (sent at 733311pps by my pktgen
> machine), and my CPU0 even has idle time !!!

I will try this next time i get the chance.

cheers,
jamal


^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: jamal @ 2010-05-01 11:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272693424.2230.75.camel@edumazet-laptop>

On Sat, 2010-05-01 at 07:57 +0200, Eric Dumazet wrote:

> I changed your program a bit to use EV_PERSIST, (to avoid epoll_ctl()
> overhead for each packet...)

Thats a different test case then ;-> You can also get rid of the timer
(I doubt it will show much difference in results) - I have it in there
because it i am trying to replicate what i saw causing the regression.

> RPS off : 220.000 pps 
> 
> RPS on (ee mask) : 700.000 pps  (with a slightly modified tg3 driver)
> 96% of delivered packets
> 

That's a very very huge gap. What were the numbers before you changed to
EV_PERSIST?
Note: i did not add any of your other patches for dst refcnt, sockets
etc. Were you running with those patches in these tests? I will try the
next opportunity i get to have latest kernel + those patches. 

> This is on tg3 adapter, and tg3 has copybreak feature : small packets
> are copied into skb of the right size.

Ok, so the driver tuning is also important then (and it shows in the
profile).

cheers,
jamal


^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Andi Kleen @ 2010-05-01 11:00 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, hadi, xiaosuo, therbert, shemminger, netdev, lenb,
	arjan
In-Reply-To: <20100430.163857.180417789.davem@davemloft.net>

On Fri, Apr 30, 2010 at 04:38:57PM -0700, David Miller wrote:
> From: Andi Kleen <ak@gargoyle.fritz.box>
> Date: Thu, 29 Apr 2010 23:41:44 +0200
> 
> >     Use io_schedule() in network stack to tell cpuidle governour to guarantee lower latencies
> > 
> >     XXX: probably too aggressive, some of these sleeps are not under high load.
> > 
> >     Based on a bug report from Eric Dumazet.
> >     
> >     Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> I like this, except that we probably don't want the delayacct_blkio_*() calls
> these things do.

Yes.

It needs more work, please don't apply it yet, to handle the "long sleep" case.

Still curious if it fixes Eric's test case.

> 
> Probably the rest of what these things do should remain in the io_schedule*()
> functions and the block layer can call it's own versions which add in the
> delayacct_blkio_*() bits.

Good point.

> 
> Or, if the delacct stuff is useful for socket I/O too, then it's interfaces
> names should have the "blk" stripped from them :-)

Good question. I suspect it's actually useful for some cases, but just adding
sockets might confuse some users.

-Andi

^ permalink raw reply

* Re: OFT - reserving CPU's for networking
From: Andi Kleen @ 2010-05-01 10:53 UTC (permalink / raw)
  To: David Miller; +Cc: tglx, shemminger, eric.dumazet, netdev, peterz
In-Reply-To: <20100430.153038.62351857.davem@davemloft.net>

> And we don't want it to, because the decision mechanisms for steering
> that we using now are starting to get into the stateful territory and
> that's verbotton for NIC offload as far as we're concerned.

Huh? I thought full TCP offload was forbidden?[1] Statefull as in NIC 
(or someone else like netfilter) tracking flows is quite common and very far 
from full offload. AFAIK it doesn't have near all the problems full
offload has.

-Andi

[1] although it seems to leak in more and more through the RDMA backdoor.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-05-01 10:47 UTC (permalink / raw)
  To: Changli Gao
  Cc: hadi, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <o2u412e6f7f1005010324sfb63393fo86acdff4c97c5be3@mail.gmail.com>

Le samedi 01 mai 2010 à 18:24 +0800, Changli Gao a écrit :
> On Sat, May 1, 2010 at 2:14 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > BTW, using ee mask, cpu4 is not used at _all_, even for the user
> > threads. Scheduler does a bad job IMHO.
> >
> > Using fe mask, I get all packets (sent at 733311pps by my pktgen
> > machine), and my CPU0 even has idle time !!!
> >
> > Limit seems to be around 800.000 pps
> >
> > ------------------------------------------------------------------------------------------------------------------------
> >   PerfTop:    5616 irqs/sec  kernel:93.9% [1000Hz cycles],  (all, 8 CPUs)
> > ------------------------------------------------------------------------------------------------------------------------
> >
> 
> Oh, cpu0 usage is about 100-(100-93.9)*8 = 51.2%(Am I right?). If we
> can do weighted packet distributing: cpu0's weight is 1, and other
> cpus are 2. maybe we can utilize all the cpu power.
> 

Nope, cpu0 was at 100% in this test, other cpus were about at 50% each.

weigthed would be ok if I wanted to use cpu0 in the 'slave' cpus (RPS
targets). But I know the workload I am interested to, and ability to
resist to DDOS, want to keep cpu0 outside of IP/TCP/UDP stack.


Later, skb_pull() inline in eth_type_trans() permitted to reach 840.000
pps.

top - 12:42:55 up  3:00,  2 users,  load average: 0.44, 0.11, 0.03
Tasks: 126 total,   1 running, 125 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.2%us, 16.5%sy,  0.0%ni, 46.5%id, 11.4%wa,  0.9%hi, 22.5%si,
0.0%st
Mem:   4148112k total,   211152k used,  3936960k free,    15228k buffers
Swap:  4192928k total,        0k used,  4192928k free,   121804k cached

You can see average idle of 46%
So there is probably more optimizations to do to reach maybe 1.300.000
pps ;)




^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Changli Gao @ 2010-05-01 10:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: hadi, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272694442.2230.86.camel@edumazet-laptop>

On Sat, May 1, 2010 at 2:14 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> BTW, using ee mask, cpu4 is not used at _all_, even for the user
> threads. Scheduler does a bad job IMHO.
>
> Using fe mask, I get all packets (sent at 733311pps by my pktgen
> machine), and my CPU0 even has idle time !!!
>
> Limit seems to be around 800.000 pps
>
> ------------------------------------------------------------------------------------------------------------------------
>   PerfTop:    5616 irqs/sec  kernel:93.9% [1000Hz cycles],  (all, 8 CPUs)
> ------------------------------------------------------------------------------------------------------------------------
>

Oh, cpu0 usage is about 100-(100-93.9)*8 = 51.2%(Am I right?). If we
can do weighted packet distributing: cpu0's weight is 1, and other
cpus are 2. maybe we can utilize all the cpu power.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: sock_def_readable() and friends RCU conversion
From: Eric Dumazet @ 2010-05-01  8:03 UTC (permalink / raw)
  To: David Miller; +Cc: hadi, xiaosuo, therbert, shemminger, netdev, eilong, bmb
In-Reply-To: <1272697367.2230.106.camel@edumazet-laptop>

Le samedi 01 mai 2010 à 09:02 +0200, Eric Dumazet a écrit :
> Le vendredi 30 avril 2010 à 16:35 -0700, David Miller a écrit :
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Thu, 29 Apr 2010 23:01:49 +0200
> > 
> > > [PATCH net-next-2.6] net: sock_def_readable() and friends RCU conversion
> > 
> > So what's the difference between call_rcu() freeing this little waitqueue
> > struct and doing it for the entire socket?
> > 
> > We'll still be doing an RCU call every socket destroy, and now we also have
> > a new memory allocation/free per connection.
> > 
> > This has to show up in things like 'lat_connect' and friends, does it not?
> 
> Before patch :
> 
> lat_connect -N 10 127.0.0.1
> TCP/IP connection cost to 127.0.0.1: 27.8872 microseconds
> 
> After :
> 
> lat_connect -N 10 127.0.0.1
> TCP/IP connection cost to 127.0.0.1: 20.7681 microseconds
> 
> Strange isnt it ?
> 
> (special care should be taken with this bench, as it leave many sockets
> in TIME_WAIT state, so to get consistent numbers we have to wait a while
> before restarting it)


Oops, this was with the other patch (about dst no_refcounting in input
path), sorry.

With the "sock_def_readable() and friends RCU conversion" patch I got :

lat_connect -N 10 127.0.0.1
TCP/IP connection cost to 127.0.0.1: 27.6244 microseconds


Anyway, this lat_connect seems very unreliable (lot of variance)

with linux-2.6.31, ~33 us
with linux-2.6.33, ~30 us

David, I also need this RCU thing in order to be able to group all
wakeups at the end of net_rx_action().

Plan was to use RCU, so that I dont need to increase sk_refcnt when
queueing a "wakeup" (and decrease sk_refcnt a long time after)

Previous attempt was a bit hacky,
http://patchwork.ozlabs.org/patch/24179/

I expect 2010 one will be cleaner :)



^ permalink raw reply

* Re: [PATCH net-next-2.6] net: sock_def_readable() and friends RCU conversion
From: Eric Dumazet @ 2010-05-01  7:02 UTC (permalink / raw)
  To: David Miller; +Cc: hadi, xiaosuo, therbert, shemminger, netdev, eilong, bmb
In-Reply-To: <20100430.163519.133415203.davem@davemloft.net>

Le vendredi 30 avril 2010 à 16:35 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 29 Apr 2010 23:01:49 +0200
> 
> > [PATCH net-next-2.6] net: sock_def_readable() and friends RCU conversion
> 
> So what's the difference between call_rcu() freeing this little waitqueue
> struct and doing it for the entire socket?
> 
> We'll still be doing an RCU call every socket destroy, and now we also have
> a new memory allocation/free per connection.
> 
> This has to show up in things like 'lat_connect' and friends, does it not?

Before patch :

lat_connect -N 10 127.0.0.1
TCP/IP connection cost to 127.0.0.1: 27.8872 microseconds

After :

lat_connect -N 10 127.0.0.1
TCP/IP connection cost to 127.0.0.1: 20.7681 microseconds

Strange isnt it ?

(special care should be taken with this bench, as it leave many sockets
in TIME_WAIT state, so to get consistent numbers we have to wait a while
before restarting it)




^ permalink raw reply

* [PATCH net-next-2.6] net: eth_type_trans() should inline skb_pull()
From: Eric Dumazet @ 2010-05-01  6:42 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Tom Herbert, jamal

840.000 pps instead of 800.000 pps on my 'old' machine, using RPS

Before patch, profile of CPU 0 (handling tg3 interrupts)

             2167.00 13.9% __alloc_skb            vmlinux
             1908.00 12.3% eth_type_trans         vmlinux
             1125.00  7.2% __kmalloc_track_caller vmlinux
              981.00  6.3% __netdev_alloc_skb     vmlinux
              925.00  5.9% _raw_spin_lock         vmlinux
              786.00  5.1% kmem_cache_alloc       vmlinux
              757.00  4.9% skb_pull               vmlinux
              698.00  4.5% tg3_read32             vmlinux
              637.00  4.1% __slab_alloc           vmlinux
              620.00  4.0% tg3_poll_work          vmlinux
              576.00  3.7% get_rps_cpu            vmlinux
              448.00  2.9% bnx2_interrupt         vmlinux

After (no more skb_pull, and eth_type_trans() not more expensive)
Predominant cost is memory allocator...

             1625.00 12.4% eth_type_trans         vmlinux
             1468.00 11.2% __alloc_skb            vmlinux
             1004.00  7.6% __kmalloc_track_caller vmlinux
              893.00  6.8% _raw_spin_lock         vmlinux
              738.00  5.6% __netdev_alloc_skb     vmlinux
              665.00  5.1% tg3_read32             vmlinux
              656.00  5.0% kmem_cache_alloc       vmlinux
              655.00  5.0% __slab_alloc           vmlinux
              509.00  3.9% bnx2_interrupt         vmlinux
              483.00  3.7% tg3_poll_work          vmlinux
              455.00  3.5% _raw_spin_lock_irqsave vmlinux
              330.00  2.5% get_rps_cpu            vmlinux
              286.00  2.2% nommu_map_page         vmlinux
              277.00  2.1% enqueue_to_backlog     vmlinux
              235.00  1.8% inet_gro_receive       vmlinux
              232.00  1.8% __copy_to_user_ll      vmlinux
              181.00  1.4% dev_gro_receive        vmlinux
              165.00  1.3% skb_gro_reset_offset   vmlinux

(bnx2_interrupt is called, because irq 16 is shared on this machine on two nics...)

Thanks !

[PATCH net-next-2.6] net: eth_type_trans() should inline skb_pull()

With RPS, this patch can give a 5 % boost in performance.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 0c0d272..763524b 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -162,7 +162,8 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
 
 	skb->dev = dev;
 	skb_reset_mac_header(skb);
-	skb_pull(skb, ETH_HLEN);
+	if (likely(skb->len >= ETH_HLEN))
+		__skb_pull(skb, ETH_HLEN);
 	eth = eth_hdr(skb);
 
 	if (unlikely(is_multicast_ether_addr(eth->h_dest))) {



^ permalink raw reply related

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-05-01  6:14 UTC (permalink / raw)
  To: hadi
  Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272693424.2230.75.camel@edumazet-laptop>

Le samedi 01 mai 2010 à 07:57 +0200, Eric Dumazet a écrit :
> Le vendredi 30 avril 2010 à 20:06 -0400, jamal a écrit :
> 
> > Yes, Nehalem. 
> > RPS off is better (~700Kpp) than RPS on(~650kpps). Are you seeing the
> > same trend on the old hardware?
> > 
> 
> Of course not ! Or else RPS would be useless :(
> 
> I changed your program a bit to use EV_PERSIST, (to avoid epoll_ctl()
> overhead for each packet...)
> 
> RPS off : 220.000 pps 
> 
> RPS on (ee mask) : 700.000 pps  (with a slightly modified tg3 driver)
> 96% of delivered packets

BTW, using ee mask, cpu4 is not used at _all_, even for the user
threads. Scheduler does a bad job IMHO.

Using fe mask, I get all packets (sent at 733311pps by my pktgen
machine), and my CPU0 even has idle time !!!

Limit seems to be around 800.000 pps

------------------------------------------------------------------------------------------------------------------------
   PerfTop:    5616 irqs/sec  kernel:93.9% [1000Hz cycles],  (all, 8 CPUs)
------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ _______

             3492.00  6.2% __slab_free                 vmlinux
             2334.00  4.2% _raw_spin_lock              vmlinux
             2314.00  4.1% _raw_spin_lock_irqsave      vmlinux
             1807.00  3.2% ip_rcv                      vmlinux
             1605.00  2.9% schedule                    vmlinux
             1474.00  2.6% __netif_receive_skb         vmlinux
             1464.00  2.6% kfree                       vmlinux
             1405.00  2.5% ip_route_input              vmlinux
             1318.00  2.4% __copy_to_user_ll           vmlinux
             1214.00  2.2% __alloc_skb                 vmlinux
             1160.00  2.1% nf_hook_slow                vmlinux
             1020.00  1.8% eth_type_trans              vmlinux
              860.00  1.5% sched_clock_local           vmlinux
              775.00  1.4% read_tsc                    vmlinux
              773.00  1.4% ipt_do_table                vmlinux
              766.00  1.4% _raw_spin_unlock_irqrestore vmlinux
              748.00  1.3% sock_recv_ts_and_drops      vmlinux
              747.00  1.3% ia32_sysenter_target        vmlinux
              740.00  1.3% select_nohz_load_balancer   vmlinux
              644.00  1.2% __kmalloc_track_caller      vmlinux
              596.00  1.1% tg3_read32                  vmlinux
              566.00  1.0% __udp4_lib_lookup           vmlinux





^ permalink raw reply

* Re: [PATCH] tcp: SO_TIMESTAMP implementation for TCP
From: Eric Dumazet @ 2010-05-01  6:00 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Bill Fink, David Miller, netdev
In-Reply-To: <AANLkTimFjDNgXdMGpVpO7Gi38ROlww1Fa7IKA1ASBKOV@mail.gmail.com>

Le vendredi 30 avril 2010 à 22:40 -0700, Tom Herbert a écrit :
> > Not being a kernel hacker, I will naively ask if the kernel tracing
> > facility could somehow be used to provide the desired info (or could
> > be modified to provide it).
> >
> 
> We did consider kernel tracing (more in the context of implementing
> RFC 4898).  In the case of trying get per packet timestamps,
> correlating a ktrace event with an application message is probably too
> high to make it practical.  If it weren't for the cost of
> timestamp'ing every single skb being received, we'd probably have
> SO_TIMESTAMP turned on permanently for many connections.  For now
> we're settling for a percentage of messages for sampling.

Tom, did you tried to reuse existing skb or sk tstamps ?




^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-05-01  5:57 UTC (permalink / raw)
  To: hadi
  Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272672394.14499.1.camel@bigi>

Le vendredi 30 avril 2010 à 20:06 -0400, jamal a écrit :

> Yes, Nehalem. 
> RPS off is better (~700Kpp) than RPS on(~650kpps). Are you seeing the
> same trend on the old hardware?
> 

Of course not ! Or else RPS would be useless :(

I changed your program a bit to use EV_PERSIST, (to avoid epoll_ctl()
overhead for each packet...)

RPS off : 220.000 pps 

RPS on (ee mask) : 700.000 pps  (with a slightly modified tg3 driver)
96% of delivered packets

This is on tg3 adapter, and tg3 has copybreak feature : small packets
are copied into skb of the right size.

define TG3_RX_COPY_THRESHOLD       256 -> 40 ...

We really should disable this feature for RPS workload,
unfortunatly ethtool cannot tweak this.

So profile of cpu 0 (RPS ON) looks like :

------------------------------------------------------------------------------------------------------------------------
   PerfTop:    1001 irqs/sec  kernel:99.7% [1000Hz cycles],  (all, cpu: 0)
------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function               DSO
             _______ _____ ______________________ _______

              819.00 12.6% __alloc_skb            vmlinux
              592.00  9.1% eth_type_trans         vmlinux
              509.00  7.8% _raw_spin_lock         vmlinux
              475.00  7.3% __kmalloc_track_caller vmlinux
              358.00  5.5% tg3_read32             vmlinux
              345.00  5.3% __netdev_alloc_skb     vmlinux
              329.00  5.0% kmem_cache_alloc       vmlinux
              307.00  4.7% _raw_spin_lock_irqsave vmlinux
              284.00  4.4% bnx2_interrupt         vmlinux
              277.00  4.2% skb_pull               vmlinux
              248.00  3.8% tg3_poll_work          vmlinux
              202.00  3.1% __slab_alloc           vmlinux
              197.00  3.0% get_rps_cpu            vmlinux
              106.00  1.6% enqueue_to_backlog     vmlinux
               87.00  1.3% _raw_spin_lock_bh      vmlinux
               80.00  1.2% __copy_to_user_ll      vmlinux
               77.00  1.2% nommu_map_page         vmlinux
               77.00  1.2% __napi_gro_receive     vmlinux
               65.00  1.0% tg3_alloc_rx_skb       vmlinux
               60.00  0.9% skb_gro_reset_offset   vmlinux
               57.00  0.9% skb_put                vmlinux
               57.00  0.9% __slab_free            vmlinux


/*
 *  Usage: udpsnkfrk [ -p baseport] nbports
*/
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <event.h>

struct worker_data {
	struct event *snk_ev;
	struct event_base *base;
	struct timeval t;
	unsigned long pack_count;
	unsigned long bytes_count;
	unsigned long tout;
	int fd;			/* move to avoid hole on 64-bit */
	int pad1;	
	unsigned long _padd[99]; /* avoid false sharing */
};

void usage(int code)
{
	fprintf(stderr, "Usage: udpsink [-p baseport] nbports\n");
	exit(code);
}

void process_recv(int fd, short ev, void *arg)
{
	char buffer[4096];
	struct sockaddr_in addr;
	socklen_t len = sizeof(addr);
	struct worker_data *wdata = (struct worker_data *)arg;
	int lu = 0;


	if (ev == EV_TIMEOUT) {
		wdata->tout++;
		if ((event_add(wdata->snk_ev, &wdata->t)) < 0) {
			perror("cb event_add");
			return;
		}
	} else {
		do {
			lu = recvfrom(wdata->fd, buffer, sizeof(buffer), 0,
			      (struct sockaddr *)&addr, &len);
			if (lu > 0) {
				wdata->pack_count++;
				wdata->bytes_count += lu;
			}
		} while (lu > 0);
	}
}

int prep_thread(struct worker_data *wdata)
{
	wdata->t.tv_sec = 1;
	wdata->t.tv_usec = random() % 50000L;

	wdata->base = event_init();
	event_set(wdata->snk_ev, wdata->fd, EV_READ|EV_PERSIST, process_recv, wdata);
	event_base_set(wdata->base, wdata->snk_ev);
	if ((event_add(wdata->snk_ev, &wdata->t)) < 0) {
		perror("event_add");
		return -1;
	}
	return 0;
}

void *worker_func(void *arg)
{
	struct worker_data *wdata = (struct worker_data *)arg;

	return (void *)event_base_loop(wdata->base, 0);
}

int main(int argc, char *argv[])
{
	int c;
	int baseport = 4000;
	int nbthreads;
	struct worker_data *wdata;
	unsigned long ototal = 0;
	int concurrent = 0;
	int verbose = 0;
	int i;
	while ((c = getopt(argc, argv, "cvp:")) != -1) {
		if (c == 'p')
			baseport = atoi(optarg);
		else if (c == 'c')
			concurrent = 1;
		else if (c == 'v')
			verbose++;
		else
			usage(1);
	}
	if (optind == argc)
		usage(1);
	nbthreads = atoi(argv[optind]);
	wdata = calloc(sizeof(struct worker_data), nbthreads);
	if (!wdata) {
		perror("calloc");
		return 1;
	}

	for (i = 0; i < nbthreads; i++) {
		struct sockaddr_in addr;
		pthread_t tid;

		if (i && concurrent) {
			wdata[i].fd = wdata[0].fd;
		} else {
			wdata[i].snk_ev = malloc(sizeof(struct event));
			if (!wdata[i].snk_ev)
				return 1;
			memset(wdata[i].snk_ev, 0, sizeof(struct event));

			wdata[i].fd = socket(PF_INET, SOCK_DGRAM, 0);
			if (wdata[i].fd == -1) {
				free(wdata[i].snk_ev);
				perror("socket");
				return 1;
			}
			memset(&addr, 0, sizeof(addr));
			addr.sin_family = AF_INET;
//                      addr.sin_addr.s_addr = inet_addr(argv[optind]);
			addr.sin_port = htons(baseport + i);
			if (bind
			    (wdata[i].fd, (struct sockaddr *)&addr,
			     sizeof(addr)) < 0) {
				free(wdata[i].snk_ev);
				perror("bind");
				return 1;
			}
                      fcntl(wdata[i].fd, F_SETFL, O_NDELAY);
		}
		if (prep_thread(wdata + i)) {
			printf("failed to allocate thread %d, exit\n", i);
			exit(0);
		}
		pthread_create(&tid, NULL, worker_func, wdata + i);
	}

	for (;;) {
		unsigned long total;
		long delta;

		sleep(1);
		total = 0;
		for (i = 0; i < nbthreads; i++) {
			total += wdata[i].pack_count;
		}
		delta = total - ototal;
		if (delta) {
			printf("%lu pps (%lu", delta, total);
			if (verbose) {
				for (i = 0; i < nbthreads; i++) {
					if (wdata[i].pack_count)
						printf(" %d:%lu", i,
						       wdata[i].pack_count);
				}
			}
			printf(")\n");
		}
		ototal = total;
	}
}




^ permalink raw reply

* Re: [PATCH] tcp: SO_TIMESTAMP implementation for TCP
From: Tom Herbert @ 2010-05-01  5:40 UTC (permalink / raw)
  To: Bill Fink; +Cc: David Miller, netdev
In-Reply-To: <20100501010735.dfe097bc.billfink@mindspring.com>

> Not being a kernel hacker, I will naively ask if the kernel tracing
> facility could somehow be used to provide the desired info (or could
> be modified to provide it).
>

We did consider kernel tracing (more in the context of implementing
RFC 4898).  In the case of trying get per packet timestamps,
correlating a ktrace event with an application message is probably too
high to make it practical.  If it weren't for the cost of
timestamp'ing every single skb being received, we'd probably have
SO_TIMESTAMP turned on permanently for many connections.  For now
we're settling for a percentage of messages for sampling.

Tom

^ permalink raw reply

* Re: [PATCH] tcp: SO_TIMESTAMP implementation for TCP
From: Tom Herbert @ 2010-05-01  5:31 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100430.164115.257514715.davem@davemloft.net>

>> I don't see an nice way to do that, we're profiling a significant
>> percentage of millions of connections over thousands of paths as part
>> of standard operations while incurring negligible overhead.  The app
>> can can easily timestamp its operations, but without some mechanism
>> for getting timestamps out of a TCP connection, the networking portion
>> of servicing requests is pretty much a black box in that.
>
> If other people have an opinion about this, now would be the time
> to speak up. :-)
>
The use case that motivated this patch is really the same as that of
UDP in that application is receiving messages that it wants to to time
stamp; in the case of TCP the application extracts the frames out of
the stream.  The lack of a timestamp to discern when a message was
received over TCP is readily apparent when designing a message based
ULP that can dynamically select which protocol to run over.

^ permalink raw reply

* Re: [PATCH] tcp: SO_TIMESTAMP implementation for TCP
From: Bill Fink @ 2010-05-01  5:07 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, netdev
In-Reply-To: <20100430.164115.257514715.davem@davemloft.net>

On Fri, 30 Apr 2010, David Miller wrote:

> From: Tom Herbert <therbert@google.com>
> Date: Fri, 30 Apr 2010 00:58:32 -0700
> 
> >> All these new checks and branches for a feature of questionable value.
> > 
> >> If you can modify you apps to grab this information you can also probe
> >> for the information using external probing tools.
> >>
> > I don't see an nice way to do that, we're profiling a significant
> > percentage of millions of connections over thousands of paths as part
> > of standard operations while incurring negligible overhead.  The app
> > can can easily timestamp its operations, but without some mechanism
> > for getting timestamps out of a TCP connection, the networking portion
> > of servicing requests is pretty much a black box in that.
> 
> If other people have an opinion about this, now would be the time
> to speak up. :-)

Not being a kernel hacker, I will naively ask if the kernel tracing
facility could somehow be used to provide the desired info (or could
be modified to provide it).

						-Bill

^ permalink raw reply

* RE: question re: net-2.6 and net-next-2.6 trees re: patch submission
From: Elina Pasheva @ 2010-05-01  5:01 UTC (permalink / raw)
  To: David Miller
  Cc: dbrownell@users.sourceforge.net, Rory Filer,
	netdev@vger.kernel.org
In-Reply-To: <20100430.190145.45137187.davem@davemloft.net>


> On 4/30/2010 7:05 PM David Miller wrote:

>>From: Elina Pasheva <epasheva@sierrawireless.com>
>>Date: Fri, 30 Apr 2010 17:53:14 -0700

>> If I submit a new driver to net-2.6 tree (e.g. sierra_net driver that
>> was applied to net-2.6 tree) where do I submit subsequent patches for
>> that driver - net-2.6 tree or net-next-2.6 tree?

>It depends upon the severity of the fix.

>At this stage in the game on the most serious fixes are going
>in, fixes for things that cause crashes and the like.  However
>since a new driver we might be a little bit more lenient since
>changes to a new driver can harm less people.

Thank you, David.

Would you please apply
[PATCH 1/1] net/usb: initiate sync sequence in sierra_net.c driver
 which is a very serious fix and affects the driver's functionality.
Sorry, it was my bad.

I tested this patch  with USB 306.

Thanks,
Elina


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox