* Re: [PATCH 1/3] ptp: Added a brand new class driver for ptp clocks.
From: Wolfgang Grandegger @ 2010-05-01 14:09 UTC (permalink / raw)
To: Richard Cochran; +Cc: netdev
In-Reply-To: <20100501140823.GA2779@riccoc20.at.omicron.at>
Richard Cochran wrote:
> On Fri, Apr 30, 2010 at 07:58:41PM +0200, Wolfgang Grandegger wrote:
>>> include/linux/Kbuild | 1 +
>>> include/linux/ptp_clock.h | 37 +++++
>> ptp_clock.h should probably be added to "include/linux/Kbuild".
>
> But it already is, see the two lines above. Or did you mean something
> else?
Oops, sorry for the noise.
Wolfgang.
^ permalink raw reply
* Re: [PATCH 1/3] ptp: Added a brand new class driver for ptp clocks.
From: Richard Cochran @ 2010-05-01 14:08 UTC (permalink / raw)
To: Wolfgang Grandegger; +Cc: netdev
In-Reply-To: <4BDB1A51.20505@grandegger.com>
On Fri, Apr 30, 2010 at 07:58:41PM +0200, Wolfgang Grandegger wrote:
> > include/linux/Kbuild | 1 +
> > include/linux/ptp_clock.h | 37 +++++
>
> ptp_clock.h should probably be added to "include/linux/Kbuild".
But it already is, see the two lines above. Or did you mean something
else?
Thanks,
Richard
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: jamal @ 2010-05-01 13:49 UTC (permalink / raw)
To: Eric Dumazet
Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272720125.2230.178.camel@edumazet-laptop>
On Sat, 2010-05-01 at 15:22 +0200, Eric Dumazet wrote:
> You must understand that the whole 'bench' is mostly governed by
> scheduler artifacts. The regression you mention is probably a side
> effect.
likely.
> By slowing down one part, its possible to zap all calls to scheduler and
> go maybe 300% faster (Because consumer threads can avoid 3/4 of the time
> to schedule)
>
> Reciprocally, optimizing one part of the network stack might make
> threads hitting an empty queue, and need to call more often the
> scheduler.
It is fair to say that what i am seeing is _not_ fatal because it is rps
that is regressing; non-rps is fine. I would consider non-rps to be the
common use scenario and if that was doing badly then it is a problem.
The good news is it is getting better - likely because of some changes
made on behalf of rps ;->
With rps, one could follow some instructions on how to make it better.
I am hoping that some of the system "magic" is documented as Tom
mentioned he will.
> This is why some higly specialized programs never block/schedule and
> perform busy loops instead.
Agreed. My brain cells should learn to accept this fact ;->
cheers,
jamal
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-05-01 13:22 UTC (permalink / raw)
To: hadi
Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272714966.14499.37.camel@bigi>
Le samedi 01 mai 2010 à 07:56 -0400, jamal a écrit :
>
> [1]i.e with this program rps was getting worse (it was much better
> before say net-next of apr14) and that non-rps has been getting better
> numbers since. The regression is real - but it is likely in another
> subsystem.
>
You must understand that the whole 'bench' is mostly governed by
scheduler artifacts. The regression you mention is probably a side
effect.
By slowing down one part, its possible to zap all calls to scheduler and
go maybe 300% faster (Because consumer threads can avoid 3/4 of the time
to schedule)
Reciprocally, optimizing one part of the network stack might make
threads hitting an empty queue, and need to call more often the
scheduler.
This is why some higly specialized programs never block/schedule and
perform busy loops instead.
^ permalink raw reply
* [PATCH linux-2.6.34-rc5] drivers/net/phy: micrel phy driver
From: Choi, David @ 2010-04-29 16:12 UTC (permalink / raw)
To: netdev
To whom it may have concerned:
From: David J. Choi <david.choi@micrel.com>
Body of the explanation: This is the first version of phy driver from Micrel Inc.
Signed-off-by: David J. Choi <david.choi@micrel.com>
---
--- linux-2.6.34-rc5/drivers/net/phy/micrel.c.orig 2010-04-29 08:20:51.000000000 -0700
+++ linux-2.6.34-rc5/drivers/net/phy/micrel.c 2010-04-29 08:52:37.000000000 -0700
@@ -0,0 +1,104 @@
+/*
+ * drivers/net/phy/micrel.c
+ *
+ * Driver for Micrel PHYs
+ *
+ * Author: David J. Choi
+ *
+ * Copyright (c) 2010 Micrel, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2 of the License, or (at your
+ * option) any later version.
+ *
+ * Support : ksz9021 , vsc8201, ks8001
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/phy.h>
+
+#define PHY_ID_KSZ9021 0x00221611
+#define PHY_ID_VSC8201 0x000FC413
+#define PHY_ID_KS8001 0x0022161A
+
+
+static int kszphy_config_init(struct phy_device *phydev)
+{
+ return 0;
+}
+
+
+static struct phy_driver ks8001_driver = {
+ .phy_id = PHY_ID_KS8001,
+ .phy_id_mask = 0x00fffff0,
+ .features = PHY_BASIC_FEATURES,
+ .flags = PHY_POLL,
+ .config_init = kszphy_config_init,
+ .config_aneg = genphy_config_aneg,
+ .read_status = genphy_read_status,
+ .driver = { .owner = THIS_MODULE,},
+};
+
+static struct phy_driver vsc8201_driver = {
+ .phy_id = PHY_ID_VSC8201,
+ .name = "Micrel VSC8201",
+ .phy_id_mask = 0x00fffff0,
+ .features = PHY_BASIC_FEATURES,
+ .flags = PHY_POLL,
+ .config_init = kszphy_config_init,
+ .config_aneg = genphy_config_aneg,
+ .read_status = genphy_read_status,
+ .driver = { .owner = THIS_MODULE,},
+};
+
+static struct phy_driver ksz9021_driver = {
+ .phy_id = PHY_ID_KSZ9021,
+ .phy_id_mask = 0x000fff10,
+ .name = "Micrel KSZ9021 Gigabit PHY",
+ .features = PHY_GBIT_FEATURES | SUPPORTED_Pause,
+ .flags = PHY_POLL,
+ .config_init = kszphy_config_init,
+ .config_aneg = genphy_config_aneg,
+ .read_status = genphy_read_status,
+ .driver = { .owner = THIS_MODULE, },
+};
+
+static int __init ksphy_init(void)
+{
+ int ret;
+
+ ret = phy_driver_register(&ks8001_driver);
+ if (ret)
+ goto err1;
+ ret = phy_driver_register(&vsc8201_driver);
+ if (ret)
+ goto err2;
+
+ ret = phy_driver_register(&ksz9021_driver);
+ if (ret)
+ goto err3;
+ return 0;
+
+err3:
+ phy_driver_unregister(&vsc8201_driver);
+err2:
+ phy_driver_unregister(&ks8001_driver);
+err1:
+ return ret;
+}
+
+static void __exit ksphy_exit(void)
+{
+ phy_driver_unregister(&ks8001_driver);
+ phy_driver_unregister(&vsc8201_driver);
+ phy_driver_unregister(&ksz9021_driver);
+}
+
+module_init(ksphy_init);
+module_exit(ksphy_exit);
+
+MODULE_DESCRIPTION("Micrel PHY driver");
+MODULE_AUTHOR("David J. Choi");
+MODULE_LICENSE("GPL");
--- linux-2.6.34-rc5/drivers/net/phy/Kconfig.orig 2010-04-29 08:21:12.000000000 -0700
+++ linux-2.6.34-rc5/drivers/net/phy/Kconfig 2010-04-29 08:25:18.000000000 -0700
@@ -88,6 +88,11 @@ config LSI_ET1011C_PHY
---help---
Supports the LSI ET1011C PHY.
+config MICREL_PHY
+ tristate "Driver for Micrel PHYs"
+ ---help---
+ Supports the KSZ9021, VSC8201, KS8001 PHYs.
+
config FIXED_PHY
bool "Driver for MDIO Bus/PHY emulation with fixed speed/link PHYs"
depends on PHYLIB=y
--- linux-2.6.34-rc5/drivers/net/phy/Makefile.orig 2010-04-29 08:20:25.000000000 -0700
+++ linux-2.6.34-rc5/drivers/net/phy/Makefile 2010-04-29 08:31:13.000000000 -0700
@@ -20,4 +20,5 @@ obj-$(CONFIG_MDIO_BITBANG) += mdio-bitba
obj-$(CONFIG_MDIO_GPIO) += mdio-gpio.o
obj-$(CONFIG_NATIONAL_PHY) += national.o
obj-$(CONFIG_STE10XP) += ste10Xp.o
+obj-$(CONFIG_MICREL_PHY) += micrel.o
obj-$(CONFIG_MDIO_OCTEON) += mdio-octeon.o
---
^ permalink raw reply
* Re: [net-next-2.6 PATCH 2/2] add ndo_set_port_profile op support for enic dynamic vnics
From: Arnd Bergmann @ 2010-05-01 12:36 UTC (permalink / raw)
To: Scott Feldman; +Cc: davem, netdev, chrisw, Jens Osterkamp
In-Reply-To: <C8008CCC.2D21E%scofeldm@cisco.com>
On Friday 30 April 2010, Scott Feldman wrote:
> > ip iov set port-profile DEVICE [ base BASE-DEVICE ] name PORT-PROFILE
> > [ host_uuid HOST_UUID ]
> > [ client_name CLIENT_NAME ]
> > [ client_uuid CLIENT_UUID ]
> > ip iov set vsi { associate | pre-associate | pre-associate-rr }
> > BASE-DEVICE
> > vsi MGR:VTID:VER
> > mac LLADDR [ vlan VID ]
> > client_uuid CLIENT_UUID
> >
> > ip iov del port_profile DEVICE [ base BASE-DEVICE ]
> > ip iov del vsi BASE-DEVICE [ mac LLADDR [ vlan VID ] ]
> > [ client_uuid CLIENT_UUID ]
> >
> > ip iov show port_profile DEVICE [ base BASE-DEVICE ]
> > ip iov show vsi BASE-DEVICE [ mac LLADDR [ vlan VID ] ]
> > [ client_uuid CLIENT_UUID ]
> >
> > You would obvioulsy only implement the kernel support for the port-profile
> > stuff as callbacks, because no driver yet does VDP in the kernel, but we
> > should
> > have a common netlink header that defines both variants.
> >
> > Chris, any opinion on this interface as opposed to the combined one?
> > Either one should work, but splitting it seems cleaner to me.
>
> I haven't seen Chris's response, but it seems vger was down for awhile, so
> maybe it's coming. Assuming we go for the split design, we're still talking
> about using RTM_SETLINK/RTM_GETLINK/RTM_DELLINK for these netlink msgs? Or
> are you suggesting by your cmd syntax that we return to
> RTM_SETIOV/RTM_GETIOV like in the first iovnl patch? RTM_SET/GET/DELLINK is
> probably simplier, cleaner patch.
In either case (split or combined), I would prefer the separate IOV
commands. The reason for this is that when support is not in the kernel,
it allows a cleaner separation between what's (always) handled in the
kernel and what's (potentially) done in user space.
Arnd
^ permalink raw reply
* Re: [PATCH] tcp: SO_TIMESTAMP implementation for TCP
From: Paul LeoNerd Evans @ 2010-05-01 12:06 UTC (permalink / raw)
To: David Miller, netdev; +Cc: therbert
In-Reply-To: <20100430.164115.257514715.davem@davemloft.net>
[-- Attachment #1: Type: text/plain, Size: 1455 bytes --]
On Fri, Apr 30, 2010 at 04:41:15PM -0700, David Miller wrote:
> If other people have an opinion about this, now would be the time
> to speak up. :-)
I have to say I agree with David.
The "receive timestamp" for a TCP recv() call is completely meaningless.
Each byte in the stream arguably could have a set of receive timestamps,
being the timestamp of the underlying IPv4 packet containing a fragment
of a TCP segment that covered that byte. One recv() call could cover
many packets, many recv() calls could be required to consume one packet.
We just don't know from userland.
The point about IPv4 fragments in UDP is a reasonable one; that because
of IPv4 fragmentation there are still potentially multiple timestamps
that could be relevant to a single UDP recv() call. But no two recv()
calls can possibly relate to the same IPv4 fragments, so I feel this is
more defined. Plus, of all the IPv4 fragments that go into a single UDP
packet, one of them is special - the first one, the one containing the
UDP header. We could easily say "the timestamp of a UDP recv() call
shall be the time at which its header was received, even if other
fragments arrived before or after it".
We cannot make any such distinction for some window in a TCP stream. All
TCP segments are indistinct in this manner.
--
Paul "LeoNerd" Evans
leonerd@leonerd.org.uk
ICQ# 4135350 | Registered Linux# 179460
http://www.leonerd.org.uk/
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: jamal @ 2010-05-01 11:56 UTC (permalink / raw)
To: Eric Dumazet
Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272714179.2230.151.camel@edumazet-laptop>
On Sat, 2010-05-01 at 13:42 +0200, Eric Dumazet wrote:
> But, whole point of epoll is to not change interest each time you get an
> event.
>
> Without EV_PERSIST, you need two more syscalls per recvfrom()
>
> epoll_wait()
> epoll_ctl(REMOVE)
> epoll_ctl(ADD)
> recvfrom()
>
> Even poll() would be faster in your case
>
> poll(one fd)
> recvfrom()
>
This is true - but my goal was/is to replicate the regression i was
seeing[1].
I will try with PERSIST next opportunity. If it gets better
then it is something that needs documentation in the doc Tom
promised ;->
> I always thought copybreak was borderline...
> It can help to reduce memory footprint (allocating 128 bytes instead of
> 2048/4096 bytes per frame), but with RPS, it would make sense to perform
> copybreak after RPS, not before.
>
> Reducing memory footprint also means less changes on
> udp_memory_allocated /tcp_memory_allocate (memory reclaim logic)
Indeed, something that didnt cross my mind in the rush to test - it is
one of those things that need to be mentioned in some doc somewhere.
Tom, are you listening? ;->
cheers,
jamal
[1]i.e with this program rps was getting worse (it was much better
before say net-next of apr14) and that non-rps has been getting better
numbers since. The regression is real - but it is likely in another
subsystem.
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-05-01 11:42 UTC (permalink / raw)
To: hadi
Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272713014.14499.21.camel@bigi>
Le samedi 01 mai 2010 à 07:23 -0400, jamal a écrit :
> On Sat, 2010-05-01 at 07:57 +0200, Eric Dumazet wrote:
>
> > I changed your program a bit to use EV_PERSIST, (to avoid epoll_ctl()
> > overhead for each packet...)
>
> Thats a different test case then ;-> You can also get rid of the timer
> (I doubt it will show much difference in results) - I have it in there
> because it i am trying to replicate what i saw causing the regression.
>
> > RPS off : 220.000 pps
> >
> > RPS on (ee mask) : 700.000 pps (with a slightly modified tg3 driver)
> > 96% of delivered packets
> >
>
> That's a very very huge gap. What were the numbers before you changed to
> EV_PERSIST?
But, whole point of epoll is to not change interest each time you get an
event.
Without EV_PERSIST, you need two more syscalls per recvfrom()
epoll_wait()
epoll_ctl(REMOVE)
epoll_ctl(ADD)
recvfrom()
Even poll() would be faster in your case
poll(one fd)
recvfrom()
> Note: i did not add any of your other patches for dst refcnt, sockets
> etc. Were you running with those patches in these tests? I will try the
> next opportunity i get to have latest kernel + those patches.
>
> > This is on tg3 adapter, and tg3 has copybreak feature : small packets
> > are copied into skb of the right size.
>
> Ok, so the driver tuning is also important then (and it shows in the
> profile).
I always thought copybreak was borderline...
It can help to reduce memory footprint (allocating 128 bytes instead of
2048/4096 bytes per frame), but with RPS, it would make sense to perform
copybreak after RPS, not before.
Reducing memory footprint also means less changes on
udp_memory_allocated /tcp_memory_allocate (memory reclaim logic)
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: jamal @ 2010-05-01 11:29 UTC (permalink / raw)
To: Eric Dumazet
Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272694442.2230.86.camel@edumazet-laptop>
On Sat, 2010-05-01 at 08:14 +0200, Eric Dumazet wrote:
> BTW, using ee mask, cpu4 is not used at _all_, even for the user
> threads. Scheduler does a bad job IMHO.
I have the opposite frustration ;->
I did notice it got used. My goal was to totally avoid using it, for
simple reason it is an SMT thread that shares same core as cpu0.
In retrospect i should probably set irq affinity then to cpu0 and 4.
> Using fe mask, I get all packets (sent at 733311pps by my pktgen
> machine), and my CPU0 even has idle time !!!
I will try this next time i get the chance.
cheers,
jamal
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: jamal @ 2010-05-01 11:23 UTC (permalink / raw)
To: Eric Dumazet
Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272693424.2230.75.camel@edumazet-laptop>
On Sat, 2010-05-01 at 07:57 +0200, Eric Dumazet wrote:
> I changed your program a bit to use EV_PERSIST, (to avoid epoll_ctl()
> overhead for each packet...)
Thats a different test case then ;-> You can also get rid of the timer
(I doubt it will show much difference in results) - I have it in there
because it i am trying to replicate what i saw causing the regression.
> RPS off : 220.000 pps
>
> RPS on (ee mask) : 700.000 pps (with a slightly modified tg3 driver)
> 96% of delivered packets
>
That's a very very huge gap. What were the numbers before you changed to
EV_PERSIST?
Note: i did not add any of your other patches for dst refcnt, sockets
etc. Were you running with those patches in these tests? I will try the
next opportunity i get to have latest kernel + those patches.
> This is on tg3 adapter, and tg3 has copybreak feature : small packets
> are copied into skb of the right size.
Ok, so the driver tuning is also important then (and it shows in the
profile).
cheers,
jamal
^ permalink raw reply
* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Andi Kleen @ 2010-05-01 11:00 UTC (permalink / raw)
To: David Miller
Cc: eric.dumazet, hadi, xiaosuo, therbert, shemminger, netdev, lenb,
arjan
In-Reply-To: <20100430.163857.180417789.davem@davemloft.net>
On Fri, Apr 30, 2010 at 04:38:57PM -0700, David Miller wrote:
> From: Andi Kleen <ak@gargoyle.fritz.box>
> Date: Thu, 29 Apr 2010 23:41:44 +0200
>
> > Use io_schedule() in network stack to tell cpuidle governour to guarantee lower latencies
> >
> > XXX: probably too aggressive, some of these sleeps are not under high load.
> >
> > Based on a bug report from Eric Dumazet.
> >
> > Signed-off-by: Andi Kleen <ak@linux.intel.com>
>
> I like this, except that we probably don't want the delayacct_blkio_*() calls
> these things do.
Yes.
It needs more work, please don't apply it yet, to handle the "long sleep" case.
Still curious if it fixes Eric's test case.
>
> Probably the rest of what these things do should remain in the io_schedule*()
> functions and the block layer can call it's own versions which add in the
> delayacct_blkio_*() bits.
Good point.
>
> Or, if the delacct stuff is useful for socket I/O too, then it's interfaces
> names should have the "blk" stripped from them :-)
Good question. I suspect it's actually useful for some cases, but just adding
sockets might confuse some users.
-Andi
^ permalink raw reply
* Re: OFT - reserving CPU's for networking
From: Andi Kleen @ 2010-05-01 10:53 UTC (permalink / raw)
To: David Miller; +Cc: tglx, shemminger, eric.dumazet, netdev, peterz
In-Reply-To: <20100430.153038.62351857.davem@davemloft.net>
> And we don't want it to, because the decision mechanisms for steering
> that we using now are starting to get into the stateful territory and
> that's verbotton for NIC offload as far as we're concerned.
Huh? I thought full TCP offload was forbidden?[1] Statefull as in NIC
(or someone else like netfilter) tracking flows is quite common and very far
from full offload. AFAIK it doesn't have near all the problems full
offload has.
-Andi
[1] although it seems to leak in more and more through the RDMA backdoor.
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-05-01 10:47 UTC (permalink / raw)
To: Changli Gao
Cc: hadi, David Miller, therbert, shemminger, netdev,
Eilon Greenstein, Brian Bloniarz
In-Reply-To: <o2u412e6f7f1005010324sfb63393fo86acdff4c97c5be3@mail.gmail.com>
Le samedi 01 mai 2010 à 18:24 +0800, Changli Gao a écrit :
> On Sat, May 1, 2010 at 2:14 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > BTW, using ee mask, cpu4 is not used at _all_, even for the user
> > threads. Scheduler does a bad job IMHO.
> >
> > Using fe mask, I get all packets (sent at 733311pps by my pktgen
> > machine), and my CPU0 even has idle time !!!
> >
> > Limit seems to be around 800.000 pps
> >
> > ------------------------------------------------------------------------------------------------------------------------
> > PerfTop: 5616 irqs/sec kernel:93.9% [1000Hz cycles], (all, 8 CPUs)
> > ------------------------------------------------------------------------------------------------------------------------
> >
>
> Oh, cpu0 usage is about 100-(100-93.9)*8 = 51.2%(Am I right?). If we
> can do weighted packet distributing: cpu0's weight is 1, and other
> cpus are 2. maybe we can utilize all the cpu power.
>
Nope, cpu0 was at 100% in this test, other cpus were about at 50% each.
weigthed would be ok if I wanted to use cpu0 in the 'slave' cpus (RPS
targets). But I know the workload I am interested to, and ability to
resist to DDOS, want to keep cpu0 outside of IP/TCP/UDP stack.
Later, skb_pull() inline in eth_type_trans() permitted to reach 840.000
pps.
top - 12:42:55 up 3:00, 2 users, load average: 0.44, 0.11, 0.03
Tasks: 126 total, 1 running, 125 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.2%us, 16.5%sy, 0.0%ni, 46.5%id, 11.4%wa, 0.9%hi, 22.5%si,
0.0%st
Mem: 4148112k total, 211152k used, 3936960k free, 15228k buffers
Swap: 4192928k total, 0k used, 4192928k free, 121804k cached
You can see average idle of 46%
So there is probably more optimizations to do to reach maybe 1.300.000
pps ;)
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Changli Gao @ 2010-05-01 10:24 UTC (permalink / raw)
To: Eric Dumazet
Cc: hadi, David Miller, therbert, shemminger, netdev,
Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272694442.2230.86.camel@edumazet-laptop>
On Sat, May 1, 2010 at 2:14 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> BTW, using ee mask, cpu4 is not used at _all_, even for the user
> threads. Scheduler does a bad job IMHO.
>
> Using fe mask, I get all packets (sent at 733311pps by my pktgen
> machine), and my CPU0 even has idle time !!!
>
> Limit seems to be around 800.000 pps
>
> ------------------------------------------------------------------------------------------------------------------------
> PerfTop: 5616 irqs/sec kernel:93.9% [1000Hz cycles], (all, 8 CPUs)
> ------------------------------------------------------------------------------------------------------------------------
>
Oh, cpu0 usage is about 100-(100-93.9)*8 = 51.2%(Am I right?). If we
can do weighted packet distributing: cpu0's weight is 1, and other
cpus are 2. maybe we can utilize all the cpu power.
--
Regards,
Changli Gao(xiaosuo@gmail.com)
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: sock_def_readable() and friends RCU conversion
From: Eric Dumazet @ 2010-05-01 8:03 UTC (permalink / raw)
To: David Miller; +Cc: hadi, xiaosuo, therbert, shemminger, netdev, eilong, bmb
In-Reply-To: <1272697367.2230.106.camel@edumazet-laptop>
Le samedi 01 mai 2010 à 09:02 +0200, Eric Dumazet a écrit :
> Le vendredi 30 avril 2010 à 16:35 -0700, David Miller a écrit :
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Thu, 29 Apr 2010 23:01:49 +0200
> >
> > > [PATCH net-next-2.6] net: sock_def_readable() and friends RCU conversion
> >
> > So what's the difference between call_rcu() freeing this little waitqueue
> > struct and doing it for the entire socket?
> >
> > We'll still be doing an RCU call every socket destroy, and now we also have
> > a new memory allocation/free per connection.
> >
> > This has to show up in things like 'lat_connect' and friends, does it not?
>
> Before patch :
>
> lat_connect -N 10 127.0.0.1
> TCP/IP connection cost to 127.0.0.1: 27.8872 microseconds
>
> After :
>
> lat_connect -N 10 127.0.0.1
> TCP/IP connection cost to 127.0.0.1: 20.7681 microseconds
>
> Strange isnt it ?
>
> (special care should be taken with this bench, as it leave many sockets
> in TIME_WAIT state, so to get consistent numbers we have to wait a while
> before restarting it)
Oops, this was with the other patch (about dst no_refcounting in input
path), sorry.
With the "sock_def_readable() and friends RCU conversion" patch I got :
lat_connect -N 10 127.0.0.1
TCP/IP connection cost to 127.0.0.1: 27.6244 microseconds
Anyway, this lat_connect seems very unreliable (lot of variance)
with linux-2.6.31, ~33 us
with linux-2.6.33, ~30 us
David, I also need this RCU thing in order to be able to group all
wakeups at the end of net_rx_action().
Plan was to use RCU, so that I dont need to increase sk_refcnt when
queueing a "wakeup" (and decrease sk_refcnt a long time after)
Previous attempt was a bit hacky,
http://patchwork.ozlabs.org/patch/24179/
I expect 2010 one will be cleaner :)
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: sock_def_readable() and friends RCU conversion
From: Eric Dumazet @ 2010-05-01 7:02 UTC (permalink / raw)
To: David Miller; +Cc: hadi, xiaosuo, therbert, shemminger, netdev, eilong, bmb
In-Reply-To: <20100430.163519.133415203.davem@davemloft.net>
Le vendredi 30 avril 2010 à 16:35 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 29 Apr 2010 23:01:49 +0200
>
> > [PATCH net-next-2.6] net: sock_def_readable() and friends RCU conversion
>
> So what's the difference between call_rcu() freeing this little waitqueue
> struct and doing it for the entire socket?
>
> We'll still be doing an RCU call every socket destroy, and now we also have
> a new memory allocation/free per connection.
>
> This has to show up in things like 'lat_connect' and friends, does it not?
Before patch :
lat_connect -N 10 127.0.0.1
TCP/IP connection cost to 127.0.0.1: 27.8872 microseconds
After :
lat_connect -N 10 127.0.0.1
TCP/IP connection cost to 127.0.0.1: 20.7681 microseconds
Strange isnt it ?
(special care should be taken with this bench, as it leave many sockets
in TIME_WAIT state, so to get consistent numbers we have to wait a while
before restarting it)
^ permalink raw reply
* [PATCH net-next-2.6] net: eth_type_trans() should inline skb_pull()
From: Eric Dumazet @ 2010-05-01 6:42 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Tom Herbert, jamal
840.000 pps instead of 800.000 pps on my 'old' machine, using RPS
Before patch, profile of CPU 0 (handling tg3 interrupts)
2167.00 13.9% __alloc_skb vmlinux
1908.00 12.3% eth_type_trans vmlinux
1125.00 7.2% __kmalloc_track_caller vmlinux
981.00 6.3% __netdev_alloc_skb vmlinux
925.00 5.9% _raw_spin_lock vmlinux
786.00 5.1% kmem_cache_alloc vmlinux
757.00 4.9% skb_pull vmlinux
698.00 4.5% tg3_read32 vmlinux
637.00 4.1% __slab_alloc vmlinux
620.00 4.0% tg3_poll_work vmlinux
576.00 3.7% get_rps_cpu vmlinux
448.00 2.9% bnx2_interrupt vmlinux
After (no more skb_pull, and eth_type_trans() not more expensive)
Predominant cost is memory allocator...
1625.00 12.4% eth_type_trans vmlinux
1468.00 11.2% __alloc_skb vmlinux
1004.00 7.6% __kmalloc_track_caller vmlinux
893.00 6.8% _raw_spin_lock vmlinux
738.00 5.6% __netdev_alloc_skb vmlinux
665.00 5.1% tg3_read32 vmlinux
656.00 5.0% kmem_cache_alloc vmlinux
655.00 5.0% __slab_alloc vmlinux
509.00 3.9% bnx2_interrupt vmlinux
483.00 3.7% tg3_poll_work vmlinux
455.00 3.5% _raw_spin_lock_irqsave vmlinux
330.00 2.5% get_rps_cpu vmlinux
286.00 2.2% nommu_map_page vmlinux
277.00 2.1% enqueue_to_backlog vmlinux
235.00 1.8% inet_gro_receive vmlinux
232.00 1.8% __copy_to_user_ll vmlinux
181.00 1.4% dev_gro_receive vmlinux
165.00 1.3% skb_gro_reset_offset vmlinux
(bnx2_interrupt is called, because irq 16 is shared on this machine on two nics...)
Thanks !
[PATCH net-next-2.6] net: eth_type_trans() should inline skb_pull()
With RPS, this patch can give a 5 % boost in performance.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 0c0d272..763524b 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -162,7 +162,8 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
skb->dev = dev;
skb_reset_mac_header(skb);
- skb_pull(skb, ETH_HLEN);
+ if (likely(skb->len >= ETH_HLEN))
+ __skb_pull(skb, ETH_HLEN);
eth = eth_hdr(skb);
if (unlikely(is_multicast_ether_addr(eth->h_dest))) {
^ permalink raw reply related
* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-05-01 6:14 UTC (permalink / raw)
To: hadi
Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272693424.2230.75.camel@edumazet-laptop>
Le samedi 01 mai 2010 à 07:57 +0200, Eric Dumazet a écrit :
> Le vendredi 30 avril 2010 à 20:06 -0400, jamal a écrit :
>
> > Yes, Nehalem.
> > RPS off is better (~700Kpp) than RPS on(~650kpps). Are you seeing the
> > same trend on the old hardware?
> >
>
> Of course not ! Or else RPS would be useless :(
>
> I changed your program a bit to use EV_PERSIST, (to avoid epoll_ctl()
> overhead for each packet...)
>
> RPS off : 220.000 pps
>
> RPS on (ee mask) : 700.000 pps (with a slightly modified tg3 driver)
> 96% of delivered packets
BTW, using ee mask, cpu4 is not used at _all_, even for the user
threads. Scheduler does a bad job IMHO.
Using fe mask, I get all packets (sent at 733311pps by my pktgen
machine), and my CPU0 even has idle time !!!
Limit seems to be around 800.000 pps
------------------------------------------------------------------------------------------------------------------------
PerfTop: 5616 irqs/sec kernel:93.9% [1000Hz cycles], (all, 8 CPUs)
------------------------------------------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ___________________________ _______
3492.00 6.2% __slab_free vmlinux
2334.00 4.2% _raw_spin_lock vmlinux
2314.00 4.1% _raw_spin_lock_irqsave vmlinux
1807.00 3.2% ip_rcv vmlinux
1605.00 2.9% schedule vmlinux
1474.00 2.6% __netif_receive_skb vmlinux
1464.00 2.6% kfree vmlinux
1405.00 2.5% ip_route_input vmlinux
1318.00 2.4% __copy_to_user_ll vmlinux
1214.00 2.2% __alloc_skb vmlinux
1160.00 2.1% nf_hook_slow vmlinux
1020.00 1.8% eth_type_trans vmlinux
860.00 1.5% sched_clock_local vmlinux
775.00 1.4% read_tsc vmlinux
773.00 1.4% ipt_do_table vmlinux
766.00 1.4% _raw_spin_unlock_irqrestore vmlinux
748.00 1.3% sock_recv_ts_and_drops vmlinux
747.00 1.3% ia32_sysenter_target vmlinux
740.00 1.3% select_nohz_load_balancer vmlinux
644.00 1.2% __kmalloc_track_caller vmlinux
596.00 1.1% tg3_read32 vmlinux
566.00 1.0% __udp4_lib_lookup vmlinux
^ permalink raw reply
* Re: [PATCH] tcp: SO_TIMESTAMP implementation for TCP
From: Eric Dumazet @ 2010-05-01 6:00 UTC (permalink / raw)
To: Tom Herbert; +Cc: Bill Fink, David Miller, netdev
In-Reply-To: <AANLkTimFjDNgXdMGpVpO7Gi38ROlww1Fa7IKA1ASBKOV@mail.gmail.com>
Le vendredi 30 avril 2010 à 22:40 -0700, Tom Herbert a écrit :
> > Not being a kernel hacker, I will naively ask if the kernel tracing
> > facility could somehow be used to provide the desired info (or could
> > be modified to provide it).
> >
>
> We did consider kernel tracing (more in the context of implementing
> RFC 4898). In the case of trying get per packet timestamps,
> correlating a ktrace event with an application message is probably too
> high to make it practical. If it weren't for the cost of
> timestamp'ing every single skb being received, we'd probably have
> SO_TIMESTAMP turned on permanently for many connections. For now
> we're settling for a percentage of messages for sampling.
Tom, did you tried to reuse existing skb or sk tstamps ?
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-05-01 5:57 UTC (permalink / raw)
To: hadi
Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272672394.14499.1.camel@bigi>
Le vendredi 30 avril 2010 à 20:06 -0400, jamal a écrit :
> Yes, Nehalem.
> RPS off is better (~700Kpp) than RPS on(~650kpps). Are you seeing the
> same trend on the old hardware?
>
Of course not ! Or else RPS would be useless :(
I changed your program a bit to use EV_PERSIST, (to avoid epoll_ctl()
overhead for each packet...)
RPS off : 220.000 pps
RPS on (ee mask) : 700.000 pps (with a slightly modified tg3 driver)
96% of delivered packets
This is on tg3 adapter, and tg3 has copybreak feature : small packets
are copied into skb of the right size.
define TG3_RX_COPY_THRESHOLD 256 -> 40 ...
We really should disable this feature for RPS workload,
unfortunatly ethtool cannot tweak this.
So profile of cpu 0 (RPS ON) looks like :
------------------------------------------------------------------------------------------------------------------------
PerfTop: 1001 irqs/sec kernel:99.7% [1000Hz cycles], (all, cpu: 0)
------------------------------------------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ______________________ _______
819.00 12.6% __alloc_skb vmlinux
592.00 9.1% eth_type_trans vmlinux
509.00 7.8% _raw_spin_lock vmlinux
475.00 7.3% __kmalloc_track_caller vmlinux
358.00 5.5% tg3_read32 vmlinux
345.00 5.3% __netdev_alloc_skb vmlinux
329.00 5.0% kmem_cache_alloc vmlinux
307.00 4.7% _raw_spin_lock_irqsave vmlinux
284.00 4.4% bnx2_interrupt vmlinux
277.00 4.2% skb_pull vmlinux
248.00 3.8% tg3_poll_work vmlinux
202.00 3.1% __slab_alloc vmlinux
197.00 3.0% get_rps_cpu vmlinux
106.00 1.6% enqueue_to_backlog vmlinux
87.00 1.3% _raw_spin_lock_bh vmlinux
80.00 1.2% __copy_to_user_ll vmlinux
77.00 1.2% nommu_map_page vmlinux
77.00 1.2% __napi_gro_receive vmlinux
65.00 1.0% tg3_alloc_rx_skb vmlinux
60.00 0.9% skb_gro_reset_offset vmlinux
57.00 0.9% skb_put vmlinux
57.00 0.9% __slab_free vmlinux
/*
* Usage: udpsnkfrk [ -p baseport] nbports
*/
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <event.h>
struct worker_data {
struct event *snk_ev;
struct event_base *base;
struct timeval t;
unsigned long pack_count;
unsigned long bytes_count;
unsigned long tout;
int fd; /* move to avoid hole on 64-bit */
int pad1;
unsigned long _padd[99]; /* avoid false sharing */
};
void usage(int code)
{
fprintf(stderr, "Usage: udpsink [-p baseport] nbports\n");
exit(code);
}
void process_recv(int fd, short ev, void *arg)
{
char buffer[4096];
struct sockaddr_in addr;
socklen_t len = sizeof(addr);
struct worker_data *wdata = (struct worker_data *)arg;
int lu = 0;
if (ev == EV_TIMEOUT) {
wdata->tout++;
if ((event_add(wdata->snk_ev, &wdata->t)) < 0) {
perror("cb event_add");
return;
}
} else {
do {
lu = recvfrom(wdata->fd, buffer, sizeof(buffer), 0,
(struct sockaddr *)&addr, &len);
if (lu > 0) {
wdata->pack_count++;
wdata->bytes_count += lu;
}
} while (lu > 0);
}
}
int prep_thread(struct worker_data *wdata)
{
wdata->t.tv_sec = 1;
wdata->t.tv_usec = random() % 50000L;
wdata->base = event_init();
event_set(wdata->snk_ev, wdata->fd, EV_READ|EV_PERSIST, process_recv, wdata);
event_base_set(wdata->base, wdata->snk_ev);
if ((event_add(wdata->snk_ev, &wdata->t)) < 0) {
perror("event_add");
return -1;
}
return 0;
}
void *worker_func(void *arg)
{
struct worker_data *wdata = (struct worker_data *)arg;
return (void *)event_base_loop(wdata->base, 0);
}
int main(int argc, char *argv[])
{
int c;
int baseport = 4000;
int nbthreads;
struct worker_data *wdata;
unsigned long ototal = 0;
int concurrent = 0;
int verbose = 0;
int i;
while ((c = getopt(argc, argv, "cvp:")) != -1) {
if (c == 'p')
baseport = atoi(optarg);
else if (c == 'c')
concurrent = 1;
else if (c == 'v')
verbose++;
else
usage(1);
}
if (optind == argc)
usage(1);
nbthreads = atoi(argv[optind]);
wdata = calloc(sizeof(struct worker_data), nbthreads);
if (!wdata) {
perror("calloc");
return 1;
}
for (i = 0; i < nbthreads; i++) {
struct sockaddr_in addr;
pthread_t tid;
if (i && concurrent) {
wdata[i].fd = wdata[0].fd;
} else {
wdata[i].snk_ev = malloc(sizeof(struct event));
if (!wdata[i].snk_ev)
return 1;
memset(wdata[i].snk_ev, 0, sizeof(struct event));
wdata[i].fd = socket(PF_INET, SOCK_DGRAM, 0);
if (wdata[i].fd == -1) {
free(wdata[i].snk_ev);
perror("socket");
return 1;
}
memset(&addr, 0, sizeof(addr));
addr.sin_family = AF_INET;
// addr.sin_addr.s_addr = inet_addr(argv[optind]);
addr.sin_port = htons(baseport + i);
if (bind
(wdata[i].fd, (struct sockaddr *)&addr,
sizeof(addr)) < 0) {
free(wdata[i].snk_ev);
perror("bind");
return 1;
}
fcntl(wdata[i].fd, F_SETFL, O_NDELAY);
}
if (prep_thread(wdata + i)) {
printf("failed to allocate thread %d, exit\n", i);
exit(0);
}
pthread_create(&tid, NULL, worker_func, wdata + i);
}
for (;;) {
unsigned long total;
long delta;
sleep(1);
total = 0;
for (i = 0; i < nbthreads; i++) {
total += wdata[i].pack_count;
}
delta = total - ototal;
if (delta) {
printf("%lu pps (%lu", delta, total);
if (verbose) {
for (i = 0; i < nbthreads; i++) {
if (wdata[i].pack_count)
printf(" %d:%lu", i,
wdata[i].pack_count);
}
}
printf(")\n");
}
ototal = total;
}
}
^ permalink raw reply
* Re: [PATCH] tcp: SO_TIMESTAMP implementation for TCP
From: Tom Herbert @ 2010-05-01 5:40 UTC (permalink / raw)
To: Bill Fink; +Cc: David Miller, netdev
In-Reply-To: <20100501010735.dfe097bc.billfink@mindspring.com>
> Not being a kernel hacker, I will naively ask if the kernel tracing
> facility could somehow be used to provide the desired info (or could
> be modified to provide it).
>
We did consider kernel tracing (more in the context of implementing
RFC 4898). In the case of trying get per packet timestamps,
correlating a ktrace event with an application message is probably too
high to make it practical. If it weren't for the cost of
timestamp'ing every single skb being received, we'd probably have
SO_TIMESTAMP turned on permanently for many connections. For now
we're settling for a percentage of messages for sampling.
Tom
^ permalink raw reply
* Re: [PATCH] tcp: SO_TIMESTAMP implementation for TCP
From: Tom Herbert @ 2010-05-01 5:31 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20100430.164115.257514715.davem@davemloft.net>
>> I don't see an nice way to do that, we're profiling a significant
>> percentage of millions of connections over thousands of paths as part
>> of standard operations while incurring negligible overhead. The app
>> can can easily timestamp its operations, but without some mechanism
>> for getting timestamps out of a TCP connection, the networking portion
>> of servicing requests is pretty much a black box in that.
>
> If other people have an opinion about this, now would be the time
> to speak up. :-)
>
The use case that motivated this patch is really the same as that of
UDP in that application is receiving messages that it wants to to time
stamp; in the case of TCP the application extracts the frames out of
the stream. The lack of a timestamp to discern when a message was
received over TCP is readily apparent when designing a message based
ULP that can dynamically select which protocol to run over.
^ permalink raw reply
* Re: [PATCH] tcp: SO_TIMESTAMP implementation for TCP
From: Bill Fink @ 2010-05-01 5:07 UTC (permalink / raw)
To: David Miller; +Cc: therbert, netdev
In-Reply-To: <20100430.164115.257514715.davem@davemloft.net>
On Fri, 30 Apr 2010, David Miller wrote:
> From: Tom Herbert <therbert@google.com>
> Date: Fri, 30 Apr 2010 00:58:32 -0700
>
> >> All these new checks and branches for a feature of questionable value.
> >
> >> If you can modify you apps to grab this information you can also probe
> >> for the information using external probing tools.
> >>
> > I don't see an nice way to do that, we're profiling a significant
> > percentage of millions of connections over thousands of paths as part
> > of standard operations while incurring negligible overhead. The app
> > can can easily timestamp its operations, but without some mechanism
> > for getting timestamps out of a TCP connection, the networking portion
> > of servicing requests is pretty much a black box in that.
>
> If other people have an opinion about this, now would be the time
> to speak up. :-)
Not being a kernel hacker, I will naively ask if the kernel tracing
facility could somehow be used to provide the desired info (or could
be modified to provide it).
-Bill
^ permalink raw reply
* RE: question re: net-2.6 and net-next-2.6 trees re: patch submission
From: Elina Pasheva @ 2010-05-01 5:01 UTC (permalink / raw)
To: David Miller
Cc: dbrownell@users.sourceforge.net, Rory Filer,
netdev@vger.kernel.org
In-Reply-To: <20100430.190145.45137187.davem@davemloft.net>
> On 4/30/2010 7:05 PM David Miller wrote:
>>From: Elina Pasheva <epasheva@sierrawireless.com>
>>Date: Fri, 30 Apr 2010 17:53:14 -0700
>> If I submit a new driver to net-2.6 tree (e.g. sierra_net driver that
>> was applied to net-2.6 tree) where do I submit subsequent patches for
>> that driver - net-2.6 tree or net-next-2.6 tree?
>It depends upon the severity of the fix.
>At this stage in the game on the most serious fixes are going
>in, fixes for things that cause crashes and the like. However
>since a new driver we might be a little bit more lenient since
>changes to a new driver can harm less people.
Thank you, David.
Would you please apply
[PATCH 1/1] net/usb: initiate sync sequence in sierra_net.c driver
which is a very serious fix and affects the driver's functionality.
Sorry, it was my bad.
I tested this patch with USB 306.
Thanks,
Elina
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox