* DMFE Driver
From: Heiko Gerstung @ 2010-06-16 15:01 UTC (permalink / raw)
To: netdev
Hi Everybody,
just a short heads up that I already found someone who is willing to
work on this and will receive hardware from me for testing it. And, I
already received a first patch from someone else, this is magnificent!
You guys really know what "customer service" is :-) !
If anybody has an idea regarding the ASIX driver and why it only seems
to work as a bonding group member when it is put into promisc, I would
really appreciate that!
Thanks again for all your time and assistance so far!
Regards,
Heiko
--
Heiko Gerstung
*MEINBERG Funkuhren* GmbH & Co. KG
Lange Wand 9
D-31812 Bad Pyrmont, Germany
Phone: +49 (0)5281 9309-25
Fax: +49 (0)5281 9309-30
Amtsgericht Hannover 17HRA 100322
Geschäftsführer/Managing Directors: Günter Meinberg, Werner Meinberg,
Andre Hartmann, Heiko Gerstung
Email: heiko.gerstung@meinberg.de <mailto:heiko.gerstung@meinberg.de>
Web: www.meinberg.de <http://www.meinberg.de>
------------------------------------------------------------------------
*MEINBERG - Accurate Time Worldwide*
^ permalink raw reply
* [PATCH net-next-2.6] inetpeer: restore small inet_peer structures
From: Eric Dumazet @ 2010-06-16 14:52 UTC (permalink / raw)
To: David Miller; +Cc: netdev
Addition of rcu_head to struct inet_peer added 16bytes on 64bit arches.
Thats a bit unfortunate, since old size was exactly 64 bytes.
This can be solved, using an union between this rcu_head an four fields,
that are normally used only when a refcount is taken on inet_peer.
rcu_head is used only when refcnt=-1, right before structure freeing.
Add a inet_peer_refcheck() function to check this assertion for a while.
We can bring back SLAB_HWCACHE_ALIGN qualifier in kmem cache creation.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
include/net/inetpeer.h | 31 ++++++++++++++++++++++++++-----
net/ipv4/inetpeer.c | 4 ++--
net/ipv4/route.c | 1 +
net/ipv4/tcp_ipv4.c | 11 +++++++----
4 files changed, 36 insertions(+), 11 deletions(-)
diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index 6174047..51c06af 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -22,11 +22,21 @@ struct inet_peer {
__u32 dtime; /* the time of last use of not
* referenced entries */
atomic_t refcnt;
- atomic_t rid; /* Frag reception counter */
- atomic_t ip_id_count; /* IP ID for the next packet */
- __u32 tcp_ts;
- __u32 tcp_ts_stamp;
- struct rcu_head rcu;
+ /*
+ * Once inet_peer is queued for deletion (refcnt == -1), following fields
+ * are not available: rid, ip_id_count, tcp_ts, tcp_ts_stamp
+ * We can share memory with rcu_head to keep inet_peer small
+ * (less then 64 bytes)
+ */
+ union {
+ struct {
+ atomic_t rid; /* Frag reception counter */
+ atomic_t ip_id_count; /* IP ID for the next packet */
+ __u32 tcp_ts;
+ __u32 tcp_ts_stamp;
+ };
+ struct rcu_head rcu;
+ };
};
void inet_initpeers(void) __init;
@@ -37,10 +47,21 @@ struct inet_peer *inet_getpeer(__be32 daddr, int create);
/* can be called from BH context or outside */
extern void inet_putpeer(struct inet_peer *p);
+/*
+ * temporary check to make sure we dont access rid, ip_id_count, tcp_ts,
+ * tcp_ts_stamp if no refcount is taken on inet_peer
+ */
+static inline void inet_peer_refcheck(const struct inet_peer *p)
+{
+ WARN_ON_ONCE(atomic_read(&p->refcnt) <= 0);
+}
+
+
/* can be called with or without local BH being disabled */
static inline __u16 inet_getid(struct inet_peer *p, int more)
{
more++;
+ inet_peer_refcheck(p);
return atomic_add_return(more, &p->ip_id_count) - more;
}
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index 349249f..bb58aed 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -64,7 +64,7 @@
* usually under some other lock to prevent node disappearing
* dtime: unused node list lock
* v4daddr: unchangeable
- * ip_id_count: idlock
+ * ip_id_count: atomic value (no lock needed)
*/
static struct kmem_cache *peer_cachep __read_mostly;
@@ -129,7 +129,7 @@ void __init inet_initpeers(void)
peer_cachep = kmem_cache_create("inet_peer_cache",
sizeof(struct inet_peer),
- 0, SLAB_PANIC,
+ 0, SLAB_HWCACHE_ALIGN | SLAB_PANIC,
NULL);
/* All the timers, started at system startup tend
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a291edb..03430de 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2881,6 +2881,7 @@ static int rt_fill_info(struct net *net,
error = rt->dst.error;
expires = rt->dst.expires ? rt->dst.expires - jiffies : 0;
if (rt->peer) {
+ inet_peer_refcheck(rt->peer);
id = atomic_read(&rt->peer->ip_id_count) & 0xffff;
if (rt->peer->tcp_ts_stamp) {
ts = rt->peer->tcp_ts;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7f9515c..2e41e6f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -204,10 +204,12 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
* TIME-WAIT * and initialize rx_opt.ts_recent from it,
* when trying new connection.
*/
- if (peer != NULL &&
- (u32)get_seconds() - peer->tcp_ts_stamp <= TCP_PAWS_MSL) {
- tp->rx_opt.ts_recent_stamp = peer->tcp_ts_stamp;
- tp->rx_opt.ts_recent = peer->tcp_ts;
+ if (peer) {
+ inet_peer_refcheck(peer);
+ if ((u32)get_seconds() - peer->tcp_ts_stamp <= TCP_PAWS_MSL) {
+ tp->rx_opt.ts_recent_stamp = peer->tcp_ts_stamp;
+ tp->rx_opt.ts_recent = peer->tcp_ts;
+ }
}
}
@@ -1351,6 +1353,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
(dst = inet_csk_route_req(sk, req)) != NULL &&
(peer = rt_get_peer((struct rtable *)dst)) != NULL &&
peer->v4daddr == saddr) {
+ inet_peer_refcheck(peer);
if ((u32)get_seconds() - peer->tcp_ts_stamp < TCP_PAWS_MSL &&
(s32)(peer->tcp_ts - req->ts_recent) >
TCP_PAWS_WINDOW) {
^ permalink raw reply related
* [PATCH] broadcom: Add 5241 support
From: Dmitry Eremin-Solenikov @ 2010-06-16 14:38 UTC (permalink / raw)
To: netdev; +Cc: David S. Miller, Matt Carlson, Michael Chan
This patch adds the 5241 PHY ID to the broadcom module.
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
---
drivers/net/phy/broadcom.c | 21 +++++++++++++++++++++
1 files changed, 21 insertions(+), 0 deletions(-)
diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
index f482fc4..1c12a57 100644
--- a/drivers/net/phy/broadcom.c
+++ b/drivers/net/phy/broadcom.c
@@ -834,6 +834,21 @@ static struct phy_driver bcmac131_driver = {
.driver = { .owner = THIS_MODULE },
};
+static struct phy_driver bcm5241_driver = {
+ .phy_id = 0x0143bc30,
+ .phy_id_mask = 0xfffffff0,
+ .name = "Broadcom BCM5241",
+ .features = PHY_BASIC_FEATURES |
+ SUPPORTED_Pause | SUPPORTED_Asym_Pause,
+ .flags = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT,
+ .config_init = brcm_fet_config_init,
+ .config_aneg = genphy_config_aneg,
+ .read_status = genphy_read_status,
+ .ack_interrupt = brcm_fet_ack_interrupt,
+ .config_intr = brcm_fet_config_intr,
+ .driver = { .owner = THIS_MODULE },
+};
+
static int __init broadcom_init(void)
{
int ret;
@@ -868,8 +883,13 @@ static int __init broadcom_init(void)
ret = phy_driver_register(&bcmac131_driver);
if (ret)
goto out_ac131;
+ ret = phy_driver_register(&bcm5241_driver);
+ if (ret)
+ goto out_5241;
return ret;
+out_5241:
+ phy_driver_unregister(&bcmac131_driver);
out_ac131:
phy_driver_unregister(&bcm57780_driver);
out_57780:
@@ -894,6 +914,7 @@ out_5411:
static void __exit broadcom_exit(void)
{
+ phy_driver_unregister(&bcm5241_driver);
phy_driver_unregister(&bcmac131_driver);
phy_driver_unregister(&bcm57780_driver);
phy_driver_unregister(&bcm50610m_driver);
--
1.7.1
^ permalink raw reply related
* Re: 2.6.35-rc2 kernel crashes under heavy network load
From: Eric Dumazet @ 2010-06-16 14:34 UTC (permalink / raw)
To: Lazy; +Cc: linux-kernel, netdev
In-Reply-To: <AANLkTinal4PZ3ESQMIdqd8_zmnSDTNgweczNsYdGobTS@mail.gmail.com>
Le mercredi 16 juin 2010 à 15:39 +0200, Lazy a écrit :
> Our linux router crashes while rebooting under heavy network load
> (800kpps generated by pktgen on other machine).
> While running everything system seams stable.
>
> Any pointers what can I do to help resolve this issue ?
>
> the system is Dell R210 X3420, 64bit kernel, debian 5.0 Broadcom BCM5716
> same thing happens on a Dell R410 running same software Broadcom BCM5716
>
> kernel version is Linux version 2.6.35-rc2 (root@cisco3-2) (gcc
> version 4.3.2 (Debian 4.3.2-1.1) ) #2 SMP Fri Jun 11 10:22:51 CEST
> 2010
>
> general protection fault: 0000 [#1] SMP
> last sysfs file: /sys/devices/platform/dcdbas/smi_data_buf_phys_addr
> CPU 1
> Modules linked in: iTCO_wdt 8021q ipmi_poweroff ipmi_devintf ipmi_si
> ipmi_msghandler mptctl loop ioatdma dca button evdev dcdbas raid10
> raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq
> async_tx raid1 raid0 linear md_mod sd_mod sg sr_mod cdrom
> ide_pci_generic ide_core usbhid mptsas ata_piix mptscsih libata
> mptbase scsi_transport_sas scsi_mod ehci_hcd uhci_hcd bnx2 fan [last
> unloaded: scsi_wait_scan]
>
> Pid: 20, comm: events/1 Not tainted 2.6.35-rc2 #2 0N051F/PowerEdge R410
> RIP: 0010:[<ffffffff81087ae9>] [<ffffffff81087ae9>] drain_array+0x29/0xc7
> RSP: 0000:ffff88012fb6ddc0 EFLAGS: 00010202
> RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: 0720072007200720 RSI: ffff88012fc04ec0 RDI: ffff88012fc1be00
> RBP: ffff88012fb6de00 R08: 0000000000000000 R09: ffff88012fa91618
> R10: 000000102cdbdd88 R11: 0000000000000000 R12: ffff88012fc04ec0
> R13: 0720072007200720 R14: ffff88012fc04ec0 R15: 0000000000000000
072007200720 pattern is the signature of a known bug.
commit 386f40c86d6c8d5b717 (Revert "tty: fix a little bug in scrup,
vt.c") will help you.
This is probably solved in current git tree...
^ permalink raw reply
* Re: [PATCH] vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)
From: Arnd Bergmann @ 2010-06-16 14:24 UTC (permalink / raw)
To: Pedro Garcia; +Cc: netdev, Eric Dumazet, Ben Hutchings, Patrick McHardy
In-Reply-To: <6b5ed8108cebb1865c85e03d3244b6ed@dondevamos.com>
On Wednesday 16 June 2010, Pedro Garcia wrote:
> Probably a definitive fix would be not to allow the definition of VLAN 0
> in 802.1q module and provide some other way to tag priority packets without
> using a subinterface (maybe in the same module or a new 8021p one). I am
> having a look at the kernel to see what happens if we load two modules for
> the same protocol.
On a related note, we will also need to support 802.1Qad provider bridges
at some point, which use yet another variation of the VLAN header (actually
two nested VLAN tags) with a different ethertype.
I need this for 802.1Qbg multi-channel VEPA (possibly also 802.1Qbh
port extenders), but I have not yet investigated how to implement this
in the VLAN module.
> By the way, the changelog I have to write is just the text before the
> patch?
Yes.
Arnd
^ permalink raw reply
* Re: [PATCH] vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)
From: Eric Dumazet @ 2010-06-16 14:24 UTC (permalink / raw)
To: Pedro Garcia; +Cc: netdev, Ben Hutchings, Patrick McHardy
In-Reply-To: <6b5ed8108cebb1865c85e03d3244b6ed@dondevamos.com>
Le mercredi 16 juin 2010 à 15:28 +0200, Pedro Garcia a écrit :
>
> In my understanding, 802.1p is a "subset" of 802.1q, and they share the
> protocol number. We can do a 802.1p module, but in the end it will end
> up reusing most of the code in 802.1q.
>
I was more thinking of a default ETH_P_8021Q rx handler (aka
vlan_skb_recv_minimal) with minimum handling (only accept vid=0 frames),
being overridden by real 8021q handler if module loaded/present.
> In any case defining a VLAN 0 ends up usually in problems with which table
> the ARP entries get stored in. This patch solves the problem to whom
> is not using VLAN 0 explicitly, but if somebody is using VLAN 0 tagging
> it will work (whatever "work" means) as before.
>
> Probably a definitive fix would be not to allow the definition of VLAN 0
> in 802.1q module and provide some other way to tag priority packets without
> using a subinterface (maybe in the same module or a new 8021p one). I am
> having a look at the kernel to see what happens if we load two modules for
> the same protocol.
>
> By the way, the changelog I have to write is just the text before the
> patch?
Yes, you can take a look on any patch around for examples, like...
git show 6e327c11a91d190650df9aabe7d3694d4838bfa1
Check Documentation/SubmittingPatches section 2)
^ permalink raw reply
* Re: [PATCH 08/12] ptp: Added a brand new class driver for ptp clocks.
From: Richard Cochran @ 2010-06-16 14:22 UTC (permalink / raw)
To: Grant Likely
Cc: netdev, devicetree-discuss, linuxppc-dev, linux-arm-kernel,
Krzysztof Halasa, Thomas Gleixner
In-Reply-To: <AANLkTin4lYghEWhjzEyARvYDgHaXdniwLbfyZ4jY0rwm@mail.gmail.com>
On Tue, Jun 15, 2010 at 11:00:10AM -0600, Grant Likely wrote:
>
> Question from an ignorant reviewer: Why a new interface instead of
> working with the existing high resolution timers infrastructure?
Short answer: Timers are only one part of the PTP API. If you offer
the PTP clock as a Linux clock source, then you could just use the
existing POSIX timers. However, we decided not to offer the PTP clock
in that way. The following excerpts from an upcoming paper explain
why:
\subsection{Basic Clock Operations}
Based on our experience with a number of commercially available
hardware clocks, we identified a set of four basic clock
operations. Besides simply setting or getting the clock's time
value, two additional operations are needed for clock control.
Once a PTP slave has an initial estimate of the offset to its
master, it typically would need to shift the clock by the offset
atomically. Also, the servo loop of a slave periodically needs to
adjust the clock frequency.
\subsection{Ancillary Clock Features}
Perhaps the most challenging design issue was deciding how to offer
a PTP clock's capabilities to the GNU/Linux operating system. As
John Eidson pointed out~\cite{eidson2006measurement}, modern
operating systems provide surprisingly little support for
programming based on absolute time. As the IEEE 1588 standard is
being applied to a wide variety of test, measurement, and control
applications, we can imagine many possible ways to use an embedded
computer equipped with a PTP hardware clock. We do not expect that
any API will be able to cover every conceivable application of this
technology. However, the design presented here does cover common
use cases based on the capabilities of currently available hardware
clocks.
The design allows user space programs to control all of a clock's
ancillary features. Programs may create one-shot or periodic
alarms, with signal delivery on expiration. \Timestamps on
external events are provided via a First In, First Out (FIFO)
interface. If the clock has output signals, then their periods are
configurable from user space. Synchronization of the Linux system
time via the PPS subsystem may be enabled or disabled as desired.
\subsection{Synchronizing the Linux System Clock}
One important question that needed to be addressed was, now that we
have a precise time source, how do we synchronize the Linux kernel
to it? The Linux kernel offers a modular clock infrastructure
comprising ``clock sources'' and ``clock event devices.'' Clock
sources provide a monotonically increasing time base, and clock
event devices are used to schedule the next interrupt for various
timer events.
We considered but ultimately rejected the idea of offering the PTP
clock to the Linux kernel as a combined clock source and clock
event device. The one great advantage of this approach would have
been that it obviates the need for synchronization when the PTP
clock is selected as the system timer.
However, this approach is problematic when using certain kinds of
clock hardware. For example, physical layer (PHY) chip based clocks
can only be accessed by the relatively slow 16 bit wide MDIO
bus. Such a clock would not be suitable for providing high
resolution timers, which are now a standard Linux kernel feature.
Furthermore, we cannot even be sure that a given hardware clock
will offer any interrupt to the system at all.
Instead, we elected to use the Pulse Per Second (PPS) subsystem as
a method to optionally synchronize the Linux system time to the PTP
clock. This method is feasible even for clocks that do not offer
fast register access, such as the PHY clocks. Of course, the main
disadvantage of this approach is that the Linux system time will
not be exactly synchronized to the PTP clock time. Since PTP
clocks can be synchronized an order of magnitude better than the
typical operating system scheduling latency, we expect that this
method will still yield acceptable results for many applications.
Applications with more demanding time requirements may use the new
PTP interfaces directly when needed.
\subsection{System Calls or Character Device}
When adding new functionality to an operating system, a basic design
decision is how user space programs will call into the kernel. For
the Linux kernel, two different ways come into question, namely
system calls or as a ``character device.'' In an attempt to make
the PTP clock API easy to understand, we patterned it after the
existing Network Time Protocol (NTP) and the POSIX timer APIs, as
described in Section~\ref{UserAPI}. Both of these services are
exported to the user space as system calls. However, we decided to
offer the PTP clock as a character device because extending the NTP
and POSIX interfaces seemed impractical. In addition, the
character device's \fn{read()} method provides a
convenient way to deliver time stamped events to user space
programs.
^ permalink raw reply
* Re: [PATCH] Clear IFF_XMIT_DST_RELEASE for teql interfaces
From: Eric Dumazet @ 2010-06-16 14:14 UTC (permalink / raw)
To: hadi
Cc: Tom Hughes, netdev, akpm, David S. Miller, Stephen Hemminger,
Patrick McHardy, Tejun Heo, linux-kernel
In-Reply-To: <1276694745.3862.1.camel@bigi>
Le mercredi 16 juin 2010 à 09:25 -0400, jamal a écrit :
> On Wed, 2010-06-16 at 09:24 +0100, Tom Hughes wrote:
> > The sch_teql module, which can be used to load balance over a set of
> > underlying interfaces, stopped working after 2.6.30 and has been
> > broken in all kernels since then for any underlying interface which
> > requires the addition of link level headers.
> >
> > The problem is that the transmit routine relies on being able to
> > access the destination address in the skb in order to do address
> > resolution once it has decided which underlying interface it is going
> > to transmit through.
> >
> > In 2.6.31 the IFF_XMIT_DST_RELEASE flag was introduced, and set by
> > default for all interfaces, which causes the destination address to be
> > released before the transmit routine for the interface is called.
> >
> > The solution is to clear that flag for teql interfaces.
> >
> > Signed-off-by: Tom Hughes <tom@compton.nu>
>
> Sounds reasonable. Lets CC Eric and get his ACK.
>
> cheers,
> jamal
>
Sure, I already Acked in on a previous message 5 days ago (although not
a formal patch, Stephen forwarded a bugzilla entry)
http://permalink.gmane.org/gmane.linux.network/163688
Please David, could you add bugzilla entry in commit ?
https://bugzilla.kernel.org/show_bug.cgi?id=16183
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Thanks !
^ permalink raw reply
* ipv6: netif_carrier_(on|off) with traces afterwards
From: Einar EL Lueck @ 2010-06-16 13:29 UTC (permalink / raw)
To: netdev
Hi,
With IPv6 addresses configured and and a network card doing
netif_carrier_off|on we see afterwards in 2.6.34 on S/390 some traces in
fib.
Example sequence of operations:
ip -6 addr add fd00:10:30:49:4008:ffff:35:2/80 dev eth1
ip link set eth1 up
ip link set eth1 down
# the following lines have as effect netif_carrier_off and then on (among
other stuff)
echo 0 > /sys/bus/ccwgroup/drivers/qeth/devices/0.0.e300/online
echo 1 > /sys/bus/ccwgroup/drivers/qeth/devices/0.0.e300/online
# end of plugging
ip link set eth1 up
ip link set eth1 down
=> at this point we get the following trace:
Badness
at /home/autobuild/BUILD/linux-2.6.34-20100531/net/ipv6/ip6_fib.c:1160
Modules linked in: qeth_l2 sunrpc qeth_l3 binfmt_misc dm_multipath scsi_dh
dm_mod ipv6 qeth ccwgroup
CPU: 9 Not tainted 2.6.34-43.x.20100531-s390xperformance #1
Process ip (pid: 18144, task: 000000007c304238, ksp: 00000000033af428)
Krnl PSW : 0704200180000000 000003c0018c05a4 (fib6_del+0x60/0x3f8 [ipv6])
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
Krnl GPRS: 0000000000000000 0000000000000002 000000007b58a700
0000000000830398
0000000000830398 0000000000000001 0000000000000000
0000000000830398
000003c0018bba5e 00000000033af480 0000000071ff2780
000000007b58a700
000003c0018a7000 000003c0018e4d50 00000000033af398
00000000033af2e0
Krnl Code: 000003c0018c0598: eb6ff1000004 lmg %r6,%r15,256(%r15)
000003c0018c059e: 07f4 bcr 15,%r4
000003c0018c05a0: a7f40001 brc 15,3c0018c05a2
>000003c0018c05a4: a728fffe lhi %r2,-2
000003c0018c05a8: a7f4fff3 brc 15,3c0018c058e
000003c0018c05ac: b90200aa ltgr %r10,%r10
000003c0018c05b0: a784ffed brc 8,3c0018c058a
000003c0018c05b4: e32033b00020 cg %r2,944(%r3)
Call Trace:
([<0000000000000020>] 0x20)
[<000003c0018bb9c6>] __ip6_del_rt+0x6e/0xc8 [ipv6]
[<000003c0018bba5e>] ip6_del_rt+0x3e/0x4c [ipv6]
[<000003c0018b4448>] __ipv6_ifa_notify+0x13c/0x1d4 [ipv6]
[<000003c0018b7072>] addrconf_ifdown+0x3b2/0x5f0 [ipv6]
[<000003c0018b8b2c>] addrconf_notify+0xb4/0x944 [ipv6]
[<00000000004fb6aa>] notifier_call_chain+0x5a/0xa0
[<000000000016675a>] raw_notifier_call_chain+0x2a/0x3c
[<0000000000459194>] __dev_notify_flags+0x88/0xac
[<0000000000459200>] dev_change_flags+0x48/0x70
[<00000000004664d4>] do_setlink+0x35c/0x60c
[<00000000004678fa>] rtnl_newlink+0x446/0x574
[<000000000047c89c>] netlink_rcv_skb+0xdc/0xf0
[<00000000004671a8>] rtnetlink_rcv+0x3c/0x48
[<000000000047c35e>] netlink_unicast+0x352/0x3ac
[<000000000047cebe>] netlink_sendmsg+0x22a/0x35c
[<00000000004414c4>] sock_sendmsg+0xdc/0x100
[<00000000004417ca>] SyS_sendmsg+0x182/0x2d4
[<000000000043f17c>] SyS_socketcall+0x150/0x338
[<0000000000119346>] sysc_noemu+0x10/0x16
[<0000004e1271c71a>] 0x4e1271c71a
Last Breaking-Event-Address:
[<000003c0018c05a0>] fib6_del+0x5c/0x3f8 [ipv6]
Has anybody seen effects like this before on other platforms, or has
anybody suggestions for the root cause?
Thanks,
Einar.
^ permalink raw reply
* Re: [PATCH] vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)
From: Pedro Garcia @ 2010-06-16 13:28 UTC (permalink / raw)
To: netdev; +Cc: Eric Dumazet, Ben Hutchings, Patrick McHardy
In-Reply-To: <4C18B898.4000307@trash.net>
On Wed, 16 Jun 2010 13:42:16 +0200, Patrick McHardy <kaber@trash.net>
wrote:
> Eric Dumazet wrote:
>> Le mercredi 16 juin 2010 à 10:49 +0200, Pedro Garcia a écrit :
>>> Here it is again. I added the modifications in
>>> http://kerneltrap.org/mailarchive/linux-netdev/2010/5/23/6277868 for HW
>>> accelerated incoming packets (it did not apply cleanly on the last
>>> version of
>>> the kernel, so I applied manually). Now, if the VLAN 0 is not
>>> explicitly created by the user, VLAN 0 packets will be treated as no
>>> VLAN (802.1p packets), instead of dropping them.
>>>
>>> The patch is now for two files: vlan_core (accel) and vlan_dev (non
>>> accel)
>>>
>>> I can not test on HW accelerated devices, so if someone can check it I
>>> will appreciate (even though in the thread above it looked like yes).
>>> For non accel I tessted in 2.6.26. Now the patch is for
>>> net-next-2.6, and it compiles OK, but I a have to setup a test
>>> environment to check it is still OK (should, but better to test).
>>>
>>> Signed-off-by: Pedro Garcia <pedro.netdev@dondevamos.com>
>>>
>>
>> OK, the patch itself is correct.
>>
>
> Yes, looks fine to me as well.
>
>> Now, could you please send it again with a proper changelog ?
>>
>> In this changelog, please explain why patch is needed, and
>> keep lines short (< 72 chars), like the one you did in your first mail.
>>
>> I'll then add my Signed-off-by, since I wrote the accelerated part ;)
>>
>> Note : I wonder if another patch is needed, in case 8021q module is
>> _not_ loaded. We probably should accept vlan 0 frames in this case ?
>>
>
> I agree that this would be best for consistency, but that would mean
> adding more special cases to __netif_receive_skb().
In my understanding, 802.1p is a "subset" of 802.1q, and they share the
protocol number. We can do a 802.1p module, but in the end it will end
up reusing most of the code in 802.1q.
In any case defining a VLAN 0 ends up usually in problems with which table
the ARP entries get stored in. This patch solves the problem to whom
is not using VLAN 0 explicitly, but if somebody is using VLAN 0 tagging
it will work (whatever "work" means) as before.
Probably a definitive fix would be not to allow the definition of VLAN 0
in 802.1q module and provide some other way to tag priority packets without
using a subinterface (maybe in the same module or a new 8021p one). I am
having a look at the kernel to see what happens if we load two modules for
the same protocol.
By the way, the changelog I have to write is just the text before the
patch?
Pedro
^ permalink raw reply
* Re: [PATCH] Clear IFF_XMIT_DST_RELEASE for teql interfaces
From: jamal @ 2010-06-16 13:25 UTC (permalink / raw)
To: Tom Hughes
Cc: netdev, akpm, David S. Miller, Eric Dumazet, Stephen Hemminger,
Patrick McHardy, Tejun Heo, linux-kernel
In-Reply-To: <1276676668-10256-1-git-send-email-tom@compton.nu>
On Wed, 2010-06-16 at 09:24 +0100, Tom Hughes wrote:
> The sch_teql module, which can be used to load balance over a set of
> underlying interfaces, stopped working after 2.6.30 and has been
> broken in all kernels since then for any underlying interface which
> requires the addition of link level headers.
>
> The problem is that the transmit routine relies on being able to
> access the destination address in the skb in order to do address
> resolution once it has decided which underlying interface it is going
> to transmit through.
>
> In 2.6.31 the IFF_XMIT_DST_RELEASE flag was introduced, and set by
> default for all interfaces, which causes the destination address to be
> released before the transmit routine for the interface is called.
>
> The solution is to clear that flag for teql interfaces.
>
> Signed-off-by: Tom Hughes <tom@compton.nu>
Sounds reasonable. Lets CC Eric and get his ACK.
cheers,
jamal
^ permalink raw reply
* Re: rt hash table / rt hash locks question
From: Nick Piggin @ 2010-06-16 12:49 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1276691258.2632.55.camel@edumazet-laptop>
On Wed, Jun 16, 2010 at 02:27:38PM +0200, Eric Dumazet wrote:
> Le mercredi 16 juin 2010 à 20:46 +1000, Nick Piggin a écrit :
> > I'm just converting this scalable dentry/inode hash table to a more
> > compact form. I was previously using a dumb spinlock per bucket,
> > but this doubles the size of the tables so isn't production quality.
> >
>
> Yes, we had this in the past (one rwlock or spinlock per hash chain),
> and it was not very good with LOCKDEP on.
Sure :) And it halves the size of your hash even with lockdep off.
> > What I've done at the moment is to use a bit_spinlock in bit 0 of each
> > list pointer of the table. Bit spinlocks are now pretty nice because
> > we can do __bit_spin_unlock() which gives non-atomic store with release
> > ordering, so it should be almost as fast as spinlock.
> >
> > But I look at rt hash and it seems you use a small hash on the side
> > for spinlocks. So I wonder, pros for each:
> >
> > - bitlocks have effectively zero storage
> yes but a mask is needed to get head pointer. Special care also must
> be taken when insert/delete a node in chain, keeping this bit set.
That is true. Overall, I don't know what would be better for straight
line cycles, all in L1 cache. Probably the spinlocks, although there
is some small overhead from loading the 2nd hash.
> > - bitlocks hit the same cacheline that the hash walk hits.
> yes
> > - in RCU list, locked hash walks usually followed by hash modification,
> > bitlock should have brought in the line for exclusive.
> But we usually perform a read only lookup, _then_ take the lock, to
> perform a new lookup before insert. So at time we would take the
> bitlock, cache line is in shared state. With spinlocks, we always use
> the exclusive mode, but on a separate cache line...
Hmm, OK. This is usually true of the dcache and icache as well
actually. But you still have the same problem with spinlocks (with I
presume the common case of 0 or 1 entry in the hash) when inserting
into the table.
So we're still often avoiding one cacheline transition, and avoiding
hitting one cacheline.
> > - bitlock number of locks scales with hash size
> Yes, but concurrency is more a function of online cpus, given we use
> jhash.
Oh yeah but it has a maximum upper bound on the number of buckets in
the hash, and just having it scale nicely avoids the ifdef heuristics
in the existing code.
> > - spinlocks may be slightly better at the cacheline level (bitops
> > sometimes require explicit load which may not acquire exclusive
> > line on some archs). On x86 ll/sc architectures, this shouldn't
> > be a problem.
> Yes, you can add fairness (if ticket spinlocks variant used), but on
> route cache I really doubt it can make a difference.
Yes if the critical sections are very short and uncontended, I don't
think it's a large factor.
> > - spinlocks better debugging (could be overcome with a LOCKDEP
> > option to revert to spinlocks, but a bit ugly).
> Definitely a good thing.
>
> > - in practice, contention due to aliasing in buckets to lock mapping
> > is probably fairly minor.
> Agreed
> >
> > Net code is obviously tested and tuned well, but instinctively I would
> > have tought bitlocks are the better way to go. Any comments on this?
>
> Well, to be honest, this code is rather old, and at time I wrote it,
> bitlocks were probably not available.
>
> You can add :
>
> - One downside of the hashed spinlocks is the X86_INTERNODE_CACHE_SHIFT
> being 12 on X86_VSMP : All locks are probably in same internode block :(
Oh yeah that's true, very special case though.
> - Another downside is all locks are currently on a single NUMA node,
> since we kmalloc() them in one contiguous chunk.
>
> So I guess it would be worth to try :)
OK, this is what I'm working with for the icache/dcache to hide the
details of masking out the low bit. But it looks like rt hash is a bit
more highly tuned (eg without pprev pointer as you don't delete items
without walking the hash). So it might not be appropriate for you.
You might be able to derive some macros to hide some of the pain though.
Index: linux-2.6/include/linux/list_bl.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/list_bl.h
@@ -0,0 +1,97 @@
+#ifndef _LINUX_LIST_BL_H
+#define _LINUX_LIST_BL_H
+
+#include <linux/list.h>
+
+/*
+ * Special version of lists, where head of the list has a bit spinlock
+ * in the lowest bit. This is useful for scalable hash tables without
+ * increasing memory footprint overhead.
+ */
+
+struct hlist_bl_head {
+ struct hlist_bl_node *first;
+};
+
+struct hlist_bl_node {
+ struct hlist_bl_node *next, **pprev;
+};
+#define INIT_HLIST_BL_HEAD(ptr) \
+ ((ptr)->first = NULL)
+
+static inline void INIT_HLIST_BL_NODE(struct hlist_bl_node *h)
+{
+ h->next = NULL;
+ h->pprev = NULL;
+}
+
+#define hlist_bl_entry(ptr, type, member) container_of(ptr,type,member)
+
+static inline int hlist_bl_unhashed(const struct hlist_bl_node *h)
+{
+ return !h->pprev;
+}
+
+static inline struct hlist_bl_node *hlist_bl_first(struct hlist_bl_head *h)
+{
+ return (struct hlist_bl_node *)((unsigned long)h->first & ~1UL);
+}
+
+static inline void hlist_bl_set_first(struct hlist_bl_head *h, struct hlist_bl_node *n)
+{
+#ifdef CONFIG_DEBUG_LIST
+ BUG_ON(!((unsigned long)&h->first & 1UL));
+#endif
+ h->first = (struct hlist_bl_node *)((unsigned long)n | 1UL);
+}
+
+static inline int hlist_bl_empty(const struct hlist_bl_head *ptr)
+{
+ return !((unsigned long)ptr & ~1UL);
+}
+
+static inline void hlist_bl_add_head(struct hlist_bl_node *n,
+ struct hlist_bl_head *h)
+{
+ struct hlist_bl_node *first = hlist_bl_first(h);
+
+#ifdef CONFIG_DEBUG_LIST
+ BUG_ON(!((unsigned long)&h->first & 1UL));
+#endif
+ n->next = first;
+ if (first)
+ first->pprev = &n->next;
+ n->pprev = &h->first;
+ hlist_bl_set_first(h, n);
+}
+
+static inline void __hlist_bl_del(struct hlist_bl_node *n)
+{
+ struct hlist_bl_node *next = n->next;
+ struct hlist_bl_node **pprev = n->pprev;
+ *pprev = (struct hlist_bl_node *)((unsigned long)next | ((unsigned long)*pprev & 1UL));
+ if (next)
+ next->pprev = pprev;
+}
+
+static inline void hlist_bl_del(struct hlist_bl_node *n)
+{
+ __hlist_bl_del(n);
+ n->pprev = LIST_POISON2;
+}
+
+/**
+ * hlist_bl_for_each_entry - iterate over list of given type
+ * @tpos: the type * to use as a loop cursor.
+ * @pos: the &struct hlist_node to use as a loop cursor.
+ * @head: the head for your list.
+ * @member: the name of the hlist_node within the struct.
+ *
+ */
+#define hlist_bl_for_each_entry(tpos, pos, head, member) \
+ for (pos = hlist_bl_first(head); \
+ pos && \
+ ({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1;}); \
+ pos = pos->next)
+
+#endif
Index: linux-2.6/include/linux/rculist_bl.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/rculist_bl.h
@@ -0,0 +1,123 @@
+#ifndef _LINUX_RCULIST_BL_H
+#define _LINUX_RCULIST_BL_H
+
+#ifdef __KERNEL__
+
+/*
+ * RCU-protected list version
+ */
+#include <linux/list_bl.h>
+#include <linux/rcupdate.h>
+
+static inline void hlist_bl_set_first_rcu(struct hlist_bl_head *h, struct hlist_bl_node *n)
+{
+#ifdef CONFIG_DEBUG_LIST
+ BUG_ON(!((unsigned long)h->first & 1UL));
+#endif
+ rcu_assign_pointer(h->first, (struct hlist_bl_node *)((unsigned long)n | 1UL));
+}
+
+static inline struct hlist_bl_node *hlist_bl_first_rcu(struct hlist_bl_head *h)
+{
+ return (struct hlist_bl_node *)((unsigned long)rcu_dereference(h->first) & ~1UL);
+}
+
+/**
+ * hlist_bl_del_init_rcu - deletes entry from hash list with re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: hlist_bl_unhashed() on the node return true after this. It is
+ * useful for RCU based read lockfree traversal if the writer side
+ * must know if the list entry is still hashed or already unhashed.
+ *
+ * In particular, it means that we can not poison the forward pointers
+ * that may still be used for walking the hash list and we can only
+ * zero the pprev pointer so list_unhashed() will return true after
+ * this.
+ *
+ * The caller must take whatever precautions are necessary (such as
+ * holding appropriate locks) to avoid racing with another
+ * list-mutation primitive, such as hlist_bl_add_head_rcu() or
+ * hlist_bl_del_rcu(), running on this same list. However, it is
+ * perfectly legal to run concurrently with the _rcu list-traversal
+ * primitives, such as hlist_bl_for_each_entry_rcu().
+ */
+static inline void hlist_bl_del_init_rcu(struct hlist_bl_node *n)
+{
+ if (!hlist_bl_unhashed(n)) {
+ __hlist_bl_del(n);
+ n->pprev = NULL;
+ }
+}
+
+/**
+ * hlist_bl_del_rcu - deletes entry from hash list without re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: hlist_bl_unhashed() on entry does not return true after this,
+ * the entry is in an undefined state. It is useful for RCU based
+ * lockfree traversal.
+ *
+ * In particular, it means that we can not poison the forward
+ * pointers that may still be used for walking the hash list.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_bl_add_head_rcu()
+ * or hlist_bl_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_bl_for_each_entry().
+ */
+static inline void hlist_bl_del_rcu(struct hlist_bl_node *n)
+{
+ __hlist_bl_del(n);
+ n->pprev = LIST_POISON2;
+}
+
+/**
+ * hlist_bl_add_head_rcu
+ * @n: the element to add to the hash list.
+ * @h: the list to add to.
+ *
+ * Description:
+ * Adds the specified element to the specified hlist_bl,
+ * while permitting racing traversals.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_bl_add_head_rcu()
+ * or hlist_bl_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_bl_for_each_entry_rcu(), used to prevent memory-consistency
+ * problems on Alpha CPUs. Regardless of the type of CPU, the
+ * list-traversal primitive must be guarded by rcu_read_lock().
+ */
+static inline void hlist_bl_add_head_rcu(struct hlist_bl_node *n,
+ struct hlist_bl_head *h)
+{
+ struct hlist_bl_node *first = hlist_bl_first(h);
+
+ n->next = first;
+ if (first)
+ first->pprev = &n->next;
+ n->pprev = &h->first;
+ hlist_bl_set_first_rcu(h, n);
+}
+/**
+ * hlist_bl_for_each_entry_rcu - iterate over rcu list of given type
+ * @tpos: the type * to use as a loop cursor.
+ * @pos: the &struct hlist_bl_node to use as a loop cursor.
+ * @head: the head for your list.
+ * @member: the name of the hlist_bl_node within the struct.
+ *
+ */
+#define hlist_bl_for_each_entry_rcu(tpos, pos, head, member) \
+ for (pos = hlist_bl_first_rcu(head); \
+ pos && \
+ ({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1; }); \
+ pos = rcu_dereference_raw(pos->next))
+
+#endif
+#endif
^ permalink raw reply
* Re: rt hash table / rt hash locks question
From: Eric Dumazet @ 2010-06-16 12:27 UTC (permalink / raw)
To: Nick Piggin; +Cc: netdev
In-Reply-To: <20100616104633.GW6138@laptop>
Le mercredi 16 juin 2010 à 20:46 +1000, Nick Piggin a écrit :
> I'm just converting this scalable dentry/inode hash table to a more
> compact form. I was previously using a dumb spinlock per bucket,
> but this doubles the size of the tables so isn't production quality.
>
Yes, we had this in the past (one rwlock or spinlock per hash chain),
and it was not very good with LOCKDEP on.
> What I've done at the moment is to use a bit_spinlock in bit 0 of each
> list pointer of the table. Bit spinlocks are now pretty nice because
> we can do __bit_spin_unlock() which gives non-atomic store with release
> ordering, so it should be almost as fast as spinlock.
>
> But I look at rt hash and it seems you use a small hash on the side
> for spinlocks. So I wonder, pros for each:
>
> - bitlocks have effectively zero storage
yes but a mask is needed to get head pointer. Special care also must
be taken when insert/delete a node in chain, keeping this bit set.
> - bitlocks hit the same cacheline that the hash walk hits.
yes
> - in RCU list, locked hash walks usually followed by hash modification,
> bitlock should have brought in the line for exclusive.
But we usually perform a read only lookup, _then_ take the lock, to
perform a new lookup before insert. So at time we would take the
bitlock, cache line is in shared state. With spinlocks, we always use
the exclusive mode, but on a separate cache line...
> - bitlock number of locks scales with hash size
Yes, but concurrency is more a function of online cpus, given we use
jhash.
> - spinlocks may be slightly better at the cacheline level (bitops
> sometimes require explicit load which may not acquire exclusive
> line on some archs). On x86 ll/sc architectures, this shouldn't
> be a problem.
Yes, you can add fairness (if ticket spinlocks variant used), but on
route cache I really doubt it can make a difference.
> - spinlocks better debugging (could be overcome with a LOCKDEP
> option to revert to spinlocks, but a bit ugly).
Definitely a good thing.
> - in practice, contention due to aliasing in buckets to lock mapping
> is probably fairly minor.
Agreed
>
> Net code is obviously tested and tuned well, but instinctively I would
> have tought bitlocks are the better way to go. Any comments on this?
Well, to be honest, this code is rather old, and at time I wrote it,
bitlocks were probably not available.
You can add :
- One downside of the hashed spinlocks is the X86_INTERNODE_CACHE_SHIFT
being 12 on X86_VSMP : All locks are probably in same internode block :(
- Another downside is all locks are currently on a single NUMA node,
since we kmalloc() them in one contiguous chunk.
So I guess it would be worth to try :)
^ permalink raw reply
* Re: [PATCH] vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)
From: Patrick McHardy @ 2010-06-16 11:42 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Pedro Garcia, netdev, Ben Hutchings
In-Reply-To: <1276679284.2632.22.camel@edumazet-laptop>
Eric Dumazet wrote:
> Le mercredi 16 juin 2010 à 10:49 +0200, Pedro Garcia a écrit :
>> Here it is again. I added the modifications in http://kerneltrap.org/mailarchive/linux-netdev/2010/5/23/6277868 for HW accelerated incoming packets (it did not apply cleanly on the last version of
>> the kernel, so I applied manually). Now, if the VLAN 0 is not explicitly created by the user, VLAN 0 packets will be treated as no VLAN (802.1p packets), instead of dropping them.
>>
>> The patch is now for two files: vlan_core (accel) and vlan_dev (non accel)
>>
>> I can not test on HW accelerated devices, so if someone can check it I will appreciate (even though in the thread above it looked like yes). For non accel I tessted in 2.6.26. Now the patch is for
>> net-next-2.6, and it compiles OK, but I a have to setup a test environment to check it is still OK (should, but better to test).
>>
>> Signed-off-by: Pedro Garcia <pedro.netdev@dondevamos.com>
>>
>
> OK, the patch itself is correct.
>
Yes, looks fine to me as well.
> Now, could you please send it again with a proper changelog ?
>
> In this changelog, please explain why patch is needed, and
> keep lines short (< 72 chars), like the one you did in your first mail.
>
> I'll then add my Signed-off-by, since I wrote the accelerated part ;)
>
> Note : I wonder if another patch is needed, in case 8021q module is
> _not_ loaded. We probably should accept vlan 0 frames in this case ?
>
I agree that this would be best for consistency, but that would mean
adding more special cases to __netif_receive_skb().
^ permalink raw reply
* DMFE Driver in current kernel
From: Heiko Gerstung @ 2010-06-16 10:52 UTC (permalink / raw)
To: netdev
Dear netdev members,
I apologize for contacting this list, I was pointed here by Tobias
Ringstroem who is listed as the current maintainer for the dmfe network
driver but he told me that this is not true anymore. I believe that
probably no one is looking at this driver anymore, and he suggested that
I should contact you guys because he cannot help me.
Here is my long worded story: We are a manufacturer of special network
appliances used for time synchronization and in our LANTIME time server
appliances we are currently using GNU Linux on an embedded i386 CPU
board with a Davicom network chip.
I am desperately missing an important feature in the dmfe driver (which
lists you as the current maintainer): I would want to be able to change
network speed and duplex settings without having to unload/reload the
kernel module, preferably by using ethtool.
Would anyone here interested in adding support for this into the driver?
If yes, we would want to pay for this (and of course we would love
to see that these changes are included in the next kernel version, if
possible). If nobody on this list is interested in this job, I would
appreciate recommendations regarding where to look for someone. If I do
not manage to find anyone for this task, I will try it on my own.
Pointers to suitable documentation or hints would be appreciated!
I know that this is a pretty old chip, but it seems that it is still
used in a number of embedded systems and since we want to be able to
upgrade our existing units in the field with our new firmware, I really
would like to get this functionality into the driver.
Another point I am looking for is how to get the asix driver to play
nice when used in combination with active-backup bonding. As it seems,
the current driver does not work unless I put the bonding interface into
promiscuous mode (which I find out by accident when I used tcpdump for
debugging). If anyone has an idea how to improve this, I would be very
grateful, too.
Thanks for your time,
Heiko
--
Heiko Gerstung
*MEINBERG Funkuhren* GmbH & Co. KG
Lange Wand 9
D-31812 Bad Pyrmont, Germany
Phone: +49 (0)5281 9309-25
Fax: +49 (0)5281 9309-30
Amtsgericht Hannover 17HRA 100322
Geschäftsführer/Managing Directors: Günter Meinberg, Werner Meinberg,
Andre Hartmann, Heiko Gerstung
Email: heiko.gerstung@meinberg.de <mailto:heiko.gerstung@meinberg.de>
Web: www.meinberg.de <http://www.meinberg.de>
------------------------------------------------------------------------
*MEINBERG - Accurate Time Worldwide*
^ permalink raw reply
* rt hash table / rt hash locks question
From: Nick Piggin @ 2010-06-16 10:46 UTC (permalink / raw)
To: netdev
I'm just converting this scalable dentry/inode hash table to a more
compact form. I was previously using a dumb spinlock per bucket,
but this doubles the size of the tables so isn't production quality.
What I've done at the moment is to use a bit_spinlock in bit 0 of each
list pointer of the table. Bit spinlocks are now pretty nice because
we can do __bit_spin_unlock() which gives non-atomic store with release
ordering, so it should be almost as fast as spinlock.
But I look at rt hash and it seems you use a small hash on the side
for spinlocks. So I wonder, pros for each:
- bitlocks have effectively zero storage
- bitlocks hit the same cacheline that the hash walk hits.
- in RCU list, locked hash walks usually followed by hash modification,
bitlock should have brought in the line for exclusive.
- bitlock number of locks scales with hash size
- spinlocks may be slightly better at the cacheline level (bitops
sometimes require explicit load which may not acquire exclusive
line on some archs). On x86 ll/sc architectures, this shouldn't
be a problem.
- spinlocks better debugging (could be overcome with a LOCKDEP
option to revert to spinlocks, but a bit ugly).
- in practice, contention due to aliasing in buckets to lock mapping
is probably fairly minor.
Net code is obviously tested and tuned well, but instinctively I would
have tought bitlocks are the better way to go. Any comments on this?
Thanks,
Nick
^ permalink raw reply
* Re: BUG: unable to handle kernel paging request at 000041ed00000001
From: Arturas @ 2010-06-16 10:25 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
In-Reply-To: <96E62960-9EF8-4F05-92DD-2D7477D0D78B@res.lt>
Your new patch doesn't show any warnings about tx queue
and everything is working as expected.
On Jun 14, 2010, at 12:27 PM, Arturas wrote:
>
> On Jun 14, 2010, at 11:31 AM, Eric Dumazet wrote:
>> But your problem is about bridge, not bonding (see trace).
> I want it for performance reason, not because of this bug.
> Bridge isn't a bottleneck for me, but bonding may be and not to me only,
> but for many people. I believe that performance gain would be more
> than 1% on cpu? :-)
>
>>
>> And 2.6.34 wont accept such changes, its already released.
> It can be as a separate patch or I can test 2.3.35 if it would accept
> such change. I just need a stable kernel with good performance :-)
>
>>
>>> I also have another issue with NMI. On older machine with 5500 xeons i
>>> have almost no overhead with nmi_watchdog enabled, but on this it is about twice.
>>> without nmi enabled cpu peak average is 30%, and with nmi enabled i have 53%.
>>> When traffic is not passing all cpus are idling at 100%.
>>> Maybe overhead could be a little bit smaller? :-)
>>>
>>
>> I am a bit lost here, NMI have litle to do with network stack ;)
> May this be related to very recent cpu? As i understand NMI depends on CPU.
>
>>
>>
>> Could you please test another patch ?
> Applied, it's working correctly for now. If i'll get a warning i'll write you or maybe I
> shouldn't get it if a patch is correct?
>
>>
>> Before calling sk_tx_queue_set(sk, queue_index); we should check if dst
>> dev is current device.
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 12/12] ptp: Added a clock driver for the National Semiconductor PHYTER.
From: Richard Cochran @ 2010-06-16 10:05 UTC (permalink / raw)
To: Grant Likely
Cc: netdev, devicetree-discuss, linuxppc-dev, linux-arm-kernel,
Krzysztof Halasa
In-Reply-To: <AANLkTilqd-wxqMWDxssC-XMDJqZ8iIiSrvp2t8xARAQV@mail.gmail.com>
On Tue, Jun 15, 2010 at 12:49:13PM -0600, Grant Likely wrote:
> Won't this break things for existing DP83640 users?
Nope, the driver was only added five patches ago, and it only offers
the timestamping stuff. The standard PHY functions just call the
generic functions, so the PHY works fine even without this driver.
> > +static struct ptp_clock *dp83640_clock;
> > +DEFINE_SPINLOCK(clock_lock); /* protects the one and only dp83640_clock */
>
> Why only one? Is it not possible to have 2 of these PHYs in a system?
Yes, you can have multiple PHYs, but only one PTP clock.
If you do use multiple PHYs, then you must wire their clocks together
and adjust the PTP clock on only one of the PHYs.
Thanks for your other comments,
Richard
^ permalink raw reply
* Re: mpd client timeouts (bisected) 2.6.35-rc3
From: Christian Kujau @ 2010-06-16 10:01 UTC (permalink / raw)
To: markus@trippelsdorf.de
Cc: John Fastabend, David Miller, linux-kernel@vger.kernel.org,
netdev@vger.kernel.org, yanmin_zhang, alex.shi, tim.c.chen
In-Reply-To: <20100613205922.GA1806@arch.tripp.de>
On Sun, 13 Jun 2010 at 22:59, markus@trippelsdorf.de wrote:
> This solves the problem here. Thanks.
Not sure if this is related, but I've noticed connection timeouts and
connections going in FIN_WAIT2 state (most of them SSH tunnels) with
2.6.35. Going back to 2.6.34 or applying John's (and Eric's) patch
does seem to fix this issue.
Thanks,
Christian.
--
BOFH excuse #168:
le0: no carrier: transceiver cable problem?
^ permalink raw reply
* Re: Proposed linux kernel changes : scaling tcp/ip stack
From: Andi Kleen @ 2010-06-16 9:10 UTC (permalink / raw)
To: Mitchell Erblich; +Cc: Eric Dumazet, netdev
In-Reply-To: <97746864-ED54-4A12-AFE7-752AA6E41CDD@earthlink.net>
Mitchell Erblich <erblichs@earthlink.net> writes:
>
> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
> in the generic kernel. TBD.
>
> This change should have no effect with normal available kernel mem allocs.
>
> Assuming memory pressure ( WAITING for clean memory) we should be allocating
> our last pages for input skbufs and not for xmit allocs.
How about you instrument a kernel and measure if this really happens
frequently under reasonable loads? That is you can probably
use the existing dropped page counters in netstat
Stephen added some time ago.
Since soft irqs cannot really wait exhausted GFP_ATOMIC would normally
lead to dropped packets. FWIW I am not aware of any serious dropped
packets problem on normal loads.
Running a kernel with nearly zero free memory is dangerous anyways
-- pretty much any kernel service can fail arbitarily --
if this happened frequently I suspect we would need generic
VM solution for it.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply
* Re: [PATCH] vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)
From: Eric Dumazet @ 2010-06-16 9:08 UTC (permalink / raw)
To: Pedro Garcia; +Cc: netdev, Patrick McHardy, Ben Hutchings
In-Reply-To: <311b59aee7d648c6124a84b5ca06ac60@dondevamos.com>
Le mercredi 16 juin 2010 à 10:49 +0200, Pedro Garcia a écrit :
> On Mon, 14 Jun 2010 21:12:52 +0200, Eric Dumazet <eric.dumazet@gmail.com>
> > Good luck for your first patch !
>
> Here it is again. I added the modifications in http://kerneltrap.org/mailarchive/linux-netdev/2010/5/23/6277868 for HW accelerated incoming packets (it did not apply cleanly on the last version of
> the kernel, so I applied manually). Now, if the VLAN 0 is not explicitly created by the user, VLAN 0 packets will be treated as no VLAN (802.1p packets), instead of dropping them.
>
> The patch is now for two files: vlan_core (accel) and vlan_dev (non accel)
>
> I can not test on HW accelerated devices, so if someone can check it I will appreciate (even though in the thread above it looked like yes). For non accel I tessted in 2.6.26. Now the patch is for
> net-next-2.6, and it compiles OK, but I a have to setup a test environment to check it is still OK (should, but better to test).
>
> Signed-off-by: Pedro Garcia <pedro.netdev@dondevamos.com>
OK, the patch itself is correct.
Now, could you please send it again with a proper changelog ?
In this changelog, please explain why patch is needed, and
keep lines short (< 72 chars), like the one you did in your first mail.
I'll then add my Signed-off-by, since I wrote the accelerated part ;)
Note : I wonder if another patch is needed, in case 8021q module is
_not_ loaded. We probably should accept vlan 0 frames in this case ?
^ permalink raw reply
* Re: [PATCH net-next-2.6] inetpeer: do not use zero refcnt for freed entries
From: Eric Dumazet @ 2010-06-16 8:56 UTC (permalink / raw)
To: David Miller; +Cc: netdev, paulmck
In-Reply-To: <20100615.214754.42801686.davem@davemloft.net>
Le mardi 15 juin 2010 à 21:47 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 16 Jun 2010 04:45:24 +0200
>
> > [PATCH net-next-2.6] inetpeer: do not use zero refcnt for freed entries
> >
> > Followup of commit aa1039e73cc2 (inetpeer: RCU conversion)
> >
> > Unused inet_peer entries have a null refcnt.
> >
> > Using atomic_inc_not_zero() in rcu lookups is not going to work for
> > them, and slow path is taken.
> >
> > Fix this using -1 marker instead of 0 for deleted entries.
> >
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>
> Applied, thanks Eric.
Thanks
With 65537 peers and a DDOS frag attack, I now get following profiling
results :
-----------------------------------------------------------------------------------------
PerfTop: 1024 irqs/sec kernel:100.0% exact: 0.0% [1000Hz
cycles], (all, cpu: 0)
-----------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ _________________________
7722.00 65.6% inet_frag_find
1355.00 11.5% ip4_frag_match
494.00 4.2% __lock_acquire
260.00 2.2% inet_getpeer
243.00 2.1% ip_route_input_common
151.00 1.3% lock_release
142.00 1.2% mark_lock
126.00 1.1% lock_acquire
104.00 0.9% __kmalloc
86.00 0.7% skb_put
Just to show what could be the next steps ;)
^ permalink raw reply
* [PATCH] Clear IFF_XMIT_DST_RELEASE for teql interfaces
From: Tom Hughes @ 2010-06-16 8:24 UTC (permalink / raw)
To: netdev
Cc: akpm, Tom Hughes, Jamal Hadi Salim, David S. Miller,
Stephen Hemminger, Patrick McHardy, Tejun Heo, linux-kernel
The sch_teql module, which can be used to load balance over a set of
underlying interfaces, stopped working after 2.6.30 and has been
broken in all kernels since then for any underlying interface which
requires the addition of link level headers.
The problem is that the transmit routine relies on being able to
access the destination address in the skb in order to do address
resolution once it has decided which underlying interface it is going
to transmit through.
In 2.6.31 the IFF_XMIT_DST_RELEASE flag was introduced, and set by
default for all interfaces, which causes the destination address to be
released before the transmit routine for the interface is called.
The solution is to clear that flag for teql interfaces.
Signed-off-by: Tom Hughes <tom@compton.nu>
---
net/sched/sch_teql.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c
index 3415b6c..807643b 100644
--- a/net/sched/sch_teql.c
+++ b/net/sched/sch_teql.c
@@ -449,6 +449,7 @@ static __init void teql_master_setup(struct net_device *dev)
dev->tx_queue_len = 100;
dev->flags = IFF_NOARP;
dev->hard_header_len = LL_MAX_HEADER;
+ dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
}
static LIST_HEAD(master_dev_list);
--
1.7.0.1
^ permalink raw reply related
* Re: [PATCH] vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)
From: Pedro Garcia @ 2010-06-16 8:49 UTC (permalink / raw)
To: netdev; +Cc: Patrick McHardy, Ben Hutchings, Eric Dumazet
In-Reply-To: <1276542772.2444.13.camel@edumazet-laptop>
On Mon, 14 Jun 2010 21:12:52 +0200, Eric Dumazet <eric.dumazet@gmail.com>
wrote:
> Le lundi 14 juin 2010 à 19:11 +0200, Patrick McHardy a écrit :
>> Ben Hutchings wrote:
>> > On Mon, 2010-06-14 at 18:49 +0200, Pedro Garcia wrote:
>> >
>> >> On Sun, 13 Jun 2010 22:56:30 +0100, Ben Hutchings
>> >> <bhutchings@solarflare.com> wrote:
>> >>
>> >>> I have no particular opinion on this change, but you need to read and
>> >>> follow Documentation/SubmittingPatches.
>> >>>
>> >>> Ben.
>> >>>
>> >> Sorry, first kernel patch, and I did not know about it. I resubmit
>> >> with
>> >> the correct style / format:
>> >>
>> > [...]
>> >
>> > Sorry, no you haven't.
>> >
>> > - Networking changes go through David Miller's net-next-2.6 tree so you
>> > need to use that as the baseline, not 2.6.26
>> > - Patches should be applicable with -p1, not -p0 (so if you use diff,
>> > you should run it from one directory level up)
>> > - The patch was word-wrapped
>>
>> Additionally:
>>
>> - please use the proper comment style, meaning each line begins
>> with a '*'
>>
>> - the pr_debug statements may be useful for debugging, but are
>> a bit excessive for the final version
>>
>> - + /* 2010-06-13: Pedro Garcia
>>
>> We have changelogs for this, simply explaining what the code
>> does is enough.
>>
>> - Please CC the maintainer (which is me)
>> --
>
> Pedro, we have two kind of vlan setups :
>
> accelerated and non accelerated ones.
>
> Your patch address non accelated ones only, since you only touch
> vlan_skb_recv()
>
> Accelerated vlan can follow these paths :
>
> 1) NAPI devices
>
> vlan_gro_receive() -> vlan_gro_common()
>
> 2) non NAPI devices
>
> __vlan_hwaccel_rx()
>
> So you might also patch __vlan_hwaccel_rx() and vlan_gro_common()
>
> Please merge following bits to your patch submission :
>
> http://kerneltrap.org/mailarchive/linux-netdev/2010/5/23/6277868
>
>
> Good luck for your first patch !
Here it is again. I added the modifications in http://kerneltrap.org/mailarchive/linux-netdev/2010/5/23/6277868 for HW accelerated incoming packets (it did not apply cleanly on the last version of
the kernel, so I applied manually). Now, if the VLAN 0 is not explicitly created by the user, VLAN 0 packets will be treated as no VLAN (802.1p packets), instead of dropping them.
The patch is now for two files: vlan_core (accel) and vlan_dev (non accel)
I can not test on HW accelerated devices, so if someone can check it I will appreciate (even though in the thread above it looked like yes). For non accel I tessted in 2.6.26. Now the patch is for
net-next-2.6, and it compiles OK, but I a have to setup a test environment to check it is still OK (should, but better to test).
Signed-off-by: Pedro Garcia <pedro.netdev@dondevamos.com>
--
diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index 50f58f5..daaca31 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -8,6 +8,9 @@
int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
u16 vlan_tci, int polling)
{
+ struct net_device *vlan_dev;
+ u16 vlan_id;
+
if (netpoll_rx(skb))
return NET_RX_DROP;
@@ -16,10 +19,14 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
skb->skb_iif = skb->dev->ifindex;
__vlan_hwaccel_put_tag(skb, vlan_tci);
- skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
+ vlan_id = vlan_tci & VLAN_VID_MASK;
+ vlan_dev = vlan_group_get_device(grp, vlan_id);
- if (!skb->dev)
- goto drop;
+ if (vlan_dev)
+ skb->dev = vlan_dev;
+ else
+ if (vlan_id)
+ goto drop;
return (polling ? netif_receive_skb(skb) : netif_rx(skb));
@@ -82,16 +89,22 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
unsigned int vlan_tci, struct sk_buff *skb)
{
struct sk_buff *p;
+ struct net_device *vlan_dev;
+ u16 vlan_id;
if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
skb->deliver_no_wcard = 1;
skb->skb_iif = skb->dev->ifindex;
__vlan_hwaccel_put_tag(skb, vlan_tci);
- skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
-
- if (!skb->dev)
- goto drop;
+ vlan_id = vlan_tci & VLAN_VID_MASK;
+ vlan_dev = vlan_group_get_device(grp, vlan_id);
+
+ if (vlan_dev)
+ skb->dev = vlan_dev;
+ else
+ if (vlan_id)
+ goto drop;
for (p = napi->gro_list; p; p = p->next) {
NAPI_GRO_CB(p)->same_flow =
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 5298426..65512c3 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -142,6 +142,7 @@ int vlan_skb_recv(struct sk_buff *skb, struct net_device *dev,
{
struct vlan_hdr *vhdr;
struct vlan_rx_stats *rx_stats;
+ struct net_device *vlan_dev;
u16 vlan_id;
u16 vlan_tci;
@@ -157,8 +158,18 @@ int vlan_skb_recv(struct sk_buff *skb, struct net_device *dev,
vlan_id = vlan_tci & VLAN_VID_MASK;
rcu_read_lock();
- skb->dev = __find_vlan_dev(dev, vlan_id);
- if (!skb->dev) {
+ vlan_dev = __find_vlan_dev(dev, vlan_id);
+
+ /* If the VLAN device is defined, we use it.
+ * If not, and the VID is 0, it is a 802.1p packet (not
+ * really a VLAN), so we will just netif_rx it later to the
+ * original interface, but with the skb->proto set to the
+ * wrapped proto: we do nothing here.
+ */
+
+ if (vlan_dev) {
+ skb->dev = vlan_dev;
+ } else if (vlan_id) {
pr_debug("%s: ERROR: No net_device for VID: %u on dev: %s\n",
__func__, vlan_id, dev->name);
goto err_unlock;
^ permalink raw reply related
* Re: [PATCH net-next-2.6] syncookies: check decoded options against sysctl settings
From: Florian Westphal @ 2010-06-16 8:03 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20100615.180947.241927235.davem@davemloft.net>
David Miller <davem@davemloft.net> wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Sun, 13 Jun 2010 23:34:35 +0200
>
> > - if (tcp_opt->sack_ok)
> > - tcp_sack_reset(tcp_opt);
> > + if (tcp_opt->sack_ok && !sysctl_tcp_sack)
> > + return false;
> >
>
> If you remove the tcp_sack_reset() call here, who is going to
> do it?
Right, I should have mentioned that in the changelog, sorry about that.
Bottom line is that I failed to find out why its needed.
Both call sites of this function (cookie_v4_check, cookie_v6_check)
allocate the "struct tcp_options_received" argument on the stack, zero it,
hand it to tcp_parse_options() and then call cookie_check_timestamp().
I did not find any place in tcp_parse_options that would cause
tcp_opt->num_sacks/dsack to become nonzero.
Even if it can turn nonzero, I do not see any ill effects that might
happen then. The structure is on the stack and after tcp_parse_options()
returns only a few selected members are copied to the inet_request_sock.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox