* Re: [PATCH net-next-2.6] inetpeer: do not use zero refcnt for freed entries
From: Paul E. McKenney @ 2010-06-16 18:12 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1276656324.19249.39.camel@edumazet-laptop>
On Wed, Jun 16, 2010 at 04:45:24AM +0200, Eric Dumazet wrote:
> Le mardi 15 juin 2010 à 14:25 -0700, David Miller a écrit :
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Tue, 15 Jun 2010 20:23:14 +0200
> >
> > > inetpeer currently uses an AVL tree protected by an rwlock.
> > >
> > > It's possible to make most lookups use RCU
> > ...
> > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> >
> > Applied, nice work Eric.
>
> Thanks David !
>
> Re-reading patch I realize refcnt is expected to be 0 for unused entries
> (obviously), so we should use a different marker for 'about to be freed'
> ones.
>
> Thanks
>
> [PATCH net-next-2.6] inetpeer: do not use zero refcnt for freed entries
>
> Followup of commit aa1039e73cc2 (inetpeer: RCU conversion)
>
> Unused inet_peer entries have a null refcnt.
>
> Using atomic_inc_not_zero() in rcu lookups is not going to work for
> them, and slow path is taken.
>
> Fix this using -1 marker instead of 0 for deleted entries.
Based on this patch, looks good to me! (I don't see lookup_rcu_bh() and
friends in the trees I have at hand.)
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
> net/ipv4/inetpeer.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
> index 58fbc7e..39a14ba 100644
> --- a/net/ipv4/inetpeer.c
> +++ b/net/ipv4/inetpeer.c
> @@ -187,7 +187,12 @@ static struct inet_peer *lookup_rcu_bh(__be32 daddr)
>
> while (u != peer_avl_empty) {
> if (daddr == u->v4daddr) {
> - if (unlikely(!atomic_inc_not_zero(&u->refcnt)))
> + /* Before taking a reference, check if this entry was
> + * deleted, unlink_from_pool() sets refcnt=-1 to make
> + * distinction between an unused entry (refcnt=0) and
> + * a freed one.
> + */
> + if (unlikely(!atomic_add_unless(&u->refcnt, 1, -1)))
> u = NULL;
> return u;
> }
> @@ -322,8 +327,9 @@ static void unlink_from_pool(struct inet_peer *p)
> * in cleanup() function to prevent sudden disappearing. If we can
> * atomically (because of lockless readers) take this last reference,
> * it's safe to remove the node and free it later.
> + * We use refcnt=-1 to alert lockless readers this entry is deleted.
> */
> - if (atomic_cmpxchg(&p->refcnt, 1, 0) == 1) {
> + if (atomic_cmpxchg(&p->refcnt, 1, -1) == 1) {
> struct inet_peer **stack[PEER_MAXDEPTH];
> struct inet_peer ***stackptr, ***delp;
> if (lookup(p->v4daddr, stack) != p)
>
>
^ permalink raw reply
* Re: [Bugme-new] [Bug 16216] New: wrong source addr of UDP packets when using policy routing
From: Patrick McHardy @ 2010-06-16 17:43 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Andrew Morton, netdev, bugzilla-daemon, bugme-daemon, borg
In-Reply-To: <1276709309.2632.126.camel@edumazet-laptop>
Eric Dumazet wrote:
> Le mercredi 16 juin 2010 à 18:46 +0200, Patrick McHardy a écrit :
>
>
>> This is know behaviour, fwmarks don't work for source address selection
>> since before the source address is chosen, you don't even have a packet
>> which could be marked.
>>
>
> We know have sk->sk_mark routing (socket based), so we might change
> sk->sk_mark with appropriate iptables target when one packet is
> received... not very clean but worth to mention...
>
That would still be too late. The proper way would be to have the
application
set the socket mark.
^ permalink raw reply
* Re: [Bugme-new] [Bug 16216] New: wrong source addr of UDP packets when using policy routing
From: Eric Dumazet @ 2010-06-16 17:28 UTC (permalink / raw)
To: Patrick McHardy
Cc: Andrew Morton, netdev, bugzilla-daemon, bugme-daemon, borg
In-Reply-To: <4C18FFDC.8060102@trash.net>
Le mercredi 16 juin 2010 à 18:46 +0200, Patrick McHardy a écrit :
> This is know behaviour, fwmarks don't work for source address selection
> since before the source address is chosen, you don't even have a packet
> which could be marked.
We know have sk->sk_mark routing (socket based), so we might change
sk->sk_mark with appropriate iptables target when one packet is
received... not very clean but worth to mention...
commit 914a9ab386a288d0f22252fc268ecbc048cdcbd5
Author: Atis Elsts <atis@mikrotik.com>
Date: Thu Oct 1 15:16:49 2009 -0700
net: Use sk_mark for routing lookup in more places
This patch against v2.6.31 adds support for route lookup using sk_mark in some
more places. The benefits from this patch are the following.
First, SO_MARK option now has effect on UDP sockets too.
Second, ip_queue_xmit() and inet_sk_rebuild_header() could fail to do routing
lookup correctly if TCP sockets with SO_MARK were used.
Signed-off-by: Atis Elsts <atis@mikrotik.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
^ permalink raw reply
* Re: [Bugme-new] [Bug 16216] New: wrong source addr of UDP packets when using policy routing
From: Patrick McHardy @ 2010-06-16 16:46 UTC (permalink / raw)
To: Andrew Morton; +Cc: netdev, bugzilla-daemon, bugme-daemon, borg
In-Reply-To: <20100616093328.0671254b.akpm@linux-foundation.org>
Andrew Morton wrote:
> On Tue, 15 Jun 2010 15:14:43 GMT bugzilla-daemon@bugzilla.kernel.org wrote:
>
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=16216
>>
>> Summary: wrong source addr of UDP packets when using policy
>> routing
>> Product: Networking
>> Version: 2.5
>> Kernel Version: 2.6.24.7
>>
>
> The reporter has confirmed that this issue persistes in 2.6.34.
>
>
>> Platform: All
>> OS/Version: Linux
>> Tree: Mainline
>> Status: NEW
>> Severity: normal
>> Priority: P1
>> Component: IPV4
>> AssignedTo: shemminger@linux-foundation.org
>> ReportedBy: borg@uu3.net
>> Regression: No
>>
>>
>> When policy routing is used, UDP packets have wrong source address.
>> Source addr is probably taken from looking up routing table (main) to given
>> destination instead of being set just after POSTROUTING, looking up cache.
>>
>> This how it looks like doing simple netcat test:
>> (tcpdump is run on aa.aa.47.90)
>> 16:38:02.053053 IP aa.aa.47.67.32826 > aa.aa.47.90.660: UDP, length 8
>> 16:38:05.660394 IP bb.bbb.241.62.660 > aa.aa.47.67.32826: UDP, length 8
>>
>> aa.aa.47.90 have specific setup having 3 routing tables: main, 10, 20
>> and all of them have default gateway. bb.bbb.241.62 is an addr of
>> outgoing interface of default route from main table.
>> If a packet cames from specific interface
>> its being stored to ipset and when packet is going to be sent out of the box
>> its being marked in mangle OUTPUT matching specific ipset:
>>
>> ### mangle PREROUTING ###
>> fw="iptables -t mangle -A PREROUTING"
>> $fw -i vlan0.13 -j SET --add-set gw10 src
>> $fw -i lan2 -j SET --add-set gw20 src
>>
>> ### mangle OUTPUT ###
>> fw="iptables -t mangle -A OUTPUT"
>> $fw -m set --set gw10 dst -j MARK --set-mark 10
>> $fw -m set --set gw10 dst -j ACCEPT
>> $fw -m set --set gw20 dst -j MARK --set-mark 20
>> $fw -m set --set gw20 dst -j ACCEPT
>>
>> % ip rule show
>> 32764: from all fwmark 0x14 lookup 20
>> 32765: from all fwmark 0xa lookup 10
This is know behaviour, fwmarks don't work for source address selection
since before the source address is chosen, you don't even have a packet
which could be marked.
^ permalink raw reply
* Re: [Bugme-new] [Bug 16216] New: wrong source addr of UDP packets when using policy routing
From: Andrew Morton @ 2010-06-16 16:33 UTC (permalink / raw)
To: netdev; +Cc: bugzilla-daemon, bugme-daemon, borg
In-Reply-To: <bug-16216-10286@https.bugzilla.kernel.org/>
On Tue, 15 Jun 2010 15:14:43 GMT bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=16216
>
> Summary: wrong source addr of UDP packets when using policy
> routing
> Product: Networking
> Version: 2.5
> Kernel Version: 2.6.24.7
The reporter has confirmed that this issue persistes in 2.6.34.
> Platform: All
> OS/Version: Linux
> Tree: Mainline
> Status: NEW
> Severity: normal
> Priority: P1
> Component: IPV4
> AssignedTo: shemminger@linux-foundation.org
> ReportedBy: borg@uu3.net
> Regression: No
>
>
> When policy routing is used, UDP packets have wrong source address.
> Source addr is probably taken from looking up routing table (main) to given
> destination instead of being set just after POSTROUTING, looking up cache.
>
> This how it looks like doing simple netcat test:
> (tcpdump is run on aa.aa.47.90)
> 16:38:02.053053 IP aa.aa.47.67.32826 > aa.aa.47.90.660: UDP, length 8
> 16:38:05.660394 IP bb.bbb.241.62.660 > aa.aa.47.67.32826: UDP, length 8
>
> aa.aa.47.90 have specific setup having 3 routing tables: main, 10, 20
> and all of them have default gateway. bb.bbb.241.62 is an addr of
> outgoing interface of default route from main table.
> If a packet cames from specific interface
> its being stored to ipset and when packet is going to be sent out of the box
> its being marked in mangle OUTPUT matching specific ipset:
>
> ### mangle PREROUTING ###
> fw="iptables -t mangle -A PREROUTING"
> $fw -i vlan0.13 -j SET --add-set gw10 src
> $fw -i lan2 -j SET --add-set gw20 src
>
> ### mangle OUTPUT ###
> fw="iptables -t mangle -A OUTPUT"
> $fw -m set --set gw10 dst -j MARK --set-mark 10
> $fw -m set --set gw10 dst -j ACCEPT
> $fw -m set --set gw20 dst -j MARK --set-mark 20
> $fw -m set --set gw20 dst -j ACCEPT
>
> % ip rule show
> 32764: from all fwmark 0x14 lookup 20
> 32765: from all fwmark 0xa lookup 10
>
> Problem was noticed for UDP packets (openvpn connections are not working).
> Other non connection oriented protocols might be affected too.
> TCP (as connection oriented protocol) works just fine.
>
^ permalink raw reply
* Re: [0/8] netpoll/bridge fixes
From: Paul E. McKenney @ 2010-06-16 16:01 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, herbert, shemminger, mst, frzhang, netdev, amwang,
mpm
In-Reply-To: <1276669281.19249.62.camel@edumazet-laptop>
On Wed, Jun 16, 2010 at 08:21:21AM +0200, Eric Dumazet wrote:
> Le mardi 15 juin 2010 à 22:08 -0700, Paul E. McKenney a écrit :
> > On Wed, Jun 16, 2010 at 04:58:59AM +0200, Eric Dumazet wrote:
> > >
> > > Paul, could you please explain if current lockdep rules are correct, or could be relaxed ?
> > >
> > > I thought :
> > >
> > > rcu_read_lock_bh();
> > >
> > > was a shorthand to
> > >
> > > local_disable_bh();
> > > rcu_read_lock();
> >
> > In CONFIG_TREE_RCU and CONFIG_TINY_RCU, rcu_read_lock_bh() is actually
> > shorthand for only local_disable_bh(). Therefore, rcu_dereference()
> > will scream if only rcu_read_lock_bh() is held.
> >
> > However, in CONFIG_PREEMPT_TREE_RCU, rcu_read_lock_bh() is its own
> > mechanism that does local_disable_bh() but has its own set of grace
> > periods, independent of those of rcu_read_lock().
> >
> > > Why lockdep is not able to make a correct diagnostic ?
> >
> > Here is the situation I am concerned about:
> >
> > o Task 0 does rcu_read_lock(), then p=rcu_dereference_bh().
> > If we make the change you are asking for, rcu_dereference_bh()
> > is OK with this.
> >
> > o Task 0 now is preempted before finishing its RCU read-side
> > critical section.
> >
> > o Task 1 removes the data element referenced by pointer p,
> > then invokes synchronize_rcu_bh().
> >
> > o Task 0 does not block synchronize_rcu_bh(), so the grace
> > period completes.
> >
> > o Task 1 frees up the data element referenced by pointer p,
> > which might be reallocated as some other type, unmapped,
> > or whatever else.
> >
> > o Task 0 resumes, and is sadly disappointed when the data
> > element referenced by pointer p has been swept out from
> > under it.
> >
> > Or am I missing something here?
> >
>
> Nice thing with RCU is that I learn new things every day ;)
>
> Thanks Paul, I'll try to remember all the details ! ;)
;-)
But just to be clear... All but one use of RCU-bh is in networking,
so if you guys need something different from RCU-bh, let's talk!
And I learn something new about RCU every day as well. One of today's
lessons is that networking is no longer the only user of RCU-bh. ;-)
Thanx, Paul
^ permalink raw reply
* RE: ipv6: netif_carrier_(on|off) with traces afterwards
From: Tantilov, Emil S @ 2010-06-16 15:55 UTC (permalink / raw)
To: Einar EL Lueck; +Cc: NetDev
In-Reply-To: <OF3A036420.64516D4F-ONC1257744.00492A34-C1257744.004A240B@de.ibm.com>
>-----Original Message-----
>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
>Behalf Of Einar EL Lueck
>Sent: Wednesday, June 16, 2010 6:30 AM
>To: netdev@vger.kernel.org
>Subject: ipv6: netif_carrier_(on|off) with traces afterwards
>
>
>Hi,
>With IPv6 addresses configured and and a network card doing
>netif_carrier_off|on we see afterwards in 2.6.34 on S/390 some traces in
>fib.
>
>Example sequence of operations:
>ip -6 addr add fd00:10:30:49:4008:ffff:35:2/80 dev eth1
>ip link set eth1 up
>ip link set eth1 down
># the following lines have as effect netif_carrier_off and then on (among
>other stuff)
>echo 0 > /sys/bus/ccwgroup/drivers/qeth/devices/0.0.e300/online
>echo 1 > /sys/bus/ccwgroup/drivers/qeth/devices/0.0.e300/online
># end of plugging
>ip link set eth1 up
>ip link set eth1 down
>=> at this point we get the following trace:
>
>Badness
>at /home/autobuild/BUILD/linux-2.6.34-20100531/net/ipv6/ip6_fib.c:1160
<snip>
>
>Has anybody seen effects like this before on other platforms, or has
>anybody suggestions for the root cause?
I had similar issues. The following patches fixed it for me:
http://marc.info/?l=linux-netdev&m=127472600330413
http://marc.info/?l=linux-netdev&m=127472599530407
Thanks,
Emil
^ permalink raw reply
* Re: [PATCH] broadcom: Add 5241 support
From: Ben Hutchings @ 2010-06-16 15:35 UTC (permalink / raw)
To: Dmitry Eremin-Solenikov
Cc: netdev, David S. Miller, Matt Carlson, Michael Chan
In-Reply-To: <1276699126-8168-1-git-send-email-dbaryshkov@gmail.com>
On Wed, 2010-06-16 at 18:38 +0400, Dmitry Eremin-Solenikov wrote:
> This patch adds the 5241 PHY ID to the broadcom module.
[...]
You need to add the ID and mask to the module device ID table
(broadcom_tbl) as well.
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [PATCH] vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)
From: Patrick McHardy @ 2010-06-16 15:28 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: Pedro Garcia, netdev, Eric Dumazet, Ben Hutchings
In-Reply-To: <201006161624.12101.arnd@arndb.de>
Arnd Bergmann wrote:
> On Wednesday 16 June 2010, Pedro Garcia wrote:
>
>> Probably a definitive fix would be not to allow the definition of VLAN 0
>> in 802.1q module and provide some other way to tag priority packets without
>> using a subinterface (maybe in the same module or a new 8021p one). I am
>> having a look at the kernel to see what happens if we load two modules for
>> the same protocol.
>>
>
> On a related note, we will also need to support 802.1Qad provider bridges
> at some point, which use yet another variation of the VLAN header (actually
> two nested VLAN tags) with a different ethertype.
> I need this for 802.1Qbg multi-channel VEPA (possibly also 802.1Qbh
> port extenders), but I have not yet investigated how to implement this
> in the VLAN module.
>
Since we don't have any special VLAN handling in the bridging code, I
guess it comes down to optionally using a different ethertype value
(0x88a8) in the VLAN code. We probably also need some indication from
device drivers whether they are able to add these headers to avoid
trying to offload tagging in case they're not.
^ permalink raw reply
* Re: 2.6.35-rc2 kernel crashes under heavy network load
From: Lazy @ 2010-06-16 15:24 UTC (permalink / raw)
To: Eric Dumazet; +Cc: linux-kernel, netdev
In-Reply-To: <1276698862.2632.93.camel@edumazet-laptop>
2010/6/16 Eric Dumazet <eric.dumazet@gmail.com>:
> Le mercredi 16 juin 2010 à 15:39 +0200, Lazy a écrit :
>> Our linux router crashes while rebooting under heavy network load
>> (800kpps generated by pktgen on other machine).
>> While running everything system seams stable.
>>
>> Any pointers what can I do to help resolve this issue ?
>>
>> the system is Dell R210 X3420, 64bit kernel, debian 5.0 Broadcom BCM5716
>> same thing happens on a Dell R410 running same software Broadcom BCM5716
>>
>> kernel version is Linux version 2.6.35-rc2 (root@cisco3-2) (gcc
>> version 4.3.2 (Debian 4.3.2-1.1) ) #2 SMP Fri Jun 11 10:22:51 CEST
>> 2010
>>
>> general protection fault: 0000 [#1] SMP
>> last sysfs file: /sys/devices/platform/dcdbas/smi_data_buf_phys_addr
>> CPU 1
>> Modules linked in: iTCO_wdt 8021q ipmi_poweroff ipmi_devintf ipmi_si
>> ipmi_msghandler mptctl loop ioatdma dca button evdev dcdbas raid10
>> raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq
>> async_tx raid1 raid0 linear md_mod sd_mod sg sr_mod cdrom
>> ide_pci_generic ide_core usbhid mptsas ata_piix mptscsih libata
>> mptbase scsi_transport_sas scsi_mod ehci_hcd uhci_hcd bnx2 fan [last
>> unloaded: scsi_wait_scan]
>>
>> Pid: 20, comm: events/1 Not tainted 2.6.35-rc2 #2 0N051F/PowerEdge R410
>> RIP: 0010:[<ffffffff81087ae9>] [<ffffffff81087ae9>] drain_array+0x29/0xc7
>> RSP: 0000:ffff88012fb6ddc0 EFLAGS: 00010202
>> RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
>> RDX: 0720072007200720 RSI: ffff88012fc04ec0 RDI: ffff88012fc1be00
>> RBP: ffff88012fb6de00 R08: 0000000000000000 R09: ffff88012fa91618
>> R10: 000000102cdbdd88 R11: 0000000000000000 R12: ffff88012fc04ec0
>> R13: 0720072007200720 R14: ffff88012fc04ec0 R15: 0000000000000000
>
> 072007200720 pattern is the signature of a known bug.
>
>
> commit 386f40c86d6c8d5b717 (Revert "tty: fix a little bug in scrup,
> vt.c") will help you.
>
> This is probably solved in current git tree...
You are right, 2.6.35-rc3 (with this commit) works fine
thank You
--
Michal Grzedzicki
^ permalink raw reply
* Re: [PATCH 12/12] ptp: Added a clock driver for the National Semiconductor PHYTER.
From: Grant Likely @ 2010-06-16 15:10 UTC (permalink / raw)
To: Richard Cochran
Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
Krzysztof Halasa
In-Reply-To: <20100616100539.GA3569-7KxsofuKt4IfAd9E5cN8NEzG7cXyKsk/@public.gmane.org>
On Wed, Jun 16, 2010 at 4:05 AM, Richard Cochran
<richardcochran-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Tue, Jun 15, 2010 at 12:49:13PM -0600, Grant Likely wrote:
>> Won't this break things for existing DP83640 users?
>
> Nope, the driver was only added five patches ago, and it only offers
> the timestamping stuff. The standard PHY functions just call the
> generic functions, so the PHY works fine even without this driver.
>
>> > +static struct ptp_clock *dp83640_clock;
>> > +DEFINE_SPINLOCK(clock_lock); /* protects the one and only dp83640_clock */
>>
>> Why only one? Is it not possible to have 2 of these PHYs in a system?
>
> Yes, you can have multiple PHYs, but only one PTP clock.
>
> If you do use multiple PHYs, then you must wire their clocks together
> and adjust the PTP clock on only one of the PHYs.
>
>
> Thanks for your other comments,
You're welcome. Make sure to cc: linux-kernel on your next posting.
I commented on what I could, but there is a lot of code outside my
areas of expertise. In particular the time keeping code needs to be
looked at by the maintainers in that area.
Cheers,
g.
^ permalink raw reply
* DMFE Driver
From: Heiko Gerstung @ 2010-06-16 15:01 UTC (permalink / raw)
To: netdev
Hi Everybody,
just a short heads up that I already found someone who is willing to
work on this and will receive hardware from me for testing it. And, I
already received a first patch from someone else, this is magnificent!
You guys really know what "customer service" is :-) !
If anybody has an idea regarding the ASIX driver and why it only seems
to work as a bonding group member when it is put into promisc, I would
really appreciate that!
Thanks again for all your time and assistance so far!
Regards,
Heiko
--
Heiko Gerstung
*MEINBERG Funkuhren* GmbH & Co. KG
Lange Wand 9
D-31812 Bad Pyrmont, Germany
Phone: +49 (0)5281 9309-25
Fax: +49 (0)5281 9309-30
Amtsgericht Hannover 17HRA 100322
Geschäftsführer/Managing Directors: Günter Meinberg, Werner Meinberg,
Andre Hartmann, Heiko Gerstung
Email: heiko.gerstung@meinberg.de <mailto:heiko.gerstung@meinberg.de>
Web: www.meinberg.de <http://www.meinberg.de>
------------------------------------------------------------------------
*MEINBERG - Accurate Time Worldwide*
^ permalink raw reply
* [PATCH net-next-2.6] inetpeer: restore small inet_peer structures
From: Eric Dumazet @ 2010-06-16 14:52 UTC (permalink / raw)
To: David Miller; +Cc: netdev
Addition of rcu_head to struct inet_peer added 16bytes on 64bit arches.
Thats a bit unfortunate, since old size was exactly 64 bytes.
This can be solved, using an union between this rcu_head an four fields,
that are normally used only when a refcount is taken on inet_peer.
rcu_head is used only when refcnt=-1, right before structure freeing.
Add a inet_peer_refcheck() function to check this assertion for a while.
We can bring back SLAB_HWCACHE_ALIGN qualifier in kmem cache creation.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
include/net/inetpeer.h | 31 ++++++++++++++++++++++++++-----
net/ipv4/inetpeer.c | 4 ++--
net/ipv4/route.c | 1 +
net/ipv4/tcp_ipv4.c | 11 +++++++----
4 files changed, 36 insertions(+), 11 deletions(-)
diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index 6174047..51c06af 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -22,11 +22,21 @@ struct inet_peer {
__u32 dtime; /* the time of last use of not
* referenced entries */
atomic_t refcnt;
- atomic_t rid; /* Frag reception counter */
- atomic_t ip_id_count; /* IP ID for the next packet */
- __u32 tcp_ts;
- __u32 tcp_ts_stamp;
- struct rcu_head rcu;
+ /*
+ * Once inet_peer is queued for deletion (refcnt == -1), following fields
+ * are not available: rid, ip_id_count, tcp_ts, tcp_ts_stamp
+ * We can share memory with rcu_head to keep inet_peer small
+ * (less then 64 bytes)
+ */
+ union {
+ struct {
+ atomic_t rid; /* Frag reception counter */
+ atomic_t ip_id_count; /* IP ID for the next packet */
+ __u32 tcp_ts;
+ __u32 tcp_ts_stamp;
+ };
+ struct rcu_head rcu;
+ };
};
void inet_initpeers(void) __init;
@@ -37,10 +47,21 @@ struct inet_peer *inet_getpeer(__be32 daddr, int create);
/* can be called from BH context or outside */
extern void inet_putpeer(struct inet_peer *p);
+/*
+ * temporary check to make sure we dont access rid, ip_id_count, tcp_ts,
+ * tcp_ts_stamp if no refcount is taken on inet_peer
+ */
+static inline void inet_peer_refcheck(const struct inet_peer *p)
+{
+ WARN_ON_ONCE(atomic_read(&p->refcnt) <= 0);
+}
+
+
/* can be called with or without local BH being disabled */
static inline __u16 inet_getid(struct inet_peer *p, int more)
{
more++;
+ inet_peer_refcheck(p);
return atomic_add_return(more, &p->ip_id_count) - more;
}
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index 349249f..bb58aed 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -64,7 +64,7 @@
* usually under some other lock to prevent node disappearing
* dtime: unused node list lock
* v4daddr: unchangeable
- * ip_id_count: idlock
+ * ip_id_count: atomic value (no lock needed)
*/
static struct kmem_cache *peer_cachep __read_mostly;
@@ -129,7 +129,7 @@ void __init inet_initpeers(void)
peer_cachep = kmem_cache_create("inet_peer_cache",
sizeof(struct inet_peer),
- 0, SLAB_PANIC,
+ 0, SLAB_HWCACHE_ALIGN | SLAB_PANIC,
NULL);
/* All the timers, started at system startup tend
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a291edb..03430de 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2881,6 +2881,7 @@ static int rt_fill_info(struct net *net,
error = rt->dst.error;
expires = rt->dst.expires ? rt->dst.expires - jiffies : 0;
if (rt->peer) {
+ inet_peer_refcheck(rt->peer);
id = atomic_read(&rt->peer->ip_id_count) & 0xffff;
if (rt->peer->tcp_ts_stamp) {
ts = rt->peer->tcp_ts;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7f9515c..2e41e6f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -204,10 +204,12 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
* TIME-WAIT * and initialize rx_opt.ts_recent from it,
* when trying new connection.
*/
- if (peer != NULL &&
- (u32)get_seconds() - peer->tcp_ts_stamp <= TCP_PAWS_MSL) {
- tp->rx_opt.ts_recent_stamp = peer->tcp_ts_stamp;
- tp->rx_opt.ts_recent = peer->tcp_ts;
+ if (peer) {
+ inet_peer_refcheck(peer);
+ if ((u32)get_seconds() - peer->tcp_ts_stamp <= TCP_PAWS_MSL) {
+ tp->rx_opt.ts_recent_stamp = peer->tcp_ts_stamp;
+ tp->rx_opt.ts_recent = peer->tcp_ts;
+ }
}
}
@@ -1351,6 +1353,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
(dst = inet_csk_route_req(sk, req)) != NULL &&
(peer = rt_get_peer((struct rtable *)dst)) != NULL &&
peer->v4daddr == saddr) {
+ inet_peer_refcheck(peer);
if ((u32)get_seconds() - peer->tcp_ts_stamp < TCP_PAWS_MSL &&
(s32)(peer->tcp_ts - req->ts_recent) >
TCP_PAWS_WINDOW) {
^ permalink raw reply related
* [PATCH] broadcom: Add 5241 support
From: Dmitry Eremin-Solenikov @ 2010-06-16 14:38 UTC (permalink / raw)
To: netdev; +Cc: David S. Miller, Matt Carlson, Michael Chan
This patch adds the 5241 PHY ID to the broadcom module.
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
---
drivers/net/phy/broadcom.c | 21 +++++++++++++++++++++
1 files changed, 21 insertions(+), 0 deletions(-)
diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
index f482fc4..1c12a57 100644
--- a/drivers/net/phy/broadcom.c
+++ b/drivers/net/phy/broadcom.c
@@ -834,6 +834,21 @@ static struct phy_driver bcmac131_driver = {
.driver = { .owner = THIS_MODULE },
};
+static struct phy_driver bcm5241_driver = {
+ .phy_id = 0x0143bc30,
+ .phy_id_mask = 0xfffffff0,
+ .name = "Broadcom BCM5241",
+ .features = PHY_BASIC_FEATURES |
+ SUPPORTED_Pause | SUPPORTED_Asym_Pause,
+ .flags = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT,
+ .config_init = brcm_fet_config_init,
+ .config_aneg = genphy_config_aneg,
+ .read_status = genphy_read_status,
+ .ack_interrupt = brcm_fet_ack_interrupt,
+ .config_intr = brcm_fet_config_intr,
+ .driver = { .owner = THIS_MODULE },
+};
+
static int __init broadcom_init(void)
{
int ret;
@@ -868,8 +883,13 @@ static int __init broadcom_init(void)
ret = phy_driver_register(&bcmac131_driver);
if (ret)
goto out_ac131;
+ ret = phy_driver_register(&bcm5241_driver);
+ if (ret)
+ goto out_5241;
return ret;
+out_5241:
+ phy_driver_unregister(&bcmac131_driver);
out_ac131:
phy_driver_unregister(&bcm57780_driver);
out_57780:
@@ -894,6 +914,7 @@ out_5411:
static void __exit broadcom_exit(void)
{
+ phy_driver_unregister(&bcm5241_driver);
phy_driver_unregister(&bcmac131_driver);
phy_driver_unregister(&bcm57780_driver);
phy_driver_unregister(&bcm50610m_driver);
--
1.7.1
^ permalink raw reply related
* Re: 2.6.35-rc2 kernel crashes under heavy network load
From: Eric Dumazet @ 2010-06-16 14:34 UTC (permalink / raw)
To: Lazy; +Cc: linux-kernel, netdev
In-Reply-To: <AANLkTinal4PZ3ESQMIdqd8_zmnSDTNgweczNsYdGobTS@mail.gmail.com>
Le mercredi 16 juin 2010 à 15:39 +0200, Lazy a écrit :
> Our linux router crashes while rebooting under heavy network load
> (800kpps generated by pktgen on other machine).
> While running everything system seams stable.
>
> Any pointers what can I do to help resolve this issue ?
>
> the system is Dell R210 X3420, 64bit kernel, debian 5.0 Broadcom BCM5716
> same thing happens on a Dell R410 running same software Broadcom BCM5716
>
> kernel version is Linux version 2.6.35-rc2 (root@cisco3-2) (gcc
> version 4.3.2 (Debian 4.3.2-1.1) ) #2 SMP Fri Jun 11 10:22:51 CEST
> 2010
>
> general protection fault: 0000 [#1] SMP
> last sysfs file: /sys/devices/platform/dcdbas/smi_data_buf_phys_addr
> CPU 1
> Modules linked in: iTCO_wdt 8021q ipmi_poweroff ipmi_devintf ipmi_si
> ipmi_msghandler mptctl loop ioatdma dca button evdev dcdbas raid10
> raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq
> async_tx raid1 raid0 linear md_mod sd_mod sg sr_mod cdrom
> ide_pci_generic ide_core usbhid mptsas ata_piix mptscsih libata
> mptbase scsi_transport_sas scsi_mod ehci_hcd uhci_hcd bnx2 fan [last
> unloaded: scsi_wait_scan]
>
> Pid: 20, comm: events/1 Not tainted 2.6.35-rc2 #2 0N051F/PowerEdge R410
> RIP: 0010:[<ffffffff81087ae9>] [<ffffffff81087ae9>] drain_array+0x29/0xc7
> RSP: 0000:ffff88012fb6ddc0 EFLAGS: 00010202
> RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: 0720072007200720 RSI: ffff88012fc04ec0 RDI: ffff88012fc1be00
> RBP: ffff88012fb6de00 R08: 0000000000000000 R09: ffff88012fa91618
> R10: 000000102cdbdd88 R11: 0000000000000000 R12: ffff88012fc04ec0
> R13: 0720072007200720 R14: ffff88012fc04ec0 R15: 0000000000000000
072007200720 pattern is the signature of a known bug.
commit 386f40c86d6c8d5b717 (Revert "tty: fix a little bug in scrup,
vt.c") will help you.
This is probably solved in current git tree...
^ permalink raw reply
* Re: [PATCH] vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)
From: Arnd Bergmann @ 2010-06-16 14:24 UTC (permalink / raw)
To: Pedro Garcia; +Cc: netdev, Eric Dumazet, Ben Hutchings, Patrick McHardy
In-Reply-To: <6b5ed8108cebb1865c85e03d3244b6ed@dondevamos.com>
On Wednesday 16 June 2010, Pedro Garcia wrote:
> Probably a definitive fix would be not to allow the definition of VLAN 0
> in 802.1q module and provide some other way to tag priority packets without
> using a subinterface (maybe in the same module or a new 8021p one). I am
> having a look at the kernel to see what happens if we load two modules for
> the same protocol.
On a related note, we will also need to support 802.1Qad provider bridges
at some point, which use yet another variation of the VLAN header (actually
two nested VLAN tags) with a different ethertype.
I need this for 802.1Qbg multi-channel VEPA (possibly also 802.1Qbh
port extenders), but I have not yet investigated how to implement this
in the VLAN module.
> By the way, the changelog I have to write is just the text before the
> patch?
Yes.
Arnd
^ permalink raw reply
* Re: [PATCH] vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)
From: Eric Dumazet @ 2010-06-16 14:24 UTC (permalink / raw)
To: Pedro Garcia; +Cc: netdev, Ben Hutchings, Patrick McHardy
In-Reply-To: <6b5ed8108cebb1865c85e03d3244b6ed@dondevamos.com>
Le mercredi 16 juin 2010 à 15:28 +0200, Pedro Garcia a écrit :
>
> In my understanding, 802.1p is a "subset" of 802.1q, and they share the
> protocol number. We can do a 802.1p module, but in the end it will end
> up reusing most of the code in 802.1q.
>
I was more thinking of a default ETH_P_8021Q rx handler (aka
vlan_skb_recv_minimal) with minimum handling (only accept vid=0 frames),
being overridden by real 8021q handler if module loaded/present.
> In any case defining a VLAN 0 ends up usually in problems with which table
> the ARP entries get stored in. This patch solves the problem to whom
> is not using VLAN 0 explicitly, but if somebody is using VLAN 0 tagging
> it will work (whatever "work" means) as before.
>
> Probably a definitive fix would be not to allow the definition of VLAN 0
> in 802.1q module and provide some other way to tag priority packets without
> using a subinterface (maybe in the same module or a new 8021p one). I am
> having a look at the kernel to see what happens if we load two modules for
> the same protocol.
>
> By the way, the changelog I have to write is just the text before the
> patch?
Yes, you can take a look on any patch around for examples, like...
git show 6e327c11a91d190650df9aabe7d3694d4838bfa1
Check Documentation/SubmittingPatches section 2)
^ permalink raw reply
* Re: [PATCH 08/12] ptp: Added a brand new class driver for ptp clocks.
From: Richard Cochran @ 2010-06-16 14:22 UTC (permalink / raw)
To: Grant Likely
Cc: netdev, devicetree-discuss, linuxppc-dev, linux-arm-kernel,
Krzysztof Halasa, Thomas Gleixner
In-Reply-To: <AANLkTin4lYghEWhjzEyARvYDgHaXdniwLbfyZ4jY0rwm@mail.gmail.com>
On Tue, Jun 15, 2010 at 11:00:10AM -0600, Grant Likely wrote:
>
> Question from an ignorant reviewer: Why a new interface instead of
> working with the existing high resolution timers infrastructure?
Short answer: Timers are only one part of the PTP API. If you offer
the PTP clock as a Linux clock source, then you could just use the
existing POSIX timers. However, we decided not to offer the PTP clock
in that way. The following excerpts from an upcoming paper explain
why:
\subsection{Basic Clock Operations}
Based on our experience with a number of commercially available
hardware clocks, we identified a set of four basic clock
operations. Besides simply setting or getting the clock's time
value, two additional operations are needed for clock control.
Once a PTP slave has an initial estimate of the offset to its
master, it typically would need to shift the clock by the offset
atomically. Also, the servo loop of a slave periodically needs to
adjust the clock frequency.
\subsection{Ancillary Clock Features}
Perhaps the most challenging design issue was deciding how to offer
a PTP clock's capabilities to the GNU/Linux operating system. As
John Eidson pointed out~\cite{eidson2006measurement}, modern
operating systems provide surprisingly little support for
programming based on absolute time. As the IEEE 1588 standard is
being applied to a wide variety of test, measurement, and control
applications, we can imagine many possible ways to use an embedded
computer equipped with a PTP hardware clock. We do not expect that
any API will be able to cover every conceivable application of this
technology. However, the design presented here does cover common
use cases based on the capabilities of currently available hardware
clocks.
The design allows user space programs to control all of a clock's
ancillary features. Programs may create one-shot or periodic
alarms, with signal delivery on expiration. \Timestamps on
external events are provided via a First In, First Out (FIFO)
interface. If the clock has output signals, then their periods are
configurable from user space. Synchronization of the Linux system
time via the PPS subsystem may be enabled or disabled as desired.
\subsection{Synchronizing the Linux System Clock}
One important question that needed to be addressed was, now that we
have a precise time source, how do we synchronize the Linux kernel
to it? The Linux kernel offers a modular clock infrastructure
comprising ``clock sources'' and ``clock event devices.'' Clock
sources provide a monotonically increasing time base, and clock
event devices are used to schedule the next interrupt for various
timer events.
We considered but ultimately rejected the idea of offering the PTP
clock to the Linux kernel as a combined clock source and clock
event device. The one great advantage of this approach would have
been that it obviates the need for synchronization when the PTP
clock is selected as the system timer.
However, this approach is problematic when using certain kinds of
clock hardware. For example, physical layer (PHY) chip based clocks
can only be accessed by the relatively slow 16 bit wide MDIO
bus. Such a clock would not be suitable for providing high
resolution timers, which are now a standard Linux kernel feature.
Furthermore, we cannot even be sure that a given hardware clock
will offer any interrupt to the system at all.
Instead, we elected to use the Pulse Per Second (PPS) subsystem as
a method to optionally synchronize the Linux system time to the PTP
clock. This method is feasible even for clocks that do not offer
fast register access, such as the PHY clocks. Of course, the main
disadvantage of this approach is that the Linux system time will
not be exactly synchronized to the PTP clock time. Since PTP
clocks can be synchronized an order of magnitude better than the
typical operating system scheduling latency, we expect that this
method will still yield acceptable results for many applications.
Applications with more demanding time requirements may use the new
PTP interfaces directly when needed.
\subsection{System Calls or Character Device}
When adding new functionality to an operating system, a basic design
decision is how user space programs will call into the kernel. For
the Linux kernel, two different ways come into question, namely
system calls or as a ``character device.'' In an attempt to make
the PTP clock API easy to understand, we patterned it after the
existing Network Time Protocol (NTP) and the POSIX timer APIs, as
described in Section~\ref{UserAPI}. Both of these services are
exported to the user space as system calls. However, we decided to
offer the PTP clock as a character device because extending the NTP
and POSIX interfaces seemed impractical. In addition, the
character device's \fn{read()} method provides a
convenient way to deliver time stamped events to user space
programs.
^ permalink raw reply
* Re: [PATCH] Clear IFF_XMIT_DST_RELEASE for teql interfaces
From: Eric Dumazet @ 2010-06-16 14:14 UTC (permalink / raw)
To: hadi
Cc: Tom Hughes, netdev, akpm, David S. Miller, Stephen Hemminger,
Patrick McHardy, Tejun Heo, linux-kernel
In-Reply-To: <1276694745.3862.1.camel@bigi>
Le mercredi 16 juin 2010 à 09:25 -0400, jamal a écrit :
> On Wed, 2010-06-16 at 09:24 +0100, Tom Hughes wrote:
> > The sch_teql module, which can be used to load balance over a set of
> > underlying interfaces, stopped working after 2.6.30 and has been
> > broken in all kernels since then for any underlying interface which
> > requires the addition of link level headers.
> >
> > The problem is that the transmit routine relies on being able to
> > access the destination address in the skb in order to do address
> > resolution once it has decided which underlying interface it is going
> > to transmit through.
> >
> > In 2.6.31 the IFF_XMIT_DST_RELEASE flag was introduced, and set by
> > default for all interfaces, which causes the destination address to be
> > released before the transmit routine for the interface is called.
> >
> > The solution is to clear that flag for teql interfaces.
> >
> > Signed-off-by: Tom Hughes <tom@compton.nu>
>
> Sounds reasonable. Lets CC Eric and get his ACK.
>
> cheers,
> jamal
>
Sure, I already Acked in on a previous message 5 days ago (although not
a formal patch, Stephen forwarded a bugzilla entry)
http://permalink.gmane.org/gmane.linux.network/163688
Please David, could you add bugzilla entry in commit ?
https://bugzilla.kernel.org/show_bug.cgi?id=16183
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Thanks !
^ permalink raw reply
* ipv6: netif_carrier_(on|off) with traces afterwards
From: Einar EL Lueck @ 2010-06-16 13:29 UTC (permalink / raw)
To: netdev
Hi,
With IPv6 addresses configured and and a network card doing
netif_carrier_off|on we see afterwards in 2.6.34 on S/390 some traces in
fib.
Example sequence of operations:
ip -6 addr add fd00:10:30:49:4008:ffff:35:2/80 dev eth1
ip link set eth1 up
ip link set eth1 down
# the following lines have as effect netif_carrier_off and then on (among
other stuff)
echo 0 > /sys/bus/ccwgroup/drivers/qeth/devices/0.0.e300/online
echo 1 > /sys/bus/ccwgroup/drivers/qeth/devices/0.0.e300/online
# end of plugging
ip link set eth1 up
ip link set eth1 down
=> at this point we get the following trace:
Badness
at /home/autobuild/BUILD/linux-2.6.34-20100531/net/ipv6/ip6_fib.c:1160
Modules linked in: qeth_l2 sunrpc qeth_l3 binfmt_misc dm_multipath scsi_dh
dm_mod ipv6 qeth ccwgroup
CPU: 9 Not tainted 2.6.34-43.x.20100531-s390xperformance #1
Process ip (pid: 18144, task: 000000007c304238, ksp: 00000000033af428)
Krnl PSW : 0704200180000000 000003c0018c05a4 (fib6_del+0x60/0x3f8 [ipv6])
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
Krnl GPRS: 0000000000000000 0000000000000002 000000007b58a700
0000000000830398
0000000000830398 0000000000000001 0000000000000000
0000000000830398
000003c0018bba5e 00000000033af480 0000000071ff2780
000000007b58a700
000003c0018a7000 000003c0018e4d50 00000000033af398
00000000033af2e0
Krnl Code: 000003c0018c0598: eb6ff1000004 lmg %r6,%r15,256(%r15)
000003c0018c059e: 07f4 bcr 15,%r4
000003c0018c05a0: a7f40001 brc 15,3c0018c05a2
>000003c0018c05a4: a728fffe lhi %r2,-2
000003c0018c05a8: a7f4fff3 brc 15,3c0018c058e
000003c0018c05ac: b90200aa ltgr %r10,%r10
000003c0018c05b0: a784ffed brc 8,3c0018c058a
000003c0018c05b4: e32033b00020 cg %r2,944(%r3)
Call Trace:
([<0000000000000020>] 0x20)
[<000003c0018bb9c6>] __ip6_del_rt+0x6e/0xc8 [ipv6]
[<000003c0018bba5e>] ip6_del_rt+0x3e/0x4c [ipv6]
[<000003c0018b4448>] __ipv6_ifa_notify+0x13c/0x1d4 [ipv6]
[<000003c0018b7072>] addrconf_ifdown+0x3b2/0x5f0 [ipv6]
[<000003c0018b8b2c>] addrconf_notify+0xb4/0x944 [ipv6]
[<00000000004fb6aa>] notifier_call_chain+0x5a/0xa0
[<000000000016675a>] raw_notifier_call_chain+0x2a/0x3c
[<0000000000459194>] __dev_notify_flags+0x88/0xac
[<0000000000459200>] dev_change_flags+0x48/0x70
[<00000000004664d4>] do_setlink+0x35c/0x60c
[<00000000004678fa>] rtnl_newlink+0x446/0x574
[<000000000047c89c>] netlink_rcv_skb+0xdc/0xf0
[<00000000004671a8>] rtnetlink_rcv+0x3c/0x48
[<000000000047c35e>] netlink_unicast+0x352/0x3ac
[<000000000047cebe>] netlink_sendmsg+0x22a/0x35c
[<00000000004414c4>] sock_sendmsg+0xdc/0x100
[<00000000004417ca>] SyS_sendmsg+0x182/0x2d4
[<000000000043f17c>] SyS_socketcall+0x150/0x338
[<0000000000119346>] sysc_noemu+0x10/0x16
[<0000004e1271c71a>] 0x4e1271c71a
Last Breaking-Event-Address:
[<000003c0018c05a0>] fib6_del+0x5c/0x3f8 [ipv6]
Has anybody seen effects like this before on other platforms, or has
anybody suggestions for the root cause?
Thanks,
Einar.
^ permalink raw reply
* Re: [PATCH] vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)
From: Pedro Garcia @ 2010-06-16 13:28 UTC (permalink / raw)
To: netdev; +Cc: Eric Dumazet, Ben Hutchings, Patrick McHardy
In-Reply-To: <4C18B898.4000307@trash.net>
On Wed, 16 Jun 2010 13:42:16 +0200, Patrick McHardy <kaber@trash.net>
wrote:
> Eric Dumazet wrote:
>> Le mercredi 16 juin 2010 à 10:49 +0200, Pedro Garcia a écrit :
>>> Here it is again. I added the modifications in
>>> http://kerneltrap.org/mailarchive/linux-netdev/2010/5/23/6277868 for HW
>>> accelerated incoming packets (it did not apply cleanly on the last
>>> version of
>>> the kernel, so I applied manually). Now, if the VLAN 0 is not
>>> explicitly created by the user, VLAN 0 packets will be treated as no
>>> VLAN (802.1p packets), instead of dropping them.
>>>
>>> The patch is now for two files: vlan_core (accel) and vlan_dev (non
>>> accel)
>>>
>>> I can not test on HW accelerated devices, so if someone can check it I
>>> will appreciate (even though in the thread above it looked like yes).
>>> For non accel I tessted in 2.6.26. Now the patch is for
>>> net-next-2.6, and it compiles OK, but I a have to setup a test
>>> environment to check it is still OK (should, but better to test).
>>>
>>> Signed-off-by: Pedro Garcia <pedro.netdev@dondevamos.com>
>>>
>>
>> OK, the patch itself is correct.
>>
>
> Yes, looks fine to me as well.
>
>> Now, could you please send it again with a proper changelog ?
>>
>> In this changelog, please explain why patch is needed, and
>> keep lines short (< 72 chars), like the one you did in your first mail.
>>
>> I'll then add my Signed-off-by, since I wrote the accelerated part ;)
>>
>> Note : I wonder if another patch is needed, in case 8021q module is
>> _not_ loaded. We probably should accept vlan 0 frames in this case ?
>>
>
> I agree that this would be best for consistency, but that would mean
> adding more special cases to __netif_receive_skb().
In my understanding, 802.1p is a "subset" of 802.1q, and they share the
protocol number. We can do a 802.1p module, but in the end it will end
up reusing most of the code in 802.1q.
In any case defining a VLAN 0 ends up usually in problems with which table
the ARP entries get stored in. This patch solves the problem to whom
is not using VLAN 0 explicitly, but if somebody is using VLAN 0 tagging
it will work (whatever "work" means) as before.
Probably a definitive fix would be not to allow the definition of VLAN 0
in 802.1q module and provide some other way to tag priority packets without
using a subinterface (maybe in the same module or a new 8021p one). I am
having a look at the kernel to see what happens if we load two modules for
the same protocol.
By the way, the changelog I have to write is just the text before the
patch?
Pedro
^ permalink raw reply
* Re: [PATCH] Clear IFF_XMIT_DST_RELEASE for teql interfaces
From: jamal @ 2010-06-16 13:25 UTC (permalink / raw)
To: Tom Hughes
Cc: netdev, akpm, David S. Miller, Eric Dumazet, Stephen Hemminger,
Patrick McHardy, Tejun Heo, linux-kernel
In-Reply-To: <1276676668-10256-1-git-send-email-tom@compton.nu>
On Wed, 2010-06-16 at 09:24 +0100, Tom Hughes wrote:
> The sch_teql module, which can be used to load balance over a set of
> underlying interfaces, stopped working after 2.6.30 and has been
> broken in all kernels since then for any underlying interface which
> requires the addition of link level headers.
>
> The problem is that the transmit routine relies on being able to
> access the destination address in the skb in order to do address
> resolution once it has decided which underlying interface it is going
> to transmit through.
>
> In 2.6.31 the IFF_XMIT_DST_RELEASE flag was introduced, and set by
> default for all interfaces, which causes the destination address to be
> released before the transmit routine for the interface is called.
>
> The solution is to clear that flag for teql interfaces.
>
> Signed-off-by: Tom Hughes <tom@compton.nu>
Sounds reasonable. Lets CC Eric and get his ACK.
cheers,
jamal
^ permalink raw reply
* Re: rt hash table / rt hash locks question
From: Nick Piggin @ 2010-06-16 12:49 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1276691258.2632.55.camel@edumazet-laptop>
On Wed, Jun 16, 2010 at 02:27:38PM +0200, Eric Dumazet wrote:
> Le mercredi 16 juin 2010 à 20:46 +1000, Nick Piggin a écrit :
> > I'm just converting this scalable dentry/inode hash table to a more
> > compact form. I was previously using a dumb spinlock per bucket,
> > but this doubles the size of the tables so isn't production quality.
> >
>
> Yes, we had this in the past (one rwlock or spinlock per hash chain),
> and it was not very good with LOCKDEP on.
Sure :) And it halves the size of your hash even with lockdep off.
> > What I've done at the moment is to use a bit_spinlock in bit 0 of each
> > list pointer of the table. Bit spinlocks are now pretty nice because
> > we can do __bit_spin_unlock() which gives non-atomic store with release
> > ordering, so it should be almost as fast as spinlock.
> >
> > But I look at rt hash and it seems you use a small hash on the side
> > for spinlocks. So I wonder, pros for each:
> >
> > - bitlocks have effectively zero storage
> yes but a mask is needed to get head pointer. Special care also must
> be taken when insert/delete a node in chain, keeping this bit set.
That is true. Overall, I don't know what would be better for straight
line cycles, all in L1 cache. Probably the spinlocks, although there
is some small overhead from loading the 2nd hash.
> > - bitlocks hit the same cacheline that the hash walk hits.
> yes
> > - in RCU list, locked hash walks usually followed by hash modification,
> > bitlock should have brought in the line for exclusive.
> But we usually perform a read only lookup, _then_ take the lock, to
> perform a new lookup before insert. So at time we would take the
> bitlock, cache line is in shared state. With spinlocks, we always use
> the exclusive mode, but on a separate cache line...
Hmm, OK. This is usually true of the dcache and icache as well
actually. But you still have the same problem with spinlocks (with I
presume the common case of 0 or 1 entry in the hash) when inserting
into the table.
So we're still often avoiding one cacheline transition, and avoiding
hitting one cacheline.
> > - bitlock number of locks scales with hash size
> Yes, but concurrency is more a function of online cpus, given we use
> jhash.
Oh yeah but it has a maximum upper bound on the number of buckets in
the hash, and just having it scale nicely avoids the ifdef heuristics
in the existing code.
> > - spinlocks may be slightly better at the cacheline level (bitops
> > sometimes require explicit load which may not acquire exclusive
> > line on some archs). On x86 ll/sc architectures, this shouldn't
> > be a problem.
> Yes, you can add fairness (if ticket spinlocks variant used), but on
> route cache I really doubt it can make a difference.
Yes if the critical sections are very short and uncontended, I don't
think it's a large factor.
> > - spinlocks better debugging (could be overcome with a LOCKDEP
> > option to revert to spinlocks, but a bit ugly).
> Definitely a good thing.
>
> > - in practice, contention due to aliasing in buckets to lock mapping
> > is probably fairly minor.
> Agreed
> >
> > Net code is obviously tested and tuned well, but instinctively I would
> > have tought bitlocks are the better way to go. Any comments on this?
>
> Well, to be honest, this code is rather old, and at time I wrote it,
> bitlocks were probably not available.
>
> You can add :
>
> - One downside of the hashed spinlocks is the X86_INTERNODE_CACHE_SHIFT
> being 12 on X86_VSMP : All locks are probably in same internode block :(
Oh yeah that's true, very special case though.
> - Another downside is all locks are currently on a single NUMA node,
> since we kmalloc() them in one contiguous chunk.
>
> So I guess it would be worth to try :)
OK, this is what I'm working with for the icache/dcache to hide the
details of masking out the low bit. But it looks like rt hash is a bit
more highly tuned (eg without pprev pointer as you don't delete items
without walking the hash). So it might not be appropriate for you.
You might be able to derive some macros to hide some of the pain though.
Index: linux-2.6/include/linux/list_bl.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/list_bl.h
@@ -0,0 +1,97 @@
+#ifndef _LINUX_LIST_BL_H
+#define _LINUX_LIST_BL_H
+
+#include <linux/list.h>
+
+/*
+ * Special version of lists, where head of the list has a bit spinlock
+ * in the lowest bit. This is useful for scalable hash tables without
+ * increasing memory footprint overhead.
+ */
+
+struct hlist_bl_head {
+ struct hlist_bl_node *first;
+};
+
+struct hlist_bl_node {
+ struct hlist_bl_node *next, **pprev;
+};
+#define INIT_HLIST_BL_HEAD(ptr) \
+ ((ptr)->first = NULL)
+
+static inline void INIT_HLIST_BL_NODE(struct hlist_bl_node *h)
+{
+ h->next = NULL;
+ h->pprev = NULL;
+}
+
+#define hlist_bl_entry(ptr, type, member) container_of(ptr,type,member)
+
+static inline int hlist_bl_unhashed(const struct hlist_bl_node *h)
+{
+ return !h->pprev;
+}
+
+static inline struct hlist_bl_node *hlist_bl_first(struct hlist_bl_head *h)
+{
+ return (struct hlist_bl_node *)((unsigned long)h->first & ~1UL);
+}
+
+static inline void hlist_bl_set_first(struct hlist_bl_head *h, struct hlist_bl_node *n)
+{
+#ifdef CONFIG_DEBUG_LIST
+ BUG_ON(!((unsigned long)&h->first & 1UL));
+#endif
+ h->first = (struct hlist_bl_node *)((unsigned long)n | 1UL);
+}
+
+static inline int hlist_bl_empty(const struct hlist_bl_head *ptr)
+{
+ return !((unsigned long)ptr & ~1UL);
+}
+
+static inline void hlist_bl_add_head(struct hlist_bl_node *n,
+ struct hlist_bl_head *h)
+{
+ struct hlist_bl_node *first = hlist_bl_first(h);
+
+#ifdef CONFIG_DEBUG_LIST
+ BUG_ON(!((unsigned long)&h->first & 1UL));
+#endif
+ n->next = first;
+ if (first)
+ first->pprev = &n->next;
+ n->pprev = &h->first;
+ hlist_bl_set_first(h, n);
+}
+
+static inline void __hlist_bl_del(struct hlist_bl_node *n)
+{
+ struct hlist_bl_node *next = n->next;
+ struct hlist_bl_node **pprev = n->pprev;
+ *pprev = (struct hlist_bl_node *)((unsigned long)next | ((unsigned long)*pprev & 1UL));
+ if (next)
+ next->pprev = pprev;
+}
+
+static inline void hlist_bl_del(struct hlist_bl_node *n)
+{
+ __hlist_bl_del(n);
+ n->pprev = LIST_POISON2;
+}
+
+/**
+ * hlist_bl_for_each_entry - iterate over list of given type
+ * @tpos: the type * to use as a loop cursor.
+ * @pos: the &struct hlist_node to use as a loop cursor.
+ * @head: the head for your list.
+ * @member: the name of the hlist_node within the struct.
+ *
+ */
+#define hlist_bl_for_each_entry(tpos, pos, head, member) \
+ for (pos = hlist_bl_first(head); \
+ pos && \
+ ({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1;}); \
+ pos = pos->next)
+
+#endif
Index: linux-2.6/include/linux/rculist_bl.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/rculist_bl.h
@@ -0,0 +1,123 @@
+#ifndef _LINUX_RCULIST_BL_H
+#define _LINUX_RCULIST_BL_H
+
+#ifdef __KERNEL__
+
+/*
+ * RCU-protected list version
+ */
+#include <linux/list_bl.h>
+#include <linux/rcupdate.h>
+
+static inline void hlist_bl_set_first_rcu(struct hlist_bl_head *h, struct hlist_bl_node *n)
+{
+#ifdef CONFIG_DEBUG_LIST
+ BUG_ON(!((unsigned long)h->first & 1UL));
+#endif
+ rcu_assign_pointer(h->first, (struct hlist_bl_node *)((unsigned long)n | 1UL));
+}
+
+static inline struct hlist_bl_node *hlist_bl_first_rcu(struct hlist_bl_head *h)
+{
+ return (struct hlist_bl_node *)((unsigned long)rcu_dereference(h->first) & ~1UL);
+}
+
+/**
+ * hlist_bl_del_init_rcu - deletes entry from hash list with re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: hlist_bl_unhashed() on the node return true after this. It is
+ * useful for RCU based read lockfree traversal if the writer side
+ * must know if the list entry is still hashed or already unhashed.
+ *
+ * In particular, it means that we can not poison the forward pointers
+ * that may still be used for walking the hash list and we can only
+ * zero the pprev pointer so list_unhashed() will return true after
+ * this.
+ *
+ * The caller must take whatever precautions are necessary (such as
+ * holding appropriate locks) to avoid racing with another
+ * list-mutation primitive, such as hlist_bl_add_head_rcu() or
+ * hlist_bl_del_rcu(), running on this same list. However, it is
+ * perfectly legal to run concurrently with the _rcu list-traversal
+ * primitives, such as hlist_bl_for_each_entry_rcu().
+ */
+static inline void hlist_bl_del_init_rcu(struct hlist_bl_node *n)
+{
+ if (!hlist_bl_unhashed(n)) {
+ __hlist_bl_del(n);
+ n->pprev = NULL;
+ }
+}
+
+/**
+ * hlist_bl_del_rcu - deletes entry from hash list without re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: hlist_bl_unhashed() on entry does not return true after this,
+ * the entry is in an undefined state. It is useful for RCU based
+ * lockfree traversal.
+ *
+ * In particular, it means that we can not poison the forward
+ * pointers that may still be used for walking the hash list.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_bl_add_head_rcu()
+ * or hlist_bl_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_bl_for_each_entry().
+ */
+static inline void hlist_bl_del_rcu(struct hlist_bl_node *n)
+{
+ __hlist_bl_del(n);
+ n->pprev = LIST_POISON2;
+}
+
+/**
+ * hlist_bl_add_head_rcu
+ * @n: the element to add to the hash list.
+ * @h: the list to add to.
+ *
+ * Description:
+ * Adds the specified element to the specified hlist_bl,
+ * while permitting racing traversals.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_bl_add_head_rcu()
+ * or hlist_bl_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_bl_for_each_entry_rcu(), used to prevent memory-consistency
+ * problems on Alpha CPUs. Regardless of the type of CPU, the
+ * list-traversal primitive must be guarded by rcu_read_lock().
+ */
+static inline void hlist_bl_add_head_rcu(struct hlist_bl_node *n,
+ struct hlist_bl_head *h)
+{
+ struct hlist_bl_node *first = hlist_bl_first(h);
+
+ n->next = first;
+ if (first)
+ first->pprev = &n->next;
+ n->pprev = &h->first;
+ hlist_bl_set_first_rcu(h, n);
+}
+/**
+ * hlist_bl_for_each_entry_rcu - iterate over rcu list of given type
+ * @tpos: the type * to use as a loop cursor.
+ * @pos: the &struct hlist_bl_node to use as a loop cursor.
+ * @head: the head for your list.
+ * @member: the name of the hlist_bl_node within the struct.
+ *
+ */
+#define hlist_bl_for_each_entry_rcu(tpos, pos, head, member) \
+ for (pos = hlist_bl_first_rcu(head); \
+ pos && \
+ ({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1; }); \
+ pos = rcu_dereference_raw(pos->next))
+
+#endif
+#endif
^ permalink raw reply
* Re: rt hash table / rt hash locks question
From: Eric Dumazet @ 2010-06-16 12:27 UTC (permalink / raw)
To: Nick Piggin; +Cc: netdev
In-Reply-To: <20100616104633.GW6138@laptop>
Le mercredi 16 juin 2010 à 20:46 +1000, Nick Piggin a écrit :
> I'm just converting this scalable dentry/inode hash table to a more
> compact form. I was previously using a dumb spinlock per bucket,
> but this doubles the size of the tables so isn't production quality.
>
Yes, we had this in the past (one rwlock or spinlock per hash chain),
and it was not very good with LOCKDEP on.
> What I've done at the moment is to use a bit_spinlock in bit 0 of each
> list pointer of the table. Bit spinlocks are now pretty nice because
> we can do __bit_spin_unlock() which gives non-atomic store with release
> ordering, so it should be almost as fast as spinlock.
>
> But I look at rt hash and it seems you use a small hash on the side
> for spinlocks. So I wonder, pros for each:
>
> - bitlocks have effectively zero storage
yes but a mask is needed to get head pointer. Special care also must
be taken when insert/delete a node in chain, keeping this bit set.
> - bitlocks hit the same cacheline that the hash walk hits.
yes
> - in RCU list, locked hash walks usually followed by hash modification,
> bitlock should have brought in the line for exclusive.
But we usually perform a read only lookup, _then_ take the lock, to
perform a new lookup before insert. So at time we would take the
bitlock, cache line is in shared state. With spinlocks, we always use
the exclusive mode, but on a separate cache line...
> - bitlock number of locks scales with hash size
Yes, but concurrency is more a function of online cpus, given we use
jhash.
> - spinlocks may be slightly better at the cacheline level (bitops
> sometimes require explicit load which may not acquire exclusive
> line on some archs). On x86 ll/sc architectures, this shouldn't
> be a problem.
Yes, you can add fairness (if ticket spinlocks variant used), but on
route cache I really doubt it can make a difference.
> - spinlocks better debugging (could be overcome with a LOCKDEP
> option to revert to spinlocks, but a bit ugly).
Definitely a good thing.
> - in practice, contention due to aliasing in buckets to lock mapping
> is probably fairly minor.
Agreed
>
> Net code is obviously tested and tuned well, but instinctively I would
> have tought bitlocks are the better way to go. Any comments on this?
Well, to be honest, this code is rather old, and at time I wrote it,
bitlocks were probably not available.
You can add :
- One downside of the hashed spinlocks is the X86_INTERNODE_CACHE_SHIFT
being 12 on X86_VSMP : All locks are probably in same internode block :(
- Another downside is all locks are currently on a single NUMA node,
since we kmalloc() them in one contiguous chunk.
So I guess it would be worth to try :)
^ permalink raw reply
* Re: [PATCH] vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)
From: Patrick McHardy @ 2010-06-16 11:42 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Pedro Garcia, netdev, Ben Hutchings
In-Reply-To: <1276679284.2632.22.camel@edumazet-laptop>
Eric Dumazet wrote:
> Le mercredi 16 juin 2010 à 10:49 +0200, Pedro Garcia a écrit :
>> Here it is again. I added the modifications in http://kerneltrap.org/mailarchive/linux-netdev/2010/5/23/6277868 for HW accelerated incoming packets (it did not apply cleanly on the last version of
>> the kernel, so I applied manually). Now, if the VLAN 0 is not explicitly created by the user, VLAN 0 packets will be treated as no VLAN (802.1p packets), instead of dropping them.
>>
>> The patch is now for two files: vlan_core (accel) and vlan_dev (non accel)
>>
>> I can not test on HW accelerated devices, so if someone can check it I will appreciate (even though in the thread above it looked like yes). For non accel I tessted in 2.6.26. Now the patch is for
>> net-next-2.6, and it compiles OK, but I a have to setup a test environment to check it is still OK (should, but better to test).
>>
>> Signed-off-by: Pedro Garcia <pedro.netdev@dondevamos.com>
>>
>
> OK, the patch itself is correct.
>
Yes, looks fine to me as well.
> Now, could you please send it again with a proper changelog ?
>
> In this changelog, please explain why patch is needed, and
> keep lines short (< 72 chars), like the one you did in your first mail.
>
> I'll then add my Signed-off-by, since I wrote the accelerated part ;)
>
> Note : I wonder if another patch is needed, in case 8021q module is
> _not_ loaded. We probably should accept vlan 0 frames in this case ?
>
I agree that this would be best for consistency, but that would mean
adding more special cases to __netif_receive_skb().
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox