* Re: [PATCH] pkt_sched: act_xt support new Xtables interface
From: Jamal Hadi Salim @ 2012-12-21 13:03 UTC (permalink / raw)
To: Yury Stankevich
Cc: Hasan Chowdhury, Stephen Hemminger, Jan Engelhardt,
netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <50D327CD.3050904@gmail.com>
On 12-12-20 09:59 AM, Yury Stankevich wrote:
> interesting,
>
> #tc -s filter show dev usb0 parent ffff:
Given you are adding this on ingress - the settings you have will
happen before pre-routing hook.
If you did things at egress - the setting will take effect after
post-routing. So take a closer look at those details they look
like your source of issues..
cheers,
jamal
^ permalink raw reply
* Re: perm_addr get
From: Ben Hutchings @ 2012-12-21 13:09 UTC (permalink / raw)
To: Jiri Pirko; +Cc: netdev
In-Reply-To: <20121221130127.GB2040@minipsycho.orion>
On Fri, 2012-12-21 at 14:01 +0100, Jiri Pirko wrote:
> Hi all.
>
> From what I understand dev->perm_addr is set only in case the hw has
> permanent hw address somewhere written (for example EPROM).
>
> So when I query device which does not have perm_addr set I get:
>
> testt1:~$ ethtool -P team0
> Permanent address: 00:00:00:00:00:00
>
> Is this the correct behaviour? Wouldn't it be more correct if
> ethtool_get_perm_addr() fails with -ENOENT for something like that?
I don't think we should change the implementation now, as someone might
depend on it. It's trivial to distinguish this not-a-permanent-address
case. However the ethtool command output could be improved.
Ben.
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [PATCH] pkt_sched: act_xt support new Xtables interface
From: Yury Stankevich @ 2012-12-21 13:13 UTC (permalink / raw)
To: Jamal Hadi Salim
Cc: Hasan Chowdhury, Stephen Hemminger, Jan Engelhardt,
netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <50D45E25.7050703@mojatatu.com>
21.12.2012 17:03, Jamal Hadi Salim пишет:
> On 12-12-20 09:59 AM, Yury Stankevich wrote:
>> interesting,
>>
>> #tc -s filter show dev usb0 parent ffff:
>
>
> Given you are adding this on ingress - the settings you have will
> happen before pre-routing hook.
> If you did things at egress - the setting will take effect after
> post-routing. So take a closer look at those details they look
> like your source of issues..
sure,
i use it ingress,
so, i need to use tc xt action
to get mark on the packet, before filter on ifb will run.
prerouting rule, in turn, used to test if mark was actually restored.
in practice:
1. prerouting rule - is not fired. so, no packets with mark was seen.
2. filter on ifb - do not pass traffic to flow configured.
looks like `CONNMARK --restore` is not really called.
--
Linux registered user #402966 // pub 1024D/E99AF373 <pgp.mit.edu>
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: perm_addr get
From: Jiri Pirko @ 2012-12-21 13:25 UTC (permalink / raw)
To: Ben Hutchings; +Cc: netdev
In-Reply-To: <1356095395.7055.34.camel@deadeye.wl.decadent.org.uk>
Fri, Dec 21, 2012 at 02:09:55PM CET, bhutchings@solarflare.com wrote:
>On Fri, 2012-12-21 at 14:01 +0100, Jiri Pirko wrote:
>> Hi all.
>>
>> From what I understand dev->perm_addr is set only in case the hw has
>> permanent hw address somewhere written (for example EPROM).
>>
>> So when I query device which does not have perm_addr set I get:
>>
>> testt1:~$ ethtool -P team0
>> Permanent address: 00:00:00:00:00:00
>>
>> Is this the correct behaviour? Wouldn't it be more correct if
>> ethtool_get_perm_addr() fails with -ENOENT for something like that?
>
>I don't think we should change the implementation now, as someone might
>depend on it. It's trivial to distinguish this not-a-permanent-address
>case. However the ethtool command output could be improved.
Well, not change it even if it is not correct? And by "trivial to distinguish"
you mean 00:00:00:00:00:00 ~ device has no permanent address?
But in some cases (like vxge, mac80211) it's possible to see 00:00:00:00:00:00
by ethtool -P even though the device has permanent address (set later on, after
register_netdev call).
I think that ethtool_get_perm_addr should return:
-ENOENT if dev has no perm addr
-EAGAIN if dev perm addr hasn't been obtained yet
0 (and addr) in other cases
And how exactly should be the ethtool output improved?
Thanks
Jiri
>
>Ben.
>
>--
>Ben Hutchings, Staff Engineer, Solarflare
>Not speaking for my employer; that's the marketing department's job.
>They asked us to note that Solarflare product names are trademarked.
>
^ permalink raw reply
* Re: [PATCH] pkt_sched: act_xt support new Xtables interface
From: Jamal Hadi Salim @ 2012-12-21 13:50 UTC (permalink / raw)
To: Yury Stankevich
Cc: Hasan Chowdhury, Stephen Hemminger, Jan Engelhardt,
netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <50D46060.2070308@gmail.com>
On 12-12-21 08:13 AM, Yury Stankevich wrote:
> sure,
> i use it ingress,
> so, i need to use tc xt action
> to get mark on the packet, before filter on ifb will run.
Ok. So does ifb see it?
> prerouting rule, in turn, used to test if mark was actually restored.
No experience with connmark, but - in order to restore something has
to store it, correct?
> in practice:
> 1. prerouting rule - is not fired. so, no packets with mark was seen.
> 2. filter on ifb - do not pass traffic to flow configured.
> looks like `CONNMARK --restore` is not really called.
>
My suspicion is that it is not set to begin with...
cheers,
jamal
^ permalink raw reply
* Re: [PATCH] pkt_sched: act_xt support new Xtables interface
From: Yury Stankevich @ 2012-12-21 14:14 UTC (permalink / raw)
To: Jamal Hadi Salim
Cc: Hasan Chowdhury, Stephen Hemminger, Jan Engelhardt,
netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <50D46928.9070809@mojatatu.com>
well.
let me describe whole picture i want to achieve
1. use htb/sfq on ingress.
i got a traffic, and use few u32 filters to direct it to 3 flows,
priority, interactive, and bulk.
as http normally pass to interactive flow, i want to move long donwloads
to the bulk one.
how i trying to find these downloads:
iptables -t mangle -A POSTROUTING -p tcp --dport 80 -m connmark --mark 0
-m connbytes --connbytes 204800: --connbytes-dir both --connbytes-mode
bytes -j CONNMARK --set-mark 0xa
so, http connection where more than 200K downloaded, must got a
connection mark.
since ingress traffic hits qos before netfilter,
i use xt action, to copy connection mark, to a packet.
(action xt -j CONNMARK --restore-mark )
from this moment, i expect packet must have a restored mark
after this, i can use high priority tc filter .. handle 0xa fw flowid 1:102
to direct packet with mark 0xa to 1:102 flow (bulk).
now about a problem.
1. i run http download, once i get 200K - i can see that rule in
POSTROUTING is triggered and connection mark is installed (iptables -L
-n -v mangle -- can show number of packets matched by rule)
2. i see to tc stats for my flows, and i see, that packets still going
to interactive flow, not bulk as i expect.
3. from tc -s filter show dev usb0 parent ffff:
filter protocol ip pref 49152 u32
filter protocol ip pref 49152 u32 fh 800: ht divisor 1
filter protocol ip pref 49152 u32 fh 800::800 order 2048 key ht 800 bkt
0 terminal flowid ??? (rule hit 707 success 707)
match 00000000/00000000 at 0 (success 707 )
action order 1: tablename: mangle hook: NF_IP_PRE_ROUTING
target CONNMARK restore
index 5 ref 1 bind 1 installed 394 sec used 11 sec
Action statistics:
Sent 783783 bytes 707 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
action order 2: mirred (Egress Redirect to device ifb0) stolen
index 5 ref 1 bind 1 installed 394 sec used 11 sec
Action statistics:
Sent 783783 bytes 707 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
i can see that packets must reach xt action.
4. lets try to check packets mark with iptables,
if mark restored by xt action - i must be able to match it in prerouting
rule.
iptables -t mangle -A PREROUTING -m mark --mark 0xa -j NFLOG --nflog-group 1
but this rule not macthesd - so, no mark is restored by xt action.
maybe im completely wrong here, and such mode can't work for some reason ?
21.12.2012 17:50, Jamal Hadi Salim пишет:
> On 12-12-21 08:13 AM, Yury Stankevich wrote:
>
>> sure,
>> i use it ingress,
>> so, i need to use tc xt action
>> to get mark on the packet, before filter on ifb will run.
>
> Ok. So does ifb see it?
>
>> prerouting rule, in turn, used to test if mark was actually restored.
>
> No experience with connmark, but - in order to restore something has
> to store it, correct?
>
>> in practice:
>> 1. prerouting rule - is not fired. so, no packets with mark was seen.
>> 2. filter on ifb - do not pass traffic to flow configured.
>> looks like `CONNMARK --restore` is not really called.
>>
>
> My suspicion is that it is not set to begin with...
>
> cheers,
> jamal
>
--
Linux registered user #402966 // pub 1024D/E99AF373 <pgp.mit.edu>
^ permalink raw reply
* Re: [PATCH] pkt_sched: act_xt support new Xtables interface
From: Jan Engelhardt @ 2012-12-21 14:35 UTC (permalink / raw)
To: Jamal Hadi Salim
Cc: Yury Stankevich, Hasan Chowdhury, Stephen Hemminger,
netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <50D46928.9070809@mojatatu.com>
On Friday 2012-12-21 14:50, Jamal Hadi Salim wrote:
> On 12-12-21 08:13 AM, Yury Stankevich wrote:
>> i use it ingress,
>> so, i need to use tc xt action
>> to get mark on the packet, before filter on ifb will run.
>> prerouting rule, in turn, used to test if mark was actually restored.
>
> No experience with connmark, but - in order to restore something has
> to store it, correct?
The bigger problem here, if I see __netif_receive_skb right, is that
when ingress rules run, skb->nfct is still unset, thereby the
CONNMARK action is a no-op.
^ permalink raw reply
* Re: perm_addr get
From: Ben Hutchings @ 2012-12-21 14:40 UTC (permalink / raw)
To: Jiri Pirko; +Cc: netdev
In-Reply-To: <20121221132500.GC2040@minipsycho.orion>
On Fri, 2012-12-21 at 14:25 +0100, Jiri Pirko wrote:
> Fri, Dec 21, 2012 at 02:09:55PM CET, bhutchings@solarflare.com wrote:
> >On Fri, 2012-12-21 at 14:01 +0100, Jiri Pirko wrote:
> >> Hi all.
> >>
> >> From what I understand dev->perm_addr is set only in case the hw has
> >> permanent hw address somewhere written (for example EPROM).
> >>
> >> So when I query device which does not have perm_addr set I get:
> >>
> >> testt1:~$ ethtool -P team0
> >> Permanent address: 00:00:00:00:00:00
> >>
> >> Is this the correct behaviour? Wouldn't it be more correct if
> >> ethtool_get_perm_addr() fails with -ENOENT for something like that?
> >
> >I don't think we should change the implementation now, as someone might
> >depend on it. It's trivial to distinguish this not-a-permanent-address
> >case. However the ethtool command output could be improved.
>
> Well, not change it even if it is not correct? And by "trivial to distinguish"
> you mean 00:00:00:00:00:00 ~ device has no permanent address?
> But in some cases (like vxge, mac80211) it's possible to see 00:00:00:00:00:00
> by ethtool -P even though the device has permanent address (set later on, after
> register_netdev call).
Oh well, those are just driver bugs that should be fixed.
> I think that ethtool_get_perm_addr should return:
> -ENOENT if dev has no perm addr
> -EAGAIN if dev perm addr hasn't been obtained yet
> 0 (and addr) in other cases
>
> And how exactly should be the ethtool output improved?
Print something like 'No permanent address assigned' when the address is
all-zeroes.
Ben.
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [PATCH] pkt_sched: act_xt support new Xtables interface
From: Eric Dumazet @ 2012-12-21 15:45 UTC (permalink / raw)
To: Jan Engelhardt
Cc: Jamal Hadi Salim, Yury Stankevich, Hasan Chowdhury,
Stephen Hemminger, netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <alpine.LNX.2.01.1212211533360.957@nerf07.vanv.qr>
On Fri, 2012-12-21 at 15:35 +0100, Jan Engelhardt wrote:
>
> The bigger problem here, if I see __netif_receive_skb right, is that
> when ingress rules run, skb->nfct is still unset, thereby the
> CONNMARK action is a no-op.
Right, ingress is performed before IP/netfilter stack.
This reminds me this might be the reason we have
skb_reset_transport_header(skb);
in __netif_receive_skb(), while its not very logical.
(Yes, sorry for being off topic, but I am referring to
http://www.spinics.net/lists/netdev/msg214662.html )
^ permalink raw reply
* Re: TUN problems (regression?)
From: Paul Moore @ 2012-12-21 16:27 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Jason Wang, netdev
In-Reply-To: <1356046697.21834.3606.camel@edumazet-glaptop>
On Thursday, December 20, 2012 03:38:17 PM Eric Dumazet wrote:
> On Thu, 2012-12-20 at 18:16 -0500, Paul Moore wrote:
> > [CC'ing netdev in case this is a known problem I just missed ...]
> >
> > Hi Jason,
> >
> > I started doing some more testing with the multiqueue TUN changes and I
> > ran
> > into a problem when running tunctl: running it once w/o arguments works as
> > expected, but running it a second time results in failure and a
> > kmem_cache_sanity_check() failure. The problem appears to be very
> > repeatable on my test VM and happens independent of the LSM/SELinux fixup
> > patches.
> >
> > Have you seen this before?
>
> Obviously code in tun_flow_init() is wrong...
>
> static int tun_flow_init(struct tun_struct *tun)
> {
> int i;
>
> tun->flow_cache = kmem_cache_create("tun_flow_cache",
> sizeof(struct tun_flow_entry),
> 0, 0, NULL);
> if (!tun->flow_cache)
> return -ENOMEM;
> ...
> }
>
>
> I have no idea why we would need a kmem_cache per tun_struct,
> and why we even need a kmem_cache.
>
>
> I would try following patch :
>
> drivers/net/tun.c | 24 +++---------------------
> 1 file changed, 3 insertions(+), 21 deletions(-)
Thanks, that solved my problem. Also, in case you were still curious, I was
using SLUB.
--
paul moore
security and virtualization @ redhat
^ permalink raw reply
* IPv6 over Firewire
From: Stephan Gatzka @ 2012-12-21 17:03 UTC (permalink / raw)
To: netdev, linux1394-devel
Hi!
I'm currently implementing IPv6 over firewire.
To make the address mapping to the firewire link layer RFC3146 mandates
to use neighbor discovery with a certain format of the source/target
link-layer address option.
While it is not too complicated to build that up, I'm wondering how I
can reserve enough memory in the corresponding skb.
One possibility is to introduce some option padding in
ndisc_addr_option_pad() but I think that's somehow weird.
The second option I see is to set needed_tailroom in struct netdevice
for firewire net devices. That's the way I would go for.
Because I'm not really familiar with the whole network infrastructure in
Linux, a confirmation for the way to go would be nice.
Regards,
Stephan
^ permalink raw reply
* [PATCH] tuntap: dont use a private kmem_cache
From: Eric Dumazet @ 2012-12-21 17:17 UTC (permalink / raw)
To: Paul Moore, David Miller; +Cc: Jason Wang, netdev, Stephen Hemminger
In-Reply-To: <2090364.S7KStA6R4d@sifl>
From: Eric Dumazet <edumazet@google.com>
Commit 96442e42429 (tuntap: choose the txq based on rxq)
added a per tun_struct kmem_cache.
As soon as several tun_struct are used, we get an error
because two caches cannot have same name.
Use the default kmalloc()/kfree_rcu(), as it reduce code
size and doesn't have performance impact here.
Reported-by: Paul Moore <pmoore@redhat.com>
Tested-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
drivers/net/tun.c | 24 +++---------------------
1 file changed, 3 insertions(+), 21 deletions(-)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 504f7f1..fbd106e 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -180,7 +180,6 @@ struct tun_struct {
int debug;
#endif
spinlock_t lock;
- struct kmem_cache *flow_cache;
struct hlist_head flows[TUN_NUM_FLOW_ENTRIES];
struct timer_list flow_gc_timer;
unsigned long ageing_time;
@@ -209,8 +208,8 @@ static struct tun_flow_entry *tun_flow_create(struct tun_struct *tun,
struct hlist_head *head,
u32 rxhash, u16 queue_index)
{
- struct tun_flow_entry *e = kmem_cache_alloc(tun->flow_cache,
- GFP_ATOMIC);
+ struct tun_flow_entry *e = kmalloc(sizeof(*e), GFP_ATOMIC);
+
if (e) {
tun_debug(KERN_INFO, tun, "create flow: hash %u index %u\n",
rxhash, queue_index);
@@ -223,19 +222,12 @@ static struct tun_flow_entry *tun_flow_create(struct tun_struct *tun,
return e;
}
-static void tun_flow_free(struct rcu_head *head)
-{
- struct tun_flow_entry *e
- = container_of(head, struct tun_flow_entry, rcu);
- kmem_cache_free(e->tun->flow_cache, e);
-}
-
static void tun_flow_delete(struct tun_struct *tun, struct tun_flow_entry *e)
{
tun_debug(KERN_INFO, tun, "delete flow: hash %u index %u\n",
e->rxhash, e->queue_index);
hlist_del_rcu(&e->hash_link);
- call_rcu(&e->rcu, tun_flow_free);
+ kfree_rcu(e, rcu);
}
static void tun_flow_flush(struct tun_struct *tun)
@@ -833,12 +825,6 @@ static int tun_flow_init(struct tun_struct *tun)
{
int i;
- tun->flow_cache = kmem_cache_create("tun_flow_cache",
- sizeof(struct tun_flow_entry), 0, 0,
- NULL);
- if (!tun->flow_cache)
- return -ENOMEM;
-
for (i = 0; i < TUN_NUM_FLOW_ENTRIES; i++)
INIT_HLIST_HEAD(&tun->flows[i]);
@@ -854,10 +840,6 @@ static void tun_flow_uninit(struct tun_struct *tun)
{
del_timer_sync(&tun->flow_gc_timer);
tun_flow_flush(tun);
-
- /* Wait for completion of call_rcu()'s */
- rcu_barrier();
- kmem_cache_destroy(tun->flow_cache);
}
/* Initialize net device. */
^ permalink raw reply related
* [PATCH] ipv4: arp: fix a lockdep splat in arp_solicit()
From: Eric Dumazet @ 2012-12-21 17:32 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Yan Burman, Stephen Hemminger
From: Eric Dumazet <edumazet@google.com>
Yan Burman reported following lockdep warning :
=============================================
[ INFO: possible recursive locking detected ]
3.7.0+ #24 Not tainted
---------------------------------------------
swapper/1/0 is trying to acquire lock:
(&n->lock){++--..}, at: [<ffffffff8139f56e>] __neigh_event_send
+0x2e/0x2f0
but task is already holding lock:
(&n->lock){++--..}, at: [<ffffffff813f63f4>] arp_solicit+0x1d4/0x280
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&n->lock);
lock(&n->lock);
*** DEADLOCK ***
May be due to missing lock nesting notation
4 locks held by swapper/1/0:
#0: (((&n->timer))){+.-...}, at: [<ffffffff8104b350>]
call_timer_fn+0x0/0x1c0
#1: (&n->lock){++--..}, at: [<ffffffff813f63f4>] arp_solicit
+0x1d4/0x280
#2: (rcu_read_lock_bh){.+....}, at: [<ffffffff81395400>]
dev_queue_xmit+0x0/0x5d0
#3: (rcu_read_lock_bh){.+....}, at: [<ffffffff813cb41e>]
ip_finish_output+0x13e/0x640
stack backtrace:
Pid: 0, comm: swapper/1 Not tainted 3.7.0+ #24
Call Trace:
<IRQ> [<ffffffff8108c7ac>] validate_chain+0xdcc/0x11f0
[<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
[<ffffffff81120565>] ? kmem_cache_free+0xe5/0x1c0
[<ffffffff8108d570>] __lock_acquire+0x440/0xc30
[<ffffffff813c3570>] ? inet_getpeer+0x40/0x600
[<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
[<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
[<ffffffff8108ddf5>] lock_acquire+0x95/0x140
[<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
[<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
[<ffffffff81448d4b>] _raw_write_lock_bh+0x3b/0x50
[<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
[<ffffffff8139f56e>] __neigh_event_send+0x2e/0x2f0
[<ffffffff8139f99b>] neigh_resolve_output+0x16b/0x270
[<ffffffff813cb62d>] ip_finish_output+0x34d/0x640
[<ffffffff813cb41e>] ? ip_finish_output+0x13e/0x640
[<ffffffffa046f146>] ? vxlan_xmit+0x556/0xbec [vxlan]
[<ffffffff813cb9a0>] ip_output+0x80/0xf0
[<ffffffff813ca368>] ip_local_out+0x28/0x80
[<ffffffffa046f25a>] vxlan_xmit+0x66a/0xbec [vxlan]
[<ffffffffa046f146>] ? vxlan_xmit+0x556/0xbec [vxlan]
[<ffffffff81394a50>] ? skb_gso_segment+0x2b0/0x2b0
[<ffffffff81449355>] ? _raw_spin_unlock_irqrestore+0x65/0x80
[<ffffffff81394c57>] ? dev_queue_xmit_nit+0x207/0x270
[<ffffffff813950c8>] dev_hard_start_xmit+0x298/0x5d0
[<ffffffff813956f3>] dev_queue_xmit+0x2f3/0x5d0
[<ffffffff81395400>] ? dev_hard_start_xmit+0x5d0/0x5d0
[<ffffffff813f5788>] arp_xmit+0x58/0x60
[<ffffffff813f59db>] arp_send+0x3b/0x40
[<ffffffff813f6424>] arp_solicit+0x204/0x280
[<ffffffff813a1a70>] ? neigh_add+0x310/0x310
[<ffffffff8139f515>] neigh_probe+0x45/0x70
[<ffffffff813a1c10>] neigh_timer_handler+0x1a0/0x2a0
[<ffffffff8104b3cf>] call_timer_fn+0x7f/0x1c0
[<ffffffff8104b350>] ? detach_if_pending+0x120/0x120
[<ffffffff8104b748>] run_timer_softirq+0x238/0x2b0
[<ffffffff813a1a70>] ? neigh_add+0x310/0x310
[<ffffffff81043e51>] __do_softirq+0x101/0x280
[<ffffffff814518cc>] call_softirq+0x1c/0x30
[<ffffffff81003b65>] do_softirq+0x85/0xc0
[<ffffffff81043a7e>] irq_exit+0x9e/0xc0
[<ffffffff810264f8>] smp_apic_timer_interrupt+0x68/0xa0
[<ffffffff8145122f>] apic_timer_interrupt+0x6f/0x80
<EOI> [<ffffffff8100a054>] ? mwait_idle+0xa4/0x1c0
[<ffffffff8100a04b>] ? mwait_idle+0x9b/0x1c0
[<ffffffff8100a6a9>] cpu_idle+0x89/0xe0
[<ffffffff81441127>] start_secondary+0x1b2/0x1b6
Bug is from arp_solicit(), releasing the neigh lock after arp_send()
In case of vxlan, we eventually need to write lock a neigh lock later.
Its a false positive, but we can get rid of it without lockdep
annotations.
We can instead use neigh_ha_snapshot() helper.
Reported-by: Yan Burman <yanb@mellanox.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
---
net/ipv4/arp.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index ce6fbdf..1169ed4 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -321,7 +321,7 @@ static void arp_error_report(struct neighbour *neigh, struct sk_buff *skb)
static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
{
__be32 saddr = 0;
- u8 *dst_ha = NULL;
+ u8 dst_ha[MAX_ADDR_LEN];
struct net_device *dev = neigh->dev;
__be32 target = *(__be32 *)neigh->primary_key;
int probes = atomic_read(&neigh->probes);
@@ -363,9 +363,9 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
if (probes < 0) {
if (!(neigh->nud_state & NUD_VALID))
pr_debug("trying to ucast probe in NUD_INVALID\n");
- dst_ha = neigh->ha;
- read_lock_bh(&neigh->lock);
+ neigh_ha_snapshot(dst_ha, neigh, dev);
} else {
+ memset(dst_ha, 0, dev->addr_len);
probes -= neigh->parms->app_probes;
if (probes < 0) {
#ifdef CONFIG_ARPD
@@ -377,8 +377,6 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
arp_send(ARPOP_REQUEST, ETH_P_ARP, target, dev, saddr,
dst_ha, dev->dev_addr, NULL);
- if (dst_ha)
- read_unlock_bh(&neigh->lock);
}
static int arp_ignore(struct in_device *in_dev, __be32 sip, __be32 tip)
^ permalink raw reply related
* Re: IPv6 over Firewire
From: YOSHIFUJI Hideaki @ 2012-12-21 17:53 UTC (permalink / raw)
To: stephan.gatzka; +Cc: netdev, linux1394-devel, YOSHIFUJI Hideaki
In-Reply-To: <50D49659.1000101@gmail.com>
(2012年12月22日 02:03), Stephan Gatzka wrote:
> To make the address mapping to the firewire link layer RFC3146 mandates to use neighbor discovery with a certain format of the source/target link-layer address option.
>
> While it is not too complicated to build that up, I'm wondering how I can reserve enough memory in the corresponding skb.
>
> One possibility is to introduce some option padding in ndisc_addr_option_pad() but I think that's somehow weird.
>
> The second option I see is to set needed_tailroom in struct netdevice for firewire net devices. That's the way I would go for.
>
> Because I'm not really familiar with the whole network infrastructure in Linux, a confirmation for the way to go would be nice.
If you are talking about how to build NS/NA/RS/Redirect messages, you
can just use ndisc_opt_addr_space() and ndisc_fill_addr_option() here.
--yoshfuji
^ permalink raw reply
* [PATCH 0/2] Re-submit RDS patches
From: Mike Marciniszyn @ 2012-12-21 18:01 UTC (permalink / raw)
To: venkat.x.venkatsubra-QHcLZuEGTsvQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
rds-devel-N0ozoZBvEnrZJqsBc5GL+g, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
netdev-u79uwXL29TY76Z2rM5mHXA
These fixes were originally submitted in
https://oss.oracle.com/pipermail/rds-devel/2012-April/thread.html.
The first patch fixes a DOA issue with RDS on qib and should
be a stable patch as well.
The second suppresses a warning message that is misleading when
a version has been correctly determined.
These two patches were originally acked by the upstream maintainer
and never merged.
---
Mike Marciniszyn (2):
IB/rds: Correct ib_api use with gs_dma_address/sg_dma_len
IB/rds: suppress incompatible protocol when version is known
net/rds/ib_cm.c | 11 +++++------
net/rds/ib_recv.c | 9 ++++++---
2 files changed, 11 insertions(+), 9 deletions(-)
--
Thanks!
Mike Marciniszyn
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH 1/2] IB/rds: Correct ib_api use with gs_dma_address/sg_dma_len
From: Mike Marciniszyn @ 2012-12-21 18:01 UTC (permalink / raw)
To: venkat.x.venkatsubra; +Cc: linux-rdma, roland, rds-devel, davem, netdev
In-Reply-To: <20121221175857.23716.46975.stgit@phlsvslse11.ph.intel.com>
0b088e00 ("RDS: Use page_remainder_alloc() for recv bufs")
added uses of sg_dma_len() and sg_dma_address(). This makes
RDS DOA with the qib driver.
IB ulps should use ib_sg_dma_len() and ib_sg_dma_address
respectively since some HCAs overload ib_sg_dma* operations.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
---
net/rds/ib_recv.c | 9 ++++++---
1 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 8c5bc85..8eb9501 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -339,8 +339,8 @@ static int rds_ib_recv_refill_one(struct rds_connection *conn,
sge->length = sizeof(struct rds_header);
sge = &recv->r_sge[1];
- sge->addr = sg_dma_address(&recv->r_frag->f_sg);
- sge->length = sg_dma_len(&recv->r_frag->f_sg);
+ sge->addr = ib_sg_dma_address(ic->i_cm_id->device, &recv->r_frag->f_sg);
+ sge->length = ib_sg_dma_len(ic->i_cm_id->device, &recv->r_frag->f_sg);
ret = 0;
out:
@@ -381,7 +381,10 @@ void rds_ib_recv_refill(struct rds_connection *conn, int prefill)
ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr);
rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv,
recv->r_ibinc, sg_page(&recv->r_frag->f_sg),
- (long) sg_dma_address(&recv->r_frag->f_sg), ret);
+ (long) ib_sg_dma_address(
+ ic->i_cm_id->device,
+ &recv->r_frag->f_sg),
+ ret);
if (ret) {
rds_ib_conn_error(conn, "recv post on "
"%pI4 returned %d, disconnecting and "
^ permalink raw reply related
* [PATCH 2/2] IB/rds: suppress incompatible protocol when version is known
From: Mike Marciniszyn @ 2012-12-21 18:01 UTC (permalink / raw)
To: venkat.x.venkatsubra; +Cc: linux-rdma, roland, rds-devel, davem, netdev
In-Reply-To: <20121221175857.23716.46975.stgit@phlsvslse11.ph.intel.com>
Add an else to only print the incompatible protocol message
when version hasn't been established.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
---
net/rds/ib_cm.c | 11 +++++------
1 files changed, 5 insertions(+), 6 deletions(-)
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index a1e1162..31b74f5 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -434,12 +434,11 @@ static u32 rds_ib_protocol_compatible(struct rdma_cm_event *event)
version = RDS_PROTOCOL_3_0;
while ((common >>= 1) != 0)
version++;
- }
- printk_ratelimited(KERN_NOTICE "RDS: Connection from %pI4 using "
- "incompatible protocol version %u.%u\n",
- &dp->dp_saddr,
- dp->dp_protocol_major,
- dp->dp_protocol_minor);
+ } else
+ printk_ratelimited(KERN_NOTICE "RDS: Connection from %pI4 using incompatible protocol version %u.%u\n",
+ &dp->dp_saddr,
+ dp->dp_protocol_major,
+ dp->dp_protocol_minor);
return version;
}
^ permalink raw reply related
* TCP sequence number inference attack on Linux
From: Zhiyun Qian @ 2012-12-21 17:58 UTC (permalink / raw)
To: netdev
Dear sir or madam,
My name is Zhiyun Qian, a recent PhD graduate from University of
Michigan. As our recent research effort, along with my colleagues, we
identified a vulnerability related to Linux. Details described in our
paper published at this year's ACM Conference on Computer and
Communications Security (CCS): Collaborative TCP Sequence Number
Inference Attack available
http://web.eecs.umich.edu/~zhiyunq/pub/ccs12_TCP_sequence_number_inference.pdf
Keywords: TCP, sequence number inference, DelayedAckLost counter,
privilege-escalation attack
The vulnerability would allow an local malicious program to gain write
access to TCP connections of other applications. An example attack
scenario (on android) would be "an attacker uploads a seemingly benign
app to the google play, when run at the background, it can inject
malicious HTML payload into a webpage open by the browser".
The problem is caused by the common TCP stats counters (the specific
counter I found is DelayedACKLost) maintained by the kernel (but
exposed to user space). By reading and reporting such counters to an
external attacker (colluded), the aforementioned attack can be
accomplished.
It is essentially a side-channel attack (using TCP stats counters to
infer TCP sequence numbers), but it is so real and easy to carry out
that I believe it should be considered a vulnerability. From a
difference perspective, this attack can be considered as a privilege
escalation attack since it allows a local non-privileged program to
gain write access to TCP connections made by other processes.
The lines of code in kernel impacted by the attack is:
net/ipv4/tcp_input.c at line 4166 (as in the latest kernel v3.7.1)
4160 static void tcp_send_dupack(struct sock *sk, const struct sk_buff *skb)
4161 {
4162 struct tcp_sock *tp = tcp_sk(sk);
4163
4164 if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
4165 before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
4166 NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOST);
4167 tcp_enter_quickack_mode(sk);
4168
4169 if (tcp_is_sack(tp) && sysctl_tcp_dsack) {
4170 u32 end_seq = TCP_SKB_CB(skb)->end_seq;
4171
4172 if (after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt))
4173 end_seq = tp->rcv_nxt;
4174 tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, end_seq);
4175 }
4176 }
4177
4178 tcp_send_ack(sk);
4179 }
IMHO, an easy fix to the problem is to disallow unprivileged access to
counters such as DelayedAckLost. However, this solution may not be
ideal. A better way is to always enforce both acknowledgement number
and sequence number check on each incoming TCP packet instead of
checking one at a time. Currently, Linux TCP stack first only
validates the SEQ number and then subsequently the ACK number. If the
sequence number is invalid, the delayedAckLost counter will be
incremented (information about sequence number leaked already
regardless of the ACK number). To make attacker's job much harder (we
should require the attacker to guess both the valid sequence number
and ACK number at the same time). For instance, following RFC 5961:
(SND.UNA - MAX.SND.WND) <= SEG.ACK <= SND.NXT, this would require the
attacker to send many times (on the order of 10000) more packets to
conduct the same attack.
Please feel free to ask me any questions regarding this vulnerability.
Best,
-Zhiyun
^ permalink raw reply
* Re: [RFC] IP_MAX_MTU value
From: Rick Jones @ 2012-12-21 18:19 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1356072468.21834.4805.camel@edumazet-glaptop>
On 12/20/2012 10:47 PM, Eric Dumazet wrote:
> Hi David
>
> We have the following definition in net/ipv4/route.c
>
> #define IP_MAX_MTU 0xFFF0
>
> This means that "netperf -t UDP_STREAM", using UDP messages of 65507
> bytes, are fragmented on loopback interface (while its MTU is now 65536
> and should allow those UDP messages being sent without fragments)
>
> I guess Rick chose 65507 bytes in netperf because it was related to the
> max IPv4 datagram length :
>
> 65507 + 28 = 65535
That is correct. From src/nettest_opmni.c:
/* choosing the default send size is a trifle more complicated than it
used to be as we have to account for different protocol limits */
#define UDP_LENGTH_MAX (0xFFFF - 28)
static int
choose_send_size(int lss, int protocol) {
int send_size;
if (lss > 0) {
send_size = lss_size;
/* we will assume that everyone has IPPROTO_UDP and thus avoid an
issue with Windows using an enum */
if ((protocol == IPPROTO_UDP) && (send_size > UDP_LENGTH_MAX))
send_size = UDP_LENGTH_MAX;
}
else {
send_size = 4096;
}
return send_size;
}
And I figured that while IPv6 allows even larger sizes, the likelihood
of it mattering in the then near/medium term was minimal.
> Changing IP_MAX_MTU from 0xFFF0 to 0x10000 seems safe [1], but I might
> miss something really obvious ?
If you go beyond the protocol limit of an IPv4 datagram, won't it be
necessary to start being a bit more conditional on IPv4 vs IPv6?
> It might be because in old days we reserved 16 bytes for the ethernet
> header, and we wanted to avoid kmalloc() round-up to kmalloc-131072
> slab ?
>
> If so, we certainly can limit skb->head to 32 or 64 KB and complete with
> page fragments the remaining space.
>
> Thanks
>
> [1] performance increase is ~50%
99 times out of 10 I will assert that faster is better, but do we need
another 50% for UDP over loopback with that large a message size?
happy benchmarking,
rick jones
^ permalink raw reply
* Re: TCP sequence number inference attack on Linux
From: Eric Dumazet @ 2012-12-21 18:31 UTC (permalink / raw)
To: Zhiyun Qian; +Cc: netdev
In-Reply-To: <CALvgte86miv450KnOcFRR-oEm_f=qRXarDfQkyU7T3OLqq816A@mail.gmail.com>
On Fri, 2012-12-21 at 12:58 -0500, Zhiyun Qian wrote:
> Dear sir or madam,
>
> My name is Zhiyun Qian, a recent PhD graduate from University of
> Michigan. As our recent research effort, along with my colleagues, we
> identified a vulnerability related to Linux. Details described in our
> paper published at this year's ACM Conference on Computer and
> Communications Security (CCS): Collaborative TCP Sequence Number
> Inference Attack available
> http://web.eecs.umich.edu/~zhiyunq/pub/ccs12_TCP_sequence_number_inference.pdf
>
> Keywords: TCP, sequence number inference, DelayedAckLost counter,
> privilege-escalation attack
>
> The vulnerability would allow an local malicious program to gain write
> access to TCP connections of other applications. An example attack
> scenario (on android) would be "an attacker uploads a seemingly benign
> app to the google play, when run at the background, it can inject
> malicious HTML payload into a webpage open by the browser".
>
> The problem is caused by the common TCP stats counters (the specific
> counter I found is DelayedACKLost) maintained by the kernel (but
> exposed to user space). By reading and reporting such counters to an
> external attacker (colluded), the aforementioned attack can be
> accomplished.
>
> It is essentially a side-channel attack (using TCP stats counters to
> infer TCP sequence numbers), but it is so real and easy to carry out
> that I believe it should be considered a vulnerability. From a
> difference perspective, this attack can be considered as a privilege
> escalation attack since it allows a local non-privileged program to
> gain write access to TCP connections made by other processes.
>
> The lines of code in kernel impacted by the attack is:
> net/ipv4/tcp_input.c at line 4166 (as in the latest kernel v3.7.1)
>
> 4160 static void tcp_send_dupack(struct sock *sk, const struct sk_buff *skb)
> 4161 {
> 4162 struct tcp_sock *tp = tcp_sk(sk);
> 4163
> 4164 if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
> 4165 before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
> 4166 NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOST);
> 4167 tcp_enter_quickack_mode(sk);
> 4168
> 4169 if (tcp_is_sack(tp) && sysctl_tcp_dsack) {
> 4170 u32 end_seq = TCP_SKB_CB(skb)->end_seq;
> 4171
> 4172 if (after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt))
> 4173 end_seq = tp->rcv_nxt;
> 4174 tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, end_seq);
> 4175 }
> 4176 }
> 4177
> 4178 tcp_send_ack(sk);
> 4179 }
>
>
> IMHO, an easy fix to the problem is to disallow unprivileged access to
> counters such as DelayedAckLost. However, this solution may not be
> ideal. A better way is to always enforce both acknowledgement number
> and sequence number check on each incoming TCP packet instead of
> checking one at a time. Currently, Linux TCP stack first only
> validates the SEQ number and then subsequently the ACK number. If the
> sequence number is invalid, the delayedAckLost counter will be
> incremented (information about sequence number leaked already
> regardless of the ACK number). To make attacker's job much harder (we
> should require the attacker to guess both the valid sequence number
> and ACK number at the same time). For instance, following RFC 5961:
> (SND.UNA - MAX.SND.WND) <= SEG.ACK <= SND.NXT, this would require the
> attacker to send many times (on the order of 10000) more packets to
> conduct the same attack.
>
> Please feel free to ask me any questions regarding this vulnerability.
I believe RFC 5961 was implemented in recent linux versions.
Is the described vulnerability still present ?
^ permalink raw reply
* Re: [PATCH] xen/netfront: improve truesize tracking
From: Rick Jones @ 2012-12-21 18:33 UTC (permalink / raw)
To: Sander Eikelenboom
Cc: Eric Dumazet, Ian Campbell, netdev@vger.kernel.org,
Konrad Rzeszutek Wilk, annie li, xen-devel@lists.xensource.com
In-Reply-To: <1609010645.20121221122100@eikelenboom.it>
I'm guessing that trusize checks matter more on the "inbound" path than
the outbound path? If that is indeed the case, then instead of, or in
addition to using the -s option to set the local (netperf side) socket
buffer size, you should use a -S option to set the remote (netserver
side) socket buffer size.
happy benchmarking,
rick jones
^ permalink raw reply
* Re: [RFC] IP_MAX_MTU value
From: Eric Dumazet @ 2012-12-21 18:34 UTC (permalink / raw)
To: Rick Jones; +Cc: David Miller, netdev
In-Reply-To: <50D4A84D.1010402@hp.com>
On Fri, 2012-12-21 at 10:19 -0800, Rick Jones wrote:
> If you go beyond the protocol limit of an IPv4 datagram, won't it be
> necessary to start being a bit more conditional on IPv4 vs IPv6?
>
This IP_MAX_MTU is really an IPv4 thing (static to net/ipv4/route.c)
>
> 99 times out of 10 I will assert that faster is better, but do we need
> another 50% for UDP over loopback with that large a message size?
Well, I only didnt understand why sending 65507 UDP messages had to use
fragments. I didnt care of performance at this point, only tried to have
an reasonable explanation.
It turns out its a strange limitation.
^ permalink raw reply
* Re: IPv6 over Firewire
From: Stephan Gatzka @ 2012-12-21 18:39 UTC (permalink / raw)
To: YOSHIFUJI Hideaki; +Cc: netdev, linux1394-devel
In-Reply-To: <50D4A219.7080807@linux-ipv6.org>
> If you are talking about how to build NS/NA/RS/Redirect messages, you
> can just use ndisc_opt_addr_space() and ndisc_fill_addr_option() here.
Thanks, these functions are certainly helpful. But
ndisc_opt_addr_space() calculates the required space from dev->addr_len
and ndisc_addr_option_pad(dev->type). The latter is 0 for IEEE1394
(firewire). So the required option space just comes from dev->addr_len,
which is 8 for firewire, resulting in an option address space of 16 (2
octets).
But rfc3146 requires an option address space of 3 octets. So my main
question is if in such a situation the best is to reserve additional skb
tail room using needed_tailroom in struct netdevice. This directly
affects the memory allocated in ndisc_build_skb().
Regards,
Stephan
^ permalink raw reply
* Re: [RFC] IP_MAX_MTU value
From: Rick Jones @ 2012-12-21 18:50 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1356114879.21834.7709.camel@edumazet-glaptop>
On 12/21/2012 10:34 AM, Eric Dumazet wrote:
> On Fri, 2012-12-21 at 10:19 -0800, Rick Jones wrote:
>
>> If you go beyond the protocol limit of an IPv4 datagram, won't it be
>> necessary to start being a bit more conditional on IPv4 vs IPv6?
>>
>
> This IP_MAX_MTU is really an IPv4 thing (static to net/ipv4/route.c)
OK. Doesn't this:
if (mtu > IP_MAX_MTU)
mtu = IP_MAX_MTU;
mean it should be OK to go to 0xFFFF but not 0x10000? Since 65535 is
the limit of an IPv4 datagram and so I would think would be the maximum
MTU for an IPv4 interface.
rick
^ permalink raw reply
* Re: TCP sequence number inference attack on Linux
From: Eric Dumazet @ 2012-12-21 18:52 UTC (permalink / raw)
To: Zhiyun Qian; +Cc: netdev
In-Reply-To: <1356114663.21834.7697.camel@edumazet-glaptop>
On Fri, 2012-12-21 at 10:31 -0800, Eric Dumazet wrote:
> I believe RFC 5961 was implemented in recent linux versions.
>
> Is the described vulnerability still present ?
>
By the way, I believe Chrome browser uses private network namespaces,
and statistics are per network namespace, so it should be safe.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox