Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 0/2] [BUG FIXES - 3.10.27] sit: More backports
From: Nicolas Dichtel @ 2014-01-30  9:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, netdev, stable, Clark Williams,
	Luis Claudio R. Goncalves, John Kacur, Willem de Bruijn
In-Reply-To: <20140129154811.27fbaf13@gandalf.local.home>

Le 29/01/2014 21:48, Steven Rostedt a écrit :
> On Wed, 29 Jan 2014 17:04:12 +0100
> Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:
>
>> Your patch serie seems to be the good way to go (note that patch 1/2 does not
>> compile) but I think the fix is smaller because we don't have x-netns.
>>
>> Here is my proposal, if you agree, I will send the same patch for ip6_tunnnel,
>> which have the netns leak.
>>
>
> Hold on. Seems that the kernels that were being tested in QA had more
> code than what I was testing. Clark had backported "sit: fix use after
> free of fb_tunnel_dev" and that was what was causing the
> unlist_netdevice() to be missed.
>
> When I started working on vanilla 3.10.27 as well, I first did my
> original patch (which just removes the call to
> unregister_netdevice_queue() from sit_exit_net()). I asked to have that
> added to our kernel for testing, and they told me it was already there
> via Clark's backport. Then I did the full backport as well and looked
> for the leak. I'm now thinking that the full backport is not needed as
> that was what caused the leak.
>
> According to commit 9434266f2c645d4fcf62a03a8e36ad8075e37943 "sit: fix
> use after free of fb_tunnel_dev", it states:
>
>      Bug: The fallback device is created in sit_init_net and assumed to
>      be freed in sit_exit_net. First, it is dereferenced in that
>      function, in sit_destroy_tunnels:
>
>              struct net *net = dev_net(sitn->fb_tunnel_dev);
>
>      Prior to this, rtnl_unlink_register has removed all devices that
>      match rtnl_link_ops == sit_link_ops.
>
>      Commit 205983c43700 added the line
>
>      +       sitn->fb_tunnel_dev->rtnl_link_ops = &sit_link_ops;
>
>      which cases the fallback device to match here and be freed before it
>      is last dereferenced.
>
> Commit 205983c43700 was backported to 3.10, but without commit
> 5e6700b3bf98 "sit: add support of x-netns" which was what added the
>
>    net = dev_net(sitn->fb_tunnel_dev);
>
> Which looks to me that the only reason I need to port back commit
> 5e6700b3bf98 is if I add the full backport of 9434266f2c645d4f.
>
> Seems to me that my original patch may be good enough. The one that I
> said this series obsoletes.
>
> Note, I've talked with the people that are doing the testing, and I'm
> having them revert all changes except for that one fix and rerun the
> tests again. I should know the results by tomorrow.
>
> Let me know if "sit: fix use after free of fb_tunnel_dev" still needs
> to be backported due to some other way that the fallback device can be
> dereferenced after being freed.

Steve, I think the patch I sent yesterday is the good fix. At the end, it's
a backport of Willem's patch. Note that he also ack that patch.
The first version you sent (which removes
unregister_netdevice_queue(sitn->fb_tunnel_dev, &list)) will introduce a
memory leak when the user destroy a netns.

^ permalink raw reply

* RE: Help testing for USB ethernet/xHCI regression
From: David Laight @ 2014-01-30  9:44 UTC (permalink / raw)
  To: 'Sarah Sharp', renevant@internode.on.net
  Cc: linux-usb@vger.kernel.org, Mark Lord, Greg Kroah-Hartman,
	netdev@vger.kernel.org
In-Reply-To: <20140129215408.GC5991@xanatos>

From: Sarah Sharp
> > Current issue is when plugging in the ax88179 there is lag when bringing the
> > interface up and a bunch of kernel messages:
> 
> With which kernel?

I saw similar issues testing some patches yesterday.
Both with the ax179_178a and smsx95xx cards (connected to xhci).
My kernel claims to be 3.13.0-dsl+ but it is lying since it was
updated from linus's tree earlier this week.
I'd not seen any similar delays until the last week or so.

The xhci controller I'm using is the Intel 'Panther Point' 8086:1e31
rev 4 (also says 8086:1e2d and 8086:1e26).

	David

^ permalink raw reply

* Re: [PATCH RFC v3 0/12] vti4: prepare namespace and interfamily support.
From: Steffen Klassert @ 2014-01-30  9:56 UTC (permalink / raw)
  To: Christophe Gouault; +Cc: netdev, Saurabh Mohan
In-Reply-To: <52E8DE2C.7060706@6wind.com>

On Wed, Jan 29, 2014 at 11:55:40AM +0100, Christophe Gouault wrote:
> 
> Hi Steffen,
> 
> I did some tests, and it seems there is no inbound policy check against
> a vti SP after the ipsec decryption:

Thanks for testing!

> 
> To confirm the problem, I added some logs in the kernel to track the
> outbound SPD lookup and inbound policy check.
> 
> I tested a ping from HostL(10.22.1.1) to HostR(10.24.1.201), that must
> be encapsulated via a vti interface (vti1, mark 1, ifindex 8) between
> IPsecVTI(10.23.1.101) and HostR(10.23.1.201).
> 
> . 10.22.1.0/24 10.23.1.0/24 10.24.1.0/24
> . (HostL) ------------(IPsecVTI)============(HostR)------------
> . .1 .101 .201
> 
> Here is the trace:
> 
> (1) xfrm_lookup: oif=8 mark=0 saddr=10.22.1.1 daddr=10.24.1.201
> (2) xfrm_lookup: oif=8 mark=1 saddr=10.22.1.1 daddr=10.24.1.201
> (3) vti_rcv: found tunnel vti1
> (4) __xfrm_policy_check: dir=0 iif=0 mark=0 saddr=10.23.1.201
> daddr=10.23.1.101 skb->sp->len=0
> (5) __xfrm_policy_check: dir=2 iif=0 mark=0 saddr=10.24.1.201
> daddr=10.22.1.1 skb->sp->len=0
> 
> And the analysis:
> 
> - A first SPD lookup is done before entering vti1 in (1), seeking for
> a "global SP".
> - A second SPD lookup is done after entering vti1 in (2), with mark 1,
> seeking for a "vti SP"
> - the icmp request is encapsulated and sent to HostR
> - the esp-encrypted icmp reply is received, the packet enters vti1
> and an inbound policy check is performed on the ESP packet itself in
> (4), with mark 0, seeking for a "global SP".
> - the packet is decrypted and its mark set to 1, but no vti inbound
> policy check is done. Then the skb mark and secpath are reset by
> skb_scrub_params (called by vti_rcv_cb).
> - Then only an inbound policy check is performed on the icmp
> reply in (5), seeking for a "global SP". It is considered a plaintext
> packet, with no mark or secpath.
> 
> => there is no check that the forward vti security policy was
> enforced.
> 

Yes, that's true and this is a real problem. If we want to support
namespace transitions with vti, we can't know if a packet is going
to be forwarded or locally received in the other namespace. This means
that we don't know if we should enforce a input or a forward policy.

All we can do here, is to enforce a input policy before we do the
namespace transition in the receive path. The patch below (on top
of the vti patchset) should do this.

But this has the implication that forward policies do not make
much sense in combination with vti. This is a bit contrary to
traditional xfrm processing. But on the other hand, we receive
plaintext packets from the vti device so we should not check
for any IPsec processing that happened before we received the
packets via the vti device.


---
 include/net/xfrm.h        |   10 +++++-----
 net/ipv4/ip_vti.c         |    5 ++++-
 net/ipv4/xfrm4_protocol.c |    9 ++++++---
 net/xfrm/xfrm_input.c     |    4 +++-
 4 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index b7740ce..cac9c46 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1519,7 +1519,7 @@ int xfrm4_extract_output(struct xfrm_state *x, struct sk_buff *skb);
 int xfrm4_prepare_output(struct xfrm_state *x, struct sk_buff *skb);
 int xfrm4_output(struct sk_buff *skb);
 int xfrm4_output_finish(struct sk_buff *skb);
-void xfrm4_rcv_cb(struct sk_buff *skb, u8 protocol, int err);
+int xfrm4_rcv_cb(struct sk_buff *skb, u8 protocol, int err);
 int xfrm4_protocol_register(struct xfrm4_protocol *handler, unsigned char protocol);
 int xfrm4_protocol_deregister(struct xfrm4_protocol *handler, unsigned char protocol);
 int xfrm4_tunnel_register(struct xfrm_tunnel *handler, unsigned short family);
@@ -1753,14 +1753,14 @@ static inline int xfrm_mark_put(struct sk_buff *skb, const struct xfrm_mark *m)
 	return ret;
 }
 
-static inline void xfrm_rcv_cb(struct sk_buff *skb, unsigned int family,
-			       u8 protocol, int err)
+static inline int xfrm_rcv_cb(struct sk_buff *skb, unsigned int family,
+			      u8 protocol, int err)
 {
 	switch(family) {
 	case AF_INET:
-		xfrm4_rcv_cb(skb, protocol, err);
-		break;
+		return xfrm4_rcv_cb(skb, protocol, err);
 	}
+	return 0;
 }
 
 #endif	/* _NET_XFRM_H */
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index 7b1542c..5beb260 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -82,7 +82,7 @@ static int vti_rcv_cb(struct sk_buff *skb, int err)
 	struct ip_tunnel *tunnel = XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip4;
 
 	if (!tunnel)
-		return -1;
+		return 1;
 
 	dev = tunnel->dev;
 
@@ -93,6 +93,9 @@ static int vti_rcv_cb(struct sk_buff *skb, int err)
 		return 0;
 	}
 
+	if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
+		return -EPERM;
+
 	skb_scrub_packet(skb, !net_eq(tunnel->net, dev_net(skb->dev)));
 	skb->dev = dev;
 
diff --git a/net/ipv4/xfrm4_protocol.c b/net/ipv4/xfrm4_protocol.c
index 993ab39..5ab5527 100644
--- a/net/ipv4/xfrm4_protocol.c
+++ b/net/ipv4/xfrm4_protocol.c
@@ -46,13 +46,16 @@ static inline struct xfrm4_protocol __rcu **proto_handlers(u8 protocol)
 	     handler != NULL;				\
 	     handler = rcu_dereference(handler->next))	\
 
-void xfrm4_rcv_cb(struct sk_buff *skb, u8 protocol, int err)
+int xfrm4_rcv_cb(struct sk_buff *skb, u8 protocol, int err)
 {
+	int ret;
 	struct xfrm4_protocol *handler;
 
 	for_each_protocol_rcu(*proto_handlers(protocol), handler)
-		if (!handler->cb_handler(skb, err))
-			return;
+		if ((ret = handler->cb_handler(skb, err)) <= 0)
+			return ret;
+
+	return 0;
 }
 EXPORT_SYMBOL(xfrm4_rcv_cb);
 
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index fb64b4a..99e3a9e 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -263,7 +263,9 @@ resume:
 		}
 	} while (!err);
 
-	xfrm_rcv_cb(skb, family, x->type->proto, 0);
+	err = xfrm_rcv_cb(skb, family, x->type->proto, 0);
+	if (err)
+		goto drop;
 
 	nf_reset(skb);
 
-- 
1.7.9.5

^ permalink raw reply related

* RE: Help testing for USB ethernet/xHCI regression
From: David Laight @ 2014-01-30 10:03 UTC (permalink / raw)
  To: 'Sarah Sharp', Mark Lord
  Cc: Greg Kroah-Hartman, linux-usb@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <20140129211817.GA5991@xanatos>

From: Sarah Sharp
> On Tue, Jan 28, 2014 at 11:30:51PM -0500, Mark Lord wrote:
> > On 14-01-28 03:30 PM, Sarah Sharp wrote:
> > ..
> > > Can you please pull this branch, which contains a 3.13 kernel with
> > > David's patch reverted, and test whether your USB ethernet device works
> > > or fails?
> >
> > Fails.  dmesg log attached.
> 
> It's funny, because there's certainly data transferred over endpoint
> 0x82, even though there were link TRBs in the middle of transfers.  Did
> the "untransferred" messages stop when the device stopped working, or
> did they continue?

What I saw was that the USB transfers continued, but the ethernet
transmits stopped.
This rather changes the packets generated by TCP at both ends.
The effect can be seen in the timestamps (etc) in the USB trace.
There are still tx and rx packets, after a while they'll only
happen on ever-increasing timeouts.

That was the point where I wrote the NOP patch just to see if
it made a difference. I didn't really expect one!

I think that the LINK trb splits a 1k usb message in two.
This well and truly confuses the ethernet part of the ax88179_178a
hardware to the point where it doesn't even reset itself on the
next sub 1k message.

It might be that other targets (eg the smsx95xx) is more resilient
and only loses the single packet - which won't be immediately obvious.

	David

^ permalink raw reply

* [PATCH linux-3.10.y 1/3] sit: fix double free of fb_tunnel_dev on exit
From: Nicolas Dichtel @ 2014-01-30 10:09 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, netdev, stable, williams, lclaudio, jkacur, willemb,
	Nicolas Dichtel
In-Reply-To: <52EA1B4B.5080403@6wind.com>

This problem was fixed upstream by commit 9434266f2c64 ("sit: fix use after free
of fb_tunnel_dev").
The upstream patch depends on upstream commit 5e6700b3bf98 ("sit: add support of
x-netns"), which was not backported into 3.10 branch.

First, explain the problem: when the sit module is unloaded, sit_cleanup() is
called.
rmmod sit
=> sit_cleanup()
  => rtnl_link_unregister()
    => __rtnl_kill_links()
      => for_each_netdev(net, dev) {
        if (dev->rtnl_link_ops == ops)
        	ops->dellink(dev, &list_kill);
        }
At this point, the FB device is deleted (and all sit tunnels).
  => unregister_pernet_device()
    => unregister_pernet_operations()
      => ops_exit_list()
        => sit_exit_net()
          => sit_destroy_tunnels()
          In this function, no tunnel is found.
          => unregister_netdevice_queue(sitn->fb_tunnel_dev, &list);
We delete the FB device a second time here!

Because we cannot simply remove the second deletion (sit_exit_net() must remove
the FB device when a netns is deleted), we add an rtnl ops which delete all sit
device excepting the FB device and thus we can keep the explicit deletion in
sit_exit_net().

CC: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Willem de Bruijn <willemb@google.com>
---
 net/ipv6/sit.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 0491264b8bfc..620d326e8fdd 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -1507,6 +1507,15 @@ static const struct nla_policy ipip6_policy[IFLA_IPTUN_MAX + 1] = {
 #endif
 };
 
+static void ipip6_dellink(struct net_device *dev, struct list_head *head)
+{
+	struct net *net = dev_net(dev);
+	struct sit_net *sitn = net_generic(net, sit_net_id);
+
+	if (dev != sitn->fb_tunnel_dev)
+		unregister_netdevice_queue(dev, head);
+}
+
 static struct rtnl_link_ops sit_link_ops __read_mostly = {
 	.kind		= "sit",
 	.maxtype	= IFLA_IPTUN_MAX,
@@ -1517,6 +1526,7 @@ static struct rtnl_link_ops sit_link_ops __read_mostly = {
 	.changelink	= ipip6_changelink,
 	.get_size	= ipip6_get_size,
 	.fill_info	= ipip6_fill_info,
+	.dellink	= ipip6_dellink,
 };
 
 static struct xfrm_tunnel sit_handler __read_mostly = {
-- 
1.8.4.1

^ permalink raw reply related

* [PATCH linux-3.10.y 2/3] Revert "ip6tnl: fix use after free of fb_tnl_dev"
From: Nicolas Dichtel @ 2014-01-30 10:09 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, netdev, stable, williams, lclaudio, jkacur, willemb,
	Nicolas Dichtel
In-Reply-To: <1391076563-10798-1-git-send-email-nicolas.dichtel@6wind.com>

This reverts commit 22c3ec552c29cf4bd4a75566088950fe57d860c4.

This patch is not the right fix, it introduces a memory leak when a netns is
destroyed (the FB device is never deleted).

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv6/ip6_tunnel.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 209bb4d6e188..0516ebbea80b 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1711,6 +1711,8 @@ static void __net_exit ip6_tnl_destroy_tunnels(struct ip6_tnl_net *ip6n)
 		}
 	}
 
+	t = rtnl_dereference(ip6n->tnls_wc[0]);
+	unregister_netdevice_queue(t->dev, &list);
 	unregister_netdevice_many(&list);
 }
 
-- 
1.8.4.1

^ permalink raw reply related

* [PATCH linux-3.10.y 3/3] ip6tnl: fix double free of fb_tnl_dev on exit
From: Nicolas Dichtel @ 2014-01-30 10:09 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, netdev, stable, williams, lclaudio, jkacur, willemb,
	Nicolas Dichtel
In-Reply-To: <1391076563-10798-1-git-send-email-nicolas.dichtel@6wind.com>

This problem was fixed upstream by commit 1e9f3d6f1c40 ("ip6tnl: fix use after
free of fb_tnl_dev").
The upstream patch depends on upstream commit 0bd8762824e7 ("ip6tnl: add x-netns
support"), which was not backported into 3.10 branch.

First, explain the problem: when the ip6_tunnel module is unloaded,
ip6_tunnel_cleanup() is called.
rmmod ip6_tunnel
=> ip6_tunnel_cleanup()
  => rtnl_link_unregister()
    => __rtnl_kill_links()
      => for_each_netdev(net, dev) {
        if (dev->rtnl_link_ops == ops)
        	ops->dellink(dev, &list_kill);
        }
At this point, the FB device is deleted (and all ip6tnl tunnels).
  => unregister_pernet_device()
    => unregister_pernet_operations()
      => ops_exit_list()
        => ip6_tnl_exit_net()
          => ip6_tnl_destroy_tunnels()
            => t = rtnl_dereference(ip6n->tnls_wc[0]);
               unregister_netdevice_queue(t->dev, &list);
We delete the FB device a second time here!

The previous fix removes these lines, which fix this double free. But the patch
introduces a memory leak when a netns is destroyed, because the FB device is
never deleted. By adding an rtnl ops which delete all ip6tnl device excepting
the FB device, we can keep this exlicit removal in ip6_tnl_destroy_tunnels().

CC: Steven Rostedt <rostedt@goodmis.org>
CC: Willem de Bruijn <willemb@google.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv6/ip6_tunnel.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 0516ebbea80b..f21cf476b00c 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1617,6 +1617,15 @@ static int ip6_tnl_changelink(struct net_device *dev, struct nlattr *tb[],
 	return ip6_tnl_update(t, &p);
 }
 
+static void ip6_tnl_dellink(struct net_device *dev, struct list_head *head)
+{
+	struct net *net = dev_net(dev);
+	struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id);
+
+	if (dev != ip6n->fb_tnl_dev)
+		unregister_netdevice_queue(dev, head);
+}
+
 static size_t ip6_tnl_get_size(const struct net_device *dev)
 {
 	return
@@ -1681,6 +1690,7 @@ static struct rtnl_link_ops ip6_link_ops __read_mostly = {
 	.validate	= ip6_tnl_validate,
 	.newlink	= ip6_tnl_newlink,
 	.changelink	= ip6_tnl_changelink,
+	.dellink	= ip6_tnl_dellink,
 	.get_size	= ip6_tnl_get_size,
 	.fill_info	= ip6_tnl_fill_info,
 };
-- 
1.8.4.1

^ permalink raw reply related

* Re: Help testing for USB ethernet/xHCI regression
From: renevant @ 2014-01-30 10:46 UTC (permalink / raw)
  To: David Laight
  Cc: 'Sarah Sharp', renevant@internode.on.net,
	linux-usb@vger.kernel.org, Mark Lord, Greg Kroah-Hartman,
	netdev@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D0F6ACF20@AcuExch.aculab.com>

When using the ax88179 connected via the via based card the whole system gets 
brought down after a while i got this my system log.

I'm going to take a break and see if I can narrow anything more down tomorrow. 
This log is in reverse because of the wonderful way journalctl works.

I suppose I could try using iommu=pt as a kernel boot parameter but that 
doesn't sound like a safe thing to do.


 Jan 30 21:04:38 athas kernel: [<ffffffffc11281a0>] xhci_msi_irq [xhci_hcd]
Jan 30 21:04:38 athas kernel: handlers:
Jan 30 21:04:38 athas kernel:  [<ffffffffa74f9ee6>] 
system_call_fastpath+0x1a/0x1f
Jan 30 21:04:38 athas kernel:  [<ffffffffa715ef6a>] ? 
SyS_epoll_ctl+0x4fa/0xb00
Jan 30 21:04:38 athas kernel:  <EOI>  [<ffffffffa715f181>] ? 
SyS_epoll_ctl+0x711/0xb00
Jan 30 21:04:38 athas kernel:  [<ffffffffa74f982a>] common_interrupt+0x6a/0x6a
Jan 30 21:04:38 athas kernel:  [<ffffffffa700445a>] do_IRQ+0x4a/0xf0
Jan 30 21:04:38 athas kernel:  [<ffffffffa7004679>] handle_irq+0x19/0x30
Jan 30 21:04:38 athas kernel:  [<ffffffffa708200f>] handle_edge_irq+0x6f/0x120
Jan 30 21:04:38 athas kernel:  [<ffffffffa707f8b1>] handle_irq_event+0x31/0x50
Jan 30 21:04:38 athas kernel:  [<ffffffffa707f802>] 
handle_irq_event_percpu+0xc2/0x140
Jan 30 21:04:38 athas kernel:  [<ffffffffa7081b00>] note_interrupt+0xe0/0x1e0
Jan 30 21:04:38 athas kernel:  [<ffffffffa708176d>] __report_bad_irq+0x2d/0xc0
Jan 30 21:04:38 athas kernel:  <IRQ>  [<ffffffffa74efd3f>] 
dump_stack+0x45/0x56
Jan 30 21:04:38 athas kernel: Call Trace:
Jan 30 21:04:38 athas kernel:  0000000000000000 ffff88043edc3ed8 
ffffffffa7081b00 0000000000000000
Jan 30 21:04:38 athas kernel:  ffff88043edc3e98 ffffffffa708176d 
ffff880425031b00 000000000000004d
Jan 30 21:04:38 athas kernel:  ffff880425031b84 ffff88043edc3e70 
ffffffffa74efd3f ffff880425031b00
Jan 30 21:04:38 athas kernel: Hardware name: To be filled by O.E.M. To be 
filled by O.E.M./M5A99FX PRO R2.0, BIOS 2201 11/22/2013
Jan 30 21:04:38 athas kernel: CPU: 7 PID: 1160 Comm: Chrome_IOThread Not 
tainted 3.13.0+ #13
Jan 30 21:04:38 athas kernel: irq event 77: bogus return value ffffff94
Jan 30 21:04:38 athas kernel: xhci_hcd 0000:02:00.0: Host not halted after 
16000 microseconds.
Jan 30 21:04:38 athas kernel: AMD-Vi: Event logged [IO_PAGE_FAULT 
device=02:00.0 domain=0x0019 address=0x00000000002b2000 flags=0x0000]
Jan 30 21:04:38 athas kernel: xhci_hcd 0000:02:00.0: WARNING: Host System 
Error


Regards,

Will Trives

^ permalink raw reply

* Re: IGMP joins come from the wrong SA/interface
From: Steinar H. Gunderson @ 2014-01-30 10:47 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev
In-Reply-To: <20140120184025.GA19972@sesse.net>

On Mon, Jan 20, 2014 at 07:40:25PM +0100, Steinar H. Gunderson wrote:
>> I currently only remember one commit 0a7e22609067ff ("ipv4: fix
>> ineffective source address selection") which did affect multicast source
>> address selection in recent times.
> I tried 3.10.27, just to check something older. I also tried 3.10.27 with
> 0a7e22609067ff reverted, and it's still wrong.
> 
> I am thinking this might have something to do with the machine switching to
> systemd, presumably changing the order of DHCP and static addresses being
> assigned...

Anything more I can do here?

/* Steinar */
-- 
Homepage: http://www.sesse.net/

^ permalink raw reply

* [PATCH net] net/ipv4: Use proper RCU APIs for writer-side in udp_offload.c
From: Or Gerlitz @ 2014-01-30 10:51 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, Shlomo Pongratz, Or Gerlitz

From: Shlomo Pongratz <shlomop@mellanox.com>

RCU writer side should use rcu_dereference_protected() and not
rcu_dereference(), fix that. This also removes the "suspicious RCU usage"
warning seen when running with CONFIG_PROVE_RCU.

Fixes: b582ef0 ('net: Add GRO support for UDP encapsulating protocols')
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
---
 net/ipv4/udp_offload.c |   14 +++++++++-----
 1 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 2ffea6f..1bf21d4 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -109,7 +109,8 @@ int udp_add_offload(struct udp_offload *uo)
 	new_offload->offload = uo;
 
 	spin_lock(&udp_offload_lock);
-	rcu_assign_pointer(new_offload->next, rcu_dereference(*head));
+	rcu_assign_pointer(new_offload->next,
+			   rcu_dereference_protected(*head, lockdep_is_held(&udp_offload_lock)));
 	rcu_assign_pointer(*head, new_offload);
 	spin_unlock(&udp_offload_lock);
 
@@ -130,12 +131,15 @@ void udp_del_offload(struct udp_offload *uo)
 
 	spin_lock(&udp_offload_lock);
 
-	uo_priv = rcu_dereference(*head);
+	uo_priv = rcu_dereference_protected(*head,
+					    lockdep_is_held(&udp_offload_lock));
 	for (; uo_priv != NULL;
-		uo_priv = rcu_dereference(*head)) {
-
+	     uo_priv = rcu_dereference_protected(*head,
+						 lockdep_is_held(&udp_offload_lock))) {
 		if (uo_priv->offload == uo) {
-			rcu_assign_pointer(*head, rcu_dereference(uo_priv->next));
+			rcu_assign_pointer(*head,
+					   rcu_dereference_protected(uo_priv->next,
+								     lockdep_is_held(&udp_offload_lock)));
 			goto unlock;
 		}
 		head = &uo_priv->next;
-- 
1.7.1

^ permalink raw reply related

* [PATCH net] e100: Fix "disabling already-disabled device" warning
From: Michele Baldessari @ 2014-01-30 10:51 UTC (permalink / raw)
  To: netdev; +Cc: e1000-devel, idirectscm, David S. Miller, Michele Baldessari

In https://bugzilla.redhat.com/show_bug.cgi?id=994438 and
https://bugzilla.redhat.com/show_bug.cgi?id=970480  we
received different reports of e100 throwing the following
warning:

 [<c06a0ba5>] ? pci_disable_device+0x85/0x90
 [<c044a153>] warn_slowpath_fmt+0x33/0x40
 [<c06a0ba5>] pci_disable_device+0x85/0x90
 [<f7fdf7e0>] __e100_shutdown+0x80/0x120 [e100]
 [<c0476ca5>] ? check_preempt_curr+0x65/0x90
 [<f7fdf8d6>] e100_suspend+0x16/0x30 [e100]
 [<c06a1ebb>] pci_legacy_suspend+0x2b/0xb0
 [<c098fc0f>] ? wait_for_completion+0x1f/0xd0
 [<c06a2d50>] ? pci_pm_poweroff+0xb0/0xb0
 [<c06a2de4>] pci_pm_freeze+0x94/0xa0
 [<c0767bb7>] dpm_run_callback+0x37/0x80
 [<c076a204>] ? pm_wakeup_pending+0xc4/0x140
 [<c0767f12>] __device_suspend+0xb2/0x1f0
 [<c076806f>] async_suspend+0x1f/0x90
 [<c04706e5>] async_run_entry_fn+0x35/0x140
 [<c0478aef>] ? wake_up_process+0x1f/0x40
 [<c0464495>] process_one_work+0x115/0x370
 [<c0462645>] ? start_worker+0x25/0x30
 [<c0464dc5>] ? manage_workers.isra.27+0x1a5/0x250
 [<c0464f6e>] worker_thread+0xfe/0x330
 [<c0464e70>] ? manage_workers.isra.27+0x250/0x250
 [<c046a224>] kthread+0x94/0xa0
 [<c0997f37>] ret_from_kernel_thread+0x1b/0x28
 [<c046a190>] ? insert_kthread_work+0x30/0x30

This patch removes pci_disable_device() from __e100_shutdown().
pci_clear_master() is enough.

Signed-off-by: Michele Baldessari <michele@acksyn.org>
Tested-by: Mark Harig <idirectscm@aim.com>
---
 drivers/net/ethernet/intel/e100.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/e100.c b/drivers/net/ethernet/intel/e100.c
index cbaba44..bf7a01e 100644
--- a/drivers/net/ethernet/intel/e100.c
+++ b/drivers/net/ethernet/intel/e100.c
@@ -3034,7 +3034,7 @@ static void __e100_shutdown(struct pci_dev *pdev, bool *enable_wake)
 		*enable_wake = false;
 	}
 
-	pci_disable_device(pdev);
+	pci_clear_master(pdev);
 }
 
 static int __e100_power_off(struct pci_dev *pdev, bool wake)
-- 
1.8.5.3

^ permalink raw reply related

* Re: [Patch net] net: allow setting mac address of loopback device
From: Neil Horman @ 2014-01-30 12:04 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, Stephen Hemminger, Eric Dumazet, David S. Miller
In-Reply-To: <1391038731-7501-1-git-send-email-xiyou.wangcong@gmail.com>

On Wed, Jan 29, 2014 at 03:38:51PM -0800, Cong Wang wrote:
> We are trying to mirror the local traffic from lo to eth0,
> allowing setting mac address of lo to eth0 would make
> the ether addresses in these packets correct, so that
> we don't have to modify the ether header again.
> 
> Since usually no one cares about its mac address (all-zero),
> it is safe to allow those who care to set its mac address.
> 
> Cc: Stephen Hemminger <stephen@networkplumber.org>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: David S. Miller <davem@davemloft.net>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> 
> ---
> diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> index c5011e0..a0ee030 100644
> --- a/drivers/net/loopback.c
> +++ b/drivers/net/loopback.c
> @@ -160,6 +160,7 @@ static const struct net_device_ops loopback_ops = {
>  	.ndo_init      = loopback_dev_init,
>  	.ndo_start_xmit= loopback_xmit,
>  	.ndo_get_stats64 = loopback_get_stats64,
> +	.ndo_set_mac_address = eth_mac_addr,
>  };
>  
>  /*
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Seems reasonable
Acked-by: Neil Horman <nhorman@tuxdriver.com>

^ permalink raw reply

* IPv4 / IPv6 over IPv4 IPsec tunnel: setting the DF bit
From: Simon Schneider @ 2014-01-30 12:25 UTC (permalink / raw)
  To: netdev

Hi,
for the scenarios
- IPv4 over IPv4 IPsec tunnel
- IPv6 over IPv4 IPsec tunnel

I wonder how the DF bit of the outer (encrypted) packet is set.

There are generally three options:
- DF bit always 0
- DF bit always 1
- DF bit copied from inner packet

(the last case is obviously not applicable for the IPv6 case, as the IPv6 header does not have a DF bit).

How is this done in Linux?

When investigating, I stumbled over defines named TNL_F_DF_INHERIT / TNL_F_DF_DEFAULT.

Are these still supported?

Is it possible to configure the behavior at runtime or just at compile time?

I would appreciate very much if someone could give an overview on this!

best regards, Simon

^ permalink raw reply

* Re: Help testing for USB ethernet/xHCI regression
From: renevant @ 2014-01-30 12:46 UTC (permalink / raw)
  To: renevant
  Cc: David Laight, 'Sarah Sharp', linux-usb@vger.kernel.org,
	Mark Lord, Greg Kroah-Hartman, netdev@vger.kernel.org
In-Reply-To: <1703720.2GNpoTUMCv@athas>

via vl800 pcie card 
kernel parameter iommu=pt
ethtool -K xxx sg off
ifconfig xxx mtu 4060 up

stable so far, it's way past the point that it usually crashes.

i'll do proper testing tomorrow


iommu=pt   bah !

Regards,

Will Trives

On Thursday 30 January 2014 21:46:27 renevant@internode.on.net wrote:
> When using the ax88179 connected via the via based card the whole system
> gets brought down after a while i got this my system log.
> 
> I'm going to take a break and see if I can narrow anything more down
> tomorrow. This log is in reverse because of the wonderful way journalctl
> works.
> 
> I suppose I could try using iommu=pt as a kernel boot parameter but that
> doesn't sound like a safe thing to do.
> 
> 
>  Jan 30 21:04:38 athas kernel: [<ffffffffc11281a0>] xhci_msi_irq [xhci_hcd]
> Jan 30 21:04:38 athas kernel: handlers:
> Jan 30 21:04:38 athas kernel:  [<ffffffffa74f9ee6>]
> system_call_fastpath+0x1a/0x1f
> Jan 30 21:04:38 athas kernel:  [<ffffffffa715ef6a>] ?
> SyS_epoll_ctl+0x4fa/0xb00
> Jan 30 21:04:38 athas kernel:  <EOI>  [<ffffffffa715f181>] ?
> SyS_epoll_ctl+0x711/0xb00
> Jan 30 21:04:38 athas kernel:  [<ffffffffa74f982a>]
> common_interrupt+0x6a/0x6a Jan 30 21:04:38 athas kernel: 
> [<ffffffffa700445a>] do_IRQ+0x4a/0xf0 Jan 30 21:04:38 athas kernel: 
> [<ffffffffa7004679>] handle_irq+0x19/0x30 Jan 30 21:04:38 athas kernel: 
> [<ffffffffa708200f>] handle_edge_irq+0x6f/0x120 Jan 30 21:04:38 athas
> kernel:  [<ffffffffa707f8b1>] handle_irq_event+0x31/0x50 Jan 30 21:04:38
> athas kernel:  [<ffffffffa707f802>]
> handle_irq_event_percpu+0xc2/0x140
> Jan 30 21:04:38 athas kernel:  [<ffffffffa7081b00>]
> note_interrupt+0xe0/0x1e0 Jan 30 21:04:38 athas kernel: 
> [<ffffffffa708176d>] __report_bad_irq+0x2d/0xc0 Jan 30 21:04:38 athas
> kernel:  <IRQ>  [<ffffffffa74efd3f>]
> dump_stack+0x45/0x56
> Jan 30 21:04:38 athas kernel: Call Trace:
> Jan 30 21:04:38 athas kernel:  0000000000000000 ffff88043edc3ed8
> ffffffffa7081b00 0000000000000000
> Jan 30 21:04:38 athas kernel:  ffff88043edc3e98 ffffffffa708176d
> ffff880425031b00 000000000000004d
> Jan 30 21:04:38 athas kernel:  ffff880425031b84 ffff88043edc3e70
> ffffffffa74efd3f ffff880425031b00
> Jan 30 21:04:38 athas kernel: Hardware name: To be filled by O.E.M. To be
> filled by O.E.M./M5A99FX PRO R2.0, BIOS 2201 11/22/2013
> Jan 30 21:04:38 athas kernel: CPU: 7 PID: 1160 Comm: Chrome_IOThread Not
> tainted 3.13.0+ #13
> Jan 30 21:04:38 athas kernel: irq event 77: bogus return value ffffff94
> Jan 30 21:04:38 athas kernel: xhci_hcd 0000:02:00.0: Host not halted after
> 16000 microseconds.
> Jan 30 21:04:38 athas kernel: AMD-Vi: Event logged [IO_PAGE_FAULT
> device=02:00.0 domain=0x0019 address=0x00000000002b2000 flags=0x0000]
> Jan 30 21:04:38 athas kernel: xhci_hcd 0000:02:00.0: WARNING: Host System
> Error
> 
> 
> Regards,
> 
> Will Trives

^ permalink raw reply

* Re: [PATCH net v2 5/9] bridge: Fix the way to check if a local fdb entry can be deleted
From: Toshiaki Makita @ 2014-01-30 12:50 UTC (permalink / raw)
  To: Stephen Hemminger, vyasevic, David S . Miller; +Cc: netdev
In-Reply-To: <1387526576.3475.5.camel@ubuntu-vm-makita>

On Fri, 2013-12-20 at 17:02 +0900, Toshiaki Makita wrote:
> On Thu, 2013-12-19 at 09:39 -0800, Stephen Hemminger wrote:
> ...
> > Could we make up as set of test case scripts to validate these changes.
> > Now that FDB table can be manipulated by iproute tools, should be possible
> > to have set of cases for validation.
> 
> Thank you for your suggestion.
> Maybe is it enough to make some test of port attach/detach and confirm
> that data/control plane doesn't result in any inconsistency or odd
> situation by tcpdump/bridge commands?

Sorry for replying an old thread..

I tested traffic and fdb entries when attaching/detaching a bridge port,
and couldn't find any problem with this patch.

Instead, I found an additional undesirable behavior without this patch:
ping to a bridge device (with arp resolution) sometimes fails while
detaching a port.
The bridge device continues to have its mac address for a while when a
port is detached, but in current implementation we immediately delete
corresponding fdb entry even though that mac address is used by the
bridge device. I think this caused ping fails because the replied mac
address actually can't reach the bridge device due to the premature fdb
entry deletion.

I need to rearrange this patch set, but I'm going to keep this change as
is.


The test I did is:
- Attach/detach a port while sending traffic, and confirm tat the
traffic is changed to being [flooded/delivered to the bridge] as
expected. Also, confirm that corresponding fdb entry is added/deleted as
expected.
- Attach/detach a port while flushing arp entry and pinging, and confirm
that ping doesn't fail.

The test script I used and the result is below.

---- Test script begin ----
#!/bin/sh

# This is a test script for validating fdb and traffic consistency
# when a bridge port is attached/detached.
#
# This script does two types of tests.
# Forwarding test:
#   Confirm that traffic to mac address of PORT0 is delivered to bridge
#   when PORT0 is attached and flooded when PORT0 is detached.
#   And measure the time window between "ip set master" and actual traffic
#   change by "date" command (clock_gettime) and timestamp in tcpdump.
# ARP/ICMP test:
#   Confirm that ping can always suceed with arp resolution each time when
#   PORT0 gets attached/detached.
# In each test, this validates fdb entries as well, i.e. if attaching PORT0,
# we must have a local entry where its dst and address is those of PORT0,
# and if detaching PORT0, we must not have such an entry.
#
# This test script assumes environment below.
# BR0 has three ports. Two of them (PORT1/2) are veths to namespaces.
# PORT0 is subject for attaching/detaching.
#
# +----------------------------------------------+
# | +-------------+                              |
# | |     +-------|        +-----+  +---+  +-----|
# | | NS0 |NS0_IF0|--VETH--|PORT1|--|BR0|--|PORT0|
# | |     +-------|        +-----+  +---+  +-----|
# | +-------------+                   |          |
# |                                   |          |
# | +-------------+                   |          |
# | |     +-------|        +-----+    |          |
# | | NS1 |NS1_IF0|--VETH--|PORT2|----+          |
# | |     +-------|        +-----+               |
# | +-------------+                              |
# |                             Physical machine |
# +----------------------------------------------+
#
# Test scenarios are
# Forwarding test:
#   Send traffic from NS0 to the mac address of PORT0.
#   The traffic should be delivered to BR0 when attached, and to NS1 when
#   detached.
# ARP/ICMP test:
#   Invalidate the arp entry of the mac address of PORT0 and do "ping -c 1"
#   to the ip address of BR0.
#   Ping should always succeed even if attach/detach occurs.

export LC_ALL=C

# Number of tests
ITERATIONS=10
[ -n "$1" ] && ITERATIONS=$1

# Namespace
NS0=ns0
NS1=ns1
NS0CMD="ip netns exec $NS0"

# Veth interface
VETH_TO_NS0_IF=veth0
NS0_IF=ns0-eth0
VETH_TO_NS1_IF=veth1
NS1_IF=ns1-eth0

# Bridge
BR0=br0

# Bridge interfaces
PORT0=em1
PORT1=$VETH_TO_NS0_IF
PORT2=$VETH_TO_NS1_IF

# MAC addresses
MAC0=12:34:56:78:90:ab # for PORT0, smallest among all bridge ports
NS0_MAC=aa:bb:cc:dd:ee:00
NS1_MAC=aa:bb:cc:dd:ee:01
FLOOD_MAC=aa:bb:cc:dd:ee:02

# IP addresses
BR0_IP=192.168.0.1
NS0_IP=192.168.0.2
PREFIX=24

# Pktgen parameters
PGTHREAD=/proc/net/pktgen/kpktgend_0
PGDEV=/proc/net/pktgen/$NS0_IF
PGCTRL=/proc/net/pktgen/pgctrl

# Directories to store logs
RESULT_DIR=/tmp

# Test statistics
SEND_FAILS_ATTACH=0
NO_ENTRY_ATTACH=0
WRONG_ENTRY_ATTACH=0
UNDELETED_ENTRY_ATTACH=0
CAPTURE_DROPS_ATTACH=0
NO_CAPTURE_ATTACH=0
DELAYED_FLOW_CHANGE_ATTACH=0
DROPS_UNKNOWN_REASON_ATTACH=0

SEND_FAILS_DETACH=0
NO_ENTRY_DETACH=0
WRONG_ENTRY_DETACH=0
UNDELETED_ENTRY_DETACH=0
CAPTURE_DROPS_DETACH=0
NO_CAPTURE_DETACH=0
DELAYED_FLOW_CHANGE_DETACH=0
DROPS_UNKNOWN_REASON_DETACH=0

WINDOW_SUM_ATTACH=0
WINDOW_MAX_ATTACH=0
WINDOW_MIN_ATTACH=-1

WINDOW_SUM_DETACH=0
WINDOW_MAX_DETACH=0
WINDOW_MIN_DETACH=-1

PING_FAIL_COUNT_ATTACH=0
PING_ERROR_COUNT_ATTACH=0
PING_FAIL_COUNT_DETACH=0
PING_ERROR_COUNT_DETACH=0

FORWARD_TEST_COUNT_ATTACH=0
FORWARD_TEST_COUNT_DETACH=0
REPLY_TEST_COUNT_ATTACH=0
REPLY_TEST_COUNT_DETACH=0

prepare ()
{
	modprobe pktgen

	# Namespace settings
	for i in 0 1; do
		eval NS=\$NS$i
		eval NS_IF=\$NS${i}_IF
		eval VETH_TO_NS_IF=\$VETH_TO_NS${i}_IF
		eval NS_MAC=\$NS${i}_MAC

		if ip netns | grep -q $NS; then
			ip netns exec $NS ip link show dev $NS_IF > /dev/null 2>&1 && \
			ip netns exec $NS ip link del $NS_IF
			ip netns del $NS
		fi

		ip netns add $NS

		ip link show dev $VETH_TO_NS_IF > /dev/null 2>&1 && \
		ip link del $VETH_TO_NS_IF
		ip link show dev $NS_IF > /dev/null 2>&1 && ip link del $NS_IF
		ip link add $VETH_TO_NS_IF type veth peer name $NS_IF

		ip link set $NS_IF netns $NS
		ip netns exec $NS ip link set $NS_IF up
		ip link set $VETH_TO_NS_IF address $NS_MAC 
	done

	# PORT0 address setting
	ip link set $PORT0 address $MAC0

	# Bridge settings
	ip link show dev $BR0 > /dev/null 2>&1 && ip link del $BR0
	ip link add $BR0 type bridge

	ip link set $PORT1 master $BR0 # NS0
	ip link set $PORT2 master $BR0 # NS1

	# IP address settings
	ip addr add ${BR0_IP}/$PREFIX dev $BR0
	$NS0CMD ip addr add ${NS0_IP}/$PREFIX dev $NS0_IF

	# Enable bridge and ports
	ip link set $PORT0 up
	ip link set $PORT1 up
	ip link set $PORT2 up
	ip link set $BR0 up
}

# Set a pktgen parameter
pgset ()
{
	PGPATH=$2
	if [ x"$3" == x"bg" -a x"$PGPATH" == x"$PGCTRL" ]; then
		$NS0CMD sh -c "echo $1 > $PGPATH" &
		PG_PID=$!
		ERROR=$?
	else
		$NS0CMD sh -c "echo $1 > $PGPATH"
		ERROR=$?
		RESULT=`$NS0CMD cat $PGPATH`
		if ! echo "$RESULT" | fgrep -q "Result: OK:"; then
			echo "$RESULT" | fgrep Result: 1>&2
			ERROR=1
		fi
	fi
	[ $ERROR -ne 0 ] && return 1
	return 0
}

# Send frames using pktgen
send_frames ()
{
	COUNT=$1
	DST=$2
	BG=$3

	pgset "rem_device_all" $PGTHREAD || return 1
	pgset "add_device $NS0_IF" $PGTHREAD || return 1
	
	pgset "count $COUNT" $PGDEV || return 1
	pgset "pkt_size 60" $PGDEV || return 1
	pgset "dst_mac $DST" $PGDEV || return 1
	pgset "delay 0" $PGDEV || return 1

	pgset "start" $PGCTRL $BG || return 1
	return 0
}

# Send frames and attach/detach PORT0
do_forward_test ()
{
	ATTACH=$1
	if [ x"$ATTACH" == x"attach" ]; then
		MASTER="master $BR0"
	else
		MASTER="nomaster"
	fi

	DUMP_PROG=tcpdump
	DATE=`date +%H%M%S%N`
	BR0_DUMP=${RESULT_DIR}/${BR0}_${DATE}.dump
	BR0_DUMPLOG=${RESULT_DIR}/${BR0}_${DATE}.log
	NS1_IF_DUMP=${RESULT_DIR}/${NS1_IF}_${DATE}.dump
	NS1_IF_DUMPLOG=${RESULT_DIR}/${NS1_IF}_${DATE}.log
	FILTER="udp and dst port 9"

	# Start capturing
	$DUMP_PROG -p -i $BR0 -f "$FILTER" -s 64 \
	-w $BR0_DUMP 2> $BR0_DUMPLOG &
	DUMP_PIDS=$!

	ip netns exec $NS1 $DUMP_PROG -p -i $NS1_IF -f "$FILTER" -s 64 \
	-w $NS1_IF_DUMP 2> $NS1_IF_DUMPLOG &
	DUMP_PIDS="$DUMP_PIDS $!"

	# Wait for capturing start
	BR0_CAPTURE_OK=0
	NS1_IF_CAPTURE_OK=0
	while [ $BR0_CAPTURE_OK -eq 0 -o $NS1_IF_CAPTURE_OK -eq 0 ]; do
		[ $BR0_CAPTURE_OK -eq 0 ] &&  send_frames 1000 "$NS0_MAC" fg
		[ $NS1_IF_CAPTURE_OK -eq 0 ] && send_frames 1000 "$FLOOD_MAC" fg
		sleep 0.2
		[ $BR0_CAPTURE_OK -eq 0 ] && \
		[ `tshark -n -r $BR0_DUMP 2> /dev/null | wc -l` -ne 0 ] && \
		BR0_CAPTURE_OK=1
		[ $NS1_IF_CAPTURE_OK -eq 0 ] && \
		[ `tshark -n -r $NS1_IF_DUMP 2> /dev/null | wc -l` -ne 0 ] && \
		NS1_IF_CAPTURE_OK=1
	done

	# Do test
	PG_PID=""
	if send_frames 0 "$MAC0" bg; then
		# Wait for frame sending start
		sleep 0.2

		# Change traffic destination
		ADD_DEL_IF_TIME=`date +%H%M%S%N`
		ip link set $PORT0 $MASTER

		# Wait for traffic flow change
		sleep 0.1

		SEND_FAILED=0
	else
		SEND_FAILED=1
	fi

	# Stop capturing
	kill $DUMP_PIDS $PG_PID
	wait $DUMP_PIDS $PG_PID 2> /dev/null
	DUMP_PIDS=""
	PG_PID=""

	if [ $SEND_FAILED -eq 0 ]; then
		SEND_ERROR_COUNT=`$NS0CMD tail -1 $PGDEV | \
		sed -n 's/.*errors: \([0-9]\+\)/\1/p'`
		[ "$SEND_ERROR_COUNT" -ne 0 ] && SEND_FAILED=1
	fi
}

validate_fdb ()
{
	local AT_TYPE=$1

	FDB_ENTRY=`bridge fdb show | grep -v self | grep $MAC0`
	if [ -z "$FDB_ENTRY" ]; then
		# Retrieving a particular fdb entry might fail because
		# "bridge fdb show" could consist of multiple netlink
		# recvmsg()s and fdb might changes between each recvmsg().
		# Check fdb again to make sure.
		FDB_ENTRY=`bridge fdb show | grep -v self | grep $MAC0`
	fi

	if [ x"$AT_TYPE" == x"ATTACH" ]; then
		if [ -z "$FDB_ENTRY" ]; then
			NO_ENTRY_ATTACH=`expr $NO_ENTRY_ATTACH + 1`
			return 1
		fi
	
		if ! echo $FDB_ENTRY | grep -q "permanent"; then
			WRONG_ENTRY_ATTACH=`expr $WRONG_ENTRY_ATTACH + 1`
			return 1
		fi
		if ! echo $FDB_ENTRY | grep -q "$PORT0"; then
			WRONG_ENTRY_ATTACH=`expr $WRONG_ENTRY_ATTACH + 1`
			return 1
		fi
	else
		if [ -n "$FDB_ENTRY" ]; then
			if ! echo $FDB_ENTRY | grep -q "$PORT0"; then
				WRONG_ENTRY_DETACH=`expr $WRONG_ENTRY_DETACH + 1`
				return 1
			fi
			UNDELETED_ENTRY_DETACH=`expr $UNDELETED_ENTRY_DETACH + 1`
			return 1
		fi
	fi

	return 0
}

validate_forward ()
{
	if [ x"$1" == x"attach" ]; then
		DUMP_BEFORE_CHANGE=$NS1_IF_DUMP
		DUMP_AFTER_CHANGE=$BR0_DUMP
		local AT_TYPE="ATTACH"
	else
		DUMP_BEFORE_CHANGE=$BR0_DUMP
		DUMP_AFTER_CHANGE=$NS1_IF_DUMP
		local AT_TYPE="DETACH"
	fi

	if [ $SEND_FAILED -eq 1 ]; then
		eval local SEND_FAILS=\$SEND_FAILS_$AT_TYPE
		eval SEND_FAILS_$AT_TYPE=`expr $SEND_FAILS + 1`
		return 2
	fi

	validate_fdb $AT_TYPE
	local RET=$?
	[ $RET -ne 0 ] && return $RET

	# Validate captured data
	BR0_CAPTURE_DROPS=`tail -1 $BR0_DUMPLOG | awk '{print $1}'`
	NS1_IF_CAPTURE_DROPS=`tail -1 $NS1_IF_DUMPLOG | awk '{print $1}'`
	if [ "$BR0_CAPTURE_DROPS" -ne 0 -o \
	     "$NS1_IF_CAPTURE_DROPS" -ne 0 ]; then
		# Kernel failed to store captured frames due to no space
		eval local CAPTURE_DROPS=\$CAPTURE_DROPS_$AT_TYPE
		eval CAPTURE_DROPS_$AT_TYPE=`expr $CAPTURE_DROPS + 1`
		return 2
	fi

	CAPTURED_BEFORE_CHANGE=`tshark -n -r $DUMP_BEFORE_CHANGE -T fields \
	-e frame.number -Y "eth.dst eq $MAC0" 2> /dev/null | wc -l`
	CAPTURED_AFTER_CHANGE=`tshark -n -r $DUMP_AFTER_CHANGE -T fields \
	-e frame.number -Y "eth.dst eq $MAC0" 2> /dev/null | wc -l`
	if [ $CAPTURED_BEFORE_CHANGE -eq 0 ]; then
		# Couldn't captured expected traffic
		# Maybe one of 'sleep's was too short
		eval local NO_CAPTURE=\$NO_CAPTURE_$AT_TYPE
		eval NO_CAPTURE_$AT_TYPE=`expr $NO_CAPTURE + 1`
		return 2
	fi
	if [ $CAPTURED_AFTER_CHANGE -eq 0 ]; then
		# This implies too late traffic flow change
		eval local DELAYED_FLOW_CHANGE=\$DELAYED_FLOW_CHANGE_$AT_TYPE
		eval DELAYED_FLOW_CHANGE_$AT_TYPE=`expr $DELAYED_FLOW_CHANGE + 1`
		return 1
	fi

	LAST_SEQ_BEFORE_CHANGE=`tshark -n -r $DUMP_BEFORE_CHANGE -T fields \
	-e pktgen.seqnum -Y "eth.dst eq $MAC0" 2> /dev/null | tail -1`
	FIRST_SEQ_AFTER_CHANGE=`tshark -n -r $DUMP_AFTER_CHANGE -T fields \
	-e pktgen.seqnum -Y "eth.dst eq $MAC0" 2> /dev/null | head -1`
	LAST_SEQ_AFTER_CHANGE=`tshark -n -r $DUMP_AFTER_CHANGE -T fields \
	-e pktgen.seqnum -Y "eth.dst eq $MAC0" 2> /dev/null | tail -1`

	# Check drops by unknown reason
	if [ `expr $FIRST_SEQ_AFTER_CHANGE - $LAST_SEQ_BEFORE_CHANGE` -ne 1 -o \
	     "$CAPTURED_BEFORE_CHANGE" -ne "$LAST_SEQ_BEFORE_CHANGE" -o \
	     "$CAPTURED_AFTER_CHANGE" -ne \
	     `expr $LAST_SEQ_AFTER_CHANGE - $LAST_SEQ_BEFORE_CHANGE` ]; then
		eval local DROPS_UNKNOWN_REASON=\$DROPS_UNKNOWN_REASON_$AT_TYPE
		eval DROPS_UNKNOWN_REASON_$AT_TYPE=`expr $DROPS_UNKNOWN_REASON + 1`
		return 1
	fi

	# Calculate window time
	TRAFFIC_CHANGED_TIME=`tshark -n -r $DUMP_AFTER_CHANGE -T fields \
	-e frame.time -Y "eth.dst eq $MAC0" 2> /dev/null | head -1 | \
	awk '{print $4}' | sed s/[:.]//g`

	h1=`echo $ADD_DEL_IF_TIME | cut -b 1-2`
	m1=`echo $ADD_DEL_IF_TIME | cut -b 3-4`
	s1=`echo $ADD_DEL_IF_TIME | cut -b 5-6`
	us1=`echo $ADD_DEL_IF_TIME | cut -b 7-12`
	TIME1=`expr $h1 \* 3600 + $m1 \* 60 + $s1`
	TIME1=`expr $TIME1 \* 1000000 + $us1`

	h2=`echo $TRAFFIC_CHANGED_TIME | cut -b 1-2`
	[ "$h2" -lt "$h1" ] && h2=`expr $h2 + 24`
	m2=`echo $TRAFFIC_CHANGED_TIME | cut -b 3-4`
	s2=`echo $TRAFFIC_CHANGED_TIME | cut -b 5-6`
	us2=`echo $TRAFFIC_CHANGED_TIME | cut -b 7-12`
	TIME2=`expr $h2 \* 3600 + $m2 \* 60 + $s2`
	TIME2=`expr $TIME2 \* 1000000 + $us2`

	WINDOW=`expr $TIME2 - $TIME1`
	eval local WINDOW_SUM=\$WINDOW_SUM_$AT_TYPE
	eval WINDOW_SUM_$AT_TYPE=`expr $WINDOW_SUM + $WINDOW`
	eval local WINDOW_MAX=\$WINDOW_MAX_$AT_TYPE
	[ "$WINDOW" -gt "$WINDOW_MAX" ] && eval WINDOW_MAX_$AT_TYPE=$WINDOW
	eval local WINDOW_MIN=\$WINDOW_MIN_$AT_TYPE
	[ "$WINDOW_MIN" -eq -1 -o "$WINDOW" -lt "$WINDOW_MIN" ] && \
	eval WINDOW_MIN_$AT_TYPE=$WINDOW

	return 0
}

cleanup_forward_logs ()
{
	/bin/rm -f "$BR0_DUMP"
	/bin/rm -f "$BR0_DUMPLOG"
	/bin/rm -f "$NS1_IF_DUMP"
	/bin/rm -f "$NS1_IF_DUMPLOG"
}

send_probes ()
{
	: > $PROBE_RESULT
	while :; do
		# Invalidate arp cache
		$NS0CMD ip neigh del $BR0_IP dev $NS0_IF 2> /dev/null
		# Send arp request and echo request.
		# This may fail when mac address of BR0 changes
		# between arp reply and echo request.
		$NS0CMD ping -c 1 -w 1 $BR0_IP > /dev/null 2>&1 || \
		echo >> $PROBE_RESULT
	done
}

do_reply_test ()
{
	ATTACH=$1
	if [ x"$ATTACH" == x"attach" ]; then
		MASTER="master $BR0"
	else
		MASTER="nomaster"
	fi

	DATE=`date +%H%M%S%N`
	PROBE_RESULT=${RESULT_DIR}/PROBE_${DATE}.log

	send_probes &
	PROBE_PID=$!
	sleep 0.1

	# Change traffic destination
	ip link set $PORT0 $MASTER

	# Wait for traffic flow change
	sleep 1.2

	# Stop probing
	kill $PROBE_PID
	wait $PROBE_PID 2> /dev/null
	PROBE_PID=""
}

validate_reply ()
{
	if [ x"$1" == x"attach" ]; then
		local AT_TYPE="ATTACH"
	else
		local AT_TYPE="DETACH"
	fi

	validate_fdb $AT_TYPE
	local RET=$?
	[ $RET -ne 0 ] && return $RET

	# validate ping result
	local PING_FAIL=`cat $PROBE_RESULT | wc -l`
	if [ $PING_FAIL -eq 1 ]; then
		# ping may fail if mac address changes between arp reply and
		# echo request.
		eval local PING_FAIL_COUNT=\$PING_FAIL_COUNT_$AT_TYPE
		eval PING_FAIL_COUNT_$AT_TYPE=`expr $PING_FAIL_COUNT + 1`
		return 2
	elif [ $PING_FAIL -gt 1 ]; then
		# ping should not fail two or more times.
		eval local PING_ERROR_COUNT=\$PING_ERROR_COUNT_$AT_TYPE
		eval PING_ERROR_COUNT_$AT_TYPE=`expr $PING_ERROR_COUNT + 1`
		return 1
	fi

	return 0
}

cleanup_reply_logs ()
{
	/bin/rm -f "$PROBE_RESULT"
}

output_results ()
{
	echo "" 1>&2

	local AT_TYPE
	for AT_TYPE in ATTACH DETACH; do
		eval local NO_ENTRY=\$NO_ENTRY_$AT_TYPE
		eval local WRONG_ENTRY=\$WRONG_ENTRY_$AT_TYPE
		eval local UNDELETED_ENTRY=\$UNDELETED_ENTRY_$AT_TYPE
		eval local DELAYED_FLOW_CHANGE=\$DELAYED_FLOW_CHANGE_$AT_TYPE
		eval local DROPS_UNKNOWN_REASON=\$DROPS_UNKNOWN_REASON_$AT_TYPE
		eval local PING_ERROR_COUNT=\$PING_ERROR_COUNT_$AT_TYPE
		ERROR_COUNT=`expr $NO_ENTRY + $WRONG_ENTRY + $UNDELETED_ENTRY + $DELAYED_FLOW_CHANGE + $DROPS_UNKNOWN_REASON + $PING_ERROR_COUNT`

		eval local SEND_FAILS=\$SEND_FAILS_$AT_TYPE
		eval local CAPTURE_DROPS=\$CAPTURE_DROPS_$AT_TYPE
		eval local NO_CAPTURE=\$NO_CAPTURE_$AT_TYPE
		eval local PING_FAIL_COUNT=\$PING_FAIL_COUNT_$AT_TYPE
		INFO_COUNT=`expr $SEND_FAILS + $CAPTURE_DROPS + $NO_CAPTURE + $PING_FAIL_COUNT`

		eval local FORWARD_TEST_COUNT=\$FORWARD_TEST_COUNT_$AT_TYPE
		eval local REPLY_TEST_COUNT=\$REPLY_TEST_COUNT_$AT_TYPE
		local TEST_COUNT=`expr $FORWARD_TEST_COUNT + $REPLY_TEST_COUNT`

		echo "$AT_TYPE test result:"
		if [ "$ERROR_COUNT" -ne 0 ]; then
			echo "Validation failed: $ERROR_COUNT"
			if [ "$NO_ENTRY" -ne 0 ]; then
				echo -e "\tExpected fdb entry not found: $NO_ENTRY"
			fi
			if [ "$WRONG_ENTRY" -ne 0 ]; then
				echo -e "\tExpected fdb entry had wrong attribute: $WRONG_ENTRY"
			fi
			if [ "$UNDELETED_ENTRY" -ne 0 ]; then
				echo -e "\tUnexpected fdb entry found: $UNDELETED_ENTRY"
			fi
			if [ "$DELAYED_FLOW_CHANGE" -ne 0 ]; then
				echo -e "\tWindow time was too long (over 100 ms): $DELAYED_FLOW_CHANGE"
			fi
			if [ "$DROPS_UNKNOWN_REASON" -ne 0 ]; then
				echo -e "\tCouldn't capture due to unknown reason: $DROPS_UNKNOWN_REASON"
			fi
			if [ "$PING_ERROR_COUNT" -ne 0 ]; then
				echo -e "\tPing failed two or more times: $PING_ERROR_COUNT"
			fi
		else
			if [ "$TEST_COUNT" -ne 0 ]; then
				echo "All validations succeeded."
				echo "Number of valid tests: $TEST_COUNT"
				echo -e "\tForwarding tests: $FORWARD_TEST_COUNT"
				echo -e "\tARP/ICMP tests: $REPLY_TEST_COUNT"
			else
				echo "No valid test was done."
			fi
		fi

		if [ "$FORWARD_TEST_COUNT" -ne 0 ]; then
			eval local WINDOW_SUM=\$WINDOW_SUM_$AT_TYPE
			eval local WINDOW_MAX=\$WINDOW_MAX_$AT_TYPE
			eval local WINDOW_MIN=\$WINDOW_MIN_$AT_TYPE

			echo "Window time summary:"

			WINDOW_AVG=`expr $WINDOW_SUM / $FORWARD_TEST_COUNT`
			WINDOW_AVG_MSEC=`expr $WINDOW_AVG / 1000`
			WINDOW_AVG_USEC=`expr $WINDOW_AVG % 1000 | xargs printf %03d`
			echo -e "\tAverage window time: ${WINDOW_AVG_MSEC}.$WINDOW_AVG_USEC msec"

			WINDOW_MAX_MSEC=`expr $WINDOW_MAX / 1000`
			WINDOW_MAX_USEC=`expr $WINDOW_MAX % 1000 | xargs printf %03d`
			echo -e "\tMax window time: ${WINDOW_MAX_MSEC}.$WINDOW_MAX_USEC msec"

			WINDOW_MIN_MSEC=`expr $WINDOW_MIN / 1000`
			WINDOW_MIN_USEC=`expr $WINDOW_MIN % 1000 | xargs printf %03d`
			echo -e "\tMin window time: ${WINDOW_MIN_MSEC}.$WINDOW_MIN_USEC msec"
		fi

		if [ "$INFO_COUNT" -ne 0 ]; then
			echo "INFO: Some tests failed to execute commands normally."
			echo "      These are not validation failures but test environment issues."
			if [ "$SEND_FAILS" -ne 0 ]; then
				echo -e "\tFrame send fails: $SEND_FAILS"
			fi
			if [ "$CAPTURE_DROPS" -ne 0 ]; then
				echo -e "\tCouldn't capture due to buffer overflow: $CAPTURE_DROPS"
			fi
			if [ "$NO_CAPTURE" -ne 0 ]; then
				echo -e "\tCouldn't capture due to delayed pktgen start: $NO_CAPTURE"
			fi
			if [ "$PING_FAIL_COUNT" -ne 0 ]; then
				echo -e "\tPing failed (only once): $PING_FAIL_COUNT"
				echo -e "\t  Maybe this happened due to timing issue where mac address changes between arp reply and echo request."
			fi
		fi

		echo ""
	done
}

cleanup_settings ()
{
	ip link del $BR0

	local i
	for i in 0 1; do
		eval NS=\$NS$i
		eval VETH_TO_NS_IF=\$VETH_TO_NS${i}_IF

		ip link del $VETH_TO_NS_IF
		ip netns del $NS
	done
}

output_exit ()
{
	output_results
	if [ -n "$DUMP_PIDS$PG_PID$PROBE_PID" ]; then
		kill $DUMP_PIDS $PG_PID $PROBE_PID
		wait $DUMP_PIDS $PG_PID $PROBE_PID 2> /dev/null
	fi
	cleanup_forward_logs
	cleanup_reply_logs
	cleanup_settings
	exit
}

forward_test ()
{
	local j
	for j in attach detach; do
		if [ x"$j" == x"attach" ]; then
			local AT_TYPE=ATTACH
		else
			local AT_TYPE=DETACH
		fi
		do_forward_test $j
		validate_forward $j
		local RET=$?
		if [ $RET -eq 0 ]; then
			echo -n "." 1>&2
			eval local TEST_COUNT=\$FORWARD_TEST_COUNT_$AT_TYPE
			eval FORWARD_TEST_COUNT_$AT_TYPE=`expr $TEST_COUNT + 1`
		elif [ $RET -eq 1 ]; then
			echo -n "!" 1>&2
		else
			echo -n "?" 1>&2
		fi
		cleanup_forward_logs
	done
}

reply_test ()
{
	local j
	for j in attach detach; do
		if [ x"$j" == x"attach" ]; then
			local AT_TYPE=ATTACH
		else
			local AT_TYPE=DETACH
		fi
		do_reply_test $j
		validate_reply $j
		local RET=$?
		if [ $RET -eq 0 ]; then
			echo -n "." 1>&2
			eval local TEST_COUNT=\$REPLY_TEST_COUNT_$AT_TYPE
			eval REPLY_TEST_COUNT_$AT_TYPE=`expr $TEST_COUNT + 1`
		elif [ $RET -eq 1 ]; then
			echo -n "!" 1>&2
		else
			echo -n "?" 1>&2
		fi
		cleanup_reply_logs
	done
}

prepare

trap output_exit INT TERM
trap "output_results 1>&2" USR1

for i in `seq $ITERATIONS`; do
	forward_test
	reply_test
done

output_exit

---- Test script end ----

---- Test result begin ----

* before this patch set

ATTACH test result:
All validations succeeded.
Number of valid tests: 2000
        Forwarding tests: 1000
        ARP/ICMP tests: 1000
Window time summary:
        Average window time: 1.547 msec
        Max window time: 2.899 msec
        Min window time: 1.328 msec

DETACH test result:
All validations succeeded.
Number of valid tests: 1982
        Forwarding tests: 1000
        ARP/ICMP tests: 982
Window time summary:
        Average window time: 1.276 msec
        Max window time: 4.030 msec
        Min window time: 1.056 msec
INFO: Some tests failed to execute commands normally.
      These are not validation failures but test environment issues.
        Ping failed (only once): 18
          Maybe this happened due to timing issue where mac address changes between arp reply and echo request.


* after this patch set

ATTACH test result:
All validations succeeded.
Number of valid tests: 2000
        Forwarding tests: 1000
        ARP/ICMP tests: 1000
Window time summary:
        Average window time: 1.455 msec
        Max window time: 2.651 msec
        Min window time: 1.256 msec

DETACH test result:
All validations succeeded.
Number of valid tests: 1999
        Forwarding tests: 1000
        ARP/ICMP tests: 999
Window time summary:
        Average window time: 1.382 msec
        Max window time: 4.015 msec
        Min window time: 1.159 msec
INFO: Some tests failed to execute commands normally.
      These are not validation failures but test environment issues.
        Ping failed (only once): 1
          Maybe this happened due to timing issue where mac address changes between arp reply and echo request.

---- Test result end ----

Note: ping failed only once even after applying this patch set.
I'm thinking mac address was changed between arp reply and echo request.
This can happen on not only bridge device but also in any network
environment, and not a bug.

Thanks,
Toshiaki Makita

^ permalink raw reply

* [PATCH] rtnetlink: return the newly created link in response to newlink
From: Tom Gundersen @ 2014-01-30 13:05 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, John Fastabend, Thomas Graf, Nicolas Dichtel,
	Vlad Yasevich, Tom Gundersen, Marcel Holtmann, David S. Miller

Userspace needs to reliably know the ifindex of the netdevs it creates,
as we cannot rely on the ifname staying unchanged.

Earlier, a simlpe NLMSG_ERROR would be returned, but this returns the
corresponding RTM_NEWLINK on success instead.

Signed-off-by: Tom Gundersen <teg@jklm.no>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: David S. Miller <davem@davemloft.net>
---
 net/core/rtnetlink.c | 100 ++++++++++++++++++++++++++-------------------------
 1 file changed, 52 insertions(+), 48 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index cf67144..31c1322 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1725,6 +1725,54 @@ static int rtnl_group_changelink(struct net *net, int group,
 	return 0;
 }
 
+static int rtnl_getlink(struct sk_buff *skb, struct nlmsghdr* nlh)
+{
+	struct net *net = sock_net(skb->sk);
+	struct ifinfomsg *ifm;
+	char ifname[IFNAMSIZ];
+	struct nlattr *tb[IFLA_MAX+1];
+	struct net_device *dev = NULL;
+	struct sk_buff *nskb;
+	int err;
+	u32 ext_filter_mask = 0;
+
+	err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy);
+	if (err < 0)
+		return err;
+
+	if (tb[IFLA_IFNAME])
+		nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);
+
+	if (tb[IFLA_EXT_MASK])
+		ext_filter_mask = nla_get_u32(tb[IFLA_EXT_MASK]);
+
+	ifm = nlmsg_data(nlh);
+	if (ifm->ifi_index > 0)
+		dev = __dev_get_by_index(net, ifm->ifi_index);
+	else if (tb[IFLA_IFNAME])
+		dev = __dev_get_by_name(net, ifname);
+	else
+		return -EINVAL;
+
+	if (dev == NULL)
+		return -ENODEV;
+
+	nskb = nlmsg_new(if_nlmsg_size(dev, ext_filter_mask), GFP_KERNEL);
+	if (nskb == NULL)
+		return -ENOBUFS;
+
+	err = rtnl_fill_ifinfo(nskb, dev, RTM_NEWLINK, NETLINK_CB(skb).portid,
+			       nlh->nlmsg_seq, 0, 0, ext_filter_mask);
+	if (err < 0) {
+		/* -EMSGSIZE implies BUG in if_nlmsg_size */
+		WARN_ON(err == -EMSGSIZE);
+		kfree_skb(nskb);
+	} else
+		err = rtnl_unicast(nskb, net, NETLINK_CB(skb).portid);
+
+	return err;
+}
+
 static int rtnl_newlink(struct sk_buff *skb, struct nlmsghdr *nlh)
 {
 	struct net *net = sock_net(skb->sk);
@@ -1871,63 +1919,19 @@ replay:
 			goto out;
 		}
 
+		ifm->ifi_index = dev->ifindex;
+
 		err = rtnl_configure_link(dev, ifm);
 		if (err < 0)
 			unregister_netdevice(dev);
+		else
+			rtnl_getlink(skb, nlh);
 out:
 		put_net(dest_net);
 		return err;
 	}
 }
 
-static int rtnl_getlink(struct sk_buff *skb, struct nlmsghdr* nlh)
-{
-	struct net *net = sock_net(skb->sk);
-	struct ifinfomsg *ifm;
-	char ifname[IFNAMSIZ];
-	struct nlattr *tb[IFLA_MAX+1];
-	struct net_device *dev = NULL;
-	struct sk_buff *nskb;
-	int err;
-	u32 ext_filter_mask = 0;
-
-	err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy);
-	if (err < 0)
-		return err;
-
-	if (tb[IFLA_IFNAME])
-		nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);
-
-	if (tb[IFLA_EXT_MASK])
-		ext_filter_mask = nla_get_u32(tb[IFLA_EXT_MASK]);
-
-	ifm = nlmsg_data(nlh);
-	if (ifm->ifi_index > 0)
-		dev = __dev_get_by_index(net, ifm->ifi_index);
-	else if (tb[IFLA_IFNAME])
-		dev = __dev_get_by_name(net, ifname);
-	else
-		return -EINVAL;
-
-	if (dev == NULL)
-		return -ENODEV;
-
-	nskb = nlmsg_new(if_nlmsg_size(dev, ext_filter_mask), GFP_KERNEL);
-	if (nskb == NULL)
-		return -ENOBUFS;
-
-	err = rtnl_fill_ifinfo(nskb, dev, RTM_NEWLINK, NETLINK_CB(skb).portid,
-			       nlh->nlmsg_seq, 0, 0, ext_filter_mask);
-	if (err < 0) {
-		/* -EMSGSIZE implies BUG in if_nlmsg_size */
-		WARN_ON(err == -EMSGSIZE);
-		kfree_skb(nskb);
-	} else
-		err = rtnl_unicast(nskb, net, NETLINK_CB(skb).portid);
-
-	return err;
-}
-
 static u16 rtnl_calcit(struct sk_buff *skb, struct nlmsghdr *nlh)
 {
 	struct net *net = sock_net(skb->sk);
-- 
1.8.5.3

^ permalink raw reply related

* [PATCH] net: set default DEVTYPE for all ethernet based devices
From: Tom Gundersen @ 2014-01-30 13:20 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, Stephen Hemminger, Avinash Kumar,
	Mauro Carvalho Chehab, Simon Horman, Tom Gundersen,
	Marcel Holtmann, Greg KH, Kay Sievers

In systemd's networkd and udevd, we would like to give the administrator a
simple way to filter net devices by their DEVTYPE [0][1]. Other software
such as ConnMan and NetworkManager uses a similar filtering already.

Currently, plain ethernet devices have DEVTYPE=(null). This patch sets the
devtype to "ethernet" instead. This avoids the need for special-casing the
DEVTYPE=(null) case in userspace, and also avoids false positives, as there
are several other types of netdevs that also have DEVTYPE=(null).

Notice that this is done, as suggested by Marcel, in alloc_etherdev_mqs(),
and as best I can tell it will not give any false positives. I considered
doing it in ether_setup() instead as that seemed more intuitive, but that
would give a lot of false positives indeed.

[0]: <http://www.freedesktop.org/software/systemd/man/systemd-networkd.service.html#Type>
[1]: <http://www.freedesktop.org/software/systemd/man/udev.html#Type>

Signed-off-by: Tom Gundersen <teg@jklm.no>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Kay Sievers <kay@vrfy.org>
---
 net/ethernet/eth.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 8f032ba..b76dc17 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -369,6 +369,10 @@ void ether_setup(struct net_device *dev)
 }
 EXPORT_SYMBOL(ether_setup);

+static const struct device_type eth_type = {
+	.name = "ethernet",
+};
+
 /**
  * alloc_etherdev_mqs - Allocates and sets up an Ethernet device
  * @sizeof_priv: Size of additional driver-private structure to be allocated
@@ -387,7 +391,13 @@ EXPORT_SYMBOL(ether_setup);
 struct net_device *alloc_etherdev_mqs(int sizeof_priv, unsigned int txqs,
 				      unsigned int rxqs)
 {
-	return alloc_netdev_mqs(sizeof_priv, "eth%d", ether_setup, txqs, rxqs);
+	struct net_device* dev;
+
+	dev = alloc_netdev_mqs(sizeof_priv, "eth%d", ether_setup, txqs, rxqs);
+	if (dev)
+		dev->dev.type = &eth_type;
+
+	return dev;
 }
 EXPORT_SYMBOL(alloc_etherdev_mqs);

-- 
1.8.5.3

^ permalink raw reply related

* Re: [PATCH 0/2] [BUG FIXES - 3.10.27] sit: More backports
From: Steven Rostedt @ 2014-01-30 13:31 UTC (permalink / raw)
  To: nicolas.dichtel
  Cc: linux-kernel, netdev, stable, Clark Williams,
	Luis Claudio R. Goncalves, John Kacur, Willem de Bruijn
In-Reply-To: <52EA1B4B.5080403@6wind.com>

On Thu, 30 Jan 2014 10:28:43 +0100
Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:


> Steve, I think the patch I sent yesterday is the good fix. At the end, it's
> a backport of Willem's patch. Note that he also ack that patch.
> The first version you sent (which removes
> unregister_netdevice_queue(sitn->fb_tunnel_dev, &list)) will introduce a
> memory leak when the user destroy a netns.

Hi Nicolas,

I reverted my patches and applied and tested your patches locally and
they passed my first line testing. I'm going to have them entered into
our test suite, after removing our other patches, and see if they solve
all the bugs that we were tripping over.

I'll let you know when these are finished.

Thanks!

-- Steve

^ permalink raw reply

* Re: [PATCH 0/2] [BUG FIXES - 3.10.27] sit: More backports
From: Nicolas Dichtel @ 2014-01-30 13:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, netdev, stable, Clark Williams,
	Luis Claudio R. Goncalves, John Kacur, Willem de Bruijn
In-Reply-To: <20140130083103.2bc68bad@gandalf.local.home>

Le 30/01/2014 14:31, Steven Rostedt a écrit :
> On Thu, 30 Jan 2014 10:28:43 +0100
> Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:
>
>
>> Steve, I think the patch I sent yesterday is the good fix. At the end, it's
>> a backport of Willem's patch. Note that he also ack that patch.
>> The first version you sent (which removes
>> unregister_netdevice_queue(sitn->fb_tunnel_dev, &list)) will introduce a
>> memory leak when the user destroy a netns.
>
> Hi Nicolas,
>
> I reverted my patches and applied and tested your patches locally and
> they passed my first line testing. I'm going to have them entered into
> our test suite, after removing our other patches, and see if they solve
> all the bugs that we were tripping over.
>
> I'll let you know when these are finished.
Thank you for testing.


Regards,
Nicolas

^ permalink raw reply

* Re: Fwd: RFC 7112 on Implications of Oversized IPv6 Header Chains
From: Ben Hutchings @ 2014-01-30 13:56 UTC (permalink / raw)
  To: Fernando Gont; +Cc: netdev
In-Reply-To: <52E955A3.7080408@gont.com.ar>

[-- Attachment #1: Type: text/plain, Size: 427 bytes --]

On Wed, 2014-01-29 at 16:25 -0300, Fernando Gont wrote:
> Folks,
> 
> FYI. This one has important implications -- it allows stateless
> filtering in IPv6 (otherwise not really possible)
[...]

Still not possible unless you can trust that all hosts behind the
firewall will correctly drop overlapping fragments.

Ben.

-- 
Ben Hutchings
It is a miracle that curiosity survives formal education. - Albert Einstein

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* Re: [PATCH 1/4] net: ethoc: implement basic ethtool operations
From: Ben Hutchings @ 2014-01-30 13:59 UTC (permalink / raw)
  To: Max Filippov
  Cc: netdev, linux-kernel, David S. Miller, Florian Fainelli,
	Marc Gauthier
In-Reply-To: <1391025397-14965-2-git-send-email-jcmvbkbc@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1341 bytes --]

On Wed, 2014-01-29 at 23:56 +0400, Max Filippov wrote:
> The following methods are implemented:
> - get link state (standard implementation);
> - get timestamping info (standard implementation).
> 
> Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>

Reviewed-by: Ben Hutchings <ben@decadent.org.uk>

> ---
>  drivers/net/ethernet/ethoc.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/net/ethernet/ethoc.c b/drivers/net/ethernet/ethoc.c
> index 5854d41..6de6352 100644
> --- a/drivers/net/ethernet/ethoc.c
> +++ b/drivers/net/ethernet/ethoc.c
> @@ -900,6 +900,11 @@ out:
>  	return NETDEV_TX_OK;
>  }
>  
> +const struct ethtool_ops ethoc_ethtool_ops = {
> +	.get_link = ethtool_op_get_link,
> +	.get_ts_info = ethtool_op_get_ts_info,
> +};
> +
>  static const struct net_device_ops ethoc_netdev_ops = {
>  	.ndo_open = ethoc_open,
>  	.ndo_stop = ethoc_stop,
> @@ -1148,6 +1153,7 @@ static int ethoc_probe(struct platform_device *pdev)
>  	netdev->netdev_ops = &ethoc_netdev_ops;
>  	netdev->watchdog_timeo = ETHOC_TIMEOUT;
>  	netdev->features |= 0;
> +	netdev->ethtool_ops = &ethoc_ethtool_ops;
>  
>  	/* setup NAPI */
>  	netif_napi_add(netdev, &priv->napi, ethoc_poll, 64);

-- 
Ben Hutchings
It is a miracle that curiosity survives formal education. - Albert Einstein

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* Re: [PATCH 3/4] net: ethoc: implement ethtool get registers
From: Ben Hutchings @ 2014-01-30 14:01 UTC (permalink / raw)
  To: Max Filippov
  Cc: netdev, linux-kernel, David S. Miller, Florian Fainelli,
	Marc Gauthier
In-Reply-To: <1391025397-14965-4-git-send-email-jcmvbkbc@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1716 bytes --]

On Wed, 2014-01-29 at 23:56 +0400, Max Filippov wrote:
> Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>

Reviewed-by: Ben Hutchings <ben@decadent.org.uk>

> ---
>  drivers/net/ethernet/ethoc.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/drivers/net/ethernet/ethoc.c b/drivers/net/ethernet/ethoc.c
> index 9518023..0bf297b 100644
> --- a/drivers/net/ethernet/ethoc.c
> +++ b/drivers/net/ethernet/ethoc.c
> @@ -52,6 +52,7 @@ MODULE_PARM_DESC(buffer_size, "DMA buffer allocation size");
>  #define	ETH_HASH0	0x48
>  #define	ETH_HASH1	0x4c
>  #define	ETH_TXCTRL	0x50
> +#define	ETH_END		0x54
>  
>  /* mode register */
>  #define	MODER_RXEN	(1 <<  0) /* receive enable */
> @@ -922,9 +923,28 @@ static int ethoc_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
>  	return phy_ethtool_sset(phydev, cmd);
>  }
>  
> +static int ethoc_get_regs_len(struct net_device *netdev)
> +{
> +	return ETH_END;
> +}
> +
> +static void ethoc_get_regs(struct net_device *dev, struct ethtool_regs *regs,
> +			   void *p)
> +{
> +	struct ethoc *priv = netdev_priv(dev);
> +	u32 *regs_buff = p;
> +	unsigned i;
> +
> +	regs->version = 0;
> +	for (i = 0; i < ETH_END / sizeof(u32); ++i)
> +		regs_buff[i] = ethoc_read(priv, i * sizeof(u32));
> +}
> +
>  const struct ethtool_ops ethoc_ethtool_ops = {
>  	.get_settings = ethoc_get_settings,
>  	.set_settings = ethoc_set_settings,
> +	.get_regs_len = ethoc_get_regs_len,
> +	.get_regs = ethoc_get_regs,
>  	.get_link = ethtool_op_get_link,
>  	.get_ts_info = ethtool_op_get_ts_info,
>  };

-- 
Ben Hutchings
It is a miracle that curiosity survives formal education. - Albert Einstein

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* Re: [PATCH v2 4/4] net: ethoc: implement ethtool operations
From: Ben Hutchings @ 2014-01-30 14:04 UTC (permalink / raw)
  To: Max Filippov
  Cc: linux-xtensa@linux-xtensa.org, netdev, LKML, Chris Zankel,
	Marc Gauthier, David S. Miller, Florian Fainelli
In-Reply-To: <CAMo8Bf+9A-yh-_EBmZeJqMhUKrHN5+BPtckFj_gnZ44tfnX__w@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 994 bytes --]

On Thu, 2014-01-30 at 07:04 +0400, Max Filippov wrote:
> On Thu, Jan 30, 2014 at 5:59 AM, Ben Hutchings <ben@decadent.org.uk> wrote:
> > On Wed, 2014-01-29 at 10:00 +0400, Max Filippov wrote:
[...]
> >> +     priv->num_tx = rounddown_pow_of_two(ring->tx_pending);
> >
> > Range check?
> 
> May there be requested more than ring->tx_max_pending that we
> indicated in the get_ringparam?

Yes, the ethtool core doesn't check that for you.

> >> +     priv->num_rx = priv->num_bd - priv->num_tx;
> >> +     if (priv->num_rx > ring->rx_pending)
> >> +             priv->num_rx = ring->rx_pending;
> >
> > So the RX ring may only ever be shrunk?!  Did you mean to compare with
> > priv->num_bd instead?
> 
> First all non-TX descriptors are made RX, and if that's more than user
> requested I trim it.
[...]

OK, I get it.  But it would be clearer if you used min().

Ben.

-- 
Ben Hutchings
It is a miracle that curiosity survives formal education. - Albert Einstein

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* Re: IGMP joins come from the wrong SA/interface
From: Hannes Frederic Sowa @ 2014-01-30 14:17 UTC (permalink / raw)
  To: Steinar H. Gunderson; +Cc: netdev
In-Reply-To: <20140130104709.GA21178@sesse.net>

On Thu, Jan 30, 2014 at 11:47:09AM +0100, Steinar H. Gunderson wrote:
> On Mon, Jan 20, 2014 at 07:40:25PM +0100, Steinar H. Gunderson wrote:
> >> I currently only remember one commit 0a7e22609067ff ("ipv4: fix
> >> ineffective source address selection") which did affect multicast source
> >> address selection in recent times.
> > I tried 3.10.27, just to check something older. I also tried 3.10.27 with
> > 0a7e22609067ff reverted, and it's still wrong.
> > 
> > I am thinking this might have something to do with the machine switching to
> > systemd, presumably changing the order of DHCP and static addresses being
> > assigned...
> 
> Anything more I can do here?

Can you give a bit more background what multicast application you are running
on the box and also post a cat /proc/net/igmp?

(For the application info an strace how the join to the multicast address would be
interesting.)

I guess a workaround would be to bind the join to a specific interface.

Greetings,

  Hannes

^ permalink raw reply

* Re: IPv4 / IPv6 over IPv4 IPsec tunnel: setting the DF bit
From: Hannes Frederic Sowa @ 2014-01-30 14:21 UTC (permalink / raw)
  To: Simon Schneider; +Cc: netdev
In-Reply-To: <trinity-bc5263ea-896d-4350-aa80-fd2895b54b3b-1391084710336@3capp-gmx-bs28>

On Thu, Jan 30, 2014 at 01:25:10PM +0100, Simon Schneider wrote:
> Hi,
> for the scenarios
> - IPv4 over IPv4 IPsec tunnel
> - IPv6 over IPv4 IPsec tunnel
> 
> I wonder how the DF bit of the outer (encrypted) packet is set.
> 
> There are generally three options:
> - DF bit always 0
> - DF bit always 1
> - DF bit copied from inner packet

There is a pmtudisc knob on ip tunnel ... to force DF bit on outgoing packets,
but DF bit should get copied from inner packet up to tunnel header in every
case.

Greetings,

  Hannes

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox