netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [IPv6] "sendmsg: invalid argument" to multicast group after some time
@ 2008-12-28  4:47 Eduard Guzovsky
  0 siblings, 0 replies; 25+ messages in thread
From: Eduard Guzovsky @ 2008-12-28  4:47 UTC (permalink / raw)
  To: netdev

> I even get the same error when doing a multicast ping6:
>  miredo:~# ping6 -I eth0 ff02::9
>  PING ff02::9(ff02::9) from fe80::216:3eff:feb9:29f5 eth0: 56 data bytes
>  ping: sendmsg: Invalid argument

We had a similar problem in our lab network. I tracked down the source
of the "Invalid argument" error to ip6_output_finish(). Here is the
stack

  -----edg ip6_output_finish: failed to find neighbour
  [<c010647a>] show_trace_log_lvl+0x1a/0x30
  [<c0106ba2>] show_trace+0x12/0x20
  [<c0106c09>] dump_stack+0x19/0x20
  [<f14ab019>] ip6_output2+0x279/0x290 [ipv6]
  [<f14ab40f>] ip6_output+0x2df/0x830 [ipv6]
  [<f14abce7>] ip6_push_pending_frames+0x247/0x420 [ipv6]
  [<f14bde2f>] udp_v6_push_pending_frames+0x13f/0x1f0 [ipv6]
  [<f14bf8fe>] udpv6_sendmsg+0x7ae/0xa60 [ipv6]
  [<c02ea254>] inet_sendmsg+0x34/0x60
  [<c0297adc>] sock_sendmsg+0xfc/0x120
  [<c029835f>] sys_sendto+0xbf/0xe0
  [<c0299a37>] sys_socketcall+0x187/0x260
  [<c0105b7b>] syscall_call+0x7/0xb
  =======================

ip6_output_finish() returns EINVAL because the route cache entry has
NULL as a "neighbour" pointer.

These invalid route cache entries are created when ipv6 neighbour
table is filled up (one potential reason for that is a combination of
a lot of multicast traffic –"ff02:…" and xen hosts with interfaces in
promiscuous mode). In this case ndisc_get_neigh() returns NULL, but at
least in two places the routing code in net/ipv6/route.c ignores it
and inserts invalid entries in the cache anyway.

This is especially bad for frequently used multicast addresses.
Garbage collector does not remove them from the cache, probably
because of the frequent updates of the "__use" count. You need to
flush the cache to get rid of them.

One way to work around the problem is to increase "gc_thresh3" for
ipv6 neighbour table. That still leaves you open for DOS attacks.
Another way is to create permanent entries in neighbor/routing tables.

In any case routing cache pollution problem has to be fixed. I suggest
the following patch. I do not know this code and would appreciate if
code maintainers could comment on it.

Thanks,

-Ed

--- a/net/ipv6/route.c  2008-12-26 14:56:50.000000000 -0500
+++ b/net/ipv6/route.c  2008-12-26 14:57:19.000000000 -0500
@@ -638,6 +638,11 @@

                rt->rt6i_nexthop = ndisc_get_neigh(rt->rt6i_dev,
&rt->rt6i_gateway);

+                if (rt->rt6i_nexthop == NULL) {
+                    dst_free((struct dst_entry *)rt);
+                    rt = NULL;
+                }
+
        }

        return rt;
@@ -991,9 +996,18 @@
        dev_hold(dev);
        if (neigh)
                neigh_hold(neigh);
-       else
+       else {
                neigh = ndisc_get_neigh(dev, addr);

+                if (neigh == NULL) {
+                    dev_put(dev);
+                    in6_dev_put(idev);
+                    dst_free((struct dst_entry *)rt);
+                    rt = NULL;
+                    goto out;
+                }
+        }
+
        rt->rt6i_dev      = dev;
        rt->rt6i_idev     = idev;
        rt->rt6i_nexthop  = neigh;

^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [IPv6] "sendmsg: invalid argument" to multicast group after some time
@ 2008-12-30  7:52 David Miller
  2008-12-31 19:53 ` Eduard Guzovsky
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2008-12-30  7:52 UTC (permalink / raw)
  To: eguzovsky; +Cc: berni, dlstevens, pekkas, netdev


Eduard, thanks for your analysis and RFC patch.

I agree this is an ugly situation.

Looking over this area the real problem is that the neighbour cache
can't do anything to apply back pressure on the routing cache when it
fills up with essentially unused multicast entries like this.

When we hit the upper limits (such as gc_thresh3) for the neighbour
cache, it tries to do things like neigh_forced_gc().

But this won't accomplish anything since all of these ipv6 multicast
routes have a reference on the neigh entries filling up the table, so
the forced GC won't be able to liberate them

So you're absolutely right that the route cache pollution is the core
problem.

Looking at the IPV4 routing cache we have code which goes:

		int err = arp_bind_neighbour(&rt->u.dst);
		if (err) {
 ...
			/* Neighbour tables are full and nothing
			   can be released. Try to shrink route cache,
			   it is most likely it holds some neighbour records.
			 */

and then proceeds to try and forcefully flush some routing cache
entries.

So the real fix is that IPV6 should do something similar.

Something like the following (untested) patch:

diff --git a/include/net/ndisc.h b/include/net/ndisc.h
index ce532f2..1459ed3 100644
--- a/include/net/ndisc.h
+++ b/include/net/ndisc.h
@@ -155,9 +155,9 @@ static inline struct neighbour * ndisc_get_neigh(struct net_device *dev, const s
 {
 
 	if (dev)
-		return __neigh_lookup(&nd_tbl, addr, dev, 1);
+		return __neigh_lookup_errno(&nd_tbl, addr, dev);
 
-	return NULL;
+	return ERR_PTR(-ENODEV);
 }
 
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 18c486c..0db4129 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -627,6 +627,9 @@ static struct rt6_info *rt6_alloc_cow(struct rt6_info *ort, struct in6_addr *dad
 	rt = ip6_rt_copy(ort);
 
 	if (rt) {
+		struct neighbour *neigh;
+		int attempts = !in_softirq();
+
 		if (!(rt->rt6i_flags&RTF_GATEWAY)) {
 			if (rt->rt6i_dst.plen != 128 &&
 			    ipv6_addr_equal(&rt->rt6i_dst.addr, daddr))
@@ -646,7 +649,35 @@ static struct rt6_info *rt6_alloc_cow(struct rt6_info *ort, struct in6_addr *dad
 		}
 #endif
 
-		rt->rt6i_nexthop = ndisc_get_neigh(rt->rt6i_dev, &rt->rt6i_gateway);
+	retry:
+		neigh = ndisc_get_neigh(rt->rt6i_dev, &rt->rt6i_gateway);
+		if (IS_ERR(neigh)) {
+			struct net *net = dev_net(rt->rt6i_dev);
+			int saved_rt_min_interval =
+				net->ipv6.sysctl.ip6_rt_gc_min_interval;
+			int saved_rt_elasticity =
+				net->ipv6.sysctl.ip6_rt_gc_elasticity;
+
+			if (attempts-- > 0) {
+				net->ipv6.sysctl.ip6_rt_gc_elasticity = 1;
+				net->ipv6.sysctl.ip6_rt_gc_min_interval = 0;
+
+				ip6_dst_gc(net->ipv6.ip6_dst_ops);
+
+				net->ipv6.sysctl.ip6_rt_gc_elasticity =
+					saved_rt_elasticity;
+				net->ipv6.sysctl.ip6_rt_gc_min_interval =
+					saved_rt_min_interval;
+				goto retry;
+			}
+
+			if (net_ratelimit())
+				printk(KERN_WARNING
+				       "Neighbour table overflow.\n");
+			dst_free(&rt->u.dst);
+			return NULL;
+		}
+		rt->rt6i_nexthop = neigh;
 
 	}
 
@@ -945,8 +976,11 @@ struct dst_entry *icmp6_dst_alloc(struct net_device *dev,
 	dev_hold(dev);
 	if (neigh)
 		neigh_hold(neigh);
-	else
+	else {
 		neigh = ndisc_get_neigh(dev, addr);
+		if (IS_ERR(neigh))
+			neigh = NULL;
+	}
 
 	rt->rt6i_dev	  = dev;
 	rt->rt6i_idev     = idev;
@@ -1887,6 +1921,7 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
 {
 	struct net *net = dev_net(idev->dev);
 	struct rt6_info *rt = ip6_dst_alloc(net->ipv6.ip6_dst_ops);
+	struct neighbour *neigh;
 
 	if (rt == NULL)
 		return ERR_PTR(-ENOMEM);
@@ -1909,11 +1944,18 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
 		rt->rt6i_flags |= RTF_ANYCAST;
 	else
 		rt->rt6i_flags |= RTF_LOCAL;
-	rt->rt6i_nexthop = ndisc_get_neigh(rt->rt6i_dev, &rt->rt6i_gateway);
-	if (rt->rt6i_nexthop == NULL) {
+	neigh = ndisc_get_neigh(rt->rt6i_dev, &rt->rt6i_gateway);
+	if (IS_ERR(neigh)) {
 		dst_free(&rt->u.dst);
-		return ERR_PTR(-ENOMEM);
+
+		/* We are casting this because that is the return
+		 * value type.  But a errno encoded pointer is the
+		 * same regardless of the underlying pointer type,
+		 * and that's what we are returning.  So this is OK.
+		 */
+		return (struct rt6_info *) neigh;
 	}
+	rt->rt6i_nexthop = neigh;
 
 	ipv6_addr_copy(&rt->rt6i_dst.addr, addr);
 	rt->rt6i_dst.plen = 128;

^ permalink raw reply related	[flat|nested] 25+ messages in thread
* [IPv6] "sendmsg: invalid argument" to multicast group after some time
@ 2008-08-31 18:20 Bernhard Schmidt
  2008-09-01  5:49 ` David Stevens
  2008-09-01 13:03 ` David Stevens
  0 siblings, 2 replies; 25+ messages in thread
From: Bernhard Schmidt @ 2008-08-31 18:20 UTC (permalink / raw)
  To: netdev

Hello all,

this is about the same box as the message from Remi an hour ago, but
most probably not related.

I'm running a Teredo (RFC4830) relay on a i386 Xen domU with kernel
2.6.26 vanilla with the integrated pv_ops feature. This relay function
is implemented in a userspace daemon called Miredo which provides a tun
interface to the OS where native IPv6 for 2001::/32 is routed into. The
traffic is then handled in the userspace daemon and emitted encapsulated
in IPv4/UDP. Apart from a few scalability problems which seem to be
related to the neighbor or route cache size it works fine. The machine
is doing around 2kpps of IPv6 traffic (which means 4kpps of IPv4+IPv6).

As there are a couple of similar relays globally anycasted I'm supposed
to withdraw the route from BGP if the daemon or the machine fails. To do
this I'm running ripngd from the Quagga routing suite, which announces
the Teredo prefix to my core routers using RIPng (RFC2080). On a kernel
level RIPng is basically periodic UDP to a link-local multicast address 
[ff02::9]:521.

Every few hours this announcement fails (no announcements reach the core
routers anymore, which kill the routing entry after a timeout). ripngd
debugging claims that it could not send the announcement due to 
"Invalid argument". There are no outgoing packets in tcpdump anymore.

I even get the same error when doing a multicast ping6:
miredo:~# ping6 -I eth0 ff02::9
PING ff02::9(ff02::9) from fe80::216:3eff:feb9:29f5 eth0: 56 data bytes
ping: sendmsg: Invalid argument                                        
64 bytes from fe80::216:3eff:feb9:29f5: icmp_seq=1 ttl=64 time=0.030 ms
64 bytes from fe80::216:3eff:feb9:29f5: icmp_seq=1 ttl=64 time=0.018 ms (DUP!)

(fe80::216:3eff:feb9:29f5 is the box itself, it's the only one that ever
answers ... duplicate however)

ping6 to other multicast addresses, even in the same scope works fine

miredo:~# ping6 -I eth0 ff02::2                       
PING ff02::2(ff02::2) from fe80::216:3eff:feb9:29f5 eth0: 56 data bytes
64 bytes from fe80::216:3eff:feb9:29f5: icmp_seq=1 ttl=64 time=0.057 ms
64 bytes from fe80::20c:86ff:fe9a:3819: icmp_seq=1 ttl=64 time=0.466 ms (DUP!)
64 bytes from fe80::20c:86ff:fe9a:2819: icmp_seq=1 ttl=64 time=0.476 ms (DUP!)
64 bytes from fe80::216:3eff:feb9:29f5: icmp_seq=2 ttl=64 time=0.043 ms       

Now the freaky part ... the multicast ping to ff02::9 (and thus the
RIPng announcements) start to work again when I restart the miredo
daemon. This is sort of unexpected because miredo does not deal with
this address (or multicast) at all.

miredo:~# /etc/init.d/miredo stop                                           
Stopping Teredo IPv6 tunneling daemon: miredo.                              
miredo:~# /etc/init.d/miredo start
Starting Teredo IPv6 tunneling daemon: miredo.
miredo:~# ping6 -c 2 -I eth0 ff02::9
PING ff02::9(ff02::9) from fe80::216:3eff:feb9:29f5 eth0: 56 data bytes
64 bytes from fe80::216:3eff:feb9:29f5: icmp_seq=1 ttl=64 time=0.044 ms
64 bytes from fe80::2c0:9fff:fe4b:8ccf: icmp_seq=1 ttl=64 time=0.441 ms (DUP!)
64 bytes from fe80::20c:86ff:fe9a:3819: icmp_seq=1 ttl=64 time=0.458 ms (DUP!)
64 bytes from fe80::20c:86ff:fe9a:2819: icmp_seq=1 ttl=64 time=0.466 ms (DUP!)

Someone else running a miredo relay on Linux has reported the same
problem, only using ospf6d instead of ripngd:

2008/08/31 17:25:54 OSPF6: sendmsg failed: ifindex: 2: Invalid argument (22)

OSPFv3 is link-local multicast as well (own protocol on ff02::5), 
restarting the miredo daemon fixed the problem for him as well.

Regards,
Bernhard

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2009-01-04 23:56 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-28  4:47 [IPv6] "sendmsg: invalid argument" to multicast group after some time Eduard Guzovsky
  -- strict thread matches above, loose matches on Subject: below --
2008-12-30  7:52 David Miller
2008-12-31 19:53 ` Eduard Guzovsky
2009-01-04 23:56   ` David Miller
2008-08-31 18:20 Bernhard Schmidt
2008-09-01  5:49 ` David Stevens
2008-09-01  9:09   ` Bernhard Schmidt
2008-09-01 13:03 ` David Stevens
2008-09-01 17:01   ` Bernhard Schmidt
2008-09-01 17:05     ` Bernhard Schmidt
2008-09-01 17:57     ` Pekka Savola
2008-09-01 18:03       ` Bernhard Schmidt
2008-09-02  9:06         ` Pekka Savola
2008-09-02 13:57     ` Brian Haley
2008-09-02 15:00       ` Bernhard Schmidt
2008-09-02 15:48         ` Brian Haley
2008-09-09  0:34         ` David Stevens
2008-09-09  0:38           ` Bernhard Schmidt
2008-09-09  2:26             ` David Stevens
2008-09-09  6:52             ` Rémi Denis-Courmont
2008-09-09  7:17               ` David Stevens
2008-09-09 10:06                 ` Bernhard Schmidt
2008-09-09 15:05                   ` David Stevens
2008-09-09 17:16             ` Pekka Savola
2008-09-09 20:13               ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).