Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 3/10] Fix leaking of kernel heap addresses in net/
From: Thomas Graf @ 2010-11-12  1:20 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: David S. Miller, Oliver Hartkopp, Alexey Kuznetsov, Urs Thuermann,
	Hideaki YOSHIFUJI, Patrick McHardy, James Morris,
	Remi Denis-Courmont, Pekka Savola (ipv6), Sridhar Samudrala,
	Vlad Yasevich, Tejun Heo, Eric Dumazet, Li Zefan, Joe Perches,
	Stephen Hemminger, Jamal Hadi Salim, Eric W. Biederman,
	Alexey Dobriyan, Jiri Pirko, Johannes Berg, Daniel Lezcano,
	Pavel 
In-Reply-To: <1289524023.5167.67.camel@dan>

On Thu, Nov 11, 2010 at 08:07:03PM -0500, Dan Rosenberg wrote:
> diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
> index 1f85ef2..0ac8ff2 100644
> --- a/net/ipv4/raw.c
> +++ b/net/ipv4/raw.c
> @@ -948,13 +948,26 @@ static void raw_sock_seq_show(struct seq_file *seq, struct sock *sp, int i)
>  	__u16 destp = 0,
>  	      srcp  = inet->inet_num;
>  
> -	seq_printf(seq, "%4d: %08X:%04X %08X:%04X"
> -		" %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %p %d\n",
> -		i, src, srcp, dest, destp, sp->sk_state,
> -		sk_wmem_alloc_get(sp),
> -		sk_rmem_alloc_get(sp),
> -		0, 0L, 0, sock_i_uid(sp), 0, sock_i_ino(sp),
> -		atomic_read(&sp->sk_refcnt), sp, atomic_read(&sp->sk_drops));
> +	/* Only expose kernel addresses to privileged readers */
> +	if (capable(CAP_NET_ADMIN))
> +		seq_printf(seq, "%4d: %08X:%04X %08X:%04X"
> +			" %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %p %d\n",
> +			i, src, srcp, dest, destp, sp->sk_state,
> +			sk_wmem_alloc_get(sp),
> +			sk_rmem_alloc_get(sp),
> +			0, 0L, 0, sock_i_uid(sp), 0, sock_i_ino(sp),
> +			atomic_read(&sp->sk_refcnt),
> +			sp, atomic_read(&sp->sk_drops));
> +	else
> +		seq_printf(seq, "%4d: %08X:%04X %08X:%04X"
> +			" %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %d %d\n",
> +			i, src, srcp, dest, destp, sp->sk_state,
> +			sk_wmem_alloc_get(sp),
> +			sk_rmem_alloc_get(sp),
> +			0, 0L, 0, sock_i_uid(sp), 0, sock_i_ino(sp),
> +			atomic_read(&sp->sk_refcnt),
> +			0, atomic_read(&sp->sk_drops));

If we really have to do this. At least don't duplicate all this code. Do
the check in the printf argument:

	seq_printf(seq, "%4d: %08X:%04X %08X:%04X"
		...
		capable(CAP_NET_ADMIN) ? sp : 0,

I would even move the decision whether to expose kernel addresses or not
to a function so we can change behavior in one place.

^ permalink raw reply

* Re: [PATCH 2/10] Fix leaking of kernel heap addresses in net/
From: Dan Rosenberg @ 2010-11-12  1:22 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David S. Miller, Oliver Hartkopp, Alexey Kuznetsov, Urs Thuermann,
	Hideaki YOSHIFUJI, Patrick McHardy, James Morris,
	Remi Denis-Courmont, Pekka Savola (ipv6), Sridhar Samudrala,
	Vlad Yasevich, Tejun Heo, Eric Dumazet, Li Zefan, Joe Perches,
	Jamal Hadi Salim, Eric W. Biederman, Alexey Dobriyan, Jiri Pirko,
	Johannes Berg, Daniel Lezcano, Pavel Emelyanov, socketcan-core
In-Reply-To: <20101111171754.0198e151@nehalam>


> 
> Printing different data based on security state seems like an ABI
> nightmare.
> 

I can't remove the data entirely, because that would seriously break the
ABI.  I deliberately kept the same format so as not to break any
userspace programs relying on consistent output - are there really
programs that would break when they read a 0 instead of an address?

-Dan


^ permalink raw reply

* Re: [PATCH 3/10] Fix leaking of kernel heap addresses in net/
From: Dan Rosenberg @ 2010-11-12  1:24 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David S. Miller, Oliver Hartkopp, Alexey Kuznetsov, Urs Thuermann,
	Hideaki YOSHIFUJI, Patrick McHardy, James Morris,
	Remi Denis-Courmont, Pekka Savola (ipv6), Sridhar Samudrala,
	Vlad Yasevich, Tejun Heo, Eric Dumazet, Li Zefan, Joe Perches,
	Stephen Hemminger, Jamal Hadi Salim, Eric W. Biederman,
	Alexey Dobriyan, Jiri Pirko, Johannes Berg, Daniel Lezcano,
	Pavel 
In-Reply-To: <20101112012039.GB4683@canuck.infradead.org>


> 
> If we really have to do this. At least don't duplicate all this code. Do
> the check in the printf argument:
> 
> 	seq_printf(seq, "%4d: %08X:%04X %08X:%04X"
> 		...
> 		capable(CAP_NET_ADMIN) ? sp : 0,
> 
> I would even move the decision whether to expose kernel addresses or not
> to a function so we can change behavior in one place.

I wrote it this way because the format specifier must also be changed,
or the %p output will print "(null)", which cannot be parsed by
userspace programs expecting "(nil)" or 0.  I could include another
check inside the format specifier, but that seemed pretty ugly.  But
then again, it's ugly either way.

-Dan


^ permalink raw reply

* Re: [PATCH 3/10] Fix leaking of kernel heap addresses in net/
From: Thomas Graf @ 2010-11-12  1:44 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: David S. Miller, Oliver Hartkopp, Alexey Kuznetsov, Urs Thuermann,
	Hideaki YOSHIFUJI, Patrick McHardy, James Morris,
	Remi Denis-Courmont, Pekka Savola (ipv6), Sridhar Samudrala,
	Vlad Yasevich, Tejun Heo, Eric Dumazet, Li Zefan, Joe Perches,
	Stephen Hemminger, Jamal Hadi Salim, Eric W. Biederman,
	Alexey Dobriyan, Jiri Pirko, Johannes Berg, Daniel Lezcano,
	Pavel 
In-Reply-To: <1289525049.5167.82.camel@dan>

On Thu, Nov 11, 2010 at 08:24:09PM -0500, Dan Rosenberg wrote:
> > If we really have to do this. At least don't duplicate all this code. Do
> > the check in the printf argument:
> > 
> > 	seq_printf(seq, "%4d: %08X:%04X %08X:%04X"
> > 		...
> > 		capable(CAP_NET_ADMIN) ? sp : 0,
> > 
> > I would even move the decision whether to expose kernel addresses or not
> > to a function so we can change behavior in one place.
> 
> I wrote it this way because the format specifier must also be changed,
> or the %p output will print "(null)", which cannot be parsed by
> userspace programs expecting "(nil)" or 0.  I could include another
> check inside the format specifier, but that seemed pretty ugly.  But
> then again, it's ugly either way.

Considering the amount of duplication you are about to do, you may
want to think about adding a new pointer format extension then. We
already have special '%p' modes for IPv6 addresse, MAC addresses and
various other pointer types. It wouldn't be hard to add one which
does not print (null).

^ permalink raw reply

* [PATCH] rtnetlink: Fix message size calculation for link messages
From: Thomas Graf @ 2010-11-12  1:47 UTC (permalink / raw)
  To: davem; +Cc: Patrick McHardy, netdev

nlmsg_total_size() calculates the length of a netlink message
including header and alignment. nla_total_size() calculates the
space an individual attribute consumes which was meant to be used
in this context.

Also, ensure to account for the attribute header for the
IFLA_INFO_XSTATS attribute as implementations of get_xstats_size()
seem to assume that we do so.

The addition of two message headers minus the missing attribute
header resulted in a calculated message size that was larger than
required. Therefore we never risked running out of skb tailroom.

Signed-off-by: Thomas Graf <tgraf@infradead.org>
Cc: Patrick McHardy <kaber@trash.net>

Index: net-2.6/net/core/rtnetlink.c
===================================================================
--- net-2.6.orig/net/core/rtnetlink.c
+++ net-2.6/net/core/rtnetlink.c
@@ -347,16 +347,17 @@ static size_t rtnl_link_get_size(const s
 	if (!ops)
 		return 0;

-	size = nlmsg_total_size(sizeof(struct nlattr)) + /* IFLA_LINKINFO */
-	       nlmsg_total_size(strlen(ops->kind) + 1);	 /* IFLA_INFO_KIND */
+	size = nla_total_size(sizeof(struct nlattr)) + /* IFLA_LINKINFO */
+	       nla_total_size(strlen(ops->kind) + 1);  /* IFLA_INFO_KIND */

 	if (ops->get_size)
 		/* IFLA_INFO_DATA + nested data */
-		size += nlmsg_total_size(sizeof(struct nlattr)) +
+		size += nla_total_size(sizeof(struct nlattr)) +
 			ops->get_size(dev);

 	if (ops->get_xstats_size)
-		size += ops->get_xstats_size(dev);	/* IFLA_INFO_XSTATS */
+		/* IFLA_INFO_XSTATS */
+		size += nla_total_size(ops->get_xstats_size(dev));

 	return size;
 }

^ permalink raw reply

* Re: [PATCH 4/10] Fix leaking of kernel heap addresses in net/
From: David Miller @ 2010-11-12  2:02 UTC (permalink / raw)
  To: drosenberg-PiUznwcHFHrqlBn2x/YWAg
  Cc: urs.thuermann-l29pVbxQd1IUtdQbppsyvg,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	linux-sctp-u79uwXL29TY76Z2rM5mHXA,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA, xemul-GEFAQzZX7r8dnm+yROfE0A,
	pekkas-UjJjq++bwZ7HOG6cAo2yLw,
	eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	jmorris-gx6/JNMH7DfYtjvyW6yDsg, kuznet-v/Mj1YrvjDBInbfyfbPRSQ,
	adobriyan-Re5JQEeQqe8AvxtiuMwx3w, sri-r/Jw6+rmf7HQT0dZR+AlfA,
	johannes.berg-ral2JQCrhuEAvxtiuMwx3w, hadi-jkUAjuhPggJWk0Htik3J/w,
	vladislav.yasevich-VXdhtT5mjnY, lizf-BthXqXjhjHXQFUHtdCDX3A,
	tj-DgEjT+Ai2ygdnm+yROfE0A,
	remi.denis-courmont-xNZwKgViW5gAvxtiuMwx3w,
	daniel.lezcano-GANU6spQydw, jpirko-H+wXaHxf7aLQT0dZR+AlfA,
	yoshfuji-VfPWfsRibaP+Ru+s062T9g, socketcan-fJ+pQTUTwRTk1uMJSBkQmQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	joe-6d6DIl74uiNBDgjK7y7TUQ, kaber-dcUjhNyLwpNeoWH0uzbU5w
In-Reply-To: <1289524801.5167.76.camel@dan>

From: Dan Rosenberg <drosenberg-PiUznwcHFHrqlBn2x/YWAg@public.gmane.org>
Date: Thu, 11 Nov 2010 20:20:01 -0500

> If not, then you have committed to leaving a permanent easy target for
> exploits.

In your opinion.

^ permalink raw reply

* Re: [PATCH] ipv4: Make rt->fl.iif tests lest obscure.
From: David Miller @ 2010-11-12  2:07 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20101111.163617.15250206.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Thu, 11 Nov 2010 16:36:17 -0800 (PST)

> From: David Miller <davem@davemloft.net>
> Date: Thu, 11 Nov 2010 16:28:14 -0800 (PST)
> 
>> diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
>> index 94a9eb1..9aad1c0 100644
>> --- a/net/decnet/dn_route.c
>> +++ b/net/decnet/dn_route.c
>> @@ -1181,7 +1181,7 @@ static int __dn_route_output_key(struct dst_entry **pprt, const struct flowi *fl
> 
> Anyone looking at this closely will notice that I need to redo these
> decnet parts.
> 
> Updated patch coming up.

Ok, this is a lot better:

--------------------
ipv4: Make rt->fl.iif tests lest obscure.

When we test rt->fl.iif against zero, we're seeing if it's
an output or an input route.

Make that explicit with some helper functions.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/dn_route.h          |   10 ++++++++++
 include/net/route.h             |   10 ++++++++++
 net/decnet/dn_route.c           |    4 ++--
 net/ipv4/icmp.c                 |    4 ++--
 net/ipv4/igmp.c                 |    2 +-
 net/ipv4/ip_gre.c               |    2 +-
 net/ipv4/ipmr.c                 |    2 +-
 net/ipv4/route.c                |   20 ++++++++++----------
 net/netfilter/ipvs/ip_vs_xmit.c |    8 +++++---
 9 files changed, 42 insertions(+), 20 deletions(-)

diff --git a/include/net/dn_route.h b/include/net/dn_route.h
index ccadab3..9b185df 100644
--- a/include/net/dn_route.h
+++ b/include/net/dn_route.h
@@ -80,6 +80,16 @@ struct dn_route {
 	unsigned rt_type;
 };
 
+static inline bool dn_is_input_route(struct dn_route *rt)
+{
+	return rt->fl.iif != 0;
+}
+
+static inline bool dn_is_output_route(struct dn_route *rt)
+{
+	return rt->fl.iif == 0;
+}
+
 extern void dn_route_init(void);
 extern void dn_route_cleanup(void);
 
diff --git a/include/net/route.h b/include/net/route.h
index cea533e..5cd46d1 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -71,6 +71,16 @@ struct rtable {
 	struct inet_peer	*peer; /* long-living peer info */
 };
 
+static inline bool rt_is_input_route(struct rtable *rt)
+{
+	return rt->fl.iif != 0;
+}
+
+static inline bool rt_is_output_route(struct rtable *rt)
+{
+	return rt->fl.iif == 0;
+}
+
 struct ip_rt_acct {
 	__u32 	o_bytes;
 	__u32 	o_packets;
diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
index 94a9eb1..474d54d 100644
--- a/net/decnet/dn_route.c
+++ b/net/decnet/dn_route.c
@@ -1181,7 +1181,7 @@ static int __dn_route_output_key(struct dst_entry **pprt, const struct flowi *fl
 			if ((flp->fld_dst == rt->fl.fld_dst) &&
 			    (flp->fld_src == rt->fl.fld_src) &&
 			    (flp->mark == rt->fl.mark) &&
-			    (rt->fl.iif == 0) &&
+			    dn_is_output_route(rt) &&
 			    (rt->fl.oif == flp->oif)) {
 				dst_use(&rt->dst, jiffies);
 				rcu_read_unlock_bh();
@@ -1512,7 +1512,7 @@ static int dn_rt_fill_info(struct sk_buff *skb, u32 pid, u32 seq,
 	if (rtnl_put_cacheinfo(skb, &rt->dst, 0, 0, 0, expires,
 			       rt->dst.error) < 0)
 		goto rtattr_failure;
-	if (rt->fl.iif)
+	if (dn_is_input_route(rt))
 		RTA_PUT(skb, RTA_IIF, sizeof(int), &rt->fl.iif);
 
 	nlh->nlmsg_len = skb_tail_pointer(skb) - b;
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 96bc7f9..c6e2aff 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -506,8 +506,8 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 		struct net_device *dev = NULL;
 
 		rcu_read_lock();
-		if (rt->fl.iif &&
-			net->ipv4.sysctl_icmp_errors_use_inbound_ifaddr)
+		if (rt_is_input_route(rt) &&
+		    net->ipv4.sysctl_icmp_errors_use_inbound_ifaddr)
 			dev = dev_get_by_index_rcu(net, rt->fl.iif);
 
 		if (dev)
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index c8877c6..08d0d81 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -961,7 +961,7 @@ int igmp_rcv(struct sk_buff *skb)
 	case IGMP_HOST_MEMBERSHIP_REPORT:
 	case IGMPV2_HOST_MEMBERSHIP_REPORT:
 		/* Is it our report looped back? */
-		if (skb_rtable(skb)->fl.iif == 0)
+		if (rt_is_output_route(skb_rtable(skb)))
 			break;
 		/* don't rely on MC router hearing unicast reports */
 		if (skb->pkt_type == PACKET_MULTICAST ||
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 70ff77f..cab2057 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -634,7 +634,7 @@ static int ipgre_rcv(struct sk_buff *skb)
 #ifdef CONFIG_NET_IPGRE_BROADCAST
 		if (ipv4_is_multicast(iph->daddr)) {
 			/* Looped back packet, drop it! */
-			if (skb_rtable(skb)->fl.iif == 0)
+			if (rt_is_output_route(skb_rtable(skb)))
 				goto drop;
 			tunnel->dev->stats.multicast++;
 			skb->pkt_type = PACKET_BROADCAST;
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 86dd569..ef2b008 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1654,7 +1654,7 @@ static int ip_mr_forward(struct net *net, struct mr_table *mrt,
 	if (mrt->vif_table[vif].dev != skb->dev) {
 		int true_vifi;
 
-		if (skb_rtable(skb)->fl.iif == 0) {
+		if (rt_is_output_route(skb_rtable(skb))) {
 			/* It is our own packet, looped back.
 			 * Very complicated situation...
 			 *
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 5955965..66610ea 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -623,7 +623,7 @@ static inline int rt_fast_clean(struct rtable *rth)
 	/* Kill broadcast/multicast entries very aggresively, if they
 	   collide in hash table with more useful entries */
 	return (rth->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST)) &&
-		rth->fl.iif && rth->dst.rt_next;
+		rt_is_input_route(rth) && rth->dst.rt_next;
 }
 
 static inline int rt_valuable(struct rtable *rth)
@@ -668,7 +668,7 @@ static inline u32 rt_score(struct rtable *rt)
 	if (rt_valuable(rt))
 		score |= (1<<31);
 
-	if (!rt->fl.iif ||
+	if (rt_is_output_route(rt) ||
 	    !(rt->rt_flags & (RTCF_BROADCAST|RTCF_MULTICAST|RTCF_LOCAL)))
 		score |= (1<<30);
 
@@ -1126,7 +1126,7 @@ restart:
 		 */
 
 		rt->dst.flags |= DST_NOCACHE;
-		if (rt->rt_type == RTN_UNICAST || rt->fl.iif == 0) {
+		if (rt->rt_type == RTN_UNICAST || rt_is_output_route(rt)) {
 			int err = arp_bind_neighbour(&rt->dst);
 			if (err) {
 				if (net_ratelimit())
@@ -1224,7 +1224,7 @@ restart:
 	/* Try to bind route to arp only if it is output
 	   route or unicast forwarding path.
 	 */
-	if (rt->rt_type == RTN_UNICAST || rt->fl.iif == 0) {
+	if (rt->rt_type == RTN_UNICAST || rt_is_output_route(rt)) {
 		int err = arp_bind_neighbour(&rt->dst);
 		if (err) {
 			spin_unlock_bh(rt_hash_lock_addr(hash));
@@ -1406,7 +1406,7 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 				if (rth->fl.fl4_dst != daddr ||
 				    rth->fl.fl4_src != skeys[i] ||
 				    rth->fl.oif != ikeys[k] ||
-				    rth->fl.iif != 0 ||
+				    rt_is_input_route(rth) ||
 				    rt_is_expired(rth) ||
 				    !net_eq(dev_net(rth->dst.dev), net)) {
 					rthp = &rth->dst.rt_next;
@@ -1666,7 +1666,7 @@ unsigned short ip_rt_frag_needed(struct net *net, struct iphdr *iph,
 				    rth->rt_dst != daddr ||
 				    rth->rt_src != iph->saddr ||
 				    rth->fl.oif != ikeys[k] ||
-				    rth->fl.iif != 0 ||
+				    rt_is_input_route(rth) ||
 				    dst_metric_locked(&rth->dst, RTAX_MTU) ||
 				    !net_eq(dev_net(rth->dst.dev), net) ||
 				    rt_is_expired(rth))
@@ -1770,7 +1770,7 @@ void ip_rt_get_source(u8 *addr, struct rtable *rt)
 	__be32 src;
 	struct fib_result res;
 
-	if (rt->fl.iif == 0)
+	if (rt_is_output_route(rt))
 		src = rt->rt_src;
 	else {
 		rcu_read_lock();
@@ -2669,7 +2669,7 @@ int __ip_route_output_key(struct net *net, struct rtable **rp,
 		rth = rcu_dereference_bh(rth->dst.rt_next)) {
 		if (rth->fl.fl4_dst == flp->fl4_dst &&
 		    rth->fl.fl4_src == flp->fl4_src &&
-		    rth->fl.iif == 0 &&
+		    rt_is_output_route(rth) &&
 		    rth->fl.oif == flp->oif &&
 		    rth->fl.mark == flp->mark &&
 		    !((rth->fl.fl4_tos ^ flp->fl4_tos) &
@@ -2824,7 +2824,7 @@ static int rt_fill_info(struct net *net,
 	if (rt->dst.tclassid)
 		NLA_PUT_U32(skb, RTA_FLOW, rt->dst.tclassid);
 #endif
-	if (rt->fl.iif)
+	if (rt_is_input_route(rt))
 		NLA_PUT_BE32(skb, RTA_PREFSRC, rt->rt_spec_dst);
 	else if (rt->rt_src != rt->fl.fl4_src)
 		NLA_PUT_BE32(skb, RTA_PREFSRC, rt->rt_src);
@@ -2849,7 +2849,7 @@ static int rt_fill_info(struct net *net,
 		}
 	}
 
-	if (rt->fl.iif) {
+	if (rt_is_input_route(rt)) {
 #ifdef CONFIG_IP_MROUTE
 		__be32 dst = rt->rt_dst;
 
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index de04ea3..10bd39c 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -169,7 +169,7 @@ __ip_vs_reroute_locally(struct sk_buff *skb)
 	struct net *net = dev_net(dev);
 	struct iphdr *iph = ip_hdr(skb);
 
-	if (rt->fl.iif) {
+	if (rt_is_input_route(rt)) {
 		unsigned long orefdst = skb->_skb_refdst;
 
 		if (ip_route_input(skb, iph->daddr, iph->saddr,
@@ -552,7 +552,8 @@ ip_vs_nat_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 #endif
 
 	/* From world but DNAT to loopback address? */
-	if (local && ipv4_is_loopback(rt->rt_dst) && skb_rtable(skb)->fl.iif) {
+	if (local && ipv4_is_loopback(rt->rt_dst) &&
+	    rt_is_input_route(skb_rtable(skb))) {
 		IP_VS_DBG_RL_PKT(1, AF_INET, pp, skb, 0, "ip_vs_nat_xmit(): "
 				 "stopping DNAT to loopback address");
 		goto tx_error_put;
@@ -1165,7 +1166,8 @@ ip_vs_icmp_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 #endif
 
 	/* From world but DNAT to loopback address? */
-	if (local && ipv4_is_loopback(rt->rt_dst) && skb_rtable(skb)->fl.iif) {
+	if (local && ipv4_is_loopback(rt->rt_dst) &&
+	    rt_is_input_route(skb_rtable(skb))) {
 		IP_VS_DBG(1, "%s(): "
 			  "stopping DNAT to loopback %pI4\n",
 			  __func__, &cp->daddr.ip);
-- 
1.7.3.2


^ permalink raw reply related

* Re: [PATCH 4/10] Fix leaking of kernel heap addresses in net/
From: Dan Rosenberg @ 2010-11-12  2:15 UTC (permalink / raw)
  To: David Miller
  Cc: socketcan, kuznet, urs.thuermann, yoshfuji, kaber, jmorris,
	remi.denis-courmont, pekkas, sri, vladislav.yasevich, tj,
	eric.dumazet, lizf, joe, shemminger, hadi, ebiederm, adobriyan,
	jpirko, johannes.berg, daniel.lezcano, xemul, socketcan-core,
	netdev, linux-sctp, Linus Torvalds


I am willing to do whatever is necessary to make these changes acceptable.  Are you completely set on not removing these addresses from the output?

-Dan

^ permalink raw reply

* Re: [PATCH 4/10] Fix leaking of kernel heap addresses in net/
From: David Miller @ 2010-11-12  2:29 UTC (permalink / raw)
  To: drosenberg-PiUznwcHFHrqlBn2x/YWAg
  Cc: urs.thuermann-l29pVbxQd1IUtdQbppsyvg,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	linux-sctp-u79uwXL29TY76Z2rM5mHXA,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA, xemul-GEFAQzZX7r8dnm+yROfE0A,
	pekkas-UjJjq++bwZ7HOG6cAo2yLw,
	eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	jmorris-gx6/JNMH7DfYtjvyW6yDsg, kuznet-v/Mj1YrvjDBInbfyfbPRSQ,
	adobriyan-Re5JQEeQqe8AvxtiuMwx3w, sri-r/Jw6+rmf7HQT0dZR+AlfA,
	johannes.berg-ral2JQCrhuEAvxtiuMwx3w, hadi-jkUAjuhPggJWk0Htik3J/w,
	vladislav.yasevich-VXdhtT5mjnY, lizf-BthXqXjhjHXQFUHtdCDX3A,
	tj-DgEjT+Ai2ygdnm+yROfE0A,
	remi.denis-courmont-xNZwKgViW5gAvxtiuMwx3w,
	daniel.lezcano-GANU6spQydw, jpirko-H+wXaHxf7aLQT0dZR+AlfA,
	yoshfuji-VfPWfsRibaP+Ru+s062T9g, socketcan-fJ+pQTUTwRTk1uMJSBkQmQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	joe-6d6DIl74uiNBDgjK7y7TUQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kaber-dcUjhNyLwpNeoWH0uzbU5w
In-Reply-To: <2129857903-1289528127-cardhu_decombobulator_blackberry.rim.net-1506931048--JnVBb1XAImjjL2gL5RxOEzYg3SYOavFBmZ6FRVpaDsI@public.gmane.org>

From: "Dan Rosenberg" <drosenberg-PiUznwcHFHrqlBn2x/YWAg@public.gmane.org>
Date: Fri, 12 Nov 2010 02:15:25 +0000

> I am willing to do whatever is necessary to make these changes
> acceptable.  Are you completely set on not removing these addresses
> from the output?

I want whatever you replace it with to be equivalent for
object tracking purposes.

^ permalink raw reply

* Re: [PATCH 4/10] Fix leaking of kernel heap addresses in net/
From: Dan Rosenberg @ 2010-11-12  2:34 UTC (permalink / raw)
  To: David Miller
  Cc: socketcan, kuznet, urs.thuermann, yoshfuji, kaber, jmorris,
	remi.denis-courmont, pekkas, sri, vladislav.yasevich, tj,
	eric.dumazet, lizf, joe, shemminger, hadi, ebiederm, adobriyan,
	jpirko, johannes.berg, daniel.lezcano, xemul, socketcan-core,
	netdev, linux-sctp, torvalds
In-Reply-To: <20101111.182939.258124014.davem@davemloft.net>


> I want whatever you replace it with to be equivalent for
> object tracking purposes.

In nearly all of the cases I fixed, the socket inode is already
provided, which serves as a perfectly good unique identifier.  Would you
prefer I include that information twice?

-Dan


^ permalink raw reply

* Re: can-bcm: fix minor heap overflow
From: Simon Horman @ 2010-11-12  2:39 UTC (permalink / raw)
  To: Oliver Hartkopp
  Cc: David Miller, Linux Netdev List, Dan Rosenberg, Linus Torvalds,
	Urs Thuermann, security
In-Reply-To: <4CDB1856.4040001@hartkopp.net>

On Wed, Nov 10, 2010 at 11:10:30PM +0100, Oliver Hartkopp wrote:
> On 64-bit platforms the ASCII representation of a pointer may be up to 17
> bytes long. This patch increases the length of the buffer accordingly.
> 
> http://marc.info/?l=linux-netdev&m=128872251418192&w=2
> 
> Reported-by: Dan Rosenberg <drosenberg@vsecurity.com>
> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> 
> ---
> 
> diff --git a/net/can/bcm.c b/net/can/bcm.c
> index 08ffe9e..6faa825 100644
> --- a/net/can/bcm.c
> +++ b/net/can/bcm.c
> @@ -125,7 +125,7 @@ struct bcm_sock {
>  	struct list_head tx_ops;
>  	unsigned long dropped_usr_msgs;
>  	struct proc_dir_entry *bcm_proc_read;
> -	char procname [9]; /* pointer printed in ASCII with \0 */
> +	char procname [20]; /* pointer printed in ASCII with \0 */
>  };

If the string may be up to 17 bytes long why are you allocating 20?

>  static inline struct bcm_sock *bcm_sk(const struct sock *sk)

^ permalink raw reply

* Re: can-bcm: fix minor heap overflow
From: Dan Rosenberg @ 2010-11-12  2:43 UTC (permalink / raw)
  To: Simon Horman
  Cc: Oliver Hartkopp, David Miller, Linux Netdev List, Linus Torvalds,
	Urs Thuermann, security
In-Reply-To: <20101112023950.GA8145@verge.net.au>


> 
> If the string may be up to 17 bytes long why are you allocating 20?
> 

In Oliver's defense, this doesn't matter even a little bit.  The
structure will be allocated with kmalloc-1024 either way.

-Dan


^ permalink raw reply

* Re: can-bcm: fix minor heap overflow
From: Simon Horman @ 2010-11-12  2:48 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: Oliver Hartkopp, David Miller, Linux Netdev List, Linus Torvalds,
	Urs Thuermann, security
In-Reply-To: <1289529838.3090.209.camel@Dan>

On Thu, Nov 11, 2010 at 09:43:58PM -0500, Dan Rosenberg wrote:
> 
> > 
> > If the string may be up to 17 bytes long why are you allocating 20?
> > 
> 
> In Oliver's defense, this doesn't matter even a little bit.  The
> structure will be allocated with kmalloc-1024 either way.

I agree that its very unlikely to make any difference.
I am just curious.

^ permalink raw reply

* Re: [PATCH 4/10] Fix leaking of kernel heap addresses in net/
From: David Miller @ 2010-11-12  2:49 UTC (permalink / raw)
  To: drosenberg-PiUznwcHFHrqlBn2x/YWAg
  Cc: urs.thuermann-l29pVbxQd1IUtdQbppsyvg,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	linux-sctp-u79uwXL29TY76Z2rM5mHXA,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA, xemul-GEFAQzZX7r8dnm+yROfE0A,
	pekkas-UjJjq++bwZ7HOG6cAo2yLw,
	eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	jmorris-gx6/JNMH7DfYtjvyW6yDsg, kuznet-v/Mj1YrvjDBInbfyfbPRSQ,
	adobriyan-Re5JQEeQqe8AvxtiuMwx3w, sri-r/Jw6+rmf7HQT0dZR+AlfA,
	johannes.berg-ral2JQCrhuEAvxtiuMwx3w, hadi-jkUAjuhPggJWk0Htik3J/w,
	vladislav.yasevich-VXdhtT5mjnY, lizf-BthXqXjhjHXQFUHtdCDX3A,
	tj-DgEjT+Ai2ygdnm+yROfE0A,
	remi.denis-courmont-xNZwKgViW5gAvxtiuMwx3w,
	daniel.lezcano-GANU6spQydw, jpirko-H+wXaHxf7aLQT0dZR+AlfA,
	yoshfuji-VfPWfsRibaP+Ru+s062T9g, socketcan-fJ+pQTUTwRTk1uMJSBkQmQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	joe-6d6DIl74uiNBDgjK7y7TUQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kaber-dcUjhNyLwpNeoWH0uzbU5w
In-Reply-To: <1289529269.3090.207.camel@Dan>

From: Dan Rosenberg <drosenberg-PiUznwcHFHrqlBn2x/YWAg@public.gmane.org>
Date: Thu, 11 Nov 2010 21:34:29 -0500

> 
>> I want whatever you replace it with to be equivalent for
>> object tracking purposes.
> 
> In nearly all of the cases I fixed, the socket inode is already
> provided, which serves as a perfectly good unique identifier.  Would you
> prefer I include that information twice?

The problem is that the socket inode is not available in a certain
subclass of cases, so the transformation is not equivalent.

Why not attack this at the heart of where your concern is, and hack
the %p format handling to do whatever it is you like instead of
patching code all over the tree?

^ permalink raw reply

* Re: [PATCH 4/10] Fix leaking of kernel heap addresses in net/
From: Dan Rosenberg @ 2010-11-12  2:51 UTC (permalink / raw)
  To: David Miller
  Cc: socketcan, kuznet, urs.thuermann, yoshfuji, kaber, jmorris,
	remi.denis-courmont, pekkas, sri, vladislav.yasevich, tj,
	eric.dumazet, lizf, joe, shemminger, hadi, ebiederm, adobriyan,
	jpirko, johannes.berg, daniel.lezcano, xemul, socketcan-core,
	netdev, linux-sctp, torvalds
In-Reply-To: <20101111.184902.233699247.davem@davemloft.net>


> > 
> >> I want whatever you replace it with to be equivalent for
> >> object tracking purposes.
> > 
> > In nearly all of the cases I fixed, the socket inode is already
> > provided, which serves as a perfectly good unique identifier.  Would you
> > prefer I include that information twice?
> 
> The problem is that the socket inode is not available in a certain
> subclass of cases, so the transformation is not equivalent.
> 
> Why not attack this at the heart of where your concern is, and hack
> the %p format handling to do whatever it is you like instead of
> patching code all over the tree?

This has already been suggested, and I agree it is a much better
approach.  If I take this approach, and find some suitable substitute
for those cases where the socket inode is not available, will you
consider these changes?

-Dan


^ permalink raw reply

* Re: [PATCH 4/10] Fix leaking of kernel heap addresses in net/
From: David Miller @ 2010-11-12  2:59 UTC (permalink / raw)
  To: drosenberg-PiUznwcHFHrqlBn2x/YWAg
  Cc: urs.thuermann-l29pVbxQd1IUtdQbppsyvg,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	linux-sctp-u79uwXL29TY76Z2rM5mHXA,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA, xemul-GEFAQzZX7r8dnm+yROfE0A,
	pekkas-UjJjq++bwZ7HOG6cAo2yLw,
	eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	jmorris-gx6/JNMH7DfYtjvyW6yDsg, kuznet-v/Mj1YrvjDBInbfyfbPRSQ,
	adobriyan-Re5JQEeQqe8AvxtiuMwx3w, sri-r/Jw6+rmf7HQT0dZR+AlfA,
	johannes.berg-ral2JQCrhuEAvxtiuMwx3w, hadi-jkUAjuhPggJWk0Htik3J/w,
	vladislav.yasevich-VXdhtT5mjnY, lizf-BthXqXjhjHXQFUHtdCDX3A,
	tj-DgEjT+Ai2ygdnm+yROfE0A,
	remi.denis-courmont-xNZwKgViW5gAvxtiuMwx3w,
	daniel.lezcano-GANU6spQydw, jpirko-H+wXaHxf7aLQT0dZR+AlfA,
	yoshfuji-VfPWfsRibaP+Ru+s062T9g, socketcan-fJ+pQTUTwRTk1uMJSBkQmQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	joe-6d6DIl74uiNBDgjK7y7TUQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kaber-dcUjhNyLwpNeoWH0uzbU5w
In-Reply-To: <1289530264.3090.212.camel@Dan>

From: Dan Rosenberg <drosenberg-PiUznwcHFHrqlBn2x/YWAg@public.gmane.org>
Date: Thu, 11 Nov 2010 21:51:04 -0500

> This has already been suggested, and I agree it is a much better
> approach.  If I take this approach, and find some suitable substitute
> for those cases where the socket inode is not available, will you
> consider these changes?

I will consider an approach where the keys reported allow object
tracking equally as the actual pointers allow for right now.

I've said that this is my criteria about 3 times now.

^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Cypher Wu @ 2010-11-12  3:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netdev
In-Reply-To: <1289489007.17691.1310.camel@edumazet-laptop>

On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>
> Hi
>
> CC netdev, since you ask questions about network stuff _and_ rwlock
>
>
>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>> other platforms. It have a priority for write lock that when tried it
>> will block the following read lock even if read lock is hold by
>> others. Its code can be read in Linux Kernel 2.6.36 in
>> arch/tile/lib/spinlock_32.c.
>
> This seems a bug to me.
>
> read_lock() can be nested. We used such a schem in the past in iptables
> (it can re-enter itself),
> and we used instead a spinlock(), but with many discussions with lkml
> and Linus himself if I remember well.
>
It seems not a problem that read_lock() can be nested or not since
rwlock doesn't have 'owner', it's just that should we give
write_lock() a priority than read_lock() since if there have a lot
read_lock()s then they'll starve write_lock().
We should work out a well defined behavior so all the
platform-dependent raw_rwlock has to design under that principle.
>
>>
>> That different could cause a deadlock in kernel if we join/leave
>> Multicast Group simultaneous and frequently on mutlicores. IGMP
>> message is sent by
>>
>> igmp_ifc_timer_expire() -> igmpv3_send_cr() -> igmpv3_sendpack()
>>
>> in timer interrupt, igmpv3_send_cr() will generate the sk_buff for
>> IGMP message with mc_list_lock read locked and then call
>> igmpv3_sendpack() with it unlocked.
>> But if we have so many join/leave messages have to generate and it
>> can't be sent in one sk_buff then igmpv3_send_cr() -> add_grec() will
>> call igmpv3_sendpack() to send it and reallocate a new buffer. When
>> the message is sent:
>>
>> __mkroute_output() -> ip_check_mc()
>>
>> will read lock mc_list_lock again. If there is another core is try
>> write lock mc_list_lock between the two read lock, then deadlock
>> ocurred.
>>
>> The rwlock on other platforms I've check, say, PowerPC, x86, ARM, is
>> just read lock shared and write_lock mutex, so if we've hold read lock
>> the write lock will just wait, and if there have a read lock again it
>> will success.
>>
>> So, What's the criteria of rwlock design in the Linux kernel? Is that
>> read lock re-hold of IGMP a design error in Linux kernel, or the read
>> lock has to be design like that?
>>
>
> Well, we try to get rid of all rwlocks in performance critical sections.
>
> I would say, if you believe one rwlock can justify the special TILE
> behavior you tried to make, then we should instead migrate this rwlock
> to a RCU + spinlock schem (so that all arches benefit from this work,
> not only TILE)

IGMP in not very performance critical so maybe rwlock is enough?

>
>> There is a other thing, that the timer interrupt will start timer on
>> the same in_dev, should that be optimized?
>>
>
> Not sure I understand what you mean.

Since mc_list & mc_tomb is shared list in the kernel we needn't to
start multiple timers on different cores for them, we can synchronize
it on one core.

>
>> BTW: If we have so many cores, say 64, is there other things we have
>> to think about spinlock? If there have collisions ocurred, should we
>> just read the shared memory again and again, or just a very little
>> 'delay' is better? I've seen relax() is called in the implementation
>> of spinlock on TILEPro platform.
>> --
>
> Is TILE using ticket spinlocks ?
>

What ticket spinlocks means? Could you explain a little, pls:) I'm not
familiar with Linux kernel, I'm trying to get more understanding of it
since I'm working on Linux platform now.

>
>

^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Américo Wang @ 2010-11-12  6:28 UTC (permalink / raw)
  To: Cypher Wu; +Cc: Eric Dumazet, linux-kernel, netdev
In-Reply-To: <AANLkTik93QUQxSDmwd4Qj-gXQiWWPzd68JPAYAHBAsHR@mail.gmail.com>

On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>>
>> Is TILE using ticket spinlocks ?
>>

Eric, yes.

>
>What ticket spinlocks means? Could you explain a little, pls:) I'm not
>familiar with Linux kernel, I'm trying to get more understanding of it
>since I'm working on Linux platform now.
>

You might want to search "ticket spinlock" with google. :)

Briefly speaking, it is introduced to solve unfairness
of the old spinlock implementation. In linux kernel, not all
arch implement this. X86 implementation is done by Nick with
asm code, while tile uses C to implement it.

^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Américo Wang @ 2010-11-12  7:13 UTC (permalink / raw)
  To: Cypher Wu; +Cc: Eric Dumazet, linux-kernel, netdev
In-Reply-To: <AANLkTik93QUQxSDmwd4Qj-gXQiWWPzd68JPAYAHBAsHR@mail.gmail.com>

On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>>
>> Hi
>>
>> CC netdev, since you ask questions about network stuff _and_ rwlock
>>
>>
>>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>>> other platforms. It have a priority for write lock that when tried it
>>> will block the following read lock even if read lock is hold by
>>> others. Its code can be read in Linux Kernel 2.6.36 in
>>> arch/tile/lib/spinlock_32.c.
>>
>> This seems a bug to me.
>>
>> read_lock() can be nested. We used such a schem in the past in iptables
>> (it can re-enter itself),
>> and we used instead a spinlock(), but with many discussions with lkml
>> and Linus himself if I remember well.
>>
>It seems not a problem that read_lock() can be nested or not since
>rwlock doesn't have 'owner', it's just that should we give
>write_lock() a priority than read_lock() since if there have a lot
>read_lock()s then they'll starve write_lock().
>We should work out a well defined behavior so all the
>platform-dependent raw_rwlock has to design under that principle.

It is a known weakness of rwlock, it is designed like that. :)

The solution is to use RCU or seqlock, but I don't think seqlock
is proper for this case you described. So, try RCU lock.

^ permalink raw reply

* Re: [PATCH 4/10] Fix leaking of kernel heap addresses in net/
From: Eric Dumazet @ 2010-11-12  7:23 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: David Miller, socketcan, kuznet, urs.thuermann, yoshfuji, kaber,
	jmorris, remi.denis-courmont, pekkas, sri, vladislav.yasevich, tj,
	lizf, joe, shemminger, hadi, ebiederm, adobriyan, jpirko,
	johannes.berg, daniel.lezcano, xemul, socketcan-core, netdev,
	linux-sctp, torvalds
In-Reply-To: <1289529269.3090.207.camel@Dan>

Le jeudi 11 novembre 2010 à 21:34 -0500, Dan Rosenberg a écrit :
> > I want whatever you replace it with to be equivalent for
> > object tracking purposes.
> 
> In nearly all of the cases I fixed, the socket inode is already
> provided, which serves as a perfectly good unique identifier.  Would you
> prefer I include that information twice?

Oh well. Please read this answer carefuly.

Some facts to feed your next patch. I am very pleased you changed your
mind and that we keep useful information in kernel log.

1) Inode numbers are not guaranteed to be unique. Its a 32bit seq
number, and we dont check another socket inode use the same inode number
(after 2^32 allocations it can happens)

2) /proc/net/ files can deliver same "line" of information several
times, because of their implementation.

3) Because of SLAB_DESTROY_BY_RCU, same 'kernel socket pointer' can be
seen several times in /proc/net/tcp & /proc/net/udp, but really on
different "sockets"

4) Some good applications use both the socket pointer and inode number
(tuple) to filter out the [2] problem. Dont break them, please ?
Anything that might break an application must be at the very least
tunable.

In my opinion, a good thing would be :

- Use a special printf() format , aka "secure pointer", as Thomas
suggested.

- Make sure you print different opaque values for two different kernel
pointers. This is mandatory.

- Make sure the NULL pointer stay as a NULL pointer to not let the
hostile user know your secret, and to ease debugging stuff.

- Have security experts advice to chose a nice crypto function, maybe
jenkin hash. Not too slow would be nice.

static unsigned long securize_kpointers_rnd;

At boot time, stick a random value in this variable.
(Maybe make sure the 5 low order bits are 0)

unsigned long opacify_kptr(unsigned long ptr)
{
	if (ptr == 0)
		return ptr;
	if (capable(CAP_NET_ADMIN))
		return ptr;

	return some_crypto_hash(ptr, &securize_kpointers_rnd);
}

At least, use a central point, so that we can easily add/change the
logic if needed.

Please provide this patch in kernel/printk.c for initial review, then
once everybody is OK, you can send one patch for net tree.

No need to send 10 patches if we dont agree on the general principle.

^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Eric Dumazet @ 2010-11-12  7:27 UTC (permalink / raw)
  To: Américo Wang; +Cc: Cypher Wu, linux-kernel, netdev
In-Reply-To: <20101112071323.GB5660@cr0.nay.redhat.com>

Le vendredi 12 novembre 2010 à 15:13 +0800, Américo Wang a écrit :
> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
> >>
> >> Hi
> >>
> >> CC netdev, since you ask questions about network stuff _and_ rwlock
> >>
> >>
> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
> >>> other platforms. It have a priority for write lock that when tried it
> >>> will block the following read lock even if read lock is hold by
> >>> others. Its code can be read in Linux Kernel 2.6.36 in
> >>> arch/tile/lib/spinlock_32.c.
> >>
> >> This seems a bug to me.
> >>
> >> read_lock() can be nested. We used such a schem in the past in iptables
> >> (it can re-enter itself),
> >> and we used instead a spinlock(), but with many discussions with lkml
> >> and Linus himself if I remember well.
> >>
> >It seems not a problem that read_lock() can be nested or not since
> >rwlock doesn't have 'owner', it's just that should we give
> >write_lock() a priority than read_lock() since if there have a lot
> >read_lock()s then they'll starve write_lock().
> >We should work out a well defined behavior so all the
> >platform-dependent raw_rwlock has to design under that principle.
> 

AFAIK, Lockdep allows read_lock() to be nested.

> It is a known weakness of rwlock, it is designed like that. :)
> 

Agreed.

> The solution is to use RCU or seqlock, but I don't think seqlock
> is proper for this case you described. So, try RCU lock.

In the IGMP case, it should be easy for the task owning a read_lock() to
pass a parameter to the called function saying 'I already own the
read_lock(), dont try to re-acquire it'

A RCU conversion is far more complex.




^ permalink raw reply

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
From: Patrick McHardy @ 2010-11-12  7:36 UTC (permalink / raw)
  To: Hua Zhong
  Cc: 'Eric Paris', netdev, linux-kernel, davem, kuznet, pekkas,
	jmorris, yoshfuji
In-Reply-To: <00c201cb81eb$84e18160$8ea48420$@com>

On 11.11.2010 22:58, Hua Zhong wrote:
>> Yes, I realize this is little different than if the
>> SYN was dropped in the first network device, but it is different
>> because we know what happened!  We know that connect() call failed
>> and that there isn't anything coming back.
> 
> I would argue that -j DROP should behave exactly as the packet is dropped in the network, while -j REJECT should signal the failure to the application as soon as possible (which it doesn't seem to do).

It sends an ICMP error or TCP reset. Interpretation is up to TCP.

^ permalink raw reply

* Re: [PATCH] rtnetlink: Fix message size calculation for link messages
From: Patrick McHardy @ 2010-11-12  7:43 UTC (permalink / raw)
  To: davem, netdev
In-Reply-To: <20101112014759.GA8491@canuck.infradead.org>

On 12.11.2010 02:47, Thomas Graf wrote:
> nlmsg_total_size() calculates the length of a netlink message
> including header and alignment. nla_total_size() calculates the
> space an individual attribute consumes which was meant to be used
> in this context.
> 
> Also, ensure to account for the attribute header for the
> IFLA_INFO_XSTATS attribute as implementations of get_xstats_size()
> seem to assume that we do so.
> 
> The addition of two message headers minus the missing attribute
> header resulted in a calculated message size that was larger than
> required. Therefore we never risked running out of skb tailroom.
> 
> Signed-off-by: Thomas Graf <tgraf@infradead.org>
> Cc: Patrick McHardy <kaber@trash.net>

Looks good to me, thanks Thomas.

Acked-by: Patrick McHardy <kaber@trash.net>

^ permalink raw reply

* Re: [PATCH] netfilter: ipv6: fix overlap check for fragments
From: Shan Wei @ 2010-11-12  7:47 UTC (permalink / raw)
  To: nicolas.dichtel, netdev@vger.kernel.org, netfilter-devel,
	Patrick McHardy
  Cc: David Miller
In-Reply-To: <4CD3F0F8.5030205@cn.fujitsu.com>

Hi Patrick:
 Would you apply this to your tree. The similar patch for ipv6 has been
applied by David . 
See http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commit;h=f46421416fb6b91513fb687d6503142cd99034a5

Shan Wei wrote, at 11/05/2010 07:56 PM:
> The type of FRAG6_CB(prev)->offset is int, skb->len is *unsigned* int,
> and offset is int.
> 
> Without this patch, type conversion occurred to this expression, when
> (FRAG6_CB(prev)->offset + prev->len) is less than offset.
> 
> 
> Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
> ---
>  net/ipv6/netfilter/nf_conntrack_reasm.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
> index 3a3f129..79d43aa 100644
> --- a/net/ipv6/netfilter/nf_conntrack_reasm.c
> +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
> @@ -286,7 +286,7 @@ found:
>  
>  	/* Check for overlap with preceding fragment. */
>  	if (prev &&
> -	    (NFCT_FRAG6_CB(prev)->offset + prev->len) - offset > 0)
> +	    (NFCT_FRAG6_CB(prev)->offset + prev->len) > offset)
>  		goto discard_fq;
>  
>  	/* Look for overlap with succeeding segment. */


-- 

Best Regards
-----
Shan Wei

^ permalink raw reply

* Re: [PATCH] netfilter: ipv6: fix overlap check for fragments
From: Patrick McHardy @ 2010-11-12  7:52 UTC (permalink / raw)
  To: Shan Wei
  Cc: nicolas.dichtel, David Miller, netdev@vger.kernel.org,
	netfilter-devel
In-Reply-To: <4CD3F0F8.5030205@cn.fujitsu.com>

On 05.11.2010 12:56, Shan Wei wrote:
> The type of FRAG6_CB(prev)->offset is int, skb->len is *unsigned* int,
> and offset is int.
> 
> Without this patch, type conversion occurred to this expression, when
> (FRAG6_CB(prev)->offset + prev->len) is less than offset.

Applied, thanks Shan.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox