Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] pch_can: fix tseg1/tseg2 setting issue
From: David Miller @ 2011-02-10  0:46 UTC (permalink / raw)
  To: tomoya-linux-ECg8zkTtlr0C6LszWs/t0g
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w,
	qi.wang-ral2JQCrhuEAvxtiuMwx3w, netdev-u79uwXL29TY76Z2rM5mHXA,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	toshiharu-linux-ECg8zkTtlr0C6LszWs/t0g,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w,
	joel.clark-ral2JQCrhuEAvxtiuMwx3w, wg-5Yr1BZd7O62+XT7JhA+gdA
In-Reply-To: <1297298399-6250-1-git-send-email-tomoya-linux-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>

From: Tomoya MORINAGA <tomoya-linux-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>
Date: Thu, 10 Feb 2011 09:39:59 +0900

> Previous patch "[PATCH 1/3] pch_can: fix 800k comms issue" is wrong.
> I should have modified tseg1_min not tseg2_min.
> This patch reverts tseg2_min to 1 and set tseg1_min to 2.
> 
> Signed-off-by: Tomoya MORINAGA <tomoya-linux-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] virtio-net: add schedule check to napi_enable call in refill_work
From: Rusty Russell @ 2011-02-10  1:31 UTC (permalink / raw)
  To: virtualization; +Cc: Ken Stailey, netdev, Bruce Rogers
In-Reply-To: <132885.30568.qm@web110313.mail.gq1.yahoo.com>

On Thu, 10 Feb 2011 06:59:25 am Ken Stailey wrote:
> Justification:
> 
> Impact: Under heavy network I/O load virtio-net driver crashes making VM guest unusable.

Hmm, this went badly wrong.  I acked this patch, and it was mailed to
netdev six months ago.

Bruce's patch used spaces instead of tabs, but that should not have caused
it to be dropped.  I've taken that and ported it forwards, will repost now.

Thanks for picking this up off the floor!
Rusty.

^ permalink raw reply

* [PATCH] virtio_net: Add schedule check to napi_enable call
From: Rusty Russell @ 2011-02-10  2:02 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev, David Miller, virtualization, Ken Stailey

From: "Bruce Rogers" <brogers@novell.com>

Under harsh testing conditions, including low memory, the guest would
stop receiving packets. With this patch applied we no longer see any
problems in the driver while performing these tests for extended periods
of time.

Make sure napi is scheduled subsequent to each napi_enable.

Signed-off-by: Bruce Rogers <brogers@novell.com>
Signed-off-by: Olaf Kirch <okir@suse.de>
Cc: stable@kernel.org
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 drivers/net/virtio_net.c |   27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -446,6 +446,20 @@ static void skb_recv_done(struct virtque
 	}
 }
 
+static void virtnet_napi_enable(struct virtnet_info *vi)
+{
+	napi_enable(&vi->napi);
+
+	/* If all buffers were filled by other side before we napi_enabled, we
+	 * won't get another interrupt, so process any outstanding packets
+	 * now.  virtnet_poll wants re-enable the queue, so we disable here.
+	 * We synchronize against interrupts via NAPI_STATE_SCHED */
+	if (napi_schedule_prep(&vi->napi)) {
+		virtqueue_disable_cb(vi->rvq);
+		__napi_schedule(&vi->napi);
+	}
+}
+
 static void refill_work(struct work_struct *work)
 {
 	struct virtnet_info *vi;
@@ -454,7 +468,7 @@ static void refill_work(struct work_stru
 	vi = container_of(work, struct virtnet_info, refill.work);
 	napi_disable(&vi->napi);
 	still_empty = !try_fill_recv(vi, GFP_KERNEL);
-	napi_enable(&vi->napi);
+	virtnet_napi_enable(vi);
 
 	/* In theory, this can happen: if we don't get any buffers in
 	 * we will *never* try to fill again. */
@@ -638,16 +652,7 @@ static int virtnet_open(struct net_devic
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 
-	napi_enable(&vi->napi);
-
-	/* If all buffers were filled by other side before we napi_enabled, we
-	 * won't get another interrupt, so process any outstanding packets
-	 * now.  virtnet_poll wants re-enable the queue, so we disable here.
-	 * We synchronize against interrupts via NAPI_STATE_SCHED */
-	if (napi_schedule_prep(&vi->napi)) {
-		virtqueue_disable_cb(vi->rvq);
-		__napi_schedule(&vi->napi);
-	}
+	virtnet_napi_enable(vi);
 	return 0;
 }
 

^ permalink raw reply

* Re: [PATCH] virtio-net: add schedule check to napi_enable call in refill_work
From: Bruce Rogers @ 2011-02-10  2:44 UTC (permalink / raw)
  To: virtualization, Rusty Russell; +Cc: netdev
In-Reply-To: <201102101201.19656.rusty@rustcorp.com.au>

 >>> On 2/9/2011 at 06:31 PM, Rusty Russell <rusty@rustcorp.com.au> wrote: 
> On Thu, 10 Feb 2011 06:59:25 am Ken Stailey wrote:
>> Justification:
>> 
>> Impact: Under heavy network I/O load virtio-net driver crashes making VM 
> guest unusable.
> 
> Hmm, this went badly wrong.  I acked this patch, and it was mailed to
> netdev six months ago.
> 
> Bruce's patch used spaces instead of tabs, but that should not have caused
> it to be dropped.  I've taken that and ported it forwards, will repost now.
> 
> Thanks for picking this up off the floor!
> Rusty.

Thanks for taking care of that!

Bruce

^ permalink raw reply

* Re: [RFC PATCH net-next] net: rename group sysfs entry to netdev_group
From: Xiaotian Feng @ 2011-02-10  2:54 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, eric.dumazet, therbert, ebiederm, shemminger, ddvlad
In-Reply-To: <20110209.140558.59676278.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 1436 bytes --]

On 02/10/2011 06:05 AM, David Miller wrote:
> From: David Miller<davem@davemloft.net>
> Date: Wed, 09 Feb 2011 14:03:23 -0800 (PST)
>
>> From: Xiaotian feng<dfeng@redhat.com>
>> Date: Wed,  9 Feb 2011 18:52:49 +0800
>>
>>> From: Xiaotian Feng<dfeng@redhat.com>
>>>
>>> commit a512b92 adds sysfs entry for net device group, but
>>> before this commit, tun also uses group sysfs, so after this
>>> commit checkin, kernel warns like this:
>>>      sysfs: cannot create duplicate filename '/devices/virtual/net/vnet0/group'
>>>
>>> Since tun has used this for years, rename sysfs under tun might
>>> break existing userspace, so rename group sysfs entry for net device
>>> group is a better choice.
>>>
>>> Signed-off-by: Xiaotian Feng<dfeng@redhat.com>
>>
>> I don't think we have much choice in this matter, so I have applied
>> this patch, thanks!
>
> Wait, you didn't even build test this patch?!?!?!?!
>
> net/core/net-sysfs.c: In function ‘format_netdev_group’:
> net/core/net-sysfs.c:298: error: ‘const struct net_device’ has no member named ‘netdev_group’
> net/core/net-sysfs.c: At top level:
> net/core/net-sysfs.c:333: error: ‘show_group’ undeclared here (not in a function)
>
> "RFC" doesn't preclude you from at least build testing patches you
> post.
>
> Sigh...
>
Sorry, my bad ... v2 patch is attatched, I've built and r/w this renamed 
sysfs, all work fine now. Sorry again about my carelessness ...

Regards
Xiaotian
>
>


[-- Attachment #2: 0001-net-rename-group-sysfs-entry-to-netdev_group.patch --]
[-- Type: text/plain, Size: 1535 bytes --]

>From 35388da8821a72a71f54cb955146a881f916eb25 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Thu, 10 Feb 2011 10:48:53 +0800
Subject: [PATCH net-next v2] net: rename group sysfs entry to netdev_group

commit a512b92 adds sysfs entry for net device group, but
before this commit, tun also uses group sysfs, so after this
commit checkin, kernel warns like this:
    sysfs: cannot create duplicate filename '/devices/virtual/net/vnet0/group'

Since tun has used this for years, rename sysfs under tun might
break existing userspace, so rename group sysfs entry for net device
group is a better choice.

Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Vlad Dogaru <ddvlad@rosedu.org>
---
 net/core/net-sysfs.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 2e4a393..5ceb257 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -330,7 +330,7 @@ static struct device_attribute net_class_attributes[] = {
 	__ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags),
 	__ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len,
 	       store_tx_queue_len),
-	__ATTR(group, S_IRUGO | S_IWUSR, show_group, store_group),
+	__ATTR(netdev_group, S_IRUGO | S_IWUSR, show_group, store_group),
 	{}
 };
 
-- 
1.7.1


^ permalink raw reply related

* Re: [RFC PATCH net-next] net: rename group sysfs entry to netdev_group
From: David Miller @ 2011-02-10  3:16 UTC (permalink / raw)
  To: dfeng; +Cc: netdev, eric.dumazet, therbert, ebiederm, shemminger, ddvlad
In-Reply-To: <4D53536B.2010505@redhat.com>

From: Xiaotian Feng <dfeng@redhat.com>
Date: Thu, 10 Feb 2011 10:54:35 +0800

> Subject: [PATCH net-next v2] net: rename group sysfs entry to netdev_group
> 
> commit a512b92 adds sysfs entry for net device group, but
> before this commit, tun also uses group sysfs, so after this
> commit checkin, kernel warns like this:
>     sysfs: cannot create duplicate filename '/devices/virtual/net/vnet0/group'
> 
> Since tun has used this for years, rename sysfs under tun might
> break existing userspace, so rename group sysfs entry for net device
> group is a better choice.
> 
> Signed-off-by: Xiaotian Feng <dfeng@redhat.com>

Applied.

^ permalink raw reply

* [RFC PATCH 0/5] Cache PMTU/redirects in inetpeer
From: David Miller @ 2011-02-10  6:12 UTC (permalink / raw)
  To: netdev

This is what I've been working on for the past several days.

Right now if the routing cache is turned off (by setting
rt_cache_rebuild_count to "0") several things stop working.

We never make use of any PMTU or redirect information we learn
via ICMP packets.  This is because when the routing cache is
off, we can't "find" the existing cached routes that match
the ICMP because we don't add them to the hash table.

This functionality loss is also a blocker for eliminating the
routing cache entirely.

Solve this by remembering this state in the inetpeer entries.

PMTU information now self-expires.  It gets validated when
cached routes are sanity checked via dst_ops->check().  At
expiration, the original RTAX_MTU metric value is restored.
So we don't have to invalidate the entire cached route just
because it's PMTU learned value has expired.

Similarly, we store redirect information in inetpeer too.
Except that currently my patches don't remember the "original"
gateway the route had, so we have to kill the route off when
we get a dst_ops->negative_advice() call on a redirected route.

Avoid this is easy to fix and I might do that soon.

These patches implement the PMTU/redirect bits in ipv4 only at the
moment, but I do have ipv6 patches I'm in the process of finishing
up.  I just wanted people to see this as soon as possible so that
I can start getting feedback.

And hey if people can test this stuff out that'd be awesome!  If
you've used these changes in an environment where you did hit PMTU
and redirects, please do let me know.

^ permalink raw reply

* [RFC PATCH 1/5] inetpeer: Abstract address representation further.
From: David Miller @ 2011-02-10  6:13 UTC (permalink / raw)
  To: netdev


Future changes will add caching information, and some of
these new elements will be addresses.

Since the family is implicit via the ->daddr.family member,
replicating the family in ever address we store is entirely
redundant.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/inetpeer.h |   16 ++++++++++------
 net/ipv4/inetpeer.c    |    6 +++---
 net/ipv4/tcp_ipv4.c    |    2 +-
 net/ipv6/tcp_ipv6.c    |    2 +-
 4 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index ead2cb2..60e2cd8 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -15,12 +15,16 @@
 #include <net/ipv6.h>
 #include <asm/atomic.h>
 
-struct inetpeer_addr {
+struct inetpeer_addr_base {
 	union {
-		__be32		a4;
-		__be32		a6[4];
+		__be32			a4;
+		__be32			a6[4];
 	};
-	__u16	family;
+};
+
+struct inetpeer_addr {
+	struct inetpeer_addr_base	addr;
+	__u16				family;
 };
 
 struct inet_peer {
@@ -67,7 +71,7 @@ static inline struct inet_peer *inet_getpeer_v4(__be32 v4daddr, int create)
 {
 	struct inetpeer_addr daddr;
 
-	daddr.a4 = v4daddr;
+	daddr.addr.a4 = v4daddr;
 	daddr.family = AF_INET;
 	return inet_getpeer(&daddr, create);
 }
@@ -76,7 +80,7 @@ static inline struct inet_peer *inet_getpeer_v6(struct in6_addr *v6daddr, int cr
 {
 	struct inetpeer_addr daddr;
 
-	ipv6_addr_copy((struct in6_addr *)daddr.a6, v6daddr);
+	ipv6_addr_copy((struct in6_addr *)daddr.addr.a6, v6daddr);
 	daddr.family = AF_INET6;
 	return inet_getpeer(&daddr, create);
 }
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index 709fbb4..4346c38 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -167,9 +167,9 @@ static int addr_compare(const struct inetpeer_addr *a,
 	int i, n = (a->family == AF_INET ? 1 : 4);
 
 	for (i = 0; i < n; i++) {
-		if (a->a6[i] == b->a6[i])
+		if (a->addr.a6[i] == b->addr.a6[i])
 			continue;
-		if (a->a6[i] < b->a6[i])
+		if (a->addr.a6[i] < b->addr.a6[i])
 			return -1;
 		return 1;
 	}
@@ -510,7 +510,7 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create)
 		p->daddr = *daddr;
 		atomic_set(&p->refcnt, 1);
 		atomic_set(&p->rid, 0);
-		atomic_set(&p->ip_id_count, secure_ip_id(daddr->a4));
+		atomic_set(&p->ip_id_count, secure_ip_id(daddr->addr.a4));
 		p->tcp_ts_stamp = 0;
 		p->metrics[RTAX_LOCK-1] = INETPEER_METRICS_NEW;
 		p->rate_tokens = 0;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 02f583b..e2b9be2 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1341,7 +1341,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 		    tcp_death_row.sysctl_tw_recycle &&
 		    (dst = inet_csk_route_req(sk, req)) != NULL &&
 		    (peer = rt_get_peer((struct rtable *)dst)) != NULL &&
-		    peer->daddr.a4 == saddr) {
+		    peer->daddr.addr.a4 == saddr) {
 			inet_peer_refcheck(peer);
 			if ((u32)get_seconds() - peer->tcp_ts_stamp < TCP_PAWS_MSL &&
 			    (s32)(peer->tcp_ts - req->ts_recent) >
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 20aa95e..d6954e3 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1323,7 +1323,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 		    tcp_death_row.sysctl_tw_recycle &&
 		    (dst = inet6_csk_route_req(sk, req)) != NULL &&
 		    (peer = rt6_get_peer((struct rt6_info *)dst)) != NULL &&
-		    ipv6_addr_equal((struct in6_addr *)peer->daddr.a6,
+		    ipv6_addr_equal((struct in6_addr *)peer->daddr.addr.a6,
 				    &treq->rmt_addr)) {
 			inet_peer_refcheck(peer);
 			if ((u32)get_seconds() - peer->tcp_ts_stamp < TCP_PAWS_MSL &&
-- 
1.7.4


^ permalink raw reply related

* [RFC PATCH 2/5] inetpeer: Add redirect and PMTU discovery cached info.
From: David Miller @ 2011-02-10  6:13 UTC (permalink / raw)
  To: netdev


Validity of the cached PMTU information is indicated by it's
expiration value being non-zero, just as per dst->expires.

The scheme we will use is that we will remember the pre-ICMP value
held in the metrics or route entry, and then at expiration time
we will restore that value.

In this way PMTU expiration does not kill off the cached route as is
done currently.

Redirect information is permanent, or at least until another redirect
is received.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/inetpeer.h |   18 +++++++++++-------
 net/ipv4/inetpeer.c    |    2 ++
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index 60e2cd8..e6dd8da6 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -43,13 +43,17 @@ struct inet_peer {
 	 */
 	union {
 		struct {
-			atomic_t	rid;		/* Frag reception counter */
-			atomic_t	ip_id_count;	/* IP ID for the next packet */
-			__u32		tcp_ts;
-			__u32		tcp_ts_stamp;
-			u32		metrics[RTAX_MAX];
-			u32		rate_tokens;	/* rate limiting for ICMP */
-			unsigned long	rate_last;
+			atomic_t			rid;		/* Frag reception counter */
+			atomic_t			ip_id_count;	/* IP ID for the next packet */
+			__u32				tcp_ts;
+			__u32				tcp_ts_stamp;
+			u32				metrics[RTAX_MAX];
+			u32				rate_tokens;	/* rate limiting for ICMP */
+			unsigned long			rate_last;
+			unsigned long			pmtu_expires;
+			u32				pmtu_orig;
+			u32				pmtu_learned;
+			struct inetpeer_addr_base	redirect_learned;
 		};
 		struct rcu_head         rcu;
 	};
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index 4346c38..48f8d45 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -515,6 +515,8 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create)
 		p->metrics[RTAX_LOCK-1] = INETPEER_METRICS_NEW;
 		p->rate_tokens = 0;
 		p->rate_last = 0;
+		p->pmtu_expires = 0;
+		memset(&p->redirect_learned, 0, sizeof(p->redirect_learned));
 		INIT_LIST_HEAD(&p->unused);
 
 
-- 
1.7.4


^ permalink raw reply related

* [RFC PATCH 3/5] inet: Create a mechanism for upward inetpeer propagation into routes.
From: David Miller @ 2011-02-10  6:13 UTC (permalink / raw)
  To: netdev

If we didn't have a routing cache, we would not be able to properly
propagate certain kinds of dynamic path attributes, for example
PMTU information and redirects.

The reason is that if we didn't have a routing cache, then there would
be no way to lookup all of the active cached routes hanging off of
sockets, tunnels, IPSEC bundles, etc.

Consider the case where we created a cached route, but no inetpeer
entry existed and also we were not asked to pre-COW the route metrics
and therefore did not force the creation a new inetpeer entry.

If we later get a PMTU message, or a redirect, and store this
information in a new inetpeer entry, there is no way to teach that
cached route about the newly existing inetpeer entry.

The facilities implemented here handle this problem.

First we create a generation ID.  When we create a cached route of any
kind, we remember the generation ID at the time of attachment.  Any
time we force-create an inetpeer entry in response to new path
information, we bump that generation ID.

The dst_ops->check() callback is where the knowledge of this event
is propagated.  If the global generation ID does not equal the one
stored in the cached route, and the cached route has not attached
to an inetpeer yet, we look it up and attach if one is found.  Now
that we've updated the cached route's information, we update the
route's generation ID too.

This clears the way for implementing PMTU and redirects directly in
the inetpeer cache.  There is absolutely no need to consult cached
route information in order to maintain this information.

At this point nothing bumps the inetpeer genids, that comes in the
later changes which handle PMTUs and redirects using inetpeers.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/ip6_fib.h |    1 +
 include/net/route.h   |    1 +
 net/ipv4/route.c      |   19 ++++++++++++++++++-
 net/ipv6/route.c      |   18 ++++++++++++++++--
 4 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 708ff7c..46a6e8a 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -108,6 +108,7 @@ struct rt6_info {
 	u32				rt6i_flags;
 	struct rt6key			rt6i_src;
 	u32				rt6i_metric;
+	u32				rt6i_peer_genid;

 	struct inet6_dev		*rt6i_idev;
 	struct inet_peer		*rt6i_peer;
diff --git a/include/net/route.h b/include/net/route.h
index e586465..bf790c1 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -69,6 +69,7 @@ struct rtable {

 	/* Miscellaneous cached information */
 	__be32			rt_spec_dst; /* RFC1122 specific destination */
+	u32			rt_peer_genid;
 	struct inet_peer	*peer; /* long-living peer info */
 	struct fib_info		*fi; /* for client ref to shared metrics */
 };
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 0455af8..0979e03 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1308,6 +1308,13 @@ skip_hashing:
 	return 0;
 }

+static atomic_t __rt_peer_genid = ATOMIC_INIT(0);
+
+static u32 rt_peer_genid(void)
+{
+	return atomic_read(&__rt_peer_genid);
+}
+
 void rt_bind_peer(struct rtable *rt, int create)
 {
 	struct inet_peer *peer;
@@ -1316,6 +1323,8 @@ void rt_bind_peer(struct rtable *rt, int create)

 	if (peer && cmpxchg(&rt->peer, NULL, peer) != NULL)
 		inet_putpeer(peer);
+	else
+		rt->rt_peer_genid = rt_peer_genid();
 }

 /*
@@ -1767,8 +1776,16 @@ static void ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu)

 static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
 {
-	if (rt_is_expired((struct rtable *)dst))
+	struct rtable *rt = (struct rtable *) dst;
+
+	if (rt_is_expired(rt))
 		return NULL;
+	if (rt->rt_peer_genid != rt_peer_genid()) {
+		if (!rt->peer)
+			rt_bind_peer(rt, 0);
+
+		rt->rt_peer_genid = rt_peer_genid();
+	}
 	return dst;
 }

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 12ec83d..ad8556e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -240,6 +240,13 @@ static void ip6_dst_destroy(struct dst_entry *dst)
 	}
 }

+static atomic_t __rt6_peer_genid = ATOMIC_INIT(0);
+
+static u32 rt6_peer_genid(void)
+{
+	return atomic_read(&__rt6_peer_genid);
+}
+
 void rt6_bind_peer(struct rt6_info *rt, int create)
 {
 	struct inet_peer *peer;
@@ -247,6 +254,8 @@ void rt6_bind_peer(struct rt6_info *rt, int create)
 	peer = inet_getpeer_v6(&rt->rt6i_dst.addr, create);
 	if (peer && cmpxchg(&rt->rt6i_peer, NULL, peer) != NULL)
 		inet_putpeer(peer);
+	else
+		rt->rt6i_peer_genid = rt6_peer_genid();
 }

 static void ip6_dst_ifdown(struct dst_entry *dst, struct net_device *dev,
@@ -912,9 +921,14 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)

 	rt = (struct rt6_info *) dst;

-	if (rt->rt6i_node && (rt->rt6i_node->fn_sernum == cookie))
+	if (rt->rt6i_node && (rt->rt6i_node->fn_sernum == cookie)) {
+		if (rt->rt6i_peer_genid != rt6_peer_genid()) {
+			if (!rt->rt6i_peer)
+				rt6_bind_peer(rt, 0);
+			rt->rt6i_peer_genid = rt6_peer_genid();
+		}
 		return dst;
-
+	}
 	return NULL;
 }

-- 
1.7.4

^ permalink raw reply related

* [RFC PATCH 4/5] ipv4: Cache learned PMTU information in inetpeer.
From: David Miller @ 2011-02-10  6:13 UTC (permalink / raw)
  To: netdev


The general idea is that if we learn new PMTU information, we
bump the peer genid.

This triggers the dst_ops->check() code to validate and if
necessary propagate the new PMTU value into the metrics.

Learned PMTU information self-expires.

This means that it is not necessary to kill a cached route
entry just because the PMTU information is too old.

As a consequence:

1) When the path appears unreachable (dst_ops->link_failure
   or dst_ops->negative_advice) we unwind the PMTU state if
   it is out of date, instead of killing the cached route.

   A redirected route will still be invlidated in these
   situations.

2) rt_check_expire(), rt_worker_func(), et al. are no longer
   necessary at all.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/route.c |  260 ++++++++++++++++++------------------------------------
 1 files changed, 86 insertions(+), 174 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 0979e03..11faf14 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -131,9 +131,6 @@ static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
 static int ip_rt_min_advmss __read_mostly	= 256;
 static int rt_chain_length_max __read_mostly	= 20;
 
-static struct delayed_work expires_work;
-static unsigned long expires_ljiffies;
-
 /*
  *	Interface to generic destination cache.
  */
@@ -668,7 +665,7 @@ static inline int rt_fast_clean(struct rtable *rth)
 static inline int rt_valuable(struct rtable *rth)
 {
 	return (rth->rt_flags & (RTCF_REDIRECTED | RTCF_NOTIFY)) ||
-		rth->dst.expires;
+		(rth->peer && rth->peer->pmtu_expires);
 }
 
 static int rt_may_expire(struct rtable *rth, unsigned long tmo1, unsigned long tmo2)
@@ -679,13 +676,7 @@ static int rt_may_expire(struct rtable *rth, unsigned long tmo1, unsigned long t
 	if (atomic_read(&rth->dst.__refcnt))
 		goto out;
 
-	ret = 1;
-	if (rth->dst.expires &&
-	    time_after_eq(jiffies, rth->dst.expires))
-		goto out;
-
 	age = jiffies - rth->dst.lastuse;
-	ret = 0;
 	if ((age <= tmo1 && !rt_fast_clean(rth)) ||
 	    (age <= tmo2 && rt_valuable(rth)))
 		goto out;
@@ -829,97 +820,6 @@ static int has_noalias(const struct rtable *head, const struct rtable *rth)
 	return ONE;
 }
 
-static void rt_check_expire(void)
-{
-	static unsigned int rover;
-	unsigned int i = rover, goal;
-	struct rtable *rth;
-	struct rtable __rcu **rthp;
-	unsigned long samples = 0;
-	unsigned long sum = 0, sum2 = 0;
-	unsigned long delta;
-	u64 mult;
-
-	delta = jiffies - expires_ljiffies;
-	expires_ljiffies = jiffies;
-	mult = ((u64)delta) << rt_hash_log;
-	if (ip_rt_gc_timeout > 1)
-		do_div(mult, ip_rt_gc_timeout);
-	goal = (unsigned int)mult;
-	if (goal > rt_hash_mask)
-		goal = rt_hash_mask + 1;
-	for (; goal > 0; goal--) {
-		unsigned long tmo = ip_rt_gc_timeout;
-		unsigned long length;
-
-		i = (i + 1) & rt_hash_mask;
-		rthp = &rt_hash_table[i].chain;
-
-		if (need_resched())
-			cond_resched();
-
-		samples++;
-
-		if (rcu_dereference_raw(*rthp) == NULL)
-			continue;
-		length = 0;
-		spin_lock_bh(rt_hash_lock_addr(i));
-		while ((rth = rcu_dereference_protected(*rthp,
-					lockdep_is_held(rt_hash_lock_addr(i)))) != NULL) {
-			prefetch(rth->dst.rt_next);
-			if (rt_is_expired(rth)) {
-				*rthp = rth->dst.rt_next;
-				rt_free(rth);
-				continue;
-			}
-			if (rth->dst.expires) {
-				/* Entry is expired even if it is in use */
-				if (time_before_eq(jiffies, rth->dst.expires)) {
-nofree:
-					tmo >>= 1;
-					rthp = &rth->dst.rt_next;
-					/*
-					 * We only count entries on
-					 * a chain with equal hash inputs once
-					 * so that entries for different QOS
-					 * levels, and other non-hash input
-					 * attributes don't unfairly skew
-					 * the length computation
-					 */
-					length += has_noalias(rt_hash_table[i].chain, rth);
-					continue;
-				}
-			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout))
-				goto nofree;
-
-			/* Cleanup aged off entries. */
-			*rthp = rth->dst.rt_next;
-			rt_free(rth);
-		}
-		spin_unlock_bh(rt_hash_lock_addr(i));
-		sum += length;
-		sum2 += length*length;
-	}
-	if (samples) {
-		unsigned long avg = sum / samples;
-		unsigned long sd = int_sqrt(sum2 / samples - avg*avg);
-		rt_chain_length_max = max_t(unsigned long,
-					ip_rt_gc_elasticity,
-					(avg + 4*sd) >> FRACT_BITS);
-	}
-	rover = i;
-}
-
-/*
- * rt_worker_func() is run in process context.
- * we call rt_check_expire() to scan part of the hash table
- */
-static void rt_worker_func(struct work_struct *work)
-{
-	rt_check_expire();
-	schedule_delayed_work(&expires_work, ip_rt_gc_interval);
-}
-
 /*
  * Pertubation of rt_genid by a small quantity [1..256]
  * Using 8 bits of shuffling ensure we can call rt_cache_invalidate()
@@ -1535,9 +1435,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst)
 		if (dst->obsolete > 0) {
 			ip_rt_put(rt);
 			ret = NULL;
-		} else if ((rt->rt_flags & RTCF_REDIRECTED) ||
-			   (rt->dst.expires &&
-			    time_after_eq(jiffies, rt->dst.expires))) {
+		} else if (rt->rt_flags & RTCF_REDIRECTED) {
 			unsigned hash = rt_hash(rt->fl.fl4_dst, rt->fl.fl4_src,
 						rt->fl.oif,
 						rt_genid(dev_net(dst->dev)));
@@ -1547,6 +1445,14 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst)
 #endif
 			rt_del(hash, rt);
 			ret = NULL;
+		} else if (rt->peer &&
+			   rt->peer->pmtu_expires &&
+			   time_after_eq(jiffies, rt->peer->pmtu_expires)) {
+			unsigned long orig = rt->peer->pmtu_expires;
+
+			if (cmpxchg(&rt->peer->pmtu_expires, orig, 0) == orig)
+				dst_metric_set(dst, RTAX_MTU,
+					       rt->peer->pmtu_orig);
 		}
 	}
 	return ret;
@@ -1697,80 +1603,78 @@ unsigned short ip_rt_frag_needed(struct net *net, struct iphdr *iph,
 				 unsigned short new_mtu,
 				 struct net_device *dev)
 {
-	int i, k;
 	unsigned short old_mtu = ntohs(iph->tot_len);
-	struct rtable *rth;
-	int  ikeys[2] = { dev->ifindex, 0 };
-	__be32  skeys[2] = { iph->saddr, 0, };
-	__be32  daddr = iph->daddr;
 	unsigned short est_mtu = 0;
+	struct inet_peer *peer;
 
-	for (k = 0; k < 2; k++) {
-		for (i = 0; i < 2; i++) {
-			unsigned hash = rt_hash(daddr, skeys[i], ikeys[k],
-						rt_genid(net));
-
-			rcu_read_lock();
-			for (rth = rcu_dereference(rt_hash_table[hash].chain); rth;
-			     rth = rcu_dereference(rth->dst.rt_next)) {
-				unsigned short mtu = new_mtu;
+	peer = inet_getpeer_v4(iph->daddr, 1);
+	if (peer) {
+		unsigned short mtu = new_mtu;
 
-				if (rth->fl.fl4_dst != daddr ||
-				    rth->fl.fl4_src != skeys[i] ||
-				    rth->rt_dst != daddr ||
-				    rth->rt_src != iph->saddr ||
-				    rth->fl.oif != ikeys[k] ||
-				    rt_is_input_route(rth) ||
-				    dst_metric_locked(&rth->dst, RTAX_MTU) ||
-				    !net_eq(dev_net(rth->dst.dev), net) ||
-				    rt_is_expired(rth))
-					continue;
+		if (new_mtu < 68 || new_mtu >= old_mtu) {
+			/* BSD 4.2 derived systems incorrectly adjust
+			 * tot_len by the IP header length, and report
+			 * a zero MTU in the ICMP message.
+			 */
+			if (mtu == 0 &&
+			    old_mtu >= 68 + (iph->ihl << 2))
+				old_mtu -= iph->ihl << 2;
+			mtu = guess_mtu(old_mtu);
+		}
 
-				if (new_mtu < 68 || new_mtu >= old_mtu) {
+		if (mtu < ip_rt_min_pmtu)
+			mtu = ip_rt_min_pmtu;
+		if (!peer->pmtu_expires || mtu < peer->pmtu_learned) {
+			est_mtu = mtu;
+			peer->pmtu_learned = mtu;
+			peer->pmtu_expires = jiffies + ip_rt_mtu_expires;
+		}
 
-					/* BSD 4.2 compatibility hack :-( */
-					if (mtu == 0 &&
-					    old_mtu >= dst_mtu(&rth->dst) &&
-					    old_mtu >= 68 + (iph->ihl << 2))
-						old_mtu -= iph->ihl << 2;
+		inet_putpeer(peer);
 
-					mtu = guess_mtu(old_mtu);
-				}
-				if (mtu <= dst_mtu(&rth->dst)) {
-					if (mtu < dst_mtu(&rth->dst)) {
-						dst_confirm(&rth->dst);
-						if (mtu < ip_rt_min_pmtu) {
-							u32 lock = dst_metric(&rth->dst,
-									      RTAX_LOCK);
-							mtu = ip_rt_min_pmtu;
-							lock |= (1 << RTAX_MTU);
-							dst_metric_set(&rth->dst, RTAX_LOCK,
-								       lock);
-						}
-						dst_metric_set(&rth->dst, RTAX_MTU, mtu);
-						dst_set_expires(&rth->dst,
-							ip_rt_mtu_expires);
-					}
-					est_mtu = mtu;
-				}
-			}
-			rcu_read_unlock();
-		}
+		atomic_inc(&__rt_peer_genid);
 	}
 	return est_mtu ? : new_mtu;
 }
 
+static void check_peer_pmtu(struct dst_entry *dst, struct inet_peer *peer)
+{
+	unsigned long expires = peer->pmtu_expires;
+
+	if (time_before(expires, jiffies)) {
+		u32 orig_dst_mtu = dst_mtu(dst);
+		if (peer->pmtu_learned < orig_dst_mtu) {
+			if (!peer->pmtu_orig)
+				peer->pmtu_orig = dst_metric_raw(dst, RTAX_MTU);
+			dst_metric_set(dst, RTAX_MTU, peer->pmtu_learned);
+		}
+	} else if (cmpxchg(&peer->pmtu_expires, expires, 0) == expires)
+		dst_metric_set(dst, RTAX_MTU, peer->pmtu_orig);
+}
+
 static void ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu)
 {
-	if (dst_mtu(dst) > mtu && mtu >= 68 &&
-	    !(dst_metric_locked(dst, RTAX_MTU))) {
-		if (mtu < ip_rt_min_pmtu) {
-			u32 lock = dst_metric(dst, RTAX_LOCK);
+	struct rtable *rt = (struct rtable *) dst;
+	struct inet_peer *peer;
+
+	dst_confirm(dst);
+
+	if (!rt->peer)
+		rt_bind_peer(rt, 1);
+	peer = rt->peer;
+	if (peer) {
+		if (mtu < ip_rt_min_pmtu)
 			mtu = ip_rt_min_pmtu;
-			dst_metric_set(dst, RTAX_LOCK, lock | (1 << RTAX_MTU));
+		if (!peer->pmtu_expires || mtu < peer->pmtu_learned) {
+			peer->pmtu_learned = mtu;
+			peer->pmtu_expires = jiffies + ip_rt_mtu_expires;
+
+			atomic_inc(&__rt_peer_genid);
+			rt->rt_peer_genid = rt_peer_genid();
+
+			check_peer_pmtu(dst, peer);
 		}
-		dst_metric_set(dst, RTAX_MTU, mtu);
-		dst_set_expires(dst, ip_rt_mtu_expires);
+		inet_putpeer(peer);
 	}
 }
 
@@ -1781,9 +1685,15 @@ static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
 	if (rt_is_expired(rt))
 		return NULL;
 	if (rt->rt_peer_genid != rt_peer_genid()) {
+		struct inet_peer *peer;
+
 		if (!rt->peer)
 			rt_bind_peer(rt, 0);
 
+		peer = rt->peer;
+		if (peer && peer->pmtu_expires)
+			check_peer_pmtu(dst, peer);
+
 		rt->rt_peer_genid = rt_peer_genid();
 	}
 	return dst;
@@ -1812,8 +1722,14 @@ static void ipv4_link_failure(struct sk_buff *skb)
 	icmp_send(skb, ICMP_DEST_UNREACH, ICMP_HOST_UNREACH, 0);
 
 	rt = skb_rtable(skb);
-	if (rt)
-		dst_set_expires(&rt->dst, 0);
+	if (rt &&
+	    rt->peer &&
+	    rt->peer->pmtu_expires) {
+		unsigned long orig = rt->peer->pmtu_expires;
+
+		if (cmpxchg(&rt->peer->pmtu_expires, orig, 0) == orig)
+			dst_metric_set(&rt->dst, RTAX_MTU, rt->peer->pmtu_orig);
+	}
 }
 
 static int ip_rt_bug(struct sk_buff *skb)
@@ -1911,6 +1827,9 @@ static void rt_init_metrics(struct rtable *rt, struct fib_info *fi)
 			memcpy(peer->metrics, fi->fib_metrics,
 			       sizeof(u32) * RTAX_MAX);
 		dst_init_metrics(&rt->dst, peer->metrics, false);
+
+		if (peer->pmtu_expires)
+			check_peer_pmtu(&rt->dst, peer);
 	} else {
 		if (fi->fib_metrics != (u32 *) dst_default_metrics) {
 			rt->fi = fi;
@@ -2961,7 +2880,8 @@ static int rt_fill_info(struct net *net,
 		NLA_PUT_BE32(skb, RTA_MARK, rt->fl.mark);
 
 	error = rt->dst.error;
-	expires = rt->dst.expires ? rt->dst.expires - jiffies : 0;
+	expires = (rt->peer && rt->peer->pmtu_expires) ?
+		rt->peer->pmtu_expires - jiffies : 0;
 	if (rt->peer) {
 		inet_peer_refcheck(rt->peer);
 		id = atomic_read(&rt->peer->ip_id_count) & 0xffff;
@@ -3418,14 +3338,6 @@ int __init ip_rt_init(void)
 	devinet_init();
 	ip_fib_init();
 
-	/* All the timers, started at system startup tend
-	   to synchronize. Perturb it a bit.
-	 */
-	INIT_DELAYED_WORK_DEFERRABLE(&expires_work, rt_worker_func);
-	expires_ljiffies = jiffies;
-	schedule_delayed_work(&expires_work,
-		net_random() % ip_rt_gc_interval + ip_rt_gc_interval);
-
 	if (ip_rt_proc_init())
 		printk(KERN_ERR "Unable to create route proc files\n");
 #ifdef CONFIG_XFRM
-- 
1.7.4


^ permalink raw reply related

* [RFC PATCH 5/5] ipv4: Cache learned redirect information in inetpeer.
From: David Miller @ 2011-02-10  6:13 UTC (permalink / raw)
  To: netdev


Note that we do not generate the redirect netevent any longer,
because we don't create a new cached route.

Instead, once the new neighbour is bound to the cached route,
we emit a neigh update event instead.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/route.c |  136 +++++++++++++++++-------------------------------------
 1 files changed, 42 insertions(+), 94 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 11faf14..756f544 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1294,13 +1294,8 @@ static void rt_del(unsigned hash, struct rtable *rt)
 void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 		    __be32 saddr, struct net_device *dev)
 {
-	int i, k;
 	struct in_device *in_dev = __in_dev_get_rcu(dev);
-	struct rtable *rth;
-	struct rtable __rcu **rthp;
-	__be32  skeys[2] = { saddr, 0 };
-	int  ikeys[2] = { dev->ifindex, 0 };
-	struct netevent_redirect netevent;
+	struct inet_peer *peer;
 	struct net *net;
 
 	if (!in_dev)
@@ -1312,9 +1307,6 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 	    ipv4_is_zeronet(new_gw))
 		goto reject_redirect;
 
-	if (!rt_caching(net))
-		goto reject_redirect;
-
 	if (!IN_DEV_SHARED_MEDIA(in_dev)) {
 		if (!inet_addr_onlink(in_dev, new_gw, old_gw))
 			goto reject_redirect;
@@ -1325,93 +1317,13 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 			goto reject_redirect;
 	}
 
-	for (i = 0; i < 2; i++) {
-		for (k = 0; k < 2; k++) {
-			unsigned hash = rt_hash(daddr, skeys[i], ikeys[k],
-						rt_genid(net));
-
-			rthp = &rt_hash_table[hash].chain;
-
-			while ((rth = rcu_dereference(*rthp)) != NULL) {
-				struct rtable *rt;
-
-				if (rth->fl.fl4_dst != daddr ||
-				    rth->fl.fl4_src != skeys[i] ||
-				    rth->fl.oif != ikeys[k] ||
-				    rt_is_input_route(rth) ||
-				    rt_is_expired(rth) ||
-				    !net_eq(dev_net(rth->dst.dev), net)) {
-					rthp = &rth->dst.rt_next;
-					continue;
-				}
-
-				if (rth->rt_dst != daddr ||
-				    rth->rt_src != saddr ||
-				    rth->dst.error ||
-				    rth->rt_gateway != old_gw ||
-				    rth->dst.dev != dev)
-					break;
-
-				dst_hold(&rth->dst);
-
-				rt = dst_alloc(&ipv4_dst_ops);
-				if (rt == NULL) {
-					ip_rt_put(rth);
-					return;
-				}
-
-				/* Copy all the information. */
-				*rt = *rth;
-				rt->dst.__use		= 1;
-				atomic_set(&rt->dst.__refcnt, 1);
-				rt->dst.child		= NULL;
-				if (rt->dst.dev)
-					dev_hold(rt->dst.dev);
-				rt->dst.obsolete	= -1;
-				rt->dst.lastuse	= jiffies;
-				rt->dst.path		= &rt->dst;
-				rt->dst.neighbour	= NULL;
-				rt->dst.hh		= NULL;
-#ifdef CONFIG_XFRM
-				rt->dst.xfrm		= NULL;
-#endif
-				rt->rt_genid		= rt_genid(net);
-				rt->rt_flags		|= RTCF_REDIRECTED;
-
-				/* Gateway is different ... */
-				rt->rt_gateway		= new_gw;
-
-				/* Redirect received -> path was valid */
-				dst_confirm(&rth->dst);
-
-				if (rt->peer)
-					atomic_inc(&rt->peer->refcnt);
-				if (rt->fi)
-					atomic_inc(&rt->fi->fib_clntref);
-
-				if (arp_bind_neighbour(&rt->dst) ||
-				    !(rt->dst.neighbour->nud_state &
-					    NUD_VALID)) {
-					if (rt->dst.neighbour)
-						neigh_event_send(rt->dst.neighbour, NULL);
-					ip_rt_put(rth);
-					rt_drop(rt);
-					goto do_next;
-				}
+	peer = inet_getpeer_v4(daddr, 1);
+	if (peer) {
+		peer->redirect_learned.a4 = new_gw;
 
-				netevent.old = &rth->dst;
-				netevent.new = &rt->dst;
-				call_netevent_notifiers(NETEVENT_REDIRECT,
-							&netevent);
+		inet_putpeer(peer);
 
-				rt_del(hash, rth);
-				if (!rt_intern_hash(hash, rt, &rt, NULL, rt->fl.oif))
-					ip_rt_put(rt);
-				goto do_next;
-			}
-		do_next:
-			;
-		}
+		atomic_inc(&__rt_peer_genid);
 	}
 	return;
 
@@ -1678,6 +1590,31 @@ static void ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu)
 	}
 }
 
+static int check_peer_redir(struct dst_entry *dst, struct inet_peer *peer)
+{
+	struct rtable *rt = (struct rtable *) dst;
+	__be32 orig_gw = rt->rt_gateway;
+
+	dst_confirm(&rt->dst);
+
+	neigh_release(rt->dst.neighbour);
+	rt->dst.neighbour = NULL;
+
+	rt->rt_gateway = peer->redirect_learned.a4;
+	if (arp_bind_neighbour(&rt->dst) ||
+	    !(rt->dst.neighbour->nud_state & NUD_VALID)) {
+		if (rt->dst.neighbour)
+			neigh_event_send(rt->dst.neighbour, NULL);
+		rt->rt_gateway = orig_gw;
+		return -EAGAIN;
+	} else {
+		rt->rt_flags |= RTCF_REDIRECTED;
+		call_netevent_notifiers(NETEVENT_NEIGH_UPDATE,
+					rt->dst.neighbour);
+	}
+	return 0;
+}
+
 static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
 {
 	struct rtable *rt = (struct rtable *) dst;
@@ -1694,6 +1631,12 @@ static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
 		if (peer && peer->pmtu_expires)
 			check_peer_pmtu(dst, peer);
 
+		if (peer && peer->redirect_learned.a4 &&
+		    peer->redirect_learned.a4 != rt->rt_gateway) {
+			if (check_peer_redir(dst, peer))
+				return NULL;
+		}
+
 		rt->rt_peer_genid = rt_peer_genid();
 	}
 	return dst;
@@ -1830,6 +1773,11 @@ static void rt_init_metrics(struct rtable *rt, struct fib_info *fi)
 
 		if (peer->pmtu_expires)
 			check_peer_pmtu(&rt->dst, peer);
+		if (peer->redirect_learned.a4 &&
+		    peer->redirect_learned.a4 != rt->rt_gateway) {
+			rt->rt_gateway = peer->redirect_learned.a4;
+			rt->rt_flags |= RTCF_REDIRECTED;
+		}
 	} else {
 		if (fi->fib_metrics != (u32 *) dst_default_metrics) {
 			rt->fi = fi;
-- 
1.7.4


^ permalink raw reply related

* Re: STMMAC driver: NFS Problem on 2.6.37
From: Brian Downing @ 2011-02-09 20:58 UTC (permalink / raw)
  To: Chuck Lever
  Cc: deepaksi, Armando VISCONTI, Trond Myklebust, netdev,
	Linux NFS Mailing List, Shiraz HASHIM, Viresh KUMAR,
	Peppe CAVALLARO, amitgoel
In-Reply-To: <D031EAC4-68E9-4E56-8DB3-208B34C8DFD7@oracle.com>

On Wed, Feb 09, 2011 at 03:12:22PM -0500, Chuck Lever wrote:
> Based on your console logs, I see that the working case uses UDP to
> contact the server's mountd, but the failing case uses TCP.  You can
> try explicitly specifying "proto=udp" to force the use of UDP, to test
> this theory.

This does indeed make it work again for me, thanks!

> Meanwhile, the patch description explicitly states that the default
> mount option settings have changed.  Does it make sense to change the
> default behavior of NFSROOT mounts to use UDP again?  I don't see
> another way to make this process more reliable across NIC
> initialization.  If this is considered a regression, we can make a
> patch for 2.6.38-rc and 2.6.37.

I only use nfsroot for development, so I don't have a terribly strong
opinion.  I would point out though that the default u-boot parameters
for nfsrooting a lot of boards will no longer work at this point, so if
it's not patched to work again without specifying nfs options I think
there should at least be a note in the documentation and possibly a
"maybe try proto=udp?" console message on failure.

I assume it's not feasable to either wait until the chosen interface's
link is ready before trying to mount nfsroot, or retrying TCP-based
connections a little bit more aggressively/at all?

-bcd

^ permalink raw reply

* [RFC !!BONUS!! PATCH 6/5] ipv4: Delete routing cache.
From: David Miller @ 2011-02-10  6:39 UTC (permalink / raw)
  To: netdev


Signed-off-by: David S. Miller <davem@davemloft.net>
---

I couldn't resist, I've been waiting years to be able to even
think about trying this.

If you want to live on the edge, and I really mean "the edge",
you want to try this patch out for sure. :-)

It builds, and that's all that I promise.

 include/net/route.h     |    1 -
 net/ipv4/fib_frontend.c |    5 -
 net/ipv4/route.c        |  891 ++---------------------------------------------
 3 files changed, 23 insertions(+), 874 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index bf790c1..fcf1b11 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -117,7 +117,6 @@ extern int		ip_rt_init(void);
 extern void		ip_rt_redirect(__be32 old_gw, __be32 dst, __be32 new_gw,
 				       __be32 src, struct net_device *dev);
 extern void		rt_cache_flush(struct net *net, int how);
-extern void		rt_cache_flush_batch(struct net *net);
 extern int		__ip_route_output_key(struct net *, struct rtable **, const struct flowi *flp);
 extern int		ip_route_output_key(struct net *, struct rtable **, struct flowi *flp);
 extern int		ip_route_output_flow(struct net *, struct rtable **rp, struct flowi *flp, struct sock *sk, int flags);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 2a49c06..694145c 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -978,11 +978,6 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
 		rt_cache_flush(dev_net(dev), 0);
 		break;
 	case NETDEV_UNREGISTER_BATCH:
-		/* The batch unregister is only called on the first
-		 * device in the list of devices being unregistered.
-		 * Therefore we should not pass dev_net(dev) in here.
-		 */
-		rt_cache_flush_batch(NULL);
 		break;
 	}
 	return NOTIFY_DONE;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 756f544..7078b8b 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -129,7 +129,6 @@ static int ip_rt_gc_elasticity __read_mostly	= 8;
 static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
 static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
 static int ip_rt_min_advmss __read_mostly	= 256;
-static int rt_chain_length_max __read_mostly	= 20;
 
 /*
  *	Interface to generic destination cache.
@@ -222,184 +221,30 @@ const __u8 ip_tos2prio[16] = {
 };
 
 
-/*
- * Route cache.
- */
-
-/* The locking scheme is rather straight forward:
- *
- * 1) Read-Copy Update protects the buckets of the central route hash.
- * 2) Only writers remove entries, and they hold the lock
- *    as they look at rtable reference counts.
- * 3) Only readers acquire references to rtable entries,
- *    they do so with atomic increments and with the
- *    lock held.
- */
-
-struct rt_hash_bucket {
-	struct rtable __rcu	*chain;
-};
-
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
-	defined(CONFIG_PROVE_LOCKING)
-/*
- * Instead of using one spinlock for each rt_hash_bucket, we use a table of spinlocks
- * The size of this table is a power of two and depends on the number of CPUS.
- * (on lockdep we have a quite big spinlock_t, so keep the size down there)
- */
-#ifdef CONFIG_LOCKDEP
-# define RT_HASH_LOCK_SZ	256
-#else
-# if NR_CPUS >= 32
-#  define RT_HASH_LOCK_SZ	4096
-# elif NR_CPUS >= 16
-#  define RT_HASH_LOCK_SZ	2048
-# elif NR_CPUS >= 8
-#  define RT_HASH_LOCK_SZ	1024
-# elif NR_CPUS >= 4
-#  define RT_HASH_LOCK_SZ	512
-# else
-#  define RT_HASH_LOCK_SZ	256
-# endif
-#endif
-
-static spinlock_t	*rt_hash_locks;
-# define rt_hash_lock_addr(slot) &rt_hash_locks[(slot) & (RT_HASH_LOCK_SZ - 1)]
-
-static __init void rt_hash_lock_init(void)
-{
-	int i;
-
-	rt_hash_locks = kmalloc(sizeof(spinlock_t) * RT_HASH_LOCK_SZ,
-			GFP_KERNEL);
-	if (!rt_hash_locks)
-		panic("IP: failed to allocate rt_hash_locks\n");
-
-	for (i = 0; i < RT_HASH_LOCK_SZ; i++)
-		spin_lock_init(&rt_hash_locks[i]);
-}
-#else
-# define rt_hash_lock_addr(slot) NULL
-
-static inline void rt_hash_lock_init(void)
-{
-}
-#endif
-
-static struct rt_hash_bucket 	*rt_hash_table __read_mostly;
-static unsigned			rt_hash_mask __read_mostly;
-static unsigned int		rt_hash_log  __read_mostly;
-
 static DEFINE_PER_CPU(struct rt_cache_stat, rt_cache_stat);
 #define RT_CACHE_STAT_INC(field) __this_cpu_inc(rt_cache_stat.field)
 
-static inline unsigned int rt_hash(__be32 daddr, __be32 saddr, int idx,
-				   int genid)
-{
-	return jhash_3words((__force u32)daddr, (__force u32)saddr,
-			    idx, genid)
-		& rt_hash_mask;
-}
-
 static inline int rt_genid(struct net *net)
 {
 	return atomic_read(&net->ipv4.rt_genid);
 }
 
 #ifdef CONFIG_PROC_FS
-struct rt_cache_iter_state {
-	struct seq_net_private p;
-	int bucket;
-	int genid;
-};
-
-static struct rtable *rt_cache_get_first(struct seq_file *seq)
-{
-	struct rt_cache_iter_state *st = seq->private;
-	struct rtable *r = NULL;
-
-	for (st->bucket = rt_hash_mask; st->bucket >= 0; --st->bucket) {
-		if (!rcu_dereference_raw(rt_hash_table[st->bucket].chain))
-			continue;
-		rcu_read_lock_bh();
-		r = rcu_dereference_bh(rt_hash_table[st->bucket].chain);
-		while (r) {
-			if (dev_net(r->dst.dev) == seq_file_net(seq) &&
-			    r->rt_genid == st->genid)
-				return r;
-			r = rcu_dereference_bh(r->dst.rt_next);
-		}
-		rcu_read_unlock_bh();
-	}
-	return r;
-}
-
-static struct rtable *__rt_cache_get_next(struct seq_file *seq,
-					  struct rtable *r)
-{
-	struct rt_cache_iter_state *st = seq->private;
-
-	r = rcu_dereference_bh(r->dst.rt_next);
-	while (!r) {
-		rcu_read_unlock_bh();
-		do {
-			if (--st->bucket < 0)
-				return NULL;
-		} while (!rcu_dereference_raw(rt_hash_table[st->bucket].chain));
-		rcu_read_lock_bh();
-		r = rcu_dereference_bh(rt_hash_table[st->bucket].chain);
-	}
-	return r;
-}
-
-static struct rtable *rt_cache_get_next(struct seq_file *seq,
-					struct rtable *r)
-{
-	struct rt_cache_iter_state *st = seq->private;
-	while ((r = __rt_cache_get_next(seq, r)) != NULL) {
-		if (dev_net(r->dst.dev) != seq_file_net(seq))
-			continue;
-		if (r->rt_genid == st->genid)
-			break;
-	}
-	return r;
-}
-
-static struct rtable *rt_cache_get_idx(struct seq_file *seq, loff_t pos)
-{
-	struct rtable *r = rt_cache_get_first(seq);
-
-	if (r)
-		while (pos && (r = rt_cache_get_next(seq, r)))
-			--pos;
-	return pos ? NULL : r;
-}
-
 static void *rt_cache_seq_start(struct seq_file *seq, loff_t *pos)
 {
-	struct rt_cache_iter_state *st = seq->private;
 	if (*pos)
-		return rt_cache_get_idx(seq, *pos - 1);
-	st->genid = rt_genid(seq_file_net(seq));
+		return NULL;
 	return SEQ_START_TOKEN;
 }
 
 static void *rt_cache_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 {
-	struct rtable *r;
-
-	if (v == SEQ_START_TOKEN)
-		r = rt_cache_get_first(seq);
-	else
-		r = rt_cache_get_next(seq, v);
 	++*pos;
-	return r;
+	return NULL;
 }
 
 static void rt_cache_seq_stop(struct seq_file *seq, void *v)
 {
-	if (v && v != SEQ_START_TOKEN)
-		rcu_read_unlock_bh();
 }
 
 static int rt_cache_seq_show(struct seq_file *seq, void *v)
@@ -409,29 +254,6 @@ static int rt_cache_seq_show(struct seq_file *seq, void *v)
 			   "Iface\tDestination\tGateway \tFlags\t\tRefCnt\tUse\t"
 			   "Metric\tSource\t\tMTU\tWindow\tIRTT\tTOS\tHHRef\t"
 			   "HHUptod\tSpecDst");
-	else {
-		struct rtable *r = v;
-		int len;
-
-		seq_printf(seq, "%s\t%08X\t%08X\t%8X\t%d\t%u\t%d\t"
-			      "%08X\t%d\t%u\t%u\t%02X\t%d\t%1d\t%08X%n",
-			r->dst.dev ? r->dst.dev->name : "*",
-			(__force u32)r->rt_dst,
-			(__force u32)r->rt_gateway,
-			r->rt_flags, atomic_read(&r->dst.__refcnt),
-			r->dst.__use, 0, (__force u32)r->rt_src,
-			dst_metric_advmss(&r->dst) + 40,
-			dst_metric(&r->dst, RTAX_WINDOW),
-			(int)((dst_metric(&r->dst, RTAX_RTT) >> 3) +
-			      dst_metric(&r->dst, RTAX_RTTVAR)),
-			r->fl.fl4_tos,
-			r->dst.hh ? atomic_read(&r->dst.hh->hh_refcnt) : -1,
-			r->dst.hh ? (r->dst.hh->hh_output ==
-				       dev_queue_xmit) : 0,
-			r->rt_spec_dst, &len);
-
-		seq_printf(seq, "%*s\n", 127 - len, "");
-	}
 	return 0;
 }
 
@@ -444,8 +266,7 @@ static const struct seq_operations rt_cache_seq_ops = {
 
 static int rt_cache_seq_open(struct inode *inode, struct file *file)
 {
-	return seq_open_net(inode, file, &rt_cache_seq_ops,
-			sizeof(struct rt_cache_iter_state));
+	return seq_open_net(inode, file, &rt_cache_seq_ops, 0);
 }
 
 static const struct file_operations rt_cache_seq_fops = {
@@ -643,184 +464,12 @@ static inline int ip_rt_proc_init(void)
 }
 #endif /* CONFIG_PROC_FS */
 
-static inline void rt_free(struct rtable *rt)
-{
-	call_rcu_bh(&rt->dst.rcu_head, dst_rcu_free);
-}
-
-static inline void rt_drop(struct rtable *rt)
-{
-	ip_rt_put(rt);
-	call_rcu_bh(&rt->dst.rcu_head, dst_rcu_free);
-}
-
-static inline int rt_fast_clean(struct rtable *rth)
-{
-	/* Kill broadcast/multicast entries very aggresively, if they
-	   collide in hash table with more useful entries */
-	return (rth->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST)) &&
-		rt_is_input_route(rth) && rth->dst.rt_next;
-}
-
-static inline int rt_valuable(struct rtable *rth)
-{
-	return (rth->rt_flags & (RTCF_REDIRECTED | RTCF_NOTIFY)) ||
-		(rth->peer && rth->peer->pmtu_expires);
-}
-
-static int rt_may_expire(struct rtable *rth, unsigned long tmo1, unsigned long tmo2)
-{
-	unsigned long age;
-	int ret = 0;
-
-	if (atomic_read(&rth->dst.__refcnt))
-		goto out;
-
-	age = jiffies - rth->dst.lastuse;
-	if ((age <= tmo1 && !rt_fast_clean(rth)) ||
-	    (age <= tmo2 && rt_valuable(rth)))
-		goto out;
-	ret = 1;
-out:	return ret;
-}
-
-/* Bits of score are:
- * 31: very valuable
- * 30: not quite useless
- * 29..0: usage counter
- */
-static inline u32 rt_score(struct rtable *rt)
-{
-	u32 score = jiffies - rt->dst.lastuse;
-
-	score = ~score & ~(3<<30);
-
-	if (rt_valuable(rt))
-		score |= (1<<31);
-
-	if (rt_is_output_route(rt) ||
-	    !(rt->rt_flags & (RTCF_BROADCAST|RTCF_MULTICAST|RTCF_LOCAL)))
-		score |= (1<<30);
-
-	return score;
-}
-
-static inline bool rt_caching(const struct net *net)
-{
-	return net->ipv4.current_rt_cache_rebuild_count <=
-		net->ipv4.sysctl_rt_cache_rebuild_count;
-}
-
-static inline bool compare_hash_inputs(const struct flowi *fl1,
-					const struct flowi *fl2)
-{
-	return ((((__force u32)fl1->fl4_dst ^ (__force u32)fl2->fl4_dst) |
-		((__force u32)fl1->fl4_src ^ (__force u32)fl2->fl4_src) |
-		(fl1->iif ^ fl2->iif)) == 0);
-}
-
-static inline int compare_keys(struct flowi *fl1, struct flowi *fl2)
-{
-	return (((__force u32)fl1->fl4_dst ^ (__force u32)fl2->fl4_dst) |
-		((__force u32)fl1->fl4_src ^ (__force u32)fl2->fl4_src) |
-		(fl1->mark ^ fl2->mark) |
-		(*(u16 *)&fl1->fl4_tos ^ *(u16 *)&fl2->fl4_tos) |
-		(fl1->oif ^ fl2->oif) |
-		(fl1->iif ^ fl2->iif)) == 0;
-}
-
-static inline int compare_netns(struct rtable *rt1, struct rtable *rt2)
-{
-	return net_eq(dev_net(rt1->dst.dev), dev_net(rt2->dst.dev));
-}
-
 static inline int rt_is_expired(struct rtable *rth)
 {
 	return rth->rt_genid != rt_genid(dev_net(rth->dst.dev));
 }
 
 /*
- * Perform a full scan of hash table and free all entries.
- * Can be called by a softirq or a process.
- * In the later case, we want to be reschedule if necessary
- */
-static void rt_do_flush(struct net *net, int process_context)
-{
-	unsigned int i;
-	struct rtable *rth, *next;
-
-	for (i = 0; i <= rt_hash_mask; i++) {
-		struct rtable __rcu **pprev;
-		struct rtable *list;
-
-		if (process_context && need_resched())
-			cond_resched();
-		rth = rcu_dereference_raw(rt_hash_table[i].chain);
-		if (!rth)
-			continue;
-
-		spin_lock_bh(rt_hash_lock_addr(i));
-
-		list = NULL;
-		pprev = &rt_hash_table[i].chain;
-		rth = rcu_dereference_protected(*pprev,
-			lockdep_is_held(rt_hash_lock_addr(i)));
-
-		while (rth) {
-			next = rcu_dereference_protected(rth->dst.rt_next,
-				lockdep_is_held(rt_hash_lock_addr(i)));
-
-			if (!net ||
-			    net_eq(dev_net(rth->dst.dev), net)) {
-				rcu_assign_pointer(*pprev, next);
-				rcu_assign_pointer(rth->dst.rt_next, list);
-				list = rth;
-			} else {
-				pprev = &rth->dst.rt_next;
-			}
-			rth = next;
-		}
-
-		spin_unlock_bh(rt_hash_lock_addr(i));
-
-		for (; list; list = next) {
-			next = rcu_dereference_protected(list->dst.rt_next, 1);
-			rt_free(list);
-		}
-	}
-}
-
-/*
- * While freeing expired entries, we compute average chain length
- * and standard deviation, using fixed-point arithmetic.
- * This to have an estimation of rt_chain_length_max
- *  rt_chain_length_max = max(elasticity, AVG + 4*SD)
- * We use 3 bits for frational part, and 29 (or 61) for magnitude.
- */
-
-#define FRACT_BITS 3
-#define ONE (1UL << FRACT_BITS)
-
-/*
- * Given a hash chain and an item in this hash chain,
- * find if a previous entry has the same hash_inputs
- * (but differs on tos, mark or oif)
- * Returns 0 if an alias is found.
- * Returns ONE if rth has no alias before itself.
- */
-static int has_noalias(const struct rtable *head, const struct rtable *rth)
-{
-	const struct rtable *aux = head;
-
-	while (aux != rth) {
-		if (compare_hash_inputs(&aux->fl, &rth->fl))
-			return 0;
-		aux = rcu_dereference_protected(aux->dst.rt_next, 1);
-	}
-	return ONE;
-}
-
-/*
  * Pertubation of rt_genid by a small quantity [1..256]
  * Using 8 bits of shuffling ensure we can call rt_cache_invalidate()
  * many times (2^24) without giving recent rt_genid.
@@ -841,366 +490,32 @@ static void rt_cache_invalidate(struct net *net)
 void rt_cache_flush(struct net *net, int delay)
 {
 	rt_cache_invalidate(net);
-	if (delay >= 0)
-		rt_do_flush(net, !in_softirq());
-}
-
-/* Flush previous cache invalidated entries from the cache */
-void rt_cache_flush_batch(struct net *net)
-{
-	rt_do_flush(net, !in_softirq());
 }
 
-static void rt_emergency_hash_rebuild(struct net *net)
-{
-	if (net_ratelimit())
-		printk(KERN_WARNING "Route hash chain too long!\n");
-	rt_cache_invalidate(net);
-}
-
-/*
-   Short description of GC goals.
-
-   We want to build algorithm, which will keep routing cache
-   at some equilibrium point, when number of aged off entries
-   is kept approximately equal to newly generated ones.
-
-   Current expiration strength is variable "expire".
-   We try to adjust it dynamically, so that if networking
-   is idle expires is large enough to keep enough of warm entries,
-   and when load increases it reduces to limit cache size.
- */
-
 static int rt_garbage_collect(struct dst_ops *ops)
 {
-	static unsigned long expire = RT_GC_TIMEOUT;
-	static unsigned long last_gc;
-	static int rover;
-	static int equilibrium;
-	struct rtable *rth;
-	struct rtable __rcu **rthp;
-	unsigned long now = jiffies;
-	int goal;
-	int entries = dst_entries_get_fast(&ipv4_dst_ops);
-
-	/*
-	 * Garbage collection is pretty expensive,
-	 * do not make it too frequently.
-	 */
-
 	RT_CACHE_STAT_INC(gc_total);
-
-	if (now - last_gc < ip_rt_gc_min_interval &&
-	    entries < ip_rt_max_size) {
-		RT_CACHE_STAT_INC(gc_ignored);
-		goto out;
-	}
-
-	entries = dst_entries_get_slow(&ipv4_dst_ops);
-	/* Calculate number of entries, which we want to expire now. */
-	goal = entries - (ip_rt_gc_elasticity << rt_hash_log);
-	if (goal <= 0) {
-		if (equilibrium < ipv4_dst_ops.gc_thresh)
-			equilibrium = ipv4_dst_ops.gc_thresh;
-		goal = entries - equilibrium;
-		if (goal > 0) {
-			equilibrium += min_t(unsigned int, goal >> 1, rt_hash_mask + 1);
-			goal = entries - equilibrium;
-		}
-	} else {
-		/* We are in dangerous area. Try to reduce cache really
-		 * aggressively.
-		 */
-		goal = max_t(unsigned int, goal >> 1, rt_hash_mask + 1);
-		equilibrium = entries - goal;
-	}
-
-	if (now - last_gc >= ip_rt_gc_min_interval)
-		last_gc = now;
-
-	if (goal <= 0) {
-		equilibrium += goal;
-		goto work_done;
-	}
-
-	do {
-		int i, k;
-
-		for (i = rt_hash_mask, k = rover; i >= 0; i--) {
-			unsigned long tmo = expire;
-
-			k = (k + 1) & rt_hash_mask;
-			rthp = &rt_hash_table[k].chain;
-			spin_lock_bh(rt_hash_lock_addr(k));
-			while ((rth = rcu_dereference_protected(*rthp,
-					lockdep_is_held(rt_hash_lock_addr(k)))) != NULL) {
-				if (!rt_is_expired(rth) &&
-					!rt_may_expire(rth, tmo, expire)) {
-					tmo >>= 1;
-					rthp = &rth->dst.rt_next;
-					continue;
-				}
-				*rthp = rth->dst.rt_next;
-				rt_free(rth);
-				goal--;
-			}
-			spin_unlock_bh(rt_hash_lock_addr(k));
-			if (goal <= 0)
-				break;
-		}
-		rover = k;
-
-		if (goal <= 0)
-			goto work_done;
-
-		/* Goal is not achieved. We stop process if:
-
-		   - if expire reduced to zero. Otherwise, expire is halfed.
-		   - if table is not full.
-		   - if we are called from interrupt.
-		   - jiffies check is just fallback/debug loop breaker.
-		     We will not spin here for long time in any case.
-		 */
-
-		RT_CACHE_STAT_INC(gc_goal_miss);
-
-		if (expire == 0)
-			break;
-
-		expire >>= 1;
-#if RT_CACHE_DEBUG >= 2
-		printk(KERN_DEBUG "expire>> %u %d %d %d\n", expire,
-				dst_entries_get_fast(&ipv4_dst_ops), goal, i);
-#endif
-
-		if (dst_entries_get_fast(&ipv4_dst_ops) < ip_rt_max_size)
-			goto out;
-	} while (!in_softirq() && time_before_eq(jiffies, now));
-
-	if (dst_entries_get_fast(&ipv4_dst_ops) < ip_rt_max_size)
-		goto out;
-	if (dst_entries_get_slow(&ipv4_dst_ops) < ip_rt_max_size)
-		goto out;
-	if (net_ratelimit())
-		printk(KERN_WARNING "dst cache overflow\n");
-	RT_CACHE_STAT_INC(gc_dst_overflow);
-	return 1;
-
-work_done:
-	expire += ip_rt_gc_min_interval;
-	if (expire > ip_rt_gc_timeout ||
-	    dst_entries_get_fast(&ipv4_dst_ops) < ipv4_dst_ops.gc_thresh ||
-	    dst_entries_get_slow(&ipv4_dst_ops) < ipv4_dst_ops.gc_thresh)
-		expire = ip_rt_gc_timeout;
-#if RT_CACHE_DEBUG >= 2
-	printk(KERN_DEBUG "expire++ %u %d %d %d\n", expire,
-			dst_entries_get_fast(&ipv4_dst_ops), goal, rover);
-#endif
-out:	return 0;
-}
-
-/*
- * Returns number of entries in a hash chain that have different hash_inputs
- */
-static int slow_chain_length(const struct rtable *head)
-{
-	int length = 0;
-	const struct rtable *rth = head;
-
-	while (rth) {
-		length += has_noalias(head, rth);
-		rth = rcu_dereference_protected(rth->dst.rt_next, 1);
-	}
-	return length >> FRACT_BITS;
+	return 0;
 }
 
-static int rt_intern_hash(unsigned hash, struct rtable *rt,
-			  struct rtable **rp, struct sk_buff *skb, int ifindex)
+static int rt_finalize(struct rtable *rt, struct rtable **rp, struct sk_buff *skb)
 {
-	struct rtable	*rth, *cand;
-	struct rtable __rcu **rthp, **candp;
-	unsigned long	now;
-	u32 		min_score;
-	int		chain_length;
-	int attempts = !in_softirq();
-
-restart:
-	chain_length = 0;
-	min_score = ~(u32)0;
-	cand = NULL;
-	candp = NULL;
-	now = jiffies;
-
-	if (!rt_caching(dev_net(rt->dst.dev))) {
-		/*
-		 * If we're not caching, just tell the caller we
-		 * were successful and don't touch the route.  The
-		 * caller hold the sole reference to the cache entry, and
-		 * it will be released when the caller is done with it.
-		 * If we drop it here, the callers have no way to resolve routes
-		 * when we're not caching.  Instead, just point *rp at rt, so
-		 * the caller gets a single use out of the route
-		 * Note that we do rt_free on this new route entry, so that
-		 * once its refcount hits zero, we are still able to reap it
-		 * (Thanks Alexey)
-		 * Note: To avoid expensive rcu stuff for this uncached dst,
-		 * we set DST_NOCACHE so that dst_release() can free dst without
-		 * waiting a grace period.
-		 */
-
-		rt->dst.flags |= DST_NOCACHE;
-		if (rt->rt_type == RTN_UNICAST || rt_is_output_route(rt)) {
-			int err = arp_bind_neighbour(&rt->dst);
-			if (err) {
-				if (net_ratelimit())
-					printk(KERN_WARNING
-					    "Neighbour table failure & not caching routes.\n");
-				ip_rt_put(rt);
-				return err;
-			}
-		}
-
-		goto skip_hashing;
-	}
-
-	rthp = &rt_hash_table[hash].chain;
-
-	spin_lock_bh(rt_hash_lock_addr(hash));
-	while ((rth = rcu_dereference_protected(*rthp,
-			lockdep_is_held(rt_hash_lock_addr(hash)))) != NULL) {
-		if (rt_is_expired(rth)) {
-			*rthp = rth->dst.rt_next;
-			rt_free(rth);
-			continue;
-		}
-		if (compare_keys(&rth->fl, &rt->fl) && compare_netns(rth, rt)) {
-			/* Put it first */
-			*rthp = rth->dst.rt_next;
-			/*
-			 * Since lookup is lockfree, the deletion
-			 * must be visible to another weakly ordered CPU before
-			 * the insertion at the start of the hash chain.
-			 */
-			rcu_assign_pointer(rth->dst.rt_next,
-					   rt_hash_table[hash].chain);
-			/*
-			 * Since lookup is lockfree, the update writes
-			 * must be ordered for consistency on SMP.
-			 */
-			rcu_assign_pointer(rt_hash_table[hash].chain, rth);
-
-			dst_use(&rth->dst, now);
-			spin_unlock_bh(rt_hash_lock_addr(hash));
-
-			rt_drop(rt);
-			if (rp)
-				*rp = rth;
-			else
-				skb_dst_set(skb, &rth->dst);
-			return 0;
-		}
-
-		if (!atomic_read(&rth->dst.__refcnt)) {
-			u32 score = rt_score(rth);
-
-			if (score <= min_score) {
-				cand = rth;
-				candp = rthp;
-				min_score = score;
-			}
-		}
-
-		chain_length++;
-
-		rthp = &rth->dst.rt_next;
-	}
-
-	if (cand) {
-		/* ip_rt_gc_elasticity used to be average length of chain
-		 * length, when exceeded gc becomes really aggressive.
-		 *
-		 * The second limit is less certain. At the moment it allows
-		 * only 2 entries per bucket. We will see.
-		 */
-		if (chain_length > ip_rt_gc_elasticity) {
-			*candp = cand->dst.rt_next;
-			rt_free(cand);
-		}
-	} else {
-		if (chain_length > rt_chain_length_max &&
-		    slow_chain_length(rt_hash_table[hash].chain) > rt_chain_length_max) {
-			struct net *net = dev_net(rt->dst.dev);
-			int num = ++net->ipv4.current_rt_cache_rebuild_count;
-			if (!rt_caching(net)) {
-				printk(KERN_WARNING "%s: %d rebuilds is over limit, route caching disabled\n",
-					rt->dst.dev->name, num);
-			}
-			rt_emergency_hash_rebuild(net);
-			spin_unlock_bh(rt_hash_lock_addr(hash));
-
-			hash = rt_hash(rt->fl.fl4_dst, rt->fl.fl4_src,
-					ifindex, rt_genid(net));
-			goto restart;
-		}
-	}
-
-	/* Try to bind route to arp only if it is output
-	   route or unicast forwarding path.
+	/* To avoid expensive rcu stuff for this uncached dst, we set
+	 * DST_NOCACHE so that dst_release() can free dst without
+	 * waiting a grace period.
 	 */
+	rt->dst.flags |= DST_NOCACHE;
 	if (rt->rt_type == RTN_UNICAST || rt_is_output_route(rt)) {
 		int err = arp_bind_neighbour(&rt->dst);
 		if (err) {
-			spin_unlock_bh(rt_hash_lock_addr(hash));
-
-			if (err != -ENOBUFS) {
-				rt_drop(rt);
-				return err;
-			}
-
-			/* Neighbour tables are full and nothing
-			   can be released. Try to shrink route cache,
-			   it is most likely it holds some neighbour records.
-			 */
-			if (attempts-- > 0) {
-				int saved_elasticity = ip_rt_gc_elasticity;
-				int saved_int = ip_rt_gc_min_interval;
-				ip_rt_gc_elasticity	= 1;
-				ip_rt_gc_min_interval	= 0;
-				rt_garbage_collect(&ipv4_dst_ops);
-				ip_rt_gc_min_interval	= saved_int;
-				ip_rt_gc_elasticity	= saved_elasticity;
-				goto restart;
-			}
-
 			if (net_ratelimit())
-				printk(KERN_WARNING "ipv4: Neighbour table overflow.\n");
-			rt_drop(rt);
-			return -ENOBUFS;
+				printk(KERN_WARNING
+				       "Neighbour table failure & not caching routes.\n");
+			ip_rt_put(rt);
+			return err;
 		}
 	}
 
-	rt->dst.rt_next = rt_hash_table[hash].chain;
-
-#if RT_CACHE_DEBUG >= 2
-	if (rt->dst.rt_next) {
-		struct rtable *trt;
-		printk(KERN_DEBUG "rt_cache @%02x: %pI4",
-		       hash, &rt->rt_dst);
-		for (trt = rt->dst.rt_next; trt; trt = trt->dst.rt_next)
-			printk(" . %pI4", &trt->rt_dst);
-		printk("\n");
-	}
-#endif
-	/*
-	 * Since lookup is lockfree, we must make sure
-	 * previous writes to rt are comitted to memory
-	 * before making rt visible to other CPUS.
-	 */
-	rcu_assign_pointer(rt_hash_table[hash].chain, rt);
-
-	spin_unlock_bh(rt_hash_lock_addr(hash));
-
-skip_hashing:
 	if (rp)
 		*rp = rt;
 	else
@@ -1270,26 +585,6 @@ void __ip_select_ident(struct iphdr *iph, struct dst_entry *dst, int more)
 }
 EXPORT_SYMBOL(__ip_select_ident);
 
-static void rt_del(unsigned hash, struct rtable *rt)
-{
-	struct rtable __rcu **rthp;
-	struct rtable *aux;
-
-	rthp = &rt_hash_table[hash].chain;
-	spin_lock_bh(rt_hash_lock_addr(hash));
-	ip_rt_put(rt);
-	while ((aux = rcu_dereference_protected(*rthp,
-			lockdep_is_held(rt_hash_lock_addr(hash)))) != NULL) {
-		if (aux == rt || rt_is_expired(aux)) {
-			*rthp = aux->dst.rt_next;
-			rt_free(aux);
-			continue;
-		}
-		rthp = &aux->dst.rt_next;
-	}
-	spin_unlock_bh(rt_hash_lock_addr(hash));
-}
-
 /* called in rcu_read_lock() section */
 void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 		    __be32 saddr, struct net_device *dev)
@@ -1348,14 +643,11 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst)
 			ip_rt_put(rt);
 			ret = NULL;
 		} else if (rt->rt_flags & RTCF_REDIRECTED) {
-			unsigned hash = rt_hash(rt->fl.fl4_dst, rt->fl.fl4_src,
-						rt->fl.oif,
-						rt_genid(dev_net(dst->dev)));
 #if RT_CACHE_DEBUG >= 1
 			printk(KERN_DEBUG "ipv4_negative_advice: redirect to %pI4/%02x dropped\n",
-				&rt->rt_dst, rt->fl.fl4_tos);
+			       &rt->rt_dst, rt->fl.fl4_tos);
 #endif
-			rt_del(hash, rt);
+			ip_rt_put(rt);
 			ret = NULL;
 		} else if (rt->peer &&
 			   rt->peer->pmtu_expires &&
@@ -1820,7 +1112,6 @@ static void rt_set_nexthop(struct rtable *rt, struct fib_result *res, u32 itag)
 static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 				u8 tos, struct net_device *dev, int our)
 {
-	unsigned int hash;
 	struct rtable *rth;
 	__be32 spec_dst;
 	struct in_device *in_dev = __in_dev_get_rcu(dev);
@@ -1887,8 +1178,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 #endif
 	RT_CACHE_STAT_INC(in_slow_mc);
 
-	hash = rt_hash(daddr, saddr, dev->ifindex, rt_genid(dev_net(dev)));
-	return rt_intern_hash(hash, rth, NULL, skb, dev->ifindex);
+	return rt_finalize(rth, NULL, skb);
 
 e_nobufs:
 	return -ENOBUFS;
@@ -2035,7 +1325,6 @@ static int ip_mkroute_input(struct sk_buff *skb,
 {
 	struct rtable* rth = NULL;
 	int err;
-	unsigned hash;
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	if (res->fi && res->fi->fib_nhs > 1 && fl->oif == 0)
@@ -2048,9 +1337,7 @@ static int ip_mkroute_input(struct sk_buff *skb,
 		return err;
 
 	/* put it into the cache */
-	hash = rt_hash(daddr, saddr, fl->iif,
-		       rt_genid(dev_net(rth->dst.dev)));
-	return rt_intern_hash(hash, rth, NULL, skb, fl->iif);
+	return rt_finalize(rth, NULL, skb);
 }
 
 /*
@@ -2078,7 +1365,6 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	unsigned	flags = 0;
 	u32		itag = 0;
 	struct rtable * rth;
-	unsigned	hash;
 	__be32		spec_dst;
 	int		err = -EINVAL;
 	struct net    * net = dev_net(dev);
@@ -2197,8 +1483,7 @@ local_input:
 		rth->rt_flags 	&= ~RTCF_LOCAL;
 	}
 	rth->rt_type	= res.type;
-	hash = rt_hash(daddr, saddr, fl.iif, rt_genid(net));
-	err = rt_intern_hash(hash, rth, NULL, skb, fl.iif);
+	err = rt_finalize(rth, NULL, skb);
 	goto out;
 
 no_route:
@@ -2242,47 +1527,10 @@ martian_source_keep_err:
 int ip_route_input_common(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 			   u8 tos, struct net_device *dev, bool noref)
 {
-	struct rtable * rth;
-	unsigned	hash;
-	int iif = dev->ifindex;
-	struct net *net;
 	int res;
 
-	net = dev_net(dev);
-
 	rcu_read_lock();
 
-	if (!rt_caching(net))
-		goto skip_cache;
-
-	tos &= IPTOS_RT_MASK;
-	hash = rt_hash(daddr, saddr, iif, rt_genid(net));
-
-	for (rth = rcu_dereference(rt_hash_table[hash].chain); rth;
-	     rth = rcu_dereference(rth->dst.rt_next)) {
-		if ((((__force u32)rth->fl.fl4_dst ^ (__force u32)daddr) |
-		     ((__force u32)rth->fl.fl4_src ^ (__force u32)saddr) |
-		     (rth->fl.iif ^ iif) |
-		     rth->fl.oif |
-		     (rth->fl.fl4_tos ^ tos)) == 0 &&
-		    rth->fl.mark == skb->mark &&
-		    net_eq(dev_net(rth->dst.dev), net) &&
-		    !rt_is_expired(rth)) {
-			if (noref) {
-				dst_use_noref(&rth->dst, jiffies);
-				skb_dst_set_noref(skb, &rth->dst);
-			} else {
-				dst_use(&rth->dst, jiffies);
-				skb_dst_set(skb, &rth->dst);
-			}
-			RT_CACHE_STAT_INC(in_hit);
-			rcu_read_unlock();
-			return 0;
-		}
-		RT_CACHE_STAT_INC(in_hlist_search);
-	}
-
-skip_cache:
 	/* Multicast recognition logic is moved from route cache to here.
 	   The problem was that too many Ethernet cards have broken/missing
 	   hardware multicast filters :-( As result the host on multicasting
@@ -2439,12 +1687,9 @@ static int ip_mkroute_output(struct rtable **rp,
 {
 	struct rtable *rth = NULL;
 	int err = __mkroute_output(&rth, res, fl, oldflp, dev_out, flags);
-	unsigned hash;
-	if (err == 0) {
-		hash = rt_hash(oldflp->fl4_dst, oldflp->fl4_src, oldflp->oif,
-			       rt_genid(dev_net(dev_out)));
-		err = rt_intern_hash(hash, rth, rp, NULL, oldflp->oif);
-	}
+
+	if (!err == 0)
+		err = rt_finalize(rth, rp, NULL);
 
 	return err;
 }
@@ -2635,38 +1880,8 @@ out:	return err;
 int __ip_route_output_key(struct net *net, struct rtable **rp,
 			  const struct flowi *flp)
 {
-	unsigned int hash;
 	int res;
-	struct rtable *rth;
 
-	if (!rt_caching(net))
-		goto slow_output;
-
-	hash = rt_hash(flp->fl4_dst, flp->fl4_src, flp->oif, rt_genid(net));
-
-	rcu_read_lock_bh();
-	for (rth = rcu_dereference_bh(rt_hash_table[hash].chain); rth;
-		rth = rcu_dereference_bh(rth->dst.rt_next)) {
-		if (rth->fl.fl4_dst == flp->fl4_dst &&
-		    rth->fl.fl4_src == flp->fl4_src &&
-		    rt_is_output_route(rth) &&
-		    rth->fl.oif == flp->oif &&
-		    rth->fl.mark == flp->mark &&
-		    !((rth->fl.fl4_tos ^ flp->fl4_tos) &
-			    (IPTOS_RT_MASK | RTO_ONLINK)) &&
-		    net_eq(dev_net(rth->dst.dev), net) &&
-		    !rt_is_expired(rth)) {
-			dst_use(&rth->dst, jiffies);
-			RT_CACHE_STAT_INC(out_hit);
-			rcu_read_unlock_bh();
-			*rp = rth;
-			return 0;
-		}
-		RT_CACHE_STAT_INC(out_hlist_search);
-	}
-	rcu_read_unlock_bh();
-
-slow_output:
 	rcu_read_lock();
 	res = ip_route_output_slow(net, rp, flp);
 	rcu_read_unlock();
@@ -2966,43 +2181,6 @@ errout_free:
 
 int ip_rt_dump(struct sk_buff *skb,  struct netlink_callback *cb)
 {
-	struct rtable *rt;
-	int h, s_h;
-	int idx, s_idx;
-	struct net *net;
-
-	net = sock_net(skb->sk);
-
-	s_h = cb->args[0];
-	if (s_h < 0)
-		s_h = 0;
-	s_idx = idx = cb->args[1];
-	for (h = s_h; h <= rt_hash_mask; h++, s_idx = 0) {
-		if (!rt_hash_table[h].chain)
-			continue;
-		rcu_read_lock_bh();
-		for (rt = rcu_dereference_bh(rt_hash_table[h].chain), idx = 0; rt;
-		     rt = rcu_dereference_bh(rt->dst.rt_next), idx++) {
-			if (!net_eq(dev_net(rt->dst.dev), net) || idx < s_idx)
-				continue;
-			if (rt_is_expired(rt))
-				continue;
-			skb_dst_set_noref(skb, &rt->dst);
-			if (rt_fill_info(net, skb, NETLINK_CB(cb->skb).pid,
-					 cb->nlh->nlmsg_seq, RTM_NEWROUTE,
-					 1, NLM_F_MULTI) <= 0) {
-				skb_dst_drop(skb);
-				rcu_read_unlock_bh();
-				goto done;
-			}
-			skb_dst_drop(skb);
-		}
-		rcu_read_unlock_bh();
-	}
-
-done:
-	cb->args[0] = h;
-	cb->args[1] = idx;
 	return skb->len;
 }
 
@@ -3235,16 +2413,6 @@ static __net_initdata struct pernet_operations rt_genid_ops = {
 struct ip_rt_acct __percpu *ip_rt_acct __read_mostly;
 #endif /* CONFIG_IP_ROUTE_CLASSID */
 
-static __initdata unsigned long rhash_entries;
-static int __init set_rhash_entries(char *str)
-{
-	if (!str)
-		return 0;
-	rhash_entries = simple_strtoul(str, &str, 0);
-	return 1;
-}
-__setup("rhash_entries=", set_rhash_entries);
-
 int __init ip_rt_init(void)
 {
 	int rc = 0;
@@ -3267,21 +2435,8 @@ int __init ip_rt_init(void)
 	if (dst_entries_init(&ipv4_dst_blackhole_ops) < 0)
 		panic("IP: failed to allocate ipv4_dst_blackhole_ops counter\n");
 
-	rt_hash_table = (struct rt_hash_bucket *)
-		alloc_large_system_hash("IP route cache",
-					sizeof(struct rt_hash_bucket),
-					rhash_entries,
-					(totalram_pages >= 128 * 1024) ?
-					15 : 17,
-					0,
-					&rt_hash_log,
-					&rt_hash_mask,
-					rhash_entries ? 0 : 512 * 1024);
-	memset(rt_hash_table, 0, (rt_hash_mask + 1) * sizeof(struct rt_hash_bucket));
-	rt_hash_lock_init();
-
-	ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
-	ip_rt_max_size = (rt_hash_mask + 1) * 16;
+	ipv4_dst_ops.gc_thresh = ~0;
+	ip_rt_max_size = INT_MAX;
 
 	devinet_init();
 	ip_fib_init();
-- 
1.7.4



^ permalink raw reply related

* Re: [PATCH 4/4] m68k/atari: ARAnyM - Add support for network access
From: Geert Uytterhoeven @ 2011-02-10  8:37 UTC (permalink / raw)
  To: David Miller
  Cc: linux-m68k, linux-kernel, aranym, schmitz, pstehlik, milan.jurik,
	netdev
In-Reply-To: <20110206.111705.183051866.davem@davemloft.net>

On Sun, Feb 6, 2011 at 20:17, David Miller <davem@davemloft.net> wrote:
> From: Geert Uytterhoeven <geert@linux-m68k.org>
> Date: Sun,  6 Feb 2011 11:51:09 +0100
>
>> +     dev->trans_start = jiffies;
>
> Device drivers no longer make this operation, the generic code
> does it (see net/core/dev.c:dev_hard_start_xmit() and how it
> invokes txq_trans_update() on ->ndo_start_xmit() success).
>
> Therefore, please remove this line.

Will do.

(180 more to go in drivers/net/?)

>> +     pr_debug(DRV_NAME ": send %d bytes\n", len);
>
> For consistency with other network drivers, add an appropriate CPP
> define for "pr_fmt" and use netdev_info(), netdev_debug(), etc.
>
> In situations where a netdev pointer is not available
> (ie. pre-register_netdev()), use "dev_*()" instead.

Will fix.

Thanks for reviewing!

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [RFC PATCH net-next] net: rename group sysfs entry to netdev_group
From: Vlad Dogaru @ 2011-02-10  9:11 UTC (permalink / raw)
  To: David Miller; +Cc: dfeng, netdev, eric.dumazet, therbert, ebiederm, shemminger
In-Reply-To: <20110209.140323.39178091.davem@davemloft.net>

On Wed, Feb 09, 2011 at 02:03:23PM -0800, David Miller wrote:
> From: Xiaotian feng <dfeng@redhat.com>
> Date: Wed,  9 Feb 2011 18:52:49 +0800
> 
> > From: Xiaotian Feng <dfeng@redhat.com>
> > 
> > commit a512b92 adds sysfs entry for net device group, but
> > before this commit, tun also uses group sysfs, so after this
> > commit checkin, kernel warns like this:
> >     sysfs: cannot create duplicate filename '/devices/virtual/net/vnet0/group'
> > 
> > Since tun has used this for years, rename sysfs under tun might
> > break existing userspace, so rename group sysfs entry for net device
> > group is a better choice.

I was not aware of that, sorry for breaking things :)

^ permalink raw reply

* Re: [PATCH net-next-2.6 v6 1/1] can: c_can: Added support for Bosch C_CAN controller
From: Marc Kleine-Budde @ 2011-02-10  9:28 UTC (permalink / raw)
  To: Bhupesh SHARMA
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Wolfgang Grandegger
In-Reply-To: <D5ECB3C7A6F99444980976A8C6D896384DEE36615D-8vAmw3ZAcdzhJTuQ9jeba9BPR1lH4CV8@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 18526 bytes --]

Hello

On 02/09/2011 11:54 AM, Bhupesh SHARMA wrote:
[...]
>> The driver looks quite good, some comments inline, most of them
>> nitpicking and or style related.
>>
>> Have a look at the netif_stop_queue(). In the at91 driver there are two
>> possibilities that to stop the queue. First the next tx mailbox is
>> still in use, second we have a wrap around. But your hardware is a bit
>> different. Anyways a second look doesn't harm.
> 
> As the Tx/Rx path of my driver are based on at91 driver, I have earlier
> gone through the possibility of stopping the tx queue in the two cases as
> you mentioned above :)
> 
> As per at91 specs, MRDY=0 signifies:
> "Mailbox data registers cannot be read/written by the software application"
> 
> But, after reading the Bosch C_CAN specs Transmission Request Register(TxRqst)'s
> bits if set to 1, signify that the transmission of this Message Object is
> requested and is not yet done. If you agree we can add a check against the same here.
> Please do go through the Bosch C_CAN specs for details:
> http://www.semiconductors.bosch.de/media/en/pdf/ipmodules_1/
> c_can/users_manual_c_can.pdf

Better save then sorry, add the check.

> 
>>> ---
>>> Changes since V5:
>>> 1. Seperated the state change and bus error handling paths.
>>> 2. Added logic to write LEC value to 0x7 from CPU to check for
>> updates
>>>    later.
>>> 3. Corrected the ERROR_WARNING handling logic to correctly send error
>>>    frames on the bus.
>>>
>>>  drivers/net/can/Kconfig                |    2 +
>>>  drivers/net/can/Makefile               |    1 +
>>>  drivers/net/can/c_can/Kconfig          |   15 +
>>>  drivers/net/can/c_can/Makefile         |    8 +
>>>  drivers/net/can/c_can/c_can.c          |  993
>> ++++++++++++++++++++++++++++++++
>>>  drivers/net/can/c_can/c_can.h          |  230 ++++++++
>>>  drivers/net/can/c_can/c_can_platform.c |  207 +++++++
>>>  7 files changed, 1456 insertions(+), 0 deletions(-)  create mode
>>> 100644 drivers/net/can/c_can/Kconfig  create mode 100644
>>> drivers/net/can/c_can/Makefile  create mode 100644
>>> drivers/net/can/c_can/c_can.c  create mode 100644
>>> drivers/net/can/c_can/c_can.h  create mode 100644
>>> drivers/net/can/c_can/c_can_platform.c
>>>
>>> diff --git a/drivers/net/can/Kconfig b/drivers/net/can/Kconfig index
>>> 5dec456..1d699e3 100644
>>> --- a/drivers/net/can/Kconfig
>>> +++ b/drivers/net/can/Kconfig
>>> @@ -115,6 +115,8 @@ source "drivers/net/can/mscan/Kconfig"
>>>
>>>  source "drivers/net/can/sja1000/Kconfig"
>>>
>>> +source "drivers/net/can/c_can/Kconfig"
>>> +
>>>  source "drivers/net/can/usb/Kconfig"
>>>
>>>  source "drivers/net/can/softing/Kconfig"
>>> diff --git a/drivers/net/can/Makefile b/drivers/net/can/Makefile
>> index
>>> 53c82a7..24ebfe8 100644
>>> --- a/drivers/net/can/Makefile
>>> +++ b/drivers/net/can/Makefile
>>> @@ -13,6 +13,7 @@ obj-y                             += softing/
>>>
>>>  obj-$(CONFIG_CAN_SJA1000)  += sja1000/
>>>  obj-$(CONFIG_CAN_MSCAN)            += mscan/
>>> +obj-$(CONFIG_CAN_C_CAN)            += c_can/
>>>  obj-$(CONFIG_CAN_AT91)             += at91_can.o
>>>  obj-$(CONFIG_CAN_TI_HECC)  += ti_hecc.o
>>>  obj-$(CONFIG_CAN_MCP251X)  += mcp251x.o
>>> diff --git a/drivers/net/can/c_can/Kconfig
>>> b/drivers/net/can/c_can/Kconfig new file mode 100644 index
>>> 0000000..ffb9773
>>> --- /dev/null
>>> +++ b/drivers/net/can/c_can/Kconfig
>>> @@ -0,0 +1,15 @@
>>> +menuconfig CAN_C_CAN
>>> +   tristate "Bosch C_CAN devices"
>>> +   depends on CAN_DEV && HAS_IOMEM
>>> +
>>> +if CAN_C_CAN
>>> +
>>> +config CAN_C_CAN_PLATFORM
>>> +   tristate "Generic Platform Bus based C_CAN driver"
>>> +   ---help---
>>> +     This driver adds support for the C_CAN chips connected to
>>> +     the "platform bus" (Linux abstraction for directly to the
>>> +     processor attached devices) which can be found on various
>>> +     boards from ST Microelectronics (http://www.st.com)
>>> +     like the SPEAr1310 and SPEAr320 evaluation boards.
>>> +endif
>>> diff --git a/drivers/net/can/c_can/Makefile
>>> b/drivers/net/can/c_can/Makefile new file mode 100644 index
>>> 0000000..9273f6d
>>> --- /dev/null
>>> +++ b/drivers/net/can/c_can/Makefile
>>> @@ -0,0 +1,8 @@
>>> +#
>>> +#  Makefile for the Bosch C_CAN controller drivers.
>>> +#
>>> +
>>> +obj-$(CONFIG_CAN_C_CAN) += c_can.o
>>> +obj-$(CONFIG_CAN_C_CAN_PLATFORM) += c_can_platform.o
>>> +
>>> +ccflags-$(CONFIG_CAN_DEBUG_DEVICES) := -DDEBUG
>>> diff --git a/drivers/net/can/c_can/c_can.c
>>> b/drivers/net/can/c_can/c_can.c new file mode 100644 index
>>> 0000000..7ef4aa9
>>> --- /dev/null
>>> +++ b/drivers/net/can/c_can/c_can.c
>>> @@ -0,0 +1,993 @@
>>> +/*
>>> + * CAN bus driver for Bosch C_CAN controller
>>> + *
>>> + * Copyright (C) 2010 ST Microelectronics
>>> + * Bhupesh Sharma <bhupesh.sharma-qxv4g6HH51o@public.gmane.org>
>>> + *
>>> + * Borrowed heavily from the C_CAN driver originally written by:
>>> + * Copyright (C) 2007
>>> + * - Sascha Hauer, Marc Kleine-Budde, Pengutronix
>>> +<s.hauer-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
>>> + * - Simon Kallweit, intefo AG <simon.kallweit-+G9qxTFKJT/tRgLqZ5aouw@public.gmane.org>
>>> + *
>>> + * TX and RX NAPI implementation has been borrowed from at91 CAN
>>> +driver
>>> + * written by:
>>> + * Copyright
>>> + * (C) 2007 by Hans J. Koch <hjk-vqZO0P4V72/QD6PfKP4TzA@public.gmane.org>
>>> + * (C) 2008, 2009 by Marc Kleine-Budde <kernel-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
>>> + *
>>> + * Bosch C_CAN controller is compliant to CAN protocol version 2.0
>> part A and B.
>>> + * Bosch C_CAN user manual can be obtained from:
>>> + *
>> http://www.semiconductors.bosch.de/media/en/pdf/ipmodules_1/c_can/
>>> + * users_manual_c_can.pdf
>>> + *
>>> + * This file is licensed under the terms of the GNU General Public
>>> + * License version 2. This program is licensed "as is" without any
>>> + * warranty of any kind, whether express or implied.
>>> + */
>>> +
>>> +#include <linux/kernel.h>
>>> +#include <linux/version.h>
>>> +#include <linux/module.h>
>>> +#include <linux/interrupt.h>
>>> +#include <linux/delay.h>
>>> +#include <linux/netdevice.h>
>>> +#include <linux/if_arp.h>
>>> +#include <linux/if_ether.h>
>>> +#include <linux/list.h>
>>> +#include <linux/delay.h>
>>> +#include <linux/io.h>
>>> +
>>> +#include <linux/can.h>
>>> +#include <linux/can/dev.h>
>>> +#include <linux/can/error.h>
>>> +
>>> +#include "c_can.h"
>>> +
>>> +static struct can_bittiming_const c_can_bittiming_const = {
>>> +   .name = KBUILD_MODNAME,
>>> +   .tseg1_min = 2,         /* Time segment 1 = prop_seg + phase_seg1
>> */
>>> +   .tseg1_max = 16,
>>> +   .tseg2_min = 1,         /* Time segment 2 = phase_seg2 */
>>> +   .tseg2_max = 8,
>>> +   .sjw_max = 4,
>>> +   .brp_min = 1,
>>> +   .brp_max = 1024,        /* 6-bit BRP field + 4-bit BRPE field*/
>>> +   .brp_inc = 1,
>>> +};
>>> +
>>> +static inline int get_tx_next_msg_obj(const struct c_can_priv *priv)
>>> +{
>>> +   return (priv->tx_next & C_CAN_NEXT_MSG_OBJ_MASK) +
>>> +                   C_CAN_MSG_OBJ_TX_FIRST;
>>> +}
>>> +
>>> +static inline int get_tx_echo_msg_obj(const struct c_can_priv *priv)
>>> +{
>>> +   return (priv->tx_echo & C_CAN_NEXT_MSG_OBJ_MASK) +
>>> +                   C_CAN_MSG_OBJ_TX_FIRST;
>>> +}
>>> +
>>> +static u32 c_can_read_reg32(struct c_can_priv *priv, void *reg) {
>>> +   u32 val = priv->read_reg(priv, reg);
>>> +   val |= ((u32) priv->read_reg(priv, reg + 2)) << 16;
>>> +   return val;
>>> +}
>>> +
>>> +static void c_can_enable_all_interrupts(struct c_can_priv *priv,
>>> +                                           int enable)
>>> +{
>>> +   unsigned int cntrl_save = priv->read_reg(priv,
>>> +                                           &priv->regs->control);
>>> +
>>> +   if (enable)
>>> +           cntrl_save |= (CONTROL_SIE | CONTROL_EIE | CONTROL_IE);
>>> +   else
>>> +           cntrl_save &= ~(CONTROL_EIE | CONTROL_IE | CONTROL_SIE);
>>> +
>>> +   priv->write_reg(priv, &priv->regs->control, cntrl_save); }
>>> +
>>> +static inline int c_can_check_busy_status(struct c_can_priv *priv,
>>> +int iface) {
>>> +   int count = MIN_TIMEOUT_VALUE;
>>> +
>>> +   while (count && priv->read_reg(priv,
>>> +                           &priv->regs->ifregs[iface].com_req) &
>>> +                           IF_COMR_BUSY) {
>>> +           count--;
>>> +           udelay(1);
>>> +   }
>>> +
>>> +   return count;
>>
>> it's an unusual return value...maybe return 0 on success and -EBUSY
>> otherwise?
> 
> Hmm.. this will add the checking MIN_TIMEOUT_VALUE against 0 here,

You mean check "count" against 0 here...

> instead of "c_can_object_get" and "c_can_object_put" routines.
> If you persist we can add the same in V7 though.. :)

We have a function "c_can_check_busy_status()". What does it return? The
name doesn't tell me. I think it would be more clear if you just rename
the function to "c_can_is_busy()" or "c_can_object_is_busy()".

The you can use it like this:

if (c_can_is_busy()) {
	printk("mailbox still busy!\n");
	return -EWHATEVER;
}

> 
>>> +}
>>> +
>>> +static inline void c_can_object_get(struct net_device *dev,
>>> +                                   int iface, int objno, int mask)
>>> +{
>>> +   int ret;
>>> +   struct c_can_priv *priv = netdev_priv(dev);
>>> +
>>> +   /*
>>> +    * As per specs, after writting the message object number in the
>>> +    * IF command request register the transfer b/w interface
>>> +    * register and message RAM must be complete in 6 CAN-CLK
>>> +    * period.
>>> +    */
>>> +   priv->write_reg(priv, &priv->regs->ifregs[iface].com_mask,
>>> +                   IFX_WRITE_LOW_16BIT(mask));
>>> +   priv->write_reg(priv, &priv->regs->ifregs[iface].com_req,
>>> +                   IFX_WRITE_LOW_16BIT(objno));
>>> +
>>> +   ret = c_can_check_busy_status(priv, iface);
>>> +   if (!ret)
>>> +           netdev_err(dev, "timed out in object get\n"); }

There's no error handling for the object is busy case....


[...]

>>> diff --git a/drivers/net/can/c_can/c_can.h
>>> b/drivers/net/can/c_can/c_can.h new file mode 100644 index
>>> 0000000..bd094e6
>>> --- /dev/null
>>> +++ b/drivers/net/can/c_can/c_can.h
>>> @@ -0,0 +1,230 @@
>>> +/*
>>> + * CAN bus driver for Bosch C_CAN controller
>>> + *
>>> + * Copyright (C) 2010 ST Microelectronics
>>> + * Bhupesh Sharma <bhupesh.sharma-qxv4g6HH51o@public.gmane.org>
>>> + *
>>> + * Borrowed heavily from the C_CAN driver originally written by:
>>> + * Copyright (C) 2007
>>> + * - Sascha Hauer, Marc Kleine-Budde, Pengutronix
>>> +<s.hauer-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
>>> + * - Simon Kallweit, intefo AG <simon.kallweit-+G9qxTFKJT/tRgLqZ5aouw@public.gmane.org>
>>> + *
>>> + * Bosch C_CAN controller is compliant to CAN protocol version 2.0
>> part A and B.
>>> + * Bosch C_CAN user manual can be obtained from:
>>> + *
>> http://www.semiconductors.bosch.de/media/en/pdf/ipmodules_1/c_can/
>>> + * users_manual_c_can.pdf
>>> + *
>>> + * This file is licensed under the terms of the GNU General Public
>>> + * License version 2. This program is licensed "as is" without any
>>> + * warranty of any kind, whether express or implied.
>>> + */
>>> +
>>> +#ifndef C_CAN_H
>>> +#define C_CAN_H
>>> +
>>> +/* control register */
>>> +#define CONTROL_TEST               BIT(7)
>>> +#define CONTROL_CCE                BIT(6)
>>> +#define CONTROL_DISABLE_AR BIT(5)
>>> +#define CONTROL_ENABLE_AR  (0 << 5)
>>> +#define CONTROL_EIE                BIT(3)
>>> +#define CONTROL_SIE                BIT(2)
>>> +#define CONTROL_IE         BIT(1)
>>> +#define CONTROL_INIT               BIT(0)
>>> +
>>> +/* test register */
>>> +#define TEST_RX                    BIT(7)
>>> +#define TEST_TX1           BIT(6)
>>> +#define TEST_TX2           BIT(5)
>>> +#define TEST_LBACK         BIT(4)
>>> +#define TEST_SILENT                BIT(3)
>>> +#define TEST_BASIC         BIT(2)
>>> +
>>> +/* status register */
>>> +#define STATUS_BOFF                BIT(7)
>>> +#define STATUS_EWARN               BIT(6)
>>> +#define STATUS_EPASS               BIT(5)
>>> +#define STATUS_RXOK                BIT(4)
>>> +#define STATUS_TXOK                BIT(3)
>>> +
>>> +/* error counter register */
>>> +#define ERR_CNT_TEC_MASK   0xff
>>> +#define ERR_CNT_TEC_SHIFT  0
>>> +#define ERR_CNT_REC_SHIFT  8
>>> +#define ERR_CNT_REC_MASK   (0x7f << ERR_CNT_REC_SHIFT)
>>> +#define ERR_CNT_RP_SHIFT   15
>>> +#define ERR_CNT_RP_MASK            (0x1 << ERR_CNT_RP_SHIFT)
>>> +
>>> +/* bit-timing register */
>>> +#define BTR_BRP_MASK               0x3f
>>> +#define BTR_BRP_SHIFT              0
>>> +#define BTR_SJW_SHIFT              6
>>> +#define BTR_SJW_MASK               (0x3 << BTR_SJW_SHIFT)
>>> +#define BTR_TSEG1_SHIFT            8
>>> +#define BTR_TSEG1_MASK             (0xf << BTR_TSEG1_SHIFT)
>>> +#define BTR_TSEG2_SHIFT            12
>>> +#define BTR_TSEG2_MASK             (0x7 << BTR_TSEG2_SHIFT)
>>> +
>>> +/* brp extension register */
>>> +#define BRP_EXT_BRPE_MASK  0x0f
>>> +#define BRP_EXT_BRPE_SHIFT 0
>>> +
>>> +/* IFx command request */
>>> +#define IF_COMR_BUSY               BIT(15)
>>> +
>>> +/* IFx command mask */
>>> +#define IF_COMM_WR         BIT(7)
>>> +#define IF_COMM_MASK               BIT(6)
>>> +#define IF_COMM_ARB                BIT(5)
>>> +#define IF_COMM_CONTROL            BIT(4)
>>> +#define IF_COMM_CLR_INT_PND        BIT(3)
>>> +#define IF_COMM_TXRQST             BIT(2)
>>> +#define IF_COMM_DATAA              BIT(1)
>>> +#define IF_COMM_DATAB              BIT(0)
>>> +#define IF_COMM_ALL                (IF_COMM_MASK | IF_COMM_ARB | \
>>> +                           IF_COMM_CONTROL | IF_COMM_TXRQST | \
>>> +                           IF_COMM_DATAA | IF_COMM_DATAB)
>>> +
>>> +/* IFx arbitration */
>>> +#define IF_ARB_MSGVAL              BIT(15)
>>> +#define IF_ARB_MSGXTD              BIT(14)
>>> +#define IF_ARB_TRANSMIT            BIT(13)
>>> +
>>> +/* IFx message control */
>>> +#define IF_MCONT_NEWDAT            BIT(15)
>>> +#define IF_MCONT_MSGLST            BIT(14)
>>> +#define IF_MCONT_CLR_MSGLST        (0 << 14)
>>> +#define IF_MCONT_INTPND            BIT(13)
>>> +#define IF_MCONT_UMASK             BIT(12)
>>> +#define IF_MCONT_TXIE              BIT(11)
>>> +#define IF_MCONT_RXIE              BIT(10)
>>> +#define IF_MCONT_RMTEN             BIT(9)
>>> +#define IF_MCONT_TXRQST            BIT(8)
>>> +#define IF_MCONT_EOB               BIT(7)
>>> +#define IF_MCONT_DLC_MASK  0xf
>>> +
>>> +/*
>>> + * IFx register masks:
>>> + * allow easy operation on 16-bit registers when the
>>> + * argument is 32-bit instead
>>> + */
>>> +#define IFX_WRITE_LOW_16BIT(x)     ((x) & 0xFFFF)
>>> +#define IFX_WRITE_HIGH_16BIT(x)    (((x) & 0xFFFF0000) >> 16)
>>> +
>>> +/* message object split */
>>> +#define C_CAN_NO_OF_OBJECTS        32
>>> +#define C_CAN_MSG_OBJ_RX_NUM       16
>>> +#define C_CAN_MSG_OBJ_TX_NUM       16
>>> +
>>> +#define C_CAN_MSG_OBJ_RX_FIRST     1
>>> +#define C_CAN_MSG_OBJ_RX_LAST      (C_CAN_MSG_OBJ_RX_FIRST + \
>>> +                           C_CAN_MSG_OBJ_RX_NUM - 1)
>>> +
>>> +#define C_CAN_MSG_OBJ_TX_FIRST     (C_CAN_MSG_OBJ_RX_LAST + 1)
>>> +#define C_CAN_MSG_OBJ_TX_LAST      (C_CAN_MSG_OBJ_TX_FIRST + \
>>> +                           C_CAN_MSG_OBJ_TX_NUM - 1)
>>> +
>>> +#define C_CAN_MSG_OBJ_RX_SPLIT     9
>>> +#define C_CAN_MSG_RX_LOW_LAST      (C_CAN_MSG_OBJ_RX_SPLIT - 1)
>>> +
>>> +#define C_CAN_NEXT_MSG_OBJ_MASK    (C_CAN_MSG_OBJ_TX_NUM - 1)
>>> +#define RECEIVE_OBJECT_BITS        0x0000ffff
>>> +
>>> +/* status interrupt */
>>> +#define STATUS_INTERRUPT   0x8000
>>> +
>>> +/* global interrupt masks */
>>> +#define ENABLE_ALL_INTERRUPTS      1
>>> +#define DISABLE_ALL_INTERRUPTS     0
>>> +
>>> +/* minimum timeout for checking BUSY status */
>>> +#define MIN_TIMEOUT_VALUE  6
>>> +
>>> +/* napi related */
>>> +#define C_CAN_NAPI_WEIGHT  C_CAN_MSG_OBJ_RX_NUM
>>> +
>>> +/* c_can IF registers */
>>> +struct c_can_if_regs {
>>> +   u16 com_req;
>>> +   u16 com_mask;
>>> +   u16 mask1;
>>> +   u16 mask2;
>>> +   u16 arb1;
>>> +   u16 arb2;
>>> +   u16 msg_cntrl;
>>> +   u16 data[4];
>>> +   u16 _reserved[13];
>>> +};
>>> +
>>> +/* c_can hardware registers */
>>> +struct c_can_regs {
>>> +   u16 control;
>>> +   u16 status;
>>> +   u16 err_cnt;
>>> +   u16 btr;
>>> +   u16 interrupt;
>>> +   u16 test;
>>> +   u16 brp_ext;
>>> +   u16 _reserved1;
>>> +   struct c_can_if_regs ifregs[2]; /* [0] = IF1 and [1] = IF2 */
>>> +   u16 _reserved2[8];
>>> +   u16 txrqst1;
>>> +   u16 txrqst2;
>>> +   u16 _reserved3[6];
>>> +   u16 newdat1;
>>> +   u16 newdat2;
>>> +   u16 _reserved4[6];
>>> +   u16 intpnd1;
>>> +   u16 intpnd2;
>>> +   u16 _reserved5[6];
>>> +   u16 msgval1;
>>> +   u16 msgval2;
>>> +   u16 _reserved6[6];
>>> +};
>>> +
>>> +/* c_can lec values */
>>> +enum c_can_lec_type {
>>> +   LEC_NO_ERROR = 0,
>>> +   LEC_STUFF_ERROR,
>>> +   LEC_FORM_ERROR,
>>> +   LEC_ACK_ERROR,
>>> +   LEC_BIT1_ERROR,
>>> +   LEC_BIT0_ERROR,
>>> +   LEC_CRC_ERROR,
>>> +   LEC_UNUSED,
>>> +};
>>> +
>>> +/*
>>> + * c_can error types:
>>> + * Bus errors (BUS_OFF, ERROR_WARNING, ERROR_PASSIVE) are supported
>>> +*/ enum c_can_bus_error_types {
>>> +   C_CAN_NO_ERROR = 0,
>>> +   C_CAN_BUS_OFF,
>>> +   C_CAN_ERROR_WARNING,
>>> +   C_CAN_ERROR_PASSIVE,
>>> +};
>>
>> nitpick: are the defines, enums and structs needed in more than one c
>> file? If not, please move them into the c-file where they are used.
> 
> Well most of the strcuts/defines are useful in both `c_can.c` and
> `c_can_platform.c`, but I will explore how to separate the rest in
> the respective c-files. But inititally we agreed to a *sja1000* like
> approach and hence this placement in h-file.

Yes, your're right.

But keeping only in one .c or .h file used things out of the .h file is
a good rule for cleaner code :)

regards, Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

[-- Attachment #2: Type: text/plain, Size: 188 bytes --]

_______________________________________________
Socketcan-core mailing list
Socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org
https://lists.berlios.de/mailman/listinfo/socketcan-core

^ permalink raw reply

* Dear Sir/Madam,
From: Dr. Donald Wilson @ 2011-02-10  9:51 UTC (permalink / raw)



We offer at 2% interest rate. Between $1,000 USD to $100M USD. Get back if interested. Dr. Donald Wilson




^ permalink raw reply

* [RFD][PATCH] Add JMEMCMP to Berkeley Packet Filters
From: Ian Molton @ 2011-02-10 12:14 UTC (permalink / raw)
  To: netdev; +Cc: rdunlap, isdn, paulus, arnd, davem, herbert, ebiederm

Hi folks,

This patch implements an extension for BPF to allow filter programs to use a
data section, along with a MEMCMP instruction.

There are a few issues noted in the patch itself, which can easily be
addressed, and I would like to check wether sk_run_filter is ever expected to
be called from a context that cannot sleep (I dont think it is).

I think the patch should probably be split into a patch to add data sections
and one adding the JMEMCMP instruction, but that can be done after some review! 

-Ian

^ permalink raw reply

* [RFD][PATCH] Add JMEMCMP to Berkeley Packet Filters
From: Ian Molton @ 2011-02-10 12:31 UTC (permalink / raw)
  To: netdev; +Cc: rdunlap, isdn, paulus, arnd, davem, herbert, ebiederm,
	alban.crequy

 Documentation/networking/filter.txt |    9 ++
 drivers/isdn/i4l/isdn_ppp.c         |    2 
 drivers/net/ppp_generic.c           |    2 
 include/asm-generic/socket.h        |    2 
 include/linux/filter.h              |   17 ++++-
 include/linux/ptp_classify.h        |    2 
 net/core/filter.c                   |  115 ++++++++++++++++++++++++++++++++++--
 net/core/sock.c                     |   14 ++++
 net/core/timestamping.c             |    4 -
 net/packet/af_packet.c              |    3 

This patch adds support for adding a data section to BPF. It is intended to be
used by the JMEMCMP instruction also added in this patch.

There are some issues, mostly noted int he commit message, and I'd like to
check that sk_run_filter() does not get called from a context that cannot sleep
(I dont think so).

Comments welcome!

-Ian

^ permalink raw reply

* [PATCH] Add JMEMCMP to Berkeley Packet Filters
From: Ian Molton @ 2011-02-10 12:31 UTC (permalink / raw)
  To: netdev
  Cc: rdunlap, isdn, paulus, arnd, davem, herbert, ebiederm,
	alban.crequy, Ian Molton
In-Reply-To: <1297341067-12264-1-git-send-email-ian.molton@collabora.co.uk>

This patch allows a data section to be specified for BPF.

This is made use of by a MEMCMP like instruction.

Testsuite here:
http://git.collabora.co.uk/?p=user/ian/check-bpf.git;a=summary

Issues:
* Do I need to update the headers for all arches, or just generic
* Can sk_run_filter() be called in a context where kmalloc(GFP_KERNEL) is
  not allowed (I think not)
* Data section allocated with second call to sock_kmalloc().
* Should the patch be broken into two - one to add the data uploading,
  one to add the JMEMCMP insn. ?
---
 Documentation/networking/filter.txt |    9 +++
 drivers/isdn/i4l/isdn_ppp.c         |    2 +-
 drivers/net/ppp_generic.c           |    2 +-
 include/asm-generic/socket.h        |    2 +
 include/linux/filter.h              |   17 +++++-
 include/linux/ptp_classify.h        |    2 +-
 net/core/filter.c                   |  115 +++++++++++++++++++++++++++++++++-
 net/core/sock.c                     |   14 ++++
 net/core/timestamping.c             |    4 +-
 net/packet/af_packet.c              |    3 +-
 10 files changed, 158 insertions(+), 12 deletions(-)

diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index bbf2005..d6efb5f 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -31,6 +31,15 @@ the old one and placing your new one in its place, assuming your
 filter has passed the checks, otherwise if it fails the old filter
 will remain on that socket.
 
+Linux packet filters also provide a facility to upload a data section
+for use with the JMEMCMP instruction. This is done using the 
+SO_ATTACH_FILTER_WITH_DATA parameter to setsockopt().
+
+The JMEMCMP instruction allows arbitrary comparisons between the packet
+data and the filters data. The K, A, and X registers provide the offset
+into the data, the number of bytes to compare, and the offset into the
+packet, respectively.
+
 Examples
 ========
 
diff --git a/drivers/isdn/i4l/isdn_ppp.c b/drivers/isdn/i4l/isdn_ppp.c
index 9e8162c..1a0d513 100644
--- a/drivers/isdn/i4l/isdn_ppp.c
+++ b/drivers/isdn/i4l/isdn_ppp.c
@@ -453,7 +453,7 @@ static int get_filter(void __user *arg, struct sock_filter **p)
 	if (IS_ERR(code))
 		return PTR_ERR(code);
 
-	err = sk_chk_filter(code, uprog.len);
+	err = sk_chk_filter(code, uprog.len, NULL);
 	if (err) {
 		kfree(code);
 		return err;
diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c
index 9f6d670..345c3ac 100644
--- a/drivers/net/ppp_generic.c
+++ b/drivers/net/ppp_generic.c
@@ -542,7 +542,7 @@ static int get_filter(void __user *arg, struct sock_filter **p)
 	if (IS_ERR(code))
 		return PTR_ERR(code);
 
-	err = sk_chk_filter(code, uprog.len);
+	err = sk_chk_filter(code, uprog.len, NULL);
 	if (err) {
 		kfree(code);
 		return err;
diff --git a/include/asm-generic/socket.h b/include/asm-generic/socket.h
index 9a6115e..83458b9 100644
--- a/include/asm-generic/socket.h
+++ b/include/asm-generic/socket.h
@@ -64,4 +64,6 @@
 #define SO_DOMAIN		39
 
 #define SO_RXQ_OVFL             40
+
+#define SO_ATTACH_FILTER_WITH_DATA	41
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 45266b7..c290e17 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -35,6 +35,13 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 	struct sock_filter __user *filter;
 };
 
+struct sock_fprog_with_data { /* Required for SO_ATTACH_FILTER_WITH_DATA. */
+	unsigned short			len;    /* Number of filter blocks */
+	unsigned short			data_len;
+	__u8 __user			*data; /* Program data section */
+	struct sock_filter __user	*filter;
+};
+
 /*
  * Instruction classes
  */
@@ -78,6 +85,7 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 #define         BPF_JGT         0x20
 #define         BPF_JGE         0x30
 #define         BPF_JSET        0x40
+#define         BPF_JMEMCMP     0x50
 #define BPF_SRC(code)   ((code) & 0x08)
 #define         BPF_K           0x00
 #define         BPF_X           0x08
@@ -136,6 +144,8 @@ struct sk_filter
 	atomic_t		refcnt;
 	unsigned int         	len;	/* Number of filter blocks */
 	struct rcu_head		rcu;
+	u8			*data;
+	unsigned int		data_len;
 	struct sock_filter     	insns[0];
 };
 
@@ -149,10 +159,13 @@ struct sock;
 
 extern int sk_filter(struct sock *sk, struct sk_buff *skb);
 extern unsigned int sk_run_filter(const struct sk_buff *skb,
-				  const struct sock_filter *filter);
+				  const struct sock_filter *filter,
+				  const u8 *data, const unsigned int data_len);
 extern int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
+extern int sk_attach_filter_with_data(struct sock_fprog_with_data *fprog,
+					struct sock *sk);
 extern int sk_detach_filter(struct sock *sk);
-extern int sk_chk_filter(struct sock_filter *filter, int flen);
+extern int sk_chk_filter(struct sock_filter *filter, int flen, u8 *data);
 #endif /* __KERNEL__ */
 
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/linux/ptp_classify.h b/include/linux/ptp_classify.h
index 943a85a..bfe8f7a 100644
--- a/include/linux/ptp_classify.h
+++ b/include/linux/ptp_classify.h
@@ -79,7 +79,7 @@
 static inline int ptp_filter_init(struct sock_filter *f, int len)
 {
 	if (OP_LDH == f[0].code)
-		return sk_chk_filter(f, len);
+		return sk_chk_filter(f, len, NULL);
 	else
 		return 0;
 }
diff --git a/net/core/filter.c b/net/core/filter.c
index 232b187..eb5f4e2 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -85,6 +85,7 @@ enum {
 	BPF_S_JMP_JGT_X,
 	BPF_S_JMP_JSET_K,
 	BPF_S_JMP_JSET_X,
+	BPF_S_JMP_MEMCMP,
 	/* Ancillary data */
 	BPF_S_ANC_PROTOCOL,
 	BPF_S_ANC_PKTTYPE,
@@ -145,7 +146,9 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
 	rcu_read_lock();
 	filter = rcu_dereference(sk->sk_filter);
 	if (filter) {
-		unsigned int pkt_len = sk_run_filter(skb, filter->insns);
+		unsigned int pkt_len = sk_run_filter(skb, filter->insns,
+							filter->data,
+							filter->data_len);
 
 		err = pkt_len ? pskb_trim(skb, pkt_len) : -EPERM;
 	}
@@ -168,7 +171,8 @@ EXPORT_SYMBOL(sk_filter);
  * flen. (We used to pass to this function the length of filter)
  */
 unsigned int sk_run_filter(const struct sk_buff *skb,
-			   const struct sock_filter *fentry)
+			   const struct sock_filter *fentry,
+			   const u8 *data, const unsigned int data_len)
 {
 	void *ptr;
 	u32 A = 0;			/* Accumulator */
@@ -268,6 +272,46 @@ unsigned int sk_run_filter(const struct sk_buff *skb,
 		case BPF_S_JMP_JSET_X:
 			fentry += (A & X) ? fentry->jt : fentry->jf;
 			continue;
+		case BPF_S_JMP_MEMCMP: {
+			u8 *pkt_data, *tmp_data;
+
+			/* A = Comparison length.
+			 * K = Offset into the data
+			 * X = Offset into the packet
+			 */
+			if (K + A > data_len || X + A > skb->len)
+				return 0;
+
+			/* We should write a skb aware memcmp() and avoid
+			 * copying the contents of the skb
+			 */
+			tmp_data = (u8*)kmalloc(A, GFP_KERNEL);
+
+			if(!tmp_data)
+				return 0;
+
+			/* Load enough bytes to analyse already offset by K */
+			ptr = load_pointer(skb, X, A, tmp_data);
+
+			if (!ptr) {
+				kfree(tmp_data);
+				return 0;
+			}
+
+			pkt_data = (u8 *)ptr;
+
+			/* data will not be NULL here if sk_chk_filter() has
+			 * been called. Since SO_ATTACH_FILTER{,_WITH_DATA}
+			 * both call this, only broken kernel code can cause
+			 * trouble
+			 */
+			fentry += (!memcmp(data + K, pkt_data, A))
+				? fentry->jt : fentry->jf;
+
+			kfree(tmp_data);
+
+			continue;
+		}
 		case BPF_S_LD_W_ABS:
 			k = K;
 load_w:
@@ -492,7 +536,7 @@ error:
  *
  * Returns 0 if the rule set is legal or -EINVAL if not.
  */
-int sk_chk_filter(struct sock_filter *filter, int flen)
+int sk_chk_filter(struct sock_filter *filter, int flen, u8 *data)
 {
 	/*
 	 * Valid instructions are initialized to non-0.
@@ -544,6 +588,7 @@ int sk_chk_filter(struct sock_filter *filter, int flen)
 		[BPF_JMP|BPF_JGT|BPF_X]  = BPF_S_JMP_JGT_X,
 		[BPF_JMP|BPF_JSET|BPF_K] = BPF_S_JMP_JSET_K,
 		[BPF_JMP|BPF_JSET|BPF_X] = BPF_S_JMP_JSET_X,
+		[BPF_JMP|BPF_JMEMCMP|BPF_K|BPF_X|BPF_A] = BPF_S_JMP_MEMCMP,
 	};
 	int pc;
 
@@ -585,6 +630,10 @@ int sk_chk_filter(struct sock_filter *filter, int flen)
 			if (ftest->k >= (unsigned)(flen-pc-1))
 				return -EINVAL;
 			break;
+		case BPF_S_JMP_MEMCMP:
+			if (!data)
+				return -EINVAL;
+			/* Fall through */
 		case BPF_S_JMP_JEQ_K:
 		case BPF_S_JMP_JEQ_X:
 		case BPF_S_JMP_JGE_K:
@@ -672,8 +721,10 @@ int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
 
 	atomic_set(&fp->refcnt, 1);
 	fp->len = fprog->len;
+	fp->data = NULL;
+	fp->data_len = 0;
 
-	err = sk_chk_filter(fp->insns, fp->len);
+	err = sk_chk_filter(fp->insns, fp->len, NULL);
 	if (err) {
 		sk_filter_uncharge(sk, fp);
 		return err;
@@ -689,6 +740,62 @@ int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(sk_attach_filter);
 
+int sk_attach_filter_with_data(struct sock_fprog_with_data *fprog, struct sock *sk)
+{
+	struct sk_filter *fp, *old_fp;
+	unsigned int fsize = sizeof(struct sock_filter) * fprog->len;
+	unsigned int dsize = fprog->data_len;
+	int err = -ENOMEM;
+
+	/* Make sure new filter is there and in the right amounts. */
+	if (fprog->filter == NULL)
+		return -EINVAL;
+
+	fp = sock_kmalloc(sk, fsize+sizeof(*fp), GFP_KERNEL);
+
+	if (!fp)
+		goto out;
+
+	fp->data = sock_kmalloc(sk, dsize, GFP_KERNEL);
+
+	if(!fp->data)
+		goto out_free;
+
+	if (copy_from_user(fp->data, fprog->data, dsize))
+		goto out_free;
+
+	fp->data_len = dsize;
+
+	if (copy_from_user(fp->insns, fprog->filter, fsize))
+		goto out_free_data;
+
+	atomic_set(&fp->refcnt, 1);
+	fp->len = fprog->len;
+
+	err = sk_chk_filter(fp->insns, fp->len, fp->data);
+	if (err)
+		goto out_uncharge;
+
+	old_fp = rcu_dereference_protected(sk->sk_filter,
+					   sock_owned_by_user(sk));
+	rcu_assign_pointer(sk->sk_filter, fp);
+
+	if (old_fp)
+		sk_filter_uncharge(sk, old_fp);
+
+	return 0;
+
+out_uncharge:
+	sk_filter_uncharge(sk, fp);
+out_free_data:
+	sock_kfree_s(sk, fp->data, dsize);
+out_free:
+	sock_kfree_s(sk, fp, fsize+sizeof(*fp));
+out:
+	return err;
+}
+EXPORT_SYMBOL_GPL(sk_attach_filter_with_data);
+
 int sk_detach_filter(struct sock *sk)
 {
 	int ret = -ENOENT;
diff --git a/net/core/sock.c b/net/core/sock.c
index 7dfed79..627f731 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -714,6 +714,20 @@ set_rcvbuf:
 			ret = sk_attach_filter(&fprog, sk);
 		}
 		break;
+	
+	case SO_ATTACH_FILTER_WITH_DATA:
+		ret = -EINVAL;
+		if (optlen == sizeof(struct sock_fprog_with_data)) {
+			struct sock_fprog_with_data fprog;
+
+			ret = -EFAULT;
+			if (copy_from_user(&fprog, optval,
+						sizeof(fprog)))
+				break;
+
+			ret = sk_attach_filter_with_data(&fprog, sk);
+		}
+		break;
 
 	case SO_DETACH_FILTER:
 		ret = sk_detach_filter(sk);
diff --git a/net/core/timestamping.c b/net/core/timestamping.c
index 7e7ca37..efb9a44 100644
--- a/net/core/timestamping.c
+++ b/net/core/timestamping.c
@@ -31,7 +31,7 @@ static unsigned int classify(const struct sk_buff *skb)
 	if (likely(skb->dev &&
 		   skb->dev->phydev &&
 		   skb->dev->phydev->drv))
-		return sk_run_filter(skb, ptp_filter);
+		return sk_run_filter(skb, ptp_filter, NULL, 0);
 	else
 		return PTP_CLASS_NONE;
 }
@@ -124,5 +124,5 @@ bool skb_defer_rx_timestamp(struct sk_buff *skb)
 
 void __init skb_timestamping_init(void)
 {
-	BUG_ON(sk_chk_filter(ptp_filter, ARRAY_SIZE(ptp_filter)));
+	BUG_ON(sk_chk_filter(ptp_filter, ARRAY_SIZE(ptp_filter), NULL));
 }
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index c60649e..7b8bb1b 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -525,7 +525,8 @@ static inline unsigned int run_filter(const struct sk_buff *skb,
 	rcu_read_lock();
 	filter = rcu_dereference(sk->sk_filter);
 	if (filter != NULL)
-		res = sk_run_filter(skb, filter->insns);
+		res = sk_run_filter(skb, filter->insns, filter->data,
+					filter->data_len);
 	rcu_read_unlock();
 
 	return res;
-- 
1.7.2.3


^ permalink raw reply related

* Re: [RFD][PATCH] Add JMEMCMP to Berkeley Packet Filters
From: Ian Molton @ 2011-02-10 12:57 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1297340087-10963-1-git-send-email-ian.molton@collabora.co.uk>

Shit, sorry about the noise. git-send-email bit me :)

-Ian

^ permalink raw reply

* Re: [PATCH] Add JMEMCMP to Berkeley Packet Filters
From: Eric Dumazet @ 2011-02-10 13:24 UTC (permalink / raw)
  To: Ian Molton
  Cc: netdev, rdunlap, isdn, paulus, arnd, davem, herbert, ebiederm,
	alban.crequy
In-Reply-To: <1297341067-12264-2-git-send-email-ian.molton@collabora.co.uk>

Le jeudi 10 février 2011 à 12:31 +0000, Ian Molton a écrit :
> This patch allows a data section to be specified for BPF.
> 
> This is made use of by a MEMCMP like instruction.
> 
> Testsuite here:
> http://git.collabora.co.uk/?p=user/ian/check-bpf.git;a=summary
> 
> Issues:
> * Do I need to update the headers for all arches, or just generic
> * Can sk_run_filter() be called in a context where kmalloc(GFP_KERNEL) is
>   not allowed (I think not)

You cannot use GFP_KERNEL in sk_run_filter() : We run in {soft}irq mode,
in input path.

> * Data section allocated with second call to sock_kmalloc().
> * Should the patch be broken into two - one to add the data uploading,
>   one to add the JMEMCMP insn. ?

May I ask why it is needed at all ?

Then, why only one JMEMCMP would be allowed in a filter ?




^ permalink raw reply

* Re: [PATCH] Add JMEMCMP to Berkeley Packet Filters
From: Ian Molton @ 2011-02-10 13:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, rdunlap, isdn, paulus, arnd, davem, herbert, ebiederm,
	alban.crequy
In-Reply-To: <1297344292.2493.3.camel@edumazet-laptop>

On 10/02/11 13:24, Eric Dumazet wrote:

Hi!

Thanks for reviewing! :)

>> * Can sk_run_filter() be called in a context where kmalloc(GFP_KERNEL) is
>>    not allowed (I think not)
>
> You cannot use GFP_KERNEL in sk_run_filter() : We run in {soft}irq mode,
> in input path.

Ok, that can be sorted.

>> * Data section allocated with second call to sock_kmalloc().
>> * Should the patch be broken into two - one to add the data uploading,
>>    one to add the JMEMCMP insn. ?
>
> May I ask why it is needed at all ?

So we can match strings in packet filters... I don't think I understand 
the question...

> Then, why only one JMEMCMP would be allowed in a filter ?

I dont think I'm restricting the filter to only have one JMEMCMP? Am I 
misunderstanding you?

-Ian

^ permalink raw reply

* Mutual Aid  (Stash of Fortune)
From: George @ 2011-02-10 12:55 UTC (permalink / raw)


Mutual Aid  (Stash of Fortune)

Hi
I would hold back certain information for security reasons for now until
you have found time to visit the BBC website stated below to enable you
have an insight into what I intend sharing with you.

http://news.bbc.co.uk/2/hi/middle_east/2988455.stm

Get back to me having visited the above website with this email:
mrglenn_bradley@w.cn

Best Regards
A former member of the  3rd Infantry Division


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox