* Re: [PATCH] pch_can: fix tseg1/tseg2 setting issue
From: David Miller @ 2011-02-10 0:46 UTC (permalink / raw)
To: tomoya-linux-ECg8zkTtlr0C6LszWs/t0g
Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w,
qi.wang-ral2JQCrhuEAvxtiuMwx3w, netdev-u79uwXL29TY76Z2rM5mHXA,
yong.y.wang-ral2JQCrhuEAvxtiuMwx3w,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
toshiharu-linux-ECg8zkTtlr0C6LszWs/t0g,
kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w,
joel.clark-ral2JQCrhuEAvxtiuMwx3w, wg-5Yr1BZd7O62+XT7JhA+gdA
In-Reply-To: <1297298399-6250-1-git-send-email-tomoya-linux-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>
From: Tomoya MORINAGA <tomoya-linux-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>
Date: Thu, 10 Feb 2011 09:39:59 +0900
> Previous patch "[PATCH 1/3] pch_can: fix 800k comms issue" is wrong.
> I should have modified tseg1_min not tseg2_min.
> This patch reverts tseg2_min to 1 and set tseg1_min to 2.
>
> Signed-off-by: Tomoya MORINAGA <tomoya-linux-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH] virtio-net: add schedule check to napi_enable call in refill_work
From: Rusty Russell @ 2011-02-10 1:31 UTC (permalink / raw)
To: virtualization; +Cc: Ken Stailey, netdev, Bruce Rogers
In-Reply-To: <132885.30568.qm@web110313.mail.gq1.yahoo.com>
On Thu, 10 Feb 2011 06:59:25 am Ken Stailey wrote:
> Justification:
>
> Impact: Under heavy network I/O load virtio-net driver crashes making VM guest unusable.
Hmm, this went badly wrong. I acked this patch, and it was mailed to
netdev six months ago.
Bruce's patch used spaces instead of tabs, but that should not have caused
it to be dropped. I've taken that and ported it forwards, will repost now.
Thanks for picking this up off the floor!
Rusty.
^ permalink raw reply
* [PATCH] virtio_net: Add schedule check to napi_enable call
From: Rusty Russell @ 2011-02-10 2:02 UTC (permalink / raw)
To: Herbert Xu; +Cc: netdev, David Miller, virtualization, Ken Stailey
From: "Bruce Rogers" <brogers@novell.com>
Under harsh testing conditions, including low memory, the guest would
stop receiving packets. With this patch applied we no longer see any
problems in the driver while performing these tests for extended periods
of time.
Make sure napi is scheduled subsequent to each napi_enable.
Signed-off-by: Bruce Rogers <brogers@novell.com>
Signed-off-by: Olaf Kirch <okir@suse.de>
Cc: stable@kernel.org
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
drivers/net/virtio_net.c | 27 ++++++++++++++++-----------
1 file changed, 16 insertions(+), 11 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -446,6 +446,20 @@ static void skb_recv_done(struct virtque
}
}
+static void virtnet_napi_enable(struct virtnet_info *vi)
+{
+ napi_enable(&vi->napi);
+
+ /* If all buffers were filled by other side before we napi_enabled, we
+ * won't get another interrupt, so process any outstanding packets
+ * now. virtnet_poll wants re-enable the queue, so we disable here.
+ * We synchronize against interrupts via NAPI_STATE_SCHED */
+ if (napi_schedule_prep(&vi->napi)) {
+ virtqueue_disable_cb(vi->rvq);
+ __napi_schedule(&vi->napi);
+ }
+}
+
static void refill_work(struct work_struct *work)
{
struct virtnet_info *vi;
@@ -454,7 +468,7 @@ static void refill_work(struct work_stru
vi = container_of(work, struct virtnet_info, refill.work);
napi_disable(&vi->napi);
still_empty = !try_fill_recv(vi, GFP_KERNEL);
- napi_enable(&vi->napi);
+ virtnet_napi_enable(vi);
/* In theory, this can happen: if we don't get any buffers in
* we will *never* try to fill again. */
@@ -638,16 +652,7 @@ static int virtnet_open(struct net_devic
{
struct virtnet_info *vi = netdev_priv(dev);
- napi_enable(&vi->napi);
-
- /* If all buffers were filled by other side before we napi_enabled, we
- * won't get another interrupt, so process any outstanding packets
- * now. virtnet_poll wants re-enable the queue, so we disable here.
- * We synchronize against interrupts via NAPI_STATE_SCHED */
- if (napi_schedule_prep(&vi->napi)) {
- virtqueue_disable_cb(vi->rvq);
- __napi_schedule(&vi->napi);
- }
+ virtnet_napi_enable(vi);
return 0;
}
^ permalink raw reply
* Re: [PATCH] virtio-net: add schedule check to napi_enable call in refill_work
From: Bruce Rogers @ 2011-02-10 2:44 UTC (permalink / raw)
To: virtualization, Rusty Russell; +Cc: netdev
In-Reply-To: <201102101201.19656.rusty@rustcorp.com.au>
>>> On 2/9/2011 at 06:31 PM, Rusty Russell <rusty@rustcorp.com.au> wrote:
> On Thu, 10 Feb 2011 06:59:25 am Ken Stailey wrote:
>> Justification:
>>
>> Impact: Under heavy network I/O load virtio-net driver crashes making VM
> guest unusable.
>
> Hmm, this went badly wrong. I acked this patch, and it was mailed to
> netdev six months ago.
>
> Bruce's patch used spaces instead of tabs, but that should not have caused
> it to be dropped. I've taken that and ported it forwards, will repost now.
>
> Thanks for picking this up off the floor!
> Rusty.
Thanks for taking care of that!
Bruce
^ permalink raw reply
* Re: [RFC PATCH net-next] net: rename group sysfs entry to netdev_group
From: Xiaotian Feng @ 2011-02-10 2:54 UTC (permalink / raw)
To: David Miller; +Cc: netdev, eric.dumazet, therbert, ebiederm, shemminger, ddvlad
In-Reply-To: <20110209.140558.59676278.davem@davemloft.net>
[-- Attachment #1: Type: text/plain, Size: 1436 bytes --]
On 02/10/2011 06:05 AM, David Miller wrote:
> From: David Miller<davem@davemloft.net>
> Date: Wed, 09 Feb 2011 14:03:23 -0800 (PST)
>
>> From: Xiaotian feng<dfeng@redhat.com>
>> Date: Wed, 9 Feb 2011 18:52:49 +0800
>>
>>> From: Xiaotian Feng<dfeng@redhat.com>
>>>
>>> commit a512b92 adds sysfs entry for net device group, but
>>> before this commit, tun also uses group sysfs, so after this
>>> commit checkin, kernel warns like this:
>>> sysfs: cannot create duplicate filename '/devices/virtual/net/vnet0/group'
>>>
>>> Since tun has used this for years, rename sysfs under tun might
>>> break existing userspace, so rename group sysfs entry for net device
>>> group is a better choice.
>>>
>>> Signed-off-by: Xiaotian Feng<dfeng@redhat.com>
>>
>> I don't think we have much choice in this matter, so I have applied
>> this patch, thanks!
>
> Wait, you didn't even build test this patch?!?!?!?!
>
> net/core/net-sysfs.c: In function ‘format_netdev_group’:
> net/core/net-sysfs.c:298: error: ‘const struct net_device’ has no member named ‘netdev_group’
> net/core/net-sysfs.c: At top level:
> net/core/net-sysfs.c:333: error: ‘show_group’ undeclared here (not in a function)
>
> "RFC" doesn't preclude you from at least build testing patches you
> post.
>
> Sigh...
>
Sorry, my bad ... v2 patch is attatched, I've built and r/w this renamed
sysfs, all work fine now. Sorry again about my carelessness ...
Regards
Xiaotian
>
>
[-- Attachment #2: 0001-net-rename-group-sysfs-entry-to-netdev_group.patch --]
[-- Type: text/plain, Size: 1535 bytes --]
>From 35388da8821a72a71f54cb955146a881f916eb25 Mon Sep 17 00:00:00 2001
From: Xiaotian Feng <dfeng@redhat.com>
Date: Thu, 10 Feb 2011 10:48:53 +0800
Subject: [PATCH net-next v2] net: rename group sysfs entry to netdev_group
commit a512b92 adds sysfs entry for net device group, but
before this commit, tun also uses group sysfs, so after this
commit checkin, kernel warns like this:
sysfs: cannot create duplicate filename '/devices/virtual/net/vnet0/group'
Since tun has used this for years, rename sysfs under tun might
break existing userspace, so rename group sysfs entry for net device
group is a better choice.
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Vlad Dogaru <ddvlad@rosedu.org>
---
net/core/net-sysfs.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 2e4a393..5ceb257 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -330,7 +330,7 @@ static struct device_attribute net_class_attributes[] = {
__ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags),
__ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len,
store_tx_queue_len),
- __ATTR(group, S_IRUGO | S_IWUSR, show_group, store_group),
+ __ATTR(netdev_group, S_IRUGO | S_IWUSR, show_group, store_group),
{}
};
--
1.7.1
^ permalink raw reply related
* Re: [RFC PATCH net-next] net: rename group sysfs entry to netdev_group
From: David Miller @ 2011-02-10 3:16 UTC (permalink / raw)
To: dfeng; +Cc: netdev, eric.dumazet, therbert, ebiederm, shemminger, ddvlad
In-Reply-To: <4D53536B.2010505@redhat.com>
From: Xiaotian Feng <dfeng@redhat.com>
Date: Thu, 10 Feb 2011 10:54:35 +0800
> Subject: [PATCH net-next v2] net: rename group sysfs entry to netdev_group
>
> commit a512b92 adds sysfs entry for net device group, but
> before this commit, tun also uses group sysfs, so after this
> commit checkin, kernel warns like this:
> sysfs: cannot create duplicate filename '/devices/virtual/net/vnet0/group'
>
> Since tun has used this for years, rename sysfs under tun might
> break existing userspace, so rename group sysfs entry for net device
> group is a better choice.
>
> Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Applied.
^ permalink raw reply
* [RFC PATCH 0/5] Cache PMTU/redirects in inetpeer
From: David Miller @ 2011-02-10 6:12 UTC (permalink / raw)
To: netdev
This is what I've been working on for the past several days.
Right now if the routing cache is turned off (by setting
rt_cache_rebuild_count to "0") several things stop working.
We never make use of any PMTU or redirect information we learn
via ICMP packets. This is because when the routing cache is
off, we can't "find" the existing cached routes that match
the ICMP because we don't add them to the hash table.
This functionality loss is also a blocker for eliminating the
routing cache entirely.
Solve this by remembering this state in the inetpeer entries.
PMTU information now self-expires. It gets validated when
cached routes are sanity checked via dst_ops->check(). At
expiration, the original RTAX_MTU metric value is restored.
So we don't have to invalidate the entire cached route just
because it's PMTU learned value has expired.
Similarly, we store redirect information in inetpeer too.
Except that currently my patches don't remember the "original"
gateway the route had, so we have to kill the route off when
we get a dst_ops->negative_advice() call on a redirected route.
Avoid this is easy to fix and I might do that soon.
These patches implement the PMTU/redirect bits in ipv4 only at the
moment, but I do have ipv6 patches I'm in the process of finishing
up. I just wanted people to see this as soon as possible so that
I can start getting feedback.
And hey if people can test this stuff out that'd be awesome! If
you've used these changes in an environment where you did hit PMTU
and redirects, please do let me know.
^ permalink raw reply
* [RFC PATCH 1/5] inetpeer: Abstract address representation further.
From: David Miller @ 2011-02-10 6:13 UTC (permalink / raw)
To: netdev
Future changes will add caching information, and some of
these new elements will be addresses.
Since the family is implicit via the ->daddr.family member,
replicating the family in ever address we store is entirely
redundant.
Signed-off-by: David S. Miller <davem@davemloft.net>
---
include/net/inetpeer.h | 16 ++++++++++------
net/ipv4/inetpeer.c | 6 +++---
net/ipv4/tcp_ipv4.c | 2 +-
net/ipv6/tcp_ipv6.c | 2 +-
4 files changed, 15 insertions(+), 11 deletions(-)
diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index ead2cb2..60e2cd8 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -15,12 +15,16 @@
#include <net/ipv6.h>
#include <asm/atomic.h>
-struct inetpeer_addr {
+struct inetpeer_addr_base {
union {
- __be32 a4;
- __be32 a6[4];
+ __be32 a4;
+ __be32 a6[4];
};
- __u16 family;
+};
+
+struct inetpeer_addr {
+ struct inetpeer_addr_base addr;
+ __u16 family;
};
struct inet_peer {
@@ -67,7 +71,7 @@ static inline struct inet_peer *inet_getpeer_v4(__be32 v4daddr, int create)
{
struct inetpeer_addr daddr;
- daddr.a4 = v4daddr;
+ daddr.addr.a4 = v4daddr;
daddr.family = AF_INET;
return inet_getpeer(&daddr, create);
}
@@ -76,7 +80,7 @@ static inline struct inet_peer *inet_getpeer_v6(struct in6_addr *v6daddr, int cr
{
struct inetpeer_addr daddr;
- ipv6_addr_copy((struct in6_addr *)daddr.a6, v6daddr);
+ ipv6_addr_copy((struct in6_addr *)daddr.addr.a6, v6daddr);
daddr.family = AF_INET6;
return inet_getpeer(&daddr, create);
}
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index 709fbb4..4346c38 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -167,9 +167,9 @@ static int addr_compare(const struct inetpeer_addr *a,
int i, n = (a->family == AF_INET ? 1 : 4);
for (i = 0; i < n; i++) {
- if (a->a6[i] == b->a6[i])
+ if (a->addr.a6[i] == b->addr.a6[i])
continue;
- if (a->a6[i] < b->a6[i])
+ if (a->addr.a6[i] < b->addr.a6[i])
return -1;
return 1;
}
@@ -510,7 +510,7 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create)
p->daddr = *daddr;
atomic_set(&p->refcnt, 1);
atomic_set(&p->rid, 0);
- atomic_set(&p->ip_id_count, secure_ip_id(daddr->a4));
+ atomic_set(&p->ip_id_count, secure_ip_id(daddr->addr.a4));
p->tcp_ts_stamp = 0;
p->metrics[RTAX_LOCK-1] = INETPEER_METRICS_NEW;
p->rate_tokens = 0;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 02f583b..e2b9be2 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1341,7 +1341,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
tcp_death_row.sysctl_tw_recycle &&
(dst = inet_csk_route_req(sk, req)) != NULL &&
(peer = rt_get_peer((struct rtable *)dst)) != NULL &&
- peer->daddr.a4 == saddr) {
+ peer->daddr.addr.a4 == saddr) {
inet_peer_refcheck(peer);
if ((u32)get_seconds() - peer->tcp_ts_stamp < TCP_PAWS_MSL &&
(s32)(peer->tcp_ts - req->ts_recent) >
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 20aa95e..d6954e3 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1323,7 +1323,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
tcp_death_row.sysctl_tw_recycle &&
(dst = inet6_csk_route_req(sk, req)) != NULL &&
(peer = rt6_get_peer((struct rt6_info *)dst)) != NULL &&
- ipv6_addr_equal((struct in6_addr *)peer->daddr.a6,
+ ipv6_addr_equal((struct in6_addr *)peer->daddr.addr.a6,
&treq->rmt_addr)) {
inet_peer_refcheck(peer);
if ((u32)get_seconds() - peer->tcp_ts_stamp < TCP_PAWS_MSL &&
--
1.7.4
^ permalink raw reply related
* [RFC PATCH 2/5] inetpeer: Add redirect and PMTU discovery cached info.
From: David Miller @ 2011-02-10 6:13 UTC (permalink / raw)
To: netdev
Validity of the cached PMTU information is indicated by it's
expiration value being non-zero, just as per dst->expires.
The scheme we will use is that we will remember the pre-ICMP value
held in the metrics or route entry, and then at expiration time
we will restore that value.
In this way PMTU expiration does not kill off the cached route as is
done currently.
Redirect information is permanent, or at least until another redirect
is received.
Signed-off-by: David S. Miller <davem@davemloft.net>
---
include/net/inetpeer.h | 18 +++++++++++-------
net/ipv4/inetpeer.c | 2 ++
2 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index 60e2cd8..e6dd8da6 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -43,13 +43,17 @@ struct inet_peer {
*/
union {
struct {
- atomic_t rid; /* Frag reception counter */
- atomic_t ip_id_count; /* IP ID for the next packet */
- __u32 tcp_ts;
- __u32 tcp_ts_stamp;
- u32 metrics[RTAX_MAX];
- u32 rate_tokens; /* rate limiting for ICMP */
- unsigned long rate_last;
+ atomic_t rid; /* Frag reception counter */
+ atomic_t ip_id_count; /* IP ID for the next packet */
+ __u32 tcp_ts;
+ __u32 tcp_ts_stamp;
+ u32 metrics[RTAX_MAX];
+ u32 rate_tokens; /* rate limiting for ICMP */
+ unsigned long rate_last;
+ unsigned long pmtu_expires;
+ u32 pmtu_orig;
+ u32 pmtu_learned;
+ struct inetpeer_addr_base redirect_learned;
};
struct rcu_head rcu;
};
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index 4346c38..48f8d45 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -515,6 +515,8 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create)
p->metrics[RTAX_LOCK-1] = INETPEER_METRICS_NEW;
p->rate_tokens = 0;
p->rate_last = 0;
+ p->pmtu_expires = 0;
+ memset(&p->redirect_learned, 0, sizeof(p->redirect_learned));
INIT_LIST_HEAD(&p->unused);
--
1.7.4
^ permalink raw reply related
* [RFC PATCH 3/5] inet: Create a mechanism for upward inetpeer propagation into routes.
From: David Miller @ 2011-02-10 6:13 UTC (permalink / raw)
To: netdev
If we didn't have a routing cache, we would not be able to properly
propagate certain kinds of dynamic path attributes, for example
PMTU information and redirects.
The reason is that if we didn't have a routing cache, then there would
be no way to lookup all of the active cached routes hanging off of
sockets, tunnels, IPSEC bundles, etc.
Consider the case where we created a cached route, but no inetpeer
entry existed and also we were not asked to pre-COW the route metrics
and therefore did not force the creation a new inetpeer entry.
If we later get a PMTU message, or a redirect, and store this
information in a new inetpeer entry, there is no way to teach that
cached route about the newly existing inetpeer entry.
The facilities implemented here handle this problem.
First we create a generation ID. When we create a cached route of any
kind, we remember the generation ID at the time of attachment. Any
time we force-create an inetpeer entry in response to new path
information, we bump that generation ID.
The dst_ops->check() callback is where the knowledge of this event
is propagated. If the global generation ID does not equal the one
stored in the cached route, and the cached route has not attached
to an inetpeer yet, we look it up and attach if one is found. Now
that we've updated the cached route's information, we update the
route's generation ID too.
This clears the way for implementing PMTU and redirects directly in
the inetpeer cache. There is absolutely no need to consult cached
route information in order to maintain this information.
At this point nothing bumps the inetpeer genids, that comes in the
later changes which handle PMTUs and redirects using inetpeers.
Signed-off-by: David S. Miller <davem@davemloft.net>
---
include/net/ip6_fib.h | 1 +
include/net/route.h | 1 +
net/ipv4/route.c | 19 ++++++++++++++++++-
net/ipv6/route.c | 18 ++++++++++++++++--
4 files changed, 36 insertions(+), 3 deletions(-)
diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 708ff7c..46a6e8a 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -108,6 +108,7 @@ struct rt6_info {
u32 rt6i_flags;
struct rt6key rt6i_src;
u32 rt6i_metric;
+ u32 rt6i_peer_genid;
struct inet6_dev *rt6i_idev;
struct inet_peer *rt6i_peer;
diff --git a/include/net/route.h b/include/net/route.h
index e586465..bf790c1 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -69,6 +69,7 @@ struct rtable {
/* Miscellaneous cached information */
__be32 rt_spec_dst; /* RFC1122 specific destination */
+ u32 rt_peer_genid;
struct inet_peer *peer; /* long-living peer info */
struct fib_info *fi; /* for client ref to shared metrics */
};
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 0455af8..0979e03 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1308,6 +1308,13 @@ skip_hashing:
return 0;
}
+static atomic_t __rt_peer_genid = ATOMIC_INIT(0);
+
+static u32 rt_peer_genid(void)
+{
+ return atomic_read(&__rt_peer_genid);
+}
+
void rt_bind_peer(struct rtable *rt, int create)
{
struct inet_peer *peer;
@@ -1316,6 +1323,8 @@ void rt_bind_peer(struct rtable *rt, int create)
if (peer && cmpxchg(&rt->peer, NULL, peer) != NULL)
inet_putpeer(peer);
+ else
+ rt->rt_peer_genid = rt_peer_genid();
}
/*
@@ -1767,8 +1776,16 @@ static void ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu)
static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
{
- if (rt_is_expired((struct rtable *)dst))
+ struct rtable *rt = (struct rtable *) dst;
+
+ if (rt_is_expired(rt))
return NULL;
+ if (rt->rt_peer_genid != rt_peer_genid()) {
+ if (!rt->peer)
+ rt_bind_peer(rt, 0);
+
+ rt->rt_peer_genid = rt_peer_genid();
+ }
return dst;
}
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 12ec83d..ad8556e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -240,6 +240,13 @@ static void ip6_dst_destroy(struct dst_entry *dst)
}
}
+static atomic_t __rt6_peer_genid = ATOMIC_INIT(0);
+
+static u32 rt6_peer_genid(void)
+{
+ return atomic_read(&__rt6_peer_genid);
+}
+
void rt6_bind_peer(struct rt6_info *rt, int create)
{
struct inet_peer *peer;
@@ -247,6 +254,8 @@ void rt6_bind_peer(struct rt6_info *rt, int create)
peer = inet_getpeer_v6(&rt->rt6i_dst.addr, create);
if (peer && cmpxchg(&rt->rt6i_peer, NULL, peer) != NULL)
inet_putpeer(peer);
+ else
+ rt->rt6i_peer_genid = rt6_peer_genid();
}
static void ip6_dst_ifdown(struct dst_entry *dst, struct net_device *dev,
@@ -912,9 +921,14 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
rt = (struct rt6_info *) dst;
- if (rt->rt6i_node && (rt->rt6i_node->fn_sernum == cookie))
+ if (rt->rt6i_node && (rt->rt6i_node->fn_sernum == cookie)) {
+ if (rt->rt6i_peer_genid != rt6_peer_genid()) {
+ if (!rt->rt6i_peer)
+ rt6_bind_peer(rt, 0);
+ rt->rt6i_peer_genid = rt6_peer_genid();
+ }
return dst;
-
+ }
return NULL;
}
--
1.7.4
^ permalink raw reply related
* [RFC PATCH 4/5] ipv4: Cache learned PMTU information in inetpeer.
From: David Miller @ 2011-02-10 6:13 UTC (permalink / raw)
To: netdev
The general idea is that if we learn new PMTU information, we
bump the peer genid.
This triggers the dst_ops->check() code to validate and if
necessary propagate the new PMTU value into the metrics.
Learned PMTU information self-expires.
This means that it is not necessary to kill a cached route
entry just because the PMTU information is too old.
As a consequence:
1) When the path appears unreachable (dst_ops->link_failure
or dst_ops->negative_advice) we unwind the PMTU state if
it is out of date, instead of killing the cached route.
A redirected route will still be invlidated in these
situations.
2) rt_check_expire(), rt_worker_func(), et al. are no longer
necessary at all.
Signed-off-by: David S. Miller <davem@davemloft.net>
---
net/ipv4/route.c | 260 ++++++++++++++++++------------------------------------
1 files changed, 86 insertions(+), 174 deletions(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 0979e03..11faf14 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -131,9 +131,6 @@ static int ip_rt_min_pmtu __read_mostly = 512 + 20 + 20;
static int ip_rt_min_advmss __read_mostly = 256;
static int rt_chain_length_max __read_mostly = 20;
-static struct delayed_work expires_work;
-static unsigned long expires_ljiffies;
-
/*
* Interface to generic destination cache.
*/
@@ -668,7 +665,7 @@ static inline int rt_fast_clean(struct rtable *rth)
static inline int rt_valuable(struct rtable *rth)
{
return (rth->rt_flags & (RTCF_REDIRECTED | RTCF_NOTIFY)) ||
- rth->dst.expires;
+ (rth->peer && rth->peer->pmtu_expires);
}
static int rt_may_expire(struct rtable *rth, unsigned long tmo1, unsigned long tmo2)
@@ -679,13 +676,7 @@ static int rt_may_expire(struct rtable *rth, unsigned long tmo1, unsigned long t
if (atomic_read(&rth->dst.__refcnt))
goto out;
- ret = 1;
- if (rth->dst.expires &&
- time_after_eq(jiffies, rth->dst.expires))
- goto out;
-
age = jiffies - rth->dst.lastuse;
- ret = 0;
if ((age <= tmo1 && !rt_fast_clean(rth)) ||
(age <= tmo2 && rt_valuable(rth)))
goto out;
@@ -829,97 +820,6 @@ static int has_noalias(const struct rtable *head, const struct rtable *rth)
return ONE;
}
-static void rt_check_expire(void)
-{
- static unsigned int rover;
- unsigned int i = rover, goal;
- struct rtable *rth;
- struct rtable __rcu **rthp;
- unsigned long samples = 0;
- unsigned long sum = 0, sum2 = 0;
- unsigned long delta;
- u64 mult;
-
- delta = jiffies - expires_ljiffies;
- expires_ljiffies = jiffies;
- mult = ((u64)delta) << rt_hash_log;
- if (ip_rt_gc_timeout > 1)
- do_div(mult, ip_rt_gc_timeout);
- goal = (unsigned int)mult;
- if (goal > rt_hash_mask)
- goal = rt_hash_mask + 1;
- for (; goal > 0; goal--) {
- unsigned long tmo = ip_rt_gc_timeout;
- unsigned long length;
-
- i = (i + 1) & rt_hash_mask;
- rthp = &rt_hash_table[i].chain;
-
- if (need_resched())
- cond_resched();
-
- samples++;
-
- if (rcu_dereference_raw(*rthp) == NULL)
- continue;
- length = 0;
- spin_lock_bh(rt_hash_lock_addr(i));
- while ((rth = rcu_dereference_protected(*rthp,
- lockdep_is_held(rt_hash_lock_addr(i)))) != NULL) {
- prefetch(rth->dst.rt_next);
- if (rt_is_expired(rth)) {
- *rthp = rth->dst.rt_next;
- rt_free(rth);
- continue;
- }
- if (rth->dst.expires) {
- /* Entry is expired even if it is in use */
- if (time_before_eq(jiffies, rth->dst.expires)) {
-nofree:
- tmo >>= 1;
- rthp = &rth->dst.rt_next;
- /*
- * We only count entries on
- * a chain with equal hash inputs once
- * so that entries for different QOS
- * levels, and other non-hash input
- * attributes don't unfairly skew
- * the length computation
- */
- length += has_noalias(rt_hash_table[i].chain, rth);
- continue;
- }
- } else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout))
- goto nofree;
-
- /* Cleanup aged off entries. */
- *rthp = rth->dst.rt_next;
- rt_free(rth);
- }
- spin_unlock_bh(rt_hash_lock_addr(i));
- sum += length;
- sum2 += length*length;
- }
- if (samples) {
- unsigned long avg = sum / samples;
- unsigned long sd = int_sqrt(sum2 / samples - avg*avg);
- rt_chain_length_max = max_t(unsigned long,
- ip_rt_gc_elasticity,
- (avg + 4*sd) >> FRACT_BITS);
- }
- rover = i;
-}
-
-/*
- * rt_worker_func() is run in process context.
- * we call rt_check_expire() to scan part of the hash table
- */
-static void rt_worker_func(struct work_struct *work)
-{
- rt_check_expire();
- schedule_delayed_work(&expires_work, ip_rt_gc_interval);
-}
-
/*
* Pertubation of rt_genid by a small quantity [1..256]
* Using 8 bits of shuffling ensure we can call rt_cache_invalidate()
@@ -1535,9 +1435,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst)
if (dst->obsolete > 0) {
ip_rt_put(rt);
ret = NULL;
- } else if ((rt->rt_flags & RTCF_REDIRECTED) ||
- (rt->dst.expires &&
- time_after_eq(jiffies, rt->dst.expires))) {
+ } else if (rt->rt_flags & RTCF_REDIRECTED) {
unsigned hash = rt_hash(rt->fl.fl4_dst, rt->fl.fl4_src,
rt->fl.oif,
rt_genid(dev_net(dst->dev)));
@@ -1547,6 +1445,14 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst)
#endif
rt_del(hash, rt);
ret = NULL;
+ } else if (rt->peer &&
+ rt->peer->pmtu_expires &&
+ time_after_eq(jiffies, rt->peer->pmtu_expires)) {
+ unsigned long orig = rt->peer->pmtu_expires;
+
+ if (cmpxchg(&rt->peer->pmtu_expires, orig, 0) == orig)
+ dst_metric_set(dst, RTAX_MTU,
+ rt->peer->pmtu_orig);
}
}
return ret;
@@ -1697,80 +1603,78 @@ unsigned short ip_rt_frag_needed(struct net *net, struct iphdr *iph,
unsigned short new_mtu,
struct net_device *dev)
{
- int i, k;
unsigned short old_mtu = ntohs(iph->tot_len);
- struct rtable *rth;
- int ikeys[2] = { dev->ifindex, 0 };
- __be32 skeys[2] = { iph->saddr, 0, };
- __be32 daddr = iph->daddr;
unsigned short est_mtu = 0;
+ struct inet_peer *peer;
- for (k = 0; k < 2; k++) {
- for (i = 0; i < 2; i++) {
- unsigned hash = rt_hash(daddr, skeys[i], ikeys[k],
- rt_genid(net));
-
- rcu_read_lock();
- for (rth = rcu_dereference(rt_hash_table[hash].chain); rth;
- rth = rcu_dereference(rth->dst.rt_next)) {
- unsigned short mtu = new_mtu;
+ peer = inet_getpeer_v4(iph->daddr, 1);
+ if (peer) {
+ unsigned short mtu = new_mtu;
- if (rth->fl.fl4_dst != daddr ||
- rth->fl.fl4_src != skeys[i] ||
- rth->rt_dst != daddr ||
- rth->rt_src != iph->saddr ||
- rth->fl.oif != ikeys[k] ||
- rt_is_input_route(rth) ||
- dst_metric_locked(&rth->dst, RTAX_MTU) ||
- !net_eq(dev_net(rth->dst.dev), net) ||
- rt_is_expired(rth))
- continue;
+ if (new_mtu < 68 || new_mtu >= old_mtu) {
+ /* BSD 4.2 derived systems incorrectly adjust
+ * tot_len by the IP header length, and report
+ * a zero MTU in the ICMP message.
+ */
+ if (mtu == 0 &&
+ old_mtu >= 68 + (iph->ihl << 2))
+ old_mtu -= iph->ihl << 2;
+ mtu = guess_mtu(old_mtu);
+ }
- if (new_mtu < 68 || new_mtu >= old_mtu) {
+ if (mtu < ip_rt_min_pmtu)
+ mtu = ip_rt_min_pmtu;
+ if (!peer->pmtu_expires || mtu < peer->pmtu_learned) {
+ est_mtu = mtu;
+ peer->pmtu_learned = mtu;
+ peer->pmtu_expires = jiffies + ip_rt_mtu_expires;
+ }
- /* BSD 4.2 compatibility hack :-( */
- if (mtu == 0 &&
- old_mtu >= dst_mtu(&rth->dst) &&
- old_mtu >= 68 + (iph->ihl << 2))
- old_mtu -= iph->ihl << 2;
+ inet_putpeer(peer);
- mtu = guess_mtu(old_mtu);
- }
- if (mtu <= dst_mtu(&rth->dst)) {
- if (mtu < dst_mtu(&rth->dst)) {
- dst_confirm(&rth->dst);
- if (mtu < ip_rt_min_pmtu) {
- u32 lock = dst_metric(&rth->dst,
- RTAX_LOCK);
- mtu = ip_rt_min_pmtu;
- lock |= (1 << RTAX_MTU);
- dst_metric_set(&rth->dst, RTAX_LOCK,
- lock);
- }
- dst_metric_set(&rth->dst, RTAX_MTU, mtu);
- dst_set_expires(&rth->dst,
- ip_rt_mtu_expires);
- }
- est_mtu = mtu;
- }
- }
- rcu_read_unlock();
- }
+ atomic_inc(&__rt_peer_genid);
}
return est_mtu ? : new_mtu;
}
+static void check_peer_pmtu(struct dst_entry *dst, struct inet_peer *peer)
+{
+ unsigned long expires = peer->pmtu_expires;
+
+ if (time_before(expires, jiffies)) {
+ u32 orig_dst_mtu = dst_mtu(dst);
+ if (peer->pmtu_learned < orig_dst_mtu) {
+ if (!peer->pmtu_orig)
+ peer->pmtu_orig = dst_metric_raw(dst, RTAX_MTU);
+ dst_metric_set(dst, RTAX_MTU, peer->pmtu_learned);
+ }
+ } else if (cmpxchg(&peer->pmtu_expires, expires, 0) == expires)
+ dst_metric_set(dst, RTAX_MTU, peer->pmtu_orig);
+}
+
static void ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu)
{
- if (dst_mtu(dst) > mtu && mtu >= 68 &&
- !(dst_metric_locked(dst, RTAX_MTU))) {
- if (mtu < ip_rt_min_pmtu) {
- u32 lock = dst_metric(dst, RTAX_LOCK);
+ struct rtable *rt = (struct rtable *) dst;
+ struct inet_peer *peer;
+
+ dst_confirm(dst);
+
+ if (!rt->peer)
+ rt_bind_peer(rt, 1);
+ peer = rt->peer;
+ if (peer) {
+ if (mtu < ip_rt_min_pmtu)
mtu = ip_rt_min_pmtu;
- dst_metric_set(dst, RTAX_LOCK, lock | (1 << RTAX_MTU));
+ if (!peer->pmtu_expires || mtu < peer->pmtu_learned) {
+ peer->pmtu_learned = mtu;
+ peer->pmtu_expires = jiffies + ip_rt_mtu_expires;
+
+ atomic_inc(&__rt_peer_genid);
+ rt->rt_peer_genid = rt_peer_genid();
+
+ check_peer_pmtu(dst, peer);
}
- dst_metric_set(dst, RTAX_MTU, mtu);
- dst_set_expires(dst, ip_rt_mtu_expires);
+ inet_putpeer(peer);
}
}
@@ -1781,9 +1685,15 @@ static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
if (rt_is_expired(rt))
return NULL;
if (rt->rt_peer_genid != rt_peer_genid()) {
+ struct inet_peer *peer;
+
if (!rt->peer)
rt_bind_peer(rt, 0);
+ peer = rt->peer;
+ if (peer && peer->pmtu_expires)
+ check_peer_pmtu(dst, peer);
+
rt->rt_peer_genid = rt_peer_genid();
}
return dst;
@@ -1812,8 +1722,14 @@ static void ipv4_link_failure(struct sk_buff *skb)
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_HOST_UNREACH, 0);
rt = skb_rtable(skb);
- if (rt)
- dst_set_expires(&rt->dst, 0);
+ if (rt &&
+ rt->peer &&
+ rt->peer->pmtu_expires) {
+ unsigned long orig = rt->peer->pmtu_expires;
+
+ if (cmpxchg(&rt->peer->pmtu_expires, orig, 0) == orig)
+ dst_metric_set(&rt->dst, RTAX_MTU, rt->peer->pmtu_orig);
+ }
}
static int ip_rt_bug(struct sk_buff *skb)
@@ -1911,6 +1827,9 @@ static void rt_init_metrics(struct rtable *rt, struct fib_info *fi)
memcpy(peer->metrics, fi->fib_metrics,
sizeof(u32) * RTAX_MAX);
dst_init_metrics(&rt->dst, peer->metrics, false);
+
+ if (peer->pmtu_expires)
+ check_peer_pmtu(&rt->dst, peer);
} else {
if (fi->fib_metrics != (u32 *) dst_default_metrics) {
rt->fi = fi;
@@ -2961,7 +2880,8 @@ static int rt_fill_info(struct net *net,
NLA_PUT_BE32(skb, RTA_MARK, rt->fl.mark);
error = rt->dst.error;
- expires = rt->dst.expires ? rt->dst.expires - jiffies : 0;
+ expires = (rt->peer && rt->peer->pmtu_expires) ?
+ rt->peer->pmtu_expires - jiffies : 0;
if (rt->peer) {
inet_peer_refcheck(rt->peer);
id = atomic_read(&rt->peer->ip_id_count) & 0xffff;
@@ -3418,14 +3338,6 @@ int __init ip_rt_init(void)
devinet_init();
ip_fib_init();
- /* All the timers, started at system startup tend
- to synchronize. Perturb it a bit.
- */
- INIT_DELAYED_WORK_DEFERRABLE(&expires_work, rt_worker_func);
- expires_ljiffies = jiffies;
- schedule_delayed_work(&expires_work,
- net_random() % ip_rt_gc_interval + ip_rt_gc_interval);
-
if (ip_rt_proc_init())
printk(KERN_ERR "Unable to create route proc files\n");
#ifdef CONFIG_XFRM
--
1.7.4
^ permalink raw reply related
* [RFC PATCH 5/5] ipv4: Cache learned redirect information in inetpeer.
From: David Miller @ 2011-02-10 6:13 UTC (permalink / raw)
To: netdev
Note that we do not generate the redirect netevent any longer,
because we don't create a new cached route.
Instead, once the new neighbour is bound to the cached route,
we emit a neigh update event instead.
Signed-off-by: David S. Miller <davem@davemloft.net>
---
net/ipv4/route.c | 136 +++++++++++++++++-------------------------------------
1 files changed, 42 insertions(+), 94 deletions(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 11faf14..756f544 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1294,13 +1294,8 @@ static void rt_del(unsigned hash, struct rtable *rt)
void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
__be32 saddr, struct net_device *dev)
{
- int i, k;
struct in_device *in_dev = __in_dev_get_rcu(dev);
- struct rtable *rth;
- struct rtable __rcu **rthp;
- __be32 skeys[2] = { saddr, 0 };
- int ikeys[2] = { dev->ifindex, 0 };
- struct netevent_redirect netevent;
+ struct inet_peer *peer;
struct net *net;
if (!in_dev)
@@ -1312,9 +1307,6 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
ipv4_is_zeronet(new_gw))
goto reject_redirect;
- if (!rt_caching(net))
- goto reject_redirect;
-
if (!IN_DEV_SHARED_MEDIA(in_dev)) {
if (!inet_addr_onlink(in_dev, new_gw, old_gw))
goto reject_redirect;
@@ -1325,93 +1317,13 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
goto reject_redirect;
}
- for (i = 0; i < 2; i++) {
- for (k = 0; k < 2; k++) {
- unsigned hash = rt_hash(daddr, skeys[i], ikeys[k],
- rt_genid(net));
-
- rthp = &rt_hash_table[hash].chain;
-
- while ((rth = rcu_dereference(*rthp)) != NULL) {
- struct rtable *rt;
-
- if (rth->fl.fl4_dst != daddr ||
- rth->fl.fl4_src != skeys[i] ||
- rth->fl.oif != ikeys[k] ||
- rt_is_input_route(rth) ||
- rt_is_expired(rth) ||
- !net_eq(dev_net(rth->dst.dev), net)) {
- rthp = &rth->dst.rt_next;
- continue;
- }
-
- if (rth->rt_dst != daddr ||
- rth->rt_src != saddr ||
- rth->dst.error ||
- rth->rt_gateway != old_gw ||
- rth->dst.dev != dev)
- break;
-
- dst_hold(&rth->dst);
-
- rt = dst_alloc(&ipv4_dst_ops);
- if (rt == NULL) {
- ip_rt_put(rth);
- return;
- }
-
- /* Copy all the information. */
- *rt = *rth;
- rt->dst.__use = 1;
- atomic_set(&rt->dst.__refcnt, 1);
- rt->dst.child = NULL;
- if (rt->dst.dev)
- dev_hold(rt->dst.dev);
- rt->dst.obsolete = -1;
- rt->dst.lastuse = jiffies;
- rt->dst.path = &rt->dst;
- rt->dst.neighbour = NULL;
- rt->dst.hh = NULL;
-#ifdef CONFIG_XFRM
- rt->dst.xfrm = NULL;
-#endif
- rt->rt_genid = rt_genid(net);
- rt->rt_flags |= RTCF_REDIRECTED;
-
- /* Gateway is different ... */
- rt->rt_gateway = new_gw;
-
- /* Redirect received -> path was valid */
- dst_confirm(&rth->dst);
-
- if (rt->peer)
- atomic_inc(&rt->peer->refcnt);
- if (rt->fi)
- atomic_inc(&rt->fi->fib_clntref);
-
- if (arp_bind_neighbour(&rt->dst) ||
- !(rt->dst.neighbour->nud_state &
- NUD_VALID)) {
- if (rt->dst.neighbour)
- neigh_event_send(rt->dst.neighbour, NULL);
- ip_rt_put(rth);
- rt_drop(rt);
- goto do_next;
- }
+ peer = inet_getpeer_v4(daddr, 1);
+ if (peer) {
+ peer->redirect_learned.a4 = new_gw;
- netevent.old = &rth->dst;
- netevent.new = &rt->dst;
- call_netevent_notifiers(NETEVENT_REDIRECT,
- &netevent);
+ inet_putpeer(peer);
- rt_del(hash, rth);
- if (!rt_intern_hash(hash, rt, &rt, NULL, rt->fl.oif))
- ip_rt_put(rt);
- goto do_next;
- }
- do_next:
- ;
- }
+ atomic_inc(&__rt_peer_genid);
}
return;
@@ -1678,6 +1590,31 @@ static void ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu)
}
}
+static int check_peer_redir(struct dst_entry *dst, struct inet_peer *peer)
+{
+ struct rtable *rt = (struct rtable *) dst;
+ __be32 orig_gw = rt->rt_gateway;
+
+ dst_confirm(&rt->dst);
+
+ neigh_release(rt->dst.neighbour);
+ rt->dst.neighbour = NULL;
+
+ rt->rt_gateway = peer->redirect_learned.a4;
+ if (arp_bind_neighbour(&rt->dst) ||
+ !(rt->dst.neighbour->nud_state & NUD_VALID)) {
+ if (rt->dst.neighbour)
+ neigh_event_send(rt->dst.neighbour, NULL);
+ rt->rt_gateway = orig_gw;
+ return -EAGAIN;
+ } else {
+ rt->rt_flags |= RTCF_REDIRECTED;
+ call_netevent_notifiers(NETEVENT_NEIGH_UPDATE,
+ rt->dst.neighbour);
+ }
+ return 0;
+}
+
static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
{
struct rtable *rt = (struct rtable *) dst;
@@ -1694,6 +1631,12 @@ static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
if (peer && peer->pmtu_expires)
check_peer_pmtu(dst, peer);
+ if (peer && peer->redirect_learned.a4 &&
+ peer->redirect_learned.a4 != rt->rt_gateway) {
+ if (check_peer_redir(dst, peer))
+ return NULL;
+ }
+
rt->rt_peer_genid = rt_peer_genid();
}
return dst;
@@ -1830,6 +1773,11 @@ static void rt_init_metrics(struct rtable *rt, struct fib_info *fi)
if (peer->pmtu_expires)
check_peer_pmtu(&rt->dst, peer);
+ if (peer->redirect_learned.a4 &&
+ peer->redirect_learned.a4 != rt->rt_gateway) {
+ rt->rt_gateway = peer->redirect_learned.a4;
+ rt->rt_flags |= RTCF_REDIRECTED;
+ }
} else {
if (fi->fib_metrics != (u32 *) dst_default_metrics) {
rt->fi = fi;
--
1.7.4
^ permalink raw reply related
* Re: STMMAC driver: NFS Problem on 2.6.37
From: Brian Downing @ 2011-02-09 20:58 UTC (permalink / raw)
To: Chuck Lever
Cc: deepaksi, Armando VISCONTI, Trond Myklebust, netdev,
Linux NFS Mailing List, Shiraz HASHIM, Viresh KUMAR,
Peppe CAVALLARO, amitgoel
In-Reply-To: <D031EAC4-68E9-4E56-8DB3-208B34C8DFD7@oracle.com>
On Wed, Feb 09, 2011 at 03:12:22PM -0500, Chuck Lever wrote:
> Based on your console logs, I see that the working case uses UDP to
> contact the server's mountd, but the failing case uses TCP. You can
> try explicitly specifying "proto=udp" to force the use of UDP, to test
> this theory.
This does indeed make it work again for me, thanks!
> Meanwhile, the patch description explicitly states that the default
> mount option settings have changed. Does it make sense to change the
> default behavior of NFSROOT mounts to use UDP again? I don't see
> another way to make this process more reliable across NIC
> initialization. If this is considered a regression, we can make a
> patch for 2.6.38-rc and 2.6.37.
I only use nfsroot for development, so I don't have a terribly strong
opinion. I would point out though that the default u-boot parameters
for nfsrooting a lot of boards will no longer work at this point, so if
it's not patched to work again without specifying nfs options I think
there should at least be a note in the documentation and possibly a
"maybe try proto=udp?" console message on failure.
I assume it's not feasable to either wait until the chosen interface's
link is ready before trying to mount nfsroot, or retrying TCP-based
connections a little bit more aggressively/at all?
-bcd
^ permalink raw reply
* [RFC !!BONUS!! PATCH 6/5] ipv4: Delete routing cache.
From: David Miller @ 2011-02-10 6:39 UTC (permalink / raw)
To: netdev
Signed-off-by: David S. Miller <davem@davemloft.net>
---
I couldn't resist, I've been waiting years to be able to even
think about trying this.
If you want to live on the edge, and I really mean "the edge",
you want to try this patch out for sure. :-)
It builds, and that's all that I promise.
include/net/route.h | 1 -
net/ipv4/fib_frontend.c | 5 -
net/ipv4/route.c | 891 ++---------------------------------------------
3 files changed, 23 insertions(+), 874 deletions(-)
diff --git a/include/net/route.h b/include/net/route.h
index bf790c1..fcf1b11 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -117,7 +117,6 @@ extern int ip_rt_init(void);
extern void ip_rt_redirect(__be32 old_gw, __be32 dst, __be32 new_gw,
__be32 src, struct net_device *dev);
extern void rt_cache_flush(struct net *net, int how);
-extern void rt_cache_flush_batch(struct net *net);
extern int __ip_route_output_key(struct net *, struct rtable **, const struct flowi *flp);
extern int ip_route_output_key(struct net *, struct rtable **, struct flowi *flp);
extern int ip_route_output_flow(struct net *, struct rtable **rp, struct flowi *flp, struct sock *sk, int flags);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 2a49c06..694145c 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -978,11 +978,6 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
rt_cache_flush(dev_net(dev), 0);
break;
case NETDEV_UNREGISTER_BATCH:
- /* The batch unregister is only called on the first
- * device in the list of devices being unregistered.
- * Therefore we should not pass dev_net(dev) in here.
- */
- rt_cache_flush_batch(NULL);
break;
}
return NOTIFY_DONE;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 756f544..7078b8b 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -129,7 +129,6 @@ static int ip_rt_gc_elasticity __read_mostly = 8;
static int ip_rt_mtu_expires __read_mostly = 10 * 60 * HZ;
static int ip_rt_min_pmtu __read_mostly = 512 + 20 + 20;
static int ip_rt_min_advmss __read_mostly = 256;
-static int rt_chain_length_max __read_mostly = 20;
/*
* Interface to generic destination cache.
@@ -222,184 +221,30 @@ const __u8 ip_tos2prio[16] = {
};
-/*
- * Route cache.
- */
-
-/* The locking scheme is rather straight forward:
- *
- * 1) Read-Copy Update protects the buckets of the central route hash.
- * 2) Only writers remove entries, and they hold the lock
- * as they look at rtable reference counts.
- * 3) Only readers acquire references to rtable entries,
- * they do so with atomic increments and with the
- * lock held.
- */
-
-struct rt_hash_bucket {
- struct rtable __rcu *chain;
-};
-
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
- defined(CONFIG_PROVE_LOCKING)
-/*
- * Instead of using one spinlock for each rt_hash_bucket, we use a table of spinlocks
- * The size of this table is a power of two and depends on the number of CPUS.
- * (on lockdep we have a quite big spinlock_t, so keep the size down there)
- */
-#ifdef CONFIG_LOCKDEP
-# define RT_HASH_LOCK_SZ 256
-#else
-# if NR_CPUS >= 32
-# define RT_HASH_LOCK_SZ 4096
-# elif NR_CPUS >= 16
-# define RT_HASH_LOCK_SZ 2048
-# elif NR_CPUS >= 8
-# define RT_HASH_LOCK_SZ 1024
-# elif NR_CPUS >= 4
-# define RT_HASH_LOCK_SZ 512
-# else
-# define RT_HASH_LOCK_SZ 256
-# endif
-#endif
-
-static spinlock_t *rt_hash_locks;
-# define rt_hash_lock_addr(slot) &rt_hash_locks[(slot) & (RT_HASH_LOCK_SZ - 1)]
-
-static __init void rt_hash_lock_init(void)
-{
- int i;
-
- rt_hash_locks = kmalloc(sizeof(spinlock_t) * RT_HASH_LOCK_SZ,
- GFP_KERNEL);
- if (!rt_hash_locks)
- panic("IP: failed to allocate rt_hash_locks\n");
-
- for (i = 0; i < RT_HASH_LOCK_SZ; i++)
- spin_lock_init(&rt_hash_locks[i]);
-}
-#else
-# define rt_hash_lock_addr(slot) NULL
-
-static inline void rt_hash_lock_init(void)
-{
-}
-#endif
-
-static struct rt_hash_bucket *rt_hash_table __read_mostly;
-static unsigned rt_hash_mask __read_mostly;
-static unsigned int rt_hash_log __read_mostly;
-
static DEFINE_PER_CPU(struct rt_cache_stat, rt_cache_stat);
#define RT_CACHE_STAT_INC(field) __this_cpu_inc(rt_cache_stat.field)
-static inline unsigned int rt_hash(__be32 daddr, __be32 saddr, int idx,
- int genid)
-{
- return jhash_3words((__force u32)daddr, (__force u32)saddr,
- idx, genid)
- & rt_hash_mask;
-}
-
static inline int rt_genid(struct net *net)
{
return atomic_read(&net->ipv4.rt_genid);
}
#ifdef CONFIG_PROC_FS
-struct rt_cache_iter_state {
- struct seq_net_private p;
- int bucket;
- int genid;
-};
-
-static struct rtable *rt_cache_get_first(struct seq_file *seq)
-{
- struct rt_cache_iter_state *st = seq->private;
- struct rtable *r = NULL;
-
- for (st->bucket = rt_hash_mask; st->bucket >= 0; --st->bucket) {
- if (!rcu_dereference_raw(rt_hash_table[st->bucket].chain))
- continue;
- rcu_read_lock_bh();
- r = rcu_dereference_bh(rt_hash_table[st->bucket].chain);
- while (r) {
- if (dev_net(r->dst.dev) == seq_file_net(seq) &&
- r->rt_genid == st->genid)
- return r;
- r = rcu_dereference_bh(r->dst.rt_next);
- }
- rcu_read_unlock_bh();
- }
- return r;
-}
-
-static struct rtable *__rt_cache_get_next(struct seq_file *seq,
- struct rtable *r)
-{
- struct rt_cache_iter_state *st = seq->private;
-
- r = rcu_dereference_bh(r->dst.rt_next);
- while (!r) {
- rcu_read_unlock_bh();
- do {
- if (--st->bucket < 0)
- return NULL;
- } while (!rcu_dereference_raw(rt_hash_table[st->bucket].chain));
- rcu_read_lock_bh();
- r = rcu_dereference_bh(rt_hash_table[st->bucket].chain);
- }
- return r;
-}
-
-static struct rtable *rt_cache_get_next(struct seq_file *seq,
- struct rtable *r)
-{
- struct rt_cache_iter_state *st = seq->private;
- while ((r = __rt_cache_get_next(seq, r)) != NULL) {
- if (dev_net(r->dst.dev) != seq_file_net(seq))
- continue;
- if (r->rt_genid == st->genid)
- break;
- }
- return r;
-}
-
-static struct rtable *rt_cache_get_idx(struct seq_file *seq, loff_t pos)
-{
- struct rtable *r = rt_cache_get_first(seq);
-
- if (r)
- while (pos && (r = rt_cache_get_next(seq, r)))
- --pos;
- return pos ? NULL : r;
-}
-
static void *rt_cache_seq_start(struct seq_file *seq, loff_t *pos)
{
- struct rt_cache_iter_state *st = seq->private;
if (*pos)
- return rt_cache_get_idx(seq, *pos - 1);
- st->genid = rt_genid(seq_file_net(seq));
+ return NULL;
return SEQ_START_TOKEN;
}
static void *rt_cache_seq_next(struct seq_file *seq, void *v, loff_t *pos)
{
- struct rtable *r;
-
- if (v == SEQ_START_TOKEN)
- r = rt_cache_get_first(seq);
- else
- r = rt_cache_get_next(seq, v);
++*pos;
- return r;
+ return NULL;
}
static void rt_cache_seq_stop(struct seq_file *seq, void *v)
{
- if (v && v != SEQ_START_TOKEN)
- rcu_read_unlock_bh();
}
static int rt_cache_seq_show(struct seq_file *seq, void *v)
@@ -409,29 +254,6 @@ static int rt_cache_seq_show(struct seq_file *seq, void *v)
"Iface\tDestination\tGateway \tFlags\t\tRefCnt\tUse\t"
"Metric\tSource\t\tMTU\tWindow\tIRTT\tTOS\tHHRef\t"
"HHUptod\tSpecDst");
- else {
- struct rtable *r = v;
- int len;
-
- seq_printf(seq, "%s\t%08X\t%08X\t%8X\t%d\t%u\t%d\t"
- "%08X\t%d\t%u\t%u\t%02X\t%d\t%1d\t%08X%n",
- r->dst.dev ? r->dst.dev->name : "*",
- (__force u32)r->rt_dst,
- (__force u32)r->rt_gateway,
- r->rt_flags, atomic_read(&r->dst.__refcnt),
- r->dst.__use, 0, (__force u32)r->rt_src,
- dst_metric_advmss(&r->dst) + 40,
- dst_metric(&r->dst, RTAX_WINDOW),
- (int)((dst_metric(&r->dst, RTAX_RTT) >> 3) +
- dst_metric(&r->dst, RTAX_RTTVAR)),
- r->fl.fl4_tos,
- r->dst.hh ? atomic_read(&r->dst.hh->hh_refcnt) : -1,
- r->dst.hh ? (r->dst.hh->hh_output ==
- dev_queue_xmit) : 0,
- r->rt_spec_dst, &len);
-
- seq_printf(seq, "%*s\n", 127 - len, "");
- }
return 0;
}
@@ -444,8 +266,7 @@ static const struct seq_operations rt_cache_seq_ops = {
static int rt_cache_seq_open(struct inode *inode, struct file *file)
{
- return seq_open_net(inode, file, &rt_cache_seq_ops,
- sizeof(struct rt_cache_iter_state));
+ return seq_open_net(inode, file, &rt_cache_seq_ops, 0);
}
static const struct file_operations rt_cache_seq_fops = {
@@ -643,184 +464,12 @@ static inline int ip_rt_proc_init(void)
}
#endif /* CONFIG_PROC_FS */
-static inline void rt_free(struct rtable *rt)
-{
- call_rcu_bh(&rt->dst.rcu_head, dst_rcu_free);
-}
-
-static inline void rt_drop(struct rtable *rt)
-{
- ip_rt_put(rt);
- call_rcu_bh(&rt->dst.rcu_head, dst_rcu_free);
-}
-
-static inline int rt_fast_clean(struct rtable *rth)
-{
- /* Kill broadcast/multicast entries very aggresively, if they
- collide in hash table with more useful entries */
- return (rth->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST)) &&
- rt_is_input_route(rth) && rth->dst.rt_next;
-}
-
-static inline int rt_valuable(struct rtable *rth)
-{
- return (rth->rt_flags & (RTCF_REDIRECTED | RTCF_NOTIFY)) ||
- (rth->peer && rth->peer->pmtu_expires);
-}
-
-static int rt_may_expire(struct rtable *rth, unsigned long tmo1, unsigned long tmo2)
-{
- unsigned long age;
- int ret = 0;
-
- if (atomic_read(&rth->dst.__refcnt))
- goto out;
-
- age = jiffies - rth->dst.lastuse;
- if ((age <= tmo1 && !rt_fast_clean(rth)) ||
- (age <= tmo2 && rt_valuable(rth)))
- goto out;
- ret = 1;
-out: return ret;
-}
-
-/* Bits of score are:
- * 31: very valuable
- * 30: not quite useless
- * 29..0: usage counter
- */
-static inline u32 rt_score(struct rtable *rt)
-{
- u32 score = jiffies - rt->dst.lastuse;
-
- score = ~score & ~(3<<30);
-
- if (rt_valuable(rt))
- score |= (1<<31);
-
- if (rt_is_output_route(rt) ||
- !(rt->rt_flags & (RTCF_BROADCAST|RTCF_MULTICAST|RTCF_LOCAL)))
- score |= (1<<30);
-
- return score;
-}
-
-static inline bool rt_caching(const struct net *net)
-{
- return net->ipv4.current_rt_cache_rebuild_count <=
- net->ipv4.sysctl_rt_cache_rebuild_count;
-}
-
-static inline bool compare_hash_inputs(const struct flowi *fl1,
- const struct flowi *fl2)
-{
- return ((((__force u32)fl1->fl4_dst ^ (__force u32)fl2->fl4_dst) |
- ((__force u32)fl1->fl4_src ^ (__force u32)fl2->fl4_src) |
- (fl1->iif ^ fl2->iif)) == 0);
-}
-
-static inline int compare_keys(struct flowi *fl1, struct flowi *fl2)
-{
- return (((__force u32)fl1->fl4_dst ^ (__force u32)fl2->fl4_dst) |
- ((__force u32)fl1->fl4_src ^ (__force u32)fl2->fl4_src) |
- (fl1->mark ^ fl2->mark) |
- (*(u16 *)&fl1->fl4_tos ^ *(u16 *)&fl2->fl4_tos) |
- (fl1->oif ^ fl2->oif) |
- (fl1->iif ^ fl2->iif)) == 0;
-}
-
-static inline int compare_netns(struct rtable *rt1, struct rtable *rt2)
-{
- return net_eq(dev_net(rt1->dst.dev), dev_net(rt2->dst.dev));
-}
-
static inline int rt_is_expired(struct rtable *rth)
{
return rth->rt_genid != rt_genid(dev_net(rth->dst.dev));
}
/*
- * Perform a full scan of hash table and free all entries.
- * Can be called by a softirq or a process.
- * In the later case, we want to be reschedule if necessary
- */
-static void rt_do_flush(struct net *net, int process_context)
-{
- unsigned int i;
- struct rtable *rth, *next;
-
- for (i = 0; i <= rt_hash_mask; i++) {
- struct rtable __rcu **pprev;
- struct rtable *list;
-
- if (process_context && need_resched())
- cond_resched();
- rth = rcu_dereference_raw(rt_hash_table[i].chain);
- if (!rth)
- continue;
-
- spin_lock_bh(rt_hash_lock_addr(i));
-
- list = NULL;
- pprev = &rt_hash_table[i].chain;
- rth = rcu_dereference_protected(*pprev,
- lockdep_is_held(rt_hash_lock_addr(i)));
-
- while (rth) {
- next = rcu_dereference_protected(rth->dst.rt_next,
- lockdep_is_held(rt_hash_lock_addr(i)));
-
- if (!net ||
- net_eq(dev_net(rth->dst.dev), net)) {
- rcu_assign_pointer(*pprev, next);
- rcu_assign_pointer(rth->dst.rt_next, list);
- list = rth;
- } else {
- pprev = &rth->dst.rt_next;
- }
- rth = next;
- }
-
- spin_unlock_bh(rt_hash_lock_addr(i));
-
- for (; list; list = next) {
- next = rcu_dereference_protected(list->dst.rt_next, 1);
- rt_free(list);
- }
- }
-}
-
-/*
- * While freeing expired entries, we compute average chain length
- * and standard deviation, using fixed-point arithmetic.
- * This to have an estimation of rt_chain_length_max
- * rt_chain_length_max = max(elasticity, AVG + 4*SD)
- * We use 3 bits for frational part, and 29 (or 61) for magnitude.
- */
-
-#define FRACT_BITS 3
-#define ONE (1UL << FRACT_BITS)
-
-/*
- * Given a hash chain and an item in this hash chain,
- * find if a previous entry has the same hash_inputs
- * (but differs on tos, mark or oif)
- * Returns 0 if an alias is found.
- * Returns ONE if rth has no alias before itself.
- */
-static int has_noalias(const struct rtable *head, const struct rtable *rth)
-{
- const struct rtable *aux = head;
-
- while (aux != rth) {
- if (compare_hash_inputs(&aux->fl, &rth->fl))
- return 0;
- aux = rcu_dereference_protected(aux->dst.rt_next, 1);
- }
- return ONE;
-}
-
-/*
* Pertubation of rt_genid by a small quantity [1..256]
* Using 8 bits of shuffling ensure we can call rt_cache_invalidate()
* many times (2^24) without giving recent rt_genid.
@@ -841,366 +490,32 @@ static void rt_cache_invalidate(struct net *net)
void rt_cache_flush(struct net *net, int delay)
{
rt_cache_invalidate(net);
- if (delay >= 0)
- rt_do_flush(net, !in_softirq());
-}
-
-/* Flush previous cache invalidated entries from the cache */
-void rt_cache_flush_batch(struct net *net)
-{
- rt_do_flush(net, !in_softirq());
}
-static void rt_emergency_hash_rebuild(struct net *net)
-{
- if (net_ratelimit())
- printk(KERN_WARNING "Route hash chain too long!\n");
- rt_cache_invalidate(net);
-}
-
-/*
- Short description of GC goals.
-
- We want to build algorithm, which will keep routing cache
- at some equilibrium point, when number of aged off entries
- is kept approximately equal to newly generated ones.
-
- Current expiration strength is variable "expire".
- We try to adjust it dynamically, so that if networking
- is idle expires is large enough to keep enough of warm entries,
- and when load increases it reduces to limit cache size.
- */
-
static int rt_garbage_collect(struct dst_ops *ops)
{
- static unsigned long expire = RT_GC_TIMEOUT;
- static unsigned long last_gc;
- static int rover;
- static int equilibrium;
- struct rtable *rth;
- struct rtable __rcu **rthp;
- unsigned long now = jiffies;
- int goal;
- int entries = dst_entries_get_fast(&ipv4_dst_ops);
-
- /*
- * Garbage collection is pretty expensive,
- * do not make it too frequently.
- */
-
RT_CACHE_STAT_INC(gc_total);
-
- if (now - last_gc < ip_rt_gc_min_interval &&
- entries < ip_rt_max_size) {
- RT_CACHE_STAT_INC(gc_ignored);
- goto out;
- }
-
- entries = dst_entries_get_slow(&ipv4_dst_ops);
- /* Calculate number of entries, which we want to expire now. */
- goal = entries - (ip_rt_gc_elasticity << rt_hash_log);
- if (goal <= 0) {
- if (equilibrium < ipv4_dst_ops.gc_thresh)
- equilibrium = ipv4_dst_ops.gc_thresh;
- goal = entries - equilibrium;
- if (goal > 0) {
- equilibrium += min_t(unsigned int, goal >> 1, rt_hash_mask + 1);
- goal = entries - equilibrium;
- }
- } else {
- /* We are in dangerous area. Try to reduce cache really
- * aggressively.
- */
- goal = max_t(unsigned int, goal >> 1, rt_hash_mask + 1);
- equilibrium = entries - goal;
- }
-
- if (now - last_gc >= ip_rt_gc_min_interval)
- last_gc = now;
-
- if (goal <= 0) {
- equilibrium += goal;
- goto work_done;
- }
-
- do {
- int i, k;
-
- for (i = rt_hash_mask, k = rover; i >= 0; i--) {
- unsigned long tmo = expire;
-
- k = (k + 1) & rt_hash_mask;
- rthp = &rt_hash_table[k].chain;
- spin_lock_bh(rt_hash_lock_addr(k));
- while ((rth = rcu_dereference_protected(*rthp,
- lockdep_is_held(rt_hash_lock_addr(k)))) != NULL) {
- if (!rt_is_expired(rth) &&
- !rt_may_expire(rth, tmo, expire)) {
- tmo >>= 1;
- rthp = &rth->dst.rt_next;
- continue;
- }
- *rthp = rth->dst.rt_next;
- rt_free(rth);
- goal--;
- }
- spin_unlock_bh(rt_hash_lock_addr(k));
- if (goal <= 0)
- break;
- }
- rover = k;
-
- if (goal <= 0)
- goto work_done;
-
- /* Goal is not achieved. We stop process if:
-
- - if expire reduced to zero. Otherwise, expire is halfed.
- - if table is not full.
- - if we are called from interrupt.
- - jiffies check is just fallback/debug loop breaker.
- We will not spin here for long time in any case.
- */
-
- RT_CACHE_STAT_INC(gc_goal_miss);
-
- if (expire == 0)
- break;
-
- expire >>= 1;
-#if RT_CACHE_DEBUG >= 2
- printk(KERN_DEBUG "expire>> %u %d %d %d\n", expire,
- dst_entries_get_fast(&ipv4_dst_ops), goal, i);
-#endif
-
- if (dst_entries_get_fast(&ipv4_dst_ops) < ip_rt_max_size)
- goto out;
- } while (!in_softirq() && time_before_eq(jiffies, now));
-
- if (dst_entries_get_fast(&ipv4_dst_ops) < ip_rt_max_size)
- goto out;
- if (dst_entries_get_slow(&ipv4_dst_ops) < ip_rt_max_size)
- goto out;
- if (net_ratelimit())
- printk(KERN_WARNING "dst cache overflow\n");
- RT_CACHE_STAT_INC(gc_dst_overflow);
- return 1;
-
-work_done:
- expire += ip_rt_gc_min_interval;
- if (expire > ip_rt_gc_timeout ||
- dst_entries_get_fast(&ipv4_dst_ops) < ipv4_dst_ops.gc_thresh ||
- dst_entries_get_slow(&ipv4_dst_ops) < ipv4_dst_ops.gc_thresh)
- expire = ip_rt_gc_timeout;
-#if RT_CACHE_DEBUG >= 2
- printk(KERN_DEBUG "expire++ %u %d %d %d\n", expire,
- dst_entries_get_fast(&ipv4_dst_ops), goal, rover);
-#endif
-out: return 0;
-}
-
-/*
- * Returns number of entries in a hash chain that have different hash_inputs
- */
-static int slow_chain_length(const struct rtable *head)
-{
- int length = 0;
- const struct rtable *rth = head;
-
- while (rth) {
- length += has_noalias(head, rth);
- rth = rcu_dereference_protected(rth->dst.rt_next, 1);
- }
- return length >> FRACT_BITS;
+ return 0;
}
-static int rt_intern_hash(unsigned hash, struct rtable *rt,
- struct rtable **rp, struct sk_buff *skb, int ifindex)
+static int rt_finalize(struct rtable *rt, struct rtable **rp, struct sk_buff *skb)
{
- struct rtable *rth, *cand;
- struct rtable __rcu **rthp, **candp;
- unsigned long now;
- u32 min_score;
- int chain_length;
- int attempts = !in_softirq();
-
-restart:
- chain_length = 0;
- min_score = ~(u32)0;
- cand = NULL;
- candp = NULL;
- now = jiffies;
-
- if (!rt_caching(dev_net(rt->dst.dev))) {
- /*
- * If we're not caching, just tell the caller we
- * were successful and don't touch the route. The
- * caller hold the sole reference to the cache entry, and
- * it will be released when the caller is done with it.
- * If we drop it here, the callers have no way to resolve routes
- * when we're not caching. Instead, just point *rp at rt, so
- * the caller gets a single use out of the route
- * Note that we do rt_free on this new route entry, so that
- * once its refcount hits zero, we are still able to reap it
- * (Thanks Alexey)
- * Note: To avoid expensive rcu stuff for this uncached dst,
- * we set DST_NOCACHE so that dst_release() can free dst without
- * waiting a grace period.
- */
-
- rt->dst.flags |= DST_NOCACHE;
- if (rt->rt_type == RTN_UNICAST || rt_is_output_route(rt)) {
- int err = arp_bind_neighbour(&rt->dst);
- if (err) {
- if (net_ratelimit())
- printk(KERN_WARNING
- "Neighbour table failure & not caching routes.\n");
- ip_rt_put(rt);
- return err;
- }
- }
-
- goto skip_hashing;
- }
-
- rthp = &rt_hash_table[hash].chain;
-
- spin_lock_bh(rt_hash_lock_addr(hash));
- while ((rth = rcu_dereference_protected(*rthp,
- lockdep_is_held(rt_hash_lock_addr(hash)))) != NULL) {
- if (rt_is_expired(rth)) {
- *rthp = rth->dst.rt_next;
- rt_free(rth);
- continue;
- }
- if (compare_keys(&rth->fl, &rt->fl) && compare_netns(rth, rt)) {
- /* Put it first */
- *rthp = rth->dst.rt_next;
- /*
- * Since lookup is lockfree, the deletion
- * must be visible to another weakly ordered CPU before
- * the insertion at the start of the hash chain.
- */
- rcu_assign_pointer(rth->dst.rt_next,
- rt_hash_table[hash].chain);
- /*
- * Since lookup is lockfree, the update writes
- * must be ordered for consistency on SMP.
- */
- rcu_assign_pointer(rt_hash_table[hash].chain, rth);
-
- dst_use(&rth->dst, now);
- spin_unlock_bh(rt_hash_lock_addr(hash));
-
- rt_drop(rt);
- if (rp)
- *rp = rth;
- else
- skb_dst_set(skb, &rth->dst);
- return 0;
- }
-
- if (!atomic_read(&rth->dst.__refcnt)) {
- u32 score = rt_score(rth);
-
- if (score <= min_score) {
- cand = rth;
- candp = rthp;
- min_score = score;
- }
- }
-
- chain_length++;
-
- rthp = &rth->dst.rt_next;
- }
-
- if (cand) {
- /* ip_rt_gc_elasticity used to be average length of chain
- * length, when exceeded gc becomes really aggressive.
- *
- * The second limit is less certain. At the moment it allows
- * only 2 entries per bucket. We will see.
- */
- if (chain_length > ip_rt_gc_elasticity) {
- *candp = cand->dst.rt_next;
- rt_free(cand);
- }
- } else {
- if (chain_length > rt_chain_length_max &&
- slow_chain_length(rt_hash_table[hash].chain) > rt_chain_length_max) {
- struct net *net = dev_net(rt->dst.dev);
- int num = ++net->ipv4.current_rt_cache_rebuild_count;
- if (!rt_caching(net)) {
- printk(KERN_WARNING "%s: %d rebuilds is over limit, route caching disabled\n",
- rt->dst.dev->name, num);
- }
- rt_emergency_hash_rebuild(net);
- spin_unlock_bh(rt_hash_lock_addr(hash));
-
- hash = rt_hash(rt->fl.fl4_dst, rt->fl.fl4_src,
- ifindex, rt_genid(net));
- goto restart;
- }
- }
-
- /* Try to bind route to arp only if it is output
- route or unicast forwarding path.
+ /* To avoid expensive rcu stuff for this uncached dst, we set
+ * DST_NOCACHE so that dst_release() can free dst without
+ * waiting a grace period.
*/
+ rt->dst.flags |= DST_NOCACHE;
if (rt->rt_type == RTN_UNICAST || rt_is_output_route(rt)) {
int err = arp_bind_neighbour(&rt->dst);
if (err) {
- spin_unlock_bh(rt_hash_lock_addr(hash));
-
- if (err != -ENOBUFS) {
- rt_drop(rt);
- return err;
- }
-
- /* Neighbour tables are full and nothing
- can be released. Try to shrink route cache,
- it is most likely it holds some neighbour records.
- */
- if (attempts-- > 0) {
- int saved_elasticity = ip_rt_gc_elasticity;
- int saved_int = ip_rt_gc_min_interval;
- ip_rt_gc_elasticity = 1;
- ip_rt_gc_min_interval = 0;
- rt_garbage_collect(&ipv4_dst_ops);
- ip_rt_gc_min_interval = saved_int;
- ip_rt_gc_elasticity = saved_elasticity;
- goto restart;
- }
-
if (net_ratelimit())
- printk(KERN_WARNING "ipv4: Neighbour table overflow.\n");
- rt_drop(rt);
- return -ENOBUFS;
+ printk(KERN_WARNING
+ "Neighbour table failure & not caching routes.\n");
+ ip_rt_put(rt);
+ return err;
}
}
- rt->dst.rt_next = rt_hash_table[hash].chain;
-
-#if RT_CACHE_DEBUG >= 2
- if (rt->dst.rt_next) {
- struct rtable *trt;
- printk(KERN_DEBUG "rt_cache @%02x: %pI4",
- hash, &rt->rt_dst);
- for (trt = rt->dst.rt_next; trt; trt = trt->dst.rt_next)
- printk(" . %pI4", &trt->rt_dst);
- printk("\n");
- }
-#endif
- /*
- * Since lookup is lockfree, we must make sure
- * previous writes to rt are comitted to memory
- * before making rt visible to other CPUS.
- */
- rcu_assign_pointer(rt_hash_table[hash].chain, rt);
-
- spin_unlock_bh(rt_hash_lock_addr(hash));
-
-skip_hashing:
if (rp)
*rp = rt;
else
@@ -1270,26 +585,6 @@ void __ip_select_ident(struct iphdr *iph, struct dst_entry *dst, int more)
}
EXPORT_SYMBOL(__ip_select_ident);
-static void rt_del(unsigned hash, struct rtable *rt)
-{
- struct rtable __rcu **rthp;
- struct rtable *aux;
-
- rthp = &rt_hash_table[hash].chain;
- spin_lock_bh(rt_hash_lock_addr(hash));
- ip_rt_put(rt);
- while ((aux = rcu_dereference_protected(*rthp,
- lockdep_is_held(rt_hash_lock_addr(hash)))) != NULL) {
- if (aux == rt || rt_is_expired(aux)) {
- *rthp = aux->dst.rt_next;
- rt_free(aux);
- continue;
- }
- rthp = &aux->dst.rt_next;
- }
- spin_unlock_bh(rt_hash_lock_addr(hash));
-}
-
/* called in rcu_read_lock() section */
void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
__be32 saddr, struct net_device *dev)
@@ -1348,14 +643,11 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst)
ip_rt_put(rt);
ret = NULL;
} else if (rt->rt_flags & RTCF_REDIRECTED) {
- unsigned hash = rt_hash(rt->fl.fl4_dst, rt->fl.fl4_src,
- rt->fl.oif,
- rt_genid(dev_net(dst->dev)));
#if RT_CACHE_DEBUG >= 1
printk(KERN_DEBUG "ipv4_negative_advice: redirect to %pI4/%02x dropped\n",
- &rt->rt_dst, rt->fl.fl4_tos);
+ &rt->rt_dst, rt->fl.fl4_tos);
#endif
- rt_del(hash, rt);
+ ip_rt_put(rt);
ret = NULL;
} else if (rt->peer &&
rt->peer->pmtu_expires &&
@@ -1820,7 +1112,6 @@ static void rt_set_nexthop(struct rtable *rt, struct fib_result *res, u32 itag)
static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
u8 tos, struct net_device *dev, int our)
{
- unsigned int hash;
struct rtable *rth;
__be32 spec_dst;
struct in_device *in_dev = __in_dev_get_rcu(dev);
@@ -1887,8 +1178,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
#endif
RT_CACHE_STAT_INC(in_slow_mc);
- hash = rt_hash(daddr, saddr, dev->ifindex, rt_genid(dev_net(dev)));
- return rt_intern_hash(hash, rth, NULL, skb, dev->ifindex);
+ return rt_finalize(rth, NULL, skb);
e_nobufs:
return -ENOBUFS;
@@ -2035,7 +1325,6 @@ static int ip_mkroute_input(struct sk_buff *skb,
{
struct rtable* rth = NULL;
int err;
- unsigned hash;
#ifdef CONFIG_IP_ROUTE_MULTIPATH
if (res->fi && res->fi->fib_nhs > 1 && fl->oif == 0)
@@ -2048,9 +1337,7 @@ static int ip_mkroute_input(struct sk_buff *skb,
return err;
/* put it into the cache */
- hash = rt_hash(daddr, saddr, fl->iif,
- rt_genid(dev_net(rth->dst.dev)));
- return rt_intern_hash(hash, rth, NULL, skb, fl->iif);
+ return rt_finalize(rth, NULL, skb);
}
/*
@@ -2078,7 +1365,6 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
unsigned flags = 0;
u32 itag = 0;
struct rtable * rth;
- unsigned hash;
__be32 spec_dst;
int err = -EINVAL;
struct net * net = dev_net(dev);
@@ -2197,8 +1483,7 @@ local_input:
rth->rt_flags &= ~RTCF_LOCAL;
}
rth->rt_type = res.type;
- hash = rt_hash(daddr, saddr, fl.iif, rt_genid(net));
- err = rt_intern_hash(hash, rth, NULL, skb, fl.iif);
+ err = rt_finalize(rth, NULL, skb);
goto out;
no_route:
@@ -2242,47 +1527,10 @@ martian_source_keep_err:
int ip_route_input_common(struct sk_buff *skb, __be32 daddr, __be32 saddr,
u8 tos, struct net_device *dev, bool noref)
{
- struct rtable * rth;
- unsigned hash;
- int iif = dev->ifindex;
- struct net *net;
int res;
- net = dev_net(dev);
-
rcu_read_lock();
- if (!rt_caching(net))
- goto skip_cache;
-
- tos &= IPTOS_RT_MASK;
- hash = rt_hash(daddr, saddr, iif, rt_genid(net));
-
- for (rth = rcu_dereference(rt_hash_table[hash].chain); rth;
- rth = rcu_dereference(rth->dst.rt_next)) {
- if ((((__force u32)rth->fl.fl4_dst ^ (__force u32)daddr) |
- ((__force u32)rth->fl.fl4_src ^ (__force u32)saddr) |
- (rth->fl.iif ^ iif) |
- rth->fl.oif |
- (rth->fl.fl4_tos ^ tos)) == 0 &&
- rth->fl.mark == skb->mark &&
- net_eq(dev_net(rth->dst.dev), net) &&
- !rt_is_expired(rth)) {
- if (noref) {
- dst_use_noref(&rth->dst, jiffies);
- skb_dst_set_noref(skb, &rth->dst);
- } else {
- dst_use(&rth->dst, jiffies);
- skb_dst_set(skb, &rth->dst);
- }
- RT_CACHE_STAT_INC(in_hit);
- rcu_read_unlock();
- return 0;
- }
- RT_CACHE_STAT_INC(in_hlist_search);
- }
-
-skip_cache:
/* Multicast recognition logic is moved from route cache to here.
The problem was that too many Ethernet cards have broken/missing
hardware multicast filters :-( As result the host on multicasting
@@ -2439,12 +1687,9 @@ static int ip_mkroute_output(struct rtable **rp,
{
struct rtable *rth = NULL;
int err = __mkroute_output(&rth, res, fl, oldflp, dev_out, flags);
- unsigned hash;
- if (err == 0) {
- hash = rt_hash(oldflp->fl4_dst, oldflp->fl4_src, oldflp->oif,
- rt_genid(dev_net(dev_out)));
- err = rt_intern_hash(hash, rth, rp, NULL, oldflp->oif);
- }
+
+ if (!err == 0)
+ err = rt_finalize(rth, rp, NULL);
return err;
}
@@ -2635,38 +1880,8 @@ out: return err;
int __ip_route_output_key(struct net *net, struct rtable **rp,
const struct flowi *flp)
{
- unsigned int hash;
int res;
- struct rtable *rth;
- if (!rt_caching(net))
- goto slow_output;
-
- hash = rt_hash(flp->fl4_dst, flp->fl4_src, flp->oif, rt_genid(net));
-
- rcu_read_lock_bh();
- for (rth = rcu_dereference_bh(rt_hash_table[hash].chain); rth;
- rth = rcu_dereference_bh(rth->dst.rt_next)) {
- if (rth->fl.fl4_dst == flp->fl4_dst &&
- rth->fl.fl4_src == flp->fl4_src &&
- rt_is_output_route(rth) &&
- rth->fl.oif == flp->oif &&
- rth->fl.mark == flp->mark &&
- !((rth->fl.fl4_tos ^ flp->fl4_tos) &
- (IPTOS_RT_MASK | RTO_ONLINK)) &&
- net_eq(dev_net(rth->dst.dev), net) &&
- !rt_is_expired(rth)) {
- dst_use(&rth->dst, jiffies);
- RT_CACHE_STAT_INC(out_hit);
- rcu_read_unlock_bh();
- *rp = rth;
- return 0;
- }
- RT_CACHE_STAT_INC(out_hlist_search);
- }
- rcu_read_unlock_bh();
-
-slow_output:
rcu_read_lock();
res = ip_route_output_slow(net, rp, flp);
rcu_read_unlock();
@@ -2966,43 +2181,6 @@ errout_free:
int ip_rt_dump(struct sk_buff *skb, struct netlink_callback *cb)
{
- struct rtable *rt;
- int h, s_h;
- int idx, s_idx;
- struct net *net;
-
- net = sock_net(skb->sk);
-
- s_h = cb->args[0];
- if (s_h < 0)
- s_h = 0;
- s_idx = idx = cb->args[1];
- for (h = s_h; h <= rt_hash_mask; h++, s_idx = 0) {
- if (!rt_hash_table[h].chain)
- continue;
- rcu_read_lock_bh();
- for (rt = rcu_dereference_bh(rt_hash_table[h].chain), idx = 0; rt;
- rt = rcu_dereference_bh(rt->dst.rt_next), idx++) {
- if (!net_eq(dev_net(rt->dst.dev), net) || idx < s_idx)
- continue;
- if (rt_is_expired(rt))
- continue;
- skb_dst_set_noref(skb, &rt->dst);
- if (rt_fill_info(net, skb, NETLINK_CB(cb->skb).pid,
- cb->nlh->nlmsg_seq, RTM_NEWROUTE,
- 1, NLM_F_MULTI) <= 0) {
- skb_dst_drop(skb);
- rcu_read_unlock_bh();
- goto done;
- }
- skb_dst_drop(skb);
- }
- rcu_read_unlock_bh();
- }
-
-done:
- cb->args[0] = h;
- cb->args[1] = idx;
return skb->len;
}
@@ -3235,16 +2413,6 @@ static __net_initdata struct pernet_operations rt_genid_ops = {
struct ip_rt_acct __percpu *ip_rt_acct __read_mostly;
#endif /* CONFIG_IP_ROUTE_CLASSID */
-static __initdata unsigned long rhash_entries;
-static int __init set_rhash_entries(char *str)
-{
- if (!str)
- return 0;
- rhash_entries = simple_strtoul(str, &str, 0);
- return 1;
-}
-__setup("rhash_entries=", set_rhash_entries);
-
int __init ip_rt_init(void)
{
int rc = 0;
@@ -3267,21 +2435,8 @@ int __init ip_rt_init(void)
if (dst_entries_init(&ipv4_dst_blackhole_ops) < 0)
panic("IP: failed to allocate ipv4_dst_blackhole_ops counter\n");
- rt_hash_table = (struct rt_hash_bucket *)
- alloc_large_system_hash("IP route cache",
- sizeof(struct rt_hash_bucket),
- rhash_entries,
- (totalram_pages >= 128 * 1024) ?
- 15 : 17,
- 0,
- &rt_hash_log,
- &rt_hash_mask,
- rhash_entries ? 0 : 512 * 1024);
- memset(rt_hash_table, 0, (rt_hash_mask + 1) * sizeof(struct rt_hash_bucket));
- rt_hash_lock_init();
-
- ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
- ip_rt_max_size = (rt_hash_mask + 1) * 16;
+ ipv4_dst_ops.gc_thresh = ~0;
+ ip_rt_max_size = INT_MAX;
devinet_init();
ip_fib_init();
--
1.7.4
^ permalink raw reply related
* Re: [PATCH 4/4] m68k/atari: ARAnyM - Add support for network access
From: Geert Uytterhoeven @ 2011-02-10 8:37 UTC (permalink / raw)
To: David Miller
Cc: linux-m68k, linux-kernel, aranym, schmitz, pstehlik, milan.jurik,
netdev
In-Reply-To: <20110206.111705.183051866.davem@davemloft.net>
On Sun, Feb 6, 2011 at 20:17, David Miller <davem@davemloft.net> wrote:
> From: Geert Uytterhoeven <geert@linux-m68k.org>
> Date: Sun, 6 Feb 2011 11:51:09 +0100
>
>> + dev->trans_start = jiffies;
>
> Device drivers no longer make this operation, the generic code
> does it (see net/core/dev.c:dev_hard_start_xmit() and how it
> invokes txq_trans_update() on ->ndo_start_xmit() success).
>
> Therefore, please remove this line.
Will do.
(180 more to go in drivers/net/?)
>> + pr_debug(DRV_NAME ": send %d bytes\n", len);
>
> For consistency with other network drivers, add an appropriate CPP
> define for "pr_fmt" and use netdev_info(), netdev_debug(), etc.
>
> In situations where a netdev pointer is not available
> (ie. pre-register_netdev()), use "dev_*()" instead.
Will fix.
Thanks for reviewing!
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
^ permalink raw reply
* Re: [RFC PATCH net-next] net: rename group sysfs entry to netdev_group
From: Vlad Dogaru @ 2011-02-10 9:11 UTC (permalink / raw)
To: David Miller; +Cc: dfeng, netdev, eric.dumazet, therbert, ebiederm, shemminger
In-Reply-To: <20110209.140323.39178091.davem@davemloft.net>
On Wed, Feb 09, 2011 at 02:03:23PM -0800, David Miller wrote:
> From: Xiaotian feng <dfeng@redhat.com>
> Date: Wed, 9 Feb 2011 18:52:49 +0800
>
> > From: Xiaotian Feng <dfeng@redhat.com>
> >
> > commit a512b92 adds sysfs entry for net device group, but
> > before this commit, tun also uses group sysfs, so after this
> > commit checkin, kernel warns like this:
> > sysfs: cannot create duplicate filename '/devices/virtual/net/vnet0/group'
> >
> > Since tun has used this for years, rename sysfs under tun might
> > break existing userspace, so rename group sysfs entry for net device
> > group is a better choice.
I was not aware of that, sorry for breaking things :)
^ permalink raw reply
* Re: [PATCH net-next-2.6 v6 1/1] can: c_can: Added support for Bosch C_CAN controller
From: Marc Kleine-Budde @ 2011-02-10 9:28 UTC (permalink / raw)
To: Bhupesh SHARMA
Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org,
netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Wolfgang Grandegger
In-Reply-To: <D5ECB3C7A6F99444980976A8C6D896384DEE36615D-8vAmw3ZAcdzhJTuQ9jeba9BPR1lH4CV8@public.gmane.org>
[-- Attachment #1.1: Type: text/plain, Size: 18526 bytes --]
Hello
On 02/09/2011 11:54 AM, Bhupesh SHARMA wrote:
[...]
>> The driver looks quite good, some comments inline, most of them
>> nitpicking and or style related.
>>
>> Have a look at the netif_stop_queue(). In the at91 driver there are two
>> possibilities that to stop the queue. First the next tx mailbox is
>> still in use, second we have a wrap around. But your hardware is a bit
>> different. Anyways a second look doesn't harm.
>
> As the Tx/Rx path of my driver are based on at91 driver, I have earlier
> gone through the possibility of stopping the tx queue in the two cases as
> you mentioned above :)
>
> As per at91 specs, MRDY=0 signifies:
> "Mailbox data registers cannot be read/written by the software application"
>
> But, after reading the Bosch C_CAN specs Transmission Request Register(TxRqst)'s
> bits if set to 1, signify that the transmission of this Message Object is
> requested and is not yet done. If you agree we can add a check against the same here.
> Please do go through the Bosch C_CAN specs for details:
> http://www.semiconductors.bosch.de/media/en/pdf/ipmodules_1/
> c_can/users_manual_c_can.pdf
Better save then sorry, add the check.
>
>>> ---
>>> Changes since V5:
>>> 1. Seperated the state change and bus error handling paths.
>>> 2. Added logic to write LEC value to 0x7 from CPU to check for
>> updates
>>> later.
>>> 3. Corrected the ERROR_WARNING handling logic to correctly send error
>>> frames on the bus.
>>>
>>> drivers/net/can/Kconfig | 2 +
>>> drivers/net/can/Makefile | 1 +
>>> drivers/net/can/c_can/Kconfig | 15 +
>>> drivers/net/can/c_can/Makefile | 8 +
>>> drivers/net/can/c_can/c_can.c | 993
>> ++++++++++++++++++++++++++++++++
>>> drivers/net/can/c_can/c_can.h | 230 ++++++++
>>> drivers/net/can/c_can/c_can_platform.c | 207 +++++++
>>> 7 files changed, 1456 insertions(+), 0 deletions(-) create mode
>>> 100644 drivers/net/can/c_can/Kconfig create mode 100644
>>> drivers/net/can/c_can/Makefile create mode 100644
>>> drivers/net/can/c_can/c_can.c create mode 100644
>>> drivers/net/can/c_can/c_can.h create mode 100644
>>> drivers/net/can/c_can/c_can_platform.c
>>>
>>> diff --git a/drivers/net/can/Kconfig b/drivers/net/can/Kconfig index
>>> 5dec456..1d699e3 100644
>>> --- a/drivers/net/can/Kconfig
>>> +++ b/drivers/net/can/Kconfig
>>> @@ -115,6 +115,8 @@ source "drivers/net/can/mscan/Kconfig"
>>>
>>> source "drivers/net/can/sja1000/Kconfig"
>>>
>>> +source "drivers/net/can/c_can/Kconfig"
>>> +
>>> source "drivers/net/can/usb/Kconfig"
>>>
>>> source "drivers/net/can/softing/Kconfig"
>>> diff --git a/drivers/net/can/Makefile b/drivers/net/can/Makefile
>> index
>>> 53c82a7..24ebfe8 100644
>>> --- a/drivers/net/can/Makefile
>>> +++ b/drivers/net/can/Makefile
>>> @@ -13,6 +13,7 @@ obj-y += softing/
>>>
>>> obj-$(CONFIG_CAN_SJA1000) += sja1000/
>>> obj-$(CONFIG_CAN_MSCAN) += mscan/
>>> +obj-$(CONFIG_CAN_C_CAN) += c_can/
>>> obj-$(CONFIG_CAN_AT91) += at91_can.o
>>> obj-$(CONFIG_CAN_TI_HECC) += ti_hecc.o
>>> obj-$(CONFIG_CAN_MCP251X) += mcp251x.o
>>> diff --git a/drivers/net/can/c_can/Kconfig
>>> b/drivers/net/can/c_can/Kconfig new file mode 100644 index
>>> 0000000..ffb9773
>>> --- /dev/null
>>> +++ b/drivers/net/can/c_can/Kconfig
>>> @@ -0,0 +1,15 @@
>>> +menuconfig CAN_C_CAN
>>> + tristate "Bosch C_CAN devices"
>>> + depends on CAN_DEV && HAS_IOMEM
>>> +
>>> +if CAN_C_CAN
>>> +
>>> +config CAN_C_CAN_PLATFORM
>>> + tristate "Generic Platform Bus based C_CAN driver"
>>> + ---help---
>>> + This driver adds support for the C_CAN chips connected to
>>> + the "platform bus" (Linux abstraction for directly to the
>>> + processor attached devices) which can be found on various
>>> + boards from ST Microelectronics (http://www.st.com)
>>> + like the SPEAr1310 and SPEAr320 evaluation boards.
>>> +endif
>>> diff --git a/drivers/net/can/c_can/Makefile
>>> b/drivers/net/can/c_can/Makefile new file mode 100644 index
>>> 0000000..9273f6d
>>> --- /dev/null
>>> +++ b/drivers/net/can/c_can/Makefile
>>> @@ -0,0 +1,8 @@
>>> +#
>>> +# Makefile for the Bosch C_CAN controller drivers.
>>> +#
>>> +
>>> +obj-$(CONFIG_CAN_C_CAN) += c_can.o
>>> +obj-$(CONFIG_CAN_C_CAN_PLATFORM) += c_can_platform.o
>>> +
>>> +ccflags-$(CONFIG_CAN_DEBUG_DEVICES) := -DDEBUG
>>> diff --git a/drivers/net/can/c_can/c_can.c
>>> b/drivers/net/can/c_can/c_can.c new file mode 100644 index
>>> 0000000..7ef4aa9
>>> --- /dev/null
>>> +++ b/drivers/net/can/c_can/c_can.c
>>> @@ -0,0 +1,993 @@
>>> +/*
>>> + * CAN bus driver for Bosch C_CAN controller
>>> + *
>>> + * Copyright (C) 2010 ST Microelectronics
>>> + * Bhupesh Sharma <bhupesh.sharma-qxv4g6HH51o@public.gmane.org>
>>> + *
>>> + * Borrowed heavily from the C_CAN driver originally written by:
>>> + * Copyright (C) 2007
>>> + * - Sascha Hauer, Marc Kleine-Budde, Pengutronix
>>> +<s.hauer-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
>>> + * - Simon Kallweit, intefo AG <simon.kallweit-+G9qxTFKJT/tRgLqZ5aouw@public.gmane.org>
>>> + *
>>> + * TX and RX NAPI implementation has been borrowed from at91 CAN
>>> +driver
>>> + * written by:
>>> + * Copyright
>>> + * (C) 2007 by Hans J. Koch <hjk-vqZO0P4V72/QD6PfKP4TzA@public.gmane.org>
>>> + * (C) 2008, 2009 by Marc Kleine-Budde <kernel-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
>>> + *
>>> + * Bosch C_CAN controller is compliant to CAN protocol version 2.0
>> part A and B.
>>> + * Bosch C_CAN user manual can be obtained from:
>>> + *
>> http://www.semiconductors.bosch.de/media/en/pdf/ipmodules_1/c_can/
>>> + * users_manual_c_can.pdf
>>> + *
>>> + * This file is licensed under the terms of the GNU General Public
>>> + * License version 2. This program is licensed "as is" without any
>>> + * warranty of any kind, whether express or implied.
>>> + */
>>> +
>>> +#include <linux/kernel.h>
>>> +#include <linux/version.h>
>>> +#include <linux/module.h>
>>> +#include <linux/interrupt.h>
>>> +#include <linux/delay.h>
>>> +#include <linux/netdevice.h>
>>> +#include <linux/if_arp.h>
>>> +#include <linux/if_ether.h>
>>> +#include <linux/list.h>
>>> +#include <linux/delay.h>
>>> +#include <linux/io.h>
>>> +
>>> +#include <linux/can.h>
>>> +#include <linux/can/dev.h>
>>> +#include <linux/can/error.h>
>>> +
>>> +#include "c_can.h"
>>> +
>>> +static struct can_bittiming_const c_can_bittiming_const = {
>>> + .name = KBUILD_MODNAME,
>>> + .tseg1_min = 2, /* Time segment 1 = prop_seg + phase_seg1
>> */
>>> + .tseg1_max = 16,
>>> + .tseg2_min = 1, /* Time segment 2 = phase_seg2 */
>>> + .tseg2_max = 8,
>>> + .sjw_max = 4,
>>> + .brp_min = 1,
>>> + .brp_max = 1024, /* 6-bit BRP field + 4-bit BRPE field*/
>>> + .brp_inc = 1,
>>> +};
>>> +
>>> +static inline int get_tx_next_msg_obj(const struct c_can_priv *priv)
>>> +{
>>> + return (priv->tx_next & C_CAN_NEXT_MSG_OBJ_MASK) +
>>> + C_CAN_MSG_OBJ_TX_FIRST;
>>> +}
>>> +
>>> +static inline int get_tx_echo_msg_obj(const struct c_can_priv *priv)
>>> +{
>>> + return (priv->tx_echo & C_CAN_NEXT_MSG_OBJ_MASK) +
>>> + C_CAN_MSG_OBJ_TX_FIRST;
>>> +}
>>> +
>>> +static u32 c_can_read_reg32(struct c_can_priv *priv, void *reg) {
>>> + u32 val = priv->read_reg(priv, reg);
>>> + val |= ((u32) priv->read_reg(priv, reg + 2)) << 16;
>>> + return val;
>>> +}
>>> +
>>> +static void c_can_enable_all_interrupts(struct c_can_priv *priv,
>>> + int enable)
>>> +{
>>> + unsigned int cntrl_save = priv->read_reg(priv,
>>> + &priv->regs->control);
>>> +
>>> + if (enable)
>>> + cntrl_save |= (CONTROL_SIE | CONTROL_EIE | CONTROL_IE);
>>> + else
>>> + cntrl_save &= ~(CONTROL_EIE | CONTROL_IE | CONTROL_SIE);
>>> +
>>> + priv->write_reg(priv, &priv->regs->control, cntrl_save); }
>>> +
>>> +static inline int c_can_check_busy_status(struct c_can_priv *priv,
>>> +int iface) {
>>> + int count = MIN_TIMEOUT_VALUE;
>>> +
>>> + while (count && priv->read_reg(priv,
>>> + &priv->regs->ifregs[iface].com_req) &
>>> + IF_COMR_BUSY) {
>>> + count--;
>>> + udelay(1);
>>> + }
>>> +
>>> + return count;
>>
>> it's an unusual return value...maybe return 0 on success and -EBUSY
>> otherwise?
>
> Hmm.. this will add the checking MIN_TIMEOUT_VALUE against 0 here,
You mean check "count" against 0 here...
> instead of "c_can_object_get" and "c_can_object_put" routines.
> If you persist we can add the same in V7 though.. :)
We have a function "c_can_check_busy_status()". What does it return? The
name doesn't tell me. I think it would be more clear if you just rename
the function to "c_can_is_busy()" or "c_can_object_is_busy()".
The you can use it like this:
if (c_can_is_busy()) {
printk("mailbox still busy!\n");
return -EWHATEVER;
}
>
>>> +}
>>> +
>>> +static inline void c_can_object_get(struct net_device *dev,
>>> + int iface, int objno, int mask)
>>> +{
>>> + int ret;
>>> + struct c_can_priv *priv = netdev_priv(dev);
>>> +
>>> + /*
>>> + * As per specs, after writting the message object number in the
>>> + * IF command request register the transfer b/w interface
>>> + * register and message RAM must be complete in 6 CAN-CLK
>>> + * period.
>>> + */
>>> + priv->write_reg(priv, &priv->regs->ifregs[iface].com_mask,
>>> + IFX_WRITE_LOW_16BIT(mask));
>>> + priv->write_reg(priv, &priv->regs->ifregs[iface].com_req,
>>> + IFX_WRITE_LOW_16BIT(objno));
>>> +
>>> + ret = c_can_check_busy_status(priv, iface);
>>> + if (!ret)
>>> + netdev_err(dev, "timed out in object get\n"); }
There's no error handling for the object is busy case....
[...]
>>> diff --git a/drivers/net/can/c_can/c_can.h
>>> b/drivers/net/can/c_can/c_can.h new file mode 100644 index
>>> 0000000..bd094e6
>>> --- /dev/null
>>> +++ b/drivers/net/can/c_can/c_can.h
>>> @@ -0,0 +1,230 @@
>>> +/*
>>> + * CAN bus driver for Bosch C_CAN controller
>>> + *
>>> + * Copyright (C) 2010 ST Microelectronics
>>> + * Bhupesh Sharma <bhupesh.sharma-qxv4g6HH51o@public.gmane.org>
>>> + *
>>> + * Borrowed heavily from the C_CAN driver originally written by:
>>> + * Copyright (C) 2007
>>> + * - Sascha Hauer, Marc Kleine-Budde, Pengutronix
>>> +<s.hauer-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
>>> + * - Simon Kallweit, intefo AG <simon.kallweit-+G9qxTFKJT/tRgLqZ5aouw@public.gmane.org>
>>> + *
>>> + * Bosch C_CAN controller is compliant to CAN protocol version 2.0
>> part A and B.
>>> + * Bosch C_CAN user manual can be obtained from:
>>> + *
>> http://www.semiconductors.bosch.de/media/en/pdf/ipmodules_1/c_can/
>>> + * users_manual_c_can.pdf
>>> + *
>>> + * This file is licensed under the terms of the GNU General Public
>>> + * License version 2. This program is licensed "as is" without any
>>> + * warranty of any kind, whether express or implied.
>>> + */
>>> +
>>> +#ifndef C_CAN_H
>>> +#define C_CAN_H
>>> +
>>> +/* control register */
>>> +#define CONTROL_TEST BIT(7)
>>> +#define CONTROL_CCE BIT(6)
>>> +#define CONTROL_DISABLE_AR BIT(5)
>>> +#define CONTROL_ENABLE_AR (0 << 5)
>>> +#define CONTROL_EIE BIT(3)
>>> +#define CONTROL_SIE BIT(2)
>>> +#define CONTROL_IE BIT(1)
>>> +#define CONTROL_INIT BIT(0)
>>> +
>>> +/* test register */
>>> +#define TEST_RX BIT(7)
>>> +#define TEST_TX1 BIT(6)
>>> +#define TEST_TX2 BIT(5)
>>> +#define TEST_LBACK BIT(4)
>>> +#define TEST_SILENT BIT(3)
>>> +#define TEST_BASIC BIT(2)
>>> +
>>> +/* status register */
>>> +#define STATUS_BOFF BIT(7)
>>> +#define STATUS_EWARN BIT(6)
>>> +#define STATUS_EPASS BIT(5)
>>> +#define STATUS_RXOK BIT(4)
>>> +#define STATUS_TXOK BIT(3)
>>> +
>>> +/* error counter register */
>>> +#define ERR_CNT_TEC_MASK 0xff
>>> +#define ERR_CNT_TEC_SHIFT 0
>>> +#define ERR_CNT_REC_SHIFT 8
>>> +#define ERR_CNT_REC_MASK (0x7f << ERR_CNT_REC_SHIFT)
>>> +#define ERR_CNT_RP_SHIFT 15
>>> +#define ERR_CNT_RP_MASK (0x1 << ERR_CNT_RP_SHIFT)
>>> +
>>> +/* bit-timing register */
>>> +#define BTR_BRP_MASK 0x3f
>>> +#define BTR_BRP_SHIFT 0
>>> +#define BTR_SJW_SHIFT 6
>>> +#define BTR_SJW_MASK (0x3 << BTR_SJW_SHIFT)
>>> +#define BTR_TSEG1_SHIFT 8
>>> +#define BTR_TSEG1_MASK (0xf << BTR_TSEG1_SHIFT)
>>> +#define BTR_TSEG2_SHIFT 12
>>> +#define BTR_TSEG2_MASK (0x7 << BTR_TSEG2_SHIFT)
>>> +
>>> +/* brp extension register */
>>> +#define BRP_EXT_BRPE_MASK 0x0f
>>> +#define BRP_EXT_BRPE_SHIFT 0
>>> +
>>> +/* IFx command request */
>>> +#define IF_COMR_BUSY BIT(15)
>>> +
>>> +/* IFx command mask */
>>> +#define IF_COMM_WR BIT(7)
>>> +#define IF_COMM_MASK BIT(6)
>>> +#define IF_COMM_ARB BIT(5)
>>> +#define IF_COMM_CONTROL BIT(4)
>>> +#define IF_COMM_CLR_INT_PND BIT(3)
>>> +#define IF_COMM_TXRQST BIT(2)
>>> +#define IF_COMM_DATAA BIT(1)
>>> +#define IF_COMM_DATAB BIT(0)
>>> +#define IF_COMM_ALL (IF_COMM_MASK | IF_COMM_ARB | \
>>> + IF_COMM_CONTROL | IF_COMM_TXRQST | \
>>> + IF_COMM_DATAA | IF_COMM_DATAB)
>>> +
>>> +/* IFx arbitration */
>>> +#define IF_ARB_MSGVAL BIT(15)
>>> +#define IF_ARB_MSGXTD BIT(14)
>>> +#define IF_ARB_TRANSMIT BIT(13)
>>> +
>>> +/* IFx message control */
>>> +#define IF_MCONT_NEWDAT BIT(15)
>>> +#define IF_MCONT_MSGLST BIT(14)
>>> +#define IF_MCONT_CLR_MSGLST (0 << 14)
>>> +#define IF_MCONT_INTPND BIT(13)
>>> +#define IF_MCONT_UMASK BIT(12)
>>> +#define IF_MCONT_TXIE BIT(11)
>>> +#define IF_MCONT_RXIE BIT(10)
>>> +#define IF_MCONT_RMTEN BIT(9)
>>> +#define IF_MCONT_TXRQST BIT(8)
>>> +#define IF_MCONT_EOB BIT(7)
>>> +#define IF_MCONT_DLC_MASK 0xf
>>> +
>>> +/*
>>> + * IFx register masks:
>>> + * allow easy operation on 16-bit registers when the
>>> + * argument is 32-bit instead
>>> + */
>>> +#define IFX_WRITE_LOW_16BIT(x) ((x) & 0xFFFF)
>>> +#define IFX_WRITE_HIGH_16BIT(x) (((x) & 0xFFFF0000) >> 16)
>>> +
>>> +/* message object split */
>>> +#define C_CAN_NO_OF_OBJECTS 32
>>> +#define C_CAN_MSG_OBJ_RX_NUM 16
>>> +#define C_CAN_MSG_OBJ_TX_NUM 16
>>> +
>>> +#define C_CAN_MSG_OBJ_RX_FIRST 1
>>> +#define C_CAN_MSG_OBJ_RX_LAST (C_CAN_MSG_OBJ_RX_FIRST + \
>>> + C_CAN_MSG_OBJ_RX_NUM - 1)
>>> +
>>> +#define C_CAN_MSG_OBJ_TX_FIRST (C_CAN_MSG_OBJ_RX_LAST + 1)
>>> +#define C_CAN_MSG_OBJ_TX_LAST (C_CAN_MSG_OBJ_TX_FIRST + \
>>> + C_CAN_MSG_OBJ_TX_NUM - 1)
>>> +
>>> +#define C_CAN_MSG_OBJ_RX_SPLIT 9
>>> +#define C_CAN_MSG_RX_LOW_LAST (C_CAN_MSG_OBJ_RX_SPLIT - 1)
>>> +
>>> +#define C_CAN_NEXT_MSG_OBJ_MASK (C_CAN_MSG_OBJ_TX_NUM - 1)
>>> +#define RECEIVE_OBJECT_BITS 0x0000ffff
>>> +
>>> +/* status interrupt */
>>> +#define STATUS_INTERRUPT 0x8000
>>> +
>>> +/* global interrupt masks */
>>> +#define ENABLE_ALL_INTERRUPTS 1
>>> +#define DISABLE_ALL_INTERRUPTS 0
>>> +
>>> +/* minimum timeout for checking BUSY status */
>>> +#define MIN_TIMEOUT_VALUE 6
>>> +
>>> +/* napi related */
>>> +#define C_CAN_NAPI_WEIGHT C_CAN_MSG_OBJ_RX_NUM
>>> +
>>> +/* c_can IF registers */
>>> +struct c_can_if_regs {
>>> + u16 com_req;
>>> + u16 com_mask;
>>> + u16 mask1;
>>> + u16 mask2;
>>> + u16 arb1;
>>> + u16 arb2;
>>> + u16 msg_cntrl;
>>> + u16 data[4];
>>> + u16 _reserved[13];
>>> +};
>>> +
>>> +/* c_can hardware registers */
>>> +struct c_can_regs {
>>> + u16 control;
>>> + u16 status;
>>> + u16 err_cnt;
>>> + u16 btr;
>>> + u16 interrupt;
>>> + u16 test;
>>> + u16 brp_ext;
>>> + u16 _reserved1;
>>> + struct c_can_if_regs ifregs[2]; /* [0] = IF1 and [1] = IF2 */
>>> + u16 _reserved2[8];
>>> + u16 txrqst1;
>>> + u16 txrqst2;
>>> + u16 _reserved3[6];
>>> + u16 newdat1;
>>> + u16 newdat2;
>>> + u16 _reserved4[6];
>>> + u16 intpnd1;
>>> + u16 intpnd2;
>>> + u16 _reserved5[6];
>>> + u16 msgval1;
>>> + u16 msgval2;
>>> + u16 _reserved6[6];
>>> +};
>>> +
>>> +/* c_can lec values */
>>> +enum c_can_lec_type {
>>> + LEC_NO_ERROR = 0,
>>> + LEC_STUFF_ERROR,
>>> + LEC_FORM_ERROR,
>>> + LEC_ACK_ERROR,
>>> + LEC_BIT1_ERROR,
>>> + LEC_BIT0_ERROR,
>>> + LEC_CRC_ERROR,
>>> + LEC_UNUSED,
>>> +};
>>> +
>>> +/*
>>> + * c_can error types:
>>> + * Bus errors (BUS_OFF, ERROR_WARNING, ERROR_PASSIVE) are supported
>>> +*/ enum c_can_bus_error_types {
>>> + C_CAN_NO_ERROR = 0,
>>> + C_CAN_BUS_OFF,
>>> + C_CAN_ERROR_WARNING,
>>> + C_CAN_ERROR_PASSIVE,
>>> +};
>>
>> nitpick: are the defines, enums and structs needed in more than one c
>> file? If not, please move them into the c-file where they are used.
>
> Well most of the strcuts/defines are useful in both `c_can.c` and
> `c_can_platform.c`, but I will explore how to separate the rest in
> the respective c-files. But inititally we agreed to a *sja1000* like
> approach and hence this placement in h-file.
Yes, your're right.
But keeping only in one .c or .h file used things out of the .h file is
a good rule for cleaner code :)
regards, Marc
--
Pengutronix e.K. | Marc Kleine-Budde |
Industrial Linux Solutions | Phone: +49-231-2826-924 |
Vertretung West/Dortmund | Fax: +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |
[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]
[-- Attachment #2: Type: text/plain, Size: 188 bytes --]
_______________________________________________
Socketcan-core mailing list
Socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org
https://lists.berlios.de/mailman/listinfo/socketcan-core
^ permalink raw reply
* Dear Sir/Madam,
From: Dr. Donald Wilson @ 2011-02-10 9:51 UTC (permalink / raw)
We offer at 2% interest rate. Between $1,000 USD to $100M USD. Get back if interested. Dr. Donald Wilson
^ permalink raw reply
* [RFD][PATCH] Add JMEMCMP to Berkeley Packet Filters
From: Ian Molton @ 2011-02-10 12:14 UTC (permalink / raw)
To: netdev; +Cc: rdunlap, isdn, paulus, arnd, davem, herbert, ebiederm
Hi folks,
This patch implements an extension for BPF to allow filter programs to use a
data section, along with a MEMCMP instruction.
There are a few issues noted in the patch itself, which can easily be
addressed, and I would like to check wether sk_run_filter is ever expected to
be called from a context that cannot sleep (I dont think it is).
I think the patch should probably be split into a patch to add data sections
and one adding the JMEMCMP instruction, but that can be done after some review!
-Ian
^ permalink raw reply
* [RFD][PATCH] Add JMEMCMP to Berkeley Packet Filters
From: Ian Molton @ 2011-02-10 12:31 UTC (permalink / raw)
To: netdev; +Cc: rdunlap, isdn, paulus, arnd, davem, herbert, ebiederm,
alban.crequy
Documentation/networking/filter.txt | 9 ++
drivers/isdn/i4l/isdn_ppp.c | 2
drivers/net/ppp_generic.c | 2
include/asm-generic/socket.h | 2
include/linux/filter.h | 17 ++++-
include/linux/ptp_classify.h | 2
net/core/filter.c | 115 ++++++++++++++++++++++++++++++++++--
net/core/sock.c | 14 ++++
net/core/timestamping.c | 4 -
net/packet/af_packet.c | 3
This patch adds support for adding a data section to BPF. It is intended to be
used by the JMEMCMP instruction also added in this patch.
There are some issues, mostly noted int he commit message, and I'd like to
check that sk_run_filter() does not get called from a context that cannot sleep
(I dont think so).
Comments welcome!
-Ian
^ permalink raw reply
* [PATCH] Add JMEMCMP to Berkeley Packet Filters
From: Ian Molton @ 2011-02-10 12:31 UTC (permalink / raw)
To: netdev
Cc: rdunlap, isdn, paulus, arnd, davem, herbert, ebiederm,
alban.crequy, Ian Molton
In-Reply-To: <1297341067-12264-1-git-send-email-ian.molton@collabora.co.uk>
This patch allows a data section to be specified for BPF.
This is made use of by a MEMCMP like instruction.
Testsuite here:
http://git.collabora.co.uk/?p=user/ian/check-bpf.git;a=summary
Issues:
* Do I need to update the headers for all arches, or just generic
* Can sk_run_filter() be called in a context where kmalloc(GFP_KERNEL) is
not allowed (I think not)
* Data section allocated with second call to sock_kmalloc().
* Should the patch be broken into two - one to add the data uploading,
one to add the JMEMCMP insn. ?
---
Documentation/networking/filter.txt | 9 +++
drivers/isdn/i4l/isdn_ppp.c | 2 +-
drivers/net/ppp_generic.c | 2 +-
include/asm-generic/socket.h | 2 +
include/linux/filter.h | 17 +++++-
include/linux/ptp_classify.h | 2 +-
net/core/filter.c | 115 +++++++++++++++++++++++++++++++++-
net/core/sock.c | 14 ++++
net/core/timestamping.c | 4 +-
net/packet/af_packet.c | 3 +-
10 files changed, 158 insertions(+), 12 deletions(-)
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index bbf2005..d6efb5f 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -31,6 +31,15 @@ the old one and placing your new one in its place, assuming your
filter has passed the checks, otherwise if it fails the old filter
will remain on that socket.
+Linux packet filters also provide a facility to upload a data section
+for use with the JMEMCMP instruction. This is done using the
+SO_ATTACH_FILTER_WITH_DATA parameter to setsockopt().
+
+The JMEMCMP instruction allows arbitrary comparisons between the packet
+data and the filters data. The K, A, and X registers provide the offset
+into the data, the number of bytes to compare, and the offset into the
+packet, respectively.
+
Examples
========
diff --git a/drivers/isdn/i4l/isdn_ppp.c b/drivers/isdn/i4l/isdn_ppp.c
index 9e8162c..1a0d513 100644
--- a/drivers/isdn/i4l/isdn_ppp.c
+++ b/drivers/isdn/i4l/isdn_ppp.c
@@ -453,7 +453,7 @@ static int get_filter(void __user *arg, struct sock_filter **p)
if (IS_ERR(code))
return PTR_ERR(code);
- err = sk_chk_filter(code, uprog.len);
+ err = sk_chk_filter(code, uprog.len, NULL);
if (err) {
kfree(code);
return err;
diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c
index 9f6d670..345c3ac 100644
--- a/drivers/net/ppp_generic.c
+++ b/drivers/net/ppp_generic.c
@@ -542,7 +542,7 @@ static int get_filter(void __user *arg, struct sock_filter **p)
if (IS_ERR(code))
return PTR_ERR(code);
- err = sk_chk_filter(code, uprog.len);
+ err = sk_chk_filter(code, uprog.len, NULL);
if (err) {
kfree(code);
return err;
diff --git a/include/asm-generic/socket.h b/include/asm-generic/socket.h
index 9a6115e..83458b9 100644
--- a/include/asm-generic/socket.h
+++ b/include/asm-generic/socket.h
@@ -64,4 +64,6 @@
#define SO_DOMAIN 39
#define SO_RXQ_OVFL 40
+
+#define SO_ATTACH_FILTER_WITH_DATA 41
#endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 45266b7..c290e17 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -35,6 +35,13 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
struct sock_filter __user *filter;
};
+struct sock_fprog_with_data { /* Required for SO_ATTACH_FILTER_WITH_DATA. */
+ unsigned short len; /* Number of filter blocks */
+ unsigned short data_len;
+ __u8 __user *data; /* Program data section */
+ struct sock_filter __user *filter;
+};
+
/*
* Instruction classes
*/
@@ -78,6 +85,7 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
#define BPF_JGT 0x20
#define BPF_JGE 0x30
#define BPF_JSET 0x40
+#define BPF_JMEMCMP 0x50
#define BPF_SRC(code) ((code) & 0x08)
#define BPF_K 0x00
#define BPF_X 0x08
@@ -136,6 +144,8 @@ struct sk_filter
atomic_t refcnt;
unsigned int len; /* Number of filter blocks */
struct rcu_head rcu;
+ u8 *data;
+ unsigned int data_len;
struct sock_filter insns[0];
};
@@ -149,10 +159,13 @@ struct sock;
extern int sk_filter(struct sock *sk, struct sk_buff *skb);
extern unsigned int sk_run_filter(const struct sk_buff *skb,
- const struct sock_filter *filter);
+ const struct sock_filter *filter,
+ const u8 *data, const unsigned int data_len);
extern int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
+extern int sk_attach_filter_with_data(struct sock_fprog_with_data *fprog,
+ struct sock *sk);
extern int sk_detach_filter(struct sock *sk);
-extern int sk_chk_filter(struct sock_filter *filter, int flen);
+extern int sk_chk_filter(struct sock_filter *filter, int flen, u8 *data);
#endif /* __KERNEL__ */
#endif /* __LINUX_FILTER_H__ */
diff --git a/include/linux/ptp_classify.h b/include/linux/ptp_classify.h
index 943a85a..bfe8f7a 100644
--- a/include/linux/ptp_classify.h
+++ b/include/linux/ptp_classify.h
@@ -79,7 +79,7 @@
static inline int ptp_filter_init(struct sock_filter *f, int len)
{
if (OP_LDH == f[0].code)
- return sk_chk_filter(f, len);
+ return sk_chk_filter(f, len, NULL);
else
return 0;
}
diff --git a/net/core/filter.c b/net/core/filter.c
index 232b187..eb5f4e2 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -85,6 +85,7 @@ enum {
BPF_S_JMP_JGT_X,
BPF_S_JMP_JSET_K,
BPF_S_JMP_JSET_X,
+ BPF_S_JMP_MEMCMP,
/* Ancillary data */
BPF_S_ANC_PROTOCOL,
BPF_S_ANC_PKTTYPE,
@@ -145,7 +146,9 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
rcu_read_lock();
filter = rcu_dereference(sk->sk_filter);
if (filter) {
- unsigned int pkt_len = sk_run_filter(skb, filter->insns);
+ unsigned int pkt_len = sk_run_filter(skb, filter->insns,
+ filter->data,
+ filter->data_len);
err = pkt_len ? pskb_trim(skb, pkt_len) : -EPERM;
}
@@ -168,7 +171,8 @@ EXPORT_SYMBOL(sk_filter);
* flen. (We used to pass to this function the length of filter)
*/
unsigned int sk_run_filter(const struct sk_buff *skb,
- const struct sock_filter *fentry)
+ const struct sock_filter *fentry,
+ const u8 *data, const unsigned int data_len)
{
void *ptr;
u32 A = 0; /* Accumulator */
@@ -268,6 +272,46 @@ unsigned int sk_run_filter(const struct sk_buff *skb,
case BPF_S_JMP_JSET_X:
fentry += (A & X) ? fentry->jt : fentry->jf;
continue;
+ case BPF_S_JMP_MEMCMP: {
+ u8 *pkt_data, *tmp_data;
+
+ /* A = Comparison length.
+ * K = Offset into the data
+ * X = Offset into the packet
+ */
+ if (K + A > data_len || X + A > skb->len)
+ return 0;
+
+ /* We should write a skb aware memcmp() and avoid
+ * copying the contents of the skb
+ */
+ tmp_data = (u8*)kmalloc(A, GFP_KERNEL);
+
+ if(!tmp_data)
+ return 0;
+
+ /* Load enough bytes to analyse already offset by K */
+ ptr = load_pointer(skb, X, A, tmp_data);
+
+ if (!ptr) {
+ kfree(tmp_data);
+ return 0;
+ }
+
+ pkt_data = (u8 *)ptr;
+
+ /* data will not be NULL here if sk_chk_filter() has
+ * been called. Since SO_ATTACH_FILTER{,_WITH_DATA}
+ * both call this, only broken kernel code can cause
+ * trouble
+ */
+ fentry += (!memcmp(data + K, pkt_data, A))
+ ? fentry->jt : fentry->jf;
+
+ kfree(tmp_data);
+
+ continue;
+ }
case BPF_S_LD_W_ABS:
k = K;
load_w:
@@ -492,7 +536,7 @@ error:
*
* Returns 0 if the rule set is legal or -EINVAL if not.
*/
-int sk_chk_filter(struct sock_filter *filter, int flen)
+int sk_chk_filter(struct sock_filter *filter, int flen, u8 *data)
{
/*
* Valid instructions are initialized to non-0.
@@ -544,6 +588,7 @@ int sk_chk_filter(struct sock_filter *filter, int flen)
[BPF_JMP|BPF_JGT|BPF_X] = BPF_S_JMP_JGT_X,
[BPF_JMP|BPF_JSET|BPF_K] = BPF_S_JMP_JSET_K,
[BPF_JMP|BPF_JSET|BPF_X] = BPF_S_JMP_JSET_X,
+ [BPF_JMP|BPF_JMEMCMP|BPF_K|BPF_X|BPF_A] = BPF_S_JMP_MEMCMP,
};
int pc;
@@ -585,6 +630,10 @@ int sk_chk_filter(struct sock_filter *filter, int flen)
if (ftest->k >= (unsigned)(flen-pc-1))
return -EINVAL;
break;
+ case BPF_S_JMP_MEMCMP:
+ if (!data)
+ return -EINVAL;
+ /* Fall through */
case BPF_S_JMP_JEQ_K:
case BPF_S_JMP_JEQ_X:
case BPF_S_JMP_JGE_K:
@@ -672,8 +721,10 @@ int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
atomic_set(&fp->refcnt, 1);
fp->len = fprog->len;
+ fp->data = NULL;
+ fp->data_len = 0;
- err = sk_chk_filter(fp->insns, fp->len);
+ err = sk_chk_filter(fp->insns, fp->len, NULL);
if (err) {
sk_filter_uncharge(sk, fp);
return err;
@@ -689,6 +740,62 @@ int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
}
EXPORT_SYMBOL_GPL(sk_attach_filter);
+int sk_attach_filter_with_data(struct sock_fprog_with_data *fprog, struct sock *sk)
+{
+ struct sk_filter *fp, *old_fp;
+ unsigned int fsize = sizeof(struct sock_filter) * fprog->len;
+ unsigned int dsize = fprog->data_len;
+ int err = -ENOMEM;
+
+ /* Make sure new filter is there and in the right amounts. */
+ if (fprog->filter == NULL)
+ return -EINVAL;
+
+ fp = sock_kmalloc(sk, fsize+sizeof(*fp), GFP_KERNEL);
+
+ if (!fp)
+ goto out;
+
+ fp->data = sock_kmalloc(sk, dsize, GFP_KERNEL);
+
+ if(!fp->data)
+ goto out_free;
+
+ if (copy_from_user(fp->data, fprog->data, dsize))
+ goto out_free;
+
+ fp->data_len = dsize;
+
+ if (copy_from_user(fp->insns, fprog->filter, fsize))
+ goto out_free_data;
+
+ atomic_set(&fp->refcnt, 1);
+ fp->len = fprog->len;
+
+ err = sk_chk_filter(fp->insns, fp->len, fp->data);
+ if (err)
+ goto out_uncharge;
+
+ old_fp = rcu_dereference_protected(sk->sk_filter,
+ sock_owned_by_user(sk));
+ rcu_assign_pointer(sk->sk_filter, fp);
+
+ if (old_fp)
+ sk_filter_uncharge(sk, old_fp);
+
+ return 0;
+
+out_uncharge:
+ sk_filter_uncharge(sk, fp);
+out_free_data:
+ sock_kfree_s(sk, fp->data, dsize);
+out_free:
+ sock_kfree_s(sk, fp, fsize+sizeof(*fp));
+out:
+ return err;
+}
+EXPORT_SYMBOL_GPL(sk_attach_filter_with_data);
+
int sk_detach_filter(struct sock *sk)
{
int ret = -ENOENT;
diff --git a/net/core/sock.c b/net/core/sock.c
index 7dfed79..627f731 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -714,6 +714,20 @@ set_rcvbuf:
ret = sk_attach_filter(&fprog, sk);
}
break;
+
+ case SO_ATTACH_FILTER_WITH_DATA:
+ ret = -EINVAL;
+ if (optlen == sizeof(struct sock_fprog_with_data)) {
+ struct sock_fprog_with_data fprog;
+
+ ret = -EFAULT;
+ if (copy_from_user(&fprog, optval,
+ sizeof(fprog)))
+ break;
+
+ ret = sk_attach_filter_with_data(&fprog, sk);
+ }
+ break;
case SO_DETACH_FILTER:
ret = sk_detach_filter(sk);
diff --git a/net/core/timestamping.c b/net/core/timestamping.c
index 7e7ca37..efb9a44 100644
--- a/net/core/timestamping.c
+++ b/net/core/timestamping.c
@@ -31,7 +31,7 @@ static unsigned int classify(const struct sk_buff *skb)
if (likely(skb->dev &&
skb->dev->phydev &&
skb->dev->phydev->drv))
- return sk_run_filter(skb, ptp_filter);
+ return sk_run_filter(skb, ptp_filter, NULL, 0);
else
return PTP_CLASS_NONE;
}
@@ -124,5 +124,5 @@ bool skb_defer_rx_timestamp(struct sk_buff *skb)
void __init skb_timestamping_init(void)
{
- BUG_ON(sk_chk_filter(ptp_filter, ARRAY_SIZE(ptp_filter)));
+ BUG_ON(sk_chk_filter(ptp_filter, ARRAY_SIZE(ptp_filter), NULL));
}
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index c60649e..7b8bb1b 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -525,7 +525,8 @@ static inline unsigned int run_filter(const struct sk_buff *skb,
rcu_read_lock();
filter = rcu_dereference(sk->sk_filter);
if (filter != NULL)
- res = sk_run_filter(skb, filter->insns);
+ res = sk_run_filter(skb, filter->insns, filter->data,
+ filter->data_len);
rcu_read_unlock();
return res;
--
1.7.2.3
^ permalink raw reply related
* Re: [RFD][PATCH] Add JMEMCMP to Berkeley Packet Filters
From: Ian Molton @ 2011-02-10 12:57 UTC (permalink / raw)
To: netdev
In-Reply-To: <1297340087-10963-1-git-send-email-ian.molton@collabora.co.uk>
Shit, sorry about the noise. git-send-email bit me :)
-Ian
^ permalink raw reply
* Re: [PATCH] Add JMEMCMP to Berkeley Packet Filters
From: Eric Dumazet @ 2011-02-10 13:24 UTC (permalink / raw)
To: Ian Molton
Cc: netdev, rdunlap, isdn, paulus, arnd, davem, herbert, ebiederm,
alban.crequy
In-Reply-To: <1297341067-12264-2-git-send-email-ian.molton@collabora.co.uk>
Le jeudi 10 février 2011 à 12:31 +0000, Ian Molton a écrit :
> This patch allows a data section to be specified for BPF.
>
> This is made use of by a MEMCMP like instruction.
>
> Testsuite here:
> http://git.collabora.co.uk/?p=user/ian/check-bpf.git;a=summary
>
> Issues:
> * Do I need to update the headers for all arches, or just generic
> * Can sk_run_filter() be called in a context where kmalloc(GFP_KERNEL) is
> not allowed (I think not)
You cannot use GFP_KERNEL in sk_run_filter() : We run in {soft}irq mode,
in input path.
> * Data section allocated with second call to sock_kmalloc().
> * Should the patch be broken into two - one to add the data uploading,
> one to add the JMEMCMP insn. ?
May I ask why it is needed at all ?
Then, why only one JMEMCMP would be allowed in a filter ?
^ permalink raw reply
* Re: [PATCH] Add JMEMCMP to Berkeley Packet Filters
From: Ian Molton @ 2011-02-10 13:35 UTC (permalink / raw)
To: Eric Dumazet
Cc: netdev, rdunlap, isdn, paulus, arnd, davem, herbert, ebiederm,
alban.crequy
In-Reply-To: <1297344292.2493.3.camel@edumazet-laptop>
On 10/02/11 13:24, Eric Dumazet wrote:
Hi!
Thanks for reviewing! :)
>> * Can sk_run_filter() be called in a context where kmalloc(GFP_KERNEL) is
>> not allowed (I think not)
>
> You cannot use GFP_KERNEL in sk_run_filter() : We run in {soft}irq mode,
> in input path.
Ok, that can be sorted.
>> * Data section allocated with second call to sock_kmalloc().
>> * Should the patch be broken into two - one to add the data uploading,
>> one to add the JMEMCMP insn. ?
>
> May I ask why it is needed at all ?
So we can match strings in packet filters... I don't think I understand
the question...
> Then, why only one JMEMCMP would be allowed in a filter ?
I dont think I'm restricting the filter to only have one JMEMCMP? Am I
misunderstanding you?
-Ian
^ permalink raw reply
* Mutual Aid (Stash of Fortune)
From: George @ 2011-02-10 12:55 UTC (permalink / raw)
Mutual Aid (Stash of Fortune)
Hi
I would hold back certain information for security reasons for now until
you have found time to visit the BBC website stated below to enable you
have an insight into what I intend sharing with you.
http://news.bbc.co.uk/2/hi/middle_east/2988455.stm
Get back to me having visited the above website with this email:
mrglenn_bradley@w.cn
Best Regards
A former member of the 3rd Infantry Division
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox