Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] cls_u32: use skb_copy_bits() to dereference data safely
From: Changli Gao @ 2010-06-01 17:47 UTC (permalink / raw)
  To: hadi; +Cc: David S. Miller, netdev
In-Reply-To: <1275395667.3587.38.camel@bigi>

On Tue, Jun 1, 2010 at 8:34 PM, jamal <hadi@cyberus.ca> wrote:
> Hi Changli,
>
> On Mon, 2010-05-31 at 10:24 +0800, Changli Gao wrote:
>> use skb_copy_bits() to dereference data safely
>>
>> the original skb->data dereference isn't safe, as there isn't any skb->len or
>> skb_is_nonlinear() check.
>
> I dont see any safety issue in current code in this respect. Do you have
> a specific scenario where this would be unsafe? We inspect in 32 bit
> chunks walking the packet data and stop when there is no more packet
> data.

I added the following debug code into cls_u32.c

                for (i = n->sel.nkeys; i>0; i--, key++) {
+                       int off;
+
+                       off = key->off+(off2&key->offmask) + (ptr - skb->data);
+                       if (off + 4 > skb->len)
+                               printk("skb->len: %d, off: %d\n",
skb->len, off);

                        if
((*(__be32*)(ptr+key->off+(off2&key->offmask))^key->val)&key->mask) {
                                n = n->next;
@@ -179,8 +186,13 @@ check_terminal:

                ht = n->ht_down;
                sel = 0;
-               if (ht->divisor)
+               if (ht->divisor) {
+                       int off = ptr - skb->data + n->sel.hoff;
+                       if (off + 4 > skb->len)
+                               printk("skb->len: %d, off: %d\n", skb->len,
+                                       off);
                        sel =
ht->divisor&u32_hash_fold(*(__be32*)(ptr+n->sel.hoff),
&n->sel,n->fshift);
+               }

the tc filter command is:

tc filter add dev eth0 parent 8001:0 proto ip  u32 match u32
0x00000000 0x0000ffff at 40

Then I execute nc to connect to an UDP port:

localhost linux # nc -u 192.168.235.1 80

localhost linux #

just press enter.

I got these demsg:

[ 7525.330604] skb->len: 43, off: 54

we are trying to access the memory out of this skb.

>
>> skb_copy_bits() is used instead in this patch. And
>> when the skb isn't long enough, we terminate the function u32_classify()
>> immediately with -1.
>>
>
> That could be a very interesting optimization - but i see it as _very
> hard_ to do with current u32 given it has branching and different
> branches would have different lengths etc. You'd have to essentially do
> some math in user space as to what min length would suffice given
> a specified filter and pass that to the kernel.
>

It isn't an optimization, but an error exit. :)

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH 3/3] vhost: apply cpumask and cgroup to vhost workers
From: Sridhar Samudrala @ 2010-06-01 17:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michael S. Tsirkin, Oleg Nesterov, netdev, lkml,
	kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev, Jiri Kosina,
	Thomas Gleixner, Ingo Molnar, Andi Kleen
In-Reply-To: <4C04D453.9040208@kernel.org>

On Tue, 2010-06-01 at 11:35 +0200, Tejun Heo wrote:
> Apply the cpumask and cgroup of the initializing task to the created
> vhost worker.
> 
> Based on Sridhar Samudrala's patch.  Li Zefan spotted a bug in error
> path (twice), fixed (twice).
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Sridhar Samudrala <samudrala.sridhar@gmail.com>
> Cc: Li Zefan <lizf@cn.fujitsu.com>
> ---
>  drivers/vhost/vhost.c |   34 ++++++++++++++++++++++++++++++----
>  1 file changed, 30 insertions(+), 4 deletions(-)
> 
> Index: work/drivers/vhost/vhost.c
> ===================================================================
> --- work.orig/drivers/vhost/vhost.c
> +++ work/drivers/vhost/vhost.c
> @@ -23,6 +23,7 @@
>  #include <linux/highmem.h>
>  #include <linux/slab.h>
>  #include <linux/kthread.h>
> +#include <linux/cgroup.h>
> 
>  #include <linux/net.h>
>  #include <linux/if_packet.h>
> @@ -187,11 +188,29 @@ long vhost_dev_init(struct vhost_dev *de
>  		    struct vhost_virtqueue *vqs, int nvqs)
>  {
>  	struct task_struct *worker;
> -	int i;
> +	cpumask_var_t mask;
> +	int i, ret = -ENOMEM;
> +
> +	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
> +		goto out_free_mask;

I think this is another bug in the error path. You should simply
do a return instead of a goto here when aloc_cpu_mask fails.

Thanks
Sridhar
> 
>  	worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
> -	if (IS_ERR(worker))
> -		return PTR_ERR(worker);
> +	if (IS_ERR(worker)) {
> +		ret = PTR_ERR(worker);
> +		goto out_free_mask;
> +	}
> +
> +	ret = sched_getaffinity(current->pid, mask);
> +	if (ret)
> +		goto out_stop_worker;
> +
> +	ret = sched_setaffinity(worker->pid, mask);
> +	if (ret)
> +		goto out_stop_worker;
> +
> +	ret = cgroup_attach_task_current_cg(worker);
> +	if (ret)
> +		goto out_stop_worker;
> 
>  	dev->vqs = vqs;
>  	dev->nvqs = nvqs;
> @@ -214,7 +233,14 @@ long vhost_dev_init(struct vhost_dev *de
>  	}
> 
>  	wake_up_process(worker);	/* avoid contributing to loadavg */
> -	return 0;
> +	ret = 0;
> +	goto out_free_mask;
> +
> +out_stop_worker:
> +	kthread_stop(worker);
> +out_free_mask:
> +	free_cpumask_var(mask);
> +	return ret;
>  }
> 
>  /* Caller should have device mutex */
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: Fore 200 firmware
From: Ben Hutchings @ 2010-06-01 17:12 UTC (permalink / raw)
  To: Meelis Roos; +Cc: chas3, netdev
In-Reply-To: <alpine.SOC.1.00.1006011902490.22519@math.ut.ee>

On Tue, Jun 01, 2010 at 06:03:31PM +0100, Meelis Roos wrote:
> > >does not have to redo the detective work I am doing. Packaging it along 
> > >with linux-atm packages (atm-tools in Debian) does not seem correct, so 
> > >do we need to make a special firmware package out of it or just add to 
> > >some other firmware repo that is packaged already?
> > 
> > i dont care which happens really.  i would to think it is part of
> > atm-tools since the fore200e driver really isnt useful with the
> > atm tools.
> 
> OK, so we should push the distro maintainers of it to create a suggested 
> firmware package or just include the fore firmware unconditionally - 
> once it is in released linux-atm.
> 
> > >What's the licence of the firmware anyway - unknown likely?
> > 
> > it says the firmware is gpl.  look at fore200e_firmware_copyright
> > in one of the old kernel distros.
> 
> Well, there is no source to fulfill the GPL... but there also should be 
> no problem redistributing it.
[...]
 
The GPL says (approximately) that people are allowed to redistribute binaries
only if they also redistribute source.  If the source is not available to the
public then anyone redistributing it is relying on the assumption that the
copyright holder didn't really mean this condition and will not enforce it.
At any time the copyright holder could sue them for copyright infringement,
and they would have to rely on a defence such as laches.

This is a common problem with firmware images that were added to the kernel
long ago.  To reduce legal risk, Debian has been removing these images and
will not include them in any future releases, even in the non-free archive
section, unless this is fixed.  Please do not try to add this firmware to
Debian without getting an appropriate licence.
 
Ben.

-- 
Ben Hutchings
We get into the habit of living before acquiring the habit of thinking.
                                                              - Albert Camus

^ permalink raw reply

* [PATCH net-next-2.6] net: CONFIG_NET_NS reduction
From: Eric Dumazet @ 2010-06-01 16:51 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Patrick McHardy

Use read_pnet() and write_pnet() to reduce number of ifdef CONFIG_NET_NS

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netdevice.h            |    6 +-----
 include/net/cfg80211.h               |   15 ++-------------
 include/net/genetlink.h              |   15 ++-------------
 include/net/netfilter/nf_conntrack.h |    6 +-----
 include/net/sock.h                   |   10 ++--------
 net/ipv6/addrlabel.c                 |    6 +-----
 net/netfilter/nf_conntrack_core.c    |    8 ++------
 7 files changed, 11 insertions(+), 55 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a249161..bd6b753 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1087,11 +1087,7 @@ static inline void netdev_for_each_tx_queue(struct net_device *dev,
 static inline
 struct net *dev_net(const struct net_device *dev)
 {
-#ifdef CONFIG_NET_NS
-	return dev->nd_net;
-#else
-	return &init_net;
-#endif
+	return read_pnet(&dev->nd_net);
 }
 
 static inline
diff --git a/include/net/cfg80211.h b/include/net/cfg80211.h
index b44a2e5..e7ebeb8 100644
--- a/include/net/cfg80211.h
+++ b/include/net/cfg80211.h
@@ -1330,26 +1330,15 @@ struct wiphy {
 	char priv[0] __attribute__((__aligned__(NETDEV_ALIGN)));
 };
 
-#ifdef CONFIG_NET_NS
-static inline struct net *wiphy_net(struct wiphy *wiphy)
-{
-	return wiphy->_net;
-}
-
-static inline void wiphy_net_set(struct wiphy *wiphy, struct net *net)
-{
-	wiphy->_net = net;
-}
-#else
 static inline struct net *wiphy_net(struct wiphy *wiphy)
 {
-	return &init_net;
+	return read_pnet(&wiphy->_net);
 }
 
 static inline void wiphy_net_set(struct wiphy *wiphy, struct net *net)
 {
+	write_pnet(&wiphy->_net, net);
 }
-#endif
 
 /**
  * wiphy_priv - return priv from wiphy
diff --git a/include/net/genetlink.h b/include/net/genetlink.h
index eb551ba..f7dcd2c 100644
--- a/include/net/genetlink.h
+++ b/include/net/genetlink.h
@@ -68,26 +68,15 @@ struct genl_info {
 #endif
 };
 
-#ifdef CONFIG_NET_NS
 static inline struct net *genl_info_net(struct genl_info *info)
 {
-	return info->_net;
+	return read_pnet(&info->_net);
 }
 
 static inline void genl_info_net_set(struct genl_info *info, struct net *net)
 {
-	info->_net = net;
+	write_pnet(&info->_net, net);
 }
-#else
-static inline struct net *genl_info_net(struct genl_info *info)
-{
-	return &init_net;
-}
-
-static inline void genl_info_net_set(struct genl_info *info, struct net *net)
-{
-}
-#endif
 
 /**
  * struct genl_ops - generic netlink operations
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index bde095f..bbfdd94 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -152,11 +152,7 @@ extern struct net init_net;
 
 static inline struct net *nf_ct_net(const struct nf_conn *ct)
 {
-#ifdef CONFIG_NET_NS
-	return ct->ct_net;
-#else
-	return &init_net;
-#endif
+	return read_pnet(&ct->ct_net);
 }
 
 /* Alter reply tuple (maybe alter helper). */
diff --git a/include/net/sock.h b/include/net/sock.h
index ca241ea..3461e5d 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1724,19 +1724,13 @@ static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb, int copied_e
 static inline
 struct net *sock_net(const struct sock *sk)
 {
-#ifdef CONFIG_NET_NS
-	return sk->sk_net;
-#else
-	return &init_net;
-#endif
+	return read_pnet(&sk->sk_net);
 }
 
 static inline
 void sock_net_set(struct sock *sk, struct net *net)
 {
-#ifdef CONFIG_NET_NS
-	sk->sk_net = net;
-#endif
+	write_pnet(&sk->sk_net, net);
 }
 
 /*
diff --git a/net/ipv6/addrlabel.c b/net/ipv6/addrlabel.c
index 8c4348c..f0e774c 100644
--- a/net/ipv6/addrlabel.c
+++ b/net/ipv6/addrlabel.c
@@ -53,11 +53,7 @@ static struct ip6addrlbl_table
 static inline
 struct net *ip6addrlbl_net(const struct ip6addrlbl_entry *lbl)
 {
-#ifdef CONFIG_NET_NS
-	return lbl->lbl_net;
-#else
-	return &init_net;
-#endif
+	return read_pnet(&lbl->lbl_net);
 }
 
 /*
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index eeeb8bc..7728898 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -619,9 +619,7 @@ struct nf_conn *nf_conntrack_alloc(struct net *net, u16 zone,
 	ct->tuplehash[IP_CT_DIR_REPLY].hnnode.pprev = NULL;
 	/* Don't set timer yet: wait for confirmation */
 	setup_timer(&ct->timeout, death_by_timeout, (unsigned long)ct);
-#ifdef CONFIG_NET_NS
-	ct->ct_net = net;
-#endif
+	write_pnet(&ct->ct_net, net);
 #ifdef CONFIG_NF_CONNTRACK_ZONES
 	if (zone) {
 		struct nf_conntrack_zone *nf_ct_zone;
@@ -1363,9 +1361,7 @@ static int nf_conntrack_init_init_net(void)
 		goto err_extend;
 #endif
 	/* Set up fake conntrack: to never be deleted, not in any hashes */
-#ifdef CONFIG_NET_NS
-	nf_conntrack_untracked.ct_net = &init_net;
-#endif
+	write_pnet(&nf_conntrack_untracked.ct_net, &init_net);
 	atomic_set(&nf_conntrack_untracked.ct_general.use, 1);
 	/*  - and look it like as a confirmed connection */
 	set_bit(IPS_CONFIRMED_BIT, &nf_conntrack_untracked.status);



^ permalink raw reply related

* Re: pull request: wireless-2.6 2010-05-28
From: reinette chatre @ 2010-06-01 16:22 UTC (permalink / raw)
  To: sedat.dilek@gmail.com
  Cc: John W. Linville, davem@davemloft.net,
	linux-wireless@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Zheng, Jiajia, Kolekar, Abhijeet,
	Berg, Johannes
In-Reply-To: <AANLkTil8SLseSeq1M_Xg7P4AqMy3V_0jxc7by2Xuaqnb@mail.gmail.com>

Hi Sedat,

On Sat, 2010-05-29 at 06:34 -0700, Sedat Dilek wrote:
> Two people confirmed the patch in [2] fixes:
> 1. iwlwifi-2.6 GIT master (commit f10a237c95abd6d64a3a24553bd1d3bcddd9108b)
> 2. compat-wireless (2010-05-21)
> 
> And it fixes also the above mentionned combination.

We are working on getting this patch upstream. The version tested
contains a significant amount of duplicated code as well as some cleanup
code mixed in. It is thus not appropriate for an upstream submission at
this time.

> As a suggestion:
> What about "copying" bug-reports (incl. its history) from IWL-BTS into
> linux-wireless ML?
> For example (dri-devel related) bug-reports from
> bugzilla.freedesktop.org are "copied" into dri-devel ML.

Pointing people to the bug report seems to work ok. What benefit does
the above have?

Reinette

^ permalink raw reply

* [PATCH] Refactor update of IPv6 flow destination address for srcrt (RH) option
From: Arnaud Ebalard @ 2010-06-01 16:20 UTC (permalink / raw)
  To: David S. Miller
  Cc: YOSHIFUJI Hideaki / 吉藤英明, netdev

Hi,

While playing with associated code, I noticed there are more than a
dozen occurrences of the following in the IPv6 stack:

    if (opt && opt->srcrt) {
            struct rt0_hdr *rt0 = (struct rt0_hdr *) opt->srcrt;
            ipv6_addr_copy(&final, &fl.fl6_dst);
            ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
            final_p = &final;
    }

Maybe this can be replaced with a helper (not sure about the name
though).


Signed-off-by: Arnaud Ebalard <arno@natisbad.org>
---
 include/net/ipv6.h               |   13 +++++++++++++
 net/dccp/ipv6.c                  |   24 +++---------------------
 net/ipv6/af_inet6.c              |    7 +------
 net/ipv6/datagram.c              |   16 +++-------------
 net/ipv6/inet6_connection_sock.c |    7 +------
 net/ipv6/raw.c                   |    8 +-------
 net/ipv6/syncookies.c            |    7 +------
 net/ipv6/tcp_ipv6.c              |   21 +++------------------
 net/ipv6/udp.c                   |    9 ++-------
 9 files changed, 28 insertions(+), 84 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 2600b69..9edfa1c 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -290,6 +290,19 @@ static inline void ipv6_addr_copy(struct in6_addr *a1, const struct in6_addr *a2
 	memcpy(a1, a2, sizeof(struct in6_addr));
 }
 
+static inline struct in6_addr *srcrt_dst_flow_update(struct in6_addr *final,
+						     struct in6_addr *fl6dst,
+						     const struct ipv6_txoptions *opt)
+{
+	if (opt && opt->srcrt) {
+		const struct rt0_hdr *rt0 = (struct rt0_hdr *)opt->srcrt;
+		ipv6_addr_copy(final, fl6dst);
+		ipv6_addr_copy(fl6dst, rt0->addr);
+		return final;
+	}
+	return NULL;
+}
+
 static inline void ipv6_addr_prefix(struct in6_addr *pfx, 
 				    const struct in6_addr *addr,
 				    int plen)
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 0916988..fc435ba 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -265,13 +265,7 @@ static int dccp_v6_send_response(struct sock *sk, struct request_sock *req,
 
 	opt = np->opt;
 
-	if (opt != NULL && opt->srcrt != NULL) {
-		const struct rt0_hdr *rt0 = (struct rt0_hdr *)opt->srcrt;
-
-		ipv6_addr_copy(&final, &fl.fl6_dst);
-		ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-		final_p = &final;
-	}
+	final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, opt);
 
 	err = ip6_dst_lookup(sk, &dst, &fl);
 	if (err)
@@ -551,13 +545,7 @@ static struct sock *dccp_v6_request_recv_sock(struct sock *sk,
 		memset(&fl, 0, sizeof(fl));
 		fl.proto = IPPROTO_DCCP;
 		ipv6_addr_copy(&fl.fl6_dst, &ireq6->rmt_addr);
-		if (opt != NULL && opt->srcrt != NULL) {
-			const struct rt0_hdr *rt0 = (struct rt0_hdr *)opt->srcrt;
-
-			ipv6_addr_copy(&final, &fl.fl6_dst);
-			ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-			final_p = &final;
-		}
+		final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, opt);
 		ipv6_addr_copy(&fl.fl6_src, &ireq6->loc_addr);
 		fl.oif = sk->sk_bound_dev_if;
 		fl.fl_ip_dport = inet_rsk(req)->rmt_port;
@@ -988,13 +976,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	fl.fl_ip_sport = inet->inet_sport;
 	security_sk_classify_flow(sk, &fl);
 
-	if (np->opt != NULL && np->opt->srcrt != NULL) {
-		const struct rt0_hdr *rt0 = (struct rt0_hdr *)np->opt->srcrt;
-
-		ipv6_addr_copy(&final, &fl.fl6_dst);
-		ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-		final_p = &final;
-	}
+	final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, np->opt);
 
 	err = ip6_dst_lookup(sk, &dst, &fl);
 	if (err)
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index e733942..d6e6491 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -665,12 +665,7 @@ int inet6_sk_rebuild_header(struct sock *sk)
 		fl.fl_ip_sport = inet->inet_sport;
 		security_sk_classify_flow(sk, &fl);
 
-		if (np->opt && np->opt->srcrt) {
-			struct rt0_hdr *rt0 = (struct rt0_hdr *) np->opt->srcrt;
-			ipv6_addr_copy(&final, &fl.fl6_dst);
-			ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-			final_p = &final;
-		}
+		final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, np->opt);
 
 		err = ip6_dst_lookup(sk, &dst, &fl);
 		if (err) {
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 7126846..2fcba06 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -42,6 +42,7 @@ int ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	struct dst_entry	*dst;
 	struct flowi		fl;
 	struct ip6_flowlabel	*flowlabel = NULL;
+	struct ipv6_txoptions   *opt;
 	int			addr_type;
 	int			err;
 
@@ -155,19 +156,8 @@ ipv4_connected:
 
 	security_sk_classify_flow(sk, &fl);
 
-	if (flowlabel) {
-		if (flowlabel->opt && flowlabel->opt->srcrt) {
-			struct rt0_hdr *rt0 = (struct rt0_hdr *) flowlabel->opt->srcrt;
-			ipv6_addr_copy(&final, &fl.fl6_dst);
-			ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-			final_p = &final;
-		}
-	} else if (np->opt && np->opt->srcrt) {
-		struct rt0_hdr *rt0 = (struct rt0_hdr *)np->opt->srcrt;
-		ipv6_addr_copy(&final, &fl.fl6_dst);
-		ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-		final_p = &final;
-	}
+	opt = flowlabel ? flowlabel->opt : np->opt;
+	final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, opt);
 
 	err = ip6_dst_lookup(sk, &dst, &fl);
 	if (err)
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index 0c5e3c3..505c8de 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -199,12 +199,7 @@ int inet6_csk_xmit(struct sk_buff *skb)
 	fl.fl_ip_dport = inet->inet_dport;
 	security_sk_classify_flow(sk, &fl);
 
-	if (np->opt && np->opt->srcrt) {
-		struct rt0_hdr *rt0 = (struct rt0_hdr *)np->opt->srcrt;
-		ipv6_addr_copy(&final, &fl.fl6_dst);
-		ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-		final_p = &final;
-	}
+	final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, np->opt);
 
 	dst = __inet6_csk_dst_check(sk, np->dst_cookie);
 
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 4a4dcbe..91211bd 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -847,13 +847,7 @@ static int rawv6_sendmsg(struct kiocb *iocb, struct sock *sk,
 	if (ipv6_addr_any(&fl.fl6_src) && !ipv6_addr_any(&np->saddr))
 		ipv6_addr_copy(&fl.fl6_src, &np->saddr);
 
-	/* merge ip6_build_xmit from ip6_output */
-	if (opt && opt->srcrt) {
-		struct rt0_hdr *rt0 = (struct rt0_hdr *) opt->srcrt;
-		ipv6_addr_copy(&final, &fl.fl6_dst);
-		ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-		final_p = &final;
-	}
+	final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, opt);
 
 	if (!fl.oif && ipv6_addr_is_multicast(&fl.fl6_dst))
 		fl.oif = np->mcast_oif;
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 34d1f06..7d27bc7 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -245,12 +245,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 		memset(&fl, 0, sizeof(fl));
 		fl.proto = IPPROTO_TCP;
 		ipv6_addr_copy(&fl.fl6_dst, &ireq6->rmt_addr);
-		if (np->opt && np->opt->srcrt) {
-			struct rt0_hdr *rt0 = (struct rt0_hdr *) np->opt->srcrt;
-			ipv6_addr_copy(&final, &fl.fl6_dst);
-			ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-			final_p = &final;
-		}
+		final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, np->opt);
 		ipv6_addr_copy(&fl.fl6_src, &ireq6->loc_addr);
 		fl.oif = sk->sk_bound_dev_if;
 		fl.mark = sk->sk_mark;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 2b7c3a1..2dcb536 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -250,12 +250,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	fl.fl_ip_dport = usin->sin6_port;
 	fl.fl_ip_sport = inet->inet_sport;
 
-	if (np->opt && np->opt->srcrt) {
-		struct rt0_hdr *rt0 = (struct rt0_hdr *)np->opt->srcrt;
-		ipv6_addr_copy(&final, &fl.fl6_dst);
-		ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-		final_p = &final;
-	}
+	final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, np->opt);
 
 	security_sk_classify_flow(sk, &fl);
 
@@ -494,12 +489,7 @@ static int tcp_v6_send_synack(struct sock *sk, struct request_sock *req,
 	security_req_classify_flow(req, &fl);
 
 	opt = np->opt;
-	if (opt && opt->srcrt) {
-		struct rt0_hdr *rt0 = (struct rt0_hdr *) opt->srcrt;
-		ipv6_addr_copy(&final, &fl.fl6_dst);
-		ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-		final_p = &final;
-	}
+	final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, opt);
 
 	err = ip6_dst_lookup(sk, &dst, &fl);
 	if (err)
@@ -1398,12 +1388,7 @@ static struct sock * tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 		memset(&fl, 0, sizeof(fl));
 		fl.proto = IPPROTO_TCP;
 		ipv6_addr_copy(&fl.fl6_dst, &treq->rmt_addr);
-		if (opt && opt->srcrt) {
-			struct rt0_hdr *rt0 = (struct rt0_hdr *) opt->srcrt;
-			ipv6_addr_copy(&final, &fl.fl6_dst);
-			ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-			final_p = &final;
-		}
+		final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, opt);
 		ipv6_addr_copy(&fl.fl6_src, &treq->loc_addr);
 		fl.oif = sk->sk_bound_dev_if;
 		fl.mark = sk->sk_mark;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 87be586..a46ad07 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1097,14 +1097,9 @@ do_udp_sendmsg:
 		ipv6_addr_copy(&fl.fl6_src, &np->saddr);
 	fl.fl_ip_sport = inet->inet_sport;
 
-	/* merge ip6_build_xmit from ip6_output */
-	if (opt && opt->srcrt) {
-		struct rt0_hdr *rt0 = (struct rt0_hdr *) opt->srcrt;
-		ipv6_addr_copy(&final, &fl.fl6_dst);
-		ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-		final_p = &final;
+	final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, opt);
+	if (final_p)
 		connected = 0;
-	}
 
 	if (!fl.oif && ipv6_addr_is_multicast(&fl.fl6_dst)) {
 		fl.oif = np->mcast_oif;
-- 
1.7.1



^ permalink raw reply related

* [RFC nf-next-2.6] conntrack: per cpu nf_conntrack_untracked
From: Eric Dumazet @ 2010-06-01 16:20 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Netfilter Developers, netdev
In-Reply-To: <4C04E3E2.7020209@trash.net>

Le mardi 01 juin 2010 à 12:41 +0200, Patrick McHardy a écrit :

> > BTW, I notice nf_conntrack_untracked is incorrectly annotated
> > '__read_mostly'.
> > 
> > It can be written very often :(
> > 
> > Should'nt we special case it and let be really const ?
> 
> That would need quite a bit of special-casing to avoid touching
> the reference counts. So far this is completely hidden, so I'd
> say it just shouldn't be marked __read_mostly. Alternatively we
> can make "untracked" a nfctinfo state.

I tried this suggestion, (a new IP_CT_UNTRACKED ctinfo), over a per_cpu
untracked ct, but its a bit hard.

For example, I cannot find a way to change ctnetlink_conntrack_event() :

	if (ct == &nf_conntrack_untracked)
		return 0;

Maybe this piece of code is not necessary, we should not come here
anyway, or it means several packets could store events for this (shared)
ct ?

Obviously, an IPS_UNTRACKED bit would be much easier to implement.
Would it be acceptable ?


Preliminary patch with IP_CT_UNTRACKED, probably not working at all...

 include/linux/netfilter/nf_conntrack_common.h  |    3 +
 include/net/netfilter/nf_conntrack.h           |   11 +++--
 include/net/netfilter/nf_conntrack_core.h      |    2 
 net/ipv4/netfilter/nf_nat_core.c               |    4 +
 net/ipv4/netfilter/nf_nat_standalone.c         |    2 
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c |    4 -
 net/netfilter/nf_conntrack_core.c              |   32 +++++++++------
 net/netfilter/nf_conntrack_netlink.c           |    6 +-
 net/netfilter/xt_CT.c                          |   13 +++---
 net/netfilter/xt_NOTRACK.c                     |    4 -
 net/netfilter/xt_TEE.c                         |    8 +--
 net/netfilter/xt_cluster.c                     |    2 
 net/netfilter/xt_conntrack.c                   |    2 
 net/netfilter/xt_socket.c                      |    2 
 14 files changed, 58 insertions(+), 37 deletions(-)

diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
index 14e6d32..5f7c947 100644
--- a/include/linux/netfilter/nf_conntrack_common.h
+++ b/include/linux/netfilter/nf_conntrack_common.h
@@ -15,6 +15,9 @@ enum ip_conntrack_info {
            IP_CT_DIR_ORIGINAL); may be a retransmission. */
 	IP_CT_NEW,
 
+	/* Untracked */
+	IP_CT_UNTRACKED,
+
 	/* >= this indicates reply direction */
 	IP_CT_IS_REPLY,
 
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index bde095f..884ade9 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -175,7 +175,7 @@ static inline struct nf_conn *
 nf_ct_get(const struct sk_buff *skb, enum ip_conntrack_info *ctinfo)
 {
 	*ctinfo = skb->nfctinfo;
-	return (struct nf_conn *)skb->nfct;
+	return container_of(skb->nfct, struct nf_conn, ct_general);
 }
 
 /* decrement reference count on a conntrack */
@@ -261,7 +261,7 @@ extern s16 (*nf_ct_nat_offset)(const struct nf_conn *ct,
 			       u32 seq);
 
 /* Fake conntrack entry for untracked connections */
-extern struct nf_conn nf_conntrack_untracked;
+DECLARE_PER_CPU(struct nf_conn, pcpu_nf_conntrack_untracked);
 
 /* Iterate over all conntracks: if iter returns true, it's deleted. */
 extern void
@@ -291,7 +291,12 @@ static inline int nf_ct_is_dying(struct nf_conn *ct)
 
 static inline int nf_ct_is_untracked(const struct sk_buff *skb)
 {
-	return (skb->nfct == &nf_conntrack_untracked.ct_general);
+	return (skb->nfctinfo == IP_CT_UNTRACKED);
+}
+
+static inline int nf_ct_is_tracked(const struct sk_buff *skb)
+{
+	return (skb->nfctinfo != IP_CT_UNTRACKED);
 }
 
 extern int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp);
diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h
index 3d7524f..8dd05ea 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -60,7 +60,7 @@ static inline int nf_conntrack_confirm(struct sk_buff *skb)
 	struct nf_conn *ct = (struct nf_conn *)skb->nfct;
 	int ret = NF_ACCEPT;
 
-	if (ct && ct != &nf_conntrack_untracked) {
+	if (ct && nf_ct_is_tracked(skb)) {
 		if (!nf_ct_is_confirmed(ct))
 			ret = __nf_conntrack_confirm(skb);
 		if (likely(ret == NF_ACCEPT))
diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
index 4f8bddb..a797999 100644
--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -719,6 +719,7 @@ static int __init nf_nat_init(void)
 {
 	size_t i;
 	int ret;
+	int cpu;
 
 	need_ipv4_conntrack();
 
@@ -742,7 +743,8 @@ static int __init nf_nat_init(void)
 	spin_unlock_bh(&nf_nat_lock);
 
 	/* Initialize fake conntrack so that NAT will skip it */
-	nf_conntrack_untracked.status |= IPS_NAT_DONE_MASK;
+	for_each_possible_cpu(cpu)
+		per_cpu(pcpu_nf_conntrack_untracked,cpu).status |= IPS_NAT_DONE_MASK;
 
 	l3proto = nf_ct_l3proto_find_get((u_int16_t)AF_INET);
 
diff --git a/net/ipv4/netfilter/nf_nat_standalone.c b/net/ipv4/netfilter/nf_nat_standalone.c
index beb2581..17af2bb 100644
--- a/net/ipv4/netfilter/nf_nat_standalone.c
+++ b/net/ipv4/netfilter/nf_nat_standalone.c
@@ -98,7 +98,7 @@ nf_nat_fn(unsigned int hooknum,
 		return NF_ACCEPT;
 
 	/* Don't try to NAT if this packet is not conntracked */
-	if (ct == &nf_conntrack_untracked)
+	if (ctinfo == IP_CT_UNTRACKED)
 		return NF_ACCEPT;
 
 	nat = nfct_nat(ct);
diff --git a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
index 9be8177..b67029c 100644
--- a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
@@ -208,8 +208,8 @@ icmpv6_error(struct net *net, struct nf_conn *tmpl,
 	type = icmp6h->icmp6_type - 130;
 	if (type >= 0 && type < sizeof(noct_valid_new) &&
 	    noct_valid_new[type]) {
-		skb->nfct = &nf_conntrack_untracked.ct_general;
-		skb->nfctinfo = IP_CT_NEW;
+		skb->nfct = &__get_cpu_var(pcpu_nf_conntrack_untracked).ct_general;
+		skb->nfctinfo = IP_CT_UNTRACKED;
 		nf_conntrack_get(skb->nfct);
 		return NF_ACCEPT;
 	}
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index eeeb8bc..eea5df1 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -62,8 +62,8 @@ EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
 unsigned int nf_conntrack_max __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_max);
 
-struct nf_conn nf_conntrack_untracked __read_mostly;
-EXPORT_SYMBOL_GPL(nf_conntrack_untracked);
+DEFINE_PER_CPU(struct nf_conn, pcpu_nf_conntrack_untracked);
+EXPORT_PER_CPU_SYMBOL(pcpu_nf_conntrack_untracked);
 
 static int nf_conntrack_hash_rnd_initted;
 static unsigned int nf_conntrack_hash_rnd;
@@ -1185,10 +1185,16 @@ static void nf_ct_release_dying_list(struct net *net)
 
 static void nf_conntrack_cleanup_init_net(void)
 {
-	/* wait until all references to nf_conntrack_untracked are dropped */
-	while (atomic_read(&nf_conntrack_untracked.ct_general.use) > 1)
+	int cpu, use;
+	for (;;) {
+		use = 0;
+		for_each_possible_cpu(cpu)
+			use += atomic_read(&per_cpu(pcpu_nf_conntrack_untracked, cpu).ct_general.use) - 1;
+		/* wait until all references to nf_conntrack_untracked are dropped */
+		if (!use)
+			break;
 		schedule();
-
+	}
 	nf_conntrack_helper_fini();
 	nf_conntrack_proto_fini();
 #ifdef CONFIG_NF_CONNTRACK_ZONES
@@ -1325,6 +1331,7 @@ static int nf_conntrack_init_init_net(void)
 {
 	int max_factor = 8;
 	int ret;
+	int cpu;
 
 	/* Idea from tcp.c: use 1/16384 of memory.  On i386: 32MB
 	 * machine has 512 buckets. >= 1GB machines have 16384 buckets. */
@@ -1362,14 +1369,15 @@ static int nf_conntrack_init_init_net(void)
 	if (ret < 0)
 		goto err_extend;
 #endif
-	/* Set up fake conntrack: to never be deleted, not in any hashes */
-#ifdef CONFIG_NET_NS
-	nf_conntrack_untracked.ct_net = &init_net;
-#endif
-	atomic_set(&nf_conntrack_untracked.ct_general.use, 1);
-	/*  - and look it like as a confirmed connection */
-	set_bit(IPS_CONFIRMED_BIT, &nf_conntrack_untracked.status);
+	/* Set up fake conntracks: to never be deleted, not in any hashes */
+	for_each_possible_cpu(cpu) {
+		struct nf_conn *ct = &per_cpu(pcpu_nf_conntrack_untracked, cpu);
 
+		write_pnet(&ct->ct_net, &init_net);
+		atomic_set(&ct->ct_general.use, 1);
+	/*  - and look it like as a confirmed connection */
+		__set_bit(IPS_CONFIRMED_BIT, &ct->status);
+	}
 	return 0;
 
 #ifdef CONFIG_NF_CONNTRACK_ZONES
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index c42ff6a..ac21514 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -479,9 +479,9 @@ ctnetlink_conntrack_event(unsigned int events, struct nf_ct_event *item)
 	unsigned int flags = 0, group;
 	int err;
 
-	/* ignore our fake conntrack entry */
-	if (ct == &nf_conntrack_untracked)
-		return 0;
+//	/* ignore our fake conntrack entry */
+//	if (ct == &nf_conntrack_untracked)
+//		return 0;
 
 	if (events & (1 << IPCT_DESTROY)) {
 		type = IPCTNL_MSG_CT_DELETE;
diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c
index 562bf32..5723f9a 100644
--- a/net/netfilter/xt_CT.c
+++ b/net/netfilter/xt_CT.c
@@ -29,9 +29,13 @@ static unsigned int xt_ct_target(struct sk_buff *skb,
 	if (skb->nfct != NULL)
 		return XT_CONTINUE;
 
+	skb->nfctinfo = IP_CT_NEW;
+	if (info->flags & XT_CT_NOTRACK) {
+		ct = &__get_cpu_var(pcpu_nf_conntrack_untracked);
+		skb->nfctinfo = IP_CT_UNTRACKED;
+	}
 	atomic_inc(&ct->ct_general.use);
 	skb->nfct = &ct->ct_general;
-	skb->nfctinfo = IP_CT_NEW;
 
 	return XT_CONTINUE;
 }
@@ -67,8 +71,7 @@ static int xt_ct_tg_check(const struct xt_tgchk_param *par)
 		return -EINVAL;
 
 	if (info->flags & XT_CT_NOTRACK) {
-		ct = &nf_conntrack_untracked;
-		atomic_inc(&ct->ct_general.use);
+		ct = &__get_cpu_var(pcpu_nf_conntrack_untracked);
 		goto out;
 	}
 
@@ -132,14 +135,14 @@ static void xt_ct_tg_destroy(const struct xt_tgdtor_param *par)
 	struct nf_conn *ct = info->ct;
 	struct nf_conn_help *help;
 
-	if (ct != &nf_conntrack_untracked) {
+	if (!(info->flags & XT_CT_NOTRACK)) {
 		help = nfct_help(ct);
 		if (help)
 			module_put(help->helper->me);
 
 		nf_ct_l3proto_module_put(par->family);
+		nf_ct_put(info->ct);
 	}
-	nf_ct_put(info->ct);
 }
 
 static struct xt_target xt_ct_tg __read_mostly = {
diff --git a/net/netfilter/xt_NOTRACK.c b/net/netfilter/xt_NOTRACK.c
index 512b912..9547b58 100644
--- a/net/netfilter/xt_NOTRACK.c
+++ b/net/netfilter/xt_NOTRACK.c
@@ -23,8 +23,8 @@ notrack_tg(struct sk_buff *skb, const struct xt_action_param *par)
 	   If there is a real ct entry correspondig to this packet,
 	   it'll hang aroun till timing out. We don't deal with it
 	   for performance reasons. JK */
-	skb->nfct = &nf_conntrack_untracked.ct_general;
-	skb->nfctinfo = IP_CT_NEW;
+	skb->nfct = &__get_cpu_var(pcpu_nf_conntrack_untracked).ct_general;
+	skb->nfctinfo = IP_CT_UNTRACKED;
 	nf_conntrack_get(skb->nfct);
 
 	return XT_CONTINUE;
diff --git a/net/netfilter/xt_TEE.c b/net/netfilter/xt_TEE.c
index 859d9fd..b8e46b3 100644
--- a/net/netfilter/xt_TEE.c
+++ b/net/netfilter/xt_TEE.c
@@ -104,8 +104,8 @@ tee_tg4(struct sk_buff *skb, const struct xt_action_param *par)
 #ifdef WITH_CONNTRACK
 	/* Avoid counting cloned packets towards the original connection. */
 	nf_conntrack_put(skb->nfct);
-	skb->nfct     = &nf_conntrack_untracked.ct_general;
-	skb->nfctinfo = IP_CT_NEW;
+	skb->nfct     = &__get_cpu_var(pcpu_nf_conntrack_untracked).ct_general;
+	skb->nfctinfo = IP_CT_UNTRACKED;
 	nf_conntrack_get(skb->nfct);
 #endif
 	/*
@@ -177,8 +177,8 @@ tee_tg6(struct sk_buff *skb, const struct xt_action_param *par)
 
 #ifdef WITH_CONNTRACK
 	nf_conntrack_put(skb->nfct);
-	skb->nfct     = &nf_conntrack_untracked.ct_general;
-	skb->nfctinfo = IP_CT_NEW;
+	skb->nfct     = &__get_cpu_var(pcpu_nf_conntrack_untracked).ct_general;
+	skb->nfctinfo = IP_CT_UNTRACKED;
 	nf_conntrack_get(skb->nfct);
 #endif
 	if (par->hooknum == NF_INET_PRE_ROUTING ||
diff --git a/net/netfilter/xt_cluster.c b/net/netfilter/xt_cluster.c
index 30b95a1..b26f94d 100644
--- a/net/netfilter/xt_cluster.c
+++ b/net/netfilter/xt_cluster.c
@@ -120,7 +120,7 @@ xt_cluster_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	if (ct == NULL)
 		return false;
 
-	if (ct == &nf_conntrack_untracked)
+	if (nf_ct_is_untracked(skb))
 		return false;
 
 	if (ct->master)
diff --git a/net/netfilter/xt_conntrack.c b/net/netfilter/xt_conntrack.c
index 39681f1..95bcfbb 100644
--- a/net/netfilter/xt_conntrack.c
+++ b/net/netfilter/xt_conntrack.c
@@ -123,7 +123,7 @@ conntrack_mt(const struct sk_buff *skb, struct xt_action_param *par,
 
 	ct = nf_ct_get(skb, &ctinfo);
 
-	if (ct == &nf_conntrack_untracked)
+	if (nf_ct_is_untracked(skb))
 		statebit = XT_CONNTRACK_STATE_UNTRACKED;
 	else if (ct != NULL)
 		statebit = XT_CONNTRACK_STATE_BIT(ctinfo);
diff --git a/net/netfilter/xt_socket.c b/net/netfilter/xt_socket.c
index 3d54c23..1f760b5 100644
--- a/net/netfilter/xt_socket.c
+++ b/net/netfilter/xt_socket.c
@@ -127,7 +127,7 @@ socket_match(const struct sk_buff *skb, struct xt_action_param *par,
 	 * reply packet of an established SNAT-ted connection. */
 
 	ct = nf_ct_get(skb, &ctinfo);
-	if (ct && (ct != &nf_conntrack_untracked) &&
+	if (ct && nf_ct_is_tracked(skb) &&
 	    ((iph->protocol != IPPROTO_ICMP &&
 	      ctinfo == IP_CT_IS_REPLY + IP_CT_ESTABLISHED) ||
 	     (iph->protocol == IPPROTO_ICMP &&



^ permalink raw reply related

* RE: replace hooks in __netif_receive_skb (v4)
From: Fischer, Anna @ 2010-06-01 16:13 UTC (permalink / raw)
  To: Stephen Hemminger, Jiri Pirko, davem@davemloft.net
  Cc: netdev@vger.kernel.org, kaber@trash.net, eric.dumazet@gmail.com
In-Reply-To: <20100601082805.1c84b16d@nehalam>

> Subject: net: replace hooks in __netif_receive_skb (v4)
> 
> 
> From: Jiri Pirko <jpirko@redhat.com>
> 
> What this patch does is it removes two receive frame hooks (for bridge
> and for
> macvlan) from __netif_receive_skb. These are replaced them with a
> single
> hook for both. It only supports one hook per device because it makes no
> sense to do bridging and macvlan on the same device.
> 
> Then a network driver (of virtual netdev like macvlan or bridge) can
> register
> an rx_handler for needed net device.


I think the idea of this is really good, and it has been long required to get rid of the bridging hook and the "hack" for macvlan to get into the network stack. 

However, I wonder, if this is to be used as a generic interface, then why the restriction of only having a single hook per device? Yes, it makes sense to do this for the bridge/macvlan combination, but in general there could be other cases where you would want to allow multiple receivers per device. 


> @@ -2792,6 +2778,7 @@ EXPORT_SYMBOL(__skb_bond_should_drop);
>  static int __netif_receive_skb(struct sk_buff *skb)
>  {
>  	struct packet_type *ptype, *pt_prev;
> +	rx_callback_func_t *rh;
>  	struct net_device *orig_dev;
>  	struct net_device *master;
>  	struct net_device *null_or_orig;
> @@ -2855,12 +2842,16 @@ static int __netif_receive_skb(struct sk
>  ncls:
>  #endif
> 
> -	skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
> -	if (!skb)
> -		goto out;
> -	skb = handle_macvlan(skb, &pt_prev, &ret, orig_dev);
> -	if (!skb)
> -		goto out;
> +	/* Handle special case of bridge or macvlan */
> +	if ((rh = rcu_dereference(skb->dev->rx_handler)) != NULL) {
> +		if (pt_prev) {
> +			ret = deliver_skb(skb, pt_prev, orig_dev);

What happens with 'ret' here? It is completely ignored, isn't it? Because you assume that there is only a single receiver per device? I think it would be much better to have return codes that indicate whether a packet has been consumed by a receive handler, or whether it is supposed to be processed further. The same actually applies for the packet handlers that are processed a bit further down in netif_receive_skb() - the return code is ignored and so a handler has no control over further processing of the packet.

^ permalink raw reply

* Re: net: replace hooks in __netif_receive_skb (v4)
From: Stephen Hemminger @ 2010-06-01 16:10 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: davem, netdev, kaber, eric.dumazet
In-Reply-To: <20100601154106.GA4929@psychotron.redhat.com>

On Tue, 1 Jun 2010 17:41:07 +0200
Jiri Pirko <jpirko@redhat.com> wrote:

> Actually, I'm not happy about this (reducing to only one hook) and for two
> reasons:
> 
> 1) I think it's a good behaviour to "mask" one handler by another in case device
> is for example used for macvlan and then added to bridge. Because when it's
> again removed from the bridge, the original functionality is restored. And also,
> this would be consistent with the current behaviour.
> 
> 2) I would imagine a situation, when multiple handers are needed in cascade.
> Actually I'm working on a virtual device draft which uses two handlers, although
> in corner situation.
> 
> Regards,
> 	Jirka

I don't like it because:

1) Adding macvlan to bridge makes no sense. The bridge is already in promicious mode.

2) I don't like to see functionality added when it is not needed.

3) The extra overhead of traversing list causes more cache activity in extreme
   hot path.

Wait and add the multiple handlers when your code is included.

-- 
http://www.extremeprogramming.org/rules/early.html

^ permalink raw reply

* Re: Fore 200 firmware
From: Meelis Roos @ 2010-06-01 16:09 UTC (permalink / raw)
  To: chas3; +Cc: netdev
In-Reply-To: <201006011549.o51FnZFk028507@thirdoffive.cmf.nrl.navy.mil>

> >does not have to redo the detective work I am doing. Packaging it along 
> >with linux-atm packages (atm-tools in Debian) does not seem correct, so 
> >do we need to make a special firmware package out of it or just add to 
> >some other firmware repo that is packaged already?
> 
> i dont care which happens really.  i would to think it is part of
> atm-tools since the fore200e driver really isnt useful with the
> atm tools.

OK, so we should push the distro maintainers of it to create a suggested 
firmware package or just include the fore firmware unconditionally - 
once it is in released linux-atm.

> >What's the licence of the firmware anyway - unknown likely?
> 
> it says the firmware is gpl.  look at fore200e_firmware_copyright
> in one of the old kernel distros.

Well, there is no source to fulfill the GPL... but there also should be 
no problem redistributing it.

> >And we can probably remove drivers/atm/fore200e_pca_fw.c if we 
> >distribute the firmware out of kernel tree.
> 
> i dont think it is in the current net-next-2.6 tree?

Oops, you are right - I have updated my tree over several years and the 
old files were just not removed from disk. But theay are gone from the 
git, yes.

Thnak yo for your work on ATM - I intend to get a demo ATM network 
running here in my computer museum.

-- 
Meelis Roos (mroos@linux.ee)

^ permalink raw reply

* [PATCH] ppp: eliminate shadowed variable name
From: Stephen Hemminger @ 2010-06-01 16:05 UTC (permalink / raw)
  To: Paul Mackerras, David Miller; +Cc: netdev

Sparse complains about shadowed declaration of skb. So use other
name.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>


--- a/drivers/net/ppp_generic.c	2010-06-01 08:49:58.859739636 -0700
+++ b/drivers/net/ppp_generic.c	2010-06-01 08:50:41.184870657 -0700
@@ -1927,9 +1927,9 @@ ppp_receive_mp_frame(struct ppp *ppp, st
 	/* If the queue is getting long, don't wait any longer for packets
 	   before the start of the queue. */
 	if (skb_queue_len(&ppp->mrq) >= PPP_MP_MAX_QLEN) {
-		struct sk_buff *skb = skb_peek(&ppp->mrq);
-		if (seq_before(ppp->minseq, skb->sequence))
-			ppp->minseq = skb->sequence;
+		struct sk_buff *mskb = skb_peek(&ppp->mrq);
+		if (seq_before(ppp->minseq, mskb->sequence))
+			ppp->minseq = mskb->sequence;
 	}
 
 	/* Pull completed packets off the queue and receive them. */

^ permalink raw reply

* Re: Fore 200 firmware
From: Chas Williams (CONTRACTOR) @ 2010-06-01 15:49 UTC (permalink / raw)
  To: Meelis Roos; +Cc: netdev
In-Reply-To: <alpine.SOC.1.00.1006011747570.22519@math.ut.ee>

In message <alpine.SOC.1.00.1006011747570.22519@math.ut.ee>,Meelis Roos writes:
>2.5.1 does not seem to contain the firmware.

oops -- i didnt commit it to the repo.

>does not have to redo the detective work I am doing. Packaging it along 
>with linux-atm packages (atm-tools in Debian) does not seem correct, so 
>do we need to make a special firmware package out of it or just add to 
>some other firmware repo that is packaged already?

i dont care which happens really.  i would to think it is part of
atm-tools since the fore200e driver really isnt useful with the
atm tools.

>What's the licence of the firmware anyway - unknown likely?

it says the firmware is gpl.  look at fore200e_firmware_copyright
in one of the old kernel distros.

>And we can probably remove drivers/atm/fore200e_pca_fw.c if we 
>distribute the firmware out of kernel tree.

i dont think it is in the current net-next-2.6 tree?

^ permalink raw reply

* Re: net: replace hooks in __netif_receive_skb (v4)
From: Jiri Pirko @ 2010-06-01 15:41 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: davem, netdev, kaber, eric.dumazet
In-Reply-To: <20100601082805.1c84b16d@nehalam>

Actually, I'm not happy about this (reducing to only one hook) and for two
reasons:

1) I think it's a good behaviour to "mask" one handler by another in case device
is for example used for macvlan and then added to bridge. Because when it's
again removed from the bridge, the original functionality is restored. And also,
this would be consistent with the current behaviour.

2) I would imagine a situation, when multiple handers are needed in cascade.
Actually I'm working on a virtual device draft which uses two handlers, although
in corner situation.

Regards,
	Jirka

Tue, Jun 01, 2010 at 05:28:05PM CEST, shemminger@vyatta.com wrote:
>
>From: Jiri Pirko <jpirko@redhat.com>
>
>What this patch does is it removes two receive frame hooks (for bridge and for
>macvlan) from __netif_receive_skb. These are replaced them with a single
>hook for both. It only supports one hook per device because it makes no
>sense to do bridging and macvlan on the same device.
>
>Then a network driver (of virtual netdev like macvlan or bridge) can register
>an rx_handler for needed net device.
>
>Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>---
>v3->v4 
>       - only one hook per device.
>v2->v3
>	- used GPL-ed exports
>v1->v2
>	- writers are locked by rtnl_lock (removed unnecessary spinlock)
>	- using call_rcu in unregister
>	- nicer macvlan_port_create cleanup
>	- struct rx_hanler is made const in funtion parameters
>
> drivers/net/macvlan.c      |   19 ++++---
> include/linux/if_bridge.h  |    2 
> include/linux/if_macvlan.h |    4 -
> include/linux/netdevice.h  |    8 +++
> net/bridge/br.c            |    2 
> net/bridge/br_if.c         |    8 +++
> net/bridge/br_input.c      |   12 +++-
> net/bridge/br_private.h    |    3 -
> net/core/dev.c             |  119 ++++++++++++++++++++-------------------------
> 9 files changed, 94 insertions(+), 83 deletions(-)
>
>--- a/drivers/net/macvlan.c	2010-05-28 08:41:38.248169422 -0700
>+++ b/drivers/net/macvlan.c	2010-06-01 08:16:52.307206412 -0700
>@@ -145,15 +145,16 @@ static void macvlan_broadcast(struct sk_
> }
> 
> /* called under rcu_read_lock() from netif_receive_skb */
>-static struct sk_buff *macvlan_handle_frame(struct macvlan_port *port,
>-					    struct sk_buff *skb)
>+static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
> {
>+	struct macvlan_port *port;
> 	const struct ethhdr *eth = eth_hdr(skb);
> 	const struct macvlan_dev *vlan;
> 	const struct macvlan_dev *src;
> 	struct net_device *dev;
> 	unsigned int len;
> 
>+	port = rcu_dereference(skb->dev->macvlan_port);
> 	if (is_multicast_ether_addr(eth->h_dest)) {
> 		src = macvlan_hash_lookup(port, eth->h_source);
> 		if (!src)
>@@ -515,6 +516,7 @@ static int macvlan_port_create(struct ne
> {
> 	struct macvlan_port *port;
> 	unsigned int i;
>+	int err;
> 
> 	if (dev->type != ARPHRD_ETHER || dev->flags & IFF_LOOPBACK)
> 		return -EINVAL;
>@@ -528,13 +530,21 @@ static int macvlan_port_create(struct ne
> 	for (i = 0; i < MACVLAN_HASH_SIZE; i++)
> 		INIT_HLIST_HEAD(&port->vlan_hash[i]);
> 	rcu_assign_pointer(dev->macvlan_port, port);
>-	return 0;
>+
>+	err = netdev_rx_handler_register(dev, macvlan_handle_frame);
>+	if (err) {
>+		dev->macvlan_port = NULL;
>+		kfree(port);
>+	}
>+
>+	return err;
> }
> 
> static void macvlan_port_destroy(struct net_device *dev)
> {
> 	struct macvlan_port *port = dev->macvlan_port;
> 
>+	netdev_rx_handler_unregister(dev);
> 	rcu_assign_pointer(dev->macvlan_port, NULL);
> 	synchronize_rcu();
> 	kfree(port);
>@@ -767,14 +777,12 @@ static int __init macvlan_init_module(vo
> 	int err;
> 
> 	register_netdevice_notifier(&macvlan_notifier_block);
>-	macvlan_handle_frame_hook = macvlan_handle_frame;
> 
> 	err = macvlan_link_register(&macvlan_link_ops);
> 	if (err < 0)
> 		goto err1;
> 	return 0;
> err1:
>-	macvlan_handle_frame_hook = NULL;
> 	unregister_netdevice_notifier(&macvlan_notifier_block);
> 	return err;
> }
>@@ -782,7 +790,6 @@ err1:
> static void __exit macvlan_cleanup_module(void)
> {
> 	rtnl_link_unregister(&macvlan_link_ops);
>-	macvlan_handle_frame_hook = NULL;
> 	unregister_netdevice_notifier(&macvlan_notifier_block);
> }
> 
>--- a/include/linux/if_bridge.h	2010-05-28 08:35:11.628190230 -0700
>+++ b/include/linux/if_bridge.h	2010-06-01 08:02:39.859736930 -0700
>@@ -102,8 +102,6 @@ struct __fdb_entry {
> #include <linux/netdevice.h>
> 
> extern void brioctl_set(int (*ioctl_hook)(struct net *, unsigned int, void __user *));
>-extern struct sk_buff *(*br_handle_frame_hook)(struct net_bridge_port *p,
>-					       struct sk_buff *skb);
> extern int (*br_should_route_hook)(struct sk_buff *skb);
> 
> #endif
>--- a/include/linux/if_macvlan.h	2010-05-28 08:35:11.628190230 -0700
>+++ b/include/linux/if_macvlan.h	2010-06-01 08:02:39.869738723 -0700
>@@ -84,8 +84,4 @@ extern int macvlan_link_register(struct 
> extern netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
> 				      struct net_device *dev);
> 
>-
>-extern struct sk_buff *(*macvlan_handle_frame_hook)(struct macvlan_port *,
>-						    struct sk_buff *);
>-
> #endif /* _LINUX_IF_MACVLAN_H */
>--- a/include/linux/netdevice.h	2010-05-28 08:41:38.628169375 -0700
>+++ b/include/linux/netdevice.h	2010-06-01 08:12:59.680035275 -0700
>@@ -254,6 +254,7 @@ struct netdev_hw_addr_list {
> #define netdev_for_each_mc_addr(ha, dev) \
> 	netdev_hw_addr_list_for_each(ha, &(dev)->mc)
> 
>+
> struct hh_cache {
> 	struct hh_cache *hh_next;	/* Next entry			     */
> 	atomic_t	hh_refcnt;	/* number of users                   */
>@@ -381,6 +382,8 @@ enum gro_result {
> };
> typedef enum gro_result gro_result_t;
> 
>+typedef struct sk_buff *rx_callback_func_t(struct sk_buff *skb);
>+
> extern void __napi_schedule(struct napi_struct *n);
> 
> static inline int napi_disable_pending(struct napi_struct *n)
>@@ -957,6 +960,7 @@ struct net_device {
> #endif
> 
> 	struct netdev_queue	rx_queue;
>+	rx_callback_func_t	*rx_handler;
> 
> 	struct netdev_queue	*_tx ____cacheline_aligned_in_smp;
> 
>@@ -1693,6 +1697,10 @@ static inline void napi_free_frags(struc
> 	napi->skb = NULL;
> }
> 
>+extern int netdev_rx_handler_register(struct net_device *dev,
>+				      rx_callback_func_t *hook);
>+extern void netdev_rx_handler_unregister(struct net_device *dev);
>+
> extern void		netif_nit_deliver(struct sk_buff *skb);
> extern int		dev_valid_name(const char *name);
> extern int		dev_ioctl(struct net *net, unsigned int cmd, void __user *);
>--- a/net/bridge/br.c	2010-05-28 08:35:11.858189945 -0700
>+++ b/net/bridge/br.c	2010-06-01 08:02:39.889741050 -0700
>@@ -63,7 +63,6 @@ static int __init br_init(void)
> 		goto err_out4;
> 
> 	brioctl_set(br_ioctl_deviceless_stub);
>-	br_handle_frame_hook = br_handle_frame;
> 
> #if defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE)
> 	br_fdb_test_addr_hook = br_fdb_test_addr;
>@@ -100,7 +99,6 @@ static void __exit br_deinit(void)
> 	br_fdb_test_addr_hook = NULL;
> #endif
> 
>-	br_handle_frame_hook = NULL;
> 	br_fdb_fini();
> }
> 
>--- a/net/bridge/br_if.c	2010-05-28 08:35:11.858189945 -0700
>+++ b/net/bridge/br_if.c	2010-06-01 08:14:30.521680943 -0700
>@@ -147,6 +147,7 @@ static void del_nbp(struct net_bridge_po
> 
> 	list_del_rcu(&p->list);
> 
>+	netdev_rx_handler_unregister(dev);
> 	rcu_assign_pointer(dev->br_port, NULL);
> 
> 	br_multicast_del_port(p);
>@@ -429,6 +430,11 @@ int br_add_if(struct net_bridge *br, str
> 		goto err2;
> 
> 	rcu_assign_pointer(dev->br_port, p);
>+
>+	err = netdev_rx_handler_register(dev, br_handle_frame);
>+	if (err)
>+		goto err3;
>+
> 	dev_disable_lro(dev);
> 
> 	list_add_rcu(&p->list, &br->port_list);
>@@ -451,6 +457,8 @@ int br_add_if(struct net_bridge *br, str
> 	br_netpoll_enable(br, dev);
> 
> 	return 0;
>+err3:
>+	dev->br_port = NULL;
> err2:
> 	br_fdb_delete_by_port(br, p, 1);
> err1:
>--- a/net/bridge/br_input.c	2010-05-28 08:35:11.858189945 -0700
>+++ b/net/bridge/br_input.c	2010-06-01 08:02:39.909740447 -0700
>@@ -131,15 +131,19 @@ static inline int is_link_local(const un
> }
> 
> /*
>- * Called via br_handle_frame_hook.
>  * Return NULL if skb is handled
>- * note: already called with rcu_read_lock (preempt_disabled)
>+ * note: already called with rcu_read_lock (preempt_disabled) from
>+ * netif_receive_skb
>  */
>-struct sk_buff *br_handle_frame(struct net_bridge_port *p, struct sk_buff *skb)
>+struct sk_buff *br_handle_frame(struct sk_buff *skb)
> {
>+	struct net_bridge_port *p;
> 	const unsigned char *dest = eth_hdr(skb)->h_dest;
> 	int (*rhook)(struct sk_buff *skb);
> 
>+	if (skb->pkt_type == PACKET_LOOPBACK)
>+		return skb;
>+
> 	if (!is_valid_ether_addr(eth_hdr(skb)->h_source))
> 		goto drop;
> 
>@@ -147,6 +151,8 @@ struct sk_buff *br_handle_frame(struct n
> 	if (!skb)
> 		return NULL;
> 
>+	p = rcu_dereference(skb->dev->br_port);
>+
> 	if (unlikely(is_link_local(dest))) {
> 		/* Pause frames shouldn't be passed up by driver anyway */
> 		if (skb->protocol == htons(ETH_P_PAUSE))
>--- a/net/bridge/br_private.h	2010-05-28 08:35:11.858189945 -0700
>+++ b/net/bridge/br_private.h	2010-06-01 08:02:39.919725220 -0700
>@@ -331,8 +331,7 @@ extern void br_features_recompute(struct
> 
> /* br_input.c */
> extern int br_handle_frame_finish(struct sk_buff *skb);
>-extern struct sk_buff *br_handle_frame(struct net_bridge_port *p,
>-				       struct sk_buff *skb);
>+extern struct sk_buff *br_handle_frame(struct sk_buff *skb);
> 
> /* br_ioctl.c */
> extern int br_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd);
>--- a/net/core/dev.c	2010-05-28 08:41:38.678169590 -0700
>+++ b/net/core/dev.c	2010-06-01 08:16:38.563051750 -0700
>@@ -2581,70 +2581,14 @@ static inline int deliver_skb(struct sk_
> 	return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
> }
> 
>-#if defined(CONFIG_BRIDGE) || defined (CONFIG_BRIDGE_MODULE)
>-
>-#if defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE)
>+#if (defined(CONFIG_BRIDGE) || defined(CONFIG_BRIDGE_MODULE)) && \
>+    (defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE))
> /* This hook is defined here for ATM LANE */
> int (*br_fdb_test_addr_hook)(struct net_device *dev,
> 			     unsigned char *addr) __read_mostly;
> EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
> #endif
> 
>-/*
>- * If bridge module is loaded call bridging hook.
>- *  returns NULL if packet was consumed.
>- */
>-struct sk_buff *(*br_handle_frame_hook)(struct net_bridge_port *p,
>-					struct sk_buff *skb) __read_mostly;
>-EXPORT_SYMBOL_GPL(br_handle_frame_hook);
>-
>-static inline struct sk_buff *handle_bridge(struct sk_buff *skb,
>-					    struct packet_type **pt_prev, int *ret,
>-					    struct net_device *orig_dev)
>-{
>-	struct net_bridge_port *port;
>-
>-	if (skb->pkt_type == PACKET_LOOPBACK ||
>-	    (port = rcu_dereference(skb->dev->br_port)) == NULL)
>-		return skb;
>-
>-	if (*pt_prev) {
>-		*ret = deliver_skb(skb, *pt_prev, orig_dev);
>-		*pt_prev = NULL;
>-	}
>-
>-	return br_handle_frame_hook(port, skb);
>-}
>-#else
>-#define handle_bridge(skb, pt_prev, ret, orig_dev)	(skb)
>-#endif
>-
>-#if defined(CONFIG_MACVLAN) || defined(CONFIG_MACVLAN_MODULE)
>-struct sk_buff *(*macvlan_handle_frame_hook)(struct macvlan_port *p,
>-					     struct sk_buff *skb) __read_mostly;
>-EXPORT_SYMBOL_GPL(macvlan_handle_frame_hook);
>-
>-static inline struct sk_buff *handle_macvlan(struct sk_buff *skb,
>-					     struct packet_type **pt_prev,
>-					     int *ret,
>-					     struct net_device *orig_dev)
>-{
>-	struct macvlan_port *port;
>-
>-	port = rcu_dereference(skb->dev->macvlan_port);
>-	if (!port)
>-		return skb;
>-
>-	if (*pt_prev) {
>-		*ret = deliver_skb(skb, *pt_prev, orig_dev);
>-		*pt_prev = NULL;
>-	}
>-	return macvlan_handle_frame_hook(port, skb);
>-}
>-#else
>-#define handle_macvlan(skb, pt_prev, ret, orig_dev)	(skb)
>-#endif
>-
> #ifdef CONFIG_NET_CLS_ACT
> /* TODO: Maybe we should just force sch_ingress to be compiled in
>  * when CONFIG_NET_CLS_ACT is? otherwise some useless instructions
>@@ -2740,6 +2684,48 @@ void netif_nit_deliver(struct sk_buff *s
> 	rcu_read_unlock();
> }
> 
>+/**
>+ *	netdev_rx_handler_register - register receive handler
>+ *	@dev: device to register a handler for
>+ *	@rh: receive handler to register
>+ *
>+ *	Register a receive hander for a device. This handler will then be
>+ *	called from __netif_receive_skb. A negative errno code is returned
>+ *	on a failure.
>+ *
>+ *	The caller must hold the rtnl_mutex.
>+ */
>+int netdev_rx_handler_register(struct net_device *dev,
>+			       rx_callback_func_t *hook)
>+{
>+	ASSERT_RTNL();
>+
>+	if (dev->rx_handler)
>+		return -EBUSY;
>+
>+	rcu_assign_pointer(dev->rx_handler, hook);
>+
>+	return 0;
>+}
>+EXPORT_SYMBOL_GPL(netdev_rx_handler_register);
>+
>+/**
>+ *	netdev_rx_handler_unregister - unregister receive handler
>+ *	@dev: device to unregister a handler from
>+ *	@rh: receive handler to unregister
>+ *
>+ *	Unregister a receive hander from a device.
>+ *
>+ *	The caller must hold the rtnl_mutex.
>+ */
>+void netdev_rx_handler_unregister(struct net_device *dev)
>+{
>+
>+	ASSERT_RTNL();
>+	dev->rx_handler = NULL;
>+}
>+EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
>+
> static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
> 					      struct net_device *master)
> {
>@@ -2792,6 +2778,7 @@ EXPORT_SYMBOL(__skb_bond_should_drop);
> static int __netif_receive_skb(struct sk_buff *skb)
> {
> 	struct packet_type *ptype, *pt_prev;
>+	rx_callback_func_t *rh;
> 	struct net_device *orig_dev;
> 	struct net_device *master;
> 	struct net_device *null_or_orig;
>@@ -2855,12 +2842,16 @@ static int __netif_receive_skb(struct sk
> ncls:
> #endif
> 
>-	skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
>-	if (!skb)
>-		goto out;
>-	skb = handle_macvlan(skb, &pt_prev, &ret, orig_dev);
>-	if (!skb)
>-		goto out;
>+	/* Handle special case of bridge or macvlan */
>+	if ((rh = rcu_dereference(skb->dev->rx_handler)) != NULL) {
>+		if (pt_prev) {
>+			ret = deliver_skb(skb, pt_prev, orig_dev);
>+			pt_prev = NULL;
>+		}
>+		skb = rh(skb);
>+		if (!skb)
>+			goto out;
>+	}
> 
> 	/*
> 	 * Make sure frames received on VLAN interfaces stacked on

^ permalink raw reply

* net: replace hooks in __netif_receive_skb (v4)
From: Stephen Hemminger @ 2010-06-01 15:28 UTC (permalink / raw)
  To: Jiri Pirko, davem; +Cc: netdev, kaber, eric.dumazet
In-Reply-To: <20100528073345.GD2823@psychotron.redhat.com>


From: Jiri Pirko <jpirko@redhat.com>

What this patch does is it removes two receive frame hooks (for bridge and for
macvlan) from __netif_receive_skb. These are replaced them with a single
hook for both. It only supports one hook per device because it makes no
sense to do bridging and macvlan on the same device.

Then a network driver (of virtual netdev like macvlan or bridge) can register
an rx_handler for needed net device.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
---
v3->v4 
       - only one hook per device.
v2->v3
	- used GPL-ed exports
v1->v2
	- writers are locked by rtnl_lock (removed unnecessary spinlock)
	- using call_rcu in unregister
	- nicer macvlan_port_create cleanup
	- struct rx_hanler is made const in funtion parameters

 drivers/net/macvlan.c      |   19 ++++---
 include/linux/if_bridge.h  |    2 
 include/linux/if_macvlan.h |    4 -
 include/linux/netdevice.h  |    8 +++
 net/bridge/br.c            |    2 
 net/bridge/br_if.c         |    8 +++
 net/bridge/br_input.c      |   12 +++-
 net/bridge/br_private.h    |    3 -
 net/core/dev.c             |  119 ++++++++++++++++++++-------------------------
 9 files changed, 94 insertions(+), 83 deletions(-)

--- a/drivers/net/macvlan.c	2010-05-28 08:41:38.248169422 -0700
+++ b/drivers/net/macvlan.c	2010-06-01 08:16:52.307206412 -0700
@@ -145,15 +145,16 @@ static void macvlan_broadcast(struct sk_
 }
 
 /* called under rcu_read_lock() from netif_receive_skb */
-static struct sk_buff *macvlan_handle_frame(struct macvlan_port *port,
-					    struct sk_buff *skb)
+static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
 {
+	struct macvlan_port *port;
 	const struct ethhdr *eth = eth_hdr(skb);
 	const struct macvlan_dev *vlan;
 	const struct macvlan_dev *src;
 	struct net_device *dev;
 	unsigned int len;
 
+	port = rcu_dereference(skb->dev->macvlan_port);
 	if (is_multicast_ether_addr(eth->h_dest)) {
 		src = macvlan_hash_lookup(port, eth->h_source);
 		if (!src)
@@ -515,6 +516,7 @@ static int macvlan_port_create(struct ne
 {
 	struct macvlan_port *port;
 	unsigned int i;
+	int err;
 
 	if (dev->type != ARPHRD_ETHER || dev->flags & IFF_LOOPBACK)
 		return -EINVAL;
@@ -528,13 +530,21 @@ static int macvlan_port_create(struct ne
 	for (i = 0; i < MACVLAN_HASH_SIZE; i++)
 		INIT_HLIST_HEAD(&port->vlan_hash[i]);
 	rcu_assign_pointer(dev->macvlan_port, port);
-	return 0;
+
+	err = netdev_rx_handler_register(dev, macvlan_handle_frame);
+	if (err) {
+		dev->macvlan_port = NULL;
+		kfree(port);
+	}
+
+	return err;
 }
 
 static void macvlan_port_destroy(struct net_device *dev)
 {
 	struct macvlan_port *port = dev->macvlan_port;
 
+	netdev_rx_handler_unregister(dev);
 	rcu_assign_pointer(dev->macvlan_port, NULL);
 	synchronize_rcu();
 	kfree(port);
@@ -767,14 +777,12 @@ static int __init macvlan_init_module(vo
 	int err;
 
 	register_netdevice_notifier(&macvlan_notifier_block);
-	macvlan_handle_frame_hook = macvlan_handle_frame;
 
 	err = macvlan_link_register(&macvlan_link_ops);
 	if (err < 0)
 		goto err1;
 	return 0;
 err1:
-	macvlan_handle_frame_hook = NULL;
 	unregister_netdevice_notifier(&macvlan_notifier_block);
 	return err;
 }
@@ -782,7 +790,6 @@ err1:
 static void __exit macvlan_cleanup_module(void)
 {
 	rtnl_link_unregister(&macvlan_link_ops);
-	macvlan_handle_frame_hook = NULL;
 	unregister_netdevice_notifier(&macvlan_notifier_block);
 }
 
--- a/include/linux/if_bridge.h	2010-05-28 08:35:11.628190230 -0700
+++ b/include/linux/if_bridge.h	2010-06-01 08:02:39.859736930 -0700
@@ -102,8 +102,6 @@ struct __fdb_entry {
 #include <linux/netdevice.h>
 
 extern void brioctl_set(int (*ioctl_hook)(struct net *, unsigned int, void __user *));
-extern struct sk_buff *(*br_handle_frame_hook)(struct net_bridge_port *p,
-					       struct sk_buff *skb);
 extern int (*br_should_route_hook)(struct sk_buff *skb);
 
 #endif
--- a/include/linux/if_macvlan.h	2010-05-28 08:35:11.628190230 -0700
+++ b/include/linux/if_macvlan.h	2010-06-01 08:02:39.869738723 -0700
@@ -84,8 +84,4 @@ extern int macvlan_link_register(struct 
 extern netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
 				      struct net_device *dev);
 
-
-extern struct sk_buff *(*macvlan_handle_frame_hook)(struct macvlan_port *,
-						    struct sk_buff *);
-
 #endif /* _LINUX_IF_MACVLAN_H */
--- a/include/linux/netdevice.h	2010-05-28 08:41:38.628169375 -0700
+++ b/include/linux/netdevice.h	2010-06-01 08:12:59.680035275 -0700
@@ -254,6 +254,7 @@ struct netdev_hw_addr_list {
 #define netdev_for_each_mc_addr(ha, dev) \
 	netdev_hw_addr_list_for_each(ha, &(dev)->mc)
 
+
 struct hh_cache {
 	struct hh_cache *hh_next;	/* Next entry			     */
 	atomic_t	hh_refcnt;	/* number of users                   */
@@ -381,6 +382,8 @@ enum gro_result {
 };
 typedef enum gro_result gro_result_t;
 
+typedef struct sk_buff *rx_callback_func_t(struct sk_buff *skb);
+
 extern void __napi_schedule(struct napi_struct *n);
 
 static inline int napi_disable_pending(struct napi_struct *n)
@@ -957,6 +960,7 @@ struct net_device {
 #endif
 
 	struct netdev_queue	rx_queue;
+	rx_callback_func_t	*rx_handler;
 
 	struct netdev_queue	*_tx ____cacheline_aligned_in_smp;
 
@@ -1693,6 +1697,10 @@ static inline void napi_free_frags(struc
 	napi->skb = NULL;
 }
 
+extern int netdev_rx_handler_register(struct net_device *dev,
+				      rx_callback_func_t *hook);
+extern void netdev_rx_handler_unregister(struct net_device *dev);
+
 extern void		netif_nit_deliver(struct sk_buff *skb);
 extern int		dev_valid_name(const char *name);
 extern int		dev_ioctl(struct net *net, unsigned int cmd, void __user *);
--- a/net/bridge/br.c	2010-05-28 08:35:11.858189945 -0700
+++ b/net/bridge/br.c	2010-06-01 08:02:39.889741050 -0700
@@ -63,7 +63,6 @@ static int __init br_init(void)
 		goto err_out4;
 
 	brioctl_set(br_ioctl_deviceless_stub);
-	br_handle_frame_hook = br_handle_frame;
 
 #if defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE)
 	br_fdb_test_addr_hook = br_fdb_test_addr;
@@ -100,7 +99,6 @@ static void __exit br_deinit(void)
 	br_fdb_test_addr_hook = NULL;
 #endif
 
-	br_handle_frame_hook = NULL;
 	br_fdb_fini();
 }
 
--- a/net/bridge/br_if.c	2010-05-28 08:35:11.858189945 -0700
+++ b/net/bridge/br_if.c	2010-06-01 08:14:30.521680943 -0700
@@ -147,6 +147,7 @@ static void del_nbp(struct net_bridge_po
 
 	list_del_rcu(&p->list);
 
+	netdev_rx_handler_unregister(dev);
 	rcu_assign_pointer(dev->br_port, NULL);
 
 	br_multicast_del_port(p);
@@ -429,6 +430,11 @@ int br_add_if(struct net_bridge *br, str
 		goto err2;
 
 	rcu_assign_pointer(dev->br_port, p);
+
+	err = netdev_rx_handler_register(dev, br_handle_frame);
+	if (err)
+		goto err3;
+
 	dev_disable_lro(dev);
 
 	list_add_rcu(&p->list, &br->port_list);
@@ -451,6 +457,8 @@ int br_add_if(struct net_bridge *br, str
 	br_netpoll_enable(br, dev);
 
 	return 0;
+err3:
+	dev->br_port = NULL;
 err2:
 	br_fdb_delete_by_port(br, p, 1);
 err1:
--- a/net/bridge/br_input.c	2010-05-28 08:35:11.858189945 -0700
+++ b/net/bridge/br_input.c	2010-06-01 08:02:39.909740447 -0700
@@ -131,15 +131,19 @@ static inline int is_link_local(const un
 }
 
 /*
- * Called via br_handle_frame_hook.
  * Return NULL if skb is handled
- * note: already called with rcu_read_lock (preempt_disabled)
+ * note: already called with rcu_read_lock (preempt_disabled) from
+ * netif_receive_skb
  */
-struct sk_buff *br_handle_frame(struct net_bridge_port *p, struct sk_buff *skb)
+struct sk_buff *br_handle_frame(struct sk_buff *skb)
 {
+	struct net_bridge_port *p;
 	const unsigned char *dest = eth_hdr(skb)->h_dest;
 	int (*rhook)(struct sk_buff *skb);
 
+	if (skb->pkt_type == PACKET_LOOPBACK)
+		return skb;
+
 	if (!is_valid_ether_addr(eth_hdr(skb)->h_source))
 		goto drop;
 
@@ -147,6 +151,8 @@ struct sk_buff *br_handle_frame(struct n
 	if (!skb)
 		return NULL;
 
+	p = rcu_dereference(skb->dev->br_port);
+
 	if (unlikely(is_link_local(dest))) {
 		/* Pause frames shouldn't be passed up by driver anyway */
 		if (skb->protocol == htons(ETH_P_PAUSE))
--- a/net/bridge/br_private.h	2010-05-28 08:35:11.858189945 -0700
+++ b/net/bridge/br_private.h	2010-06-01 08:02:39.919725220 -0700
@@ -331,8 +331,7 @@ extern void br_features_recompute(struct
 
 /* br_input.c */
 extern int br_handle_frame_finish(struct sk_buff *skb);
-extern struct sk_buff *br_handle_frame(struct net_bridge_port *p,
-				       struct sk_buff *skb);
+extern struct sk_buff *br_handle_frame(struct sk_buff *skb);
 
 /* br_ioctl.c */
 extern int br_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd);
--- a/net/core/dev.c	2010-05-28 08:41:38.678169590 -0700
+++ b/net/core/dev.c	2010-06-01 08:16:38.563051750 -0700
@@ -2581,70 +2581,14 @@ static inline int deliver_skb(struct sk_
 	return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 }
 
-#if defined(CONFIG_BRIDGE) || defined (CONFIG_BRIDGE_MODULE)
-
-#if defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE)
+#if (defined(CONFIG_BRIDGE) || defined(CONFIG_BRIDGE_MODULE)) && \
+    (defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE))
 /* This hook is defined here for ATM LANE */
 int (*br_fdb_test_addr_hook)(struct net_device *dev,
 			     unsigned char *addr) __read_mostly;
 EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
 #endif
 
-/*
- * If bridge module is loaded call bridging hook.
- *  returns NULL if packet was consumed.
- */
-struct sk_buff *(*br_handle_frame_hook)(struct net_bridge_port *p,
-					struct sk_buff *skb) __read_mostly;
-EXPORT_SYMBOL_GPL(br_handle_frame_hook);
-
-static inline struct sk_buff *handle_bridge(struct sk_buff *skb,
-					    struct packet_type **pt_prev, int *ret,
-					    struct net_device *orig_dev)
-{
-	struct net_bridge_port *port;
-
-	if (skb->pkt_type == PACKET_LOOPBACK ||
-	    (port = rcu_dereference(skb->dev->br_port)) == NULL)
-		return skb;
-
-	if (*pt_prev) {
-		*ret = deliver_skb(skb, *pt_prev, orig_dev);
-		*pt_prev = NULL;
-	}
-
-	return br_handle_frame_hook(port, skb);
-}
-#else
-#define handle_bridge(skb, pt_prev, ret, orig_dev)	(skb)
-#endif
-
-#if defined(CONFIG_MACVLAN) || defined(CONFIG_MACVLAN_MODULE)
-struct sk_buff *(*macvlan_handle_frame_hook)(struct macvlan_port *p,
-					     struct sk_buff *skb) __read_mostly;
-EXPORT_SYMBOL_GPL(macvlan_handle_frame_hook);
-
-static inline struct sk_buff *handle_macvlan(struct sk_buff *skb,
-					     struct packet_type **pt_prev,
-					     int *ret,
-					     struct net_device *orig_dev)
-{
-	struct macvlan_port *port;
-
-	port = rcu_dereference(skb->dev->macvlan_port);
-	if (!port)
-		return skb;
-
-	if (*pt_prev) {
-		*ret = deliver_skb(skb, *pt_prev, orig_dev);
-		*pt_prev = NULL;
-	}
-	return macvlan_handle_frame_hook(port, skb);
-}
-#else
-#define handle_macvlan(skb, pt_prev, ret, orig_dev)	(skb)
-#endif
-
 #ifdef CONFIG_NET_CLS_ACT
 /* TODO: Maybe we should just force sch_ingress to be compiled in
  * when CONFIG_NET_CLS_ACT is? otherwise some useless instructions
@@ -2740,6 +2684,48 @@ void netif_nit_deliver(struct sk_buff *s
 	rcu_read_unlock();
 }
 
+/**
+ *	netdev_rx_handler_register - register receive handler
+ *	@dev: device to register a handler for
+ *	@rh: receive handler to register
+ *
+ *	Register a receive hander for a device. This handler will then be
+ *	called from __netif_receive_skb. A negative errno code is returned
+ *	on a failure.
+ *
+ *	The caller must hold the rtnl_mutex.
+ */
+int netdev_rx_handler_register(struct net_device *dev,
+			       rx_callback_func_t *hook)
+{
+	ASSERT_RTNL();
+
+	if (dev->rx_handler)
+		return -EBUSY;
+
+	rcu_assign_pointer(dev->rx_handler, hook);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(netdev_rx_handler_register);
+
+/**
+ *	netdev_rx_handler_unregister - unregister receive handler
+ *	@dev: device to unregister a handler from
+ *	@rh: receive handler to unregister
+ *
+ *	Unregister a receive hander from a device.
+ *
+ *	The caller must hold the rtnl_mutex.
+ */
+void netdev_rx_handler_unregister(struct net_device *dev)
+{
+
+	ASSERT_RTNL();
+	dev->rx_handler = NULL;
+}
+EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
+
 static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
 					      struct net_device *master)
 {
@@ -2792,6 +2778,7 @@ EXPORT_SYMBOL(__skb_bond_should_drop);
 static int __netif_receive_skb(struct sk_buff *skb)
 {
 	struct packet_type *ptype, *pt_prev;
+	rx_callback_func_t *rh;
 	struct net_device *orig_dev;
 	struct net_device *master;
 	struct net_device *null_or_orig;
@@ -2855,12 +2842,16 @@ static int __netif_receive_skb(struct sk
 ncls:
 #endif
 
-	skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
-	if (!skb)
-		goto out;
-	skb = handle_macvlan(skb, &pt_prev, &ret, orig_dev);
-	if (!skb)
-		goto out;
+	/* Handle special case of bridge or macvlan */
+	if ((rh = rcu_dereference(skb->dev->rx_handler)) != NULL) {
+		if (pt_prev) {
+			ret = deliver_skb(skb, pt_prev, orig_dev);
+			pt_prev = NULL;
+		}
+		skb = rh(skb);
+		if (!skb)
+			goto out;
+	}
 
 	/*
 	 * Make sure frames received on VLAN interfaces stacked on

^ permalink raw reply

* Re: Fore 200 firmware
From: Meelis Roos @ 2010-06-01 14:52 UTC (permalink / raw)
  To: chas3; +Cc: netdev
In-Reply-To: <201006011437.o51Ebp6f014332@thirdoffive.cmf.nrl.navy.mil>

> >So the current state is that drivers/atm contains the C dump of fore 200 
> >firmware in drivers/atm/fore200e_pca_fw.c but this firmware is not built 
> >and not installed by make install. It's also not present in 
> >linux-firware tree/packages so the fore driver can not load its firmware 
> >from anywhere. I guess this is not the intended outcome.
> 
> this firmware is in the linux-atm source tree now.  the 2.5.1 release
> should have the firmware.

2.5.1 does not seem to contain the firmware.

Even if it will be there in the next release, we should still do 
something to distribute it to the user so that any modern distro has the 
firmware available (in linux-firmware or via similar means) so the user 
does not have to redo the detective work I am doing. Packaging it along 
with linux-atm packages (atm-tools in Debian) does not seem correct, so 
do we need to make a special firmware package out of it or just add to 
some other firmware repo that is packaged already?

What's the licence of the firmware anyway - unknown likely?

And we can probably remove drivers/atm/fore200e_pca_fw.c if we 
distribute the firmware out of kernel tree.

-- 
Meelis Roos (mroos@linux.ee)

^ permalink raw reply

* Re: [PATCH] greth: Fix build after OF device conversions.
From: Grant Likely @ 2010-06-01 14:46 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100601.001059.241461551.davem@davemloft.net>

On Tue, Jun 1, 2010 at 1:10 AM, David Miller <davem@davemloft.net> wrote:
>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
>
> I've pushed this to net-2.6

Thanks David.

g.

>
>  drivers/net/greth.c |   11 +++++------
>  1 files changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/greth.c b/drivers/net/greth.c
> index f37a4c1..3a029d0 100644
> --- a/drivers/net/greth.c
> +++ b/drivers/net/greth.c
> @@ -1607,14 +1607,13 @@ static struct of_device_id greth_of_match[] = {
>  MODULE_DEVICE_TABLE(of, greth_of_match);
>
>  static struct of_platform_driver greth_of_driver = {
> -       .name = "grlib-greth",
> -       .match_table = greth_of_match,
> +       .driver = {
> +               .name = "grlib-greth",
> +               .owner = THIS_MODULE,
> +               .of_match_table = greth_of_match,
> +       },
>        .probe = greth_of_probe,
>        .remove = __devexit_p(greth_of_remove),
> -       .driver = {
> -                  .owner = THIS_MODULE,
> -                  .name = "grlib-greth",
> -                  },
>  };
>
>  static int __init greth_init(void)
> --
> 1.7.0.4
>
>



-- 
Grant Likely, B.Sc., P.Eng.
Secret Lab Technologies Ltd.

^ permalink raw reply

* Re: Fore 200 firmware
From: Chas Williams (CONTRACTOR) @ 2010-06-01 14:37 UTC (permalink / raw)
  To: Meelis Roos; +Cc: netdev
In-Reply-To: <alpine.SOC.1.00.1005311058180.2805@math.ut.ee>

In message <alpine.SOC.1.00.1005311058180.2805@math.ut.ee>,Meelis Roos writes:
>So the current state is that drivers/atm contains the C dump of fore 200 
>firmware in drivers/atm/fore200e_pca_fw.c but this firmware is not built 
>and not installed by make install. It's also not present in 
>linux-firware tree/packages so the fore driver can not load its firmware 
>from anywhere. I guess this is not the intended outcome.

this firmware is in the linux-atm source tree now.  the 2.5.1 release
should have the firmware.

http://sourceforge.net/projects/linux-atm/files/

^ permalink raw reply

* Re: [PATCH 1/3] vhost: replace vhost_workqueue with per-vhost kthread
From: Paul E. McKenney @ 2010-06-01 14:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Tejun Heo, Oleg Nesterov, Sridhar Samudrala, netdev, lkml,
	kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev, Jiri Kosina,
	Thomas Gleixner, Ingo Molnar, Andi Kleen
In-Reply-To: <20100531152221.GB2987@redhat.com>

On Mon, May 31, 2010 at 06:22:21PM +0300, Michael S. Tsirkin wrote:
> On Sun, May 30, 2010 at 10:24:01PM +0200, Tejun Heo wrote:
> > Replace vhost_workqueue with per-vhost kthread.  Other than callback
> > argument change from struct work_struct * to struct vhost_poll *,
> > there's no visible change to vhost_poll_*() interface.
> 
> I would prefer a substructure vhost_work, even just to make
> the code easier to review and compare to workqueue.c.

Either way this plays out, the rcu_dereference_check() calls will need
to be adjusted to reflect the change.

							Thanx, Paul

> > This conversion is to make each vhost use a dedicated kthread so that
> > resource control via cgroup can be applied.
> > 
> > Partially based on Sridhar Samudrala's patch.
> > 
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Cc: Sridhar Samudrala <samudrala.sridhar@gmail.com>
> > ---
> > Okay, here is three patch series to convert vhost to use per-vhost
> > kthread, add cgroup_attach_task_current_cg() and apply it to the vhost
> > kthreads.  The conversion is mostly straight forward although flush is
> > slightly tricky.
> > 
> > The problem is that I have no idea how to test this.
> 
> It's a 3 step process:
> 
> 1. 
> Install qemu-kvm under fc13, or build recent one from source,
> get it from here:
> git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git
> 
> 2. install guest under it:
> qemu-img create -f qcow2 disk.qcow2 100G
> Now get some image (e.g. fedora 13 in fc13.iso)
> and install guest:
> qemu-kvm -enable-kvm -m 1G -cdrom fc13.iso -drive file=disk.qcow2
> 
> 
> 3. set up networking. I usually simply do host to guest 
> on a special subnet for testing purposes:
> 
> Set up a bridge named mstbr0:
> 
> ifconfig mstbr0 down
> brctl delbr mstbr0
> brctl addbr mstbr0
> brctl setfd mstbr0 0
> ifconfig mstbr0 11.0.0.1
> 
> cat > ifup << EOF
> #!/bin/sh -x
> /sbin/ifconfig msttap0 0.0.0.0 up
> brctl addif mstbr0 msttap0
> EOF
> 
> 
> qemu-kvm -enable-kvm -m 1G -cdrom fc13.iso -drive file=disk.qcow2
>  -net nic,model=virtio,netdev=foo -netdev
> tap,id=foo,ifname=msttap0,script=/home/mst/ifup,downscript=no,vhost=on
> 
> after you set up the guest, log into it and
> ifconfig eth0 11.0.0.2
> 
> You should now be able to ping guest to host and back.
> Use something like netperf to stress the connection.
> Close qemu with kill -9 and unload module to test flushing code.
> 
> 
> 
> > Index: work/drivers/vhost/vhost.c
> > ===================================================================
> > --- work.orig/drivers/vhost/vhost.c
> > +++ work/drivers/vhost/vhost.c
> 
> ...
> 
> > @@ -125,10 +139,50 @@ static void vhost_vq_reset(struct vhost_
> >  	vq->log_ctx = NULL;
> >  }
> > 
> > +static int vhost_poller(void *data)
> > +{
> > +	struct vhost_dev *dev = data;
> > +	struct vhost_poll *poll;
> > +
> > +repeat:
> > +	set_current_state(TASK_INTERRUPTIBLE);	/* mb paired w/ kthread_stop */
> > +
> > +	if (kthread_should_stop()) {
> > +		__set_current_state(TASK_RUNNING);
> > +		return 0;
> > +	}
> > +
> > +	poll = NULL;
> > +	spin_lock(&dev->poller_lock);
> > +	if (!list_empty(&dev->poll_list)) {
> > +		poll = list_first_entry(&dev->poll_list,
> > +					struct vhost_poll, node);
> > +		list_del_init(&poll->node);
> > +	}
> > +	spin_unlock(&dev->poller_lock);
> > +
> > +	if (poll) {
> > +		__set_current_state(TASK_RUNNING);
> > +		poll->fn(poll);
> > +		smp_wmb();	/* paired with rmb in vhost_poll_flush() */
> > +		poll->done_seq = poll->queue_seq;
> > +		wake_up_all(&poll->done);
> 
> 
> This seems to add wakeups on data path, which uses spinlocks etc.
> OTOH workqueue.c adds a special barrier
> entry which only does a wakeup when needed.
> Right?
> 
> > +	} else
> > +		schedule();
> > +
> > +	goto repeat;
> > +}
> > +
> >  long vhost_dev_init(struct vhost_dev *dev,
> >  		    struct vhost_virtqueue *vqs, int nvqs)
> >  {
> > +	struct task_struct *poller;
> >  	int i;
> > +
> > +	poller = kthread_create(vhost_poller, dev, "vhost-%d", current->pid);
> > +	if (IS_ERR(poller))
> > +		return PTR_ERR(poller);
> > +
> >  	dev->vqs = vqs;
> >  	dev->nvqs = nvqs;
> >  	mutex_init(&dev->mutex);
> > @@ -136,6 +190,9 @@ long vhost_dev_init(struct vhost_dev *de
> >  	dev->log_file = NULL;
> >  	dev->memory = NULL;
> >  	dev->mm = NULL;
> > +	spin_lock_init(&dev->poller_lock);
> > +	INIT_LIST_HEAD(&dev->poll_list);
> > +	dev->poller = poller;
> > 
> >  	for (i = 0; i < dev->nvqs; ++i) {
> >  		dev->vqs[i].dev = dev;
> > @@ -143,8 +200,7 @@ long vhost_dev_init(struct vhost_dev *de
> >  		vhost_vq_reset(dev, dev->vqs + i);
> >  		if (dev->vqs[i].handle_kick)
> >  			vhost_poll_init(&dev->vqs[i].poll,
> > -					dev->vqs[i].handle_kick,
> > -					POLLIN);
> > +					dev->vqs[i].handle_kick, POLLIN, dev);
> >  	}
> >  	return 0;
> >  }
> > @@ -217,6 +273,8 @@ void vhost_dev_cleanup(struct vhost_dev
> >  	if (dev->mm)
> >  		mmput(dev->mm);
> >  	dev->mm = NULL;
> > +
> > +	kthread_stop(dev->poller);
> >  }
> > 
> >  static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
> > @@ -1113,16 +1171,3 @@ void vhost_disable_notify(struct vhost_v
> >  		vq_err(vq, "Failed to enable notification at %p: %d\n",
> >  		       &vq->used->flags, r);
> >  }
> > -
> > -int vhost_init(void)
> > -{
> > -	vhost_workqueue = create_singlethread_workqueue("vhost");
> > -	if (!vhost_workqueue)
> > -		return -ENOMEM;
> > -	return 0;
> > -}
> > -
> > -void vhost_cleanup(void)
> > -{
> > -	destroy_workqueue(vhost_workqueue);
> 
> I note that destroy_workqueue does a flush, kthread_stop
> doesn't. Right? Sure we don't need to check nothing is in one of
> the lists? Maybe add a BUG_ON?
> 
> > -}
> > Index: work/drivers/vhost/vhost.h
> > ===================================================================
> > --- work.orig/drivers/vhost/vhost.h
> > +++ work/drivers/vhost/vhost.h
> > @@ -5,7 +5,6 @@
> >  #include <linux/vhost.h>
> >  #include <linux/mm.h>
> >  #include <linux/mutex.h>
> > -#include <linux/workqueue.h>
> >  #include <linux/poll.h>
> >  #include <linux/file.h>
> >  #include <linux/skbuff.h>
> > @@ -20,19 +19,26 @@ enum {
> >  	VHOST_NET_MAX_SG = MAX_SKB_FRAGS + 2,
> >  };
> > 
> > +struct vhost_poll;
> > +typedef void (*vhost_poll_fn_t)(struct vhost_poll *poll);
> > +
> >  /* Poll a file (eventfd or socket) */
> >  /* Note: there's nothing vhost specific about this structure. */
> >  struct vhost_poll {
> > +	vhost_poll_fn_t		  fn;
> >  	poll_table                table;
> >  	wait_queue_head_t        *wqh;
> >  	wait_queue_t              wait;
> > -	/* struct which will handle all actual work. */
> > -	struct work_struct        work;
> > +	struct list_head	  node;
> > +	wait_queue_head_t	  done;
> >  	unsigned long		  mask;
> > +	struct vhost_dev	 *dev;
> > +	int			  queue_seq;
> > +	int			  done_seq;
> >  };
> > 
> > -void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
> > -		     unsigned long mask);
> > +void vhost_poll_init(struct vhost_poll *poll, vhost_poll_fn_t fn,
> > +		     unsigned long mask, struct vhost_dev *dev);
> >  void vhost_poll_start(struct vhost_poll *poll, struct file *file);
> >  void vhost_poll_stop(struct vhost_poll *poll);
> >  void vhost_poll_flush(struct vhost_poll *poll);
> > @@ -63,7 +69,7 @@ struct vhost_virtqueue {
> >  	struct vhost_poll poll;
> > 
> >  	/* The routine to call when the Guest pings us, or timeout. */
> > -	work_func_t handle_kick;
> > +	vhost_poll_fn_t handle_kick;
> > 
> >  	/* Last available index we saw. */
> >  	u16 last_avail_idx;
> > @@ -86,11 +92,11 @@ struct vhost_virtqueue {
> >  	struct iovec hdr[VHOST_NET_MAX_SG];
> >  	size_t hdr_size;
> >  	/* We use a kind of RCU to access private pointer.
> > -	 * All readers access it from workqueue, which makes it possible to
> > -	 * flush the workqueue instead of synchronize_rcu. Therefore readers do
> > +	 * All readers access it from poller, which makes it possible to
> > +	 * flush the vhost_poll instead of synchronize_rcu. Therefore readers do
> >  	 * not need to call rcu_read_lock/rcu_read_unlock: the beginning of
> > -	 * work item execution acts instead of rcu_read_lock() and the end of
> > -	 * work item execution acts instead of rcu_read_lock().
> > +	 * vhost_poll execution acts instead of rcu_read_lock() and the end of
> > +	 * vhost_poll execution acts instead of rcu_read_lock().
> >  	 * Writers use virtqueue mutex. */
> >  	void *private_data;
> >  	/* Log write descriptors */
> > @@ -110,6 +116,9 @@ struct vhost_dev {
> >  	int nvqs;
> >  	struct file *log_file;
> >  	struct eventfd_ctx *log_ctx;
> > +	spinlock_t poller_lock;
> > +	struct list_head poll_list;
> > +	struct task_struct *poller;
> >  };
> > 
> >  long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
> > @@ -136,9 +145,6 @@ bool vhost_enable_notify(struct vhost_vi
> >  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
> >  		    unsigned int log_num, u64 len);
> > 
> > -int vhost_init(void);
> > -void vhost_cleanup(void);
> > -
> >  #define vq_err(vq, fmt, ...) do {                                  \
> >  		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> >  		if ((vq)->error_ctx)                               \
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: Kernel panic with the b44 driver in 2.6.35-rc1
From: John W. Linville @ 2010-06-01 13:33 UTC (permalink / raw)
  To: news.gmane.org; +Cc: netdev
In-Reply-To: <4C0414F3.8030409@tvcablenet.be>

On Mon, May 31, 2010 at 09:58:43PM +0200, news.gmane.org wrote:

> I have a kernel panic at startup with the b44 driver with the 2.6.35-rc1
> kernel. I have submitted a bug report (see
> https://bugzilla.kernel.org/show_bug.cgi?id=16074) but until know, I
> din't get any answer yet. Does anybody knows what's happening. I have
> tried several git-bisect run but I didn't find any conclusive result.
> The first bad-commit was different each time and if I revert it, the
> problem still occurs anyway.

This patch is already on its way to Linus -- it should fix the problem
you are seeing:

commit da1fdb02d9200ff28b6f3a380d21930335fe5429
Author: Christoph Fritz <chf.fritz@googlemail.com>
Date:   Fri May 28 10:45:59 2010 +0200

    ssb: fix NULL ptr deref when pcihost_wrapper is used
    
    Ethernet driver b44 does register ssb by it's pcihost_wrapper
    and doesn't set ssb_chipcommon. A check on this value
    introduced with commit d53cdbb94a52a920d5420ed64d986c3523a56743
    and ea2db495f92ad2cf3301623e60cb95b4062bc484 triggers:
    
    BUG: unable to handle kernel NULL pointer dereference at 00000010
    IP: [<c1266c36>] ssb_is_sprom_available+0x16/0x30
    
    Signed-off-by: Christoph Fritz <chf.fritz@googlemail.com>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* Re: [Regression] [2.6.35-rc1] ssb_is_sprom_available
From: John W. Linville @ 2010-06-01 13:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Maciej Rutecki, linux-kernel@vger.kernel.org, linux-pci,
	linux-usb, Rafael J. Wysocki, netdev, Michael Buesch
In-Reply-To: <20100531205300.GB32708@parisc-linux.org>

On Mon, May 31, 2010 at 02:53:00PM -0600, Matthew Wilcox wrote:
> On Mon, May 31, 2010 at 09:55:20PM +0200, Maciej Rutecki wrote:
> > Last known good: 2.6.34
> > Failing kernel: 2.6.35-rc1
> > 
> > subsystem: PCI, USB(?)
> > 
> > Kernel dies during booting on message "ssb_is_sprom_available", see picture:
> > http://www.unixy.pl/maciek/download/kernel/2.6.35-rc1/gumis/DSC_0011.JPG
> 
> Um, looks like it's something to do with the Sonics Silicon Backplane,
> not PCI, nor USB.

Fix is on its way...

commit da1fdb02d9200ff28b6f3a380d21930335fe5429
Author: Christoph Fritz <chf.fritz@googlemail.com>
Date:   Fri May 28 10:45:59 2010 +0200

    ssb: fix NULL ptr deref when pcihost_wrapper is used
    
    Ethernet driver b44 does register ssb by it's pcihost_wrapper
    and doesn't set ssb_chipcommon. A check on this value
    introduced with commit d53cdbb94a52a920d5420ed64d986c3523a56743
    and ea2db495f92ad2cf3301623e60cb95b4062bc484 triggers:
    
    BUG: unable to handle kernel NULL pointer dereference at 00000010
    IP: [<c1266c36>] ssb_is_sprom_available+0x16/0x30
    
    Signed-off-by: Christoph Fritz <chf.fritz@googlemail.com>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* Re: [PATCH] cls_u32: use skb_copy_bits() to dereference data safely
From: jamal @ 2010-06-01 12:34 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, netdev
In-Reply-To: <1275272665-19047-1-git-send-email-xiaosuo@gmail.com>

Hi Changli,

On Mon, 2010-05-31 at 10:24 +0800, Changli Gao wrote:
> use skb_copy_bits() to dereference data safely
> 
> the original skb->data dereference isn't safe, as there isn't any skb->len or
> skb_is_nonlinear() check. 

I dont see any safety issue in current code in this respect. Do you have
a specific scenario where this would be unsafe? We inspect in 32 bit
chunks walking the packet data and stop when there is no more packet
data.

> skb_copy_bits() is used instead in this patch. And
> when the skb isn't long enough, we terminate the function u32_classify()
> immediately with -1.
> 

That could be a very interesting optimization - but i see it as _very
hard_ to do with current u32 given it has branching and different
branches would have different lengths etc. You'd have to essentially do
some math in user space as to what min length would suffice given
a specified filter and pass that to the kernel. 


cheers,
jamal



^ permalink raw reply

* [PATCH 2/2] mac8390: raise error logging priority
From: Finn Thain @ 2010-06-01 12:18 UTC (permalink / raw)
  To: David Miller; +Cc: geert, joe, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <20100531.060225.68146363.davem@davemloft.net>


Log error conditions using KERN_ERR priority.

Signed-off-by: Finn Thain <fthain@telegraphics.com.au>

---

This patch doesn't address the KERN_INFO messages conditional on ei_debug. 
I don't know how those can be improved to the satisfaction of all 
interested parties.

These two patches should be applied in sequence.

Index: linux-2.6.34/drivers/net/mac8390.c
===================================================================
--- linux-2.6.34.orig/drivers/net/mac8390.c	2010-06-01 20:22:23.000000000 +1000
+++ linux-2.6.34/drivers/net/mac8390.c	2010-06-01 20:22:25.000000000 +1000
@@ -556,7 +556,7 @@ static int __init mac8390_initdev(struct
 	case MAC8390_APPLE:
 		switch (mac8390_testio(dev->mem_start)) {
 		case ACCESS_UNKNOWN:
-			pr_info("Don't know how to access card memory!\n");
+			pr_err("Don't know how to access card memory!\n");
 			return -ENODEV;
 			break;
 
@@ -648,7 +648,7 @@ static int mac8390_open(struct net_devic
 	__ei_open(dev);
 	err = request_irq(dev->irq, __ei_interrupt, 0, "8390 Ethernet", dev);
 	if (err)
-		pr_info("%s: unable to get IRQ %d\n", dev->name, dev->irq);
+		pr_err("%s: unable to get IRQ %d\n", dev->name, dev->irq);
 	return err;
 }
 

^ permalink raw reply

* [PATCH 1/2] mac8390: propagate error code from request_irq
From: Finn Thain @ 2010-06-01 12:18 UTC (permalink / raw)
  To: David Miller; +Cc: geert, joe, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <20100531.060225.68146363.davem@davemloft.net>


Use the request_irq() error code as the return value for mac8390_open(). 
EAGAIN doesn't make sense for Nubus slot IRQs. Only this driver can claim 
this IRQ (until the NIC is removed, which means everything is powered 
down).

Signed-off-by: Finn Thain <fthain@telegraphics.com.au>

Index: linux-2.6.34/drivers/net/mac8390.c
===================================================================
--- linux-2.6.34.orig/drivers/net/mac8390.c	2010-06-01 20:20:29.000000000 +1000
+++ linux-2.6.34/drivers/net/mac8390.c	2010-06-01 20:22:23.000000000 +1000
@@ -643,12 +643,13 @@ static int __init mac8390_initdev(struct
 
 static int mac8390_open(struct net_device *dev)
 {
+	int err;
+
 	__ei_open(dev);
-	if (request_irq(dev->irq, __ei_interrupt, 0, "8390 Ethernet", dev)) {
-		pr_info("%s: unable to get IRQ %d.\n", dev->name, dev->irq);
-		return -EAGAIN;
-	}
-	return 0;
+	err = request_irq(dev->irq, __ei_interrupt, 0, "8390 Ethernet", dev);
+	if (err)
+		pr_info("%s: unable to get IRQ %d\n", dev->name, dev->irq);
+	return err;
 }
 
 static int mac8390_close(struct net_device *dev)

^ permalink raw reply

* Re: [Patch] fix packet loss and massive ping spikes with PPP multi-link
From: Richard Hartmann @ 2010-06-01 11:28 UTC (permalink / raw)
  To: Ben McKeegan
  Cc: Paul Mackerras, netdev, linux-ppp, Alan Cox,
	Alexander E. Patrakov, linux-kernel
In-Reply-To: <4C04ECA3.1080905@netservers.co.uk>

On Tue, Jun 1, 2010 at 13:18, Ben McKeegan <ben@netservers.co.uk> wrote:

> This isn't really a bug fix.  Its a behavioural change to work around poor
> quality/mismatched underlying PPP channels.

Maybe not a bug in the Linux kernel itself, but certainly in the real world
that exists around Linux. Similar to how a change to a device driver that
is needed to work around broken hardware is a bug fix, imo.


RIchard

^ permalink raw reply

* Re: [Patch] fix packet loss and massive ping spikes with PPP  multi-link
From: Ben McKeegan @ 2010-06-01 11:18 UTC (permalink / raw)
  To: Richard Hartmann
  Cc: Paul Mackerras, netdev, linux-ppp, Alan Cox,
	Alexander E. Patrakov, linux-kernel
In-Reply-To: <AANLkTil7fk_ikrSPiZnqcJSLGCrzcCrCwbeuUMY5Yu5v@mail.gmail.com>

Richard Hartmann wrote:

> As 2.6.35-rc1 is out, does this mean that we are looking at 2.6.36 at the
> earliest? Or could this make it in as it is
> 
> a) a bug fix

This isn't really a bug fix.  Its a behavioural change to work around 
poor quality/mismatched underlying PPP channels.

Regards,
Ben.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox