Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 3/4] fec: add support for Freescale i.MX25 PDK (3DS)
From: Sascha Hauer @ 2010-05-20  6:46 UTC (permalink / raw)
  To: Jean-Christophe Dubois
  Cc: netdev, linux-arm-kernel, Baruch Siach, Greg Ungerer,
	David Miller
In-Reply-To: <201005191715.52905.jcd@tribudubois.net>

On Wed, May 19, 2010 at 05:15:52PM +0200, Jean-Christophe Dubois wrote:
> le lundi 25 janvier 2010 Baruch Siach a écrit
> > Hi Greg, netdev,
> > 
> > On Wed, Dec 16, 2009 at 08:34:06AM +0200, Baruch Siach wrote:
> > > On Wed, Dec 16, 2009 at 10:13:56AM +1000, Greg Ungerer wrote:
> > > > Baruch Siach wrote:
> > > > >On Tue, Dec 15, 2009 at 09:52:24PM +1000, Greg Ungerer wrote:
> > > > >>On 12/15/2009 06:31 PM, Baruch Siach wrote:
> > > > >>>+#ifndef CONFIG_M5272
> > > > >>
> > > > >>I would suggest making this conditional on FEC_MIIGSK_ENR.
> > > > >>Although the CONFIG_M5272 is the only case here currently,
> > > > >>that may change over the years. And using this here may not
> > > > >>be obvious to the causual code reader, since the register
> > > > >>offset definitions don't explicitly key on CONFIG_M5272.
> > > > >
> > > > >OK, I'll change this conditional.
> > > > >
> > > > >Can I take this as an Ack from you?
> > > > 
> > > > With that conditional check changed, sure:
> > > > 
> > > > Acked-by:  Greg Ungerer <gerg@uclinux.org>
> > > 
> > > Thanks. The updated patch below.
> > 
> > I'm really sorry to bug on this again, but since the platform code is
> > already upstream the i.MX25 code doesn't build without this patch
> > (include/linux/fec.h missing). So, someone please pick up this patch,
> > preferably prior to .33.
> > 
> > baruch
> 
> I am just wondering if somebody is going to pick up this patch 
> (http://patchwork.ozlabs.org/patch/41235/) so that it finds its way on 
> mainline.

I'm much in favour of this patch. David?

Sascha


-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply

* Re: tun: Use netif_receive_skb instead of netif_rx
From: Herbert Xu @ 2010-05-20  6:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bmb, tgraf, nhorman, nhorman, netdev
In-Reply-To: <20100520061138.GA8183@gondor.apana.org.au>

On Thu, May 20, 2010 at 04:11:38PM +1000, Herbert Xu wrote:
> On Thu, May 20, 2010 at 08:03:35AM +0200, Eric Dumazet wrote:
> > Le jeudi 20 mai 2010 à 15:46 +1000, Herbert Xu a écrit :
> > 
> > > +#ifdef CONFIG_NET_CLS_CGROUP
> > > +static inline u32 task_cls_classid(struct task_struct *p)
> > > +{
> > > +	return container_of(task_subsys_state(p, net_cls_subsys_id),
> > > +			    struct cgroup_cls_state, css).classid;
> > > +}
> > > +#else
> > > +int net_cls_subsys_id = -1;
> > > +EXPORT_SYMBOL_GPL(net_cls_subsys_id);
> > > +
> > > +static inline u32 task_cls_classid(struct task_struct *p)
> > > +{
> > > +	int id;
> > > +	u32 classid;
> > > +
> > > +	rcu_read_lock();
> > > +	id = rcu_dereference(net_cls_subsys_id);
> > > +	if (id >= 0)
> > > +		classid = container_of(task_subsys_state(p, id),
> > > +				       struct cgroup_cls_state, css)->classid;
> > > +	rcu_read_unlock();
> > > +
> > > +	return classid;
> > > +}
> > > +#endif
> > 
> > I still dont understand why you introduce this function and not reuse it
> > in net/sched/cls_cgroup.c
> > 
> > You told me it needs a separate patch, I dont think so.
> > Just make it non static and EXPORTed
> 
> Because it doesn't have to go through this RCU crap.

I'm talking about the rcu_dereference on net_cls_subsys_id of
course, not the rest of it.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: tun: Use netif_receive_skb instead of netif_rx
From: Herbert Xu @ 2010-05-20  6:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bmb, tgraf, nhorman, nhorman, netdev
In-Reply-To: <1274335415.2658.46.camel@edumazet-laptop>

On Thu, May 20, 2010 at 08:03:35AM +0200, Eric Dumazet wrote:
> Le jeudi 20 mai 2010 à 15:46 +1000, Herbert Xu a écrit :
> 
> > +#ifdef CONFIG_NET_CLS_CGROUP
> > +static inline u32 task_cls_classid(struct task_struct *p)
> > +{
> > +	return container_of(task_subsys_state(p, net_cls_subsys_id),
> > +			    struct cgroup_cls_state, css).classid;
> > +}
> > +#else
> > +int net_cls_subsys_id = -1;
> > +EXPORT_SYMBOL_GPL(net_cls_subsys_id);
> > +
> > +static inline u32 task_cls_classid(struct task_struct *p)
> > +{
> > +	int id;
> > +	u32 classid;
> > +
> > +	rcu_read_lock();
> > +	id = rcu_dereference(net_cls_subsys_id);
> > +	if (id >= 0)
> > +		classid = container_of(task_subsys_state(p, id),
> > +				       struct cgroup_cls_state, css)->classid;
> > +	rcu_read_unlock();
> > +
> > +	return classid;
> > +}
> > +#endif
> 
> I still dont understand why you introduce this function and not reuse it
> in net/sched/cls_cgroup.c
> 
> You told me it needs a separate patch, I dont think so.
> Just make it non static and EXPORTed

Because it doesn't have to go through this RCU crap.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH] vhost-net: utilize PUBLISH_USED_IDX feature
From: Avi Kivity @ 2010-05-20  6:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: davem, Juan Quintela, Rusty Russell, Paul E. McKenney,
	Arnd Bergmann, kvm, virtualization, netdev, linux-kernel,
	alex.williamson, amit.shah
In-Reply-To: <20100519222718.GB4111@redhat.com>

On 05/20/2010 01:27 AM, Michael S. Tsirkin wrote:
> On Wed, May 19, 2010 at 08:04:51PM +0300, Avi Kivity wrote:
>    
>> On 05/18/2010 04:19 AM, Michael S. Tsirkin wrote:
>>      
>>> With PUBLISH_USED_IDX, guest tells us which used entries
>>> it has consumed. This can be used to reduce the number
>>> of interrupts: after we write a used entry, if the guest has not yet
>>> consumed the previous entry, or if the guest has already consumed the
>>> new entry, we do not need to interrupt.
>>> This imporves bandwidth by 30% under some workflows.
>>>
>>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>>> ---
>>>
>>> Rusty, Dave, this patch depends on the patch
>>> "virtio: put last seen used index into ring itself"
>>> which is currently destined at Rusty's tree.
>>> Rusty, if you are taking that one for 2.6.35, please
>>> take this one as well.
>>> Dave, any objections?
>>>
>>>        
>> I object: I think the index should have its own cacheline,
>>      
> The issue here is that host/guest do not know each
> other's cache line size. I guess we could just put it
> at offset 128 or something like that ... Rusty?
>    

Not so pretty, but ok.

>    
>> and that it should be documented before merging.
>>      
> I think you meant to object to the virtio patch, not this one.  This
> patch does not introduce new layout, just implements host support.
> virtio spec patch will follow: it is not part of linux tree so
> there is no patch dependency.
>    

There is a reviewer dependency.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply

* Re: [PATCH] net: fix problem in dequeuing from input_pkt_queue
From: Tom Herbert @ 2010-05-20  6:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Changli Gao, davem, netdev
In-Reply-To: <1274330246.2658.16.camel@edumazet-laptop>

On Wed, May 19, 2010 at 9:37 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mercredi 19 mai 2010 à 19:48 -0700, Tom Herbert a écrit :
>> >> It should be okay?  process_backlog only runs in softirq so bottom
>> >> halves are already disabled, and I don't think flush_backlog runs out
>> >> of an interrupt.
>> >>
>> >
>> > Oh no. It is an IRQ handler.
>> >
>> Very well, I will fix that.
>>
>> Now I'm wondering, though, what the purpose of flush_backlog is...
>> since __netif_receive_skb is called with interrupts enabled it's
>> obvious flush_backlog won't catch all the skb's that reference the
>> device go away.  Is there a reason these packets need to be flushed
>> and can't just be processed?
>
> flush_backlog is called when device is dismantled.
>
> No new packets should be generated by the device at this moment.
>
But again since __netif_receive_skb is called with interrupts disabled
there is still a hole that the device could be completely dismantled
but at least one packet from the device still will be processed.  So
it seems like that's a bug, or maybe it's okay to process packets
after flush_backlog-- if the latter case were true why throw out
perfectly good packets?  The only rationale I can think of for
flush_backlog is to eliminate skb's with references to device that has
gone away, but the mechanism does not seem sufficient to cover all
possible skb's with a reference.

> Could you please split your patch in units, I spent 20 minutes to review
> it and come to same conclusion than Changli (need to disable interrupts
> as they are currently disabled) and also :
>
> input_queue_head_incr(sd); are _not_ needed in flush_backlog()
>
I don't see why they wouldn't be needed.  queue tail is incremented
when queuing to the input_pkt_queue, queue head is incremented when
dequeuing (after skb freed or processed).  queue_tail-queue_head ==
input_pkt_queue.len+process_queue.len.  These should be invariants.

> We are in the very last moments of the life of the device, in a very
> unlikely situation (packets in flight, not already consumed by the cpu),
> we are _dropping_ packets, so OOO means nothing at this point.
>
True.  But still seems nice to handle process_queue to be consistent.

> In dev_cpu_callback(), you reverse the order of input_pkt_queue /
> process_queue.
>
> Thats fine, but should be a single patch, because I am not sure the
> input_queue_head_incr() are valid here, since we re-inject these packets
> to netif_rx(). Could you clarify this point ?
>
queue_head advances on every dequeue.  See above...

Thanks for the great comments!

Tom

> Thanks !
>
>
>

^ permalink raw reply

* Re: tun: Use netif_receive_skb instead of netif_rx
From: Eric Dumazet @ 2010-05-20  6:03 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, bmb, tgraf, nhorman, nhorman, netdev
In-Reply-To: <20100520054642.GA7836@gondor.apana.org.au>

Le jeudi 20 mai 2010 à 15:46 +1000, Herbert Xu a écrit :

> +#ifdef CONFIG_NET_CLS_CGROUP
> +static inline u32 task_cls_classid(struct task_struct *p)
> +{
> +	return container_of(task_subsys_state(p, net_cls_subsys_id),
> +			    struct cgroup_cls_state, css).classid;
> +}
> +#else
> +int net_cls_subsys_id = -1;
> +EXPORT_SYMBOL_GPL(net_cls_subsys_id);
> +
> +static inline u32 task_cls_classid(struct task_struct *p)
> +{
> +	int id;
> +	u32 classid;
> +
> +	rcu_read_lock();
> +	id = rcu_dereference(net_cls_subsys_id);
> +	if (id >= 0)
> +		classid = container_of(task_subsys_state(p, id),
> +				       struct cgroup_cls_state, css)->classid;
> +	rcu_read_unlock();
> +
> +	return classid;
> +}
> +#endif

I still dont understand why you introduce this function and not reuse it
in net/sched/cls_cgroup.c

You told me it needs a separate patch, I dont think so.
Just make it non static and EXPORTed



> +	rcu_read_lock();
> +	classid = task_cls_state(current)->classid;
> +	rcu_read_unlock();
> +




^ permalink raw reply

* Re: tun: Use netif_receive_skb instead of netif_rx
From: Herbert Xu @ 2010-05-20  5:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bmb, tgraf, nhorman, nhorman, netdev
In-Reply-To: <1274333779.2658.43.camel@edumazet-laptop>

On Thu, May 20, 2010 at 07:36:19AM +0200, Eric Dumazet wrote:
> 
> I am ok with any kind of clarification, if its really documented as
> such, not as indirect effects of changes.

But I did document this semantic change, quite verbosely too
at that :)

> Right now, I am not sure classification is performed by the current
> process/cpu. Our queue handling can process packets queued by other cpus
> while we own the queue (__QDISC_STATE_RUNNING,) anyway.

Classification should be done when the packet is enqueued.  So
you shouldn't really see other people's traffic.

The original requeueing would have broken this too, but that is
no longer with us :)

In my original submission I forgot to mention the case of TCP
transmission triggered by an incoming ACK, which is really the
same case as this.

So let me take this opportunity to resubmit with an updated
description:

cls_cgroup: Store classid in struct sock

Up until now cls_cgroup has relied on fetching the classid out of
the current executing thread.  This runs into trouble when a packet
processing is delayed in which case it may execute out of another
thread's context.

Furthermore, even when a packet is not delayed we may fail to
classify it if soft IRQs have been disabled, because this scenario
is indistinguishable from one where a packet unrelated to the
current thread is processed by a real soft IRQ.

In fact, the current semantics is inherently broken, as a single
skb may be constructed out of the writes of two different tasks.
A different manifestation of this problem is when the TCP stack
transmits in response of an incoming ACK.  This is currently
unclassified.

As we already have a concept of packet ownership for accounting
purposes in the skb->sk pointer, this is a natural place to store
the classid in a persistent manner.

This patch adds the cls_cgroup classid in struct sock, filling up
an existing hole on 64-bit :)

The value is set at socket creation time.  So all sockets created
via socket(2) automatically gains the ID of the thread creating it.
Now you may argue that this may not be the same as the thread that
is sending the packet.  However, we already have a precedence where
an fd is passed to a different thread, its security property is
inherited.  In this case, inheriting the classid of the thread
creating the socket is also the logical thing to do.

For sockets created on inbound connections through accept(2), we
inherit the classid of the original listening socket through
sk_clone, possibly preceding the actual accept(2) call.

In order to minimise risks, I have not made this the authoritative
classid.  For now it is only used as a backup when we execute
with soft IRQs disabled.  Once we're completely happy with its
semantics we can use it as the sole classid.

Footnote: I have rearranged the error path on cls_group module
creation.  If we didn't do this, then there is a window where
someone could create a tc rule using cls_group before the cgroup
subsystem has been registered.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 8f78073..63c5036 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -410,6 +410,12 @@ struct cgroup_scanner {
 	void *data;
 };
 
+struct cgroup_cls_state
+{
+	struct cgroup_subsys_state css;
+	u32 classid;
+};
+
 /*
  * Add a new file to the given cgroup directory. Should only be
  * called by subsystems from within a populate() method
@@ -515,6 +521,10 @@ struct cgroup_subsys {
 	struct module *module;
 };
 
+#ifndef CONFIG_NET_CLS_CGROUP
+extern int net_cls_subsys_id;
+#endif
+
 #define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
 #include <linux/cgroup_subsys.h>
 #undef SUBSYS
diff --git a/include/net/sock.h b/include/net/sock.h
index 1ad6435..361a5dc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -307,7 +307,7 @@ struct sock {
 	void			*sk_security;
 #endif
 	__u32			sk_mark;
-	/* XXX 4 bytes hole on 64 bit */
+	u32			sk_classid;
 	void			(*sk_state_change)(struct sock *sk);
 	void			(*sk_data_ready)(struct sock *sk, int bytes);
 	void			(*sk_write_space)(struct sock *sk);
diff --git a/net/core/sock.c b/net/core/sock.c
index c5812bb..70bc529 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -110,6 +110,7 @@
 #include <linux/tcp.h>
 #include <linux/init.h>
 #include <linux/highmem.h>
+#include <linux/cgroup.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -217,6 +218,32 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 EXPORT_SYMBOL(sysctl_optmem_max);
 
+#ifdef CONFIG_NET_CLS_CGROUP
+static inline u32 task_cls_classid(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, net_cls_subsys_id),
+			    struct cgroup_cls_state, css).classid;
+}
+#else
+int net_cls_subsys_id = -1;
+EXPORT_SYMBOL_GPL(net_cls_subsys_id);
+
+static inline u32 task_cls_classid(struct task_struct *p)
+{
+	int id;
+	u32 classid;
+
+	rcu_read_lock();
+	id = rcu_dereference(net_cls_subsys_id);
+	if (id >= 0)
+		classid = container_of(task_subsys_state(p, id),
+				       struct cgroup_cls_state, css)->classid;
+	rcu_read_unlock();
+
+	return classid;
+}
+#endif
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;
@@ -1064,6 +1091,9 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		sock_lock_init(sk);
 		sock_net_set(sk, get_net(net));
 		atomic_set(&sk->sk_wmem_alloc, 1);
+
+		if (!in_interrupt())
+			sk->sk_classid = task_cls_classid(current);
 	}
 
 	return sk;
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 2211803..32c1296 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -16,14 +16,10 @@
 #include <linux/errno.h>
 #include <linux/skbuff.h>
 #include <linux/cgroup.h>
+#include <linux/rcupdate.h>
 #include <net/rtnetlink.h>
 #include <net/pkt_cls.h>
-
-struct cgroup_cls_state
-{
-	struct cgroup_subsys_state css;
-	u32 classid;
-};
+#include <net/sock.h>
 
 static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
 					       struct cgroup *cgrp);
@@ -112,6 +108,10 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
 	struct cls_cgroup_head *head = tp->root;
 	u32 classid;
 
+	rcu_read_lock();
+	classid = task_cls_state(current)->classid;
+	rcu_read_unlock();
+
 	/*
 	 * Due to the nature of the classifier it is required to ignore all
 	 * packets originating from softirq context as accessing `current'
@@ -122,12 +122,12 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
 	 * calls by looking at the number of nested bh disable calls because
 	 * softirqs always disables bh.
 	 */
-	if (softirq_count() != SOFTIRQ_OFFSET)
-		return -1;
-
-	rcu_read_lock();
-	classid = task_cls_state(current)->classid;
-	rcu_read_unlock();
+	if (softirq_count() != SOFTIRQ_OFFSET) {
+	 	/* If there is an sk_classid we'll use that. */
+		if (!skb->sk)
+			return -1;
+		classid = skb->sk->sk_classid;
+	}
 
 	if (!classid)
 		return -1;
@@ -289,18 +289,35 @@ static struct tcf_proto_ops cls_cgroup_ops __read_mostly = {
 
 static int __init init_cgroup_cls(void)
 {
-	int ret = register_tcf_proto_ops(&cls_cgroup_ops);
-	if (ret)
-		return ret;
+	int ret;
+
 	ret = cgroup_load_subsys(&net_cls_subsys);
 	if (ret)
-		unregister_tcf_proto_ops(&cls_cgroup_ops);
+		goto out;
+
+#ifndef CONFIG_NET_CLS_CGROUP
+	/* We can't use rcu_assign_pointer because this is an int. */
+	smp_wmb();
+	net_cls_subsys_id = net_cls_subsys.subsys_id;
+#endif
+
+	ret = register_tcf_proto_ops(&cls_cgroup_ops);
+	if (ret)
+		cgroup_unload_subsys(&net_cls_subsys);
+
+out:
 	return ret;
 }
 
 static void __exit exit_cgroup_cls(void)
 {
 	unregister_tcf_proto_ops(&cls_cgroup_ops);
+
+#ifndef CONFIG_NET_CLS_CGROUP
+	net_cls_subsys_id = -1;
+	synchronize_rcu();
+#endif
+
 	cgroup_unload_subsys(&net_cls_subsys);
 }
 
Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* will 2 cpu simultaneously  process packets which have same hash value on multiqueue nic?
From: Jon Zhou @ 2010-05-20  5:37 UTC (permalink / raw)
  To: netdev@vger.kernel.org

will 2 cpu simultaneously  process packets which have same hash value on multiqueue nic?

let 's take broadcom 57711 bnx2x_main.c as an example:

#1 packet1->queue=1
#2 packet2->queue=1

will cpu1 and cpu2 execute the function " bnx2x_rx_int" in parallel, to receive packet1 & packet2

or it just depend on smp_affinity setting?

thanks
jon

^ permalink raw reply

* Re: tun: Use netif_receive_skb instead of netif_rx
From: Eric Dumazet @ 2010-05-20  5:36 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, bmb, tgraf, nhorman, nhorman, netdev
In-Reply-To: <20100520052059.GC7443@gondor.apana.org.au>

Le jeudi 20 mai 2010 à 15:20 +1000, Herbert Xu a écrit :
> On Thu, May 20, 2010 at 07:15:07AM +0200, Eric Dumazet wrote:
> > 
> > I find this very biased, sorry.
> > 
> > In fact, fd passing is just fine today, if we consider that we classify
> > packets using the identity of the process *using* the fd, not the one
> > that *created* it.
> > 
> > Now your patch changes this, to the reverse, and you justify the caching
> > effect on socket. Sorry, this must be too convoluted for me.
> 
> I'm sorry you find this convoluted, but using the sending process's
> classid is inherently broken.
> 
> Here is why: consider a TCP socket shared by two processes with
> different classids both writing data to it.  Now suppose further
> that each writes just one byte, which is then coalesced into a
> single skb.
> 
> Whose classid should we use on the resulting skb?

I am ok with any kind of clarification, if its really documented as
such, not as indirect effects of changes.

Right now, I am not sure classification is performed by the current
process/cpu. Our queue handling can process packets queued by other cpus
while we own the queue (__QDISC_STATE_RUNNING,) anyway.




^ permalink raw reply

* Re: [PATCH net-next-2.6] bonding: move slave MTU handling from sysfs V2
From: Jiri Pirko @ 2010-05-20  5:34 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: netdev, davem, bonding-devel, monis
In-Reply-To: <628.1274314061@death.nxdomain.ibm.com>

Thu, May 20, 2010 at 02:07:41AM CEST, fubar@us.ibm.com wrote:
>Jiri Pirko <jpirko@redhat.com> wrote:
>
>>V1->V2: corrected res/ret use
>>
>>For some reason, MTU handling (storing, and restoring) is taking  place in
>>bond_sysfs. The correct place for this code is in bond_enslave, bond_release.
>>So move it there.
>
>	In principle this looks ok, as do the other patches, but none of
>them apply to net-next-2.6 for me except for the "optimize
>tlb_get_least_loaded_slave" patch.  It looks like you left out a patch,
>see below.
>
>>Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>>---
>> drivers/net/bonding/bond_main.c  |   15 ++++++++++++++-
>> drivers/net/bonding/bond_sysfs.c |   22 ++--------------------
>> 2 files changed, 16 insertions(+), 21 deletions(-)
>>
>>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>>index 5e12462..2c3f9db 100644
>>--- a/drivers/net/bonding/bond_main.c
>>+++ b/drivers/net/bonding/bond_main.c
>>@@ -1533,6 +1533,14 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
>> 	 */
>> 	new_slave->original_flags = slave_dev->flags;
>>
>>+	/* Save slave's original mtu and then set it to match the bond */
>>+	new_slave->original_mtu = slave_dev->mtu;
>>+	res = dev_set_mtu(slave_dev, bond->dev->mtu);
>>+	if (res) {
>>+		pr_debug("Error %d calling dev_set_mtu\n", res);
>>+		goto err_free;
>>+	}
>>+
>> 	/*
>> 	 * Save slave's original ("permanent") mac address for modes
>> 	 * that need it, and for restoring it upon release, and then
>>@@ -1550,7 +1558,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
>> 		res = dev_set_mac_address(slave_dev, &addr);
>> 		if (res) {
>> 			pr_debug("Error %d calling set_mac_address\n", res);
>>-			goto err_free;
>>+			goto err_restore_mtu;
>> 		}
>> 	}
>>
>>@@ -1785,6 +1793,9 @@ err_restore_mac:
>> 		dev_set_mac_address(slave_dev, &addr);
>> 	}
>>
>>+err_restore_mtu:
>>+	dev_set_mtu(slave_dev, new_slave->original_mtu);
>>+
>> err_free:
>> 	kfree(new_slave);
>>
>>@@ -1969,6 +1980,8 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev)
>> 		dev_set_mac_address(slave_dev, &addr);
>> 	}
>>
>>+	dev_set_mtu(slave_dev, slave->original_mtu);
>>+
>> 	slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB |
>> 				   IFF_SLAVE_INACTIVE | IFF_BONDING |
>> 				   IFF_SLAVE_NEEDARP);
>>diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
>>index 392e291..29a7a8a 100644
>>--- a/drivers/net/bonding/bond_sysfs.c
>>+++ b/drivers/net/bonding/bond_sysfs.c
>>@@ -220,7 +220,6 @@ static ssize_t bonding_store_slaves(struct device *d,
>> 	char command[IFNAMSIZ + 1] = { 0, };
>> 	char *ifname;
>> 	int i, res, ret = count;
>>-	u32 original_mtu;
>> 	struct slave *slave;
>> 	struct net_device *dev = NULL;
>> 	struct bonding *bond = to_bond(d);
>
>	This chunk doesn't apply to net-next-2.6 because your context
>doesn't match; it looks like you've removed the variable "found" in your
>"before" source.  On closer inspection, "found" isn't actually used
>meaningfully, so I'm guessing you removed it in a prior patch but didn't
>submit that patch.
>
>	If that's the case, could you repost the whole series, with
>sequence numbers?

I don't think that's necessary for now. The patch removing found was posted as a
first one:
http://patchwork.ozlabs.org/patch/52795/

I tried it several times. Patches are cleanly applicable in order I posted it.

Jirka
>
>	-J
>
>>@@ -281,18 +280,7 @@ static ssize_t bonding_store_slaves(struct device *d,
>> 			memcpy(bond->dev->dev_addr, dev->dev_addr,
>> 			       dev->addr_len);
>>
>>-		/* Set the slave's MTU to match the bond */
>>-		original_mtu = dev->mtu;
>>-		res = dev_set_mtu(dev, bond->dev->mtu);
>>-		if (res) {
>>-			ret = res;
>>-			goto out;
>>-		}
>>-
>> 		res = bond_enslave(bond->dev, dev);
>>-		bond_for_each_slave(bond, slave, i)
>>-			if (strnicmp(slave->dev->name, ifname, IFNAMSIZ) == 0)
>>-				slave->original_mtu = original_mtu;
>> 		if (res)
>> 			ret = res;
>>
>>@@ -301,23 +289,17 @@ static ssize_t bonding_store_slaves(struct device *d,
>>
>> 	if (command[0] == '-') {
>> 		dev = NULL;
>>-		original_mtu = 0;
>> 		bond_for_each_slave(bond, slave, i)
>> 			if (strnicmp(slave->dev->name, ifname, IFNAMSIZ) == 0) {
>> 				dev = slave->dev;
>>-				original_mtu = slave->original_mtu;
>> 				break;
>> 			}
>> 		if (dev) {
>> 			pr_info("%s: Removing slave %s\n",
>> 				bond->dev->name, dev->name);
>>-				res = bond_release(bond->dev, dev);
>>-			if (res) {
>>+			res = bond_release(bond->dev, dev);
>>+			if (res)
>> 				ret = res;
>>-				goto out;
>>-			}
>>-			/* set the slave MTU to the default */
>>-			dev_set_mtu(dev, original_mtu);
>> 		} else {
>> 			pr_err("unable to remove non-existent slave %s for bond %s.\n",
>> 			       ifname, bond->dev->name);
>>-- 
>>1.6.6.1
>
>---
>	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: tun: Use netif_receive_skb instead of netif_rx
From: Herbert Xu @ 2010-05-20  5:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bmb, tgraf, nhorman, nhorman, netdev
In-Reply-To: <1274332507.2658.31.camel@edumazet-laptop>

On Thu, May 20, 2010 at 07:15:07AM +0200, Eric Dumazet wrote:
> 
> I find this very biased, sorry.
> 
> In fact, fd passing is just fine today, if we consider that we classify
> packets using the identity of the process *using* the fd, not the one
> that *created* it.
> 
> Now your patch changes this, to the reverse, and you justify the caching
> effect on socket. Sorry, this must be too convoluted for me.

I'm sorry you find this convoluted, but using the sending process's
classid is inherently broken.

Here is why: consider a TCP socket shared by two processes with
different classids both writing data to it.  Now suppose further
that each writes just one byte, which is then coalesced into a
single skb.

Whose classid should we use on the resulting skb?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: tun: Use netif_receive_skb instead of netif_rx
From: Eric Dumazet @ 2010-05-20  5:15 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, bmb, tgraf, nhorman, nhorman, netdev
In-Reply-To: <20100520033446.GA6434@gondor.apana.org.au>

Le jeudi 20 mai 2010 à 13:34 +1000, Herbert Xu a écrit :

> The value is set at socket creation time.  So all sockets created
> via socket(2) automatically gains the ID of the thread creating it.
> Now you may argue that this may not be the same as the thread that
> is sending the packet.  However, we already have a precedence where
> an fd is passed to a different thread, its security property is
> inherited.  In this case, inheriting the classid of the thread
> creating the socket is also the logical thing to do.

I find this very biased, sorry.

In fact, fd passing is just fine today, if we consider that we classify
packets using the identity of the process *using* the fd, not the one
that *created* it.

Now your patch changes this, to the reverse, and you justify the caching
effect on socket. Sorry, this must be too convoluted for me.





^ permalink raw reply

* Re: tun: Use netif_receive_skb instead of netif_rx
From: Herbert Xu @ 2010-05-20  5:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bmb, tgraf, nhorman, nhorman, netdev
In-Reply-To: <1274331255.2658.24.camel@edumazet-laptop>

On Thu, May 20, 2010 at 06:54:15AM +0200, Eric Dumazet wrote:
>
> > @@ -1064,6 +1091,9 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
> >  		sock_lock_init(sk);
> >  		sock_net_set(sk, get_net(net));
> >  		atomic_set(&sk->sk_wmem_alloc, 1);
> > +
> > +		if (!in_interrupt())
> > +			sk->sk_classid = task_cls_classid(current);
> 
> It means we cache current classid of this process.
> 
> Can it change later ?

Please read my patchset description.  I explained everything there.

> > @@ -112,6 +108,10 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
> >  	struct cls_cgroup_head *head = tp->root;
> >  	u32 classid;
> >  
> > +	rcu_read_lock();
> > +	classid = task_cls_state(current)->classid;
> > +	rcu_read_unlock();
> > +
> 
> It would be more readable to use :

Sure, however, I'm just moving the existing code around, so I
wanted it to be exactly the same.  Clean-ups should be done as
a follow-up, feel free to do this if you like.

> > @@ -122,12 +122,12 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
> >  	 * calls by looking at the number of nested bh disable calls because
> >  	 * softirqs always disables bh.
> >  	 */
> > -	if (softirq_count() != SOFTIRQ_OFFSET)
> > -		return -1;
> > -
> > -	rcu_read_lock();
> > -	classid = task_cls_state(current)->classid;
> > -	rcu_read_unlock();
> > +	if (softirq_count() != SOFTIRQ_OFFSET) {
> 
> Why do we still need this previous test ?

You didn't read my patchset description at all, did you? :)

This patchset is trying to fix the issue at hand with the minimum
amount of changes to current behaviour.  Once we're happy with the
new semantics, we can get rid of the softirq stuff.

That will be done in a later patch to minimise the risks.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: tun: Use netif_receive_skb instead of netif_rx
From: Eric Dumazet @ 2010-05-20  4:54 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, bmb, tgraf, nhorman, nhorman, netdev
In-Reply-To: <20100520033446.GA6434@gondor.apana.org.au>

Le jeudi 20 mai 2010 à 13:34 +1000, Herbert Xu a écrit :
> On Wed, May 19, 2010 at 08:05:22PM -0700, David Miller wrote:
> > Ok, looking forward to it.
> 
> Here we go:
> 
> cls_cgroup: Store classid in struct sock
> 
> Up until now cls_cgroup has relied on fetching the classid out of
> the current executing thread.  This runs into trouble when a packet
> processing is delayed in which case it may execute out of another
> thread's context.
> 
> Furthermore, even when a packet is not delayed we may fail to
> classify it if soft IRQs have been disabled, because this scenario
> is indistinguishable from one where a packet unrelated to the
> current thread is processed by a real soft IRQ.
> 
> As we already have a concept of packet ownership for accounting
> purposes in the skb->sk pointer, this is a natural place to store
> the classid in a persistent manner.
> 
> This patch adds the cls_cgroup classid in struct sock, filling up
> an existing hole on 64-bit :)
> 
> The value is set at socket creation time.  So all sockets created
> via socket(2) automatically gains the ID of the thread creating it.
> Now you may argue that this may not be the same as the thread that
> is sending the packet.  However, we already have a precedence where
> an fd is passed to a different thread, its security property is
> inherited.  In this case, inheriting the classid of the thread
> creating the socket is also the logical thing to do.
> 
> For sockets created on inbound connections through accept(2), we
> inherit the classid of the original listening socket through
> sk_clone, possibly preceding the actual accept(2) call.
> 
> In order to minimise risks, I have not made this the authoritative
> classid.  For now it is only used as a backup when we execute
> with soft IRQs disabled.  Once we're completely happy with its
> semantics we can use it as the sole classid.
> 
> Footnote: I have rearranged the error path on cls_group module
> creation.  If we didn't do this, then there is a window where
> someone could create a tc rule using cls_group before the cgroup
> subsystem has been registered.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 8f78073..63c5036 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -410,6 +410,12 @@ struct cgroup_scanner {
>  	void *data;
>  };
>  
> +struct cgroup_cls_state
> +{
> +	struct cgroup_subsys_state css;
> +	u32 classid;
> +};
> +
>  /*
>   * Add a new file to the given cgroup directory. Should only be
>   * called by subsystems from within a populate() method
> @@ -515,6 +521,10 @@ struct cgroup_subsys {
>  	struct module *module;
>  };
>  
> +#ifndef CONFIG_NET_CLS_CGROUP
> +extern int net_cls_subsys_id;
> +#endif
> +
>  #define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
>  #include <linux/cgroup_subsys.h>
>  #undef SUBSYS
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 1ad6435..361a5dc 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -307,7 +307,7 @@ struct sock {
>  	void			*sk_security;
>  #endif
>  	__u32			sk_mark;
> -	/* XXX 4 bytes hole on 64 bit */
> +	u32			sk_classid;
>  	void			(*sk_state_change)(struct sock *sk);
>  	void			(*sk_data_ready)(struct sock *sk, int bytes);
>  	void			(*sk_write_space)(struct sock *sk);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index c5812bb..70bc529 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -110,6 +110,7 @@
>  #include <linux/tcp.h>
>  #include <linux/init.h>
>  #include <linux/highmem.h>
> +#include <linux/cgroup.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/system.h>
> @@ -217,6 +218,32 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
>  int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
>  EXPORT_SYMBOL(sysctl_optmem_max);
>  
> +#ifdef CONFIG_NET_CLS_CGROUP
> +static inline u32 task_cls_classid(struct task_struct *p)
> +{
> +	return container_of(task_subsys_state(p, net_cls_subsys_id),
> +			    struct cgroup_cls_state, css).classid;
> +}
> +#else
> +int net_cls_subsys_id = -1;
> +EXPORT_SYMBOL_GPL(net_cls_subsys_id);
> +
> +static inline u32 task_cls_classid(struct task_struct *p)
> +{
> +	int id;
> +	u32 classid;
> +
> +	rcu_read_lock();
> +	id = rcu_dereference(net_cls_subsys_id);
> +	if (id >= 0)
> +		classid = container_of(task_subsys_state(p, id),
> +				       struct cgroup_cls_state, css)->classid;
> +	rcu_read_unlock();
> +
> +	return classid;
> +}
> +#endif
> +
>  static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
>  {
>  	struct timeval tv;
> @@ -1064,6 +1091,9 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
>  		sock_lock_init(sk);
>  		sock_net_set(sk, get_net(net));
>  		atomic_set(&sk->sk_wmem_alloc, 1);
> +
> +		if (!in_interrupt())
> +			sk->sk_classid = task_cls_classid(current);

It means we cache current classid of this process.

Can it change later ?


>  	}
>  
>  	return sk;
> diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
> index 2211803..32c1296 100644
> --- a/net/sched/cls_cgroup.c
> +++ b/net/sched/cls_cgroup.c
> @@ -16,14 +16,10 @@
>  #include <linux/errno.h>
>  #include <linux/skbuff.h>
>  #include <linux/cgroup.h>
> +#include <linux/rcupdate.h>
>  #include <net/rtnetlink.h>
>  #include <net/pkt_cls.h>
> -
> -struct cgroup_cls_state
> -{
> -	struct cgroup_subsys_state css;
> -	u32 classid;
> -};
> +#include <net/sock.h>
>  
>  static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
>  					       struct cgroup *cgrp);
> @@ -112,6 +108,10 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
>  	struct cls_cgroup_head *head = tp->root;
>  	u32 classid;
>  
> +	rcu_read_lock();
> +	classid = task_cls_state(current)->classid;
> +	rcu_read_unlock();
> +

It would be more readable to use :

static inline int task_cls_state(struct task_struct *p)
{
	int classid;

	rcu_read_lock();
	classid = container_of(task_subsys_state(p, net_cls_subsys_id),
				struct cgroup_cls_state, css)->classid;
	rcu_read_unlock();
	return classid;
}

...

classid = task_cls_classid(current);

>  	/*
>  	 * Due to the nature of the classifier it is required to ignore all
>  	 * packets originating from softirq context as accessing `current'
> @@ -122,12 +122,12 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
>  	 * calls by looking at the number of nested bh disable calls because
>  	 * softirqs always disables bh.
>  	 */
> -	if (softirq_count() != SOFTIRQ_OFFSET)
> -		return -1;
> -
> -	rcu_read_lock();
> -	classid = task_cls_state(current)->classid;
> -	rcu_read_unlock();
> +	if (softirq_count() != SOFTIRQ_OFFSET) {

Why do we still need this previous test ?

> +	 	/* If there is an sk_classid we'll use that. */
> +		if (!skb->sk)
> +			return -1;
> +		classid = skb->sk->sk_classid;
> +	}
>  
>  	if (!classid)
>  		return -1;
> @@ -289,18 +289,35 @@ static struct tcf_proto_ops cls_cgroup_ops __read_mostly = {
>  
>  static int __init init_cgroup_cls(void)
>  {
> -	int ret = register_tcf_proto_ops(&cls_cgroup_ops);
> -	if (ret)
> -		return ret;
> +	int ret;
> +
>  	ret = cgroup_load_subsys(&net_cls_subsys);
>  	if (ret)
> -		unregister_tcf_proto_ops(&cls_cgroup_ops);
> +		goto out;
> +
> +#ifndef CONFIG_NET_CLS_CGROUP
> +	/* We can't use rcu_assign_pointer because this is an int. */
> +	smp_wmb();
> +	net_cls_subsys_id = net_cls_subsys.subsys_id;
> +#endif
> +
> +	ret = register_tcf_proto_ops(&cls_cgroup_ops);
> +	if (ret)
> +		cgroup_unload_subsys(&net_cls_subsys);
> +
> +out:
>  	return ret;
>  }
>  
>  static void __exit exit_cgroup_cls(void)
>  {
>  	unregister_tcf_proto_ops(&cls_cgroup_ops);
> +
> +#ifndef CONFIG_NET_CLS_CGROUP
> +	net_cls_subsys_id = -1;
> +	synchronize_rcu();
> +#endif
> +
>  	cgroup_unload_subsys(&net_cls_subsys);
>  }
>  
> 
> Thanks,



^ permalink raw reply

* Re: [PATCH] net: fix problem in dequeuing from input_pkt_queue
From: Eric Dumazet @ 2010-05-20  4:37 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Changli Gao, davem, netdev
In-Reply-To: <AANLkTimv8YJD61uADUrHoeL4xGfE8wPPUwVpwJNZPFpu@mail.gmail.com>

Le mercredi 19 mai 2010 à 19:48 -0700, Tom Herbert a écrit :
> >> It should be okay?  process_backlog only runs in softirq so bottom
> >> halves are already disabled, and I don't think flush_backlog runs out
> >> of an interrupt.
> >>
> >
> > Oh no. It is an IRQ handler.
> >
> Very well, I will fix that.
> 
> Now I'm wondering, though, what the purpose of flush_backlog is...
> since __netif_receive_skb is called with interrupts enabled it's
> obvious flush_backlog won't catch all the skb's that reference the
> device go away.  Is there a reason these packets need to be flushed
> and can't just be processed?

flush_backlog is called when device is dismantled.

No new packets should be generated by the device at this moment.

Could you please split your patch in units, I spent 20 minutes to review
it and come to same conclusion than Changli (need to disable interrupts
as they are currently disabled) and also :

input_queue_head_incr(sd); are _not_ needed in flush_backlog()

We are in the very last moments of the life of the device, in a very
unlikely situation (packets in flight, not already consumed by the cpu),
we are _dropping_ packets, so OOO means nothing at this point. 

In dev_cpu_callback(), you reverse the order of input_pkt_queue /
process_queue.

Thats fine, but should be a single patch, because I am not sure the
input_queue_head_incr() are valid here, since we re-inject these packets
to netif_rx(). Could you clarify this point ?

Thanks !

^ permalink raw reply

* Re: [PATCH] vhost-net: utilize PUBLISH_USED_IDX feature
From: Rusty Russell @ 2010-05-20  4:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Avi Kivity, davem, Juan Quintela, Paul E. McKenney, Arnd Bergmann,
	kvm, virtualization, netdev, linux-kernel, alex.williamson,
	amit.shah
In-Reply-To: <20100519222718.GB4111@redhat.com>

On Thu, 20 May 2010 07:57:18 am Michael S. Tsirkin wrote:
> On Wed, May 19, 2010 at 08:04:51PM +0300, Avi Kivity wrote:
> > On 05/18/2010 04:19 AM, Michael S. Tsirkin wrote:
> >> With PUBLISH_USED_IDX, guest tells us which used entries
> >> it has consumed. This can be used to reduce the number
> >> of interrupts: after we write a used entry, if the guest has not yet
> >> consumed the previous entry, or if the guest has already consumed the
> >> new entry, we do not need to interrupt.
> >> This imporves bandwidth by 30% under some workflows.
> >>
> >> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> >> ---
> >>
> >> Rusty, Dave, this patch depends on the patch
> >> "virtio: put last seen used index into ring itself"
> >> which is currently destined at Rusty's tree.
> >> Rusty, if you are taking that one for 2.6.35, please
> >> take this one as well.
> >> Dave, any objections?
> >>    
> >
> > I object: I think the index should have its own cacheline,
> 
> The issue here is that host/guest do not know each
> other's cache line size. I guess we could just put it
> at offset 128 or something like that ... Rusty?

I was assuming you'd put it at the end of the padding.

I think it's a silly optimization, but Avi obviously feels strongly about
it and I respect his opinion.

Please resubmit...
Thanks,
Rusty.

^ permalink raw reply

* Re: [PATCH 2/2] net: Add classid to sk_buff.
From: David Miller @ 2010-05-20  3:47 UTC (permalink / raw)
  To: herbert; +Cc: bmb, tgraf, nhorman, nhorman, eric.dumazet, netdev
In-Reply-To: <20100520033643.GA6630@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 20 May 2010 13:36:43 +1000

> On Wed, May 19, 2010 at 07:55:38PM -0700, David Miller wrote:
>> 
>> We make this zero cost by moving queue_mapping into an existing
>> empty __u16 slot.  Thus making a __u32 available, which we use
>> for the 'classid'.
>> 
>> Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> You better hide this space before someone steals it :)

Indeed, I didn't even know we had it to begin with. :)

^ permalink raw reply

* Re: tun: Use netif_receive_skb instead of netif_rx
From: David Miller @ 2010-05-20  3:46 UTC (permalink / raw)
  To: herbert; +Cc: bmb, tgraf, nhorman, nhorman, eric.dumazet, netdev
In-Reply-To: <20100520033446.GA6434@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 20 May 2010 13:34:46 +1000

> cls_cgroup: Store classid in struct sock

This looks great to me.

^ permalink raw reply

* Re: tun: Use netif_receive_skb instead of netif_rx
From: Herbert Xu @ 2010-05-20  3:42 UTC (permalink / raw)
  To: David Miller; +Cc: bmb, tgraf, nhorman, nhorman, eric.dumazet, netdev
In-Reply-To: <20100520033446.GA6434@gondor.apana.org.au>

On Thu, May 20, 2010 at 01:34:46PM +1000, Herbert Xu wrote:
>
> cls_cgroup: Store classid in struct sock
> 
> Up until now cls_cgroup has relied on fetching the classid out of
> the current executing thread.  This runs into trouble when a packet
> processing is delayed in which case it may execute out of another
> thread's context.

Oh please add this bit that I forgot about:

This also addresses the case where TCP transmits a queued packet
in response to an inbound ACK.  Currently this isn't classified
at all as we would be in softirq context.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 2/2] net: Add classid to sk_buff.
From: Herbert Xu @ 2010-05-20  3:36 UTC (permalink / raw)
  To: David Miller; +Cc: bmb, tgraf, nhorman, nhorman, eric.dumazet, netdev
In-Reply-To: <20100519.195538.139536660.davem@davemloft.net>

On Wed, May 19, 2010 at 07:55:38PM -0700, David Miller wrote:
> 
> We make this zero cost by moving queue_mapping into an existing
> empty __u16 slot.  Thus making a __u32 available, which we use
> for the 'classid'.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>

You better hide this space before someone steals it :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 1/2] ipv6: Store ndisc_nodeid in IP6CB().
From: Herbert Xu @ 2010-05-20  3:35 UTC (permalink / raw)
  To: David Miller; +Cc: bmb, tgraf, nhorman, nhorman, eric.dumazet, netdev
In-Reply-To: <20100519.195536.32704411.davem@davemloft.net>

On Wed, May 19, 2010 at 07:55:36PM -0700, David Miller wrote:
> 
> There is no reason we need to use up space in the generic
> SKB area for this.  All packet input paths in ipv6 explicitly
> clear out the IP6CB() area and therefore the default value
> for ndisc_nodeid will be correct.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>

Looks good to me.  Thanks!
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: tun: Use netif_receive_skb instead of netif_rx
From: Herbert Xu @ 2010-05-20  3:34 UTC (permalink / raw)
  To: David Miller; +Cc: bmb, tgraf, nhorman, nhorman, eric.dumazet, netdev
In-Reply-To: <20100519.200522.140743640.davem@davemloft.net>

On Wed, May 19, 2010 at 08:05:22PM -0700, David Miller wrote:
> Ok, looking forward to it.

Here we go:

cls_cgroup: Store classid in struct sock

Up until now cls_cgroup has relied on fetching the classid out of
the current executing thread.  This runs into trouble when a packet
processing is delayed in which case it may execute out of another
thread's context.

Furthermore, even when a packet is not delayed we may fail to
classify it if soft IRQs have been disabled, because this scenario
is indistinguishable from one where a packet unrelated to the
current thread is processed by a real soft IRQ.

As we already have a concept of packet ownership for accounting
purposes in the skb->sk pointer, this is a natural place to store
the classid in a persistent manner.

This patch adds the cls_cgroup classid in struct sock, filling up
an existing hole on 64-bit :)

The value is set at socket creation time.  So all sockets created
via socket(2) automatically gains the ID of the thread creating it.
Now you may argue that this may not be the same as the thread that
is sending the packet.  However, we already have a precedence where
an fd is passed to a different thread, its security property is
inherited.  In this case, inheriting the classid of the thread
creating the socket is also the logical thing to do.

For sockets created on inbound connections through accept(2), we
inherit the classid of the original listening socket through
sk_clone, possibly preceding the actual accept(2) call.

In order to minimise risks, I have not made this the authoritative
classid.  For now it is only used as a backup when we execute
with soft IRQs disabled.  Once we're completely happy with its
semantics we can use it as the sole classid.

Footnote: I have rearranged the error path on cls_group module
creation.  If we didn't do this, then there is a window where
someone could create a tc rule using cls_group before the cgroup
subsystem has been registered.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 8f78073..63c5036 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -410,6 +410,12 @@ struct cgroup_scanner {
 	void *data;
 };
 
+struct cgroup_cls_state
+{
+	struct cgroup_subsys_state css;
+	u32 classid;
+};
+
 /*
  * Add a new file to the given cgroup directory. Should only be
  * called by subsystems from within a populate() method
@@ -515,6 +521,10 @@ struct cgroup_subsys {
 	struct module *module;
 };
 
+#ifndef CONFIG_NET_CLS_CGROUP
+extern int net_cls_subsys_id;
+#endif
+
 #define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
 #include <linux/cgroup_subsys.h>
 #undef SUBSYS
diff --git a/include/net/sock.h b/include/net/sock.h
index 1ad6435..361a5dc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -307,7 +307,7 @@ struct sock {
 	void			*sk_security;
 #endif
 	__u32			sk_mark;
-	/* XXX 4 bytes hole on 64 bit */
+	u32			sk_classid;
 	void			(*sk_state_change)(struct sock *sk);
 	void			(*sk_data_ready)(struct sock *sk, int bytes);
 	void			(*sk_write_space)(struct sock *sk);
diff --git a/net/core/sock.c b/net/core/sock.c
index c5812bb..70bc529 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -110,6 +110,7 @@
 #include <linux/tcp.h>
 #include <linux/init.h>
 #include <linux/highmem.h>
+#include <linux/cgroup.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -217,6 +218,32 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 EXPORT_SYMBOL(sysctl_optmem_max);
 
+#ifdef CONFIG_NET_CLS_CGROUP
+static inline u32 task_cls_classid(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, net_cls_subsys_id),
+			    struct cgroup_cls_state, css).classid;
+}
+#else
+int net_cls_subsys_id = -1;
+EXPORT_SYMBOL_GPL(net_cls_subsys_id);
+
+static inline u32 task_cls_classid(struct task_struct *p)
+{
+	int id;
+	u32 classid;
+
+	rcu_read_lock();
+	id = rcu_dereference(net_cls_subsys_id);
+	if (id >= 0)
+		classid = container_of(task_subsys_state(p, id),
+				       struct cgroup_cls_state, css)->classid;
+	rcu_read_unlock();
+
+	return classid;
+}
+#endif
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;
@@ -1064,6 +1091,9 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		sock_lock_init(sk);
 		sock_net_set(sk, get_net(net));
 		atomic_set(&sk->sk_wmem_alloc, 1);
+
+		if (!in_interrupt())
+			sk->sk_classid = task_cls_classid(current);
 	}
 
 	return sk;
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 2211803..32c1296 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -16,14 +16,10 @@
 #include <linux/errno.h>
 #include <linux/skbuff.h>
 #include <linux/cgroup.h>
+#include <linux/rcupdate.h>
 #include <net/rtnetlink.h>
 #include <net/pkt_cls.h>
-
-struct cgroup_cls_state
-{
-	struct cgroup_subsys_state css;
-	u32 classid;
-};
+#include <net/sock.h>
 
 static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
 					       struct cgroup *cgrp);
@@ -112,6 +108,10 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
 	struct cls_cgroup_head *head = tp->root;
 	u32 classid;
 
+	rcu_read_lock();
+	classid = task_cls_state(current)->classid;
+	rcu_read_unlock();
+
 	/*
 	 * Due to the nature of the classifier it is required to ignore all
 	 * packets originating from softirq context as accessing `current'
@@ -122,12 +122,12 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
 	 * calls by looking at the number of nested bh disable calls because
 	 * softirqs always disables bh.
 	 */
-	if (softirq_count() != SOFTIRQ_OFFSET)
-		return -1;
-
-	rcu_read_lock();
-	classid = task_cls_state(current)->classid;
-	rcu_read_unlock();
+	if (softirq_count() != SOFTIRQ_OFFSET) {
+	 	/* If there is an sk_classid we'll use that. */
+		if (!skb->sk)
+			return -1;
+		classid = skb->sk->sk_classid;
+	}
 
 	if (!classid)
 		return -1;
@@ -289,18 +289,35 @@ static struct tcf_proto_ops cls_cgroup_ops __read_mostly = {
 
 static int __init init_cgroup_cls(void)
 {
-	int ret = register_tcf_proto_ops(&cls_cgroup_ops);
-	if (ret)
-		return ret;
+	int ret;
+
 	ret = cgroup_load_subsys(&net_cls_subsys);
 	if (ret)
-		unregister_tcf_proto_ops(&cls_cgroup_ops);
+		goto out;
+
+#ifndef CONFIG_NET_CLS_CGROUP
+	/* We can't use rcu_assign_pointer because this is an int. */
+	smp_wmb();
+	net_cls_subsys_id = net_cls_subsys.subsys_id;
+#endif
+
+	ret = register_tcf_proto_ops(&cls_cgroup_ops);
+	if (ret)
+		cgroup_unload_subsys(&net_cls_subsys);
+
+out:
 	return ret;
 }
 
 static void __exit exit_cgroup_cls(void)
 {
 	unregister_tcf_proto_ops(&cls_cgroup_ops);
+
+#ifndef CONFIG_NET_CLS_CGROUP
+	net_cls_subsys_id = -1;
+	synchronize_rcu();
+#endif
+
 	cgroup_unload_subsys(&net_cls_subsys);
 }
 

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* Re: tun: Use netif_receive_skb instead of netif_rx
From: David Miller @ 2010-05-20  3:05 UTC (permalink / raw)
  To: herbert; +Cc: bmb, tgraf, nhorman, nhorman, eric.dumazet, netdev
In-Reply-To: <20100520025741.GA6129@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 20 May 2010 12:57:42 +1000

> On Wed, May 19, 2010 at 07:55:33PM -0700, David Miller wrote:
>> 
>> Since it seems now inevitable to me that we'll need to store the
>> 'classid' in struct sk_buff, I've prepared two patches to do that at
>> zero cost.
> 
> Actually I'm currently working on a patch to store this in struct
> sock.  The reason I prefer struct sock is because we can then
> centralise the place where we set the classid rather than having
> it spread out through every place that creates an skb.
> 
> I'll post it soon.

Ok, looking forward to it.

^ permalink raw reply

* Re: tun: Use netif_receive_skb instead of netif_rx
From: Herbert Xu @ 2010-05-20  2:57 UTC (permalink / raw)
  To: David Miller; +Cc: bmb, tgraf, nhorman, nhorman, eric.dumazet, netdev
In-Reply-To: <20100519.195533.22536631.davem@davemloft.net>

On Wed, May 19, 2010 at 07:55:33PM -0700, David Miller wrote:
> 
> Since it seems now inevitable to me that we'll need to store the
> 'classid' in struct sk_buff, I've prepared two patches to do that at
> zero cost.

Actually I'm currently working on a patch to store this in struct
sock.  The reason I prefer struct sock is because we can then
centralise the place where we set the classid rather than having
it spread out through every place that creates an skb.

I'll post it soon.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [PATCH 2/2] net: Add classid to sk_buff.
From: David Miller @ 2010-05-20  2:55 UTC (permalink / raw)
  To: bmb; +Cc: tgraf, nhorman, nhorman, eric.dumazet, herbert, netdev


We make this zero cost by moving queue_mapping into an existing
empty __u16 slot.  Thus making a __u32 available, which we use
for the 'classid'.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/linux/skbuff.h |    6 ++----
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7c16f24..f847ec2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -367,12 +367,9 @@ struct sk_buff {
 #ifdef CONFIG_NET_CLS_ACT
 	__u16			tc_verd;	/* traffic control verdict */
 #endif
+	__u32			classid;
 #endif
 
-	__u16			queue_mapping;
-
-	/* 16 bit hole */
-
 #ifdef CONFIG_NET_DMA
 	dma_cookie_t		dma_cookie;
 #endif
@@ -385,6 +382,7 @@ struct sk_buff {
 	};
 
 	__u16			vlan_tci;
+	__u16			queue_mapping;
 
 	sk_buff_data_t		transport_header;
 	sk_buff_data_t		network_header;
-- 
1.7.0.4


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox