Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next-2.6] net: Introduce skb_orphan_try()
From: David Miller @ 2010-04-22  7:26 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1271921045.7895.4763.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 22 Apr 2010 09:24:05 +0200

> Hmm... are you sure we want to call destructor for each skb ?
> 
> Should'nt we do it before initial skb is split ?

Good idea, therefore you mean something like this?

diff --git a/net/core/dev.c b/net/core/dev.c
index 3ba774b..f3c3885 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1865,6 +1865,7 @@ static int dev_gso_segment(struct sk_buff *skb)
 	int features = dev->features & ~(illegal_highdma(dev, skb) ?
 					 NETIF_F_SG : 0);
 
+	skb_orphan_try(skb);
 	segs = skb_gso_segment(skb, features);
 
 	/* Verifying header integrity only. */

^ permalink raw reply related

* Re: [PATCH net-next-2.6] net: Introduce skb_orphan_try()
From: Eric Dumazet @ 2010-04-22  7:24 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100422.001625.200862474.davem@davemloft.net>

Le jeudi 22 avril 2010 à 00:16 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 22 Apr 2010 09:10:33 +0200
> 
> > Le mercredi 21 avril 2010 à 22:56 -0700, David Miller a écrit :
> > 
> >> Right, I've applied this, thanks.
> >> 
> >> What we should probably do instead is call and NULL out the
> >> DEV_GSO_CB() destructor.  Right?
> > 
> > Yes, probably, I'll take a look at this if you want.
> 
> It might look something like this:
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 9bf1ccc..13241da 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1892,6 +1892,20 @@ static inline void skb_orphan_try(struct sk_buff *skb)
>  		skb_orphan(skb);
>  }
>  
> +/*
> + * GSO packets need to be handled specially because such packets
> + * hold the normal SKB destructor in a backup pointer.
> + */
> +static inline void skb_orphan_try_gso(struct sk_buff *skb)
> +{
> +	if (!skb_tx(skb)->flags) {
> +		if (DEV_GSO_CB(skb)->destructor)
> +			DEV_GSO_CB(skb)->destructor(skb);
> +		DEV_GSO_CB(skb)->destructor = NULL;
> +		skb->sk = NULL;
> +	}
> +}
> +
>  int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
>  			struct netdev_queue *txq)
>  {
> @@ -1937,6 +1951,7 @@ gso:
>  		if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
>  			skb_dst_drop(nskb);
>  
> +		skb_orphan_try_gso(skb);
>  		rc = ops->ndo_start_xmit(nskb, dev);
>  		if (unlikely(rc != NETDEV_TX_OK)) {
>  			if (rc & ~NETDEV_TX_MASK)

Hmm... are you sure we want to call destructor for each skb ?

Should'nt we do it before initial skb is split ?




^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: immediate send IPI in process_backlog()
From: David Miller @ 2010-04-22  7:22 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, xiaosuo, netdev
In-Reply-To: <20100422.002118.107274505.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Thu, 22 Apr 2010 00:21:18 -0700 (PDT)

> Eric, irqs are enabled in process_backlog(), so I don't know how legal
> it is to invoke net_rps_action_and_irq_enable() from there.

Nevermind I mis-read your patch.

Ignore me, I'll apply this.

Thanks!

^ permalink raw reply

* Re: [PATCH v4] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-22  7:21 UTC (permalink / raw)
  To: Changli Gao
  Cc: Stephen Hemminger, David S. Miller, jamal, Tom Herbert, netdev
In-Reply-To: <p2i412e6f7f1004212330y6f3e2a92oa295e528afde3cd4@mail.gmail.com>

Le jeudi 22 avril 2010 à 14:30 +0800, Changli Gao a écrit :
> On Thu, Apr 22, 2010 at 2:18 PM, Stephen Hemminger
> <shemminger@vyatta.com> wrote:
> > On Thu, 22 Apr 2010 13:51:53 +0800
> > Changli Gao <xiaosuo@gmail.com> wrote:
> >
> >> +     struct sk_buff          *input_pkt_queue_head;
> >> +     struct sk_buff          **input_pkt_queue_tailp;
> >> +     unsigned int            input_pkt_queue_len;
> >> +     unsigned int            process_queue_len;
> >
> > Why is opencoding a skb queue a step forward?
> > Just keep using sk_queue routines, just not the locked variants.
> >
> 
> I want to keep the critical section of rps_lock() as small as possible
> to reduce the potential lock contention, when RPS is used.
> 

Jamal perf reports show lock contention but also cache line ping pongs.

Yet, you keep a process_queue_len shared by producers and consumer.

Producers want to read it, while consumer decrement it (dirtying its
cache line) every packet, slowing down the things.

The idea of batching is to let the consumer process its local queue with
no impact to producers.

Please remove it completely, or make the consumer zero it only at the
end of batch processing.

A cache line miss cost is about 120 cycles. Multiply it by 1 million
packet per second...

^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: immediate send IPI in process_backlog()
From: David Miller @ 2010-04-22  7:21 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, xiaosuo, netdev
In-Reply-To: <1271883898.7895.3379.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 21 Apr 2010 23:04:58 +0200

> If some skb are queued to our backlog, we are delaying IPI sending at
> the end of net_rx_action(), increasing latencies. This defeats the
> queueing, since we want to quickly dispatch packets to the pool of
> worker cpus, then eventually deeply process our packets.
> 
> It's better to send IPI before processing our packets in upper layers,
> from process_backlog().
> 
> Change the _and_disable_irq suffix to _and_enable_irq(), since we enable
> local irq in net_rps_action(), sorry for the confusion.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Eric, irqs are enabled in process_backlog(), so I don't know how legal
it is to invoke net_rps_action_and_irq_enable() from there.

At least, if you are depending upon a later action to pick up the
pieces if the rps_ipi_list test races, you need to update the comment
above net_rps_action_and_irq_enable() since it states that it is
always invoked with IRQs disabled :-)

^ permalink raw reply

* Re: [PATCH v3] net: batch skb dequeueing from softnet input_pkt_queue
From: David Miller @ 2010-04-22  7:17 UTC (permalink / raw)
  To: eric.dumazet; +Cc: xiaosuo, netdev, therbert, hadi
In-Reply-To: <1271920402.7895.4732.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 22 Apr 2010 09:13:22 +0200

> No thanks, this is out of the question.
> 
> Talk to ixiacom guys, some people settle/dismantle dozens of network
> device per second, on production machines.

Yes, ifup/ifdown performance is very important these days.

Recently we've had to do a lot of work to increase scalability and
latency in this area, let's not undo that.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: Introduce skb_orphan_try()
From: David Miller @ 2010-04-22  7:16 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1271920233.7895.4723.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 22 Apr 2010 09:10:33 +0200

> Le mercredi 21 avril 2010 à 22:56 -0700, David Miller a écrit :
> 
>> Right, I've applied this, thanks.
>> 
>> What we should probably do instead is call and NULL out the
>> DEV_GSO_CB() destructor.  Right?
> 
> Yes, probably, I'll take a look at this if you want.

It might look something like this:

diff --git a/net/core/dev.c b/net/core/dev.c
index 9bf1ccc..13241da 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1892,6 +1892,20 @@ static inline void skb_orphan_try(struct sk_buff *skb)
 		skb_orphan(skb);
 }
 
+/*
+ * GSO packets need to be handled specially because such packets
+ * hold the normal SKB destructor in a backup pointer.
+ */
+static inline void skb_orphan_try_gso(struct sk_buff *skb)
+{
+	if (!skb_tx(skb)->flags) {
+		if (DEV_GSO_CB(skb)->destructor)
+			DEV_GSO_CB(skb)->destructor(skb);
+		DEV_GSO_CB(skb)->destructor = NULL;
+		skb->sk = NULL;
+	}
+}
+
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 			struct netdev_queue *txq)
 {
@@ -1937,6 +1951,7 @@ gso:
 		if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
 			skb_dst_drop(nskb);
 
+		skb_orphan_try_gso(skb);
 		rc = ops->ndo_start_xmit(nskb, dev);
 		if (unlikely(rc != NETDEV_TX_OK)) {
 			if (rc & ~NETDEV_TX_MASK)

^ permalink raw reply related

* Re: [PATCH v3] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-22  7:13 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, netdev, Tom Herbert, jamal
In-Reply-To: <j2s412e6f7f1004212333se60a9083s59185ee3466313f@mail.gmail.com>

Le jeudi 22 avril 2010 à 14:33 +0800, Changli Gao a écrit :
> On Thu, Apr 22, 2010 at 7:05 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le mercredi 14 avril 2010 à 17:52 +0800, Changli Gao a écrit :
> >> batch skb dequeueing from softnet input_pkt_queue
> >>
> >> batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
> >> contention and irq disabling/enabling.
> >>
> >> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> >> ----
> >
> > lock contention _is_ a problem, Jamal tests can show it.
> >
> > irq disabling/enabling is not, and force to use stop_machine() killer.
> >
> 
> Although irq_disabling/enabling is not, we should do our best to make
> fast path as quickly as possible, and because stop_machine() is used
> in slow patch, I think we can afford its weight.
> 
> 

No thanks, this is out of the question.

Talk to ixiacom guys, some people settle/dismantle dozens of network
device per second, on production machines.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: Introduce skb_orphan_try()
From: Eric Dumazet @ 2010-04-22  7:10 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100421.225625.177238009.davem@davemloft.net>

Le mercredi 21 avril 2010 à 22:56 -0700, David Miller a écrit :

> Right, I've applied this, thanks.
> 
> What we should probably do instead is call and NULL out the
> DEV_GSO_CB() destructor.  Right?

Yes, probably, I'll take a look at this if you want.

Thanks



^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: David Miller @ 2010-04-22  7:09 UTC (permalink / raw)
  To: arnd; +Cc: scofeldm, chrisw, netdev
In-Reply-To: <201004212313.05060.arnd@arndb.de>

From: Arnd Bergmann <arnd@arndb.de>
Date: Wed, 21 Apr 2010 23:13:04 +0200

> My preference would probably be make these a subcategory of the
> if_link, and use the existing RTM_NEWLINK/RTM_DELLINK commands.

I was going to suggest this as well.

^ permalink raw reply

* Re: [net-next PATCH 1/2] add iovnl netlink support
From: David Miller @ 2010-04-22  6:52 UTC (permalink / raw)
  To: scofeldm; +Cc: netdev, chrisw
In-Reply-To: <20100419191807.10423.84600.stgit@savbu-pc100.cisco.com>

From: Scott Feldman <scofeldm@cisco.com>
Date: Mon, 19 Apr 2010 12:18:07 -0700

> +	if (tb[IOV_ATTR_VF_IFNAME])
> +		vf_dev = dev_get_by_name(&init_net,
> +			nla_data(tb[IOV_ATTR_VF_IFNAME]));

It's probably best to check this for NULL and notify
the user with an error in that case (don't forget to
put 'dev' in that error path :-)

As things stand it looks like if we can't find vf_dev, we'll just send
NULL down to the vf_dev arg of the various operations and possibly
silently succeed.

That's not desirable, semantically.

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Arnd Bergmann @ 2010-04-22  6:51 UTC (permalink / raw)
  To: Chris Wright; +Cc: Scott Feldman, davem, netdev
In-Reply-To: <20100421224802.GF28829@x200.localdomain>

On Thursday 22 April 2010, Chris Wright wrote:
> > 
> > ip link add link eth0 type macvlan    # for a container
> > ip link add link eth0 type macvtap    # for qemu/vhost
> > ip link add link eth0 type vf         # for device assignment
> 
> BTW, what do you mean by device assignment?

I mean giving an SR-IOV VF to the guest as a native PCI device
rather than having qemu or vhost present a virtio-net to the
guest.

> > There are obviously significant differences between these three, but
> > they also share enough of their properties to let us treat them
> > in similar ways.
> > 
> > If we integrate the iovnl client into iproute2, the sequence for setting
> > up an enic VF and associating it to the port profile could be
> > 
> > # create vf0, pass mac and vlan id to HW, no association yet
> > ip link add link eth0 name vf0 type vf mac fe:dc:ba:12:34:56 vlan 78
> 
> Just to clarify...right now, the normal SR-IOV VF is already there.
> And, or course, can have its mac addr/vlan set already.

I don't have an SR-IOV card available for testing yet. How is this
configured now?

> > # associate vf with port profile, mac address must match the one assigned
> > #  to the interface before.
> > ip iov assoc eth0 port-profile "general" host-uuid "dcf2a873-f5ee-41dd-a7ad-802a544e48c2" \
> >        mac fe:dc:ba:12:34:56
> 
> At that point you could just do s/mac fe:.*/link vf0/

My point was that this information should be irrelevant to the code doing the
association with the switch. It sort of makes sense when the receiver is enic,
but when we send the same data to lldpad, it doesn't care about the slave device
name but only about the mac address. Especially since the slave device might not
be in the root name space any more, meaning we have no way to find it.

	Arnd

^ permalink raw reply

* Re: [net-next PATCH 1/2] add iovnl netlink support
From: David Miller @ 2010-04-22  6:48 UTC (permalink / raw)
  To: scofeldm; +Cc: netdev, chrisw
In-Reply-To: <20100419191807.10423.84600.stgit@savbu-pc100.cisco.com>

From: Scott Feldman <scofeldm@cisco.com>
Date: Mon, 19 Apr 2010 12:18:07 -0700

> +#define IOVNL_PROTO_VERSION 1
> +

Please delete this in the final version, the macro isn't even used by
the code.

We don't do protocol versioning in netlink.  Instead we get the base
stuff solid from the beginning, and then if something needs fixing up
we handle this using new attributes in a way which is both backward
and forward compatible.

Thanks.

^ permalink raw reply

* Re: [PATCH] Socket filter access to hatype
From: David Miller @ 2010-04-22  6:42 UTC (permalink / raw)
  To: leonerd; +Cc: netdev
In-Reply-To: <20100421172546.GO19334@cel.leo>

From: Paul LeoNerd Evans <leonerd@leonerd.org.uk>
Date: Wed, 21 Apr 2010 18:25:46 +0100

> When capturing packets on a PF_PACKET/SOCK_RAW socket bound to all
> interfaces, there doesn't appear to be a way for the filter program to
> actually find out the underlying hardware type the packet was captured
> on, such as is reported by the sll_hatype field of the struct sockaddr_ll
> when the packet is sent up to userland.
> 
> Unless I've managed to miss a trick somewhere, this would seem to put a
> fairly fundamental blocker on actually being able to filter in such
> packets. Granted there's the SKF_OFF_NET area to inspect at the e.g. IPv4
> level, but this makes it impossible to do anything on e.g. the Ethernet
> level.
> 
> See below for a patch to add an SKF_AD_HATYPE field, up among the other
> special access fields around SKF_AD_OFF.

This looks fine but you need to submit your patch properly,
including proper "Signed-off-by: " tags etc.  see
Documentation/SubmittingPatches for details.

Please make a complete fresh new submission, and don't try to shortcut
this by just replying and adding the Signed-off-by: or anything like
that.

Thanks.

^ permalink raw reply

* Re: [PATCH v3] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-22  6:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, netdev, Tom Herbert, jamal
In-Reply-To: <1271891149.7895.3751.camel@edumazet-laptop>

On Thu, Apr 22, 2010 at 7:05 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mercredi 14 avril 2010 à 17:52 +0800, Changli Gao a écrit :
>> batch skb dequeueing from softnet input_pkt_queue
>>
>> batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
>> contention and irq disabling/enabling.
>>
>> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
>> ----
>
> lock contention _is_ a problem, Jamal tests can show it.
>
> irq disabling/enabling is not, and force to use stop_machine() killer.
>

Although irq_disabling/enabling is not, we should do our best to make
fast path as quickly as possible, and because stop_machine() is used
in slow patch, I think we can afford its weight.

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH v4] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-22  6:30 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David S. Miller, jamal, Tom Herbert, Eric Dumazet, netdev
In-Reply-To: <20100421231843.4c284991@nehalam>

On Thu, Apr 22, 2010 at 2:18 PM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
> On Thu, 22 Apr 2010 13:51:53 +0800
> Changli Gao <xiaosuo@gmail.com> wrote:
>
>> +     struct sk_buff          *input_pkt_queue_head;
>> +     struct sk_buff          **input_pkt_queue_tailp;
>> +     unsigned int            input_pkt_queue_len;
>> +     unsigned int            process_queue_len;
>
> Why is opencoding a skb queue a step forward?
> Just keep using sk_queue routines, just not the locked variants.
>

I want to keep the critical section of rps_lock() as small as possible
to reduce the potential lock contention, when RPS is used.

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH v4] net: batch skb dequeueing from softnet input_pkt_queue
From: Stephen Hemminger @ 2010-04-22  6:18 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, jamal, Tom Herbert, Eric Dumazet, netdev
In-Reply-To: <1271915513-2966-1-git-send-email-xiaosuo@gmail.com>

On Thu, 22 Apr 2010 13:51:53 +0800
Changli Gao <xiaosuo@gmail.com> wrote:

> +	struct sk_buff		*input_pkt_queue_head;
> +	struct sk_buff		**input_pkt_queue_tailp;
> +	unsigned int		input_pkt_queue_len;
> +	unsigned int		process_queue_len;

Why is opencoding a skb queue a step forward?
Just keep using sk_queue routines, just not the locked variants.

-- 

^ permalink raw reply

* Re: [PATCH v2] rps: optimize rps_get_cpu()
From: Changli Gao @ 2010-04-22  6:10 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, eric.dumazet, netdev
In-Reply-To: <20100421.224014.77255834.davem@davemloft.net>

On Thu, Apr 22, 2010 at 1:40 PM, David Miller <davem@davemloft.net> wrote:
>
> I'll buy you a cookie if you can find a multiply generated by the
> compiler for "x * 4".  It's going to use shifts and those are
> basically free.

On amd64:

                if (pskb_may_pull(skb, (ihl * 4) + 4)) {
    2794:       8d 34 9d 04 00 00 00    lea    0x4(,%rbx,4),%esi
    279b:       4c 89 ef                mov    %r13,%rdi
    279e:       e8 a5 fd ff ff          callq  2548 <pskb_may_pull>
    27a3:       85 c0                   test   %eax,%eax
    27a5:       74 28                   je     27cf <get_rps_cpu+0x169>
                        __be16 *hports = (__be16 *) (skb->data + (ihl * 4));
    27a7:       8d 04 9d 00 00 00 00    lea    0x0(,%rbx,4),%eax

the compiler uses lea instead of multiply, and it should be more
efficient, but i'm not sure. Is there a equivalent of lea on the other
architectures?

>
> Please just change one thing at a time.  It would have helped you
> here.  I was willing to apply the port dereference part of your
> change, but not necessarily the 'ihl' changes.  But because you've
> combined them, I have no choice but to reject everything.
>

Ok. I think the first version is ready to apply, it only has the port
dereference part.

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* setsockopt with cmsghdr needs COMPAT support?
From: Andrew May @ 2010-04-22  4:45 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

[-- Attachment #1: Type: text/plain, Size: 895 bytes --]

I have a userspace app that is doing an IPv6 IPV6_2292PKTOPTIONS
setsockopt to add an Extension header in a mixed 64 bit/32 bit setup.

It is failing with an EINVAL because it seems the cmsghdr doesn't get
the proper fixup.
This isn't stuff I really look at much but I came up with this hack to
at least get past the error. All my userspace is 32 bits so I just
put in the "#if 1" rather than attempting a runtime check on the socket.

I am not sure if the userspace app is doing something wrong, but it
seems like this is a real problem. The "on the stack" assumption by
the fixup helper seems like it really should be reworked, but I have no
idea how it should be done. And I didn't bother to handle the the
getsockopt function.
Doing a grep I didn't find any other offenders, but I can't say for
sure.

Does anyone have any ideas on the "right" way to fix this, or point out
my flaw?

Thanks.

[-- Attachment #2: compat.patch --]
[-- Type: text/x-patch, Size: 1821 bytes --]

diff --git a/net/compat.c b/net/compat.c
index ec24d9e..1a6fcb1 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -213,6 +213,7 @@ Efault:
 		sock_kfree_s(sk, kcmsg_base, kcmlen);
 	return err;
 }
+EXPORT_SYMBOL(cmsghdr_from_user_compat_to_kern);

 int put_cmsg_compat(struct msghdr *kmsg, int level, int type, int len, void *data)
 {
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 33f60fc..3907ce4 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -422,6 +422,11 @@ sticky_done:
 		struct msghdr msg;
 		struct flowi fl;
 		int junk;
+#if 1
+		int compat_alloc = optlen;
+#else
+		int compat_alloc = 0;
+#endif

 		fl.fl6_flowlabel = 0;
 		fl.oif = sk->sk_bound_dev_if;
@@ -436,13 +441,29 @@ sticky_done:
 		retv = -EINVAL;
 		if (optlen > 64*1024)
 			break;
-
-		opt = sock_kmalloc(sk, sizeof(*opt) + optlen, GFP_KERNEL);
+		opt = sock_kmalloc(sk, sizeof(*opt) + optlen + compat_alloc,
+				   GFP_KERNEL);
 		retv = -ENOBUFS;
 		if (opt == NULL)
 			break;

 		memset(opt, 0, sizeof(*opt));
+#if 1
+		msg.msg_controllen = optlen;
+		msg.msg_control = optval;
+		retv = cmsghdr_from_user_compat_to_kern(&msg, sk, (void*)(opt+1),
+							optlen + compat_alloc );
+
+		printk( KERN_ERR "Cmsghdr conver ret %d len from %d to %d\n",
+			retv, (int)optlen, (int)msg.msg_controllen );
+		if (retv)
+			goto done;
+		if ( msg.msg_control != (opt+1) ){
+			printk( KERN_ERR "cmsg realloc issue???" );
+			/*Screwed*/
+		}
+		opt->tot_len = sizeof(*opt) + msg.msg_controllen;
+#else
 		opt->tot_len = sizeof(*opt) + optlen;
 		retv = -EFAULT;
 		if (copy_from_user(opt+1, optval, optlen))
@@ -450,6 +471,7 @@ sticky_done:

 		msg.msg_controllen = optlen;
 		msg.msg_control = (void*)(opt+1);
+#endif

 		retv = datagram_send_ctl(net, &msg, &fl, opt, &junk, &junk);
 		if (retv)

^ permalink raw reply related

* Re: cxgb4: Make unnecessarily global functions static
From: David Miller @ 2010-04-22  6:01 UTC (permalink / raw)
  To: rdreier; +Cc: dm, netdev
In-Reply-To: <adapr1spr7e.fsf@roland-alpha.cisco.com>

From: Roland Dreier <rdreier@cisco.com>
Date: Wed, 21 Apr 2010 11:59:17 -0700

> Also put t4_write_indirect() inside "#if 0" to avoid a "defined but not
> used" compile warning.
> 
> Signed-off-by: Roland Dreier <rolandd@cisco.com>

Also applied to net-next-2.6, thanks Roland.

^ permalink raw reply

* Re: cxgb4: Use ntohs() on __be16 value instead of htons()
From: David Miller @ 2010-04-22  6:00 UTC (permalink / raw)
  To: dm; +Cc: rdreier, netdev
In-Reply-To: <4BCF4F57.4050802@chelsio.com>

From: Dimitris Michailidis <dm@chelsio.com>
Date: Wed, 21 Apr 2010 12:17:43 -0700

> On 04/21/2010 11:09 AM, Roland Dreier wrote:
>> Use the correct direction of byte-swapping function to fix a mistake
>> shown by sparse endianness checking -- c.fl0id is __be16.
>>
>> Signed-off-by: Roland Dreier<rolandd@cisco.com>
> 
> Yes, thanks.
> 
> Acked-by: Dimitris Michailidis <dm@chelsio.com>

Applied to net-next-2.6

^ permalink raw reply

* Re: [PATCH] net: ipv6 bind to device issue
From: David Miller @ 2010-04-22  5:58 UTC (permalink / raw)
  To: brian.haley
  Cc: jolsa, kuznet, pekkas, jmorris, yoshfuji, kaber, eric.dumazet,
	netdev
In-Reply-To: <20100421.225015.137831360.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Wed, 21 Apr 2010 22:50:15 -0700 (PDT)

> Jiri please respin your patch with the argument order
> reversed so that we can make the inexpensive check before
> the expensive one.

Nevermind, I see you posted an updated version already,
which I've applied, thanks!

^ permalink raw reply

* Re: [PATCH] ethernet: print protocol in host byte order
From: David Miller @ 2010-04-22  5:57 UTC (permalink / raw)
  To: johannes; +Cc: netdev, eric.dumazet
In-Reply-To: <1271833567.3627.12.camel@jlt3.sipsolutions.net>

From: Johannes Berg <johannes@sipsolutions.net>
Date: Wed, 21 Apr 2010 09:06:07 +0200

> Eric's recent patch added __force, but this
> place would seem to require actually doing
> a byte order conversion so the printk is
> consistent across architectures.
> 
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Johannes Berg <johannes@sipsolutions.net>

Applied to net-next-2.6, thanks a lot Johannes.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: Introduce skb_orphan_try()
From: David Miller @ 2010-04-22  5:56 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1271830116.7895.1316.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 21 Apr 2010 08:08:36 +0200

> Le dimanche 18 avril 2010 à 02:46 -0700, David Miller a écrit :
> 
>> Looks good, applied, thanks Eric.
> 
> Hmm, looking at the GSO stuff, I believe we should not call
> skb_orphan_try() on gso skbs ?

Right, I've applied this, thanks.

What we should probably do instead is call and NULL out the
DEV_GSO_CB() destructor.  Right?

^ permalink raw reply

* [PATCH v4] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-22  5:51 UTC (permalink / raw)
  To: David S. Miller; +Cc: jamal, Tom Herbert, Eric Dumazet, netdev, Changli Gao

batch skb dequeueing from softnet input_pkt_queue

batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
contention when RPS is enabled. input_pkt_queue is reimplemented as a single
linked list(FIFO), and input_pkt_queue_lock is moved into RPS section, so
softnet becomes smaller when RPS is disabled than before.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
 include/linux/netdevice.h |   14 +++++--
 net/core/dev.c            |   92 +++++++++++++++++++++++++++++++++++-----------
 2 files changed, 82 insertions(+), 24 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3c5ed5f..5ccb92b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1387,6 +1387,7 @@ struct softnet_data {
 	struct Qdisc		*output_queue;
 	struct list_head	poll_list;
 	struct sk_buff		*completion_queue;
+	struct sk_buff		*process_queue;
 
 #ifdef CONFIG_RPS
 	struct softnet_data	*rps_ipi_list;
@@ -1396,15 +1397,22 @@ struct softnet_data {
 	struct softnet_data	*rps_ipi_next;
 	unsigned int		cpu;
 	unsigned int		input_queue_head;
+	spinlock_t		input_pkt_queue_lock;
+	/* 4 bytes hole on 64bits machine */
 #endif
-	struct sk_buff_head	input_pkt_queue;
+	struct sk_buff		*input_pkt_queue_head;
+	struct sk_buff		**input_pkt_queue_tailp;
+	unsigned int		input_pkt_queue_len;
+	unsigned int		process_queue_len;
+
 	struct napi_struct	backlog;
 };
 
-static inline void input_queue_head_incr(struct softnet_data *sd)
+static inline void input_queue_head_add(struct softnet_data *sd,
+					unsigned int len)
 {
 #ifdef CONFIG_RPS
-	sd->input_queue_head++;
+	sd->input_queue_head += len;
 #endif
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index e904c47..81c7877 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -211,14 +211,14 @@ static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
 static inline void rps_lock(struct softnet_data *sd)
 {
 #ifdef CONFIG_RPS
-	spin_lock(&sd->input_pkt_queue.lock);
+	spin_lock(&sd->input_pkt_queue_lock);
 #endif
 }
 
 static inline void rps_unlock(struct softnet_data *sd)
 {
 #ifdef CONFIG_RPS
-	spin_unlock(&sd->input_pkt_queue.lock);
+	spin_unlock(&sd->input_pkt_queue_lock);
 #endif
 }
 
@@ -2402,6 +2402,7 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
 {
 	struct softnet_data *sd;
 	unsigned long flags;
+	unsigned int qlen;
 
 	sd = &per_cpu(softnet_data, cpu);
 
@@ -2409,12 +2410,16 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
 	__get_cpu_var(netdev_rx_stat).total++;
 
 	rps_lock(sd);
-	if (sd->input_pkt_queue.qlen <= netdev_max_backlog) {
-		if (sd->input_pkt_queue.qlen) {
+	qlen = sd->input_pkt_queue_len + sd->process_queue_len;
+	if (qlen <= netdev_max_backlog) {
+		if (qlen) {
 enqueue:
-			__skb_queue_tail(&sd->input_pkt_queue, skb);
+			skb->next = NULL;
+			*sd->input_pkt_queue_tailp = skb;
+			sd->input_pkt_queue_tailp = &skb->next;
+			sd->input_pkt_queue_len++;
 #ifdef CONFIG_RPS
-			*qtail = sd->input_queue_head + sd->input_pkt_queue.qlen;
+			*qtail = sd->input_queue_head + sd->input_pkt_queue_len;
 #endif
 			rps_unlock(sd);
 			local_irq_restore(flags);
@@ -2931,16 +2936,33 @@ static void flush_backlog(void *arg)
 {
 	struct net_device *dev = arg;
 	struct softnet_data *sd = &__get_cpu_var(softnet_data);
-	struct sk_buff *skb, *tmp;
+	struct sk_buff **pskb, *skb;
 
 	rps_lock(sd);
-	skb_queue_walk_safe(&sd->input_pkt_queue, skb, tmp)
+	for (pskb = &sd->input_pkt_queue_head; *pskb; ) {
+		skb = *pskb;
 		if (skb->dev == dev) {
-			__skb_unlink(skb, &sd->input_pkt_queue);
+			*pskb = skb->next;
 			kfree_skb(skb);
-			input_queue_head_incr(sd);
+			input_queue_head_add(sd, 1);
+			sd->input_pkt_queue_len--;
+		} else {
+			pskb = &skb->next;
 		}
+	}
+	sd->input_pkt_queue_tailp = pskb;
 	rps_unlock(sd);
+
+	for (pskb = &sd->process_queue; *pskb; ) {
+		skb = *pskb;
+		if (skb->dev == dev) {
+			*pskb = skb->next;
+			kfree_skb(skb);
+			sd->process_queue_len--;
+		} else {
+			pskb = &skb->next;
+		}
+	}
 }
 
 static int napi_gro_complete(struct sk_buff *skb)
@@ -3249,25 +3271,39 @@ static int process_backlog(struct napi_struct *napi, int quota)
 	struct softnet_data *sd = &__get_cpu_var(softnet_data);
 
 	napi->weight = weight_p;
+	local_irq_disable();
 	do {
 		struct sk_buff *skb;
 
-		local_irq_disable();
+		while (sd->process_queue) {
+			skb = sd->process_queue;
+			sd->process_queue = skb->next;
+			sd->process_queue_len--;
+			local_irq_enable();
+			__netif_receive_skb(skb);
+			if (++work >= quota)
+				goto out;
+			local_irq_disable();
+		}
+
 		rps_lock(sd);
-		skb = __skb_dequeue(&sd->input_pkt_queue);
-		if (!skb) {
+		if (sd->input_pkt_queue_head == NULL) {
 			__napi_complete(napi);
 			rps_unlock(sd);
 			local_irq_enable();
 			break;
 		}
-		input_queue_head_incr(sd);
-		rps_unlock(sd);
-		local_irq_enable();
 
-		__netif_receive_skb(skb);
-	} while (++work < quota);
+		sd->process_queue = sd->input_pkt_queue_head;
+		sd->process_queue_len = sd->input_pkt_queue_len;
+		input_queue_head_add(sd, sd->input_pkt_queue_len);
+		sd->input_pkt_queue_head = NULL;
+		sd->input_pkt_queue_tailp = &sd->input_pkt_queue_head;
+		sd->input_pkt_queue_len = 0;
+		rps_unlock(sd);
+	} while (1);
 
+out:
 	return work;
 }
 
@@ -5621,10 +5657,19 @@ static int dev_cpu_callback(struct notifier_block *nfb,
 	local_irq_enable();
 
 	/* Process offline CPU's input_pkt_queue */
-	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) {
+	while ((skb = oldsd->input_pkt_queue_head)) {
+		oldsd->input_pkt_queue_head = skb->next;
+		netif_rx(skb);
+	}
+	oldsd->input_pkt_queue_tailp = &oldsd->input_pkt_queue_head;
+	input_queue_head_add(oldsd, oldsd->input_pkt_queue_len);
+	oldsd->input_pkt_queue_len = 0;
+
+	while ((skb = oldsd->process_queue)) {
+		oldsd->process_queue = skb->next;
 		netif_rx(skb);
-		input_queue_head_incr(oldsd);
 	}
+	oldsd->process_queue_len = 0;
 
 	return NOTIFY_OK;
 }
@@ -5842,11 +5887,16 @@ static int __init net_dev_init(void)
 	for_each_possible_cpu(i) {
 		struct softnet_data *sd = &per_cpu(softnet_data, i);
 
-		skb_queue_head_init(&sd->input_pkt_queue);
+		sd->input_pkt_queue_head = NULL;
+		sd->input_pkt_queue_tailp = &sd->input_pkt_queue_head;
+		sd->input_pkt_queue_len = 0;
+		sd->process_queue = NULL;
+		sd->process_queue_len = 0;
 		sd->completion_queue = NULL;
 		INIT_LIST_HEAD(&sd->poll_list);
 
 #ifdef CONFIG_RPS
+		spin_lock_init(&sd->input_pkt_queue_lock);
 		sd->csd.func = rps_trigger_softirq;
 		sd->csd.info = sd;
 		sd->csd.flags = 0;

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox