Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 4/4] rps: Inspect GRE encapsulated packets to get flow hash
From: David Miller @ 2011-05-19 19:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, netdev
In-Reply-To: <1305821795.3028.82.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 19 May 2011 18:16:35 +0200

> For sure it helps if this machine is the final host for these packets.
> 
> If I am a firewall or router [and not looking into GRE packets], maybe I
> dont want to spread all packets received on a tunnel to several cpus and
> reorder them when forwarded.

Maybe is the operative word here.

Unless you look inside of the tunnel, you may not have any entropy
at all for packet steering and that seems to be what Tom is trying
to attack here.

Also, if we are properly keying the inner-flow, reordering isn't an
issue.  Actual flows will not be reordered.

> Maybe we need to add a table, so that upper layer (GRE or IPIP tunnels)
> can instruct __skb_get_rxhash() that we want to deep inspect packets.

Keep in mind that we have essentially already established that the
goal of this code is to obtain as much hash steering entropy as
possible without causing the reordering of traffic within a
TCP/UDP/SCTP/etc. connection.

And Tom's changes are consistent with those goals.

If we want to start having knobs and ways to change this policy that
is a totally seperate discussion from whether Tom's changes are
correct and ready to be applied, which I think they essentially are.

^ permalink raw reply

* Re: [PATCH v3] ipconfig wait for carrier
From: Micha Nelissen @ 2011-05-19 19:39 UTC (permalink / raw)
  To: David Miller; +Cc: dcbw, netdev
In-Reply-To: <20110519.153507.578609630787612821.davem@davemloft.net>

David Miller wrote:
> From: Micha Nelissen <micha@neli.hopto.org>
> Date: Thu, 19 May 2011 21:24:42 +0200
> 
>> Even if link is "fake up" by driver and not actually up after 10 msecs,
>> things will continue to work (eventually), after a second, just like now.
> 
> But this eats one of the CONF_SEND_RETRIES attempts, which is only 6.

Right. So an alternative could be to increase this to 7, or do you
prefer leaving the 1 sec timeout untouched?

Micha

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Shirley Ma @ 2011-05-19 19:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Michał Mirosław, Ben Hutchings, David Miller,
	Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel
In-Reply-To: <1305738028.32080.66.camel@localhost.localdomain>

On Wed, 2011-05-18 at 10:00 -0700, Shirley Ma wrote:
> On Wed, 2011-05-18 at 19:51 +0300, Michael S. Tsirkin wrote:
> > > > Yes, I agree.  I think for tcpdump, we really need to copy the
> > data
> > > > anyway, to avoid guest changing it in between.  So we do that
> and
> > then
> > > > use the copy everywhere, release the old one. Hmm? 
> > > 
> > > Yes. Old one use zerocopy, new one use copy data.
> > > 
> > > Thanks
> > > Shirley
> > 
> > No, that's wrong, as they might become different with a
> > malicious guest. As long as we copied already, lets realease
> > the data and have everyone use the copy. 
> 
> Ok, I will patch pskb_expand_head to test it out. 

I am patching skb_copy, skb_clone, pskb_copy, pskb_expand_head to
convert a zero-copy skb to a copy skb to avoid this kind of issue.

This overhead won't impact macvtap/vhost TX zero-copy normally.

Shirley

^ permalink raw reply

* Re: [PATCH v3] ipconfig wait for carrier
From: David Miller @ 2011-05-19 19:47 UTC (permalink / raw)
  To: micha; +Cc: dcbw, netdev
In-Reply-To: <4DD571ED.8060400@neli.hopto.org>

From: Micha Nelissen <micha@neli.hopto.org>
Date: Thu, 19 May 2011 21:39:25 +0200

> David Miller wrote:
>> From: Micha Nelissen <micha@neli.hopto.org>
>> Date: Thu, 19 May 2011 21:24:42 +0200
>> 
>>> Even if link is "fake up" by driver and not actually up after 10 msecs,
>>> things will continue to work (eventually), after a second, just like now.
>> 
>> But this eats one of the CONF_SEND_RETRIES attempts, which is only 6.
> 
> Right. So an alternative could be to increase this to 7, or do you
> prefer leaving the 1 sec timeout untouched?

I'm not so sure at the moment.

Why don't you post your patch with the "bool" thing fixed and I'll
think about it?

Thanks.

^ permalink raw reply

* net-2.6
From: David Miller @ 2011-05-19 19:55 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netfilter-devel-u79uwXL29TY76Z2rM5mHXA


Everything in net-2.6 that did not make it into 2.6.39-final will
be:

1) Queued up for 2.6.39-stable

2) Pulled into net-next-2.6

Just FYI...
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH FINAL] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Jacek Luczak @ 2011-05-19 19:55 UTC (permalink / raw)
  To: David Miller; +Cc: vladislav.yasevich, eric.dumazet, netdev

During the sctp_close() call, we do not use rcu primitives to
destroy the address list attached to the endpoint.  At the same
time, we do the removal of addresses from this list before
attempting to remove the socket from the port hash

As a result, it is possible for another process to find the socket
in the port hash that is in the process of being closed.  It then
proceeds to traverse the address list to find the conflict, only
to have that address list suddenly disappear without rcu() critical
section.

Fix issue by closing address list removal inside RCU critical
section.

Race can result in a kernel crash with general protection fault or
kernel NULL pointer dereference:

kernel: general protection fault: 0000 [#1] SMP
kernel: RIP: 0010:[<ffffffffa02f3dde>]  [<ffffffffa02f3dde>] sctp_bind_addr_conflict+0x64/0x82 [sctp]
kernel: Call Trace:
kernel:  [<ffffffffa02f415f>] ? sctp_get_port_local+0x17b/0x2a3 [sctp]
kernel:  [<ffffffffa02f3d45>] ? sctp_bind_addr_match+0x33/0x68 [sctp]
kernel:  [<ffffffffa02f4416>] ? sctp_do_bind+0xd3/0x141 [sctp]
kernel:  [<ffffffffa02f5030>] ? sctp_bindx_add+0x4d/0x8e [sctp]
kernel:  [<ffffffffa02f5183>] ? sctp_setsockopt_bindx+0x112/0x4a4 [sctp]
kernel:  [<ffffffff81089e82>] ? generic_file_aio_write+0x7f/0x9b
kernel:  [<ffffffffa02f763e>] ? sctp_setsockopt+0x14f/0xfee [sctp]
kernel:  [<ffffffff810c11fb>] ? do_sync_write+0xab/0xeb
kernel:  [<ffffffff810e82ab>] ? fsnotify+0x239/0x282
kernel:  [<ffffffff810c2462>] ? alloc_file+0x18/0xb1
kernel:  [<ffffffff8134a0b1>] ? compat_sys_setsockopt+0x1a5/0x1d9
kernel:  [<ffffffff8134aaf1>] ? compat_sys_socketcall+0x143/0x1a4
kernel:  [<ffffffff810467dc>] ? sysenter_dispatch+0x7/0x32

Signed-off-by: Jacek Luczak <luczak.jacek@gmail.com>
Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>

---
 net/sctp/bind_addr.c |   10 ++++------
 1 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
index faf71d1..6150ac5 100644
--- a/net/sctp/bind_addr.c
+++ b/net/sctp/bind_addr.c
@@ -140,14 +140,12 @@ void sctp_bind_addr_init(struct sctp_bind_addr *bp, __u16 port)
 /* Dispose of the address list. */
 static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
 {
-	struct sctp_sockaddr_entry *addr;
-	struct list_head *pos, *temp;
+	struct sctp_sockaddr_entry *addr, *temp;
 
 	/* Empty the bind address list. */
-	list_for_each_safe(pos, temp, &bp->address_list) {
-		addr = list_entry(pos, struct sctp_sockaddr_entry, list);
-		list_del(pos);
-		kfree(addr);
+	list_for_each_entry_safe(addr, temp, &bp->address_list, list) {
+		list_del_rcu(&addr->list);
+		call_rcu(&addr->rcu, sctp_local_addr_free);
 		SCTP_DBG_OBJCNT_DEC(addr);
 	}
 }


^ permalink raw reply related

* Re: [PATCH 4/4] rps: Inspect GRE encapsulated packets to get flow hash
From: Tom Herbert @ 2011-05-19 19:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev
In-Reply-To: <1305821795.3028.82.camel@edumazet-laptop>

Eric,

Thanks for comments.

> For sure it helps if this machine is the final host for these packets.
>
I would hope it helps routers too, assuming RPS already helps non
encapsulated traffic in a router.

> If I am a firewall or router [and not looking into GRE packets], maybe I
> dont want to spread all packets received on a tunnel to several cpus and
> reorder them when forwarded.
>
Do know a specific use case where this patch would be a bad thing?  I
don't believe there's any special ooo constraints on the packets in a
tunnel, flow ordering should still be maintained.

> 3) table could contains 'pointers' to decoding function, that would
> recompute a new rxhash function.
>
This could be done as a function in the protocol switch table, so that
all the protocol specific code could be moved out of dev.c.  I thought
about that several times but haven't convinced myself it's worth it;
even with the code to handle tunneling the get_hash function is still
pretty elegant (I believe ipip support is just two more lines by the
way).

^ permalink raw reply

* Re: [PATCH FINAL] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Eric Dumazet @ 2011-05-19 19:58 UTC (permalink / raw)
  To: Jacek Luczak; +Cc: David Miller, vladislav.yasevich, netdev
In-Reply-To: <4DD575A1.4080405@gmail.com>

Le jeudi 19 mai 2011 à 21:55 +0200, Jacek Luczak a écrit :
> During the sctp_close() call, we do not use rcu primitives to
> destroy the address list attached to the endpoint.  At the same
> time, we do the removal of addresses from this list before
> attempting to remove the socket from the port hash
> 
> As a result, it is possible for another process to find the socket
> in the port hash that is in the process of being closed.  It then
> proceeds to traverse the address list to find the conflict, only
> to have that address list suddenly disappear without rcu() critical
> section.
> 
> Fix issue by closing address list removal inside RCU critical
> section.
> 
> Race can result in a kernel crash with general protection fault or
> kernel NULL pointer dereference:
> 

> Signed-off-by: Jacek Luczak <luczak.jacek@gmail.com>
> Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
> CC: Eric Dumazet <eric.dumazet@gmail.com>
> 
> ---
>  net/sctp/bind_addr.c |   10 ++++------
>  1 files changed, 4 insertions(+), 6 deletions(-)
> 

Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>



^ permalink raw reply

* Re: [PATCH 4/4] rps: Inspect GRE encapsulated packets to get flow hash
From: Tom Herbert @ 2011-05-19 20:06 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev
In-Reply-To: <20110519.153916.2250094502232694032.davem@davemloft.net>

> Unless you look inside of the tunnel, you may not have any entropy
> at all for packet steering and that seems to be what Tom is trying
> to attack here.
>
Entropy for RPS, but flow identifier for RFS.  Also, this is not just
a SW problem, RSS doesn't seem to do anything creative with GRE
packets... I'm hoping we'll be able get support in NICs to do this
same functionality.

^ permalink raw reply

* Re: [PATCH] drivers/net: ks8842 Fix crash on received packet when in PIO mode.
From: David Miller @ 2011-05-19 20:12 UTC (permalink / raw)
  To: dennis.aberilla; +Cc: info, netdev
In-Reply-To: <20110518225945.GB6416@dens-work>

From: Dennis Aberilla <dennis.aberilla@mimomax.com>
Date: Thu, 19 May 2011 10:59:47 +1200

> This patch fixes a kernel crash during packet reception due to not
> enough allocated bytes for the skb. This applies to the driver when
> running in PIO mode in an ISA bus setup.
> 
> Signed-off-by: Dennis Aberilla <denzzzhome@yahoo.com>

If you're trying to accomodate the fact that the loops iterate
always by 2 or 4 bytes at a time, then you will need to allocate
up to "3" bytes of slack space, not just "2".

You need to describe exactly what the precise problem is in your
commit message, or else people might find it difficult to figure
out exactly what the problem is.

^ permalink raw reply

* Re: ip_vs_ftp causing ip_vs oops on module load.
From: Julian Anastasov @ 2011-05-19 20:13 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: lvs-devel@vger.kernel.org, Simon Horman, Dave Jones,
	netdev@vger.kernel.org, Wensong Zhang
In-Reply-To: <201105191219.22907.hans.schillstrom@ericsson.com>


	Hello,

On Thu, 19 May 2011, Hans Schillstrom wrote:

> If we talk about the long term solution,
> applications that affects other netns should have their own data.
> ex. like the ftp, ports should be per netns
>     ipvsadm --appftp [port list]

	Yes, it is possible, if we find solution to
trigger register_ip_vs_app_inc() calls with such controls...
May be app have to export some method(s) for this to work.

> > 	While the protocols have controls that manipulate pernet
> > timeouts, the apps do not have such controls about app->timeouts.
> > May be we can remove app->timeouts to avoid confusion because
> > it was never implemented in user space. May be instead of this
> > fix we should restore the global ip_vs_app_list and all things in
> > ip_vs_app.c and ip_vs_ftp.c as before the netns changes?
> 
> Doesn't matter, just keep the patch as simple as possible
> while we discuss the long term solution.

	OK, then let's use this solution but with some changes...

> > > --- a/include/net/ip_vs.h
> > > +++ b/include/net/ip_vs.h
> > > @@ -797,7 +797,8 @@ struct netns_ipvs {
> > >  	struct list_head	rs_table[IP_VS_RTAB_SIZE];
> > >  	/* ip_vs_app */
> > >  	struct list_head	app_list;
> > > -
> > > +	/* ip_vs_ftp */
> > > +	struct ip_vs_app	*ftp_app;
> > >  	/* ip_vs_proto */
> > >  	#define IP_VS_PROTO_TAB_SIZE	32	/* must be power of 2 */
> > >  	struct ip_vs_proto_data *proto_data_table[IP_VS_PROTO_TAB_SIZE];
> > > diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
> > > index 6b5dd6d..17afb09 100644
> > > --- a/net/netfilter/ipvs/ip_vs_ftp.c
> > > +++ b/net/netfilter/ipvs/ip_vs_ftp.c
> > > @@ -411,25 +411,36 @@ static struct ip_vs_app ip_vs_ftp = {
> > >  static int __net_init __ip_vs_ftp_init(struct net *net)
> > >  {
> > >  	int i, ret;
> > > -	struct ip_vs_app *app = &ip_vs_ftp;
> > > +	struct ip_vs_app *app;
> > > +	struct netns_ipvs *ipvs = net_ipvs(net);
> > > +
> > > +	app = kmemdup(&ip_vs_ftp, sizeof(struct ip_vs_app), GFP_KERNEL);
> > > +	if (!app)
> > > +		return -ENOMEM;
> > > +	INIT_LIST_HEAD(&app->a_list);
> > > +	INIT_LIST_HEAD(&app->incs_list);
> > > +	INIT_LIST_HEAD(&app->p_list);

	No need to init a_list and p_list here. As for
INIT_LIST_HEAD(&app->incs_list), may be it is better to
be moved into register_ip_vs_app() before the mutex_lock
to avoid further problems.

> > > +	ipvs->ftp_app = app;
> > >  
> > >  	ret = register_ip_vs_app(net, app);
> > >  	if (ret)
> > > -		return ret;
> > > +		goto err_exit;
> > >  
> > >  	for (i=0; i<IP_VS_APP_MAX_PORTS; i++) {
> > >  		if (!ports[i])
> > >  			continue;
> > >  		ret = register_ip_vs_app_inc(net, app, app->protocol, ports[i]);
> > >  		if (ret)
> > > -			break;
> > > +			goto err_unreg;
> > >  		pr_info("%s: loaded support on port[%d] = %d\n",
> > >  			app->name, i, ports[i]);
> > >  	}
> > > +	return 0;
> > >  
> > > -	if (ret)
> > > -		unregister_ip_vs_app(net, app);
> > > -
> > > +err_unreg:
> > > +	unregister_ip_vs_app(net, app);
> > > +err_exit:
> > > +	kfree(ipvs->ftp_app);
> > >  	return ret;
> > >  }
> > >  /*
> > > @@ -437,9 +448,7 @@ static int __net_init __ip_vs_ftp_init(struct net *net)
> > >   */
> > >  static void __ip_vs_ftp_exit(struct net *net)
> > >  {
> > > -	struct ip_vs_app *app = &ip_vs_ftp;
> > > -
> > > -	unregister_ip_vs_app(net, app);
> > > +	unregister_ip_vs_app(net, net_ipvs(net)->ftp_app);


	Add here kfree(ipvs->ftp_app);


> > >  }
> > >  
> > >  static struct pernet_operations ip_vs_ftp_ops = {

	If you agree with these changes then you have my ack.

Acked-by: Julian Anastasov <ja@ssi.bg>


^ permalink raw reply

* [PATCH v4] ipconfig wait for carrier
From: Micha Nelissen @ 2011-05-19 20:14 UTC (permalink / raw)
  To: netdev; +Cc: Micha Nelissen, David Miller
In-Reply-To: <1305405386-25187-1-git-send-email-micha@neli.hopto.org>

v3 -> v4: fix return boolean false instead of 0 for ic_is_init_dev

Currently the ip auto configuration has a hardcoded delay of 1 second.
When (ethernet) link takes longer to come up (e.g. more than 3 seconds),
nfs root may not be found.

Remove the hardcoded delay, and wait for carrier on at least one network
device.

Signed-off-by: Micha Nelissen <micha@neli.hopto.org>
Cc: David Miller <davem@davemloft.net>
---
 net/ipv4/ipconfig.c |   35 ++++++++++++++++++++++-------------
 1 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index 2b09775..d2e0bf4 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -87,8 +87,8 @@
 #endif
 
 /* Define the friendly delay before and after opening net devices */
-#define CONF_PRE_OPEN		500	/* Before opening: 1/2 second */
-#define CONF_POST_OPEN		1	/* After opening: 1 second */
+#define CONF_POST_OPEN		10	/* After opening: 10 msecs */
+#define CONF_CARRIER_TIMEOUT	120000	/* Wait for carrier timeout */
 
 /* Define the timeout for waiting for a DHCP/BOOTP/RARP reply */
 #define CONF_OPEN_RETRIES 	2	/* (Re)open devices twice */
@@ -188,14 +188,14 @@ struct ic_device {
 static struct ic_device *ic_first_dev __initdata = NULL;/* List of open device */
 static struct net_device *ic_dev __initdata = NULL;	/* Selected device */
 
-static bool __init ic_device_match(struct net_device *dev)
+static bool __init ic_is_init_dev(struct net_device *dev)
 {
-	if (user_dev_name[0] ? !strcmp(dev->name, user_dev_name) :
+	if (dev->flags & IFF_LOOPBACK)
+		return false;
+	return user_dev_name[0] ? !strcmp(dev->name, user_dev_name) :
 	    (!(dev->flags & IFF_LOOPBACK) &&
 	     (dev->flags & (IFF_POINTOPOINT|IFF_BROADCAST)) &&
-	     strncmp(dev->name, "dummy", 5)))
-		return true;
-	return false;
+	     strncmp(dev->name, "dummy", 5));
 }
 
 static int __init ic_open_devs(void)
@@ -203,6 +203,7 @@ static int __init ic_open_devs(void)
 	struct ic_device *d, **last;
 	struct net_device *dev;
 	unsigned short oflags;
+	unsigned long start;
 
 	last = &ic_first_dev;
 	rtnl_lock();
@@ -216,9 +217,7 @@ static int __init ic_open_devs(void)
 	}
 
 	for_each_netdev(&init_net, dev) {
-		if (dev->flags & IFF_LOOPBACK)
-			continue;
-		if (ic_device_match(dev)) {
+		if (ic_is_init_dev(dev)) {
 			int able = 0;
 			if (dev->mtu >= 364)
 				able |= IC_BOOTP;
@@ -252,6 +251,17 @@ static int __init ic_open_devs(void)
 				dev->name, able, d->xid));
 		}
 	}
+
+	/* wait for a carrier on at least one device */
+	start = jiffies;
+	while (jiffies - start < msecs_to_jiffies(CONF_CARRIER_TIMEOUT)) {
+		for_each_netdev(&init_net, dev)
+			if (ic_is_init_dev(dev) && netif_carrier_ok(dev))
+				goto have_carrier;
+
+		msleep(1);
+	}
+have_carrier:
 	rtnl_unlock();
 
 	*last = NULL;
@@ -1324,14 +1334,13 @@ static int __init wait_for_devices(void)
 {
 	int i;
 
-	msleep(CONF_PRE_OPEN);
 	for (i = 0; i < DEVICE_WAIT_MAX; i++) {
 		struct net_device *dev;
 		int found = 0;
 
 		rtnl_lock();
 		for_each_netdev(&init_net, dev) {
-			if (ic_device_match(dev)) {
+			if (ic_is_init_dev(dev)) {
 				found = 1;
 				break;
 			}
@@ -1378,7 +1387,7 @@ static int __init ip_auto_config(void)
 		return err;
 
 	/* Give drivers a chance to settle */
-	ssleep(CONF_POST_OPEN);
+	msleep(CONF_POST_OPEN);
 
 	/*
 	 * If the config information is insufficient (e.g., our IP address or
-- 
1.7.4.4


^ permalink raw reply related

* Re: [PATCH] tcp: Lower the initial RTO to 1s as per draft RFC 2988bis-02.
From: David Miller @ 2011-05-19 20:16 UTC (permalink / raw)
  To: tsunanet
  Cc: kuznet, pekkas, jmorris, yoshfuji, kaber, hagen, eric.dumazet,
	alexander.zimmermann, netdev, linux-kernel
In-Reply-To: <1305787669-84634-1-git-send-email-tsunanet@gmail.com>

From: Benoit Sigoure <tsunanet@gmail.com>
Date: Wed, 18 May 2011 23:47:49 -0700

> @@ -868,6 +868,11 @@ static void tcp_init_metrics(struct sock *sk)
>  {
>  	struct tcp_sock *tp = tcp_sk(sk);
>  	struct dst_entry *dst = __sk_dst_get(sk);
> +	/* If we had to retransmit anything during the 3WHS, use
> +	 * the initial fallback RTO as per draft RFC 2988bis-02.
> +	 */
> +	int init_rto = inet_csk(sk)->icsk_retransmits ?
> +		TCP_TIMEOUT_INIT_FALLBACK : TCP_TIMEOUT_INIT;

Please do not put comments in the middle of a set of function
local variable declarations.

Also, as mentioned already, icsk_retransmits is not where SYN
retransmissions are counted.

It is stored in the TCP minisocket ->retrans field.

^ permalink raw reply

* Re: [PATCH 4/4] rps: Inspect GRE encapsulated packets to get flow hash
From: Eric Dumazet @ 2011-05-19 20:17 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <BANLkTinZOEz7Bhc+UgcWkrGb-rtCJ-kFvg@mail.gmail.com>

Le jeudi 19 mai 2011 à 12:56 -0700, Tom Herbert a écrit :
> Eric,
> 
> Thanks for comments.
> 
> > For sure it helps if this machine is the final host for these packets.
> >
> I would hope it helps routers too, assuming RPS already helps non
> encapsulated traffic in a router.

If conntracking is active, RPS probably helps. (Not tested yet here)

> 
> > If I am a firewall or router [and not looking into GRE packets], maybe I
> > dont want to spread all packets received on a tunnel to several cpus and
> > reorder them when forwarded.
> >
> Do know a specific use case where this patch would be a bad thing?  I
> don't believe there's any special ooo constraints on the packets in a
> tunnel, flow ordering should still be maintained.
> 

I was referring of reordering packets coming and outgoing, this could
defeat an AQM (the source carefully ordered packets with SFQ or other
qdisc, and we send them in different order).

You and David thought I was speaking of possible OOO for packets of a
given flow. I'm pretty confident you did this right :)

> > 3) table could contains 'pointers' to decoding function, that would
> > recompute a new rxhash function.
> >
> This could be done as a function in the protocol switch table, so that
> all the protocol specific code could be moved out of dev.c.  I thought
> about that several times but haven't convinced myself it's worth it;
> even with the code to handle tunneling the get_hash function is still
> pretty elegant (I believe ipip support is just two more lines by the
> way).

Yes, anyway this can be done later.

I find a bit annoying that such changes are pushed right before merge
window when we have litle time, thats all.

Thanks



^ permalink raw reply

* Re: [PATCH net-next-2.6] ipv6: reduce per device ICMP mib sizes
From: David Miller @ 2011-05-19 20:19 UTC (permalink / raw)
  To: eric.dumazet; +Cc: denys, netdev
In-Reply-To: <1305803663.3028.39.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 19 May 2011 13:14:23 +0200

> [PATCH net-next-2.6] ipv6: reduce per device ICMP mib sizes.
> 
> ipv6 has per device ICMP SNMP counters, taking too much space because
> they use percpu storage.
> 
> needed size per device is : 
> (512+4)*sizeof(long)*number_of_possible_cpus*2 
> 
> On a 32bit kernel, 16 possible cpus, this wastes more than 64kbytes of
> memory per ipv6 enabled network device, taken in vmalloc pool.
> 
> Since ICMP messages are rare, just use shared counters (atomic_long_t)
> 
> Per network space ICMP counters are still using percpu memory, we might
> also convert them to shared counters in a future patch.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH 0/4] rps: Look into tunnels to get hash
From: David Miller @ 2011-05-19 20:19 UTC (permalink / raw)
  To: therbert; +Cc: netdev
In-Reply-To: <alpine.DEB.2.00.1105190833050.13026@pokey.mtv.corp.google.com>

From: Tom Herbert <therbert@google.com>
Date: Thu, 19 May 2011 08:38:57 -0700 (PDT)

> The patches in this series are to look into encapsulated packets
> to compute the rx hash for RPS.  Before these patches, all packets
> received on the same tunnel would wind up on the same RPS CPU-- this
> can lead to very poor loading, and make RFS ineffective on these
> packets.
> 
> This patch supports getting the rxhash out of a GRE encapsulated packet.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH FINAL] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: David Miller @ 2011-05-19 20:19 UTC (permalink / raw)
  To: eric.dumazet; +Cc: difrost.kernel, vladislav.yasevich, netdev
In-Reply-To: <1305835101.3156.6.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 19 May 2011 21:58:21 +0200

> Le jeudi 19 mai 2011 à 21:55 +0200, Jacek Luczak a écrit :
>> During the sctp_close() call, we do not use rcu primitives to
>> destroy the address list attached to the endpoint.  At the same
>> time, we do the removal of addresses from this list before
>> attempting to remove the socket from the port hash
>> 
>> As a result, it is possible for another process to find the socket
>> in the port hash that is in the process of being closed.  It then
>> proceeds to traverse the address list to find the conflict, only
>> to have that address list suddenly disappear without rcu() critical
>> section.
>> 
>> Fix issue by closing address list removal inside RCU critical
>> section.
>> 
>> Race can result in a kernel crash with general protection fault or
>> kernel NULL pointer dereference:
>> 
> 
>> Signed-off-by: Jacek Luczak <luczak.jacek@gmail.com>
>> Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
>> CC: Eric Dumazet <eric.dumazet@gmail.com>
>> 
>> ---
>>  net/sctp/bind_addr.c |   10 ++++------
>>  1 files changed, 4 insertions(+), 6 deletions(-)
>> 
> 
> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
> 

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH v4] ipconfig wait for carrier
From: David Miller @ 2011-05-19 20:19 UTC (permalink / raw)
  To: micha; +Cc: netdev
In-Reply-To: <1305836046-21136-1-git-send-email-micha@neli.hopto.org>

From: Micha Nelissen <micha@neli.hopto.org>
Date: Thu, 19 May 2011 22:14:06 +0200

> v3 -> v4: fix return boolean false instead of 0 for ic_is_init_dev
> 
> Currently the ip auto configuration has a hardcoded delay of 1 second.
> When (ethernet) link takes longer to come up (e.g. more than 3 seconds),
> nfs root may not be found.
> 
> Remove the hardcoded delay, and wait for carrier on at least one network
> device.
> 
> Signed-off-by: Micha Nelissen <micha@neli.hopto.org>
> Cc: David Miller <davem@davemloft.net>

Ok, I've applied this, let's see what happens.

^ permalink raw reply

* Re: [PATCH 1/4] rps: Some minor cleanup in get_rps_cpus
From: Eric Dumazet @ 2011-05-19 20:21 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.2.00.1105190827480.12796@pokey.mtv.corp.google.com>

Le jeudi 19 mai 2011 à 08:39 -0700, Tom Herbert a écrit :
> Use some variables for clarity and extensibility.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
>  net/core/dev.c |   12 +++++++-----
>  1 files changed, 7 insertions(+), 5 deletions(-)

Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>



^ permalink raw reply

* Re: [PATCH 1/2] net: ping: make local functions static
From: David Miller @ 2011-05-19 20:18 UTC (permalink / raw)
  To: xiaosuo; +Cc: kuznet, pekkas, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <1305789361-5366-1-git-send-email-xiaosuo@gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Thu, 19 May 2011 15:16:00 +0800

> As these functions are only used in this file.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH 2/2] net: ping: fix the coding style
From: David Miller @ 2011-05-19 20:18 UTC (permalink / raw)
  To: xiaosuo; +Cc: kuznet, pekkas, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <1305789361-5366-2-git-send-email-xiaosuo@gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Thu, 19 May 2011 15:16:01 +0800

> The characters in a line should be no more than 80.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH 2/4] rps: Add flag to skb to indicate rxhash is on L4 tuple
From: Eric Dumazet @ 2011-05-19 20:26 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.2.00.1105190827240.12799@pokey.mtv.corp.google.com>

Le jeudi 19 mai 2011 à 08:39 -0700, Tom Herbert a écrit :
> The l4_rxhash flag was added to the skb structure to indicate
> that the rxhash value was computed over the 4 tuple for the
> packet which includes the port information in the encapsulated
> transport packet.  This is used by the stack to preserve the
> rxhash value in __skb_rx_tunnel.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
>  include/linux/skbuff.h |    5 +++--
>  include/net/dst.h      |    9 ++++++++-
>  include/net/sock.h     |   14 +++++++++++---
>  net/core/dev.c         |   17 +++++++++++------
>  net/core/skbuff.c      |    1 +
>  net/ipv4/tcp_ipv4.c    |    4 ++--
>  net/ipv4/udp.c         |    4 ++--
>  7 files changed, 38 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 79aafbb..4fab336 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -397,6 +397,7 @@ struct sk_buff {
>  	__u8			ndisc_nodetype:2;
>  #endif
>  	__u8			ooo_okay:1;
> +	__u8			l4_rxhash:1;
>  	kmemcheck_bitfield_end(flags2);
>  
>  	/* 0/13 bit hole */
> @@ -555,11 +556,11 @@ extern unsigned int   skb_find_text(struct sk_buff *skb, unsigned int from,
>  				    unsigned int to, struct ts_config *config,
>  				    struct ts_state *state);
>  
> -extern __u32 __skb_get_rxhash(struct sk_buff *skb);
> +extern void __skb_get_rxhash(struct sk_buff *skb);
>  static inline __u32 skb_get_rxhash(struct sk_buff *skb)
>  {
>  	if (!skb->rxhash)
> -		skb->rxhash = __skb_get_rxhash(skb);
> +		__skb_get_rxhash(skb);
>  
>  	return skb->rxhash;
>  }
> diff --git a/include/net/dst.h b/include/net/dst.h
> index 07a0402..17ff602 100644
> --- a/include/net/dst.h
> +++ b/include/net/dst.h
> @@ -307,7 +307,14 @@ static inline void skb_dst_force(struct sk_buff *skb)
>  static inline void __skb_tunnel_rx(struct sk_buff *skb, struct net_device *dev)
>  {
>  	skb->dev = dev;
> -	skb->rxhash = 0;
> +
> +	/*
> +	 * Clear rxhash so that we can reculate the hash for the

typo ?

> +	 * encapsulated packet, unless we have already determine the hash
> +	 * over the L4 4-tuple.
> +	 */
> +	if (!skb->l4_rxhash)
> +		skb->rxhash = 0;
>  	skb_set_queue_mapping(skb, 0);
>  	skb_dst_drop(skb);
>  	nf_reset(skb);
> diff --git a/include/net/sock.h b/include/net/sock.h
> index f2046e4..e670c41 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -685,16 +685,24 @@ static inline void sock_rps_reset_flow(const struct sock *sk)
>  #endif
>  }
>  
> -static inline void sock_rps_save_rxhash(struct sock *sk, u32 rxhash)
> +static inline void sock_rps_save_rxhash(struct sock *sk, struct sk_buff *skb)

const struct sk_buff *skb ?

>  {
>  #ifdef CONFIG_RPS
> -	if (unlikely(sk->sk_rxhash != rxhash)) {
> +	if (unlikely(sk->sk_rxhash != skb->rxhash)) {
>  		sock_rps_reset_flow(sk);
> -		sk->sk_rxhash = rxhash;
> +		sk->sk_rxhash = skb->rxhash;
>  	}
>  #endif
>  }
>  
> +static inline void sock_rps_reset_rxhash(struct sock *sk)
> +{
> +#ifdef CONFIG_RPS
> +	sock_rps_reset_flow(sk);
> +	sk->sk_rxhash = 0;
> +#endif
> +}
> +
>  #define sk_wait_event(__sk, __timeo, __condition)			\
>  	({	int __rc;						\
>  		release_sock(__sk);					\
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a40eea9..37ddece 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2014,8 +2014,10 @@ static inline void skb_orphan_try(struct sk_buff *skb)
>  		/* skb_tx_hash() wont be able to get sk.
>  		 * We copy sk_hash into skb->rxhash
>  		 */
> -		if (!skb->rxhash)
> +		if (!skb->rxhash) {
>  			skb->rxhash = sk->sk_hash;
> +			skb->l4_rxhash = 1;

Why do you set l4_rxhash here ?

> +		}
>  		skb_orphan(skb);
>  	}
>  }
> @@ -2499,12 +2501,13 @@ static inline void ____napi_schedule(struct softnet_data *sd,
>  
>  /*
>   * __skb_get_rxhash: calculate a flow hash based on src/dst addresses
> - * and src/dst port numbers. Returns a non-zero hash number on success
> - * and 0 on failure.
> + * and src/dst port numbers.  Sets rxhash in skb to non-zero hash value
> + * on success, zero indicates no valid hash.  Also, sets l4_rxhash in skb
> + * if hash is a canonical 4-tuple hash over transport ports.
>   */
> -__u32 __skb_get_rxhash(struct sk_buff *skb)
> +void __skb_get_rxhash(struct sk_buff *skb)
>  {
> -	int nhoff, hash = 0, poff;
> +	int nhoff, hash = 0, l4hash = 0, poff;
>  	const struct ipv6hdr *ip6;
>  	const struct iphdr *ip;
>  	u8 ip_proto;
> @@ -2554,6 +2557,7 @@ __u32 __skb_get_rxhash(struct sk_buff *skb)
>  			ports.v32 = * (__force u32 *) (skb->data + nhoff);
>  			if (ports.v16[1] < ports.v16[0])
>  				swap(ports.v16[0], ports.v16[1]);
> +			l4hash = 1;

maybe directly set skb->l4_rxhash = 1 here ?

		
>  		}
>  	}
>  
> @@ -2566,7 +2570,8 @@ __u32 __skb_get_rxhash(struct sk_buff *skb)
>  		hash = 1;
>  
>  done:
> -	return hash;
> +	skb->rxhash = hash;
> +	skb->l4_rxhash = l4hash;

and remove this line.  (gcc is pretty expensive here for a one bit
field)


Thanks



^ permalink raw reply

* Re: [PATCH 3/4] rps: Infrastructure in __skb_get_rxhash for deep inspection
From: Eric Dumazet @ 2011-05-19 20:27 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.2.00.1105190828001.12835@pokey.mtv.corp.google.com>

Le jeudi 19 mai 2011 à 08:39 -0700, Tom Herbert a écrit :
> Basics for looking for ports in encapsulated packets in tunnels.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
>  net/core/dev.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 

Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>




^ permalink raw reply

* Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.
From: tsuna @ 2011-05-19 20:30 UTC (permalink / raw)
  To: David Miller
  Cc: alexander.zimmermann, hagen, kuznet, pekkas, jmorris, yoshfuji,
	kaber, eric.dumazet, netdev, linux-kernel
In-Reply-To: <20110519.152703.1327182804872376183.davem@davemloft.net>

On Thu, May 19, 2011 at 12:27 PM, David Miller <davem@davemloft.net> wrote:
> So using SCTP as an example of "see we do this already over here" is a
> non-starter.  Don't do it.

Fair enough.  I hope that the "there's no one-size-fits-all solution"
argument has more weight than "hey SCTP does it".  :)

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

^ permalink raw reply

* GRE RPS patch set
From: David Miller @ 2011-05-19 21:14 UTC (permalink / raw)
  To: therbert; +Cc: netdev

I reverted all of it, besides Eric's feedback, I'd like to know if
grep is not working on your computer?

net/ipv6/udp.c: In function ‘udpv6_queue_rcv_skb’:
net/ipv6/udp.c:509:3: warning: passing argument 2 of ‘sock_rps_save_rxhash’ makes pointer from integer without a cast
include/net/sock.h:688:20: note: expected ‘struct sk_buff *’ but argument is of type ‘__u32’

Updating all call sites of a function when you change it's arguments
isn't rocket science. :-)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox