Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 2/4] [RFC] Add c/r support for connected INET sockets
From: Serge E. Hallyn @ 2009-10-21 17:56 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers, John Dykstra, netdev
In-Reply-To: <1256072803-3518-3-git-send-email-danms@us.ibm.com>

Quoting Dan Smith (danms@us.ibm.com):
> This patch adds basic support for C/R of open INET sockets.  I think that
> all the important bits of the TCP and ICSK socket structures is saved,
> but I think there is still some additional IPv6 stuff that needs to be
> handled.
> 
> With this patch applied, the following script can be used to demonstrate
> the functionality:
> 
>   https://lists.linux-foundation.org/pipermail/containers/2009-October/021239.html
> 
> It shows that this enables migration of a sendmail process with open
> connections from one machine to another without dropping.
> 
> We still need comments from the netdev people about what sort of sanity
> checking we need to do on the values in the ckpt_hdr_socket_inet
> structure on restart.
> 
> Note that this still doesn't address lingering sockets yet.
> 
> Changes in v2:
>  - Restore saddr, rcv_saddr, daddr, sport, and dport from the sockaddr
>    structure instead of saving them separately
>  - Fix 'sock' naming in sock_cptrst()
>  - Don't take the queue lock before skb_queue_tail() since it is
>    done for us
>  - Allow "listen only" restore behavior if RESTART_SOCK_LISTENONLY
>    flag is specified on sys_restart()
>  - Pull the implementation of the list of listening sockets back into
>    this patch
>  - Fix dangling printk
>  - Add some comments around the parent/child restore logic
> 
> Cc: netdev@vger.kernel.org
> Cc: Oren Laadan <orenl@librato.com>
> Cc: John Dykstra <jdykstra72@gmail.com>
> Signed-off-by: Dan Smith <danms@us.ibm.com>

fwiw,

Acked-by: Serge Hallyn <serue@us.ibm.com>

except

> +static int sock_inet_restore_addrs(struct inet_sock *inet,
> +				   struct ckpt_hdr_socket_inet *hh)
> +{
> +	inet->daddr = hh->raddr.sin_addr.s_addr;
> +	inet->saddr = hh->laddr.sin_addr.s_addr;
> +	inet->rcv_saddr = inet->saddr;
> +
> +	inet->dport = hh->raddr.sin_port;
> +	inet->sport = hh->laddr.sin_port;

Sorry, I think we've discussed this before but can't recall - does
setting sport here allow an unpriv user to bypass CAP_NET_BIND_SERVICE?

-serge

^ permalink raw reply

* [RFC] net,socket: introduce build_sockaddr_check helper to catch overflow at build time
From: Cyrill Gorcunov @ 2009-10-21 17:07 UTC (permalink / raw)
  To: Linux-Netdev; +Cc: David Miller

Hi,

while were sneaking thru sockets code I've got the idea that we may
check for __kernel_sockaddr_storage overflow at build time. At moment
this structure is big enough and I hardly believe it could be overflowed
ever (hmm?).

Anyway just an idea which could be stupid perhaps but I decided to
put it out. An idea is that before copy protocol specific data in
socket->ops->getname implementation the driver code may put

build_sockaddr_check(sizeof(some_struct));

and be sure it doesn't overflow the hosting unit.

Feel free to just ignore this RFC, was just an idea to share.

	-- Cyrill
---
net,socket: introduce build_sockaddr_check helper to catch overflow at build time

proto_ops->getname implies copying protocol specific data
into storage unit (particulary to __kernel_sockaddr_storage).
So when one implements new protocol he either may keep this
in mind (or may not).

Lets introduce build_sockaddr_check helper which check if
storage unit is not overfowed. Note that the check is build
time and introduce no slowdown at execution time.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
 include/linux/socket.h |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6.git/include/linux/socket.h
=====================================================================
--- linux-2.6.git.orig/include/linux/socket.h
+++ linux-2.6.git/include/linux/socket.h
@@ -24,6 +24,9 @@ struct __kernel_sockaddr_storage {
 #include <linux/types.h>		/* pid_t			*/
 #include <linux/compiler.h>		/* __user			*/

+#define build_sockaddr_check(size)	\
+	BUILD_BUG_ON(((size) > sizeof(struct __kernel_sockaddr_storage)))
+
 #ifdef __KERNEL__
 # ifdef CONFIG_PROC_FS
 struct seq_file;

^ permalink raw reply

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
From: Octavian Purdila @ 2009-10-21 16:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Benjamin LaHaise, netdev, Cosmin Ratiu
In-Reply-To: <4ADF2B57.4030708@gmail.com>

On Wednesday 21 October 2009 18:40:07 you wrote:

> >
> > I would also like to see this patch in, we are running into scalability
> > issues with creating/deleting lots of interfaces as well.
> 
> Ben patch only address interface deletion, and one part of the problem,
> maybe the more visible one for the current kernel.
> 
> Adding lots of interfaces only needs several threads to run concurently.
> 
> Before applying/examining his patch I suggest identifying all dev_put()
>  spots than can be deleted and replaced by something more scalable. I began
>  this job but others can help me.
> 

Yes, I agree with you, there are multiple places which needs to be touched to 
allow for better scaling with regard to the number of interfaces. We do have 
patches that addresses some of these issues, but unfortunately they are based 
on 2.6.7 and some of them are quite ugly hacks :) 

However, we are in the process of switching to 2.6.31 so I hope we will be 
able to contribute on this effort.

> RTNL and rcu grace periods are going to hurt anyway, so you probably need
> to use many tasks to be able to delete lots of interfaces in parallel.
> 

Hmm, how would multiple tasks help here? Isn't the RTNL mutex global?

> netdev_run_todo() should also use a better algorithm to allow parallelism.
> 
> Following patch doesnt slow down dev_put() users and real scalability
> problems will surface and might be addressed.
> 
> [PATCH] net: allow netdev_wait_allrefs() to run faster
> 

Thanks, I am going to test it on our platform and send back the results.

tavi

^ permalink raw reply

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
From: Benjamin LaHaise @ 2009-10-21 16:51 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Octavian Purdila, netdev, Cosmin Ratiu
In-Reply-To: <4ADF2B57.4030708@gmail.com>

On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
> Ben patch only address interface deletion, and one part of the problem,
> maybe the more visible one for the current kernel.

The first part I've been tackling has been the overhead in procfs, sysctl 
and sysfs.  I've got patches for some of the issues, hacks for others, and 
should have something to post in a few days.  Getting rid of those overheads 
is enough to get to decent interface creation times for the first 20 or 30k 
interfaces.

On the interface deletion side of things, within the network code, fib_hash 
has a few linear scans that really start hurting.  trie is a bit better, 
but I haven't started digging too deeply into its flush/remove overhead yet, 
aside from noticing that trie has a linear scan if 
CONFIG_IP_ROUTE_MULTIPATH is set since it sets the hash size to 1.  
fn_trie_flush() is currently the worst offender during deletion.

		-ben

^ permalink raw reply

* Re: possible circular locking dependency in ISDN PPP
From: Tilman Schmidt @ 2009-10-21 16:24 UTC (permalink / raw)
  To: Xiaotian Feng; +Cc: LKML, isdn4linux, Netdev, Karsten Keil
In-Reply-To: <7b6bb4a50910182227y1281b40bj3fcc082d32cf4496@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2152 bytes --]

Thanks for your analysis. The patch you propose does indeed prevent the
"circular locking dependency" message, with no noticeable ill effect.

I cannot say why xmit_lock was taken around isdn_net_lp_busy() in the
first place. The ISDN4Linux people would be the ones to comment on that.
If none of them objects, I propose you add a Signed-off-by line to your
patch and submit it to Karsten Keil, the ISDN maintainer, for inclusion.
You can also add a "Tested-by: Tilman Schmidt <tilman@imap.cc>" line.

Thanks,
Tilman

Am 19.10.2009 07:27 schrieb Xiaotian Feng:
> So there's a circular locking dependency.. Looking into isdn_net_get_locked_lp()
[...]
> Why do we need to hold xmit_lock while using
> isdn_net_lp_busy(nd->queue) ? Can following patch fix this bug?
> 
> ---
> diff --git a/drivers/isdn/i4l/isdn_net.h b/drivers/isdn/i4l/isdn_net.h
> index 74032d0..7511f08 100644
> --- a/drivers/isdn/i4l/isdn_net.h
> +++ b/drivers/isdn/i4l/isdn_net.h
> @@ -83,19 +83,19 @@ static __inline__ isdn_net_local *
> isdn_net_get_locked_lp(isdn_net_dev *nd)
> 
>         spin_lock_irqsave(&nd->queue_lock, flags);
>         lp = nd->queue;         /* get lp on top of queue */
> -       spin_lock(&nd->queue->xmit_lock);
>         while (isdn_net_lp_busy(nd->queue)) {
> -               spin_unlock(&nd->queue->xmit_lock);
>                 nd->queue = nd->queue->next;
>                 if (nd->queue == lp) { /* not found -- should never happen */
>                         lp = NULL;
>                         goto errout;
>                 }
> -               spin_lock(&nd->queue->xmit_lock);
>         }
>         lp = nd->queue;
>         nd->queue = nd->queue->next;
> +       spin_unlock_irqrestore(&nd->queue_lock, flags);
> +       spin_lock(&lp->xmit_lock);
>         local_bh_disable();
> +       return lp;
>  errout:
>         spin_unlock_irqrestore(&nd->queue_lock, flags);
>         return lp;
> 

-- 
Tilman Schmidt                    E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Ungeöffnet mindestens haltbar bis: (siehe Rückseite)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 254 bytes --]

^ permalink raw reply

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
From: Eric Dumazet @ 2009-10-21 16:09 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Benjamin LaHaise, netdev, Cosmin Ratiu
In-Reply-To: <4ADF2B57.4030708@gmail.com>

Eric Dumazet a écrit :
> Octavian Purdila a écrit :
>> On Sunday 18 October 2009 21:21:44 you wrote:
>>>> The msleep(250) should be tuned first. Then if this is really necessary
>>>> to dismantle 100.000 netdevices per second, we might have to think a bit
>>>> more. 
>>>> Just try msleep(1 or 2), it should work quite well.
>>> My goal is tearing down 100,000 interfaces in a few seconds, which really
>>>  is  necessary.  Right now we're running about 40,000 interfaces on a not
>>>  yet saturated 10Gbps link.  Going to dual 10Gbps links means pushing more
>>>  than 100,000 subscriber interfaces, and it looks like a modern dual socket
>>>  system can handle that.
>>>
>> I would also like to see this patch in, we are running into scalability issues 
>> with creating/deleting lots of interfaces as well.
> 
> Ben patch only address interface deletion, and one part of the problem,
> maybe the more visible one for the current kernel.
> 
> Adding lots of interfaces only needs several threads to run concurently.
> 
> Before applying/examining his patch I suggest identifying all dev_put() spots than
> can be deleted and replaced by something more scalable. I began this job
> but others can help me.
> 
> RTNL and rcu grace periods are going to hurt anyway, so you probably need
> to use many tasks to be able to delete lots of interfaces in parallel.
> 
> netdev_run_todo() should also use a better algorithm to allow parallelism.
> 
> Following patch doesnt slow down dev_put() users and real scalability
> problems will surface and might be addressed.
> 

Here are typical timings (on current kernel, but on following example
netdev_wait_allrefs() doesnt wait at all, because my netdevice has no refs)

# time ip link add link eth3 address 00:1E:0B:8F:D0:D6 mv161 type macvlan

real    0m0.001s
user    0m0.000s
sys     0m0.001s
# time ip link set mv161 up

real    0m0.001s
user    0m0.000s
sys     0m0.001s
# time ip link set mv161 down

real    0m0.021s
user    0m0.000s
sys     0m0.001s
# time ip link del mv161

real    0m0.022s
user    0m0.000s
sys     0m0.001s

# time ip link add link eth3 address 00:1E:0B:8F:D0:D6 mv161 type macvlan

real    0m0.001s
user    0m0.001s
sys     0m0.001s
# time ip link set mv161 up

real    0m0.001s
user    0m0.000s
sys     0m0.001s
# time ip link del mv161

real    0m0.036s
user    0m0.000s
sys     0m0.001s

22 ms (or 36 ms) delay are also problematic if you want to dismantle 1.000.000 netdevices at once.

^ permalink raw reply

* [PATCH] net: allow netdev_wait_allrefs() to run faster
From: Eric Dumazet @ 2009-10-21 15:40 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Benjamin LaHaise, netdev, Cosmin Ratiu
In-Reply-To: <200910211539.01824.opurdila@ixiacom.com>

Octavian Purdila a écrit :
> On Sunday 18 October 2009 21:21:44 you wrote:
>>> The msleep(250) should be tuned first. Then if this is really necessary
>>> to dismantle 100.000 netdevices per second, we might have to think a bit
>>> more. 
>>> Just try msleep(1 or 2), it should work quite well.
>> My goal is tearing down 100,000 interfaces in a few seconds, which really
>>  is  necessary.  Right now we're running about 40,000 interfaces on a not
>>  yet saturated 10Gbps link.  Going to dual 10Gbps links means pushing more
>>  than 100,000 subscriber interfaces, and it looks like a modern dual socket
>>  system can handle that.
>>
> 
> I would also like to see this patch in, we are running into scalability issues 
> with creating/deleting lots of interfaces as well.

Ben patch only address interface deletion, and one part of the problem,
maybe the more visible one for the current kernel.

Adding lots of interfaces only needs several threads to run concurently.

Before applying/examining his patch I suggest identifying all dev_put() spots than
can be deleted and replaced by something more scalable. I began this job
but others can help me.

RTNL and rcu grace periods are going to hurt anyway, so you probably need
to use many tasks to be able to delete lots of interfaces in parallel.

netdev_run_todo() should also use a better algorithm to allow parallelism.

Following patch doesnt slow down dev_put() users and real scalability
problems will surface and might be addressed.

[PATCH] net: allow netdev_wait_allrefs() to run faster

netdev_wait_allrefs() waits that all references to a device vanishes.

It currently uses a _very_ pessimistic 250 ms delay between each probe.
Some users report that no more than 4 devices can be dismantled per second,
this is a pretty serious problem for extreme setups.

Most likely, references only wait for a rcu grace period that should come
fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/core/dev.c |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 28b0b9e..fca2e4a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4983,7 +4983,7 @@ static void netdev_wait_allrefs(struct net_device *dev)
 			rebroadcast_time = jiffies;
 		}

-		msleep(250);
+		schedule_timeout_uninterruptible(1);

 		if (time_after(jiffies, warning_time + 10 * HZ)) {
 			printk(KERN_EMERG "unregister_netdevice: "

^ permalink raw reply related

* Re: [net-next-2.6 PATCH 1/3] irq: Export irq_set_affinity() for drivers
From: Ben Hutchings @ 2009-10-21 15:35 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, gospo, netdev, Peter P Waskiewicz Jr
In-Reply-To: <20091021022626.32449.73883.stgit@localhost.localdomain>

On Tue, 2009-10-20 at 19:26 -0700, Jeff Kirsher wrote:
> From: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> 
> This patch allows drivers to specify an IRQ affinity mask for
> their respective interrupt sources.  This is very useful on
> network adapters using MSI-X, where aligning network flows
> linearly to CPUs greatly improves efficiency of the network
> stack.
[...]

Since this is not networking-specific, I think it needs to go to
linux-kernel as well.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Troubles with virtio network
From: Александр Дориф @ 2009-10-21 13:52 UTC (permalink / raw)
  To: davem; +Cc: netdev

Hello! I compiled Linux kernel v 2.6.32-rc5 and got this warning:

WARNING: drivers/net/virtio_net.o(.data+0xa0): Section mismatch in reference from the variable virtio_net to the function .devexit.text:virtnet_remove()
The variable virtio_net references
the function __devexit virtnet_remove()
If the reference is valid then annotate the
variable with __exit* (see linux/init.h) or name the variable:
*driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console, 

I hope it'll be corrected soon.

^ permalink raw reply

* Re: [PATCH v2 2/8] Allow tcp_parse_options to consult dst entry
From: Gilad Ben-Yossef @ 2009-10-21 14:07 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Netdev, ori
In-Reply-To: <alpine.DEB.2.00.0910211559050.5304@wel-95.cs.helsinki.fi>

Hi Ilpo,


Thanks for the feedback :-)


Ilpo Järvinen wrote:

> On Wed, 21 Oct 2009, Gilad Ben-Yossef wrote:
>
>   
>> We need tcp_parse_options to be aware of dst_entry to 
>> take into account per dst_entry TCP options settings
>>
>> Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
>> Sigend-off-by: Ori Finkelman <ori@comsleep.com>
>> Sigend-off-by: Yony Amit <yony@comsleep.com>
>>
>> ---
>>  include/net/tcp.h        |    3 ++-
>>  net/ipv4/syncookies.c    |   27 ++++++++++++++-------------
>>  net/ipv4/tcp_input.c     |    9 ++++++---
>>  net/ipv4/tcp_ipv4.c      |   19 ++++++++++---------
>>  net/ipv4/tcp_minisocks.c |    7 +++++--
>>  net/ipv6/syncookies.c    |   28 +++++++++++++++-------------
>>  net/ipv6/tcp_ipv6.c      |    3 ++-
>>  7 files changed, 54 insertions(+), 42 deletions(-)
>>
>>
>>     
<snip>
>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> index 7cda24b..1cb0ec4 100644
>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -1256,11 +1256,18 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
>>  	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
>>  #endif
>>  
>> +	ireq = inet_rsk(req);
>> +	ireq->loc_addr = daddr;
>> +	ireq->rmt_addr = saddr;
>> +	ireq->no_srccheck = inet_sk(sk)->transparent;
>> +	ireq->opt = tcp_v4_save_options(sk, skb);
>> +
>> +	dst = inet_csk_route_req(sk, req);
>>  	tcp_clear_options(&tmp_opt);
>>  	tmp_opt.mss_clamp = 536;
>>  	tmp_opt.user_mss  = tcp_sk(sk)->rx_opt.user_mss;
>>  
>> -	tcp_parse_options(skb, &tmp_opt, 0);
>> +	tcp_parse_options(skb, &tmp_opt, 0, dst);
>>  
>>  	if (want_cookie && !tmp_opt.saw_tstamp)
>>  		tcp_clear_options(&tmp_opt);
>> @@ -1269,14 +1276,8 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
>>  
>>  	tcp_openreq_init(req, &tmp_opt, skb);
>>  
>> -	ireq = inet_rsk(req);
>> -	ireq->loc_addr = daddr;
>> -	ireq->rmt_addr = saddr;
>> -	ireq->no_srccheck = inet_sk(sk)->transparent;
>> -	ireq->opt = tcp_v4_save_options(sk, skb);
>> -
>>  	if (security_inet_conn_request(sk, skb, req))
>> -		goto drop_and_free;
>> +		goto drop_and_release;
>>  
>>  	if (!want_cookie)
>>  		TCP_ECN_create_request(req, tcp_hdr(skb));
>> @@ -1301,7 +1302,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
>>  		 */
>>  		if (tmp_opt.saw_tstamp &&
>>  		    tcp_death_row.sysctl_tw_recycle &&
>> -		    (dst = inet_csk_route_req(sk, req)) != NULL &&
>> +		    dst != NULL &&
>>     
>
> Why you need this NULL check this here while you trap it with BUG_ON 
> elsewhere? Does your patch perhaps create a remote DoS opportunity?
>
>
>   
Indeed, I believe you are right. Good catch.

What about this (I know the patch gets eaten by Thunderbird, sorry about 
that. This is just for explaining what I want to do):

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c

index 1cb0ec4..1d611e3 100644

--- a/net/ipv4/tcp_ipv4.c

+++ b/net/ipv4/tcp_ipv4.c

@@ -1263,6 +1263,9 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)

        ireq->opt = tcp_v4_save_options(sk, skb);

 

        dst = inet_csk_route_req(sk, req);

+       if(!dst)

+               goto drop_and_free;

+

        tcp_clear_options(&tmp_opt);

        tmp_opt.mss_clamp = 536;

        tmp_opt.user_mss  = tcp_sk(sk)->rx_opt.user_mss;

@@ -1302,7 +1305,6 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)

                 */

                if (tmp_opt.saw_tstamp &&

                    tcp_death_row.sysctl_tw_recycle &&

-                   dst != NULL &&

                    (peer = rt_get_peer((struct rtable *)dst)) != NULL &&

                    peer->v4daddr == saddr) {

                        if (get_seconds() < peer->tcp_ts_stamp + TCP_PAWS_MSL &&



My rational is that since if the connection is formed we will need to 
send a syn/ack ( call to __tcp_v4_send_synack a couple of lines below) 
and since we can't do that  if we don't have a route, this makes sense.

If this sounds sane, I'll re-spin the patch with this as a fix.

Thanks a bunch!
Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker & CTO
Codefidence Ltd.

Web:   http://codefidence.com
Cell:  +972-52-8260388
Skype: gilad_codefidence
Tel:   +972-8-9316883 ext. 201
Fax:   +972-8-9316884
Email: gilad@codefidence.com

Check out our Open Source technology and training blog - http://tuxology.net

	"Sorry cannot parse this, its too long to be true  :)"
	  -- Eric Dumazet on netdev mailing list


^ permalink raw reply

* Re: [PATCH 1/4] Adds random ect generation to tfrc-sp sender side
From: Ivo Calado @ 2009-10-21 13:15 UTC (permalink / raw)
  To: dccp, Gerrit Renker, Ivo Calado, netdev
In-Reply-To: <425e6efa0910210449r1a10fb2cvf7650ad470d987aa@mail.gmail.com>

On Mon, Oct 19, 2009 at 2:23 AM, Gerrit Renker <gerrit@erg.abdn.ac.uk> wrote:
> | Adds random ect generation to tfrc-sp sender side.
>
> I thought about this and found several reasons why it would be better to
> defer ECN Nonce sums to a later implementation.
>
>  1) At the moment the code always sets ECT(0). Even if it would
>    alternate ECT(0) and ECT(1), this would later be overwritten by ECT(0)
>    in dccp_msghdr_parse(). Ok, this could be fixed, but the real problem
>    is that the underlying machinery does not support ECN nonces, since
>
>    * ECN / DiffServ information is in two separate places of the
>      inet_sock (u8 `tos' field and u8 `tclass' field of ipv6_pinfo);
>
>    * the ECN driver sits in include/net/inet_ecn.h as
>      #define INET_ECN_xmit(sk) do { inet_sk(sk)->tos |= INET_ECN_ECT_0; } while (0)
>
>    * hence this would need to be revised and the best way to make an
>      acceptable suggestion would be a coded proof of concept that
>      changing the underlying implementation does have benefits.
>
>    On the receiver side the situation is the same. The function
>    tfrc_sp_check_ecn_sum(), introduced in Patch 2/4 of the TFRC-SP sender
>    implementation is only referenced in Patch 2/2 of the CCID-4 set, where
>    it ends, without side effect in "TODO: consider ecn sum test fail".
>
>    That is, at the moment both the sender and receiver side of the ECN Nonce
>    sum verification are placeholders which currently have no effect.
>

Okay, then the implementation would be useless now.

>
>  2) As far as I can see the ECN Nonce is an optimisation, an
>
>      "optional addition to Explicit Congestion Notification [RFC3168]
>       improving its robustness against malicious or accidental
>       concealment of marked packets [...]" (from the abstract)
>
>    Hence if at all, we would only have a benefit of adding the ECN Nonce
>    verification on top of an already verified implementation.
>

Yes, not priority at all. And you're right, no benefit.

>  3) Starting an implementation throws up further questions that need to
>    be addressed, both the basis and the extension need to be verified.
>
> I would like to suggest to implement the basis, that is CCID-4 with ECN
> (using plain ECT(0)), test with that until it works satisfactorily, and
> then continue adding measures such as the ECN Nonce verification.
>

Okay. But, when would be good to at least include random ECT
generation? When DCCP ECN code will get fixed? Is there any work on
this?

> Nothing is lost, once we are at this stage we can return to this set of
> initial patches and revise the situation based on the insights gained
> with ECT(0) experience.
>
> In summary, I would like to suggest to remove the ECN verification for
> the moment and focus on the "basic" issues first.
>
> Would you be ok with that?
>

Yes, we'll keep the ECN verification code here at our git until the
scenario is ready.

>
>
> Appendix
> --------
> | +int tfrc_sp_get_random_ect(struct tfrc_tx_li_data *li_data, u64 seqn)
> | +{
> | +     int ect;
> | +     struct tfrc_ecn_echo_sum_entry *sum;
> | +
> | +     /* TODO: implement random ect*/
> | +     ect = INET_ECN_ECT_0;
> | +
> | +     sum = kmem_cache_alloc(tfrc_ecn_echo_sum_slab, GFP_ATOMIC);
>
> For a later implementation, there should be protection against NULL, e.g.
>        if (sum == NULL) {
>           DCCP_CRIT("Problem here ...");
>           return 0;
>        }
> | +
> | +     sum->previous = li_data->ecn_sums_head;
> | +     sum->ecn_echo_sum = (sum->previous->ecn_echo_sum) ? !ect : ect;
>

Thanks, i forgot that.

> (Also for later) I wonder how to do the sums, with RFC 3168
> ECT(0) = 0x2 => !0x2 = 0
> ECT(1) = 0x1 => !0x1 = 0
>

I don't understand. Can you try to explain it? Or cite RFC section
that address it?

> From the addition table in RFC 3540, section 2,
> ECT(0) + ECT(0) = 0
> ECT(0) + ECT(1) = 1
> ECT(1) + ECT(1) = 0
>
> One way could be
>        sum->ecn_echo_sum ^= (ect == INET_ECN_ECT_1);

Ok.




-- 
Ivo Augusto Andrade Rocha Calado
MSc. Candidate
Embedded Systems and Pervasive Computing Lab - http://embedded.ufcg.edu.br
Systems and Computing Department - http://www.dsc.ufcg.edu.br
Electrical Engineering and Informatics Center - http://www.ceei.ufcg.edu.br
Federal University of Campina Grande - http://www.ufcg.edu.br

PGP: 0x03422935
Quidquid latine dictum sit, altum viditur.

^ permalink raw reply

* Re: Enable syn cookies by default
From: David Miller @ 2009-10-21 13:04 UTC (permalink / raw)
  To: olafvdspek; +Cc: netdev
In-Reply-To: <b2cc26e40910210017v3885b18dre5021c8a920f30d7@mail.gmail.com>

From: Olaf van der Spek <olafvdspek@gmail.com>
Date: Wed, 21 Oct 2009 09:17:53 +0200

> Anybody?

Would please you be patient?

In case you haven't fucking noticed, all of the major kernel
developers are in Japan at the annual kernel summit and the Japan
Linux Symposium since late last week.

So nobody has the time to look into anything requiring real
long thinking like this issue does.

^ permalink raw reply

* Re: [PATCH v2 2/8] Allow tcp_parse_options to consult dst entry
From: Ilpo Järvinen @ 2009-10-21 13:03 UTC (permalink / raw)
  To: Gilad Ben-Yossef; +Cc: Netdev, ori
In-Reply-To: <1256115421-12714-3-git-send-email-gilad@codefidence.com>

On Wed, 21 Oct 2009, Gilad Ben-Yossef wrote:

> We need tcp_parse_options to be aware of dst_entry to 
> take into account per dst_entry TCP options settings
> 
> Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
> Sigend-off-by: Ori Finkelman <ori@comsleep.com>
> Sigend-off-by: Yony Amit <yony@comsleep.com>
> 
> ---
>  include/net/tcp.h        |    3 ++-
>  net/ipv4/syncookies.c    |   27 ++++++++++++++-------------
>  net/ipv4/tcp_input.c     |    9 ++++++---
>  net/ipv4/tcp_ipv4.c      |   19 ++++++++++---------
>  net/ipv4/tcp_minisocks.c |    7 +++++--
>  net/ipv6/syncookies.c    |   28 +++++++++++++++-------------
>  net/ipv6/tcp_ipv6.c      |    3 ++-
>  7 files changed, 54 insertions(+), 42 deletions(-)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 03a49c7..740d09b 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -409,7 +409,8 @@ extern int			tcp_recvmsg(struct kiocb *iocb, struct sock *sk,
>  
>  extern void			tcp_parse_options(struct sk_buff *skb,
>  						  struct tcp_options_received *opt_rx,
> -						  int estab);
> +						  int estab,
> +						  struct dst_entry *dst);
>  
>  extern u8			*tcp_parse_md5sig_option(struct tcphdr *th);
>  
> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index a6e0e07..4990dd4 100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -276,13 +276,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
>  
>  	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV);
>  
> -	/* check for timestamp cookie support */
> -	memset(&tcp_opt, 0, sizeof(tcp_opt));
> -	tcp_parse_options(skb, &tcp_opt, 0);
> -
> -	if (tcp_opt.saw_tstamp)
> -		cookie_check_timestamp(&tcp_opt);
> -
>  	ret = NULL;
>  	req = inet_reqsk_alloc(&tcp_request_sock_ops); /* for safety */
>  	if (!req)
> @@ -298,12 +291,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
>  	ireq->loc_addr		= ip_hdr(skb)->daddr;
>  	ireq->rmt_addr		= ip_hdr(skb)->saddr;
>  	ireq->ecn_ok		= 0;
> -	ireq->snd_wscale	= tcp_opt.snd_wscale;
> -	ireq->rcv_wscale	= tcp_opt.rcv_wscale;
> -	ireq->sack_ok		= tcp_opt.sack_ok;
> -	ireq->wscale_ok		= tcp_opt.wscale_ok;
> -	ireq->tstamp_ok		= tcp_opt.saw_tstamp;
> -	req->ts_recent		= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
>  
>  	/* We throwed the options of the initial SYN away, so we hope
>  	 * the ACK carries the same options again (see RFC1122 4.2.3.8)
> @@ -351,6 +338,20 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
>  		}
>  	}
>  
> +	/* check for timestamp cookie support */
> +	memset(&tcp_opt, 0, sizeof(tcp_opt));
> +	tcp_parse_options(skb, &tcp_opt, 0, &rt->u.dst);
> +
> +	if (tcp_opt.saw_tstamp)
> +		cookie_check_timestamp(&tcp_opt);
> +
> +	ireq->snd_wscale        = tcp_opt.snd_wscale;
> +	ireq->rcv_wscale        = tcp_opt.rcv_wscale;
> +	ireq->sack_ok           = tcp_opt.sack_ok;
> +	ireq->wscale_ok         = tcp_opt.wscale_ok;
> +	ireq->tstamp_ok         = tcp_opt.saw_tstamp;
> +	req->ts_recent          = tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
> +
>  	/* Try to redo what tcp_v4_send_synack did. */
>  	req->window_clamp = tp->window_clamp ? :dst_metric(&rt->u.dst, RTAX_WINDOW);
>  
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index d86784b..d502f49 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3698,12 +3698,14 @@ old_ack:
>   * the fast version below fails.
>   */
>  void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
> -		       int estab)
> +		       int estab,  struct dst_entry *dst)
>  {
>  	unsigned char *ptr;
>  	struct tcphdr *th = tcp_hdr(skb);
>  	int length = (th->doff * 4) - sizeof(struct tcphdr);
>  
> +	BUG_ON(!estab && !dst);
> +
>  	ptr = (unsigned char *)(th + 1);
>  	opt_rx->saw_tstamp = 0;
>  
> @@ -3820,7 +3822,7 @@ static int tcp_fast_parse_options(struct sk_buff *skb, struct tcphdr *th,
>  		if (tcp_parse_aligned_timestamp(tp, th))
>  			return 1;
>  	}
> -	tcp_parse_options(skb, &tp->rx_opt, 1);
> +	tcp_parse_options(skb, &tp->rx_opt, 1, NULL);
>  	return 1;
>  }
>  
> @@ -5364,8 +5366,9 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
>  	struct tcp_sock *tp = tcp_sk(sk);
>  	struct inet_connection_sock *icsk = inet_csk(sk);
>  	int saved_clamp = tp->rx_opt.mss_clamp;
> +	struct dst_entry *dst = __sk_dst_get(sk);
>  
> -	tcp_parse_options(skb, &tp->rx_opt, 0);
> +	tcp_parse_options(skb, &tp->rx_opt, 0, dst);
>  
>  	if (th->ack) {
>  		/* rfc793:
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 7cda24b..1cb0ec4 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1256,11 +1256,18 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
>  	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
>  #endif
>  
> +	ireq = inet_rsk(req);
> +	ireq->loc_addr = daddr;
> +	ireq->rmt_addr = saddr;
> +	ireq->no_srccheck = inet_sk(sk)->transparent;
> +	ireq->opt = tcp_v4_save_options(sk, skb);
> +
> +	dst = inet_csk_route_req(sk, req);
>  	tcp_clear_options(&tmp_opt);
>  	tmp_opt.mss_clamp = 536;
>  	tmp_opt.user_mss  = tcp_sk(sk)->rx_opt.user_mss;
>  
> -	tcp_parse_options(skb, &tmp_opt, 0);
> +	tcp_parse_options(skb, &tmp_opt, 0, dst);
>  
>  	if (want_cookie && !tmp_opt.saw_tstamp)
>  		tcp_clear_options(&tmp_opt);
> @@ -1269,14 +1276,8 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
>  
>  	tcp_openreq_init(req, &tmp_opt, skb);
>  
> -	ireq = inet_rsk(req);
> -	ireq->loc_addr = daddr;
> -	ireq->rmt_addr = saddr;
> -	ireq->no_srccheck = inet_sk(sk)->transparent;
> -	ireq->opt = tcp_v4_save_options(sk, skb);
> -
>  	if (security_inet_conn_request(sk, skb, req))
> -		goto drop_and_free;
> +		goto drop_and_release;
>  
>  	if (!want_cookie)
>  		TCP_ECN_create_request(req, tcp_hdr(skb));
> @@ -1301,7 +1302,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
>  		 */
>  		if (tmp_opt.saw_tstamp &&
>  		    tcp_death_row.sysctl_tw_recycle &&
> -		    (dst = inet_csk_route_req(sk, req)) != NULL &&
> +		    dst != NULL &&

Why you need this NULL check this here while you trap it with BUG_ON 
elsewhere? Does your patch perhaps create a remote DoS opportunity?


-- 
 i.

^ permalink raw reply

* [PATCH] bonding: fix a race condition in calls to slave MII ioctls
From: Jiri Bohac @ 2009-10-21 13:03 UTC (permalink / raw)
  To: fubar; +Cc: netdev

Hi,

In mii monitor mode, bond_check_dev_link() calls the the ioctl
handler of slave devices. It stores the ndo_do_ioctl function
pointer to a static (!) ioctl variable and later uses it to call the
handler with the IOCTL macro.

If another thread executes bond_check_dev_link() at the same time
(even with a different bond, which none of the locks prevent), a
race condition occurs. If the two racing slaves have different
drivers, this may result in one driver's ioctl handler being
called with a pointer to a net_device controlled with a different
driver, resulting in unpredictable breakage.

Unless I am overlooking something, the "static" must be a
copy'n'paste error (?).

Signed-off-by: Jiri Bohac <jbohac@suse.cz>

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 69c5b15..5bfdd0c 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -691,7 +691,7 @@ static int bond_check_dev_link(struct bonding *bond,
 			       struct net_device *slave_dev, int reporting)
 {
 	const struct net_device_ops *slave_ops = slave_dev->netdev_ops;
-	static int (*ioctl)(struct net_device *, struct ifreq *, int);
+	int (*ioctl)(struct net_device *, struct ifreq *, int);
 	struct ifreq ifr;
 	struct mii_ioctl_data *mii;

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ

^ permalink raw reply related

* Re: [PATCH net-next V2 1/3] iwmc3200top: Add Intel Wireless MultiCom  3200 top driver.
From: Marcel Holtmann @ 2009-10-21 12:52 UTC (permalink / raw)
  To: Tomas Winkler
  Cc: David Miller, linville, netdev, linux-wireless, linux-mmc, yi.zhu,
	inaky.perez-gonzalez, cindy.h.kao, guy.cohen, ron.rindjunsky
In-Reply-To: <1ba2fa240910200453k4638dc6cm8c8911353ea85b60@mail.gmail.com>

Hi Tomas,

> >> This patch adds Intel Wireless MultiCom 3200 top driver.
> >> IWMC3200 is 4Wireless Com CHIP (GPS/BT/WiFi/WiMAX).
> >> Top driver is responsible for device initialization and firmware download.
> >> Firmware handled by top is responsible for top itself and
> >> as well as bluetooth and GPS coms. (Wifi and WiMax provide their own firmware)
> >> In addition top driver is used to retrieve firmware logs
> >> and supports other debugging features
> >>
> >> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
> >
> > Applied to net-next-2.6
> 
> Thanks Dave
> 
> Marcel
> I want to send out now the BT driver, would like the patch against
> bluetooth-next-2.6.git, then I wait till you sync or can you also pick
> it from net-next if Dave is OK with that?

send it against net-next since you are working in drivers/bluetooth/, we
will have not conflicts.

Regards

Marcel



^ permalink raw reply

* Re: [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second
From: Octavian Purdila @ 2009-10-21 12:39 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Eric Dumazet, netdev, Cosmin Ratiu
In-Reply-To: <20091018182144.GC23395@kvack.org>

On Sunday 18 October 2009 21:21:44 you wrote:
> > The msleep(250) should be tuned first. Then if this is really necessary
> > to dismantle 100.000 netdevices per second, we might have to think a bit
> > more. 
> > Just try msleep(1 or 2), it should work quite well.
> 
> My goal is tearing down 100,000 interfaces in a few seconds, which really
>  is  necessary.  Right now we're running about 40,000 interfaces on a not
>  yet saturated 10Gbps link.  Going to dual 10Gbps links means pushing more
>  than 100,000 subscriber interfaces, and it looks like a modern dual socket
>  system can handle that.
> 

I would also like to see this patch in, we are running into scalability issues 
with creating/deleting lots of interfaces as well.

Thanks,
tavi

^ permalink raw reply

* Re: [patch 0/3] KS8851 updates for -rc5
From: Ben Dooks @ 2009-10-21 10:44 UTC (permalink / raw)
  To: David Miller; +Cc: ben, Ping.Doong, netdev, Charles.Li
In-Reply-To: <20091020.191204.183462320.davem@davemloft.net>

On Tue, Oct 20, 2009 at 07:12:04PM -0700, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Tue, 20 Oct 2009 19:11:42 -0700 (PDT)

> > All applied to net-next-2.6, thanks.
> 
> Sorry, I meant 'net-2.6'

thanks.

^ permalink raw reply

* [PATCH] e1000: the power down when running ifdown command
From: Naohiro Ooiwa @ 2009-10-21  9:52 UTC (permalink / raw)
  To: netdev@vger.kernel.org, e1000-devel, svaidy, linux-kernel-owner

Hi all

I'm trying to modify e1000 driver for power saving.

The e1000 driver doesn't let the power down when running ifdown command.
So, I set to the D3hot state of a PCI device at the end of e1000_close().

With this modification, e1000 driver reduces power by ifdown
to the same level as link-down case.

I spoke it in Collaboration Summit 2009.
For details and result of power measurement,
please refer the 36-38 page of following document.
http://events.linuxfoundation.org/archive/lfcs09_ooiwa.pdf

Could you please check the my patch ?

Should I consider WOL ?
Dosen't E1000_PCI_POWER_SAVE need ?


Hi Vaidy

I was glad that you talked to me in Collaboration Summit.
I'm sorry for sending the patch so late.

I found a bug in the patch which I used in Collaboration Summit.
Sometimes, it's don't work to auto-negotiation by repeated ifup and ifdown.

I fixed it. I'd appreciate it if you could test.


Thanks you.
Naohiro Ooiwa

Signed-off-by: Naohiro Ooiwa <nooiwa@miraclelinux.com>
---
 drivers/net/e1000/e1000_main.c |   32 ++++++++++++++++++++++++++++++++
 1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index bcd192c..12e1a42 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -26,6 +26,11 @@

 *******************************************************************************/

+/*
+ * define this if you want pci save power while ifdown.
+ */
+#define E1000_PCI_POWER_SAVE
+
 #include "e1000.h"
 #include <net/ip6_checksum.h>

@@ -1248,6 +1253,23 @@ static int e1000_open(struct net_device *netdev)
 	struct e1000_hw *hw = &adapter->hw;
 	int err;

+#ifdef E1000_PCI_POWER_SAVE
+	struct pci_dev *pdev = adapter->pdev;
+
+	pci_set_power_state(pdev, PCI_D0);
+	pci_restore_state(pdev);
+
+	if (adapter->need_ioport)
+		err = pci_enable_device(pdev);
+	else
+		err = pci_enable_device_mem(pdev);
+	if (err) {
+		printk(KERN_ERR "e1000: Cannot enable PCI device from power-save\n");
+		return err;
+	}
+	pci_set_master(pdev);
+#endif
+
 	/* disallow open during test */
 	if (test_bit(__E1000_TESTING, &adapter->flags))
 		return -EBUSY;
@@ -1265,6 +1287,7 @@ static int e1000_open(struct net_device *netdev)
 		goto err_setup_rx;

 	e1000_power_up_phy(adapter);
+	e1000_reset(adapter);

 	adapter->mng_vlan_id = E1000_MNG_VLAN_NONE;
 	if ((hw->mng_cookie.status &
@@ -1341,6 +1364,15 @@ static int e1000_close(struct net_device *netdev)
 		e1000_vlan_rx_kill_vid(netdev, adapter->mng_vlan_id);
 	}

+#ifdef E1000_PCI_POWER_SAVE
+#ifdef CONFIG_PM
+	pci_save_state(adapter->pdev);
+#endif
+	pci_disable_device(adapter->pdev);
+	pci_wake_from_d3(adapter->pdev, true);
+	pci_set_power_state(adapter->pdev, PCI_D3hot);
+#endif
+
 	return 0;
 }

-- 
1.5.4.1


^ permalink raw reply related

* Re: [PATCH v2 8/8] Document future removal of sysctl_tcp_* options
From: Gilad Ben-Yossef @ 2009-10-21 10:23 UTC (permalink / raw)
  To: William Allen Simpson; +Cc: netdev, ori
In-Reply-To: <4ADED6FA.2030502@gmail.com>

William Allen Simpson wrote:

> Gilad Ben-Yossef wrote:
>> +What:    sysctl_tcp_sack, sysctl_tcp_timestamps, 
>> sysctl_tcp_window_scaling,
>> +    sysctl_tcp_dsack
>
> Opposed to removing, as this is a common configuration for *BSD.  It 
> helps
> operators to have similar utility across systems.
>
> net.inet.tcp.sack.enable
>
> net.inet.tcp.timestamps
>
> net.inet.tcp.win_scale
>
> Support removing sysctl_tcp_dsack, as this appears to be Linux-only.

I have no issue with leaving those, if everyone thinks we're better off.

BTW, while we're talking about OS envy, I do believe that Windows do let
you specify on a per route basis. Not that this is really a good ground for
technical decision, but still... :-)

Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker & CTO
Codefidence Ltd.

Web:   http://codefidence.com
Cell:  +972-52-8260388
Skype: gilad_codefidence
Tel:   +972-8-9316883 ext. 201
Fax:   +972-8-9316884
Email: gilad@codefidence.com

Check out our Open Source technology and training blog - http://tuxology.net

	"Sorry cannot parse this, its too long to be true  :)"
	  -- Eric Dumazet on netdev mailing list

^ permalink raw reply

* Re: Enable syn cookies by default
From: Olaf van der Spek @ 2009-10-21 10:10 UTC (permalink / raw)
  To: William Allen Simpson; +Cc: netdev
In-Reply-To: <4ADED186.3040300@gmail.com>

On Wed, Oct 21, 2009 at 11:16 AM, William Allen Simpson
<william.allen.simpson@gmail.com> wrote:
> Keep in mind that I'm busy trying to replace syncookies with real cookies,
> so I'm biased.  The syncookies interfere with new options; although in
> Linux, they interfere less than other systems.

How and when do they interfere?
If syn cookies are enabled and the queue isn't full, they're not used
so they don't interfere.
If the queue is full, they do interfere, but the alternative would be
no connection at all.
So I really don't see the disadvantage of enabling cookies by default.

> As Ubuntu is debian based, perhaps they can back-port the Ubuntu changes?

Actually changing the value isn't the problem, but the Debian
maintainer isn't sure it's a good idea (but he doesn't know why).

Olaf

^ permalink raw reply

* Re: [PATCH v2 1/8] Only parse time stamp TCP option in time wait sock
From: Gilad Ben-Yossef @ 2009-10-21 10:07 UTC (permalink / raw)
  To: William Allen Simpson; +Cc: netdev
In-Reply-To: <4ADED915.7000107@gmail.com>

William Allen Simpson wrote:

> Gilad Ben-Yossef wrote:
>> A time wait socket is established - we already know if time stamp
>> option is called for or not.
>>
> Not so sure about this.  A timewait sock isn't actually established,
> and new/changed options could appear.  There's all sorts of edge cases.
If you examine the specific context where tcp_parse_options is being 
called here,
the only TCP option which is of interest is the time stamp option, and 
this code path
is only being taken when we already know that the original socket  had
used the time stamp option.

So while I agree that in general you are right, I do believe that in the 
specific context
of this patch we should call tcp_parse_options with the established flag 
on and let it
know we are expecting to see a time stamp option, which is what I was 
referring to.

>
> There's also some current work to note:
>
>  http://tools.ietf.org/html/draft-ietf-tcpm-1323bis
>
>  http://tools.ietf.org/html/draft-gont-tcpm-tcp-timestamps

Very interesting, thank you.

As I noted above, my comment about
TIME WAIT sockets being "established" should really only be considered
in the context of the specific call to tcp_parse_options() and the 
"established"
parameter of that function.

Thanks,
Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker & CTO
Codefidence Ltd.

Web:   http://codefidence.com
Cell:  +972-52-8260388
Skype: gilad_codefidence
Tel:   +972-8-9316883 ext. 201
Fax:   +972-8-9316884
Email: gilad@codefidence.com

Check out our Open Source technology and training blog - http://tuxology.net

	"Sorry cannot parse this, its too long to be true  :)"
	  -- Eric Dumazet on netdev mailing list

^ permalink raw reply

* Re: [PATCH v2 1/8] Only parse time stamp TCP option in time wait sock
From: William Allen Simpson @ 2009-10-21  9:49 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1256115421-12714-2-git-send-email-gilad@codefidence.com>

Gilad Ben-Yossef wrote:
> A time wait socket is established - we already know if time stamp
> option is called for or not.
> 
Not so sure about this.  A timewait sock isn't actually established,
and new/changed options could appear.  There's all sorts of edge cases.

There's also some current work to note:

  http://tools.ietf.org/html/draft-ietf-tcpm-1323bis

  http://tools.ietf.org/html/draft-gont-tcpm-tcp-timestamps

^ permalink raw reply

* Re: [PATCH v2 8/8] Document future removal of sysctl_tcp_* options
From: William Allen Simpson @ 2009-10-21  9:40 UTC (permalink / raw)
  To: Gilad Ben-Yossef; +Cc: netdev, ori
In-Reply-To: <1256115421-12714-9-git-send-email-gilad@codefidence.com>

Gilad Ben-Yossef wrote:
> +What:	sysctl_tcp_sack, sysctl_tcp_timestamps, sysctl_tcp_window_scaling,
> +	sysctl_tcp_dsack

Opposed to removing, as this is a common configuration for *BSD.  It helps
operators to have similar utility across systems.

net.inet.tcp.sack.enable

net.inet.tcp.timestamps

net.inet.tcp.win_scale

Support removing sysctl_tcp_dsack, as this appears to be Linux-only.

^ permalink raw reply

* Re: Enable syn cookies by default
From: William Allen Simpson @ 2009-10-21  9:16 UTC (permalink / raw)
  To: netdev
In-Reply-To: <b2cc26e40910210048y43bdb604pcd356376a93c41e@mail.gmail.com>

Olaf van der Spek wrote:
> On Wed, Oct 21, 2009 at 9:25 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> This is a user selectable setting. What's wrong with /etc/sysctl.conf ?
> 
> It requires user action...
> Often you notice cookies are disabled only after a service becomes unreachable.
> What's wrong with improving defaults?

I've not been a regular contributor here, so I'm not sure that my view has
much weight, but I'm *against* changing the coded default.

Keep in mind that I'm busy trying to replace syncookies with real cookies,
so I'm biased.  The syncookies interfere with new options; although in
Linux, they interfere less than other systems.

For Ubuntu, the practice is complicated.  In /etc/sysctl.conf, the text
assumes that the default is off:

# Uncomment the next line to enable TCP/IP SYN cookies
# This disables TCP Window Scaling (http://lkml.org/lkml/2008/2/5/167),
# and is not recommended.
#net.ipv4.tcp_syncookies=1

But in the default installed /etc/sysctl.d/10-network-security.conf, it
is explicitly on in any case:

# Turn on SYN-flood protections.  Starting with 2.6.26, there is no loss
# of TCP functionality/features under normal conditions.  When flood
# protections kick in under high unanswered-SYN load, the system
# should remain more stable, with a trade off of some loss of TCP
# functionality/features (e.g. TCP Window scaling).
net.ipv4.tcp_syncookies=1

As Ubuntu is debian based, perhaps they can back-port the Ubuntu changes?

> Don't forget the missing log entries.
> 
On this I agree.  I'd like the system to syslog it's under attack,
especially whenever syncookies are off.

^ permalink raw reply

* [PATCH v2 5/8] Allow disabling TCP timestamp options per route
From: Gilad Ben-Yossef @ 2009-10-21  8:56 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef
In-Reply-To: <1256115421-12714-1-git-send-email-gilad@codefidence.com>

Implement querying and acting upon the no timestamp bit in the feature 
field.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>

---
 include/linux/rtnetlink.h |    2 +-
 net/ipv4/tcp_input.c      |    3 ++-
 net/ipv4/tcp_output.c     |    8 ++++++--
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 9c802a6..2ab8c75 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -378,7 +378,7 @@ enum
 
 #define RTAX_FEATURE_ECN	0x00000001
 #define RTAX_FEATURE_NO_SACK	0x00000002
-#define RTAX_FEATURE_TIMESTAMP	0x00000004
+#define RTAX_FEATURE_NO_TSTAMP	0x00000004
 #define RTAX_FEATURE_ALLFRAG	0x00000008
 
 struct rta_session
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b14f780..d2f9742 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3755,7 +3755,8 @@ void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
 			case TCPOPT_TIMESTAMP:
 				if ((opsize == TCPOLEN_TIMESTAMP) &&
 				    ((estab && opt_rx->tstamp_ok) ||
-				     (!estab && sysctl_tcp_timestamps))) {
+				     (!estab && sysctl_tcp_timestamps &&
+				      !dst_feature(dst, RTAX_FEATURE_NO_TSTAMP)))) {
 					opt_rx->saw_tstamp = 1;
 					opt_rx->rcv_tsval = get_unaligned_be32(ptr);
 					opt_rx->rcv_tsecr = get_unaligned_be32(ptr + 4);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 64db8dd..8f30c18 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -488,7 +488,9 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 	opts->mss = tcp_advertise_mss(sk);
 	size += TCPOLEN_MSS_ALIGNED;
 
-	if (likely(sysctl_tcp_timestamps && *md5 == NULL)) {
+	if (likely(sysctl_tcp_timestamps &&
+		   !dst_feature(dst, RTAX_FEATURE_NO_TSTAMP) &&
+		   *md5 == NULL)) {
 		opts->options |= OPTION_TS;
 		opts->tsval = TCP_SKB_CB(skb)->when;
 		opts->tsecr = tp->rx_opt.ts_recent;
@@ -2317,7 +2319,9 @@ static void tcp_connect_init(struct sock *sk)
 	 * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT.
 	 */
 	tp->tcp_header_len = sizeof(struct tcphdr) +
-		(sysctl_tcp_timestamps ? TCPOLEN_TSTAMP_ALIGNED : 0);
+		(sysctl_tcp_timestamps &&
+		(!dst_feature(dst, RTAX_FEATURE_NO_TSTAMP) ?
+		  TCPOLEN_TSTAMP_ALIGNED : 0));
 
 #ifdef CONFIG_TCP_MD5SIG
 	if (tp->af_specific->md5_lookup(sk, sk) != NULL)
-- 
1.5.6.3


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox