Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCHv4 1/7] Only parse time stamp TCP option in time wait sock
From: Gilad Ben-Yossef @ 2009-10-28 14:15 UTC (permalink / raw)
  To: netdev; +Cc: ori
In-Reply-To: <1256739327-11576-1-git-send-email-gilad@codefidence.com>

Since we only use tcp_parse_options here to check for the exietence
of TCP timestamp option in the header, it is better to call with
the "established" flag on.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Signed-off-by: Ori Finkelman <ori@comsleep.com>
Signed-off-by: Yony Amit <yony@comsleep.com>
---
 net/ipv4/tcp_minisocks.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 624c3c9..8c8c6e6 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -100,9 +100,9 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 	struct tcp_options_received tmp_opt;
 	int paws_reject = 0;
 
-	tmp_opt.saw_tstamp = 0;
 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
-		tcp_parse_options(skb, &tmp_opt, 0);
+		tmp_opt.tstamp_ok = 1;
+		tcp_parse_options(skb, &tmp_opt, 1);
 
 		if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent	= tcptw->tw_ts_recent;
-- 
1.5.6.3


^ permalink raw reply related

* Re: [net-next-2.6 PATCH v4 3/3] TCPCT part 1c: initial SYN exchange with SYNACK data
From: Eric Dumazet @ 2009-10-28 14:17 UTC (permalink / raw)
  To: William Allen Simpson; +Cc: Linux Kernel Network Developers
In-Reply-To: <4AE6E7C0.2050408@gmail.com>

William Allen Simpson a écrit :
> This is a significantly revised implementation of an earlier (year-old)
> patch that no longer applies cleanly, with permission of the original
> author (Adam Langley).  That patch was previously reviewed:
> 
>    http://thread.gmane.org/gmane.linux.network/102586
> 
> The principle difference is using a TCP option to carry the cookie nonce,
> instead of a user configured offset in the data.  This is more flexible and
> less subject to user configuration error.  Such a cookie option has been
> suggested for many years, and is also useful without SYN data, allowing
> several related concepts to use the same extension option.
> 
>    "Re: SYN floods (was: does history repeat itself?)", September 9, 1996.
>    http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html

Sorry this link might be interesting to you, but I found nothing that explains
your patches.

> 
>    "Re: what a new TCP header might look like", May 12, 1998.
>    ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail

Same here....

> 
> Data structures are carefully composed to require minimal additions.
> For example, the struct tcp_options_received cookie_plus variable fits
> between existing 16-bit and 8-bit variables, requiring no additional
> space (taking alignment into consideration).  There are no additions to
> tcp_request_sock, and only 1 pointer and 1 flag byte in tcp_sock.
> 
> Allocations have been rearranged to avoid requiring GFP_ATOMIC, with
> only one unavoidable exception in tcp_create_openreq_child(), where the
> tcp_sock itself is created GFP_ATOMIC.
> 
> These functions will also be used in subsequent patches that implement
> additional features.
> 
> Requires:
>   TCPCT part 1a: add request_values parameter for sending SYNACK
>   TCPCT part 1b: sysctl_tcp_cookie_size, socket option
> TCP_COOKIE_TRANSACTIONS, functions
> 
> Signed-off-by: William.Allen.Simpson@gmail.com

I tried to find an RFC or document about this stuff and failed.

Before reading implementation code, I like to have english text that describes
the new concept/design.

(BTW I found http://ttcplinux.sourceforge.net/theses/ETTCP.pdf and found it interesting,
I wonder what happened to this)


^ permalink raw reply

* Re: [PATCH 3/3] net: TCP thin dupack
From: Ilpo Järvinen @ 2009-10-28 14:17 UTC (permalink / raw)
  To: Andreas Petlund; +Cc: Netdev, LKML, shemminger, David Miller
In-Reply-To: <4AE7207D.8090402@simula.no>

On Tue, 27 Oct 2009, Andreas Petlund wrote:

> This patch enables fast retransmissions after one dupACK for TCP if the 
> stream is identified as thin. This will reduce latencies for thin 
> streams that are not able to trigger fast retransmissions due to high 
> packet interarrival time. This mechanism is only active if enabled by 
> iocontrol or syscontrol and the stream is identified as thin. 
> 
> 
> Signed-off-by: Andreas Petlund <apetlund@simula.no>
> ---
>  include/linux/tcp.h        |    4 +++-
>  include/net/tcp.h          |    1 +
>  net/ipv4/sysctl_net_ipv4.c |    8 ++++++++
>  net/ipv4/tcp.c             |    5 +++++
>  net/ipv4/tcp_input.c       |    8 ++++++++
>  5 files changed, 25 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index e64368d..f4a05ff 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -97,6 +97,7 @@ enum {
>  #define TCP_CONGESTION		13	/* Congestion control algorithm */
>  #define TCP_MD5SIG		14	/* TCP MD5 Signature (RFC2385) */
>  #define TCP_THIN_RM_EXPB        15      /* Remove exp. backoff for thin streams*/
> +#define TCP_THIN_DUPACK         16      /* Fast retrans. after 1 dupack */
>  
>  #define TCPI_OPT_TIMESTAMPS	1
>  #define TCPI_OPT_SACK		2
> @@ -301,7 +302,8 @@ struct tcp_sock {
>  	u8	frto_counter;	/* Number of new acks after RTO */
>  	u8	nonagle;	/* Disable Nagle algorithm?             */
>  	u8      thin_rm_expb:1, /* Remove exp. backoff for thin streams */
> -		thin_undef : 7;
> +		thin_dupack : 1,/* Fast retransmit on first dupack      */
> +		thin_undef : 6;
>  
>  /* RTT measurement */
>  	u32	srtt;		/* smoothed round trip time << 3	*/
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 412c1bd..41f3a5e 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -238,6 +238,7 @@ extern int sysctl_tcp_workaround_signed_windows;
>  extern int sysctl_tcp_slow_start_after_idle;
>  extern int sysctl_tcp_max_ssthresh;
>  extern int sysctl_tcp_force_thin_rm_expb;
> +extern int sysctl_tcp_force_thin_dupack;
>  
>  extern atomic_t tcp_memory_allocated;
>  extern struct percpu_counter tcp_sockets_allocated;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 7458f37..8653867 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -721,6 +721,14 @@ static struct ctl_table ipv4_table[] = {
>  		.proc_handler   = proc_dointvec
>  	},
>  	{
> +		.ctl_name       = CTL_UNNUMBERED,
> +		.procname       = "tcp_force_thin_dupack",
> +		.data           = &sysctl_tcp_force_thin_dupack,
> +		.maxlen         = sizeof(int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec
> +	},
> +	{
>  		.ctl_name	= CTL_UNNUMBERED,
>  		.procname	= "udp_mem",
>  		.data		= &sysctl_udp_mem,
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index b4b0931..de190db 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2139,6 +2139,11 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
>  			tp->thin_rm_expb = 1;
>  		break;
>  
> +	case TCP_THIN_DUPACK:
> +		if (val)
> +			tp->thin_dupack = 1;
> +		break;
> +
>  	case TCP_CORK:
>  		/* When set indicates to always queue non-full frames.
>  		 * Later the user clears this option and we transmit
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index d86784b..b71eb89 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -89,6 +89,8 @@ int sysctl_tcp_frto __read_mostly = 2;
>  int sysctl_tcp_frto_response __read_mostly;
>  int sysctl_tcp_nometrics_save __read_mostly;
>  
> +int sysctl_tcp_force_thin_dupack __read_mostly;
> +
>  int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
>  int sysctl_tcp_abc __read_mostly;
>  
> @@ -2447,6 +2449,12 @@ static int tcp_time_to_recover(struct sock *sk)
>  		return 1;
>  	}
>  
> +	/* If a thin stream is detected, retransmit after first
> +	 * received dupack */
> +	if ((tp->thin_dupack || sysctl_tcp_force_thin_dupack) &&
> +	    tcp_dupack_heurestics(tp) > 1 && tcp_stream_is_thin(tp))
> +		return 1;
> +
>  	return 0;
>  }

Have you tested it? ...I doubt this will work like you say and retransmit 
something when the window is small. ...Besides, you should have built this 
patch on top of the function rename you submitted earlier as after DaveM 
applied that this will no longer even compile...

-- 
 i.

^ permalink raw reply

* Re: [PATCH 2/3] net: TCP thin linear timeouts
From: Ilpo Järvinen @ 2009-10-28 14:18 UTC (permalink / raw)
  To: Andreas Petlund; +Cc: Netdev, LKML, shemminger, David Miller
In-Reply-To: <4AE72079.4030504@simula.no>

On Tue, 27 Oct 2009, Andreas Petlund wrote:

> This patch will make TCP use only linear timeouts if the stream is thin. This will help to avoid the very high latencies that thin stream suffer because of exponential backoff. This mechanism is only active if enabled by iocontrol or syscontrol and the stream is identified as thin.
> 
> 
> Signed-off-by: Andreas Petlund <apetlund@simula.no>
> ---
>  include/linux/tcp.h        |    3 +++
>  include/net/tcp.h          |    1 +
>  net/ipv4/sysctl_net_ipv4.c |    8 ++++++++
>  net/ipv4/tcp.c             |    5 +++++
>  net/ipv4/tcp_timer.c       |   17 ++++++++++++++++-
>  5 files changed, 33 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 61723a7..e64368d 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -96,6 +96,7 @@ enum {
>  #define TCP_QUICKACK		12	/* Block/reenable quick acks */
>  #define TCP_CONGESTION		13	/* Congestion control algorithm */
>  #define TCP_MD5SIG		14	/* TCP MD5 Signature (RFC2385) */
> +#define TCP_THIN_RM_EXPB        15      /* Remove exp. backoff for thin streams*/
>  
>  #define TCPI_OPT_TIMESTAMPS	1
>  #define TCPI_OPT_SACK		2
> @@ -299,6 +300,8 @@ struct tcp_sock {
>  	u16	advmss;		/* Advertised MSS			*/
>  	u8	frto_counter;	/* Number of new acks after RTO */
>  	u8	nonagle;	/* Disable Nagle algorithm?             */
> +	u8      thin_rm_expb:1, /* Remove exp. backoff for thin streams */
> +		thin_undef : 7;
>  
>  /* RTT measurement */
>  	u32	srtt;		/* smoothed round trip time << 3	*/
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 7c4482f..412c1bd 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -237,6 +237,7 @@ extern int sysctl_tcp_base_mss;
>  extern int sysctl_tcp_workaround_signed_windows;
>  extern int sysctl_tcp_slow_start_after_idle;
>  extern int sysctl_tcp_max_ssthresh;
> +extern int sysctl_tcp_force_thin_rm_expb;
>  
>  extern atomic_t tcp_memory_allocated;
>  extern struct percpu_counter tcp_sockets_allocated;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 2dcf04d..7458f37 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -713,6 +713,14 @@ static struct ctl_table ipv4_table[] = {
>  		.proc_handler	= proc_dointvec,
>  	},
>  	{
> +		.ctl_name       = CTL_UNNUMBERED,
> +		.procname       = "tcp_force_thin_rm_expb",
> +		.data           = &sysctl_tcp_force_thin_rm_expb,
> +		.maxlen         = sizeof(int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec
> +	},
> +	{
>  		.ctl_name	= CTL_UNNUMBERED,
>  		.procname	= "udp_mem",
>  		.data		= &sysctl_udp_mem,
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 90b2e06..b4b0931 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2134,6 +2134,11 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
>  		}
>  		break;
>  
> +	case TCP_THIN_RM_EXPB:
> +		if (val)
> +			tp->thin_rm_expb = 1;
> +		break;
> +
>  	case TCP_CORK:
>  		/* When set indicates to always queue non-full frames.
>  		 * Later the user clears this option and we transmit
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index cdb2ca7..24d6dc3 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -29,6 +29,7 @@ int sysctl_tcp_keepalive_intvl __read_mostly = TCP_KEEPALIVE_INTVL;
>  int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
>  int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
>  int sysctl_tcp_orphan_retries __read_mostly;
> +int sysctl_tcp_force_thin_rm_expb __read_mostly;
>  
>  static void tcp_write_timer(unsigned long);
>  static void tcp_delack_timer(unsigned long);
> @@ -386,7 +387,21 @@ void tcp_retransmit_timer(struct sock *sk)
>  	icsk->icsk_retransmits++;
>  
>  out_reset_timer:
> -	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
> +	if ((tp->thin_rm_expb || sysctl_tcp_force_thin_rm_expb) &&
> +	    tcp_stream_is_thin(tp) && sk->sk_state == TCP_ESTABLISHED) {
> +		/* If stream is thin, remove exponential backoff.
> +		 * Since 'icsk_backoff' is used to reset timer, set to 0
> +		 * Recalculate 'icsk_rto' as this might be increased if
> +		 * stream oscillates between thin and thick, thus the old
> +		 * value might already be too high compared to the value
> +		 * set by 'tcp_set_rto' in tcp_input.c which resets the
> +		 * rto without backoff. */
> +		icsk->icsk_backoff = 0;
> +		icsk->icsk_rto = min(((tp->srtt >> 3) + tp->rttvar), TCP_RTO_MAX);

The first part is nowadays done with __tcp_set_rto(tp).

-- 
 i.

^ permalink raw reply

* Re: [PATCHv4 0/7] Per route TCP options support kill switches
From: Eric Dumazet @ 2009-10-28 14:22 UTC (permalink / raw)
  To: Gilad Ben-Yossef; +Cc: netdev, ori
In-Reply-To: <1256739327-11576-1-git-send-email-gilad@codefidence.com>

Gilad Ben-Yossef a écrit :
> Allow selectively turning off support for specific TCP options
> on a per route basis.
> 
> One normally want to disable SACK, DSACK, time stamp or window
> scale if one got a piece of broken networking equipment somewhere
> as a stop gap until you can bring a big enough hammer to deal with
> the broken network equipment. It doesn't make sense to "punish" the
> entire connections going through the machine to destinations not
> related to the broken equipment.
> 
> This is doubly true when one is dealing with network containers
> used to isolate several virtual domains.
> 
> Per route options implemented in free bits in the features route
> entry property, which in some cases were reserved by name for these
> options, so this does not inflate any structure.
> 
> Global sysctls for these options are still preserved and retain 
> the exact original meaning (e.g. you have to have both the global 
> sysctl turned on and not turn off the TCP option parsing in the
> specific route to have it proccessed).
> 
> It is not possible to turn off globally an option but turn it on
> per route, so as to not subtly change the meaning of current
> establish sysctls (and this is a rare need anyway).
> 
> Tested on x86 using Qemu/KVM.
> 
> Working but crude matching patch to iproute2 sent earlier to the list.
> 
> Patchset based on original work by Ori Finkelman and Yony Amit
> from ComSleep Ltd.
> 
> The author wishes to thank Eric Dumazaet, William Allen Simpson, 
> Bill Fink and Ilpo Jarvinen for their feedback.
> 
> 
> Gilad Ben-Yossef (7):
>   Only parse time stamp TCP option in time wait sock
>   Allow tcp_parse_options to consult dst entry
>   Add dst_feature to query route entry features
>   Add the no SACK route option feature
>   Allow disabling TCP timestamp options per route
>   Allow to turn off TCP window scale opt per route
>   Allow disabling of DSACK TCP option per route
> 
>  include/linux/rtnetlink.h |    6 ++++--
>  include/net/dst.h         |    8 +++++++-
>  include/net/tcp.h         |    3 ++-
>  net/ipv4/syncookies.c     |   27 ++++++++++++++-------------
>  net/ipv4/tcp_input.c      |   26 ++++++++++++++++++--------
>  net/ipv4/tcp_ipv4.c       |   21 ++++++++++++---------
>  net/ipv4/tcp_minisocks.c  |    9 ++++++---
>  net/ipv4/tcp_output.c     |   18 +++++++++++++-----
>  net/ipv6/syncookies.c     |   28 +++++++++++++++-------------
>  net/ipv6/tcp_ipv6.c       |    3 ++-
>  10 files changed, 93 insertions(+), 56 deletions(-)
> 

I am a bit lost. What exactly changed in this new version, versus v3 ?


^ permalink raw reply

* Re: [PATCHv4 0/7] Per route TCP options support kill switches
From: Gilad Ben-Yossef @ 2009-10-28 14:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, ori
In-Reply-To: <4AE853A5.3060804@gmail.com>

Eric Dumazet wrote:

> Gilad Ben-Yossef a écrit :
>   
>> Allow selectively turning off support for specific TCP options
>> on a per route basis.
>>
>> One normally want to disable SACK, DSACK, time stamp or window
>> scale if one got a piece of broken networking equipment somewhere
>> as a stop gap until you can bring a big enough hammer to deal with
>> the broken network equipment. It doesn't make sense to "punish" the
>> entire connections going through the machine to destinations not
>> related to the broken equipment.
>>
>> This is doubly true when one is dealing with network containers
>> used to isolate several virtual domains.
>>
>> Per route options implemented in free bits in the features route
>> entry property, which in some cases were reserved by name for these
>> options, so this does not inflate any structure.
>>
>> Global sysctls for these options are still preserved and retain 
>> the exact original meaning (e.g. you have to have both the global 
>> sysctl turned on and not turn off the TCP option parsing in the
>> specific route to have it proccessed).
>>
>> It is not possible to turn off globally an option but turn it on
>> per route, so as to not subtly change the meaning of current
>> establish sysctls (and this is a rare need anyway).
>>
>> Tested on x86 using Qemu/KVM.
>>
>> Working but crude matching patch to iproute2 sent earlier to the list.
>>
>> Patchset based on original work by Ori Finkelman and Yony Amit
>> from ComSleep Ltd.
>>
>> The author wishes to thank Eric Dumazaet, William Allen Simpson, 
>> Bill Fink and Ilpo Jarvinen for their feedback.
>>
>>
>> Gilad Ben-Yossef (7):
>>   Only parse time stamp TCP option in time wait sock
>>   Allow tcp_parse_options to consult dst entry
>>   Add dst_feature to query route entry features
>>   Add the no SACK route option feature
>>   Allow disabling TCP timestamp options per route
>>   Allow to turn off TCP window scale opt per route
>>   Allow disabling of DSACK TCP option per route
>>
>>  include/linux/rtnetlink.h |    6 ++++--
>>  include/net/dst.h         |    8 +++++++-
>>  include/net/tcp.h         |    3 ++-
>>  net/ipv4/syncookies.c     |   27 ++++++++++++++-------------
>>  net/ipv4/tcp_input.c      |   26 ++++++++++++++++++--------
>>  net/ipv4/tcp_ipv4.c       |   21 ++++++++++++---------
>>  net/ipv4/tcp_minisocks.c  |    9 ++++++---
>>  net/ipv4/tcp_output.c     |   18 +++++++++++++-----
>>  net/ipv6/syncookies.c     |   28 +++++++++++++++-------------
>>  net/ipv6/tcp_ipv6.c       |    3 ++-
>>  10 files changed, 93 insertions(+), 56 deletions(-)
>>
>>     
>
> I am a bit lost. What exactly changed in this new version, versus v3 ?
>
>   


Code wise only the first patch - in tcp_timewait_state_process I
was calling tcp_parse_options() without actually initializing
properly the tstamp_ok field of the truct tcp_options_received
parameter to the that function.

I failed to noticed that since being a non initalized structure
on the stack it can actually produce the required behavior
depending on whatever random data was there.

I spotted it thanks to Williams A.S. review of my changes in that area.

Since I had the opportunity, I also edited the patch set description to
better document some of the things that seemed to bug some
of the reviewers (namely Bill F. and Williams A.S ).

Sorry for the confusion.

Thanks,
Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker & CTO
Codefidence Ltd.

Web:   http://codefidence.com
Cell:  +972-52-8260388
Skype: gilad_codefidence
Tel:   +972-8-9316883 ext. 201
Fax:   +972-8-9316884
Email: gilad@codefidence.com

Check out our Open Source technology and training blog - http://tuxology.net

	"The biggest risk you can take it is to take no risk."
		-- Mark Zuckerberg and probably others


^ permalink raw reply

* Re: [PATCH 2/3] net: TCP thin linear timeouts
From: Ilpo Järvinen @ 2009-10-28 14:31 UTC (permalink / raw)
  To: Arnd Hannemann
  Cc: Eric Dumazet, Andreas Petlund, Netdev, LKML, shemminger,
	David Miller
In-Reply-To: <4AE83FE4.1050309@nets.rwth-aachen.de>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1469 bytes --]

On Wed, 28 Oct 2009, Arnd Hannemann wrote:

> Eric Dumazet schrieb:
> > Andreas Petlund a écrit :
> >> This patch will make TCP use only linear timeouts if the stream is 
> >> thin. This will help to avoid the very high latencies that thin 
> >> stream suffer because of exponential backoff. This mechanism is only 
> >> active if enabled by iocontrol or syscontrol and the stream is 
> >> identified as thin. 

...I don't see how high latency is in any connection to stream being 
"thin" or not btw. If all ACKs are lost it usually requires silence for 
the full RTT, which affects a stream regardless of its size. ...If not all 
ACKs are lost, then the dupACK approach in the other patch should cover 
it already.

> However, addressing the proposal:
> I wonder how one can seriously suggest to just skip congestion response 
> during timeout-based loss recovery? I believe that in a heavily 
> congested scenarios, this would lead to a goodput disaster... Not to 
> mention that in a heavily congested scenario, suddenly every flow will 
> become "thin", so this will even amplify the problems. Or did I miss 
> something?

Good point. I suppose such an under-provisioned network can certainly be
there. I have heard that at least some people who remove exponential back 
off apply it later on nth retransmission as very often there really isn't 
such a super heavy congestion scenario but something completely unrelated 
to congestion which causes the RTO.

-- 
 i.

^ permalink raw reply

* [PATCHv2 net-2.6] sfc: Really allow RX checksum offload to be disabled
From: Ben Hutchings @ 2009-10-28 14:34 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers

We have never checked the efx_nic::rx_checksum_enabled flag everywhere
we should, and since the switch to GRO we don't check it anywhere.
It's simplest to check it in the one place where we initialise the
per-packet checksummed flag.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Cc: stable@kernel.org
---
This version really is applicable to net-2.6 and 2.6.31.y, and to
2.6.27.y with fuzz 1.  I'm not sure whether this bug is serious enough
for a stable update but it is an obvious fix.

Ben.

 drivers/net/sfc/falcon.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/net/sfc/falcon.c b/drivers/net/sfc/falcon.c
index c049364..e75674e 100644
--- a/drivers/net/sfc/falcon.c
+++ b/drivers/net/sfc/falcon.c
@@ -884,7 +884,9 @@ static void falcon_handle_rx_event(struct efx_channel *channel,
 		/* If packet is marked as OK and packet type is TCP/IPv4 or
 		 * UDP/IPv4, then we can rely on the hardware checksum.
 		 */
-		checksummed = RX_EV_HDR_TYPE_HAS_CHECKSUMS(rx_ev_hdr_type);
+		checksummed =
+			efx->rx_checksum_enabled &&
+			RX_EV_HDR_TYPE_HAS_CHECKSUMS(rx_ev_hdr_type);
 	} else {
 		falcon_handle_rx_not_ok(rx_queue, event, &rx_ev_pkt_ok,
 					&discard);

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH net-next-2.6] ipv6 sit: Optimize multiple unregistration
From: Eric Dumazet @ 2009-10-28 14:37 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List

Speedup module unloading by factorizing synchronize_rcu() calls

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv6/sit.c |   17 +++++++++++------
 1 files changed, 11 insertions(+), 6 deletions(-)


diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index b6b1626..2362a33 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -1145,16 +1145,19 @@ static struct xfrm_tunnel sit_handler = {
 	.priority	=	1,
 };
 
-static void sit_destroy_tunnels(struct sit_net *sitn)
+static void sit_destroy_tunnels(struct sit_net *sitn, struct list_head *head)
 {
 	int prio;
 
 	for (prio = 1; prio < 4; prio++) {
 		int h;
 		for (h = 0; h < HASH_SIZE; h++) {
-			struct ip_tunnel *t;
-			while ((t = sitn->tunnels[prio][h]) != NULL)
-				unregister_netdevice(t->dev);
+			struct ip_tunnel *t = sitn->tunnels[prio][h];
+
+			while (t != NULL) {
+				unregister_netdevice_queue(t->dev, head);
+				t = t->next;
+			}
 		}
 	}
 }
@@ -1208,11 +1211,13 @@ err_alloc:
 static void sit_exit_net(struct net *net)
 {
 	struct sit_net *sitn;
+	LIST_HEAD(list);
 
 	sitn = net_generic(net, sit_net_id);
 	rtnl_lock();
-	sit_destroy_tunnels(sitn);
-	unregister_netdevice(sitn->fb_tunnel_dev);
+	sit_destroy_tunnels(sitn, &list);
+	unregister_netdevice_queue(sitn->fb_tunnel_dev, &list);
+	unregister_netdevice_many(&list);
 	rtnl_unlock();
 	kfree(sitn);
 }

^ permalink raw reply related

* TG3, kvm, ipv6 & tso data corruption bug?
From: Rik van Riel @ 2009-10-28 14:46 UTC (permalink / raw)
  To: netdev; +Cc: Linux kernel Mailing List, Matt Carlson, Michael Chan, KVM list

I have been tracking down what I thought was a KVM related network
issue for a while, however it appears it could be a hardware issue.

The symptom is that data in network packets gets corrupted, before
the checksum is calculated.  This means the remote host can get
corrupted data, with no way to calculate it (except application
level checksums).  Luckily ssh has such checksums, so my rsync over
ssh backup script discovered this issue.

On a very regular basis, I got this message from ssh:

	Corrupted MAC on input.

I have played around a bit and narrowed it down to the following:

ipv4          => no problem
ipv6 w/o tso  => no problem
ipv6 with tso => occasional data corruption

Disabling tso with ethtool -K eth0 tso off makes the problem stop.

I am running Fedora 12's 2.6.31.1-56.fc12.x86_64 kernel, with the
following hardware:

05:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5761 
Gigabit Ethernet PCIe (rev 10)

I do not know enough about the network layer to know whether this is
fixable in software or whether TSO offloading for ipv6 should just
be disabled on this model.

-- 
All rights reversed.

^ permalink raw reply

* [PATCH net-next-2.6] ip6mr: Optimize multiple unregistration
From: Eric Dumazet @ 2009-10-28 14:48 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List

Speedup module unloading by factorizing synchronize_rcu() calls

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv6/ip6mr.c |   15 ++++++++++-----
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 85849b4..52e0f74 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -477,7 +477,7 @@ failure:
  *	Delete a VIF entry
  */
 
-static int mif6_delete(struct net *net, int vifi)
+static int mif6_delete(struct net *net, int vifi, struct list_head *head)
 {
 	struct mif_device *v;
 	struct net_device *dev;
@@ -519,7 +519,7 @@ static int mif6_delete(struct net *net, int vifi)
 		in6_dev->cnf.mc_forwarding--;
 
 	if (v->flags & MIFF_REGISTER)
-		unregister_netdevice(dev);
+		unregister_netdevice_queue(dev, head);
 
 	dev_put(dev);
 	return 0;
@@ -976,6 +976,7 @@ static int ip6mr_device_event(struct notifier_block *this,
 	struct net *net = dev_net(dev);
 	struct mif_device *v;
 	int ct;
+	LIST_HEAD(list);
 
 	if (event != NETDEV_UNREGISTER)
 		return NOTIFY_DONE;
@@ -983,8 +984,10 @@ static int ip6mr_device_event(struct notifier_block *this,
 	v = &net->ipv6.vif6_table[0];
 	for (ct = 0; ct < net->ipv6.maxvif; ct++, v++) {
 		if (v->dev == dev)
-			mif6_delete(net, ct);
+			mif6_delete(net, ct, &list);
 	}
+	unregister_netdevice_many(&list);
+
 	return NOTIFY_DONE;
 }
 
@@ -1188,14 +1191,16 @@ static int ip6mr_mfc_add(struct net *net, struct mf6cctl *mfc, int mrtsock)
 static void mroute_clean_tables(struct net *net)
 {
 	int i;
+	LIST_HEAD(list);
 
 	/*
 	 *	Shut down all active vif entries
 	 */
 	for (i = 0; i < net->ipv6.maxvif; i++) {
 		if (!(net->ipv6.vif6_table[i].flags & VIFF_STATIC))
-			mif6_delete(net, i);
+			mif6_delete(net, i, &list);
 	}
+	unregister_netdevice_many(&list);
 
 	/*
 	 *	Wipe the cache
@@ -1325,7 +1330,7 @@ int ip6_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, uns
 		if (copy_from_user(&mifi, optval, sizeof(mifi_t)))
 			return -EFAULT;
 		rtnl_lock();
-		ret = mif6_delete(net, mifi);
+		ret = mif6_delete(net, mifi, NULL);
 		rtnl_unlock();
 		return ret;
 

^ permalink raw reply related

* Re: [PATCH] udev: create empty regular files to represent net interfaces
From: dann frazier @ 2009-10-28 15:09 UTC (permalink / raw)
  To: Matt Domsch
  Cc: Kay Sievers, linux-hotplug, Narendra_K, netdev, Jordan_Hargrave,
	Charles_Rose, Ben Hutchings
In-Reply-To: <20091028130308.GA24611@auslistsprd01.us.dell.com>

On Wed, Oct 28, 2009 at 08:03:08AM -0500, Matt Domsch wrote:
> On Wed, Oct 28, 2009 at 09:23:57AM +0100, Kay Sievers wrote:
[...]
> > That all sounds very much like something which will hit us back some
> > day. I'm not sure, if udev should publish such dead text files in
> > /dev, it does not seem to fit the usual APIs/assumptions where /sys
> > and /dev match, and libudev provides access to both. It all sounds
> > more like a database for a possible netdevname library, which does not
> > need to be public in /dev, right?
> 
> Right, it doesn't need to be in /dev.  We could have udev rules that
> simply call yet another program to maintain that database, in yet
> another way.

Or have udev maintain them in a private directory (e.g.,
/var/lib/udev/netalias). Personally, I like the approach of having
udev manage them as files - its an abstraction our users already get,
and they don't have to learn two mechanisms when aliasing disks and
nics (SYMLINK ftw). Plus there's obviously a lot of code reuse to be
had (most of my patch was moving code into a common section).

If we want to hide the file implementation - we could invent another
udev construct that basically aliases SYMLINK (e.g. NETALIAS) that
works iff the device is a netdevice. That would let us switch out
implementations in the future, but would obviously be much more
invasive.

-- 
dann frazier

^ permalink raw reply

* [PATCH net-next-2.6] ip6tnl: Optimize multiple unregistration
From: Eric Dumazet @ 2009-10-28 15:16 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List

Speedup module unloading by factorizing synchronize_rcu() calls

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv6/ip6_tunnel.c |   11 ++++++++---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 670c291..6c1b5c9 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1393,14 +1393,19 @@ static void ip6_tnl_destroy_tunnels(struct ip6_tnl_net *ip6n)
 {
 	int h;
 	struct ip6_tnl *t;
+	LIST_HEAD(list);
 
 	for (h = 0; h < HASH_SIZE; h++) {
-		while ((t = ip6n->tnls_r_l[h]) != NULL)
-			unregister_netdevice(t->dev);
+		t = ip6n->tnls_r_l[h];
+		while (t != NULL) {
+			unregister_netdevice_queue(t->dev, &list);
+			t = t->next;
+		}
 	}
 
 	t = ip6n->tnls_wc[0];
-	unregister_netdevice(t->dev);
+	unregister_netdevice_queue(t->dev, &list);
+	unregister_netdevice_many(&list);
 }
 
 static int ip6_tnl_init_net(struct net *net)

^ permalink raw reply related

* [PATCH net-next-2.6] ipmr: Optimize multiple unregistration
From: Eric Dumazet @ 2009-10-28 15:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List

Speedup module unloading by factorizing synchronize_rcu() calls

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/ipmr.c |   15 ++++++++++-----
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 6949745..ef4ee45 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -275,7 +275,8 @@ failure:
  *	@notify: Set to 1, if the caller is a notifier_call
  */
 
-static int vif_delete(struct net *net, int vifi, int notify)
+static int vif_delete(struct net *net, int vifi, int notify,
+		      struct list_head *head)
 {
 	struct vif_device *v;
 	struct net_device *dev;
@@ -319,7 +320,7 @@ static int vif_delete(struct net *net, int vifi, int notify)
 	}
 
 	if (v->flags&(VIFF_TUNNEL|VIFF_REGISTER) && !notify)
-		unregister_netdevice(dev);
+		unregister_netdevice_queue(dev, head);
 
 	dev_put(dev);
 	return 0;
@@ -870,14 +871,16 @@ static int ipmr_mfc_add(struct net *net, struct mfcctl *mfc, int mrtsock)
 static void mroute_clean_tables(struct net *net)
 {
 	int i;
+	LIST_HEAD(list);
 
 	/*
 	 *	Shut down all active vif entries
 	 */
 	for (i = 0; i < net->ipv4.maxvif; i++) {
 		if (!(net->ipv4.vif_table[i].flags&VIFF_STATIC))
-			vif_delete(net, i, 0);
+			vif_delete(net, i, 0, &list);
 	}
+	unregister_netdevice_many(&list);
 
 	/*
 	 *	Wipe the cache
@@ -993,7 +996,7 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi
 		if (optname == MRT_ADD_VIF) {
 			ret = vif_add(net, &vif, sk == net->ipv4.mroute_sk);
 		} else {
-			ret = vif_delete(net, vif.vifc_vifi, 0);
+			ret = vif_delete(net, vif.vifc_vifi, 0, NULL);
 		}
 		rtnl_unlock();
 		return ret;
@@ -1156,6 +1159,7 @@ static int ipmr_device_event(struct notifier_block *this, unsigned long event, v
 	struct net *net = dev_net(dev);
 	struct vif_device *v;
 	int ct;
+	LIST_HEAD(list);
 
 	if (!net_eq(dev_net(dev), net))
 		return NOTIFY_DONE;
@@ -1165,8 +1169,9 @@ static int ipmr_device_event(struct notifier_block *this, unsigned long event, v
 	v = &net->ipv4.vif_table[0];
 	for (ct = 0; ct < net->ipv4.maxvif; ct++, v++) {
 		if (v->dev == dev)
-			vif_delete(net, ct, 1);
+			vif_delete(net, ct, 1, &list);
 	}
+	unregister_netdevice_many(&list);
 	return NOTIFY_DONE;
 }
 

^ permalink raw reply related

* [PATCH kernel 2.6.32-rc5] [RESEND] netdev: usb: dm9601.c can drive a device not supported yet, add support for it
From: Janusz Krzysztofik @ 2009-10-28 15:34 UTC (permalink / raw)
  To: Peter Korsgaard; +Cc: netdev, David S. Miller
In-Reply-To: <87my3jnts0.fsf@macbook.be.48ers.dk>

I found that the current version of drivers/net/usb/dm9601.c can be used to
successfully drive a low-power, low-cost network adapter with USB ID
0a46:9000, based on a DM9000E chipset. As no device with this ID is yet
present in the kernel, I have created a patch that adds support for the device
to the dm9601 driver.

Created and tested against linux-2.6.32-rc5.

Signed-off-by: Janusz Krzysztofik <jkrzyszt@tis.icnet.pl>
Acked-by: Peter Korsgaard <jacmet@sunsite.dk>

---
Thursday 22 October 2009 20:54:55 Peter Korsgaard wrote:

> Acked-by: Peter Korsgaard <jacmet@sunsite.dk>

I'm not sure if there was sometning wrong with my initial submition, but since
the patch, unlike others, has not been applied since acked last week, neither
to net-2.6 nor net-next-2.6, I decided to resend it, with a slightly modified
subject, and CC: David this time.

Thanks,
Janusz

--- linux-2.6.32-rc5/drivers/net/usb/dm9601.c.orig	2009-10-22 20:14:00.000000000 +0200
+++ linux-2.6.32-rc5/drivers/net/usb/dm9601.c	2009-10-22 20:14:04.000000000 +0200
@@ -649,6 +649,10 @@ static const struct usb_device_id produc
 	USB_DEVICE(0x0fe6, 0x8101),	/* DM9601 USB to Fast Ethernet Adapter */
 	.driver_info = (unsigned long)&dm9601_info,
 	 },
+	{
+	 USB_DEVICE(0x0a46, 0x9000),	/* DM9000E */
+	 .driver_info = (unsigned long)&dm9601_info,
+	 },
 	{},			// END
 };

^ permalink raw reply

* [PATCH net-next-2.6] bridge: Optimize multiple unregistration
From: Eric Dumazet @ 2009-10-28 15:35 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List

Speedup module unloading by factorizing synchronize_rcu() calls

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/bridge/br_if.c |   19 +++++++++----------
 1 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index b1b3b0f..2117e5b 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -154,7 +154,7 @@ static void del_nbp(struct net_bridge_port *p)
 }
 
 /* called with RTNL */
-static void del_br(struct net_bridge *br)
+static void del_br(struct net_bridge *br, struct list_head *head)
 {
 	struct net_bridge_port *p, *n;
 
@@ -165,7 +165,7 @@ static void del_br(struct net_bridge *br)
 	del_timer_sync(&br->gc_timer);
 
 	br_sysfs_delbr(br->dev);
-	unregister_netdevice(br->dev);
+	unregister_netdevice_queue(br->dev, head);
 }
 
 static struct net_device *new_bridge_dev(struct net *net, const char *name)
@@ -323,7 +323,7 @@ int br_del_bridge(struct net *net, const char *name)
 	}
 
 	else
-		del_br(netdev_priv(dev));
+		del_br(netdev_priv(dev), NULL);
 
 	rtnl_unlock();
 	return ret;
@@ -462,15 +462,14 @@ int br_del_if(struct net_bridge *br, struct net_device *dev)
 void br_net_exit(struct net *net)
 {
 	struct net_device *dev;
+	LIST_HEAD(list);
 
 	rtnl_lock();
-restart:
-	for_each_netdev(net, dev) {
-		if (dev->priv_flags & IFF_EBRIDGE) {
-			del_br(netdev_priv(dev));
-			goto restart;
-		}
-	}
+	for_each_netdev(net, dev)
+		if (dev->priv_flags & IFF_EBRIDGE)
+			del_br(netdev_priv(dev), &list);
+
+	unregister_netdevice_many(&list);
 	rtnl_unlock();
 
 }

^ permalink raw reply related

* Re: [net-next-2.6 PATCH 01/23] igb: add support for seperate tx-usecs setting in ethtool
From: Stephen Hemminger @ 2009-10-28 15:42 UTC (permalink / raw)
  To: David Miller; +Cc: jeffrey.t.kirsher, netdev, gospo, alexander.h.duyck
In-Reply-To: <20091028.033922.99548522.davem@davemloft.net>

On Wed, 28 Oct 2009 03:39:22 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> +	new_rx_count = max(new_rx_count, (u32)IGB_MIN_RXD);

new_rx_count = max_t(u32, new_rx_count, IGB_MIN_RXD)
  is slightly cleaner (hides cast)

-- 

^ permalink raw reply

* Re: [PATCH] net: fold network name hash (v2)
From: Stephen Hemminger @ 2009-10-28 15:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, linux-kernel, akpm, torvalds, opurdila,
	viro
In-Reply-To: <4AE7DF8E.3020607@gmail.com>

On Wed, 28 Oct 2009 07:07:10 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Stephen Hemminger a écrit :
> > The full_name_hash does not produce a value that is evenly distributed
> > over the lower 8 bits. This causes name hash to be unbalanced with large
> > number of names. There is a standard function to fold in upper bits
> > so use that.
> > 
> > This is independent of possible improvements to full_name_hash()
> > in future.
> 
> >  static inline struct hlist_head *dev_name_hash(struct net *net, const char *name)
> >  {
> >  	unsigned hash = full_name_hash(name, strnlen(name, IFNAMSIZ));
> > -	return &net->dev_name_head[hash & ((1 << NETDEV_HASHBITS) - 1)];
> > +	return &net->dev_name_head[hash_long(hash, NETDEV_HASHBITS)];
> >  }
> >  
> >  static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
> 
> full_name_hash() returns an "unsigned int", which is guaranteed to be 32 bits
> 
> You should therefore use hash_32(hash, NETDEV_HASHBITS),
> not hash_long() that maps to hash_64() on 64 bit arches, which is
> slower and certainly not any better with a 32bits input.

OK, I was following precedent. Only a couple places use hash_32, most use
hash_long().

Using the upper bits does give better distribution.
With 100,000 network names:

               Time       Ratio       Max   StdDev
hash_32       0.002123     1.00       422  11.07
hash_64       0.002927     1.00       400   3.97

The time field is pretty meaningless for such a small sample

^ permalink raw reply

* Re: WAN device configuration, again...
From: Dan Williams @ 2009-10-28 16:03 UTC (permalink / raw)
  To: Krzysztof Halasa; +Cc: netdev
In-Reply-To: <m3d447hcks.fsf@intrepid.localdomain>

On Wed, 2009-10-28 at 14:28 +0100, Krzysztof Halasa wrote:
> Hi,
> 
> I'm currently at final stages of "producing" two WAN drivers and there
> is one thing to solve: they have really complex options. It's no longer
> a V.35 with ca. 4 clock modes, a clock rate and few encodings etc. They
> need many options unique to each driver/board. I think I need a more
> capable interface to configure the devices than the current ioctl-based
> one.
> 
> I think of something:
> - using netlink or similar interface

If you're doing a new config interface, I'd suggest netlink like the
wireless guys did to replace WEXT with cfg80211.  Using netlink makes
your interface easily available from programs/libraries without having
to screenscrape anything.  If you want some advice on netlink API stuff,
ask Johannes Berg.

Dan

> - with potentially unlimited "payload" size (data may be transfered in
>   smaller packets)
> - the "command" and "response" should be variable-length ASCII-based,
>   instead of fixed structures. This way I don't have to duplicate all
>   option handling in userspace, only the specific driver has to know
>   about them.
> 
> Comments? Perhaps there is already an example?
> Should I use something else?
> 
> I also thought about using /sys read/write calls, but I'm not sure it's
> a good idea.

^ permalink raw reply

* RE: [PATCH] udev: create empty regular files to represent net interfaces
From: Jordan_Hargrave @ 2009-10-28 16:09 UTC (permalink / raw)
  To: dannf, Matt_Domsch
  Cc: kay.sievers, linux-hotplug, Narendra_K, netdev, Charles_Rose,
	bhutchings
In-Reply-To: <20091028150913.GB3612@ldl.fc.hp.com>

I was thinking, if we're not planning on use the chardev/kernel route.  There already exists an ifindex file in /sys/class/net/XXX/ifindex.
Should udev/helper be creating links to this, or is it better to keep everything under the /dev/ tree?
Using this method would require the patch to udev to handle renaming events.

--jordan hargrave
Dell Enterprise Linux Engineering

-----Original Message-----
From: dann frazier [mailto:dannf@hp.com]
Sent: Wed 10/28/2009 10:09
To: Domsch, Matt
Cc: Kay Sievers; linux-hotplug@vger.kernel.org; K, Narendra; netdev@vger.kernel.org; Hargrave, Jordan; Rose, Charles; Ben Hutchings
Subject: Re: [PATCH] udev: create empty regular files to represent net interfaces

On Wed, Oct 28, 2009 at 08:03:08AM -0500, Matt Domsch wrote:
> On Wed, Oct 28, 2009 at 09:23:57AM +0100, Kay Sievers wrote:
[...]
> > That all sounds very much like something which will hit us back some
> > day. I'm not sure, if udev should publish such dead text files in
> > /dev, it does not seem to fit the usual APIs/assumptions where /sys
> > and /dev match, and libudev provides access to both. It all sounds
> > more like a database for a possible netdevname library, which does not
> > need to be public in /dev, right?
> 
> Right, it doesn't need to be in /dev.  We could have udev rules that
> simply call yet another program to maintain that database, in yet
> another way.

Or have udev maintain them in a private directory (e.g.,
/var/lib/udev/netalias). Personally, I like the approach of having
udev manage them as files - its an abstraction our users already get,
and they don't have to learn two mechanisms when aliasing disks and
nics (SYMLINK ftw). Plus there's obviously a lot of code reuse to be
had (most of my patch was moving code into a common section).

If we want to hide the file implementation - we could invent another
udev construct that basically aliases SYMLINK (e.g. NETALIAS) that
works iff the device is a netdevice. That would let us switch out
implementations in the future, but would obviously be much more
invasive.

-- 
dann frazier

^ permalink raw reply

* RE: [PATCH] udev: create empty regular files to represent net interfaces
From: Jordan_Hargrave @ 2009-10-28 16:09 UTC (permalink / raw)
  To: dannf, Matt_Domsch
  Cc: kay.sievers, linux-hotplug, Narendra_K, netdev, Charles_Rose,
	bhutchings
In-Reply-To: <20091028150913.GB3612@ldl.fc.hp.com>

I was thinking, if we're not planning on use the chardev/kernel route.  There already exists an ifindex file in /sys/class/net/XXX/ifindex.
Should udev/helper be creating links to this, or is it better to keep everything under the /dev/ tree?
Using this method would require the patch to udev to handle renaming events.

--jordan hargrave
Dell Enterprise Linux Engineering

-----Original Message-----
From: dann frazier [mailto:dannf@hp.com]
Sent: Wed 10/28/2009 10:09
To: Domsch, Matt
Cc: Kay Sievers; linux-hotplug@vger.kernel.org; K, Narendra; netdev@vger.kernel.org; Hargrave, Jordan; Rose, Charles; Ben Hutchings
Subject: Re: [PATCH] udev: create empty regular files to represent net interfaces

On Wed, Oct 28, 2009 at 08:03:08AM -0500, Matt Domsch wrote:
> On Wed, Oct 28, 2009 at 09:23:57AM +0100, Kay Sievers wrote:
[...]
> > That all sounds very much like something which will hit us back some
> > day. I'm not sure, if udev should publish such dead text files in
> > /dev, it does not seem to fit the usual APIs/assumptions where /sys
> > and /dev match, and libudev provides access to both. It all sounds
> > more like a database for a possible netdevname library, which does not
> > need to be public in /dev, right?
> 
> Right, it doesn't need to be in /dev.  We could have udev rules that
> simply call yet another program to maintain that database, in yet
> another way.

Or have udev maintain them in a private directory (e.g.,
/var/lib/udev/netalias). Personally, I like the approach of having
udev manage them as files - its an abstraction our users already get,
and they don't have to learn two mechanisms when aliasing disks and
nics (SYMLINK ftw). Plus there's obviously a lot of code reuse to be
had (most of my patch was moving code into a common section).

If we want to hide the file implementation - we could invent another
udev construct that basically aliases SYMLINK (e.g. NETALIAS) that
works iff the device is a netdevice. That would let us switch out
implementations in the future, but would obviously be much more
invasive.

-- 
dann frazier

^ permalink raw reply

* Re: [PATCH] udev: create empty regular files to represent net interfaces
From: Greg KH @ 2009-10-28 16:11 UTC (permalink / raw)
  To: Jordan_Hargrave
  Cc: dannf, Matt_Domsch, kay.sievers, linux-hotplug, Narendra_K,
	netdev, Charles_Rose, bhutchings
In-Reply-To: <5DDAB7BA7BDB58439DD0EED0B8E9A3AE02E827AF@ausx3mpc102.aus.amer.dell.com>


A: No.
Q: Should I include quotations after my reply?

http://daringfireball.net/2007/07/on_top

On Wed, Oct 28, 2009 at 11:09:36AM -0500, Jordan_Hargrave@Dell.com wrote:
> I was thinking, if we're not planning on use the chardev/kernel route.
> There already exists an ifindex file in /sys/class/net/XXX/ifindex.
> Should udev/helper be creating links to this, or is it better to keep
> everything under the /dev/ tree?  Using this method would require the
> patch to udev to handle renaming events.

Please never create symlinks out of /dev/ to /sys that doesn't make
sense at all and probably violates part of the LSB somewhere...

thanks,

greg k-h

^ permalink raw reply

* Re: Oops in net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c::ipv6_confirm(), kernel 2.6.30.8
From: Patrick McHardy @ 2009-10-28 16:29 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: netfilter-devel, netdev
In-Reply-To: <200910170903.n9H93bKI012269@int-mx03.intmail.prod.int.phx2.redhat.com>

Chuck Ebbert wrote:
> general protection fault: 0000 [#1] SMP 
> last sysfs file: /sys/devices/system/cpu/sched_mc_power_savings
> CPU 0 
> Modules linked in: tun fuse rfcomm sco bridge stp llc bnep l2cap autofs4
> w83627ehf hwmon_vid sunrpc sit tunnel4 nf_nat_sip nf_conntrack_sip nf_nat_ftp
> nf_conntrack_ftp ipt_LOG xt_owner iptable_mangle ipt_MASQUERADE iptable_nat
> nf_nat xt_limit nf_conntrack_ipv6 xt_mac ip6t_LOG ip6table_filter ip6_tables
> p4_clockmod freq_table speedstep_lib squashfs nls_utf8 dm_multipath raid1
> kvm_intel kvm uinput ipv6 ppdev snd_hda_codec_realtek snd_hda_intel
> snd_hda_codec snd_hwdep snd_pcm nouveau pcspkr i2c_i801 firewire_ohci snd_timer
> btusb firewire_core e1000 snd bluetooth drm iTCO_wdt iTCO_vendor_support
> crc_itu_t i2c_algo_bit asus_atk0110 i82975x_edac soundcore sky2 edac_core
> parport_pc i2c_core floppy hwmon snd_page_alloc parport raid456 raid6_pq
> async_xor async_memcpy async_tx xor [last unloaded: freq_table]
> Pid: 4104, comm: qemu-kvm Not tainted 2.6.30.8-64.fc11.x86_64.debug #1 System
> Product Name
> RIP: 0010:[<ffffffffa03624e1>]  [<ffffffffa03624e1>] ipv6_confirm+0xd0/0x147
> [nf_conntrack_ipv6]
> RSP: 0018:ffff880035203668  EFLAGS: 00010212
> RAX: 0000000000000030 RBX: ffff8801f90a1080 RCX: 0000000000000002
> RDX: ffffffff81783f40 RSI: 0000000000000030 RDI: ffff8801f90a1080
> RBP: ffff880035203698 R08: ffffffffa04520ee R09: ffff880035203748
> R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff81783f40
> R13: 6b6b6b6b6b6b6b6b R14: 0000000000000002 R15: 0000000000000004
> FS:  00007f944e44b740(0000) GS:ffff880035200000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007fffbc54ef60 CR3: 000000020f8d8000 CR4: 00000000000026e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process qemu-kvm (pid: 4104, threadinfo ffff88020f89c000, task
> ffff8802104a4760)
> Stack:
>  3a00000000000246 00000000696092b1 0000000080000000 ffff8801f90a1080
>  ffffffffa04520ee ffffffff81783bd0 ffff880035203708 ffffffff8142b117
>  ffff8800352036c8 ffff880035203748 ffff880210492060 0000000000000000
> Call Trace:
>  <IRQ> <0> [<ffffffffa04520ee>] ? br_nf_dev_queue_xmit+0x0/0xa1 [bridge]
>  [<ffffffff8142b117>] nf_iterate+0x5c/0xb3
>  [<ffffffffa04520ee>] ? br_nf_dev_queue_xmit+0x0/0xa1 [bridge]
>  [<ffffffff8142b214>] nf_hook_slow+0xa6/0x136
>  [<ffffffffa04520ee>] ? br_nf_dev_queue_xmit+0x0/0xa1 [bridge]
>  [<ffffffffa044c29d>] ? br_dev_queue_push_xmit+0x0/0xae [bridge]
>  [<ffffffffa045263b>] nf_hook_thresh.clone.0+0x4c/0x62 [bridge]
>  [<ffffffffa0452d92>] br_nf_post_routing+0x1a8/0x1e4 [bridge]
>  [<ffffffff8142b117>] nf_iterate+0x5c/0xb3
>  [<ffffffffa044c29d>] ? br_dev_queue_push_xmit+0x0/0xae [bridge]
>  [<ffffffff8142b214>] nf_hook_slow+0xa6/0x136
>  [<ffffffffa044c29d>] ? br_dev_queue_push_xmit+0x0/0xae [bridge]
>  [<ffffffffa044c39d>] nf_hook_thresh.clone.0+0x52/0x68 [bridge]
>  [<ffffffffa044c3ed>] br_forward_finish+0x3a/0x62 [bridge]
>  [<ffffffffa0452aaa>] br_nf_forward_finish+0xb3/0xd2 [bridge]
>  [<ffffffffa045263b>] ? nf_hook_thresh.clone.0+0x4c/0x62 [bridge]
>  [<ffffffffa045318a>] br_nf_forward_ip+0x1af/0x1de [bridge]
>  [<ffffffffa044c3b3>] ? br_forward_finish+0x0/0x62 [bridge]
>  [<ffffffff8142b117>] nf_iterate+0x5c/0xb3
>  [<ffffffffa044c3b3>] ? br_forward_finish+0x0/0x62 [bridge]
>  [<ffffffff8142b214>] nf_hook_slow+0xa6/0x136
>  [<ffffffffa044c3b3>] ? br_forward_finish+0x0/0x62 [bridge]
>  [<ffffffffa044c415>] ? __br_forward+0x0/0xab [bridge]
>  [<ffffffffa044c39d>] nf_hook_thresh.clone.0+0x52/0x68 [bridge]
>  [<ffffffffa044c499>] __br_forward+0x84/0xab [bridge]
>  [<ffffffffa044c1ca>] br_flood+0x82/0xd9 [bridge]
>  [<ffffffff814086ee>] ? netif_receive_skb+0x120/0x44c
>  [<ffffffffa044c249>] br_flood_forward+0x28/0x3e [bridge]
>  [<ffffffffa044d36a>] br_handle_frame_finish+0x13a/0x167 [bridge]
>  [<ffffffffa04529da>] br_nf_pre_routing_finish_ipv6+0xb7/0xd4 [bridge]
>  [<ffffffffa045263b>] ? nf_hook_thresh.clone.0+0x4c/0x62 [bridge]
>  [<ffffffffa04534e8>] br_nf_pre_routing+0x32f/0x577 [bridge]
>  [<ffffffffa044d230>] ? br_handle_frame_finish+0x0/0x167 [bridge]
>  [<ffffffff8142b117>] nf_iterate+0x5c/0xb3
>  [<ffffffff8123bbf6>] ? kobject_put+0x54/0x6f
>  [<ffffffffa044d230>] ? br_handle_frame_finish+0x0/0x167 [bridge]
>  [<ffffffff8142b214>] nf_hook_slow+0xa6/0x136
>  [<ffffffffa044d230>] ? br_handle_frame_finish+0x0/0x167 [bridge]
>  [<ffffffffa044d21a>] nf_hook_thresh.clone.0+0x52/0x68 [bridge]
>  [<ffffffffa044d533>] br_handle_frame+0x19c/0x1d9 [bridge]
>  [<ffffffff814088fa>] netif_receive_skb+0x32c/0x44c

> Code: 2c 75 1d f6 05 1a fc 1c e2 40 74 60 f6 05 17 fc 1c e2 04 74 57 80 3d ad
> 4d 00 00 00 74 4e eb 62 44 89 f1 4c 89 e2 89 c6 48 89 df <41> ff 55 50 83 f8 01
> 75 3d 4c 8b a3 88 00 00 00 4d 85 e4 74 2c 
> RIP  [<ffffffffa03624e1>] ipv6_confirm+0xd0/0x147 [nf_conntrack_ipv6]
>  RSP <ffff880035203668>
> ---[ end trace 5dc400d9f2f8290b ]---
> 
>    c: f6 05 17 fc 1c e2 04  testb  $0x4,-0x1de303e9(%rip)
>   13: 74 57                 je     0x6c
>   15: 80 3d ad 4d 00 00 00  cmpb   $0x0,0x4dad(%rip)
>   1c: 74 4e                 je     0x6c
>   1e: eb 62                 jmp    0x82
>   20: 44 89 f1              mov    %r14d,%ecx
>   23: 4c 89 e2              mov    %r12,%rdx
>   26: 89 c6                 mov    %eax,%esi
>   28: 48 89 df              mov    %rbx,%rdi
> 
>    0: 41 ff 55 50           callq  *0x50(%r13)  <===========
>    4: 83 f8 01              cmp    $0x1,%eax
>    7: 75 3d                 jne    0x46
>    9: 4c 8b a3 88 00 00 00  mov    0x88(%rbx),%r12
>   10: 4d 85 e4              test   %r12,%r12
>   13: 74 2c                 je     0x41
> 
> R13: 6b6b6b6b6b6b6b6b  
> 
> Corresponds to:
> net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c:178:
> 
>         ret = helper->help(skb, protoff, ct, ctinfo);  

Did you unload any helper modules before this happened?

^ permalink raw reply

* [PATCH] vmxnet3: remove duplicate #include
From: Shreyas Bhatewara @ 2009-10-28 16:30 UTC (permalink / raw)
  To: netdev; +Cc: weiyi.huang, pv-drivers


Remove duplicate headerfile includes from vmxnet3_int.h

Signed-off-by: Shreyas Bhatewara <sbhatewara@vmware.com>
Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Bhavesh Davda <davda@vmware.com>

---

diff --git a/drivers/net/vmxnet3/vmxnet3_int.h b/drivers/net/vmxnet3/vmxnet3_int.h
index 3c0d70d..4450816 100644
--- a/drivers/net/vmxnet3/vmxnet3_int.h
+++ b/drivers/net/vmxnet3/vmxnet3_int.h
@@ -27,16 +27,11 @@
 #ifndef _VMXNET3_INT_H
 #define _VMXNET3_INT_H
 
-#include <linux/types.h>
 #include <linux/ethtool.h>
 #include <linux/delay.h>
-#include <linux/device.h>
 #include <linux/netdevice.h>
 #include <linux/pci.h>
-#include <linux/ethtool.h>
 #include <linux/compiler.h>
-#include <linux/module.h>
-#include <linux/moduleparam.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/ioport.h>

^ permalink raw reply related

* Re: TG3, kvm, ipv6 & tso data corruption bug?
From: Matt Carlson @ 2009-10-28 16:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: netdev@vger.kernel.org, Linux kernel Mailing List,
	Matthew Carlson, Michael Chan, KVM list
In-Reply-To: <4AE8595F.1080404@redhat.com>

On Wed, Oct 28, 2009 at 07:46:55AM -0700, Rik van Riel wrote:
> I have been tracking down what I thought was a KVM related network
> issue for a while, however it appears it could be a hardware issue.
> 
> The symptom is that data in network packets gets corrupted, before
> the checksum is calculated.  This means the remote host can get
> corrupted data, with no way to calculate it (except application
> level checksums).  Luckily ssh has such checksums, so my rsync over
> ssh backup script discovered this issue.
> 
> On a very regular basis, I got this message from ssh:
> 
> 	Corrupted MAC on input.
> 
> I have played around a bit and narrowed it down to the following:
> 
> ipv4          => no problem
> ipv6 w/o tso  => no problem
> ipv6 with tso => occasional data corruption
> 
> Disabling tso with ethtool -K eth0 tso off makes the problem stop.
> 
> I am running Fedora 12's 2.6.31.1-56.fc12.x86_64 kernel, with the
> following hardware:
> 
> 05:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5761 
> Gigabit Ethernet PCIe (rev 10)
> 
> I do not know enough about the network layer to know whether this is
> fixable in software or whether TSO offloading for ipv6 should just
> be disabled on this model.

This problem sounds familiar.  There are chip bugs in this area, but as
far as I know, they should have been worked around.  Let me see if this
is indeed the same bug resurfacing.


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox