Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.
From: Eric Dumazet @ 2011-05-17  8:34 UTC (permalink / raw)
  To: Alexander Zimmermann
  Cc: Benoit Sigoure, davem, kuznet, pekkas, jmorris, yoshfuji, kaber,
	netdev, linux-kernel
In-Reply-To: <E5F8880D-72DF-488B-9515-60453D3EC240@comsys.rwth-aachen.de>

Le mardi 17 mai 2011 à 10:01 +0200, Alexander Zimmermann a écrit :

> 
> regardless of netdev will accept this patch or not, the
> upcoming initRTO is 1s. See
> http://tools.ietf.org/id/draft-paxson-tcpm-rfc2988bis-02.txt
> 
> The draft is IESG approved and will become an RFC soon.

Thanks Alex for this link / information.

^ permalink raw reply

* Re: [PATCH] net: Change netdev_fix_features messages loglevel
From: Michał Mirosław @ 2011-05-17  8:27 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, mst, bhutchings
In-Reply-To: <20110516234816.GA12024@gondor.apana.org.au>

On Tue, May 17, 2011 at 09:48:16AM +1000, Herbert Xu wrote:
> On Mon, May 16, 2011 at 03:14:34PM -0400, David Miller wrote:
> > From: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> > Date: Mon, 16 May 2011 15:17:57 +0200 (CEST)
> > 
> > > Those reduced to DEBUG can possibly be triggered by unprivileged processes
> > > and are nothing exceptional. Illegal checksum combinations can only be
> > > caused by driver bug, so promote those messages to WARN.
> > > 
> > > Since GSO without SG will now only cause DEBUG message from
> > > netdev_fix_features(), remove the workaround from register_netdevice().
> > > 
> > > Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> > Applied, thanks.
> I think this patch is not a good idea.  The only problem here
> is the tun driver which should just fix up the feature bits
> silently (we need to do that anyway as right now it doesn't
> verify the feature bits at all, those warnings are only trigger
> on registering the tun device).
> 
> With this patch future buggy drivers will instead trigger hard-
> to-debug warnings further down the stack in the TSO code path,
> like they have before (or silent packet drops if those warnings
> aren't there anymore).

If core TSO code really depends on checksumming features being turned on
then this dependency should be checked in netdev_fix_features. There
will be no possibility for a driver to cause bugs then.

Network core is making up for devices which can't checksum just before
calling ndo_hard_start_xmit. It's a possibility that packet to be TSOed
will be unnecessarily checksummed in dev_hard_start_xmit() if HW checksum
is off while TSO is on (I deduce this based on the code, can't test as
TSO in my test-box is busted).

Best Regards,
Michał Mirosław

^ permalink raw reply

* [PATCH] net: tuntap: Fix tun_net_fix_features()
From: Michał Mirosław @ 2011-05-17  8:19 UTC (permalink / raw)
  To: netdev; +Cc: Herbert Xu, Michael S. Tsirkin, Ben Hutchings, Shan Wei
In-Reply-To: <BANLkTimeRLQ0FLASsLE+ZxiXOk4gHt5XFQ@mail.gmail.com>

tun->set_features are meant to limit not force the features.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
---
 drivers/net/tun.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 74e9405..f77c6d0 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -458,7 +458,7 @@ static u32 tun_net_fix_features(struct net_device *dev, u32 features)
 {
 	struct tun_struct *tun = netdev_priv(dev);

-	return (features & tun->set_features) | (features & ~TUN_USER_FEATURES);
+	return features & (tun->set_features | ~TUN_USER_FEATURES);
 }

 static const struct net_device_ops tun_netdev_ops = {
-- 
1.7.2.5

^ permalink raw reply related

* Re: tap/bridge: Dropping NETIF_F_GSO/NETIF_F_SG
From: Michał Mirosław @ 2011-05-17  8:15 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Michael S. Tsirkin, Shan Wei, Michał Mirosław, netdev,
	Ben Hutchings
In-Reply-To: <BANLkTimeRLQ0FLASsLE+ZxiXOk4gHt5XFQ@mail.gmail.com>

W dniu 17 maja 2011 10:08 użytkownik Michał Mirosław <mirqus@gmail.com> napisał:
> In the tun/tap case, by using TUNSETOFFLOAD userspace advertises that
> it is willing to receive unchecksummed/TSOed packets from kernel (it
> should use TUN_VNET_HDR for this to work correctly unless it doesn't
> forward the packets). After the conversion, those offloads can be
> disabled by using ethtool even if userspace does support them.
> Previously you were able to push offloaded packets to apllications not
> prepared for them.

And there is a bug in the logic implementing this. I'll send a fix in a minute.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: tap/bridge: Dropping NETIF_F_GSO/NETIF_F_SG
From: Michał Mirosław @ 2011-05-17  8:08 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Michael S. Tsirkin, Shan Wei, Michał Mirosław, netdev,
	Ben Hutchings
In-Reply-To: <20110516224658.GA11157@gondor.apana.org.au>

2011/5/17 Herbert Xu <herbert@gondor.apana.org.au>:
> On Mon, May 16, 2011 at 02:24:19PM +0200, Michał Mirosław wrote:
>> In this case - no. Those messages inform that driver supports more
>> feature combinations than network core and the feature set is reduced
>> because of that.
> The driver should never even claim to support SG/TSO/UFO if it
> does not support checksum.  That is the point of the warning.

Not really. The dependency of SG on checksum is in network core code
only, not in the hardware. For TSO/UFO they imply hw checksumming for
the respective protocol, but still theres no point in duplicating
features dependencies in driver code.

In the tun/tap case, by using TUNSETOFFLOAD userspace advertises that
it is willing to receive unchecksummed/TSOed packets from kernel (it
should use TUN_VNET_HDR for this to work correctly unless it doesn't
forward the packets). After the conversion, those offloads can be
disabled by using ethtool even if userspace does support them.
Previously you were able to push offloaded packets to apllications not
prepared for them.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.
From: Eric Dumazet @ 2011-05-17  8:07 UTC (permalink / raw)
  To: Benoit Sigoure
  Cc: davem, kuznet, pekkas, jmorris, yoshfuji, kaber, netdev,
	linux-kernel
In-Reply-To: <1305618020-72535-2-git-send-email-tsunanet@gmail.com>

Le mardi 17 mai 2011 à 00:40 -0700, Benoit Sigoure a écrit :
> Instead of hardcoding the initial RTO to 3s and requiring
> the kernel to be recompiled to change it, expose it as a
> sysctl that can be tuned at runtime.  Leave the default
> value unchanged.
> 

I wont discuss if introducing a new sysctl is welcomed, only on patch
issues. I believe some work in IETF is done to reduce the 3sec value to
1sec anyway.

> Signed-off-by: Benoit Sigoure <tsunanet@gmail.com>
> ---
>  Documentation/networking/ip-sysctl.txt |    6 ++++++
>  include/linux/sysctl.h                 |    1 +
>  include/net/tcp.h                      |    3 ++-
>  kernel/sysctl_binary.c                 |    1 +
>  net/ipv4/syncookies.c                  |    2 +-
>  net/ipv4/sysctl_net_ipv4.c             |   11 +++++++++++
>  net/ipv4/tcp.c                         |    4 ++--
>  net/ipv4/tcp_input.c                   |    8 ++++----
>  net/ipv4/tcp_ipv4.c                    |    6 +++---
>  net/ipv4/tcp_minisocks.c               |    6 +++---
>  net/ipv4/tcp_output.c                  |    2 +-
>  net/ipv4/tcp_timer.c                   |    9 +++++----
>  net/ipv6/syncookies.c                  |    2 +-
>  net/ipv6/tcp_ipv6.c                    |    6 +++---
>  14 files changed, 44 insertions(+), 23 deletions(-)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index d3d653a..c381c68 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -384,6 +384,12 @@ tcp_retries2 - INTEGER
>  	RFC 1122 recommends at least 100 seconds for the timeout,
>  	which corresponds to a value of at least 8.
>  
> +tcp_initial_rto - INTEGER
> +	This value sets the initial retransmit timeout, that is how long
> +	the kernel will wait before retransmitting the initial SYN packet.
> +
> +	RFC 1122 says that this SHOULD be 3 seconds, which is the default.
> +

units ? seconds ? ms ? jiffies ? I suggest using ms as external
interface.

>  tcp_rfc1337 - BOOLEAN
>  	If set, the TCP stack behaves conforming to RFC1337. If unset,
>  	we are not conforming to RFC, but prevent TCP TIME_WAIT
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> index 11684d9..96a9b41 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -425,6 +425,7 @@ enum
>  	NET_TCP_ALLOWED_CONG_CONTROL=123,
>  	NET_TCP_MAX_SSTHRESH=124,
>  	NET_TCP_FRTO_RESPONSE=125,
> +        NET_IPV4_TCP_INITIAL_RTO=126,

We dont add new values here anymore, only anonymous ones.

>  };
>  
>  enum {
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index cda30ea..a2bb0f1 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -213,6 +213,7 @@ extern int sysctl_tcp_syn_retries;
>  extern int sysctl_tcp_synack_retries;
>  extern int sysctl_tcp_retries1;
>  extern int sysctl_tcp_retries2;
> +extern int sysctl_tcp_initial_rto;
>  extern int sysctl_tcp_orphan_retries;
>  extern int sysctl_tcp_syncookies;
>  extern int sysctl_tcp_retrans_collapse;
> @@ -295,7 +296,7 @@ static inline void tcp_synq_overflow(struct sock *sk)
>  static inline int tcp_synq_no_recent_overflow(const struct sock *sk)
>  {
>  	unsigned long last_overflow = tcp_sk(sk)->rx_opt.ts_recent_stamp;
> -	return time_after(jiffies, last_overflow + TCP_TIMEOUT_INIT);
> +	return time_after(jiffies, last_overflow + sysctl_tcp_initial_rto);
>  }
>  
>  extern struct proto tcp_prot;
> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
> index 3b8e028..d608d84 100644
> --- a/kernel/sysctl_binary.c
> +++ b/kernel/sysctl_binary.c
> @@ -354,6 +354,7 @@ static const struct bin_table bin_net_ipv4_table[] = {
>  	{ CTL_INT,	NET_IPV4_TCP_KEEPALIVE_INTVL,		"tcp_keepalive_intvl" },
>  	{ CTL_INT,	NET_IPV4_TCP_RETRIES1,			"tcp_retries1" },
>  	{ CTL_INT,	NET_IPV4_TCP_RETRIES2,			"tcp_retries2" },
> +	{ CTL_INT,	NET_IPV4_TCP_INITIAL_RTO,		"tcp_initial_rto" },

no need here. sysctl() is deprecated.

>  	{ CTL_INT,	NET_IPV4_TCP_FIN_TIMEOUT,		"tcp_fin_timeout" },
>  	{ CTL_INT,	NET_TCP_SYNCOOKIES,			"tcp_syncookies" },
>  	{ CTL_INT,	NET_TCP_TW_RECYCLE,			"tcp_tw_recycle" },
> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index 8b44c6d..089bc92 100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -186,7 +186,7 @@ __u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
>   * sysctl_tcp_retries1. It's a rather complicated formula (exponential
>   * backoff) to compute at runtime so it's currently hardcoded here.
>   */
> -#define COUNTER_TRIES 4
> +#define COUNTER_TRIES (sysctl_tcp_initial_rto + 1)

Are you sure of this ?

If HZ=1000, sysctl_tcp_initial_rto is 3000

COUNTER_TRIES goes from 4 to 3004 
  
>  /*
>   * Check if a ack sequence number is a valid syncookie.
>   * Return the decoded mss if it is, or 0 if not.
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 321e6e8..24dc21d 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -30,6 +30,8 @@ static int tcp_adv_win_scale_min = -31;
>  static int tcp_adv_win_scale_max = 31;
>  static int ip_ttl_min = 1;
>  static int ip_ttl_max = 255;
> +static int tcp_initial_rto_min = TCP_RTO_MIN;

warning its jiffies units here.

> +static int tcp_initial_rto_max = TCP_RTO_MAX;
>  
>  /* Update system visible IP port range */
>  static void set_local_port_range(int range[2])
> @@ -246,6 +248,15 @@ static struct ctl_table ipv4_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec
>  	},
> +        {
> +		.procname       = "tcp_initial_rto",
> +		.data           = &sysctl_tcp_initial_rto,
> +		.maxlen         = sizeof(int),
> +		.mode           = 0644,
> +		.proc_handler	= proc_dointvec_minmax,

so unit is jiffies ? Really its not a good thing. Use ms instead.

Consider proc_dointvec_ms_jiffies(), here.

> +		.extra1		= &tcp_initial_rto_min,
> +		.extra2		= &tcp_initial_rto_max,
> +	},
>  	{
>  		.procname	= "tcp_fin_timeout",
>  		.data		= &sysctl_tcp_fin_timeout,
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index b22d450..e9e7c3f 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2352,7 +2352,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
>  	case TCP_DEFER_ACCEPT:
>  		/* Translate value in seconds to number of retransmits */
>  		icsk->icsk_accept_queue.rskq_defer_accept =
> -			secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
> +			secs_to_retrans(val, sysctl_tcp_initial_rto / HZ,

Here you assume sysctl_tcp_initial_rto is expressed in jiffies ?
Oh well...

>  					TCP_RTO_MAX / HZ);
>  		break;
>  
> @@ -2539,7 +2539,7 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
>  		break;
>  	case TCP_DEFER_ACCEPT:
>  		val = retrans_to_secs(icsk->icsk_accept_queue.rskq_defer_accept,
> -				      TCP_TIMEOUT_INIT / HZ, TCP_RTO_MAX / HZ);
> +				      sysctl_tcp_initial_rto / HZ, TCP_RTO_MAX / HZ);
>  		break;
>  	case TCP_WINDOW_CLAMP:
>  		val = tp->window_clamp;
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index bef9f04..39f6c27 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -890,7 +890,7 @@ static void tcp_init_metrics(struct sock *sk)
>  	if (dst_metric(dst, RTAX_RTT) == 0)
>  		goto reset;
>  
> -	if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (TCP_TIMEOUT_INIT << 3))
> +	if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (sysctl_tcp_initial_rto << 3))

Here you assume jiffies unit again. I wonder how this was tested :(

Please fix this and chose a definitive unit.

^ permalink raw reply

* Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.
From: Alexander Zimmermann @ 2011-05-17  8:01 UTC (permalink / raw)
  To: Benoit Sigoure
  Cc: davem, kuznet, pekkas, jmorris, yoshfuji, kaber, netdev,
	linux-kernel
In-Reply-To: <1305618020-72535-2-git-send-email-tsunanet@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 751 bytes --]

Hi Benoit,

Am 17.05.2011 um 09:40 schrieb Benoit Sigoure:

> Instead of hardcoding the initial RTO to 3s and requiring
> the kernel to be recompiled to change it, expose it as a
> sysctl that can be tuned at runtime.  Leave the default
> value unchanged.
> 

regardless of netdev will accept this patch or not, the
upcoming initRTO is 1s. See
http://tools.ietf.org/id/draft-paxson-tcpm-rfc2988bis-02.txt

The draft is IESG approved and will become an RFC soon.

Alex

//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: zimmermann@cs.rwth-aachen.de
// web: http://www.umic-mesh.net
//


[-- Attachment #2: Signierter Teil der Nachricht --]
[-- Type: application/pgp-signature, Size: 243 bytes --]

^ permalink raw reply

* Re: Null pointer dereference in icmp_send
From: Eric Dumazet @ 2011-05-17  7:51 UTC (permalink / raw)
  To: Aristide Fattori; +Cc: netdev, roberto.paleari
In-Reply-To: <BANLkTinNeK1KVv8L-+ySgG-nfhg1=eiUPg@mail.gmail.com>

Le mardi 17 mai 2011 à 09:31 +0200, Aristide Fattori a écrit :

> At a first glance, it appears we identified the same bug you already
> fixed. If you want to double check, we can send you the "long version
> report" that we submitted at first to security@kernel.org and
> david@davemloft.net.

Well, why not, please send it to me only.

Thanks



^ permalink raw reply

* [PATCH] ethtool: change spi param on ip4 to l4data
From: Sebastian Pöhn @ 2011-05-17  7:40 UTC (permalink / raw)
  To: netdev; +Cc: sebastian.poehn, bhutchings

It is confusing for users if ip4 has a spi field, which results in
l4_4_bytes filtering. Add a new option and remove spi from ip4.

Signed-off-by: Sebastian Poehn <sebastian.poehn@belden.com>

diff --git a/ethtool.8.in b/ethtool.8.in
index 42d923b..19ba190 100644
--- a/ethtool.8.in
+++ b/ethtool.8.in
@@ -284,6 +284,7 @@ ethtool \- query or control network driver and
hardware settings
 .RB [ dst\-ip \ \*(PA\ [ m \ \*(PA]]
 .BM tos
 .BM l4proto
+.BM l4data
 .BM src\-port
 .BM dst\-port
 .BM spi
@@ -683,6 +684,10 @@ match along with an optional mask.  Applies to all
IPv4 based flow-types.
 Includes the layer 4 protocol number and optional mask.  Valid only for
 flow-type ip4.
 .TP
+.BI l4data \ N \\fR\ [\\fPm \ N \\fR]\\fP
+Specify the first 4 byte of layer 4 protocol.  Valid only for
+flow-type ip4.
+.TP
 .BI src\-port \ N \\fR\ [\\fPm \ N \\fR]\\fP
 Specify the value of the source port field (applicable to TCP/UDP packets)
 in the incoming packet to match along with an optional mask.  Valid for
@@ -696,7 +701,7 @@ Valid for flow-types ip4, tcp4, udp4, and sctp4.
 .BI spi \ N \\fR\ [\\fPm \ N \\fR]\\fP
 Specify the value of the security parameter index field (applicable to
 AH/ESP packets)in the incoming packet to match along with an optional
-mask.  Valid for flow-types ip4, ah4, and esp4.
+mask.  Valid for flow-types ah4 and esp4.
 .TP
 .BI vlan\-etype \ N \\fR\ [\\fPm \ N \\fR]\\fP
 Includes the VLAN tag Ethertype and an optional mask.
diff --git a/ethtool.c b/ethtool.c
index 0b7ec05..fe12934 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -242,6 +242,7 @@ static struct option {
 		"			[ dst-ip %d.%d.%d.%d [m %d.%d.%d.%d] ]\n"
 		"			[ tos %d [m %x] ]\n"
 		"			[ l4proto %d [m %x] ]\n"
+		"			[ l4data %d [m %x] ]\n"
 		"			[ src-port %d [m %x] ]\n"
 		"			[ dst-port %d [m %x] ]\n"
 		"			[ spi %d [m %x] ]\n"
diff --git a/rxclass.c b/rxclass.c
index ee486f7..7507979 100644
--- a/rxclass.c
+++ b/rxclass.c
@@ -541,6 +541,7 @@ typedef enum {
 #define NTUPLE_FLAG_VLAN	0x100
 #define NTUPLE_FLAG_UDEF	0x200
 #define NTUPLE_FLAG_VETH	0x400
+#define NFC_FLAG_L4DATA	0x800

 struct rule_opts {
 	const char	*name;
@@ -622,7 +623,7 @@ static struct rule_opts rule_nfc_usr_ip4[] = {
 	{ "l4proto", OPT_U8, NFC_FLAG_PROTO,
 	  offsetof(struct ethtool_rx_flow_spec, h_u.usr_ip4_spec.proto),
 	  offsetof(struct ethtool_rx_flow_spec, m_u.usr_ip4_spec.proto) },
-	{ "spi", OPT_BE32, NFC_FLAG_SPI,
+	{ "l4data", OPT_BE32, NFC_FLAG_L4DATA,
 	  offsetof(struct ethtool_rx_flow_spec, h_u.usr_ip4_spec.l4_4_bytes),
 	  offsetof(struct ethtool_rx_flow_spec, m_u.usr_ip4_spec.l4_4_bytes) },
 	{ "src-port", OPT_BE16, NFC_FLAG_SPORT,

^ permalink raw reply related

* [PATCH] tcp: Expose the initial RTO via a new sysctl.
From: Benoit Sigoure @ 2011-05-17  7:40 UTC (permalink / raw)
  To: davem, kuznet, pekkas, jmorris, yoshfuji, kaber
  Cc: netdev, linux-kernel, Benoit Sigoure
In-Reply-To: <1305618020-72535-1-git-send-email-tsunanet@gmail.com>

Instead of hardcoding the initial RTO to 3s and requiring
the kernel to be recompiled to change it, expose it as a
sysctl that can be tuned at runtime.  Leave the default
value unchanged.

Signed-off-by: Benoit Sigoure <tsunanet@gmail.com>
---
 Documentation/networking/ip-sysctl.txt |    6 ++++++
 include/linux/sysctl.h                 |    1 +
 include/net/tcp.h                      |    3 ++-
 kernel/sysctl_binary.c                 |    1 +
 net/ipv4/syncookies.c                  |    2 +-
 net/ipv4/sysctl_net_ipv4.c             |   11 +++++++++++
 net/ipv4/tcp.c                         |    4 ++--
 net/ipv4/tcp_input.c                   |    8 ++++----
 net/ipv4/tcp_ipv4.c                    |    6 +++---
 net/ipv4/tcp_minisocks.c               |    6 +++---
 net/ipv4/tcp_output.c                  |    2 +-
 net/ipv4/tcp_timer.c                   |    9 +++++----
 net/ipv6/syncookies.c                  |    2 +-
 net/ipv6/tcp_ipv6.c                    |    6 +++---
 14 files changed, 44 insertions(+), 23 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index d3d653a..c381c68 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -384,6 +384,12 @@ tcp_retries2 - INTEGER
 	RFC 1122 recommends at least 100 seconds for the timeout,
 	which corresponds to a value of at least 8.
 
+tcp_initial_rto - INTEGER
+	This value sets the initial retransmit timeout, that is how long
+	the kernel will wait before retransmitting the initial SYN packet.
+
+	RFC 1122 says that this SHOULD be 3 seconds, which is the default.
+
 tcp_rfc1337 - BOOLEAN
 	If set, the TCP stack behaves conforming to RFC1337. If unset,
 	we are not conforming to RFC, but prevent TCP TIME_WAIT
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 11684d9..96a9b41 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -425,6 +425,7 @@ enum
 	NET_TCP_ALLOWED_CONG_CONTROL=123,
 	NET_TCP_MAX_SSTHRESH=124,
 	NET_TCP_FRTO_RESPONSE=125,
+        NET_IPV4_TCP_INITIAL_RTO=126,
 };
 
 enum {
diff --git a/include/net/tcp.h b/include/net/tcp.h
index cda30ea..a2bb0f1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -213,6 +213,7 @@ extern int sysctl_tcp_syn_retries;
 extern int sysctl_tcp_synack_retries;
 extern int sysctl_tcp_retries1;
 extern int sysctl_tcp_retries2;
+extern int sysctl_tcp_initial_rto;
 extern int sysctl_tcp_orphan_retries;
 extern int sysctl_tcp_syncookies;
 extern int sysctl_tcp_retrans_collapse;
@@ -295,7 +296,7 @@ static inline void tcp_synq_overflow(struct sock *sk)
 static inline int tcp_synq_no_recent_overflow(const struct sock *sk)
 {
 	unsigned long last_overflow = tcp_sk(sk)->rx_opt.ts_recent_stamp;
-	return time_after(jiffies, last_overflow + TCP_TIMEOUT_INIT);
+	return time_after(jiffies, last_overflow + sysctl_tcp_initial_rto);
 }
 
 extern struct proto tcp_prot;
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 3b8e028..d608d84 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -354,6 +354,7 @@ static const struct bin_table bin_net_ipv4_table[] = {
 	{ CTL_INT,	NET_IPV4_TCP_KEEPALIVE_INTVL,		"tcp_keepalive_intvl" },
 	{ CTL_INT,	NET_IPV4_TCP_RETRIES1,			"tcp_retries1" },
 	{ CTL_INT,	NET_IPV4_TCP_RETRIES2,			"tcp_retries2" },
+	{ CTL_INT,	NET_IPV4_TCP_INITIAL_RTO,		"tcp_initial_rto" },
 	{ CTL_INT,	NET_IPV4_TCP_FIN_TIMEOUT,		"tcp_fin_timeout" },
 	{ CTL_INT,	NET_TCP_SYNCOOKIES,			"tcp_syncookies" },
 	{ CTL_INT,	NET_TCP_TW_RECYCLE,			"tcp_tw_recycle" },
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 8b44c6d..089bc92 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -186,7 +186,7 @@ __u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
  * sysctl_tcp_retries1. It's a rather complicated formula (exponential
  * backoff) to compute at runtime so it's currently hardcoded here.
  */
-#define COUNTER_TRIES 4
+#define COUNTER_TRIES (sysctl_tcp_initial_rto + 1)
 /*
  * Check if a ack sequence number is a valid syncookie.
  * Return the decoded mss if it is, or 0 if not.
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 321e6e8..24dc21d 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -30,6 +30,8 @@ static int tcp_adv_win_scale_min = -31;
 static int tcp_adv_win_scale_max = 31;
 static int ip_ttl_min = 1;
 static int ip_ttl_max = 255;
+static int tcp_initial_rto_min = TCP_RTO_MIN;
+static int tcp_initial_rto_max = TCP_RTO_MAX;
 
 /* Update system visible IP port range */
 static void set_local_port_range(int range[2])
@@ -246,6 +248,15 @@ static struct ctl_table ipv4_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+        {
+		.procname       = "tcp_initial_rto",
+		.data           = &sysctl_tcp_initial_rto,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &tcp_initial_rto_min,
+		.extra2		= &tcp_initial_rto_max,
+	},
 	{
 		.procname	= "tcp_fin_timeout",
 		.data		= &sysctl_tcp_fin_timeout,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b22d450..e9e7c3f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2352,7 +2352,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 	case TCP_DEFER_ACCEPT:
 		/* Translate value in seconds to number of retransmits */
 		icsk->icsk_accept_queue.rskq_defer_accept =
-			secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
+			secs_to_retrans(val, sysctl_tcp_initial_rto / HZ,
 					TCP_RTO_MAX / HZ);
 		break;
 
@@ -2539,7 +2539,7 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 		break;
 	case TCP_DEFER_ACCEPT:
 		val = retrans_to_secs(icsk->icsk_accept_queue.rskq_defer_accept,
-				      TCP_TIMEOUT_INIT / HZ, TCP_RTO_MAX / HZ);
+				      sysctl_tcp_initial_rto / HZ, TCP_RTO_MAX / HZ);
 		break;
 	case TCP_WINDOW_CLAMP:
 		val = tp->window_clamp;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bef9f04..39f6c27 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -890,7 +890,7 @@ static void tcp_init_metrics(struct sock *sk)
 	if (dst_metric(dst, RTAX_RTT) == 0)
 		goto reset;
 
-	if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (TCP_TIMEOUT_INIT << 3))
+	if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (sysctl_tcp_initial_rto << 3))
 		goto reset;
 
 	/* Initial rtt is determined from SYN,SYN-ACK.
@@ -916,7 +916,7 @@ static void tcp_init_metrics(struct sock *sk)
 		tp->mdev_max = tp->rttvar = max(tp->mdev, tcp_rto_min(sk));
 	}
 	tcp_set_rto(sk);
-	if (inet_csk(sk)->icsk_rto < TCP_TIMEOUT_INIT && !tp->rx_opt.saw_tstamp) {
+	if (inet_csk(sk)->icsk_rto < sysctl_tcp_initial_rto && !tp->rx_opt.saw_tstamp) {
 reset:
 		/* Play conservative. If timestamps are not
 		 * supported, TCP will fail to recalculate correct
@@ -924,8 +924,8 @@ reset:
 		 */
 		if (!tp->rx_opt.saw_tstamp && tp->srtt) {
 			tp->srtt = 0;
-			tp->mdev = tp->mdev_max = tp->rttvar = TCP_TIMEOUT_INIT;
-			inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+			tp->mdev = tp->mdev_max = tp->rttvar = sysctl_tcp_initial_rto;
+			inet_csk(sk)->icsk_rto = sysctl_tcp_initial_rto;
 		}
 	}
 	tp->snd_cwnd = tcp_init_cwnd(tp, dst);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f7e6c2c..21920e6 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1383,7 +1383,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	    want_cookie)
 		goto drop_and_free;
 
-	inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+	inet_csk_reqsk_queue_hash_add(sk, req, sysctl_tcp_initial_rto);
 	return 0;
 
 drop_and_release:
@@ -1834,8 +1834,8 @@ static int tcp_v4_init_sock(struct sock *sk)
 	tcp_init_xmit_timers(sk);
 	tcp_prequeue_init(tp);
 
-	icsk->icsk_rto = TCP_TIMEOUT_INIT;
-	tp->mdev = TCP_TIMEOUT_INIT;
+	icsk->icsk_rto = sysctl_tcp_initial_rto;
+	tp->mdev = sysctl_tcp_initial_rto;
 
 	/* So many TCP implementations out there (incorrectly) count the
 	 * initial SYN frame in their delayed-ACK and congestion control
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 80b1f80..c63ffa0 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -472,8 +472,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		tcp_init_wl(newtp, treq->rcv_isn);
 
 		newtp->srtt = 0;
-		newtp->mdev = TCP_TIMEOUT_INIT;
-		newicsk->icsk_rto = TCP_TIMEOUT_INIT;
+		newtp->mdev = sysctl_tcp_initial_rto;
+		newicsk->icsk_rto = sysctl_tcp_initial_rto;
 
 		newtp->packets_out = 0;
 		newtp->retrans_out = 0;
@@ -582,7 +582,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 			 * it can be estimated (approximately)
 			 * from another data.
 			 */
-			tmp_opt.ts_recent_stamp = get_seconds() - ((TCP_TIMEOUT_INIT/HZ)<<req->retrans);
+			tmp_opt.ts_recent_stamp = get_seconds() - ((sysctl_tcp_initial_rto/HZ)<<req->retrans);
 			paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
 		}
 	}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 17388c7..e34b0f6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2599,7 +2599,7 @@ static void tcp_connect_init(struct sock *sk)
 	tp->rcv_wup = 0;
 	tp->copied_seq = 0;
 
-	inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+	inet_csk(sk)->icsk_rto = sysctl_tcp_initial_rto;
 	inet_csk(sk)->icsk_retransmits = 0;
 	tcp_clear_retrans(tp);
 }
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index ecd44b0..b9da62b 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -29,6 +29,7 @@ int sysctl_tcp_keepalive_probes __read_mostly = TCP_KEEPALIVE_PROBES;
 int sysctl_tcp_keepalive_intvl __read_mostly = TCP_KEEPALIVE_INTVL;
 int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
 int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
+int sysctl_tcp_initial_rto __read_mostly = TCP_TIMEOUT_INIT;
 int sysctl_tcp_orphan_retries __read_mostly;
 int sysctl_tcp_thin_linear_timeouts __read_mostly;
 
@@ -135,8 +136,8 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
 
 /* This function calculates a "timeout" which is equivalent to the timeout of a
  * TCP connection after "boundary" unsuccessful, exponentially backed-off
- * retransmissions with an initial RTO of TCP_RTO_MIN or TCP_TIMEOUT_INIT if
- * syn_set flag is set.
+ * retransmissions with an initial RTO of TCP_RTO_MIN or
+ * sysctl_tcp_initial_rto if syn_set flag is set.
  */
 static bool retransmits_timed_out(struct sock *sk,
 				  unsigned int boundary,
@@ -144,7 +145,7 @@ static bool retransmits_timed_out(struct sock *sk,
 				  bool syn_set)
 {
 	unsigned int linear_backoff_thresh, start_ts;
-	unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;
+	unsigned int rto_base = syn_set ? sysctl_tcp_initial_rto : TCP_RTO_MIN;
 
 	if (!inet_csk(sk)->icsk_retransmits)
 		return false;
@@ -495,7 +496,7 @@ out_unlock:
 static void tcp_synack_timer(struct sock *sk)
 {
 	inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL,
-				   TCP_TIMEOUT_INIT, TCP_RTO_MAX);
+				   sysctl_tcp_initial_rto, TCP_RTO_MAX);
 }
 
 void tcp_syn_ack_timeout(struct sock *sk, struct request_sock *req)
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 352c260..50baaec 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -45,7 +45,7 @@ static __u16 const msstab[] = {
  * sysctl_tcp_retries1. It's a rather complicated formula (exponential
  * backoff) to compute at runtime so it's currently hardcoded here.
  */
-#define COUNTER_TRIES 4
+#define COUNTER_TRIES (sysctl_tcp_initial_rto + 1)
 
 static inline struct sock *get_cookie_sock(struct sock *sk, struct sk_buff *skb,
 					   struct request_sock *req,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 4f49e5d..7e791e6 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1349,7 +1349,7 @@ have_isn:
 	    want_cookie)
 		goto drop_and_free;
 
-	inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+	inet6_csk_reqsk_queue_hash_add(sk, req, sysctl_tcp_initial_rto);
 	return 0;
 
 drop_and_release:
@@ -1957,8 +1957,8 @@ static int tcp_v6_init_sock(struct sock *sk)
 	tcp_init_xmit_timers(sk);
 	tcp_prequeue_init(tp);
 
-	icsk->icsk_rto = TCP_TIMEOUT_INIT;
-	tp->mdev = TCP_TIMEOUT_INIT;
+	icsk->icsk_rto = sysctl_tcp_initial_rto;
+	tp->mdev = sysctl_tcp_initial_rto;
 
 	/* So many TCP implementations out there (incorrectly) count the
 	 * initial SYN frame in their delayed-ACK and congestion control
-- 
1.7.0.4

^ permalink raw reply related

* [PATCH] tcp: Expose the initial RTO via a new sysctl.
From: Benoit Sigoure @ 2011-05-17  7:40 UTC (permalink / raw)
  To: davem, kuznet, pekkas, jmorris, yoshfuji, kaber; +Cc: netdev, linux-kernel

Hi,
it's not easy to change the initial RTO of TCP as right now you need to
recompile your kernel.  In order to make it easier to tune this setting,
I was wondering whether you would consider turning it into a sysctl.  I
attached a first attempt at a patch that does this -- this is my first
patch to the Linux kernel so although I've read SubmitChecklist and
SubmittingPatches, and I've run checkpatch.pl, please let me know if I'm
doing something wrong.

I am doing this because I work in a high-throughput low-latency environment
(line-rate GbE with submillisecond RTT) and some of our clients are negatively
affected by the high initial RTO when the servers are unable to accept() new
connections fast enough.  While we're working on fixing these servers and/or
giving them larger backlog queues when they listen(), being able to tune
the initial RTO at runtime would be very useful as quick workaround for the
server-side issues.

Some large Internet websites are also running with a more aggressive initial
RTO, for instance Google documented some of what they're doing here:
  http://www.ietf.org/proceedings/75/slides/tcpm-1.pdf
While I'm not arguing to change the default value at this time, I believe
that this patch would also come in handy for those who wish to experiment
with various values in their environment.

If you're willing to consider this patch, bear in mind I only compiled it,
I didn't test it yet (not knowing whether you'd want something like that or
not).  I would also appreciate if anyone had any insight on what I did with
`COUNTER_TRIES' in `syncookies.c' as this magic constant is rather mysterious
and the comment didn't help me figure out how it had been derived.  I couldn't
find anything online and git blame didn't help me either (it pre-dates Git).

^ permalink raw reply

* Re: Null pointer dereference in icmp_send
From: Aristide Fattori @ 2011-05-17  7:31 UTC (permalink / raw)
  To: Eric Dumazet, netdev, roberto.paleari
In-Reply-To: <1305581934.9466.11.camel@edumazet-laptop>

Hi Eric,

On Mon, May 16, 2011 at 11:38 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le lundi 16 mai 2011 à 23:27 +0200, Eric Dumazet a écrit :
>> You forgot to tell us which linux version you used ?

Sorry about that, we verified the bug in versions 2.6.38.3, 2.6.38.4
and 2.6.38.6, but we didn't check the svn version :-)

> ...
> But if a timeout occurs, we take first queued fragment to build one ICMP
> TIME EXCEEDED message. Problem is all queued skb have weak dst pointers,
> ...

At a first glance, it appears we identified the same bug you already
fixed. If you want to double check, we can send you the "long version
report" that we submitted at first to security@kernel.org and
david@davemloft.net.

> It is scheduled for linux-2.6.38 stable tree as well

Good job! :-)

-- 
GnuPG Key on keyserver.pgp.com ID 0x25578128
http://security.dico.unimi.it/~joystick/

^ permalink raw reply

* Re: [PATCH] CDC NCM: release interfaces fix in unbind()
From: Oliver Neukum @ 2011-05-17  6:26 UTC (permalink / raw)
  To: Alexey ORISHKO
  Cc: David Miller, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-usb-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	gregkh-l3A5Bk7waGM@public.gmane.org
In-Reply-To: <2AC7D4AD8BA1C640B4C60C61C8E520153E3A93E8EE-8ZTw5gFVCTjVH5byLeRTJxkTb7+GphCuwzqs5ZKRSiY@public.gmane.org>

Am Dienstag, 17. Mai 2011, 02:52:34 schrieb Alexey ORISHKO:

> Some confusion was added by other driver code that handles crippled
> devices, which mix ctrl and data interfaces... Shall not be the case here.
> I see, that some clean up can be done and "xxx_claimed" fields in context
> structure can be removed, so will post a new version of patch shortly.

This code is a bit tricky in those drivers because you cannot assume a certain
order of calls to disconnect() w.r.t. the interfaces. If you claim multiple
interfaces, you need to release all other interfaces, as soon as any interface
is disconnected.

	Regards
		Oliver
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: tap/bridge: Dropping NETIF_F_GSO/NETIF_F_SG
From: Herbert Xu @ 2011-05-17  6:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, mirqus, shanwei, mirq-linux, netdev, bhutchings
In-Reply-To: <20110517054823.GB26414@redhat.com>

On Tue, May 17, 2011 at 08:48:23AM +0300, Michael S. Tsirkin wrote:
>
> You mean like bool quiet?

Yes when tun calls netdev_update_features we expect it to contain
bogus combinations so all warnings should disappear.

Everybody should still need maintain sanity.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michael S. Tsirkin @ 2011-05-17  6:21 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Ben Hutchings, David Miller, Eric Dumazet, Avi Kivity,
	Arnd Bergmann, netdev, kvm, linux-kernel
In-Reply-To: <1305588738.3456.65.camel@localhost.localdomain>

On Mon, May 16, 2011 at 04:32:18PM -0700, Shirley Ma wrote:
> Hello Michael,
> 
> Looks like to use a new flag requires more time/work.
> I am thinking
> whether we can just use HIGHDMA flag to enable zero-copy in macvtap to
> avoid the new flag for now since mavctap uses real NICs as lower device?
> 
> Thanks
> Shirley

Problem is, in your patch there are a set of restrictions on what the
device can do with the skb that we need to enforce somehow.
Also, how do we know it's a 'real NIC' and not a software device?

-- 
MST

^ permalink raw reply

* Re: [PATCH 09/18] virtio: use avail_event index
From: Michael S. Tsirkin @ 2011-05-17  6:10 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky, linux390-tA70FqPdS9bQT0dZR+AlfA
In-Reply-To: <87ei3zdsq2.fsf-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>

On Mon, May 16, 2011 at 04:42:21PM +0930, Rusty Russell wrote:
> On Sun, 15 May 2011 16:55:41 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Mon, May 09, 2011 at 02:03:26PM +0930, Rusty Russell wrote:
> > > On Wed, 4 May 2011 23:51:47 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > > Use the new avail_event feature to reduce the number
> > > > of exits from the guest.
> > > 
> > > Figures here would be nice :)
> > 
> > You mean ASCII art in comments?
> 
> I mean benchmarks of some kind.

:)

> > 
> > > > @@ -228,6 +237,12 @@ add_head:
> > > >  	 * new available array entries. */
> > > >  	virtio_wmb();
> > > >  	vq->vring.avail->idx++;
> > > > +	/* If the driver never bothers to kick in a very long while,
> > > > +	 * avail index might wrap around. If that happens, invalidate
> > > > +	 * kicked_avail index we stored. TODO: make sure all drivers
> > > > +	 * kick at least once in 2^16 and remove this. */
> > > > +	if (unlikely(vq->vring.avail->idx == vq->kicked_avail))
> > > > +		vq->kicked_avail_valid = true;
> > > 
> > > If they don't, they're already buggy.  Simply do:
> > >         WARN_ON(vq->vring.avail->idx == vq->kicked_avail);
> > 
> > Hmm, but does it say that somewhere?
> 
> AFAICT it's a corollary of:
> 1) You have a finite ring of size <= 2^16.
> 2) You need to kick the other side once you've done some work.

Well one can imagine a driver doing:

	while (virtqueue_get_buf()) {
		virtqueue_add_buf()
	}
	virtqueue_kick()

which looks sensible (batch kicks) but might
process any number of bufs between kicks.

If we look at drivers closely enough, I think none
of them do the equivalent of the above, but not 100% sure.


> > > > @@ -482,6 +517,8 @@ void vring_transport_features(struct virtio_device *vdev)
> > > >  			break;
> > > >  		case VIRTIO_RING_F_USED_EVENT_IDX:
> > > >  			break;
> > > > +		case VIRTIO_RING_F_AVAIL_EVENT_IDX:
> > > > +			break;
> > > >  		default:
> > > >  			/* We don't understand this bit. */
> > > >  			clear_bit(i, vdev->features);
> > > 
> > > Does this belong in a prior patch?
> > > 
> > > Thanks,
> > > Rusty.
> > 
> > Well if we don't support the feature in the ring we should not
> > ack the feature, right?
> 
> Ah, you're right.
> 
> Thanks,
> Rusty.

^ permalink raw reply

* Re: [PATCH 06/18] virtio_ring: avail event index interface
From: Michael S. Tsirkin @ 2011-05-17  6:00 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky, linux390-tA70FqPdS9bQT0dZR+AlfA
In-Reply-To: <87k4drduzs.fsf-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>

On Mon, May 16, 2011 at 03:53:19PM +0930, Rusty Russell wrote:
> On Sun, 15 May 2011 15:47:27 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Mon, May 09, 2011 at 01:43:15PM +0930, Rusty Russell wrote:
> > > On Wed, 4 May 2011 23:51:19 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > >  #define VIRTIO_RING_F_USED_EVENT_IDX	29
> > > > +/* The Host publishes the avail index for which it expects a kick
> > > > + * at the end of the used ring. Guest should ignore the used->flags field. */
> > > > +#define VIRTIO_RING_F_AVAIL_EVENT_IDX	32
> > > 
> > > Are you really sure we want to separate the two?  Seems a little simpler
> > > to have one bit to mean "we're publishing our threshold".  For someone
> > > implementing this from scratch, it's a little simpler.
> > > 
> > > Or are there cases where the old style makes more sense?
> > > 
> > > Thanks,
> > > Rusty.
> > 
> > Hmm, it makes debugging easier as each side can disable
> > publishing separately - I used it all the time when I saw
> > e.g. networking stuck to guess whether I need to investigate the
> > interrupt or the exit handling.
> > 
> > But I'm not hung up on this.
> > 
> > Let me know pls.
> 
> If we combine them into one, then these patches no longer depend on
> the feature bit expansion, which is worthwhile (though I'll take both).
> 
> Thanks,
> Rusty.

Yes, I know. But if we do expand feature bits anyway, for debugging
and profiling if nothing else it's useful to have them separate ...
If you take both why does the order matter?

-- 
MST

^ permalink raw reply

* Re: [PATCH V5 4/6 net-next] vhost: vhost TX zero-copy support
From: Michael S. Tsirkin @ 2011-05-17  5:55 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel
In-Reply-To: <1305606683.10756.3.camel@localhost.localdomain>

On Mon, May 16, 2011 at 09:31:23PM -0700, Shirley Ma wrote:
> On Tue, 2011-05-17 at 00:24 +0300, Michael S. Tsirkin wrote:
> > On Mon, May 16, 2011 at 01:56:54PM -0700, Shirley Ma wrote:
> > > On Mon, 2011-05-16 at 23:45 +0300, Michael S. Tsirkin wrote:
> > > > > +/* Since we need to keep the order of used_idx as avail_idx,
> > it's
> > > > possible that
> > > > > + * DMA done not in order in lower device driver for some
> > reason. To
> > > > prevent
> > > > > + * used_idx out of order, upend_idx is used to track avail_idx
> > > > order, done_idx
> > > > > + * is used to track used_idx order. Once lower device DMA done,
> > > > then upend_idx
> > > > > + * can move to done_idx.
> > > > 
> > > > Could you clarify this please? virtio explicitly allows out of
> > order
> > > > completion of requests. Does it simplify code that we try to keep
> > > > used index updates in-order? Because if not, this is not
> > > > really a requirement.
> > > 
> > > Hello Mike,
> > > 
> > > Based on my testing, vhost_add_used() must be in order from
> > > vhost_get_vq_desc(). Otherwise, virtio_net ring seems get double
> > > freed.
> > 
> > Double-freed or you get NULL below?
> 
> More likely is NULL.
> 
> > > I didn't spend time on debugging this.
> > > 
> > > in virtqueue_get_buf
> > > 
> > >         if (unlikely(!vq->data[i])) {
> > >                 BAD_RING(vq, "id %u is not a head!\n", i);
> > >                 return NULL;
> > >         }
> > 
> > Yes but i used here is the head that we read from the
> > ring, not the ring index itself.
> >         i = vq->vring.used->ring[vq->last_used_idx%vq->vring.num].id
> > we must complete any id only once, but in any order.
> 
> It is in any order of ring id, but must be in the order of "head"
> returns  from  vhost_get_vq_desc(), any clue?
> 
> 
> Thanks
> Shirley


Something in your patch that overwrites the id in vhost
and makes it put the wrong id in the used ring?

By the way, need to keep in mind that a guest can
give us the same head twice, need to make sure this
at least does not corrupt host memory.

-- 
MST

^ permalink raw reply

* Re: tap/bridge: Dropping NETIF_F_GSO/NETIF_F_SG
From: Michael S. Tsirkin @ 2011-05-17  5:48 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, mirqus, shanwei, mirq-linux, netdev, bhutchings
In-Reply-To: <20110517052401.GA15198@gondor.apana.org.au>

On Tue, May 17, 2011 at 03:24:01PM +1000, Herbert Xu wrote:
> On Tue, May 17, 2011 at 08:18:45AM +0300, Michael S. Tsirkin wrote:
> >
> > Hmm, we get the warnings about bits dropped on each set_offload
> > call:
> > 	netdev_update_features is called,
> > 	that calls netdev_fix_features
> > 
> > No?
> 
> I presume this has changed recently as the current Linus tree
> does not contain a call to netdev_update_features, it only calls
> netdev_features_change.
> 
> If so then this does ensure that the feature bits will be sane.
> 
> I still think the warnings should be retained at their previous
> level in the call paths other than set_offload.  All we have to
> do is add a parameter to netdev_update_features and co.
> 
> Cheers,

You mean like bool quiet?

-- 
MST

^ permalink raw reply

* [PATCH 2/2] netfilter: nf_ct_sip: fix SDP parsing in TCP SIP messages for some Cisco phones
From: kaber @ 2011-05-17  5:26 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1305610014-3056-1-git-send-email-kaber@trash.net>

From: Patrick McHardy <kaber@trash.net>

Some Cisco phones do not place the Content-Length field at the end of the
SIP message. This is valid, due to a misunderstanding of the specification
the parser expects the SDP body to start directly after the Content-Length
field. Fix the parser to scan for \r\n\r\n to locate the beginning of the
SDP body.

Reported-by: Teresa Kang <teresa_kang@gemtek.com.tw>
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/netfilter/nf_conntrack_sip.c |   14 ++++++++++----
 1 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/nf_conntrack_sip.c b/net/netfilter/nf_conntrack_sip.c
index 1f81abd..c05c0dc 100644
--- a/net/netfilter/nf_conntrack_sip.c
+++ b/net/netfilter/nf_conntrack_sip.c
@@ -1419,6 +1419,7 @@ static int sip_help_tcp(struct sk_buff *skb, unsigned int protoff,
 	const char *dptr, *end;
 	s16 diff, tdiff = 0;
 	int ret = NF_ACCEPT;
+	bool term;
 	typeof(nf_nat_sip_seq_adjust_hook) nf_nat_sip_seq_adjust;
 
 	if (ctinfo != IP_CT_ESTABLISHED &&
@@ -1453,10 +1454,15 @@ static int sip_help_tcp(struct sk_buff *skb, unsigned int protoff,
 		if (dptr + matchoff == end)
 			break;
 
-		if (end + strlen("\r\n\r\n") > dptr + datalen)
-			break;
-		if (end[0] != '\r' || end[1] != '\n' ||
-		    end[2] != '\r' || end[3] != '\n')
+		term = false;
+		for (; end + strlen("\r\n\r\n") <= dptr + datalen; end++) {
+			if (end[0] == '\r' && end[1] == '\n' &&
+			    end[2] == '\r' && end[3] == '\n') {
+				term = true;
+				break;
+			}
+		}
+		if (!term)
 			break;
 		end += strlen("\r\n\r\n") + clen;
 
-- 
1.7.2.3


^ permalink raw reply related

* [PATCH 1/2] netfilter: nf_ct_sip: validate Content-Length in TCP SIP messages
From: kaber @ 2011-05-17  5:26 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1305610014-3056-1-git-send-email-kaber@trash.net>

From: Patrick McHardy <kaber@trash.net>

Verify that the message length of a single SIP message, which is calculated
based on the Content-Length field contained in the SIP message, does not
exceed the packet boundaries.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/netfilter/nf_conntrack_sip.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/net/netfilter/nf_conntrack_sip.c b/net/netfilter/nf_conntrack_sip.c
index bcf47eb..1f81abd 100644
--- a/net/netfilter/nf_conntrack_sip.c
+++ b/net/netfilter/nf_conntrack_sip.c
@@ -1461,6 +1461,8 @@ static int sip_help_tcp(struct sk_buff *skb, unsigned int protoff,
 		end += strlen("\r\n\r\n") + clen;
 
 		msglen = origlen = end - dptr;
+		if (msglen > datalen)
+			return NF_DROP;
 
 		ret = process_sip_msg(skb, ct, dataoff, &dptr, &msglen);
 		if (ret != NF_ACCEPT)
-- 
1.7.2.3


^ permalink raw reply related

* [PATCH 0/2] netfilter: SIP conntrack fixes
From: kaber @ 2011-05-17  5:26 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev

Hi Dave,

following are two fixes for the SIP connection tracking helper:

- missing validation of the Content-Length field, which is used to calculate
  the end of the SDP body

- incorrect parsing of the SIP message, resulting in a failure to locate
  the SDP body when the Content-Length field is not the last member of the
  SIP message

Please apply or pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6.git master

Thanks!

^ permalink raw reply

* Re: tap/bridge: Dropping NETIF_F_GSO/NETIF_F_SG
From: Herbert Xu @ 2011-05-17  5:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, mirqus, shanwei, mirq-linux, netdev, bhutchings
In-Reply-To: <20110517051845.GA26414@redhat.com>

On Tue, May 17, 2011 at 08:18:45AM +0300, Michael S. Tsirkin wrote:
>
> Hmm, we get the warnings about bits dropped on each set_offload
> call:
> 	netdev_update_features is called,
> 	that calls netdev_fix_features
> 
> No?

I presume this has changed recently as the current Linus tree
does not contain a call to netdev_update_features, it only calls
netdev_features_change.

If so then this does ensure that the feature bits will be sane.

I still think the warnings should be retained at their previous
level in the call paths other than set_offload.  All we have to
do is add a parameter to netdev_update_features and co.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: tap/bridge: Dropping NETIF_F_GSO/NETIF_F_SG
From: Michael S. Tsirkin @ 2011-05-17  5:18 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, mirqus, shanwei, mirq-linux, netdev, bhutchings
In-Reply-To: <20110516234538.GA11832@gondor.apana.org.au>

On Tue, May 17, 2011 at 09:45:38AM +1000, Herbert Xu wrote:
> On Mon, May 16, 2011 at 07:06:15PM -0400, David Miller wrote:
> >
> > Well the check has to exist somewhere.
> > 
> > Currently userspace can configure tun/tap into whatever set
> > of offloads it likes.
> > 
> > We're warning when the user asks for something that needs to be
> > corrected.  So the only thing you can suggest is to duplicate these
> > changes in the tun/tap driver.
> > 
> > But if we do that, and error on bad combinations instead of fixing
> > them up, we know from this discussion that existing virtualization
> > setups and tools are going to stop working.
> 
> Yeah the tun driver is simply busted.  We should never have allowed
> user-space to tweak the feature bits like this.  Instead they should
> have gone through the ethtool interface like everyone else, or at
> least use the same underlying calls as ethtool.
> 
> Actually, I think we can still do that, and apply the same rules
> as ethtool with respect to automatically turning things on/off.
> 
> AFAICS the current set_offload in tun.c does not call anything
> that verifies/fixes up the settings.  If you change the feature
> bits after registering the tun device it may never get fixed up
> at all.

Hmm, we get the warnings about bits dropped on each set_offload
call:
	netdev_update_features is called,
	that calls netdev_fix_features

No?

> Allowing an unprivileged user to tweak feature bits directly with
> no verification is just wrong.
> 
> Cheers,

But we do verify bits, and only allow the user
to tweak these ones:

#define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO|
\
                          NETIF_F_TSO6|NETIF_F_UFO)

No?

> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH] 2.6.38 ENC28J60 works with half-duplex DMA
From: Davide Rizzo @ 2011-05-17  5:15 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110516.114855.1523183530208158354.davem@davemloft.net>

I'm sorry for that, initially It was refused by your system for format
problems (html?), then I didn't see them in the archives, so I though
some problem happened and resent it. I didn't receive it back from the
list.
Where can I see a message list to have a confirmation that it was received ?

2011/5/16 David Miller <davem@davemloft.net>:
>
> You've sent this patch 3 times, we've noticed and seen them all,
> our silence doesn't mean it hasn't been received.
>
> What new thing do you hope to accomplish on each subsequent
> submission?
>

-- 
All difficult problems have a simple solution. That is wrong. [A. Einstein]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox