Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 00/18] netfilter: IPv6 NAT
From: Pablo Neira Ayuso @ 2012-08-31  9:29 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netfilter-devel, netdev
In-Reply-To: <1345434006-16549-1-git-send-email-kaber@trash.net>

On Mon, Aug 20, 2012 at 05:39:48AM +0200, Patrick McHardy wrote:
> Following is the latest IPv6 NAT patchset, based on -rc2.
> 
> Changes since the last posting include:
> 
> - removal of fixes already merged into nf-2.6, not included upstream
>   yet though, IPv6 SIP will not work
> 
> - missing module alias for ip6t_DNAT/ip6t_SNAT added to xt_nat
> 
> - compilation fix for ip6t_MASQUERADE without ipt_MASQUERADE enabled
> 
> - proper handling of untracked packets in IPv6 fragmentation improvements
> 
> - byteorder fixes in IPv6 fragment handling
> 
> - conversion to use a mutex for NAT L3/L4 protocol registration
> 
> - missing memory barrier before publishing L4 protocol array
> 
> - proper cleanup of NATed connections after NAT L3/L4 protocol unregistration
> 
> - proper locking (RTNL) for walking network namespaces in NAT cleanup
> 
> - only free NAT L4 protocol array at shutdown instead of during L3 protocol
>   unregistation, otherwise non built-in protocols are missing after L3
>   protocol reload
> 
> - endianess annotation fixes
> 
> - missing netlink policy for IPv6 NAT attributes
> 
> I'll also push these patches to github, but it is currently acting up,
> once it works again they'll be available at
> 
> git://github.com/kaber/nf-nat-ipv6.git master

Pulled. Thanks a lot Patrick.

And I fixed the log description that Jesper mentioned. You'll have to
rebase though, sorry.

^ permalink raw reply

* Re: [BUG]  TIPC handling of -ERESTARTSYS in connect()
From: Ying Xue @ 2012-08-31  9:37 UTC (permalink / raw)
  To: Chris Friesen; +Cc: netdev, Allan Stephens, Jon Maloy
In-Reply-To: <503F84CF.208@genband.com>

[-- Attachment #1: Type: text/plain, Size: 1346 bytes --]

Hi Chris,

Although this is a known issue, still thanks for your report.

Regardless of 1.7.7 or mainline, the issue really exists.
Can you please check and verify the attached patch?

PS: the patch is based on 1.7.7 rather than mainline.

Thanks,
Ying

Chris Friesen wrote:
> Hi,
>
> I'm seeing some behaviour that looks unintentional in the TIPC connect() call.
> I'm running TIPC 1.7.7.  My userspace code (stripped of error handling) looks
> like this:
>
> 	int sd = socket (AF_TIPC, SOCK_SEQPACKET,0);
> 	connect(sd,(struct sockaddr*)&topsrv,sizeof(topsrv));
>
> where topsrv is the address of the TIPC topology server.  The thing that's weird
> is that intermittently we get a EISCONN error on the connect call.
>
> Looking at the TIPC connect() code, I think what is happening is that
> wait_event_interruptible_timeout() is being interrupted by a signal and returns
> -ERESTARTSYS.  This sets sock->state to SS_DISCONNECTING and exits.  Userspace
> sees the ERESTARTSYS and retries the syscall, then we hit the
> "sock->state != SS_UNCONNECTED" check and exit with -EISCONN.
>
> I think current mainline is susceptible to this as well.
>
> I'm not sure what the proper fix would be--can we detect coming back in that we were
> waiting for a message and just skip down to the wait_event_interruptible_timeout()
> call?
>
> Chris
>
>
>   


[-- Attachment #2: 0001-tipc-correctly-handle-ERESTARTSYS-return-value-of-co.patch --]
[-- Type: text/x-diff, Size: 3842 bytes --]

>From d1f38c0016bf9e1c24aa9c41efca857b6a302b4e Mon Sep 17 00:00:00 2001
From: Ying Xue <ying.xue@windriver.com>
Date: Fri, 31 Aug 2012 17:26:13 +0800
Subject: [PATCH] tipc: correctly handle -ERESTARTSYS return value of connect()

Signed-off-by: Ying Xue <ying.xue@windriver.com>
---
 net/tipc/tipc_socket.c |  109 +++++++++++++++++++++++++-----------------------
 1 files changed, 57 insertions(+), 52 deletions(-)

diff --git a/net/tipc/tipc_socket.c b/net/tipc/tipc_socket.c
index c135edf..aa5d57f 100644
--- a/net/tipc/tipc_socket.c
+++ b/net/tipc/tipc_socket.c
@@ -1456,21 +1456,6 @@ static int connect(struct socket *sock, struct sockaddr *dest, int destlen,
 		goto exit;
 	}
 
-	/* Issue Posix-compliant error code if socket is in the wrong state */
-
-	if (sock->state == SS_LISTENING) {
-		res = -EOPNOTSUPP;
-		goto exit;
-	}
-	if (sock->state == SS_CONNECTING) {
-		res = -EALREADY;
-		goto exit;
-	}
-	if (sock->state != SS_UNCONNECTED) {
-		res = -EISCONN;
-		goto exit;
-	}
-
 	/*
 	 * Reject connection attempt using multicast address
 	 *
@@ -1483,54 +1468,74 @@ static int connect(struct socket *sock, struct sockaddr *dest, int destlen,
 		goto exit;
 	}
 
-	/* Reject any messages already in receive queue (very unlikely) */
-
-	reject_rx_queue(sk);
+	switch (sock->state) {
+	case SS_UNCONNECTED:
+		/* Reject any messages already in receive queue
+		 * (very unlikely) */
+		reject_rx_queue(sk);
 
-	/* Send a 'SYN-' to destination */
+		/* Send a 'SYN-' to destination */
+		m.msg_name = dest;
+		m.msg_namelen = destlen;
+		res = send_msg(NULL, sock, &m, 0);
+		if (res < 0) {
+			goto exit;
+		}
 
-	m.msg_name = dest;
-	m.msg_namelen = destlen;
-	res = send_msg(NULL, sock, &m, 0);
-	if (res < 0) {
-		goto exit;
+		/*
+		 * Just entered SS_CONNECTING state; the only
+		 * difference is that return value in non-blocking
+		 * case is EINPROGRESS, rather than EALREADY.
+		 */
+		res = -EINPROGRESS;
+		break;
+	case SS_CONNECTING:
+		res = -EALREADY;
+		break;
+	case SS_CONNECTED:
+		res = -EISCONN;
+		break;
+	default:
+		res = -EINVAL;
+		break;
 	}
 
-	/* Wait until an 'ACK' or 'RST' arrives, or a timeout occurs */
-
-	timeout = tipc_sk(sk)->conn_timeout;
-	release_sock(sk);
-	res = wait_event_interruptible_timeout(*sk->sk_sleep,
-			(!skb_queue_empty(&sk->sk_receive_queue) ||
-			(sock->state != SS_CONNECTING)),
-			timeout ? (long)msecs_to_jiffies(timeout)
-				: MAX_SCHEDULE_TIMEOUT);
-	lock_sock(sk);
+	if (sock->state == SS_CONNECTING) {
+		/* Wait until an 'ACK' or 'RST' arrives, or a timeout occurs */
+		timeout = tipc_sk(sk)->conn_timeout;
+		release_sock(sk);
+		res = wait_event_interruptible_timeout(*sk->sk_sleep,
+				(!skb_queue_empty(&sk->sk_receive_queue) ||
+				(sock->state != SS_CONNECTING)),
+				timeout ? (long)msecs_to_jiffies(timeout)
+					: MAX_SCHEDULE_TIMEOUT);
+		lock_sock(sk);
 
-	if (res > 0) {
-		buf = skb_peek(&sk->sk_receive_queue);
-		if (buf != NULL) {
-			msg = buf_msg(buf);
-			res = auto_connect(sock, msg);
-			if (!res) {
-				if (!msg_data_sz(msg))
-					advance_rx_queue(sk);
+		if (res > 0) {
+			buf = skb_peek(&sk->sk_receive_queue);
+			if (buf != NULL) {
+				msg = buf_msg(buf);
+				res = auto_connect(sock, msg);
+				if (!res) {
+					if (!msg_data_sz(msg))
+						advance_rx_queue(sk);
+				}
+			} else {
+				if (sock->state == SS_CONNECTED) {
+					res = -EISCONN;
+				} else {
+					res = -ECONNREFUSED;
+				}
 			}
 		} else {
-			if (sock->state == SS_CONNECTED) {
-				res = -EISCONN;
+			if (res == 0) {
+				sock->state = SS_DISCONNECTING;
+				res = -ETIMEDOUT;
 			} else {
-				res = -ECONNREFUSED;
+				; /* leave "res" unchanged */
 			}
 		}
-	} else {
-		if (res == 0)
-			res = -ETIMEDOUT;
-		else
-			; /* leave "res" unchanged */
-		sock->state = SS_DISCONNECTING;
 	}
-
 exit:
 	release_sock(sk);
 	return res;
-- 
1.7.1


^ permalink raw reply related

* Re: [PATCH 2/3] ipvs: Fix faulty IPv6 extension  header handling in IPVS
From: Jesper Dangaard Brouer @ 2012-08-31 10:22 UTC (permalink / raw)
  To: Patrick McHardy, Hideaki YOSHIFUJI
  Cc: Hans Schillstrom, Hans Schillstrom, netdev, lvs-devel,
	Julian Anastasov, Simon Horman, netfilter-devel
In-Reply-To: <Pine.GSO.4.63.1208291429160.26100@stinky-local.trash.net>

On Wed, 2012-08-29 at 14:34 +0200, Patrick McHardy wrote:
> On Wed, 29 Aug 2012, Jesper Dangaard Brouer wrote:
> > On Wed, 2012-08-29 at 11:47 +0200, Hans Schillstrom wrote:
> >>>
> >>> On Mon, 2012-08-27 at 14:02 +0200, Patrick McHardy wrote:
> >>>> On Mon, 27 Aug 2012, Hans Schillstrom wrote:
> >>>>
> >>>>>>>> How about we change netfilter to set up the skb's transport header
> >>>>>>>> at an early time so we can avoid all (most of) these header scans
> >>>>>>>> in netfilter?

[...cut...]

> >>>> I guess inet6_skb_parm will be at least slightly more popular than
> >>>> adding it to the skb itself. The netfilter pointers are all used for
> >>>> optional things, so we can't really add it to any of those.
> >>>

[...cut...]

> >> Should we give it a try to put it in inet6_skb_parm
> >> and minimize what we put there ?
> >> I think it could be worth it.
> >
> > Okay, but then I do need some help and guidance, especially from
> > Patrick, think.
> >
> > First of all, where in the netfilter code, should we update the new
> > fields in inet6_skb_parm?
> 
> Good question. I think we'd need at least three spots since every one
> of these subsystems can be used indepedently from each other:
> 
> - conntrack/IPVS: PRE_ROUTING/LOCAL_OUT at lowest priority
> - ip6tables: first time packet hits ip6t_do_table()?

I've been looking at the code for ip6t_do_table() and it already calls
ipv6_find_hdr().  ip6t_do_table() calls ip6_packet_match()

And ip6_packet_match() already calls
  ipv6_find_hdr(skb, protoff, -1, &_frag_off, NULL);
but only if((ip6info->flags & IP6T_F_PROTO))

ip6t_do_table() uses the data found by
ipv6_find_hdr()/ip6_packet_match() and updates 'struct xt_action_param
acpar' (which is passed on to all netfilter modules/functions as 'par')

 protohdr = ipv6_find_hdr(skb, protoff, -1, &_frag_off, NULL)
 *fragoff = _frag_off;

 par->thoff   = protoff  /* thoff = Transport Header Offset */
 par->fragoff = fragoff  /* frag indicator and fragment offset */

The returned protocol (protohdr) is only used inside
ip6_packet_match(), thus the info on the protocol is lost.

(Side note) Saving the protocol could be useful for, the following
modules, as they call ipv6_find_hdr() once again to extract this:

 net/netfilter/xt_TPROXY.c: function tproxy_tg6_v1()
 net/netfilter/xt_socket.c: function socket_mt6_v1()

Thus, the netfilter framework already have this information available.
It just uses the 'struct xt_action_param par' to carry this
information, to its modules.

Mine and Hans's patch are basically introducing the same thing for IPVS,
only that this information is carried via 'struct ip_vs_iphdr'.

I don't know, if its worth to store this information in
inet6_skb_parm/IP6CB ?

I guess, to would make sense to store 'thoff' transport header offset,
especially for IPv6, given the extension headers.

But how many (code) users are there?
Is it only Netfilter and IPVS that want to look at the port numbers?

There also seems to a lot of users of "ipv6_skip_exthdr", which could
benefit?  But I simply don't know the IPv6 code well enough...


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer



^ permalink raw reply

* Re: route.c:645 suspicious rcu_dereference_check()
From: Eric Dumazet @ 2012-08-31 11:50 UTC (permalink / raw)
  To: David Miller; +Cc: proski, netdev
In-Reply-To: <20120830.133433.566450758375930776.davem@davemloft.net>

On Thu, 2012-08-30 at 13:34 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 28 Aug 2012 15:33:07 -0700
> 
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > [PATCH] ipv4: must use rcu protection while calling fib_lookup
> > 
> > Following lockdep splat was reported by Pavel Roskin :
>  ...
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Reported-by: Pavel Roskin <proski@gnu.org>
> 
> Applied, thanks.
> 
> It looks like the redirect handlers might have the same problem?

Hi David

Correct me if I am wrong, but redirect handlers should all run under
rcu_read_lock() protection already.

rcu_read_lock() is done in ip_local_deliver_finish() or
ip_rt_send_redirect() for the forward path.

And above of them, we also have rcu_read_lock() done in
__netif_receive_skb()

^ permalink raw reply

* Possible issue with Mellanox be2net/port handling
From: Marcelo Ricardo Leitner @ 2012-08-31 12:25 UTC (permalink / raw)
  To: netdev; +Cc: dledford

Hi,

Commit 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=4c41b3673759d096106e68bce586f103c51d4119 
inserted changes like:

@@ -361,7 +361,7 @@ static int add_promisc_qp(struct mlx4_dev *dev, u8 port,
         int err;
         struct mlx4_priv *priv = mlx4_priv(dev);

-       s_steer = &mlx4_priv(dev)->steer[0];
+       s_steer = &mlx4_priv(dev)->steer[port - 1];

         mutex_lock(&priv->mcg_table.mutex);

But I fear we missed one part of the deal. Concept patch:

@@ -365,7 +365,7 @@ static int add_promisc_qp(struct mlx4_dev *dev, u8 port,

         mutex_lock(&priv->mcg_table.mutex);

-       if (get_promisc_qp(dev, 0, steer, qpn)) {
+       if (get_promisc_qp(dev, port - 1, steer, qpn)) {
                 err = 0;  /* Noting to do, already exists */
                 goto out_mutex;
         }

Because:

static int add_promisc_qp(struct mlx4_dev *dev, u8 port,
               enum mlx4_steer_type steer, u32 qpn)
{
...
A)  s_steer = &mlx4_priv(dev)->steer[port - 1];

     mutex_lock(&priv->mcg_table.mutex);

     if (get_promisc_qp(dev, 0, steer, qpn)) {
         err = 0;  /* Noting to do, already exists */
         goto out_mutex;
     }
...
     /* add the new qpn to list of promisc qps */
C)  list_add_tail(&pqp->list, &s_steer->promisc_qps[steer]);
...
}

static struct mlx4_promisc_qp *get_promisc_qp(struct mlx4_dev *dev, u8 
pf_num,
                           enum mlx4_steer_type steer,
                           u32 qpn)
{
B)  struct mlx4_steer *s_steer = &mlx4_priv(dev)->steer[pf_num];
     struct mlx4_promisc_qp *pqp;

     list_for_each_entry(pqp, &s_steer->promisc_qps[steer], list) {
         if (pqp->qpn == qpn)
             return pqp;
     }
     /* not found */
     return NULL;
}

As far as I can understand, we are changing a list for a port and 
checking for duplicates on the other list. Points marked as A, B and C 
for highlighting. Am I missing something? What do you think?

FWIW, this call get_promisc_qp(dev, 0, ...) happens in other places too.

Thank you,
Marcelo.

^ permalink raw reply

* Re: Possible issue with Mellanox mlx4/port handling
From: Marcelo Ricardo Leitner @ 2012-08-31 12:26 UTC (permalink / raw)
  To: netdev; +Cc: dledford
In-Reply-To: <5040AD1F.9080702@redhat.com>

Fixed subject, sorry the confusion.

On 08/31/2012 09:25 AM, Marcelo Ricardo Leitner wrote:
> Hi,
>
> Commit
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=4c41b3673759d096106e68bce586f103c51d4119
> inserted changes like:
>
> @@ -361,7 +361,7 @@ static int add_promisc_qp(struct mlx4_dev *dev, u8
> port,
> int err;
> struct mlx4_priv *priv = mlx4_priv(dev);
>
> - s_steer = &mlx4_priv(dev)->steer[0];
> + s_steer = &mlx4_priv(dev)->steer[port - 1];
>
> mutex_lock(&priv->mcg_table.mutex);
>
> But I fear we missed one part of the deal. Concept patch:
>
> @@ -365,7 +365,7 @@ static int add_promisc_qp(struct mlx4_dev *dev, u8
> port,
>
> mutex_lock(&priv->mcg_table.mutex);
>
> - if (get_promisc_qp(dev, 0, steer, qpn)) {
> + if (get_promisc_qp(dev, port - 1, steer, qpn)) {
> err = 0; /* Noting to do, already exists */
> goto out_mutex;
> }
>
> Because:
>
> static int add_promisc_qp(struct mlx4_dev *dev, u8 port,
> enum mlx4_steer_type steer, u32 qpn)
> {
> ...
> A) s_steer = &mlx4_priv(dev)->steer[port - 1];
>
> mutex_lock(&priv->mcg_table.mutex);
>
> if (get_promisc_qp(dev, 0, steer, qpn)) {
> err = 0; /* Noting to do, already exists */
> goto out_mutex;
> }
> ...
> /* add the new qpn to list of promisc qps */
> C) list_add_tail(&pqp->list, &s_steer->promisc_qps[steer]);
> ...
> }
>
> static struct mlx4_promisc_qp *get_promisc_qp(struct mlx4_dev *dev, u8
> pf_num,
> enum mlx4_steer_type steer,
> u32 qpn)
> {
> B) struct mlx4_steer *s_steer = &mlx4_priv(dev)->steer[pf_num];
> struct mlx4_promisc_qp *pqp;
>
> list_for_each_entry(pqp, &s_steer->promisc_qps[steer], list) {
> if (pqp->qpn == qpn)
> return pqp;
> }
> /* not found */
> return NULL;
> }
>
> As far as I can understand, we are changing a list for a port and
> checking for duplicates on the other list. Points marked as A, B and C
> for highlighting. Am I missing something? What do you think?
>
> FWIW, this call get_promisc_qp(dev, 0, ...) happens in other places too.
>
> Thank you,
> Marcelo.

^ permalink raw reply

* Re: [PATCH 1/1] tcp: Wrong timeout for SYN segments
From: Alexander Bergmann @ 2012-08-31 12:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, hkjerry.chu, netdev, linux-kernel
In-Reply-To: <1346414260.2591.8.camel@edumazet-glaptop>

On Fri, Aug 31, 2012 at 04:57:40AM -0700, Eric Dumazet wrote:
> On Tue, 2012-08-28 at 10:44 +0200, Carsten Wolff wrote:
> > 
> > wouldn't it be nice, if the description of the corresponding sysctls in 
> > Documentation/networking/ip-sysctl.txt could be updated too?
> > 

Thanks Carsten for pointing that out.

> 
> Carsten had a good point, any chance you can send a v3 of your patch,
> with the Documentation/networking/ip-sysctl.txt changed as well ?
> 

Hi Eric!

I've also changed the Documentation file. As usual, comments are welcome!


Alex


>From 848f34ce27f65401940ae98e0b2d395888d3986d Mon Sep 17 00:00:00 2001
From: Alexander Bergmann <alex@linlab.net>
Date: Fri, 31 Aug 2012 14:31:00 +0200
Subject: [PATCH 1/1] tcp: Increase timeout for SYN segments

Commit 9ad7c049 changed the initRTO from 3secs to 1sec in accordance to
RFC6298 (former RFC2988bis). This reduced the time till the last SYN
retransmission packet gets sent from 93secs to 31secs.

RFC1122 is stating that the retransmission should be done for at least 3
minutes, but this seems to be quite high.

  "However, the values of R1 and R2 may be different for SYN
  and data segments.  In particular, R2 for a SYN segment MUST
  be set large enough to provide retransmission of the segment
  for at least 3 minutes.  The application can close the
  connection (i.e., give up on the open attempt) sooner, of
  course."

This patch increases the value of TCP_SYN_RETRIES to the value of 6,
providing a retransmission window of 63secs.

The comments for SYN and SYNACK retries have also been updated to
describe the current settings. The same goes for the documentation file
"Documentation/networking/ip-sysctl.txt".

Signed-off-by: Alexander Bergmann <alex@linlab.net>
---
 Documentation/networking/ip-sysctl.txt |    8 ++++++--
 include/net/tcp.h                      |   18 ++++++++++++++----
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index ca447b3..d64e531 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -439,7 +439,9 @@ tcp_stdurg - BOOLEAN
 tcp_synack_retries - INTEGER
 	Number of times SYNACKs for a passive TCP connection attempt will
 	be retransmitted. Should not be higher than 255. Default value
-	is 5, which corresponds to ~180seconds.
+	is 5, which corresponds to 31seconds till the last retransmission
+	with the current initial RTO of 1second. With this the final timeout
+	for a passive TCP connection will happen after 63seconds.
 
 tcp_syncookies - BOOLEAN
 	Only valid when the kernel was compiled with CONFIG_SYNCOOKIES
@@ -478,7 +480,9 @@ tcp_fastopen - INTEGER
 tcp_syn_retries - INTEGER
 	Number of times initial SYNs for an active TCP connection attempt
 	will be retransmitted. Should not be higher than 255. Default value
-	is 5, which corresponds to ~180seconds.
+	is 6, which corresponds to 63seconds till the last restransmission
+	with the current initial RTO of 1second. With this the final timeout
+	for an active TCP connection attempt will happen after 127seconds.
 
 tcp_timestamps - BOOLEAN
 	Enable timestamps as defined in RFC1323.
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 1f000ff..d43d6b3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -98,11 +98,21 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 				 * 15 is ~13-30min depending on RTO.
 				 */
 
-#define TCP_SYN_RETRIES	 5	/* number of times to retry active opening a
-				 * connection: ~180sec is RFC minimum	*/
+#define TCP_SYN_RETRIES	 6	/*
+				 * This is how many retries it does to active
+				 * opening a connection.
+				 * RFC1122 says the minimum retry MUST be at
+				 * least 180secs. Nevertheless this value is
+				 * corresponding to 63secs of retransmission
+				 * with the current initial RTO.
+				 */
 
-#define TCP_SYNACK_RETRIES 5	/* number of times to retry passive opening a
-				 * connection: ~180sec is RFC minimum	*/
+#define TCP_SYNACK_RETRIES 5	/* 
+				 * This is how may retries it does to passive
+				 * opening a connection. 
+				 * This is corresponding to 31secs of 
+				 * retransmission with the current initial RTO.
+				 */
 
 #define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT
 				  * state, about 60 seconds	*/
-- 
1.7.8.6

^ permalink raw reply related

* Re: [PATCH] net: Avoid spinlock recursion
From: Cong Wang @ 2012-08-31 13:02 UTC (permalink / raw)
  To: Fubo Chen; +Cc: netdev@vger.kernel.org, David S. Miller
In-Reply-To: <CAJAFBLCPjjBCmO0vxUeqSKJEkDkuQGF9H9hFRxWgFH92Kpz=VA@mail.gmail.com>

On Mon, 2012-08-27 at 14:52 +0000, Fubo Chen wrote:
> Fixes a bug introduced by commit 6bdb7fe31046ac50b47e83c35cd6c6b6160a475d.

Hi,

I already sent a patch:
http://patchwork.ozlabs.org/patch/179954/

Thanks!

^ permalink raw reply

* Re: [PATCH 1/1] tcp: Wrong timeout for SYN segments
From: Eric Dumazet @ 2012-08-31 13:25 UTC (permalink / raw)
  To: Alexander Bergmann; +Cc: davem, hkjerry.chu, netdev, linux-kernel
In-Reply-To: <20120831124831.GA23279@linlab.net>

On Fri, 2012-08-31 at 14:48 +0200, Alexander Bergmann wrote:

> Hi Eric!
> 
> I've also changed the Documentation file. As usual, comments are welcome!
> 
> 
> Alex
> 
> 
> From 848f34ce27f65401940ae98e0b2d395888d3986d Mon Sep 17 00:00:00 2001
> From: Alexander Bergmann <alex@linlab.net>
> Date: Fri, 31 Aug 2012 14:31:00 +0200
> Subject: [PATCH 1/1] tcp: Increase timeout for SYN segments
> 
> Commit 9ad7c049 changed the initRTO from 3secs to 1sec in accordance to
> RFC6298 (former RFC2988bis). This reduced the time till the last SYN
> retransmission packet gets sent from 93secs to 31secs.
> 
> RFC1122 is stating that the retransmission should be done for at least 3
> minutes, but this seems to be quite high.
> 
>   "However, the values of R1 and R2 may be different for SYN
>   and data segments.  In particular, R2 for a SYN segment MUST
>   be set large enough to provide retransmission of the segment
>   for at least 3 minutes.  The application can close the
>   connection (i.e., give up on the open attempt) sooner, of
>   course."
> 
> This patch increases the value of TCP_SYN_RETRIES to the value of 6,
> providing a retransmission window of 63secs.
> 
> The comments for SYN and SYNACK retries have also been updated to
> describe the current settings. The same goes for the documentation file
> "Documentation/networking/ip-sysctl.txt".
> 
> Signed-off-by: Alexander Bergmann <alex@linlab.net>
> ---

Thanks for your patience and followup, this seems good to me !

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* [RFC] fq_codel : interval servo on hosts
From: Eric Dumazet @ 2012-08-31 13:50 UTC (permalink / raw)
  To: codel; +Cc: Tomas Hruby, Nandita Dukkipati, netdev
In-Reply-To: <1346396137.2586.301.camel@edumazet-glaptop>

On Thu, 2012-08-30 at 23:55 -0700, Eric Dumazet wrote:
> On locally generated TCP traffic (host), we can override the 100 ms
> interval value using the more accurate RTT estimation maintained by TCP
> stack (tp->srtt)
> 
> Datacenter workload benefits using shorter feedback (say if RTT is below
> 1 ms, we can react 100 times faster to a congestion)
> 
> Idea from Yuchung Cheng.
> 

Linux patch would be the following :

I'll do tests next week, but I am sending a raw patch right now if
anybody wants to try it.

Presumably we also want to adjust target as well.

To get more precise srtt values in the datacenter, we might avoid the
'one jiffie slack' on small values in tcp_rtt_estimator(), as we force
m to be 1 before the scaling by 8 :

if (m == 0)
	m = 1;

We only need to force the least significant bit of srtt to be set.


 net/sched/sch_fq_codel.c |   23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 9fc1c62..7d2fe35 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -25,6 +25,7 @@
 #include <net/pkt_sched.h>
 #include <net/flow_keys.h>
 #include <net/codel.h>
+#include <linux/tcp.h>
 
 /*	Fair Queue CoDel.
  *
@@ -59,6 +60,7 @@ struct fq_codel_sched_data {
 	u32		perturbation;	/* hash perturbation */
 	u32		quantum;	/* psched_mtu(qdisc_dev(sch)); */
 	struct codel_params cparams;
+	codel_time_t	default_interval;
 	struct codel_stats cstats;
 	u32		drop_overlimit;
 	u32		new_flow_count;
@@ -211,6 +213,14 @@ static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	return NET_XMIT_SUCCESS;
 }
 
+/* Given TCP srtt evaluation, return codel interval.
+ * srtt is given in jiffies, scaled by 8.
+ */
+static codel_time_t tcp_srtt_to_codel(unsigned int srtt)
+{
+	return srtt * ((NSEC_PER_SEC >> (CODEL_SHIFT + 3)) / HZ);
+}
+
 /* This is the specific function called from codel_dequeue()
  * to dequeue a packet from queue. Note: backlog is handled in
  * codel, we dont need to reduce it here.
@@ -220,12 +230,21 @@ static struct sk_buff *dequeue(struct codel_vars *vars, struct Qdisc *sch)
 	struct fq_codel_sched_data *q = qdisc_priv(sch);
 	struct fq_codel_flow *flow;
 	struct sk_buff *skb = NULL;
+	struct sock *sk;
 
 	flow = container_of(vars, struct fq_codel_flow, cvars);
 	if (flow->head) {
 		skb = dequeue_head(flow);
 		q->backlogs[flow - q->flows] -= qdisc_pkt_len(skb);
 		sch->q.qlen--;
+		sk = skb->sk;
+		q->cparams.interval = q->default_interval;
+		if (sk && sk->sk_protocol == IPPROTO_TCP) {
+			u32 srtt = tcp_sk(sk)->srtt;
+
+			if (srtt)
+				q->cparams.interval = tcp_srtt_to_codel(srtt);
+		}
 	}
 	return skb;
 }
@@ -330,7 +349,7 @@ static int fq_codel_change(struct Qdisc *sch, struct nlattr *opt)
 	if (tb[TCA_FQ_CODEL_INTERVAL]) {
 		u64 interval = nla_get_u32(tb[TCA_FQ_CODEL_INTERVAL]);
 
-		q->cparams.interval = (interval * NSEC_PER_USEC) >> CODEL_SHIFT;
+		q->default_interval = (interval * NSEC_PER_USEC) >> CODEL_SHIFT;
 	}
 
 	if (tb[TCA_FQ_CODEL_LIMIT])
@@ -441,7 +460,7 @@ static int fq_codel_dump(struct Qdisc *sch, struct sk_buff *skb)
 	    nla_put_u32(skb, TCA_FQ_CODEL_LIMIT,
 			sch->limit) ||
 	    nla_put_u32(skb, TCA_FQ_CODEL_INTERVAL,
-			codel_time_to_us(q->cparams.interval)) ||
+			codel_time_to_us(q->default_interval)) ||
 	    nla_put_u32(skb, TCA_FQ_CODEL_ECN,
 			q->cparams.ecn) ||
 	    nla_put_u32(skb, TCA_FQ_CODEL_QUANTUM,

^ permalink raw reply related

* Re: [PATCH v2 1/2] 6lowpan: Make a copy of skb's delivered to 6lowpan
From: Alan Ott @ 2012-08-31 13:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-zigbee-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	David S. Miller
In-Reply-To: <1346396485.2586.307.camel@edumazet-glaptop>

On 08/31/2012 03:01 AM, Eric Dumazet wrote:
> On Wed, 2012-08-29 at 22:39 -0400, Alan Ott wrote:
>> Since lowpan_process_data() modifies the skb (by calling skb_pull()), we
>> need our own copy so that it doesn't affect the data received by other
>> protcols (in this case, af_ieee802154).
>>
>> Signed-off-by: Alan Ott <alan-yzvJWuRpmD1zbRFIqnYvSA@public.gmane.org>
>> ---
>>  net/ieee802154/6lowpan.c |    9 ++++++++-
>>  1 files changed, 8 insertions(+), 1 deletions(-)
>>
>> diff --git a/net/ieee802154/6lowpan.c b/net/ieee802154/6lowpan.c
>> index 6a09522..ce33b02 100644
>> --- a/net/ieee802154/6lowpan.c
>> +++ b/net/ieee802154/6lowpan.c
>> @@ -1133,6 +1133,8 @@ static int lowpan_validate(struct nlattr *tb[], struct nlattr *data[])
>>  static int lowpan_rcv(struct sk_buff *skb, struct net_device *dev,
>>  	struct packet_type *pt, struct net_device *orig_dev)
>>  {
>> +	struct sk_buff *local_skb;
>> +
>>  	if (!netif_running(dev))
>>  		goto drop;
>>  
>> @@ -1144,7 +1146,12 @@ static int lowpan_rcv(struct sk_buff *skb, struct net_device *dev,
>>  	case LOWPAN_DISPATCH_IPHC:	/* ipv6 datagram */
>>  	case LOWPAN_DISPATCH_FRAG1:	/* first fragment header */
>>  	case LOWPAN_DISPATCH_FRAGN:	/* next fragments headers */
>> -		lowpan_process_data(skb);
>> +		local_skb = skb_copy(skb, GFP_ATOMIC);
>> +		if (!local_skb)
>> +			goto drop;
>> +		lowpan_process_data(local_skb);
>> +
>> +		kfree_skb(skb);
>>  		break;
>>  	default:
>>  		break;
> Its not clear to me why skb_copy() is needed here.
>
> >From patch description, I would say skb_clone() would be enough (and
> faster) ?

You're probably right. I'll check it out. Thanks.

Alan.

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

^ permalink raw reply

* [RFC v2] fq_codel : interval servo on hosts
From: Eric Dumazet @ 2012-08-31 13:57 UTC (permalink / raw)
  To: codel; +Cc: Tomas Hruby, Nandita Dukkipati, netdev
In-Reply-To: <1346421031.2591.34.camel@edumazet-glaptop>

On Fri, 2012-08-31 at 06:50 -0700, Eric Dumazet wrote:
> On Thu, 2012-08-30 at 23:55 -0700, Eric Dumazet wrote:
> > On locally generated TCP traffic (host), we can override the 100 ms
> > interval value using the more accurate RTT estimation maintained by TCP
> > stack (tp->srtt)
> > 
> > Datacenter workload benefits using shorter feedback (say if RTT is below
> > 1 ms, we can react 100 times faster to a congestion)
> > 
> > Idea from Yuchung Cheng.
> > 
> 
> Linux patch would be the following :
> 
> I'll do tests next week, but I am sending a raw patch right now if
> anybody wants to try it.
> 
> Presumably we also want to adjust target as well.
> 
> To get more precise srtt values in the datacenter, we might avoid the
> 'one jiffie slack' on small values in tcp_rtt_estimator(), as we force
> m to be 1 before the scaling by 8 :
> 
> if (m == 0)
> 	m = 1;
> 
> We only need to force the least significant bit of srtt to be set.
> 

Hmm, I also need to properly init default_interval after
codel_params_init(&q->cparams) :

 net/sched/sch_fq_codel.c |   24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 9fc1c62..f04ff6a 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -25,6 +25,7 @@
 #include <net/pkt_sched.h>
 #include <net/flow_keys.h>
 #include <net/codel.h>
+#include <linux/tcp.h>
 
 /*	Fair Queue CoDel.
  *
@@ -59,6 +60,7 @@ struct fq_codel_sched_data {
 	u32		perturbation;	/* hash perturbation */
 	u32		quantum;	/* psched_mtu(qdisc_dev(sch)); */
 	struct codel_params cparams;
+	codel_time_t	default_interval;
 	struct codel_stats cstats;
 	u32		drop_overlimit;
 	u32		new_flow_count;
@@ -211,6 +213,14 @@ static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	return NET_XMIT_SUCCESS;
 }
 
+/* Given TCP srtt evaluation, return codel interval.
+ * srtt is given in jiffies, scaled by 8.
+ */
+static codel_time_t tcp_srtt_to_codel(unsigned int srtt)
+{
+	return srtt * ((NSEC_PER_SEC >> (CODEL_SHIFT + 3)) / HZ);
+}
+
 /* This is the specific function called from codel_dequeue()
  * to dequeue a packet from queue. Note: backlog is handled in
  * codel, we dont need to reduce it here.
@@ -220,12 +230,21 @@ static struct sk_buff *dequeue(struct codel_vars *vars, struct Qdisc *sch)
 	struct fq_codel_sched_data *q = qdisc_priv(sch);
 	struct fq_codel_flow *flow;
 	struct sk_buff *skb = NULL;
+	struct sock *sk;
 
 	flow = container_of(vars, struct fq_codel_flow, cvars);
 	if (flow->head) {
 		skb = dequeue_head(flow);
 		q->backlogs[flow - q->flows] -= qdisc_pkt_len(skb);
 		sch->q.qlen--;
+		sk = skb->sk;
+		q->cparams.interval = q->default_interval;
+		if (sk && sk->sk_protocol == IPPROTO_TCP) {
+			u32 srtt = tcp_sk(sk)->srtt;
+
+			if (srtt)
+				q->cparams.interval = tcp_srtt_to_codel(srtt);
+		}
 	}
 	return skb;
 }
@@ -330,7 +349,7 @@ static int fq_codel_change(struct Qdisc *sch, struct nlattr *opt)
 	if (tb[TCA_FQ_CODEL_INTERVAL]) {
 		u64 interval = nla_get_u32(tb[TCA_FQ_CODEL_INTERVAL]);
 
-		q->cparams.interval = (interval * NSEC_PER_USEC) >> CODEL_SHIFT;
+		q->default_interval = (interval * NSEC_PER_USEC) >> CODEL_SHIFT;
 	}
 
 	if (tb[TCA_FQ_CODEL_LIMIT])
@@ -395,6 +414,7 @@ static int fq_codel_init(struct Qdisc *sch, struct nlattr *opt)
 	INIT_LIST_HEAD(&q->new_flows);
 	INIT_LIST_HEAD(&q->old_flows);
 	codel_params_init(&q->cparams);
+	q->default_interval = q->cparams.interval;
 	codel_stats_init(&q->cstats);
 	q->cparams.ecn = true;
 
@@ -441,7 +461,7 @@ static int fq_codel_dump(struct Qdisc *sch, struct sk_buff *skb)
 	    nla_put_u32(skb, TCA_FQ_CODEL_LIMIT,
 			sch->limit) ||
 	    nla_put_u32(skb, TCA_FQ_CODEL_INTERVAL,
-			codel_time_to_us(q->cparams.interval)) ||
+			codel_time_to_us(q->default_interval)) ||
 	    nla_put_u32(skb, TCA_FQ_CODEL_ECN,
 			q->cparams.ecn) ||
 	    nla_put_u32(skb, TCA_FQ_CODEL_QUANTUM,

^ permalink raw reply related

* [PATCH 3/6] netfilter: ctnetlink: fix error return code in init path
From: pablo @ 2012-08-31 14:03 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1346421789-3449-1-git-send-email-pablo@netfilter.org>

From: Julia Lawall <Julia.Lawall@lip6.fr>

Initialize return variable before exiting on an error path.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
(
if@p1 (\(ret < 0\|ret != 0\))
 { ... return ret; }
|
ret@p1 = 0
)
... when != ret = e1
    when != &ret
*if(...)
{
  ... when != ret = e2
      when forall
 return ret;
}

// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_conntrack_netlink.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index da4fc37..9807f32 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -2790,7 +2790,8 @@ static int __init ctnetlink_init(void)
 		goto err_unreg_subsys;
 	}
 
-	if (register_pernet_subsys(&ctnetlink_net_ops)) {
+	ret = register_pernet_subsys(&ctnetlink_net_ops);
+	if (ret < 0) {
 		pr_err("ctnetlink_init: cannot register pernet operations\n");
 		goto err_unreg_exp_subsys;
 	}
-- 
1.7.10.4


^ permalink raw reply related

* [PATCH 4/6] netfilter: nfnetlink_log: fix error return code in init path
From: pablo @ 2012-08-31 14:03 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1346421789-3449-1-git-send-email-pablo@netfilter.org>

From: Julia Lawall <Julia.Lawall@lip6.fr>

Initialize return variable before exiting on an error path.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
(
if@p1 (\(ret < 0\|ret != 0\))
 { ... return ret; }
|
ret@p1 = 0
)
... when != ret = e1
    when != &ret
*if(...)
{
  ... when != ret = e2
      when forall
 return ret;
}

// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nfnetlink_log.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c
index 592091c1..14e2f39 100644
--- a/net/netfilter/nfnetlink_log.c
+++ b/net/netfilter/nfnetlink_log.c
@@ -996,8 +996,10 @@ static int __init nfnetlink_log_init(void)
 
 #ifdef CONFIG_PROC_FS
 	if (!proc_create("nfnetlink_log", 0440,
-			 proc_net_netfilter, &nful_file_ops))
+			 proc_net_netfilter, &nful_file_ops)) {
+		status = -ENOMEM;
 		goto cleanup_logger;
+	}
 #endif
 	return status;
 
-- 
1.7.10.4


^ permalink raw reply related

* [PATCH 1/6] netfilter: nfnetlink_log: fix NLA_PUT macro removal bug
From: pablo @ 2012-08-31 14:03 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1346421789-3449-1-git-send-email-pablo@netfilter.org>

From: Patrick McHardy <kaber@trash.net>

Commit 1db20a52 (nfnetlink_log: Stop using NLA_PUT*().) incorrectly
converted a NLA_PUT_BE16 macro to nla_put_be32() in nfnetlink_log:

-               NLA_PUT_BE16(inst->skb, NFULA_HWTYPE, htons(skb->dev->type));
+               if (nla_put_be32(inst->skb, NFULA_HWTYPE, htons(skb->dev->type))

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nfnetlink_log.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c
index 169ab59..592091c1 100644
--- a/net/netfilter/nfnetlink_log.c
+++ b/net/netfilter/nfnetlink_log.c
@@ -480,7 +480,7 @@ __build_packet_message(struct nfulnl_instance *inst,
 	}
 
 	if (indev && skb_mac_header_was_set(skb)) {
-		if (nla_put_be32(inst->skb, NFULA_HWTYPE, htons(skb->dev->type)) ||
+		if (nla_put_be16(inst->skb, NFULA_HWTYPE, htons(skb->dev->type)) ||
 		    nla_put_be16(inst->skb, NFULA_HWLEN,
 				 htons(skb->dev->hard_header_len)) ||
 		    nla_put(inst->skb, NFULA_HWHEADER, skb->dev->hard_header_len,
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 0/6] netfilter updates for 3.6-rc
From: pablo @ 2012-08-31 14:03 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Pablo Neira Ayuso <pablo@netfilter.org>

Hi David,

The following patchset contain fixes for your net tree, they are:

* Fix wrong type for NFULA_HWTYPE attribute, this was introduced while
  removing the NLA_PUT macro, from Patrick McHardy.

* Three fixes that spot incorrect return values in the initialization
  path of several Netfilter modules, from Julia Lawall.

* Fix crash in the SIP helper if we hit EBUSY while adding the RTCP
  expectation, from myself.

* Fix racy timer handling in case conntrackd is running in reliable
  event mode, also from myself.

You can pull these changes from:

git://1984.lsi.us.es/nf master

BTW, please merge net to net-next after this so I can resolve the
conflict between the SIP helper and NAT IPv6 changes from Patrick,
which is scheduled for net-next.

Thanks!

Julia Lawall (3):
  ipvs: fix error return code
  netfilter: ctnetlink: fix error return code in init path
  netfilter: nfnetlink_log: fix error return code in init path

Pablo Neira Ayuso (2):
  netfilter: nf_nat_sip: fix incorrect handling of EBUSY for RTCP expectation
  netfilter: nf_conntrack: fix racy timer handling with reliable events

Patrick McHardy (1):
  netfilter: nfnetlink_log: fix NLA_PUT macro removal bug

 include/net/netfilter/nf_conntrack_ecache.h |    1 +
 net/ipv4/netfilter/nf_nat_sip.c             |    5 ++++-
 net/netfilter/ipvs/ip_vs_ctl.c              |    4 +++-
 net/netfilter/nf_conntrack_core.c           |   16 +++++++++++-----
 net/netfilter/nf_conntrack_netlink.c        |    3 ++-
 net/netfilter/nfnetlink_log.c               |    6 ++++--
 6 files changed, 25 insertions(+), 10 deletions(-)

-- 
1.7.10.4

^ permalink raw reply

* [PATCH 2/6] ipvs: fix error return code
From: pablo @ 2012-08-31 14:03 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1346421789-3449-1-git-send-email-pablo@netfilter.org>

From: Julia Lawall <Julia.Lawall@lip6.fr>

Initialize return variable before exiting on an error path.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
(
if@p1 (\(ret < 0\|ret != 0\))
 { ... return ret; }
|
ret@p1 = 0
)
... when != ret = e1
    when != &ret
*if(...)
{
  ... when != ret = e2
      when forall
 return ret;
}

// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/ipvs/ip_vs_ctl.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 72bf32a..f51013c 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -1171,8 +1171,10 @@ ip_vs_add_service(struct net *net, struct ip_vs_service_user_kern *u,
 		goto out_err;
 	}
 	svc->stats.cpustats = alloc_percpu(struct ip_vs_cpu_stats);
-	if (!svc->stats.cpustats)
+	if (!svc->stats.cpustats) {
+		ret = -ENOMEM;
 		goto out_err;
+	}
 
 	/* I'm the first user of the service */
 	atomic_set(&svc->usecnt, 0);
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 5/6] netfilter: nf_nat_sip: fix incorrect handling of EBUSY for RTCP expectation
From: pablo @ 2012-08-31 14:03 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1346421789-3449-1-git-send-email-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>

We're hitting bug while trying to reinsert an already existing
expectation:

kernel BUG at kernel/timer.c:895!
invalid opcode: 0000 [#1] SMP
[...]
Call Trace:
 <IRQ>
 [<ffffffffa0069563>] nf_ct_expect_related_report+0x4a0/0x57a [nf_conntrack]
 [<ffffffff812d423a>] ? in4_pton+0x72/0x131
 [<ffffffffa00ca69e>] ip_nat_sdp_media+0xeb/0x185 [nf_nat_sip]
 [<ffffffffa00b5b9b>] set_expected_rtp_rtcp+0x32d/0x39b [nf_conntrack_sip]
 [<ffffffffa00b5f15>] process_sdp+0x30c/0x3ec [nf_conntrack_sip]
 [<ffffffff8103f1eb>] ? irq_exit+0x9a/0x9c
 [<ffffffffa00ca738>] ? ip_nat_sdp_media+0x185/0x185 [nf_nat_sip]

We have to remove the RTP expectation if the RTCP expectation hits EBUSY
since we keep trying with other ports until we succeed.

Reported-by: Rafal Fitt <rafalf@aplusc.com.pl>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv4/netfilter/nf_nat_sip.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/netfilter/nf_nat_sip.c b/net/ipv4/netfilter/nf_nat_sip.c
index 4ad9cf1..9c87cde 100644
--- a/net/ipv4/netfilter/nf_nat_sip.c
+++ b/net/ipv4/netfilter/nf_nat_sip.c
@@ -502,7 +502,10 @@ static unsigned int ip_nat_sdp_media(struct sk_buff *skb, unsigned int dataoff,
 		ret = nf_ct_expect_related(rtcp_exp);
 		if (ret == 0)
 			break;
-		else if (ret != -EBUSY) {
+		else if (ret == -EBUSY) {
+			nf_ct_unexpect_related(rtp_exp);
+			continue;
+		} else if (ret < 0) {
 			nf_ct_unexpect_related(rtp_exp);
 			port = 0;
 			break;
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 6/6] netfilter: nf_conntrack: fix racy timer handling with reliable events
From: pablo @ 2012-08-31 14:03 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1346421789-3449-1-git-send-email-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>

Existing code assumes that del_timer returns true for alive conntrack
entries. However, this is not true if reliable events are enabled.
In that case, del_timer may return true for entries that were
just inserted in the dying list. Note that packets / ctnetlink may
hold references to conntrack entries that were just inserted to such
list.

This patch fixes the issue by adding an independent timer for
event delivery. This increases the size of the ecache extension.
Still we can revisit this later and use variable size extensions
to allocate this area on demand.

Tested-by: Oliver Smith <olipro@8.c.9.b.0.7.4.0.1.0.0.2.ip6.arpa>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_conntrack_ecache.h |    1 +
 net/netfilter/nf_conntrack_core.c           |   16 +++++++++++-----
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h
index e1ce104..4a045cd 100644
--- a/include/net/netfilter/nf_conntrack_ecache.h
+++ b/include/net/netfilter/nf_conntrack_ecache.h
@@ -18,6 +18,7 @@ struct nf_conntrack_ecache {
 	u16 ctmask;		/* bitmask of ct events to be delivered */
 	u16 expmask;		/* bitmask of expect events to be delivered */
 	u32 pid;		/* netlink pid of destroyer */
+	struct timer_list timeout;
 };
 
 static inline struct nf_conntrack_ecache *
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index cf48755..2ceec64 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -249,12 +249,15 @@ static void death_by_event(unsigned long ul_conntrack)
 {
 	struct nf_conn *ct = (void *)ul_conntrack;
 	struct net *net = nf_ct_net(ct);
+	struct nf_conntrack_ecache *ecache = nf_ct_ecache_find(ct);
+
+	BUG_ON(ecache == NULL);
 
 	if (nf_conntrack_event(IPCT_DESTROY, ct) < 0) {
 		/* bad luck, let's retry again */
-		ct->timeout.expires = jiffies +
+		ecache->timeout.expires = jiffies +
 			(random32() % net->ct.sysctl_events_retry_timeout);
-		add_timer(&ct->timeout);
+		add_timer(&ecache->timeout);
 		return;
 	}
 	/* we've got the event delivered, now it's dying */
@@ -268,6 +271,9 @@ static void death_by_event(unsigned long ul_conntrack)
 void nf_ct_insert_dying_list(struct nf_conn *ct)
 {
 	struct net *net = nf_ct_net(ct);
+	struct nf_conntrack_ecache *ecache = nf_ct_ecache_find(ct);
+
+	BUG_ON(ecache == NULL);
 
 	/* add this conntrack to the dying list */
 	spin_lock_bh(&nf_conntrack_lock);
@@ -275,10 +281,10 @@ void nf_ct_insert_dying_list(struct nf_conn *ct)
 			     &net->ct.dying);
 	spin_unlock_bh(&nf_conntrack_lock);
 	/* set a new timer to retry event delivery */
-	setup_timer(&ct->timeout, death_by_event, (unsigned long)ct);
-	ct->timeout.expires = jiffies +
+	setup_timer(&ecache->timeout, death_by_event, (unsigned long)ct);
+	ecache->timeout.expires = jiffies +
 		(random32() % net->ct.sysctl_events_retry_timeout);
-	add_timer(&ct->timeout);
+	add_timer(&ecache->timeout);
 }
 EXPORT_SYMBOL_GPL(nf_ct_insert_dying_list);
 
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH 1/3] tcp: TCP Fast Open Server - header & support functions
From: Eric Dumazet @ 2012-08-31 14:21 UTC (permalink / raw)
  To: H.K. Jerry Chu
  Cc: davem, ycheng, edumazet, ncardwell, sivasankar, therbert, netdev
In-Reply-To: <1346369948-1722-2-git-send-email-hkchu@google.com>

On Thu, 2012-08-30 at 16:39 -0700, H.K. Jerry Chu wrote:
> From: Jerry Chu <hkchu@google.com>
> 
> This patch adds all the necessary data structure and support
> functions to implement TFO server side. It also documents a number
> of flags for the sysctl_tcp_fastopen knob, and adds a few Linux
> extension MIBs.
> 
> In addition, it includes the following:
> 
> 1. a new TCP_FASTOPEN socket option an application must call to
> supply a max backlog allowed in order to enable TFO on its listener.
> 
> 2. A number of key data structures:
> "fastopen_rsk" in tcp_sock - for a big socket to access its
> request_sock for retransmission and ack processing purpose. It is
> non-NULL iff 3WHS not completed.
> 
> "fastopenq" in request_sock_queue - points to a per Fast Open
> listener data structure "fastopen_queue" to keep track of qlen (# of
> outstanding Fast Open requests) and max_qlen, among other things.
> 
> "listener" in tcp_request_sock - to point to the original listener
> for book-keeping purpose, i.e., to maintain qlen against max_qlen
> as part of defense against IP spoofing attack.
> 
> 3. various data structure and functions, many in tcp_fastopen.c, to
> support server side Fast Open cookie operations, including
> /proc/sys/net/ipv4/tcp_fastopen_key to allow manual rekeying.
> 
> Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Tom Herbert <therbert@google.com>
> ---
>  Documentation/networking/ip-sysctl.txt |   29 ++++++++---
>  include/linux/snmp.h                   |    4 ++
>  include/linux/tcp.h                    |   45 ++++++++++++++++-
>  include/net/request_sock.h             |   36 ++++++++++++++
>  include/net/tcp.h                      |   46 +++++++++++++++---
>  net/ipv4/proc.c                        |    4 ++
>  net/ipv4/sysctl_net_ipv4.c             |   45 +++++++++++++++++
>  net/ipv4/tcp_fastopen.c                |   83 +++++++++++++++++++++++++++++++-
>  net/ipv4/tcp_input.c                   |    4 +-
>  9 files changed, 276 insertions(+), 20 deletions(-)

There are two very small points that can be addressed later, or in next
iteration if there is one.

static inline bool fastopen_cookie_present(struct tcp_fastopen_cookie *foc)
{
       return (foc)->len != -1;
}

should be :

static inline bool fastopen_cookie_present(const struct tcp_fastopen_cookie *foc)
{
       return foc->len != -1;
}

And we should add a BUILD_BUG_ON(TCP_FASTOPEN_KEY_LENGTH != 4*sizeof(u32));
in proc_tcp_fastopen_key().

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [BUG]  TIPC handling of -ERESTARTSYS in connect()
From: Chris Friesen @ 2012-08-31 15:09 UTC (permalink / raw)
  To: Ying Xue; +Cc: netdev, Allan Stephens, Jon Maloy
In-Reply-To: <504085D9.3010907@windriver.com>

On 08/31/2012 03:37 AM, Ying Xue wrote:
> Hi Chris,
>
> Although this is a known issue, still thanks for your report.
>
> Regardless of 1.7.7 or mainline, the issue really exists.
> Can you please check and verify the attached patch?
>
> PS: the patch is based on 1.7.7 rather than mainline.

I haven't had a chance to test it yet but from visual inspection the 
patch looks pretty good.

It looks like the case where we come in with "sock->state == 
SS_LISTENING" has changed.  Previously we would return -EOPNOTSUPP but 
now it'll be -EINVAL.  Is that intentional?

Chris

^ permalink raw reply

* Re: [BUG]  TIPC handling of -ERESTARTSYS in connect()
From: Chris Friesen @ 2012-08-31 15:18 UTC (permalink / raw)
  To: Ying Xue; +Cc: netdev, Allan Stephens, Jon Maloy
In-Reply-To: <5040D3C4.2010308@genband.com>

On 08/31/2012 09:09 AM, Chris Friesen wrote:
> On 08/31/2012 03:37 AM, Ying Xue wrote:
>> Hi Chris,
>>
>> Although this is a known issue, still thanks for your report.
>>
>> Regardless of 1.7.7 or mainline, the issue really exists.
>> Can you please check and verify the attached patch?
>>
>> PS: the patch is based on 1.7.7 rather than mainline.
>
>
> I haven't had a chance to test it yet but from visual inspection the
> patch looks pretty good.
>
> It looks like the case where we come in with "sock->state ==
> SS_LISTENING" has changed. Previously we would return -EOPNOTSUPP but
> now it'll be -EINVAL. Is that intentional?

Just noticed something else.  In the SS_CONNECTING case I don't think 
there's any point in setting "res = -EALREADY" since a bit further down 
it gets set unconditionally anyway.

Chris

^ permalink raw reply

* Re: [PATCH 0/6] netfilter updates for 3.6-rc
From: David Miller @ 2012-08-31 19:15 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <1346421789-3449-1-git-send-email-pablo@netfilter.org>

From: pablo@netfilter.org
Date: Fri, 31 Aug 2012 16:03:03 +0200

> You can pull these changes from:
> 
> git://1984.lsi.us.es/nf master

Pulled.

> BTW, please merge net to net-next after this so I can resolve the
> conflict between the SIP helper and NAT IPv6 changes from Patrick,
> which is scheduled for net-next.

Done.

^ permalink raw reply

* Re: [PATCH 1/2] ipv4: Improve the scaling of the ARP cache for multicast destinations.
From: Bob Gilligan @ 2012-08-31 19:21 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120830.210628.365120808137655227.davem@davemloft.net>

On 8/30/12 6:06 PM, David Miller wrote:
> From: Bob Gilligan <gilligan@aristanetworks.com>
> Date: Thu, 30 Aug 2012 17:55:04 -0700
> 
>> The mapping from multicast IPv4 address to MAC address can just as
>> easily be done at the time a packet is to be sent.  With this change,
>> we maintain one ARP cache entry for each interface that has at least
>> one multicast group member.  All routes to IPv4 multicast destinations
>> via a particular interface use the same ARP cache entry.  This entry
>> does not store the MAC address to use.  Instead, packets for multicast
>> destinations go to a new output function that maps the destination
>> IPv4 multicast address into the MAC address and forms the MAC header.
> 
> Doing an ARP MC mapping on every packet is much more expensive than
> doing a copy of the hard header cache.
> 
> I do not believe the memory consumption issue you use to justify this
> change is a real issue.
> 
> If you are talking to that many multicast groups actively, you do want
> that many neighbour cache entries.  This is not different from talking
> to nearly every IP address on a local /8 subnet.  You'll have a huge
> number of neighbour table entries in that case as well.
> 
> If your the actual steady state number of active groups being spoken
> to is smaller, you can tune the neighbour cache thresholds to collect
> old less used entries more quickly.
> 
> And this today is trivial, since routes no longer hold a reference
> to neighbour entries.  Therefore any neighbour entry whatsoever can
> be immediately reclaimed at any moment.

The scaling is N-squared: the number of neighbor cache entries
required for your multicast traffic is interfaces * groups.  100
interfaces and 100 groups could generate 10,000 entries. 1,000
interfaces and 1,000 groups could generate a million entries.

But the number of groups is hard to predict: it depends on the
applications in use and the multicast traffic they generate.  So, it
is hard to come up with a "budget" for multicast entries in the
neighbor cache for a multicast router.

If you pick a gc_thresh3 that is less than your working set, you'll
end up thrashing the neighbor cache.  And calls to neigh_forced_gc()
are expensive: It performs a linear search of the entire neighbor
cache.  Also, the calls to neigh_forced_gc() due to a large number of
multicast entries will negatively impact the unicast entries sharing the
neighbor cache: it will free any unreferenced but resolved unicast
entries. Any subsequent packets for those destinations will trigger a
re-ARP.  Unnecessary re-ARPing is generally undesirable in a router.

The user who wants to avoid these problems is left with the
alternative of setting gc_thresh3 to a very large number based on a
worst case estimate of the number of unicast plus multicast entries
required.

Seems just simpler and more efficient to keep the multicast entries
out of the neighbor cache entirely.

Bob.

> 
> I'm not fond of these patches, and adding yet more special cases to
> the neighbour layer, and therefore will not apply them.
> 

^ permalink raw reply

* WESTERN UNION IMF
From: Western Union Money Transfer @ 2012-08-31 19:21 UTC (permalink / raw)


Attn Contact Western Union for your Payment of $1,500,000.00 today with
your Full Details info,Name,Country,phone,Occupation for your first MTCN of $5,000 USD
E-mail:(claimsdptwesternunion@yahoo.co.uk)
Contact Phone Number:+2348052574570
Best Regard Once More Congratulation !!
Manager Mr Green Garry
Western©Union 2012

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox