Netdev List
 help / color / mirror / Atom feed
* [PATCH 4/4] drivers/atm/idt77252.c: Remove unnecessary error check
From: Julia Lawall @ 2010-10-02 13:59 UTC (permalink / raw)
  To: Chas Williams; +Cc: kernel-janitors, linux-atm-general, netdev, linux-kernel

This code does not call deinit_card(card); in an error case, as done in
other error-handling code in the same function.  But actually, the called
function init_sram can only return 0, so there is no need for the error
check at all.

A simplified version of the sematic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@r exists@
@r@
statement S1,S2,S3;
constant C1,C2,C3;
@@

*if (...)
 {... S1 return -C1;}
...
*if (...)
 {... when != S1
    return -C2;}
...
*if (...)
 {... S1 return -C3;}
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>

---
 drivers/atm/idt77252.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/atm/idt77252.c b/drivers/atm/idt77252.c
index 1679cbf..08ccde6 100644
--- a/drivers/atm/idt77252.c
+++ b/drivers/atm/idt77252.c
@@ -3410,8 +3410,7 @@ init_card(struct atm_dev *dev)
 
 	writel(readl(SAR_REG_CFG) | conf, SAR_REG_CFG);
 
-	if (init_sram(card) < 0)
-		return -1;
+	init_sram(card);
 
 /********************************************************************/
 /*  A L L O C   R A M   A N D   S E T   V A R I O U S   T H I N G S */

^ permalink raw reply related

* Re: [Patch] Limit sysctl_tcp_mem and sysctl_udp_mem initializers to prevent integer overflows.
From: Eric Dumazet @ 2010-10-02 13:22 UTC (permalink / raw)
  To: Robin Holt, Andrew Morton
  Cc: Willy Tarreau, linux-kernel, netdev, David S. Miller,
	Alexey Kuznetsov, Pekka Savola (ipv6), James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy
In-Reply-To: <20101002112419.248437367@gulag1.americas.sgi.com>

Le samedi 02 octobre 2010 à 06:24 -0500, Robin Holt a écrit :
> pièce jointe document texte brut (sysctl_tcp_udp_mem_max_overflows)
> Subject: [Patch] Limit sysctl_tcp_mem and sysctl_udp_mem initializers to prevent integer overflows.
> 
> On a 16TB x86_64 machine, sysctl_tcp_mem[2], sysctl_udp_mem[2], and
> sysctl_sctp_mem[2] can integer overflow.  Set limit such that they are
> maximized without overflowing.
> 
> Signed-off-by: Robin Holt <holt@sgi.com>
> To: Willy Tarreau <w@1wt.eu>
> To: linux-kernel@vger.kernel.org
> To: netdev@vger.kernel.org
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
> Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
> Cc: James Morris <jmorris@namei.org>
> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
> Cc: Patrick McHardy <kaber@trash.net>
> 
> ---
> 
>  net/ipv4/tcp.c |    4 +++-
>  net/ipv4/udp.c |    4 +++-
>  2 files changed, 6 insertions(+), 2 deletions(-)

Hi Mr SixteenTB

Strange, you mention sctp in changelog but I cant see the patch.

We could apply your patch (with sctp changes) for stable.

IMHO it would be better in long term to switch all these limits from
"int" to "long", now we have atomic_long_t primitives that have same
runtime cost than atomic_t ones. I am pretty sure one of your customer
will need more than 5TB of memory for tcp/udp buffers before year 2020,
dont you think ? (some further scalability works probably needed.)

Something like this (boot tested on my dev machine, not a 16TB one, as
you guessed ;) ) :

Based on linux-2.6 for your convenience, should probably sit first in
David net-next-2.6 ...

Note : this needs the "sysctl: fix min/max handling in
__do_proc_doulongvec_minmax()" I sent to Andrew some minutes ago.
or this triggers a BUG :

"echo "378912 505216 758989782400000" >/proc/sys/net/ipv4/tcp_mem


Thanks !

[PATCH] net: avoid limits overflow

Robin Holt tried to boot a 16TB machine and found some limits were
reached : sysctl_tcp_mem[2], sysctl_udp_mem[2]

We can switch infrastructure to use long "instead" of "int", now
atomic_long_t primitives are available for free.

Reported-by: Robin Holt <holt@sgi.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/dn.h               |    2 +-
 include/net/sock.h             |    4 ++--
 include/net/tcp.h              |    6 +++---
 include/net/udp.h              |    4 ++--
 net/core/sock.c                |   14 +++++++-------
 net/decnet/af_decnet.c         |    2 +-
 net/decnet/sysctl_net_decnet.c |    4 ++--
 net/ipv4/proc.c                |    8 ++++----
 net/ipv4/sysctl_net_ipv4.c     |    5 ++---
 net/ipv4/tcp.c                 |    4 ++--
 net/ipv4/tcp_input.c           |   11 +++++++----
 net/ipv4/udp.c                 |    4 ++--
 net/sctp/protocol.c            |    2 +-
 net/sctp/socket.c              |    4 ++--
 net/sctp/sysctl.c              |    4 ++--
 15 files changed, 40 insertions(+), 38 deletions(-)

diff --git a/include/net/dn.h b/include/net/dn.h
index e5469f7..a514a3c 100644
--- a/include/net/dn.h
+++ b/include/net/dn.h
@@ -225,7 +225,7 @@ extern int decnet_di_count;
 extern int decnet_dr_count;
 extern int decnet_no_fc_max_cwnd;
 
-extern int sysctl_decnet_mem[3];
+extern long sysctl_decnet_mem[3];
 extern int sysctl_decnet_wmem[3];
 extern int sysctl_decnet_rmem[3];
 
diff --git a/include/net/sock.h b/include/net/sock.h
index adab9dc..5d84d86 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -762,7 +762,7 @@ struct proto {
 
 	/* Memory pressure */
 	void			(*enter_memory_pressure)(struct sock *sk);
-	atomic_t		*memory_allocated;	/* Current allocated memory. */
+	atomic_long_t		*memory_allocated;	/* Current allocated memory. */
 	struct percpu_counter	*sockets_allocated;	/* Current number of sockets. */
 	/*
 	 * Pressure flag: try to collapse.
@@ -771,7 +771,7 @@ struct proto {
 	 * is strict, actions are advisory and have some latency.
 	 */
 	int			*memory_pressure;
-	int			*sysctl_mem;
+	long			*sysctl_mem;
 	int			*sysctl_wmem;
 	int			*sysctl_rmem;
 	int			max_header;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3e4b33e..3f05403 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -224,7 +224,7 @@ extern int sysctl_tcp_fack;
 extern int sysctl_tcp_reordering;
 extern int sysctl_tcp_ecn;
 extern int sysctl_tcp_dsack;
-extern int sysctl_tcp_mem[3];
+extern long sysctl_tcp_mem[3];
 extern int sysctl_tcp_wmem[3];
 extern int sysctl_tcp_rmem[3];
 extern int sysctl_tcp_app_win;
@@ -247,7 +247,7 @@ extern int sysctl_tcp_cookie_size;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
 
-extern atomic_t tcp_memory_allocated;
+extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
 extern int tcp_memory_pressure;
 
@@ -280,7 +280,7 @@ static inline bool tcp_too_many_orphans(struct sock *sk, int shift)
 	}
 
 	if (sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
-	    atomic_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])
+	    atomic_long_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])
 		return true;
 	return false;
 }
diff --git a/include/net/udp.h b/include/net/udp.h
index a184d34..e686e01 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -105,10 +105,10 @@ static inline struct udp_hslot *udp_hashslot2(struct udp_table *table,
 
 extern struct proto udp_prot;
 
-extern atomic_t udp_memory_allocated;
+extern atomic_long_t udp_memory_allocated;
 
 /* sysctl variables for udp */
-extern int sysctl_udp_mem[3];
+extern long sysctl_udp_mem[3];
 extern int sysctl_udp_rmem_min;
 extern int sysctl_udp_wmem_min;
 
diff --git a/net/core/sock.c b/net/core/sock.c
index ef30e9d..a2af957 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1646,10 +1646,10 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 {
 	struct proto *prot = sk->sk_prot;
 	int amt = sk_mem_pages(size);
-	int allocated;
+	long allocated;
 
 	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
-	allocated = atomic_add_return(amt, prot->memory_allocated);
+	allocated = atomic_long_add_return(amt, prot->memory_allocated);
 
 	/* Under limit. */
 	if (allocated <= prot->sysctl_mem[0]) {
@@ -1707,7 +1707,7 @@ suppress_allocation:
 
 	/* Alas. Undo changes. */
 	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
-	atomic_sub(amt, prot->memory_allocated);
+	atomic_long_sub(amt, prot->memory_allocated);
 	return 0;
 }
 EXPORT_SYMBOL(__sk_mem_schedule);
@@ -1720,12 +1720,12 @@ void __sk_mem_reclaim(struct sock *sk)
 {
 	struct proto *prot = sk->sk_prot;
 
-	atomic_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT,
+	atomic_long_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT,
 		   prot->memory_allocated);
 	sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1;
 
 	if (prot->memory_pressure && *prot->memory_pressure &&
-	    (atomic_read(prot->memory_allocated) < prot->sysctl_mem[0]))
+	    (atomic_long_read(prot->memory_allocated) < prot->sysctl_mem[0]))
 		*prot->memory_pressure = 0;
 }
 EXPORT_SYMBOL(__sk_mem_reclaim);
@@ -2445,12 +2445,12 @@ static char proto_method_implemented(const void *method)
 
 static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
 {
-	seq_printf(seq, "%-9s %4u %6d  %6d   %-3s %6u   %-3s  %-10s "
+	seq_printf(seq, "%-9s %4u %6d  %6ld   %-3s %6u   %-3s  %-10s "
 			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
 		   proto->name,
 		   proto->obj_size,
 		   sock_prot_inuse_get(seq_file_net(seq), proto),
-		   proto->memory_allocated != NULL ? atomic_read(proto->memory_allocated) : -1,
+		   proto->memory_allocated != NULL ? atomic_long_read(proto->memory_allocated) : -1L,
 		   proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
 		   proto->max_header,
 		   proto->slab == NULL ? "no" : "yes",
diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
index d6b93d1..a76b78d 100644
--- a/net/decnet/af_decnet.c
+++ b/net/decnet/af_decnet.c
@@ -155,7 +155,7 @@ static const struct proto_ops dn_proto_ops;
 static DEFINE_RWLOCK(dn_hash_lock);
 static struct hlist_head dn_sk_hash[DN_SK_HASH_SIZE];
 static struct hlist_head dn_wild_sk;
-static atomic_t decnet_memory_allocated;
+static atomic_long_t decnet_memory_allocated;
 
 static int __dn_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen, int flags);
 static int __dn_getsockopt(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen, int flags);
diff --git a/net/decnet/sysctl_net_decnet.c b/net/decnet/sysctl_net_decnet.c
index be3eb8e..28f8b5e 100644
--- a/net/decnet/sysctl_net_decnet.c
+++ b/net/decnet/sysctl_net_decnet.c
@@ -38,7 +38,7 @@ int decnet_log_martians = 1;
 int decnet_no_fc_max_cwnd = NSP_MIN_WINDOW;
 
 /* Reasonable defaults, I hope, based on tcp's defaults */
-int sysctl_decnet_mem[3] = { 768 << 3, 1024 << 3, 1536 << 3 };
+long sysctl_decnet_mem[3] = { 768 << 3, 1024 << 3, 1536 << 3 };
 int sysctl_decnet_wmem[3] = { 4 * 1024, 16 * 1024, 128 * 1024 };
 int sysctl_decnet_rmem[3] = { 4 * 1024, 87380, 87380 * 2 };
 
@@ -324,7 +324,7 @@ static ctl_table dn_table[] = {
 		.data = &sysctl_decnet_mem,
 		.maxlen = sizeof(sysctl_decnet_mem),
 		.mode = 0644,
-		.proc_handler = proc_dointvec,
+		.proc_handler = proc_doulongvec_minmax
 	},
 	{
 		.procname = "decnet_rmem",
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 4ae1f20..1b48eb1 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -59,13 +59,13 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
 	local_bh_enable();
 
 	socket_seq_show(seq);
-	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %d\n",
+	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %ld\n",
 		   sock_prot_inuse_get(net, &tcp_prot), orphans,
 		   tcp_death_row.tw_count, sockets,
-		   atomic_read(&tcp_memory_allocated));
-	seq_printf(seq, "UDP: inuse %d mem %d\n",
+		   atomic_long_read(&tcp_memory_allocated));
+	seq_printf(seq, "UDP: inuse %d mem %ld\n",
 		   sock_prot_inuse_get(net, &udp_prot),
-		   atomic_read(&udp_memory_allocated));
+		   atomic_long_read(&udp_memory_allocated));
 	seq_printf(seq, "UDPLITE: inuse %d\n",
 		   sock_prot_inuse_get(net, &udplite_prot));
 	seq_printf(seq, "RAW: inuse %d\n",
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d96c1da..e91911d 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -398,7 +398,7 @@ static struct ctl_table ipv4_table[] = {
 		.data		= &sysctl_tcp_mem,
 		.maxlen		= sizeof(sysctl_tcp_mem),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec
+		.proc_handler	= proc_doulongvec_minmax
 	},
 	{
 		.procname	= "tcp_wmem",
@@ -602,8 +602,7 @@ static struct ctl_table ipv4_table[] = {
 		.data		= &sysctl_udp_mem,
 		.maxlen		= sizeof(sysctl_udp_mem),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &zero
+		.proc_handler	= proc_doulongvec_minmax,
 	},
 	{
 		.procname	= "udp_rmem_min",
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f115ea6..e88d7a0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -282,7 +282,7 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
-int sysctl_tcp_mem[3] __read_mostly;
+long sysctl_tcp_mem[3] __read_mostly;
 int sysctl_tcp_wmem[3] __read_mostly;
 int sysctl_tcp_rmem[3] __read_mostly;
 
@@ -290,7 +290,7 @@ EXPORT_SYMBOL(sysctl_tcp_mem);
 EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
-atomic_t tcp_memory_allocated;	/* Current allocated memory. */
+atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
 EXPORT_SYMBOL(tcp_memory_allocated);
 
 /*
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b55f60f..0f56fb4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -259,8 +259,11 @@ static void tcp_fixup_sndbuf(struct sock *sk)
 	int sndmem = tcp_sk(sk)->rx_opt.mss_clamp + MAX_TCP_HEADER + 16 +
 		     sizeof(struct sk_buff);
 
-	if (sk->sk_sndbuf < 3 * sndmem)
-		sk->sk_sndbuf = min(3 * sndmem, sysctl_tcp_wmem[2]);
+	if (sk->sk_sndbuf < 3 * sndmem) {
+		sk->sk_sndbuf = 3 * sndmem;
+		if (sk->sk_sndbuf > sysctl_tcp_wmem[2])
+			sk->sk_sndbuf = sysctl_tcp_wmem[2];
+	}
 }
 
 /* 2. Tuning advertised window (window_clamp, rcv_ssthresh)
@@ -396,7 +399,7 @@ static void tcp_clamp_window(struct sock *sk)
 	if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
 	    !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
 	    !tcp_memory_pressure &&
-	    atomic_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
+	    atomic_long_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
 		sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
 				    sysctl_tcp_rmem[2]);
 	}
@@ -4870,7 +4873,7 @@ static int tcp_should_expand_sndbuf(struct sock *sk)
 		return 0;
 
 	/* If we are under soft global TCP memory pressure, do not expand.  */
-	if (atomic_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])
+	if (atomic_long_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])
 		return 0;
 
 	/* If we filled the congestion window, do not expand.  */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index fb23c2e..ecfdf2e 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -110,7 +110,7 @@
 struct udp_table udp_table __read_mostly;
 EXPORT_SYMBOL(udp_table);
 
-int sysctl_udp_mem[3] __read_mostly;
+long sysctl_udp_mem[3] __read_mostly;
 EXPORT_SYMBOL(sysctl_udp_mem);
 
 int sysctl_udp_rmem_min __read_mostly;
@@ -119,7 +119,7 @@ EXPORT_SYMBOL(sysctl_udp_rmem_min);
 int sysctl_udp_wmem_min __read_mostly;
 EXPORT_SYMBOL(sysctl_udp_wmem_min);
 
-atomic_t udp_memory_allocated;
+atomic_long_t udp_memory_allocated;
 EXPORT_SYMBOL(udp_memory_allocated);
 
 #define MAX_UDP_PORTS 65536
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 5027b83..bf95400 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -90,7 +90,7 @@ static struct sctp_af *sctp_af_v6_specific;
 struct kmem_cache *sctp_chunk_cachep __read_mostly;
 struct kmem_cache *sctp_bucket_cachep __read_mostly;
 
-int sysctl_sctp_mem[3];
+long sysctl_sctp_mem[3];
 int sysctl_sctp_rmem[3];
 int sysctl_sctp_wmem[3];
 
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index ca44917..cc03b44 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -109,12 +109,12 @@ static void sctp_sock_migrate(struct sock *, struct sock *,
 static char *sctp_hmac_alg = SCTP_COOKIE_HMAC_ALG;
 
 extern struct kmem_cache *sctp_bucket_cachep;
-extern int sysctl_sctp_mem[3];
+extern long sysctl_sctp_mem[3];
 extern int sysctl_sctp_rmem[3];
 extern int sysctl_sctp_wmem[3];
 
 static int sctp_memory_pressure;
-static atomic_t sctp_memory_allocated;
+static atomic_long_t sctp_memory_allocated;
 struct percpu_counter sctp_sockets_allocated;
 
 static void sctp_enter_memory_pressure(struct sock *sk)
diff --git a/net/sctp/sysctl.c b/net/sctp/sysctl.c
index 832590b..50cb57f 100644
--- a/net/sctp/sysctl.c
+++ b/net/sctp/sysctl.c
@@ -54,7 +54,7 @@ static int sack_timer_max = 500;
 static int addr_scope_max = 3; /* check sctp_scope_policy_t in include/net/sctp/constants.h for max entries */
 static int rwnd_scale_max = 16;
 
-extern int sysctl_sctp_mem[3];
+extern long sysctl_sctp_mem[3];
 extern int sysctl_sctp_rmem[3];
 extern int sysctl_sctp_wmem[3];
 
@@ -203,7 +203,7 @@ static ctl_table sctp_table[] = {
 		.data		= &sysctl_sctp_mem,
 		.maxlen		= sizeof(sysctl_sctp_mem),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= proc_doulongvec_minmax
 	},
 	{
 		.procname	= "sctp_rmem",



^ permalink raw reply related

* [PATCH] sysctl: fix min/max handling in __do_proc_doulongvec_minmax()
From: Eric Dumazet @ 2010-10-02 13:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Robin Holt, Willy Tarreau, David S. Miller, netdev,
	James Morris, Hideaki YOSHIFUJI, Pekka Savola (ipv6), netdev,
	James Morris, Hideaki YOSHIFUJI, Pekka Savola (ipv6),
	Patrick McHardy, Alexey Kuznetsov

When proc_doulongvec_minmax() is used with an array of longs,
and no min/max check requested (.extra1 or .extra2 being NULL), we
dereference a NULL pointer for the second element of the array.

Noticed while doing some changes in network stack for the "16TB problem"

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 kernel/sysctl.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index f88552c..4fba86d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2500,7 +2500,8 @@ static int __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, int
 				break;
 			if (neg)
 				continue;
-			if ((min && val < *min) || (max && val > *max))
+			if ((table->extra1 && val < *min) ||
+			    (table->extra2 && val > *max))
 				continue;
 			*i = val;
 		} else {



^ permalink raw reply related

* Re: [PATCH 2/2] net: Fix the condition passed to sk_wait_event()
From: Nagendra Tomar @ 2010-10-02 12:16 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, davem

This is the second part of the split up patch submitted in http://www.spinics.net/lists/netdev/msg142617.html

This fixes the condition (3rd arg) passed to sk_wait_event() in sk_stream_wait_memory(). The incorrect check in sk_stream_wait_memory() causes the following soft lockup in tcp_sendmsg() when the global tcp memory pool has exhausted. 

>>> snip <<<

localhost kernel: BUG: soft lockup - CPU#3 stuck for 11s! [sshd:6429]
localhost kernel: CPU 3:
localhost kernel: RIP: 0010:[sk_stream_wait_memory+0xcd/0x200]  [sk_stream_wait_memory+0xcd/0x200] sk_stream_wait_memory+0xcd/0x200
localhost kernel: 
localhost kernel: Call Trace:
localhost kernel:  [sk_stream_wait_memory+0x1b1/0x200] sk_stream_wait_memory+0x1b1/0x200
localhost kernel:  [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40
localhost kernel:  [ipv6:tcp_sendmsg+0x6e6/0xe90] tcp_sendmsg+0x6e6/0xce0
localhost kernel:  [sock_aio_write+0x126/0x140] sock_aio_write+0x126/0x140
localhost kernel:  [xfs:do_sync_write+0xf1/0x130] do_sync_write+0xf1/0x130
localhost kernel:  [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40
localhost kernel:  [hrtimer_start+0xe3/0x170] hrtimer_start+0xe3/0x170
localhost kernel:  [vfs_write+0x185/0x190] vfs_write+0x185/0x190
localhost kernel:  [sys_write+0x50/0x90] sys_write+0x50/0x90
localhost kernel:  [system_call+0x7e/0x83] system_call+0x7e/0x83

>>> snip <<<


What is happening is, that the sk_wait_event() condition passed from
sk_stream_wait_memory() evaluates to true for the case of tcp global memory
exhaustion. This is because both sk_stream_memory_free() and vm_wait are true
which causes sk_wait_event() to *not* call schedule_timeout(). 
Hence sk_stream_wait_memory() returns immediately to the caller w/o sleeping.
This causes the caller to again try allocation, which again fails and again
calls sk_stream_wait_memory(), and so on.


Signed-off-by: Nagendra Singh Tomar <tomer_iisc@yahoo.com>
---
--- linux-2.6.35.7/net/core/stream.c.orig	2010-03-24 09:31:00.000000000 +0530
+++ linux-2.6.35.7/net/core/stream.c	2010-03-24 09:31:08.000000000 +0530
@@ -143,10 +143,9 @@ int sk_stream_wait_memory(struct sock *s
 
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 		sk->sk_write_pending++;
-		sk_wait_event(sk, &current_timeo, !sk->sk_err &&
-						  !(sk->sk_shutdown & SEND_SHUTDOWN) &&
-						  sk_stream_memory_free(sk) &&
-						  vm_wait);
+		sk_wait_event(sk, &current_timeo, sk->sk_err ||
+						  (sk->sk_shutdown & SEND_SHUTDOWN) ||
+						  (sk_stream_memory_free(sk) && !vm_wait));
 		sk->sk_write_pending--;
 
 		if (vm_wait) {



      

^ permalink raw reply

* Re: [PATCH net-next V2] net: dynamic ingress_queue allocation
From: jamal @ 2010-10-02 12:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jarek Poplawski, David Miller, netdev
In-Reply-To: <1285941388.2641.175.camel@edumazet-laptop>

On Fri, 2010-10-01 at 15:56 +0200, Eric Dumazet wrote: 
> Le vendredi 01 octobre 2010 à 07:45 -0400, jamal a écrit :

> 
> I first thought of this, and found it would add a new dependence on
> vmlinux :
> 
> If someone wants to add NET_SCH_INGRESS module, he would need to
> recompile whole kernel and reboot.
> 
> This is probably why ing_filter() and handle_ing() are enclosed with
> CONFIG_NET_CLS_ACT, not CONFIG_NET_SCH_INGRESS.
> 
> Since struct net_dev only holds one pointer (after this patch), I
> believe its better to use same dependence.
> 

Ok - I just noticed that CONFIG_NET_SCH_INGRESS depends on 
CONFIG_NET_CLS_ACT - probably my fault too. I have a use case
for having just CONFIG_NET_SCH_INGRESS without need for
CONFIG_NET_CLS_ACT; but you are right to keep in the current
spirit it is fair to keep it as is; i wonder if it needs to be fixed.

> > 
> > > +static inline struct netdev_queue *dev_ingress_queue(struct net_device *dev)
> > > +{
> > > +#ifdef CONFIG_NET_CLS_ACT
> > > +	return dev->ingress_queue;
> > > +#else
> > > +	return NULL;
> > > +#endif
> > 
> > Above, if you just returned dev->ingress_queue wouldnt it always be 
> > NULL if it was not allocated?
> > 
> 
> ingress_queue is not defined in "struct net_device *dev" if 
> !CONFIG_NET_CLS_ACT
> 
> Returning NULL here permits dead code elimination by compiler.
> 
> Then, probably nobody unset CONFIG_NET_CLS_ACT, so we can probably avoid
> this preprocessor stuff.
> 

Ok


> I see, this adds an indirection and a conditional branch, but this
> should be in cpu cache and well predicted.

Ok - good enough for me.

> I thought adding a fake "struct netdev_queue" object, with a qdisc
> pointing to noop_qdisc. But this would slow down a bit non ingress
> users ;)

nod.

> For people not using ingress, my solution removes an access to an extra
> cache line. So network latency is improved a bit when cpu caches are
> full of user land data.

Who are these lunatics running without ingress anyways?

> > 1) compile support for ingress and add/delete ingress qdisc
> 
> This worked for me, but I dont know complex setups.
> 

I think if you covered the basic case, should be equivalent
to any other case. Maybe a replace after you do an add etc.

> > 2) Dont compile support and add/delete ingress qdisc
> 
> tc gives an error (a bit like !CONFIG_NET_SCH_INGRESS)
> 
> # tc qdisc add dev eth0 ingress
> RTNETLINK answers: No such file or directory
> # tc -s -d qdisc show dev eth0
> qdisc mq 0: root 
>  Sent 636 bytes 10 pkt (dropped 0, overlimits 0 requeues 0) 
>  backlog 0b 0p requeues 0 
> 

nice

> 
> > 3) Compile ingress as a module and add/delete ingress qdisc
> > 
> Seems to work like 1)
> 

ok then, nothing else to bitch about.

Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>

> I took a look at ifb as suggested by Stephen but could not see trivial
> changes (LLTX or per-cpu stats), since central lock is needed I am
> afraid. 

I wasnt thinking ifb - but ifb (and probably other netdevs) could
benefit a tiny bit by rearranging net_stats struct rx and tx parts into
different cache lines since these are typically updated in different
paths. In kernel, rx path for ifb is guaranteed to run in one cpu at a
time but tx could be many cpus (so at least for rx path theres benefit)

> And qdisc are the same, stats updates are mostly free as we
> dirtied cache line for the lock.

ok - qdiscs are easier because a qdisc instance can be accessed by one
cpu at a time. actions could be configured to be shared by many filters;
in such a mode they could be accessed across many cpus.

cheers,
jamal


^ permalink raw reply

* Re: [PATCH 1/2] net: Fix the condition passed to sk_wait_event()
From: Nagendra Tomar @ 2010-10-02 12:08 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, davem

This is the first part of the split up patch submitted in http://www.spinics.net/lists/netdev/msg142617.html

This patch fixes the sk_wait_event() condition in the sk_stream_wait_connect() function. With this change, we correctly check for the TCPF_ESTABLISHED and TCPF_CLOSE_WAIT states and avoid potentially returning success when there is an error on the socket.

Signed-off-by: Nagendra Singh Tomar <tomer_iisc@yahoo.com>
---
--- linux-2.6.35.7/net/core/stream.c.orig	2010-03-24 09:30:00.000000000 +0530
+++ linux-2.6.35.7/net/core/stream.c	2010-03-24 09:30:17.000000000 +0530
@@ -73,9 +73,8 @@ int sk_stream_wait_connect(struct sock *
 		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 		sk->sk_write_pending++;
 		done = sk_wait_event(sk, timeo_p,
-				     !sk->sk_err &&
-				     !((1 << sk->sk_state) &
-				       ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)));
+				     ((1 << sk->sk_state) &
+				       (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)));
 		finish_wait(sk_sleep(sk), &wait);
 		sk->sk_write_pending--;
 	} while (!done);

^ permalink raw reply

* checkentry function
From: Nicola Padovano @ 2010-10-02 11:59 UTC (permalink / raw)
  To: netfilter-devel, netdev

Hello there.
I've written checkentry function to check my new target, in this way:

[CHECK_ENTRY_CODE]
static bool xt_tarpit_check(const char *tablename, const void *entry,
                            const struct xt_target *target, void *targinfo,
                            unsigned int hook_mask)
{
 if (strcmp(tablename, "filter"))   {
    printk(KERN_INFO "DEBUG: the tablename (not FILTER) is %s\n",tablename);
    return false;
  }
return true;
}
[/CHECK_ENTRY_CODE]

but it doesn't work.
In fact if I do:

iptables -A INPUT -t filter -s 192.168.0.1 -p tcp -j TAR

the printk prints this message: DEBUG: the tablename (not FILTER) is: �%H �

so: in the tablename i haven't the string "filter"...what' the matter?

-- 
Nicola Padovano
e-mail: nicola.padovano@gmail.com
web: http://npadovano.altervista.org

"My only ambition is not be anything at all; it seems the most
sensible thing" (C. Bukowski)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 2.6.35.7] net: Fix the condition passed to sk_wait_event()
From: Nagendra Tomar @ 2010-10-02 11:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, linux-kernel
In-Reply-To: <1286008441.2582.851.camel@edumazet-laptop>



--- On Sat, 2/10/10, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > Just wondering why you remove the test on sk->err
> ?
> > 
> > We want to break the loop If sk->sk_err is set, or
> state is ESTABLISHED
> > or CLOSE_WAIT.
> 
> Hmm, reading the code again, I can see sk_err is tested in
> the loop, so
> your code is better (sk_stream_wait_connect() returns an
> error after
> your patch, instead of returning 0)

Exactly.

> 
> Could you please split your patch in two patches ?
> 

ok, I'll send it soon.

Thanks,
Tomar




      

^ permalink raw reply

* [Patch] Limit sysctl_tcp_mem and sysctl_udp_mem initializers to prevent integer overflows.
From: Robin Holt @ 2010-10-02 11:24 UTC (permalink / raw)
  To: Willy Tarreau, linux-kernel, netdev
  Cc: David S. Miller, Alexey Kuznetsov, Pekka Savola (ipv6),
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy
In-Reply-To: <20101002112405.951704198@gulag1.americas.sgi.com>

[-- Attachment #1: sysctl_tcp_udp_mem_max_overflows --]
[-- Type: text/plain, Size: 2746 bytes --]

Subject: [Patch] Limit sysctl_tcp_mem and sysctl_udp_mem initializers to prevent integer overflows.

On a 16TB x86_64 machine, sysctl_tcp_mem[2], sysctl_udp_mem[2], and
sysctl_sctp_mem[2] can integer overflow.  Set limit such that they are
maximized without overflowing.

Signed-off-by: Robin Holt <holt@sgi.com>
To: Willy Tarreau <w@1wt.eu>
To: linux-kernel@vger.kernel.org
To: netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>

---

 net/ipv4/tcp.c |    4 +++-
 net/ipv4/udp.c |    4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

Index: pv1010932/net/ipv4/tcp.c
===================================================================
--- pv1010932.orig/net/ipv4/tcp.c	2010-10-02 06:11:59.737449853 -0500
+++ pv1010932/net/ipv4/tcp.c	2010-10-02 06:12:35.445454593 -0500
@@ -3271,12 +3271,14 @@ void __init tcp_init(void)
 
 	/* Set the pressure threshold to be a fraction of global memory that
 	 * is up to 1/2 at 256 MB, decreasing toward zero with the amount of
-	 * memory, with a floor of 128 pages.
+	 * memory, with a floor of 128 pages, and a ceiling that prevents an
+	 * integer overflow.
 	 */
 	nr_pages = totalram_pages - totalhigh_pages;
 	limit = min(nr_pages, 1UL<<(28-PAGE_SHIFT)) >> (20-PAGE_SHIFT);
 	limit = (limit * (nr_pages >> (20-PAGE_SHIFT))) >> (PAGE_SHIFT-11);
 	limit = max(limit, 128UL);
+	limit = min(limit, INT_MAX * 4UL / 3 / 2);
 	sysctl_tcp_mem[0] = limit / 4 * 3;
 	sysctl_tcp_mem[1] = limit;
 	sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
Index: pv1010932/net/ipv4/udp.c
===================================================================
--- pv1010932.orig/net/ipv4/udp.c	2010-10-02 06:11:59.737449853 -0500
+++ pv1010932/net/ipv4/udp.c	2010-10-02 06:12:35.453453784 -0500
@@ -2167,12 +2167,14 @@ void __init udp_init(void)
 	udp_table_init(&udp_table, "UDP");
 	/* Set the pressure threshold up by the same strategy of TCP. It is a
 	 * fraction of global memory that is up to 1/2 at 256 MB, decreasing
-	 * toward zero with the amount of memory, with a floor of 128 pages.
+	 * toward zero with the amount of memory, with a floor of 128 pages,
+	 * and a ceiling that prevents an integer overflow.
 	 */
 	nr_pages = totalram_pages - totalhigh_pages;
 	limit = min(nr_pages, 1UL<<(28-PAGE_SHIFT)) >> (20-PAGE_SHIFT);
 	limit = (limit * (nr_pages >> (20-PAGE_SHIFT))) >> (PAGE_SHIFT-11);
 	limit = max(limit, 128UL);
+	limit = min(limit, INT_MAX * 4UL / 3 / 2);
 	sysctl_udp_mem[0] = limit / 4 * 3;
 	sysctl_udp_mem[1] = limit;
 	sysctl_udp_mem[2] = sysctl_udp_mem[0] * 2;

^ permalink raw reply

* Re: [PATCHv3 net-next-2.6 3/5] XFRM,IPv6: Add IRO src/dst address remapping XFRM types and i/o handlers
From: Herbert Xu @ 2010-10-02 10:32 UTC (permalink / raw)
  To: Arnaud Ebalard; +Cc: David Miller, eric.dumazet, yoshfuji, netdev
In-Reply-To: <87iq1k99vk.fsf@small.ssi.corp>

On Sat, Oct 02, 2010 at 12:17:35PM +0200, Arnaud Ebalard wrote:
>
> and I see no reason not to keep the lock we have on the state until the
> end of the function when the state is valid (when we break), instead of
> releasing it to get it again later. Something like the following would
> allow removing the spin_lock()/spin_unlock() calls from all mip6 input
> handlers (mip6_{destopt,rthdr,iro_src,iro_dst}_input()):

No I moved the state lock down precisely because it should not
be taken at a higher level as that breaks asynchronous IPsec
processing and the fact that it isn't needed in most places.

If your code needs it then you should take it rather than impose
it on real IPsec users.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCHv3 net-next-2.6 3/5] XFRM,IPv6: Add IRO src/dst address remapping XFRM types and i/o handlers
From: Arnaud Ebalard @ 2010-10-02 10:17 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, herbert, yoshfuji, netdev
In-Reply-To: <20100929.201618.241458205.davem@davemloft.net>

Hi,

David Miller <davem@davemloft.net> writes:

>> +static int mip6_iro_src_reject(struct xfrm_state *x, struct sk_buff *skb, struct flowi *fl)
>> +{
>> +	int err = 0;
>> +
>> +	/* XXX We may need some reject handler at some point but it is not
>> +	 * critical yet: see xfrm_secpath_reject() in net/xfrm/xfrm_policy.c
>> +	 * and aslo what mip6_destopt_reject() implements */
>> +
>> +	printk("XXX FIXME: mip6_iro_src_reject() called\n");
>
> pr_debug() or pr_err() or get rid of it altogher and use WARN_ON() or
> similar.

I will take a look at this reject handler tomorrow (implement or remove it).


>> +	spin_lock(&x->lock);
>> +	if (!ipv6_addr_equal(&iph->daddr, (struct in6_addr *)x->coaddr) &&
>> +	    !ipv6_addr_any((struct in6_addr *)x->coaddr))
>> +		err = -ENOENT;
>> +	spin_unlock(&x->lock);
>
> What are you actually protecting with this lock?  The moment you drop
> it the x->coaddr can change which changes the result you should return
> here.
>
> I suspect you either don't need the lock, or you need to lock at a higher
> level.

I basically trusted RH2 input handler code and reused it as a basis:

  static int mip6_rthdr_input(struct xfrm_state *x, struct sk_buff *skb)
  {
  	struct ipv6hdr *iph = ipv6_hdr(skb);
  	struct rt2_hdr *rt2 = (struct rt2_hdr *)skb->data;
  	int err = rt2->rt_hdr.nexthdr;
  
  	spin_lock(&x->lock);
  	if (!ipv6_addr_equal(&iph->daddr, (struct in6_addr *)x->coaddr) &&
  	    !ipv6_addr_any((struct in6_addr *)x->coaddr))
  		err = -ENOENT;
  	spin_unlock(&x->lock);
  
  	return err;
  }

*At that time*, I considered the lock useful to prevent changes on coaddr
during the two checks, i.e. to make it coherent. But I think you are
right and I see no reason for the lock not to be at a higher level:
I may have missed somthing but AFAICT, from a look at the code, there is
nothing preventing x->coaddr to  be updated (via xfrm_sa_update()) just
before or just after the checks.

I took a look at the callers for mip6 handlers and if I am not mistaken
there is *only* xfrm6_input_addr() because xfrm_input() only handles
esp, ah and ipcomp extension headers and not mip6-related ones
(i.e. only IPsec-related ones, those with a SPI). Here is a snippet of
the interesting (lock-wise) part of xfrm6_input_addr():

>	for (i = 0; i < 3; i++) {
>
>               <....snip....>
>
> 		spin_lock(&x->lock);
> 
> 		if ((!i || (x->props.flags & XFRM_STATE_WILDRECV)) &&
> 		    likely(x->km.state == XFRM_STATE_VALID) &&
> 		    !xfrm_state_check_expire(x)) {
> 			spin_unlock(&x->lock);
> 			if (x->type->input(x, skb) > 0) {
> 				/* found a valid state */
> 				break;
> 			}
> 		} else
> 			spin_unlock(&x->lock);
> 
> 		xfrm_state_put(x);
> 		x = NULL;
> 	}
> 
> 	if (!x) {
> 		XFRM_INC_STATS(net, LINUX_MIB_XFRMINNOSTATES);
> 		xfrm_audit_state_notfound_simple(skb, AF_INET6);
> 		goto drop;
> 	}
> 
> 	skb->sp->xvec[skb->sp->len++] = x;
> 
> 	spin_lock(&x->lock);
> 
> 	x->curlft.bytes += skb->len;
> 	x->curlft.packets++;
> 
> 	spin_unlock(&x->lock);

and I see no reason not to keep the lock we have on the state until the
end of the function when the state is valid (when we break), instead of
releasing it to get it again later. Something like the following would
allow removing the spin_lock()/spin_unlock() calls from all mip6 input
handlers (mip6_{destopt,rthdr,iro_src,iro_dst}_input()):

> 		spin_lock(&x->lock);
>
> 		if ((!i || (x->props.flags & XFRM_STATE_WILDRECV)) &&
> 		    likely(x->km.state == XFRM_STATE_VALID) &&
> 		    !xfrm_state_check_expire(x)) {
> 			if (x->type->input(x, skb) > 0) {
> 				/* found a valid state */
> 				break;
> 			} 
> 		}
>
> 		spin_unlock(&x->lock);
>
> 		xfrm_state_put(x);
> 		x = NULL;
> 	}
> 
> 	if (!x) {
> 		XFRM_INC_STATS(net, LINUX_MIB_XFRMINNOSTATES);
> 		xfrm_audit_state_notfound_simple(skb, AF_INET6);
> 		goto drop;
> 	}
> 
> 	skb->sp->xvec[skb->sp->len++] = x;
> 
> 	x->curlft.bytes += skb->len;
> 	x->curlft.packets++;
> 
> 	spin_unlock(&x->lock);

If this is ok, I will add a patch to the set to do that and also remove
the locks from the input handlers I introduce.

Cheers,

a+

^ permalink raw reply

* Re: [PATCH net-next V2] net: dynamic ingress_queue allocation
From: Jarek Poplawski @ 2010-10-02  9:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: hadi, David Miller, netdev
In-Reply-To: <1285941388.2641.175.camel@edumazet-laptop>

On Fri, Oct 01, 2010 at 03:56:28PM +0200, Eric Dumazet wrote:
> Le vendredi 01 octobre 2010 ?? 07:45 -0400, jamal a écrit :
> > On Fri, 2010-10-01 at 00:58 +0200, Eric Dumazet wrote:
> > > Hi Jamal
> > > 
> > > Here is the dynamic allocation I promised. I lightly tested it, could
> > > you review it please ?
> > > Thanks !
> > > 
> > > [PATCH net-next2.6] net: dynamic ingress_queue allocation
> > > 
> > > ingress being not used very much, and net_device->ingress_queue being
> > > quite a big object (128 or 256 bytes), use a dynamic allocation if
> > > needed (tc qdisc add dev eth0 ingress ...)
> > 
> > I agree with the principle that it is valuable in making it dynamic for
> > people who dont want it; but, but (like my kid would say, sniff, sniff)
> > you are making me sad saying it is not used very much ;-> It is used
> > very much in my world. My friend Jarek uses it;->

Thanks Jamal, my friend, I'm really glad! (And sad ;-)

> 
> ;)
> 
> > 
> > > +#ifdef CONFIG_NET_CLS_ACT
> > 
> > I think appropriately this should be NET_SCH_INGRESS (everywhere else as
> > well).
> > 
> 
> I first thought of this, and found it would add a new dependence on
> vmlinux :
> 
> If someone wants to add NET_SCH_INGRESS module, he would need to
> recompile whole kernel and reboot.
> 
> This is probably why ing_filter() and handle_ing() are enclosed with
> CONFIG_NET_CLS_ACT, not CONFIG_NET_SCH_INGRESS.
> 
> Since struct net_dev only holds one pointer (after this patch), I
> believe its better to use same dependence.
> 
> > 
> > > +static inline struct netdev_queue *dev_ingress_queue(struct net_device *dev)
> > > +{
> > > +#ifdef CONFIG_NET_CLS_ACT
> > > +	return dev->ingress_queue;
> > > +#else
> > > +	return NULL;
> > > +#endif
> > 
> > Above, if you just returned dev->ingress_queue wouldnt it always be 
> > NULL if it was not allocated?
> > 
> 
> ingress_queue is not defined in "struct net_device *dev" if 
> !CONFIG_NET_CLS_ACT
> 
> Returning NULL here permits dead code elimination by compiler.
> 
> Then, probably nobody unset CONFIG_NET_CLS_ACT, so we can probably avoid
> this preprocessor stuff.
> 
> > 
> > > @@ -2737,7 +2734,9 @@ static inline struct sk_buff *handle_ing(struct sk_buff *skb,
> > >  					 struct packet_type **pt_prev,
> > >  					 int *ret, struct net_device *orig_dev)
> > >  {
> > > -	if (skb->dev->ingress_queue.qdisc == &noop_qdisc)
> > > +	struct netdev_queue *rxq = dev_ingress_queue(skb->dev);
> > > +
> > > +	if (!rxq || rxq->qdisc == &noop_qdisc)
> > >  		goto out;
> > 
> > I stared at above a little longer since this is the only fast path
> > affected; is it a few more cycles now for people who love ingress?
> 
> I see, this adds an indirection and a conditional branch, but this
> should be in cpu cache and well predicted.
> 
> I thought adding a fake "struct netdev_queue" object, with a qdisc
> pointing to noop_qdisc. But this would slow down a bit non ingress
> users ;)
> 
> For people not using ingress, my solution removes an access to an extra
> cache line. So network latency is improved a bit when cpu caches are
> full of user land data.
> 
> > 
> > > @@ -690,6 +693,8 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> > >  		    (new && new->flags & TCQ_F_INGRESS)) {
> > >  			num_q = 1;
> > >  			ingress = 1;
> > > +			if (!dev_ingress_queue(dev))
> > > +				return -ENOENT;
> > >  		}
> > >  
> > 
> > The above looks clever but worries me because it changes the old flow.
> > If you have time,  the following tests will alleviate my fears
> > 
> > 1) compile support for ingress and add/delete ingress qdisc
> 
> This worked for me, but I dont know complex setups.
> 
> > 2) Dont compile support and add/delete ingress qdisc
> 
> tc gives an error (a bit like !CONFIG_NET_SCH_INGRESS)
> 
> # tc qdisc add dev eth0 ingress
> RTNETLINK answers: No such file or directory
> # tc -s -d qdisc show dev eth0
> qdisc mq 0: root 
>  Sent 636 bytes 10 pkt (dropped 0, overlimits 0 requeues 0) 
>  backlog 0b 0p requeues 0 
> 
> 
> > 3) Compile ingress as a module and add/delete ingress qdisc
> > 
> > 
> 
> Seems to work like 1)
> 
> > Other than that excellent work Eric. And you can add my
> > Acked/reviewed-by etc.
> > 
> > BTW, did i say i like your per-cpu stats stuff? It applies nicely to
> > qdiscs, actions etc ;->
> 
> I took a look at ifb as suggested by Stephen but could not see trivial
> changes (LLTX or per-cpu stats), since central lock is needed I am
> afraid. And qdisc are the same, stats updates are mostly free as we
> dirtied cache line for the lock.
> 
> Thanks Jamal !
> 
> Here is the V2, with two #ifdef removed.
> 
> 
> [PATCH net-next V2] net: dynamic ingress_queue allocation
> 
> ingress being not used very much, and net_device->ingress_queue being
> quite a big object (128 or 256 bytes), use a dynamic allocation if
> needed (tc qdisc add dev eth0 ingress ...)
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  include/linux/netdevice.h |   11 ++++++--
>  net/core/dev.c            |   48 +++++++++++++++++++++++++++---------
>  net/sched/sch_api.c       |   40 ++++++++++++++++++++----------
>  net/sched/sch_generic.c   |   36 +++++++++++++++------------
>  4 files changed, 92 insertions(+), 43 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index ceed347..4f86009 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -986,8 +986,7 @@ struct net_device {
>  	rx_handler_func_t	*rx_handler;
>  	void			*rx_handler_data;
>  
> -	struct netdev_queue	ingress_queue; /* use two cache lines */
> -
> +	struct netdev_queue	*ingress_queue;
>  /*
>   * Cache lines mostly used on transmit path
>   */
> @@ -1115,6 +1114,14 @@ static inline void netdev_for_each_tx_queue(struct net_device *dev,
>  		f(dev, &dev->_tx[i], arg);
>  }
>  
> +
> +static inline struct netdev_queue *dev_ingress_queue(struct net_device *dev)
> +{
> +	return dev->ingress_queue;
> +}
> +
> +extern struct netdev_queue *dev_ingress_queue_create(struct net_device *dev);
> +
>  /*
>   * Net namespace inlines
>   */
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a313bab..e3bb8c9 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2702,11 +2702,10 @@ EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
>   * the ingress scheduler, you just cant add policies on ingress.
>   *
>   */
> -static int ing_filter(struct sk_buff *skb)
> +static int ing_filter(struct sk_buff *skb, struct netdev_queue *rxq)
>  {
>  	struct net_device *dev = skb->dev;
>  	u32 ttl = G_TC_RTTL(skb->tc_verd);
> -	struct netdev_queue *rxq;
>  	int result = TC_ACT_OK;
>  	struct Qdisc *q;
>  
> @@ -2720,8 +2719,6 @@ static int ing_filter(struct sk_buff *skb)
>  	skb->tc_verd = SET_TC_RTTL(skb->tc_verd, ttl);
>  	skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_INGRESS);
>  
> -	rxq = &dev->ingress_queue;
> -
>  	q = rxq->qdisc;
>  	if (q != &noop_qdisc) {
>  		spin_lock(qdisc_lock(q));
> @@ -2737,7 +2734,9 @@ static inline struct sk_buff *handle_ing(struct sk_buff *skb,
>  					 struct packet_type **pt_prev,
>  					 int *ret, struct net_device *orig_dev)
>  {
> -	if (skb->dev->ingress_queue.qdisc == &noop_qdisc)
> +	struct netdev_queue *rxq = dev_ingress_queue(skb->dev);
> +
> +	if (!rxq || rxq->qdisc == &noop_qdisc)
>  		goto out;
>  
>  	if (*pt_prev) {
> @@ -2745,7 +2744,7 @@ static inline struct sk_buff *handle_ing(struct sk_buff *skb,
>  		*pt_prev = NULL;
>  	}
>  
> -	switch (ing_filter(skb)) {
> +	switch (ing_filter(skb, rxq)) {
>  	case TC_ACT_SHOT:
>  	case TC_ACT_STOLEN:
>  		kfree_skb(skb);
> @@ -4932,15 +4931,17 @@ static void __netdev_init_queue_locks_one(struct net_device *dev,
>  					  struct netdev_queue *dev_queue,
>  					  void *_unused)
>  {
> -	spin_lock_init(&dev_queue->_xmit_lock);
> -	netdev_set_xmit_lockdep_class(&dev_queue->_xmit_lock, dev->type);
> -	dev_queue->xmit_lock_owner = -1;
> +	if (dev_queue) {
> +		spin_lock_init(&dev_queue->_xmit_lock);
> +		netdev_set_xmit_lockdep_class(&dev_queue->_xmit_lock, dev->type);
> +		dev_queue->xmit_lock_owner = -1;
> +	}
>  }
>  
>  static void netdev_init_queue_locks(struct net_device *dev)
>  {
>  	netdev_for_each_tx_queue(dev, __netdev_init_queue_locks_one, NULL);
> -	__netdev_init_queue_locks_one(dev, &dev->ingress_queue, NULL);
> +	__netdev_init_queue_locks_one(dev, dev_ingress_queue(dev), NULL);

Is dev_ingress_queue(dev) not NULL anytime here?

>  }
>  
>  unsigned long netdev_fix_features(unsigned long features, const char *name)
> @@ -5447,16 +5448,37 @@ static void netdev_init_one_queue(struct net_device *dev,
>  				  struct netdev_queue *queue,
>  				  void *_unused)
>  {
> -	queue->dev = dev;
> +	if (queue)
> +		queue->dev = dev;
>  }
>  
>  static void netdev_init_queues(struct net_device *dev)
>  {
> -	netdev_init_one_queue(dev, &dev->ingress_queue, NULL);
> +	netdev_init_one_queue(dev, dev_ingress_queue(dev), NULL);

Is dev_ingress_queue(dev) not NULL anytime here?

>  	netdev_for_each_tx_queue(dev, netdev_init_one_queue, NULL);
>  	spin_lock_init(&dev->tx_global_lock);
>  }
>  
> +struct netdev_queue *dev_ingress_queue_create(struct net_device *dev)
> +{
> +	struct netdev_queue *queue = dev_ingress_queue(dev);
> +
> +#ifdef CONFIG_NET_CLS_ACT
> +	if (queue)
> +		return queue;
> +	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
> +	if (!queue)
> +		return NULL;
> +	netdev_init_one_queue(dev, queue, NULL);
> +	__netdev_init_queue_locks_one(dev, queue, NULL);
> +	queue->qdisc = &noop_qdisc;
> +	queue->qdisc_sleeping = &noop_qdisc;
> +	smp_wmb();

Why don't we need smp_rmb() in handle_ing()?

> +	dev->ingress_queue = queue;
> +#endif
> +	return queue;
> +}
> +
>  /**
>   *	alloc_netdev_mq - allocate network device
>   *	@sizeof_priv:	size of private data to allocate space for
> @@ -5559,6 +5581,8 @@ void free_netdev(struct net_device *dev)
>  
>  	kfree(dev->_tx);
>  
> +	kfree(dev_ingress_queue(dev));
> +
>  	/* Flush device addresses */
>  	dev_addr_flush(dev);
>  
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index b802078..8635110 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -240,7 +240,10 @@ struct Qdisc *qdisc_lookup(struct net_device *dev, u32 handle)
>  	if (q)
>  		goto out;
>  
> -	q = qdisc_match_from_root(dev->ingress_queue.qdisc_sleeping, handle);
> +	if (!dev_ingress_queue(dev))
> +		goto out;
> +	q = qdisc_match_from_root(dev_ingress_queue(dev)->qdisc_sleeping,
> +				  handle);

I'd prefer:
 +	if (dev_ingress_queue(dev))
 +		q = qdisc_match_from_root(dev_ingress_queue(dev)->qdisc_sleeping,

>  out:
>  	return q;
>  }
> @@ -690,6 +693,8 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>  		    (new && new->flags & TCQ_F_INGRESS)) {
>  			num_q = 1;
>  			ingress = 1;
> +			if (!dev_ingress_queue(dev))
> +				return -ENOENT;

Is this test really needed here?

>  		}
>  
>  		if (dev->flags & IFF_UP)
> @@ -701,7 +706,7 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>  		}
>  
>  		for (i = 0; i < num_q; i++) {
> -			struct netdev_queue *dev_queue = &dev->ingress_queue;
> +			struct netdev_queue *dev_queue = dev_ingress_queue(dev);
>  
>  			if (!ingress)
>  				dev_queue = netdev_get_tx_queue(dev, i);
> @@ -979,7 +984,8 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, void *arg)
>  					return -ENOENT;
>  				q = qdisc_leaf(p, clid);
>  			} else { /* ingress */
> -				q = dev->ingress_queue.qdisc_sleeping;
> +				if (dev_ingress_queue(dev))
> +					q = dev_ingress_queue(dev)->qdisc_sleeping;
>  			}
>  		} else {
>  			q = dev->qdisc;
> @@ -1044,7 +1050,8 @@ replay:
>  					return -ENOENT;
>  				q = qdisc_leaf(p, clid);
>  			} else { /*ingress */
> -				q = dev->ingress_queue.qdisc_sleeping;
> +				if (dev_ingress_queue_create(dev))
> +					q = dev_ingress_queue(dev)->qdisc_sleeping;

I wonder if doing dev_ingress_queue_create() just before qdisc_create()
(and the test here) isn't more readable.

>  			}
>  		} else {
>  			q = dev->qdisc;
> @@ -1123,11 +1130,14 @@ replay:
>  create_n_graft:
>  	if (!(n->nlmsg_flags&NLM_F_CREATE))
>  		return -ENOENT;
> -	if (clid == TC_H_INGRESS)
> -		q = qdisc_create(dev, &dev->ingress_queue, p,
> -				 tcm->tcm_parent, tcm->tcm_parent,
> -				 tca, &err);
> -	else {
> +	if (clid == TC_H_INGRESS) {
> +		if (dev_ingress_queue(dev))
> +			q = qdisc_create(dev, dev_ingress_queue(dev), p,
> +					 tcm->tcm_parent, tcm->tcm_parent,
> +					 tca, &err);
> +		else
> +			err = -ENOENT;
> +	} else {
>  		struct netdev_queue *dev_queue;
>  
>  		if (p && p->ops->cl_ops && p->ops->cl_ops->select_queue)
> @@ -1304,8 +1314,10 @@ static int tc_dump_qdisc(struct sk_buff *skb, struct netlink_callback *cb)
>  		if (tc_dump_qdisc_root(dev->qdisc, skb, cb, &q_idx, s_q_idx) < 0)
>  			goto done;
>  
> -		dev_queue = &dev->ingress_queue;
> -		if (tc_dump_qdisc_root(dev_queue->qdisc_sleeping, skb, cb, &q_idx, s_q_idx) < 0)
> +		dev_queue = dev_ingress_queue(dev);
> +		if (dev_queue &&
> +		    tc_dump_qdisc_root(dev_queue->qdisc_sleeping, skb, cb,
> +				       &q_idx, s_q_idx) < 0)
>  			goto done;
>  
>  cont:
> @@ -1595,8 +1607,10 @@ static int tc_dump_tclass(struct sk_buff *skb, struct netlink_callback *cb)
>  	if (tc_dump_tclass_root(dev->qdisc, skb, tcm, cb, &t, s_t) < 0)
>  		goto done;
>  
> -	dev_queue = &dev->ingress_queue;
> -	if (tc_dump_tclass_root(dev_queue->qdisc_sleeping, skb, tcm, cb, &t, s_t) < 0)
> +	dev_queue = dev_ingress_queue(dev);
> +	if (dev_queue &&
> +	    tc_dump_tclass_root(dev_queue->qdisc_sleeping, skb, tcm, cb,
> +				&t, s_t) < 0)
>  		goto done;
>  
>  done:
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 545278a..c42dec5 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -721,16 +721,18 @@ static void transition_one_qdisc(struct net_device *dev,
>  				 struct netdev_queue *dev_queue,
>  				 void *_need_watchdog)
>  {
> -	struct Qdisc *new_qdisc = dev_queue->qdisc_sleeping;
> -	int *need_watchdog_p = _need_watchdog;
> +	if (dev_queue) {
> +		struct Qdisc *new_qdisc = dev_queue->qdisc_sleeping;
> +		int *need_watchdog_p = _need_watchdog;
>  
> -	if (!(new_qdisc->flags & TCQ_F_BUILTIN))
> -		clear_bit(__QDISC_STATE_DEACTIVATED, &new_qdisc->state);
> +		if (!(new_qdisc->flags & TCQ_F_BUILTIN))
> +			clear_bit(__QDISC_STATE_DEACTIVATED, &new_qdisc->state);
>  
> -	rcu_assign_pointer(dev_queue->qdisc, new_qdisc);
> -	if (need_watchdog_p && new_qdisc != &noqueue_qdisc) {
> -		dev_queue->trans_start = 0;
> -		*need_watchdog_p = 1;
> +		rcu_assign_pointer(dev_queue->qdisc, new_qdisc);
> +		if (need_watchdog_p && new_qdisc != &noqueue_qdisc) {
> +			dev_queue->trans_start = 0;
> +			*need_watchdog_p = 1;
> +		}
>  	}
>  }
>  
> @@ -753,7 +755,7 @@ void dev_activate(struct net_device *dev)
>  
>  	need_watchdog = 0;
>  	netdev_for_each_tx_queue(dev, transition_one_qdisc, &need_watchdog);
> -	transition_one_qdisc(dev, &dev->ingress_queue, NULL);
> +	transition_one_qdisc(dev, dev_ingress_queue(dev), NULL);

I'd prefer here and similarly later:

 +	if (dev_ingress_queue(dev))
 +		transition_one_qdisc(dev, dev_ingress_queue(dev), NULL);

to show NULL dev_queue is only legal in this one case.

Cheers,
Jarek P.

>  
>  	if (need_watchdog) {
>  		dev->trans_start = jiffies;
> @@ -768,7 +770,7 @@ static void dev_deactivate_queue(struct net_device *dev,
>  	struct Qdisc *qdisc_default = _qdisc_default;
>  	struct Qdisc *qdisc;
>  
> -	qdisc = dev_queue->qdisc;
> +	qdisc = dev_queue ? dev_queue->qdisc : NULL;
>  	if (qdisc) {
>  		spin_lock_bh(qdisc_lock(qdisc));
>  
> @@ -812,7 +814,7 @@ static bool some_qdisc_is_busy(struct net_device *dev)
>  void dev_deactivate(struct net_device *dev)
>  {
>  	netdev_for_each_tx_queue(dev, dev_deactivate_queue, &noop_qdisc);
> -	dev_deactivate_queue(dev, &dev->ingress_queue, &noop_qdisc);
> +	dev_deactivate_queue(dev, dev_ingress_queue(dev), &noop_qdisc);
>  
>  	dev_watchdog_down(dev);
>  
> @@ -830,15 +832,17 @@ static void dev_init_scheduler_queue(struct net_device *dev,
>  {
>  	struct Qdisc *qdisc = _qdisc;
>  
> -	dev_queue->qdisc = qdisc;
> -	dev_queue->qdisc_sleeping = qdisc;
> +	if (dev_queue) {
> +		dev_queue->qdisc = qdisc;
> +		dev_queue->qdisc_sleeping = qdisc;
> +	}
>  }
>  
>  void dev_init_scheduler(struct net_device *dev)
>  {
>  	dev->qdisc = &noop_qdisc;
>  	netdev_for_each_tx_queue(dev, dev_init_scheduler_queue, &noop_qdisc);
> -	dev_init_scheduler_queue(dev, &dev->ingress_queue, &noop_qdisc);
> +	dev_init_scheduler_queue(dev, dev_ingress_queue(dev), &noop_qdisc);
>  
>  	setup_timer(&dev->watchdog_timer, dev_watchdog, (unsigned long)dev);
>  }
> @@ -847,7 +851,7 @@ static void shutdown_scheduler_queue(struct net_device *dev,
>  				     struct netdev_queue *dev_queue,
>  				     void *_qdisc_default)
>  {
> -	struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
> +	struct Qdisc *qdisc = dev_queue ? dev_queue->qdisc_sleeping : NULL;
>  	struct Qdisc *qdisc_default = _qdisc_default;
>  
>  	if (qdisc) {
> @@ -861,7 +865,7 @@ static void shutdown_scheduler_queue(struct net_device *dev,
>  void dev_shutdown(struct net_device *dev)
>  {
>  	netdev_for_each_tx_queue(dev, shutdown_scheduler_queue, &noop_qdisc);
> -	shutdown_scheduler_queue(dev, &dev->ingress_queue, &noop_qdisc);
> +	shutdown_scheduler_queue(dev, dev_ingress_queue(dev), &noop_qdisc);
>  	qdisc_destroy(dev->qdisc);
>  	dev->qdisc = &noop_qdisc;
>  
> 
> 

^ permalink raw reply

* Re: [patch v3 00/12] IPVS: SIP Persistence Engine
From: Julian Anastasov @ 2010-10-02  9:00 UTC (permalink / raw)
  To: Simon Horman
  Cc: lvs-devel, netdev, netfilter, netfilter-devel, Jan Engelhardt,
	Stephen Hemminger, Wensong Zhang, Patrick McHardy
In-Reply-To: <20101002023056.536217499@akiko.akashicho.tokyo.vergenet.net>


 	Hello,

On Sat, 2 Oct 2010, Simon Horman wrote:

> This patch series adds load-balancing of UDP SIP based on Call-ID to
> IPVS as well as a frame-work for extending IPVS to handle alternate
> persistence requirements.
>
> REVISIONS
>
> This is v3 of the patch series with addresses serveral problems
> raised by Julian Anastasov on the lvs-devel mailing list.

 	No other obvious problems in v3 1-12. So,
Acked-by: Julian Anastasov <ja@ssi.bg>

 	May be next days I'll test my changes on top of PE v3

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Payment.
From: Western Union® @ 2010-10-02  8:49 UTC (permalink / raw)





You have earned $1,000,000.00 USD from Western Union. To receive your
funds, send your details: Full Names, Age, Country, Phone Number to:
(funds@westernu-online.com)


^ permalink raw reply

* Re: [PATCH 2.6.35.7] net: Fix the condition passed to sk_wait_event()
From: Eric Dumazet @ 2010-10-02  8:34 UTC (permalink / raw)
  To: Nagendra Tomar; +Cc: netdev, davem, linux-kernel
In-Reply-To: <1286008051.2582.846.camel@edumazet-laptop>

Le samedi 02 octobre 2010 à 10:27 +0200, Eric Dumazet a écrit :

> Just wondering why you remove the test on sk->err ?
> 
> We want to break the loop If sk->sk_err is set, or state is ESTABLISHED
> or CLOSE_WAIT.

Hmm, reading the code again, I can see sk_err is tested in the loop, so
your code is better (sk_stream_wait_connect() returns an error after
your patch, instead of returning 0)

Could you please split your patch in two patches ?

The sk_stream_wait_connect() problems comes from commit
c1cbe4b7ad0bc4b1d9 ([NET]: Avoid atomic xchg() for non-error case)





^ permalink raw reply

* Re: [PATCH 2.6.35.7] net: Fix the condition passed to sk_wait_event()
From: Eric Dumazet @ 2010-10-02  8:27 UTC (permalink / raw)
  To: Nagendra Tomar; +Cc: netdev, davem, linux-kernel
In-Reply-To: <667106.4951.qm@web53706.mail.re2.yahoo.com>

Le samedi 02 octobre 2010 à 01:22 -0700, Nagendra Tomar a écrit :
> Resending ...
> 
> 
> The condition (3rd arg) passed to sk_wait_event() in sk_stream_wait_memory() and sk_stream_wait_connect() are incorrect.
> The incorrect check in sk_stream_wait_memory() causes the following soft lockup in tcp_sendmsg() when the global tcp memory pool has exhausted. The check in sk_stream_wait_connect() was found by code audit.
>    
> 
> >>> snip <<<
> 
> localhost kernel: BUG: soft lockup - CPU#3 stuck for 11s! [sshd:6429]
> localhost kernel: CPU 3:
> localhost kernel: RIP: 0010:[sk_stream_wait_memory+0xcd/0x200]  [sk_stream_wait_memory+0xcd/0x200] sk_stream_wait_memory+0xcd/0x200
> localhost kernel: Call Trace:
> localhost kernel:  [sk_stream_wait_memory+0x1b1/0x200] sk_stream_wait_memory+0x1b1/0x200
> localhost kernel:  [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40
> localhost kernel:  [ipv6:tcp_sendmsg+0x6e6/0xe90] tcp_sendmsg+0x6e6/0xce0
> localhost kernel:  [sock_aio_write+0x126/0x140] sock_aio_write+0x126/0x140
> localhost kernel:  [xfs:do_sync_write+0xf1/0x130] do_sync_write+0xf1/0x130
> localhost kernel:  [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40
> localhost kernel:  [hrtimer_start+0xe3/0x170] hrtimer_start+0xe3/0x170
> localhost kernel:  [vfs_write+0x185/0x190] vfs_write+0x185/0x190
> localhost kernel:  [sys_write+0x50/0x90] sys_write+0x50/0x90
> localhost kernel:  [system_call+0x7e/0x83] system_call+0x7e/0x83
> 
> >>> snip <<<
> 
> What is happening is, that the sk_wait_event() condition passed from
> sk_stream_wait_memory() evaluates to true for the case of tcp global memory
> exhaustion. This is because both sk_stream_memory_free() and vm_wait are true which causes sk_wait_event() to *not* call schedule_timeout(). 
> Hence sk_stream_wait_memory() returns immediately to the caller w/o sleeping.
> This causes the caller to again try allocation, which again fails and again
> calls sk_stream_wait_memory(), and so on.
> 
> 

Hi Nagendra


> Signed-off-by: Nagendra Singh Tomar <tomer_iisc@yahoo.com>
> ---
> --- linux-2.6.35.7/net/core/stream.c.orig	2010-03-23 23:46:45.000000000 +0530
> +++ linux-2.6.35.7/net/core/stream.c	2010-03-24 00:21:09.000000000 +0530
> @@ -73,9 +73,8 @@ int sk_stream_wait_connect(struct sock *
>  		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
>  		sk->sk_write_pending++;
>  		done = sk_wait_event(sk, timeo_p,
> -				     !sk->sk_err &&
> -				     !((1 << sk->sk_state) &
> -				       ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)));
> +				     ((1 << sk->sk_state) &
> +				       (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)));

Just wondering why you remove the test on sk->err ?

We want to break the loop If sk->sk_err is set, or state is ESTABLISHED
or CLOSE_WAIT.

>  		finish_wait(sk_sleep(sk), &wait);
>  		sk->sk_write_pending--;
>  	} while (!done);
> @@ -144,10 +143,9 @@ int sk_stream_wait_memory(struct sock *s
>  
>  		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>  		sk->sk_write_pending++;
> -		sk_wait_event(sk, &current_timeo, !sk->sk_err &&
> -						  !(sk->sk_shutdown & SEND_SHUTDOWN) &&
> -						  sk_stream_memory_free(sk) &&
> -						  vm_wait);
> +		sk_wait_event(sk, &current_timeo, sk->sk_err ||
> +						  (sk->sk_shutdown & SEND_SHUTDOWN) ||
> +						  (sk_stream_memory_free(sk) && !vm_wait));
>  		sk->sk_write_pending--;
>  
>  		if (vm_wait) {
> 
> 
> 

Thanks !

^ permalink raw reply

* Re: [PATCH 2.6.35.7] net: Fix the condition passed to sk_wait_event()
From: Nagendra Tomar @ 2010-10-02  8:22 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-kernel

Resending ...


The condition (3rd arg) passed to sk_wait_event() in sk_stream_wait_memory() and sk_stream_wait_connect() are incorrect.
The incorrect check in sk_stream_wait_memory() causes the following soft lockup in tcp_sendmsg() when the global tcp memory pool has exhausted. The check in sk_stream_wait_connect() was found by code audit.
   

>>> snip <<<

localhost kernel: BUG: soft lockup - CPU#3 stuck for 11s! [sshd:6429]
localhost kernel: CPU 3:
localhost kernel: RIP: 0010:[sk_stream_wait_memory+0xcd/0x200]  [sk_stream_wait_memory+0xcd/0x200] sk_stream_wait_memory+0xcd/0x200
localhost kernel: Call Trace:
localhost kernel:  [sk_stream_wait_memory+0x1b1/0x200] sk_stream_wait_memory+0x1b1/0x200
localhost kernel:  [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40
localhost kernel:  [ipv6:tcp_sendmsg+0x6e6/0xe90] tcp_sendmsg+0x6e6/0xce0
localhost kernel:  [sock_aio_write+0x126/0x140] sock_aio_write+0x126/0x140
localhost kernel:  [xfs:do_sync_write+0xf1/0x130] do_sync_write+0xf1/0x130
localhost kernel:  [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40
localhost kernel:  [hrtimer_start+0xe3/0x170] hrtimer_start+0xe3/0x170
localhost kernel:  [vfs_write+0x185/0x190] vfs_write+0x185/0x190
localhost kernel:  [sys_write+0x50/0x90] sys_write+0x50/0x90
localhost kernel:  [system_call+0x7e/0x83] system_call+0x7e/0x83

>>> snip <<<

What is happening is, that the sk_wait_event() condition passed from
sk_stream_wait_memory() evaluates to true for the case of tcp global memory
exhaustion. This is because both sk_stream_memory_free() and vm_wait are true which causes sk_wait_event() to *not* call schedule_timeout(). 
Hence sk_stream_wait_memory() returns immediately to the caller w/o sleeping.
This causes the caller to again try allocation, which again fails and again
calls sk_stream_wait_memory(), and so on.


Signed-off-by: Nagendra Singh Tomar <tomer_iisc@yahoo.com>
---
--- linux-2.6.35.7/net/core/stream.c.orig	2010-03-23 23:46:45.000000000 +0530
+++ linux-2.6.35.7/net/core/stream.c	2010-03-24 00:21:09.000000000 +0530
@@ -73,9 +73,8 @@ int sk_stream_wait_connect(struct sock *
 		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 		sk->sk_write_pending++;
 		done = sk_wait_event(sk, timeo_p,
-				     !sk->sk_err &&
-				     !((1 << sk->sk_state) &
-				       ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)));
+				     ((1 << sk->sk_state) &
+				       (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)));
 		finish_wait(sk_sleep(sk), &wait);
 		sk->sk_write_pending--;
 	} while (!done);
@@ -144,10 +143,9 @@ int sk_stream_wait_memory(struct sock *s
 
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 		sk->sk_write_pending++;
-		sk_wait_event(sk, &current_timeo, !sk->sk_err &&
-						  !(sk->sk_shutdown & SEND_SHUTDOWN) &&
-						  sk_stream_memory_free(sk) &&
-						  vm_wait);
+		sk_wait_event(sk, &current_timeo, sk->sk_err ||
+						  (sk->sk_shutdown & SEND_SHUTDOWN) ||
+						  (sk_stream_memory_free(sk) && !vm_wait));
 		sk->sk_write_pending--;
 
 		if (vm_wait) {




      

^ permalink raw reply

* Re: problem with flowi structure
From: Nicola Padovano @ 2010-10-02  8:08 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Eric Dumazet, AIJAZ BAIG, netfilter-devel, netdev
In-Reply-To: <alpine.LNX.2.01.1010020146440.26371@obet.zrqbmnf.qr>

i.e. like a wildcard?

On Sat, Oct 2, 2010 at 1:46 AM, Jan Engelhardt <jengelh@medozas.de> wrote:
>
> On Friday 2010-09-17 20:25, Nicola Padovano wrote:
>>
>>[CODE]
>>if (hooknumber == NF_INET_LOCAL_IN) fl.nl_u.ip4_u.saddr = niph->saddr;
>>  //niph is the pointer to ip header of the packet to send
>>if (hooknumber == NF_INET_FORWARD) fl.nl_u.ip4_u.saddr = 0;
>>[/CODE]
>>
>>so, i don't understand why saddr = 0 when the hooknumber is NF_INET_FORWARD....
>>
>>this is the real problem.
>
> 0 is automatic address selection.
>



-- 
Nicola Padovano
e-mail: nicola.padovano@gmail.com
web: http://npadovano.altervista.org

"My only ambition is not be anything at all; it seems the most
sensible thing" (C. Bukowski)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [patch v2 03/12] [PATCH 03/12] IPVS: compact ip_vs_sched_persist()
From: Simon Horman @ 2010-10-02  8:08 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: lvs-devel, netdev, netfilter, netfilter-devel, Jan Engelhardt,
	Stephen Hemminger, Wensong Zhang, Patrick McHardy
In-Reply-To: <alpine.LFD.2.00.1010021048260.1814@ja.ssi.bg>

On Sat, Oct 02, 2010 at 10:56:19AM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Sat, 2 Oct 2010, Simon Horman wrote:
> 
> >>	Here dport:
> >>
> >>>+	dport = dest->port;
> >>
> >>	should be:
> >>
> >>	dport = ports[1];
> >>	if (dport == svc->port && dest->port)
> >>		dport = dest->port;
> >
> >Thanks, fixed.
> 
> 	I'm still wondering, may be it needs separate patch
> but we do not support NAT to different dest->port in the
> case for fwmark. May be the above logic can be changed to
> support it. By this way web to different VIPs and VPORTs
> in a single virtual service (fwmark) can use single NAT
> real server for name-based virtual hosting. But such change
> can create compatibility problems for setups that used
> different vports for the fwmark service and still expect
> it in that way (vport to same dport).

Hi Julian,

I think that this sounds line a new flavour of fwmark virtual services to me.
Perhaps yet another flag is in order?

To be clear, what you have in mind is essentially to nat *:* (as matched by
a fwmark) to x:y, where as at this time *:y may be natted to x:y.


^ permalink raw reply

* Re: [patch v2 03/12] [PATCH 03/12] IPVS: compact ip_vs_sched_persist()
From: Julian Anastasov @ 2010-10-02  7:56 UTC (permalink / raw)
  To: Simon Horman
  Cc: lvs-devel, netdev, netfilter, netfilter-devel, Jan Engelhardt,
	Stephen Hemminger, Wensong Zhang, Patrick McHardy
In-Reply-To: <20101002022050.GA16593@verge.net.au>


 	Hello,

On Sat, 2 Oct 2010, Simon Horman wrote:

>> 	Here dport:
>>
>>> +	dport = dest->port;
>>
>> 	should be:
>>
>> 	dport = ports[1];
>> 	if (dport == svc->port && dest->port)
>> 		dport = dest->port;
>
> Thanks, fixed.

 	I'm still wondering, may be it needs separate patch
but we do not support NAT to different dest->port in the
case for fwmark. May be the above logic can be changed to
support it. By this way web to different VIPs and VPORTs
in a single virtual service (fwmark) can use single NAT
real server for name-based virtual hosting. But such change
can create compatibility problems for setups that used
different vports for the fwmark service and still expect
it in that way (vport to same dport).

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Hello Friend
From: Mr. Peter Lee @ 2010-10-01 23:33 UTC (permalink / raw)



Dear Friend,

I am Mr. Peter Lee a South Korean, happily married with children, and  
i am a Director of Hang Seng Bank Ltd, in charge of the International  
Remittance department. I have a confidential business suggestion for  
you. I will need you to assist me in executing a business project from  
Hong Kong to your country.

It involves the transfer of a large sum of money. Everything  
concerning this transaction shall be legally done without hitch.  
Please endeavour to reply me if interested. Here is my private  
Email:(peterlee333@aol.com)

Thanks,
Best Regard,
Mr. Peter Lee





----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.



^ permalink raw reply

* [PATCH net-next] nf_nat: make find/put static
From: Stephen Hemminger @ 2010-10-02  5:17 UTC (permalink / raw)
  To: Patrick McHardy, David Miller; +Cc: netfilter-devel, netdev

The functions nf_nat_proto_find_get and nf_nat_proto_put are
only used internally in nf_nat_core. This might break some out
of tree NAT module.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/include/net/netfilter/nf_nat_protocol.h	2010-10-02 09:01:09.153396885 +0900
+++ b/include/net/netfilter/nf_nat_protocol.h	2010-10-02 09:01:55.340910304 +0900
@@ -45,9 +45,6 @@ struct nf_nat_protocol {
 extern int nf_nat_protocol_register(const struct nf_nat_protocol *proto);
 extern void nf_nat_protocol_unregister(const struct nf_nat_protocol *proto);
 
-extern const struct nf_nat_protocol *nf_nat_proto_find_get(u_int8_t protocol);
-extern void nf_nat_proto_put(const struct nf_nat_protocol *proto);
-
 /* Built-in protocols. */
 extern const struct nf_nat_protocol nf_nat_protocol_tcp;
 extern const struct nf_nat_protocol nf_nat_protocol_udp;
--- a/net/ipv4/netfilter/nf_nat_core.c	2010-10-02 09:01:09.183396507 +0900
+++ b/net/ipv4/netfilter/nf_nat_core.c	2010-10-02 09:01:55.340910304 +0900
@@ -47,7 +47,7 @@ __nf_nat_proto_find(u_int8_t protonum)
 	return rcu_dereference(nf_nat_protos[protonum]);
 }
 
-const struct nf_nat_protocol *
+static const struct nf_nat_protocol *
 nf_nat_proto_find_get(u_int8_t protonum)
 {
 	const struct nf_nat_protocol *p;
@@ -60,14 +60,12 @@ nf_nat_proto_find_get(u_int8_t protonum)
 
 	return p;
 }
-EXPORT_SYMBOL_GPL(nf_nat_proto_find_get);
 
-void
+static void
 nf_nat_proto_put(const struct nf_nat_protocol *p)
 {
 	module_put(p->me);
 }
-EXPORT_SYMBOL_GPL(nf_nat_proto_put);
 
 /* We keep an extra hash for each conntrack, for fast searching. */
 static inline unsigned int


^ permalink raw reply

* Re: [PATCH 2.6.35.7] net: Fix the condition passed to sk_wait_event()
From: David Miller @ 2010-10-02  4:27 UTC (permalink / raw)
  To: tomer_iisc; +Cc: netdev, linux-kernel
In-Reply-To: <563428.39597.qm@web53703.mail.re2.yahoo.com>


Your patch has been corrupted by your email client, it changed
tab characters into spaces.

Please correct this and fully resubmit your patch.

Thank you.

^ permalink raw reply

* [PATCH 2.6.35.7] net: Fix the condition passed to sk_wait_event()
From: Nagendra Tomar @ 2010-10-02  2:55 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, davem



The condition (3rd arg) passed to sk_wait_event() in sk_stream_wait_memory() and 
sk_stream_wait_connect() are incorrect.
The incorrect check in sk_stream_wait_memory() causes the following soft lockup 
in tcp_sendmsg() when the global tcp memory pool has exhausted. The check in 
sk_stream_wait_connect() was found by code audit.
   
>>> snip <<<
localhost kernel: BUG: soft lockup - CPU#3 stuck for 11s! [sshd:6429]
localhost kernel: CPU 3:
localhost kernel: 
localhost kernel: Call Trace:
localhost kernel:  [sk_stream_wait_memory+0x1b1/0x200] 
sk_stream_wait_memory+0x1b1/0x200
localhost kernel:  [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40
localhost kernel:  [ipv6:tcp_sendmsg+0x6e6/0xe90] tcp_sendmsg+0x6e6/0xce0
localhost kernel:  [sock_aio_write+0x126/0x140] sock_aio_write+0x126/0x140
localhost kernel:  [xfs:do_sync_write+0xf1/0x130] do_sync_write+0xf1/0x130
localhost kernel:  [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40
localhost kernel:  [hrtimer_start+0xe3/0x170] hrtimer_start+0xe3/0x170
localhost kernel:  [vfs_write+0x185/0x190] vfs_write+0x185/0x190
localhost kernel:  [sys_write+0x50/0x90] sys_write+0x50/0x90
localhost kernel:  [system_call+0x7e/0x83] system_call+0x7e/0x83
>>> snip <<<

What is happening is, that the sk_wait_event() condition passed from
sk_stream_wait_memory() evaluates to true for the case of tcp global memory
exhaustion. This is because both sk_stream_memory_free() and vm_wait are true
which causes sk_wait_event() to *not* call schedule_timeout(). 
Hence sk_stream_wait_memory() returns immediately to the caller w/o sleeping.
This causes the caller to again try allocation, which again fails and again
calls sk_stream_wait_memory(), and so on.

Signed-off-by: Nagendra Singh Tomar <tomer_iisc@yahoo.com>
---
--- linux-2.6.35.7/net/core/stream.c.orig 2010-03-23 23:46:45.000000000 +0530
+++ linux-2.6.35.7/net/core/stream.c 2010-03-24 00:21:09.000000000 +0530
@@ -73,9 +73,8 @@ int sk_stream_wait_connect(struct sock *
   prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
   sk->sk_write_pending++;
   done = sk_wait_event(sk, timeo_p,
-         !sk->sk_err &&
-         !((1 << sk->sk_state) &
-           ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)));
+         ((1 << sk->sk_state) &
+           (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)));
   finish_wait(sk_sleep(sk), &wait);
   sk->sk_write_pending--;
  } while (!done);
@@ -144,10 +143,9 @@ int sk_stream_wait_memory(struct sock *s
 
   set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
   sk->sk_write_pending++;
-  sk_wait_event(sk, &current_timeo, !sk->sk_err &&
-        !(sk->sk_shutdown & SEND_SHUTDOWN) &&
-        sk_stream_memory_free(sk) &&
-        vm_wait);
+  sk_wait_event(sk, &current_timeo, sk->sk_err ||
+        (sk->sk_shutdown & SEND_SHUTDOWN) ||
+        (sk_stream_memory_free(sk) && !vm_wait));
   sk->sk_write_pending--;
 
   if (vm_wait) {



      

^ permalink raw reply

* [patch v3 12/12] IPVS: sip persistence engine
From: Simon Horman @ 2010-10-02  2:31 UTC (permalink / raw)
  To: lvs-devel, netdev, netfilter, netfilter-devel
  Cc: Jan Engelhardt, Stephen Hemminger, Wensong Zhang,
	Julian Anastasov, Patrick McHardy
In-Reply-To: <20101002023056.536217499@akiko.akashicho.tokyo.vergenet.net>

[-- Attachment #1: 0012-IPVS-sip-persistence-engine.patch --]
[-- Type: text/plain, Size: 7182 bytes --]

Add the SIP callid as a key for persistence.

This allows multiple connections from the same IP address to be
differentiated on the basis of the callid.

When used in conjunction with the persistence mask, it allows connections
from different  IP addresses to be aggregated on the basis of the callid.

It is envisaged that a persistence mask of 0.0.0.0 will be a useful
setting.  That is, ignore the source IP address when checking for
persistence.

It is envisaged that this option will be used in conjunction with
one-packet scheduling.

This only works with UDP and cannot be made to work with TCP
within the current framework.

Signed-off-by: Simon Horman <horms@verge.net.au>

---

v1
* Use buf[] instead of poiter arithmetic in ip_vs_dbg_callid()
  As suggested by Jan Engelhardt

v2
* Use GFP_ATOMIC for allocations inside of ip_vs_sip_fill_param()
  which is called in an atomic context. This resolves the
  "scheduling while atomic" problem.
* As noted by Julian Anastasov RFC 3261 section 8.1.1.4 says
  "Call-IDs are case-sensitive and are simply compared byte-by-byte",
  so may be memcmp should be used instead of strnicmp() in
  ip_vs_sip_ct_match().
* Spelling fix in comment: persistance -> persistence
* Trivial rediff

v3
* Trivial whitespace fixes
* Update for addition of inverse parameter to ip_vs_conn_hashkey_param()

Index: lvs-test-2.6/net/netfilter/ipvs/Kconfig
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/Kconfig	2010-10-02 10:49:37.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/Kconfig	2010-10-02 10:50:16.000000000 +0900
@@ -256,4 +256,11 @@ config	IP_VS_NFCT
 	  connection state to be exported to the Netfilter framework
 	  for filtering purposes.
 
+config	IP_VS_PE_SIP
+	tristate "SIP persistence engine"
+        depends on IP_VS_PROTO_UDP
+	depends on NF_CONNTRACK_SIP
+	---help---
+	  Allow persistence based on the SIP Call-ID
+
 endif # IP_VS
Index: lvs-test-2.6/net/netfilter/ipvs/Makefile
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/Makefile	2010-10-02 10:50:16.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/Makefile	2010-10-02 10:50:16.000000000 +0900
@@ -35,3 +35,6 @@ obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o
 
 # IPVS application helpers
 obj-$(CONFIG_IP_VS_FTP) += ip_vs_ftp.o
+
+# IPVS connection template retrievers
+obj-$(CONFIG_IP_VS_PE_SIP) += ip_vs_pe_sip.o
Index: lvs-test-2.6/net/netfilter/ipvs/ip_vs_pe_sip.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ lvs-test-2.6/net/netfilter/ipvs/ip_vs_pe_sip.c	2010-10-02 10:52:08.000000000 +0900
@@ -0,0 +1,167 @@
+#define KMSG_COMPONENT "IPVS"
+#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+
+#include <net/ip_vs.h>
+#include <net/netfilter/nf_conntrack.h>
+#include <linux/netfilter/nf_conntrack_sip.h>
+
+static const char *ip_vs_dbg_callid(char *buf, size_t buf_len,
+				    const char *callid, size_t callid_len,
+				    int *idx)
+{
+	size_t len = min(min(callid_len, (size_t)64), buf_len - *idx - 1);
+	memcpy(buf + *idx, callid, len);
+	buf[*idx+len] = '\0';
+	*idx += len + 1;
+	return buf + *idx - len;
+}
+
+#define IP_VS_DEBUG_CALLID(callid, len)					\
+	ip_vs_dbg_callid(ip_vs_dbg_buf, sizeof(ip_vs_dbg_buf),		\
+			 callid, len, &ip_vs_dbg_idx)
+
+static int get_callid(const char *dptr, unsigned int dataoff,
+		      unsigned int datalen,
+		      unsigned int *matchoff, unsigned int *matchlen)
+{
+	/* Find callid */
+	while (1) {
+		int ret = ct_sip_get_header(NULL, dptr, dataoff, datalen,
+					    SIP_HDR_CALL_ID, matchoff,
+					    matchlen);
+		if (ret > 0)
+			break;
+		if (!ret)
+			return 0;
+		dataoff += *matchoff;
+	}
+
+	/* Empty callid is useless */
+	if (!*matchlen)
+		return -EINVAL;
+
+	/* Too large is useless */
+	if (*matchlen > IP_VS_PEDATA_MAXLEN)
+		return -EINVAL;
+
+	/* SIP headers are always followed by a line terminator */
+	if (*matchoff + *matchlen == datalen)
+		return -EINVAL;
+
+	/* RFC 2543 allows lines to be terminated with CR, LF or CRLF,
+	 * RFC 3261 allows only CRLF, we support both. */
+	if (*(dptr + *matchoff + *matchlen) != '\r' &&
+	    *(dptr + *matchoff + *matchlen) != '\n')
+		return -EINVAL;
+
+	IP_VS_DBG_BUF(9, "SIP callid %s (%d bytes)\n",
+		      IP_VS_DEBUG_CALLID(dptr + *matchoff, *matchlen),
+		      *matchlen);
+	return 0;
+}
+
+static int
+ip_vs_sip_fill_param(struct ip_vs_conn_param *p, struct sk_buff *skb)
+{
+	struct ip_vs_iphdr iph;
+	unsigned int dataoff, datalen, matchoff, matchlen;
+	const char *dptr;
+
+	ip_vs_fill_iphdr(p->af, skb_network_header(skb), &iph);
+
+	/* Only useful with UDP */
+	if (iph.protocol != IPPROTO_UDP)
+		return -EINVAL;
+
+	/* No Data ? */
+	dataoff = iph.len + sizeof(struct udphdr);
+	if (dataoff >= skb->len)
+		return -EINVAL;
+
+	dptr = skb->data + dataoff;
+	datalen = skb->len - dataoff;
+
+	if (get_callid(dptr, dataoff, datalen, &matchoff, &matchlen))
+		return -EINVAL;
+
+	p->pe_data = kmalloc(matchlen, GFP_ATOMIC);
+	if (!p->pe_data)
+		return -ENOMEM;
+
+	/* N.B: pe_data is only set on success,
+	 * this allows fallback to the default persistence logic on failure
+	 */
+	memcpy(p->pe_data, dptr + matchoff, matchlen);
+	p->pe_data_len = matchlen;
+
+	return 0;
+}
+
+static bool ip_vs_sip_ct_match(const struct ip_vs_conn_param *p,
+				  struct ip_vs_conn *ct)
+
+{
+	bool ret = 0;
+
+	if (ct->af == p->af &&
+	    ip_vs_addr_equal(p->af, p->caddr, &ct->caddr) &&
+	    /* protocol should only be IPPROTO_IP if
+	     * d_addr is a fwmark */
+	    ip_vs_addr_equal(p->protocol == IPPROTO_IP ? AF_UNSPEC : p->af,
+			     p->vaddr, &ct->vaddr) &&
+	    ct->vport == p->vport &&
+	    ct->flags & IP_VS_CONN_F_TEMPLATE &&
+	    ct->protocol == p->protocol &&
+	    ct->pe_data && ct->pe_data_len == p->pe_data_len &&
+	    !memcmp(ct->pe_data, p->pe_data, p->pe_data_len))
+		ret = 1;
+
+	IP_VS_DBG_BUF(9, "SIP template match %s %s->%s:%d %s\n",
+		      ip_vs_proto_name(p->protocol),
+		      IP_VS_DEBUG_CALLID(p->pe_data, p->pe_data_len),
+		      IP_VS_DBG_ADDR(p->af, p->vaddr), ntohs(p->vport),
+		      ret ? "hit" : "not hit");
+
+	return ret;
+}
+
+static u32 ip_vs_sip_hashkey_raw(const struct ip_vs_conn_param *p,
+				 u32 initval, bool inverse)
+{
+	return jhash(p->pe_data, p->pe_data_len, initval);
+}
+
+static int ip_vs_sip_show_pe_data(const struct ip_vs_conn *cp, char *buf)
+{
+	memcpy(buf, cp->pe_data, cp->pe_data_len);
+	return cp->pe_data_len;
+}
+
+static struct ip_vs_pe ip_vs_sip_pe =
+{
+	.name =			"sip",
+	.refcnt =		ATOMIC_INIT(0),
+	.module =		THIS_MODULE,
+	.n_list =		LIST_HEAD_INIT(ip_vs_sip_pe.n_list),
+	.fill_param =		ip_vs_sip_fill_param,
+	.ct_match =		ip_vs_sip_ct_match,
+	.hashkey_raw =		ip_vs_sip_hashkey_raw,
+	.show_pe_data =		ip_vs_sip_show_pe_data,
+};
+
+static int __init ip_vs_sip_init(void)
+{
+	return register_ip_vs_pe(&ip_vs_sip_pe);
+}
+
+static void __exit ip_vs_sip_cleanup(void)
+{
+	unregister_ip_vs_pe(&ip_vs_sip_pe);
+}
+
+module_init(ip_vs_sip_init);
+module_exit(ip_vs_sip_cleanup);
+MODULE_LICENSE("GPL");


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox