Netdev List
 help / color / mirror / Atom feed
* Re: Re: TSO and IPoIB performance degradation
From: Roland Dreier @ 2006-03-08  1:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, linux-kernel, openib-general, shemminger
In-Reply-To: <20060307.172336.107863253.davem@davemloft.net>

    David> I wish you had started the thread by mentioning this
    David> specific patch, we wasted an enormous amount of precious
    David> developer time speculating and asking for arbitrary tests
    David> to be run in order to narrow down the problem, yet you knew
    David> the specific change that introduced the performance
    David> regression already...

Sorry, you're right.  I was a little confused because I had a memory of
Michael's original email (http://lkml.org/lkml/2006/3/6/150) quoting a
changelog entry, but looking back at the message, it was quoting
something completely different and misleading.

I think the most interesting email in the old thread is
http://openib.org/pipermail/openib-general/2005-October/012482.html
which shows that reverting 314324121 (the "stretch ACK performance
killer" fix) gives ~400 Mbit/sec in extra IPoIB performance.

 - R.

^ permalink raw reply

* [patch 0/4] net: percpufy frequently used vars on struct proto
From: Ravikiran G Thirumalai @ 2006-03-08  1:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, davem, netdev, shai

Following patchset converts struct proto.memory_allocated to use batching
per-cpu counters, struct proto.sockets_allocated to use per-cpu counters and
changes the proto.inuse per-cpu variable to use alloc_percpu instead of the
NR_CPUS x cacheline size padding. 

We observed 5% improvement in apache bench requests per second with this 
patchset on a multi NIC 8 way IBM x460 box.

(This was posted earlier
http://marc.theaimsgroup.com/?l=linux-kernel&m=113830220408812&w=2 )

Can this go into -mm please?

Thanks,
Kiran

^ permalink raw reply

* [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh
From: Ravikiran G Thirumalai @ 2006-03-08  1:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, davem, netdev, shai
In-Reply-To: <20060308015808.GA9062@localhost.localdomain>

Add percpu_counter_mod_bh for using these counters safely from
both softirq and process context.

Signed-off by: Pravin B. Shelar <pravins@calsoftinc.com>
Signed-off by: Ravikiran G Thirumalai <kiran@scalex86.org>
Signed-off by: Shai Fultheim <shai@scalex86.org>

Index: linux-2.6.16-rc5mm3/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/linux/percpu_counter.h	2006-03-07 15:08:00.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/linux/percpu_counter.h	2006-03-07 15:09:21.000000000 -0800
@@ -11,6 +11,7 @@
 #include <linux/smp.h>
 #include <linux/threads.h>
 #include <linux/percpu.h>
+#include <linux/interrupt.h>
 
 #ifdef CONFIG_SMP
 
@@ -110,4 +111,11 @@ static inline void percpu_counter_dec(st
 	percpu_counter_mod(fbc, -1);
 }
 
+static inline void percpu_counter_mod_bh(struct percpu_counter *fbc, long amount)
+{
+	local_bh_disable();
+	percpu_counter_mod(fbc, amount);
+	local_bh_enable();
+}
+
 #endif /* _LINUX_PERCPU_COUNTER_H */

^ permalink raw reply

* [patch 2/4] net: percpufy frequently used vars -- struct proto.memory_allocated
From: Ravikiran G Thirumalai @ 2006-03-08  2:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, davem, netdev, shai
In-Reply-To: <20060308015808.GA9062@localhost.localdomain>

Change struct proto->memory_allocated to a batching per-CPU counter 
(percpu_counter) from an atomic_t.  A batching counter is better than a 
plain per-CPU counter as this field is read often.

Signed-off-by: Pravin B. Shelar <pravins@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>

Index: linux-2.6.16-rc5mm3/include/net/sock.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/net/sock.h	2006-03-07 15:07:43.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/net/sock.h	2006-03-07 15:11:48.000000000 -0800
@@ -48,6 +48,7 @@
 #include <linux/netdevice.h>
 #include <linux/skbuff.h>	/* struct sk_buff */
 #include <linux/security.h>
+#include <linux/percpu_counter.h>
 
 #include <linux/filter.h>
 
@@ -541,8 +542,9 @@ struct proto {
 
 	/* Memory pressure */
 	void			(*enter_memory_pressure)(void);
-	atomic_t		*memory_allocated;	/* Current allocated memory. */
+	struct percpu_counter	*memory_allocated;	/* Current allocated memory. */
 	atomic_t		*sockets_allocated;	/* Current number of sockets. */
+
 	/*
 	 * Pressure flag: try to collapse.
 	 * Technical note: it is used by multiple contexts non atomically.
Index: linux-2.6.16-rc5mm3/include/net/tcp.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/net/tcp.h	2006-03-07 15:08:00.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/net/tcp.h	2006-03-07 15:11:48.000000000 -0800
@@ -225,7 +225,7 @@ extern int sysctl_tcp_abc;
 extern int sysctl_tcp_mtu_probing;
 extern int sysctl_tcp_base_mss;
 
-extern atomic_t tcp_memory_allocated;
+extern struct percpu_counter tcp_memory_allocated;
 extern atomic_t tcp_sockets_allocated;
 extern int tcp_memory_pressure;
 
Index: linux-2.6.16-rc5mm3/net/core/sock.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/core/sock.c	2006-03-07 15:07:43.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/core/sock.c	2006-03-07 15:11:48.000000000 -0800
@@ -1616,12 +1616,13 @@ static char proto_method_implemented(con
 
 static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
 {
-	seq_printf(seq, "%-9s %4u %6d  %6d   %-3s %6u   %-3s  %-10s "
+	seq_printf(seq, "%-9s %4u %6d  %6lu   %-3s %6u   %-3s  %-10s "
 			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
 		   proto->name,
 		   proto->obj_size,
 		   proto->sockets_allocated != NULL ? atomic_read(proto->sockets_allocated) : -1,
-		   proto->memory_allocated != NULL ? atomic_read(proto->memory_allocated) : -1,
+		   proto->memory_allocated != NULL ?
+				percpu_counter_read_positive(proto->memory_allocated) : -1,
 		   proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
 		   proto->max_header,
 		   proto->slab == NULL ? "no" : "yes",
Index: linux-2.6.16-rc5mm3/net/core/stream.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/core/stream.c	2006-03-07 15:07:43.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/core/stream.c	2006-03-07 15:11:48.000000000 -0800
@@ -196,11 +196,11 @@ EXPORT_SYMBOL(sk_stream_error);
 void __sk_stream_mem_reclaim(struct sock *sk)
 {
 	if (sk->sk_forward_alloc >= SK_STREAM_MEM_QUANTUM) {
-		atomic_sub(sk->sk_forward_alloc / SK_STREAM_MEM_QUANTUM,
-			   sk->sk_prot->memory_allocated);
+		percpu_counter_mod_bh(sk->sk_prot->memory_allocated,
+				-(sk->sk_forward_alloc / SK_STREAM_MEM_QUANTUM));
 		sk->sk_forward_alloc &= SK_STREAM_MEM_QUANTUM - 1;
 		if (*sk->sk_prot->memory_pressure &&
-		    (atomic_read(sk->sk_prot->memory_allocated) <
+		    (percpu_counter_read(sk->sk_prot->memory_allocated) <
 		     sk->sk_prot->sysctl_mem[0]))
 			*sk->sk_prot->memory_pressure = 0;
 	}
@@ -213,23 +213,26 @@ int sk_stream_mem_schedule(struct sock *
 	int amt = sk_stream_pages(size);
 
 	sk->sk_forward_alloc += amt * SK_STREAM_MEM_QUANTUM;
-	atomic_add(amt, sk->sk_prot->memory_allocated);
+	percpu_counter_mod_bh(sk->sk_prot->memory_allocated, amt);
 
 	/* Under limit. */
-	if (atomic_read(sk->sk_prot->memory_allocated) < sk->sk_prot->sysctl_mem[0]) {
+	if (percpu_counter_read(sk->sk_prot->memory_allocated) <
+			sk->sk_prot->sysctl_mem[0]) {
 		if (*sk->sk_prot->memory_pressure)
 			*sk->sk_prot->memory_pressure = 0;
 		return 1;
 	}
 
 	/* Over hard limit. */
-	if (atomic_read(sk->sk_prot->memory_allocated) > sk->sk_prot->sysctl_mem[2]) {
+	if (percpu_counter_read(sk->sk_prot->memory_allocated) >
+			sk->sk_prot->sysctl_mem[2]) {
 		sk->sk_prot->enter_memory_pressure();
 		goto suppress_allocation;
 	}
 
 	/* Under pressure. */
-	if (atomic_read(sk->sk_prot->memory_allocated) > sk->sk_prot->sysctl_mem[1])
+	if (percpu_counter_read(sk->sk_prot->memory_allocated) >
+			sk->sk_prot->sysctl_mem[1])
 		sk->sk_prot->enter_memory_pressure();
 
 	if (kind) {
@@ -259,7 +262,7 @@ suppress_allocation:
 
 	/* Alas. Undo changes. */
 	sk->sk_forward_alloc -= amt * SK_STREAM_MEM_QUANTUM;
-	atomic_sub(amt, sk->sk_prot->memory_allocated);
+	percpu_counter_mod_bh(sk->sk_prot->memory_allocated, -amt);
 	return 0;
 }
 
Index: linux-2.6.16-rc5mm3/net/decnet/af_decnet.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/decnet/af_decnet.c	2006-03-07 15:07:43.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/decnet/af_decnet.c	2006-03-07 15:09:22.000000000 -0800
@@ -154,7 +154,7 @@ static const struct proto_ops dn_proto_o
 static DEFINE_RWLOCK(dn_hash_lock);
 static struct hlist_head dn_sk_hash[DN_SK_HASH_SIZE];
 static struct hlist_head dn_wild_sk;
-static atomic_t decnet_memory_allocated;
+static struct percpu_counter decnet_memory_allocated;
 
 static int __dn_setsockopt(struct socket *sock, int level, int optname, char __user *optval, int optlen, int flags);
 static int __dn_getsockopt(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen, int flags);
@@ -2383,7 +2383,8 @@ static int __init decnet_init(void)
 	rc = proto_register(&dn_proto, 1);
 	if (rc != 0)
 		goto out;
-
+	
+	percpu_counter_init(&decnet_memory_allocated);
 	dn_neigh_init();
 	dn_dev_init();
 	dn_route_init();
@@ -2424,6 +2425,7 @@ static void __exit decnet_exit(void)
 	proc_net_remove("decnet");
 
 	proto_unregister(&dn_proto);
+	percpu_counter_destroy(&decnet_memory_allocated);
 }
 module_exit(decnet_exit);
 #endif
Index: linux-2.6.16-rc5mm3/net/ipv4/proc.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/proc.c	2006-03-07 15:07:44.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/proc.c	2006-03-07 15:11:48.000000000 -0800
@@ -61,10 +61,10 @@ static int fold_prot_inuse(struct proto 
 static int sockstat_seq_show(struct seq_file *seq, void *v)
 {
 	socket_seq_show(seq);
-	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %d\n",
+	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %lu\n",
 		   fold_prot_inuse(&tcp_prot), atomic_read(&tcp_orphan_count),
 		   tcp_death_row.tw_count, atomic_read(&tcp_sockets_allocated),
-		   atomic_read(&tcp_memory_allocated));
+		   percpu_counter_read_positive(&tcp_memory_allocated));
 	seq_printf(seq, "UDP: inuse %d\n", fold_prot_inuse(&udp_prot));
 	seq_printf(seq, "RAW: inuse %d\n", fold_prot_inuse(&raw_prot));
 	seq_printf(seq,  "FRAG: inuse %d memory %d\n", ip_frag_nqueues,
Index: linux-2.6.16-rc5mm3/net/ipv4/tcp.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp.c	2006-03-07 15:07:44.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp.c	2006-03-07 15:11:48.000000000 -0800
@@ -283,7 +283,7 @@ EXPORT_SYMBOL(sysctl_tcp_mem);
 EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
-atomic_t tcp_memory_allocated;	/* Current allocated memory. */
+struct percpu_counter tcp_memory_allocated;	/* Current allocated memory. */
 atomic_t tcp_sockets_allocated;	/* Current number of TCP sockets. */
 
 EXPORT_SYMBOL(tcp_memory_allocated);
@@ -1593,7 +1593,7 @@ adjudge_to_death:
 		sk_stream_mem_reclaim(sk);
 		if (atomic_read(sk->sk_prot->orphan_count) > sysctl_tcp_max_orphans ||
 		    (sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
-		     atomic_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])) {
+		     percpu_counter_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])) {
 			if (net_ratelimit())
 				printk(KERN_INFO "TCP: too many of orphaned "
 				       "sockets\n");
@@ -2127,6 +2127,8 @@ void __init tcp_init(void)
 	       "(established %d bind %d)\n",
 	       tcp_hashinfo.ehash_size << 1, tcp_hashinfo.bhash_size);
 
+	percpu_counter_init(&tcp_memory_allocated);
+
 	tcp_register_congestion_control(&tcp_reno);
 }
 
Index: linux-2.6.16-rc5mm3/net/ipv4/tcp_input.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp_input.c	2006-03-07 15:08:01.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp_input.c	2006-03-07 15:09:22.000000000 -0800
@@ -333,7 +333,7 @@ static void tcp_clamp_window(struct sock
 	if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
 	    !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
 	    !tcp_memory_pressure &&
-	    atomic_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
+	    percpu_counter_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
 		sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
 				    sysctl_tcp_rmem[2]);
 	}
@@ -3555,7 +3555,7 @@ static int tcp_should_expand_sndbuf(stru
 		return 0;
 
 	/* If we are under soft global TCP memory pressure, do not expand.  */
-	if (atomic_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])
+	if (percpu_counter_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])
 		return 0;
 
 	/* If we filled the congestion window, do not expand.  */
Index: linux-2.6.16-rc5mm3/net/ipv4/tcp_timer.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp_timer.c	2006-03-07 15:08:01.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp_timer.c	2006-03-07 15:09:22.000000000 -0800
@@ -80,7 +80,7 @@ static int tcp_out_of_resources(struct s
 
 	if (orphans >= sysctl_tcp_max_orphans ||
 	    (sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
-	     atomic_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])) {
+	     percpu_counter_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])) {
 		if (net_ratelimit())
 			printk(KERN_INFO "Out of socket memory\n");
 

^ permalink raw reply

* [patch 3/4] net: percpufy frequently used vars -- proto.sockets_allocated
From: Ravikiran G Thirumalai @ 2006-03-08  2:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, davem, netdev, shai
In-Reply-To: <20060308015808.GA9062@localhost.localdomain>

Change the atomic_t sockets_allocated member of struct proto to a 
per-cpu counter.

Signed-off-by: Pravin B. Shelar <pravins@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>

Index: linux-2.6.16-rc5mm3/include/net/sock.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/net/sock.h	2006-03-07 15:09:22.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/net/sock.h	2006-03-07 15:09:52.000000000 -0800
@@ -543,7 +543,7 @@ struct proto {
 	/* Memory pressure */
 	void			(*enter_memory_pressure)(void);
 	struct percpu_counter	*memory_allocated;	/* Current allocated memory. */
-	atomic_t		*sockets_allocated;	/* Current number of sockets. */
+	int                     *sockets_allocated;     /* Current number of sockets (percpu counter). */
 
 	/*
 	 * Pressure flag: try to collapse.
@@ -579,6 +579,24 @@ struct proto {
 	} stats[NR_CPUS];
 };
 
+static inline int read_sockets_allocated(struct proto *prot)
+{
+	int total = 0;
+	int cpu;
+	for_each_cpu(cpu)
+		total += *per_cpu_ptr(prot->sockets_allocated, cpu);
+	return total;
+}
+
+static inline void mod_sockets_allocated(int *sockets_allocated, int count)
+{
+	(*per_cpu_ptr(sockets_allocated, get_cpu())) += count;
+	put_cpu();
+}
+
+#define inc_sockets_allocated(c) mod_sockets_allocated(c, 1)
+#define dec_sockets_allocated(c) mod_sockets_allocated(c, -1)
+
 extern int proto_register(struct proto *prot, int alloc_slab);
 extern void proto_unregister(struct proto *prot);
 
Index: linux-2.6.16-rc5mm3/include/net/tcp.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/net/tcp.h	2006-03-07 15:09:22.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/net/tcp.h	2006-03-07 15:09:52.000000000 -0800
@@ -226,7 +226,7 @@ extern int sysctl_tcp_mtu_probing;
 extern int sysctl_tcp_base_mss;
 
 extern struct percpu_counter tcp_memory_allocated;
-extern atomic_t tcp_sockets_allocated;
+extern int *tcp_sockets_allocated;
 extern int tcp_memory_pressure;
 
 /*
Index: linux-2.6.16-rc5mm3/net/core/sock.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/core/sock.c	2006-03-07 15:09:22.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/core/sock.c	2006-03-07 15:09:52.000000000 -0800
@@ -771,7 +771,7 @@ struct sock *sk_clone(const struct sock 
 		newsk->sk_sleep	 = NULL;
 
 		if (newsk->sk_prot->sockets_allocated)
-			atomic_inc(newsk->sk_prot->sockets_allocated);
+			inc_sockets_allocated(newsk->sk_prot->sockets_allocated); 
 	}
 out:
 	return newsk;
@@ -1620,7 +1620,7 @@ static void proto_seq_printf(struct seq_
 			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
 		   proto->name,
 		   proto->obj_size,
-		   proto->sockets_allocated != NULL ? atomic_read(proto->sockets_allocated) : -1,
+		   proto->sockets_allocated != NULL ? read_sockets_allocated(proto) : -1,
 		   proto->memory_allocated != NULL ?
 				percpu_counter_read_positive(proto->memory_allocated) : -1,
 		   proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
Index: linux-2.6.16-rc5mm3/net/core/stream.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/core/stream.c	2006-03-07 15:09:22.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/core/stream.c	2006-03-07 15:09:52.000000000 -0800
@@ -242,7 +242,7 @@ int sk_stream_mem_schedule(struct sock *
 		return 1;
 
 	if (!*sk->sk_prot->memory_pressure ||
-	    sk->sk_prot->sysctl_mem[2] > atomic_read(sk->sk_prot->sockets_allocated) *
+	    sk->sk_prot->sysctl_mem[2] > read_sockets_allocated(sk->sk_prot) *
 				sk_stream_pages(sk->sk_wmem_queued +
 						atomic_read(&sk->sk_rmem_alloc) +
 						sk->sk_forward_alloc))
Index: linux-2.6.16-rc5mm3/net/ipv4/proc.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/proc.c	2006-03-07 15:09:22.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/proc.c	2006-03-07 15:09:52.000000000 -0800
@@ -63,7 +63,7 @@ static int sockstat_seq_show(struct seq_
 	socket_seq_show(seq);
 	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %lu\n",
 		   fold_prot_inuse(&tcp_prot), atomic_read(&tcp_orphan_count),
-		   tcp_death_row.tw_count, atomic_read(&tcp_sockets_allocated),
+		   tcp_death_row.tw_count, read_sockets_allocated(&tcp_prot),
 		   percpu_counter_read_positive(&tcp_memory_allocated));
 	seq_printf(seq, "UDP: inuse %d\n", fold_prot_inuse(&udp_prot));
 	seq_printf(seq, "RAW: inuse %d\n", fold_prot_inuse(&raw_prot));
Index: linux-2.6.16-rc5mm3/net/ipv4/tcp.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp.c	2006-03-07 15:09:22.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp.c	2006-03-07 15:09:52.000000000 -0800
@@ -284,7 +284,7 @@ EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
 struct percpu_counter tcp_memory_allocated;	/* Current allocated memory. */
-atomic_t tcp_sockets_allocated;	/* Current number of TCP sockets. */
+int *tcp_sockets_allocated;   /* Current number of TCP sockets. */
 
 EXPORT_SYMBOL(tcp_memory_allocated);
 EXPORT_SYMBOL(tcp_sockets_allocated);
@@ -2055,6 +2055,12 @@ void __init tcp_init(void)
 	if (!tcp_hashinfo.bind_bucket_cachep)
 		panic("tcp_init: Cannot alloc tcp_bind_bucket cache.");
 
+	tcp_sockets_allocated = alloc_percpu(int);
+	if (!tcp_sockets_allocated)
+		panic("tcp_init: Cannot alloc tcp_sockets_allocated");
+
+	tcp_prot.sockets_allocated = tcp_sockets_allocated;
+
 	/* Size and allocate the main established and bind bucket
 	 * hash tables.
 	 *
Index: linux-2.6.16-rc5mm3/net/ipv4/tcp_ipv4.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp_ipv4.c	2006-03-07 15:08:01.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp_ipv4.c	2006-03-07 15:09:52.000000000 -0800
@@ -1273,7 +1273,7 @@ static int tcp_v4_init_sock(struct sock 
 	sk->sk_sndbuf = sysctl_tcp_wmem[1];
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
-	atomic_inc(&tcp_sockets_allocated);
+	inc_sockets_allocated(tcp_sockets_allocated);
 
 	return 0;
 }
@@ -1307,7 +1307,7 @@ int tcp_v4_destroy_sock(struct sock *sk)
 		sk->sk_sndmsg_page = NULL;
 	}
 
-	atomic_dec(&tcp_sockets_allocated);
+	dec_sockets_allocated(tcp_sockets_allocated);
 
 	return 0;
 }
@@ -1815,7 +1815,6 @@ struct proto tcp_prot = {
 	.unhash			= tcp_unhash,
 	.get_port		= tcp_v4_get_port,
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
-	.sockets_allocated	= &tcp_sockets_allocated,
 	.orphan_count		= &tcp_orphan_count,
 	.memory_allocated	= &tcp_memory_allocated,
 	.memory_pressure	= &tcp_memory_pressure,
Index: linux-2.6.16-rc5mm3/net/ipv6/tcp_ipv6.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv6/tcp_ipv6.c	2006-03-07 15:08:01.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv6/tcp_ipv6.c	2006-03-07 15:11:43.000000000 -0800
@@ -1375,7 +1375,7 @@ static int tcp_v6_init_sock(struct sock 
 	sk->sk_sndbuf = sysctl_tcp_wmem[1];
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
-	atomic_inc(&tcp_sockets_allocated);
+	inc_sockets_allocated(tcp_sockets_allocated);
 
 	return 0;
 }
@@ -1573,7 +1573,6 @@ struct proto tcpv6_prot = {
 	.unhash			= tcp_unhash,
 	.get_port		= tcp_v6_get_port,
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
-	.sockets_allocated	= &tcp_sockets_allocated,
 	.memory_allocated	= &tcp_memory_allocated,
 	.memory_pressure	= &tcp_memory_pressure,
 	.orphan_count		= &tcp_orphan_count,
@@ -1605,6 +1604,7 @@ static struct inet_protosw tcpv6_protosw
 
 void __init tcpv6_init(void)
 {
+	tcpv6_prot.sockets_allocated = tcp_sockets_allocated;
 	/* register inet6 protocol */
 	if (inet6_add_protocol(&tcpv6_protocol, IPPROTO_TCP) < 0)
 		printk(KERN_ERR "tcpv6_init: Could not register protocol\n");

^ permalink raw reply

* [patch 4/4] net: percpufy frequently used vars -- proto.inuse
From: Ravikiran G Thirumalai @ 2006-03-08  2:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, davem, netdev, shai
In-Reply-To: <20060308015808.GA9062@localhost.localdomain>

This patch uses alloc_percpu to allocate per-CPU memory for the proto->inuse
field.  The inuse field is currently per-CPU as in NR_CPUS * cacheline size -- 
a big bloat on arches with large cachelines.  Also marks some frequently
used protos read mostly.

Signed-off-by: Pravin B. Shelar <pravins@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>

Index: linux-2.6.16-rc2/include/net/sock.h
===================================================================
--- linux-2.6.16-rc2.orig/include/net/sock.h	2006-02-09 15:01:47.000000000 -0800
+++ linux-2.6.16-rc2/include/net/sock.h	2006-02-09 15:01:52.000000000 -0800
@@ -573,10 +573,7 @@ struct proto {
 #ifdef SOCK_REFCNT_DEBUG
 	atomic_t		socks;
 #endif
-	struct {
-		int inuse;
-		u8  __pad[SMP_CACHE_BYTES - sizeof(int)];
-	} stats[NR_CPUS];
+	int *inuse;
 };
 
 static inline int read_sockets_allocated(struct proto *prot)
@@ -628,12 +625,12 @@ static inline void sk_refcnt_debug_relea
 /* Called with local bh disabled */
 static __inline__ void sock_prot_inc_use(struct proto *prot)
 {
-	prot->stats[smp_processor_id()].inuse++;
+	(*per_cpu_ptr(prot->inuse, smp_processor_id())) += 1;
 }
 
 static __inline__ void sock_prot_dec_use(struct proto *prot)
 {
-	prot->stats[smp_processor_id()].inuse--;
+	(*per_cpu_ptr(prot->inuse, smp_processor_id())) -= 1;
 }
 
 /* With per-bucket locks this operation is not-atomic, so that
Index: linux-2.6.16-rc2/net/core/sock.c
===================================================================
--- linux-2.6.16-rc2.orig/net/core/sock.c	2006-02-09 15:01:47.000000000 -0800
+++ linux-2.6.16-rc2/net/core/sock.c	2006-02-09 15:01:52.000000000 -0800
@@ -1508,12 +1508,21 @@ int proto_register(struct proto *prot, i
 		}
 	}
 
+	prot->inuse = alloc_percpu(int);
+	if (prot->inuse == NULL) {
+		if (alloc_slab)
+			goto out_free_timewait_sock_slab_name_cache;
+		else
+			goto out;
+	}
 	write_lock(&proto_list_lock);
 	list_add(&prot->node, &proto_list);
 	write_unlock(&proto_list_lock);
 	rc = 0;
 out:
 	return rc;
+out_free_timewait_sock_slab_name_cache:
+	kmem_cache_destroy(prot->twsk_prot->twsk_slab);
 out_free_timewait_sock_slab_name:
 	kfree(timewait_sock_slab_name);
 out_free_request_sock_slab:
@@ -1537,6 +1546,11 @@ void proto_unregister(struct proto *prot
 	list_del(&prot->node);
 	write_unlock(&proto_list_lock);
 
+	if (prot->inuse != NULL) {
+		free_percpu(prot->inuse);
+		prot->inuse = NULL;
+	}
+
 	if (prot->slab != NULL) {
 		kmem_cache_destroy(prot->slab);
 		prot->slab = NULL;
Index: linux-2.6.16-rc2/net/ipv4/proc.c
===================================================================
--- linux-2.6.16-rc2.orig/net/ipv4/proc.c	2006-02-09 15:01:47.000000000 -0800
+++ linux-2.6.16-rc2/net/ipv4/proc.c	2006-02-09 15:01:52.000000000 -0800
@@ -50,7 +50,7 @@ static int fold_prot_inuse(struct proto 
 	int cpu;
 
 	for_each_cpu(cpu)
-		res += proto->stats[cpu].inuse;
+		res += (*per_cpu_ptr(proto->inuse, cpu));
 
 	return res;
 }
Index: linux-2.6.16-rc2/net/ipv4/raw.c
===================================================================
--- linux-2.6.16-rc2.orig/net/ipv4/raw.c	2006-02-07 23:14:04.000000000 -0800
+++ linux-2.6.16-rc2/net/ipv4/raw.c	2006-02-09 15:01:52.000000000 -0800
@@ -718,7 +718,7 @@ static int raw_ioctl(struct sock *sk, in
 	}
 }
 
-struct proto raw_prot = {
+struct proto raw_prot __read_mostly = {
 	.name =		"RAW",
 	.owner =	THIS_MODULE,
 	.close =	raw_close,
Index: linux-2.6.16-rc2/net/ipv4/tcp_ipv4.c
===================================================================
--- linux-2.6.16-rc2.orig/net/ipv4/tcp_ipv4.c	2006-02-09 15:01:47.000000000 -0800
+++ linux-2.6.16-rc2/net/ipv4/tcp_ipv4.c	2006-02-09 15:01:52.000000000 -0800
@@ -1794,7 +1794,7 @@ void tcp4_proc_exit(void)
 }
 #endif /* CONFIG_PROC_FS */
 
-struct proto tcp_prot = {
+struct proto tcp_prot __read_mostly = {
 	.name			= "TCP",
 	.owner			= THIS_MODULE,
 	.close			= tcp_close,
Index: linux-2.6.16-rc2/net/ipv4/udp.c
===================================================================
--- linux-2.6.16-rc2.orig/net/ipv4/udp.c	2006-02-07 23:14:04.000000000 -0800
+++ linux-2.6.16-rc2/net/ipv4/udp.c	2006-02-09 15:01:52.000000000 -0800
@@ -1340,7 +1340,7 @@ unsigned int udp_poll(struct file *file,
 	
 }
 
-struct proto udp_prot = {
+struct proto udp_prot __read_mostly = {
  	.name =		"UDP",
 	.owner =	THIS_MODULE,
 	.close =	udp_close,
Index: linux-2.6.16-rc2/net/ipv6/proc.c
===================================================================
--- linux-2.6.16-rc2.orig/net/ipv6/proc.c	2006-02-07 23:19:10.000000000 -0800
+++ linux-2.6.16-rc2/net/ipv6/proc.c	2006-02-09 15:01:52.000000000 -0800
@@ -39,7 +39,7 @@ static int fold_prot_inuse(struct proto 
 	int cpu;
 
 	for_each_cpu(cpu)
-		res += proto->stats[cpu].inuse;
+		res += (*per_cpu_ptr(proto->inuse, cpu));
 
 	return res;
 }

^ permalink raw reply

* Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh
From: Andrew Morton @ 2006-03-08  2:13 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: linux-kernel, davem, netdev, shai
In-Reply-To: <20060308015934.GB9062@localhost.localdomain>

Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>
> +static inline void percpu_counter_mod_bh(struct percpu_counter *fbc, long amount)
>  +{
>  +	local_bh_disable();
>  +	percpu_counter_mod(fbc, amount);
>  +	local_bh_enable();
>  +}
>  +

percpu_counter_mod() does preempt_disable(), which is redundant in this
context.  So just do fbc->count += amount; here.

^ permalink raw reply

* Re: [patch 2/4] net: percpufy frequently used vars -- struct proto.memory_allocated
From: Andrew Morton @ 2006-03-08  2:14 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: linux-kernel, davem, netdev, shai
In-Reply-To: <20060308020118.GC9062@localhost.localdomain>

Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>
> -	if (atomic_read(sk->sk_prot->memory_allocated) < sk->sk_prot->sysctl_mem[0]) {
>  +	if (percpu_counter_read(sk->sk_prot->memory_allocated) <
>  +			sk->sk_prot->sysctl_mem[0]) {

Bear in mind that percpu_counter_read[_positive] can be inaccurate on large
CPU counts.

It might be worth running percpu_counter_sum() to get the exact count if we
think we're about to cause something to fail.

^ permalink raw reply

* Re: [patch 3/4] net: percpufy frequently used vars -- proto.sockets_allocated
From: Andrew Morton @ 2006-03-08  2:16 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: linux-kernel, davem, netdev, shai
In-Reply-To: <20060308020227.GD9062@localhost.localdomain>

Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>
> --- linux-2.6.16-rc5mm3.orig/include/net/sock.h	2006-03-07 15:09:22.000000000 -0800
>  +++ linux-2.6.16-rc5mm3/include/net/sock.h	2006-03-07 15:09:52.000000000 -0800
>  @@ -543,7 +543,7 @@ struct proto {
>   	/* Memory pressure */
>   	void			(*enter_memory_pressure)(void);
>   	struct percpu_counter	*memory_allocated;	/* Current allocated memory. */
>  -	atomic_t		*sockets_allocated;	/* Current number of sockets. */
>  +	int                     *sockets_allocated;     /* Current number of sockets (percpu counter). */
>   
>   	/*
>   	 * Pressure flag: try to collapse.
>  @@ -579,6 +579,24 @@ struct proto {
>   	} stats[NR_CPUS];
>   };
>   
>  +static inline int read_sockets_allocated(struct proto *prot)
>  +{
>  +	int total = 0;
>  +	int cpu;
>  +	for_each_cpu(cpu)
>  +		total += *per_cpu_ptr(prot->sockets_allocated, cpu);
>  +	return total;
>  +}

This is likely too big to be inlined, plus sock.h doesn't include enough
headers to reliably compile this code.

>  +static inline void mod_sockets_allocated(int *sockets_allocated, int count)
>  +{
>  +	(*per_cpu_ptr(sockets_allocated, get_cpu())) += count;
>  +	put_cpu();
>  +}
>  +

Ditto.

^ permalink raw reply

* Re: [patch 2/4] net: percpufy frequently used vars -- struct proto.memory_allocated
From: Ravikiran G Thirumalai @ 2006-03-08  3:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, davem, netdev, shai
In-Reply-To: <20060307181422.3e279ca1.akpm@osdl.org>

On Tue, Mar 07, 2006 at 06:14:22PM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
> >
> > -	if (atomic_read(sk->sk_prot->memory_allocated) < sk->sk_prot->sysctl_mem[0]) {
> >  +	if (percpu_counter_read(sk->sk_prot->memory_allocated) <
> >  +			sk->sk_prot->sysctl_mem[0]) {
> 
> Bear in mind that percpu_counter_read[_positive] can be inaccurate on large
> CPU counts.
> 
> It might be worth running percpu_counter_sum() to get the exact count if we
> think we're about to cause something to fail.

The problem is percpu_counter_sum has to read all the cpus cachelines.  If
we have to use percpu_counter_sum everywhere, then might as well use plain
per-cpu counters instead of  batching ones no?
 
sysctl_mem[0] is about 196K  and on a 16 cpu box variance is 512 bytes, which 
is OK with just percpu_counter_read I hope.  Maybe, on very large cpu counts, 
we should just change the FBC_BATCH so that variance does not go quadratic.
Something like 32.  So that variance is 32 * NR_CPUS in that case, instead
of (NR_CPUS * NR_CPUS * 2) currently.  Comments?

^ permalink raw reply

* Re: de2104x: interrupts before interrupt handler is registered
From: Martin Michlmayr @ 2006-03-08  3:22 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev, linux-kernel
In-Reply-To: <20060308001556.GA9362@electric-eye.fr.zoreil.com>

* Francois Romieu <romieu@fr.zoreil.com> [2006-03-08 01:15]:
> netdev watchdog events appear in the dmesg of the patched driver.
> The driver survived it. So I'd say that the patch does its job.
> 
> OTOH, if you ever saw the unpatched driver survive this event, yell
> now.

No, I've never seen the unpatched driver survive.
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply

* Re: [patch 2/4] net: percpufy frequently used vars -- struct proto.memory_allocated
From: Andrew Morton @ 2006-03-08  3:22 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: linux-kernel, davem, netdev, shai
In-Reply-To: <20060308030803.GF9062@localhost.localdomain>

Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>
> On Tue, Mar 07, 2006 at 06:14:22PM -0800, Andrew Morton wrote:
> > Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
> > >
> > > -	if (atomic_read(sk->sk_prot->memory_allocated) < sk->sk_prot->sysctl_mem[0]) {
> > >  +	if (percpu_counter_read(sk->sk_prot->memory_allocated) <
> > >  +			sk->sk_prot->sysctl_mem[0]) {
> > 
> > Bear in mind that percpu_counter_read[_positive] can be inaccurate on large
> > CPU counts.
> > 
> > It might be worth running percpu_counter_sum() to get the exact count if we
> > think we're about to cause something to fail.
> 
> The problem is percpu_counter_sum has to read all the cpus cachelines.  If
> we have to use percpu_counter_sum everywhere, then might as well use plain
> per-cpu counters instead of  batching ones no?

I didn't say "use it everywhere" ;)

Just in places like this:

	if (percpu_counter_read(something) > something_else)
		make_an_application_fail();

in that case it's worth running percpu_counter_sum().  And bear in mind
that once we've done that, the following percpu_counter_read()s become
accurate, so we won't run the expensive percpu_counter_sum() again
for a while.  Unless we're really close to or over the limit, in which case
blowing a few cycles is relatively unimportant.

All that should be captured in library code (per_cpu_counter_exceeds(ptr,
threshold), for example) rather than open-coded everywhere.

> sysctl_mem[0] is about 196K  and on a 16 cpu box variance is 512 bytes, which 
> is OK with just percpu_counter_read I hope.

You mean a 16 CPU box with NR_CPUS=16 as well...

>  Maybe, on very large cpu counts, 
> we should just change the FBC_BATCH so that variance does not go quadratic.
> Something like 32.  So that variance is 32 * NR_CPUS in that case, instead
> of (NR_CPUS * NR_CPUS * 2) currently.  Comments?

Sure, we need to make that happen.  But it got all mixed up with the
spinlock removal and it does need quite some thought and testing and
documentation to help developers to choose the right settings and
appropriate selection of defaults, etc.

^ permalink raw reply

* [2.6 patch] drivers/net/wireless/ipw2200.c: make ipw_qos_current_mode() static
From: Adrian Bunk @ 2006-03-08 12:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: yi.zhu, jketreno, linux-kernel, linville, netdev

This patch makes the needlessly global function ipw_qos_current_mode() 
static.


Signed-off-by: Adrian Bunk <bunk@stusta.de>

---

This patch was already sent on:
- 4 Mar 2006

--- linux-2.6.16-rc5-mm2-full/drivers/net/wireless/ipw2200.c.old	2006-03-03 17:49:37.000000000 +0100
+++ linux-2.6.16-rc5-mm2-full/drivers/net/wireless/ipw2200.c	2006-03-03 17:50:00.000000000 +0100
@@ -6566,7 +6566,7 @@
 * get the modulation type of the current network or
 * the card current mode
 */
-u8 ipw_qos_current_mode(struct ipw_priv * priv)
+static u8 ipw_qos_current_mode(struct ipw_priv * priv)
 {
 	u8 mode = 0;
 

^ permalink raw reply

* Re: Re: TSO and IPoIB performance degradation
From: Michael S. Tsirkin @ 2006-03-08 12:53 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, rdreier, linux-kernel, openib-general, shemminger
In-Reply-To: <20060307.172336.107863253.davem@davemloft.net>

Quoting r. David S. Miller <davem@davemloft.net>:
> Subject: Re: Re: TSO and IPoIB performance degradation
> 
> From: Roland Dreier <rdreier@cisco.com>
> Date: Tue, 07 Mar 2006 17:17:30 -0800
> 
> > The reason TSO comes up is that reverting the patch described below
> > helps (or helped at some point at least) IPoIB throughput quite a bit.
> 
> I wish you had started the thread by mentioning this specific patch

Er, since you mention it, the first message in thread did include this link:
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=314324121f9b94b2ca657a494cf2b9cb0e4a28cc
and I even pasted the patch description there, but oh well.

Now that Roland helped us clear it all up, and now that it has been clarified
that reverting this patch gives us back most of the performance, is the answer
to my question the same?

What I was trying to figure out was, how can we re-enable the trick without
hurting TSO? Could a solution be to simply look at the frame size, and call
tcp_send_delayed_ack if the frame size is small?

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

^ permalink raw reply

* [PATCH][RFC] Notifier mechanism to notify RDMA devices of network events
From: Tom Tucker @ 2006-03-08 16:38 UTC (permalink / raw)
  To: netdev, linux-kernel, openib-general


This patch implements a mechanism that allows interested clients to 
register for notification of certain network events. The intended use
is to allow RDMA(OpenIB) devices to be notified of neighbour updates, 
ICMP redirects, path MTU changes, and route changes. 

The reason these devices need update events is because they typically 
cache this information in hardware and need to be notified when
this information has been updated.

This approach is one of many possibilities and may be preferred because
it uses an existing notification mechanism that has precedent in the 
stack. 

This code does not yet implement path MTU change because the number of 
places in which this value is updated is large and if this mechanism seems
reasonable, it would be probably be best to funnel these updates through a 
single function.


diff -u -r -x '.*' --new-file linux-2.6.14.5/include/net/netevent.h linux-2.6.14.5.tom/include/net/netevent.h
--- linux-2.6.14.5/include/net/netevent.h	1969-12-31 18:00:00.000000000 -0600
+++ linux-2.6.14.5.tom/include/net/netevent.h	2006-01-16 13:42:09.000000000 -0600
@@ -0,0 +1,41 @@
+#ifndef _NET_EVENT_H
+#define _NET_EVENT_H
+
+/*
+ *	Generic netevent notifiers
+ *
+ *	Authors:
+ *      Tom Tucker              <tom@opengridcomputing.com>
+ *
+ * 	Changes:
+ */
+
+#ifdef __KERNEL__
+
+#include <net/dst.h>
+
+struct netevent_redirect {
+	struct dst_entry *old;
+	struct dst_entry *new;
+};
+
+struct netevent_route_change {
+        int event;
+        struct fib_info *fib_info;
+};
+
+enum netevent_notif_type {
+	NETEVENT_NEIGH_UPDATE = 1, /* arg is * struct neighbour */
+	NETEVENT_ROUTE_UPDATE,	   /* arg is * netevent_route_change */
+	NETEVENT_PMTU_UPDATE,
+	NETEVENT_REDIRECT,	   /* arg is * struct netevent_redirect */
+};
+
+extern int register_netevent_notifier(struct notifier_block *nb);
+extern int unregister_netevent_notifier(struct notifier_block *nb);
+extern int call_netevent_notifiers(unsigned long val, void *v);
+
+#endif
+#endif
+
+
diff -u -r -x '.*' --new-file linux-2.6.14.5/net/core/Makefile linux-2.6.14.5.tom/net/core/Makefile
--- linux-2.6.14.5/net/core/Makefile	2005-12-26 18:26:33.000000000 -0600
+++ linux-2.6.14.5.tom/net/core/Makefile	2006-01-16 13:41:42.000000000 -0600
@@ -7,7 +7,7 @@
 
 obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
 
-obj-y		     += dev.o ethtool.o dev_mcast.o dst.o \
+obj-y		     += dev.o ethtool.o dev_mcast.o dst.o netevent.o \
 			neighbour.o rtnetlink.o utils.o link_watch.o filter.o
 
 obj-$(CONFIG_XFRM) += flow.o
diff -u -r -x '.*' --new-file linux-2.6.14.5/net/core/neighbour.c linux-2.6.14.5.tom/net/core/neighbour.c
--- linux-2.6.14.5/net/core/neighbour.c	2005-12-26 18:26:33.000000000 -0600
+++ linux-2.6.14.5.tom/net/core/neighbour.c	2006-01-16 13:41:42.000000000 -0600
@@ -30,9 +30,11 @@
 #include <net/neighbour.h>
 #include <net/dst.h>
 #include <net/sock.h>
+#include <net/netevent.h>
 #include <linux/rtnetlink.h>
 #include <linux/random.h>
 #include <linux/string.h>
+#include <linux/notifier.h>
 
 #define NEIGH_DEBUG 1
 
@@ -756,6 +758,7 @@
 			NEIGH_PRINTK2("neigh %p is suspected.\n", neigh);
 			neigh->nud_state = NUD_STALE;
 			neigh_suspect(neigh);
+			call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh);
 		}
 	} else if (state & NUD_DELAY) {
 		if (time_before_eq(now, 
@@ -763,6 +766,7 @@
 			NEIGH_PRINTK2("neigh %p is now reachable.\n", neigh);
 			neigh->nud_state = NUD_REACHABLE;
 			neigh_connect(neigh);
+			call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh);
 			next = neigh->confirmed + neigh->parms->reachable_time;
 		} else {
 			NEIGH_PRINTK2("neigh %p is probed.\n", neigh);
@@ -781,6 +785,7 @@
 
 		neigh->nud_state = NUD_FAILED;
 		notify = 1;
+		call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh);
 		NEIGH_CACHE_STAT_INC(neigh->tbl, res_failed);
 		NEIGH_PRINTK2("neigh %p is failed.\n", neigh);
 
@@ -1051,6 +1056,9 @@
 			(neigh->flags | NTF_ROUTER) :
 			(neigh->flags & ~NTF_ROUTER);
 	}
+
+	call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh);
+
 	write_unlock_bh(&neigh->lock);
 #ifdef CONFIG_ARPD
 	if (notify && neigh->parms->app_probes)
diff -u -r -x '.*' --new-file linux-2.6.14.5/net/core/netevent.c linux-2.6.14.5.tom/net/core/netevent.c
--- linux-2.6.14.5/net/core/netevent.c	1969-12-31 18:00:00.000000000 -0600
+++ linux-2.6.14.5.tom/net/core/netevent.c	2006-01-16 13:50:25.000000000 -0600
@@ -0,0 +1,69 @@
+/*
+ *	Network event notifiers
+ *
+ *	Authors:
+ *      Tom Tucker             <tom@opengridcomputing.com>
+ *
+ *	This program is free software; you can redistribute it and/or
+ *      modify it under the terms of the GNU General Public License
+ *      as published by the Free Software Foundation; either version
+ *      2 of the License, or (at your option) any later version.
+ *
+ *	Fixes:
+ */
+
+#include <linux/rtnetlink.h>
+#include <linux/notifier.h>
+
+static struct notifier_block *netevent_notif_chain;
+
+/**
+ *	register_netevent_notifier - register a netevent notifier block
+ *	@nb: notifier
+ *
+ *	Register a notifier to be called when a netevent occurs.
+ *	The notifier passed is linked into the kernel structures and must
+ *	not be reused until it has been unregistered. A negative errno code
+ *	is returned on a failure.
+ */
+int register_netevent_notifier(struct notifier_block *nb)
+{
+	int err;
+
+	rtnl_lock();
+	err = notifier_chain_register(&netevent_notif_chain, nb);
+	rtnl_unlock();
+	return err;
+}
+
+/**
+ *	netevent_unregister_notifier - unregister a netevent notifier block
+ *	@nb: notifier
+ *
+ *	Unregister a notifier previously registered by
+ *	register_neigh_notifier(). The notifier is unlinked into the
+ *	kernel structures and may then be reused. A negative errno code
+ *	is returned on a failure.
+ */
+
+int unregister_netevent_notifier(struct notifier_block *nb)
+{
+	return notifier_chain_unregister(&netevent_notif_chain, nb);
+}
+
+/**
+ *	call_netevent_notifiers - call all netevent notifier blocks
+ *      @val: value passed unmodified to notifier function
+ *      @v:   pointer passed unmodified to notifier function
+ *
+ *	Call all neighbour notifier blocks.  Parameters and return value
+ *	are as for notifier_call_chain().
+ */
+
+int call_netevent_notifiers(unsigned long val, void *v)
+{
+	return notifier_call_chain(&netevent_notif_chain, val, v);
+}
+
+EXPORT_SYMBOL(register_netevent_notifier);
+EXPORT_SYMBOL(unregister_netevent_notifier);
diff -u -r -x '.*' --new-file linux-2.6.14.5/net/ipv4/fib_semantics.c linux-2.6.14.5.tom/net/ipv4/fib_semantics.c
--- linux-2.6.14.5/net/ipv4/fib_semantics.c	2005-12-26 18:26:33.000000000 -0600
+++ linux-2.6.14.5.tom/net/ipv4/fib_semantics.c	2006-01-16 13:41:42.000000000 -0600
@@ -43,6 +43,7 @@
 #include <net/sock.h>
 #include <net/ip_fib.h>
 #include <net/ip_mp_alg.h>
+#include <net/netevent.h>
 
 #include "fib_lookup.h"
 
@@ -276,9 +277,15 @@
 	       struct nlmsghdr *n, struct netlink_skb_parms *req)
 {
 	struct sk_buff *skb;
+	struct netevent_route_change rev;
+
 	u32 pid = req ? req->pid : n->nlmsg_pid;
 	int size = NLMSG_SPACE(sizeof(struct rtmsg)+256);
 
+	rev.event = event;
+	rev.fib_info = fa->fa_info;
+	call_netevent_notifiers(NETEVENT_ROUTE_UPDATE, &rev);
+
 	skb = alloc_skb(size, GFP_KERNEL);
 	if (!skb)
 		return;
diff -u -r -x '.*' --new-file linux-2.6.14.5/net/ipv4/route.c linux-2.6.14.5.tom/net/ipv4/route.c
--- linux-2.6.14.5/net/ipv4/route.c	2005-12-26 18:26:33.000000000 -0600
+++ linux-2.6.14.5.tom/net/ipv4/route.c	2006-01-16 13:41:42.000000000 -0600
@@ -103,6 +103,7 @@
 #include <net/icmp.h>
 #include <net/xfrm.h>
 #include <net/ip_mp_alg.h>
+#include <net/netevent.h>
 #ifdef CONFIG_SYSCTL
 #include <linux/sysctl.h>
 #endif
@@ -1118,6 +1119,7 @@
 	struct rtable *rth, **rthp;
 	u32  skeys[2] = { saddr, 0 };
 	int  ikeys[2] = { dev->ifindex, 0 };
+	struct netevent_redirect netevent;
 
 	tos &= IPTOS_RT_MASK;
 
@@ -1213,6 +1215,10 @@
 					rt_drop(rt);
 					goto do_next;
 				}
+				
+				netevent.old = &rth->u.dst;
+				netevent.new = &rt->u.dst;
+				call_netevent_notifiers(NETEVENT_REDIRECT, &netevent);
 
 				rt_del(hash, rth);
 				if (!rt_intern_hash(hash, rt, &rt))

^ permalink raw reply

* [PATCH] compat. ifconf: fix limits
From: Randy.Dunlap @ 2006-03-08 17:16 UTC (permalink / raw)
  To: netdev, linux-fsdevel; +Cc: Alexandra.Kossovsky, ak, akpm, torvalds

From: Randy Dunlap <rdunlap@xenotime.net>

A recent change to compat. dev_ifconf() in fs/compat_ioctl.c
causes ifconf data to be truncated 1 entry too early when copying it
to userspace.  The correct amount of data (length) is returned,
but the final entry is empty (zero, not filled in).
The for-loop 'i' check should use <= to allow the final struct
ifreq32 to be copied.  I also used the ifconf-corruption program
in kernel bugzilla #4746 to make sure that this change does not
re-introduce the corruption.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
---
 fs/compat_ioctl.c |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

--- linux-2616-rc5.orig/fs/compat_ioctl.c
+++ linux-2616-rc5/fs/compat_ioctl.c
@@ -446,7 +446,7 @@ static int dev_ifconf(unsigned int fd, u
 	ifr = ifc.ifc_req;
 	ifr32 = compat_ptr(ifc32.ifcbuf);
 	for (i = 0, j = 0;
-             i + sizeof (struct ifreq32) < ifc32.ifc_len && j < ifc.ifc_len;
+             i + sizeof (struct ifreq32) <= ifc32.ifc_len && j < ifc.ifc_len;
 	     i += sizeof (struct ifreq32), j += sizeof (struct ifreq)) {
 		if (copy_in_user(ifr32, ifr, sizeof (struct ifreq32)))
 			return -EFAULT;


---

^ permalink raw reply

* Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh
From: Ravikiran G Thirumalai @ 2006-03-08 20:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, davem, netdev, shai
In-Reply-To: <20060307181301.4dd6aa96.akpm@osdl.org>

On Tue, Mar 07, 2006 at 06:13:01PM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
> >
> > +static inline void percpu_counter_mod_bh(struct percpu_counter *fbc, long amount)
> >  +{
> >  +	local_bh_disable();
> >  +	percpu_counter_mod(fbc, amount);
> >  +	local_bh_enable();
> >  +}
> >  +
> 
> percpu_counter_mod() does preempt_disable(), which is redundant in this
> context.  So just do fbc->count += amount; here.

OK.  Here is the revised patch.


Add percpu_counter_mod_bh for using these counters safely from
both softirq and process context.

Signed-off by: Pravin B. Shelar <pravins@calsoftinc.com>
Signed-off by: Ravikiran G Thirumalai <kiran@scalex86.org>
Signed-off by: Shai Fultheim <shai@scalex86.org>

Index: linux-2.6.16-rc5mm3/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/linux/percpu_counter.h	2006-03-07 15:08:00.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/linux/percpu_counter.h	2006-03-08 10:53:13.000000000 -0800
@@ -11,6 +11,7 @@
 #include <linux/smp.h>
 #include <linux/threads.h>
 #include <linux/percpu.h>
+#include <linux/interrupt.h>
 
 #ifdef CONFIG_SMP
 
@@ -40,6 +41,7 @@ static inline void percpu_counter_destro
 
 void percpu_counter_mod(struct percpu_counter *fbc, long amount);
 long percpu_counter_sum(struct percpu_counter *fbc);
+void percpu_counter_mod_bh(struct percpu_counter *fbc, long amount);
 
 static inline long percpu_counter_read(struct percpu_counter *fbc)
 {
@@ -98,6 +100,12 @@ static inline long percpu_counter_sum(st
 	return percpu_counter_read_positive(fbc);
 }
 
+static inline void percpu_counter_mod_bh(struct percpu_counter *fbc, long amount)
+{
+	local_bh_disable();
+	fbc->count += amount;
+	local_bh_enable();
+}
 #endif	/* CONFIG_SMP */
 
 static inline void percpu_counter_inc(struct percpu_counter *fbc)
@@ -110,4 +118,5 @@ static inline void percpu_counter_dec(st
 	percpu_counter_mod(fbc, -1);
 }
 
+
 #endif /* _LINUX_PERCPU_COUNTER_H */
Index: linux-2.6.16-rc5mm3/mm/swap.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/mm/swap.c	2006-03-07 15:08:01.000000000 -0800
+++ linux-2.6.16-rc5mm3/mm/swap.c	2006-03-08 12:11:19.000000000 -0800
@@ -521,11 +521,11 @@ static int cpu_swap_callback(struct noti
 #endif /* CONFIG_SMP */
 
 #ifdef CONFIG_SMP
-void percpu_counter_mod(struct percpu_counter *fbc, long amount)
+static void __percpu_counter_mod(struct percpu_counter *fbc, long amount)
 {
 	long count;
 	long *pcount;
-	int cpu = get_cpu();
+	int cpu = smp_processor_id();
 
 	pcount = per_cpu_ptr(fbc->counters, cpu);
 	count = *pcount + amount;
@@ -537,9 +537,21 @@ void percpu_counter_mod(struct percpu_co
 	} else {
 		*pcount = count;
 	}
-	put_cpu();
 }
-EXPORT_SYMBOL(percpu_counter_mod);
+
+void percpu_counter_mod(struct percpu_counter *fbc, long amount)
+{
+	preempt_disable();
+	__percpu_counter_mod(fbc, amount);
+	preempt_enable();
+}
+
+void percpu_counter_mod_bh(struct percpu_counter *fbc, long amount)
+{
+	local_bh_disable();
+	__percpu_counter_mod(fbc, amount);
+	local_bh_enable();
+}
 
 /*
  * Add up all the per-cpu counts, return the result.  This is a more accurate
@@ -559,6 +571,9 @@ long percpu_counter_sum(struct percpu_co
 	spin_unlock(&fbc->lock);
 	return ret < 0 ? 0 : ret;
 }
+
+EXPORT_SYMBOL(percpu_counter_mod);
+EXPORT_SYMBOL(percpu_counter_mod_bh);
 EXPORT_SYMBOL(percpu_counter_sum);
 #endif
 

^ permalink raw reply

* Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh
From: Benjamin LaHaise @ 2006-03-08 20:36 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: Andrew Morton, linux-kernel, davem, netdev, shai
In-Reply-To: <20060308202656.GA4493@localhost.localdomain>

On Wed, Mar 08, 2006 at 12:26:56PM -0800, Ravikiran G Thirumalai wrote:
> +static inline void percpu_counter_mod_bh(struct percpu_counter *fbc, long amount)
> +{
> +	local_bh_disable();
> +	fbc->count += amount;
> +	local_bh_enable();
> +}

Please use local_t instead, then you don't have to do the local_bh_disable() 
and enable() and the whole thing collapses down into 1 instruction on x86.

		-ben

^ permalink raw reply

* Re: TSO and IPoIB performance degradation
From: David S. Miller @ 2006-03-08 20:53 UTC (permalink / raw)
  To: mst; +Cc: rdreier, netdev, linux-kernel, openib-general, shemminger
In-Reply-To: <20060308125311.GE17618@mellanox.co.il>

From: "Michael S. Tsirkin" <mst@mellanox.co.il>
Date: Wed, 8 Mar 2006 14:53:11 +0200

> What I was trying to figure out was, how can we re-enable the trick without
> hurting TSO? Could a solution be to simply look at the frame size, and call
> tcp_send_delayed_ack if the frame size is small?

The problem is that this patch helps performance when the
receiver is CPU limited.

The old code would delay ACKs forever if the CPU of the
receiver was slow, because we'd wait for all received
packets to be copied into userspace before spitting out
the ACK.  This would allow the pipe to empty, since the
sender is waiting for ACKs in order to send more into
the pipe, and once the ACK did go out it would cause the
sender to emit an enormous burst of data.  Both of these
behaviors are highly frowned upon for a TCP stack.

I'll try to look at this some more later today.

^ permalink raw reply

* Re: [patch 2/4] net: percpufy frequently used vars -- struct proto.memory_allocated
From: Ravikiran G Thirumalai @ 2006-03-08 20:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, davem, netdev, shai
In-Reply-To: <20060307192234.7efb1213.akpm@osdl.org>

On Tue, Mar 07, 2006 at 07:22:34PM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
> > 
> > The problem is percpu_counter_sum has to read all the cpus cachelines.  If
> > we have to use percpu_counter_sum everywhere, then might as well use plain
> > per-cpu counters instead of  batching ones no?
> 
> I didn't say "use it everywhere" ;)

:) No you did not. Sorry.  When the counter shows large varience it will affect
all places where it is read and it wasn't clear to me where to use
percpu_counter_sum vs percpu_counter_read in the context of the patch.  
But yes, using percpu_counter_sum at the critical points makes things better.
Thanks for pointing it out.  Here is an attempt on those lines. 

> 
> >  Maybe, on very large cpu counts, 
> > we should just change the FBC_BATCH so that variance does not go quadratic.
> > Something like 32.  So that variance is 32 * NR_CPUS in that case, instead
> > of (NR_CPUS * NR_CPUS * 2) currently.  Comments?
> 
> Sure, we need to make that happen.  But it got all mixed up with the
> spinlock removal and it does need quite some thought and testing and
> documentation to help developers to choose the right settings and
> appropriate selection of defaults, etc.

I was thinking on the lines of something like the following as a simple fix
to the quadratic variance. But yes, it needs some experimentation before hand..

#if NR_CPUS >= 16
#define FBC_BATCH       (32)
#else
#define FBC_BATCH       (NR_CPUS*4)
#endif

Thanks,
Kiran


Change struct proto->memory_allocated to a batching per-CPU counter 
(percpu_counter) from an atomic_t.  A batching counter is better than a 
plain per-CPU counter as this field is read often. 

Changes since last post: Use the more accurate percpu_counter_sum() at 
critical places.

Signed-off-by: Pravin B. Shelar <pravins@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>

Index: linux-2.6.16-rc5mm3/include/net/sock.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/net/sock.h	2006-03-08 10:36:07.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/net/sock.h	2006-03-08 11:03:19.000000000 -0800
@@ -48,6 +48,7 @@
 #include <linux/netdevice.h>
 #include <linux/skbuff.h>	/* struct sk_buff */
 #include <linux/security.h>
+#include <linux/percpu_counter.h>
 
 #include <linux/filter.h>
 
@@ -541,8 +542,9 @@ struct proto {
 
 	/* Memory pressure */
 	void			(*enter_memory_pressure)(void);
-	atomic_t		*memory_allocated;	/* Current allocated memory. */
+	struct percpu_counter	*memory_allocated;	/* Current allocated memory. */
 	atomic_t		*sockets_allocated;	/* Current number of sockets. */
+
 	/*
 	 * Pressure flag: try to collapse.
 	 * Technical note: it is used by multiple contexts non atomically.
Index: linux-2.6.16-rc5mm3/include/net/tcp.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/net/tcp.h	2006-03-08 10:36:07.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/net/tcp.h	2006-03-08 11:03:19.000000000 -0800
@@ -225,7 +225,7 @@ extern int sysctl_tcp_abc;
 extern int sysctl_tcp_mtu_probing;
 extern int sysctl_tcp_base_mss;
 
-extern atomic_t tcp_memory_allocated;
+extern struct percpu_counter tcp_memory_allocated;
 extern atomic_t tcp_sockets_allocated;
 extern int tcp_memory_pressure;
 
Index: linux-2.6.16-rc5mm3/net/core/sock.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/core/sock.c	2006-03-08 10:36:07.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/core/sock.c	2006-03-08 11:03:19.000000000 -0800
@@ -1616,12 +1616,13 @@ static char proto_method_implemented(con
 
 static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
 {
-	seq_printf(seq, "%-9s %4u %6d  %6d   %-3s %6u   %-3s  %-10s "
+	seq_printf(seq, "%-9s %4u %6d  %6lu   %-3s %6u   %-3s  %-10s "
 			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
 		   proto->name,
 		   proto->obj_size,
 		   proto->sockets_allocated != NULL ? atomic_read(proto->sockets_allocated) : -1,
-		   proto->memory_allocated != NULL ? atomic_read(proto->memory_allocated) : -1,
+		   proto->memory_allocated != NULL ?
+				percpu_counter_read_positive(proto->memory_allocated) : -1,
 		   proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
 		   proto->max_header,
 		   proto->slab == NULL ? "no" : "yes",
Index: linux-2.6.16-rc5mm3/net/core/stream.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/core/stream.c	2006-03-08 10:36:07.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/core/stream.c	2006-03-08 11:08:28.000000000 -0800
@@ -196,11 +196,11 @@ EXPORT_SYMBOL(sk_stream_error);
 void __sk_stream_mem_reclaim(struct sock *sk)
 {
 	if (sk->sk_forward_alloc >= SK_STREAM_MEM_QUANTUM) {
-		atomic_sub(sk->sk_forward_alloc / SK_STREAM_MEM_QUANTUM,
-			   sk->sk_prot->memory_allocated);
+		percpu_counter_mod_bh(sk->sk_prot->memory_allocated,
+				-(sk->sk_forward_alloc / SK_STREAM_MEM_QUANTUM));
 		sk->sk_forward_alloc &= SK_STREAM_MEM_QUANTUM - 1;
 		if (*sk->sk_prot->memory_pressure &&
-		    (atomic_read(sk->sk_prot->memory_allocated) <
+		    (percpu_counter_read(sk->sk_prot->memory_allocated) <
 		     sk->sk_prot->sysctl_mem[0]))
 			*sk->sk_prot->memory_pressure = 0;
 	}
@@ -213,23 +213,26 @@ int sk_stream_mem_schedule(struct sock *
 	int amt = sk_stream_pages(size);
 
 	sk->sk_forward_alloc += amt * SK_STREAM_MEM_QUANTUM;
-	atomic_add(amt, sk->sk_prot->memory_allocated);
+	percpu_counter_mod_bh(sk->sk_prot->memory_allocated, amt);
 
 	/* Under limit. */
-	if (atomic_read(sk->sk_prot->memory_allocated) < sk->sk_prot->sysctl_mem[0]) {
+	if (percpu_counter_read(sk->sk_prot->memory_allocated) <
+			sk->sk_prot->sysctl_mem[0]) {
 		if (*sk->sk_prot->memory_pressure)
 			*sk->sk_prot->memory_pressure = 0;
 		return 1;
 	}
 
 	/* Over hard limit. */
-	if (atomic_read(sk->sk_prot->memory_allocated) > sk->sk_prot->sysctl_mem[2]) {
+	if (percpu_counter_sum(sk->sk_prot->memory_allocated) >
+			sk->sk_prot->sysctl_mem[2]) {
 		sk->sk_prot->enter_memory_pressure();
 		goto suppress_allocation;
 	}
 
 	/* Under pressure. */
-	if (atomic_read(sk->sk_prot->memory_allocated) > sk->sk_prot->sysctl_mem[1])
+	if (percpu_counter_read(sk->sk_prot->memory_allocated) >
+			sk->sk_prot->sysctl_mem[1])
 		sk->sk_prot->enter_memory_pressure();
 
 	if (kind) {
@@ -259,7 +262,7 @@ suppress_allocation:
 
 	/* Alas. Undo changes. */
 	sk->sk_forward_alloc -= amt * SK_STREAM_MEM_QUANTUM;
-	atomic_sub(amt, sk->sk_prot->memory_allocated);
+	percpu_counter_mod_bh(sk->sk_prot->memory_allocated, -amt);
 	return 0;
 }
 
Index: linux-2.6.16-rc5mm3/net/decnet/af_decnet.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/decnet/af_decnet.c	2006-03-08 10:36:07.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/decnet/af_decnet.c	2006-03-08 11:03:19.000000000 -0800
@@ -154,7 +154,7 @@ static const struct proto_ops dn_proto_o
 static DEFINE_RWLOCK(dn_hash_lock);
 static struct hlist_head dn_sk_hash[DN_SK_HASH_SIZE];
 static struct hlist_head dn_wild_sk;
-static atomic_t decnet_memory_allocated;
+static struct percpu_counter decnet_memory_allocated;
 
 static int __dn_setsockopt(struct socket *sock, int level, int optname, char __user *optval, int optlen, int flags);
 static int __dn_getsockopt(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen, int flags);
@@ -2383,7 +2383,8 @@ static int __init decnet_init(void)
 	rc = proto_register(&dn_proto, 1);
 	if (rc != 0)
 		goto out;
-
+	
+	percpu_counter_init(&decnet_memory_allocated);
 	dn_neigh_init();
 	dn_dev_init();
 	dn_route_init();
@@ -2424,6 +2425,7 @@ static void __exit decnet_exit(void)
 	proc_net_remove("decnet");
 
 	proto_unregister(&dn_proto);
+	percpu_counter_destroy(&decnet_memory_allocated);
 }
 module_exit(decnet_exit);
 #endif
Index: linux-2.6.16-rc5mm3/net/ipv4/proc.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/proc.c	2006-03-08 10:36:07.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/proc.c	2006-03-08 11:03:19.000000000 -0800
@@ -61,10 +61,10 @@ static int fold_prot_inuse(struct proto 
 static int sockstat_seq_show(struct seq_file *seq, void *v)
 {
 	socket_seq_show(seq);
-	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %d\n",
+	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %lu\n",
 		   fold_prot_inuse(&tcp_prot), atomic_read(&tcp_orphan_count),
 		   tcp_death_row.tw_count, atomic_read(&tcp_sockets_allocated),
-		   atomic_read(&tcp_memory_allocated));
+		   percpu_counter_read_positive(&tcp_memory_allocated));
 	seq_printf(seq, "UDP: inuse %d\n", fold_prot_inuse(&udp_prot));
 	seq_printf(seq, "RAW: inuse %d\n", fold_prot_inuse(&raw_prot));
 	seq_printf(seq,  "FRAG: inuse %d memory %d\n", ip_frag_nqueues,
Index: linux-2.6.16-rc5mm3/net/ipv4/tcp.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp.c	2006-03-08 10:36:07.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp.c	2006-03-08 11:16:33.000000000 -0800
@@ -283,7 +283,7 @@ EXPORT_SYMBOL(sysctl_tcp_mem);
 EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
-atomic_t tcp_memory_allocated;	/* Current allocated memory. */
+struct percpu_counter tcp_memory_allocated;	/* Current allocated memory. */
 atomic_t tcp_sockets_allocated;	/* Current number of TCP sockets. */
 
 EXPORT_SYMBOL(tcp_memory_allocated);
@@ -1593,7 +1593,7 @@ adjudge_to_death:
 		sk_stream_mem_reclaim(sk);
 		if (atomic_read(sk->sk_prot->orphan_count) > sysctl_tcp_max_orphans ||
 		    (sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
-		     atomic_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])) {
+		     percpu_counter_sum(&tcp_memory_allocated) > sysctl_tcp_mem[2])) {
 			if (net_ratelimit())
 				printk(KERN_INFO "TCP: too many of orphaned "
 				       "sockets\n");
@@ -2127,6 +2127,8 @@ void __init tcp_init(void)
 	       "(established %d bind %d)\n",
 	       tcp_hashinfo.ehash_size << 1, tcp_hashinfo.bhash_size);
 
+	percpu_counter_init(&tcp_memory_allocated);
+
 	tcp_register_congestion_control(&tcp_reno);
 }
 
Index: linux-2.6.16-rc5mm3/net/ipv4/tcp_input.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp_input.c	2006-03-08 10:36:07.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp_input.c	2006-03-08 11:03:19.000000000 -0800
@@ -333,7 +333,7 @@ static void tcp_clamp_window(struct sock
 	if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
 	    !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
 	    !tcp_memory_pressure &&
-	    atomic_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
+	    percpu_counter_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
 		sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
 				    sysctl_tcp_rmem[2]);
 	}
@@ -3555,7 +3555,7 @@ static int tcp_should_expand_sndbuf(stru
 		return 0;
 
 	/* If we are under soft global TCP memory pressure, do not expand.  */
-	if (atomic_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])
+	if (percpu_counter_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])
 		return 0;
 
 	/* If we filled the congestion window, do not expand.  */
Index: linux-2.6.16-rc5mm3/net/ipv4/tcp_timer.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp_timer.c	2006-03-08 10:36:07.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp_timer.c	2006-03-08 11:39:05.000000000 -0800
@@ -80,7 +80,7 @@ static int tcp_out_of_resources(struct s
 
 	if (orphans >= sysctl_tcp_max_orphans ||
 	    (sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
-	     atomic_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])) {
+	     percpu_counter_sum(&tcp_memory_allocated) > sysctl_tcp_mem[2])) {
 		if (net_ratelimit())
 			printk(KERN_INFO "Out of socket memory\n");
 

^ permalink raw reply

* Re: [patch 3/4] net: percpufy frequently used vars -- proto.sockets_allocated
From: Ravikiran G Thirumalai @ 2006-03-08 20:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, davem, netdev, shai
In-Reply-To: <20060307181602.77030e2a.akpm@osdl.org>

On Tue, Mar 07, 2006 at 06:16:02PM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
> >   
> >  +static inline int read_sockets_allocated(struct proto *prot)
> >  +{
> >  +	int total = 0;
> >  +	int cpu;
> >  +	for_each_cpu(cpu)
> >  +		total += *per_cpu_ptr(prot->sockets_allocated, cpu);
> >  +	return total;
> >  +}
> 
> This is likely too big to be inlined, plus sock.h doesn't include enough
> headers to reliably compile this code.
> 
> >  +static inline void mod_sockets_allocated(int *sockets_allocated, int count)
> >  +{
> >  +	(*per_cpu_ptr(sockets_allocated, get_cpu())) += count;
> >  +	put_cpu();
> >  +}
> >  +
> 
> Ditto.

OK, here is a revised patch.


Change the atomic_t sockets_allocated member of struct proto to a 
per-cpu counter.

Signed-off-by: Pravin B. Shelar <pravins@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>

Index: linux-2.6.16-rc5mm3/include/net/sock.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/net/sock.h	2006-03-08 11:03:19.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/net/sock.h	2006-03-08 12:06:52.000000000 -0800
@@ -543,7 +543,7 @@ struct proto {
 	/* Memory pressure */
 	void			(*enter_memory_pressure)(void);
 	struct percpu_counter	*memory_allocated;	/* Current allocated memory. */
-	atomic_t		*sockets_allocated;	/* Current number of sockets. */
+	int                     *sockets_allocated;     /* Current number of sockets (percpu counter). */
 
 	/*
 	 * Pressure flag: try to collapse.
@@ -579,6 +579,12 @@ struct proto {
 	} stats[NR_CPUS];
 };
 
+extern int read_sockets_allocated(struct proto *prot);
+extern void mod_sockets_allocated(int *sockets_allocated, int count);
+
+#define inc_sockets_allocated(c) mod_sockets_allocated(c, 1)
+#define dec_sockets_allocated(c) mod_sockets_allocated(c, -1)
+
 extern int proto_register(struct proto *prot, int alloc_slab);
 extern void proto_unregister(struct proto *prot);
 
Index: linux-2.6.16-rc5mm3/include/net/tcp.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/net/tcp.h	2006-03-08 11:03:19.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/net/tcp.h	2006-03-08 11:39:40.000000000 -0800
@@ -226,7 +226,7 @@ extern int sysctl_tcp_mtu_probing;
 extern int sysctl_tcp_base_mss;
 
 extern struct percpu_counter tcp_memory_allocated;
-extern atomic_t tcp_sockets_allocated;
+extern int *tcp_sockets_allocated;
 extern int tcp_memory_pressure;
 
 /*
Index: linux-2.6.16-rc5mm3/net/core/sock.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/core/sock.c	2006-03-08 11:03:19.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/core/sock.c	2006-03-08 12:09:19.000000000 -0800
@@ -771,7 +771,7 @@ struct sock *sk_clone(const struct sock 
 		newsk->sk_sleep	 = NULL;
 
 		if (newsk->sk_prot->sockets_allocated)
-			atomic_inc(newsk->sk_prot->sockets_allocated);
+			inc_sockets_allocated(newsk->sk_prot->sockets_allocated); 
 	}
 out:
 	return newsk;
@@ -1451,6 +1451,25 @@ void sk_common_release(struct sock *sk)
 
 EXPORT_SYMBOL(sk_common_release);
 
+int read_sockets_allocated(struct proto *prot)
+{
+	int total = 0;
+	int cpu;
+	for_each_cpu(cpu)
+		total += *per_cpu_ptr(prot->sockets_allocated, cpu);
+	return total;
+}
+
+EXPORT_SYMBOL(read_sockets_allocated);
+
+void mod_sockets_allocated(int *sockets_allocated, int count)
+{
+	(*per_cpu_ptr(sockets_allocated, get_cpu())) += count;
+	put_cpu();
+}
+
+EXPORT_SYMBOL(mod_sockets_allocated);
+
 static DEFINE_RWLOCK(proto_list_lock);
 static LIST_HEAD(proto_list);
 
@@ -1620,7 +1639,7 @@ static void proto_seq_printf(struct seq_
 			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
 		   proto->name,
 		   proto->obj_size,
-		   proto->sockets_allocated != NULL ? atomic_read(proto->sockets_allocated) : -1,
+		   proto->sockets_allocated != NULL ? read_sockets_allocated(proto) : -1,
 		   proto->memory_allocated != NULL ?
 				percpu_counter_read_positive(proto->memory_allocated) : -1,
 		   proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
Index: linux-2.6.16-rc5mm3/net/core/stream.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/core/stream.c	2006-03-08 11:08:28.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/core/stream.c	2006-03-08 11:39:40.000000000 -0800
@@ -242,7 +242,7 @@ int sk_stream_mem_schedule(struct sock *
 		return 1;
 
 	if (!*sk->sk_prot->memory_pressure ||
-	    sk->sk_prot->sysctl_mem[2] > atomic_read(sk->sk_prot->sockets_allocated) *
+	    sk->sk_prot->sysctl_mem[2] > read_sockets_allocated(sk->sk_prot) *
 				sk_stream_pages(sk->sk_wmem_queued +
 						atomic_read(&sk->sk_rmem_alloc) +
 						sk->sk_forward_alloc))
Index: linux-2.6.16-rc5mm3/net/ipv4/proc.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/proc.c	2006-03-08 11:03:19.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/proc.c	2006-03-08 11:39:40.000000000 -0800
@@ -63,7 +63,7 @@ static int sockstat_seq_show(struct seq_
 	socket_seq_show(seq);
 	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %lu\n",
 		   fold_prot_inuse(&tcp_prot), atomic_read(&tcp_orphan_count),
-		   tcp_death_row.tw_count, atomic_read(&tcp_sockets_allocated),
+		   tcp_death_row.tw_count, read_sockets_allocated(&tcp_prot),
 		   percpu_counter_read_positive(&tcp_memory_allocated));
 	seq_printf(seq, "UDP: inuse %d\n", fold_prot_inuse(&udp_prot));
 	seq_printf(seq, "RAW: inuse %d\n", fold_prot_inuse(&raw_prot));
Index: linux-2.6.16-rc5mm3/net/ipv4/tcp.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp.c	2006-03-08 11:16:33.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp.c	2006-03-08 11:39:40.000000000 -0800
@@ -284,7 +284,7 @@ EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
 struct percpu_counter tcp_memory_allocated;	/* Current allocated memory. */
-atomic_t tcp_sockets_allocated;	/* Current number of TCP sockets. */
+int *tcp_sockets_allocated;   /* Current number of TCP sockets. */
 
 EXPORT_SYMBOL(tcp_memory_allocated);
 EXPORT_SYMBOL(tcp_sockets_allocated);
@@ -2055,6 +2055,12 @@ void __init tcp_init(void)
 	if (!tcp_hashinfo.bind_bucket_cachep)
 		panic("tcp_init: Cannot alloc tcp_bind_bucket cache.");
 
+	tcp_sockets_allocated = alloc_percpu(int);
+	if (!tcp_sockets_allocated)
+		panic("tcp_init: Cannot alloc tcp_sockets_allocated");
+
+	tcp_prot.sockets_allocated = tcp_sockets_allocated;
+
 	/* Size and allocate the main established and bind bucket
 	 * hash tables.
 	 *
Index: linux-2.6.16-rc5mm3/net/ipv4/tcp_ipv4.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp_ipv4.c	2006-03-08 10:36:03.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp_ipv4.c	2006-03-08 11:39:40.000000000 -0800
@@ -1273,7 +1273,7 @@ static int tcp_v4_init_sock(struct sock 
 	sk->sk_sndbuf = sysctl_tcp_wmem[1];
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
-	atomic_inc(&tcp_sockets_allocated);
+	inc_sockets_allocated(tcp_sockets_allocated);
 
 	return 0;
 }
@@ -1307,7 +1307,7 @@ int tcp_v4_destroy_sock(struct sock *sk)
 		sk->sk_sndmsg_page = NULL;
 	}
 
-	atomic_dec(&tcp_sockets_allocated);
+	dec_sockets_allocated(tcp_sockets_allocated);
 
 	return 0;
 }
@@ -1815,7 +1815,6 @@ struct proto tcp_prot = {
 	.unhash			= tcp_unhash,
 	.get_port		= tcp_v4_get_port,
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
-	.sockets_allocated	= &tcp_sockets_allocated,
 	.orphan_count		= &tcp_orphan_count,
 	.memory_allocated	= &tcp_memory_allocated,
 	.memory_pressure	= &tcp_memory_pressure,
Index: linux-2.6.16-rc5mm3/net/ipv6/tcp_ipv6.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv6/tcp_ipv6.c	2006-03-08 10:36:03.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv6/tcp_ipv6.c	2006-03-08 11:39:40.000000000 -0800
@@ -1375,7 +1375,7 @@ static int tcp_v6_init_sock(struct sock 
 	sk->sk_sndbuf = sysctl_tcp_wmem[1];
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
-	atomic_inc(&tcp_sockets_allocated);
+	inc_sockets_allocated(tcp_sockets_allocated);
 
 	return 0;
 }
@@ -1573,7 +1573,6 @@ struct proto tcpv6_prot = {
 	.unhash			= tcp_unhash,
 	.get_port		= tcp_v6_get_port,
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
-	.sockets_allocated	= &tcp_sockets_allocated,
 	.memory_allocated	= &tcp_memory_allocated,
 	.memory_pressure	= &tcp_memory_pressure,
 	.orphan_count		= &tcp_orphan_count,
@@ -1605,6 +1604,7 @@ static struct inet_protosw tcpv6_protosw
 
 void __init tcpv6_init(void)
 {
+	tcpv6_prot.sockets_allocated = tcp_sockets_allocated;
 	/* register inet6 protocol */
 	if (inet6_add_protocol(&tcpv6_protocol, IPPROTO_TCP) < 0)
 		printk(KERN_ERR "tcpv6_init: Could not register protocol\n");

^ permalink raw reply

* Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh
From: Ravikiran G Thirumalai @ 2006-03-08 21:07 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Andrew Morton, linux-kernel, davem, netdev, shai
In-Reply-To: <20060308203642.GZ5410@kvack.org>

On Wed, Mar 08, 2006 at 03:36:42PM -0500, Benjamin LaHaise wrote:
> On Wed, Mar 08, 2006 at 12:26:56PM -0800, Ravikiran G Thirumalai wrote:
> > +static inline void percpu_counter_mod_bh(struct percpu_counter *fbc, long amount)
> > +{
> > +	local_bh_disable();
> > +	fbc->count += amount;
> > +	local_bh_enable();
> > +}
> 
> Please use local_t instead, then you don't have to do the local_bh_disable() 
> and enable() and the whole thing collapses down into 1 instruction on x86.

But on non x86, local_bh_disable() is gonna be cheaper than a cli/atomic op no?
(Even if they were switched over to do local_irq_save() and
local_irq_restore() from atomic_t's that is).

And if we use local_t, we will add the overhead for the non bh 
percpu_counter_mod for non x86 arches.

Thanks,
Kiran

^ permalink raw reply

* Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh
From: Benjamin LaHaise @ 2006-03-08 21:17 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: Andrew Morton, linux-kernel, davem, netdev, shai
In-Reply-To: <20060308210726.GD4493@localhost.localdomain>

On Wed, Mar 08, 2006 at 01:07:26PM -0800, Ravikiran G Thirumalai wrote:
> But on non x86, local_bh_disable() is gonna be cheaper than a cli/atomic op no?
> (Even if they were switched over to do local_irq_save() and
> local_irq_restore() from atomic_t's that is).

It's still more expensive than local_t.

> And if we use local_t, we will add the overhead for the non bh 
> percpu_counter_mod for non x86 arches.

Last time I checked, all the major architectures had efficient local_t 
implementations.  Most of the RISC CPUs are able to do a load / store 
conditional implementation that is the same cost (since memory barriers 
tend to be explicite on powerpc).  So why not use it?

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply

* Re: Re: [Patch 7/7] Generic netlink interface (delay accounting)
From: Shailabh Nagar @ 2006-03-08 21:56 UTC (permalink / raw)
  To: hadi; +Cc: netdev, linux-kernel, lse-tech
In-Reply-To: <1141742282.5171.55.camel@localhost.localdomain>

jamal wrote:

>On Mon, 2006-06-03 at 12:00 -0500, Shailabh Nagar wrote:
>
>  
>
>>My design was to have the listener get both responses (what I call 
>>replies in the code) as well as events (data sent on exit of pid)
>>
>>    
>>
>
>I think i may not be doing justice explaining this, so let me be more
>elaborate so we can be in sync.
>Here is the classical way of doing things:
>
>- Assume several apps in user space and a target in the kernel (this
>could be reversed or combined in many ways, but the sake of
>simplicity/clarity make the above assumption).
>- suppose we have five user space apps A, B, C, D, E; these processes
>would typically do one of the following class of activities:
>
>a) configure (ADD/NEW/DEL etc). This is issued towards the kernel to
>set/create/delete/flush some scalar attribute or vector. These sorts of
>commands are synchronous. i.e you issue them, you expect a response
>(which may indicate success/failure etc). The response is unicast; the
>effect of what they affected may cause an event which may be multicast.
>
>b) query(GET). This is issued towards the kernel to query state of
>configured items. These class of commands are also synchronous. There
>are special cases of the query which dump everything in the target -
>literally called "dumps". The response is unicast.
>
>c) events. These are _asynchronous_ messages issued by the kernel to
>indicate some happening in the kernel. The event may be caused by #a
>above or any other activity in the kernel. Events are multicast.
>To receive them you have to register for the multicast group. You do so
>via sockets. You can register to many multicast group.
>
>For clarity again assume we have a multicast group where announcements
>of pids exiting is seen and C and D are registered to such a multicast
>group.
>Suppose process A exits. That would fall under #c above. C and D will be
>notified.
>Suppose B configures something in the kernel that forces the kernel to
>have process E exit and that such an operation is successful. B will get
>acknowledgement it succeeded (unicast). C and D will get notified
>(multicast). 
>Suppose C issued a GET to find details about a specific pid, then only C
>will get that info back (message is unicast).
>[A response message to a GET is typically designed to be the same as an
>ADD message i.e one should be able to take exactly the same message,
>change one or two things and shove back into the kernel to configure].
>Suppose D issued a GET with dump flag, then D will get the details of
>all pids (message is unicast).
>
>Is this clear? Is there more than the above you need?
>  
>
Thanks for the clarification of the usage model. While our needs are 
certainly much less complex,
it is useful to know the range of options.

>There are no hard rules on what you need to be multicasting and as an
>example you could send periodic(aka time based) samples from the kernel
>on a multicast channel and that would be received by all. It did seem
>odd that you want to have a semi-promiscous mode where a response to a
>GET is multicast. If that is still what you want to achieve, then you
>should.
>  
>
Ok, we'll probably switch to the classical mode you suggest since the 
"efficient processing"
rationale for choosing to operate in the semi-promiscous mode earlier 
can be overcome by
writing a multi-threaded userspace utility.

>>However, we could switch to the model you suggest and use a 
>>multithreaded send/receive userspace utility.
>>
>>    
>>
>
>This is more of the classical way of doing things. 
>
>
>  
>
>>>There is a recent netlink addition to make sure
>>>that events dont get sent if no listeners exist.
>>>genetlink needs to be extended. For now assume such a thing exists.
>>> 
>>>
>>>      
>>>
>>Ok. Will this addition work for both unicast and multicast modes ?
>>
>>    
>>
>
>If you never open a connection to the kernel, nothing will be generated
>towards user space. 
>There are other techniques to rate limit event generation as well (one
>such technique is a nagle-like algorithm used by xfrm).
>
>  
>
>>Will this be necessary ? Isn't genl_rcv_msg() going to return a -EOPNOTSUPP
>>automatically for us since we've not registered the command ?
>> 
>>    
>>
>
>Yes, please in your doc feedback remind me of this,
>
>  
>
>>>Also if you can provide feedback whether the doc i sent was any use
>>>and what wasnt clear etc.
>>> 
>>>
>>>      
>>>
>>Will do.
>>
>>    
>>
>
>also take a look at the excellent documentation Thomas Graf has put in
>the kernel for all the utilities for manipulating netlink messages and
>tell me if that should also be put in this doc (It is listed as a TODO).
>
>
>  
>
Ok.

Thanks,
Shailabh

>cheers,
>jamal
>
>  
>



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642

^ permalink raw reply

* Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh
From: Ravikiran G Thirumalai @ 2006-03-08 22:25 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Andrew Morton, linux-kernel, davem, netdev, shai
In-Reply-To: <20060308211733.GA5410@kvack.org>

On Wed, Mar 08, 2006 at 04:17:33PM -0500, Benjamin LaHaise wrote:
> On Wed, Mar 08, 2006 at 01:07:26PM -0800, Ravikiran G Thirumalai wrote:
> 
> Last time I checked, all the major architectures had efficient local_t 
> implementations.  Most of the RISC CPUs are able to do a load / store 
> conditional implementation that is the same cost (since memory barriers 
> tend to be explicite on powerpc).  So why not use it?

Then, for the batched percpu_counters, we could gain by using local_t only for 
the UP case. But we will have to have a new local_long_t implementation 
for that.  Do you think just one use case of local_long_t warrants for a new
set of apis?

Kiran

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox