Re: [patch 0/6] mm: alloc_percpu and bigrefs

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [patch 0/6] mm: alloc_percpu and  bigrefs
       [not found] <20050923062529.GA4209@localhost.localdomain>
@ 2005-09-23  7:10 ` Andrew Morton
  2005-09-23  7:17   ` David S. Miller
                     ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Andrew Morton @ 2005-09-23  7:10 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: davem, rusty, dipankar, linux-kernel

(Added linux-kernel)

Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>
> Hi Andrew,
> This patchset contains alloc_percpu + bigrefs + bigrefs for netdevice
> refcount.  This patchset improves tbench lo performance by 6% on a 8 way IBM
> x445.

I think we'd need more comprehensive benchmarks than this before adding
large amounts of complex core code.

We'd also need to know how much of any performance improvement was due to
alloc_percpu versus bigrefs, please.

Bigrefs look reasonably sane to me, but a whole new memory allocator is a
big deal.  Given that alloc_percpu() is already numa-aware, is that extra
cross-node fetch and pointer hop really worth all that new code?  The new
version will have to do a tlb load (including a cross-node fetch)
approximately as often as the old version will get a CPU cache miss on the
percpu array, maybe?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch 0/6] mm: alloc_percpu and bigrefs
  2005-09-23  7:10 ` [patch 0/6] mm: alloc_percpu and bigrefs Andrew Morton
@ 2005-09-23  7:17   ` David S. Miller
  2005-09-23  8:11     ` Rusty Russell
  2005-09-23  9:03     ` Eric Dumazet
  2005-09-23 18:40   ` Ravikiran G Thirumalai
  2005-09-23 19:16   ` Christoph Lameter
  2 siblings, 2 replies; 11+ messages in thread
From: David S. Miller @ 2005-09-23  7:17 UTC (permalink / raw)
  To: akpm; +Cc: kiran, rusty, dipankar, linux-kernel

From: Andrew Morton <akpm@osdl.org>
Date: Fri, 23 Sep 2005 00:10:13 -0700

> Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
> >
> > Hi Andrew,
> > This patchset contains alloc_percpu + bigrefs + bigrefs for netdevice
> > refcount.  This patchset improves tbench lo performance by 6% on a 8 way IBM
> > x445.
> 
> I think we'd need more comprehensive benchmarks than this before adding
> large amounts of complex core code.

I agree.  tbench over loopback is about as far from real life
as it gets.

I'm still against expanding these networking datastructures with
bigrefs just for this stuff.  Some people have per-cpu and per-node on
the brain, and it's starting to bloat things up a little bit too much.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch 0/6] mm: alloc_percpu and bigrefs
  2005-09-23  7:17   ` David S. Miller
@ 2005-09-23  8:11     ` Rusty Russell
  2005-09-23 11:36       ` Dipankar Sarma
  2005-09-23  9:03     ` Eric Dumazet
  1 sibling, 1 reply; 11+ messages in thread
From: Rusty Russell @ 2005-09-23  8:11 UTC (permalink / raw)
  To: David S. Miller; +Cc: akpm, kiran, dipankar, linux-kernel

On Fri, 2005-09-23 at 00:17 -0700, David S. Miller wrote:
> I'm still against expanding these networking datastructures with
> bigrefs just for this stuff.  Some people have per-cpu and per-node on
> the brain, and it's starting to bloat things up a little bit too much.

I think for net devices it actually makes sense; most of the time we are
not trying to remove them, so the refcounting is simply overhead.  We
also don't alloc and free them very often.  The size issue is not really
an issue since we only map for each CPU, and even better: if a bigref
allocation can't get per-cpu data it just degrades beautifully into a
test and an atomic.

Now, that said, I wanted (and wrote, way back when) a far simpler
allocator which only worked for GFP_KERNEL and used the same
__per_cpu_offset[] to fixup dynamic per-cpu ptrs as static ones.  Maybe
not as "complete" as this one, but maybe less offensive.

Rusty.
-- 
A bad analogy is like a leaky screwdriver -- Richard Braakman

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch 0/6] mm: alloc_percpu and bigrefs
  2005-09-23  8:11     ` Rusty Russell
@ 2005-09-23 11:36       ` Dipankar Sarma
  2005-09-23 18:48         ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 11+ messages in thread
From: Dipankar Sarma @ 2005-09-23 11:36 UTC (permalink / raw)
  To: Rusty Russell; +Cc: David S. Miller, akpm, kiran, linux-kernel

On Fri, Sep 23, 2005 at 06:11:30PM +1000, Rusty Russell wrote:
> On Fri, 2005-09-23 at 00:17 -0700, David S. Miller wrote:
> > I'm still against expanding these networking datastructures with
> > bigrefs just for this stuff.  Some people have per-cpu and per-node on
> > the brain, and it's starting to bloat things up a little bit too much.
> 
> I think for net devices it actually makes sense; most of the time we are
> not trying to remove them, so the refcounting is simply overhead.  We
> also don't alloc and free them very often.  The size issue is not really
> an issue since we only map for each CPU, and even better: if a bigref
> allocation can't get per-cpu data it just degrades beautifully into a
> test and an atomic.

I agree, given that it is for a much smaller number of refcounted
objects.

> 
> Now, that said, I wanted (and wrote, way back when) a far simpler
> allocator which only worked for GFP_KERNEL and used the same
> __per_cpu_offset[] to fixup dynamic per-cpu ptrs as static ones.  Maybe
> not as "complete" as this one, but maybe less offensive.

The GFP_ATOMIC support stuff is needed only for dst entries. However
using per-cpu refcounters in such objects like dentries and dst entries
are problematic and that is why I hadn't tried changing those.
Some of the earlier versions of the allocator were simpler and I think
we need to roll this back and do some more analysis. No GFP_ATOMIC
support, no early use. I haven't got around to look at this 
for a while, but I will try.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch 0/6] mm: alloc_percpu and bigrefs
  2005-09-23 11:36       ` Dipankar Sarma
@ 2005-09-23 18:48         ` Ravikiran G Thirumalai
  0 siblings, 0 replies; 11+ messages in thread
From: Ravikiran G Thirumalai @ 2005-09-23 18:48 UTC (permalink / raw)
  To: Dipankar Sarma; +Cc: Rusty Russell, David S. Miller, akpm, linux-kernel

On Fri, Sep 23, 2005 at 05:06:36PM +0530, Dipankar Sarma wrote:
> On Fri, Sep 23, 2005 at 06:11:30PM +1000, Rusty Russell wrote:
> > On Fri, 2005-09-23 at 00:17 -0700, David S. Miller wrote:
> > Now, that said, I wanted (and wrote, way back when) a far simpler
> > allocator which only worked for GFP_KERNEL and used the same
> > __per_cpu_offset[] to fixup dynamic per-cpu ptrs as static ones.  Maybe
> > not as "complete" as this one, but maybe less offensive.
> 
> The GFP_ATOMIC support stuff is needed only for dst entries. However
> using per-cpu refcounters in such objects like dentries and dst entries
> are problematic and that is why I hadn't tried changing those.
> Some of the earlier versions of the allocator were simpler and I think
> we need to roll this back and do some more analysis. No GFP_ATOMIC
> support, no early use. I haven't got around to look at this 

The patches I have submitted this time does not have GFP_ATOMIC support. The
early use enablement stuff are in seperate patches  (patch 5 and 6).  The
patchset would work without these two patches too.

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch 0/6] mm: alloc_percpu and bigrefs
  2005-09-23  7:17   ` David S. Miller
  2005-09-23  8:11     ` Rusty Russell
@ 2005-09-23  9:03     ` Eric Dumazet
  2005-09-23  9:35       ` Eric Dumazet
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2005-09-23  9:03 UTC (permalink / raw)
  To: David S. Miller; +Cc: akpm, kiran, rusty, dipankar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2612 bytes --]

David S. Miller a écrit :
> From: Andrew Morton <akpm@osdl.org>
> Date: Fri, 23 Sep 2005 00:10:13 -0700
> 
> 
>>Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>>
>>>Hi Andrew,
>>>This patchset contains alloc_percpu + bigrefs + bigrefs for netdevice
>>>refcount.  This patchset improves tbench lo performance by 6% on a 8 way IBM
>>>x445.
>>
>>I think we'd need more comprehensive benchmarks than this before adding
>>large amounts of complex core code.
> 
> 
> I agree.  tbench over loopback is about as far from real life
> as it gets.
> 
> I'm still against expanding these networking datastructures with
> bigrefs just for this stuff.  Some people have per-cpu and per-node on
> the brain, and it's starting to bloat things up a little bit too much.

Hi David

I agree with you about bigref might be a bit of over-engineering for netdevice 
refcount.

I sent one month ago a patch to reorder 'struct net_device' to at least make 
it SMP friendly, with no per-cpu or per-node additional memory but got no comment.

It's important to place mostly read parts together, so that a cache lines can 
be shared by CPUS. It's important to separate hot fields in three different 
cache lines to that a cpu can work on the transmit while another is receiving 
frames without paying cache line ping pongs.

Please (re)consider this patch since it really reduce CPU load and/or memory 
bus trafic between CPUS.

I ask this question :
  - Is the following comment found in netdevice.h still true ?

/*
  * This is the first field of the "visible" part of this structure
  * (i.e. as seen by users in the "Space.c" file).  It is the name
  * the interface.
  */

If it is still true, then I suspect name_hlist cannot be moved like I did.


Thank you


- [NET] : Reorder some hot fields of struct net_device and place them on 
separate cache lines in SMP to lower memory bouncing between multiple CPU 
accessing the device.

     - One part is mostly used on receive path (including eth_type_trans())
      (poll_list, poll, quota, weight, last_rx, dev_addr, broadcast)

     - One part is mostly used on queue transmit path (qdisc)
      (queue_lock, qdisc, qdisc_sleeping, qdisc_list, tx_queue_len)

     - One part is mostly used on xmit path (device)
      (xmit_lock, xmit_lock_owner, priv, hard_start_xmit, trans_start)

'features' is placed outside of these hot points, in a location that may be 
shared by all cpus (because mostly read)

Minor but maybe not possible because of ABI :
	name_hlist is moved close to name[IFNAMSIZ] to speedup __dev_get_by_name()

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>



[-- Attachment #2: reorder_netdevice --]
[-- Type: text/plain, Size: 6222 bytes --]

--- linux-2.6.14-rc2/include/linux/netdevice.h	2005-09-20 05:00:41.000000000 +0200
+++ linux-2.6.14-rc2-ed/include/linux/netdevice.h	2005-09-23 10:52:43.000000000 +0200
@@ -265,6 +265,8 @@
 	 * the interface.
 	 */
 	char			name[IFNAMSIZ];
+	/* device name hash chain */
+	struct hlist_node	name_hlist;
 
 	/*
 	 *	I/O specific fields
@@ -292,6 +294,21 @@
 
 	/* ------- Fields preinitialized in Space.c finish here ------- */
 
+	/* Net device features */
+	unsigned long		features;
+#define NETIF_F_SG		1	/* Scatter/gather IO. */
+#define NETIF_F_IP_CSUM		2	/* Can checksum only TCP/UDP over IPv4. */
+#define NETIF_F_NO_CSUM		4	/* Does not require checksum. F.e. loopack. */
+#define NETIF_F_HW_CSUM		8	/* Can checksum all the packets. */
+#define NETIF_F_HIGHDMA		32	/* Can DMA to high memory. */
+#define NETIF_F_FRAGLIST	64	/* Scatter/gather IO. */
+#define NETIF_F_HW_VLAN_TX	128	/* Transmit VLAN hw acceleration */
+#define NETIF_F_HW_VLAN_RX	256	/* Receive VLAN hw acceleration */
+#define NETIF_F_HW_VLAN_FILTER	512	/* Receive filtering on VLAN */
+#define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
+#define NETIF_F_TSO		2048	/* Can offload TCP/IP segmentation */
+#define NETIF_F_LLTX		4096	/* LockLess TX */
+
 	struct net_device	*next_sched;
 
 	/* Interface index. Unique device identifier	*/
@@ -316,9 +333,6 @@
 	 * will (read: may be cleaned up at will).
 	 */
 
-	/* These may be needed for future network-power-down code. */
-	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
-	unsigned long		last_rx;	/* Time of last Rx	*/
 
 	unsigned short		flags;	/* interface flags (a la BSD)	*/
 	unsigned short		gflags;
@@ -328,15 +342,12 @@
 	unsigned		mtu;	/* interface MTU value		*/
 	unsigned short		type;	/* interface hardware type	*/
 	unsigned short		hard_header_len;	/* hardware hdr length	*/
-	void			*priv;	/* pointer to private data	*/
 
 	struct net_device	*master; /* Pointer to master device of a group,
 					  * which this device is member of.
 					  */
 
 	/* Interface address info. */
-	unsigned char		broadcast[MAX_ADDR_LEN];	/* hw bcast add	*/
-	unsigned char		dev_addr[MAX_ADDR_LEN];	/* hw address	*/
 	unsigned char		perm_addr[MAX_ADDR_LEN]; /* permanent hw address */
 	unsigned char		addr_len;	/* hardware address length	*/
 	unsigned short          dev_id;		/* for shared network cards */
@@ -346,8 +357,6 @@
 	int			promiscuity;
 	int			allmulti;
 
-	int			watchdog_timeo;
-	struct timer_list	watchdog_timer;
 
 	/* Protocol specific pointers */
 	
@@ -358,32 +367,62 @@
 	void			*ec_ptr;	/* Econet specific data	*/
 	void			*ax25_ptr;	/* AX.25 specific data */
 
-	struct list_head	poll_list;	/* Link to poll list	*/
+/*
+ * Cache line mostly used on receive path (including eth_type_trans())
+ */
+	struct list_head	poll_list ____cacheline_aligned_in_smp;
+					/* Link to poll list	*/
+
+	int			(*poll) (struct net_device *dev, int *quota);
 	int			quota;
 	int			weight;
+	unsigned long		last_rx;	/* Time of last Rx	*/
+	/* Interface address info used in eth_type_trans() */
+	unsigned char		dev_addr[MAX_ADDR_LEN];	/* hw address, (before bcast 
+							because most packets are unicast) */
+
+	unsigned char		broadcast[MAX_ADDR_LEN];	/* hw bcast add	*/
 
+/*
+ * Cache line mostly used on queue transmit path (qdisc)
+ */
+	/* device queue lock */
+	spinlock_t		queue_lock ____cacheline_aligned_in_smp;
 	struct Qdisc		*qdisc;
 	struct Qdisc		*qdisc_sleeping;
-	struct Qdisc		*qdisc_ingress;
 	struct list_head	qdisc_list;
 	unsigned long		tx_queue_len;	/* Max frames per queue allowed */
 
 	/* ingress path synchronizer */
 	spinlock_t		ingress_lock;
+	struct Qdisc		*qdisc_ingress;
+
+/*
+ * One part is mostly used on xmit path (device)
+ */
 	/* hard_start_xmit synchronizer */
-	spinlock_t		xmit_lock;
+	spinlock_t		xmit_lock ____cacheline_aligned_in_smp;
 	/* cpu id of processor entered to hard_start_xmit or -1,
 	   if nobody entered there.
 	 */
 	int			xmit_lock_owner;
-	/* device queue lock */
-	spinlock_t		queue_lock;
+	void			*priv;	/* pointer to private data	*/
+	int			(*hard_start_xmit) (struct sk_buff *skb,
+						    struct net_device *dev);
+	/* These may be needed for future network-power-down code. */
+	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
+
+	int			watchdog_timeo; /* used by dev_watchdog() */
+	struct timer_list	watchdog_timer;
+
+/*
+ * refcnt is a very hot point, so align it on SMP
+ */
 	/* Number of references to this device */
-	atomic_t		refcnt;
+	atomic_t		refcnt ____cacheline_aligned_in_smp;
+
 	/* delayed register/unregister */
 	struct list_head	todo_list;
-	/* device name hash chain */
-	struct hlist_node	name_hlist;
 	/* device index hash chain */
 	struct hlist_node	index_hlist;
 
@@ -396,21 +435,6 @@
 	       NETREG_RELEASED,		/* called free_netdev */
 	} reg_state;
 
-	/* Net device features */
-	unsigned long		features;
-#define NETIF_F_SG		1	/* Scatter/gather IO. */
-#define NETIF_F_IP_CSUM		2	/* Can checksum only TCP/UDP over IPv4. */
-#define NETIF_F_NO_CSUM		4	/* Does not require checksum. F.e. loopack. */
-#define NETIF_F_HW_CSUM		8	/* Can checksum all the packets. */
-#define NETIF_F_HIGHDMA		32	/* Can DMA to high memory. */
-#define NETIF_F_FRAGLIST	64	/* Scatter/gather IO. */
-#define NETIF_F_HW_VLAN_TX	128	/* Transmit VLAN hw acceleration */
-#define NETIF_F_HW_VLAN_RX	256	/* Receive VLAN hw acceleration */
-#define NETIF_F_HW_VLAN_FILTER	512	/* Receive filtering on VLAN */
-#define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
-#define NETIF_F_TSO		2048	/* Can offload TCP/IP segmentation */
-#define NETIF_F_LLTX		4096	/* LockLess TX */
-
 	/* Called after device is detached from network. */
 	void			(*uninit)(struct net_device *dev);
 	/* Called after last user reference disappears. */
@@ -419,10 +443,7 @@
 	/* Pointers to interface service routines.	*/
 	int			(*open)(struct net_device *dev);
 	int			(*stop)(struct net_device *dev);
-	int			(*hard_start_xmit) (struct sk_buff *skb,
-						    struct net_device *dev);
 #define HAVE_NETDEV_POLL
-	int			(*poll) (struct net_device *dev, int *quota);
 	int			(*hard_header) (struct sk_buff *skb,
 						struct net_device *dev,
 						unsigned short type,

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch 0/6] mm: alloc_percpu and bigrefs
  2005-09-23  9:03     ` Eric Dumazet
@ 2005-09-23  9:35       ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2005-09-23  9:35 UTC (permalink / raw)
  To: akpm; +Cc: David S. Miller, kiran, rusty, dipankar, linux-kernel

Andrew Morton a écrit :
 > Eric Dumazet <dada1@cosmosbay.com> wrote:
 >
 >> Please (re)consider this patch since it really reduce CPU load and/or memory
 >> bus trafic between CPUS.
 >
 >
 > Did you actually measure this?   If so, you should publish the results.
 >
 >

Hi Andrew

Well yes I did.

Here is part of the mail sent on netdev one month ago :



Hi David

I played and I have very good results with the following patches.

# tc -s -d qdisc show dev eth0 ; sleep 10 ; tc -s -d qdisc show dev eth0
qdisc pfifo_fast 0: bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
  Sent 440173135511 bytes 3211378293 pkt (dropped 240817, overlimits 0 
requeues 27028035)
  backlog 0b 0p requeues 27028035
qdisc pfifo_fast 0: bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
  Sent 440216655667 bytes 3211730812 pkt (dropped 240904, overlimits 0 
requeues 27031668)
  backlog 0b 0p requeues 27031668

(So about 360 requeues per second, much better than before (12000 / second))

oprofile results give
0.6257  %    qdisc_restart  (instead of 2.6452 %)


thread is archived here :

http://marc.theaimsgroup.com/?l=linux-netdev&m=112521684415443&w=2

Thank you

Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch 0/6] mm: alloc_percpu and  bigrefs
  2005-09-23  7:10 ` [patch 0/6] mm: alloc_percpu and bigrefs Andrew Morton
  2005-09-23  7:17   ` David S. Miller
@ 2005-09-23 18:40   ` Ravikiran G Thirumalai
  2005-09-23 18:50     ` David S. Miller
  2005-09-23 19:16   ` Christoph Lameter
  2 siblings, 1 reply; 11+ messages in thread
From: Ravikiran G Thirumalai @ 2005-09-23 18:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: davem, rusty, dipankar, linux-kernel

On Fri, Sep 23, 2005 at 12:10:13AM -0700, Andrew Morton wrote:
> 
> (Added linux-kernel)
> 
> Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
> >
> > Hi Andrew,
> > This patchset contains alloc_percpu + bigrefs + bigrefs for netdevice
> > refcount.  This patchset improves tbench lo performance by 6% on a 8 way IBM
> > x445.
> 
> I think we'd need more comprehensive benchmarks than this before adding
> large amounts of complex core code.
> 
> We'd also need to know how much of any performance improvement was due to
> alloc_percpu versus bigrefs, please.
> 
> Bigrefs look reasonably sane to me, but a whole new memory allocator is a
> big deal.  Given that alloc_percpu() is already numa-aware, is that extra
> cross-node fetch and pointer hop really worth all that new code?  The new
> version will have to do a tlb load (including a cross-node fetch)
> approximately as often as the old version will get a CPU cache miss on the
> percpu array, maybe?
> 
> 

The extra pointer dereference saving with the re-implementation saved 10 %
with the dst changes.  With just the net_device refcount, there was a
difference of 2 % with the existing implementation.  So 1 extra dereference
does matter.

That said, the re-implementation does have other advantages;
1. It avoids the memory bloat caused by the existing implementation as it does 
not pad all objects to cacheline size like the existing implementation
does,  This helps small objects like bigrefs.  Additionally, it results in
better cacheline utilization now as many per-cpu objects  share the same
cacheline.
2. It is hotplug friendly.  In case you want to offline a node, you can
easily free the per-cpu memory corresponding to that node.  You cannot do
that with the current scheme.
3. It can work early.  We did some experimentation with struct
vfsmount.mnt_count refcounter (It is per-mount point so it is not too many
objects), and the old alloc_percpu refused to work early.  Yes, now I guess
there are workarounds to make cpu_possible_mask bits set for NR_CPUS early,
but wouldn't that result in memory being allocated for cpus which can
never be present?
4. With the re-implementation, it might be possible to make often used 
structures like cpu_pda truly node local with alloc_percpu..

As for the benchmarks, tbench on lo was used indicatively.  lo performance
does matter for quite a few benchmarks. There are apps which do use lo 
extensively.  The dst and netdevice changes were made after profiling such 
real wold apps.  Agreed, per-cpufication of objects which can go up in 
size is not the right approach on hindsight, but netdevice.refcount is not 
one of those.  I can try running a standard mpi benchmark or some
other indicative benchmark if that would help?

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch 0/6] mm: alloc_percpu and bigrefs
  2005-09-23 18:40   ` Ravikiran G Thirumalai
@ 2005-09-23 18:50     ` David S. Miller
  2005-09-23 23:30       ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 11+ messages in thread
From: David S. Miller @ 2005-09-23 18:50 UTC (permalink / raw)
  To: kiran; +Cc: akpm, rusty, dipankar, linux-kernel

From: Ravikiran G Thirumalai <kiran@scalex86.org>
Date: Fri, 23 Sep 2005 11:40:52 -0700

> As for the benchmarks, tbench on lo was used indicatively.  lo performance
> does matter for quite a few benchmarks. There are apps which do use lo 
> extensively.  The dst and netdevice changes were made after profiling such 
> real wold apps.  Agreed, per-cpufication of objects which can go up in 
> size is not the right approach on hindsight, but netdevice.refcount is not 
> one of those.  I can try running a standard mpi benchmark or some
> other indicative benchmark if that would help?

I worry about real life sites, such as a big web server, that will by
definition have hundreds of thousands of routing cache (and thus
'dst') entries active.

The memory usage will increase, and that's particularly bad in this
kind of case because unlike the 'lo' benchmarks you won't have nodes
and cpus fighting over access to the same routes.  In such a load
the bigrefs are just wasted memory and aren't actually needed.

I really would like to encourage a move away from this fascination
with optimizating the loopback interface performance on enormous
systems, yes even if it is hit hard by the benchmarks.  It just
means the benchmarks are wrong, not that we should optimize for
them.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch 0/6] mm: alloc_percpu and bigrefs
  2005-09-23 18:50     ` David S. Miller
@ 2005-09-23 23:30       ` Ravikiran G Thirumalai
  0 siblings, 0 replies; 11+ messages in thread
From: Ravikiran G Thirumalai @ 2005-09-23 23:30 UTC (permalink / raw)
  To: David S. Miller; +Cc: akpm, rusty, dipankar, linux-kernel

On Fri, Sep 23, 2005 at 11:50:58AM -0700, David S. Miller wrote:
> From: Ravikiran G Thirumalai <kiran@scalex86.org>
> Date: Fri, 23 Sep 2005 11:40:52 -0700
> 
> 
> I worry about real life sites, such as a big web server, that will by
> definition have hundreds of thousands of routing cache (and thus
> 'dst') entries active.
> 
> The memory usage will increase, and that's particularly bad in this
> kind of case because unlike the 'lo' benchmarks you won't have nodes
> and cpus fighting over access to the same routes.  In such a load
> the bigrefs are just wasted memory and aren't actually needed.

Point taken. That is the reason I have excluded the dst patches in this
patchset.  The problem with dst.__refcount stays and we should probably look
for some other approach rather than thinking per-cpu/per-node counters for 
this. 
But the patch in question now is net_device refcount. Surely that doesn't
affect webservers with many dst entries.

> 
> I really would like to encourage a move away from this fascination
> with optimizating the loopback interface performance on enormous
> systems, yes even if it is hit hard by the benchmarks.  It just
> means the benchmarks are wrong, not that we should optimize for
> them.

Benchmarks could be wrong, but we don't control what people run. 
And there are apps which use lo (for whatever reason) :(. 

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch 0/6] mm: alloc_percpu and  bigrefs
  2005-09-23  7:10 ` [patch 0/6] mm: alloc_percpu and bigrefs Andrew Morton
  2005-09-23  7:17   ` David S. Miller
  2005-09-23 18:40   ` Ravikiran G Thirumalai
@ 2005-09-23 19:16   ` Christoph Lameter
  2 siblings, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2005-09-23 19:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ravikiran G Thirumalai, davem, rusty, dipankar, linux-kernel

On Fri, 23 Sep 2005, Andrew Morton wrote:

> big deal.  Given that alloc_percpu() is already numa-aware, is that extra
> cross-node fetch and pointer hop really worth all that new code?  The new
> version will have to do a tlb load (including a cross-node fetch)
> approximately as often as the old version will get a CPU cache miss on the
> percpu array, maybe?

The current alloc_percpu() is problematic because it has to allocate 
entries for all cpus even those who are not online yet. There is no way to 
track alloc_percpu entries and therefore no possibility of adding an entry 
for a processor if one comes online or for removing one when a processor 
goes away.

An additional complication is the allocation of per cpu entries for 
processors whose memory node are not online. The current 
implementation will fall back on a unspecific allocation but this means 
that the placement of the per cpu entries will not be on the node of the 
processor!

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2005-09-23 23:30 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20050923062529.GA4209@localhost.localdomain>
2005-09-23  7:10 ` [patch 0/6] mm: alloc_percpu and bigrefs Andrew Morton
2005-09-23  7:17   ` David S. Miller
2005-09-23  8:11     ` Rusty Russell
2005-09-23 11:36       ` Dipankar Sarma
2005-09-23 18:48         ` Ravikiran G Thirumalai
2005-09-23  9:03     ` Eric Dumazet
2005-09-23  9:35       ` Eric Dumazet
2005-09-23 18:40   ` Ravikiran G Thirumalai
2005-09-23 18:50     ` David S. Miller
2005-09-23 23:30       ` Ravikiran G Thirumalai
2005-09-23 19:16   ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.