* Re: [patch 0/6] mm: alloc_percpu and bigrefs [not found] <20050923062529.GA4209@localhost.localdomain> @ 2005-09-23 7:10 ` Andrew Morton 2005-09-23 7:17 ` David S. Miller ` (2 more replies) 0 siblings, 3 replies; 11+ messages in thread From: Andrew Morton @ 2005-09-23 7:10 UTC (permalink / raw) To: Ravikiran G Thirumalai; +Cc: davem, rusty, dipankar, linux-kernel (Added linux-kernel) Ravikiran G Thirumalai <kiran@scalex86.org> wrote: > > Hi Andrew, > This patchset contains alloc_percpu + bigrefs + bigrefs for netdevice > refcount. This patchset improves tbench lo performance by 6% on a 8 way IBM > x445. I think we'd need more comprehensive benchmarks than this before adding large amounts of complex core code. We'd also need to know how much of any performance improvement was due to alloc_percpu versus bigrefs, please. Bigrefs look reasonably sane to me, but a whole new memory allocator is a big deal. Given that alloc_percpu() is already numa-aware, is that extra cross-node fetch and pointer hop really worth all that new code? The new version will have to do a tlb load (including a cross-node fetch) approximately as often as the old version will get a CPU cache miss on the percpu array, maybe? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [patch 0/6] mm: alloc_percpu and bigrefs 2005-09-23 7:10 ` [patch 0/6] mm: alloc_percpu and bigrefs Andrew Morton @ 2005-09-23 7:17 ` David S. Miller 2005-09-23 8:11 ` Rusty Russell 2005-09-23 9:03 ` Eric Dumazet 2005-09-23 18:40 ` Ravikiran G Thirumalai 2005-09-23 19:16 ` Christoph Lameter 2 siblings, 2 replies; 11+ messages in thread From: David S. Miller @ 2005-09-23 7:17 UTC (permalink / raw) To: akpm; +Cc: kiran, rusty, dipankar, linux-kernel From: Andrew Morton <akpm@osdl.org> Date: Fri, 23 Sep 2005 00:10:13 -0700 > Ravikiran G Thirumalai <kiran@scalex86.org> wrote: > > > > Hi Andrew, > > This patchset contains alloc_percpu + bigrefs + bigrefs for netdevice > > refcount. This patchset improves tbench lo performance by 6% on a 8 way IBM > > x445. > > I think we'd need more comprehensive benchmarks than this before adding > large amounts of complex core code. I agree. tbench over loopback is about as far from real life as it gets. I'm still against expanding these networking datastructures with bigrefs just for this stuff. Some people have per-cpu and per-node on the brain, and it's starting to bloat things up a little bit too much. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [patch 0/6] mm: alloc_percpu and bigrefs 2005-09-23 7:17 ` David S. Miller @ 2005-09-23 8:11 ` Rusty Russell 2005-09-23 11:36 ` Dipankar Sarma 2005-09-23 9:03 ` Eric Dumazet 1 sibling, 1 reply; 11+ messages in thread From: Rusty Russell @ 2005-09-23 8:11 UTC (permalink / raw) To: David S. Miller; +Cc: akpm, kiran, dipankar, linux-kernel On Fri, 2005-09-23 at 00:17 -0700, David S. Miller wrote: > I'm still against expanding these networking datastructures with > bigrefs just for this stuff. Some people have per-cpu and per-node on > the brain, and it's starting to bloat things up a little bit too much. I think for net devices it actually makes sense; most of the time we are not trying to remove them, so the refcounting is simply overhead. We also don't alloc and free them very often. The size issue is not really an issue since we only map for each CPU, and even better: if a bigref allocation can't get per-cpu data it just degrades beautifully into a test and an atomic. Now, that said, I wanted (and wrote, way back when) a far simpler allocator which only worked for GFP_KERNEL and used the same __per_cpu_offset[] to fixup dynamic per-cpu ptrs as static ones. Maybe not as "complete" as this one, but maybe less offensive. Rusty. -- A bad analogy is like a leaky screwdriver -- Richard Braakman ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [patch 0/6] mm: alloc_percpu and bigrefs 2005-09-23 8:11 ` Rusty Russell @ 2005-09-23 11:36 ` Dipankar Sarma 2005-09-23 18:48 ` Ravikiran G Thirumalai 0 siblings, 1 reply; 11+ messages in thread From: Dipankar Sarma @ 2005-09-23 11:36 UTC (permalink / raw) To: Rusty Russell; +Cc: David S. Miller, akpm, kiran, linux-kernel On Fri, Sep 23, 2005 at 06:11:30PM +1000, Rusty Russell wrote: > On Fri, 2005-09-23 at 00:17 -0700, David S. Miller wrote: > > I'm still against expanding these networking datastructures with > > bigrefs just for this stuff. Some people have per-cpu and per-node on > > the brain, and it's starting to bloat things up a little bit too much. > > I think for net devices it actually makes sense; most of the time we are > not trying to remove them, so the refcounting is simply overhead. We > also don't alloc and free them very often. The size issue is not really > an issue since we only map for each CPU, and even better: if a bigref > allocation can't get per-cpu data it just degrades beautifully into a > test and an atomic. I agree, given that it is for a much smaller number of refcounted objects. > > Now, that said, I wanted (and wrote, way back when) a far simpler > allocator which only worked for GFP_KERNEL and used the same > __per_cpu_offset[] to fixup dynamic per-cpu ptrs as static ones. Maybe > not as "complete" as this one, but maybe less offensive. The GFP_ATOMIC support stuff is needed only for dst entries. However using per-cpu refcounters in such objects like dentries and dst entries are problematic and that is why I hadn't tried changing those. Some of the earlier versions of the allocator were simpler and I think we need to roll this back and do some more analysis. No GFP_ATOMIC support, no early use. I haven't got around to look at this for a while, but I will try. Thanks Dipankar ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [patch 0/6] mm: alloc_percpu and bigrefs 2005-09-23 11:36 ` Dipankar Sarma @ 2005-09-23 18:48 ` Ravikiran G Thirumalai 0 siblings, 0 replies; 11+ messages in thread From: Ravikiran G Thirumalai @ 2005-09-23 18:48 UTC (permalink / raw) To: Dipankar Sarma; +Cc: Rusty Russell, David S. Miller, akpm, linux-kernel On Fri, Sep 23, 2005 at 05:06:36PM +0530, Dipankar Sarma wrote: > On Fri, Sep 23, 2005 at 06:11:30PM +1000, Rusty Russell wrote: > > On Fri, 2005-09-23 at 00:17 -0700, David S. Miller wrote: > > Now, that said, I wanted (and wrote, way back when) a far simpler > > allocator which only worked for GFP_KERNEL and used the same > > __per_cpu_offset[] to fixup dynamic per-cpu ptrs as static ones. Maybe > > not as "complete" as this one, but maybe less offensive. > > The GFP_ATOMIC support stuff is needed only for dst entries. However > using per-cpu refcounters in such objects like dentries and dst entries > are problematic and that is why I hadn't tried changing those. > Some of the earlier versions of the allocator were simpler and I think > we need to roll this back and do some more analysis. No GFP_ATOMIC > support, no early use. I haven't got around to look at this The patches I have submitted this time does not have GFP_ATOMIC support. The early use enablement stuff are in seperate patches (patch 5 and 6). The patchset would work without these two patches too. Thanks, Kiran ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [patch 0/6] mm: alloc_percpu and bigrefs 2005-09-23 7:17 ` David S. Miller 2005-09-23 8:11 ` Rusty Russell @ 2005-09-23 9:03 ` Eric Dumazet 2005-09-23 9:35 ` Eric Dumazet 1 sibling, 1 reply; 11+ messages in thread From: Eric Dumazet @ 2005-09-23 9:03 UTC (permalink / raw) To: David S. Miller; +Cc: akpm, kiran, rusty, dipankar, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2612 bytes --] David S. Miller a écrit : > From: Andrew Morton <akpm@osdl.org> > Date: Fri, 23 Sep 2005 00:10:13 -0700 > > >>Ravikiran G Thirumalai <kiran@scalex86.org> wrote: >> >>>Hi Andrew, >>>This patchset contains alloc_percpu + bigrefs + bigrefs for netdevice >>>refcount. This patchset improves tbench lo performance by 6% on a 8 way IBM >>>x445. >> >>I think we'd need more comprehensive benchmarks than this before adding >>large amounts of complex core code. > > > I agree. tbench over loopback is about as far from real life > as it gets. > > I'm still against expanding these networking datastructures with > bigrefs just for this stuff. Some people have per-cpu and per-node on > the brain, and it's starting to bloat things up a little bit too much. Hi David I agree with you about bigref might be a bit of over-engineering for netdevice refcount. I sent one month ago a patch to reorder 'struct net_device' to at least make it SMP friendly, with no per-cpu or per-node additional memory but got no comment. It's important to place mostly read parts together, so that a cache lines can be shared by CPUS. It's important to separate hot fields in three different cache lines to that a cpu can work on the transmit while another is receiving frames without paying cache line ping pongs. Please (re)consider this patch since it really reduce CPU load and/or memory bus trafic between CPUS. I ask this question : - Is the following comment found in netdevice.h still true ? /* * This is the first field of the "visible" part of this structure * (i.e. as seen by users in the "Space.c" file). It is the name * the interface. */ If it is still true, then I suspect name_hlist cannot be moved like I did. Thank you - [NET] : Reorder some hot fields of struct net_device and place them on separate cache lines in SMP to lower memory bouncing between multiple CPU accessing the device. - One part is mostly used on receive path (including eth_type_trans()) (poll_list, poll, quota, weight, last_rx, dev_addr, broadcast) - One part is mostly used on queue transmit path (qdisc) (queue_lock, qdisc, qdisc_sleeping, qdisc_list, tx_queue_len) - One part is mostly used on xmit path (device) (xmit_lock, xmit_lock_owner, priv, hard_start_xmit, trans_start) 'features' is placed outside of these hot points, in a location that may be shared by all cpus (because mostly read) Minor but maybe not possible because of ABI : name_hlist is moved close to name[IFNAMSIZ] to speedup __dev_get_by_name() Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> [-- Attachment #2: reorder_netdevice --] [-- Type: text/plain, Size: 6222 bytes --] --- linux-2.6.14-rc2/include/linux/netdevice.h 2005-09-20 05:00:41.000000000 +0200 +++ linux-2.6.14-rc2-ed/include/linux/netdevice.h 2005-09-23 10:52:43.000000000 +0200 @@ -265,6 +265,8 @@ * the interface. */ char name[IFNAMSIZ]; + /* device name hash chain */ + struct hlist_node name_hlist; /* * I/O specific fields @@ -292,6 +294,21 @@ /* ------- Fields preinitialized in Space.c finish here ------- */ + /* Net device features */ + unsigned long features; +#define NETIF_F_SG 1 /* Scatter/gather IO. */ +#define NETIF_F_IP_CSUM 2 /* Can checksum only TCP/UDP over IPv4. */ +#define NETIF_F_NO_CSUM 4 /* Does not require checksum. F.e. loopack. */ +#define NETIF_F_HW_CSUM 8 /* Can checksum all the packets. */ +#define NETIF_F_HIGHDMA 32 /* Can DMA to high memory. */ +#define NETIF_F_FRAGLIST 64 /* Scatter/gather IO. */ +#define NETIF_F_HW_VLAN_TX 128 /* Transmit VLAN hw acceleration */ +#define NETIF_F_HW_VLAN_RX 256 /* Receive VLAN hw acceleration */ +#define NETIF_F_HW_VLAN_FILTER 512 /* Receive filtering on VLAN */ +#define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */ +#define NETIF_F_TSO 2048 /* Can offload TCP/IP segmentation */ +#define NETIF_F_LLTX 4096 /* LockLess TX */ + struct net_device *next_sched; /* Interface index. Unique device identifier */ @@ -316,9 +333,6 @@ * will (read: may be cleaned up at will). */ - /* These may be needed for future network-power-down code. */ - unsigned long trans_start; /* Time (in jiffies) of last Tx */ - unsigned long last_rx; /* Time of last Rx */ unsigned short flags; /* interface flags (a la BSD) */ unsigned short gflags; @@ -328,15 +342,12 @@ unsigned mtu; /* interface MTU value */ unsigned short type; /* interface hardware type */ unsigned short hard_header_len; /* hardware hdr length */ - void *priv; /* pointer to private data */ struct net_device *master; /* Pointer to master device of a group, * which this device is member of. */ /* Interface address info. */ - unsigned char broadcast[MAX_ADDR_LEN]; /* hw bcast add */ - unsigned char dev_addr[MAX_ADDR_LEN]; /* hw address */ unsigned char perm_addr[MAX_ADDR_LEN]; /* permanent hw address */ unsigned char addr_len; /* hardware address length */ unsigned short dev_id; /* for shared network cards */ @@ -346,8 +357,6 @@ int promiscuity; int allmulti; - int watchdog_timeo; - struct timer_list watchdog_timer; /* Protocol specific pointers */ @@ -358,32 +367,62 @@ void *ec_ptr; /* Econet specific data */ void *ax25_ptr; /* AX.25 specific data */ - struct list_head poll_list; /* Link to poll list */ +/* + * Cache line mostly used on receive path (including eth_type_trans()) + */ + struct list_head poll_list ____cacheline_aligned_in_smp; + /* Link to poll list */ + + int (*poll) (struct net_device *dev, int *quota); int quota; int weight; + unsigned long last_rx; /* Time of last Rx */ + /* Interface address info used in eth_type_trans() */ + unsigned char dev_addr[MAX_ADDR_LEN]; /* hw address, (before bcast + because most packets are unicast) */ + + unsigned char broadcast[MAX_ADDR_LEN]; /* hw bcast add */ +/* + * Cache line mostly used on queue transmit path (qdisc) + */ + /* device queue lock */ + spinlock_t queue_lock ____cacheline_aligned_in_smp; struct Qdisc *qdisc; struct Qdisc *qdisc_sleeping; - struct Qdisc *qdisc_ingress; struct list_head qdisc_list; unsigned long tx_queue_len; /* Max frames per queue allowed */ /* ingress path synchronizer */ spinlock_t ingress_lock; + struct Qdisc *qdisc_ingress; + +/* + * One part is mostly used on xmit path (device) + */ /* hard_start_xmit synchronizer */ - spinlock_t xmit_lock; + spinlock_t xmit_lock ____cacheline_aligned_in_smp; /* cpu id of processor entered to hard_start_xmit or -1, if nobody entered there. */ int xmit_lock_owner; - /* device queue lock */ - spinlock_t queue_lock; + void *priv; /* pointer to private data */ + int (*hard_start_xmit) (struct sk_buff *skb, + struct net_device *dev); + /* These may be needed for future network-power-down code. */ + unsigned long trans_start; /* Time (in jiffies) of last Tx */ + + int watchdog_timeo; /* used by dev_watchdog() */ + struct timer_list watchdog_timer; + +/* + * refcnt is a very hot point, so align it on SMP + */ /* Number of references to this device */ - atomic_t refcnt; + atomic_t refcnt ____cacheline_aligned_in_smp; + /* delayed register/unregister */ struct list_head todo_list; - /* device name hash chain */ - struct hlist_node name_hlist; /* device index hash chain */ struct hlist_node index_hlist; @@ -396,21 +435,6 @@ NETREG_RELEASED, /* called free_netdev */ } reg_state; - /* Net device features */ - unsigned long features; -#define NETIF_F_SG 1 /* Scatter/gather IO. */ -#define NETIF_F_IP_CSUM 2 /* Can checksum only TCP/UDP over IPv4. */ -#define NETIF_F_NO_CSUM 4 /* Does not require checksum. F.e. loopack. */ -#define NETIF_F_HW_CSUM 8 /* Can checksum all the packets. */ -#define NETIF_F_HIGHDMA 32 /* Can DMA to high memory. */ -#define NETIF_F_FRAGLIST 64 /* Scatter/gather IO. */ -#define NETIF_F_HW_VLAN_TX 128 /* Transmit VLAN hw acceleration */ -#define NETIF_F_HW_VLAN_RX 256 /* Receive VLAN hw acceleration */ -#define NETIF_F_HW_VLAN_FILTER 512 /* Receive filtering on VLAN */ -#define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */ -#define NETIF_F_TSO 2048 /* Can offload TCP/IP segmentation */ -#define NETIF_F_LLTX 4096 /* LockLess TX */ - /* Called after device is detached from network. */ void (*uninit)(struct net_device *dev); /* Called after last user reference disappears. */ @@ -419,10 +443,7 @@ /* Pointers to interface service routines. */ int (*open)(struct net_device *dev); int (*stop)(struct net_device *dev); - int (*hard_start_xmit) (struct sk_buff *skb, - struct net_device *dev); #define HAVE_NETDEV_POLL - int (*poll) (struct net_device *dev, int *quota); int (*hard_header) (struct sk_buff *skb, struct net_device *dev, unsigned short type, ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [patch 0/6] mm: alloc_percpu and bigrefs 2005-09-23 9:03 ` Eric Dumazet @ 2005-09-23 9:35 ` Eric Dumazet 0 siblings, 0 replies; 11+ messages in thread From: Eric Dumazet @ 2005-09-23 9:35 UTC (permalink / raw) To: akpm; +Cc: David S. Miller, kiran, rusty, dipankar, linux-kernel Andrew Morton a écrit : > Eric Dumazet <dada1@cosmosbay.com> wrote: > >> Please (re)consider this patch since it really reduce CPU load and/or memory >> bus trafic between CPUS. > > > Did you actually measure this? If so, you should publish the results. > > Hi Andrew Well yes I did. Here is part of the mail sent on netdev one month ago : Hi David I played and I have very good results with the following patches. # tc -s -d qdisc show dev eth0 ; sleep 10 ; tc -s -d qdisc show dev eth0 qdisc pfifo_fast 0: bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 440173135511 bytes 3211378293 pkt (dropped 240817, overlimits 0 requeues 27028035) backlog 0b 0p requeues 27028035 qdisc pfifo_fast 0: bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 440216655667 bytes 3211730812 pkt (dropped 240904, overlimits 0 requeues 27031668) backlog 0b 0p requeues 27031668 (So about 360 requeues per second, much better than before (12000 / second)) oprofile results give 0.6257 % qdisc_restart (instead of 2.6452 %) thread is archived here : http://marc.theaimsgroup.com/?l=linux-netdev&m=112521684415443&w=2 Thank you Eric ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [patch 0/6] mm: alloc_percpu and bigrefs 2005-09-23 7:10 ` [patch 0/6] mm: alloc_percpu and bigrefs Andrew Morton 2005-09-23 7:17 ` David S. Miller @ 2005-09-23 18:40 ` Ravikiran G Thirumalai 2005-09-23 18:50 ` David S. Miller 2005-09-23 19:16 ` Christoph Lameter 2 siblings, 1 reply; 11+ messages in thread From: Ravikiran G Thirumalai @ 2005-09-23 18:40 UTC (permalink / raw) To: Andrew Morton; +Cc: davem, rusty, dipankar, linux-kernel On Fri, Sep 23, 2005 at 12:10:13AM -0700, Andrew Morton wrote: > > (Added linux-kernel) > > Ravikiran G Thirumalai <kiran@scalex86.org> wrote: > > > > Hi Andrew, > > This patchset contains alloc_percpu + bigrefs + bigrefs for netdevice > > refcount. This patchset improves tbench lo performance by 6% on a 8 way IBM > > x445. > > I think we'd need more comprehensive benchmarks than this before adding > large amounts of complex core code. > > We'd also need to know how much of any performance improvement was due to > alloc_percpu versus bigrefs, please. > > Bigrefs look reasonably sane to me, but a whole new memory allocator is a > big deal. Given that alloc_percpu() is already numa-aware, is that extra > cross-node fetch and pointer hop really worth all that new code? The new > version will have to do a tlb load (including a cross-node fetch) > approximately as often as the old version will get a CPU cache miss on the > percpu array, maybe? > > The extra pointer dereference saving with the re-implementation saved 10 % with the dst changes. With just the net_device refcount, there was a difference of 2 % with the existing implementation. So 1 extra dereference does matter. That said, the re-implementation does have other advantages; 1. It avoids the memory bloat caused by the existing implementation as it does not pad all objects to cacheline size like the existing implementation does, This helps small objects like bigrefs. Additionally, it results in better cacheline utilization now as many per-cpu objects share the same cacheline. 2. It is hotplug friendly. In case you want to offline a node, you can easily free the per-cpu memory corresponding to that node. You cannot do that with the current scheme. 3. It can work early. We did some experimentation with struct vfsmount.mnt_count refcounter (It is per-mount point so it is not too many objects), and the old alloc_percpu refused to work early. Yes, now I guess there are workarounds to make cpu_possible_mask bits set for NR_CPUS early, but wouldn't that result in memory being allocated for cpus which can never be present? 4. With the re-implementation, it might be possible to make often used structures like cpu_pda truly node local with alloc_percpu.. As for the benchmarks, tbench on lo was used indicatively. lo performance does matter for quite a few benchmarks. There are apps which do use lo extensively. The dst and netdevice changes were made after profiling such real wold apps. Agreed, per-cpufication of objects which can go up in size is not the right approach on hindsight, but netdevice.refcount is not one of those. I can try running a standard mpi benchmark or some other indicative benchmark if that would help? Thanks, Kiran ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [patch 0/6] mm: alloc_percpu and bigrefs 2005-09-23 18:40 ` Ravikiran G Thirumalai @ 2005-09-23 18:50 ` David S. Miller 2005-09-23 23:30 ` Ravikiran G Thirumalai 0 siblings, 1 reply; 11+ messages in thread From: David S. Miller @ 2005-09-23 18:50 UTC (permalink / raw) To: kiran; +Cc: akpm, rusty, dipankar, linux-kernel From: Ravikiran G Thirumalai <kiran@scalex86.org> Date: Fri, 23 Sep 2005 11:40:52 -0700 > As for the benchmarks, tbench on lo was used indicatively. lo performance > does matter for quite a few benchmarks. There are apps which do use lo > extensively. The dst and netdevice changes were made after profiling such > real wold apps. Agreed, per-cpufication of objects which can go up in > size is not the right approach on hindsight, but netdevice.refcount is not > one of those. I can try running a standard mpi benchmark or some > other indicative benchmark if that would help? I worry about real life sites, such as a big web server, that will by definition have hundreds of thousands of routing cache (and thus 'dst') entries active. The memory usage will increase, and that's particularly bad in this kind of case because unlike the 'lo' benchmarks you won't have nodes and cpus fighting over access to the same routes. In such a load the bigrefs are just wasted memory and aren't actually needed. I really would like to encourage a move away from this fascination with optimizating the loopback interface performance on enormous systems, yes even if it is hit hard by the benchmarks. It just means the benchmarks are wrong, not that we should optimize for them. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [patch 0/6] mm: alloc_percpu and bigrefs 2005-09-23 18:50 ` David S. Miller @ 2005-09-23 23:30 ` Ravikiran G Thirumalai 0 siblings, 0 replies; 11+ messages in thread From: Ravikiran G Thirumalai @ 2005-09-23 23:30 UTC (permalink / raw) To: David S. Miller; +Cc: akpm, rusty, dipankar, linux-kernel On Fri, Sep 23, 2005 at 11:50:58AM -0700, David S. Miller wrote: > From: Ravikiran G Thirumalai <kiran@scalex86.org> > Date: Fri, 23 Sep 2005 11:40:52 -0700 > > > I worry about real life sites, such as a big web server, that will by > definition have hundreds of thousands of routing cache (and thus > 'dst') entries active. > > The memory usage will increase, and that's particularly bad in this > kind of case because unlike the 'lo' benchmarks you won't have nodes > and cpus fighting over access to the same routes. In such a load > the bigrefs are just wasted memory and aren't actually needed. Point taken. That is the reason I have excluded the dst patches in this patchset. The problem with dst.__refcount stays and we should probably look for some other approach rather than thinking per-cpu/per-node counters for this. But the patch in question now is net_device refcount. Surely that doesn't affect webservers with many dst entries. > > I really would like to encourage a move away from this fascination > with optimizating the loopback interface performance on enormous > systems, yes even if it is hit hard by the benchmarks. It just > means the benchmarks are wrong, not that we should optimize for > them. Benchmarks could be wrong, but we don't control what people run. And there are apps which use lo (for whatever reason) :(. Thanks, Kiran ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [patch 0/6] mm: alloc_percpu and bigrefs 2005-09-23 7:10 ` [patch 0/6] mm: alloc_percpu and bigrefs Andrew Morton 2005-09-23 7:17 ` David S. Miller 2005-09-23 18:40 ` Ravikiran G Thirumalai @ 2005-09-23 19:16 ` Christoph Lameter 2 siblings, 0 replies; 11+ messages in thread From: Christoph Lameter @ 2005-09-23 19:16 UTC (permalink / raw) To: Andrew Morton Cc: Ravikiran G Thirumalai, davem, rusty, dipankar, linux-kernel On Fri, 23 Sep 2005, Andrew Morton wrote: > big deal. Given that alloc_percpu() is already numa-aware, is that extra > cross-node fetch and pointer hop really worth all that new code? The new > version will have to do a tlb load (including a cross-node fetch) > approximately as often as the old version will get a CPU cache miss on the > percpu array, maybe? The current alloc_percpu() is problematic because it has to allocate entries for all cpus even those who are not online yet. There is no way to track alloc_percpu entries and therefore no possibility of adding an entry for a processor if one comes online or for removing one when a processor goes away. An additional complication is the allocation of per cpu entries for processors whose memory node are not online. The current implementation will fall back on a unspecific allocation but this means that the placement of the per cpu entries will not be on the node of the processor! ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2005-09-23 23:30 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20050923062529.GA4209@localhost.localdomain>
2005-09-23 7:10 ` [patch 0/6] mm: alloc_percpu and bigrefs Andrew Morton
2005-09-23 7:17 ` David S. Miller
2005-09-23 8:11 ` Rusty Russell
2005-09-23 11:36 ` Dipankar Sarma
2005-09-23 18:48 ` Ravikiran G Thirumalai
2005-09-23 9:03 ` Eric Dumazet
2005-09-23 9:35 ` Eric Dumazet
2005-09-23 18:40 ` Ravikiran G Thirumalai
2005-09-23 18:50 ` David S. Miller
2005-09-23 23:30 ` Ravikiran G Thirumalai
2005-09-23 19:16 ` Christoph Lameter
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.