* [RFC] Dynamic percpu data allocator
@ 2002-05-23 13:08 Dipankar Sarma
2002-05-24 4:37 ` BALBIR SINGH
0 siblings, 1 reply; 15+ messages in thread
From: Dipankar Sarma @ 2002-05-23 13:08 UTC (permalink / raw)
To: linux-kernel; +Cc: Rusty Russell, Paul McKenney, lse-tech
[-- Attachment #1: Type: text/plain, Size: 1185 bytes --]
If static percpu area is around, can dynamic percpu data allocator
be far behind ;-)
As a part of scalable kernel primitives work for higher-end SMP
and NUMA architectures, we have been seeing increasing need
for per-cpu data in various key areas. Rusty's percpu area
work has added a way in 2.5 kernels to maintain static per-cpu
data. Inspired by that work, I have implemented a dynamic per-cpu
data allocator. Currently it is useful to us for -
1. Per-cpu data in dynamically allocated structures.
2. per-cpu statistics and reference counters
3. Per-cpu data in drivers/modules.
4. Scalable locking primitives like local spin only locks
(or even big reader locks).
Included in this mail is a document that describes the allocator.
I would really appreciate if people comment on it. I am
particularly interested in eek-value of the interfaces,
specially the bit about keeping the type information in
a dummy variable in a union.
The actual patch will follow soon, unless someone convince
me quickly that there is an saner way to do this.
Thanks
--
Dipankar Sarma <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.
[-- Attachment #2: percpu_data.txt --]
[-- Type: text/plain, Size: 4302 bytes --]
Per-CPU Data Allocator
----------------------
Interfaces
----------
The interfaces for per-cpu data allocator are similar to Rusty's static
per-CPU data interfaces. One clear goal was to make sure that they
leave no overhead in the UP kernels, so for UP kernels they
reduce to ordinary variables. The basic interfaces are these -
1. percpu_data_declare(type,var)
2. percpu_data_alloc(var)
3. percpu_data(var,cpu)
4. this_percpu_data(var)
5. percpu_data_free(var)
For example, we can declare the following structure -
struct yy {
int a;
percpu_data_declare(int, b);
int c;
int d;
};
We can allocate memory for percpu int like this -
struct yy y;
if (percpu_data_alloc(y.b)) {
/* Failed */
}
To use it -
cpu = smp_processor_id();
percpu_data(y.b, cpu)++;
or
this_percpu_data(y.b)++;
To free the per-CPU data
percpu_data_free(y.b);
The data declaration interface is a bit unnatural, but I can't think of
anything better that would let me preserve the type information for the
original variable so that appropriate typecasting can be done for other
interfaces. percpu_data_declare(type,var) expands to -
union {
percpu_data_t *percpu;
typeof(type) realtype;
} var
percpu_data_t maintains the pointers necessary to lookup the real
percpu data.
typedef struct {
void *blkaddrs[NR_CPUS];
struct percpu_data_blk *blkp;
} percpu_data_t;
The type information is used (using typeof()) to
typecast data accesses.
#define percpu_data(var,cpu) \
(*((typeof(var.realtype) *)var.percpu->blkaddrs[cpu]))
Using a pointer to percpu_data_t adds
an overhead of an additional memory reference while accessing
percpu data. This can be avoided by embedding the percpu_data_t
structure, but since percpu_data_t has an NR_CPUS array, it changes
structure sizes very radically. It is a tradeoff, we could go either
way.
Potential Uses
--------------
1. Scalable counters - they already use a scaled down version of the
allocator inside. Per-cpu counters can reduce the overhead of cacheline
bouncing.
2. Big reader lock - this need not be statically allocated anymore.
3. Per-CPU data in modules - the static per-cpu scheme by Rusty's
doesn't work in modules, atleast I haven't seen a way to do this.
4. Scalable locks - per-cpu data is commonly used in scalable locks
like MCS locks.
Allocator
---------
The current approach is that unless there is interest in a dynamic
percpu data allocator, there is no point in spending too much time
in writing a sophisticated.
Allocation Policy
-----------------
1. If the allocation requests size is a factor of SMP_CACHE_BYTES,
then it will be interleaved to avoid fragmentation as much as possible.
If the request size is a multiple of SMP_CACHE_BYTES, fragmentation
will still be avoided. Anything else will result in fragmentation.
The current allocator doesn't make any attempt to use the fragmented
portion, in a sense it is like padding to cache line boundary.
2. A simple binary search tree is used to maintain the memory
objects (could be blocks of them) of different sizes. For
interleaving, the objects are maintained in blocks and a freelist
mechanism similar to the slab allocator is used within the blocks
to allocates objects from within. Each block is allocated from kmem_cache.
If there is sufficient interest in the per-cpu data allocator,
then I will revisit the allocator and see if fragmentation can
be reduced for non-multiples/non-factors of SMP_CACHE_BYTES.
For non-factor allocations, the residual part of the cache-line
can be maintained and a best-factor-fit alogorithm can be used
to allocate this. This makes an assumption that kernel allocation
requests are likely to contain repetitive patterns of similar sizes.
Alignment Issues
----------------
The current alignment strategy is this -
1. Minimum allocation size is sizeof(int).
2. Each block is aligned to cache line boundary and size of any
object allocated within the block is either a factor of the block
size or equal to the block size. However I am not sure if this
guarantees proper alignment on all architectures. We need to
investigate some more about this.
I am sure there is a 69-bit transputer architecture somewhere that
breaks this allocator ;-)
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [RFC] Dynamic percpu data allocator
2002-05-23 13:08 [RFC] Dynamic percpu data allocator Dipankar Sarma
@ 2002-05-24 4:37 ` BALBIR SINGH
2002-05-24 6:13 ` Dipankar Sarma
0 siblings, 1 reply; 15+ messages in thread
From: BALBIR SINGH @ 2002-05-24 4:37 UTC (permalink / raw)
To: dipankar, linux-kernel; +Cc: Rusty Russell, Paul McKenney, lse-tech
[-- Attachment #1: Type: text/plain, Size: 2966 bytes --]
Hello, Dipankar,
I would prefer to use the existing slab allocator for this.
I am not sure if I understand your requirements for the per-cpu
allocator correctly, please correct me if I do not.
What I would like to see
1. Have per-cpu slabs instead of per-cpu cpucache_t. One should
be able to tell for which caches we want per-cpu slabs. This
way we can make even kmalloc per-cpu. Since most of the kernel
would use and dispose memory before they migrate across cpus.
I think this would be useful, but again no data to back it up.
2. I hate the use of NR_CPUS. If I compiled an SMP kernel on a two
CPU machine, I still end up with support for 32 CPUs. What I would
like to see is that in new kernel code, we should use treat equivalent
classes of CPUs as belonging to the same CPU. For example
void *blkaddrs[NR_CPUS];
while searching, instead of doing
blkaddrs[smp_processor_id()], if the slot for smp_processor_id() is full,
we should look through
for (i = 0; i < NR_CPUS; i++) {
look into blkaddrs[smp_processor_id() + i % smp_number_of_cpus()(or
whatever)]
if successful break
}
On a two CPU system 1,3,5 ... belong to the same equivalent class. So we
might
as well utilize them. Even with a per-cpu pool, threads could use the slots
in
the per-cpu equivalent classes in parallel (I have a very rough idea about
this).
Does any of this make sense,
Balbir
|-----Original Message-----
|From: linux-kernel-owner@vger.kernel.org
|[mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Dipankar Sarma
|Sent: Thursday, May 23, 2002 6:39 PM
|To: linux-kernel@vger.kernel.org
|Cc: Rusty Russell; Paul McKenney; lse-tech@lists.sourceforge.net
|Subject: [RFC] Dynamic percpu data allocator
|
|
|If static percpu area is around, can dynamic percpu data allocator
|be far behind ;-)
|
|As a part of scalable kernel primitives work for higher-end SMP
|and NUMA architectures, we have been seeing increasing need
|for per-cpu data in various key areas. Rusty's percpu area
|work has added a way in 2.5 kernels to maintain static per-cpu
|data. Inspired by that work, I have implemented a dynamic per-cpu
|data allocator. Currently it is useful to us for -
|
|1. Per-cpu data in dynamically allocated structures.
|2. per-cpu statistics and reference counters
|3. Per-cpu data in drivers/modules.
|4. Scalable locking primitives like local spin only locks
| (or even big reader locks).
|
|Included in this mail is a document that describes the allocator.
|I would really appreciate if people comment on it. I am
|particularly interested in eek-value of the interfaces,
|specially the bit about keeping the type information in
|a dummy variable in a union.
|
|The actual patch will follow soon, unless someone convince
|me quickly that there is an saner way to do this.
|
|Thanks
|--
|Dipankar Sarma <dipankar@in.ibm.com> http://lse.sourceforge.net
|Linux Technology Center, IBM Software Lab, Bangalore, India.
|
[-- Attachment #2: Wipro_Disclaimer.txt --]
[-- Type: text/plain, Size: 490 bytes --]
**************************Disclaimer************************************
Information contained in this E-MAIL being proprietary to Wipro Limited is
'privileged' and 'confidential' and intended for use only by the individual
or entity to which it is addressed. You are notified that any use, copying
or dissemination of the information contained in the E-MAIL in any manner
whatsoever is strictly prohibited.
***************************************************************************
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Dynamic percpu data allocator
2002-05-24 4:37 ` BALBIR SINGH
@ 2002-05-24 6:13 ` Dipankar Sarma
2002-05-24 8:38 ` [Lse-tech] " BALBIR SINGH
0 siblings, 1 reply; 15+ messages in thread
From: Dipankar Sarma @ 2002-05-24 6:13 UTC (permalink / raw)
To: BALBIR SINGH; +Cc: linux-kernel, Rusty Russell, Paul McKenney, lse-tech
On Fri, May 24, 2002 at 10:07:59AM +0530, BALBIR SINGH wrote:
> Hello, Dipankar,
>
> I would prefer to use the existing slab allocator for this.
> I am not sure if I understand your requirements for the per-cpu
> allocator correctly, please correct me if I do not.
>
> What I would like to see
>
> 1. Have per-cpu slabs instead of per-cpu cpucache_t. One should
> be able to tell for which caches we want per-cpu slabs. This
> way we can make even kmalloc per-cpu. Since most of the kernel
> would use and dispose memory before they migrate across cpus.
> I think this would be useful, but again no data to back it up.
Allocating cpu-local memory is a different issue altogether.
Eventually for NUMA support, we will have to do such allocations
that supports choosing memory closest to a group of CPUs.
The per-cpu data allocator allocates one copy for *each* CPU.
It uses the slab allocator underneath. Eventually, when/if we have
per-cpu/numa-node slab allocation, the per-cpu data allocator
can allocate every CPU's copy from memory closest to it.
I suppose you worked on DYNIX/ptx ? Think of this as dynamic
plocal.
>
> 2. I hate the use of NR_CPUS. If I compiled an SMP kernel on a two
> CPU machine, I still end up with support for 32 CPUs. What I would
If you don't like it, just define NR_CPUS to 2 and recompile.
> like to see is that in new kernel code, we should use treat equivalent
> classes of CPUs as belonging to the same CPU. For example
>
> void *blkaddrs[NR_CPUS];
>
> while searching, instead of doing
>
> blkaddrs[smp_processor_id()], if the slot for smp_processor_id() is full,
> we should look through
>
> for (i = 0; i < NR_CPUS; i++) {
> look into blkaddrs[smp_processor_id() + i % smp_number_of_cpus()(or
> whatever)]
> if successful break
> }
How will it work ? You could be accessing memory beyond blkaddrs[].
I use NR_CPUS for allocations because if I don't, supporting CPU
hotplug will be a nightmare. Resizing so many data structures is
not an option. I believe Rusty and/or cpu hotplug work is
adding a for_each_cpu() macro to walk the CPUs that take
care of everything including sparse CPU numbers. Until then,
I would use for (i = 0; i < smp_num_cpus; i++).
Thanks
--
Dipankar Sarma <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
2002-05-24 6:13 ` Dipankar Sarma
@ 2002-05-24 8:38 ` BALBIR SINGH
2002-05-24 9:13 ` Dipankar Sarma
2002-05-24 14:38 ` Martin J. Bligh
0 siblings, 2 replies; 15+ messages in thread
From: BALBIR SINGH @ 2002-05-24 8:38 UTC (permalink / raw)
To: dipankar; +Cc: linux-kernel, Rusty Russell, Paul McKenney, lse-tech
[-- Attachment #1: Type: text/plain, Size: 3192 bytes --]
|> Hello, Dipankar,
|>
|> I would prefer to use the existing slab allocator for this.
|> I am not sure if I understand your requirements for the per-cpu
|> allocator correctly, please correct me if I do not.
|>
|> What I would like to see
|>
|> 1. Have per-cpu slabs instead of per-cpu cpucache_t. One should
|> be able to tell for which caches we want per-cpu slabs. This
|> way we can make even kmalloc per-cpu. Since most of the kernel
|> would use and dispose memory before they migrate across cpus.
|> I think this would be useful, but again no data to back it up.
|
|Allocating cpu-local memory is a different issue altogether.
|Eventually for NUMA support, we will have to do such allocations
|that supports choosing memory closest to a group of CPUs.
|
|The per-cpu data allocator allocates one copy for *each* CPU.
|It uses the slab allocator underneath. Eventually, when/if we have
|per-cpu/numa-node slab allocation, the per-cpu data allocator
|can allocate every CPU's copy from memory closest to it.
|
|I suppose you worked on DYNIX/ptx ? Think of this as dynamic
|plocal.
Sure, I understand what you are talking about now. That makes a lot
of sense, I will go through your document once more and read it.
I was thinking of the two combined (allocating CPU local memory
for certain data structs also includes allocating one copy per CPU).
Is there a reason to delay the implementation of CPU local memory,
or are we waiting for NUMA guys to do it? Is it not useful in an
SMP system to allocate CPU local memory?
|
|>
|> 2. I hate the use of NR_CPUS. If I compiled an SMP kernel on a two
|> CPU machine, I still end up with support for 32 CPUs. What I would
|
|If you don't like it, just define NR_CPUS to 2 and recompile.
|
That does make sense, but I would like to keep my headers in sync, so that
all patches apply cleanly.
|
|> like to see is that in new kernel code, we should use treat equivalent
|> classes of CPUs as belonging to the same CPU. For example
|>
|> void *blkaddrs[NR_CPUS];
|>
|> while searching, instead of doing
|>
|> blkaddrs[smp_processor_id()], if the slot for
|smp_processor_id() is full,
|> we should look through
|>
|> for (i = 0; i < NR_CPUS; i++) {
|> look into blkaddrs[smp_processor_id() + i % smp_number_of_cpus()(or
|> whatever)]
|> if successful break
|> }
|
|How will it work ? You could be accessing memory beyond blkaddrs[].
|
Sorry that could happen, it should be (smp_processor_id + i) %
smp_number_of_cpus().
|I use NR_CPUS for allocations because if I don't, supporting CPU
|hotplug will be a nightmare. Resizing so many data structures is
|not an option. I believe Rusty and/or cpu hotplug work is
|adding a for_each_cpu() macro to walk the CPUs that take
|care of everything including sparse CPU numbers. Until then,
|I would use for (i = 0; i < smp_num_cpus; i++).
I think that does make a whole lot of sense, I was not thinking
about HotPluging CPUs (dumb me).
|
|Thanks
|--
|Dipankar Sarma <dipankar@in.ibm.com> http://lse.sourceforge.net
|Linux Technology Center, IBM Software Lab, Bangalore, India.
|
|_______________________________________________________________
[-- Attachment #2: Wipro_Disclaimer.txt --]
[-- Type: text/plain, Size: 490 bytes --]
**************************Disclaimer************************************
Information contained in this E-MAIL being proprietary to Wipro Limited is
'privileged' and 'confidential' and intended for use only by the individual
or entity to which it is addressed. You are notified that any use, copying
or dissemination of the information contained in the E-MAIL in any manner
whatsoever is strictly prohibited.
***************************************************************************
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
2002-05-24 8:38 ` [Lse-tech] " BALBIR SINGH
@ 2002-05-24 9:13 ` Dipankar Sarma
2002-05-24 11:59 ` BALBIR SINGH
2002-05-24 14:38 ` Martin J. Bligh
1 sibling, 1 reply; 15+ messages in thread
From: Dipankar Sarma @ 2002-05-24 9:13 UTC (permalink / raw)
To: BALBIR SINGH; +Cc: linux-kernel, Rusty Russell, Paul McKenney, lse-tech
On Fri, May 24, 2002 at 02:08:50PM +0530, BALBIR SINGH wrote:
>
> Sure, I understand what you are talking about now. That makes a lot
> of sense, I will go through your document once more and read it.
> I was thinking of the two combined (allocating CPU local memory
> for certain data structs also includes allocating one copy per CPU).
> Is there a reason to delay the implementation of CPU local memory,
> or are we waiting for NUMA guys to do it? Is it not useful in an
> SMP system to allocate CPU local memory?
In an SMP system, the entire memory is equidistant from the CPUs.
So, any memory that is exclusively accessed by once cpu only
is CPU-local. On a NUMA machine however that isn't true, so
you need special schemes.
The thing about one-copy-per-cpu allocator that I describe is that
it interleaves per-cpu data to save on space. That is if you
allocate per-cpu ints i1, i2, it will be laid out in memory like this -
CPU #0 CPU#1
--------- --------- Start of cache line
i1 i1
i2 i2
. .
. .
. .
. .
. .
--------- ---------- End of cache line
The per-cpu copies of i1 and i2 for CPU #0 and CPU #1 are allocated from
different cache lines of memory, but copy of i1 and i2 for CPU #0 are
in the same cache line. This interleaving saves space by avoiding
the need to pad small data structures to cache line sizes.
This essentially how the static per-cpu data area in 2.5 kernel
is laid out in memory. Since copies for CPU #0 and CPU #1 for
the same variable are on different cache lines, assuming that
code that accesses "this" CPU's copy will not result in cache line
bouncing. On an SMP machine, I can allocate the cache lines
for different CPUs, where the interleaved data structures are
laid out, using the slab allocator. On a NUMA machine however,
I would want to make sure that cache line allocated for this
purpose for CPU #N is closest possible to CPU #N.
Thanks
--
Dipankar Sarma <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
2002-05-24 9:13 ` Dipankar Sarma
@ 2002-05-24 11:59 ` BALBIR SINGH
0 siblings, 0 replies; 15+ messages in thread
From: BALBIR SINGH @ 2002-05-24 11:59 UTC (permalink / raw)
To: dipankar; +Cc: linux-kernel, Rusty Russell, Paul McKenney, lse-tech
[-- Attachment #1: Type: text/plain, Size: 2624 bytes --]
Thanks, when I ment CPU local memory on SMP, I ment CPU cache.
Sorry for the confusion. I think your dynamic allocator makes
sense.
Balbir
|
|On Fri, May 24, 2002 at 02:08:50PM +0530, BALBIR SINGH wrote:
|>
|> Sure, I understand what you are talking about now. That makes a lot
|> of sense, I will go through your document once more and read it.
|> I was thinking of the two combined (allocating CPU local memory
|> for certain data structs also includes allocating one copy per CPU).
|> Is there a reason to delay the implementation of CPU local memory,
|> or are we waiting for NUMA guys to do it? Is it not useful in an
|> SMP system to allocate CPU local memory?
|
|In an SMP system, the entire memory is equidistant from the CPUs.
|So, any memory that is exclusively accessed by once cpu only
|is CPU-local. On a NUMA machine however that isn't true, so
|you need special schemes.
|
|The thing about one-copy-per-cpu allocator that I describe is that
|it interleaves per-cpu data to save on space. That is if you
|allocate per-cpu ints i1, i2, it will be laid out in memory like this -
|
| CPU #0 CPU#1
|
| --------- --------- Start of cache line
| i1 i1
| i2 i2
|
| . .
| . .
| . .
| . .
| . .
|
| --------- ---------- End of cache line
|
|The per-cpu copies of i1 and i2 for CPU #0 and CPU #1 are allocated from
|different cache lines of memory, but copy of i1 and i2 for CPU #0 are
|in the same cache line. This interleaving saves space by avoiding
|the need to pad small data structures to cache line sizes.
|This essentially how the static per-cpu data area in 2.5 kernel
|is laid out in memory. Since copies for CPU #0 and CPU #1 for
|the same variable are on different cache lines, assuming that
|code that accesses "this" CPU's copy will not result in cache line
|bouncing. On an SMP machine, I can allocate the cache lines
|for different CPUs, where the interleaved data structures are
|laid out, using the slab allocator. On a NUMA machine however,
|I would want to make sure that cache line allocated for this
|purpose for CPU #N is closest possible to CPU #N.
|
|
|Thanks
|--
|Dipankar Sarma <dipankar@in.ibm.com> http://lse.sourceforge.net
|Linux Technology Center, IBM Software Lab, Bangalore, India.
|-
|To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
|the body of a message to majordomo@vger.kernel.org
|More majordomo info at http://vger.kernel.org/majordomo-info.html
|Please read the FAQ at http://www.tux.org/lkml/
[-- Attachment #2: Wipro_Disclaimer.txt --]
[-- Type: text/plain, Size: 490 bytes --]
**************************Disclaimer************************************
Information contained in this E-MAIL being proprietary to Wipro Limited is
'privileged' and 'confidential' and intended for use only by the individual
or entity to which it is addressed. You are notified that any use, copying
or dissemination of the information contained in the E-MAIL in any manner
whatsoever is strictly prohibited.
***************************************************************************
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
2002-05-24 8:38 ` [Lse-tech] " BALBIR SINGH
2002-05-24 9:13 ` Dipankar Sarma
@ 2002-05-24 14:38 ` Martin J. Bligh
1 sibling, 0 replies; 15+ messages in thread
From: Martin J. Bligh @ 2002-05-24 14:38 UTC (permalink / raw)
To: BALBIR SINGH, dipankar
Cc: linux-kernel, Rusty Russell, Paul McKenney, lse-tech
> Is there a reason to delay the implementation of CPU local memory,
> or are we waiting for NUMA guys to do it? Is it not useful in an
> SMP system to allocate CPU local memory?
That should be pretty easy to do now, if I understand what you're
asking for ... when you allocate the area for cpu N, just do
alloc_pages_node (cpu_to_nid(smp_processor_id())) or something
similar.
M.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
@ 2002-05-30 13:56 Mala Anand
2002-05-30 17:55 ` Dipankar Sarma
0 siblings, 1 reply; 15+ messages in thread
From: Mala Anand @ 2002-05-30 13:56 UTC (permalink / raw)
To: dipankar
Cc: BALBIR SINGH, linux-kernel, lse-tech, lse-tech-admin,
Paul McKenney, Rusty Russell
dipankar@beaverton.ibm.co
m To: BALBIR SINGH <balbir.singh@wipro.com>
Sent by: cc: linux-kernel@vger.kernel.org, Rusty Russell <rusty@rustcorp.com.au>, Paul
lse-tech-admin@lists.sour McKenney/Beaverton/IBM@IBMUS, lse-tech@lists.sourceforge.net
ceforge.net Subject: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
05/24/02 01:13 AM
Please respond to
dipankar
>On Fri, May 24, 2002 at 10:07:59AM +0530, BALBIR SINGH wrote:
>> Hello, Dipankar,
>>
>> I would prefer to use the existing slab allocator for this.
>> I am not sure if I understand your requirements for the per-cpu
>> allocator correctly, please correct me if I do not.
>>
>> What I would like to see
>>
>> 1. Have per-cpu slabs instead of per-cpu cpucache_t. One should
>> be able to tell for which caches we want per-cpu slabs. This
>> way we can make even kmalloc per-cpu. Since most of the kernel
>> would use and dispose memory before they migrate across cpus.
>> I think this would be useful, but again no data to back it up.
>Allocating cpu-local memory is a different issue altogether.
>Eventually for NUMA support, we will have to do such allocations
>that supports choosing memory closest to a group of CPUs.
>The per-cpu data allocator allocates one copy for *each* CPU.
>It uses the slab allocator underneath. Eventually, when/if we have
>per-cpu/numa-node slab allocation, the per-cpu data allocator
>can allocate every CPU's copy from memory closest to it.
Does this mean that memory allocation will happen in "each" CPU?
Do slab allocator allocate the memory in each cpu? Your per-cpu
data allocator sounds like the hot list skbs that are in the tcpip stack
in the sense it is one level above the slab allocator and the list is
kept per cpu. If slab allocator is fixed for per cpu, do you still
need this per-cpu data allocator?
_____________________________________________
Regards,
Mala
Mala Anand
E-mail:manand@us.ibm.com
Linux Technology Center - Performance
Phone:838-8088; Tie-line:678-8088
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
2002-05-30 13:56 Mala Anand
@ 2002-05-30 17:55 ` Dipankar Sarma
2002-05-31 7:57 ` BALBIR SINGH
0 siblings, 1 reply; 15+ messages in thread
From: Dipankar Sarma @ 2002-05-30 17:55 UTC (permalink / raw)
To: Mala Anand
Cc: BALBIR SINGH, linux-kernel, lse-tech, lse-tech-admin,
Paul McKenney, Rusty Russell
On Thu, May 30, 2002 at 08:56:36AM -0500, Mala Anand wrote:
>
> dipankar@beaverton.ibm.co
> m To: BALBIR SINGH <balbir.singh@wipro.com>
>
> >The per-cpu data allocator allocates one copy for *each* CPU.
> >It uses the slab allocator underneath. Eventually, when/if we have
> >per-cpu/numa-node slab allocation, the per-cpu data allocator
> >can allocate every CPU's copy from memory closest to it.
>
> Does this mean that memory allocation will happen in "each" CPU?
> Do slab allocator allocate the memory in each cpu? Your per-cpu
> data allocator sounds like the hot list skbs that are in the tcpip stack
> in the sense it is one level above the slab allocator and the list is
> kept per cpu. If slab allocator is fixed for per cpu, do you still
> need this per-cpu data allocator?
Actually I don't know for sure what plans are afoot to fix the slab allocator
for per-cpu. One plan I heard about was allocating from per-cpu pools
rather than per-cpu copies. My requirements are similar to
the hot list skbs. I want to do this -
int *ctrp1, *ctrp2;
ctrp1 = kmalloc_percpu(sizeof(*ctrp1), GFP_ATOMIC);
if (ctrp1 == NULL) {
/* recover */
}
ctrp2 = kmalloc_percpu(sizeof(*ctrp2), GFP_ATOMIC);
if (ctrp2 == NULL) {
/* recover */
}
*per_cpu_ptr(ctrp1, smp_processor_id())++;
this_cpu_ptr(ctrp2)++;
Now I can allocate by making ctrp1/ctrp2 point to an array
of NR_CPUS and kmalloc() memory for each CPU's copy of the
int. This is simple and will work.
void **ptrs = kmalloc(sizeof(*ptrs) * NR_CPUS, flags);
if (!ptrs) return NULL;
for (i = 0; i < NR_CPUS; i++) {
ptrs[i] = kmalloc(size, flags);
if (!ptrs[i])
goto unwind_oom;
}
However I would like to use kmalloc_percpu() for allocating very
small objects - typlically integer counters or small structures
to be used as per-cpu counters for things like dst entries and dentries.
kmalloc will waste the rest of the cache line for such small objects.
The alternative is to use a layer of code to interleave small objects
and save on space.
CPU #0 CPU#1
--------- --------- Start of cache line
*ctrp1 *ctrp1
*ctrp2 *ctrp2
. .
. .
. .
. .
. .
--------- ---------- End of cache line
I have an allocator that interleaves objects like this if they can be fitted
into size that is a factor of SMP_CACHE_BYTES.
I hope someone can tell me that I don't even have to do this. Otherwise
I will go ahead and do my thing.
Thanks
--
Dipankar Sarma <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
2002-05-30 17:55 ` Dipankar Sarma
@ 2002-05-31 7:57 ` BALBIR SINGH
2002-05-31 8:40 ` Dipankar Sarma
0 siblings, 1 reply; 15+ messages in thread
From: BALBIR SINGH @ 2002-05-31 7:57 UTC (permalink / raw)
To: dipankar, 'Mala Anand'
Cc: linux-kernel, lse-tech-admin, 'Paul McKenney',
'Rusty Russell'
[-- Attachment #1: Type: text/plain, Size: 2920 bytes --]
|Actually I don't know for sure what plans are afoot to fix the
|slab allocator
|for per-cpu. One plan I heard about was allocating from per-cpu pools
|rather than per-cpu copies. My requirements are similar to
|the hot list skbs. I want to do this -
|
| int *ctrp1, *ctrp2;
|
| ctrp1 = kmalloc_percpu(sizeof(*ctrp1), GFP_ATOMIC);
| if (ctrp1 == NULL) {
| /* recover */
| }
| ctrp2 = kmalloc_percpu(sizeof(*ctrp2), GFP_ATOMIC);
| if (ctrp2 == NULL) {
| /* recover */
| }
|
| *per_cpu_ptr(ctrp1, smp_processor_id())++;
| this_cpu_ptr(ctrp2)++;
|
|Now I can allocate by making ctrp1/ctrp2 point to an array
|of NR_CPUS and kmalloc() memory for each CPU's copy of the
|int. This is simple and will work.
|
| void **ptrs = kmalloc(sizeof(*ptrs) * NR_CPUS, flags);
|
| if (!ptrs) return NULL;
| for (i = 0; i < NR_CPUS; i++) {
| ptrs[i] = kmalloc(size, flags);
| if (!ptrs[i])
| goto unwind_oom;
| }
|
|
|However I would like to use kmalloc_percpu() for allocating very
|small objects - typlically integer counters or small structures
|to be used as per-cpu counters for things like dst entries and
|dentries.
|kmalloc will waste the rest of the cache line for such small objects.
|The alternative is to use a layer of code to interleave small objects
|and save on space.
|
|
| CPU #0 CPU#1
|
| --------- --------- Start of cache line
| *ctrp1 *ctrp1
| *ctrp2 *ctrp2
|
| . .
| . .
| . .
| . .
| . .
|
| --------- ---------- End of cache line
Won't this result in a lot of false sharing, if any of the CPUs
tried to access any of the counters, the entire cache line would be
moved from the current CPU to that CPU. Isn't this a very bad thing or
am I missing something? Do all your counters fit into one cache line.
For sometime now, I have been thinking of implementing/supporting
PME's (Peformance Monitoring Events and Counters), so that we can
get real values (atleast on x86) as compared to our guesses about
cacheline bouncing, etc. Do you know if somebody is already doing
this?
Regards,
Balbir
|
|I have an allocator that interleaves objects like this if they
|can be fitted
|into size that is a factor of SMP_CACHE_BYTES.
|
|I hope someone can tell me that I don't even have to do this. Otherwise
|I will go ahead and do my thing.
|
|Thanks
|--
|Dipankar Sarma <dipankar@in.ibm.com> http://lse.sourceforge.net
|Linux Technology Center, IBM Software Lab, Bangalore, India.
|
|_______________________________________________________________
|
|Don't miss the 2002 Sprint PCS Application Developer's Conference
|August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm
|
|_______________________________________________
|Lse-tech mailing list
|Lse-tech@lists.sourceforge.net
|https://lists.sourceforge.net/lists/listinfo/lse-tech
|
[-- Attachment #2: Wipro_Disclaimer.txt --]
[-- Type: text/plain, Size: 490 bytes --]
**************************Disclaimer************************************
Information contained in this E-MAIL being proprietary to Wipro Limited is
'privileged' and 'confidential' and intended for use only by the individual
or entity to which it is addressed. You are notified that any use, copying
or dissemination of the information contained in the E-MAIL in any manner
whatsoever is strictly prohibited.
***************************************************************************
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
2002-05-31 7:57 ` BALBIR SINGH
@ 2002-05-31 8:40 ` Dipankar Sarma
0 siblings, 0 replies; 15+ messages in thread
From: Dipankar Sarma @ 2002-05-31 8:40 UTC (permalink / raw)
To: BALBIR SINGH
Cc: 'Mala Anand', linux-kernel, 'Paul McKenney',
'Rusty Russell'
On Fri, May 31, 2002 at 01:27:44PM +0530, BALBIR SINGH wrote:
> |
> |
> | CPU #0 CPU#1
> |
> | --------- --------- Start of cache line
> | *ctrp1 *ctrp1
> | *ctrp2 *ctrp2
> |
> | . .
> | . .
> | . .
> | . .
> | . .
> |
> | --------- ---------- End of cache line
>
>
> Won't this result in a lot of false sharing, if any of the CPUs
> tried to access any of the counters, the entire cache line would be
> moved from the current CPU to that CPU. Isn't this a very bad thing or
> am I missing something? Do all your counters fit into one cache line.
Yes it could result in false sharing. You could probably avoid
that by imposing classes of allocation - say STRICLY_LOCAL and
ALMOST_LOCAL, so that strictly local objects are not penalized
by occasionally non-local objects. If your code frequently accesses
other CPU's copy of the object than you should not be using this
per-cpu allocator in the first place, it would be meaningless.
>
> For sometime now, I have been thinking of implementing/supporting
> PME's (Peformance Monitoring Events and Counters), so that we can
> get real values (atleast on x86) as compared to our guesses about
> cacheline bouncing, etc. Do you know if somebody is already doing
> this?
You can use SGI kernprof to measure PMCs. See the SGI oss
website for details. You can count L2_LINES_IN event to
get a measure of cache line bouncing.
Thanks
--
Dipankar Sarma <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
@ 2002-06-03 19:12 Mala Anand
2002-06-03 19:48 ` Dipankar Sarma
0 siblings, 1 reply; 15+ messages in thread
From: Mala Anand @ 2002-06-03 19:12 UTC (permalink / raw)
To: dipankar
Cc: BALBIR SINGH, linux-kernel, lse-tech, lse-tech-admin,
Paul McKenney, Rusty Russell
On Thu, May 30, 2002 at 08:56:36AM -0500, Mala Anand wrote:
>>
>> dipankar@beaverton.ibm.co
>> m To: BALBIR
SINGH <balbir.singh@wipro.com>
>>
>>The per-cpu data allocator allocates one copy for *each* CPU.
>> >It uses the slab allocator underneath. Eventually, when/if we have
>> >per-cpu/numa-node slab allocation, the per-cpu data allocator
>> >can allocate every CPU's copy from memory closest to it.
>>
>> Does this mean that memory allocation will happen in "each" CPU?
>> Do slab allocator allocate the memory in each cpu? Your per-cpu
>> data allocator sounds like the hot list skbs that are in the tcpip stack
>> in the sense it is one level above the slab allocator and the list is
>> kept per cpu. If slab allocator is fixed for per cpu, do you still
>> need this per-cpu data allocator?
>Actually I don't know for sure what plans are afoot to fix the slab
allocator
>for per-cpu. One plan I heard about was allocating from per-cpu pools
>rather than per-cpu copies. My requirements are similar to
>the hot list skbs. I want to do this -
I looked at the slab code, per cpu slab is already implemented by Manfred
Spraul.
Look at cpu_data[NR_CPUS] in kmem_cache_s structure.
Regards,
Mala
Mala Anand
E-mail:manand@us.ibm.com
Linux Technology Center - Performance
Phone:838-8088; Tie-line:678-8088
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
2002-06-03 19:12 Mala Anand
@ 2002-06-03 19:48 ` Dipankar Sarma
0 siblings, 0 replies; 15+ messages in thread
From: Dipankar Sarma @ 2002-06-03 19:48 UTC (permalink / raw)
To: Mala Anand
Cc: BALBIR SINGH, linux-kernel, lse-tech, Paul McKenney,
Rusty Russell
On Mon, Jun 03, 2002 at 02:12:29PM -0500, Mala Anand wrote:
> I looked at the slab code, per cpu slab is already implemented by Manfred
> Spraul.
> Look at cpu_data[NR_CPUS] in kmem_cache_s structure.
>
Sorry, I should have been more clear saying what I wanted.
Yes, kmem_cache_alloc() allocates one object from "this" CPU's slabs. What
I want is a kmalloc_percupu() that allocates one copy for every
CPU in the system. Think of this as dynamically allocacting an
array of NR_CPUS objects with objects residing on different cachelines.
Thanks
--
Dipankar Sarma <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
@ 2002-06-04 12:05 Mala Anand
0 siblings, 0 replies; 15+ messages in thread
From: Mala Anand @ 2002-06-04 12:05 UTC (permalink / raw)
To: dipankar; +Cc: BALBIR SINGH, linux-kernel, Paul McKenney,
'Rusty Russell'
>> For sometime now, I have been thinking of implementing/supporting
>> PME's (Peformance Monitoring Events and Counters), so that we can
>> get real values (atleast on x86) as compared to our guesses about
>> cacheline bouncing, etc. Do you know if somebody is already doing
>> this?
>You can use SGI kernprof to measure PMCs. See the SGI oss
>website for details. You can count L2_LINES_IN event to
>get a measure of cache line bouncing.
I have profiled L2_LINES_OUT on netperf tcp_stream workload.
The following is the profiling from a 100mb ethernet tcp_stream
4 adapter test baseline 2.4.17 kernel:
poll_idle [c0105280]: 121743
csum_partial_copy_generic [c0277f60]: 27951
schedule [c0114190]: 24853
do_softirq [c011b9e0]: 9130
mod_timer [c011eb10]: 6997
tcp_v4_rcv [c0258b70]: 6449
speedo_interrupt [c01a83e0]: 6262
__wake_up [c0114720]: 6143
tcp_recvmsg [c0249930]: 5199
USER [c0124a70]: 5081
speedo_start_xmit [c01a7fc0]: 4349
tcp_rcv_established [c0250c90]: 3724
tcp_data_wait [c02496d0]: 3610
speedo_rx [c01a8900]: 3358
handle_IRQ_event [c0108a60]: 3339
__kfree_skb [c0230520]: 2716
net_rx_action [c0233f10]: 2510
mcount [c02784e0]: 2261
ip_route_input [c023ee40]: 2028
ip_rcv [c0241180]: 1912
ip_queue_xmit [c0243950]: 1886
tcp_transmit_skb [c0252400]: 1877
__switch_to [c01059d0]: 1522
tcp_prequeue_process [c0249860]: 1461
skb_copy_and_csum_datagram_iovec [c0232590]: 1393
netif_rx [c0233ba0]: 1360
sock_recvmsg [c022cfc0]: 1358
eth_type_trans [c023a260]: 1350
tcp_event_data_recv [c024c320]: 1344
ip_output [c0243810]: 1333
speedo_tx_buffer_gc [c01a81d0]: 1239
tcp_v4_do_rcv [c0258a50]: 1169
kmalloc [c012dda0]: 1154
tcp_copy_to_iovec [c0250ae0]: 1138
fput [c0136d30]: 1124
sys_recvfrom [c022df40]: 1105
speedo_refill_rx_buf [c01a86d0]: 1045
kfree [c012df90]: 1015
dev_queue_xmit [c0233860]: 952
alloc_skb [c02301e0]: 936
do_gettimeofday [c010c520]: 919
system_call [c01070d8]: 918
fget [c0136e30]: 901
sys_socketcall [c022e610]: 887
skb_release_data [c0230420]: 882
ip_local_deliver [c0241020]: 808
skb_copy_and_csum_datagram [c02322a0]: 805
csum_partial [c0277e78]: 781
cleanup_rbuf [c02495d0]: 736
kfree_skbmem [c02304b0]: 696
sock_wfree [c022f380]: 600
inet_recvmsg [c0264720]: 596
speedo_refill_rx_buffers [c01a88b0]: 568
qdisc_restart [c023a4e0]: 568
__generic_copy_from_user [c02781d0]: 501
do_check_pgt_cache [c0112de0]: 494
sys_recv [c022e020]: 488
remove_wait_queue [c0115930]: 487
check_pgt_cache [c0124b30]: 483
add_wait_queue [c01158b0]: 430
__generic_copy_to_user [c0278180]: 429
tcp_send_delayed_ack [c0254e30]: 405
tcp_v4_checksum_init [c0258930]: 391
cpu_idle [c01052b0]: 386
sockfd_lookup [c022cd70]: 370
pfifo_fast_enqueue [c023a940]: 350
schedule_timeout [c0114060]: 314
pfifo_fast_dequeue [c023a9c0]: 304
To eliminate the cache-line bouncing, I applied IRQ and PROCESS
affinity. The L2_CACHE_LINES_OUT profiling with affinity:
poll_idle [c0105280]: 72241
csum_partial_copy_generic [c0289500]: 13838
schedule [c0114190]: 9036
speedo_interrupt [c01b9980]: 5066
do_softirq [c011b9e0]: 3922
USER [c0124c80]: 2573
tcp_recvmsg [c025aed0]: 2154
__wake_up [c0114720]: 1779
speedo_start_xmit [c01b9560]: 1654
mod_timer [c011eb10]: 1551
tcp_rcv_established [c0262230]: 1336
mcount [c0289a80]: 1298
tcp_transmit_skb [c02639a0]: 984
__switch_to [c01059d0]: 927
do_gettimeofday [c010c520]: 876
sys_socketcall [c023fbb0]: 872
ip_rcv [c0252720]: 872
ip_route_input [c02503e0]: 868
ip_queue_xmit [c0254ef0]: 805
system_call [c01070d8]: 772
tcp_data_wait [c025ac70]: 748
__kfree_skb [c0241ac0]: 640
tcp_v4_rcv [c026a110]: 629
do_check_pgt_cache [c0112de0]: 584
ip_output [c0254db0]: 575
net_rx_action [c02454b0]: 565
kfree [c012e1a0]: 556
fput [c0136f40]: 524
csum_partial [c0289418]: 514
handle_IRQ_event [c0108a60]: 507
sock_recvmsg [c023e560]: 479
cleanup_rbuf [c025ab70]: 430
skb_copy_and_csum_datagram [c0243840]: 428
dev_queue_xmit [c0244e00]: 409
sys_recvfrom [c023f4e0]: 404
kfree_skbmem [c0241a50]: 391
speedo_tx_buffer_gc [c01b9770]: 388
skb_copy_and_csum_datagram_iovec [c0243b30]: 383
kmalloc [c012dfb0]: 379
ip_local_deliver [c02525c0]: 362
netif_rx [c0245140]: 361
tcp_copy_to_iovec [c0262080]: 356
tcp_event_data_recv [c025d8c0]: 334
tcp_prequeue_process [c025ae00]: 319
fget [c0137040]: 300
tcp_v4_do_rcv [c0269ff0]: 285
add_wait_queue [c01158b0]: 267
skb_release_data [c02419c0]: 249
alloc_skb [c0241780]: 249
schedule_timeout [c0114060]: 238
remove_wait_queue [c0115930]: 233
sockfd_lookup [c023e310]: 232
__generic_copy_from_user [c0289770]: 224
sys_recv [c023f5c0]: 216
Regards,
Mala
Mala Anand
E-mail:manand@us.ibm.com
Linux Technology Center - Performance
Phone:838-8088; Tie-line:678-8088
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
@ 2002-06-04 21:11 Paul McKenney
0 siblings, 0 replies; 15+ messages in thread
From: Paul McKenney @ 2002-06-04 21:11 UTC (permalink / raw)
To: Mala Anand; +Cc: BALBIR SINGH, dipankar, linux-kernel, 'Rusty Russell'
For whatever it is worth, here are the functions ranked in
decreasing order of savings due to the IRQ and process
affinity. Column 2 is profile ticks with affinity, column 3
is without, and column 4 is the difference.
Thanx, Paul
poll_idle 72241 121743 49502
schedule 9036 24853 15817
csum_partial_copy_generic 13838 27951 14113
tcp_v4_rcv 629 6449 5820
mod_timer 1551 6997 5446
do_softirq 3922 9130 5208
__wake_up 1779 6143 4364
speedo_rx 0 3358 3358
tcp_recvmsg 2154 5199 3045
tcp_data_wait 748 3610 2862
handle_IRQ_event 507 3339 2832
speedo_start_xmit 1654 4349 2695
USER 2573 5081 2508
tcp_rcv_established 1336 3724 2388
__kfree_skb 640 2716 2076
net_rx_action 565 2510 1945
eth_type_trans 0 1350 1350
speedo_interrupt 5066 6262 1196
ip_route_input 868 2028 1160
tcp_prequeue_process 319 1461 1142
ip_queue_xmit 805 1886 1081
speedo_refill_rx_buf 0 1045 1045
ip_rcv 872 1912 1040
tcp_event_data_recv 334 1344 1010
skb_copy_and_csum_datagram_iovec 383 1393 1010
netif_rx 361 1360 999
mcount 1298 2261 963
tcp_transmit_skb 984 1877 893
tcp_v4_do_rcv 285 1169 884
sock_recvmsg 479 1358 879
speedo_tx_buffer_gc 388 1239 851
tcp_copy_to_iovec 356 1138 782
kmalloc 379 1154 775
ip_output 575 1333 758
sys_recvfrom 404 1105 701
alloc_skb 249 936 687
skb_release_data 249 882 633
fget 300 901 601
fput 524 1124 600
sock_wfree 0 600 600
inet_recvmsg 0 596 596
__switch_to 927 1522 595
qdisc_restart 0 568 568
speedo_refill_rx_buffers 0 568 568
dev_queue_xmit 409 952 543
check_pgt_cache 0 483 483
kfree 556 1015 459
ip_local_deliver 362 808 446
__generic_copy_to_user 0 429 429
tcp_send_delayed_ack 0 405 405
tcp_v4_checksum_init 0 391 391
cpu_idle 0 386 386
skb_copy_and_csum_datagram 428 805 377
pfifo_fast_enqueue 0 350 350
cleanup_rbuf 430 736 306
kfree_skbmem 391 696 305
pfifo_fast_dequeue 0 304 304
__generic_copy_from_user 224 501 277
sys_recv 216 488 272
csum_partial 514 781 267
remove_wait_queue 233 487 254
add_wait_queue 267 430 163
system_call 772 918 146
sockfd_lookup 232 370 138
schedule_timeout 238 314 76
do_gettimeofday 876 919 43
sys_socketcall 872 887 15
do_check_pgt_cache 584 494 -90
>>> For sometime now, I have been thinking of implementing/supporting
>>> PME's (Peformance Monitoring Events and Counters), so that we can
>>> get real values (atleast on x86) as compared to our guesses about
>>> cacheline bouncing, etc. Do you know if somebody is already doing
>>> this?
>
>>You can use SGI kernprof to measure PMCs. See the SGI oss
>>website for details. You can count L2_LINES_IN event to
>>get a measure of cache line bouncing.
>
>I have profiled L2_LINES_OUT on netperf tcp_stream workload.
>The following is the profiling from a 100mb ethernet tcp_stream
>4 adapter test baseline 2.4.17 kernel:
>
>poll_idle [c0105280]: 121743
>csum_partial_copy_generic [c0277f60]: 27951
>schedule [c0114190]: 24853
>do_softirq [c011b9e0]: 9130
>mod_timer [c011eb10]: 6997
>tcp_v4_rcv [c0258b70]: 6449
>speedo_interrupt [c01a83e0]: 6262
>__wake_up [c0114720]: 6143
>tcp_recvmsg [c0249930]: 5199
>USER [c0124a70]: 5081
>speedo_start_xmit [c01a7fc0]: 4349
>tcp_rcv_established [c0250c90]: 3724
>tcp_data_wait [c02496d0]: 3610
>speedo_rx [c01a8900]: 3358
>handle_IRQ_event [c0108a60]: 3339
>__kfree_skb [c0230520]: 2716
>net_rx_action [c0233f10]: 2510
>mcount [c02784e0]: 2261
>ip_route_input [c023ee40]: 2028
>ip_rcv [c0241180]: 1912
>ip_queue_xmit [c0243950]: 1886
>tcp_transmit_skb [c0252400]: 1877
>__switch_to [c01059d0]: 1522
>tcp_prequeue_process [c0249860]: 1461
>skb_copy_and_csum_datagram_iovec [c0232590]: 1393
>netif_rx [c0233ba0]: 1360
>sock_recvmsg [c022cfc0]: 1358
>eth_type_trans [c023a260]: 1350
>tcp_event_data_recv [c024c320]: 1344
>ip_output [c0243810]: 1333
>speedo_tx_buffer_gc [c01a81d0]: 1239
>tcp_v4_do_rcv [c0258a50]: 1169
>kmalloc [c012dda0]: 1154
>tcp_copy_to_iovec [c0250ae0]: 1138
>fput [c0136d30]: 1124
>sys_recvfrom [c022df40]: 1105
>speedo_refill_rx_buf [c01a86d0]: 1045
>kfree [c012df90]: 1015
>dev_queue_xmit [c0233860]: 952
>alloc_skb [c02301e0]: 936
>do_gettimeofday [c010c520]: 919
>system_call [c01070d8]: 918
>fget [c0136e30]: 901
>sys_socketcall [c022e610]: 887
>skb_release_data [c0230420]: 882
>ip_local_deliver [c0241020]: 808
>skb_copy_and_csum_datagram [c02322a0]: 805
>csum_partial [c0277e78]: 781
>cleanup_rbuf [c02495d0]: 736
>kfree_skbmem [c02304b0]: 696
>sock_wfree [c022f380]: 600
>inet_recvmsg [c0264720]: 596
>speedo_refill_rx_buffers [c01a88b0]: 568
>qdisc_restart [c023a4e0]: 568
>__generic_copy_from_user [c02781d0]: 501
>do_check_pgt_cache [c0112de0]: 494
>sys_recv [c022e020]: 488
>remove_wait_queue [c0115930]: 487
>check_pgt_cache [c0124b30]: 483
>add_wait_queue [c01158b0]: 430
>__generic_copy_to_user [c0278180]: 429
>tcp_send_delayed_ack [c0254e30]: 405
>tcp_v4_checksum_init [c0258930]: 391
>cpu_idle [c01052b0]: 386
>sockfd_lookup [c022cd70]: 370
>pfifo_fast_enqueue [c023a940]: 350
>schedule_timeout [c0114060]: 314
>pfifo_fast_dequeue [c023a9c0]: 304
>
>
>To eliminate the cache-line bouncing, I applied IRQ and PROCESS
>affinity. The L2_CACHE_LINES_OUT profiling with affinity:
>
>poll_idle [c0105280]: 72241
>csum_partial_copy_generic [c0289500]: 13838
>schedule [c0114190]: 9036
>speedo_interrupt [c01b9980]: 5066
>do_softirq [c011b9e0]: 3922
>USER [c0124c80]: 2573
>tcp_recvmsg [c025aed0]: 2154
>__wake_up [c0114720]: 1779
>speedo_start_xmit [c01b9560]: 1654
>mod_timer [c011eb10]: 1551
>tcp_rcv_established [c0262230]: 1336
>mcount [c0289a80]: 1298
>tcp_transmit_skb [c02639a0]: 984
>__switch_to [c01059d0]: 927
>do_gettimeofday [c010c520]: 876
>sys_socketcall [c023fbb0]: 872
>ip_rcv [c0252720]: 872
>ip_route_input [c02503e0]: 868
>ip_queue_xmit [c0254ef0]: 805
>system_call [c01070d8]: 772
>tcp_data_wait [c025ac70]: 748
>__kfree_skb [c0241ac0]: 640
>tcp_v4_rcv [c026a110]: 629
>do_check_pgt_cache [c0112de0]: 584
>ip_output [c0254db0]: 575
>net_rx_action [c02454b0]: 565
>kfree [c012e1a0]: 556
>fput [c0136f40]: 524
>csum_partial [c0289418]: 514
>handle_IRQ_event [c0108a60]: 507
>sock_recvmsg [c023e560]: 479
>cleanup_rbuf [c025ab70]: 430
>skb_copy_and_csum_datagram [c0243840]: 428
>dev_queue_xmit [c0244e00]: 409
>sys_recvfrom [c023f4e0]: 404
>kfree_skbmem [c0241a50]: 391
>speedo_tx_buffer_gc [c01b9770]: 388
>skb_copy_and_csum_datagram_iovec [c0243b30]: 383
>kmalloc [c012dfb0]: 379
>ip_local_deliver [c02525c0]: 362
>netif_rx [c0245140]: 361
>tcp_copy_to_iovec [c0262080]: 356
>tcp_event_data_recv [c025d8c0]: 334
>tcp_prequeue_process [c025ae00]: 319
>fget [c0137040]: 300
>tcp_v4_do_rcv [c0269ff0]: 285
>add_wait_queue [c01158b0]: 267
>skb_release_data [c02419c0]: 249
>alloc_skb [c0241780]: 249
>schedule_timeout [c0114060]: 238
>remove_wait_queue [c0115930]: 233
>sockfd_lookup [c023e310]: 232
>__generic_copy_from_user [c0289770]: 224
>sys_recv [c023f5c0]: 216
>
>
>Regards,
> Mala
>
>
> Mala Anand
> E-mail:manand@us.ibm.com
> Linux Technology Center - Performance
> Phone:838-8088; Tie-line:678-8088
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2002-06-04 21:11 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-05-23 13:08 [RFC] Dynamic percpu data allocator Dipankar Sarma
2002-05-24 4:37 ` BALBIR SINGH
2002-05-24 6:13 ` Dipankar Sarma
2002-05-24 8:38 ` [Lse-tech] " BALBIR SINGH
2002-05-24 9:13 ` Dipankar Sarma
2002-05-24 11:59 ` BALBIR SINGH
2002-05-24 14:38 ` Martin J. Bligh
-- strict thread matches above, loose matches on Subject: below --
2002-05-30 13:56 Mala Anand
2002-05-30 17:55 ` Dipankar Sarma
2002-05-31 7:57 ` BALBIR SINGH
2002-05-31 8:40 ` Dipankar Sarma
2002-06-03 19:12 Mala Anand
2002-06-03 19:48 ` Dipankar Sarma
2002-06-04 12:05 Mala Anand
2002-06-04 21:11 Paul McKenney
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox