public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC] Dynamic percpu data allocator
@ 2002-05-23 13:08 Dipankar Sarma
  2002-05-24  4:37 ` BALBIR SINGH
  0 siblings, 1 reply; 15+ messages in thread
From: Dipankar Sarma @ 2002-05-23 13:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: Rusty Russell, Paul McKenney, lse-tech

[-- Attachment #1: Type: text/plain, Size: 1185 bytes --]

If static percpu area is around, can dynamic percpu data allocator
be far behind ;-)

As a part of scalable kernel primitives work for higher-end SMP
and NUMA architectures, we have been seeing increasing need
for per-cpu data in various key areas. Rusty's percpu area
work has added a way in 2.5 kernels to maintain static per-cpu
data. Inspired by that work, I have implemented a dynamic per-cpu
data allocator. Currently it is useful to us for -

1. Per-cpu data in dynamically allocated structures.
2. per-cpu statistics and reference counters
3. Per-cpu data in drivers/modules.
4. Scalable locking primitives like local spin only locks
   (or even big reader locks).

Included in this mail is a document that describes the allocator.
I would really appreciate if people comment on it. I am
particularly interested in eek-value of the interfaces,
specially the bit about keeping the type information in
a dummy variable in a union.

The actual patch will follow soon, unless someone convince
me quickly that there is an saner way to do this.

Thanks
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

[-- Attachment #2: percpu_data.txt --]
[-- Type: text/plain, Size: 4302 bytes --]

                        Per-CPU Data Allocator
                        ----------------------

Interfaces
----------

The interfaces for per-cpu data allocator are similar to Rusty's static
per-CPU data interfaces. One clear goal was to make sure that they
leave no overhead in the UP kernels, so for UP kernels they
reduce to ordinary variables. The basic interfaces are these -

1. percpu_data_declare(type,var)
2. percpu_data_alloc(var)
3. percpu_data(var,cpu) 
4. this_percpu_data(var)
5. percpu_data_free(var)

For example, we can declare the following structure -

	struct yy {
		int a;
		percpu_data_declare(int, b);
		int c;
		int d;
	};

We can allocate memory for percpu int like this -

	struct yy y;

	if (percpu_data_alloc(y.b)) {
		/* Failed */
	}

To use it -
	
	cpu = smp_processor_id();
	percpu_data(y.b, cpu)++;

	or

	this_percpu_data(y.b)++;

To free the per-CPU data

	percpu_data_free(y.b);

The data declaration interface is a bit unnatural, but I can't think of
anything better that would let me preserve the type information for the
original variable so that appropriate typecasting can be done for other
interfaces. percpu_data_declare(type,var) expands to -

	union {
		percpu_data_t *percpu;
		typeof(type) realtype;
	} var

percpu_data_t maintains the pointers necessary to lookup the real
percpu data. 

	typedef struct {
		void *blkaddrs[NR_CPUS];
		struct percpu_data_blk *blkp;
	} percpu_data_t;

The type information is used (using typeof()) to
typecast data accesses. 

	#define percpu_data(var,cpu) \
			(*((typeof(var.realtype) *)var.percpu->blkaddrs[cpu]))

Using a pointer to percpu_data_t adds
an overhead of an additional memory reference while accessing
percpu data. This can be avoided by embedding the percpu_data_t
structure, but since percpu_data_t has an NR_CPUS array, it changes
structure sizes very radically. It is a tradeoff, we could go either
way.



Potential Uses
--------------

1. Scalable counters - they already use a scaled down version of the
allocator inside. Per-cpu counters can reduce the overhead of cacheline
bouncing.

2. Big reader lock - this need not be statically allocated anymore.

3. Per-CPU data in modules - the static per-cpu scheme by Rusty's
doesn't work in modules, atleast I haven't seen a way to do this.

4. Scalable locks - per-cpu data is commonly used in scalable locks
like MCS locks.



Allocator
---------

The current approach is that unless there is interest in a dynamic
percpu data allocator, there is no point in spending too much time
in writing a sophisticated. 

Allocation Policy
-----------------

1. If the allocation requests size is a factor of SMP_CACHE_BYTES,
then it will be interleaved to avoid fragmentation as much as possible.
If the request size is a multiple of SMP_CACHE_BYTES, fragmentation
will still be avoided. Anything else will result in fragmentation.
The current allocator doesn't make any attempt to use the fragmented
portion, in a sense it is like padding to cache line boundary.

2. A simple binary search tree is used to maintain the memory
objects (could be blocks of them) of different sizes. For 
interleaving, the objects are maintained in blocks and a freelist 
mechanism similar to the slab allocator is used within the blocks 
to allocates objects from within. Each block is allocated from kmem_cache.

If there is sufficient interest in the per-cpu data allocator,
then I will revisit the allocator and see if fragmentation can
be reduced for non-multiples/non-factors of SMP_CACHE_BYTES.

For non-factor allocations, the residual part of the cache-line
can be maintained and a best-factor-fit alogorithm can be used
to allocate this. This makes an assumption that kernel allocation
requests are likely to contain repetitive patterns of similar sizes.

Alignment Issues
----------------

The current alignment strategy is this -

1. Minimum allocation size is sizeof(int).
2. Each block is aligned to cache line boundary and size of any
   object allocated within the block is either a factor of the block
   size or equal to the block size. However I am not sure if this
   guarantees proper alignment on all architectures. We need to
   investigate some more about this.

I am sure there is a 69-bit transputer architecture somewhere that
breaks this allocator ;-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [RFC] Dynamic percpu data allocator
  2002-05-23 13:08 [RFC] Dynamic percpu data allocator Dipankar Sarma
@ 2002-05-24  4:37 ` BALBIR SINGH
  2002-05-24  6:13   ` Dipankar Sarma
  0 siblings, 1 reply; 15+ messages in thread
From: BALBIR SINGH @ 2002-05-24  4:37 UTC (permalink / raw)
  To: dipankar, linux-kernel; +Cc: Rusty Russell, Paul McKenney, lse-tech

[-- Attachment #1: Type: text/plain, Size: 2966 bytes --]

Hello, Dipankar,

I would prefer to use the existing slab allocator for this.
I am not sure if I understand your requirements for the per-cpu
allocator correctly, please correct me if I do not.

What I would like to see

1. Have per-cpu slabs instead of per-cpu cpucache_t. One should
   be able to tell for which caches we want per-cpu slabs. This
   way we can make even kmalloc per-cpu. Since most of the kernel
   would use and dispose memory before they migrate across cpus.
   I think this would be useful, but again no data to back it up.

2. I hate the use of NR_CPUS. If I compiled an SMP kernel on a two
   CPU machine, I still end up with support for 32 CPUs. What I would
   like to see is that in new kernel code, we should use treat equivalent
   classes of CPUs as belonging to the same CPU. For example

   void *blkaddrs[NR_CPUS];

   while searching, instead of doing

   blkaddrs[smp_processor_id()], if the slot for smp_processor_id() is full,
   we should look through

   for (i = 0; i < NR_CPUS; i++) {
     look into blkaddrs[smp_processor_id() + i % smp_number_of_cpus()(or
whatever)]
     if successful break
   }

On a two CPU system 1,3,5 ... belong to the same equivalent class. So we
might
as well utilize them. Even with a per-cpu pool, threads could use the slots
in
the per-cpu equivalent classes in parallel (I have a very rough idea about
this).

Does any of this make sense,
Balbir


|-----Original Message-----
|From: linux-kernel-owner@vger.kernel.org
|[mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Dipankar Sarma
|Sent: Thursday, May 23, 2002 6:39 PM
|To: linux-kernel@vger.kernel.org
|Cc: Rusty Russell; Paul McKenney; lse-tech@lists.sourceforge.net
|Subject: [RFC] Dynamic percpu data allocator
|
|
|If static percpu area is around, can dynamic percpu data allocator
|be far behind ;-)
|
|As a part of scalable kernel primitives work for higher-end SMP
|and NUMA architectures, we have been seeing increasing need
|for per-cpu data in various key areas. Rusty's percpu area
|work has added a way in 2.5 kernels to maintain static per-cpu
|data. Inspired by that work, I have implemented a dynamic per-cpu
|data allocator. Currently it is useful to us for -
|
|1. Per-cpu data in dynamically allocated structures.
|2. per-cpu statistics and reference counters
|3. Per-cpu data in drivers/modules.
|4. Scalable locking primitives like local spin only locks
|   (or even big reader locks).
|
|Included in this mail is a document that describes the allocator.
|I would really appreciate if people comment on it. I am
|particularly interested in eek-value of the interfaces,
|specially the bit about keeping the type information in
|a dummy variable in a union.
|
|The actual patch will follow soon, unless someone convince
|me quickly that there is an saner way to do this.
|
|Thanks
|--
|Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
|Linux Technology Center, IBM Software Lab, Bangalore, India.
|


[-- Attachment #2: Wipro_Disclaimer.txt --]
[-- Type: text/plain, Size: 490 bytes --]

**************************Disclaimer************************************

Information contained in this E-MAIL being proprietary to Wipro Limited is 
'privileged' and 'confidential' and intended for use only by the individual
 or entity to which it is addressed. You are notified that any use, copying 
or dissemination of the information contained in the E-MAIL in any manner 
whatsoever is strictly prohibited.

***************************************************************************

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Dynamic percpu data allocator
  2002-05-24  4:37 ` BALBIR SINGH
@ 2002-05-24  6:13   ` Dipankar Sarma
  2002-05-24  8:38     ` [Lse-tech] " BALBIR SINGH
  0 siblings, 1 reply; 15+ messages in thread
From: Dipankar Sarma @ 2002-05-24  6:13 UTC (permalink / raw)
  To: BALBIR SINGH; +Cc: linux-kernel, Rusty Russell, Paul McKenney, lse-tech

On Fri, May 24, 2002 at 10:07:59AM +0530, BALBIR SINGH wrote:
> Hello, Dipankar,
> 
> I would prefer to use the existing slab allocator for this.
> I am not sure if I understand your requirements for the per-cpu
> allocator correctly, please correct me if I do not.
> 
> What I would like to see
> 
> 1. Have per-cpu slabs instead of per-cpu cpucache_t. One should
>    be able to tell for which caches we want per-cpu slabs. This
>    way we can make even kmalloc per-cpu. Since most of the kernel
>    would use and dispose memory before they migrate across cpus.
>    I think this would be useful, but again no data to back it up.

Allocating cpu-local memory is a different issue altogether.
Eventually for NUMA support, we will have to do such allocations
that supports choosing memory closest to a group of CPUs.

The per-cpu data allocator allocates one copy for *each* CPU.
It uses the slab allocator underneath. Eventually, when/if we have
per-cpu/numa-node slab allocation, the per-cpu data allocator
can allocate every CPU's copy from memory closest to it.

I suppose you worked on DYNIX/ptx ? Think of this as dynamic
plocal. 

> 
> 2. I hate the use of NR_CPUS. If I compiled an SMP kernel on a two
>    CPU machine, I still end up with support for 32 CPUs. What I would

If you don't like it, just define NR_CPUS to 2 and recompile.


>    like to see is that in new kernel code, we should use treat equivalent
>    classes of CPUs as belonging to the same CPU. For example
> 
>    void *blkaddrs[NR_CPUS];
> 
>    while searching, instead of doing
> 
>    blkaddrs[smp_processor_id()], if the slot for smp_processor_id() is full,
>    we should look through
> 
>    for (i = 0; i < NR_CPUS; i++) {
>      look into blkaddrs[smp_processor_id() + i % smp_number_of_cpus()(or
> whatever)]
>      if successful break
>    }

How will it work ? You could be accessing memory beyond blkaddrs[].

I use NR_CPUS for allocations because if I don't, supporting CPU
hotplug will be a nightmare. Resizing so many data structures is
not an option. I believe Rusty and/or cpu hotplug work is
adding a for_each_cpu() macro to walk the CPUs that take
care of everything including sparse CPU numbers. Until then,
I would use for (i = 0; i < smp_num_cpus; i++).

Thanks
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
  2002-05-24  6:13   ` Dipankar Sarma
@ 2002-05-24  8:38     ` BALBIR SINGH
  2002-05-24  9:13       ` Dipankar Sarma
  2002-05-24 14:38       ` Martin J. Bligh
  0 siblings, 2 replies; 15+ messages in thread
From: BALBIR SINGH @ 2002-05-24  8:38 UTC (permalink / raw)
  To: dipankar; +Cc: linux-kernel, Rusty Russell, Paul McKenney, lse-tech

[-- Attachment #1: Type: text/plain, Size: 3192 bytes --]


|> Hello, Dipankar,
|>
|> I would prefer to use the existing slab allocator for this.
|> I am not sure if I understand your requirements for the per-cpu
|> allocator correctly, please correct me if I do not.
|>
|> What I would like to see
|>
|> 1. Have per-cpu slabs instead of per-cpu cpucache_t. One should
|>    be able to tell for which caches we want per-cpu slabs. This
|>    way we can make even kmalloc per-cpu. Since most of the kernel
|>    would use and dispose memory before they migrate across cpus.
|>    I think this would be useful, but again no data to back it up.
|
|Allocating cpu-local memory is a different issue altogether.
|Eventually for NUMA support, we will have to do such allocations
|that supports choosing memory closest to a group of CPUs.
|
|The per-cpu data allocator allocates one copy for *each* CPU.
|It uses the slab allocator underneath. Eventually, when/if we have
|per-cpu/numa-node slab allocation, the per-cpu data allocator
|can allocate every CPU's copy from memory closest to it.
|
|I suppose you worked on DYNIX/ptx ? Think of this as dynamic
|plocal.


Sure, I understand what you are talking about now. That makes a lot
of sense, I will go through your document once more and read it.
I was thinking of the two combined (allocating CPU local memory
for certain data structs also includes allocating one copy per CPU).
Is there a reason to delay the implementation of CPU local memory,
or are we waiting for NUMA guys to do it? Is it not useful in an
SMP system to allocate CPU local memory?


|
|>
|> 2. I hate the use of NR_CPUS. If I compiled an SMP kernel on a two
|>    CPU machine, I still end up with support for 32 CPUs. What I would
|
|If you don't like it, just define NR_CPUS to 2 and recompile.
|

That does make sense, but I would like to keep my headers in sync, so that
all patches apply cleanly.

|
|>    like to see is that in new kernel code, we should use treat equivalent
|>    classes of CPUs as belonging to the same CPU. For example
|>
|>    void *blkaddrs[NR_CPUS];
|>
|>    while searching, instead of doing
|>
|>    blkaddrs[smp_processor_id()], if the slot for
|smp_processor_id() is full,
|>    we should look through
|>
|>    for (i = 0; i < NR_CPUS; i++) {
|>      look into blkaddrs[smp_processor_id() + i % smp_number_of_cpus()(or
|> whatever)]
|>      if successful break
|>    }
|
|How will it work ? You could be accessing memory beyond blkaddrs[].
|

Sorry that could happen, it should be (smp_processor_id + i) %
smp_number_of_cpus().


|I use NR_CPUS for allocations because if I don't, supporting CPU
|hotplug will be a nightmare. Resizing so many data structures is
|not an option. I believe Rusty and/or cpu hotplug work is
|adding a for_each_cpu() macro to walk the CPUs that take
|care of everything including sparse CPU numbers. Until then,
|I would use for (i = 0; i < smp_num_cpus; i++).

I think that does make a whole lot of sense, I was not thinking
about HotPluging CPUs (dumb me).


|
|Thanks
|--
|Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
|Linux Technology Center, IBM Software Lab, Bangalore, India.
|
|_______________________________________________________________


[-- Attachment #2: Wipro_Disclaimer.txt --]
[-- Type: text/plain, Size: 490 bytes --]

**************************Disclaimer************************************

Information contained in this E-MAIL being proprietary to Wipro Limited is 
'privileged' and 'confidential' and intended for use only by the individual
 or entity to which it is addressed. You are notified that any use, copying 
or dissemination of the information contained in the E-MAIL in any manner 
whatsoever is strictly prohibited.

***************************************************************************

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
  2002-05-24  8:38     ` [Lse-tech] " BALBIR SINGH
@ 2002-05-24  9:13       ` Dipankar Sarma
  2002-05-24 11:59         ` BALBIR SINGH
  2002-05-24 14:38       ` Martin J. Bligh
  1 sibling, 1 reply; 15+ messages in thread
From: Dipankar Sarma @ 2002-05-24  9:13 UTC (permalink / raw)
  To: BALBIR SINGH; +Cc: linux-kernel, Rusty Russell, Paul McKenney, lse-tech

On Fri, May 24, 2002 at 02:08:50PM +0530, BALBIR SINGH wrote:
> 
> Sure, I understand what you are talking about now. That makes a lot
> of sense, I will go through your document once more and read it.
> I was thinking of the two combined (allocating CPU local memory
> for certain data structs also includes allocating one copy per CPU).
> Is there a reason to delay the implementation of CPU local memory,
> or are we waiting for NUMA guys to do it? Is it not useful in an
> SMP system to allocate CPU local memory?

In an SMP system, the entire memory is equidistant from the CPUs.
So, any memory that is exclusively accessed by once cpu only
is CPU-local. On a NUMA machine however that isn't true, so
you need special schemes.

The thing about one-copy-per-cpu allocator that I describe is that
it interleaves per-cpu data to save on space. That is if you
allocate per-cpu ints i1, i2, it will be laid out in memory like this -

   CPU #0          CPU#1

 ---------       ---------         Start of cache line
   i1              i1
   i2              i2 

   .               .
   .               .
   .               .
   .               .
   .               .

 ---------       ----------        End of cache line

The per-cpu copies of i1 and i2 for CPU #0 and CPU #1 are allocated from 
different cache lines of memory, but copy of i1 and i2 for CPU #0 are
in the same cache line. This interleaving saves space by avoiding
the need to pad small data structures to cache line sizes.
This essentially how the static per-cpu data area in 2.5 kernel
is laid out in memory. Since copies for CPU #0 and CPU #1 for
the same variable are on different cache lines, assuming that
code that accesses "this" CPU's copy will not result in cache line
bouncing. On an SMP machine, I can allocate the cache lines
for different CPUs, where the interleaved data structures are
laid out, using the slab allocator. On a NUMA machine however,
I would want to make sure that cache line allocated for this
purpose for CPU #N is closest possible to CPU #N.


Thanks
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
  2002-05-24  9:13       ` Dipankar Sarma
@ 2002-05-24 11:59         ` BALBIR SINGH
  0 siblings, 0 replies; 15+ messages in thread
From: BALBIR SINGH @ 2002-05-24 11:59 UTC (permalink / raw)
  To: dipankar; +Cc: linux-kernel, Rusty Russell, Paul McKenney, lse-tech

[-- Attachment #1: Type: text/plain, Size: 2624 bytes --]

Thanks, when I ment CPU local memory on SMP, I ment CPU cache.
Sorry for the confusion. I think your dynamic allocator makes
sense.

Balbir

|
|On Fri, May 24, 2002 at 02:08:50PM +0530, BALBIR SINGH wrote:
|> 
|> Sure, I understand what you are talking about now. That makes a lot
|> of sense, I will go through your document once more and read it.
|> I was thinking of the two combined (allocating CPU local memory
|> for certain data structs also includes allocating one copy per CPU).
|> Is there a reason to delay the implementation of CPU local memory,
|> or are we waiting for NUMA guys to do it? Is it not useful in an
|> SMP system to allocate CPU local memory?
|
|In an SMP system, the entire memory is equidistant from the CPUs.
|So, any memory that is exclusively accessed by once cpu only
|is CPU-local. On a NUMA machine however that isn't true, so
|you need special schemes.
|
|The thing about one-copy-per-cpu allocator that I describe is that
|it interleaves per-cpu data to save on space. That is if you
|allocate per-cpu ints i1, i2, it will be laid out in memory like this -
|
|   CPU #0          CPU#1
|
| ---------       ---------         Start of cache line
|   i1              i1
|   i2              i2 
|
|   .               .
|   .               .
|   .               .
|   .               .
|   .               .
|
| ---------       ----------        End of cache line
|
|The per-cpu copies of i1 and i2 for CPU #0 and CPU #1 are allocated from 
|different cache lines of memory, but copy of i1 and i2 for CPU #0 are
|in the same cache line. This interleaving saves space by avoiding
|the need to pad small data structures to cache line sizes.
|This essentially how the static per-cpu data area in 2.5 kernel
|is laid out in memory. Since copies for CPU #0 and CPU #1 for
|the same variable are on different cache lines, assuming that
|code that accesses "this" CPU's copy will not result in cache line
|bouncing. On an SMP machine, I can allocate the cache lines
|for different CPUs, where the interleaved data structures are
|laid out, using the slab allocator. On a NUMA machine however,
|I would want to make sure that cache line allocated for this
|purpose for CPU #N is closest possible to CPU #N.
|
|
|Thanks
|-- 
|Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
|Linux Technology Center, IBM Software Lab, Bangalore, India.
|-
|To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
|the body of a message to majordomo@vger.kernel.org
|More majordomo info at  http://vger.kernel.org/majordomo-info.html
|Please read the FAQ at  http://www.tux.org/lkml/

[-- Attachment #2: Wipro_Disclaimer.txt --]
[-- Type: text/plain, Size: 490 bytes --]

**************************Disclaimer************************************

Information contained in this E-MAIL being proprietary to Wipro Limited is 
'privileged' and 'confidential' and intended for use only by the individual
 or entity to which it is addressed. You are notified that any use, copying 
or dissemination of the information contained in the E-MAIL in any manner 
whatsoever is strictly prohibited.

***************************************************************************

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
  2002-05-24  8:38     ` [Lse-tech] " BALBIR SINGH
  2002-05-24  9:13       ` Dipankar Sarma
@ 2002-05-24 14:38       ` Martin J. Bligh
  1 sibling, 0 replies; 15+ messages in thread
From: Martin J. Bligh @ 2002-05-24 14:38 UTC (permalink / raw)
  To: BALBIR SINGH, dipankar
  Cc: linux-kernel, Rusty Russell, Paul McKenney, lse-tech

> Is there a reason to delay the implementation of CPU local memory,
> or are we waiting for NUMA guys to do it? Is it not useful in an
> SMP system to allocate CPU local memory?

That should be pretty easy to do now, if I understand what you're
asking for ... when you allocate the area for cpu N, just do
alloc_pages_node (cpu_to_nid(smp_processor_id())) or something
similar.

M.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
@ 2002-05-30 13:56 Mala Anand
  2002-05-30 17:55 ` Dipankar Sarma
  0 siblings, 1 reply; 15+ messages in thread
From: Mala Anand @ 2002-05-30 13:56 UTC (permalink / raw)
  To: dipankar
  Cc: BALBIR SINGH, linux-kernel, lse-tech, lse-tech-admin,
	Paul McKenney, Rusty Russell

                                                                                                                                               
                      dipankar@beaverton.ibm.co                                                                                                
                      m                                To:       BALBIR SINGH <balbir.singh@wipro.com>                                         
                      Sent by:                         cc:       linux-kernel@vger.kernel.org, Rusty Russell <rusty@rustcorp.com.au>, Paul     
                      lse-tech-admin@lists.sour         McKenney/Beaverton/IBM@IBMUS, lse-tech@lists.sourceforge.net                           
                      ceforge.net                      Subject:  [Lse-tech] Re: [RFC] Dynamic percpu data allocator                            
                                                                                                                                               
                                                                                                                                               
                      05/24/02 01:13 AM                                                                                                        
                      Please respond to                                                                                                        
                      dipankar                                                                                                                 
                                                                                                                                               
                                                                                                                                               








>On Fri, May 24, 2002 at 10:07:59AM +0530, BALBIR SINGH wrote:
>> Hello, Dipankar,
>>
>> I would prefer to use the existing slab allocator for this.
>> I am not sure if I understand your requirements for the per-cpu
>> allocator correctly, please correct me if I do not.
>>
>> What I would like to see
>>
>> 1. Have per-cpu slabs instead of per-cpu cpucache_t. One should
>>    be able to tell for which caches we want per-cpu slabs. This
>>    way we can make even kmalloc per-cpu. Since most of the kernel
>>    would use and dispose memory before they migrate across cpus.
>>    I think this would be useful, but again no data to back it up.

>Allocating cpu-local memory is a different issue altogether.
>Eventually for NUMA support, we will have to do such allocations
>that supports choosing memory closest to a group of CPUs.

>The per-cpu data allocator allocates one copy for *each* CPU.
>It uses the slab allocator underneath. Eventually, when/if we have
>per-cpu/numa-node slab allocation, the per-cpu data allocator
>can allocate every CPU's copy from memory closest to it.

Does this mean that memory allocation will happen in "each" CPU?
Do slab allocator allocate the memory in each cpu? Your per-cpu
data allocator sounds like the hot list skbs that are in the tcpip stack
in the sense it is one level above the slab allocator and the list is
kept per cpu.  If slab allocator is fixed for per cpu, do you still
need this per-cpu data allocator?

_____________________________________________
Regards,
    Mala


   Mala Anand
   E-mail:manand@us.ibm.com
   Linux Technology Center - Performance
   Phone:838-8088; Tie-line:678-8088


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
  2002-05-30 13:56 Mala Anand
@ 2002-05-30 17:55 ` Dipankar Sarma
  2002-05-31  7:57   ` BALBIR SINGH
  0 siblings, 1 reply; 15+ messages in thread
From: Dipankar Sarma @ 2002-05-30 17:55 UTC (permalink / raw)
  To: Mala Anand
  Cc: BALBIR SINGH, linux-kernel, lse-tech, lse-tech-admin,
	Paul McKenney, Rusty Russell

On Thu, May 30, 2002 at 08:56:36AM -0500, Mala Anand wrote:
>                                                                                                                                                
>                       dipankar@beaverton.ibm.co                                                                                                
>                       m                                To:       BALBIR SINGH <balbir.singh@wipro.com>                                         
> 
> >The per-cpu data allocator allocates one copy for *each* CPU.
> >It uses the slab allocator underneath. Eventually, when/if we have
> >per-cpu/numa-node slab allocation, the per-cpu data allocator
> >can allocate every CPU's copy from memory closest to it.
> 
> Does this mean that memory allocation will happen in "each" CPU?
> Do slab allocator allocate the memory in each cpu? Your per-cpu
> data allocator sounds like the hot list skbs that are in the tcpip stack
> in the sense it is one level above the slab allocator and the list is
> kept per cpu.  If slab allocator is fixed for per cpu, do you still
> need this per-cpu data allocator?

Actually I don't know for sure what plans are afoot to fix the slab allocator
for per-cpu. One plan I heard about was allocating from per-cpu pools
rather than per-cpu copies. My requirements are similar to
the hot list skbs. I want to do this -

	int *ctrp1, *ctrp2;
	
	ctrp1 = kmalloc_percpu(sizeof(*ctrp1), GFP_ATOMIC);
	if (ctrp1 == NULL) {
		/* recover */
	}
	ctrp2 = kmalloc_percpu(sizeof(*ctrp2), GFP_ATOMIC);
	if (ctrp2 == NULL) {
		/* recover */
	}

	*per_cpu_ptr(ctrp1, smp_processor_id())++;
	this_cpu_ptr(ctrp2)++;

Now I can allocate by making ctrp1/ctrp2 point to an array
of NR_CPUS and kmalloc() memory for each CPU's copy of the
int. This is simple and will work. 

	void **ptrs = kmalloc(sizeof(*ptrs) * NR_CPUS, flags);

	if (!ptrs) return NULL;
	for (i = 0; i < NR_CPUS; i++) {
	      ptrs[i] = kmalloc(size, flags);
	      if (!ptrs[i])
		      goto unwind_oom;
	}


However I would like to use kmalloc_percpu() for allocating very 
small objects - typlically integer counters or small structures
to be used as per-cpu counters for things like dst entries and dentries.
kmalloc will waste the rest of the cache line for such small objects.
The alternative is to use a layer of code to interleave small objects
and save on space.


   CPU #0          CPU#1

 ---------       ---------         Start of cache line
   *ctrp1         *ctrp1
   *ctrp2         *ctrp2

   .               .
   .               .
   .               .
   .               .
   .               .

 ---------       ----------        End of cache line

I have an allocator that interleaves objects like this if they can be fitted
into size that is a factor of SMP_CACHE_BYTES. 

I hope someone can tell me that I don't even have to do this. Otherwise
I will go ahead and do my thing.

Thanks
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
  2002-05-30 17:55 ` Dipankar Sarma
@ 2002-05-31  7:57   ` BALBIR SINGH
  2002-05-31  8:40     ` Dipankar Sarma
  0 siblings, 1 reply; 15+ messages in thread
From: BALBIR SINGH @ 2002-05-31  7:57 UTC (permalink / raw)
  To: dipankar, 'Mala Anand'
  Cc: linux-kernel, lse-tech-admin, 'Paul McKenney',
	'Rusty Russell'

[-- Attachment #1: Type: text/plain, Size: 2920 bytes --]

|Actually I don't know for sure what plans are afoot to fix the 
|slab allocator
|for per-cpu. One plan I heard about was allocating from per-cpu pools
|rather than per-cpu copies. My requirements are similar to
|the hot list skbs. I want to do this -
|
|	int *ctrp1, *ctrp2;
|	
|	ctrp1 = kmalloc_percpu(sizeof(*ctrp1), GFP_ATOMIC);
|	if (ctrp1 == NULL) {
|		/* recover */
|	}
|	ctrp2 = kmalloc_percpu(sizeof(*ctrp2), GFP_ATOMIC);
|	if (ctrp2 == NULL) {
|		/* recover */
|	}
|
|	*per_cpu_ptr(ctrp1, smp_processor_id())++;
|	this_cpu_ptr(ctrp2)++;
|
|Now I can allocate by making ctrp1/ctrp2 point to an array
|of NR_CPUS and kmalloc() memory for each CPU's copy of the
|int. This is simple and will work. 
|
|	void **ptrs = kmalloc(sizeof(*ptrs) * NR_CPUS, flags);
|
|	if (!ptrs) return NULL;
|	for (i = 0; i < NR_CPUS; i++) {
|	      ptrs[i] = kmalloc(size, flags);
|	      if (!ptrs[i])
|		      goto unwind_oom;
|	}
|
|
|However I would like to use kmalloc_percpu() for allocating very 
|small objects - typlically integer counters or small structures
|to be used as per-cpu counters for things like dst entries and 
|dentries.
|kmalloc will waste the rest of the cache line for such small objects.
|The alternative is to use a layer of code to interleave small objects
|and save on space.
|
|
|   CPU #0          CPU#1
|
| ---------       ---------         Start of cache line
|   *ctrp1         *ctrp1
|   *ctrp2         *ctrp2
|
|   .               .
|   .               .
|   .               .
|   .               .
|   .               .
|
| ---------       ----------        End of cache line


Won't this result in a lot of false sharing, if any of the CPUs
tried to access any of the counters, the entire cache line would be
moved from the current CPU to that CPU. Isn't this a very bad thing or
am I missing something? Do all your counters fit into one cache line.

For sometime now, I have been thinking of implementing/supporting
PME's (Peformance Monitoring Events and Counters), so that we can
get real values (atleast on x86) as compared to our guesses about
cacheline bouncing, etc. Do you know if somebody is already doing
this?

Regards,
Balbir

|
|I have an allocator that interleaves objects like this if they 
|can be fitted
|into size that is a factor of SMP_CACHE_BYTES. 
|
|I hope someone can tell me that I don't even have to do this. Otherwise
|I will go ahead and do my thing.
|
|Thanks
|-- 
|Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
|Linux Technology Center, IBM Software Lab, Bangalore, India.
|
|_______________________________________________________________
|
|Don't miss the 2002 Sprint PCS Application Developer's Conference
|August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm
|
|_______________________________________________
|Lse-tech mailing list
|Lse-tech@lists.sourceforge.net
|https://lists.sourceforge.net/lists/listinfo/lse-tech
|


[-- Attachment #2: Wipro_Disclaimer.txt --]
[-- Type: text/plain, Size: 490 bytes --]

**************************Disclaimer************************************

Information contained in this E-MAIL being proprietary to Wipro Limited is 
'privileged' and 'confidential' and intended for use only by the individual
 or entity to which it is addressed. You are notified that any use, copying 
or dissemination of the information contained in the E-MAIL in any manner 
whatsoever is strictly prohibited.

***************************************************************************

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
  2002-05-31  7:57   ` BALBIR SINGH
@ 2002-05-31  8:40     ` Dipankar Sarma
  0 siblings, 0 replies; 15+ messages in thread
From: Dipankar Sarma @ 2002-05-31  8:40 UTC (permalink / raw)
  To: BALBIR SINGH
  Cc: 'Mala Anand', linux-kernel, 'Paul McKenney',
	'Rusty Russell'

On Fri, May 31, 2002 at 01:27:44PM +0530, BALBIR SINGH wrote:
> |
> |
> |   CPU #0          CPU#1
> |
> | ---------       ---------         Start of cache line
> |   *ctrp1         *ctrp1
> |   *ctrp2         *ctrp2
> |
> |   .               .
> |   .               .
> |   .               .
> |   .               .
> |   .               .
> |
> | ---------       ----------        End of cache line
> 
> 
> Won't this result in a lot of false sharing, if any of the CPUs
> tried to access any of the counters, the entire cache line would be
> moved from the current CPU to that CPU. Isn't this a very bad thing or
> am I missing something? Do all your counters fit into one cache line.

Yes it could result in false sharing. You could probably avoid
that by imposing classes of allocation - say STRICLY_LOCAL and
ALMOST_LOCAL, so that strictly local objects are not penalized
by occasionally non-local objects. If your code frequently accesses 
other CPU's copy of the object than you should not be using this 
per-cpu allocator in the first place, it would be meaningless.

> 
> For sometime now, I have been thinking of implementing/supporting
> PME's (Peformance Monitoring Events and Counters), so that we can
> get real values (atleast on x86) as compared to our guesses about
> cacheline bouncing, etc. Do you know if somebody is already doing
> this?

You can use SGI kernprof to measure PMCs. See the SGI oss
website for details. You can count L2_LINES_IN event to
get a measure of cache line bouncing.

Thanks
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
@ 2002-06-03 19:12 Mala Anand
  2002-06-03 19:48 ` Dipankar Sarma
  0 siblings, 1 reply; 15+ messages in thread
From: Mala Anand @ 2002-06-03 19:12 UTC (permalink / raw)
  To: dipankar
  Cc: BALBIR SINGH, linux-kernel, lse-tech, lse-tech-admin,
	Paul McKenney, Rusty Russell

On Thu, May 30, 2002 at 08:56:36AM -0500, Mala Anand wrote:
>>

>>                       dipankar@beaverton.ibm.co

>>                       m                                To:       BALBIR
SINGH <balbir.singh@wipro.com>
>>
>>The per-cpu data allocator allocates one copy for *each* CPU.
>> >It uses the slab allocator underneath. Eventually, when/if we have
>> >per-cpu/numa-node slab allocation, the per-cpu data allocator
>> >can allocate every CPU's copy from memory closest to it.
>>
>> Does this mean that memory allocation will happen in "each" CPU?
>> Do slab allocator allocate the memory in each cpu? Your per-cpu
>> data allocator sounds like the hot list skbs that are in the tcpip stack
>> in the sense it is one level above the slab allocator and the list is
>> kept per cpu.  If slab allocator is fixed for per cpu, do you still
>> need this per-cpu data allocator?

>Actually I don't know for sure what plans are afoot to fix the slab
allocator
>for per-cpu. One plan I heard about was allocating from per-cpu pools
>rather than per-cpu copies. My requirements are similar to
>the hot list skbs. I want to do this -

I looked at the slab code, per cpu slab is already implemented by Manfred
Spraul.
Look at cpu_data[NR_CPUS] in kmem_cache_s structure.


Regards,
    Mala


   Mala Anand
   E-mail:manand@us.ibm.com
   Linux Technology Center - Performance
   Phone:838-8088; Tie-line:678-8088




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
  2002-06-03 19:12 Mala Anand
@ 2002-06-03 19:48 ` Dipankar Sarma
  0 siblings, 0 replies; 15+ messages in thread
From: Dipankar Sarma @ 2002-06-03 19:48 UTC (permalink / raw)
  To: Mala Anand
  Cc: BALBIR SINGH, linux-kernel, lse-tech, Paul McKenney,
	Rusty Russell

On Mon, Jun 03, 2002 at 02:12:29PM -0500, Mala Anand wrote:
> I looked at the slab code, per cpu slab is already implemented by Manfred
> Spraul.
> Look at cpu_data[NR_CPUS] in kmem_cache_s structure.
> 

Sorry, I should have been more clear saying what I wanted.
Yes, kmem_cache_alloc() allocates one object from "this" CPU's slabs. What
I want is a kmalloc_percupu() that allocates one copy for every
CPU in the system. Think of this as dynamically allocacting an
array of NR_CPUS objects with objects residing on different cachelines.

Thanks
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
@ 2002-06-04 12:05 Mala Anand
  0 siblings, 0 replies; 15+ messages in thread
From: Mala Anand @ 2002-06-04 12:05 UTC (permalink / raw)
  To: dipankar; +Cc: BALBIR SINGH, linux-kernel, Paul McKenney,
	'Rusty Russell'




>> For sometime now, I have been thinking of implementing/supporting
>> PME's (Peformance Monitoring Events and Counters), so that we can
>> get real values (atleast on x86) as compared to our guesses about
>> cacheline bouncing, etc. Do you know if somebody is already doing
>> this?

>You can use SGI kernprof to measure PMCs. See the SGI oss
>website for details. You can count L2_LINES_IN event to
>get a measure of cache line bouncing.

I have profiled L2_LINES_OUT on netperf tcp_stream workload.
The following is the profiling from a 100mb ethernet tcp_stream
4 adapter test baseline 2.4.17 kernel:

poll_idle [c0105280]: 121743
csum_partial_copy_generic [c0277f60]: 27951
schedule [c0114190]: 24853
do_softirq [c011b9e0]: 9130
mod_timer [c011eb10]: 6997
tcp_v4_rcv [c0258b70]: 6449
speedo_interrupt [c01a83e0]: 6262
__wake_up [c0114720]: 6143
tcp_recvmsg [c0249930]: 5199
USER [c0124a70]: 5081
speedo_start_xmit [c01a7fc0]: 4349
tcp_rcv_established [c0250c90]: 3724
tcp_data_wait [c02496d0]: 3610
speedo_rx [c01a8900]: 3358
handle_IRQ_event [c0108a60]: 3339
__kfree_skb [c0230520]: 2716
net_rx_action [c0233f10]: 2510
mcount [c02784e0]: 2261
ip_route_input [c023ee40]: 2028
ip_rcv [c0241180]: 1912
ip_queue_xmit [c0243950]: 1886
tcp_transmit_skb [c0252400]: 1877
__switch_to [c01059d0]: 1522
tcp_prequeue_process [c0249860]: 1461
skb_copy_and_csum_datagram_iovec [c0232590]: 1393
netif_rx [c0233ba0]: 1360
sock_recvmsg [c022cfc0]: 1358
eth_type_trans [c023a260]: 1350
tcp_event_data_recv [c024c320]: 1344
ip_output [c0243810]: 1333
speedo_tx_buffer_gc [c01a81d0]: 1239
tcp_v4_do_rcv [c0258a50]: 1169
kmalloc [c012dda0]: 1154
tcp_copy_to_iovec [c0250ae0]: 1138
fput [c0136d30]: 1124
sys_recvfrom [c022df40]: 1105
speedo_refill_rx_buf [c01a86d0]: 1045
kfree [c012df90]: 1015
dev_queue_xmit [c0233860]: 952
alloc_skb [c02301e0]: 936
do_gettimeofday [c010c520]: 919
system_call [c01070d8]: 918
fget [c0136e30]: 901
sys_socketcall [c022e610]: 887
skb_release_data [c0230420]: 882
ip_local_deliver [c0241020]: 808
skb_copy_and_csum_datagram [c02322a0]: 805
csum_partial [c0277e78]: 781
cleanup_rbuf [c02495d0]: 736
kfree_skbmem [c02304b0]: 696
sock_wfree [c022f380]: 600
inet_recvmsg [c0264720]: 596
speedo_refill_rx_buffers [c01a88b0]: 568
qdisc_restart [c023a4e0]: 568
__generic_copy_from_user [c02781d0]: 501
do_check_pgt_cache [c0112de0]: 494
sys_recv [c022e020]: 488
remove_wait_queue [c0115930]: 487
check_pgt_cache [c0124b30]: 483
add_wait_queue [c01158b0]: 430
__generic_copy_to_user [c0278180]: 429
tcp_send_delayed_ack [c0254e30]: 405
tcp_v4_checksum_init [c0258930]: 391
cpu_idle [c01052b0]: 386
sockfd_lookup [c022cd70]: 370
pfifo_fast_enqueue [c023a940]: 350
schedule_timeout [c0114060]: 314
pfifo_fast_dequeue [c023a9c0]: 304


To eliminate the cache-line bouncing, I applied IRQ and PROCESS
affinity. The L2_CACHE_LINES_OUT profiling with affinity:

poll_idle [c0105280]: 72241
csum_partial_copy_generic [c0289500]: 13838
schedule [c0114190]: 9036
speedo_interrupt [c01b9980]: 5066
do_softirq [c011b9e0]: 3922
USER [c0124c80]: 2573
tcp_recvmsg [c025aed0]: 2154
__wake_up [c0114720]: 1779
speedo_start_xmit [c01b9560]: 1654
mod_timer [c011eb10]: 1551
tcp_rcv_established [c0262230]: 1336
mcount [c0289a80]: 1298
tcp_transmit_skb [c02639a0]: 984
__switch_to [c01059d0]: 927
do_gettimeofday [c010c520]: 876
sys_socketcall [c023fbb0]: 872
ip_rcv [c0252720]: 872
ip_route_input [c02503e0]: 868
ip_queue_xmit [c0254ef0]: 805
system_call [c01070d8]: 772
tcp_data_wait [c025ac70]: 748
__kfree_skb [c0241ac0]: 640
tcp_v4_rcv [c026a110]: 629
do_check_pgt_cache [c0112de0]: 584
ip_output [c0254db0]: 575
net_rx_action [c02454b0]: 565
kfree [c012e1a0]: 556
fput [c0136f40]: 524
csum_partial [c0289418]: 514
handle_IRQ_event [c0108a60]: 507
sock_recvmsg [c023e560]: 479
cleanup_rbuf [c025ab70]: 430
skb_copy_and_csum_datagram [c0243840]: 428
dev_queue_xmit [c0244e00]: 409
sys_recvfrom [c023f4e0]: 404
kfree_skbmem [c0241a50]: 391
speedo_tx_buffer_gc [c01b9770]: 388
skb_copy_and_csum_datagram_iovec [c0243b30]: 383
kmalloc [c012dfb0]: 379
ip_local_deliver [c02525c0]: 362
netif_rx [c0245140]: 361
tcp_copy_to_iovec [c0262080]: 356
tcp_event_data_recv [c025d8c0]: 334
tcp_prequeue_process [c025ae00]: 319
fget [c0137040]: 300
tcp_v4_do_rcv [c0269ff0]: 285
add_wait_queue [c01158b0]: 267
skb_release_data [c02419c0]: 249
alloc_skb [c0241780]: 249
schedule_timeout [c0114060]: 238
remove_wait_queue [c0115930]: 233
sockfd_lookup [c023e310]: 232
__generic_copy_from_user [c0289770]: 224
sys_recv [c023f5c0]: 216


Regards,
    Mala


   Mala Anand
   E-mail:manand@us.ibm.com
   Linux Technology Center - Performance
   Phone:838-8088; Tie-line:678-8088





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
@ 2002-06-04 21:11 Paul McKenney
  0 siblings, 0 replies; 15+ messages in thread
From: Paul McKenney @ 2002-06-04 21:11 UTC (permalink / raw)
  To: Mala Anand; +Cc: BALBIR SINGH, dipankar, linux-kernel, 'Rusty Russell'


For whatever it is worth, here are the functions ranked in
decreasing order of savings due to the IRQ and process
affinity.  Column 2 is profile ticks with affinity, column 3
is without, and column 4 is the difference.

                              Thanx, Paul

                       poll_idle   72241 121743 49502
                        schedule    9036  24853 15817
       csum_partial_copy_generic   13838  27951 14113
                      tcp_v4_rcv     629   6449  5820
                       mod_timer    1551   6997  5446
                      do_softirq    3922   9130  5208
                       __wake_up    1779   6143  4364
                       speedo_rx       0   3358  3358
                     tcp_recvmsg    2154   5199  3045
                   tcp_data_wait     748   3610  2862
                handle_IRQ_event     507   3339  2832
               speedo_start_xmit    1654   4349  2695
                            USER    2573   5081  2508
             tcp_rcv_established    1336   3724  2388
                     __kfree_skb     640   2716  2076
                   net_rx_action     565   2510  1945
                  eth_type_trans       0   1350  1350
                speedo_interrupt    5066   6262  1196
                  ip_route_input     868   2028  1160
            tcp_prequeue_process     319   1461  1142
                   ip_queue_xmit     805   1886  1081
            speedo_refill_rx_buf       0   1045  1045
                          ip_rcv     872   1912  1040
             tcp_event_data_recv     334   1344  1010
skb_copy_and_csum_datagram_iovec     383   1393  1010
                        netif_rx     361   1360   999
                          mcount    1298   2261   963
                tcp_transmit_skb     984   1877   893
                   tcp_v4_do_rcv     285   1169   884
                    sock_recvmsg     479   1358   879
             speedo_tx_buffer_gc     388   1239   851
               tcp_copy_to_iovec     356   1138   782
                         kmalloc     379   1154   775
                       ip_output     575   1333   758
                    sys_recvfrom     404   1105   701
                       alloc_skb     249    936   687
                skb_release_data     249    882   633
                            fget     300    901   601
                            fput     524   1124   600
                      sock_wfree       0    600   600
                    inet_recvmsg       0    596   596
                     __switch_to     927   1522   595
                   qdisc_restart       0    568   568
        speedo_refill_rx_buffers       0    568   568
                  dev_queue_xmit     409    952   543
                 check_pgt_cache       0    483   483
                           kfree     556   1015   459
                ip_local_deliver     362    808   446
          __generic_copy_to_user       0    429   429
            tcp_send_delayed_ack       0    405   405
            tcp_v4_checksum_init       0    391   391
                        cpu_idle       0    386   386
      skb_copy_and_csum_datagram     428    805   377
              pfifo_fast_enqueue       0    350   350
                    cleanup_rbuf     430    736   306
                    kfree_skbmem     391    696   305
              pfifo_fast_dequeue       0    304   304
        __generic_copy_from_user     224    501   277
                        sys_recv     216    488   272
                    csum_partial     514    781   267
               remove_wait_queue     233    487   254
                  add_wait_queue     267    430   163
                     system_call     772    918   146
                   sockfd_lookup     232    370   138
                schedule_timeout     238    314    76
                 do_gettimeofday     876    919    43
                  sys_socketcall     872    887    15
              do_check_pgt_cache     584    494   -90

>>> For sometime now, I have been thinking of implementing/supporting
>>> PME's (Peformance Monitoring Events and Counters), so that we can
>>> get real values (atleast on x86) as compared to our guesses about
>>> cacheline bouncing, etc. Do you know if somebody is already doing
>>> this?
>
>>You can use SGI kernprof to measure PMCs. See the SGI oss
>>website for details. You can count L2_LINES_IN event to
>>get a measure of cache line bouncing.
>
>I have profiled L2_LINES_OUT on netperf tcp_stream workload.
>The following is the profiling from a 100mb ethernet tcp_stream
>4 adapter test baseline 2.4.17 kernel:
>
>poll_idle [c0105280]: 121743
>csum_partial_copy_generic [c0277f60]: 27951
>schedule [c0114190]: 24853
>do_softirq [c011b9e0]: 9130
>mod_timer [c011eb10]: 6997
>tcp_v4_rcv [c0258b70]: 6449
>speedo_interrupt [c01a83e0]: 6262
>__wake_up [c0114720]: 6143
>tcp_recvmsg [c0249930]: 5199
>USER [c0124a70]: 5081
>speedo_start_xmit [c01a7fc0]: 4349
>tcp_rcv_established [c0250c90]: 3724
>tcp_data_wait [c02496d0]: 3610
>speedo_rx [c01a8900]: 3358
>handle_IRQ_event [c0108a60]: 3339
>__kfree_skb [c0230520]: 2716
>net_rx_action [c0233f10]: 2510
>mcount [c02784e0]: 2261
>ip_route_input [c023ee40]: 2028
>ip_rcv [c0241180]: 1912
>ip_queue_xmit [c0243950]: 1886
>tcp_transmit_skb [c0252400]: 1877
>__switch_to [c01059d0]: 1522
>tcp_prequeue_process [c0249860]: 1461
>skb_copy_and_csum_datagram_iovec [c0232590]: 1393
>netif_rx [c0233ba0]: 1360
>sock_recvmsg [c022cfc0]: 1358
>eth_type_trans [c023a260]: 1350
>tcp_event_data_recv [c024c320]: 1344
>ip_output [c0243810]: 1333
>speedo_tx_buffer_gc [c01a81d0]: 1239
>tcp_v4_do_rcv [c0258a50]: 1169
>kmalloc [c012dda0]: 1154
>tcp_copy_to_iovec [c0250ae0]: 1138
>fput [c0136d30]: 1124
>sys_recvfrom [c022df40]: 1105
>speedo_refill_rx_buf [c01a86d0]: 1045
>kfree [c012df90]: 1015
>dev_queue_xmit [c0233860]: 952
>alloc_skb [c02301e0]: 936
>do_gettimeofday [c010c520]: 919
>system_call [c01070d8]: 918
>fget [c0136e30]: 901
>sys_socketcall [c022e610]: 887
>skb_release_data [c0230420]: 882
>ip_local_deliver [c0241020]: 808
>skb_copy_and_csum_datagram [c02322a0]: 805
>csum_partial [c0277e78]: 781
>cleanup_rbuf [c02495d0]: 736
>kfree_skbmem [c02304b0]: 696
>sock_wfree [c022f380]: 600
>inet_recvmsg [c0264720]: 596
>speedo_refill_rx_buffers [c01a88b0]: 568
>qdisc_restart [c023a4e0]: 568
>__generic_copy_from_user [c02781d0]: 501
>do_check_pgt_cache [c0112de0]: 494
>sys_recv [c022e020]: 488
>remove_wait_queue [c0115930]: 487
>check_pgt_cache [c0124b30]: 483
>add_wait_queue [c01158b0]: 430
>__generic_copy_to_user [c0278180]: 429
>tcp_send_delayed_ack [c0254e30]: 405
>tcp_v4_checksum_init [c0258930]: 391
>cpu_idle [c01052b0]: 386
>sockfd_lookup [c022cd70]: 370
>pfifo_fast_enqueue [c023a940]: 350
>schedule_timeout [c0114060]: 314
>pfifo_fast_dequeue [c023a9c0]: 304
>
>
>To eliminate the cache-line bouncing, I applied IRQ and PROCESS
>affinity. The L2_CACHE_LINES_OUT profiling with affinity:
>
>poll_idle [c0105280]: 72241
>csum_partial_copy_generic [c0289500]: 13838
>schedule [c0114190]: 9036
>speedo_interrupt [c01b9980]: 5066
>do_softirq [c011b9e0]: 3922
>USER [c0124c80]: 2573
>tcp_recvmsg [c025aed0]: 2154
>__wake_up [c0114720]: 1779
>speedo_start_xmit [c01b9560]: 1654
>mod_timer [c011eb10]: 1551
>tcp_rcv_established [c0262230]: 1336
>mcount [c0289a80]: 1298
>tcp_transmit_skb [c02639a0]: 984
>__switch_to [c01059d0]: 927
>do_gettimeofday [c010c520]: 876
>sys_socketcall [c023fbb0]: 872
>ip_rcv [c0252720]: 872
>ip_route_input [c02503e0]: 868
>ip_queue_xmit [c0254ef0]: 805
>system_call [c01070d8]: 772
>tcp_data_wait [c025ac70]: 748
>__kfree_skb [c0241ac0]: 640
>tcp_v4_rcv [c026a110]: 629
>do_check_pgt_cache [c0112de0]: 584
>ip_output [c0254db0]: 575
>net_rx_action [c02454b0]: 565
>kfree [c012e1a0]: 556
>fput [c0136f40]: 524
>csum_partial [c0289418]: 514
>handle_IRQ_event [c0108a60]: 507
>sock_recvmsg [c023e560]: 479
>cleanup_rbuf [c025ab70]: 430
>skb_copy_and_csum_datagram [c0243840]: 428
>dev_queue_xmit [c0244e00]: 409
>sys_recvfrom [c023f4e0]: 404
>kfree_skbmem [c0241a50]: 391
>speedo_tx_buffer_gc [c01b9770]: 388
>skb_copy_and_csum_datagram_iovec [c0243b30]: 383
>kmalloc [c012dfb0]: 379
>ip_local_deliver [c02525c0]: 362
>netif_rx [c0245140]: 361
>tcp_copy_to_iovec [c0262080]: 356
>tcp_event_data_recv [c025d8c0]: 334
>tcp_prequeue_process [c025ae00]: 319
>fget [c0137040]: 300
>tcp_v4_do_rcv [c0269ff0]: 285
>add_wait_queue [c01158b0]: 267
>skb_release_data [c02419c0]: 249
>alloc_skb [c0241780]: 249
>schedule_timeout [c0114060]: 238
>remove_wait_queue [c0115930]: 233
>sockfd_lookup [c023e310]: 232
>__generic_copy_from_user [c0289770]: 224
>sys_recv [c023f5c0]: 216
>
>
>Regards,
>    Mala
>
>
>   Mala Anand
>   E-mail:manand@us.ibm.com
>   Linux Technology Center - Performance
>   Phone:838-8088; Tie-line:678-8088


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2002-06-04 21:11 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-05-23 13:08 [RFC] Dynamic percpu data allocator Dipankar Sarma
2002-05-24  4:37 ` BALBIR SINGH
2002-05-24  6:13   ` Dipankar Sarma
2002-05-24  8:38     ` [Lse-tech] " BALBIR SINGH
2002-05-24  9:13       ` Dipankar Sarma
2002-05-24 11:59         ` BALBIR SINGH
2002-05-24 14:38       ` Martin J. Bligh
  -- strict thread matches above, loose matches on Subject: below --
2002-05-30 13:56 Mala Anand
2002-05-30 17:55 ` Dipankar Sarma
2002-05-31  7:57   ` BALBIR SINGH
2002-05-31  8:40     ` Dipankar Sarma
2002-06-03 19:12 Mala Anand
2002-06-03 19:48 ` Dipankar Sarma
2002-06-04 12:05 Mala Anand
2002-06-04 21:11 Paul McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox