[RFC] Dynamic percpu data allocator

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] Dynamic percpu data allocator
@ 2002-05-23 13:08 Dipankar Sarma
  2002-05-24  4:37 ` BALBIR SINGH
  0 siblings, 1 reply; 7+ messages in thread
From: Dipankar Sarma @ 2002-05-23 13:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: Rusty Russell, Paul McKenney, lse-tech

[-- Attachment #1: Type: text/plain, Size: 1185 bytes --]

If static percpu area is around, can dynamic percpu data allocator
be far behind ;-)

As a part of scalable kernel primitives work for higher-end SMP
and NUMA architectures, we have been seeing increasing need
for per-cpu data in various key areas. Rusty's percpu area
work has added a way in 2.5 kernels to maintain static per-cpu
data. Inspired by that work, I have implemented a dynamic per-cpu
data allocator. Currently it is useful to us for -

1. Per-cpu data in dynamically allocated structures.
2. per-cpu statistics and reference counters
3. Per-cpu data in drivers/modules.
4. Scalable locking primitives like local spin only locks
   (or even big reader locks).

Included in this mail is a document that describes the allocator.
I would really appreciate if people comment on it. I am
particularly interested in eek-value of the interfaces,
specially the bit about keeping the type information in
a dummy variable in a union.

The actual patch will follow soon, unless someone convince
me quickly that there is an saner way to do this.

Thanks
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

[-- Attachment #2: percpu_data.txt --]
[-- Type: text/plain, Size: 4302 bytes --]

                        Per-CPU Data Allocator
                        ----------------------

Interfaces
----------

The interfaces for per-cpu data allocator are similar to Rusty's static
per-CPU data interfaces. One clear goal was to make sure that they
leave no overhead in the UP kernels, so for UP kernels they
reduce to ordinary variables. The basic interfaces are these -

1. percpu_data_declare(type,var)
2. percpu_data_alloc(var)
3. percpu_data(var,cpu) 
4. this_percpu_data(var)
5. percpu_data_free(var)

For example, we can declare the following structure -

	struct yy {
		int a;
		percpu_data_declare(int, b);
		int c;
		int d;
	};

We can allocate memory for percpu int like this -

	struct yy y;

	if (percpu_data_alloc(y.b)) {
		/* Failed */
	}

To use it -

	cpu = smp_processor_id();
	percpu_data(y.b, cpu)++;

	or

	this_percpu_data(y.b)++;

To free the per-CPU data

	percpu_data_free(y.b);

The data declaration interface is a bit unnatural, but I can't think of
anything better that would let me preserve the type information for the
original variable so that appropriate typecasting can be done for other
interfaces. percpu_data_declare(type,var) expands to -

	union {
		percpu_data_t *percpu;
		typeof(type) realtype;
	} var

percpu_data_t maintains the pointers necessary to lookup the real
percpu data. 

	typedef struct {
		void *blkaddrs[NR_CPUS];
		struct percpu_data_blk *blkp;
	} percpu_data_t;

The type information is used (using typeof()) to
typecast data accesses. 

	#define percpu_data(var,cpu) \
			(*((typeof(var.realtype) *)var.percpu->blkaddrs[cpu]))

Using a pointer to percpu_data_t adds
an overhead of an additional memory reference while accessing
percpu data. This can be avoided by embedding the percpu_data_t
structure, but since percpu_data_t has an NR_CPUS array, it changes
structure sizes very radically. It is a tradeoff, we could go either
way.

Potential Uses
--------------

1. Scalable counters - they already use a scaled down version of the
allocator inside. Per-cpu counters can reduce the overhead of cacheline
bouncing.

2. Big reader lock - this need not be statically allocated anymore.

3. Per-CPU data in modules - the static per-cpu scheme by Rusty's
doesn't work in modules, atleast I haven't seen a way to do this.

4. Scalable locks - per-cpu data is commonly used in scalable locks
like MCS locks.

Allocator
---------

The current approach is that unless there is interest in a dynamic
percpu data allocator, there is no point in spending too much time
in writing a sophisticated. 

Allocation Policy
-----------------

1. If the allocation requests size is a factor of SMP_CACHE_BYTES,
then it will be interleaved to avoid fragmentation as much as possible.
If the request size is a multiple of SMP_CACHE_BYTES, fragmentation
will still be avoided. Anything else will result in fragmentation.
The current allocator doesn't make any attempt to use the fragmented
portion, in a sense it is like padding to cache line boundary.

2. A simple binary search tree is used to maintain the memory
objects (could be blocks of them) of different sizes. For 
interleaving, the objects are maintained in blocks and a freelist 
mechanism similar to the slab allocator is used within the blocks 
to allocates objects from within. Each block is allocated from kmem_cache.

If there is sufficient interest in the per-cpu data allocator,
then I will revisit the allocator and see if fragmentation can
be reduced for non-multiples/non-factors of SMP_CACHE_BYTES.

For non-factor allocations, the residual part of the cache-line
can be maintained and a best-factor-fit alogorithm can be used
to allocate this. This makes an assumption that kernel allocation
requests are likely to contain repetitive patterns of similar sizes.

Alignment Issues
----------------

The current alignment strategy is this -

1. Minimum allocation size is sizeof(int).
2. Each block is aligned to cache line boundary and size of any
   object allocated within the block is either a factor of the block
   size or equal to the block size. However I am not sure if this
   guarantees proper alignment on all architectures. We need to
   investigate some more about this.

I am sure there is a 69-bit transputer architecture somewhere that
breaks this allocator ;-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [RFC] Dynamic percpu data allocator
  2002-05-23 13:08 [RFC] Dynamic percpu data allocator Dipankar Sarma
@ 2002-05-24  4:37 ` BALBIR SINGH
  2002-05-24  6:13   ` Dipankar Sarma
  0 siblings, 1 reply; 7+ messages in thread
From: BALBIR SINGH @ 2002-05-24  4:37 UTC (permalink / raw)
  To: dipankar, linux-kernel; +Cc: Rusty Russell, Paul McKenney, lse-tech

[-- Attachment #1: Type: text/plain, Size: 2966 bytes --]

Hello, Dipankar,

I would prefer to use the existing slab allocator for this.
I am not sure if I understand your requirements for the per-cpu
allocator correctly, please correct me if I do not.

What I would like to see

1. Have per-cpu slabs instead of per-cpu cpucache_t. One should
   be able to tell for which caches we want per-cpu slabs. This
   way we can make even kmalloc per-cpu. Since most of the kernel
   would use and dispose memory before they migrate across cpus.
   I think this would be useful, but again no data to back it up.

2. I hate the use of NR_CPUS. If I compiled an SMP kernel on a two
   CPU machine, I still end up with support for 32 CPUs. What I would
   like to see is that in new kernel code, we should use treat equivalent
   classes of CPUs as belonging to the same CPU. For example

   void *blkaddrs[NR_CPUS];

   while searching, instead of doing

   blkaddrs[smp_processor_id()], if the slot for smp_processor_id() is full,
   we should look through

   for (i = 0; i < NR_CPUS; i++) {
     look into blkaddrs[smp_processor_id() + i % smp_number_of_cpus()(or
whatever)]
     if successful break
   }

On a two CPU system 1,3,5 ... belong to the same equivalent class. So we
might
as well utilize them. Even with a per-cpu pool, threads could use the slots
in
the per-cpu equivalent classes in parallel (I have a very rough idea about
this).

Does any of this make sense,
Balbir

|-----Original Message-----
|From: linux-kernel-owner@vger.kernel.org
|[mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Dipankar Sarma
|Sent: Thursday, May 23, 2002 6:39 PM
|To: linux-kernel@vger.kernel.org
|Cc: Rusty Russell; Paul McKenney; lse-tech@lists.sourceforge.net
|Subject: [RFC] Dynamic percpu data allocator
|
|
|If static percpu area is around, can dynamic percpu data allocator
|be far behind ;-)
|
|As a part of scalable kernel primitives work for higher-end SMP
|and NUMA architectures, we have been seeing increasing need
|for per-cpu data in various key areas. Rusty's percpu area
|work has added a way in 2.5 kernels to maintain static per-cpu
|data. Inspired by that work, I have implemented a dynamic per-cpu
|data allocator. Currently it is useful to us for -
|
|1. Per-cpu data in dynamically allocated structures.
|2. per-cpu statistics and reference counters
|3. Per-cpu data in drivers/modules.
|4. Scalable locking primitives like local spin only locks
|   (or even big reader locks).
|
|Included in this mail is a document that describes the allocator.
|I would really appreciate if people comment on it. I am
|particularly interested in eek-value of the interfaces,
|specially the bit about keeping the type information in
|a dummy variable in a union.
|
|The actual patch will follow soon, unless someone convince
|me quickly that there is an saner way to do this.
|
|Thanks
|--
|Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
|Linux Technology Center, IBM Software Lab, Bangalore, India.
|

[-- Attachment #2: Wipro_Disclaimer.txt --]
[-- Type: text/plain, Size: 490 bytes --]

**************************Disclaimer************************************

Information contained in this E-MAIL being proprietary to Wipro Limited is 
'privileged' and 'confidential' and intended for use only by the individual
 or entity to which it is addressed. You are notified that any use, copying 
or dissemination of the information contained in the E-MAIL in any manner 
whatsoever is strictly prohibited.

***************************************************************************

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Dynamic percpu data allocator
  2002-05-24  4:37 ` BALBIR SINGH
@ 2002-05-24  6:13   ` Dipankar Sarma
  2002-05-24  8:38     ` [Lse-tech] " BALBIR SINGH
  0 siblings, 1 reply; 7+ messages in thread
From: Dipankar Sarma @ 2002-05-24  6:13 UTC (permalink / raw)
  To: BALBIR SINGH; +Cc: linux-kernel, Rusty Russell, Paul McKenney, lse-tech

On Fri, May 24, 2002 at 10:07:59AM +0530, BALBIR SINGH wrote:
> Hello, Dipankar,
> 
> I would prefer to use the existing slab allocator for this.
> I am not sure if I understand your requirements for the per-cpu
> allocator correctly, please correct me if I do not.
> 
> What I would like to see
> 
> 1. Have per-cpu slabs instead of per-cpu cpucache_t. One should
>    be able to tell for which caches we want per-cpu slabs. This
>    way we can make even kmalloc per-cpu. Since most of the kernel
>    would use and dispose memory before they migrate across cpus.
>    I think this would be useful, but again no data to back it up.

Allocating cpu-local memory is a different issue altogether.
Eventually for NUMA support, we will have to do such allocations
that supports choosing memory closest to a group of CPUs.

The per-cpu data allocator allocates one copy for *each* CPU.
It uses the slab allocator underneath. Eventually, when/if we have
per-cpu/numa-node slab allocation, the per-cpu data allocator
can allocate every CPU's copy from memory closest to it.

I suppose you worked on DYNIX/ptx ? Think of this as dynamic
plocal. 

> 
> 2. I hate the use of NR_CPUS. If I compiled an SMP kernel on a two
>    CPU machine, I still end up with support for 32 CPUs. What I would

If you don't like it, just define NR_CPUS to 2 and recompile.

>    like to see is that in new kernel code, we should use treat equivalent
>    classes of CPUs as belonging to the same CPU. For example
> 
>    void *blkaddrs[NR_CPUS];
> 
>    while searching, instead of doing
> 
>    blkaddrs[smp_processor_id()], if the slot for smp_processor_id() is full,
>    we should look through
> 
>    for (i = 0; i < NR_CPUS; i++) {
>      look into blkaddrs[smp_processor_id() + i % smp_number_of_cpus()(or
> whatever)]
>      if successful break
>    }

How will it work ? You could be accessing memory beyond blkaddrs[].

I use NR_CPUS for allocations because if I don't, supporting CPU
hotplug will be a nightmare. Resizing so many data structures is
not an option. I believe Rusty and/or cpu hotplug work is
adding a for_each_cpu() macro to walk the CPUs that take
care of everything including sparse CPU numbers. Until then,
I would use for (i = 0; i < smp_num_cpus; i++).

Thanks
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
  2002-05-24  6:13   ` Dipankar Sarma
@ 2002-05-24  8:38     ` BALBIR SINGH
  2002-05-24  9:13       ` Dipankar Sarma
  2002-05-24 14:38       ` Martin J. Bligh
  0 siblings, 2 replies; 7+ messages in thread
From: BALBIR SINGH @ 2002-05-24  8:38 UTC (permalink / raw)
  To: dipankar; +Cc: linux-kernel, Rusty Russell, Paul McKenney, lse-tech

[-- Attachment #1: Type: text/plain, Size: 3192 bytes --]

|> Hello, Dipankar,
|>
|> I would prefer to use the existing slab allocator for this.
|> I am not sure if I understand your requirements for the per-cpu
|> allocator correctly, please correct me if I do not.
|>
|> What I would like to see
|>
|> 1. Have per-cpu slabs instead of per-cpu cpucache_t. One should
|>    be able to tell for which caches we want per-cpu slabs. This
|>    way we can make even kmalloc per-cpu. Since most of the kernel
|>    would use and dispose memory before they migrate across cpus.
|>    I think this would be useful, but again no data to back it up.
|
|Allocating cpu-local memory is a different issue altogether.
|Eventually for NUMA support, we will have to do such allocations
|that supports choosing memory closest to a group of CPUs.
|
|The per-cpu data allocator allocates one copy for *each* CPU.
|It uses the slab allocator underneath. Eventually, when/if we have
|per-cpu/numa-node slab allocation, the per-cpu data allocator
|can allocate every CPU's copy from memory closest to it.
|
|I suppose you worked on DYNIX/ptx ? Think of this as dynamic
|plocal.

Sure, I understand what you are talking about now. That makes a lot
of sense, I will go through your document once more and read it.
I was thinking of the two combined (allocating CPU local memory
for certain data structs also includes allocating one copy per CPU).
Is there a reason to delay the implementation of CPU local memory,
or are we waiting for NUMA guys to do it? Is it not useful in an
SMP system to allocate CPU local memory?

|
|>
|> 2. I hate the use of NR_CPUS. If I compiled an SMP kernel on a two
|>    CPU machine, I still end up with support for 32 CPUs. What I would
|
|If you don't like it, just define NR_CPUS to 2 and recompile.
|

That does make sense, but I would like to keep my headers in sync, so that
all patches apply cleanly.

|
|>    like to see is that in new kernel code, we should use treat equivalent
|>    classes of CPUs as belonging to the same CPU. For example
|>
|>    void *blkaddrs[NR_CPUS];
|>
|>    while searching, instead of doing
|>
|>    blkaddrs[smp_processor_id()], if the slot for
|smp_processor_id() is full,
|>    we should look through
|>
|>    for (i = 0; i < NR_CPUS; i++) {
|>      look into blkaddrs[smp_processor_id() + i % smp_number_of_cpus()(or
|> whatever)]
|>      if successful break
|>    }
|
|How will it work ? You could be accessing memory beyond blkaddrs[].
|

Sorry that could happen, it should be (smp_processor_id + i) %
smp_number_of_cpus().

|I use NR_CPUS for allocations because if I don't, supporting CPU
|hotplug will be a nightmare. Resizing so many data structures is
|not an option. I believe Rusty and/or cpu hotplug work is
|adding a for_each_cpu() macro to walk the CPUs that take
|care of everything including sparse CPU numbers. Until then,
|I would use for (i = 0; i < smp_num_cpus; i++).

I think that does make a whole lot of sense, I was not thinking
about HotPluging CPUs (dumb me).

|
|Thanks
|--
|Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
|Linux Technology Center, IBM Software Lab, Bangalore, India.
|
|_______________________________________________________________

[-- Attachment #2: Wipro_Disclaimer.txt --]
[-- Type: text/plain, Size: 490 bytes --]

**************************Disclaimer************************************

Information contained in this E-MAIL being proprietary to Wipro Limited is 
'privileged' and 'confidential' and intended for use only by the individual
 or entity to which it is addressed. You are notified that any use, copying 
or dissemination of the information contained in the E-MAIL in any manner 
whatsoever is strictly prohibited.

***************************************************************************

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
  2002-05-24  8:38     ` [Lse-tech] " BALBIR SINGH
@ 2002-05-24  9:13       ` Dipankar Sarma
  2002-05-24 11:59         ` BALBIR SINGH
  2002-05-24 14:38       ` Martin J. Bligh
  1 sibling, 1 reply; 7+ messages in thread
From: Dipankar Sarma @ 2002-05-24  9:13 UTC (permalink / raw)
  To: BALBIR SINGH; +Cc: linux-kernel, Rusty Russell, Paul McKenney, lse-tech

On Fri, May 24, 2002 at 02:08:50PM +0530, BALBIR SINGH wrote:
> 
> Sure, I understand what you are talking about now. That makes a lot
> of sense, I will go through your document once more and read it.
> I was thinking of the two combined (allocating CPU local memory
> for certain data structs also includes allocating one copy per CPU).
> Is there a reason to delay the implementation of CPU local memory,
> or are we waiting for NUMA guys to do it? Is it not useful in an
> SMP system to allocate CPU local memory?

In an SMP system, the entire memory is equidistant from the CPUs.
So, any memory that is exclusively accessed by once cpu only
is CPU-local. On a NUMA machine however that isn't true, so
you need special schemes.

The thing about one-copy-per-cpu allocator that I describe is that
it interleaves per-cpu data to save on space. That is if you
allocate per-cpu ints i1, i2, it will be laid out in memory like this -

   CPU #0          CPU#1

 ---------       ---------         Start of cache line
   i1              i1
   i2              i2 

   .               .
   .               .
   .               .
   .               .
   .               .

 ---------       ----------        End of cache line

The per-cpu copies of i1 and i2 for CPU #0 and CPU #1 are allocated from 
different cache lines of memory, but copy of i1 and i2 for CPU #0 are
in the same cache line. This interleaving saves space by avoiding
the need to pad small data structures to cache line sizes.
This essentially how the static per-cpu data area in 2.5 kernel
is laid out in memory. Since copies for CPU #0 and CPU #1 for
the same variable are on different cache lines, assuming that
code that accesses "this" CPU's copy will not result in cache line
bouncing. On an SMP machine, I can allocate the cache lines
for different CPUs, where the interleaved data structures are
laid out, using the slab allocator. On a NUMA machine however,
I would want to make sure that cache line allocated for this
purpose for CPU #N is closest possible to CPU #N.

Thanks
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
  2002-05-24  9:13       ` Dipankar Sarma
@ 2002-05-24 11:59         ` BALBIR SINGH
  0 siblings, 0 replies; 7+ messages in thread
From: BALBIR SINGH @ 2002-05-24 11:59 UTC (permalink / raw)
  To: dipankar; +Cc: linux-kernel, Rusty Russell, Paul McKenney, lse-tech

[-- Attachment #1: Type: text/plain, Size: 2624 bytes --]

Thanks, when I ment CPU local memory on SMP, I ment CPU cache.
Sorry for the confusion. I think your dynamic allocator makes
sense.

Balbir

|
|On Fri, May 24, 2002 at 02:08:50PM +0530, BALBIR SINGH wrote:
|> 
|> Sure, I understand what you are talking about now. That makes a lot
|> of sense, I will go through your document once more and read it.
|> I was thinking of the two combined (allocating CPU local memory
|> for certain data structs also includes allocating one copy per CPU).
|> Is there a reason to delay the implementation of CPU local memory,
|> or are we waiting for NUMA guys to do it? Is it not useful in an
|> SMP system to allocate CPU local memory?
|
|In an SMP system, the entire memory is equidistant from the CPUs.
|So, any memory that is exclusively accessed by once cpu only
|is CPU-local. On a NUMA machine however that isn't true, so
|you need special schemes.
|
|The thing about one-copy-per-cpu allocator that I describe is that
|it interleaves per-cpu data to save on space. That is if you
|allocate per-cpu ints i1, i2, it will be laid out in memory like this -
|
|   CPU #0          CPU#1
|
| ---------       ---------         Start of cache line
|   i1              i1
|   i2              i2 
|
|   .               .
|   .               .
|   .               .
|   .               .
|   .               .
|
| ---------       ----------        End of cache line
|
|The per-cpu copies of i1 and i2 for CPU #0 and CPU #1 are allocated from 
|different cache lines of memory, but copy of i1 and i2 for CPU #0 are
|in the same cache line. This interleaving saves space by avoiding
|the need to pad small data structures to cache line sizes.
|This essentially how the static per-cpu data area in 2.5 kernel
|is laid out in memory. Since copies for CPU #0 and CPU #1 for
|the same variable are on different cache lines, assuming that
|code that accesses "this" CPU's copy will not result in cache line
|bouncing. On an SMP machine, I can allocate the cache lines
|for different CPUs, where the interleaved data structures are
|laid out, using the slab allocator. On a NUMA machine however,
|I would want to make sure that cache line allocated for this
|purpose for CPU #N is closest possible to CPU #N.
|
|
|Thanks
|-- 
|Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
|Linux Technology Center, IBM Software Lab, Bangalore, India.
|-
|To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
|the body of a message to majordomo@vger.kernel.org
|More majordomo info at  http://vger.kernel.org/majordomo-info.html
|Please read the FAQ at  http://www.tux.org/lkml/

[-- Attachment #2: Wipro_Disclaimer.txt --]
[-- Type: text/plain, Size: 490 bytes --]

**************************Disclaimer************************************

Information contained in this E-MAIL being proprietary to Wipro Limited is 
'privileged' and 'confidential' and intended for use only by the individual
 or entity to which it is addressed. You are notified that any use, copying 
or dissemination of the information contained in the E-MAIL in any manner 
whatsoever is strictly prohibited.

***************************************************************************

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
  2002-05-24  8:38     ` [Lse-tech] " BALBIR SINGH
  2002-05-24  9:13       ` Dipankar Sarma
@ 2002-05-24 14:38       ` Martin J. Bligh
  1 sibling, 0 replies; 7+ messages in thread
From: Martin J. Bligh @ 2002-05-24 14:38 UTC (permalink / raw)
  To: BALBIR SINGH, dipankar
  Cc: linux-kernel, Rusty Russell, Paul McKenney, lse-tech

> Is there a reason to delay the implementation of CPU local memory,
> or are we waiting for NUMA guys to do it? Is it not useful in an
> SMP system to allocate CPU local memory?

That should be pretty easy to do now, if I understand what you're
asking for ... when you allocate the area for cpu N, just do
alloc_pages_node (cpu_to_nid(smp_processor_id())) or something
similar.

M.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2002-05-24 14:39 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-05-23 13:08 [RFC] Dynamic percpu data allocator Dipankar Sarma
2002-05-24  4:37 ` BALBIR SINGH
2002-05-24  6:13   ` Dipankar Sarma
2002-05-24  8:38     ` [Lse-tech] " BALBIR SINGH
2002-05-24  9:13       ` Dipankar Sarma
2002-05-24 11:59         ` BALBIR SINGH
2002-05-24 14:38       ` Martin J. Bligh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox