Re: [PATCH] x86_64: Make NR_IRQS configurable in Kconfig

public inbox for linux-arch@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] x86_64: Make NR_IRQS configurable in Kconfig
       [not found]   ` <17624.7310.856480.704542@cargo.ozlabs.ibm.com>
@ 2006-08-08  5:14     ` Andi Kleen
  2006-08-08  8:17       ` Martin Schwidefsky
  0 siblings, 1 reply; 7+ messages in thread
From: Andi Kleen @ 2006-08-08  5:14 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andrew Morton, Eric W. Biederman, Randy.Dunlap,
	Protasevich, Natalie, linux-kernel, linux-arch

On Tuesday 08 August 2006 07:09, Paul Mackerras wrote:
> Andrew Morton writes:

[adding linux-arch; talking about doing extensible per cpu areas
by prereserving virtual space and then later fill it up as needed]

> > > Drawback would be some more TLB misses.
> > 
> > yup.  On some (important) architectures - I'm not sure which architectures
> > do the bigpage-for-kernel trick.
> 
> I looked at optimizing the per-cpu data accessors on PowerPC and only
> ever saw fractions of a percent change in overall performance, which
> says to me that we don't actually use per-cpu data all that much.  So
> unless you make per-cpu data really really slow, I doubt that we'll
> see any significant performance difference.

The main problem is that we would need a "vmalloc reserve first; allocate pages
later" interface. On x86 it would be easy by just splitting up vmalloc/vmap a bit
again. Does anybody else see problems with implementing that on any
other architecture? 

This wouldn't be truly demand paged, just pages initialized on allocation.

-Andi


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] x86_64: Make NR_IRQS configurable in Kconfig
  2006-08-08  5:14     ` [PATCH] x86_64: Make NR_IRQS configurable in Kconfig Andi Kleen
@ 2006-08-08  8:17       ` Martin Schwidefsky
  2006-08-09 17:58         ` Luck, Tony
  0 siblings, 1 reply; 7+ messages in thread
From: Martin Schwidefsky @ 2006-08-08  8:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Paul Mackerras, Andrew Morton, Eric W. Biederman, Randy.Dunlap,
	Protasevich, Natalie, linux-kernel, linux-arch

On Tue, 2006-08-08 at 07:14 +0200, Andi Kleen wrote:
> > > > Drawback would be some more TLB misses.
> > > 
> > > yup.  On some (important) architectures - I'm not sure which architectures
> > > do the bigpage-for-kernel trick.
> > 
> > I looked at optimizing the per-cpu data accessors on PowerPC and only
> > ever saw fractions of a percent change in overall performance, which
> > says to me that we don't actually use per-cpu data all that much.  So
> > unless you make per-cpu data really really slow, I doubt that we'll
> > see any significant performance difference.
> 
> The main problem is that we would need a "vmalloc reserve first; allocate pages
> later" interface. On x86 it would be easy by just splitting up vmalloc/vmap a bit
> again. Does anybody else see problems with implementing that on any
> other architecture? 

"vmalloc reserve first; allocate pages later" would be a really nice
feature. We could use this on s390 to implement the virtual mem_map
array spanning the whole 64 bit address range (with holes in it). To
make it perfect a "deallocate pages; keep vmalloc reserve" should be
added, then we could free parts of the mem_map array again on hot memory
remove. 
I don't see a problem for s390.

-- 
blue skies,
  Martin.

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] x86_64: Make NR_IRQS configurable in Kconfig
  2006-08-08  8:17       ` Martin Schwidefsky
@ 2006-08-09 17:58         ` Luck, Tony
  2006-08-09 18:25           ` Dave Hansen
  0 siblings, 1 reply; 7+ messages in thread
From: Luck, Tony @ 2006-08-09 17:58 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Andi Kleen, Paul Mackerras, Andrew Morton, Eric W. Biederman,
	Randy.Dunlap, Protasevich, Natalie, linux-kernel, linux-arch

On Tue, Aug 08, 2006 at 10:17:53AM +0200, Martin Schwidefsky wrote:
> "vmalloc reserve first; allocate pages later" would be a really nice
> feature. We could use this on s390 to implement the virtual mem_map
> array spanning the whole 64 bit address range (with holes in it). To
> make it perfect a "deallocate pages; keep vmalloc reserve" should be
> added, then we could free parts of the mem_map array again on hot memory
> remove. 

IA-64 already has some arch. specific code to allocate a sparse
virtual memory map ... having generic code to do so would be
nice, but I foresee some chicken&egg problems in getting enough
of the vmalloc/vmap framework up & running before mem_map[] has
been allocated.

That and the hotplug memory folks don't like the virtual mem_map
code and have spurned it in favour of SPARSE.

-Tony

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] x86_64: Make NR_IRQS configurable in Kconfig
  2006-08-09 17:58         ` Luck, Tony
@ 2006-08-09 18:25           ` Dave Hansen
  2006-08-10 12:55             ` Martin Schwidefsky
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Hansen @ 2006-08-09 18:25 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Martin Schwidefsky, Andi Kleen, Paul Mackerras, Andrew Morton,
	Eric W. Biederman, Randy.Dunlap, Protasevich, Natalie,
	linux-kernel, linux-arch, Andy Whitcroft

On Wed, 2006-08-09 at 10:58 -0700, Luck, Tony wrote:
> On Tue, Aug 08, 2006 at 10:17:53AM +0200, Martin Schwidefsky wrote:
> > "vmalloc reserve first; allocate pages later" would be a really nice
> > feature. We could use this on s390 to implement the virtual mem_map
> > array spanning the whole 64 bit address range (with holes in it). To
> > make it perfect a "deallocate pages; keep vmalloc reserve" should be
> > added, then we could free parts of the mem_map array again on hot memory
> > remove. 

Martin,

We can already do this partial freeing today with sparsemem and memory
hot-remove.  It would be a shame to go have to do another implementation
for each an every architecture that wants to do it.

For the very sparse 64-bit address spaces, I would be really interested
to see an alternate pfn_to_section_nr() that relies on something other
than a direct correlation between physical address and section number.

Instead of:

#define pfn_to_section_nr(pfn) ((pfn) >> PFN_SECTION_SHIFT)

We could do:

static inline unsigned long pfn_to_section_nr(unsigned long pfn)
{
	return some_hash(pfn) % NR_OF_SECTION_SLOTS;
}

This would, of course, still have limits on how _many_ sections can be
populated.  But, it would remove the relationship on what the actual
physical address ranges can be from the number of populated sections.

Of course, it isn't quite that simple.  You need to make sure that the
sparse code is clean from all connections between section number and
physical address, as well as handling things like hash collisions.  We'd
probably also need to store the _actual_ physical address somewhere
because we can't get it from the section number any more.

But, Andy and I have talked about this kind of thing from the beginning
of sparsemem, so I hope the code is amenable to change like this.

-- Dave

P.S. With sparsemem extreme, I think you can cover an entire 64-bits of
address space with a 4GB top-level table.  If one more level of tables
was added, we'd be down to (I think) an 8MB table.  So, that might be an
option, too.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] x86_64: Make NR_IRQS configurable in Kconfig
  2006-08-09 18:25           ` Dave Hansen
@ 2006-08-10 12:55             ` Martin Schwidefsky
  2006-08-10 14:40               ` Andy Whitcroft
  0 siblings, 1 reply; 7+ messages in thread
From: Martin Schwidefsky @ 2006-08-10 12:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Luck, Tony, Andi Kleen, Paul Mackerras, Andrew Morton,
	Eric W. Biederman, Randy.Dunlap, Protasevich, Natalie,
	linux-kernel, linux-arch, Andy Whitcroft

On Wed, 2006-08-09 at 11:25 -0700, Dave Hansen wrote: 
> Instead of:
> 
> #define pfn_to_section_nr(pfn) ((pfn) >> PFN_SECTION_SHIFT)
> 
> We could do:
> 
> static inline unsigned long pfn_to_section_nr(unsigned long pfn)
> {
> 	return some_hash(pfn) % NR_OF_SECTION_SLOTS;
> }
> 
> This would, of course, still have limits on how _many_ sections can be
> populated.  But, it would remove the relationship on what the actual
> physical address ranges can be from the number of populated sections.
> 
> Of course, it isn't quite that simple.  You need to make sure that the
> sparse code is clean from all connections between section number and
> physical address, as well as handling things like hash collisions.  We'd
> probably also need to store the _actual_ physical address somewhere
> because we can't get it from the section number any more.

You have to deal with the hash collisions somehow, for example with a
list of pages that have the same hash. And you have to calculate the
hash value. Both hurts performance.

> P.S. With sparsemem extreme, I think you can cover an entire 64-bits of
> address space with a 4GB top-level table.  If one more level of tables
> was added, we'd be down to (I think) an 8MB table.  So, that might be an
> option, too.

On s390 we have to prepare for the situation of an address space that
has a chunk of memory at the low end and another chunk with bit 2^63
set. So the mem_map array needs to cover the whole 64 bit address range.
For sparsemem, we can choose on the size of the mem_map sections and on
how many indirections the lookup table should have. Some examples:

1) flat mem_map array: 2^52 entries, 56 bytes each.
2) mem_map sections with 256 entries / 14KB for each section,
   1 indirection level, 2^44 indirection pointers, 128TB overhead
3) mem_map sections with 256 entries / 14KB for each section,
   2 indirection levels, 2^22 indirection pointers for each level,
   32MB for each indirection array, minimum 64MB overhead
4) mem_map sections with 256 entries / 14KB for each section,
   3 indirection levels, 2^15/2^15/2^14 indirection pointers,
   256K/256K/128K indirection arrays, minimum 640K overhead
5) mem_map sections with 1024 entries / 56KB for each section,
   3 indirection levels, 2^14/2^14/2^14 indirection pointers,
   128K/128K/128K indirection arrays, minimum 384KB overhead

2 levels of indirection results in large overhead in regard to memory.
For 3 levels of indirection the memory overhead is ok, but each lookup
has to walk 3 indirections. This adds cpu cycles to access the mem_map
array.

The alternative of a flat mem_map array in vmalloc space is much more
attractive. The size of the array is 2^52*56 Byte. 1,3% of the virtual
address space. The access doesn't change, an array gets accessed. The
access gets automatically cached by the hardware.
Simple, straightforward, no additional overhead. Only the setup of the
kernel page tables for the mem_map vmalloc area needs some thought.

-- 
blue skies,
  Martin.

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] x86_64: Make NR_IRQS configurable in Kconfig
  2006-08-10 12:55             ` Martin Schwidefsky
@ 2006-08-10 14:40               ` Andy Whitcroft
  2006-08-10 14:53                 ` Martin Schwidefsky
  0 siblings, 1 reply; 7+ messages in thread
From: Andy Whitcroft @ 2006-08-10 14:40 UTC (permalink / raw)
  To: schwidefsky
  Cc: Dave Hansen, Luck, Tony, Andi Kleen, Paul Mackerras,
	Andrew Morton, Eric W. Biederman, Randy.Dunlap,
	Protasevich, Natalie, linux-kernel, linux-arch

Martin Schwidefsky wrote:
> On Wed, 2006-08-09 at 11:25 -0700, Dave Hansen wrote: 
>> Instead of:
>>
>> #define pfn_to_section_nr(pfn) ((pfn) >> PFN_SECTION_SHIFT)
>>
>> We could do:
>>
>> static inline unsigned long pfn_to_section_nr(unsigned long pfn)
>> {
>> 	return some_hash(pfn) % NR_OF_SECTION_SLOTS;
>> }
>>
>> This would, of course, still have limits on how _many_ sections can be
>> populated.  But, it would remove the relationship on what the actual
>> physical address ranges can be from the number of populated sections.
>>
>> Of course, it isn't quite that simple.  You need to make sure that the
>> sparse code is clean from all connections between section number and
>> physical address, as well as handling things like hash collisions.  We'd
>> probably also need to store the _actual_ physical address somewhere
>> because we can't get it from the section number any more.
> 
> You have to deal with the hash collisions somehow, for example with a
> list of pages that have the same hash. And you have to calculate the
> hash value. Both hurts performance.
> 
>> P.S. With sparsemem extreme, I think you can cover an entire 64-bits of
>> address space with a 4GB top-level table.  If one more level of tables
>> was added, we'd be down to (I think) an 8MB table.  So, that might be an
>> option, too.
> 
> On s390 we have to prepare for the situation of an address space that
> has a chunk of memory at the low end and another chunk with bit 2^63
> set. So the mem_map array needs to cover the whole 64 bit address range.
> For sparsemem, we can choose on the size of the mem_map sections and on
> how many indirections the lookup table should have. Some examples:
> 
> 1) flat mem_map array: 2^52 entries, 56 bytes each.
> 2) mem_map sections with 256 entries / 14KB for each section,
>    1 indirection level, 2^44 indirection pointers, 128TB overhead
> 3) mem_map sections with 256 entries / 14KB for each section,
>    2 indirection levels, 2^22 indirection pointers for each level,
>    32MB for each indirection array, minimum 64MB overhead
> 4) mem_map sections with 256 entries / 14KB for each section,
>    3 indirection levels, 2^15/2^15/2^14 indirection pointers,
>    256K/256K/128K indirection arrays, minimum 640K overhead
> 5) mem_map sections with 1024 entries / 56KB for each section,
>    3 indirection levels, 2^14/2^14/2^14 indirection pointers,
>    128K/128K/128K indirection arrays, minimum 384KB overhead
> 
> 2 levels of indirection results in large overhead in regard to memory.
> For 3 levels of indirection the memory overhead is ok, but each lookup
> has to walk 3 indirections. This adds cpu cycles to access the mem_map
> array.
> 
> The alternative of a flat mem_map array in vmalloc space is much more
> attractive. The size of the array is 2^52*56 Byte. 1,3% of the virtual
> address space. The access doesn't change, an array gets accessed. The
> access gets automatically cached by the hardware.
> Simple, straightforward, no additional overhead. Only the setup of the
> kernel page tables for the mem_map vmalloc area needs some thought.
> 

Well you could do something more fun with the top of the address.  You 
don't need to keep the bytes in the same order for instance.  If this is 
really a fair size chunk at the bottom and one at the top then taking 
the address and swapping the bytes like:

	ABCDEFGH => BCDAEFGH

Would be a pretty trivial bit of register wibbling (ie very quick), but 
would probabally mean a single flat, smaller sparsemem table would cover 
all likely areas.

-apw

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] x86_64: Make NR_IRQS configurable in Kconfig
  2006-08-10 14:40               ` Andy Whitcroft
@ 2006-08-10 14:53                 ` Martin Schwidefsky
  0 siblings, 0 replies; 7+ messages in thread
From: Martin Schwidefsky @ 2006-08-10 14:53 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Dave Hansen, Luck, Tony, Andi Kleen, Paul Mackerras,
	Andrew Morton, Eric W. Biederman, Randy.Dunlap,
	Protasevich, Natalie, linux-kernel, linux-arch

On Thu, 2006-08-10 at 15:40 +0100, Andy Whitcroft wrote:
> Well you could do something more fun with the top of the address.  You
> don't need to keep the bytes in the same order for instance.  If this
> is really a fair size chunk at the bottom and one at the top then
> taking the address and swapping the bytes like:
> 
>         ABCDEFGH => BCDAEFGH
> 
> Would be a pretty trivial bit of register wibbling (ie very quick),
> but would probabally mean a single flat, smaller sparsemem table would
> cover all likely areas. 

Not if you don't know where the objects will be mapped..

-- 
blue skies,
  Martin.

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-08-10 14:53 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <m1irl4ftya.fsf@ebiederm.dsl.xmission.com>
     [not found] ` <20060807194159.f7c741b5.akpm@osdl.org>
     [not found]   ` <17624.7310.856480.704542@cargo.ozlabs.ibm.com>
2006-08-08  5:14     ` [PATCH] x86_64: Make NR_IRQS configurable in Kconfig Andi Kleen
2006-08-08  8:17       ` Martin Schwidefsky
2006-08-09 17:58         ` Luck, Tony
2006-08-09 18:25           ` Dave Hansen
2006-08-10 12:55             ` Martin Schwidefsky
2006-08-10 14:40               ` Andy Whitcroft
2006-08-10 14:53                 ` Martin Schwidefsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox