* Re: [PATCH] x86_64: Make NR_IRQS configurable in Kconfig [not found] ` <17624.7310.856480.704542@cargo.ozlabs.ibm.com> @ 2006-08-08 5:14 ` Andi Kleen 2006-08-08 8:17 ` Martin Schwidefsky 0 siblings, 1 reply; 7+ messages in thread From: Andi Kleen @ 2006-08-08 5:14 UTC (permalink / raw) To: Paul Mackerras Cc: Andrew Morton, Eric W. Biederman, Randy.Dunlap, Protasevich, Natalie, linux-kernel, linux-arch On Tuesday 08 August 2006 07:09, Paul Mackerras wrote: > Andrew Morton writes: [adding linux-arch; talking about doing extensible per cpu areas by prereserving virtual space and then later fill it up as needed] > > > Drawback would be some more TLB misses. > > > > yup. On some (important) architectures - I'm not sure which architectures > > do the bigpage-for-kernel trick. > > I looked at optimizing the per-cpu data accessors on PowerPC and only > ever saw fractions of a percent change in overall performance, which > says to me that we don't actually use per-cpu data all that much. So > unless you make per-cpu data really really slow, I doubt that we'll > see any significant performance difference. The main problem is that we would need a "vmalloc reserve first; allocate pages later" interface. On x86 it would be easy by just splitting up vmalloc/vmap a bit again. Does anybody else see problems with implementing that on any other architecture? This wouldn't be truly demand paged, just pages initialized on allocation. -Andi ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] x86_64: Make NR_IRQS configurable in Kconfig 2006-08-08 5:14 ` [PATCH] x86_64: Make NR_IRQS configurable in Kconfig Andi Kleen @ 2006-08-08 8:17 ` Martin Schwidefsky 2006-08-09 17:58 ` Luck, Tony 0 siblings, 1 reply; 7+ messages in thread From: Martin Schwidefsky @ 2006-08-08 8:17 UTC (permalink / raw) To: Andi Kleen Cc: Paul Mackerras, Andrew Morton, Eric W. Biederman, Randy.Dunlap, Protasevich, Natalie, linux-kernel, linux-arch On Tue, 2006-08-08 at 07:14 +0200, Andi Kleen wrote: > > > > Drawback would be some more TLB misses. > > > > > > yup. On some (important) architectures - I'm not sure which architectures > > > do the bigpage-for-kernel trick. > > > > I looked at optimizing the per-cpu data accessors on PowerPC and only > > ever saw fractions of a percent change in overall performance, which > > says to me that we don't actually use per-cpu data all that much. So > > unless you make per-cpu data really really slow, I doubt that we'll > > see any significant performance difference. > > The main problem is that we would need a "vmalloc reserve first; allocate pages > later" interface. On x86 it would be easy by just splitting up vmalloc/vmap a bit > again. Does anybody else see problems with implementing that on any > other architecture? "vmalloc reserve first; allocate pages later" would be a really nice feature. We could use this on s390 to implement the virtual mem_map array spanning the whole 64 bit address range (with holes in it). To make it perfect a "deallocate pages; keep vmalloc reserve" should be added, then we could free parts of the mem_map array again on hot memory remove. I don't see a problem for s390. -- blue skies, Martin. Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] x86_64: Make NR_IRQS configurable in Kconfig 2006-08-08 8:17 ` Martin Schwidefsky @ 2006-08-09 17:58 ` Luck, Tony 2006-08-09 18:25 ` Dave Hansen 0 siblings, 1 reply; 7+ messages in thread From: Luck, Tony @ 2006-08-09 17:58 UTC (permalink / raw) To: Martin Schwidefsky Cc: Andi Kleen, Paul Mackerras, Andrew Morton, Eric W. Biederman, Randy.Dunlap, Protasevich, Natalie, linux-kernel, linux-arch On Tue, Aug 08, 2006 at 10:17:53AM +0200, Martin Schwidefsky wrote: > "vmalloc reserve first; allocate pages later" would be a really nice > feature. We could use this on s390 to implement the virtual mem_map > array spanning the whole 64 bit address range (with holes in it). To > make it perfect a "deallocate pages; keep vmalloc reserve" should be > added, then we could free parts of the mem_map array again on hot memory > remove. IA-64 already has some arch. specific code to allocate a sparse virtual memory map ... having generic code to do so would be nice, but I foresee some chicken&egg problems in getting enough of the vmalloc/vmap framework up & running before mem_map[] has been allocated. That and the hotplug memory folks don't like the virtual mem_map code and have spurned it in favour of SPARSE. -Tony ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] x86_64: Make NR_IRQS configurable in Kconfig 2006-08-09 17:58 ` Luck, Tony @ 2006-08-09 18:25 ` Dave Hansen 2006-08-10 12:55 ` Martin Schwidefsky 0 siblings, 1 reply; 7+ messages in thread From: Dave Hansen @ 2006-08-09 18:25 UTC (permalink / raw) To: Luck, Tony Cc: Martin Schwidefsky, Andi Kleen, Paul Mackerras, Andrew Morton, Eric W. Biederman, Randy.Dunlap, Protasevich, Natalie, linux-kernel, linux-arch, Andy Whitcroft On Wed, 2006-08-09 at 10:58 -0700, Luck, Tony wrote: > On Tue, Aug 08, 2006 at 10:17:53AM +0200, Martin Schwidefsky wrote: > > "vmalloc reserve first; allocate pages later" would be a really nice > > feature. We could use this on s390 to implement the virtual mem_map > > array spanning the whole 64 bit address range (with holes in it). To > > make it perfect a "deallocate pages; keep vmalloc reserve" should be > > added, then we could free parts of the mem_map array again on hot memory > > remove. Martin, We can already do this partial freeing today with sparsemem and memory hot-remove. It would be a shame to go have to do another implementation for each an every architecture that wants to do it. For the very sparse 64-bit address spaces, I would be really interested to see an alternate pfn_to_section_nr() that relies on something other than a direct correlation between physical address and section number. Instead of: #define pfn_to_section_nr(pfn) ((pfn) >> PFN_SECTION_SHIFT) We could do: static inline unsigned long pfn_to_section_nr(unsigned long pfn) { return some_hash(pfn) % NR_OF_SECTION_SLOTS; } This would, of course, still have limits on how _many_ sections can be populated. But, it would remove the relationship on what the actual physical address ranges can be from the number of populated sections. Of course, it isn't quite that simple. You need to make sure that the sparse code is clean from all connections between section number and physical address, as well as handling things like hash collisions. We'd probably also need to store the _actual_ physical address somewhere because we can't get it from the section number any more. But, Andy and I have talked about this kind of thing from the beginning of sparsemem, so I hope the code is amenable to change like this. -- Dave P.S. With sparsemem extreme, I think you can cover an entire 64-bits of address space with a 4GB top-level table. If one more level of tables was added, we'd be down to (I think) an 8MB table. So, that might be an option, too. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] x86_64: Make NR_IRQS configurable in Kconfig 2006-08-09 18:25 ` Dave Hansen @ 2006-08-10 12:55 ` Martin Schwidefsky 2006-08-10 14:40 ` Andy Whitcroft 0 siblings, 1 reply; 7+ messages in thread From: Martin Schwidefsky @ 2006-08-10 12:55 UTC (permalink / raw) To: Dave Hansen Cc: Luck, Tony, Andi Kleen, Paul Mackerras, Andrew Morton, Eric W. Biederman, Randy.Dunlap, Protasevich, Natalie, linux-kernel, linux-arch, Andy Whitcroft On Wed, 2006-08-09 at 11:25 -0700, Dave Hansen wrote: > Instead of: > > #define pfn_to_section_nr(pfn) ((pfn) >> PFN_SECTION_SHIFT) > > We could do: > > static inline unsigned long pfn_to_section_nr(unsigned long pfn) > { > return some_hash(pfn) % NR_OF_SECTION_SLOTS; > } > > This would, of course, still have limits on how _many_ sections can be > populated. But, it would remove the relationship on what the actual > physical address ranges can be from the number of populated sections. > > Of course, it isn't quite that simple. You need to make sure that the > sparse code is clean from all connections between section number and > physical address, as well as handling things like hash collisions. We'd > probably also need to store the _actual_ physical address somewhere > because we can't get it from the section number any more. You have to deal with the hash collisions somehow, for example with a list of pages that have the same hash. And you have to calculate the hash value. Both hurts performance. > P.S. With sparsemem extreme, I think you can cover an entire 64-bits of > address space with a 4GB top-level table. If one more level of tables > was added, we'd be down to (I think) an 8MB table. So, that might be an > option, too. On s390 we have to prepare for the situation of an address space that has a chunk of memory at the low end and another chunk with bit 2^63 set. So the mem_map array needs to cover the whole 64 bit address range. For sparsemem, we can choose on the size of the mem_map sections and on how many indirections the lookup table should have. Some examples: 1) flat mem_map array: 2^52 entries, 56 bytes each. 2) mem_map sections with 256 entries / 14KB for each section, 1 indirection level, 2^44 indirection pointers, 128TB overhead 3) mem_map sections with 256 entries / 14KB for each section, 2 indirection levels, 2^22 indirection pointers for each level, 32MB for each indirection array, minimum 64MB overhead 4) mem_map sections with 256 entries / 14KB for each section, 3 indirection levels, 2^15/2^15/2^14 indirection pointers, 256K/256K/128K indirection arrays, minimum 640K overhead 5) mem_map sections with 1024 entries / 56KB for each section, 3 indirection levels, 2^14/2^14/2^14 indirection pointers, 128K/128K/128K indirection arrays, minimum 384KB overhead 2 levels of indirection results in large overhead in regard to memory. For 3 levels of indirection the memory overhead is ok, but each lookup has to walk 3 indirections. This adds cpu cycles to access the mem_map array. The alternative of a flat mem_map array in vmalloc space is much more attractive. The size of the array is 2^52*56 Byte. 1,3% of the virtual address space. The access doesn't change, an array gets accessed. The access gets automatically cached by the hardware. Simple, straightforward, no additional overhead. Only the setup of the kernel page tables for the mem_map vmalloc area needs some thought. -- blue skies, Martin. Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] x86_64: Make NR_IRQS configurable in Kconfig 2006-08-10 12:55 ` Martin Schwidefsky @ 2006-08-10 14:40 ` Andy Whitcroft 2006-08-10 14:53 ` Martin Schwidefsky 0 siblings, 1 reply; 7+ messages in thread From: Andy Whitcroft @ 2006-08-10 14:40 UTC (permalink / raw) To: schwidefsky Cc: Dave Hansen, Luck, Tony, Andi Kleen, Paul Mackerras, Andrew Morton, Eric W. Biederman, Randy.Dunlap, Protasevich, Natalie, linux-kernel, linux-arch Martin Schwidefsky wrote: > On Wed, 2006-08-09 at 11:25 -0700, Dave Hansen wrote: >> Instead of: >> >> #define pfn_to_section_nr(pfn) ((pfn) >> PFN_SECTION_SHIFT) >> >> We could do: >> >> static inline unsigned long pfn_to_section_nr(unsigned long pfn) >> { >> return some_hash(pfn) % NR_OF_SECTION_SLOTS; >> } >> >> This would, of course, still have limits on how _many_ sections can be >> populated. But, it would remove the relationship on what the actual >> physical address ranges can be from the number of populated sections. >> >> Of course, it isn't quite that simple. You need to make sure that the >> sparse code is clean from all connections between section number and >> physical address, as well as handling things like hash collisions. We'd >> probably also need to store the _actual_ physical address somewhere >> because we can't get it from the section number any more. > > You have to deal with the hash collisions somehow, for example with a > list of pages that have the same hash. And you have to calculate the > hash value. Both hurts performance. > >> P.S. With sparsemem extreme, I think you can cover an entire 64-bits of >> address space with a 4GB top-level table. If one more level of tables >> was added, we'd be down to (I think) an 8MB table. So, that might be an >> option, too. > > On s390 we have to prepare for the situation of an address space that > has a chunk of memory at the low end and another chunk with bit 2^63 > set. So the mem_map array needs to cover the whole 64 bit address range. > For sparsemem, we can choose on the size of the mem_map sections and on > how many indirections the lookup table should have. Some examples: > > 1) flat mem_map array: 2^52 entries, 56 bytes each. > 2) mem_map sections with 256 entries / 14KB for each section, > 1 indirection level, 2^44 indirection pointers, 128TB overhead > 3) mem_map sections with 256 entries / 14KB for each section, > 2 indirection levels, 2^22 indirection pointers for each level, > 32MB for each indirection array, minimum 64MB overhead > 4) mem_map sections with 256 entries / 14KB for each section, > 3 indirection levels, 2^15/2^15/2^14 indirection pointers, > 256K/256K/128K indirection arrays, minimum 640K overhead > 5) mem_map sections with 1024 entries / 56KB for each section, > 3 indirection levels, 2^14/2^14/2^14 indirection pointers, > 128K/128K/128K indirection arrays, minimum 384KB overhead > > 2 levels of indirection results in large overhead in regard to memory. > For 3 levels of indirection the memory overhead is ok, but each lookup > has to walk 3 indirections. This adds cpu cycles to access the mem_map > array. > > The alternative of a flat mem_map array in vmalloc space is much more > attractive. The size of the array is 2^52*56 Byte. 1,3% of the virtual > address space. The access doesn't change, an array gets accessed. The > access gets automatically cached by the hardware. > Simple, straightforward, no additional overhead. Only the setup of the > kernel page tables for the mem_map vmalloc area needs some thought. > Well you could do something more fun with the top of the address. You don't need to keep the bytes in the same order for instance. If this is really a fair size chunk at the bottom and one at the top then taking the address and swapping the bytes like: ABCDEFGH => BCDAEFGH Would be a pretty trivial bit of register wibbling (ie very quick), but would probabally mean a single flat, smaller sparsemem table would cover all likely areas. -apw ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] x86_64: Make NR_IRQS configurable in Kconfig 2006-08-10 14:40 ` Andy Whitcroft @ 2006-08-10 14:53 ` Martin Schwidefsky 0 siblings, 0 replies; 7+ messages in thread From: Martin Schwidefsky @ 2006-08-10 14:53 UTC (permalink / raw) To: Andy Whitcroft Cc: Dave Hansen, Luck, Tony, Andi Kleen, Paul Mackerras, Andrew Morton, Eric W. Biederman, Randy.Dunlap, Protasevich, Natalie, linux-kernel, linux-arch On Thu, 2006-08-10 at 15:40 +0100, Andy Whitcroft wrote: > Well you could do something more fun with the top of the address. You > don't need to keep the bytes in the same order for instance. If this > is really a fair size chunk at the bottom and one at the top then > taking the address and swapping the bytes like: > > ABCDEFGH => BCDAEFGH > > Would be a pretty trivial bit of register wibbling (ie very quick), > but would probabally mean a single flat, smaller sparsemem table would > cover all likely areas. Not if you don't know where the objects will be mapped.. -- blue skies, Martin. Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2006-08-10 14:53 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <m1irl4ftya.fsf@ebiederm.dsl.xmission.com>
[not found] ` <20060807194159.f7c741b5.akpm@osdl.org>
[not found] ` <17624.7310.856480.704542@cargo.ozlabs.ibm.com>
2006-08-08 5:14 ` [PATCH] x86_64: Make NR_IRQS configurable in Kconfig Andi Kleen
2006-08-08 8:17 ` Martin Schwidefsky
2006-08-09 17:58 ` Luck, Tony
2006-08-09 18:25 ` Dave Hansen
2006-08-10 12:55 ` Martin Schwidefsky
2006-08-10 14:40 ` Andy Whitcroft
2006-08-10 14:53 ` Martin Schwidefsky
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox