* speed difference between using hard-linked and modular drives? @ 2001-11-08 16:01 Roy Sigurd Karlsbakk 2001-11-08 17:02 ` Ingo Molnar 2001-11-08 17:53 ` Robert Love 0 siblings, 2 replies; 45+ messages in thread From: Roy Sigurd Karlsbakk @ 2001-11-08 16:01 UTC (permalink / raw) To: linux-kernel hi Are there any speed difference between hard-linked device drivers and their modular counterparts? roy -- Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA Computers are like air conditioners. They stop working when you open Windows. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-08 16:01 speed difference between using hard-linked and modular drives? Roy Sigurd Karlsbakk @ 2001-11-08 17:02 ` Ingo Molnar 2001-11-08 17:37 ` Ingo Molnar 2001-11-08 23:59 ` Anton Blanchard 2001-11-08 17:53 ` Robert Love 1 sibling, 2 replies; 45+ messages in thread From: Ingo Molnar @ 2001-11-08 17:02 UTC (permalink / raw) To: Roy Sigurd Karlsbakk; +Cc: linux-kernel On Thu, 8 Nov 2001, Roy Sigurd Karlsbakk wrote: > Are there any speed difference between hard-linked device drivers and > their modular counterparts? minimal. a few instructions per IO. Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-08 17:02 ` Ingo Molnar @ 2001-11-08 17:37 ` Ingo Molnar 2001-11-08 23:59 ` Anton Blanchard 1 sibling, 0 replies; 45+ messages in thread From: Ingo Molnar @ 2001-11-08 17:37 UTC (permalink / raw) To: Roy Sigurd Karlsbakk; +Cc: linux-kernel On Thu, 8 Nov 2001, Ingo Molnar wrote: > > Are there any speed difference between hard-linked device drivers and > > their modular counterparts? > > minimal. a few instructions per IO. Arjan pointed out that there is also the cost of TLB misses due to vmalloc()-ing module libraries, which can be as high as a 5% slowdown. we should fix this by trying to allocate continuous physical memory if possible, and fall back to vmalloc() only if this allocation fails. Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-08 17:02 ` Ingo Molnar 2001-11-08 17:37 ` Ingo Molnar @ 2001-11-08 23:59 ` Anton Blanchard 2001-11-09 5:11 ` Keith Owens 1 sibling, 1 reply; 45+ messages in thread From: Anton Blanchard @ 2001-11-08 23:59 UTC (permalink / raw) To: Ingo Molnar; +Cc: Roy Sigurd Karlsbakk, linux-kernel > > Are there any speed difference between hard-linked device drivers and > > their modular counterparts? > > minimal. a few instructions per IO. Its worse on some architectures that need to pass through a trampoline when going between kernel and module (eg ppc). Its even worse on ppc64 at the moment because we have a local TOC per module which needs to be saved and restored. Anton ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-08 23:59 ` Anton Blanchard @ 2001-11-09 5:11 ` Keith Owens 2001-11-10 3:35 ` Anton Blanchard 0 siblings, 1 reply; 45+ messages in thread From: Keith Owens @ 2001-11-09 5:11 UTC (permalink / raw) To: Anton Blanchard; +Cc: linux-kernel On Fri, 9 Nov 2001 10:59:21 +1100, Anton Blanchard <anton@samba.org> wrote: > >> > Are there any speed difference between hard-linked device drivers and >> > their modular counterparts? > >Its worse on some architectures that need to pass through a trampoline >when going between kernel and module (eg ppc). Its even worse on ppc64 >at the moment because we have a local TOC per module which needs to be >saved and restored. Is that TOC save and restore just for module code or does it apply to all calls through function pointers? On IA64, R1 (global data pointer) must be saved and restored on all calls through function pointers, even if both the caller and callee are in the kernel. You might know that this is a kernel to kernel call but gcc does not so it has to assume the worst. This is not a module problem, it affects all indirect function calls. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 5:11 ` Keith Owens @ 2001-11-10 3:35 ` Anton Blanchard 2001-11-10 7:26 ` Keith Owens 0 siblings, 1 reply; 45+ messages in thread From: Anton Blanchard @ 2001-11-10 3:35 UTC (permalink / raw) To: Keith Owens; +Cc: linux-kernel Hi, > Is that TOC save and restore just for module code or does it apply to > all calls through function pointers? > > On IA64, R1 (global data pointer) must be saved and restored on all > calls through function pointers, even if both the caller and callee are > in the kernel. You might know that this is a kernel to kernel call but > gcc does not so it has to assume the worst. This is not a module > problem, it affects all indirect function calls. Yep all indirect function calls require save and reload of the TOC (which is r2): std r2,40(r1) mtctr r0 ld r2,8(r9) bctrl # function call When calling a function in the kernel from within the kernel (eg printk), we dont have to save and reload the TOC: 000014ec bl .printk 000014f0 nop Alan Modra tells me the linker does the fixup of nop -> r2 reload. So in this case it isnt needed. However when we do the same printk from a module, the nop is replaced with an r2 reload: 000014ec bl 0x2f168 # call trampoline 000014f0 ld r2,40(r1) And because we have to load the new TOC for the call to printk, it is done in a small trampoline. (r12 is a pointer to the function descriptor for printk which contains 3 values, 1. the function address, 2. the TOC, ignore the 3rd) 0002f168 ld r12,-32456(r2) 0002f16c std r2,40(r1) 0002f170 ld r0,0(r12) 0002f174 ld r2,8(r12) 0002f178 mtctr r0 0002f17c bctr # call printk So the trampoline and r2 restore is the overhead Im talking about :) btw the trampoline is also required because of the limited range of relative branches on ppc. So ppc32 also has an overhead except it is smaller because it doesnt need the TOC juggling. Anton ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-10 3:35 ` Anton Blanchard @ 2001-11-10 7:26 ` Keith Owens 0 siblings, 0 replies; 45+ messages in thread From: Keith Owens @ 2001-11-10 7:26 UTC (permalink / raw) To: Anton Blanchard; +Cc: linux-kernel On Sat, 10 Nov 2001 14:35:58 +1100, Anton Blanchard <anton@samba.org> wrote: >Yep all indirect function calls require save and reload of the TOC >(which is r2): > >When calling a function in the kernel from within the kernel (eg printk), >we dont have to save and reload the TOC: Same on IA64, indirect function calls have to save R1, load R1 for the target function from the function descriptor, call the function, restore R1. Incidentally that makes a function descriptor on IA64 _two_ words, you cannot save an IA64 function pointer in a long or even a void * variable. >Alan Modra tells me the linker does the fixup of nop -> r2 reload. So >in this case it isnt needed. IA64 kernels are compiled with -mconstant-gp which tells gcc that direct calls do not require R1 save/reload, gcc does not even generate a nop. However indirect function calls from one part of the kernel to another still require save and reload code, gcc cannot tell if the call is local or not. >However when we do the same printk from a module, the nop is replaced >with an r2 reload: Same on IA64, calls from a module into the kernel require R1 save and reload, even if the call is direct. So there is some code overhead when making direct function calls from modules to kernel on IA64, that overhead disappears when code is linked into the kernel. Indirect functions calls always have the overhead, whether in kernel or in module. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-08 16:01 speed difference between using hard-linked and modular drives? Roy Sigurd Karlsbakk 2001-11-08 17:02 ` Ingo Molnar @ 2001-11-08 17:53 ` Robert Love 1 sibling, 0 replies; 45+ messages in thread From: Robert Love @ 2001-11-08 17:53 UTC (permalink / raw) To: Roy Sigurd Karlsbakk; +Cc: linux-kernel On Thu, 2001-11-08 at 11:01, Roy Sigurd Karlsbakk wrote: > Are there any speed difference between hard-linked device drivers and > their modular counterparts? On top of what Ingo said, there is also a slightly larger (very slight) memory footprint due to some of the module code that isn't included in in-kernel components. For example, the __exit functions aren't needed if the driver is not a module. Robert Love ^ permalink raw reply [flat|nested] 45+ messages in thread
[parent not found: <Pine.LNX.4.33.0111081802380.15975-100000@localhost.localdomain.suse.lists.linux.kernel>]
[parent not found: <Pine.LNX.4.33.0111081836080.15975-100000@localhost.localdomain.suse.lists.linux.kernel>]
* Re: speed difference between using hard-linked and modular drives? [not found] ` <Pine.LNX.4.33.0111081836080.15975-100000@localhost.localdomain.suse.lists.linux.kernel> @ 2001-11-08 23:00 ` Andi Kleen 2001-11-09 0:05 ` Anton Blanchard 2001-11-09 3:12 ` Rusty Russell 0 siblings, 2 replies; 45+ messages in thread From: Andi Kleen @ 2001-11-08 23:00 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel Ingo Molnar <mingo@elte.hu> writes: > > we should fix this by trying to allocate continuous physical memory if > possible, and fall back to vmalloc() only if this allocation fails. Check -aa. A patch to do that has been in there for some time now. -Andi P.S.: It makes a measurable difference with some Oracle benchmarks with the Qlogic driver. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-08 23:00 ` Andi Kleen @ 2001-11-09 0:05 ` Anton Blanchard 2001-11-09 5:45 ` Andi Kleen 2001-11-09 3:12 ` Rusty Russell 1 sibling, 1 reply; 45+ messages in thread From: Anton Blanchard @ 2001-11-09 0:05 UTC (permalink / raw) To: Andi Kleen; +Cc: Ingo Molnar, linux-kernel > > we should fix this by trying to allocate continuous physical memory if > > possible, and fall back to vmalloc() only if this allocation fails. > > Check -aa. A patch to do that has been in there for some time now. We also need a way to satisfy very large allocations for the hashes (eg the pagecache hash). On a 32G machine we get awful performance on the pagecache hash because we can only get an order 9 allocation out of get_free_pages: http://samba.org/~anton/linux/pagecache/pagecache_before.png When switching to vmalloc the hash is large enough to be useful: http://samba.org/~anton/linux/pagecache/pagecache_after.png As pointed out by Davem and Ingo we should try and avoid vmalloc here due to tlb trashing. Anton ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 0:05 ` Anton Blanchard @ 2001-11-09 5:45 ` Andi Kleen 2001-11-09 6:04 ` David S. Miller 0 siblings, 1 reply; 45+ messages in thread From: Andi Kleen @ 2001-11-09 5:45 UTC (permalink / raw) To: Anton Blanchard; +Cc: Andi Kleen, Ingo Molnar, linux-kernel On Fri, Nov 09, 2001 at 11:05:32AM +1100, Anton Blanchard wrote: > We also need a way to satisfy very large allocations for the hashes (eg > the pagecache hash). On a 32G machine we get awful performance on the > pagecache hash because we can only get an order 9 allocation out of > get_free_pages: > > http://samba.org/~anton/linux/pagecache/pagecache_before.png > > When switching to vmalloc the hash is large enough to be useful: > > http://samba.org/~anton/linux/pagecache/pagecache_after.png > > As pointed out by Davem and Ingo we should try and avoid vmalloc here > due to tlb trashing. Sounds like you need a better hash function instead. -Andi ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 5:45 ` Andi Kleen @ 2001-11-09 6:04 ` David S. Miller 2001-11-09 6:39 ` Andi Kleen 0 siblings, 1 reply; 45+ messages in thread From: David S. Miller @ 2001-11-09 6:04 UTC (permalink / raw) To: ak; +Cc: anton, mingo, linux-kernel From: Andi Kleen <ak@suse.de> Date: Fri, 9 Nov 2001 06:45:40 +0100 Sounds like you need a better hash function instead. Andi, please think about the problem before jumping to conclusions. N_PAGES / N_CHAINS > 1 in his situation. A better hash function cannot help. Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 6:04 ` David S. Miller @ 2001-11-09 6:39 ` Andi Kleen 2001-11-09 6:54 ` Andrew Morton ` (3 more replies) 0 siblings, 4 replies; 45+ messages in thread From: Andi Kleen @ 2001-11-09 6:39 UTC (permalink / raw) To: David S. Miller; +Cc: ak, anton, mingo, linux-kernel On Thu, Nov 08, 2001 at 10:04:44PM -0800, David S. Miller wrote: > From: Andi Kleen <ak@suse.de> > Date: Fri, 9 Nov 2001 06:45:40 +0100 > > Sounds like you need a better hash function instead. > > Andi, please think about the problem before jumping to conclusions. > N_PAGES / N_CHAINS > 1 in his situation. A better hash function > cannot help. I'm assuming that walking on average 5-10 pages on a lookup is not too big a deal, especially when you use prefetch for the list walk. It is a tradeoff between a big hash table thrashing your cache and a smaller hash table that can be cached but has on average >1 entries/buckets. At some point the the smaller hash table wins, assuming the hash function is evenly distributed. It would only get bad if the average chain length would become much bigger. Before jumping to real conclusions it would be interesting to gather some statistics on Anton's machine, but I suspect he just has an very unevenly populated table. -Andi ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 6:39 ` Andi Kleen @ 2001-11-09 6:54 ` Andrew Morton 2001-11-09 7:17 ` David S. Miller 2001-11-09 7:14 ` David S. Miller ` (2 subsequent siblings) 3 siblings, 1 reply; 45+ messages in thread From: Andrew Morton @ 2001-11-09 6:54 UTC (permalink / raw) To: Andi Kleen; +Cc: David S. Miller, anton, mingo, linux-kernel Andi Kleen wrote: > > On Thu, Nov 08, 2001 at 10:04:44PM -0800, David S. Miller wrote: > > From: Andi Kleen <ak@suse.de> > > Date: Fri, 9 Nov 2001 06:45:40 +0100 > > > > Sounds like you need a better hash function instead. > > > > Andi, please think about the problem before jumping to conclusions. > > N_PAGES / N_CHAINS > 1 in his situation. A better hash function > > cannot help. > > I'm assuming that walking on average 5-10 pages on a lookup is not too big a > deal, especially when you use prefetch for the list walk. It is a tradeoff > between a big hash table thrashing your cache and a smaller hash table that > can be cached but has on average >1 entries/buckets. At some point the the > smaller hash table wins, assuming the hash function is evenly distributed. > > It would only get bad if the average chain length would become much bigger. > > Before jumping to real conclusions it would be interesting to gather > some statistics on Anton's machine, but I suspect he just has an very > unevenly populated table. I played with that earlier in the year. Shrinking the hash table by a factor of eight made no measurable difference to anything on a Pentium II. The hash distribution was all over the place though. Lots of buckets with 1-2 pages, lots with 12-13. - ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 6:54 ` Andrew Morton @ 2001-11-09 7:17 ` David S. Miller 2001-11-09 7:16 ` Andrew Morton 0 siblings, 1 reply; 45+ messages in thread From: David S. Miller @ 2001-11-09 7:17 UTC (permalink / raw) To: akpm; +Cc: ak, anton, mingo, linux-kernel From: Andrew Morton <akpm@zip.com.au> Date: Thu, 08 Nov 2001 22:54:30 -0800 I played with that earlier in the year. Shrinking the hash table by a factor of eight made no measurable difference to anything on a Pentium II. The hash distribution was all over the place though. Lots of buckets with 1-2 pages, lots with 12-13. What is the distribution when you don't shrink the hash table? Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 7:17 ` David S. Miller @ 2001-11-09 7:16 ` Andrew Morton 2001-11-09 7:24 ` David S. Miller 2001-11-09 8:21 ` Ingo Molnar 0 siblings, 2 replies; 45+ messages in thread From: Andrew Morton @ 2001-11-09 7:16 UTC (permalink / raw) To: David S. Miller; +Cc: ak, anton, mingo, linux-kernel "David S. Miller" wrote: > > From: Andrew Morton <akpm@zip.com.au> > Date: Thu, 08 Nov 2001 22:54:30 -0800 > > I played with that earlier in the year. Shrinking the hash table > by a factor of eight made no measurable difference to anything on > a Pentium II. The hash distribution was all over the place though. > Lots of buckets with 1-2 pages, lots with 12-13. > > What is the distribution when you don't shrink the hash > table? > Well on my setup, there are more hash buckets than there are pages in the system. So - basically empty. If memory serves me, never more than two pages in a bucket. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 7:16 ` Andrew Morton @ 2001-11-09 7:24 ` David S. Miller 2001-11-09 8:21 ` Ingo Molnar 1 sibling, 0 replies; 45+ messages in thread From: David S. Miller @ 2001-11-09 7:24 UTC (permalink / raw) To: akpm; +Cc: ak, anton, mingo, linux-kernel From: Andrew Morton <akpm@zip.com.au> Date: Thu, 08 Nov 2001 23:16:08 -0800 Well on my setup, there are more hash buckets than there are pages in the system. So - basically empty. If memory serves me, never more than two pages in a bucket. Ok, this is what I expected. The function is tuned for having N_HASH_CHAINS being roughly equal to N_PAGES. If you want to experiment with smaller hash tables, there are some hacks in the FreeBSD sources that choose a different "salt" per inode. You xor the salt into the hash for each page on that inode. Something like this... Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 7:16 ` Andrew Morton 2001-11-09 7:24 ` David S. Miller @ 2001-11-09 8:21 ` Ingo Molnar 2001-11-09 7:35 ` Andrew Morton 1 sibling, 1 reply; 45+ messages in thread From: Ingo Molnar @ 2001-11-09 8:21 UTC (permalink / raw) To: Andrew Morton; +Cc: David S. Miller, ak, anton, linux-kernel On Thu, 8 Nov 2001, Andrew Morton wrote: > Well on my setup, there are more hash buckets than there are pages in > the system. So - basically empty. If memory serves me, never more > than two pages in a bucket. how much RAM and how many buckets are there on your system? Ingo ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 8:21 ` Ingo Molnar @ 2001-11-09 7:35 ` Andrew Morton 2001-11-09 7:44 ` David S. Miller 0 siblings, 1 reply; 45+ messages in thread From: Andrew Morton @ 2001-11-09 7:35 UTC (permalink / raw) To: mingo; +Cc: David S. Miller, ak, anton, linux-kernel Ingo Molnar wrote: > > On Thu, 8 Nov 2001, Andrew Morton wrote: > > > Well on my setup, there are more hash buckets than there are pages in > > the system. So - basically empty. If memory serves me, never more > > than two pages in a bucket. > > how much RAM and how many buckets are there on your system? > urgh. It was ages ago. I shouldn't have stuck my head up ;) I guess it was 256 megs: Kernel command line: ... mem=256m Page-cache hash table entries: 65536 (order: 6, 262144 bytes) And that's one entry per page, yes? I ended up concluding that a) The hash is sucky and b) Except for certain specialised workloads, a lookup is usually associated with a big memory copy, so none of it matters and c) given b), the page cache hashtable is on the wrong side of the size/space tradeoff :) - ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 7:35 ` Andrew Morton @ 2001-11-09 7:44 ` David S. Miller 0 siblings, 0 replies; 45+ messages in thread From: David S. Miller @ 2001-11-09 7:44 UTC (permalink / raw) To: akpm; +Cc: mingo, ak, anton, linux-kernel From: Andrew Morton <akpm@zip.com.au> Date: Thu, 08 Nov 2001 23:35:04 -0800 b) Except for certain specialised workloads, a lookup is usually associated with a big memory copy, so none of it matters and I disagree, cache pollution always matters. Especially, if the cpu does memcpy's using cache-bypass-on-miss. Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 6:39 ` Andi Kleen 2001-11-09 6:54 ` Andrew Morton @ 2001-11-09 7:14 ` David S. Miller 2001-11-09 7:16 ` David S. Miller 2001-11-10 4:56 ` Anton Blanchard 3 siblings, 0 replies; 45+ messages in thread From: David S. Miller @ 2001-11-09 7:14 UTC (permalink / raw) To: ak; +Cc: anton, mingo, linux-kernel From: Andi Kleen <ak@suse.de> Date: Fri, 9 Nov 2001 07:39:46 +0100 Before jumping to real conclusions it would be interesting to gather some statistics on Anton's machine, but I suspect he just has an very unevenly populated table. N_PAGES / N_HASHCHAINS was on the order of 9, and the hash chains were evenly distributed. He posted URLs to graphs of the hash table chain lengths. Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 6:39 ` Andi Kleen 2001-11-09 6:54 ` Andrew Morton 2001-11-09 7:14 ` David S. Miller @ 2001-11-09 7:16 ` David S. Miller 2001-11-09 12:59 ` Alan Cox 2001-11-10 5:20 ` Anton Blanchard 2001-11-10 4:56 ` Anton Blanchard 3 siblings, 2 replies; 45+ messages in thread From: David S. Miller @ 2001-11-09 7:16 UTC (permalink / raw) To: ak; +Cc: anton, mingo, linux-kernel From: Andi Kleen <ak@suse.de> Date: Fri, 9 Nov 2001 07:39:46 +0100 I'm assuming that walking on average 5-10 pages on a lookup is not too big a deal, especially when you use prefetch for the list walk. Oh no, not this again... It _IS_ a big deal. Fetching _ONE_ hash chain cache line is always going to be cheaper than fetching _FIVE_ to _TEN_ page struct cache lines while walking the list. Even if prefetch would kill all of this overhead (sorry, it won't), it is _DUMB_ and _STUPID_ to bring those _FIVE_ to _TEN_ cache lines into the processor just to lookup _ONE_ page. Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 7:16 ` David S. Miller @ 2001-11-09 12:59 ` Alan Cox 2001-11-09 12:54 ` David S. Miller 2001-11-10 5:20 ` Anton Blanchard 1 sibling, 1 reply; 45+ messages in thread From: Alan Cox @ 2001-11-09 12:59 UTC (permalink / raw) To: David S. Miller; +Cc: ak, anton, mingo, linux-kernel > Oh no, not this again... > > It _IS_ a big deal. Fetching _ONE_ hash chain cache line > is always going to be cheaper than fetching _FIVE_ to _TEN_ > page struct cache lines while walking the list. Big picture time. What costs more - the odd five cache line hit or swapping 200Kbytes/second on and off disk ? - thats obviously workload dependant. Perhaps at some point we need to accept there is a memory/speed tradeoff throughout the kernel and we need a CONFIG option for it - especially for the handheld world. I don't want to do lots of I/O on an ipaq, I don't need big tcp hashes, and I'd rather take a small performance hit. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 12:59 ` Alan Cox @ 2001-11-09 12:54 ` David S. Miller 2001-11-09 13:15 ` Philip Dodd 2001-11-09 13:17 ` Andi Kleen 0 siblings, 2 replies; 45+ messages in thread From: David S. Miller @ 2001-11-09 12:54 UTC (permalink / raw) To: alan; +Cc: ak, anton, mingo, linux-kernel From: Alan Cox <alan@lxorguk.ukuu.org.uk> Date: Fri, 9 Nov 2001 12:59:09 +0000 (GMT) we need a CONFIG option for it I think a boot time commandline option is more appropriate for something like this. Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 12:54 ` David S. Miller @ 2001-11-09 13:15 ` Philip Dodd 2001-11-09 13:26 ` David S. Miller 2001-11-09 13:17 ` Andi Kleen 1 sibling, 1 reply; 45+ messages in thread From: Philip Dodd @ 2001-11-09 13:15 UTC (permalink / raw) To: alan, David S. Miller; +Cc: ak, anton, mingo, linux-kernel > > we need a CONFIG option for it > > I think a boot time commandline option is more appropriate > for something like this. In the light of what was said about embedded systems, I'm not really sure a boot time option really is the way to go... Just a thought. Philip DODD Sales Engineer SIVA Les Fjords - Immeuble Narvik 19 Avenue de Norvège Z.A. de Courtaboeuf 1 91953 LES ULIS CEDEX http://www.siva.fr ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 13:15 ` Philip Dodd @ 2001-11-09 13:26 ` David S. Miller 2001-11-09 20:45 ` Mike Fedyk 0 siblings, 1 reply; 45+ messages in thread From: David S. Miller @ 2001-11-09 13:26 UTC (permalink / raw) To: smpcomputing; +Cc: alan, ak, anton, mingo, linux-kernel From: "Philip Dodd" <smpcomputing@free.fr> Date: Fri, 9 Nov 2001 14:15:32 +0100 > I think a boot time commandline option is more appropriate > for something like this. In the light of what was said about embedded systems, I'm not really sure a boot time option really is the way to go... All the hash tables in question are allocated dynamically, we size them at boot time, the memory is not consumed until the kernel begins executing. So a boottime option would be just fine. Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 13:26 ` David S. Miller @ 2001-11-09 20:45 ` Mike Fedyk 0 siblings, 0 replies; 45+ messages in thread From: Mike Fedyk @ 2001-11-09 20:45 UTC (permalink / raw) To: David S. Miller; +Cc: smpcomputing, alan, ak, anton, mingo, linux-kernel On Fri, Nov 09, 2001 at 05:26:50AM -0800, David S. Miller wrote: > From: "Philip Dodd" <smpcomputing@free.fr> > Date: Fri, 9 Nov 2001 14:15:32 +0100 > > > I think a boot time commandline option is more appropriate > > for something like this. > > In the light of what was said about embedded systems, I'm not really sure a > boot time option really is the way to go... > > All the hash tables in question are allocated dynamically, > we size them at boot time, the memory is not consumed until > the kernel begins executing. So a boottime option would be > just fine. How much is this code going to affect the kernel image size? ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 12:54 ` David S. Miller 2001-11-09 13:15 ` Philip Dodd @ 2001-11-09 13:17 ` Andi Kleen 2001-11-09 13:25 ` David S. Miller 1 sibling, 1 reply; 45+ messages in thread From: Andi Kleen @ 2001-11-09 13:17 UTC (permalink / raw) To: David S. Miller; +Cc: alan, ak, anton, mingo, linux-kernel On Fri, Nov 09, 2001 at 04:54:55AM -0800, David S. Miller wrote: > From: Alan Cox <alan@lxorguk.ukuu.org.uk> > Date: Fri, 9 Nov 2001 12:59:09 +0000 (GMT) > > we need a CONFIG option for it > > I think a boot time commandline option is more appropriate > for something like this. Fine if you don't mind an indirect function call pointer somewhere in the TCP hash path. I'm thinking about adding one that removes the separate time wait table. It is not needed for desktops because they should have little or no time-wait sockets. also it should throttle the hash table sizing aggressively; e.g. 256-512 buckets should be more than enough for a client. BTW I noticed that 1/4 of the big hash table is not used on SMP. The time wait buckets share the locks of the lower half, so the spinlocks in the upper half are never used. What would you think about splitting the table and not putting spinlocks in the time-wait range? -Andi ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 13:17 ` Andi Kleen @ 2001-11-09 13:25 ` David S. Miller 2001-11-09 13:39 ` Andi Kleen 0 siblings, 1 reply; 45+ messages in thread From: David S. Miller @ 2001-11-09 13:25 UTC (permalink / raw) To: ak; +Cc: alan, anton, mingo, linux-kernel From: Andi Kleen <ak@suse.de> Date: Fri, 9 Nov 2001 14:17:55 +0100 Fine if you don't mind an indirect function call pointer somewhere in the TCP hash path. The hashes are sized at boot time, we can just reduce the size when the boot time option says "small machine" or whatever. Why in the world do we need indirection function call pointers in TCP to handle that? Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 13:25 ` David S. Miller @ 2001-11-09 13:39 ` Andi Kleen 2001-11-09 13:41 ` David S. Miller 0 siblings, 1 reply; 45+ messages in thread From: Andi Kleen @ 2001-11-09 13:39 UTC (permalink / raw) To: David S. Miller; +Cc: ak, alan, anton, mingo, linux-kernel On Fri, Nov 09, 2001 at 05:25:54AM -0800, David S. Miller wrote: > Why in the world do we need indirection function call pointers > in TCP to handle that? To handle the case of not having a separate TIME-WAIT table (sorry for being unclear). Or alternatively several conditionals. -Andi ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 13:39 ` Andi Kleen @ 2001-11-09 13:41 ` David S. Miller 0 siblings, 0 replies; 45+ messages in thread From: David S. Miller @ 2001-11-09 13:41 UTC (permalink / raw) To: ak; +Cc: alan, anton, mingo, linux-kernel From: Andi Kleen <ak@suse.de> Date: Fri, 9 Nov 2001 14:39:30 +0100 On Fri, Nov 09, 2001 at 05:25:54AM -0800, David S. Miller wrote: > Why in the world do we need indirection function call pointers > in TCP to handle that? To handle the case of not having a separate TIME-WAIT table (sorry for being unclear). Or alternatively several conditionals. The TIME-WAIT half of the hash table is most useful on clients actually. I mean, just double the amount you "downsize" the TCP established hash table if it bothers you that much. Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 7:16 ` David S. Miller 2001-11-09 12:59 ` Alan Cox @ 2001-11-10 5:20 ` Anton Blanchard 1 sibling, 0 replies; 45+ messages in thread From: Anton Blanchard @ 2001-11-10 5:20 UTC (permalink / raw) To: David S. Miller; +Cc: ak, mingo, linux-kernel Hi, > It _IS_ a big deal. Fetching _ONE_ hash chain cache line > is always going to be cheaper than fetching _FIVE_ to _TEN_ > page struct cache lines while walking the list. Exactly, the reason I found the pagecache hash was too small was because __find_page_nolock was one of the worst offenders when doing zero copy web serving of a large dataset. > Even if prefetch would kill all of this overhead (sorry, it won't), it > is _DUMB_ and _STUPID_ to bring those _FIVE_ to _TEN_ cache lines into > the processor just to lookup _ONE_ page. Yes you cant expect prefetch to help you when you use the data 10 instructions after you issue the prefetch. (ie walking the hash chain) Anton ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 6:39 ` Andi Kleen ` (2 preceding siblings ...) 2001-11-09 7:16 ` David S. Miller @ 2001-11-10 4:56 ` Anton Blanchard 2001-11-10 5:09 ` Andi Kleen 2001-11-10 13:29 ` David S. Miller 3 siblings, 2 replies; 45+ messages in thread From: Anton Blanchard @ 2001-11-10 4:56 UTC (permalink / raw) To: Andi Kleen; +Cc: David S. Miller, mingo, linux-kernel Hi, > I'm assuming that walking on average 5-10 pages on a lookup is not too big a > deal, especially when you use prefetch for the list walk. It is a tradeoff > between a big hash table thrashing your cache and a smaller hash table that > can be cached but has on average >1 entries/buckets. At some point the the > smaller hash table wins, assuming the hash function is evenly distributed. > > It would only get bad if the average chain length would become much bigger. > > Before jumping to real conclusions it would be interesting to gather > some statistics on Anton's machine, but I suspect he just has an very > unevenly populated table. You can find the raw data here: http://samba.org/~anton/linux/pagecache/pagecache_data_gfp.gz http://samba.org/~anton/linux/pagecache/pagecache_data_vmalloc.gz You can see the average depth of the get_free_page hash is way too deep. I agree there are a lot of pagecache pages (17GB in the gfp test and 21GB in the vmalloc test), but we have to make use of the 32GB of RAM :) I did some experimentation with prefetch and I dont think it will gain you anything here. We need to issue the prefetch many cycles before using the data which we cannot do when walking the chain. Anton ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-10 4:56 ` Anton Blanchard @ 2001-11-10 5:09 ` Andi Kleen 2001-11-10 13:29 ` David S. Miller 1 sibling, 0 replies; 45+ messages in thread From: Andi Kleen @ 2001-11-10 5:09 UTC (permalink / raw) To: Anton Blanchard; +Cc: linux-kernel > You can see the average depth of the get_free_page hash is way too deep. > I agree there are a lot of pagecache pages (17GB in the gfp test and 21GB > in the vmalloc test), but we have to make use of the 32GB of RAM :) Thanks for the information. I guess the fix for your case would be then to use the bootmem allocator for allocating the page table hash. It should have no problems with very large continuous tables, assuming you have the (physically continuous) memory. Another possibility would be to switch to some tree/skiplist, but that's probably too radical and may have other problems on smaller boxes. -Andi ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-10 4:56 ` Anton Blanchard 2001-11-10 5:09 ` Andi Kleen @ 2001-11-10 13:29 ` David S. Miller 2001-11-10 13:44 ` David S. Miller 2001-11-10 13:52 ` David S. Miller 1 sibling, 2 replies; 45+ messages in thread From: David S. Miller @ 2001-11-10 13:29 UTC (permalink / raw) To: anton; +Cc: ak, mingo, linux-kernel From: Anton Blanchard <anton@samba.org> Date: Sat, 10 Nov 2001 15:56:03 +1100 You can see the average depth of the get_free_page hash is way too deep. I agree there are a lot of pagecache pages (17GB in the gfp test and 21GB in the vmalloc test), but we have to make use of the 32GB of RAM :) Anton, are you bored? :-) If so, could you test out the patch below on your ppc64 box? It does the "page hash table via bootmem" thing. It is against 2.4.15-pre2 The ppc64 specific bits you'll need to do, but they should be very straight forward. It also fixes a really stupid bug in the bootmem allocator. If the bootmem area starts in some unaligned address, the "align" argument to the bootmem allocator isn't honored. --- ./arch/alpha/mm/init.c.~1~ Sun Oct 21 02:47:53 2001 +++ ./arch/alpha/mm/init.c Sat Nov 10 01:49:56 2001 @@ -23,6 +23,7 @@ #ifdef CONFIG_BLK_DEV_INITRD #include <linux/blk.h> #endif +#include <linux/pagemap.h> #include <asm/system.h> #include <asm/uaccess.h> @@ -360,6 +361,7 @@ mem_init(void) { max_mapnr = num_physpages = max_low_pfn; + page_cache_init(count_free_bootmem()); totalram_pages += free_all_bootmem(); high_memory = (void *) __va(max_low_pfn * PAGE_SIZE); --- ./arch/alpha/mm/numa.c.~1~ Sun Oct 21 02:47:53 2001 +++ ./arch/alpha/mm/numa.c Sat Nov 10 01:52:27 2001 @@ -15,6 +15,7 @@ #ifdef CONFIG_BLK_DEV_INITRD #include <linux/blk.h> #endif +#include <linux/pagemap.h> #include <asm/hwrpb.h> #include <asm/pgalloc.h> @@ -359,8 +360,13 @@ extern char _text, _etext, _data, _edata; extern char __init_begin, __init_end; extern unsigned long totalram_pages; - unsigned long nid, i; + unsigned long nid, i, num_free_bootmem_pages; mem_map_t * lmem_map; + + num_free_bootmem_pages = 0; + for (nid = 0; nid < numnodes; nid++) + num_free_bootmem_pages += count_free_bootmem_node(NODE_DATA(nid)); + page_cache_init(num_free_bootmem_pages); high_memory = (void *) __va(max_mapnr <<PAGE_SHIFT); --- ./arch/arm/mm/init.c.~1~ Sun Oct 21 02:47:53 2001 +++ ./arch/arm/mm/init.c Sat Nov 10 01:52:34 2001 @@ -23,6 +23,7 @@ #include <linux/init.h> #include <linux/bootmem.h> #include <linux/blk.h> +#include <linux/pagemap.h> #include <asm/segment.h> #include <asm/mach-types.h> @@ -594,6 +595,7 @@ void __init mem_init(void) { unsigned int codepages, datapages, initpages; + unsigned long num_free_bootmem_pages; int i, node; codepages = &_etext - &_text; @@ -608,6 +610,11 @@ */ if (meminfo.nr_banks != 1) create_memmap_holes(&meminfo); + + num_free_bootmem_pages = 0; + for (node = 0; node < numnodes; node++) + num_free_bootmem_pages += count_free_bootmem_node(NODE_DATA(node)); + page_cache_init(num_free_bootmem_pages); /* this will put all unused low memory onto the freelists */ for (node = 0; node < numnodes; node++) { --- ./arch/i386/mm/init.c.~1~ Sun Oct 21 02:47:53 2001 +++ ./arch/i386/mm/init.c Sat Nov 10 01:53:43 2001 @@ -455,6 +455,8 @@ #endif high_memory = (void *) __va(max_low_pfn * PAGE_SIZE); + page_cache_init(count_free_bootmem()); + /* clear the zero-page */ memset(empty_zero_page, 0, PAGE_SIZE); --- ./arch/m68k/mm/init.c.~1~ Sun Oct 21 02:47:53 2001 +++ ./arch/m68k/mm/init.c Sat Nov 10 01:54:47 2001 @@ -20,6 +20,7 @@ #ifdef CONFIG_BLK_DEV_RAM #include <linux/blk.h> #endif +#include <linux/pagemap.h> #include <asm/setup.h> #include <asm/uaccess.h> @@ -135,6 +136,8 @@ if (MACH_IS_ATARI) atari_stram_mem_init_hook(); #endif + + page_cache_init(count_free_bootmem()); /* this will put all memory onto the freelists */ totalram_pages = free_all_bootmem(); --- ./arch/mips/mm/init.c.~1~ Sun Oct 21 02:47:53 2001 +++ ./arch/mips/mm/init.c Sat Nov 10 01:55:09 2001 @@ -28,6 +28,7 @@ #ifdef CONFIG_BLK_DEV_INITRD #include <linux/blk.h> #endif +#include <linux/pagemap.h> #include <asm/bootinfo.h> #include <asm/cachectl.h> @@ -203,6 +204,8 @@ max_mapnr = num_physpages = max_low_pfn; high_memory = (void *) __va(max_mapnr << PAGE_SHIFT); + + page_cache_init(count_free_bootmem()); totalram_pages += free_all_bootmem(); totalram_pages -= setup_zero_pages(); /* Setup zeroed pages. */ --- ./arch/ppc/mm/init.c.~1~ Sun Oct 21 02:47:53 2001 +++ ./arch/ppc/mm/init.c Sat Nov 10 01:57:34 2001 @@ -34,6 +34,7 @@ #ifdef CONFIG_BLK_DEV_INITRD #include <linux/blk.h> /* for initrd_* */ #endif +#include <linux/pagemap.h> #include <asm/pgalloc.h> #include <asm/prom.h> @@ -462,6 +463,8 @@ high_memory = (void *) __va(max_low_pfn * PAGE_SIZE); num_physpages = max_mapnr; /* RAM is assumed contiguous */ + + page_cache_init(count_free_bootmem()); totalram_pages += free_all_bootmem(); --- ./arch/sparc/mm/init.c.~1~ Sun Oct 21 02:47:53 2001 +++ ./arch/sparc/mm/init.c Sat Nov 10 01:59:48 2001 @@ -25,6 +25,7 @@ #include <linux/init.h> #include <linux/highmem.h> #include <linux/bootmem.h> +#include <linux/pagemap.h> #include <asm/system.h> #include <asm/segment.h> @@ -434,6 +435,8 @@ max_mapnr = last_valid_pfn - (phys_base >> PAGE_SHIFT); high_memory = __va(max_low_pfn << PAGE_SHIFT); + + page_cache_init(count_free_bootmem()); #ifdef DEBUG_BOOTMEM prom_printf("mem_init: Calling free_all_bootmem().\n"); --- ./arch/sparc64/mm/init.c.~1~ Fri Nov 9 18:42:08 2001 +++ ./arch/sparc64/mm/init.c Sat Nov 10 02:00:23 2001 @@ -16,6 +16,7 @@ #include <linux/blk.h> #include <linux/swap.h> #include <linux/swapctl.h> +#include <linux/pagemap.h> #include <asm/head.h> #include <asm/system.h> @@ -1584,6 +1585,8 @@ max_mapnr = last_valid_pfn - (phys_base >> PAGE_SHIFT); high_memory = __va(last_valid_pfn << PAGE_SHIFT); + + page_cache_init(count_free_bootmem()); num_physpages = free_all_bootmem() - 1; --- ./arch/sh/mm/init.c.~1~ Sun Oct 21 02:47:53 2001 +++ ./arch/sh/mm/init.c Sat Nov 10 01:59:56 2001 @@ -26,6 +26,7 @@ #endif #include <linux/highmem.h> #include <linux/bootmem.h> +#include <linux/pagemap.h> #include <asm/processor.h> #include <asm/system.h> @@ -139,6 +140,7 @@ void __init mem_init(void) { extern unsigned long empty_zero_page[1024]; + unsigned long num_free_bootmem_pages; int codesize, reservedpages, datasize, initsize; int tmp; @@ -148,6 +150,12 @@ /* clear the zero-page */ memset(empty_zero_page, 0, PAGE_SIZE); __flush_wback_region(empty_zero_page, PAGE_SIZE); + + num_free_bootmem_pages = count_free_bootmem_node(NODE_DATA(0)); +#ifdef CONFIG_DISCONTIGMEM + num_free_bootmem_pages += count_free_bootmem_node(NODE_DATA(1)); +#endif + page_cache_init(num_free_bootmem_pages); /* this will put all low memory onto the freelists */ totalram_pages += free_all_bootmem_node(NODE_DATA(0)); --- ./arch/s390/mm/init.c.~1~ Sun Oct 21 02:47:53 2001 +++ ./arch/s390/mm/init.c Sat Nov 10 01:57:56 2001 @@ -186,6 +186,8 @@ /* clear the zero-page */ memset(empty_zero_page, 0, PAGE_SIZE); + page_cache_init(count_free_bootmem()); + /* this will put all low memory onto the freelists */ totalram_pages += free_all_bootmem(); --- ./arch/ia64/mm/init.c.~1~ Fri Nov 9 19:08:02 2001 +++ ./arch/ia64/mm/init.c Sat Nov 10 01:54:20 2001 @@ -13,6 +13,7 @@ #include <linux/reboot.h> #include <linux/slab.h> #include <linux/swap.h> +#include <linux/pagemap.h> #include <asm/bitops.h> #include <asm/dma.h> @@ -406,6 +407,8 @@ max_mapnr = max_low_pfn; high_memory = __va(max_low_pfn * PAGE_SIZE); + + page_cache_init(count_free_bootmem()); totalram_pages += free_all_bootmem(); --- ./arch/mips64/mm/init.c.~1~ Sun Oct 21 02:47:53 2001 +++ ./arch/mips64/mm/init.c Sat Nov 10 01:55:30 2001 @@ -25,6 +25,7 @@ #ifdef CONFIG_BLK_DEV_INITRD #include <linux/blk.h> #endif +#include <linux/pagemap.h> #include <asm/bootinfo.h> #include <asm/cachectl.h> @@ -396,6 +397,8 @@ max_mapnr = num_physpages = max_low_pfn; high_memory = (void *) __va(max_mapnr << PAGE_SHIFT); + + page_cache_init(count_free_bootmem()); totalram_pages += free_all_bootmem(); totalram_pages -= setup_zero_pages(); /* Setup zeroed pages. */ --- ./arch/mips64/sgi-ip27/ip27-memory.c.~1~ Sun Oct 21 02:47:53 2001 +++ ./arch/mips64/sgi-ip27/ip27-memory.c Sat Nov 10 02:02:33 2001 @@ -15,6 +15,7 @@ #include <linux/mm.h> #include <linux/bootmem.h> #include <linux/swap.h> +#include <linux/pagemap.h> #include <asm/page.h> #include <asm/bootinfo.h> @@ -277,6 +278,11 @@ num_physpages = numpages; /* memory already sized by szmem */ max_mapnr = pagenr; /* already found during paging_init */ high_memory = (void *) __va(max_mapnr << PAGE_SHIFT); + + tmp = 0; + for (nid = 0; nid < numnodes; nid++) + tmp += count_free_bootmem_node(NODE_DATA(nid)); + page_cache_init(tmp); for (nid = 0; nid < numnodes; nid++) { --- ./arch/parisc/mm/init.c.~1~ Sun Oct 21 02:47:53 2001 +++ ./arch/parisc/mm/init.c Sat Nov 10 01:57:11 2001 @@ -17,6 +17,7 @@ #include <linux/pci.h> /* for hppa_dma_ops and pcxl_dma_ops */ #include <linux/swap.h> #include <linux/unistd.h> +#include <linux/pagemap.h> #include <asm/pgalloc.h> @@ -48,6 +49,8 @@ { max_mapnr = num_physpages = max_low_pfn; high_memory = __va(max_low_pfn * PAGE_SIZE); + + page_cache_init(count_free_bootmem()); totalram_pages += free_all_bootmem(); printk("Memory: %luk available\n", totalram_pages << (PAGE_SHIFT-10)); --- ./arch/cris/mm/init.c.~1~ Sun Oct 21 02:47:53 2001 +++ ./arch/cris/mm/init.c Sat Nov 10 01:53:10 2001 @@ -95,6 +95,7 @@ #include <linux/swap.h> #include <linux/smp.h> #include <linux/bootmem.h> +#include <linux/pagemap.h> #include <asm/system.h> #include <asm/segment.h> @@ -366,6 +367,8 @@ max_mapnr = num_physpages = max_low_pfn - min_low_pfn; + page_cache_init(count_free_bootmem()); + /* this will put all memory onto the freelists */ totalram_pages = free_all_bootmem(); --- ./arch/s390x/mm/init.c.~1~ Fri Nov 9 19:08:02 2001 +++ ./arch/s390x/mm/init.c Sat Nov 10 01:58:14 2001 @@ -198,6 +198,8 @@ /* clear the zero-page */ memset(empty_zero_page, 0, PAGE_SIZE); + page_cache_init(count_free_bootmem()); + /* this will put all low memory onto the freelists */ totalram_pages += free_all_bootmem(); --- ./include/linux/bootmem.h.~1~ Fri Nov 9 19:35:08 2001 +++ ./include/linux/bootmem.h Sat Nov 10 02:33:45 2001 @@ -43,11 +43,13 @@ #define alloc_bootmem_low_pages(x) \ __alloc_bootmem((x), PAGE_SIZE, 0) extern unsigned long __init free_all_bootmem (void); +extern unsigned long __init count_free_bootmem (void); extern unsigned long __init init_bootmem_node (pg_data_t *pgdat, unsigned long freepfn, unsigned long startpfn, unsigned long endpfn); extern void __init reserve_bootmem_node (pg_data_t *pgdat, unsigned long physaddr, unsigned long size); extern void __init free_bootmem_node (pg_data_t *pgdat, unsigned long addr, unsigned long size); extern unsigned long __init free_all_bootmem_node (pg_data_t *pgdat); +extern unsigned long __init count_free_bootmem_node (pg_data_t *pgdat); extern void * __init __alloc_bootmem_node (pg_data_t *pgdat, unsigned long size, unsigned long align, unsigned long goal); #define alloc_bootmem_node(pgdat, x) \ __alloc_bootmem_node((pgdat), (x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS)) --- ./init/main.c.~1~ Fri Nov 9 19:08:11 2001 +++ ./init/main.c Sat Nov 10 04:58:16 2001 @@ -597,7 +597,6 @@ proc_caches_init(); vfs_caches_init(mempages); buffer_init(mempages); - page_cache_init(mempages); #if defined(CONFIG_ARCH_S390) ccwcache_init(); #endif --- ./mm/filemap.c.~1~ Fri Nov 9 19:08:11 2001 +++ ./mm/filemap.c Sat Nov 10 05:15:16 2001 @@ -24,6 +24,7 @@ #include <linux/mm.h> #include <linux/iobuf.h> #include <linux/compiler.h> +#include <linux/bootmem.h> #include <asm/pgalloc.h> #include <asm/uaccess.h> @@ -2929,28 +2930,48 @@ goto unlock; } +/* This is called from the arch specific mem_init routine. + * It is done right before free_all_bootmem (or NUMA equivalent). + * + * The mempages arg is the number of pages free_all_bootmem is + * going to liberate, or a close approximation. + * + * We have to use bootmem because on huge systems (ie. 16GB ram) + * get_free_pages cannot give us a large enough allocation. + */ void __init page_cache_init(unsigned long mempages) { - unsigned long htable_size, order; + unsigned long htable_size, real_size; htable_size = mempages; htable_size *= sizeof(struct page *); - for(order = 0; (PAGE_SIZE << order) < htable_size; order++) + + for (real_size = 1UL; real_size < htable_size; real_size <<= 1UL) ; do { - unsigned long tmp = (PAGE_SIZE << order) / sizeof(struct page *); + unsigned long tmp = (real_size / sizeof(struct page *)); + unsigned long align; page_hash_bits = 0; while((tmp >>= 1UL) != 0UL) page_hash_bits++; + + align = real_size; + if (align > (4UL * 1024UL * 1024UL)) + align = (4UL * 1024UL * 1024UL); + + page_hash_table = __alloc_bootmem(real_size, align, + __pa(MAX_DMA_ADDRESS)); + + /* Perhaps the alignment was too strict. */ + if (page_hash_table == NULL) + page_hash_table = alloc_bootmem(real_size); + } while (page_hash_table == NULL && + (real_size >>= 1UL) >= PAGE_SIZE); - page_hash_table = (struct page **) - __get_free_pages(GFP_ATOMIC, order); - } while(page_hash_table == NULL && --order > 0); - - printk("Page-cache hash table entries: %d (order: %ld, %ld bytes)\n", - (1 << page_hash_bits), order, (PAGE_SIZE << order)); + printk("Page-cache hash table entries: %d (%ld bytes)\n", + (1 << page_hash_bits), real_size); if (!page_hash_table) panic("Failed to allocate page hash table\n"); memset((void *)page_hash_table, 0, PAGE_HASH_SIZE * sizeof(struct page *)); ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-10 13:29 ` David S. Miller @ 2001-11-10 13:44 ` David S. Miller 2001-11-10 13:52 ` David S. Miller 1 sibling, 0 replies; 45+ messages in thread From: David S. Miller @ 2001-11-10 13:44 UTC (permalink / raw) To: anton; +Cc: ak, mingo, linux-kernel From: "David S. Miller" <davem@redhat.com> Date: Sat, 10 Nov 2001 05:29:17 -0800 (PST) Anton, are you bored? :-) If so, could you test out the patch below on your ppc64 box? It does the "page hash table via bootmem" thing. It is against 2.4.15-pre2 Erm, ignore this patch, it was incomplete, I'll diff it up properly. Sorry... Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-10 13:29 ` David S. Miller 2001-11-10 13:44 ` David S. Miller @ 2001-11-10 13:52 ` David S. Miller 1 sibling, 0 replies; 45+ messages in thread From: David S. Miller @ 2001-11-10 13:52 UTC (permalink / raw) To: anton; +Cc: ak, mingo, linux-kernel Ok, this should be a working patch, try this one :-) diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/alpha/mm/init.c linux/arch/alpha/mm/init.c --- vanilla/linux/arch/alpha/mm/init.c Thu Sep 20 20:02:03 2001 +++ linux/arch/alpha/mm/init.c Sat Nov 10 01:49:56 2001 @@ -23,6 +23,7 @@ #ifdef CONFIG_BLK_DEV_INITRD #include <linux/blk.h> #endif +#include <linux/pagemap.h> #include <asm/system.h> #include <asm/uaccess.h> @@ -360,6 +361,7 @@ mem_init(void) { max_mapnr = num_physpages = max_low_pfn; + page_cache_init(count_free_bootmem()); totalram_pages += free_all_bootmem(); high_memory = (void *) __va(max_low_pfn * PAGE_SIZE); diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/alpha/mm/numa.c linux/arch/alpha/mm/numa.c --- vanilla/linux/arch/alpha/mm/numa.c Sun Aug 12 10:38:48 2001 +++ linux/arch/alpha/mm/numa.c Sat Nov 10 01:52:27 2001 @@ -15,6 +15,7 @@ #ifdef CONFIG_BLK_DEV_INITRD #include <linux/blk.h> #endif +#include <linux/pagemap.h> #include <asm/hwrpb.h> #include <asm/pgalloc.h> @@ -359,8 +360,13 @@ extern char _text, _etext, _data, _edata; extern char __init_begin, __init_end; extern unsigned long totalram_pages; - unsigned long nid, i; + unsigned long nid, i, num_free_bootmem_pages; mem_map_t * lmem_map; + + num_free_bootmem_pages = 0; + for (nid = 0; nid < numnodes; nid++) + num_free_bootmem_pages += count_free_bootmem_node(NODE_DATA(nid)); + page_cache_init(num_free_bootmem_pages); high_memory = (void *) __va(max_mapnr <<PAGE_SHIFT); diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/arm/mm/init.c linux/arch/arm/mm/init.c --- vanilla/linux/arch/arm/mm/init.c Thu Oct 11 09:04:57 2001 +++ linux/arch/arm/mm/init.c Sat Nov 10 01:52:34 2001 @@ -23,6 +23,7 @@ #include <linux/init.h> #include <linux/bootmem.h> #include <linux/blk.h> +#include <linux/pagemap.h> #include <asm/segment.h> #include <asm/mach-types.h> @@ -594,6 +595,7 @@ void __init mem_init(void) { unsigned int codepages, datapages, initpages; + unsigned long num_free_bootmem_pages; int i, node; codepages = &_etext - &_text; @@ -608,6 +610,11 @@ */ if (meminfo.nr_banks != 1) create_memmap_holes(&meminfo); + + num_free_bootmem_pages = 0; + for (node = 0; node < numnodes; node++) + num_free_bootmem_pages += count_free_bootmem_node(NODE_DATA(node)); + page_cache_init(num_free_bootmem_pages); /* this will put all unused low memory onto the freelists */ for (node = 0; node < numnodes; node++) { diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/cris/mm/init.c linux/arch/cris/mm/init.c --- vanilla/linux/arch/cris/mm/init.c Thu Jul 26 15:10:06 2001 +++ linux/arch/cris/mm/init.c Sat Nov 10 01:53:10 2001 @@ -95,6 +95,7 @@ #include <linux/swap.h> #include <linux/smp.h> #include <linux/bootmem.h> +#include <linux/pagemap.h> #include <asm/system.h> #include <asm/segment.h> @@ -366,6 +367,8 @@ max_mapnr = num_physpages = max_low_pfn - min_low_pfn; + page_cache_init(count_free_bootmem()); + /* this will put all memory onto the freelists */ totalram_pages = free_all_bootmem(); diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/i386/mm/init.c linux/arch/i386/mm/init.c --- vanilla/linux/arch/i386/mm/init.c Thu Sep 20 19:59:20 2001 +++ linux/arch/i386/mm/init.c Sat Nov 10 01:53:43 2001 @@ -455,6 +455,8 @@ #endif high_memory = (void *) __va(max_low_pfn * PAGE_SIZE); + page_cache_init(count_free_bootmem()); + /* clear the zero-page */ memset(empty_zero_page, 0, PAGE_SIZE); diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/ia64/mm/init.c linux/arch/ia64/mm/init.c --- vanilla/linux/arch/ia64/mm/init.c Fri Nov 9 18:39:51 2001 +++ linux/arch/ia64/mm/init.c Sat Nov 10 01:54:20 2001 @@ -13,6 +13,7 @@ #include <linux/reboot.h> #include <linux/slab.h> #include <linux/swap.h> +#include <linux/pagemap.h> #include <asm/bitops.h> #include <asm/dma.h> @@ -406,6 +407,8 @@ max_mapnr = max_low_pfn; high_memory = __va(max_low_pfn * PAGE_SIZE); + + page_cache_init(count_free_bootmem()); totalram_pages += free_all_bootmem(); diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/m68k/mm/init.c linux/arch/m68k/mm/init.c --- vanilla/linux/arch/m68k/mm/init.c Thu Sep 20 20:02:03 2001 +++ linux/arch/m68k/mm/init.c Sat Nov 10 01:54:47 2001 @@ -20,6 +20,7 @@ #ifdef CONFIG_BLK_DEV_RAM #include <linux/blk.h> #endif +#include <linux/pagemap.h> #include <asm/setup.h> #include <asm/uaccess.h> @@ -135,6 +136,8 @@ if (MACH_IS_ATARI) atari_stram_mem_init_hook(); #endif + + page_cache_init(count_free_bootmem()); /* this will put all memory onto the freelists */ totalram_pages = free_all_bootmem(); diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/mips/mm/init.c linux/arch/mips/mm/init.c --- vanilla/linux/arch/mips/mm/init.c Wed Jul 4 11:50:39 2001 +++ linux/arch/mips/mm/init.c Sat Nov 10 01:55:09 2001 @@ -28,6 +28,7 @@ #ifdef CONFIG_BLK_DEV_INITRD #include <linux/blk.h> #endif +#include <linux/pagemap.h> #include <asm/bootinfo.h> #include <asm/cachectl.h> @@ -203,6 +204,8 @@ max_mapnr = num_physpages = max_low_pfn; high_memory = (void *) __va(max_mapnr << PAGE_SHIFT); + + page_cache_init(count_free_bootmem()); totalram_pages += free_all_bootmem(); totalram_pages -= setup_zero_pages(); /* Setup zeroed pages. */ diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/mips64/mm/init.c linux/arch/mips64/mm/init.c --- vanilla/linux/arch/mips64/mm/init.c Wed Jul 4 11:50:39 2001 +++ linux/arch/mips64/mm/init.c Sat Nov 10 01:55:30 2001 @@ -25,6 +25,7 @@ #ifdef CONFIG_BLK_DEV_INITRD #include <linux/blk.h> #endif +#include <linux/pagemap.h> #include <asm/bootinfo.h> #include <asm/cachectl.h> @@ -396,6 +397,8 @@ max_mapnr = num_physpages = max_low_pfn; high_memory = (void *) __va(max_mapnr << PAGE_SHIFT); + + page_cache_init(count_free_bootmem()); totalram_pages += free_all_bootmem(); totalram_pages -= setup_zero_pages(); /* Setup zeroed pages. */ diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/mips64/sgi-ip27/ip27-memory.c linux/arch/mips64/sgi-ip27/ip27-memory.c --- vanilla/linux/arch/mips64/sgi-ip27/ip27-memory.c Sun Sep 9 10:43:02 2001 +++ linux/arch/mips64/sgi-ip27/ip27-memory.c Sat Nov 10 02:02:33 2001 @@ -15,6 +15,7 @@ #include <linux/mm.h> #include <linux/bootmem.h> #include <linux/swap.h> +#include <linux/pagemap.h> #include <asm/page.h> #include <asm/bootinfo.h> @@ -277,6 +278,11 @@ num_physpages = numpages; /* memory already sized by szmem */ max_mapnr = pagenr; /* already found during paging_init */ high_memory = (void *) __va(max_mapnr << PAGE_SHIFT); + + tmp = 0; + for (nid = 0; nid < numnodes; nid++) + tmp += count_free_bootmem_node(NODE_DATA(nid)); + page_cache_init(tmp); for (nid = 0; nid < numnodes; nid++) { diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/parisc/mm/init.c linux/arch/parisc/mm/init.c --- vanilla/linux/arch/parisc/mm/init.c Tue Dec 5 12:29:39 2000 +++ linux/arch/parisc/mm/init.c Sat Nov 10 01:57:11 2001 @@ -17,6 +17,7 @@ #include <linux/pci.h> /* for hppa_dma_ops and pcxl_dma_ops */ #include <linux/swap.h> #include <linux/unistd.h> +#include <linux/pagemap.h> #include <asm/pgalloc.h> @@ -48,6 +49,8 @@ { max_mapnr = num_physpages = max_low_pfn; high_memory = __va(max_low_pfn * PAGE_SIZE); + + page_cache_init(count_free_bootmem()); totalram_pages += free_all_bootmem(); printk("Memory: %luk available\n", totalram_pages << (PAGE_SHIFT-10)); diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/ppc/mm/init.c linux/arch/ppc/mm/init.c --- vanilla/linux/arch/ppc/mm/init.c Tue Oct 2 09:12:44 2001 +++ linux/arch/ppc/mm/init.c Sat Nov 10 01:57:34 2001 @@ -34,6 +34,7 @@ #ifdef CONFIG_BLK_DEV_INITRD #include <linux/blk.h> /* for initrd_* */ #endif +#include <linux/pagemap.h> #include <asm/pgalloc.h> #include <asm/prom.h> @@ -462,6 +463,8 @@ high_memory = (void *) __va(max_low_pfn * PAGE_SIZE); num_physpages = max_mapnr; /* RAM is assumed contiguous */ + + page_cache_init(count_free_bootmem()); totalram_pages += free_all_bootmem(); diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/s390/mm/init.c linux/arch/s390/mm/init.c --- vanilla/linux/arch/s390/mm/init.c Thu Oct 11 09:04:57 2001 +++ linux/arch/s390/mm/init.c Sat Nov 10 01:57:56 2001 @@ -186,6 +186,8 @@ /* clear the zero-page */ memset(empty_zero_page, 0, PAGE_SIZE); + page_cache_init(count_free_bootmem()); + /* this will put all low memory onto the freelists */ totalram_pages += free_all_bootmem(); diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/s390x/mm/init.c linux/arch/s390x/mm/init.c --- vanilla/linux/arch/s390x/mm/init.c Fri Nov 9 18:39:51 2001 +++ linux/arch/s390x/mm/init.c Sat Nov 10 01:58:14 2001 @@ -198,6 +198,8 @@ /* clear the zero-page */ memset(empty_zero_page, 0, PAGE_SIZE); + page_cache_init(count_free_bootmem()); + /* this will put all low memory onto the freelists */ totalram_pages += free_all_bootmem(); diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/sh/mm/init.c linux/arch/sh/mm/init.c --- vanilla/linux/arch/sh/mm/init.c Mon Oct 15 13:36:48 2001 +++ linux/arch/sh/mm/init.c Sat Nov 10 01:59:56 2001 @@ -26,6 +26,7 @@ #endif #include <linux/highmem.h> #include <linux/bootmem.h> +#include <linux/pagemap.h> #include <asm/processor.h> #include <asm/system.h> @@ -139,6 +140,7 @@ void __init mem_init(void) { extern unsigned long empty_zero_page[1024]; + unsigned long num_free_bootmem_pages; int codesize, reservedpages, datasize, initsize; int tmp; @@ -148,6 +150,12 @@ /* clear the zero-page */ memset(empty_zero_page, 0, PAGE_SIZE); __flush_wback_region(empty_zero_page, PAGE_SIZE); + + num_free_bootmem_pages = count_free_bootmem_node(NODE_DATA(0)); +#ifdef CONFIG_DISCONTIGMEM + num_free_bootmem_pages += count_free_bootmem_node(NODE_DATA(1)); +#endif + page_cache_init(num_free_bootmem_pages); /* this will put all low memory onto the freelists */ totalram_pages += free_all_bootmem_node(NODE_DATA(0)); diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/sparc/mm/init.c linux/arch/sparc/mm/init.c --- vanilla/linux/arch/sparc/mm/init.c Mon Oct 1 09:19:56 2001 +++ linux/arch/sparc/mm/init.c Sat Nov 10 05:30:31 2001 @@ -1,4 +1,4 @@ -/* $Id: init.c,v 1.100 2001/09/21 22:51:47 davem Exp $ +/* $Id: init.c,v 1.101 2001/11/10 13:30:31 davem Exp $ * linux/arch/sparc/mm/init.c * * Copyright (C) 1995 David S. Miller (davem@caip.rutgers.edu) @@ -25,6 +25,7 @@ #include <linux/init.h> #include <linux/highmem.h> #include <linux/bootmem.h> +#include <linux/pagemap.h> #include <asm/system.h> #include <asm/segment.h> @@ -434,6 +435,8 @@ max_mapnr = last_valid_pfn - (phys_base >> PAGE_SHIFT); high_memory = __va(max_low_pfn << PAGE_SHIFT); + + page_cache_init(count_free_bootmem()); #ifdef DEBUG_BOOTMEM prom_printf("mem_init: Calling free_all_bootmem().\n"); diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/sparc64/mm/init.c linux/arch/sparc64/mm/init.c --- vanilla/linux/arch/sparc64/mm/init.c Tue Oct 30 15:08:11 2001 +++ linux/arch/sparc64/mm/init.c Sat Nov 10 05:30:31 2001 @@ -1,4 +1,4 @@ -/* $Id: init.c,v 1.199 2001/10/25 18:48:03 davem Exp $ +/* $Id: init.c,v 1.201 2001/11/10 13:30:31 davem Exp $ * arch/sparc64/mm/init.c * * Copyright (C) 1996-1999 David S. Miller (davem@caip.rutgers.edu) @@ -16,6 +16,7 @@ #include <linux/blk.h> #include <linux/swap.h> #include <linux/swapctl.h> +#include <linux/pagemap.h> #include <asm/head.h> #include <asm/system.h> @@ -1400,7 +1401,7 @@ if (second_alias_page) spitfire_flush_dtlb_nucleus_page(second_alias_page); - flush_tlb_all(); + __flush_tlb_all(); { unsigned long zones_size[MAX_NR_ZONES]; @@ -1584,6 +1585,8 @@ max_mapnr = last_valid_pfn - (phys_base >> PAGE_SHIFT); high_memory = __va(last_valid_pfn << PAGE_SHIFT); + + page_cache_init(count_free_bootmem()); num_physpages = free_all_bootmem() - 1; diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/include/linux/bootmem.h linux/include/linux/bootmem.h --- vanilla/linux/include/linux/bootmem.h Mon Nov 5 12:43:18 2001 +++ linux/include/linux/bootmem.h Sat Nov 10 02:33:45 2001 @@ -43,11 +43,13 @@ #define alloc_bootmem_low_pages(x) \ __alloc_bootmem((x), PAGE_SIZE, 0) extern unsigned long __init free_all_bootmem (void); +extern unsigned long __init count_free_bootmem (void); extern unsigned long __init init_bootmem_node (pg_data_t *pgdat, unsigned long freepfn, unsigned long startpfn, unsigned long endpfn); extern void __init reserve_bootmem_node (pg_data_t *pgdat, unsigned long physaddr, unsigned long size); extern void __init free_bootmem_node (pg_data_t *pgdat, unsigned long addr, unsigned long size); extern unsigned long __init free_all_bootmem_node (pg_data_t *pgdat); +extern unsigned long __init count_free_bootmem_node (pg_data_t *pgdat); extern void * __init __alloc_bootmem_node (pg_data_t *pgdat, unsigned long size, unsigned long align, unsigned long goal); #define alloc_bootmem_node(pgdat, x) \ __alloc_bootmem_node((pgdat), (x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS)) diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/init/main.c linux/init/main.c --- vanilla/linux/init/main.c Fri Nov 9 18:40:00 2001 +++ linux/init/main.c Sat Nov 10 04:58:16 2001 @@ -597,7 +597,6 @@ proc_caches_init(); vfs_caches_init(mempages); buffer_init(mempages); - page_cache_init(mempages); #if defined(CONFIG_ARCH_S390) ccwcache_init(); #endif diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/mm/bootmem.c linux/mm/bootmem.c --- vanilla/linux/mm/bootmem.c Tue Sep 18 14:10:43 2001 +++ linux/mm/bootmem.c Sat Nov 10 05:18:53 2001 @@ -154,6 +154,9 @@ if (align & (align-1)) BUG(); + offset = (bdata->node_boot_start & (align - 1)); + offset >>= PAGE_SHIFT; + /* * We try to allocate bootmem pages above 'goal' * first, then we try to allocate lower pages. @@ -165,6 +168,7 @@ preferred = 0; preferred = ((preferred + align - 1) & ~(align - 1)) >> PAGE_SHIFT; + preferred += offset; areasize = (size+PAGE_SIZE-1)/PAGE_SIZE; incr = align >> PAGE_SHIFT ? : 1; @@ -184,7 +188,7 @@ fail_block:; } if (preferred) { - preferred = 0; + preferred = offset; goto restart_scan; } return NULL; @@ -272,6 +276,28 @@ return total; } +static unsigned long __init count_free_bootmem_core(pg_data_t *pgdat) +{ + bootmem_data_t *bdata = pgdat->bdata; + unsigned long i, idx, total; + + if (!bdata->node_bootmem_map) BUG(); + + total = 0; + idx = bdata->node_low_pfn - (bdata->node_boot_start >> PAGE_SHIFT); + for (i = 0; i < idx; i++) { + if (!test_bit(i, bdata->node_bootmem_map)) + total++; + } + + /* + * Count the allocator bitmap itself. + */ + total += ((bdata->node_low_pfn-(bdata->node_boot_start >> PAGE_SHIFT))/8 + PAGE_SIZE-1)/PAGE_SIZE; + + return total; +} + unsigned long __init init_bootmem_node (pg_data_t *pgdat, unsigned long freepfn, unsigned long startpfn, unsigned long endpfn) { return(init_bootmem_core(pgdat, freepfn, startpfn, endpfn)); @@ -292,6 +318,11 @@ return(free_all_bootmem_core(pgdat)); } +unsigned long __init count_free_bootmem_node (pg_data_t *pgdat) +{ + return(count_free_bootmem_core(pgdat)); +} + unsigned long __init init_bootmem (unsigned long start, unsigned long pages) { max_low_pfn = pages; @@ -312,6 +343,11 @@ unsigned long __init free_all_bootmem (void) { return(free_all_bootmem_core(&contig_page_data)); +} + +unsigned long __init count_free_bootmem (void) +{ + return(count_free_bootmem_core(&contig_page_data)); } void * __init __alloc_bootmem (unsigned long size, unsigned long align, unsigned long goal) diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/mm/filemap.c linux/mm/filemap.c --- vanilla/linux/mm/filemap.c Fri Nov 9 18:40:00 2001 +++ linux/mm/filemap.c Sat Nov 10 05:15:16 2001 @@ -24,6 +24,7 @@ #include <linux/mm.h> #include <linux/iobuf.h> #include <linux/compiler.h> +#include <linux/bootmem.h> #include <asm/pgalloc.h> #include <asm/uaccess.h> @@ -2929,28 +2930,48 @@ goto unlock; } +/* This is called from the arch specific mem_init routine. + * It is done right before free_all_bootmem (or NUMA equivalent). + * + * The mempages arg is the number of pages free_all_bootmem is + * going to liberate, or a close approximation. + * + * We have to use bootmem because on huge systems (ie. 16GB ram) + * get_free_pages cannot give us a large enough allocation. + */ void __init page_cache_init(unsigned long mempages) { - unsigned long htable_size, order; + unsigned long htable_size, real_size; htable_size = mempages; htable_size *= sizeof(struct page *); - for(order = 0; (PAGE_SIZE << order) < htable_size; order++) + + for (real_size = 1UL; real_size < htable_size; real_size <<= 1UL) ; do { - unsigned long tmp = (PAGE_SIZE << order) / sizeof(struct page *); + unsigned long tmp = (real_size / sizeof(struct page *)); + unsigned long align; page_hash_bits = 0; while((tmp >>= 1UL) != 0UL) page_hash_bits++; + + align = real_size; + if (align > (4UL * 1024UL * 1024UL)) + align = (4UL * 1024UL * 1024UL); + + page_hash_table = __alloc_bootmem(real_size, align, + __pa(MAX_DMA_ADDRESS)); + + /* Perhaps the alignment was too strict. */ + if (page_hash_table == NULL) + page_hash_table = alloc_bootmem(real_size); + } while (page_hash_table == NULL && + (real_size >>= 1UL) >= PAGE_SIZE); - page_hash_table = (struct page **) - __get_free_pages(GFP_ATOMIC, order); - } while(page_hash_table == NULL && --order > 0); - - printk("Page-cache hash table entries: %d (order: %ld, %ld bytes)\n", - (1 << page_hash_bits), order, (PAGE_SIZE << order)); + printk("Page-cache hash table entries: %d (%ld bytes)\n", + (1 << page_hash_bits), real_size); if (!page_hash_table) panic("Failed to allocate page hash table\n"); memset((void *)page_hash_table, 0, PAGE_HASH_SIZE * sizeof(struct page *)); ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-08 23:00 ` Andi Kleen 2001-11-09 0:05 ` Anton Blanchard @ 2001-11-09 3:12 ` Rusty Russell 2001-11-09 5:59 ` Andi Kleen 2001-11-09 11:16 ` Helge Hafting 1 sibling, 2 replies; 45+ messages in thread From: Rusty Russell @ 2001-11-09 3:12 UTC (permalink / raw) To: Andi Kleen; +Cc: mingo, linux-kernel On 09 Nov 2001 00:00:19 +0100 Andi Kleen <ak@suse.de> wrote: > Ingo Molnar <mingo@elte.hu> writes: > > > > we should fix this by trying to allocate continuous physical memory if > > possible, and fall back to vmalloc() only if this allocation fails. > > Check -aa. A patch to do that has been in there for some time now. > > -Andi > > P.S.: It makes a measurable difference with some Oracle benchmarks with > the Qlogic driver. Modules have lots of little disadvantages that add up. The speed penalty on various platforms is one, the load/unload race complexity is another. There's a widespread "modules are free!" mentality: they're not, and we can add complexity trying to make them "free", but it might be wiser to realize that dynamic adding and deleting from a running kernel is a problem on par with a pagagble kernel, and may not be the greatest thing since sliced bread. Rusty. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 3:12 ` Rusty Russell @ 2001-11-09 5:59 ` Andi Kleen 2001-11-09 11:16 ` Helge Hafting 1 sibling, 0 replies; 45+ messages in thread From: Andi Kleen @ 2001-11-09 5:59 UTC (permalink / raw) To: Rusty Russell; +Cc: Andi Kleen, mingo, linux-kernel On Fri, Nov 09, 2001 at 02:12:15PM +1100, Rusty Russell wrote: > Modules have lots of little disadvantages that add up. The speed penalty > on various platforms is one, the load/unload race complexity is another. At least for the speed penalty due to TLB thrashing: I would not really blame modules in this case, it is just an application crying for large pages support. -Andi ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 3:12 ` Rusty Russell 2001-11-09 5:59 ` Andi Kleen @ 2001-11-09 11:16 ` Helge Hafting 2001-11-12 9:59 ` Rusty Russell 1 sibling, 1 reply; 45+ messages in thread From: Helge Hafting @ 2001-11-09 11:16 UTC (permalink / raw) To: Rusty Russell, linux-kernel Rusty Russell wrote: > Modules have lots of little disadvantages that add up. The speed penalty > on various platforms is one, the load/unload race complexity is another. > Races can be fixed. (Isn't that one of the things considered for 2.5?) Speed penalties on various platforms is there to stay, so you simply have to weigh that against having more swappable RAM. I use the following rules of thumb: 1. Modules only for seldom-used devices. A module for the mouse is no use if you do all your work in X. There's simply no gain from a module that never unloads. A seldom used fs may be modular though. I rarely use cd's, so isofs is a module on my machine. 2. No modules for high-speed stuff like harddisks and network, that's where you might feel the slowdown. Low-speed stuff like floppy and cdrom drivers are modular though. Helge Hafting ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-09 11:16 ` Helge Hafting @ 2001-11-12 9:59 ` Rusty Russell 2001-11-12 23:23 ` David S. Miller 0 siblings, 1 reply; 45+ messages in thread From: Rusty Russell @ 2001-11-12 9:59 UTC (permalink / raw) To: Helge Hafting; +Cc: linux-kernel On Fri, 09 Nov 2001 12:16:49 +0100 Helge Hafting <helgehaf@idb.hist.no> wrote: > Rusty Russell wrote: > > > Modules have lots of little disadvantages that add up. The speed penalty > > on various platforms is one, the load/unload race complexity is another. > > > Races can be fixed. (Isn't that one of the things considered for 2.5?) We get more problems if we go preemptible (some seem to thing that preemption is "free"). And some races can be fixed by paying more of a speed penalty (atomic_inc & atomic_dec_and_test for every packet, anyone?). Hope that clarifies, Rusty. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-12 9:59 ` Rusty Russell @ 2001-11-12 23:23 ` David S. Miller 2001-11-12 23:14 ` Rusty Russell 0 siblings, 1 reply; 45+ messages in thread From: David S. Miller @ 2001-11-12 23:23 UTC (permalink / raw) To: rusty; +Cc: helgehaf, linux-kernel From: Rusty Russell <rusty@rustcorp.com.au> Date: Mon, 12 Nov 2001 20:59:05 +1100 (atomic_inc & atomic_dec_and_test for every packet, anyone?). We already do pay that price, in skb_release_data() :-) ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-12 23:23 ` David S. Miller @ 2001-11-12 23:14 ` Rusty Russell 2001-11-13 1:30 ` Mike Fedyk 0 siblings, 1 reply; 45+ messages in thread From: Rusty Russell @ 2001-11-12 23:14 UTC (permalink / raw) To: David S. Miller; +Cc: helgehaf, linux-kernel In message <20011112.152304.39155908.davem@redhat.com> you write: > From: Rusty Russell <rusty@rustcorp.com.au> > Date: Mon, 12 Nov 2001 20:59:05 +1100 > > (atomic_inc & atomic_dec_and_test for every packet, anyone?). > > We already do pay that price, in skb_release_data() :-) Sorry, I wasn't clear! skb_release_data() does an atomic ops on the skb data region, which is almost certainly on the same CPU. This is an atomic op on a global counter for the module, which almost certainly isn't. For something which (statistically speaking) never happens (module unload). Ouch, Rusty. -- Premature optmztion is rt of all evl. --DK ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-12 23:14 ` Rusty Russell @ 2001-11-13 1:30 ` Mike Fedyk 2001-11-13 1:15 ` David Lang 0 siblings, 1 reply; 45+ messages in thread From: Mike Fedyk @ 2001-11-13 1:30 UTC (permalink / raw) To: Rusty Russell; +Cc: David S. Miller, helgehaf, linux-kernel On Tue, Nov 13, 2001 at 10:14:22AM +1100, Rusty Russell wrote: > In message <20011112.152304.39155908.davem@redhat.com> you write: > > From: Rusty Russell <rusty@rustcorp.com.au> > > Date: Mon, 12 Nov 2001 20:59:05 +1100 > > > > (atomic_inc & atomic_dec_and_test for every packet, anyone?). > > > > We already do pay that price, in skb_release_data() :-) > > Sorry, I wasn't clear! skb_release_data() does an atomic ops on the > skb data region, which is almost certainly on the same CPU. This is > an atomic op on a global counter for the module, which almost > certainly isn't. > > For something which (statistically speaking) never happens (module > unload). > Is this in the fast path or slow path? If it only happens on (un)load, then there isn't any cost until it's needed... Mike ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: speed difference between using hard-linked and modular drives? 2001-11-13 1:30 ` Mike Fedyk @ 2001-11-13 1:15 ` David Lang 0 siblings, 0 replies; 45+ messages in thread From: David Lang @ 2001-11-13 1:15 UTC (permalink / raw) To: Mike Fedyk; +Cc: Rusty Russell, David S. Miller, helgehaf, linux-kernel Mike the point is that the module count inc/dec would need to be done for every packet so that when you go to unload you can check the usage value, so the check is done in the slow path, but the inc/dec is done in the fast path. David Lang On Mon, 12 Nov 2001, Mike Fedyk wrote: > Date: Mon, 12 Nov 2001 17:30:14 -0800 > From: Mike Fedyk <mfedyk@matchmail.com> > To: Rusty Russell <rusty@rustcorp.com.au> > Cc: David S. Miller <davem@redhat.com>, helgehaf@idb.hist.no, > linux-kernel@vger.kernel.org > Subject: Re: speed difference between using hard-linked and modular > drives? > > On Tue, Nov 13, 2001 at 10:14:22AM +1100, Rusty Russell wrote: > > In message <20011112.152304.39155908.davem@redhat.com> you write: > > > From: Rusty Russell <rusty@rustcorp.com.au> > > > Date: Mon, 12 Nov 2001 20:59:05 +1100 > > > > > > (atomic_inc & atomic_dec_and_test for every packet, anyone?). > > > > > > We already do pay that price, in skb_release_data() :-) > > > > Sorry, I wasn't clear! skb_release_data() does an atomic ops on the > > skb data region, which is almost certainly on the same CPU. This is > > an atomic op on a global counter for the module, which almost > > certainly isn't. > > > > For something which (statistically speaking) never happens (module > > unload). > > > > Is this in the fast path or slow path? > > If it only happens on (un)load, then there isn't any cost until it's needed... > > Mike > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2001-11-13 1:40 UTC | newest]
Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-11-08 16:01 speed difference between using hard-linked and modular drives? Roy Sigurd Karlsbakk
2001-11-08 17:02 ` Ingo Molnar
2001-11-08 17:37 ` Ingo Molnar
2001-11-08 23:59 ` Anton Blanchard
2001-11-09 5:11 ` Keith Owens
2001-11-10 3:35 ` Anton Blanchard
2001-11-10 7:26 ` Keith Owens
2001-11-08 17:53 ` Robert Love
[not found] <Pine.LNX.4.33.0111081802380.15975-100000@localhost.localdomain.suse.lists.linux.kernel>
[not found] ` <Pine.LNX.4.33.0111081836080.15975-100000@localhost.localdomain.suse.lists.linux.kernel>
2001-11-08 23:00 ` Andi Kleen
2001-11-09 0:05 ` Anton Blanchard
2001-11-09 5:45 ` Andi Kleen
2001-11-09 6:04 ` David S. Miller
2001-11-09 6:39 ` Andi Kleen
2001-11-09 6:54 ` Andrew Morton
2001-11-09 7:17 ` David S. Miller
2001-11-09 7:16 ` Andrew Morton
2001-11-09 7:24 ` David S. Miller
2001-11-09 8:21 ` Ingo Molnar
2001-11-09 7:35 ` Andrew Morton
2001-11-09 7:44 ` David S. Miller
2001-11-09 7:14 ` David S. Miller
2001-11-09 7:16 ` David S. Miller
2001-11-09 12:59 ` Alan Cox
2001-11-09 12:54 ` David S. Miller
2001-11-09 13:15 ` Philip Dodd
2001-11-09 13:26 ` David S. Miller
2001-11-09 20:45 ` Mike Fedyk
2001-11-09 13:17 ` Andi Kleen
2001-11-09 13:25 ` David S. Miller
2001-11-09 13:39 ` Andi Kleen
2001-11-09 13:41 ` David S. Miller
2001-11-10 5:20 ` Anton Blanchard
2001-11-10 4:56 ` Anton Blanchard
2001-11-10 5:09 ` Andi Kleen
2001-11-10 13:29 ` David S. Miller
2001-11-10 13:44 ` David S. Miller
2001-11-10 13:52 ` David S. Miller
2001-11-09 3:12 ` Rusty Russell
2001-11-09 5:59 ` Andi Kleen
2001-11-09 11:16 ` Helge Hafting
2001-11-12 9:59 ` Rusty Russell
2001-11-12 23:23 ` David S. Miller
2001-11-12 23:14 ` Rusty Russell
2001-11-13 1:30 ` Mike Fedyk
2001-11-13 1:15 ` David Lang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox