speed difference between using hard-linked and modular drives?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* speed difference between using hard-linked and modular drives?
@ 2001-11-08 16:01 Roy Sigurd Karlsbakk
  2001-11-08 17:02 ` Ingo Molnar
  2001-11-08 17:53 ` Robert Love
  0 siblings, 2 replies; 45+ messages in thread
From: Roy Sigurd Karlsbakk @ 2001-11-08 16:01 UTC (permalink / raw)
  To: linux-kernel

hi

Are there any speed difference between hard-linked device drivers and
their modular counterparts?

roy

--
Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA

Computers are like air conditioners.
They stop working when you open Windows.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-08 16:01 speed difference between using hard-linked and modular drives? Roy Sigurd Karlsbakk
@ 2001-11-08 17:02 ` Ingo Molnar
  2001-11-08 17:37   ` Ingo Molnar
  2001-11-08 23:59   ` Anton Blanchard
  2001-11-08 17:53 ` Robert Love
  1 sibling, 2 replies; 45+ messages in thread
From: Ingo Molnar @ 2001-11-08 17:02 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: linux-kernel


On Thu, 8 Nov 2001, Roy Sigurd Karlsbakk wrote:

> Are there any speed difference between hard-linked device drivers and
> their modular counterparts?

minimal. a few instructions per IO.

	Ingo


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-08 17:02 ` Ingo Molnar
@ 2001-11-08 17:37   ` Ingo Molnar
  2001-11-08 23:59   ` Anton Blanchard
  1 sibling, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2001-11-08 17:37 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: linux-kernel

On Thu, 8 Nov 2001, Ingo Molnar wrote:

> > Are there any speed difference between hard-linked device drivers and
> > their modular counterparts?
>
> minimal. a few instructions per IO.

Arjan pointed out that there is also the cost of TLB misses due to
vmalloc()-ing module libraries, which can be as high as a 5% slowdown.

we should fix this by trying to allocate continuous physical memory if
possible, and fall back to vmalloc() only if this allocation fails.

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-08 17:02 ` Ingo Molnar
  2001-11-08 17:37   ` Ingo Molnar
@ 2001-11-08 23:59   ` Anton Blanchard
  2001-11-09  5:11     ` Keith Owens
  1 sibling, 1 reply; 45+ messages in thread
From: Anton Blanchard @ 2001-11-08 23:59 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Roy Sigurd Karlsbakk, linux-kernel

 
> > Are there any speed difference between hard-linked device drivers and
> > their modular counterparts?
> 
> minimal. a few instructions per IO.

Its worse on some architectures that need to pass through a trampoline
when going between kernel and module (eg ppc). Its even worse on ppc64
at the moment because we have a local TOC per module which needs to be
saved and restored.

Anton

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-08 23:59   ` Anton Blanchard
@ 2001-11-09  5:11     ` Keith Owens
  2001-11-10  3:35       ` Anton Blanchard
  0 siblings, 1 reply; 45+ messages in thread
From: Keith Owens @ 2001-11-09  5:11 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: linux-kernel

On Fri, 9 Nov 2001 10:59:21 +1100, 
Anton Blanchard <anton@samba.org> wrote:
> 
>> > Are there any speed difference between hard-linked device drivers and
>> > their modular counterparts?
>
>Its worse on some architectures that need to pass through a trampoline
>when going between kernel and module (eg ppc). Its even worse on ppc64
>at the moment because we have a local TOC per module which needs to be
>saved and restored.

Is that TOC save and restore just for module code or does it apply to
all calls through function pointers?

On IA64, R1 (global data pointer) must be saved and restored on all
calls through function pointers, even if both the caller and callee are
in the kernel.  You might know that this is a kernel to kernel call but
gcc does not so it has to assume the worst.  This is not a module
problem, it affects all indirect function calls.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09  5:11     ` Keith Owens
@ 2001-11-10  3:35       ` Anton Blanchard
  2001-11-10  7:26         ` Keith Owens
  0 siblings, 1 reply; 45+ messages in thread
From: Anton Blanchard @ 2001-11-10  3:35 UTC (permalink / raw)
  To: Keith Owens; +Cc: linux-kernel

Hi,

> Is that TOC save and restore just for module code or does it apply to
> all calls through function pointers?
>
> On IA64, R1 (global data pointer) must be saved and restored on all
> calls through function pointers, even if both the caller and callee are
> in the kernel.  You might know that this is a kernel to kernel call but
> gcc does not so it has to assume the worst.  This is not a module
> problem, it affects all indirect function calls.

Yep all indirect function calls require save and reload of the TOC
(which is r2):

std     r2,40(r1)
mtctr   r0
ld      r2,8(r9)
bctrl			# function call

When calling a function in the kernel from within the kernel (eg printk),
we dont have to save and reload the TOC:

000014ec bl .printk
000014f0 nop

Alan Modra tells me the linker does the fixup of nop -> r2 reload. So
in this case it isnt needed.

However when we do the same printk from a module, the nop is replaced
with an r2 reload:

000014ec  bl	0x2f168		# call trampoline
000014f0  ld	r2,40(r1)

And because we have to load the new TOC for the call to printk, it is
done in a small trampoline. (r12 is a pointer to the function descriptor
for printk which contains 3 values, 1. the function address, 2. the TOC,
ignore the 3rd)

0002f168  ld	r12,-32456(r2)
0002f16c  std	r2,40(r1)
0002f170  ld	r0,0(r12)
0002f174  ld	r2,8(r12)
0002f178  mtctr	r0
0002f17c  bctr			# call printk

So the trampoline and r2 restore is the overhead Im talking about :)

btw the trampoline is also required because of the limited range of
relative branches on ppc. So ppc32 also has an overhead except it is
smaller because it doesnt need the TOC juggling.

Anton

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-10  3:35       ` Anton Blanchard
@ 2001-11-10  7:26         ` Keith Owens
  0 siblings, 0 replies; 45+ messages in thread
From: Keith Owens @ 2001-11-10  7:26 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: linux-kernel

On Sat, 10 Nov 2001 14:35:58 +1100, 
Anton Blanchard <anton@samba.org> wrote:
>Yep all indirect function calls require save and reload of the TOC
>(which is r2):
>
>When calling a function in the kernel from within the kernel (eg printk),
>we dont have to save and reload the TOC:

Same on IA64, indirect function calls have to save R1, load R1 for the
target function from the function descriptor, call the function,
restore R1.  Incidentally that makes a function descriptor on IA64
_two_ words, you cannot save an IA64 function pointer in a long or even
a void * variable.

>Alan Modra tells me the linker does the fixup of nop -> r2 reload. So
>in this case it isnt needed.

IA64 kernels are compiled with -mconstant-gp which tells gcc that
direct calls do not require R1 save/reload, gcc does not even generate
a nop.  However indirect function calls from one part of the kernel to
another still require save and reload code, gcc cannot tell if the call
is local or not.

>However when we do the same printk from a module, the nop is replaced
>with an r2 reload:

Same on IA64, calls from a module into the kernel require R1 save and
reload, even if the call is direct.  So there is some code overhead
when making direct function calls from modules to kernel on IA64, that
overhead disappears when code is linked into the kernel.  Indirect
functions calls always have the overhead, whether in kernel or in
module.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-08 16:01 speed difference between using hard-linked and modular drives? Roy Sigurd Karlsbakk
  2001-11-08 17:02 ` Ingo Molnar
@ 2001-11-08 17:53 ` Robert Love
  1 sibling, 0 replies; 45+ messages in thread
From: Robert Love @ 2001-11-08 17:53 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: linux-kernel

On Thu, 2001-11-08 at 11:01, Roy Sigurd Karlsbakk wrote:
> Are there any speed difference between hard-linked device drivers and
> their modular counterparts?

On top of what Ingo said, there is also a slightly larger (very slight)
memory footprint due to some of the module code that isn't included in
in-kernel components.  For example, the __exit functions aren't needed
if the driver is not a module.

	Robert Love


^ permalink raw reply	[flat|nested] 45+ messages in thread

[parent not found: <Pine.LNX.4.33.0111081802380.15975-100000@localhost.localdomain.suse.lists.linux.kernel>]

[parent not found: <Pine.LNX.4.33.0111081836080.15975-100000@localhost.localdomain.suse.lists.linux.kernel>]

* Re: speed difference between using hard-linked and modular drives?
       [not found] ` <Pine.LNX.4.33.0111081836080.15975-100000@localhost.localdomain.suse.lists.linux.kernel>
@ 2001-11-08 23:00   ` Andi Kleen
  2001-11-09  0:05     ` Anton Blanchard
  2001-11-09  3:12     ` Rusty Russell
  0 siblings, 2 replies; 45+ messages in thread
From: Andi Kleen @ 2001-11-08 23:00 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

Ingo Molnar <mingo@elte.hu> writes:
> 
> we should fix this by trying to allocate continuous physical memory if
> possible, and fall back to vmalloc() only if this allocation fails.

Check -aa. A patch to do that has been in there for some time now.

-Andi

P.S.: It makes a measurable difference with some Oracle benchmarks with
the Qlogic driver.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-08 23:00   ` Andi Kleen
@ 2001-11-09  0:05     ` Anton Blanchard
  2001-11-09  5:45       ` Andi Kleen
  2001-11-09  3:12     ` Rusty Russell
  1 sibling, 1 reply; 45+ messages in thread
From: Anton Blanchard @ 2001-11-09  0:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ingo Molnar, linux-kernel

> > we should fix this by trying to allocate continuous physical memory if
> > possible, and fall back to vmalloc() only if this allocation fails.
> 
> Check -aa. A patch to do that has been in there for some time now.

We also need a way to satisfy very large allocations for the hashes (eg
the pagecache hash). On a 32G machine we get awful performance on the
pagecache hash because we can only get an order 9 allocation out of
get_free_pages:

http://samba.org/~anton/linux/pagecache/pagecache_before.png

When switching to vmalloc the hash is large enough to be useful:

http://samba.org/~anton/linux/pagecache/pagecache_after.png

As pointed out by Davem and Ingo we should try and avoid vmalloc here
due to tlb trashing.

Anton

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09  0:05     ` Anton Blanchard
@ 2001-11-09  5:45       ` Andi Kleen
  2001-11-09  6:04         ` David S. Miller
  0 siblings, 1 reply; 45+ messages in thread
From: Andi Kleen @ 2001-11-09  5:45 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Andi Kleen, Ingo Molnar, linux-kernel

On Fri, Nov 09, 2001 at 11:05:32AM +1100, Anton Blanchard wrote:
> We also need a way to satisfy very large allocations for the hashes (eg
> the pagecache hash). On a 32G machine we get awful performance on the
> pagecache hash because we can only get an order 9 allocation out of
> get_free_pages:
> 
> http://samba.org/~anton/linux/pagecache/pagecache_before.png
> 
> When switching to vmalloc the hash is large enough to be useful:
> 
> http://samba.org/~anton/linux/pagecache/pagecache_after.png
> 
> As pointed out by Davem and Ingo we should try and avoid vmalloc here
> due to tlb trashing.

Sounds like you need a better hash function instead.

-Andi


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09  5:45       ` Andi Kleen
@ 2001-11-09  6:04         ` David S. Miller
  2001-11-09  6:39           ` Andi Kleen
  0 siblings, 1 reply; 45+ messages in thread
From: David S. Miller @ 2001-11-09  6:04 UTC (permalink / raw)
  To: ak; +Cc: anton, mingo, linux-kernel

   From: Andi Kleen <ak@suse.de>
   Date: Fri, 9 Nov 2001 06:45:40 +0100

   Sounds like you need a better hash function instead.

Andi, please think about the problem before jumping to conclusions.
N_PAGES / N_CHAINS > 1 in his situation.  A better hash function
cannot help.

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09  6:04         ` David S. Miller
@ 2001-11-09  6:39           ` Andi Kleen
  2001-11-09  6:54             ` Andrew Morton
                               ` (3 more replies)
  0 siblings, 4 replies; 45+ messages in thread
From: Andi Kleen @ 2001-11-09  6:39 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, anton, mingo, linux-kernel

On Thu, Nov 08, 2001 at 10:04:44PM -0800, David S. Miller wrote:
>    From: Andi Kleen <ak@suse.de>
>    Date: Fri, 9 Nov 2001 06:45:40 +0100
>    
>    Sounds like you need a better hash function instead.
>    
> Andi, please think about the problem before jumping to conclusions.
> N_PAGES / N_CHAINS > 1 in his situation.  A better hash function
> cannot help.

I'm assuming that walking on average 5-10 pages on a lookup is not too big a 
deal, especially when you use prefetch for the list walk. It is a tradeoff
between a big hash table thrashing your cache and a smaller hash table that
can be cached but has on average >1 entries/buckets. At some point the the 
smaller hash table wins, assuming the hash function is evenly distributed.

It would only get bad if the average chain length would become much bigger.

Before jumping to real conclusions it would be interesting to gather
some statistics on Anton's machine, but I suspect he just has an very
unevenly populated table.

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09  6:39           ` Andi Kleen
@ 2001-11-09  6:54             ` Andrew Morton
  2001-11-09  7:17               ` David S. Miller
  2001-11-09  7:14             ` David S. Miller
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 45+ messages in thread
From: Andrew Morton @ 2001-11-09  6:54 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, anton, mingo, linux-kernel

Andi Kleen wrote:
> 
> On Thu, Nov 08, 2001 at 10:04:44PM -0800, David S. Miller wrote:
> >    From: Andi Kleen <ak@suse.de>
> >    Date: Fri, 9 Nov 2001 06:45:40 +0100
> >
> >    Sounds like you need a better hash function instead.
> >
> > Andi, please think about the problem before jumping to conclusions.
> > N_PAGES / N_CHAINS > 1 in his situation.  A better hash function
> > cannot help.
> 
> I'm assuming that walking on average 5-10 pages on a lookup is not too big a
> deal, especially when you use prefetch for the list walk. It is a tradeoff
> between a big hash table thrashing your cache and a smaller hash table that
> can be cached but has on average >1 entries/buckets. At some point the the
> smaller hash table wins, assuming the hash function is evenly distributed.
> 
> It would only get bad if the average chain length would become much bigger.
> 
> Before jumping to real conclusions it would be interesting to gather
> some statistics on Anton's machine, but I suspect he just has an very
> unevenly populated table.

I played with that earlier in the year.  Shrinking the hash table
by a factor of eight made no measurable difference to anything on
a Pentium II.  The hash distribution was all over the place though.
Lots of buckets with 1-2 pages, lots with 12-13.

-

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09  6:54             ` Andrew Morton
@ 2001-11-09  7:17               ` David S. Miller
  2001-11-09  7:16                 ` Andrew Morton
  0 siblings, 1 reply; 45+ messages in thread
From: David S. Miller @ 2001-11-09  7:17 UTC (permalink / raw)
  To: akpm; +Cc: ak, anton, mingo, linux-kernel

   From: Andrew Morton <akpm@zip.com.au>
   Date: Thu, 08 Nov 2001 22:54:30 -0800

   I played with that earlier in the year.  Shrinking the hash table
   by a factor of eight made no measurable difference to anything on
   a Pentium II.  The hash distribution was all over the place though.
   Lots of buckets with 1-2 pages, lots with 12-13.

What is the distribution when you don't shrink the hash
table?

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09  7:17               ` David S. Miller
@ 2001-11-09  7:16                 ` Andrew Morton
  2001-11-09  7:24                   ` David S. Miller
  2001-11-09  8:21                   ` Ingo Molnar
  0 siblings, 2 replies; 45+ messages in thread
From: Andrew Morton @ 2001-11-09  7:16 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, anton, mingo, linux-kernel

"David S. Miller" wrote:
> 
>    From: Andrew Morton <akpm@zip.com.au>
>    Date: Thu, 08 Nov 2001 22:54:30 -0800
> 
>    I played with that earlier in the year.  Shrinking the hash table
>    by a factor of eight made no measurable difference to anything on
>    a Pentium II.  The hash distribution was all over the place though.
>    Lots of buckets with 1-2 pages, lots with 12-13.
> 
> What is the distribution when you don't shrink the hash
> table?
> 

Well on my setup, there are more hash buckets than there are
pages in the system.  So - basically empty.  If memory serves
me, never more than two pages in a bucket.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09  7:16                 ` Andrew Morton
@ 2001-11-09  7:24                   ` David S. Miller
  2001-11-09  8:21                   ` Ingo Molnar
  1 sibling, 0 replies; 45+ messages in thread
From: David S. Miller @ 2001-11-09  7:24 UTC (permalink / raw)
  To: akpm; +Cc: ak, anton, mingo, linux-kernel

   From: Andrew Morton <akpm@zip.com.au>
   Date: Thu, 08 Nov 2001 23:16:08 -0800

   Well on my setup, there are more hash buckets than there are
   pages in the system.  So - basically empty.  If memory serves
   me, never more than two pages in a bucket.

Ok, this is what I expected.  The function is tuned for
having N_HASH_CHAINS being roughly equal to N_PAGES.

If you want to experiment with smaller hash tables, there
are some hacks in the FreeBSD sources that choose a different "salt"
per inode.  You xor the salt into the hash for each page on that
inode.  Something like this...

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09  7:16                 ` Andrew Morton
  2001-11-09  7:24                   ` David S. Miller
@ 2001-11-09  8:21                   ` Ingo Molnar
  2001-11-09  7:35                     ` Andrew Morton
  1 sibling, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2001-11-09  8:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David S. Miller, ak, anton, linux-kernel


On Thu, 8 Nov 2001, Andrew Morton wrote:

> Well on my setup, there are more hash buckets than there are pages in
> the system.  So - basically empty.  If memory serves me, never more
> than two pages in a bucket.

how much RAM and how many buckets are there on your system?

	Ingo



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09  8:21                   ` Ingo Molnar
@ 2001-11-09  7:35                     ` Andrew Morton
  2001-11-09  7:44                       ` David S. Miller
  0 siblings, 1 reply; 45+ messages in thread
From: Andrew Morton @ 2001-11-09  7:35 UTC (permalink / raw)
  To: mingo; +Cc: David S. Miller, ak, anton, linux-kernel

Ingo Molnar wrote:
> 
> On Thu, 8 Nov 2001, Andrew Morton wrote:
> 
> > Well on my setup, there are more hash buckets than there are pages in
> > the system.  So - basically empty.  If memory serves me, never more
> > than two pages in a bucket.
> 
> how much RAM and how many buckets are there on your system?
> 

urgh.  It was ages ago.  I shouldn't have stuck my head up ;)

I guess it was 256 megs:

Kernel command line: ...  mem=256m
Page-cache hash table entries: 65536 (order: 6, 262144 bytes)

And that's one entry per page, yes?

I ended up concluding that

a) The hash is sucky and
b) Except for certain specialised workloads, a lookup is usually
   associated with a big memory copy, so none of it matters and
c) given b), the page cache hashtable is on the wrong side of the
   size/space tradeoff :)

-

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09  7:35                     ` Andrew Morton
@ 2001-11-09  7:44                       ` David S. Miller
  0 siblings, 0 replies; 45+ messages in thread
From: David S. Miller @ 2001-11-09  7:44 UTC (permalink / raw)
  To: akpm; +Cc: mingo, ak, anton, linux-kernel

   From: Andrew Morton <akpm@zip.com.au>
   Date: Thu, 08 Nov 2001 23:35:04 -0800

   b) Except for certain specialised workloads, a lookup is usually
      associated with a big memory copy, so none of it matters and

I disagree, cache pollution always matters.  Especially, if the cpu
does memcpy's using cache-bypass-on-miss.

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09  6:39           ` Andi Kleen
  2001-11-09  6:54             ` Andrew Morton
@ 2001-11-09  7:14             ` David S. Miller
  2001-11-09  7:16             ` David S. Miller
  2001-11-10  4:56             ` Anton Blanchard
  3 siblings, 0 replies; 45+ messages in thread
From: David S. Miller @ 2001-11-09  7:14 UTC (permalink / raw)
  To: ak; +Cc: anton, mingo, linux-kernel

   From: Andi Kleen <ak@suse.de>
   Date: Fri, 9 Nov 2001 07:39:46 +0100

   Before jumping to real conclusions it would be interesting to gather
   some statistics on Anton's machine, but I suspect he just has an very
   unevenly populated table.

N_PAGES / N_HASHCHAINS was on the order of 9, and the hash chains were
evenly distributed.  He posted URLs to graphs of the hash table chain
lengths.

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09  6:39           ` Andi Kleen
  2001-11-09  6:54             ` Andrew Morton
  2001-11-09  7:14             ` David S. Miller
@ 2001-11-09  7:16             ` David S. Miller
  2001-11-09 12:59               ` Alan Cox
  2001-11-10  5:20               ` Anton Blanchard
  2001-11-10  4:56             ` Anton Blanchard
  3 siblings, 2 replies; 45+ messages in thread
From: David S. Miller @ 2001-11-09  7:16 UTC (permalink / raw)
  To: ak; +Cc: anton, mingo, linux-kernel

   From: Andi Kleen <ak@suse.de>
   Date: Fri, 9 Nov 2001 07:39:46 +0100
   
   I'm assuming that walking on average 5-10 pages on a lookup is not
   too big a deal, especially when you use prefetch for the list walk.

Oh no, not this again...

It _IS_ a big deal.  Fetching _ONE_ hash chain cache line
is always going to be cheaper than fetching _FIVE_ to _TEN_
page struct cache lines while walking the list.

Even if prefetch would kill all of this overhead (sorry, it won't), it
is _DUMB_ and _STUPID_ to bring those _FIVE_ to _TEN_ cache lines into
the processor just to lookup _ONE_ page.

Franks a lot,
David S. Miller
davem@redhat.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09  7:16             ` David S. Miller
@ 2001-11-09 12:59               ` Alan Cox
  2001-11-09 12:54                 ` David S. Miller
  2001-11-10  5:20               ` Anton Blanchard
  1 sibling, 1 reply; 45+ messages in thread
From: Alan Cox @ 2001-11-09 12:59 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, anton, mingo, linux-kernel

> Oh no, not this again...
> 
> It _IS_ a big deal.  Fetching _ONE_ hash chain cache line
> is always going to be cheaper than fetching _FIVE_ to _TEN_
> page struct cache lines while walking the list.

Big picture time. What costs more - the odd five cache line hit or swapping
200Kbytes/second on and off disk ? - thats obviously workload dependant.

Perhaps at some point we need to accept there is a memory/speed tradeoff
throughout the kernel and we need a CONFIG option for it - especially for
the handheld world. I don't want to do lots of I/O on an ipaq, I don't need
big tcp hashes, and I'd rather take a small performance hit.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09 12:59               ` Alan Cox
@ 2001-11-09 12:54                 ` David S. Miller
  2001-11-09 13:15                   ` Philip Dodd
  2001-11-09 13:17                   ` Andi Kleen
  0 siblings, 2 replies; 45+ messages in thread
From: David S. Miller @ 2001-11-09 12:54 UTC (permalink / raw)
  To: alan; +Cc: ak, anton, mingo, linux-kernel

   From: Alan Cox <alan@lxorguk.ukuu.org.uk>
   Date: Fri, 9 Nov 2001 12:59:09 +0000 (GMT)

   we need a CONFIG option for it

I think a boot time commandline option is more appropriate
for something like this.

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09 12:54                 ` David S. Miller
@ 2001-11-09 13:15                   ` Philip Dodd
  2001-11-09 13:26                     ` David S. Miller
  2001-11-09 13:17                   ` Andi Kleen
  1 sibling, 1 reply; 45+ messages in thread
From: Philip Dodd @ 2001-11-09 13:15 UTC (permalink / raw)
  To: alan, David S. Miller; +Cc: ak, anton, mingo, linux-kernel

>
>    we need a CONFIG option for it
>
> I think a boot time commandline option is more appropriate
> for something like this.

In the light of what was said about embedded systems, I'm not really sure a
boot time option really is the way to go...

Just a thought.

Philip DODD
Sales Engineer
SIVA
Les Fjords - Immeuble Narvik
19 Avenue de Norvège
Z.A. de Courtaboeuf 1
91953 LES ULIS CEDEX
http://www.siva.fr

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09 13:15                   ` Philip Dodd
@ 2001-11-09 13:26                     ` David S. Miller
  2001-11-09 20:45                       ` Mike Fedyk
  0 siblings, 1 reply; 45+ messages in thread
From: David S. Miller @ 2001-11-09 13:26 UTC (permalink / raw)
  To: smpcomputing; +Cc: alan, ak, anton, mingo, linux-kernel

   From: "Philip Dodd" <smpcomputing@free.fr>
   Date: Fri, 9 Nov 2001 14:15:32 +0100

   > I think a boot time commandline option is more appropriate
   > for something like this.

   In the light of what was said about embedded systems, I'm not really sure a
   boot time option really is the way to go...

All the hash tables in question are allocated dynamically,
we size them at boot time, the memory is not consumed until
the kernel begins executing.  So a boottime option would be
just fine.

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09 13:26                     ` David S. Miller
@ 2001-11-09 20:45                       ` Mike Fedyk
  0 siblings, 0 replies; 45+ messages in thread
From: Mike Fedyk @ 2001-11-09 20:45 UTC (permalink / raw)
  To: David S. Miller; +Cc: smpcomputing, alan, ak, anton, mingo, linux-kernel

On Fri, Nov 09, 2001 at 05:26:50AM -0800, David S. Miller wrote:
>    From: "Philip Dodd" <smpcomputing@free.fr>
>    Date: Fri, 9 Nov 2001 14:15:32 +0100
> 
>    > I think a boot time commandline option is more appropriate
>    > for something like this.
>    
>    In the light of what was said about embedded systems, I'm not really sure a
>    boot time option really is the way to go...
> 
> All the hash tables in question are allocated dynamically,
> we size them at boot time, the memory is not consumed until
> the kernel begins executing.  So a boottime option would be
> just fine.

How much is this code going to affect the kernel image size?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09 12:54                 ` David S. Miller
  2001-11-09 13:15                   ` Philip Dodd
@ 2001-11-09 13:17                   ` Andi Kleen
  2001-11-09 13:25                     ` David S. Miller
  1 sibling, 1 reply; 45+ messages in thread
From: Andi Kleen @ 2001-11-09 13:17 UTC (permalink / raw)
  To: David S. Miller; +Cc: alan, ak, anton, mingo, linux-kernel

On Fri, Nov 09, 2001 at 04:54:55AM -0800, David S. Miller wrote:
>    From: Alan Cox <alan@lxorguk.ukuu.org.uk>
>    Date: Fri, 9 Nov 2001 12:59:09 +0000 (GMT)
> 
>    we need a CONFIG option for it
> 
> I think a boot time commandline option is more appropriate
> for something like this.

Fine if you don't mind an indirect function call pointer somewhere in the TCP
hash path.

I'm thinking about adding one that removes the separate time wait 
table. It is not needed for desktops because they should have little
or no time-wait sockets. also it should throttle the hash table
sizing aggressively; e.g. 256-512 buckets should be more than enough
for a client. 

BTW I noticed that 1/4 of the big hash table is not used on SMP. The
time wait buckets share the locks of the lower half, so the spinlocks
in the upper half are never used. What would you think about splitting
the table and not putting spinlocks in the time-wait range?

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09 13:17                   ` Andi Kleen
@ 2001-11-09 13:25                     ` David S. Miller
  2001-11-09 13:39                       ` Andi Kleen
  0 siblings, 1 reply; 45+ messages in thread
From: David S. Miller @ 2001-11-09 13:25 UTC (permalink / raw)
  To: ak; +Cc: alan, anton, mingo, linux-kernel

   From: Andi Kleen <ak@suse.de>
   Date: Fri, 9 Nov 2001 14:17:55 +0100

   Fine if you don't mind an indirect function call pointer somewhere in the TCP
   hash path.

The hashes are sized at boot time, we can just reduce
the size when the boot time option says "small machine"
or whatever.

Why in the world do we need indirection function call pointers
in TCP to handle that?

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09 13:25                     ` David S. Miller
@ 2001-11-09 13:39                       ` Andi Kleen
  2001-11-09 13:41                         ` David S. Miller
  0 siblings, 1 reply; 45+ messages in thread
From: Andi Kleen @ 2001-11-09 13:39 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, alan, anton, mingo, linux-kernel

On Fri, Nov 09, 2001 at 05:25:54AM -0800, David S. Miller wrote:
> Why in the world do we need indirection function call pointers
> in TCP to handle that?

To handle the case of not having a separate TIME-WAIT table
(sorry for being unclear). Or alternatively several conditionals. 

-Andi


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09 13:39                       ` Andi Kleen
@ 2001-11-09 13:41                         ` David S. Miller
  0 siblings, 0 replies; 45+ messages in thread
From: David S. Miller @ 2001-11-09 13:41 UTC (permalink / raw)
  To: ak; +Cc: alan, anton, mingo, linux-kernel

   From: Andi Kleen <ak@suse.de>
   Date: Fri, 9 Nov 2001 14:39:30 +0100

   On Fri, Nov 09, 2001 at 05:25:54AM -0800, David S. Miller wrote:
   > Why in the world do we need indirection function call pointers
   > in TCP to handle that?

   To handle the case of not having a separate TIME-WAIT table
   (sorry for being unclear). Or alternatively several conditionals. 

The TIME-WAIT half of the hash table is most useful on
clients actually.

I mean, just double the amount you "downsize" the TCP established
hash table if it bothers you that much.

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09  7:16             ` David S. Miller
  2001-11-09 12:59               ` Alan Cox
@ 2001-11-10  5:20               ` Anton Blanchard
  1 sibling, 0 replies; 45+ messages in thread
From: Anton Blanchard @ 2001-11-10  5:20 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, mingo, linux-kernel

 
Hi,

> It _IS_ a big deal.  Fetching _ONE_ hash chain cache line
> is always going to be cheaper than fetching _FIVE_ to _TEN_
> page struct cache lines while walking the list.

Exactly, the reason I found the pagecache hash was too small was because
__find_page_nolock was one of the worst offenders when doing zero copy
web serving of a large dataset.

> Even if prefetch would kill all of this overhead (sorry, it won't), it
> is _DUMB_ and _STUPID_ to bring those _FIVE_ to _TEN_ cache lines into
> the processor just to lookup _ONE_ page.

Yes you cant expect prefetch to help you when you use the data 10
instructions after you issue the prefetch. (ie walking the hash chain)

Anton

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09  6:39           ` Andi Kleen
                               ` (2 preceding siblings ...)
  2001-11-09  7:16             ` David S. Miller
@ 2001-11-10  4:56             ` Anton Blanchard
  2001-11-10  5:09               ` Andi Kleen
  2001-11-10 13:29               ` David S. Miller
  3 siblings, 2 replies; 45+ messages in thread
From: Anton Blanchard @ 2001-11-10  4:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, mingo, linux-kernel

 
Hi,

> I'm assuming that walking on average 5-10 pages on a lookup is not too big a 
> deal, especially when you use prefetch for the list walk. It is a tradeoff
> between a big hash table thrashing your cache and a smaller hash table that
> can be cached but has on average >1 entries/buckets. At some point the the 
> smaller hash table wins, assuming the hash function is evenly distributed.
> 
> It would only get bad if the average chain length would become much bigger.
> 
> Before jumping to real conclusions it would be interesting to gather
> some statistics on Anton's machine, but I suspect he just has an very
> unevenly populated table.

You can find the raw data here:

http://samba.org/~anton/linux/pagecache/pagecache_data_gfp.gz
http://samba.org/~anton/linux/pagecache/pagecache_data_vmalloc.gz

You can see the average depth of the get_free_page hash is way too deep.
I agree there are a lot of pagecache pages (17GB in the gfp test and 21GB
in the vmalloc test), but we have to make use of the 32GB of RAM :)

I did some experimentation with prefetch and I dont think it will gain
you anything here. We need to issue the prefetch many cycles before
using the data which we cannot do when walking the chain.

Anton

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-10  4:56             ` Anton Blanchard
@ 2001-11-10  5:09               ` Andi Kleen
  2001-11-10 13:29               ` David S. Miller
  1 sibling, 0 replies; 45+ messages in thread
From: Andi Kleen @ 2001-11-10  5:09 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: linux-kernel

> You can see the average depth of the get_free_page hash is way too deep.
> I agree there are a lot of pagecache pages (17GB in the gfp test and 21GB
> in the vmalloc test), but we have to make use of the 32GB of RAM :)

Thanks for the information. I guess the fix for your case would be then
to use the bootmem allocator for allocating the page table hash.
It should have no problems with very large continuous tables, assuming
you have the (physically continuous) memory.

Another possibility would be to switch to some tree/skiplist, but that's 
probably too radical and may have other problems on smaller boxes.

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-10  4:56             ` Anton Blanchard
  2001-11-10  5:09               ` Andi Kleen
@ 2001-11-10 13:29               ` David S. Miller
  2001-11-10 13:44                 ` David S. Miller
  2001-11-10 13:52                 ` David S. Miller
  1 sibling, 2 replies; 45+ messages in thread
From: David S. Miller @ 2001-11-10 13:29 UTC (permalink / raw)
  To: anton; +Cc: ak, mingo, linux-kernel

   From: Anton Blanchard <anton@samba.org>
   Date: Sat, 10 Nov 2001 15:56:03 +1100
   
   You can see the average depth of the get_free_page hash is way too deep.
   I agree there are a lot of pagecache pages (17GB in the gfp test and 21GB
   in the vmalloc test), but we have to make use of the 32GB of RAM :)

Anton, are you bored?  :-) If so, could you test out the patch
below on your ppc64 box?  It does the "page hash table via bootmem"
thing.  It is against 2.4.15-pre2

The ppc64 specific bits you'll need to do, but they should
be very straight forward.

It also fixes a really stupid bug in the bootmem allocator.
If the bootmem area starts in some unaligned address, the
"align" argument to the bootmem allocator isn't honored.

--- ./arch/alpha/mm/init.c.~1~	Sun Oct 21 02:47:53 2001
+++ ./arch/alpha/mm/init.c	Sat Nov 10 01:49:56 2001
@@ -23,6 +23,7 @@
 #ifdef CONFIG_BLK_DEV_INITRD
 #include <linux/blk.h>
 #endif
+#include <linux/pagemap.h>
 
 #include <asm/system.h>
 #include <asm/uaccess.h>
@@ -360,6 +361,7 @@
 mem_init(void)
 {
 	max_mapnr = num_physpages = max_low_pfn;
+	page_cache_init(count_free_bootmem());
 	totalram_pages += free_all_bootmem();
 	high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
 
--- ./arch/alpha/mm/numa.c.~1~	Sun Oct 21 02:47:53 2001
+++ ./arch/alpha/mm/numa.c	Sat Nov 10 01:52:27 2001
@@ -15,6 +15,7 @@
 #ifdef CONFIG_BLK_DEV_INITRD
 #include <linux/blk.h>
 #endif
+#include <linux/pagemap.h>
 
 #include <asm/hwrpb.h>
 #include <asm/pgalloc.h>
@@ -359,8 +360,13 @@
 	extern char _text, _etext, _data, _edata;
 	extern char __init_begin, __init_end;
 	extern unsigned long totalram_pages;
-	unsigned long nid, i;
+	unsigned long nid, i, num_free_bootmem_pages;
 	mem_map_t * lmem_map;
+
+	num_free_bootmem_pages = 0;
+	for (nid = 0; nid < numnodes; nid++)
+		num_free_bootmem_pages += count_free_bootmem_node(NODE_DATA(nid));
+	page_cache_init(num_free_bootmem_pages);
 
 	high_memory = (void *) __va(max_mapnr <<PAGE_SHIFT);
 
--- ./arch/arm/mm/init.c.~1~	Sun Oct 21 02:47:53 2001
+++ ./arch/arm/mm/init.c	Sat Nov 10 01:52:34 2001
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/bootmem.h>
 #include <linux/blk.h>
+#include <linux/pagemap.h>
 
 #include <asm/segment.h>
 #include <asm/mach-types.h>
@@ -594,6 +595,7 @@
 void __init mem_init(void)
 {
 	unsigned int codepages, datapages, initpages;
+	unsigned long num_free_bootmem_pages;
 	int i, node;
 
 	codepages = &_etext - &_text;
@@ -608,6 +610,11 @@
 	 */
 	if (meminfo.nr_banks != 1)
 		create_memmap_holes(&meminfo);
+
+	num_free_bootmem_pages = 0;
+	for (node = 0; node < numnodes; node++)
+		num_free_bootmem_pages += count_free_bootmem_node(NODE_DATA(node));
+	page_cache_init(num_free_bootmem_pages);
 
 	/* this will put all unused low memory onto the freelists */
 	for (node = 0; node < numnodes; node++) {
--- ./arch/i386/mm/init.c.~1~	Sun Oct 21 02:47:53 2001
+++ ./arch/i386/mm/init.c	Sat Nov 10 01:53:43 2001
@@ -455,6 +455,8 @@
 #endif
 	high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
 
+	page_cache_init(count_free_bootmem());
+
 	/* clear the zero-page */
 	memset(empty_zero_page, 0, PAGE_SIZE);
 
--- ./arch/m68k/mm/init.c.~1~	Sun Oct 21 02:47:53 2001
+++ ./arch/m68k/mm/init.c	Sat Nov 10 01:54:47 2001
@@ -20,6 +20,7 @@
 #ifdef CONFIG_BLK_DEV_RAM
 #include <linux/blk.h>
 #endif
+#include <linux/pagemap.h>
 
 #include <asm/setup.h>
 #include <asm/uaccess.h>
@@ -135,6 +136,8 @@
 	if (MACH_IS_ATARI)
 		atari_stram_mem_init_hook();
 #endif
+
+	page_cache_init(count_free_bootmem());
 
 	/* this will put all memory onto the freelists */
 	totalram_pages = free_all_bootmem();
--- ./arch/mips/mm/init.c.~1~	Sun Oct 21 02:47:53 2001
+++ ./arch/mips/mm/init.c	Sat Nov 10 01:55:09 2001
@@ -28,6 +28,7 @@
 #ifdef CONFIG_BLK_DEV_INITRD
 #include <linux/blk.h>
 #endif
+#include <linux/pagemap.h>
 
 #include <asm/bootinfo.h>
 #include <asm/cachectl.h>
@@ -203,6 +204,8 @@
 
 	max_mapnr = num_physpages = max_low_pfn;
 	high_memory = (void *) __va(max_mapnr << PAGE_SHIFT);
+
+	page_cache_init(count_free_bootmem());
 
 	totalram_pages += free_all_bootmem();
 	totalram_pages -= setup_zero_pages();	/* Setup zeroed pages.  */
--- ./arch/ppc/mm/init.c.~1~	Sun Oct 21 02:47:53 2001
+++ ./arch/ppc/mm/init.c	Sat Nov 10 01:57:34 2001
@@ -34,6 +34,7 @@
 #ifdef CONFIG_BLK_DEV_INITRD
 #include <linux/blk.h>		/* for initrd_* */
 #endif
+#include <linux/pagemap.h>
 
 #include <asm/pgalloc.h>
 #include <asm/prom.h>
@@ -462,6 +463,8 @@
 
 	high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
 	num_physpages = max_mapnr;	/* RAM is assumed contiguous */
+
+	page_cache_init(count_free_bootmem());
 
 	totalram_pages += free_all_bootmem();
 
--- ./arch/sparc/mm/init.c.~1~	Sun Oct 21 02:47:53 2001
+++ ./arch/sparc/mm/init.c	Sat Nov 10 01:59:48 2001
@@ -25,6 +25,7 @@
 #include <linux/init.h>
 #include <linux/highmem.h>
 #include <linux/bootmem.h>
+#include <linux/pagemap.h>
 
 #include <asm/system.h>
 #include <asm/segment.h>
@@ -434,6 +435,8 @@
 
 	max_mapnr = last_valid_pfn - (phys_base >> PAGE_SHIFT);
 	high_memory = __va(max_low_pfn << PAGE_SHIFT);
+
+	page_cache_init(count_free_bootmem());
 
 #ifdef DEBUG_BOOTMEM
 	prom_printf("mem_init: Calling free_all_bootmem().\n");
--- ./arch/sparc64/mm/init.c.~1~	Fri Nov  9 18:42:08 2001
+++ ./arch/sparc64/mm/init.c	Sat Nov 10 02:00:23 2001
@@ -16,6 +16,7 @@
 #include <linux/blk.h>
 #include <linux/swap.h>
 #include <linux/swapctl.h>
+#include <linux/pagemap.h>
 
 #include <asm/head.h>
 #include <asm/system.h>
@@ -1584,6 +1585,8 @@
 
 	max_mapnr = last_valid_pfn - (phys_base >> PAGE_SHIFT);
 	high_memory = __va(last_valid_pfn << PAGE_SHIFT);
+
+	page_cache_init(count_free_bootmem());
 
 	num_physpages = free_all_bootmem() - 1;
 
--- ./arch/sh/mm/init.c.~1~	Sun Oct 21 02:47:53 2001
+++ ./arch/sh/mm/init.c	Sat Nov 10 01:59:56 2001
@@ -26,6 +26,7 @@
 #endif
 #include <linux/highmem.h>
 #include <linux/bootmem.h>
+#include <linux/pagemap.h>
 
 #include <asm/processor.h>
 #include <asm/system.h>
@@ -139,6 +140,7 @@
 void __init mem_init(void)
 {
 	extern unsigned long empty_zero_page[1024];
+	unsigned long num_free_bootmem_pages;
 	int codesize, reservedpages, datasize, initsize;
 	int tmp;
 
@@ -148,6 +150,12 @@
 	/* clear the zero-page */
 	memset(empty_zero_page, 0, PAGE_SIZE);
 	__flush_wback_region(empty_zero_page, PAGE_SIZE);
+
+	num_free_bootmem_pages = count_free_bootmem_node(NODE_DATA(0));
+#ifdef CONFIG_DISCONTIGMEM
+	num_free_bootmem_pages += count_free_bootmem_node(NODE_DATA(1));
+#endif
+	page_cache_init(num_free_bootmem_pages);
 
 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem_node(NODE_DATA(0));
--- ./arch/s390/mm/init.c.~1~	Sun Oct 21 02:47:53 2001
+++ ./arch/s390/mm/init.c	Sat Nov 10 01:57:56 2001
@@ -186,6 +186,8 @@
         /* clear the zero-page */
         memset(empty_zero_page, 0, PAGE_SIZE);
 
+	page_cache_init(count_free_bootmem());
+
 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem();
 
--- ./arch/ia64/mm/init.c.~1~	Fri Nov  9 19:08:02 2001
+++ ./arch/ia64/mm/init.c	Sat Nov 10 01:54:20 2001
@@ -13,6 +13,7 @@
 #include <linux/reboot.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
+#include <linux/pagemap.h>
 
 #include <asm/bitops.h>
 #include <asm/dma.h>
@@ -406,6 +407,8 @@
 
 	max_mapnr = max_low_pfn;
 	high_memory = __va(max_low_pfn * PAGE_SIZE);
+
+	page_cache_init(count_free_bootmem());
 
 	totalram_pages += free_all_bootmem();
 
--- ./arch/mips64/mm/init.c.~1~	Sun Oct 21 02:47:53 2001
+++ ./arch/mips64/mm/init.c	Sat Nov 10 01:55:30 2001
@@ -25,6 +25,7 @@
 #ifdef CONFIG_BLK_DEV_INITRD
 #include <linux/blk.h>
 #endif
+#include <linux/pagemap.h>
 
 #include <asm/bootinfo.h>
 #include <asm/cachectl.h>
@@ -396,6 +397,8 @@
 
 	max_mapnr = num_physpages = max_low_pfn;
 	high_memory = (void *) __va(max_mapnr << PAGE_SHIFT);
+
+	page_cache_init(count_free_bootmem());
 
 	totalram_pages += free_all_bootmem();
 	totalram_pages -= setup_zero_pages();	/* Setup zeroed pages.  */
--- ./arch/mips64/sgi-ip27/ip27-memory.c.~1~	Sun Oct 21 02:47:53 2001
+++ ./arch/mips64/sgi-ip27/ip27-memory.c	Sat Nov 10 02:02:33 2001
@@ -15,6 +15,7 @@
 #include <linux/mm.h>
 #include <linux/bootmem.h>
 #include <linux/swap.h>
+#include <linux/pagemap.h>
 
 #include <asm/page.h>
 #include <asm/bootinfo.h>
@@ -277,6 +278,11 @@
 	num_physpages = numpages;	/* memory already sized by szmem */
 	max_mapnr = pagenr;		/* already found during paging_init */
 	high_memory = (void *) __va(max_mapnr << PAGE_SHIFT);
+
+	tmp = 0;
+	for (nid = 0; nid < numnodes; nid++)
+		tmp += count_free_bootmem_node(NODE_DATA(nid));
+	page_cache_init(tmp);
 
 	for (nid = 0; nid < numnodes; nid++) {
 
--- ./arch/parisc/mm/init.c.~1~	Sun Oct 21 02:47:53 2001
+++ ./arch/parisc/mm/init.c	Sat Nov 10 01:57:11 2001
@@ -17,6 +17,7 @@
 #include <linux/pci.h>		/* for hppa_dma_ops and pcxl_dma_ops */
 #include <linux/swap.h>
 #include <linux/unistd.h>
+#include <linux/pagemap.h>
 
 #include <asm/pgalloc.h>
 
@@ -48,6 +49,8 @@
 {
 	max_mapnr = num_physpages = max_low_pfn;
 	high_memory = __va(max_low_pfn * PAGE_SIZE);
+
+	page_cache_init(count_free_bootmem());
 
 	totalram_pages += free_all_bootmem();
 	printk("Memory: %luk available\n", totalram_pages << (PAGE_SHIFT-10));
--- ./arch/cris/mm/init.c.~1~	Sun Oct 21 02:47:53 2001
+++ ./arch/cris/mm/init.c	Sat Nov 10 01:53:10 2001
@@ -95,6 +95,7 @@
 #include <linux/swap.h>
 #include <linux/smp.h>
 #include <linux/bootmem.h>
+#include <linux/pagemap.h>
 
 #include <asm/system.h>
 #include <asm/segment.h>
@@ -366,6 +367,8 @@
 
 	max_mapnr = num_physpages = max_low_pfn - min_low_pfn;
  
+	page_cache_init(count_free_bootmem());
+
 	/* this will put all memory onto the freelists */
         totalram_pages = free_all_bootmem();
 
--- ./arch/s390x/mm/init.c.~1~	Fri Nov  9 19:08:02 2001
+++ ./arch/s390x/mm/init.c	Sat Nov 10 01:58:14 2001
@@ -198,6 +198,8 @@
         /* clear the zero-page */
         memset(empty_zero_page, 0, PAGE_SIZE);
 
+        page_cache_init(count_free_bootmem());
+
 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem();
 
--- ./include/linux/bootmem.h.~1~	Fri Nov  9 19:35:08 2001
+++ ./include/linux/bootmem.h	Sat Nov 10 02:33:45 2001
@@ -43,11 +43,13 @@
 #define alloc_bootmem_low_pages(x) \
 	__alloc_bootmem((x), PAGE_SIZE, 0)
 extern unsigned long __init free_all_bootmem (void);
+extern unsigned long __init count_free_bootmem (void);
 
 extern unsigned long __init init_bootmem_node (pg_data_t *pgdat, unsigned long freepfn, unsigned long startpfn, unsigned long endpfn);
 extern void __init reserve_bootmem_node (pg_data_t *pgdat, unsigned long physaddr, unsigned long size);
 extern void __init free_bootmem_node (pg_data_t *pgdat, unsigned long addr, unsigned long size);
 extern unsigned long __init free_all_bootmem_node (pg_data_t *pgdat);
+extern unsigned long __init count_free_bootmem_node (pg_data_t *pgdat);
 extern void * __init __alloc_bootmem_node (pg_data_t *pgdat, unsigned long size, unsigned long align, unsigned long goal);
 #define alloc_bootmem_node(pgdat, x) \
 	__alloc_bootmem_node((pgdat), (x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS))
--- ./init/main.c.~1~	Fri Nov  9 19:08:11 2001
+++ ./init/main.c	Sat Nov 10 04:58:16 2001
@@ -597,7 +597,6 @@
 	proc_caches_init();
 	vfs_caches_init(mempages);
 	buffer_init(mempages);
-	page_cache_init(mempages);
 #if defined(CONFIG_ARCH_S390)
 	ccwcache_init();
 #endif
--- ./mm/filemap.c.~1~	Fri Nov  9 19:08:11 2001
+++ ./mm/filemap.c	Sat Nov 10 05:15:16 2001
@@ -24,6 +24,7 @@
 #include <linux/mm.h>
 #include <linux/iobuf.h>
 #include <linux/compiler.h>
+#include <linux/bootmem.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -2929,28 +2930,48 @@
 	goto unlock;
 }
 
+/* This is called from the arch specific mem_init routine.
+ * It is done right before free_all_bootmem (or NUMA equivalent).
+ *
+ * The mempages arg is the number of pages free_all_bootmem is
+ * going to liberate, or a close approximation.
+ *
+ * We have to use bootmem because on huge systems (ie. 16GB ram)
+ * get_free_pages cannot give us a large enough allocation.
+ */
 void __init page_cache_init(unsigned long mempages)
 {
-	unsigned long htable_size, order;
+	unsigned long htable_size, real_size;
 
 	htable_size = mempages;
 	htable_size *= sizeof(struct page *);
-	for(order = 0; (PAGE_SIZE << order) < htable_size; order++)
+
+	for (real_size = 1UL; real_size < htable_size; real_size <<= 1UL)
 		;
 
 	do {
-		unsigned long tmp = (PAGE_SIZE << order) / sizeof(struct page *);
+		unsigned long tmp = (real_size / sizeof(struct page *));
+		unsigned long align;
 
 		page_hash_bits = 0;
 		while((tmp >>= 1UL) != 0UL)
 			page_hash_bits++;
+		
+		align = real_size;
+		if (align > (4UL * 1024UL * 1024UL))
+			align = (4UL * 1024UL * 1024UL);
+
+		page_hash_table = __alloc_bootmem(real_size, align,
+						  __pa(MAX_DMA_ADDRESS));
+
+		/* Perhaps the alignment was too strict. */
+		if (page_hash_table == NULL)
+			page_hash_table = alloc_bootmem(real_size);
+	} while (page_hash_table == NULL &&
+		 (real_size >>= 1UL) >= PAGE_SIZE);
 
-		page_hash_table = (struct page **)
-			__get_free_pages(GFP_ATOMIC, order);
-	} while(page_hash_table == NULL && --order > 0);
-
-	printk("Page-cache hash table entries: %d (order: %ld, %ld bytes)\n",
-	       (1 << page_hash_bits), order, (PAGE_SIZE << order));
+	printk("Page-cache hash table entries: %d (%ld bytes)\n",
+	       (1 << page_hash_bits), real_size);
 	if (!page_hash_table)
 		panic("Failed to allocate page hash table\n");
 	memset((void *)page_hash_table, 0, PAGE_HASH_SIZE * sizeof(struct page *));

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-10 13:29               ` David S. Miller
@ 2001-11-10 13:44                 ` David S. Miller
  2001-11-10 13:52                 ` David S. Miller
  1 sibling, 0 replies; 45+ messages in thread
From: David S. Miller @ 2001-11-10 13:44 UTC (permalink / raw)
  To: anton; +Cc: ak, mingo, linux-kernel

   From: "David S. Miller" <davem@redhat.com>
   Date: Sat, 10 Nov 2001 05:29:17 -0800 (PST)

   Anton, are you bored?  :-) If so, could you test out the patch
   below on your ppc64 box?  It does the "page hash table via bootmem"
   thing.  It is against 2.4.15-pre2

Erm, ignore this patch, it was incomplete, I'll diff it up
properly.  Sorry...

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-10 13:29               ` David S. Miller
  2001-11-10 13:44                 ` David S. Miller
@ 2001-11-10 13:52                 ` David S. Miller
  1 sibling, 0 replies; 45+ messages in thread
From: David S. Miller @ 2001-11-10 13:52 UTC (permalink / raw)
  To: anton; +Cc: ak, mingo, linux-kernel


Ok, this should be a working patch, try this one :-)

diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/alpha/mm/init.c linux/arch/alpha/mm/init.c
--- vanilla/linux/arch/alpha/mm/init.c	Thu Sep 20 20:02:03 2001
+++ linux/arch/alpha/mm/init.c	Sat Nov 10 01:49:56 2001
@@ -23,6 +23,7 @@
 #ifdef CONFIG_BLK_DEV_INITRD
 #include <linux/blk.h>
 #endif
+#include <linux/pagemap.h>
 
 #include <asm/system.h>
 #include <asm/uaccess.h>
@@ -360,6 +361,7 @@
 mem_init(void)
 {
 	max_mapnr = num_physpages = max_low_pfn;
+	page_cache_init(count_free_bootmem());
 	totalram_pages += free_all_bootmem();
 	high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/alpha/mm/numa.c linux/arch/alpha/mm/numa.c
--- vanilla/linux/arch/alpha/mm/numa.c	Sun Aug 12 10:38:48 2001
+++ linux/arch/alpha/mm/numa.c	Sat Nov 10 01:52:27 2001
@@ -15,6 +15,7 @@
 #ifdef CONFIG_BLK_DEV_INITRD
 #include <linux/blk.h>
 #endif
+#include <linux/pagemap.h>
 
 #include <asm/hwrpb.h>
 #include <asm/pgalloc.h>
@@ -359,8 +360,13 @@
 	extern char _text, _etext, _data, _edata;
 	extern char __init_begin, __init_end;
 	extern unsigned long totalram_pages;
-	unsigned long nid, i;
+	unsigned long nid, i, num_free_bootmem_pages;
 	mem_map_t * lmem_map;
+
+	num_free_bootmem_pages = 0;
+	for (nid = 0; nid < numnodes; nid++)
+		num_free_bootmem_pages += count_free_bootmem_node(NODE_DATA(nid));
+	page_cache_init(num_free_bootmem_pages);
 
 	high_memory = (void *) __va(max_mapnr <<PAGE_SHIFT);
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/arm/mm/init.c linux/arch/arm/mm/init.c
--- vanilla/linux/arch/arm/mm/init.c	Thu Oct 11 09:04:57 2001
+++ linux/arch/arm/mm/init.c	Sat Nov 10 01:52:34 2001
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/bootmem.h>
 #include <linux/blk.h>
+#include <linux/pagemap.h>
 
 #include <asm/segment.h>
 #include <asm/mach-types.h>
@@ -594,6 +595,7 @@
 void __init mem_init(void)
 {
 	unsigned int codepages, datapages, initpages;
+	unsigned long num_free_bootmem_pages;
 	int i, node;
 
 	codepages = &_etext - &_text;
@@ -608,6 +610,11 @@
 	 */
 	if (meminfo.nr_banks != 1)
 		create_memmap_holes(&meminfo);
+
+	num_free_bootmem_pages = 0;
+	for (node = 0; node < numnodes; node++)
+		num_free_bootmem_pages += count_free_bootmem_node(NODE_DATA(node));
+	page_cache_init(num_free_bootmem_pages);
 
 	/* this will put all unused low memory onto the freelists */
 	for (node = 0; node < numnodes; node++) {
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/cris/mm/init.c linux/arch/cris/mm/init.c
--- vanilla/linux/arch/cris/mm/init.c	Thu Jul 26 15:10:06 2001
+++ linux/arch/cris/mm/init.c	Sat Nov 10 01:53:10 2001
@@ -95,6 +95,7 @@
 #include <linux/swap.h>
 #include <linux/smp.h>
 #include <linux/bootmem.h>
+#include <linux/pagemap.h>
 
 #include <asm/system.h>
 #include <asm/segment.h>
@@ -366,6 +367,8 @@
 
 	max_mapnr = num_physpages = max_low_pfn - min_low_pfn;
  
+	page_cache_init(count_free_bootmem());
+
 	/* this will put all memory onto the freelists */
         totalram_pages = free_all_bootmem();
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/i386/mm/init.c linux/arch/i386/mm/init.c
--- vanilla/linux/arch/i386/mm/init.c	Thu Sep 20 19:59:20 2001
+++ linux/arch/i386/mm/init.c	Sat Nov 10 01:53:43 2001
@@ -455,6 +455,8 @@
 #endif
 	high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
 
+	page_cache_init(count_free_bootmem());
+
 	/* clear the zero-page */
 	memset(empty_zero_page, 0, PAGE_SIZE);
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/ia64/mm/init.c linux/arch/ia64/mm/init.c
--- vanilla/linux/arch/ia64/mm/init.c	Fri Nov  9 18:39:51 2001
+++ linux/arch/ia64/mm/init.c	Sat Nov 10 01:54:20 2001
@@ -13,6 +13,7 @@
 #include <linux/reboot.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
+#include <linux/pagemap.h>
 
 #include <asm/bitops.h>
 #include <asm/dma.h>
@@ -406,6 +407,8 @@
 
 	max_mapnr = max_low_pfn;
 	high_memory = __va(max_low_pfn * PAGE_SIZE);
+
+	page_cache_init(count_free_bootmem());
 
 	totalram_pages += free_all_bootmem();
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/m68k/mm/init.c linux/arch/m68k/mm/init.c
--- vanilla/linux/arch/m68k/mm/init.c	Thu Sep 20 20:02:03 2001
+++ linux/arch/m68k/mm/init.c	Sat Nov 10 01:54:47 2001
@@ -20,6 +20,7 @@
 #ifdef CONFIG_BLK_DEV_RAM
 #include <linux/blk.h>
 #endif
+#include <linux/pagemap.h>
 
 #include <asm/setup.h>
 #include <asm/uaccess.h>
@@ -135,6 +136,8 @@
 	if (MACH_IS_ATARI)
 		atari_stram_mem_init_hook();
 #endif
+
+	page_cache_init(count_free_bootmem());
 
 	/* this will put all memory onto the freelists */
 	totalram_pages = free_all_bootmem();
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/mips/mm/init.c linux/arch/mips/mm/init.c
--- vanilla/linux/arch/mips/mm/init.c	Wed Jul  4 11:50:39 2001
+++ linux/arch/mips/mm/init.c	Sat Nov 10 01:55:09 2001
@@ -28,6 +28,7 @@
 #ifdef CONFIG_BLK_DEV_INITRD
 #include <linux/blk.h>
 #endif
+#include <linux/pagemap.h>
 
 #include <asm/bootinfo.h>
 #include <asm/cachectl.h>
@@ -203,6 +204,8 @@
 
 	max_mapnr = num_physpages = max_low_pfn;
 	high_memory = (void *) __va(max_mapnr << PAGE_SHIFT);
+
+	page_cache_init(count_free_bootmem());
 
 	totalram_pages += free_all_bootmem();
 	totalram_pages -= setup_zero_pages();	/* Setup zeroed pages.  */
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/mips64/mm/init.c linux/arch/mips64/mm/init.c
--- vanilla/linux/arch/mips64/mm/init.c	Wed Jul  4 11:50:39 2001
+++ linux/arch/mips64/mm/init.c	Sat Nov 10 01:55:30 2001
@@ -25,6 +25,7 @@
 #ifdef CONFIG_BLK_DEV_INITRD
 #include <linux/blk.h>
 #endif
+#include <linux/pagemap.h>
 
 #include <asm/bootinfo.h>
 #include <asm/cachectl.h>
@@ -396,6 +397,8 @@
 
 	max_mapnr = num_physpages = max_low_pfn;
 	high_memory = (void *) __va(max_mapnr << PAGE_SHIFT);
+
+	page_cache_init(count_free_bootmem());
 
 	totalram_pages += free_all_bootmem();
 	totalram_pages -= setup_zero_pages();	/* Setup zeroed pages.  */
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/mips64/sgi-ip27/ip27-memory.c linux/arch/mips64/sgi-ip27/ip27-memory.c
--- vanilla/linux/arch/mips64/sgi-ip27/ip27-memory.c	Sun Sep  9 10:43:02 2001
+++ linux/arch/mips64/sgi-ip27/ip27-memory.c	Sat Nov 10 02:02:33 2001
@@ -15,6 +15,7 @@
 #include <linux/mm.h>
 #include <linux/bootmem.h>
 #include <linux/swap.h>
+#include <linux/pagemap.h>
 
 #include <asm/page.h>
 #include <asm/bootinfo.h>
@@ -277,6 +278,11 @@
 	num_physpages = numpages;	/* memory already sized by szmem */
 	max_mapnr = pagenr;		/* already found during paging_init */
 	high_memory = (void *) __va(max_mapnr << PAGE_SHIFT);
+
+	tmp = 0;
+	for (nid = 0; nid < numnodes; nid++)
+		tmp += count_free_bootmem_node(NODE_DATA(nid));
+	page_cache_init(tmp);
 
 	for (nid = 0; nid < numnodes; nid++) {
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/parisc/mm/init.c linux/arch/parisc/mm/init.c
--- vanilla/linux/arch/parisc/mm/init.c	Tue Dec  5 12:29:39 2000
+++ linux/arch/parisc/mm/init.c	Sat Nov 10 01:57:11 2001
@@ -17,6 +17,7 @@
 #include <linux/pci.h>		/* for hppa_dma_ops and pcxl_dma_ops */
 #include <linux/swap.h>
 #include <linux/unistd.h>
+#include <linux/pagemap.h>
 
 #include <asm/pgalloc.h>
 
@@ -48,6 +49,8 @@
 {
 	max_mapnr = num_physpages = max_low_pfn;
 	high_memory = __va(max_low_pfn * PAGE_SIZE);
+
+	page_cache_init(count_free_bootmem());
 
 	totalram_pages += free_all_bootmem();
 	printk("Memory: %luk available\n", totalram_pages << (PAGE_SHIFT-10));
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/ppc/mm/init.c linux/arch/ppc/mm/init.c
--- vanilla/linux/arch/ppc/mm/init.c	Tue Oct  2 09:12:44 2001
+++ linux/arch/ppc/mm/init.c	Sat Nov 10 01:57:34 2001
@@ -34,6 +34,7 @@
 #ifdef CONFIG_BLK_DEV_INITRD
 #include <linux/blk.h>		/* for initrd_* */
 #endif
+#include <linux/pagemap.h>
 
 #include <asm/pgalloc.h>
 #include <asm/prom.h>
@@ -462,6 +463,8 @@
 
 	high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
 	num_physpages = max_mapnr;	/* RAM is assumed contiguous */
+
+	page_cache_init(count_free_bootmem());
 
 	totalram_pages += free_all_bootmem();
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/s390/mm/init.c linux/arch/s390/mm/init.c
--- vanilla/linux/arch/s390/mm/init.c	Thu Oct 11 09:04:57 2001
+++ linux/arch/s390/mm/init.c	Sat Nov 10 01:57:56 2001
@@ -186,6 +186,8 @@
         /* clear the zero-page */
         memset(empty_zero_page, 0, PAGE_SIZE);
 
+	page_cache_init(count_free_bootmem());
+
 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem();
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/s390x/mm/init.c linux/arch/s390x/mm/init.c
--- vanilla/linux/arch/s390x/mm/init.c	Fri Nov  9 18:39:51 2001
+++ linux/arch/s390x/mm/init.c	Sat Nov 10 01:58:14 2001
@@ -198,6 +198,8 @@
         /* clear the zero-page */
         memset(empty_zero_page, 0, PAGE_SIZE);
 
+        page_cache_init(count_free_bootmem());
+
 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem();
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/sh/mm/init.c linux/arch/sh/mm/init.c
--- vanilla/linux/arch/sh/mm/init.c	Mon Oct 15 13:36:48 2001
+++ linux/arch/sh/mm/init.c	Sat Nov 10 01:59:56 2001
@@ -26,6 +26,7 @@
 #endif
 #include <linux/highmem.h>
 #include <linux/bootmem.h>
+#include <linux/pagemap.h>
 
 #include <asm/processor.h>
 #include <asm/system.h>
@@ -139,6 +140,7 @@
 void __init mem_init(void)
 {
 	extern unsigned long empty_zero_page[1024];
+	unsigned long num_free_bootmem_pages;
 	int codesize, reservedpages, datasize, initsize;
 	int tmp;
 
@@ -148,6 +150,12 @@
 	/* clear the zero-page */
 	memset(empty_zero_page, 0, PAGE_SIZE);
 	__flush_wback_region(empty_zero_page, PAGE_SIZE);
+
+	num_free_bootmem_pages = count_free_bootmem_node(NODE_DATA(0));
+#ifdef CONFIG_DISCONTIGMEM
+	num_free_bootmem_pages += count_free_bootmem_node(NODE_DATA(1));
+#endif
+	page_cache_init(num_free_bootmem_pages);
 
 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem_node(NODE_DATA(0));
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/sparc/mm/init.c linux/arch/sparc/mm/init.c
--- vanilla/linux/arch/sparc/mm/init.c	Mon Oct  1 09:19:56 2001
+++ linux/arch/sparc/mm/init.c	Sat Nov 10 05:30:31 2001
@@ -1,4 +1,4 @@
-/*  $Id: init.c,v 1.100 2001/09/21 22:51:47 davem Exp $
+/*  $Id: init.c,v 1.101 2001/11/10 13:30:31 davem Exp $
  *  linux/arch/sparc/mm/init.c
  *
  *  Copyright (C) 1995 David S. Miller (davem@caip.rutgers.edu)
@@ -25,6 +25,7 @@
 #include <linux/init.h>
 #include <linux/highmem.h>
 #include <linux/bootmem.h>
+#include <linux/pagemap.h>
 
 #include <asm/system.h>
 #include <asm/segment.h>
@@ -434,6 +435,8 @@
 
 	max_mapnr = last_valid_pfn - (phys_base >> PAGE_SHIFT);
 	high_memory = __va(max_low_pfn << PAGE_SHIFT);
+
+	page_cache_init(count_free_bootmem());
 
 #ifdef DEBUG_BOOTMEM
 	prom_printf("mem_init: Calling free_all_bootmem().\n");
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/sparc64/mm/init.c linux/arch/sparc64/mm/init.c
--- vanilla/linux/arch/sparc64/mm/init.c	Tue Oct 30 15:08:11 2001
+++ linux/arch/sparc64/mm/init.c	Sat Nov 10 05:30:31 2001
@@ -1,4 +1,4 @@
-/*  $Id: init.c,v 1.199 2001/10/25 18:48:03 davem Exp $
+/*  $Id: init.c,v 1.201 2001/11/10 13:30:31 davem Exp $
  *  arch/sparc64/mm/init.c
  *
  *  Copyright (C) 1996-1999 David S. Miller (davem@caip.rutgers.edu)
@@ -16,6 +16,7 @@
 #include <linux/blk.h>
 #include <linux/swap.h>
 #include <linux/swapctl.h>
+#include <linux/pagemap.h>
 
 #include <asm/head.h>
 #include <asm/system.h>
@@ -1400,7 +1401,7 @@
 	if (second_alias_page)
 		spitfire_flush_dtlb_nucleus_page(second_alias_page);
 
-	flush_tlb_all();
+	__flush_tlb_all();
 
 	{
 		unsigned long zones_size[MAX_NR_ZONES];
@@ -1584,6 +1585,8 @@
 
 	max_mapnr = last_valid_pfn - (phys_base >> PAGE_SHIFT);
 	high_memory = __va(last_valid_pfn << PAGE_SHIFT);
+
+	page_cache_init(count_free_bootmem());
 
 	num_physpages = free_all_bootmem() - 1;
 
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/include/linux/bootmem.h linux/include/linux/bootmem.h
--- vanilla/linux/include/linux/bootmem.h	Mon Nov  5 12:43:18 2001
+++ linux/include/linux/bootmem.h	Sat Nov 10 02:33:45 2001
@@ -43,11 +43,13 @@
 #define alloc_bootmem_low_pages(x) \
 	__alloc_bootmem((x), PAGE_SIZE, 0)
 extern unsigned long __init free_all_bootmem (void);
+extern unsigned long __init count_free_bootmem (void);
 
 extern unsigned long __init init_bootmem_node (pg_data_t *pgdat, unsigned long freepfn, unsigned long startpfn, unsigned long endpfn);
 extern void __init reserve_bootmem_node (pg_data_t *pgdat, unsigned long physaddr, unsigned long size);
 extern void __init free_bootmem_node (pg_data_t *pgdat, unsigned long addr, unsigned long size);
 extern unsigned long __init free_all_bootmem_node (pg_data_t *pgdat);
+extern unsigned long __init count_free_bootmem_node (pg_data_t *pgdat);
 extern void * __init __alloc_bootmem_node (pg_data_t *pgdat, unsigned long size, unsigned long align, unsigned long goal);
 #define alloc_bootmem_node(pgdat, x) \
 	__alloc_bootmem_node((pgdat), (x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS))
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/init/main.c linux/init/main.c
--- vanilla/linux/init/main.c	Fri Nov  9 18:40:00 2001
+++ linux/init/main.c	Sat Nov 10 04:58:16 2001
@@ -597,7 +597,6 @@
 	proc_caches_init();
 	vfs_caches_init(mempages);
 	buffer_init(mempages);
-	page_cache_init(mempages);
 #if defined(CONFIG_ARCH_S390)
 	ccwcache_init();
 #endif
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/mm/bootmem.c linux/mm/bootmem.c
--- vanilla/linux/mm/bootmem.c	Tue Sep 18 14:10:43 2001
+++ linux/mm/bootmem.c	Sat Nov 10 05:18:53 2001
@@ -154,6 +154,9 @@
 	if (align & (align-1))
 		BUG();
 
+	offset = (bdata->node_boot_start & (align - 1));
+	offset >>= PAGE_SHIFT;
+
 	/*
 	 * We try to allocate bootmem pages above 'goal'
 	 * first, then we try to allocate lower pages.
@@ -165,6 +168,7 @@
 		preferred = 0;
 
 	preferred = ((preferred + align - 1) & ~(align - 1)) >> PAGE_SHIFT;
+	preferred += offset;
 	areasize = (size+PAGE_SIZE-1)/PAGE_SIZE;
 	incr = align >> PAGE_SHIFT ? : 1;
 
@@ -184,7 +188,7 @@
 	fail_block:;
 	}
 	if (preferred) {
-		preferred = 0;
+		preferred = offset;
 		goto restart_scan;
 	}
 	return NULL;
@@ -272,6 +276,28 @@
 	return total;
 }
 
+static unsigned long __init count_free_bootmem_core(pg_data_t *pgdat)
+{
+	bootmem_data_t *bdata = pgdat->bdata;
+	unsigned long i, idx, total;
+
+	if (!bdata->node_bootmem_map) BUG();
+
+	total = 0;
+	idx = bdata->node_low_pfn - (bdata->node_boot_start >> PAGE_SHIFT);
+	for (i = 0; i < idx; i++) {
+		if (!test_bit(i, bdata->node_bootmem_map))
+			total++;
+	}
+
+	/*
+	 * Count the allocator bitmap itself.
+	 */
+	total += ((bdata->node_low_pfn-(bdata->node_boot_start >> PAGE_SHIFT))/8 + PAGE_SIZE-1)/PAGE_SIZE;
+
+	return total;
+}
+
 unsigned long __init init_bootmem_node (pg_data_t *pgdat, unsigned long freepfn, unsigned long startpfn, unsigned long endpfn)
 {
 	return(init_bootmem_core(pgdat, freepfn, startpfn, endpfn));
@@ -292,6 +318,11 @@
 	return(free_all_bootmem_core(pgdat));
 }
 
+unsigned long __init count_free_bootmem_node (pg_data_t *pgdat)
+{
+	return(count_free_bootmem_core(pgdat));
+}
+
 unsigned long __init init_bootmem (unsigned long start, unsigned long pages)
 {
 	max_low_pfn = pages;
@@ -312,6 +343,11 @@
 unsigned long __init free_all_bootmem (void)
 {
 	return(free_all_bootmem_core(&contig_page_data));
+}
+
+unsigned long __init count_free_bootmem (void)
+{
+	return(count_free_bootmem_core(&contig_page_data));
 }
 
 void * __init __alloc_bootmem (unsigned long size, unsigned long align, unsigned long goal)
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/mm/filemap.c linux/mm/filemap.c
--- vanilla/linux/mm/filemap.c	Fri Nov  9 18:40:00 2001
+++ linux/mm/filemap.c	Sat Nov 10 05:15:16 2001
@@ -24,6 +24,7 @@
 #include <linux/mm.h>
 #include <linux/iobuf.h>
 #include <linux/compiler.h>
+#include <linux/bootmem.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -2929,28 +2930,48 @@
 	goto unlock;
 }
 
+/* This is called from the arch specific mem_init routine.
+ * It is done right before free_all_bootmem (or NUMA equivalent).
+ *
+ * The mempages arg is the number of pages free_all_bootmem is
+ * going to liberate, or a close approximation.
+ *
+ * We have to use bootmem because on huge systems (ie. 16GB ram)
+ * get_free_pages cannot give us a large enough allocation.
+ */
 void __init page_cache_init(unsigned long mempages)
 {
-	unsigned long htable_size, order;
+	unsigned long htable_size, real_size;
 
 	htable_size = mempages;
 	htable_size *= sizeof(struct page *);
-	for(order = 0; (PAGE_SIZE << order) < htable_size; order++)
+
+	for (real_size = 1UL; real_size < htable_size; real_size <<= 1UL)
 		;
 
 	do {
-		unsigned long tmp = (PAGE_SIZE << order) / sizeof(struct page *);
+		unsigned long tmp = (real_size / sizeof(struct page *));
+		unsigned long align;
 
 		page_hash_bits = 0;
 		while((tmp >>= 1UL) != 0UL)
 			page_hash_bits++;
+		
+		align = real_size;
+		if (align > (4UL * 1024UL * 1024UL))
+			align = (4UL * 1024UL * 1024UL);
+
+		page_hash_table = __alloc_bootmem(real_size, align,
+						  __pa(MAX_DMA_ADDRESS));
+
+		/* Perhaps the alignment was too strict. */
+		if (page_hash_table == NULL)
+			page_hash_table = alloc_bootmem(real_size);
+	} while (page_hash_table == NULL &&
+		 (real_size >>= 1UL) >= PAGE_SIZE);
 
-		page_hash_table = (struct page **)
-			__get_free_pages(GFP_ATOMIC, order);
-	} while(page_hash_table == NULL && --order > 0);
-
-	printk("Page-cache hash table entries: %d (order: %ld, %ld bytes)\n",
-	       (1 << page_hash_bits), order, (PAGE_SIZE << order));
+	printk("Page-cache hash table entries: %d (%ld bytes)\n",
+	       (1 << page_hash_bits), real_size);
 	if (!page_hash_table)
 		panic("Failed to allocate page hash table\n");
 	memset((void *)page_hash_table, 0, PAGE_HASH_SIZE * sizeof(struct page *));

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-08 23:00   ` Andi Kleen
  2001-11-09  0:05     ` Anton Blanchard
@ 2001-11-09  3:12     ` Rusty Russell
  2001-11-09  5:59       ` Andi Kleen
  2001-11-09 11:16       ` Helge Hafting
  1 sibling, 2 replies; 45+ messages in thread
From: Rusty Russell @ 2001-11-09  3:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: mingo, linux-kernel

On 09 Nov 2001 00:00:19 +0100
Andi Kleen <ak@suse.de> wrote:

> Ingo Molnar <mingo@elte.hu> writes:
> > 
> > we should fix this by trying to allocate continuous physical memory if
> > possible, and fall back to vmalloc() only if this allocation fails.
> 
> Check -aa. A patch to do that has been in there for some time now.
> 
> -Andi
> 
> P.S.: It makes a measurable difference with some Oracle benchmarks with
> the Qlogic driver.

Modules have lots of little disadvantages that add up.  The speed penalty
on various platforms is one, the load/unload race complexity is another.

There's a widespread "modules are free!" mentality: they're not, and we
can add complexity trying to make them "free", but it might be wiser to
realize that dynamic adding and deleting from a running kernel is a
problem on par with a pagagble kernel, and may not be the greatest thing
since sliced bread.

Rusty.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09  3:12     ` Rusty Russell
@ 2001-11-09  5:59       ` Andi Kleen
  2001-11-09 11:16       ` Helge Hafting
  1 sibling, 0 replies; 45+ messages in thread
From: Andi Kleen @ 2001-11-09  5:59 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Andi Kleen, mingo, linux-kernel

On Fri, Nov 09, 2001 at 02:12:15PM +1100, Rusty Russell wrote:
> Modules have lots of little disadvantages that add up.  The speed penalty
> on various platforms is one, the load/unload race complexity is another.

At least for the speed penalty due to TLB thrashing: I would not really
blame modules in this case, it is just an application crying for large
pages support.

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09  3:12     ` Rusty Russell
  2001-11-09  5:59       ` Andi Kleen
@ 2001-11-09 11:16       ` Helge Hafting
  2001-11-12  9:59         ` Rusty Russell
  1 sibling, 1 reply; 45+ messages in thread
From: Helge Hafting @ 2001-11-09 11:16 UTC (permalink / raw)
  To: Rusty Russell, linux-kernel

Rusty Russell wrote:

> Modules have lots of little disadvantages that add up.  The speed penalty
> on various platforms is one, the load/unload race complexity is another.
> 
Races can be fixed.  (Isn't that one of the things considered for 2.5?)

Speed penalties on various platforms is there to stay, so you simply
have to weigh that against having more swappable RAM.

I use the following rules of thumb:

1. Modules only for seldom-used devices.  A module for
   the mouse is no use if you do all your work in X.  
   There's simply no gain from a module that never unloads.
   A seldom used fs may be modular though.  I rarely
   use cd's, so isofs is a module on my machine.
2. No modules for high-speed stuff like harddisks and network,
   that's where you might feel the slowdown.  Low-speed stuff
   like floppy and cdrom drivers are modular though.

Helge Hafting

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-09 11:16       ` Helge Hafting
@ 2001-11-12  9:59         ` Rusty Russell
  2001-11-12 23:23           ` David S. Miller
  0 siblings, 1 reply; 45+ messages in thread
From: Rusty Russell @ 2001-11-12  9:59 UTC (permalink / raw)
  To: Helge Hafting; +Cc: linux-kernel

On Fri, 09 Nov 2001 12:16:49 +0100
Helge Hafting <helgehaf@idb.hist.no> wrote:

> Rusty Russell wrote:
> 
> > Modules have lots of little disadvantages that add up.  The speed penalty
> > on various platforms is one, the load/unload race complexity is another.
> > 
> Races can be fixed.  (Isn't that one of the things considered for 2.5?)

We get more problems if we go preemptible (some seem to thing that preemption
is "free").  And some races can be fixed by paying more of a speed penalty
(atomic_inc & atomic_dec_and_test for every packet, anyone?).

Hope that clarifies,
Rusty.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-12  9:59         ` Rusty Russell
@ 2001-11-12 23:23           ` David S. Miller
  2001-11-12 23:14             ` Rusty Russell
  0 siblings, 1 reply; 45+ messages in thread
From: David S. Miller @ 2001-11-12 23:23 UTC (permalink / raw)
  To: rusty; +Cc: helgehaf, linux-kernel

   From: Rusty Russell <rusty@rustcorp.com.au>
   Date: Mon, 12 Nov 2001 20:59:05 +1100

   (atomic_inc & atomic_dec_and_test for every packet, anyone?).

We already do pay that price, in skb_release_data() :-)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-12 23:23           ` David S. Miller
@ 2001-11-12 23:14             ` Rusty Russell
  2001-11-13  1:30               ` Mike Fedyk
  0 siblings, 1 reply; 45+ messages in thread
From: Rusty Russell @ 2001-11-12 23:14 UTC (permalink / raw)
  To: David S. Miller; +Cc: helgehaf, linux-kernel

In message <20011112.152304.39155908.davem@redhat.com> you write:
>    From: Rusty Russell <rusty@rustcorp.com.au>
>    Date: Mon, 12 Nov 2001 20:59:05 +1100
> 
>    (atomic_inc & atomic_dec_and_test for every packet, anyone?).
> 
> We already do pay that price, in skb_release_data() :-)

Sorry, I wasn't clear!  skb_release_data() does an atomic ops on the
skb data region, which is almost certainly on the same CPU.  This is
an atomic op on a global counter for the module, which almost
certainly isn't.

For something which (statistically speaking) never happens (module
unload).

Ouch,
Rusty.
--
Premature optmztion is rt of all evl. --DK

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-12 23:14             ` Rusty Russell
@ 2001-11-13  1:30               ` Mike Fedyk
  2001-11-13  1:15                 ` David Lang
  0 siblings, 1 reply; 45+ messages in thread
From: Mike Fedyk @ 2001-11-13  1:30 UTC (permalink / raw)
  To: Rusty Russell; +Cc: David S. Miller, helgehaf, linux-kernel

On Tue, Nov 13, 2001 at 10:14:22AM +1100, Rusty Russell wrote:
> In message <20011112.152304.39155908.davem@redhat.com> you write:
> >    From: Rusty Russell <rusty@rustcorp.com.au>
> >    Date: Mon, 12 Nov 2001 20:59:05 +1100
> > 
> >    (atomic_inc & atomic_dec_and_test for every packet, anyone?).
> > 
> > We already do pay that price, in skb_release_data() :-)
> 
> Sorry, I wasn't clear!  skb_release_data() does an atomic ops on the
> skb data region, which is almost certainly on the same CPU.  This is
> an atomic op on a global counter for the module, which almost
> certainly isn't.
> 
> For something which (statistically speaking) never happens (module
> unload).
>

Is this in the fast path or slow path?

If it only happens on (un)load, then there isn't any cost until it's needed...

Mike

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: speed difference between using hard-linked and modular drives?
  2001-11-13  1:30               ` Mike Fedyk
@ 2001-11-13  1:15                 ` David Lang
  0 siblings, 0 replies; 45+ messages in thread
From: David Lang @ 2001-11-13  1:15 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Rusty Russell, David S. Miller, helgehaf, linux-kernel

Mike the point is that the module count inc/dec would need to be done for
every packet so that when you go to unload you can check the usage value,
so the check is done in the slow path, but the inc/dec is done in the fast
path.

David Lang

 On Mon, 12 Nov 2001, Mike Fedyk wrote:

> Date: Mon, 12 Nov 2001 17:30:14 -0800
> From: Mike Fedyk <mfedyk@matchmail.com>
> To: Rusty Russell <rusty@rustcorp.com.au>
> Cc: David S. Miller <davem@redhat.com>, helgehaf@idb.hist.no,
>      linux-kernel@vger.kernel.org
> Subject: Re: speed difference between using hard-linked and modular
>     drives?
>
> On Tue, Nov 13, 2001 at 10:14:22AM +1100, Rusty Russell wrote:
> > In message <20011112.152304.39155908.davem@redhat.com> you write:
> > >    From: Rusty Russell <rusty@rustcorp.com.au>
> > >    Date: Mon, 12 Nov 2001 20:59:05 +1100
> > >
> > >    (atomic_inc & atomic_dec_and_test for every packet, anyone?).
> > >
> > > We already do pay that price, in skb_release_data() :-)
> >
> > Sorry, I wasn't clear!  skb_release_data() does an atomic ops on the
> > skb data region, which is almost certainly on the same CPU.  This is
> > an atomic op on a global counter for the module, which almost
> > certainly isn't.
> >
> > For something which (statistically speaking) never happens (module
> > unload).
> >
>
> Is this in the fast path or slow path?
>
> If it only happens on (un)load, then there isn't any cost until it's needed...
>
> Mike
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2001-11-13  1:40 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-11-08 16:01 speed difference between using hard-linked and modular drives? Roy Sigurd Karlsbakk
2001-11-08 17:02 ` Ingo Molnar
2001-11-08 17:37   ` Ingo Molnar
2001-11-08 23:59   ` Anton Blanchard
2001-11-09  5:11     ` Keith Owens
2001-11-10  3:35       ` Anton Blanchard
2001-11-10  7:26         ` Keith Owens
2001-11-08 17:53 ` Robert Love
     [not found] <Pine.LNX.4.33.0111081802380.15975-100000@localhost.localdomain.suse.lists.linux.kernel>
     [not found] ` <Pine.LNX.4.33.0111081836080.15975-100000@localhost.localdomain.suse.lists.linux.kernel>
2001-11-08 23:00   ` Andi Kleen
2001-11-09  0:05     ` Anton Blanchard
2001-11-09  5:45       ` Andi Kleen
2001-11-09  6:04         ` David S. Miller
2001-11-09  6:39           ` Andi Kleen
2001-11-09  6:54             ` Andrew Morton
2001-11-09  7:17               ` David S. Miller
2001-11-09  7:16                 ` Andrew Morton
2001-11-09  7:24                   ` David S. Miller
2001-11-09  8:21                   ` Ingo Molnar
2001-11-09  7:35                     ` Andrew Morton
2001-11-09  7:44                       ` David S. Miller
2001-11-09  7:14             ` David S. Miller
2001-11-09  7:16             ` David S. Miller
2001-11-09 12:59               ` Alan Cox
2001-11-09 12:54                 ` David S. Miller
2001-11-09 13:15                   ` Philip Dodd
2001-11-09 13:26                     ` David S. Miller
2001-11-09 20:45                       ` Mike Fedyk
2001-11-09 13:17                   ` Andi Kleen
2001-11-09 13:25                     ` David S. Miller
2001-11-09 13:39                       ` Andi Kleen
2001-11-09 13:41                         ` David S. Miller
2001-11-10  5:20               ` Anton Blanchard
2001-11-10  4:56             ` Anton Blanchard
2001-11-10  5:09               ` Andi Kleen
2001-11-10 13:29               ` David S. Miller
2001-11-10 13:44                 ` David S. Miller
2001-11-10 13:52                 ` David S. Miller
2001-11-09  3:12     ` Rusty Russell
2001-11-09  5:59       ` Andi Kleen
2001-11-09 11:16       ` Helge Hafting
2001-11-12  9:59         ` Rusty Russell
2001-11-12 23:23           ` David S. Miller
2001-11-12 23:14             ` Rusty Russell
2001-11-13  1:30               ` Mike Fedyk
2001-11-13  1:15                 ` David Lang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox