RFC: CONFIG_PAGE_SHIFT (aka software PAGE

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
@ 2007-07-06 22:26 Andrea Arcangeli
  2007-07-06 23:33 ` Dave Hansen
                   ` (6 more replies)
  0 siblings, 7 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2007-07-06 22:26 UTC (permalink / raw)
  To: linux-kernel

Hello,

for the hack week at opensuse (see http://idea.opensuse.org/) I've
been working on a new feature called CONFIG_PAGE_SHIFT.

In the last few days while reading the topics of the VM summit I
answered I disliked the dependency on defrag for reliable I/O and I
suggested I had an alternative design being able to run static
binaries so far. So some discussion come up and I got request for
disclosing my stuff sooner than later. Frankly I wanted to do a bit
more of progress before posting it, because I'm still unsure if this
is totally good idea, the large part of the userland page fault
frequency reduction isn't implemented yet but OTOH I sure like it more
than everything else I've seen so far in this space, and I don't want
to hide it further after explicit requests to disclose from other vm
developers ;).

Some background:

The x86 and amd64 architectures only support a fixed 4k page size. The
smaller the page size, the lower amount of memory is wasted in
partially unused ram, but the slower the global performance is. The
next available hardware page size is 2M which is generally too big for
general purpose applications and the x86 ABI requires mmap offset
parameter to work with 4k granularity (amd64 abi fixes that problem
but apps have been written for the x86 ABI so we'd rather keep
supporting the 4k file offset granularity in mmap if we want to be
sure not to break backwards compatibility with userland, especially
for the 32bit compatibility mode).

While there's nothing we can do in software to alleviate the
_hardware_ related overhead of the 4k page size (like tlb caching and
frequency of the hardware pagetable walking), the 4k page size end up
hurting many purely software things.

The xfs developers for example want to enlarge their filesystem
blocksize (the filesystem blocksize has a tradeoff similar to the
PAGE_SIZE, the larger the faster the filesystem but more disk space is
potentially wasted), they also want to use the “normal” writeback
pagecache efficient behavior when using a writable fs on top of a
dvd-ram with an hardblocksize of 64k. But they can't on x86/amd64
because the PAGE_SIZE is still 4k and the whole linux kernel can't
handle more than a blocksize of PAGE_SIZE.

What they miss is that the problem with the 4k PAGE_SIZE isn't just
the maximum blocksize we can support (i.e. dvd with 64k hardblocksize)
but the _whole_ kernel (not only the storage/fs subsystems) is slower
because of the 4k thing. This starts from the page faults in a
memcpy() that are double the number than if this was a 8k page-size,
all the memory allocations (including slub/slab/blob/whatever) are
double or 4 fold or 8 fold the ones that would happen with a
8k/16k/32k page size.

So my whole idea is to once and for all to decuple the size of the
pte-entry (4k on x86/amd64) with the page allocator granularity. The
HARD_PAGE_SHIFT will be 4k still, the common code PAGE_SIZE will be
variable and configurable at compile time with CONFIG_PAGE_SHIFT.

I feel this need to happen at some point in the linux VM, since once
done I can't imagine any server running with a 4k page-size anymore. 

Rule number 1: the moment you need to relay on order > 0 allocations
for critical things like basic buffered I/O, you must make everything
an order >0 allocation and just boost the PAGE_SIZE. Only vm_pgoff and
other pte manipulations will be still indexed at the HARD_PAGE_SIZE,
all common code won't notice. The backwards compatibility is provided
by tracking vm_pgoff+((addr & ~PAGE_MASK) >> HARD_PAGE_SHIFT), see
hardpfn_offset_to_index. Thanks to anon-vma this should work for
anon+mremap too, even though I need to figure out some bits there
still but I don't see anything fundamentally different (anon-vma whole
point is to reduce the differences in that area and to allow doing on
anonymous memory anything we can do on pagecache). The pagecache side
already apparently works, it still needs a restart of the pagetable
walking loop over the PAGE_SHIFT-HARD_PAGE_SHIFT bits bounded by
vm_start >> HARD_PAGE_SHIFT, vm_end >> HARD_PAGE_SHIFT, so to reduce
all the page fault rate though. The pte unmapping may be severely
broken too.

Once finished, this should allow for a total backwards compatible
design without any aliasing in the pagecache (only the pte won't be
naturally aligned but that's ok, aliasing at the virtual level is a
fundamental property of the VM and it always happens).

This whole issue is really a pure tradeoff between memory consumption
and I/O and CPU performance (and for the dvd-ram and xfs also a way to
use larger hardblocksize), so being able to benchmark is the first
priority, if there's no significant benchmark gain this whole thing
may be a failure. I'm not talking the I/O bound side, the I/O side
performance boost is guaranteed (exactly the same as with the variable
order page cache).

64k is probably the ideal value for CONFIG_PAGE_SHIFT in db servers,
only 8 times faster in allocating ram but without huge ram waste and
especially optimal I/O size for ide (and better for scsi too of
course).

Comparison with “variable order page cache”: that tries to keep the
page allocator at 4k and changes the pagecache layer at order > 0
allocations. The major showstopper with their design is that there's
no way they can defrag reliably the kernel memory as long as any
driver is still allowed to run alloc_page(). Worst of all the
defragmenter will waste lots of cpu if it has an hard time to defrag,
so it's not a strightforward tradeoff and it has corner cases where
its underperformance will be hard to evaluate because it normally
won't trigger (even if in the best-case the I/O performance will be
good). Even worse if it eventually fails to defrag (no guarantees can
be made unless certain areas of memory are marked non generic) I/O
reliability will be decreased. So it would need at least a fallback to
order 0 to be really reliable.  And despite all the above downsides it
provides no advantage except being able to access devices with an
hard/soft blocksize larger than 4k (it only tackle on the I/O
performance with 4k fs, it hurts CPU performance if something). My
design solves their troubles (I/O performance) and at the same time it
boost the performance of everyone else too. It however requires
compiling a kernel with a special CONFIG_PAGE_SHIFT, but then you also
have to specially create xfs with a >4k blocksize so it seems a minor
issue (especially for the 1024 cpu systems ;), and in theory the
CONFIG_PAGE_SHIFT can also become a boot time parameter if we're ok to
waste quite some cycles at runtime.

The original idea of having a software page size larger than a
hardware page size, originated at SUSE by myself and Andi Kleen while
helping AMD to design their amd64 cpu, IIRC the conclusion was not to
worry too much about the 4k page size being too small because we could
make a soft-page-size if the time would come (or even a 2M PAGE_SIZE
kernel), it's just that at the time we thought we had to break
backwards compatibility (hence the ABI change in amd64 not requiring a
4k mmap offset alignment anymore), but I hope my current
improved/refined idea for the hack week of handling not naturally
aligned pages using the vm_pgoff indexed at HARD_PAGE_SIZE plus the
few bits between PAGE_SHIFT and HARD_PAGE_SHIFT of virtual address,
will not need to break anything anymore.

The following simple bench seems to run fine on one real hardware and
on kvm (a friend of mine failed so far to run it on his hardware
though, so perhaps some driver triggers some remaining bugs) when
booted as init=/tmp/bench-static after “cp -a /dev/hda /tmp/”.

#include <stdio.h>
#include <sys/time.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <assert.h> 

#define BUFSIZE (100*1024*1024) 

main() { 
      struct timeval before, after;
      int fd = open("/tmp/hda", O_RDONLY);
      unsigned long usec;
      char * c = malloc(BUFSIZE);
      assert(c);
      assert(fd > 2);
      for (;;) {
              gettimeofday(&before, NULL);
              if (read(fd, c, BUFSIZE) != BUFSIZE)
                      printf("errorn");
              gettimeofday(&after, NULL);
              lseek(fd, 0, 0);
              usec = (after.tv_sec - before.tv_sec)*1000000;
              usec += after.tv_usec - before.tv_usec;
              printf("%d usecn", usec);
      }
}

CONFIG_PAGE_SHIFT = 12 (default):

109770 usec
109673 usec

CONFIG_PAGE_SHIFT = 13 (8k page size)

108738 usec
108667 usec

Numbers are totally repeatable. Because I was too lazy at adapting the
anonymous memory page faults so far, the page coloring is guaranteed
the worst possible in the lowest significant bit of the page color,
but once I'll stop wasting gigantic amounts of ram over anon memory
and I'll reduce the page fault rate of 2 4 8 16 etc... times, the
anonymous memory will be automatically page-colored (for the first
time, actually not a perfect coloring but a better coloring for sure,
the larger the PAGE_SHIFT the better the coloring), so after that
there shouldn't be slowdowns anymore at the very large PAGE_SHIFT like
64 and over.

Max PAGE_SIZE supported is 8M, but implementation details in pgattr
will likely prevent to boot (now even to compile) with a page size
over 2M (easy to fix but going over 2M wasn't a short term
worry). Clearly once we reach those large PAGE_SIZE, it'll also be
possible to use the pmd to map the 'struct page' with a large tlb if
it has been mapped naturally aligned in the virtual address space.

If you want to help/look here the patch:

	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.22-rc7/hard-page-size

I'm tracking it with hg mq extension so far, but I can change if it
helps.

Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-06 22:26 RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE) Andrea Arcangeli
@ 2007-07-06 23:33 ` Dave Hansen
  2007-07-06 23:52   ` Andrea Arcangeli
  2007-07-07  1:36 ` Badari Pulavarty
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Dave Hansen @ 2007-07-06 23:33 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

On Sat, 2007-07-07 at 00:26 +0200, Andrea Arcangeli wrote:
> for the hack week at opensuse (see http://idea.opensuse.org/) I've
> been working on a new feature called CONFIG_PAGE_SHIFT.
...
> If you want to help/look here the patch:
> 
> 	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.22-rc7/hard-page-size
> 
> I'm tracking it with hg mq extension so far, but I can change if it
> helps.

The patch looks really interesting, it's just a little hard to parse
with all of the s/4096/PAGE_SIZE/ bits around.  Those cleanups, along
with the s/PAGE_SIZE/HARD_PAGE_SIZE/ parts would be great in a
separated-out patch so that the really juicy bits (like the pte
handling) where the new logic is stand out better.  

I think it would help readability to have something like:

#define PAGES_PER_HARD_PAGE (1<<(PAGE_SHIFT-HARD_PAGE_SHIFT))

which would look like this:

-	if (unlikely(!pfn_valid(pfn))) {
+	if (unlikely(!pfn_valid(pfn * PAGES_PER_HARD_PAGE))) {

Instead of having hardpfn_t, would it be more useful to tag the types
with sparse?  That's probably something that other interested parties
could work on.

-- Dave

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-06 23:33 ` Dave Hansen
@ 2007-07-06 23:52   ` Andrea Arcangeli
  2007-07-17 17:47     ` William Lee Irwin III
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2007-07-06 23:52 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel

On Fri, Jul 06, 2007 at 04:33:21PM -0700, Dave Hansen wrote:
> The patch looks really interesting, it's just a little hard to parse
> with all of the s/4096/PAGE_SIZE/ bits around.  Those cleanups, along
> with the s/PAGE_SIZE/HARD_PAGE_SIZE/ parts would be great in a
> separated-out patch so that the really juicy bits (like the pte
> handling) where the new logic is stand out better.  

Agreed.

> I think it would help readability to have something like:
> 
> #define PAGES_PER_HARD_PAGE (1<<(PAGE_SHIFT-HARD_PAGE_SHIFT))

Indeed.

> 
> which would look like this:
> 
> -	if (unlikely(!pfn_valid(pfn))) {
> +	if (unlikely(!pfn_valid(pfn * PAGES_PER_HARD_PAGE))) {

I normally prefer to shift left/right than to multiply/divide, so feel
free to suggest another define name with just
PAGE_SHIFT-HARD_PAGE_SHIFT, then you can #define PAGES_PER_HARD_PAGE
(1<<definename).

> Instead of having hardpfn_t, would it be more useful to tag the types
> with sparse?  That's probably something that other interested parties
> could work on.

Ouch, hardpfn_t so far is unused ;). I initially wanted to try to make
things more type safe, but then it didn't work out very well so I
deferred it.

BTW, in a parallel thread (the thread where I've been suggested to
post this), Rik rightfully mentioned Bill once also tried to get this
working and basically asked for the differences. I don't know exactly
what Bill did, I only remember well the major reason he did it. Below
I add some more comment on the Bill, taken from my answer to Rik:

---------------
Right, I almost forgot he also tried enlarging the PAGE_SIZE at some
point, back then it was for the 32bit systems with 64G of ram, to
reduce the mem_map array, something my patch achieves too btw.

I thought his approach was of the old type, not backwards compatible,
the one we also thought for amd64, and I seem to remember he was
trying to solve the backwards compatibility issue without much
success.

But really I'm unsure how Bill could achieve anything backwards
compatible back then without anon-vma... anon-vma is the enabler. I
remember he worked on enlarging the PAGE_SIZE back then, but I don't
recall him exposing HARD_PAGE_SIZE to the common code either (actually
I never seen his code so I can't be sure of this). Even if he had pte
chains back then, reaching the pte wasn't enough and I doubt he could
unwalk the pagetable tree from pte up to pmd up to pgd/mm, up to vma
to read the vm_pgoff that btw was meaningless back then for the anon
vmas ;).

Things are very complex, but I think it's possible by doing proper
math on vm_pgoff, vm_start/vm_end and address, just with that 4 things
we should have enough info to know which parts of each page to map in
which pte, and that's all we need to solve it. At the second mprotect
of 4k over the same 8k page will get two vmas queued in the same
anon-vma. So we check both vmas and looking at the vm_pgoff(hardpage
units)+(((address-vm_start)&~PAGE_MASK)>>HARD_PAGE_SHIFT we should be
able to tell if the ptes behind the vma need to be updated and if the
second vma can be merged back.

The idea to make it work is to synchronously map all the ptes for all
indexes covered by each page as long as they're in the range
vm_start>>HARD_PAGE_SHIFT to vm_end >> HARD_PAGE_SHIFT. We should
threat a page fault like a multiple page fault. Then when you mprotect
or mremap you already know which ptes are mapped and that you need to
unmap/update by looking the start/end hard-page-indexes, and you also
have to always check all vmas that could possibly map that page, if
the page cross the vm_start/vm_end boundary.

Easy definitely not, but feasible I hope yes because I couldn't think
of a case where we can't figure out which part of the page to map in
which pte. I wish I had it implemented before posting because then I
would be 100% sure it was feasible ;).

Now if somebody here can think of a case where we can't know where to
map which part of the page in which pte, then *that* would be very
interesting and it could save some wasted development effort. Unless
this happens, I guess I can keep trying to make it work, hopefully now
with some help.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-06 22:26 RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE) Andrea Arcangeli
  2007-07-06 23:33 ` Dave Hansen
@ 2007-07-07  1:36 ` Badari Pulavarty
  2007-07-07  1:47 ` Badari Pulavarty
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 34+ messages in thread
From: Badari Pulavarty @ 2007-07-07  1:36 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: lkml

On Sat, 2007-07-07 at 00:26 +0200, Andrea Arcangeli wrote:
> Hello,
> 
..
> 
> If you want to help/look here the patch:
> 
> 	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.22-rc7/hard-page-size
> 

Very interesting patch set. I really would like to support for it.
I would like to play with, please keep the patchset uptodate.

Here is the small nit fix ..

Thanks,
Badari

 mm/migrate.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.22-rc7/mm/migrate.c
===================================================================
--- linux-2.6.22-rc7.orig/mm/migrate.c	2007-07-01 12:54:24.000000000 -0700
+++ linux-2.6.22-rc7/mm/migrate.c	2007-07-06 19:58:43.000000000 -0700
@@ -169,7 +169,7 @@ static void remove_migration_pte(struct 
 		goto out;
 
 	get_page(new);
-	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
+	pte = pte_mkold(mk_pte(new, addr, vma->vm_page_prot));
 	if (is_write_migration_entry(entry))
 		pte = pte_mkwrite(pte);
 	set_pte_at(mm, addr, ptep, pte);




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-06 22:26 RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE) Andrea Arcangeli
  2007-07-06 23:33 ` Dave Hansen
  2007-07-07  1:36 ` Badari Pulavarty
@ 2007-07-07  1:47 ` Badari Pulavarty
  2007-07-07 10:12   ` Andrea Arcangeli
  2007-07-07  7:01 ` Paul Mackerras
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Badari Pulavarty @ 2007-07-07  1:47 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: lkml

On Sat, 2007-07-07 at 00:26 +0200, Andrea Arcangeli wrote:
..
> The following simple bench seems to run fine on one real hardware and
> on kvm (a friend of mine failed so far to run it on his hardware
> though, so perhaps some driver triggers some remaining bugs) when
> booted as init=/tmp/bench-static after “cp -a /dev/hda /tmp/”.

Hmm.. I didn't have any luck booting my machine with the patchset 
(with 8k pagesize) :(

It fails to find the partition table on my hard drive.

Thanks,
Badari

....
AMD8111: IDE controller at PCI slot 0000:00:07.1
AMD8111: chipset revision 3
AMD8111: not 100% native mode: will probe irqs later
AMD8111: 0000:00:07.1 (rev 03) UDMA133 controller
    ide0: BM-DMA at 0x1020-0x1027, BIOS settings: hda:DMA, hdb:pio
    ide1: BM-DMA at 0x1028-0x102f, BIOS settings: hdc:DMA, hdd:pio
hda: IC35L080AVVA07-0, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hdc: TOSHIBA DVD-ROM SD-M1612, ATAPI CD/DVD-ROM drive
ide1 at 0x170-0x177,0x376 on irq 15
hda: max request size: 128KiB
hda: 160836480 sectors (82348 MB) w/1863KiB Cache, CHS=65535/16/63, UDMA
(100)
hda: cache flushes supported
 hda: unknown partition table <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
hdc: ATAPI 48X DVD-ROM drive, 512kB Cache, UDMA(33)
Uniform CD-ROM driver Revision: 3.20
ide-floppy driver 0.99.newide
PNP: PS/2 Controller [PNP0303:PS2K,PNP0f13:PS2M] at 0x60,0x64 irq 1,12
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
mice: PS/2 mouse device common for all mice
input: AT Translated Set 2 keyboard as /class/input/input0
input: PC Speaker as /class/input/input1
input: PS/2 Generic Mouse as /class/input/input2
TCP cubic registered
NET: Registered protocol family 1
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
VFS: Cannot open root device "hda2" or unknown-block(3,2)
Please append a correct "root=" boot option; here are the available
partitions:
0300   80418240 hda driver: ide-disk
1600    4194302 hdc driver: ide-cdrom
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-
block(3,2)



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-06 22:26 RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE) Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2007-07-07  1:47 ` Badari Pulavarty
@ 2007-07-07  7:01 ` Paul Mackerras
  2007-07-07 10:25   ` Andrea Arcangeli
  2007-07-07 18:53 ` Jan Engelhardt
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Paul Mackerras @ 2007-07-07  7:01 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

Andrea Arcangeli writes:

> So my whole idea is to once and for all to decuple the size of the
> pte-entry (4k on x86/amd64) with the page allocator granularity. The
> HARD_PAGE_SHIFT will be 4k still, the common code PAGE_SIZE will be
> variable and configurable at compile time with CONFIG_PAGE_SHIFT.

How does the page cache work with your scheme?  For example if I have
1000 1kB files cached in the page cache, and 16k PAGE_SIZE, does that
use up 4M, or 16M?

Paul.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-07  1:47 ` Badari Pulavarty
@ 2007-07-07 10:12   ` Andrea Arcangeli
  0 siblings, 0 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2007-07-07 10:12 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: lkml

On Fri, Jul 06, 2007 at 06:47:01PM -0700, Badari Pulavarty wrote:
> Hmm.. I didn't have any luck booting my machine with the patchset 
> (with 8k pagesize) :(
> 
> It fails to find the partition table on my hard drive.

I'm afraid I can't reproduce :( Best would be to track that code and
see what's being read from memory. 

Please also send me privately your .config so I may have more luck to
reproduce ;)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-07  7:01 ` Paul Mackerras
@ 2007-07-07 10:25   ` Andrea Arcangeli
  0 siblings, 0 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2007-07-07 10:25 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linux-kernel

On Sat, Jul 07, 2007 at 05:01:57PM +1000, Paul Mackerras wrote:
> Andrea Arcangeli writes:
> 
> > So my whole idea is to once and for all to decuple the size of the
> > pte-entry (4k on x86/amd64) with the page allocator granularity. The
> > HARD_PAGE_SHIFT will be 4k still, the common code PAGE_SIZE will be
> > variable and configurable at compile time with CONFIG_PAGE_SHIFT.
> 
> How does the page cache work with your scheme?  For example if I have
> 1000 1kB files cached in the page cache, and 16k PAGE_SIZE, does that
> use up 4M, or 16M?

It uses 16M of course. Like I said before:

   This whole issue is really a pure tradeoff between memory consumption
   and I/O and CPU performance (and for the dvd-ram and xfs also a way to

The CONFIG_PAGE_SHIFT allows you to ship a "monster" kernel for db
usage with hundred gigs of ram, with 64k page size and 64k blocksize,
getting the whole advantages. We of course must make sure that
CONFIG_PAGE_SHIFT=12 doesn't provide any slowdown.

Then us mere mortals will enjoy running with 8k page size too, with
our 2-4G of ram. I used 8k page size with an alpha workstation back in
2000 and I didn't feel any substantial ram waste, I had about 2G of
ram. Ok, now the kernel is larger, but even git learnt using packs ;)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-06 22:26 RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE) Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2007-07-07  7:01 ` Paul Mackerras
@ 2007-07-07 18:53 ` Jan Engelhardt
  2007-07-07 20:34   ` Rik van Riel
  2007-07-08  9:52   ` Andrea Arcangeli
  2007-07-08 23:20 ` David Chinner
  2007-07-12 17:53 ` Matt Mackall
  6 siblings, 2 replies; 34+ messages in thread
From: Jan Engelhardt @ 2007-07-07 18:53 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel


On Jul 7 2007 00:26, Andrea Arcangeli wrote:
>Subject: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)

I wonder what happens if the soft page size gets set to 2048 bytes :)


	Jan
-- 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-07 18:53 ` Jan Engelhardt
@ 2007-07-07 20:34   ` Rik van Riel
  2007-07-08  9:52   ` Andrea Arcangeli
  1 sibling, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2007-07-07 20:34 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Andrea Arcangeli, linux-kernel

Jan Engelhardt wrote:
> On Jul 7 2007 00:26, Andrea Arcangeli wrote:
>> Subject: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
> 
> I wonder what happens if the soft page size gets set to 2048 bytes :)

That won't work, because the smallest granularity the x86
MMU supports is 4kB.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-07 18:53 ` Jan Engelhardt
  2007-07-07 20:34   ` Rik van Riel
@ 2007-07-08  9:52   ` Andrea Arcangeli
  1 sibling, 0 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2007-07-08  9:52 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: linux-kernel

On Sat, Jul 07, 2007 at 08:53:49PM +0200, Jan Engelhardt wrote:
> 
> On Jul 7 2007 00:26, Andrea Arcangeli wrote:
> >Subject: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
> 
> I wonder what happens if the soft page size gets set to 2048 bytes :)

Well the min allowed shift is 12 so you can't set it to 2048. But if
you're curious to see what happens, you can try removing the low limit
and force it to 11.. ;)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-06 22:26 RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE) Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2007-07-07 18:53 ` Jan Engelhardt
@ 2007-07-08 23:20 ` David Chinner
  2007-07-10 10:11   ` Andrea Arcangeli
  2007-07-12 17:53 ` Matt Mackall
  6 siblings, 1 reply; 34+ messages in thread
From: David Chinner @ 2007-07-08 23:20 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

On Sat, Jul 07, 2007 at 12:26:51AM +0200, Andrea Arcangeli wrote:
> The xfs developers for example want to enlarge their filesystem
> blocksize (the filesystem blocksize has a tradeoff similar to the
> PAGE_SIZE, the larger the faster the filesystem but more disk space is
> potentially wasted),

I think you've misunderstood why large block sizes are important to
XFS.  The major benefits to XFS of larger block size have almost
nothing to do with data layout or in memory indexing - it comes from
metadata btree's getting much broader and so we can search much
larger spaces using the same number of seeks. It's metadata
scalability that I'm concerned about here, not file data.

IOWs, larger pages in the page cache are not directly related to
improving data I/O performance of the filesystem, but to allow us
to greatly improve metadata scalability of the filesystem by
allowing us to increase the fundamental block size of the filesystem.
This, in turn, improves the data I/O scalability of the filesystem.

And given that XFS has different metadata block sizes (even on 4k
block size filesystems), it would be really handy to be able to
allocate different sized large pages to match all those different
block sizes so we could avoid having to play vmap() games....

> they also want to use the â€œnormalâ€ writeback
> pagecache efficient behavior when using a writable fs on top of a
> dvd-ram with an hardblocksize of 64k.

In this case "they" != "XFS developers" - you're lumping several
different groups of ppl that want large pages for I/O into one
group.

This is where simply increasing the page size falls down - if you
want to use large block size on your DVD drive (i.e. every desktop
machine out there) you need to use (say) a 64k page size which is
less than ideal for caching the kernel trees that you are currently
compiling.

e.g. I was recently asked what the downsides of moving from a 16k
page to a 64k page size would be - the back-of-the-envelope
calculations I did for a cached kernel tree showed it's foot-print
increased from about 300MB to ~1.2GB of RAM because 80% of the files
in the kernel tree I looked at were smaller than 16k and all that
happened is we wasted much more memory on those files.  That's not
what you want for your desktop, yet we would like 32-64k pages for
the DVD drives.

The point that seems to be ignored is that this is not a "one size
fits all" type of problem.  This is why the variable page cache may
be a better solution if the fragmentation issues can be solved.
They've been solved before, so I don't see why they can't be solved
again.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-08 23:20 ` David Chinner
@ 2007-07-10 10:11   ` Andrea Arcangeli
  2007-07-12  0:12     ` David Chinner
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2007-07-10 10:11 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-kernel

On Mon, Jul 09, 2007 at 09:20:31AM +1000, David Chinner wrote:
> I think you've misunderstood why large block sizes are important to
> XFS.  The major benefits to XFS of larger block size have almost
> nothing to do with data layout or in memory indexing - it comes from
> metadata btree's getting much broader and so we can search much
> larger spaces using the same number of seeks. It's metadata
> scalability that I'm concerned about here, not file data.

I didn't misunderstand. But the reason you can't use a larger
blocksize than 4k is because the PAGE_SIZE is 4k, and
CONFIG_PAGE_SHIFT raises the PAGE_SIZE to 8k or more, so you can then
enlarge the filesystem blocksize too.

> IOWs, larger pages in the page cache are not directly related to
> improving data I/O performance of the filesystem, but to allow us

Of course, for I/O performance the CPU cost is mostly irrelevant,
especially with slow storage.

> to greatly improve metadata scalability of the filesystem by
> allowing us to increase the fundamental block size of the filesystem.
> This, in turn, improves the data I/O scalability of the filesystem.

Yes I'm aware of this and my patch allows it too the same way, but the
fundamental difference is that it should help your I/O layout
optimizations with larger blocksize, while at the same time making the
_whole_ kernel faster. And it won't even waste more pagecache than a
variable order page size would (both CONFIG_PAGE_SHIFT and variable
order page size will waste some pagecache compared to a 4k page
size). So they better be used for workloads manipulating large files.

> And given that XFS has different metadata block sizes (even on 4k
> block size filesystems), it would be really handy to be able to
> allocate different sized large pages to match all those different
> block sizes so we could avoid having to play vmap() games....

That should be possible the same way with both designs.

> In this case "they" != "XFS developers" - you're lumping several
> different groups of ppl that want large pages for I/O into one
> group.

Sorry.

> This is where simply increasing the page size falls down - if you
> want to use large block size on your DVD drive (i.e. every desktop
> machine out there) you need to use (say) a 64k page size which is
> less than ideal for caching the kernel trees that you are currently
> compiling.

Totally agreed, your approach would be much better for dvd on the
desktop. If only I could trust it to be reliable (I guess I'd rather
stick to growisofs).

But for your _own_ usage, the big box with lots of ram and where a
blocksize of 4k is a blocker, my design should be much better because
it'll give you many more advantages on the CPU side too (the only
downside is the higher complexity in the pte manipulations).

Think, even if you would end up mounting xfs with 64k blocksize on a
kernel with a 16k PAGE_SIZE, that's still going to be a smaller
fragmentation risk than using a 64k blocksize on a kernel with a 4k
PAGE_SIZE, the risk in failing defrag because of alloc_page() = 4k is
much higher than if the undefragmentable alloc_page returns a 16k
page. The CPU cost of defrag itself will be diminished by a factor of
4 too.

> e.g. I was recently asked what the downsides of moving from a 16k
> page to a 64k page size would be - the back-of-the-envelope
> calculations I did for a cached kernel tree showed it's foot-print
> increased from about 300MB to ~1.2GB of RAM because 80% of the files
> in the kernel tree I looked at were smaller than 16k and all that
> happened is we wasted much more memory on those files.  That's not
> what you want for your desktop, yet we would like 32-64k pages for
> the DVD drives.

The equivalent waste will happen on disk if you raise the blocksize to
64k. The same waste will happen as well if you mounted the filesystem
with the cache kernel tree using a variable order page size of 64k.

I guess for maximizing cache usage during kernel development the ideal
PAGE_SIZE would be smaller than 4k...

> The point that seems to be ignored is that this is not a "one size
> fits all" type of problem.  This is why the variable page cache may
> be a better solution if the fragmentation issues can be solved.
> They've been solved before, so I don't see why they can't be solved
> again.

You guys need to explain me how you solved the defrag issue if you
can't defrag the return value of alloc_page(GFP_KERNEL) =
4k. Furthermore you never seem to account the CPU cost of defrag on
big systems that may need to memcpy a lot of ram. My design doesn't
need proofs, it never requires memcpy, and it'll just always run as
fast as right after boot. Boosting the PAGE_SIZE is more a black and
white and predictable think so I've no doubt I prefer it.

BTW, I asked Hugh to look into Bill's and Hugh's old patch to see if
there's some goodness we can copy to solve things like the underlying
overlapping anon page after writeprotect faults over
MAP_PRIVATE. Perhaps there's a better way than looking the nearby pte
for a pte pointing to PG_anon or a swap entry which is my current
idea. This is assuming their old patches were really using a similar
design to mine (btw, back then there was no PG_anon but I guess
checking page->mapping for null would have been enough to tell it was
an anon page).

Hugh also reminded me that at KS some year ago their old patch
boosting the PAGE_SIZE was dismissed because it looked unnecessary,
the major reason for wanting it back then was the mem_map_t array
size, and that's not an issue anymore on 64bit archs. But back then,
nobody proposed to boost the pagecache to order > 0 allocations, so
this is one reason why _now_ it's different. It's really your variable
order page size and the defrag efforts that don't math-proof guarantee
defrag, that triggered my interest in CONFIG_PAGE_SHIFT.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-10 10:11   ` Andrea Arcangeli
@ 2007-07-12  0:12     ` David Chinner
  2007-07-12 11:14       ` Andrea Arcangeli
  0 siblings, 1 reply; 34+ messages in thread
From: David Chinner @ 2007-07-12  0:12 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: David Chinner, linux-kernel

On Tue, Jul 10, 2007 at 12:11:48PM +0200, Andrea Arcangeli wrote:
> On Mon, Jul 09, 2007 at 09:20:31AM +1000, David Chinner wrote:
> > I think you've misunderstood why large block sizes are important to
> > XFS.  The major benefits to XFS of larger block size have almost
> > nothing to do with data layout or in memory indexing - it comes from
> > metadata btree's getting much broader and so we can search much
> > larger spaces using the same number of seeks. It's metadata
> > scalability that I'm concerned about here, not file data.
> 
> I didn't misunderstand. But the reason you can't use a larger
> blocksize than 4k is because the PAGE_SIZE is 4k, and
> CONFIG_PAGE_SHIFT raises the PAGE_SIZE to 8k or more, so you can then
> enlarge the filesystem blocksize too.

Sure, but now we waste more memory on small files....

> > to greatly improve metadata scalability of the filesystem by
> > allowing us to increase the fundamental block size of the filesystem.
> > This, in turn, improves the data I/O scalability of the filesystem.
> 
> Yes I'm aware of this and my patch allows it too the same way, but the
> fundamental difference is that it should help your I/O layout
> optimizations with larger blocksize, while at the same time making the
> _whole_ kernel faster. And it won't even waste more pagecache than a
> variable order page size would (both CONFIG_PAGE_SHIFT and variable
> order page size will waste some pagecache compared to a 4k page
> size). So they better be used for workloads manipulating large files.

The difference is that we can use different blocksizes even
within the one filesystem for small and large files with a
variable page cache. We can't do that with a fixed page size.

> > And given that XFS has different metadata block sizes (even on 4k
> > block size filesystems), it would be really handy to be able to
> > allocate different sized large pages to match all those different
> > block sizes so we could avoid having to play vmap() games....
> 
> That should be possible the same way with both designs.

Not really. If I want a inode cache, it always needs to be 8k based.
If I want a directory cache, it needs to be one of 4k, 8k, 16k, 32k or 64k.
in the same filesystem. the data block size is different again to the
directory block size and the inode block size.

This is where the variable page cache wins hands down. I don't need
to care what page size someone built their kernel with, the file
system can be moved between different page size kernels and *still work*.

> But for your _own_ usage, the big box with lots of ram and where a
> blocksize of 4k is a blocker, my design should be much better because
> it'll give you many more advantages on the CPU side too (the only
> downside is the higher complexity in the pte manipulations).

FWIW, I don't really care all that much about huge HPC machines. Most
of the systems I deal with are 4-8 socket machines with tens to hundreds of
TB of disk. i.e. small CPU count, relatively small memory (64-128GB RAM)
but really large storage subsystems.

I need really large filesystems that contain both small and large files to
work more efficiently on small boxes where we can't throw endless amounts of
RAM and CPUs at the problem.  Hence things like 64k page size are just not an
option because of the wastage that it entails.

> Think, even if you would end up mounting xfs with 64k blocksize on a
> kernel with a 16k PAGE_SIZE, that's still going to be a smaller
> fragmentation risk than using a 64k blocksize on a kernel with a 4k
> PAGE_SIZE, the risk in failing defrag because of alloc_page() = 4k is
> much higher than if the undefragmentable alloc_page returns a 16k
> page. The CPU cost of defrag itself will be diminished by a factor of
> 4 too.
> 
> > e.g. I was recently asked what the downsides of moving from a 16k
> > page to a 64k page size would be - the back-of-the-envelope
> > calculations I did for a cached kernel tree showed it's foot-print
> > increased from about 300MB to ~1.2GB of RAM because 80% of the files
> > in the kernel tree I looked at were smaller than 16k and all that
> > happened is we wasted much more memory on those files.  That's not
> > what you want for your desktop, yet we would like 32-64k pages for
> > the DVD drives.
> 
> The equivalent waste will happen on disk if you raise the blocksize to
> 64k. The same waste will happen as well if you mounted the filesystem
> with the cache kernel tree using a variable order page size of 64k.

See, that's where variable page cache is so good - I don't need to
move everything to 64k block size. We can *easily* do variable
data block size in the filesystem because it's all extent based,
so this really isn't an issue for us on disk. Just changing the
base page size doesn't give us the flexibility we need in memory
to do this....

> > The point that seems to be ignored is that this is not a "one size
> > fits all" type of problem.  This is why the variable page cache may
> > be a better solution if the fragmentation issues can be solved.
> > They've been solved before, so I don't see why they can't be solved
> > again.
> 
> You guys need to explain me how you solved the defrag issue if you
> can't defrag the return value of alloc_page(GFP_KERNEL) = 4k.

Me? I'm a just filesystems weenie, not a vm guru. I don't care about
academic mathematical proof for a defrag algorithm - I just want
something that works. It's the "something that works" that has been
done before....

i.e. I'm not wedded to large pages in the page cache - what I
really, really want is an efficient variable order page cache that
doesn't have any vmap overhead. I don't really care how it is
implemented, but changing the base page size doesn't meet the
"efficiency" or "flexibility" requirement I have.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-12  0:12     ` David Chinner
@ 2007-07-12 11:14       ` Andrea Arcangeli
  2007-07-12 14:44         ` David Chinner
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2007-07-12 11:14 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-kernel

On Thu, Jul 12, 2007 at 10:12:56AM +1000, David Chinner wrote:
> I need really large filesystems that contain both small and large files to
> work more efficiently on small boxes where we can't throw endless amounts of
> RAM and CPUs at the problem.  Hence things like 64k page size are just not an
> option because of the wastage that it entails.

I didn't know you were allocating 4k pages for the small files and 64k
for the large ones in the same fs. That sounds quite a bit
overkill. So it seems all you really need is to reduce the length of
the sg list?  Otherwise you could do the above fine without order > 0
+ pte changes and memcpy in the defrag code. Given the amount of cpu
you throw at the problem of deciding 4k or 64k pages and the defrag,
and all complexity involved to handle mixed page-cache-sized per
inode, I doubt the cpu saving of the order page size matters much to
you. Probably the main thing you can measure is your storage subsystem
being too slow if the DMA isn't physically contiguous, hence the need
for those larger pages when you do I/O on the big files.

I still think you should run those systems with PAGE_SIZE 64k even if
it'll waste you more memory on the small files.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-12 11:14       ` Andrea Arcangeli
@ 2007-07-12 14:44         ` David Chinner
  2007-07-12 16:31           ` Andrea Arcangeli
  0 siblings, 1 reply; 34+ messages in thread
From: David Chinner @ 2007-07-12 14:44 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: David Chinner, linux-kernel

On Thu, Jul 12, 2007 at 01:14:36PM +0200, Andrea Arcangeli wrote:
> On Thu, Jul 12, 2007 at 10:12:56AM +1000, David Chinner wrote:
> > I need really large filesystems that contain both small and large files to
> > work more efficiently on small boxes where we can't throw endless amounts of
> > RAM and CPUs at the problem.  Hence things like 64k page size are just not an
> > option because of the wastage that it entails.
> 
> I didn't know you were allocating 4k pages for the small files and 64k
> for the large ones in the same fs. That sounds quite a bit
> overkill.

We already have rudimentary multi-block size support via the
per-inode extent size hint, but we still cache based on the
filesystem block size ('coz we can't increase it).

All I want is to be able to change the index granularity in the page
cache with minimal impact and everything in XFS falls almost
straight out in a pretty optimal manner.

> I still think you should run those systems with PAGE_SIZE 64k even if
> it'll waste you more memory on the small files.

That's crap. Just because a machine has lots of memory does not
make it OK to waste lots of memory.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-12 14:44         ` David Chinner
@ 2007-07-12 16:31           ` Andrea Arcangeli
  2007-07-12 16:34             ` Dave Hansen
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2007-07-12 16:31 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-kernel

On Fri, Jul 13, 2007 at 12:44:49AM +1000, David Chinner wrote:
> That's crap. Just because a machine has lots of memory does not
> make it OK to waste lots of memory.

It's not just wasted, it lowers overhead all over the place. Yes, the
benefit of wasting less pagecache may largely outweight the benefit of
having a larger page size, but if you've a lot of memory perhaps your
working set already fits in the cache, or perhaps you don't fit in the
cache regardless of the page size.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-12 16:31           ` Andrea Arcangeli
@ 2007-07-12 16:34             ` Dave Hansen
  2007-07-13  7:13               ` David Chinner
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Hansen @ 2007-07-12 16:34 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: David Chinner, linux-kernel, David Kleikamp

On Thu, 2007-07-12 at 18:31 +0200, Andrea Arcangeli wrote:
> On Fri, Jul 13, 2007 at 12:44:49AM +1000, David Chinner wrote:
> > That's crap. Just because a machine has lots of memory does not
> > make it OK to waste lots of memory.
> 
> It's not just wasted, it lowers overhead all over the place. Yes, the
> benefit of wasting less pagecache may largely outweight the benefit of
> having a larger page size, but if you've a lot of memory perhaps your
> working set already fits in the cache, or perhaps you don't fit in the
> cache regardless of the page size.

Have you guys seen Shaggy's page cache tails?

http://kernel.org/pub/linux/kernel/people/shaggy/OLS-2006/kleikamp.pdf

We've had the same memory waste issue on ppc64 with 64k hardware
pages.  

-- Dave


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-06 22:26 RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE) Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2007-07-08 23:20 ` David Chinner
@ 2007-07-12 17:53 ` Matt Mackall
  2007-07-13  1:06   ` Andrea Arcangeli
  6 siblings, 1 reply; 34+ messages in thread
From: Matt Mackall @ 2007-07-12 17:53 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

On Sat, Jul 07, 2007 at 12:26:51AM +0200, Andrea Arcangeli wrote:
> The original idea of having a software page size larger than a
> hardware page size, originated at SUSE by myself and Andi Kleen while
> helping AMD to design their amd64 cpu,

Original? This was done on VAXen and in Mach ages ago.

On Linux, there've already been two implementations, one by Hugh
Dickens and an expanded version by Bill Irwin (presented at OLS in
2003).

Bill's patch was notable for going to heroic efforts to maintain
binary compatibility, basically separating the userspace notion of the
ABI's page size from the kernel's. How's your version fair here?

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-12 17:53 ` Matt Mackall
@ 2007-07-13  1:06   ` Andrea Arcangeli
  0 siblings, 0 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2007-07-13  1:06 UTC (permalink / raw)
  To: Matt Mackall; +Cc: linux-kernel

On Thu, Jul 12, 2007 at 12:53:09PM -0500, Matt Mackall wrote:
> On Sat, Jul 07, 2007 at 12:26:51AM +0200, Andrea Arcangeli wrote:
> > The original idea of having a software page size larger than a
> > hardware page size, originated at SUSE by myself and Andi Kleen while
> > helping AMD to design their amd64 cpu,
> 
> Original? This was done on VAXen and in Mach ages ago.

I know nothing about that ancient stuff, but I've to trust you on
this.

> On Linux, there've already been two implementations, one by Hugh
> Dickens and an expanded version by Bill Irwin (presented at OLS in
> 2003).
> 
> Bill's patch was notable for going to heroic efforts to maintain
> binary compatibility, basically separating the userspace notion of the
> ABI's page size from the kernel's. How's your version fair here?

The events I referred to happened well before Bill's effort.

I admit I also started having some doubt about the correctness of my
above statement when I read Hugh's patch dated Jul 2001 in the last
few days, I'm now uncertain when Hugh's effort started, probably many
months before he published his code.

Overall I definitely shooted myself in the foot ;), because 1) I don't
actually care that much about the attribution of the idea (I wrote it
only as a side note), 2) we're in open source intellectual property
destruction land anyway so it doesn't matter who had the
idea. Apologies.

The way I felt while writing that side note, was that the time has
come to do what we had planned a long time ago, I simply didn't care
too much if VAXen or other ancient OS had it working before, sorry.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-12 16:34             ` Dave Hansen
@ 2007-07-13  7:13               ` David Chinner
  2007-07-13 14:08                 ` Dave Kleikamp
  2007-07-13 14:31                 ` Andrea Arcangeli
  0 siblings, 2 replies; 34+ messages in thread
From: David Chinner @ 2007-07-13  7:13 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Andrea Arcangeli, David Chinner, linux-kernel, David Kleikamp

On Thu, Jul 12, 2007 at 09:34:57AM -0700, Dave Hansen wrote:
> On Thu, 2007-07-12 at 18:31 +0200, Andrea Arcangeli wrote:
> > On Fri, Jul 13, 2007 at 12:44:49AM +1000, David Chinner wrote:
> > > That's crap. Just because a machine has lots of memory does not
> > > make it OK to waste lots of memory.
> > 
> > It's not just wasted, it lowers overhead all over the place. Yes, the
> > benefit of wasting less pagecache may largely outweight the benefit of
> > having a larger page size, but if you've a lot of memory perhaps your
> > working set already fits in the cache, or perhaps you don't fit in the
> > cache regardless of the page size.
> 
> Have you guys seen Shaggy's page cache tails?
> 
> http://kernel.org/pub/linux/kernel/people/shaggy/OLS-2006/kleikamp.pdf
> 
> We've had the same memory waste issue on ppc64 with 64k hardware
> pages.  

Sure. Fundamentally, though, I think it is the wrong approach to
take - it's a workaround for a big negative side effect of
increasing page size. It introduces lots of complexity and
difficult-to-test corner cases; judging by the tail packing problems
reiser3 has had over the years, it has the potential to be a
never-ending source of data corruption bugs.

I think that fine granularity and aggregation for efficiency of
scale is a better model to use than increasing the base page size.
With PPC, you can handle different page sizes in the hardware (like
MIPS) and the use of 64k base page size is an obvious workaround to
the problem of not being able to use multiple page sizes within the
OS.

Adding a workaround (tail packing) to address the negative side
effects of another workaround (64k base page size) ignores the basic
problem that has led to both these things being done: Linux does not
support multiple page sizes natively.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-13  7:13               ` David Chinner
@ 2007-07-13 14:08                 ` Dave Kleikamp
  2007-07-13 14:31                 ` Andrea Arcangeli
  1 sibling, 0 replies; 34+ messages in thread
From: Dave Kleikamp @ 2007-07-13 14:08 UTC (permalink / raw)
  To: David Chinner; +Cc: haveblue, Andrea Arcangeli, linux-kernel

On Fri, 2007-07-13 at 17:13 +1000, David Chinner wrote:
> On Thu, Jul 12, 2007 at 09:34:57AM -0700, Dave Hansen wrote:
> > On Thu, 2007-07-12 at 18:31 +0200, Andrea Arcangeli wrote:
> > > On Fri, Jul 13, 2007 at 12:44:49AM +1000, David Chinner wrote:
> > > > That's crap. Just because a machine has lots of memory does not
> > > > make it OK to waste lots of memory.
> > > 
> > > It's not just wasted, it lowers overhead all over the place. Yes, the
> > > benefit of wasting less pagecache may largely outweight the benefit of
> > > having a larger page size, but if you've a lot of memory perhaps your
> > > working set already fits in the cache, or perhaps you don't fit in the
> > > cache regardless of the page size.
> > 
> > Have you guys seen Shaggy's page cache tails?
> > 
> > http://kernel.org/pub/linux/kernel/people/shaggy/OLS-2006/kleikamp.pdf
> > 
> > We've had the same memory waste issue on ppc64 with 64k hardware
> > pages.  
> 
> Sure. Fundamentally, though, I think it is the wrong approach to
> take - it's a workaround for a big negative side effect of
> increasing page size. It introduces lots of complexity and
> difficult-to-test corner cases; judging by the tail packing problems
> reiser3 has had over the years, it has the potential to be a
> never-ending source of data corruption bugs.

Yeah, I'm not real happy right now with the complexity of my patches.  I
had some hope that Christoph's variable page cache cleanups would
simplify some of it, but that doesn't really help.  I'm working on it
though.

> I think that fine granularity and aggregation for efficiency of
> scale is a better model to use than increasing the base page size.
> With PPC, you can handle different page sizes in the hardware (like
> MIPS) and the use of 64k base page size is an obvious workaround to
> the problem of not being able to use multiple page sizes within the
> OS.
> 
> Adding a workaround (tail packing) to address the negative side
> effects of another workaround (64k base page size) ignores the basic
> problem that has led to both these things being done: Linux does not
> support multiple page sizes natively.....

I'd much prefer having support for multiple page sizes.  I have to admit
that I don't know the VM well enough to weigh in on that debate.

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-13  7:13               ` David Chinner
  2007-07-13 14:08                 ` Dave Kleikamp
@ 2007-07-13 14:31                 ` Andrea Arcangeli
  2007-07-16  0:27                   ` David Chinner
  1 sibling, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2007-07-13 14:31 UTC (permalink / raw)
  To: David Chinner; +Cc: Dave Hansen, linux-kernel, David Kleikamp

On Fri, Jul 13, 2007 at 05:13:08PM +1000, David Chinner wrote:
> Sure. Fundamentally, though, I think it is the wrong approach to
> take - it's a workaround for a big negative side effect of
> increasing page size. It introduces lots of complexity and
> difficult-to-test corner cases; judging by the tail packing problems
> reiser3 has had over the years, it has the potential to be a
> never-ending source of data corruption bugs.
> 
> I think that fine granularity and aggregation for efficiency of
> scale is a better model to use than increasing the base page size.
> With PPC, you can handle different page sizes in the hardware (like
> MIPS) and the use of 64k base page size is an obvious workaround to
> the problem of not being able to use multiple page sizes within the
> OS.

I think you're being too fs centric. Moving only the pagecache to a
large order is enough to you but it isn't enough to me, I'd like all
allocations to be faster, and I'd like to reduce the page fault
rate. The CONFIG_PAGE_SHIFT isn't just about I/O. It's just that
CONFIG_PAGE_SHIFT will give you the I/O side for free too.

Also keep in mind mixing multiple page sizes for different inodes has
the potential to screw the aging algorithms in the reclaim code. Just
to make an example during real random I/O over all bits of hot cache
in pagecache, a 64k page has 16 times more probability of being marked
young than a 4k page.

The tail packing of pagecache could very well be worth it. It should
cost nothing for the large files.

> Adding a workaround (tail packing) to address the negative side
> effects of another workaround (64k base page size) ignores the basic
> problem that has led to both these things being done: Linux does not
> support multiple page sizes natively.....

I understand you mean multiple page size in pagecache, but I see it as
a feature to keep the fast paths as fast as they can be.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-13 14:31                 ` Andrea Arcangeli
@ 2007-07-16  0:27                   ` David Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: David Chinner @ 2007-07-16  0:27 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: David Chinner, Dave Hansen, linux-kernel, David Kleikamp

On Fri, Jul 13, 2007 at 04:31:09PM +0200, Andrea Arcangeli wrote:
> On Fri, Jul 13, 2007 at 05:13:08PM +1000, David Chinner wrote:
> > Sure. Fundamentally, though, I think it is the wrong approach to
> > take - it's a workaround for a big negative side effect of
> > increasing page size. It introduces lots of complexity and
> > difficult-to-test corner cases; judging by the tail packing problems
> > reiser3 has had over the years, it has the potential to be a
> > never-ending source of data corruption bugs.
> > 
> > I think that fine granularity and aggregation for efficiency of
> > scale is a better model to use than increasing the base page size.
> > With PPC, you can handle different page sizes in the hardware (like
> > MIPS) and the use of 64k base page size is an obvious workaround to
> > the problem of not being able to use multiple page sizes within the
> > OS.
> 
> I think you're being too fs centric. Moving only the pagecache to a
> large order is enough to you but it isn't enough to me, I'd like all
> allocations to be faster, and I'd like to reduce the page fault
> rate.

Right, and that is done on other operating systems by supporting
multiple hardware page sizes and telling the relevant applications to
use larger pages (e.g. via cpuset configuration).

> The CONFIG_PAGE_SHIFT isn't just about I/O. It's just that
> CONFIG_PAGE_SHIFT will give you the I/O side for free too.

It's not for free, and that's one of the points I've been trying
to make.

> Also keep in mind mixing multiple page sizes for different inodes has
> the potential to screw the aging algorithms in the reclaim code. Just
> to make an example during real random I/O over all bits of hot cache
> in pagecache, a 64k page has 16 times more probability of being marked
> young than a 4k page.

Sure, but if a page is being hit repeatedly - regardless of it's
size - then you want to keep it around....

> The tail packing of pagecache could very well be worth it. It should
> cost nothing for the large files.

As I've said before - I'm not just concerned with large files - I'm
also concerned about large numbers of files (hundreds of millions to
billions in a filesystem) and the scalability issues involved with
them. IOWs, I'm looking at metadata scalability as much as data
scalability.

It's flexibility that I need from the VM, not pure VM efficiency.
Shifting the base page size is not an efficient solution to the
different aspects of filesystem scalability. We've got to deal with
both ends of the spectrum simultaneously on the one machine in the
same filesystem and it's only going to get worse in the future.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-06 23:52   ` Andrea Arcangeli
@ 2007-07-17 17:47     ` William Lee Irwin III
  2007-07-17 19:33       ` Andrea Arcangeli
  0 siblings, 1 reply; 34+ messages in thread
From: William Lee Irwin III @ 2007-07-17 17:47 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Dave Hansen, linux-kernel

On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote:
> BTW, in a parallel thread (the thread where I've been suggested to
> post this), Rik rightfully mentioned Bill once also tried to get this
> working and basically asked for the differences. I don't know exactly
> what Bill did, I only remember well the major reason he did it. Below
> I add some more comment on the Bill, taken from my answer to Rik:

I got it working. It merely bitrotted faster than I could maintain it.

On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote:
> Right, I almost forgot he also tried enlarging the PAGE_SIZE at some
> point, back then it was for the 32bit systems with 64G of ram, to
> reduce the mem_map array, something my patch achieves too btw.

It was done for the occasion of the first publicly-announced boot of
Linux on a 64GB x86-32 machine.

On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote:
> I thought his approach was of the old type, not backwards compatible,
> the one we also thought for amd64, and I seem to remember he was
> trying to solve the backwards compatibility issue without much
> success.

It was not of the old type. It followed Hugh's strategy, which made
it fully backward-compatible. The only deficits in terms of success
were performance, maintenance, and attracting any sort of audience.
The only tester besides myself was literally Zwane.

On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote:
> But really I'm unsure how Bill could achieve anything backwards
> compatible back then without anon-vma... anon-vma is the enabler. I
> remember he worked on enlarging the PAGE_SIZE back then, but I don't
> recall him exposing HARD_PAGE_SIZE to the common code either (actually
> I never seen his code so I can't be sure of this). Even if he had pte
> chains back then, reaching the pte wasn't enough and I doubt he could
> unwalk the pagetable tree from pte up to pmd up to pgd/mm, up to vma
> to read the vm_pgoff that btw was meaningless back then for the anon
> vmas ;).

It was exposed to the common code as MMUPAGE_SIZE. Significant pte
vectoring code in the core was involved, as well as partial page
distribution policies, mmap()/mprotect() et al handling splitting
across physical page boundaries, and the like. When done wrong,
applications such as /sbin/init didn't run. It was all there, though
Hugh's earlier implementation was far superior.

pte_chains didn't make things anywhere near as awkward as highpte.
pte_chains didn't really care so much how large an area a struct
page tracked. highpte OTOH needed more effort, though I don't recall
specifically why anymore.

My long-dead code should be at:

	ftp://ftp.kernel.org/pub/linux/kernel/people/wli/vm/pgcl/

dmesg's from 64GB x86-32 machines are also in that directory, dating
from March 2003.

On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote:
> Things are very complex, but I think it's possible by doing proper
> math on vm_pgoff, vm_start/vm_end and address, just with that 4 things
> we should have enough info to know which parts of each page to map in
> which pte, and that's all we need to solve it. At the second mprotect
> of 4k over the same 8k page will get two vmas queued in the same
> anon-vma. So we check both vmas and looking at the vm_pgoff(hardpage
> units)+(((address-vm_start)&~PAGE_MASK)>>HARD_PAGE_SHIFT we should be
> able to tell if the ptes behind the vma need to be updated and if the
> second vma can be merged back.
> The idea to make it work is to synchronously map all the ptes for all
> indexes covered by each page as long as they're in the range
> vm_start>>HARD_PAGE_SHIFT to vm_end >> HARD_PAGE_SHIFT. We should
> threat a page fault like a multiple page fault. Then when you mprotect
> or mremap you already know which ptes are mapped and that you need to
> unmap/update by looking the start/end hard-page-indexes, and you also
> have to always check all vmas that could possibly map that page, if
> the page cross the vm_start/vm_end boundary.

Hugh had this all worked out in 2001. I explored some alternatives in
the design space, but they didn't perform as well as the original.
It's best to refer to his original patch for reference as it's far
cleaner, though in principle one should be able to find machines where
the late 2.5.x patches I did will run. It was never exposed to a very
broad variety of systems, so I can't vouch for much beyond NUMA-Q and
ThinkPad and whatever Zwane booted it on.

On Sat, Jul 07, 2007 at 01:52:28AM +0200, Andrea Arcangeli wrote:
> Easy definitely not, but feasible I hope yes because I couldn't think
> of a case where we can't figure out which part of the page to map in
> which pte. I wish I had it implemented before posting because then I
> would be 100% sure it was feasible ;).
> Now if somebody here can think of a case where we can't know where to
> map which part of the page in which pte, then *that* would be very
> interesting and it could save some wasted development effort. Unless
> this happens, I guess I can keep trying to make it work, hopefully now
> with some help.

You may rest assured that it's technically feasible. It's been done.
The larger obstacles to all this are nontechnical.

-- wli

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-17 17:47     ` William Lee Irwin III
@ 2007-07-17 19:33       ` Andrea Arcangeli
  2007-07-18 13:32         ` William Lee Irwin III
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2007-07-17 19:33 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Dave Hansen, linux-kernel

On Tue, Jul 17, 2007 at 10:47:37AM -0700, William Lee Irwin III wrote:
> You may rest assured that it's technically feasible. It's been done.
> The larger obstacles to all this are nontechnical.

Back then there was no variable order page size proposal, no slub,
generally nothing of that kind.

I think these days it worth to get it working again and solve the
technical obstacles once more time. Then we should plug into it a
pagecache logic to handle small files. That means if the soft page
size is 64k, we should kmalloc 32k of pagecache if the file is < 64k
but >= 32k, or kmalloc 16k if the file is < 32k but >= 16k, etc...

Down to 32bytes if we memcpy the 32bytes away to a 64k page, and we
disable the logic the moment somebody attempts to mmap the "kmalloced"
pagecache (which I think it's a lot simpler than trying to mmap a
kmalloced 4k naturally aligned object into userland). I wouldn't call
it tail packing, it's more a fine-granular pagecache with the already
available kmalloc granularities. That will maximize pagecache
utilization with read syscall for hg/git compared to current 2.6.22
plus memory will be allocated faster in 64k chunks etc... Ideally it
should be possible to disable the finer-granular-kmalloc-pagecache on
the big irons with lots of memory and only working with big files.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-17 19:33       ` Andrea Arcangeli
@ 2007-07-18 13:32         ` William Lee Irwin III
  2007-07-18 16:34           ` Rene Herman
  2007-07-24 19:44           ` Andrea Arcangeli
  0 siblings, 2 replies; 34+ messages in thread
From: William Lee Irwin III @ 2007-07-18 13:32 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Dave Hansen, linux-kernel

On Tue, Jul 17, 2007 at 10:47:37AM -0700, William Lee Irwin III wrote:
>> You may rest assured that it's technically feasible. It's been done.
>> The larger obstacles to all this are nontechnical.

On Tue, Jul 17, 2007 at 09:33:08PM +0200, Andrea Arcangeli wrote:
> Back then there was no variable order page size proposal, no slub,
> generally nothing of that kind.
> I think these days it worth to get it working again and solve the
> technical obstacles once more time. Then we should plug into it a
> pagecache logic to handle small files. That means if the soft page
> size is 64k, we should kmalloc 32k of pagecache if the file is < 64k
> but >= 32k, or kmalloc 16k if the file is < 32k but >= 16k, etc...

Actually I'd worked on what was called MPSS (Multiple Page Size Support)
before I ever started on pgcl. Some large portion of the pgcl proposal
as I presented it internally was to reduce the order of large page
allocations and provide a promotion and demotion mechanism enabling
different processes to have different sized translations for the same
large page, and hence no out-of-context pagetable/TLB updates during
promotion and demotion, essentially by making the TLB translation to
page relation M:N. ISTR describing this in a KS presentation for which
IIRC you were present. But that's neither here nor there.


On Tue, Jul 17, 2007 at 09:33:08PM +0200, Andrea Arcangeli wrote:
> Down to 32bytes if we memcpy the 32bytes away to a 64k page, and we
> disable the logic the moment somebody attempts to mmap the "kmalloced"
> pagecache (which I think it's a lot simpler than trying to mmap a
> kmalloced 4k naturally aligned object into userland). I wouldn't call
> it tail packing, it's more a fine-granular pagecache with the already
> available kmalloc granularities. That will maximize pagecache
> utilization with read syscall for hg/git compared to current 2.6.22
> plus memory will be allocated faster in 64k chunks etc... Ideally it
> should be possible to disable the finer-granular-kmalloc-pagecache on
> the big irons with lots of memory and only working with big files.

In any event, that is a sound strategy for mitigating internal
fragmentation of pagecache, though internal fragmentation of anonymous
memory has more severe consequences and is less easily mitigated.


-- wli

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-18 13:32         ` William Lee Irwin III
@ 2007-07-18 16:34           ` Rene Herman
  2007-07-18 23:50             ` Andrea Arcangeli
  2007-07-24 19:44           ` Andrea Arcangeli
  1 sibling, 1 reply; 34+ messages in thread
From: Rene Herman @ 2007-07-18 16:34 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andrea Arcangeli, Dave Hansen, linux-kernel, Dave Kleikamp

On 07/18/2007 03:32 PM, William Lee Irwin III wrote:

> On Tue, Jul 17, 2007 at 09:33:08PM +0200, Andrea Arcangeli wrote:

>> kmalloced 4k naturally aligned object into userland). I wouldn't call
>> it tail packing, it's more a fine-granular pagecache with the already
>> available kmalloc granularities. That will maximize pagecache
>> utilization with read syscall for hg/git compared to current 2.6.22
>> plus memory will be allocated faster in 64k chunks etc... Ideally it
>> should be possible to disable the finer-granular-kmalloc-pagecache on
>> the big irons with lots of memory and only working with big files.
> 
> In any event, that is a sound strategy for mitigating internal
> fragmentation of pagecache, though internal fragmentation of anonymous
> memory has more severe consequences and is less easily mitigated.

I suppose low/highmem is an issue on x86? I was reading the tail packing 
paper Dave posted a link to earlier:

http://kernel.org/pub/linux/kernel/people/shaggy/OLS-2006/kleikamp.pdf

It says that highmem is not an issue due to no such thing as highmem even 
existing on the machines with support for larger hard pagesizes, but this 
wouldn't hold for soft pages. Sort of went "damn" in an x86 context upon 
reading that.

Rene.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-18 16:34           ` Rene Herman
@ 2007-07-18 23:50             ` Andrea Arcangeli
  2007-07-19  0:53               ` Rene Herman
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2007-07-18 23:50 UTC (permalink / raw)
  To: Rene Herman
  Cc: William Lee Irwin III, Dave Hansen, linux-kernel, Dave Kleikamp

On Wed, Jul 18, 2007 at 06:34:20PM +0200, Rene Herman wrote:
> It says that highmem is not an issue due to no such thing as highmem even 
> existing on the machines with support for larger hard pagesizes, but this 
> wouldn't hold for soft pages. Sort of went "damn" in an x86 context upon 
> reading that.

Correct, but I'm not really sure if it worth worrying about x86
missing this, furthermore it would still be possible to enable it on
the very x86 low end (with regular 4k page size) that may worry to use
up to the last byte of ram as cache for tiny files. To me using
kmalloc for this looks quite ideal.

Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-18 23:50             ` Andrea Arcangeli
@ 2007-07-19  0:53               ` Rene Herman
  0 siblings, 0 replies; 34+ messages in thread
From: Rene Herman @ 2007-07-19  0:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: William Lee Irwin III, Dave Hansen, linux-kernel, Dave Kleikamp

On 07/19/2007 01:50 AM, Andrea Arcangeli wrote:

> On Wed, Jul 18, 2007 at 06:34:20PM +0200, Rene Herman wrote:

>> It says that highmem is not an issue due to no such thing as highmem even 
>> existing on the machines with support for larger hard pagesizes, but this 
>> wouldn't hold for soft pages. Sort of went "damn" in an x86 context upon 
>> reading that.
> 
> Correct, but I'm not really sure if it worth worrying about x86
> missing this

Larger softpages would nicely solve the "1-page stacks are sometimes small" 
issue with 4KSTACKS on x86 that was discussed in another thread just now but 
without tail packing, the pagecache slack would be too high a price to pay 
given that loads that would actually benefit from it most definitely have 
moved to 64-bit (although I'd certainly still want to try 8K as well, and 
filesystems with larger blocksizes could be nice as well).

> furthermore it would still be possible to enable it on the very x86 low
> end (with regular 4k page size) that may worry to use up to the last byte
> of ram as cache for tiny files.

But, yes, that's true, and I wonder if !HIGHMEM x86 will in fact be "very 
low end" for long considering x86-64 is now _really_ here. Many people who 
want enough memory to need highmem have probably already made the switch, 
and in the embedded world, 896M (or 1G, or 2G with a adjusted split) is 
still decidely non-low end. Yet a PVR, say, could love 64K pages for VM and 
disk...

> To me using kmalloc for this looks quite ideal.

Certainly simplest...

Rene.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-18 13:32         ` William Lee Irwin III
  2007-07-18 16:34           ` Rene Herman
@ 2007-07-24 19:44           ` Andrea Arcangeli
  2007-07-25  3:20             ` William Lee Irwin III
  1 sibling, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2007-07-24 19:44 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Dave Hansen, linux-kernel

On Wed, Jul 18, 2007 at 06:32:22AM -0700, William Lee Irwin III wrote:
> Actually I'd worked on what was called MPSS (Multiple Page Size Support)
> before I ever started on pgcl. Some large portion of the pgcl proposal
> as I presented it internally was to reduce the order of large page
> allocations and provide a promotion and demotion mechanism enabling
> different processes to have different sized translations for the same
> large page, and hence no out-of-context pagetable/TLB updates during
> promotion and demotion, essentially by making the TLB translation to
> page relation M:N. ISTR describing this in a KS presentation for which
> IIRC you were present. But that's neither here nor there.

Well the whole difference between you back then and SGI now, is that
your stuff wasn't being pushed to be merged very hard (it was proposed
but IIRC more as research topic, like the large PAGE_SIZE also fallen
into that same research area). See now the emails from SGI fs folks
about variable order page size, they want it merged badly instead.

My whole point is that the single moment the variable order page size
isn't pure research anymore like MPSS, the CONFIG_PAGE_SHIFT isn't
research anymore either, like the tail packing in pagecache with
kmalloc also isn't research anymore.

About the fs deciding the size of the pagecache granularity I totally
dislike that design, there's no reason why the fs should control that,
whatever clever algorithm deciding which pagecache granularity to use
should be outside fs/xfs. I like the pagecache layer to be in charge
of everything. The fs should stay a simple remapper between logical
inode offset to physical disk offset. That can take into account raid,
or other stuff, that's still a logical->raid->physical translation,
but the highelevel "brainer" intellgigence of deciding which
granularity the pagecache should use, would better be in the
pagecache/vfs layer to benefit everyone. And anyway I prefer to keep
the PAGE_SIZE big, and allocate fragments for small files with kmalloc
down to 32 bytes granularity, and memcpy them away if you mmap the
file. After the first time we move from kmalloc fragment to real
PAGE_SIZE pagecache, we add a bitflag to the inode somewhere to be
sure we never use the kmalloc fragment anymore later even if the page
is evicted from pagecache (inodes may well live longer than pagecache
so a bitflag is going to be worth it).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-24 19:44           ` Andrea Arcangeli
@ 2007-07-25  3:20             ` William Lee Irwin III
  2007-07-25 14:39               ` Andrea Arcangeli
  0 siblings, 1 reply; 34+ messages in thread
From: William Lee Irwin III @ 2007-07-25  3:20 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Dave Hansen, linux-kernel

On Wed, Jul 18, 2007 at 06:32:22AM -0700, William Lee Irwin III wrote:
>> Actually I'd worked on what was called MPSS (Multiple Page Size Support)
>> before I ever started on pgcl. Some large portion of the pgcl proposal
>> as I presented it internally was to reduce the order of large page
>> allocations and provide a promotion and demotion mechanism enabling
>> different processes to have different sized translations for the same
>> large page, and hence no out-of-context pagetable/TLB updates during
>> promotion and demotion, essentially by making the TLB translation to
>> page relation M:N. ISTR describing this in a KS presentation for which
>> IIRC you were present. But that's neither here nor there.

On Tue, Jul 24, 2007 at 09:44:18PM +0200, Andrea Arcangeli wrote:
> Well the whole difference between you back then and SGI now, is that
> your stuff wasn't being pushed to be merged very hard (it was proposed
> but IIRC more as research topic, like the large PAGE_SIZE also fallen
> into that same research area). See now the emails from SGI fs folks
> about variable order page size, they want it merged badly instead.

Neither were research topics, but I'm tired of correcting the history
of my failures. I've got enough ongoing failures as things stand.

On Tue, Jul 24, 2007 at 09:44:18PM +0200, Andrea Arcangeli wrote:
> My whole point is that the single moment the variable order page size
> isn't pure research anymore like MPSS, the CONFIG_PAGE_SHIFT isn't
> research anymore either, like the tail packing in pagecache with
> kmalloc also isn't research anymore.

There was never any research involved in the page clustering per se.
It was supposed to be a generally advantageous thing that Linus had
at least once explicitly approved of that just so happened to relieve
mem_map[] pressure on 64GB i386, the side effect intended to attract
corporate patronage.

That last fact was not only demonstrable, it was used in the first
ever public demonstration of a 64GB i386 machine running Linux, which
I personally carried out.

Beyond active hindrances and lacks of cooperation, a "competing
solution" with distro backing appeared that removed the last vestige
of corporate patronage from the project. It ended up bitrotting
faster than I could singlehandedly do all the maintenance, testing,
and coding work on it while also trying to get anything else done.

MPSS was not as well-developed at the time the hugetlb "solution"
killed it, but is not terribly dissimilar in how it came into
being, developed, and then died, apart from less active hindrance.

The one and only aspect in which any research was involved was a
proposal, never accepted or pursued, to investigate how larger
base page sizes implemented via page clustering mitigated external
fragmentation for the purposes of MPSS and also how certain
techniques borrowed from page clustering could reduce the frequency
of and performance penalties associated with demotion in MPSS. The
proposal has never been publicly circulated, though some of its content
was described in the KS presentation as "future directions" or similar.

On Tue, Jul 24, 2007 at 09:44:18PM +0200, Andrea Arcangeli wrote:
> About the fs deciding the size of the pagecache granularity I totally
> dislike that design, there's no reason why the fs should control that,
[...]

This is all valid commentary, though I don't have any particular
response to it.

In any event, I've never been involved in a research project, though
I would've liked to have been. The emphasis in all cases was enabling
specific functionality in production, using techniques whose viability
had furthermore already been demonstrated elsewhere, by others.

In both instances, insurmountable nontechnical obstacles were present,
which remain in place and effectively limit the scale and scope of any
sort of project I can personally lead with any sort of likelihood of
mainline acceptance.

Where I am limited, you are not. Good luck to you.

-- wli

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-25  3:20             ` William Lee Irwin III
@ 2007-07-25 14:39               ` Andrea Arcangeli
  2007-07-25 17:56                 ` William Lee Irwin III
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2007-07-25 14:39 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Dave Hansen, linux-kernel

On Tue, Jul 24, 2007 at 08:20:11PM -0700, William Lee Irwin III wrote:
> In any event, I've never been involved in a research project, though

I didn't mean it was supposed to be research project in some
University. But IIRC it was founded by what is defined as R&D in the
income statement of a public company. That's why I called so, given it
wasn't incorporated in mainline or forked trees, and it eventually
bitrotten. I didn't mean to de-qualify the effort by calling it that
way. Infact I'm just saying it is valid now more than ever before,
given the current directions that are being pushed for mainline.

> In both instances, insurmountable nontechnical obstacles were present,
> which remain in place and effectively limit the scale and scope of any
> sort of project I can personally lead with any sort of likelihood of
> mainline acceptance.
> 
> Where I am limited, you are not. Good luck to you.

Not so sure as you are, I'm not even invited to KS this year, but I
guess that's fair enough punishment for me, given I also wasted some
time with other activities that in the long run I hope will become
profitable (this remains to be seen though, there's an Italian saying
that "who wants too much will get nothing" ;). But I'll be at the VM
summit which to me is probably more important than KS and I hope to
have some discussion about this stuff there, hope you're there
too. Anyway there's no reason why you shouldn't contribute to
CONFIG_PAGE_SHIFT if you want. I don't really care if it's me doing
it, or you or Hugh, I stepped in first because of the great idea of
the Hack Week and second because I care that Linux goes in directions
that benefit everyone, not just a single filesystem running on top of
some scatter gather crippled storage that slowdowns like a crawl if
the sg entries are small (which is something CONFIG_PAGE_SHIFT will
address too just fine but while giving other advantages at the same time).

I'm also not against the defrag efforts, but I simply want to reduce
the maximum the code that require order > 0 allocations for
strict performance reasons. defrag is by far not a free operation, it
even requires memcopies of the bulk data payload or swapouts.

For the kernel stack btw, when alloc_pages(order=1) fails vmalloc
should be used and 4k stacks can be dropped. Nobody does dma from the
stack anymore these days IIRC (it doesn't work in all archs anyway).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
  2007-07-25 14:39               ` Andrea Arcangeli
@ 2007-07-25 17:56                 ` William Lee Irwin III
  0 siblings, 0 replies; 34+ messages in thread
From: William Lee Irwin III @ 2007-07-25 17:56 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Dave Hansen, linux-kernel

On Wed, Jul 25, 2007 at 04:39:04PM +0200, Andrea Arcangeli wrote:
> For the kernel stack btw, when alloc_pages(order=1) fails vmalloc
> should be used and 4k stacks can be dropped. Nobody does dma from the
> stack anymore these days IIRC (it doesn't work in all archs anyway).

I have recent code for that circulating, albeit intended for debugging
purposes. There's nothing particularly debug-oriented about it, though,
apart from the fact a guard page is automatically set up by vmalloc()
and that the use of vmalloc() is unconditional.

As for the rest, I'm sure there could be a lively conversation, but
consensus, so I'll let it go.

-- wli

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2007-07-25 17:54 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-06 22:26 RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE) Andrea Arcangeli
2007-07-06 23:33 ` Dave Hansen
2007-07-06 23:52   ` Andrea Arcangeli
2007-07-17 17:47     ` William Lee Irwin III
2007-07-17 19:33       ` Andrea Arcangeli
2007-07-18 13:32         ` William Lee Irwin III
2007-07-18 16:34           ` Rene Herman
2007-07-18 23:50             ` Andrea Arcangeli
2007-07-19  0:53               ` Rene Herman
2007-07-24 19:44           ` Andrea Arcangeli
2007-07-25  3:20             ` William Lee Irwin III
2007-07-25 14:39               ` Andrea Arcangeli
2007-07-25 17:56                 ` William Lee Irwin III
2007-07-07  1:36 ` Badari Pulavarty
2007-07-07  1:47 ` Badari Pulavarty
2007-07-07 10:12   ` Andrea Arcangeli
2007-07-07  7:01 ` Paul Mackerras
2007-07-07 10:25   ` Andrea Arcangeli
2007-07-07 18:53 ` Jan Engelhardt
2007-07-07 20:34   ` Rik van Riel
2007-07-08  9:52   ` Andrea Arcangeli
2007-07-08 23:20 ` David Chinner
2007-07-10 10:11   ` Andrea Arcangeli
2007-07-12  0:12     ` David Chinner
2007-07-12 11:14       ` Andrea Arcangeli
2007-07-12 14:44         ` David Chinner
2007-07-12 16:31           ` Andrea Arcangeli
2007-07-12 16:34             ` Dave Hansen
2007-07-13  7:13               ` David Chinner
2007-07-13 14:08                 ` Dave Kleikamp
2007-07-13 14:31                 ` Andrea Arcangeli
2007-07-16  0:27                   ` David Chinner
2007-07-12 17:53 ` Matt Mackall
2007-07-13  1:06   ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox