* Re: [rfc] no ZERO_PAGE?
2007-04-04 15:35 ` Linus Torvalds
@ 2007-04-04 15:48 ` Andrea Arcangeli
2007-04-04 16:09 ` Linus Torvalds
` (2 more replies)
2007-04-04 16:32 ` Eric Dumazet
` (4 subsequent siblings)
5 siblings, 3 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2007-04-04 15:48 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nick Piggin, Hugh Dickins, Andrew Morton,
Linux Memory Management List, tee, holt,
Linux Kernel Mailing List
On Wed, Apr 04, 2007 at 08:35:30AM -0700, Linus Torvalds wrote:
> Anyway, I'm not against this, but I can see somebody actually *wanting*
> the ZERO page in some cases. I've used the fact for TLB testing, for
> example, by just doing a big malloc(), and knowing that the kernel will
> re-use the ZERO_PAGE so that I don't get any cache effects (well, at least
> not any *physical* cache effects. Virtually indexed cached will still show
> effects of it, of course, but I haven't cared).
Ok, those cases wanting the same zero page, could be fairly easily
converted to an mmap over /dev/zero (without having to run 4k large
mmap syscalls or nonlinear).
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: [rfc] no ZERO_PAGE?
2007-04-04 15:48 ` Andrea Arcangeli
@ 2007-04-04 16:09 ` Linus Torvalds
2007-04-04 16:23 ` Andrea Arcangeli
2007-04-04 16:10 ` Hugh Dickins
2007-04-04 22:07 ` Valdis.Kletnieks
2 siblings, 1 reply; 40+ messages in thread
From: Linus Torvalds @ 2007-04-04 16:09 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Nick Piggin, Hugh Dickins, Andrew Morton,
Linux Memory Management List, tee, holt,
Linux Kernel Mailing List
On Wed, 4 Apr 2007, Andrea Arcangeli wrote:
>
> Ok, those cases wanting the same zero page, could be fairly easily
> converted to an mmap over /dev/zero (without having to run 4k large
> mmap syscalls or nonlinear).
You're missing the point. What if it's something like oracle that has been
tuned for Linux using this? Or even an open-source app that is just used
by big places and they see performace problems but it's not obvious *why*.
We "know" why, because we're discussing this point. But two months from
now, when some random company complains to SuSE/RH/whatever that their app
runs 5% slower or uses 200% more swap, who is going to realize what caused
it?
THAT is the problem with patches like this. I'm not against it, but you
can't just dismiss it with "we can fix the app". We *cannot* fix the app
if we don't even realize what caused the problem..
Linus
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [rfc] no ZERO_PAGE?
2007-04-04 16:09 ` Linus Torvalds
@ 2007-04-04 16:23 ` Andrea Arcangeli
0 siblings, 0 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2007-04-04 16:23 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nick Piggin, Hugh Dickins, Andrew Morton,
Linux Memory Management List, tee, holt,
Linux Kernel Mailing List
On Wed, Apr 04, 2007 at 09:09:28AM -0700, Linus Torvalds wrote:
> You're missing the point. What if it's something like oracle that has been
> tuned for Linux using this? Or even an open-source app that is just used
> by big places and they see performace problems but it's not obvious *why*.
>
> We "know" why, because we're discussing this point. But two months from
> now, when some random company complains to SuSE/RH/whatever that their app
> runs 5% slower or uses 200% more swap, who is going to realize what caused
> it?
No, I'm not missing the point, I was the first to say here that such
code has been there forever and in turn I'm worried about apps
depending on it for all the wrong reasons, I even went as far as
asking a counter to avoid the waste to go unniticed, and last but not
the least that's why I'm not discussing this as internal suse fix for
the scalability issue, but only as a malinline patch for -mm.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [rfc] no ZERO_PAGE?
2007-04-04 15:48 ` Andrea Arcangeli
2007-04-04 16:09 ` Linus Torvalds
@ 2007-04-04 16:10 ` Hugh Dickins
2007-04-04 16:31 ` Andrea Arcangeli
2007-04-04 22:07 ` Valdis.Kletnieks
2 siblings, 1 reply; 40+ messages in thread
From: Hugh Dickins @ 2007-04-04 16:10 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Linus Torvalds, Nick Piggin, Andrew Morton,
Linux Memory Management List, tee, holt,
Linux Kernel Mailing List
On Wed, 4 Apr 2007, Andrea Arcangeli wrote:
> On Wed, Apr 04, 2007 at 08:35:30AM -0700, Linus Torvalds wrote:
> > Anyway, I'm not against this, but I can see somebody actually *wanting*
> > the ZERO page in some cases. I've used the fact for TLB testing, for
> > example, by just doing a big malloc(), and knowing that the kernel will
> > re-use the ZERO_PAGE so that I don't get any cache effects (well, at least
> > not any *physical* cache effects. Virtually indexed cached will still show
> > effects of it, of course, but I haven't cared).
>
> Ok, those cases wanting the same zero page, could be fairly easily
> converted to an mmap over /dev/zero
No, MAP_SHARED mmap of /dev/zero uses shmem, which allocates distinct
pages for this (because in general tmpfs doesn't know if a readonly
file will be written to later on), and MAP_PRIVATE mmap of /dev/zero
uses the zeromap stuff which we were hoping to eliminate too
(though not in Nick's initial patch).
Looks like a job for /dev/same_page_over_and_over_again.
> (without having to run 4k large mmap syscalls or nonlinear).
You scared me, I made no sense of that at first: ah yes,
repeatedly mmap'ing the same page can be done those ways.
Hugh
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [rfc] no ZERO_PAGE?
2007-04-04 16:10 ` Hugh Dickins
@ 2007-04-04 16:31 ` Andrea Arcangeli
0 siblings, 0 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2007-04-04 16:31 UTC (permalink / raw)
To: Hugh Dickins
Cc: Linus Torvalds, Nick Piggin, Andrew Morton,
Linux Memory Management List, tee, holt,
Linux Kernel Mailing List
On Wed, Apr 04, 2007 at 05:10:37PM +0100, Hugh Dickins wrote:
> file will be written to later on), and MAP_PRIVATE mmap of /dev/zero
Obviously I meant MAP_PRIVATE of /dev/zero, since it's the only one
backed by the zero page.
> uses the zeromap stuff which we were hoping to eliminate too
> (though not in Nick's initial patch).
I didn't realized you wanted to eliminate it too.
> Looks like a job for /dev/same_page_over_and_over_again.
>
> > (without having to run 4k large mmap syscalls or nonlinear).
>
> You scared me, I made no sense of that at first: ah yes,
> repeatedly mmap'ing the same page can be done those ways.
Yep, which is probably why we don't need the
/dev/same_page_over_and_over_again for that.
Overall the worry about the TLB benchmarking apps being broken in its
measurements sounds very minor compared to the risk of wasting tons of
ram and going out of memory. If there was no risk of bad breakage we
wouldn't need to discuss this.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [rfc] no ZERO_PAGE?
2007-04-04 15:48 ` Andrea Arcangeli
2007-04-04 16:09 ` Linus Torvalds
2007-04-04 16:10 ` Hugh Dickins
@ 2007-04-04 22:07 ` Valdis.Kletnieks
2 siblings, 0 replies; 40+ messages in thread
From: Valdis.Kletnieks @ 2007-04-04 22:07 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Linus Torvalds, Nick Piggin, Hugh Dickins, Andrew Morton,
Linux Memory Management List, tee, holt,
Linux Kernel Mailing List
[-- Attachment #1: Type: text/plain, Size: 283 bytes --]
On Wed, 04 Apr 2007 17:48:39 +0200, Andrea Arcangeli said:
> Ok, those cases wanting the same zero page, could be fairly easily
> converted to an mmap over /dev/zero (without having to run 4k large
> mmap syscalls or nonlinear).
"D'oh!" -- H. Simpson.
Ignore my previous note. :)
[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [rfc] no ZERO_PAGE?
2007-04-04 15:35 ` Linus Torvalds
2007-04-04 15:48 ` Andrea Arcangeli
@ 2007-04-04 16:32 ` Eric Dumazet
2007-04-04 17:02 ` Linus Torvalds
2007-04-04 19:15 ` Andrew Morton
` (3 subsequent siblings)
5 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2007-04-04 16:32 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nick Piggin, Hugh Dickins, Andrew Morton,
Linux Memory Management List, tee, holt, Andrea Arcangeli,
Linux Kernel Mailing List
On Wed, 4 Apr 2007 08:35:30 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Anyway, I'm not against this, but I can see somebody actually *wanting*
> the ZERO page in some cases. I've used the fact for TLB testing, for
> example, by just doing a big malloc(), and knowing that the kernel will
> re-use the ZERO_PAGE so that I don't get any cache effects (well, at least
> not any *physical* cache effects. Virtually indexed cached will still show
> effects of it, of course, but I haven't cared).
>
> That's an example of an app that actually cares about the page allocation
> (or, in this case, the lack there-of). Not an important one, but maybe
> there are important ones that care?
I dont know if this small prog is of any interest :
But results on an Intel Pentium-M are interesting, in particular 2) & 3)
If a page is first allocated as page_zero then cow to a full rw page, this is more expensive.
(2660 cycles instead of 2300)
Is there an app somewhere that depends on 2) being ultra-fast but then future write accesses *slow* ???
$ ./page_bench >RES; cat RES
1) pagefault tp bring a rw page:
Poke (addr=0x804c000): 2360 cycles
1) pagefault to bring a rw page:
Poke (addr=0x804d000): 2368 cycles
1) pagefault to bring a rw page:
Poke (addr=0x804e000): 2120 cycles
2) pagefault to bring a zero page, readonly
Peek(addr=0x804f000): ->0 891 cycles
3) pagefault to make this page rw
Poke (addr=0x804f000): 2660 cycles
1) pagefault to bring a rw page:
Poke (addr=0x8050000): 2099 cycles
1) pagefault to bring a rw page:
Poke (addr=0x8051000): 2062 cycles
4) memset 4096 bytes to 0x55:
Poke_full (addr=0x804f000, len=4096): 2719 cycles
5) fill the whole table
Poke_full (addr=0x804c000, len=4194304): 6563661 cycles
6) fill again whole table (no more faults, but cpu cache too small)
Poke_full (addr=0x804c000, len=4194304): 5188925 cycles
7.1) faulting a mmap zone, read access
Peek(addr=0xb7f8a000): ->0 40453 cycles
8.1) faulting a mmap zone, write access
Poke (addr=0xb7f89000): 10599 cycles
7.2) faulting a mmap zone, read access
Peek(addr=0xb7f88000): ->0 8167 cycles
8.3) faulting a mmap zone, write access
Poke (addr=0xb7f87000): 5701 cycles
$ cat page_bench.c
# include <errno.h>
# include <stdlib.h>
# include <unistd.h>
# include <fcntl.h>
# include <stdio.h>
# include <sys/time.h>
# include <time.h>
# include <sys/mman.h>
# include <string.h>
#ifdef __x86_64
#define rdtscll(val) do { \
unsigned int __a,__d; \
asm volatile("rdtsc" : "=a" (__a), "=d" (__d)); \
(val) = ((unsigned long)__a) | (((unsigned long)__d)<<32); \
} while(0)
#elif __i386
#define rdtscll(val) \
__asm__ __volatile__("rdtsc" : "=A" (val))
#endif
int var;
int *addr1, *addr2, *addr3, *addr4;
void map_many_vmas(unsigned int nb)
{
size_t sz = getpagesize();
int ui;
for (ui = 0 ; ui < nb ; ui++) {
void *p = mmap(NULL, sz,
(ui == 0) ? PROT_READ : PROT_READ|PROT_WRITE,
(ui & 1) ? MAP_PRIVATE|MAP_ANONYMOUS : MAP_ANONYMOUS|MAP_SHARED, -1, 0);
if (p == (void *)-1) {
fprintf(stderr, "Only %u mappings could be set\n", ui);
break;
}
if (!addr1) addr1 = (int *)p;
else if (!addr2) addr2 = (int *)p;
else if (!addr3) addr3 = (int *)p;
else if (!addr4) addr4 = (int *)p;
}
}
void show_maps()
{
char buffer[4096];
int fd, lu;
fd = open("/proc/self/maps", 0);
if (fd != -1) {
while ((lu = read(fd, buffer, sizeof(buffer))) > 0)
write(2, buffer, lu);
close(fd);
}
}
void poke_int(void *addr, int val)
{
unsigned long long start, end;
long delta;
rdtscll(start);
*(int *)addr = val;
rdtscll(end);
delta = (end - start);
printf("Poke (addr=%p): %ld cycles\n", addr, delta);
}
void poke_full(void *addr, int val, int len)
{
unsigned long long start, end;
long delta;
rdtscll(start);
memset(addr, val, len);
rdtscll(end);
delta = (end - start);
printf("Poke_full (addr=%p, len=%d): %ld cycles\n", addr, len, delta);
}
int peek_int(void *addr)
{
unsigned long long start, end;
long delta;
int val;
rdtscll(start);
val = *(int *)addr;
rdtscll(end);
delta = (end - start);
printf("Peek(addr=%p): ->%d %ld cycles\n", addr, val, delta);
return val;
}
int big_table[1024*1024] __attribute__((aligned(4096)));
void usage(int code)
{
fprintf(stderr, "Usage : page_bench [-m mappings]\n");
exit(code);
}
int main(int argc, char *argv[])
{
unsigned int nb_mappings = 200;
int c;
while ((c = getopt(argc, argv, "Vm:")) != EOF) {
if (c == 'm')
nb_mappings = atoi(optarg);
else if (c == 'V')
usage(0);
}
if (nb_mappings < 4)
nb_mappings = 4;
map_many_vmas(nb_mappings);
// show_maps();
printf("1) pagefault tp bring a rw page:\n") ;
poke_int(&big_table[0], 10);
printf("1) pagefault to bring a rw page:\n") ;
poke_int(&big_table[1024], 10);
printf("1) pagefault to bring a rw page:\n") ;
poke_int(&big_table[2048], 10);
printf("2) pagefault to bring a zero page, readonly\n");
peek_int(&big_table[3*1024]);
printf("3) pagefault to make this page rw\n");
poke_int(&big_table[3*1024], 10);
printf("1) pagefault to bring a rw page:\n") ;
poke_int(&big_table[4*1024], 10);
printf("1) pagefault to bring a rw page:\n") ;
poke_int(&big_table[5*1024], 10);
printf("4) memset 4096 bytes to 0x55:\n");
poke_full(&big_table[3*1024], 0x55, 4096);
printf("5) fill the whole table\n");
poke_full(big_table, 1, sizeof(big_table));
printf("6) fill again whole table (no more faults, but cpu cache too small)\n");
poke_full(big_table, 1, sizeof(big_table));
printf("7.1) faulting a mmap zone, read access\n");
peek_int(addr1);
printf("8.1) faulting a mmap zone, write access\n");
poke_int(addr2, 10);
printf("7.2) faulting a mmap zone, read access\n");
peek_int(addr3);
printf("8.3) faulting a mmap zone, write access\n");
poke_int(addr4, 10);
return 0;
}
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: [rfc] no ZERO_PAGE?
2007-04-04 16:32 ` Eric Dumazet
@ 2007-04-04 17:02 ` Linus Torvalds
0 siblings, 0 replies; 40+ messages in thread
From: Linus Torvalds @ 2007-04-04 17:02 UTC (permalink / raw)
To: Eric Dumazet
Cc: Nick Piggin, Hugh Dickins, Andrew Morton,
Linux Memory Management List, tee, holt, Andrea Arcangeli,
Linux Kernel Mailing List
On Wed, 4 Apr 2007, Eric Dumazet wrote:
>
> But results on an Intel Pentium-M are interesting, in particular 2) & 3)
>
> If a page is first allocated as page_zero then cow to a full rw page, this is more expensive.
> (2660 cycles instead of 2300)
Yes, you have an extra TLB flush there at a minimum (if the page didn't
exist at all before, you don't have to flush).
That said, the big cost tends to be the clearing of the page. Which is why
the "bring in zero page" is so much faster than anything else - it's the
only case that doesn't need to clear the page.
So you should basically think of your numbers like this:
- roughly 900 cycles is the cost of the page fault and all the
"basic software" side in the kernel
- roughly 1400 cycles to actually do the "memset" to clear the page (and
no, that's *not* the cost of memory accesses per se - it's very likely
already in the L2 cache or similar, we just need to clear it and if
it wasn't marked exclusive need to do a bus cycle to invalidate it on
any other CPU's).
with small variation depending on what the state was before of the cache
in particular (for example, the TLB flush cost, but also: when you do
> 4) memset 4096 bytes to 0x55:
> Poke_full (addr=0x804f000, len=4096): 2719 cycles
This only adds ~600 cycles to memset the same 4kB that cost ~1400 cycles
before, but that's *probably* largely because it was now already dirty in
the L2 and possibly the L1, so it's quite possible that this is really
just a cache effect, because now it's entirely exclusive in the caches so
you don't need to do any probing on the bus at all).
Also note: in the end, page faults are usually fairly unusual. You do them
once, and then use the page a lot after that. That's not *always* true, of
course. Some malloc()/free() patterns of big areas that are not used for
long will easily cause constant mmap/munmap, and a lot of page faults.
The worst effect of page faults tends to be for short-lived stuff. Notably
things like "system()" that executes a shell just to execute something
else. Almost *everything* in that path is basically "use once, then throw
away", and page fault latency is interesting.
So this is one case where it might be interesting to look at what lmbench
reports for the "fork/exit", "fork/exec" and "shell exec" numbers before
and after.
Linus
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [rfc] no ZERO_PAGE?
2007-04-04 15:35 ` Linus Torvalds
2007-04-04 15:48 ` Andrea Arcangeli
2007-04-04 16:32 ` Eric Dumazet
@ 2007-04-04 19:15 ` Andrew Morton
2007-04-04 20:11 ` David Miller
` (2 subsequent siblings)
5 siblings, 0 replies; 40+ messages in thread
From: Andrew Morton @ 2007-04-04 19:15 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nick Piggin, Hugh Dickins, Linux Memory Management List, tee,
holt, Andrea Arcangeli, Linux Kernel Mailing List
On Wed, 4 Apr 2007 08:35:30 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Does anybody do any performance testing on -mm?
http://test.kernel.org/perf/index.html has pretty graphs of lots of kernel versions
for a few benchmarks. I'm not aware of any other organised effort along those
lines.
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: [rfc] no ZERO_PAGE?
2007-04-04 15:35 ` Linus Torvalds
` (2 preceding siblings ...)
2007-04-04 19:15 ` Andrew Morton
@ 2007-04-04 20:11 ` David Miller
2007-04-04 20:50 ` Andrew Morton
` (2 more replies)
2007-04-04 22:05 ` Valdis.Kletnieks
2007-04-05 4:47 ` Nick Piggin
5 siblings, 3 replies; 40+ messages in thread
From: David Miller @ 2007-04-04 20:11 UTC (permalink / raw)
To: torvalds; +Cc: npiggin, hugh, akpm, linux-mm, tee, holt, andrea, linux-kernel
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 4 Apr 2007 08:35:30 -0700 (PDT)
> Anyway, I'm not against this, but I can see somebody actually *wanting*
> the ZERO page in some cases. I've used the fact for TLB testing, for
> example, by just doing a big malloc(), and knowing that the kernel will
> re-use the ZERO_PAGE so that I don't get any cache effects (well, at least
> not any *physical* cache effects. Virtually indexed cached will still show
> effects of it, of course, but I haven't cared).
>
> That's an example of an app that actually cares about the page allocation
> (or, in this case, the lack there-of). Not an important one, but maybe
> there are important ones that care?
If we're going to consider this seriously, there is a case I know of.
Look at flush_dcache_page()'s test for ZERO_PAGE() on sparc64, there
is an instructive comment:
/* Do not bother with the expensive D-cache flush if it
* is merely the zero page. The 'bigcore' testcase in GDB
* causes this case to run millions of times.
*/
if (page == ZERO_PAGE(0))
return;
basically what the GDB test case does it mmap() an enormous anonymous
area, not touch it, then dump core.
As I understand the patch being considered to remove ZERO_PAGE(), this
kind of core dump will cause a lot of pages to be allocated, probably
eating up a lot of system time as well as memory.
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: [rfc] no ZERO_PAGE?
2007-04-04 20:11 ` David Miller
@ 2007-04-04 20:50 ` Andrew Morton
2007-04-05 2:03 ` Nick Piggin
2007-04-05 5:23 ` Andrea Arcangeli
2 siblings, 0 replies; 40+ messages in thread
From: Andrew Morton @ 2007-04-04 20:50 UTC (permalink / raw)
To: David Miller
Cc: torvalds, npiggin, hugh, linux-mm, tee, holt, andrea,
linux-kernel
On Wed, 04 Apr 2007 13:11:11 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:
> As I understand the patch being considered to remove ZERO_PAGE(), this
> kind of core dump will cause a lot of pages to be allocated, probably
> eating up a lot of system time as well as memory.
Point.
Also, what effect will the proposed changes have upon rss reporting,
and upon the numbers in /proc/pid/[s]maps?
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [rfc] no ZERO_PAGE?
2007-04-04 20:11 ` David Miller
2007-04-04 20:50 ` Andrew Morton
@ 2007-04-05 2:03 ` Nick Piggin
2007-04-05 5:23 ` Andrea Arcangeli
2 siblings, 0 replies; 40+ messages in thread
From: Nick Piggin @ 2007-04-05 2:03 UTC (permalink / raw)
To: David Miller
Cc: torvalds, hugh, akpm, linux-mm, tee, holt, andrea, linux-kernel
On Wed, Apr 04, 2007 at 01:11:11PM -0700, David Miller wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Wed, 4 Apr 2007 08:35:30 -0700 (PDT)
>
> > Anyway, I'm not against this, but I can see somebody actually *wanting*
> > the ZERO page in some cases. I've used the fact for TLB testing, for
> > example, by just doing a big malloc(), and knowing that the kernel will
> > re-use the ZERO_PAGE so that I don't get any cache effects (well, at least
> > not any *physical* cache effects. Virtually indexed cached will still show
> > effects of it, of course, but I haven't cared).
> >
> > That's an example of an app that actually cares about the page allocation
> > (or, in this case, the lack there-of). Not an important one, but maybe
> > there are important ones that care?
>
> If we're going to consider this seriously, there is a case I know of.
> Look at flush_dcache_page()'s test for ZERO_PAGE() on sparc64, there
> is an instructive comment:
>
> /* Do not bother with the expensive D-cache flush if it
> * is merely the zero page. The 'bigcore' testcase in GDB
> * causes this case to run millions of times.
> */
> if (page == ZERO_PAGE(0))
> return;
>
> basically what the GDB test case does it mmap() an enormous anonymous
> area, not touch it, then dump core.
>
> As I understand the patch being considered to remove ZERO_PAGE(), this
> kind of core dump will cause a lot of pages to be allocated, probably
> eating up a lot of system time as well as memory.
Yeah. Well it is trivial to leave ZERO_PAGE in get_user_pages, however
in the longer run it would be nice to get rid of ZERO_PAGE completely
so we need an alternative.
I've been working on a patch for core dumping that can detect unfaulted
anonymous memory and skip it without doing the ZERO_PAGE comparision.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [rfc] no ZERO_PAGE?
2007-04-04 20:11 ` David Miller
2007-04-04 20:50 ` Andrew Morton
2007-04-05 2:03 ` Nick Piggin
@ 2007-04-05 5:23 ` Andrea Arcangeli
2 siblings, 0 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2007-04-05 5:23 UTC (permalink / raw)
To: David Miller
Cc: torvalds, npiggin, hugh, akpm, linux-mm, tee, holt, linux-kernel
On Wed, Apr 04, 2007 at 01:11:11PM -0700, David S. Miller wrote:
> If we're going to consider this seriously, there is a case I know of.
> Look at flush_dcache_page()'s test for ZERO_PAGE() on sparc64, there
> is an instructive comment:
>
> /* Do not bother with the expensive D-cache flush if it
> * is merely the zero page. The 'bigcore' testcase in GDB
> * causes this case to run millions of times.
> */
> if (page == ZERO_PAGE(0))
> return;
>
> basically what the GDB test case does it mmap() an enormous anonymous
> area, not touch it, then dump core.
>
> As I understand the patch being considered to remove ZERO_PAGE(), this
> kind of core dump will cause a lot of pages to be allocated, probably
> eating up a lot of system time as well as memory.
Well, if we leave the zero page in because there may be too many apps
to optimize, we still have to fix the zero page handling. Current code
is far from ideal. Currently the zero page scales worse than
no-zero-page, at the very least all the page count/mapcount
increase/decrease at every map-in/zap must be dropped from memory.c,
otherwise two totally unrelated gdb running at the same time (or gdb
at the same time of fortran, or two unrelated fortran apps) will badly
trash over the zero page reference counting.
Besides the backwards compatibility argument with gdb or similar apps
I doubt the zero page is a really worthwhile optimization and I guess
we'd be better off if it never existed.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [rfc] no ZERO_PAGE?
2007-04-04 15:35 ` Linus Torvalds
` (3 preceding siblings ...)
2007-04-04 20:11 ` David Miller
@ 2007-04-04 22:05 ` Valdis.Kletnieks
2007-04-05 0:27 ` Linus Torvalds
2007-04-05 4:47 ` Nick Piggin
5 siblings, 1 reply; 40+ messages in thread
From: Valdis.Kletnieks @ 2007-04-04 22:05 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nick Piggin, Hugh Dickins, Andrew Morton,
Linux Memory Management List, tee, holt, Andrea Arcangeli,
Linux Kernel Mailing List
[-- Attachment #1: Type: text/plain, Size: 1137 bytes --]
On Wed, 04 Apr 2007 08:35:30 PDT, Linus Torvalds said:
> Although I don't know how much -mm will do for it. There is certainly not
> going to be any correctness problems, afaik, just *performance* problems.
> Does anybody do any performance testing on -mm?
I have to admit I don't do anything more definite than "wow, this goes oink"...
> That's an example of an app that actually cares about the page allocation
> (or, in this case, the lack there-of). Not an important one, but maybe
> there are important ones that care?
I'd not be surprised if there's sparse-matrix code out there that wants to
malloc a *huge* array (like a 1025x1025 array of numbers) that then only
actually *writes* to several hundred locations, and relies on the fact that
all the untouched pages read back all-zeros. Of course, said code is probably
buggy because it doesn't zero the whole thing because you don't usually know
if some other function already scribbled on that heap page.
This would probably be more interesting if we had a userspace API for
"Give me a metric buttload of zero page frames" that malloc() and friends
could leverage.....
[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: [rfc] no ZERO_PAGE?
2007-04-04 22:05 ` Valdis.Kletnieks
@ 2007-04-05 0:27 ` Linus Torvalds
2007-04-05 1:25 ` Valdis.Kletnieks
2007-04-05 2:30 ` Nick Piggin
0 siblings, 2 replies; 40+ messages in thread
From: Linus Torvalds @ 2007-04-05 0:27 UTC (permalink / raw)
To: Valdis.Kletnieks
Cc: Nick Piggin, Hugh Dickins, Andrew Morton,
Linux Memory Management List, tee, holt, Andrea Arcangeli,
Linux Kernel Mailing List
On Wed, 4 Apr 2007, Valdis.Kletnieks@vt.edu wrote:
>
> I'd not be surprised if there's sparse-matrix code out there that wants to
> malloc a *huge* array (like a 1025x1025 array of numbers) that then only
> actually *writes* to several hundred locations, and relies on the fact that
> all the untouched pages read back all-zeros.
Good point. In fact, it doesn't need to be a malloc() - I remember people
doing this with Fortran programs and just having an absolutely incredibly
big BSS (with traditional Fortran, dymic memory allocations are just not
done).
> Of course, said code is probably buggy because it doesn't zero the whole
> thing because you don't usually know if some other function already
> scribbled on that heap page.
Sure you do. If glibc used mmap() or brk(), it *knows* the new data is
zero. So if you use calloc(), for example, it's entirely possible that
a good libc wouldn't waste time zeroing it.
The same is true of BSS. You never clear the BSS with a memset, you just
know it starts out zeroed.
Linus
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [rfc] no ZERO_PAGE?
2007-04-05 0:27 ` Linus Torvalds
@ 2007-04-05 1:25 ` Valdis.Kletnieks
2007-04-05 2:30 ` Nick Piggin
1 sibling, 0 replies; 40+ messages in thread
From: Valdis.Kletnieks @ 2007-04-05 1:25 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nick Piggin, Hugh Dickins, Andrew Morton,
Linux Memory Management List, tee, holt, Andrea Arcangeli,
Linux Kernel Mailing List
[-- Attachment #1: Type: text/plain, Size: 903 bytes --]
On Wed, 04 Apr 2007 17:27:31 PDT, Linus Torvalds said:
> Sure you do. If glibc used mmap() or brk(), it *knows* the new data is
> zero. So if you use calloc(), for example, it's entirely possible that
> a good libc wouldn't waste time zeroing it.
Right. However, the *user* code usually has no idea about the previous
history - so if it uses malloc(), it should be doing something like:
ptr = malloc(my_size*sizeof(whatever));
memset(ptr, my_size*sizeof(), 0);
So malloc does something clever to guarantee that it's zero, and then userspace
undoes the cleverness because it has no easy way to *know* that cleverness
happened.
Admittedly, calloc() *can* get away with being clever. I know we have some
glibc experts lurking here - any of them want to comment on how smart calloc()
actually is, or how smart it can become without needing major changes to the
rest of the malloc() and friends?
[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [rfc] no ZERO_PAGE?
2007-04-05 0:27 ` Linus Torvalds
2007-04-05 1:25 ` Valdis.Kletnieks
@ 2007-04-05 2:30 ` Nick Piggin
2007-04-05 5:37 ` William Lee Irwin III
1 sibling, 1 reply; 40+ messages in thread
From: Nick Piggin @ 2007-04-05 2:30 UTC (permalink / raw)
To: Linus Torvalds
Cc: Valdis.Kletnieks, Hugh Dickins, Andrew Morton,
Linux Memory Management List, tee, holt, Andrea Arcangeli,
Linux Kernel Mailing List
On Wed, Apr 04, 2007 at 05:27:31PM -0700, Linus Torvalds wrote:
>
>
> On Wed, 4 Apr 2007, Valdis.Kletnieks@vt.edu wrote:
> >
> > I'd not be surprised if there's sparse-matrix code out there that wants to
> > malloc a *huge* array (like a 1025x1025 array of numbers) that then only
> > actually *writes* to several hundred locations, and relies on the fact that
> > all the untouched pages read back all-zeros.
>
> Good point. In fact, it doesn't need to be a malloc() - I remember people
> doing this with Fortran programs and just having an absolutely incredibly
> big BSS (with traditional Fortran, dymic memory allocations are just not
> done).
Sparse matrices are one thing I worry about. I don't know enough about
HPC code to know whether they will be a problem. I know there exist
data structures to optimise sparse matrix storage...
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [rfc] no ZERO_PAGE?
2007-04-05 2:30 ` Nick Piggin
@ 2007-04-05 5:37 ` William Lee Irwin III
2007-04-05 17:23 ` Valdis.Kletnieks
0 siblings, 1 reply; 40+ messages in thread
From: William Lee Irwin III @ 2007-04-05 5:37 UTC (permalink / raw)
To: Nick Piggin
Cc: Linus Torvalds, Valdis.Kletnieks, Hugh Dickins, Andrew Morton,
Linux Memory Management List, tee, holt, Andrea Arcangeli,
Linux Kernel Mailing List
On Wed, Apr 04, 2007 at 05:27:31PM -0700, Linus Torvalds wrote:
>> Good point. In fact, it doesn't need to be a malloc() - I remember people
>> doing this with Fortran programs and just having an absolutely incredibly
>> big BSS (with traditional Fortran, dymic memory allocations are just not
>> done).
On Thu, Apr 05, 2007 at 04:30:26AM +0200, Nick Piggin wrote:
> Sparse matrices are one thing I worry about. I don't know enough about
> HPC code to know whether they will be a problem. I know there exist
> data structures to optimise sparse matrix storage...
\begin{admission-against-interest}
Sparse matrix code goes to extreme lengths to avoid ever looking at
substantial numbers of zero floating point matrix and vector entries.
In extreme cases, hashing and various sorts of heavyweight data
structures are used to represent highly irregular structures. At various
times the matrix is not even explicitly formed. Most typical are cases
like band diagonal matrices where storage is allocated only for the
nonzero diagonals. The entire purpose of sparse algorithms is to avoid
examining or even allocating zeros.
The actual phenomenon of concern here is dense matrix code with sparse
matrix inputs. The matrices will typically not be vast but may span 1MB
or so of RAM (1024x1024 is 1M*sizeof(double), and various dense matrix
algorithms target ca. 300x300). Most of the time this will arise from
the use of dense matrix code as black box solvers called as a library
by programs not terribly concerned about efficiency until something
gets explosively inefficient (and maybe not even then), or otherwise
numerically naive programs. This, however, is arguably the majority of
the usage cases by end-user invocations, so beware, though not too much.
I'd be more concerned about large hashtables sparsely used for the
purposes of adjacency detection and other cases where large time vs.
space tradeoffs are made for probabilistic reasons involving
collisions.
\end{admission-against-interest}
-- wli
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: [rfc] no ZERO_PAGE?
2007-04-05 5:37 ` William Lee Irwin III
@ 2007-04-05 17:23 ` Valdis.Kletnieks
0 siblings, 0 replies; 40+ messages in thread
From: Valdis.Kletnieks @ 2007-04-05 17:23 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Nick Piggin, Linus Torvalds, Hugh Dickins, Andrew Morton,
Linux Memory Management List, tee, holt, Andrea Arcangeli,
Linux Kernel Mailing List
[-- Attachment #1: Type: text/plain, Size: 1580 bytes --]
On Wed, 04 Apr 2007 22:37:29 PDT, William Lee Irwin III said:
> The actual phenomenon of concern here is dense matrix code with sparse
> matrix inputs. The matrices will typically not be vast but may span 1MB
> or so of RAM (1024x1024 is 1M*sizeof(double), and various dense matrix
> algorithms target ca. 300x300). Most of the time this will arise from
> the use of dense matrix code as black box solvers called as a library
> by programs not terribly concerned about efficiency until something
> gets explosively inefficient (and maybe not even then), or otherwise
> numerically naive programs. This, however, is arguably the majority of
> the usage cases by end-user invocations, so beware, though not too much.
Amen, brother! :)
At least in my environment, the vast majority of matrix code is actually run by
graduate students under the direction of whatever professor is the Principal
Investigator on the grant. As a rule, you can expect the grad student to know
about rounding errors and convergence issues and similar program *correctness*
factors. But it's the rare one that has much interest in program *efficiency*.
If it takes 2 days to run, that's 2 days they can go get another few pages of
thesis written while they wait. :)
The code that gets on our SystemX (a top-50 supercomputer still) is usually
well-tweaked for efficiency. However, that's just one system - there's on the
order of several hundred smaller compute clusters and boxen and SGI-en on
campus where "protect the system from cargo-cult programming by grad students"
is a valid kernel goal. ;)
[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [rfc] no ZERO_PAGE?
2007-04-04 15:35 ` Linus Torvalds
` (4 preceding siblings ...)
2007-04-04 22:05 ` Valdis.Kletnieks
@ 2007-04-05 4:47 ` Nick Piggin
5 siblings, 0 replies; 40+ messages in thread
From: Nick Piggin @ 2007-04-05 4:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: Hugh Dickins, Andrew Morton, Linux Memory Management List, tee,
holt, Andrea Arcangeli, Linux Kernel Mailing List
On Wed, Apr 04, 2007 at 08:35:30AM -0700, Linus Torvalds wrote:
>
>
> On Wed, 4 Apr 2007, Nick Piggin wrote:
> >
> > Shall I do a more complete patchset and ask Andrew to give it a
> > run in -mm?
>
> Do this trivial one first. See how it fares.
OK.
> Although I don't know how much -mm will do for it. There is certainly not
> going to be any correctness problems, afaik, just *performance* problems.
> Does anybody do any performance testing on -mm?
>
> That said, talking about correctness/performance problems:
>
> > + page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> > + if (likely(!pte_none(*page_table))) {
> > inc_mm_counter(mm, anon_rss);
> > lru_cache_add_active(page);
> > page_add_new_anon_rmap(page, vma, address);
>
> Isn't that test the wrong way around?
>
> Shouldn't it be
>
> if (likely(pte_none(*page_table))) {
>
> without any logical negation? Was this patch tested?
Yeah, untested of course. I'm having problems booting my normal test box,
so the main point of the patch was to generate some discussion (which
worked! ;)).
Thanks,
Nick
^ permalink raw reply [flat|nested] 40+ messages in thread