Maybe a VM bug in 2.4.18-18 from RH 8.0?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Maybe a VM bug in 2.4.18-18 from RH 8.0?
@ 2002-12-06  0:13 Norman Gaywood
  2002-12-06  1:00 ` Andrew Morton
  2002-12-06  1:08 ` Andrea Arcangeli
  0 siblings, 2 replies; 49+ messages in thread
From: Norman Gaywood @ 2002-12-06  0:13 UTC (permalink / raw)
  To: linux-kernel

I think I have a trigger for a VM bug in the RH kernel-bigmem-2.4.18-18

The system is a 4 processor, 16GB memory Dell PE6600 running RH8.0 +
errata. More details at the end of this message.

By doing a large copy I can trigger this problem in about 30-40 minutes. At
the end of that time, kswapd will start to get a larger % of CPU and
the system load will be around 2-3. The system will feel sluggish at an
interactive shell and it will take several seconds before a command like
top would start to display. If I let it go for another 30 minutes the
system is unusable were it could take 10 minutes or more to do simple
commands. If I let it go for several hours after that, the following
messages can appear on the console depending on the type of copy:

ENOMEM in journal_get_undo_access_Rsmp_df5dec49, retrying.

or

EMOMEM in do_get_write_access, retrying.

The problem can be triggered by almost any type of copy command. In
particular, this command can trigger it:

   tar cf /dev/tape .

for . large enough. Unfortunately this was how I was intending to backup
the system.

"Large enough" is several gigabytes. It also seems to depend on how much
memory is used. In particular, how much memory is used by cache. Also in
the equation is the number of files. Copying one big file does not seem
to trigger the problem. I initially discovered the problem when doing an
rsync copy over a network of the user home directories.

Can it be stopped? Yes. On the linux-poweredge@dell.com mailing list,
Stephan Wonczak suggested that I should put the system under some memory
pressure while doing the copy. The program he supplied used about 750
megabytes just to use some memory. I tried running this at 10 second
intervals while doing a copy but it did not help. Since the system has
16 Gig of memory, I tried to give it some real memory pressure and ran
7 processes that used 1.8G each like this:

#!/bin/sh
SLEEP=600
COUNT=20

while [ `expr $COUNT - 1` != 0 ]
do
   date
   # 2000 by 1_000_000 seems to be a 1.8G process
   perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
   perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
   perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
   perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
   perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
   perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
   perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }'
   sleep $SLEEP
done

This bought the cache down to about 3-4 Gig used after it ran. With this
running the system performed the copy with no problems! No doubt there
is a happy medium between these two extremes.

There is a suggestion that I may not see this problem when the system is
under real load. Since I am only setting up the system at the moment there
are no users giving the system something to do. The copy is the only real
work during these tests. I find it difficult to say "she'll be right",
(as we do in Aus) and throw the system into production hoping that it
will just work.

So what do I do now? I have a what I believe a trigger for a VM problem
in a widely used version of linux. Anyone have some patches for me to
try that won't take me too far from the RH 8.0 base system.

Here are the system details:

PE6600 running RH 8.0 with latest errata. Note that I have upgraded to
kernel 2.4.18-19.7.tg3.120bigmem which I understand to be the latest
RH8 errata kernel + patches to stop the tg3 hanging problem. This came
from http://people.redhat.com/jgarzik/tg3/. I have also tried the latest
RH errata kernel using the bcm5700 driver and it has the same problem.

HW includes:
Adaptec AIC-7892 SCSI BIOS v25704
3 Adaptex SCSI Card 39160 BIOS v2.57.2S2
8 HITACHI DK32DJ-72MC 160 drives
2 Quantum ATLAS10K3-73-SCA 160 drives

uname -a
Linux alan.une.edu.au 2.4.18-19.7.tg3.120bigmem #1 SMP Mon Nov 25 15:15:29 EST 2002 i686 i686 i386 GNU/Linux

cat /proc/meminfo
        total:    used:    free:  shared: buffers:  cached:
Mem:  16671522816 444915712 16226607104        0 136830976 56520704
Swap: 34365202432        0 34365202432
MemTotal:     16280784 kB
MemFree:      15846296 kB
MemShared:           0 kB
Buffers:        133624 kB
Cached:          55196 kB
SwapCached:          0 kB
Active:         249984 kB
Inact_dirty:     18088 kB
Inact_clean:       480 kB
Inact_target:    53708 kB
HighTotal:    15597504 kB
HighFree:     15434932 kB
LowTotal:       683280 kB
LowFree:        411364 kB
SwapTotal:    33559768 kB
SwapFree:     33559768 kB
Committed_AS:   177044 kB

df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/md2               8254136   2825112   5009736  37% /
/dev/md0                101018     25627     70175  27% /boot
/dev/md6             211671024  88323536 112595200  44% /home
/dev/md1              16515968   1785024  13891956  12% /opt
none                   8140392         0   8140392   0% /dev/shm
/dev/md4               4126976    149944   3767392   4% /tmp
/dev/md3              16515968    168172  15508808   2% /var
/dev/md5               8522932   1596520   6493468  20% /var/spool/mail
/dev/sdh1             70557052     32832  66940124   1% /.automount/alan/disks/alan/h1
/dev/sdi1             70557052  22856784  44116172  35% /.automount/alan/disks/alan/i1
/dev/sdj1             70557052  13619440  53353516  21% /.automount/alan/disks/alan/j1

df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/md2             1048576  167838  880738   17% /
/dev/md0               26104      59   26045    1% /boot
/dev/md6             26886144 1941926 24944218    8% /home
/dev/md1             2101152   49285 2051867    3% /opt
none                 2035098       1 2035097    1% /dev/shm
/dev/md4              524288      26  524262    1% /tmp
/dev/md3             2101152    4877 2096275    1% /var
/dev/md5             1082720    2535 1080185    1% /var/spool/mail
/dev/sdh1            8962048      12 8962036    1% /.automount/alan/disks/alan/h1
/dev/sdi1            8962048  712400 8249648    8% /.automount/alan/disks/alan/i1
/dev/sdj1            8962048   10497 8951551    1% /.automount/alan/disks/alan/j1

-- 
Norman Gaywood -- School of Mathematical and Computer Sciences
University of New England, Armidale, NSW 2351, Australia
norm@turing.une.edu.au     http://turing.une.edu.au/~norm
Phone: +61 2 6773 2412     Fax: +61 2 6773 3312

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
       [not found] <mailman.1039133948.27411.linux-kernel2news@redhat.com>
@ 2002-12-06  0:35 ` Pete Zaitcev
  2002-12-06  1:27   ` Norman Gaywood
  0 siblings, 1 reply; 49+ messages in thread
From: Pete Zaitcev @ 2002-12-06  0:35 UTC (permalink / raw)
  To: Norman Gaywood; +Cc: linux-kernel

> I think I have a trigger for a VM bug in the RH kernel-bigmem-2.4.18-18

> By doing a large copy I can trigger this problem in about 30-40 minutes. At
> the end of that time, kswapd will start to get a larger % of CPU and
> the system load will be around 2-3. The system will feel sluggish at an
> interactive shell and it will take several seconds before a command like
> top would start to display. [...]

Check your /proc/slabinfo, just in case, to rule out a leak.

> cat /proc/meminfo
>         total:    used:    free:  shared: buffers:  cached:
> Mem:  16671522816 444915712 16226607104        0 136830976 56520704
> Swap: 34365202432        0 34365202432
> MemTotal:     16280784 kB
> MemFree:      15846296 kB
> MemShared:           0 kB
> Buffers:        133624 kB
> Cached:          55196 kB
> SwapCached:          0 kB
> Active:         249984 kB
> Inact_dirty:     18088 kB
> Inact_clean:       480 kB
> Inact_target:    53708 kB
> HighTotal:    15597504 kB
> HighFree:     15434932 kB
> LowTotal:       683280 kB
> LowFree:        411364 kB
> SwapTotal:    33559768 kB
> SwapFree:     33559768 kB
> Committed_AS:   177044 kB

This is not interesting. Get it _after_ the box becomes sluggish.

Remember, the 2.4.18 stream in RH does not have its own VM, distinct
from Marcelo+Riel. So, you can come to linux-kernel for advice,
but first, get it all reproduced with Marcelo's tree with
Riel's patches all the same.

-- Pete

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  0:13 Maybe a VM bug in 2.4.18-18 from RH 8.0? Norman Gaywood
@ 2002-12-06  1:00 ` Andrew Morton
  2002-12-06  1:17   ` Andrea Arcangeli
  2002-12-06  1:08 ` Andrea Arcangeli
  1 sibling, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2002-12-06  1:00 UTC (permalink / raw)
  To: Norman Gaywood; +Cc: linux-kernel

Norman Gaywood wrote:
> 
> I think I have a trigger for a VM bug in the RH kernel-bigmem-2.4.18-18
> 
> 16GB
> ...
>    tar cf /dev/tape .
> 

This machine will die due to buffer_heads which are attached
to highmem pagecache, and due to inodes which are pinned by
highmem pagecache.

> ...
> while [ `expr $COUNT - 1` != 0 ]
> do
>    date
>    # 2000 by 1_000_000 seems to be a 1.8G process
>    perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
> ...

This will evict the highmem pagecache.  That frees the buffer_heads
and unpins the inodes.

> So what do I do now?

I guess talk to Red Hat.  These are well-known problems and there
should be fixes for them in a "bigmem" kernel.

Otherwise, the -aa kernels have patches to address these problems.
One option would be to roll your own kernel, based on a kernel.org
kernel and a matching patch from
http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/

> ...
> Anyone have some patches for me to
> try that won't take me too far from the RH 8.0 base system.

Hard.  The relevant patches are:

http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1/05_vm_16_active_free_zone_bhs-1
and
http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1/10_inode-highmem-2

The first one will not come vaguely close to applying to an
RH 2.4.18 kernel.

The second one may well apply, and will probably fix the problem.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  0:13 Maybe a VM bug in 2.4.18-18 from RH 8.0? Norman Gaywood
  2002-12-06  1:00 ` Andrew Morton
@ 2002-12-06  1:08 ` Andrea Arcangeli
  1 sibling, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2002-12-06  1:08 UTC (permalink / raw)
  To: Norman Gaywood; +Cc: linux-kernel

On Fri, Dec 06, 2002 at 11:13:26AM +1100, Norman Gaywood wrote:
> I think I have a trigger for a VM bug in the RH kernel-bigmem-2.4.18-18
> 
> The system is a 4 processor, 16GB memory Dell PE6600 running RH8.0 +
> errata. More details at the end of this message.

Thanks to lots of feedback from users in the last months I fixed all
known vm bugs todate that can be reproduced on those big machines.
They're all included in my tree and in the current UL/SuSE releases.
Over time I should have posted all of them to the kernel list in one way
or another.  The most critical ones are now pending for merging in
2.4.21pre. So in the meantime you want to try to reproduce on top of
2.4.20aa1 or the UL kernel and (unless your problem is a tape driver bug ;)
I'm pretty sure it will fix the problems on your big machine.

	http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1.gz
	http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1/

Hope this helps,

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  1:00 ` Andrew Morton
@ 2002-12-06  1:17   ` Andrea Arcangeli
  2002-12-06  1:34     ` Andrew Morton
  0 siblings, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2002-12-06  1:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Norman Gaywood, linux-kernel

On Thu, Dec 05, 2002 at 05:00:15PM -0800, Andrew Morton wrote:
> Hard.  The relevant patches are:
> 
> http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1/05_vm_16_active_free_zone_bhs-1
> and
> http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1/10_inode-highmem-2

yep, those are the two I had in mind when I said they're pending for
2.4.21pre inclusion. He may still suffer other known problems besides
the above two critical highmem fixes (for example if
lower_zone_reserve_ratio is not applied and there's no other fix around
it IMHO, that's generic OS problem not only for linux, and that was my
only sensible solution to fix it, the approch in mainline is way too
weak to make a real difference) though probably whatever else problem
would probably need something more complicated than a tar to reproduce.

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  0:35 ` Pete Zaitcev
@ 2002-12-06  1:27   ` Norman Gaywood
  2002-12-06 12:48     ` Rik van Riel
  0 siblings, 1 reply; 49+ messages in thread
From: Norman Gaywood @ 2002-12-06  1:27 UTC (permalink / raw)
  To: Pete Zaitcev; +Cc: linux-kernel

On Thu, Dec 05, 2002 at 07:35:49PM -0500, Pete Zaitcev wrote:
> > I think I have a trigger for a VM bug in the RH kernel-bigmem-2.4.18-18
> 
> > By doing a large copy I can trigger this problem in about 30-40 minutes. At
> > the end of that time, kswapd will start to get a larger % of CPU and
> > the system load will be around 2-3. The system will feel sluggish at an
> > interactive shell and it will take several seconds before a command like
> > top would start to display. [...]
> 
> Check your /proc/slabinfo, just in case, to rule out a leak.

Here is a /proc/slabinfo diff of a good system and a very sluggish one:

1c1
< Mon Nov 25 17:13:04 EST 2002
---
> Mon Nov 25 22:35:58 EST 2002
6c6
< nfs_inode_cache        6      6    640    1    1    1 :  124   62
---
> nfs_inode_cache        1      6    640    1    1    1 :  124   62
8,11c8,11
< ip_fib_hash          224    224     32    2    2    1 :  252  126
< journal_head        3101  36113     48   69  469    1 :  252  126
< revoke_table         250    250     12    1    1    1 :  252  126
< revoke_record        672    672     32    6    6    1 :  252  126
---
> ip_fib_hash           10    224     32    2    2    1 :  252  126
> journal_head          12    154     48    2    2    1 :  252  126
> revoke_table           7    250     12    1    1    1 :  252  126
> revoke_record          0      0     32    0    0    1 :  252  126
14,20c14,20
< tcp_tw_bucket        210    210    128    7    7    1 :  252  126
< tcp_bind_bucket      896    896     32    8    8    1 :  252  126
< tcp_open_request     180    180    128    6    6    1 :  252  126
< inet_peer_cache        0      0     64    0    0    1 :  252  126
< ip_dst_cache         105    105    256    7    7    1 :  252  126
< arp_cache             90     90    128    3    3    1 :  252  126
< blkdev_requests    16548  17430    128  561  581    1 :  252  126
---
> tcp_tw_bucket          0      0    128    0    0    1 :  252  126
> tcp_bind_bucket       28    784     32    7    7    1 :  252  126
> tcp_open_request       0      0    128    0    0    1 :  252  126
> inet_peer_cache        1     58     64    1    1    1 :  252  126
> ip_dst_cache          40    105    256    7    7    1 :  252  126
> arp_cache              4     90    128    3    3    1 :  252  126
> blkdev_requests    16384  16410    128  547  547    1 :  252  126
22c22
< file_lock_cache      328    328     92    8    8    1 :  252  126
---
> file_lock_cache        3     82     92    2    2    1 :  252  126
24,27c24,27
< uid_cache            672    672     32    6    6    1 :  252  126
< skbuff_head_cache   1107   2745    256   77  183    1 :  252  126
< sock                 270    270   1280   90   90    1 :   60   30
< sigqueue             870    870    132   30   30    1 :  252  126
---
> uid_cache              9    448     32    4    4    1 :  252  126
> skbuff_head_cache    816   1110    256   74   74    1 :  252  126
> sock                  81    129   1280   43   43    1 :   60   30
> sigqueue              29     29    132    1    1    1 :  252  126
29,33c29,33
< cdev_cache           498   2262     64   12   39    1 :  252  126
< bdev_cache           290    290     64    5    5    1 :  252  126
< mnt_cache            232    232     64    4    4    1 :  252  126
< inode_cache       543337 553490    512 79070 79070    1 :  124   62
< dentry_cache      373336 554430    128 18481 18481    1 :  252  126
---
> cdev_cache            16    290     64    5    5    1 :  252  126
> bdev_cache            27    174     64    3    3    1 :  252  126
> mnt_cache             19    174     64    3    3    1 :  252  126
> inode_cache       305071 305081    512 43583 43583    1 :  124   62
> dentry_cache         418   2430    128   81   81    1 :  252  126
35,43c35,43
< filp                 930    930    128   31   31    1 :  252  126
< names_cache           48     48   4096   48   48    1 :   60   30
< buffer_head       831810 831810    128 27727 27727    1 :  252  126
< mm_struct            510    510    256   34   34    1 :  252  126
< vm_area_struct      4488   4740    128  158  158    1 :  252  126
< fs_cache             696    696     64   12   12    1 :  252  126
< files_cache          469    469    512   67   67    1 :  124   62
< signal_act           388    418   1408   38   38    4 :   60   30
< pae_pgd              696    696     64   12   12    1 :  252  126
---
> filp                1041   1230    128   41   41    1 :  252  126
> names_cache            7      8   4096    7    8    1 :   60   30
> buffer_head       3431966 3432150    128 114405 114405    1 :  252  126
> mm_struct            198    315    256   21   21    1 :  252  126
> vm_area_struct      5905   5970    128  199  199    1 :  252  126
> fs_cache             204    464     64    8    8    1 :  252  126
> files_cache          204    217    512   31   31    1 :  124   62
> signal_act           246    286   1408   26   26    4 :   60   30
> pae_pgd              198    638     64   11   11    1 :  252  126
51c51
< size-16384            16     24  16384   16   24    4 :    0    0
---
> size-16384            20     20  16384   20   20    4 :    0    0
53c53
< size-8192              5     11   8192    5   11    2 :    0    0
---
> size-8192              9      9   8192    9    9    2 :    0    0
55c55
< size-4096            287    407   4096  287  407    1 :   60   30
---
> size-4096             56     56   4096   56   56    1 :   60   30
57c57
< size-2048            426    666   2048  213  333    1 :   60   30
---
> size-2048            281    314   2048  157  157    1 :   60   30
59c59
< size-1024           1024   1272   1024  256  318    1 :  124   62
---
> size-1024            659    712   1024  178  178    1 :  124   62
61c61
< size-512            3398   3584    512  445  448    1 :  124   62
---
> size-512            2782   2856    512  357  357    1 :  124   62
63c63
< size-256             777   1155    256   67   77    1 :  252  126
---
> size-256             101    255    256   17   17    1 :  252  126
65c65
< size-128            4836  19200    128  244  640    1 :  252  126
---
> size-128            2757   3750    128  125  125    1 :  252  126
67c67
< size-64             8958  20550    128  356  685    1 :  252  126
---
> size-64              178    510    128   17   17    1 :  252  126
69c69
< size-32            23262  43674     64  433  753    1 :  252  126
---
> size-32              711   1218     64   21   21    1 :  252  126


> > cat /proc/meminfo
> This is not interesting. Get it _after_ the box becomes sluggish.

I don't have one of those, but here is a top of a sluggish system:

  3:51pm  up 43 min,  3 users,  load average: 1.69, 1.28, 0.92
109 processes: 108 sleeping, 1 running, 0 zombie, 0 stopped
CPU0 states:  0.0% user,  0.3% system,  0.0% nice, 99.2% idle
CPU1 states:  0.0% user,  0.0% system,  0.0% nice, 100.0% idle
CPU2 states:  0.0% user,  0.0% system,  0.0% nice, 100.0% idle
CPU3 states:  0.0% user,  1.4% system,  0.0% nice, 98.0% idle
CPU4 states:  0.0% user, 58.2% system,  0.0% nice, 41.2% idle
CPU5 states:  0.0% user, 96.4% system,  0.0% nice,  3.0% idle
CPU6 states:  0.0% user,  0.5% system,  0.0% nice, 99.0% idle
CPU7 states:  0.0% user,  0.3% system,  0.0% nice, 99.2% idle
Mem:  16280784K av, 15747124K used,  533660K free,       0K shrd, 20952K buff
Swap: 33559768K av,       0K used, 33559768K free 15037240K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
   19 root      25   0     0    0     0 SW   96.7  0.0   1:52 kswapd
 1173 root      21   0 10592  10M   424 D    58.2  0.0   3:30 cp
  202 root      15   0     0    0     0 DW    1.9  0.0   0:04 kjournald
  205 root      15   0     0    0     0 DW    0.9  0.0   0:10 kjournald
   21 root      15   0     0    0     0 SW    0.5  0.0   0:01 kupdated
 1121 root      16   0  1056 1056   836 R     0.5  0.0   0:09 top
    1 root      15   0   476  476   424 S     0.0  0.0   0:04 init
    2 root      0K   0     0    0     0 SW    0.0  0.0   0:00 migration_CPU0
    3 root      0K   0     0    0     0 SW    0.0  0.0   0:00 migration_CPU1
    4 root      0K   0     0    0     0 SW    0.0  0.0   0:00 migration_CPU2
    5 root      0K   0     0    0     0 SW    0.0  0.0   0:00 migration_CPU3
    6 root      0K   0     0    0     0 SW    0.0  0.0   0:00 migration_CPU4
    7 root      0K   0     0    0     0 SW    0.0  0.0   0:00 migration_CPU5
    8 root      0K   0     0    0     0 SW    0.0  0.0   0:00 migration_CPU6
    9 root      0K   0     0    0     0 SW    0.0  0.0   0:00 migration_CPU7
   10 root      15   0     0    0     0 SW    0.0  0.0   0:00 keventd

> Remember, the 2.4.18 stream in RH does not have its own VM, distinct
> from Marcelo+Riel. So, you can come to linux-kernel for advice,
> but first, get it all reproduced with Marcelo's tree with
> Riel's patches all the same.

Yep, I understand that. I just thought this might be of interest
however. It's pretty hard to find a place to talk about this problem
with someone who might know something! I've got a service request in
with RH but no answer yet, but it's only been 1.5 days.

While I've been writing this it looks like Andrew Morton and Andrea
Arcangeli have given me some great answers and have declared this a
"well known problem".  Looks like I've got something to try.

-- 
Norman Gaywood -- School of Mathematical and Computer Sciences
University of New England, Armidale, NSW 2351, Australia
norm@turing.une.edu.au     http://turing.une.edu.au/~norm
Phone: +61 2 6773 2412     Fax: +61 2 6773 3312

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  1:17   ` Andrea Arcangeli
@ 2002-12-06  1:34     ` Andrew Morton
  2002-12-06  1:44       ` Andrea Arcangeli
  0 siblings, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2002-12-06  1:34 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Norman Gaywood, linux-kernel

Andrea Arcangeli wrote:
> 
> ...
> He may still suffer other known problems besides
> the above two critical highmem fixes (for example if
> lower_zone_reserve_ratio is not applied and there's no other fix around
> it IMHO, that's generic OS problem not only for linux, and that was my
> only sensible solution to fix it, the approch in mainline is way too
> weak to make a real difference)

argh.  I hate that one ;)  Giving away 100 megabytes of memory
hurts.

I've never been able to find the workload which makes this
necessary.  Can you please describe an "exploit" against 
2.4.20 which demonstrates the need for this?

Thanks.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  1:34     ` Andrew Morton
@ 2002-12-06  1:44       ` Andrea Arcangeli
  2002-12-06  2:15         ` William Lee Irwin III
  0 siblings, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2002-12-06  1:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Norman Gaywood, linux-kernel

On Thu, Dec 05, 2002 at 05:34:34PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> > 
> > ...
> > He may still suffer other known problems besides
> > the above two critical highmem fixes (for example if
> > lower_zone_reserve_ratio is not applied and there's no other fix around
> > it IMHO, that's generic OS problem not only for linux, and that was my
> > only sensible solution to fix it, the approch in mainline is way too
> > weak to make a real difference)
> 
> argh.  I hate that one ;)  Giving away 100 megabytes of memory
> hurts.

100M hurts on a 4G box? No-way ;)

it hurts when such 100M of normal zone are mlocked
by an highmem-capable users and you can't allocate one more inode but
you have still 3G free of highmem (google is doing this, they even drop
a check so they can mlock > half of the ram).

Or it hurts when you can't allocate an inode because such 100M are in
pagetables on a 64G box and you still have 60G free of highmem.

> I've never been able to find the workload which makes this
> necessary.  Can you please describe an "exploit" against 

ask google...

> 2.4.20 which demonstrates the need for this?

even simpler, swapoff -a and malloc and have fun! ;) (again ask google,
they run w/o swap for obvious good reasons)

Or if you have enough time, wait those 100M to be filled by pagetables
on a 64G box.

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  1:44       ` Andrea Arcangeli
@ 2002-12-06  2:15         ` William Lee Irwin III
  2002-12-06  2:28           ` Andrea Arcangeli
  2002-12-06 10:36           ` Arjan van de Ven
  0 siblings, 2 replies; 49+ messages in thread
From: William Lee Irwin III @ 2002-12-06  2:15 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Norman Gaywood, linux-kernel

On Fri, Dec 06, 2002 at 02:44:29AM +0100, Andrea Arcangeli wrote:
> Or it hurts when you can't allocate an inode because such 100M are in
> pagetables on a 64G box and you still have 60G free of highmem.

This is the zone vs. zone watermark stuff that penalizes/fails
allocations made with a given GFP mask from being satisfied by
fallback. This is largely old news wrt. various kinds of inability
to pressure those ZONE_NORMAL (maybe also ZONE_DMA) consumers.

Admission control for fallback is valuable, sure. I suspect the
question akpm raised is about memory utilization. My own issues are
centered around allocations targeted directly at ZONE_NORMAL,
which fallback prevention does not address, so the watermark patch
is not something I'm personally very concerned about.

64GB isn't getting any testing that I know of; I'd hold off until
someone's actually stood up and confessed to attempting to boot
Linux on such a beast. Or until I get some more RAM. =)

Bill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  2:15         ` William Lee Irwin III
@ 2002-12-06  2:28           ` Andrea Arcangeli
  2002-12-06  2:41             ` William Lee Irwin III
  2002-12-06 10:36           ` Arjan van de Ven
  1 sibling, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2002-12-06  2:28 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, Norman Gaywood,
	linux-kernel

On Thu, Dec 05, 2002 at 06:15:59PM -0800, William Lee Irwin III wrote:
> On Fri, Dec 06, 2002 at 02:44:29AM +0100, Andrea Arcangeli wrote:
> > Or it hurts when you can't allocate an inode because such 100M are in
> > pagetables on a 64G box and you still have 60G free of highmem.
> 
> This is the zone vs. zone watermark stuff that penalizes/fails
> allocations made with a given GFP mask from being satisfied by
> fallback. This is largely old news wrt. various kinds of inability
> to pressure those ZONE_NORMAL (maybe also ZONE_DMA) consumers.
> 
> Admission control for fallback is valuable, sure. I suspect the
> question akpm raised is about memory utilization. My own issues are
> centered around allocations targeted directly at ZONE_NORMAL,
> which fallback prevention does not address, so the watermark patch
> is not something I'm personally very concerned about.

you must be very concerned about it too.

If you don't have the fallback prevention all your efforts around the
allocations targeted directoy zone normal will be completely worthless.

Either that or you want to drop ZONE_NORMAL enterely because it means
nothing uses zone-normal dynamically anymore (ZONE_NORMAL seen as a
place that is directly mapped, not necessairly always 32bit dma
capable).

> 64GB isn't getting any testing that I know of; I'd hold off until
> someone's actually stood up and confessed to attempting to boot
> Linux on such a beast. Or until I get some more RAM. =)

64GB is an example, a good example for this thing, but a 16G machine or
a 4G machine can run in the very same issues. As said just swapoff -a
and malloc(1G) and such 1G is all ZONE_NORMAL before you could allocate
enough inodes for your workload. Or alloc 1G of pagetables by setting
everything protnone, and sugh 1G of pagetables goes in zone-normal
because the highmem is filled by cache. Choose whatever is your
preferred example of real life bug fixed by the lowmem-reservation patch
that is absolutely necessary to run stable on a big box with normal zone
and highmem (not only a 64G box).

The only place where you must not be concerned about these fixes are the
64bit archs.

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  2:28           ` Andrea Arcangeli
@ 2002-12-06  2:41             ` William Lee Irwin III
  2002-12-06  5:25               ` Andrew Morton
  2002-12-06 22:28               ` Andrea Arcangeli
  0 siblings, 2 replies; 49+ messages in thread
From: William Lee Irwin III @ 2002-12-06  2:41 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Norman Gaywood, linux-kernel

On Thu, Dec 05, 2002 at 06:15:59PM -0800, William Lee Irwin III wrote:
>> Admission control for fallback is valuable, sure. I suspect the
>> question akpm raised is about memory utilization. My own issues are
>> centered around allocations targeted directly at ZONE_NORMAL,
>> which fallback prevention does not address, so the watermark patch
>> is not something I'm personally very concerned about.

On Fri, Dec 06, 2002 at 03:28:53AM +0100, Andrea Arcangeli wrote:
> you must be very concerned about it too.
> If you don't have the fallback prevention all your efforts around the
> allocations targeted directoy zone normal will be completely worthless.
> Either that or you want to drop ZONE_NORMAL enterely because it means
> nothing uses zone-normal dynamically anymore (ZONE_NORMAL seen as a
> place that is directly mapped, not necessairly always 32bit dma
> capable).

Yes, it's necessary; no, I've never directly encountered the issue it
fixes. Sorry about the miscommunication there.


On Thu, Dec 05, 2002 at 06:15:59PM -0800, William Lee Irwin III wrote:
>> 64GB isn't getting any testing that I know of; I'd hold off until
>> someone's actually stood up and confessed to attempting to boot
>> Linux on such a beast. Or until I get some more RAM. =)

On Fri, Dec 06, 2002 at 03:28:53AM +0100, Andrea Arcangeli wrote:
> 64GB is an example, a good example for this thing, but a 16G machine or
> a 4G machine can run in the very same issues. As said just swapoff -a
> and malloc(1G) and such 1G is all ZONE_NORMAL before you could allocate
> enough inodes for your workload. Or alloc 1G of pagetables by setting
> everything protnone, and sugh 1G of pagetables goes in zone-normal
> because the highmem is filled by cache. Choose whatever is your
> preferred example of real life bug fixed by the lowmem-reservation patch
> that is absolutely necessary to run stable on a big box with normal zone
> and highmem (not only a 64G box).
> The only place where you must not be concerned about these fixes are the
> 64bit archs.

64GB on 32-bit is in the territory where it's dead, either literally,
performance-wise, or by virtue of dropping hardware on the floor (as
it's basically no longer 64GB) due to deeper design limitations.

No idea why there's not more support behind or interest in page
clustering. It's an optimization (not required) for 64-bit/saner arches.


Bill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  2:41             ` William Lee Irwin III
@ 2002-12-06  5:25               ` Andrew Morton
  2002-12-06  5:48                 ` Andrea Arcangeli
  2002-12-06  6:00                 ` William Lee Irwin III
  2002-12-06 22:28               ` Andrea Arcangeli
  1 sibling, 2 replies; 49+ messages in thread
From: Andrew Morton @ 2002-12-06  5:25 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Andrea Arcangeli, Norman Gaywood, linux-kernel

William Lee Irwin III wrote:
> 
> Yes, it's necessary; no, I've never directly encountered the issue it
> fixes. Sorry about the miscommunication there.

The google thing.

The basic problem is in allowing allocations which _could_ use
highmem to use the normal zone as anon memory or pagecache.

Because the app could mlock that memory.   So for a simple
demonstration:

- mem=2G
- read a 1.2G file
- malloc 800M, now mlock it.

Those 800M will be in ZONE_NORMAL, simply because that was where the
free memory was.  And you're dead, even though you've only mlocked
800M.  The same thing happens if you have lots of anon memory in the
normal zone and there is no swapspace available.

Linus's approach was to raise the ZONE_NORMAL pages_min limit for
allocations which _could_ use highmem.  So a GFP_HIGHUSER allocation
has a pages_min limit of (say) 4M when considering the normal zone,
but a GFP_KERNEL allocation has a limit of 2M.

Andrea's patch does the same thing, via a separate table.   He has
set the threshold much higher (100M on a 4G box).   AFAICT, the
algorithms are identical - I was planning on just adding a multiplier
to set Linus's ratio - it is currently hardwired to "1".   Search for 
"mysterious" in mm/page_alloc.c ;)

It's not clear to me why -aa defaults to 100 megs when the problem
only occurs with no swap or when the app is using mlock.  The default
multiplier (of variable local_min) should be zero.  Swapless machines
or heavy mlock users can crank it up.

But mlocking 700M on a 4G box would kill it as well.  The google
application, IIRC, mlocks 1G on a 2G machine.  Daniel put them
onto the 2G+2G split and all was well.

Anyway, thanks.   I'll take another look at Andrea's implementation.

Now, regarding mlock(mmap(open(/dev/hda1))) ;)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  5:25               ` Andrew Morton
@ 2002-12-06  5:48                 ` Andrea Arcangeli
  2002-12-06  6:14                   ` William Lee Irwin III
  2002-12-06  6:55                   ` Andrew Morton
  2002-12-06  6:00                 ` William Lee Irwin III
  1 sibling, 2 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2002-12-06  5:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, Norman Gaywood, linux-kernel

On Thu, Dec 05, 2002 at 09:25:15PM -0800, Andrew Morton wrote:
> William Lee Irwin III wrote:
> > 
> > Yes, it's necessary; no, I've never directly encountered the issue it
> > fixes. Sorry about the miscommunication there.
> 
> The google thing.
> 
> The basic problem is in allowing allocations which _could_ use
> highmem to use the normal zone as anon memory or pagecache.
> 
> Because the app could mlock that memory.   So for a simple
> demonstration:
> 
> - mem=2G
> - read a 1.2G file
> - malloc 800M, now mlock it.
> 
> Those 800M will be in ZONE_NORMAL, simply because that was where the
> free memory was.  And you're dead, even though you've only mlocked
> 800M.  The same thing happens if you have lots of anon memory in the
> normal zone and there is no swapspace available.
> 
> Linus's approach was to raise the ZONE_NORMAL pages_min limit for
> allocations which _could_ use highmem.  So a GFP_HIGHUSER allocation
> has a pages_min limit of (say) 4M when considering the normal zone,
> but a GFP_KERNEL allocation has a limit of 2M.
> 
> Andrea's patch does the same thing, via a separate table.   He has
> set the threshold much higher (100M on a 4G box).   AFAICT, the
> algorithms are identical - I was planning on just adding a multiplier
> to set Linus's ratio - it is currently hardwired to "1".   Search for 
> "mysterious" in mm/page_alloc.c ;)
> 
> It's not clear to me why -aa defaults to 100 megs when the problem
> only occurs with no swap or when the app is using mlock.  The default
> multiplier (of variable local_min) should be zero.  Swapless machines
> or heavy mlock users can crank it up.
> 
> But mlocking 700M on a 4G box would kill it as well.  The google
> application, IIRC, mlocks 1G on a 2G machine.  Daniel put them
> onto the 2G+2G split and all was well.
> 
> Anyway, thanks.   I'll take another look at Andrea's implementation.

you should because it seems you didn't realize how my code works. the
algorithm is autotuned at boot and depends on the zone sizes, and it
applies to the dma zone too with respect to the normal zone, the highmem
case is just one of the cases that the fix for the general problem
resolves, and you're totally wrong saying that mlocking 700m on a 4G box
could kill it. I call it the per-claszone point of view watermark. If
you are capable of highmem (mlock users are) you must left 100M or 10M
or 10G free on the normal zone (depends on the watermark setting tuned
at boot that is calculated in function of the zone sizes) etc... so it
doesn't matter if you mlock 700M or 700G, it can't kill it. The split
doesn't matter at all. 2.5 misses this important fix too btw.

If you ignore this bugfix people will notice and there's no other way
to fix it completely (unless you want to drop the zone-normal and
zone-dma enterely, actually zone-dma matters much less because even if
it exists basically nobody uses it).

> 
> Now, regarding mlock(mmap(open(/dev/hda1))) ;)


Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  5:25               ` Andrew Morton
  2002-12-06  5:48                 ` Andrea Arcangeli
@ 2002-12-06  6:00                 ` William Lee Irwin III
  1 sibling, 0 replies; 49+ messages in thread
From: William Lee Irwin III @ 2002-12-06  6:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Norman Gaywood, linux-kernel

William Lee Irwin III wrote:
>> Yes, it's necessary; no, I've never directly encountered the issue it
>> fixes. Sorry about the miscommunication there.

On Thu, Dec 05, 2002 at 09:25:15PM -0800, Andrew Morton wrote:
> Linus's approach was to raise the ZONE_NORMAL pages_min limit for
> allocations which _could_ use highmem.  So a GFP_HIGHUSER allocation
> has a pages_min limit of (say) 4M when considering the normal zone,
> but a GFP_KERNEL allocation has a limit of 2M.
> Andrea's patch does the same thing, via a separate table.   He has
> set the threshold much higher (100M on a 4G box).   AFAICT, the
> algorithms are identical - I was planning on just adding a multiplier
> to set Linus's ratio - it is currently hardwired to "1".   Search for 
> "mysterious" in mm/page_alloc.c ;)

There's no mystery here aside from a couple of magic numbers and a
not-very-well-explained admission control policy.

Tweaking magic numbers a la 2.4.x-aa until more infrastructure is
available (2.7) sounds good to me.

Thanks,
Bill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  5:48                 ` Andrea Arcangeli
@ 2002-12-06  6:14                   ` William Lee Irwin III
  2002-12-06  6:55                   ` Andrew Morton
  1 sibling, 0 replies; 49+ messages in thread
From: William Lee Irwin III @ 2002-12-06  6:14 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Norman Gaywood, linux-kernel

On Fri, Dec 06, 2002 at 06:48:04AM +0100, Andrea Arcangeli wrote:
> you should because it seems you didn't realize how my code works. the
> algorithm is autotuned at boot and depends on the zone sizes, and it
> applies to the dma zone too with respect to the normal zone, the highmem
> case is just one of the cases that the fix for the general problem
> resolves, and you're totally wrong saying that mlocking 700m on a 4G box
> could kill it. I call it the per-claszone point of view watermark. If
> you are capable of highmem (mlock users are) you must left 100M or 10M
> or 10G free on the normal zone (depends on the watermark setting tuned
> at boot that is calculated in function of the zone sizes) etc... so it
> doesn't matter if you mlock 700M or 700G, it can't kill it. The split
> doesn't matter at all. 2.5 misses this important fix too btw.
> If you ignore this bugfix people will notice and there's no other way
> to fix it completely (unless you want to drop the zone-normal and
> zone-dma enterely, actually zone-dma matters much less because even if
> it exists basically nobody uses it).

This problem is not universal; pure GFP_KERNEL allocations are the main
problem here. The fix is necessary for anti-google bits but not a
panacea for all workloads. The issue here is basically forkbombs (i.e.
databases) with potentially high cross-process sharing.

Bill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  5:48                 ` Andrea Arcangeli
  2002-12-06  6:14                   ` William Lee Irwin III
@ 2002-12-06  6:55                   ` Andrew Morton
  2002-12-06  7:14                     ` GrandMasterLee
  2002-12-06 14:57                     ` Andrea Arcangeli
  1 sibling, 2 replies; 49+ messages in thread
From: Andrew Morton @ 2002-12-06  6:55 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: William Lee Irwin III, Norman Gaywood, linux-kernel

Andrea Arcangeli wrote:
> 
> the
> algorithm is autotuned at boot and depends on the zone sizes, and it
> applies to the dma zone too with respect to the normal zone, the highmem
> case is just one of the cases that the fix for the general problem
> resolves,

Linus's incremental min will protect ZONE_DMA in the same manner.

> and you're totally wrong saying that mlocking 700m on a 4G box
> could kill it.

It is possible to mlock 700M of the normal zone on a 4G -aa kernel.
I can't immediately think of anything apart from vma's which will
make it fall over, but it will run like crap.

> 2.5 misses this important fix too btw.

It does not appear to be an important fix at all.  There have been
zero reports of it on any mailing list which I read since the google
days.

Yes, it needs to be addressed.  But it is not worth taking 100 megabytes
of pagecache away from everyone.  That is just a matter of choosing
the default value.

2.5 has much bigger problems than this - radix_tree nodes and pte_chains
in particular.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  6:55                   ` Andrew Morton
@ 2002-12-06  7:14                     ` GrandMasterLee
  2002-12-06  7:25                       ` Andrew Morton
  2002-12-06 14:57                     ` Andrea Arcangeli
  1 sibling, 1 reply; 49+ messages in thread
From: GrandMasterLee @ 2002-12-06  7:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, William Lee Irwin III, Norman Gaywood,
	linux-kernel

On Fri, 2002-12-06 at 00:55, Andrew Morton wrote:
> Andrea Arcangeli wrote:
[...]
> > and you're totally wrong saying that mlocking 700m on a 4G box
> > could kill it.
> 
> It is possible to mlock 700M of the normal zone on a 4G -aa kernel.
> I can't immediately think of anything apart from vma's which will
> make it fall over, but it will run like crap.

Just curious, but how long would it take a system with 8GB RAM, using 4G
or 64G kernel to fall over? One thing I've noticed, is that 2.4.19aa2
runs great on a box with 8GB when I don't allocate all that much, but
seems to run into issues after a large DB has been running on it for
several days. (i.e. the system get's generally a little slower, less
responsive, and in some cases crashes after 7 days). 

Yes, I know, sounds like a memory leak in something, but aside from
patching Oracle from 8.1.7.4(dba's can't find any new patches ATM), I've
tried everything except changing my kernel. 

Could this be similar behaviour?

--The GrandMaster

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  7:14                     ` GrandMasterLee
@ 2002-12-06  7:25                       ` Andrew Morton
  2002-12-06  7:34                         ` GrandMasterLee
  0 siblings, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2002-12-06  7:25 UTC (permalink / raw)
  To: GrandMasterLee
  Cc: Andrea Arcangeli, William Lee Irwin III, Norman Gaywood,
	linux-kernel

GrandMasterLee wrote:
> 
> On Fri, 2002-12-06 at 00:55, Andrew Morton wrote:
> > Andrea Arcangeli wrote:
> [...]
> > > and you're totally wrong saying that mlocking 700m on a 4G box
> > > could kill it.
> >
> > It is possible to mlock 700M of the normal zone on a 4G -aa kernel.
> > I can't immediately think of anything apart from vma's which will
> > make it fall over, but it will run like crap.
> 
> Just curious, but how long would it take a system with 8GB RAM, using 4G
> or 64G kernel to fall over?

A few seconds if you ran the wrong thing.  Never if you ran something
else.

> One thing I've noticed, is that 2.4.19aa2
> runs great on a box with 8GB when I don't allocate all that much, but
> seems to run into issues after a large DB has been running on it for
> several days. (i.e. the system get's generally a little slower, less
> responsive, and in some cases crashes after 7 days).

"crashes"?  kernel, or application?   What additional info is
available?
 
> Yes, I know, sounds like a memory leak in something, but aside from
> patching Oracle from 8.1.7.4(dba's can't find any new patches ATM), I've
> tried everything except changing my kernel.
> 
> Could this be similar behaviour?

No, it's something else.  Possibly a leak, possibly vma structures.

You should wait until the machine is sluggish, then capture
the output of:

	vmstat 1
	cat /proc/meminfo
	cat /proc/slabinfo
	ps aux

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  7:25                       ` Andrew Morton
@ 2002-12-06  7:34                         ` GrandMasterLee
  2002-12-06  7:51                           ` Andrew Morton
  0 siblings, 1 reply; 49+ messages in thread
From: GrandMasterLee @ 2002-12-06  7:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, William Lee Irwin III, Norman Gaywood,
	linux-kernel

On Fri, 2002-12-06 at 01:25, Andrew Morton wrote:
> GrandMasterLee wrote:
> > 
[...]
> > Just curious, but how long would it take a system with 8GB RAM, using 4G
> > or 64G kernel to fall over?
> 
> A few seconds if you ran the wrong thing.  Never if you ran something
> else.
> 
> > One thing I've noticed, is that 2.4.19aa2
> > runs great on a box with 8GB when I don't allocate all that much, but
> > seems to run into issues after a large DB has been running on it for
> > several days. (i.e. the system get's generally a little slower, less
> > responsive, and in some cases crashes after 7 days).
> 
> "crashes"?  kernel, or application?   What additional info is
> available?

Machine will panic. I've actually captured some and sent them to this
list, but I've been told that my stack was corrupt. Problem is, ATM, I
can't find a memory problem. Memtest86 locks up on test 4(as in, machine
needs hard booting), no matter if it's 8GB or 4GB RAM installed. An no
matter if *known good* ram is being tested as well. So I don't think
it's that per se. 

> > Yes, I know, sounds like a memory leak in something, but aside from
> > patching Oracle from 8.1.7.4(dba's can't find any new patches ATM), I've
> > tried everything except changing my kernel.
> > 
> > Could this be similar behaviour?
> 
> No, it's something else.  Possibly a leak, possibly vma structures.

Could that yield a corrupt stack?

> You should wait until the machine is sluggish, then capture
> the output of:
> 
> 	vmstat 1
> 	cat /proc/meminfo
> 	cat /proc/slabinfo
> 	ps aux

I shall gather the information sometime 12/06/2002. TIA

--The GrandMaster

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  7:34                         ` GrandMasterLee
@ 2002-12-06  7:51                           ` Andrew Morton
  2002-12-06 11:37                             ` Christoph Hellwig
  2002-12-06 16:19                             ` GrandMasterLee
  0 siblings, 2 replies; 49+ messages in thread
From: Andrew Morton @ 2002-12-06  7:51 UTC (permalink / raw)
  To: GrandMasterLee
  Cc: Andrea Arcangeli, William Lee Irwin III, Norman Gaywood,
	linux-kernel

GrandMasterLee wrote:
> 
> ...
> > "crashes"?  kernel, or application?   What additional info is
> > available?
> 
> Machine will panic. I've actually captured some and sent them to this
> list, but I've been told that my stack was corrupt.

OK.  In your second oops trace the `swapper' process had used 5k of its
8k kernel stack processing an XFS IO completion interrupt.  And I don't
think `swapper' uses much stack of its own.

If some other process happens to be using 3k of stack when the same 
interrupt hits it, it's game over.

So at a guess, I'd say you're being hit by excessive stack use in
the XFS filesystem.  I think the XFS team have done some work on that
recently so an upgrade may help.

Or it may be something completely different ;)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  2:15         ` William Lee Irwin III
  2002-12-06  2:28           ` Andrea Arcangeli
@ 2002-12-06 10:36           ` Arjan van de Ven
  2002-12-06 14:23             ` William Lee Irwin III
  1 sibling, 1 reply; 49+ messages in thread
From: Arjan van de Ven @ 2002-12-06 10:36 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andrea Arcangeli, Andrew Morton, Norman Gaywood, linux-kernel


> 64GB isn't getting any testing that I know of; I'd hold off until
> someone's actually stood up and confessed to attempting to boot
> Linux on such a beast. Or until I get some more RAM. =)

United Linux at least has tested this according to
http://www.unitedlinux.com/en/press/pr111902.html
Hardware functionality is exploited through advanced features such as
large memory support for up to 64 GB of RAM

so I'm sure Andrea's VM deals with it gracefully

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  7:51                           ` Andrew Morton
@ 2002-12-06 11:37                             ` Christoph Hellwig
  2002-12-06 16:19                             ` GrandMasterLee
  1 sibling, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2002-12-06 11:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: GrandMasterLee, Andrea Arcangeli, William Lee Irwin III,
	Norman Gaywood, linux-kernel

On Thu, Dec 05, 2002 at 11:51:10PM -0800, Andrew Morton wrote:
> So at a guess, I'd say you're being hit by excessive stack use in
> the XFS filesystem.  I think the XFS team have done some work on that
> recently so an upgrade may help.

Yes, XFS 1.1 used a lot of stack.  XFS 1.2pre (and the stuff in 2.5)
uses much less.  He's also using the qla2xxx drivers that aren't exactly
stack-friendly either.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  1:27   ` Norman Gaywood
@ 2002-12-06 12:48     ` Rik van Riel
  0 siblings, 0 replies; 49+ messages in thread
From: Rik van Riel @ 2002-12-06 12:48 UTC (permalink / raw)
  To: Norman Gaywood; +Cc: Pete Zaitcev, linux-kernel

On Fri, 6 Dec 2002, Norman Gaywood wrote:
> On Thu, Dec 05, 2002 at 07:35:49PM -0500, Pete Zaitcev wrote:

> > Check your /proc/slabinfo, just in case, to rule out a leak.
>
> Here is a /proc/slabinfo diff of a good system and a very sluggish one:

> > inode_cache       305071 305081    512 43583 43583    1 :  124   62
> > buffer_head       3431966 3432150    128 114405 114405    1 :  252  126

Guess what ?  120 MB in inode cache and 450 MB in buffer heads,
or 570 MB of zone_normal eaten with just these two items.

Looks like the RH kernel needs Stephen Tweedie's patch to
reclaim the buffer heads once IO is done ;)

regards,

Rik
-- 
A: No.
Q: Should I include quotations after my reply?
http://www.surriel.com/		http://guru.conectiva.com/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06 10:36           ` Arjan van de Ven
@ 2002-12-06 14:23             ` William Lee Irwin III
  2002-12-06 15:12               ` William Lee Irwin III
  0 siblings, 1 reply; 49+ messages in thread
From: William Lee Irwin III @ 2002-12-06 14:23 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrea Arcangeli, Andrew Morton, Norman Gaywood, linux-kernel

At some point in the past, I wrote:
>> 64GB isn't getting any testing that I know of; I'd hold off until
>> someone's actually stood up and confessed to attempting to boot
>> Linux on such a beast. Or until I get some more RAM. =)

On Fri, Dec 06, 2002 at 11:36:15AM +0100, Arjan van de Ven wrote:
> United Linux at least has tested this according to
> http://www.unitedlinux.com/en/press/pr111902.html
> Hardware functionality is exploited through advanced features such as
> large memory support for up to 64 GB of RAM
> so I'm sure Andrea's VM deals with it gracefully

I'm not convinced of grace even if I were to take it from this that it
were directly tested, which seems doubtful given the nature of the page.
This page sounds more like CONFIG_HIGHMEM64G is an option.

And besides, the report is useless unless it's got actual technical
content and descriptions reported by an kernel hacker.


Bill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  6:55                   ` Andrew Morton
  2002-12-06  7:14                     ` GrandMasterLee
@ 2002-12-06 14:57                     ` Andrea Arcangeli
  2002-12-06 15:12                       ` William Lee Irwin III
  1 sibling, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2002-12-06 14:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, Norman Gaywood, linux-kernel

On Thu, Dec 05, 2002 at 10:55:53PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> > 
> > the
> > algorithm is autotuned at boot and depends on the zone sizes, and it
> > applies to the dma zone too with respect to the normal zone, the highmem
> > case is just one of the cases that the fix for the general problem
> > resolves,
> 
> Linus's incremental min will protect ZONE_DMA in the same manner.

of how many bytes?

> 
> > and you're totally wrong saying that mlocking 700m on a 4G box
> > could kill it.
> 
> It is possible to mlock 700M of the normal zone on a 4G -aa kernel.
> I can't immediately think of anything apart from vma's which will
> make it fall over, but it will run like crap.

you're missing the whole point. the vma are zone-normal users. You're
saying that you can run out of ZONE_NORMAL if you run
alloc_page(GFP_KERNEL) for some hundred thousand times. Yeah that's not
a big news.

I'm saying you *can't* run out of zone-normal due highmem allocations so
if you run alloc_pages(GFP_HIGHMEM), period.

that's a completely different thing.

I thought you understood what the problem is, not sure why you say you
can run out of zone-normal running 100000 times alloc_page(GFP_KERNEL),
that has *nothing* to do with the bug we're discussing here, if you
don't want to run out of zone-normal after 100000 GFP_KERNEL page
allocations you can only drop the zone-normal.

The bug we're discussing here is that w/o my fix you will run out of
zone-normal despite you didn't start allocating zone-normal yet and
despite you still have 60G free in the highmem zone. This is what the
patch prevents, nothing more nothing less.

And it's not so much specific to google, they were just unlucky
triggering it, as said just allocate plenty of pagetables (they are
highmem capable in my tree and 2.5) or swapoff -a, and you'll run in the
very same scenario that needs my fix in all normal workloads that
allocates some more than some hunded mbytes of ram.

And this is definitely a generic problem, not even specific to linux,
it's an OS wide design problem while dealing with the balancing of
different zones that have overlapping but not equivalent capabilities,
it even applies to zone-dma with respect to zone-normal and zone-highmem
and there's no other fix around it at the moment.

Mainline fixes it in a very weak way, it reserves a few meges only,
that's not nearly enough if you need to allocate more than one more
inode etc... The lowmem reservation must allow the machine to do
interesting workloads for the whole uptime, not to defer the failure of
a few seconds. A few megs aren't nearly enough.

If interesting workloads needs huge zone-normal, just reserve more of it
at boot and they will work. if all the zone-normal isn't enough you fall
into a totally different problem, that is the zone-normal existence in
the first place and it has nothing to do with this bug, and you can fix
the other problem only by dropping the zone-normal (of course if you do
that you will in turn fix this problem too, but the problems are
different).

The only alternate fix is to be able to migrate pagetables (1st level
only, pte) and all the other highmem capable allocations at runtime
(pagecache, shared memory etc..). Which is clearly not possible in 2.5
and 2.4.

Once that will be possible/implemented my fix can go away and you can
simply migrate the highmem capable allocations from zone-normal to
highmem. That would be the only alternate and also dynamic/superior fix
but it's not feasible at the moment, at the very least not in 2.4. It
would also have some performance implications, I'm sure lots of people
prefers to throw away 500M of ram in a 32G machine than riskying to
spend the cpu time in memcopies, so it would not be *that* superior, it
would be inferior in some ways.

Reserving 500M of ram on a 32G machine doesn't really matter at all, so
the current fix is certainly the best thing we can do for 2.4, and for
2.5 too unless you want to implement highmem migration for all highmem
capable kernel objects (which would work fine too).

Also your possible multiplicator via sysctl remains a much inferior to
my fix that is able to cleanly enforce classzone-point-of-view
watermarks (not fixed watermarks), you would need to change
multiplicator depending on zone size and depending on the zone to make
it equivalent, so yes, you could implement it equivally but it would be
much less clean and readable than my current code (and more hardly
tunable with a kernel paramter at boot like my current fix is).

> > 2.5 misses this important fix too btw.
> 
> It does not appear to be an important fix at all.  There have been

well if you ignore it people can use my tree, I personally need that fix
for myself on big boxes so I'm going to retain it in one form or
another (the form in mainline is too weak as said and just adding a
multiplicator would not be equivalent as said above).

> 2.5 has much bigger problems than this - radix_tree nodes and pte_chains
> in particular.

I'm not saying there aren't bigger problems in 2.5, but I don't classify
this one as a minor one, infact it was a showstopper for a long time in
2.4 (one of the last ones), until I fixed it and it still is a problem
because the 2.4 fix is way too weak (a few megs aren't enough to
guarantee big workloads to succeed).

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06 14:57                     ` Andrea Arcangeli
@ 2002-12-06 15:12                       ` William Lee Irwin III
  2002-12-06 23:32                         ` Andrea Arcangeli
  0 siblings, 1 reply; 49+ messages in thread
From: William Lee Irwin III @ 2002-12-06 15:12 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Norman Gaywood, linux-kernel

On Fri, Dec 06, 2002 at 03:57:19PM +0100, Andrea Arcangeli wrote:
> The only alternate fix is to be able to migrate pagetables (1st level
> only, pte) and all the other highmem capable allocations at runtime
> (pagecache, shared memory etc..). Which is clearly not possible in 2.5
> and 2.4.

Actually it should not be difficult for 2.5, though it's not done now.
Shared pagetables would complicate the implementation slightly. I've
gotten 100% backlash from my proposals in this area, so I'm not
touching it at all out of aggravation or whatever.


Bill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06 14:23             ` William Lee Irwin III
@ 2002-12-06 15:12               ` William Lee Irwin III
  2002-12-06 22:34                 ` Andrea Arcangeli
  0 siblings, 1 reply; 49+ messages in thread
From: William Lee Irwin III @ 2002-12-06 15:12 UTC (permalink / raw)
  To: Arjan van de Ven, Andrea Arcangeli, Andrew Morton, Norman Gaywood,
	linux-kernel

On Fri, Dec 06, 2002 at 11:36:15AM +0100, Arjan van de Ven wrote:
>> United Linux at least has tested this according to
>> http://www.unitedlinux.com/en/press/pr111902.html
>> Hardware functionality is exploited through advanced features such as
>> large memory support for up to 64 GB of RAM
>> so I'm sure Andrea's VM deals with it gracefully

On Fri, Dec 06, 2002 at 06:23:02AM -0800, William Lee Irwin III wrote:
> I'm not convinced of grace even if I were to take it from this that it
> were directly tested, which seems doubtful given the nature of the page.
> This page sounds more like CONFIG_HIGHMEM64G is an option.
> And besides, the report is useless unless it's got actual technical
> content and descriptions reported by an kernel hacker.

Well, since I've not seen recent attempts at the Right Way To Do It (TM),
there's also a remote possibility of someone changing the user/kernel
split just to get a bloated mem_map to fit. Many of the smaller apps,
e.g. /bin/sh etc. are indifferent to the ABI violation.


Bill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  7:51                           ` Andrew Morton
  2002-12-06 11:37                             ` Christoph Hellwig
@ 2002-12-06 16:19                             ` GrandMasterLee
  1 sibling, 0 replies; 49+ messages in thread
From: GrandMasterLee @ 2002-12-06 16:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, William Lee Irwin III, Norman Gaywood,
	linux-kernel

On Fri, 2002-12-06 at 01:51, Andrew Morton wrote:
> GrandMasterLee wrote:
> > 
> > ...
> > > "crashes"?  kernel, or application?   What additional info is
> > > available?
> > 
> > Machine will panic. I've actually captured some and sent them to this
> > list, but I've been told that my stack was corrupt.
> 
> OK.  In your second oops trace the `swapper' process had used 5k of its
> 8k kernel stack processing an XFS IO completion interrupt.  And I don't
> think `swapper' uses much stack of its own.

The second Oops is the *best* one IMO. I got it just over 7 days. (like
7 days 6 hours or something. I've still been testing the crud out of
this kernel on like hardware, and can't reproduce it. I'd love to know a
method for reproducing this for my beta environment.

> If some other process happens to be using 3k of stack when the same 
> interrupt hits it, it's game over.
> 
> So at a guess, I'd say you're being hit by excessive stack use in
> the XFS filesystem.  I think the XFS team have done some work on that
> recently so an upgrade may help.

Since we run ~1TB dbs on the systems, and a LOT of IO, and Qlogic
drivers, I think that's the culprit. Will swapper use less stack in more
recent kernels?(XFS will be updated as part of a plan for the new year
I'm putting together. Till then, it's reboot every 7 days)


> Or it may be something completely different ;)

I hope not. :)

--The GrandMaster


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06  2:41             ` William Lee Irwin III
  2002-12-06  5:25               ` Andrew Morton
@ 2002-12-06 22:28               ` Andrea Arcangeli
  2002-12-06 23:21                 ` William Lee Irwin III
  1 sibling, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2002-12-06 22:28 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, Norman Gaywood,
	linux-kernel

On Thu, Dec 05, 2002 at 06:41:40PM -0800, William Lee Irwin III wrote:
> No idea why there's not more support behind or interest in page
> clustering. It's an optimization (not required) for 64-bit/saner arches.

softpagesize sounds a good idea to try for archs with a page size < 8k
indeed, modulo a few places where the 4k pagesize is part of the
userspace abi, for that reason on x86-64 Andi recently suggested to
changed the abi to assume a bigger page size and I suggested to assume
it to be 2M and not a smaller thing as originally suggested, that way we
waste some more virtual space (not an issue on 64bit) and some cache
color (not a big deal either, those caches are multiway associative even
if not fully associative), so eventually in theory we could even switch
the page size to 2M ;)

however don't mistake softpagesize with the PAGE_CACHE_SIZE (the latter
I think was completed some time ago by Hugh). I think PAGE_CACHE_SIZE
is a broken idea (i'm talking about the PAGE_CACHE_SIZE at large, not
about the implementation that may even be fine with Hugh's patch
applied).

PAGE_CACHE_SIZE will never work well due the fragmentation problems it
introduces. So I definitely vote for dropping PAGE_CACHE_SIZE and to
experiment with a soft PAGE_SIZE, multiple of the hardware PAGE_SIZE.
That means the allocator minimal granularity will return 8k. on x86 that
breaks a bit the ABI. on x86-64 the softpagesize would breaks only the 32bit
compatibilty mode abi a little so it would be even less severe. And I
think the softpagesize should be a config option so it can be
experimented without breaking the default config even on x86.

the soft PAGE_SIZE will also decrease of an order of magnitude the page
fault rate, the number of pte will be the same but we'll cluster the pte
refills all served from the same I/O anyways (readhaead usually loads
the next pages too anyways). So it's a kind of quite obvious design
optimization to experiment with (maybe for 2.7?).

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06 15:12               ` William Lee Irwin III
@ 2002-12-06 22:34                 ` Andrea Arcangeli
  2002-12-07 18:27                   ` Eric W. Biederman
  0 siblings, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2002-12-06 22:34 UTC (permalink / raw)
  To: William Lee Irwin III, Arjan van de Ven, Andrew Morton,
	Norman Gaywood, linux-kernel

On Fri, Dec 06, 2002 at 07:12:38AM -0800, William Lee Irwin III wrote:
> split just to get a bloated mem_map to fit. Many of the smaller apps,
> e.g. /bin/sh etc. are indifferent to the ABI violation.

the problem of the split is that it would reduce the address space
available to userspace that is quite critical on big machines (one of
the big advantages of 64bit that can't be fixed on 32bit) but I wouldn't
classify it as an ABI violation, infact the little I can remember about
the 2.0 kernels [I almost never read that code] is that it had shared
address space and tlb flush while entering/exiting kernel, so I can bet
the user stack in 2.0  was put at 4G, not at 3G. 2.2 had to put it at 3G
because then the address space was shared with the obvious performance
advantages, so while I didn't read any ABI, I deduce you can't say the
ABI got broken if the stack is put at 2G or 1G or 3.5G or 4G again with
x86-64 (of course x86-64 can give the full 4G to userspace because the
kernel runs in the negative part of the [64bit] address space, as 2.0
could too).

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06 22:28               ` Andrea Arcangeli
@ 2002-12-06 23:21                 ` William Lee Irwin III
  2002-12-06 23:50                   ` Andrea Arcangeli
  2002-12-07  0:01                   ` Andrew Morton
  0 siblings, 2 replies; 49+ messages in thread
From: William Lee Irwin III @ 2002-12-06 23:21 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Norman Gaywood, linux-kernel

On Thu, Dec 05, 2002 at 06:41:40PM -0800, William Lee Irwin III wrote:
>> No idea why there's not more support behind or interest in page
>> clustering. It's an optimization (not required) for 64-bit/saner arches.

On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote:
> softpagesize sounds a good idea to try for archs with a page size < 8k
> indeed, modulo a few places where the 4k pagesize is part of the
> userspace abi, for that reason on x86-64 Andi recently suggested to
> changed the abi to assume a bigger page size and I suggested to assume
> it to be 2M and not a smaller thing as originally suggested, that way we
> waste some more virtual space (not an issue on 64bit) and some cache
> color (not a big deal either, those caches are multiway associative even
> if not fully associative), so eventually in theory we could even switch
> the page size to 2M ;)

The patch I'm talking about introduces a distinction between the size
of an area mapped by a PTE or TLB entry (MMUPAGE_SIZE) and the kernel's
internal allocation unit (PAGE_SIZE), and does (AFAICT) properly
vectored PTE operations in the VM to support the system's native page
size, and does a whole kernel audit of drivers/ and fs/ PAGE_SIZE usage
so that the distinction between PAGE_SIZE and MMUPAGE_SIZE is understood.


On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote:
> however don't mistake softpagesize with the PAGE_CACHE_SIZE (the latter
> I think was completed some time ago by Hugh). I think PAGE_CACHE_SIZE
> is a broken idea (i'm talking about the PAGE_CACHE_SIZE at large, not
> about the implementation that may even be fine with Hugh's patch
> applied).

PAGE_CACHE_SIZE is mostly an fs thing, so there's not much danger of
confusion, at least not in my mind.


On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote:
> PAGE_CACHE_SIZE will never work well due the fragmentation problems it
> introduces. So I definitely vote for dropping PAGE_CACHE_SIZE and to
> experiment with a soft PAGE_SIZE, multiple of the hardware PAGE_SIZE.
> That means the allocator minimal granularity will return 8k. on x86 that
> breaks a bit the ABI. on x86-64 the softpagesize would breaks only the 32bit
> compatibilty mode abi a little so it would be even less severe. And I
> think the softpagesize should be a config option so it can be
> experimented without breaking the default config even on x86.

Hmm, from the appearances of the patch (my ability to test the patch
is severely hampered by its age) it should actually maintain hardware
pagesize mmap() granularity, ABI compatibility, etc.


On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote:
> the soft PAGE_SIZE will also decrease of an order of magnitude the page
> fault rate, the number of pte will be the same but we'll cluster the pte
> refills all served from the same I/O anyways (readhaead usually loads
> the next pages too anyways). So it's a kind of quite obvious design
> optimization to experiment with (maybe for 2.7?).

Sounds like the right timing for me.

A 16KB or 64KB kernel allocation unit would then annihilate
sizeof(mem_map) concerns on 3/1 splits. 720MB -> 180MB or 45MB.
Or on my home machine (768MB PC) 6MB -> 1.5MB or 384KB, which
is a substantial reduction in cache footprint and outright
memory footprint.

I think this is a perfect example of how the increased awareness of
space consumption highmem gives us helps us optimize all boxen.


Thanks,
Bill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06 15:12                       ` William Lee Irwin III
@ 2002-12-06 23:32                         ` Andrea Arcangeli
  2002-12-06 23:45                           ` William Lee Irwin III
  0 siblings, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2002-12-06 23:32 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, Norman Gaywood,
	linux-kernel

On Fri, Dec 06, 2002 at 07:12:20AM -0800, William Lee Irwin III wrote:
> On Fri, Dec 06, 2002 at 03:57:19PM +0100, Andrea Arcangeli wrote:
> > The only alternate fix is to be able to migrate pagetables (1st level
> > only, pte) and all the other highmem capable allocations at runtime
> > (pagecache, shared memory etc..). Which is clearly not possible in 2.5
> > and 2.4.
> 
> Actually it should not be difficult for 2.5, though it's not done now.

"difficult" is a relative word, nothing is difficult but everything is
difficult, depends the way you feel about it.

but note that even with rmap you don't know the pmd that points to the
pte that you want to relocate and for the anon pages you miss
information about mm and virtual address where those pages are
allocated, so basically rmap is useless for doing it, you need to do the
pagetable walking ala swap_out, in turn it's not easier at all in 2.5
than it could been in 2.4 (but of course this is a 2.5 thing only, I
just want to say that if it's not difficult in 2.5 it wasn't difficult
in 2.4 either).

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06 23:32                         ` Andrea Arcangeli
@ 2002-12-06 23:45                           ` William Lee Irwin III
  2002-12-06 23:57                             ` Andrea Arcangeli
  0 siblings, 1 reply; 49+ messages in thread
From: William Lee Irwin III @ 2002-12-06 23:45 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Norman Gaywood, linux-kernel

On Sat, Dec 07, 2002 at 12:32:43AM +0100, Andrea Arcangeli wrote:
> but note that even with rmap you don't know the pmd that points to the
> pte that you want to relocate and for the anon pages you miss
> information about mm and virtual address where those pages are
> allocated, so basically rmap is useless for doing it, you need to do the
> pagetable walking ala swap_out, in turn it's not easier at all in 2.5
> than it could been in 2.4 (but of course this is a 2.5 thing only, I
> just want to say that if it's not difficult in 2.5 it wasn't difficult
> in 2.4 either).

Actually, we do. From include/asm-generic/rmap.h:

static inline void pgtable_add_rmap(struct page * page, struct mm_struct * mm, unsigned long address)
{
#ifdef BROKEN_PPC_PTE_ALLOC_ONE
	/* OK, so PPC calls pte_alloc() before mem_map[] is setup ... ;( */
	extern int mem_init_done;

	if (!mem_init_done)
		return;
#endif
	page->mapping = (void *)mm;
	page->index = address & ~((PTRS_PER_PTE * PAGE_SIZE) - 1);
	inc_page_state(nr_page_table_pages);
}

So pagetable pages are tagged with the right information, and in
principle could even be tagged here with the pmd in page->private.

These fields are actually required for use by try_to_unmap_one(),
and something similar could be done for a try_to_move_one(). This
information remains intact with shared pagetables, and is generalized
so that the PTE page is tagged with a list of mm's (the mm_chain),
and in that case no unique pmd could be directly stored in the page,
but it could just as easily be derived from the mm's in the mm_chain.

But there's no denying it would involve a substantial amount of work.

Bill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06 23:21                 ` William Lee Irwin III
@ 2002-12-06 23:50                   ` Andrea Arcangeli
  2002-12-07  0:30                     ` William Lee Irwin III
  2002-12-07  0:01                   ` Andrew Morton
  1 sibling, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2002-12-06 23:50 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, Norman Gaywood,
	linux-kernel

On Fri, Dec 06, 2002 at 03:21:25PM -0800, William Lee Irwin III wrote:
> On Thu, Dec 05, 2002 at 06:41:40PM -0800, William Lee Irwin III wrote:
> >> No idea why there's not more support behind or interest in page
> >> clustering. It's an optimization (not required) for 64-bit/saner arches.
> 
> On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote:
> > softpagesize sounds a good idea to try for archs with a page size < 8k
> > indeed, modulo a few places where the 4k pagesize is part of the
> > userspace abi, for that reason on x86-64 Andi recently suggested to
> > changed the abi to assume a bigger page size and I suggested to assume
> > it to be 2M and not a smaller thing as originally suggested, that way we
> > waste some more virtual space (not an issue on 64bit) and some cache
> > color (not a big deal either, those caches are multiway associative even
> > if not fully associative), so eventually in theory we could even switch
> > the page size to 2M ;)
> 
> The patch I'm talking about introduces a distinction between the size
> of an area mapped by a PTE or TLB entry (MMUPAGE_SIZE) and the kernel's
> internal allocation unit (PAGE_SIZE), and does (AFAICT) properly
> vectored PTE operations in the VM to support the system's native page
> size, and does a whole kernel audit of drivers/ and fs/ PAGE_SIZE usage
> so that the distinction between PAGE_SIZE and MMUPAGE_SIZE is understood.

My point is that making any distinction will lead to inevitable
fragmentation of memory.

Going to an higher kernel wide PAGE_SIZE and avoiding the distinction
will even fix the 8k fragmentation issue with the kernel stack ;) Not to
tell allowing more workloads to be able to use all ram of the 32bit 64G
boxes.

> On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote:
> > however don't mistake softpagesize with the PAGE_CACHE_SIZE (the latter
> > I think was completed some time ago by Hugh). I think PAGE_CACHE_SIZE
> > is a broken idea (i'm talking about the PAGE_CACHE_SIZE at large, not
> > about the implementation that may even be fine with Hugh's patch
> > applied).
> 
> PAGE_CACHE_SIZE is mostly an fs thing, so there's not much danger of
> confusion, at least not in my mind.

ok, I thought MMUPAGE_SIZE and PAGE_CACHE_SIZE were related, but of
course they don't need to.

> On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote:
> > PAGE_CACHE_SIZE will never work well due the fragmentation problems it
> > introduces. So I definitely vote for dropping PAGE_CACHE_SIZE and to
> > experiment with a soft PAGE_SIZE, multiple of the hardware PAGE_SIZE.
> > That means the allocator minimal granularity will return 8k. on x86 that
> > breaks a bit the ABI. on x86-64 the softpagesize would breaks only the 32bit
> > compatibilty mode abi a little so it would be even less severe. And I
> > think the softpagesize should be a config option so it can be
> > experimented without breaking the default config even on x86.
> 
> Hmm, from the appearances of the patch (my ability to test the patch
> is severely hampered by its age) it should actually maintain hardware
> pagesize mmap() granularity, ABI compatibility, etc.

If it only implements the MMUPAGE_SIZE, yes, it can.

You break the ABI as soon as you change the kernel wide PAGE_SIZE. it is
allowed only on 64bit binaries running on a x86-64 kernel.  The 32bit
binaries running in compatibility mode as said would suffer a bit, but
most things should run and we can make hacks like using anon mappings if
the files are small just for the sake of running some app 32bit (like we
use anon mappings for a.out binaries needing 1k offsets today).

Said that even the MMUPAGE_SIZE alone would be useful, but I'd prefer if
the kernel wide PAGE_SIZE would be increased (with the disavantage of
breaking the ABI, but it would be a config option, even the 2G/3.5G/1G
split has the chance of breaking some app despite I wouldn't classify it
as an ABI violation for the reason explained in one of the last emails).

> On Fri, Dec 06, 2002 at 11:28:52PM +0100, Andrea Arcangeli wrote:
> > the soft PAGE_SIZE will also decrease of an order of magnitude the page
> > fault rate, the number of pte will be the same but we'll cluster the pte
> > refills all served from the same I/O anyways (readhaead usually loads
> > the next pages too anyways). So it's a kind of quite obvious design
> > optimization to experiment with (maybe for 2.7?).
> 
> Sounds like the right timing for me.
> 
> A 16KB or 64KB kernel allocation unit would then annihilate
> sizeof(mem_map) concerns on 3/1 splits. 720MB -> 180MB or 45MB.
>
> Or on my home machine (768MB PC) 6MB -> 1.5MB or 384KB, which
> is a substantial reduction in cache footprint and outright
> memory footprint.

Yep.

> 
> I think this is a perfect example of how the increased awareness of
> space consumption highmem gives us helps us optimize all boxen.

In this case funnily it has a chance to help some 64bit boxes too ;).

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06 23:45                           ` William Lee Irwin III
@ 2002-12-06 23:57                             ` Andrea Arcangeli
  0 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2002-12-06 23:57 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, Norman Gaywood,
	linux-kernel

On Fri, Dec 06, 2002 at 03:45:24PM -0800, William Lee Irwin III wrote:
> On Sat, Dec 07, 2002 at 12:32:43AM +0100, Andrea Arcangeli wrote:
> > but note that even with rmap you don't know the pmd that points to the
> > pte that you want to relocate and for the anon pages you miss
> > information about mm and virtual address where those pages are
> > allocated, so basically rmap is useless for doing it, you need to do the
> > pagetable walking ala swap_out, in turn it's not easier at all in 2.5
> > than it could been in 2.4 (but of course this is a 2.5 thing only, I
> > just want to say that if it's not difficult in 2.5 it wasn't difficult
> > in 2.4 either).
> 
> Actually, we do. From include/asm-generic/rmap.h:
> 
> static inline void pgtable_add_rmap(struct page * page, struct mm_struct * mm, unsigned long address)
> {
> #ifdef BROKEN_PPC_PTE_ALLOC_ONE
> 	/* OK, so PPC calls pte_alloc() before mem_map[] is setup ... ;( */
> 	extern int mem_init_done;
> 
> 	if (!mem_init_done)
> 		return;
> #endif
> 	page->mapping = (void *)mm;
> 	page->index = address & ~((PTRS_PER_PTE * PAGE_SIZE) - 1);
> 	inc_page_state(nr_page_table_pages);
> }
> 
> So pagetable pages are tagged with the right information, and in
> principle could even be tagged here with the pmd in page->private.

sorry I didn't noticed the overlap of page->mapping to store the mm. But
yes, I should have realized that you had do because otherwise you
wouldn't know how to flush the tlb ;) so without the mm and address rmap
would be useless. So via the address and mapping you can walk the
pagetables and reach it with lower complexity than w/o rmap. Still doing
the pagetable walk wouldn't be an huge increase in complexity but it
would increase the "computational" complexity of the algorithm.

> These fields are actually required for use by try_to_unmap_one(),
> and something similar could be done for a try_to_move_one(). This
> information remains intact with shared pagetables, and is generalized
> so that the PTE page is tagged with a list of mm's (the mm_chain),
> and in that case no unique pmd could be directly stored in the page,
> but it could just as easily be derived from the mm's in the mm_chain.
> 
> But there's no denying it would involve a substantial amount of work.
> 
> 
> Bill


Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06 23:21                 ` William Lee Irwin III
  2002-12-06 23:50                   ` Andrea Arcangeli
@ 2002-12-07  0:01                   ` Andrew Morton
  2002-12-07  0:21                     ` William Lee Irwin III
                                       ` (3 more replies)
  1 sibling, 4 replies; 49+ messages in thread
From: Andrew Morton @ 2002-12-07  0:01 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Andrea Arcangeli, Norman Gaywood, linux-kernel

William Lee Irwin III wrote:
> 
> ...
> A 16KB or 64KB kernel allocation unit would then annihilate

You want to be careful about this:

	CPU: L1 I cache: 16K, L1 D cache: 16K

Because instantiating a 16k page into user pagetables in
one hit means that it must all be zeroed.  With these large
pagesizes that means that the application is likely to get
100% L1 misses against the new page, whereas it currently
gets 100% hits.

I'd expect this performance dropoff to occur when going from 8k
to 16k.  By the time you get to 32k it would be quite bad.

One way to address this could be to find a way of making the
pages present, but still cause a fault on first access.  Then
have a special-case fastpath in the fault handler to really wipe
the page just before it is used.  I don't know how though - maybe
_PAGE_USER?

get_user_pages() would need attention too - you don't want to
allow the user to perform O_DIRECT writes of uninitialised
pages to their files...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-07  0:01                   ` Andrew Morton
@ 2002-12-07  0:21                     ` William Lee Irwin III
  2002-12-07  0:30                       ` Andrew Morton
  2002-12-07  2:19                       ` Alan Cox
  2002-12-07  0:22                     ` Andrea Arcangeli
                                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 49+ messages in thread
From: William Lee Irwin III @ 2002-12-07  0:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Norman Gaywood, linux-kernel

> William Lee Irwin III wrote:
> > 
> > ...
> > A 16KB or 64KB kernel allocation unit would then annihilate
> 
On Fri, Dec 06, 2002 at 04:01:24PM -0800, Andrew Morton wrote:
> You want to be careful about this:
> 	CPU: L1 I cache: 16K, L1 D cache: 16K
> Because instantiating a 16k page into user pagetables in
> one hit means that it must all be zeroed.  With these large
> pagesizes that means that the application is likely to get
> 100% L1 misses against the new page, whereas it currently
> gets 100% hits.

16K is reasonable; after that one might as well go all the way.
About the only way to cope is amortizing it by cacheing zeroed pages,
and that has other downsides.


Bill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-07  0:01                   ` Andrew Morton
  2002-12-07  0:21                     ` William Lee Irwin III
@ 2002-12-07  0:22                     ` Andrea Arcangeli
  2002-12-07  0:35                       ` Andrew Morton
  2002-12-07  0:46                     ` William Lee Irwin III
  2002-12-07 10:55                     ` Arjan van de Ven
  3 siblings, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2002-12-07  0:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, Norman Gaywood, linux-kernel

On Fri, Dec 06, 2002 at 04:01:24PM -0800, Andrew Morton wrote:
> William Lee Irwin III wrote:
> > 
> > ...
> > A 16KB or 64KB kernel allocation unit would then annihilate
> 
> You want to be careful about this:
> 
> 	CPU: L1 I cache: 16K, L1 D cache: 16K
> 
> Because instantiating a 16k page into user pagetables in
> one hit means that it must all be zeroed.  With these large
> pagesizes that means that the application is likely to get
> 100% L1 misses against the new page, whereas it currently
> gets 100% hits.
> 
> I'd expect this performance dropoff to occur when going from 8k
> to 16k.  By the time you get to 32k it would be quite bad.
> 
> One way to address this could be to find a way of making the
> pages present, but still cause a fault on first access.  Then
> have a special-case fastpath in the fault handler to really wipe
> the page just before it is used.  I don't know how though - maybe
> _PAGE_USER?

I think taking the page fault itself is the biggest overhead that would
be nice to avoid on every second virtually consecutive page, if we've to
take the page fault on every page we could as well do the rest of the
work that should not that big compared to the overhead of
entering/exiting kernel and preparing to handle the fault.

> 
> get_user_pages() would need attention too - you don't want to
> allow the user to perform O_DIRECT writes of uninitialised
> pages to their files...


Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06 23:50                   ` Andrea Arcangeli
@ 2002-12-07  0:30                     ` William Lee Irwin III
  0 siblings, 0 replies; 49+ messages in thread
From: William Lee Irwin III @ 2002-12-07  0:30 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Norman Gaywood, linux-kernel

At some point in the past, I wrote:
> My point is that making any distinction will lead to inevitable
> fragmentation of memory.

It's mostly userspace; the kernel is usually (hello drivers/ !) cautious
and uses slab.c's anti-internal fragmentation techniques for most structs.


At some point in the past, I wrote:
>> Hmm, from the appearances of the patch (my ability to test the patch
>> is severely hampered by its age) it should actually maintain hardware
>> pagesize mmap() granularity, ABI compatibility, etc.

On Sat, Dec 07, 2002 at 12:50:32AM +0100, Andrea Arcangeli wrote:
> If it only implements the MMUPAGE_SIZE, yes, it can.
> You break the ABI as soon as you change the kernel wide PAGE_SIZE. it is
> allowed only on 64bit binaries running on a x86-64 kernel.  The 32bit
> binaries running in compatibility mode as said would suffer a bit, but
> most things should run and we can make hacks like using anon mappings if
> the files are small just for the sake of running some app 32bit (like we
> use anon mappings for a.out binaries needing 1k offsets today).

I'm not sure what to make of this. The distinction and PTE vectoring
API (AFAICT) allows PTE's to map sub-PAGE_SIZE-sized (MMUPAGE_SIZE to
be exact) regions. Someone start screaming if I misread the patch.


On Sat, Dec 07, 2002 at 12:50:32AM +0100, Andrea Arcangeli wrote:
> Said that even the MMUPAGE_SIZE alone would be useful, but I'd prefer if
> the kernel wide PAGE_SIZE would be increased (with the disavantage of
> breaking the ABI, but it would be a config option, even the 2G/3.5G/1G
> split has the chance of breaking some app despite I wouldn't classify it
> as an ABI violation for the reason explained in one of the last emails).

Userspace is required to have >= 3GB of virtualspace, according to the
SVR4 i386 ABI spec. But we don't follow that strictly anyway.


At some point in the past, I wrote:
>> I think this is a perfect example of how the increased awareness of
>> space consumption highmem gives us helps us optimize all boxen.

On Sat, Dec 07, 2002 at 12:50:32AM +0100, Andrea Arcangeli wrote:
> In this case funnily it has a chance to help some 64bit boxes too ;).

I've heard the sizeof(mem_map) footprint is worse on 64-bit because
while PAGE_SIZE remains the same, but pointers double in size. This
would help a bit there, too.


Bill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-07  0:21                     ` William Lee Irwin III
@ 2002-12-07  0:30                       ` Andrew Morton
  2002-12-07  2:19                       ` Alan Cox
  1 sibling, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2002-12-07  0:30 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Andrea Arcangeli, Norman Gaywood, linux-kernel

William Lee Irwin III wrote:
> 
> > William Lee Irwin III wrote:
> > >
> > > ...
> > > A 16KB or 64KB kernel allocation unit would then annihilate
> >
> On Fri, Dec 06, 2002 at 04:01:24PM -0800, Andrew Morton wrote:
> > You want to be careful about this:
> >       CPU: L1 I cache: 16K, L1 D cache: 16K
> > Because instantiating a 16k page into user pagetables in
> > one hit means that it must all be zeroed.  With these large
> > pagesizes that means that the application is likely to get
> > 100% L1 misses against the new page, whereas it currently
> > gets 100% hits.
> 
> 16K is reasonable; after that one might as well go all the way.

16k will cause serious slowdowns.

> About the only way to cope is amortizing it by cacheing zeroed pages,
> and that has other downsides.

So will that.  You've seen the kernbench profiles...

You will need to find a way to clear the page just before it
is used.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-07  0:22                     ` Andrea Arcangeli
@ 2002-12-07  0:35                       ` Andrew Morton
  0 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2002-12-07  0:35 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: William Lee Irwin III, Norman Gaywood, linux-kernel

Andrea Arcangeli wrote:
> 
> > One way to address this could be to find a way of making the
> > pages present, but still cause a fault on first access.  Then
> > have a special-case fastpath in the fault handler to really wipe
> > the page just before it is used.  I don't know how though - maybe
> > _PAGE_USER?
> 
> I think taking the page fault itself is the biggest overhead that would
> be nice to avoid on every second virtually consecutive page, if we've to
> take the page fault on every page we could as well do the rest of the
> work that should not that big compared to the overhead of
> entering/exiting kernel and preparing to handle the fault.

Yes, 8k at a time would probably be OK.  Say, L1-size/2.

I expect that anything bigger would cause 2x or worse slowdowns of a
range of apps.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-07  0:01                   ` Andrew Morton
  2002-12-07  0:21                     ` William Lee Irwin III
  2002-12-07  0:22                     ` Andrea Arcangeli
@ 2002-12-07  0:46                     ` William Lee Irwin III
  2002-12-07 10:55                     ` Arjan van de Ven
  3 siblings, 0 replies; 49+ messages in thread
From: William Lee Irwin III @ 2002-12-07  0:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Norman Gaywood, linux-kernel

On Fri, Dec 06, 2002 at 04:01:24PM -0800, Andrew Morton wrote:
> One way to address this could be to find a way of making the
> pages present, but still cause a fault on first access.  Then
> have a special-case fastpath in the fault handler to really wipe
> the page just before it is used.  I don't know how though - maybe
> _PAGE_USER?

All of the problems there have to do with accounting which pieces of
the page are zeroed. The PTE's map the same size areas (MMUPAGE_SIZE
stays 4KB)... So after a partial zero we end up with a struct page
pointing at MMUPAGE_COUNT mmupages, and a PTE pointing at the one
that's been zeroed and not a whole lot of flag bits left to keep track
of which pieces are initialized. How about a single PG_zero flag and
map out which bits of the thing are already zeroed in page->private?
(basically the swapcache can be considered the owning fs and it then
 then uses page->private for those shenanigans).

On Fri, Dec 06, 2002 at 04:01:24PM -0800, Andrew Morton wrote:
> get_user_pages() would need attention too - you don't want to
> allow the user to perform O_DIRECT writes of uninitialised
> pages to their files...

Well, I'm not sure how that would happen. fs io should deal with
kernel PAGE_SIZE-sized units so we're dealing with anonymous memory
only. O_DIRECT if we perform a write would only find the part of the
page mapped by a PTE, which must have been pre-zeroed prior to being
mapped. Reads seem to be in equally good shape. Perhaps it's more of
"this is yet another things to audit when dealing with it"; I'll admit
that the audit needed for this thing is somewhat large.

Bill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-07  2:19                       ` Alan Cox
@ 2002-12-07  1:46                         ` William Lee Irwin III
  2002-12-07  1:56                           ` Andrea Arcangeli
  2002-12-07  2:31                           ` Alan Cox
  0 siblings, 2 replies; 49+ messages in thread
From: William Lee Irwin III @ 2002-12-07  1:46 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andrew Morton, Andrea Arcangeli, Norman Gaywood,
	Linux Kernel Mailing List

On Sat, 2002-12-07 at 00:21, William Lee Irwin III wrote:
>> 16K is reasonable; after that one might as well go all the way.
>> About the only way to cope is amortizing it by cacheing zeroed pages,
>> and that has other downsides.

On Sat, Dec 07, 2002 at 02:19:49AM +0000, Alan Cox wrote:
> Some of the lower end CPU's only have about 12-16K of L1. I don't think
> thats a big problem since those aren't going to be highmem or large
> memory users 

It's an arch parameter, so they'd probably just
#define MMUPAGE_SIZE PAGE_SIZE
Hugh's original patch did that for all non-i386 arches.

Bill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-07  1:46                         ` William Lee Irwin III
@ 2002-12-07  1:56                           ` Andrea Arcangeli
  2002-12-07  2:31                           ` Alan Cox
  1 sibling, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2002-12-07  1:56 UTC (permalink / raw)
  To: William Lee Irwin III, Alan Cox, Andrew Morton, Norman Gaywood,
	Linux Kernel Mailing List

On Fri, Dec 06, 2002 at 05:46:43PM -0800, William Lee Irwin III wrote:
> On Sat, 2002-12-07 at 00:21, William Lee Irwin III wrote:
> >> 16K is reasonable; after that one might as well go all the way.
> >> About the only way to cope is amortizing it by cacheing zeroed pages,
> >> and that has other downsides.
> 
> On Sat, Dec 07, 2002 at 02:19:49AM +0000, Alan Cox wrote:
> > Some of the lower end CPU's only have about 12-16K of L1. I don't think
> > thats a big problem since those aren't going to be highmem or large
> > memory users 
> 
> It's an arch parameter, so they'd probably just
> #define MMUPAGE_SIZE PAGE_SIZE
> Hugh's original patch did that for all non-i386 arches.

I would say the most important thing to evaluate before the cpu and
cache size is the amount of ram in the machine. The major downside of
going to 8k is the loss of granularity in the paging, so a small machine
may not want to pagein the next page too unless it's been explicitly
touched by the program, to utilize the few available ram at best and to
have the most finegrined info possible about the working set in the
pagetables. The breakpoint depends on the workload. probably it would
make sense to keep at 4k all boxes <= 64M or something on those lines.

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-07  2:31                           ` Alan Cox
@ 2002-12-07  2:09                             ` William Lee Irwin III
  0 siblings, 0 replies; 49+ messages in thread
From: William Lee Irwin III @ 2002-12-07  2:09 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andrew Morton, Andrea Arcangeli, Norman Gaywood,
	Linux Kernel Mailing List

On Sat, 2002-12-07 at 01:46, William Lee Irwin III wrote:
>> It's an arch parameter, so they'd probably just
>> #define MMUPAGE_SIZE PAGE_SIZE
>> Hugh's original patch did that for all non-i386 arches.

On Sat, Dec 07, 2002 at 02:31:37AM +0000, Alan Cox wrote:
> These are low end x86 - but we could this based on
> 	<= i586
> 	i586
> 	i686+

It's relatively flexible as to the choice of PAGE_SIZE (it's
MMUPAGE_SIZE that's defined by hardware); about the only constraints
are that jacking it up where PAGE_SIZE spans pmd's breaks the core
vectoring API, PAGE_SIZE >= MMUPAGE_SIZE, both are powers of 2, the
vectors (which are of size MMUPAGE_COUNT*sizeof(pte_t *)) are stack-
allocated, and arch code has to understand small bits of it.

It sounds like we could pick sane defaults based on CPU revision.

Bill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-07  0:21                     ` William Lee Irwin III
  2002-12-07  0:30                       ` Andrew Morton
@ 2002-12-07  2:19                       ` Alan Cox
  2002-12-07  1:46                         ` William Lee Irwin III
  1 sibling, 1 reply; 49+ messages in thread
From: Alan Cox @ 2002-12-07  2:19 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andrew Morton, Andrea Arcangeli, Norman Gaywood,
	Linux Kernel Mailing List

On Sat, 2002-12-07 at 00:21, William Lee Irwin III wrote:
> 16K is reasonable; after that one might as well go all the way.
> About the only way to cope is amortizing it by cacheing zeroed pages,
> and that has other downsides.

Some of the lower end CPU's only have about 12-16K of L1. I don't think
thats a big problem since those aren't going to be highmem or large
memory users 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-07  1:46                         ` William Lee Irwin III
  2002-12-07  1:56                           ` Andrea Arcangeli
@ 2002-12-07  2:31                           ` Alan Cox
  2002-12-07  2:09                             ` William Lee Irwin III
  1 sibling, 1 reply; 49+ messages in thread
From: Alan Cox @ 2002-12-07  2:31 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andrew Morton, Andrea Arcangeli, Norman Gaywood,
	Linux Kernel Mailing List

On Sat, 2002-12-07 at 01:46, William Lee Irwin III wrote:
> It's an arch parameter, so they'd probably just
> #define MMUPAGE_SIZE PAGE_SIZE
> Hugh's original patch did that for all non-i386 arches.

These are low end x86 - but we could this based on

	<= i586
	i586
	i686+


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-07  0:01                   ` Andrew Morton
                                       ` (2 preceding siblings ...)
  2002-12-07  0:46                     ` William Lee Irwin III
@ 2002-12-07 10:55                     ` Arjan van de Ven
  3 siblings, 0 replies; 49+ messages in thread
From: Arjan van de Ven @ 2002-12-07 10:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Sat, 2002-12-07 at 01:01, Andrew Morton wrote:
> William Lee Irwin III wrote:
> > 
> > ...
> > A 16KB or 64KB kernel allocation unit would then annihilate
> 
> You want to be careful about this:
> 
> 	CPU: L1 I cache: 16K, L1 D cache: 16K
> 
> Because instantiating a 16k page into user pagetables in
> one hit means that it must all be zeroed.  With these large
> pagesizes that means that the application is likely to get
> 100% L1 misses against the new page, whereas it currently
> gets 100% hits.

If you really want you can cheat that 100% statistic into something much
lower by zeroing the page from back to front (based on the exact
faulting address even, because you know THAT one will get used) and/or
zeroing the second half while bypassing the cache. At least it's 50%
hits then ;)

Still not 100% and I still agree that the 8Kb number is much nicer for
16Kb L1 cache machines....

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
  2002-12-06 22:34                 ` Andrea Arcangeli
@ 2002-12-07 18:27                   ` Eric W. Biederman
  0 siblings, 0 replies; 49+ messages in thread
From: Eric W. Biederman @ 2002-12-07 18:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: William Lee Irwin III, Arjan van de Ven, Andrew Morton,
	Norman Gaywood, linux-kernel

Andrea Arcangeli <andrea@suse.de> writes:

> On Fri, Dec 06, 2002 at 07:12:38AM -0800, William Lee Irwin III wrote:
> > split just to get a bloated mem_map to fit. Many of the smaller apps,
> > e.g. /bin/sh etc. are indifferent to the ABI violation.
> 
> the problem of the split is that it would reduce the address space
> available to userspace that is quite critical on big machines (one of
> the big advantages of 64bit that can't be fixed on 32bit) but I wouldn't
> classify it as an ABI violation, infact the little I can remember about
> the 2.0 kernels [I almost never read that code] is that it had shared
> address space and tlb flush while entering/exiting kernel, so I can bet
> the user stack in 2.0  was put at 4G, not at 3G. 2.2 had to put it at 3G
> because then the address space was shared with the obvious performance
> advantages, so while I didn't read any ABI, I deduce you can't say the
> ABI got broken if the stack is put at 2G or 1G or 3.5G or 4G again with
> x86-64 (of course x86-64 can give the full 4G to userspace because the
> kernel runs in the negative part of the [64bit] address space, as 2.0
> could too).

As I remember it 2.0 used the 3/1 split the difference was that
segments had different base register values.  So the kernel though it
was running at 0.  %fs which retained a base address of 0 was used
when access to user space was desired.

Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2002-12-07 18:21 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-12-06  0:13 Maybe a VM bug in 2.4.18-18 from RH 8.0? Norman Gaywood
2002-12-06  1:00 ` Andrew Morton
2002-12-06  1:17   ` Andrea Arcangeli
2002-12-06  1:34     ` Andrew Morton
2002-12-06  1:44       ` Andrea Arcangeli
2002-12-06  2:15         ` William Lee Irwin III
2002-12-06  2:28           ` Andrea Arcangeli
2002-12-06  2:41             ` William Lee Irwin III
2002-12-06  5:25               ` Andrew Morton
2002-12-06  5:48                 ` Andrea Arcangeli
2002-12-06  6:14                   ` William Lee Irwin III
2002-12-06  6:55                   ` Andrew Morton
2002-12-06  7:14                     ` GrandMasterLee
2002-12-06  7:25                       ` Andrew Morton
2002-12-06  7:34                         ` GrandMasterLee
2002-12-06  7:51                           ` Andrew Morton
2002-12-06 11:37                             ` Christoph Hellwig
2002-12-06 16:19                             ` GrandMasterLee
2002-12-06 14:57                     ` Andrea Arcangeli
2002-12-06 15:12                       ` William Lee Irwin III
2002-12-06 23:32                         ` Andrea Arcangeli
2002-12-06 23:45                           ` William Lee Irwin III
2002-12-06 23:57                             ` Andrea Arcangeli
2002-12-06  6:00                 ` William Lee Irwin III
2002-12-06 22:28               ` Andrea Arcangeli
2002-12-06 23:21                 ` William Lee Irwin III
2002-12-06 23:50                   ` Andrea Arcangeli
2002-12-07  0:30                     ` William Lee Irwin III
2002-12-07  0:01                   ` Andrew Morton
2002-12-07  0:21                     ` William Lee Irwin III
2002-12-07  0:30                       ` Andrew Morton
2002-12-07  2:19                       ` Alan Cox
2002-12-07  1:46                         ` William Lee Irwin III
2002-12-07  1:56                           ` Andrea Arcangeli
2002-12-07  2:31                           ` Alan Cox
2002-12-07  2:09                             ` William Lee Irwin III
2002-12-07  0:22                     ` Andrea Arcangeli
2002-12-07  0:35                       ` Andrew Morton
2002-12-07  0:46                     ` William Lee Irwin III
2002-12-07 10:55                     ` Arjan van de Ven
2002-12-06 10:36           ` Arjan van de Ven
2002-12-06 14:23             ` William Lee Irwin III
2002-12-06 15:12               ` William Lee Irwin III
2002-12-06 22:34                 ` Andrea Arcangeli
2002-12-07 18:27                   ` Eric W. Biederman
2002-12-06  1:08 ` Andrea Arcangeli
     [not found] <mailman.1039133948.27411.linux-kernel2news@redhat.com>
2002-12-06  0:35 ` Pete Zaitcev
2002-12-06  1:27   ` Norman Gaywood
2002-12-06 12:48     ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox