~500 megs cached yet 2.6.5 goes into swap hell

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* ~500 megs cached yet 2.6.5 goes into swap hell
@ 2004-04-28 21:27 Brett E.
  2004-04-29  0:01 ` Andrew Morton
  2004-04-29  0:04 ` Brett E.
  0 siblings, 2 replies; 128+ messages in thread
From: Brett E. @ 2004-04-28 21:27 UTC (permalink / raw)
  To: linux-kernel mailing list

[-- Attachment #1: Type: text/plain, Size: 362 bytes --]

Same thing happens on 2.4.18.

I attached sar, slabinfo and /proc/meminfo data on the 2.6.5 machine.  I 
reproduce this behavior by simply untarring a 260meg file on a 
production server, the machine becomes sluggish as it swaps to disk. Is 
there a way to limit the cache so this machine, which has 1 gigabyte of 
memory, doesn't dip into swap?

Thanks,

Brett

[-- Attachment #2: attach.1 --]
[-- Type: text/plain, Size: 15171 bytes --]

06:18:52 PM kbmemfree kbmemused  %memused kbbuffers  kbcached kbswpfree kbswpused  %swpused  kbswpcad
06:18:53 PM     55332   1238644     95.72     14660    497888    450740     79364     14.97      9692
06:18:54 PM     55268   1238708     95.73     14660    497888    450740     79364     14.97      9692
06:18:55 PM     40060   1253916     96.90     14860    512920    450740     79364     14.97      9692
06:18:57 PM      6120   1287856     99.53     15340    546644    450740     79364     14.97      9692
06:18:59 PM      6632   1287344     99.49     15864    550880    450740     79364     14.97      9692
06:19:00 PM      6440   1287536     99.50     16020    552628    450740     79364     14.97      9692
06:19:02 PM      7648   1286328     99.41     15980    548452    450740     79364     14.97      9692
06:19:03 PM      6504   1287472     99.50     16008    548832    450740     79364     14.97      9692
06:19:04 PM      7592   1286384     99.41     15980    530160    450740     79364     14.97      9692
06:19:05 PM      6192   1287784     99.52     15716    499008    450740     79364     14.97      9692
06:19:06 PM      6544   1287432     99.49     15732    494640    450740     79364     14.97      9692
06:19:07 PM      7104   1286872     99.45     15768    488756    450740     79364     14.97      9692
06:19:08 PM      7592   1286384     99.41     15844    488680    450740     79364     14.97      9692
06:19:10 PM      7416   1286560     99.43     15936    479136    450740     79364     14.97      9692
06:19:13 PM      7024   1286952     99.46     15912    467808    450744     79360     14.97      9688
06:19:14 PM      7096   1286880     99.45     15664    427736    450744     79360     14.97      9684
06:19:15 PM      7240   1286736     99.44     15604    415692    450744     79360     14.97      9684
06:19:16 PM      6712   1287264     99.48     15616    414524    450744     79360     14.97      9684
06:19:18 PM      6200   1287776     99.52     15652    409660    450744     79360     14.97      9684
06:19:19 PM     10600   1283376     99.18     15724    407004    450744     79360     14.97      9684


06:18:52 PM  pgpgin/s pgpgout/s   fault/s  majflt/s
06:18:53 PM      0.00    712.00   1236.00      0.00
06:18:54 PM     12.12      8.08   1067.68      0.00
06:18:55 PM   7497.03     11.88   2844.55      0.00
06:18:57 PM  10626.00    310.00   1422.50      0.00
06:18:59 PM  11758.00    196.00    346.50      0.00
06:19:00 PM   7828.00    608.00    136.00      0.00
06:19:02 PM    145.27   1136.32   1108.96      0.00
06:19:03 PM    905.05  13822.22    663.64      0.00
06:19:04 PM    689.11   2384.16   9437.62      0.00
06:19:05 PM    499.01   9572.28  13467.33      0.00
06:19:06 PM   3444.00   1340.00   1825.00      0.00
06:19:07 PM   7720.00   2032.00   3034.00      0.00
06:19:08 PM   5420.00   1304.00    688.00      0.00
06:19:10 PM   4045.77   4304.48   2188.56      0.00
06:19:13 PM   1079.07   5528.68   2046.90      0.00
06:19:14 PM    696.00    920.00  15650.00      0.00
06:19:15 PM   1478.79   1187.88   5046.46      0.00
06:19:16 PM   1000.00   2752.94    539.22      0.00


meminfo:
meminfo:

MemTotal:      1293976 kB
MemFree:          8320 kB
Buffers:         13396 kB
Cached:         436428 kB
SwapCached:       9516 kB
Active:         810472 kB
Inactive:       346816 kB
HighTotal:      393216 kB
HighFree:         1152 kB
LowTotal:       900760 kB
LowFree:          7168 kB
SwapTotal:      530104 kB
SwapFree:       450796 kB
Dirty:           33704 kB
Writeback:       10268 kB
Mapped:         710732 kB
Slab:           115240 kB
Committed_AS:   942592 kB
PageTables:       4612 kB
VmallocTotal:   114680 kB
VmallocUsed:       560 kB
VmallocChunk:   114120 kB



slabinfo - version: 2.0
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <batchcount> <limit> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
rpc_buffers            8      8   2048    2    1 : tunables   24   12    8 : slabdata      4      4      0
rpc_tasks              8     15    256   15    1 : tunables  120   60    8 : slabdata      1      1      0
rpc_inode_cache       12     14    512    7    1 : tunables   54   27    8 : slabdata      2      2      0
unix_sock            192    203    512    7    1 : tunables   54   27    8 : slabdata     29     29      0
ip_conntrack        9926  14860    384   10    1 : tunables   54   27    8 : slabdata   1486   1486    216
tcp_tw_bucket       2028   6450    128   30    1 : tunables  120   60    8 : slabdata    215    215    384
tcp_bind_bucket      207    800     16  200    1 : tunables  120   60    8 : slabdata      4      4     16
tcp_open_request     113    290     64   58    1 : tunables  120   60    8 : slabdata      5      5      3
inet_peer_cache        2     58     64   58    1 : tunables  120   60    8 : slabdata      1      1      0
ip_fib_hash           18    200     16  200    1 : tunables  120   60    8 : slabdata      1      1      0
ip_dst_cache       23046  23145    256   15    1 : tunables  120   60    8 : slabdata   1543   1543      0
arp_cache             11     30    256   15    1 : tunables  120   60    8 : slabdata      2      2      0
raw4_sock              0      0    512    7    1 : tunables   54   27    8 : slabdata      0      0      0
udp_sock              10     21    512    7    1 : tunables   54   27    8 : slabdata      3      3      0
tcp_sock             248    408   1024    4    1 : tunables   54   27    8 : slabdata    102    102      0
flow_cache             0      0    128   30    1 : tunables  120   60    8 : slabdata      0      0      0
udf_inode_cache        0      0    512    7    1 : tunables   54   27    8 : slabdata      0      0      0
nfs_write_data        36     42    512    7    1 : tunables   54   27    8 : slabdata      6      6      0
nfs_read_data         32     35    512    7    1 : tunables   54   27    8 : slabdata      5      5      0
nfs_inode_cache       15     24    640    6    1 : tunables   54   27    8 : slabdata      4      4      0
nfs_page               0      0    128   30    1 : tunables  120   60    8 : slabdata      0      0      0
isofs_inode_cache      0      0    384   10    1 : tunables   54   27    8 : slabdata      0      0      0
fat_inode_cache        0      0    512    7    1 : tunables   54   27    8 : slabdata      0      0      0
ext2_inode_cache    7294   7294    512    7    1 : tunables   54   27    8 : slabdata   1042   1042      0
journal_handle         0      0     28  123    1 : tunables  120   60    8 : slabdata      0      0      0
journal_head           0      0     48   77    1 : tunables  120   60    8 : slabdata      0      0      0
revoke_table           0      0     12  250    1 : tunables  120   60    8 : slabdata      0      0      0
revoke_record          0      0     16  200    1 : tunables  120   60    8 : slabdata      0      0      0
ext3_inode_cache       0      0    512    7    1 : tunables   54   27    8 : slabdata      0      0      0
ext3_xattr             0      0     48   77    1 : tunables  120   60    8 : slabdata      0      0      0
eventpoll_pwq          0      0     36   99    1 : tunables  120   60    8 : slabdata      0      0      0
eventpoll_epi          0      0    128   30    1 : tunables  120   60    8 : slabdata      0      0      0
kioctx                 0      0    256   15    1 : tunables  120   60    8 : slabdata      0      0      0
kiocb                  0      0    256   15    1 : tunables  120   60    8 : slabdata      0      0      0
dnotify_cache          0      0     20  166    1 : tunables  120   60    8 : slabdata      0      0      0
file_lock_cache        9     40     96   40    1 : tunables  120   60    8 : slabdata      1      1      0
fasync_cache           0      0     16  200    1 : tunables  120   60    8 : slabdata      0      0      0
shmem_inode_cache      3      7    512    7    1 : tunables   54   27    8 : slabdata      1      1      0
posix_timers_cache      0      0     88   43    1 : tunables  120   60    8 : slabdata      0      0      0
uid_cache              5    112     32  112    1 : tunables  120   60    8 : slabdata      1      1      0
sgpool-128            32     32   2048    2    1 : tunables   24   12    8 : slabdata     16     16      0
sgpool-64             32     32   1024    4    1 : tunables   54   27    8 : slabdata      8      8      0
sgpool-32             32     32    512    8    1 : tunables   54   27    8 : slabdata      4      4      0
sgpool-16             32     45    256   15    1 : tunables  120   60    8 : slabdata      3      3      0
sgpool-8              32     60    128   30    1 : tunables  120   60    8 : slabdata      2      2      0
deadline_drq           0      0     52   71    1 : tunables  120   60    8 : slabdata      0      0      0
as_arq               296    348     64   58    1 : tunables  120   60    8 : slabdata      6      6     60
blkdev_requests      312    312    160   24    1 : tunables  120   60    8 : slabdata     13     13     60
biovec-BIO_MAX_PAGES    256    256   3072    2    2 : tunables   24   12    8 : slabdata    128    128      0
biovec-128           256    260   1536    5    2 : tunables   24   12    8 : slabdata     52     52      0
biovec-64            629    640    768    5    1 : tunables   54   27    8 : slabdata    128    128     38
biovec-16            315    315    256   15    1 : tunables  120   60    8 : slabdata     21     21      0
biovec-4             348    348     64   58    1 : tunables  120   60    8 : slabdata      6      6      0
biovec-1             520    600     16  200    1 : tunables  120   60    8 : slabdata      3      3     60
bio                  870    870     64   58    1 : tunables  120   60    8 : slabdata     15     15    180
sock_inode_cache     573    910    512    7    1 : tunables   54   27    8 : slabdata    130    130      0
skbuff_head_cache    296    870    256   15    1 : tunables  120   60    8 : slabdata     58     58     30
sock                   4     10    384   10    1 : tunables   54   27    8 : slabdata      1      1      0
proc_inode_cache    1417   1530    384   10    1 : tunables   54   27    8 : slabdata    153    153      0
sigqueue             130    130    144   26    1 : tunables  120   60    8 : slabdata      5      5      0
radix_tree_node     7117   8955    260   15    1 : tunables   54   27    8 : slabdata    597    597    189
bdev_cache             6      7    512    7    1 : tunables   54   27    8 : slabdata      1      1      0
mnt_cache             20     58     64   58    1 : tunables  120   60    8 : slabdata      1      1      0
inode_cache          566    580    384   10    1 : tunables   54   27    8 : slabdata     58     58      0
dentry_cache      167775 176055    256   15    1 : tunables  120   60    8 : slabdata  11737  11737      0
filp                2057   2790    256   15    1 : tunables  120   60    8 : slabdata    186    186      0
names_cache           25     25   4096    1    1 : tunables   24   12    8 : slabdata     25     25      0
idr_layer_cache        3     28    136   28    1 : tunables  120   60    8 : slabdata      1      1      0
buffer_head        35463  50481     52   71    1 : tunables  120   60    8 : slabdata    711    711      0
mm_struct            331    360    640    6    1 : tunables   54   27    8 : slabdata     60     60      0
vm_area_struct     10667  12586     64   58    1 : tunables  120   60    8 : slabdata    217    217      0
fs_cache             331    464     64   58    1 : tunables  120   60    8 : slabdata      8      8      0
files_cache          346    371    512    7    1 : tunables   54   27    8 : slabdata     53     53      0
signal_cache         447    696     64   58    1 : tunables  120   60    8 : slabdata     12     12      0
sighand_cache        345    380   1408    5    2 : tunables   24   12    8 : slabdata     76     76      0
task_struct          434    450   1456    5    2 : tunables   24   12    8 : slabdata     90     90      0
pte_chain         139628 145500    128   30    1 : tunables  120   60    8 : slabdata   4850   4850      0
pgd                  330    330   4096    1    1 : tunables   24   12    8 : slabdata    330    330      0
size-131072(DMA)       0      0 131072    1   32 : tunables    8    4    0 : slabdata      0      0      0
size-131072            0      0 131072    1   32 : tunables    8    4    0 : slabdata      0      0      0
size-65536(DMA)        0      0  65536    1   16 : tunables    8    4    0 : slabdata      0      0      0
size-65536             0      0  65536    1   16 : tunables    8    4    0 : slabdata      0      0      0
size-32768(DMA)        0      0  32768    1    8 : tunables    8    4    0 : slabdata      0      0      0
size-32768             0      0  32768    1    8 : tunables    8    4    0 : slabdata      0      0      0
size-16384(DMA)        0      0  16384    1    4 : tunables    8    4    0 : slabdata      0      0      0
size-16384             1      1  16384    1    4 : tunables    8    4    0 : slabdata      1      1      0
size-8192(DMA)         0      0   8192    1    2 : tunables    8    4    0 : slabdata      0      0      0
size-8192            446    446   8192    1    2 : tunables    8    4    0 : slabdata    446    446      0
size-4096(DMA)         0      0   4096    1    1 : tunables   24   12    8 : slabdata      0      0      0
size-4096             65     66   4096    1    1 : tunables   24   12    8 : slabdata     65     66      0
size-2048(DMA)         0      0   2048    2    1 : tunables   24   12    8 : slabdata      0      0      0
size-2048            245    294   2048    2    1 : tunables   24   12    8 : slabdata    147    147      4
size-1024(DMA)         0      0   1024    4    1 : tunables   54   27    8 : slabdata      0      0      0
size-1024            109    128   1024    4    1 : tunables   54   27    8 : slabdata     32     32      0
size-512(DMA)          0      0    512    8    1 : tunables   54   27    8 : slabdata      0      0      0
size-512             268    488    512    8    1 : tunables   54   27    8 : slabdata     61     61      0
size-256(DMA)          0      0    256   15    1 : tunables  120   60    8 : slabdata      0      0      0
size-256             424    465    256   15    1 : tunables  120   60    8 : slabdata     31     31      0
size-128(DMA)          0      0    128   30    1 : tunables  120   60    8 : slabdata      0      0      0
size-128            2387   3090    128   30    1 : tunables  120   60    8 : slabdata    103    103      0
size-64(DMA)           0      0     64   58    1 : tunables  120   60    8 : slabdata      0      0      0
size-64              334    406     64   58    1 : tunables  120   60    8 : slabdata      7      7      0
size-32(DMA)           0      0     32  112    1 : tunables  120   60    8 : slabdata      0      0      0
size-32              744    784     32  112    1 : tunables  120   60    8 : slabdata      7      7      0
kmem_cache           104    104    148   26    1 : tunables  120   60    8 : slabdata      4      4      0

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-28 21:27 ~500 megs cached yet 2.6.5 goes into swap hell Brett E.
@ 2004-04-29  0:01 ` Andrew Morton
  2004-04-29  0:10   ` Jeff Garzik
  2004-04-29  0:44   ` Brett E.
  2004-04-29  0:04 ` Brett E.
  1 sibling, 2 replies; 128+ messages in thread
From: Andrew Morton @ 2004-04-29  0:01 UTC (permalink / raw)
  To: brettspamacct; +Cc: linux-kernel

"Brett E." <brettspamacct@fastclick.com> wrote:
>
> I attached sar, slabinfo and /proc/meminfo data on the 2.6.5 machine.  I 
> reproduce this behavior by simply untarring a 260meg file on a 
> production server, the machine becomes sluggish as it swaps to disk.

I see no swapout from the info which you sent.

A `vmstat 1' trace would be more useful.

> Is there a way to limit the cache so this machine, which has 1 gigabyte of 
> memory, doesn't dip into swap?

Decrease /proc/sys/vm/swappiness?

Swapout is good.  It frees up unused memory.  I run my desktop machines at
swappiness=100.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-28 21:27 ~500 megs cached yet 2.6.5 goes into swap hell Brett E.
  2004-04-29  0:01 ` Andrew Morton
@ 2004-04-29  0:04 ` Brett E.
  2004-04-29  0:13   ` Jeff Garzik
  2004-04-29 13:51   ` Horst von Brand
  1 sibling, 2 replies; 128+ messages in thread
From: Brett E. @ 2004-04-29  0:04 UTC (permalink / raw)
  To: brettspamacct; +Cc: linux-kernel mailing list

Brett E. wrote:

> Same thing happens on 2.4.18.
> 
> I attached sar, slabinfo and /proc/meminfo data on the 2.6.5 machine.  I 
> reproduce this behavior by simply untarring a 260meg file on a 
> production server, the machine becomes sluggish as it swaps to disk. Is 
> there a way to limit the cache so this machine, which has 1 gigabyte of 
> memory, doesn't dip into swap?
> 
> Thanks,
> 
> Brett
> 

I created a hack which allocates memory causing cache to go down, then 
exits, freeing up the malloc'ed memory. This brings free memory up by 
400 megs and brings the cache down to close to 0, of course the cache 
grows right afterwards. It would be nice to cap the cache datastructures 
in the kernel but I've been posting about this since September to no 
avail so my expectations are pretty low.

Here's the code:

#define ALLOC_SIZE 1024*1024
#define NUM_ALLOC 400

int main() {
     char* ptr;
     int i,j;

     for(i=0;i<NUM_ALLOC;i++) {
         ptr = (void*)malloc(ALLOC_SIZE);
         for(j=0;j<ALLOC_SIZE;j+=512) {
                 ptr[j]=0;
         }
     }

     return 0;
}

...
Maybe I can make it a hack of all hacks and have it parse out "Cached" 
from /proc/meminfo and allocate that many bytes.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  0:01 ` Andrew Morton
@ 2004-04-29  0:10   ` Jeff Garzik
  2004-04-29  0:21     ` Nick Piggin
  2004-04-29  0:49     ` Brett E.
  2004-04-29  0:44   ` Brett E.
  1 sibling, 2 replies; 128+ messages in thread
From: Jeff Garzik @ 2004-04-29  0:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: brettspamacct, linux-kernel

Andrew Morton wrote:
> Swapout is good.  It frees up unused memory.  I run my desktop machines at
> swappiness=100.

The definition of "unused" is quite subjective and app-dependent...

I've see reports with increasing frequency about the swappiness of the 
2.6.x kernels, from people who were already annoyed at the swappiness of 
2.4.x kernels :)

Favorite pathological (and quite common) examples are the various 4am 
cron jobs that scan your entire filesystem.  Running that process 
overnight on a quiet machines practically guarantees a huge burst of 
disk activity, with unwanted results:
1) Inode and page caches are blown away
2) A lot of your desktop apps are swapped out

Additionally, a (IMO valid) maxim of sysadmins has been "a properly 
configured server doesn't swap".  There should be no reason why this 
maxim becomes invalid over time.  When Linux starts to swap out apps the 
sysadmin knows will be useful in an hour, or six hours, or a day just 
because it needs a bit more file cache, I get worried.

There IMO should be some way to balance the amount of anon-vma's such 
that the sysadmin can say "stop taking 70% of my box's memory for 
disposable cache, use it instead for apps you would otherwise swap out, 
you memory-hungry kernel you."

	Jeff

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  0:04 ` Brett E.
@ 2004-04-29  0:13   ` Jeff Garzik
  2004-04-29  0:43     ` Nick Piggin
  2004-04-29 13:51   ` Horst von Brand
  1 sibling, 1 reply; 128+ messages in thread
From: Jeff Garzik @ 2004-04-29  0:13 UTC (permalink / raw)
  To: brettspamacct; +Cc: linux-kernel mailing list

[-- Attachment #1: Type: text/plain, Size: 695 bytes --]

Brett E. wrote:
> exits, freeing up the malloc'ed memory. This brings free memory up by 
> 400 megs and brings the cache down to close to 0, of course the cache 

Yeah, I have something similar (attached).  Run it like

	fillmem <number-of-megabytes>


> grows right afterwards. It would be nice to cap the cache datastructures 
> in the kernel but I've been posting about this since September to no 
> avail so my expectations are pretty low.

This is a frequent request...  although I disagree with a hard cap on 
the cache, I think the request (and similar ones) should hopefully 
indicate to the VM gurus that the kernel likes cache better than anon 
VMAs that must be swapped out.

	Jeff



[-- Attachment #2: fillmem.c --]
[-- Type: text/plain, Size: 707 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>

#define MEGS 140
#define MEG (1024 * 1024)

int main (int argc, char *argv[])
{
	void **data;
	int i, r;
	size_t megs = MEGS;

	if ((argc >= 2) && (atoi(argv[1]) > 0))
		megs = atoi(argv[1]);

	data = malloc (megs * sizeof (void*));
	if (!data) abort();

	memset (data, 0, megs * sizeof (void*));

	srand(time(NULL));

	for (i = 0; i < megs; i++) {
		data[i] = malloc(MEG);
		memset (data[i], i, MEG);
		printf("malloc/memset %03d/%03lu\n", i+1, megs);
	}
	for (i = megs - 1; i >= 0; i--) {
		r = rand() % 200;
		memset (data[i], r, MEG);
		printf("memset #2 %03d/%03lu = %d\n", i+1, megs, r);
	}
	printf("done\n");
	return 0;
}

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  0:10   ` Jeff Garzik
@ 2004-04-29  0:21     ` Nick Piggin
  2004-04-29  0:50       ` Wakko Warner
                         ` (2 more replies)
  2004-04-29  0:49     ` Brett E.
  1 sibling, 3 replies; 128+ messages in thread
From: Nick Piggin @ 2004-04-29  0:21 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Andrew Morton, brettspamacct, linux-kernel

Jeff Garzik wrote:
> Andrew Morton wrote:
> 
>> Swapout is good.  It frees up unused memory.  I run my desktop 
>> machines at
>> swappiness=100.
> 
> 
> 
> The definition of "unused" is quite subjective and app-dependent...
> 
> I've see reports with increasing frequency about the swappiness of the 
> 2.6.x kernels, from people who were already annoyed at the swappiness of 
> 2.4.x kernels :)
> 
> Favorite pathological (and quite common) examples are the various 4am 
> cron jobs that scan your entire filesystem.  Running that process 
> overnight on a quiet machines practically guarantees a huge burst of 
> disk activity, with unwanted results:
> 1) Inode and page caches are blown away
> 2) A lot of your desktop apps are swapped out
> 
> Additionally, a (IMO valid) maxim of sysadmins has been "a properly 
> configured server doesn't swap".  There should be no reason why this 
> maxim becomes invalid over time.  When Linux starts to swap out apps the 
> sysadmin knows will be useful in an hour, or six hours, or a day just 
> because it needs a bit more file cache, I get worried.
> 

I don't know. What if you have some huge application that only
runs once per day for 10 minutes? Do you want it to be consuming
100MB of your memory for the other 23 hours and 50 minutes for
no good reason?

Anyway, I have a small set of VM patches which attempt to improve
this sort of behaviour if anyone is brave enough to try them.
Against -mm kernels only I'm afraid (the objrmap work causes some
porting difficulty).

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  0:13   ` Jeff Garzik
@ 2004-04-29  0:43     ` Nick Piggin
  0 siblings, 0 replies; 128+ messages in thread
From: Nick Piggin @ 2004-04-29  0:43 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: brettspamacct, linux-kernel mailing list

Jeff Garzik wrote:
> Brett E. wrote:
> 
>> exits, freeing up the malloc'ed memory. This brings free memory up by 
>> 400 megs and brings the cache down to close to 0, of course the cache 
> 
> 
> Yeah, I have something similar (attached).  Run it like
> 
>     fillmem <number-of-megabytes>
> 
> 
>> grows right afterwards. It would be nice to cap the cache 
>> datastructures in the kernel but I've been posting about this since 
>> September to no avail so my expectations are pretty low.
> 
> 
> This is a frequent request...  although I disagree with a hard cap on 
> the cache, I think the request (and similar ones) should hopefully 
> indicate to the VM gurus that the kernel likes cache better than anon 
> VMAs that must be swapped out.
> 

For 2.6.6-rc2-mm2:
http://www.kerneltrap.org/~npiggin/vm-rollup.patch.gz

/proc/sys/vm/mapped_page_cost - indicate which *you* like better ;)

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  0:01 ` Andrew Morton
  2004-04-29  0:10   ` Jeff Garzik
@ 2004-04-29  0:44   ` Brett E.
  2004-04-29  1:13     ` Andrew Morton
  1 sibling, 1 reply; 128+ messages in thread
From: Brett E. @ 2004-04-29  0:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1968 bytes --]

First of all, thanks for your replies and helpful information.

Andrew Morton wrote:

> "Brett E." <brettspamacct@fastclick.com> wrote:
> 
>>I attached sar, slabinfo and /proc/meminfo data on the 2.6.5 machine.  I 
>>reproduce this behavior by simply untarring a 260meg file on a 
>>production server, the machine becomes sluggish as it swaps to disk.
> 
> 
> I see no swapout from the info which you sent.

pgpgout/s gives the total number of blocks paged out to disk per second, 
it peaks at 13,000 and hovers around 3,000 per the attachment.

> 
> A `vmstat 1' trace would be more useful.
Ok. Attached(ran this with swappiness set to 0 then 100). In both cases 
sar showed high paging in/out.

> 
> 
>>Is there a way to limit the cache so this machine, which has 1 gigabyte of 
>>memory, doesn't dip into swap?
> 
> 
> Decrease /proc/sys/vm/swappiness?
> 
> Swapout is good.  It frees up unused memory.  I run my desktop machines at
> swappiness=100.
> 

Swapping out is good, but when that's coupled with swapping in as is the 
case on my side, it creates a thrashing situation where we swap out to 
disk pages which are being used, we then immediately swap those pages 
back in, etc etc.. This creates lots of disk I/O which competes with the 
userland processes, resulting in a system slowing down to a crawl. I 
don't understand why it swaps in the first place when 400-500 megs are 
taken up by cache datastructures.

The usage pattern by the way is on a server which continuously hits a 
database and reads files so I don't know what "swappiness" should be set 
to exactly.  Every hour or so it wants to untar tarballs and by then the 
cache is large. From here, the system swaps in and out more while cache 
decreases. Basically, it should do what I believe Solaris does... simply 
reclaim cache and not swap.  Capping cache would be good too but the 
best solution IMO is to simply reclaim the cache on an as-needed basis 
before thinking about swapping.



[-- Attachment #2: attach.2 --]
[-- Type: text/plain, Size: 20862 bytes --]

swappiness of 0:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0 11 168396   9260  16032 508496    1    2    84   120  142   132 37  7 48  8
 0  8 168396   6636  16056 510580    0    0  1320    76 1334  1639 13  3  0 84
 0  9 168396   6432  16068 510092    0    0   836    60 1242  1124 13  2  0 85
17  9 168396   7200  16084 508580    0    0  1248   148 1318  1351 11  3  7 80
 0 10 168396  14488  16104 507064    0    0  1364   904 1488  1977 20  4  0 76
 8  8 168396  11992  16116 508752    0    0  1124    88 1304  1345 11  3  0 86
16  8 168396  10008  16140 510768    0    0  1392   172 1434  1970 21  4  0 74
 3 11 168396  13592  16152 512524    0    0  1072  1364 1625  2544 32  6  2 60
 0  9 168396   6560  16252 519632    0    0  5644   380 1431  2073 19  6  0 76
 0 10 168396   6576  16320 519564    0    0  3840  1208 1259  1013  9  4  0 88
 0  6 168396   7040  16420 519260    0    0  5408   356 1311  1281 11  4  0 85
 0 10 168396   8640  16432 517616    0    0  2020   116 1496  2268 26  6  0 68
 0 10 168396   8384  16528 516704    0    0  4972  4124 1526  2278 30  8  4 60
 0  7 168396   7744  16528 517248    0    0   552    16 1267  1012  8  2  1 89
 0  7 168396   6528  16532 517788    0    0   304 12024 1175   174  1  1  0 98
 0  8 168396   7488  16552 515728    0    0  1408  7376 1111   173  1  1  0 98
12  8 168396   8824  16556 514024    0    0  1956  2724 1301  1582 19  5  0 76
12 14 168396   6944  16504 492860    0    0  1524     0 1637  2458 71 12  0 17
 0  7 168396   7072  16596 491272    0    0  5624     0 1296  1168 14  5  0 82
 0  7 168396   6936  16708 488712    0    0  7520  1496 1287  1737 17  7  0 76
 0  6 168396   6328  16756 488324    0    0  4712  2000 1234   786  7  4  0 89
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0 12 168396   6448  13400 485600   96    0  8896 29480 8401  6544 11  3  0 86
 4 13 168396  10016  12760 487056    0    0  1600     0 1475  2089 21  5  0 74
10  7 168396   6944  12764 486304    0    0  1620  1028 1509  2366 37  6  0 57
 0 11 168396   6760  12780 493088    0    0  3880  2240 1455  2034 22  5  0 72
 0  6 168396   6240  12796 493480    0    0  8064  2032 1248  1167  9  5  0 87
 0  8 168396   7200  12820 492300    0    0  7336  2416 1387  1738 21  7  0 73
 1 11 168396   7968  12820 491144    0    0  3628   784 1551  2490 27  8  0 65
 0  4 168396   6544  12844 499144    0    0  8948   584 1318  1310 12  6  0 83
 5  7 168396   8856  12856 496820    0    0  4536  5792 1288  1336 11  4  0 85
 0  6 168396   7640  12856 498112    0    0   920  1352 1180   545  4  1  0 96
 0  6 168396   7512  12860 497836    0    0  2053 10676 1152   133  0  1  0 98
 0  6 168396   8056  12864 496676    0    0  1664     0 1079   138  1  1  0 99
 1  7 168396   6400  12896 495964    0    0  7665  1649 1171   465  4  4  0 93
 0  8 168396   7920  12908 495340    0    0   956   672 1410  1851 22  6  0 73
 8 13 168396   6936  12908 477456    0    0   684  3299 1643  2473 54 11  0 34
 0 15 168396   6584  12912 470516    0    0   832  1340 1376  1703 32  5  0 63
 0 18 168396  11184  12912 471128    0    0   620     0 1541  2548 35  6  0 60
 0  8 168396  11084  12912 472012    0    0   904  3384 1459  1771 29  6  0 65
 1  9 168396   6844  12924 473088    0    0  1064     4 1390  1694 24  3  0 73
 1 12 168396   7248  12932 469408    0    0   960   132 1562  2412 43  6  0 51
 5  9 168396  13228  12932 468996    4    0  1160   128 1628  2732 38  7  0 55
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0 11 168396  11628  12932 469948    0    0   880  1388 1619  2407 29  6  0 66
 0  9 168396  16500  12936 471372    0    0  1068  2200 1385  1464 14  3  0 84
 1  4 168396  14132  12964 473724    0    0  1712  5432 1516  2168 31  5  0 65
 5 12 168396  19000  12972 475824    0    0  1360   160 1468  1918 30  5  0 66
 0  9 168396  16720  12976 477724    0    0  1376   128 1599  2943 36  7  0 56
 0  8 168396  20260  12976 480920    0    0  1884   224 1785  2982 38  8  0 54
 5 12 168396   7368  13000 493136    0    0 10312   648 1372  1743 18  8  1 74
 1 12 168396   7480  13008 492312    0    0  2848  1628 1532  2428 26  7  0 67
 0 11 168396   7992  13016 498696    0    0  3484  3368 1546  2521 28  7  0 64
 1 14 168396   6760  13028 499772    0   20  5536   812 1534  2733 32  7  0 62
 0 17 168396   6120  13048 507504    0    0  7944  6596 1539  2217 26  8  0 66
 0  5 168396   6424  13048 507164    0    0   564  4352 1229   332  4  1  0 95
 0 14 168396   7128  13048 506552    0    0  1576  3280 1374  2009 22  4  0 74
 0 10 168396  12536  13048 508184    0    0  1008     0 1283  1488 19  3  0 79
 8 10 168392  10104  13056 510256   32    0  1460   192 1628  2668 30  8  0 62
 1  6 168392  14904  13076 512960   64    0  1736     0 1696  3413 43 11  0 46
 3  4 168392  19384  13088 515804    0    0  1708     0 1483  2337 27  6  0 67
 7  5 168392  14328  13096 520624    0    0  2760   280 1659  2531 36  6  0 58
 0  5 168392  16124  13108 527004    0    0  3416   276 1319  1308 15  6  0 80
 0  5 168392   9724  13124 533448    0    0  3284  3840 1231   739 10  2  0 88
18  9 168392   9532  13124 533516    0    0    64  3108 1314   473  6  2  0 92
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0 14 168392   8636  13124 534264    0    0   808     0 1511  2083 24  5  0 72
 1 10 168392   8156  13124 535012    0    0   716  6360 1459  2052 21  4  1 76
 0  5 168392   7388  13124 535624    0    0   628  2892 1449  1890 23  4  9 64
14  4 168392  14204  13128 536576   64    0   828  2056 1534  2161 31  5  3 61
 0  4 168392  13820  13132 536980    0    0   356  3821 1490  2082 29  7  2 61
 1  2 168392  20700  13136 537656    0    0   636  1568 1589  2516 31  6  5 58
 1  6 168392  19708  13144 538600    0    0  1024  2732 1558  2504 39  7  6 49
 6  3 168392  26688  13144 539144    0    0   520  5372 1624  2589 40  7  0 53
 0  4 168392  34036  13148 539820    0    0   692  1568 1602  3144 37  7  6 50
 2  3 168392  33300  13152 540564    0    0   732  5212 1624  2892 30  5 16 48
 0  3 168392  40276  13160 541508    0    0   960  4188 1655  2542 37  7  6 52
 1  2 168392  40340  13160 542052    0    0   464  2056 1719  3330 42  8  2 49
 3  2 168392  47252  13164 542460   64    0   572   260 1728  3321 43  8  6 42
 5  4 168392  54880  13168 542936   64    0   484   232 1725  3352 51  8  9 31




MemTotal:      1293976 kB
MemFree:         77964 kB
Buffers:         15568 kB
Cached:         525740 kB
SwapCached:      38596 kB
Active:         677728 kB
Inactive:       442556 kB
HighTotal:      393216 kB
HighFree:          768 kB
LowTotal:       900760 kB
LowFree:         77196 kB
SwapTotal:      530104 kB
SwapFree:       365860 kB
Dirty:           50036 kB
Writeback:       14036 kB
Mapped:         570488 kB
Slab:            82860 kB
Committed_AS:   853200 kB
PageTables:       4400 kB
VmallocTotal:   114680 kB
VmallocUsed:       560 kB
VmallocChunk:   114120 kB

swappiness of 100:


procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  1 168304 198344  13736 463148    1    2    85   121  144   137 37  7 48  9
 1  0 168304 198200  13736 463216    0    0    44   372 1830  3379 48  8 28 17
 0  0 168304 198120  13736 463284    0    0    32     8 1680  3064 41  7 51  1
 2  1 168304 145512  14952 514768    0    0    68   864 1586  2377 33 25 34  7
 0  4 168304 135336  15176 524744    0    0   248  1092 1410   657 10  6  0 85
 0  5 168304 135144  15176 524744    0    0     0   956 1315   178  1  1  0 98
 0  5 168304 133544  15188 525616    0    0   412 10604 1267   795 14  3  0 84
 0  4 168304 112112  15212 527564    0    0  1012  4768 1630  2875 74 12  0 14
 0  4 168304  84432  15504 545836    0    0  6120 12484 1627  2396 54 15  0 32
 0  6 168304  57496  16040 568352    0    0   180 32512 2432  1297 16  7  0 77
 1  5 168304  36696  16452 585688    0    0     8  8120 1338   259  7  8  0 84
10  6 168304  23752  16712 595220    0    0    64 13304 1317  1029 19  7  0 73
 8  8 168304   7248  16936 589440    0    0   284  5380 1607  2234 63 15  0 22
 0 17 168304   6192  16944 587324    0    0   932  5328 1381  1636 17  4  0 80
 0 15 168304   7400  16940 581888    0    0   600   752 1389  1500 23  4  0 74
 3 11 168304   7400  16944 581136    0    0   688    16 1289   877  7  1  0 92
 5 23 168304   7304  16944 578892    0    0   888   606 1287  1189  9  3  0 89
 2 10 168304   7784  16940 571756    0    0   780  1136 1436  1855 30  4  0 67
 0  7 168304   6184  16940 568016    0    0   816   104 1458  1757 23  5  1 71
 0  7 168304   6752  16940 556184    0    0   732   188 1441  1726 34  7  0 59
 0 11 168304   6488  16940 555368    0    0   652   344 1285  1361 11  2  0 87
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1 15 168304   6640  16996 553952    0    0  3316   960 1357  1578 11  4  0 85
 1 15 168304   6512  17072 551876    0    0  6064  3804 1289  1132 14  5  0 82
 1 14 168304   6896  17212 542792    0    0  6032  3176 1333  1496 26  8  0 66
 0 11 168296   6504  17220 527728    0    0  3508  4204 1323  1326 38  8  0 54
 1  7 168296   7376  17292 526992    0    0  3476  1120 1236   909  8  3  0 88
 1 12 168292   6992  17388 520728    0    0  4268  4776 1354  1409 25  5  0 70
 0 13 168292   6544  17388 521068    0    0   424    48 1213   914  6  1  0 93
 2 16 168292   6536  17388 519028    0    0   756     4 1308  1582 18  3  0 78
 0  7 168292   6456  17388 516784    0    0   668  5016 1308  1289 13  2  0 85
 0  7 168292  11168  17388 516036    0    0   644   112 1340  1366 14  3  0 84
 0 13 168292   8992  17392 516712    0    0   696  2068 1358  1713 17  3  0 80
 5 11 168292   6896  17392 513176    0    0   616    92 1431  1792 35  5  0 61
 0  6 168292   7280  17416 512488    0    0  1212  2396 1431  2187 20  3  0 77
 0 10 168292   7216  17436 512260    0    0  1472    44 1372  1647 14  4  1 82
 0  9 168292  12576  17444 512580    0    0  1044    84 1252  1067 10  2  0 87
 0 11 168292  10912  17464 514192    0    0  1136  2640 1304  1353 12  3  0 86
 1  4 168292   9312  17472 515816    0    0  1144   720 1244  1212 10  2  0 89
 1  9 168292  13744  17488 517976    0    0  1276    96 1417  1750 15  4  0 82
 0 11 168292  11760  17508 519656    0    0  1228  1732 1362  2056 17  4  0 79
 0  9 168292   9712  17524 521204    0    0  1116   832 1338  1286 12  3  0 86
 1 10 168292   7072  17644 524056    0    0  5600   160 1514  2443 24  8  0 68
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0 10 168292   8036  17672 530312    0    0  6048   184 1319  1684 16  5  0 79
 1 10 168292   7392  17648 531064    0    0  7080   132 1340  1502 11  5  0 83
 1 13 168292   6840  17644 531620    0    0  6372   360 1379  1858 15  6  0 80
 1  7 168292   7288  17652 528416    0    0   924     5 1317   690 11  2  0 87
 0  5 168292   6648  17664 528472    0    0   136  9961 1293   279  1  1  0 98
 0  6 168292   8056  17664 526024    0    0     4  6297 1154   107  0  1  0 99
 0 14 168292   7416  17680 528388    0    0  1480  2327 1476  2175 24  6  0 70
 2 10 168292   6576  17604 504120    0    0   884  3724 1349  1494 56  9  0 35
 2  7 168292  10016  17604 504052    0    0   920  1052 1358  1564 25  4  0 71
 1  8 168292  12064  17608 506564    0    0  1604  1108 1397  1779 17  4  0 79
 0 12 168292   6656  17604 509764    0    0  3904  2488 1413  1661 19  5  0 76
 1 15 168292   6848  17516 505296    0    0  5196  2052 1445  2156 29  6  0 65
 0 11 168292   8256  17520 507672    0    0  4284   780 1300  1566 19  5  1 75
 1 13 168292   7792  17504 508572    0    0  6808  2303 1364  1465 13  5  0 83
 0  7 168292   6896  17516 508780    0    0  4516  3256 1315  1473 12  4  0 84
 0  6 168292   8160  17524 507604    0    8  2373   448 1196   558  3  2  0 96
 0  4 168292   7904  17532 507664    0    0    52 10924 1234   244  3  1  0 96
 1 19 168292   6640  17532 508140    0    0  1740  7580 1250  1175 13  4  0 83
 0 11 168292   6704  17532 508344    0    0  1104    12 1258  1166 11  2  0 87
 1  6 168292   6256  17536 502296    0    0  1724   404 1489  1894 34  7  0 58
 0 10 168292   8552  17536 496788    0    0  1824  1548 1561  2694 38  8  3 52
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  5 168292  12036  17544 499432    0    0  1724   148 1515  2175 28  6  0 66
 7  6 168292   6380  17544 501608    0    0  1412   128 1291  1381 22  3  0 74
 9  8 168292  11580  17540 502428    0    0  1540   148 1436  1863 22  6  1 71
 0 14 168292   9276  17540 504264    0    0  1228   660 1616  2405 30  6  0 65
 0 10 168292   6280  17556 507220    0    0  7260  1948 1355  1679 16  6  0 78
 0 12 168292   7176  17564 506164    0    0  7224  2040 1294  1406 10  6  0 84
 0 10 168292   6984  17572 506168    0    0  6608  2340 1316  1511 11  5  0 84
 0 17 168292   6108  17584 506836    0    8  6928  4196 1386  1543 13  5  0 82
 0  9 168292   6712  17584 506020    0    0   592    12 1188   694  7  1  0 91
 0  6 168292   6200  17584 506292    0    0   312  4420 1231   360  1  1  0 98
 1  7 168292   6456  17584 502996   32    0   612  8952 1464  1424 17  4  0 79
 0  6 168296   3200  17580 496572    0    4   636   500 1592  2571 49 10  0 41
 0 17 168304   3124  17580 494836    0    8   868     8 1505  2131 32  6  0 62
 0  7 168404   8364  17580 492808    0  220   628  4264 1406  1908 24  4  0 73
 0 12 168620   5804  17580 493008    0  340  1108  1312 1517  2096 30  5  0 66
 0 10 168620   9884  17584 495112    0    0  1416     0 1424  2218 24  5  0 71
 0 12 168620   7324  17588 497624    0    0  1620     0 1441  1633 16  4  0 80
 0 14 168620  10580  17588 499460    0    0  1228   144 1383  1637 25  5  0 71
 0  8 168632   6784  17608 503888    0  124  4056   124 1432  1655 18  5  0 76
 0 11 169212   7348  17628 503444    0  676  3748   864 1383  1622 19  4  5 71
 4 13 169212   5236  17628 506012   16    0  1808  3896 1364  1903 20  4  4 73
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  6 169296   3444  17640 507780    0   88  1208  1364 1580  2414 33  7  0 60
 0 17 173120  14424  17624 504260    0 3832   956  3836 1383  1716 17  3  0 80
 0  9 174236  13336  17624 505592    0 1116  1056  2672 1298  1119 10  3  0 88
 0 16 174236   5464  17640 513396    0    0  4254  3100 1367  1253 10  4  0 86
 0 12 174732   5176  17624 514112    0  496   776   568 1354  1686 16  4  0 80
 0 17 178196   7008  17624 508172    0 3464   696  3480 1366  1636 23  5  0 73
 0 14 180832  10424  17620 498148    0 2640   664  2640 1307  1274 24  4  0 72
 0  6 180832   9448  17620 498912   52    0   784    12 1316  1169  9  2  2 86
 0 13 180832   8360  17620 499584    8    0   680   176 1405  1717 17  4  4 75
 5 11 180832   9192  17620 500196    0    0   680   128 1579  2259 34  6  8 51
 1  9 180832   8104  17624 501076    0    0   820   340 1355  1412 12  2 13 74
16  7 180832   7016  17632 502020    0    0   972   256 1364  1607 20  3  0 77
 0  8 180832  12280  17632 502600   32    0   636  1930 1394  1472 15  3  0 81
 2  5 180832  11256  17632 503484    0    0   812  3356 1420  1983 23  4  0 73
 0  9 180832  18128  17632 503824    0    0   372  1712 1420  1865 25  5  0 71
 0  7 180832  17152  17632 504708    0    0   848  1456 1327  1609 18  3  2 77
 4 11 180832  24376  17640 505312    0    0   628  3552 1470  1961 21  3 20 55
 0  2 180832  23544  17640 505924    0    0   612  1560 1447  1918 21  4  7 69
 0 15 180832  22584  17648 506732    0    0   740  1376 1313  1216 15  3 14 67
 0  5 180832  22072  17652 507204    0    0   572  1476 1770  3492 46  9  1 44
 1 12 180832  20984  17652 508292    0    0  1040  3232 1741  3186 50  9  5 37
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 8  5 180832  27736  17652 509108    0    0   828  3726 1631  2891 51 11  0 37
 4  6 180832  26936  17652 509856    0    0   700  5672 1702  2875 39  9 13 39
 0  3 180832  33568  17652 510672    0    0   880   232 1677  2889 42  7  2 49
 0 16 180832  32992  17652 511624    0    0   948   136 1559  2325 29  6  5 59
13  9 180832  40240  17652 512236    0    0   604   176 1403  1779 17  3  3 76
 1  9 180832  39248  17652 513732    0    0  1476   144 1705  3237 41  9  3 48
 3  4 180832  46724  17652 514276    0    0   504   272 1721  3376 39  8  0 53
 3  7 180832  45940  17652 515092    0    0   828   840 1753  3187 43  9  6 42
 4  8 180832  53268  17652 515704    0    0   560  3424 1545  2224 29  6 12 53
 2  2 180832  60276  17652 516384    0    0   664   356 1698  3135 37  7  8 48
 5  6 180832  59508  17652 517124    8    0   772   880 1562  2492 28  6  9 58
 0  4 180776  67384  17652 517520    0    0   420  3280 1595  3019 36  8  9 48
 1  7 180776  66968  17652 517928    0    0   336  1568 1558  2338 28  5 22 45
 0  8 180776  66456  17652 518336    0    0   400   332 1501  2040 28  6  5 61
 0  5 180776  73240  17652 519280    8    0   980    48 1604  2806 32  8 12 50
 0  8 180776  72664  17652 519960    0    0   688  5048 1667  2822 35  6 16 42
 1  2 180716  79924  17656 520492    0    0   504     8 1638  2522 37  7 22 35
 3  0 180716  79124  17660 520896    0    0   424     8 1683  3526 48  8 27 16
 0  0 180604  86020  17664 521592   20    0   696    16 1783  3452 56 11 20 12
 1  1 180604  93412  17664 522340    0    0   720     8 1796  3540 52 10 24 15
 5  0 180604  92516  17664 523020    0    0   708   240 1799  3012 47  7 18 27
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 8  0 180604  99556  17664 523632    0    0   580     8 1811  3788 51 10 25 13
 0  3 180604 107132  17664 524176    0    0   528     8 1798  3622 61 10 13 16
14  1 180604 106524  17668 524716    0    0   572    16 1780  3195 48 11  9 32
 0  0 180604 105884  17668 525300   96    0   644     8 1931  4016 66 12 16  5
10 12 180604 113244  17668 525744   32    0   544   651 1733  3057 40  8 17 36
 5  0 180604 109020  17672 526592   32    0   868     8 1778  3294 49  9  7 34
 8  1 180604 104764  17672 527680    0    0  1012     8 1835  3832 65 11  5 19
 1  1 180604 110460  17672 528428    0    0   736    16 1772  3721 56 10 24 11
 0  1 180604 109692  17672 529172    4    0   752     8 1791  3388 51  9 28 11
 1  3 180604 108988  17676 529844    4    0   708   632 1860  3637 54  9  9 28
 0  1 180604 108732  17676 530116    0    0   252    16 1770  2861 46  8 16 32
 4  0 180604 108220  17684 530652    0    0   460     8 1650  3194 41  8 36 14
 5  0 180604 107772  17684 531128    0    0   512    16 1625  3173 40  7 41 11
 3  1 180556 114956  17684 531612    0    0   504     8 1750  3649 51 10 29 11
13  0 180556 122300  17684 532088    0    0   452   724 1835  3344 50  9 14 27
 6  1 180556 121836  17684 532564    0    0   496    16 1782  4114 53 10 30  7
 1  0 180556 121324  17684 533176    0    0   524     8 1686  3157 43  8 40 10
 1  0 180556 128844  17684 533516    0    0   384    16 1720  3460 51  8 31 11
 3  0 180556 128652  17684 533712    8    0   216     8 1740  3888 45  8 38  8
13  4 180556 136076  17684 534324    0    0   576   808 1794  3053 43  8 19 30
 1  0 180556 135756  17684 534596    0    0   288    16 1728  2986 42  8 25 25
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0 180556 135308  17684 535072    0    0   456     8 1715  3644 42  7 41  9



MemTotal:      1293976 kB
MemFree:          7528 kB
Buffers:         16940 kB
Cached:         581684 kB
SwapCached:      32552 kB
Active:         852360 kB
Inactive:       342852 kB
HighTotal:      393216 kB
HighFree:          768 kB
LowTotal:       900760 kB
LowFree:          6760 kB
SwapTotal:      530104 kB
SwapFree:       361800 kB
Dirty:           48908 kB
Writeback:       12996 kB
Mapped:         594156 kB
Slab:            78336 kB
Committed_AS:   883164 kB
PageTables:       4452 kB
VmallocTotal:   114680 kB
VmallocUsed:       560 kB
VmallocChunk:   114120 kB

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  0:10   ` Jeff Garzik
  2004-04-29  0:21     ` Nick Piggin
@ 2004-04-29  0:49     ` Brett E.
  2004-04-29  1:00       ` Andrew Morton
                         ` (2 more replies)
  1 sibling, 3 replies; 128+ messages in thread
From: Brett E. @ 2004-04-29  0:49 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Andrew Morton, linux-kernel

Jeff Garzik wrote:

> Andrew Morton wrote:
> 
>> Swapout is good.  It frees up unused memory.  I run my desktop 
>> machines at
>> swappiness=100.
> 
> 
> 
> The definition of "unused" is quite subjective and app-dependent...
> 
> I've see reports with increasing frequency about the swappiness of the 
> 2.6.x kernels, from people who were already annoyed at the swappiness of 
> 2.4.x kernels :)
> 
> Favorite pathological (and quite common) examples are the various 4am 
> cron jobs that scan your entire filesystem.  Running that process 
> overnight on a quiet machines practically guarantees a huge burst of 
> disk activity, with unwanted results:
> 1) Inode and page caches are blown away
> 2) A lot of your desktop apps are swapped out
> 
> Additionally, a (IMO valid) maxim of sysadmins has been "a properly 
> configured server doesn't swap".  There should be no reason why this 
> maxim becomes invalid over time.  When Linux starts to swap out apps the 
> sysadmin knows will be useful in an hour, or six hours, or a day just 
> because it needs a bit more file cache, I get worried.
> 
> There IMO should be some way to balance the amount of anon-vma's such 
> that the sysadmin can say "stop taking 70% of my box's memory for 
> disposable cache, use it instead for apps you would otherwise swap out, 
> you memory-hungry kernel you."
> 
>     Jeff

Or how about "Use ALL the cache you want Mr. Kernel.  But when I want 
more physical memory pages, just reap cache pages and only swap out when 
the cache is down to a certain size(configurable, say 100megs or 
something)."



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  0:21     ` Nick Piggin
@ 2004-04-29  0:50       ` Wakko Warner
  2004-04-29  0:53         ` Jeff Garzik
                           ` (2 more replies)
  2004-04-29  0:58       ` Marc Singer
  2004-04-29 20:01       ` Horst von Brand
  2 siblings, 3 replies; 128+ messages in thread
From: Wakko Warner @ 2004-04-29  0:50 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel

> I don't know. What if you have some huge application that only
> runs once per day for 10 minutes? Do you want it to be consuming
> 100MB of your memory for the other 23 hours and 50 minutes for
> no good reason?

I keep soffice open all the time.  The box in question has 512mb of ram. 
This is one app, even though I use it infrequently, would prefer that it
never be swapped out.  Mainly when I want to use it, I *WANT* it now (ie not
waiting for it to come back from swap)

This is just my oppinion.  I personally feel that cache should use available
memory, not already used memory (swapping apps out for more cache).

-- 
 Lab tests show that use of micro$oft causes cancer in lab animals

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  0:50       ` Wakko Warner
@ 2004-04-29  0:53         ` Jeff Garzik
  2004-04-29  0:54         ` Nick Piggin
  2004-04-29 21:45         ` Denis Vlasenko
  2 siblings, 0 replies; 128+ messages in thread
From: Jeff Garzik @ 2004-04-29  0:53 UTC (permalink / raw)
  To: Wakko Warner; +Cc: Nick Piggin, linux-kernel

Wakko Warner wrote:
> This is just my oppinion.  I personally feel that cache should use available
> memory, not already used memory (swapping apps out for more cache).


Strongly agreed, though there are pathological cases that prevent this 
from being something that's easy to implement on a global basis.

	Jeff




^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  0:50       ` Wakko Warner
  2004-04-29  0:53         ` Jeff Garzik
@ 2004-04-29  0:54         ` Nick Piggin
  2004-04-29  1:51           ` Tim Connors
  2004-04-29 21:45         ` Denis Vlasenko
  2 siblings, 1 reply; 128+ messages in thread
From: Nick Piggin @ 2004-04-29  0:54 UTC (permalink / raw)
  To: Wakko Warner; +Cc: linux-kernel

Wakko Warner wrote:
>>I don't know. What if you have some huge application that only
>>runs once per day for 10 minutes? Do you want it to be consuming
>>100MB of your memory for the other 23 hours and 50 minutes for
>>no good reason?
> 
> 
> I keep soffice open all the time.  The box in question has 512mb of ram. 
> This is one app, even though I use it infrequently, would prefer that it
> never be swapped out.  Mainly when I want to use it, I *WANT* it now (ie not
> waiting for it to come back from swap)
> 
> This is just my oppinion.  I personally feel that cache should use available
> memory, not already used memory (swapping apps out for more cache).
> 

On the other hand, suppose that with soffice resident the entire
time, you don't have enough memory to cache an entire kernel tree
(or video you are editing, or whatever).

Now your find | xargs grep keeps taking 30s every time you run
it, or your video is un-editable...

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  0:21     ` Nick Piggin
  2004-04-29  0:50       ` Wakko Warner
@ 2004-04-29  0:58       ` Marc Singer
  2004-04-29  3:48         ` Nick Piggin
  2004-04-29 20:01       ` Horst von Brand
  2 siblings, 1 reply; 128+ messages in thread
From: Marc Singer @ 2004-04-29  0:58 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jeff Garzik, Andrew Morton, brettspamacct, linux-kernel

On Thu, Apr 29, 2004 at 10:21:24AM +1000, Nick Piggin wrote:
> Anyway, I have a small set of VM patches which attempt to improve
> this sort of behaviour if anyone is brave enough to try them.
> Against -mm kernels only I'm afraid (the objrmap work causes some
> porting difficulty).

Is this the same patch you wanted me to try?  

  Remember, the embedded system where NFS IO was pushing my
  application out of memory.  Setting swappiness to zero was a
  temporary fix.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  0:49     ` Brett E.
@ 2004-04-29  1:00       ` Andrew Morton
  2004-04-29  1:24         ` Jeff Garzik
                           ` (2 more replies)
  2004-04-29  1:41       ` Tim Connors
  2004-04-29  9:43       ` Helge Hafting
  2 siblings, 3 replies; 128+ messages in thread
From: Andrew Morton @ 2004-04-29  1:00 UTC (permalink / raw)
  To: brettspamacct; +Cc: jgarzik, linux-kernel

"Brett E." <brettspamacct@fastclick.com> wrote:
>
>  Or how about "Use ALL the cache you want Mr. Kernel.  But when I want 
>  more physical memory pages, just reap cache pages and only swap out when 
>  the cache is down to a certain size(configurable, say 100megs or 
>  something)."

Have you tried decreasing /proc/sys/vm/swappiness?  That's what it is for.

My point is that decreasing the tendency of the kernel to swap stuff out is
wrong.  You really don't want hundreds of megabytes of BloatyApp's
untouched memory floating about in the machine.  Get it out on the disk,
use the memory for something useful.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  0:44   ` Brett E.
@ 2004-04-29  1:13     ` Andrew Morton
  2004-04-29  1:29       ` Brett E.
  0 siblings, 1 reply; 128+ messages in thread
From: Andrew Morton @ 2004-04-29  1:13 UTC (permalink / raw)
  To: brettspamacct; +Cc: linux-kernel

"Brett E." <brettspamacct@fastclick.com> wrote:
>
> > I see no swapout from the info which you sent.
> 
>  pgpgout/s gives the total number of blocks paged out to disk per second, 
>  it peaks at 13,000 and hovers around 3,000 per the attachment.

Nope.  pgpgout is simply writes to disk, of all types.

swapout is accounted for under pswpout and your vmstat trace shows a little
bit of (healthy) swapout with swappiness=100 and negligible swapout with
swappiness=0.  In both cases, negligible swapin.  That's all just fine.

>  Swapping out is good, but when that's coupled with swapping in as is the 
>  case on my side, it creates a thrashing situation where we swap out to 
>  disk pages which are being used, we then immediately swap those pages 
>  back in, etc etc..

Look at your "si" column in vmstat.  It's practically all zeroes.

>  The usage pattern by the way is on a server which continuously hits a 
>  database and reads files so I don't know what "swappiness" should be set 
>  to exactly.  Every hour or so it wants to untar tarballs and by then the 
>  cache is large. From here, the system swaps in and out more while cache 
>  decreases. Basically, it should do what I believe Solaris does... simply 
>  reclaim cache and not swap.  Capping cache would be good too but the 
>  best solution IMO is to simply reclaim the cache on an as-needed basis 
>  before thinking about swapping.

swappiness=100: swaps a lot.  swappiness=0: doesn't swap much.

With a funny workload like that you might choose to set swappiness to 0
just around the hourly tar operation, but as the machine seems to not be
swapping there doesn't seem to be a need.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  1:00       ` Andrew Morton
@ 2004-04-29  1:24         ` Jeff Garzik
  2004-04-29  1:40           ` Andrew Morton
  2004-04-29  1:30         ` Paul Mackerras
  2004-04-29  1:46         ` Rik van Riel
  2 siblings, 1 reply; 128+ messages in thread
From: Jeff Garzik @ 2004-04-29  1:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: brettspamacct, linux-kernel

Andrew Morton wrote:
> "Brett E." <brettspamacct@fastclick.com> wrote:
> 
>> Or how about "Use ALL the cache you want Mr. Kernel.  But when I want 
>> more physical memory pages, just reap cache pages and only swap out when 
>> the cache is down to a certain size(configurable, say 100megs or 
>> something)."
> 
> 
> Have you tried decreasing /proc/sys/vm/swappiness?  That's what it is for.
> 
> My point is that decreasing the tendency of the kernel to swap stuff out is
> wrong.  You really don't want hundreds of megabytes of BloatyApp's
> untouched memory floating about in the machine.  Get it out on the disk,
> use the memory for something useful.

Well, if it's truly untouched, then it never needs to be allocated a 
page or swapped out at all... just accounted for (overcommit on/off, 
etc. here)

But I assume you are not talking about that, but instead talking about 
_rarely_ used pages, that were filled with some amount of data at some 
point in time.  These are at the heart of the thread (or my point, at 
least) -- BloatyApp may be Oracle with a huge cache of its own, for 
which swapping out may be a huge mistake.  Or Mozilla.  After some 
amount of disk IO on my 512MB machine, Mozilla would be swapped out... 
when I had only been typing an email minutes before.

BloatyApp?  yes.  Should it have been swapped out?  Absolutely not.  The 
'SIZE' in top was only 160M and there were no other major apps running.

Applications are increasingly playing second fiddle to cache ;-(

Regardless of /proc/sys/vm/swappiness, I think it's a valid concern of 
sysadmins who request "hard cache limit", because they are seeing 
pathological behavior such that apps get swapped out when cache is over 
50% of all available memory.

	Jeff

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  1:13     ` Andrew Morton
@ 2004-04-29  1:29       ` Brett E.
  2004-04-29 18:05         ` Brett E.
  0 siblings, 1 reply; 128+ messages in thread
From: Brett E. @ 2004-04-29  1:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Andrew Morton wrote:

> "Brett E." <brettspamacct@fastclick.com> wrote:
> 
>>>I see no swapout from the info which you sent.
>>
>> pgpgout/s gives the total number of blocks paged out to disk per second, 
>> it peaks at 13,000 and hovers around 3,000 per the attachment.
> 
> 
> Nope.  pgpgout is simply writes to disk, of all types.
That is what is confusing me.. From the sar man page:

pgpgin/s
     Total number of kilobytes the system paged in from disk per second.

pgpgout/s
     Total number of kilobytes the system paged out to disk per second.



> 
> swapout is accounted for under pswpout and your vmstat trace shows a little
> bit of (healthy) swapout with swappiness=100 and negligible swapout with
> swappiness=0.  In both cases, negligible swapin.  That's all just fine.
> 
> 
>> Swapping out is good, but when that's coupled with swapping in as is the 
>> case on my side, it creates a thrashing situation where we swap out to 
>> disk pages which are being used, we then immediately swap those pages 
>> back in, etc etc..
> 
> 
> Look at your "si" column in vmstat.  It's practically all zeroes.
> 
> 
>> The usage pattern by the way is on a server which continuously hits a 
>> database and reads files so I don't know what "swappiness" should be set 
>> to exactly.  Every hour or so it wants to untar tarballs and by then the 
>> cache is large. From here, the system swaps in and out more while cache 
>> decreases. Basically, it should do what I believe Solaris does... simply 
>> reclaim cache and not swap.  Capping cache would be good too but the 
>> best solution IMO is to simply reclaim the cache on an as-needed basis 
>> before thinking about swapping.
> 
> 
> swappiness=100: swaps a lot.  swappiness=0: doesn't swap much.
> 
> With a funny workload like that you might choose to set swappiness to 0
> just around the hourly tar operation, but as the machine seems to not be
> swapping there doesn't seem to be a need.

Yeah, it wouldn't help if paging isn't the problem. I'd like more 
clarificaton on sar before I throw out paging being the culprit.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  1:00       ` Andrew Morton
  2004-04-29  1:24         ` Jeff Garzik
@ 2004-04-29  1:30         ` Paul Mackerras
  2004-04-29  1:31           ` Paul Mackerras
  2004-04-29  1:53           ` Andrew Morton
  2004-04-29  1:46         ` Rik van Riel
  2 siblings, 2 replies; 128+ messages in thread
From: Paul Mackerras @ 2004-04-29  1:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: brettspamacct, jgarzik, linux-kernel

Andrew Morton writes:

> My point is that decreasing the tendency of the kernel to swap stuff out is
> wrong.  You really don't want hundreds of megabytes of BloatyApp's
> untouched memory floating about in the machine.  Get it out on the disk,
> use the memory for something useful.

What I have noticed with 2.6.6-rc1 on my dual G5 is that if I rsync a
gigabyte or so of data over to another machine, it then takes several
seconds to change focus from one window to another.  I can see it
slowly redraw the window title bars.  It looks like the window manager
is getting swapped/paged out.

This machine has 2.5GB of ram, so I really don't see why it would need
to swap at all.  There should be plenty of page cache pages that are
clean and not in use by any process that could be discarded.  It seems
like as soon as there is any memory shortage at all it picks on the
window manager and chucks out all its pages. :(

Paul.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  1:30         ` Paul Mackerras
@ 2004-04-29  1:31           ` Paul Mackerras
  2004-04-29  1:53           ` Andrew Morton
  1 sibling, 0 replies; 128+ messages in thread
From: Paul Mackerras @ 2004-04-29  1:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: brettspamacct, jgarzik, linux-kernel

I wrote:

> What I have noticed with 2.6.6-rc1 on my dual G5 is that if I rsync a
> gigabyte or so of data over to another machine, it then takes several
> seconds to change focus from one window to another.  I can see it
> slowly redraw the window title bars.  It looks like the window manager
> is getting swapped/paged out.

I meant to add that this is with swappiness = 60.

Paul.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  1:24         ` Jeff Garzik
@ 2004-04-29  1:40           ` Andrew Morton
  2004-04-29  1:47             ` Rik van Riel
                               ` (2 more replies)
  0 siblings, 3 replies; 128+ messages in thread
From: Andrew Morton @ 2004-04-29  1:40 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: brettspamacct, linux-kernel

Jeff Garzik <jgarzik@pobox.com> wrote:
>
> Andrew Morton wrote:
> > "Brett E." <brettspamacct@fastclick.com> wrote:
> > 
> >> Or how about "Use ALL the cache you want Mr. Kernel.  But when I want 
> >> more physical memory pages, just reap cache pages and only swap out when 
> >> the cache is down to a certain size(configurable, say 100megs or 
> >> something)."
> > 
> > 
> > Have you tried decreasing /proc/sys/vm/swappiness?  That's what it is for.
> > 
> > My point is that decreasing the tendency of the kernel to swap stuff out is
> > wrong.  You really don't want hundreds of megabytes of BloatyApp's
> > untouched memory floating about in the machine.  Get it out on the disk,
> > use the memory for something useful.
> 
> Well, if it's truly untouched, then it never needs to be allocated a 
> page or swapped out at all... just accounted for (overcommit on/off, 
> etc. here)
> 
> But I assume you are not talking about that, but instead talking about 
> _rarely_ used pages, that were filled with some amount of data at some 
> point in time.

Of course.  My fairly modest desktop here stabilises at about 300 megs
swapped out, with negligible swapin.  That's all just crap which apps
aren't using any more.  Getting that memory out on disk, relatively freely
is an important optimisation.

>  These are at the heart of the thread (or my point, at 
> least) -- BloatyApp may be Oracle with a huge cache of its own, for 
> which swapping out may be a huge mistake.  Or Mozilla.  After some 
> amount of disk IO on my 512MB machine, Mozilla would be swapped out... 
> when I had only been typing an email minutes before.

OK, so it takes four seconds to swap mozilla back in, and you noticed it.

Did you notice that those three kernel builds you just did ran in twenty
seconds less time because they had more cache available?  Nope.

> Regardless of /proc/sys/vm/swappiness, I think it's a valid concern of 
> sysadmins who request "hard cache limit", because they are seeing 
> pathological behavior such that apps get swapped out when cache is over 
> 50% of all available memory.

We should be sceptical of this.  If they can provide *numbers* then fine. 
Otherwise, the subjective "oh gee, that took a long time" seat-of-the-pants
stuff does not impress.  If they want to feel better about it then sure,
set swappiness to zero and live with less cache for the things which need
it...

Let me point out that the kernel right now, with default swappiness very
much tends to reclaim cache rather than swapping stuff out.  The
top-of-thread report was incorrect, due to a misreading of kernel
instrumentation.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re:  ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  0:49     ` Brett E.
  2004-04-29  1:00       ` Andrew Morton
@ 2004-04-29  1:41       ` Tim Connors
  2004-04-29  9:43       ` Helge Hafting
  2 siblings, 0 replies; 128+ messages in thread
From: Tim Connors @ 2004-04-29  1:41 UTC (permalink / raw)
  To: Jeff Garzik, Andrew Morton, linux-kernel

"Brett E." <brettspamacct@fastclick.com> said on Wed, 28 Apr 2004 17:49:43 -0700:
> Or how about "Use ALL the cache you want Mr. Kernel.  But when I want 
> more physical memory pages, just reap cache pages and only swap out when 
> the cache is down to a certain size(configurable, say 100megs or 
> something)."

Oh how dearly I would love that...

I have a huge app that operates on a large file (but both are a bit
smaller than available memory, by maybe a hundred or two megs - enough
for to keep the entire working set in RAM, anyway). I create these
large files over and over (on another host, so cache does absolutely
no good whatsoever, since we are streaming a read once), but don't
delete the old ones, so they all remain in cache. So when I close one
copy of the app, and open up a new one on a different file, when it
comes time to allocate those several hundred megs, it rather blows
away my mozilla or my X session(! -- since I need it to display the
results) or my window manager, and keeps growing that cache.

-- 
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
When you are chewing on life's grissle, don't grumble - give a whistle!
This'll help things turn out for the best
Always look on the bright side of life

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  1:00       ` Andrew Morton
  2004-04-29  1:24         ` Jeff Garzik
  2004-04-29  1:30         ` Paul Mackerras
@ 2004-04-29  1:46         ` Rik van Riel
  2004-04-29  1:57           ` Andrew Morton
  2 siblings, 1 reply; 128+ messages in thread
From: Rik van Riel @ 2004-04-29  1:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: brettspamacct, jgarzik, linux-kernel

On Wed, 28 Apr 2004, Andrew Morton wrote:

> You really don't want hundreds of megabytes of BloatyApp's untouched
> memory floating about in the machine.

But people do.  The point here is LATENCY, when a user comes
back from lunch and continues typing in OpenOffice, his system
should behave just like he left it.

Making the user have very bad interactivity for the first
minute or so is a Bad Thing, even if the computer did run
more efficiently while the user wasn't around to notice...

IMHO, the VM on a desktop system really should be optimised to
have the best interactive behaviour, meaning decent latency
when switching applications.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  1:40           ` Andrew Morton
@ 2004-04-29  1:47             ` Rik van Riel
  2004-04-29 18:14               ` Adam Kropelin
  2004-04-29  2:19             ` Tim Connors
  2004-04-29 16:24             ` Martin J. Bligh
  2 siblings, 1 reply; 128+ messages in thread
From: Rik van Riel @ 2004-04-29  1:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jeff Garzik, brettspamacct, linux-kernel

On Wed, 28 Apr 2004, Andrew Morton wrote:

> OK, so it takes four seconds to swap mozilla back in, and you noticed it.
> 
> Did you notice that those three kernel builds you just did ran in twenty
> seconds less time because they had more cache available?  Nope.

That's exactly why desktops should be optimised to give
the best performance where the user notices it most...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re:  ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  0:54         ` Nick Piggin
@ 2004-04-29  1:51           ` Tim Connors
  0 siblings, 0 replies; 128+ messages in thread
From: Tim Connors @ 2004-04-29  1:51 UTC (permalink / raw)
  To: linux-kernel

Nick Piggin <nickpiggin@yahoo.com.au> said on Thu, 29 Apr 2004 10:54:36 +1000:
> Wakko Warner wrote:
> >>I don't know. What if you have some huge application that only
> >>runs once per day for 10 minutes? Do you want it to be consuming
> >>100MB of your memory for the other 23 hours and 50 minutes for
> >>no good reason?
> > 
> > 
> > I keep soffice open all the time.  The box in question has 512mb of ram. 
> > This is one app, even though I use it infrequently, would prefer that it
> > never be swapped out.  Mainly when I want to use it, I *WANT* it now (ie not
> > waiting for it to come back from swap)
> > 
> > This is just my oppinion.  I personally feel that cache should use available
> > memory, not already used memory (swapping apps out for more cache).
> > 
> 
> On the other hand, suppose that with soffice resident the entire
> time, you don't have enough memory to cache an entire kernel tree
> (or video you are editing, or whatever).

For the kernel example, I only ever compile once before rebooting[1]
:)

This I think is the kind of thing that a kernel will never
automatically detect. This *must* be in the hands of the
administrator, who will know what they are doing (hopefully).

[1] I have never had enough memory on machines that I use to compile
kernels, to cache an entire tree anyway -- I'd much rather mozilla use
it than a cache which will never be reused

-- 
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
[transporting bed]... across several suburbs and a large salt water
harbour. Well, they thoughtfully bridged the harbour in the 1930s, so
the problem was actually transporting it across several suburbs and a
long single span bridge.          -- Hipatia

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  1:30         ` Paul Mackerras
  2004-04-29  1:31           ` Paul Mackerras
@ 2004-04-29  1:53           ` Andrew Morton
  2004-04-29  2:40             ` Andrew Morton
  2004-04-29  3:57             ` Nick Piggin
  1 sibling, 2 replies; 128+ messages in thread
From: Andrew Morton @ 2004-04-29  1:53 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: brettspamacct, jgarzik, linux-kernel

Paul Mackerras <paulus@samba.org> wrote:
>
> Andrew Morton writes:
> 
> > My point is that decreasing the tendency of the kernel to swap stuff out is
> > wrong.  You really don't want hundreds of megabytes of BloatyApp's
> > untouched memory floating about in the machine.  Get it out on the disk,
> > use the memory for something useful.
> 
> What I have noticed with 2.6.6-rc1 on my dual G5 is that if I rsync a
> gigabyte or so of data over to another machine, it then takes several
> seconds to change focus from one window to another.  I can see it
> slowly redraw the window title bars.  It looks like the window manager
> is getting swapped/paged out.
> 
> This machine has 2.5GB of ram, so I really don't see why it would need
> to swap at all.  There should be plenty of page cache pages that are
> clean and not in use by any process that could be discarded.  It seems
> like as soon as there is any memory shortage at all it picks on the
> window manager and chucks out all its pages. :(
> 

I suspect rsync is taking two passes across the source files for its
checksumming thing.  If so, this will defeat the pagecache use-once logic. 
The kernel sees the second touch of the pages and assumes that there will
be a third touch.

I use scp ;)

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  1:46         ` Rik van Riel
@ 2004-04-29  1:57           ` Andrew Morton
  2004-04-29  2:29             ` Marc Singer
  2004-04-29  2:41             ` Rik van Riel
  0 siblings, 2 replies; 128+ messages in thread
From: Andrew Morton @ 2004-04-29  1:57 UTC (permalink / raw)
  To: Rik van Riel; +Cc: brettspamacct, jgarzik, linux-kernel

Rik van Riel <riel@redhat.com> wrote:
>
>  IMHO, the VM on a desktop system really should be optimised to
>  have the best interactive behaviour, meaning decent latency
>  when switching applications.

I'm gonna stick my fingers in my ears and sing "la la la" until people tell
me "I set swappiness to zero and it didn't do what I wanted it to do".


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re:  ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  1:40           ` Andrew Morton
  2004-04-29  1:47             ` Rik van Riel
@ 2004-04-29  2:19             ` Tim Connors
  2004-04-29 16:24             ` Martin J. Bligh
  2 siblings, 0 replies; 128+ messages in thread
From: Tim Connors @ 2004-04-29  2:19 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: brettspamacct, linux-kernel

Andrew Morton <akpm@osdl.org> said on Wed, 28 Apr 2004 18:40:08 -0700:
> Jeff Garzik <jgarzik@pobox.com> wrote:
> >  These are at the heart of the thread (or my point, at 
> > least) -- BloatyApp may be Oracle with a huge cache of its own, for 
> > which swapping out may be a huge mistake.  Or Mozilla.  After some 
> > amount of disk IO on my 512MB machine, Mozilla would be swapped out... 
> > when I had only been typing an email minutes before.
> 
> OK, so it takes four seconds to swap mozilla back in, and you noticed it.

Actually, about 20-30 seconds on all of my boxs (no, I have no idea
why so slow even on the P4 I have here - swapping has always seemed
overly slow on this machine, and yes, DMA is turned on) with a ~100MB
mozilla image (plus the parts of X that get swapped out and need to be
swapped in before the user sees any effect - X takes up about ~100MB
res memory typically here, since I tend to have so many apps with
cached pixmaps open and in current use).

> Did you notice that those three kernel builds you just did ran in twenty
> seconds less time because they had more cache available?  Nope.

Nope, because I never run 3 builds before rebooting - I do however run
a lot of software that only ever reads a file once (the file was
written on another host on the cluster, so the caching done at write
time is of no benefit to us here.

This is something that should be up to the admin, because the kernel
*cannot* know what I want. And I don't think /proc/.../swapiness is
enough to define what we want.

> > Regardless of /proc/sys/vm/swappiness, I think it's a valid concern of 
> > sysadmins who request "hard cache limit", because they are seeing 
> > pathological behavior such that apps get swapped out when cache is over 
> > 50% of all available memory.
> 
> We should be sceptical of this.  If they can provide *numbers* then fine. 
> Otherwise, the subjective "oh gee, that took a long time" seat-of-the-pants
> stuff does not impress.  If they want to feel better about it then sure,
> set swappiness to zero and live with less cache for the things which need
> it...

OK - I'll try to get around to giving you a vmstat 1 and maybe top
output, and timing things next time I run one of these big
visualisation jobs (it'd be very nice if this was all backported to
2.4, since this is what we are mostly using here -- I think I can find
a 2.6 machine though)...

-- 
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
My code is giving me mixed signals. SIGSEGV then SIGILL then SIGBUS. -- me

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  1:57           ` Andrew Morton
@ 2004-04-29  2:29             ` Marc Singer
  2004-04-29  2:35               ` Andrew Morton
  2004-04-29  2:41             ` Rik van Riel
  1 sibling, 1 reply; 128+ messages in thread
From: Marc Singer @ 2004-04-29  2:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, brettspamacct, jgarzik, linux-kernel

On Wed, Apr 28, 2004 at 06:57:20PM -0700, Andrew Morton wrote:
> Rik van Riel <riel@redhat.com> wrote:
> >
> >  IMHO, the VM on a desktop system really should be optimised to
> >  have the best interactive behaviour, meaning decent latency
> >  when switching applications.
> 
> I'm gonna stick my fingers in my ears and sing "la la la" until people tell
> me "I set swappiness to zero and it didn't do what I wanted it to do".

It does, but it's a bit too coarse of a solution.  It just means that
the page cache always loses.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  2:29             ` Marc Singer
@ 2004-04-29  2:35               ` Andrew Morton
  2004-04-29  3:10                 ` Marc Singer
  0 siblings, 1 reply; 128+ messages in thread
From: Andrew Morton @ 2004-04-29  2:35 UTC (permalink / raw)
  To: Marc Singer; +Cc: riel, brettspamacct, jgarzik, linux-kernel

Marc Singer <elf@buici.com> wrote:
>
> On Wed, Apr 28, 2004 at 06:57:20PM -0700, Andrew Morton wrote:
> > Rik van Riel <riel@redhat.com> wrote:
> > >
> > >  IMHO, the VM on a desktop system really should be optimised to
> > >  have the best interactive behaviour, meaning decent latency
> > >  when switching applications.
> > 
> > I'm gonna stick my fingers in my ears and sing "la la la" until people tell
> > me "I set swappiness to zero and it didn't do what I wanted it to do".
> 
> It does, but it's a bit too coarse of a solution.  It just means that
> the page cache always loses.

That's what people have been asking for.  What are you suggesting should
happen instead?


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  1:53           ` Andrew Morton
@ 2004-04-29  2:40             ` Andrew Morton
  2004-04-29  2:58               ` Paul Mackerras
  2004-04-29 16:50               ` Martin J. Bligh
  2004-04-29  3:57             ` Nick Piggin
  1 sibling, 2 replies; 128+ messages in thread
From: Andrew Morton @ 2004-04-29  2:40 UTC (permalink / raw)
  To: paulus, brettspamacct, jgarzik, linux-kernel

Andrew Morton <akpm@osdl.org> wrote:
>
> Paul Mackerras <paulus@samba.org> wrote:
>  >
> ...
>  > What I have noticed with 2.6.6-rc1 on my dual G5 is that if I rsync a
>  > gigabyte or so of data over to another machine, it then takes several
>  > seconds to change focus from one window to another.  I can see it
>  > slowly redraw the window title bars.  It looks like the window manager
>  > is getting swapped/paged out.
>  > 
>  > This machine has 2.5GB of ram, so I really don't see why it would need
>  > to swap at all.  There should be plenty of page cache pages that are
>  > clean and not in use by any process that could be discarded.  It seems
>  > like as soon as there is any memory shortage at all it picks on the
>  > window manager and chucks out all its pages. :(
>  > 
> 
>  I suspect rsync is taking two passes across the source files for its
>  checksumming thing.  If so, this will defeat the pagecache use-once logic. 
>  The kernel sees the second touch of the pages and assumes that there will
>  be a third touch.

OK, a bit of fiddling does indicate that if a file is present on both
client and server, and is modified on the client, the rsync client will
indeed touch the pagecache pages twice.  Does this describe the files which
you're copying at all?

One thing you could do is to run `watch -n1 cat /proc/meminfo'.  Cause lots
of memory to be freed up then do the copy.  Monitor the size of the active
and inactive lists.  If the active list is growing then we know that rsync
is touching pages twice.

That would be an unfortunate special-case.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  1:57           ` Andrew Morton
  2004-04-29  2:29             ` Marc Singer
@ 2004-04-29  2:41             ` Rik van Riel
  2004-04-29  2:43               ` Andrew Morton
  1 sibling, 1 reply; 128+ messages in thread
From: Rik van Riel @ 2004-04-29  2:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: brettspamacct, jgarzik, linux-kernel

On Wed, 28 Apr 2004, Andrew Morton wrote:
> Rik van Riel <riel@redhat.com> wrote:
> >
> >  IMHO, the VM on a desktop system really should be optimised to
> >  have the best interactive behaviour, meaning decent latency
> >  when switching applications.
> 
> I'm gonna stick my fingers in my ears and sing "la la la" until people tell
> me "I set swappiness to zero and it didn't do what I wanted it to do".

Agreed, you shouldn't be the one to fix this problem.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  2:41             ` Rik van Riel
@ 2004-04-29  2:43               ` Andrew Morton
  0 siblings, 0 replies; 128+ messages in thread
From: Andrew Morton @ 2004-04-29  2:43 UTC (permalink / raw)
  To: Rik van Riel; +Cc: brettspamacct, jgarzik, linux-kernel

Rik van Riel <riel@redhat.com> wrote:
>
> On Wed, 28 Apr 2004, Andrew Morton wrote:
> > Rik van Riel <riel@redhat.com> wrote:
> > >
> > >  IMHO, the VM on a desktop system really should be optimised to
> > >  have the best interactive behaviour, meaning decent latency
> > >  when switching applications.
> > 
> > I'm gonna stick my fingers in my ears and sing "la la la" until people tell
> > me "I set swappiness to zero and it didn't do what I wanted it to do".
> 
> Agreed, you shouldn't be the one to fix this problem.
> 

What problem?

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  2:40             ` Andrew Morton
@ 2004-04-29  2:58               ` Paul Mackerras
  2004-04-29  3:09                 ` Andrew Morton
                                   ` (2 more replies)
  2004-04-29 16:50               ` Martin J. Bligh
  1 sibling, 3 replies; 128+ messages in thread
From: Paul Mackerras @ 2004-04-29  2:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: brettspamacct, jgarzik, linux-kernel

Andrew Morton writes:

> OK, a bit of fiddling does indicate that if a file is present on both
> client and server, and is modified on the client, the rsync client will
> indeed touch the pagecache pages twice.  Does this describe the files which
> you're copying at all?

The client/server thing is a bit misleading, what matters is the
direction of the transfer.  In the case I saw this morning, the G5 was
the sender.  In any case I was using the -W switch, which tells it not
to use the rsync algorithm but just transfer the whole file.  So I
believe that rsync on the G5 side was just reading the file through
once.

I have also noticed similar behaviour after doing a bk pull on a
kernel tree.

The really strange thing is that the behaviour seems to get worse the
more RAM you have.  I haven't noticed any problem at all on my laptop
with 768MB, only on the G5, which has 2.5GB.  (The laptop is still on
2.6.2-rc3 though, so I will try a newer kernel on it.)

Regards,
Paul.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  2:58               ` Paul Mackerras
@ 2004-04-29  3:09                 ` Andrew Morton
  2004-04-29  3:14                 ` William Lee Irwin III
  2004-04-29  6:12                 ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 128+ messages in thread
From: Andrew Morton @ 2004-04-29  3:09 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: brettspamacct, jgarzik, linux-kernel

Paul Mackerras <paulus@samba.org> wrote:
>
> Andrew Morton writes:
> 
> > OK, a bit of fiddling does indicate that if a file is present on both
> > client and server, and is modified on the client, the rsync client will
> > indeed touch the pagecache pages twice.  Does this describe the files which
> > you're copying at all?
> 
> The client/server thing is a bit misleading, what matters is the
> direction of the transfer.  In the case I saw this morning, the G5 was
> the sender.  In any case I was using the -W switch, which tells it not
> to use the rsync algorithm but just transfer the whole file.  So I
> believe that rsync on the G5 side was just reading the file through
> once.
> 
> I have also noticed similar behaviour after doing a bk pull on a
> kernel tree.
> 
> The really strange thing is that the behaviour seems to get worse the
> more RAM you have.  I haven't noticed any problem at all on my laptop
> with 768MB, only on the G5, which has 2.5GB.  (The laptop is still on
> 2.6.2-rc3 though, so I will try a newer kernel on it.)
> 

Is the laptop x86 or ppc?  IIRC there were problems with the pte-referenced
handling on ppc?  Or was it ppc64?  It shouldn't make any difference in
this case I guess.

To investigate this sort of thing you're better off using just a local `dd'
to ascertain the pattern which is causing the problem.  Keep things simple.

What happens if you do a 4G writeout with dd?  Is there any swapout?  There
shouldn't be much at all.  If the big dd indeed does not cause swapout,
then what is different about rsync?

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  2:35               ` Andrew Morton
@ 2004-04-29  3:10                 ` Marc Singer
  2004-04-29  3:19                   ` Andrew Morton
  2004-04-29  8:02                   ` Wichert Akkerman
  0 siblings, 2 replies; 128+ messages in thread
From: Marc Singer @ 2004-04-29  3:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: riel, brettspamacct, jgarzik, linux-kernel

On Wed, Apr 28, 2004 at 07:35:41PM -0700, Andrew Morton wrote:
> Marc Singer <elf@buici.com> wrote:
> >
> > On Wed, Apr 28, 2004 at 06:57:20PM -0700, Andrew Morton wrote:
> > > Rik van Riel <riel@redhat.com> wrote:
> > > >
> > > >  IMHO, the VM on a desktop system really should be optimised to
> > > >  have the best interactive behaviour, meaning decent latency
> > > >  when switching applications.
> > > 
> > > I'm gonna stick my fingers in my ears and sing "la la la" until people tell
> > > me "I set swappiness to zero and it didn't do what I wanted it to do".
> > 
> > It does, but it's a bit too coarse of a solution.  It just means that
> > the page cache always loses.
> 
> That's what people have been asking for.  What are you suggesting should
> happen instead?

I'm thinking that the problem is that the page cache is greedier that
most people expect.  For example, if I could hold the page cache to be
under a specific size, then I could do some performance measurements.
E.g, compile kernel with a 768K page cache, 512K, 256K and 128K.  On a
machine with loads of RAM, where's the optimal page cache size?


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  2:58               ` Paul Mackerras
  2004-04-29  3:09                 ` Andrew Morton
@ 2004-04-29  3:14                 ` William Lee Irwin III
  2004-04-29  6:12                 ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 128+ messages in thread
From: William Lee Irwin III @ 2004-04-29  3:14 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Andrew Morton, brettspamacct, jgarzik, linux-kernel

On Thu, Apr 29, 2004 at 12:58:13PM +1000, Paul Mackerras wrote:
> The client/server thing is a bit misleading, what matters is the
> direction of the transfer.  In the case I saw this morning, the G5 was
> the sender.  In any case I was using the -W switch, which tells it not
> to use the rsync algorithm but just transfer the whole file.  So I
> believe that rsync on the G5 side was just reading the file through
> once.
> I have also noticed similar behaviour after doing a bk pull on a
> kernel tree.
> The really strange thing is that the behaviour seems to get worse the
> more RAM you have.  I haven't noticed any problem at all on my laptop
> with 768MB, only on the G5, which has 2.5GB.  (The laptop is still on
> 2.6.2-rc3 though, so I will try a newer kernel on it.)

Looks like you've got a system with an issue. Any chance you could send
logs from an instrumented test run?

Thanks.


-- wli

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  3:10                 ` Marc Singer
@ 2004-04-29  3:19                   ` Andrew Morton
  2004-04-29  4:13                     ` Marc Singer
  2004-04-29 16:51                     ` Andy Isaacson
  2004-04-29  8:02                   ` Wichert Akkerman
  1 sibling, 2 replies; 128+ messages in thread
From: Andrew Morton @ 2004-04-29  3:19 UTC (permalink / raw)
  To: Marc Singer; +Cc: riel, brettspamacct, jgarzik, linux-kernel

Marc Singer <elf@buici.com> wrote:
>
> > That's what people have been asking for.  What are you suggesting should
> > happen instead?
> 
> I'm thinking that the problem is that the page cache is greedier that
> most people expect.  For example, if I could hold the page cache to be
> under a specific size, then I could do some performance measurements.
> E.g, compile kernel with a 768K page cache, 512K, 256K and 128K.  On a
> machine with loads of RAM, where's the optimal page cache size?

Nope, there's no point in leaving free memory floating about when the
kernel can and will reclaim clean pagecache on demand.

What you discuss above is just an implementation detail.  Forget it.  What
are the requirements?  Thus far I've seen

a) updatedb causes cache reclaim

b) updatedb causes swapout

c) prefer that openoffice/mozilla not get paged out when there's heavy
   pagecache demand.

For a) we don't really have a solution.  Some have been proposed but they
could have serious downsides.

For b) and c) we can tune the pageout-vs-cache reclaim tendency with
/proc/sys/vm/swappiness, only nobody seems to know that.

What else is there?

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  0:58       ` Marc Singer
@ 2004-04-29  3:48         ` Nick Piggin
  2004-04-29  4:20           ` Marc Singer
  0 siblings, 1 reply; 128+ messages in thread
From: Nick Piggin @ 2004-04-29  3:48 UTC (permalink / raw)
  To: Marc Singer
  Cc: Jeff Garzik, Andrew Morton, brettspamacct, linux-kernel,
	Russell King

Marc Singer wrote:
> On Thu, Apr 29, 2004 at 10:21:24AM +1000, Nick Piggin wrote:
> 
>>Anyway, I have a small set of VM patches which attempt to improve
>>this sort of behaviour if anyone is brave enough to try them.
>>Against -mm kernels only I'm afraid (the objrmap work causes some
>>porting difficulty).
> 
> 
> Is this the same patch you wanted me to try?  
> 
>   Remember, the embedded system where NFS IO was pushing my
>   application out of memory.  Setting swappiness to zero was a
>   temporary fix.
> 
> 

Yes this is the same patch I wanted you to try. Yes I
remember your problem!

Didn't anyone come up with a patch for you to test the
stale PTE theory? If so, what where the results?


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  1:53           ` Andrew Morton
  2004-04-29  2:40             ` Andrew Morton
@ 2004-04-29  3:57             ` Nick Piggin
  2004-04-29 14:29               ` Rik van Riel
  1 sibling, 1 reply; 128+ messages in thread
From: Nick Piggin @ 2004-04-29  3:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Mackerras, brettspamacct, jgarzik, linux-kernel

Andrew Morton wrote:
> Paul Mackerras <paulus@samba.org> wrote:
> 
>>Andrew Morton writes:
>>
>>
>>>My point is that decreasing the tendency of the kernel to swap stuff out is
>>>wrong.  You really don't want hundreds of megabytes of BloatyApp's
>>>untouched memory floating about in the machine.  Get it out on the disk,
>>>use the memory for something useful.
>>
>>What I have noticed with 2.6.6-rc1 on my dual G5 is that if I rsync a
>>gigabyte or so of data over to another machine, it then takes several
>>seconds to change focus from one window to another.  I can see it
>>slowly redraw the window title bars.  It looks like the window manager
>>is getting swapped/paged out.
>>
>>This machine has 2.5GB of ram, so I really don't see why it would need
>>to swap at all.  There should be plenty of page cache pages that are
>>clean and not in use by any process that could be discarded.  It seems
>>like as soon as there is any memory shortage at all it picks on the
>>window manager and chucks out all its pages. :(
>>
> 
> 
> I suspect rsync is taking two passes across the source files for its
> checksumming thing.  If so, this will defeat the pagecache use-once logic. 
> The kernel sees the second touch of the pages and assumes that there will
> be a third touch.
> 

I'm not very impressed with the pagecache use-once logic, and I
have a patch to remove it completely and treat non-mapped touches
(IMO) more sanely.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  3:19                   ` Andrew Morton
@ 2004-04-29  4:13                     ` Marc Singer
  2004-04-29  4:33                       ` Andrew Morton
  2004-04-29 16:51                     ` Andy Isaacson
  1 sibling, 1 reply; 128+ messages in thread
From: Marc Singer @ 2004-04-29  4:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: riel, brettspamacct, jgarzik, linux-kernel

On Wed, Apr 28, 2004 at 08:19:24PM -0700, Andrew Morton wrote:
> Marc Singer <elf@buici.com> wrote:
> >
> > > That's what people have been asking for.  What are you suggesting should
> > > happen instead?
> > 
> > I'm thinking that the problem is that the page cache is greedier that
> > most people expect.  For example, if I could hold the page cache to be
> > under a specific size, then I could do some performance measurements.
> > E.g, compile kernel with a 768K page cache, 512K, 256K and 128K.  On a
> > machine with loads of RAM, where's the optimal page cache size?
> 
> Nope, there's no point in leaving free memory floating about when the
> kernel can and will reclaim clean pagecache on demand.

It could work differently from that.  For example, if we had 500M
total, we map 200M, then we do 400M of IO.  Perhaps we'd like to be
able to say that a 400M page cache is too big.  The problem isn't
about reclaiming pagecache it's about the cost of swapping pages back
in.  The page cache can tend to favor swapping mapped pages over
reclaiming it's own pages that are less likely to be used.  Of course,
it doesn't know that...which is the rub.

If I thought I had an method for doing this, I'd write code to try it
out.

> What you discuss above is just an implementation detail.  Forget it.  What
> are the requirements?  Thus far I've seen

The requirement is that we'd like to see pages aged more gracefully.
A mapped page that is used continuously for ten minutes and then left
to idle for 10 minutes is more valuable than an IO page that was read
once and then not used for ten minutes.  As the mapped page ages, it's
value decays.

> a) updatedb causes cache reclaim
> 
> b) updatedb causes swapout
> 
> c) prefer that openoffice/mozilla not get paged out when there's heavy
>    pagecache demand.
> 
> For a) we don't really have a solution.  Some have been proposed but they
> could have serious downsides.
> 
> For b) and c) we can tune the pageout-vs-cache reclaim tendency with
> /proc/sys/vm/swappiness, only nobody seems to know that.

I've read the source for where swappiness comes into play.  Yet I
cannot make a statement about what it means.  Can you?

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  3:48         ` Nick Piggin
@ 2004-04-29  4:20           ` Marc Singer
  2004-04-29  4:26             ` Nick Piggin
                               ` (2 more replies)
  0 siblings, 3 replies; 128+ messages in thread
From: Marc Singer @ 2004-04-29  4:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jeff Garzik, Andrew Morton, brettspamacct, linux-kernel,
	Russell King

On Thu, Apr 29, 2004 at 01:48:02PM +1000, Nick Piggin wrote:
> Marc Singer wrote:
> >On Thu, Apr 29, 2004 at 10:21:24AM +1000, Nick Piggin wrote:
> >
> >>Anyway, I have a small set of VM patches which attempt to improve
> >>this sort of behaviour if anyone is brave enough to try them.
> >>Against -mm kernels only I'm afraid (the objrmap work causes some
> >>porting difficulty).
> >
> >
> >Is this the same patch you wanted me to try?  
> >
> >  Remember, the embedded system where NFS IO was pushing my
> >  application out of memory.  Setting swappiness to zero was a
> >  temporary fix.
> >
> >
> 
> Yes this is the same patch I wanted you to try. Yes I
> remember your problem!
> 
> Didn't anyone come up with a patch for you to test the
> stale PTE theory? If so, what where the results?

Russell King is working on a lot of things for the MMU code in ARM.
I'm waiting to see where he ends up.  I believe he's planning on
removing the lazy PTE release logic.

I hacked at it for some time.  And I'm convinced that I correctly
forced the TLBs to be flushed.  Still, I was never able to get the
system to behave.

Now, I just read a comment you or WLI made about the page cache
use-once logic.  I wonder if that's the real culprit?  As I wrote to
Andrew Morton, the kernel seems to be assigning an awful lot of value
to page cache pages that are used once (or twice?).  I know that it
would be expensive to perform an HTG aging algorithm where the head of
the LRU list is really LRU.  Does your patch pursue this line of
thought?

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  4:20           ` Marc Singer
@ 2004-04-29  4:26             ` Nick Piggin
  2004-04-29 14:49               ` Marc Singer
  2004-04-29  6:38             ` William Lee Irwin III
  2004-04-29  7:36             ` Russell King
  2 siblings, 1 reply; 128+ messages in thread
From: Nick Piggin @ 2004-04-29  4:26 UTC (permalink / raw)
  To: Marc Singer
  Cc: Jeff Garzik, Andrew Morton, brettspamacct, linux-kernel,
	Russell King

Marc Singer wrote:
> On Thu, Apr 29, 2004 at 01:48:02PM +1000, Nick Piggin wrote:
> 
>>Marc Singer wrote:
>>
>>>On Thu, Apr 29, 2004 at 10:21:24AM +1000, Nick Piggin wrote:
>>>
>>>
>>>>Anyway, I have a small set of VM patches which attempt to improve
>>>>this sort of behaviour if anyone is brave enough to try them.
>>>>Against -mm kernels only I'm afraid (the objrmap work causes some
>>>>porting difficulty).
>>>
>>>
>>>Is this the same patch you wanted me to try?  
>>>
>>> Remember, the embedded system where NFS IO was pushing my
>>> application out of memory.  Setting swappiness to zero was a
>>> temporary fix.
>>>
>>>
>>
>>Yes this is the same patch I wanted you to try. Yes I
>>remember your problem!
>>
>>Didn't anyone come up with a patch for you to test the
>>stale PTE theory? If so, what where the results?
> 
> 
> Russell King is working on a lot of things for the MMU code in ARM.
> I'm waiting to see where he ends up.  I believe he's planning on
> removing the lazy PTE release logic.
> 
> I hacked at it for some time.  And I'm convinced that I correctly
> forced the TLBs to be flushed.  Still, I was never able to get the
> system to behave.
> 
> Now, I just read a comment you or WLI made about the page cache
> use-once logic.  I wonder if that's the real culprit?  As I wrote to
> Andrew Morton, the kernel seems to be assigning an awful lot of value
> to page cache pages that are used once (or twice?).  I know that it
> would be expensive to perform an HTG aging algorithm where the head of
> the LRU list is really LRU.  Does your patch pursue this line of
> thought?
> 

Yes it includes something which should help that. Along with
the "split active lists" that I mentioned might help your
problem when WLI first came up with the change to the
swappiness calculation for your problem.

It would be great if you had time to give my patch a run.
It hasn't been widely stress tested yet though, so no
production systems, of course!

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  4:13                     ` Marc Singer
@ 2004-04-29  4:33                       ` Andrew Morton
  2004-04-29 14:45                         ` Marc Singer
  0 siblings, 1 reply; 128+ messages in thread
From: Andrew Morton @ 2004-04-29  4:33 UTC (permalink / raw)
  To: Marc Singer; +Cc: riel, brettspamacct, jgarzik, linux-kernel

Marc Singer <elf@buici.com> wrote:
>
> It could work differently from that.  For example, if we had 500M
> total, we map 200M, then we do 400M of IO.  Perhaps we'd like to be
> able to say that a 400M page cache is too big.

Try it - you'll find that the system will leave all of your 200M of mapped
memory in place.  You'll be left with 300M of pagecache from that I/O
activity.  There may be a small amount of unmapping activity if the I/O is
a write, or if the system has a small highmem zone.  Maybe.

Beware that both ARM and NFS seem to be doing odd things, so try it on a
PC+disk first ;)

>  The problem isn't
> about reclaiming pagecache it's about the cost of swapping pages back
> in.  The page cache can tend to favor swapping mapped pages over
> reclaiming it's own pages that are less likely to be used.  Of course,
> it doesn't know that...which is the rub.

No, the system will only start to unmap pages if reclaim of unmapped
pagecache is getting into difficulty.  The threshold of "getting into
difficulty" is controlled by /proc/sys/vm/swappiness.

> The requirement is that we'd like to see pages aged more gracefully.
> A mapped page that is used continuously for ten minutes and then left
> to idle for 10 minutes is more valuable than an IO page that was read
> once and then not used for ten minutes.  As the mapped page ages, it's
> value decays.

yes, remembering aging info over that period of time is hard.  We only have
six levels of aging: referenced+active, unreferenced+active,
referenced+inactive,unreferenced+inactive, plus position-on-lru*2.

> I've read the source for where swappiness comes into play.  Yet I
> cannot make a statement about what it means.  Can you?

It controls the level of page reclaim distress at which we decide to start
reclaiming mapped pages.

We prefer to reclaim pagecache, but we have to start swapping at *some*
level of reclaim failure.  swappiness sets that level, in rather vague
units.

It might make sense to recast swappiness in terms of
pages_reclaimed/pages_scanned, which is the real metric of page reclaim
distress.  But that would only affect the meaning of the actual number - it
wouldn't change the tunable's effect on the system.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  2:58               ` Paul Mackerras
  2004-04-29  3:09                 ` Andrew Morton
  2004-04-29  3:14                 ` William Lee Irwin III
@ 2004-04-29  6:12                 ` Benjamin Herrenschmidt
  2004-04-29  6:22                   ` Andrew Morton
  2004-04-29  6:31                   ` William Lee Irwin III
  2 siblings, 2 replies; 128+ messages in thread
From: Benjamin Herrenschmidt @ 2004-04-29  6:12 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andrew Morton, brettspamacct, Jeff Garzik, Linux Kernel list


> The really strange thing is that the behaviour seems to get worse the
> more RAM you have.  I haven't noticed any problem at all on my laptop
> with 768MB, only on the G5, which has 2.5GB.  (The laptop is still on
> 2.6.2-rc3 though, so I will try a newer kernel on it.)

Your G5 also has a 2Gb IO hole in the middle of zone DMA, it's possible
that the accounting doesn't work properly.

Ben.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  6:12                 ` Benjamin Herrenschmidt
@ 2004-04-29  6:22                   ` Andrew Morton
  2004-04-29  6:25                     ` Benjamin Herrenschmidt
  2004-04-29  6:31                   ` William Lee Irwin III
  1 sibling, 1 reply; 128+ messages in thread
From: Andrew Morton @ 2004-04-29  6:22 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: paulus, brettspamacct, jgarzik, linux-kernel

Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>
> 
> > The really strange thing is that the behaviour seems to get worse the
> > more RAM you have.  I haven't noticed any problem at all on my laptop
> > with 768MB, only on the G5, which has 2.5GB.  (The laptop is still on
> > 2.6.2-rc3 though, so I will try a newer kernel on it.)
> 
> Your G5 also has a 2Gb IO hole in the middle of zone DMA, it's possible
> that the accounting doesn't work properly.

heh.  It should have zone->spanned_pages - zone->present_pages = 2G.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  6:22                   ` Andrew Morton
@ 2004-04-29  6:25                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 128+ messages in thread
From: Benjamin Herrenschmidt @ 2004-04-29  6:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul Mackerras, brettspamacct, Jeff Garzik, Linux Kernel list

On Thu, 2004-04-29 at 16:22, Andrew Morton wrote:
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> >
> > 
> > > The really strange thing is that the behaviour seems to get worse the
> > > more RAM you have.  I haven't noticed any problem at all on my laptop
> > > with 768MB, only on the G5, which has 2.5GB.  (The laptop is still on
> > > 2.6.2-rc3 though, so I will try a newer kernel on it.)
> > 
> > Your G5 also has a 2Gb IO hole in the middle of zone DMA, it's possible
> > that the accounting doesn't work properly.
> 
> heh.  It should have zone->spanned_pages - zone->present_pages = 2G.

That should be fine, I'll check later, I can't reboot mine right now.

I'm initializing the zone with free_area_init_node() and I _am_ passing
the hole size. Paul, also check if you have NUMA enabled in .config, it changes
the way zones are initialized, I may have gotten that case wrong.

Ben.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  6:12                 ` Benjamin Herrenschmidt
  2004-04-29  6:22                   ` Andrew Morton
@ 2004-04-29  6:31                   ` William Lee Irwin III
  1 sibling, 0 replies; 128+ messages in thread
From: William Lee Irwin III @ 2004-04-29  6:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paul Mackerras, Andrew Morton, brettspamacct, Jeff Garzik,
	Linux Kernel list

At some point in the past, I wrote:
>> The really strange thing is that the behaviour seems to get worse the
>> more RAM you have.  I haven't noticed any problem at all on my laptop
>> with 768MB, only on the G5, which has 2.5GB.  (The laptop is still on
>> 2.6.2-rc3 though, so I will try a newer kernel on it.)

On Thu, Apr 29, 2004 at 04:12:38PM +1000, Benjamin Herrenschmidt wrote:
> Your G5 also has a 2Gb IO hole in the middle of zone DMA, it's possible
> that the accounting doesn't work properly.

Hmm, ->present_pages vs. ->spanned_pages distinction(s) should cover
this, or should have at one point. How are those being set at the moment?


-- wli

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  4:20           ` Marc Singer
  2004-04-29  4:26             ` Nick Piggin
@ 2004-04-29  6:38             ` William Lee Irwin III
  2004-04-29  7:36             ` Russell King
  2 siblings, 0 replies; 128+ messages in thread
From: William Lee Irwin III @ 2004-04-29  6:38 UTC (permalink / raw)
  To: Marc Singer
  Cc: Nick Piggin, Jeff Garzik, Andrew Morton, brettspamacct,
	linux-kernel, Russell King

On Wed, Apr 28, 2004 at 09:20:47PM -0700, Marc Singer wrote:
> Now, I just read a comment you or WLI made about the page cache
> use-once logic.  I wonder if that's the real culprit?  As I wrote to
> Andrew Morton, the kernel seems to be assigning an awful lot of value
> to page cache pages that are used once (or twice?).  I know that it
> would be expensive to perform an HTG aging algorithm where the head of
> the LRU list is really LRU.  Does your patch pursue this line of
> thought?

I don't recall ever having seen an actual pure LRU patch.

The physical scanning infrastructure should be enough to implement most
global replacement algorithms with. It's always good to compare
alternatives. Also, we should have an implementation of random
replacement just as a control case to verify we do better than random.


-- wli

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  4:20           ` Marc Singer
  2004-04-29  4:26             ` Nick Piggin
  2004-04-29  6:38             ` William Lee Irwin III
@ 2004-04-29  7:36             ` Russell King
  2004-04-29 10:44               ` Nick Piggin
  2 siblings, 1 reply; 128+ messages in thread
From: Russell King @ 2004-04-29  7:36 UTC (permalink / raw)
  To: Marc Singer
  Cc: Nick Piggin, Jeff Garzik, Andrew Morton, brettspamacct,
	linux-kernel

On Wed, Apr 28, 2004 at 09:20:47PM -0700, Marc Singer wrote:
> Russell King is working on a lot of things for the MMU code in ARM.
> I'm waiting to see where he ends up.  I believe he's planning on
> removing the lazy PTE release logic.

Essentially it came to a grinding halt due to the shere size of the
task of sorting out the crappy includes, which is far to large for a
stable kernel.

I may go back to the original problem and sort it a different way,
but for the time being, I'm occupied in other areas.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 PCMCIA      - http://pcmcia.arm.linux.org.uk/
                 2.6 Serial core

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  3:10                 ` Marc Singer
  2004-04-29  3:19                   ` Andrew Morton
@ 2004-04-29  8:02                   ` Wichert Akkerman
  2004-04-29 14:25                     ` Marcelo Tosatti
  1 sibling, 1 reply; 128+ messages in thread
From: Wichert Akkerman @ 2004-04-29  8:02 UTC (permalink / raw)
  To: linux-kernel

Previously Marc Singer wrote:
> I'm thinking that the problem is that the page cache is greedier that
> most people expect.  For example, if I could hold the page cache to be
> under a specific size, then I could do some performance measurements.

It is actually greedy enough that when my nightly cron starts I suddenly
see apache and pdns_recursor being killed consistently every day. 

Wichert.

-- 
Wichert Akkerman <wichert@wiggy.net>    It is simple to make things.
http://www.wiggy.net/                   It is hard to make things simple.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  0:49     ` Brett E.
  2004-04-29  1:00       ` Andrew Morton
  2004-04-29  1:41       ` Tim Connors
@ 2004-04-29  9:43       ` Helge Hafting
  2004-04-29 14:48         ` Marc Singer
  2 siblings, 1 reply; 128+ messages in thread
From: Helge Hafting @ 2004-04-29  9:43 UTC (permalink / raw)
  To: brettspamacct; +Cc: linux-kernel

Brett E. wrote:
[...]
> Or how about "Use ALL the cache you want Mr. Kernel.  But when I want 
> more physical memory pages, just reap cache pages and only swap out when 
> the cache is down to a certain size(configurable, say 100megs or 
> something)."

Problem: reaping cache is equivalent to swapping in some cases.
The cache isn't merely "files read & written".
It is also all your executable code.  Code is not different from
files being read at all.  Dumping too much cache will dump the
code you're executing, and then it have to be reloaded from disk.


Helge Hafting


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  7:36             ` Russell King
@ 2004-04-29 10:44               ` Nick Piggin
  2004-04-29 11:04                 ` Russell King
  0 siblings, 1 reply; 128+ messages in thread
From: Nick Piggin @ 2004-04-29 10:44 UTC (permalink / raw)
  To: Russell King
  Cc: Marc Singer, Jeff Garzik, Andrew Morton, brettspamacct,
	linux-kernel

Russell King wrote:
> On Wed, Apr 28, 2004 at 09:20:47PM -0700, Marc Singer wrote:
> 
>>Russell King is working on a lot of things for the MMU code in ARM.
>>I'm waiting to see where he ends up.  I believe he's planning on
>>removing the lazy PTE release logic.
> 
> 
> Essentially it came to a grinding halt due to the shere size of the
> task of sorting out the crappy includes, which is far to large for a
> stable kernel.
> 
> I may go back to the original problem and sort it a different way,
> but for the time being, I'm occupied in other areas.
> 

Anyway, Marc said he tried flushing the tlb and that didn't
solve his problem.

The problem might be the one identified in the thread:
2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 10:44               ` Nick Piggin
@ 2004-04-29 11:04                 ` Russell King
  2004-04-29 14:52                   ` Marc Singer
  0 siblings, 1 reply; 128+ messages in thread
From: Russell King @ 2004-04-29 11:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Marc Singer, Jeff Garzik, Andrew Morton, brettspamacct,
	linux-kernel

On Thu, Apr 29, 2004 at 08:44:25PM +1000, Nick Piggin wrote:
> Anyway, Marc said he tried flushing the tlb and that didn't
> solve his problem.

Nevertheless, when you have a TLB with ASIDs, there will be even less
pressure to flush these entries from the TLB, so in effect we might
as well save the expense of implementing the page aging in the first
place.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 PCMCIA      - http://pcmcia.arm.linux.org.uk/
                 2.6 Serial core

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  0:04 ` Brett E.
  2004-04-29  0:13   ` Jeff Garzik
@ 2004-04-29 13:51   ` Horst von Brand
  2004-04-29 18:32     ` Brett E.
  1 sibling, 1 reply; 128+ messages in thread
From: Horst von Brand @ 2004-04-29 13:51 UTC (permalink / raw)
  To: brettspamacct; +Cc: Linux Kernel Mailing List

"Brett E." <brettspamacct@fastclick.com> said:

[...]

> I created a hack which allocates memory causing cache to go down, then 
> exits, freeing up the malloc'ed memory. This brings free memory up by 
> 400 megs and brings the cache down to close to 0, of course the cache 
> grows right afterwards. It would be nice to cap the cache datastructures 
> in the kernel but I've been posting about this since September to no 
> avail so my expectations are pretty low.

Because it is complete nonsense. Keeping stuff around in RAM in case it
is needed again, as long as RAM is not needed for anything else, is a mayor
win. That is what cache is.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  8:02                   ` Wichert Akkerman
@ 2004-04-29 14:25                     ` Marcelo Tosatti
  2004-04-29 14:27                       ` Wichert Akkerman
  0 siblings, 1 reply; 128+ messages in thread
From: Marcelo Tosatti @ 2004-04-29 14:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: wichert

On Thu, Apr 29, 2004 at 10:02:19AM +0200, Wichert Akkerman wrote:
> Previously Marc Singer wrote:
> > I'm thinking that the problem is that the page cache is greedier that
> > most people expect.  For example, if I could hold the page cache to be
> > under a specific size, then I could do some performance measurements.
> 
> It is actually greedy enough that when my nightly cron starts I suddenly
> see apache and pdns_recursor being killed consistently every day. 

Which kernel is that? 

They are getting killed because there is no more swap available.
Otherwise its a bug.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 14:25                     ` Marcelo Tosatti
@ 2004-04-29 14:27                       ` Wichert Akkerman
  0 siblings, 0 replies; 128+ messages in thread
From: Wichert Akkerman @ 2004-04-29 14:27 UTC (permalink / raw)
  To: linux-kernel

Previously Marcelo Tosatti wrote:
> Which kernel is that? 

That machine is running 2.6.4 at the moment.

> They are getting killed because there is no more swap available.
> Otherwise its a bug.

It actually killed a bunch of processes a minute ago and right now has
120mb of swap free and 104mb used for cache. 

Wichert.

-- 
Wichert Akkerman <wichert@wiggy.net>    It is simple to make things.
http://www.wiggy.net/                   It is hard to make things simple.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  3:57             ` Nick Piggin
@ 2004-04-29 14:29               ` Rik van Riel
  2004-04-30  3:00                 ` Nick Piggin
  0 siblings, 1 reply; 128+ messages in thread
From: Rik van Riel @ 2004-04-29 14:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Paul Mackerras, brettspamacct, jgarzik,
	linux-kernel

On Thu, 29 Apr 2004, Nick Piggin wrote:

> I'm not very impressed with the pagecache use-once logic, and I
> have a patch to remove it completely and treat non-mapped touches
> (IMO) more sanely.

The basic idea of use-once isn't bad (search for LIRS and
ARC page replacement), however the Linux implementation
doesn't have any of the checks and balances that the
researched replacement algorithms have...

However, adding the checks and balancing required for LIRS,
ARC and CAR(S) isn't easy since it requires keeping track of
a number of recently evicted pages.  That could be quite a 
bit of infrastructure, though it might be well worth it.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  4:33                       ` Andrew Morton
@ 2004-04-29 14:45                         ` Marc Singer
  0 siblings, 0 replies; 128+ messages in thread
From: Marc Singer @ 2004-04-29 14:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: riel, brettspamacct, jgarzik, linux-kernel

On Wed, Apr 28, 2004 at 09:33:59PM -0700, Andrew Morton wrote:
> Marc Singer <elf@buici.com> wrote:
> >
> > It could work differently from that.  For example, if we had 500M
> > total, we map 200M, then we do 400M of IO.  Perhaps we'd like to be
> > able to say that a 400M page cache is too big.
> 
> Try it - you'll find that the system will leave all of your 200M of mapped
> memory in place.  You'll be left with 300M of pagecache from that I/O
> activity.  There may be a small amount of unmapping activity if the I/O is
> a write, or if the system has a small highmem zone.  Maybe.

Are you sure?  Isn't that what the other posters are winging about?
They do lots of IO and then they have to wait for the system to page
Mozilla back in.

> Beware that both ARM and NFS seem to be doing odd things, so try it on a
> PC+disk first ;)

Yeah, I know that there is still something odd in ARM-land.  I assume
that the other posters are using IA32. 

> No, the system will only start to unmap pages if reclaim of unmapped
> pagecache is getting into difficulty.  The threshold of "getting into
> difficulty" is controlled by /proc/sys/vm/swappiness.

What constitutes 'difficulty'?  Perhaps this is rhetorical. 

> > I've read the source for where swappiness comes into play.  Yet I
> > cannot make a statement about what it means.  Can you?
> 
> It controls the level of page reclaim distress at which we decide to start
> reclaiming mapped pages.
> 
> We prefer to reclaim pagecache, but we have to start swapping at *some*
> level of reclaim failure.  swappiness sets that level, in rather vague
> units.

I'm not sure I see why we have to swap.  If have of memory is mapped,
and the user is using those pages with some frequency, perhaps we
should never reclaim mapped pages.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  9:43       ` Helge Hafting
@ 2004-04-29 14:48         ` Marc Singer
  0 siblings, 0 replies; 128+ messages in thread
From: Marc Singer @ 2004-04-29 14:48 UTC (permalink / raw)
  To: Helge Hafting; +Cc: brettspamacct, linux-kernel

On Thu, Apr 29, 2004 at 11:43:25AM +0200, Helge Hafting wrote:
> Brett E. wrote:
> [...]
> >Or how about "Use ALL the cache you want Mr. Kernel.  But when I want 
> >more physical memory pages, just reap cache pages and only swap out when 
> >the cache is down to a certain size(configurable, say 100megs or 
> >something)."
> 
> Problem: reaping cache is equivalent to swapping in some cases.
> The cache isn't merely "files read & written".
> It is also all your executable code.  Code is not different from
> files being read at all.  Dumping too much cache will dump the
> code you're executing, and then it have to be reloaded from disk.

Hmm.  I was under the impression that mapped pages were code and
unmapped pages were IO page cache.  Are you suggesting that code is
duplicated?


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  4:26             ` Nick Piggin
@ 2004-04-29 14:49               ` Marc Singer
  2004-04-30  4:08                 ` Nick Piggin
  0 siblings, 1 reply; 128+ messages in thread
From: Marc Singer @ 2004-04-29 14:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jeff Garzik, Andrew Morton, brettspamacct, linux-kernel,
	Russell King

On Thu, Apr 29, 2004 at 02:26:17PM +1000, Nick Piggin wrote:
> Yes it includes something which should help that. Along with
> the "split active lists" that I mentioned might help your
> problem when WLI first came up with the change to the
> swappiness calculation for your problem.
> 
> It would be great if you had time to give my patch a run.
> It hasn't been widely stress tested yet though, so no
> production systems, of course!

As I said, I'm game to have a go.  The trouble was that it doesn't
apply.  My development kernel has an RMK patch applied that seems to
conflict with the MM patch on which you depend.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 11:04                 ` Russell King
@ 2004-04-29 14:52                   ` Marc Singer
  0 siblings, 0 replies; 128+ messages in thread
From: Marc Singer @ 2004-04-29 14:52 UTC (permalink / raw)
  To: Nick Piggin, Jeff Garzik, Andrew Morton, brettspamacct,
	linux-kernel

On Thu, Apr 29, 2004 at 12:04:19PM +0100, Russell King wrote:
> On Thu, Apr 29, 2004 at 08:44:25PM +1000, Nick Piggin wrote:
> > Anyway, Marc said he tried flushing the tlb and that didn't
> > solve his problem.
> 
> Nevertheless, when you have a TLB with ASIDs, there will be even less
> pressure to flush these entries from the TLB, so in effect we might
> as well save the expense of implementing the page aging in the first
> place.

Uh, oh. My FLA translator just broke.  Whatsa ASID?


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  1:40           ` Andrew Morton
  2004-04-29  1:47             ` Rik van Riel
  2004-04-29  2:19             ` Tim Connors
@ 2004-04-29 16:24             ` Martin J. Bligh
  2004-04-29 16:36               ` Chris Friesen
  2 siblings, 1 reply; 128+ messages in thread
From: Martin J. Bligh @ 2004-04-29 16:24 UTC (permalink / raw)
  To: Andrew Morton, Jeff Garzik; +Cc: brettspamacct, linux-kernel

>>  These are at the heart of the thread (or my point, at 
>> least) -- BloatyApp may be Oracle with a huge cache of its own, for 
>> which swapping out may be a huge mistake.  Or Mozilla.  After some 
>> amount of disk IO on my 512MB machine, Mozilla would be swapped out... 
>> when I had only been typing an email minutes before.
> 
> OK, so it takes four seconds to swap mozilla back in, and you noticed it.
> 
> Did you notice that those three kernel builds you just did ran in twenty
> seconds less time because they had more cache available?  Nope.

The latency for interactive stuff is definitely more noticeable though, and
thus arguably more important. Perhaps we should be tying the scheduler in
more tightly with the VM - we've already decided there which apps are 
"interactive" and thus need low latency ... shouldn't we be giving a boost
to their RAM pages as well, and favour keeping those paged in over other
pages (whether other apps, or cache) logically? It's all latency still ...

M.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 16:24             ` Martin J. Bligh
@ 2004-04-29 16:36               ` Chris Friesen
  2004-04-29 16:56                 ` Martin J. Bligh
  0 siblings, 1 reply; 128+ messages in thread
From: Chris Friesen @ 2004-04-29 16:36 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrew Morton, Jeff Garzik, brettspamacct, linux-kernel

Martin J. Bligh wrote:

> The latency for interactive stuff is definitely more noticeable though, and
> thus arguably more important. Perhaps we should be tying the scheduler in
> more tightly with the VM - we've already decided there which apps are 
> "interactive" and thus need low latency ... shouldn't we be giving a boost
> to their RAM pages as well, and favour keeping those paged in over other
> pages (whether other apps, or cache) logically? It's all latency still ...

I like this idea.  Maybe make it more general though--tasks with high scheduler priority also get 
more of a memory priority boost.  This will factor in the static priority as well as the 
interactivity bonus.

Chris

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  2:40             ` Andrew Morton
  2004-04-29  2:58               ` Paul Mackerras
@ 2004-04-29 16:50               ` Martin J. Bligh
  1 sibling, 0 replies; 128+ messages in thread
From: Martin J. Bligh @ 2004-04-29 16:50 UTC (permalink / raw)
  To: Andrew Morton, paulus, brettspamacct, jgarzik, linux-kernel

>>  I suspect rsync is taking two passes across the source files for its
>>  checksumming thing.  If so, this will defeat the pagecache use-once logic. 
>>  The kernel sees the second touch of the pages and assumes that there will
>>  be a third touch.
> 
> OK, a bit of fiddling does indicate that if a file is present on both
> client and server, and is modified on the client, the rsync client will
> indeed touch the pagecache pages twice.  Does this describe the files which
> you're copying at all?
> 
> One thing you could do is to run `watch -n1 cat /proc/meminfo'.  Cause lots
> of memory to be freed up then do the copy.  Monitor the size of the active
> and inactive lists.  If the active list is growing then we know that rsync
> is touching pages twice.
> 
> That would be an unfortunate special-case.

Personally, I think that the use-twice logic is a bit of a hack that mostly
works. If we moved to a method where we kept an eye on which pages are 
associated with which address_space (for mapped pages) or which process
(for anonymous pages) we'd have a much better shot at stopping any one
process / file from monopolizing the whole of system memory. 

We'd also be able to favour memory for files that are still open over ones 
that have been closed, and recognize linear access scan patterns per file,
and reclaim more agressively from the overscanned areas, and favour higher
prio tasks over lower prio ones (including, but not limited to interactive).

Global LRU (even with the tweaks it has in Linux) doesn't seem optimal.

M.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  3:19                   ` Andrew Morton
  2004-04-29  4:13                     ` Marc Singer
@ 2004-04-29 16:51                     ` Andy Isaacson
  2004-04-29 20:42                       ` Andrew Morton
  2004-04-30  0:14                       ` Lincoln Dale
  1 sibling, 2 replies; 128+ messages in thread
From: Andy Isaacson @ 2004-04-29 16:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Marc Singer, riel, brettspamacct, jgarzik, linux-kernel

On Wed, Apr 28, 2004 at 08:19:24PM -0700, Andrew Morton wrote:
> What you discuss above is just an implementation detail.  Forget it.  What
> are the requirements?  Thus far I've seen
> 
> a) updatedb causes cache reclaim
> 
> b) updatedb causes swapout
> 
> c) prefer that openoffice/mozilla not get paged out when there's heavy
>    pagecache demand.
> 
> For a) we don't really have a solution.  Some have been proposed but they
> could have serious downsides.
> 
> For b) and c) we can tune the pageout-vs-cache reclaim tendency with
> /proc/sys/vm/swappiness, only nobody seems to know that.
> 
> What else is there?

What I want is for purely sequential workloads which far exceed cache
size (dd, updatedb, tar czf /backup/home.nightly.tar.gz /home) to avoid
thrashing my entire desktop out of memory.  I DON'T CARE if the tar
completed in 45 minutes rather than 80.  (It wouldn't, anyways, because
it only needs about 5 MB of cache to get every bit of the speedup it was
going to get.)  But the additional latency when I un-xlock in the
morning is annoying, and there is no benefit.

For a more useful example, ideally I *should not be able to tell* that
"dd if=/hde1 of=/hdf1" is running. [1]  There is *no* benefit to cacheing
more than about 2 pages, under this workload.  But with current kernels,
IME, that workload results in a gargantuan buffer cache and lots of
swapout of apps I was using 3 minutes ago.  I've taken to walking away
for some coffee, coming back when it's done, and "sudo swapoff
/dev/hda3; sudo swapon -a" to avoid the latency that is so annoying when
trying to use bloaty apps.

[1] obviously I'll see some slowdown due to interrupts and PCI
    bandwidth; that's not what I'm railing against, here.

-andy

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 16:36               ` Chris Friesen
@ 2004-04-29 16:56                 ` Martin J. Bligh
  0 siblings, 0 replies; 128+ messages in thread
From: Martin J. Bligh @ 2004-04-29 16:56 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Andrew Morton, Jeff Garzik, brettspamacct, linux-kernel

>> The latency for interactive stuff is definitely more noticeable though, and
>> thus arguably more important. Perhaps we should be tying the scheduler in
>> more tightly with the VM - we've already decided there which apps are 
>> "interactive" and thus need low latency ... shouldn't we be giving a boost
>> to their RAM pages as well, and favour keeping those paged in over other
>> pages (whether other apps, or cache) logically? It's all latency still ...
> 
> I like this idea.  Maybe make it more general though--tasks with high scheduler priority also get more of a memory priority boost.  This will factor in the static priority as well as the interactivity bonus.

Yeah, see also my other mail in that thread - if we moved to file-object (address_space) and task anon (mm) based tracking, it should be much easier.
Also fits in nicely with Hugh's anon_mm code.

M.
 

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  1:29       ` Brett E.
@ 2004-04-29 18:05         ` Brett E.
  2004-04-29 18:32           ` William Lee Irwin III
  0 siblings, 1 reply; 128+ messages in thread
From: Brett E. @ 2004-04-29 18:05 UTC (permalink / raw)
  To: brettspamacct; +Cc: Andrew Morton, linux-kernel

Brett E. wrote:

> Andrew Morton wrote:
> 
>> "Brett E." <brettspamacct@fastclick.com> wrote:
>>
>>>> I see no swapout from the info which you sent.
>>>
>>>
>>> pgpgout/s gives the total number of blocks paged out to disk per 
>>> second, it peaks at 13,000 and hovers around 3,000 per the attachment.
>>
>>
>>
>> Nope.  pgpgout is simply writes to disk, of all types.
> 
> That is what is confusing me.. From the sar man page:
> 
> pgpgin/s
>     Total number of kilobytes the system paged in from disk per second.
> 
> pgpgout/s
>     Total number of kilobytes the system paged out to disk per second.
> 
> 
Anyone know what I should believe?  Sar's pgpgin/s and pgpgout/s tell me 
  that it is paging in/out from/to disk.  Yet pswpin/s and pswpout/s are 
both 0.  Swapping and paging are the same thing I believe. pgpgin/out 
refer to paging, pswpin/out refer to swapping.  So I for one am confused.

I guess I could dig through the source but I figured someone might have 
encountered this disrepency in the past.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  1:47             ` Rik van Riel
@ 2004-04-29 18:14               ` Adam Kropelin
  2004-04-30  3:17                 ` Tim Connors
  0 siblings, 1 reply; 128+ messages in thread
From: Adam Kropelin @ 2004-04-29 18:14 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, Jeff Garzik, brettspamacct, linux-kernel

On Wed, Apr 28, 2004 at 09:47:45PM -0400, Rik van Riel wrote:
> On Wed, 28 Apr 2004, Andrew Morton wrote:
> 
> > OK, so it takes four seconds to swap mozilla back in, and you noticed it.
> > 
> > Did you notice that those three kernel builds you just did ran in twenty
> > seconds less time because they had more cache available?  Nope.
> 
> That's exactly why desktops should be optimised to give
> the best performance where the user notices it most...

Agreed. Looking at it from the standpoint of relative change, the time
to bring the mozilla window to the foreground is increased by orders of
magnitude while the kernel builds improve by a (relatively) small
percent. Humans easily notice change in orders of magnitude and such
changes can feel painful. Benchmarks notice $SMALLNUM percent long
before a human will, especially if s/he has left the room because the
job was going to take 10 minutes anyway. The 30 seconds saved off the
compile run just isn't worth it sometimes if its side-effect is to
disrupt the user's workflow.

The 'swappiness' tunable may well give enough control over the situation
to suit all sorts of users. If nothing else, this thread has raised
awareness that such a tunable exists and can be played with to influence
the kernel's decision-making. Distros, too, should give consideration to
appropriate default settings to serve their intended users.

--Adam

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 13:51   ` Horst von Brand
@ 2004-04-29 18:32     ` Brett E.
  0 siblings, 0 replies; 128+ messages in thread
From: Brett E. @ 2004-04-29 18:32 UTC (permalink / raw)
  To: Horst von Brand; +Cc: Linux Kernel Mailing List

Horst von Brand wrote:

> "Brett E." <brettspamacct@fastclick.com> said:
> 
> [...]
> 
> 
>>I created a hack which allocates memory causing cache to go down, then 
>>exits, freeing up the malloc'ed memory. This brings free memory up by 
>>400 megs and brings the cache down to close to 0, of course the cache 
>>grows right afterwards. It would be nice to cap the cache datastructures 
>>in the kernel but I've been posting about this since September to no 
>>avail so my expectations are pretty low.
> 
> 
> Because it is complete nonsense. Keeping stuff around in RAM in case it
> is needed again, as long as RAM is not needed for anything else, is a mayor
> win. That is what cache is.
The key phrase in your post is "as long as RAM is not needed for 
anything else." My assertion was that this is not the case and it seems 
to favor cache over pages being used.  Sar shows heavy paging to/from 
disk even though 500 megs are reported in cache. I hope I don't need to 
go into what paging in/out in succession means regarding paging out 
pages which we will need shortly after they are paged out. Sar also 
reports no swapping, hence the need to figure out why there is a 
disprepency before continuing.



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 18:05         ` Brett E.
@ 2004-04-29 18:32           ` William Lee Irwin III
  2004-04-29 20:47             ` Brett E.
  0 siblings, 1 reply; 128+ messages in thread
From: William Lee Irwin III @ 2004-04-29 18:32 UTC (permalink / raw)
  To: Brett E.; +Cc: Andrew Morton, linux-kernel

On Thu, Apr 29, 2004 at 11:05:42AM -0700, Brett E. wrote:
> Anyone know what I should believe?  Sar's pgpgin/s and pgpgout/s tell me 
>  that it is paging in/out from/to disk.  Yet pswpin/s and pswpout/s are 
> both 0.  Swapping and paging are the same thing I believe. pgpgin/out 
> refer to paging, pswpin/out refer to swapping.  So I for one am confused.
> I guess I could dig through the source but I figured someone might have 
> encountered this disrepency in the past.

Both are to be believed. They merely describe different things.

Pagein/pageout are counts of VM-initiated IO, regardless of whether this
IO is done on filesystem-backed pages or swap-backed pages. Pagein and
pageout are used more generally to describe VM-initiated IO and don't
exclusively refer to swap IO, but also include IO to filesystems to/from
filesystem-backed memory.

Swapin/swapout are counts of swap IO only, and are considered to apply
only to IO done to swap files/devices to/from swap-backed anonymous memory.

Pagein/pageout are both proper and necessary to have. In fact, you were
requesting that filesystem IO be done preferentially to swap IO, and the
pagein/pageout indicators showing IO while swapin/swapout indicators show
none mean you are getting exactly what you asked for.

-- wli

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  0:21     ` Nick Piggin
  2004-04-29  0:50       ` Wakko Warner
  2004-04-29  0:58       ` Marc Singer
@ 2004-04-29 20:01       ` Horst von Brand
  2004-04-29 20:18         ` Martin J. Bligh
                           ` (4 more replies)
  2 siblings, 5 replies; 128+ messages in thread
From: Horst von Brand @ 2004-04-29 20:01 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jeff Garzik, Andrew Morton, brettspamacct, linux-kernel

Nick Piggin <nickpiggin@yahoo.com.au> said:

[...]

> I don't know. What if you have some huge application that only
> runs once per day for 10 minutes? Do you want it to be consuming
> 100MB of your memory for the other 23 hours and 50 minutes for
> no good reason?

How on earth is the kernel supposed to know that for this one particular
job you don't care if it takes 3 hours instead of 10 minutes, just because
you don't want to spare enough preciousss RAM?
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 20:01       ` Horst von Brand
@ 2004-04-29 20:18         ` Martin J. Bligh
  2004-04-29 20:33         ` David B. Stevens
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 128+ messages in thread
From: Martin J. Bligh @ 2004-04-29 20:18 UTC (permalink / raw)
  To: Horst von Brand, Nick Piggin
  Cc: Jeff Garzik, Andrew Morton, brettspamacct, linux-kernel

--On Thursday, April 29, 2004 16:01:11 -0400 Horst von Brand <vonbrand@inf.utfsm.cl> wrote:

> Nick Piggin <nickpiggin@yahoo.com.au> said:
> 
> [...]
> 
>> I don't know. What if you have some huge application that only
>> runs once per day for 10 minutes? Do you want it to be consuming
>> 100MB of your memory for the other 23 hours and 50 minutes for
>> no good reason?
> 
> How on earth is the kernel supposed to know that for this one particular
> job you don't care if it takes 3 hours instead of 10 minutes, just because
> you don't want to spare enough preciousss RAM?

Nice value is the obvious interface for such information.

M.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 20:01       ` Horst von Brand
  2004-04-29 20:18         ` Martin J. Bligh
@ 2004-04-29 20:33         ` David B. Stevens
  2004-04-29 22:42           ` Steve Youngs
  2004-04-29 20:36         ` Paul Jackson
                           ` (2 subsequent siblings)
  4 siblings, 1 reply; 128+ messages in thread
From: David B. Stevens @ 2004-04-29 20:33 UTC (permalink / raw)
  To: Horst von Brand
  Cc: Nick Piggin, Jeff Garzik, Andrew Morton, brettspamacct,
	linux-kernel

Horst von Brand wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> said:
> 
> [...]
> 
> 
>>I don't know. What if you have some huge application that only
>>runs once per day for 10 minutes? Do you want it to be consuming
>>100MB of your memory for the other 23 hours and 50 minutes for
>>no good reason?
> 
> 
> How on earth is the kernel supposed to know that for this one particular
> job you don't care if it takes 3 hours instead of 10 minutes, just because
> you don't want to spare enough preciousss RAM?

Maybe the kernel should be told by the apps exactly what they require in 
the way of memory and maybe how to slice up what the app gets for memory 
from the kernel.

This would not be the first time that applications had to specify such 
information.

That was what REGION= and other such parameters were all about in other 
operating systems.

Then the kernel would have free use of what was left until the next app 
started etc ....

Cheers,
   Dave

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 20:01       ` Horst von Brand
  2004-04-29 20:18         ` Martin J. Bligh
  2004-04-29 20:33         ` David B. Stevens
@ 2004-04-29 20:36         ` Paul Jackson
  2004-04-29 21:19           ` Andrew Morton
  2004-04-29 21:38           ` Timothy Miller
  2004-04-30  5:15         ` Nick Piggin
  2004-04-30  6:20         ` Tim Connors
  4 siblings, 2 replies; 128+ messages in thread
From: Paul Jackson @ 2004-04-29 20:36 UTC (permalink / raw)
  To: Horst von Brand; +Cc: nickpiggin, jgarzik, akpm, brettspamacct, linux-kernel

> How on earth is the kernel supposed to know that for this one particular
> job you don't care if it takes 3 hours instead of 10 minutes,

I'd pay ten bucks (yeah, I'm a cheapskate) for an option that I could
twiddle that would mark my nightly updatedb and backup jobs as ones to
use reduced memory footprint (both for file caching and backing user
virtual address space), even if it took much longer.

So, rather than protest in mock outrage that it's impossible for the
kernel to know this, instead answer the question as stated in all
seriousness ... well ... how _could_ the kernel know, and what _could_
the kernel do if it knew.  What mechanism(s) would be needed so that
the kernel could restrict a jobs memory usage?

Heh - indeed perhaps the answer is closer than I realize.  For SGI's big
NUMA boxes, managing memory placement is sufficiently critical that we
are inventing or encouraging ways (such as Andi Kleen's numa stuff) to
control memory placement per node per job.  Perhaps this needs to be
extended to portions of a node (this job can only use 1 Gb of the memory
on that 2 Gb node) and to other memory uses (file cache, not just user
space memory).

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 16:51                     ` Andy Isaacson
@ 2004-04-29 20:42                       ` Andrew Morton
  2004-04-29 22:27                         ` Andy Isaacson
  2004-04-30  0:14                       ` Lincoln Dale
  1 sibling, 1 reply; 128+ messages in thread
From: Andrew Morton @ 2004-04-29 20:42 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: elf, riel, brettspamacct, jgarzik, linux-kernel

Andy Isaacson <adi@hexapodia.org> wrote:
>
> What I want is for purely sequential workloads which far exceed cache
> size (dd, updatedb, tar czf /backup/home.nightly.tar.gz /home) to avoid
> thrashing my entire desktop out of memory.  I DON'T CARE if the tar
> completed in 45 minutes rather than 80.  (It wouldn't, anyways, because
> it only needs about 5 MB of cache to get every bit of the speedup it was
> going to get.)  But the additional latency when I un-xlock in the
> morning is annoying, and there is no benefit.

What kernel version are you using?  If 2.6, what value of
/proc/sys/vm/swappiness?

> For a more useful example, ideally I *should not be able to tell* that
> "dd if=/hde1 of=/hdf1" is running.

I just did a 4GB `dd if=/dev/sda of=/x bs=1M' on a 1GB 2.6.6-rc2-mm2
swappiness=85 machine here and there was no swapout at all.

Probably your machine has less memory.  But without real, hard details
nothing can be done.

> There is *no* benefit to cacheing
> more than about 2 pages, under this workload.

Sure, we could do better things with the large streaming files, although
the risk of accidentally screwing up particular workloads is high.

But the use-once logic which we have in there at present does handle these
cases quite well.

>  But with current kernels,
> IME, that workload results in a gargantuan buffer cache and lots of
> swapout of apps I was using 3 minutes ago.  I've taken to walking away
> for some coffee, coming back when it's done, and "sudo swapoff
> /dev/hda3; sudo swapon -a" to avoid the latency that is so annoying when
> trying to use bloaty apps.

What kernel, what system specs, what swappiness setting?

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 18:32           ` William Lee Irwin III
@ 2004-04-29 20:47             ` Brett E.
  0 siblings, 0 replies; 128+ messages in thread
From: Brett E. @ 2004-04-29 20:47 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Andrew Morton, linux-kernel

William Lee Irwin III wrote:

> On Thu, Apr 29, 2004 at 11:05:42AM -0700, Brett E. wrote:
> 
>>Anyone know what I should believe?  Sar's pgpgin/s and pgpgout/s tell me 
>> that it is paging in/out from/to disk.  Yet pswpin/s and pswpout/s are 
>>both 0.  Swapping and paging are the same thing I believe. pgpgin/out 
>>refer to paging, pswpin/out refer to swapping.  So I for one am confused.
>>I guess I could dig through the source but I figured someone might have 
>>encountered this disrepency in the past.
> 
> 
> Both are to be believed. They merely describe different things.
> 
> Pagein/pageout are counts of VM-initiated IO, regardless of whether this
> IO is done on filesystem-backed pages or swap-backed pages. Pagein and
> pageout are used more generally to describe VM-initiated IO and don't
> exclusively refer to swap IO, but also include IO to filesystems to/from
> filesystem-backed memory.
> 
> Swapin/swapout are counts of swap IO only, and are considered to apply
> only to IO done to swap files/devices to/from swap-backed anonymous memory.
> 
> Pagein/pageout are both proper and necessary to have. In fact, you were
> requesting that filesystem IO be done preferentially to swap IO, and the
> pagein/pageout indicators showing IO while swapin/swapout indicators show
> none mean you are getting exactly what you asked for.
> 
> 
Thanks I think it's clear now. In layman's terms, pgpgin/out relate to 
disk cache activity vs pswpin/out which relate to swap activity.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 20:36         ` Paul Jackson
@ 2004-04-29 21:19           ` Andrew Morton
  2004-04-29 21:34             ` Paul Jackson
  2004-05-06 13:08             ` Pavel Machek
  2004-04-29 21:38           ` Timothy Miller
  1 sibling, 2 replies; 128+ messages in thread
From: Andrew Morton @ 2004-04-29 21:19 UTC (permalink / raw)
  To: Paul Jackson; +Cc: vonbrand, nickpiggin, jgarzik, brettspamacct, linux-kernel

Paul Jackson <pj@sgi.com> wrote:
>
> > How on earth is the kernel supposed to know that for this one particular
> > job you don't care if it takes 3 hours instead of 10 minutes,
> 
> I'd pay ten bucks (yeah, I'm a cheapskate) for an option that I could
> twiddle that would mark my nightly updatedb and backup jobs as ones to
> use reduced memory footprint (both for file caching and backing user
> virtual address space), even if it took much longer.
> 
> So, rather than protest in mock outrage that it's impossible for the
> kernel to know this, instead answer the question as stated in all
> seriousness ... well ... how _could_ the kernel know, and what _could_
> the kernel do if it knew.  What mechanism(s) would be needed so that
> the kernel could restrict a jobs memory usage?

Two things:

a) a knob to say "only reclaim pagecache".  We have that now.

b) a knob to say "reclaim vfs caches harder".  That's simply a matter of boosting
   the return value from shrink_dcache_memory() and perhaps shrink_icache_memory().

It's not quite what you're after, but it's close.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 21:19           ` Andrew Morton
@ 2004-04-29 21:34             ` Paul Jackson
  2004-04-29 21:57               ` Andrew Morton
  2004-05-06 13:08             ` Pavel Machek
  1 sibling, 1 reply; 128+ messages in thread
From: Paul Jackson @ 2004-04-29 21:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: vonbrand, nickpiggin, jgarzik, brettspamacct, linux-kernel

Andrew wrote:
> Two things:
> a) a knob to say "only reclaim pagecache".  We have that now.
> b) a knob to say "reclaim vfs caches harder" ...

Are these knobs system wide in affect, or per job?
I am presuming system wide.

When I'm working late, I want my updatedb/backup jobs
to scrunch themselves into a corner, even as my builds
and gui desktop continue to fly and suck up RAM.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 20:36         ` Paul Jackson
  2004-04-29 21:19           ` Andrew Morton
@ 2004-04-29 21:38           ` Timothy Miller
  2004-04-29 21:47             ` Paul Jackson
  1 sibling, 1 reply; 128+ messages in thread
From: Timothy Miller @ 2004-04-29 21:38 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Horst von Brand, nickpiggin, jgarzik, akpm, brettspamacct,
	linux-kernel



Paul Jackson wrote:

> Heh - indeed perhaps the answer is closer than I realize.  For SGI's big
> NUMA boxes, managing memory placement is sufficiently critical that we
> are inventing or encouraging ways (such as Andi Kleen's numa stuff) to
> control memory placement per node per job.  Perhaps this needs to be
> extended to portions of a node (this job can only use 1 Gb of the memory
> on that 2 Gb node) and to other memory uses (file cache, not just user
> space memory).
> 

Is updatedb run with a nice level greater than zero?

Perhaps nice level could influence how much a process is allowed to 
affect page cache.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29  0:50       ` Wakko Warner
  2004-04-29  0:53         ` Jeff Garzik
  2004-04-29  0:54         ` Nick Piggin
@ 2004-04-29 21:45         ` Denis Vlasenko
  2 siblings, 0 replies; 128+ messages in thread
From: Denis Vlasenko @ 2004-04-29 21:45 UTC (permalink / raw)
  To: Wakko Warner, Nick Piggin; +Cc: linux-kernel

On Thursday 29 April 2004 03:50, Wakko Warner wrote:
> > I don't know. What if you have some huge application that only
> > runs once per day for 10 minutes? Do you want it to be consuming
> > 100MB of your memory for the other 23 hours and 50 minutes for
> > no good reason?
>
> I keep soffice open all the time.  The box in question has 512mb of ram.
> This is one app, even though I use it infrequently, would prefer that it
> never be swapped out.  Mainly when I want to use it, I *WANT* it now (ie
> not waiting for it to come back from swap)

I'm afraid a part of the problem is that there are apps which are
way too bloated. Fighting bloat is thankless and hard, so almost
everybody simply throws RAM at the problem. Well. Having thrown
lotsa RAM at the problem, it may feel 'better' until you realize you
need not only RAM but *also* disk bandwidth to move bloat from disk
to RAM and back.

Come on, lets admit it. Proper fix to 'I want OpenOffice to be
responsive' problem is to make it several times smaller.
Everything else is more or less a workaround.

It's a pity size optimizations are not too popular even
on lkml.

> This is just my oppinion.  I personally feel that cache should use
> available memory, not already used memory (swapping apps out for more
> cache).
--
vda

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 21:38           ` Timothy Miller
@ 2004-04-29 21:47             ` Paul Jackson
  2004-04-29 22:18               ` Timothy Miller
  0 siblings, 1 reply; 128+ messages in thread
From: Paul Jackson @ 2004-04-29 21:47 UTC (permalink / raw)
  To: Timothy Miller
  Cc: vonbrand, nickpiggin, jgarzik, akpm, brettspamacct, linux-kernel

Timothy wrote:
> Perhaps nice level could influence how much a process is allowed to 
> affect page cache.

I'm from the school that says 'nice' applies to scheduling priority,
not memory usage.

I'd expect a different knob, a per-task inherited value as is 'nice',
to control memory usage.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 21:34             ` Paul Jackson
@ 2004-04-29 21:57               ` Andrew Morton
  2004-04-29 22:18                 ` Paul Jackson
  2004-04-30  0:04                 ` Andy Isaacson
  0 siblings, 2 replies; 128+ messages in thread
From: Andrew Morton @ 2004-04-29 21:57 UTC (permalink / raw)
  To: Paul Jackson; +Cc: vonbrand, nickpiggin, jgarzik, brettspamacct, linux-kernel

Paul Jackson <pj@sgi.com> wrote:
>
> Andrew wrote:
> > Two things:
> > a) a knob to say "only reclaim pagecache".  We have that now.
> > b) a knob to say "reclaim vfs caches harder" ...
> 
> Are these knobs system wide in affect, or per job?
> I am presuming system wide.

yup, system-wide.

> When I'm working late, I want my updatedb/backup jobs
> to scrunch themselves into a corner, even as my builds
> and gui desktop continue to fly and suck up RAM.

Sure.  That's not purely a cacheing thing though. Even if the background
activity was clamped to just a few megs of cache you'll find that the
seek activity is a killer, and needs a limitation mechanism.  Although the
anticipatory scheduler helps here a lot.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 21:47             ` Paul Jackson
@ 2004-04-29 22:18               ` Timothy Miller
  2004-04-29 22:46                 ` Paul Jackson
  2004-04-30  3:37                 ` Tim Connors
  0 siblings, 2 replies; 128+ messages in thread
From: Timothy Miller @ 2004-04-29 22:18 UTC (permalink / raw)
  To: Paul Jackson
  Cc: vonbrand, nickpiggin, jgarzik, akpm, brettspamacct, linux-kernel

Paul Jackson wrote:
> Timothy wrote:
> 
>>Perhaps nice level could influence how much a process is allowed to 
>>affect page cache.
> 
> 
> I'm from the school that says 'nice' applies to scheduling priority,
> not memory usage.
> 
> I'd expect a different knob, a per-task inherited value as is 'nice',
> to control memory usage.
> 

Linux kernel developers seem to be of the mind that you cannot trust 
what applications tell you about themselves, so it's better to use 
heuristics to GUESS how to schedule something, rather than to add YET 
ANOTHER property to it.

Nick, Con, Ingo, and others have done an impressive job of taking the 
guess/heuristic approach to scheduling.  I don't see why that can't be 
taken further.

Also, there seems to be strong resistance to adding a property to 
something which is not easily accessible through existing UNIX tools. 
"nice" and "renice" commands have been around forever.  Adding another 
control requires new commands, new libc functions, changes to "top", etc.

Besides, when would you want to have a sched-nice of -20 and an io-nice 
of 20, or a sched-nice of 20 and an io-nice of -20?  Things like that 
would make no sense.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 21:57               ` Andrew Morton
@ 2004-04-29 22:18                 ` Paul Jackson
  2004-04-30  0:04                 ` Andy Isaacson
  1 sibling, 0 replies; 128+ messages in thread
From: Paul Jackson @ 2004-04-29 22:18 UTC (permalink / raw)
  To: Andrew Morton, Shailabh Nagar
  Cc: vonbrand, nickpiggin, jgarzik, brettspamacct, linux-kernel

Andrew wrote:
> Even if the background activity was clamped to just a few megs
> of cache you'll find that the seek activity is a killer, and
> needs a limitation mechanism.

True - the seek activity is another critical resource that would need to
be throttled to keep updatedb/backup from interferring with my late
night labours.

Let's see, that's:
 1) cpu scheduling ticks
 2) memory for virtual address backing store
 3) memory for file related caching
 4) disk arm motion

Hmmm ... actually not so much a numa-placement extension, but rather a
CKRM opportunity.

CKRM focuses on measuring and restraining how much of specified critical
resources a task is using; numa placement on which cpus or memory nodes
are allowed to be used at all.

See further the CKRM thread of Shailabh Nagar, also running on lkml
today.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 20:42                       ` Andrew Morton
@ 2004-04-29 22:27                         ` Andy Isaacson
  2004-04-29 23:19                           ` Andrew Morton
  0 siblings, 1 reply; 128+ messages in thread
From: Andy Isaacson @ 2004-04-29 22:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: elf, riel, brettspamacct, jgarzik, linux-kernel

On Thu, Apr 29, 2004 at 01:42:22PM -0700, Andrew Morton wrote:
> Andy Isaacson <adi@hexapodia.org> wrote:
> > What I want is for purely sequential workloads which far exceed cache
> > size (dd, updatedb, tar czf /backup/home.nightly.tar.gz /home) to avoid
> > thrashing my entire desktop out of memory.  I DON'T CARE if the tar
> > completed in 45 minutes rather than 80.  (It wouldn't, anyways, because
> > it only needs about 5 MB of cache to get every bit of the speedup it was
> > going to get.)  But the additional latency when I un-xlock in the
> > morning is annoying, and there is no benefit.
> 
> What kernel version are you using?  If 2.6, what value of
> /proc/sys/vm/swappiness?

2.4.various, including 2.4.25 and 2.4.26.  I haven't taken the 2.6
plunge yet.  Running on various x86 including
 - dual PIII 666 MHz 512 MB
 - SpeedStep PIII 700 MHz 128 MB
 - Athlon XP 2GHz 512 MB

> > For a more useful example, ideally I *should not be able to tell* that
> > "dd if=/hde1 of=/hdf1" is running.
> 
> I just did a 4GB `dd if=/dev/sda of=/x bs=1M' on a 1GB 2.6.6-rc2-mm2
> swappiness=85 machine here and there was no swapout at all.
> 
> Probably your machine has less memory.  But without real, hard details
> nothing can be done.

I'm pleased to hear that 2.6 is apparently better behaved.  In your
test, what was the impact on the file cache?  It's a big improvement to
not be paging out to swap, but it's also important that sequential IO
not evict my cached build tree.

An interesting test would be to time a compilation of a source file with
a large number of includes.  For example, building
linux-2.4.25/kernel/sysctl.c on my Athlon XP 2GHz, 512MB, 2.4.25 takes
2.8 seconds with (fairly) cold cache.  (I didn't reboot, but I did take
fairly extreme measures to force stuff out.)  It takes 0.54 seconds with
warm caches.  After doing 1GB of sequential IO (wc -w /tmp/bigfile) I'm
back up to 2.08 seconds.

> > There is *no* benefit to cacheing
> > more than about 2 pages, under this workload.
> 
> Sure, we could do better things with the large streaming files, although
> the risk of accidentally screwing up particular workloads is high.

Yeah, I agree.  For example, I've occasionally used cat(1) or wc(1) to
prefetch files that I knew I was going to be accessing randomly; with my
hypothetical "sequential IO doesn't cause cacheing" it would be much
harder to do effective manual prefetching.

> But the use-once logic which we have in there at present does handle these
> cases quite well.

Where is the use-once logic available?  Is it in mainstream 2.6 or only
in some development branches?  I've not upgraded from 2.4 mostly because
I didn't see much benefits evident in the discussions, but improved
paging logic would be nice.

> >  But with current kernels,
> > IME, that workload results in a gargantuan buffer cache and lots of
> > swapout of apps I was using 3 minutes ago.  I've taken to walking away
> > for some coffee, coming back when it's done, and "sudo swapoff
> > /dev/hda3; sudo swapon -a" to avoid the latency that is so annoying when
> > trying to use bloaty apps.
> 
> What kernel, what system specs, what swappiness setting?

2.4.25, Athlon XP 2 GHz, 512MB.  I suppose you're not terribly
interested in 2.4.  I'll see if I can reasonably upgrade, if you can
tell me what I should upgrade to for the good stuff.

-andy

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 20:33         ` David B. Stevens
@ 2004-04-29 22:42           ` Steve Youngs
  0 siblings, 0 replies; 128+ messages in thread
From: Steve Youngs @ 2004-04-29 22:42 UTC (permalink / raw)
  To: Linux Kernel
  Cc: Horst von Brand, Nick Piggin, Jeff Garzik, Andrew Morton,
	brettspamacct, David B. Stevens

* David B Stevens <dsteven3@maine.rr.com> writes:

  > Maybe the kernel should be told by the apps exactly what they
  > require in the way of memory

So what happens when Mr BloatyApp says: "Yo, Mr Kernel, gimme all ya
got baby!"

-- 
|---<Steve Youngs>---------------<GnuPG KeyID: A94B3003>---|
|              Ashes to ashes, dust to dust.               |
|      The proof of the pudding, is under the crust.       |
|----------------------------------<steve@youngs.au.com>---|

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 22:18               ` Timothy Miller
@ 2004-04-29 22:46                 ` Paul Jackson
  2004-04-29 23:08                   ` Timothy Miller
  2004-04-30  3:37                 ` Tim Connors
  1 sibling, 1 reply; 128+ messages in thread
From: Paul Jackson @ 2004-04-29 22:46 UTC (permalink / raw)
  To: Timothy Miller
  Cc: vonbrand, nickpiggin, jgarzik, akpm, brettspamacct, linux-kernel

Timothy wrote:
> Linux kernel developers seem to be of the mind that you cannot trust 
> what applications tell you about themselves, so it's better to use 
> heuristics to GUESS how to schedule something, rather than to add YET 
> ANOTHER property to it.

Both are needed.  The thing has to work pretty well, for most people,
most of the time, without human intervention.

And there needs to be knobs to optimize performance.  Even with no
conscious end-user administration, a knob on the cron job that runs
updatedb, setup by the distribution packager, could have wide spread
impact on the responsiveness of a system, when the user sits down with
the first cup of coffee to scan the morning headlines and incoming
email er eh spam.

As to whether it's two nice calls, or one with dual affect, let's not
confuse the kernel API with that seen by the user.  The kernel should
provide a minimum spanning set of orthogonal mechanisms, and not be
second guessing whether the user is out of their ever loving mind to be
asking for a hot cpu, cold io, job.

In other words, I wouldn't agree with your take that it's a matter of
not trusting the application, better to GUESS.  Rather I would say that
there is a preference, and a good one at that, to not use an excessive
number of knobs as a cop-out to avoid working hard to get the widest
practical range of cases to behave reasonably, without intervention, and
a preference to keep what knobs that are there short, sweet and
minimally interacting.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 22:46                 ` Paul Jackson
@ 2004-04-29 23:08                   ` Timothy Miller
  2004-04-30 12:31                     ` Bart Samwel
  0 siblings, 1 reply; 128+ messages in thread
From: Timothy Miller @ 2004-04-29 23:08 UTC (permalink / raw)
  To: Paul Jackson
  Cc: vonbrand, nickpiggin, jgarzik, akpm, brettspamacct, linux-kernel

Paul Jackson wrote:

> 
> In other words, I wouldn't agree with your take that it's a matter of
> not trusting the application, better to GUESS.  

Okay.

> Rather I would say that
> there is a preference, and a good one at that, to not use an excessive
> number of knobs as a cop-out to avoid working hard to get the widest
> practical range of cases to behave reasonably, without intervention, and
> a preference to keep what knobs that are there short, sweet and
> minimally interacting.
> 

Agreed.  And this is why I suggested not adding another knob but rather 
going with the existing nice value.

Mind you, this shouldn't necessarily be done without some kind of 
experimentation.  Put two knobs in the kernel and try varying them  to 
each other to see what sorts of jobs, if any, would benefit in a 
disparity between cpu-nice and io-nice.  If there IS a significant 
difference, then add the extra knob.  If there isn't, then don't.

Another possibility would be to have one knob that controls cpu-nice, 
and another knob that controls io-nice minus cpu-nice, so if you REALLY 
want to make them different, you can, but typically, they are set to be 
the same.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 22:27                         ` Andy Isaacson
@ 2004-04-29 23:19                           ` Andrew Morton
  0 siblings, 0 replies; 128+ messages in thread
From: Andrew Morton @ 2004-04-29 23:19 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: elf, riel, brettspamacct, jgarzik, linux-kernel

Andy Isaacson <adi@hexapodia.org> wrote:
>
> > What kernel version are you using?  If 2.6, what value of
> > /proc/sys/vm/swappiness?
> 
> 2.4.various, including 2.4.25 and 2.4.26.  I haven't taken the 2.6
> plunge yet.

OK.  Please try 2.6 and let us know how it changes things.

> > I just did a 4GB `dd if=/dev/sda of=/x bs=1M' on a 1GB 2.6.6-rc2-mm2
> > swappiness=85 machine here and there was no swapout at all.
> > 
> > Probably your machine has less memory.  But without real, hard details
> > nothing can be done.
> 
> I'm pleased to hear that 2.6 is apparently better behaved.  In your
> test, what was the impact on the file cache?

It will have munched everything else.

>  It's a big improvement to
> not be paging out to swap, but it's also important that sequential IO
> not evict my cached build tree.

Yup.  We don't have any large-streaming-file heuristics in there.

It is the case that if you have recently accessed a file *twice* then its
pages will be preferred over the large streaming file.  But if you've
accessed the valuable file only once, the streaming I/O will evict it.

> > > There is *no* benefit to cacheing
> > > more than about 2 pages, under this workload.
> > 
> > Sure, we could do better things with the large streaming files, although
> > the risk of accidentally screwing up particular workloads is high.
> 
> Yeah, I agree.  For example, I've occasionally used cat(1) or wc(1) to
> prefetch files that I knew I was going to be accessing randomly; with my
> hypothetical "sequential IO doesn't cause cacheing" it would be much
> harder to do effective manual prefetching.

We have a new syscall in 2.6 (fadvise) with which an app can provide hints
about its access patterns and its desired cache usage.  So, for example,
tar or rsync or whatever could (if told to do so by the user) deliberately
throw away the pagecache after having accessed the file.  But that does
require application modifications.  They're pretty simple though:

+	if (user said to throw away the cache)
+		posix_fadvise64(fd, 0, -1, POSIX_FADV_DONTNEED);
	close(fd);

> > But the use-once logic which we have in there at present does handle these
> > cases quite well.
> 
> Where is the use-once logic available?  Is it in mainstream 2.6 or only
> in some development branches?  I've not upgraded from 2.4 mostly because
> I didn't see much benefits evident in the discussions, but improved
> paging logic would be nice.

It's in 2.4 also.  I think 2.4 tends to do the wrong thing because dirty
pages easily make it to the tail of the VM LRU's, which eventually causes
the VM to go off and hunt down mapped pages instead.  2.6 takes more care
to prevent dirty pages from hitting the tail of the LRU.

> > >  But with current kernels,
> > > IME, that workload results in a gargantuan buffer cache and lots of
> > > swapout of apps I was using 3 minutes ago.  I've taken to walking away
> > > for some coffee, coming back when it's done, and "sudo swapoff
> > > /dev/hda3; sudo swapon -a" to avoid the latency that is so annoying when
> > > trying to use bloaty apps.
> > 
> > What kernel, what system specs, what swappiness setting?
> 
> 2.4.25, Athlon XP 2 GHz, 512MB.  I suppose you're not terribly
> interested in 2.4.  I'll see if I can reasonably upgrade, if you can
> tell me what I should upgrade to for the good stuff.

2.6.6-rc3 would be suitable.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 21:57               ` Andrew Morton
  2004-04-29 22:18                 ` Paul Jackson
@ 2004-04-30  0:04                 ` Andy Isaacson
  2004-04-30  0:32                   ` Andrew Morton
  1 sibling, 1 reply; 128+ messages in thread
From: Andy Isaacson @ 2004-04-30  0:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul Jackson, vonbrand, nickpiggin, jgarzik, brettspamacct,
	linux-kernel

On Thu, Apr 29, 2004 at 02:57:25PM -0700, Andrew Morton wrote:
> > When I'm working late, I want my updatedb/backup jobs
> > to scrunch themselves into a corner, even as my builds
> > and gui desktop continue to fly and suck up RAM.
> 
> Sure.  That's not purely a cacheing thing though. Even if the background
> activity was clamped to just a few megs of cache you'll find that the
> seek activity is a killer, and needs a limitation mechanism.  Although the
> anticipatory scheduler helps here a lot.

I grant that in the updatedb case (or the backup case), the seeks are
going to suck and they're inherently on the same spindle as the user's
data, so there's no fixing it (short of a real "IO nice").

But in a related case, I have a background daemon that does a lot of IO
(mostly sequential, one page at a time read/modify/write of a multi-GB
file) to a filesystem on a separate spindle from my main filesystems.
I'd like to use a similar mechanism to say "don't let this program eat
my pagecache" that will let the daemon crunch away without severely
impacting my desktop work.

-andy

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 16:51                     ` Andy Isaacson
  2004-04-29 20:42                       ` Andrew Morton
@ 2004-04-30  0:14                       ` Lincoln Dale
  1 sibling, 0 replies; 128+ messages in thread
From: Lincoln Dale @ 2004-04-30  0:14 UTC (permalink / raw)
  To: Andy Isaacson
  Cc: Andrew Morton, Marc Singer, riel, brettspamacct, jgarzik,
	linux-kernel

At 02:51 AM 30/04/2004, Andy Isaacson wrote:
>What I want is for purely sequential workloads which far exceed cache
>size (dd, updatedb, tar czf /backup/home.nightly.tar.gz /home) to avoid
>thrashing my entire desktop out of memory.  I DON'T CARE if the tar
>completed in 45 minutes rather than 80.  (It wouldn't, anyways, because
>it only needs about 5 MB of cache to get every bit of the speedup it was
>going to get.)  But the additional latency when I un-xlock in the
>morning is annoying, and there is no benefit.
>
>For a more useful example, ideally I *should not be able to tell* that
>"dd if=/hde1 of=/hdf1" is running. [1]  There is *no* benefit to cacheing
>more than about 2 pages, under this workload.  But with current kernels,
>IME, that workload results in a gargantuan buffer cache and lots of
>swapout of apps I was using 3 minutes ago.  I've taken to walking away
>for some coffee, coming back when it's done, and "sudo swapoff
>/dev/hda3; sudo swapon -a" to avoid the latency that is so annoying when
>trying to use bloaty apps.

the mechanism already exists; teach tar/dd and any other app that you don't 
want to pollute the page-cache with data with to use O_DIRECT.

i suspect updatedb is a different case as its probably filling the system 
with dcache/inode entries.


cheers,

lincoln.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30  0:04                 ` Andy Isaacson
@ 2004-04-30  0:32                   ` Andrew Morton
  2004-04-30  0:54                     ` Paul Jackson
  2004-04-30  7:52                     ` Jeff Garzik
  0 siblings, 2 replies; 128+ messages in thread
From: Andrew Morton @ 2004-04-30  0:32 UTC (permalink / raw)
  To: Andy Isaacson
  Cc: pj, vonbrand, nickpiggin, jgarzik, brettspamacct, linux-kernel

Andy Isaacson <adi@hexapodia.org> wrote:
>
>  But in a related case, I have a background daemon that does a lot of IO
>  (mostly sequential, one page at a time read/modify/write of a multi-GB
>  file) to a filesystem on a separate spindle from my main filesystems.
>  I'd like to use a similar mechanism to say "don't let this program eat
>  my pagecache" that will let the daemon crunch away without severely
>  impacting my desktop work.

fadvise(POSIX_FADV_DONTNEED) is ideal for this.  Run it once per megabyte
or so.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30  0:32                   ` Andrew Morton
@ 2004-04-30  0:54                     ` Paul Jackson
  2004-04-30  5:38                       ` Andy Isaacson
  2004-04-30  7:52                     ` Jeff Garzik
  1 sibling, 1 reply; 128+ messages in thread
From: Paul Jackson @ 2004-04-30  0:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: adi, vonbrand, nickpiggin, jgarzik, brettspamacct, linux-kernel

Andrew wrote:
> fadvise(POSIX_FADV_DONTNEED) is ideal for this.

Perhaps ... perhaps not.

Just as the knobs "only reclaim pagecache" and "reclaim vfs caches
harder" had too big a scope (system-wide), using fadvise might have too
small a scope (currently cached pages of current task only).

If his background daemon is some shell script, say, that uses 'cat' to
generate the i/o to the other spindle, then he probably wants to be
marking that daemon job "don't let this entire job eat my pagecache",
not rebuilding a hacked up cat command with added POSIX_FADV_DONTNEED
calls every megabyte.

CKRM to the rescue ... ??

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 14:29               ` Rik van Riel
@ 2004-04-30  3:00                 ` Nick Piggin
  2004-04-30 12:50                   ` Rik van Riel
  0 siblings, 1 reply; 128+ messages in thread
From: Nick Piggin @ 2004-04-30  3:00 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Paul Mackerras, brettspamacct, jgarzik,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1060 bytes --]

Rik van Riel wrote:
> On Thu, 29 Apr 2004, Nick Piggin wrote:
> 
> 
>>I'm not very impressed with the pagecache use-once logic, and I
>>have a patch to remove it completely and treat non-mapped touches
>>(IMO) more sanely.
> 
> 
> The basic idea of use-once isn't bad (search for LIRS and
> ARC page replacement), however the Linux implementation
> doesn't have any of the checks and balances that the
> researched replacement algorithms have...
> 
> However, adding the checks and balancing required for LIRS,
> ARC and CAR(S) isn't easy since it requires keeping track of
> a number of recently evicted pages.  That could be quite a 
> bit of infrastructure, though it might be well worth it.
> 

No, use once logic is good in theory I think. Unfortunately
our implementation is quite fragile IMO (although it seems
to have been "good enough").

This is what I'm currently doing (on top of a couple of other
patches, but you get the idea). I should be able to transform
it into a proper use-once logic if I pick up Nikita's inactive
list second chance bit.


[-- Attachment #2: vm-dropbehind.patch --]
[-- Type: text/x-patch, Size: 3622 bytes --]


Changes mark_page_accessed to only set the PageAccessed bit, and
not move pages around the LRUs. This means we don't have to take
the lru_lock, and it also makes page ageing and scanning consistient
and all handled in mm/vmscan.c


 include/linux/buffer_head.h            |    0 
 linux-2.6-npiggin/include/linux/swap.h |    5 ++-
 linux-2.6-npiggin/mm/filemap.c         |    6 ----
 linux-2.6-npiggin/mm/swap.c            |   45 ---------------------------------
 4 files changed, 4 insertions(+), 52 deletions(-)

diff -puN mm/filemap.c~vm-dropbehind mm/filemap.c
--- linux-2.6/mm/filemap.c~vm-dropbehind	2004-04-29 17:31:38.000000000 +1000
+++ linux-2.6-npiggin/mm/filemap.c	2004-04-29 17:31:38.000000000 +1000
@@ -663,11 +663,7 @@ page_ok:
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
 
-		/*
-		 * Mark the page accessed if we read the beginning.
-		 */
-		if (!offset)
-			mark_page_accessed(page);
+		mark_page_accessed(page);
 
 		/*
 		 * Ok, we have the page, and it's up-to-date, so
diff -puN mm/swap.c~vm-dropbehind mm/swap.c
--- linux-2.6/mm/swap.c~vm-dropbehind	2004-04-29 17:31:38.000000000 +1000
+++ linux-2.6-npiggin/mm/swap.c	2004-04-29 17:31:38.000000000 +1000
@@ -100,51 +100,6 @@ int rotate_reclaimable_page(struct page 
 	return 0;
 }
 
-/*
- * FIXME: speed this up?
- */
-void fastcall activate_page(struct page *page)
-{
-	struct zone *zone = page_zone(page);
-
-	spin_lock_irq(&zone->lru_lock);
-	if (PageLRU(page)
-		&& !PageActiveMapped(page) && !PageActiveUnmapped(page)) {
-
-		del_page_from_inactive_list(zone, page);
-
-		if (page_mapped(page)) {
-			SetPageActiveMapped(page);
-			add_page_to_active_mapped_list(zone, page);
-		} else {
-			SetPageActiveUnmapped(page);
-			add_page_to_active_unmapped_list(zone, page);
-		}
-		inc_page_state(pgactivate);
-	}
-	spin_unlock_irq(&zone->lru_lock);
-}
-
-/*
- * Mark a page as having seen activity.
- *
- * inactive,unreferenced	->	inactive,referenced
- * inactive,referenced		->	active,unreferenced
- * active,unreferenced		->	active,referenced
- */
-void fastcall mark_page_accessed(struct page *page)
-{
-	if (!PageActiveMapped(page) && !PageActiveUnmapped(page)
-			&& PageReferenced(page) && PageLRU(page)) {
-		activate_page(page);
-		ClearPageReferenced(page);
-	} else if (!PageReferenced(page)) {
-		SetPageReferenced(page);
-	}
-}
-
-EXPORT_SYMBOL(mark_page_accessed);
-
 /**
  * lru_cache_add: add a page to the page lists
  * @page: the page to add
diff -puN include/linux/swap.h~vm-dropbehind include/linux/swap.h
--- linux-2.6/include/linux/swap.h~vm-dropbehind	2004-04-29 17:31:38.000000000 +1000
+++ linux-2.6-npiggin/include/linux/swap.h	2004-04-30 12:55:02.000000000 +1000
@@ -165,12 +165,13 @@ extern unsigned int nr_free_pagecache_pa
 /* linux/mm/swap.c */
 extern void FASTCALL(lru_cache_add(struct page *));
 extern void FASTCALL(lru_cache_add_active(struct page *));
-extern void FASTCALL(activate_page(struct page *));
-extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
 extern int rotate_reclaimable_page(struct page *page);
 extern void swap_setup(void);
 
+/* Mark a page as having seen activity. */
+#define mark_page_accessed(page)	SetPageReferenced(page)
+
 /* linux/mm/vmscan.c */
 extern int try_to_free_pages(struct zone **, unsigned int, unsigned int);
 extern int shrink_all_memory(int);
diff -puN include/linux/mm_inline.h~vm-dropbehind include/linux/mm_inline.h
diff -puN mm/memory.c~vm-dropbehind mm/memory.c
diff -puN mm/shmem.c~vm-dropbehind mm/shmem.c
diff -puN include/linux/buffer_head.h~vm-dropbehind include/linux/buffer_head.h

_

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re:  ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 18:14               ` Adam Kropelin
@ 2004-04-30  3:17                 ` Tim Connors
  0 siblings, 0 replies; 128+ messages in thread
From: Tim Connors @ 2004-04-30  3:17 UTC (permalink / raw)
  Cc: Rik van Riel, Andrew Morton, Jeff Garzik, brettspamacct,
	linux-kernel

Adam Kropelin <akropel1@rochester.rr.com> said on Thu, 29 Apr 2004 14:14:13 -0400:
> On Wed, Apr 28, 2004 at 09:47:45PM -0400, Rik van Riel wrote:
> > On Wed, 28 Apr 2004, Andrew Morton wrote:
> > 
> > > OK, so it takes four seconds to swap mozilla back in, and you noticed it.
> > > 
> > > Did you notice that those three kernel builds you just did ran in twenty
> > > seconds less time because they had more cache available?  Nope.
> > 
> > That's exactly why desktops should be optimised to give
> > the best performance where the user notices it most...
...
> The 'swappiness' tunable may well give enough control over the situation
> to suit all sorts of users. If nothing else, this thread has raised
> awareness that such a tunable exists and can be played with to influence
> the kernel's decision-making. Distros, too, should give consideration to
> appropriate default settings to serve their intended users.

Actually, I decided to investigate how 2.4 compares (we're still stuck
on 2.4)

According to this:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0210.1/0011.html 

2.6 with swapiness of 0% is same as 2.4.19 - I assume 2.4.19's VM is
the same as 2.4.26 (given feature freeze).

I have always been completely unimpressed with the 2.4 VM - before and
after the big change in ~2.4.10. It has *always* preferred to use
cache in preference to a recently used application.

So will this still apply to 2.6 with swapiness of 0%?

I might try to get my sysadmin to put on 2.6, becuase 2.4 is quite
unusable for some of the work I do (if I need mozilla at the same time
as my visualisation software, which allocates a good 3/4 of RAM, after
reading a file that is about that size, leaving still enough for
mozilla and X combined, mozilla and parts of X still get swapped out -
and the cahce is wasted, since I only ever read the file once, and it
is written on another host)

-- 
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
"32-bit patch for a 16-bit GUI shell running on top of an
8-bit operating system written for a 4-bit processor by a
2-bit company who cannot stand 1 bit of competition."

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re:  ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 22:18               ` Timothy Miller
  2004-04-29 22:46                 ` Paul Jackson
@ 2004-04-30  3:37                 ` Tim Connors
  1 sibling, 0 replies; 128+ messages in thread
From: Tim Connors @ 2004-04-30  3:37 UTC (permalink / raw)
  Cc: Paul Jackson, vonbrand, nickpiggin, jgarzik, akpm, brettspamacct,
	linux-kernel

Timothy Miller <miller@techsource.com> said on Thu, 29 Apr 2004 18:18:06 -0400:
> Paul Jackson wrote:
> > Timothy wrote:
> > 
> >>Perhaps nice level could influence how much a process is allowed to 
> >>affect page cache.
> > 
> > 
> > I'm from the school that says 'nice' applies to scheduling priority,
> > not memory usage.
> > 
> > I'd expect a different knob, a per-task inherited value as is 'nice',
> > to control memory usage.
> 
> Linux kernel developers seem to be of the mind that you cannot trust 
> what applications tell you about themselves, so it's better to use 
> heuristics to GUESS how to schedule something, rather than to add YET 
> ANOTHER property to it.

Why is that?

On the desktop system/workstation, which is what we are talking about
here -- we want the desktop system in particular to be responsive --
the user wouldn't try to do anythign malicious, so why not trust the
applications? openoffice and mozilla and my visualisation software are
going to know what they want out of the kernel (possibly with
safegaurds such that they only tell the kernel what they want if the
kernel happens to be in some tested range, perhaps), the kernel sure
as hell won't know what my custom built application wants via
heuristics, because I am doing something that no-one else is, and so
my exact workloads haven't been experienced or designed for.

On a server, you can have a /proc file to tell the kernel to ignore
everything an application tells you, or ignore/believe application
with uid in ranges xx--yy.

-- 
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
Beware of Programmers who carry screwdrivers.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 14:49               ` Marc Singer
@ 2004-04-30  4:08                 ` Nick Piggin
  2004-04-30 22:31                   ` Marc Singer
  0 siblings, 1 reply; 128+ messages in thread
From: Nick Piggin @ 2004-04-30  4:08 UTC (permalink / raw)
  To: Marc Singer
  Cc: Jeff Garzik, Andrew Morton, brettspamacct, linux-kernel,
	Russell King

Marc Singer wrote:
> On Thu, Apr 29, 2004 at 02:26:17PM +1000, Nick Piggin wrote:
> 
>>Yes it includes something which should help that. Along with
>>the "split active lists" that I mentioned might help your
>>problem when WLI first came up with the change to the
>>swappiness calculation for your problem.
>>
>>It would be great if you had time to give my patch a run.
>>It hasn't been widely stress tested yet though, so no
>>production systems, of course!
> 
> 
> As I said, I'm game to have a go.  The trouble was that it doesn't
> apply.  My development kernel has an RMK patch applied that seems to
> conflict with the MM patch on which you depend.
> 

You would probably be better off trying a simpler change
first actually:

in mm/vmscan.c, shrink_list(), change:

if (res == WRITEPAGE_ACTIVATE) {
	ClearPageReclaim(page);
	goto activate_locked;
}

to

if (res == WRITEPAGE_ACTIVATE) {
	ClearPageReclaim(page);
	goto keep_locked;
}

I think it is not the correct solution, but should narrow
down your problem. Let us know how it goes.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 20:01       ` Horst von Brand
                           ` (2 preceding siblings ...)
  2004-04-29 20:36         ` Paul Jackson
@ 2004-04-30  5:15         ` Nick Piggin
  2004-04-30  6:20         ` Tim Connors
  4 siblings, 0 replies; 128+ messages in thread
From: Nick Piggin @ 2004-04-30  5:15 UTC (permalink / raw)
  To: Horst von Brand; +Cc: Jeff Garzik, Andrew Morton, brettspamacct, linux-kernel

Horst von Brand wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> said:
> 
> [...]
> 
> 
>>I don't know. What if you have some huge application that only
>>runs once per day for 10 minutes? Do you want it to be consuming
>>100MB of your memory for the other 23 hours and 50 minutes for
>>no good reason?
> 
> 
> How on earth is the kernel supposed to know that for this one particular
> job you don't care if it takes 3 hours instead of 10 minutes, just because
> you don't want to spare enough preciousss RAM?


It doesn't know that.

But if you restrict this guy's working set to a tiny amount
and just allow it to thrash away, then if nothing else, all
that wasted disk IO will slow all your other stuff down too.

However that is something we can allow you to tune, via RSS
limits. I am maintaining Rik's patch for that and will send
it on when rmap optimisation work is more finalised.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30  0:54                     ` Paul Jackson
@ 2004-04-30  5:38                       ` Andy Isaacson
  2004-04-30  6:00                         ` Nick Piggin
  0 siblings, 1 reply; 128+ messages in thread
From: Andy Isaacson @ 2004-04-30  5:38 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Andrew Morton, vonbrand, nickpiggin, jgarzik, brettspamacct,
	linux-kernel

On Thu, Apr 29, 2004 at 05:54:42PM -0700, Paul Jackson wrote:
> Andrew wrote:
> > fadvise(POSIX_FADV_DONTNEED) is ideal for this.
> 
> Perhaps ... perhaps not.
> 
> Just as the knobs "only reclaim pagecache" and "reclaim vfs caches
> harder" had too big a scope (system-wide), using fadvise might have too
> small a scope (currently cached pages of current task only).
> 
> If his background daemon is some shell script, say, that uses 'cat' to
> generate the i/o to the other spindle, then he probably wants to be
> marking that daemon job "don't let this entire job eat my pagecache",
> not rebuilding a hacked up cat command with added POSIX_FADV_DONTNEED
> calls every megabyte.

Well, in this case it's bespoke C code so adding the fadvise isn't
terribly difficult.  (The structure of the code doesn't lend itself to
"do this every 10 MB" but I'm sure I can hack something up.)

It would be nicer if the kernel would do the right thing without needing
to have its hand held, but the fadvise will solve my immediate need.
(Assuming it works on 2.4.)

-andy

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30  5:38                       ` Andy Isaacson
@ 2004-04-30  6:00                         ` Nick Piggin
  0 siblings, 0 replies; 128+ messages in thread
From: Nick Piggin @ 2004-04-30  6:00 UTC (permalink / raw)
  To: Andy Isaacson
  Cc: Paul Jackson, Andrew Morton, vonbrand, jgarzik, brettspamacct,
	linux-kernel

Andy Isaacson wrote:
> On Thu, Apr 29, 2004 at 05:54:42PM -0700, Paul Jackson wrote:
> 
>>Andrew wrote:
>>
>>>fadvise(POSIX_FADV_DONTNEED) is ideal for this.
>>
>>Perhaps ... perhaps not.
>>
>>Just as the knobs "only reclaim pagecache" and "reclaim vfs caches
>>harder" had too big a scope (system-wide), using fadvise might have too
>>small a scope (currently cached pages of current task only).
>>
>>If his background daemon is some shell script, say, that uses 'cat' to
>>generate the i/o to the other spindle, then he probably wants to be
>>marking that daemon job "don't let this entire job eat my pagecache",
>>not rebuilding a hacked up cat command with added POSIX_FADV_DONTNEED
>>calls every megabyte.
> 
> 
> Well, in this case it's bespoke C code so adding the fadvise isn't
> terribly difficult.  (The structure of the code doesn't lend itself to
> "do this every 10 MB" but I'm sure I can hack something up.)
> 
> It would be nicer if the kernel would do the right thing without needing
> to have its hand held, but the fadvise will solve my immediate need.
> (Assuming it works on 2.4.)

Right for one thing will always be wrong for another.
If you want some specific behaviour then you might
have to hold hands. That is just the way it goes.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re:  ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 20:01       ` Horst von Brand
                           ` (3 preceding siblings ...)
  2004-04-30  5:15         ` Nick Piggin
@ 2004-04-30  6:20         ` Tim Connors
  2004-04-30  6:34           ` Nick Piggin
  4 siblings, 1 reply; 128+ messages in thread
From: Tim Connors @ 2004-04-30  6:20 UTC (permalink / raw)
  To: Horst von Brand
  Cc: Nick Piggin, Jeff Garzik, Andrew Morton, brettspamacct,
	linux-kernel

Horst von Brand <vonbrand@inf.utfsm.cl> said on Thu, 29 Apr 2004 16:01:11 -0400:
> Nick Piggin <nickpiggin@yahoo.com.au> said:
> 
> [...]
> 
> > I don't know. What if you have some huge application that only
> > runs once per day for 10 minutes? Do you want it to be consuming
> > 100MB of your memory for the other 23 hours and 50 minutes for
> > no good reason?
> 
> How on earth is the kernel supposed to know that for this one particular
> job you don't care if it takes 3 hours instead of 10 minutes, just because
> you don't want to spare enough preciousss RAM?

Note that we are not talking about having insufficient memory. In my
case (2.4 kernel - ie, 2.6 with swapiness 0%) there is more than
enough memory to contain all my working set - it's only because cache
is too eager to claim memory that is otherwise in use that
non-optimalities occur.

-- 
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
If I'd known computer science was going to be like this, I'd never have
given up being a rock 'n' roll star.                -- G. Hirst

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30  6:20         ` Tim Connors
@ 2004-04-30  6:34           ` Nick Piggin
  2004-04-30  7:05             ` Tim Connors
  0 siblings, 1 reply; 128+ messages in thread
From: Nick Piggin @ 2004-04-30  6:34 UTC (permalink / raw)
  To: Tim Connors
  Cc: Horst von Brand, Jeff Garzik, Andrew Morton, brettspamacct,
	linux-kernel

Tim Connors wrote:
> Horst von Brand <vonbrand@inf.utfsm.cl> said on Thu, 29 Apr 2004 16:01:11 -0400:
> 
>>Nick Piggin <nickpiggin@yahoo.com.au> said:
>>
>>[...]
>>
>>
>>>I don't know. What if you have some huge application that only
>>>runs once per day for 10 minutes? Do you want it to be consuming
>>>100MB of your memory for the other 23 hours and 50 minutes for
>>>no good reason?
>>
>>How on earth is the kernel supposed to know that for this one particular
>>job you don't care if it takes 3 hours instead of 10 minutes, just because
>>you don't want to spare enough preciousss RAM?
> 
> 
> Note that we are not talking about having insufficient memory. In my
> case (2.4 kernel - ie, 2.6 with swapiness 0%) there is more than
> enough memory to contain all my working set - it's only because cache
> is too eager to claim memory that is otherwise in use that
> non-optimalities occur.
> 

Well depends on what you mean by working set.

In our memory manager, there is a point where often used
"file cache" (ie. unmapped cache) is considered preferable
to unused or little used "application memory" (mapped
memory).

There will be a point where even the most swap phobic desktop
users will want to start swapping.

I missed the description of your exact problem... was it in
this thread somewhere? Testing 2.6 would be appreciated if
possible too.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30  6:34           ` Nick Piggin
@ 2004-04-30  7:05             ` Tim Connors
  2004-04-30  7:15               ` Nick Piggin
  2004-04-30  9:18               ` Re[2]: " vda
  0 siblings, 2 replies; 128+ messages in thread
From: Tim Connors @ 2004-04-30  7:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Horst von Brand, Jeff Garzik, Andrew Morton, brettspamacct,
	Linux Kernel Mailing List

On Fri, 30 Apr 2004, Nick Piggin wrote:

> Tim Connors wrote:
> > Horst von Brand <vonbrand@inf.utfsm.cl> said on Thu, 29 Apr 2004 16:01:11 -0400:
> >
> >>Nick Piggin <nickpiggin@yahoo.com.au> said:
> >>
> >>[...]
> >>
> >>
> >>>I don't know. What if you have some huge application that only
> >>>runs once per day for 10 minutes? Do you want it to be consuming
> >>>100MB of your memory for the other 23 hours and 50 minutes for
> >>>no good reason?
> >>
> >>How on earth is the kernel supposed to know that for this one particular
> >>job you don't care if it takes 3 hours instead of 10 minutes, just because
> >>you don't want to spare enough preciousss RAM?
> >
> >
> > Note that we are not talking about having insufficient memory. In my
> > case (2.4 kernel - ie, 2.6 with swapiness 0%) there is more than
> > enough memory to contain all my working set - it's only because cache
> > is too eager to claim memory that is otherwise in use that
> > non-optimalities occur.
> >
>
> Well depends on what you mean by working set.
>
> In our memory manager, there is a point where often used
> "file cache" (ie. unmapped cache) is considered preferable
> to unused or little used "application memory" (mapped
> memory).

Sure - and indeed I have current swap usage (now that I am not doing
anything) of 300MB - that's good because I am not using whatever's in
there.

> I missed the description of your exact problem... was it in
> this thread somewhere? Testing 2.6 would be appreciated if
> possible too.

http://www.uwsg.iu.edu/hypermail/linux/kernel/0404.3/1033.html
http://www.uwsg.iu.edu/hypermail/linux/kernel/0404.3/1394.html

In short: I have 512MB RAM. The files I am reading are read over NFS,
created remotely. I read them once, and then discard them (either delete
them, or keep them around, but don't read them again -- obviously, in the
latter case, they have an opportunity to pollute the cache for a long
time). If I do read them twice, it's only been a few times that I have
noticed any speedup the second time around, even for the smaller files (if
they're small enough not to cause a problem with swapping vital bits of
software out, then they are small enough that extra reads are hardly
noticable anyway - given the damn fast raid disks behind NFS we have)

For one of the file types, these can be several hundred megs, and are read
by an astronomical package - I have no idea how they are read, but because
it is fits, maybe it has to go back and read the header several times, but
I doubt the image data is read more than once. Come display time, memory
usage in this example is roughly the size of the FITS file - several
hundred megs. I don't recall how big X is, in such a situation, but the
sum of them both, plus recently used apps like mozilla, is below RAM size.
Watching top, I can see, during the read, mozilla rsize memory usage go
down rapidly - about as rapidly as cache usage going up.

Parts of X and the window manager also get swapped out, so when I move to
another virtual page, I get to watch fvwm redraw the screen - this is not
too painful though, because only a few megs need be swapped back in
(although the HD is seeking all over the place as things thrash about, so
it does still take non-negligible amount of time). Mozilla takes about 30
seconds to swap back in (~50-100MB - again, lots of thrashing from the
HD). I don't recall once mozilla is swapped back in, whether cache usage
has dropped again, or whether the visualisation software loses its pages
instead.

I'll try to test 2.6 here (half the battle is convincing the sysadmins
this is a worthwhile persuit) - I use that at home quite succesfully, but
I don't do big files or visualisation there.

-- 
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
'It's amazing I won. I was running against peace, prosperity and incumbency.'
  -- George W. Bush. June 14, 2001, to Swedish PM Goran Perrson,
     unaware that a live television camera was still rolling.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30  7:05             ` Tim Connors
@ 2004-04-30  7:15               ` Nick Piggin
  2004-04-30  9:18               ` Re[2]: " vda
  1 sibling, 0 replies; 128+ messages in thread
From: Nick Piggin @ 2004-04-30  7:15 UTC (permalink / raw)
  To: Tim Connors
  Cc: Horst von Brand, Jeff Garzik, Andrew Morton, brettspamacct,
	Linux Kernel Mailing List

Tim Connors wrote:
> On Fri, 30 Apr 2004, Nick Piggin wrote:
> 
>>In our memory manager, there is a point where often used
>>"file cache" (ie. unmapped cache) is considered preferable
>>to unused or little used "application memory" (mapped
>>memory).
> 
> 
> Sure - and indeed I have current swap usage (now that I am not doing
> anything) of 300MB - that's good because I am not using whatever's in
> there.
> 
> 
>>I missed the description of your exact problem... was it in
>>this thread somewhere? Testing 2.6 would be appreciated if
>>possible too.
> 
> 
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0404.3/1033.html
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0404.3/1394.html
> 
> In short: I have 512MB RAM. The files I am reading are read over NFS,

Ah, thanks for the description.

2.6 has a problem with NFS filesystems that would cause symptoms
like yours. I'm not sure whether 2.4 has something similar or not.
You can probably expect a fix for 2.6.6 but I'm not sure if there
is a patch that has been agreed upon yet.

In short, there probably isn't much point testing 2.6 right now.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30  0:32                   ` Andrew Morton
  2004-04-30  0:54                     ` Paul Jackson
@ 2004-04-30  7:52                     ` Jeff Garzik
  2004-04-30  8:02                       ` Andrew Morton
  1 sibling, 1 reply; 128+ messages in thread
From: Jeff Garzik @ 2004-04-30  7:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Isaacson, pj, vonbrand, nickpiggin, brettspamacct,
	linux-kernel

Andrew Morton wrote:
> Andy Isaacson <adi@hexapodia.org> wrote:
> 
>> But in a related case, I have a background daemon that does a lot of IO
>> (mostly sequential, one page at a time read/modify/write of a multi-GB
>> file) to a filesystem on a separate spindle from my main filesystems.
>> I'd like to use a similar mechanism to say "don't let this program eat
>> my pagecache" that will let the daemon crunch away without severely
>> impacting my desktop work.
> 
> 
> fadvise(POSIX_FADV_DONTNEED) is ideal for this.  Run it once per megabyte
> or so.


Sweet.  I'm so happy you added posix_fadvise (way back when), and even 
happier to hear this.

Does our fadvise support len==0 ("I mean the whole file")?  That's 
defined in POSIX, and would allow a compliant app to simply 
POSIX_FADV_DONTNEED once at the beginning.

	Jeff




^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30  7:52                     ` Jeff Garzik
@ 2004-04-30  8:02                       ` Andrew Morton
  2004-04-30  8:09                         ` Jeff Garzik
  0 siblings, 1 reply; 128+ messages in thread
From: Andrew Morton @ 2004-04-30  8:02 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: adi, pj, vonbrand, nickpiggin, brettspamacct, linux-kernel

Jeff Garzik <jgarzik@pobox.com> wrote:
>
> > fadvise(POSIX_FADV_DONTNEED) is ideal for this.  Run it once per megabyte
>  > or so.
> 
> 
>  Sweet.  I'm so happy you added posix_fadvise (way back when), and even 
>  happier to hear this.

There are a number of other goodies we could add to it, as linux
extensions.

>  Does our fadvise support len==0 ("I mean the whole file")?  That's 
>  defined in POSIX, and would allow a compliant app to simply 
>  POSIX_FADV_DONTNEED once at the beginning.

Well I'll be darned.

--- 25/mm/fadvise.c~fadvise-len-fix	2004-04-30 00:58:00.437598504 -0700
+++ 25-akpm/mm/fadvise.c	2004-04-30 00:59:03.237051536 -0700
@@ -38,6 +38,9 @@ asmlinkage long sys_fadvise64_64(int fd,
 		goto out;
 	}
 
+	if (len == 0)		/* 0 == "all data following offset" */
+		len = -1;
+
 	bdi = mapping->backing_dev_info;
 
 	switch (advice) {

_


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30  8:02                       ` Andrew Morton
@ 2004-04-30  8:09                         ` Jeff Garzik
  0 siblings, 0 replies; 128+ messages in thread
From: Jeff Garzik @ 2004-04-30  8:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: adi, pj, vonbrand, nickpiggin, brettspamacct, linux-kernel

Andrew Morton wrote:
> Jeff Garzik <jgarzik@pobox.com> wrote:
>> Does our fadvise support len==0 ("I mean the whole file")?  That's 
>> defined in POSIX, and would allow a compliant app to simply 
>> POSIX_FADV_DONTNEED once at the beginning.
> 
> 
> Well I'll be darned.


FWIW the specific language is "If len is zero, all data following offset 
is specified."

(for others, you probably already have this somewhere)
http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html

top level SuSv3:
http://www.opengroup.org/onlinepubs/007904975/toc.htm

	Jeff




^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re[2]: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30  7:05             ` Tim Connors
  2004-04-30  7:15               ` Nick Piggin
@ 2004-04-30  9:18               ` vda
  2004-04-30  9:33                 ` Arjan van de Ven
  1 sibling, 1 reply; 128+ messages in thread
From: vda @ 2004-04-30  9:18 UTC (permalink / raw)
  To: Tim Connors
  Cc: Nick Piggin, Horst von Brand, Jeff Garzik, Andrew Morton,
	brettspamacct, Linux Kernel Mailing List

Hello Tim,

Friday, April 30, 2004, 10:05:19 AM, you wrote:
TC> Parts of X and the window manager also get swapped out, so when I move to
TC> another virtual page, I get to watch fvwm redraw the screen - this is not
TC> too painful though, because only a few megs need be swapped back in
TC> (although the HD is seeking all over the place as things thrash about, so
TC> it does still take non-negligible amount of time). Mozilla takes about 30
TC> seconds to swap back in (~50-100MB - again, lots of thrashing from the

I don't want to say that you're seeing optimal behavior,
just a different angle of view: wny in hell browser should
have such ridiculously large RSS? Why it tries to keep
so much stuff in the RAM?

Multimedia content (jpegs etc) is typically cached in
filesystem, so Mozilla polluted pagecache with it when
it saved JPEGs to the cache *and* then it keeps 'em in RAM
too, which doubles RAM usage. Most probably more, there
is severe internal fragmentation problems after you use
such a large application for several hours straight.
Why not reread JPEG whenever you need it? If you are
using Mozilla right now, it will be in pagecache.
When you are away, cache will be discarded, no need
to page out Mozilla pages with JPEG content - because
there aren't Mozilla pages with JPEG content!
RSS is smaller, less internal fragmentation, everyone's
happy.

(I don't specifically target Mozilla, it have shown some
improvement recently. Replace with your favorite
monstrosity)

Kernel folks probably can improve kernel behavior.
Next version of $BloatyApp will happily "use"
gained performance and improved RAM management
as an excuse for even less optimal code.

It's a vicious circle.
--
vda

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: Re[2]: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30  9:18               ` Re[2]: " vda
@ 2004-04-30  9:33                 ` Arjan van de Ven
  2004-04-30 11:33                   ` Denis Vlasenko
  2004-04-30 16:19                   ` Timothy Miller
  0 siblings, 2 replies; 128+ messages in thread
From: Arjan van de Ven @ 2004-04-30  9:33 UTC (permalink / raw)
  To: vda
  Cc: Tim Connors, Nick Piggin, Horst von Brand, Jeff Garzik,
	Andrew Morton, brettspamacct, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 289 bytes --]


> Multimedia content (jpegs etc) is typically cached in
> filesystem, so Mozilla polluted pagecache with it when
> it saved JPEGs to the cache *and* then it keeps 'em in RAM
> too, which doubles RAM usage. 

well if mozilla just mmap's the jpegs there is no double caching .....


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30  9:33                 ` Arjan van de Ven
@ 2004-04-30 11:33                   ` Denis Vlasenko
  2004-04-30 16:19                   ` Timothy Miller
  1 sibling, 0 replies; 128+ messages in thread
From: Denis Vlasenko @ 2004-04-30 11:33 UTC (permalink / raw)
  To: arjanv
  Cc: Tim Connors, Nick Piggin, Horst von Brand, Jeff Garzik,
	Andrew Morton, brettspamacct, Linux Kernel Mailing List

On Friday 30 April 2004 12:33, Arjan van de Ven wrote:
> > Multimedia content (jpegs etc) is typically cached in
> > filesystem, so Mozilla polluted pagecache with it when
> > it saved JPEGs to the cache *and* then it keeps 'em in RAM
> > too, which doubles RAM usage.
>
> well if mozilla just mmap's the jpegs there is no double caching .....

I may be wrong but Mozilla keeps unpacked bitmap in malloc() space.
The point is, $BloatyApp will keep bloating up while you
are working upon improving kernel. I guess it's very clear which
process is easier. You cannot win that race.

This is OpenOffice on idle 128Mb RAM, 1000MHz Duron machine with KDE,
Mozilla and KMail running:

# time swriter;time swriter

real    0m33.906s
user    0m10.163s
sys     0m0.705s

real    0m24.025s
user    0m10.069s
sys     0m0.546s

I closed windows as soon as it appeared.

Freshly started swriter in top:
  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
 2081 root      15   0 93980  41M 80300 S     1,3 34,0   0:09   0 soffice.bin

93 megs. 10 seconds of 1GHz CPU time taken...
--
vda


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 23:08                   ` Timothy Miller
@ 2004-04-30 12:31                     ` Bart Samwel
  2004-04-30 15:35                       ` Clay Haapala
  2004-04-30 22:11                       ` Paul Jackson
  0 siblings, 2 replies; 128+ messages in thread
From: Bart Samwel @ 2004-04-30 12:31 UTC (permalink / raw)
  To: Timothy Miller
  Cc: Paul Jackson, vonbrand, nickpiggin, jgarzik, Andrew Morton,
	brettspamacct, linux-kernel

On Fri, 2004-04-30 at 01:08, Timothy Miller wrote:
> Agreed.  And this is why I suggested not adding another knob but rather 
> going with the existing nice value.
> 
> Mind you, this shouldn't necessarily be done without some kind of 
> experimentation.  Put two knobs in the kernel and try varying them  to 
> each other to see what sorts of jobs, if any, would benefit in a 
> disparity between cpu-nice and io-nice.  If there IS a significant 
> difference, then add the extra knob.  If there isn't, then don't.

Thought experiment: what would happen when you set the hypothetical
cpu-nice and io-nice knobs very differently?

* cpu-nice 20, io-nice -20: Read I/O will finish immediately, but then
the process will have to wait for ages to get a CPU slice to process the
data, so why would you want to read it so quickly? The process can do as
much write I/O as it wants, but why is it not okay to take ages to write
the data if it's okay to take ages to produce it?

* cpu-nice -20, io-nice 20: Read I/O will take ages, but once the data
gets there, the processor is immediately taken to process the data as
fast as possible. If it was okay to take ages to read the data, why
would you want to process it as soon as you can? It makes some sense for
write I/O though: produce data as fast as the other processes will allow
you to write it. But if you're going to hog the CPU completely, why give
other processes the chance to do a lot of I/O while they don't get the
CPU time to submit any I/O? Going for a smaller difference makes more
sense.

As far as I can tell, giving the knobs very different values doesn't
make much sense. The same arguments go for medium-sized differences. And
if we're going to give the knobs only *slightly* different values, we
might as well make them the same. If we really need cpu-nice = 0 and
io-nice = 3 somewhere, then I think that's a sign of a kernel problem,
where the kernel's various nice-knobs aren't calibrated correctly to
result in the same amount of "niceness" when they're set to the same
value. And cpu-nice = io-nice = 3 would probably have about the same
effect.

BTW, if there *are* going to be more knobs, I suggest adding
"memory-nice" as well. :) If you set memory-nice to 20, then the process
will not kick out much memory from other processes (it will require more
I/O -- but that can be throttled using io-nice). If you set memory-nice
to -20, then the process will kick out the memory of all other processes
if it needs to.

--Bart

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30  3:00                 ` Nick Piggin
@ 2004-04-30 12:50                   ` Rik van Riel
  2004-04-30 13:07                     ` Nick Piggin
  2004-04-30 13:18                     ` Nikita Danilov
  0 siblings, 2 replies; 128+ messages in thread
From: Rik van Riel @ 2004-04-30 12:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Paul Mackerras, brettspamacct, jgarzik,
	linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1066 bytes --]

On Fri, 30 Apr 2004, Nick Piggin wrote:
> Rik van Riel wrote:

> > The basic idea of use-once isn't bad (search for LIRS and
> > ARC page replacement), however the Linux implementation
> > doesn't have any of the checks and balances that the
> > researched replacement algorithms have...

> No, use once logic is good in theory I think. Unfortunately
> our implementation is quite fragile IMO (although it seems
> to have been "good enough").

Hey, that's what I said ;))))

> This is what I'm currently doing (on top of a couple of other
> patches, but you get the idea). I should be able to transform
> it into a proper use-once logic if I pick up Nikita's inactive
> list second chance bit.

Ummm nope, there just isn't enough info to keep things
as balanced as ARC/LIRS/CAR(T) can do.  No good way to
auto-tune the sizes of the active and inactive lists.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: TEXT/X-PATCH; NAME="vm-dropbehind.patch", Size: 3622 bytes --]


Changes mark_page_accessed to only set the PageAccessed bit, and
not move pages around the LRUs. This means we don't have to take
the lru_lock, and it also makes page ageing and scanning consistient
and all handled in mm/vmscan.c


 include/linux/buffer_head.h            |    0 
 linux-2.6-npiggin/include/linux/swap.h |    5 ++-
 linux-2.6-npiggin/mm/filemap.c         |    6 ----
 linux-2.6-npiggin/mm/swap.c            |   45 ---------------------------------
 4 files changed, 4 insertions(+), 52 deletions(-)

diff -puN mm/filemap.c~vm-dropbehind mm/filemap.c
--- linux-2.6/mm/filemap.c~vm-dropbehind	2004-04-29 17:31:38.000000000 +1000
+++ linux-2.6-npiggin/mm/filemap.c	2004-04-29 17:31:38.000000000 +1000
@@ -663,11 +663,7 @@ page_ok:
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
 
-		/*
-		 * Mark the page accessed if we read the beginning.
-		 */
-		if (!offset)
-			mark_page_accessed(page);
+		mark_page_accessed(page);
 
 		/*
 		 * Ok, we have the page, and it's up-to-date, so
diff -puN mm/swap.c~vm-dropbehind mm/swap.c
--- linux-2.6/mm/swap.c~vm-dropbehind	2004-04-29 17:31:38.000000000 +1000
+++ linux-2.6-npiggin/mm/swap.c	2004-04-29 17:31:38.000000000 +1000
@@ -100,51 +100,6 @@ int rotate_reclaimable_page(struct page 
 	return 0;
 }
 
-/*
- * FIXME: speed this up?
- */
-void fastcall activate_page(struct page *page)
-{
-	struct zone *zone = page_zone(page);
-
-	spin_lock_irq(&zone->lru_lock);
-	if (PageLRU(page)
-		&& !PageActiveMapped(page) && !PageActiveUnmapped(page)) {
-
-		del_page_from_inactive_list(zone, page);
-
-		if (page_mapped(page)) {
-			SetPageActiveMapped(page);
-			add_page_to_active_mapped_list(zone, page);
-		} else {
-			SetPageActiveUnmapped(page);
-			add_page_to_active_unmapped_list(zone, page);
-		}
-		inc_page_state(pgactivate);
-	}
-	spin_unlock_irq(&zone->lru_lock);
-}
-
-/*
- * Mark a page as having seen activity.
- *
- * inactive,unreferenced	->	inactive,referenced
- * inactive,referenced		->	active,unreferenced
- * active,unreferenced		->	active,referenced
- */
-void fastcall mark_page_accessed(struct page *page)
-{
-	if (!PageActiveMapped(page) && !PageActiveUnmapped(page)
-			&& PageReferenced(page) && PageLRU(page)) {
-		activate_page(page);
-		ClearPageReferenced(page);
-	} else if (!PageReferenced(page)) {
-		SetPageReferenced(page);
-	}
-}
-
-EXPORT_SYMBOL(mark_page_accessed);
-
 /**
  * lru_cache_add: add a page to the page lists
  * @page: the page to add
diff -puN include/linux/swap.h~vm-dropbehind include/linux/swap.h
--- linux-2.6/include/linux/swap.h~vm-dropbehind	2004-04-29 17:31:38.000000000 +1000
+++ linux-2.6-npiggin/include/linux/swap.h	2004-04-30 12:55:02.000000000 +1000
@@ -165,12 +165,13 @@ extern unsigned int nr_free_pagecache_pa
 /* linux/mm/swap.c */
 extern void FASTCALL(lru_cache_add(struct page *));
 extern void FASTCALL(lru_cache_add_active(struct page *));
-extern void FASTCALL(activate_page(struct page *));
-extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
 extern int rotate_reclaimable_page(struct page *page);
 extern void swap_setup(void);
 
+/* Mark a page as having seen activity. */
+#define mark_page_accessed(page)	SetPageReferenced(page)
+
 /* linux/mm/vmscan.c */
 extern int try_to_free_pages(struct zone **, unsigned int, unsigned int);
 extern int shrink_all_memory(int);
diff -puN include/linux/mm_inline.h~vm-dropbehind include/linux/mm_inline.h
diff -puN mm/memory.c~vm-dropbehind mm/memory.c
diff -puN mm/shmem.c~vm-dropbehind mm/shmem.c
diff -puN include/linux/buffer_head.h~vm-dropbehind include/linux/buffer_head.h

_

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30 12:50                   ` Rik van Riel
@ 2004-04-30 13:07                     ` Nick Piggin
  2004-04-30 13:18                     ` Nikita Danilov
  1 sibling, 0 replies; 128+ messages in thread
From: Nick Piggin @ 2004-04-30 13:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Paul Mackerras, brettspamacct, jgarzik,
	linux-kernel

Rik van Riel wrote:
> On Fri, 30 Apr 2004, Nick Piggin wrote:
> 
>> Rik van Riel wrote:
> 
> 
>> > The basic idea of use-once isn't bad (search for LIRS and
>> > ARC page replacement), however the Linux implementation
>> > doesn't have any of the checks and balances that the
>> > researched replacement algorithms have...
> 
> 
>> No, use once logic is good in theory I think. Unfortunately
>> our implementation is quite fragile IMO (although it seems
>> to have been "good enough").
> 
> 
> Hey, that's what I said ;))))
> 

Yes. I just thought you might have misunderstood me to
think use once is no good at all.

>> This is what I'm currently doing (on top of a couple of other
>> patches, but you get the idea). I should be able to transform
>> it into a proper use-once logic if I pick up Nikita's inactive
>> list second chance bit.
> 
> 
> Ummm nope, there just isn't enough info to keep things
> as balanced as ARC/LIRS/CAR(T) can do.  No good way to
> auto-tune the sizes of the active and inactive lists.
> 

I think perhaps it might be possible. I don't want to
discourage you from looking into more interesting replacement
schemes though. I don't doubt that our basic replacement
can often be suboptimal ;)

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30 12:50                   ` Rik van Riel
  2004-04-30 13:07                     ` Nick Piggin
@ 2004-04-30 13:18                     ` Nikita Danilov
  2004-04-30 13:39                       ` Nick Piggin
  1 sibling, 1 reply; 128+ messages in thread
From: Nikita Danilov @ 2004-04-30 13:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Nick Piggin, Andrew Morton, Paul Mackerras, brettspamacct,
	jgarzik, linux-kernel

Rik van Riel writes:
 > On Fri, 30 Apr 2004, Nick Piggin wrote:
 > > Rik van Riel wrote:
 > 
 > > > The basic idea of use-once isn't bad (search for LIRS and
 > > > ARC page replacement), however the Linux implementation
 > > > doesn't have any of the checks and balances that the
 > > > researched replacement algorithms have...
 > 
 > > No, use once logic is good in theory I think. Unfortunately
 > > our implementation is quite fragile IMO (although it seems
 > > to have been "good enough").
 > 
 > Hey, that's what I said ;))))
 > 
 > > This is what I'm currently doing (on top of a couple of other
 > > patches, but you get the idea). I should be able to transform
 > > it into a proper use-once logic if I pick up Nikita's inactive
 > > list second chance bit.
 > 
 > Ummm nope, there just isn't enough info to keep things
 > as balanced as ARC/LIRS/CAR(T) can do.  No good way to
 > auto-tune the sizes of the active and inactive lists.

While keeping "history" for non-resident pages is very good from many
points of view (provides infrastructure for local replacement and
working set tuning, for example) and in the long term, current scanner
can still be improved somewhat.

Here are results that I obtained some time ago. Test is to concurrently
clone (bk) and build (make -jN) kernel source in M directories.

For N = M = 11, TIMEFORMAT='%3R %3S %3U'

                                        REAL    SYS      USER
"stock"                               3818.320 568.999 4358.460
transfer-dirty-on-refill              3368.690 569.066 4377.845
check-PageSwapCache-after-add-to-swap 3237.632 576.208 4381.248
dont-unmap-on-pageout                 3207.522 566.539 4374.504
async-writepage                       3115.338 562.702 4325.212

(check-PageSwapCache-after-add-to-swap was added to mainline since them.)

These patches weren't updated for some time. Last version is at
ftp://ftp.namesys.com/pub/misc-patches/unsupported/extra/2004.03.25-2.6.5-rc2

[from Nick Piggin's patch]
 > 
 > Changes mark_page_accessed to only set the PageAccessed bit, and
 > not move pages around the LRUs. This means we don't have to take
 > the lru_lock, and it also makes page ageing and scanning consistient
 > and all handled in mm/vmscan.c

By the way, batch-mark_page_accessed patch at the URL above also tries
to reduce lock contention in mark_page_accessed(), but through more
standard approach of batching target pages in per-cpu pvec.

Nikita.


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30 13:18                     ` Nikita Danilov
@ 2004-04-30 13:39                       ` Nick Piggin
  0 siblings, 0 replies; 128+ messages in thread
From: Nick Piggin @ 2004-04-30 13:39 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Rik van Riel, Andrew Morton, Paul Mackerras, brettspamacct,
	jgarzik, linux-kernel

Nikita Danilov wrote:

> Here are results that I obtained some time ago. Test is to concurrently
> clone (bk) and build (make -jN) kernel source in M directories.
> 
> For N = M = 11, TIMEFORMAT='%3R %3S %3U'
> 
>                                         REAL    SYS      USER
> "stock"                               3818.320 568.999 4358.460
> transfer-dirty-on-refill              3368.690 569.066 4377.845
> check-PageSwapCache-after-add-to-swap 3237.632 576.208 4381.248
> dont-unmap-on-pageout                 3207.522 566.539 4374.504
> async-writepage                       3115.338 562.702 4325.212
> 

I like your transfer-dirty-on-refill change. It is definitely
worthwhile to mark a page as dirty when it drops off the active
list in order to hopefully get it written before it reaches the
tail of the inactive list.

> (check-PageSwapCache-after-add-to-swap was added to mainline since them.)
> 
> These patches weren't updated for some time. Last version is at
> ftp://ftp.namesys.com/pub/misc-patches/unsupported/extra/2004.03.25-2.6.5-rc2
> 
> [from Nick Piggin's patch]
>  > 
>  > Changes mark_page_accessed to only set the PageAccessed bit, and
>  > not move pages around the LRUs. This means we don't have to take
>  > the lru_lock, and it also makes page ageing and scanning consistient
>  > and all handled in mm/vmscan.c
> 
> By the way, batch-mark_page_accessed patch at the URL above also tries
> to reduce lock contention in mark_page_accessed(), but through more
> standard approach of batching target pages in per-cpu pvec.
> 

This is a good patch too if mark_page_accessed is required to
take the lock (which it currently is, of course).

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30 12:31                     ` Bart Samwel
@ 2004-04-30 15:35                       ` Clay Haapala
  2004-04-30 15:44                         ` Bart Samwel
  2004-04-30 22:11                       ` Paul Jackson
  1 sibling, 1 reply; 128+ messages in thread
From: Clay Haapala @ 2004-04-30 15:35 UTC (permalink / raw)
  To: bart
  Cc: Timothy Miller, Paul Jackson, vonbrand, nickpiggin, jgarzik,
	Andrew Morton, brettspamacct, linux-kernel

On Fri, 30 Apr 2004, Bart Samwel uttered the following:
> 
> Thought experiment: what would happen when you set the hypothetical
> cpu-nice and io-nice knobs very differently?
> 
Dunno why, but this talk of knobs makes me think of the "effects-mix"
knob on my bass amp that controls how much effects-loop signal is
mixed with the "dry" guitar signal.

Getting back to kernel talk, we have a "swappiness" knob, right?
Should there be, or is there already, a way to dynamically vary the
effect of swappiness [within a range], based on some monitored system
characteristics such as keyboard/mouse (HID) input or some other
identifiable profile?  Perhaps this is similar to nice/fairness logic
in the process schedulers.

Using HID as a profile, if I'm up late working on a paper in OO and
using a browser like Mozilla when updatedb fires up, the fact that
there is recent keyboard/mouse input has been seen would modify
swappiness down.

However, if I've fallen asleep in my chair for an hour when updatedb
fires up, no recent input events will have been detected, and updatedb
gets the high range of swappiness effect.  If I happen to wake up in
the middle of it, I just have to accept it'll take time to wake my
apps up, but at least they will get progressively more responsive as I
use 'em.

I use the term "profile" because I wouldn't want to have just HID
events be the trigger.  If a machine's main use is database or
web-serving, perhaps the appropriate events to monitor would be, say,
traffic on specified TCP ports or network interfaces.

The amount of extra work should be no more than what goes on with
entropy generation, I would think.
-- 
Clay Haapala (chaapala@cisco.com) Cisco Systems SRBU +1 763-398-1056
   6450 Wedgwood Rd, Suite 130 Maple Grove MN 55311 PGP: C89240AD
  "Oh, *that* Physics Prize.  Well, I just substituted 'stupidity' for
      'dark matter' in the equations, and it all came together."

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30 15:35                       ` Clay Haapala
@ 2004-04-30 15:44                         ` Bart Samwel
  0 siblings, 0 replies; 128+ messages in thread
From: Bart Samwel @ 2004-04-30 15:44 UTC (permalink / raw)
  To: Clay Haapala
  Cc: Timothy Miller, Paul Jackson, vonbrand, nickpiggin, jgarzik,
	Andrew Morton, brettspamacct, linux-kernel

On Fri, 2004-04-30 at 17:35, Clay Haapala wrote:
> On Fri, 30 Apr 2004, Bart Samwel uttered the following:
> > 
> > Thought experiment: what would happen when you set the hypothetical
> > cpu-nice and io-nice knobs very differently?
> > 
> Dunno why, but this talk of knobs makes me think of the "effects-mix"
> knob on my bass amp that controls how much effects-loop signal is
> mixed with the "dry" guitar signal.
> 
> Getting back to kernel talk, we have a "swappiness" knob, right?
> Should there be, or is there already, a way to dynamically vary the
> effect of swappiness [within a range], based on some monitored system
> characteristics such as keyboard/mouse (HID) input or some other
> identifiable profile?  Perhaps this is similar to nice/fairness logic
> in the process schedulers.

This kind of thing is exactly what has been avoided by using
interactivity boosts, and taking that into account in an "io-nice" value
as well should solve that. Other profiles might be interesting though.

Interactive tasks have a tendency to be interactive for a short while,
and then be noninteractive for a long time. I'm thinking that it might
be worthwhile to do something with that, i.e. to keep a bonus for "past
interactivity" on some pages based on the fact that they were originally
loaded by a still-existing process that was once/is marked as
interactive.

--Bart

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30  9:33                 ` Arjan van de Ven
  2004-04-30 11:33                   ` Denis Vlasenko
@ 2004-04-30 16:19                   ` Timothy Miller
  1 sibling, 0 replies; 128+ messages in thread
From: Timothy Miller @ 2004-04-30 16:19 UTC (permalink / raw)
  To: arjanv
  Cc: vda, Tim Connors, Nick Piggin, Horst von Brand, Jeff Garzik,
	Andrew Morton, brettspamacct, Linux Kernel Mailing List



Arjan van de Ven wrote:
>>Multimedia content (jpegs etc) is typically cached in
>>filesystem, so Mozilla polluted pagecache with it when
>>it saved JPEGs to the cache *and* then it keeps 'em in RAM
>>too, which doubles RAM usage. 
> 
> 
> well if mozilla just mmap's the jpegs there is no double caching .....
> 


What is cached in memory?  The original JPEG or the decoded raw image?


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30 12:31                     ` Bart Samwel
  2004-04-30 15:35                       ` Clay Haapala
@ 2004-04-30 22:11                       ` Paul Jackson
  1 sibling, 0 replies; 128+ messages in thread
From: Paul Jackson @ 2004-04-30 22:11 UTC (permalink / raw)
  To: bart
  Cc: miller, vonbrand, nickpiggin, jgarzik, akpm, brettspamacct,
	linux-kernel

> Thought experiment: what would happen when you set the hypothetical
> cpu-nice and io-nice knobs very differently?

If there was one, single implementation hook in the kernel where making
some decision depend on a user setting cleanly adapted both i/o and cpu
priority, then yes, your thought experiment would recommend that this
one hook was sufficient, and lead to a single user knob to control it.

But in this case, there are obviously two implementation hooks - the
classic one in the scheduler that affects cpu usage, and another off in
some i/o code that affects i/o usage.

So then the question comes - do we have one knob over this that is
ganged to both hooks, or do we have two knobs, one per hook.

Ganging these two hooks together, to control them in synchrony to a
single user setting, is a policy choice.  It's saying that we don't
think you will ever want to run them out of sync, so as a matter of
policy, we are ganging them together.

I prefer to avoid nonessential policy in the kernel.  Best to simply
expose each independent kernel facility, 1-to-1, to the user.  Let
them decide when and if if these two settings should be ganged.

I find gratuitous (not needed for system reliability) policy in the
kernel to be a greater negative than another system call.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-30  4:08                 ` Nick Piggin
@ 2004-04-30 22:31                   ` Marc Singer
  0 siblings, 0 replies; 128+ messages in thread
From: Marc Singer @ 2004-04-30 22:31 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jeff Garzik, Andrew Morton, brettspamacct, linux-kernel,
	Russell King

On Fri, Apr 30, 2004 at 02:08:03PM +1000, Nick Piggin wrote:
> You would probably be better off trying a simpler change
> first actually:
> 
> in mm/vmscan.c, shrink_list(), change:
> 
> if (res == WRITEPAGE_ACTIVATE) {
> 	ClearPageReclaim(page);
> 	goto activate_locked;
> }
> 
> to
> 
> if (res == WRITEPAGE_ACTIVATE) {
> 	ClearPageReclaim(page);
> 	goto keep_locked;
> }
> 
> I think it is not the correct solution, but should narrow
> down your problem. Let us know how it goes.

OK, thanks.  It might be a few days before I can get to this.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-04-29 21:19           ` Andrew Morton
  2004-04-29 21:34             ` Paul Jackson
@ 2004-05-06 13:08             ` Pavel Machek
  2004-05-07 15:53               ` Hugh Dickins
  1 sibling, 1 reply; 128+ messages in thread
From: Pavel Machek @ 2004-05-06 13:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul Jackson, vonbrand, nickpiggin, jgarzik, brettspamacct,
	linux-kernel

Hi!

> > > How on earth is the kernel supposed to know that for this one particular
> > > job you don't care if it takes 3 hours instead of 10 minutes,
> > 
> > I'd pay ten bucks (yeah, I'm a cheapskate) for an option that I could
> > twiddle that would mark my nightly updatedb and backup jobs as ones to
> > use reduced memory footprint (both for file caching and backing user
> > virtual address space), even if it took much longer.
> > 
> > So, rather than protest in mock outrage that it's impossible for the
> > kernel to know this, instead answer the question as stated in all
> > seriousness ... well ... how _could_ the kernel know, and what _could_
> > the kernel do if it knew.  What mechanism(s) would be needed so that
> > the kernel could restrict a jobs memory usage?
> 
> Two things:
> 
> a) a knob to say "only reclaim pagecache".  We have that now.
> 
> b) a knob to say "reclaim vfs caches harder".  That's simply a matter of boosting
>    the return value from shrink_dcache_memory() and perhaps shrink_icache_memory().
> 
> It's not quite what you're after, but it's close.

Perhaps what we really want is "swap_back_in" script? That way you
could do "updatedb; swap_back_in" in cron and be happy.

								Pavel
-- 
When do you have heart between your knees?

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-05-06 13:08             ` Pavel Machek
@ 2004-05-07 15:53               ` Hugh Dickins
  2004-05-07 16:57                 ` Pavel Machek
  0 siblings, 1 reply; 128+ messages in thread
From: Hugh Dickins @ 2004-05-07 15:53 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Andrew Morton, Paul Jackson, vonbrand, nickpiggin, jgarzik,
	brettspamacct, linux-kernel

On Thu, 6 May 2004, Pavel Machek wrote:
> 
> Perhaps what we really want is "swap_back_in" script? That way you
> could do "updatedb; swap_back_in" in cron and be happy.

swapoff -a; swapon -a

Hugh


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-05-07 15:53               ` Hugh Dickins
@ 2004-05-07 16:57                 ` Pavel Machek
  2004-05-07 17:30                   ` Timothy Miller
  2004-05-12 17:52                   ` Rob Landley
  0 siblings, 2 replies; 128+ messages in thread
From: Pavel Machek @ 2004-05-07 16:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Paul Jackson, vonbrand, nickpiggin, jgarzik,
	brettspamacct, linux-kernel

Hi!

> > Perhaps what we really want is "swap_back_in" script? That way you
> > could do "updatedb; swap_back_in" in cron and be happy.
> 
> swapoff -a; swapon -a

Good point... it will not bring back executable pages, through.

								Pavel
-- 
Horseback riding is like software...
...vgf orggre jura vgf serr.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-05-07 16:57                 ` Pavel Machek
@ 2004-05-07 17:30                   ` Timothy Miller
  2004-05-07 17:43                     ` Hugh Dickins
  2004-05-07 17:48                     ` Mark Frazer
  2004-05-12 17:52                   ` Rob Landley
  1 sibling, 2 replies; 128+ messages in thread
From: Timothy Miller @ 2004-05-07 17:30 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Hugh Dickins, Andrew Morton, Paul Jackson, vonbrand, nickpiggin,
	jgarzik, brettspamacct, linux-kernel



Pavel Machek wrote:
> Hi!
> 
> 
>>>Perhaps what we really want is "swap_back_in" script? That way you
>>>could do "updatedb; swap_back_in" in cron and be happy.
>>
>>swapoff -a; swapon -a
> 
> 
> Good point... it will not bring back executable pages, through.
> 
> 								Pavel

Wouldn't this also be a problem if you are using more memory than you 
have physical RAM?


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-05-07 17:30                   ` Timothy Miller
@ 2004-05-07 17:43                     ` Hugh Dickins
  2004-05-07 17:48                     ` Mark Frazer
  1 sibling, 0 replies; 128+ messages in thread
From: Hugh Dickins @ 2004-05-07 17:43 UTC (permalink / raw)
  To: Timothy Miller
  Cc: Pavel Machek, Andrew Morton, Paul Jackson, vonbrand, nickpiggin,
	jgarzik, brettspamacct, linux-kernel

On Fri, 7 May 2004, Timothy Miller wrote:
> > 
> >>>Perhaps what we really want is "swap_back_in" script? That way you
> >>>could do "updatedb; swap_back_in" in cron and be happy.
> >>
> >>swapoff -a; swapon -a
> 
> Wouldn't this also be a problem if you are using more memory than you 
> have physical RAM?

On 2.4 it certainly would be a problem (hang with others OOM-killed).

On 2.6 it shouldn't be a problem: the swapoff may fail upfront if
there's way too little memory, or it may get itself OOM-killed if
it runs out on the way, but it ought not to upset other tasks.

But of course, Pavel is right that it does nothing for file backed.

Hugh


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-05-07 17:30                   ` Timothy Miller
  2004-05-07 17:43                     ` Hugh Dickins
@ 2004-05-07 17:48                     ` Mark Frazer
  1 sibling, 0 replies; 128+ messages in thread
From: Mark Frazer @ 2004-05-07 17:48 UTC (permalink / raw)
  To: Timothy Miller
  Cc: Pavel Machek, Hugh Dickins, Andrew Morton, Paul Jackson, vonbrand,
	nickpiggin, jgarzik, brettspamacct, linux-kernel

Timothy Miller <miller@techsource.com> [04/05/07 13:26]:
> >>>Perhaps what we really want is "swap_back_in" script? That way you
> >>>could do "updatedb; swap_back_in" in cron and be happy.
> >>
> >>swapoff -a; swapon -a
> >
> >
> >Good point... it will not bring back executable pages, through.
> >
> >								Pavel
> 
> Wouldn't this also be a problem if you are using more memory than you 
> have physical RAM?

#!/bin/bash
swapused=$(( $(sed -n -e 's/ \+/-/g' -e '/^Swap:/p' /proc/meminfo | cut -d'-' -f2,4) ))
bufsused=$(( $(sed -n -e 's/ \+/+/g' -e '/^Mem:/p' /proc/meminfo | cut -d'+' -f6,7) ))

if [ $bufsused -gt $(( 11 * swapused / 10 )) ]
then	swapoff -a; swapon -a
fi

or something like that
-- 
How can I live my life if I can't tell good from evil? - Fry

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-05-07 16:57                 ` Pavel Machek
  2004-05-07 17:30                   ` Timothy Miller
@ 2004-05-12 17:52                   ` Rob Landley
  2004-05-17 20:16                     ` Hugh Dickins
  1 sibling, 1 reply; 128+ messages in thread
From: Rob Landley @ 2004-05-12 17:52 UTC (permalink / raw)
  To: Pavel Machek, Hugh Dickins
  Cc: Andrew Morton, Paul Jackson, vonbrand, nickpiggin, jgarzik,
	brettspamacct, linux-kernel

On Friday 07 May 2004 11:57, Pavel Machek wrote:
> Hi!
>
> > > Perhaps what we really want is "swap_back_in" script? That way you
> > > could do "updatedb; swap_back_in" in cron and be happy.
> >
> > swapoff -a; swapon -a
>
> Good point... it will not bring back executable pages, through.
>
> 								Pavel

What would the above do if there wasn't enough memory to swap everything back 
in?  (Presumably, the swapoff would fail?)

Rob

-- 
www.linucon.org: Linux Expo and Science Fiction Convention
October 8-10, 2004 in Austin Texas.  (I'm the con chair.)



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: ~500 megs cached yet 2.6.5 goes into swap hell
  2004-05-12 17:52                   ` Rob Landley
@ 2004-05-17 20:16                     ` Hugh Dickins
  0 siblings, 0 replies; 128+ messages in thread
From: Hugh Dickins @ 2004-05-17 20:16 UTC (permalink / raw)
  To: Rob Landley
  Cc: Pavel Machek, Andrew Morton, Paul Jackson, vonbrand, nickpiggin,
	jgarzik, brettspamacct, linux-kernel

On Wed, 12 May 2004, Rob Landley wrote:
> On Friday 07 May 2004 11:57, Pavel Machek wrote:
> > Hi!
> >
> > > > Perhaps what we really want is "swap_back_in" script? That way you
> > > > could do "updatedb; swap_back_in" in cron and be happy.
> > >
> > > swapoff -a; swapon -a
> >
> > Good point... it will not bring back executable pages, through.
> >
> > 								Pavel
> 
> What would the above do if there wasn't enough memory to swap everything back 
> in?  (Presumably, the swapoff would fail?)

Repeating my earlier reply to a similar question...

On 2.4 it certainly would be a problem (hang with others OOM-killed).

On 2.6 it shouldn't be a problem: the swapoff may fail upfront if
there's way too little memory, or it may get itself OOM-killed if
it runs out on the way, but it ought not to upset other tasks.

But of course, Pavel is right that it does nothing for file backed.

Hugh


^ permalink raw reply	[flat|nested] 128+ messages in thread

end of thread, other threads:[~2004-05-17 20:16 UTC | newest]

Thread overview: 128+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-28 21:27 ~500 megs cached yet 2.6.5 goes into swap hell Brett E.
2004-04-29  0:01 ` Andrew Morton
2004-04-29  0:10   ` Jeff Garzik
2004-04-29  0:21     ` Nick Piggin
2004-04-29  0:50       ` Wakko Warner
2004-04-29  0:53         ` Jeff Garzik
2004-04-29  0:54         ` Nick Piggin
2004-04-29  1:51           ` Tim Connors
2004-04-29 21:45         ` Denis Vlasenko
2004-04-29  0:58       ` Marc Singer
2004-04-29  3:48         ` Nick Piggin
2004-04-29  4:20           ` Marc Singer
2004-04-29  4:26             ` Nick Piggin
2004-04-29 14:49               ` Marc Singer
2004-04-30  4:08                 ` Nick Piggin
2004-04-30 22:31                   ` Marc Singer
2004-04-29  6:38             ` William Lee Irwin III
2004-04-29  7:36             ` Russell King
2004-04-29 10:44               ` Nick Piggin
2004-04-29 11:04                 ` Russell King
2004-04-29 14:52                   ` Marc Singer
2004-04-29 20:01       ` Horst von Brand
2004-04-29 20:18         ` Martin J. Bligh
2004-04-29 20:33         ` David B. Stevens
2004-04-29 22:42           ` Steve Youngs
2004-04-29 20:36         ` Paul Jackson
2004-04-29 21:19           ` Andrew Morton
2004-04-29 21:34             ` Paul Jackson
2004-04-29 21:57               ` Andrew Morton
2004-04-29 22:18                 ` Paul Jackson
2004-04-30  0:04                 ` Andy Isaacson
2004-04-30  0:32                   ` Andrew Morton
2004-04-30  0:54                     ` Paul Jackson
2004-04-30  5:38                       ` Andy Isaacson
2004-04-30  6:00                         ` Nick Piggin
2004-04-30  7:52                     ` Jeff Garzik
2004-04-30  8:02                       ` Andrew Morton
2004-04-30  8:09                         ` Jeff Garzik
2004-05-06 13:08             ` Pavel Machek
2004-05-07 15:53               ` Hugh Dickins
2004-05-07 16:57                 ` Pavel Machek
2004-05-07 17:30                   ` Timothy Miller
2004-05-07 17:43                     ` Hugh Dickins
2004-05-07 17:48                     ` Mark Frazer
2004-05-12 17:52                   ` Rob Landley
2004-05-17 20:16                     ` Hugh Dickins
2004-04-29 21:38           ` Timothy Miller
2004-04-29 21:47             ` Paul Jackson
2004-04-29 22:18               ` Timothy Miller
2004-04-29 22:46                 ` Paul Jackson
2004-04-29 23:08                   ` Timothy Miller
2004-04-30 12:31                     ` Bart Samwel
2004-04-30 15:35                       ` Clay Haapala
2004-04-30 15:44                         ` Bart Samwel
2004-04-30 22:11                       ` Paul Jackson
2004-04-30  3:37                 ` Tim Connors
2004-04-30  5:15         ` Nick Piggin
2004-04-30  6:20         ` Tim Connors
2004-04-30  6:34           ` Nick Piggin
2004-04-30  7:05             ` Tim Connors
2004-04-30  7:15               ` Nick Piggin
2004-04-30  9:18               ` Re[2]: " vda
2004-04-30  9:33                 ` Arjan van de Ven
2004-04-30 11:33                   ` Denis Vlasenko
2004-04-30 16:19                   ` Timothy Miller
2004-04-29  0:49     ` Brett E.
2004-04-29  1:00       ` Andrew Morton
2004-04-29  1:24         ` Jeff Garzik
2004-04-29  1:40           ` Andrew Morton
2004-04-29  1:47             ` Rik van Riel
2004-04-29 18:14               ` Adam Kropelin
2004-04-30  3:17                 ` Tim Connors
2004-04-29  2:19             ` Tim Connors
2004-04-29 16:24             ` Martin J. Bligh
2004-04-29 16:36               ` Chris Friesen
2004-04-29 16:56                 ` Martin J. Bligh
2004-04-29  1:30         ` Paul Mackerras
2004-04-29  1:31           ` Paul Mackerras
2004-04-29  1:53           ` Andrew Morton
2004-04-29  2:40             ` Andrew Morton
2004-04-29  2:58               ` Paul Mackerras
2004-04-29  3:09                 ` Andrew Morton
2004-04-29  3:14                 ` William Lee Irwin III
2004-04-29  6:12                 ` Benjamin Herrenschmidt
2004-04-29  6:22                   ` Andrew Morton
2004-04-29  6:25                     ` Benjamin Herrenschmidt
2004-04-29  6:31                   ` William Lee Irwin III
2004-04-29 16:50               ` Martin J. Bligh
2004-04-29  3:57             ` Nick Piggin
2004-04-29 14:29               ` Rik van Riel
2004-04-30  3:00                 ` Nick Piggin
2004-04-30 12:50                   ` Rik van Riel
2004-04-30 13:07                     ` Nick Piggin
2004-04-30 13:18                     ` Nikita Danilov
2004-04-30 13:39                       ` Nick Piggin
2004-04-29  1:46         ` Rik van Riel
2004-04-29  1:57           ` Andrew Morton
2004-04-29  2:29             ` Marc Singer
2004-04-29  2:35               ` Andrew Morton
2004-04-29  3:10                 ` Marc Singer
2004-04-29  3:19                   ` Andrew Morton
2004-04-29  4:13                     ` Marc Singer
2004-04-29  4:33                       ` Andrew Morton
2004-04-29 14:45                         ` Marc Singer
2004-04-29 16:51                     ` Andy Isaacson
2004-04-29 20:42                       ` Andrew Morton
2004-04-29 22:27                         ` Andy Isaacson
2004-04-29 23:19                           ` Andrew Morton
2004-04-30  0:14                       ` Lincoln Dale
2004-04-29  8:02                   ` Wichert Akkerman
2004-04-29 14:25                     ` Marcelo Tosatti
2004-04-29 14:27                       ` Wichert Akkerman
2004-04-29  2:41             ` Rik van Riel
2004-04-29  2:43               ` Andrew Morton
2004-04-29  1:41       ` Tim Connors
2004-04-29  9:43       ` Helge Hafting
2004-04-29 14:48         ` Marc Singer
2004-04-29  0:44   ` Brett E.
2004-04-29  1:13     ` Andrew Morton
2004-04-29  1:29       ` Brett E.
2004-04-29 18:05         ` Brett E.
2004-04-29 18:32           ` William Lee Irwin III
2004-04-29 20:47             ` Brett E.
2004-04-29  0:04 ` Brett E.
2004-04-29  0:13   ` Jeff Garzik
2004-04-29  0:43     ` Nick Piggin
2004-04-29 13:51   ` Horst von Brand
2004-04-29 18:32     ` Brett E.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox