public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* VM: qsbench
@ 2001-10-31 12:12 Lorenzo Allegrucci
  2001-10-31 12:23 ` Jeff Garzik
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Lorenzo Allegrucci @ 2001-10-31 12:12 UTC (permalink / raw)
  To: linux-kernel


Three runs for each kernel, kswapd CPU time appended.

Linux-2.4.13-ac4:
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
70.800u 3.470s 3:04.15 40.3%    0+0k 0+0io 13916pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
71.530u 3.930s 3:13.90 38.9%    0+0k 0+0io 14101pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
71.260u 3.640s 3:03.54 40.8%    0+0k 0+0io 13047pf+0w
0:08 kswapd

Linux-2.4.13:
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
71.260u 2.150s 2:20.68 52.1%    0+0k 0+0io 20173pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
71.020u 2.050s 2:18.78 52.6%    0+0k 0+0io 20353pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
70.810u 2.080s 2:19.50 52.2%    0+0k 0+0io 20413pf+0w
0:06 kswapd

Linux-2.4.14-pre3:
N/A, this kernel cannot run qsbench. Livelock.

Linux-2.4.14-pre4:
Not tested.

Linux-2.4.14-pre5:
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
70.340u 3.450s 2:13.62 55.2%    0+0k 0+0io 16829pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
70.590u 2.940s 2:15.48 54.2%    0+0k 0+0io 17182pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
70.140u 3.480s 2:14.66 54.6%    0+0k 0+0io 17122pf+0w
0:01 kswapd

kswapd CPU time is a record ;)


Linux-2.4.14-pre6:
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
Out of Memory: Killed process 224 (qsbench).
69.890u 3.430s 2:12.48 55.3%    0+0k 0+0io 16374pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
Out of Memory: Killed process 226 (qsbench).
69.550u 2.990s 2:11.31 55.2%    0+0k 0+0io 15374pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
Out of Memory: Killed process 228 (qsbench).
69.480u 3.100s 2:13.33 54.4%    0+0k 0+0io 15950pf+0w
0:01 kswapd

This is interesting, -pre6 killed qsbench _just_ before qsbench exited.
Unreliable results.

Linux-2.4.14-pre3aa1:
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
72.180u 2.200s 2:19.59 53.2%    0+0k 0+0io 19568pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
71.510u 2.230s 2:18.74 53.1%    0+0k 0+0io 19585pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
71.500u 2.510s 2:19.29 53.1%    0+0k 0+0io 19606pf+0w
0:04 kswapd

Linux-2.4.14-pre3aa2:
Not tested.

Linux-2.4.14-pre3aa3:
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
71.790u 2.280s 2:17.57 53.8%    0+0k 0+0io 19138pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
71.190u 2.040s 2:16.95 53.4%    0+0k 0+0io 19306pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
72.000u 2.120s 2:16.80 54.1%    0+0k 0+0io 19231pf+0w
0:03 kswapd

Linux-2.4.14-pre3aa4:
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
71.270u 2.210s 2:16.43 53.8%    0+0k 0+0io 19067pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
71.110u 2.180s 2:16.52 53.6%    0+0k 0+0io 19095pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
71.320u 2.290s 2:16.32 53.9%    0+0k 0+0io 19162pf+0w
0:03 kswapd

Linux-2.4.14-pre5aa1:
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
70.580u 2.430s 2:16.36 53.5%    0+0k 0+0io 19024pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
71.070u 2.180s 2:15.97 53.8%    0+0k 0+0io 19110pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
71.280u 2.160s 2:16.61 53.7%    0+0k 0+0io 19185pf+0w
0:03 kswapd



-- 
Lorenzo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: VM: qsbench
  2001-10-31 12:12 VM: qsbench Lorenzo Allegrucci
@ 2001-10-31 12:23 ` Jeff Garzik
  2001-10-31 15:00 ` new OOM heuristic failure (was: Re: VM: qsbench) Rik van Riel
  2001-10-31 17:55 ` VM: qsbench Lorenzo Allegrucci
  2 siblings, 0 replies; 23+ messages in thread
From: Jeff Garzik @ 2001-10-31 12:23 UTC (permalink / raw)
  To: Lorenzo Allegrucci; +Cc: linux-kernel

Lorenzo Allegrucci wrote:
> Linux-2.4.14-pre6:
> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
> Out of Memory: Killed process 224 (qsbench).
> 69.890u 3.430s 2:12.48 55.3%    0+0k 0+0io 16374pf+0w
> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
> Out of Memory: Killed process 226 (qsbench).
> 69.550u 2.990s 2:11.31 55.2%    0+0k 0+0io 15374pf+0w
> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
> Out of Memory: Killed process 228 (qsbench).
> 69.480u 3.100s 2:13.33 54.4%    0+0k 0+0io 15950pf+0w
> 0:01 kswapd
> 
> This is interesting, -pre6 killed qsbench _just_ before qsbench exited.
> Unreliable results.

Can you give us some idea of the memory usage of this application?  Your
amount of RAM and swap?

	Jeff


-- 
Jeff Garzik      | Only so many songs can be sung
Building 1024    | with two lips, two lungs, and one tongue.
MandrakeSoft     |         - nomeansno


^ permalink raw reply	[flat|nested] 23+ messages in thread

* new OOM heuristic failure  (was: Re: VM: qsbench)
  2001-10-31 12:12 VM: qsbench Lorenzo Allegrucci
  2001-10-31 12:23 ` Jeff Garzik
@ 2001-10-31 15:00 ` Rik van Riel
  2001-10-31 15:52   ` Linus Torvalds
  2001-10-31 17:55   ` Lorenzo Allegrucci
  2001-10-31 17:55 ` VM: qsbench Lorenzo Allegrucci
  2 siblings, 2 replies; 23+ messages in thread
From: Rik van Riel @ 2001-10-31 15:00 UTC (permalink / raw)
  To: Lorenzo Allegrucci; +Cc: linux-kernel, Linus Torvalds

On Wed, 31 Oct 2001, Lorenzo Allegrucci wrote:

Linus, it seems Lorenzo's test program gets killed due
to the new out_of_memory() heuristic ...

> Linux-2.4.14-pre6:
> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
> Out of Memory: Killed process 224 (qsbench).
> 69.890u 3.430s 2:12.48 55.3%    0+0k 0+0io 16374pf+0w
> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
> Out of Memory: Killed process 226 (qsbench).
> 69.550u 2.990s 2:11.31 55.2%    0+0k 0+0io 15374pf+0w
> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
> Out of Memory: Killed process 228 (qsbench).
> 69.480u 3.100s 2:13.33 54.4%    0+0k 0+0io 15950pf+0w
> 0:01 kswapd
>
> This is interesting, -pre6 killed qsbench _just_ before qsbench exited.
> Unreliable results.

Rik
-- 
DMCA, SSSCA, W3C?  Who cares?  http://thefreeworld.net/

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new OOM heuristic failure  (was: Re: VM: qsbench)
  2001-10-31 15:00 ` new OOM heuristic failure (was: Re: VM: qsbench) Rik van Riel
@ 2001-10-31 15:52   ` Linus Torvalds
  2001-10-31 16:04     ` Rik van Riel
  2001-10-31 17:55   ` Lorenzo Allegrucci
  1 sibling, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2001-10-31 15:52 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Lorenzo Allegrucci, linux-kernel


On Wed, 31 Oct 2001, Rik van Riel wrote:
>
> Linus, it seems Lorenzo's test program gets killed due
> to the new out_of_memory() heuristic ...

Hmm.. The oom killer really only gets invoced if we're really down to zero
swapspace (that's the _only_ non-rate-based heuristic in the whole thing).

Lorenzo, can you do a "vmstat 1" and show the output of it during the
interesting part of the test (ie around the kill).

I could probably argue that the machine really _is_ out of memory at this
point: no swap, and it obviously has to work very hard to free any pages.
Read the "out_of_memory()" code (which is _really_ simple), with the
realization that it only gets called when "try_to_free_pages()" fails and
I think you'll agree.

That said, it may be "try_to_free_pages()" itself that just gives up way
too easily - it simply didn't matter before, because all callers just
looped around and asked for more memory if it failed. So the code could
still trigger too easily not because the oom() logic itself is all that
bad, but simply because it makes the assumption that try_to_free_pages()
only fails in bad situations.

		Linus


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new OOM heuristic failure  (was: Re: VM: qsbench)
  2001-10-31 15:52   ` Linus Torvalds
@ 2001-10-31 16:04     ` Rik van Riel
  2001-10-31 17:42       ` Stephan von Krawczynski
  0 siblings, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2001-10-31 16:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Lorenzo Allegrucci, linux-kernel

On Wed, 31 Oct 2001, Linus Torvalds wrote:

> I could probably argue that the machine really _is_ out of memory at this
> point: no swap, and it obviously has to work very hard to free any pages.
> Read the "out_of_memory()" code (which is _really_ simple), with the
> realization that it only gets called when "try_to_free_pages()" fails and
> I think you'll agree.

Absolutely agreed, an earlier out_of_memory() is probably a good
thing for most systems.   The only "but" is that Lorenzo's test
program runs fine with other kernels, but you could argue that
it's a corner case anyway...

Rik
-- 
DMCA, SSSCA, W3C?  Who cares?  http://thefreeworld.net/

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new OOM heuristic failure  (was: Re: VM: qsbench)
  2001-10-31 16:04     ` Rik van Riel
@ 2001-10-31 17:42       ` Stephan von Krawczynski
  2001-10-31 18:22         ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Stephan von Krawczynski @ 2001-10-31 17:42 UTC (permalink / raw)
  To: Rik van Riel; +Cc: torvalds, lenstra, linux-kernel

On Wed, 31 Oct 2001 14:04:45 -0200 (BRST) Rik van Riel <riel@conectiva.com.br>
wrote:

> On Wed, 31 Oct 2001, Linus Torvalds wrote:
> 
> > I could probably argue that the machine really _is_ out of memory at this
> > point: no swap, and it obviously has to work very hard to free any pages.
> > Read the "out_of_memory()" code (which is _really_ simple), with the
> > realization that it only gets called when "try_to_free_pages()" fails and
> > I think you'll agree.
> 
> Absolutely agreed, an earlier out_of_memory() is probably a good
> thing for most systems.   The only "but" is that Lorenzo's test
> program runs fine with other kernels, but you could argue that
> it's a corner case anyway...

I took a deep look into this code and wonder how this benchmark manages to get
killed. If I read that right this would imply that shrink_cache has run a
hundred times through the _complete_ inactive_list finding no free-able pages,
with one exception that I read across:

        int max_mapped = nr_pages*10;
...
page_mapped:
                        if (--max_mapped >= 0)
                                continue;

                        /*
                         * Alert! We've found too many mapped pages on the
                         * inactive list, so we start swapping out now!
                         */
                        spin_unlock(&pagemap_lru_lock);
                        swap_out(priority, gfp_mask, classzone);
                        return nr_pages;

Is it possible, that this does a too early exit from shrink_cache?
I don't know how much mem Lorenzo has, but running only once through several
hundred MB of inactive list is a notable time in my system, running a hundred
times through could be far more than 70 s. But if there's no complete run, you
cannot state to really be oom.
Does it make sense to stop shrink_cache when having detected 4k * 32 * 10 =
1280 k of mapped mem on the inactive list of possibly several hundred MB in
size?

Regards,
Stephan


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: VM: qsbench
  2001-10-31 12:12 VM: qsbench Lorenzo Allegrucci
  2001-10-31 12:23 ` Jeff Garzik
  2001-10-31 15:00 ` new OOM heuristic failure (was: Re: VM: qsbench) Rik van Riel
@ 2001-10-31 17:55 ` Lorenzo Allegrucci
  2 siblings, 0 replies; 23+ messages in thread
From: Lorenzo Allegrucci @ 2001-10-31 17:55 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-kernel

At 07.23 31/10/01 -0500, Jeff Garzik wrote:
>Lorenzo Allegrucci wrote:
>> Linux-2.4.14-pre6:
>> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
>> Out of Memory: Killed process 224 (qsbench).
>> 69.890u 3.430s 2:12.48 55.3%    0+0k 0+0io 16374pf+0w
>> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
>> Out of Memory: Killed process 226 (qsbench).
>> 69.550u 2.990s 2:11.31 55.2%    0+0k 0+0io 15374pf+0w
>> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
>> Out of Memory: Killed process 228 (qsbench).
>> 69.480u 3.100s 2:13.33 54.4%    0+0k 0+0io 15950pf+0w
>> 0:01 kswapd
>> 
>> This is interesting, -pre6 killed qsbench _just_ before qsbench exited.
>> Unreliable results.
>
>Can you give us some idea of the memory usage of this application?  Your
>amount of RAM and swap?

256M of RAM + 200M of swap, qsbench allocates about 343M.


-- 
Lorenzo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new OOM heuristic failure  (was: Re: VM: qsbench)
  2001-10-31 15:00 ` new OOM heuristic failure (was: Re: VM: qsbench) Rik van Riel
  2001-10-31 15:52   ` Linus Torvalds
@ 2001-10-31 17:55   ` Lorenzo Allegrucci
  2001-10-31 18:06     ` Linus Torvalds
                       ` (3 more replies)
  1 sibling, 4 replies; 23+ messages in thread
From: Lorenzo Allegrucci @ 2001-10-31 17:55 UTC (permalink / raw)
  To: Linus Torvalds, Rik van Riel; +Cc: linux-kernel

At 07.52 31/10/01 -0800, Linus Torvalds wrote:
>
>On Wed, 31 Oct 2001, Rik van Riel wrote:
>>
>> Linus, it seems Lorenzo's test program gets killed due
>> to the new out_of_memory() heuristic ...
>
>Hmm.. The oom killer really only gets invoced if we're really down to zero
>swapspace (that's the _only_ non-rate-based heuristic in the whole thing).
>
>Lorenzo, can you do a "vmstat 1" and show the output of it during the
>interesting part of the test (ie around the kill).

   procs                      memory    swap          io     system
cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy
 id
 2  0  0 139908   3588     68   3588   0   0     0     0  101     3 100   0
  0
 1  0  0 139908   3588     68   3588   0   0     0     0  101     7 100   0
  0
 1  0  0 139908   3588     68   3588   0   0     0     0  101     5 100   0
  0
 0  1  0 139336   1568     68   3588 2524 696  2524   696  192   180  57
2  41
 1  0  0 140296   2996     68   3588 3776 4208  3776  4208  287   304   3
3  94
 1  0  0 139968   2708     68   3588 288   0   288     0  110    21  96   0
  4
 1  0  0 139968   2708     68   3588   0   0     0     0  101     5 100   0
  0
 1  0  0 139968   2708     68   3588   0   0     0     0  101     5 100   0
  0
 1  0  0 139968   2708     68   3588   0   0     0     0  101     5  99   1
  0
 1  0  0 139968   2708     68   3588   0   0     0     0  101     3 100   0
  0
 1  0  0 139968   2708     68   3588   0   0     0     0  101     3 100   0
  0
 1  0  0 139968   2708     68   3588   0   0     0    12  104     9 100   0
  0
 0  1  0 144064   1620     64   3588 7256 6880  7256  6880  395   517  28
5  67
 1  0  0 146168   2952     60   3584 5780 6720  5780  6720  396   401   0
8  92
 0  1  0 151672   3580     64   3584 12744 10076 12748 10076  579   870   3
  7  90
 0  1  0 165496   1620     64   3388 14684 4108 14684  4108  629  1131  11
 6  83
 1  0  0 177912   1592     64   1624 4544 14196  4544 14200  377   355   5
 2  93
 0  1  0 182392   1548     60   1624 14648 8064 14648  8064  633   935  11
11  78
 0  1  1 195320   2692     64   1624 14156 9600 14160  9600  605   943   3
 8  89
 1  0  0 195512   3516     64    400 5312 8376  5312  8376  378   374   2
8  90
 1  0  1 195512   1664     64    400 22256   0 22256     0  797  1419  18
8  74
 1  0  0 195512   1544     60    400 23520   0 23520     4  837  1540  13
7  80
   procs                      memory    swap          io     system
cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy
 id
 2  0  0 195512   1660     60    400 23292   0 23292     0  832  1546  10
10  80
 0  0  0   5384 250420     76    784 2212  24  2672    24  201   208   1
7  92
 0  0  0   5384 250420     76    784   0   0     0     0  101     3   0   0
100
 0  0  0   5384 250416     76    788   0   0     0     0  101     3   0   0
100
 0  0  0   5384 250400     92    788   0   0    16     0  105    15   0   0
100
 0  0  0   5384 250400     92    788   0   0     0     0  101     3   0   0
100
 0  0  0   5384 250400     92    788   0   0     0     0  101     7   0   0
100

Until swpd is "139968" everything is fine and I have about 60M of
free swap (I have 256M RAM + 200M of swap and qsbench uses about 343M).
>From that point Linux starts swapping without any apparent reason (?)
because qsbench allocates its memory just once at the beginning.
I guess Linux starts swapping when qsbench scans sequentially the
whole array to check for errors after sorting, in the final stage.
I wonder why..

Linux-2.4.13:
 1  0  0 109864   3820     64    396   0   0     0     0  101     3 100   0
  0
 1  0  0 109864   3816     68    396   0   0     4     0  107    23 100   0
  0
 1  0  0 109864   3816     68    396   0   0     0     0  101     5  98   2
  0
 1  0  0 109864   3816     68    396   0   0     0     0  101     3 100   0
  0
 1  0  0 109864   3816     68    396   0   0     0     0  101     3 100   0
  0
 1  0  0 109864   3816     68    396   0   0     0     0  102     5 100   0
  0
 0  1  0 112156   3224     64    508 2676 2048  2888  2052  235   239  68
1  31
   procs                      memory    swap          io     system
cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy
 id
 1  1  0 121372   3416     64    508 8896 9216  8896  9216  519   686   4
5  91
 1  0  0 130460   3340     64    508 9420 9216  9420  9216  559   737   5
2  93
 0  1  0 139932   3168     64    508 9644 9472  9644  9472  547   717   5
4  91
 0  1  1 149532   3488     64    508 9356 9572  9356  9576  550   725   2
5  93
 1  0  1 158308   4008     64    500 8484 8736  8492  8744  502   655   5
15  80
 0  1  0 166244   3724     64    500 8204 8004  8204  8004  452   601   4
13  83
 0  1  0 175716   4092     64    500 9104 9344  9104  9344  525   681   8
4  88
 0  1  0 185188   4076     64    500 9356 9344  9356  9344  545   690   7
8  85
 0  1  1 192100   3624     64    500 7544 7040  7544  7040  444   548   2
13  85
 1  0  0 195512   3972     64    348 11260 3924 11264  3928  521   767   8
23  69
 1  0  0 195512   4184     64    348 16812   0 16812     0  632  1074  14
25  61
 0  1  0 195512   4164     64    364 19828   0 19856     0  722  1251   9
21  70
 1  0  0 195512   3880     64    364 19740   0 19740     0  721  1240  10
16  74
 1  0  0 195512   3752     64    396 20676   0 20736     0  752  1307  13
21  66
 1  0  0 195512   3096     64    372 16260   4 16264     8  617  1040  11
23  66
 1  0  0 195512   3344     68    372 7548   0  7560     0  346   493  51
5  44
 0  0  0   5948 250640     80    800 328   0   768     0  132    64  29   4
 67
 0  0  0   5948 250640     80    800   0   0     0     0  101     3   0   0
100
 0  0  0   5948 250640     80    800   0   0     0     0  104    10   0   0
100
 0  0  0   5948 250640     80    800   0   0     0     0  119    44   0   0
100
 0  0  0   5948 250640     80    800   0   0     0     0  104    11   0   0
100
 0  0  0   5948 250640     80    800   0   0     0     0  130    61   0   0
100

Same behaviour.

Linux-2.4.14-pre5:
 1  0  0 142648   3268     80   3784   0   0     0     0  101     5  99   1
  0
 1  0  0 142648   3268     80   3784   0   0     0     0  101     9 100   0
  0
 1  0  0 142648   3268     80   3784   0   0     0     0  101     3 100   0
  0
 1  0  0 142648   3268     80   3784   0   0     0     0  101     3 100   0
  0
 1  0  0 142648   3268     80   3784   0   0     0     0  101     3 100   0
  0
 1  0  0 142648   3268     80   3784   0   0     0     0  101     5  99   1
  0
 0  1  0 143404   3624     80   3784 5380 2108  5380  2116  298   346  61
2  37
 0  1  0 148324   1632     76   3780 9452 7808  9452  7808  480   601   4
7  89
 1  0  0 153572   3412     72   3780 11492 6044 11492  6044  560   737   6
 4  90
 1  0  0 165604   1584     72   2860 13952 7972 13952  7972  615   889  10
10  80
 1  1  0 175076   1624     72   1624 5232 13536  5232 13536  390   339   4
 6  90
 0  1  0 181604   1540     76   1624 13360 7924 13364  7924  593   852  12
 4  84
 1  0  0 194276   2812     76   1624 12696 7704 12696  7704  575   804   8
 6  86
 1  0  0 195512   1640     76    556 7624 11412  7624 11412  449   488   4
 5  91
 1  0  0 195512   1572     72    496 21768  52 21768    56  784  1367  14
9  77
 1  1  0 195512   1580     72    496 23196   0 23196     0  827  1460  14
10  76
   procs                      memory    swap          io     system
cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy
 id
 1  1  0 195512   1608     76    496 19208   0 19212     0  704  1220  15
8  77
 1  0  0 195512   1728     76    496 15040   0 15040     4  572   946  48
6  47
 1  0  0 195512   1612     72    496 21664   0 21664     0  782  1363  19
11  70
 0  1  0   5144 250652     84    564 12120   0 12196     0  495   790  30
9  61
 0  0  0   4984 250236     84    748 368   0   552     0  122    44   0   0
100
 0  0  0   4984 250228     92    748   0   0     8     0  106    20   0   0
100
 0  0  0   4984 250228     92    748   0   0     0     0  102     5   0   0
100
 0  0  0   4984 250228     92    748   0   0     0     0  105    12   0   0
100
 0  0  0   4984 250228     92    748   0   0     0     0  102     5   0   0
100
 0  0  0   4984 250228     92    748   0   0     0     0  101     3   0   0
100
 0  0  0   4984 250228     92    748   0   0     0     0  102    11   0   0
100
 0  1  0   4984 250196     92    748  32   0    32     0  112    26   0   0
100

Same behaviour, but 2.4.13 uses less swap space.
Both kernels above seem to fall in OOM conditions, but they don't kill
qsbench.

>I could probably argue that the machine really _is_ out of memory at this
>point: no swap, and it obviously has to work very hard to free any pages.
>Read the "out_of_memory()" code (which is _really_ simple), with the
>realization that it only gets called when "try_to_free_pages()" fails and
>I think you'll agree.
>
>That said, it may be "try_to_free_pages()" itself that just gives up way
>too easily - it simply didn't matter before, because all callers just
>looped around and asked for more memory if it failed. So the code could
>still trigger too easily not because the oom() logic itself is all that
>bad, but simply because it makes the assumption that try_to_free_pages()
>only fails in bad situations.
>
>		Linus


-- 
Lorenzo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new OOM heuristic failure  (was: Re: VM: qsbench)
  2001-10-31 17:55   ` Lorenzo Allegrucci
@ 2001-10-31 18:06     ` Linus Torvalds
  2001-10-31 21:31     ` Lorenzo Allegrucci
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2001-10-31 18:06 UTC (permalink / raw)
  To: Lorenzo Allegrucci; +Cc: Rik van Riel, linux-kernel


On Wed, 31 Oct 2001, Lorenzo Allegrucci wrote:
>
> Until swpd is "139968" everything is fine and I have about 60M of
> free swap (I have 256M RAM + 200M of swap and qsbench uses about 343M).

Ok, that's the problem. The swap free on swap-in logic got removed, try
this simple patch, and I bet it ends up working ok for you

You should see better performance with a bigger swapspace, though. Linux
would prefer to keep the swap cache allocated as long as possible, and not
drop the pages just because swap is smaller than the working set.

(Ie the best setup is not when "RAM + SWAP > working set", but when you
have "SWAP > working set").

Can you re-do the numbers with this one on top of pre6?

Thanks,

		Linus

-----
diff -u --recursive pre6/linux/mm/memory.c linux/mm/memory.c
--- pre6/linux/mm/memory.c	Wed Oct 31 10:04:11 2001
+++ linux/mm/memory.c	Wed Oct 31 10:02:33 2001
@@ -1158,6 +1158,8 @@
 	pte = mk_pte(page, vma->vm_page_prot);

 	swap_free(entry);
+	if (vm_swap_full())
+		remove_exclusive_swap_page(page);

 	flush_page_to_ram(page);
 	flush_icache_page(vma, page);


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new OOM heuristic failure  (was: Re: VM: qsbench)
  2001-10-31 17:42       ` Stephan von Krawczynski
@ 2001-10-31 18:22         ` Linus Torvalds
  0 siblings, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2001-10-31 18:22 UTC (permalink / raw)
  To: linux-kernel

In article <20011031184256.6e541e43.skraw@ithnet.com>,
Stephan von Krawczynski  <skraw@ithnet.com> wrote:
>
>I took a deep look into this code and wonder how this benchmark manages to get
>killed. If I read that right this would imply that shrink_cache has run a
>hundred times through the _complete_ inactive_list finding no free-able pages,
>with one exception that I read across:

That's a red herring. The real reason it is killed is that the machine
really _is_ out of memory, but that, in turn, is because the swap space
is totally filled up - with pages we have in memory in the swap cache.

The swap cache is wonderful for many thing, but Linux has historically
had swap as "additional" memory, and the swap cache really really wants
to have backing store for the _whole_ working set, not just for the
pages we have to get rid of.

Thus the two-line patch elsewhere in this thread, which says "ok, if
we're low on swap space, let's start decimating the swap cache entries
for stuff we have in memory". 

		Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new OOM heuristic failure  (was: Re: VM: qsbench)
       [not found] <Pine.LNX.3.96.1011031133645.448B-100000@gollum.norang.ca>
@ 2001-10-31 19:46 ` Linus Torvalds
  0 siblings, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2001-10-31 19:46 UTC (permalink / raw)
  To: Bernt Hansen; +Cc: Kernel Mailing List, Lorenzo Allegrucci


[ Cc'd to linux-kernel just in case other people are wondering ]

On Wed, 31 Oct 2001, Bernt Hansen wrote:
>
> Do I need to rebuild my systems with my swap partitions >= my physical
> memory size for the 2.4.x kernels?  All of my systems have total swap
> space less than their physical memory size and are running 2.4.13 kernels.

No. With the two-liner patch on linux-kernel, your old setup should work
as-is.

And performance will be fine, _except_ if you regularly actually have your
swap usage up in the 75%+ range. But if you do work that typically puts a
lot of pressure on swap, and you find that you almost always end up using
clearly more than half your swapspace, that implies that you should
consider perhaps reconfiguring so that you have a bigger swap partition.

When I pointed out the performance problems to Lorenzo, I specifically
meant only that one load that he is testing - the fact that the load fills
up the swap device implies that for _that_ load, performance could be
improved by making sure he has enough swap to cover it.

I bet Lorenzo doesn't even come _close_ to 80% full swap under normal
usage, so he probably wouldn't see any performance impact normally. It's
just that when you report VM benchmarks, maybe you want to try to improve
the numbers..

[ It's equally valid to say that Lorenzo's numbers are _especially_
  interesting exactly because they also test the behaviour when we need to
  start pruning the swap cache, though. So I'm in no way trying to
  criticise his benchmark - I think the qsort benchmark is actually one of
  the more valid VM patterns we have ever had as a benchmark, and I
  really like how it mixes random accesses with non-random ones ]

So don't worry.

		Linus


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new OOM heuristic failure  (was: Re: VM: qsbench)
  2001-10-31 17:55   ` Lorenzo Allegrucci
  2001-10-31 18:06     ` Linus Torvalds
@ 2001-10-31 21:31     ` Lorenzo Allegrucci
  2001-11-02 13:00     ` Stephan von Krawczynski
  2001-11-02 17:36     ` Lorenzo Allegrucci
  3 siblings, 0 replies; 23+ messages in thread
From: Lorenzo Allegrucci @ 2001-10-31 21:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, linux-kernel

At 10.06 31/10/01 -0800, Linus Torvalds wrote:
>
>On Wed, 31 Oct 2001, Lorenzo Allegrucci wrote:
>>
>> Until swpd is "139968" everything is fine and I have about 60M of
>> free swap (I have 256M RAM + 200M of swap and qsbench uses about 343M).
>
>Ok, that's the problem. The swap free on swap-in logic got removed, try
>this simple patch, and I bet it ends up working ok for you
>
>You should see better performance with a bigger swapspace, though. Linux
>would prefer to keep the swap cache allocated as long as possible, and not
>drop the pages just because swap is smaller than the working set.
>
>(Ie the best setup is not when "RAM + SWAP > working set", but when you
>have "SWAP > working set").
>
>Can you re-do the numbers with this one on top of pre6?
>
>Thanks,
>
>		Linus
>
>-----
>diff -u --recursive pre6/linux/mm/memory.c linux/mm/memory.c
>--- pre6/linux/mm/memory.c	Wed Oct 31 10:04:11 2001
>+++ linux/mm/memory.c	Wed Oct 31 10:02:33 2001
>@@ -1158,6 +1158,8 @@
> 	pte = mk_pte(page, vma->vm_page_prot);
>
> 	swap_free(entry);
>+	if (vm_swap_full())
>+		remove_exclusive_swap_page(page);
>
> 	flush_page_to_ram(page);
> 	flush_icache_page(vma, page);

Linus,

your patch seems to help one case out of three.
(even though I have not any meaningful statistical data)

lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
Out of Memory: Killed process 225 (qsbench).
69.500u 3.200s 2:11.23 55.3%    0+0k 0+0io 15297pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
Out of Memory: Killed process 228 (qsbench).
69.720u 3.190s 2:12.23 55.1%    0+0k 0+0io 15561pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
70.250u 3.470s 2:15.88 54.2%    0+0k 0+0io 17170pf+0w

   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 2  0  0 136320   3644     72    284   0   0     0     0  101     5 100   0   0
 1  0  0 136320   3644     72    284   0   0     0     0  101     3 100   0   0
 1  0  0 136320   3644     72    284   0   0     0     0  101     3 100   0   0
 1  0  0 136320   3644     72    284   0   0     0     0  101     9 100   0   0
 0  1  0 133140   2608     72    284 3344 768  3344   768  215   215  47   2  51
 0  1  0 132276   1608     72    284 3552 6376  3552  6376  280   227   2   3  95
 1  0  0 128648   3240     72    284 768 3656   768  3660  162    54  57   1  42
 1  0  0 128648   3240     72    284   0   0     0     0  101     3 100   0   0
 1  0  0 128648   3240     72    284   0   0     0     4  102     9 100   0   0
 1  0  0 128648   3240     72    284   0   0     0     0  101     3 100   0   0
 1  0  0 128648   3240     72    284   0   0     0     0  101     3 100   0   0
 1  0  0 128648   3240     72    284   0   0     0     0  101     3 100   0   0
 1  0  0 128648   3240     72    284   0   0     0     0  101     3 100   0   0
 1  0  0 129672   3316     68    284 4328 2860  4328  2860  265   282  62   2  36
 1  0  0 137992   1644     68    280 19216 3172 19216  3172  743  1227   7   5  88
 0  1  0 153096   3648     68    280 3072 17788  3072 17788  353   218   2   6  92
 0  1  1 160136   1660     68    280 15240 4740 15240  4740  647   963  16  10  74
 1  0  0 177288   1588     68    280 5868 14220  5868 14220  422   393   0   7  93
 0  1  0 188680   1620     68    280 8144 11904  8144 11904  473   544   4   5  91
 0  1  0 192136   1552     68    280 17136 5860 17136  5860  689  1081   8   9  83
 1  0  0 195512   2948     68    280 7672 9008  7672  9008  476   512   2   8  90
 1  0  0 195512   1556     68    280 21688 356 21688   356  786  1375  11   8  81
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 2  0  0 195512   1608     68    276 22352   0 22352     0  801  1422  10  17  73
 1  0  0 195512   1588     68    276 22748   0 22748     0  812  1431  14  12  74
 1  0  0 195512   1560     68    276 12768   0 12768     0  502   809  55   4  41
 1  0  0 195512   1552     68    280 23012   0 23012     0  823  1446  11   6  83
 0  1  0   4696 250440     80    632 9048   0  9412     4  409   609  27   7  66
 0  0  0   4564 250284     84    752  32   0   156     0  108    17   0   0 100
 0  0  0   4564 250280     88    752   0   0     4     0  106    18   0   0 100
 0  0  0   4564 250280     88    752   0   0     0     0  101     3   0   0 100
 0  0  0   4564 250280     88    752   0   0     0     0  109    21   0   0 100
 0  0  0   4564 250280     88    752   0   0     0     0  121    44   0   0 100
 0  0  0   4564 250280     88    752   0   0     0     0  101     3   0   0 100

Then, I repeated the test with a bigger swap partition (400M):
qsbench working set is about 343M, so now SWAP > working set.

lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
70.770u 3.630s 2:14.21 55.4%    0+0k 0+0io 16545pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
70.720u 3.370s 2:16.66 54.2%    0+0k 0+0io 17444pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
70.050u 3.380s 2:15.05 54.3%    0+0k 0+0io 17045pf+0w

   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 2  0  0 124040   3652     68   3428   0   0     0     0  101     7 100   0   0
 1  0  0 122096   1640     68   3428 2880 1656  2880  1656  208   184  51   1  48
 1  0  0 129972   1896     68   3428 2328 1836  2328  1836  195   292  45   3  52
 1  0  0 130740   2000     68   3428   0 168     0   168  106     6 100   0   0
 1  0  0 131508   2104     68   3428   0 252     0   252  101     8 100   0   0
 1  0  0 132660   2340     68   3428   0 336     0   336  107    10 100   0   0
 1  0  0 133428   2460     68   3428   0 208     0   208  101     6 100   0   0
 1  0  0 134196   2560     68   3428   0 212     0   212  105     6 100   0   0
 1  0  0 134196   2560     68   3428   0   0     0     0  101     5 100   0   0
 0  1  1 138932   1664     68   3428 1856 2052  1856  2052  178   156  83   0  17
 0  1  0 145076   1612     68   3428 6900 9956  6900  9964  451   532   3   5  92
 1  0  0 149044   3648     68   3424 3232 9556  3232  9556  333   259   2   5  93
 1  0  0 154036   1580     64   3424 13816 4736 13816  4736  635   951   6   4  90
 0  1  0 171444   1648     64   2404 14328 6544 14328  6544  620  1155   5  13  82
 0  1  0 182580   1648     64   1584 6180 21916  6180 21912  438   422   1   7  92
 0  1  0 184500   1628     64   1584 13800 3980 13800  3984  602   878  11   5  84
 1  0  0 196532   1624     64   1584 10876 7576 10876  7576  522   707   6   5  89
 0  1  0 210612   1540     64   1584 8992 13760  8992 13760  492   592   5   9  86
 0  1  0 214452   2412     64   1584 12928 10176 12928 10176  593   817  11   4  85
 1  0  0 225460   1632     64   1584 11704 8380 11704  8380  564   766   5   8  87
 1  0  0 230976   1592     64   1224 8012 10008  8012 10008  465   525   2   6  92
 1  0  0 233340   1556     80    288 17748 888 17764   888  674  1136   7  12  81
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 2  0  0 233340   3524     64    284 20276 2392 20276  2392  771  1315  10   6  84
 1  0  0 233340   3632     64    284 14948   0 14948     0  569   957  44   5  51
 1  0  0 233340   1556     64    284 24448   0 24448     0  865  1575  12   8  80
 0  1  0 240920   1580     68    288 18208 4656 18212  4656  717  1186   7   5  88
 1  0  0 240920   2704     68    288 16672 2924 16672  2928  656  1069  25  11  64
 0  0  0   4536 250948     84    760 4384   0  4872     0  270   340   3   8  89
 0  0  0   4536 250948     84    760   0   0     0     0  101     3   0   0 100
 0  0  0   4536 250948     84    760   0   0     0     0  101     3   0   0 100
 0  0  0   4536 250948     84    760   0   0     0     0  101     7   0   0 100



-- 
Lorenzo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new OOM heuristic failure  (was: Re: VM: qsbench)
       [not found] ` <3.0.6.32.20011101214957.01feaa70@pop.tiscalinet.it>
@ 2001-11-01 21:59   ` Lorenzo Allegrucci
  2001-11-01 23:35     ` Stephan von Krawczynski
  0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Allegrucci @ 2001-11-01 21:59 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, Linus Torvalds, Andrea Arcangeli

At 22.08 01/11/01 +0100, you wrote:
>> At 15.44 01/11/01 +0100, Stephan von Krawczynski wrote:             
>> >On Wed, 31 Oct 2001 22:31:40 +0100 Lorenzo Allegrucci              
><lenstra@tiscalinet.it>                                               
>> >wrote:                                                             
>> >                                                                   
>> >> Linus,                                                           
>> >>                                                                  
>> >> your patch seems to help one case out of three.                  
>> >> (even though I have not any meaningful statistical data)         
>> >                                                                   
>> >Hm, I will not say that I expected that :-), he knows by far more  
>than me.                                                              
>> >But can you try my patch below in addition or comparison to linus' 
>?                                                                     
>> >Give me a hint what happens.                                       
>>                                                                     
>> Well, your patch works but it hurts performance :(                  
>>
>> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100   
>> 71.500u 1.790s 2:29.18 49.1%    0+0k 0+0io 18498pf+0w               
>> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100   
>> 71.460u 1.990s 2:26.87 50.0%    0+0k 0+0io 18257pf+0w               
>> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100   
>> 71.220u 2.200s 2:26.82 50.0%    0+0k 0+0io 18326pf+0w               
>> 0:55 kswapd                                                         
>>                                                                     
>> Linux-2.4.14-pre5:                                                  
>> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100   
>> 70.340u 3.450s 2:13.62 55.2%    0+0k 0+0io 16829pf+0w               
>> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100   
>> 70.590u 2.940s 2:15.48 54.2%    0+0k 0+0io 17182pf+0w               
>> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100   
>> 70.140u 3.480s 2:14.66 54.6%    0+0k 0+0io 17122pf+0w               
>> 0:01 kswapd                                                         
>                                                                      
>Hello Lorenzo,                                                        
>                                                                      
>to be honest: I expected that. The patch according to my knowledge    
>fixes a "definition hole" in the shrink_cache algorithm. I tend to say
>it is the right thing to do it this way, but I am sure it is not as   
>fast as immediate exit to swap. It would be interesting to know if  it
>does hurt performance in not-near-oom environment. I'd say Andrea or  
>Linus might know that, or you can try, of course :-)                  

400M of swap now (from 200M), Linux-2.4.14-pre6 + your vmscan-patch:

lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
71.320u 2.260s 2:28.92 49.4%    0+0k 0+0io 18755pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
71.330u 2.120s 2:28.40 49.4%    0+0k 0+0io 18838pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
71.880u 2.100s 2:28.31 49.8%    0+0k 0+0io 18646pf+0w
0:56 kswapd

qsbench vsize is just 343M, definitely not-near-oom environment :)

>Anyway may I beg you to post my patch and your answer to the list,    
>because I currently cannot do it (I am not in office right now, but on
>a web-terminal somewhere in the outbacks ;-). I have neither patch at 
>hand nor am I able to attach it with this mailer...                   
>                                                                      
>Thanks,                                                               
>Stephan                                                               

your vmscan-patch:

--- linux-orig/mm/vmscan.c	Wed Oct 31 12:32:11 2001
+++ linux/mm/vmscan.c	Thu Nov  1 15:38:13 2001
@@ -469,16 +469,10 @@
 			spin_unlock(&pagecache_lock);
 			UnlockPage(page);
 page_mapped:
-			if (--max_mapped >= 0)
-				continue;
+			if (max_mapped > 0)
+				max_mapped--;
+			continue;
 
-			/*
-			 * Alert! We've found too many mapped pages on the
-			 * inactive list, so we start swapping out now!
-			 */
-			spin_unlock(&pagemap_lru_lock);
-			swap_out(priority, gfp_mask, classzone);
-			return nr_pages;
 		}
 
 		/*
@@ -514,6 +508,14 @@
 		break;
 	}
 	spin_unlock(&pagemap_lru_lock);
+
+	/*
+	 * Alert! We've found too many mapped pages on the
+	 * inactive list, so we start swapping out - delayed!
+	 * -skraw
+	 */
+	if (max_mapped==0)
+		swap_out(priority, gfp_mask, classzone);
 
 	return nr_pages;
 }



-- 
Lorenzo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new OOM heuristic failure  (was: Re: VM: qsbench)
  2001-11-01 21:59   ` Lorenzo Allegrucci
@ 2001-11-01 23:35     ` Stephan von Krawczynski
  2001-11-02  0:37       ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Stephan von Krawczynski @ 2001-11-01 23:35 UTC (permalink / raw)
  To: Lorenzo Allegrucci; +Cc: linux-kernel, Linus Torvalds, Andrea Arcangeli

> At 22.08 01/11/01 +0100, you wrote:                                 
> >> Well, your patch works but it hurts performance :(               
                                                                      
> >>                                                                  
> >> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
                                                                      
> >> 71.500u 1.790s 2:29.18 49.1%    0+0k 0+0io 18498pf+0w            
                                                                      
> >> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
                                                                      
> >> 71.460u 1.990s 2:26.87 50.0%    0+0k 0+0io 18257pf+0w            
                                                                      
> >> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
                                                                      
> >> 71.220u 2.200s 2:26.82 50.0%    0+0k 0+0io 18326pf+0w            
                                                                      
> >> 0:55 kswapd                                                      
                                                                      
> >>                                                                  
                                                                      
> >> Linux-2.4.14-pre5:                                               
                                                                      
> >> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
                                                                      
> >> 70.340u 3.450s 2:13.62 55.2%    0+0k 0+0io 16829pf+0w            
                                                                      
> >> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
                                                                      
> >> 70.590u 2.940s 2:15.48 54.2%    0+0k 0+0io 17182pf+0w            
                                                                      
> >> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
                                                                      
> >> 70.140u 3.480s 2:14.66 54.6%    0+0k 0+0io 17122pf+0w            
                                                                      
> >> 0:01 kswapd                                                      
                                                                      
> >                                                                   
                                                                      
> >Hello Lorenzo,                                                     
                                                                      
> >                                                                   
                                                                      
> >to be honest: I expected that. The patch according to my knowledge 
                                                                      
> >fixes a "definition hole" in the shrink_cache algorithm. I tend to 
say                                                                   
> >it is the right thing to do it this way, but I am sure it is not as
                                                                      
> >fast as immediate exit to swap. It would be interesting to know if 
it                                                                    
> >does hurt performance in not-near-oom environment. I'd say Andrea  
or                                                                    
> >Linus might know that, or you can try, of course :-)               
                                                                      
To clarify this one a bit:                                            
shrink_cache is thought to do what it says, it is given a number of   
pages it should somehow manage to free by shrinking the cache. What my
patch does is go after the _whole_ list to fulfill that. One cannot   
really say that this is the wrong thing to do, I guess. If it takes   
time to _find_ free pages with shrink_cache, then probably the idea to
use it was wrong in the first place (which is not the fault of the    
function itself). Or the number of free-pages to find is to high, or  
(as a last but guess unrealistic approach) the swap_out eats the time 
and shouldn't be called when nr_pages (return value) is equal to zero.
This last one could be checked (hint hint Lorenzo ;-) by simply       
modifiying                                                            
                                                                      
if (max_swapped==0)                                                   
                                                                      
to                                                                    
                                                                      
if (max_swapped==0 && nr_pages>0)                                     
                                                                      
at the end of shrink_cache.                                           
Thinking again about this it really sounds like the right choice,     
because there is no need to swap when we fulfilled the requested      
number of free-pages.                                                 
                                                                      
You should try.                                                       
                                                                      
Thank you for your patience Lorenzo                                   
                                                                      
Regards,                                                              
Stephan                                                               
                                                                      
PS: just fishing for lobster, Linus ;-)                               
                                                                      
                                                                      

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new OOM heuristic failure  (was: Re: VM: qsbench)
  2001-11-01 23:35     ` Stephan von Krawczynski
@ 2001-11-02  0:37       ` Linus Torvalds
  2001-11-02  2:17         ` Stephan von Krawczynski
  2001-11-02  2:30         ` Stephan von Krawczynski
  0 siblings, 2 replies; 23+ messages in thread
From: Linus Torvalds @ 2001-11-02  0:37 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Lorenzo Allegrucci, linux-kernel, Andrea Arcangeli


On Fri, 2 Nov 2001, Stephan von Krawczynski wrote:
>
> To clarify this one a bit:
> shrink_cache is thought to do what it says, it is given a number of
> pages it should somehow manage to free by shrinking the cache. What my
> patch does is go after the _whole_ list to fulfill that.

I would suggest a slight modification: make "max_mapped" grow as the
priority goes up.

Right now max_mapped is fixed at "nr_pages*10".

You could have something like

	max_mapped = nr_pages * 60 / priority;

instead, which might also alleviate the problem with not even bothering to
scan much of the inactive list simply because 99% of all pages are mapped.

That way you don't waste time on looking at the rest of the inactive list
until you _need_ to.

		Linus


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new OOM heuristic failure  (was: Re: VM: qsbench)
  2001-11-02  0:37       ` Linus Torvalds
@ 2001-11-02  2:17         ` Stephan von Krawczynski
  2001-11-02  2:21           ` Linus Torvalds
  2001-11-02  2:30         ` Stephan von Krawczynski
  1 sibling, 1 reply; 23+ messages in thread
From: Stephan von Krawczynski @ 2001-11-02  2:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Lorenzo Allegrucci, linux-kernel, Andrea Arcangeli

>                                                                     
> On Fri, 2 Nov 2001, Stephan von Krawczynski wrote:                  
> >                                                                   
> > To clarify this one a bit:                                        
> > shrink_cache is thought to do what it says, it is given a number  
of                                                                    
> > pages it should somehow manage to free by shrinking the cache.    
What my                                                               
> > patch does is go after the _whole_ list to fulfill that.          
>                                                                     
> I would suggest a slight modification: make "max_mapped" grow as the
> priority goes up.                                                   
>                                                                     
> Right now max_mapped is fixed at "nr_pages*10".                     
>                                                                     
> You could have something like                                       
>                                                                     
> 	max_mapped = nr_pages * 60 / priority;                             
>                                                                     
> instead, which might also alleviate the problem with not even       
bothering to                                                          
> scan much of the inactive list simply because 99% of all pages are  
mapped.                                                               
>                                                                     
> That way you don't waste time on looking at the rest of the inactive
list                                                                  
> until you _need_ to.                                                
                                                                      
Wait a minute: there is something illogical in this approach:         
Basically you say by making max_mapped bigger that the "early exit"   
from shrink_cache shouldn't be that early. But if you _know_ that     
merely all pages are mapped, then why don't you just go to swap_out   
right away without even walking through the list, because in the end, 
you will go to swap_out anyway (simply because of the high percentage 
of mapped pages). That makes scanning somehow superfluous. Making it  
priority-dependant sounds like you want to swap_out earlier the       
_lower_ memory pressure is. In the end it sounds just like a hack to  
hold up the early exit against every logic (but not against some      
benchmark of course).                                                 
It doesn't sound like the right thing.                                
Is the inactive list somehow sorted currently? If not, could it be    
implicitly sorted to match this criteria (not mapped versa mapped), so
that shrink_cache finds the not-mapped first (with a chance to fulfill
nr_pages-request). If it isn't fulfilled and hits the first mapped    
page, it can go to swap_out right away, because more scanning doesn't 
make sense and can only end in swap_out anyways.                      
                                                                      
I am no fan of complete list scanning, but if you are looking for     
something you have to scan until you find it.                         
                                                                      
Regards,                                                              
Stephan                                                               
                                                                      
PS: I am still no pro in this area, so I try to go after the global   
picture and find the right direction...                               
                                                                      
                                                                      
                                                                      

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new OOM heuristic failure  (was: Re: VM: qsbench)
  2001-11-02  2:17         ` Stephan von Krawczynski
@ 2001-11-02  2:21           ` Linus Torvalds
  0 siblings, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2001-11-02  2:21 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Lorenzo Allegrucci, linux-kernel, Andrea Arcangeli


On Fri, 2 Nov 2001, Stephan von Krawczynski wrote:
>
> Wait a minute: there is something illogical in this approach:
> Basically you say by making max_mapped bigger that the "early exit"
> from shrink_cache shouldn't be that early. But if you _know_ that
> merely all pages are mapped, then why don't you just go to swap_out
> right away without even walking through the list, because in the end,
> you will go to swap_out anyway (simply because of the high percentage
> of mapped pages). That makes scanning somehow superfluous.

Well, no.

There's two things: sure, we know we have tons of mapped pages, and we
obviously will have done the "swap_out()" for th efirst iteration (and
probably the second and third ones too).

But at some point you have to say "Ok, _this_ process has done its due
work to clean up the VM pressure, and now this process needs to get on
with its life and stop caring about other peoples bad memory usage".

Remember: everybody who calls "swap_out()" will free several pages from
the pag tables. And everybody starts off with a low priority (ie 6). So if
we're truly 99% mapped, then every single allocator will start off doing
swap_out(), but at some point they obviously need to do other things too
(ie they need to get to the point int he inactive queue where those
swapped out pages are now, and try to write them out to disk).

Imagine a inactive queue that is a million entries. That's 4GB worth of
RAM, sure, but there are lots of machines like that. If we only allow
shrink_cache() to look at 320 pages at a time, we'll never get a life of
our own.

(Yeah, sure, if you have all that 4GB on the inactive list, and it's all
mapped, you're going to spend some time cleaning it up _regardless_ of
what you do. That's life.)

		Linus


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new OOM heuristic failure  (was: Re: VM: qsbench)
  2001-11-02  0:37       ` Linus Torvalds
  2001-11-02  2:17         ` Stephan von Krawczynski
@ 2001-11-02  2:30         ` Stephan von Krawczynski
  2001-11-02  2:55           ` Stephan von Krawczynski
  1 sibling, 1 reply; 23+ messages in thread
From: Stephan von Krawczynski @ 2001-11-02  2:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Lorenzo Allegrucci, linux-kernel, Andrea Arcangeli

>                                                                     
> On Fri, 2 Nov 2001, Stephan von Krawczynski wrote:                  
> >                                                                   
> > To clarify this one a bit:                                        
> > shrink_cache is thought to do what it says, it is given a number  
of                                                                    
> > pages it should somehow manage to free by shrinking the cache.    
What my                                                               
> > patch does is go after the _whole_ list to fulfill that.          
>                                                                     
> I would suggest a slight modification: make "max_mapped" grow as the
> priority goes up.                                                   
>                                                                     
> Right now max_mapped is fixed at "nr_pages*10".                     
>                                                                     
> You could have something like                                       
>                                                                     
> 	max_mapped = nr_pages * 60 / priority;                             
>                                                                     
> instead, which might also alleviate the problem with not even       
bothering to                                                          
> scan much of the inactive list simply because 99% of all pages are  
mapped.                                                               
>                                                                     
> That way you don't waste time on looking at the rest of the inactive
list                                                                  
> until you _need_ to.                                                
                                                                      
Ok. I re-checked the code and found out this approach cannot stand.   
the list scan _is_ already exited early when priority is low:         
                                                                      
        int max_scan = nr_inactive_pages / priority;                  
                                                                      
        while (--max_scan >= 0 && (entry = inactive_list.prev) !=     
&inactive_list) {                                                     
                                                                      
It will not make big sense to do it again in max_mapped.              
                                                                      
On the other hand I am also very sure, that refining:                 
                                                                      
        if (max_mapped==0)                                            
                swap_out(priority, gfp_mask, classzone);              
                                                                      
        return nr_pages;                                              
                                                                      
in the end to:                                                        
                                                                      
        if (max_mapped==0 && nr_pages>0)                              
                swap_out(priority, gfp_mask, classzone);              
                                                                      
        return nr_pages;                                              
                                                                      
is a good thing. We don't need swap_out if we gained all the pages    
requested, no matter if we _could_ do it or not.                      
                                                                      
Is there some performance difference in this approach, Lorenzo? I     
guess it should.                                                      
                                                                      
Regards,                                                              
Stephan                                                               
                                                                      

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new OOM heuristic failure  (was: Re: VM: qsbench)
@ 2001-11-02  2:37 Ed Tomlinson
  2001-11-02  3:01 ` Stephan von Krawczynski
  0 siblings, 1 reply; 23+ messages in thread
From: Ed Tomlinson @ 2001-11-02  2:37 UTC (permalink / raw)
  To: linux-kernel

Hi,

shrink_caches can end up lying.  shrink_dcache_memory and friends do not tell 
shrink_caches how many pages they free so nr_pages can be bogus...  Is it worth 
fixing?  The simpliest, harmlessly racey and not too pretty, code follows.  It 
would also not be hard to change the shrink_ calls to return the number of pages 
shrunk, but this would hit more code...

Comments?

Ed Tomlinson

--- linux/mm/vmscan.c.orig	Wed Oct 31 14:11:33 2001
+++ linux/mm/vmscan.c	Wed Oct 31 14:51:58 2001
@@ -552,6 +552,7 @@
 static int shrink_caches(zone_t * classzone, int priority, unsigned int gfp_mask, int nr_pages)
 {
 	int chunk_size = nr_pages;
+	int nr_shrunk;
 	unsigned long ratio;
 
 	nr_pages -= kmem_cache_reap(gfp_mask);
@@ -567,11 +568,21 @@
 	if (nr_pages <= 0)
 		return 0;
 
+	nr_shrunk = nr_free_pages;
+
 	shrink_dcache_memory(priority, gfp_mask);
 	shrink_icache_memory(priority, gfp_mask);
 #ifdef CONFIG_QUOTA
 	shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
 #endif
+
+	/* racey - calculate how many pages we got from shrinks */
+	nr_shrunk = nr_free_pages - nr_shrunk; 
+	if (nr_shrunk > 0) {
+		nr_pages -= nr_shrunk;
+		if (nr_pages <= 0)
+			return 0;
+	}
 
 	return nr_pages;
 }

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new OOM heuristic failure  (was: Re: VM: qsbench)
  2001-11-02  2:30         ` Stephan von Krawczynski
@ 2001-11-02  2:55           ` Stephan von Krawczynski
  0 siblings, 0 replies; 23+ messages in thread
From: Stephan von Krawczynski @ 2001-11-02  2:55 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Lorenzo Allegrucci, linux-kernel, Andrea Arcangeli

> Ok. I re-checked the code and found out this approach cannot stand. 
                                                                      
> the list scan _is_ already exited early when priority is low:       
                                                                      
                                                                      
Sorry for followup on my own mail, but there is another thing that    
comes to my mind:                                                     
                                                                      
swap_out is currently in no way priority-dependant. But it could be   
(the parameter is there). How about swapping more pages in tighter    
memory situation? The basic idea is that if there is a rising need for
mem it cannot be wrong to do a bit more than under normal             
circumstances. One could achieve this simply by:                      
                                                                      
        int counter, nr_pages = SWAP_CLUSTER_MAX;                     
                                                                      
to                                                                    
                                                                      
        int counter, nr_pages = SWAP_CLUSTER_MAX * DEF_PRIORITY /     
priority;                                                             
                                                                      
in swap_out.                                                          
The idea behind is to reduce the overhead in finding out if swapping  
is needed by simply swapping more everytime we already gone "the long 
way to knowing".                                                      
                                                                      
Regards,                                                              
Stephan                                                               
                                                                      
                                                                      
                                                                      

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new OOM heuristic failure  (was: Re: VM: qsbench)
  2001-11-02  2:37 Ed Tomlinson
@ 2001-11-02  3:01 ` Stephan von Krawczynski
  0 siblings, 0 replies; 23+ messages in thread
From: Stephan von Krawczynski @ 2001-11-02  3:01 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: linux-kernel

> Hi,                                                                 
>                                                                     
> shrink_caches can end up lying.  shrink_dcache_memory and friends do
not tell                                                              
> shrink_caches how many pages they free so nr_pages can be bogus...  
Is it worth                                                           
> fixing?  The simpliest, harmlessly racey and not too pretty, code   
follows.  It                                                          
> would also not be hard to change the shrink_ calls to return the    
number of pages                                                       
> shrunk, but this would hit more code...                             
>                                                                     
> Comments?                                                           
                                                                      
I believe the idea of having a more precise nr_pages value can make a 
difference. We are trying to estimate if swapping is needed, which is 
pretty expensive. If we can avoid it by more accurately knowing what  
is really going on (without _too_ much costs) we can only win.        
                                                                      
Regards,                                                              
Stephan                                                               
                                                                      
                                                                      
                                                                      

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new OOM heuristic failure  (was: Re: VM: qsbench)
  2001-10-31 17:55   ` Lorenzo Allegrucci
  2001-10-31 18:06     ` Linus Torvalds
  2001-10-31 21:31     ` Lorenzo Allegrucci
@ 2001-11-02 13:00     ` Stephan von Krawczynski
  2001-11-02 17:36     ` Lorenzo Allegrucci
  3 siblings, 0 replies; 23+ messages in thread
From: Stephan von Krawczynski @ 2001-11-02 13:00 UTC (permalink / raw)
  To: Lorenzo Allegrucci; +Cc: torvalds, riel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 439 bytes --]

Hello Lorenzo,

please find attached next vmscan.c patch which sums up the delayed swap_out
(first patch), the fix for not swapping when nr_pages is reached, and (new) the
idea to swap more pages in one call to swap_out if priority gets higher.

I have not the slightest idea what all this does to the performance. Especially
the "more" swap_out code is a pure try-and-error type of thing. Can you do some
testing please?

Thanks,
Stephan

[-- Attachment #2: vmscan-patch2 --]
[-- Type: application/octet-stream, Size: 1511 bytes --]

--- linux-orig/mm/vmscan.c	Thu Nov  1 15:33:58 2001
+++ linux/mm/vmscan.c	Fri Nov  2 13:50:31 2001
@@ -290,7 +290,7 @@
 static int FASTCALL(swap_out(unsigned int priority, unsigned int gfp_mask, zone_t * classzone));
 static int swap_out(unsigned int priority, unsigned int gfp_mask, zone_t * classzone)
 {
-	int counter, nr_pages = SWAP_CLUSTER_MAX;
+	int counter, nr_pages = SWAP_CLUSTER_MAX * DEF_PRIORITY / priority;
 	struct mm_struct *mm;
 
 	counter = mmlist_nr;
@@ -334,7 +334,7 @@
 {
 	struct list_head * entry;
 	int max_scan = nr_inactive_pages / priority;
-	int max_mapped = nr_pages*10;
+	int max_mapped = SWAP_CLUSTER_MAX * DEF_PRIORITY / priority;
 
 	spin_lock(&pagemap_lru_lock);
 	while (--max_scan >= 0 && (entry = inactive_list.prev) != &inactive_list) {
@@ -469,16 +469,10 @@
 			spin_unlock(&pagecache_lock);
 			UnlockPage(page);
 page_mapped:
-			if (--max_mapped >= 0)
-				continue;
+			if (max_mapped > 0)
+				max_mapped--;
+			continue;
 
-			/*
-			 * Alert! We've found too many mapped pages on the
-			 * inactive list, so we start swapping out now!
-			 */
-			spin_unlock(&pagemap_lru_lock);
-			swap_out(priority, gfp_mask, classzone);
-			return nr_pages;
 		}
 
 		/*
@@ -514,6 +508,14 @@
 		break;
 	}
 	spin_unlock(&pagemap_lru_lock);
+
+	/*
+	 * Alert! We've found too many mapped pages on the
+	 * inactive list, so we start swapping out - delayed!
+	 * -skraw
+	 */
+	if (max_mapped==0 && nr_pages>0)
+		swap_out(priority, gfp_mask, classzone);
 
 	return nr_pages;
 }

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new OOM heuristic failure  (was: Re: VM: qsbench)
  2001-10-31 17:55   ` Lorenzo Allegrucci
                       ` (2 preceding siblings ...)
  2001-11-02 13:00     ` Stephan von Krawczynski
@ 2001-11-02 17:36     ` Lorenzo Allegrucci
  3 siblings, 0 replies; 23+ messages in thread
From: Lorenzo Allegrucci @ 2001-11-02 17:36 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: torvalds, riel, linux-kernel

At 14.00 02/11/01 +0100, Stephan von Krawczynski wrote:
>Hello Lorenzo,
>
>please find attached next vmscan.c patch which sums up the delayed swap_out
>(first patch), the fix for not swapping when nr_pages is reached, and (new) the
>idea to swap more pages in one call to swap_out if priority gets higher.
>
>I have not the slightest idea what all this does to the performance. Especially
>the "more" swap_out code is a pure try-and-error type of thing. Can you do some
>testing please?

vmscan-patch2 looks slightly slower than vmscan-patch:

lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
70.800u 2.210s 2:27.96 49.3%    0+0k 0+0io 18551pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
70.600u 2.150s 2:28.49 48.9%    0+0k 0+0io 18728pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
70.690u 2.080s 2:28.77 48.9%    0+0k 0+0io 18753pf+0w
1:03 kswapd

Same test with 400M of swap:

lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
72.180u 2.110s 2:31.37 49.0%    0+0k 0+0io 18696pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
70.400u 2.200s 2:31.04 48.0%    0+0k 0+0io 18940pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
70.950u 2.210s 2:32.35 48.0%    0+0k 0+0io 19115pf+0w
1:02 kswapd

kswapd still takes many cycles.


-- 
Lorenzo

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2001-11-02 17:34 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-10-31 12:12 VM: qsbench Lorenzo Allegrucci
2001-10-31 12:23 ` Jeff Garzik
2001-10-31 15:00 ` new OOM heuristic failure (was: Re: VM: qsbench) Rik van Riel
2001-10-31 15:52   ` Linus Torvalds
2001-10-31 16:04     ` Rik van Riel
2001-10-31 17:42       ` Stephan von Krawczynski
2001-10-31 18:22         ` Linus Torvalds
2001-10-31 17:55   ` Lorenzo Allegrucci
2001-10-31 18:06     ` Linus Torvalds
2001-10-31 21:31     ` Lorenzo Allegrucci
2001-11-02 13:00     ` Stephan von Krawczynski
2001-11-02 17:36     ` Lorenzo Allegrucci
2001-10-31 17:55 ` VM: qsbench Lorenzo Allegrucci
     [not found] <Pine.LNX.3.96.1011031133645.448B-100000@gollum.norang.ca>
2001-10-31 19:46 ` new OOM heuristic failure (was: Re: VM: qsbench) Linus Torvalds
     [not found] <200111012108.WAA28044@webserver.ithnet.com>
     [not found] ` <3.0.6.32.20011101214957.01feaa70@pop.tiscalinet.it>
2001-11-01 21:59   ` Lorenzo Allegrucci
2001-11-01 23:35     ` Stephan von Krawczynski
2001-11-02  0:37       ` Linus Torvalds
2001-11-02  2:17         ` Stephan von Krawczynski
2001-11-02  2:21           ` Linus Torvalds
2001-11-02  2:30         ` Stephan von Krawczynski
2001-11-02  2:55           ` Stephan von Krawczynski
  -- strict thread matches above, loose matches on Subject: below --
2001-11-02  2:37 Ed Tomlinson
2001-11-02  3:01 ` Stephan von Krawczynski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox