* Subtle MM bug
@ 2001-01-07 20:59 ` Zlatko Calusic
0 siblings, 0 replies; 128+ messages in thread
From: Zlatko Calusic @ 2001-01-07 20:59 UTC (permalink / raw)
To: linux-kernel, linux-mm
I'm trying to get more familiar with the MM code in 2.4.0, as can be
seen from lots of questions I have on the subject. I discovered nasty
mm behaviour under even moderate load (2.2 didn't have troubles).
Things go berzerk if you have one big process whose working set is
around your physical memory size. Typical memory hoggers are good
enough to trigger the bad behaviour. Final effect is that physical
memory gets extremely flooded with the swap cache pages and at the
same time the system absorbs ridiculous amount of the swap space.
xmem is as usual very good at detecting this and you just need to
press Alt-SysReq-M to see that most of the memory (e.g. 90%) is
populated with the swap cache pages.
For instance on my 192MB configuration, firing up the hogmem program
which allocates let's say 170MB of memory and dirties it leads to
215MB of swap used. vmstat 1 shows that the pagecache size is
constantly growing - that is swapcache enlarging in fact - during the
second pass of the hogmem program.
...
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 1 1 131488 1592 400 62384 4172 5188 1092 1298 353 1447 2 4 94
0 1 1 136584 1592 400 67428 5860 4104 1465 1034 322 1327 3 3 93
0 1 1 141668 1592 388 72536 5504 4420 1376 1106 323 1423 1 3 95
0 1 1 146724 1592 380 77592 5996 4236 1499 1060 335 1096 2 3 94
0 1 1 151876 1600 320 82764 6264 3712 1566 936 327 1226 3 4 93
0 1 1 157016 1600 320 87908 5284 4268 1321 1068 315 1248 1 2 96
1 0 0 157016 1600 308 87792 1836 5168 459 1293 281 1324 3 3 94
0 1 0 162204 1600 304 92892 7784 5236 1946 1315 385 1353 3 5 92
0 1 0 167216 1600 304 97780 3496 5016 874 1256 301 1222 0 2 97
0 1 1 177904 1608 284 108276 5160 5168 1290 1300 330 1453 1 4 94
0 1 2 182008 1588 288 112264 4936 3344 1268 838 293 801 2 3 95
0 2 1 183620 1588 260 114012 3064 1756 830 445 290 846 0 15 85
0 2 2 185384 1596 180 115864 2320 2620 635 658 285 722 1 29 70
0 3 2 187528 1592 220 117892 2488 2224 657 557 273 754 3 30 67
0 4 1 190512 1592 236 120772 2524 3012 725 760 343 1080 1 14 85
0 4 1 195780 1592 240 125868 2336 5316 613 1331 381 1624 2 2 96
1 0 1 200992 1592 248 131052 2080 2176 623 552 234 1044 3 23 74
0 1 0 200996 1592 252 130948 2208 3048 580 762 256 1065 10 10 80
0 1 1 206240 1592 252 136076 2988 5252 760 1314 309 1406 7 4 8
0 2 1 211408 1592 256 141080 5424 5180 1389 1303 395 1885 3 5 91
0 2 0 214744 1592 264 144280 4756 3328 1223 834 327 1211 1 5 95
1 0 0 214868 1592 244 144468 4344 5148 1087 1295 303 1189 11 2 86
0 1 1 214900 1592 248 144496 4360 3244 1098 812 318 1467 7 4 89
0 1 1 214916 1592 248 144520 4280 3452 1070 865 336 1602 3 3 94
0 1 1 214964 1592 248 144580 4972 4184 1243 1054 368 1620 3 5 92
0 2 2 214956 1592 272 144548 3700 4544 1081 1142 665 2952 1 1 98
0 1 0 214992 1592 272 144588 1220 5088 305 1274 282 1363 1 4 95
0 1 1 215012 1592 272 144600 3640 4420 910 1106 325 1579 3 2 9
Any thoughts on this?
--
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Subtle MM bug
@ 2001-01-07 20:59 ` Zlatko Calusic
0 siblings, 0 replies; 128+ messages in thread
From: Zlatko Calusic @ 2001-01-07 20:59 UTC (permalink / raw)
To: linux-kernel, linux-mm
I'm trying to get more familiar with the MM code in 2.4.0, as can be
seen from lots of questions I have on the subject. I discovered nasty
mm behaviour under even moderate load (2.2 didn't have troubles).
Things go berzerk if you have one big process whose working set is
around your physical memory size. Typical memory hoggers are good
enough to trigger the bad behaviour. Final effect is that physical
memory gets extremely flooded with the swap cache pages and at the
same time the system absorbs ridiculous amount of the swap space.
xmem is as usual very good at detecting this and you just need to
press Alt-SysReq-M to see that most of the memory (e.g. 90%) is
populated with the swap cache pages.
For instance on my 192MB configuration, firing up the hogmem program
which allocates let's say 170MB of memory and dirties it leads to
215MB of swap used. vmstat 1 shows that the pagecache size is
constantly growing - that is swapcache enlarging in fact - during the
second pass of the hogmem program.
...
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 1 1 131488 1592 400 62384 4172 5188 1092 1298 353 1447 2 4 94
0 1 1 136584 1592 400 67428 5860 4104 1465 1034 322 1327 3 3 93
0 1 1 141668 1592 388 72536 5504 4420 1376 1106 323 1423 1 3 95
0 1 1 146724 1592 380 77592 5996 4236 1499 1060 335 1096 2 3 94
0 1 1 151876 1600 320 82764 6264 3712 1566 936 327 1226 3 4 93
0 1 1 157016 1600 320 87908 5284 4268 1321 1068 315 1248 1 2 96
1 0 0 157016 1600 308 87792 1836 5168 459 1293 281 1324 3 3 94
0 1 0 162204 1600 304 92892 7784 5236 1946 1315 385 1353 3 5 92
0 1 0 167216 1600 304 97780 3496 5016 874 1256 301 1222 0 2 97
0 1 1 177904 1608 284 108276 5160 5168 1290 1300 330 1453 1 4 94
0 1 2 182008 1588 288 112264 4936 3344 1268 838 293 801 2 3 95
0 2 1 183620 1588 260 114012 3064 1756 830 445 290 846 0 15 85
0 2 2 185384 1596 180 115864 2320 2620 635 658 285 722 1 29 70
0 3 2 187528 1592 220 117892 2488 2224 657 557 273 754 3 30 67
0 4 1 190512 1592 236 120772 2524 3012 725 760 343 1080 1 14 85
0 4 1 195780 1592 240 125868 2336 5316 613 1331 381 1624 2 2 96
1 0 1 200992 1592 248 131052 2080 2176 623 552 234 1044 3 23 74
0 1 0 200996 1592 252 130948 2208 3048 580 762 256 1065 10 10 80
0 1 1 206240 1592 252 136076 2988 5252 760 1314 309 1406 7 4 8
0 2 1 211408 1592 256 141080 5424 5180 1389 1303 395 1885 3 5 91
0 2 0 214744 1592 264 144280 4756 3328 1223 834 327 1211 1 5 95
1 0 0 214868 1592 244 144468 4344 5148 1087 1295 303 1189 11 2 86
0 1 1 214900 1592 248 144496 4360 3244 1098 812 318 1467 7 4 89
0 1 1 214916 1592 248 144520 4280 3452 1070 865 336 1602 3 3 94
0 1 1 214964 1592 248 144580 4972 4184 1243 1054 368 1620 3 5 92
0 2 2 214956 1592 272 144548 3700 4544 1081 1142 665 2952 1 1 98
0 1 0 214992 1592 272 144588 1220 5088 305 1274 282 1363 1 4 95
0 1 1 215012 1592 272 144600 3640 4420 910 1106 325 1579 3 2 9
Any thoughts on this?
--
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-07 20:59 ` Zlatko Calusic
@ 2001-01-07 21:37 ` Rik van Riel
-1 siblings, 0 replies; 128+ messages in thread
From: Rik van Riel @ 2001-01-07 21:37 UTC (permalink / raw)
To: Zlatko Calusic; +Cc: linux-kernel, linux-mm
On 7 Jan 2001, Zlatko Calusic wrote:
> Things go berzerk if you have one big process whose working set
> is around your physical memory size.
"go berzerk" in what way? Does the system cause lots of extra
swap IO and does it make the system thrash where 2.2 didn't
even touch the disk ?
> Final effect is that physical memory gets extremely flooded with
> the swap cache pages and at the same time the system absorbs
> ridiculous amount of the swap space.
This is mostly because Linux 2.4 keeps dirty pages in the
swap cache. Under Linux 2.2 a page would be deleted from the
swap cache when a program writes to it, but in Linux 2.4 it
can stay in the swap cache.
Oh, and don't forget that pages in the swap cache can also
be resident in the process, so it's not like the swap cache
is "eating into" the process' RSS ;)
> For instance on my 192MB configuration, firing up the hogmem
> program which allocates let's say 170MB of memory and dirties it
> leads to 215MB of swap used.
So that's 170MB of swap space for hogmem and 45MB for
the other things in the system (daemons, X, ...).
Sounds pretty ok, except maybe for the fact that now
Linux allocates (not uses!) a lot more swap space then
before and some people may need to add some swap space
to their system ...
Now if 2.4 has worse _performance_ than 2.2 due to one
reason or another, that I'd like to hear about ;)
regards,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
@ 2001-01-07 21:37 ` Rik van Riel
0 siblings, 0 replies; 128+ messages in thread
From: Rik van Riel @ 2001-01-07 21:37 UTC (permalink / raw)
To: Zlatko Calusic; +Cc: linux-kernel, linux-mm
On 7 Jan 2001, Zlatko Calusic wrote:
> Things go berzerk if you have one big process whose working set
> is around your physical memory size.
"go berzerk" in what way? Does the system cause lots of extra
swap IO and does it make the system thrash where 2.2 didn't
even touch the disk ?
> Final effect is that physical memory gets extremely flooded with
> the swap cache pages and at the same time the system absorbs
> ridiculous amount of the swap space.
This is mostly because Linux 2.4 keeps dirty pages in the
swap cache. Under Linux 2.2 a page would be deleted from the
swap cache when a program writes to it, but in Linux 2.4 it
can stay in the swap cache.
Oh, and don't forget that pages in the swap cache can also
be resident in the process, so it's not like the swap cache
is "eating into" the process' RSS ;)
> For instance on my 192MB configuration, firing up the hogmem
> program which allocates let's say 170MB of memory and dirties it
> leads to 215MB of swap used.
So that's 170MB of swap space for hogmem and 45MB for
the other things in the system (daemons, X, ...).
Sounds pretty ok, except maybe for the fact that now
Linux allocates (not uses!) a lot more swap space then
before and some people may need to add some swap space
to their system ...
Now if 2.4 has worse _performance_ than 2.2 due to one
reason or another, that I'd like to hear about ;)
regards,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-07 21:37 ` Rik van Riel
@ 2001-01-07 22:33 ` Zlatko Calusic
-1 siblings, 0 replies; 128+ messages in thread
From: Zlatko Calusic @ 2001-01-07 22:33 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-kernel, linux-mm
Rik van Riel <riel@conectiva.com.br> writes:
> On 7 Jan 2001, Zlatko Calusic wrote:
>
> > Things go berzerk if you have one big process whose working set
> > is around your physical memory size.
>
> "go berzerk" in what way? Does the system cause lots of extra
> swap IO and does it make the system thrash where 2.2 didn't
> even touch the disk ?
>
Well, I think yes. I'll do some testing on the 2.2 before I can tell
you for sure, but definitely the system is behaving badly where I
think it should not.
> > Final effect is that physical memory gets extremely flooded with
> > the swap cache pages and at the same time the system absorbs
> > ridiculous amount of the swap space.
>
> This is mostly because Linux 2.4 keeps dirty pages in the
> swap cache. Under Linux 2.2 a page would be deleted from the
> swap cache when a program writes to it, but in Linux 2.4 it
> can stay in the swap cache.
>
OK, I can buy that.
> Oh, and don't forget that pages in the swap cache can also
> be resident in the process, so it's not like the swap cache
> is "eating into" the process' RSS ;)
>
So far so good... A little bit weird but not alarming per se.
> > For instance on my 192MB configuration, firing up the hogmem
> > program which allocates let's say 170MB of memory and dirties it
> > leads to 215MB of swap used.
>
> So that's 170MB of swap space for hogmem and 45MB for
> the other things in the system (daemons, X, ...).
>
Yes, that's it. So it looks like all of my processes are on the
swap. That can't be good. I mean, even Solaris (known to eat swap
space like there's no tomorrow :)) would probably be more polite.
> Sounds pretty ok, except maybe for the fact that now
> Linux allocates (not uses!) a lot more swap space then
> before and some people may need to add some swap space
> to their system ...
>
Yes, I would say really a lot more. Big diffeence.
Also, I don't see a diference between allocated and used swap space on
the Linux. Could you elaborate on that?
>
> Now if 2.4 has worse _performance_ than 2.2 due to one
> reason or another, that I'd like to hear about ;)
>
I'll get back to you later with more data. Time to boot 2.2. :)
--
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
@ 2001-01-07 22:33 ` Zlatko Calusic
0 siblings, 0 replies; 128+ messages in thread
From: Zlatko Calusic @ 2001-01-07 22:33 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-kernel, linux-mm
Rik van Riel <riel@conectiva.com.br> writes:
> On 7 Jan 2001, Zlatko Calusic wrote:
>
> > Things go berzerk if you have one big process whose working set
> > is around your physical memory size.
>
> "go berzerk" in what way? Does the system cause lots of extra
> swap IO and does it make the system thrash where 2.2 didn't
> even touch the disk ?
>
Well, I think yes. I'll do some testing on the 2.2 before I can tell
you for sure, but definitely the system is behaving badly where I
think it should not.
> > Final effect is that physical memory gets extremely flooded with
> > the swap cache pages and at the same time the system absorbs
> > ridiculous amount of the swap space.
>
> This is mostly because Linux 2.4 keeps dirty pages in the
> swap cache. Under Linux 2.2 a page would be deleted from the
> swap cache when a program writes to it, but in Linux 2.4 it
> can stay in the swap cache.
>
OK, I can buy that.
> Oh, and don't forget that pages in the swap cache can also
> be resident in the process, so it's not like the swap cache
> is "eating into" the process' RSS ;)
>
So far so good... A little bit weird but not alarming per se.
> > For instance on my 192MB configuration, firing up the hogmem
> > program which allocates let's say 170MB of memory and dirties it
> > leads to 215MB of swap used.
>
> So that's 170MB of swap space for hogmem and 45MB for
> the other things in the system (daemons, X, ...).
>
Yes, that's it. So it looks like all of my processes are on the
swap. That can't be good. I mean, even Solaris (known to eat swap
space like there's no tomorrow :)) would probably be more polite.
> Sounds pretty ok, except maybe for the fact that now
> Linux allocates (not uses!) a lot more swap space then
> before and some people may need to add some swap space
> to their system ...
>
Yes, I would say really a lot more. Big diffeence.
Also, I don't see a diference between allocated and used swap space on
the Linux. Could you elaborate on that?
>
> Now if 2.4 has worse _performance_ than 2.2 due to one
> reason or another, that I'd like to hear about ;)
>
I'll get back to you later with more data. Time to boot 2.2. :)
--
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
@ 2001-01-08 5:29 Wayne Whitney
2001-01-08 5:42 ` Andi Kleen
2001-01-08 17:16 ` Rik van Riel
0 siblings, 2 replies; 128+ messages in thread
From: Wayne Whitney @ 2001-01-08 5:29 UTC (permalink / raw)
To: linux-kernel; +Cc: William A. Stein
On Sunday, January 7, 20001, Rik van Riel <riel@conectiva.com.br> wrote:
> Now if 2.4 has worse _performance_ than 2.2 due to one reason or
> another, that I'd like to hear about ;)
Well, here is a workload that performs worse on 2.4.0 than on 2.2.19pre,
and as it is the usual workload on my little cluster of 3 machines, they
are all running 2.2.19pre:
The application is some mathematics computations (modular symbols) using a
package called MAGMA; at times this requires very large matrices. The
RSS can get up to 870MB; for some reason a MAGMA process under linux
thinks it has run out of memory at 870MB, regardless of the actual
memory/swap in the machine. MAGMA is single-threaded.
The typical machine is a dual Intel box with 512MB RAM and 512MB swap.
There is no problem with just one MAGMA process, it just hits that 870MB
barrier and gracefully exits. But if I do the following test, I notice
very different behaviour under 2.2 and 2.4: while running 'top d 1' I
simultaneously launch two instances of a job which actually requires more
than 870MB of memory to complete. So each instance will slowly grow in
RSS until it gets killed by OOM or hits that 870MB limit.
Under 2.2, everything proceeds smoothly: before physical RAM is exhausted,
top updates every second, and the jobs have all the CPU. When swapping
kicks in, top updates every 1-2 seconds and lists most of the CPU as
'system' (kswapd), but I perceive not much loss of interactivity.
Eventually the 1GB of virtual memory is exhausted, the OOM killer kills
one of the MAGMA's, and the other runs till it hits the 870MB barrier and
exits.
But under 2.4, interactivity suffers as soon as physical RAM is exhausted.
Top only updates every 2-10 seconds, the load average hits 3-4, and top
reports the CPUs are 90% idle. Eventually, the OOM killer kicks in and
all returns to normal. For practical purposes, the machine is unusual
while swapping like this.
I have heard 'vmstat' mentioned here, so below is the output of a 'vmstat
1' concommitant with the test above (top and the two MAGMA jobs). I would
be more than happy to provide any other relevant information about this.
I read the LKML via an archive that updates once a day, so please cc: me
if you would like a speedier response. I wish I knew of a newsgroup
interface to the LKML, then I could read it more often :-).
Cheers,
Wayne
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 49180 447840 840 54104 269 969 84 244 76 236 10 4 86
1 0 0 49180 443276 852 55972 0 0 470 0 163 150 15 2 83
2 0 0 49180 440060 852 56292 0 0 80 0 115 60 93 1 6
2 0 0 49180 438236 856 56292 0 0 1 0 107 53 99 1 0
2 0 0 49180 429468 856 56392 0 0 25 0 109 16 99 0 0
2 0 0 49180 421296 856 56392 0 0 0 0 104 13 98 2 0
2 0 0 49180 421132 856 56392 0 0 0 0 108 53 100 0 0
2 0 0 49180 421128 856 56392 0 0 0 0 108 47 100 0 0
2 0 0 49180 397520 856 56392 0 0 0 1 107 49 96 4 0
2 0 0 49180 364860 856 56392 0 0 0 0 106 47 95 5 0
2 0 0 49180 332244 856 56392 0 0 0 0 106 49 95 5 0
2 0 0 49180 299660 856 56392 0 0 0 0 106 54 92 8 0
2 0 0 49180 267076 856 56392 0 0 0 0 109 56 95 5 0
2 0 0 49180 234632 856 56392 0 0 0 0 110 57 94 6 0
2 0 0 49180 202096 872 56448 32 0 18 0 117 70 95 5 0
2 0 0 49180 169544 872 56448 0 0 0 0 103 13 96 4 0
2 0 0 49180 137108 872 56448 0 0 0 0 107 49 93 7 0
2 0 0 49180 104600 872 56448 0 0 0 0 107 51 94 6 0
2 0 0 49180 72368 872 56448 0 0 0 52 136 54 93 7 0
2 0 0 49180 39964 872 56448 0 0 0 0 110 59 92 8 0
2 0 2 7296 1576 96 13072 0 720 0 184 130 465 74 22 4
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 2 2 53620 1564 116 23512 1012 31876 565 7969 883 3802 1 8 92
2 1 2 68800 1560 96 20128 68 15396 17 3850 291 2775 1 7 92
3 0 1 99484 1556 96 26096 84 29552 21 7388 594 3832 1 4 95
1 3 2 114708 1560 104 32528 284 14696 161 3674 374 3125 0 4 96
1 4 2 175484 1560 124 31112 360 63000 237 15753 1404 14952 1 5 94
1 2 2 205900 1560 96 32748 12 30080 3 7520 606 8356 1 5 94
2 1 2 221156 1560 96 17848 412 14256 103 3564 308 8450 1 10 89
1 2 2 222128 1564 96 12736 0 16100 7 4025 346 1010 0 5 95
1 2 2 236580 1560 108 15220 276 13988 97 3497 347 4102 0 7 92
2 1 2 267488 1560 104 32044 260 17376 69 4346 405 1265 0 7 93
3 1 1 282756 1560 96 29380 16 15304 4 3827 335 4359 1 7 92
2 1 2 282756 1580 96 11460 92 14948 23 3737 332 4120 1 5 94
2 1 2 313496 1560 100 30476 200 15484 54 3871 318 2359 0 9 90
2 1 2 313496 1560 100 14148 0 13076 1 3270 246 5165 1 8 91
3 1 1 344564 1572 96 23892 16 18444 11 4613 419 1555 0 7 93
2 1 2 375020 1560 96 25400 172 26988 43 6747 556 2910 1 7 93
1 2 2 375020 1968 96 22760 8 17136 2 4284 378 787 0 2 98
2 1 2 406056 1568 96 20432 212 17320 53 4330 393 2704 1 10 89
3 0 3 421316 1560 96 25056 72 14416 18 3604 281 1731 0 5 94
1 3 0 452120 1544 100 21216 240 31480 116 7870 715 2681 1 6 94
2 2 2 467488 1588 108 27248 440 15056 123 3765 385 2206 0 5 94
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
2 1 0 467488 1564 136 13352 88 15376 49 3844 368 2913 1 4 95
3 0 1 482864 1560 96 15256 128 15384 32 3846 296 986 1 7 92
3 0 1 497920 1560 96 14144 0 12636 0 3159 245 2302 1 9 90
3 1 1 529844 1540 96 18632 940 33340 569 8336 1104 1366 1 10 88
0 1 0 269856 205944 148 21772 2628 0 1196 2 267 313 0 3 97
0 1 0 269856 182736 156 33180 11180 0 2854 0 309 451 6 3 91
0 1 0 269856 158668 156 44696 11516 0 2879 0 314 462 12 4 83
0 1 0 269856 131928 156 57588 12892 0 3223 0 312 466 8 4 88
0 1 0 269856 105176 156 70448 12864 0 3216 0 332 506 12 3 85
0 1 0 269856 79056 156 82644 12196 0 3049 0 456 602 10 6 83
1 1 0 269856 46948 156 96900 14252 0 3563 0 359 518 21 7 72
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 5:29 Wayne Whitney
@ 2001-01-08 5:42 ` Andi Kleen
2001-01-08 6:04 ` Linus Torvalds
2001-01-08 17:16 ` Rik van Riel
1 sibling, 1 reply; 128+ messages in thread
From: Andi Kleen @ 2001-01-08 5:42 UTC (permalink / raw)
To: Wayne Whitney; +Cc: linux-kernel, William A. Stein
On Sun, Jan 07, 2001 at 09:29:29PM -0800, Wayne Whitney wrote:
> The application is some mathematics computations (modular symbols) using a
> package called MAGMA; at times this requires very large matrices. The
> RSS can get up to 870MB; for some reason a MAGMA process under linux
> thinks it has run out of memory at 870MB, regardless of the actual
> memory/swap in the machine. MAGMA is single-threaded.
I think it's caused by the way malloc maps its memory.
Newer glibc should work a bit better by falling back to mmap even for smaller
allocations (older does it only for very big ones)
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 5:42 ` Andi Kleen
@ 2001-01-08 6:04 ` Linus Torvalds
2001-01-08 17:44 ` Rik van Riel
0 siblings, 1 reply; 128+ messages in thread
From: Linus Torvalds @ 2001-01-08 6:04 UTC (permalink / raw)
To: linux-kernel
In article <20010108064225.B29026@gruyere.muc.suse.de>,
Andi Kleen <ak@suse.de> wrote:
>On Sun, Jan 07, 2001 at 09:29:29PM -0800, Wayne Whitney wrote:
>> The application is some mathematics computations (modular symbols) using a
>> package called MAGMA; at times this requires very large matrices. The
>> RSS can get up to 870MB; for some reason a MAGMA process under linux
>> thinks it has run out of memory at 870MB, regardless of the actual
>> memory/swap in the machine. MAGMA is single-threaded.
>
>I think it's caused by the way malloc maps its memory.
>Newer glibc should work a bit better by falling back to mmap even for smaller
>allocations (older does it only for very big ones)
That doesn't resolve the "2.4.x behaves badly" thing, though.
I've seen that one myself, and it seems to be simply due to the fact
that we're usually so good at gettign memory from page_launder() that we
never bother to try to swap stuff out. And when we _do_ start swapping
stuff out it just moves to the dirty list, and page_launder() will take
care of it.
So far so good. The problem appears to be that we don't swap stuff out
smoothly: we start doing the VM scanning, but when we get enough dirty
pages, we'll let it be, and go back to page_launder() again. Which means
that we don't walk theough the whole VM space, we just do some "spot
cleaning".
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
[not found] <200101080602.WAA02132@pizda.ninka.net>
@ 2001-01-08 6:42 ` Linus Torvalds
2001-01-08 13:11 ` Marcelo Tosatti
` (2 more replies)
0 siblings, 3 replies; 128+ messages in thread
From: Linus Torvalds @ 2001-01-08 6:42 UTC (permalink / raw)
To: David S. Miller; +Cc: Rik van Riel, Marcelo Tosatti, linux-mm
[ MM people Cc'd, because while I have a plan, I don't have enough time to
actually put that plan in action. And mayb esomebody can shoot down my
brilliant plan. ]
On Sun, 7 Jan 2001, David S. Miller wrote:
>
> BTW, this reminds me. Now that you keep track of the "all mm's" list
> thingy, you can also keep track of "nr_mms" in the system and do that
> little:
>
> for (i = 0; i < (nr_mms >> priority); i++)
> pagetable_scan();
>
> thing you were talking about last week.
This is the whole reason for making that list in the first place.
Even more subtle: see the comment in kernel/fork.c about keeping the list
of mm's in order. What I _really_ want to do is something like
void swap_out(void)
{
for (i = 0; i < (nr_mms >> priority); i++) {
struct list_head *p;
struct mm_struct *mm;
spin_lock(&mmlist_lock);
p = initmm.mmlist.next;
if (p != &initmm.mmlist) {
struct mm_struct *mm = list_entry(p, struct mm_struct, mmlist);
/* Move it to the back of the queue */
list_del(p);
__list_add(p, initmm.mmlist.prev, &initmm.mmlist);
atomic_inc(&mm->mm_users);
spin_unlock(&mmlist_lock);
swap_out_mm(mm);
continue;
}
/* empty mm-list - shouldn't really happen except during bootup */
spin_unlock(&mmlist_lock);
break;
}
}
and just get rid of all the logic to try to "find the best mm". It's bogus
anyway: we should get perfectly fair access patterns by just doing
everything in round-robin, and each "swap_out_mm(mm)" would just try to
walk some fixed percentage of the RSS size (say, something like
count = (mm->rss >> 4)
and be done with it.
Then, with something like the above, we just try to make sure that we scan
the whole virtual memory space every once in a while. Make the "every once
in a while" be some simple heuristic like "try to keep the active list to
less than 50% of all memory". So "try_to_free_memory()" would just start
off with something like
/*
* Too many active pages? That implies that we don't have enough
* of a working set for page_launder() to do a good job. Start by
* walking the VM space..
*/
if ((nr_active_pages >> 1) > total_pages)
swap_out();
/*
* This is where we actually free memory
*/
page_launder(..);
and we'd be all done. (And that "max 50% of all pages should be active"
number was taken out of my ass. AND the above will work really badly if
there is no swap-space, so it needs tweaking - think of it not as a hard
algorithm, but more as a "this is where I think we need to go").
Advantage: it automatically does the right thing: if the reason for the
memory pressure is that we have lots of pages mapped, it will scan the VM
lists. If the reason is that we just have tons of pages cached, it won't
even bother to age the page tables.
Right now we have this cockamamy scheme to try to balance off the lists
against each other, and then at fairly random points we'll get to
"swap_out()" if we haven't found anything nice on the other lists. That's
just not the way to get nice MM behaviour.
I'll bet you $5 USD that the above approach will (a) work fairly and
(b) give much smoother behavior with a much more understandable swap-out
policy.
Of course, I've been wrong before. But I'd like somebody to take a look.
Anybody?
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 6:42 ` Subtle MM bug Linus Torvalds
@ 2001-01-08 13:11 ` Marcelo Tosatti
2001-01-08 16:42 ` Rik van Riel
2001-01-08 17:43 ` Linus Torvalds
2001-01-08 13:57 ` Stephen C. Tweedie
2001-01-08 16:45 ` Rik van Riel
2 siblings, 2 replies; 128+ messages in thread
From: Marcelo Tosatti @ 2001-01-08 13:11 UTC (permalink / raw)
To: Linus Torvalds; +Cc: David S. Miller, Rik van Riel, linux-mm
On Sun, 7 Jan 2001, Linus Torvalds wrote:
> and just get rid of all the logic to try to "find the best mm". It's bogus
> anyway: we should get perfectly fair access patterns by just doing
> everything in round-robin, and each "swap_out_mm(mm)" would just try to
> walk some fixed percentage of the RSS size (say, something like
>
> count = (mm->rss >> 4)
>
> and be done with it.
I have the impression that a fixed percentage of the RSS will be a problem
when you have a memory hog (or hogs) running.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 6:42 ` Subtle MM bug Linus Torvalds
2001-01-08 13:11 ` Marcelo Tosatti
@ 2001-01-08 13:57 ` Stephen C. Tweedie
2001-01-08 17:29 ` Linus Torvalds
2001-01-08 16:45 ` Rik van Riel
2 siblings, 1 reply; 128+ messages in thread
From: Stephen C. Tweedie @ 2001-01-08 13:57 UTC (permalink / raw)
To: Linus Torvalds; +Cc: David S. Miller, Rik van Riel, Marcelo Tosatti, linux-mm
Hi,
On Sun, Jan 07, 2001 at 10:42:11PM -0800, Linus Torvalds wrote:
>
> and just get rid of all the logic to try to "find the best mm". It's bogus
> anyway: we should get perfectly fair access patterns by just doing
> everything in round-robin
Definitely.
> Then, with something like the above, we just try to make sure that we scan
> the whole virtual memory space every once in a while. Make the "every once
> in a while" be some simple heuristic like "try to keep the active list to
> less than 50% of all memory".
... which will produce an enormous storm of soft page faults for
workloads involving mmaping large amounts of data or where we have
a lot of space devoted to anonymous pages, such as static
computational workloads.
The idea of an inactive list target is sound, but it needs to be based
on memory pressure: we don't need anything like 50% if we aren't under
any pressure, so compute-bound workloads with large data sets can
achieve stability.
--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 13:11 ` Marcelo Tosatti
@ 2001-01-08 16:42 ` Rik van Riel
2001-01-08 17:43 ` Linus Torvalds
1 sibling, 0 replies; 128+ messages in thread
From: Rik van Riel @ 2001-01-08 16:42 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Linus Torvalds, David S. Miller, linux-mm
On Mon, 8 Jan 2001, Marcelo Tosatti wrote:
> On Sun, 7 Jan 2001, Linus Torvalds wrote:
>
> > and just get rid of all the logic to try to "find the best mm". It's bogus
> > anyway: we should get perfectly fair access patterns by just doing
> > everything in round-robin, and each "swap_out_mm(mm)" would just try to
> > walk some fixed percentage of the RSS size (say, something like
> >
> > count = (mm->rss >> 4)
> >
> > and be done with it.
>
> I have the impression that a fixed percentage of the RSS will be
> a problem when you have a memory hog (or hogs) running.
My RSS ulimit enforcing patches solve this problem in a
very simple way.
If a process is exceeding its RSS limit, we scan ALL pages
from the process. Otherwise, we scan the normal percentage.
Furthermore, I have put a default soft RSS limit of half
of physical memory in the system. This means that when you
have one big runaway process, kswapd will be more agressive
against that process then against others. The fact that it
is a soft limit, OTOH, means that the process can use all
the available memory if there is no memory pressure in the
system...
regards,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 6:42 ` Subtle MM bug Linus Torvalds
2001-01-08 13:11 ` Marcelo Tosatti
2001-01-08 13:57 ` Stephen C. Tweedie
@ 2001-01-08 16:45 ` Rik van Riel
2001-01-08 17:50 ` Linus Torvalds
2 siblings, 1 reply; 128+ messages in thread
From: Rik van Riel @ 2001-01-08 16:45 UTC (permalink / raw)
To: Linus Torvalds; +Cc: David S. Miller, Marcelo Tosatti, linux-mm
On Sun, 7 Jan 2001, Linus Torvalds wrote:
> /*
> * Too many active pages? That implies that we don't have enough
> * of a working set for page_launder() to do a good job. Start by
> * walking the VM space..
> */
> if ((nr_active_pages >> 1) > total_pages)
> swap_out();
>
> /*
> * This is where we actually free memory
> */
> page_launder(..);
Ahhh, but this is NOT the balancing problem we're trying to
pin down in 2.4...
The (possible) problem is in the balancing between swap_out()
and refill_inactive_scan().
regards,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 5:29 Wayne Whitney
2001-01-08 5:42 ` Andi Kleen
@ 2001-01-08 17:16 ` Rik van Riel
2001-01-08 17:58 ` Linus Torvalds
2001-01-08 21:30 ` Wayne Whitney
1 sibling, 2 replies; 128+ messages in thread
From: Rik van Riel @ 2001-01-08 17:16 UTC (permalink / raw)
To: Wayne Whitney; +Cc: linux-kernel, Linus Torvalds, William A. Stein
On Sun, 7 Jan 2001, Wayne Whitney wrote:
> Well, here is a workload that performs worse on 2.4.0 than on 2.2.19pre,
> The typical machine is a dual Intel box with 512MB RAM and 512MB swap.
How does 2.4 perform when you add an extra GB of swap ?
2.4 keeps dirty pages in the swap cache, so you will need
more swap to run the same programs...
Linus: is this something we want to keep or should we give
the user the option to run in a mode where swap space is
freed when we swap in something non-shared ?
regards,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 13:57 ` Stephen C. Tweedie
@ 2001-01-08 17:29 ` Linus Torvalds
2001-01-08 18:10 ` Stephen C. Tweedie
0 siblings, 1 reply; 128+ messages in thread
From: Linus Torvalds @ 2001-01-08 17:29 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: David S. Miller, Rik van Riel, Marcelo Tosatti, linux-mm
On Mon, 8 Jan 2001, Stephen C. Tweedie wrote:
>
> > Then, with something like the above, we just try to make sure that we scan
> > the whole virtual memory space every once in a while. Make the "every once
> > in a while" be some simple heuristic like "try to keep the active list to
> > less than 50% of all memory".
>
> ... which will produce an enormous storm of soft page faults for
> workloads involving mmaping large amounts of data or where we have
> a lot of space devoted to anonymous pages, such as static
> computational workloads.
I don't think you'll find that in practice.
It would obviously trigger only on low-memory code _anyway_ (we don't even
get into "try_to_free_pages()" unless there is memory pressure), so I
think you're _completely_ off the mark here.
Remember: the thing doesn't require that < 50% of memory is in the page
tables. It only says: if 50% or more of memory is in the page tables, we
will always scan the page tables first when we try to find free pages.
If you have a well-behaving application that doesn't even have memory
pressure, but fills up >50% of memory in its VM, nothing will actually
happen in the steady state. It can have 99% of available memory, and not a
single soft page fault.
But think about what happens if you now start up another application? And
think about what SHOULD happen. The 50% ruls is perfectly fine: if we're
starting to swap, we're better off taking soft page faults that give us a
better LRU than letting the MM scrub the same pages over and over because
it effectively only sees a subset of the total pages (with the mapped
pages being "invisible").
The fact is, that we absolutely _have_ to do the VM scan in order for the
inactive lists to be at all representative of the state of affairs. If we
just rely on page_launder() and refill_inactive() as the #1 way to get
free pages, we will never consider anything but the pages that are already
on the lists.
Stephen: have you tried the behaviour of a working set that is dirty in
the VM's and slightly larger than available ram? Not pretty. We do
_really_ well on many loads, but this one we do badly on. And from what
I've been able to see so far, it's because we're just too damn good at
waiting on page_launder() and doing refill_inactive_scan().
There's another advantage to the 50% rule: if we are under memory
pressure, and somebody is dirtying pages in its VM (which is otherwise an
"invisible" event to the kernel), the 50% rule is much more likely to mean
that we actually _see_ the dirtying, and can slow it down.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 13:11 ` Marcelo Tosatti
2001-01-08 16:42 ` Rik van Riel
@ 2001-01-08 17:43 ` Linus Torvalds
1 sibling, 0 replies; 128+ messages in thread
From: Linus Torvalds @ 2001-01-08 17:43 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: David S. Miller, Rik van Riel, linux-mm
On Mon, 8 Jan 2001, Marcelo Tosatti wrote:
>
> On Sun, 7 Jan 2001, Linus Torvalds wrote:
>
> > and just get rid of all the logic to try to "find the best mm". It's bogus
> > anyway: we should get perfectly fair access patterns by just doing
> > everything in round-robin, and each "swap_out_mm(mm)" would just try to
> > walk some fixed percentage of the RSS size (say, something like
> >
> > count = (mm->rss >> 4)
> >
> > and be done with it.
>
> I have the impression that a fixed percentage of the RSS will be a problem
> when you have a memory hog (or hogs) running.
Nothing but testing can prove it, but I don't think that's really an
issue.
Remember: we're not actually swapping stuff out any more in VM scanning.
We're just saying "we're low on memory, let's evict the page tables so
that we _could_ swap stuff out if necessary". We're going to have to evict
_something_, and walking the page tables really gives us a lot better
knowledge of WHAT to evict.
The cost of scanning the VM is (a) the cost of scanning itself and (b) the
cost of soft-faults and CPU TLB invalidate cross-calls for the scanning.
Both of which might be noticeable - but I have this fairly strong feeling
that neither of them is big enough to offset the cost of paging out the
wrong page. Which we definitely do now - I've got some simple
test-programs that have a VM footprint that is not _that_ much more than
the available memory, and they _really_ show problems.
(The "lots of dirty pages" case is not the common case under most loads,
so the fact that 2.4.0 has some performance problems with it was not a
show-stopper for me - during my testing with low memory most loads were
very nice indeed).
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 6:04 ` Linus Torvalds
@ 2001-01-08 17:44 ` Rik van Riel
2001-01-08 18:02 ` Linus Torvalds
0 siblings, 1 reply; 128+ messages in thread
From: Rik van Riel @ 2001-01-08 17:44 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
On 7 Jan 2001, Linus Torvalds wrote:
> That doesn't resolve the "2.4.x behaves badly" thing, though.
>
> I've seen that one myself, and it seems to be simply due to the
> fact that we're usually so good at gettign memory from
> page_launder() that we never bother to try to swap stuff out.
> And when we _do_ start swapping stuff out it just moves to the
> dirty list, and page_launder() will take care of it.
>
> So far so good. The problem appears to be that we don't swap
> stuff out smoothly: we start doing the VM scanning, but when we
> get enough dirty pages, we'll let it be, and go back to
> page_launder() again. Which means that we don't walk theough the
> whole VM space, we just do some "spot cleaning".
You are right in that we need to refill the inactive list
before calling page_launder(), but we'll also need a few
other modifications:
1. adopt the latest FreeBSD tactic in page_launder()
- mark dirty pages we see but don't flush
- in the first loop, flush up to maxlaunder of the
already seen dirty pages
- in the second loop, flush as many pages as we
need to refill the free&inactive_clean list
2. go back to having a _static_ free target, at
max(freepages.high, SUM(zone->pages_high) ... this
means free_shortage() will never be very big
3. keep track of how many pages we need to free in
page_launder() and substract one from the target
when we submit a page for IO ... no need to flush
20MB of dirty pages when we only need 1MB pages
cleaned
I have these things in my local tree and it seems to smooth
out the load quite well for a very large haskell run and for
the fillmem program from Juan Quintela's memtest suite.
When combined with your idea of refilling the freelist _first_,
we should be able to get the VM quite a bit smoother under loads
with lots of dirty pages.
I will work on this while travelling to and being in Australia.
Expect a clean patch to fix this problem once the 2.4 bugfix-only
period is over.
Other people on this list are invited to apply the VM patches from
my home page and give them a good beating. I want to be able to
submit a well-tested, known-good patch to Linus once 2.4 is out of
the bugfix-only period...
regards,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 16:45 ` Rik van Riel
@ 2001-01-08 17:50 ` Linus Torvalds
2001-01-08 18:21 ` Rik van Riel
0 siblings, 1 reply; 128+ messages in thread
From: Linus Torvalds @ 2001-01-08 17:50 UTC (permalink / raw)
To: Rik van Riel; +Cc: David S. Miller, Marcelo Tosatti, linux-mm
On Mon, 8 Jan 2001, Rik van Riel wrote:
> On Sun, 7 Jan 2001, Linus Torvalds wrote:
>
> > /*
> > * Too many active pages? That implies that we don't have enough
> > * of a working set for page_launder() to do a good job. Start by
> > * walking the VM space..
> > */
> > if ((nr_active_pages >> 1) > total_pages)
> > swap_out();
> >
> > /*
> > * This is where we actually free memory
> > */
> > page_launder(..);
>
> Ahhh, but this is NOT the balancing problem we're trying to
> pin down in 2.4...
>
> The (possible) problem is in the balancing between swap_out()
> and refill_inactive_scan().
That _is_ the problem the above will fix. Don't read "page_launder()"
there: it's more meant to be "this is the old code that does
page_launder() etc.."
Trust me. Try my code. It will work.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 17:16 ` Rik van Riel
@ 2001-01-08 17:58 ` Linus Torvalds
2001-01-08 23:41 ` Zlatko Calusic
2001-01-08 21:30 ` Wayne Whitney
1 sibling, 1 reply; 128+ messages in thread
From: Linus Torvalds @ 2001-01-08 17:58 UTC (permalink / raw)
To: Rik van Riel; +Cc: Wayne Whitney, linux-kernel, William A. Stein
On Mon, 8 Jan 2001, Rik van Riel wrote:
> On Sun, 7 Jan 2001, Wayne Whitney wrote:
>
> > Well, here is a workload that performs worse on 2.4.0 than on 2.2.19pre,
>
> > The typical machine is a dual Intel box with 512MB RAM and 512MB swap.
>
> How does 2.4 perform when you add an extra GB of swap ?
>
> 2.4 keeps dirty pages in the swap cache, so you will need
> more swap to run the same programs...
>
> Linus: is this something we want to keep or should we give
> the user the option to run in a mode where swap space is
> freed when we swap in something non-shared ?
I'd prefer just documenting it and keeping it. I'd hate to have two fairly
different modes of behaviour. It's always been the suggested "twice the
amount of RAM", although there's historically been the "Linux doesn't
really need that much" that we just killed with 2.4.x.
If you have 512MB or RAM, you can probably afford another 40GB or so of
harddisk. They are disgustingly cheap these days.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 17:44 ` Rik van Riel
@ 2001-01-08 18:02 ` Linus Torvalds
0 siblings, 0 replies; 128+ messages in thread
From: Linus Torvalds @ 2001-01-08 18:02 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-kernel
On Mon, 8 Jan 2001, Rik van Riel wrote:
>
> You are right in that we need to refill the inactive list
> before calling page_launder(), but we'll also need a few
> other modifications:
NONE of your three additions do _anything_ to help us at all if we don't
even see the dirty bit because the page is on the active list and the
dirty bit is in somebodys VM space.
I agree that they look ok, but they are all complicating the code. I
propose getting rid of complications, and getting rid of the precarious
"when do we actually scan the VM tables" balancing issue.
Quite frankly, I'd rather see somebody try the vmscan stuff FIRST. Your
suggestions look fine, but apart from the "let dirty pages go twice
through the list" they look like tweaks that would need re-tweaking after
the balancing stuff is ripped out.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 17:29 ` Linus Torvalds
@ 2001-01-08 18:10 ` Stephen C. Tweedie
2001-01-08 21:52 ` Marcelo Tosatti
0 siblings, 1 reply; 128+ messages in thread
From: Stephen C. Tweedie @ 2001-01-08 18:10 UTC (permalink / raw)
To: Linus Torvalds
Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel,
Marcelo Tosatti, linux-mm
On Mon, Jan 08, 2001 at 09:29:15AM -0800, Linus Torvalds wrote:
> On Mon, 8 Jan 2001, Stephen C. Tweedie wrote:
> If you have a well-behaving application that doesn't even have memory
> pressure, but fills up >50% of memory in its VM, nothing will actually
> happen in the steady state. It can have 99% of available memory, and not a
> single soft page fault.
Agreed, but that's not how I read your statement about scanning the VM
regularly. The problem happens if you are working happily with enough
free memory and you suddenly need a large amount of allocation: having
some relatively uptodate page age information may give you a _much_
better idea of what to page out.
Rik was going to experiment with this --- Rik, do you have any hard
numbers for the benefit of maintaining a background page aging task?
> But think about what happens if you now start up another application? And
> think about what SHOULD happen. The 50% ruls is perfectly fine:
Right, I interpreted your 50% as a steady-state limit.
> Stephen: have you tried the behaviour of a working set that is dirty in
> the VM's and slightly larger than available ram? Not pretty.
Yes, and this is something that Marcelo's swap clustering code ought
to be ideal for.
> _really_ well on many loads, but this one we do badly on. And from what
> I've been able to see so far, it's because we're just too damn good at
> waiting on page_launder() and doing refill_inactive_scan().
do_try_to_free_pages() is trying to
/*
* If needed, we move pages from the active list
* to the inactive list. We also "eat" pages from
* the inode and dentry cache whenever we do this.
*/
if (free_shortage() || inactive_shortage()) {
shrink_dcache_memory(6, gfp_mask);
shrink_icache_memory(6, gfp_mask);
ret += refill_inactive(gfp_mask, user);
} else {
So we're refilling the inactive list regardless of its current size
whenever free_shortage() is true. In the situation you describe,
there's no point refilling the inactive list too far beyond the
ability of the swapper to launder it, regardless of whether
free_shortage() is set.
refill_inactive contains exactly the opposite logic: it breaks out if
/*
* If we either have enough free memory, or if
* page_launder() will be able to make enough
* free memory, then stop.
*/
if (!inactive_shortage() || !free_shortage())
goto done;
but that still means that we're doing unnecessary inactive list
refilling whenever free_shortage() is true: this test only occurs
after we've tried at least one swap_out(). We're calling
refill_inactive if either condition is true, but we're staying inside
it only if both conditions are true.
Shouldn't we really just be making the refill_inactive() here depend
on inactive_shortage() alone, not free_shortage()? By refilling the
inactive list too agressively we actually end up discarding aging
information which might be of use to us.
Rik, any thoughts? This looks as if it's destroying any hope of
maintaining the intended inactive_shortage() targets.
--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 17:50 ` Linus Torvalds
@ 2001-01-08 18:21 ` Rik van Riel
2001-01-08 18:38 ` Linus Torvalds
0 siblings, 1 reply; 128+ messages in thread
From: Rik van Riel @ 2001-01-08 18:21 UTC (permalink / raw)
To: Linus Torvalds; +Cc: David S. Miller, Marcelo Tosatti, linux-mm
On Mon, 8 Jan 2001, Linus Torvalds wrote:
> On Mon, 8 Jan 2001, Rik van Riel wrote:
> > On Sun, 7 Jan 2001, Linus Torvalds wrote:
> >
> > > /*
> > > * Too many active pages? That implies that we don't have enough
> > > * of a working set for page_launder() to do a good job. Start by
> > > * walking the VM space..
> > > */
> > > if ((nr_active_pages >> 1) > total_pages)
> > > swap_out();
> That _is_ the problem the above will fix. Don't read
> "page_launder()" there: it's more meant to be "this is the old
> code that does page_launder() etc.."
>
> Trust me. Try my code. It will work.
Except for the small detail that pages inside the processes
are often not on the active list ;)
But I agree with your idea that we really should make sure
we have enough pages available to choose from when swapping
stuff out.
regards,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 18:21 ` Rik van Riel
@ 2001-01-08 18:38 ` Linus Torvalds
0 siblings, 0 replies; 128+ messages in thread
From: Linus Torvalds @ 2001-01-08 18:38 UTC (permalink / raw)
To: Rik van Riel; +Cc: David S. Miller, Marcelo Tosatti, linux-mm
On Mon, 8 Jan 2001, Rik van Riel wrote:
>
> > That _is_ the problem the above will fix. Don't read
> > "page_launder()" there: it's more meant to be "this is the old
> > code that does page_launder() etc.."
> >
> > Trust me. Try my code. It will work.
>
> Except for the small detail that pages inside the processes
> are often not on the active list ;)
Yes, you're right - we don't have a good counter to test right now.
That's actually fairly nasty. We can't even use the "reverse" test,
because while we can make it do something like
if (nr_inactive + nr_inactive_dirty < X %)
that won't pick up on things like the dentry and inode caches, so that
would be wrong too.
We would really need to count the number of mapped anonymous pages to get
this right. Damn. That makes it harder than I thought.
(Hmm.. Increment counter in "do_anonymous_page()" and "do_wp_page()".
Decrement in "add_to_swap_cache()". Decrement in "free_pte()" for the
!page->mapping case. Test. Find the places I forgot. Maybe it's not that
bad, after all).
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
@ 2001-01-08 20:39 Szabolcs Szakacsits
2001-01-08 21:56 ` Wayne Whitney
2001-01-08 22:00 ` Wayne Whitney
0 siblings, 2 replies; 128+ messages in thread
From: Szabolcs Szakacsits @ 2001-01-08 20:39 UTC (permalink / raw)
To: linux-kernel; +Cc: Andi Kleen, Wayne Whitney
Andi Kleen <ak@suse.de> wrote:
> On Sun, Jan 07, 2001 at 09:29:29PM -0800, Wayne Whitney wrote:
> > package called MAGMA; at times this requires very large matrices. The
> > RSS can get up to 870MB; for some reason a MAGMA process under linux
> > thinks it has run out of memory at 870MB, regardless of the actual
> > memory/swap in the machine. MAGMA is single-threaded.
> I think it's caused by the way malloc maps its memory.
> Newer glibc should work a bit better by falling back to mmap even
> for smaller allocations (older does it only for very big ones)
AFAIK newer glibc = CVS glibc but the malloc() tune parameters
work via environment variables for the current stable ones as well,
e.g. to overcome the above "out of memory" one could do,
% export MALLOC_MMAP_MAX_=1000000
% export MALLOC_MMAP_THRESHOLD_=0
% magma
At default, on a 32bit Linux current stable glibc malloc uses brk
between 0x08??????-0x40000000 and max (MALLOC_MMAP_MAX_) 128 mmap if
the requested chunk is greater than 128 kB (MALLOC_MMAP_THRESHOLD_).
If MAGMA mallocs memory in less than 128 kB chunks then the above out
of memory behaviour is expected.
Szaka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 17:16 ` Rik van Riel
2001-01-08 17:58 ` Linus Torvalds
@ 2001-01-08 21:30 ` Wayne Whitney
1 sibling, 0 replies; 128+ messages in thread
From: Wayne Whitney @ 2001-01-08 21:30 UTC (permalink / raw)
To: Rik van Riel; +Cc: LKML, Linus Torvalds, William A. Stein
On Mon, 8 Jan 2001, Rik van Riel wrote:
> How does 2.4 perform when you add an extra GB of swap ?
OK, some more data:
First, I tried booting 2.4.0 with "nosmp" to see if the behavior I observe
is SMP related. It isn't, there was no difference under 2.4.0 between
512MB/512MB/1CPU and 512MB/512MB/2CPUs.
Second, I tried going to 2GB of swap with 2.4.0, so 512MB/2GB/2CPUs.
Again, there is no difference: as soon as swapping begins with two MAGMA
processes, interactivity suffers. I notice that while swapping in this
situation, the HD light is blinking only intermittently.
I also tried logging in to a fourth VT during this second test, and it got
nowhere. In fact, this stopped the top updates completely and the HD
light also stopped. After 30 seconds of nothing (all I could do is switch
VT's), I gave up and sent a ^Z to one MAGMA process; this eventually was
received, and the system immediately recovered.
Perhaps there is some sort of I/O starvation triggered by two swapping
processes?
Again, under 2.2.19pre6, the exact same tests yield hardly any loss of
interactivity, I can log in fine (a little slowly) during the top / two
MAGMA process test. And once swapping begins, the HD light is continually
lit.
Again, I'd be happy to do any additional tests, provide more info about my
machine, etc.
Cheers,
Wayne
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 18:10 ` Stephen C. Tweedie
@ 2001-01-08 21:52 ` Marcelo Tosatti
2001-01-09 0:28 ` Linus Torvalds
0 siblings, 1 reply; 128+ messages in thread
From: Marcelo Tosatti @ 2001-01-08 21:52 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Linus Torvalds, David S. Miller, Rik van Riel, linux-mm
On Mon, 8 Jan 2001, Stephen C. Tweedie wrote:
> > _really_ well on many loads, but this one we do badly on. And from what
> > I've been able to see so far, it's because we're just too damn good at
> > waiting on page_launder() and doing refill_inactive_scan().
>
> do_try_to_free_pages() is trying to
>
> /*
> * If needed, we move pages from the active list
> * to the inactive list. We also "eat" pages from
> * the inode and dentry cache whenever we do this.
> */
> if (free_shortage() || inactive_shortage()) {
> shrink_dcache_memory(6, gfp_mask);
> shrink_icache_memory(6, gfp_mask);
> ret += refill_inactive(gfp_mask, user);
> } else {
>
> So we're refilling the inactive list regardless of its current size
> whenever free_shortage() is true. In the situation you describe,
> there's no point refilling the inactive list too far beyond the
> ability of the swapper to launder it, regardless of whether
> free_shortage() is set.
Agreed.
After some fights me and Rik agreed on doing a per-zone inactive shortage
check in inactive_shortage().
This allow us to check _only_ for inactive_shortage() before calling
refill_inactive().
>
> refill_inactive contains exactly the opposite logic: it breaks out if
>
> /*
> * If we either have enough free memory, or if
> * page_launder() will be able to make enough
> * free memory, then stop.
> */
> if (!inactive_shortage() || !free_shortage())
> goto done;
>
> but that still means that we're doing unnecessary inactive list
> refilling whenever free_shortage() is true: this test only occurs
> after we've tried at least one swap_out(). We're calling
> refill_inactive if either condition is true, but we're staying inside
> it only if both conditions are true.
>
> Shouldn't we really just be making the refill_inactive() here depend
> on inactive_shortage() alone, not free_shortage()? By refilling the
> inactive list too agressively we actually end up discarding aging
> information which might be of use to us.
Yes.
I've removed the free_shortage() of refill_inactive() in the patch.
Comments are welcome.
--- linux.orig/mm/vmscan.c Thu Jan 4 02:45:26 2001
+++ linux/mm/vmscan.c Mon Jan 8 20:43:59 2001
@@ -808,6 +808,9 @@
int inactive_shortage(void)
{
int shortage = 0;
+ pg_data_t *pgdat = pgdat_list;
+
+ /* Is the inactive dirty list too small? */
shortage += freepages.high;
shortage += inactive_target;
@@ -818,7 +821,27 @@
if (shortage > 0)
return shortage;
- return 0;
+ /* If not, do we have enough per-zone pages on the inactive list? */
+
+ shortage = 0;
+
+ do {
+ int i;
+ for(i = 0; i < MAX_NR_ZONES; i++) {
+ int zone_shortage;
+ zone_t *zone = pgdat->node_zones+ i;
+
+ zone_shortage = zone->pages_high;
+ zone_shortage -= zone->inactive_dirty_pages;
+ zone_shortage -= zone->inactive_clean_pages;
+ zone_shortage -= zone->free_pages;
+ if (zone_shortage > 0)
+ shortage += zone_shortage;
+ }
+ pgdat = pgdat->node_next;
+ } while (pgdat);
+
+ return shortage;
}
/*
@@ -861,12 +884,13 @@
}
/*
- * don't be too light against the d/i cache since
- * refill_inactive() almost never fail when there's
- * really plenty of memory free.
+ * Only free memory from i/d caches if we have
+ * are under low memory.
*/
- shrink_dcache_memory(priority, gfp_mask);
- shrink_icache_memory(priority, gfp_mask);
+ if(free_shortage()) {
+ shrink_dcache_memory(priority, gfp_mask);
+ shrink_icache_memory(priority, gfp_mask);
+ }
/*
* Then, try to page stuff out..
@@ -878,11 +902,10 @@
}
/*
- * If we either have enough free memory, or if
- * page_launder() will be able to make enough
+ * If page_launder() will be able to make enough
* free memory, then stop.
*/
- if (!inactive_shortage() || !free_shortage())
+ if (!inactive_shortage())
goto done;
/*
@@ -922,14 +945,20 @@
/*
* If needed, we move pages from the active list
- * to the inactive list. We also "eat" pages from
- * the inode and dentry cache whenever we do this.
+ * to the inactive list.
+ */
+ if (inactive_shortage())
+ ret += refill_inactive(gfp_mask, user);
+
+ /*
+ * Delete pages from the inode and dentry cache
+ * if memory is low.
*/
- if (free_shortage() || inactive_shortage()) {
+ if (free_shortage()) {
shrink_dcache_memory(6, gfp_mask);
shrink_icache_memory(6, gfp_mask);
- ret += refill_inactive(gfp_mask, user);
- } else {
+ } else {
+
/*
* Reclaim unused slab cache memory.
*/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 20:39 Szabolcs Szakacsits
@ 2001-01-08 21:56 ` Wayne Whitney
2001-01-08 23:22 ` Wayne Whitney
2001-01-08 22:00 ` Wayne Whitney
1 sibling, 1 reply; 128+ messages in thread
From: Wayne Whitney @ 2001-01-08 21:56 UTC (permalink / raw)
To: Szabolcs Szakacsits; +Cc: LKML
On Mon, 8 Jan 2001, Szabolcs Szakacsits wrote:
> AFAIK newer glibc = CVS glibc but the malloc() tune parameters work
> via environment variables for the current stable ones as well,
Hmm, this must have been introduced in libc6? Unfortunately, I don't have
the source code to MAGMA, and the binary I have is statically linked. It
does not contain the names of the environment variables you mentioned.
I'll arrange a binary linked against glibc2.2, and then your suggestion
will hopefully do the trick. Thanks for your kind help!
Cheers,
Wayne
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 20:39 Szabolcs Szakacsits
2001-01-08 21:56 ` Wayne Whitney
@ 2001-01-08 22:00 ` Wayne Whitney
2001-01-08 22:15 ` Andrea Arcangeli
1 sibling, 1 reply; 128+ messages in thread
From: Wayne Whitney @ 2001-01-08 22:00 UTC (permalink / raw)
To: Szabolcs Szakacsits; +Cc: LKML, Andi Kleen
On Mon, 8 Jan 2001, Szabolcs Szakacsits wrote:
> AFAIK newer glibc = CVS glibc but the malloc() tune parameters work
> via environment variables for the current stable ones as well, e.g. to
> overcome the above "out of memory" one could do,
>
> % export MALLOC_MMAP_MAX_=1000000
> % export MALLOC_MMAP_THRESHOLD_=0
> % magma
As I just mentioned, I haven't been able to test this yet due to my
current binary being linked against an older libc with doesn't seem to
have these parameters. But here's one other data point, I just thought
I'd ask if this jives with your theory: if I configure the linux kernel
to be able to use 2GB of RAM, then the 870MB limit becomes much lower, to
230MB.
Cheers, Wayne
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 22:00 ` Wayne Whitney
@ 2001-01-08 22:15 ` Andrea Arcangeli
0 siblings, 0 replies; 128+ messages in thread
From: Andrea Arcangeli @ 2001-01-08 22:15 UTC (permalink / raw)
To: Wayne Whitney; +Cc: Szabolcs Szakacsits, LKML, Andi Kleen
On Mon, Jan 08, 2001 at 02:00:19PM -0800, Wayne Whitney wrote:
> I'd ask if this jives with your theory: if I configure the linux kernel
> to be able to use 2GB of RAM, then the 870MB limit becomes much lower, to
> 230MB.
It's because the virtual address space for userspace tasks gets reduced
from 3G to 2G to give an additional giga of direct mapping to the kernel.
Also the other limit you hit (at around 800mbyte) is partly because
of the too low userspace virtual address space.
You can use this hack by me to allow the tasks to grow up to 3.5G per task on
IA32 on 2.4.0 (equivalent hack exists for 2.2.19pre6aa1 with bigmem, btw it
makes sense also without bigmem if you have lots of swap, that's all about
virtual memory not physical RAM). However it doesn't work with PAE enabled
yet.
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.0-test11-pre5/per-process-3.5G-IA32-no-PAE-1
If you run your program on any 64bit architecture (in 64bit userspace mode)
supported by linux, you won't run into those per-process address space limits.
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 21:56 ` Wayne Whitney
@ 2001-01-08 23:22 ` Wayne Whitney
2001-01-08 23:30 ` Andrea Arcangeli
0 siblings, 1 reply; 128+ messages in thread
From: Wayne Whitney @ 2001-01-08 23:22 UTC (permalink / raw)
To: Szabolcs Szakacsits; +Cc: LKML, Andi Kleen
On Mon, 8 Jan 2001, Wayne Whitney wrote:
> On Mon, 8 Jan 2001, Szabolcs Szakacsits wrote:
>
> > AFAIK newer glibc = CVS glibc but the malloc() tune parameters work
> > via environment variables for the current stable ones as well,
>
> I'll arrange a binary linked against glibc2.2, and then your suggestion
> will hopefully do the trick. Thanks for your kind help!
OK, I now have a binary dynamically linked against /lib/libc.so.6,
(according to ldd), and that points to glibc-2.1.92. And I tried setting
the environment variables you suggested, I checked that they are set and
checked that they appear in /lib/libc.so.6. But the behaviour is
unchanged: MAGMA still hits this barrier at 830M (not 870M, that was a
typo).
I guess I conclude that either (1) MAGMA does not use libc's malloc
(checking on this, I doubt it) or (2) glibc-2.1.92 knows of these
variables but has not yet implemented the tuning (I'll try glibc-2.2) or
(3) this is not the problem.
I'll look at Andrea's hack as well. Thanks for everybody's help!
Cheers, Wayne
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 23:22 ` Wayne Whitney
@ 2001-01-08 23:30 ` Andrea Arcangeli
2001-01-09 0:37 ` Linus Torvalds
0 siblings, 1 reply; 128+ messages in thread
From: Andrea Arcangeli @ 2001-01-08 23:30 UTC (permalink / raw)
To: Wayne Whitney; +Cc: Szabolcs Szakacsits, LKML, Andi Kleen
On Mon, Jan 08, 2001 at 03:22:44PM -0800, Wayne Whitney wrote:
> I guess I conclude that either (1) MAGMA does not use libc's malloc
> (checking on this, I doubt it) or (2) glibc-2.1.92 knows of these
> variables but has not yet implemented the tuning (I'll try glibc-2.2) or
> (3) this is not the problem.
You should monitor the program with strace while it fails (last few syscalls).
You can breakpoint at exit() and run `cat /proc/pid/maps` to show us the vma
layout of the task. Then we'll see why it's failing. With CONFIG_1G in 2.2.x
or 2.4.x (confinguration option doesn't matter) you should at least reach
something like 1.5G.
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 17:58 ` Linus Torvalds
@ 2001-01-08 23:41 ` Zlatko Calusic
2001-01-09 2:58 ` Linus Torvalds
2001-01-09 6:20 ` Eric W. Biederman
0 siblings, 2 replies; 128+ messages in thread
From: Zlatko Calusic @ 2001-01-08 23:41 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Rik van Riel, linux-kernel
Linus Torvalds <torvalds@transmeta.com> writes:
> On Mon, 8 Jan 2001, Rik van Riel wrote:
>
> > On Sun, 7 Jan 2001, Wayne Whitney wrote:
> >
> > > Well, here is a workload that performs worse on 2.4.0 than on 2.2.19pre,
> >
> > > The typical machine is a dual Intel box with 512MB RAM and 512MB swap.
> >
> > How does 2.4 perform when you add an extra GB of swap ?
> >
> > 2.4 keeps dirty pages in the swap cache, so you will need
> > more swap to run the same programs...
> >
> > Linus: is this something we want to keep or should we give
> > the user the option to run in a mode where swap space is
> > freed when we swap in something non-shared ?
>
> I'd prefer just documenting it and keeping it. I'd hate to have two fairly
> different modes of behaviour. It's always been the suggested "twice the
> amount of RAM", although there's historically been the "Linux doesn't
> really need that much" that we just killed with 2.4.x.
>
> If you have 512MB or RAM, you can probably afford another 40GB or so of
> harddisk. They are disgustingly cheap these days.
>
Yes, but a lot more data on the swap also means degraded performance,
because the disk head has to seek around in the much bigger area. Are
you sure this is all OK?
--
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 0:28 ` Linus Torvalds
@ 2001-01-08 23:49 ` Marcelo Tosatti
2001-01-09 3:12 ` Linus Torvalds
0 siblings, 1 reply; 128+ messages in thread
From: Marcelo Tosatti @ 2001-01-08 23:49 UTC (permalink / raw)
To: Linus Torvalds
Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm
On Mon, 8 Jan 2001, Linus Torvalds wrote:
>
> On Mon, 8 Jan 2001, Marcelo Tosatti wrote:
> >
> > I've removed the free_shortage() of refill_inactive() in the patch.
> >
> > Comments are welcome.
>
> One comment: why does refill_inactive() do the shrink_dcache_memory() at
> all? Why not just remove that?
>
> do_try_to_free_pages() will do that, and that's where it makes more sense
> (shrinking the dcache/icache has absolutely nothing to do with the
> inactive list).
Right. kmem_cache_reap() should not be there too.
> Also, we should probably remove the "made_progress" and "count--" from the
> swap_out() case, as swap_out() hasn't actually caused pages to be free'd
> in a long time..
Indeed.
Your lazy enough to ask me to regenerate a patch or you can by
yourself? :)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 21:52 ` Marcelo Tosatti
@ 2001-01-09 0:28 ` Linus Torvalds
2001-01-08 23:49 ` Marcelo Tosatti
0 siblings, 1 reply; 128+ messages in thread
From: Linus Torvalds @ 2001-01-09 0:28 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm
On Mon, 8 Jan 2001, Marcelo Tosatti wrote:
>
> I've removed the free_shortage() of refill_inactive() in the patch.
>
> Comments are welcome.
One comment: why does refill_inactive() do the shrink_dcache_memory() at
all? Why not just remove that?
do_try_to_free_pages() will do that, and that's where it makes more sense
(shrinking the dcache/icache has absolutely nothing to do with the
inactive list).
Historical code?
Also, we should probably remove the "made_progress" and "count--" from the
swap_out() case, as swap_out() hasn't actually caused pages to be free'd
in a long time..
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 23:30 ` Andrea Arcangeli
@ 2001-01-09 0:37 ` Linus Torvalds
0 siblings, 0 replies; 128+ messages in thread
From: Linus Torvalds @ 2001-01-09 0:37 UTC (permalink / raw)
To: linux-kernel
In article <20010109003002.L27646@athlon.random>,
Andrea Arcangeli <andrea@suse.de> wrote:
>On Mon, Jan 08, 2001 at 03:22:44PM -0800, Wayne Whitney wrote:
>> I guess I conclude that either (1) MAGMA does not use libc's malloc
>> (checking on this, I doubt it) or (2) glibc-2.1.92 knows of these
>> variables but has not yet implemented the tuning (I'll try glibc-2.2) or
>> (3) this is not the problem.
>
>You should monitor the program with strace while it fails (last few syscalls).
>You can breakpoint at exit() and run `cat /proc/pid/maps` to show us the vma
>layout of the task. Then we'll see why it's failing. With CONFIG_1G in 2.2.x
>or 2.4.x (confinguration option doesn't matter) you should at least reach
>something like 1.5G.
It might be doing its own memory management with brk() directly - some
older UNIX programs will do that (for various reasons - it can be faster
than malloc() etc if you know your access patterns, for example).
If you do that, and you have shared libraries, you'll get a failure
around the point Wayne sees it.
But your suggestion to check with strace is a good one.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-07 21:37 ` Rik van Riel
@ 2001-01-09 2:01 ` Zlatko Calusic
-1 siblings, 0 replies; 128+ messages in thread
From: Zlatko Calusic @ 2001-01-09 2:01 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-kernel, linux-mm
Rik van Riel <riel@conectiva.com.br> writes:
> Now if 2.4 has worse _performance_ than 2.2 due to one
> reason or another, that I'd like to hear about ;)
>
Oh, well, it seems that I was wrong. :)
First test: hogmem 180 5 = allocate 180MB and dirty it 5 times (on a
192MB machine)
kernel | swap usage | speed
-------------------------------
2.2.17 | 48 MB | 11.8 MB/s
-------------------------------
2.4.0 | 206 MB | 11.1 MB/s
-------------------------------
So 2.2 is only marginally faster. Also it can be seen that 2.4 uses 4
times more swap space. If Linus says it's ok... :)
Second test: kernel compile make -j32 (empirically this puts the VM
under load, but not excessively!)
2.2.17 -> make -j32 392.49s user 47.87s system 168% cpu 4:21.13 total
2.4.0 -> make -j32 389.59s user 31.29s system 182% cpu 3:50.24 total
Now, is this great news or what, 2.4.0 is definitely faster.
--
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
@ 2001-01-09 2:01 ` Zlatko Calusic
0 siblings, 0 replies; 128+ messages in thread
From: Zlatko Calusic @ 2001-01-09 2:01 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-kernel, linux-mm
Rik van Riel <riel@conectiva.com.br> writes:
> Now if 2.4 has worse _performance_ than 2.2 due to one
> reason or another, that I'd like to hear about ;)
>
Oh, well, it seems that I was wrong. :)
First test: hogmem 180 5 = allocate 180MB and dirty it 5 times (on a
192MB machine)
kernel | swap usage | speed
-------------------------------
2.2.17 | 48 MB | 11.8 MB/s
-------------------------------
2.4.0 | 206 MB | 11.1 MB/s
-------------------------------
So 2.2 is only marginally faster. Also it can be seen that 2.4 uses 4
times more swap space. If Linus says it's ok... :)
Second test: kernel compile make -j32 (empirically this puts the VM
under load, but not excessively!)
2.2.17 -> make -j32 392.49s user 47.87s system 168% cpu 4:21.13 total
2.4.0 -> make -j32 389.59s user 31.29s system 182% cpu 3:50.24 total
Now, is this great news or what, 2.4.0 is definitely faster.
--
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 23:41 ` Zlatko Calusic
@ 2001-01-09 2:58 ` Linus Torvalds
2001-01-09 6:20 ` Eric W. Biederman
1 sibling, 0 replies; 128+ messages in thread
From: Linus Torvalds @ 2001-01-09 2:58 UTC (permalink / raw)
To: Zlatko Calusic; +Cc: Rik van Riel, linux-kernel
On 9 Jan 2001, Zlatko Calusic wrote:
>
> Yes, but a lot more data on the swap also means degraded performance,
> because the disk head has to seek around in the much bigger area. Are
> you sure this is all OK?
Yes and no.
I'm not _sure_, obviously.
However, one thing I _am_ sure of is that the sticky page-cache simplifies
some things enormously, and make some things possible that simply weren't
possible before.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 23:49 ` Marcelo Tosatti
@ 2001-01-09 3:12 ` Linus Torvalds
2001-01-09 20:33 ` Marcelo Tosatti
2001-01-17 4:54 ` Rik van Riel
0 siblings, 2 replies; 128+ messages in thread
From: Linus Torvalds @ 2001-01-09 3:12 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm
On Mon, 8 Jan 2001, Marcelo Tosatti wrote:
>
> Your lazy enough to ask me to regenerate a patch or you can by
> yourself? :)
Try out 2.4.1-pre1 in testing.
It does three things:
- gets rid of the complex "best mm" logic and replaces it with the
round-robin thing as discussed. I have this suspicion that we
eventually want to make this based on fault rates etc in an effort to
more aggressively control big RSS processes, but I also suspect that
this is tied in to the the RSS limiting patches, so this will simmer
for a while.
- it cleans up the unnecessary dcache/icache shrink that is already done
more properly elsewhere.
- it cleans up and simplifies the MM "priority" thing. In fact, right now
only one priority is ever used, and I suspect strongly that all the
"made_progress" logic was really there because that's how we want to do
it (and just having one priority made "made_progress" unnecessary).
(It also has some non-VM patches, of course, but for this discussion the
VM ones are the only interesting ones).
As far as I can tell, the non-priority version is every bit as good as the
one that counts down priorities, and if nobody can argue against it I'll
just remove the priority argument altogether at some point. Right now it
still exists, it just doesn't change.
That kmem_cache_reap() thing still looks completely bogus, but I didn't
touch it. It looks _so_ bogus that there must be some reason for doing it
that ass-backwards way. Why should anybody have does a kmem_cache_reap()
when we're _not_ short of free pages? That code just makes me very
confused, so I'm not touching it.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-08 23:41 ` Zlatko Calusic
2001-01-09 2:58 ` Linus Torvalds
@ 2001-01-09 6:20 ` Eric W. Biederman
2001-01-09 7:27 ` Linus Torvalds
1 sibling, 1 reply; 128+ messages in thread
From: Eric W. Biederman @ 2001-01-09 6:20 UTC (permalink / raw)
To: zlatko; +Cc: Linus Torvalds, Rik van Riel, linux-kernel
Zlatko Calusic <zlatko@iskon.hr> writes:
>
> Yes, but a lot more data on the swap also means degraded performance,
> because the disk head has to seek around in the much bigger area. Are
> you sure this is all OK?
I don't think we have more data on the swap, just more data has an
allocated home on the swap. With the earlier allocation we should
(I haven't verified) allocate contiguous chunks of memory contiguously
on the swap. And reusing the same swap pages helps out with this.
Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 6:20 ` Eric W. Biederman
@ 2001-01-09 7:27 ` Linus Torvalds
2001-01-09 11:38 ` Eric W. Biederman
2001-01-09 12:29 ` Zlatko Calusic
0 siblings, 2 replies; 128+ messages in thread
From: Linus Torvalds @ 2001-01-09 7:27 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: zlatko, Rik van Riel, linux-kernel
On 8 Jan 2001, Eric W. Biederman wrote:
> Zlatko Calusic <zlatko@iskon.hr> writes:>
> >
> > Yes, but a lot more data on the swap also means degraded performance,
> > because the disk head has to seek around in the much bigger area. Are
> > you sure this is all OK?
>
> I don't think we have more data on the swap, just more data has an
> allocated home on the swap.
I think Zlatko's point is that because of the extra allocations, we will
have worse locality (more seeks etc).
Clearly we should not actually do any more actual IO. But the sticky
allocation _might_ make the IO we do be more spread out.
To offset that, I think the sticky allocation makes us much better able to
handle things like clustering etc more intelligently, which is why I think
it's very much worth it. But let's not close our eyes to potential
downsides.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 7:27 ` Linus Torvalds
@ 2001-01-09 11:38 ` Eric W. Biederman
2001-01-09 12:29 ` Zlatko Calusic
1 sibling, 0 replies; 128+ messages in thread
From: Eric W. Biederman @ 2001-01-09 11:38 UTC (permalink / raw)
To: Linus Torvalds; +Cc: zlatko, Rik van Riel, linux-kernel
Linus Torvalds <torvalds@transmeta.com> writes:
> On 8 Jan 2001, Eric W. Biederman wrote:
>
> > Zlatko Calusic <zlatko@iskon.hr> writes:>
> > >
> > > Yes, but a lot more data on the swap also means degraded performance,
> > > because the disk head has to seek around in the much bigger area. Are
> > > you sure this is all OK?
> >
> > I don't think we have more data on the swap, just more data has an
> > allocated home on the swap.
>
> I think Zlatko's point is that because of the extra allocations, we will
> have worse locality (more seeks etc).
>
> Clearly we should not actually do any more actual IO. But the sticky
> allocation _might_ make the IO we do be more spread out.
The tradeoff when implemented correctly is that writes will tend to be
more spread out and reads should be better clustered together.
> To offset that, I think the sticky allocation makes us much better able to
> handle things like clustering etc more intelligently, which is why I think
> it's very much worth it. But let's not close our eyes to potential
> downsides.
Certainly, keeping ours eyes open is a good a good thing.
But it has been apparent for a long time that by doing allocation as
we were doing it, that when it came to heavy swapping we were taking a
performance hit. So I'm relieved that we are now being more aggressive.
>From the sounds of it what we are currently doing actually sucks worse
for some heavy loads. But it still feels like the right direction.
It's been my impression that work loads where we are actively swapping
are a lot different from work loads where we really don't swap. To
the extent that it might make sense to make the actively swapping case
a config option to get our attention in the code. It would be nice
to have a linux kernel for once that handles heavy swapping (below
the level of thrashing) gracefully. :)
Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 7:27 ` Linus Torvalds
2001-01-09 11:38 ` Eric W. Biederman
@ 2001-01-09 12:29 ` Zlatko Calusic
2001-01-09 18:47 ` Linus Torvalds
1 sibling, 1 reply; 128+ messages in thread
From: Zlatko Calusic @ 2001-01-09 12:29 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Eric W. Biederman, Rik van Riel, linux-kernel
Linus Torvalds <torvalds@transmeta.com> writes:
> On 8 Jan 2001, Eric W. Biederman wrote:
>
> > Zlatko Calusic <zlatko@iskon.hr> writes:>
> > >
> > > Yes, but a lot more data on the swap also means degraded performance,
> > > because the disk head has to seek around in the much bigger area. Are
> > > you sure this is all OK?
> >
> > I don't think we have more data on the swap, just more data has an
> > allocated home on the swap.
>
> I think Zlatko's point is that because of the extra allocations, we will
> have worse locality (more seeks etc).
Yes that was my concern.
But in the end I'm not sure. I made two simple tests and haven't found
any problems with 2.4.0 mm logic (opposed to 2.2.17). In fact, the new
kernel was faster in the more interesting (make -j32) test.
Also I have found that new kernel allocates 4 times more swap space
under some circumstances. That may or may not be alarming, it remains
to be seen.
--
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 12:29 ` Zlatko Calusic
@ 2001-01-09 18:47 ` Linus Torvalds
2001-01-09 19:09 ` Daniel Phillips
` (2 more replies)
0 siblings, 3 replies; 128+ messages in thread
From: Linus Torvalds @ 2001-01-09 18:47 UTC (permalink / raw)
To: Zlatko Calusic; +Cc: Eric W. Biederman, Rik van Riel, linux-kernel
On 9 Jan 2001, Zlatko Calusic wrote:
>
> But in the end I'm not sure. I made two simple tests and haven't found
> any problems with 2.4.0 mm logic (opposed to 2.2.17). In fact, the new
> kernel was faster in the more interesting (make -j32) test.
I personally think 2.4.x is going to be as fast or faster at just about
anything. We do have some MM issues still to hash out, and tuning to do,
but I'm absolutely convinced that 2.4.x is going to be a _lot_ easier to
tune than 2.2.x ever was. The "scan the page tables without doing any IO"
thing just makes the 2.4.x memory management several orders of magnitude
more flexible than 2.2.x ever was.
(This is why I worked so hard at getting the PageDirty semantics right in
the last two months or so - and why I released 2.4.0 when I did. Getting
PageDirty right was the big step to make all of the VM stuff possible in
the first place. Even if it probably looked a bit foolhardy to change the
semantics of "writepage()" quite radically just before 2.4 was released).
> Also I have found that new kernel allocates 4 times more swap space
> under some circumstances. That may or may not be alarming, it remains
> to be seen.
Yes. The new VM will allocate the swap space a _lot_ more aggressively.
Many of those allocations will not necessarily ever actually be used, but
the fact that we _have_ allocated backing store for a page is what allows
us to drop it from the VM page tables, so that it can be processed by
page_launder().
And this _is_ a downside, there's no question about it. There's the worry
about the potential loss of locality, but there's also the fact that you
effectively need a bigger swap partition with 2.4.x - never mind that
large portions of the allocations may never be used. You still need the
disk space for good VM behaviour.
There are always trade-offs, I think the 2.4.x tradeoff is a good one.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 18:47 ` Linus Torvalds
@ 2001-01-09 19:09 ` Daniel Phillips
2001-01-09 19:29 ` Trond Myklebust
` (2 more replies)
2001-01-09 19:53 ` Simon Kirby
2001-01-10 1:45 ` David Woodhouse
2 siblings, 3 replies; 128+ messages in thread
From: Daniel Phillips @ 2001-01-09 19:09 UTC (permalink / raw)
To: Linus Torvalds, linux-kernel
Linus Torvalds wrote:
> (This is why I worked so hard at getting the PageDirty semantics right in
> the last two months or so - and why I released 2.4.0 when I did. Getting
> PageDirty right was the big step to make all of the VM stuff possible in
> the first place. Even if it probably looked a bit foolhardy to change the
> semantics of "writepage()" quite radically just before 2.4 was released).
On the topic of writepage, it's not symmetric with readpage at the
moment - it still takes (struct file *). Is this in the cleanup
pipeline? It looks like nfs_readpage already ignores the struct file *,
but maybe some other net filesystems are still depending on it.
--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 19:09 ` Daniel Phillips
@ 2001-01-09 19:29 ` Trond Myklebust
2001-01-10 17:32 ` Andi Kleen
2001-01-09 19:37 ` Linus Torvalds
2001-01-17 8:46 ` Rik van Riel
2 siblings, 1 reply; 128+ messages in thread
From: Trond Myklebust @ 2001-01-09 19:29 UTC (permalink / raw)
To: Daniel Phillips; +Cc: Linus Torvalds, linux-kernel
>>>>> " " == Daniel Phillips <phillips@innominate.de> writes:
> Linus Torvalds wrote:
>> (This is why I worked so hard at getting the PageDirty
>> semantics right in the last two months or so - and why I
>> released 2.4.0 when I did. Getting PageDirty right was the big
>> step to make all of the VM stuff possible in the first
>> place. Even if it probably looked a bit foolhardy to change the
>> semantics of "writepage()" quite radically just before 2.4 was
>> released).
> On the topic of writepage, it's not symmetric with readpage at
> the moment - it still takes (struct file *). Is this in the
> cleanup pipeline? It looks like nfs_readpage already ignores
> the struct file *, but maybe some other net filesystems are
> still depending on it.
NO! We definitely want to pass the struct file down to nfs_readpage()
when it's available.
Al has mentioned that he wants us to move towards a *BSD-like system
of credentials (i.e. struct ucred) that could be used here, but that's
in the far future. In the meantime, we cache RPC credentials in the
struct file...
Cheers,
Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 19:09 ` Daniel Phillips
2001-01-09 19:29 ` Trond Myklebust
@ 2001-01-09 19:37 ` Linus Torvalds
2001-01-17 8:46 ` Rik van Riel
2 siblings, 0 replies; 128+ messages in thread
From: Linus Torvalds @ 2001-01-09 19:37 UTC (permalink / raw)
To: linux-kernel
In article <3A5B61F7.FB0E79C1@innominate.de>,
Daniel Phillips <phillips@innominate.de> wrote:
>Linus Torvalds wrote:
>> (This is why I worked so hard at getting the PageDirty semantics right in
>> the last two months or so - and why I released 2.4.0 when I did. Getting
>> PageDirty right was the big step to make all of the VM stuff possible in
>> the first place. Even if it probably looked a bit foolhardy to change the
>> semantics of "writepage()" quite radically just before 2.4 was released).
>
>On the topic of writepage, it's not symmetric with readpage at the
>moment - it still takes (struct file *). Is this in the cleanup
>pipeline? It looks like nfs_readpage already ignores the struct file *,
>but maybe some other net filesystems are still depending on it.
readpage() is always a synchronous operation, and is actually much more
closely linked to "prepare_write()"/"commit_write()" than to writepage,
despite the naming similarities.
So no, the two are not symmetric, and they really shouldn't be.
"readpage()" is for reading a page into the page cache, and is always
synchronous with the reader (even prefetching is "synchronous" in the
sense that it's done by the reader: it's asynchronous in the sense that
we don't wait for the results, but the _calling_ of readpage() is
synchronous, if you see what I mean).
Similarly, prepare_write() and commit_write() are synchronous to the
writer (again, we do not wait for the writes to have actually
_happened_, but we call the functions synchronously and they can choose
to let the actual IO happen asynchronously - the VM doesn't care about
that small detail).
So "readpage()" and "prepare_write()/commit_write()" are pairs. They
are different simply because reading is assumed to be a cacheable and
prefetchable operation (think regular CPU caches), while writing
obviously has to give a much stricter "write _these_ bytes, not the
whole cache line".
In contrast, writepage() is a completely different animal. It's
basically a cache eviction notice, and happens asynchronously to any
operations that actually fill or dirty the cache. So despite the name,
it really as an operation has absolutely nothing in common with
readpage(), other than the fact that it is supposed to obviously do the
IO associated with the name.
Writepage has a friend in "sync_page()", which is another asynchronous
call-back that basically says "we want you to start your IO _now_". It's
similar to "writepage()" in that it's a kind of cache state
notification: while writepage() notifies that the cached page wants to
be evicted, "sync_page()" notifies that the cached page is waited upon
by somebody else and that we want to speed up any background IO on it.
You'll notice that writepage()/sync_page() have similar calling
convention, while readpage/prepare_write/commit_write have similar
calling conventions.
The one operation that _really_ stands out is "bmap()". It has
absolutely no calling convention at all, and is not symmetric with
anything. Pretty ugly, but easily supported.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 18:47 ` Linus Torvalds
2001-01-09 19:09 ` Daniel Phillips
@ 2001-01-09 19:53 ` Simon Kirby
2001-01-09 20:08 ` Linus Torvalds
2001-01-09 20:10 ` Zlatko Calusic
2001-01-10 1:45 ` David Woodhouse
2 siblings, 2 replies; 128+ messages in thread
From: Simon Kirby @ 2001-01-09 19:53 UTC (permalink / raw)
To: Linus Torvalds
Cc: Zlatko Calusic, Eric W. Biederman, Rik van Riel, linux-kernel
On Tue, Jan 09, 2001 at 10:47:57AM -0800, Linus Torvalds wrote:
> And this _is_ a downside, there's no question about it. There's the worry
> about the potential loss of locality, but there's also the fact that you
> effectively need a bigger swap partition with 2.4.x - never mind that
> large portions of the allocations may never be used. You still need the
> disk space for good VM behaviour.
>
> There are always trade-offs, I think the 2.4.x tradeoff is a good one.
Hmm, perhaps you could clarify...
For boxes that rarely ever use swap with 2.2, will they now need more
swap space on 2.4 to perform well, or just boxes which don't have enough
RAM to handle everything nicely?
I've always been tending to make swap partitions smaller lately, as it
helps in the case where we have to wait for a runaway process to eat up
all of the swap space before it gets killed. Making the swap size
smaller speeds up the time it takes for this to happen, albeit something
which isn't supposed to happen anyway.
Simon-
[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ sim@stormix.com ][ sim@netnation.com ]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 19:53 ` Simon Kirby
@ 2001-01-09 20:08 ` Linus Torvalds
2001-01-09 20:10 ` Zlatko Calusic
1 sibling, 0 replies; 128+ messages in thread
From: Linus Torvalds @ 2001-01-09 20:08 UTC (permalink / raw)
To: Simon Kirby; +Cc: Zlatko Calusic, Eric W. Biederman, Rik van Riel, linux-kernel
On Tue, 9 Jan 2001, Simon Kirby wrote:
>
> On Tue, Jan 09, 2001 at 10:47:57AM -0800, Linus Torvalds wrote:
>
> > And this _is_ a downside, there's no question about it. There's the worry
> > about the potential loss of locality, but there's also the fact that you
> > effectively need a bigger swap partition with 2.4.x - never mind that
> > large portions of the allocations may never be used. You still need the
> > disk space for good VM behaviour.
> >
> > There are always trade-offs, I think the 2.4.x tradeoff is a good one.
>
> Hmm, perhaps you could clarify...
>
> For boxes that rarely ever use swap with 2.2, will they now need more
> swap space on 2.4 to perform well, or just boxes which don't have enough
> RAM to handle everything nicely?
If you don't have any swap, or if you run out of swap, the major
difference between 2.2.x and 2.4.x is probably going to be the oom
handling: I suspect that 2.4.x might be more likely to kill things off
sooner (but it tries to be graceful about which processes to kill).
Not having any swap is going to be a performance issue for both 2.2.x and
2.4.x - Linux likes to push inactive dirty pages out to swap where they
can lie around without bothering anybody, even if there is no _major_
memory crunch going on.
If you do have swap, but it's smaller than your available physical RAM, I
suspect that the Linux-2.4 swap pre-allocate may cause that kind of
performance degradation earlier than 2.2.x would have. Another way of
putting this: in 2.2.x you could use a fairly small swap partition to pick
up some of the slack, and in 2.4.x a really small swap-partition doesn't
really buy you much anything.
> I've always been tending to make swap partitions smaller lately, as it
> helps in the case where we have to wait for a runaway process to eat up
> all of the swap space before it gets killed. Making the swap size
> smaller speeds up the time it takes for this to happen, albeit something
> which isn't supposed to happen anyway.
Yes, that kind of swap size tuning will still work in 2.4.x, but the sizes
you tune for would be different, I'm afraid. If you have, say, 128MB or
RAM, and you used to make a smallish partition of 64MB for "slop" in
2.2.x, I really suspect that you might like to increase it to 128MB or
196MB.
Of course, if you really only used your swap for "slop", I don't think
you'll necessarily notice the difference.
NOTE! The above guide-lines are pure guesses. The machines I use have had
big swap-partitions or none at all, so I think we'll just have to wait and
see.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 19:53 ` Simon Kirby
2001-01-09 20:08 ` Linus Torvalds
@ 2001-01-09 20:10 ` Zlatko Calusic
1 sibling, 0 replies; 128+ messages in thread
From: Zlatko Calusic @ 2001-01-09 20:10 UTC (permalink / raw)
To: Simon Kirby; +Cc: Linus Torvalds, Eric W. Biederman, Rik van Riel, linux-kernel
Simon Kirby <sim@stormix.com> writes:
> On Tue, Jan 09, 2001 at 10:47:57AM -0800, Linus Torvalds wrote:
>
> > And this _is_ a downside, there's no question about it. There's the worry
> > about the potential loss of locality, but there's also the fact that you
> > effectively need a bigger swap partition with 2.4.x - never mind that
> > large portions of the allocations may never be used. You still need the
> > disk space for good VM behaviour.
> >
> > There are always trade-offs, I think the 2.4.x tradeoff is a good one.
>
> Hmm, perhaps you could clarify...
>
> For boxes that rarely ever use swap with 2.2, will they now need more
> swap space on 2.4 to perform well, or just boxes which don't have enough
> RAM to handle everything nicely?
>
Just boxes that were already short on memory (swapped a lot) will need
more swap, empirically up to 4 times as much. If you already had
enough memory than things will stay almost the same for you.
But anyway, after some testing I've done recently I would now not
recommend anybody to have less than 2 x RAM size swap partition.
> I've always been tending to make swap partitions smaller lately, as it
> helps in the case where we have to wait for a runaway process to eat up
> all of the swap space before it gets killed. Making the swap size
> smaller speeds up the time it takes for this to happen, albeit something
> which isn't supposed to happen anyway.
>
Well, if you continue with that practice now you will be even more
successful in killing such processes, I would say. :)
--
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 3:12 ` Linus Torvalds
@ 2001-01-09 20:33 ` Marcelo Tosatti
2001-01-09 22:44 ` Linus Torvalds
2001-01-17 4:54 ` Rik van Riel
1 sibling, 1 reply; 128+ messages in thread
From: Marcelo Tosatti @ 2001-01-09 20:33 UTC (permalink / raw)
To: Linus Torvalds
Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm
On Mon, 8 Jan 2001, Linus Torvalds wrote:
> Try out 2.4.1-pre1 in testing.
The "while (!inactive_shortage())" should be "while (inactive_shortage())"
as Benjamin noted on lk.
The second problem is that background scanning is being done
unconditionally, and it should not. You end up getting all pages with the
same age if the system is idle. Look at this example (2.4.1-pre1):
MemTotal: 900148 kB
MemFree: 145060 kB
Cached: 725624 kB
Active: 3972 kB
Inact_dirty: 722940 kB
Inact_clean: 0 kB
Inact_target: 188 kB
> That kmem_cache_reap() thing still looks completely bogus, but I didn't
> touch it. It looks _so_ bogus that there must be some reason for doing it
> that ass-backwards way. Why should anybody have does a kmem_cache_reap()
> when we're _not_ short of free pages? That code just makes me very
> confused, so I'm not touching it.
This patch removes kmem_cache_reap() from refill_inactive() and moves it
to inside the free_shortage() check in do_try_to_free_pages().
It also changes the "while (!inactive_shortage())" mistake.
Comments?
diff -Nur linux.orig/include/linux/fs.h linux/include/linux/fs.h
--- linux.orig/include/linux/fs.h Tue Jan 9 19:32:51 2001
+++ linux/include/linux/fs.h Tue Jan 9 20:07:32 2001
@@ -985,7 +985,7 @@
extern int fs_may_remount_ro(struct super_block *);
-extern int try_to_free_buffers(struct page *, int);
+extern void try_to_free_buffers(struct page *, int);
extern void refile_buffer(struct buffer_head * buf);
#define BUF_CLEAN 0
diff -Nur linux.orig/include/linux/swap.h linux/include/linux/swap.h
--- linux.orig/include/linux/swap.h Tue Jan 9 19:32:51 2001
+++ linux/include/linux/swap.h Tue Jan 9 20:07:38 2001
@@ -108,7 +108,7 @@
extern int free_shortage(void);
extern int inactive_shortage(void);
extern void wakeup_kswapd(int);
-extern int try_to_free_pages(unsigned int gfp_mask);
+extern void try_to_free_pages(unsigned int gfp_mask);
/* linux/mm/page_io.c */
extern void rw_swap_page(int, struct page *, int);
diff -Nur linux.orig/mm/vmscan.c linux/mm/vmscan.c
--- linux.orig/mm/vmscan.c Tue Jan 9 19:35:41 2001
+++ linux/mm/vmscan.c Tue Jan 9 20:06:01 2001
@@ -825,9 +825,6 @@
count = (1 << page_cluster);
start_count = count;
- /* Always trim SLAB caches when memory gets low. */
- kmem_cache_reap(gfp_mask);
-
priority = 6;
do {
if (current->need_resched) {
@@ -842,16 +839,14 @@
/* If refill_inactive_scan failed, try to page stuff out.. */
swap_out(priority, gfp_mask);
- } while (!inactive_shortage());
+ } while (inactive_shortage());
done:
return (count < start_count);
}
-static int do_try_to_free_pages(unsigned int gfp_mask, int user)
+static void do_try_to_free_pages(unsigned int gfp_mask, int user)
{
- int ret = 0;
-
/*
* If we're low on free pages, move pages from the
* inactive_dirty list to the inactive_clean list.
@@ -862,32 +857,24 @@
*/
if (free_shortage() || nr_inactive_dirty_pages > nr_free_pages() +
nr_inactive_clean_pages())
- ret += page_launder(gfp_mask, user);
+ page_launder(gfp_mask, user);
/*
* If needed, we move pages from the active list
* to the inactive list.
*/
if (inactive_shortage())
- ret += refill_inactive(gfp_mask, user);
+ refill_inactive(gfp_mask, user);
/*
- * Delete pages from the inode and dentry cache
- * if memory is low.
+ * Delete pages from the inode and dentry cache and
+ * reclaim unused slab cache if memory is low.
*/
if (free_shortage()) {
shrink_dcache_memory(6, gfp_mask);
shrink_icache_memory(6, gfp_mask);
- } else {
-
- /*
- * Reclaim unused slab cache memory.
- */
kmem_cache_reap(gfp_mask);
- ret = 1;
}
-
- return ret;
}
DECLARE_WAIT_QUEUE_HEAD(kswapd_wait);
@@ -1029,17 +1016,13 @@
* memory but are unable to sleep on kswapd because
* they might be holding some IO locks ...
*/
-int try_to_free_pages(unsigned int gfp_mask)
+void try_to_free_pages(unsigned int gfp_mask)
{
- int ret = 1;
-
if (gfp_mask & __GFP_WAIT) {
current->flags |= PF_MEMALLOC;
- ret = do_try_to_free_pages(gfp_mask, 1);
+ do_try_to_free_pages(gfp_mask, 1);
current->flags &= ~PF_MEMALLOC;
}
-
- return ret;
}
DECLARE_WAIT_QUEUE_HEAD(kreclaimd_wait);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 22:44 ` Linus Torvalds
@ 2001-01-09 21:33 ` Marcelo Tosatti
2001-01-09 22:11 ` Yet another bogus piece of do_try_to_free_pages() Marcelo Tosatti
2001-01-09 23:58 ` Subtle MM bug Linus Torvalds
0 siblings, 2 replies; 128+ messages in thread
From: Marcelo Tosatti @ 2001-01-09 21:33 UTC (permalink / raw)
To: Stephen C. Tweedie, Linus Torvalds
Cc: David S. Miller, Rik van Riel, linux-mm
On Tue, 9 Jan 2001, Linus Torvalds wrote:
>
>
> On Tue, 9 Jan 2001, Marcelo Tosatti wrote:
> >
> > The "while (!inactive_shortage())" should be "while (inactive_shortage())"
> > as Benjamin noted on lk.
>
> Yes. Also, it does need something to make sure that it doesn't end up
> being an endless loop.
Ok, I'll send another patch which fixes this later today.
> > The second problem is that background scanning is being done
> > unconditionally, and it should not. You end up getting all pages with the
> > same age if the system is idle. Look at this example (2.4.1-pre1):
>
> I agree. However, I think that we do want to do some background scanning
> to push out dirty pages in the background, kind of like bdflush. It just
> shouldn't age the pages (and thus not move them to the inactive list).
Actually it must age the pages, but aging should not be unconditional.
Stephen has some thoughts on this. Stephen?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Yet another bogus piece of do_try_to_free_pages()
2001-01-09 21:33 ` Marcelo Tosatti
@ 2001-01-09 22:11 ` Marcelo Tosatti
2001-01-10 0:06 ` Linus Torvalds
2001-01-09 23:58 ` Subtle MM bug Linus Torvalds
1 sibling, 1 reply; 128+ messages in thread
From: Marcelo Tosatti @ 2001-01-09 22:11 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-mm
Hi,
Look at this piece of code from kswapd:
/* If needed, try to free some memory. */
if (inactive_shortage() || free_shortage()) {
int wait = 0;
/* Do we need to do some synchronous flushing? */
if (waitqueue_active(&kswapd_done))
wait = 1;
do_try_to_free_pages(GFP_KSWAPD, wait);
}
The problem is that do_try_to_free_pages uses the "wait" argument when
calling page_launder() (where the paramater is used to indicate if we want
todo sync or async IO) _and_ used to call refill_inactive(), where this
parameter is used to indicate if its being called from a normal process or
from kswapd:
* OTOH, if we're a user process (and not kswapd), we
* really care about latency. In that case we don't try
* to free too many pages.
*/
static int refill_inactive(unsigned int gfp_mask, int user)
{
int priority, count, start_count;
count = inactive_shortage() + free_shortage();
if (user)
count = (1 << page_cluster);
start_count = count;
This is probably quite nasty in practice (low memory conditions) because
if we have waiters on kswapd, we want to free more memory.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 23:58 ` Subtle MM bug Linus Torvalds
@ 2001-01-09 22:21 ` Marcelo Tosatti
2001-01-10 0:23 ` Linus Torvalds
0 siblings, 1 reply; 128+ messages in thread
From: Marcelo Tosatti @ 2001-01-09 22:21 UTC (permalink / raw)
To: Linus Torvalds
Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm
On Tue, 9 Jan 2001, Linus Torvalds wrote:
>
> On Tue, 9 Jan 2001, Marcelo Tosatti wrote:
> >
> > > > The second problem is that background scanning is being done
> > > > unconditionally, and it should not. You end up getting all pages with the
> > > > same age if the system is idle. Look at this example (2.4.1-pre1):
> > >
> > > I agree. However, I think that we do want to do some background scanning
> > > to push out dirty pages in the background, kind of like bdflush. It just
> > > shouldn't age the pages (and thus not move them to the inactive list).
> >
> > Actually it must age the pages, but aging should not be unconditional.
>
> No, I'm saying that "the background scanning" should not do the page
> aging.
If you age pages only when there is memory pressure/low memory, you'll
have less knowledge about which pages were unused/used pages over time.
> Obviously "refill_inactive()" needs to do the page aging. I'm just not at
> all convinced that "background scanning" == "refill_inactive()".
This is the background scanning I refer (in kswapd):
/*
* Do some (very minimal) background scanning. This
* will scan all pages on the active list once
* every minute. This clears old referenced bits
* and moves unused pages to the inactive list.
*/
refill_inactive_scan(6, 0);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 20:33 ` Marcelo Tosatti
@ 2001-01-09 22:44 ` Linus Torvalds
2001-01-09 21:33 ` Marcelo Tosatti
0 siblings, 1 reply; 128+ messages in thread
From: Linus Torvalds @ 2001-01-09 22:44 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm
On Tue, 9 Jan 2001, Marcelo Tosatti wrote:
>
> The "while (!inactive_shortage())" should be "while (inactive_shortage())"
> as Benjamin noted on lk.
Yes. Also, it does need something to make sure that it doesn't end up
being an endless loop.
Now, the oom_killer() thing should make sure it's not endless, but the
fact is that kswapd() (who calls the oom-killer) also calls the very same
do_try_to_free_pages(), so we really do have to make sure that it doesn't
loop forever trying to find a page.
The priority countdown used to handle this, and while I disagree with the
_other_ uses of the priority (it used to make the freeing action
"chunkier" by walking bigger pieces of the VM or the active lists), I
think we need to rename "priority" to "maxtry", and use that to give up
gracefully when we truly do run out of memory.
(I _suspect_ that the oom killer would be invoced before this happens in
practice, and refill_inactive_scan() would find _something_ to make
slight progress on all the time, but the fact is that we shouldn't have
those kinds of assumptions in the VM code).
This would make the return value (that you removed in this patch) still a
valid thing. So I don't think it should go away.
> The second problem is that background scanning is being done
> unconditionally, and it should not. You end up getting all pages with the
> same age if the system is idle. Look at this example (2.4.1-pre1):
I agree. However, I think that we do want to do some background scanning
to push out dirty pages in the background, kind of like bdflush. It just
shouldn't age the pages (and thus not move them to the inactive list).
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 21:33 ` Marcelo Tosatti
2001-01-09 22:11 ` Yet another bogus piece of do_try_to_free_pages() Marcelo Tosatti
@ 2001-01-09 23:58 ` Linus Torvalds
2001-01-09 22:21 ` Marcelo Tosatti
1 sibling, 1 reply; 128+ messages in thread
From: Linus Torvalds @ 2001-01-09 23:58 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm
On Tue, 9 Jan 2001, Marcelo Tosatti wrote:
>
> > > The second problem is that background scanning is being done
> > > unconditionally, and it should not. You end up getting all pages with the
> > > same age if the system is idle. Look at this example (2.4.1-pre1):
> >
> > I agree. However, I think that we do want to do some background scanning
> > to push out dirty pages in the background, kind of like bdflush. It just
> > shouldn't age the pages (and thus not move them to the inactive list).
>
> Actually it must age the pages, but aging should not be unconditional.
No, I'm saying that "the background scanning" should not do the page
aging.
Obviously "refill_inactive()" needs to do the page aging. I'm just not at
all convinced that "background scanning" == "refill_inactive()".
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Yet another bogus piece of do_try_to_free_pages()
2001-01-09 22:11 ` Yet another bogus piece of do_try_to_free_pages() Marcelo Tosatti
@ 2001-01-10 0:06 ` Linus Torvalds
2001-01-10 6:39 ` Marcelo Tosatti
2001-01-17 6:52 ` Rik van Riel
0 siblings, 2 replies; 128+ messages in thread
From: Linus Torvalds @ 2001-01-10 0:06 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: linux-mm
On Tue, 9 Jan 2001, Marcelo Tosatti wrote:
>
> The problem is that do_try_to_free_pages uses the "wait" argument when
> calling page_launder() (where the paramater is used to indicate if we want
> todo sync or async IO) _and_ used to call refill_inactive(), where this
> parameter is used to indicate if its being called from a normal process or
> from kswapd:
Yes. Bogus.
I suspect that the proper fix is something more along the lines of what we
did to bdflush: get rid of the notion of waiting synchronously from
bdflush, and instead do the work yourself.
Doing the same to kswapd would imply getting rid of that kswapd_wait
thing, and instead of having people wait on it, they would do
"page_launder(gfp_mask, 1)" themselves (and we _do_ want them to wait,
because that ends up being rate-limiting especially on the applications
that do a lot of memory allocation - which are the applications that end
up being the problem in the first place).
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 0:23 ` Linus Torvalds
@ 2001-01-10 0:12 ` Marcelo Tosatti
2001-01-10 11:29 ` Stephen C. Tweedie
2001-01-11 3:30 ` Marcelo Tosatti
1 sibling, 1 reply; 128+ messages in thread
From: Marcelo Tosatti @ 2001-01-10 0:12 UTC (permalink / raw)
To: Linus Torvalds
Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm
On Tue, 9 Jan 2001, Linus Torvalds wrote:
> Hmm.. Fair enough. However, if you don't have VM pressure, you're also not
> going to look at the page tables, so you are not going to get any use
> information from them, either.
Are you sure that potentially unmapping pte's and swapping out its pages
in the background scanning is ok?
I mean, what kind of swap behaviour we will have if we do it?
> The aging should really be done at roughly the same rate as the "mark
> active", wouldn't you say? If you mark things active without aging, pages
> end up all being marked as "new". And if you age without marking things
> active, they all end up being "old". Neither is good. What you really want
> to have is aging that happens at the same rate as reference marking.
> So one "conditional aging" algorithm might just be something as simple as
>
> - every time you mark something referenced, you increment a counter
> - every time you want to age something, you check whethe rthe counter is
> positive first (and decrement it if you age something)
Seems to be a nice solution.
I'll send you the previously promised patch and then I'll send the
background scanning one as soon as we (or I?) figure out the previous
question about background pte scanning.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 22:21 ` Marcelo Tosatti
@ 2001-01-10 0:23 ` Linus Torvalds
2001-01-10 0:12 ` Marcelo Tosatti
2001-01-11 3:30 ` Marcelo Tosatti
0 siblings, 2 replies; 128+ messages in thread
From: Linus Torvalds @ 2001-01-10 0:23 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm
On Tue, 9 Jan 2001, Marcelo Tosatti wrote:
> >
> > No, I'm saying that "the background scanning" should not do the page
> > aging.
>
> If you age pages only when there is memory pressure/low memory, you'll
> have less knowledge about which pages were unused/used pages over time.
Hmm.. Fair enough. However, if you don't have VM pressure, you're also not
going to look at the page tables, so you are not going to get any use
information from them, either.
The aging should really be done at roughly the same rate as the "mark
active", wouldn't you say? If you mark things active without aging, pages
end up all being marked as "new". And if you age without marking things
active, they all end up being "old". Neither is good. What you really want
to have is aging that happens at the same rate as reference marking.
So one "conditional aging" algorithm might just be something as simple as
- every time you mark something referenced, you increment a counter
- every time you want to age something, you check whethe rthe counter is
positive first (and decrement it if you age something)
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 18:47 ` Linus Torvalds
2001-01-09 19:09 ` Daniel Phillips
2001-01-09 19:53 ` Simon Kirby
@ 2001-01-10 1:45 ` David Woodhouse
2001-01-10 2:26 ` Andrea Arcangeli
2001-01-10 6:57 ` Linus Torvalds
2 siblings, 2 replies; 128+ messages in thread
From: David Woodhouse @ 2001-01-10 1:45 UTC (permalink / raw)
To: Linus Torvalds
Cc: Zlatko Calusic, Eric W. Biederman, Rik van Riel, linux-kernel
On Tue, 9 Jan 2001, Linus Torvalds wrote:
> And this _is_ a downside, there's no question about it. There's the worry
> about the potential loss of locality, but there's also the fact that you
> effectively need a bigger swap partition with 2.4.x - never mind that
> large portions of the allocations may never be used. You still need the
> disk space for good VM behaviour.
>
> There are always trade-offs, I think the 2.4.x tradeoff is a good one.
How does this affect embedded systems with no swap space at all?
--
dwmw2
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 1:45 ` David Woodhouse
@ 2001-01-10 2:26 ` Andrea Arcangeli
2001-01-10 6:57 ` Linus Torvalds
1 sibling, 0 replies; 128+ messages in thread
From: Andrea Arcangeli @ 2001-01-10 2:26 UTC (permalink / raw)
To: David Woodhouse
Cc: Linus Torvalds, Zlatko Calusic, Eric W. Biederman, Rik van Riel,
linux-kernel
On Wed, Jan 10, 2001 at 01:45:47AM +0000, David Woodhouse wrote:
> How does this affect embedded systems with no swap space at all?
If there's no swap the swap-cache dirty-sticky issue can't arise.
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Yet another bogus piece of do_try_to_free_pages()
2001-01-10 0:06 ` Linus Torvalds
@ 2001-01-10 6:39 ` Marcelo Tosatti
2001-01-10 22:19 ` Roger Larsson
2001-01-11 0:11 ` Zlatko Calusic
2001-01-17 6:52 ` Rik van Riel
1 sibling, 2 replies; 128+ messages in thread
From: Marcelo Tosatti @ 2001-01-10 6:39 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-mm
On Tue, 9 Jan 2001, Linus Torvalds wrote:
> I suspect that the proper fix is something more along the lines of what we
> did to bdflush: get rid of the notion of waiting synchronously from
> bdflush, and instead do the work yourself.
Agreed.
Without blocking on sync IO, kswapd can keep aging pages and moving
them to the inactive lists.
The following patch changes some stuff we've discussed before (the
kmem_cache_reap and maxtry thingies) and it also removes the kswapd
sleeping scheme.
I haven't tested it yet, though I'll do it tomorrow.
diff --exclude-from=/home/marcelo/exclude -Nur linux.orig/include/linux/swap.h linux/include/linux/swap.h
--- linux.orig/include/linux/swap.h Wed Jan 10 02:17:59 2001
+++ linux/include/linux/swap.h Wed Jan 10 05:52:02 2001
@@ -107,7 +107,7 @@
extern int page_launder(int, int);
extern int free_shortage(void);
extern int inactive_shortage(void);
-extern void wakeup_kswapd(int);
+extern void wakeup_kswapd(void);
extern int try_to_free_pages(unsigned int gfp_mask);
/* linux/mm/page_io.c */
diff --exclude-from=/home/marcelo/exclude -Nur linux.orig/mm/filemap.c linux/mm/filemap.c
--- linux.orig/mm/filemap.c Wed Jan 10 02:17:59 2001
+++ linux/mm/filemap.c Wed Jan 10 05:54:56 2001
@@ -306,7 +306,7 @@
*/
age_page_up(page);
if (inactive_shortage() > inactive_target / 2 && free_shortage())
- wakeup_kswapd(0);
+ wakeup_kswapd();
not_found:
return page;
}
diff --exclude-from=/home/marcelo/exclude -Nur linux.orig/mm/page_alloc.c linux/mm/page_alloc.c
--- linux.orig/mm/page_alloc.c Wed Jan 10 02:17:59 2001
+++ linux/mm/page_alloc.c Wed Jan 10 06:04:05 2001
@@ -16,6 +16,7 @@
#include <linux/interrupt.h>
#include <linux/pagemap.h>
#include <linux/bootmem.h>
+#include <linux/slab.h>
int nr_swap_pages;
int nr_active_pages;
@@ -303,7 +304,7 @@
* an inactive page shortage, wake up kswapd.
*/
if (inactive_shortage() > inactive_target / 2 && free_shortage())
- wakeup_kswapd(0);
+ wakeup_kswapd();
/*
* If we are about to get low on free pages and cleaning
* the inactive_dirty pages would fix the situation,
@@ -379,7 +380,7 @@
* - if we don't have __GFP_IO set, kswapd may be
* able to free some memory we can't free ourselves
*/
- wakeup_kswapd(0);
+ wakeup_kswapd();
if (gfp_mask & __GFP_WAIT) {
__set_current_state(TASK_RUNNING);
current->policy |= SCHED_YIELD;
@@ -404,7 +405,7 @@
* - we're doing a higher-order allocation
* --> move pages to the free list until we succeed
* - we're /really/ tight on memory
- * --> wait on the kswapd waitqueue until memory is freed
+ * --> try to free pages ourselves with page_launder
*/
if (!(current->flags & PF_MEMALLOC)) {
/*
@@ -443,36 +444,23 @@
/*
* When we arrive here, we are really tight on memory.
*
- * We wake up kswapd and sleep until kswapd wakes us
- * up again. After that we loop back to the start.
- *
- * We have to do this because something else might eat
- * the memory kswapd frees for us and we need to be
- * reliable. Note that we don't loop back for higher
- * order allocations since it is possible that kswapd
- * simply cannot free a large enough contiguous area
- * of memory *ever*.
- */
- if ((gfp_mask & (__GFP_WAIT|__GFP_IO)) == (__GFP_WAIT|__GFP_IO)) {
- wakeup_kswapd(1);
- memory_pressure++;
- if (!order)
- goto try_again;
- /*
- * If __GFP_IO isn't set, we can't wait on kswapd because
- * kswapd just might need some IO locks /we/ are holding ...
- *
- * SUBTLE: The scheduling point above makes sure that
- * kswapd does get the chance to free memory we can't
- * free ourselves...
+ * We try to free pages ourselves by:
+ * - shrinking the i/d caches.
+ * - reclaiming unused memory from the slab caches.
+ * - swapping/syncing pages to disk (done by page_launder)
+ * - moving clean pages from the inactive dirty list to
+ * the inactive clean list. (done by page_launder)
*/
- } else if (gfp_mask & __GFP_WAIT) {
- try_to_free_pages(gfp_mask);
- memory_pressure++;
+ if (gfp_mask & __GFP_WAIT) {
+ shrink_icache_memory(6, gfp_mask);
+ shrink_dcache_memory(6, gfp_mask);
+ kmem_cache_reap(gfp_mask);
+
+ page_launder(gfp_mask, 1);
+
if (!order)
goto try_again;
}
-
}
/*
diff --exclude-from=/home/marcelo/exclude -Nur linux.orig/mm/slab.c linux/mm/slab.c
--- linux.orig/mm/slab.c Wed Jan 10 02:17:59 2001
+++ linux/mm/slab.c Wed Jan 10 06:01:27 2001
@@ -1702,7 +1702,7 @@
* kmem_cache_reap - Reclaim memory from caches.
* @gfp_mask: the type of memory required.
*
- * Called from try_to_free_page().
+ * Called from do_try_to_free_pages() and __alloc_pages()
*/
void kmem_cache_reap (int gfp_mask)
{
diff --exclude-from=/home/marcelo/exclude -Nur linux.orig/mm/vmscan.c linux/mm/vmscan.c
--- linux.orig/mm/vmscan.c Wed Jan 10 02:17:59 2001
+++ linux/mm/vmscan.c Wed Jan 10 05:57:45 2001
@@ -156,20 +156,6 @@
return 0;
}
-/*
- * A new implementation of swap_out(). We do not swap complete processes,
- * but only a small number of blocks, before we continue with the next
- * process. The number of blocks actually swapped is determined on the
- * number of page faults, that this process actually had in the last time,
- * so we won't swap heavily used processes all the time ...
- *
- * Note: the priority argument is a hint on much CPU to waste with the
- * swap block search, not a hint, of how much blocks to swap with
- * each process.
- *
- * (C) 1993 Kai Petzke, wpp@marie.physik.tu-berlin.de
- */
-
static inline int swap_out_pmd(struct mm_struct * mm, struct vm_area_struct * vma, pmd_t *dir, unsigned long address, unsigned long end)
{
pte_t * pte;
@@ -818,17 +804,14 @@
*/
static int refill_inactive(unsigned int gfp_mask, int user)
{
- int priority, count, start_count;
+ int priority, count, start_count, maxtry;
count = inactive_shortage() + free_shortage();
if (user)
count = (1 << page_cluster);
start_count = count;
- /* Always trim SLAB caches when memory gets low. */
- kmem_cache_reap(gfp_mask);
-
- priority = 6;
+ maxtry = priority = 6;
do {
if (current->need_resched) {
__set_current_state(TASK_RUNNING);
@@ -842,7 +825,10 @@
/* If refill_inactive_scan failed, try to page stuff out.. */
swap_out(priority, gfp_mask);
- } while (!inactive_shortage());
+
+ if(--maxtry <= 0)
+ return 0;
+ } while (inactive_shortage());
done:
return (count < start_count);
@@ -872,20 +858,14 @@
ret += refill_inactive(gfp_mask, user);
/*
- * Delete pages from the inode and dentry cache
- * if memory is low.
+ * Delete pages from the inode and dentry caches and
+ * reclaim unused slab cache if memory is low.
*/
if (free_shortage()) {
shrink_dcache_memory(6, gfp_mask);
shrink_icache_memory(6, gfp_mask);
- } else {
-
- /*
- * Reclaim unused slab cache memory.
- */
kmem_cache_reap(gfp_mask);
- ret = 1;
- }
+ }
return ret;
}
@@ -938,13 +918,8 @@
static int recalc = 0;
/* If needed, try to free some memory. */
- if (inactive_shortage() || free_shortage()) {
- int wait = 0;
- /* Do we need to do some synchronous flushing? */
- if (waitqueue_active(&kswapd_done))
- wait = 1;
- do_try_to_free_pages(GFP_KSWAPD, wait);
- }
+ if (inactive_shortage() || free_shortage())
+ do_try_to_free_pages(GFP_KSWAPD, 0);
/*
* Do some (very minimal) background scanning. This
@@ -960,11 +935,6 @@
recalculate_vm_stats();
}
- /*
- * Wake up everybody waiting for free memory
- * and unplug the disk queue.
- */
- wake_up_all(&kswapd_done);
run_task_queue(&tq_disk);
/*
@@ -995,33 +965,10 @@
}
}
-void wakeup_kswapd(int block)
+void wakeup_kswapd(void)
{
- DECLARE_WAITQUEUE(wait, current);
-
- if (current == kswapd_task)
- return;
-
- if (!block) {
- if (waitqueue_active(&kswapd_wait))
- wake_up(&kswapd_wait);
- return;
- }
-
- /*
- * Kswapd could wake us up before we get a chance
- * to sleep, so we have to be very careful here to
- * prevent SMP races...
- */
- __set_current_state(TASK_UNINTERRUPTIBLE);
- add_wait_queue(&kswapd_done, &wait);
-
- if (waitqueue_active(&kswapd_wait))
- wake_up(&kswapd_wait);
- schedule();
-
- remove_wait_queue(&kswapd_done, &wait);
- __set_current_state(TASK_RUNNING);
+ if (current != kswapd_task)
+ wake_up_process(kswapd_task);
}
/*
@@ -1046,7 +993,7 @@
/*
* Kreclaimd will move pages from the inactive_clean list to the
* free list, in order to keep atomic allocations possible under
- * all circumstances. Even when kswapd is blocked on IO.
+ * all circumstances.
*/
int kreclaimd(void *unused)
{
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 1:45 ` David Woodhouse
2001-01-10 2:26 ` Andrea Arcangeli
@ 2001-01-10 6:57 ` Linus Torvalds
2001-01-10 11:46 ` David Woodhouse
1 sibling, 1 reply; 128+ messages in thread
From: Linus Torvalds @ 2001-01-10 6:57 UTC (permalink / raw)
To: David Woodhouse
Cc: Zlatko Calusic, Eric W. Biederman, Rik van Riel, linux-kernel
On Wed, 10 Jan 2001, David Woodhouse wrote:
>
> How does this affect embedded systems with no swap space at all?
The no-swap behaviour shoul dactually be pretty much identical, simply
because both 2.2 and 2.4 will do the same thing: just skip dirty pages in
the page tables because they cannot do anything about them.
That said, the _other_ VM differences in 2.4.x may obviously make a
difference, just not the sticky swap cache one..
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 0:12 ` Marcelo Tosatti
@ 2001-01-10 11:29 ` Stephen C. Tweedie
0 siblings, 0 replies; 128+ messages in thread
From: Stephen C. Tweedie @ 2001-01-10 11:29 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: Linus Torvalds, Stephen C. Tweedie, David S. Miller, Rik van Riel,
linux-mm
Hi,
On Tue, Jan 09, 2001 at 10:12:45PM -0200, Marcelo Tosatti wrote:
> On Tue, 9 Jan 2001, Linus Torvalds wrote:
>
> > Hmm.. Fair enough. However, if you don't have VM pressure, you're also not
> > going to look at the page tables, so you are not going to get any use
> > information from them, either.
>
> Are you sure that potentially unmapping pte's and swapping out its pages
> in the background scanning is ok?
Why not? We're only going to be aging things slowly in the absense of
memory pressure, and if a page hasn't been used between two
widely-separated passes then inactivating the page isn't likely to
have much impact: it's only a soft-fault to get it back.
> > The aging should really be done at roughly the same rate as the "mark
> > active", wouldn't you say? If you mark things active without aging, pages
> > end up all being marked as "new". And if you age without marking things
> > active, they all end up being "old". Neither is good. What you really want
> > to have is aging that happens at the same rate as reference marking.
> > So one "conditional aging" algorithm might just be something as simple as
> >
> > - every time you mark something referenced, you increment a counter
> > - every time you want to age something, you check whethe rthe counter is
> > positive first (and decrement it if you age something)
>
> Seems to be a nice solution.
This is _exactly_ what I proposed to Rick last time we talked about
it, and it seems to be the right balance between maintaining uptodate
information when data is being accessed, and maintaining old state
when it isn't. You need to decay the counter appropriately, though.
--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 6:57 ` Linus Torvalds
@ 2001-01-10 11:46 ` David Woodhouse
2001-01-10 14:56 ` Andrea Arcangeli
2001-01-10 17:03 ` Linus Torvalds
0 siblings, 2 replies; 128+ messages in thread
From: David Woodhouse @ 2001-01-10 11:46 UTC (permalink / raw)
To: Linus Torvalds
Cc: Zlatko Calusic, Eric W. Biederman, Rik van Riel, linux-kernel
torvalds@transmeta.com said:
> The no-swap behaviour shoul dactually be pretty much identical,
> simply because both 2.2 and 2.4 will do the same thing: just skip
> dirty pages in the page tables because they cannot do anything about
> them.
So the VM code spends a fair amount of time scanning lists of pages which
it really can't do anything about?
Would it be possible to put such pages on different list, so that the VM
code doesn't have to keep skipping them?
(forgive me if I'm displaying my utter ignorance of the VM code here)
--
dwmw2
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 11:46 ` David Woodhouse
@ 2001-01-10 14:56 ` Andrea Arcangeli
2001-01-10 17:46 ` Eric W. Biederman
2001-01-10 17:03 ` Linus Torvalds
1 sibling, 1 reply; 128+ messages in thread
From: Andrea Arcangeli @ 2001-01-10 14:56 UTC (permalink / raw)
To: David Woodhouse
Cc: Linus Torvalds, Zlatko Calusic, Eric W. Biederman, Rik van Riel,
linux-kernel
On Wed, Jan 10, 2001 at 11:46:03AM +0000, David Woodhouse wrote:
> So the VM code spends a fair amount of time scanning lists of pages which
> it really can't do anything about?
Yes.
> Would it be possible to put such pages on different list, so that the VM
Currently to unmap the other pages we have to waste time on those unfreeable
pages as well.
Once I or other developer finishes with the reverse lookup from page to
pte-chain (an implementation from DaveM just exists) we'll be able to put them
in a separate lru, but it's certainly not a 2.4.1-pre2 thing.
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 11:46 ` David Woodhouse
2001-01-10 14:56 ` Andrea Arcangeli
@ 2001-01-10 17:03 ` Linus Torvalds
2001-01-11 14:36 ` Jim Gettys
1 sibling, 1 reply; 128+ messages in thread
From: Linus Torvalds @ 2001-01-10 17:03 UTC (permalink / raw)
To: David Woodhouse
Cc: Zlatko Calusic, Eric W. Biederman, Rik van Riel, linux-kernel
On Wed, 10 Jan 2001, David Woodhouse wrote:
>
> torvalds@transmeta.com said:
> > The no-swap behaviour shoul dactually be pretty much identical,
> > simply because both 2.2 and 2.4 will do the same thing: just skip
> > dirty pages in the page tables because they cannot do anything about
> > them.
>
> So the VM code spends a fair amount of time scanning lists of pages which
> it really can't do anything about?
It can do _tons_ of stuff.
Remember, on platforms like this, one of the reasons for being low on
memory is things like running X and netscape: maybe you have 64MB of RAM
and you don't think you need a swap device, and you want to have a web
browser.
The fact that we cannot touch _dirty_ pages doesn't mean that there's
nothing to do: instead of running out of memory we can at least make the
machine usable by dropping the text pages and the page cache..
> Would it be possible to put such pages on different list, so that the VM
> code doesn't have to keep skipping them?
If we don't have any swapspace, the dirty pages will not be on any lists:
they will never have exited the page tables, and they will just be dirty
anonymous, unlisted pages.
We'll still scan the page tables (and see them there), but we have to do
that to find the clean and unreferenced pages - we don't have separate
"dirty page tables" and "clean page tables" ;)
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 19:29 ` Trond Myklebust
@ 2001-01-10 17:32 ` Andi Kleen
2001-01-10 19:31 ` Alan Cox
0 siblings, 1 reply; 128+ messages in thread
From: Andi Kleen @ 2001-01-10 17:32 UTC (permalink / raw)
To: Trond Myklebust; +Cc: Daniel Phillips, Linus Torvalds, linux-kernel
On Tue, Jan 09, 2001 at 08:29:02PM +0100, Trond Myklebust wrote:
> Al has mentioned that he wants us to move towards a *BSD-like system
> of credentials (i.e. struct ucred) that could be used here, but that's
> in the far future. In the meantime, we cache RPC credentials in the
> struct file...
struct ucred is also needed to get LinuxThreads POSIX compliant (sharing
credentials between threads, but still keeping system calls atomic in
relation to credential changes)
-Andi (who doesn't want to know how many security holes are in linux ported
programs using threads and set*id() because of that)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 14:56 ` Andrea Arcangeli
@ 2001-01-10 17:46 ` Eric W. Biederman
2001-01-10 18:33 ` Andrea Arcangeli
2001-01-10 19:03 ` Linus Torvalds
0 siblings, 2 replies; 128+ messages in thread
From: Eric W. Biederman @ 2001-01-10 17:46 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: David Woodhouse, Linus Torvalds, Zlatko Calusic,
Eric W. Biederman, Rik van Riel, linux-kernel
Andrea Arcangeli <andrea@suse.de> writes:
> On Wed, Jan 10, 2001 at 11:46:03AM +0000, David Woodhouse wrote:
> > So the VM code spends a fair amount of time scanning lists of pages which
> > it really can't do anything about?
>
> Yes.
>
> > Would it be possible to put such pages on different list, so that the VM
>
> Currently to unmap the other pages we have to waste time on those unfreeable
> pages as well.
>
> Once I or other developer finishes with the reverse lookup from page to
> pte-chain (an implementation from DaveM just exists) we'll be able to put them
> in a separate lru, but it's certainly not a 2.4.1-pre2 thing.
Why do we even want to do reverse page tables?
It seems everyone is assuming this is a good thing and except for being
a touch more flexible I don't see what this buys us (besides more locked memory).
My impression with the MM stuff is that everyone except linux is
trying hard to clone BSD instead of thinking through the issues
ourselves.
And because of the extra overhead this doesn't look to be a win on a
heavily loaded box with no swap. And probably only glibc mmaped.
Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 17:46 ` Eric W. Biederman
@ 2001-01-10 18:33 ` Andrea Arcangeli
2001-01-17 14:26 ` Rik van Riel
2001-01-10 19:03 ` Linus Torvalds
1 sibling, 1 reply; 128+ messages in thread
From: Andrea Arcangeli @ 2001-01-10 18:33 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David Woodhouse, Linus Torvalds, Zlatko Calusic, Rik van Riel,
linux-kernel
On Wed, Jan 10, 2001 at 10:46:07AM -0700, Eric W. Biederman wrote:
> Why do we even want to do reverse page tables?
> It seems everyone is assuming this is a good thing and except for being
I'm not assuming it's a good thing, but I believe it's something to try.
> My impression with the MM stuff is that everyone except linux is
> trying hard to clone BSD instead of thinking through the issues
> ourselves.
I wasn't even thinking about BSD and I always though about the issues myself,
no panic ;).
> And because of the extra overhead this doesn't look to be a win on a
> heavily loaded box with no swap. And probably only glibc mmaped.
It can make sense also without swap. We could drop clean pages from the lru
directly that way without wasting time on pages that we don't have a chance to
free (incidentally it's exactly the optimization requested by David W. for
embedded systems). Note that I'm not convinced that it would be worthwhile to
separate the anonymous and shm pages from the other mapped pages but in theory
we could do that.
I didn't meant that it is certainly the right way to go, but with reverse
lookup we could do very ""interesting"" things and I think it's worthwhile to
research and benchmark what happens (note also that depending on the
implementation very different things can happen at runtime)
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 17:46 ` Eric W. Biederman
2001-01-10 18:33 ` Andrea Arcangeli
@ 2001-01-10 19:03 ` Linus Torvalds
2001-01-10 19:27 ` David S. Miller
` (2 more replies)
1 sibling, 3 replies; 128+ messages in thread
From: Linus Torvalds @ 2001-01-10 19:03 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Andrea Arcangeli, David Woodhouse, Zlatko Calusic, Rik van Riel,
linux-kernel
On 10 Jan 2001, Eric W. Biederman wrote:
> Andrea Arcangeli <andrea@suse.de> writes:
> >
> > Once I or other developer finishes with the reverse lookup from page to
> > pte-chain (an implementation from DaveM just exists) we'll be able to put them
> > in a separate lru, but it's certainly not a 2.4.1-pre2 thing.
>
> Why do we even want to do reverse page tables?
We don't.
But it does come up every once in a while, and it will probably continue
to do so.
I looked at it a year or two ago myself, and came to the conclusion that I
don't want to blow up our page table size by a factor of three or more, so
I'm not personally interested any more. Maybe somebody else comes up with
a better way to do it, or with a really compelling reason to.
"Feel free to try" is definitely the open source motto.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 19:03 ` Linus Torvalds
@ 2001-01-10 19:27 ` David S. Miller
2001-01-10 19:36 ` Alan Cox
2001-01-17 14:28 ` Rik van Riel
2 siblings, 0 replies; 128+ messages in thread
From: David S. Miller @ 2001-01-10 19:27 UTC (permalink / raw)
To: torvalds; +Cc: ebiederm, andrea, dwmw2, zlatko, riel, linux-kernel
Date: Wed, 10 Jan 2001 11:03:21 -0800 (PST)
From: Linus Torvalds <torvalds@transmeta.com>
"Feel free to try" is definitely the open source motto.
I basically came to the conclusion that it sucks when I
gave it a go.
In my scheme I tried to save space by using very small descriptors to
keep track of anonymous areas in processes. This was essentially a
vma->vm_anon pointer that kept track of the pages for you.
After trying to fight this for a few days I determined that this
doesn't work at all because of how COW dups the pages around on you.
Also it was a devil to work out anonymous pages created due to writes
to private mmaps of a file, as soon as one of these were made for the
first time on a vma you had to cook up one of the anon descriptors.
Yeah, I got the anon descriptor down to 2 pointers and an atomic
counter, but it didn't work so this achievement was worthless :-)
There are a few approaches that work, but they tend to take up too
much space to be considerable, as Linus mentioned.
Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 17:32 ` Andi Kleen
@ 2001-01-10 19:31 ` Alan Cox
2001-01-10 19:33 ` Andi Kleen
2001-01-10 20:11 ` Linus Torvalds
0 siblings, 2 replies; 128+ messages in thread
From: Alan Cox @ 2001-01-10 19:31 UTC (permalink / raw)
To: Andi Kleen; +Cc: Trond Myklebust, Daniel Phillips, Linus Torvalds, linux-kernel
> struct ucred is also needed to get LinuxThreads POSIX compliant (sharing
> credentials between threads, but still keeping system calls atomic in
> relation to credential changes)
That is extremely undesirable behaviour. setuid() changes for pthreads crud
should be done by the library emulation layer. Many people have very real
and very good reasons for running multiple parallel ids. Just try writing
a threaded ftp daemon (non anonymous) without that, or an nfs server
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 19:31 ` Alan Cox
@ 2001-01-10 19:33 ` Andi Kleen
2001-01-10 19:40 ` Alan Cox
2001-01-10 20:11 ` Linus Torvalds
1 sibling, 1 reply; 128+ messages in thread
From: Andi Kleen @ 2001-01-10 19:33 UTC (permalink / raw)
To: Alan Cox
Cc: Andi Kleen, Trond Myklebust, Daniel Phillips, Linus Torvalds,
linux-kernel
On Wed, Jan 10, 2001 at 07:31:52PM +0000, Alan Cox wrote:
> > struct ucred is also needed to get LinuxThreads POSIX compliant (sharing
> > credentials between threads, but still keeping system calls atomic in
> > relation to credential changes)
>
> That is extremely undesirable behaviour. setuid() changes for pthreads crud
> should be done by the library emulation layer. Many people have very real
> and very good reasons for running multiple parallel ids. Just try writing
> a threaded ftp daemon (non anonymous) without that, or an nfs server
Of course not by default, it would be a new clone flag (with default to on in
linuxthreads though, to not cause security holes in ported programs like today)
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 19:03 ` Linus Torvalds
2001-01-10 19:27 ` David S. Miller
@ 2001-01-10 19:36 ` Alan Cox
2001-01-10 23:56 ` David Weinehall
2001-01-17 14:28 ` Rik van Riel
2 siblings, 1 reply; 128+ messages in thread
From: Alan Cox @ 2001-01-10 19:36 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric W. Biederman, Andrea Arcangeli, David Woodhouse,
Zlatko Calusic, Rik van Riel, linux-kernel
> I looked at it a year or two ago myself, and came to the conclusion that I
> don't want to blow up our page table size by a factor of three or more, so
> I'm not personally interested any more. Maybe somebody else comes up with
> a better way to do it, or with a really compelling reason to.
There is only one reason I know for reverse page tables. That is ARM2/ARM3
support - which is still not fully merged because of this issue
The MMU on these systems is a CAM, and the mmu table is thus backwards to
convention. (It also means you can notionally map two physical addresses to
one virtual but thats undefined in the implementation ;))
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 19:33 ` Andi Kleen
@ 2001-01-10 19:40 ` Alan Cox
2001-01-10 19:43 ` Andi Kleen
0 siblings, 1 reply; 128+ messages in thread
From: Alan Cox @ 2001-01-10 19:40 UTC (permalink / raw)
To: Andi Kleen
Cc: Alan Cox, Andi Kleen, Trond Myklebust, Daniel Phillips,
Linus Torvalds, linux-kernel
> Of course not by default, it would be a new clone flag (with default to on in
> linuxthreads though, to not cause security holes in ported programs like today)
I've seen exactly nil cases where there are any security holes in apps caused
by that pthreads api non adherance. There are also far too many overheads
imposed by implementing something in kernel space that is nearly useless,
not needed for any application 99.9999% of users (possibly 100%) have and can
be done just as well in the pthreads library glue - where it will only be
a penalty to pthread using apps.
Making everyone suffer for a bad standard corner case is bad. Especially when
the 'security hole' is pure FUD
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 19:40 ` Alan Cox
@ 2001-01-10 19:43 ` Andi Kleen
2001-01-10 19:48 ` Alan Cox
0 siblings, 1 reply; 128+ messages in thread
From: Andi Kleen @ 2001-01-10 19:43 UTC (permalink / raw)
To: Alan Cox
Cc: Andi Kleen, Trond Myklebust, Daniel Phillips, Linus Torvalds,
linux-kernel
On Wed, Jan 10, 2001 at 07:40:49PM +0000, Alan Cox wrote:
> > Of course not by default, it would be a new clone flag (with default to on in
> > linuxthreads though, to not cause security holes in ported programs like today)
>
> I've seen exactly nil cases where there are any security holes in apps caused
> by that pthreads api non adherance. There are also far too many overheads
> imposed by implementing something in kernel space that is nearly useless,
> not needed for any application 99.9999% of users (possibly 100%) have and can
> be done just as well in the pthreads library glue - where it will only be
> a penalty to pthread using apps.
I have not seen a good way to implement it in user space yet.
> Making everyone suffer for a bad standard corner case is bad. Especially when
> the 'security hole' is pure FUD
>
As the thread started it's not only only needed for pthreads, but also for NFS
and setuid (actually NFS already implements it privately), and probably other network
file systems too. So it's far from being only a "bad standard corner case".
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 19:43 ` Andi Kleen
@ 2001-01-10 19:48 ` Alan Cox
2001-01-10 19:48 ` Andi Kleen
2001-01-11 9:51 ` Trond Myklebust
0 siblings, 2 replies; 128+ messages in thread
From: Alan Cox @ 2001-01-10 19:48 UTC (permalink / raw)
To: Andi Kleen
Cc: Alan Cox, Andi Kleen, Trond Myklebust, Daniel Phillips,
Linus Torvalds, linux-kernel
> As the thread started it's not only only needed for pthreads, but also for NFS
> and setuid (actually NFS already implements it privately), and probably other network
> file systems too. So it's far from being only a "bad standard corner case".
I wonder how Linux 2.2 worked, that doesnt have them. Now if its a clean way
of sorting out a pile of other things and it does pthreads as a side effect
I've no problem, but arguing for it because of a tiny pthreads corner case
is coming from the wrong end
Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 19:48 ` Alan Cox
@ 2001-01-10 19:48 ` Andi Kleen
2001-01-11 9:51 ` Trond Myklebust
1 sibling, 0 replies; 128+ messages in thread
From: Andi Kleen @ 2001-01-10 19:48 UTC (permalink / raw)
To: Alan Cox
Cc: Andi Kleen, Trond Myklebust, Daniel Phillips, Linus Torvalds,
linux-kernel
On Wed, Jan 10, 2001 at 07:48:04PM +0000, Alan Cox wrote:
> > As the thread started it's not only only needed for pthreads, but also for NFS
> > and setuid (actually NFS already implements it privately), and probably other network
> > file systems too. So it's far from being only a "bad standard corner case".
>
> I wonder how Linux 2.2 worked, that doesnt have them. Now if its a clean way
> of sorting out a pile of other things and it does pthreads as a side effect
Linux 2.2 setuid in nfs never worked quite like traditional Unix, and there
were lots of reports because users were regularly rediscovering it.
I think the nfs patches merged in 2.2.18 fixed it (?)
> I've no problem, but arguing for it because of a tiny pthreads corner case
> is coming from the wrong end
I'm not so sure the thread corner case is that tiny.
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
@ 2001-01-10 19:57 Chris Wing
0 siblings, 0 replies; 128+ messages in thread
From: Chris Wing @ 2001-01-10 19:57 UTC (permalink / raw)
To: Alan Cox; +Cc: linux-kernel
Alan:
> I've seen exactly nil cases where there are any security holes in apps caused
> by that pthreads api non adherance.
I don't know of any exploitable bugs that were found in it, but the identd
server included in Red Hat 6.1 (pidentd 3.0.10) unintentionally ran as
root instead of nobody because its programmer used pthreads and assumed
that setuid() would affect all threads.
I pointed this out to the author and Red Hat, and it was fixed in
pidentd 3.0.11 and Red Hat 6.2.
-Chris Wing
wingc@engin.umich.edu
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 19:31 ` Alan Cox
2001-01-10 19:33 ` Andi Kleen
@ 2001-01-10 20:11 ` Linus Torvalds
2001-01-11 12:56 ` Stephen C. Tweedie
1 sibling, 1 reply; 128+ messages in thread
From: Linus Torvalds @ 2001-01-10 20:11 UTC (permalink / raw)
To: Alan Cox; +Cc: Andi Kleen, Trond Myklebust, Daniel Phillips, linux-kernel
On Wed, 10 Jan 2001, Alan Cox wrote:
>
> That is extremely undesirable behaviour. setuid() changes for pthreads crud
> should be done by the library emulation layer. Many people have very real
> and very good reasons for running multiple parallel ids. Just try writing
> a threaded ftp daemon (non anonymous) without that, or an nfs server
I absolutely think that "one thread, one ID" is the way to go.
That said, we can easily support the notion of CLONE_CRED if we absolutely
have to (and sane people just shouldn't use it), so if somebody wants to
work on this for 2.5.x...
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Yet another bogus piece of do_try_to_free_pages()
2001-01-10 6:39 ` Marcelo Tosatti
@ 2001-01-10 22:19 ` Roger Larsson
2001-01-11 0:11 ` Zlatko Calusic
1 sibling, 0 replies; 128+ messages in thread
From: Roger Larsson @ 2001-01-10 22:19 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: linux-mm
On Wednesday 10 January 2001 07:39, Marcelo Tosatti wrote:
> On Tue, 9 Jan 2001, Linus Torvalds wrote:
> > I suspect that the proper fix is something more along the lines of what
> > we did to bdflush: get rid of the notion of waiting synchronously from
> > bdflush, and instead do the work yourself.
>
> Agreed.
>
> Without blocking on sync IO, kswapd can keep aging pages and moving
> them to the inactive lists.
>
> The following patch changes some stuff we've discussed before (the
> kmem_cache_reap and maxtry thingies) and it also removes the kswapd
> sleeping scheme.
>
> I haven't tested it yet, though I'll do it tomorrow.
>
I have have it running...
It gave me the highest dbench 16 result I have seen [recently begun to
run against a faster disk...]
On my PPro 180 with 96 M RAM [best of 3]
write, copy, read, diff uses plain bash commands with data of 150 or 300 MB.
[streaming]
only one run of dbench (takes tooo... much time)
[the CLIENTS goes via a symbolic link to the other disk - not perfect but...]
kernel write copy read diff dbench
2.4.0 10.6 10.9 14.1 8.3 10.2
2.4.1-pre1+neg 10.1 10.9 14.0 8.2 10.0
2.4.1-pre1+this 11.5 10.6 14.4 8.2 10.8
as a comparisation
2.2.18 10.6 9.7 12.8 7.2 7.7
The only really strange thing that is common for all the 2.4 kernels is
konquerors brk usage resulting in SIGSEGV. Reported earlier to linux-kernel.
select(16, [3 4 6 7 9 10 12 13 14 15], NULL, NULL, {0, 0}) = 2 (in [7 13],
left {0, 0})
read(13, " 4_ a_", 10) = 10
read(13, "\0\0\0\0", 4) = 4
read(7, "\2\1\0\2.\1\0\0", 8) = 8
read(7, "\1\0\0\0", 4) = 4
read(7, "\0\0\0\17konqueror-3415\0\0\0\0\vkonqueror"..., 302) = 302
brk(0x84f8000) = 0x84f8000
brk(0x84fd000) = 0x84fd000
brk(0x8502000) = 0x8502000
brk(0x8507000) = 0x8507000
brk(0x850c000) = 0x850c000
brk(0x8511000) = 0x8511000
brk(0x8516000) = 0x8516000
brk(0x851b000) = 0x851b000
brk(0x8520000) = 0x8520000
[...]
brk(0xd02d000) = 0xd02d000
brk(0xd02f000) = 0xd02f000
brk(0xd031000) = 0xd02f000
brk(0xd031000) = 0xd02f000
brk(0xd031000) = 0xd02f000
brk(0xd031000) = 0xd02f000
brk(0xd031000) = 0xd02f000
brk(0xd031000) = 0xd02f000
--- SIGSEGV (Segmentation fault) ---
--- SIGSEGV (Segmentation fault) ---
--- SIGSEGV (Segmentation fault) ---
+++ killed by SIGSEGV +++
--
Home page:
http://www.norran.net/nra02596/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 19:36 ` Alan Cox
@ 2001-01-10 23:56 ` David Weinehall
2001-01-11 0:24 ` Alan Cox
2001-01-12 5:56 ` Ralf Baechle
0 siblings, 2 replies; 128+ messages in thread
From: David Weinehall @ 2001-01-10 23:56 UTC (permalink / raw)
To: Alan Cox
Cc: Linus Torvalds, Eric W. Biederman, Andrea Arcangeli,
David Woodhouse, Zlatko Calusic, Rik van Riel, linux-kernel
On Wed, Jan 10, 2001 at 07:36:43PM +0000, Alan Cox wrote:
> > I looked at it a year or two ago myself, and came to the conclusion that I
> > don't want to blow up our page table size by a factor of three or more, so
> > I'm not personally interested any more. Maybe somebody else comes up with
> > a better way to do it, or with a really compelling reason to.
>
> There is only one reason I know for reverse page tables. That is ARM2/ARM3
> support - which is still not fully merged because of this issue
>
> The MMU on these systems is a CAM, and the mmu table is thus backwards to
> convention. (It also means you can notionally map two physical addresses to
> one virtual but thats undefined in the implementation ;))
Are there any other (not yet supported) platforms with similar (or other
unrelated, but hard to support because of the current architecture of
the kernel) problems?
(No, I have no secret trumps up my sleeve, I'm just curious.)
/David
_ _
// David Weinehall <tao@acc.umu.se> /> Northern lights wander \\
// Project MCA Linux hacker // Dance across the winter sky //
\> http://www.acc.umu.se/~tao/ </ Full colour fire </
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Yet another bogus piece of do_try_to_free_pages()
2001-01-10 6:39 ` Marcelo Tosatti
2001-01-10 22:19 ` Roger Larsson
@ 2001-01-11 0:11 ` Zlatko Calusic
2001-01-17 6:58 ` Rik van Riel
1 sibling, 1 reply; 128+ messages in thread
From: Zlatko Calusic @ 2001-01-11 0:11 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Linus Torvalds, linux-mm
Marcelo Tosatti <marcelo@conectiva.com.br> writes:
> On Tue, 9 Jan 2001, Linus Torvalds wrote:
>
> > I suspect that the proper fix is something more along the lines of what we
> > did to bdflush: get rid of the notion of waiting synchronously from
> > bdflush, and instead do the work yourself.
>
> Agreed.
>
> Without blocking on sync IO, kswapd can keep aging pages and moving
> them to the inactive lists.
>
> The following patch changes some stuff we've discussed before (the
> kmem_cache_reap and maxtry thingies) and it also removes the kswapd
> sleeping scheme.
>
> I haven't tested it yet, though I'll do it tomorrow.
>
I have tested it for you and results are great. On some tests I got
20% to 30% better results which is amazing. I'll do some more tests
but I would vote for this to get in immediately. Yes, it's *so* good.
Great work Marcelo!
--
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 23:56 ` David Weinehall
@ 2001-01-11 0:24 ` Alan Cox
2001-01-12 5:56 ` Ralf Baechle
1 sibling, 0 replies; 128+ messages in thread
From: Alan Cox @ 2001-01-11 0:24 UTC (permalink / raw)
To: David Weinehall
Cc: Alan Cox, Linus Torvalds, Eric W. Biederman, Andrea Arcangeli,
David Woodhouse, Zlatko Calusic, Rik van Riel, linux-kernel
> > The MMU on these systems is a CAM, and the mmu table is thus backwards to
> > convention. (It also means you can notionally map two physical addresses to
> > one virtual but thats undefined in the implementation ;))
>
> Are there any other (not yet supported) platforms with similar (or other
> unrelated, but hard to support because of the current architecture of
> the kernel) problems?
I believe its uniquely deranged. There are people who have asked for reverse
tables for other purposes (eg cache flush handling) but their mmu is the normal
way around.
Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 0:23 ` Linus Torvalds
2001-01-10 0:12 ` Marcelo Tosatti
@ 2001-01-11 3:30 ` Marcelo Tosatti
2001-01-11 9:42 ` Stephen C. Tweedie
1 sibling, 1 reply; 128+ messages in thread
From: Marcelo Tosatti @ 2001-01-11 3:30 UTC (permalink / raw)
To: Linus Torvalds
Cc: Stephen C. Tweedie, David S. Miller, Rik van Riel, linux-mm
On Tue, 9 Jan 2001, Linus Torvalds wrote:
> So one "conditional aging" algorithm might just be something as simple as
I've done a very easy conditional aging patch (I dont think doing new
functions to scan the active list and the pte's is necessary)
kswapd is not perfectly obeing the counter: if the counter reaches 0, we
keep doing a previously (when counter > 0) called swap_out().
But since swap_out() is only scanning a small part of a mm I dont think
the "non perfect" scanning is a big issue.
Comments?
diff --exclude-from=/home/marcelo/exclude -Nur linux.orig/include/linux/swap.h linux/include/linux/swap.h
--- linux.orig/include/linux/swap.h Thu Jan 11 00:27:46 2001
+++ linux/include/linux/swap.h Thu Jan 11 02:45:04 2001
@@ -101,6 +101,8 @@
extern void swap_setup(void);
/* linux/mm/vmscan.c */
+extern int bg_page_aging;
+
extern struct page * reclaim_page(zone_t *);
extern wait_queue_head_t kswapd_wait;
extern wait_queue_head_t kreclaimd_wait;
diff --exclude-from=/home/marcelo/exclude -Nur linux.orig/mm/swap.c linux/mm/swap.c
--- linux.orig/mm/swap.c Thu Jan 11 00:27:45 2001
+++ linux/mm/swap.c Thu Jan 11 02:12:01 2001
@@ -214,6 +214,8 @@
/* Make sure the page gets a fair chance at staying active. */
if (page->age < PAGE_AGE_START)
page->age = PAGE_AGE_START;
+
+ bg_page_aging++;
}
void activate_page(struct page * page)
diff --exclude-from=/home/marcelo/exclude -Nur linux.orig/mm/vmscan.c linux/mm/vmscan.c
--- linux.orig/mm/vmscan.c Thu Jan 11 00:27:45 2001
+++ linux/mm/vmscan.c Thu Jan 11 02:53:40 2001
@@ -24,6 +24,8 @@
#include <asm/pgalloc.h>
+int bg_page_aging = 0;
+
/*
* The swap-out functions return 1 if they successfully
* threw something out, and we got a free page. It returns
@@ -60,9 +62,12 @@
age_page_up(page);
goto out_failed;
}
- if (!onlist)
+ if (!onlist) {
/* The page is still mapped, so it can't be freeable... */
+ if(bg_page_aging)
+ bg_page_aging--;
age_page_down_ageonly(page);
+ }
/*
* If the page is in active use by us, or if the page
@@ -650,11 +655,12 @@
* This function will scan a portion of the active list to find
* unused pages, those pages will then be moved to the inactive list.
*/
-int refill_inactive_scan(unsigned int priority, int oneshot)
+int refill_inactive_scan(unsigned int priority, int background)
{
struct list_head * page_lru;
struct page * page;
- int maxscan, page_active = 0;
+ int maxscan, page_active;
+ int deactivate = 1;
int ret = 0;
/* Take the lock while messing with the list... */
@@ -674,8 +680,21 @@
/* Do aging on the pages. */
if (PageTestandClearReferenced(page)) {
age_page_up_nolock(page);
- page_active = 1;
- } else {
+ } else if (deactivate) {
+
+ /*
+ * We're aging down a page.
+ * Decrement the counter if it has not reached zero
+ * yet. If it reached zero, and we are doing background
+ * scan and the counter reached 0, stop deactivating pages.
+ */
+ if (bg_page_aging)
+ bg_page_aging--;
+ else if (background) {
+ deactivate = 0;
+ continue;
+ }
+
age_page_down_ageonly(page);
/*
* Since we don't hold a reference on the page
@@ -691,8 +710,6 @@
(page->buffers ? 2 : 1)) {
deactivate_page_nolock(page);
page_active = 0;
- } else {
- page_active = 1;
}
}
/*
@@ -705,7 +722,8 @@
list_add(page_lru, &active_list);
} else {
ret = 1;
- if (oneshot)
+ /* Stop scanning if we're not doing background scan */
+ if (!background)
break;
}
}
@@ -818,7 +836,7 @@
schedule();
}
- while (refill_inactive_scan(priority, 1)) {
+ while (refill_inactive_scan(priority, 0)) {
if (--count <= 0)
goto done;
}
@@ -921,13 +939,19 @@
if (inactive_shortage() || free_shortage())
do_try_to_free_pages(GFP_KSWAPD, 0);
+
+ /* Do some (very minimal) background scanning. */
+
/*
- * Do some (very minimal) background scanning. This
- * will scan all pages on the active list once
+ * This will scan all pages on the active list once
* every minute. This clears old referenced bits
* and moves unused pages to the inactive list.
*/
- refill_inactive_scan(6, 0);
+ refill_inactive_scan(6, 1);
+
+ /* This will scan the pte's. */
+ if(bg_page_aging)
+ swap_out(6, 0);
/* Once a second, recalculate some VM stats. */
if (time_after(jiffies, recalc + HZ)) {
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-11 3:30 ` Marcelo Tosatti
@ 2001-01-11 9:42 ` Stephen C. Tweedie
2001-01-11 15:24 ` Marcelo Tosatti
0 siblings, 1 reply; 128+ messages in thread
From: Stephen C. Tweedie @ 2001-01-11 9:42 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: Linus Torvalds, Stephen C. Tweedie, David S. Miller, Rik van Riel,
linux-mm
Hi,
On Thu, Jan 11, 2001 at 01:30:18AM -0200, Marcelo Tosatti wrote:
>
> On Tue, 9 Jan 2001, Linus Torvalds wrote:
>
> > So one "conditional aging" algorithm might just be something as simple as
>
> I've done a very easy conditional aging patch (I dont think doing new
> functions to scan the active list and the pte's is necessary)
You still need to decay the bg_page_aging counter a little somewhere,
otherwise if you've been running a long-lived workload which keeps
most of memory recently activated, you'll build up such a large
counter that going idle will still age everything to zero.
This might be as simple as clamping the value of the counter to some
arbitrary maximum value such as num_physpages.
Cheers,
Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 19:48 ` Alan Cox
2001-01-10 19:48 ` Andi Kleen
@ 2001-01-11 9:51 ` Trond Myklebust
1 sibling, 0 replies; 128+ messages in thread
From: Trond Myklebust @ 2001-01-11 9:51 UTC (permalink / raw)
To: Alan Cox; +Cc: Andi Kleen, Daniel Phillips, Linus Torvalds, linux-kernel
>>>>> " " == Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
>> As the thread started it's not only only needed for pthreads,
>> but also for NFS and setuid (actually NFS already implements it
>> privately), and probably other network file systems too. So
>> it's far from being only a "bad standard corner case".
> I wonder how Linux 2.2 worked, that doesnt have them. Now if
> its a clean way of sorting out a pile of other things and it
> does pthreads as a side effect I've no problem, but arguing for
> it because of a tiny pthreads corner case is coming from the
> wrong end
How about this then:
Sure NFS can work without ucreds, but there are limitations. For
instance the MVFS folks recently complained. They're trying to keep
mmap consistency between their own filesystem layer and the underlying
storage filesystem using i_mapping (a la CODAfs). The problem then is
that the vma will be using the wrong 'struct file' to call the
underlying storage.
This sort of problem would indeed disappear if we have a generic
credential stored in the struct file as we could make the VFS pass the
credential directly to readpage (and writepage?) rather than passing
the whole struct file.
If you use the same credentials in the task structure, then there are
other advantages even to NFS itself.
You may for example want to attach an ACL cache at some point in time
(to avoid the messiness of calling NFSv3/v4 permissions routines at
each and every file lookup). Ditto for strong RPC authentication
schemes that require an upcall to some userspace daemon.
That said, we'd first have to find a way to reconcile fsuid/fsgid with
the BSD model in some way: I'd rather not have 2 'ucred's per task (1
for threads + 1 for filesystems).
Cheers,
Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 20:11 ` Linus Torvalds
@ 2001-01-11 12:56 ` Stephen C. Tweedie
2001-01-11 13:10 ` Andi Kleen
` (3 more replies)
0 siblings, 4 replies; 128+ messages in thread
From: Stephen C. Tweedie @ 2001-01-11 12:56 UTC (permalink / raw)
To: Linus Torvalds
Cc: Alan Cox, Andi Kleen, Trond Myklebust, Daniel Phillips,
linux-kernel, Stephen Tweedie
Hi,
On Wed, Jan 10, 2001 at 12:11:16PM -0800, Linus Torvalds wrote:
>
> That said, we can easily support the notion of CLONE_CRED if we absolutely
> have to (and sane people just shouldn't use it), so if somebody wants to
> work on this for 2.5.x...
But is it really worth the pain? I'd hate to have to audit the entire
VFS to make sure that it works if another thread changes our
credentials in the middle of a syscall, so we either end up having to
lock the credentials over every VFS syscall, or take a copy of the
credentials and pass it through every VFS internal call that we make.
--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-11 12:56 ` Stephen C. Tweedie
@ 2001-01-11 13:10 ` Andi Kleen
2001-01-11 13:12 ` Trond Myklebust
` (2 subsequent siblings)
3 siblings, 0 replies; 128+ messages in thread
From: Andi Kleen @ 2001-01-11 13:10 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Linus Torvalds, Alan Cox, Andi Kleen, Trond Myklebust,
Daniel Phillips, linux-kernel
On Thu, Jan 11, 2001 at 12:56:04PM +0000, Stephen C. Tweedie wrote:
> Hi,
>
> On Wed, Jan 10, 2001 at 12:11:16PM -0800, Linus Torvalds wrote:
> >
> > That said, we can easily support the notion of CLONE_CRED if we absolutely
> > have to (and sane people just shouldn't use it), so if somebody wants to
> > work on this for 2.5.x...
>
> But is it really worth the pain? I'd hate to have to audit the entire
> VFS to make sure that it works if another thread changes our
> credentials in the middle of a syscall, so we either end up having to
> lock the credentials over every VFS syscall, or take a copy of the
> credentials and pass it through every VFS internal call that we make.
That is what NFS does already, it would just move into generic VFS then.
(NFS copies)
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-11 12:56 ` Stephen C. Tweedie
2001-01-11 13:10 ` Andi Kleen
@ 2001-01-11 13:12 ` Trond Myklebust
2001-01-11 14:13 ` Stephen C. Tweedie
2001-01-11 16:50 ` Albert D. Cahalan
2001-01-11 19:01 ` Alexander Viro
3 siblings, 1 reply; 128+ messages in thread
From: Trond Myklebust @ 2001-01-11 13:12 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Linus Torvalds, Alan Cox, Andi Kleen, Daniel Phillips,
linux-kernel
>>>>> " " == Stephen C Tweedie <sct@redhat.com> writes:
> Hi, On Wed, Jan 10, 2001 at 12:11:16PM -0800, Linus Torvalds
> wrote:
>>
>> That said, we can easily support the notion of CLONE_CRED if we
>> absolutely have to (and sane people just shouldn't use it), so
>> if somebody wants to work on this for 2.5.x...
> But is it really worth the pain? I'd hate to have to audit the
> entire VFS to make sure that it works if another thread changes
> our credentials in the middle of a syscall, so we either end up
> having to lock the credentials over every VFS syscall, or take
> a copy of the credentials and pass it through every VFS
> internal call that we make.
What's wrong with copy-on-write style semantics? IOW, anyone who
wants to change the credentials needs to make a private copy of the
existing structure first.
Cheers,
Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-11 13:12 ` Trond Myklebust
@ 2001-01-11 14:13 ` Stephen C. Tweedie
2001-01-11 19:03 ` Alexander Viro
0 siblings, 1 reply; 128+ messages in thread
From: Stephen C. Tweedie @ 2001-01-11 14:13 UTC (permalink / raw)
To: Trond Myklebust
Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Andi Kleen,
Daniel Phillips, linux-kernel
Hi,
On Thu, Jan 11, 2001 at 02:12:05PM +0100, Trond Myklebust wrote:
>
> What's wrong with copy-on-write style semantics? IOW, anyone who
> wants to change the credentials needs to make a private copy of the
> existing structure first.
Because COW only solves the problem if each task is only changing its
own, local, private copy of the credentials. Posix threads demand
that one thread changing credentials also affects all the other
threads immediately, and making your own local private copy won't help
you to change the other tasks' credentials safely.
--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 17:03 ` Linus Torvalds
@ 2001-01-11 14:36 ` Jim Gettys
0 siblings, 0 replies; 128+ messages in thread
From: Jim Gettys @ 2001-01-11 14:36 UTC (permalink / raw)
To: Linus Torvalds
Cc: David Woodhouse, Zlatko Calusic, Eric W. Biederman, Rik van Riel,
linux-kernel
> Sender: linux-kernel-owner@vger.kernel.org
> From: Linus Torvalds <torvalds@transmeta.com>
> Date: Wed, 10 Jan 2001 09:03:03 -0800 (PST)
> To: David Woodhouse <dwmw2@infradead.org>
> Cc: Zlatko Calusic <zlatko@iskon.hr>,
> "Eric W. Biederman" <ebiederm@xmission.com>,
> Rik van Riel <riel@conectiva.com.br>, linux-kernel@vger.kernel.org
> Subject: Re: Subtle MM bug
> -----
> On Wed, 10 Jan 2001, David Woodhouse wrote:
>
> >
> > torvalds@transmeta.com said:
> > > The no-swap behaviour shoul dactually be pretty much identical,
> > > simply because both 2.2 and 2.4 will do the same thing: just skip
> > > dirty pages in the page tables because they cannot do anything about
> > > them.
> >
> > So the VM code spends a fair amount of time scanning lists of pages which
> > it really can't do anything about?
>
> It can do _tons_ of stuff.
>
> Remember, on platforms like this, one of the reasons for being low on
> memory is things like running X and netscape: maybe you have 64MB of RAM
> and you don't think you need a swap device, and you want to have a web
> browser.
>
> The fact that we cannot touch _dirty_ pages doesn't mean that there's
> nothing to do: instead of running out of memory we can at least make the
> machine usable by dropping the text pages and the page cache..
>
And pushing out old text pages is a very good idea on most embedded systems.
Getting the pages back is a (relatively) cheap operation: no disk seeks,
some joules spent on decompression (if on CRAMFS or other compressed file
system).
There is an interesting question on such devices as to whether you are
better off dropping text pages or pages out of the page cache first,
or to what degree...
- Jim
--
Jim Gettys
Technology and Corporate Development
Compaq Computer Corporation
jg@pa.dec.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-11 9:42 ` Stephen C. Tweedie
@ 2001-01-11 15:24 ` Marcelo Tosatti
0 siblings, 0 replies; 128+ messages in thread
From: Marcelo Tosatti @ 2001-01-11 15:24 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Linus Torvalds, David S. Miller, Rik van Riel, linux-mm
On Thu, 11 Jan 2001, Stephen C. Tweedie wrote:
> This might be as simple as clamping the value of the counter to some
> arbitrary maximum value such as num_physpages.
Ok, I've taken this suggestion and used to limit the counter.
I've also changed some Linus changes to swap_out() in pre2 (related to
page aging).
I've noted quite nice performance improvements with the pte scanning
(which moves the dirty pte bits to the pages) on dbench: 7Mb/sec to
9.5Mb/sec. (128MB, 48 threads)
The pte scanning will be a big win for databases with heavy IO, I suppose.
The following patch is against 2.4.1pre2.
Comments?
diff -Nur --exclude-from=exclude linux.orig/mm/swap.c linux/mm/swap.c
--- linux.orig/mm/swap.c Thu Jan 11 11:13:37 2001
+++ linux/mm/swap.c Thu Jan 11 14:38:09 2001
@@ -200,17 +200,22 @@
{
if (PageInactiveDirty(page)) {
del_page_from_inactive_dirty_list(page);
- add_page_to_active_list(page);
} else if (PageInactiveClean(page)) {
del_page_from_inactive_clean_list(page);
- add_page_to_active_list(page);
} else {
/*
* The page was not on any list, so we take care
* not to do anything.
*/
+ goto inc_age;
}
+ add_page_to_active_list(page);
+
+ if(bg_page_aging < num_physpages)
+ bg_page_aging++;
+
+inc_age:
/* Make sure the page gets a fair chance at staying active. */
if (page->age < PAGE_AGE_START)
page->age = PAGE_AGE_START;
diff -Nur --exclude-from=exclude linux.orig/mm/vmscan.c linux/mm/vmscan.c
--- linux.orig/mm/vmscan.c Thu Jan 11 11:13:37 2001
+++ linux/mm/vmscan.c Thu Jan 11 14:52:04 2001
@@ -24,17 +24,8 @@
#include <asm/pgalloc.h>
-/*
- * The swap-out functions return 1 if they successfully
- * threw something out, and we got a free page. It returns
- * zero if it couldn't do anything, and any other value
- * indicates it decreased rss, but the page was shared.
- *
- * NOTE! If it sleeps, it *must* return 1 to make sure we
- * don't continue with the swap-out. Otherwise we may be
- * using a process that no longer actually exists (it might
- * have died while we slept).
- */
+int bg_page_aging = 0;
+
static void try_to_swap_out(struct mm_struct * mm, struct vm_area_struct* vma, unsigned long address, pte_t * page_table, struct page *page)
{
pte_t pte;
@@ -42,12 +33,18 @@
/* Don't look at this pte if it's been accessed recently. */
if (ptep_test_and_clear_young(page_table)) {
- page->age += PAGE_AGE_ADV;
- if (page->age > PAGE_AGE_MAX)
- page->age = PAGE_AGE_MAX;
+ age_page_up(page);
return;
+ } else {
+ age_page_down_ageonly(page);
+ if (bg_page_aging)
+ bg_page_aging--;
}
+ /* Unmap only old pages */
+ if (page->age > 0)
+ return;
+
if (TryLockPage(page))
return;
@@ -268,7 +265,7 @@
return nr < SWAP_MIN ? SWAP_MIN : nr;
}
-static int swap_out(unsigned int priority, int gfp_mask)
+static int swap_out(unsigned int priority, int background)
{
int counter;
int retval = 0;
@@ -300,6 +297,13 @@
/* Walk about 6% of the address space each time */
retval |= swap_out_mm(mm, swap_amount(mm));
mmput(mm);
+ /*
+ * In the case of background aging, stop
+ * the scan when we aged the necessary amount
+ * of pages.
+ */
+ if (background && !bg_page_aging)
+ break;
} while (--counter >= 0);
return retval;
@@ -630,22 +634,24 @@
/**
* refill_inactive_scan - scan the active list and find pages to deactivate
* @priority: the priority at which to scan
- * @oneshot: exit after deactivating one page
+ * @background: slightly different behaviour for background scanning
*
* This function will scan a portion of the active list to find
* unused pages, those pages will then be moved to the inactive list.
*/
-int refill_inactive_scan(unsigned int priority, int oneshot)
+int refill_inactive_scan(unsigned int priority, int background)
{
struct list_head * page_lru;
struct page * page;
- int maxscan, page_active = 0;
+ int maxscan;
int ret = 0;
+ int deactivate = 1;
/* Take the lock while messing with the list... */
spin_lock(&pagemap_lru_lock);
maxscan = nr_active_pages >> priority;
while (maxscan-- > 0 && (page_lru = active_list.prev) != &active_list) {
+ int page_active = 0;
page = list_entry(page_lru, struct page, lru);
/* Wrong page on list?! (list corruption, should not happen) */
@@ -660,9 +666,19 @@
if (PageTestandClearReferenced(page)) {
age_page_up_nolock(page);
page_active = 1;
- } else {
+ } else if (deactivate) {
age_page_down_ageonly(page);
/*
+ * We're aging down a page. Decrement the counter if it
+ * has not reached zero yet. If it reached zero, and we * are doing background scan, stop deactivating pages.
+ */
+ if (bg_page_aging)
+ bg_page_aging--;
+ else if (background) {
+ deactivate = 0;
+ continue;
+ }
+ /*
* Since we don't hold a reference on the page
* ourselves, we have to do our test a bit more
* strict then deactivate_page(). This is needed
@@ -676,21 +692,20 @@
(page->buffers ? 2 : 1)) {
deactivate_page_nolock(page);
page_active = 0;
- } else {
- page_active = 1;
}
}
/*
* If the page is still on the active list, move it
* to the other end of the list. Otherwise it was
- * deactivated by age_page_down and we exit successfully.
+ * deactivated by deactivate_page_nolock and we exit
+ * successfully.
*/
if (page_active || PageActive(page)) {
list_del(page_lru);
list_add(page_lru, &active_list);
} else {
ret = 1;
- if (oneshot)
+ if (!background)
break;
}
}
@@ -804,13 +819,13 @@
schedule();
}
- while (refill_inactive_scan(DEF_PRIORITY, 1)) {
+ while (refill_inactive_scan(DEF_PRIORITY, 0)) {
if (--count <= 0)
goto done;
}
/* If refill_inactive_scan failed, try to page stuff out.. */
- swap_out(DEF_PRIORITY, gfp_mask);
+ swap_out(DEF_PRIORITY, 0);
if (--maxtry <= 0)
return 0;
@@ -914,7 +929,11 @@
* every minute. This clears old referenced bits
* and moves unused pages to the inactive list.
*/
- refill_inactive_scan(DEF_PRIORITY, 0);
+ refill_inactive_scan(DEF_PRIORITY, 1);
+
+ /* Walk the pte's and age them. */
+ if (bg_page_aging)
+ swap_out(DEF_PRIORITY, 1);
/* Once a second, recalculate some VM stats. */
if (time_after(jiffies, recalc + HZ)) {
diff -Nur --exclude-from=exclude linux.orig/include/linux/swap.h linux/include/linux/swap.h
--- linux.orig/include/linux/swap.h Thu Jan 11 11:13:38 2001
+++ linux/include/linux/swap.h Thu Jan 11 14:54:57 2001
@@ -101,6 +101,7 @@
extern void swap_setup(void);
/* linux/mm/vmscan.c */
+extern int bg_page_aging;
extern struct page * reclaim_page(zone_t *);
extern wait_queue_head_t kswapd_wait;
extern wait_queue_head_t kreclaimd_wait;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-11 12:56 ` Stephen C. Tweedie
2001-01-11 13:10 ` Andi Kleen
2001-01-11 13:12 ` Trond Myklebust
@ 2001-01-11 16:50 ` Albert D. Cahalan
2001-01-11 17:35 ` Stephen C. Tweedie
2001-01-11 19:01 ` Alexander Viro
3 siblings, 1 reply; 128+ messages in thread
From: Albert D. Cahalan @ 2001-01-11 16:50 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Linus Torvalds, Alan Cox, Andi Kleen, Trond Myklebust,
Daniel Phillips, linux-kernel, Stephen Tweedie
Stephen C. Tweedie writes:
> On Wed, Jan 10, 2001 at 12:11:16PM -0800, Linus Torvalds wrote:
>> That said, we can easily support the notion of CLONE_CRED if
>> we absolutely have to (and sane people just shouldn't use it),
>> so if somebody wants to work on this for 2.5.x...
>
> But is it really worth the pain? I'd hate to have to audit the
> entire VFS to make sure that it works if another thread changes our
> credentials in the middle of a syscall, so we either end up having to
> lock the credentials over every VFS syscall, or take a copy of the
> credentials and pass it through every VFS internal call that we make.
1. each thread has a copy, and doesn't need to lock it
2. threads are commanded to change their own copy
Credentials could be changed on syscall exit. It is a bit like
doing signals I think, with less overhead than making userspace
muck around with signal handlers and synchronization crud.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-11 16:50 ` Albert D. Cahalan
@ 2001-01-11 17:35 ` Stephen C. Tweedie
2001-01-11 19:38 ` Albert D. Cahalan
0 siblings, 1 reply; 128+ messages in thread
From: Stephen C. Tweedie @ 2001-01-11 17:35 UTC (permalink / raw)
To: Albert D. Cahalan
Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Andi Kleen,
Trond Myklebust, Daniel Phillips, linux-kernel
Hi,
On Thu, Jan 11, 2001 at 11:50:21AM -0500, Albert D. Cahalan wrote:
> Stephen C. Tweedie writes:
> >
> > But is it really worth the pain? I'd hate to have to audit the
> > entire VFS to make sure that it works if another thread changes our
> > credentials in the middle of a syscall, so we either end up having to
> > lock the credentials over every VFS syscall, or take a copy of the
> > credentials and pass it through every VFS internal call that we make.
>
> 1. each thread has a copy, and doesn't need to lock it
We already have that...
> 2. threads are commanded to change their own copy
We already do that: that's how the current pthreads works.
> Credentials could be changed on syscall exit. It is a bit like
> doing signals I think, with less overhead than making userspace
> muck around with signal handlers and synchronization crud.
Yuck. Far better to send a signal than to pollute the syscall exit
path. And what about syscalls which block indefinitely? We _want_
the signal so that they get woken up to do the credentials change.
--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-11 12:56 ` Stephen C. Tweedie
` (2 preceding siblings ...)
2001-01-11 16:50 ` Albert D. Cahalan
@ 2001-01-11 19:01 ` Alexander Viro
3 siblings, 0 replies; 128+ messages in thread
From: Alexander Viro @ 2001-01-11 19:01 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Linus Torvalds, Alan Cox, Andi Kleen, Trond Myklebust,
Daniel Phillips, linux-kernel
On Thu, 11 Jan 2001, Stephen C. Tweedie wrote:
> Hi,
>
> On Wed, Jan 10, 2001 at 12:11:16PM -0800, Linus Torvalds wrote:
> >
> > That said, we can easily support the notion of CLONE_CRED if we absolutely
> > have to (and sane people just shouldn't use it), so if somebody wants to
> > work on this for 2.5.x...
>
> But is it really worth the pain? I'd hate to have to audit the entire
> VFS to make sure that it works if another thread changes our
> credentials in the middle of a syscall, so we either end up having to
> lock the credentials over every VFS syscall, or take a copy of the
> credentials and pass it through every VFS internal call that we make.
COW. Pthreads are simply irrelevant here - if you want set*id in one
thread to change the credentials of the rest you can do it in libpthreads.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-11 14:13 ` Stephen C. Tweedie
@ 2001-01-11 19:03 ` Alexander Viro
2001-01-11 19:47 ` Stephen C. Tweedie
0 siblings, 1 reply; 128+ messages in thread
From: Alexander Viro @ 2001-01-11 19:03 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Trond Myklebust, Linus Torvalds, Alan Cox, Andi Kleen,
Daniel Phillips, linux-kernel
On Thu, 11 Jan 2001, Stephen C. Tweedie wrote:
> Hi,
>
> On Thu, Jan 11, 2001 at 02:12:05PM +0100, Trond Myklebust wrote:
> >
> > What's wrong with copy-on-write style semantics? IOW, anyone who
> > wants to change the credentials needs to make a private copy of the
> > existing structure first.
>
> Because COW only solves the problem if each task is only changing its
> own, local, private copy of the credentials. Posix threads demand
> that one thread changing credentials also affects all the other
> threads immediately, and making your own local private copy won't help
> you to change the other tasks' credentials safely.
And how is that different from the current situation?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-11 17:35 ` Stephen C. Tweedie
@ 2001-01-11 19:38 ` Albert D. Cahalan
0 siblings, 0 replies; 128+ messages in thread
From: Albert D. Cahalan @ 2001-01-11 19:38 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Albert D. Cahalan, Stephen C. Tweedie, Linus Torvalds, Alan Cox,
Andi Kleen, Trond Myklebust, Daniel Phillips, linux-kernel
Stephen C. Tweedie writes:
> On Thu, Jan 11, 2001 at 11:50:21AM -0500, Albert D. Cahalan wrote:
>> Stephen C. Tweedie writes:
>>> But is it really worth the pain? I'd hate to have to audit the
>>> entire VFS to make sure that it works if another thread changes our
>>> credentials in the middle of a syscall, so we either end up having to
>>> lock the credentials over every VFS syscall, or take a copy of the
>>> credentials and pass it through every VFS internal call that we make.
>>
>> 1. each thread has a copy, and doesn't need to lock it
>
> We already have that...
>
>> 2. threads are commanded to change their own copy
>
> We already do that: that's how the current pthreads works.
I thought it was unimplemented. Even so, it is at least one
extra round trip to/from the kernel. (I'd guess trips>1)
>> Credentials could be changed on syscall exit. It is a bit like
>> doing signals I think, with less overhead than making userspace
>> muck around with signal handlers and synchronization crud.
>
> Yuck. Far better to send a signal than to pollute the syscall exit
> path. And what about syscalls which block indefinitely? We _want_
> the signal so that they get woken up to do the credentials change.
The syscall exit path itself need not be polluted. Changes to
recalc_sigpending and do_signal would get the job done.
For the former, either add an extra word of kernel-internal
signal data or just check a simple flag. For do_signal, maybe
add an extra "if(foo)" at the top of the main loop. (that would
depend on what was done to recalc_sigpending)
I suppose the goodness or badness of this depends partly on how
much you are willing to pay for pthreads that are fast and correct.
People around here seem to like burying their heads in hope that
pthreads will just go away, while app developers stubbornly try to
use the API.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-11 19:03 ` Alexander Viro
@ 2001-01-11 19:47 ` Stephen C. Tweedie
2001-01-11 19:57 ` Alexander Viro
0 siblings, 1 reply; 128+ messages in thread
From: Stephen C. Tweedie @ 2001-01-11 19:47 UTC (permalink / raw)
To: Alexander Viro
Cc: Stephen C. Tweedie, Trond Myklebust, Linus Torvalds, Alan Cox,
Andi Kleen, Daniel Phillips, linux-kernel
Hi,
On Thu, Jan 11, 2001 at 02:03:48PM -0500, Alexander Viro wrote:
> On Thu, 11 Jan 2001, Stephen C. Tweedie wrote:
>
> > On Thu, Jan 11, 2001 at 02:12:05PM +0100, Trond Myklebust wrote:
> > >
> > > What's wrong with copy-on-write style semantics? IOW, anyone who
> > > wants to change the credentials needs to make a private copy of the
> > > existing structure first.
> >
> > Because COW only solves the problem if each task is only changing its
> > own, local, private copy of the credentials. Posix threads demand
> > that one thread changing credentials also affects all the other
> > threads immediately, and making your own local private copy won't help
> > you to change the other tasks' credentials safely.
>
> And how is that different from the current situation?
It's not, which is the point I was making: COW doesn't actually solve
the pthreads problem. Far better to do it in user space.
--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-11 19:47 ` Stephen C. Tweedie
@ 2001-01-11 19:57 ` Alexander Viro
0 siblings, 0 replies; 128+ messages in thread
From: Alexander Viro @ 2001-01-11 19:57 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Trond Myklebust, Linus Torvalds, Alan Cox, Andi Kleen,
Daniel Phillips, linux-kernel
On Thu, 11 Jan 2001, Stephen C. Tweedie wrote:
> > And how is that different from the current situation?
>
> It's not, which is the point I was making: COW doesn't actually solve
> the pthreads problem. Far better to do it in user space.
Oh, certainly. We need COW for completely unrelated reasons - suppose
you open() a file and then change your *ID. You definitely want credentials
on the opened file to stay unchanged.
Pthreads are non-issue as far as I'm concerned. I'ld rather avoid mixing
them with credentials' cache. BTW, what about *BSD implementations? Do
they change creds of all threads upon set*id(2)?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 23:56 ` David Weinehall
2001-01-11 0:24 ` Alan Cox
@ 2001-01-12 5:56 ` Ralf Baechle
2001-01-12 16:10 ` Eric W. Biederman
1 sibling, 1 reply; 128+ messages in thread
From: Ralf Baechle @ 2001-01-12 5:56 UTC (permalink / raw)
To: David Weinehall
Cc: Alan Cox, Linus Torvalds, Eric W. Biederman, Andrea Arcangeli,
David Woodhouse, Zlatko Calusic, Rik van Riel, linux-kernel
On Thu, Jan 11, 2001 at 12:56:57AM +0100, David Weinehall wrote:
> > The MMU on these systems is a CAM, and the mmu table is thus backwards to
> > convention. (It also means you can notionally map two physical addresses to
> > one virtual but thats undefined in the implementation ;))
>
> Are there any other (not yet supported) platforms with similar (or other
> unrelated, but hard to support because of the current architecture of
> the kernel) problems?
>
> (No, I have no secret trumps up my sleeve, I'm just curious.)
Having a reverse mappings is the least sucky way to handle virtual aliases
of certain types of MIPS caches.
Ralf
--
"Embrace, Enhance, Eliminate" - it worked for the pope, it'll work for Bill.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-12 5:56 ` Ralf Baechle
@ 2001-01-12 16:10 ` Eric W. Biederman
2001-01-12 21:11 ` Russell King
2001-01-15 2:53 ` Ralf Baechle
0 siblings, 2 replies; 128+ messages in thread
From: Eric W. Biederman @ 2001-01-12 16:10 UTC (permalink / raw)
To: Ralf Baechle
Cc: David Weinehall, Alan Cox, Linus Torvalds, Andrea Arcangeli,
David Woodhouse, Zlatko Calusic, Rik van Riel, linux-kernel
Ralf Baechle <ralf@conectiva.com.br> writes:
> On Thu, Jan 11, 2001 at 12:56:57AM +0100, David Weinehall wrote:
>
> > > The MMU on these systems is a CAM, and the mmu table is thus backwards to
> > > convention. (It also means you can notionally map two physical addresses to
> > > one virtual but thats undefined in the implementation ;))
> >
> > Are there any other (not yet supported) platforms with similar (or other
> > unrelated, but hard to support because of the current architecture of
> > the kernel) problems?
> >
> > (No, I have no secret trumps up my sleeve, I'm just curious.)
>
> Having a reverse mappings is the least sucky way to handle virtual aliases
> of certain types of MIPS caches.
Hmm. I would think that increasing the logical page size in the kernel would
be the trivial way to handle virtual aliases. (i.e.) with a large enough page
size you can't actually have a virtual alias.
You could also play some games with simply allocating pages only with the proper
proper high bits. These games might also be useful on architectures for L2 caches
who have significant physical bits than PAGE_SHIFT bits.
But how does a reverse mapping help to handle virtual aliases? What are those
caches doing? The only model in my head is having a virtually indexed cache
where you have more index bits than PAGE_SHIFT bits.
Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-12 16:10 ` Eric W. Biederman
@ 2001-01-12 21:11 ` Russell King
2001-01-15 2:56 ` Ralf Baechle
2001-01-15 2:53 ` Ralf Baechle
1 sibling, 1 reply; 128+ messages in thread
From: Russell King @ 2001-01-12 21:11 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Ralf Baechle, riel, Andrea Arcangeli, linux-kernel
Eric W. Biederman writes:
> Hmm. I would think that increasing the logical page size in the kernel
> would be the trivial way to handle virtual aliases. (i.e.) with a large
> enough page size you can't actually have a virtual alias.
There are types of caches out there that no matter how large the page size,
you will always have alias issues. These are ones where the cache lines
are indexed independent of virtual address (and therefore can have funny
cache line replacement algorithms).
And yes, you guessed which processor has it. ;)
(Sorry the CC list got trimmed, elm ate some of it. I'm sure most of the
people who where on it were on lkml anyway)
_____
|_____| ------------------------------------------------- ---+---+-
| | Russell King rmk@arm.linux.org.uk --- ---
| | | | http://www.arm.linux.org.uk/personal/aboutme.html / / |
| +-+-+ --- -+-
/ | THE developer of ARM Linux |+| /|\
/ | | | --- |
+-+-+ ------------------------------------------------- /\\\ |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-12 16:10 ` Eric W. Biederman
2001-01-12 21:11 ` Russell King
@ 2001-01-15 2:53 ` Ralf Baechle
1 sibling, 0 replies; 128+ messages in thread
From: Ralf Baechle @ 2001-01-15 2:53 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David Weinehall, Alan Cox, Linus Torvalds, Andrea Arcangeli,
David Woodhouse, Zlatko Calusic, Rik van Riel, linux-kernel
On Fri, Jan 12, 2001 at 09:10:54AM -0700, Eric W. Biederman wrote:
> > Having a reverse mappings is the least sucky way to handle virtual aliases
> > of certain types of MIPS caches.
>
> Hmm. I would think that increasing the logical page size in the kernel would
> be the trivial way to handle virtual aliases. (i.e.) with a large enough page
> size you can't actually have a virtual alias.
That's a possible solution; I'm not clear how bad the overhead would be.
Right now a virtual alias is a relativly rare event and we don't want the
common case of no virtual alias to make pay a high price. Or?
> You could also play some games with simply allocating pages only with the
> proper proper high bits. These games might also be useful on architectures
> for L2 caches who have significant physical bits than PAGE_SHIFT bits.
An alternative but less efficient solution. I tried to implement it; I ran
into problems with running out of larger pages soon as I had to split order 2
pages into 4 order 0 pages to implement this; the fragmentation was _really_
bad.
> But how does a reverse mapping help to handle virtual aliases? What are those
> caches doing?
You leave only mappings of one color accessible. All other mappings are made
unaccessible in the page table, so accessing will result in a TLB fault.
The TLB fault handler then flushes the active mappings, makes them
unaccessible by clearing the MIPS hw dirty / accessible bits, then makes the
mapping of the new color accessible in the page table. This is already
possible right now but doing the necessary reverse mappings can be rather
inefficient as is.
> The only model in my head is having a virtually indexed cache where you
> have more index bits than PAGE_SHIFT bits.
Which is exactly what many MIPS implementations are suffering from. At
least they're tagged with the physical address, so no flushes on context
switch necessary.
Ralf
--
"Embrace, Enhance, Eliminate" - it worked for the pope, it'll work for Bill.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-12 21:11 ` Russell King
@ 2001-01-15 2:56 ` Ralf Baechle
2001-01-15 6:59 ` Eric W. Biederman
0 siblings, 1 reply; 128+ messages in thread
From: Ralf Baechle @ 2001-01-15 2:56 UTC (permalink / raw)
To: Russell King; +Cc: Eric W. Biederman, riel, Andrea Arcangeli, linux-kernel
On Fri, Jan 12, 2001 at 09:11:43PM +0000, Russell King wrote:
> Eric W. Biederman writes:
> > Hmm. I would think that increasing the logical page size in the kernel
> > would be the trivial way to handle virtual aliases. (i.e.) with a large
> > enough page size you can't actually have a virtual alias.
>
> There are types of caches out there that no matter how large the page size,
> you will always have alias issues. These are ones where the cache lines
> are indexed independent of virtual address (and therefore can have funny
> cache line replacement algorithms).
>
> And yes, you guessed which processor has it. ;)
I recently spoke with some CPU architecture researcher at some university
about cache architectures; I suspect in the near future we'll see more
funny cache indexing and replacment algorithems ...
Ralf
--
"Embrace, Enhance, Eliminate" - it worked for the pope, it'll work for Bill.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-15 2:56 ` Ralf Baechle
@ 2001-01-15 6:59 ` Eric W. Biederman
0 siblings, 0 replies; 128+ messages in thread
From: Eric W. Biederman @ 2001-01-15 6:59 UTC (permalink / raw)
To: Ralf Baechle; +Cc: Russell King, riel, Andrea Arcangeli, linux-kernel
Ralf Baechle <ralf@uni-koblenz.de> writes:
> On Fri, Jan 12, 2001 at 09:11:43PM +0000, Russell King wrote:
>
> > Eric W. Biederman writes:
> > > Hmm. I would think that increasing the logical page size in the kernel
> > > would be the trivial way to handle virtual aliases. (i.e.) with a large
> > > enough page size you can't actually have a virtual alias.
> >
> > There are types of caches out there that no matter how large the page size,
> > you will always have alias issues. These are ones where the cache lines
> > are indexed independent of virtual address (and therefore can have funny
> > cache line replacement algorithms).
> >
> > And yes, you guessed which processor has it. ;)
Odd. Does this affect correctness?
> I recently spoke with some CPU architecture researcher at some university
> about cache architectures; I suspect in the near future we'll see more
> funny cache indexing and replacment algorithems ...
But I doubt many of those will run incorrectly if just less efficiently if
the OS doesn't help you avoid aliases.
Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 2:01 ` Zlatko Calusic
@ 2001-01-17 4:48 ` Rik van Riel
-1 siblings, 0 replies; 128+ messages in thread
From: Rik van Riel @ 2001-01-17 4:48 UTC (permalink / raw)
To: Zlatko Calusic; +Cc: linux-kernel, linux-mm
On 9 Jan 2001, Zlatko Calusic wrote:
> Rik van Riel <riel@conectiva.com.br> writes:
>
> > Now if 2.4 has worse _performance_ than 2.2 due to one
> > reason or another, that I'd like to hear about ;)
> >
>
> Oh, well, it seems that I was wrong. :)
>
> First test: hogmem 180 5 = allocate 180MB and dirty it 5 times (on a
> 192MB machine)
>
> kernel | swap usage | speed
> -------------------------------
> 2.2.17 | 48 MB | 11.8 MB/s
> -------------------------------
> 2.4.0 | 206 MB | 11.1 MB/s
> -------------------------------
>
> So 2.2 is only marginally faster. Also it can be seen that 2.4
> uses 4 times more swap space. If Linus says it's ok... :)
I have been working on some changes to page_launder() which
might just fix this problem. Quick and dirty patches are on
my home page and I'll try to clean things up and make something
correct & clean later today or tomorrow ;)
> Second test: kernel compile make -j32 (empirically this puts the
> VM under load, but not excessively!)
>
> 2.2.17 -> make -j32 392.49s user 47.87s system 168% cpu 4:21.13 total
> 2.4.0 -> make -j32 389.59s user 31.29s system 182% cpu 3:50.24 total
>
> Now, is this great news or what, 2.4.0 is definitely faster.
One problem is that these tasks may be waiting on kswapd when
kswapd might not get scheduled in on time. On the one hand this
will mean lower load and less thrashing, on the other hand it
means more IO wait.
This is another area where we may be able to improve some things.
(btw, according to Alan the 2.4 kernel is the first one to break
the 1.2 kernel compiling speed record on an 8MB machine he has ;))
cheers,
Rik (stuck in australia on a conference)
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
@ 2001-01-17 4:48 ` Rik van Riel
0 siblings, 0 replies; 128+ messages in thread
From: Rik van Riel @ 2001-01-17 4:48 UTC (permalink / raw)
To: Zlatko Calusic; +Cc: linux-kernel, linux-mm
On 9 Jan 2001, Zlatko Calusic wrote:
> Rik van Riel <riel@conectiva.com.br> writes:
>
> > Now if 2.4 has worse _performance_ than 2.2 due to one
> > reason or another, that I'd like to hear about ;)
> >
>
> Oh, well, it seems that I was wrong. :)
>
> First test: hogmem 180 5 = allocate 180MB and dirty it 5 times (on a
> 192MB machine)
>
> kernel | swap usage | speed
> -------------------------------
> 2.2.17 | 48 MB | 11.8 MB/s
> -------------------------------
> 2.4.0 | 206 MB | 11.1 MB/s
> -------------------------------
>
> So 2.2 is only marginally faster. Also it can be seen that 2.4
> uses 4 times more swap space. If Linus says it's ok... :)
I have been working on some changes to page_launder() which
might just fix this problem. Quick and dirty patches are on
my home page and I'll try to clean things up and make something
correct & clean later today or tomorrow ;)
> Second test: kernel compile make -j32 (empirically this puts the
> VM under load, but not excessively!)
>
> 2.2.17 -> make -j32 392.49s user 47.87s system 168% cpu 4:21.13 total
> 2.4.0 -> make -j32 389.59s user 31.29s system 182% cpu 3:50.24 total
>
> Now, is this great news or what, 2.4.0 is definitely faster.
One problem is that these tasks may be waiting on kswapd when
kswapd might not get scheduled in on time. On the one hand this
will mean lower load and less thrashing, on the other hand it
means more IO wait.
This is another area where we may be able to improve some things.
(btw, according to Alan the 2.4 kernel is the first one to break
the 1.2 kernel compiling speed record on an 8MB machine he has ;))
cheers,
Rik (stuck in australia on a conference)
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 3:12 ` Linus Torvalds
2001-01-09 20:33 ` Marcelo Tosatti
@ 2001-01-17 4:54 ` Rik van Riel
1 sibling, 0 replies; 128+ messages in thread
From: Rik van Riel @ 2001-01-17 4:54 UTC (permalink / raw)
To: Linus Torvalds
Cc: Marcelo Tosatti, Stephen C. Tweedie, David S. Miller, linux-mm
On Mon, 8 Jan 2001, Linus Torvalds wrote:
> - gets rid of the complex "best mm" logic and replaces it with the
> round-robin thing as discussed.
This could help IO clustering as well, which should be good
whenever we want to swap the data back in ;)
> - it cleans up and simplifies the MM "priority" thing. In fact, right now
> only one priority is ever used,
Sounds great.
In the week that I've been offline I have been working on
page_launder and doing a few other improvements to the VM.
Once I get the time to clean everything up I think we can
take 2.4 to a slightly better performance level without
having to change anything big.
regards,
Rik (at linux.conf.au)
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Yet another bogus piece of do_try_to_free_pages()
2001-01-17 6:58 ` Rik van Riel
@ 2001-01-17 6:07 ` Marcelo Tosatti
2001-01-17 19:04 ` Zlatko Calusic
1 sibling, 0 replies; 128+ messages in thread
From: Marcelo Tosatti @ 2001-01-17 6:07 UTC (permalink / raw)
To: Rik van Riel; +Cc: Zlatko Calusic, Linus Torvalds, linux-mm
On Wed, 17 Jan 2001, Rik van Riel wrote:
> On 11 Jan 2001, Zlatko Calusic wrote:
>
> > I have tested it for you and results are great. On some tests I got
> > 20% to 30% better results which is amazing. I'll do some more tests
> > but I would vote for this to get in immediately. Yes, it's *so* good.
>
> Don't be so rash.
>
> The patch hasn't been tested very thoroughly, otherwise
> people would have noticed the problem that PG_MEMALLOC
> isn't set around the page freeing code, possibly leading
> to deadlocks, triple faults and other nasties.
Look at 2.4.1pre8.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Yet another bogus piece of do_try_to_free_pages()
2001-01-10 0:06 ` Linus Torvalds
2001-01-10 6:39 ` Marcelo Tosatti
@ 2001-01-17 6:52 ` Rik van Riel
1 sibling, 0 replies; 128+ messages in thread
From: Rik van Riel @ 2001-01-17 6:52 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Marcelo Tosatti, linux-mm
On Tue, 9 Jan 2001, Linus Torvalds wrote:
> On Tue, 9 Jan 2001, Marcelo Tosatti wrote:
> >
> > The problem is that do_try_to_free_pages uses the "wait" argument when
> > calling page_launder() (where the paramater is used to indicate if we want
> > todo sync or async IO) _and_ used to call refill_inactive(), where this
> > parameter is used to indicate if its being called from a normal process or
> > from kswapd:
>
> Yes. Bogus.
>
> I suspect that the proper fix is something more along the lines
> of what we did to bdflush: get rid of the notion of waiting
> synchronously from bdflush, and instead do the work yourself.
Agreed. I've been working on this a bit in the last week and
have achieved some interesting results.
The main thing I found that it is *not* trivial to do this
because we can end up with multiple instances of eg. page_launder()
running at the same time and we will want to balance them against
each other in some way to prevent them from flushing too many pages
at once.
regards,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Yet another bogus piece of do_try_to_free_pages()
2001-01-11 0:11 ` Zlatko Calusic
@ 2001-01-17 6:58 ` Rik van Riel
2001-01-17 6:07 ` Marcelo Tosatti
2001-01-17 19:04 ` Zlatko Calusic
0 siblings, 2 replies; 128+ messages in thread
From: Rik van Riel @ 2001-01-17 6:58 UTC (permalink / raw)
To: Zlatko Calusic; +Cc: Marcelo Tosatti, Linus Torvalds, linux-mm
On 11 Jan 2001, Zlatko Calusic wrote:
> I have tested it for you and results are great. On some tests I got
> 20% to 30% better results which is amazing. I'll do some more tests
> but I would vote for this to get in immediately. Yes, it's *so* good.
Don't be so rash.
The patch hasn't been tested very thoroughly, otherwise
people would have noticed the problem that PG_MEMALLOC
isn't set around the page freeing code, possibly leading
to deadlocks, triple faults and other nasties.
(and yes, I'm sure there will be somebody able to trigger
this bug)
Remember that we - officially - still are in the 2.4 BUGFIX
period, it's time to be careful with the code now and we should
IMHO not randomly introduce new bugs in the name of performance.
Performance enhancements are perfectly fine, of course, but IMHO
not after they've been posted 2 hours ago and haven't been
reviewed and stresstested yet.
regards,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-09 19:09 ` Daniel Phillips
2001-01-09 19:29 ` Trond Myklebust
2001-01-09 19:37 ` Linus Torvalds
@ 2001-01-17 8:46 ` Rik van Riel
2001-01-25 22:51 ` Daniel Phillips
2 siblings, 1 reply; 128+ messages in thread
From: Rik van Riel @ 2001-01-17 8:46 UTC (permalink / raw)
To: Daniel Phillips; +Cc: Linus Torvalds, linux-kernel
On Tue, 9 Jan 2001, Daniel Phillips wrote:
> Linus Torvalds wrote:
> > (This is why I worked so hard at getting the PageDirty semantics right in
> > the last two months or so - and why I released 2.4.0 when I did. Getting
> > PageDirty right was the big step to make all of the VM stuff possible in
> > the first place. Even if it probably looked a bit foolhardy to change the
> > semantics of "writepage()" quite radically just before 2.4 was released).
>
> On the topic of writepage, it's not symmetric with readpage at
> the moment - it still takes (struct file *). Is this in the
> cleanup pipeline? It looks like nfs_readpage already ignores
> the struct file *, but maybe some other net filesystems are
> still depending on it.
writepage() and readpage() will never be symmetric...
readpage()
program can't continue until data is there
reading in larger clusters eats (wastes?) more memory
done when we think a process needs data
writepage()
called after the process has written data and moved on
writing larger clusters has no influence on memory use
often done to free up memory
Since readpage() needs to tune readahead behaviour, we will
always want to give it some information (eg. in the file *)
so it can do the extra things it needs to do.
regards,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 18:33 ` Andrea Arcangeli
@ 2001-01-17 14:26 ` Rik van Riel
0 siblings, 0 replies; 128+ messages in thread
From: Rik van Riel @ 2001-01-17 14:26 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Eric W. Biederman, David Woodhouse, Linus Torvalds,
Zlatko Calusic, linux-kernel
On Wed, 10 Jan 2001, Andrea Arcangeli wrote:
> On Wed, Jan 10, 2001 at 10:46:07AM -0700, Eric W. Biederman wrote:
> > My impression with the MM stuff is that everyone except linux is
> > trying hard to clone BSD instead of thinking through the issues
> > ourselves.
>
> I wasn't even thinking about BSD and I always though about the
> issues myself, no panic ;).
Andrea, if you have the time, please do check out the
FreeBSD and NetBSD VM code.
The FreeBSD code has the original Mach overengineered
abstraction layer, but an absolutely kickass page
replacement strategy.
The NetBSD code has cleaned up the abstraction layer
into something nice and lower overhead, but has a lot
simpler (probably lower performance) page replacement.
It would be cool if some of the Linux hackers could take
the time and look at this code to see if there are some
good ideas we might want to have in Linux.
It might just be the case that we DON'T want to reinvent
the wheel (that others have made into a nice round shape
with 15 years of trial, error and redesigning).
(though I know some people prefer reinventing wheels ;))
regards,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-10 19:03 ` Linus Torvalds
2001-01-10 19:27 ` David S. Miller
2001-01-10 19:36 ` Alan Cox
@ 2001-01-17 14:28 ` Rik van Riel
2001-01-18 1:23 ` Linus Torvalds
2 siblings, 1 reply; 128+ messages in thread
From: Rik van Riel @ 2001-01-17 14:28 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric W. Biederman, Andrea Arcangeli, David Woodhouse,
Zlatko Calusic, linux-kernel
On Wed, 10 Jan 2001, Linus Torvalds wrote:
> I looked at it a year or two ago myself, and came to the
> conclusion that I don't want to blow up our page table size by a
> factor of three or more, so I'm not personally interested any
> more. Maybe somebody else comes up with a better way to do it,
> or with a really compelling reason to.
OTOH, it _would_ get rid of all the balancing issues in one
blow. And it would fix the aliasing issues and possibly the
memory fragmentation problem too.
And using something like Davem's lower-overhead reverse
mapping layer, we might just be able to pull off all (or most)
of the advantages with lower overhead ;)
[this is something I will be looking into for 2.5]
regards,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-17 4:48 ` Rik van Riel
@ 2001-01-17 18:53 ` Zlatko Calusic
-1 siblings, 0 replies; 128+ messages in thread
From: Zlatko Calusic @ 2001-01-17 18:53 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-kernel, linux-mm
Rik van Riel <riel@conectiva.com.br> writes:
> > Second test: kernel compile make -j32 (empirically this puts the
> > VM under load, but not excessively!)
> >
> > 2.2.17 -> make -j32 392.49s user 47.87s system 168% cpu 4:21.13 total
> > 2.4.0 -> make -j32 389.59s user 31.29s system 182% cpu 3:50.24 total
> >
> > Now, is this great news or what, 2.4.0 is definitely faster.
>
> One problem is that these tasks may be waiting on kswapd when
> kswapd might not get scheduled in on time. On the one hand this
> will mean lower load and less thrashing, on the other hand it
> means more IO wait.
>
Hm, if all tasks are waiting for memory, what is stopping kswapd to
run? :)
--
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
@ 2001-01-17 18:53 ` Zlatko Calusic
0 siblings, 0 replies; 128+ messages in thread
From: Zlatko Calusic @ 2001-01-17 18:53 UTC (permalink / raw)
To: Rik van Riel; +Cc: linux-kernel, linux-mm
Rik van Riel <riel@conectiva.com.br> writes:
> > Second test: kernel compile make -j32 (empirically this puts the
> > VM under load, but not excessively!)
> >
> > 2.2.17 -> make -j32 392.49s user 47.87s system 168% cpu 4:21.13 total
> > 2.4.0 -> make -j32 389.59s user 31.29s system 182% cpu 3:50.24 total
> >
> > Now, is this great news or what, 2.4.0 is definitely faster.
>
> One problem is that these tasks may be waiting on kswapd when
> kswapd might not get scheduled in on time. On the one hand this
> will mean lower load and less thrashing, on the other hand it
> means more IO wait.
>
Hm, if all tasks are waiting for memory, what is stopping kswapd to
run? :)
--
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Yet another bogus piece of do_try_to_free_pages()
2001-01-17 6:58 ` Rik van Riel
2001-01-17 6:07 ` Marcelo Tosatti
@ 2001-01-17 19:04 ` Zlatko Calusic
2001-01-17 19:22 ` Ingo Molnar
1 sibling, 1 reply; 128+ messages in thread
From: Zlatko Calusic @ 2001-01-17 19:04 UTC (permalink / raw)
To: Rik van Riel; +Cc: Marcelo Tosatti, Linus Torvalds, linux-mm
Rik van Riel <riel@conectiva.com.br> writes:
> On 11 Jan 2001, Zlatko Calusic wrote:
>
> > I have tested it for you and results are great. On some tests I got
> > 20% to 30% better results which is amazing. I'll do some more tests
> > but I would vote for this to get in immediately. Yes, it's *so* good.
>
> Don't be so rash.
>
> The patch hasn't been tested very thoroughly, otherwise
> people would have noticed the problem that PG_MEMALLOC
> isn't set around the page freeing code, possibly leading
> to deadlocks, triple faults and other nasties.
>
Oh, believe me I tested that patch very thoroughly with lots of
utilities, and it worked very very well. I don't remember that it
fiddled anywhere with the PG_MEMALLOC flag.
But, anyway, it's in the kernel now so I can delete
/boot/vmlinuz-marcelo which was my performance etalon, it was so
good. :)
> (and yes, I'm sure there will be somebody able to trigger
> this bug)
>
> Remember that we - officially - still are in the 2.4 BUGFIX
> period, it's time to be careful with the code now and we should
> IMHO not randomly introduce new bugs in the name of performance.
>
Yeah, right! And Linus has just included reiserfs in a prepatch.
> Performance enhancements are perfectly fine, of course, but IMHO
> not after they've been posted 2 hours ago and haven't been
> reviewed and stresstested yet.
>
They have been tested well enough.
--
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Yet another bogus piece of do_try_to_free_pages()
2001-01-17 19:04 ` Zlatko Calusic
@ 2001-01-17 19:22 ` Ingo Molnar
2001-01-18 0:55 ` Rik van Riel
0 siblings, 1 reply; 128+ messages in thread
From: Ingo Molnar @ 2001-01-17 19:22 UTC (permalink / raw)
To: Zlatko Calusic; +Cc: Rik van Riel, Marcelo Tosatti, Linus Torvalds, linux-mm
On 17 Jan 2001, Zlatko Calusic wrote:
> Oh, believe me I tested that patch very thoroughly with lots of
> utilities, and it worked very very well. I don't remember that it
> fiddled anywhere with the PG_MEMALLOC flag.
yep, same result here, Marcelo's patch is plain *wonderful*. Combined with
the block-IO changes, -pre8 is really behaving spectacularly in under high
VM or pagecache load.
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Yet another bogus piece of do_try_to_free_pages()
2001-01-17 19:22 ` Ingo Molnar
@ 2001-01-18 0:55 ` Rik van Riel
0 siblings, 0 replies; 128+ messages in thread
From: Rik van Riel @ 2001-01-18 0:55 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Zlatko Calusic, Marcelo Tosatti, Linus Torvalds, linux-mm
On Wed, 17 Jan 2001, Ingo Molnar wrote:
> On 17 Jan 2001, Zlatko Calusic wrote:
>
> > Oh, believe me I tested that patch very thoroughly with lots of
> > utilities, and it worked very very well. I don't remember that it
> > fiddled anywhere with the PG_MEMALLOC flag.
>
> yep, same result here, Marcelo's patch is plain *wonderful*.
> Combined with the block-IO changes, -pre8 is really behaving
> spectacularly in under high VM or pagecache load.
Oh, I'm not doubting that. I just got suspicious when Linus
got asked to put it in the kernel after Zlatko tested it for
a few hours ... and when I spotted a lack of flags|=PF_MEMALLOC
around the thing.
(but from what marcelo told me, it got fixed in -pre8)
regards,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-17 14:28 ` Rik van Riel
@ 2001-01-18 1:23 ` Linus Torvalds
2001-01-18 11:48 ` Rik van Riel
0 siblings, 1 reply; 128+ messages in thread
From: Linus Torvalds @ 2001-01-18 1:23 UTC (permalink / raw)
To: linux-kernel
In article <Pine.LNX.4.31.0101180126240.31432-100000@localhost.localdomain>,
Rik van Riel <riel@conectiva.com.br> wrote:
>On Wed, 10 Jan 2001, Linus Torvalds wrote:
>
>> I looked at it a year or two ago myself, and came to the
>> conclusion that I don't want to blow up our page table size by a
>> factor of three or more, so I'm not personally interested any
>> more. Maybe somebody else comes up with a better way to do it,
>> or with a really compelling reason to.
>
>OTOH, it _would_ get rid of all the balancing issues in one
>blow. And it would fix the aliasing issues and possibly the
>memory fragmentation problem too.
I totally disagree.
It might help fragmentation, but it has absolutely _no_ impact on
balancing. See my comments about not seeing the "accessed" bit until way
too late with a "find by physical" approach.
You simply _cannot_ use "find by physical" for balancing, unless you're
willing to pay the price of doing software accessed bits even on
hardware that does it for you in the page tables. Which is a price MUCH
too high to pay, I suspect.
The current vmscanning is the way to go. Getting PageDirty was a big
step for it, because it is needed so that we can drop pages without
having to do IO like we historically did. I doubt find-by-physical will
help AT ALL wrt balancing.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-17 18:53 ` Zlatko Calusic
@ 2001-01-18 1:32 ` Rik van Riel
-1 siblings, 0 replies; 128+ messages in thread
From: Rik van Riel @ 2001-01-18 1:32 UTC (permalink / raw)
To: Zlatko Calusic; +Cc: linux-kernel, linux-mm
On 17 Jan 2001, Zlatko Calusic wrote:
> Rik van Riel <riel@conectiva.com.br> writes:
>
> > > Second test: kernel compile make -j32 (empirically this puts the
> > > VM under load, but not excessively!)
> > >
> > > 2.2.17 -> make -j32 392.49s user 47.87s system 168% cpu 4:21.13 total
> > > 2.4.0 -> make -j32 389.59s user 31.29s system 182% cpu 3:50.24 total
> > >
> > > Now, is this great news or what, 2.4.0 is definitely faster.
> >
> > One problem is that these tasks may be waiting on kswapd when
> > kswapd might not get scheduled in on time. On the one hand this
> > will mean lower load and less thrashing, on the other hand it
> > means more IO wait.
>
> Hm, if all tasks are waiting for memory, what is stopping kswapd
> to run? :)
Suppose you have 8 high-priority tasks waiting on kswapd
and one lower-priority (but still higher than kswapd)
process running and preventing kswapd from doing its work.
Oh .. and also preventing the higher-priority tasks from
being woken up and continuing...
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
@ 2001-01-18 1:32 ` Rik van Riel
0 siblings, 0 replies; 128+ messages in thread
From: Rik van Riel @ 2001-01-18 1:32 UTC (permalink / raw)
To: Zlatko Calusic; +Cc: linux-kernel, linux-mm
On 17 Jan 2001, Zlatko Calusic wrote:
> Rik van Riel <riel@conectiva.com.br> writes:
>
> > > Second test: kernel compile make -j32 (empirically this puts the
> > > VM under load, but not excessively!)
> > >
> > > 2.2.17 -> make -j32 392.49s user 47.87s system 168% cpu 4:21.13 total
> > > 2.4.0 -> make -j32 389.59s user 31.29s system 182% cpu 3:50.24 total
> > >
> > > Now, is this great news or what, 2.4.0 is definitely faster.
> >
> > One problem is that these tasks may be waiting on kswapd when
> > kswapd might not get scheduled in on time. On the one hand this
> > will mean lower load and less thrashing, on the other hand it
> > means more IO wait.
>
> Hm, if all tasks are waiting for memory, what is stopping kswapd
> to run? :)
Suppose you have 8 high-priority tasks waiting on kswapd
and one lower-priority (but still higher than kswapd)
process running and preventing kswapd from doing its work.
Oh .. and also preventing the higher-priority tasks from
being woken up and continuing...
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-18 1:23 ` Linus Torvalds
@ 2001-01-18 11:48 ` Rik van Riel
0 siblings, 0 replies; 128+ messages in thread
From: Rik van Riel @ 2001-01-18 11:48 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
On 17 Jan 2001, Linus Torvalds wrote:
> Rik van Riel <riel@conectiva.com.br> wrote:
> >On Wed, 10 Jan 2001, Linus Torvalds wrote:
> >
> >> I looked at it a year or two ago myself, and came to the
> >> conclusion that I don't want to blow up our page table size by a
> >> factor of three or more, so I'm not personally interested any
> >> more. Maybe somebody else comes up with a better way to do it,
> >> or with a really compelling reason to.
> >
> >OTOH, it _would_ get rid of all the balancing issues in one
> >blow. And it would fix the aliasing issues and possibly the
> >memory fragmentation problem too.
>
> I totally disagree.
I still haven't seen anything that might get us a
"universally correct" balancing between swap_out()
and refill_inactive_scan().
We either scan both categories at the same relative
rate, which gives mapped pages an advantage because
they may get unmapped later than the unmapped pages
get deactivated.
Alternatively, you do the scanning between these two
at different rates, which gives an advantage to one
or the other.
(or am I overlooking something stupid here?)
regards,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-17 8:46 ` Rik van Riel
@ 2001-01-25 22:51 ` Daniel Phillips
0 siblings, 0 replies; 128+ messages in thread
From: Daniel Phillips @ 2001-01-25 22:51 UTC (permalink / raw)
To: Rik van Riel, linux-kernel
Rik van Riel wrote:
>
> On Tue, 9 Jan 2001, Daniel Phillips wrote:
> > Linus Torvalds wrote:
> > > (This is why I worked so hard at getting the PageDirty semantics right in
> > > the last two months or so - and why I released 2.4.0 when I did. Getting
> > > PageDirty right was the big step to make all of the VM stuff possible in
> > > the first place. Even if it probably looked a bit foolhardy to change the
> > > semantics of "writepage()" quite radically just before 2.4 was released).
> >
> > On the topic of writepage, it's not symmetric with readpage at
> > the moment - it still takes (struct file *). Is this in the
> > cleanup pipeline? It looks like nfs_readpage already ignores
> > the struct file *, but maybe some other net filesystems are
> > still depending on it.
>
> writepage() and readpage() will never be symmetric...
>
> readpage()
> program can't continue until data is there
> reading in larger clusters eats (wastes?) more memory
> done when we think a process needs data
>
> writepage()
> called after the process has written data and moved on
> writing larger clusters has no influence on memory use
> often done to free up memory
>
> Since readpage() needs to tune readahead behaviour, we will
> always want to give it some information (eg. in the file *)
> so it can do the extra things it needs to do.
Which extra information did you have in mind?
--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Subtle MM bug
2001-01-18 1:32 ` Rik van Riel
(?)
@ 2001-04-17 19:37 ` H. Peter Anvin
-1 siblings, 0 replies; 128+ messages in thread
From: H. Peter Anvin @ 2001-04-17 19:37 UTC (permalink / raw)
To: linux-kernel
Followup to: <Pine.LNX.4.31.0101181230020.31432-100000@localhost.localdomain>
By author: Rik van Riel <riel@conectiva.com.br>
In newsgroup: linux.dev.kernel
>
> Suppose you have 8 high-priority tasks waiting on kswapd
> and one lower-priority (but still higher than kswapd)
> process running and preventing kswapd from doing its work.
> Oh .. and also preventing the higher-priority tasks from
> being woken up and continuing...
>
Classic priority inversion. In this particular case it seems like it
should be unusually simple to apply priority inheritance, though (the
general case is complicated by the fact that the dependency matrix
usually isn't readily available.)
-hpa
--
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
^ permalink raw reply [flat|nested] 128+ messages in thread
end of thread, other threads:[~2001-04-17 19:40 UTC | newest]
Thread overview: 128+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200101080602.WAA02132@pizda.ninka.net>
2001-01-08 6:42 ` Subtle MM bug Linus Torvalds
2001-01-08 13:11 ` Marcelo Tosatti
2001-01-08 16:42 ` Rik van Riel
2001-01-08 17:43 ` Linus Torvalds
2001-01-08 13:57 ` Stephen C. Tweedie
2001-01-08 17:29 ` Linus Torvalds
2001-01-08 18:10 ` Stephen C. Tweedie
2001-01-08 21:52 ` Marcelo Tosatti
2001-01-09 0:28 ` Linus Torvalds
2001-01-08 23:49 ` Marcelo Tosatti
2001-01-09 3:12 ` Linus Torvalds
2001-01-09 20:33 ` Marcelo Tosatti
2001-01-09 22:44 ` Linus Torvalds
2001-01-09 21:33 ` Marcelo Tosatti
2001-01-09 22:11 ` Yet another bogus piece of do_try_to_free_pages() Marcelo Tosatti
2001-01-10 0:06 ` Linus Torvalds
2001-01-10 6:39 ` Marcelo Tosatti
2001-01-10 22:19 ` Roger Larsson
2001-01-11 0:11 ` Zlatko Calusic
2001-01-17 6:58 ` Rik van Riel
2001-01-17 6:07 ` Marcelo Tosatti
2001-01-17 19:04 ` Zlatko Calusic
2001-01-17 19:22 ` Ingo Molnar
2001-01-18 0:55 ` Rik van Riel
2001-01-17 6:52 ` Rik van Riel
2001-01-09 23:58 ` Subtle MM bug Linus Torvalds
2001-01-09 22:21 ` Marcelo Tosatti
2001-01-10 0:23 ` Linus Torvalds
2001-01-10 0:12 ` Marcelo Tosatti
2001-01-10 11:29 ` Stephen C. Tweedie
2001-01-11 3:30 ` Marcelo Tosatti
2001-01-11 9:42 ` Stephen C. Tweedie
2001-01-11 15:24 ` Marcelo Tosatti
2001-01-17 4:54 ` Rik van Riel
2001-01-08 16:45 ` Rik van Riel
2001-01-08 17:50 ` Linus Torvalds
2001-01-08 18:21 ` Rik van Riel
2001-01-08 18:38 ` Linus Torvalds
2001-01-10 19:57 Chris Wing
-- strict thread matches above, loose matches on Subject: below --
2001-01-08 20:39 Szabolcs Szakacsits
2001-01-08 21:56 ` Wayne Whitney
2001-01-08 23:22 ` Wayne Whitney
2001-01-08 23:30 ` Andrea Arcangeli
2001-01-09 0:37 ` Linus Torvalds
2001-01-08 22:00 ` Wayne Whitney
2001-01-08 22:15 ` Andrea Arcangeli
2001-01-08 5:29 Wayne Whitney
2001-01-08 5:42 ` Andi Kleen
2001-01-08 6:04 ` Linus Torvalds
2001-01-08 17:44 ` Rik van Riel
2001-01-08 18:02 ` Linus Torvalds
2001-01-08 17:16 ` Rik van Riel
2001-01-08 17:58 ` Linus Torvalds
2001-01-08 23:41 ` Zlatko Calusic
2001-01-09 2:58 ` Linus Torvalds
2001-01-09 6:20 ` Eric W. Biederman
2001-01-09 7:27 ` Linus Torvalds
2001-01-09 11:38 ` Eric W. Biederman
2001-01-09 12:29 ` Zlatko Calusic
2001-01-09 18:47 ` Linus Torvalds
2001-01-09 19:09 ` Daniel Phillips
2001-01-09 19:29 ` Trond Myklebust
2001-01-10 17:32 ` Andi Kleen
2001-01-10 19:31 ` Alan Cox
2001-01-10 19:33 ` Andi Kleen
2001-01-10 19:40 ` Alan Cox
2001-01-10 19:43 ` Andi Kleen
2001-01-10 19:48 ` Alan Cox
2001-01-10 19:48 ` Andi Kleen
2001-01-11 9:51 ` Trond Myklebust
2001-01-10 20:11 ` Linus Torvalds
2001-01-11 12:56 ` Stephen C. Tweedie
2001-01-11 13:10 ` Andi Kleen
2001-01-11 13:12 ` Trond Myklebust
2001-01-11 14:13 ` Stephen C. Tweedie
2001-01-11 19:03 ` Alexander Viro
2001-01-11 19:47 ` Stephen C. Tweedie
2001-01-11 19:57 ` Alexander Viro
2001-01-11 16:50 ` Albert D. Cahalan
2001-01-11 17:35 ` Stephen C. Tweedie
2001-01-11 19:38 ` Albert D. Cahalan
2001-01-11 19:01 ` Alexander Viro
2001-01-09 19:37 ` Linus Torvalds
2001-01-17 8:46 ` Rik van Riel
2001-01-25 22:51 ` Daniel Phillips
2001-01-09 19:53 ` Simon Kirby
2001-01-09 20:08 ` Linus Torvalds
2001-01-09 20:10 ` Zlatko Calusic
2001-01-10 1:45 ` David Woodhouse
2001-01-10 2:26 ` Andrea Arcangeli
2001-01-10 6:57 ` Linus Torvalds
2001-01-10 11:46 ` David Woodhouse
2001-01-10 14:56 ` Andrea Arcangeli
2001-01-10 17:46 ` Eric W. Biederman
2001-01-10 18:33 ` Andrea Arcangeli
2001-01-17 14:26 ` Rik van Riel
2001-01-10 19:03 ` Linus Torvalds
2001-01-10 19:27 ` David S. Miller
2001-01-10 19:36 ` Alan Cox
2001-01-10 23:56 ` David Weinehall
2001-01-11 0:24 ` Alan Cox
2001-01-12 5:56 ` Ralf Baechle
2001-01-12 16:10 ` Eric W. Biederman
2001-01-12 21:11 ` Russell King
2001-01-15 2:56 ` Ralf Baechle
2001-01-15 6:59 ` Eric W. Biederman
2001-01-15 2:53 ` Ralf Baechle
2001-01-17 14:28 ` Rik van Riel
2001-01-18 1:23 ` Linus Torvalds
2001-01-18 11:48 ` Rik van Riel
2001-01-10 17:03 ` Linus Torvalds
2001-01-11 14:36 ` Jim Gettys
2001-01-08 21:30 ` Wayne Whitney
2001-01-07 20:59 Zlatko Calusic
2001-01-07 20:59 ` Zlatko Calusic
2001-01-07 21:37 ` Rik van Riel
2001-01-07 21:37 ` Rik van Riel
2001-01-07 22:33 ` Zlatko Calusic
2001-01-07 22:33 ` Zlatko Calusic
2001-01-09 2:01 ` Zlatko Calusic
2001-01-09 2:01 ` Zlatko Calusic
2001-01-17 4:48 ` Rik van Riel
2001-01-17 4:48 ` Rik van Riel
2001-01-17 18:53 ` Zlatko Calusic
2001-01-17 18:53 ` Zlatko Calusic
2001-01-18 1:32 ` Rik van Riel
2001-01-18 1:32 ` Rik van Riel
2001-04-17 19:37 ` H. Peter Anvin
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.