Subtle MM bug

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Subtle MM bug
@ 2001-01-07 20:59 Zlatko Calusic
  2001-01-07 21:37 ` Rik van Riel
  0 siblings, 1 reply; 88+ messages in thread
From: Zlatko Calusic @ 2001-01-07 20:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm

I'm trying to get more familiar with the MM code in 2.4.0, as can be
seen from lots of questions I have on the subject. I discovered nasty
mm behaviour under even moderate load (2.2 didn't have troubles).

Things go berzerk if you have one big process whose working set is
around your physical memory size. Typical memory hoggers are good
enough to trigger the bad behaviour. Final effect is that physical
memory gets extremely flooded with the swap cache pages and at the
same time the system absorbs ridiculous amount of the swap space.
xmem is as usual very good at detecting this and you just need to
press Alt-SysReq-M to see that most of the memory (e.g. 90%) is
populated with the swap cache pages.

For instance on my 192MB configuration, firing up the hogmem program
which allocates let's say 170MB of memory and dirties it leads to
215MB of swap used. vmstat 1 shows that the pagecache size is
constantly growing - that is swapcache enlarging in fact - during the
second pass of the hogmem program.

...
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd  free buff  cache   si   so    bi    bo   in    cs  us  sy  id
 0  1  1 131488  1592  400  62384 4172 5188  1092  1298  353  1447   2   4  94
 0  1  1 136584  1592  400  67428 5860 4104  1465  1034  322  1327   3   3  93
 0  1  1 141668  1592  388  72536 5504 4420  1376  1106  323  1423   1   3  95
 0  1  1 146724  1592  380  77592 5996 4236  1499  1060  335  1096   2   3  94
 0  1  1 151876  1600  320  82764 6264 3712  1566   936  327  1226   3   4  93
 0  1  1 157016  1600  320  87908 5284 4268  1321  1068  315  1248   1   2  96
 1  0  0 157016  1600  308  87792 1836 5168   459  1293  281  1324   3   3  94
 0  1  0 162204  1600  304  92892 7784 5236  1946  1315  385  1353   3   5  92
 0  1  0 167216  1600  304  97780 3496 5016   874  1256  301  1222   0   2  97
 0  1  1 177904  1608  284 108276 5160 5168  1290  1300  330  1453   1   4  94
 0  1  2 182008  1588  288 112264 4936 3344  1268   838  293   801   2   3  95
 0  2  1 183620  1588  260 114012 3064 1756   830   445  290   846   0  15  85
 0  2  2 185384  1596  180 115864 2320 2620   635   658  285   722   1  29  70
 0  3  2 187528  1592  220 117892 2488 2224   657   557  273   754   3  30  67
 0  4  1 190512  1592  236 120772 2524 3012   725   760  343  1080   1  14  85
 0  4  1 195780  1592  240 125868 2336 5316   613  1331  381  1624   2   2  96
 1  0  1 200992  1592  248 131052 2080 2176   623   552  234  1044   3  23  74
 0  1  0 200996  1592  252 130948 2208 3048   580   762  256  1065  10  10  80
 0  1  1 206240  1592  252 136076 2988 5252   760  1314  309  1406   7   4  8
 0  2  1 211408  1592  256 141080 5424 5180  1389  1303  395  1885   3   5  91
 0  2  0 214744  1592  264 144280 4756 3328  1223   834  327  1211   1   5  95
 1  0  0 214868  1592  244 144468 4344 5148  1087  1295  303  1189  11   2  86
 0  1  1 214900  1592  248 144496 4360 3244  1098   812  318  1467   7   4  89
 0  1  1 214916  1592  248 144520 4280 3452  1070   865  336  1602   3   3  94
 0  1  1 214964  1592  248 144580 4972 4184  1243  1054  368  1620   3   5  92
 0  2  2 214956  1592  272 144548 3700 4544  1081  1142  665  2952   1   1  98
 0  1  0 214992  1592  272 144588 1220 5088   305  1274  282  1363   1   4  95
 0  1  1 215012  1592  272 144600 3640 4420   910  1106  325  1579   3   2  9

Any thoughts on this?
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-07 20:59 Zlatko Calusic
@ 2001-01-07 21:37 ` Rik van Riel
  2001-01-07 22:33   ` Zlatko Calusic
  2001-01-09  2:01   ` Zlatko Calusic
  0 siblings, 2 replies; 88+ messages in thread
From: Rik van Riel @ 2001-01-07 21:37 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: linux-kernel, linux-mm

On 7 Jan 2001, Zlatko Calusic wrote:

> Things go berzerk if you have one big process whose working set
> is around your physical memory size.

"go berzerk" in what way?  Does the system cause lots of extra
swap IO and does it make the system thrash where 2.2 didn't
even touch the disk ?

> Final effect is that physical memory gets extremely flooded with
> the swap cache pages and at the same time the system absorbs
> ridiculous amount of the swap space.

This is mostly because Linux 2.4 keeps dirty pages in the
swap cache. Under Linux 2.2 a page would be deleted from the
swap cache when a program writes to it, but in Linux 2.4 it
can stay in the swap cache.

Oh, and don't forget that pages in the swap cache can also
be resident in the process, so it's not like the swap cache
is "eating into" the process' RSS ;)

> For instance on my 192MB configuration, firing up the hogmem
> program which allocates let's say 170MB of memory and dirties it
> leads to 215MB of swap used.

So that's 170MB of swap space for hogmem and 45MB for
the other things in the system (daemons, X, ...).

Sounds pretty ok, except maybe for the fact that now
Linux allocates (not uses!) a lot more swap space then
before and some people may need to add some swap space
to their system ...

Now if 2.4 has worse _performance_ than 2.2 due to one
reason or another, that I'd like to hear about ;)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-07 21:37 ` Rik van Riel
@ 2001-01-07 22:33   ` Zlatko Calusic
  2001-01-09  2:01   ` Zlatko Calusic
  1 sibling, 0 replies; 88+ messages in thread
From: Zlatko Calusic @ 2001-01-07 22:33 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

Rik van Riel <riel@conectiva.com.br> writes:

> On 7 Jan 2001, Zlatko Calusic wrote:
> 
> > Things go berzerk if you have one big process whose working set
> > is around your physical memory size.
> 
> "go berzerk" in what way?  Does the system cause lots of extra
> swap IO and does it make the system thrash where 2.2 didn't
> even touch the disk ?
>

Well, I think yes. I'll do some testing on the 2.2 before I can tell
you for sure, but definitely the system is behaving badly where I
think it should not.

> > Final effect is that physical memory gets extremely flooded with
> > the swap cache pages and at the same time the system absorbs
> > ridiculous amount of the swap space.
> 
> This is mostly because Linux 2.4 keeps dirty pages in the
> swap cache. Under Linux 2.2 a page would be deleted from the
> swap cache when a program writes to it, but in Linux 2.4 it
> can stay in the swap cache.
>

OK, I can buy that.

> Oh, and don't forget that pages in the swap cache can also
> be resident in the process, so it's not like the swap cache
> is "eating into" the process' RSS ;)
>

So far so good... A little bit weird but not alarming per se.

> > For instance on my 192MB configuration, firing up the hogmem
> > program which allocates let's say 170MB of memory and dirties it
> > leads to 215MB of swap used.
> 
> So that's 170MB of swap space for hogmem and 45MB for
> the other things in the system (daemons, X, ...).
>

Yes, that's it. So it looks like all of my processes are on the
swap. That can't be good. I mean, even Solaris (known to eat swap
space like there's no tomorrow :)) would probably be more polite.

> Sounds pretty ok, except maybe for the fact that now
> Linux allocates (not uses!) a lot more swap space then
> before and some people may need to add some swap space
> to their system ...
>

Yes, I would say really a lot more. Big diffeence.

Also, I don't see a diference between allocated and used swap space on
the Linux. Could you elaborate on that?

> 
> Now if 2.4 has worse _performance_ than 2.2 due to one
> reason or another, that I'd like to hear about ;)
> 

I'll get back to you later with more data. Time to boot 2.2. :)
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
@ 2001-01-08  5:29 Wayne Whitney
  2001-01-08  5:42 ` Andi Kleen
  2001-01-08 17:16 ` Rik van Riel
  0 siblings, 2 replies; 88+ messages in thread
From: Wayne Whitney @ 2001-01-08  5:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: William A. Stein


On Sunday, January 7, 20001, Rik van Riel <riel@conectiva.com.br> wrote:

> Now if 2.4 has worse _performance_ than 2.2 due to one reason or
> another, that I'd like to hear about ;)

Well, here is a workload that performs worse on 2.4.0 than on 2.2.19pre,
and as it is the usual workload on my little cluster of 3 machines, they
are all running 2.2.19pre:

The application is some mathematics computations (modular symbols) using a
package called MAGMA;  at times this requires very large matrices.  The
RSS can get up to 870MB; for some reason a MAGMA process under linux
thinks it has run out of memory at 870MB, regardless of the actual
memory/swap in the machine.  MAGMA is single-threaded.

The typical machine is a dual Intel box with 512MB RAM and 512MB swap.
There is no problem with just one MAGMA process, it just hits that 870MB
barrier and gracefully exits.  But if I do the following test, I notice
very different behaviour under 2.2 and 2.4:  while running 'top d 1' I
simultaneously launch two instances of a job which actually requires more
than 870MB of memory to complete.  So each instance will slowly grow in
RSS until it gets killed by OOM or hits that 870MB limit.

Under 2.2, everything proceeds smoothly: before physical RAM is exhausted,
top updates every second, and the jobs have all the CPU.  When swapping
kicks in, top updates every 1-2 seconds and lists most of the CPU as
'system' (kswapd), but I perceive not much loss of interactivity.
Eventually the 1GB of virtual memory is exhausted, the OOM killer kills
one of the MAGMA's, and the other runs till it hits the 870MB barrier and
exits.

But under 2.4, interactivity suffers as soon as physical RAM is exhausted.
Top only updates every 2-10 seconds, the load average hits 3-4, and top
reports the CPUs are 90% idle.  Eventually, the OOM killer kicks in and
all returns to normal.  For practical purposes, the machine is unusual
while swapping like this.

I have heard 'vmstat' mentioned here, so below is the output of a 'vmstat
1' concommitant with the test above (top and the two MAGMA jobs).  I would
be more than happy to provide any other relevant information about this.

I read the LKML via an archive that updates once a day, so please cc: me
if you would like a speedier response.  I wish I knew of a newsgroup
interface to the LKML, then I could read it more often :-).

Cheers,
Wayne


   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  0  0  49180 447840    840  54104 269 969    84   244   76   236  10   4  86
 1  0  0  49180 443276    852  55972   0   0   470     0  163   150  15   2  83
 2  0  0  49180 440060    852  56292   0   0    80     0  115    60  93   1   6
 2  0  0  49180 438236    856  56292   0   0     1     0  107    53  99   1   0
 2  0  0  49180 429468    856  56392   0   0    25     0  109    16  99   0   0
 2  0  0  49180 421296    856  56392   0   0     0     0  104    13  98   2   0
 2  0  0  49180 421132    856  56392   0   0     0     0  108    53 100   0   0
 2  0  0  49180 421128    856  56392   0   0     0     0  108    47 100   0   0
 2  0  0  49180 397520    856  56392   0   0     0     1  107    49  96   4   0
 2  0  0  49180 364860    856  56392   0   0     0     0  106    47  95   5   0
 2  0  0  49180 332244    856  56392   0   0     0     0  106    49  95   5   0
 2  0  0  49180 299660    856  56392   0   0     0     0  106    54  92   8   0
 2  0  0  49180 267076    856  56392   0   0     0     0  109    56  95   5   0
 2  0  0  49180 234632    856  56392   0   0     0     0  110    57  94   6   0
 2  0  0  49180 202096    872  56448  32   0    18     0  117    70  95   5   0
 2  0  0  49180 169544    872  56448   0   0     0     0  103    13  96   4   0
 2  0  0  49180 137108    872  56448   0   0     0     0  107    49  93   7   0
 2  0  0  49180 104600    872  56448   0   0     0     0  107    51  94   6   0
 2  0  0  49180  72368    872  56448   0   0     0    52  136    54  93   7   0
 2  0  0  49180  39964    872  56448   0   0     0     0  110    59  92   8   0
 2  0  2   7296   1576     96  13072   0 720     0   184  130   465  74  22   4
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 1  2  2  53620   1564    116  23512 1012 31876   565  7969  883  3802   1   8  92
 2  1  2  68800   1560     96  20128  68 15396    17  3850  291  2775   1   7  92
 3  0  1  99484   1556     96  26096  84 29552    21  7388  594  3832   1   4  95
 1  3  2 114708   1560    104  32528 284 14696   161  3674  374  3125   0   4  96
 1  4  2 175484   1560    124  31112 360 63000   237 15753 1404 14952   1   5  94
 1  2  2 205900   1560     96  32748  12 30080     3  7520  606  8356   1   5  94
 2  1  2 221156   1560     96  17848 412 14256   103  3564  308  8450   1  10  89
 1  2  2 222128   1564     96  12736   0 16100     7  4025  346  1010   0   5  95
 1  2  2 236580   1560    108  15220 276 13988    97  3497  347  4102   0   7  92
 2  1  2 267488   1560    104  32044 260 17376    69  4346  405  1265   0   7  93
 3  1  1 282756   1560     96  29380  16 15304     4  3827  335  4359   1   7  92
 2  1  2 282756   1580     96  11460  92 14948    23  3737  332  4120   1   5  94
 2  1  2 313496   1560    100  30476 200 15484    54  3871  318  2359   0   9  90
 2  1  2 313496   1560    100  14148   0 13076     1  3270  246  5165   1   8  91
 3  1  1 344564   1572     96  23892  16 18444    11  4613  419  1555   0   7  93
 2  1  2 375020   1560     96  25400 172 26988    43  6747  556  2910   1   7  93
 1  2  2 375020   1968     96  22760   8 17136     2  4284  378   787   0   2  98
 2  1  2 406056   1568     96  20432 212 17320    53  4330  393  2704   1  10  89
 3  0  3 421316   1560     96  25056  72 14416    18  3604  281  1731   0   5  94
 1  3  0 452120   1544    100  21216 240 31480   116  7870  715  2681   1   6  94
 2  2  2 467488   1588    108  27248 440 15056   123  3765  385  2206   0   5  94
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 2  1  0 467488   1564    136  13352  88 15376    49  3844  368  2913   1   4  95
 3  0  1 482864   1560     96  15256 128 15384    32  3846  296   986   1   7  92
 3  0  1 497920   1560     96  14144   0 12636     0  3159  245  2302   1   9  90
 3  1  1 529844   1540     96  18632 940 33340   569  8336 1104  1366   1  10  88
 0  1  0 269856 205944    148  21772 2628   0  1196     2  267   313   0   3  97
 0  1  0 269856 182736    156  33180 11180   0  2854     0  309   451   6   3  91
 0  1  0 269856 158668    156  44696 11516   0  2879     0  314   462  12   4  83
 0  1  0 269856 131928    156  57588 12892   0  3223     0  312   466   8   4  88
 0  1  0 269856 105176    156  70448 12864   0  3216     0  332   506  12   3  85
 0  1  0 269856  79056    156  82644 12196   0  3049     0  456   602  10   6  83
 1  1  0 269856  46948    156  96900 14252   0  3563     0  359   518  21   7  72

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-08  5:29 Wayne Whitney
@ 2001-01-08  5:42 ` Andi Kleen
  2001-01-08  6:04   ` Linus Torvalds
  2001-01-08 17:16 ` Rik van Riel
  1 sibling, 1 reply; 88+ messages in thread
From: Andi Kleen @ 2001-01-08  5:42 UTC (permalink / raw)
  To: Wayne Whitney; +Cc: linux-kernel, William A. Stein

On Sun, Jan 07, 2001 at 09:29:29PM -0800, Wayne Whitney wrote:
> The application is some mathematics computations (modular symbols) using a
> package called MAGMA;  at times this requires very large matrices.  The
> RSS can get up to 870MB; for some reason a MAGMA process under linux
> thinks it has run out of memory at 870MB, regardless of the actual
> memory/swap in the machine.  MAGMA is single-threaded.

I think it's caused by the way malloc maps its memory. 
Newer glibc should work a bit better by falling back to mmap even for smaller
allocations (older does it only for very big ones) 



-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-08  5:42 ` Andi Kleen
@ 2001-01-08  6:04   ` Linus Torvalds
  2001-01-08 17:44     ` Rik van Riel
  0 siblings, 1 reply; 88+ messages in thread
From: Linus Torvalds @ 2001-01-08  6:04 UTC (permalink / raw)
  To: linux-kernel

In article <20010108064225.B29026@gruyere.muc.suse.de>,
Andi Kleen  <ak@suse.de> wrote:
>On Sun, Jan 07, 2001 at 09:29:29PM -0800, Wayne Whitney wrote:
>> The application is some mathematics computations (modular symbols) using a
>> package called MAGMA;  at times this requires very large matrices.  The
>> RSS can get up to 870MB; for some reason a MAGMA process under linux
>> thinks it has run out of memory at 870MB, regardless of the actual
>> memory/swap in the machine.  MAGMA is single-threaded.
>
>I think it's caused by the way malloc maps its memory. 
>Newer glibc should work a bit better by falling back to mmap even for smaller
>allocations (older does it only for very big ones) 

That doesn't resolve the "2.4.x behaves badly" thing, though.

I've seen that one myself, and it seems to be simply due to the fact
that we're usually so good at gettign memory from page_launder() that we
never bother to try to swap stuff out. And when we _do_ start swapping
stuff out it just moves to the dirty list, and page_launder() will take
care of it.

So far so good. The problem appears to be that we don't swap stuff out
smoothly: we start doing the VM scanning, but when we get enough dirty
pages, we'll let it be, and go back to page_launder() again. Which means
that we don't walk theough the whole VM space, we just do some "spot
cleaning".

		Linus 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-08  5:29 Wayne Whitney
  2001-01-08  5:42 ` Andi Kleen
@ 2001-01-08 17:16 ` Rik van Riel
  2001-01-08 17:58   ` Linus Torvalds
  2001-01-08 21:30   ` Wayne Whitney
  1 sibling, 2 replies; 88+ messages in thread
From: Rik van Riel @ 2001-01-08 17:16 UTC (permalink / raw)
  To: Wayne Whitney; +Cc: linux-kernel, Linus Torvalds, William A. Stein

On Sun, 7 Jan 2001, Wayne Whitney wrote:

> Well, here is a workload that performs worse on 2.4.0 than on 2.2.19pre,

> The typical machine is a dual Intel box with 512MB RAM and 512MB swap.

How does 2.4 perform when you add an extra GB of swap ?

2.4 keeps dirty pages in the swap cache, so you will need
more swap to run the same programs...

Linus: is this something we want to keep or should we give
the user the option to run in a mode where swap space is
freed when we swap in something non-shared ?

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-08  6:04   ` Linus Torvalds
@ 2001-01-08 17:44     ` Rik van Riel
  2001-01-08 18:02       ` Linus Torvalds
  0 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2001-01-08 17:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On 7 Jan 2001, Linus Torvalds wrote:

> That doesn't resolve the "2.4.x behaves badly" thing, though.
> 
> I've seen that one myself, and it seems to be simply due to the
> fact that we're usually so good at gettign memory from
> page_launder() that we never bother to try to swap stuff out.
> And when we _do_ start swapping stuff out it just moves to the
> dirty list, and page_launder() will take care of it.
> 
> So far so good. The problem appears to be that we don't swap
> stuff out smoothly: we start doing the VM scanning, but when we
> get enough dirty pages, we'll let it be, and go back to
> page_launder() again. Which means that we don't walk theough the
> whole VM space, we just do some "spot cleaning".

You are right in that we need to refill the inactive list
before calling page_launder(), but we'll also need a few
other modifications:

1. adopt the latest FreeBSD tactic in page_launder()
	- mark dirty pages we see but don't flush
	- in the first loop, flush up to maxlaunder of the
	  already seen dirty pages
	- in the second loop, flush as many pages as we
	  need to refill the free&inactive_clean list

2. go back to having a _static_ free target, at
   max(freepages.high, SUM(zone->pages_high) ... this
   means free_shortage() will never be very big

3. keep track of how many pages we need to free in
   page_launder() and substract one from the target
   when we submit a page for IO ... no need to flush
   20MB of dirty pages when we only need 1MB pages
   cleaned

I have these things in my local tree and it seems to smooth
out the load quite well for a very large haskell run and for
the fillmem program from Juan Quintela's memtest suite.

When combined with your idea of refilling the freelist _first_,
we should be able to get the VM quite a bit smoother under loads
with lots of dirty pages.

I will work on this while travelling to and being in Australia.
Expect a clean patch to fix this problem once the 2.4 bugfix-only
period is over.

Other people on this list are invited to apply the VM patches from
my home page and give them a good beating. I want to be able to
submit a well-tested, known-good patch to Linus once 2.4 is out of
the bugfix-only period...

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-08 17:16 ` Rik van Riel
@ 2001-01-08 17:58   ` Linus Torvalds
  2001-01-08 23:41     ` Zlatko Calusic
  2001-01-08 21:30   ` Wayne Whitney
  1 sibling, 1 reply; 88+ messages in thread
From: Linus Torvalds @ 2001-01-08 17:58 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Wayne Whitney, linux-kernel, William A. Stein



On Mon, 8 Jan 2001, Rik van Riel wrote:

> On Sun, 7 Jan 2001, Wayne Whitney wrote:
> 
> > Well, here is a workload that performs worse on 2.4.0 than on 2.2.19pre,
> 
> > The typical machine is a dual Intel box with 512MB RAM and 512MB swap.
> 
> How does 2.4 perform when you add an extra GB of swap ?
> 
> 2.4 keeps dirty pages in the swap cache, so you will need
> more swap to run the same programs...
> 
> Linus: is this something we want to keep or should we give
> the user the option to run in a mode where swap space is
> freed when we swap in something non-shared ?

I'd prefer just documenting it and keeping it. I'd hate to have two fairly
different modes of behaviour. It's always been the suggested "twice the
amount of RAM", although there's historically been the "Linux doesn't
really need that much" that we just killed with 2.4.x.

If you have 512MB or RAM, you can probably afford another 40GB or so of
harddisk. They are disgustingly cheap these days.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-08 17:44     ` Rik van Riel
@ 2001-01-08 18:02       ` Linus Torvalds
  0 siblings, 0 replies; 88+ messages in thread
From: Linus Torvalds @ 2001-01-08 18:02 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel

On Mon, 8 Jan 2001, Rik van Riel wrote:
> 
> You are right in that we need to refill the inactive list
> before calling page_launder(), but we'll also need a few
> other modifications:

NONE of your three additions do _anything_ to help us at all if we don't
even see the dirty bit because the page is on the active list and the
dirty bit is in somebodys VM space.

I agree that they look ok, but they are all complicating the code. I
propose getting rid of complications, and getting rid of the precarious
"when do we actually scan the VM tables" balancing issue.

Quite frankly, I'd rather see somebody try the vmscan stuff FIRST. Your
suggestions look fine, but apart from the "let dirty pages go twice
through the list" they look like tweaks that would need re-tweaking after
the balancing stuff is ripped out.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
@ 2001-01-08 20:39 Szabolcs Szakacsits
  2001-01-08 21:56 ` Wayne Whitney
  2001-01-08 22:00 ` Subtle MM bug Wayne Whitney
  0 siblings, 2 replies; 88+ messages in thread
From: Szabolcs Szakacsits @ 2001-01-08 20:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andi Kleen, Wayne Whitney

Andi Kleen <ak@suse.de> wrote:
> On Sun, Jan 07, 2001 at 09:29:29PM -0800, Wayne Whitney wrote:
> > package called MAGMA; at times this requires very large matrices. The
> > RSS can get up to 870MB; for some reason a MAGMA process under linux
> > thinks it has run out of memory at 870MB, regardless of the actual
> > memory/swap in the machine. MAGMA is single-threaded.
> I think it's caused by the way malloc maps its memory.
> Newer glibc should work a bit better by falling back to mmap even
> for smaller allocations (older does it only for very big ones)

AFAIK newer glibc = CVS glibc but the malloc() tune parameters
work via environment variables for the current stable ones as well,
e.g. to overcome the above "out of memory" one could do,
% export MALLOC_MMAP_MAX_=1000000
% export MALLOC_MMAP_THRESHOLD_=0
% magma

At default, on a 32bit Linux current stable glibc malloc uses brk
between 0x08??????-0x40000000 and max (MALLOC_MMAP_MAX_) 128 mmap if
the requested chunk is greater than 128 kB (MALLOC_MMAP_THRESHOLD_).
If MAGMA mallocs memory in less than 128 kB chunks then the above out
of memory behaviour is expected.

	Szaka

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-08 17:16 ` Rik van Riel
  2001-01-08 17:58   ` Linus Torvalds
@ 2001-01-08 21:30   ` Wayne Whitney
  1 sibling, 0 replies; 88+ messages in thread
From: Wayne Whitney @ 2001-01-08 21:30 UTC (permalink / raw)
  To: Rik van Riel; +Cc: LKML, Linus Torvalds, William A. Stein

On Mon, 8 Jan 2001, Rik van Riel wrote:

> How does 2.4 perform when you add an extra GB of swap ?

OK, some more data:

First, I tried booting 2.4.0 with "nosmp" to see if the behavior I observe
is SMP related.  It isn't, there was no difference under 2.4.0 between
512MB/512MB/1CPU and 512MB/512MB/2CPUs.

Second, I tried going to 2GB of swap with 2.4.0, so 512MB/2GB/2CPUs.
Again, there is no difference:  as soon as swapping begins with two MAGMA
processes, interactivity suffers.  I notice that while swapping in this
situation, the HD light is blinking only intermittently.

I also tried logging in to a fourth VT during this second test, and it got
nowhere.  In fact, this stopped the top updates completely and the HD
light also stopped.  After 30 seconds of nothing (all I could do is switch
VT's), I gave up and sent a ^Z to one MAGMA process; this eventually was
received, and the system immediately recovered.

Perhaps there is some sort of I/O starvation triggered by two swapping
processes?

Again, under 2.2.19pre6, the exact same tests yield hardly any loss of
interactivity, I can log in fine (a little slowly) during the top / two
MAGMA process test.  And once swapping begins, the HD light is continually
lit.

Again, I'd be happy to do any additional tests, provide more info about my
machine, etc.

Cheers,
Wayne

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-08 20:39 Subtle MM bug Szabolcs Szakacsits
@ 2001-01-08 21:56 ` Wayne Whitney
  2001-01-08 23:22   ` Wayne Whitney
  2001-01-08 22:00 ` Subtle MM bug Wayne Whitney
  1 sibling, 1 reply; 88+ messages in thread
From: Wayne Whitney @ 2001-01-08 21:56 UTC (permalink / raw)
  To: Szabolcs Szakacsits; +Cc: LKML

On Mon, 8 Jan 2001, Szabolcs Szakacsits wrote:

> AFAIK newer glibc = CVS glibc but the malloc() tune parameters work
> via environment variables for the current stable ones as well,

Hmm, this must have been introduced in libc6?  Unfortunately, I don't have
the source code to MAGMA, and the binary I have is statically linked.  It
does not contain the names of the environment variables you mentioned.

I'll arrange a binary linked against glibc2.2, and then your suggestion
will hopefully do the trick.  Thanks for your kind help!

Cheers,
Wayne

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-08 20:39 Subtle MM bug Szabolcs Szakacsits
  2001-01-08 21:56 ` Wayne Whitney
@ 2001-01-08 22:00 ` Wayne Whitney
  2001-01-08 22:15   ` Andrea Arcangeli
  1 sibling, 1 reply; 88+ messages in thread
From: Wayne Whitney @ 2001-01-08 22:00 UTC (permalink / raw)
  To: Szabolcs Szakacsits; +Cc: LKML, Andi Kleen

On Mon, 8 Jan 2001, Szabolcs Szakacsits wrote:

> AFAIK newer glibc = CVS glibc but the malloc() tune parameters work
> via environment variables for the current stable ones as well, e.g. to
> overcome the above "out of memory" one could do,
>
> % export MALLOC_MMAP_MAX_=1000000
> % export MALLOC_MMAP_THRESHOLD_=0
> % magma

As I just mentioned, I haven't been able to test this yet due to my
current binary being linked against an older libc with doesn't seem to
have these parameters.  But here's one other data point, I just thought
I'd ask if this jives with your theory:  if I configure the linux kernel
to be able to use 2GB of RAM, then the 870MB limit becomes much lower, to
230MB.

Cheers, Wayne

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-08 22:00 ` Subtle MM bug Wayne Whitney
@ 2001-01-08 22:15   ` Andrea Arcangeli
  0 siblings, 0 replies; 88+ messages in thread
From: Andrea Arcangeli @ 2001-01-08 22:15 UTC (permalink / raw)
  To: Wayne Whitney; +Cc: Szabolcs Szakacsits, LKML, Andi Kleen

On Mon, Jan 08, 2001 at 02:00:19PM -0800, Wayne Whitney wrote:
> I'd ask if this jives with your theory:  if I configure the linux kernel
> to be able to use 2GB of RAM, then the 870MB limit becomes much lower, to
> 230MB.

It's because the virtual address space for userspace tasks gets reduced
from 3G to 2G to give an additional giga of direct mapping to the kernel.

Also the other limit you hit (at around 800mbyte) is partly because
of the too low userspace virtual address space.

You can use this hack by me to allow the tasks to grow up to 3.5G per task on
IA32 on 2.4.0 (equivalent hack exists for 2.2.19pre6aa1 with bigmem, btw it
makes sense also without bigmem if you have lots of swap, that's all about
virtual memory not physical RAM).  However it doesn't work with PAE enabled
yet.

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.0-test11-pre5/per-process-3.5G-IA32-no-PAE-1

If you run your program on any 64bit architecture (in 64bit userspace mode)
supported by linux, you won't run into those per-process address space limits.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-08 21:56 ` Wayne Whitney
@ 2001-01-08 23:22   ` Wayne Whitney
  2001-01-08 23:30     ` Andrea Arcangeli
  0 siblings, 1 reply; 88+ messages in thread
From: Wayne Whitney @ 2001-01-08 23:22 UTC (permalink / raw)
  To: Szabolcs Szakacsits; +Cc: LKML, Andi Kleen

On Mon, 8 Jan 2001, Wayne Whitney wrote:

> On Mon, 8 Jan 2001, Szabolcs Szakacsits wrote:
>
> > AFAIK newer glibc = CVS glibc but the malloc() tune parameters work
> > via environment variables for the current stable ones as well,
>
> I'll arrange a binary linked against glibc2.2, and then your suggestion
> will hopefully do the trick.  Thanks for your kind help!

OK, I now have a binary dynamically linked against /lib/libc.so.6,
(according to ldd), and that points to glibc-2.1.92.  And I tried setting
the environment variables you suggested, I checked that they are set and
checked that they appear in /lib/libc.so.6.  But the behaviour is
unchanged:  MAGMA still hits this barrier at 830M (not 870M, that was a
typo).

I guess I conclude that either (1) MAGMA does not use libc's malloc
(checking on this, I doubt it) or (2) glibc-2.1.92 knows of these
variables but has not yet implemented the tuning (I'll try glibc-2.2) or
(3) this is not the problem.

I'll look at Andrea's hack as well.  Thanks for everybody's help!

Cheers, Wayne

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-08 23:22   ` Wayne Whitney
@ 2001-01-08 23:30     ` Andrea Arcangeli
  2001-01-09  0:37       ` Linus Torvalds
  2001-01-09  3:01       ` Subtle MM bug (really 830MB barrier question) Wayne Whitney
  0 siblings, 2 replies; 88+ messages in thread
From: Andrea Arcangeli @ 2001-01-08 23:30 UTC (permalink / raw)
  To: Wayne Whitney; +Cc: Szabolcs Szakacsits, LKML, Andi Kleen

On Mon, Jan 08, 2001 at 03:22:44PM -0800, Wayne Whitney wrote:
> I guess I conclude that either (1) MAGMA does not use libc's malloc
> (checking on this, I doubt it) or (2) glibc-2.1.92 knows of these
> variables but has not yet implemented the tuning (I'll try glibc-2.2) or
> (3) this is not the problem.

You should monitor the program with strace while it fails (last few syscalls).
You can breakpoint at exit() and run `cat /proc/pid/maps` to show us the vma
layout of the task. Then we'll see why it's failing.  With CONFIG_1G in 2.2.x
or 2.4.x (confinguration option doesn't matter) you should at least reach
something like 1.5G.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-08 17:58   ` Linus Torvalds
@ 2001-01-08 23:41     ` Zlatko Calusic
  2001-01-09  2:58       ` Linus Torvalds
  2001-01-09  6:20       ` Eric W. Biederman
  0 siblings, 2 replies; 88+ messages in thread
From: Zlatko Calusic @ 2001-01-08 23:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:

> On Mon, 8 Jan 2001, Rik van Riel wrote:
> 
> > On Sun, 7 Jan 2001, Wayne Whitney wrote:
> > 
> > > Well, here is a workload that performs worse on 2.4.0 than on 2.2.19pre,
> > 
> > > The typical machine is a dual Intel box with 512MB RAM and 512MB swap.
> > 
> > How does 2.4 perform when you add an extra GB of swap ?
> > 
> > 2.4 keeps dirty pages in the swap cache, so you will need
> > more swap to run the same programs...
> > 
> > Linus: is this something we want to keep or should we give
> > the user the option to run in a mode where swap space is
> > freed when we swap in something non-shared ?
> 
> I'd prefer just documenting it and keeping it. I'd hate to have two fairly
> different modes of behaviour. It's always been the suggested "twice the
> amount of RAM", although there's historically been the "Linux doesn't
> really need that much" that we just killed with 2.4.x.
> 
> If you have 512MB or RAM, you can probably afford another 40GB or so of
> harddisk. They are disgustingly cheap these days.
> 

Yes, but a lot more data on the swap also means degraded performance,
because the disk head has to seek around in the much bigger area. Are
you sure this is all OK?
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-08 23:30     ` Andrea Arcangeli
@ 2001-01-09  0:37       ` Linus Torvalds
  2001-01-09  3:01       ` Subtle MM bug (really 830MB barrier question) Wayne Whitney
  1 sibling, 0 replies; 88+ messages in thread
From: Linus Torvalds @ 2001-01-09  0:37 UTC (permalink / raw)
  To: linux-kernel

In article <20010109003002.L27646@athlon.random>,
Andrea Arcangeli  <andrea@suse.de> wrote:
>On Mon, Jan 08, 2001 at 03:22:44PM -0800, Wayne Whitney wrote:
>> I guess I conclude that either (1) MAGMA does not use libc's malloc
>> (checking on this, I doubt it) or (2) glibc-2.1.92 knows of these
>> variables but has not yet implemented the tuning (I'll try glibc-2.2) or
>> (3) this is not the problem.
>
>You should monitor the program with strace while it fails (last few syscalls).
>You can breakpoint at exit() and run `cat /proc/pid/maps` to show us the vma
>layout of the task. Then we'll see why it's failing.  With CONFIG_1G in 2.2.x
>or 2.4.x (confinguration option doesn't matter) you should at least reach
>something like 1.5G.

It might be doing its own memory management with brk() directly - some
older UNIX programs will do that (for various reasons - it can be faster
than malloc() etc if you know your access patterns, for example).

If you do that, and you have shared libraries, you'll get a failure
around the point Wayne sees it. 

But your suggestion to check with strace is a good one.

		Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-07 21:37 ` Rik van Riel
  2001-01-07 22:33   ` Zlatko Calusic
@ 2001-01-09  2:01   ` Zlatko Calusic
  2001-01-17  4:48     ` Rik van Riel
  1 sibling, 1 reply; 88+ messages in thread
From: Zlatko Calusic @ 2001-01-09  2:01 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

Rik van Riel <riel@conectiva.com.br> writes:

> Now if 2.4 has worse _performance_ than 2.2 due to one
> reason or another, that I'd like to hear about ;)
> 

Oh, well, it seems that I was wrong. :)


First test: hogmem 180 5 = allocate 180MB and dirty it 5 times (on a
192MB machine)

kernel | swap usage | speed
-------------------------------
2.2.17 |  48 MB     | 11.8 MB/s
-------------------------------
2.4.0  | 206 MB     | 11.1 MB/s
-------------------------------

So 2.2 is only marginally faster. Also it can be seen that 2.4 uses 4
times more swap space. If Linus says it's ok... :)


Second test: kernel compile make -j32 (empirically this puts the VM
under load, but not excessively!)

2.2.17 -> make -j32  392.49s user 47.87s system 168% cpu 4:21.13 total
2.4.0  -> make -j32  389.59s user 31.29s system 182% cpu 3:50.24 total

Now, is this great news or what, 2.4.0 is definitely faster.

-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-08 23:41     ` Zlatko Calusic
@ 2001-01-09  2:58       ` Linus Torvalds
  2001-01-09  6:20       ` Eric W. Biederman
  1 sibling, 0 replies; 88+ messages in thread
From: Linus Torvalds @ 2001-01-09  2:58 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Rik van Riel, linux-kernel



On 9 Jan 2001, Zlatko Calusic wrote:
> 
> Yes, but a lot more data on the swap also means degraded performance,
> because the disk head has to seek around in the much bigger area. Are
> you sure this is all OK?

Yes and no.

I'm not _sure_, obviously.

However, one thing I _am_ sure of is that the sticky page-cache simplifies
some things enormously, and make some things possible that simply weren't
possible before. 

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug (really 830MB barrier question)
  2001-01-08 23:30     ` Andrea Arcangeli
  2001-01-09  0:37       ` Linus Torvalds
@ 2001-01-09  3:01       ` Wayne Whitney
  2001-01-09 20:06         ` Szabolcs Szakacsits
  1 sibling, 1 reply; 88+ messages in thread
From: Wayne Whitney @ 2001-01-09  3:01 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Szabolcs Szakacsits, LKML, Andi Kleen, William A. Stein


On Mon, Jan 08, 2001 at 03:22:44PM -0800, Wayne Whitney wrote:

> I guess I conclude that either (1) MAGMA does not use libc's malloc
> (checking on this, I doubt it)

I'm still a bit unclear on this one.  I now have two executables,
magma.exe and magma.exe.dyn (ignore the .exe).  magma.exe is statically
linked, and magma.exe.dyn is dynamically linked against libc.so.6.  But
the binaries are the same size!  Well 13.7MB and 13.4MB, respectively.

'strings magma.exe' does not mention MALLOC_MMAP_*, so I conclude it is
statically linked against an older libc (libc.so.5?).  Is it possible that
magma.exe.dyn is statically linked against libc.so.5 and dynamically
linked against libc.so.6, so that the older malloc is getting used?  This
would explain the size similarity.

> or (2) glibc-2.1.92 knows of these variables but has not yet
> implemented the tuning (I'll try glibc-2.2) or

I see the same behavior with glibc-2.2 as with glibc-2.1.92.

On Tue, 9 Jan 2001, Andrea Arcangeli wrote:

> You should monitor the program with strace while it fails (last few
> syscalls).

At the very end of this message is the (slightly edited) strace of the
following test magma session:

  Magma V2.7-4      Mon Jan  8 2001 21:27:34 on modular  [Seed = 3551764170]
  Type ? for help.  Type <Ctrl>-D to quit.
  > M1:=MatrixAlgebra(Rationals(),10000)!1;
  > M2:=MatrixAlgebra(Rationals(),10000)!1;
  > M3:=MatrixAlgebra(Rationals(),10000)!1;

  System error: Out of memory.
  All virtual memory has been exhausted so Magma cannot perform this
  statement.

Here I try three times to allocate a 10000x10000 matrix of a 32bit data
type, so each matrix takes up 4e8 bytes.

My limited understanding of the strace output is that magma.exe allocates
memory by calling brk() to increase the size of its data segment, and
brk() returns the new size of the data segment (on complete success, this
is the size requested), but that eventually this sequence fails with:

brk(0x53567c68)                         = 0x3b807d68

> You can breakpoint at exit() and run `cat /proc/pid/maps` to show us
> the vma layout of the task.

I'm not sure how to set a breakpoint, I didn't see anything in the strace
man page about it handling this.  Do I need to use gdb? I tried 'rbreak
exit' and 'rbreak _exit' with gdb, and those didn't work.

But I did check /proc/pid/maps each time I got MAGMA's > prompt.  Here is
the output the first time (before allocating any matrices):

08048000-08b5c000 r-xp 00000000 03:05 1130923    /tmp/newmagma/magma.exe.dyn
08b5c000-08cc9000 rw-p 00b13000 03:05 1130923    /tmp/newmagma/magma.exe.dyn
08cc9000-0bd00000 rwxp 00000000 00:00 0
40000000-40016000 r-xp 00000000 03:05 393301     /lib/ld-2.2.so
40016000-40017000 rw-p 00015000 03:05 393301     /lib/ld-2.2.so
40017000-40018000 rwxp 00000000 00:00 0
40018000-40019000 rw-p 00000000 00:00 0
40024000-40043000 r-xp 00000000 03:05 393307     /lib/libm-2.2.so
40043000-40044000 rw-p 0001e000 03:05 393307     /lib/libm-2.2.so
40044000-40164000 r-xp 00000000 03:05 393304     /lib/libc-2.2.so
40164000-4016a000 rw-p 0011f000 03:05 393304     /lib/libc-2.2.so
4016a000-4016e000 rw-p 00000000 00:00 0
bfffe000-c0000000 rwxp fffff000 00:00 0

Now, subsequent to each memory allocation, only the second number in the
third line changes.  It becomes 23a78000, then 3b7f0000, and finally
3b808000 (after the failed allocation).

Sorry this is a bit long, I wanted to include the full strace output in
case it would allow one to divine what memory allocation scheme this
program is using.  Did I mention that a different mathematics package,
pari (for which I have the source), does not see this 830MB limit?  It
will happily allocate more memory (I haven't checked whether it hits a
limit around 1.5GB).

Thanks again for all the responses, they are quite helpful and educational
and heart-warming!!

Cheers,
Wayne

execve("/tmp/newmagma/magma.exe.dyn", ["/tmp/newmagma/magma.exe.dyn"], [/* 27 vars */]) = 0
uname({sys="Linux", node="modular", ...}) = 0
brk(0)                                  = 0xbce7460
open("/etc/ld.so.preload", O_RDONLY)    = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 4
fstat64(4, {st_mode=S_IFREG|0644, st_size=45881, ...}) = 0
old_mmap(NULL, 45881, PROT_READ, MAP_PRIVATE, 4, 0) = 0x40018000
close(4)                                = 0
open("/lib/libm.so.6", O_RDONLY)        = 4
read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\20J\0\000"..., 1024) = 1024
fstat64(4, {st_mode=S_IFREG|0755, st_size=503435, ...}) = 0
old_mmap(NULL, 128760, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) = 0x40024000
mprotect(0x40043000, 1784, PROT_NONE)   = 0
old_mmap(0x40043000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0x1e000) = 0x40043000
close(4)                                = 0
open("/lib/libc.so.6", O_RDONLY)        = 4
read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\0\301\1"..., 1024) = 1024
fstat64(4, {st_mode=S_IFREG|0755, st_size=4851725, ...}) = 0
old_mmap(NULL, 1217864, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) = 0x40044000
mprotect(0x40164000, 38216, PROT_NONE)  = 0
old_mmap(0x40164000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0x11f000) = 0x40164000
old_mmap(0x4016a000, 13640, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4016a000
close(4)                                = 0
open("/lib/libc.so.6", O_RDONLY)        = 4
read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\0\301\1"..., 1024) = 1024
fstat64(4, {st_mode=S_IFREG|0755, st_size=4851725, ...}) = 0
close(4)                                = 0
munmap(0x40018000, 45881)               = 0
getpid()                                = 13699
brk(0)                                  = 0xbce7460
brk(0xbce7468)                          = 0xbce7468
brk(0xbce7568)                          = 0xbce7568
brk(0xbcffc68)                          = 0xbcffc68
time([979006103])                       = 979006103
open("/etc/localtime", O_RDONLY)        = 4
fstat64(4, {st_mode=S_IFREG|0644, st_size=1267, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40018000
read(4, "TZif\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\4\0\0\0\4\0"..., 4096) = 1267
close(4)                                = 0
munmap(0x40018000, 4096)                = 0
ioctl(0, TIOCGWINSZ, {ws_row=40, ws_col=120, ws_xpixel=1200, ws_ypixel=800}) = 0
rt_sigaction(SIGWINCH, {0x89ecb4c, [WINCH], SA_RESTART|0x4000000}, {SIG_DFL}, 8) = 0
ioctl(0, TIOCGWINSZ, {ws_row=40, ws_col=120, ws_xpixel=1200, ws_ypixel=800}) = 0
rt_sigaction(SIGWINCH, {0x89ecb4c, [WINCH], SA_RESTART|0x4000000}, {0x89ecb4c, [WINCH], SA_RESTART|0x4000000}, 8) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 29238685
rt_sigaction(SIGTSTP, {0x89ec6a8, [TSTP], SA_RESTART|0x4000000}, {SIG_DFL}, 8) = 0
getpid()                                = 13699
time(NULL)                              = 979006103
uname({sys="Linux", node="modular", ...}) = 0
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40018000
ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
write(1, "Magma V2.7-4      Mon Jan  8 200"..., 75) = 75
write(1, "Type ? for help.  Type <Ctrl>-D "..., 41) = 41
rt_sigaction(SIGSEGV, {0x857cde8, [SEGV], SA_RESTART|0x4000000}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGINT, {0x857cbf0, [INT], SA_RESTART|0x4000000}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGQUIT, {0x857ccd0, [QUIT], SA_RESTART|0x4000000}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGILL, {0x857cde8, [ILL], SA_RESTART|0x4000000}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGFPE, {0x857cde8, [FPE], SA_RESTART|0x4000000}, {SIG_DFL}, 8) = 0
ioctl(0, TCGETA, {B38400 opost isig icanon echo ...}) = 0
ioctl(0, TCSETA, {B38400 opost isig -icanon -echo ...}) = 0
write(0, "> ", 2)                       = 2

[ I type M1:=MatrixAlgebra(Rationals(),10000)!1; ]

write(0, "\10\10\10\10\10\10\10\10\10\10\10\10\10\10\10\10\10\10"..., 83) = 83
rt_sigaction(SIGALRM, {0x89eb828, [ALRM], SA_RESTART|0x4000000}, {SIG_DFL}, 8) = 0
alarm(1)                                = 0
gettimeofday({979006159, 644571}, {300, 0}) = 0
brk(0x23a77268)                         = 0x23a77268
--- SIGALRM (Alarm clock) ---
ioctl(0, TCSETA, {B38400 opost isig icanon echo ...}) = 0
sigreturn()                             = ? (mask now [])
gettimeofday({979006164, 153555}, {300, 0}) = 0
ioctl(0, TCGETA, {B38400 opost isig icanon echo ...}) = 0
ioctl(0, TCSETA, {B38400 opost isig -icanon -echo ...}) = 0
write(0, "> ", 2)                       = 2

[ I type M2:=MatrixAlgebra(Rationals(),10000)!1; ]

write(0, "\10\10\10\10\10\10\10\10\10\10\10\10\10\10\10\10\10\10"..., 83) = 83
rt_sigaction(SIGALRM, {0x89eb828, [ALRM], SA_RESTART|0x4000000}, {0x89eb828, [ALRM], SA_RESTART|0x4000000}, 8) = 0
alarm(1)                                = 0
brk(0x23a8f968)                         = 0x23a8f968
gettimeofday({979006171, 207167}, {300, 0}) = 0
--- SIGALRM (Alarm clock) ---
ioctl(0, TCSETA, {B38400 opost isig icanon echo ...}) = 0
sigreturn()                             = ? (mask now [])
brk(0x23a77268)                         = 0x23a77268
brk(0x3b7ef668)                         = 0x3b7ef668
gettimeofday({979006178, 520082}, {300, 0}) = 0
ioctl(0, TCGETA, {B38400 opost isig icanon echo ...}) = 0
ioctl(0, TCSETA, {B38400 opost isig -icanon -echo ...}) = 0
write(0, "> ", 2)                       = 2

[ I type M3:=MatrixAlgebra(Rationals(),10000)!1; ]

write(0, "\10\10\10\10\10\10\10\10\10\10\10\10\10\10\10\10\10\10"..., 83) = 83
rt_sigaction(SIGALRM, {0x89eb828, [ALRM], SA_RESTART|0x4000000}, {0x89eb828, [ALRM], SA_RESTART|0x4000000}, 8) = 0
alarm(1)                                = 0
brk(0x3b807d68)                         = 0x3b807d68
gettimeofday({979006180, 783755}, {300, 0}) = 0
--- SIGALRM (Alarm clock) ---
ioctl(0, TCSETA, {B38400 opost isig icanon echo ...}) = 0
sigreturn()                             = ? (mask now [])
brk(0x53567c68)                         = 0x3b807d68
brk(0x53567c68)                         = 0x3b807d68
gettimeofday({979006186, 446393}, {300, 0}) = 0
write(1, "\n", 1)                       = 1
write(1, "System error: Out of memory.\n", 29) = 29
write(1, "All virtual memory has been exha"..., 78) = 78
rt_sigaction(SIGSEGV, {0x857cde8, [SEGV], SA_RESTART|0x4000000}, {0x857cde8, [SEGV], SA_RESTART|0x4000000}, 8) = 0
rt_sigaction(SIGINT, {0x857cbf0, [INT], SA_RESTART|0x4000000}, {0x857cbf0, [INT], SA_RESTART|0x4000000}, 8) = 0
rt_sigaction(SIGQUIT, {0x857ccd0, [QUIT], SA_RESTART|0x4000000}, {0x857ccd0, [QUIT], SA_RESTART|0x4000000}, 8) = 0
rt_sigaction(SIGILL, {0x857cde8, [ILL], SA_RESTART|0x4000000}, {0x857cde8, [ILL], SA_RESTART|0x4000000}, 8) = 0
rt_sigaction(SIGFPE, {0x857cde8, [FPE], SA_RESTART|0x4000000}, {0x857cde8, [FPE], SA_RESTART|0x4000000}, 8) = 0
ioctl(0, TCGETA, {B38400 opost isig icanon echo ...}) = 0
ioctl(0, TCSETA, {B38400 opost isig -icanon -echo ...}) = 0
write(0, "> ", 2)                       = 2
read(0, "\4", 1)                        = 1
write(0, "\n", 1)                       = 1
ioctl(0, TCSETA, {B38400 opost isig icanon echo ...}) = 0
close(0)                                = 0
times({tms_utime=1455, tms_stime=294, tms_cutime=0, tms_cstime=0}) = 29247102
write(1, "\n", 1)                       = 1
write(1, "Total time: 17.479 seconds\n", 27) = 27
munmap(0x40018000, 4096)                = 0
_exit(0)                                = ?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-08 23:41     ` Zlatko Calusic
  2001-01-09  2:58       ` Linus Torvalds
@ 2001-01-09  6:20       ` Eric W. Biederman
  2001-01-09  7:27         ` Linus Torvalds
  1 sibling, 1 reply; 88+ messages in thread
From: Eric W. Biederman @ 2001-01-09  6:20 UTC (permalink / raw)
  To: zlatko; +Cc: Linus Torvalds, Rik van Riel, linux-kernel

Zlatko Calusic <zlatko@iskon.hr> writes:

> 
> Yes, but a lot more data on the swap also means degraded performance,
> because the disk head has to seek around in the much bigger area. Are
> you sure this is all OK?

I don't think we have more data on the swap, just more data has an
allocated home on the swap.  With the earlier allocation we should
(I haven't verified) allocate contiguous chunks of memory contiguously
on the swap.   And reusing the same swap pages helps out with this.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-09  6:20       ` Eric W. Biederman
@ 2001-01-09  7:27         ` Linus Torvalds
  2001-01-09 11:38           ` Eric W. Biederman
  2001-01-09 12:29           ` Zlatko Calusic
  0 siblings, 2 replies; 88+ messages in thread
From: Linus Torvalds @ 2001-01-09  7:27 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: zlatko, Rik van Riel, linux-kernel

On 8 Jan 2001, Eric W. Biederman wrote:

> Zlatko Calusic <zlatko@iskon.hr> writes:> 
> > 
> > Yes, but a lot more data on the swap also means degraded performance,
> > because the disk head has to seek around in the much bigger area. Are
> > you sure this is all OK?
> 
> I don't think we have more data on the swap, just more data has an
> allocated home on the swap.

I think Zlatko's point is that because of the extra allocations, we will
have worse locality (more seeks etc). 

Clearly we should not actually do any more actual IO. But the sticky
allocation _might_ make the IO we do be more spread out.

To offset that, I think the sticky allocation makes us much better able to
handle things like clustering etc more intelligently, which is why I think
it's very much worth it.  But let's not close our eyes to potential
downsides.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-09  7:27         ` Linus Torvalds
@ 2001-01-09 11:38           ` Eric W. Biederman
  2001-01-09 12:29           ` Zlatko Calusic
  1 sibling, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2001-01-09 11:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: zlatko, Rik van Riel, linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:

> On 8 Jan 2001, Eric W. Biederman wrote:
> 
> > Zlatko Calusic <zlatko@iskon.hr> writes:> 
> > > 
> > > Yes, but a lot more data on the swap also means degraded performance,
> > > because the disk head has to seek around in the much bigger area. Are
> > > you sure this is all OK?
> > 
> > I don't think we have more data on the swap, just more data has an
> > allocated home on the swap.
> 
> I think Zlatko's point is that because of the extra allocations, we will
> have worse locality (more seeks etc). 
> 
> Clearly we should not actually do any more actual IO. But the sticky
> allocation _might_ make the IO we do be more spread out.

The tradeoff when implemented correctly is that writes will tend to be
more spread out and reads should be better clustered together. 

> To offset that, I think the sticky allocation makes us much better able to
> handle things like clustering etc more intelligently, which is why I think
> it's very much worth it.  But let's not close our eyes to potential
> downsides.

Certainly, keeping ours eyes open is a good a good thing.

But it has been apparent for a long time that by doing allocation as
we were doing it, that when it came to heavy swapping we were taking a
performance hit.  So I'm relieved that we are now being more aggressive.

>From the sounds of it what we are currently doing actually sucks worse
for some heavy loads.  But it still feels like the right direction.

It's been my impression that work loads where we are actively swapping
are a lot different from work loads where we really don't swap.  To
the extent that it might make sense to make the actively swapping case
a config option to get our attention in the code.  It would be nice
to have a linux kernel for once that handles heavy swapping (below
the level of thrashing) gracefully. :)

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-09  7:27         ` Linus Torvalds
  2001-01-09 11:38           ` Eric W. Biederman
@ 2001-01-09 12:29           ` Zlatko Calusic
  2001-01-09 18:47             ` Linus Torvalds
  1 sibling, 1 reply; 88+ messages in thread
From: Zlatko Calusic @ 2001-01-09 12:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Eric W. Biederman, Rik van Riel, linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:

> On 8 Jan 2001, Eric W. Biederman wrote:
> 
> > Zlatko Calusic <zlatko@iskon.hr> writes:> 
> > > 
> > > Yes, but a lot more data on the swap also means degraded performance,
> > > because the disk head has to seek around in the much bigger area. Are
> > > you sure this is all OK?
> > 
> > I don't think we have more data on the swap, just more data has an
> > allocated home on the swap.
> 
> I think Zlatko's point is that because of the extra allocations, we will
> have worse locality (more seeks etc).

Yes that was my concern.

But in the end I'm not sure. I made two simple tests and haven't found
any problems with 2.4.0 mm logic (opposed to 2.2.17). In fact, the new
kernel was faster in the more interesting (make -j32) test.

Also I have found that new kernel allocates 4 times more swap space
under some circumstances. That may or may not be alarming, it remains
to be seen.

-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-09 12:29           ` Zlatko Calusic
@ 2001-01-09 18:47             ` Linus Torvalds
  2001-01-09 19:09               ` Daniel Phillips
                                 ` (2 more replies)
  0 siblings, 3 replies; 88+ messages in thread
From: Linus Torvalds @ 2001-01-09 18:47 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Eric W. Biederman, Rik van Riel, linux-kernel

On 9 Jan 2001, Zlatko Calusic wrote:
> 
> But in the end I'm not sure. I made two simple tests and haven't found
> any problems with 2.4.0 mm logic (opposed to 2.2.17). In fact, the new
> kernel was faster in the more interesting (make -j32) test.

I personally think 2.4.x is going to be as fast or faster at just about
anything. We do have some MM issues still to hash out, and tuning to do,
but I'm absolutely convinced that 2.4.x is going to be a _lot_ easier to
tune than 2.2.x ever was. The "scan the page tables without doing any IO"
thing just makes the 2.4.x memory management several orders of magnitude
more flexible than 2.2.x ever was.

(This is why I worked so hard at getting the PageDirty semantics right in
the last two months or so - and why I released 2.4.0 when I did. Getting
PageDirty right was the big step to make all of the VM stuff possible in
the first place. Even if it probably looked a bit foolhardy to change the
semantics of "writepage()" quite radically just before 2.4 was released).

> Also I have found that new kernel allocates 4 times more swap space
> under some circumstances. That may or may not be alarming, it remains
> to be seen.

Yes. The new VM will allocate the swap space a _lot_ more aggressively.
Many of those allocations will not necessarily ever actually be used, but
the fact that we _have_ allocated backing store for a page is what allows
us to drop it from the VM page tables, so that it can be processed by
page_launder().

And this _is_ a downside, there's no question about it. There's the worry
about the potential loss of locality, but there's also the fact that you
effectively need a bigger swap partition with 2.4.x - never mind that
large portions of the allocations may never be used. You still need the
disk space for good VM behaviour.

There are always trade-offs, I think the 2.4.x tradeoff is a good one.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-09 18:47             ` Linus Torvalds
@ 2001-01-09 19:09               ` Daniel Phillips
  2001-01-09 19:29                 ` Trond Myklebust
                                   ` (2 more replies)
  2001-01-09 19:53               ` Simon Kirby
  2001-01-10  1:45               ` David Woodhouse
  2 siblings, 3 replies; 88+ messages in thread
From: Daniel Phillips @ 2001-01-09 19:09 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

Linus Torvalds wrote:
> (This is why I worked so hard at getting the PageDirty semantics right in
> the last two months or so - and why I released 2.4.0 when I did. Getting
> PageDirty right was the big step to make all of the VM stuff possible in
> the first place. Even if it probably looked a bit foolhardy to change the
> semantics of "writepage()" quite radically just before 2.4 was released).

On the topic of writepage, it's not symmetric with readpage at the
moment - it still takes (struct file *).  Is this in the cleanup
pipeline?  It looks like nfs_readpage already ignores the struct file *,
but maybe some other net filesystems are still depending on it.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-09 19:09               ` Daniel Phillips
@ 2001-01-09 19:29                 ` Trond Myklebust
  2001-01-10 17:32                   ` Andi Kleen
  2001-01-09 19:37                 ` Linus Torvalds
  2001-01-17  8:46                 ` Rik van Riel
  2 siblings, 1 reply; 88+ messages in thread
From: Trond Myklebust @ 2001-01-09 19:29 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Linus Torvalds, linux-kernel

>>>>> " " == Daniel Phillips <phillips@innominate.de> writes:

     > Linus Torvalds wrote:
    >> (This is why I worked so hard at getting the PageDirty
    >> semantics right in the last two months or so - and why I
    >> released 2.4.0 when I did. Getting PageDirty right was the big
    >> step to make all of the VM stuff possible in the first
    >> place. Even if it probably looked a bit foolhardy to change the
    >> semantics of "writepage()" quite radically just before 2.4 was
    >> released).

     > On the topic of writepage, it's not symmetric with readpage at
     > the moment - it still takes (struct file *).  Is this in the
     > cleanup pipeline?  It looks like nfs_readpage already ignores
     > the struct file *, but maybe some other net filesystems are
     > still depending on it.

NO! We definitely want to pass the struct file down to nfs_readpage()
when it's available.

Al has mentioned that he wants us to move towards a *BSD-like system
of credentials (i.e. struct ucred) that could be used here, but that's
in the far future. In the meantime, we cache RPC credentials in the
struct file...

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-09 19:09               ` Daniel Phillips
  2001-01-09 19:29                 ` Trond Myklebust
@ 2001-01-09 19:37                 ` Linus Torvalds
  2001-01-17  8:46                 ` Rik van Riel
  2 siblings, 0 replies; 88+ messages in thread
From: Linus Torvalds @ 2001-01-09 19:37 UTC (permalink / raw)
  To: linux-kernel

In article <3A5B61F7.FB0E79C1@innominate.de>,
Daniel Phillips  <phillips@innominate.de> wrote:
>Linus Torvalds wrote:
>> (This is why I worked so hard at getting the PageDirty semantics right in
>> the last two months or so - and why I released 2.4.0 when I did. Getting
>> PageDirty right was the big step to make all of the VM stuff possible in
>> the first place. Even if it probably looked a bit foolhardy to change the
>> semantics of "writepage()" quite radically just before 2.4 was released).
>
>On the topic of writepage, it's not symmetric with readpage at the
>moment - it still takes (struct file *).  Is this in the cleanup
>pipeline?  It looks like nfs_readpage already ignores the struct file *,
>but maybe some other net filesystems are still depending on it.

readpage() is always a synchronous operation, and is actually much more
closely linked to "prepare_write()"/"commit_write()" than to writepage,
despite the naming similarities.

So no, the two are not symmetric, and they really shouldn't be. 

"readpage()" is for reading a page into the page cache, and is always
synchronous with the reader (even prefetching is "synchronous" in the
sense that it's done by the reader: it's asynchronous in the sense that
we don't wait for the results, but the _calling_ of readpage() is
synchronous, if you see what I mean).

Similarly, prepare_write() and commit_write() are synchronous to the
writer (again, we do not wait for the writes to have actually
_happened_, but we call the functions synchronously and they can choose
to let the actual IO happen asynchronously - the VM doesn't care about
that small detail). 

So "readpage()" and "prepare_write()/commit_write()" are pairs.  They
are different simply because reading is assumed to be a cacheable and
prefetchable operation (think regular CPU caches), while writing
obviously has to give a much stricter "write _these_ bytes, not the
whole cache line". 

In contrast, writepage() is a completely different animal. It's
basically a cache eviction notice, and happens asynchronously to any
operations that actually fill or dirty the cache. So despite the name,
it really as an operation has absolutely nothing in common with
readpage(), other than the fact that it is supposed to obviously do the
IO associated with the name.

Writepage has a friend in "sync_page()", which is another asynchronous
call-back that basically says "we want you to start your IO _now_". It's
similar to "writepage()" in that it's a kind of cache state
notification: while writepage() notifies that the cached page wants to
be evicted, "sync_page()" notifies that the cached page is waited upon
by somebody else and that we want to speed up any background IO on it.

You'll notice that writepage()/sync_page() have similar calling
convention, while readpage/prepare_write/commit_write have similar
calling conventions.

The one operation that _really_ stands out is "bmap()".  It has
absolutely no calling convention at all, and is not symmetric with
anything. Pretty ugly, but easily supported.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-09 18:47             ` Linus Torvalds
  2001-01-09 19:09               ` Daniel Phillips
@ 2001-01-09 19:53               ` Simon Kirby
  2001-01-09 20:08                 ` Linus Torvalds
  2001-01-09 20:10                 ` Zlatko Calusic
  2001-01-10  1:45               ` David Woodhouse
  2 siblings, 2 replies; 88+ messages in thread
From: Simon Kirby @ 2001-01-09 19:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zlatko Calusic, Eric W. Biederman, Rik van Riel, linux-kernel

On Tue, Jan 09, 2001 at 10:47:57AM -0800, Linus Torvalds wrote:

> And this _is_ a downside, there's no question about it. There's the worry
> about the potential loss of locality, but there's also the fact that you
> effectively need a bigger swap partition with 2.4.x - never mind that
> large portions of the allocations may never be used. You still need the
> disk space for good VM behaviour.
> 
> There are always trade-offs, I think the 2.4.x tradeoff is a good one.

Hmm, perhaps you could clarify...

For boxes that rarely ever use swap with 2.2, will they now need more
swap space on 2.4 to perform well, or just boxes which don't have enough
RAM to handle everything nicely?

I've always been tending to make swap partitions smaller lately, as it
helps in the case where we have to wait for a runaway process to eat up
all of the swap space before it gets killed.  Making the swap size
smaller speeds up the time it takes for this to happen, albeit something
which isn't supposed to happen anyway.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[       sim@stormix.com       ][       sim@netnation.com        ]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug (really 830MB barrier question)
  2001-01-09  3:01       ` Subtle MM bug (really 830MB barrier question) Wayne Whitney
@ 2001-01-09 20:06         ` Szabolcs Szakacsits
  2001-01-09 23:45           ` Wayne Whitney
                             ` (2 more replies)
  0 siblings, 3 replies; 88+ messages in thread
From: Szabolcs Szakacsits @ 2001-01-09 20:06 UTC (permalink / raw)
  To: Wayne Whitney; +Cc: LKML, William A. Stein, Dan Maas


On Tue, 9 Jan 2001, Dan Maas wrote:

> OK it's fairly obvious what's happening here. Your program is using
> its own allocator, which relies solely on brk() to obtain more
> memory.
[... good explanation here ...]
> Here's your short answer: ask the authors of your program to either
> 1) replace their custom allocator with regular malloc() or 2) enhance
> their custom allocator to use mmap. (or, buy some 64-bit hardware =)...)

3) ask kernel developers to get rid of this "brk hits the fixed start
address of mmapped areas" or the other way around complaints "mmapped
area should start at lower address" limitation. E.g. Solaris does
growing up heap, growing down mmap and fixed size stack at the top.

Wayne, the patch below should fix your barrier problem [1 GB physical
memory configuration], I used only with 2.2 kernels. Your app should
complain about out of memory around 2.7 GB (0xb0000000-0x08??????),
but note that only 256 MB (0xc0000000-0xb0000000) left for shared
libraries, mmapped areas.

Good luck,

	Szaka

--- linux-2.2.18/include/asm-i386/processor.h  Thu Dec 14 08:20:17 2000
+++ linux/include/asm-i386/processor.h	Tue Jan  9 17:50:49 2001
@@ -166,7 +166,7 @@
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
  */
-#define TASK_UNMAPPED_BASE	(TASK_SIZE / 3)
+#define TASK_UNMAPPED_BASE	0xb0000000

 /*
  * Size of io_bitmap in longwords: 32 is ports 0-0x3ff.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-09 19:53               ` Simon Kirby
@ 2001-01-09 20:08                 ` Linus Torvalds
  2001-01-09 20:10                 ` Zlatko Calusic
  1 sibling, 0 replies; 88+ messages in thread
From: Linus Torvalds @ 2001-01-09 20:08 UTC (permalink / raw)
  To: Simon Kirby; +Cc: Zlatko Calusic, Eric W. Biederman, Rik van Riel, linux-kernel

On Tue, 9 Jan 2001, Simon Kirby wrote:
>
> On Tue, Jan 09, 2001 at 10:47:57AM -0800, Linus Torvalds wrote:
> 
> > And this _is_ a downside, there's no question about it. There's the worry
> > about the potential loss of locality, but there's also the fact that you
> > effectively need a bigger swap partition with 2.4.x - never mind that
> > large portions of the allocations may never be used. You still need the
> > disk space for good VM behaviour.
> > 
> > There are always trade-offs, I think the 2.4.x tradeoff is a good one.
> 
> Hmm, perhaps you could clarify...
> 
> For boxes that rarely ever use swap with 2.2, will they now need more
> swap space on 2.4 to perform well, or just boxes which don't have enough
> RAM to handle everything nicely?

If you don't have any swap, or if you run out of swap, the major
difference between 2.2.x and 2.4.x is probably going to be the oom
handling: I suspect that 2.4.x might be more likely to kill things off
sooner (but it tries to be graceful about which processes to kill).

Not having any swap is going to be a performance issue for both 2.2.x and
2.4.x - Linux likes to push inactive dirty pages out to swap where they
can lie around without bothering anybody, even if there is no _major_
memory crunch going on.

If you do have swap, but it's smaller than your available physical RAM, I
suspect that the Linux-2.4 swap pre-allocate may cause that kind of
performance degradation earlier than 2.2.x would have. Another way of
putting this: in 2.2.x you could use a fairly small swap partition to pick
up some of the slack, and in 2.4.x a really small swap-partition doesn't
really buy you much anything.

> I've always been tending to make swap partitions smaller lately, as it
> helps in the case where we have to wait for a runaway process to eat up
> all of the swap space before it gets killed.  Making the swap size
> smaller speeds up the time it takes for this to happen, albeit something
> which isn't supposed to happen anyway.

Yes, that kind of swap size tuning will still work in 2.4.x, but the sizes
you tune for would be different, I'm afraid. If you have, say, 128MB or
RAM, and you used to make a smallish partition of 64MB for "slop" in
2.2.x, I really suspect that you might like to increase it to 128MB or
196MB.

Of course, if you really only used your swap for "slop", I don't think
you'll necessarily notice the difference.

NOTE! The above guide-lines are pure guesses. The machines I use have had
big swap-partitions or none at all, so I think we'll just have to wait and
see.

			Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-09 19:53               ` Simon Kirby
  2001-01-09 20:08                 ` Linus Torvalds
@ 2001-01-09 20:10                 ` Zlatko Calusic
  1 sibling, 0 replies; 88+ messages in thread
From: Zlatko Calusic @ 2001-01-09 20:10 UTC (permalink / raw)
  To: Simon Kirby; +Cc: Linus Torvalds, Eric W. Biederman, Rik van Riel, linux-kernel

Simon Kirby <sim@stormix.com> writes:

> On Tue, Jan 09, 2001 at 10:47:57AM -0800, Linus Torvalds wrote:
> 
> > And this _is_ a downside, there's no question about it. There's the worry
> > about the potential loss of locality, but there's also the fact that you
> > effectively need a bigger swap partition with 2.4.x - never mind that
> > large portions of the allocations may never be used. You still need the
> > disk space for good VM behaviour.
> > 
> > There are always trade-offs, I think the 2.4.x tradeoff is a good one.
> 
> Hmm, perhaps you could clarify...
> 
> For boxes that rarely ever use swap with 2.2, will they now need more
> swap space on 2.4 to perform well, or just boxes which don't have enough
> RAM to handle everything nicely?
>

Just boxes that were already short on memory (swapped a lot) will need
more swap, empirically up to 4 times as much. If you already had
enough memory than things will stay almost the same for you.

But anyway, after some testing I've done recently I would now not
recommend anybody to have less than 2 x RAM size swap partition.

> I've always been tending to make swap partitions smaller lately, as it
> helps in the case where we have to wait for a runaway process to eat up
> all of the swap space before it gets killed.  Making the swap size
> smaller speeds up the time it takes for this to happen, albeit something
> which isn't supposed to happen anyway.
> 

Well, if you continue with that practice now you will be even more
successful in killing such processes, I would say. :)
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug (really 830MB barrier question)
  2001-01-09 20:06         ` Szabolcs Szakacsits
@ 2001-01-09 23:45           ` Wayne Whitney
  2001-01-11  0:03           ` Wayne Whitney
  2001-01-11  2:46           ` [2.4.0 pre-PATCH] 830MB barrier (was: Subtle MM bug) Wayne Whitney
  2 siblings, 0 replies; 88+ messages in thread
From: Wayne Whitney @ 2001-01-09 23:45 UTC (permalink / raw)
  To: Szabolcs Szakacsits; +Cc: LKML, William A. Stein, Dan Maas

On Tue, 9 Jan 2001, Szabolcs Szakacsits wrote:

> Wayne, the patch below should fix your barrier problem [1 GB physical
> memory configuration].

First, I just wanted to thank you and everyone else (Linus, Andrea, Dan
Maas, Rik and others) who has responded to my emails.  You guys are
wonderful!

On Tue, 9 Jan 2001, Dan Maas wrote:

> OK it's fairly obvious what's happening here. Your program is using
> its own allocator, which relies solely on brk() to obtain more memory.

OK, the statically linked binary I have produces a simpler /proc/pid/maps,
here it is (before I actually try to create any big objects in memory):

08048000-08afb000 r-xp 00000000 03:07 64318      /usr/local/magma-2.7/magma.exe
08afb000-08c3e000 rw-p 00ab2000 03:07 64318      /usr/local/magma-2.7/magma.exe
08c3e000-0bc54000 rwxp 00000000 00:00 0
40000000-40001000 rw-p 00000000 00:00 0
bfffd000-c0000000 rwxp ffffe000 00:00 0

If I understand correctly, the first two lines are the executable
(although I don't know why it shows up twice), the third line is the heap
for this program, the fourth line is where mmap stuff starts and the fifth
line is the boundary between the process address space and the kernel
address space.

First question:  for this statically linked binary, nothing is really
being mmap'ed, is there any way that I can arrange, for this process only,
to get rid of the fourth line?  This would be the ideal solution.

Szabolcs's suggestion (and Mark Hahn's privately, as well) of modifying
TASK_UNMAPPED_BASE does work for me.  Unfortunately, on the same machine
I'd like to both run programs that use brk() allocation and that use
mmap() allocation, so the best I can do is change TASK_UNMAPPED_BASE to
1.5GB from 1GB, this allows a bit under 1.5GB for brk() and 1.5GB() for
mmap().

Second question:  if I understand correctly, the start of the kernel
process space is controlled by PAGE_OFFSET, and under CONFIG_NOHIGHMEM,
the kernel maps all the physical RAM into its address space.  I guess
128MB of this space is used for the 2.4.0 kernel itself, so that the
maximum physical RAM under 2.4.0 with PAGE_OFFSET set to 3GB is 896MB.
One of my machines has 1024MB of RAM, so can I just decrease PAGE_OFFEST
by 128MB to be able to use all of it?

Thanks again,
Wayne

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-09 18:47             ` Linus Torvalds
  2001-01-09 19:09               ` Daniel Phillips
  2001-01-09 19:53               ` Simon Kirby
@ 2001-01-10  1:45               ` David Woodhouse
  2001-01-10  2:26                 ` Andrea Arcangeli
  2001-01-10  6:57                 ` Linus Torvalds
  2 siblings, 2 replies; 88+ messages in thread
From: David Woodhouse @ 2001-01-10  1:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zlatko Calusic, Eric W. Biederman, Rik van Riel, linux-kernel

On Tue, 9 Jan 2001, Linus Torvalds wrote:

> And this _is_ a downside, there's no question about it. There's the worry
> about the potential loss of locality, but there's also the fact that you
> effectively need a bigger swap partition with 2.4.x - never mind that
> large portions of the allocations may never be used. You still need the
> disk space for good VM behaviour.
>
> There are always trade-offs, I think the 2.4.x tradeoff is a good one.

How does this affect embedded systems with no swap space at all?

-- 
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10  1:45               ` David Woodhouse
@ 2001-01-10  2:26                 ` Andrea Arcangeli
  2001-01-10  6:57                 ` Linus Torvalds
  1 sibling, 0 replies; 88+ messages in thread
From: Andrea Arcangeli @ 2001-01-10  2:26 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Linus Torvalds, Zlatko Calusic, Eric W. Biederman, Rik van Riel,
	linux-kernel

On Wed, Jan 10, 2001 at 01:45:47AM +0000, David Woodhouse wrote:
> How does this affect embedded systems with no swap space at all?

If there's no swap the swap-cache dirty-sticky issue can't arise.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10  1:45               ` David Woodhouse
  2001-01-10  2:26                 ` Andrea Arcangeli
@ 2001-01-10  6:57                 ` Linus Torvalds
  2001-01-10 11:46                   ` David Woodhouse
  1 sibling, 1 reply; 88+ messages in thread
From: Linus Torvalds @ 2001-01-10  6:57 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Zlatko Calusic, Eric W. Biederman, Rik van Riel, linux-kernel



On Wed, 10 Jan 2001, David Woodhouse wrote:
> 
> How does this affect embedded systems with no swap space at all?

The no-swap behaviour shoul dactually be pretty much identical, simply
because both 2.2 and 2.4 will do the same thing: just skip dirty pages in
the page tables because they cannot do anything about them.

That said, the _other_ VM differences in 2.4.x may obviously make a
difference, just not the sticky swap cache one..

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10  6:57                 ` Linus Torvalds
@ 2001-01-10 11:46                   ` David Woodhouse
  2001-01-10 14:56                     ` Andrea Arcangeli
  2001-01-10 17:03                     ` Linus Torvalds
  0 siblings, 2 replies; 88+ messages in thread
From: David Woodhouse @ 2001-01-10 11:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zlatko Calusic, Eric W. Biederman, Rik van Riel, linux-kernel


torvalds@transmeta.com said:
>  The no-swap behaviour shoul dactually be pretty much identical,
> simply because both 2.2 and 2.4 will do the same thing: just skip
> dirty pages in the page tables because they cannot do anything about
> them. 

So the VM code spends a fair amount of time scanning lists of pages which 
it really can't do anything about?

Would it be possible to put such pages on different list, so that the VM
code doesn't have to keep skipping them?

(forgive me if I'm displaying my utter ignorance of the VM code here)

--
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 11:46                   ` David Woodhouse
@ 2001-01-10 14:56                     ` Andrea Arcangeli
  2001-01-10 17:46                       ` Eric W. Biederman
  2001-01-10 17:03                     ` Linus Torvalds
  1 sibling, 1 reply; 88+ messages in thread
From: Andrea Arcangeli @ 2001-01-10 14:56 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Linus Torvalds, Zlatko Calusic, Eric W. Biederman, Rik van Riel,
	linux-kernel

On Wed, Jan 10, 2001 at 11:46:03AM +0000, David Woodhouse wrote:
> So the VM code spends a fair amount of time scanning lists of pages which 
> it really can't do anything about?

Yes.

> Would it be possible to put such pages on different list, so that the VM

Currently to unmap the other pages we have to waste time on those unfreeable
pages as well.

Once I or other developer finishes with the reverse lookup from page to
pte-chain (an implementation from DaveM just exists) we'll be able to put them
in a separate lru, but it's certainly not a 2.4.1-pre2 thing.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 11:46                   ` David Woodhouse
  2001-01-10 14:56                     ` Andrea Arcangeli
@ 2001-01-10 17:03                     ` Linus Torvalds
  2001-01-11 14:36                       ` Jim Gettys
  1 sibling, 1 reply; 88+ messages in thread
From: Linus Torvalds @ 2001-01-10 17:03 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Zlatko Calusic, Eric W. Biederman, Rik van Riel, linux-kernel

On Wed, 10 Jan 2001, David Woodhouse wrote:

> 
> torvalds@transmeta.com said:
> >  The no-swap behaviour shoul dactually be pretty much identical,
> > simply because both 2.2 and 2.4 will do the same thing: just skip
> > dirty pages in the page tables because they cannot do anything about
> > them. 
> 
> So the VM code spends a fair amount of time scanning lists of pages which 
> it really can't do anything about?

It can do _tons_ of stuff.

Remember, on platforms like this, one of the reasons for being low on
memory is things like running X and netscape: maybe you have 64MB of RAM
and you don't think you need a swap device, and you want to have a web
browser.

The fact that we cannot touch _dirty_ pages doesn't mean that there's
nothing to do: instead of running out of memory we can at least make the
machine usable by dropping the text pages and the page cache..

> Would it be possible to put such pages on different list, so that the VM
> code doesn't have to keep skipping them?

If we don't have any swapspace, the dirty pages will not be on any lists:
they will never have exited the page tables, and they will just be dirty
anonymous, unlisted pages.

We'll still scan the page tables (and see them there), but we have to do
that to find the clean and unreferenced pages - we don't have separate
"dirty page tables" and "clean page tables" ;)

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-09 19:29                 ` Trond Myklebust
@ 2001-01-10 17:32                   ` Andi Kleen
  2001-01-10 19:31                     ` Alan Cox
  0 siblings, 1 reply; 88+ messages in thread
From: Andi Kleen @ 2001-01-10 17:32 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Daniel Phillips, Linus Torvalds, linux-kernel

On Tue, Jan 09, 2001 at 08:29:02PM +0100, Trond Myklebust wrote:
> Al has mentioned that he wants us to move towards a *BSD-like system
> of credentials (i.e. struct ucred) that could be used here, but that's
> in the far future. In the meantime, we cache RPC credentials in the
> struct file...

struct ucred is also needed to get LinuxThreads POSIX compliant (sharing
credentials between threads, but still keeping system calls atomic in
relation to credential changes) 


-Andi (who doesn't want to know how many security holes are in linux ported
programs using threads and set*id() because of that) 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 14:56                     ` Andrea Arcangeli
@ 2001-01-10 17:46                       ` Eric W. Biederman
  2001-01-10 18:33                         ` Andrea Arcangeli
  2001-01-10 19:03                         ` Linus Torvalds
  0 siblings, 2 replies; 88+ messages in thread
From: Eric W. Biederman @ 2001-01-10 17:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: David Woodhouse, Linus Torvalds, Zlatko Calusic,
	Eric W. Biederman, Rik van Riel, linux-kernel

Andrea Arcangeli <andrea@suse.de> writes:

> On Wed, Jan 10, 2001 at 11:46:03AM +0000, David Woodhouse wrote:
> > So the VM code spends a fair amount of time scanning lists of pages which 
> > it really can't do anything about?
> 
> Yes.
> 
> > Would it be possible to put such pages on different list, so that the VM
> 
> Currently to unmap the other pages we have to waste time on those unfreeable
> pages as well.
> 
> Once I or other developer finishes with the reverse lookup from page to
> pte-chain (an implementation from DaveM just exists) we'll be able to put them
> in a separate lru, but it's certainly not a 2.4.1-pre2 thing.

Why do we even want to do reverse page tables?
It seems everyone is assuming this is a good thing and except for being
a touch more flexible I don't see what this buys us (besides more locked memory).

My impression with the MM stuff is that everyone except linux is
trying hard to clone BSD instead of thinking through the issues
ourselves.

And because of the extra overhead this doesn't look to be a win on a
heavily loaded box with no swap.  And probably only glibc mmaped.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 17:46                       ` Eric W. Biederman
@ 2001-01-10 18:33                         ` Andrea Arcangeli
  2001-01-17 14:26                           ` Rik van Riel
  2001-01-10 19:03                         ` Linus Torvalds
  1 sibling, 1 reply; 88+ messages in thread
From: Andrea Arcangeli @ 2001-01-10 18:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Woodhouse, Linus Torvalds, Zlatko Calusic, Rik van Riel,
	linux-kernel

On Wed, Jan 10, 2001 at 10:46:07AM -0700, Eric W. Biederman wrote:
> Why do we even want to do reverse page tables?
> It seems everyone is assuming this is a good thing and except for being

I'm not assuming it's a good thing, but I believe it's something to try.

> My impression with the MM stuff is that everyone except linux is
> trying hard to clone BSD instead of thinking through the issues
> ourselves.

I wasn't even thinking about BSD and I always though about the issues myself,
no panic ;).

> And because of the extra overhead this doesn't look to be a win on a
> heavily loaded box with no swap.  And probably only glibc mmaped.

It can make sense also without swap. We could drop clean pages from the lru
directly that way without wasting time on pages that we don't have a chance to
free (incidentally it's exactly the optimization requested by David W. for
embedded systems).  Note that I'm not convinced that it would be worthwhile to
separate the anonymous and shm pages from the other mapped pages but in theory
we could do that.

I didn't meant that it is certainly the right way to go, but with reverse
lookup we could do very ""interesting"" things and I think it's worthwhile to
research and benchmark what happens (note also that depending on the
implementation very different things can happen at runtime)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 17:46                       ` Eric W. Biederman
  2001-01-10 18:33                         ` Andrea Arcangeli
@ 2001-01-10 19:03                         ` Linus Torvalds
  2001-01-10 19:27                           ` David S. Miller
                                             ` (2 more replies)
  1 sibling, 3 replies; 88+ messages in thread
From: Linus Torvalds @ 2001-01-10 19:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrea Arcangeli, David Woodhouse, Zlatko Calusic, Rik van Riel,
	linux-kernel



On 10 Jan 2001, Eric W. Biederman wrote:

> Andrea Arcangeli <andrea@suse.de> writes:
> > 
> > Once I or other developer finishes with the reverse lookup from page to
> > pte-chain (an implementation from DaveM just exists) we'll be able to put them
> > in a separate lru, but it's certainly not a 2.4.1-pre2 thing.
> 
> Why do we even want to do reverse page tables?

We don't.

But it does come up every once in a while, and it will probably continue
to do so.

I looked at it a year or two ago myself, and came to the conclusion that I
don't want to blow up our page table size by a factor of three or more, so
I'm not personally interested any more. Maybe somebody else comes up with
a better way to do it, or with a really compelling reason to.

"Feel free to try" is definitely the open source motto.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:03                         ` Linus Torvalds
@ 2001-01-10 19:27                           ` David S. Miller
  2001-01-10 19:36                           ` Alan Cox
  2001-01-17 14:28                           ` Rik van Riel
  2 siblings, 0 replies; 88+ messages in thread
From: David S. Miller @ 2001-01-10 19:27 UTC (permalink / raw)
  To: torvalds; +Cc: ebiederm, andrea, dwmw2, zlatko, riel, linux-kernel

   Date: 	Wed, 10 Jan 2001 11:03:21 -0800 (PST)
   From: Linus Torvalds <torvalds@transmeta.com>

   "Feel free to try" is definitely the open source motto.

I basically came to the conclusion that it sucks when I
gave it a go.

In my scheme I tried to save space by using very small descriptors to
keep track of anonymous areas in processes.  This was essentially a
vma->vm_anon pointer that kept track of the pages for you.

After trying to fight this for a few days I determined that this
doesn't work at all because of how COW dups the pages around on you.
Also it was a devil to work out anonymous pages created due to writes
to private mmaps of a file, as soon as one of these were made for the
first time on a vma you had to cook up one of the anon descriptors.

Yeah, I got the anon descriptor down to 2 pointers and an atomic
counter, but it didn't work so this achievement was worthless :-)

There are a few approaches that work, but they tend to take up too
much space to be considerable, as Linus mentioned.

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 17:32                   ` Andi Kleen
@ 2001-01-10 19:31                     ` Alan Cox
  2001-01-10 19:33                       ` Andi Kleen
  2001-01-10 20:11                       ` Linus Torvalds
  0 siblings, 2 replies; 88+ messages in thread
From: Alan Cox @ 2001-01-10 19:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Trond Myklebust, Daniel Phillips, Linus Torvalds, linux-kernel

> struct ucred is also needed to get LinuxThreads POSIX compliant (sharing
> credentials between threads, but still keeping system calls atomic in
> relation to credential changes) 

That is extremely undesirable behaviour. setuid() changes for pthreads crud
should be done by the library emulation layer. Many people have very real
and very good reasons for running multiple parallel ids. Just try writing
a threaded ftp daemon (non anonymous) without that, or an nfs server

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:31                     ` Alan Cox
@ 2001-01-10 19:33                       ` Andi Kleen
  2001-01-10 19:40                         ` Alan Cox
  2001-01-10 20:11                       ` Linus Torvalds
  1 sibling, 1 reply; 88+ messages in thread
From: Andi Kleen @ 2001-01-10 19:33 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Trond Myklebust, Daniel Phillips, Linus Torvalds,
	linux-kernel

On Wed, Jan 10, 2001 at 07:31:52PM +0000, Alan Cox wrote:
> > struct ucred is also needed to get LinuxThreads POSIX compliant (sharing
> > credentials between threads, but still keeping system calls atomic in
> > relation to credential changes) 
> 
> That is extremely undesirable behaviour. setuid() changes for pthreads crud
> should be done by the library emulation layer. Many people have very real
> and very good reasons for running multiple parallel ids. Just try writing
> a threaded ftp daemon (non anonymous) without that, or an nfs server

Of course not by default, it would be a new clone flag (with default to on in
linuxthreads though, to not cause security holes in ported programs like today) 


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:03                         ` Linus Torvalds
  2001-01-10 19:27                           ` David S. Miller
@ 2001-01-10 19:36                           ` Alan Cox
  2001-01-10 23:56                             ` David Weinehall
  2001-01-17 14:28                           ` Rik van Riel
  2 siblings, 1 reply; 88+ messages in thread
From: Alan Cox @ 2001-01-10 19:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Andrea Arcangeli, David Woodhouse,
	Zlatko Calusic, Rik van Riel, linux-kernel

> I looked at it a year or two ago myself, and came to the conclusion that I
> don't want to blow up our page table size by a factor of three or more, so
> I'm not personally interested any more. Maybe somebody else comes up with
> a better way to do it, or with a really compelling reason to.

There is only one reason I know for reverse page tables. That is ARM2/ARM3 
support - which is still not fully merged because of this issue

The MMU on these systems is a CAM, and the mmu table is thus backwards to
convention. (It also means you can notionally map two physical addresses to
one virtual but thats undefined in the implementation ;))


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:33                       ` Andi Kleen
@ 2001-01-10 19:40                         ` Alan Cox
  2001-01-10 19:43                           ` Andi Kleen
  0 siblings, 1 reply; 88+ messages in thread
From: Alan Cox @ 2001-01-10 19:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Cox, Andi Kleen, Trond Myklebust, Daniel Phillips,
	Linus Torvalds, linux-kernel

> Of course not by default, it would be a new clone flag (with default to on in
> linuxthreads though, to not cause security holes in ported programs like today) 

I've seen exactly nil cases where there are any security holes in apps caused
by that pthreads api non adherance. There are also far too many overheads
imposed by implementing something in kernel space that is nearly useless,
not needed for any application 99.9999% of users (possibly 100%) have and can
be done just as well in the pthreads library glue - where it will only be
a penalty to pthread using apps.

Making everyone suffer for a bad standard corner case is bad. Especially when
the 'security hole' is pure FUD

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:40                         ` Alan Cox
@ 2001-01-10 19:43                           ` Andi Kleen
  2001-01-10 19:48                             ` Alan Cox
  0 siblings, 1 reply; 88+ messages in thread
From: Andi Kleen @ 2001-01-10 19:43 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Trond Myklebust, Daniel Phillips, Linus Torvalds,
	linux-kernel

On Wed, Jan 10, 2001 at 07:40:49PM +0000, Alan Cox wrote:
> > Of course not by default, it would be a new clone flag (with default to on in
> > linuxthreads though, to not cause security holes in ported programs like today) 
> 
> I've seen exactly nil cases where there are any security holes in apps caused
> by that pthreads api non adherance. There are also far too many overheads
> imposed by implementing something in kernel space that is nearly useless,
> not needed for any application 99.9999% of users (possibly 100%) have and can
> be done just as well in the pthreads library glue - where it will only be
> a penalty to pthread using apps.

I have not seen a good way to implement it in user space yet.

> Making everyone suffer for a bad standard corner case is bad. Especially when
> the 'security hole' is pure FUD
>
As the thread started it's not only only needed for pthreads, but also for NFS
and setuid (actually NFS already implements it privately), and probably other network
file systems too.  So it's far from being only a "bad standard corner case". 


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:43                           ` Andi Kleen
@ 2001-01-10 19:48                             ` Alan Cox
  2001-01-10 19:48                               ` Andi Kleen
  2001-01-11  9:51                               ` Trond Myklebust
  0 siblings, 2 replies; 88+ messages in thread
From: Alan Cox @ 2001-01-10 19:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Cox, Andi Kleen, Trond Myklebust, Daniel Phillips,
	Linus Torvalds, linux-kernel

> As the thread started it's not only only needed for pthreads, but also for NFS
> and setuid (actually NFS already implements it privately), and probably other network
> file systems too.  So it's far from being only a "bad standard corner case". 

I wonder how Linux 2.2 worked, that doesnt have them. Now if its a clean way
of sorting out a pile of other things and it does pthreads as a side effect
I've no problem, but arguing for it because of a tiny pthreads corner case
is coming from the wrong end

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:48                             ` Alan Cox
@ 2001-01-10 19:48                               ` Andi Kleen
  2001-01-11  9:51                               ` Trond Myklebust
  1 sibling, 0 replies; 88+ messages in thread
From: Andi Kleen @ 2001-01-10 19:48 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Trond Myklebust, Daniel Phillips, Linus Torvalds,
	linux-kernel

On Wed, Jan 10, 2001 at 07:48:04PM +0000, Alan Cox wrote:
> > As the thread started it's not only only needed for pthreads, but also for NFS
> > and setuid (actually NFS already implements it privately), and probably other network
> > file systems too.  So it's far from being only a "bad standard corner case". 
> 
> I wonder how Linux 2.2 worked, that doesnt have them. Now if its a clean way
> of sorting out a pile of other things and it does pthreads as a side effect

Linux 2.2 setuid in nfs never worked quite like traditional Unix, and there
were lots of reports because users were regularly rediscovering it.

I think the nfs patches merged in 2.2.18 fixed it (?) 

> I've no problem, but arguing for it because of a tiny pthreads corner case
> is coming from the wrong end

I'm not so sure the thread corner case is that tiny. 

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
@ 2001-01-10 19:57 Chris Wing
  0 siblings, 0 replies; 88+ messages in thread
From: Chris Wing @ 2001-01-10 19:57 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

Alan:

> I've seen exactly nil cases where there are any security holes in apps caused
> by that pthreads api non adherance. 

I don't know of any exploitable bugs that were found in it, but the identd
server included in Red Hat 6.1 (pidentd 3.0.10) unintentionally ran as
root instead of nobody because its programmer used pthreads and assumed
that setuid() would affect all threads.

I pointed this out to the author and Red Hat, and it was fixed in
pidentd 3.0.11 and Red Hat 6.2.

-Chris Wing
wingc@engin.umich.edu

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:31                     ` Alan Cox
  2001-01-10 19:33                       ` Andi Kleen
@ 2001-01-10 20:11                       ` Linus Torvalds
  2001-01-11 12:56                         ` Stephen C. Tweedie
  1 sibling, 1 reply; 88+ messages in thread
From: Linus Torvalds @ 2001-01-10 20:11 UTC (permalink / raw)
  To: Alan Cox; +Cc: Andi Kleen, Trond Myklebust, Daniel Phillips, linux-kernel



On Wed, 10 Jan 2001, Alan Cox wrote:
> 
> That is extremely undesirable behaviour. setuid() changes for pthreads crud
> should be done by the library emulation layer. Many people have very real
> and very good reasons for running multiple parallel ids. Just try writing
> a threaded ftp daemon (non anonymous) without that, or an nfs server

I absolutely think that "one thread, one ID" is the way to go.

That said, we can easily support the notion of CLONE_CRED if we absolutely
have to (and sane people just shouldn't use it), so if somebody wants to
work on this for 2.5.x...

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:36                           ` Alan Cox
@ 2001-01-10 23:56                             ` David Weinehall
  2001-01-11  0:24                               ` Alan Cox
  2001-01-12  5:56                               ` Ralf Baechle
  0 siblings, 2 replies; 88+ messages in thread
From: David Weinehall @ 2001-01-10 23:56 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Eric W. Biederman, Andrea Arcangeli,
	David Woodhouse, Zlatko Calusic, Rik van Riel, linux-kernel

On Wed, Jan 10, 2001 at 07:36:43PM +0000, Alan Cox wrote:
> > I looked at it a year or two ago myself, and came to the conclusion that I
> > don't want to blow up our page table size by a factor of three or more, so
> > I'm not personally interested any more. Maybe somebody else comes up with
> > a better way to do it, or with a really compelling reason to.
> 
> There is only one reason I know for reverse page tables. That is ARM2/ARM3 
> support - which is still not fully merged because of this issue
> 
> The MMU on these systems is a CAM, and the mmu table is thus backwards to
> convention. (It also means you can notionally map two physical addresses to
> one virtual but thats undefined in the implementation ;))

Are there any other (not yet supported) platforms with similar (or other
unrelated, but hard to support because of the current architecture of
the kernel) problems?

(No, I have no secret trumps up my sleeve, I'm just curious.)


/David
  _                                                                 _
 // David Weinehall <tao@acc.umu.se> /> Northern lights wander      \\
//  Project MCA Linux hacker        //  Dance across the winter sky //
\>  http://www.acc.umu.se/~tao/    </   Full colour fire           </
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug (really 830MB barrier question)
  2001-01-09 20:06         ` Szabolcs Szakacsits
  2001-01-09 23:45           ` Wayne Whitney
@ 2001-01-11  0:03           ` Wayne Whitney
  2001-01-11  2:46           ` [2.4.0 pre-PATCH] 830MB barrier (was: Subtle MM bug) Wayne Whitney
  2 siblings, 0 replies; 88+ messages in thread
From: Wayne Whitney @ 2001-01-11  0:03 UTC (permalink / raw)
  To: Szabolcs Szakacsits; +Cc: LKML, William A. Stein, Dan Maas

On Tue, 9 Jan 2001, Szabolcs Szakacsits wrote:

> 3) ask kernel developers to get rid of this "brk hits the fixed start
> address of mmapped areas" or the other way around complaints "mmapped
> area should start at lower address" limitation. E.g. Solaris does
> growing up heap, growing down mmap and fixed size stack at the top.

OK, despite knowing nothing of the kernel internals, I looked at doing
this myself :-)

I notice that TASK_UNMAPPED_BASE is only used in get_unmapped_area() in
mm/mmap.c, which is encouraging.  Moreover, get_unmapped_area() is only
called once in mm/mmap.c and once in mm/mremap.c.  So I think I would only
have to change get_unmapped_area() to get the desired effect, and this
change should not affect anything else.

If no address is specified, get_unmapped_area() currently chooses the
first (large enough, unused) region above TASK_UNMAPPED_BASE.  I guess I
would just have to define something like TASK_UNMAPPED_CEILING and arrange
for get_unmapped_area() to allocate the first region below
TASK_UNMAPPED_CEILING.  And I guess TASK_UNMAPPED_CEILING should equal
TASK_SIZE - maximum stack size.  What is the maximum stack size?  I
couldn't quite figure this out myself.

Am I missing something, or should choosing an appropriate value for
TASK_UNMAPPED_CEILING and changing get_unmapped_area() be sufficient?

Cheers,
Wayne

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 23:56                             ` David Weinehall
@ 2001-01-11  0:24                               ` Alan Cox
  2001-01-12  5:56                               ` Ralf Baechle
  1 sibling, 0 replies; 88+ messages in thread
From: Alan Cox @ 2001-01-11  0:24 UTC (permalink / raw)
  To: David Weinehall
  Cc: Alan Cox, Linus Torvalds, Eric W. Biederman, Andrea Arcangeli,
	David Woodhouse, Zlatko Calusic, Rik van Riel, linux-kernel

> > The MMU on these systems is a CAM, and the mmu table is thus backwards to
> > convention. (It also means you can notionally map two physical addresses to
> > one virtual but thats undefined in the implementation ;))
> 
> Are there any other (not yet supported) platforms with similar (or other
> unrelated, but hard to support because of the current architecture of
> the kernel) problems?

I believe its uniquely deranged. There are people who have asked for reverse
tables for other purposes (eg cache flush handling) but their mmu is the normal
way around.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [2.4.0 pre-PATCH] 830MB barrier (was: Subtle MM bug)
  2001-01-09 20:06         ` Szabolcs Szakacsits
  2001-01-09 23:45           ` Wayne Whitney
  2001-01-11  0:03           ` Wayne Whitney
@ 2001-01-11  2:46           ` Wayne Whitney
  2 siblings, 0 replies; 88+ messages in thread
From: Wayne Whitney @ 2001-01-11  2:46 UTC (permalink / raw)
  To: Szabolcs Szakacsits; +Cc: LKML, William A. Stein, Dan Maas

On Tue, 9 Jan 2001, Szabolcs Szakacsits wrote:

> 3) ask kernel developers to get rid of this "brk hits the fixed start
> address of mmapped areas" or the other way around complaints "mmapped
> area should start at lower address" limitation. E.g. Solaris does
> growing up heap, growing down mmap and fixed size stack at the top.

OK, I attempted this myself, it boots fine and seems to work!

The basic idea is to define TASK_UNMAPPED_CEILING in
include/asm-i386/processor.h, and then modify get_unmapped_area() in
mm/mmap.c to search downards from TASK_UNMAPPED_CEILING, rather than
upwards from TASK_UNMAPPED_BASE. But this is my first ever linux kernel
patch (included below), so it has several rough edges:

(1) I don't have any idea what happens on architectures other than i386,
so if TASK_UNMAPPED_CEILING is undefined, get_unmapped_area() behaves as
before.  I used #ifdef to do this, I don't know if that is considered
proper style.

(2) I have no idea how much room to allow for the stack on i386, so I
arbitrarily picked 128MB.  Is this way overkill?

(3) The (original) search upwards version of get_unmapped_area only calls
find_vma once, and then it uses vmm->vm_next in the main loop to get the
next vm_area_struct.  My search downwards version calls find_vma once
every loop, as I don't understand the connectivity of the vma_structs, but
find_vma is documented.

Is the overhead of calling find_vma every time a problem?  If so, is there
a better way of doing this without changing the vm_area_struct structure
to be a doubly linked list?

(4) Lastly, I terminate the downwards search at TASK_UNMAPPED_BASE, but
this is just wrong, as part of the point is that we should allow mmap's to
go all the way down to the top of the heap.  What is the correct
termination condition?  If the heap is just another vm_area_struct, then I
would think it would be checking against where the executable is loaded,
is their kernel macro for this?

Any comments would be extremely welcome!!  As is, it serves my purposes
(allowing a non mmap'ing program to brk() up to TASK_SIZE), but I'd be
happy to "do it right".

Cheers,
Wayne

diff -ru -x *.o -x .config linux-2.4.0-pizza/include/asm-i386/processor.h linux-2.4.0-hack/include/asm-i386/processor.h
--- linux-2.4.0-pizza/include/asm-i386/processor.h	Sat Jan  6 16:46:21 2001
+++ linux-2.4.0-hack/include/asm-i386/processor.h	Wed Jan 10 17:22:30 2001
@@ -260,10 +260,20 @@
  */
 #define TASK_SIZE	(PAGE_OFFSET)

-/* This decides where the kernel will search for a free chunk of vm
- * space during mmap's.
+/*
+ * When looking for a free chunk of vm space during mmap's, the kernel
+ * will search upwards from TASK_UNMAPPED_BASE, unless
+ * TASK_UNMAPPED_CEILING is defined, in which case it will search
+ * downwards from that address.
  */
 #define TASK_UNMAPPED_BASE	(TASK_SIZE / 3)
+
+/*
+ * We need to allow room for the stack to grow downard from TASK_SIZE,
+ * I really have no idea how large it can get, so I arbitrarily picked
+ * 128MB.
+ */
+#define TASK_UNMAPPED_CEILING   (TASK_SIZE - 128 * 1024 * 1024)

 /*
  * Size of io_bitmap in longwords: 32 is ports 0-0x3ff.
diff -ru -x *.o -x .config linux-2.4.0-pizza/mm/mmap.c linux-2.4.0-hack/mm/mmap.c
--- linux-2.4.0-pizza/mm/mmap.c	Sat Dec 30 12:35:19 2000
+++ linux-2.4.0-hack/mm/mmap.c	Wed Jan 10 17:24:08 2001
@@ -382,6 +382,22 @@

 	if (len > TASK_SIZE)
 		return 0;
+#ifdef TASK_UNMAPPED_CEILING
+	if (!addr)
+		addr = TASK_UNMAPPED_CEILING - len;
+
+	do {
+		/* align addr _downards_; PAGE_ALIGN aligns it upwards */
+		addr = addr&PAGE_MASK;
+		vmm = find_vma(current->mm,addr);
+		/* At this point:  (!vmm || addr < vmm->vm_end). */
+		if (!vmm || addr + len <= vmm->vm_start)
+			return addr;
+		addr = vmm->vm_start - len;
+	} while (addr >= TASK_UNMAPPED_BASE);
+
+	return 0;
+#else
 	if (!addr)
 		addr = TASK_UNMAPPED_BASE;
 	addr = PAGE_ALIGN(addr);
@@ -394,6 +410,7 @@
 			return addr;
 		addr = vmm->vm_end;
 	}
+#endif
 }
 #endif

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:48                             ` Alan Cox
  2001-01-10 19:48                               ` Andi Kleen
@ 2001-01-11  9:51                               ` Trond Myklebust
  1 sibling, 0 replies; 88+ messages in thread
From: Trond Myklebust @ 2001-01-11  9:51 UTC (permalink / raw)
  To: Alan Cox; +Cc: Andi Kleen, Daniel Phillips, Linus Torvalds, linux-kernel

>>>>> " " == Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

    >> As the thread started it's not only only needed for pthreads,
    >> but also for NFS and setuid (actually NFS already implements it
    >> privately), and probably other network file systems too.  So
    >> it's far from being only a "bad standard corner case".

     > I wonder how Linux 2.2 worked, that doesnt have them. Now if
     > its a clean way of sorting out a pile of other things and it
     > does pthreads as a side effect I've no problem, but arguing for
     > it because of a tiny pthreads corner case is coming from the
     > wrong end

How about this then:

Sure NFS can work without ucreds, but there are limitations.  For
instance the MVFS folks recently complained. They're trying to keep
mmap consistency between their own filesystem layer and the underlying
storage filesystem using i_mapping (a la CODAfs). The problem then is
that the vma will be using the wrong 'struct file' to call the
underlying storage.

This sort of problem would indeed disappear if we have a generic
credential stored in the struct file as we could make the VFS pass the
credential directly to readpage (and writepage?) rather than passing
the whole struct file.

If you use the same credentials in the task structure, then there are
other advantages even to NFS itself.
You may for example want to attach an ACL cache at some point in time
(to avoid the messiness of calling NFSv3/v4 permissions routines at
each and every file lookup). Ditto for strong RPC authentication
schemes that require an upcall to some userspace daemon.

That said, we'd first have to find a way to reconcile fsuid/fsgid with
the BSD model in some way: I'd rather not have 2 'ucred's per task (1
for threads + 1 for filesystems).

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 20:11                       ` Linus Torvalds
@ 2001-01-11 12:56                         ` Stephen C. Tweedie
  2001-01-11 13:10                           ` Andi Kleen
                                             ` (3 more replies)
  0 siblings, 4 replies; 88+ messages in thread
From: Stephen C. Tweedie @ 2001-01-11 12:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Andi Kleen, Trond Myklebust, Daniel Phillips,
	linux-kernel, Stephen Tweedie

Hi,

On Wed, Jan 10, 2001 at 12:11:16PM -0800, Linus Torvalds wrote:
> 
> That said, we can easily support the notion of CLONE_CRED if we absolutely
> have to (and sane people just shouldn't use it), so if somebody wants to
> work on this for 2.5.x...

But is it really worth the pain?  I'd hate to have to audit the entire
VFS to make sure that it works if another thread changes our
credentials in the middle of a syscall, so we either end up having to
lock the credentials over every VFS syscall, or take a copy of the
credentials and pass it through every VFS internal call that we make.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-11 12:56                         ` Stephen C. Tweedie
@ 2001-01-11 13:10                           ` Andi Kleen
  2001-01-11 13:12                           ` Trond Myklebust
                                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 88+ messages in thread
From: Andi Kleen @ 2001-01-11 13:10 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Andi Kleen, Trond Myklebust,
	Daniel Phillips, linux-kernel

On Thu, Jan 11, 2001 at 12:56:04PM +0000, Stephen C. Tweedie wrote:
> Hi,
> 
> On Wed, Jan 10, 2001 at 12:11:16PM -0800, Linus Torvalds wrote:
> > 
> > That said, we can easily support the notion of CLONE_CRED if we absolutely
> > have to (and sane people just shouldn't use it), so if somebody wants to
> > work on this for 2.5.x...
> 
> But is it really worth the pain?  I'd hate to have to audit the entire
> VFS to make sure that it works if another thread changes our
> credentials in the middle of a syscall, so we either end up having to
> lock the credentials over every VFS syscall, or take a copy of the
> credentials and pass it through every VFS internal call that we make.

That is what NFS does already, it would just move into generic VFS then.
(NFS copies) 


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-11 12:56                         ` Stephen C. Tweedie
  2001-01-11 13:10                           ` Andi Kleen
@ 2001-01-11 13:12                           ` Trond Myklebust
  2001-01-11 14:13                             ` Stephen C. Tweedie
  2001-01-11 16:50                           ` Albert D. Cahalan
  2001-01-11 19:01                           ` Alexander Viro
  3 siblings, 1 reply; 88+ messages in thread
From: Trond Myklebust @ 2001-01-11 13:12 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Andi Kleen, Daniel Phillips,
	linux-kernel

>>>>> " " == Stephen C Tweedie <sct@redhat.com> writes:

     > Hi, On Wed, Jan 10, 2001 at 12:11:16PM -0800, Linus Torvalds
     > wrote:
    >>
    >> That said, we can easily support the notion of CLONE_CRED if we
    >> absolutely have to (and sane people just shouldn't use it), so
    >> if somebody wants to work on this for 2.5.x...

     > But is it really worth the pain?  I'd hate to have to audit the
     > entire VFS to make sure that it works if another thread changes
     > our credentials in the middle of a syscall, so we either end up
     > having to lock the credentials over every VFS syscall, or take
     > a copy of the credentials and pass it through every VFS
     > internal call that we make.

 What's wrong with copy-on-write style semantics? IOW, anyone who
wants to change the credentials needs to make a private copy of the
existing structure first.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-11 13:12                           ` Trond Myklebust
@ 2001-01-11 14:13                             ` Stephen C. Tweedie
  2001-01-11 19:03                               ` Alexander Viro
  0 siblings, 1 reply; 88+ messages in thread
From: Stephen C. Tweedie @ 2001-01-11 14:13 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Andi Kleen,
	Daniel Phillips, linux-kernel

Hi,

On Thu, Jan 11, 2001 at 02:12:05PM +0100, Trond Myklebust wrote:
> 
>  What's wrong with copy-on-write style semantics? IOW, anyone who
> wants to change the credentials needs to make a private copy of the
> existing structure first.

Because COW only solves the problem if each task is only changing its
own, local, private copy of the credentials.  Posix threads demand
that one thread changing credentials also affects all the other
threads immediately, and making your own local private copy won't help
you to change the other tasks' credentials safely.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 17:03                     ` Linus Torvalds
@ 2001-01-11 14:36                       ` Jim Gettys
  0 siblings, 0 replies; 88+ messages in thread
From: Jim Gettys @ 2001-01-11 14:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Woodhouse, Zlatko Calusic, Eric W. Biederman, Rik van Riel,
	linux-kernel


> Sender: linux-kernel-owner@vger.kernel.org
> From: Linus Torvalds <torvalds@transmeta.com>
> Date: 	Wed, 10 Jan 2001 09:03:03 -0800 (PST)
> To: David Woodhouse <dwmw2@infradead.org>
> Cc: Zlatko Calusic <zlatko@iskon.hr>,
>         "Eric W. Biederman" <ebiederm@xmission.com>,
>         Rik van Riel <riel@conectiva.com.br>, linux-kernel@vger.kernel.org
> Subject: Re: Subtle MM bug
> -----
> On Wed, 10 Jan 2001, David Woodhouse wrote:
> 
> >
> > torvalds@transmeta.com said:
> > >  The no-swap behaviour shoul dactually be pretty much identical,
> > > simply because both 2.2 and 2.4 will do the same thing: just skip
> > > dirty pages in the page tables because they cannot do anything about
> > > them.
> >
> > So the VM code spends a fair amount of time scanning lists of pages which
> > it really can't do anything about?
> 
> It can do _tons_ of stuff.
> 
> Remember, on platforms like this, one of the reasons for being low on
> memory is things like running X and netscape: maybe you have 64MB of RAM
> and you don't think you need a swap device, and you want to have a web
> browser.
> 
> The fact that we cannot touch _dirty_ pages doesn't mean that there's
> nothing to do: instead of running out of memory we can at least make the
> machine usable by dropping the text pages and the page cache..
> 

And pushing out old text pages is a very good idea on most embedded systems.
Getting the pages back is a (relatively) cheap operation: no disk seeks,
some joules spent on decompression (if on CRAMFS or other compressed file
system).

There is an interesting question on such devices as to whether you are
better off dropping text pages or pages out of the page cache first,
or to what degree... 
				- Jim

--
Jim Gettys
Technology and Corporate Development
Compaq Computer Corporation
jg@pa.dec.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-11 12:56                         ` Stephen C. Tweedie
  2001-01-11 13:10                           ` Andi Kleen
  2001-01-11 13:12                           ` Trond Myklebust
@ 2001-01-11 16:50                           ` Albert D. Cahalan
  2001-01-11 17:35                             ` Stephen C. Tweedie
  2001-01-11 19:01                           ` Alexander Viro
  3 siblings, 1 reply; 88+ messages in thread
From: Albert D. Cahalan @ 2001-01-11 16:50 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Andi Kleen, Trond Myklebust,
	Daniel Phillips, linux-kernel, Stephen Tweedie

Stephen C. Tweedie writes:
> On Wed, Jan 10, 2001 at 12:11:16PM -0800, Linus Torvalds wrote:

>> That said, we can easily support the notion of CLONE_CRED if
>> we absolutely have to (and sane people just shouldn't use it),
>> so if somebody wants to work on this for 2.5.x...
>
> But is it really worth the pain?  I'd hate to have to audit the
> entire VFS to make sure that it works if another thread changes our
> credentials in the middle of a syscall, so we either end up having to
> lock the credentials over every VFS syscall, or take a copy of the
> credentials and pass it through every VFS internal call that we make.

1. each thread has a copy, and doesn't need to lock it
2. threads are commanded to change their own copy

Credentials could be changed on syscall exit. It is a bit like
doing signals I think, with less overhead than making userspace
muck around with signal handlers and synchronization crud.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-11 16:50                           ` Albert D. Cahalan
@ 2001-01-11 17:35                             ` Stephen C. Tweedie
  2001-01-11 19:38                               ` Albert D. Cahalan
  0 siblings, 1 reply; 88+ messages in thread
From: Stephen C. Tweedie @ 2001-01-11 17:35 UTC (permalink / raw)
  To: Albert D. Cahalan
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Andi Kleen,
	Trond Myklebust, Daniel Phillips, linux-kernel

Hi,

On Thu, Jan 11, 2001 at 11:50:21AM -0500, Albert D. Cahalan wrote:
> Stephen C. Tweedie writes:
> >
> > But is it really worth the pain?  I'd hate to have to audit the
> > entire VFS to make sure that it works if another thread changes our
> > credentials in the middle of a syscall, so we either end up having to
> > lock the credentials over every VFS syscall, or take a copy of the
> > credentials and pass it through every VFS internal call that we make.
> 
> 1. each thread has a copy, and doesn't need to lock it

We already have that...

> 2. threads are commanded to change their own copy

We already do that: that's how the current pthreads works.
 
> Credentials could be changed on syscall exit. It is a bit like
> doing signals I think, with less overhead than making userspace
> muck around with signal handlers and synchronization crud.

Yuck.  Far better to send a signal than to pollute the syscall exit
path.  And what about syscalls which block indefinitely?  We _want_
the signal so that they get woken up to do the credentials change.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-11 12:56                         ` Stephen C. Tweedie
                                             ` (2 preceding siblings ...)
  2001-01-11 16:50                           ` Albert D. Cahalan
@ 2001-01-11 19:01                           ` Alexander Viro
  3 siblings, 0 replies; 88+ messages in thread
From: Alexander Viro @ 2001-01-11 19:01 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Andi Kleen, Trond Myklebust,
	Daniel Phillips, linux-kernel



On Thu, 11 Jan 2001, Stephen C. Tweedie wrote:

> Hi,
> 
> On Wed, Jan 10, 2001 at 12:11:16PM -0800, Linus Torvalds wrote:
> > 
> > That said, we can easily support the notion of CLONE_CRED if we absolutely
> > have to (and sane people just shouldn't use it), so if somebody wants to
> > work on this for 2.5.x...
> 
> But is it really worth the pain?  I'd hate to have to audit the entire
> VFS to make sure that it works if another thread changes our
> credentials in the middle of a syscall, so we either end up having to
> lock the credentials over every VFS syscall, or take a copy of the
> credentials and pass it through every VFS internal call that we make.

COW. Pthreads are simply irrelevant here - if you want set*id in one
thread to change the credentials of the rest you can do it in libpthreads.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-11 14:13                             ` Stephen C. Tweedie
@ 2001-01-11 19:03                               ` Alexander Viro
  2001-01-11 19:47                                 ` Stephen C. Tweedie
  0 siblings, 1 reply; 88+ messages in thread
From: Alexander Viro @ 2001-01-11 19:03 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Trond Myklebust, Linus Torvalds, Alan Cox, Andi Kleen,
	Daniel Phillips, linux-kernel



On Thu, 11 Jan 2001, Stephen C. Tweedie wrote:

> Hi,
> 
> On Thu, Jan 11, 2001 at 02:12:05PM +0100, Trond Myklebust wrote:
> > 
> >  What's wrong with copy-on-write style semantics? IOW, anyone who
> > wants to change the credentials needs to make a private copy of the
> > existing structure first.
> 
> Because COW only solves the problem if each task is only changing its
> own, local, private copy of the credentials.  Posix threads demand
> that one thread changing credentials also affects all the other
> threads immediately, and making your own local private copy won't help
> you to change the other tasks' credentials safely.

And how is that different from the current situation?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-11 17:35                             ` Stephen C. Tweedie
@ 2001-01-11 19:38                               ` Albert D. Cahalan
  0 siblings, 0 replies; 88+ messages in thread
From: Albert D. Cahalan @ 2001-01-11 19:38 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Albert D. Cahalan, Stephen C. Tweedie, Linus Torvalds, Alan Cox,
	Andi Kleen, Trond Myklebust, Daniel Phillips, linux-kernel

Stephen C. Tweedie writes:
> On Thu, Jan 11, 2001 at 11:50:21AM -0500, Albert D. Cahalan wrote:
>> Stephen C. Tweedie writes:

>>> But is it really worth the pain?  I'd hate to have to audit the
>>> entire VFS to make sure that it works if another thread changes our
>>> credentials in the middle of a syscall, so we either end up having to
>>> lock the credentials over every VFS syscall, or take a copy of the
>>> credentials and pass it through every VFS internal call that we make.
>>
>> 1. each thread has a copy, and doesn't need to lock it
>
> We already have that...
>
>> 2. threads are commanded to change their own copy
>
> We already do that: that's how the current pthreads works.

I thought it was unimplemented. Even so, it is at least one
extra round trip to/from the kernel. (I'd guess trips>1)

>> Credentials could be changed on syscall exit. It is a bit like
>> doing signals I think, with less overhead than making userspace
>> muck around with signal handlers and synchronization crud.
>
> Yuck.  Far better to send a signal than to pollute the syscall exit
> path.  And what about syscalls which block indefinitely?  We _want_
> the signal so that they get woken up to do the credentials change.

The syscall exit path itself need not be polluted. Changes to
recalc_sigpending and do_signal would get the job done.
For the former, either add an extra word of kernel-internal
signal data or just check a simple flag. For do_signal, maybe
add an extra "if(foo)" at the top of the main loop. (that would
depend on what was done to recalc_sigpending)

I suppose the goodness or badness of this depends partly on how
much you are willing to pay for pthreads that are fast and correct.
People around here seem to like burying their heads in hope that
pthreads will just go away, while app developers stubbornly try to
use the API.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-11 19:03                               ` Alexander Viro
@ 2001-01-11 19:47                                 ` Stephen C. Tweedie
  2001-01-11 19:57                                   ` Alexander Viro
  0 siblings, 1 reply; 88+ messages in thread
From: Stephen C. Tweedie @ 2001-01-11 19:47 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Stephen C. Tweedie, Trond Myklebust, Linus Torvalds, Alan Cox,
	Andi Kleen, Daniel Phillips, linux-kernel

Hi,

On Thu, Jan 11, 2001 at 02:03:48PM -0500, Alexander Viro wrote:
> On Thu, 11 Jan 2001, Stephen C. Tweedie wrote:
> 
> > On Thu, Jan 11, 2001 at 02:12:05PM +0100, Trond Myklebust wrote:
> > > 
> > >  What's wrong with copy-on-write style semantics? IOW, anyone who
> > > wants to change the credentials needs to make a private copy of the
> > > existing structure first.
> > 
> > Because COW only solves the problem if each task is only changing its
> > own, local, private copy of the credentials.  Posix threads demand
> > that one thread changing credentials also affects all the other
> > threads immediately, and making your own local private copy won't help
> > you to change the other tasks' credentials safely.
> 
> And how is that different from the current situation?

It's not, which is the point I was making: COW doesn't actually solve
the pthreads problem.  Far better to do it in user space.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-11 19:47                                 ` Stephen C. Tweedie
@ 2001-01-11 19:57                                   ` Alexander Viro
  0 siblings, 0 replies; 88+ messages in thread
From: Alexander Viro @ 2001-01-11 19:57 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Trond Myklebust, Linus Torvalds, Alan Cox, Andi Kleen,
	Daniel Phillips, linux-kernel

On Thu, 11 Jan 2001, Stephen C. Tweedie wrote:

> > And how is that different from the current situation?
> 
> It's not, which is the point I was making: COW doesn't actually solve
> the pthreads problem.  Far better to do it in user space.

Oh, certainly. We need COW for completely unrelated reasons - suppose
you open() a file and then change your *ID. You definitely want credentials
on the opened file to stay unchanged.

Pthreads are non-issue as far as I'm concerned. I'ld rather avoid mixing
them with credentials' cache. BTW, what about *BSD implementations? Do
they change creds of all threads upon set*id(2)?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 23:56                             ` David Weinehall
  2001-01-11  0:24                               ` Alan Cox
@ 2001-01-12  5:56                               ` Ralf Baechle
  2001-01-12 16:10                                 ` Eric W. Biederman
  1 sibling, 1 reply; 88+ messages in thread
From: Ralf Baechle @ 2001-01-12  5:56 UTC (permalink / raw)
  To: David Weinehall
  Cc: Alan Cox, Linus Torvalds, Eric W. Biederman, Andrea Arcangeli,
	David Woodhouse, Zlatko Calusic, Rik van Riel, linux-kernel

On Thu, Jan 11, 2001 at 12:56:57AM +0100, David Weinehall wrote:

> > The MMU on these systems is a CAM, and the mmu table is thus backwards to
> > convention. (It also means you can notionally map two physical addresses to
> > one virtual but thats undefined in the implementation ;))
> 
> Are there any other (not yet supported) platforms with similar (or other
> unrelated, but hard to support because of the current architecture of
> the kernel) problems?
> 
> (No, I have no secret trumps up my sleeve, I'm just curious.)

Having a reverse mappings is the least sucky way to handle virtual aliases
of certain types of MIPS caches.

  Ralf

--
"Embrace, Enhance, Eliminate" - it worked for the pope, it'll work for Bill.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-12  5:56                               ` Ralf Baechle
@ 2001-01-12 16:10                                 ` Eric W. Biederman
  2001-01-12 21:11                                   ` Russell King
  2001-01-15  2:53                                   ` Ralf Baechle
  0 siblings, 2 replies; 88+ messages in thread
From: Eric W. Biederman @ 2001-01-12 16:10 UTC (permalink / raw)
  To: Ralf Baechle
  Cc: David Weinehall, Alan Cox, Linus Torvalds, Andrea Arcangeli,
	David Woodhouse, Zlatko Calusic, Rik van Riel, linux-kernel

Ralf Baechle <ralf@conectiva.com.br> writes:

> On Thu, Jan 11, 2001 at 12:56:57AM +0100, David Weinehall wrote:
> 
> > > The MMU on these systems is a CAM, and the mmu table is thus backwards to
> > > convention. (It also means you can notionally map two physical addresses to
> > > one virtual but thats undefined in the implementation ;))
> > 
> > Are there any other (not yet supported) platforms with similar (or other
> > unrelated, but hard to support because of the current architecture of
> > the kernel) problems?
> > 
> > (No, I have no secret trumps up my sleeve, I'm just curious.)
> 
> Having a reverse mappings is the least sucky way to handle virtual aliases
> of certain types of MIPS caches.

Hmm.  I would think that increasing the logical page size in the kernel would
be the trivial way to handle virtual aliases.  (i.e.) with a large enough page
size you can't actually have a virtual alias.

You could also play some games with simply allocating pages only with the proper 
proper high bits.   These games might also be useful on architectures for L2 caches
who have significant physical bits than PAGE_SHIFT bits.

But how does a reverse mapping help to handle virtual aliases?  What are those
caches doing?  The only model in my head is having a virtually indexed cache
where you have more index bits than PAGE_SHIFT bits.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-12 16:10                                 ` Eric W. Biederman
@ 2001-01-12 21:11                                   ` Russell King
  2001-01-15  2:56                                     ` Ralf Baechle
  2001-01-15  2:53                                   ` Ralf Baechle
  1 sibling, 1 reply; 88+ messages in thread
From: Russell King @ 2001-01-12 21:11 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Ralf Baechle, riel, Andrea Arcangeli, linux-kernel

Eric W. Biederman writes:
> Hmm.  I would think that increasing the logical page size in the kernel
> would be the trivial way to handle virtual aliases.  (i.e.) with a large
> enough page size you can't actually have a virtual alias.

There are types of caches out there that no matter how large the page size,
you will always have alias issues.  These are ones where the cache lines
are indexed independent of virtual address (and therefore can have funny
cache line replacement algorithms).

And yes, you guessed which processor has it. ;)

(Sorry the CC list got trimmed, elm ate some of it.  I'm sure most of the
people who where on it were on lkml anyway)
   _____
  |_____| ------------------------------------------------- ---+---+-
  |   |         Russell King        rmk@arm.linux.org.uk      --- ---
  | | | | http://www.arm.linux.org.uk/personal/aboutme.html   /  /  |
  | +-+-+                                                     --- -+-
  /   |               THE developer of ARM Linux              |+| /|\
 /  | | |                                                     ---  |
    +-+-+ -------------------------------------------------  /\\\  |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-12 16:10                                 ` Eric W. Biederman
  2001-01-12 21:11                                   ` Russell King
@ 2001-01-15  2:53                                   ` Ralf Baechle
  1 sibling, 0 replies; 88+ messages in thread
From: Ralf Baechle @ 2001-01-15  2:53 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Weinehall, Alan Cox, Linus Torvalds, Andrea Arcangeli,
	David Woodhouse, Zlatko Calusic, Rik van Riel, linux-kernel

On Fri, Jan 12, 2001 at 09:10:54AM -0700, Eric W. Biederman wrote:

> > Having a reverse mappings is the least sucky way to handle virtual aliases
> > of certain types of MIPS caches.
> 
> Hmm.  I would think that increasing the logical page size in the kernel would
> be the trivial way to handle virtual aliases.  (i.e.) with a large enough page
> size you can't actually have a virtual alias.

That's a possible solution; I'm not clear how bad the overhead would be.
Right now a virtual alias is a relativly rare event and we don't want the
common case of no virtual alias to make pay a high price.  Or?

> You could also play some games with simply allocating pages only with the
> proper proper high bits.   These games might also be useful on architectures
> for L2 caches who have significant physical bits than PAGE_SHIFT bits.

An alternative but less efficient solution.  I tried to implement it; I ran
into problems with running out of larger pages soon as I had to split order 2
pages into 4 order 0 pages to implement this; the fragmentation was _really_
bad.

> But how does a reverse mapping help to handle virtual aliases?  What are those
> caches doing?

You leave only mappings of one color accessible.  All other mappings are made
unaccessible in the page table, so accessing will result in a TLB fault.
The TLB fault handler then flushes the active mappings, makes them
unaccessible by clearing the MIPS hw dirty / accessible bits, then makes the
mapping of the new color accessible in the page table.  This is already
possible right now but doing the necessary reverse mappings can be rather
inefficient as is.

> The only model in my head is having a virtually indexed cache where you
> have more index bits than PAGE_SHIFT bits.

Which is exactly what many MIPS implementations are suffering from.  At
least they're tagged with the physical address, so no flushes on context
switch necessary.

  Ralf

--
"Embrace, Enhance, Eliminate" - it worked for the pope, it'll work for Bill.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-12 21:11                                   ` Russell King
@ 2001-01-15  2:56                                     ` Ralf Baechle
  2001-01-15  6:59                                       ` Eric W. Biederman
  0 siblings, 1 reply; 88+ messages in thread
From: Ralf Baechle @ 2001-01-15  2:56 UTC (permalink / raw)
  To: Russell King; +Cc: Eric W. Biederman, riel, Andrea Arcangeli, linux-kernel

On Fri, Jan 12, 2001 at 09:11:43PM +0000, Russell King wrote:

> Eric W. Biederman writes:
> > Hmm.  I would think that increasing the logical page size in the kernel
> > would be the trivial way to handle virtual aliases.  (i.e.) with a large
> > enough page size you can't actually have a virtual alias.
> 
> There are types of caches out there that no matter how large the page size,
> you will always have alias issues.  These are ones where the cache lines
> are indexed independent of virtual address (and therefore can have funny
> cache line replacement algorithms).
> 
> And yes, you guessed which processor has it. ;)

I recently spoke with some CPU architecture researcher at some university
about cache architectures; I suspect in the near future we'll see more
funny cache indexing and replacment algorithems ...

  Ralf

--
"Embrace, Enhance, Eliminate" - it worked for the pope, it'll work for Bill.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-15  2:56                                     ` Ralf Baechle
@ 2001-01-15  6:59                                       ` Eric W. Biederman
  0 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2001-01-15  6:59 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Russell King, riel, Andrea Arcangeli, linux-kernel

Ralf Baechle <ralf@uni-koblenz.de> writes:

> On Fri, Jan 12, 2001 at 09:11:43PM +0000, Russell King wrote:
> 
> > Eric W. Biederman writes:
> > > Hmm.  I would think that increasing the logical page size in the kernel
> > > would be the trivial way to handle virtual aliases.  (i.e.) with a large
> > > enough page size you can't actually have a virtual alias.
> > 
> > There are types of caches out there that no matter how large the page size,
> > you will always have alias issues.  These are ones where the cache lines
> > are indexed independent of virtual address (and therefore can have funny
> > cache line replacement algorithms).
> > 
> > And yes, you guessed which processor has it. ;)

Odd.  Does this affect correctness?

> I recently spoke with some CPU architecture researcher at some university
> about cache architectures; I suspect in the near future we'll see more
> funny cache indexing and replacment algorithems ...

But I doubt many of those will run incorrectly if just less efficiently if
the OS doesn't help you avoid aliases.  


Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-09  2:01   ` Zlatko Calusic
@ 2001-01-17  4:48     ` Rik van Riel
  2001-01-17 18:53       ` Zlatko Calusic
  0 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2001-01-17  4:48 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: linux-kernel, linux-mm

On 9 Jan 2001, Zlatko Calusic wrote:
> Rik van Riel <riel@conectiva.com.br> writes:
>
> > Now if 2.4 has worse _performance_ than 2.2 due to one
> > reason or another, that I'd like to hear about ;)
> >
>
> Oh, well, it seems that I was wrong. :)
>
> First test: hogmem 180 5 = allocate 180MB and dirty it 5 times (on a
> 192MB machine)
>
> kernel | swap usage | speed
> -------------------------------
> 2.2.17 |  48 MB     | 11.8 MB/s
> -------------------------------
> 2.4.0  | 206 MB     | 11.1 MB/s
> -------------------------------
>
> So 2.2 is only marginally faster. Also it can be seen that 2.4
> uses 4 times more swap space. If Linus says it's ok... :)

I have been working on some changes to page_launder() which
might just fix this problem. Quick and dirty patches are on
my home page and I'll try to clean things up and make something
correct & clean later today or tomorrow ;)

> Second test: kernel compile make -j32 (empirically this puts the
> VM under load, but not excessively!)
>
> 2.2.17 -> make -j32  392.49s user 47.87s system 168% cpu 4:21.13 total
> 2.4.0  -> make -j32  389.59s user 31.29s system 182% cpu 3:50.24 total
>
> Now, is this great news or what, 2.4.0 is definitely faster.

One problem is that these tasks may be waiting on kswapd when
kswapd might not get scheduled in on time. On the one hand this
will mean lower load and less thrashing, on the other hand it
means more IO wait.

This is another area where we may be able to improve some things.

(btw, according to Alan the 2.4 kernel is the first one to break
the 1.2 kernel compiling speed record on an 8MB machine he has ;))

cheers,

Rik  (stuck in australia on a conference)
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-09 19:09               ` Daniel Phillips
  2001-01-09 19:29                 ` Trond Myklebust
  2001-01-09 19:37                 ` Linus Torvalds
@ 2001-01-17  8:46                 ` Rik van Riel
  2001-01-25 22:51                   ` Daniel Phillips
  2 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2001-01-17  8:46 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Linus Torvalds, linux-kernel

On Tue, 9 Jan 2001, Daniel Phillips wrote:
> Linus Torvalds wrote:
> > (This is why I worked so hard at getting the PageDirty semantics right in
> > the last two months or so - and why I released 2.4.0 when I did. Getting
> > PageDirty right was the big step to make all of the VM stuff possible in
> > the first place. Even if it probably looked a bit foolhardy to change the
> > semantics of "writepage()" quite radically just before 2.4 was released).
>
> On the topic of writepage, it's not symmetric with readpage at
> the moment - it still takes (struct file *).  Is this in the
> cleanup pipeline?  It looks like nfs_readpage already ignores
> the struct file *, but maybe some other net filesystems are
> still depending on it.

writepage() and readpage() will never be symmetric...

readpage()
	program can't continue until data is there
	reading in larger clusters eats (wastes?) more memory
	done when we think a process needs data

writepage()
	called after the process has written data and moved on
	writing larger clusters has no influence on memory use
	often done to free up memory

Since readpage() needs to tune readahead behaviour, we will
always want to give it some information (eg. in the file *)
so it can do the extra things it needs to do.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 18:33                         ` Andrea Arcangeli
@ 2001-01-17 14:26                           ` Rik van Riel
  0 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2001-01-17 14:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Eric W. Biederman, David Woodhouse, Linus Torvalds,
	Zlatko Calusic, linux-kernel

On Wed, 10 Jan 2001, Andrea Arcangeli wrote:
> On Wed, Jan 10, 2001 at 10:46:07AM -0700, Eric W. Biederman wrote:

> > My impression with the MM stuff is that everyone except linux is
> > trying hard to clone BSD instead of thinking through the issues
> > ourselves.
>
> I wasn't even thinking about BSD and I always though about the
> issues myself, no panic ;).

Andrea, if you have the time, please do check out the
FreeBSD and NetBSD VM code.

The FreeBSD code has the original Mach overengineered
abstraction layer, but an absolutely kickass page
replacement strategy.

The NetBSD code has cleaned up the abstraction layer
into something nice and lower overhead, but has a lot
simpler (probably lower performance) page replacement.

It would be cool if some of the Linux hackers could take
the time and look at this code to see if there are some
good ideas we might want to have in Linux.

It might just be the case that we DON'T want to reinvent
the wheel (that others have made into a nice round shape
with 15 years of trial, error and redesigning).

(though I know some people prefer reinventing wheels ;))

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-10 19:03                         ` Linus Torvalds
  2001-01-10 19:27                           ` David S. Miller
  2001-01-10 19:36                           ` Alan Cox
@ 2001-01-17 14:28                           ` Rik van Riel
  2001-01-18  1:23                             ` Linus Torvalds
  2 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2001-01-17 14:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Andrea Arcangeli, David Woodhouse,
	Zlatko Calusic, linux-kernel

On Wed, 10 Jan 2001, Linus Torvalds wrote:

> I looked at it a year or two ago myself, and came to the
> conclusion that I don't want to blow up our page table size by a
> factor of three or more, so I'm not personally interested any
> more. Maybe somebody else comes up with a better way to do it,
> or with a really compelling reason to.

OTOH, it _would_ get rid of all the balancing issues in one
blow. And it would fix the aliasing issues and possibly the
memory fragmentation problem too.

And using something like Davem's lower-overhead reverse
mapping layer, we might just be able to pull off all (or most)
of the advantages with lower overhead ;)

[this is something I will be looking into for 2.5]

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-17  4:48     ` Rik van Riel
@ 2001-01-17 18:53       ` Zlatko Calusic
  2001-01-18  1:32         ` Rik van Riel
  0 siblings, 1 reply; 88+ messages in thread
From: Zlatko Calusic @ 2001-01-17 18:53 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

Rik van Riel <riel@conectiva.com.br> writes:

> > Second test: kernel compile make -j32 (empirically this puts the
> > VM under load, but not excessively!)
> >
> > 2.2.17 -> make -j32  392.49s user 47.87s system 168% cpu 4:21.13 total
> > 2.4.0  -> make -j32  389.59s user 31.29s system 182% cpu 3:50.24 total
> >
> > Now, is this great news or what, 2.4.0 is definitely faster.
> 
> One problem is that these tasks may be waiting on kswapd when
> kswapd might not get scheduled in on time. On the one hand this
> will mean lower load and less thrashing, on the other hand it
> means more IO wait.
> 

Hm, if all tasks are waiting for memory, what is stopping kswapd to
run? :)
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-17 14:28                           ` Rik van Riel
@ 2001-01-18  1:23                             ` Linus Torvalds
  2001-01-18 11:48                               ` Rik van Riel
  0 siblings, 1 reply; 88+ messages in thread
From: Linus Torvalds @ 2001-01-18  1:23 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.31.0101180126240.31432-100000@localhost.localdomain>,
Rik van Riel  <riel@conectiva.com.br> wrote:
>On Wed, 10 Jan 2001, Linus Torvalds wrote:
>
>> I looked at it a year or two ago myself, and came to the
>> conclusion that I don't want to blow up our page table size by a
>> factor of three or more, so I'm not personally interested any
>> more. Maybe somebody else comes up with a better way to do it,
>> or with a really compelling reason to.
>
>OTOH, it _would_ get rid of all the balancing issues in one
>blow. And it would fix the aliasing issues and possibly the
>memory fragmentation problem too.

I totally disagree.

It might help fragmentation, but it has absolutely _no_ impact on
balancing. See my comments about not seeing the "accessed" bit until way
too late with a "find by physical" approach.

You simply _cannot_ use "find by physical" for balancing, unless you're
willing to pay the price of doing software accessed bits even on
hardware that does it for you in the page tables.  Which is a price MUCH
too high to pay, I suspect. 

The current vmscanning is the way to go.  Getting PageDirty was a big
step for it, because it is needed so that we can drop pages without
having to do IO like we historically did.  I doubt find-by-physical will
help AT ALL wrt balancing. 

		Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-17 18:53       ` Zlatko Calusic
@ 2001-01-18  1:32         ` Rik van Riel
  2001-04-17 19:37           ` H. Peter Anvin
  0 siblings, 1 reply; 88+ messages in thread
From: Rik van Riel @ 2001-01-18  1:32 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: linux-kernel, linux-mm

On 17 Jan 2001, Zlatko Calusic wrote:
> Rik van Riel <riel@conectiva.com.br> writes:
>
> > > Second test: kernel compile make -j32 (empirically this puts the
> > > VM under load, but not excessively!)
> > >
> > > 2.2.17 -> make -j32  392.49s user 47.87s system 168% cpu 4:21.13 total
> > > 2.4.0  -> make -j32  389.59s user 31.29s system 182% cpu 3:50.24 total
> > >
> > > Now, is this great news or what, 2.4.0 is definitely faster.
> >
> > One problem is that these tasks may be waiting on kswapd when
> > kswapd might not get scheduled in on time. On the one hand this
> > will mean lower load and less thrashing, on the other hand it
> > means more IO wait.
>
> Hm, if all tasks are waiting for memory, what is stopping kswapd
> to run? :)

Suppose you have 8 high-priority tasks waiting on kswapd
and one lower-priority (but still higher than kswapd)
process running and preventing kswapd from doing its work.
Oh .. and also preventing the higher-priority tasks from
being woken up and continuing...


Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-18  1:23                             ` Linus Torvalds
@ 2001-01-18 11:48                               ` Rik van Riel
  0 siblings, 0 replies; 88+ messages in thread
From: Rik van Riel @ 2001-01-18 11:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On 17 Jan 2001, Linus Torvalds wrote:
> Rik van Riel  <riel@conectiva.com.br> wrote:
> >On Wed, 10 Jan 2001, Linus Torvalds wrote:
> >
> >> I looked at it a year or two ago myself, and came to the
> >> conclusion that I don't want to blow up our page table size by a
> >> factor of three or more, so I'm not personally interested any
> >> more. Maybe somebody else comes up with a better way to do it,
> >> or with a really compelling reason to.
> >
> >OTOH, it _would_ get rid of all the balancing issues in one
> >blow. And it would fix the aliasing issues and possibly the
> >memory fragmentation problem too.
>
> I totally disagree.

I still haven't seen anything that might get us a
"universally correct" balancing between swap_out()
and refill_inactive_scan().

We either scan both categories at the same relative
rate, which gives mapped pages an advantage because
they may get unmapped later than the unmapped pages
get deactivated.

Alternatively, you do the scanning between these two
at different rates, which gives an advantage to one
or the other.

(or am I overlooking something stupid here?)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-17  8:46                 ` Rik van Riel
@ 2001-01-25 22:51                   ` Daniel Phillips
  0 siblings, 0 replies; 88+ messages in thread
From: Daniel Phillips @ 2001-01-25 22:51 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel

Rik van Riel wrote:
> 
> On Tue, 9 Jan 2001, Daniel Phillips wrote:
> > Linus Torvalds wrote:
> > > (This is why I worked so hard at getting the PageDirty semantics right in
> > > the last two months or so - and why I released 2.4.0 when I did. Getting
> > > PageDirty right was the big step to make all of the VM stuff possible in
> > > the first place. Even if it probably looked a bit foolhardy to change the
> > > semantics of "writepage()" quite radically just before 2.4 was released).
> >
> > On the topic of writepage, it's not symmetric with readpage at
> > the moment - it still takes (struct file *).  Is this in the
> > cleanup pipeline?  It looks like nfs_readpage already ignores
> > the struct file *, but maybe some other net filesystems are
> > still depending on it.
> 
> writepage() and readpage() will never be symmetric...
> 
> readpage()
>         program can't continue until data is there
>         reading in larger clusters eats (wastes?) more memory
>         done when we think a process needs data
> 
> writepage()
>         called after the process has written data and moved on
>         writing larger clusters has no influence on memory use
>         often done to free up memory
> 
> Since readpage() needs to tune readahead behaviour, we will
> always want to give it some information (eg. in the file *)
> so it can do the extra things it needs to do.

Which extra information did you have in mind?

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Subtle MM bug
  2001-01-18  1:32         ` Rik van Riel
@ 2001-04-17 19:37           ` H. Peter Anvin
  0 siblings, 0 replies; 88+ messages in thread
From: H. Peter Anvin @ 2001-04-17 19:37 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <Pine.LNX.4.31.0101181230020.31432-100000@localhost.localdomain>
By author:    Rik van Riel <riel@conectiva.com.br>
In newsgroup: linux.dev.kernel
> 
> Suppose you have 8 high-priority tasks waiting on kswapd
> and one lower-priority (but still higher than kswapd)
> process running and preventing kswapd from doing its work.
> Oh .. and also preventing the higher-priority tasks from
> being woken up and continuing...
> 

Classic priority inversion.  In this particular case it seems like it
should be unusually simple to apply priority inheritance, though (the
general case is complicated by the fact that the dependency matrix
usually isn't readily available.)

	-hpa
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2001-04-17 19:40 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-01-08 20:39 Subtle MM bug Szabolcs Szakacsits
2001-01-08 21:56 ` Wayne Whitney
2001-01-08 23:22   ` Wayne Whitney
2001-01-08 23:30     ` Andrea Arcangeli
2001-01-09  0:37       ` Linus Torvalds
2001-01-09  3:01       ` Subtle MM bug (really 830MB barrier question) Wayne Whitney
2001-01-09 20:06         ` Szabolcs Szakacsits
2001-01-09 23:45           ` Wayne Whitney
2001-01-11  0:03           ` Wayne Whitney
2001-01-11  2:46           ` [2.4.0 pre-PATCH] 830MB barrier (was: Subtle MM bug) Wayne Whitney
2001-01-08 22:00 ` Subtle MM bug Wayne Whitney
2001-01-08 22:15   ` Andrea Arcangeli
  -- strict thread matches above, loose matches on Subject: below --
2001-01-10 19:57 Chris Wing
2001-01-08  5:29 Wayne Whitney
2001-01-08  5:42 ` Andi Kleen
2001-01-08  6:04   ` Linus Torvalds
2001-01-08 17:44     ` Rik van Riel
2001-01-08 18:02       ` Linus Torvalds
2001-01-08 17:16 ` Rik van Riel
2001-01-08 17:58   ` Linus Torvalds
2001-01-08 23:41     ` Zlatko Calusic
2001-01-09  2:58       ` Linus Torvalds
2001-01-09  6:20       ` Eric W. Biederman
2001-01-09  7:27         ` Linus Torvalds
2001-01-09 11:38           ` Eric W. Biederman
2001-01-09 12:29           ` Zlatko Calusic
2001-01-09 18:47             ` Linus Torvalds
2001-01-09 19:09               ` Daniel Phillips
2001-01-09 19:29                 ` Trond Myklebust
2001-01-10 17:32                   ` Andi Kleen
2001-01-10 19:31                     ` Alan Cox
2001-01-10 19:33                       ` Andi Kleen
2001-01-10 19:40                         ` Alan Cox
2001-01-10 19:43                           ` Andi Kleen
2001-01-10 19:48                             ` Alan Cox
2001-01-10 19:48                               ` Andi Kleen
2001-01-11  9:51                               ` Trond Myklebust
2001-01-10 20:11                       ` Linus Torvalds
2001-01-11 12:56                         ` Stephen C. Tweedie
2001-01-11 13:10                           ` Andi Kleen
2001-01-11 13:12                           ` Trond Myklebust
2001-01-11 14:13                             ` Stephen C. Tweedie
2001-01-11 19:03                               ` Alexander Viro
2001-01-11 19:47                                 ` Stephen C. Tweedie
2001-01-11 19:57                                   ` Alexander Viro
2001-01-11 16:50                           ` Albert D. Cahalan
2001-01-11 17:35                             ` Stephen C. Tweedie
2001-01-11 19:38                               ` Albert D. Cahalan
2001-01-11 19:01                           ` Alexander Viro
2001-01-09 19:37                 ` Linus Torvalds
2001-01-17  8:46                 ` Rik van Riel
2001-01-25 22:51                   ` Daniel Phillips
2001-01-09 19:53               ` Simon Kirby
2001-01-09 20:08                 ` Linus Torvalds
2001-01-09 20:10                 ` Zlatko Calusic
2001-01-10  1:45               ` David Woodhouse
2001-01-10  2:26                 ` Andrea Arcangeli
2001-01-10  6:57                 ` Linus Torvalds
2001-01-10 11:46                   ` David Woodhouse
2001-01-10 14:56                     ` Andrea Arcangeli
2001-01-10 17:46                       ` Eric W. Biederman
2001-01-10 18:33                         ` Andrea Arcangeli
2001-01-17 14:26                           ` Rik van Riel
2001-01-10 19:03                         ` Linus Torvalds
2001-01-10 19:27                           ` David S. Miller
2001-01-10 19:36                           ` Alan Cox
2001-01-10 23:56                             ` David Weinehall
2001-01-11  0:24                               ` Alan Cox
2001-01-12  5:56                               ` Ralf Baechle
2001-01-12 16:10                                 ` Eric W. Biederman
2001-01-12 21:11                                   ` Russell King
2001-01-15  2:56                                     ` Ralf Baechle
2001-01-15  6:59                                       ` Eric W. Biederman
2001-01-15  2:53                                   ` Ralf Baechle
2001-01-17 14:28                           ` Rik van Riel
2001-01-18  1:23                             ` Linus Torvalds
2001-01-18 11:48                               ` Rik van Riel
2001-01-10 17:03                     ` Linus Torvalds
2001-01-11 14:36                       ` Jim Gettys
2001-01-08 21:30   ` Wayne Whitney
2001-01-07 20:59 Zlatko Calusic
2001-01-07 21:37 ` Rik van Riel
2001-01-07 22:33   ` Zlatko Calusic
2001-01-09  2:01   ` Zlatko Calusic
2001-01-17  4:48     ` Rik van Riel
2001-01-17 18:53       ` Zlatko Calusic
2001-01-18  1:32         ` Rik van Riel
2001-04-17 19:37           ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox