* 2.4.19pre1aa1 @ 2002-02-28 2:57 rwhron 0 siblings, 0 replies; 77+ messages in thread From: rwhron @ 2002-02-28 2:57 UTC (permalink / raw) To: linux-kernel Extended changelog at: http://home.earthlink.net/~rwhron/kernel/andrea/2.4.19pre1aa1.html -- Randy Hron ^ permalink raw reply [flat|nested] 77+ messages in thread
* 2.4.19pre1aa1 @ 2002-02-27 12:50 Andrea Arcangeli 2002-02-28 22:11 ` 2.4.19pre1aa1 Bill Davidsen 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2002-02-27 12:50 UTC (permalink / raw) To: linux-kernel I would like to have feedback about this VM update, if nobody can find any serious issue I'd try to push vm-28 into mainline during 2.4.19pre. Please test oom conditions as well. Thanks! URL: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre1aa1.gz ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre1aa1/ Only in 2.4.18rc4aa1: 00_block-highmem-all-18b-3.gz Only in 2.4.19pre1aa1: 00_block-highmem-all-18b-4.gz Fix leftover setting. Only in 2.4.18rc4aa1: 00_hpfs-oops-1 Only in 2.4.18rc4aa1: 30_get_request-starvation-1 Only in 2.4.18rc4aa1: 00_init-blk-freelist-1 Now in mainline. Only in 2.4.19pre1aa1: 00_lcall_trace-1 call gate entry point speciality. Only in 2.4.18rc4aa1: 00_prepare-write-fixes-1 Only in 2.4.19pre1aa1: 00_prepare-write-fixes-2 Avoid false positives (agreed Andrew?). Only in 2.4.18rc4aa1: 10_rawio-vary-io-2 Only in 2.4.19pre1aa1: 10_rawio-vary-io-3 Rediffed. Only in 2.4.18rc4aa1: 10_vm-27 Only in 2.4.19pre1aa1: 10_vm-28 Further updates. As soon as I get the confirm this goes well in all the benchmarks I think it should go into mainline. Only in 2.4.18rc4aa1: 70_xfs-1.gz Only in 2.4.19pre1aa1: 70_xfs-2.gz Drop PG_launder, never really existed in -aa, wait_IO does a better job (not only for dirty bh submitted by the vm) and wait_IO is just supported by xfs. Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-02-27 12:50 2.4.19pre1aa1 Andrea Arcangeli @ 2002-02-28 22:11 ` Bill Davidsen 2002-03-01 1:30 ` 2.4.19pre1aa1 Mike Fedyk 0 siblings, 1 reply; 77+ messages in thread From: Bill Davidsen @ 2002-02-28 22:11 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel On Wed, 27 Feb 2002, Andrea Arcangeli wrote: > I would like to have feedback about this VM update, if nobody can find > any serious issue I'd try to push vm-28 into mainline during 2.4.19pre. > Please test oom conditions as well. I have enjoyed using your -aa patches (and run child first) for some time, and Rik's rmap patches as well. However, I find that for some machines your stuff works clearly better, particularly larger memory machines, and for some rmap is clearly more responsive, particularly for small machines under heavy memory pressure. The point is that choice is good, and having two solutions two address various machines is a good thing, even if the convenience isn't all that great. That being said, I fear that if your solution gets pushed into mainline that it will preempt other solutions. And my testing tells me that there is no one solution here, even with all the tuning in your VM, using the hints you gave me. I would rather see both systems continue to be available, until there is a clear winner (ie. no common cases where one is clearly worse than the other), or until they somehow merge, or even become config options (I don't really favor that). I suggested that VM would be nice as a module, but it doesn't see possible. If others share the thought that it's too early for a preemptive choice please speak up. And if everyone feels that this is good I will not beat a dead horse on this one. I assume you meant "serious issues" with failures, rather than semi-political timing and choice issues. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-02-28 22:11 ` 2.4.19pre1aa1 Bill Davidsen @ 2002-03-01 1:30 ` Mike Fedyk 2002-03-01 3:26 ` 2.4.19pre1aa1 Bill Davidsen 0 siblings, 1 reply; 77+ messages in thread From: Mike Fedyk @ 2002-03-01 1:30 UTC (permalink / raw) To: Bill Davidsen; +Cc: Andrea Arcangeli, linux-kernel On Thu, Feb 28, 2002 at 05:11:25PM -0500, Bill Davidsen wrote: > On Wed, 27 Feb 2002, Andrea Arcangeli wrote: > > > I would like to have feedback about this VM update, if nobody can find > > any serious issue I'd try to push vm-28 into mainline during 2.4.19pre. > > Please test oom conditions as well. > > I have enjoyed using your -aa patches (and run child first) for some time, > and Rik's rmap patches as well. However, I find that for some machines > your stuff works clearly better, particularly larger memory machines, and > for some rmap is clearly more responsive, particularly for small machines > under heavy memory pressure. > > The point is that choice is good, and having two solutions two address > various machines is a good thing, even if the convenience isn't all that > great. That being said, I fear that if your solution gets pushed into > mainline that it will preempt other solutions. And my testing tells me > that there is no one solution here, even with all the tuning in your VM, > using the hints you gave me. > The problem here is that currently the mainline kernel makes some bad dicesions in the VM, and -aa is the solution in this case. When -aa is merged, you will still have both solutions; one in mainline, one as a patch (rmap). Linus has already changed the VM once in 2.4, and I don't really see another large VM change (rmap in 2.4) happening again. Rmap looks promising for a 2.5 merge after several issues are overcome (pte-highmem, etc). Mike ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-01 1:30 ` 2.4.19pre1aa1 Mike Fedyk @ 2002-03-01 3:26 ` Bill Davidsen 2002-03-01 3:46 ` 2.4.19pre1aa1 Mike Fedyk ` (3 more replies) 0 siblings, 4 replies; 77+ messages in thread From: Bill Davidsen @ 2002-03-01 3:26 UTC (permalink / raw) To: Mike Fedyk; +Cc: Andrea Arcangeli, linux-kernel On Thu, 28 Feb 2002, Mike Fedyk wrote: > The problem here is that currently the mainline kernel makes some bad > dicesions in the VM, and -aa is the solution in this case. When -aa is > merged, you will still have both solutions; one in mainline, one as a patch > (rmap). > > Linus has already changed the VM once in 2.4, and I don't really see another > large VM change (rmap in 2.4) happening again. > > Rmap looks promising for a 2.5 merge after several issues are overcome > (pte-highmem, etc). I do understand what happens in the VM currently... And as noted I run both -aa kernels and rmap on different machines. But -aa runs better on large machines and rmap better on small machines with memory pressure (my experience), so blessing one and making the other "only a patch" troubles me somewhat. I hate to say "compete" as VM solution, but they both solve the same problem with more success in one field or another. If either is adopted the pressure will be off to improve in the areas where one or the other is weak, Once the decision is made that won't happen, And if rmap is a large VM change, what then is Ardrea's code? Large isn't just the size of the patch, it is to some extent the size of the behaviour change. For me it makes little difference, I like to play with kernels, and I'm hoping for the source which needs only numbers in /proc/sys to tune, rather than patches. But there are a lot more small machines (which I feel are better served by rmap) than large. I would like to leave the jury out a little longer on this. I was looking for opinions, thak you for sharing yours.! -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-01 3:26 ` 2.4.19pre1aa1 Bill Davidsen @ 2002-03-01 3:46 ` Mike Fedyk 2002-03-01 12:51 ` 2.4.19pre1aa1 Rik van Riel 2002-03-01 10:17 ` 2.4.19pre1aa1 Marco Colombo ` (2 subsequent siblings) 3 siblings, 1 reply; 77+ messages in thread From: Mike Fedyk @ 2002-03-01 3:46 UTC (permalink / raw) To: Bill Davidsen; +Cc: linux-kernel On Thu, Feb 28, 2002 at 10:26:48PM -0500, Bill Davidsen wrote: > experience), so blessing one and making the other "only a patch" troubles > me somewhat. I hate to say "compete" as VM solution, but they both solve > the same problem with more success in one field or another. > > If either is adopted the pressure will be off to improve in the areas > where one or the other is weak, Once the decision is made that won't > happen, I sincerely doubt that Rik will slow down at all when parts of -aa are in the mainline kernel. There is 2.5 to work award, and 2.4 isn't a lost cause... Also, one has already been blessed, way back in 2.4.10-pre11 by Linus. I don't see any chance of rmap getting into 2.4 before 2.4.27+ Marcelo has said he wants to see rmap in production on in -ac for a while before he thinks about merging rmap, and that's good IMHO. >And if rmap is a large VM change, what then is Ardrea's code? > Large isn't just the size of the patch, it is to some extent the size of > the behavior change. > True, and by that token, rmap would be the larger change in behavior (not swapping on disk accesses, etc ;). > For me it makes little difference, I like to play with kernels, and I'm > hoping for the source which needs only numbers in /proc/sys to tune, > rather than patches. But there are a lot more small machines (which I feel > are better served by rmap) than large. I would like to leave the jury out > a little longer on this. > Look at it another way, by forcing Andrea to send it in as small chunks with descriptions, we may finally get a documented -aa VM. ;) So, lets watch and see that happen. I don't see anyone benefiting with *both* of the VM enhancements as external patches. > I was looking for opinions, thak you for sharing yours.! > You will certainly find that here. ;) ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-01 3:46 ` 2.4.19pre1aa1 Mike Fedyk @ 2002-03-01 12:51 ` Rik van Riel 2002-03-01 18:37 ` 2.4.19pre1aa1 Mike Fedyk 0 siblings, 1 reply; 77+ messages in thread From: Rik van Riel @ 2002-03-01 12:51 UTC (permalink / raw) To: Mike Fedyk; +Cc: Bill Davidsen, linux-kernel On Thu, 28 Feb 2002, Mike Fedyk wrote: > Look at it another way, by forcing Andrea to send it in as small > chunks with descriptions, we may finally get a documented -aa VM. ;) > So, lets watch and see that happen. That would be the preferred way. There must be some good stuff hidden in -aa, but it won't turn into maintainable code just by merging stuff into the kernel. It'll turn into maintainable code by having it merged in small, documented pieces. regards, Rik -- "Linux holds advantages over the single-vendor commercial OS" -- Microsoft's "Competing with Linux" document http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-01 12:51 ` 2.4.19pre1aa1 Rik van Riel @ 2002-03-01 18:37 ` Mike Fedyk 0 siblings, 0 replies; 77+ messages in thread From: Mike Fedyk @ 2002-03-01 18:37 UTC (permalink / raw) To: Rik van Riel; +Cc: Bill Davidsen, linux-kernel On Fri, Mar 01, 2002 at 09:51:54AM -0300, Rik van Riel wrote: > On Thu, 28 Feb 2002, Mike Fedyk wrote: > > > Look at it another way, by forcing Andrea to send it in as small > > chunks with descriptions, we may finally get a documented -aa VM. ;) > > So, lets watch and see that happen. > > That would be the preferred way. There must be some good stuff > hidden in -aa, but it won't turn into maintainable code just by > merging stuff into the kernel. > > It'll turn into maintainable code by having it merged in small, > documented pieces. > Let me see... "small chunks with descriptions". I think we're saying the same thing. What we need is to have those descriptions in the patches sent to Marcelo, that way the docs are in the code... Mike ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-01 3:26 ` 2.4.19pre1aa1 Bill Davidsen 2002-03-01 3:46 ` 2.4.19pre1aa1 Mike Fedyk @ 2002-03-01 10:17 ` Marco Colombo 2002-03-01 11:37 ` 2.4.19pre1aa1 Alan Cox 2002-03-02 2:06 ` 2.4.19pre1aa1 Andrea Arcangeli [not found] ` <200203021958.g22JwKq08818@Port.imtp.ilyichevsk.odessa.ua> 3 siblings, 1 reply; 77+ messages in thread From: Marco Colombo @ 2002-03-01 10:17 UTC (permalink / raw) To: Bill Davidsen; +Cc: linux-kernel On Thu, 28 Feb 2002, Bill Davidsen wrote: > On Thu, 28 Feb 2002, Mike Fedyk wrote: > > > The problem here is that currently the mainline kernel makes some bad > > dicesions in the VM, and -aa is the solution in this case. When -aa is > > merged, you will still have both solutions; one in mainline, one as a patch > > (rmap). > > > > Linus has already changed the VM once in 2.4, and I don't really see another > > large VM change (rmap in 2.4) happening again. > > > > Rmap looks promising for a 2.5 merge after several issues are overcome > > (pte-highmem, etc). > > I do understand what happens in the VM currently... And as noted I run > both -aa kernels and rmap on different machines. But -aa runs better on > large machines and rmap better on small machines with memory pressure (my > experience), so blessing one and making the other "only a patch" troubles > me somewhat. I hate to say "compete" as VM solution, but they both solve > the same problem with more success in one field or another. 2.4 VM is Andrea's. There's no competition. I see current -aa VM patches just as maintainance, which is performed outside the mainline for good reasons. As soon as Andrea is satisfied with testing, -aa will be integrated into Marcelo's 2.4. This is just part of VM (which admittedly was quite "young" when it was included) maintainance/evolution. OTOH, Red Hat 2.4 kernels are still based on Rik's, AFAIK. I bet they'll be running 2.4-rmap sooner or later. Red Hat has a long history of running kernels with non standard features (RAID 0.90 comes to mind). So maybe there *is* competition, but on the vendor side only. I do hope vanilla 2.4 VM will be -aa forever (but I'll be running RH provided kernels most of the times - I like them). .TM. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-01 10:17 ` 2.4.19pre1aa1 Marco Colombo @ 2002-03-01 11:37 ` Alan Cox 0 siblings, 0 replies; 77+ messages in thread From: Alan Cox @ 2002-03-01 11:37 UTC (permalink / raw) To: Marco Colombo; +Cc: Bill Davidsen, linux-kernel > OTOH, Red Hat 2.4 kernels are still based on Rik's, AFAIK. I bet they'll The RH 2.4.7-9 kernels are based on the stuff Rik wanted to try in 2.4 that Linus played with, mixed with used once and then ignored chunks of. Think of it as 2.4.Rik VM but not rmap. For the future we'll evaluate all sorts of options for our customers to see what is best to deliver - thats our job. Alan ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-01 3:26 ` 2.4.19pre1aa1 Bill Davidsen 2002-03-01 3:46 ` 2.4.19pre1aa1 Mike Fedyk 2002-03-01 10:17 ` 2.4.19pre1aa1 Marco Colombo @ 2002-03-02 2:06 ` Andrea Arcangeli 2002-03-02 2:28 ` 2.4.19pre1aa1 Alan Cox 2002-03-03 21:38 ` 2.4.19pre1aa1 Daniel Phillips [not found] ` <200203021958.g22JwKq08818@Port.imtp.ilyichevsk.odessa.ua> 3 siblings, 2 replies; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-02 2:06 UTC (permalink / raw) To: Bill Davidsen; +Cc: Mike Fedyk, linux-kernel On Thu, Feb 28, 2002 at 10:26:48PM -0500, Bill Davidsen wrote: > rather than patches. But there are a lot more small machines (which I feel > are better served by rmap) than large. I would like to leave the jury out I think there's quite some confusion going on from the rmap users, let's clarify the facts. The rmap design in the VM is all about decreasing the complexity of swap_out on the huge boxes (so it's all about saving CPU), by slowing down a big lots of fast common paths like page faults and by paying with some memory too. See the lmbench numbers posted by Randy after applying rmap to see what I mean. On a very lowmem machine the rmap design shouldn't really make a sensible difference, the smaller the amount of mapped VM, the less rmap can make differences, period. So I wouldn't really worry about the low mem machines. I guess what makes the difference for you (the responsiveness part) are things like read-latency2 included at least in some variant of the rmap patch, but they're completly orthogonal to the VM (they're included in the rmap patch just incidentally, the rmap patch isn't just about the rmap design, it's lots of other stuff too, please don't mistake this for a blame, I would prefer if it would be kept separated so people wouldn't be confused thinking rmap gives the responsiveness on the lowmem boxes, but I'm also not perfect sometime at maintaining patches, see vm-28, it does more than just one thing, even if they're at least all vm related things). Note that I'm listening to the rmap design too, and Rik's implementation should be better than the last one I seen last year from Dave, but I really am not going to slow down page faults and other paths just to save CPU during heavy swapout in 2.4, all my machines are mostly idle during heavy swapout/pageout anyways. For 2.5 it would be easy to integrate just the rmap design from Rik's patch on top of my vm-28, as far as the design is concerned that's orthogonal with all the other changes I'm doing, but the very visible lmbench slowdowns for lots of the important common paths didn't made it appealing to me yet (first somebody has to show me the total wastage of cpu during swapout with my current patch applied, I mean the last column on the right of vmstat). So in short you may want to try 2.4.19pre1 + vm-28 + read-latency2 (or even more simply 2.4.19pre1aa1 + read-latency2) and see if it makes the system as responsive as rmap for you on the lowmem boxes. let us know if it helps, thanks! IMHO vm-28 should be somehow included into mainline ASAP (before 2.4.19 is released), then again IMHO we can forget about the 2.4 VM and it will be definitely finished. Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-02 2:06 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-02 2:28 ` Alan Cox 2002-03-02 3:30 ` 2.4.19pre1aa1 Andrea Arcangeli 2002-03-03 21:38 ` 2.4.19pre1aa1 Daniel Phillips 1 sibling, 1 reply; 77+ messages in thread From: Alan Cox @ 2002-03-02 2:28 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Bill Davidsen, Mike Fedyk, linux-kernel > On a very lowmem machine the rmap design shouldn't really make a sensible > difference, the smaller the amount of mapped VM, the less rmap can make > differences, period. It makes a big big difference on a low memory box. Try running xfce on a 24Mb box with the base 2.4.18, 2.4.18 + rmap12f and 2.4.18+aa. Thats a case where aa definitely loses and without other I/O patches being applied. Its an X11 based workload with a -lot- of shared pages. Both rmap and aa materially outperform 2.4.18 base on this workload (and 2.4.17 blew up with out of memory errors) > IMHO vm-28 should be somehow included into mainline ASAP (before 2.4.19 > is released), then again IMHO we can forget about the 2.4 VM and it will > be definitely finished. With luck 8) VM is never finished 8( Alan ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-02 2:28 ` 2.4.19pre1aa1 Alan Cox @ 2002-03-02 3:30 ` Andrea Arcangeli 0 siblings, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-02 3:30 UTC (permalink / raw) To: Alan Cox; +Cc: Bill Davidsen, Mike Fedyk, linux-kernel On Sat, Mar 02, 2002 at 02:28:20AM +0000, Alan Cox wrote: > > On a very lowmem machine the rmap design shouldn't really make a sensible > > difference, the smaller the amount of mapped VM, the less rmap can make > > differences, period. > > It makes a big big difference on a low memory box. Try running xfce on > a 24Mb box with the base 2.4.18, 2.4.18 + rmap12f and 2.4.18+aa. Thats > a case where aa definitely loses and without other I/O patches being hmm to fully evaluate this I'd need to have access to the exact two kernel source tarballs that you compared (a diff against a known vanilla kernel tree would be fine) and to know the way you measured the difference of them while xfce was running (nominal performance/responsiveness/whatever?). Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-02 2:06 ` 2.4.19pre1aa1 Andrea Arcangeli 2002-03-02 2:28 ` 2.4.19pre1aa1 Alan Cox @ 2002-03-03 21:38 ` Daniel Phillips 2002-03-04 0:49 ` 2.4.19pre1aa1 Andrea Arcangeli 1 sibling, 1 reply; 77+ messages in thread From: Daniel Phillips @ 2002-03-03 21:38 UTC (permalink / raw) To: Andrea Arcangeli, Bill Davidsen; +Cc: Mike Fedyk, linux-kernel On March 2, 2002 03:06 am, Andrea Arcangeli wrote: > On Thu, Feb 28, 2002 at 10:26:48PM -0500, Bill Davidsen wrote: > > rather than patches. But there are a lot more small machines (which I feel > > are better served by rmap) than large. I would like to leave the jury out > > I think there's quite some confusion going on from the rmap users, let's > clarify the facts. > > The rmap design in the VM is all about decreasing the complexity of > swap_out on the huge boxes (so it's all about saving CPU), by slowing > down a big lots of fast common paths like page faults and by paying with > some memory too. See the lmbench numbers posted by Randy after applying > rmap to see what I mean. Do you know any reason why rmap must slow down the page fault fast, or are you just thinking about Rik's current implementation? Yes, rmap has to add a pte_chain entry there, but it can be a direct pointer in the unshared case and the spinlock looks like it can be avoided in the common case as well. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-03 21:38 ` 2.4.19pre1aa1 Daniel Phillips @ 2002-03-04 0:49 ` Andrea Arcangeli 2002-03-04 1:46 ` 2.4.19pre1aa1 Daniel Phillips 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-04 0:49 UTC (permalink / raw) To: Daniel Phillips; +Cc: Bill Davidsen, Mike Fedyk, linux-kernel On Sun, Mar 03, 2002 at 10:38:34PM +0100, Daniel Phillips wrote: > On March 2, 2002 03:06 am, Andrea Arcangeli wrote: > > On Thu, Feb 28, 2002 at 10:26:48PM -0500, Bill Davidsen wrote: > > > rather than patches. But there are a lot more small machines (which I feel > > > are better served by rmap) than large. I would like to leave the jury out > > > > I think there's quite some confusion going on from the rmap users, let's > > clarify the facts. > > > > The rmap design in the VM is all about decreasing the complexity of > > swap_out on the huge boxes (so it's all about saving CPU), by slowing > > down a big lots of fast common paths like page faults and by paying with > > some memory too. See the lmbench numbers posted by Randy after applying > > rmap to see what I mean. > > Do you know any reason why rmap must slow down the page fault fast, or are > you just thinking about Rik's current implementation? Yes, rmap has to add > a pte_chain entry there, but it can be a direct pointer in the unshared case > and the spinlock looks like it can be avoided in the common case as well. unshared isn't the very common case (shm, and file mappings like executables are all going to be shared, not unshared). So unless you first share all the pagetables as well (like Ben once said years ago), it's not going to be a direct pointer in the very common case. And there's no guarantee you can share the pagetable (even assuming the kernels supports that at the maximum possible degree across execve and at random mmaps too) if you map those pages at different virtual addresses. Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 0:49 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-04 1:46 ` Daniel Phillips 2002-03-04 2:25 ` 2.4.19pre1aa1 Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Daniel Phillips @ 2002-03-04 1:46 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Bill Davidsen, Mike Fedyk, linux-kernel On March 4, 2002 01:49 am, Andrea Arcangeli wrote: > On Sun, Mar 03, 2002 at 10:38:34PM +0100, Daniel Phillips wrote: > > On March 2, 2002 03:06 am, Andrea Arcangeli wrote: > > > On Thu, Feb 28, 2002 at 10:26:48PM -0500, Bill Davidsen wrote: > > > > rather than patches. But there are a lot more small machines (which I feel > > > > are better served by rmap) than large. I would like to leave the jury out > > > > > > I think there's quite some confusion going on from the rmap users, let's > > > clarify the facts. > > > > > > The rmap design in the VM is all about decreasing the complexity of > > > swap_out on the huge boxes (so it's all about saving CPU), by slowing > > > down a big lots of fast common paths like page faults and by paying with > > > some memory too. See the lmbench numbers posted by Randy after applying > > > rmap to see what I mean. > > > > Do you know any reason why rmap must slow down the page fault fast, or are > > you just thinking about Rik's current implementation? Yes, rmap has to add > > a pte_chain entry there, but it can be a direct pointer in the unshared case > > and the spinlock looks like it can be avoided in the common case as well. > > unshared isn't the very common case (shm, and file mappings like > executables are all going to be shared, not unshared). As soon as you have shared pages you start to benefit from rmap's ability to unmap in one step, so the cost of creating the link is recovered by not having to scan two page tables to unmap it. In theory. Do you see a hole in that? > So unless you first share all the pagetables as well (like Ben once said > years ago), it's not going to be a direct pointer in the very common > case. And there's no guarantee you can share the pagetable (even > assuming the kernels supports that at the maximum possible degree across > execve and at random mmaps too) if you map those pages at different > virtual addresses. The virtual alignment just needs to be the same modulo 4 MB. There are other requirements as well, but being able to share seems to be the common case. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 1:46 ` 2.4.19pre1aa1 Daniel Phillips @ 2002-03-04 2:25 ` Andrea Arcangeli 2002-03-04 3:22 ` 2.4.19pre1aa1 Daniel Phillips 2002-03-04 12:41 ` 2.4.19pre1aa1 Rik van Riel 0 siblings, 2 replies; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-04 2:25 UTC (permalink / raw) To: Daniel Phillips; +Cc: Bill Davidsen, Mike Fedyk, linux-kernel On Mon, Mar 04, 2002 at 02:46:22AM +0100, Daniel Phillips wrote: > On March 4, 2002 01:49 am, Andrea Arcangeli wrote: > > On Sun, Mar 03, 2002 at 10:38:34PM +0100, Daniel Phillips wrote: > > > On March 2, 2002 03:06 am, Andrea Arcangeli wrote: > > > > On Thu, Feb 28, 2002 at 10:26:48PM -0500, Bill Davidsen wrote: > > > > > rather than patches. But there are a lot more small machines (which I feel > > > > > are better served by rmap) than large. I would like to leave the jury out > > > > > > > > I think there's quite some confusion going on from the rmap users, let's > > > > clarify the facts. > > > > > > > > The rmap design in the VM is all about decreasing the complexity of > > > > swap_out on the huge boxes (so it's all about saving CPU), by slowing > > > > down a big lots of fast common paths like page faults and by paying with > > > > some memory too. See the lmbench numbers posted by Randy after applying > > > > rmap to see what I mean. > > > > > > Do you know any reason why rmap must slow down the page fault fast, or are > > > you just thinking about Rik's current implementation? Yes, rmap has to add > > > a pte_chain entry there, but it can be a direct pointer in the unshared case > > > and the spinlock looks like it can be avoided in the common case as well. > > > > unshared isn't the very common case (shm, and file mappings like > > executables are all going to be shared, not unshared). > > As soon as you have shared pages you start to benefit from rmap's ability > to unmap in one step, so the cost of creating the link is recovered by not we'd benefit also with unshared pages. BTW, for the map shared mappings we just collect the rmap information, we need it for vmtruncate, but it's not layed out for efficient browsing, it's only meant to make vmtruncate work. > having to scan two page tables to unmap it. In theory. Do you see a hole > in that? Just the fact you never need the reverse lookup during lots of important production usages (first that cames to mind is when you have enough ram to do your job, all number crunching/fileserving, and most servers are setup that way). This is the whole point. Note that this has nothing to do with the "cache" part, this is only about the pageout/swapout stage, only a few servers really needs heavy swapout. The background swapout to avoid unused services to stay in ram forever, doesn't matter with rmap or w/o rmap design. And on the other case (heavy swapout/pageouts like in some hard DBMS usage, simualtions and laptops or legacy desktops) we would mostly save CPU and reduce complexity, but I really don't see system load during heavy pageouts/swapouts yet, so I don't see an obvious need of save cpu there either. Probably the main difference visible in numbers would be infact to follow a perfect lru, but really giving mapped pages an higher chance is beneficial. Another bit in the current design of round robin cycling over the whole VM clearing the accessed bitflag and activating physical pages if needed, can also be see also as a feature in some ways. It is much better at providing a kind of "clock based" aging to the accessed bit information, while the lru pass rmap aware, wouldn't really be fair with all the virtual pages the same way as we do now. > > So unless you first share all the pagetables as well (like Ben once said > > years ago), it's not going to be a direct pointer in the very common > > case. And there's no guarantee you can share the pagetable (even > > assuming the kernels supports that at the maximum possible degree across > > execve and at random mmaps too) if you map those pages at different > > virtual addresses. > > The virtual alignment just needs to be the same modulo 4 MB. There are > other requirements as well, but being able to share seems to be the common > case. Yep on x86 w/o PAE. With PAE enabled (or x86-64 kernel) it needs to be the same layout of phys pages on a naturally aligned 2M chunk. I trust that will match often in theory, but still tracking it down over execve and on random mmaps looks not that easy, I think for tracking that down we'd really need the rmap information for everything (not just map shared like right now). And also doing all the checks and walking the reverse maps won't be zero cost, but I can see the benefit of the full pte sharing (starting from cpu cache utilization across tlb flushes). Infact it maybe rmap will be more useful for things like enabling the full pagetable sharing you're suggesting above, rather than for replacing the swap_out round robing cycle over the VM. so it might be used only for MM internals rather than for VM internals. Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 2:25 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-04 3:22 ` Daniel Phillips 2002-03-04 12:41 ` 2.4.19pre1aa1 Rik van Riel 1 sibling, 0 replies; 77+ messages in thread From: Daniel Phillips @ 2002-03-04 3:22 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Bill Davidsen, Mike Fedyk, linux-kernel On March 4, 2002 03:25 am, Andrea Arcangeli wrote: > On Mon, Mar 04, 2002 at 02:46:22AM +0100, Daniel Phillips wrote: > > As soon as you have shared pages you start to benefit from rmap's ability > > to unmap in one step, so the cost of creating the link is recovered by not > > we'd benefit also with unshared pages. > > BTW, for the map shared mappings we just collect the rmap information, > we need it for vmtruncate, but it's not layed out for efficient > browsing, it's only meant to make vmtruncate work. Sorry, transmission error, what did you mean? > > having to scan two page tables to unmap it. In theory. Do you see a hole > > in that? > > Just the fact you never need the reverse lookup during lots of > important production usages (first that cames to mind is when you have > enough ram to do your job, all number crunching/fileserving, and most > servers are setup that way). This is the whole point. Note that this > has nothing to do with the "cache" part, this is only about the > pageout/swapout stage, only a few servers really needs heavy swapout. You always have to unmap the page at some point, so you win back the cost of creating the pte_chain there, hopefully. You could argue that paying the cost up front makes latency a little worse. You might have trouble measuring that though. > ...Another bit in the current design of round robin cycling > over the whole VM clearing the accessed bitflag and activating physical > pages if needed, can also be see also as a feature in some ways. It is > much better at providing a kind of "clock based" aging to the accessed > bit information, while the lru pass rmap aware, wouldn't really be fair > with all the virtual pages the same way as we do now. You get a perfectly good clock by scanning the lru list. It's not totally fair because a page newly promoted from the cold end to the hot end of the list will get scanned again after a much shorter delta-t, but it's hard to see why that's bad. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 2:25 ` 2.4.19pre1aa1 Andrea Arcangeli 2002-03-04 3:22 ` 2.4.19pre1aa1 Daniel Phillips @ 2002-03-04 12:41 ` Rik van Riel 2002-03-04 14:05 ` 2.4.19pre1aa1 Andrea Arcangeli 1 sibling, 1 reply; 77+ messages in thread From: Rik van Riel @ 2002-03-04 12:41 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Mon, 4 Mar 2002, Andrea Arcangeli wrote: > > having to scan two page tables to unmap it. In theory. Do you see a hole > > in that? > > Just the fact you never need the reverse lookup during lots of > important production usages (first that cames to mind is when you have > enough ram to do your job, all number crunching/fileserving, and most > servers are setup that way). This is the whole point. Note that this > has nothing to do with the "cache" part, this is only about the > pageout/swapout stage, only a few servers really needs heavy swapout. Ahhh, but it's not necessarily about making this common case better. It's about making sure Linux doesn't die horribly in some worst cases. The case of "system has more than enough memory" won't suffer with -rmap anyway since the amount of activity in the VM part of the system will be relatively low. > And on the other case (heavy swapout/pageouts like in some hard DBMS > usage, simualtions and laptops or legacy desktops) we would mostly save > CPU and reduce complexity, but I really don't see system load during > heavy pageouts/swapouts yet, so I don't see an obvious need of save cpu > there either. The thing here is that -rmap is able to easily balance the reclaiming of cache with the swapout of anonymous pages. Even though you tried to get rid of the magic numbers in the old VM when you introduced your changes, you're already back up to 4 magic numbers for the cache/swapout balancing. This is not your fault, being difficult to balance is just a fundamental property of the partially physical, partially virtual scanning. > Infact it maybe rmap will be more useful for things like enabling the full > pagetable sharing you're suggesting above, rather than for replacing the > swap_out round robing cycle over the VM. so it might be used only for MM > internals rather than for VM internals. Sharing is quite a can of worms, it might be easier to just use 4MB (or 2MB) pages for database shared memory segments and VMAs where programs want large pages. That will get rid of both the page tables (and associated locking) and the alignment constraints. regards, Rik -- "Linux holds advantages over the single-vendor commercial OS" -- Microsoft's "Competing with Linux" document http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 12:41 ` 2.4.19pre1aa1 Rik van Riel @ 2002-03-04 14:05 ` Andrea Arcangeli 2002-03-04 14:23 ` 2.4.19pre1aa1 Rik van Riel 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-04 14:05 UTC (permalink / raw) To: Rik van Riel; +Cc: Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Mon, Mar 04, 2002 at 09:41:40AM -0300, Rik van Riel wrote: > On Mon, 4 Mar 2002, Andrea Arcangeli wrote: > > > > having to scan two page tables to unmap it. In theory. Do you see a hole > > > in that? > > > > Just the fact you never need the reverse lookup during lots of > > important production usages (first that cames to mind is when you have > > enough ram to do your job, all number crunching/fileserving, and most > > servers are setup that way). This is the whole point. Note that this > > has nothing to do with the "cache" part, this is only about the > > pageout/swapout stage, only a few servers really needs heavy swapout. > > Ahhh, but it's not necessarily about making this common case > better. It's about making sure Linux doesn't die horribly in > some worst cases. rmap is only about making pagout/swapout activities more efficient, there's no stability issue to solve as far I can tell. > The case of "system has more than enough memory" won't suffer > with -rmap anyway since the amount of activity in the VM part > of the system will be relatively low. I don't see anything significant to save in that area. During heavy paging the system load is something like 1/2% of the cpu. > > And on the other case (heavy swapout/pageouts like in some hard DBMS > > usage, simualtions and laptops or legacy desktops) we would mostly save > > CPU and reduce complexity, but I really don't see system load during > > heavy pageouts/swapouts yet, so I don't see an obvious need of save cpu > > there either. > > The thing here is that -rmap is able to easily balance the > reclaiming of cache with the swapout of anonymous pages. > > Even though you tried to get rid of the magic numbers in > the old VM when you introduced your changes, you're already > back up to 4 magic numbers for the cache/swapout balancing. > > This is not your fault, being difficult to balance is just > a fundamental property of the partially physical, partially > virtual scanning. Those numbers also control how aggressive is the swap_out pass. That is partly a feature I think. Do you plan to unmap and put anonymous pages into the swapcache when you reach them in the inactive lru, despite you may have 99% of ram into freeable cache? I think you'll still need some number/heuristic to know when the lru pass should start to be aggressive unmapping and pagingout stuff. So I believe this issue about the "number thing" is quite unrelated to the complexity reduction of the paging algorithm with the removal of the swap_out pass. Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 14:05 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-04 14:23 ` Rik van Riel 2002-03-04 16:10 ` 2.4.19pre1aa1 Andrea Arcangeli 2002-03-04 16:59 ` 2.4.19pre1aa1 Martin J. Bligh 0 siblings, 2 replies; 77+ messages in thread From: Rik van Riel @ 2002-03-04 14:23 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Mon, 4 Mar 2002, Andrea Arcangeli wrote: > On Mon, Mar 04, 2002 at 09:41:40AM -0300, Rik van Riel wrote: > > On Mon, 4 Mar 2002, Andrea Arcangeli wrote: > > > > > has nothing to do with the "cache" part, this is only about the > > > pageout/swapout stage, only a few servers really needs heavy swapout. > > > > Ahhh, but it's not necessarily about making this common case > > better. It's about making sure Linux doesn't die horribly in > > some worst cases. > > rmap is only about making pagout/swapout activities more efficient, > there's no stability issue to solve as far I can tell. Not stability per se, but you have to admit the VM tends to behave badly when there's a shortage in just one memory zone. I believe NUMA will only make this situation worse. It helps a lot when the VM can just free pages from those zones where it has a memory shortage and skip scanning the others. > > The case of "system has more than enough memory" won't suffer > > with -rmap anyway since the amount of activity in the VM part > > of the system will be relatively low. > > I don't see anything significant to save in that area. During heavy > paging the system load is something like 1/2% of the cpu. During heavy paging you don't really care about how much system time the VM takes (within reasonable limits, of course), instead you care about how well the VM chooses which pages to swap out and which pages to keep in RAM. > > > And on the other case (heavy swapout/pageouts like in some hard DBMS > > > usage, simualtions and laptops or legacy desktops) we would mostly save > > > CPU and reduce complexity, but I really don't see system load during > > > heavy pageouts/swapouts yet, so I don't see an obvious need of save cpu > > > there either. > > > > The thing here is that -rmap is able to easily balance the > > reclaiming of cache with the swapout of anonymous pages. > > > > Even though you tried to get rid of the magic numbers in > > the old VM when you introduced your changes, you're already > > back up to 4 magic numbers for the cache/swapout balancing. > > > > This is not your fault, being difficult to balance is just > > a fundamental property of the partially physical, partially > > virtual scanning. > > Those numbers also control how aggressive is the swap_out pass. That is > partly a feature I think. Do you plan to unmap and put anonymous pages > into the swapcache when you reach them in the inactive lru, despite you > may have 99% of ram into freeable cache? I think you'll still need some > number/heuristic to know when the lru pass should start to be aggressive > unmapping and pagingout stuff. So I believe this issue about the "number > thing" is quite unrelated to the complexity reduction of the paging > algorithm with the removal of the swap_out pass. It's harder to balance a combined virtual/physical scanning VM than it is to balance a pure physical scanning VM. I do have some tunables planned for -rmap, but those will be more along the lines of a switch called "defer_swapout" which the user can switch on or off. No need for the user to know how the VM works internally, the VM has enough info to work out the details by itself. regards, Rik -- "Linux holds advantages over the single-vendor commercial OS" -- Microsoft's "Competing with Linux" document http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 14:23 ` 2.4.19pre1aa1 Rik van Riel @ 2002-03-04 16:10 ` Andrea Arcangeli 2002-03-04 16:28 ` 2.4.19pre1aa1 Rik van Riel 2002-03-04 16:59 ` 2.4.19pre1aa1 Martin J. Bligh 1 sibling, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-04 16:10 UTC (permalink / raw) To: Rik van Riel; +Cc: Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Mon, Mar 04, 2002 at 11:23:57AM -0300, Rik van Riel wrote: > On Mon, 4 Mar 2002, Andrea Arcangeli wrote: > > On Mon, Mar 04, 2002 at 09:41:40AM -0300, Rik van Riel wrote: > > > On Mon, 4 Mar 2002, Andrea Arcangeli wrote: > > > > > > > has nothing to do with the "cache" part, this is only about the > > > > pageout/swapout stage, only a few servers really needs heavy swapout. > > > > > > Ahhh, but it's not necessarily about making this common case > > > better. It's about making sure Linux doesn't die horribly in > > > some worst cases. > > > > rmap is only about making pagout/swapout activities more efficient, > > there's no stability issue to solve as far I can tell. > > Not stability per se, but you have to admit the VM tends to > behave badly when there's a shortage in just one memory zone. I don't think it behaves badly, and I don't see how rmap can help there except from saving some cpu. The major O(N) complexity when working on the lwoer zones is in passing over/ignoring the pages in the higher zones and that's at the page layer so rmap will make no differences there and the complexity will remain O(N) (where N is the number of higher-pages). > It helps a lot when the VM can just free pages from those > zones where it has a memory shortage and skip scanning the > others. I'm not scanning the other/unrelated pagetables just now. > > > The case of "system has more than enough memory" won't suffer > > > with -rmap anyway since the amount of activity in the VM part > > > of the system will be relatively low. > > > > I don't see anything significant to save in that area. During heavy > > paging the system load is something like 1/2% of the cpu. > > During heavy paging you don't really care about how much system > time the VM takes (within reasonable limits, of course), instead yes, this is why rmap isn't making a sensible difference in the heavy swap case either. > you care about how well the VM chooses which pages to swap out > and which pages to keep in RAM. and for that the aging fair scan for the acessed bitflag has a chance to be better than the unfair accessed bit handling in rmap that can lead to not evaluating correctly the accessed-virtual-age of the pages. Also threating mapped pages in a special manner is beneficial. > > > > And on the other case (heavy swapout/pageouts like in some hard DBMS > > > > usage, simualtions and laptops or legacy desktops) we would mostly save > > > > CPU and reduce complexity, but I really don't see system load during > > > > heavy pageouts/swapouts yet, so I don't see an obvious need of save cpu > > > > there either. > > > > > > The thing here is that -rmap is able to easily balance the > > > reclaiming of cache with the swapout of anonymous pages. > > > > > > Even though you tried to get rid of the magic numbers in > > > the old VM when you introduced your changes, you're already > > > back up to 4 magic numbers for the cache/swapout balancing. > > > > > > This is not your fault, being difficult to balance is just > > > a fundamental property of the partially physical, partially > > > virtual scanning. > > > > Those numbers also control how aggressive is the swap_out pass. That is > > partly a feature I think. Do you plan to unmap and put anonymous pages > > into the swapcache when you reach them in the inactive lru, despite you > > may have 99% of ram into freeable cache? I think you'll still need some > > number/heuristic to know when the lru pass should start to be aggressive > > unmapping and pagingout stuff. So I believe this issue about the "number > > thing" is quite unrelated to the complexity reduction of the paging > > algorithm with the removal of the swap_out pass. > > It's harder to balance a combined virtual/physical scanning VM > than it is to balance a pure physical scanning VM. Depends, you may have to do similar things to balance rmap in a similar manner. The point again is "when to start unmapping stuff". Once you tune right and choose "ok, go ahead and unmap" at the right time, it basically doesn't matter if you do that by calling swap_out or if you try to unmap the current page in the lru. Plus swap_out will be fair and mapped pages automatically will get a longer lifetime than unmapped pages like plain fs cache, both things sounds like positive. Plus rmap hurts the common fast paths, i.e. when no heavy swapout is needed like in most servers out there. Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 16:10 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-04 16:28 ` Rik van Riel 0 siblings, 0 replies; 77+ messages in thread From: Rik van Riel @ 2002-03-04 16:28 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Mon, 4 Mar 2002, Andrea Arcangeli wrote: > > you care about how well the VM chooses which pages to swap out > > and which pages to keep in RAM. > > and for that the aging fair scan for the acessed bitflag has a chance to > be better than the unfair accessed bit handling in rmap that can lead to > not evaluating correctly the accessed-virtual-age of the pages. Ummm, what do you mean by this ? > Also threating mapped pages in a special manner is beneficial. Note that -rmap already does this. regards, Rik -- "Linux holds advantages over the single-vendor commercial OS" -- Microsoft's "Competing with Linux" document http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 14:23 ` 2.4.19pre1aa1 Rik van Riel 2002-03-04 16:10 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-04 16:59 ` Martin J. Bligh 2002-03-04 18:18 ` 2.4.19pre1aa1 Stephan von Krawczynski 2002-03-04 18:19 ` 2.4.19pre1aa1 Andrea Arcangeli 1 sibling, 2 replies; 77+ messages in thread From: Martin J. Bligh @ 2002-03-04 16:59 UTC (permalink / raw) To: Rik van Riel, Andrea Arcangeli Cc: Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel > Not stability per se, but you have to admit the VM tends to > behave badly when there's a shortage in just one memory zone. > I believe NUMA will only make this situation worse. rmap would seem to buy us (at least) two major things for NUMA: 1) We can balance between zones easier by "swapping out" pages to another zone. 2) We can do local per-node scanning - no need to bounce information to and fro across the interconnect just to see what's worth swapping out. I suspect that the performance of NUMA under memory pressure without the rmap stuff will be truly horrific, as we decend into a cache-trashing page transfer war. I can't see any way to fix this without some sort of rmap - any other suggestions as to how this might be done? Thanks, Martin. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 16:59 ` 2.4.19pre1aa1 Martin J. Bligh @ 2002-03-04 18:18 ` Stephan von Krawczynski 2002-03-04 18:41 ` 2.4.19pre1aa1 Stephan von Krawczynski ` (2 more replies) 2002-03-04 18:19 ` 2.4.19pre1aa1 Andrea Arcangeli 1 sibling, 3 replies; 77+ messages in thread From: Stephan von Krawczynski @ 2002-03-04 18:18 UTC (permalink / raw) To: Martin J. Bligh; +Cc: riel, andrea, phillips, davidsen, mfedyk, linux-kernel On Mon, 04 Mar 2002 08:59:10 -0800 "Martin J. Bligh" <Martin.Bligh@us.ibm.com> wrote: > 2) We can do local per-node scanning - no need to bounce > information to and fro across the interconnect just to see what's > worth swapping out. Well, you can achieve this by "attaching" the nodes' local memory (zone) to its cpu and let the vm work preferably only on these attached zones (regarding the list scanning and the like). This way you have no interconnect traffic generated. But this is in no way related to rmap. > I suspect that the performance of NUMA under memory pressure > without the rmap stuff will be truly horrific, as we decend into > a cache-trashing page transfer war. I guess you are right for the current implementation, but I doubt rmap will be a _real_ solution to your problem. > I can't see any way to fix this without some sort of rmap - any > other suggestions as to how this might be done? As stated above: try to bring in per-node zones that are preferred by their cpu. This can work equally well for UP,SMP and NUMA (maybe even for cluster). UP=every zone is one or more preferred zone(s) SMP=every zone is one or more preferred zone(s) NUMA=every cpu has one or more preferred zone(s), but can walk the whole zone-list if necessary. cluster=every cpu has one or more preferred zone(s), but cannot walk the whole zone-list. Preference is implemented as simple list of cpu-ids attached to every memory zone. This is for being able to see the whole picture. Every cpu has a private list of (preferred) zones which is used by vm for the scanning jobs (swap et al). This way there is no need to touch interconnection. If you are really in a bad situation you can alway go back to the global list and do whatever is needed. This sounds pretty scalable and runtime-configurable. And not related to rmap... Beat me, Stephan PS: Drop clusters from the discussion, I know this would become weird. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 18:18 ` 2.4.19pre1aa1 Stephan von Krawczynski @ 2002-03-04 18:41 ` Stephan von Krawczynski 2002-03-04 18:46 ` 2.4.19pre1aa1 Martin J. Bligh 2002-03-04 21:37 ` 2.4.19pre1aa1 Rik van Riel 2 siblings, 0 replies; 77+ messages in thread From: Stephan von Krawczynski @ 2002-03-04 18:41 UTC (permalink / raw) To: Stephan von Krawczynski Cc: Martin.Bligh, riel, andrea, phillips, davidsen, mfedyk, linux-kernel On Mon, 4 Mar 2002 19:18:04 +0100 Stephan von Krawczynski <skraw@ithnet.com> wrote: > As stated above: try to bring in per-node zones that are preferred by their cpu. This can work equally well for UP,SMP and NUMA (maybe even for cluster). > UP=every zone is one or more preferred zone(s) correct: UP=all zones are preferred zones for the single CPU > SMP=every zone is one or more preferred zone(s) correct: SMP=all zones are preferred zones for all CPUs Regards, Stephan ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 18:18 ` 2.4.19pre1aa1 Stephan von Krawczynski 2002-03-04 18:41 ` 2.4.19pre1aa1 Stephan von Krawczynski @ 2002-03-04 18:46 ` Martin J. Bligh 2002-03-04 22:06 ` 2.4.19pre1aa1 Andrea Arcangeli 2002-03-04 21:37 ` 2.4.19pre1aa1 Rik van Riel 2 siblings, 1 reply; 77+ messages in thread From: Martin J. Bligh @ 2002-03-04 18:46 UTC (permalink / raw) To: Stephan von Krawczynski Cc: riel, andrea, phillips, davidsen, mfedyk, linux-kernel >> 2) We can do local per-node scanning - no need to bounce >> information to and fro across the interconnect just to see what's >> worth swapping out. > > Well, you can achieve this by "attaching" the nodes' local memory > (zone) to its cpu and let the vm work preferably only on these attached > zones (regarding the list scanning and the like). This way you have no > interconnect traffic generated. But this is in no way related to rmap. > >> I can't see any way to fix this without some sort of rmap - any >> other suggestions as to how this might be done? > > As stated above: try to bring in per-node zones that are preferred by their cpu. This can work equally well for UP,SMP and NUMA (maybe even for cluster). > UP=every zone is one or more preferred zone(s) > SMP=every zone is one or more preferred zone(s) > NUMA=every cpu has one or more preferred zone(s), but can walk the whole zone-list if necessary. > > Preference is implemented as simple list of cpu-ids attached to every > memory zone. This is for being able to see the whole picture. Every > cpu has a private list of (preferred) zones which is used by vm for the > scanning jobs (swap et al). This way there is no need to touch interconnection. > If you are really in a bad situation you can alway go back to the global > list and do whatever is needed. As I understand the current code (ie this may be totally wrong ;-) ) I think we already pretty much have what you're suggesting. There's one (or more) zone per node chained off the pgdata_t, and during memory allocation we try to scan through the zones attatched to the local node first. The problem seems to me to be that the way we do current swap-out scanning is virtual, not physical, and thus cannot be per zone => per node. Am I totally missing your point here? Thanks, Martin. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 18:46 ` 2.4.19pre1aa1 Martin J. Bligh @ 2002-03-04 22:06 ` Andrea Arcangeli 2002-03-04 23:03 ` 2.4.19pre1aa1 Samuel Ortiz ` (2 more replies) 0 siblings, 3 replies; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-04 22:06 UTC (permalink / raw) To: Martin J. Bligh Cc: Stephan von Krawczynski, riel, phillips, davidsen, mfedyk, linux-kernel On Mon, Mar 04, 2002 at 10:46:54AM -0800, Martin J. Bligh wrote: > >> 2) We can do local per-node scanning - no need to bounce > >> information to and fro across the interconnect just to see what's > >> worth swapping out. > > > > Well, you can achieve this by "attaching" the nodes' local memory > > (zone) to its cpu and let the vm work preferably only on these attached > > zones (regarding the list scanning and the like). This way you have no > > interconnect traffic generated. But this is in no way related to rmap. > > > >> I can't see any way to fix this without some sort of rmap - any > >> other suggestions as to how this might be done? > > > > As stated above: try to bring in per-node zones that are preferred by their cpu. This can work equally well for UP,SMP and NUMA (maybe even for cluster). > > UP=every zone is one or more preferred zone(s) > > SMP=every zone is one or more preferred zone(s) > > NUMA=every cpu has one or more preferred zone(s), but can walk the whole zone-list if necessary. > > > > Preference is implemented as simple list of cpu-ids attached to every > > memory zone. This is for being able to see the whole picture. Every > > cpu has a private list of (preferred) zones which is used by vm for the > > scanning jobs (swap et al). This way there is no need to touch interconnection. > > If you are really in a bad situation you can alway go back to the global > > list and do whatever is needed. > > As I understand the current code (ie this may be totally wrong ;-) ) I think > we already pretty much have what you're suggesting. There's one (or more) > zone per node chained off the pgdata_t, and during memory allocation we > try to scan through the zones attatched to the local node first. The problem yes, also make sure to keep this patch from SGI applied, it's very important to avoid memory balancing if there's still free memory in the other zones: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre1aa1/20_numa-mm-1 It should apply cleanly on top of my vm-28. > seems to me to be that the way we do current swap-out scanning is virtual, > not physical, and thus cannot be per zone => per node. actually if you do process bindings the pte should be all allocated local to the node if numa is enabled, and if there's no binding, no matter if you have rmap or not, the ptes can be spread across the whole system (just like the physical pages in the inactive/active lrus, because they're not per-node). Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 22:06 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-04 23:03 ` Samuel Ortiz 2002-03-05 11:23 ` 2.4.19pre1aa1 Stephan von Krawczynski 2002-03-05 0:12 ` 2.4.19pre1aa1 Rik van Riel 2002-03-05 6:21 ` 2.4.19pre1aa1 Martin J. Bligh 2 siblings, 1 reply; 77+ messages in thread From: Samuel Ortiz @ 2002-03-04 23:03 UTC (permalink / raw) To: Andrea Arcangeli Cc: Martin J. Bligh, Stephan von Krawczynski, riel, phillips, davidsen, mfedyk, linux-kernel On Mon, 4 Mar 2002, Andrea Arcangeli wrote: > On Mon, Mar 04, 2002 at 10:46:54AM -0800, Martin J. Bligh wrote: > > >> 2) We can do local per-node scanning - no need to bounce > > >> information to and fro across the interconnect just to see what's > > >> worth swapping out. > > > > > > Well, you can achieve this by "attaching" the nodes' local memory > > > (zone) to its cpu and let the vm work preferably only on these attached > > > zones (regarding the list scanning and the like). This way you have no > > > interconnect traffic generated. But this is in no way related to rmap. > > > > > >> I can't see any way to fix this without some sort of rmap - any > > >> other suggestions as to how this might be done? > > > > > > As stated above: try to bring in per-node zones that are preferred by their cpu. This can work equally well for UP,SMP and NUMA (maybe even for cluster). > > > UP=every zone is one or more preferred zone(s) > > > SMP=every zone is one or more preferred zone(s) > > > NUMA=every cpu has one or more preferred zone(s), but can walk the whole zone-list if necessary. > > > > > > Preference is implemented as simple list of cpu-ids attached to every > > > memory zone. This is for being able to see the whole picture. Every > > > cpu has a private list of (preferred) zones which is used by vm for the > > > scanning jobs (swap et al). This way there is no need to touch interconnection. > > > If you are really in a bad situation you can alway go back to the global > > > list and do whatever is needed. > > > > As I understand the current code (ie this may be totally wrong ;-) ) I think > > we already pretty much have what you're suggesting. There's one (or more) > > zone per node chained off the pgdata_t, and during memory allocation we > > try to scan through the zones attatched to the local node first. The problem > > yes, also make sure to keep this patch from SGI applied, it's very > important to avoid memory balancing if there's still free memory in the > other zones: > > ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre1aa1/20_numa-mm-1 This patch is included (in a slightly different form) in the 2.4.17 discontig patch (http://sourceforge.net/projects/discontig). But martin may need another patch to apply. With the current implementation of __alloc_pages, we have 2 problems : 1) A node is not emptied before moving to the following node 2) If none of the zones on a node have more freepages than min(defined as min+= z->pages_low), we start looking on the following node, instead of trying harder on the same node. I have a patch that tries to fix these problems. Of course this patch makes sense only with either the discontig patch or the SGI patch Andrea mentioned applied. I'd appreciate your feedback on this piece of code. This patch is against 2.4.19-pre2: --- linux-2.4.19-pre2/mm/page_alloc.c Mon Mar 4 14:35:27 2002 +++ linux-2.4.19-pre2-sam/mm/page_alloc.c Mon Mar 4 14:38:53 2002 @@ -339,68 +339,110 @@ */ struct page * __alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist) { - unsigned long min; - zone_t **zone, * classzone; + unsigned long min_low, min_min; + zone_t **zone, **current_zone, * classzone, *z; struct page * page; int freed; - + struct pglist_data* current_node; + zone = zonelist->zones; - classzone = *zone; - min = 1UL << order; - for (;;) { - zone_t *z = *(zone++); - if (!z) + z = *zone; + for(;;){ + /* + * This loops scans all the zones + */ + min_low = 1UL << order; + current_node = z->zone_pgdat; + current_zone = zone; + classzone = z; + do{ + /* + * This loops scans all the zones of + * the current node. + */ + min_low += z->pages_low; + if (z->free_pages > min_low) { + page = rmqueue(z, order); + if (page) + return page; + } + z = *(++zone); + }while(z && (z->zone_pgdat == current_node)); + /* + * The node is low on memory. + * If this is the last node, then the + * swap daemon is awaken. + */ + + classzone->need_balance = 1; + mb(); + if (!z && waitqueue_active(&kswapd_wait)) + wake_up_interruptible(&kswapd_wait); + + min_min = 1UL << order; + + /* + * We want to try again in the current node. + */ + zone = current_zone; + z = *zone; + do{ + unsigned long local_min; + local_min = z->pages_min; + if (!(gfp_mask & __GFP_WAIT)) + local_min >>= 2; + min_min += local_min; + if (z->free_pages > min_min) { + page = rmqueue(z, order); + if (page) + return page; + } + z = *(++zone); + }while(z && (z->zone_pgdat == current_node)); + + /* + * If we are on the last node, and the current + * process has not the correct flags, then it is + * not allowed to empty the machine. + */ + if(!z && !(current->flags & (PF_MEMALLOC | PF_MEMDIE))) break; - min += z->pages_low; - if (z->free_pages > min) { + zone = current_zone; + z = *zone; + do{ page = rmqueue(z, order); if (page) return page; - } - } - - classzone->need_balance = 1; - mb(); - if (waitqueue_active(&kswapd_wait)) - wake_up_interruptible(&kswapd_wait); - - zone = zonelist->zones; - min = 1UL << order; - for (;;) { - unsigned long local_min; - zone_t *z = *(zone++); - if (!z) + z = *(++zone); + }while(z && (z->zone_pgdat == current_node)); + + if(!z) break; - - local_min = z->pages_min; - if (!(gfp_mask & __GFP_WAIT)) - local_min >>= 2; - min += local_min; - if (z->free_pages > min) { - page = rmqueue(z, order); - if (page) - return page; - } } - - /* here we're in the low on memory slow path */ - + rebalance: + /* + * We were not able to find enough memory. + * Since the swap daemon has been waken up, + * we might be able to find some pages. + * If not, we need to balance the entire memory. + */ + classzone = *zonelist->zones; if (current->flags & (PF_MEMALLOC | PF_MEMDIE)) { zone = zonelist->zones; for (;;) { zone_t *z = *(zone++); if (!z) break; - + page = rmqueue(z, order); if (page) return page; } return NULL; } - + /* Atomic allocations - we can't balance anything */ if (!(gfp_mask & __GFP_WAIT)) return NULL; @@ -410,14 +452,14 @@ return page; zone = zonelist->zones; - min = 1UL << order; + min_min = 1UL << order; for (;;) { zone_t *z = *(zone++); if (!z) break; - min += z->pages_min; - if (z->free_pages > min) { + min_min += z->pages_min; + if (z->free_pages > min_min) { page = rmqueue(z, order); if (page) return page; ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 23:03 ` 2.4.19pre1aa1 Samuel Ortiz @ 2002-03-05 11:23 ` Stephan von Krawczynski 2002-03-05 17:35 ` 2.4.19pre1aa1 Samuel Ortiz 0 siblings, 1 reply; 77+ messages in thread From: Stephan von Krawczynski @ 2002-03-05 11:23 UTC (permalink / raw) To: Samuel Ortiz Cc: andrea, Martin.Bligh, riel, phillips, davidsen, mfedyk, linux-kernel On Mon, 4 Mar 2002 15:03:19 -0800 (PST) Samuel Ortiz <sortiz@dbear.engr.sgi.com> wrote: > On Mon, 4 Mar 2002, Andrea Arcangeli wrote: > > yes, also make sure to keep this patch from SGI applied, it's very > > important to avoid memory balancing if there's still free memory in the > > other zones: > > > > ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre1aa1/20_numa-mm-1 > This patch is included (in a slightly different form) in the 2.4.17 > discontig patch (http://sourceforge.net/projects/discontig). > But martin may need another patch to apply. With the current > implementation of __alloc_pages, we have 2 problems : > 1) A node is not emptied before moving to the following node > 2) If none of the zones on a node have more freepages than min(defined as > min+= z->pages_low), we start looking on the following node, instead of > trying harder on the same node. Forgive my ignorance, but aren't these two problems completely identical in a UP or even SMP setup? I mean what is the negative drawback in your proposed solution, if there simply is no other node? If it is not harmful to the "standard" setups it may as well be included in the mainline, or not? Regards, Stephan ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 11:23 ` 2.4.19pre1aa1 Stephan von Krawczynski @ 2002-03-05 17:35 ` Samuel Ortiz 0 siblings, 0 replies; 77+ messages in thread From: Samuel Ortiz @ 2002-03-05 17:35 UTC (permalink / raw) To: Stephan von Krawczynski Cc: andrea, Martin.Bligh, riel, phillips, davidsen, mfedyk, linux-kernel On Tue, 5 Mar 2002, Stephan von Krawczynski wrote: > On Mon, 4 Mar 2002 15:03:19 -0800 (PST) > Samuel Ortiz <sortiz@dbear.engr.sgi.com> wrote: > > > On Mon, 4 Mar 2002, Andrea Arcangeli wrote: > > > yes, also make sure to keep this patch from SGI applied, it's very > > > important to avoid memory balancing if there's still free memory in the > > > other zones: > > > > > > ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre1aa1/20_numa-mm-1 > > This patch is included (in a slightly different form) in the 2.4.17 > > discontig patch (http://sourceforge.net/projects/discontig). > > But martin may need another patch to apply. With the current > > implementation of __alloc_pages, we have 2 problems : > > 1) A node is not emptied before moving to the following node > > 2) If none of the zones on a node have more freepages than min(defined as > > min+= z->pages_low), we start looking on the following node, instead of > > trying harder on the same node. > > Forgive my ignorance, but aren't these two problems completely identical in a > UP or even SMP setup? I mean what is the negative drawback in your proposed > solution, if there simply is no other node? If it is not harmful to the > "standard" setups it may as well be included in the mainline, or not? You're right. It is harmful to the standard UMA boxes. However, the current __alloc_pages does just what it is supposed to do on those boxes. That's why very few people have been bothered by this bug. I was just waiting for Andrea or Rik's feedback before trying to push it to Marcelo. Maybe they'll find some time to review the patch soon... Cheers, Samuel. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 22:06 ` 2.4.19pre1aa1 Andrea Arcangeli 2002-03-04 23:03 ` 2.4.19pre1aa1 Samuel Ortiz @ 2002-03-05 0:12 ` Rik van Riel 2002-03-05 6:21 ` 2.4.19pre1aa1 Martin J. Bligh 2 siblings, 0 replies; 77+ messages in thread From: Rik van Riel @ 2002-03-05 0:12 UTC (permalink / raw) To: Andrea Arcangeli Cc: Martin J. Bligh, Stephan von Krawczynski, phillips, davidsen, mfedyk, linux-kernel On Mon, 4 Mar 2002, Andrea Arcangeli wrote: > On Mon, Mar 04, 2002 at 10:46:54AM -0800, Martin J. Bligh wrote: > > seems to me to be that the way we do current swap-out scanning is virtual, > > not physical, and thus cannot be per zone => per node. > > actually if you do process bindings the pte should be all allocated > local to the node if numa is enabled, and if there's no binding, no > matter if you have rmap or not, the ptes can be spread across the whole > system (just like the physical pages in the inactive/active lrus, > because they're not per-node). Think shared pages. With -rmap you'll scan all the page table entries mapping the pages on the current node, regardless of which node the page tables live. Without -rmap you'll need to scan all page table entries in the system, not just the ones mapping pages on the current node. regards, Rik -- "Linux holds advantages over the single-vendor commercial OS" -- Microsoft's "Competing with Linux" document http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 22:06 ` 2.4.19pre1aa1 Andrea Arcangeli 2002-03-04 23:03 ` 2.4.19pre1aa1 Samuel Ortiz 2002-03-05 0:12 ` 2.4.19pre1aa1 Rik van Riel @ 2002-03-05 6:21 ` Martin J. Bligh 2 siblings, 0 replies; 77+ messages in thread From: Martin J. Bligh @ 2002-03-05 6:21 UTC (permalink / raw) To: Andrea Arcangeli, Martin J. Bligh Cc: Stephan von Krawczynski, riel, phillips, linux-kernel >> seems to me to be that the way we do current swap-out scanning is >> virtual, not physical, and thus cannot be per zone => per node. > > actually if you do process bindings the pte should be all allocated > local to the node if numa is enabled, and if there's no binding, no > matter if you have rmap or not, the ptes can be spread across the whole > system (just like the physical pages in the inactive/active lrus, > because they're not per-node). Why does it matter if the ptes are spread across the system? I get the feeling I'm missing some magic trick here ... In reality we're not going to hard-bind every process, though we'll try to keep most of the allocations local. Imagine I have eight nodes (0..7), each with one zone (0..7). I need to free memory from zone 5 ... with the virtual scan, it seems to me that all I can do is blunder through the whole process list looking for something that happens to have pages on zone 5 that aren't being used much? Is this not expensive? Won't I end up with a whole bunch of cross-node mem transfers? M. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 18:18 ` 2.4.19pre1aa1 Stephan von Krawczynski 2002-03-04 18:41 ` 2.4.19pre1aa1 Stephan von Krawczynski 2002-03-04 18:46 ` 2.4.19pre1aa1 Martin J. Bligh @ 2002-03-04 21:37 ` Rik van Riel 2 siblings, 0 replies; 77+ messages in thread From: Rik van Riel @ 2002-03-04 21:37 UTC (permalink / raw) To: Stephan von Krawczynski Cc: Martin J. Bligh, andrea, phillips, davidsen, mfedyk, linux-kernel On Mon, 4 Mar 2002, Stephan von Krawczynski wrote: > On Mon, 04 Mar 2002 08:59:10 -0800 > "Martin J. Bligh" <Martin.Bligh@us.ibm.com> wrote: > > > 2) We can do local per-node scanning - no need to bounce > > information to and fro across the interconnect just to see what's > > worth swapping out. > > Well, you can achieve this by "attaching" the nodes' local memory (zone) > to its cpu and let the vm work preferably only on these attached zones > (regarding the list scanning and the like). This way you have no > interconnect traffic generated. But this is in no way related to rmap. But it is. Without -rmap you don't know which processes from which nodes could have mapped memory on your node, so you end up scanning the page tables of all processes on all nodes. regards, Rik -- Will hack the VM for food. http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 16:59 ` 2.4.19pre1aa1 Martin J. Bligh 2002-03-04 18:18 ` 2.4.19pre1aa1 Stephan von Krawczynski @ 2002-03-04 18:19 ` Andrea Arcangeli 2002-03-04 18:56 ` 2.4.19pre1aa1 Martin J. Bligh 2002-03-04 21:36 ` 2.4.19pre1aa1 Rik van Riel 1 sibling, 2 replies; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-04 18:19 UTC (permalink / raw) To: Martin J. Bligh Cc: Rik van Riel, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Mon, Mar 04, 2002 at 08:59:10AM -0800, Martin J. Bligh wrote: > > Not stability per se, but you have to admit the VM tends to > > behave badly when there's a shortage in just one memory zone. > > I believe NUMA will only make this situation worse. > > rmap would seem to buy us (at least) two major things for NUMA: > > 1) We can balance between zones easier by "swapping out" > pages to another zone. Yes, operations like "now migrate and bind this task to a certain cpu/mem pair" pretty much needs rmap or it will get the same complexity of swapout, that may be very very slow with lots of vm address space mapped. But this has nothing to do with the swap_out pass we were talking about previously. I just considered those cases (like also supporting pagetable sharing at the maximum possible levels also across random mmaps/mremaps/mprotect/mlocks/execve), and this is why I said rmap may be more useful for mm internals, rather than replacing the swap_out pass (hmm in this case the migration of the pagecache may be considered more a vm thing too though). > 2) We can do local per-node scanning - no need to bounce > information to and fro across the interconnect just to see what's > worth swapping out. the lru lists are global at the moment, so for the normal swapout activitiy rmap won't allow you to do what you mention above (furthmore rmap gives you only the pointer to the pte chain, but there's no guarantee the pte is in the same node as the physical page, even assuming we'll have per-node inactive/active list, so you'll fall into the bouncing scenario anyways rmap or not, only the cpu usage will be lower and as side effect you'll bounce less, but you're not avoiding the interconnet overhead with the per-node scanning). Said that I definitely agree the potential pageout/swapout scalability with rmap may be better on a very huge system with several hundred gigabytes of ram (despite the accessed bit aging will be less fair etc..). So yes, I also of course agree that there will be benefits in killing the swap_out loop on some currently-corner case hardware, and maybe long term, if we'll ever need to pageout heavily on a 256G ram box, it may be the only sane way to do that really no matter if it's numa or not, (I think on a 256G box it will be only a matter of paging out the dirty shared mappings and dropping the clean mappings, I don't see any need to swapout there, but still to do the pageout efficiently on such kind of machine we'll need rmap). Also note that on the modern numa (the thing I mostly care about) in misc load (like a desktop), without special usages (like user bindings), striping virtual pages and pagecache over all the nodes will be better than restricting one task to use only the bandwith of one bank of ram, so decreasing significantly the potential bandwith of the global machine. Interconnects are much faster than what ram will ever provide, it's not the legacy dinousaur numa. I understand old hardware with huge penalty while crossing the interconnects has different needs though. They're both called cc-numa but they're completly different beasts. So I don't worry much about walking on ptes on remote nodes, it may be infact faster than walking on ptes of the same node, and usually the dinosaurs have so much ram that they will hardly need to swapout heavily. On similar lines the alpha cpus (despite I'd put it in the "new numa" class) doesn't even provide the accessed bit in the pte, you only can use minor page faults to know that. The numa point for the new hardware is that if we have N nodes, and we have apps loading at 100% each node and using 100% of mem bandwith from each node in a local manner without passing through the interconnects (like we can do with cpu bindings and migration+bind API) then the performance will be better than if we stripe globally, and this is why the OS needs to be aware about numa, to optimize those cases, so if you've a certain workload you can get the 100% of the performance out of the hardware, but on a misc load without a dedicated-design for the machine where it is running on (so if we're not able to use all the 4 nodes fully in a local manner) striping will be better (so you'll get a 2/3 of performance out of the hardware, rather than a 1/4 of performance of it because you're only using 1/4 of the global bandwith). Same goes for shm and pagecache, page (or cacheline) striping is better there too. note: the above numbers I invented them to make the example more clear, they've no relation to any existing real hardware at all. > I suspect that the performance of NUMA under memory pressure > without the rmap stuff will be truly horrific, as we decend into > a cache-trashing page transfer war. depends on what kind of numa systems I think. I worry more about the complexity with lots of ram. As said above on a 64bit 512G system with hundred gigabytes of vm globally mapped at the same time, paging out hard beacuse of some terabyte mapping marked dirty during page faults, will quite certainly need rmap to pageout such dirty mappings efficiently, really no matter if it's cc-numa or not, it's mostly a complexity problem. I really don't see it as a 2.4 need :). I never said no-way rmap in 2.5. It maybe I won't agree on the implementation, but on the design I can agree: if we'll ever need to get the above workloads fast and pagecache migration for the numa bindings, we'll definitely need rmap for all kind of user pages, not just for map shared pages, like we have just now in 2.4 and in all previous kernels (I hope this also answers Daniel's question, otherwise please ask again). So I appreciate the work done on rmap, but I currently don't see it as a 2.4 item. > I can't see any way to fix this without some sort of rmap - any > other suggestions as to how this might be done? > > Thanks, > > Martin. Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 18:19 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-04 18:56 ` Martin J. Bligh 2002-03-04 22:25 ` 2.4.19pre1aa1 Andrea Arcangeli 2002-03-04 22:38 ` 2.4.19pre1aa1 Daniel Phillips 2002-03-04 21:36 ` 2.4.19pre1aa1 Rik van Riel 1 sibling, 2 replies; 77+ messages in thread From: Martin J. Bligh @ 2002-03-04 18:56 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel >> 1) We can balance between zones easier by "swapping out" >> pages to another zone. > > Yes, operations like "now migrate and bind this task to a certain > cpu/mem pair" pretty much needs rmap or it will get the same complexity > of swapout, that may be very very slow with lots of vm address space > mapped. But this has nothing to do with the swap_out pass we were > talking about previously. If we're out of memory on one node, and have free memory on another, during the swap-out pass it would be quicker to transfer the page to another node, ie "swap out the page to another zone" rather than swap it out to disk. This is what I mean by the above comment (though you're right, it helps with the more esoteric case of deliberate page migration too), though I probably phrased it badly enough to make it incomprehensible ;-) I guess could this help with non-NUMA architectures too - if ZONE_NORMAL is full, and ZONE_HIGHMEM has free pages, it would be nice to be able to scan ZONE_NORMAL, and transfer pages to ZONE_HIGHMEM. In reality, I suspect this won't be so useful, as there shouldn't be HIGHEM capable page data sitting in ZONE_NORMAL unless ZONE_HIGHMEM had been full at some point in the past? And I'm not sure if we keep a bit to say where the page could have been allocated from or not ? M. PS. The rest of your email re: striping twisted my brain out of shape - I'll have to think about it some more. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 18:56 ` 2.4.19pre1aa1 Martin J. Bligh @ 2002-03-04 22:25 ` Andrea Arcangeli 2002-03-04 23:09 ` 2.4.19pre1aa1 Gerrit Huizenga 2002-03-04 22:38 ` 2.4.19pre1aa1 Daniel Phillips 1 sibling, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-04 22:25 UTC (permalink / raw) To: Martin J. Bligh Cc: Rik van Riel, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Mon, Mar 04, 2002 at 10:56:11AM -0800, Martin J. Bligh wrote: > >> 1) We can balance between zones easier by "swapping out" > >> pages to another zone. > > > > Yes, operations like "now migrate and bind this task to a certain > > cpu/mem pair" pretty much needs rmap or it will get the same complexity > > of swapout, that may be very very slow with lots of vm address space > > mapped. But this has nothing to do with the swap_out pass we were > > talking about previously. > > If we're out of memory on one node, and have free memory on another, > during the swap-out pass it would be quicker to transfer the page to > another node, ie "swap out the page to another zone" rather than swap > it out to disk. This is what I mean by the above comment (though you're I think unless we're sure we need to split the system in parts and so there's some explicit cpu binding (like in the example I made above), it doesn't worth to do migrations just because one zone is low on memory, the migration has a cost and without bindings the scheduler is free to reschedule the task away in the next timeslice anyways, and then it's better to keep it there for cpu cache locality reasons. So I believe it's better to make sure to use all available ram in all nodes instead of doing migrations when the local node is low on mem. But this again depends on the kind of numa system, I'm considering the new numas, not the old ones with the huge penality on the remote memory. > right, it helps with the more esoteric case of deliberate page migration too), > though I probably phrased it badly enough to make it incomprehensible ;-) > > I guess could this help with non-NUMA architectures too - if ZONE_NORMAL > is full, and ZONE_HIGHMEM has free pages, it would be nice to be able > to scan ZONE_NORMAL, and transfer pages to ZONE_HIGHMEM. In > reality, I suspect this won't be so useful, as there shouldn't be HIGHEM > capable page data sitting in ZONE_NORMAL unless ZONE_HIGHMEM > had been full at some point in the past? And I'm not sure if we keep a bit Exactly, this is what the per-zone point-of-view watermarks just do in my tree, and this is why even if we're not able to migrate all the highmem capable pages from lowmem to highmem (like anon memory when there's no swap, or mlocked memory) we still don't run into inbalances. btw, to migrate anon memory without swap, we wouldn't really be forced to use rmap, we could just use anonymous swapcache and then we could migrate the swapcache atomically with the pagecache_lock acquired, just like we would do with rmap. but I think the main problem of migration is "when" should we trigger it. Currently we don't need to answer this question and the watermarks make sure we've enough lowmem resources not to madantory need migration. When I did the watermarks fix, I also consdiered the migration through anon swap cache, but it wasn't black and white thing, the watermarks are better solution for 2.4 at least I think :). > to say where the page could have been allocated from or not ? > > M. > > PS. The rest of your email re: striping twisted my brain out of shape - I'll > have to think about it some more. Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 22:25 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-04 23:09 ` Gerrit Huizenga 2002-03-05 0:19 ` 2.4.19pre1aa1 Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Gerrit Huizenga @ 2002-03-04 23:09 UTC (permalink / raw) To: Andrea Arcangeli Cc: Martin J. Bligh, Rik van Riel, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel In message <20020304232544.P20606@dualathlon.random>, > : Andrea Arcangeli writ es: > it's better to make sure to use all available ram in all nodes instead > of doing migrations when the local node is low on mem. But this again > depends on the kind of numa system, I'm considering the new numas, not > the old ones with the huge penality on the remote memory. Andrea, don't forget that the "old" NUMAs will soon be the "new" NUMAs again. The internal bus and clock speeds are still quite likely to increase faster than the speeds of most interconnects. And even quite a few "big SMP" machines today are really somewhat NUMA-like with a 2 to 1 - remote to local memory latency (e.g. the Corollary interconnect used on a lot of >4-way IA32 boxes is not as fast as the two local busses). So, desiging for the "new" NUMAs is fine if your code goes into production this year. But if it is going into production in two to three years, you might want to be thinking about some greater memory latency ratios for the upcoming hardware configurations... gerrit ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 23:09 ` 2.4.19pre1aa1 Gerrit Huizenga @ 2002-03-05 0:19 ` Andrea Arcangeli 2002-03-05 2:00 ` 2.4.19pre1aa1 Gerrit Huizenga 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-05 0:19 UTC (permalink / raw) To: Gerrit Huizenga Cc: Martin J. Bligh, Rik van Riel, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Mon, Mar 04, 2002 at 03:09:51PM -0800, Gerrit Huizenga wrote: > > In message <20020304232544.P20606@dualathlon.random>, > : Andrea Arcangeli writ > es: > > it's better to make sure to use all available ram in all nodes instead > > of doing migrations when the local node is low on mem. But this again > > depends on the kind of numa system, I'm considering the new numas, not > > the old ones with the huge penality on the remote memory. > > Andrea, don't forget that the "old" NUMAs will soon be the "new" NUMAs > again. The internal bus and clock speeds are still quite likely to > increase faster than the speeds of most interconnects. And even quite For various reasons I think we'll never go back to "old" NUMA in the long run. > a few "big SMP" machines today are really somewhat NUMA-like with a > 2 to 1 - remote to local memory latency (e.g. the Corollary interconnect > used on a lot of >4-way IA32 boxes is not as fast as the two local > busses). there's a reason for that. > So, desiging for the "new" NUMAs is fine if your code goes into > production this year. But if it is going into production in two to > three years, you might want to be thinking about some greater memory > latency ratios for the upcoming hardware configurations... Disagree, but don't take me wrong, I'm not really suggesting to design for new numa only. I think linux should support both equally well, so some heuristic like in the scheduler will be mostly the same, but they will need different heuristics in some other place. For example the "less frequently used ram migration instead of taking advantage of free memory in the other nodes first" should fall in the old numa category. Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 0:19 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-05 2:00 ` Gerrit Huizenga 0 siblings, 0 replies; 77+ messages in thread From: Gerrit Huizenga @ 2002-03-05 2:00 UTC (permalink / raw) To: Andrea Arcangeli Cc: Gerrit Huizenga, Martin J. Bligh, Rik van Riel, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel In message <20020305011907.V20606@dualathlon.random>, > : Andrea Arcangeli writ es: > On Mon, Mar 04, 2002 at 03:09:51PM -0800, Gerrit Huizenga wrote: > > > > In message <20020304232544.P20606@dualathlon.random>, > : Andrea Arcangeli writ > > es: > > > it's better to make sure to use all available ram in all nodes instead > > > of doing migrations when the local node is low on mem. But this again > > > depends on the kind of numa system, I'm considering the new numas, not > > > the old ones with the huge penality on the remote memory. > > > > Andrea, don't forget that the "old" NUMAs will soon be the "new" NUMAs > > again. The internal bus and clock speeds are still quite likely to > > increase faster than the speeds of most interconnects. And even quite > > For various reasons I think we'll never go back to "old" NUMA in the > long run. Do those reasons involve new advances in physics? How close can you put, say, 4 CPUs? How physically close together can you put, say 64 CPUs? How fast can you arbitrate sharing/cache coherency on an interconnect? How fast does, say, Intel, increase the clock rate of a processor? How fast does the bus rate for the same chip increase? How fast does the interconnect speed increase? How fast is the L1 cache? L2? L3? L4? Basically, the trend seems to be hierarcies of latency and bandwidth, and the more loads arbitrating in a given level of the hierarchy, the longer the greater the latency. In part, the physics and the cost of technologies seem to force a hierarchical approach. I'm not sure why you think Physics won't dictate a return to the previous differences in latency, especially since several vendors are already working in that space... > > a few "big SMP" machines today are really somewhat NUMA-like with a > > 2 to 1 - remote to local memory latency (e.g. the Corollary interconnect > > used on a lot of >4-way IA32 boxes is not as fast as the two local > > busses). > > there's a reason for that. > > > So, desiging for the "new" NUMAs is fine if your code goes into > > production this year. But if it is going into production in two to > > three years, you might want to be thinking about some greater memory > > latency ratios for the upcoming hardware configurations... > > Disagree, but don't take me wrong, I'm not really suggesting to design > for new numa only. I think linux should support both equally well, so > some heuristic like in the scheduler will be mostly the same, but they > will need different heuristics in some other place. For example the > "less frequently used ram migration instead of taking advantage of free > memory in the other nodes first" should fall in the old numa category. This is where I think some of the topology representation work will help (lse and the sourceforge large system foundry). Various systems will have various types of hierarchies in memory access, latency and bandwidth. I agree that heurestics may need to be tuned per arch type, but look well at the history of hardware development and be aware that a past trend has been that local and remote bus speeds and memory access latencies have tended to stair step - with local busses stepping up much more quickly and interconnect stepping up much more slowly. And with some architectures using three and four levels of hierarchy, the differences between local and really, really remote will typically increase over a five year (or so) window. gerrit ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 18:56 ` 2.4.19pre1aa1 Martin J. Bligh 2002-03-04 22:25 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-04 22:38 ` Daniel Phillips 1 sibling, 0 replies; 77+ messages in thread From: Daniel Phillips @ 2002-03-04 22:38 UTC (permalink / raw) To: Martin J. Bligh, Andrea Arcangeli Cc: Rik van Riel, Bill Davidsen, Mike Fedyk, linux-kernel On March 4, 2002 07:56 pm, Martin J. Bligh wrote: > >> 1) We can balance between zones easier by "swapping out" > >> pages to another zone. > > > > Yes, operations like "now migrate and bind this task to a certain > > cpu/mem pair" pretty much needs rmap or it will get the same complexity > > of swapout, that may be very very slow with lots of vm address space > > mapped. But this has nothing to do with the swap_out pass we were > > talking about previously. > > If we're out of memory on one node, and have free memory on another, > during the swap-out pass it would be quicker to transfer the page to > another node, ie "swap out the page to another zone" rather than swap > it out to disk. This is what I mean by the above comment (though you're > right, it helps with the more esoteric case of deliberate page migration too), > though I probably phrased it badly enough to make it incomprehensible ;-) > > I guess could this help with non-NUMA architectures too - if ZONE_NORMAL > is full, and ZONE_HIGHMEM has free pages, it would be nice to be able > to scan ZONE_NORMAL, and transfer pages to ZONE_HIGHMEM. In > reality, I suspect this won't be so useful, as there shouldn't be HIGHEM > capable page data sitting in ZONE_NORMAL unless ZONE_HIGHMEM > had been full at some point in the past? That's the normal case when the cache is loaded up. > And I'm not sure if we keep a bit > to say where the page could have been allocated from or not ? No, we don't record the gfp_mask or, in the case of discontigmem, the zonelist. Perhaps this information could be recovered from the mapping, or lack of it. I don't know how you'd deduce that a page was required to be in zone_dma, for example, without specifically remembering that at page_alloc time. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 18:19 ` 2.4.19pre1aa1 Andrea Arcangeli 2002-03-04 18:56 ` 2.4.19pre1aa1 Martin J. Bligh @ 2002-03-04 21:36 ` Rik van Riel 2002-03-04 23:01 ` 2.4.19pre1aa1 Andrea Arcangeli 1 sibling, 1 reply; 77+ messages in thread From: Rik van Riel @ 2002-03-04 21:36 UTC (permalink / raw) To: Andrea Arcangeli Cc: Martin J. Bligh, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Mon, 4 Mar 2002, Andrea Arcangeli wrote: > > 2) We can do local per-node scanning - no need to bounce > > information to and fro across the interconnect just to see what's > > worth swapping out. > > the lru lists are global at the moment, so for the normal swapout > activitiy rmap won't allow you to do what you mention above Actually, the lru lists are per zone and have been for a while. The thing which was lacking up to now is a pagecache_lru_lock per zone, because this clashes with truncate(). Arjan came up with a creative solution to fix this problem and I'll integrate it into -rmap soon... > (furthmore rmap gives you only the pointer to the pte chain, but there's > no guarantee the pte is in the same node as the physical page, even > assuming we'll have per-node inactive/active list, so you'll fall into > the bouncing scenario anyways rmap or not, only the cpu usage will be > lower and as side effect you'll bounce less, but you're not avoiding the > interconnet overhead with the per-node scanning). Well, if we need to free memory from node A, we will need to do that anyway. If we don't scan the page tables from node B, maybe we'll never be able to free memory from node A. The only thing -rmap does is make sure we only scan the page tables belonging to the physical pages in node A, instead of having to scan the page tables of all processes in all nodes. > Also note that on the modern numa (the thing I mostly care about) in > misc load (like a desktop), without special usages (like user bindings), > striping virtual pages and pagecache over all the nodes will be better > than restricting one task to use only the bandwith of one bank of ram, > so decreasing significantly the potential bandwith of the global > machine. This is an interesting point and suggests we want to start the zone fallback chains from different places for each CPU, this both balances the allocation and can avoid the CPUs looking at "each other's" zone and bouncing cachelines around. > depends on what kind of numa systems I think. I worry more about the > complexity with lots of ram. As said above on a 64bit 512G system > with hundred gigabytes of vm globally mapped at the same time, paging > out hard beacuse of some terabyte mapping marked dirty during page > faults, will quite certainly need rmap to pageout such dirty mappings > efficiently, really no matter if it's cc-numa or not, it's mostly a > complexity problem. Indeed. > I really don't see it as a 2.4 need :). I never said no-way rmap in 2.5. > It maybe I won't agree on the implementation, I'd appreciate it if you could look at the implementation and look for areas to optimise. However, note that I don't believe -rmap is already at the stage where optimisation is appropriate. Or rather, now is the time for macro optimisations, not for micro optimisations. regards, Rik -- Will hack the VM for food. http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 21:36 ` 2.4.19pre1aa1 Rik van Riel @ 2002-03-04 23:01 ` Andrea Arcangeli 2002-03-04 23:11 ` 2.4.19pre1aa1 Rik van Riel 2002-03-05 5:38 ` 2.4.19pre1aa1 Martin J. Bligh 0 siblings, 2 replies; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-04 23:01 UTC (permalink / raw) To: Rik van Riel Cc: Martin J. Bligh, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Mon, Mar 04, 2002 at 06:36:47PM -0300, Rik van Riel wrote: > On Mon, 4 Mar 2002, Andrea Arcangeli wrote: > > > > 2) We can do local per-node scanning - no need to bounce > > > information to and fro across the interconnect just to see what's > > > worth swapping out. > > > > the lru lists are global at the moment, so for the normal swapout > > activitiy rmap won't allow you to do what you mention above > > Actually, the lru lists are per zone and have been for a while. They're not in my tree and for very good reasons, Ben did such mistake the first time at some point during 2.3. You've a big downside with the per-zone information, all normal machines (like with 64M of ram or 2G of ram) where theorical O(N) complexity is perfectly fine for lowmem dma/normal allocations, will get hurted very much by the per-node lrus. You're the one saying that the system load is very low and that it's better to do more accurate page replacement decisions. I think they may be worthwhile on a hundred gigabyte machine only, but the whole point is that in such a box you'll have only one zone anyways and so per-zone in such case will match per-node :). So I think they should be at least per-node in 2.5 to make 99% of userbase happy. And again, it depends on what kind numa if they've to be global or per-node, so it would be probably much better to have them per-node or global depending on a compile-time configuration #define. > The thing which was lacking up to now is a pagecache_lru_lock > per zone, because this clashes with truncate(). Arjan came up > with a creative solution to fix this problem and I'll integrate > it into -rmap soon... making it a per-lru spinlock is natural scalability optimization, but anyways pagemap_lru_lock isn't a very critical spinlock. before worrying about pagemal_lru_lock I'd worry about the pagecache_lock I think (even the pagecache_lock doesn't matter much on most usages). Of course it also depends on the workload, but the important workloads will hit the pagecache_lock first. > > (furthmore rmap gives you only the pointer to the pte chain, but there's > > no guarantee the pte is in the same node as the physical page, even > > assuming we'll have per-node inactive/active list, so you'll fall into > > the bouncing scenario anyways rmap or not, only the cpu usage will be > > lower and as side effect you'll bounce less, but you're not avoiding the > > interconnet overhead with the per-node scanning). > > Well, if we need to free memory from node A, we will need to > do that anyway. If we don't scan the page tables from node B, > maybe we'll never be able to free memory from node A. > > The only thing -rmap does is make sure we only scan the page > tables belonging to the physical pages in node A, instead of > having to scan the page tables of all processes in all nodes. Correct. And as said this is a scalability optimization, the more ptes you'll have, the more you want to skip the ones belonging to pages in node B, or you may end wasting too much system time on 512G system etc... > I'd appreciate it if you could look at the implementation and > look for areas to optimise. However, note that I don't believe I didn't had time to look too much into that yet (I had only a short review so far), but I will certainly do that in some more time, looking at it with a 2.5 long term prospective. I didn't liked too much that you resurrected some of the old code that I don't think pays off. I would preferred if you had rmap on top of my vm patch without reintroducing the older logics. I still don't see the need of inactive_dirty and the fact you dropped classzone and put the unreliable "plenty stuff" that reintroduces design bugs that will lead kswapd go crazy again. But ok, I don't worry too much about that, the rmap bits that maintains the additional information are orthogonal with the other changes and that's the interesting part of the patch after all. Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 23:01 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-04 23:11 ` Rik van Riel 2002-03-04 23:52 ` 2.4.19pre1aa1 Andrea Arcangeli 2002-03-05 5:38 ` 2.4.19pre1aa1 Martin J. Bligh 1 sibling, 1 reply; 77+ messages in thread From: Rik van Riel @ 2002-03-04 23:11 UTC (permalink / raw) To: Andrea Arcangeli Cc: Martin J. Bligh, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Tue, 5 Mar 2002, Andrea Arcangeli wrote: > On Mon, Mar 04, 2002 at 06:36:47PM -0300, Rik van Riel wrote: > > On Mon, 4 Mar 2002, Andrea Arcangeli wrote: > > > > > > 2) We can do local per-node scanning - no need to bounce > > > > information to and fro across the interconnect just to see what's > > > > worth swapping out. > > > > > > the lru lists are global at the moment, so for the normal swapout > > > activitiy rmap won't allow you to do what you mention above > > > > Actually, the lru lists are per zone and have been for a while. > > They're not in my tree Yeah, but you shouldn't judge rmap by what's in your tree ;)) Balancing is quite simple, too. > > The thing which was lacking up to now is a pagecache_lru_lock > > per zone, because this clashes with truncate(). Arjan came up > > with a creative solution to fix this problem and I'll integrate > > it into -rmap soon... > > making it a per-lru spinlock is natural scalability optimization, but > anyways pagemap_lru_lock isn't a very critical spinlock. That's what I used to think, too. The folks at IBM showed me I was wrong and the pagemap_lru_lock is critical. > > I'd appreciate it if you could look at the implementation and > > look for areas to optimise. However, note that I don't believe > > I didn't had time to look too much into that yet (I had only a short > review so far), but I will certainly do that in some more time, looking > at it with a 2.5 long term prospective. I didn't liked too much that you > resurrected some of the old code that I don't think pays off. I would > preferred if you had rmap on top of my vm patch without reintroducing > the older logics. I still don't see the need of inactive_dirty and the > fact you dropped classzone and put the unreliable "plenty stuff" that > reintroduces design bugs that will lead kswapd go crazy again. But ok, I > don't worry too much about that, the rmap bits that maintains the > additional information are orthogonal with the other changes and that's > the interesting part of the patch after all. OK, lets try to put classzone on top of a Hammer "NUMA" system. You'll have one CPU starting to allocate from zone A, falling back to zone B and then further down. Another CPU starts allocating at zone B, falling back to A and then further down. How would you express this in classzone ? I've looked at it for quite a while and haven't found a clean way to get this situation right with classzone, which is why I have removed it. As for kswapd going crazy, that is nicely fixed by having per zone lru lists... ;) regards, Rik -- "Linux holds advantages over the single-vendor commercial OS" -- Microsoft's "Competing with Linux" document http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 23:11 ` 2.4.19pre1aa1 Rik van Riel @ 2002-03-04 23:52 ` Andrea Arcangeli 2002-03-05 0:01 ` 2.4.19pre1aa1 Rik van Riel 2002-03-05 8:35 ` 2.4.19pre1aa1 arjan 0 siblings, 2 replies; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-04 23:52 UTC (permalink / raw) To: Rik van Riel Cc: Martin J. Bligh, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Mon, Mar 04, 2002 at 08:11:21PM -0300, Rik van Riel wrote: > You'll have one CPU starting to allocate from zone A, falling > back to zone B and then further down. what is zone A/B, I guess you mean node A/B etc.. Zones are called NORMAL/DMA/HIGHMEM so I'm confused. > Another CPU starts allocating at zone B, falling back to A > and then further down. > > How would you express this in classzone ? I've looked at it I don't see the problem you're raising. classzone is an information that you pass the memory balancing, that tells it "what kind of ram you need". That's all. This ensure it does the right work and that it puts the result into the per-process local_pages structure, so the result isn't stolen before we can notice it (fairness). That's completly unrelated to NUMA, I think I said that many times. classzone and numa are disconnected concepts. > As for kswapd going crazy, that is nicely fixed by having > per zone lru lists... ;) I don't see how per-zone lru lists are related to the kswapd deadlock. as soon as the ZONE_DMA will be filled with filedescriptors or with pagetables (or whatever non pageable/shrinkable kernel datastructure you prefer) kswapd will go mad without classzone, period. Check l-k and see how many kswapd-crazy reports there are been since classzone is been introduced into the kernel, and incidentally we just seen new kswapd report for the rmap patch without swap (it's hard to trigger I know without swap, with swap such behaviour will happen trivially because without swap every single page of anonymous ram will become unpageable just like the kernel data, but the very same kswapd-crazy problem would happen if swap was there too, it would only take more time to reproduce like in the 2.4.x series with x < 10). it's the same problem you told me at the kernel summit, remember? classzone has the advantage of being very low cost and it also increases the fairness of the allocations, compared to a system where you may end working for others rather than for yourself like with the "plenty" stuff. It not only fixes kswapd. Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 23:52 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-05 0:01 ` Rik van Riel 2002-03-05 1:05 ` 2.4.19pre1aa1 Andrea Arcangeli 2002-03-05 8:35 ` 2.4.19pre1aa1 arjan 1 sibling, 1 reply; 77+ messages in thread From: Rik van Riel @ 2002-03-05 0:01 UTC (permalink / raw) To: Andrea Arcangeli Cc: Martin J. Bligh, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Tue, 5 Mar 2002, Andrea Arcangeli wrote: > On Mon, Mar 04, 2002 at 08:11:21PM -0300, Rik van Riel wrote: > > You'll have one CPU starting to allocate from zone A, falling > > back to zone B and then further down. > > what is zone A/B, I guess you mean node A/B etc.. Zones are called > NORMAL/DMA/HIGHMEM so I'm confused. OK, now think about a NUMA-with-small-n system like AMD Hammer. One of the CPUs will want to allocate from HIGHMEM zone A while another CPU will start allocating at HIGHMEM zone B. Of course, with memory access time between the "nodes" being not too different you'll want to fall back to the "other" HIGHMEM zone before falling back to the (single) NORMAL and DMA zones. This could be expressed as: "node A" HIGHMEM A -> HIGHMEM B -> NORMAL -> DMA "node B" HIGHMEM B -> HIGHMEM A -> NORMAL -> DMA How would you express this situation in classzone ? > > As for kswapd going crazy, that is nicely fixed by having > > per zone lru lists... ;) > > I don't see how per-zone lru lists are related to the kswapd deadlock. > as soon as the ZONE_DMA will be filled with filedescriptors or with > pagetables (or whatever non pageable/shrinkable kernel datastructure you > prefer) kswapd will go mad without classzone, period. So why would kswapd not go mad _with_ classzone ? I bet the workaround for that problem has very little to do with classzones... regards, Rik -- "Linux holds advantages over the single-vendor commercial OS" -- Microsoft's "Competing with Linux" document http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 0:01 ` 2.4.19pre1aa1 Rik van Riel @ 2002-03-05 1:05 ` Andrea Arcangeli 2002-03-05 1:26 ` 2.4.19pre1aa1 Rik van Riel 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-05 1:05 UTC (permalink / raw) To: Rik van Riel Cc: Martin J. Bligh, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Mon, Mar 04, 2002 at 09:01:31PM -0300, Rik van Riel wrote: > This could be expressed as: > > "node A" HIGHMEM A -> HIGHMEM B -> NORMAL -> DMA > "node B" HIGHMEM B -> HIGHMEM A -> NORMAL -> DMA Highmem? Let's assume you speak about "normal" and "dma" only of course. And that's not always the right zonelist layout. If an allocation asks for ram from a certain node, like during the ram bindings, we should use the current layout of the numa zonelist. If node A is the preferred, than we should allocate from node A first, other logics (see the point-of-view watermarks in my tree) will make sure you fallback into node B if we risk to be unbalanced across the zones. However, the layout you mentioned above is sometime the right layout, for example for allocations with no "preference" on the node to allocate from, your layout would make perfect sense. But at the moment we miss an API to choose if the node allocation should be strict or not. Said that, see below to see how to implement the zonelist layout you suggested on top of the current vm (regardless if it's the best generic layout or not). > > How would you express this situation in classzone ? Check my tree in the 20_numa-mm-1 patch, to implement your above layout, you need to make a 10 line change to build_zonelistss so that it fills the zonelist array with normal B before dma (and other way around for the normal classzone zonelist on the node B). The memory balancing in my tree will just do the right thing after that, check the memclass based on zone_idx (that was needed for the old numa too infact). In short it fits beautifully into it. > So why would kswapd not go mad _with_ classzone ? because nobody asks for GFP_DMA and nobody cares about the state of the DMA classzone. And if somebody does it is right that kswapd has to try to make some progress, but if nobody asks there's no good reason to waste CPU. the scsi pool being allocated from DMA is not a problem, that never happens at runtime. if it happens before production kswapd will stop in a few seconds after a failed try. > I bet the workaround for that problem has very little > to do with classzones... that is not a workaround, the memory balancing knows what classzone it has to work on and so it doesn't fall into a senseless trap of trying to free a classzone that nobody cares about. My current VM code is very advanced about knowing every detail, it's not a guess "let's look at which zones have plenty of memory". Just like it supports NUMA layouts like the above you mentioned just fine (even if you want to add highmem or any other zones you want). Note that this is all unrelated to rmap, we can just put rmap on top of my VM bits without any problem, that's completly orthogonal with the other bits. Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 1:05 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-05 1:26 ` Rik van Riel 2002-03-05 1:40 ` 2.4.19pre1aa1 Andrea Arcangeli 2002-03-05 3:05 ` 2.4.19pre1aa1 Bill Davidsen 0 siblings, 2 replies; 77+ messages in thread From: Rik van Riel @ 2002-03-05 1:26 UTC (permalink / raw) To: Andrea Arcangeli Cc: Martin J. Bligh, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Tue, 5 Mar 2002, Andrea Arcangeli wrote: > On Mon, Mar 04, 2002 at 09:01:31PM -0300, Rik van Riel wrote: > > This could be expressed as: > > > > "node A" HIGHMEM A -> HIGHMEM B -> NORMAL -> DMA > > "node B" HIGHMEM B -> HIGHMEM A -> NORMAL -> DMA > > Highmem? Let's assume you speak about "normal" and "dma" only of course. > > And that's not always the right zonelist layout. If an allocation asks for > ram from a certain node, like during the ram bindings, we should use the > current layout of the numa zonelist. If node A is the preferred, than we > should allocate from node A first, You're forgetting about the fact that this NUMA box only has 1 ZONE_NORMAL and 1 ZONE_DMA while it has multiple HIGHMEM zones... This makes the fallback pattern somewhat more complex. regards, Rik -- "Linux holds advantages over the single-vendor commercial OS" -- Microsoft's "Competing with Linux" document http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 1:26 ` 2.4.19pre1aa1 Rik van Riel @ 2002-03-05 1:40 ` Andrea Arcangeli 2002-03-05 1:55 ` 2.4.19pre1aa1 Martin J. Bligh 2002-03-05 12:22 ` 2.4.19pre1aa1 Rik van Riel 2002-03-05 3:05 ` 2.4.19pre1aa1 Bill Davidsen 1 sibling, 2 replies; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-05 1:40 UTC (permalink / raw) To: Rik van Riel Cc: Martin J. Bligh, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Mon, Mar 04, 2002 at 10:26:30PM -0300, Rik van Riel wrote: > On Tue, 5 Mar 2002, Andrea Arcangeli wrote: > > On Mon, Mar 04, 2002 at 09:01:31PM -0300, Rik van Riel wrote: > > > This could be expressed as: > > > > > > "node A" HIGHMEM A -> HIGHMEM B -> NORMAL -> DMA > > > "node B" HIGHMEM B -> HIGHMEM A -> NORMAL -> DMA > > > > Highmem? Let's assume you speak about "normal" and "dma" only of course. > > > > And that's not always the right zonelist layout. If an allocation asks for > > ram from a certain node, like during the ram bindings, we should use the > > current layout of the numa zonelist. If node A is the preferred, than we > > should allocate from node A first, > > You're forgetting about the fact that this NUMA box only the example you made doesn't have highmem at all. > has 1 ZONE_NORMAL and 1 ZONE_DMA while it has multiple > HIGHMEM zones... it has multiple zone normal and only one zone dma. I'm not forgetting that. > This makes the fallback pattern somewhat more complex. it's not more complex than the current way, it's just different and it's not strict, but it's the best one for allocations that doesn't "prefer" memory from a certain node, but OTOH we don't have an API to define 'waek' or 'strict' allocation bheaviour so the default would better be the 'strict' one like in oldnuma. Infact in the future we may want to have also a way to define a "very strict" allocation, that means it won't fallback into the other nodes at all, even if there's plenty of memory free on them. An API needs to be built with some bitflag specifying the "strength" of the numa affinity required. Your layout provides the 'weakest' approch, that is perfectly fine for some kind of non-numa-aware allocations, just like "very strict" will be necessary for the relocation bindings (if we cannot relocate in the right node there's no point to relocate in another node, let's ingore complex topologies for now :). Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 1:40 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-05 1:55 ` Martin J. Bligh 2002-03-05 5:16 ` 2.4.19pre1aa1 Samuel Ortiz 2002-03-05 12:22 ` 2.4.19pre1aa1 Rik van Riel 1 sibling, 1 reply; 77+ messages in thread From: Martin J. Bligh @ 2002-03-05 1:55 UTC (permalink / raw) To: Andrea Arcangeli, Rik van Riel, Matt Dobson Cc: Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel > it's not more complex than the current way, it's just different and it's > not strict, but it's the best one for allocations that doesn't "prefer" > memory from a certain node, but OTOH we don't have an API to define > 'waek' or 'strict' allocation bheaviour so the default would better be > the 'strict' one like in oldnuma. Infact in the future we may want to > have also a way to define a "very strict" allocation, that means it > won't fallback into the other nodes at all, even if there's plenty of > memory free on them. An API needs to be built with some bitflag > specifying the "strength" of the numa affinity required. Your layout > provides the 'weakest' approch, that is perfectly fine for some kind of > non-numa-aware allocations, just like "very strict" will be necessary > for the relocation bindings (if we cannot relocate in the right node > there's no point to relocate in another node, let's ingore complex > topologies for now :). Actually, we (IBM) do have a simple API to do this that Matt Dobson has been working on that's nearing readiness (& publication). I've been coding up a patch to _alloc_pages today that has both a strict and non-strict binding in it. It first goes through your "preferred" set of nodes (defined on a per-process basis), then again looking for any node that you've not strictly banned from the list - I hope that's sufficient for what you're discussing? I'll try to publish my part tommorow, definitely this week - it'll be easy to see how it works in conjunction with the API, though the rest of the API might be a little longer before arrival .... Martin. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 1:55 ` 2.4.19pre1aa1 Martin J. Bligh @ 2002-03-05 5:16 ` Samuel Ortiz 2002-03-05 5:47 ` 2.4.19pre1aa1 Martin J. Bligh 0 siblings, 1 reply; 77+ messages in thread From: Samuel Ortiz @ 2002-03-05 5:16 UTC (permalink / raw) To: Martin J. Bligh Cc: Andrea Arcangeli, Rik van Riel, Matt Dobson, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Mon, 4 Mar 2002, Martin J. Bligh wrote: > > it's not more complex than the current way, it's just different and it's > > not strict, but it's the best one for allocations that doesn't "prefer" > > memory from a certain node, but OTOH we don't have an API to define > > 'waek' or 'strict' allocation bheaviour so the default would better be > > the 'strict' one like in oldnuma. Infact in the future we may want to > > have also a way to define a "very strict" allocation, that means it > > won't fallback into the other nodes at all, even if there's plenty of > > memory free on them. An API needs to be built with some bitflag > > specifying the "strength" of the numa affinity required. Your layout > > provides the 'weakest' approch, that is perfectly fine for some kind of > > non-numa-aware allocations, just like "very strict" will be necessary > > for the relocation bindings (if we cannot relocate in the right node > > there's no point to relocate in another node, let's ingore complex > > topologies for now :). > > Actually, we (IBM) do have a simple API to do this that Matt Dobson > has been working on that's nearing readiness (& publication). I've > been coding up a patch to _alloc_pages today that has both a strict > and non-strict binding in it. It first goes through your "preferred" set of > nodes (defined on a per-process basis), then again looking for any > node that you've not strictly banned from the list - I hope that's > sufficient for what you're discussing? I'll try to publish my part tommorow, > definitely this week - it'll be easy to see how it works in conjunction with > the API, though the rest of the API might be a little longer before arrival .... SGI's CpuMemSets is supposed to do that as well. We are now able to bind a process to a set of memories, and soon we will be able to specify how strict the allocation can be. Right now, if a process is allowed to allocate memory from node 0, 2, and 3, it won't look outside of this set. The memory set granularity is smaller though, because it depends on the process, and the cpu (and thus the node) this process is running on. The CpuMemSets have been tested and are available on the Linux Scalability Effort sourceforge page, if you want to give it a try... Samuel. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 5:16 ` 2.4.19pre1aa1 Samuel Ortiz @ 2002-03-05 5:47 ` Martin J. Bligh 2002-03-05 6:33 ` 2.4.19pre1aa1 Samuel Ortiz 0 siblings, 1 reply; 77+ messages in thread From: Martin J. Bligh @ 2002-03-05 5:47 UTC (permalink / raw) To: Samuel Ortiz, Martin J. Bligh Cc: Andrea Arcangeli, Rik van Riel, Matt Dobson, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel > SGI's CpuMemSets is supposed to do that as well. We are now able to bind a > process to a set of memories, and soon we will be able to specify how > strict the allocation can be. Right now, if a process is allowed to > allocate memory from node 0, 2, and 3, it won't look outside of this set. > The memory set granularity is smaller though, because it depends on the > process, and the cpu (and thus the node) this process is running on. > The CpuMemSets have been tested and are available on the Linux Scalability > Effort sourceforge page, if you want to give it a try... The problem with CpuMemSets is that it's mind-bogglingly complex - I think we need something simpler ... at least to start with. M. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 5:47 ` 2.4.19pre1aa1 Martin J. Bligh @ 2002-03-05 6:33 ` Samuel Ortiz 0 siblings, 0 replies; 77+ messages in thread From: Samuel Ortiz @ 2002-03-05 6:33 UTC (permalink / raw) To: Martin J. Bligh Cc: Andrea Arcangeli, Rik van Riel, Matt Dobson, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Mon, 4 Mar 2002, Martin J. Bligh wrote: > > SGI's CpuMemSets is supposed to do that as well. We are now able to bind a > > process to a set of memories, and soon we will be able to specify how > > strict the allocation can be. Right now, if a process is allowed to > > allocate memory from node 0, 2, and 3, it won't look outside of this set. > > The memory set granularity is smaller though, because it depends on the > > process, and the cpu (and thus the node) this process is running on. > > The CpuMemSets have been tested and are available on the Linux Scalability > > Effort sourceforge page, if you want to give it a try... > > The problem with CpuMemSets is that it's mind-bogglingly > complex - I think we need something simpler ... at least > to start with. Yes, I agree with the fact that it is complex. Right now, you need to get a good understanding of them in order for them to be useful. However I think this is the price to pay for something that covers a large range of cases, from the simplest one to very complex ones. The simpler implementation you are talking about will be useless as soon as you'll need to cover more complex cases. A good thing would be to define an API on top of CpuMemSets to allow interested people to use them quickly for those simple cases. Samuel. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 1:40 ` 2.4.19pre1aa1 Andrea Arcangeli 2002-03-05 1:55 ` 2.4.19pre1aa1 Martin J. Bligh @ 2002-03-05 12:22 ` Rik van Riel 2002-03-05 15:01 ` 2.4.19pre1aa1 Andrea Arcangeli [not found] ` <Pine.LNX.4.44L.0203050921510.1413-100000@duckman.distro.conecti va> 1 sibling, 2 replies; 77+ messages in thread From: Rik van Riel @ 2002-03-05 12:22 UTC (permalink / raw) To: Andrea Arcangeli Cc: Martin J. Bligh, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Tue, 5 Mar 2002, Andrea Arcangeli wrote: > On Mon, Mar 04, 2002 at 10:26:30PM -0300, Rik van Riel wrote: > > On Tue, 5 Mar 2002, Andrea Arcangeli wrote: > > > On Mon, Mar 04, 2002 at 09:01:31PM -0300, Rik van Riel wrote: > > > > This could be expressed as: > > > > > > > > "node A" HIGHMEM A -> HIGHMEM B -> NORMAL -> DMA > > > > "node B" HIGHMEM B -> HIGHMEM A -> NORMAL -> DMA > the example you made doesn't have highmem at all. > > > has 1 ZONE_NORMAL and 1 ZONE_DMA while it has multiple > > HIGHMEM zones... > > it has multiple zone normal and only one zone dma. I'm not forgetting > that. Your reality doesn't seem to correspond well with NUMA-Q reality. Rik -- Will hack the VM for food. http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 12:22 ` 2.4.19pre1aa1 Rik van Riel @ 2002-03-05 15:01 ` Andrea Arcangeli [not found] ` <Pine.LNX.4.44L.0203050921510.1413-100000@duckman.distro.conecti va> 1 sibling, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-05 15:01 UTC (permalink / raw) To: Rik van Riel Cc: Martin J. Bligh, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Tue, Mar 05, 2002 at 09:22:25AM -0300, Rik van Riel wrote: > On Tue, 5 Mar 2002, Andrea Arcangeli wrote: > > On Mon, Mar 04, 2002 at 10:26:30PM -0300, Rik van Riel wrote: > > > On Tue, 5 Mar 2002, Andrea Arcangeli wrote: > > > > On Mon, Mar 04, 2002 at 09:01:31PM -0300, Rik van Riel wrote: > > > > > This could be expressed as: > > > > > > > > > > "node A" HIGHMEM A -> HIGHMEM B -> NORMAL -> DMA > > > > > "node B" HIGHMEM B -> HIGHMEM A -> NORMAL -> DMA > > > the example you made doesn't have highmem at all. > > > > > has 1 ZONE_NORMAL and 1 ZONE_DMA while it has multiple > > > HIGHMEM zones... > > > > it has multiple zone normal and only one zone dma. I'm not forgetting > > that. > > Your reality doesn't seem to correspond well with NUMA-Q > reality. Not sure to understand your point, current code should be fine for all the classic numas, and for the case you were making too. Anyways whatever is wrong for NUMA-Q it's not a problem introduced with the classzone design because that's completly orthogonal to whatever numa heuristics in the allocator and memory balancing. Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
[parent not found: <Pine.LNX.4.44L.0203050921510.1413-100000@duckman.distro.conecti va>]
* Re: 2.4.19pre1aa1 [not found] ` <Pine.LNX.4.44L.0203050921510.1413-100000@duckman.distro.conecti va> @ 2002-03-05 15:29 ` Martin J. Bligh 2002-03-05 15:43 ` 2.4.19pre1aa1 Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Martin J. Bligh @ 2002-03-05 15:29 UTC (permalink / raw) To: Rik van Riel, Andrea Arcangeli Cc: Martin J. Bligh, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel --On Tuesday, March 05, 2002 9:22 AM -0300 Rik van Riel <riel@conectiva.com.br> wrote: > On Tue, 5 Mar 2002, Andrea Arcangeli wrote: >> On Mon, Mar 04, 2002 at 10:26:30PM -0300, Rik van Riel wrote: >> > On Tue, 5 Mar 2002, Andrea Arcangeli wrote: >> > > On Mon, Mar 04, 2002 at 09:01:31PM -0300, Rik van Riel wrote: >> > > > This could be expressed as: >> > > > >> > > > "node A" HIGHMEM A -> HIGHMEM B -> NORMAL -> DMA >> > > > "node B" HIGHMEM B -> HIGHMEM A -> NORMAL -> DMA > >> the example you made doesn't have highmem at all. >> >> > has 1 ZONE_NORMAL and 1 ZONE_DMA while it has multiple >> > HIGHMEM zones... >> >> it has multiple zone normal and only one zone dma. I'm not forgetting >> that. > > Your reality doesn't seem to correspond well with NUMA-Q > reality. I think the difference is that he has a 64 bit vaddr space, and I don't ;-) Thus all mem to him is ZONE_NORMAL (not sure why he still has a ZONE_DMA, unless he reused it for the 4Gb boundary). Andrea, is my assumtpion correct? On a 32 bit arch (eg ia32) everything above 896Mb (by default) is ZONE_HIGHMEM. Thus if I have > 896Mb in the first node, I will have one ZONE_NORMAL in node 0, and a ZONE_HIGHMEM in every node. If I have < 896Mb in the first node, then I have a ZONE_NORMAL in every node up to and including the 896 breakpoint, and a ZONE_HIGHMEM in every node from the breakpoint up (including the breakpoint node). Thus the number of zones = number of nodes + 1. M. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 15:29 ` 2.4.19pre1aa1 Martin J. Bligh @ 2002-03-05 15:43 ` Andrea Arcangeli 0 siblings, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-05 15:43 UTC (permalink / raw) To: Martin J. Bligh Cc: Rik van Riel, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel On Tue, Mar 05, 2002 at 07:29:11AM -0800, Martin J. Bligh wrote: > --On Tuesday, March 05, 2002 9:22 AM -0300 Rik van Riel > <riel@conectiva.com.br> wrote: > > >On Tue, 5 Mar 2002, Andrea Arcangeli wrote: > >>On Mon, Mar 04, 2002 at 10:26:30PM -0300, Rik van Riel wrote: > >>> On Tue, 5 Mar 2002, Andrea Arcangeli wrote: > >>> > On Mon, Mar 04, 2002 at 09:01:31PM -0300, Rik van Riel wrote: > >>> > > This could be expressed as: > >>> > > > >>> > > "node A" HIGHMEM A -> HIGHMEM B -> NORMAL -> DMA > >>> > > "node B" HIGHMEM B -> HIGHMEM A -> NORMAL -> DMA > > > >>the example you made doesn't have highmem at all. > >> > >>> has 1 ZONE_NORMAL and 1 ZONE_DMA while it has multiple > >>> HIGHMEM zones... > >> > >>it has multiple zone normal and only one zone dma. I'm not forgetting > >>that. > > > >Your reality doesn't seem to correspond well with NUMA-Q > >reality. > > I think the difference is that he has a 64 bit vaddr space, > and I don't ;-) Thus all mem to him is ZONE_NORMAL (not sure > why he still has a ZONE_DMA, unless he reused it for the 4Gb > boundary). Andrea, is my assumtpion correct? correct, but the current code from SGI should be just fine for NUMA-Q too, if you've highmem, your zonelist will automatically be setup accordingly, I don't see problems there. > > On a 32 bit arch (eg ia32) everything above 896Mb (by default) > is ZONE_HIGHMEM. Thus if I have > 896Mb in the first node, > I will have one ZONE_NORMAL in node 0, and a ZONE_HIGHMEM > in every node. If I have < 896Mb in the first node, then > I have a ZONE_NORMAL in every node up to and including the > 896 breakpoint, and a ZONE_HIGHMEM in every node from the > breakpoint up (including the breakpoint node). Thus the number > of zones = number of nodes + 1. > > M. Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 1:26 ` 2.4.19pre1aa1 Rik van Riel 2002-03-05 1:40 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-05 3:05 ` Bill Davidsen 1 sibling, 0 replies; 77+ messages in thread From: Bill Davidsen @ 2002-03-05 3:05 UTC (permalink / raw) To: Rik van Riel; +Cc: Linux Kernel Mailing List On Mon, 4 Mar 2002, Rik van Riel wrote: > On Tue, 5 Mar 2002, Andrea Arcangeli wrote: > > On Mon, Mar 04, 2002 at 09:01:31PM -0300, Rik van Riel wrote: > > > This could be expressed as: > > > > > > "node A" HIGHMEM A -> HIGHMEM B -> NORMAL -> DMA > > > "node B" HIGHMEM B -> HIGHMEM A -> NORMAL -> DMA > > > > Highmem? Let's assume you speak about "normal" and "dma" only of course. > > > > And that's not always the right zonelist layout. If an allocation asks for > > ram from a certain node, like during the ram bindings, we should use the > > current layout of the numa zonelist. If node A is the preferred, than we > > should allocate from node A first, > > You're forgetting about the fact that this NUMA box only > has 1 ZONE_NORMAL and 1 ZONE_DMA while it has multiple > HIGHMEM zones... > > This makes the fallback pattern somewhat more complex. Both HIMEM (on CPU) and NUMA nodes remind me somewhat of the days when "band switched" memory was supposed to be the answer to limited addressing space. The trick was to have things in the right place and not eat up the capacity with moving data. I think you're right that the problem is not as simple several posters have suggested. I'm afraid someone will have to do some clever adaptive work here, the speed of the connects to the memory will change as both evolve. And it's easier to make a node smaller than put them closer together. I'm awaiting the IBM paper(s) on this, I don't find my Hypercube and PVM experience to fit well anymore :-( -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 23:52 ` 2.4.19pre1aa1 Andrea Arcangeli 2002-03-05 0:01 ` 2.4.19pre1aa1 Rik van Riel @ 2002-03-05 8:35 ` arjan 2002-03-05 12:41 ` 2.4.19pre1aa1 Rik van Riel 2002-03-05 14:55 ` 2.4.19pre1aa1 Andrea Arcangeli 1 sibling, 2 replies; 77+ messages in thread From: arjan @ 2002-03-05 8:35 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel In article <20020305005215.U20606@dualathlon.random> you wrote: > I don't see how per-zone lru lists are related to the kswapd deadlock. > as soon as the ZONE_DMA will be filled with filedescriptors or with > pagetables (or whatever non pageable/shrinkable kernel datastructure you > prefer) kswapd will go mad without classzone, period. So does it with class zone on a scsi system.... ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 8:35 ` 2.4.19pre1aa1 arjan @ 2002-03-05 12:41 ` Rik van Riel 2002-03-05 15:10 ` 2.4.19pre1aa1 Andrea Arcangeli 2002-03-06 0:09 ` 2.4.19pre1aa1 Daniel Phillips 2002-03-05 14:55 ` 2.4.19pre1aa1 Andrea Arcangeli 1 sibling, 2 replies; 77+ messages in thread From: Rik van Riel @ 2002-03-05 12:41 UTC (permalink / raw) To: arjan; +Cc: Andrea Arcangeli, linux-kernel On Tue, 5 Mar 2002 arjan@fenrus.demon.nl wrote: > In article <20020305005215.U20606@dualathlon.random> you wrote: > > > I don't see how per-zone lru lists are related to the kswapd deadlock. > > as soon as the ZONE_DMA will be filled with filedescriptors or with > > pagetables (or whatever non pageable/shrinkable kernel datastructure you > > prefer) kswapd will go mad without classzone, period. > > So does it with class zone on a scsi system.... Furthermore, there is another problem which is present in both 2.4 vanilla, -aa and -rmap. Suppose that (1) we are low on memory in ZONE_NORMAL and (2) we have enough free memory in ZONE_HIGHMEM and (3) the memory in ZONE_NORMAL is for a large part taken by buffer heads belonging to pages in ZONE_HIGHMEM. In that case, none of the VMs will bother freeing the buffer heads associated with the highmem pages and kswapd will have to work hard trying to free something else in ZONE_NORMAL. Now before you say this is a strange theoretical situation, I've seen it here when using highmem emulation. Low memory was limited to 30 MB (16 MB ZONE_DMA, 14 MB ZONE_NORMAL) and the rest of the machine was HIGHMEM. Buffer heads were taking up 8 MB of low memory, dcache and inode cache were a good second with 2 MB and 5 MB respectively. How to efficiently fix this case ? I wouldn't know right now... However, I guess we might want to come up with a fix because it's a quite embarassing scenario ;) regards, Rik -- Will hack the VM for food. http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 12:41 ` 2.4.19pre1aa1 Rik van Riel @ 2002-03-05 15:10 ` Andrea Arcangeli 2002-03-05 16:57 ` 2.4.19pre1aa1 Rik van Riel 2002-03-06 0:09 ` 2.4.19pre1aa1 Daniel Phillips 1 sibling, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-05 15:10 UTC (permalink / raw) To: Rik van Riel; +Cc: arjan, linux-kernel On Tue, Mar 05, 2002 at 09:41:56AM -0300, Rik van Riel wrote: > On Tue, 5 Mar 2002 arjan@fenrus.demon.nl wrote: > > In article <20020305005215.U20606@dualathlon.random> you wrote: > > > > > I don't see how per-zone lru lists are related to the kswapd deadlock. > > > as soon as the ZONE_DMA will be filled with filedescriptors or with > > > pagetables (or whatever non pageable/shrinkable kernel datastructure you > > > prefer) kswapd will go mad without classzone, period. > > > > So does it with class zone on a scsi system.... > > Furthermore, there is another problem which is present in > both 2.4 vanilla, -aa and -rmap. Please check the code. scsi_resize_dma_pool is called when you insmod a module. It doesn't really matter if kswapd runs for 2 seconds during insmod. And anyways if there would be some buggy code allocating dma in a flood by mistake on a high end machine, then I can fix it completly by tracking down when somebody freed dma pages over some watermark, but that would add additional accounting that I don't feel needed, simply because if you don't need DMA zone you shouldn't use GFP_DMA, I feel fixing scsi is the right thing if something (but again, I don't see any flood allocation during production with scsi). > Suppose that (1) we are low on memory in ZONE_NORMAL and > (2) we have enough free memory in ZONE_HIGHMEM and (3) the > memory in ZONE_NORMAL is for a large part taken by buffer > heads belonging to pages in ZONE_HIGHMEM. > > In that case, none of the VMs will bother freeing the buffer > heads associated with the highmem pages and kswapd will have wrong, classzone will do that, both for NORMAL and HIGHMEM allocations. You won't free the buffer headers only if you do DMA allocations and by luck there will be no buffer headers in the DMA zone, otherwise it will free the bh during DMA allocations too. remeber highmem classzone means all the ram in the machine, not just highmem zone. > to work hard trying to free something else in ZONE_NORMAL. > > Now before you say this is a strange theoretical situation, > I've seen it here when using highmem emulation. Low memory > was limited to 30 MB (16 MB ZONE_DMA, 14 MB ZONE_NORMAL) > and the rest of the machine was HIGHMEM. Buffer heads were > taking up 8 MB of low memory, dcache and inode cache were a > good second with 2 MB and 5 MB respectively. > > > How to efficiently fix this case ? I wouldn't know right now... I don't see anything to fix, that should be just handled flawlessy. > However, I guess we might want to come up with a fix because it's > a quite embarassing scenario ;) > > regards, > > Rik > -- > Will hack the VM for food. > > http://www.surriel.com/ http://distro.conectiva.com/ Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 15:10 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-05 16:57 ` Rik van Riel 2002-03-05 18:26 ` 2.4.19pre1aa1 Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Rik van Riel @ 2002-03-05 16:57 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: arjan, linux-kernel On Tue, 5 Mar 2002, Andrea Arcangeli wrote: > > Suppose that (1) we are low on memory in ZONE_NORMAL and > > (2) we have enough free memory in ZONE_HIGHMEM and (3) the > > memory in ZONE_NORMAL is for a large part taken by buffer > > heads belonging to pages in ZONE_HIGHMEM. > > > > In that case, none of the VMs will bother freeing the buffer > > heads associated with the highmem pages and kswapd will have > > wrong, classzone will do that, both for NORMAL and HIGHMEM allocations. Let me explain it to you again: 1) ZONE_NORMAL + ZONE_DMA is low on free memory 2) the memory is taken by buffer heads, these buffer heads belong to pagecache pages that live in highmem 3) the highmem zone has enough free memory As you probably know, shrink_caches() has the following line of code to make sure it won't try to free highmem pages: if (!memclass(page->zone, classzone)) continue; Of course, this line of code also means it will not take away the buffer heads from highmem pages, so the ZONE_NORMAL and ZONE_DMA memory USED BY THE BUFFER HEADS will not be freed. regards, Rik -- Will hack the VM for food. http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 16:57 ` 2.4.19pre1aa1 Rik van Riel @ 2002-03-05 18:26 ` Andrea Arcangeli 2002-03-05 18:30 ` 2.4.19pre1aa1 Arjan van de Ven 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-05 18:26 UTC (permalink / raw) To: Rik van Riel; +Cc: arjan, linux-kernel On Tue, Mar 05, 2002 at 01:57:13PM -0300, Rik van Riel wrote: > Let me explain it to you again: > > 1) ZONE_NORMAL + ZONE_DMA is low on free memory > > 2) the memory is taken by buffer heads, these > buffer heads belong to pagecache pages that > live in highmem > > 3) the highmem zone has enough free memory > > > As you probably know, shrink_caches() has the following line > of code to make sure it won't try to free highmem pages: > > if (!memclass(page->zone, classzone)) > continue; > > Of course, this line of code also means it will not take > away the buffer heads from highmem pages, so the ZONE_NORMAL > and ZONE_DMA memory USED BY THE BUFFER HEADS will not be > freed. I'm very sorry for not understanding your previous email, many thanks for explaning this again since I understood perfectly this time :). Right you are. I don't see this as a showstopper, but I think it would be nice to do something about it in 2.4 too. I think the best fix is to define a memclass_related() that checks the page->buffers to see if there's any lowmem bh queued on top of such page, the check cannot be embedded into the memclass, we need this additional check, because we shouldn't consider a "classzone normal progress" the freeing of a lowmem bh and furthmore we should only get into the path of the bh-freeing, not the path of the page-freeing to avoid throwing away highmem pagecache due a lowmem shortage. And if we freed something significant but we think we failed (because we cannot account the memclass_related as a progress), the .high watermark will let us go ahead with the allocation later in page_alloc.c. The above again fits beautifully into the classzone logic and it makes 100% sure not to waste a single page of highmem due a lowmem shortage. It's nearly impossible that classzone collides with anything good because classzone is the natural thing to do. btw, I think you've the very same problem with the "plenty" logic, the highmem zone will look as "plenty of ram free" and you won't balance it, despite you should because otherwise the bh wouldn't be released. The memclass_related will just make the VM accurate on those bh, and later it can be extended to other metadata too if necessary, so we'll always do the right thing. Another approch would be to add the pages backing the bh into the lru too, but then we'd need to mess with the slab and new bitflags, new methods and so I don't think it's the best solution. The only good reason for putting new kind of entries in the lru would be to age them too the same way as the other pages, but we don't need that with the bh (they're just in, and we mostly care only about the page age, not the bh age). Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 18:26 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-05 18:30 ` Arjan van de Ven 2002-03-05 19:12 ` 2.4.19pre1aa1 Andrew Morton 0 siblings, 1 reply; 77+ messages in thread From: Arjan van de Ven @ 2002-03-05 18:30 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Rik van Riel, linux-kernel On Tue, Mar 05, 2002 at 07:26:04PM +0100, Andrea Arcangeli wrote: > Another approch would be to add the pages backing the bh into the lru > too, but then we'd need to mess with the slab and new bitflags, new > methods and so I don't think it's the best solution. The only good > reason for putting new kind of entries in the lru would be to age them > too the same way as the other pages, but we don't need that with the bh > (they're just in, and we mostly care only about the page age, not the bh > age). For 2.5 I kind of like this idea. There is one issue though: to make this work really well we'd probably need a ->prepareforfreepage() or similar page op (which for page cache pages can be equal to writepage() ) which the vm can use to prepare this page for freeing. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 18:30 ` 2.4.19pre1aa1 Arjan van de Ven @ 2002-03-05 19:12 ` Andrew Morton 2002-03-05 23:03 ` 2.4.19pre1aa1 Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Andrew Morton @ 2002-03-05 19:12 UTC (permalink / raw) To: Arjan van de Ven; +Cc: Andrea Arcangeli, Rik van Riel, linux-kernel Arjan van de Ven wrote: > > On Tue, Mar 05, 2002 at 07:26:04PM +0100, Andrea Arcangeli wrote: > > > Another approch would be to add the pages backing the bh into the lru > > too, but then we'd need to mess with the slab and new bitflags, new > > methods and so I don't think it's the best solution. The only good > > reason for putting new kind of entries in the lru would be to age them > > too the same way as the other pages, but we don't need that with the bh > > (they're just in, and we mostly care only about the page age, not the bh > > age). > > For 2.5 I kind of like this idea. There is one issue though: to make > this work really well we'd probably need a ->prepareforfreepage() > or similar page op (which for page cache pages can be equal to writepage() > ) which the vm can use to prepare this page for freeing. If we stop using buffer_heads for pagecache I/O, we don't have this problem. I'm showing a 20% reduction in CPU load for large reads. Which is a *lot*, given that read load is dominated by copy_to_user. 2.5 is significantly less efficient than 2.4 at this time. Some of that seems to be due to worsened I-cache footprint, and a lot of it is due to the way buffer_heads now have a BIO wrapper layer. Take a look at submit_bh(). The writing is on the wall, guys. - ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 19:12 ` 2.4.19pre1aa1 Andrew Morton @ 2002-03-05 23:03 ` Andrea Arcangeli 2002-03-05 23:05 ` 2.4.19pre1aa1 Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-05 23:03 UTC (permalink / raw) To: Andrew Morton; +Cc: Arjan van de Ven, Rik van Riel, linux-kernel On Tue, Mar 05, 2002 at 11:12:46AM -0800, Andrew Morton wrote: > Arjan van de Ven wrote: > > > > On Tue, Mar 05, 2002 at 07:26:04PM +0100, Andrea Arcangeli wrote: > > > > > Another approch would be to add the pages backing the bh into the lru > > > too, but then we'd need to mess with the slab and new bitflags, new > > > methods and so I don't think it's the best solution. The only good > > > reason for putting new kind of entries in the lru would be to age them > > > too the same way as the other pages, but we don't need that with the bh > > > (they're just in, and we mostly care only about the page age, not the bh > > > age). > > > > For 2.5 I kind of like this idea. There is one issue though: to make > > this work really well we'd probably need a ->prepareforfreepage() > > or similar page op (which for page cache pages can be equal to writepage() > > ) which the vm can use to prepare this page for freeing. > > If we stop using buffer_heads for pagecache I/O, we don't have this problem. > > I'm showing a 20% reduction in CPU load for large reads. Which is a *lot*, > given that read load is dominated by copy_to_user. > > 2.5 is significantly less efficient than 2.4 at this time. Some of that > seems to be due to worsened I-cache footprint, and a lot of it is due > to the way buffer_heads now have a BIO wrapper layer. Indeed, at the moment bio is making the thing more expensive in CPU terms, even if OTOH it makes rawio fly. > Take a look at submit_bh(). The writing is on the wall, guys. > > - Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 23:03 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-05 23:05 ` Andrea Arcangeli 2002-03-05 23:24 ` 2.4.19pre1aa1 Andrew Morton 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-05 23:05 UTC (permalink / raw) To: Andrew Morton; +Cc: Arjan van de Ven, Rik van Riel, linux-kernel On Wed, Mar 06, 2002 at 12:03:14AM +0100, Andrea Arcangeli wrote: > On Tue, Mar 05, 2002 at 11:12:46AM -0800, Andrew Morton wrote: > > Arjan van de Ven wrote: > > > > > > On Tue, Mar 05, 2002 at 07:26:04PM +0100, Andrea Arcangeli wrote: > > > > > > > Another approch would be to add the pages backing the bh into the lru > > > > too, but then we'd need to mess with the slab and new bitflags, new > > > > methods and so I don't think it's the best solution. The only good > > > > reason for putting new kind of entries in the lru would be to age them > > > > too the same way as the other pages, but we don't need that with the bh > > > > (they're just in, and we mostly care only about the page age, not the bh > > > > age). > > > > > > For 2.5 I kind of like this idea. There is one issue though: to make > > > this work really well we'd probably need a ->prepareforfreepage() > > > or similar page op (which for page cache pages can be equal to writepage() > > > ) which the vm can use to prepare this page for freeing. > > > > If we stop using buffer_heads for pagecache I/O, we don't have this problem. > > > > I'm showing a 20% reduction in CPU load for large reads. Which is a *lot*, > > given that read load is dominated by copy_to_user. BTW, I noticed one of my last my email was a private reply so I'll answer here too for the buffer_head pagecache I/O part: Having persistence on the physical I/O information is a good thing, so you don't need to resolve logical to physical block at every I/O and bio has a cost to setup too. The information we carry on the bh isn't superflous, it's needed for the I/O so even if you don't use the buffer_head you will still need some other memory to hold such information, or alternatively you need to call get_block (and serialize in the fs) at every I/O even if you've plenty of ram free. So I don't think the current setup is that stupid, current bh only sucks for the rawio and that's fixed by bio. Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 23:05 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-05 23:24 ` Andrew Morton 2002-03-05 23:37 ` 2.4.19pre1aa1 Andrea Arcangeli 0 siblings, 1 reply; 77+ messages in thread From: Andrew Morton @ 2002-03-05 23:24 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Arjan van de Ven, Rik van Riel, linux-kernel Andrea Arcangeli wrote: > > BTW, I noticed one of my last my email was a private reply so I'll > answer here too for the buffer_head pagecache I/O part: Heh. Me too. > Having persistence on the physical I/O information is a good thing, so > you don't need to resolve logical to physical block at every I/O and bio > has a cost to setup too. The information we carry on the bh isn't > superflous, it's needed for the I/O so even if you don't use the > buffer_head you will still need some other memory to hold such > information, or alternatively you need to call get_block (and serialize > in the fs) at every I/O even if you've plenty of ram free. So I don't > think the current setup is that stupid, current bh only sucks for the > rawio and that's fixed by bio. The small benefit of caching the get_block result in the buffers just isn't worth it. At present, a one-megabyte write to disk requires the allocation and freeing and manipulation and locking of 256 buffer_heads and 256 BIOs. lru_list_lock, hash_table_lock, icache/dcache thrashing, etc, etc. It's an *enormous* amount of work. I'm doing the same amount of work with as few as two (yes, 2) BIOs. This is not something theoretical. I have numbers, and code. 20% speedup on a 2-way with a workload which is dominated by copy_*_user. It'll be more significant on larger machines, on machines with higher core/main memory speed ratios, on machines with higher I/O bandwidth. (OK, that bit was theoretical). - ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 23:24 ` 2.4.19pre1aa1 Andrew Morton @ 2002-03-05 23:37 ` Andrea Arcangeli 2002-03-05 23:51 ` 2.4.19pre1aa1 Andrew Morton 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-05 23:37 UTC (permalink / raw) To: Andrew Morton; +Cc: Arjan van de Ven, Rik van Riel, linux-kernel On Tue, Mar 05, 2002 at 03:24:49PM -0800, Andrew Morton wrote: > Andrea Arcangeli wrote: > > > > BTW, I noticed one of my last my email was a private reply so I'll > > answer here too for the buffer_head pagecache I/O part: > > Heh. Me too. > > > Having persistence on the physical I/O information is a good thing, so > > you don't need to resolve logical to physical block at every I/O and bio > > has a cost to setup too. The information we carry on the bh isn't > > superflous, it's needed for the I/O so even if you don't use the > > buffer_head you will still need some other memory to hold such > > information, or alternatively you need to call get_block (and serialize > > in the fs) at every I/O even if you've plenty of ram free. So I don't > > think the current setup is that stupid, current bh only sucks for the > > rawio and that's fixed by bio. > > The small benefit of caching the get_block result in the buffers > just isn't worth it. > > At present, a one-megabyte write to disk requires the allocation > and freeing and manipulation and locking of 256 buffer_heads and > 256 BIOs. lru_list_lock, hash_table_lock, icache/dcache > thrashing, etc, etc. It's an *enormous* amount of work. > > I'm doing the same amount of work with as few as two (yes, 2) BIOs. > > This is not something theoretical. I have numbers, and code. > 20% speedup on a 2-way with a workload which is dominated > by copy_*_user. It'll be more significant on larger machines, > on machines with higher core/main memory speed ratios, on > machines with higher I/O bandwidth. (OK, that bit was theoretical). then let's cut and paste this part as well :) depends what you're doing, if you do `cp /dev/zero .` and the fs is lucky enough to have free contigous space I definitely can see the improvement of highlevel merging, but that's not always what you're doing with the fs, for example that's not the case for kernel compiles and small files where you'll be always fragmented and where the bio will at max hold 4k and you keep rewriting into cache. The times you enter get_block you enter in a fs lock, rather than staying at the per-page lock, it's not additional locking, the bh on the pagecahce doesn't need any additional locking. So for a kernel compile the current situation is an obvious advantage in performance and scalability (fs code definitely doesn't scale at the moment). But ok, globally it will be probably better to drop the bh since we have to work on the bio anyways somehow and so at the very least we don't want to be slowed down from the bio logic in the physically contigous pagecache flood case. I just meant the bh isn't totally pointless and it could be shrunk as Arjan said in a private email. Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 23:37 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-05 23:51 ` Andrew Morton 0 siblings, 0 replies; 77+ messages in thread From: Andrew Morton @ 2002-03-05 23:51 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Arjan van de Ven, Rik van Riel, linux-kernel Andrea Arcangeli wrote: > > > depends what you're doing, if you do `cp /dev/zero .` and the fs is > lucky enough to have free contigous space I definitely can see the > improvement of highlevel merging, but that's not always what you're > doing with the fs, for example that's not the case for kernel compiles > and small files where you'll be always fragmented and where the bio will > at max hold 4k and you keep rewriting into cache. Cache effects. We touch the buffers at prepare_write. We touch them again at commit_write(). And at writeout time. And at page reclaim time. I think it's this general white-noise cost which is causing the funny profiles which I'm seeing. (For example, with no-buffers, the cost of the IDE driver setup and interrupt handler has nosedived). > The times you enter > get_block you enter in a fs lock, rather than staying at the per-page > lock, it's not additional locking, the bh on the pagecahce doesn't need > any additional locking. For writes, we have the lru list insertion, and the hashtable lock (twice). > So for a kernel compile the current situation is > an obvious advantage in performance and scalability (fs code definitely > doesn't scale at the moment). mm.. Delayed allocation means that the short-lived files never get a disk mapping at all. And yes, if all files are 100% fragmented then the BIO aggregation doesn't help as much. > But ok, globally it will be probably better to drop the bh since we have > to work on the bio anyways somehow and so at the very least we don't > want to be slowed down from the bio logic in the physically contigous > pagecache flood case. > > I just meant the bh isn't totally pointless and it could be shrunk as > Arjan said in a private email. bh represents a disk block. It's a wrapper around a section of the block device's pagecache pages. We'll always need a representation of disk blocks. For filesystem metadata. - ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 12:41 ` 2.4.19pre1aa1 Rik van Riel 2002-03-05 15:10 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-06 0:09 ` Daniel Phillips 1 sibling, 0 replies; 77+ messages in thread From: Daniel Phillips @ 2002-03-06 0:09 UTC (permalink / raw) To: Rik van Riel, arjan; +Cc: Andrea Arcangeli, linux-kernel On March 5, 2002 01:41 pm, Rik van Riel wrote: > On Tue, 5 Mar 2002 arjan@fenrus.demon.nl wrote: > > In article <20020305005215.U20606@dualathlon.random> you wrote: > > > > > I don't see how per-zone lru lists are related to the kswapd deadlock. > > > as soon as the ZONE_DMA will be filled with filedescriptors or with > > > pagetables (or whatever non pageable/shrinkable kernel datastructure you > > > prefer) kswapd will go mad without classzone, period. > > > > So does it with class zone on a scsi system.... > > Furthermore, there is another problem which is present in > both 2.4 vanilla, -aa and -rmap. > > Suppose that (1) we are low on memory in ZONE_NORMAL and > (2) we have enough free memory in ZONE_HIGHMEM and (3) the > memory in ZONE_NORMAL is for a large part taken by buffer > heads belonging to pages in ZONE_HIGHMEM. > > In that case, none of the VMs will bother freeing the buffer > heads associated with the highmem pages and kswapd will have > to work hard trying to free something else in ZONE_NORMAL. > > Now before you say this is a strange theoretical situation, > I've seen it here when using highmem emulation. Low memory > was limited to 30 MB (16 MB ZONE_DMA, 14 MB ZONE_NORMAL) > and the rest of the machine was HIGHMEM. Buffer heads were > taking up 8 MB of low memory, dcache and inode cache were a > good second with 2 MB and 5 MB respectively. > > > How to efficiently fix this case ? I wouldn't know right now... > However, I guess we might want to come up with a fix because it's > a quite embarassing scenario ;) There's the short term fix - hack the vm - and the long term fix: get rid of buffers. A buffers are does three jobs at the moment: 1) cache the physical block number 2) io handle for a file block 3) data handle for a file block, including locking The physical block number could be moved either into the struct page - which desireable since it wastes space for pages that don't have physical blocks - or my preferred solution, move it into the page cache radix tree. For (2) we have a whole flock of solutions on the way. I guess bio does the job quite nicely as Andrew Morton demonstrated last week. For (3), my idea is to generalize the size of the object referred to by struct page so that it can match the filesystem block size. This is still in the research stage, and there are a few issues I'm looking at, but the more I look the more practical it seems. How nice it would be to get rid of the page->buffers->page tangle, for one thing. -- Daniel ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 8:35 ` 2.4.19pre1aa1 arjan 2002-03-05 12:41 ` 2.4.19pre1aa1 Rik van Riel @ 2002-03-05 14:55 ` Andrea Arcangeli 1 sibling, 0 replies; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-05 14:55 UTC (permalink / raw) To: arjan; +Cc: linux-kernel On Tue, Mar 05, 2002 at 08:35:51AM +0000, arjan@fenrus.demon.nl wrote: > In article <20020305005215.U20606@dualathlon.random> you wrote: > > > I don't see how per-zone lru lists are related to the kswapd deadlock. > > as soon as the ZONE_DMA will be filled with filedescriptors or with > > pagetables (or whatever non pageable/shrinkable kernel datastructure you > > prefer) kswapd will go mad without classzone, period. > > So does it with class zone on a scsi system.... as said in another message such pool isn't refilled in a flood. Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-04 23:01 ` 2.4.19pre1aa1 Andrea Arcangeli 2002-03-04 23:11 ` 2.4.19pre1aa1 Rik van Riel @ 2002-03-05 5:38 ` Martin J. Bligh 2002-03-05 6:45 ` 2.4.19pre1aa1 David Lang 1 sibling, 1 reply; 77+ messages in thread From: Martin J. Bligh @ 2002-03-05 5:38 UTC (permalink / raw) To: Andrea Arcangeli, Rik van Riel Cc: Martin J. Bligh, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel > They're not in my tree and for very good reasons, Ben did such mistake > the first time at some point during 2.3. You've a big downside with the > per-zone information, all normal machines (like with 64M of ram or 2G of > ram) where theorical O(N) complexity is perfectly fine for lowmem > dma/normal allocations, will get hurted very much by the per-node lrus. I'm not sure why it has to be a big impact for the "common desktop" machine - they should only have one zone anyway. ZONE_DMA should shrivel up and die in a lonely corner. Yeah, OK, keep it as a back-compatibility option for those museum pieces that need it, but personally I'd make ISA DMA support a config option defaulting to off ... maybe it's possible to do dynamically (just stick no pages in it, though I suspect it's too late by the time we know). Hardly any common desktop will need HIGHMEM support, and those that do will probably get enough kickback from per-zone things to pay for the cost. To me, per-node would probably be about as good, but I don't think per-zone is as bad as you think. > making it a per-lru spinlock is natural scalability optimization, > but anyways pagemap_lru_lock isn't a very critical spinlock. see my other email - it's worse in rmap. M. ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-05 5:38 ` 2.4.19pre1aa1 Martin J. Bligh @ 2002-03-05 6:45 ` David Lang 0 siblings, 0 replies; 77+ messages in thread From: David Lang @ 2002-03-05 6:45 UTC (permalink / raw) To: Martin J. Bligh Cc: Andrea Arcangeli, Rik van Riel, Daniel Phillips, Bill Davidsen, Mike Fedyk, linux-kernel 1G x86 machines are becoming fairly common and they either need to waste ram or turn on himem. David Lang On Mon, 4 Mar 2002, Martin J. Bligh wrote: > > They're not in my tree and for very good reasons, Ben did such mistake > > the first time at some point during 2.3. You've a big downside with the > > per-zone information, all normal machines (like with 64M of ram or 2G of > > ram) where theorical O(N) complexity is perfectly fine for lowmem > > dma/normal allocations, will get hurted very much by the per-node lrus. > > I'm not sure why it has to be a big impact for the "common desktop" > machine - they should only have one zone anyway. ZONE_DMA should > shrivel up and die in a lonely corner. Yeah, OK, keep it as a > back-compatibility option for those museum pieces that need it, > but personally I'd make ISA DMA support a config option defaulting > to off ... maybe it's possible to do dynamically (just stick no > pages in it, though I suspect it's too late by the time we know). > Hardly any common desktop will need HIGHMEM support, and those > that do will probably get enough kickback from per-zone things to > pay for the cost. > > To me, per-node would probably be about as good, but I don't think > per-zone is as bad as you think. > > > making it a per-lru spinlock is natural scalability optimization, > > but anyways pagemap_lru_lock isn't a very critical spinlock. > > see my other email - it's worse in rmap. > > M. > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 77+ messages in thread
[parent not found: <200203021958.g22JwKq08818@Port.imtp.ilyichevsk.odessa.ua>]
* Re: 2.4.19pre1aa1 [not found] ` <200203021958.g22JwKq08818@Port.imtp.ilyichevsk.odessa.ua> @ 2002-03-02 20:47 ` Andrea Arcangeli 2002-03-02 20:58 ` 2.4.19pre1aa1 Robert Love 0 siblings, 1 reply; 77+ messages in thread From: Andrea Arcangeli @ 2002-03-02 20:47 UTC (permalink / raw) To: Denis Vlasenko; +Cc: Bill Davidsen, Mike Fedyk, linux-kernel, Rik van Riel On Sat, Mar 02, 2002 at 09:57:49PM -0200, Denis Vlasenko wrote: > If rmap is really better than current VM, it will be merged into head > development branch (2.5). There is no anti-rmap conspiracy :-) Indeed. Andrea ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-02 20:47 ` 2.4.19pre1aa1 Andrea Arcangeli @ 2002-03-02 20:58 ` Robert Love 2002-03-05 22:16 ` 2.4.19pre1aa1 Bill Davidsen 0 siblings, 1 reply; 77+ messages in thread From: Robert Love @ 2002-03-02 20:58 UTC (permalink / raw) To: Andrea Arcangeli Cc: Denis Vlasenko, Bill Davidsen, Mike Fedyk, linux-kernel, Rik van Riel On Sat, 2002-03-02 at 15:47, Andrea Arcangeli wrote: > On Sat, Mar 02, 2002 at 09:57:49PM -0200, Denis Vlasenko wrote: > > > If rmap is really better than current VM, it will be merged into head > > development branch (2.5). There is no anti-rmap conspiracy :-) > > Indeed. Of note: I don't think anyone "loses" if one VM is merged or not. A reverse mapping VM is a significant redesign of our current VM approach and if it proves better, yes, I suspect (and hope) it will be merged into 2.5. But that doesn't mean the 2.4 VM is worse, per se. Robert Love ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: 2.4.19pre1aa1 2002-03-02 20:58 ` 2.4.19pre1aa1 Robert Love @ 2002-03-05 22:16 ` Bill Davidsen 0 siblings, 0 replies; 77+ messages in thread From: Bill Davidsen @ 2002-03-05 22:16 UTC (permalink / raw) To: Robert Love; +Cc: Andrea Arcangeli, Linux Kernel Mailing List, Rik van Riel On 2 Mar 2002, Robert Love wrote: > On Sat, 2002-03-02 at 15:47, Andrea Arcangeli wrote: > > > On Sat, Mar 02, 2002 at 09:57:49PM -0200, Denis Vlasenko wrote: > > > > > If rmap is really better than current VM, it will be merged into head > > > development branch (2.5). There is no anti-rmap conspiracy :-) > > > > Indeed. > > Of note: I don't think anyone "loses" if one VM is merged or not. A > reverse mapping VM is a significant redesign of our current VM approach > and if it proves better, yes, I suspect (and hope) it will be merged > into 2.5. As noted, I do use both flavors of VM. But in practical terms the delay getting the "performance" changes, rmap, preempt, scheduler, into a stable kernel will be 18-24 months by my guess, 12-18 months to 2.6 and six months before Linus opens 2.7 and lets things gel. So to the extent that people who would be using those kernels get less performance, or less responsiveness, I guess they are the only ones who lose. Feel free to tell me it won't be that long or that 2.5 will be stable enough for production use, but be prepared to have people post release dates from 12 to 2.0, 2.0 to 2.2, 2.2 to 2.4, and just laugh about stability. There are a lot of neat new things in 2.5, and they will take relatively a long time to be stable. No one wants to limit the development of 2.5, or at least the posts I read are in favor of more change rather than less. In any case, I agree there are no "losers" in that sense. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 77+ messages in thread
end of thread, other threads:[~2002-03-06 0:14 UTC | newest]
Thread overview: 77+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-02-28 2:57 2.4.19pre1aa1 rwhron
-- strict thread matches above, loose matches on Subject: below --
2002-02-27 12:50 2.4.19pre1aa1 Andrea Arcangeli
2002-02-28 22:11 ` 2.4.19pre1aa1 Bill Davidsen
2002-03-01 1:30 ` 2.4.19pre1aa1 Mike Fedyk
2002-03-01 3:26 ` 2.4.19pre1aa1 Bill Davidsen
2002-03-01 3:46 ` 2.4.19pre1aa1 Mike Fedyk
2002-03-01 12:51 ` 2.4.19pre1aa1 Rik van Riel
2002-03-01 18:37 ` 2.4.19pre1aa1 Mike Fedyk
2002-03-01 10:17 ` 2.4.19pre1aa1 Marco Colombo
2002-03-01 11:37 ` 2.4.19pre1aa1 Alan Cox
2002-03-02 2:06 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-02 2:28 ` 2.4.19pre1aa1 Alan Cox
2002-03-02 3:30 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-03 21:38 ` 2.4.19pre1aa1 Daniel Phillips
2002-03-04 0:49 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-04 1:46 ` 2.4.19pre1aa1 Daniel Phillips
2002-03-04 2:25 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-04 3:22 ` 2.4.19pre1aa1 Daniel Phillips
2002-03-04 12:41 ` 2.4.19pre1aa1 Rik van Riel
2002-03-04 14:05 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-04 14:23 ` 2.4.19pre1aa1 Rik van Riel
2002-03-04 16:10 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-04 16:28 ` 2.4.19pre1aa1 Rik van Riel
2002-03-04 16:59 ` 2.4.19pre1aa1 Martin J. Bligh
2002-03-04 18:18 ` 2.4.19pre1aa1 Stephan von Krawczynski
2002-03-04 18:41 ` 2.4.19pre1aa1 Stephan von Krawczynski
2002-03-04 18:46 ` 2.4.19pre1aa1 Martin J. Bligh
2002-03-04 22:06 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-04 23:03 ` 2.4.19pre1aa1 Samuel Ortiz
2002-03-05 11:23 ` 2.4.19pre1aa1 Stephan von Krawczynski
2002-03-05 17:35 ` 2.4.19pre1aa1 Samuel Ortiz
2002-03-05 0:12 ` 2.4.19pre1aa1 Rik van Riel
2002-03-05 6:21 ` 2.4.19pre1aa1 Martin J. Bligh
2002-03-04 21:37 ` 2.4.19pre1aa1 Rik van Riel
2002-03-04 18:19 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-04 18:56 ` 2.4.19pre1aa1 Martin J. Bligh
2002-03-04 22:25 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-04 23:09 ` 2.4.19pre1aa1 Gerrit Huizenga
2002-03-05 0:19 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-05 2:00 ` 2.4.19pre1aa1 Gerrit Huizenga
2002-03-04 22:38 ` 2.4.19pre1aa1 Daniel Phillips
2002-03-04 21:36 ` 2.4.19pre1aa1 Rik van Riel
2002-03-04 23:01 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-04 23:11 ` 2.4.19pre1aa1 Rik van Riel
2002-03-04 23:52 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-05 0:01 ` 2.4.19pre1aa1 Rik van Riel
2002-03-05 1:05 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-05 1:26 ` 2.4.19pre1aa1 Rik van Riel
2002-03-05 1:40 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-05 1:55 ` 2.4.19pre1aa1 Martin J. Bligh
2002-03-05 5:16 ` 2.4.19pre1aa1 Samuel Ortiz
2002-03-05 5:47 ` 2.4.19pre1aa1 Martin J. Bligh
2002-03-05 6:33 ` 2.4.19pre1aa1 Samuel Ortiz
2002-03-05 12:22 ` 2.4.19pre1aa1 Rik van Riel
2002-03-05 15:01 ` 2.4.19pre1aa1 Andrea Arcangeli
[not found] ` <Pine.LNX.4.44L.0203050921510.1413-100000@duckman.distro.conecti va>
2002-03-05 15:29 ` 2.4.19pre1aa1 Martin J. Bligh
2002-03-05 15:43 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-05 3:05 ` 2.4.19pre1aa1 Bill Davidsen
2002-03-05 8:35 ` 2.4.19pre1aa1 arjan
2002-03-05 12:41 ` 2.4.19pre1aa1 Rik van Riel
2002-03-05 15:10 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-05 16:57 ` 2.4.19pre1aa1 Rik van Riel
2002-03-05 18:26 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-05 18:30 ` 2.4.19pre1aa1 Arjan van de Ven
2002-03-05 19:12 ` 2.4.19pre1aa1 Andrew Morton
2002-03-05 23:03 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-05 23:05 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-05 23:24 ` 2.4.19pre1aa1 Andrew Morton
2002-03-05 23:37 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-05 23:51 ` 2.4.19pre1aa1 Andrew Morton
2002-03-06 0:09 ` 2.4.19pre1aa1 Daniel Phillips
2002-03-05 14:55 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-05 5:38 ` 2.4.19pre1aa1 Martin J. Bligh
2002-03-05 6:45 ` 2.4.19pre1aa1 David Lang
[not found] ` <200203021958.g22JwKq08818@Port.imtp.ilyichevsk.odessa.ua>
2002-03-02 20:47 ` 2.4.19pre1aa1 Andrea Arcangeli
2002-03-02 20:58 ` 2.4.19pre1aa1 Robert Love
2002-03-05 22:16 ` 2.4.19pre1aa1 Bill Davidsen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox